AI as a Compression Problem

A recent article in The Atlantic makes the case that very large language models effectively contain much of the works they're trained on. This article is an attempt to popularize the insights in the recent academic paper Extracting books from production language models from Ahmed et al. The authors of the paper demonstrate convincingly that well-known copyrighted textual material can be extracted from the chatbot interfaces of popular commercial LLM services.

The Atlantic article cites a podcast quote about the Stable Diffusion AI image-generator model, saying "We took 100,000 gigabytes of images and compressed it to a two-gigabyte file that can re-create any of those and iterations of those". By analogy, this suggests we might think of LLMs (which work on text, not the images handled by Stable Diffusion) as a form of lossy textual compression.

The entire text of Moby Dick, the canonical Big American Novel is merely 1.2MiB uncompressed (and less than 0.4MiB losslessly compressed with bzip2 -9). It's not surprising to imagine that a model with hundreds of billions of parameters might contain copies of these works.

Warning: The next paragraph contains fuzzy math with no real concrete engineering practice behind it!

Consider a hypothetical model with 100 billion parameters, where each parameter is stored as a 16-bit floating point value. The model weights would take 200 GB of storage. If you were to fill the parameter space only with losslessly compressed copies of books like Moby Dick, you could still fit half a million books, more than anyone can read in a lifetime. And lossy compression is typically orders of magnitude less in size than lossless compression, so we're talking about millions of works effectively encoded, with the acceptance of some artifacts being injected in the output.

I first encountered this "compression" view of AI nearly three years ago, in Ted Chiang's insightful ChatGPT is a Blurry JPEG of the Web. I was suprised that The Atlantic article didn't cite Chiang's piece. If you haven't read Ted Chiang, i strongly recommend his work, and this piece is a great place to start.

Chiang aside, the more recent writing that focuses on the idea of compressed works being "contained" in the model weights seems to be used by people interested in wielding esome sort of copyright claims against the AI companies that maintain or provide access to these models. There are many many problems with AI today, but attacking AI companies based on copyright concerns seems similar to going after Al Capone for tax evasion.

We should be much more concerned with the effect these projects have on cultural homogeneity, mental health, labor rights, privacy, and social control than whether they're violating copyright in some specific instance.