LLMs and Artificial General Intelligence, Part V: Counter-arguments: The Argument from Design and Ted Chiang’s “Blurry JPEG of the Web” Argument

Adam Morse
8 min readJun 12, 2023

--

Prior Essays:
LLMs and Reasoning, Part I: The Monty Hall Problem
LLMs and Reasoning, Part II: Novel Practical Reasoning Problems
LLMs and Reasoning, Part III: Defining a Programming Problem and Having GPT 4 Solve It
LLMs and Artificial General Intelligence, Part IV: Counter-arguments: Searle’s Chinese Room and Its Successors

After presenting a series of arguments that LLMs can demonstrate reasoning and understanding, I have turned to responding to major counter-arguments. Friday’s essay addressed Searle’s Chinese Room hypothetical and similar arguments. Today, I respond to arguments based on design and to Ted Chiang’s influential “blurry JPEG of the Web” argument, which I view as quite related to arguments based on design.

Arguments based on design

A second category of arguments that LLMs cannot have understanding and intelligence stems from their design and construction. According to this argument, LLMs are designed simply as autocomplete on steroids, identifying the most probable subsequent token. Because they are built to calculate the next token, not to think or reason, no amount of apparent reasoning should lead to a conclusion of intelligence. That’s simply not part of their design.

This argument makes the error of assuming that devices — especially complex ones — can only serve the purpose for which they are designed. The principal argument isn’t that LLMs are built to think and reason, but that the ability to think and reason may be an emergent property of their architecture and training models. What the designers intended is less important than the reality of what they did.

The standard non-religious explanation of human intelligence is that it arose without any design or intent at all. Animals followed instinctive, biologically driven patterns of behavior to reproduce their genes. Random mutations led to the emergence of some animals with more impressive nervous systems, which gave those animals competitive advantages and caused them to enjoy more success (i.e. descendants). Eventually, the processes of natural selection and evolution resulted in sapience — in intelligence in the human-like sense that I am discussing. None of that shows any actual design or intent, notwithstanding that some of it may have been driven by sexual selection for more intelligent mates. Intelligence doesn’t require design behind it, and the mere absence of an intent to design for intelligence — or limits on what components of intelligence were being designed for — doesn’t preclude the emergence of intelligence.

These arguments from design strike me as similar to reversed versions of the philosophical arguments for the existence of God based on the apparent evidence of intelligent design — often described as the watchmaker argument, based on a standard analogy in the philosophical literature. Those arguments point to the existence of complex systems that have the appearance of being engineered by an almighty power and use those to argue that God exists — ignoring or rejecting the possibility that those complex systems could have arisen through natural processes like physics (for the formation of the Solar System) or biology (for the evolution of complicated organisms and the development of intelligence). Arguments from design for LLMs take the opposite direction of reasoning, that because they know what the engineered purpose of LLMs is that they cannot, despite their complexity, have emergently developed other capabilities. Just like the watchmaker argument for the existence of God is unconvincing to most people who do not already believe in God’s existence, the arguments from design that LLMs are incapable of reason because that was not part of their design should similarly be viewed as unconvincing.

“A Blurry JPEG of the Web”

Ted Chiang argues that the best way to understand LLMs is that they produce “a blurry JPEG of the Web.”1 His New Yorker article has become one of the most celebrated critiques of LLMs capabilities, cited by many people as providing an insightful understanding of what LLMs are, and why they are a threat to creative work. Under his analysis, LLMs are really just databases that store the data that they were trained on. Rather than storing the data in a lossless format, that allows the exact inputs to be reproduced precisely, they store the data in a lossy way, extracting some version of the underlying meaning of the input and then regurgitating that upon request. However, because their storage of the data is imperfect — giving up complete accuracy and fidelity for manageability and storage capacity advantages — the versions that they produce differ from the originals, incorporating errors and “hallucinations.” Therefore, he suggests, LLMs are unlikely to be as useful as their proponents would have people believe, because, given access to the originals on the Internet, using reconstructed, less accurate versions produced by LLMs gives up accuracy for little gain, while replacing actual human production of knowledge with lossy, computer driven content mills.

His argument bears a close resemblance to critiques of ChatGPT based on its frequent inaccuracies.2 GPT often gives responses that look and feel right but are not in fact accurate — blithely weaving true statements, inaccurate statements, and outright fabrications together. Some of my academic friends find that asking ChatGPT to describe their professional work is an amusing party game. Its responses frequently mention and describe actual publications they’ve written alongside attributing works by other scholars to them, along with most humorously making up entirely new titles and descriptions of works that do not exist. Often, this has the feeling of being truthy without being true: the works that they describe seem like the sort of thing that they would write even though they did not in fact write them.3 Similarly, if you ask ChatGPT to analyze song lyrics, it will frequently include “quotations” that do not actually appear in the lyrics of the song, and if you ask it to summarize a scholarly area, it will provide citations to books and articles that do not exist, sometimes with ersatz URLs that lead no where. This has led to some humorously appalling examples of ChatGPT Worst Practices, such as the Texas A&M-Commerce professor who concluded that literally his entire class was cheating with ChatGPT because when he entered snippets of text into ChatGPT and asked did you write this, it erroneously informed him that it did.4 An even worse example ensued when a lawyer used ChatGPT to generate a brief which included citations to cases that do not, in fact, exist. When ordered by the court to provide the full text of those opinions, he asked ChatGPT to provide them, and ended up submitting entirely falsified opinions to a court5 — conduct so egregious that I would consider anything short of outright disbarment to be merciful.

These hallucinations represent enormous weaknesses of LLMs, of course, but concluding that LLMs are nothing more than imperfect photocopiers of the Internet engages in a category error. It identifies a flaw or limitation and treats that as the most significant thing that LLMs can do, thus showing that they do things poorly. This is like complaining that a camera is a poor substitute for a pencil and a notepad because it is hard to use it to compose text or to sketch an idea; that’s true, but it misses the point that a camera offers very different capabilities, some better, some worse, and some merely different. Instead, we should look at different capabilities separately and thus be open to seeing what they can do well, as well as what they still do badly. LLMs can do some things extraordinarily well — for example, their ability to produce grammatical, coherent text is nothing short of amazing. LLMs ability to reason is also remarkable and has improved rapidly over the past few generations of models. Conversely, LLMs’ ability to use their training data as a source of a database of knowledge is mixed. They demonstrate what in a human we would view as an impressive breadth of knowledge, but they produce outputs that look like a bullshitting human — stating clear facts that aren’t and providing quotations and citations that don’t exist to produce the appearance of deeper and more complete knowledge than they actually have. One of the recommendations that some people provide about how to produce better results when working with LLMs that have active web access is to instruct it to do a web search before providing any answers, precisely to ensure that it’s not over-relying on the imperfect results of its training.6 Likewise, when using an LLM to write material, some people recommend assuming that any facts that the user did not supply are incorrect.

In many ways, Chiang’s argument reminds me of a misguided version of the argument from design. His argument amounts to the assertion that LLMs’ design is to store the text of the Internet, boiled down to a smaller representation, and then to try to rehydrate the tablets that were once articles and web pages and books. This design doesn’t leave room for actual intelligence, nor does it do a great job of recreating the original text material. But that’s not really what LLMs are doing at all. They are not trying to store a JPEG, a lossy smaller version of an existing image. Rather, they are using vast amounts of text to train neural networks to create a set of transformer modules that can decode prompts, develop appropriate responses, and output those responses as text.

The reason LLMs can succeed at some new tasks that involve reasoning is because that architecture and implementation, at the scales that the current systems use, produces emergent abilities to respond well to prompts, not by finding an appropriate example and reconstituting it from a database, but by constructing a novel response to a novel query. A blurry photocopier AI system can identify that someone is asking about the Monty Hall Problem and then produce a blurred copy of Wikipedia’s article on the Monty Hall Problem. But a current generation LLM with an adequate number of parameters and an adequately enormous training corpus can read a prompt asking about Karsten’s bizarre goatherder variant, recognize that it’s superficially like the Monty Hall Problem but that the correct answer in this variant is different, and provide a response that both explains the Monty Hall Problem and the correct answer under the strange variant presented. As a database storing and retrieving Wikipedia’s article on the Monty Hall Problem, GPT 4 is at best flawed. But as an LLM capable of reasoning about a novel variant, it is remarkable.

1Ted Chiang, “Chat GPT is a Blurry JPEG of the Web”, New Yorker, https://www.newyorker.com/tech/annals-of-technology/chatgpt-is-a-blurry-jpeg-of-the-web (Feb. 9, 2023).

2See, e.g., Gary Marcus, “GPT-4’s Successes, and GPT-4’s Failures,” The Road to AI We Can Trust, Substack, available at https://garymarcus.substack.com/p/gpt-4s-successes-and-gpt-4s-failures (March 15, 2023).

3This game has become less fun recently; ChatGPT now refuses some requests to generate biographical information for insufficiently notable people, likely to reduce liability, that it used to give erroneous responses for.

4Pranshu Verma, “A Professor Accused his Class of Using ChatGPT, Putting Diplomas in Jeopardy,” Washington Post, available at https://www.washingtonpost.com/technology/2023/05/18/texas-professor-threatened-fail-class-chatgpt-cheating/ (May 18, 2023).

5See Mata v. Avianca Order to Show Cause, 22 CV 1461 (PKC) (S.D.N.Y. May 4, 2023), available at https://storage.courtlistener.com/recap/gov.uscourts.nysd.575368/gov.uscourts.nysd.575368.31.0_1.pdf .

6Ethan Mollick, “A Guide to Prompting AI (for what it is worth)”, One Useful Thing, available at https://www.oneusefulthing.org/p/a-guide-to-prompting-ai-for-what (April 26, 2023). In general, Prof. Mollick’s work is useful for thinking about what ChatGPT and similar tools can do, as opposed to only what they can’t do.

--

--