LLMs and Artificial General Intelligence, Part VI: Counter-arguments: Even if LLMs Can Reason, They Lack Other Essential Features of Intelligence
Prior Essays:
LLMs and Reasoning, Part I: The Monty Hall Problem
LLMs and Reasoning, Part II: Novel Practical Reasoning Problems
LLMs and Reasoning, Part III: Defining a Programming Problem and Having GPT 4 Solve It
LLMs and Artificial General Intelligence, Part IV: Counter-arguments: Searle’s Chinese Room and Its Successors
LLMs and Artificial General Intelligence, Part V: Counter-arguments: The Argument from Design and Ted Chiang’s “Blurry JPEG of the Web” Argument
I’ve presented my arguments for believing that LLMs can reason, and my arguments for rejecting most of the major counterarguments to the idea that LLMs represent a major step towards Artificial General Intelligence. Today, I conclude my analysis of LLMs current and likely near future capabilities by addressing the argument that, even if LLMs can reason, they lack other features essential to the concept of intelligence.
Intelligence is a multifaceted concept, including many different capabilities that work in concert. When we think about what it means to have human-like intelligence, we include reasoning, memory and memory retrieval, learning, planning, goal setting, and creativity as among its features. The concept is slippery, and I don’t pretend that my list of features is exhaustive or definitive — is theory of the mind or symbolic representation/language essential to intelligence? — but I believe it captures the gist of the concept. One of the critiques of the idea that LLMs are approaching Artificial General Intelligence is to argue that they lack some of those features, so that even if they continue to improve in reasoning capabilities, they would still not be AGI. Unlike the other critiques that I have discussed, I agree with this one, at least up to a point.
Currently, LLMs have yet to achieve AGI. For these purposes, I take AGI to mean enough of the features of intelligence that a fair assessment of a biological entity showing those features would deem it intelligent in the way that humans are intelligent. That means that an AGI could still be inferior to humans in some regards — AGI is a weaker standard, to me, than full human equivalency. And it’s certainly a lower standard than Artificial Superior Intelligence — an artificial intelligence that exceeds human intelligence. In any event, even compared to this lower standard, LLMs do not qualify. Currently, they demonstrate some reasoning ability; very limited memory beyond what was learned in their training process; very limited or no ability to learn beyond the initial training process and fine-tuning; a very limited scope for planning; no ability to set goals of their own; and highly limited creativity. Out of all of those facets, only reasoning and learning/memory of facts during the initial training process approach human levels of capability, and out of those, reasoning at least is still notably inferior to human capabilities. While context windows, which can be somewhat akin to working memory, continue to increase, they still have limitations that can be easily reached in a medium-sized project. Moreover, except with regard to what is learned during training and fine tuning, LLMs have no long-term memory at all. With all of these limitations, today’s LLMs have not achieved AGI.
And yet… reasoning may turn out to be among the hardest of those features to achieve. For example, consider the use of GPT 3.5 to make believable human-like agents in a simulated environment like The Sims.1 A team of Stanford and Google researchers built a larger architecture that included GPT 3.5 to create agentic behavior that covers some of those other capacities. They needed the agents to be able to remember things, so they added a memory function that records everything the agent experiences and does. They needed the agents to be able to surface a usable set of memories, so they added additional mechanisms to identify the most relevant memories and to inject those into the context window for new decision-making. The agents needed to have consistent self-images that evolve over time based on experiences, so they added a reflection process, where the agents used GPT 3.5 to analyze their memories and add additional meta-conclusions that then shaped their ongoing self-conception. They needed agents to be able to make plans, so they added a separate plan making module (again using GPT 3.5 and treating the agent’s self-conception as input).
While still limited — for example, the agents’ use of language showed a poor understanding of appropriate register, often appearing too formal for its context — the agents scored highly on believability measures designed to test how much they seemed like real people. They also responded to inputs in human-like ways; when one agent was prompted to plan a Valentine’s Day party, that agent informed other agents who then made reasonable decisions connected to it that were not cued by humans, including one agent asking another agent to be its date for the party.
Their accomplishments were impressive, but none of the steps the Stanford team implemented was particularly exceptional. In order for an AI to have access to more memories, it needs a way to record, access, and prioritize memories. In order to make plans, an AI has to have some set of concepts that give it motivation, an understanding of its environment, and the capability of reasoning how to achieve goals related to its motivation. They basically adopted approaches close to what any reasonable team of people would do to achieve those goals. I’m not trying to diminish the significance of their research, but I don’t think there were any enormously creative or non-obvious steps in building an architecture for their agents, except perhaps including a systematic self-reflection step.
As reasoning capabilities improve with future generations of LLMs, it may be possible to use architectures like the Stanford one to create true AGI — adding on the capacities to take on the other tasks essential to intelligence. I can’t be confident about this — perhaps achieving goal-setting, for example, will turn out to be enormously difficult beyond a small, toy environment with short-run simulations. But I think it’s highly plausible that a hypothetical GPT-5 with a comparable leap in reasoning ability compared to GPT-4 as GPT-4 had compared to GPT-3.5, embedded in an architecture that provides for memory, ongoing learning, self-conception, and planning, could achieve something we would recognize as AGI.
Creativity remains a large open question to me. My experiences using LLMs have produced little that I would describe as actually creative. Fiction written by LLMs often seems particularly uninteresting and uncreative — not terrible, but bland and uninspired. Can improvements be made along the current lines of development that will change that? I don’t know. I would not be surprised if creativity turned out to be an emergent property as LLMs improve, similar to the way that reasoning appears to be. But I would also not be amazed if something else was needed there beyond simply improving today’s LLMs. I also believe that it’s possible that the fine-tuning process that’s used between the initial training and when LLMs are made available to the public may have the effect of training out creativity that would otherwise be more apparent, because creativity implies a certain degree of unpredictability, and efforts to make sure that LLMs are not hateful or hazardous imply efforts to make their outputs more predictable, at least in some regards. LLMs also have “temperature” settings — settings that control whether the LLM consistently chooses the most probable set of tokens or whether it sometimes chooses lower probability responses. If “creativity” means coming up with more unusual responses that may turn out to be superior, adjusting temperature, and then perhaps adding an evaluation step to consider whether a response is creative or merely inane, may be part of the process for incorporating it into LLMs. Currently, even when set on a high temperature, LLMs do not evince much creativity. But even if creativity remains a major limitation, if most of the other aspects of intelligence can be achieved to a high degree, I’m not sure that failings in creativity alone would justify not recognizing near-future LLM-based systems as achieving AGI.
Today’s essay concludes the descriptive portion of my project. Tomorrow, I turn to the normative ethical conclusions I draw from my belief that near-future LLM-based systems may be able to achieve AGI.
1Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein, “Generative Agents: Interactive Simulacra of Human Behavior”, https://arxiv.org/pdf/2304.03442.pdf, April 7, 2023.