LLMs and Reasoning, Part III: Defining a Programming Problem and Having GPT 4 Solve It
LLMs and Reasoning Part I: The Monty Hall Problem
LLMs and Reasoning, Part II: Novel Practical Reasoning Problems
In my previous essays, I discussed GPT 4’s performance at various tests of reasoning: a perverse variant on the Monty Hall Problem, a problem involving navigating an environment with wet paint, and a spatial reasoning problem following a tennis ball. In this essay, I turn to a different category of tasks: computer programming. I believe that GPT 4’s successes in programming also provide evidence of emergent reasoning capabilities. I also have some concluding remarks on the evidence supporting LLMs’ ability to reason before I turn to counter-arguments.
Asking GPT 4 to Write a Novel Computer Program
I recently had a problem which would be relatively easy to solve with a computer program. I had a hypothesis that I wanted to test — that Elo ratings for soccer/association football teams in leagues with promotion and relegation between leagues would tend to overestimate the Elo ratings for teams in higher leagues and underestimate them for teams in lower leagues. Elo ratings are a system of rating and ranking competitors that a physicist and avid chess player, Arpad Elo, developed for rating chess players. In an Elo rating system, each competitor is rated, and relative ratings represent probabilities of victory in a head-to-head match. After each rated match, the ratings of each competitor are updated, increasing the rating of the winner and decreasing the rating of the loser according to a mathematical formula that sets the size of the adjustment based in part on how unlikely the outcome was — a major upset produces a relatively large change, while an expected drubbing of a much less skilled opponent produces a small adjustment. This has been applied to many other competitions beyond chess, and I was curious whether the promotion and relegation of teams in English association football would result in systematic inaccuracies in Elo ratings: differences between the measure Elo rating and a hypothetical accurate Elo rating. This is reasonably straightforward to simulate, but a bit of a hassle to code for someone with rusty coding skills. I’m not a professional programmer, and while I have at times been a pretty capable amateur, I don’t do much coding these days.
Instead of doing it on my own, I asked GPT 4 to write the program for me instead, providing it with a paragraph description of what I wanted, but without giving it any instruction as to how to do it beyond the choice of language:
Prompt: I’d like to create a simulation in Perl of an Elo ranking system for a set of soccer leagues with promotion and relegation. Each league should have 20 teams in it. Each team should have a true Elo rating (how strong it actually is, which doesn’t change), and an observed Elo rating (an Elo rating calculated using the Elo formula based on the results of its games and the observed Elo rating going into the game). In each season, each team should play all the other teams in its league twice. At the end of the season the top two teams by points (3 for a win, 1 for a draw, and 0 for a loss) get promoted to the league above (except for the top league) and the bottom two teams by points get relegated to the league below (except for the bottom league. There should be 4 total leagues. After each game, the observed Elo ratings of the teams should be recalculated. All of the teams in the top league at the beginning of the simulation should have an observed Elo rating of 1700. All of the teams in the second league should begin with an observed Elo rating of 1600, with the lower two leagues having observed ratings of 1500 and 1400. Within each league, one team should have an actual Elo rating 45 points above its observed Elo rating, one 40 points above its observed Elo rating, and so forth, down to one 50 points below its observed Elo rating. The simulation should run for a season, then output a list of all the teams by current league, along with the average observed Elo for each league after promotion and relegation, then repeat for seasons 2, 3, 4, through season 10.
The process wasn’t seamless: ChatGPT 4 initially gave me only half a program, though when I asked for the second half it filled that in. Once I had a full program, I noticed that it simply gave what it asserted was the final results of the simulation, but that of course made it impossible to see whether it was actually doing the right intermediate steps (at least without analyzing the underlying code in detail.) I asked it to add some code to print out intermediate steps. This was useful, because I then identified an error: the code wasn’t tracking wins and losses properly. Once I pointed out the error to GPT 4, it provided new code that fixed the error, which let us identify the next mistake, and so forth. Because the back-and-forth process was quite extended, I’m not including it in the text of this essay, but it’s all available in the links in the footnote.1
The process took quite a while, with many iterations back and forth. In fact, I periodically had to start new chat threads, beginning by providing the initial design request and the code that we had at the end of the last thread, to get around the limits of Chat GPT’s “context window” — the total number of tokens it can work with at a time. But here’s the point: GPT 4 was able to write a program to solve the task I provided, and do all of the coding work to fix bugs, and it ultimately worked. I needed to examine the outputs and see if they looked correct, and in a few cases look at the code to see if it looked like it would do what it was supposed to do. But my approach to solving problems was uniformly to say “this is what it’s supposed to be doing, and it’s not” and then let GPT 4 figure out how to do what it was supposed to do, and GPT 4 stepped up and solved the problem. All the data structures, algorithmic work, and approach to solving the problem was supplied by GPT 4.
In case anyone is curious, the result concluded that my hypothesis was correct: teams in the top league have measured Elo ratings about 100 points above their actual ratings, and teams in the bottom league have measured Elo ratings about 100 points below their actual ratings, with the positive and negative effects roughly cancelling out for the intermediate leagues. In other words: Man City, you’re not as far ahead of Wrexham A.F.C. as you appear to be, and we’re coming for you! C’mon you Red Dragons!
Getting back to discussing LLMs, this is a task that involved a fair amount of what seems like reasoning. And while it’s certainly possible that someone somewhere has done this before, there’s little evidence to support the assumption that GPT 4 was simply recapitulating prior work. Instead, it appears to take a request in plain language, decode that into its semantic meaning, and construct a complicated response with tokens in the correct order to fit that request. Or, to use the language we would use about a person doing the same tasks, it understands a request, reasons through a solution to the request, and then writes code to implement the solution that it has reasoned through.
Conclusion
I have presented evidence from several different domains to show that LLMs, especially GPT 4, can engage in reasoning.
GPT 4 and similar LLMs continue to have many limitations. They still suffer from many challenges with mathematics, often demonstrating severe innumeracy. While I believe that they demonstrate clear reasoning ability, their reasoning is highly imperfect. They make mistakes. As a project grows in complexity, the likelihood that some parts of the implementation will be shortcuts, approximations, or simply wrong grows. Their attempts at fiction writing are flat and pedestrian, hitting somewhat reasonable story beats with banal language and lacking the sparkle that a good human writer would bring to the same stories. And, of course, they hallucinate “facts” on a regular basis, including a particular predilection for inventing plausible but fake citations and quotations.
None of those limitations negate the ability to demonstrate reasoning capabilities, nor even the more general concept of intelligence. Indeed, every one of those flaws exist among humans — many humans make elementary reasoning mistakes, especially but not only when young. Our ability to understand numbers can be limited — for some people, any numbers at all produce fear and confusion. Larger projects require careful planning and execution to avoid losing the thread or incorporating errors. While writing distinctive fiction that feels evocative and compelling is something some humans can do, it’s certainly not a universal talent, and many people’s best efforts remain plodding and flat. That’s even true of professional writers, although I will forbear from naming names. And who among us hasn’t had a disagreement about a memory with someone who remembered the same events differently, or tried to dredge up a famous quote only to produce a paraphrase?
While I don’t believe that today’s LLMs represent Artificial General Intelligence, possessed of intelligence in the full, broad, and multifaceted scope of that concept, I believe that they can in fact reason, albeit not at a fully human level. I further think that reasoning ability is a key component of intelligence. I agree with the description of a team of Microsoft AI researchers that LMMs demonstrate a “spark” of intelligence — that they have not fully achieved AGI, but that they have reached a point where we can see signs of intelligence.2 Many of the other components of intelligence beyond reasoning can likely be implemented through relatively obvious combinations of other technologies and architectures. For example, a team of AI researchers at Stanford were able to achieve substantially improved believability in simulated humans by linking GPT 3.5 to additional resources that record and retrieve memories (to get around the highly limited context window), reflect on memories to form a self-image and to extract higher level generalizations from those memories, and create and maintain plans.3 By using multiple instances of GPT 3.5 in concert, together with some additional resources, they can achieve results that a single instance alone could not. I’m confident that a similar structure powered by GPT 4 would be substantially more human-like. I believe that using techniques like that combined with continuing to improve and refine LMMs is likely to create true AGI in the near future — not necessarily at or exceeding the human level of intelligence in all areas, but nonetheless recognizably more like an intelligence than like an unintelligent machine, and with a meaningful chance of fully superior to human intelligence within the next decade.
Tomorrow, I turn to presenting and responding to counter-arguments that LLMs are not meaningful steps towards true intelligence.
1The original chat thread I had with ChatGPT 4 asking it to write this program is available at https://chat.openai.com/share/a22f755e-3f9f-4e91-8150-6faece91a4f5 (originally generated April 2023, link generated June 8, 2023). I later started a new thread, beginning by describing the task and then providing the current state of the code and the current problems, at https://chat.openai.com/share/03f3eb4f-6a72-4bc0-b088-a06cef69d8e3 (originally generated April 2023, link generated June 8, 2023). When that chat thread started failing to successfully generate new responses, I started a third thread, again beginning with the current state of the code and the known problems, at https://chat.openai.com/share/6700cef3-f25b-4783-aef8-1c7bdff58c89 (originally generated April 2023, link generated June 8, 2023).
2S´ebastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang, Microsoft Research, “Sparks of Artificial General Intelligence: Early experiments with GPT-4”, https://arxiv.org/pdf/2303.12712.pdf, April 13, 2023.
3Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein, “Generative Agents: Interactive Simulacra of Human Behavior”, https://arxiv.org/pdf/2304.03442.pdf, April 7, 2023.