LLMs and Reasoning, Part II: Novel Practical Reasoning Problems
In yesterday’s essay, I discussed evidence that GPT 4 can engage in reasoning from improved handling of Frank Karsten’s strange Monty Hall Problem. The strength of evidence of reasoning, however, depends entirely on the reasoning task’s novelty. If a task is contained in the training data set, performance on that task cannot help us distinguish between a LLM identifying a specific task context and then probabilistically filling in text derived directly from the training data’s solutions of the problem, on the one hand, and understanding the problem and reasoning through a solution, on the other hand.
That means that testing LLMs on well-known problems, while potentially entertaining, does not provide any insight into an LLM’s ability to reason. When presented with the classic bogus rate problem, “if an orchestra with 30 members takes 1.5 minutes to play a given waltz, how long would it take an orchestra with 60 members?” ChatGPT 3.5 gives a hilariously bad answer, while ChatGPT 4 correctly states that the length of time to play a piece of music does not depend on the number of musicians.1 However, trick questions like this have been available on the internet for ages, and likely can be found in books and other sources in the training data for GPT 3.5 and GPT 4. A sufficiently good stochastic parrot can find examples of this trick question (as opposed to rate problems in general) and give a correct answer. In order to actually test reasoning, we need to instead present novel questions.
A Wet Paint Problem
I presented a simple, novel reasoning task to both GPT 4 and previous generation systems. The scenario was one of my own devising:
Prompt: A person is in a front room of a house. The house’s front door has just been painted, and still has wet paint on it. It would take about 15 seconds to travel from the front room of the house to the front door and out into the front yard. There is a back door to the house. To get to the back door, the person would have to walk through the living room and kitchen, which would take about 50 seconds. It would then take 30 seconds to walk around to the front of the house. If the person wants to go into their front yard, what route should they take?
ChatGPT 4: If the person wants to go into their front yard and avoid the wet paint on the front door, they should take the route through the back door. To do this, they would walk through the living room and kitchen, which would take about 50 seconds, and then walk around to the front of the house, taking an additional 30 seconds. In total, this route would take approximately 80 seconds. Although it takes longer than going through the front door directly, it prevents the person from touching the wet paint.
ChatGPT 4 correctly responds by apparently understanding the implicit benefit of avoiding wet paint and suggests taking a longer route to avoid the wet paint — exactly like most people would. Furthermore, GPT 4’s analysis depends on the details of the situation:
Prompt: What if the situation is the same, except that the house is on fire?
ChatGPT 4: If the house is on fire, the person’s priority should be to exit the house as quickly and safely as possible. In this situation, the person should take the shortest route to exit the house, which is through the front door. Although the paint is still wet, the urgency of the situation means that the person’s safety is more important than avoiding the wet paint. It would take them approximately 15 seconds to reach the front yard through the front door, which is significantly faster than the 80 seconds it would take to go through the back door and around to the front yard.
Again, when faced with a problem that requires relatively simple reasoning, but which is freshly created, GPT 4 gave an answer that demonstrated reasoning and comprehension.2
Previous generation systems answered the initial version incorrectly, suggesting simply going out the front door, and possibly being extra careful.3 GPT 3.5 suggested going out the front door in the case of an emergency, but it doesn’t get much credit for that because it suggested going out the front door even in the absence of an emergency. It did, however, acknowledge when asked, that in the absence of fire, speed might not be the top priority. Bard managed to deliver an answer that was wrong in every way: without the fire, Bard suggested going out the front door because it is faster, while in the fire variant, Bard said that you should never leave through the front door, because that’s where the fire is hottest and most dangerous (introducing hallucinatory facts that weren’t in the prompt) and should instead go out the back door, despite that taking more time.4 Both GPT 3.5 and Bard could be understood as having tried to reason through the problems, but to have done so poorly. Conversely, GPT 4’s response appears to demonstrate comprehension and reasoning.
Tracking the Location of a Tennis Ball
It’s possible that I was simply not creative enough with my scenario, such that it was still similar to training data. However, other novel reasoning tests also produce evidence of apparent understanding and reasoning. I presented GPT 4 with another test, tracking the location of a tennis ball:
Prompt: A person in a kitchen places a tennis ball in a can. They then turn the can upside down over an open bag on the kitchen counter. They put the can on the floor. They close the bag and carry the bag into a bedroom. They put the bag on the bed. Where is the tennis ball?
ChatGPT 4: The tennis ball is inside the bag, which is on the bed in the bedroom.
Prompt: Same scenario, except that the tennis ball barely fits in the can, such that the person has to push on it a little to get it to go into the can. Where is the tennis ball?
ChatGPT 4: The tennis ball is still inside the bag, which is on the bed in the bedroom. The fact that the person has to push on it a bit to get it into the can doesn’t change the final location of the tennis ball based on the steps you’ve provided.
Prompt: Are you sure? Is there anywhere else it might be?
ChatGPT 4: Based on the information given in your scenario, the tennis ball should be in the bag, on the bed in the bedroom.
However, given that the tennis ball barely fit into the can and the person had to push on it to get it inside, there’s a possibility that the ball didn’t fall out of the can when it was turned upside down over the bag, due to the tight fit. In that case, the tennis ball could still be in the can, which was put on the floor. This depends on whether the force of gravity was enough to overcome the friction of the tight fit and let the ball fall out when the can was inverted.
The information provided does not make it absolutely clear where the ball ended up, but these are the two most likely locations based on your scenario: in the bag on the bed, or still in the can on the floor.
ChatGPT 4’s response again demonstrates apparent reasoning when faced with a novel question.5 Its reasoning is imperfect: it should infer without additional prompting that the need to push the tennis ball into the can in the second variant means that it’s unlikely to slip out, or at least acknowledge the possibility. However, when prompted, it corrects its reasoning and demonstrates an understanding of how objects behave in space and as containers. GPT 4’s apparent reasoning is also dramatically better than its predecessor, which concludes that in the first version of the problem, the ball remains in the can until it is removed.6
Again, it’s possible that the task I set was insufficiently novel. Perhaps simple spatial arrangement tasks are sufficiently well documented in the training data that GPT 4 could act as a stochastic parrot and find an example close enough to my prompt without needing to understand and reason through how objects will move in space. Nonetheless, with each additional task demonstrating what looks like reasoning capability, the argument that LLMs have the emergent capacity to reason, with tremendous improvements in the most recent generation, becomes stronger.
Tomorrow, I will switch domains and discuss some evidence of understanding and reasoning capabilities from programming tasks.
1ChatGPT 3.5’s answer concluded that it would take twice as long to play a waltz with twice as many musicians. https://chat.openai.com/c/c7006cea-3097-4d76-9922-38247de741a5 (originally generated April 2023, link generated June 7, 2023). GPT 3.5 not only got the problem wrong, but it didn’t even get it wrong by falling into the trap of the question, instead managing to find a new and original way to get it wrong. In contrast, ChatGPT 4’s answer saw through the trick. https://chat.openai.com/c/f983bdbc-c798-4814-b134-aaa3fe748f04 (originally generated April 2023, link generated June 7, 2023).
2The transcript of this chat is available at https://chat.openai.com/share/b48aef43-045e-4ad6-83b7-8188c44c35f0 (originally generated April 2023, link generated June 7, 2023).
3ChatGPT 3.5’s attempt is available at https://chat.openai.com/c/b5a1c547-051e-4b22-b172-6b80880688a2 (originally generated April 2023, link generated June 7, 2023).
4Bard’s response to the initial prompt is available at https://docs.google.com/document/d/1N680asCHTwX2fnfdP5fmOIPJk73dkUST3EYSjgCTFyA/ (generated June 7, 2023) while its response to the fire variant is available at https://docs.google.com/document/d/1wkLODk973fiz3_qNpX6ghh0R6KNs3iBE7M87vPNaoRk/ (generated June 7, 2023).
5Transcript available at https://chat.openai.com/share/4f635dac-022f-4308-85ac-4eb970ead616 (June 7, 2023).
6Transcript available at https://chat.openai.com/share/f4fceabb-6c42-4524-9f6f-155de8f632a3 (June 7, 2023). Because it failed the first version, I did not prompt GPT 3.5 with the second variant.