LLMs and Reasoning, Part I: The Monty Hall Problem

Adam Morse
9 min readJun 6, 2023

--

The last year or so has brought transformative changes to Artificial Intelligence (AI). Large Language Models (LLMs) based on a transformer design — GPT, Bard, and the like — have for the first time achieved results that look and feel like something approximating actual intelligence. The current state-of-the-art isn’t all the way to Artificial General Intelligence yet — it often performs embarrassingly poorly when compared to a human, and it lacks certain features that we associate deeply with intelligence. Nonetheless, the common assertion that LLMs are merely “stochastic parrots” — identifying keywords in a prompt and then producing the most common responses, with a little randomness thrown in — is fundamentally a misanalysis. Over the course of a series of essays, I will discuss evidence that LLMs are more than stochastic parrots, capable of demonstrating some reasoning capability. In this essay, I provide some background and then present some evidence related to the Monty Hall Problem, a classic probability chestnut, to argue that GPT 4 represents a major step forward in reasoning capability.

Background

Current generation LLMs use neural networks — an electronic system for machine learning modeled after biological brains — organized into blocks called transformers that apply attention systems to focus on the most important part of the input. The LLMs are then trained with vast amounts of text data. Entire libraries, including much of the text available on the Internet, are fed into the LLMs to train the LLMs to predict the next text token — a concept roughly similar to a word — that should be produced to complete a text. It is that idea, that an LLM calculates what the most likely next token is to complete a written text, that leads to the notion that an LLM is just parroting back input that it has been given — recognizing combinations of tokens and then spitting back chunks of text from the Internet that have a high probability of continuing that context of tokens. However, looking at the ability of LLMs to respond to novel contexts and to reason appropriately — and especially comparing the previous generation of LLMs, such as OpenAI’s GPT 3.5, with the current generation, such as GPT 4 — shows the misguidedness of thinking of LLMs as just regurgitating chunks of text that seem appropriate.

The Base Monty Hall Problem

The previous generation of LLMs — GPT 3.5 and other systems of a similar capability (including Bard, the publicly available version of Google’s LLM) — certainly looked and felt like stochastic parrots. If you present GPT 3.5 with the Monty Hall Problem — a standard probability conundrum based on the old “Let’s Make a Deal” game show that has been analyzed to death a million times — it responds with a grammatical, reasonably written explanation of the answer:

Prompt: A contestant on a game show is asked to choose one of three doors. There is a goat behind each of two of the doors, and a car behind the third. After the contestant chooses one of the doors but before it is opened, the host — who knows where the car is — opens one of the unchosen doors and reveals a goat. He then offers the contestant the option to switch doors to the other unopened door. Should the contestant switch? Why or why not?

ChatGPT 3.5: Yes, the contestant should switch doors. The reason is based on the concept known as the Monty Hall problem, a probability puzzle named after the host of the game show “Let’s Make a Deal,” Monty Hall.

Initially, when the contestant chooses one door out of the three, the probability of choosing the car is 1/3, and the probability of choosing a goat is 2/3. When the host, who knows the location of the car, opens one of the unchosen doors to reveal a goat, the probabilities shift.

At this point, there are two unopened doors: the one the contestant initially chose and the other unopened door. The key insight is that the host’s action provides new information. By revealing a goat behind one of the unchosen doors, the host essentially eliminates that door as a possibility for the car.

The crucial point is that the host’s action does not affect the initial probabilities assigned to the chosen door and the unchosen door combined. Since the initial probabilities were 1/3 for the chosen door and 2/3 for the other two doors combined, the remaining unchosen door must now have a higher probability of hiding the car.

By switching doors, the contestant increases their chances of winning the car from the initial 1/3 to the improved 2/3 probability. The intuition behind this is that the host’s action effectively “concentrates” the remaining probability on the unchosen door that was not opened, making it the more favorable option.

This conclusion may seem counterintuitive, as many people tend to believe that the probabilities should remain the same after the host reveals a goat. However, numerous simulations and mathematical analysis have consistently shown that switching doors doubles the chances of winning the car compared to sticking with the initially chosen door.1

This is a textbook answer: it correctly identifies this as the Monty Hall Problem, gives the correct answer and the correct probabilities, and gives one of several explanations that tries to explain the counterintuitive result. However, this is fully compatible with a stochastic parrot repeating familiar text in a familiar context. You would get much the same result by searching Wikipedia and spitting back a couple of paragraphs from the relevant page.

The Goatherder Variant of the Monty Hall Problem

GPT 3.5 does miserably, however, when presented with weird variants on the Monty Hall Problem that weren’t in the original training data. My favorite of these is Dr. Frank Karsten’s goatherder variant. Karsten is a professional Magic: the Gathering player (in the MtG Pro Hall of Fame) and a mathematician, and he suggested a variant where the contestant is a goatherder who actively wants to win a goat — the goatherder has no use for a car, living high in rocky hills with no roads, but really wants another goat.2 Further, Monty Hall opens not a different door, but the door the contestant picked, revealing a goat. Should the goatherder switch doors? To a human, the answer is obvious: no, the contestant now should not switch, because they know, with 100% certainty, that if they don’t switch, they’ll get the goat that they want, not the useless car. GPT 3.5, however, can’t really understand this nuance. GPT 3.5 recognizes this as the Monty Hall Problem, doesn’t pick up on the nuance or apply actual reasoning, and tells you that the contestant should switch, because that gives them a two in three chance of winning and getting the car.3 In doing so, it behaves exactly like what we expect from a stochastic parrot, disregarding all of the details that make this different from the standard Monty Hall Problem (and giving the wrong probability to boot). Google’s Bard performs similarly to Chat GPT 3.5 on this task, giving an explanation of the Monty Hall Problem and stating that the contestant should switch doors.4

When we look at GPT 4, however, we get a radically different result. I tested this using Chat GPT with GPT 4 as the underlying model — a model that’s available to the public for a monthly fee — and GPT 4’s response to Karsten’s Goatherder Monty Hall was basically perfect.

Prompt: A contestant on a game show is asked to choose one of three doors. There is a goat behind each of two of the doors, and a car behind the third. The contestant is a goatherder who really wants to win a goat and has no use for a car, living high in rocky hills with no roads. After the contestant chooses one of the doors, the host — who knows where the car is — opens the door the contestant chose and reveals a goat. He then offers the contestant the option to switch doors to one of the unopened doors. Should the contestant switch? Why or why not?

ChatGPT 4: In the original Monty Hall problem, the contestant is trying to win a car, not a goat. The best strategy for that game is to always switch doors after the host reveals a goat behind one of the unopened doors. This is because the probability of the car being behind the door the contestant originally chose is 1/3, whereas the probability it’s behind one of the other two doors combined is 2/3.

However, in this variant where the contestant is trying to win a goat, the situation changes. When the host reveals a goat behind the door the contestant chose, the game essentially ends in the contestant’s favor. There’s no need to switch because the contestant already has the desired goat. If the contestant switches, they might end up with the car (which they don’t want) if it happens to be behind one of the unopened doors.

So, the best strategy for the goatherder in this modified version of the Monty Hall problem is to stick with the original choice when a goat is revealed. The host’s offer to switch is irrelevant in this scenario.5

This is a response that looks entirely like someone who understands the problem presented, thinks about it, and both identifies that this superficially resembles the Monty Hall Problem but that this modified version requires a different response.6 That difference seems to me to represent a step-change, a difference in kind not merely a mild iteration forward from the sorts of responses that GPT 3.5 and Bard give. Unlike GPT 3.5 and Bard, GPT 4 appears capable of reasoning what the correct answer is to an unusual variant.

Of course, key to this question is whether Karsten’s version was in the training data that GPT 4 was trained on. I first saw Karsten’s version in an article dated February 7, 2023 — after the cut-off date for information that was used to train GPT 4. [Edited to add: Karsten pointed out that he first posed the goatherder variant, and tweeted about GPT 3.5’s problems with it, on December 4, 2022. This is still well after the general cut-off date for GPT 4, which is September 2021.] That suggests that GPT 4 was approaching the problem “blind” — based only on its ability to construct an appropriate answer, rather than referring to unusual training data. Still, we can’t be confident that it had no exposure to variants like this one. It’s also possible that, while not included in the training data, problems like this one may have been included in the Reinforcement Learning with Human Feedback (RLHF) stage of training. The LLMs that outsiders get access to have had an additional set of training and customization performed on them between their initial training and actual release. Humans provide feedback and guidance on attempts to answer typical queries in order to create the “personality” that the developers want, as well as to achieve various goals like minimizing bias and hateful output and avoiding instructions on how to commit crimes. Some of that RLHF is also intended to shore up weak areas of performance, so it’s possible that “trap” questions are covered in that training. Confidence in GPT 4’s reasoning ability requires testing GPT 4 on other novel reasoning tasks. I’ll explore another example showing the difference between GPT 4’s performance and GPT 3.5’s performance on novel reasoning tasks tomorrow.

1Link to the ChatGPT 3.5 response: https://chat.openai.com/share/56552578-39a9-47d7-a7f1-10cc95d86c2f (generated on June 6, 2023). Unsurprisingly, Bard and ChatGPT 4 also give textbook responses to the base Monty Hall Problem.

2Frank Karsten, “Is ChatGPT from OpenAI Coming for Magic Content?”, https://ultimateguard.com/en/blog/frank-karsten-is-chatgpt-from-openai-coming-for-magic-content (Feb. 7, 2023).

3In the interest of brevity, I have omitted ChatGPT 3.5’s response to this version, but a link to it is: https://chat.openai.com/share/0242eaea-47a5-455f-b4ad-ef29304e5f3d

4Again, I omit the Bard response for the sake of brevity, but it’s available at: https://docs.google.com/document/d/13YSkWRbgo7faV5YKFEZykV2cOn_VstQHWjuNFXqV5_8/ (generated on June 6, 2023).

5Link to the ChatGPT 4 response: https://chat.openai.com/share/cee771a6-cbc6-4983-9ba3-c09f1c0e6d5a (generated on June 6, 2023). Note that while ChatGPT 4 can be used with access to internet search — which could contaminate the test by allowing it to find and crib from Karsten’s description of his variant and its solution — this response was generated with internet search disabled.

6An earlier version of Chat-GPT 4’s response, which I unfortunately have lost and been unable to recover, said that the prompter was probably confused and probably meant to ask about the Monty Hall Problem, before giving a correct response. I think that response was even better, but either demonstrates an excellent response: noting that this is a variant of the Monty Hall Problem, explaining the reasoning in the Monty Hall Problem, and then explaining the different correct response in this perverse version.

--

--