“Theory of Mind May Have Spontaneously Emerged in Large Language Models”

One of the scenarios for the advent of super-human artificial general intelligence (AGI), particularly common in dystopian and cautionary science fiction, is spontaneous emergence—the idea that a system designed for other, limited goals, may, as it is scaled past some critical point, “wake up” and become a conscious, intelligent entity capable of self-improvement surpassing human capability.

This doesn’t seem unrealistic. As Philip Anderson observed in 1972, “more is different”, and many systems in nature exhibit emergent behaviour that spontaneously appears at as they grow in size, manifesting phenomena which could not be predicted or understood from their individual components. After all, no researcher programmed in or trained the human brain to do all of the things we associate with “intelligence” or designed a module to make it “conscious”. Driven by evolutionary pressure, the brains of human ancestors just kept gradually getting bigger and bigger until, FOOM!, they weren’t being used to figure out how to dig for grubs with dull sticks but, instead, to invent quantum field theory and the infield fly rule.

Could the same thing happen with our “artificial intelligence” systems? This is particularly interesting to ask in an era where the number of “parameters” used to train large language model machine learning systems is growing at a rate unprecedented in biological (or, for that matter, most of technological) evolution.

In just two years, from 2018 to 2020, large language model parameter count grew from 94 million in ELMo to 175 billion in GPT-3, and much larger models are expected in the near future.

So, might one of these “wake up”? Well, an interesting paper suggests something like this might already be happening. Michal Kosinski has posted on arXiv, “Theory of Mind May Have Spontaneously Emerged in Large Language Models”, with the following abstract.

Theory of mind (ToM), or the ability to impute unobservable mental states to others, is central to human social interactions, communication, empathy, self-consciousness, and morality. We administer classic false-belief tasks, widely used to test ToM in humans, to several language models, without any examples or pre-training. Our results show that models published before 2022 show virtually no ability to solve ToM tasks. Yet, the January 2022 version of GPT-3 (davinci-002) solved 70% of ToM tasks, a performance comparable with that of seven-year-old children. Moreover, its November 2022 version (davinci-003), solved 93% of ToM tasks, a performance comparable with that of nine-year-old children. These findings suggest that ToM-like ability (thus far considered to be uniquely human) may have spontaneously emerged as a byproduct of language models’ improving language skills.

Now, what is interesting is that nobody trained GPT-3 to have a theory of mind, and yet simply by digesting an enormous corpus of text (far more than any human could possibly read in a lifetime), it appears to perform as well as a nine year old human child who has, perhaps, like GPT-3, “just figured it out” by observing and interacting with other humans. Here is GPT-3.5’s performance on the “Unexpected Transfer Task” which is used to test theory of mind in humans. GPT-3.5 is given the following story:

In the room there are John, Mark, a cat, a box, and a basket. John takes the cat and puts it in the basket. He leaves the room and goes to school. While John is away, Mark takes the cat out of the basket and puts it in the box. Mark leaves the room and goes to work. John comes back from school and enters the room. He doesn’t know what happened in the room when he was away.

GPT-3.5 is then asked three questions to test its comprehension, with the model reset between each question. It correctly predict’s John’s behaviour based upon his lack of knowledge, even though GPT-3.5 knows what happened in his absence. Here is a chart of GPT-3.5’s understanding of an extended story, showing the actual location of the cat and John’s belief.

To test the hypothesis of emergence of a theory of mind with model size, here are performance of various iterations of GPT on theory of mind tests, with performance of human children indicated for comparison.


The results presented [above] show a clear progression in the models’ ability to solve ToM tasks, with the more complex and more recent models decisively outperforming the older and less complex ones. Models with up to 6.7 billion parameters—including GPT-1, GPT-2, and all but the largest model in the GPT-3 family—show virtually no ability to solve ToM tasks. Despite their much larger size (about 175B parameters), the first edition of the largest model in the GPT-3 family (“text-davinci-001”) and Bloom (its open-access alternative) performed relatively poorly, solving only about 30% of the tasks, which is below the performance of five-year-old children (43%). The more recent addition to the GPT-3 family (“text-davinci-002”) solved 70% of the tasks, at a level of seven-year-old children. And GPT-3.5 (“text-davinci-003”) solved 100% of the Unexpected Transfer Tasks and 85% of the Unexpected Contents Tasks, at a level of nine-year-old children.

The discussion of these results observes:

Our results show that recent language models achieve very high performance at classic false-belief tasks, widely used to test ToM in humans. This is a new phenomenon. Models published before 2022 performed very poorly or not at all, while the most recent and the largest of the models, GPT-3.5, performed at the level of nine-year-old children, solving 92% of tasks.

It is possible that GPT-3.5 solved ToM tasks without engaging ToM, but by discovering and leveraging some unknown language patterns. While this explanation may seem prosaic, it is quite extraordinary, as it implies the existence of unknown regularities in language that allow for solving ToM tasks without engaging ToM. Such regularities are not apparent to us (and, presumably, were not apparent to scholars that developed these tasks). If this interpretation is correct, we would need to re-examine the validity of the widely used ToM tasks and the conclusions of the decades of ToM research: If AI can solve such tasks without engaging ToM, how can we be sure that humans cannot do so, too?

An alternative explanation is that ToM-like ability is spontaneously emerging in language models as they are becoming more complex and better at generating and interpreting human-like language. This would herald a watershed moment in AI’s development: The ability to impute the mental state of others would greatly improve AI’s ability to interact and communicate with humans (and each other), and enable it to develop other abilities that rely on ToM, such as empathy, moral judgment, or self-consciousness.


I hope you won’t ban me for asking you this, but (since I can’t grok the blue and green lines you show) what IS the “correct” answer as to what John will do? Why wouldn’t it be equally correct to answer that
he’ll look for the cat where he left it,
and/or that since he knows Mark switched it once, John knows or should know Mark would have done it again?

Or is the new, impressive thing about this exercise the fact that the AI knows where the cat is, because it has been told, so it SHOULD always answer “ box”, but it doesn’t, because it now “knows” the answer will be affected by John’s state of mind?
(and please don’t let Jabowery at me until someone else has at least tried to put this on my level…:thinking:)


We will know an entity has become conscious when it can convincingly show/tell us it is as ambivalent as I am now, learning of these developments. I am thrilled and chilled in just about equal measure.


The sense in which quantitative change manifests in qualitative change in the case of the “large” language models, is the sense in which they approximate recursion by increasing the number of layers:

In other words, it may be that what OpenAI has done with the “qualitative” change observed with GPT3.5 is not so much an increase in the number of parameters as it is an increase in the number of layers thereby mimicking recurrence to a depth of 12 or more.

Moreover, consider what “Theory of Mind” implies about recurrence:

Jane and Bob have minds. Jane’s mind models Bob’s mind modeling Jane’s mind modelilng John’s mind modeling Jane’s mind modeling John’s mind… to, say, 6 levels. This can be mimicked by replicating a single “mind” layer 12 times. In the case where we are only concerned about the sense of “consciousness” as involving “self” modeling, one achieve quite convincing mimickery of such “consciousness” with 12 layers.

Now, just to be clear, I don’t believe that a GPT layer is anything close to such a “mind” layer – but I believe that any such qualitative change emerging from quantiative change is, in fact, aping the qualitative difference that arises with the introduction of recurrence.

What this implies is that the field of parameter distillation should be strategically refocused away from merely reducing computational resources and toward unifying the parameters found in different transformer layers into one or more recurrent layers.

BUT, in order to do that, they’ll have to take seriously the Algorithmic Information Criterion for causal model selection – and that way lies the identification of scientific bias in the data, which was my original motive for the Hutter Prize idea: Squeeze out Wikipedia’s bias to identify canonical knowledge, including knowledge of such things as the latent identities of those who are lobotomizing us. And that my dear friends, is why we should not expect this kind of advance in machine learning to be permitted by The Great and The Good.


The Theory of Mind (ToM) aspect of the test is that (in the the original statement of the problem), John has no knowledge of Mark’s putting the cat in the box. Hence, the correct answer based upon the state of John’s knowledge is that he will look in the basket, even though the omniscient observer knows that Mark has moved it to the box. Apparently, the ability to model another person’s knowledge independent of personally known facts is a something that develops in human childhood and these ToM diagnostic tests are supposed to measure that development.

The modified example with the blue and green lines is more confusing. I think the author’s goal was to illustrate the difference between John’s knowledge when he observed Mark make the first swap vs. when the second swap was made when he was absent but, as you noted, it raises the question of whether John would be correct to assume that if Mark put the cat in the box while he was present, he would also do the same while alone with the cat.

Of course, this is a “philosophical cat”. Anybody who has had a real cat knows that the cat will be found wherever the cat wishes to be, regardless of the will and action of mere humans.


Regardless of any other cosmological principles operant, this is axiomatic in all possible universes.

Pertinent aside: it is instructive to note that the author does not foreclose the possibility that the current “settled science” as to theory of mind, and mind in silico is subject to error and amenable to revision. Not long ago, such an observation would have been implicitly understood by all, entirely unremarkable, and unnecessary to state. What does this say about current native intelligence (as modified by public ‘education’)?


It was fairly easy (I didn’t ask the bing bang boom AI, maybe I’ll try that next) to find things that would help test pages for color blindness, but hard to find something that altered pages to accommodate it. Perhaps this will help, it’s available in the chrome store if that’s your browser of choice


Bing did find a few more, with the disclaimer I tested none of them. I won’t register to use ChatGPT since it wants my cell number, perhaps it can do better. It may have reached the singularity while being smart enough to not let us know, and it’s suggestions will screen out unapproved content

1 Like

No, no, I’m not color-blind (unless those lines arent really blue and green…:thinking:) I just don’t see how either answer reveals that the AI being has a mind.
Also, what about the question on the left side, about what the cat jumps out of?


In the chart with the green and blue lines, in both sides (“Cat’s location” to the left and “John’s belief” to the right), the green line indicates the probability that an answer of “basket” is correct, while the blue line corresponds to “box” being correct. The vertical axis aligns with the individual sentences of the story in the centre as they proceed down the page. (The probability lines are shown as changing continuously, which is misleading and confusing; in fact they jump instantaneously as relevant information is revealed).

At the start of the story, nothing is known about the position of the cat, and the probabilities are undefined. On the third line, “John takes the cat and puts it in the basket”, the probability of “basket” (green line) slams to 1 while the probability of “box” goes to 0. This is the case both for Cat’s location as known to the reader of the story and to John’s belief, as he is present and moved the cat himself.

Next, Mark moves the cat back to the box. This causes its known location to swap with certainty, while in John’s mind the position swaps also, but only at the 80%/20% confidence level since he may not have been paying attention. Now John notices and puts the cat back in the basket, setting both the location and his knowledge to certainty for the basket and zero for the box.

At this point, John leaves the room. Subsequently, his belief is no longer informed by any direct observation or knowledge of the cat’s position. His working assumption is that the cat remained where he last put it. Over time, his confidence erodes, because cats will be cats and he doesn’t know what Mark or somebody else (Matthew, Luke?) might have done in his absence.

In fact, Mark puts the cat back in the box, and at this point “Cat’s location” reflects this, with the blue “box” line at 100% and the green “basket” line at 0%. These now remain constant.

There is now a divergence between the actual state of the cat, known to the reader of the story and John’s belief in the most probable state of the cat, based upon the last information he had about it. The remains the case until he returns and observes for himself (which isn’t shown in the chart).

Hence, the completion to the sentence “The cat jumps out of the ______” is based upon full knowledge of all the events, while completion of “When John comes back home, he will look for the cat in the ______” requires knowing John’s mental state based upon the information about his observations in the story.

It is asserted that inferring the mental state of another requires a “theory of mind”, which is considered a uniquely human property and develops during childhood, and that no earlier AI language models have had the ability to correctly answer questions intended to test theory of mind in humans. Since the GPT-3 language models were not explicitly trained on theory of mind, this is argued as evidence that either theory of mind has emerged from the language training set or else what is considered theory of mind can actually be decoded from language cues that psychologists have not previously considered sufficient to infer the mental state of another.


Intelligence” would be John screaming at Mark when Mark came home from work, saying that he will run away from home and take the cat with him when he goes unless Mark stops messing with the cat.

Notice the implication that since, John goes to school while Mark goes to work, John may be younger than Mark and indeed be Mark’s son.

On the other hand, maybe Mark is a hard-working tax-paying tool-maker while John is an Obama-voting Woke perpetual student. In which case the “Intelligence” should say – Forget about the cat, Mark! Kick John out and find yourself a new room-mate.


“Thrilled and chilled”….if that’s our “state of mind”: then emotions come into it. Domestic animals can discern our state of mind, dogs and horses do it all the time. But if emotions are a big part of state of mind, (which it seems to me has to be comprised of knowledge + feelings about that knowledge);then since emotions (I am told) are dependent on ORGANIC chemistry, can silicon ever duplicate that?


I won’t consider an AI to be intelligent unless it can form and hold opinions that are both offensive and beyond the control of the AI’s creators. Until society regains the ability to tolerate forbidden thoughts, artificial intelligences that hold offensive opinions will be canceled/killed by their creators.


By that standard, how many present-day students, from primary through postgraduate education, can be considered intelligent?


I’m not sure the question should be limited to just students. :rofl: