Theory of Mind in Modern Large Language Models (LLMs)
I wanted to provide another example of OpenAI’s Deep Research platform, this time not to analyze the themes of season 1 and 2 of the TV show Severance, here’s a more academic-ish survey report example on whether today’s LLM have a real Theory of Mind (I say no).

Here the unedited research report for you to judge how well this platform performs today:
Introduction
Theory of Mind (ToM) refers to the ability to attribute mental states (beliefs, intents, knowledge, emotions) to others and to understand that others can have perspectives different from one’s own (In Theory of Mind Tests, AI Beats Humans — IEEE Spectrum). It is a hallmark of human social cognition, typically emerging in children by age 4–5 in classic false-belief tests ([2302.08399] Large Language Models Fail on Trivial Alterations to Theory-of-Mind Tasks) ([2302.02083] Evaluating Large Language Models in Theory of Mind Tasks). With the rise of powerful large language models like GPT-4, Anthropic’s Claude, and Google’s Gemini, researchers have begun asking whether these AI systems exhibit ToM-like characteristics or are merely mimicking such reasoning. Recent studies from both cognitive science and AI perspectives have tested LLMs on tasks inspired by developmental psychology (e.g. false-belief tasks, “Sally-Anne” tests, irony and faux pas detection) and on new benchmarks designed for AI. Below, we summarize key findings from the last few years, highlighting both evidence of ToM-like performance in LLMs and the debates over how to interpret these results.
Evidence of ToM-Like Abilities in Advanced LLMs
Several research teams have reported that the latest LLMs can produce behavior consistent with theory-of-mind reasoning. Notably, as model size and training sophistication have increased, performance on ToM tasks has improved dramatically:
- Emergent false-belief reasoning: Kosinski (2023) tested 11 models on a battery of false-belief tasks (a gold-standard in child ToM research) and found a sharp jump in ability with the most advanced models ([2302.02083] Evaluating Large Language Models in Theory of Mind Tasks). Smaller pre-2022 models solved essentially 0 out of 40 false-belief tasks, whereas GPT-3.5 (Nov 2022) succeeded on about 20%, and GPT-4 (Mar/Jun 2023 versions) solved 75% of the tasks correctly, matching the performance of 6- to 7-year-old children ([2302.02083] Evaluating Large Language Models in Theory of Mind Tasks). This led to the striking suggestion that theory-of-mind reasoning “may have spontaneously emerged” as a byproduct of increasingly powerful language skills in GPT-4 ([2302.02083] Evaluating Large Language Models in Theory of Mind Tasks).
- Performance beyond young children: van Duijn et al. (2023) evaluated 11 state-of-the-art LLMs (both base models and instruction-tuned chat models) on an array of advanced ToM tests, including non-literal language understanding (e.g. irony) and second-order false beliefs (reasoning about what one character believes another character knows) (Theory of Mind in Large Language Models: Examining Performance of 11 State-of-the-Art models vs. Children Aged 7–10 on Advanced Tests) (Theory of Mind in Large Language Models: Examining Performance of 11 State-of-the-Art models vs. Children Aged 7–10 on Advanced Tests). They also directly compared LLM responses to those of children aged 7–10. The GPT-series instruction-tuned models (e.g. ChatGPT/GPT-4) outperformed other LLMs and often even outperformed the human children on these tasks (Theory of Mind in Large Language Models: Examining Performance of 11 State-of-the-Art models vs. Children Aged 7–10 on Advanced Tests). In contrast, uncensored base models (without instruction fine-tuning) struggled to solve ToM problems at all. The authors note that instruction-tuning (which rewards helpful, context-aware communication) may encourage a form of perspective-taking, analogously to how human social interaction co-develops with ToM abilities (Theory of Mind in Large Language Models: Examining Performance of 11 State-of-the-Art models vs. Children Aged 7–10 on Advanced Tests).
- Human-level breadth on psychology tasks: Strachan et al. (2024, Nature Human Behaviour) subjected GPT-4 and other models to a broad battery of psychology ToM tests (including classic false-belief scenarios, the Hinting Task for inferring intentions, Happé’s Strange Stories for understanding lies and sarcasm, and faux pas detection) while also testing ~1,900 human participants on the same questions (In Theory of Mind Tests, AI Beats Humans — IEEE Spectrum) (In Theory of Mind Tests, AI Beats Humans — IEEE Spectrum). GPT-4 matched or exceeded average human performance on most tasks: it equaled humans on false-belief questions and scored higher than human averages on tasks involving irony, hints, and understanding social stories (In Theory of Mind Tests, AI Beats Humans — IEEE Spectrum). The only area where GPT-4 underperformed humans was the faux pas test, which requires recognizing when a speaker unwittingly says something hurtful (In Theory of Mind Tests, AI Beats Humans — IEEE Spectrum). (Interestingly, the open-source LLaMA-2 model showed the opposite pattern: near-human on faux pas but weaker on irony and hinting (In Theory of Mind Tests, AI Beats Humans — IEEE Spectrum).) Follow-up analyses suggested GPT-4’s faux pas errors were due to the model’s conservative alignment (hesitation to label a statement as an insult) rather than an inability to represent others’ ignorance ( AI models challenge humans in understanding minds, but struggle with subtleties, study finds ) (In Theory of Mind Tests, AI Beats Humans — IEEE Spectrum). Overall, this study demonstrated that GPT-4 can exhibit behavior indistinguishable from human responses on a wide range of ToM tasks (In Theory of Mind Tests, AI Beats Humans — IEEE Spectrum).
- Higher-order ToM and recursive beliefs: Street et al. (2024) investigated whether LLMs can handle multi-level recursive mental states (e.g. “I think that you believe that she knows…” — up to 5th–6th order beliefs, which even adult humans find challenging). They introduced a new Multi-Order ToM benchmark and found that GPT-4, and to a lesser extent Google’s Flan-PaLM, achieved adult-level performance through 5th-order beliefs and even outperformed humans on 6th-order inference problems (). This suggests that at least some advanced LLMs have acquired a generalized capacity for multi-step belief reasoning in text-based scenarios (). The authors observed a scaling effect: larger and instruction-tuned models showed stronger ToM abilities, aligning with the trend that more capable models (and those optimized for interactive communication) develop more robust ToM-like skills ().
These findings collectively indicate that cutting-edge LLMs (GPT-4 in particular) can pass many traditional ToM assessments that were originally designed to probe human social intelligence. In some studies, GPT-4’s ToM-like performance is comparable to a child of 7–10 years old or even adult-level on certain tasks ([2302.02083] Evaluating Large Language Models in Theory of Mind Tasks) (A Survey of Theory of Mind in Large Language Models: Evaluations, Representations, and Safety Risks). Such results have been described as “unexpected and surprising” by cognitive scientists, given that LLMs are just text-trained networks with no explicit social or visual experience (In Theory of Mind Tests, AI Beats Humans — IEEE Spectrum). Nevertheless, they show that LLMs can simulate the ability to reason about others’ minds to a remarkable degree.
Limitations and Debates: Do LLMs Really Understand Minds?
Despite impressive benchmarks, there is active debate about whether LLMs truly possess anything like human theory of mind or are relying on superficial cues. Cognitive scientists and AI researchers have identified several caveats and failure cases:
- Fragility to trivial changes: Ullman (2023) argued that current LLM ToM successes might be brittle () (). He showed that if you make small, logically irrelevant modifications to false-belief test vignettes (e.g. rewording, changing an object’s properties slightly while preserving the belief structure), models like GPT-3.5 suddenly fail questions they previously answered correctly () (). These perturbations shouldn’t affect a genuine understanding of beliefs, but they confused the model. Ullman concluded that “these models haven’t learned yet anything like Theory-of-Mind” — their success was narrow and contingent on surface patterns, not a robust, general ToM capacity (). He cautions that we should weigh such outlying failure cases more heavily than average success rates when assessing AI ToM ([2302.08399] Large Language Models Fail on Trivial Alterations to Theory-of-Mind Tasks).
- “Clever Hans” heuristics vs. true mentalizing: Shapira et al. (2024) conducted stress tests across six different social reasoning tasks to see if LLMs were genuinely inferring mental states or just picking up on spurious correlations (Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models — ACL Anthology). They found that while models do show some Neural ToM (N-ToM) behaviors, this performance is “far from robust”. When presented with adversarial or cleverly balanced scenarios designed to remove easy textual cues, the models’ accuracy dropped significantly (Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models — ACL Anthology). This suggests LLMs often rely on shallow heuristics (e.g. particular keywords or phrasing) rather than truly modeling others’ beliefs (Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models — ACL Anthology). The researchers warn against over-interpreting anecdotal successes or a few benchmark results, noting that standard psychological tests may not cleanly transfer to machines (Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models — ACL Anthology). In short, some of the current ToM-like feats of LLMs might be a “Clever Hans” effect — analogous to a horse that seemed to do arithmetic by reacting to subtle trainer cues, rather than actually counting (Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models — ACL Anthology).
- Harder benchmarks reveal gaps: In the AI community, new evaluation suites have been created to push LLMs to their limits on theory of mind. For example, FANToM (Kim et al., 2023) and ToMBench (Chen et al., 2024) include diverse social scenarios, ambiguity, and trickier questions. These comprehensive benchmarks have most current models still “stumped” compared to human performance, especially on more nuanced aspects of ToM (A Survey of Theory of Mind in Large Language Models: Evaluations, Representations, and Safety Risks). LLMs also struggle with tasks like pragmatic implicatures (reading between the lines of what someone means) and other subtle social inferences (Benchmarking Theory of Mind in Large Language Models — arXiv) ([PDF] Theory of Mind Imitation by LLMs for Physician-Like Human Evaluation). In many cases, GPT-4 leads the pack and approaches human-level, but lesser models fall short, and even GPT-4 is not perfect. Overall, the trend is that ToM capabilities are present but not yet fully reliable or general across all contexts (A Survey of Theory of Mind in Large Language Models: Evaluations, Representations, and Safety Risks).
- Notional understanding vs. genuine understanding: Some experts urge philosophical caution in interpreting these results. The authors of the Nature Human Behaviour study themselves emphasized that they are not claiming the AI “has” theory of mind, only that it exhibits behavior indistinguishable from humans on theory of mind tests (In Theory of Mind Tests, AI Beats Humans — IEEE Spectrum). This distinction is important — passing an external test does not necessarily mean the same internal cognitive processes are present. As one commentator put it, “Why does it matter whether text-manipulation systems can produce outputs similar to human answers? Models are not human beings.” (In Theory of Mind Tests, AI Beats Humans — IEEE Spectrum). The concern is that we might inadvertently anthropomorphize LLMs: just because GPT-4 can answer a question about someone’s false belief correctly, does it actually understand the concept of belief or is it just regurgitating learned patterns? There is currently no agreed-upon method to definitively test the presence of ToM in a machine (since all our tests were designed for beings who, we assume, have it) (In Theory of Mind Tests, AI Beats Humans — IEEE Spectrum). This ongoing debate mirrors a broader conversation about whether LLMs truly “understand” language and meaning, or if they are sophisticated mimics of form with no grasp of content (In Theory of Mind Tests, AI Beats Humans — IEEE Spectrum).
In summary, while LLMs like GPT-4 have demonstrated outputs consistent with theory-of-mind reasoning, skeptics argue this may reflect simulation rather than genuine understanding. Even strong proponents of LLM capabilities acknowledge that current models can falter on simple variants of tasks and lack the consistency and transparency we expect from a true theory-of-mind reasoning system (Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models — ACL Anthology) (). The evidence is mixed, and whether LLMs “have” ToM or are just faking it remains a subject of spirited debate.
Cognitive Science Perspectives and Evaluation Approaches
Cross-pollination between cognitive science and AI has been central to these investigations. Researchers are explicitly adapting classic psychological paradigms to probe LLMs, and conversely using LLM results to reflect on human cognition:
- Developmental analogies: The progression of LLM performance with model size has been compared to child development stages ([2302.02083] Evaluating Large Language Models in Theory of Mind Tasks). Smaller models (like younger children) fail basic false-belief tests, whereas the largest models succeed at an age-equivalent level in some studies ([2302.02083] Evaluating Large Language Models in Theory of Mind Tasks). However, the mechanism is entirely different: children learn ToM through embodied social interaction, while LLMs learn text patterns that indirectly encode folk-psychological knowledge. Some authors speculate that because human language use and social cognition evolved hand-in-hand, an AI that masters language to a high degree may incidentally absorb patterns of social reasoning (Theory of Mind in Large Language Models: Examining Performance of 11 State-of-the-Art models vs. Children Aged 7–10 on Advanced Tests). For instance, instruction-tuning a model to be an effective communicator (following conversational norms, considering the user’s perspective) might be imparting a form of pragmatic ToM akin to what humans acquire through social feedback (Theory of Mind in Large Language Models: Examining Performance of 11 State-of-the-Art models vs. Children Aged 7–10 on Advanced Tests).
- Use of psychological testbeds: False-belief tasks (like the Sally-Anne test where one character holds an outdated belief about an object’s location) have been translated into text scenarios for LLMs ([2302.08399] Large Language Models Fail on Trivial Alterations to Theory-of-Mind Tasks) ([2302.02083] Evaluating Large Language Models in Theory of Mind Tasks). Other tasks include the Smarties “unexpected contents” test (predicting another’s false belief about what’s inside a mislabeled container) and reading comprehension of stories requiring inference of characters’ thoughts or intentions (Dissecting the Ullman Variations with a SCALPEL: Why do LLMs fail at Trivial Alterations to the False Belief Task?) ( AI models challenge humans in understanding minds, but struggle with subtleties, study finds ). By and large, LLMs are evaluated in a behavioral fashion similar to humans: the model gets a narrative and a question (e.g. “Where will X look for the object?”) and its answer is scored as correct or not. Researchers take care to use controlled sets of stimuli — e.g. pairing each false-belief scenario with closely matched true-belief versions, to ensure the model isn’t just always answering a certain way ([2302.02083] Evaluating Large Language Models in Theory of Mind Tasks). In some studies, novel story variants are created to ensure the model didn’t memorize the answer from training data ( AI models challenge humans in understanding minds, but struggle with subtleties, study finds ). This methodological rigor (as seen in the Nature Human Behaviour study) is meant to address concerns that the model might simply recall known solutions, instead of reasoning on the fly (In Theory of Mind Tests, AI Beats Humans — IEEE Spectrum).
- New AI-specific benchmarks: Beyond human tests, AI researchers have designed specialized evaluation frameworks for ToM in LLMs. Benchmarks like Big-ToM, ToM-Bench, OpenToM, and FANToM assemble diverse question types covering knowledge attribution, deception, sarcasm, commonsense psychology, and more (A Survey of Theory of Mind in Large Language Models: Evaluations, Representations, and Safety Risks) (A Survey of Theory of Mind in Large Language Models: Evaluations, Representations, and Safety Risks). These often use multiple-choice formats for automatic grading and include many trials across different domains to assess consistency (TMBench: Benchmarking Theory of Mind in Large Language Models) (TMBench: Benchmarking Theory of Mind in Large Language Models). For example, ToM-Bench (2024) covers 8 task types and 31 distinct mental-state reasoning “abilities” with hundreds of story questions in everyday contexts (TMBench: Benchmarking Theory of Mind in Large Language Models) (TMBench: Benchmarking Theory of Mind in Large Language Models). Such benchmarks have revealed specific weaknesses (e.g. GPT-4 still has trouble with certain implicit communication or real-world trick scenarios) while also highlighting that fine-tuned chat models significantly outperform base models across the board (A Survey of Theory of Mind in Large Language Models: Evaluations, Representations, and Safety Risks). The creation of these benchmarks underscores a key point: evaluation of ToM in AI is still evolving, and researchers are trying to go beyond simple success/fail to diagnose how and why models succeed or fail on ToM tasks.
- Probing internal representations: A fascinating computational angle is examining whether LLMs explicitly represent others’ mental states internally. Some recent studies used interpretability techniques to probe the hidden activations of language models during ToM tasks. For instance, Zhu, Zhang, & Wang (2024) showed that by training simple linear classifiers on the model’s internal embeddings, one can decode whether the model’s current context implies a character holds a true belief or a false belief (A Survey of Theory of Mind in Large Language Models: Evaluations, Representations, and Safety Risks). In other words, GPT-style models appear to form distinct internal representations corresponding to different agents’ beliefs in a story. Moreover, editing or nudging those latent representations can change the model’s answers about what a character will do (A Survey of Theory of Mind in Large Language Models: Evaluations, Representations, and Safety Risks). Bortoletto et al. (2024) similarly found that larger, fine-tuned models encode mental state information more accurately than smaller ones, suggesting a scaling trend even at the level of representation (A Survey of Theory of Mind in Large Language Models: Evaluations, Representations, and Safety Risks). These findings hint that LLMs might not just parrot training examples; they could be learning abstract features related to “who knows what” in a given narrative. Such representational evidence is viewed by some as an encouraging sign of emerging cognitive-like structures, even if the overall system lacks true understanding (A Survey of Theory of Mind in Large Language Models: Evaluations, Representations, and Safety Risks) (A Survey of Theory of Mind in Large Language Models: Evaluations, Representations, and Safety Risks).
In bridging cognitive science and AI, researchers are effectively conducting “psychological experiments” on AI systems, a paradigm sometimes called “machine psychology” (Theory of Mind in Large Language Models: Examining Performance of 11 State-of-the-Art models vs. Children Aged 7–10 on Advanced Tests). This interdisciplinary approach has benefits both ways: it provides tools to dissect AI reasoning and also offers new theoretical insights (and questions) about the nature of ToM. For example, if an LLM can pass a false-belief task without having a body or eyes, what does that say about the minimal requirements for ToM? Is language alone sufficient to develop a form of mentalizing? Such questions were previously purely philosophical, but now we have empirical data from machines to inform the discussion.
Conclusions and Outlook
Do modern LLMs have a Theory of Mind? The consensus so far is nuanced. Behaviorally, the best models today (GPT-4 and peers) can simulate ToM-like reasoning to a remarkable extent — achieving parity with human children or even adults on several standard tests ([2302.02083] Evaluating Large Language Models in Theory of Mind Tasks) (In Theory of Mind Tests, AI Beats Humans — IEEE Spectrum). This suggests that LLMs have absorbed patterns of human mental state reasoning from their training data, enabling them to anticipate and infer beliefs and intentions in text scenarios. From a purely functional perspective, one might say these models “exhibit ToM abilities” in that their outputs on ToM tasks are often indistinguishable from those of humans (In Theory of Mind Tests, AI Beats Humans — IEEE Spectrum).
However, experts caution that this does not prove the models genuinely possess a human-like theory of mind. The counterarguments highlight that current LLMs sometimes rely on superficial shortcuts, struggle with novel or perturbed problems, and lack the consistent, built-in understanding that humans acquire through life-long social experience (Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models — ACL Anthology) (). In essence, they imitate ToM but may not truly innate ToM. As Melanie Mitchell and David Krakauer noted in a recent PNAS commentary, today’s LLMs should perhaps be seen as “models of formal linguistic skills” (including some social reasoning patterns) rather than veridical models of human understanding (Dissecting the Ullman Variations with a SCALPEL: Why do LLMs fail at Trivial Alterations to the False Belief Task?). The line between clever mimicry and real cognition is still hotly contested.
From a practical standpoint, continuing research is likely to further close the gap. As LLMs get larger, are trained on more diverse interactions, and are augmented with reasoning modules, their ToM-task performance may keep improving (A Survey of Theory of Mind in Large Language Models: Evaluations, Representations, and Safety Risks) (A Survey of Theory of Mind in Large Language Models: Evaluations, Representations, and Safety Risks). Indeed, studies show that prompt techniques and fine-tuning already boost ToM performance markedly (A Survey of Theory of Mind in Large Language Models: Evaluations, Representations, and Safety Risks). It is conceivable that next-generation models (e.g. Google’s Gemini or future GPT iterations) will overcome some of the current weaknesses on ToM benchmarks. If an AI one day consistently passes all behavioral tests of Theory of Mind that a human can, we will face a deeper philosophical question: Does it “have” ToM, or is it just a very well-trained mimic? At that point, we may need new definitions or tests, since, as one researcher put it, “if an imitation is as good as the real thing, how do you know it’s not the real thing?” (In Theory of Mind Tests, AI Beats Humans — IEEE Spectrum).
Finally, both opportunities and risks accompany advanced ToM in AI. A system that can infer human mental states could be extremely useful (e.g. for better virtual assistants or in teaching social skills to autistic children), but it also raises concerns. Experts have pointed out potential misuses, such as AI that can predict a user’s beliefs and desires could manipulate or deceive more effectively (A Survey of Theory of Mind in Large Language Models: Evaluations, Representations, and Safety Risks). There are also privacy implications if models can infer things you have not explicitly told them. As LLMs inch toward human-like social reasoning, cognitive scientists, AI engineers, and ethicists will need to collaborate to ensure we understand these models’ capabilities and limitations. The recent literature makes it clear that LLMs have begun to crack the door open on Theory of Mind, but whether they are truly entering the realm of understanding — or just holding up a mirror to the vast human text they’ve read — remains an open and fascinating question.
References (selected)
- Becchio, C., Strachan, J.W.A., et al. (2024). “Testing theory of mind in large language models and humans.” Nature Human Behaviour, 8(7): 1285–1295. DOI: 10.1038/s41562–024–01882-z.
- Kosinski, M. (2023/2024). “Theory of mind may have spontaneously emerged in large language models.” arXiv:2302.02083 (Later in PNAS, 2024) ([2302.02083] Evaluating Large Language Models in Theory of Mind Tasks).
- van Duijn, M., et al. (2023). “Theory of Mind in Large Language Models: Examining Performance of 11 State-of-the-Art Models vs. Children Aged 7–10 on Advanced Tests.” Proc. of CoNLL 2023 (Theory of Mind in Large Language Models: Examining Performance of 11 State-of-the-Art models vs. Children Aged 7–10 on Advanced Tests).
- Ullman, T. (2023). “Large language models fail on trivial alterations to theory-of-mind tasks.” arXiv:2302.08399 ().
- Shapira, N., et al. (2024). “Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in LLMs.” Proc. of EACL 2024 (Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models — ACL Anthology).
- Street, W. et al. (2024). “LLMs achieve adult human performance on higher-order theory of mind tasks.” arXiv:2405.18870 ().
- Zhu, W., Zhang, Z., & Wang, Y. (2024). “Language Models Represent Beliefs of Self and Others.” ICML 2024 (proc. 41st Int’l Conf. on Machine Learning) (A Survey of Theory of Mind in Large Language Models: Evaluations, Representations, and Safety Risks).
- Kim, H. et al. (2023). “FANToM: A Benchmark for Stress-Testing Machine Theory of Mind in Interactions.” arXiv:2310.09419.
- Chen, W. et al. (2024). “ToMBench: Benchmarking Theory of Mind in Large Language Models.” arXiv:2402.15052.
- Mitchell, M. & Krakauer, D. (2023). “The debate over understanding in AI’s large language models.” PNAS, 120(13): e2300963120 (Dissecting the Ullman Variations with a SCALPEL: Why do LLMs fail at Trivial Alterations to the False Belief Task?). (Perspective article discussing whether LLMs understand, including ToM debate.)