OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling

University of Central Florida
Comparison of endpoint QA vs. belief-structure reconstruction on a false-belief story.
Endpoint QA vs. Explicit Belief Reconstruction. A model can answer "Where will Bob look?" correctly without ever tracking the belief states that make the answer valid. OmniToM makes the hidden mental-state representation explicit and directly measurable.

Abstract

Social reasoning requires tracking how information is distributed across actors, not only what happened in the world. Existing Theory of Mind (ToM) benchmarks evaluate this ability through endpoint question answering: models are scored by whether they produce the correct final answer, leaving the underlying mental-state representation entirely unobserved. A model can answer "Where will Bob look?" correctly while failing to track the belief states that make the answer valid.

We introduce OmniToM, a benchmark that closes this gap through explicit belief-structure modeling. Rather than probing endpoints, OmniToM requires a model to reconstruct the full multi-actor belief representation of a story, then label each belief proposition along seven dimensions grounded in ATOMS, a literature-derived taxonomy of Theory of Mind abilities.

OmniToM comprises 895 stories and 22,343 labeled belief propositions developed through over 1,000 person-hours of human annotation effort. Zero-shot evaluation across nine open- and closed-source models identifies a consistent actor-specific information-tracking bottleneck: Stage 1 belief extraction peaks at 57.69% macro F1, while Stage 2 belief labeling reaches 85.95% accuracy. Errors concentrate on Knowledge Access and Representation, the dimensions that require a model to determine who could know or share a belief, and whether it is explicitly stated or inferred from context.

Benchmark at a Glance

895
Benchmark Stories
22K+
Labeled Beliefs
156K
Schema Labels
7
Story Categories
9
Models Evaluated
1K+
Person-Hours

Seven-Dimensional Schema (grounded in ATOMS)

Order Truth Status Knowledge Access Representation Content Type Mental Source Context

Two-Stage Evaluation Pipeline

OmniToM evaluates belief-structure modeling in two stages, both run under zero-shot TELeR Level 3 prompts (task directive + stepwise sub-tasks, no in-context examples):

Stage 1, Belief Extraction: Given a story, the model extracts all relevant (Actor, Belief, Order) tuples. A GPT-5 semantic judge scores extraction by precision, recall, and macro F1 over semantically matched propositions.

Stage 2, Belief Labeling: Given a story and the benchmark belief table, the model assigns a seven-dimensional schema label vector to each belief. Scored by exact-match accuracy per dimension, averaged over all propositions.

OmniToM two-stage evaluation pipeline
Figure 1. The OmniToM two-stage pipeline. Stage 1 extracts actor-specific belief propositions from a story; Stage 2 labels each proposition across seven schema dimensions. Both stages are evaluated zero-shot.

Annotation Pipeline

Benchmark construction required over 1,000 person-hours of human annotation to establish high-quality gold labels. From this human-calibrated foundation, a human-in-the-loop LLM-assisted pipeline was fixed and scaled to the full 895-story benchmark. Each story produces an average of 24.96 labeled belief propositions spanning world facts (Order 0, 32.6%), actor beliefs (Order 1, 57.1%), and nested beliefs (Order 2+, 10.3%).

OmniToM annotation pipeline
Figure 2. The human-calibrated annotation pipeline used to produce labeled belief propositions at scale.
OmniToM schema label distribution
Figure 3. Distribution of schema labels across the benchmark. Knowledge Access and Representation are the most contested dimensions.

Benchmark Results

Zero-shot TELeR Level 3 prompts. Bold = best, underline = second-best. GPT-5 is the semantic judge and is omitted from Stage 1.

Stage 1, Belief Extraction F1 (%)

Model Params AST FBT FPT HT PST SIT SST Overall
Closed-source models
Gemini-2.5 FlashN/A 42.4056.4857.78 50.3458.55 62.9156.31 54.97
GPT-5N/A judge model, excluded from Stage 1 N/A
Open-weight models
Gemma-3 27B27B 48.7272.3956.05 45.4656.72 68.7655.77 57.69
Mistral-Small 24B24B 52.9754.5859.79 48.3256.97 66.2053.17 56.00
Mistral-Large 123B123B 47.7571.2853.66 41.7858.53 57.3848.35 54.10
Qwen3 32B32B 46.8857.3253.38 41.4157.25 56.5148.67 51.63
Llama-3.3 70B70B 37.5164.0746.33 36.2747.23 57.7041.58 47.24
Qwen3 8B8B 39.1250.2244.21 37.1448.36 47.2037.60 43.41
Llama-3.1 8B8B 26.3448.2935.80 31.4836.12 53.5230.37 37.42

Stage 2, Belief-Labeling Accuracy (%)

Model Order Status Access Repr CType Source Context Overall
Closed-source models
Gemini-2.5 Flash 95.5684.9771.34 87.5885.97 84.1092.14 85.95
GPT-5 95.1882.7266.85 83.4279.96 83.0288.83 82.85
Open-weight models
Mistral-Large 123B 97.2586.5374.14 72.8782.83 86.3292.97 84.70
Mistral-Small 24B 95.1382.2274.59 62.7976.01 84.8291.90 81.06
Llama-3.3 70B 92.7483.5567.41 72.4372.35 76.7191.69 79.55
Qwen3 32B 96.4282.4373.91 62.4571.27 76.8490.81 79.16
Gemma-3 27B 96.5682.4471.57 54.3373.50 78.7292.07 78.46
Qwen3 8B 73.3867.1757.94 63.7751.43 61.4974.62 64.26
Llama-3.1 8B 71.9065.5956.13 64.4048.63 55.1876.81 62.66

Key Finding

Actor-specific information tracking is the core bottleneck. Models can parse social stories, but they struggle to determine which information each actor has access to, how it is communicated or inferred, and how it becomes part of that actor's mental-state representation.

Stage 1 reveals the structural side of this bottleneck. F1 drops sharply as belief order increases: moving beyond Order 0 world facts requires the model to determine which facts each actor perceived, missed, remembered, was told, or could infer, and higher-order beliefs add a further layer of nested mental-state reasoning.

Stage 2 explains the same bottleneck at the schema-label level. Knowledge Access (56-75%) and Representation (54-88%) are the weakest dimensions across all models. The analysis by belief order shows that Order 1 actor beliefs have the lowest overall labeling accuracy (71.2%), with especially low accuracy for Knowledge Access (58.9%) and Representation (57.3%). These labels require deciding who could know or share a belief, and whether it is directly stated or inferred from perception, testimony, interaction, or context.

The gap between stages is itself diagnostic: models label provided belief propositions far better (up to 85.95%) than they extract those propositions from raw text (up to 57.69%). Current LLMs are much better at operating over an explicit belief structure once it is given than at constructing that structure directly from story text, a distinction invisible to endpoint QA benchmarks.

Stage 1 extraction F1 by model and belief-order bucket
Figure 4a. Stage 1 extraction F1 by belief-order bucket. All models show a consistent drop from Order 0 world facts to Order 1 actor beliefs.
Stage 2 labeling accuracy by belief order and schema dimension
Figure 4b. Stage 2 labeling accuracy by belief-order bucket and schema dimension. Order 1 beliefs are the hardest to label, especially for Knowledge Access and Representation.

BibTeX

@article{bawatneh2026omnitom,
  title   = {OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling},
  author  = {Bawatneh, Adam and Sapkota, Sagar and Bedi, Amrit Singh and
             Karmaker, Santu and Shah, Mubarak},
  journal = {arXiv preprint arXiv:2605.26322},
  year    = {2026}
}