OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling

Comparison of endpoint QA vs. belief-structure reconstruction on a false-belief story. — **Endpoint QA vs. Explicit Belief Reconstruction.** A model can answer *"Where will Bob look?"* correctly without ever tracking the belief states that make the answer valid. OmniToM makes the hidden mental-state representation explicit and directly measurable.

Abstract

Social reasoning requires tracking how information is distributed across actors, not only what happened in the world. Existing Theory of Mind (ToM) benchmarks evaluate this ability through endpoint question answering: models are scored by whether they produce the correct final answer, leaving the underlying mental-state representation entirely unobserved. A model can answer "Where will Bob look?" correctly while failing to track the belief states that make the answer valid.

We introduce OmniToM, a benchmark that closes this gap through explicit belief-structure modeling. Rather than probing endpoints, OmniToM requires a model to reconstruct the full multi-actor belief representation of a story, then label each belief proposition along seven dimensions grounded in ATOMS, a literature-derived taxonomy of Theory of Mind abilities.

OmniToM comprises 895 stories and 22,343 labeled belief propositions developed through over 1,000 person-hours of human annotation effort. Zero-shot evaluation across nine open- and closed-source models identifies a consistent actor-specific information-tracking bottleneck: Stage 1 belief extraction peaks at 57.69% macro F₁, while Stage 2 belief labeling reaches 85.95% accuracy. Errors concentrate on Knowledge Access and Representation, the dimensions that require a model to determine who could know or share a belief, and whether it is explicitly stated or inferred from context.

Benchmark at a Glance

895

Benchmark Stories

22K+

Labeled Beliefs

156K

Schema Labels

7

Story Categories

9

Models Evaluated

1K+

Person-Hours

Seven-Dimensional Schema (grounded in ATOMS)

Order Truth Status Knowledge Access Representation Content Type Mental Source Context

Two-Stage Evaluation Pipeline

OmniToM evaluates belief-structure modeling in two stages, both run under zero-shot TELeR Level 3 prompts (task directive + stepwise sub-tasks, no in-context examples):

Stage 1, Belief Extraction: Given a story, the model extracts all relevant (Actor, Belief, Order) tuples. A GPT-5 semantic judge scores extraction by precision, recall, and macro F₁ over semantically matched propositions.

Stage 2, Belief Labeling: Given a story and the benchmark belief table, the model assigns a seven-dimensional schema label vector to each belief. Scored by exact-match accuracy per dimension, averaged over all propositions.

OmniToM two-stage evaluation pipeline — **Figure 1.** The OmniToM two-stage pipeline. Stage 1 extracts actor-specific belief propositions from a story; Stage 2 labels each proposition across seven schema dimensions. Both stages are evaluated zero-shot.

Annotation Pipeline

Benchmark construction required over 1,000 person-hours of human annotation to establish high-quality gold labels. From this human-calibrated foundation, a human-in-the-loop LLM-assisted pipeline was fixed and scaled to the full 895-story benchmark. Each story produces an average of 24.96 labeled belief propositions spanning world facts (Order 0, 32.6%), actor beliefs (Order 1, 57.1%), and nested beliefs (Order 2+, 10.3%).

OmniToM schema label distribution — **Figure 3.** Distribution of schema labels across the benchmark. Knowledge Access and Representation are the most contested dimensions.

Benchmark Results

Zero-shot TELeR Level 3 prompts. Bold = best, underline = second-best. GPT-5 is the semantic judge and is omitted from Stage 1.

Stage 1, Belief Extraction F₁ (%)

Model	Params	AST	FBT	FPT	HT	PST	SIT	SST	Overall
Closed-source models
Gemini-2.5 Flash	N/A	42.40	56.48	57.78	50.34	58.55	62.91	56.31	54.97
GPT-5	N/A	judge model, excluded from Stage 1							N/A
Open-weight models
Gemma-3 27B	27B	48.72	72.39	56.05	45.46	56.72	68.76	55.77	57.69
Mistral-Small 24B	24B	52.97	54.58	59.79	48.32	56.97	66.20	53.17	56.00
Mistral-Large 123B	123B	47.75	71.28	53.66	41.78	58.53	57.38	48.35	54.10
Qwen3 32B	32B	46.88	57.32	53.38	41.41	57.25	56.51	48.67	51.63
Llama-3.3 70B	70B	37.51	64.07	46.33	36.27	47.23	57.70	41.58	47.24
Qwen3 8B	8B	39.12	50.22	44.21	37.14	48.36	47.20	37.60	43.41
Llama-3.1 8B	8B	26.34	48.29	35.80	31.48	36.12	53.52	30.37	37.42

Stage 2, Belief-Labeling Accuracy (%)

Model	Order	Status	Access	Repr	CType	Source	Context	Overall
Closed-source models
Gemini-2.5 Flash	95.56	84.97	71.34	87.58	85.97	84.10	92.14	85.95
GPT-5	95.18	82.72	66.85	83.42	79.96	83.02	88.83	82.85
Open-weight models
Mistral-Large 123B	97.25	86.53	74.14	72.87	82.83	86.32	92.97	84.70
Mistral-Small 24B	95.13	82.22	74.59	62.79	76.01	84.82	91.90	81.06
Llama-3.3 70B	92.74	83.55	67.41	72.43	72.35	76.71	91.69	79.55
Qwen3 32B	96.42	82.43	73.91	62.45	71.27	76.84	90.81	79.16
Gemma-3 27B	96.56	82.44	71.57	54.33	73.50	78.72	92.07	78.46
Qwen3 8B	73.38	67.17	57.94	63.77	51.43	61.49	74.62	64.26
Llama-3.1 8B	71.90	65.59	56.13	64.40	48.63	55.18	76.81	62.66

Key Finding

Actor-specific information tracking is the core bottleneck. Models can parse social stories, but they struggle to determine which information each actor has access to, how it is communicated or inferred, and how it becomes part of that actor's mental-state representation.

Stage 1 reveals the structural side of this bottleneck. F₁ drops sharply as belief order increases: moving beyond Order 0 world facts requires the model to determine which facts each actor perceived, missed, remembered, was told, or could infer, and higher-order beliefs add a further layer of nested mental-state reasoning.

Stage 2 explains the same bottleneck at the schema-label level. Knowledge Access (56-75%) and Representation (54-88%) are the weakest dimensions across all models. The analysis by belief order shows that Order 1 actor beliefs have the lowest overall labeling accuracy (71.2%), with especially low accuracy for Knowledge Access (58.9%) and Representation (57.3%). These labels require deciding who could know or share a belief, and whether it is directly stated or inferred from perception, testimony, interaction, or context.

The gap between stages is itself diagnostic: models label provided belief propositions far better (up to 85.95%) than they extract those propositions from raw text (up to 57.69%). Current LLMs are much better at operating over an explicit belief structure once it is given than at constructing that structure directly from story text, a distinction invisible to endpoint QA benchmarks.

Stage 1 extraction F1 by model and belief-order bucket — **Figure 4a.** Stage 1 extraction F₁ by belief-order bucket. All models show a consistent drop from Order 0 world facts to Order 1 actor beliefs.

Stage 2 labeling accuracy by belief order and schema dimension — **Figure 4b.** Stage 2 labeling accuracy by belief-order bucket and schema dimension. Order 1 beliefs are the hardest to label, especially for Knowledge Access and Representation.

BibTeX

@article{bawatneh2026omnitom,
  title   = {OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling},
  author  = {Bawatneh, Adam and Sapkota, Sagar and Bedi, Amrit Singh and
             Karmaker, Santu and Shah, Mubarak},
  journal = {arXiv preprint arXiv:2605.26322},
  year    = {2026}
}