OmniToM - Benchmarking Theory of Mind in LLMs

arXiv preprint introducing an explicit belief-structure benchmark for evaluating Theory of Mind in large language models.

OmniToM is my arXiv preprint on evaluating Theory of Mind in large language models through explicit belief-structure modeling rather than endpoint question answering alone.

First page preview of the OmniToM arXiv preprint

Paper

  • Title: OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling
  • Authors: Adam Bawatneh, Sagar Sapkota, Amrit Singh Bedi, Santu Karmaker, Mubarak Shah
  • Preprint: arXiv:2605.26322

What OmniToM Measures

Most Theory of Mind benchmarks score the final answer to a social-reasoning question. OmniToM asks a deeper diagnostic question: can a model recover the underlying beliefs that each actor holds, including mistaken, inferred, or nested beliefs?

The benchmark uses a two-stage evaluation:

  • Stage 1, Belief Extraction: recover the actor-specific belief propositions needed to explain a story’s social dynamics.
  • Stage 2, Belief Labeling: assign each belief a seven-dimensional schema label covering recursive order, truth status, knowledge access, representation, content type, mental source, and context.

OmniToM two-stage belief modeling workflow

Dataset and Evaluation

  • Built from 895 ToMBench-derived stories.
  • Includes 22,343 labeled belief propositions.
  • Uses a human-calibrated LLM-assisted annotation pipeline.
  • Evaluates both open-weight and API model families under zero-shot prompting.

Main Finding

OmniToM reveals an actor-specific belief-tracking bottleneck. Current LLMs can often work with explicit belief structures once they are given, but struggle more when they must construct those belief structures from raw narrative text. In the paper’s zero-shot evaluation, belief labeling reaches 85.95% accuracy, while belief extraction peaks at 57.69% F1.

Why It Matters

Endpoint QA can hide whether a model actually tracked who knew what, when they knew it, and how that information became part of an actor’s mental state. OmniToM makes those intermediate representations inspectable, giving researchers a sharper tool for diagnosing social reasoning failures in LLMs.