PaperHigh credibilityAAAI 2024 · Li et al. · January 1, 2024

Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation Using the StepGame Benchmark

Our summary

Evaluates LLMs on the StepGame spatial-reasoning benchmark, finding they map language to spatial relations reasonably but degrade on multi-hop spatial inference; proposes prompting and neuro-symbolic enhancements.

Why it matters

Pins spatial and geometric reasoning failures to multi-hop composition over relations — a concrete, reproducible weak spot.

Related findings (1)

Spatial and geometric reasoning errorsLow
Models struggle with relative positions, rotations, and simple geometric/visual reasoning.

Reasoning failure Evals Benchmarks

Published June 26, 2026

Cite this

Qlarify Labs. (2026). Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation Using the StepGame Benchmark. Retrieved from https://labs.qlarify.fi/references/spatial-reasoning-stepgame-2024