GTR-Bench: Evaluating Geo-Temporal Reasoning in Vision-Language Models

We are proud to release GTR-Bench, a novel challenge for geographic temporal reasoning of moving targets in a large-scale camera network.

Motivation

Spatial-temporal intelligence of Vision-Language Models (VLMs) has attracted much attention due to its importance for Autonomous Driving, Embodied AI, and General Artificial Intelligence. Existing benchmarks mainly focus on egocentric perspective reasoning or geographic perspective reasoning with graphics context, failing to assess VLMs’ geographic spatial-temporal intelligence with both images/video and graphics context.

Key Challenges

GTR-Bench is more challenging as it requires:

Multiple perspective switches between maps and videos
Joint reasoning across multiple videos with non-overlapping fields of view
Inference over spatial-temporal regions unobserved by any video context

Key Findings

Evaluations of more than 10 popular VLMs reveal three primary deficiencies:

VLMs’ reasoning is impaired by an imbalanced utilization of spatial-temporal context
VLMs are weak in temporal forecasting, performing worse on temporal-emphasized tasks
VLMs lack proficiency in comprehending or aligning map data with multi-view video inputs

Even the best proprietary model, Gemini-2.5-Pro (34.9%), significantly lags behind human performance (78.61%).

Paper: arXiv:2510.07791
Code & Data: GitHub