GTR-Bench: Evaluating Geo-Temporal Reasoning in Vision-Language Models
We are proud to release GTR-Bench, a novel challenge for geographic temporal reasoning of moving targets in a large-scale camera network.
Motivation
Spatial-temporal intelligence of Vision-Language Models (VLMs) has attracted much attention due to its importance for Autonomous Driving, Embodied AI, and General Artificial Intelligence. Existing benchmarks mainly focus on egocentric perspective reasoning or geographic perspective reasoning with graphics context, failing to assess VLMs’ geographic spatial-temporal intelligence with both images/video and graphics context.
Key Challenges
GTR-Bench is more challenging as it requires:
- Multiple perspective switches between maps and videos
- Joint reasoning across multiple videos with non-overlapping fields of view
- Inference over spatial-temporal regions unobserved by any video context
Key Findings
Evaluations of more than 10 popular VLMs reveal three primary deficiencies:
- VLMs’ reasoning is impaired by an imbalanced utilization of spatial-temporal context
- VLMs are weak in temporal forecasting, performing worse on temporal-emphasized tasks
- VLMs lack proficiency in comprehending or aligning map data with multi-view video inputs
Even the best proprietary model, Gemini-2.5-Pro (34.9%), significantly lags behind human performance (78.61%).
- Paper: arXiv:2510.07791
- Code & Data: GitHub