DSM: Constructing a Diverse Semantic Map for 3D Visual Grounding
We introduce the Diverse Semantic Map (DSM) framework, a novel scene representation designed to enhance deep reasoning in 3D Visual Grounding tasks.
Problem
Existing methods for 3D Visual Grounding are often constrained — they either only focus on geometric and visual cues, or, like traditional 3D scene graphs, lack the multi-dimensional attributes needed for complex reasoning.
Our Framework
The DSM framework enriches robust geometric models with a spectrum of VLM-derived semantics, including:
- Appearance attributes: color, patterns, texture
- Physical attributes: weight, material, surface properties
- Affordance attributes: functional aspects and operational methods
We construct the DSM online by fusing multi-view observations within a temporal sliding window, creating a persistent and comprehensive world model.
DSM-Grounding
Building on this foundation, we propose DSM-Grounding, a new paradigm that shifts grounding from free-form VLM queries to a structured reasoning process over the semantic-rich map, markedly improving accuracy and interpretability.
Results
- ScanRefer benchmark: 59.06% overall accuracy (IoU@0.5), surpassing prior methods by 10%
- Semantic segmentation: 67.93% F-mIoU, outperforming all baselines including privileged ones
-
Successfully deployed on physical robots for navigation and grasping tasks
- Paper: arXiv:2504.08307
- Project Page: binicey.github.io/DSM