DSM: Constructing a Diverse Semantic Map for 3D Visual Grounding

We introduce the Diverse Semantic Map (DSM) framework, a novel scene representation designed to enhance deep reasoning in 3D Visual Grounding tasks.

Problem

Existing methods for 3D Visual Grounding are often constrained — they either only focus on geometric and visual cues, or, like traditional 3D scene graphs, lack the multi-dimensional attributes needed for complex reasoning.

Our Framework

The DSM framework enriches robust geometric models with a spectrum of VLM-derived semantics, including:

  • Appearance attributes: color, patterns, texture
  • Physical attributes: weight, material, surface properties
  • Affordance attributes: functional aspects and operational methods

We construct the DSM online by fusing multi-view observations within a temporal sliding window, creating a persistent and comprehensive world model.

DSM-Grounding

Building on this foundation, we propose DSM-Grounding, a new paradigm that shifts grounding from free-form VLM queries to a structured reasoning process over the semantic-rich map, markedly improving accuracy and interpretability.

Results

  • ScanRefer benchmark: 59.06% overall accuracy (IoU@0.5), surpassing prior methods by 10%
  • Semantic segmentation: 67.93% F-mIoU, outperforming all baselines including privileged ones
  • Successfully deployed on physical robots for navigation and grasping tasks

  • Paper: arXiv:2504.08307
  • Project Page: binicey.github.io/DSM