DSM: Constructing a Diverse Semantic Map for 3D Visual Grounding

Video

Abstract

Effective scene representation is critical for the visual grounding ability of reasoning. However, existing 3D Visual Grounding methods either only focus on geometric and visual cues, or like traditional 3D scene graphs, lack the multi-dimensional attributes needed for complex reasoning. To bridge this gap, we introduce a novel scene representation framework, Diverse Semantic Map (DSM), that enriches robust geometric map with a spectrum of VLM-derived semantics, including appearance, physical, and affordance attributes. The DSM is first constructed online by fusing multi-view observations within a temporal sliding window, creating a comprehensive map of scene knowledge. Building on this foundation, we propose a new grounding paradigm, DSM-Grounding, that shifts grounding from free-form queries to a structured reasoning process over the semantic-rich map for VLM. Extensive evaluations validate our approach's superiority, improving accuracy and interpretability. DSM-Grounding achieves a state-of-the-art 59.06% overall accuracy of IoU@0.5, surpassing others by 10% on ScanRefer. In semantic segmentation, our DSM attains a 67.93% F-mIoU, outperforming all baselines in Replica. Furthermore, successful deployment on physical robots for complex navigation and grasping tasks confirms the framework's practical utility in real-world scenarios.

Pipeline

Overview of the DSM framework. After receiving the user's query, the robot first collects time-continuous poses, depth images, and color images of the scene to build a DSM. Next, we extract the visual and geometric information from each observation point. At the same time, we use VLM to analyze their relations and semantic attributes, which are categorized into Appearance, Physical and Affordance Attributes. We fuse objects from multi views using a multimodal object fusion method in conjunction with the Geometry Sliding Window method for mapping. Finally, we identify candidates in the DSM based on the attributes and relations of objects. We use the multi-level observations method to precisely locate the target object. Additionally, our method can be broadly applied to tasks such as robotic semantic navigation and semantic grasping.

DSM Example

DSM Scene. Example of DSM Scene from Replica.

DSM Semantic Attributes. Example of DSM Example

Attributes	Pillow	Stool
Appearance	A soft, square pillow with a floral design	A small, rounded seat with a padded top, typically covered in a beige fabric. The design is simple yet stylish.
Physical	Filled with a soft material, providing compressibility and comfort	Sturdy and stable, designed to support weight effectively. Lightweight for easy movement and positioning.
Affordance	Intended for support when sitting or lying down, enhancing comfort	Serves as a seating option, footrest, or even a small table, making it a versatile piece of furniture.

DSM Relation. Example of DSM Relation

Object Type	Name	Spatial Rel.	Semantic Relation
Target	Pillow	close by	The pillow is an accessory placed on the sofa for comfort and support while sitting or lounging.
Anchor	Sofa	close by

3D Semantic Segmentation

3D Semantic Segmentation on Replica Dataset.

Category	Method	mAcc (%)	F-mIoU (%)
Privileged	LSeg	33.39	51.54
Privileged	OpenSeg	41.19	53.74
Zero-shot	MaskCLIP	4.53	0.94
	ConceptFusion + SAM	31.53	38.70
	ConceptGraphs	40.63	35.95
	Ours (Fixed)	38.76	67.93
	Ours (w/ ATC)	39.22	68.15

3D Visual Grounding

ScanRefer. Comparisons of 3D visual grounding on ScanRefer dataset. Accuracy at 0.25 and 0.5 IoU thresholds.

Method	Venue	Supervision	LLMs	Unique		Multiple		Overall
Method	Venue	Supervision	LLMs	Acc@0.25	Acc@0.5	Acc@0.25	Acc@0.5	Acc@0.25	Acc@0.5
ScanRefer	ECCV'20	Fully	-	67.60	46.20	32.10	21.30	39.00	26.10
Scene-Verse	ECCV'24	Fully	-	81.60	75.10	43.70	39.10	50.60	45.80
LIBA	AAAI'25	Fully	-	88.81	74.27	54.42	44.41	59.57	48.96
OpenScene	CVPR'23	Fine-tuning	CLIP	20.10	13.10	11.10	4.40	13.20	6.50
Chat-3D v2	NeurIPS'24	Fine-tuning	Vicuna1.5-7B	61.20	57.60	25.20	22.60	35.90	30.40
Inst3D-LMM	CVPR'25	Fine-tuning	Vicuna1.5-7B	88.60	81.50	48.70	43.20	57.80	51.60
ConceptGraphs	ICRA'24	Zero-Shot	GPT-4	16.50	10.32	9.57	7.69	13.28	9.31
ZSVG3D	CVPR'24	Zero-Shot	GPT-4 turbo	63.80	58.40	27.70	24.60	36.40	32.70
VLM-Grounder	CoRL'24	Zero-Shot	GPT-4o	66.00	29.80	48.30	33.50	51.60	32.80
SeeGround	CVPR'25	Zero-Shot	Qwen2-VL-72B	75.70	68.90	34.00	30.00	44.10	39.40
FreeQ-Graph	Arxiv'25	Zero-Shot	Qwen2-VL-72B	83.10	79.40	50.16	39.13	56.13	49.41
Ours	-	Zero-Shot	GPT-4o-mini	83.32	80.17	47.01	43.93	57.47	55.39
Ours	-	Zero-Shot	Qwen2.5-VL-72B	85.71	85.71	56.98	53.65	61.56	59.06

Sr3D & Nr3D. Comparisons of 3D visual grounding on Sr3D and Nr3D. Top-1 accuracy using ground-truth boxes.

Method	Super	Nr3D					Sr3D
Method	Super	Overall	Easy	Hard	V-Dep.	V-Indep.	Overall	Easy	Hard	V-Dep.	V-Indep.
InstanceRefer	Fully	38.80	46.00	31.80	34.50	41.90	48.00	51.10	40.50	45.80	48.10
LAR	Fully	48.90	58.40	42.30	47.40	52.10	59.40	63.00	51.20	50.00	59.10
MVT	Fully	59.50	67.40	52.70	59.10	60.30	64.50	66.90	58.80	58.40	58.40
ViL3DRel	Fully	64.40	70.20	57.40	62.00	64.50	72.80	74.90	67.90	63.80	73.20
EDA	Fully	52.10	58.20	46.10	50.20	53.10	68.10	70.30	62.90	54.10	68.70
3D-VisTA	Fully	64.20	72.10	56.70	61.50	65.10	76.40	78.80	71.30	58.90	77.30
Scene-Verse	Fully	64.90	72.50	57.80	56.90	67.90	77.50	80.10	71.60	62.80	78.20
ZSVG3D	Zero-Shot	46.50	31.70	36.80	40.00	39.00	-	-	-	-	-
VLM-Grounder	Zero-Shot	48.00	55.20	39.50	45.80	49.40	-	-	-	-	-
ConceptGraph	Zero-Shot	38.20	39.40	32.60	42.10	38.70	43.60	44.30	41.90	38.40	49.70
SeeGround	Zero-Shot	54.50	38.30	42.30	48.20	46.10	65.40	47.90	52.20	58.40	56.20
FreeQ-Graph	Zero-Shot	61.80	61.40	57.80	60.90	67.10	70.90	79.30	63.90	64.10	76.50
Ours	Zero-Shot	62.19	64.06	56.00	61.25	63.12	73.33	77.44	63.29	73.91	73.49

Robot Experiment

Robot Demo. Navigation and grasping tasks in real-world and simulated environments.

Real-world Experiments. More complex scenarios are used to demonstrate the reliability of the DSM system.

Key Description	Goal / Res	Time	Dist / Pose
Navigation Task
Computer: electronic device w/ monitor, on flat desk.	Arrived	10.2s	3.62m
Toy: small, irregular shape, vibrant color, purple.	Arrived	15.1s	5.42m
Toolbox: plastic/cardboard, transparent, easy stacking.	Arrived	17.5s	6.8m
Bookshelf: rectangular, on blue desk, for organizing.	Arrived	13.2s	4.98m
Storage Bin: container for storage, sturdy construction.	Failed	-	7.8m
Grasping Task
Storage Box: lightweight, sturdy, for organizing spaces.	Succ.	7.3s	Upright
File Cabinet: for storing documents, near storage box.	Succ.	8.2s	Side
Organizer Case: straight edges, securely holds items.	Succ.	6.2s	Upright
Enclosure: can be opened/closed, protects contents.	Succ.	7.5s	Side
Monitor: stable device, displays digital content.	Failed	-	30% Occ.

DSM Visualization (Replica)

DSM Visualization (AI2-THOR)

DSM-Grounding Quality Result

AI2-THOR FloorPlan10

Please find the wooden chair with a curved backrest that has a simple design.

Please locate the stove positioned on the counter, featuring multiple burners on top.

Please locate the box for organization and storage, on the countertop next to the stove.

Replica Room0

Please find the pillow for providing cushioning and support.

Please find the vase made from sturdy ceramic on the dining table.

Please locate the window above the desk organizer on the workspace.

BibTeX

@article{xie2025dsm,
  title     = {DSM: Constructing a Diverse Semantic Map for 3D Visual Grounding},
  author    = {Xie, Qinghongbing and Liang, Zijian and Li, Fuhao and Zeng, Long},
  journal   = {arXiv preprint arXiv:2504.08307},
  year      = {2025},
}