IEEE Robotics and Automation Letters (RA-L) 2026

DSM: Constructing a Diverse Semantic Map for 3D Visual Grounding

Qinghongbing Xie†1, Zijian Liang†1, Fuhao Li1, Long Zeng1*
1Tsinghua Shenzhen International Graduate School, Tsinghua University
Equal contribution    *Corresponding author

Video

Abstract

Effective scene representation is critical for the visual grounding ability of reasoning. However, existing 3D Visual Grounding methods either only focus on geometric and visual cues, or like traditional 3D scene graphs, lack the multi-dimensional attributes needed for complex reasoning. To bridge this gap, we introduce a novel scene representation framework, Diverse Semantic Map (DSM), that enriches robust geometric map with a spectrum of VLM-derived semantics, including appearance, physical, and affordance attributes. The DSM is first constructed online by fusing multi-view observations within a temporal sliding window, creating a comprehensive map of scene knowledge. Building on this foundation, we propose a new grounding paradigm, DSM-Grounding, that shifts grounding from free-form queries to a structured reasoning process over the semantic-rich map for VLM. Extensive evaluations validate our approach's superiority, improving accuracy and interpretability. DSM-Grounding achieves a state-of-the-art 59.06% overall accuracy of IoU@0.5, surpassing others by 10% on ScanRefer. In semantic segmentation, our DSM attains a 67.93% F-mIoU, outperforming all baselines in Replica. Furthermore, successful deployment on physical robots for complex navigation and grasping tasks confirms the framework's practical utility in real-world scenarios.

Pipeline

DSM Framework Pipeline

Overview of the DSM framework. After receiving the user's query, the robot first collects time-continuous poses, depth images, and color images of the scene to build a DSM. Next, we extract the visual and geometric information from each observation point. At the same time, we use VLM to analyze their relations and semantic attributes, which are categorized into Appearance, Physical and Affordance Attributes. We fuse objects from multi views using a multimodal object fusion method in conjunction with the Geometry Sliding Window method for mapping. Finally, we identify candidates in the DSM based on the attributes and relations of objects. We use the multi-level observations method to precisely locate the target object. Additionally, our method can be broadly applied to tasks such as robotic semantic navigation and semantic grasping.

DSM Example

DSM Scene from Replica

DSM Scene. Example of DSM Scene from Replica.

DSM Semantic Attributes. Example of DSM Example

Attributes Pillow Stool
Appearance A soft, square pillow with a floral design A small, rounded seat with a padded top, typically covered in a beige fabric. The design is simple yet stylish.
Physical Filled with a soft material, providing compressibility and comfort Sturdy and stable, designed to support weight effectively. Lightweight for easy movement and positioning.
Affordance Intended for support when sitting or lying down, enhancing comfort Serves as a seating option, footrest, or even a small table, making it a versatile piece of furniture.

DSM Relation. Example of DSM Relation

Object Type Name Spatial Rel. Semantic Relation
Target Pillow close by The pillow is an accessory placed on the sofa for comfort and support while sitting or lounging.
Anchor Sofa

3D Semantic Segmentation

3D Semantic Segmentation on Replica Dataset.

Category Method mAcc (%) F-mIoU (%)
Privileged LSeg 33.39 51.54
OpenSeg 41.19 53.74
Zero-shot MaskCLIP 4.53 0.94
ConceptFusion + SAM 31.53 38.70
ConceptGraphs 40.63 35.95
Ours (Fixed) 38.76 67.93
Ours (w/ ATC) 39.22 68.15

3D Visual Grounding

ScanRefer. Comparisons of 3D visual grounding on ScanRefer dataset. Accuracy at 0.25 and 0.5 IoU thresholds.

Method Venue Supervision LLMs Unique Multiple Overall
Acc@0.25 Acc@0.5 Acc@0.25 Acc@0.5 Acc@0.25 Acc@0.5
ScanReferECCV'20Fully- 67.6046.2032.1021.3039.0026.10
Scene-VerseECCV'24Fully- 81.6075.1043.7039.1050.6045.80
LIBAAAAI'25Fully- 88.8174.2754.4244.4159.5748.96
OpenSceneCVPR'23Fine-tuningCLIP 20.1013.1011.104.4013.206.50
Chat-3D v2NeurIPS'24Fine-tuningVicuna1.5-7B 61.2057.6025.2022.6035.9030.40
Inst3D-LMMCVPR'25Fine-tuningVicuna1.5-7B 88.6081.5048.7043.2057.8051.60
ConceptGraphsICRA'24Zero-ShotGPT-4 16.5010.329.577.6913.289.31
ZSVG3DCVPR'24Zero-ShotGPT-4 turbo 63.8058.4027.7024.6036.4032.70
VLM-GrounderCoRL'24Zero-ShotGPT-4o 66.0029.8048.3033.5051.6032.80
SeeGroundCVPR'25Zero-ShotQwen2-VL-72B 75.7068.9034.0030.0044.1039.40
FreeQ-GraphArxiv'25Zero-ShotQwen2-VL-72B 83.1079.4050.1639.1356.1349.41
Ours-Zero-ShotGPT-4o-mini 83.3280.1747.0143.9357.4755.39
Ours-Zero-ShotQwen2.5-VL-72B 85.7185.7156.9853.6561.5659.06

Sr3D & Nr3D. Comparisons of 3D visual grounding on Sr3D and Nr3D. Top-1 accuracy using ground-truth boxes.

Method Super Nr3D Sr3D
OverallEasyHardV-Dep.V-Indep. OverallEasyHardV-Dep.V-Indep.
InstanceReferFully 38.8046.0031.8034.5041.90 48.0051.1040.5045.8048.10
LARFully 48.9058.4042.3047.4052.10 59.4063.0051.2050.0059.10
MVTFully 59.5067.4052.7059.1060.30 64.5066.9058.8058.4058.40
ViL3DRelFully 64.4070.2057.4062.0064.50 72.8074.9067.9063.8073.20
EDAFully 52.1058.2046.1050.2053.10 68.1070.3062.9054.1068.70
3D-VisTAFully 64.2072.1056.7061.5065.10 76.4078.8071.3058.9077.30
Scene-VerseFully 64.9072.5057.8056.9067.90 77.5080.1071.6062.8078.20
ZSVG3DZero-Shot 46.5031.7036.8040.0039.00 -----
VLM-GrounderZero-Shot 48.0055.2039.5045.8049.40 -----
ConceptGraphZero-Shot 38.2039.4032.6042.1038.70 43.6044.3041.9038.4049.70
SeeGroundZero-Shot 54.5038.3042.3048.2046.10 65.4047.9052.2058.4056.20
FreeQ-GraphZero-Shot 61.8061.4057.8060.9067.10 70.9079.3063.9064.1076.50
OursZero-Shot 62.1964.0656.0061.2563.12 73.3377.4463.2973.9173.49

Robot Experiment

Robot Experiment Demo

Robot Demo. Navigation and grasping tasks in real-world and simulated environments.

Real-world Experiments. More complex scenarios are used to demonstrate the reliability of the DSM system.

Key Description Goal / Res Time Dist / Pose
Navigation Task
Computer: electronic device w/ monitor, on flat desk. Arrived10.2s3.62m
Toy: small, irregular shape, vibrant color, purple. Arrived15.1s5.42m
Toolbox: plastic/cardboard, transparent, easy stacking. Arrived17.5s6.8m
Bookshelf: rectangular, on blue desk, for organizing. Arrived13.2s4.98m
Storage Bin: container for storage, sturdy construction. Failed-7.8m
Grasping Task
Storage Box: lightweight, sturdy, for organizing spaces. Succ.7.3sUpright
File Cabinet: for storing documents, near storage box. Succ.8.2sSide
Organizer Case: straight edges, securely holds items. Succ.6.2sUpright
Enclosure: can be opened/closed, protects contents. Succ.7.5sSide
Monitor: stable device, displays digital content. Failed-30% Occ.

DSM Visualization (Replica)

Room0
Room1
Office0
Office2
Room0
Room1
Office0
Office2

DSM Visualization (AI2-THOR)

Plan10
Plan12
Plan14
Plan15
Plan10
Plan12
Plan14
Plan15

DSM-Grounding Quality Result

ai2thor base

AI2-THOR FloorPlan10

ai2thor chair

Please find the wooden chair with a curved backrest that has a simple design.

ai2thor stove

Please locate the stove positioned on the counter, featuring multiple burners on top.

ai2thor box

Please locate the box for organization and storage, on the countertop next to the stove.

replica base

Replica Room0

replica pillow

Please find the pillow for providing cushioning and support.

replica vase

Please find the vase made from sturdy ceramic on the dining table.

replica window

Please locate the window above the desk organizer on the workspace.

BibTeX

@article{xie2025dsm,
  title     = {DSM: Constructing a Diverse Semantic Map for 3D Visual Grounding},
  author    = {Xie, Qinghongbing and Liang, Zijian and Li, Fuhao and Zeng, Long},
  journal   = {arXiv preprint arXiv:2504.08307},
  year      = {2025},
}