Effective scene representation is critical for the visual grounding ability of reasoning. However, existing 3D Visual Grounding methods either only focus on geometric and visual cues, or like traditional 3D scene graphs, lack the multi-dimensional attributes needed for complex reasoning. To bridge this gap, we introduce a novel scene representation framework, Diverse Semantic Map (DSM), that enriches robust geometric map with a spectrum of VLM-derived semantics, including appearance, physical, and affordance attributes. The DSM is first constructed online by fusing multi-view observations within a temporal sliding window, creating a comprehensive map of scene knowledge. Building on this foundation, we propose a new grounding paradigm, DSM-Grounding, that shifts grounding from free-form queries to a structured reasoning process over the semantic-rich map for VLM. Extensive evaluations validate our approach's superiority, improving accuracy and interpretability. DSM-Grounding achieves a state-of-the-art 59.06% overall accuracy of IoU@0.5, surpassing others by 10% on ScanRefer. In semantic segmentation, our DSM attains a 67.93% F-mIoU, outperforming all baselines in Replica. Furthermore, successful deployment on physical robots for complex navigation and grasping tasks confirms the framework's practical utility in real-world scenarios.
Overview of the DSM framework. After receiving the user's query, the robot first collects time-continuous poses, depth images, and color images of the scene to build a DSM. Next, we extract the visual and geometric information from each observation point. At the same time, we use VLM to analyze their relations and semantic attributes, which are categorized into Appearance, Physical and Affordance Attributes. We fuse objects from multi views using a multimodal object fusion method in conjunction with the Geometry Sliding Window method for mapping. Finally, we identify candidates in the DSM based on the attributes and relations of objects. We use the multi-level observations method to precisely locate the target object. Additionally, our method can be broadly applied to tasks such as robotic semantic navigation and semantic grasping.
DSM Scene. Example of DSM Scene from Replica.
DSM Semantic Attributes. Example of DSM Example
| Attributes | Pillow | Stool |
|---|---|---|
| Appearance | A soft, square pillow with a floral design | A small, rounded seat with a padded top, typically covered in a beige fabric. The design is simple yet stylish. |
| Physical | Filled with a soft material, providing compressibility and comfort | Sturdy and stable, designed to support weight effectively. Lightweight for easy movement and positioning. |
| Affordance | Intended for support when sitting or lying down, enhancing comfort | Serves as a seating option, footrest, or even a small table, making it a versatile piece of furniture. |
DSM Relation. Example of DSM Relation
| Object Type | Name | Spatial Rel. | Semantic Relation |
|---|---|---|---|
| Target | Pillow | close by | The pillow is an accessory placed on the sofa for comfort and support while sitting or lounging. |
| Anchor | Sofa |
3D Semantic Segmentation on Replica Dataset.
| Category | Method | mAcc (%) | F-mIoU (%) |
|---|---|---|---|
| Privileged | LSeg | 33.39 | 51.54 |
| OpenSeg | 41.19 | 53.74 | |
| Zero-shot | MaskCLIP | 4.53 | 0.94 |
| ConceptFusion + SAM | 31.53 | 38.70 | |
| ConceptGraphs | 40.63 | 35.95 | |
| Ours (Fixed) | 38.76 | 67.93 | |
| Ours (w/ ATC) | 39.22 | 68.15 |
ScanRefer. Comparisons of 3D visual grounding on ScanRefer dataset. Accuracy at 0.25 and 0.5 IoU thresholds.
| Method | Venue | Supervision | LLMs | Unique | Multiple | Overall | |||
|---|---|---|---|---|---|---|---|---|---|
| Acc@0.25 | Acc@0.5 | Acc@0.25 | Acc@0.5 | Acc@0.25 | Acc@0.5 | ||||
| ScanRefer | ECCV'20 | Fully | - | 67.60 | 46.20 | 32.10 | 21.30 | 39.00 | 26.10 |
| Scene-Verse | ECCV'24 | Fully | - | 81.60 | 75.10 | 43.70 | 39.10 | 50.60 | 45.80 |
| LIBA | AAAI'25 | Fully | - | 88.81 | 74.27 | 54.42 | 44.41 | 59.57 | 48.96 |
| OpenScene | CVPR'23 | Fine-tuning | CLIP | 20.10 | 13.10 | 11.10 | 4.40 | 13.20 | 6.50 |
| Chat-3D v2 | NeurIPS'24 | Fine-tuning | Vicuna1.5-7B | 61.20 | 57.60 | 25.20 | 22.60 | 35.90 | 30.40 |
| Inst3D-LMM | CVPR'25 | Fine-tuning | Vicuna1.5-7B | 88.60 | 81.50 | 48.70 | 43.20 | 57.80 | 51.60 |
| ConceptGraphs | ICRA'24 | Zero-Shot | GPT-4 | 16.50 | 10.32 | 9.57 | 7.69 | 13.28 | 9.31 |
| ZSVG3D | CVPR'24 | Zero-Shot | GPT-4 turbo | 63.80 | 58.40 | 27.70 | 24.60 | 36.40 | 32.70 |
| VLM-Grounder | CoRL'24 | Zero-Shot | GPT-4o | 66.00 | 29.80 | 48.30 | 33.50 | 51.60 | 32.80 |
| SeeGround | CVPR'25 | Zero-Shot | Qwen2-VL-72B | 75.70 | 68.90 | 34.00 | 30.00 | 44.10 | 39.40 |
| FreeQ-Graph | Arxiv'25 | Zero-Shot | Qwen2-VL-72B | 83.10 | 79.40 | 50.16 | 39.13 | 56.13 | 49.41 |
| Ours | - | Zero-Shot | GPT-4o-mini | 83.32 | 80.17 | 47.01 | 43.93 | 57.47 | 55.39 |
| Ours | - | Zero-Shot | Qwen2.5-VL-72B | 85.71 | 85.71 | 56.98 | 53.65 | 61.56 | 59.06 |
Sr3D & Nr3D. Comparisons of 3D visual grounding on Sr3D and Nr3D. Top-1 accuracy using ground-truth boxes.
| Method | Super | Nr3D | Sr3D | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Overall | Easy | Hard | V-Dep. | V-Indep. | Overall | Easy | Hard | V-Dep. | V-Indep. | ||
| InstanceRefer | Fully | 38.80 | 46.00 | 31.80 | 34.50 | 41.90 | 48.00 | 51.10 | 40.50 | 45.80 | 48.10 |
| LAR | Fully | 48.90 | 58.40 | 42.30 | 47.40 | 52.10 | 59.40 | 63.00 | 51.20 | 50.00 | 59.10 |
| MVT | Fully | 59.50 | 67.40 | 52.70 | 59.10 | 60.30 | 64.50 | 66.90 | 58.80 | 58.40 | 58.40 |
| ViL3DRel | Fully | 64.40 | 70.20 | 57.40 | 62.00 | 64.50 | 72.80 | 74.90 | 67.90 | 63.80 | 73.20 |
| EDA | Fully | 52.10 | 58.20 | 46.10 | 50.20 | 53.10 | 68.10 | 70.30 | 62.90 | 54.10 | 68.70 |
| 3D-VisTA | Fully | 64.20 | 72.10 | 56.70 | 61.50 | 65.10 | 76.40 | 78.80 | 71.30 | 58.90 | 77.30 |
| Scene-Verse | Fully | 64.90 | 72.50 | 57.80 | 56.90 | 67.90 | 77.50 | 80.10 | 71.60 | 62.80 | 78.20 |
| ZSVG3D | Zero-Shot | 46.50 | 31.70 | 36.80 | 40.00 | 39.00 | - | - | - | - | - |
| VLM-Grounder | Zero-Shot | 48.00 | 55.20 | 39.50 | 45.80 | 49.40 | - | - | - | - | - |
| ConceptGraph | Zero-Shot | 38.20 | 39.40 | 32.60 | 42.10 | 38.70 | 43.60 | 44.30 | 41.90 | 38.40 | 49.70 |
| SeeGround | Zero-Shot | 54.50 | 38.30 | 42.30 | 48.20 | 46.10 | 65.40 | 47.90 | 52.20 | 58.40 | 56.20 |
| FreeQ-Graph | Zero-Shot | 61.80 | 61.40 | 57.80 | 60.90 | 67.10 | 70.90 | 79.30 | 63.90 | 64.10 | 76.50 |
| Ours | Zero-Shot | 62.19 | 64.06 | 56.00 | 61.25 | 63.12 | 73.33 | 77.44 | 63.29 | 73.91 | 73.49 |
Robot Demo. Navigation and grasping tasks in real-world and simulated environments.
Real-world Experiments. More complex scenarios are used to demonstrate the reliability of the DSM system.
| Key Description | Goal / Res | Time | Dist / Pose |
|---|---|---|---|
| Navigation Task | |||
| Computer: electronic device w/ monitor, on flat desk. | Arrived | 10.2s | 3.62m |
| Toy: small, irregular shape, vibrant color, purple. | Arrived | 15.1s | 5.42m |
| Toolbox: plastic/cardboard, transparent, easy stacking. | Arrived | 17.5s | 6.8m |
| Bookshelf: rectangular, on blue desk, for organizing. | Arrived | 13.2s | 4.98m |
| Storage Bin: container for storage, sturdy construction. | Failed | - | 7.8m |
| Grasping Task | |||
| Storage Box: lightweight, sturdy, for organizing spaces. | Succ. | 7.3s | Upright |
| File Cabinet: for storing documents, near storage box. | Succ. | 8.2s | Side |
| Organizer Case: straight edges, securely holds items. | Succ. | 6.2s | Upright |
| Enclosure: can be opened/closed, protects contents. | Succ. | 7.5s | Side |
| Monitor: stable device, displays digital content. | Failed | - | 30% Occ. |

















AI2-THOR FloorPlan10

Please find the wooden chair with a curved backrest that has a simple design.

Please locate the stove positioned on the counter, featuring multiple burners on top.

Please locate the box for organization and storage, on the countertop next to the stove.

Replica Room0

Please find the pillow for providing cushioning and support.

Please find the vase made from sturdy ceramic on the dining table.

Please locate the window above the desk organizer on the workspace.
@article{xie2025dsm,
title = {DSM: Constructing a Diverse Semantic Map for 3D Visual Grounding},
author = {Xie, Qinghongbing and Liang, Zijian and Li, Fuhao and Zeng, Long},
journal = {arXiv preprint arXiv:2504.08307},
year = {2025},
}