SpatialLM

Introduction

SpatialLM is a 3D large language model designed to process 3D point cloud data and generate structured 3D scene understanding outputs. These outputs include architectural elements like walls, doors, windows, and oriented object bounding boxes with their semantic categories. Unlike previous methods that require specialized equipment for data collection, SpatialLM can handle point clouds from diverse sources such as monocular video sequences, RGBD images, and LiDAR sensors. This multimodal architecture effectively bridges the gap between unstructured 3D geometric data and structured 3D representations, offering high-level semantic understanding. It enhances spatial reasoning capabilities for applications in embodied robotics, autonomous navigation, and other complex 3D scene analysis tasks.

SpatialLM Pipeline

Given an RGB video, we use MASt3R-SLAM to reconstruct the 3D point cloud. SpatialLM then converts these dense point clouds into a structured representation. The point cloud encoder encodes the point cloud to compact features, and the LLM generates scene codes that describe the scene, which can be converted into 3D structural layouts.

Training Dataset

SpatialLM is trained on large-scale, photo-realistic dataset. The walls and objects are realistically placed, accurately reflecting real-world scenarios and ensuring physical correctness.

Cross Platform

SpatialLM's prediction results are versatile and compatible across platforms. Outputs can be expressed in various formats, including structural layouts like 3D oriented bounding boxes, 2D floorplans, and industry-standard formats such as IFC (Industry Foundation Classes).

Future Extension

Derived from state-of-the-art (SOTA) powerful LLM and its versatile output options, SpatialLM can be extended to more tasks in the future, such as interacting with humans as an intelligent assistant and empowering embodied agents to perform complex tasks in challenging environments.

BibTeX

@article{SpatialLM,
    title         = {SpatialLM: Training Large Language Models for Structured Indoor Modeling},
    author        = {Mao, Yongsen and Zhong, Junhao and Fang, Chuan and Zheng, Jia and Tang, Rui and Zhu, Hao and Tan, Ping and Zhou, Zihan},
    journal       = {arXiv preprint},
    year          = {2025},
    eprint        = {2506.07491},
    archivePrefix = {arXiv},
    primaryClass  = {cs.CV}
}