TL;DR: Given a 3D semantic layout, SpatialGen can generate a photo-realistic 3D indoor scene conditioned on a text or image prompt.

Introduction

Creating high-fidelity 3D models of indoor environments is essential for applications in design, virtual reality, and robotics. However, manual 3D modeling remains time-consuming and labor-intensive. While recent advances in generative AI have enabled automated scene synthesis, existing methods often face challenges in balancing visual quality, diversity, semantic consistency, and user control. A major bottleneck is the lack of a large-scale, high-quality dataset tailored to this task. To address this gap, we introduce a comprehensive synthetic dataset, featuring 12,328 structured annotated scenes with 57,440 rooms, and 4.7M photorealistic 2D renderings. Leveraging this dataset, we present SpatialGen, a novel multi-view multi-modal diffusion model that generates realistic and semantically consistent 3D indoor scenes. Given a 3D layout and a reference image (derived from a text prompt), our model synthesizes appearance (color image), geometry (scene coordinate map), and semantic (semantic segmentation map) from arbitrary viewpoints, while preserving spatial consistency across modalities. SpatialGen consistently generates superior results to previous methods in our experiments. We are open-sourcing our data and models to empower the community and advance the field of indoor scene understanding and generation.

SpatialGen Pipeline

SpatialGen is a multi-view, multi-modal diffusion model that generates view-consistent 3D indoor scenes from a semantic layout.

SpatialGen Dataset

We introduce a large-scale synthetic dataset, SpatialGen Dataset, featuring 12,328 structured annotated scenes with 57,440 rooms, and 4.7M photorealistic 2D renderings. We create physically plausible camera trajectories that navigate smoothly through each scene while avoiding obstacles. These trajectories are sampled at 0.5m intervals to ensure comprehensive spatial coverage. For each viewpoint, we generate photorealistic panoramic renderings using an industry-leading rendering engine, capturing color, depth, normal, semantic, and instance segmentation data.

Text to Image to 3D Scene

Text to Image to 3D Scene (Style Variants)

Single Image to 3D Scene

BibTeX

@article{SpatialGen,
  title         = {SpatialGen: Layout-guided 3D Indoor Scene Generation},
  author        = {Fang, Chuan and Li, Heng and Liang, Yixu and Zheng, Jia and Mao, Yongsen and Liu, Yuan and Tang, Rui and Zhou, Zihan and Tan, Ping},
  journal       = {arXiv preprint},
  year          = {2025},
  eprint        = {2509.14981},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV}
}

Acknowledgements

This work was done during Chuan Fang's internship at Manycore Tech Inc. This work was partially supported by the Key R&D Program of Zhejiang Province (2025C01001) and the HKUST project 24251090T019. We would like to thank the engineering team in Manycore Tech Inc., -- Yingqi Shen, Liangbin Hu, and Fuchun Dong -- for their exceptional effort in supporting building the large-scale SpatialGen Dataset. We are also grateful to Chenfeng Hou and Zhiwei Wang for developing the layout ControlNet. Additionally, we extend our thanks to Kunming Luo for his valuable suggestions regarding the paper figures.