3D4D: An Interactive, Editable, 4D World Model via 3D Video Generation

Yunhong He*,†, Zhengqing Yuan2,*,
Zhengzhong Tu3, Yanfang Ye2, Lichao Sun1
1Lehigh University, 2University of Notre Dame, 3Texas A&M University
*Equal Contribution. Yunhong He is an independent undergraduate student, remotely working with Lichao Sun.

Abstract

We present DreamLand, a novel frontend visualization framework designed to enable real-time, multimodal interaction with 4D (spatiotemporal) scenes. While recent advances in vision and language models have enabled rich 3D content generation, existing WebGL-based systems remain limited in dynamic scene rendering, temporal control, and user interaction. DreamLand addresses these challenges by integrating WebGL with Supersplat rendering, supporting intuitive editing tools, temporal navigation, and camera control within a unified interface. Given a single image and a textual description, DreamLand synthesizes a 4D scene via 3D reconstruction, image-to-video generation, and temporal decomposition, allowing users to explore and manipulate the generated content in real-time. Our system design is modular, scalable, and adaptable, with applications across VR/AR, gaming, and scientific simulation. We demonstrate that DreamLand offers an effective and accessible solution for interactive 4D visualization, bridging static 3D generation with dynamic, user-driven spatiotemporal interaction.

Demo Video



Framework

The inference process of Real3D-Portrait.

Illustration of the processing pipeline. The system comprises four main modules: 3D scene reconstruction, image-to-video synthesis, video-to-frame decomposition, and 4D scene generation. The resulting 4D content is passed to the frontend for real-time user interaction.

Frontend Workflow

The inference process of Real3D-Portrait.

Workflow of our proposed controllable 4D scene generation framework. The system takes a single 2D image and textual description as input and sequentially processes them through 3D generation, panoramic rendering, and dynamic video synthesis. The resulting video is temporally reconstructed into a continuous 4D scene, which is then transmitted to the frontend for real-time, user-driven interaction. Users can intuitively control playback speed, temporal position, and spatial viewpoint to explore and edit the 4D environment.