WorldSplat: Gaussian-Centric Feed-Forward 4D Scene Generation for Autonomous Driving

Ziyue Zhu1 · Zhanqian Wu2 · Zhenxin Zhu2 · Lijun Zhou2 · Haiyang Sun2† · Bing Wang2 · Kun Ma2 · Guang Chen2 · Hangjun Ye2 · Jin Xie3✉ · Yang Jian1✉
1 Nankai University · 2 Xiaomi EV · 3 Nanjing University, Suzhou Corresponding Author · Project Leader
📄 arXiv 💻 Code 📑 BibTeX

Abstract

We propose WorldSplat , a novel feed-forward framework for 4D driving-scene generation. Our approach effectively generates consistent multi-track videos through two key steps:
(i) We introduce a 4D-aware latent diffusion model integrating multi-modal information to produce pixel-aligned 4D Gaussians in a feed-forward manner.
(ii) Subsequently, we refine the novel view videos rendered from these Gaussians using a enhanced video diffusion model.
WorldSplat effectively generates high-fidelity, temporally and spatially consistent multi-track novel view driving videos.

Motivation

Motivation

Comparison of different driving world models. Previous driving world models focus on video generation, while our method directly creates controllable 4D Gaussians in a feed-forward manner, enabling the production of novel-view videos (e.g. shifting ego trajectory ±N m) with spatiotemporal consistency.

Method Overview

Method Overview

The overview of our framework: (1) Employing a 4D-aware diffusion model to generate a multi-modal latent containing RGB, depth, and dynamic information. (2) Predicting pixel-aligned 3D Gaussians from the denoised latent using our feed-forward latent decoder. Then, 3D semantic Gaussians are decoded. (3) Aggregating the 3D Gaussians with dynamic-static decomposition to form 4D Gaussians and rendering novel-view videos. (4) Improving the spatial resolution and temporal consistency of the rendered videos with an enhanced diffusion model. The arrow and the ↑ ones denote the train-only and inference, respectively.

Gaussians Visualization

Gaussian Visualization

Visualization of our generated 4D Gaussian representation, which serves as the basis for rendering novel-trajectory videos.

Video Demo

Video demo of generated driving videos under a 2 m left–right ego shift.

Comparison with Other Driving World Models

Method Overview

Comparison with MagicDrive and Panacea. The top row shows real frames, the second row the corresponding sketches and bounding-box controls. Red boxes highlight areas where our method achieves the most notable improvements.

Method Overview

Video generation comparison on the nuScenes validation set, with green and blue highlighting the best and second-best values, respectively.

Comparison with Opimization-based Urban Reconstrution Models

Method Overview

Qualitative comparison of our novel view synthesis against the state-of-the-art urban reconstruction method Omnire. We translate the ego-vehicle by ±2m to generate the novel viewpoints. Red boxes indicate where our method achieves the greatest improvements.

Method Overview

Quantitative results of novel-view synthesis, reporting FID and FVD under viewpoint shifts of ±1, ±2, and ±4 meters. Baseline metrics are taken from DiST-4D

Benefit on Downstream Driving Tasks

Method Overview

The applications of our generated data on the downstream 3D object detection and bev map segmentation.

BibTeX

          @misc{zhu2026worldsplat,
            title        = {WorldSplat: Gaussian-Centric Feed-Forward 4D Scene Generation for Autonomous Driving},
            author       = {Ziyue Zhu and Zhanqian Wu and Zhenxin Zhu and Lijun Zhou and Haiyang Sun and Bing Wang and Kun Ma and Guang Chen and Hangjun Ye and Jin Xie and Jian Yang},
            year         = {2026},
            note         = {Under review at the International Conference on Learning Representations (ICLR) 2026},
            url          = {https://github.com/wm-research/worldsplat}
          }
            
```