WorldSplat: Gaussian-Centric Feed-Forward 4D Scene Generation for Autonomous Driving

Abstract

We propose WorldSplat , a novel feed-forward framework for 4D driving-scene generation. Our approach effectively generates consistent multi-track videos through two key steps:
(i) We introduce a 4D-aware latent diffusion model integrating multi-modal information to produce pixel-aligned 4D Gaussians in a feed-forward manner.
(ii) Subsequently, we refine the novel view videos rendered from these Gaussians using a enhanced video diffusion model.
WorldSplat effectively generates high-fidelity, temporally and spatially consistent multi-track novel view driving videos.

Motivation

Comparison of different driving world models. Previous driving world models focus on video generation, while our method directly creates controllable 4D Gaussians in a feed-forward manner, enabling the production of novel-view videos (e.g. shifting ego trajectory ±N m) with spatiotemporal consistency.

Method Overview

The overview of our framework: (1) Employing a 4D-aware diffusion model to generate a multi-modal latent containing RGB, depth, and dynamic information. (2) Predicting pixel-aligned 3D Gaussians from the denoised latent using our feed-forward latent decoder. Then, 3D semantic Gaussians are decoded. (3) Aggregating the 3D Gaussians with dynamic-static decomposition to form 4D Gaussians and rendering novel-view videos. (4) Improving the spatial resolution and temporal consistency of the rendered videos with an enhanced diffusion model. The ↑ arrow and the ↑ ones denote the train-only and inference, respectively.

Gaussians Visualization

Visualization of our generated 4D Gaussian representation, which serves as the basis for rendering novel-trajectory videos.

Generated-Real Video Comparisons

For each example, the upper video shows the generated result, and the lower video shows the corresponding real footage.

Comparison 1

Generated (upper) vs Real (lower)

Comparison 2

Generated (upper) vs Real (lower)

Comparison 3

Generated (upper) vs Real (lower)

Novel View Video Generation

Video demos of generated driving videos under a 2 m left–right ego shift, along with their corresponding 3D Gaussian representations.

Example 1

3D Gaussians Visualization

Generated Novel-View Video

Example 2

3D Gaussians Visualization

Generated Novel-View Video

Comparison with Opimization-based Urban Reconstrution Models

Qualitative comparison of our novel view synthesis against the state-of-the-art urban reconstruction method Omnire. We translate the ego-vehicle by ±2m to generate the novel viewpoints. Red boxes indicate where our method achieves the greatest improvements.

Quantitative results of novel-view synthesis, reporting FID and FVD under viewpoint shifts of ±1, ±2, and ±4 meters. Baseline metrics are taken from DiST-4D

Comparison with Other Driving World Models

Comparison with MagicDrive and Panacea. The top row shows real frames, the second row the corresponding sketches and bounding-box controls. Red boxes highlight areas where our method achieves the most notable improvements.

Video generation comparison on the nuScenes validation set, with green and blue highlighting the best and second-best values, respectively.

Benefit on Downstream Driving Tasks

The applications of our generated data on the downstream 3D object detection and bev map segmentation.

BibTeX

@misc{zhu2025worldsplatgaussiancentricfeedforward4d, title={WorldSplat: Gaussian-Centric Feed-Forward 4D Scene Generation for Autonomous Driving}, author={Ziyue Zhu and Zhanqian Wu and Zhenxin Zhu and Lijun Zhou and Haiyang Sun and Bing Wan and Kun Ma and Guang Chen and Hangjun Ye and Jin Xie and jian Yang}, year={2025}, eprint={2509.23402}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2509.23402}, }