We propose WorldSplat , a novel feed-forward framework for 4D driving-scene generation. Our approach effectively generates consistent multi-track videos through two key steps:
(i) We introduce a 4D-aware latent diffusion model integrating multi-modal information to produce pixel-aligned 4D Gaussians in a feed-forward manner.
(ii) Subsequently, we refine the novel view videos rendered from these Gaussians using a enhanced video diffusion model.
WorldSplat effectively generates high-fidelity, temporally and spatially consistent multi-track novel view driving videos.
Comparison of different driving world models. Previous driving world models focus on video generation, while our method directly creates controllable 4D Gaussians in a feed-forward manner, enabling the production of novel-view videos (e.g. shifting ego trajectory ±N m) with spatiotemporal consistency.
The overview of our framework: (1) Employing a 4D-aware diffusion model to generate a multi-modal latent containing RGB, depth, and dynamic information. (2) Predicting pixel-aligned 3D Gaussians from the denoised latent using our feed-forward latent decoder. Then, 3D semantic Gaussians are decoded. (3) Aggregating the 3D Gaussians with dynamic-static decomposition to form 4D Gaussians and rendering novel-view videos. (4) Improving the spatial resolution and temporal consistency of the rendered videos with an enhanced diffusion model. The ↑ arrow and the ↑ ones denote the train-only and inference, respectively.
Visualization of our generated 4D Gaussian representation, which serves as the basis for rendering novel-trajectory videos.
Video demo of generated driving videos under a 2 m left–right ego shift.
Comparison with MagicDrive and Panacea. The top row shows real frames, the second row the corresponding sketches and bounding-box controls. Red boxes highlight areas where our method achieves the most notable improvements.
Video generation comparison on the nuScenes validation set, with green and blue highlighting the best and second-best values, respectively.
Qualitative comparison of our novel view synthesis against the state-of-the-art urban reconstruction method Omnire. We translate the ego-vehicle by ±2m to generate the novel viewpoints. Red boxes indicate where our method achieves the greatest improvements.
Quantitative results of novel-view synthesis, reporting FID and FVD under viewpoint shifts of ±1, ±2, and ±4 meters. Baseline metrics are taken from DiST-4D
The applications of our generated data on the downstream 3D object detection and bev map segmentation.
@misc{zhu2026worldsplat, title = {WorldSplat: Gaussian-Centric Feed-Forward 4D Scene Generation for Autonomous Driving}, author = {Ziyue Zhu and Zhanqian Wu and Zhenxin Zhu and Lijun Zhou and Haiyang Sun and Bing Wang and Kun Ma and Guang Chen and Hangjun Ye and Jin Xie and Jian Yang}, year = {2026}, note = {Under review at the International Conference on Learning Representations (ICLR) 2026}, url = {https://github.com/wm-research/worldsplat} }