We propose WorldSplat , a novel feed-forward framework for 4D driving-scene generation. Our approach effectively generates consistent multi-track videos through two key steps:
(i) We introduce a 4D-aware latent diffusion model integrating multi-modal information to produce pixel-aligned 4D Gaussians in a feed-forward manner.
(ii) Subsequently, we refine the novel view videos rendered from these Gaussians using a enhanced video diffusion model.
WorldSplat effectively generates high-fidelity, temporally and spatially consistent multi-track novel view driving videos.
Comparison of different driving world models. Previous driving world models focus on video generation, while our method directly creates controllable 4D Gaussians in a feed-forward manner, enabling the production of novel-view videos (e.g. shifting ego trajectory ±N m) with spatiotemporal consistency.
The overview of our framework: (1) Employing a 4D-aware diffusion model to generate a multi-modal latent containing RGB, depth, and dynamic information. (2) Predicting pixel-aligned 3D Gaussians from the denoised latent using our feed-forward latent decoder. Then, 3D semantic Gaussians are decoded. (3) Aggregating the 3D Gaussians with dynamic-static decomposition to form 4D Gaussians and rendering novel-view videos. (4) Improving the spatial resolution and temporal consistency of the rendered videos with an enhanced diffusion model. The ↑ arrow and the ↑ ones denote the train-only and inference, respectively.
Visualization of our generated 4D Gaussian representation, which serves as the basis for rendering novel-trajectory videos.
For each example, the upper video shows the generated result, and the lower video shows the corresponding real footage.
Generated (upper) vs Real (lower)
Generated (upper) vs Real (lower)
Generated (upper) vs Real (lower)
Video demos of generated driving videos under a 2 m left–right ego shift, along with their corresponding 3D Gaussian representations.
3D Gaussians Visualization
Generated Novel-View Video
3D Gaussians Visualization
Generated Novel-View Video
Qualitative comparison of our novel view synthesis against the state-of-the-art urban reconstruction method Omnire. We translate the ego-vehicle by ±2m to generate the novel viewpoints. Red boxes indicate where our method achieves the greatest improvements.
Quantitative results of novel-view synthesis, reporting FID and FVD under viewpoint shifts of ±1, ±2, and ±4 meters. Baseline metrics are taken from DiST-4D
Comparison with MagicDrive and Panacea. The top row shows real frames, the second row the corresponding sketches and bounding-box controls. Red boxes highlight areas where our method achieves the most notable improvements.
Video generation comparison on the nuScenes validation set, with green and blue highlighting the best and second-best values, respectively.
The applications of our generated data on the downstream 3D object detection and bev map segmentation.