DriveLaW: Unifying Planning and Video Generation in a Latent Driving World

Unifying Planning and Video Generation in a Latent Driving World

¹Huazhong University of Science & Technology ²Xiaomi EV

^*Equal Contributions. ^†Project Leader. ^✉Corresponding Author.

Abstract

World models have become crucial for autonomous driving, as they learn how scenarios evolve over time to address the long-tail challenges of the real world. However, current approaches relegate world models to limited roles: they operate within ostensibly unified architectures that still keep world prediction and motion planning as decoupled processes. To bridge this gap, we propose DriveLaW, a novel paradigm that unifies video generation and motion planning. By directly injecting the latent representation from its video generator into the planner, DriveLaW ensures inherent consistency between high-fidelity future generation and reliable trajectory planning. Specifically, DriveLaW consists of two core components: DriveLaW-Video, our powerful world model that generates high-fidelity forecasting with expressive latent representations, and DriveLaW-Act, a diffusion planner that generates consistent and reliable trajectories from the latent of DriveLaW-Video, with both components optimized by a three-stage progressive training strategy. The power of our unified paradigm is demonstrated by new state-of-the-art results across both tasks. DriveLaW not only advances video prediction significantly, surpassing best-performing work by 33.3% in FID and 1.8% in FVD, but also achieves a new record on the NAVSIM planning benchmark.

Motivation

VGM (Video Generation Model) latent features can serve as more efficient and informative conditions for action learning. We visualize and compare three types of latent representations. We apply PCA (Principal Component Analysis) to project each representation to 3 principal components mapped to RGB channels, all upsampled to 1280×704. The visualization clearly shows that BEV and VLM features are diffuse, unstable, and exhibit irregular focus patterns. In contrast, VGM features are sharper, less noisy, and demonstrate superior semantic coherence with strong spatial structure awareness, even under challenging driving conditions. This suggests that VGM features provide a more suitable representation for action learning in autonomous driving.

Framework

DriveLaW is a unified framework composed of a DriveLaW-Video and a DriveLaW-Act. The video model first encodes past driving frames with a spatiotemporal VAE and encodes textual prompts with a text encoder. A stack of Video DiT blocks then performs latent-space denoising, and the VAE decoder reconstructs the video. Concurrently, action noise, ego status, and high-level commands are encoded and fed into the action model. Video latents from the Video DiT serve as conditioning signals, guiding the Action DiT to output the final trajectory. The Video DiT and Action DiT are chained and trained to learn driving representations from large-scale video generation, providing a shared basis for planning.

Common Video Generation on nuScenes

Clip1

Clip2

Clip3

Clip4

BibTeX

@article{xia2025drivelaw, title={DriveLaW: Unifying Planning and Video Generation in a Latent Driving World}, author={Xia, Tianze and Li, Yongkang and Zhou, Lijun and Yao, Jingfeng and Xiong, Kaixin and Sun, Haiyang and Wang, Bing and Ma, Kun and Ye, Hangjun and Liu, Wenyu and others}, journal={arXiv preprint arXiv:2512.23421}, year={2025} }

DriveLaW:

Unifying Planning and Video Generation in a Latent Driving World

Abstract

Motivation

Framework

Common Video Generation on nuScenes

Clip1

Clip2

Clip3

Clip4

Rainy Video Generation on nuScenes

Clip1

Clip2

Qualitative results on the Navtest benchmark

Video Generation Results

Planning Results

BibTeX