DriveLaW:

Unifying Planning and Video Generation in a Latent Driving World

1Huazhong University of Science & Technology 2Xiaomi EV

*Equal Contributions. Project Leader. Corresponding Author.

Abstract

World models have become crucial for autonomous driving, as they learn how scenarios evolve over time to address the long-tail challenges of the real world. However, current approaches relegate world models to limited roles: they operate within ostensibly unified architectures that still keep world prediction and motion planning as decoupled processes. To bridge this gap, we propose DriveLaW, a novel paradigm that unifies video generation and motion planning. By directly injecting the latent representation from its video generator into the planner, DriveLaW ensures inherent consistency between high-fidelity future generation and reliable trajectory planning. Specifically, DriveLaW consists of two core components: DriveLaW-Video, our powerful world model that generates high-fidelity forecasting with expressive latent representations, and DriveLaW-Act, a diffusion planner that generates consistent and reliable trajectories from the latent of DriveLaW-Video, with both components optimized by a three-stage progressive training strategy. The power of our unified paradigm is demonstrated by new state-of-the-art results across both tasks. DriveLaW not only advances video prediction significantly, surpassing best-performing work by 33.3% in FID and 1.8% in FVD, but also achieves a new record on the NAVSIM planning benchmark.


Motivation

VGM (Video Generation Model) latent features can serve as more efficient and informative conditions for action learning. We visualize and compare three types of latent representations. We apply PCA (Principal Component Analysis) to project each representation to 3 principal components mapped to RGB channels, all upsampled to 1280×704. The visualization clearly shows that BEV and VLM features are diffuse, unstable, and exhibit irregular focus patterns. In contrast, VGM features are sharper, less noisy, and demonstrate superior semantic coherence with strong spatial structure awareness, even under challenging driving conditions. This suggests that VGM features provide a more suitable representation for action learning in autonomous driving.

Framework

DriveLaW is a unified framework composed of a DriveLaW-Video and a DriveLaW-Act. The video model first encodes past driving frames with a spatiotemporal VAE and encodes textual prompts with a text encoder. A stack of Video DiT blocks then performs latent-space denoising, and the VAE decoder reconstructs the video. Concurrently, action noise, ego status, and high-level commands are encoded and fed into the action model. Video latents from the Video DiT serve as conditioning signals, guiding the Action DiT to output the final trajectory. The Video DiT and Action DiT are chained and trained to learn driving representations from large-scale video generation, providing a shared basis for planning.

Common Video Generation on nuScenes

Clip1

Clip2

Clip3

Clip4

Rainy Video Generation on nuScenes

Clip1

Clip2

Qualitative results on the Navtest benchmark

We present representative cases from the Navtest splits, highlighting DriveLaW’s ability to predict future trajectories while ensuring safety and smoothness.

Video Generation Results

Quantitative evaluation of video generation on the NuScenes validation set. Our method outperforms prior single-view state-of-the-art methods in generation quality.

Planning Results

Performance comparison on NAVSIM Navtest using closed-loop metrics. Methods are grouped by whether they employ an explicit world model: Traditional End-to-End Methods and World Model Methods.† denotes methods trained with the same flow-matching objective.

BibTeX

@article{xia2025drivelaw,
  title={DriveLaW: Unifying Planning and Video Generation in a Latent Driving World},
  author={Xia, Tianze and Li, Yongkang and Zhou, Lijun and Yao, Jingfeng and Xiong, Kaixin and Sun, Haiyang and Wang, Bing and Ma, Kun and Ye, Hangjun and Liu, Wenyu and others},
  journal={arXiv preprint arXiv:2512.23421},
  year={2025}
}