Rethinking Driving World Model as

Synthetic Data Generator for Perception Tasks

1Peking University 2Xiaomi EV 3Huazhong University of Science & Technology

*Equal Contribution. Intern of Xiaomi EV. Project Leader. Corresponding Author.

Abstract

Recent advancements in driving world models enable controllable generation of high-quality RGB videos or multimodal videos. Existing methods primarily focus on metrics related to generation quality and controllability. However, they often overlook the evaluation of downstream perception tasks, which are really crucial for the performance of autonomous driving. Existing methods usually leverage a training strategy that first pretrains on synthetic data and finetunes on real data, resulting in twice the epochs compared to the baseline (real data only). When we double the epochs in the baseline, the benefit of synthetic data becomes negligible. To thoroughly demonstrate the benefit of synthetic data, we introduce Dream4Drive, a novel synthetic data generation framework designed for enhancing the downstream perception tasks. Dream4Drive first decomposes the input video into several 3D-aware guidance maps and subsequently renders the 3D assets onto these guidance maps. Finally, the driving world model is fine-tuned to produce the edited, multi-view photorealistic videos, which can be used to train the downstream perception models. Dream4Drive enables unprecedented flexibility in generating multi-view corner cases at scale, significantly boosting corner case perception in autonomous driving. To facilitate future research, we also contribute a large-scale 3D asset dataset named DriveObj3D, covering the typical categories in driving scenarios and enabling diverse 3D-aware video editing. We conduct comprehensive experiments to show that Dream4Drive can effectively boost the performance of downstream perception models under various training epochs.


Motivation

Previous methods (e.g., Panacea, SubjectDrive) often employ a training strategy that first pretrains on synthetic data and then fine-tunes on real data, resulting in double the training epochs compared to using only real data. We find that when the total number of training epochs is kept the same, large amounts of synthetic data provide little to no advantage and can even lead to worse performance than using only real data. As shown in the figure below, under the 2× epoch setting, models trained exclusively on real data achieve higher mAP and NDS compared to those trained on both real and synthetic data.

Framework

To reevaluate the value of synthetic data, we introduce Dream4Drive, a novel 3D-aware synthetic data generation framework designed for downstream perception tasks. The core idea of Dream4Drive is to first decompose the input video into several 3D-aware guidance maps and subsequently render 3D assets onto these guidance maps. Finally, the driving world model is fine-tuned to produce the edited, multi-view photorealistic videos, which can be used to train downstream perception models. Consequently, we can incorporate various assets with different trajectories (e.g., viewpoints, poses, and distances) into the same scene, significantly improving the geometric and appearance diversity of the synthetic data while ensuring consistency between annotations and videos. As shown in the figure above, under identical training epochs (1×, 2×, or 3×), our method requires only 420 synthetic samples—less than 2% of the real samples—to outperform prior augmentation methods.

3D-aware scene editing

Given the input images, we first obtain the depth map, normal map, and edge map for the background. For a target 3D asset, we position it within the 3D space of the original video based on the provided 3D bounding boxes. For each frame and each view, we then use calibrated camera intrinsics and extrinsics to render the target 3D asset. This process yields the object image and object mask, which are then used to edit the original depth, normal, and edge maps.

3D-aware video rendering

Once we have obtained the depth map, normal map, and edge map for the background, as well as the rendered object image and object mask for the target 3D asset, we utilize a fine-tuned driving world model to render the edited video based on these 3D-aware guidance maps. This 3D-aware scene editing pipeline effectively utilizes the accurate pose, geometry, and texture information provided by 3D assets, ensuring geometric consistency in the results generated. Notably, our method does not depend on 3D bounding box embeddings for controlling object placement. Instead, we directly edit in 3D space, offering a more intuitive and reliable way to manage control.

DriveObj3D

To support large-scale downstream driving tasks, we propose a simple 3D asset generation pipeline and construct a diverse asset dataset, DriveObj3D, covering a wide range of categories in driving scenarios to support insertion tasks.

The inference process of original inpainting and asset insertion

From top to bottom are the five types of 3D-aware guidance maps (depth, normal, edge, object image, object mask), the original condition images, and finally the result video generated by the driving world model.

Original Inpainting

Asset Insertion

Synthetic rare corner case

A large truck is inserted at close range in front

A barrier is inserted in front and is about to collide with the vehicle

Videos of multiple asset class insertions

Insert the barrier on the left

Insert the traffic cone at left rear

Insert the bus from behind

Insert the construction vehicle from behind

Naive asset insertion vs. our generative approach

Effectiveness for Downstream Tasks

Detection 256
Comparison of detection under different training epochs. * indicates the evaluation of WoVoGen is only on the vehicle classes of cars, trucks, and buses. Bold and underline indicate the best and second best.
Tracking 256
Comparison of tracking under different training epochs. Bold and underline indicate the best and second best.

Effectiveness for Various Resolutions

Tracking 256
Detection performance under different training epochs (1x, 2x, 3x). "Naive Insert" denotes the direct projection of 3D assets into the original scene. Results are reported at 512×768 resolution.

Detailed AP Metrics

We provided a detailed presentation of the AP metrics across different training epochs at a resolution of 512×768.
Detection 256
AP comparison across different categories for Real and Ours (+420) at 1× training epoch.
Tracking 256
2× training epoch.
Tracking 256
3× training epoch.

FID and FVD

Quantitative comparison on video generation quality with other methods. Our method achieves the best FVD and FID score.

BibTeX

@article{Dream4Drive,
  title={Rethinking Driving World Model as Synthetic Data Generator for Perception Tasks},
  author={Kai Zeng, Zhanqian Wu, Kaixin Xiong, Xiaobao Wei, Xiangyu Guo, Zhenxin Zhu, Kalok Ho, Lijun Zhou, Bohan Zeng, Ming Lu, Haiyang Sun, BING WANG, Guang Chen, Hangjun Ye, Wentao Zhang},
  journal={arXiv preprint arXiv:},
  year={2025}
}