[中文简介]
demo2_3_out2.mp4
Welcome to the OpenDWM project! This is an open-source initiative, focusing on autonomous driving video generation. Our mission is to provide a high-quality, controllable tool for generating autonomous driving videos using the latest technology. We aim to build a codebase that is both user-friendly and highly reusable, and hope to continuously improve the project through the collective wisdom of the community.
The driving world models generate multi-view images or videos of autonomous driving scenes based on text and road environment layout conditions. Whether it's the environment, weather conditions, vehicle type, or driving path, you can adjust them according to your needs.
The highlights are as follows:
-
Significant improvement in the environmental diversity. Through the use of multiple datasets, the model's generalization ability has been enhanced like never before. Take the example of a generation task controlled by layout conditions, such as a snowy city street or a lakeside highway with distant snow mountains, these scenarios are impossible tasks for generative models trained with a single dataset.
-
Greatly improved generation quality. Support for popular model architectures (SD 2.1, 3.5) enables more convenient utilization of the advanced pre-training generation capabilities within the community. Various training techniques, including multitasking and self-supervision, allow the model to utilize the information in autonomous driving video data more effectively.
-
Convenient evaluation. Evaluation follows the popular framework
torchmetrics
, which is easy to configure, develop, and integrate into the pipeline. Public configurations (such as FID, FVD on the nuScenes validation set) are provided to align other research works.
Furthermore, our code modules are designed with high reusability in mind, for easy application in other projects.
Currently, the project has implemented the following papers:
UniMLVG: Unified Framework for Multi-view Long Video Generation with Comprehensive Control Capabilities for Autonomous Driving
Rui Chen1,2, Zehuan Wu2, Yichen Liu2, Yuxin Guo2, Jingcheng Ni2, Haifeng Xia1, Siyu Xia1
1Southeast University 2SenseTime Research
MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction
Jingcheng Ni, Yuxin Guo, Yichen Liu, Rui Chen, Lewei Lu, Zehuan Wu
SenseTime Research
Hardware requirement:
- Training and testing multi-view image generation or short video (<= 6 frames per iteration) generation requires 32GB GPU memory (e.g. V100)
- Training and testing multi-view long video (6 ~ 40 frames per iteration) generation requires 80GB GPU memory (e.g. A100, H100)
Software requirement:
- git (>= 2.25)
- python (>= 3.9)
Install the PyTorch >= 2.5:
python -m pip install torch==2.5.1 torchvision==0.20.1
Clone the repository, then install the dependencies.
cd DWM
git submodule update --init --recursive
python -m pip install requirements.txt -r
Our cross-view temporal SD (CTSD) pipeline support loading the pretrained SD 2.1, 3.0, 3.5, or the checkpoints we trained on the autonomous driving datasets.
Base model | Text conditioned driving generation |
Text and layout (box, map) conditioned driving generation |
---|---|---|
SD 2.1 | Config, Download | Config, Download |
SD 3.0 | UniMLVG Config, Released by 2025-2-1 | |
SD 3.5 | Config, Download | Config, Released by 2025-2-1 |
Download base model (for VAE, text encoders, scheduler config) and driving generation model checkpoint, and edit the path and prompts in the JSON config, then run this command.
PYTHONPATH=src python examples/ctsd_generation_example.py -c examples/ctsd_35_6views_image_generation.json -o output/ctsd_35_6views_image_generation
- Download base model (for VAE, text encoders, scheduler config) and driving generation model checkpoint, and edit the path in the JSON config.
- Download layout resource package nuscenes_scene-0627_package.zip and unzip to the
{RESOURCE_PATH}
. Then edit the meta path as{RESOURCE_PATH}/data.json
in the JSON config. - Run this command to generate the video.
PYTHONPATH=src python src/dwm/preview.py -c examples/ctsd_21_6views_video_generation_with_layout.json -o output/ctsd_21_6views_video_generation_with_layout
Preparation:
- Download the base models.
- Download and process datasets.
- Edit the configuration file (mainly the path of the model and dataset under the user environment).
Once the config file is updated with the correct model and data information, launch training by:
PYTHONPATH=src:externals/waymo-open-dataset/src:externals/TATS/tats/fvd python src/dwm/train.py -c {YOUR_CONFIG} -o output/{YOUR_WORKSPACE}
Or distributed training by:
OMP_NUM_THREADS=1 TOKENIZERS_PARALLELISM=false PYTHONPATH=src:externals/waymo-open-dataset/src:externals/TATS/tats/fvd python -m torch.distributed.run --nnodes $WORLD_SIZE --nproc-per-node 8 --node-rank $RANK --master-addr $MASTER_ADDR --master-port $MASTER_PORT src/dwm/train.py -c {YOUR_CONFIG} -o output/{YOUR_WORKSPACE}
Then you can check the preview under output/{YOUR_WORKSPACE}/preview
, and get the checkpoint files from output/{YOUR_WORKSPACE}/checkpoints
.
Some training tasks require multi stages (for the configurations with names of train_warmup.json
and train.json
), you should fill the path of the saved checkpoint from the previous stage into the following stage (for example), then launch the training of this following stage.
We have integrated the functions of FID and FVD metric evaluation in the pipeline, which involves filling in the validation set (source, sampling interval) and evaluation parameters (for example, the number of frames of each video segment to be measured in FVD) in the configuration file.
The specific call method is as follows.
PYTHONPATH=src:externals/waymo-open-dataset/src:externals/TATS/tats/fvd python src/dwm/evaluate.py -c {YOUR_CONFIG} -o output/{YOUR_WORKSPACE}
Or distributed evaluation by torch.distributed.run
, similar to the distributed training.
configs
The config files for data and pipeline with different arguments.examples
The inference code and configurations.externals
The dependency projects.src/dwm
The shared components of this project.datasets
implementstorch.utils.data.Dataset
for our training pipeline by reading multi-view, LiDAR and temporal data, with optional text, 3D box, HD map, pose, camera parameters as conditions.fs
provides flexible access methods followingfsspec
to the data stored in ZIP blobs, or in the S3 compatible storage services.metrics
implementstorchmetrics
compatible classes for quantitative evaluation.models
implements generation models and their building blocks.pipelines
implements the training logic for different models.tools
provides dataset and file processing scripts for faster initialization and reading.
Introduction about the file system, and dataset.