Open Driving World Models (OpenDWM)

demo2_3_out2.mp4

Welcome to the OpenDWM project! This is an open-source initiative, focusing on autonomous driving video generation. Our mission is to provide a high-quality, controllable tool for generating autonomous driving videos using the latest technology. We aim to build a codebase that is both user-friendly and highly reusable, and hope to continuously improve the project through the collective wisdom of the community.

The driving world models generate multi-view images or videos of autonomous driving scenes based on text and road environment layout conditions. Whether it's the environment, weather conditions, vehicle type, or driving path, you can adjust them according to your needs.

The highlights are as follows:

Significant improvement in the environmental diversity. Through the use of multiple datasets, the model's generalization ability has been enhanced like never before. Take the example of a generation task controlled by layout conditions, such as a snowy city street or a lakeside highway with distant snow mountains, these scenarios are impossible tasks for generative models trained with a single dataset.
Greatly improved generation quality. Support for popular model architectures (SD 2.1, 3.5) enables more convenient utilization of the advanced pre-training generation capabilities within the community. Various training techniques, including multitasking and self-supervision, allow the model to utilize the information in autonomous driving video data more effectively.
Convenient evaluation. Evaluation follows the popular framework torchmetrics, which is easy to configure, develop, and integrate into the pipeline. Public configurations (such as FID, FVD on the nuScenes validation set) are provided to align other research works.

Furthermore, our code modules are designed with high reusability in mind, for easy application in other projects.

Currently, the project has implemented the following papers:

UniMLVG: Unified Framework for Multi-view Long Video Generation with Comprehensive Control Capabilities for Autonomous Driving
Rui Chen^1,2, Zehuan Wu², Yichen Liu², Yuxin Guo², Jingcheng Ni², Haifeng Xia¹, Siyu Xia¹
¹Southeast University ²SenseTime Research

MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction
Jingcheng Ni, Yuxin Guo, Yichen Liu, Rui Chen, Lewei Lu, Zehuan Wu
SenseTime Research

Setup

Hardware requirement:

Training and testing multi-view image generation or short video (<= 6 frames per iteration) generation requires 32GB GPU memory (e.g. V100)
Training and testing multi-view long video (6 ~ 40 frames per iteration) generation requires 80GB GPU memory (e.g. A100, H100)

Software requirement:

git (>= 2.25)
python (>= 3.9)

Install the PyTorch >= 2.5:

python -m pip install torch==2.5.1 torchvision==0.20.1

Clone the repository, then install the dependencies.

cd DWM
git submodule update --init --recursive
python -m pip install requirements.txt -r

Models

Our cross-view temporal SD (CTSD) pipeline support loading the pretrained SD 2.1, 3.0, 3.5, or the checkpoints we trained on the autonomous driving datasets.

Base model	Text conditioned driving generation	Text and layout (box, map) conditioned driving generation
SD 2.1	Config, Download	Config, Download
SD 3.0		UniMLVG Config, Released by 2025-2-1
SD 3.5	Config, Download	Config, Released by 2025-2-1

Examples

T2I, T2V generation with CTSD pipeline

Download base model (for VAE, text encoders, scheduler config) and driving generation model checkpoint, and edit the path and prompts in the JSON config, then run this command.

PYTHONPATH=src python examples/ctsd_generation_example.py -c examples/ctsd_35_6views_image_generation.json -o output/ctsd_35_6views_image_generation

Layout conditioned T2V generation with CTSD pipeline

Download base model (for VAE, text encoders, scheduler config) and driving generation model checkpoint, and edit the path in the JSON config.
Download layout resource package nuscenes_scene-0627_package.zip and unzip to the {RESOURCE_PATH}. Then edit the meta path as {RESOURCE_PATH}/data.json in the JSON config.
Run this command to generate the video.

PYTHONPATH=src python src/dwm/preview.py -c examples/ctsd_21_6views_video_generation_with_layout.json -o output/ctsd_21_6views_video_generation_with_layout

Train

Preparation:

Download the base models.
Download and process datasets.
Edit the configuration file (mainly the path of the model and dataset under the user environment).

Once the config file is updated with the correct model and data information, launch training by:

PYTHONPATH=src:externals/waymo-open-dataset/src:externals/TATS/tats/fvd python src/dwm/train.py -c {YOUR_CONFIG} -o output/{YOUR_WORKSPACE}

Or distributed training by:

OMP_NUM_THREADS=1 TOKENIZERS_PARALLELISM=false PYTHONPATH=src:externals/waymo-open-dataset/src:externals/TATS/tats/fvd python -m torch.distributed.run --nnodes $WORLD_SIZE --nproc-per-node 8 --node-rank $RANK --master-addr $MASTER_ADDR --master-port $MASTER_PORT src/dwm/train.py -c {YOUR_CONFIG} -o output/{YOUR_WORKSPACE}

Then you can check the preview under output/{YOUR_WORKSPACE}/preview, and get the checkpoint files from output/{YOUR_WORKSPACE}/checkpoints.

Some training tasks require multi stages (for the configurations with names of train_warmup.json and train.json), you should fill the path of the saved checkpoint from the previous stage into the following stage (for example), then launch the training of this following stage.

Evaluation

We have integrated the functions of FID and FVD metric evaluation in the pipeline, which involves filling in the validation set (source, sampling interval) and evaluation parameters (for example, the number of frames of each video segment to be measured in FVD) in the configuration file.

The specific call method is as follows.

PYTHONPATH=src:externals/waymo-open-dataset/src:externals/TATS/tats/fvd python src/dwm/evaluate.py -c {YOUR_CONFIG} -o output/{YOUR_WORKSPACE}

Or distributed evaluation by torch.distributed.run, similar to the distributed training.

Development

Folder structure

configs The config files for data and pipeline with different arguments.
examples The inference code and configurations.
externals The dependency projects.
src/dwm The shared components of this project.
- datasets implements torch.utils.data.Dataset for our training pipeline by reading multi-view, LiDAR and temporal data, with optional text, 3D box, HD map, pose, camera parameters as conditions.
- fs provides flexible access methods following fsspec to the data stored in ZIP blobs, or in the S3 compatible storage services.
- metrics implements torchmetrics compatible classes for quantitative evaluation.
- models implements generation models and their building blocks.
- pipelines implements the training logic for different models.
- tools provides dataset and file processing scripts for faster initialization and reading.

Introduction about the file system, and dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
configs		configs
docs		docs
examples		examples
externals		externals
src/dwm		src/dwm
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
README_intro_zh.md		README_intro_zh.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open Driving World Models (OpenDWM)

Setup

Models

Examples

T2I, T2V generation with CTSD pipeline

Layout conditioned T2V generation with CTSD pipeline

Train

Evaluation

Development

Folder structure

About

Releases

Packages

Languages

License

AetherZ25/OpenDWM

Folders and files

Latest commit

History

Repository files navigation

Open Driving World Models (OpenDWM)

Setup

Models

Examples

T2I, T2V generation with CTSD pipeline

Layout conditioned T2V generation with CTSD pipeline

Train

Evaluation

Development

Folder structure

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages