Generating motions based on text descriptions. Analogy to text-to-image that generate new images from text.
conda create -n text2motion python=3.9
conda activate text2motion
# Clone repository recursively
git clone https://github.com/Developer-Zer0/MoDDM-Text-to-Motion-Synthesis-Using-Discrete-Diffusion.git --recurse-submodules
# Install Pytorch 1.10.0 (**CUDA 11.1**)
pip install torch==1.10.0+cu111 torchvision==0.11.0+cu111 torchaudio==0.10.0 -f https://download.pytorch.org/whl/torch_stable.html
# Install required pacakges
pip install -r requirements.txt
# Install DetUtil
cd DetUtil
python setup.py develop
API to run single sample inference using trained model on HumanML3D. Edit sample_description.txt
to any text description of your choice. Inference does not require GPU and runs completely on CPU within 15 seconds. First run can take additional time to load CLIP.
-
You need to setup FFMPEG for .mp4 generation. Follow instructions at LINK. After installation, add path to ffmpeg.exe (inside bin folder) in .env (Rename .env.example).
-
Download autoencoder checkpoint and discrete diffusion checkpoint. Store them under
checkpoints/
(Create if doesn't exist). -
You will also need to download SMPL_DATA and Deps for the human skeleton transformations and animations. Extract them and store under
data/
(Create if doesn't exist) (data/Deps
,data/SMPL_DATA
). -
Run the following script and your human motion .mp4 will be stored in
generations/
.
python sample_generation.py
To get both HumanML3D and KIT-ML dataset, follow instructions at https://github.com/EricGuo5513/HumanML3D. Once downloaded, store at location data/
(Create if doesn't exist). For training and evaluations, you will also need SMPL_DATA and Deps from Step 3 of single sample inference.
Default dataset will be the HumanML3D dataset in all experiments. To use the KIT dataset add datamodule=guo-kit-ml.yaml
as a parameter in command scripts.
You can skip this step by using an autoencoder checkpoint. If you want to skip, copy paste autoencoder_finest.ckpt
in the same location and rename is to autoencoder_trained.ckpt
.
Train VQ-VAE reconstruction model on HumanML3D (or KIT-ML). Run the following script. All the outputs and checkpoints will be stored in logs/
.
python src/train.py --config-name=train model=vq_vae.yaml model.do_evaluation=false trainer.devices=[1] trainer.max_epochs=500
Setting model.do_evaluation=True
will run the evaluator after every epoch to store FID, R-Precision. However, evaluator is a pre-trained model by the work at https://github.com/EricGuo5513/TM2T. You will need to download the pre-trained models from LINK. For HumanML3D evaluator, you need the t2m/text_mot_match/model/finest.tar
. Store it at checkpoints/t2m/text_mot_match/model/finest.tar
.
KIT-ML pre-trained models are from the above work as well and can be found at LINK. For the KIT-ML evaluator, you need the kit/text_mot_match/model/finest.tar
. Store is at checkpoints/kit/text_mot_match/model/finest.tar
. Also include eval_ckpt=checkpoints/kit/text_mot_match/model/finest.tar
as parameter in script.
Discrete Diffusion training on HumanML3D (or KIT-ML). Copy trained autoencoder checkpoint from above step and paste directly into checkpoints/
. Rename .ckpt file to autoencoder_trained.ckpt
so that stage 2 can load it. All the outputs and checkpoints will be stored in logs/
. 3 checkpoints will be created corresponding to the epoch with best validation FID, best validation R-Precision and best validation loss. Run the following command.
python src/train.py --config-name=train model=vq_diffusion.yaml model.do_evaluation=false trainer.devices=[1] trainer.max_epochs=500
Similar to stage 1 training, setting model.do_evaluation=True
will run the evaluator after every epoch to store metrics. Follow above steps to download pre-trained models for HumanML3D (or KIT-ML)
Set logger=tensorboard
to get loss and metric plots across epochs.
We compare our model to 4 methods: Seq2Seq, Language2Pose, TM2T and Motion Diffusion Model (MDM). Seq2seq and Language2Pose are deterministic motion generation baselines. TM2T utilizes VQ-VAE and recurrent models for text-to-motion synthesis task. MDM uses a conditional diffusion model on raw motions that showed promising motion results.
HumanML3D | KIT-ML |
---|---|