VideoVerses · DukeofCambridge · Dec 29, 2024 · Dec 29, 2024
diff --git a/README.md b/README.md
@@ -363,7 +363,7 @@ After downloading, the model checkpoints should be placed as [Checkpoint Structu
 
 Task|Model|Command|Length (#frames)|Resolution|Inference Time (s)|GPU Memory (GiB)|
 |:---------|:---------|:---------|:---------|:---------|:---------|:---------|
-|T2V|HunyuanVideo|`bash shscripts/inference_hunyuan_diffusers.sh`|129|720x1280|1920|59.15|
+|T2V|HunyuanVideo|`bash shscripts/inference_hunyuan_src.sh`|129|720x1280|1920|59.15|
 |T2V|Mochi|`bash shscripts/inference_mochi.sh`|84|480x848|109.0|26|
 |I2V|CogVideoX-5b-I2V|`bash shscripts/inference_cogVideo_i2v_diffusers.sh`|49|480x720|310.4|4.78|
 |T2V|CogVideoX-2b|`bash shscripts/inference_cogVideo_t2v_diffusers.sh`|49|480x720|107.6|2.32|
@@ -433,6 +433,64 @@ bash shscripts/train_opensorav10.sh
 bash configs/train/000_videocrafter2ft/run.sh
 ``` -->
 
+#### 4. HunyuanVideo Fine-tuning
+
+Please follow the steps below:
+  1) Install the environment
+     ``` shell
+      conda create --name videotuna-hunyuan python=3.10 -y
+      conda activate videotuna-hunyuan
+      pip install -r requirements-hunyuan-finetune.txt
+     ```
+
+  2) Prepare the dataset
+     ``` shell
+      # install `huggingface_hub`
+      huggingface-cli download \
+        --repo-type dataset Wild-Heart/Disney-VideoGeneration-Dataset \
+        --local-dir Dataset/video-dataset-disney
+     ```
+
+     <details close>
+      <summary>
+
+      **Customized Dataset Preparation**
+      </summary>
+         If you would like to use a custom dataset, make sure to create a `prompt.txt` file, which should contain prompts separated by lines. Please note that the prompts must be in English, and it is recommended to use the [prompt refinement script](https://github.com/THUDM/CogVideo/blob/main/inference/convert_demo.py) for better prompts. Alternatively, you can use [CogVideo-caption](https://huggingface.co/THUDM/cogvlm2-llama3-caption) for data annotation:
+
+        ```
+        A black and white animated sequence featuring a rabbit, named Rabbity Ribfried, and an anthropomorphic goat in a musical, playful environment, showcasing their evolving interaction.
+        A black and white animated sequence on a ship’s deck features a bulldog character, named Bully Bulldoger, showcasing exaggerated facial expressions and body language...
+        ...
+        ```
+
+        You will also need a `videos.txt` file. The `videos.txt` file should contain the video file paths, separated by lines. Please note that the paths must be relative to the `--data_root` directory. The format is as follows:
+
+        ```
+        videos/00000.mp4
+        videos/00001.mp4
+        ...
+        ```
+
+        The training video resolution should be divisible by 32. For example, `720 * 480`, `1920 * 1020`, etc. And the frame counts (length) must be `4 * k` or `4 * k + 1` (example: 16, 32, 49, 81)
+       </details>  
+  4) Download the checkpoints
+
+      ```
+        huggingface-cli download hunyuanvideo-community/HunyuanVideo --local-dir ./checkpoints/hunyuan/HunyuanVideo
+      ```
+
+     Please note you need about 60G CUDA memory for HunyuanVideo training and 40G+ for inference. You may decrease the video size and frame length to try to save memory. 
+  4) Train the LoRA
+     ```
+     bash shscripts/train_hunyuan_lora.sh
+     ```
+  5) Run the inference
+     ```
+     bash shscripts/inference_hunyuan_diffusers.sh
+     ```
+
+
 #### Finetuning for enhanced langugage understanding
 
 

diff --git a/configs/006_hunyuanVideo/hunyuan_t2v_lora.yaml b/configs/006_hunyuanVideo/hunyuan_t2v_lora.yaml
@@ -0,0 +1,17 @@
+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: 'NO'
+downcast_bf16: 'no'
+enable_cpu_affinity: false
+gpu_ids: '0'
+machine_rank: 0
+main_training_function: main
+mixed_precision: bf16
+num_machines: 1
+num_processes: 1
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
diff --git a/requirements-hunyuan-finetune.txt b/requirements-hunyuan-finetune.txt
@@ -0,0 +1,17 @@
+torch==2.5.1
+torchvision==0.20.1
+torchao>=0.5.0
+accelerate
+bitsandbytes
+diffusers>=0.30.3
+transformers>=4.45.2
+huggingface_hub
+hf_transfer>=0.1.8
+peft>=0.12.0
+decord>=0.6.0
+wandb
+pandas
+sentencepiece>=0.2.0
+imageio-ffmpeg>=0.5.1
+numpy>=1.26.4
+opencv-python
diff --git a/scripts/inference_hunyuan_diffusers.py b/scripts/inference_hunyuan_diffusers.py
@@ -1,60 +1,55 @@
-import os
-import sys
-import time
-from pathlib import Path
-from loguru import logger
-from datetime import datetime
-
-sys.path.insert(0, os.getcwd())
-sys.path.insert(1, f'{os.getcwd()}/src')
-from src.hyvideo.utils.file_utils import save_videos_grid
-from src.hyvideo.config import parse_args
-from src.hyvideo.inference import HunyuanVideoSampler
-
-
+import torch
+import argparse
+from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
+from diffusers.utils import export_to_video
+
+
+# Function to parse arguments
+def parse_args():
+    parser = argparse.ArgumentParser(description="Hyperparameters for Hunyuan Video Inference")
+    parser.add_argument('--ckpt-path', type=str, required=True, help="Path to the checkpoint directory")
+    parser.add_argument('--lora-path', type=str, required=True, help="Path to the LoRA weights")
+    parser.add_argument('--lora-weight', type=float, required=True, help="Weight for the LoRA model")
+    parser.add_argument('--prompt', type=str, required=True, help="Prompt for the video generation")
+    parser.add_argument('--video-size', type=int, nargs=2, required=True,
+                        help="Height and width of the generated video")
+    parser.add_argument('--video-frame-length', type=int, required=True, help="Number of frames in the video")
+    parser.add_argument('--video-fps', type=int, required=True, help="Frames per second for the output video")
+    parser.add_argument('--infer-steps', type=int, required=True, help="Number of inference steps")
+    parser.add_argument('--output-path', type=str, required=True, help="Path to save the output video")
+
+    return parser.parse_args()
+
+
+# Main function
 def main():
+    # Parse arguments
     args = parse_args()
-    print(args)
-    models_root_path = Path(args.model_base)
-    if not models_root_path.exists():
-        raise ValueError(f"`models_root` not exists: {models_root_path}")
 
-    # Create save folder to save the samples
-    save_path = args.save_path if args.save_path_suffix == "" else f'{args.save_path}_{args.save_path_suffix}'
-    if not os.path.exists(args.save_path):
-        os.makedirs(save_path, exist_ok=True)
+    # Load model and pipeline
+    transformer = HunyuanVideoTransformer3DModel.from_pretrained(args.ckpt_path, subfolder="transformer",
+                                                                 torch_dtype=torch.bfloat16)
+    pipe = HunyuanVideoPipeline.from_pretrained(args.ckpt_path, transformer=transformer, torch_dtype=torch.float16)
 
-    # Load models
-    hunyuan_video_sampler = HunyuanVideoSampler.from_pretrained(models_root_path, args=args)
+    # Load LoRA weights
+    pipe.load_lora_weights(args.lora_path, adapter_name="hunyuanvideo-lora")
+    pipe.set_adapters(["hunyuanvideo-lora"], [args.lora_weight])
 
-    # Get the updated args
-    args = hunyuan_video_sampler.args
+    # Enable tiling and move to GPU
+    pipe.vae.enable_tiling()
+    pipe.to("cuda")
 
-    # Start sampling
-    # TODO: batch inference check
-    outputs = hunyuan_video_sampler.predict(
+    # Generate video frames
+    output = pipe(
         prompt=args.prompt,
         height=args.video_size[0],
         width=args.video_size[1],
-        video_length=args.video_length,
-        seed=args.seed,
-        negative_prompt=args.neg_prompt,
-        infer_steps=args.infer_steps,
-        guidance_scale=args.cfg_scale,
-        num_videos_per_prompt=args.num_videos,
-        flow_shift=args.flow_shift,
-        batch_size=args.batch_size,
-        embedded_guidance_scale=args.embedded_cfg_scale
-    )
-    samples = outputs['samples']
-
-    # Save samples
-    for i, sample in enumerate(samples):
-        sample = samples[i].unsqueeze(0)
-        time_flag = datetime.fromtimestamp(time.time()).strftime("%Y-%m-%d-%H:%M:%S")
-        save_path = f"{save_path}/{time_flag}_seed{outputs['seeds'][i]}_{outputs['prompts'][i][:100].replace('/', '')}.mp4"
-        save_videos_grid(sample, save_path, fps=24)
-        logger.info(f'Sample save to: {save_path}')
+        num_frames=args.video_frame_length,
+        num_inference_steps=args.infer_steps,
+    ).frames[0]
+
+    # Export to video
+    export_to_video(output, args.output_path, fps=args.video_fps)
 
 
 if __name__ == "__main__":

diff --git a/scripts/inference_hunyuan_src.py b/scripts/inference_hunyuan_src.py
@@ -0,0 +1,61 @@
+import os
+import sys
+import time
+from pathlib import Path
+from loguru import logger
+from datetime import datetime
+
+sys.path.insert(0, os.getcwd())
+sys.path.insert(1, f'{os.getcwd()}/src')
+from src.hyvideo.utils.file_utils import save_videos_grid
+from src.hyvideo.config import parse_args
+from src.hyvideo.inference import HunyuanVideoSampler
+
+
+def main():
+    args = parse_args()
+    print(args)
+    models_root_path = Path(args.model_base)
+    if not models_root_path.exists():
+        raise ValueError(f"`models_root` not exists: {models_root_path}")
+
+    # Create save folder to save the samples
+    save_path = args.save_path if args.save_path_suffix == "" else f'{args.save_path}_{args.save_path_suffix}'
+    if not os.path.exists(args.save_path):
+        os.makedirs(save_path, exist_ok=True)
+
+    # Load models
+    hunyuan_video_sampler = HunyuanVideoSampler.from_pretrained(models_root_path, args=args)
+
+    # Get the updated args
+    args = hunyuan_video_sampler.args
+
+    # Start sampling
+    # TODO: batch inference check
+    outputs = hunyuan_video_sampler.predict(
+        prompt=args.prompt,
+        height=args.video_size[0],
+        width=args.video_size[1],
+        video_length=args.video_length,
+        seed=args.seed,
+        negative_prompt=args.neg_prompt,
+        infer_steps=args.infer_steps,
+        guidance_scale=args.cfg_scale,
+        num_videos_per_prompt=args.num_videos,
+        flow_shift=args.flow_shift,
+        batch_size=args.batch_size,
+        embedded_guidance_scale=args.embedded_cfg_scale
+    )
+    samples = outputs['samples']
+
+    # Save samples
+    for i, sample in enumerate(samples):
+        sample = samples[i].unsqueeze(0)
+        time_flag = datetime.fromtimestamp(time.time()).strftime("%Y-%m-%d-%H:%M:%S")
+        save_path = f"{save_path}/{time_flag}_seed{outputs['seeds'][i]}_{outputs['prompts'][i][:100].replace('/', '')}.mp4"
+        save_videos_grid(sample, save_path, fps=24)
+        logger.info(f'Sample save to: {save_path}')
+
+
+if __name__ == "__main__":
+    main()
diff --git a/scripts/train_hunyuan_lora.py b/scripts/train_hunyuan_lora.py
@@ -0,0 +1,50 @@
+import os
+import sys
+import logging
+import traceback
+
+sys.path.insert(0, os.getcwd())
+sys.path.insert(1, f'{os.getcwd()}/src')
+from src.finetrainers import Trainer, parse_arguments
+
+LOG_LEVEL = os.environ.get("LOG_LEVEL", "INFO")
+
+logger = logging.getLogger("finetrainers")
+logger.setLevel(LOG_LEVEL)
+
+
+def main():
+    try:
+        import multiprocessing
+
+        multiprocessing.set_start_method("fork")
+    except Exception as e:
+        logger.error(
+            f'Failed to set multiprocessing start method to "fork". This can lead to poor performance, high memory usage, or crashes. '
+            f"See: https://pytorch.org/docs/stable/notes/multiprocessing.html\n"
+            f"Error: {e}"
+        )
+
+    try:
+        args = parse_arguments()
+        trainer = Trainer(args)
+
+        trainer.prepare_dataset()
+        trainer.prepare_models()
+        trainer.prepare_precomputations()
+        trainer.prepare_trainable_parameters()
+        trainer.prepare_optimizer()
+        trainer.prepare_for_training()
+        trainer.prepare_trackers()
+        trainer.train()
+        # trainer.evaluate()
+
+    except KeyboardInterrupt:
+        logger.info("Received keyboard interrupt. Exiting...")
+    except Exception as e:
+        logger.error(f"An error occurred during training: {e}")
+        logger.error(traceback.format_exc())
+
+
+if __name__ == "__main__":
+    main()
diff --git a/shscripts/inference_hunyuan_diffusers.sh b/shscripts/inference_hunyuan_diffusers.sh
@@ -1,12 +1,18 @@
-# You can increase the --video-size to <720 1280> if your GPU has about 60GB memory. The current setting requires about 45GB GPU memory.
+export TOKENIZERS_PARALLELISM=false
+
+HunyuanCKPTPath="checkpoints/hunyuan/HunyuanVideo"
+LoRAPath="results/hunyuan/hunyuan-video-loras/your-experiment-name/checkpoint-500/pytorch_lora_weights.safetensors"
+LoRAweight=0.6
+Prompt="A black and white animated scene unfolds with an anthropomorphic goat surrounded by musical notes and symbols, suggesting a playful environment. Mickey Mouse appears, leaning forward in curiosity as the goat remains still. The goat then engages with Mickey, who bends down to converse or react. The dynamics shift as Mickey grabs the goat, potentially in surprise or playfulness, amidst a minimalistic background. The scene captures the evolving relationship between the two characters in a whimsical, animated setting, emphasizing their interactions and emotions."
+OutputPath="results/hunyuan/output_with_lora.mp4"
+
 python scripts/inference_hunyuan_diffusers.py \
-    --video-size 544 960 \
-    --video-length 129 \
+    --ckpt-path $HunyuanCKPTPath \
+    --lora-path $LoRAPath \
+    --lora-weight $LoRAweight \
+    --prompt "$Prompt" \
+    --video-size 320 512 \
+    --video-frame-length 61 \
+    --video-fps 15 \
     --infer-steps 50 \
-    --prompt "A cat walks on the grass, realistic style." \
-    --flow-reverse \
-    --use-cpu-offload \
-    --save-path ./results/hunyuan \
-    --model-base ./checkpoints/hunyuan \
-    --dit-weight ./checkpoints/hunyuan/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt \
-    --seed 43   # You may change the seed to get different results using the same prompt
+    --output-path $OutputPath
diff --git a/shscripts/inference_hunyuan_src.sh b/shscripts/inference_hunyuan_src.sh
@@ -0,0 +1,12 @@
+# You can increase the --video-size to <720 1280> if your GPU has about 60GB memory. The current setting requires about 45GB GPU memory.
+python scripts/inference_hunyuan_src.py \
+    --video-size 544 960 \
+    --video-length 129 \
+    --infer-steps 50 \
+    --prompt "A cat walks on the grass, realistic style." \
+    --flow-reverse \
+    --use-cpu-offload \
+    --save-path ./results/hunyuan \
+    --model-base ./checkpoints/hunyuan \
+    --dit-weight ./checkpoints/hunyuan/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt \
+    --seed 43   # You may change the seed to get different results using the same prompt