Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: hunyuan t2v finetune #22

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 59 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -363,7 +363,7 @@ After downloading, the model checkpoints should be placed as [Checkpoint Structu

Task|Model|Command|Length (#frames)|Resolution|Inference Time (s)|GPU Memory (GiB)|
|:---------|:---------|:---------|:---------|:---------|:---------|:---------|
|T2V|HunyuanVideo|`bash shscripts/inference_hunyuan_diffusers.sh`|129|720x1280|1920|59.15|
|T2V|HunyuanVideo|`bash shscripts/inference_hunyuan_src.sh`|129|720x1280|1920|59.15|
|T2V|Mochi|`bash shscripts/inference_mochi.sh`|84|480x848|109.0|26|
|I2V|CogVideoX-5b-I2V|`bash shscripts/inference_cogVideo_i2v_diffusers.sh`|49|480x720|310.4|4.78|
|T2V|CogVideoX-2b|`bash shscripts/inference_cogVideo_t2v_diffusers.sh`|49|480x720|107.6|2.32|
Expand Down Expand Up @@ -433,6 +433,64 @@ bash shscripts/train_opensorav10.sh
bash configs/train/000_videocrafter2ft/run.sh
``` -->

#### 4. HunyuanVideo Fine-tuning

Please follow the steps below:
1) Install the environment
``` shell
conda create --name videotuna-hunyuan python=3.10 -y
conda activate videotuna-hunyuan
pip install -r requirements-hunyuan-finetune.txt
```

2) Prepare the dataset
``` shell
# install `huggingface_hub`
huggingface-cli download \
--repo-type dataset Wild-Heart/Disney-VideoGeneration-Dataset \
--local-dir Dataset/video-dataset-disney
```

<details close>
<summary>

**Customized Dataset Preparation**
</summary>
If you would like to use a custom dataset, make sure to create a `prompt.txt` file, which should contain prompts separated by lines. Please note that the prompts must be in English, and it is recommended to use the [prompt refinement script](https://github.com/THUDM/CogVideo/blob/main/inference/convert_demo.py) for better prompts. Alternatively, you can use [CogVideo-caption](https://huggingface.co/THUDM/cogvlm2-llama3-caption) for data annotation:

```
A black and white animated sequence featuring a rabbit, named Rabbity Ribfried, and an anthropomorphic goat in a musical, playful environment, showcasing their evolving interaction.
A black and white animated sequence on a ship’s deck features a bulldog character, named Bully Bulldoger, showcasing exaggerated facial expressions and body language...
...
```

You will also need a `videos.txt` file. The `videos.txt` file should contain the video file paths, separated by lines. Please note that the paths must be relative to the `--data_root` directory. The format is as follows:

```
videos/00000.mp4
videos/00001.mp4
...
```

The training video resolution should be divisible by 32. For example, `720 * 480`, `1920 * 1020`, etc. And the frame counts (length) must be `4 * k` or `4 * k + 1` (example: 16, 32, 49, 81)
</details>
4) Download the checkpoints

```
huggingface-cli download hunyuanvideo-community/HunyuanVideo --local-dir ./checkpoints/hunyuan/HunyuanVideo
```

Please note you need about 60G CUDA memory for HunyuanVideo training and 40G+ for inference. You may decrease the video size and frame length to try to save memory.
4) Train the LoRA
```
bash shscripts/train_hunyuan_lora.sh
```
5) Run the inference
```
bash shscripts/inference_hunyuan_diffusers.sh
```


#### Finetuning for enhanced langugage understanding


Expand Down
17 changes: 17 additions & 0 deletions configs/006_hunyuanVideo/hunyuan_t2v_lora.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: 'NO'
downcast_bf16: 'no'
enable_cpu_affinity: false
gpu_ids: '0'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
17 changes: 17 additions & 0 deletions requirements-hunyuan-finetune.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
torch==2.5.1
torchvision==0.20.1
torchao>=0.5.0
accelerate
bitsandbytes
diffusers>=0.30.3
transformers>=4.45.2
huggingface_hub
hf_transfer>=0.1.8
peft>=0.12.0
decord>=0.6.0
wandb
pandas
sentencepiece>=0.2.0
imageio-ffmpeg>=0.5.1
numpy>=1.26.4
opencv-python
91 changes: 43 additions & 48 deletions scripts/inference_hunyuan_diffusers.py
Original file line number Diff line number Diff line change
@@ -1,60 +1,55 @@
import os
import sys
import time
from pathlib import Path
from loguru import logger
from datetime import datetime

sys.path.insert(0, os.getcwd())
sys.path.insert(1, f'{os.getcwd()}/src')
from src.hyvideo.utils.file_utils import save_videos_grid
from src.hyvideo.config import parse_args
from src.hyvideo.inference import HunyuanVideoSampler


import torch
import argparse
from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
from diffusers.utils import export_to_video


# Function to parse arguments
def parse_args():
parser = argparse.ArgumentParser(description="Hyperparameters for Hunyuan Video Inference")
parser.add_argument('--ckpt-path', type=str, required=True, help="Path to the checkpoint directory")
parser.add_argument('--lora-path', type=str, required=True, help="Path to the LoRA weights")
parser.add_argument('--lora-weight', type=float, required=True, help="Weight for the LoRA model")
parser.add_argument('--prompt', type=str, required=True, help="Prompt for the video generation")
parser.add_argument('--video-size', type=int, nargs=2, required=True,
help="Height and width of the generated video")
parser.add_argument('--video-frame-length', type=int, required=True, help="Number of frames in the video")
parser.add_argument('--video-fps', type=int, required=True, help="Frames per second for the output video")
parser.add_argument('--infer-steps', type=int, required=True, help="Number of inference steps")
parser.add_argument('--output-path', type=str, required=True, help="Path to save the output video")

return parser.parse_args()


# Main function
def main():
# Parse arguments
args = parse_args()
print(args)
models_root_path = Path(args.model_base)
if not models_root_path.exists():
raise ValueError(f"`models_root` not exists: {models_root_path}")

# Create save folder to save the samples
save_path = args.save_path if args.save_path_suffix == "" else f'{args.save_path}_{args.save_path_suffix}'
if not os.path.exists(args.save_path):
os.makedirs(save_path, exist_ok=True)
# Load model and pipeline
transformer = HunyuanVideoTransformer3DModel.from_pretrained(args.ckpt_path, subfolder="transformer",
torch_dtype=torch.bfloat16)
pipe = HunyuanVideoPipeline.from_pretrained(args.ckpt_path, transformer=transformer, torch_dtype=torch.float16)

# Load models
hunyuan_video_sampler = HunyuanVideoSampler.from_pretrained(models_root_path, args=args)
# Load LoRA weights
pipe.load_lora_weights(args.lora_path, adapter_name="hunyuanvideo-lora")
pipe.set_adapters(["hunyuanvideo-lora"], [args.lora_weight])

# Get the updated args
args = hunyuan_video_sampler.args
# Enable tiling and move to GPU
pipe.vae.enable_tiling()
pipe.to("cuda")

# Start sampling
# TODO: batch inference check
outputs = hunyuan_video_sampler.predict(
# Generate video frames
output = pipe(
prompt=args.prompt,
height=args.video_size[0],
width=args.video_size[1],
video_length=args.video_length,
seed=args.seed,
negative_prompt=args.neg_prompt,
infer_steps=args.infer_steps,
guidance_scale=args.cfg_scale,
num_videos_per_prompt=args.num_videos,
flow_shift=args.flow_shift,
batch_size=args.batch_size,
embedded_guidance_scale=args.embedded_cfg_scale
)
samples = outputs['samples']

# Save samples
for i, sample in enumerate(samples):
sample = samples[i].unsqueeze(0)
time_flag = datetime.fromtimestamp(time.time()).strftime("%Y-%m-%d-%H:%M:%S")
save_path = f"{save_path}/{time_flag}_seed{outputs['seeds'][i]}_{outputs['prompts'][i][:100].replace('/', '')}.mp4"
save_videos_grid(sample, save_path, fps=24)
logger.info(f'Sample save to: {save_path}')
num_frames=args.video_frame_length,
num_inference_steps=args.infer_steps,
).frames[0]

# Export to video
export_to_video(output, args.output_path, fps=args.video_fps)


if __name__ == "__main__":
Expand Down
61 changes: 61 additions & 0 deletions scripts/inference_hunyuan_src.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
import os
import sys
import time
from pathlib import Path
from loguru import logger
from datetime import datetime

sys.path.insert(0, os.getcwd())
sys.path.insert(1, f'{os.getcwd()}/src')
from src.hyvideo.utils.file_utils import save_videos_grid
from src.hyvideo.config import parse_args
from src.hyvideo.inference import HunyuanVideoSampler


def main():
args = parse_args()
print(args)
models_root_path = Path(args.model_base)
if not models_root_path.exists():
raise ValueError(f"`models_root` not exists: {models_root_path}")

# Create save folder to save the samples
save_path = args.save_path if args.save_path_suffix == "" else f'{args.save_path}_{args.save_path_suffix}'
if not os.path.exists(args.save_path):
os.makedirs(save_path, exist_ok=True)

# Load models
hunyuan_video_sampler = HunyuanVideoSampler.from_pretrained(models_root_path, args=args)

# Get the updated args
args = hunyuan_video_sampler.args

# Start sampling
# TODO: batch inference check
outputs = hunyuan_video_sampler.predict(
prompt=args.prompt,
height=args.video_size[0],
width=args.video_size[1],
video_length=args.video_length,
seed=args.seed,
negative_prompt=args.neg_prompt,
infer_steps=args.infer_steps,
guidance_scale=args.cfg_scale,
num_videos_per_prompt=args.num_videos,
flow_shift=args.flow_shift,
batch_size=args.batch_size,
embedded_guidance_scale=args.embedded_cfg_scale
)
samples = outputs['samples']

# Save samples
for i, sample in enumerate(samples):
sample = samples[i].unsqueeze(0)
time_flag = datetime.fromtimestamp(time.time()).strftime("%Y-%m-%d-%H:%M:%S")
save_path = f"{save_path}/{time_flag}_seed{outputs['seeds'][i]}_{outputs['prompts'][i][:100].replace('/', '')}.mp4"
save_videos_grid(sample, save_path, fps=24)
logger.info(f'Sample save to: {save_path}')


if __name__ == "__main__":
main()
50 changes: 50 additions & 0 deletions scripts/train_hunyuan_lora.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
import os
import sys
import logging
import traceback

sys.path.insert(0, os.getcwd())
sys.path.insert(1, f'{os.getcwd()}/src')
from src.finetrainers import Trainer, parse_arguments

LOG_LEVEL = os.environ.get("LOG_LEVEL", "INFO")

logger = logging.getLogger("finetrainers")
logger.setLevel(LOG_LEVEL)


def main():
try:
import multiprocessing

multiprocessing.set_start_method("fork")
except Exception as e:
logger.error(
f'Failed to set multiprocessing start method to "fork". This can lead to poor performance, high memory usage, or crashes. '
f"See: https://pytorch.org/docs/stable/notes/multiprocessing.html\n"
f"Error: {e}"
)

try:
args = parse_arguments()
trainer = Trainer(args)

trainer.prepare_dataset()
trainer.prepare_models()
trainer.prepare_precomputations()
trainer.prepare_trainable_parameters()
trainer.prepare_optimizer()
trainer.prepare_for_training()
trainer.prepare_trackers()
trainer.train()
# trainer.evaluate()

except KeyboardInterrupt:
logger.info("Received keyboard interrupt. Exiting...")
except Exception as e:
logger.error(f"An error occurred during training: {e}")
logger.error(traceback.format_exc())


if __name__ == "__main__":
main()
26 changes: 16 additions & 10 deletions shscripts/inference_hunyuan_diffusers.sh
Original file line number Diff line number Diff line change
@@ -1,12 +1,18 @@
# You can increase the --video-size to <720 1280> if your GPU has about 60GB memory. The current setting requires about 45GB GPU memory.
export TOKENIZERS_PARALLELISM=false

HunyuanCKPTPath="checkpoints/hunyuan/HunyuanVideo"
LoRAPath="results/hunyuan/hunyuan-video-loras/your-experiment-name/checkpoint-500/pytorch_lora_weights.safetensors"
LoRAweight=0.6
Prompt="A black and white animated scene unfolds with an anthropomorphic goat surrounded by musical notes and symbols, suggesting a playful environment. Mickey Mouse appears, leaning forward in curiosity as the goat remains still. The goat then engages with Mickey, who bends down to converse or react. The dynamics shift as Mickey grabs the goat, potentially in surprise or playfulness, amidst a minimalistic background. The scene captures the evolving relationship between the two characters in a whimsical, animated setting, emphasizing their interactions and emotions."
OutputPath="results/hunyuan/output_with_lora.mp4"

python scripts/inference_hunyuan_diffusers.py \
--video-size 544 960 \
--video-length 129 \
--ckpt-path $HunyuanCKPTPath \
--lora-path $LoRAPath \
--lora-weight $LoRAweight \
--prompt "$Prompt" \
--video-size 320 512 \
--video-frame-length 61 \
--video-fps 15 \
--infer-steps 50 \
--prompt "A cat walks on the grass, realistic style." \
--flow-reverse \
--use-cpu-offload \
--save-path ./results/hunyuan \
--model-base ./checkpoints/hunyuan \
--dit-weight ./checkpoints/hunyuan/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt \
--seed 43 # You may change the seed to get different results using the same prompt
--output-path $OutputPath
12 changes: 12 additions & 0 deletions shscripts/inference_hunyuan_src.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# You can increase the --video-size to <720 1280> if your GPU has about 60GB memory. The current setting requires about 45GB GPU memory.
python scripts/inference_hunyuan_src.py \
--video-size 544 960 \
--video-length 129 \
--infer-steps 50 \
--prompt "A cat walks on the grass, realistic style." \
--flow-reverse \
--use-cpu-offload \
--save-path ./results/hunyuan \
--model-base ./checkpoints/hunyuan \
--dit-weight ./checkpoints/hunyuan/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt \
--seed 43 # You may change the seed to get different results using the same prompt
Loading