Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM issue #27

Open
Yaqing2023 opened this issue Dec 12, 2024 · 13 comments
Open

OOM issue #27

Yaqing2023 opened this issue Dec 12, 2024 · 13 comments

Comments

@Yaqing2023
Copy link

The current implementation in inference_basic.py has
video_frames = pipeline(
image=validation_image,
image_pose=validation_control_images,
...
which basically handles all pose images in one pipeline, this requires significant amount of GPU memory and limits the length of the video. I tried to split the validation_control_images into smaller chunks such as 40 images in a sublist and run the pipeline in a for loop. This works well for me and the final video quality looks good to me. Not sure whether this will affect the ID consistency but at least from what i tried results are good.

@nitinmukesh
Copy link

@Yaqing2023

Please could you share the updated code with batch processing. It will be useful.

@Yaqing2023
Copy link
Author

in inference_basic.py, first add a function
def split_into_sublists(image_list, max_images_per_list):
sublists = [image_list[i:i + max_images_per_list] for i in range(0, len(image_list), max_images_per_list)]

# Check if the last sublist contains fewer than min_last_sublist_size images
if len(sublists[-1]) < 10 and len(sublists) > 1:
    # Combine the last sublist with the previous one
    sublists[-2].extend(sublists[-1])
    sublists.pop()  # Remove the last sublist

return sublists

then change the pipeline call to following
# hard code sublist images to 40
max_images_per_list = 40
pose_sublists = split_into_sublists(validation_control_images, max_images_per_list)
all_video_frames = []
print("Total sublist groups: ", len(pose_sublists))
for pose_sublist in pose_sublists:
sub_num_frames = len(pose_sublist)
video_frames = pipeline(
image=validation_image,
image_pose=pose_sublist,
height=args.height,
width=args.width,
num_frames=sub_num_frames,
tile_size=args.tile_size,
tile_overlap=args.frames_overlap,
decode_chunk_size=decode_chunk_size,
motion_bucket_id=127.,
fps=7,
min_guidance_scale=args.guidance_scale,
max_guidance_scale=args.guidance_scale,
noise_aug_strength=args.noise_aug_strength,
num_inference_steps=args.num_inference_steps,
generator=generator,
output_type="pil",
validation_image_id_ante_embedding=validation_image_id_ante_embedding,
).frames[0]
all_video_frames.extend(video_frames)

out_file = os.path.join(
    args.output_dir,
    f"animation_video.mp4",
)
for i in range(num_frames):
    img = all_video_frames[i]
    all_video_frames[i] = np.array(img)

png_out_file = os.path.join(args.output_dir, "animated_images")
os.makedirs(png_out_file, exist_ok=True)
export_to_gif(all_video_frames, out_file, 8)
save_frames_as_png(all_video_frames, png_out_file)

@nitinmukesh
Copy link

@Yaqing2023

Thank you very much.
Would request to attach the updated file as I am not a developer, just trying these AI apps.

@Yaqing2023
Copy link
Author

inference_basic.py.txt

@Francis-Rings
Copy link
Owner

in inference_basic.py, first add a function def split_into_sublists(image_list, max_images_per_list): sublists = [image_list[i:i + max_images_per_list] for i in range(0, len(image_list), max_images_per_list)]

# Check if the last sublist contains fewer than min_last_sublist_size images
if len(sublists[-1]) < 10 and len(sublists) > 1:
    # Combine the last sublist with the previous one
    sublists[-2].extend(sublists[-1])
    sublists.pop()  # Remove the last sublist

return sublists

then change the pipeline call to following # hard code sublist images to 40 max_images_per_list = 40 pose_sublists = split_into_sublists(validation_control_images, max_images_per_list) all_video_frames = [] print("Total sublist groups: ", len(pose_sublists)) for pose_sublist in pose_sublists: sub_num_frames = len(pose_sublist) video_frames = pipeline( image=validation_image, image_pose=pose_sublist, height=args.height, width=args.width, num_frames=sub_num_frames, tile_size=args.tile_size, tile_overlap=args.frames_overlap, decode_chunk_size=decode_chunk_size, motion_bucket_id=127., fps=7, min_guidance_scale=args.guidance_scale, max_guidance_scale=args.guidance_scale, noise_aug_strength=args.noise_aug_strength, num_inference_steps=args.num_inference_steps, generator=generator, output_type="pil", validation_image_id_ante_embedding=validation_image_id_ante_embedding, ).frames[0] all_video_frames.extend(video_frames)

out_file = os.path.join(
    args.output_dir,
    f"animation_video.mp4",
)
for i in range(num_frames):
    img = all_video_frames[i]
    all_video_frames[i] = np.array(img)

png_out_file = os.path.join(args.output_dir, "animated_images")
os.makedirs(png_out_file, exist_ok=True)
export_to_gif(all_video_frames, out_file, 8)
save_frames_as_png(all_video_frames, png_out_file)

Cool! Thank you for your contribution!

@nitinmukesh
Copy link

@Yaqing2023

Appreciate your help, thank you.

@Francis-Rings

The existing code works fine on low VRAM with 512 x 512. If the changes contributed by Yaqing can be incorporated in the code driven by a --batch_processing True, it will help many.

@Francis-Rings
Copy link
Owner

@Yaqing2023

Appreciate your help, thank you.

@Francis-Rings

The existing code works fine on low VRAM with 512 x 512. If the changes contributed by Yaqing can be incorporated in the code driven by a --batch_processing True, it will help many.

Sure, I will check whether the modified pipeline affects the performance of StableAnimator. If it maintains the original performance while reducing GPU memory consumption, I’ll incorporate it into the inference code.

@nitinmukesh
Copy link

It increases the processing time by 10 fold.

@Yaqing2023
Copy link
Author

It is indeed a trade off of low memory and execution time. Many of us may have single GPU 4090 which can only run few seconds long video. The change will remove that limit.
The other memory bottleneck later i removed is the video_frames list. Current code stores all frames and do tesnor2vid in one batch which may also cause OOM. I changed that to small batch and write frames to disk immediately so i do not need to maintain the huge video_frames list. This way I can handle basically any length of driving videos without memory limitation.

@nitinmukesh
Copy link

nitinmukesh commented Dec 14, 2024

and here I am using this on 4060 8 GB VRAM + 8 GB shared. :)
Please if you can share the updated code I would love to try it.
The approach of writing to disk is also good. I guess instead of creating video if each image is saved to disk as soon as it is processed and clearing torch cache will significantly reduce the VRAM usage. The frames can be merged using ffmpeg which will take only few seconds. Unfortunately, I am not a developer so can't code all these approaches.

Here is the benchmark for the video I am trying
Duration : 12 s 423 ms
Frame rate : 30.000 FPS
Resolution : 512 x 512

Inference time
100%|██████████████| 775/775 [35:14<00:00, 2.73s/it]

Screenshot 2024-12-14 133418

After inference it does some processing which need (will post total processing time). I think this is where if each image is saved to disk will save a lot of VRAM.

Screenshot 2024-12-14 140100

Total Inference Time: 0:53:49.142157

@Yaqing2023
Copy link
Author

inference_basic.py.txt
@nitinmukesh this is generate images in the basic_infer/animated_images/ directly for each batch of frames instead of wait till the end. Then you can run ffmpeg manually to create a mp4 files using the frames in basic_infer/animated_images/

@nitinmukesh
Copy link

TQVM @Yaqing2023. Appreciate your help.
Will try it now.

@Yaqing2023
Copy link
Author

The intention is not speeding up the process but saving memory so less powerful machine can play with this project

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants