FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models
[arXiv] [Project Page]
FrameFusion reduces the number of tokens in Large Vision-Language Models (LVLMs) by combining similarity-based merging with importance-based pruning. It achieves a 70% vision token reduction, 3.4–4.4× LLM speedups, and 1.6–1.9× end-to-end speedups with minimal performance impact.
Create a new environment:
conda create -n framefusion python=3.10
conda activate framefusion
Install the dependencies:
pip install -r requirements.txt
Install FrameFusion:
pip install -e .
To use Llava-Video LVLM, you also need to install the dependencies for it. We recommend clone the official repository, then install it with pip install -e .
in the cloned repository.
We provide an example with LLaVA-Video-7B model to inference on a video with or without FrameFusion in script/playground/example_llava.py
.
python script/playground/example_llava.py
You can apply FrameFusion in your own code to any huggingface model that supports the interface with few lines of code. Here is an example:
from llava.model.builder import load_pretrained_model
from framefusion.interface import apply_framefusion
# set attn_implementation to be sdpa
tokenizer, model, image_processor, max_length = load_pretrained_model("lmms-lab/LLaVA-Video-7B-Qwen2", None, "llava_qwen", torch_dtype="bfloat16", attn_implementation='sdpa', device_map="auto")
# apply FrameFusion
apply_framefusion(model, cost=0.3, similarity_lower_bound=0.6, ratio_lower_bound=0.1)
# use the model as usual
framefusion/
: The main package for FrameFusion.models/
: The adapter for different models.main.py
: The main implementation of FrameFusion.interface.py
: The interface for applying FrameFusion.
scripts/
: Scripts for running experiments.evaluate/
: Scripts for evaluating the performance models.playground/
: Scripts for running misc experiments.
example/
: Example input videos
-
Add a new model adapter in
framefusion/models/
, it applies framefusion after the attention module.Three model functions are required:
llm_forward
,decoder_forward
, andattention_forward
. The forward functions are easily modified from the correspondingmodeling_<MODEL>.py
functions in huggingface transformers. All modifications are marked with###
comments. For LLM, seeframefusion/models/qwen2/modeling_qwen2.py
as an example. -
Register the model in
framefusion/interface.py
, it applies framefusion to the correct model class. -
Add a new example in
script/playground/
, it shows how to apply framefusion to the model.
If you have any questions on applying FrameFusion to a new model, please feel free to open an issue. We are happy to help you and expand the adapter for more models.