Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chinese Text2video retrieval support? #201

Open
KeyaoZhao opened this issue Oct 23, 2024 · 7 comments
Open

Chinese Text2video retrieval support? #201

KeyaoZhao opened this issue Oct 23, 2024 · 7 comments

Comments

@KeyaoZhao
Copy link

Thank you for contributing such outstanding work, I would like to ask InternVideo2 support Chinese text search video? What model do I need to replace the VisionEncoder and TextEncoder with? Or how to modify our finetune? Thank you very much
———————————————————————————————————————————————————————
您好,感谢贡献如此杰出的工作,我想请问InternVideo2支持中文文字检索视频吗?我需要把VisionEncoder和TextEncoder换成什么模型呢?或者需要怎么修改我们finetune吗?非常感谢

@leexinhao
Copy link
Collaborator

You could use https://huggingface.co/OpenGVLab/InternVideo2-CLIP-1B-224p-f8, it supports Chinese text search!

@KeyaoZhao
Copy link
Author

KeyaoZhao commented Oct 25, 2024

You could use https://huggingface.co/OpenGVLab/InternVideo2-CLIP-1B-224p-f8, it supports Chinese text search!

Thanks for your reply. I use 'InternVideo2-stage2_1b-224p-f4.pt'+'1B_clip.pth' as the vision encoder, 'chinese_alpaca_lora_7b' as tokenizer, 'internvl_c_13b_224px.pth' as the text encoder, but I got the error:

RuntimeError: Error(s) in loading state_dict for InternVideo2_Stage2: size mismatch for text_proj.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 768]).

I found the vision feature is Lineard to 512d, but the text feature is 768d and cannot be Lineard to 512d, so how to multiply this two mat? I did something wrong?

@leexinhao
Copy link
Collaborator

How do you set the ckpt_path?
image
, if you set the ckpt of InternVideo2_Stage2 to vision_ckpt_path, it shouldn't meet size mismatch of text_proj.weight.

@zoezhu
Copy link

zoezhu commented Oct 29, 2024

How do you set the ckpt_path? image , if you set the ckpt of InternVideo2_Stage2 to vision_ckpt_path, it shouldn't meet size mismatch of text_proj.weight.

Can you explain this with more detail please? I am comfused about how to load the model with these parameters? Which class should I use to initialize the model?

@KeyaoZhao
Copy link
Author

KeyaoZhao commented Oct 30, 2024

How do you set the ckpt_path? image , if you set the ckpt of InternVideo2_Stage2 to vision_ckpt_path, it shouldn't meet size mismatch of text_proj.weight.

Thanks, I already solved the mismatch bug. I use the "InternVideo2_clip" to initialize the model but the logger got the following message. And I wonder if I load the model correctly? Because I got the different score answer every time /(ㄒoㄒ)/~~

2024-10-30T09:44:18 | models.internvideo2_clip: Load vision_encoder checkpoint from /root/.cache/huggingface/hub/models--OpenGVLab--InternVideo2-Stage2_1B-224p-f4/snapshots/4362e1f88a992e7edbfd7696f7f78b7f79426dfd/InternVideo2-stage2_1b-224p-f4.pt 2024-10-30T09:44:19 | models.internvideo2_clip: Load text_encoder checkpoint from /workspace/InternVideo/InternVideo/InternVideo2/multi_modality/pretrained/internvl_c_13b_224px.pth 2024-10-30T09:44:34 | models.internvideo2_clip: _IncompatibleKeys(missing_keys=['temp', 'text_encoder.transformer.layers.0.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.0.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.1.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.1.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.2.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.2.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.3.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.3.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.4.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.4.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.5.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.5.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.6.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.6.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.7.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.7.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.8.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.8.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.9.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.9.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.10.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.10.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.11.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.11.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.12.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.12.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.13.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.13.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.14.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.14.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.15.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.15.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.16.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.16.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.17.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.17.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.18.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.18.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.19.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.19.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.20.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.20.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.21.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.21.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.22.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.22.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.23.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.23.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.24.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.24.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.25.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.25.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.26.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.26.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.27.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.27.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.28.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.28.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.29.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.29.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.30.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.30.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.31.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.31.self_attn.v_proj.base_layer.weight'], unexpected_keys=['text_encoder.transformer.layers.0.self_attn.q_proj.weight', 'text_encoder.transformer.layers.0.self_attn.v_proj.weight', 'text_encoder.transformer.layers.1.self_attn.q_proj.weight', 'text_encoder.transformer.layers.1.self_attn.v_proj.weight', 'text_encoder.transformer.layers.2.self_attn.q_proj.weight', 'text_encoder.transformer.layers.2.self_attn.v_proj.weight', 'text_encoder.transformer.layers.3.self_attn.q_proj.weight', 'text_encoder.transformer.layers.3.self_attn.v_proj.weight', 'text_encoder.transformer.layers.4.self_attn.q_proj.weight', 'text_encoder.transformer.layers.4.self_attn.v_proj.weight', 'text_encoder.transformer.layers.5.self_attn.q_proj.weight', 'text_encoder.transformer.layers.5.self_attn.v_proj.weight', 'text_encoder.transformer.layers.6.self_attn.q_proj.weight', 'text_encoder.transformer.layers.6.self_attn.v_proj.weight', 'text_encoder.transformer.layers.7.self_attn.q_proj.weight', 'text_encoder.transformer.layers.7.self_attn.v_proj.weight', 'text_encoder.transformer.layers.8.self_attn.q_proj.weight', 'text_encoder.transformer.layers.8.self_attn.v_proj.weight', 'text_encoder.transformer.layers.9.self_attn.q_proj.weight', 'text_encoder.transformer.layers.9.self_attn.v_proj.weight', 'text_encoder.transformer.layers.10.self_attn.q_proj.weight', 'text_encoder.transformer.layers.10.self_attn.v_proj.weight', 'text_encoder.transformer.layers.11.self_attn.q_proj.weight', 'text_encoder.transformer.layers.11.self_attn.v_proj.weight', 'text_encoder.transformer.layers.12.self_attn.q_proj.weight', 'text_encoder.transformer.layers.12.self_attn.v_proj.weight', 'text_encoder.transformer.layers.13.self_attn.q_proj.weight', 'text_encoder.transformer.layers.13.self_attn.v_proj.weight', 'text_encoder.transformer.layers.14.self_attn.q_proj.weight', 'text_encoder.transformer.layers.14.self_attn.v_proj.weight', 'text_encoder.transformer.layers.15.self_attn.q_proj.weight', 'text_encoder.transformer.layers.15.self_attn.v_proj.weight', 'text_encoder.transformer.layers.16.self_attn.q_proj.weight', 'text_encoder.transformer.layers.16.self_attn.v_proj.weight', 'text_encoder.transformer.layers.17.self_attn.q_proj.weight', 'text_encoder.transformer.layers.17.self_attn.v_proj.weight', 'text_encoder.transformer.layers.18.self_attn.q_proj.weight', 'text_encoder.transformer.layers.18.self_attn.v_proj.weight', 'text_encoder.transformer.layers.19.self_attn.q_proj.weight', 'text_encoder.transformer.layers.19.self_attn.v_proj.weight', 'text_encoder.transformer.layers.20.self_attn.q_proj.weight', 'text_encoder.transformer.layers.20.self_attn.v_proj.weight', 'text_encoder.transformer.layers.21.self_attn.q_proj.weight', 'text_encoder.transformer.layers.21.self_attn.v_proj.weight', 'text_encoder.transformer.layers.22.self_attn.q_proj.weight', 'text_encoder.transformer.layers.22.self_attn.v_proj.weight', 'text_encoder.transformer.layers.23.self_attn.q_proj.weight', 'text_encoder.transformer.layers.23.self_attn.v_proj.weight', 'text_encoder.transformer.layers.24.self_attn.q_proj.weight', 'text_encoder.transformer.layers.24.self_attn.v_proj.weight', 'text_encoder.transformer.layers.25.self_attn.q_proj.weight', 'text_encoder.transformer.layers.25.self_attn.v_proj.weight', 'text_encoder.transformer.layers.26.self_attn.q_proj.weight', 'text_encoder.transformer.layers.26.self_attn.v_proj.weight', 'text_encoder.transformer.layers.27.self_attn.q_proj.weight', 'text_encoder.transformer.layers.27.self_attn.v_proj.weight', 'text_encoder.transformer.layers.28.self_attn.q_proj.weight', 'text_encoder.transformer.layers.28.self_attn.v_proj.weight', 'text_encoder.transformer.layers.29.self_attn.q_proj.weight', 'text_encoder.transformer.layers.29.self_attn.v_proj.weight', 'text_encoder.transformer.layers.30.self_attn.q_proj.weight', 'text_encoder.transformer.layers.30.self_attn.v_proj.weight', 'text_encoder.transformer.layers.31.self_attn.q_proj.weight', 'text_encoder.transformer.layers.31.self_attn.v_proj.weight'])

@ge35tay
Copy link

ge35tay commented Dec 5, 2024

How do you set the ckpt_path? image , if you set the ckpt of InternVideo2_Stage2 to vision_ckpt_path, it shouldn't meet size mismatch of text_proj.weight.

Thanks, I already solved the mismatch bug. I use the "InternVideo2_clip" to initialize the model but the logger got the following message. And I wonder if I load the model correctly? Because I got the different score answer every time /(ㄒoㄒ)/~~

2024-10-30T09:44:18 | models.internvideo2_clip: Load vision_encoder checkpoint from /root/.cache/huggingface/hub/models--OpenGVLab--InternVideo2-Stage2_1B-224p-f4/snapshots/4362e1f88a992e7edbfd7696f7f78b7f79426dfd/InternVideo2-stage2_1b-224p-f4.pt 2024-10-30T09:44:19 | models.internvideo2_clip: Load text_encoder checkpoint from /workspace/InternVideo/InternVideo/InternVideo2/multi_modality/pretrained/internvl_c_13b_224px.pth 2024-10-30T09:44:34 | models.internvideo2_clip: _IncompatibleKeys(missing_keys=['temp', 'text_encoder.transformer.layers.0.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.0.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.1.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.1.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.2.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.2.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.3.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.3.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.4.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.4.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.5.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.5.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.6.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.6.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.7.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.7.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.8.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.8.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.9.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.9.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.10.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.10.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.11.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.11.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.12.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.12.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.13.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.13.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.14.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.14.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.15.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.15.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.16.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.16.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.17.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.17.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.18.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.18.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.19.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.19.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.20.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.20.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.21.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.21.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.22.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.22.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.23.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.23.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.24.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.24.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.25.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.25.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.26.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.26.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.27.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.27.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.28.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.28.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.29.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.29.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.30.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.30.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.31.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.31.self_attn.v_proj.base_layer.weight'], unexpected_keys=['text_encoder.transformer.layers.0.self_attn.q_proj.weight', 'text_encoder.transformer.layers.0.self_attn.v_proj.weight', 'text_encoder.transformer.layers.1.self_attn.q_proj.weight', 'text_encoder.transformer.layers.1.self_attn.v_proj.weight', 'text_encoder.transformer.layers.2.self_attn.q_proj.weight', 'text_encoder.transformer.layers.2.self_attn.v_proj.weight', 'text_encoder.transformer.layers.3.self_attn.q_proj.weight', 'text_encoder.transformer.layers.3.self_attn.v_proj.weight', 'text_encoder.transformer.layers.4.self_attn.q_proj.weight', 'text_encoder.transformer.layers.4.self_attn.v_proj.weight', 'text_encoder.transformer.layers.5.self_attn.q_proj.weight', 'text_encoder.transformer.layers.5.self_attn.v_proj.weight', 'text_encoder.transformer.layers.6.self_attn.q_proj.weight', 'text_encoder.transformer.layers.6.self_attn.v_proj.weight', 'text_encoder.transformer.layers.7.self_attn.q_proj.weight', 'text_encoder.transformer.layers.7.self_attn.v_proj.weight', 'text_encoder.transformer.layers.8.self_attn.q_proj.weight', 'text_encoder.transformer.layers.8.self_attn.v_proj.weight', 'text_encoder.transformer.layers.9.self_attn.q_proj.weight', 'text_encoder.transformer.layers.9.self_attn.v_proj.weight', 'text_encoder.transformer.layers.10.self_attn.q_proj.weight', 'text_encoder.transformer.layers.10.self_attn.v_proj.weight', 'text_encoder.transformer.layers.11.self_attn.q_proj.weight', 'text_encoder.transformer.layers.11.self_attn.v_proj.weight', 'text_encoder.transformer.layers.12.self_attn.q_proj.weight', 'text_encoder.transformer.layers.12.self_attn.v_proj.weight', 'text_encoder.transformer.layers.13.self_attn.q_proj.weight', 'text_encoder.transformer.layers.13.self_attn.v_proj.weight', 'text_encoder.transformer.layers.14.self_attn.q_proj.weight', 'text_encoder.transformer.layers.14.self_attn.v_proj.weight', 'text_encoder.transformer.layers.15.self_attn.q_proj.weight', 'text_encoder.transformer.layers.15.self_attn.v_proj.weight', 'text_encoder.transformer.layers.16.self_attn.q_proj.weight', 'text_encoder.transformer.layers.16.self_attn.v_proj.weight', 'text_encoder.transformer.layers.17.self_attn.q_proj.weight', 'text_encoder.transformer.layers.17.self_attn.v_proj.weight', 'text_encoder.transformer.layers.18.self_attn.q_proj.weight', 'text_encoder.transformer.layers.18.self_attn.v_proj.weight', 'text_encoder.transformer.layers.19.self_attn.q_proj.weight', 'text_encoder.transformer.layers.19.self_attn.v_proj.weight', 'text_encoder.transformer.layers.20.self_attn.q_proj.weight', 'text_encoder.transformer.layers.20.self_attn.v_proj.weight', 'text_encoder.transformer.layers.21.self_attn.q_proj.weight', 'text_encoder.transformer.layers.21.self_attn.v_proj.weight', 'text_encoder.transformer.layers.22.self_attn.q_proj.weight', 'text_encoder.transformer.layers.22.self_attn.v_proj.weight', 'text_encoder.transformer.layers.23.self_attn.q_proj.weight', 'text_encoder.transformer.layers.23.self_attn.v_proj.weight', 'text_encoder.transformer.layers.24.self_attn.q_proj.weight', 'text_encoder.transformer.layers.24.self_attn.v_proj.weight', 'text_encoder.transformer.layers.25.self_attn.q_proj.weight', 'text_encoder.transformer.layers.25.self_attn.v_proj.weight', 'text_encoder.transformer.layers.26.self_attn.q_proj.weight', 'text_encoder.transformer.layers.26.self_attn.v_proj.weight', 'text_encoder.transformer.layers.27.self_attn.q_proj.weight', 'text_encoder.transformer.layers.27.self_attn.v_proj.weight', 'text_encoder.transformer.layers.28.self_attn.q_proj.weight', 'text_encoder.transformer.layers.28.self_attn.v_proj.weight', 'text_encoder.transformer.layers.29.self_attn.q_proj.weight', 'text_encoder.transformer.layers.29.self_attn.v_proj.weight', 'text_encoder.transformer.layers.30.self_attn.q_proj.weight', 'text_encoder.transformer.layers.30.self_attn.v_proj.weight', 'text_encoder.transformer.layers.31.self_attn.q_proj.weight', 'text_encoder.transformer.layers.31.self_attn.v_proj.weight'])

他这个权重给的有问题,给的internVL_13B是不带lora的,但是LLAMA load的时候又是use_lora,导致lora相关的权重都缺失了,希望作者检查一下

@leexinhao
Copy link
Collaborator

How do you set the ckpt_path? image , if you set the ckpt of InternVideo2_Stage2 to vision_ckpt_path, it shouldn't meet size mismatch of text_proj.weight.

Thanks, I already solved the mismatch bug. I use the "InternVideo2_clip" to initialize the model but the logger got the following message. And I wonder if I load the model correctly? Because I got the different score answer every time /(ㄒoㄒ)/~~
2024-10-30T09:44:18 | models.internvideo2_clip: Load vision_encoder checkpoint from /root/.cache/huggingface/hub/models--OpenGVLab--InternVideo2-Stage2_1B-224p-f4/snapshots/4362e1f88a992e7edbfd7696f7f78b7f79426dfd/InternVideo2-stage2_1b-224p-f4.pt 2024-10-30T09:44:19 | models.internvideo2_clip: Load text_encoder checkpoint from /workspace/InternVideo/InternVideo/InternVideo2/multi_modality/pretrained/internvl_c_13b_224px.pth 2024-10-30T09:44:34 | models.internvideo2_clip: _IncompatibleKeys(missing_keys=['temp', 'text_encoder.transformer.layers.0.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.0.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.1.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.1.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.2.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.2.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.3.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.3.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.4.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.4.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.5.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.5.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.6.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.6.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.7.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.7.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.8.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.8.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.9.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.9.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.10.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.10.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.11.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.11.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.12.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.12.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.13.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.13.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.14.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.14.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.15.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.15.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.16.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.16.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.17.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.17.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.18.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.18.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.19.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.19.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.20.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.20.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.21.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.21.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.22.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.22.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.23.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.23.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.24.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.24.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.25.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.25.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.26.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.26.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.27.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.27.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.28.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.28.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.29.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.29.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.30.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.30.self_attn.v_proj.base_layer.weight', 'text_encoder.transformer.layers.31.self_attn.q_proj.base_layer.weight', 'text_encoder.transformer.layers.31.self_attn.v_proj.base_layer.weight'], unexpected_keys=['text_encoder.transformer.layers.0.self_attn.q_proj.weight', 'text_encoder.transformer.layers.0.self_attn.v_proj.weight', 'text_encoder.transformer.layers.1.self_attn.q_proj.weight', 'text_encoder.transformer.layers.1.self_attn.v_proj.weight', 'text_encoder.transformer.layers.2.self_attn.q_proj.weight', 'text_encoder.transformer.layers.2.self_attn.v_proj.weight', 'text_encoder.transformer.layers.3.self_attn.q_proj.weight', 'text_encoder.transformer.layers.3.self_attn.v_proj.weight', 'text_encoder.transformer.layers.4.self_attn.q_proj.weight', 'text_encoder.transformer.layers.4.self_attn.v_proj.weight', 'text_encoder.transformer.layers.5.self_attn.q_proj.weight', 'text_encoder.transformer.layers.5.self_attn.v_proj.weight', 'text_encoder.transformer.layers.6.self_attn.q_proj.weight', 'text_encoder.transformer.layers.6.self_attn.v_proj.weight', 'text_encoder.transformer.layers.7.self_attn.q_proj.weight', 'text_encoder.transformer.layers.7.self_attn.v_proj.weight', 'text_encoder.transformer.layers.8.self_attn.q_proj.weight', 'text_encoder.transformer.layers.8.self_attn.v_proj.weight', 'text_encoder.transformer.layers.9.self_attn.q_proj.weight', 'text_encoder.transformer.layers.9.self_attn.v_proj.weight', 'text_encoder.transformer.layers.10.self_attn.q_proj.weight', 'text_encoder.transformer.layers.10.self_attn.v_proj.weight', 'text_encoder.transformer.layers.11.self_attn.q_proj.weight', 'text_encoder.transformer.layers.11.self_attn.v_proj.weight', 'text_encoder.transformer.layers.12.self_attn.q_proj.weight', 'text_encoder.transformer.layers.12.self_attn.v_proj.weight', 'text_encoder.transformer.layers.13.self_attn.q_proj.weight', 'text_encoder.transformer.layers.13.self_attn.v_proj.weight', 'text_encoder.transformer.layers.14.self_attn.q_proj.weight', 'text_encoder.transformer.layers.14.self_attn.v_proj.weight', 'text_encoder.transformer.layers.15.self_attn.q_proj.weight', 'text_encoder.transformer.layers.15.self_attn.v_proj.weight', 'text_encoder.transformer.layers.16.self_attn.q_proj.weight', 'text_encoder.transformer.layers.16.self_attn.v_proj.weight', 'text_encoder.transformer.layers.17.self_attn.q_proj.weight', 'text_encoder.transformer.layers.17.self_attn.v_proj.weight', 'text_encoder.transformer.layers.18.self_attn.q_proj.weight', 'text_encoder.transformer.layers.18.self_attn.v_proj.weight', 'text_encoder.transformer.layers.19.self_attn.q_proj.weight', 'text_encoder.transformer.layers.19.self_attn.v_proj.weight', 'text_encoder.transformer.layers.20.self_attn.q_proj.weight', 'text_encoder.transformer.layers.20.self_attn.v_proj.weight', 'text_encoder.transformer.layers.21.self_attn.q_proj.weight', 'text_encoder.transformer.layers.21.self_attn.v_proj.weight', 'text_encoder.transformer.layers.22.self_attn.q_proj.weight', 'text_encoder.transformer.layers.22.self_attn.v_proj.weight', 'text_encoder.transformer.layers.23.self_attn.q_proj.weight', 'text_encoder.transformer.layers.23.self_attn.v_proj.weight', 'text_encoder.transformer.layers.24.self_attn.q_proj.weight', 'text_encoder.transformer.layers.24.self_attn.v_proj.weight', 'text_encoder.transformer.layers.25.self_attn.q_proj.weight', 'text_encoder.transformer.layers.25.self_attn.v_proj.weight', 'text_encoder.transformer.layers.26.self_attn.q_proj.weight', 'text_encoder.transformer.layers.26.self_attn.v_proj.weight', 'text_encoder.transformer.layers.27.self_attn.q_proj.weight', 'text_encoder.transformer.layers.27.self_attn.v_proj.weight', 'text_encoder.transformer.layers.28.self_attn.q_proj.weight', 'text_encoder.transformer.layers.28.self_attn.v_proj.weight', 'text_encoder.transformer.layers.29.self_attn.q_proj.weight', 'text_encoder.transformer.layers.29.self_attn.v_proj.weight', 'text_encoder.transformer.layers.30.self_attn.q_proj.weight', 'text_encoder.transformer.layers.30.self_attn.v_proj.weight', 'text_encoder.transformer.layers.31.self_attn.q_proj.weight', 'text_encoder.transformer.layers.31.self_attn.v_proj.weight'])

他这个权重给的有问题,给的internVL_13B是不带lora的,但是LLAMA load的时候又是use_lora,导致lora相关的权重都缺失了,希望作者检查一下
我们额外提供了一个lora权重:https://huggingface.co/OpenGVLab/InternVideo2-CLIP-1B-224p-f8,请问你加载了吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants