Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llava_trainer.py: Type and Attribute Error #8

Open
asad14053 opened this issue Oct 29, 2024 · 10 comments
Open

llava_trainer.py: Type and Attribute Error #8

asad14053 opened this issue Oct 29, 2024 · 10 comments

Comments

@asad14053
Copy link

asad14053 commented Oct 29, 2024

/Video-XL/videoxl/videoxl/train/llava_trainer.py", line 252, in compute_loss
if "retrieval_span" in inputs:
TypeError: argument of type 'NoneType' is not iterable
Traceback (most recent call last):

Video-XL/videoxl/videoxl/train/llava_trainer.py", line 268, in _prepare_inputs
inputs.pop("length", None)
AttributeError: 'NoneType' object has no attribute 'pop'
Traceback (most recent call last):

Video-XL/videoxl/videoxl/model/language_model/llava_qwen.py", line 1694, in forward
if input_ids.shape[1] != 1:
AttributeError: 'NoneType' object has no attribute 'shape'

Tried with 3 zero approaches- zero3.json, zero3_offload.json, and zero2_offload.json, still this error came in.

May I know what is the fix for those errors?

@shuyansy
Copy link
Collaborator

It seems like your input_id is None. I guess there must sth wrong with your input, could you check your training data format or show me some case of your training data?

@asad14053
Copy link
Author

It seems like your input_id is None. I guess there must sth wrong with your input, could you check your training data format or show me some case of your training data?

This is my input json format, I have exactly followed https://github.com/VectorSpaceLab/Video-XL/blob/main/assets/train_example.json this format.

{
"id": "1",
"video": "chicken_1.avi",
"conversations": [
{
"from": "human",
"value": "\nQuestion: What are the main activities that take place in the video?"
},
{
"from": "gpt",
"value": "A video featuring a surveillance camera monitoring three chicken cages...."
}
]
}

Is there any other input formats?

@shuyansy
Copy link
Collaborator

Thanks for providing the sample. Could you check is "\n" exists in {"from":"human","value":} pairs?

@asad14053
Copy link
Author

asad14053 commented Oct 30, 2024

Thanks for providing the sample. Could you check is "\n" exists in {"from":"human","value":} pairs?

Yes, '\n' exist only in {"from":"human","value":} pairs for every input. Is that, what you mean? if no, could you please provide a example of ideal input formats for a video?

And, still the error comes in. Do I need to put '\n' at the end of every "value:.......\n" like that?

Please confirm me, if I also have to use '\n' in {"from": "gpt", "value"} pairs?

@shuyansy
Copy link
Collaborator

Actually I mean the image token like this. Meanwhile, do you use all the data I provided?
截屏2024-10-31 上午10 12 44

@asad14053
Copy link
Author

asad14053 commented Oct 31, 2024

Actually I mean the image token like this. Meanwhile, do you use all the data I provided? 截屏2024-10-31 上午10 12 44

Thanks for your prompt reply.
Yes, this format \n, I applied for all input inference pairs ({"from":"human", "value":}) during my fine-tuning.

I am doing my custom dataset finetuning. Could you please elaborate more on this, "using all data"?
I know, it sounds dumb. Did you get a chance to recheck your installation and finetuning steps using another computer/ test environments?

@shuyansy
Copy link
Collaborator

OK, I will try finetune on other machines. And I will upload some data I used within these days. At that time, maybe you can first use my data to train. If it is ok, you can debug what is wrong with your custom data.

@asad14053
Copy link
Author

OK, I will try finetune on other machines. And I will upload some data I used within these days. At that time, maybe you can first use my data to train. If it is ok, you can debug what is wrong with your custom data.

Thanks for your reply.
Can you explain once more the input format for video data fine tuning?

@shuyansy
Copy link
Collaborator

I think you can check these part of codes(videoxl/videoxk/model/llava_arch.py). For the video input, the tensor shape of video should be (N,3,144,144); where N is the frame of videos.

def encode_multimodals(self, videos_or_images, video_idx_in_batch, split_sizes=None):
videos_or_images_features = self.get_model().get_vision_tower()(videos_or_images)
per_videos_or_images_features = torch.split(videos_or_images_features, split_sizes, dim=0) # tuple, (dim_1, 576, 4096)
all_videos_or_images_features = []

    for idx, feat in enumerate(per_videos_or_images_features):

        feat = self.get_model().mm_projector(feat)
        # Post pooling
        if idx in video_idx_in_batch:
            feat = self.get_2dPool(feat)
        all_videos_or_images_features.append(feat)
    return all_videos_or_images_features

@shuyansy
Copy link
Collaborator

Hi, I am not sure the issue has been settled. I have released some training data and you can use it to fine-tune the model. Thanks once again for patient waiting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants