llava_trainer.py: Type and Attribute Error #8

asad14053 · 2024-10-29T09:39:55Z

/Video-XL/videoxl/videoxl/train/llava_trainer.py", line 252, in compute_loss
if "retrieval_span" in inputs:
TypeError: argument of type 'NoneType' is not iterable
Traceback (most recent call last):

Video-XL/videoxl/videoxl/train/llava_trainer.py", line 268, in _prepare_inputs
inputs.pop("length", None)
AttributeError: 'NoneType' object has no attribute 'pop'
Traceback (most recent call last):

Video-XL/videoxl/videoxl/model/language_model/llava_qwen.py", line 1694, in forward
if input_ids.shape[1] != 1:
AttributeError: 'NoneType' object has no attribute 'shape'

Tried with 3 zero approaches- zero3.json, zero3_offload.json, and zero2_offload.json, still this error came in.

May I know what is the fix for those errors?

shuyansy · 2024-10-30T14:09:55Z

It seems like your input_id is None. I guess there must sth wrong with your input, could you check your training data format or show me some case of your training data?

asad14053 · 2024-10-30T14:41:05Z

It seems like your input_id is None. I guess there must sth wrong with your input, could you check your training data format or show me some case of your training data?

This is my input json format, I have exactly followed https://github.com/VectorSpaceLab/Video-XL/blob/main/assets/train_example.json this format.

{
"id": "1",
"video": "chicken_1.avi",
"conversations": [
{
"from": "human",
"value": "\nQuestion: What are the main activities that take place in the video?"
},
{
"from": "gpt",
"value": "A video featuring a surveillance camera monitoring three chicken cages...."
}
]
}

Is there any other input formats?

shuyansy · 2024-10-30T14:44:53Z

Thanks for providing the sample. Could you check is "\n" exists in {"from":"human","value":} pairs?

asad14053 · 2024-10-30T15:19:23Z

Thanks for providing the sample. Could you check is "\n" exists in {"from":"human","value":} pairs?

Yes, '\n' exist only in {"from":"human","value":} pairs for every input. Is that, what you mean? if no, could you please provide a example of ideal input formats for a video?

And, still the error comes in. Do I need to put '\n' at the end of every "value:.......\n" like that?

Please confirm me, if I also have to use '\n' in {"from": "gpt", "value"} pairs?

shuyansy · 2024-10-31T02:14:18Z

Actually I mean the image token like this. Meanwhile, do you use all the data I provided?

asad14053 · 2024-10-31T05:08:45Z

Actually I mean the image token like this. Meanwhile, do you use all the data I provided?

Thanks for your prompt reply.
Yes, this format \n, I applied for all input inference pairs ({"from":"human", "value":}) during my fine-tuning.

I am doing my custom dataset finetuning. Could you please elaborate more on this, "using all data"?
I know, it sounds dumb. Did you get a chance to recheck your installation and finetuning steps using another computer/ test environments?

shuyansy · 2024-10-31T06:12:03Z

OK, I will try finetune on other machines. And I will upload some data I used within these days. At that time, maybe you can first use my data to train. If it is ok, you can debug what is wrong with your custom data.

asad14053 · 2024-10-31T06:17:28Z

OK, I will try finetune on other machines. And I will upload some data I used within these days. At that time, maybe you can first use my data to train. If it is ok, you can debug what is wrong with your custom data.

Thanks for your reply.
Can you explain once more the input format for video data fine tuning?

shuyansy · 2024-10-31T07:40:45Z

I think you can check these part of codes(videoxl/videoxk/model/llava_arch.py). For the video input, the tensor shape of video should be (N,3,144,144); where N is the frame of videos.

def encode_multimodals(self, videos_or_images, video_idx_in_batch, split_sizes=None):
videos_or_images_features = self.get_model().get_vision_tower()(videos_or_images)
per_videos_or_images_features = torch.split(videos_or_images_features, split_sizes, dim=0) # tuple, (dim_1, 576, 4096)
all_videos_or_images_features = []

    for idx, feat in enumerate(per_videos_or_images_features):

        feat = self.get_model().mm_projector(feat)
        # Post pooling
        if idx in video_idx_in_batch:
            feat = self.get_2dPool(feat)
        all_videos_or_images_features.append(feat)
    return all_videos_or_images_features

shuyansy · 2024-12-22T22:42:24Z

Hi, I am not sure the issue has been settled. I have released some training data and you can use it to fine-tune the model. Thanks once again for patient waiting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llava_trainer.py: Type and Attribute Error #8

llava_trainer.py: Type and Attribute Error #8

asad14053 commented Oct 29, 2024 •

edited

Loading

shuyansy commented Oct 30, 2024

asad14053 commented Oct 30, 2024

shuyansy commented Oct 30, 2024

asad14053 commented Oct 30, 2024 •

edited

Loading

shuyansy commented Oct 31, 2024

asad14053 commented Oct 31, 2024 •

edited

Loading

shuyansy commented Oct 31, 2024

asad14053 commented Oct 31, 2024

shuyansy commented Oct 31, 2024

shuyansy commented Dec 22, 2024

llava_trainer.py: Type and Attribute Error #8

llava_trainer.py: Type and Attribute Error #8

Comments

asad14053 commented Oct 29, 2024 • edited Loading

shuyansy commented Oct 30, 2024

asad14053 commented Oct 30, 2024

shuyansy commented Oct 30, 2024

asad14053 commented Oct 30, 2024 • edited Loading

shuyansy commented Oct 31, 2024

asad14053 commented Oct 31, 2024 • edited Loading

shuyansy commented Oct 31, 2024

asad14053 commented Oct 31, 2024

shuyansy commented Oct 31, 2024

shuyansy commented Dec 22, 2024

asad14053 commented Oct 29, 2024 •

edited

Loading

asad14053 commented Oct 30, 2024 •

edited

Loading

asad14053 commented Oct 31, 2024 •

edited

Loading