You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for this amazing work. In the Mantis paper, the importance of image denotation with numbering and ordering is highlighted as below:
"Interleaving Text-Image: A proper text-image interleaving format can help acquire multi-image understanding and reasoning ability. We contend that a good text-image interleaving format should: (1) mark boundaries between images clearly, and (2) denote the serial number of images. Following this principle, we designed our interleaving format as follows: "(image {i}: <BOI><image><EOI>)", where <BOI> is the begin of image token and <EOI> is the end of image token. <image> is the placeholder for image patches. This format adds clear separators between images, and gives serialized information of the image through "image {i}". In practice, we set <BOI> and <EOI> to be and respectively"
However, I am unable to find such serial number based denotation in the processed prompt for Mantis-idefics2:
'<s> User:<fake_token_around_image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><fake_token_around_image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><fake_token_around_image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><fake_token_around_image> What cities image 1, image 2, and image 3 belong to respectively? Answer me in order.<end_of_utterance> \nAssistant:'
Kindly clarify this understanding.
The text was updated successfully, but these errors were encountered:
Since Idefics2 is already a well-trained model, we did not add those image denotations like during the training. Actualy , Idefics2 has <fake_token_around_image> which can serve as the same tole
Hi Team,
Thanks for this amazing work. In the Mantis paper, the importance of image denotation with numbering and ordering is highlighted as below:
However, I am unable to find such serial number based denotation in the processed prompt for Mantis-idefics2:
'<s> User:<fake_token_around_image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><fake_token_around_image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><fake_token_around_image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><fake_token_around_image> What cities image 1, image 2, and image 3 belong to respectively? Answer me in order.<end_of_utterance> \nAssistant:'
Kindly clarify this understanding.
The text was updated successfully, but these errors were encountered: