Image Denotation for Mantis-Idefics2 #24

schwarzwalder93 · 2025-02-11T17:52:39Z

Hi Team,

Thanks for this amazing work. In the Mantis paper, the importance of image denotation with numbering and ordering is highlighted as below:

"Interleaving Text-Image: A proper text-image interleaving format can help acquire multi-image understanding and reasoning ability. We contend that a good text-image interleaving format should: (1) mark boundaries between images clearly, and (2) denote the serial number of images. Following this principle, we designed our interleaving format as follows: "(image {i}: <BOI><image><EOI>)", where <BOI> is the begin of image token and <EOI> is the end of image token. <image> is the placeholder for image patches. This format adds clear separators between images, and gives serialized information of the image through "image {i}". In practice, we set <BOI> and <EOI> to be and respectively"

However, I am unable to find such serial number based denotation in the processed prompt for Mantis-idefics2:

'<s> User:<fake_token_around_image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><fake_token_around_image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><fake_token_around_image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><fake_token_around_image> What cities image 1, image 2, and image 3 belong to respectively? Answer me in order.<end_of_utterance> \nAssistant:'

Kindly clarify this understanding.

The text was updated successfully, but these errors were encountered:

jdf-prog · 2025-02-13T23:19:32Z

Since Idefics2 is already a well-trained model, we did not add those image denotations like during the training. Actualy , Idefics2 has <fake_token_around_image> which can serve as the same tole

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Image Denotation for Mantis-Idefics2 #24

Image Denotation for Mantis-Idefics2 #24

schwarzwalder93 commented Feb 11, 2025 •

edited

Loading

jdf-prog commented Feb 13, 2025

Image Denotation for Mantis-Idefics2 #24

Image Denotation for Mantis-Idefics2 #24

Comments

schwarzwalder93 commented Feb 11, 2025 • edited Loading

jdf-prog commented Feb 13, 2025

schwarzwalder93 commented Feb 11, 2025 •

edited

Loading