ABC: Achieving Better Control of Multimodal Embeddings using VLMs

Authors: Benjamin Schneider, Florian Kerschbaum, Wenhu Chen @ TIGER-Lab

🔥News

[2025/3/4] Release of the ABC Paper, along with the first release of our 🤗 Model and Datasets on Hugging Face (more to come, stay tuned!).

Overview

ABC's Design

We introduce ABC, an open-source multimodal embedding model that uses a vision-language model backbone to deeply integrate image features with natural language instructions.
ABC is designed to give the user maximum control over how images are represented in embeddings. If you need to use naturral langauge to specify which aspects of an image you want emphasized and represented, ABC is the perfect model for you!
The key behind ABC's training is that we pretrain the model using a large dataset of difficult embedding samples, where each batch contains many candidates that are relevant but not quite correct. The pretrained model is therefore able to generate embeddings that capture subtle differences. After a short finetuning stage, the model ideal for tasks like VQA, where differences in user instructions result in different correct answers (right).
ABC outputs great quality embeddings, ABC achieves best-for-size performance on MSCOCO image-to-text retrieval and is the top performing model on zero-shot classification and VQA tasks in the Massive Multimodal Embedding Benchmark.

🤗 Models

Model	Supports Instructions	Base Model	Training Dataset
ABC-Qwen2VL-Instruct	✅	ABC-Qwen2VL-Pretrain	TIGER-Lab/ABC-VG-Instruct
ABC-Qwen2VL-Pretrain	❌	Qwen2VL-Instruct	TIGER-Lab/ABC-Pretrain

📚 Datasets

ABC-VG-Instruct: A custom dataset for multimodal finetuning. Contains multiple instructions per image, each corresponding to different aspects of each image.
ABC-Pretrain: Multimodal pretraining dataset with mined negatives.

🚀 Quick Start

Install Dependancies:

git clone $
cd ABC
pip install -r requirements.txt

Start making multimodal embeddings!

python -i ./quick_start.py

📈 Zero-shot Performance

Check out our paper for additional evaluations!

🤖 Training

🚧
I'm currently figuring out the best way to make this as easy as possible (some of our data is quick large [~300 Gb of images]).
Please chekc back in a few days! Hopefully I can serve the whole dataset off huggingface.
🚧

🤖 `CtrlBench`

🚧
CtrlBench is a benchmark we constructed for the purpose of measuring how well a model can interleave visual and ntural language features. We found that many existing "multimodal" tasks are solveable by just looking at one of the text or the image. CtrlBench is designed to be a retreival task that requires combining modalities to solve the benchmark, measuring the models ability to output truly "multimodal" embeddings. See our paper for a more detailed description of the design and motivations behind CtrlBench. 🚧

Citation

If you find this work helpful, please consider citing:

@misc{schneider2025abcachievingbettercontrol,
      title={ABC: Achieving Better Control of Multimodal Embeddings using VLMs}, 
      author={Benjamin Schneider and Florian Kerschbaum and Wenhu Chen},
      year={2025},
      eprint={2503.00329},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.00329}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

ABC: Achieving Better Control of Multimodal Embeddings using VLMs

🔥News

Overview

🤗 Models

📚 Datasets

🚀 Quick Start

📈 Zero-shot Performance

🤖 Training

🤖 `CtrlBench`

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

ABC: Achieving Better Control of Multimodal Embeddings using VLMs

🔥News

Overview

🤗 Models

📚 Datasets

🚀 Quick Start

📈 Zero-shot Performance

🤖 Training

🤖 CtrlBench

Citation

🤖 `CtrlBench`