MME-CoT 🔥🕵️: Benchmarking Chain-of-Thought in LMMs for Reasoning Quality, Robustness, and Efficiency

Official repository for "MME-CoT: Benchmarking Chain-of-Thought in LMMs for Reasoning Quality, Robustness, and Efficiency".

🌟 For more details, please refer to the project page with dataset exploration and visualization tools.

[🍓Project Page] [📖 Paper] [📊 Huggingface Dataset] [🏆 Leaderboard] [👁️ Visualization]

💥 News

[2025.02.14] 🌟 We are very proud to launch MME-CoT, the first-ever comprehensive CoT evaluation benchmark of LMMs in Visual Reasoning! We release the arxiv paper and all data samples in huggingface dataset.

📌 ToDo

Coming this week: MME-CoT evaluation with VLMEvalKit

👀 About MME-CoT

Answering questions with Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs), yet its impact on Large Multimodal Models (LMMs) still lacks a systematic assessment and in-depth investigation.

In this paper, we introduce MME-CoT, a specialized benchmark evaluating the CoT reasoning performance of LMMs, spanning six domains: math, science, OCR, logic, space-time, and general scenes. As the first comprehensive study in this area, we propose a thorough evaluation suite incorporating three novel metrics that assess the reasoning quality, robustness, and efficiency at a fine-grained level.

Leveraging curated high-quality data and a unique evaluation strategy, we conduct an in-depth analysis of state-of-the-art LMMs, uncovering several key insights: (1) Models with reflection mechanism demonstrate a superior CoT quality, with Kimi k1.5 outperforming GPT-4o and demonstrating the highest quality results; (2) CoT prompting often degrades LMM performance on perception-heavy tasks, suggesting a potentially harmful overthinking behavior; (3) Although the CoT quality is high, LMMs with reflection exhibit significant inefficiency in both normal response and self-correction phases. We hope MME-CoT serves as a foundation for advancing multimodal reasoning in LMMs.

💡 Illustration of our CoT Quality Evaluation Strategy

💪 Illustration of our CoT Robustness Evaluation Strategy

⚡️ Illustration of our CoT Efficiency Evaluation Strategy

📈 Eval

To calculate the six metrics (precision, recall, efficacy, stability, relevance rate, reflection quality), please follow the following steps:

Install the required packages.

pip install -r requirements.txt

Format the model answer as the example shown in results/json.

The file should be in a jsonl format, with each answer to a question in one line. All the other information of the question in the dataset should be preserved in the line.

The suffix _cot.json denotes answering with the CoT prompt, and _dir.json denotes answering with the direct prompt.
Run the evaluation script.

You can either run the metric one by one. For example, to evaluate recall:
```
bash scripts/recall.sh
```
Or you can run all the metrics for all the models in one directory with:
```
bash batch_scripts/run_all.py --result_dir results/json
```
Calculate the metrics.

We cache the evaluation results of all the questions in the cache dir. Here we read the results from the cache dir and calculate the metrics.

For example, to calculate recall:
```
python final_score/recall.py --cache_dir cache/recall --save_path final_results
```

Notes

The structure of the scripts directory:

- scripts
- - recall.sh # evaluate recall
- - precision.sh # evaluate precision
- - reflection_quality.sh # evaluate reflection quality
- - relevance_rate.sh # evaluate relevance rate
- - extract.sh # Step1 of direct evaluation (for robustness): extract the final answer from the model answer
- - judge.sh # Step2 of direct evaluation (for robustness): judge the correctness of the extracted answer

🏆 Leaderboard

Contributing to the Leaderboard

🚨 The Leaderboard is continuously being updated, welcoming the contribution of your excellent LMMs!

To contribute your model to the leaderboard, please email the prediction files of four tasks to 📫[email protected].

Data Usage

We release the MME-CoT data and evaluation prompts for benchmarking on the leaderboard.

You can download the dataset from the 🤗 Huggingface by the following command (make sure that you have installed related packages):

from datasets import load_dataset

dataset = load_dataset("CaraJ/MME-CoT")

✅ Citation

If you find MME-CoT useful for your research and applications, please kindly cite using this BibTeX:

@article{jiang2025mme,
  title={MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency},
  author={Jiang, Dongzhi and Zhang, Renrui and Guo, Ziyu and Li, Yanwei and Qi, Yu and Chen, Xinyan and Wang, Liuhui and Jin, Jianhan and Guo, Claire and Yan, Shen and others},
  journal={arXiv preprint arXiv:2502.09621},
  year={2025}
}

📜 Related Work

Explore our additional research on Vision-Language Large Models:

[MME-Survey] MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
[MME] MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
[MMSearch] MMSearch: Benchmarking the potential of large models as multi-modal search engines
[MathVerse] MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
[LLaMA-Adapter] LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
[ImageBind-LLM] Imagebind-LLM: Multi-modality Instruction Tuning
[SPHINX] The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal LLMs
[Point-Bind & Point-LLM] Multi-modality 3D Understanding, Generation, and Instruction Following
[PerSAM] Personalize segment anything model with one shot
[CoMat] CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
batch_scripts		batch_scripts
figs		figs
final_score		final_score
prompt		prompt
results		results
scripts		scripts
tools		tools
.gitignore		.gitignore
README.md		README.md
dataset.py		dataset.py
direct_eval.py		direct_eval.py
file_utils.py		file_utils.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MME-CoT 🔥🕵️: Benchmarking Chain-of-Thought in LMMs for Reasoning Quality, Robustness, and Efficiency

💥 News

📌 ToDo

👀 About MME-CoT

📈 Eval

Notes

🏆 Leaderboard

Contributing to the Leaderboard

Data Usage

✅ Citation

📜 Related Work

About

Releases

Packages

Contributors 2

Languages

CaraJ7/MME-CoT

Folders and files

Latest commit

History

Repository files navigation

MME-CoT 🔥🕵️: Benchmarking Chain-of-Thought in LMMs for Reasoning Quality, Robustness, and Efficiency

💥 News

📌 ToDo

👀 About MME-CoT

📈 Eval

Notes

🏆 Leaderboard

Contributing to the Leaderboard

Data Usage

✅ Citation

📜 Related Work

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages