MME-CoT 🔥🕵️: Benchmarking Chain-of-Thought in LMMs for Reasoning Quality, Robustness, and Efficiency
Official repository for "MME-CoT: Benchmarking Chain-of-Thought in LMMs for Reasoning Quality, Robustness, and Efficiency".
🌟 For more details, please refer to the project page with dataset exploration and visualization tools.
[🍓Project Page] [📖 Paper] [📊 Huggingface Dataset] [🏆 Leaderboard] [👁️ Visualization]
- [2025.02.14] 🌟 We are very proud to launch MME-CoT, the first-ever comprehensive CoT evaluation benchmark of LMMs in Visual Reasoning! We release the arxiv paper and all data samples in huggingface dataset.
- Coming this week: MME-CoT evaluation with VLMEvalKit
Answering questions with Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs), yet its impact on Large Multimodal Models (LMMs) still lacks a systematic assessment and in-depth investigation.
In this paper, we introduce MME-CoT, a specialized benchmark evaluating the CoT reasoning performance of LMMs, spanning six domains: math, science, OCR, logic, space-time, and general scenes. As the first comprehensive study in this area, we propose a thorough evaluation suite incorporating three novel metrics that assess the reasoning quality, robustness, and efficiency at a fine-grained level.
Leveraging curated high-quality data and a unique evaluation strategy, we conduct an in-depth analysis of state-of-the-art LMMs, uncovering several key insights: (1) Models with reflection mechanism demonstrate a superior CoT quality, with Kimi k1.5 outperforming GPT-4o and demonstrating the highest quality results; (2) CoT prompting often degrades LMM performance on perception-heavy tasks, suggesting a potentially harmful overthinking behavior; (3) Although the CoT quality is high, LMMs with reflection exhibit significant inefficiency in both normal response and self-correction phases. We hope MME-CoT serves as a foundation for advancing multimodal reasoning in LMMs.
To calculate the six metrics (precision, recall, efficacy, stability, relevance rate, reflection quality), please follow the following steps:
- Install the required packages.
pip install -r requirements.txt
-
Format the model answer as the example shown in
results/json
.The file should be in a jsonl format, with each answer to a question in one line. All the other information of the question in the dataset should be preserved in the line.
The suffix
_cot.json
denotes answering with the CoT prompt, and_dir.json
denotes answering with the direct prompt. -
Run the evaluation script.
You can either run the metric one by one. For example, to evaluate recall:
bash scripts/recall.sh
Or you can run all the metrics for all the models in one directory with:
bash batch_scripts/run_all.py --result_dir results/json
-
Calculate the metrics.
We cache the evaluation results of all the questions in the cache dir. Here we read the results from the cache dir and calculate the metrics.
For example, to calculate recall:
python final_score/recall.py --cache_dir cache/recall --save_path final_results
- The structure of the
scripts
directory:- scripts - - recall.sh # evaluate recall - - precision.sh # evaluate precision - - reflection_quality.sh # evaluate reflection quality - - relevance_rate.sh # evaluate relevance rate - - extract.sh # Step1 of direct evaluation (for robustness): extract the final answer from the model answer - - judge.sh # Step2 of direct evaluation (for robustness): judge the correctness of the extracted answer
🚨 The Leaderboard is continuously being updated, welcoming the contribution of your excellent LMMs!
To contribute your model to the leaderboard, please email the prediction files of four tasks to 📫[email protected].
We release the MME-CoT data and evaluation prompts for benchmarking on the leaderboard.
You can download the dataset from the 🤗 Huggingface by the following command (make sure that you have installed related packages):
from datasets import load_dataset
dataset = load_dataset("CaraJ/MME-CoT")
If you find MME-CoT useful for your research and applications, please kindly cite using this BibTeX:
@article{jiang2025mme,
title={MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency},
author={Jiang, Dongzhi and Zhang, Renrui and Guo, Ziyu and Li, Yanwei and Qi, Yu and Chen, Xinyan and Wang, Liuhui and Jin, Jianhan and Guo, Claire and Yan, Shen and others},
journal={arXiv preprint arXiv:2502.09621},
year={2025}
}
Explore our additional research on Vision-Language Large Models:
- [MME-Survey] MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
- [MME] MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
- [MMSearch] MMSearch: Benchmarking the potential of large models as multi-modal search engines
- [MathVerse] MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
- [LLaMA-Adapter] LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
- [ImageBind-LLM] Imagebind-LLM: Multi-modality Instruction Tuning
- [SPHINX] The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal LLMs
- [Point-Bind & Point-LLM] Multi-modality 3D Understanding, Generation, and Instruction Following
- [PerSAM] Personalize segment anything model with one shot
- [CoMat] CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching