MLVU aggregate function is incorrect #534

SCZwangxiao · 2025-02-11T12:30:14Z

According to results in their leaderboard, the overall score is calculated by the average of each subtype accuracy, not the sample accuracy.

lmms-eval/lmms_eval/tasks/mlvu/utils.py

Lines 117 to 122 in d294646

    
           total_correct = 0 
        
           total_answered = 0 
        
           for k, v in category2score.items(): 
        
               total_correct += v["correct"] 
        
               total_answered += v["answered"] 
        
           eval_logger.info(f"Overall Performance: {100 * total_correct / total_answered if total_answered > 0 else 0 : .1f}%")

kcz358 · 2025-02-13T05:48:08Z

Hi @shuyansy , do you mind take a look whether this logic is correct? Thanks!

kcz358 · 2025-02-27T10:49:39Z

Hi @SCZwangxiao , this issue should have been fixed in #555

kcz358 mentioned this issue Feb 27, 2025

modify utils.py #555

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MLVU aggregate function is incorrect #534

MLVU aggregate function is incorrect #534

SCZwangxiao commented Feb 11, 2025

kcz358 commented Feb 13, 2025

kcz358 commented Feb 27, 2025

MLVU aggregate function is incorrect #534

MLVU aggregate function is incorrect #534

Comments

SCZwangxiao commented Feb 11, 2025

kcz358 commented Feb 13, 2025

kcz358 commented Feb 27, 2025