readme

mlfoundations · Jan 23, 2025 · 2c5291e · 2c5291e
1 parent d19763a
commit 2c5291e
Show file tree

Hide file tree

Showing 2 changed files with 9 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -53,8 +53,10 @@ huggingface-cli login
   - **IFEval**: [Instruction following capability evaluation](https://github.com/google-research/google-research/tree/master/instruction_following_eval)
   - **AlpacaEval**: [Instruction following evaluation](https://github.com/tatsu-lab/alpaca_eval)
   - **HumanEval**: [Code generation and problem solving](https://github.com/openai/human-eval)
+  - **HumanEvalPlus**: [HumanEval with more test cases](https://github.com/evalplus/evalplus)
   - **ZeroEval**: [Logical reasoning and problem solving](https://github.com/WildEval/ZeroEval)
   - **MBPP**: [Python programming benchmark](https://github.com/google-research/google-research/tree/master/mbpp)
+  - **MBPPPlus**: [MBPP with more test cases](https://github.com/evalplus/evalplus)
   - **BigCodeBench:** [Benchmarking Code Generation with Diverse Function Calls and Complex Instructions](https://arxiv.org/abs/2406.15877)
 
     > **🚨 Warning:** for BigCodeBench evaluation, we strongly recommend using a Docker container since the execution of LLM generated code on a machine can lead to destructive outcomes. More info is [here](eval/chat_benchmarks/BigCodeBench/README.md).

diff --git a/reproduced_benchmarks.md b/reproduced_benchmarks.md
@@ -68,4 +68,10 @@
 |             |         | meta-llama/Meta-Llama-3.1-8B-Instruct   | instruct (pass@1)             | 30.7        | 32.8             |                                     |
 |             |         |                                         | complete (pass@1)             | 41.9        | 40.5             |                                     |
 |             |         | Qwen/Qwen2.5-7B-Instruct                | instruct (pass@1)             | 35.2        | 37.6             |                                     |
-|             |         |                                         | complete (pass@1)             | 46.7        | 46.1             |                                     |
+|             |         |                                         | complete (pass@1)             | 46.7        | 46.1             |                                     |
+|HumanEvalPlus| Sedrick | mistralai/Mistral-7B-Instruct-v0.2      | accuracy (pass@1)             | 27.44       | 36.0             | [EvalPlus Leaderboard](https://evalplus.github.io/leaderboard.html) |
+|             |         | meta-llama/Llama-3.1-8B-Instruct        | accuracy (pass@1)             | 62.2        | 62.8             | [EvalPlus Leaderboard](https://evalplus.github.io/leaderboard.html) |
+|             |         | google/codegemma-7b-it                  | accuracy (pass@1)             | 36.6        | 51.8             | [EvalPlus Leaderboard](https://evalplus.github.io/leaderboard.html) |
+| MBPPPlus    | Sedrick | mistralai/Mistral-7B-Instruct-v0.2      | accuracy (pass@1)             | 43.9        | 37.0             | [EvalPlus Leaderboard](https://evalplus.github.io/leaderboard.html) |
+|             |         | meta-llama/Llama-3.1-8B-Instruct        | accuracy (pass@1)             | 58.7        | 55.6             | [EvalPlus Leaderboard](https://evalplus.github.io/leaderboard.html) |
+|             |         | google/codegemma-7b-it                  | accuracy (pass@1)             | 56.6        | 56.9             | [EvalPlus Leaderboard](https://evalplus.github.io/leaderboard.html) |