diff --git a/README.md b/README.md index 0d9a137..5d54e57 100644 --- a/README.md +++ b/README.md @@ -53,8 +53,10 @@ huggingface-cli login - **IFEval**: [Instruction following capability evaluation](https://github.com/google-research/google-research/tree/master/instruction_following_eval) - **AlpacaEval**: [Instruction following evaluation](https://github.com/tatsu-lab/alpaca_eval) - **HumanEval**: [Code generation and problem solving](https://github.com/openai/human-eval) + - **HumanEvalPlus**: [HumanEval with more test cases](https://github.com/evalplus/evalplus) - **ZeroEval**: [Logical reasoning and problem solving](https://github.com/WildEval/ZeroEval) - **MBPP**: [Python programming benchmark](https://github.com/google-research/google-research/tree/master/mbpp) + - **MBPPPlus**: [MBPP with more test cases](https://github.com/evalplus/evalplus) - **BigCodeBench:** [Benchmarking Code Generation with Diverse Function Calls and Complex Instructions](https://arxiv.org/abs/2406.15877) > **🚨 Warning:** for BigCodeBench evaluation, we strongly recommend using a Docker container since the execution of LLM generated code on a machine can lead to destructive outcomes. More info is [here](eval/chat_benchmarks/BigCodeBench/README.md). diff --git a/reproduced_benchmarks.md b/reproduced_benchmarks.md index d32ca35..c284cbe 100644 --- a/reproduced_benchmarks.md +++ b/reproduced_benchmarks.md @@ -68,4 +68,10 @@ | | | meta-llama/Meta-Llama-3.1-8B-Instruct | instruct (pass@1) | 30.7 | 32.8 | | | | | | complete (pass@1) | 41.9 | 40.5 | | | | | Qwen/Qwen2.5-7B-Instruct | instruct (pass@1) | 35.2 | 37.6 | | -| | | | complete (pass@1) | 46.7 | 46.1 | | \ No newline at end of file +| | | | complete (pass@1) | 46.7 | 46.1 | | +|HumanEvalPlus| Sedrick | mistralai/Mistral-7B-Instruct-v0.2 | accuracy (pass@1) | 27.44 | 36.0 | [EvalPlus Leaderboard](https://evalplus.github.io/leaderboard.html) | +| | | meta-llama/Llama-3.1-8B-Instruct | accuracy (pass@1) | 62.2 | 62.8 | [EvalPlus Leaderboard](https://evalplus.github.io/leaderboard.html) | +| | | google/codegemma-7b-it | accuracy (pass@1) | 36.6 | 51.8 | [EvalPlus Leaderboard](https://evalplus.github.io/leaderboard.html) | +| MBPPPlus | Sedrick | mistralai/Mistral-7B-Instruct-v0.2 | accuracy (pass@1) | 43.9 | 37.0 | [EvalPlus Leaderboard](https://evalplus.github.io/leaderboard.html) | +| | | meta-llama/Llama-3.1-8B-Instruct | accuracy (pass@1) | 58.7 | 55.6 | [EvalPlus Leaderboard](https://evalplus.github.io/leaderboard.html) | +| | | google/codegemma-7b-it | accuracy (pass@1) | 56.6 | 56.9 | [EvalPlus Leaderboard](https://evalplus.github.io/leaderboard.html) | \ No newline at end of file