Benchmarking Future Work #165

deboer-tim · 2024-12-06T17:15:00Z

This epic lists future ideas for model benchmarking from the onsite meeting.

Benchmarks

Investigate the following benchmarks which are easy for the model team to run. The expectation is that we could adapt big code with our own tests to start, and use the others as necessary if big code can't handle some of the specific tests below.

Big code: https://github.com/bigcode-project/bigcode-evaluation-harness
Cross code: https://crosscodeeval.github.io/
Can it edit: https://github.com/nuprl/canitedit

Tests

Aspects to test:

chat
code completion
file editing
multi-file editing
multiple languages (not within one test)
capture memory use and performance for each test

Phases

automate existing tests
be able to evaluate more/new models quicker
continuously add tests over time to match what we're adding/testing in the extension
run once on multiple laptops/OSes/GPUs (using what the team has) to set performance baseline, confirm spec cutoff or where we run which/multiple models
automate pipeline with fixed set of hardware

deboer-tim added the 📐 benchmark Benchmarking granite label Dec 6, 2024

deboer-tim assigned jamescho72 Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking Future Work #165

Benchmarking Future Work #165

deboer-tim commented Dec 6, 2024

Benchmarking Future Work #165

Benchmarking Future Work #165

Comments

deboer-tim commented Dec 6, 2024

Benchmarks

Tests

Phases