Run r1 evals for `games` datasets #227

joesharratt1229 · 2025-02-26T12:36:17Z

Given the latency of evaluating R1 comparative to other datasets, it makes sense to segment this into running evaluations by category type. This sub-issue relating to running R1 evals for datasets of the games category

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run r1 evals for `games` datasets #227

Run r1 evals for `games` datasets #227

joesharratt1229 commented Feb 26, 2025

Run r1 evals for games datasets #227

Run r1 evals for games datasets #227

Comments

joesharratt1229 commented Feb 26, 2025

Run r1 evals for `games` datasets #227

Run r1 evals for `games` datasets #227