[Evals] Refactor tasks, add linting and CI #47

SumanthRH · 2025-01-24T20:30:50Z

What does this PR do?

Big PR refactoring tasks and adding linters. Should close #23 .

Goals

With the Evaluation suite adding more and more tasks and supporting more models, there's a need for modularization. Some of the goals for a good eval suite are:

Easy to support new tasks.
Easy to maintain a large number of tasks.
Easy to tweak configurations for a given task .
Easy to support new models.
Easy to debug/ inspect results for a given task.
Easy to scale evaluations for a given task

This PR is only focused on the road to 1., 2. and 3.

Task Refactoring Proposal

While considering the new design, it is helpful to visit how other eval suites like lm_eval are organized. lm_eval takes a very configuration centric approach, with almost all formatting, preprocessing and response postprocessing logic in jinja-templated YAMLs (Example: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/gsm8k/gsm8k-cot.yaml)

Design choices in this PR :

Logic (conditional logic, how templating parameters are prepended/appended) lives in code i.e. in {task}_handlers.py . This avoids the awkwardness of having complicated branching + processing logic all in jinja templated YAMLs like lm_eval
Prompt Strings, dataset names, few shot examples etc are best left in YAMLs. Crucially, we should avoid bare string concatenation in code and use f-strings in YAMLs to define templates. This will allow us to easily iterate on the right formats and also catch bugs with missed spaces, newlines etc.
Complex string processing - including filtering, regex matches, etc lives in code
This is another difference from lm_eval.
Handler + task configs for each task live in their own folder for modularity. It is expected for developers to go into the code to understand the exact parsing logic used (another contrast from lm_eval - this is a good thing because YAMLs can be limiting)

Current Organization

Here is the high-level organization of the current code:

Folders:

 |-util
 | |-taco
 | | |-pyext2.py
 | | |-testing_util.py 
 | |-livecodebench
 | | |-testing_util.py
 | |-prompts.py 
 | |-apps
 | | |-testing_util.py
 | |-math
 | | |-testing_util.py
 | |-model_utils.py # <--- system prompts
 | |-task_handlers.py # <--- all handlers in one file
 | |-common.py  
 |-eval.py
 |-inference_and_check.py
 |-requirements.txt

Proposed Organization

├── tasks
│   ├── aime
│   │   ├── aime_handler.py
│   │   └── aime.yaml
│   │   └── aime_8shot.yaml # different few shot configurations can live in separate yamls
│   ├── apps
│   ├── livecodebench
│   ├── math500
│   └── taco
└── util
   ├── common.py
   ├── math_parsing_util.py
   ├── model_utils.py
   ├── prompts.py

Each task yaml will look like;

- handler: arc_c # name of the handler for this task. 
- dataset_path: <repo_id>  # repo ID in huggingface or local path
- dataset_source: null # which subset on huggingface
- question_key: <question_key> 
- templating_parameters: 
   - template_1: "First string to add {prompt}" 
   - template_2: "Second string to add {prompt}"
# Optional. Not implemented yet 
- fewshot_config:
  - question: ...
     target:  ...
 - num_fewshot: n

This PR doesn't try to force any structure for prompt templates because it's quite varied across datasets. To really know how a template is used, you need to see the corresponding handler.py implementation. But it's easy to modify and maintain.

Linting and Code formatting

I've added pre commit hooks for the tools/ folder only - because I'm not sure if we're at a state to clean up the training related code in train/ . i wanted to separate them out for now so that those working on training can proceed as usual for now.

High-level reorganization

The tools/ folder is now renamed to skythought_evals, which is the main package associated with the repo. So setup.py will install skythought_evals when you run pip install -e . . FOr those developing locally and running scripts inside skythought/skythought_evals, they can continue running scripts as usual.

TODO:

Improve interface for eval.py

Signed-off-by: SumanthRH <[email protected]>

caoshiyi · 2025-01-24T21:33:34Z

Thanks, @SumanthRH for the efforts! This is great!

Signed-off-by: SumanthRH <[email protected]>

tyler-griggs · 2025-01-26T00:26:59Z

+1, this will be very helpful!

Signed-off-by: SumanthRH <[email protected]>

kouroshHakha

@sumanth can you run the linter PR before merging this one? The refactoring effort is hidden under the linter changes.

kouroshHakha · 2025-01-27T17:21:21Z

skythought/tools/.githooks/pre-commit

we should put these linter stuff at the top of level of this repo. Did you have any reason for not putting it there?

If you don't want to apply these to the train sub-folder let's just exclude that from the precommit / format runs.

I've added pre commit hooks for the tools/ folder only - because I'm not sure if we're at a state to clean up the training related code in train/ . i wanted to separate them out for now so that those working on training can proceed as usual for now.

LMK if this doesn't make sense - the main reason I kept it separate is because the train/ folder is a Llamafactory repo fork. It looks like skythought/tools will be a package in itself focused on evals.

Ok so I moved this to the top-level repo. Also added a setup.py which installes our evaluation package at skythought/skythought_evals (renamed from tools) as skythought_evals

kouroshHakha · 2025-01-27T17:34:40Z

skythought/tools/combine_data.py

Maybe not in this PR but for better structure of the codebase. We should take these scripts and maybe put them in the following refactored structure:

skythought/... # reusable python modules go in here. project_scripts/ skythought-t1/ data_preparation/ combine_data.py train_yamls/ ... skythought-t1-flash/ ... tests/ skythought/... # internal unittests

Yeah agreed let's postpose this higher level folder re-org for later.

I've now moved the linting and pre-commit hooks to the top level repo and ignored train/ folder for now. I've added basic tests in tests/tools for skythought/tools. Also, since we need to give a package name to make use of the module in tests, I wrote a small setup.py where the package skythought_evals is set-up - this is just an alias for skythought.tools.

skythought/tools/eval.py

skythought/tools/inference_and_check.py

skythought/tools/tasks/task_util.py

skythought/tools/tasks/base.py

kouroshHakha · 2025-01-27T18:30:30Z

skythought/tools/tasks/base.py

+    templating_parameters: Dict[str, str] = Field(default_factory=dict)
+    # Optional, unused for now
+    fewshot_config: List[Dict[str, Any]] = Field(default_factory=list)
+    num_fewshot: int = 0


do you want to absorb this within the fewshot_config?

skythought/tools/tasks/base.py

skythought/tools/tasks/math/math_handler.py

Signed-off-by: SumanthRH <[email protected]>

… refactor-evals Signed-off-by: SumanthRH <[email protected]>

Signed-off-by: SumanthRH <[email protected]>

caoshiyi · 2025-01-28T04:40:47Z

skythought/skythought_evals/tasks/aime/aime_handler.py

+
+class AIMETaskHandler(MathTaskHandler):
+    def generate_prompt(self, problem: Dict, model):
+        if MODEL_TO_NAME[model] == "Sky-T1-32B-Preview":


is it possible to avoid hardcoding this here?

Hello. Yeah so there's model related cleanup that is pending! This PR didn't do any cleanup related to that.

Here's my plan: In the next PR, I will

Move models_utils.py to a YAML.

Cleanup some of the model-specific logic in the handlers (like passing model for AIME , system prompt handling

Make a separate task YAML if there's a different template for a model.

So for example, instead of

# aime.yaml handler:aime ... templating_parameters: regular_template: "....." sky_template: "......."

I will have two yamls, aime.yaml and aime_sky.yaml

# aime.yaml handler: aime templating_parameters: template: <regular template>

# aime_sky.yaml handler: aime templating_parameters: template: <sky template>

This way, you can even evaluate other models with the template for SKy-T1 (for ex, fine-tunes from Sky-T1)

Wdyt?

This PR already did quite some refactoring and linting, CI, etc changes so I left it for later.

Sounds good! Let's leave this to the next PR.

skythought/skythought_evals/tasks/aime/aime_handler.py

caoshiyi · 2025-01-28T04:47:47Z

skythought/skythought_evals/inference_and_check.py

+    ):
+        result_file = os.path.join(
+            args.result_dir,
+            f"{MODEL_TO_NAME[args.model]}_{args.task}_{args.split}_{args.source}_{args.filter_difficulty}_{args.start}_{args.end}_{args.math_difficulty_lower_bound}_{args.math_difficulty_upper_bound}.json",


can we have some default behavior for MODEL_TO_NAME so that if args.model is not in MODEL_TO_NAME some default values will be used?

Seems suited for the next PR on models cleanup? See #47 (comment)

caoshiyi · 2025-01-28T04:49:30Z

skythought/skythought_evals/inference_and_check.py

+            OpenAI()
+            if args.model.startswith("openai")
+            else LLM(model=args.model, tensor_parallel_size=args.tp)
+        )
        system_prompt = SYSTEM_PROMPT[args.model]


same here, we should have some default behavior here or allow the user to pass in new system prompt if args.model is currently not in SYSTEM_PROMPT.

Hmm looks best suited for the plans for the follow up PR?

#47 (comment)

Signed-off-by: SumanthRH <[email protected]>

tyler-griggs

LGTM, thank you. I tried it out on a few evals and it's much easier to manage, create scripts for, etc. Great work.

Also +1 to Shiyi's comments for TODOs after this PR is merged.

Signed-off-by: SumanthRH <[email protected]>

SumanthRH added 13 commits January 23, 2025 22:39

refactoring init

aa47191

Signed-off-by: SumanthRH <[email protected]>

add a bunch of stuff

ee8d0f4

Signed-off-by: SumanthRH <[email protected]>

add a bunch of stuff

0cf3548

Signed-off-by: SumanthRH <[email protected]>

check

f48b5dc

Signed-off-by: SumanthRH <[email protected]>

check

3e624af

Signed-off-by: SumanthRH <[email protected]>

check

859f023

Signed-off-by: SumanthRH <[email protected]>

more refactoring

d877bee

Signed-off-by: SumanthRH <[email protected]>

x

faaf293

Signed-off-by: SumanthRH <[email protected]>

extra large commit

c18ad7b

Signed-off-by: SumanthRH <[email protected]>

x

b7d9532

Signed-off-by: SumanthRH <[email protected]>

minor linting changes

8b9c67e

Signed-off-by: SumanthRH <[email protected]>

minor

c767708

Signed-off-by: SumanthRH <[email protected]>

more more more

995f1a5

Signed-off-by: SumanthRH <[email protected]>

caoshiyi self-requested a review January 24, 2025 21:39

SumanthRH added 4 commits January 24, 2025 22:47

mid commit

7bafae1

Signed-off-by: SumanthRH <[email protected]>

x

829cb4c

Signed-off-by: SumanthRH <[email protected]>

Merge remote-tracking branch 'upstream/main' into refactor-evals

484df2d

Signed-off-by: SumanthRH <[email protected]>

add answer_key

3188aeb

Signed-off-by: SumanthRH <[email protected]>

tyler-griggs self-requested a review January 26, 2025 00:26

tyler-griggs mentioned this pull request Jan 26, 2025

Failure to reproduce AIME score on Sky-T1-32B-Preview #38

Open

SumanthRH added 3 commits January 27, 2025 06:52

x

612ecaf

Signed-off-by: SumanthRH <[email protected]>

x

77b8ba3

Signed-off-by: SumanthRH <[email protected]>

fixing pre-commit

52a4ce7

Signed-off-by: SumanthRH <[email protected]>

kouroshHakha reviewed Jan 27, 2025

View reviewed changes

SumanthRH added 4 commits January 27, 2025 20:29

x

d21d68e

Signed-off-by: SumanthRH <[email protected]>

move some stuff; init tests; init package skyevals

2069df7

Signed-off-by: SumanthRH <[email protected]>

Merge branch 'refactor-evals' of github.com:sumanthrh/skythought into…

05b1653

… refactor-evals Signed-off-by: SumanthRH <[email protected]>

rm llama factory change

55b670c

Signed-off-by: SumanthRH <[email protected]>

SumanthRH added 11 commits January 28, 2025 02:24

test workflows

840006f

Signed-off-by: SumanthRH <[email protected]>

x

4cdeab0

Signed-off-by: SumanthRH <[email protected]>

x

8d564a3

Signed-off-by: SumanthRH <[email protected]>

x

fc1087e

Signed-off-by: SumanthRH <[email protected]>

it's time to fight the CI

6e2e979

Signed-off-by: SumanthRH <[email protected]>

I might have won the fight:

7449117

Signed-off-by: SumanthRH <[email protected]>

CI please

04ead2a

Signed-off-by: SumanthRH <[email protected]>

set up permissions

84c9617

Signed-off-by: SumanthRH <[email protected]>

test ci setup

e032b47

Signed-off-by: SumanthRH <[email protected]>

x

7e99d60

Signed-off-by: SumanthRH <[email protected]>

x

3f5ff02

Signed-off-by: SumanthRH <[email protected]>

caoshiyi reviewed Jan 28, 2025

View reviewed changes

skythought/skythought_evals/tasks/aime/aime_handler.py Show resolved Hide resolved

caoshiyi reviewed Jan 28, 2025

View reviewed changes

update to two workflows

79d12a2

Signed-off-by: SumanthRH <[email protected]>

SumanthRH changed the title ~~[Evals] Refactor tasks and add linting~~ [Evals] Refactor tasks and add linting and CI Jan 28, 2025

SumanthRH changed the title ~~[Evals] Refactor tasks and add linting and CI~~ [Evals] Refactor tasks, add linting and CI Jan 28, 2025

kouroshHakha approved these changes Jan 28, 2025

View reviewed changes

tyler-griggs approved these changes Jan 28, 2025

View reviewed changes

SumanthRH added 3 commits January 28, 2025 19:22

update to later vllm; needed for some tokenizer_revision fixes

c8a8d63

Signed-off-by: SumanthRH <[email protected]>

x

aa87124

Signed-off-by: SumanthRH <[email protected]>

x

eab138a

Signed-off-by: SumanthRH <[email protected]>

lynnliu030 self-requested a review January 29, 2025 00:20

kouroshHakha mentioned this pull request Jan 30, 2025

Adapt for DeepSeek-R1 #54

Closed

SumanthRH added 5 commits January 31, 2025 23:57

small update

07c21f9

Signed-off-by: SumanthRH <[email protected]>

reworking args

3d6942f

Signed-off-by: SumanthRH <[email protected]>

x

c2944fe

Signed-off-by: SumanthRH <[email protected]>

x

8a39701

Signed-off-by: SumanthRH <[email protected]>

x

503ea62

Signed-off-by: SumanthRH <[email protected]>

SumanthRH merged commit f943eb9 into NovaSky-AI:main Feb 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Evals] Refactor tasks, add linting and CI #47

[Evals] Refactor tasks, add linting and CI #47

SumanthRH commented Jan 24, 2025 •

edited

Loading

caoshiyi commented Jan 24, 2025

tyler-griggs commented Jan 26, 2025

kouroshHakha left a comment

kouroshHakha Jan 27, 2025

kouroshHakha Jan 27, 2025

SumanthRH Jan 27, 2025

SumanthRH Jan 27, 2025 •

edited

Loading

kouroshHakha Jan 27, 2025

SumanthRH Jan 27, 2025 •

edited

Loading

kouroshHakha Jan 27, 2025

caoshiyi Jan 28, 2025

SumanthRH Jan 28, 2025 •

edited

Loading

caoshiyi Jan 28, 2025

caoshiyi Jan 28, 2025

SumanthRH Jan 28, 2025

caoshiyi Jan 28, 2025

SumanthRH Jan 28, 2025

tyler-griggs left a comment

[Evals] Refactor tasks, add linting and CI #47

[Evals] Refactor tasks, add linting and CI #47

Conversation

SumanthRH commented Jan 24, 2025 • edited Loading

What does this PR do?

Goals

Task Refactoring Proposal

Current Organization

Proposed Organization

Linting and Code formatting

High-level reorganization

TODO:

caoshiyi commented Jan 24, 2025

tyler-griggs commented Jan 26, 2025

kouroshHakha left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SumanthRH Jan 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SumanthRH Jan 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SumanthRH Jan 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tyler-griggs left a comment

Choose a reason for hiding this comment

SumanthRH commented Jan 24, 2025 •

edited

Loading

SumanthRH Jan 27, 2025 •

edited

Loading

SumanthRH Jan 27, 2025 •

edited

Loading

SumanthRH Jan 28, 2025 •

edited

Loading