Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Evals] Refactor tasks, add linting and CI #47

Merged
merged 49 commits into from
Feb 1, 2025

Conversation

SumanthRH
Copy link
Collaborator

@SumanthRH SumanthRH commented Jan 24, 2025

What does this PR do?

Big PR refactoring tasks and adding linters. Should close #23 .

Goals

With the Evaluation suite adding more and more tasks and supporting more models, there's a need for modularization. Some of the goals for a good eval suite are:

  1. Easy to support new tasks.
  2. Easy to maintain a large number of tasks.
  3. Easy to tweak configurations for a given task .
  4. Easy to support new models.
  5. Easy to debug/ inspect results for a given task.
  6. Easy to scale evaluations for a given task

This PR is only focused on the road to 1., 2. and 3.

Task Refactoring Proposal

While considering the new design, it is helpful to visit how other eval suites like lm_eval are organized. lm_eval takes a very configuration centric approach, with almost all formatting, preprocessing and response postprocessing logic in jinja-templated YAMLs (Example: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/gsm8k/gsm8k-cot.yaml)

Design choices in this PR :

  • Logic (conditional logic, how templating parameters are prepended/appended) lives in code i.e. in {task}_handlers.py . This avoids the awkwardness of having complicated branching + processing logic all in jinja templated YAMLs like lm_eval
  • Prompt Strings, dataset names, few shot examples etc are best left in YAMLs. Crucially, we should avoid bare string concatenation in code and use f-strings in YAMLs to define templates. This will allow us to easily iterate on the right formats and also catch bugs with missed spaces, newlines etc.
  • Complex string processing - including filtering, regex matches, etc lives in code
    This is another difference from lm_eval.
  • Handler + task configs for each task live in their own folder for modularity. It is expected for developers to go into the code to understand the exact parsing logic used (another contrast from lm_eval - this is a good thing because YAMLs can be limiting)

Current Organization

Here is the high-level organization of the current code:

Folders:

 |-util
 | |-taco
 | | |-pyext2.py
 | | |-testing_util.py 
 | |-livecodebench
 | | |-testing_util.py
 | |-prompts.py 
 | |-apps
 | | |-testing_util.py
 | |-math
 | | |-testing_util.py
 | |-model_utils.py # <--- system prompts
 | |-task_handlers.py # <--- all handlers in one file
 | |-common.py  
 |-eval.py
 |-inference_and_check.py
 |-requirements.txt

Proposed Organization

├── tasks
│   ├── aime
│   │   ├── aime_handler.py
│   │   └── aime.yaml
│   │   └── aime_8shot.yaml # different few shot configurations can live in separate yamls
│   ├── apps
│   ├── livecodebench
│   ├── math500
│   └── taco
└── util
   ├── common.py
   ├── math_parsing_util.py
   ├── model_utils.py
   ├── prompts.py

Each task yaml will look like;

- handler: arc_c # name of the handler for this task. 
- dataset_path: <repo_id>  # repo ID in huggingface or local path
- dataset_source: null # which subset on huggingface
- question_key: <question_key> 
- templating_parameters: 
   - template_1: "First string to add {prompt}" 
   - template_2: "Second string to add {prompt}"
# Optional. Not implemented yet 
- fewshot_config:
  - question: ...
     target:  ...
 - num_fewshot: n

This PR doesn't try to force any structure for prompt templates because it's quite varied across datasets. To really know how a template is used, you need to see the corresponding handler.py implementation. But it's easy to modify and maintain.

Linting and Code formatting

I've added pre commit hooks for the tools/ folder only - because I'm not sure if we're at a state to clean up the training related code in train/ . i wanted to separate them out for now so that those working on training can proceed as usual for now.

High-level reorganization

The tools/ folder is now renamed to skythought_evals, which is the main package associated with the repo. So setup.py will install skythought_evals when you run pip install -e . . FOr those developing locally and running scripts inside skythought/skythought_evals, they can continue running scripts as usual.

TODO:

  • Improve interface for eval.py

Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
@caoshiyi
Copy link
Member

Thanks, @SumanthRH for the efforts! This is great!

@caoshiyi caoshiyi self-requested a review January 24, 2025 21:39
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
@tyler-griggs tyler-griggs self-requested a review January 26, 2025 00:26
@tyler-griggs
Copy link
Collaborator

+1, this will be very helpful!

x
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Copy link
Collaborator

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sumanth can you run the linter PR before merging this one? The refactoring effort is hidden under the linter changes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should put these linter stuff at the top of level of this repo. Did you have any reason for not putting it there?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't want to apply these to the train sub-folder let's just exclude that from the precommit / format runs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added pre commit hooks for the tools/ folder only - because I'm not sure if we're at a state to clean up the training related code in train/ . i wanted to separate them out for now so that those working on training can proceed as usual for now.

LMK if this doesn't make sense - the main reason I kept it separate is because the train/ folder is a Llamafactory repo fork. It looks like skythought/tools will be a package in itself focused on evals.

Copy link
Collaborator Author

@SumanthRH SumanthRH Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so I moved this to the top-level repo. Also added a setup.py which installes our evaluation package at skythought/skythought_evals (renamed from tools) as skythought_evals

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not in this PR but for better structure of the codebase. We should take these scripts and maybe put them in the following refactored structure:

skythought/... # reusable python modules go in here. 
project_scripts/
    skythought-t1/
        data_preparation/
            combine_data.py
        train_yamls/
            ...
    skythought-t1-flash/
        ...
tests/
    skythought/... # internal unittests

Copy link
Collaborator Author

@SumanthRH SumanthRH Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agreed let's postpose this higher level folder re-org for later.

I've now moved the linting and pre-commit hooks to the top level repo and ignored train/ folder for now. I've added basic tests in tests/tools for skythought/tools. Also, since we need to give a package name to make use of the module in tests, I wrote a small setup.py where the package skythought_evals is set-up - this is just an alias for skythought.tools.

templating_parameters: Dict[str, str] = Field(default_factory=dict)
# Optional, unused for now
fewshot_config: List[Dict[str, Any]] = Field(default_factory=list)
num_fewshot: int = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want to absorb this within the fewshot_config?

Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>

class AIMETaskHandler(MathTaskHandler):
def generate_prompt(self, problem: Dict, model):
if MODEL_TO_NAME[model] == "Sky-T1-32B-Preview":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to avoid hardcoding this here?

Copy link
Collaborator Author

@SumanthRH SumanthRH Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello. Yeah so there's model related cleanup that is pending! This PR didn't do any cleanup related to that.

Here's my plan: In the next PR, I will

  • Move models_utils.py to a YAML.
  • Cleanup some of the model-specific logic in the handlers (like passing model for AIME , system prompt handling
  • Make a separate task YAML if there's a different template for a model.

So for example, instead of

# aime.yaml
handler:aime
...
templating_parameters:
    regular_template: "....."
    sky_template: "......."

I will have two yamls, aime.yaml and aime_sky.yaml

# aime.yaml
handler: aime
templating_parameters:
    template: <regular template> 
# aime_sky.yaml
handler: aime
templating_parameters:
    template: <sky template> 

This way, you can even evaluate other models with the template for SKy-T1 (for ex, fine-tunes from Sky-T1)

Wdyt?

This PR already did quite some refactoring and linting, CI, etc changes so I left it for later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! Let's leave this to the next PR.

):
result_file = os.path.join(
args.result_dir,
f"{MODEL_TO_NAME[args.model]}_{args.task}_{args.split}_{args.source}_{args.filter_difficulty}_{args.start}_{args.end}_{args.math_difficulty_lower_bound}_{args.math_difficulty_upper_bound}.json",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have some default behavior for MODEL_TO_NAME so that if args.model is not in MODEL_TO_NAME some default values will be used?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems suited for the next PR on models cleanup? See #47 (comment)

OpenAI()
if args.model.startswith("openai")
else LLM(model=args.model, tensor_parallel_size=args.tp)
)
system_prompt = SYSTEM_PROMPT[args.model]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, we should have some default behavior here or allow the user to pass in new system prompt if args.model is currently not in SYSTEM_PROMPT.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm looks best suited for the plans for the follow up PR?

#47 (comment)

Signed-off-by: SumanthRH <[email protected]>
@SumanthRH SumanthRH changed the title [Evals] Refactor tasks and add linting [Evals] Refactor tasks and add linting and CI Jan 28, 2025
@SumanthRH SumanthRH changed the title [Evals] Refactor tasks and add linting and CI [Evals] Refactor tasks, add linting and CI Jan 28, 2025
Copy link
Collaborator

@tyler-griggs tyler-griggs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you. I tried it out on a few evals and it's much easier to manage, create scripts for, etc. Great work.

Also +1 to Shiyi's comments for TODOs after this PR is merged.

x
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
@lynnliu030 lynnliu030 self-requested a review January 29, 2025 00:20
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
@SumanthRH SumanthRH merged commit f943eb9 into NovaSky-AI:main Feb 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor Eval Suite
4 participants