-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Evals] Refactor tasks, add linting and CI #47
Conversation
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Thanks, @SumanthRH for the efforts! This is great! |
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
+1, this will be very helpful! |
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sumanth can you run the linter PR before merging this one? The refactoring effort is hidden under the linter changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should put these linter stuff at the top of level of this repo. Did you have any reason for not putting it there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't want to apply these to the train
sub-folder let's just exclude that from the precommit / format runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added pre commit hooks for the tools/ folder only - because I'm not sure if we're at a state to clean up the training related code in train/ . i wanted to separate them out for now so that those working on training can proceed as usual for now.
LMK if this doesn't make sense - the main reason I kept it separate is because the train/
folder is a Llamafactory repo fork. It looks like skythought/tools
will be a package in itself focused on evals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so I moved this to the top-level repo. Also added a setup.py which installes our evaluation package at skythought/skythought_evals
(renamed from tools
) as skythought_evals
skythought/tools/combine_data.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not in this PR but for better structure of the codebase. We should take these scripts and maybe put them in the following refactored structure:
skythought/... # reusable python modules go in here.
project_scripts/
skythought-t1/
data_preparation/
combine_data.py
train_yamls/
...
skythought-t1-flash/
...
tests/
skythought/... # internal unittests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah agreed let's postpose this higher level folder re-org for later.
I've now moved the linting and pre-commit hooks to the top level repo and ignored train/
folder for now. I've added basic tests in tests/tools
for skythought/tools
. Also, since we need to give a package name to make use of the module in tests
, I wrote a small setup.py where the package skythought_evals
is set-up - this is just an alias for skythought.tools
.
skythought/tools/tasks/base.py
Outdated
templating_parameters: Dict[str, str] = Field(default_factory=dict) | ||
# Optional, unused for now | ||
fewshot_config: List[Dict[str, Any]] = Field(default_factory=list) | ||
num_fewshot: int = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you want to absorb this within the fewshot_config
?
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
… refactor-evals Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
|
||
class AIMETaskHandler(MathTaskHandler): | ||
def generate_prompt(self, problem: Dict, model): | ||
if MODEL_TO_NAME[model] == "Sky-T1-32B-Preview": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to avoid hardcoding this here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello. Yeah so there's model related cleanup that is pending! This PR didn't do any cleanup related to that.
Here's my plan: In the next PR, I will
- Move
models_utils.py
to a YAML. - Cleanup some of the model-specific logic in the handlers (like passing
model
forAIME
, system prompt handling - Make a separate task YAML if there's a different template for a model.
So for example, instead of
# aime.yaml
handler:aime
...
templating_parameters:
regular_template: "....."
sky_template: "......."
I will have two yamls, aime.yaml
and aime_sky.yaml
# aime.yaml
handler: aime
templating_parameters:
template: <regular template>
# aime_sky.yaml
handler: aime
templating_parameters:
template: <sky template>
This way, you can even evaluate other models with the template for SKy-T1 (for ex, fine-tunes from Sky-T1)
Wdyt?
This PR already did quite some refactoring and linting, CI, etc changes so I left it for later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! Let's leave this to the next PR.
): | ||
result_file = os.path.join( | ||
args.result_dir, | ||
f"{MODEL_TO_NAME[args.model]}_{args.task}_{args.split}_{args.source}_{args.filter_difficulty}_{args.start}_{args.end}_{args.math_difficulty_lower_bound}_{args.math_difficulty_upper_bound}.json", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have some default behavior for MODEL_TO_NAME
so that if args.model
is not in MODEL_TO_NAME
some default values will be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems suited for the next PR on models cleanup? See #47 (comment)
OpenAI() | ||
if args.model.startswith("openai") | ||
else LLM(model=args.model, tensor_parallel_size=args.tp) | ||
) | ||
system_prompt = SYSTEM_PROMPT[args.model] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, we should have some default behavior here or allow the user to pass in new system prompt if args.model
is currently not in SYSTEM_PROMPT
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm looks best suited for the plans for the follow up PR?
Signed-off-by: SumanthRH <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you. I tried it out on a few evals and it's much easier to manage, create scripts for, etc. Great work.
Also +1 to Shiyi's comments for TODOs after this PR is merged.
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
What does this PR do?
Big PR refactoring tasks and adding linters. Should close #23 .
Goals
With the Evaluation suite adding more and more tasks and supporting more models, there's a need for modularization. Some of the goals for a good eval suite are:
This PR is only focused on the road to 1., 2. and 3.
Task Refactoring Proposal
While considering the new design, it is helpful to visit how other eval suites like
lm_eval
are organized.lm_eval
takes a very configuration centric approach, with almost all formatting, preprocessing and response postprocessing logic in jinja-templated YAMLs (Example: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/gsm8k/gsm8k-cot.yaml)Design choices in this PR :
lm_eval
This is another difference from
lm_eval
.lm_eval
- this is a good thing because YAMLs can be limiting)Current Organization
Here is the high-level organization of the current code:
Folders:
Proposed Organization
Each task yaml will look like;
This PR doesn't try to force any structure for prompt templates because it's quite varied across datasets. To really know how a template is used, you need to see the corresponding handler.py implementation. But it's easy to modify and maintain.
Linting and Code formatting
I've added pre commit hooks for the
tools/
folder only - because I'm not sure if we're at a state to clean up the training related code intrain/
. i wanted to separate them out for now so that those working on training can proceed as usual for now.High-level reorganization
The
tools/
folder is now renamed toskythought_evals
, which is the main package associated with the repo. Sosetup.py
will installskythought_evals
when you runpip install -e .
. FOr those developing locally and running scripts insideskythought/skythought_evals
, they can continue running scripts as usual.TODO: