-
Notifications
You must be signed in to change notification settings - Fork 309
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Refactor model-specific configs and move data curation scripts (#60)
Another refactor PR on top of #23 now focused on model-specific configurations and data generation. - Model-specific system prompts, user templates etc are best left to be in the a YAML file. - TaskHandler should be model agnostic, since we want to have a consistent evaluation logic for all tasks - Data curation scripts for different Sky-T1 models should live outside the `skythought_evals` package. These are mostly scripts focused on a particular data curation task like filtering, rewriting etc. My proposal is to place common scripts in `scripts/ `. A guide for obtaining the final training data + training commands for different Sky-T1 models should be placed in `recipes/` . For now, all data curation scripts are in the `scripts` folder . - Adds a new `system-prompt-template` CLI flag. User can leverage available templates like those for sky-T1, Qwen, etc for a different model during evaluation.
- Loading branch information
Showing
38 changed files
with
634 additions
and
363 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
|
||
# Sky-T1-32B-Flash | ||
|
||
[Model](https://huggingface.co/NovaSky-AI/Sky-T1-32B-Flash) | [Dataset](https://huggingface.co/datasets/NovaSky-AI/Sky-T1_preference_data_10k) | [Blog](https://novasky-ai.github.io/posts/reduce-overthinking/) | ||
|
||
For a detailed breakdown of the duration curation steps and training methodology, refer to the [blog](https://novasky-ai.github.io/posts/reduce-overthinking/) | ||
|
||
## Setup | ||
|
||
Make sure you have installed the `skythought-evals` package as outlined in the [README.md](../README.md). All the data curation commands are provided from the root directory of the repo. | ||
|
||
|
||
## Stage 1: Data Generation | ||
|
||
We used `Sky-T1-32B-Preview` to generate responses to the 12K questions in the `PRM800K` dataset. For each question, we used a temperature of 1.0 and generated 8 responses to create a diversity of response lengths. We then formed preference pairs to contrast “verbose” vs. “concise” solutions. Specifically, from the generated responses, we picked the shortest correct response as the positive example and the longest correct response as the negative example. We discarded the rest of the generated responses, and discard any questions that did not produce at least two correct responses. We also incorporated a small number of coding preference pairs simultaneously boosts coding accuracy and further reduces coding generation lengths. | ||
|
||
## Stage 2: Response Rewriting | ||
The file `response_rewrite.py` provides a pipeline for filtering and rewriting responses generated with `inference_and_check.py`. We use `response_rewrite.py` to create preference pairs for preference optimization (e.g., DPO, SimPO), however, the logic can be edited for alternative filtering and rewriting steps. Details of the implemented logic can be found in `response_rewrite.py` or on [this blog post](https://novasky-ai.github.io/posts/reduce-overthinking). | ||
|
||
To use our preference optimization pipeline, first generate and score multiple responses using `inference_and_check.py`. For example: | ||
|
||
```shell | ||
python -m skythought_evals.inference_and_check --inference --task math500 --model Qwen/Qwen2-7B-Instruct --tp 4 --max_tokens 4096 --result-dir ./ --temperatures 0.7 --n 8 | ||
python -m skythought_evals.inference_and_check --check --task math500 --model Qwen/Qwen2-7B-Instruct --tp 4 --max_tokens 4096 --result-dir ./ --temperatures 0.7 --n 8 | ||
``` | ||
|
||
Then, use `response_rewrite.py` to process the responses into preference pairs. By default, the shortest correct responses will be used as positive examples and the longest correct responses will be used as negative samples. The argument `--SILC` can be used to also include short incorrect responses as negative examples and long correct repsonses as positive samples. | ||
|
||
```shell | ||
python scripts/response_rewrite.py --SILC --rewrite-model meta-llama/Meta-Llama-3-8B-Instruct --target-model NovaSky-AI/Sky-T1-32B-Preview --dataset [PATH_TO_GENERATED_RESPONSES] --result-dir ./ --checkpoint --tp 8 | ||
``` | ||
|
||
The `--checkpoint` argument can optionally be used to save intermediate files of the processed data between steps, in case of failure. | ||
|
||
The resulting `.json` files can be used to train a model with preference optimization algorithms. See the `/train/` directory for more details. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
# Sky-T1-32B-Preview | ||
|
||
[Model](https://huggingface.co/NovaSky-AI/Sky-T1-32B-Preview) | [Dataset](https://huggingface.co/datasets/NovaSky-AI/Sky-T1_data_17k) | [Blog](https://novasky-ai.github.io/posts/sky-t1/) | ||
|
||
Give below are the instructions to replicate the data preprocessing and training steps for Sky-T1-32B-Preview. | ||
|
||
## Setup | ||
|
||
Make sure you have installed the `skythought-evals` package as outlined in the [README.md](../README.md). All the data curation commands are provided from the root directory of the repo. | ||
Set the env variable `SKYT_HOME` as the directory for the final dataset. | ||
|
||
## Training Data Curation | ||
|
||
To generate the training data for Sky-T1, we use the QwQ-32B-Preview model. We curate the data mixture to cover diverse domains that require reasoning, and a reject sampling procedure to improve the data quality. We also add the science and riddle portion from the [STILL-2 model](https://arxiv.org/pdf/2412.09413). | ||
|
||
The final data contains (1) 5k coding data from APPs and TACO, (2) 10k math data from AIME, MATH, and Olympiads subsets of the NuminaMATH dataset and (3) 1k science and puzzle data from STILL-2. | ||
|
||
### Step 0 (Only for NUMINA math dataset): Label Math Difficulty from NUMINA | ||
|
||
We provide the labelled NUMINA dataset used for training here: https://huggingface.co/datasets/NovaSky-AI/labeled_numina_difficulty . For replication, read on below. | ||
|
||
Put one or multiple OPENAI_API_KEY in a file, e.g. keys.txt (one per line). If there is more than one key, the script will use them in a round-robin way to speed up generation. Label Math difficulty using GPT-4o-mini: | ||
#### Example usage: | ||
``` | ||
python scripts/label_math_difficulty.py --source [amc_aime, math, olympiads] --keys keys.txt | ||
``` | ||
The expected output is labeled_source_0_-1.json. We also provide instructions to download these files under the labeled_numina_difficulty folder (Download from HuggingFace). | ||
|
||
### Step 1: Data Inference | ||
Inference the results from QwQ on several datasets. In preview version, we use data from the following dataset. | ||
|
||
```shell | ||
python -m skythought_evals.inference_and_check --task apps --model Qwen/QwQ-32B-Preview --tp 8 --max_tokens 16384 --split test --difficulty all --result-dir $SKYT_HOME/data --inference | ||
|
||
python -m skythought_evals.inference_and_check --task taco --model Qwen/QwQ-32B-Preview --tp 8 --max_tokens 16384 --split train --difficulty MEDIUM--result-dir $SKYT_HOME/data --inference | ||
|
||
python -m skythought_evals.inference_and_check --task taco --model Qwen/QwQ-32B-Preview --tp 8 --max_tokens 16384 --split test --difficulty all --result-dir $SKYT_HOME/data --inference | ||
|
||
python -m skythought_evals.inference_and_check --task numina --model Qwen/QwQ-32B-Preview --tp 8 --max_tokens 16384 --split train --source math --filter-difficulty --result-dir $SKYT_HOME/data --inference | ||
|
||
python -m skythought_evals.inference_and_check --task numina --model Qwen/QwQ-32B-Preview --tp 8 --max_tokens 16384 --split train --source amc_aime --filter-difficulty --result-dir $SKYT_HOME/data --inference | ||
|
||
python -m skythought_evals.inference_and_check --task numina --model Qwen/QwQ-32B-Preview --tp 8 --max_tokens 16384 --split train --source olympiads --end 20000 --filter-difficulty --result-dir $SKYT_HOME/data --inference | ||
``` | ||
|
||
### Step 2: Format the response | ||
After obtaining a list file for training data, convert them to a unified format (Note: This uses GPT-4o-mini to rewrite. The output is long and takes ~100 dollars for our preview data). | ||
```shell | ||
python scripts/convert_format.py --input_dir $SKYT_HOME/data --keys keys.txt | ||
``` | ||
|
||
### Step 3: Reject Sampling on the formatted data (Example Usage with previous script) | ||
```shell | ||
python -m skythought_evals.inference_and_check --task apps --model Qwen/QwQ-32B-Preview --tp 8 --max_tokens 16384 --split test --subset all --result-dir $SKYT_HOME/data --check | ||
``` | ||
Similar for other datasets. | ||
|
||
### Convert to ShareGPT format for training | ||
After obtaining multiple converted files, merge them together and convert to the ShareGPT format to perform training. In our preview model, we also add the science and riddle portion from the [STILL-2 model](https://arxiv.org/pdf/2412.09413), where interested readers can download their part of data and simply concatenating to the data obtained above. | ||
```shell | ||
python scripts/convert_to_data.py --input_dir $SKYT_HOME/data --output $SKYT_HOME/data/train_data.json | ||
``` | ||
|
Empty file.
2 changes: 1 addition & 1 deletion
2
skythought/skythought_evals/combine_data.py → scripts/combine_data.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
This file was deleted.
Oops, something went wrong.
Oops, something went wrong.