Skip to content

Commit

Permalink
Refactor model-specific configs and move data curation scripts (#60)
Browse files Browse the repository at this point in the history
Another refactor PR on top of #23 now focused on model-specific configurations and data generation.

- Model-specific system prompts, user templates etc are best left to be in the a YAML file. 
- TaskHandler should be model agnostic, since we want to have a consistent evaluation logic for all tasks
- Data curation scripts for different Sky-T1 models should live outside the `skythought_evals` package. These are mostly scripts focused on a particular data curation task like filtering, rewriting etc. My proposal is to place common scripts in `scripts/ `. A guide for obtaining the final training data + training commands for different Sky-T1 models should be placed in `recipes/` . For now, all data curation scripts are in the `scripts` folder . 
- Adds a new `system-prompt-template` CLI flag. User can leverage available templates like those for sky-T1, Qwen, etc for a different model during evaluation.
  • Loading branch information
SumanthRH authored Feb 5, 2025
1 parent a399909 commit cb45c81
Show file tree
Hide file tree
Showing 38 changed files with 634 additions and 363 deletions.
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ repos:
- id: ruff
args: [ --fix, --exit-non-zero-on-fix ]
# NOTE (sumanthrh): Many of the files excluded here are used for validating code generation, and linters do not recognize some of the logic in these files. skythought/train is excluded for now because it's a fork of Llamafactory
exclude: (^skythought/train|skythought_evals/tasks/taco/pyext2\.py|skythought_evals/tasks/taco/taco_util\.py|skythought_evals/tasks/apps/apps_util\.py|skythought_evals/util/prompts\.py|skythought_evals/util/model_utils\.py)$
exclude: (^skythought/train|skythought_evals/tasks/taco/pyext2\.py|skythought_evals/tasks/taco/taco_util\.py|skythought_evals/tasks/apps/apps_util\.py|scripts/prompts\.py)$


# Black needs to be ran after ruff with --fix
Expand Down
20 changes: 18 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,8 @@
# Getting Started

We open source the code and scripts we used for data curation, training, and evaluation for Sky-T1-32B-Preview, you can find more details in each directory.
- ``/data``: The 17k training data used to train Sky-T1-32B-Preview. We also add the science and riddle portion from the [STILL-2 model](https://arxiv.org/pdf/2412.09413).
- ``skythought/skythought_evals``: Our data generation and evaluation library. To generate the training data for Sky-T1, we use the QwQ-32B-Preview model. We curate the data mixture to cover diverse domains that require reasoning, and a reject sampling procedure to improve the data quality.
- ``recipes``: Recipes - data curation steps and training strategies - for building our models `Sky-T1-32B-Flash` and `Sky-T1-32B-Preview`.
- ``skythought/skythought_evals``: Our data generation and evaluation library.
- ``skythought/train``: Training scripts for Sky-T1. We use [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory) to perform training. The model was trained for 3 epochs with a learning rate of 1e-5 and a batch size of 96. Our model training was completed in 19 hours on 8 H100 GPUs using DeepSpeed Zero-3 offloading, costing approximately $450 as per Lambda Cloud pricing.


Expand All @@ -54,6 +54,22 @@ conda activate eval
pip install -e .
```

We support a wide variety of datasets in mathematics, science and coding:

- AIME'24
- MATH500
- GPQADiamond
- MMLU
- ARC-Challenge
- OlympiadBench
- AMC'23
- TACO
- APPS
- LiveCodeBench
- MMLU Pro
- MinervaMath
- GSM8K

For running evaluation, please refer to [skythought_evals/README.md](skythought/skythought_evals/README.md).


Expand Down
2 changes: 0 additions & 2 deletions data/.gitattributes

This file was deleted.

9 changes: 0 additions & 9 deletions data/README.md

This file was deleted.

36 changes: 36 additions & 0 deletions recipes/sky-t1-flash/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@

# Sky-T1-32B-Flash

[Model](https://huggingface.co/NovaSky-AI/Sky-T1-32B-Flash) | [Dataset](https://huggingface.co/datasets/NovaSky-AI/Sky-T1_preference_data_10k) | [Blog](https://novasky-ai.github.io/posts/reduce-overthinking/)

For a detailed breakdown of the duration curation steps and training methodology, refer to the [blog](https://novasky-ai.github.io/posts/reduce-overthinking/)

## Setup

Make sure you have installed the `skythought-evals` package as outlined in the [README.md](../README.md). All the data curation commands are provided from the root directory of the repo.


## Stage 1: Data Generation

We used `Sky-T1-32B-Preview` to generate responses to the 12K questions in the `PRM800K` dataset. For each question, we used a temperature of 1.0 and generated 8 responses to create a diversity of response lengths. We then formed preference pairs to contrast “verbose” vs. “concise” solutions. Specifically, from the generated responses, we picked the shortest correct response as the positive example and the longest correct response as the negative example. We discarded the rest of the generated responses, and discard any questions that did not produce at least two correct responses. We also incorporated a small number of coding preference pairs simultaneously boosts coding accuracy and further reduces coding generation lengths.

## Stage 2: Response Rewriting
The file `response_rewrite.py` provides a pipeline for filtering and rewriting responses generated with `inference_and_check.py`. We use `response_rewrite.py` to create preference pairs for preference optimization (e.g., DPO, SimPO), however, the logic can be edited for alternative filtering and rewriting steps. Details of the implemented logic can be found in `response_rewrite.py` or on [this blog post](https://novasky-ai.github.io/posts/reduce-overthinking).

To use our preference optimization pipeline, first generate and score multiple responses using `inference_and_check.py`. For example:

```shell
python -m skythought_evals.inference_and_check --inference --task math500 --model Qwen/Qwen2-7B-Instruct --tp 4 --max_tokens 4096 --result-dir ./ --temperatures 0.7 --n 8
python -m skythought_evals.inference_and_check --check --task math500 --model Qwen/Qwen2-7B-Instruct --tp 4 --max_tokens 4096 --result-dir ./ --temperatures 0.7 --n 8
```

Then, use `response_rewrite.py` to process the responses into preference pairs. By default, the shortest correct responses will be used as positive examples and the longest correct responses will be used as negative samples. The argument `--SILC` can be used to also include short incorrect responses as negative examples and long correct repsonses as positive samples.

```shell
python scripts/response_rewrite.py --SILC --rewrite-model meta-llama/Meta-Llama-3-8B-Instruct --target-model NovaSky-AI/Sky-T1-32B-Preview --dataset [PATH_TO_GENERATED_RESPONSES] --result-dir ./ --checkpoint --tp 8
```

The `--checkpoint` argument can optionally be used to save intermediate files of the processed data between steps, in case of failure.

The resulting `.json` files can be used to train a model with preference optimization algorithms. See the `/train/` directory for more details.

63 changes: 63 additions & 0 deletions recipes/sky-t1-preview/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Sky-T1-32B-Preview

[Model](https://huggingface.co/NovaSky-AI/Sky-T1-32B-Preview) | [Dataset](https://huggingface.co/datasets/NovaSky-AI/Sky-T1_data_17k) | [Blog](https://novasky-ai.github.io/posts/sky-t1/)

Give below are the instructions to replicate the data preprocessing and training steps for Sky-T1-32B-Preview.

## Setup

Make sure you have installed the `skythought-evals` package as outlined in the [README.md](../README.md). All the data curation commands are provided from the root directory of the repo.
Set the env variable `SKYT_HOME` as the directory for the final dataset.

## Training Data Curation

To generate the training data for Sky-T1, we use the QwQ-32B-Preview model. We curate the data mixture to cover diverse domains that require reasoning, and a reject sampling procedure to improve the data quality. We also add the science and riddle portion from the [STILL-2 model](https://arxiv.org/pdf/2412.09413).

The final data contains (1) 5k coding data from APPs and TACO, (2) 10k math data from AIME, MATH, and Olympiads subsets of the NuminaMATH dataset and (3) 1k science and puzzle data from STILL-2.

### Step 0 (Only for NUMINA math dataset): Label Math Difficulty from NUMINA

We provide the labelled NUMINA dataset used for training here: https://huggingface.co/datasets/NovaSky-AI/labeled_numina_difficulty . For replication, read on below.

Put one or multiple OPENAI_API_KEY in a file, e.g. keys.txt (one per line). If there is more than one key, the script will use them in a round-robin way to speed up generation. Label Math difficulty using GPT-4o-mini:
#### Example usage:
```
python scripts/label_math_difficulty.py --source [amc_aime, math, olympiads] --keys keys.txt
```
The expected output is labeled_source_0_-1.json. We also provide instructions to download these files under the labeled_numina_difficulty folder (Download from HuggingFace).

### Step 1: Data Inference
Inference the results from QwQ on several datasets. In preview version, we use data from the following dataset.

```shell
python -m skythought_evals.inference_and_check --task apps --model Qwen/QwQ-32B-Preview --tp 8 --max_tokens 16384 --split test --difficulty all --result-dir $SKYT_HOME/data --inference

python -m skythought_evals.inference_and_check --task taco --model Qwen/QwQ-32B-Preview --tp 8 --max_tokens 16384 --split train --difficulty MEDIUM--result-dir $SKYT_HOME/data --inference

python -m skythought_evals.inference_and_check --task taco --model Qwen/QwQ-32B-Preview --tp 8 --max_tokens 16384 --split test --difficulty all --result-dir $SKYT_HOME/data --inference

python -m skythought_evals.inference_and_check --task numina --model Qwen/QwQ-32B-Preview --tp 8 --max_tokens 16384 --split train --source math --filter-difficulty --result-dir $SKYT_HOME/data --inference

python -m skythought_evals.inference_and_check --task numina --model Qwen/QwQ-32B-Preview --tp 8 --max_tokens 16384 --split train --source amc_aime --filter-difficulty --result-dir $SKYT_HOME/data --inference

python -m skythought_evals.inference_and_check --task numina --model Qwen/QwQ-32B-Preview --tp 8 --max_tokens 16384 --split train --source olympiads --end 20000 --filter-difficulty --result-dir $SKYT_HOME/data --inference
```

### Step 2: Format the response
After obtaining a list file for training data, convert them to a unified format (Note: This uses GPT-4o-mini to rewrite. The output is long and takes ~100 dollars for our preview data).
```shell
python scripts/convert_format.py --input_dir $SKYT_HOME/data --keys keys.txt
```

### Step 3: Reject Sampling on the formatted data (Example Usage with previous script)
```shell
python -m skythought_evals.inference_and_check --task apps --model Qwen/QwQ-32B-Preview --tp 8 --max_tokens 16384 --split test --subset all --result-dir $SKYT_HOME/data --check
```
Similar for other datasets.

### Convert to ShareGPT format for training
After obtaining multiple converted files, merge them together and convert to the ShareGPT format to perform training. In our preview model, we also add the science and riddle portion from the [STILL-2 model](https://arxiv.org/pdf/2412.09413), where interested readers can download their part of data and simply concatenating to the data obtained above.
```shell
python scripts/convert_to_data.py --input_dir $SKYT_HOME/data --output $SKYT_HOME/data/train_data.json
```

Empty file added scripts/__init__.py
Empty file.
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import json
import random

from skythought_evals.util.prompts import system_prompt
from .prompts import system_prompt

still2_jsonl_file = "../../data/public_long_form_thought_data_5k.jsonl"
code_json_file = "../../data/converted_apps_long_form_thought_data_5k.json"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,10 @@
from itertools import cycle

import openai
from skythought_evals.util.prompts import convert_prompt, convert_prompt_example
from tqdm import tqdm

from .prompts import convert_prompt, convert_prompt_example

global args


Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
import json
import os

from skythought_evals.util.prompts import system_prompt
from .prompts import system_prompt


def main():
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,10 @@

import openai
from datasets import load_dataset
from skythought_evals.util.prompts import aops_criteria, grading_prompt
from tqdm import tqdm

from .prompts import aops_criteria, grading_prompt


# Function to set the OpenAI API key
def set_openai_key(api_key):
Expand Down
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,72 @@
import os
import random

from skythought_evals.models import ModelConfig
from skythought_evals.util.math_parsing_util import strip_answer_string
from skythought_evals.util.model_utils import (
SUBPROBLEM_SPLIT_PROMPT,
SUBSOLUTION_EXTRACTION_PROMPT,
SYSTEM_PROMPT,
)
from tqdm import tqdm
from vllm import LLM, SamplingParams

SUBPROBLEM_SPLIT_PROMPT = """
You are given a reasoning sequence that attempts to solve a math problem.
This sequence contains multiple proposed solutions, then provides a the final solution.
Each proposed solution within the sequence follows a different line of thought, usually to double check the answer.
Your objective is to identify these separate lines of thought and add the separator string '#####' between the separate lines of thought.
This is important: Your response should be the original unchanged reasoning sequence, except for '#####' injected into the sequence between distinct lines of thought.
Do NOT summarize portions of the reasoning sequence with '...'.
Please keep the sequence that starts with '<|begin_of_solution|>' and ends with '<|end_of_solution|>' as
one single sequence with no '#####' inside of the sequence. Add the separator '#####' immediately before '<|begin_of_solution|>'.
Importantly, only use '#####' if a line of thought presents an answer.
If the line of thought does not include an answer, it cannot be considered a separate line of thought, and should not be separated.
For example, if the input is:
<|begin_of_thought|>The answer to 2+3 is 5. But wait, let me double check this.
If I have two apples and I am given three more apples, I now have 5 apples, so 5 seems like the right answer.
Alternatively, 2+3 is the same as 3+2, which is also 5.<|end_of_thought|>
<|begin_of_solution|>The answer is 5<|end_of_solution|>.
Your output should be:
<|begin_of_thought|>The answer to 2+3 is 5.
#####
But wait, let me double check this.
If I have two apples and I am given three more apples, I now have 5 apples, so 5 seems like the right answer.
#####
Alternatively, 2+3 is the same as 3+2, which is also 5.<|end_of_thought|>
#####
<|begin_of_solution|>The answer is 5<|end_of_solution|>.
""" # noqa: E501

SUBSOLUTION_EXTRACTION_PROMPT = """
You are given text of an attemp to solve a math problem. The text contains a final proposed answer to the math problem.
The text also contains a string '#####' and after this string the ground truth answer is presented.
Your objective is to determine whether the final proposed answer is equivalent to the ground truth answer.
The proposed answer and ground truth answer may be in slightly different formats. For example, the proposed answer may be '1/2' but the ground truth is '0.5'.
Equivalent answers in different formats should be treated as equivalent.
If the text contains multiple proposed answers, use the final proposed answer.
You should return only "True" if the proposed answer is equivalent to the ground truth answer and "False" if there is no proposed answer or if the proposed answer is not equivalent to the ground truth.
Do NOT respond with anything at all except "True" or "False".
For example, if you are given:
I believe 2+3 equals 5.
#####
The ground truth answer is five.
Your response should be:
True
Another example, if you are given:
I believe 2+2 equals 4. But wait, it is actually 5.
#####
The ground truth answer is five.
Your response should be:
True
""" # noqa: E501


def load_dataset(dataset_path: str):
data = {}
Expand Down Expand Up @@ -450,7 +507,7 @@ def main():
variants_dataset, ["fcs", "fcs_plus1", "fcs_reflection"], llm
)

system_prompt = SYSTEM_PROMPT[args.target_model]
system_prompt = ModelConfig.from_model_id(args.target_model).system_prompt

# Generate conversation format for each variant, which can be used in SimPO/DPO/etc.
fcs_convo = make_preference_conversations(final_dataset, "fcs", system_prompt)
Expand Down
File renamed without changes.
3 changes: 0 additions & 3 deletions skythought/skythought_evals/.gitattributes

This file was deleted.

Loading

0 comments on commit cb45c81

Please sign in to comment.