Refactor model-specific configs and move data curation scripts (#60)

Another refactor PR on top of #23 now focused on model-specific configurations and data generation. - Model-specific system prompts, user templates etc are best left to be in the a YAML file. - TaskHandler should be model agnostic, since we want to have a consistent evaluation logic for all tasks - Data curation scripts for different Sky-T1 models should live outside the `skythought_evals` package. These are mostly scripts focused on a particular data curation task like filtering, rewriting etc. My proposal is to place common scripts in `scripts/ `. A guide for obtaining the final training data + training commands for different Sky-T1 models should be placed in `recipes/` . For now, all data curation scripts are in the `scripts` folder . - Adds a new `system-prompt-template` CLI flag. User can leverage available templates like those for sky-T1, Qwen, etc for a different model during evaluation.
NovaSky-AI · Feb 5, 2025 · cb45c81 · cb45c81
1 parent a399909
commit cb45c81
Show file tree

Hide file tree

Showing 38 changed files with 634 additions and 363 deletions.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -5,7 +5,7 @@ repos:
       - id: ruff
         args: [ --fix, --exit-non-zero-on-fix ]
         # NOTE (sumanthrh): Many of the files excluded here are used for validating code generation, and linters do not recognize some of the logic in these files. skythought/train is excluded for now because it's a fork of Llamafactory
-        exclude: (^skythought/train|skythought_evals/tasks/taco/pyext2\.py|skythought_evals/tasks/taco/taco_util\.py|skythought_evals/tasks/apps/apps_util\.py|skythought_evals/util/prompts\.py|skythought_evals/util/model_utils\.py)$
+        exclude: (^skythought/train|skythought_evals/tasks/taco/pyext2\.py|skythought_evals/tasks/taco/taco_util\.py|skythought_evals/tasks/apps/apps_util\.py|scripts/prompts\.py)$
 
 
   # Black needs to be ran after ruff with --fix

diff --git a/README.md b/README.md
@@ -34,8 +34,8 @@
 # Getting Started
 
 We open source the code and scripts we used for data curation, training, and evaluation for Sky-T1-32B-Preview, you can find more details in each directory.
-- ``/data``: The 17k training data used to train Sky-T1-32B-Preview. We also add the science and riddle portion from the [STILL-2 model](https://arxiv.org/pdf/2412.09413).
-- ``skythought/skythought_evals``: Our data generation and evaluation library. To generate the training data for Sky-T1, we use the QwQ-32B-Preview model. We curate the data mixture to cover diverse domains that require reasoning, and a reject sampling procedure to improve the data quality. 
+- ``recipes``: Recipes - data curation steps and training strategies - for building our models `Sky-T1-32B-Flash` and `Sky-T1-32B-Preview`. 
+- ``skythought/skythought_evals``: Our data generation and evaluation library. 
 - ``skythought/train``: Training scripts for Sky-T1. We use [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory) to perform training. The model was trained for 3 epochs with a learning rate of 1e-5 and a batch size of 96. Our model training was completed in 19 hours on 8 H100 GPUs using DeepSpeed Zero-3 offloading, costing approximately $450 as per Lambda Cloud pricing. 
 
 
@@ -54,6 +54,22 @@ conda activate eval
 pip install -e .
 ```
 
+We support a wide variety of datasets in mathematics, science and coding:
+
+- AIME'24
+- MATH500
+- GPQADiamond
+- MMLU
+- ARC-Challenge
+- OlympiadBench
+- AMC'23 
+- TACO 
+- APPS
+- LiveCodeBench
+- MMLU Pro
+- MinervaMath
+- GSM8K
+
 For running evaluation, please refer to [skythought_evals/README.md](skythought/skythought_evals/README.md).
 
 

diff --git a/data/.gitattributes b/data/.gitattributes
diff --git a/data/README.md b/data/README.md
diff --git a/recipes/sky-t1-flash/README.md b/recipes/sky-t1-flash/README.md
@@ -0,0 +1,36 @@
+
+# Sky-T1-32B-Flash
+
+[Model](https://huggingface.co/NovaSky-AI/Sky-T1-32B-Flash) | [Dataset](https://huggingface.co/datasets/NovaSky-AI/Sky-T1_preference_data_10k) | [Blog](https://novasky-ai.github.io/posts/reduce-overthinking/)
+
+For a detailed breakdown of the duration curation steps and training methodology, refer to the [blog](https://novasky-ai.github.io/posts/reduce-overthinking/)
+
+## Setup
+
+Make sure you have installed the `skythought-evals` package as outlined in the [README.md](../README.md). All the data curation commands are provided from the root directory of the repo.
+
+
+## Stage 1: Data Generation
+
+We used `Sky-T1-32B-Preview` to generate responses to the 12K questions in the `PRM800K` dataset. For each question, we used a temperature of 1.0 and generated 8 responses to create a diversity of response lengths. We then formed preference pairs to contrast “verbose” vs. “concise” solutions. Specifically, from the generated responses, we picked the shortest correct response as the positive example and the longest correct response as the negative example. We discarded the rest of the generated responses, and discard any questions that did not produce at least two correct responses. We also incorporated a small number of coding preference pairs simultaneously boosts coding accuracy and further reduces coding generation lengths. 
+
+## Stage 2: Response Rewriting
+The file `response_rewrite.py` provides a pipeline for filtering and rewriting responses generated with `inference_and_check.py`. We use `response_rewrite.py` to create preference pairs for preference optimization (e.g., DPO, SimPO), however, the logic can be edited for alternative filtering and rewriting steps. Details of the implemented logic can be found in `response_rewrite.py` or on [this blog post](https://novasky-ai.github.io/posts/reduce-overthinking). 
+
+To use our preference optimization pipeline, first generate and score multiple responses using `inference_and_check.py`. For example:
+
+```shell
+python -m skythought_evals.inference_and_check  --inference --task math500 --model Qwen/Qwen2-7B-Instruct --tp 4 --max_tokens 4096 --result-dir ./ --temperatures 0.7 --n 8
+python -m skythought_evals.inference_and_check  --check --task math500 --model Qwen/Qwen2-7B-Instruct --tp 4 --max_tokens 4096 --result-dir ./ --temperatures 0.7 --n 8
+```
+
+Then, use `response_rewrite.py` to process the responses into preference pairs. By default, the shortest correct responses will be used as positive examples and the longest correct responses will be used as negative samples. The argument `--SILC` can be used to also include short incorrect responses as negative examples and long correct repsonses as positive samples.
+
+```shell
+python scripts/response_rewrite.py --SILC --rewrite-model meta-llama/Meta-Llama-3-8B-Instruct --target-model NovaSky-AI/Sky-T1-32B-Preview --dataset [PATH_TO_GENERATED_RESPONSES] --result-dir ./ --checkpoint --tp 8
+```
+
+The `--checkpoint` argument can optionally be used to save intermediate files of the processed data between steps, in case of failure. 
+
+The resulting `.json` files can be used to train a model with preference optimization algorithms. See the `/train/` directory for more details.
+
diff --git a/recipes/sky-t1-preview/README.md b/recipes/sky-t1-preview/README.md
@@ -0,0 +1,63 @@
+# Sky-T1-32B-Preview 
+
+[Model](https://huggingface.co/NovaSky-AI/Sky-T1-32B-Preview) | [Dataset](https://huggingface.co/datasets/NovaSky-AI/Sky-T1_data_17k) | [Blog](https://novasky-ai.github.io/posts/sky-t1/)
+
+Give below are the instructions to replicate the data preprocessing and training steps for Sky-T1-32B-Preview. 
+
+## Setup
+
+Make sure you have installed the `skythought-evals` package as outlined in the [README.md](../README.md). All the data curation commands are provided from the root directory of the repo.
+Set the env variable `SKYT_HOME` as the directory for the final dataset. 
+
+## Training Data Curation
+
+To generate the training data for Sky-T1, we use the QwQ-32B-Preview model. We curate the data mixture to cover diverse domains that require reasoning, and a reject sampling procedure to improve the data quality. We also add the science and riddle portion from the [STILL-2 model](https://arxiv.org/pdf/2412.09413).
+
+The final data contains (1) 5k coding data from APPs and TACO, (2) 10k math data from AIME, MATH, and Olympiads subsets of the NuminaMATH dataset and (3) 1k science and puzzle data from STILL-2.
+
+### Step 0 (Only for NUMINA math dataset): Label Math Difficulty from NUMINA
+
+We provide the labelled NUMINA dataset used for training here: https://huggingface.co/datasets/NovaSky-AI/labeled_numina_difficulty . For replication, read on below.
+
+Put one or multiple OPENAI_API_KEY in a file, e.g. keys.txt (one per line). If there is more than one key, the script will use them in a round-robin way to speed up generation. Label Math difficulty using GPT-4o-mini: 
+#### Example usage: 
+```
+python scripts/label_math_difficulty.py --source [amc_aime, math, olympiads] --keys keys.txt
+```
+The expected output is labeled_source_0_-1.json. We also provide instructions to download these files under the labeled_numina_difficulty folder (Download from HuggingFace).
+
+### Step 1: Data Inference
+Inference the results from QwQ on several datasets. In preview version, we use data from the following dataset.
+
+```shell
+python -m skythought_evals.inference_and_check --task apps --model Qwen/QwQ-32B-Preview --tp 8 --max_tokens 16384 --split test --difficulty all --result-dir $SKYT_HOME/data --inference
+
+python -m skythought_evals.inference_and_check --task taco --model Qwen/QwQ-32B-Preview --tp 8 --max_tokens 16384 --split train --difficulty MEDIUM--result-dir $SKYT_HOME/data --inference
+
+python -m skythought_evals.inference_and_check --task taco --model Qwen/QwQ-32B-Preview --tp 8 --max_tokens 16384 --split test --difficulty all --result-dir $SKYT_HOME/data --inference
+
+python -m skythought_evals.inference_and_check --task numina --model Qwen/QwQ-32B-Preview --tp 8 --max_tokens 16384 --split train --source math --filter-difficulty --result-dir $SKYT_HOME/data --inference
+
+python -m skythought_evals.inference_and_check --task numina --model Qwen/QwQ-32B-Preview --tp 8 --max_tokens 16384 --split train --source amc_aime --filter-difficulty --result-dir $SKYT_HOME/data --inference
+
+python -m skythought_evals.inference_and_check --task numina --model Qwen/QwQ-32B-Preview --tp 8 --max_tokens 16384 --split train --source olympiads --end 20000 --filter-difficulty --result-dir $SKYT_HOME/data --inference
+```
+
+### Step 2: Format the response
+After obtaining a list file for training data, convert them to a unified format (Note: This uses GPT-4o-mini to rewrite. The output is long and takes ~100 dollars for our preview data).
+```shell
+python scripts/convert_format.py --input_dir $SKYT_HOME/data --keys keys.txt
+```
+
+### Step 3: Reject Sampling on the formatted data (Example Usage with previous script)
+```shell 
+python -m skythought_evals.inference_and_check --task apps --model Qwen/QwQ-32B-Preview --tp 8 --max_tokens 16384 --split test --subset all --result-dir $SKYT_HOME/data --check
+```
+Similar for other datasets.
+
+### Convert to ShareGPT format for training
+After obtaining multiple converted files, merge them together and convert to the ShareGPT format to perform training. In our preview model, we also add the science and riddle portion from the [STILL-2 model](https://arxiv.org/pdf/2412.09413), where interested readers can download their part of data and simply concatenating to the data obtained above.
+```shell
+python scripts/convert_to_data.py --input_dir $SKYT_HOME/data --output $SKYT_HOME/data/train_data.json
+```
+
diff --git a/scripts/__init__.py b/scripts/__init__.py
diff --git a/skythought/skythought_evals/combine_data.py → scripts/combine_data.py b/skythought/skythought_evals/combine_data.py → scripts/combine_data.py
@@ -1,7 +1,7 @@
 import json
 import random
 
-from skythought_evals.util.prompts import system_prompt
+from .prompts import system_prompt
 
 still2_jsonl_file = "../../data/public_long_form_thought_data_5k.jsonl"
 code_json_file = "../../data/converted_apps_long_form_thought_data_5k.json"

diff --git a/...hought/skythought_evals/convert_format.py → scripts/convert_format.py b/...hought/skythought_evals/convert_format.py → scripts/convert_format.py
@@ -6,9 +6,10 @@
 from itertools import cycle
 
 import openai
-from skythought_evals.util.prompts import convert_prompt, convert_prompt_example
 from tqdm import tqdm
 
+from .prompts import convert_prompt, convert_prompt_example
+
 global args
 
 

diff --git a/...ought/skythought_evals/convert_to_data.py → scripts/convert_to_data.py b/...ought/skythought_evals/convert_to_data.py → scripts/convert_to_data.py
@@ -2,7 +2,7 @@
 import json
 import os
 
-from skythought_evals.util.prompts import system_prompt
+from .prompts import system_prompt
 
 
 def main():

diff --git a/...skythought_evals/label_math_difficulty.py → scripts/label_math_difficulty.py b/...skythought_evals/label_math_difficulty.py → scripts/label_math_difficulty.py
@@ -9,9 +9,10 @@
 
 import openai
 from datasets import load_dataset
-from skythought_evals.util.prompts import aops_criteria, grading_prompt
 from tqdm import tqdm
 
+from .prompts import aops_criteria, grading_prompt
+
 
 # Function to set the OpenAI API key
 def set_openai_key(api_key):

diff --git a/skythought/skythought_evals/util/prompts.py → scripts/prompts.py b/skythought/skythought_evals/util/prompts.py → scripts/prompts.py
diff --git a/...ught/skythought_evals/response_rewrite.py → scripts/response_rewrite.py b/...ught/skythought_evals/response_rewrite.py → scripts/response_rewrite.py
@@ -3,15 +3,72 @@
 import os
 import random
 
+from skythought_evals.models import ModelConfig
 from skythought_evals.util.math_parsing_util import strip_answer_string
-from skythought_evals.util.model_utils import (
-    SUBPROBLEM_SPLIT_PROMPT,
-    SUBSOLUTION_EXTRACTION_PROMPT,
-    SYSTEM_PROMPT,
-)
 from tqdm import tqdm
 from vllm import LLM, SamplingParams
 
+SUBPROBLEM_SPLIT_PROMPT = """
+  You are given a reasoning sequence that attempts to solve a math problem.
+  This sequence contains multiple proposed solutions, then provides a the final solution. 
+  Each proposed solution within the sequence follows a different line of thought, usually to double check the answer. 
+  Your objective is to identify these separate lines of thought and add the separator string '#####' between the separate lines of thought.
+  This is important: Your response should be the original unchanged reasoning sequence, except for '#####' injected into the sequence between distinct lines of thought.
+  Do NOT summarize portions of the reasoning sequence with '...'.
+
+  Please keep the sequence that starts with '<|begin_of_solution|>' and ends with '<|end_of_solution|>' as 
+  one single sequence with no '#####' inside of the sequence. Add the separator '#####' immediately before '<|begin_of_solution|>'.
+
+  Importantly, only use '#####' if a line of thought presents an answer. 
+  If the line of thought does not include an answer, it cannot be considered a separate line of thought, and should not be separated.
+
+  For example, if the input is:
+  <|begin_of_thought|>The answer to 2+3 is 5. But wait, let me double check this. 
+  If I have two apples and I am given three more apples, I now have 5 apples, so 5 seems like the right answer. 
+  Alternatively, 2+3 is the same as 3+2, which is also 5.<|end_of_thought|>
+  <|begin_of_solution|>The answer is 5<|end_of_solution|>. 
+
+  Your output should be:
+  <|begin_of_thought|>The answer to 2+3 is 5. 
+  #####
+  But wait, let me double check this. 
+  If I have two apples and I am given three more apples, I now have 5 apples, so 5 seems like the right answer.
+  ##### 
+  Alternatively, 2+3 is the same as 3+2, which is also 5.<|end_of_thought|>
+  #####
+  <|begin_of_solution|>The answer is 5<|end_of_solution|>. 
+"""  # noqa: E501
+
+SUBSOLUTION_EXTRACTION_PROMPT = """
+  You are given text of an attemp to solve a math problem. The text contains a final proposed answer to the math problem.
+
+  The text also contains a string '#####' and after this string the ground truth answer is presented.
+
+  Your objective is to determine whether the final proposed answer is equivalent to the ground truth answer.
+  The proposed answer and ground truth answer may be in slightly different formats. For example, the proposed answer may be '1/2' but the ground truth is '0.5'.
+  Equivalent answers in different formats should be treated as equivalent.
+  If the text contains multiple proposed answers, use the final proposed answer.
+
+  You should return only "True" if the proposed answer is equivalent to the ground truth answer and "False" if there is no proposed answer or if the proposed answer is not equivalent to the ground truth.
+  Do NOT respond with anything at all except "True" or "False". 
+  
+  For example, if you are given:
+  I believe 2+3 equals 5.
+  #####
+  The ground truth answer is five.
+
+  Your response should be:
+  True
+
+  Another example, if you are given:
+  I believe 2+2 equals 4. But wait, it is actually 5.
+  #####
+  The ground truth answer is five.
+
+  Your response should be:
+  True
+"""  # noqa: E501
+
 
 def load_dataset(dataset_path: str):
     data = {}
@@ -450,7 +507,7 @@ def main():
         variants_dataset, ["fcs", "fcs_plus1", "fcs_reflection"], llm
     )
 
-    system_prompt = SYSTEM_PROMPT[args.target_model]
+    system_prompt = ModelConfig.from_model_id(args.target_model).system_prompt
 
     # Generate conversation format for each variant, which can be used in SimPO/DPO/etc.
     fcs_convo = make_preference_conversations(final_dataset, "fcs", system_prompt)

diff --git a/skythought/skythought_evals/upload_hub.py → scripts/upload_hub.py b/skythought/skythought_evals/upload_hub.py → scripts/upload_hub.py
diff --git a/skythought/skythought_evals/.gitattributes b/skythought/skythought_evals/.gitattributes