Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Evals] Refactor tasks, add linting and CI #47

Merged
merged 49 commits into from
Feb 1, 2025
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
aa47191
refactoring init
SumanthRH Jan 23, 2025
ee8d0f4
add a bunch of stuff
SumanthRH Jan 24, 2025
0cf3548
add a bunch of stuff
SumanthRH Jan 24, 2025
f48b5dc
check
SumanthRH Jan 24, 2025
3e624af
check
SumanthRH Jan 24, 2025
859f023
check
SumanthRH Jan 24, 2025
d877bee
more refactoring
SumanthRH Jan 24, 2025
faaf293
x
SumanthRH Jan 24, 2025
c18ad7b
extra large commit
SumanthRH Jan 24, 2025
b7d9532
x
SumanthRH Jan 24, 2025
8b9c67e
minor linting changes
SumanthRH Jan 24, 2025
c767708
minor
SumanthRH Jan 24, 2025
995f1a5
more more more
SumanthRH Jan 24, 2025
7bafae1
mid commit
SumanthRH Jan 24, 2025
829cb4c
x
SumanthRH Jan 25, 2025
484df2d
Merge remote-tracking branch 'upstream/main' into refactor-evals
SumanthRH Jan 25, 2025
3188aeb
add answer_key
SumanthRH Jan 25, 2025
612ecaf
x
SumanthRH Jan 27, 2025
77b8ba3
x
SumanthRH Jan 27, 2025
52a4ce7
fixing pre-commit
SumanthRH Jan 27, 2025
d21d68e
x
SumanthRH Jan 27, 2025
2069df7
move some stuff; init tests; init package skyevals
SumanthRH Jan 27, 2025
05b1653
Merge branch 'refactor-evals' of github.com:sumanthrh/skythought into…
SumanthRH Jan 27, 2025
55b670c
rm llama factory change
SumanthRH Jan 27, 2025
f2d88e9
merge issues
SumanthRH Jan 27, 2025
a9a9380
more linting
SumanthRH Jan 27, 2025
d3dde5d
x
SumanthRH Jan 27, 2025
9d64fd1
more comments
SumanthRH Jan 27, 2025
d55d2a8
x
SumanthRH Jan 28, 2025
840006f
test workflows
SumanthRH Jan 28, 2025
4cdeab0
x
SumanthRH Jan 28, 2025
8d564a3
x
SumanthRH Jan 28, 2025
fc1087e
x
SumanthRH Jan 28, 2025
6e2e979
it's time to fight the CI
SumanthRH Jan 28, 2025
7449117
I might have won the fight:
SumanthRH Jan 28, 2025
04ead2a
CI please
SumanthRH Jan 28, 2025
84c9617
set up permissions
SumanthRH Jan 28, 2025
e032b47
test ci setup
SumanthRH Jan 28, 2025
7e99d60
x
SumanthRH Jan 28, 2025
3f5ff02
x
SumanthRH Jan 28, 2025
79d12a2
update to two workflows
SumanthRH Jan 28, 2025
c8a8d63
update to later vllm; needed for some tokenizer_revision fixes
SumanthRH Jan 28, 2025
aa87124
x
SumanthRH Jan 28, 2025
eab138a
x
SumanthRH Jan 28, 2025
07c21f9
small update
SumanthRH Jan 31, 2025
3d6942f
reworking args
SumanthRH Feb 1, 2025
c2944fe
x
SumanthRH Feb 1, 2025
8a39701
x
SumanthRH Feb 1, 2025
503ea62
x
SumanthRH Feb 1, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions skythought/tools/.githooks/pre-commit
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should put these linter stuff at the top of level of this repo. Did you have any reason for not putting it there?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't want to apply these to the train sub-folder let's just exclude that from the precommit / format runs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added pre commit hooks for the tools/ folder only - because I'm not sure if we're at a state to clean up the training related code in train/ . i wanted to separate them out for now so that those working on training can proceed as usual for now.

LMK if this doesn't make sense - the main reason I kept it separate is because the train/ folder is a Llamafactory repo fork. It looks like skythought/tools will be a package in itself focused on evals.

Copy link
Collaborator Author

@SumanthRH SumanthRH Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so I moved this to the top-level repo. Also added a setup.py which installes our evaluation package at skythought/skythought_evals (renamed from tools) as skythought_evals

Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
set -e

# Get tools directory path relative to git root
TOOLS_DIR=$(git rev-parse --show-toplevel)/skythought/tools
# Only run pre-commit if changes are in tools/
# Run pre-commit from tools/ directory to use linting rules in this directory
if git diff --cached --name-only | grep "^skythought/tools/"; then
cd $TOOLS_DIR;
pre-commit run --files $(git diff --cached --name-only | grep "^skythought/tools/") --config .pre-commit-config.yaml
fi
12 changes: 12 additions & 0 deletions skythought/tools/.pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
repos:
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.6.9
hooks:
- id: ruff
args: [ --fix, --exit-non-zero-on-fix ]

# Black needs to be ran after ruff with --fix
- repo: https://github.com/psf/black
rev: 23.3.0
hooks:
- id: black
23 changes: 13 additions & 10 deletions skythought/tools/combine_data.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not in this PR but for better structure of the codebase. We should take these scripts and maybe put them in the following refactored structure:

skythought/... # reusable python modules go in here. 
project_scripts/
    skythought-t1/
        data_preparation/
            combine_data.py
        train_yamls/
            ...
    skythought-t1-flash/
        ...
tests/
    skythought/... # internal unittests

Copy link
Collaborator Author

@SumanthRH SumanthRH Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agreed let's postpose this higher level folder re-org for later.

I've now moved the linting and pre-commit hooks to the top level repo and ignored train/ folder for now. I've added basic tests in tests/tools for skythought/tools. Also, since we need to give a package name to make use of the module in tests, I wrote a small setup.py where the package skythought_evals is set-up - this is just an alias for skythought.tools.

Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import json
import random

from util.prompts import system_prompt

still2_jsonl_file = "../../data/public_long_form_thought_data_5k.jsonl"
Expand All @@ -25,14 +26,11 @@
# Create the conversation format
conversations = [
{"from": "user", "value": question},
{"from": "assistant", "value": combined_text}
{"from": "assistant", "value": combined_text},
]

# Prepare the final structure
cur_data = {
"system": system_prompt,
"conversations": conversations
}
cur_data = {"system": system_prompt, "conversations": conversations}
all_data.append(cur_data)
else:
code_num += 1
Expand All @@ -43,14 +41,19 @@
# print(code_data[0])

all_data.extend(code_data)
print(f"First item slice before shuffle: {all_data[0]['conversations'][-1]['value'][-50:-1]}")
print(
f"First item slice before shuffle: {all_data[0]['conversations'][-1]['value'][-50:-1]}"
)
random.shuffle(all_data)
print(f"First item slice after shuffle: {all_data[0]['conversations'][-1]['value'][-50:-1]}")
print(
f"First item slice after shuffle: {all_data[0]['conversations'][-1]['value'][-50:-1]}"
)
print(len(all_data))

# Save the converted data to the output file
with open(output_file, "w") as f:
json.dump(all_data, f, indent=4)

print(f"Conversion completed. The data has been saved to {output_file} with {len(all_data)} data.")

print(
f"Conversion completed. The data has been saved to {output_file} with {len(all_data)} data."
)
51 changes: 34 additions & 17 deletions skythought/tools/convert_format.py
Original file line number Diff line number Diff line change
@@ -1,23 +1,28 @@
import json
import argparse
from tqdm import tqdm
import json
import multiprocessing as mp
import openai
from itertools import cycle
import time
import os
import time
from itertools import cycle

import openai
from tqdm import tqdm

from util.prompts import convert_prompt, convert_prompt_example

global args


# Function to set the OpenAI API key
def set_openai_key(api_key):
openai.api_key = api_key


# GPT API processing function with retry logic
def process_content(content, api_key):
# Set the OpenAI key for this request
set_openai_key(api_key)

# GPT prompt
prompt = convert_prompt.format(example=convert_prompt_example, content=content)

Expand All @@ -28,44 +33,54 @@ def process_content(content, api_key):
response = openai.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a solution format convertor."},
{"role": "user", "content": prompt}
{
"role": "system",
"content": "You are a solution format convertor.",
},
{"role": "user", "content": prompt},
],
max_tokens=16384,
temperature=0.7
temperature=0.7,
)
return response.choices[0].message.content
except openai.RateLimitError:
retries -= 1
if retries == 0:
return "Error: Rate limit reached and retries exhausted."
print(f"Sleep for 5 seconds for API limit.")
print("Sleep for 5 seconds for API limit.")
time.sleep(5)
except Exception as e:
return f"Error processing content: {e}"


# Function for multiprocessing
def process_entry(entry, api_key_cycle):
key, values = entry
content = values["responses"]["0.7"]["content"]

# Get the next API key from the cycle
api_key = next(api_key_cycle)

processed = process_content(content, api_key)
values["responses"]["0.7"]["processed_content"] = processed

return key, values


# Wrapper function for multiprocessing
def process_entry_wrapper(args):
return process_entry(*args)


if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Process content and save results.")
parser.add_argument("--input_dir", type=str, help="Input directory containing JSON files.")
parser.add_argument("--keys", type=str, help="File containing OpenAI API keys (one per line).")

parser.add_argument(
"--input_dir", type=str, help="Input directory containing JSON files."
)
parser.add_argument(
"--keys", type=str, help="File containing OpenAI API keys (one per line)."
)

global args
args = parser.parse_args()

Expand All @@ -90,7 +105,9 @@ def process_entry_wrapper(args):
results = []
with mp.Pool(os.cpu_count()) as pool:
tasks = [(entry, api_key_cycle) for entry in data.items()]
for result in tqdm(pool.imap(process_entry_wrapper, tasks), total=len(data)):
for result in tqdm(
pool.imap(process_entry_wrapper, tasks), total=len(data)
):
results.append(result)

# Aggregate and write results in the main process
Expand Down
26 changes: 19 additions & 7 deletions skythought/tools/convert_to_data.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,15 @@
import os
import json
import argparse
import json
import os

from util.prompts import system_prompt


def main():
parser = argparse.ArgumentParser(description="Convert JSON data for processing.")
parser.add_argument("--input_dir", type=str, help="Directory containing input JSON files.")
parser.add_argument(
"--input_dir", type=str, help="Directory containing input JSON files."
)
parser.add_argument("--output", type=str, help="Output JSON file.")
args = parser.parse_args()

Expand All @@ -24,27 +28,35 @@ def main():

for cur_temp, cur_temp_response in response_data.items():
# Only support 0.7 for this version
assert cur_temp == "0.7", "Only support a single temperature=0.7 now."
assert (
cur_temp == "0.7"
), "Only support a single temperature=0.7 now."
# Accept this data
if cur_temp_response["correctness"]:
# Create the conversation format
conversations = [
{"from": "user", "value": prompt},
{"from": "assistant", "value": cur_temp_response["processed_content"]}
{
"from": "assistant",
"value": cur_temp_response["processed_content"],
},
]

# Prepare the final structure
cur_data = {
"system": system_prompt,
"conversations": conversations
"conversations": conversations,
}
all_data.append(cur_data)

# Save the converted data to the output file
with open(args.output, "w") as f:
json.dump(all_data, f, indent=4)

print(f"Conversion completed. The data has been saved to {args.output} with {len(all_data)} data.")
print(
f"Conversion completed. The data has been saved to {args.output} with {len(all_data)} data."
)


if __name__ == "__main__":
main()
80 changes: 51 additions & 29 deletions skythought/tools/eval.py
Original file line number Diff line number Diff line change
@@ -1,32 +1,45 @@
import argparse
import json
import subprocess
import os
import json

# Define eval to split mapping
eval_to_split = {
"MATH500": "test",
"AIME": "train",
"GPQADiamond": "train",
"MMLU": "test",
"MMLUPro": "test",
"LiveCodeBench": "test",
"GSM8K": "test",
"ARC-C": "test",
"AMC23": "train",
}
from skythought.tools.tasks.task_util import get_tasks

module_dir = os.path.dirname(os.path.abspath(__file__))
TASK_NAMES_TO_YAML = get_tasks(os.path.join(module_dir, "tasks"))

def parse_arguments():
parser = argparse.ArgumentParser(description="Process model path, prompt format, and evals to run.")
parser = argparse.ArgumentParser(
description="Process model path, prompt format, and evals to run."
)
parser.add_argument("--model", required=True, type=str, help="Path to the model.")
parser.add_argument("--evals", required=True, type=str, help="Comma-separated list of evals to run (no spaces).")
parser.add_argument(
"--evals",
required=True,
type=str,
help="Comma-separated list of evals to run (no spaces).",
)
parser.add_argument("--tp", type=int, default=8, help="Tensor Parallelism Degree")
parser.add_argument("--filter-difficulty", action="store_true", help="Filter difficulty.")
parser.add_argument(
"--filter-difficulty", action="store_true", help="Filter difficulty."
)
parser.add_argument("--source", type=str, help="Source for the dataset.")
parser.add_argument("--output_file", required=True, type=str, help="Output file to write results to.")
parser.add_argument("--temperatures", type=float, nargs="+", default=[0], help="Temperature for sampling.")
parser.add_argument(
"--output_file",
required=True,
type=str,
help="Output file to write results to.",
)
parser.add_argument(
"--temperatures",
type=float,
nargs="+",
default=[0],
help="Temperature for sampling.",
)
return parser.parse_args()


def extract_accuracy_from_output(output):
# Iterate through all lines from the end to the beginning
lines = output.splitlines()[::-1]
Expand All @@ -37,9 +50,10 @@ def extract_accuracy_from_output(output):
if "acc" in data:
return data["acc"]
except json.JSONDecodeError:
continue
continue
return None


def write_logs_to_file(logs, output_file):
try:
with open(output_file, "w") as file:
Expand All @@ -48,6 +62,7 @@ def write_logs_to_file(logs, output_file):
except IOError as e:
print(f"Failed to write logs to file {output_file}: {e}")


def main():
args = parse_arguments()

Expand All @@ -60,22 +75,26 @@ def main():

script_path = "inference_and_check.py"

# Hold all logs
# Hold all logs
all_logs = ""
results = {}

# Run the Python command for each eval and collect logs
for eval_name in evals:
eval_name = eval_name.lower()
command = [
"python", script_path,
"--model", model_path,
"--dataset", eval_name,
"--split", eval_to_split[eval_name],
"--tp", str(tp),
"--temperatures"
"python",
script_path,
"--model",
model_path,
"--dataset",
eval_name,
"--tp",
str(tp),
"--temperatures",
]
command.extend(temperatures) # Add temperatures as separate arguments

if args.filter_difficulty:
assert args.source != "", "No source passed for filtering difficulty."
command.append("--filter-difficulty")
Expand All @@ -84,7 +103,9 @@ def main():
print(f"Running eval {eval_name} with command {command}")
all_logs += f"\nRunning eval: {eval_name} with command {command}\n"
try:
with subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True) as proc:
with subprocess.Popen(
command, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True
) as proc:
output_lines = []
for line in proc.stdout:
print(line, end="") # Stream output to the console
Expand All @@ -110,5 +131,6 @@ def main():
print("Results:")
print(results)


if __name__ == "__main__":
main()
Loading