Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor #5

Open
wants to merge 65 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
94f3984
move outside
tcapelle Oct 30, 2024
650d4d9
move to notebooks
tcapelle Oct 30, 2024
af7c3c5
add data
tcapelle Oct 30, 2024
1292407
enable git lfs
tcapelle Oct 30, 2024
f9fb498
test and publish
tcapelle Oct 30, 2024
92dc585
make it installable
tcapelle Oct 30, 2024
c64552a
amazon reviews
tcapelle Oct 30, 2024
a367d2b
add json to LFS
tcapelle Oct 30, 2024
96fe2cb
add amazons reviews
tcapelle Oct 30, 2024
c272b52
amazon nb
tcapelle Oct 30, 2024
bcbc136
fix category
tcapelle Oct 30, 2024
c60ec70
update nb
tcapelle Oct 30, 2024
8367d77
helpful reviews
tcapelle Oct 30, 2024
6c2242d
save load JSONL
tcapelle Oct 30, 2024
3565670
listify
tcapelle Oct 30, 2024
6e1d0e7
update nb
tcapelle Oct 30, 2024
db6d421
dataset creation
tcapelle Oct 31, 2024
5df5381
annotate reviews
tcapelle Oct 31, 2024
d34dc6b
some renames
tcapelle Oct 31, 2024
54e873b
move main to test
tcapelle Oct 31, 2024
02ef5fa
add tests to utils
tcapelle Oct 31, 2024
c6a0554
test and refactor formatter
tcapelle Oct 31, 2024
b3af223
refactor LLM
tcapelle Oct 31, 2024
768ca2c
fix reqs
tcapelle Oct 31, 2024
b238d86
fix litellm integration
tcapelle Oct 31, 2024
fa8289f
add autopep
tcapelle Oct 31, 2024
1c08677
lint
tcapelle Oct 31, 2024
d6122bd
move prompts to file
tcapelle Nov 5, 2024
56bc141
refactor to use BaseModel instead of Tuple
tcapelle Nov 5, 2024
8bcaab8
refactor logging
tcapelle Nov 5, 2024
472dd34
rename fit
tcapelle Nov 5, 2024
9cb136f
refactor prompts and assertions
tcapelle Nov 5, 2024
3e40021
add mini example for debug
tcapelle Nov 5, 2024
6bdfd0f
rename default LLM
tcapelle Nov 5, 2024
1004d06
add a better dataset
tcapelle Nov 6, 2024
0f93bd1
working refactor!
tcapelle Nov 6, 2024
0662005
large dataset!
tcapelle Nov 6, 2024
1930524
latest workiong nb
tcapelle Nov 6, 2024
9afb1d9
nice logging
tcapelle Nov 7, 2024
6a985f1
2 working examples
tcapelle Nov 7, 2024
58ab3bd
add CLI, delete unused files
tcapelle Nov 7, 2024
2d93a5e
improve readme
tcapelle Nov 7, 2024
0b84abd
auto detect data format
tcapelle Nov 8, 2024
4fc9f42
add mini data
tcapelle Nov 8, 2024
b5bba8a
add review eval py script
morganmcg1 Nov 8, 2024
63fa45c
add review eval python script
morganmcg1 Nov 8, 2024
246af10
Merge branch 'cape' of https://github.com/wandb/evalForge into cape
morganmcg1 Nov 8, 2024
dc3cc20
add bulk review evals python script.py
morganmcg1 Nov 8, 2024
d1e6208
missing deps
tcapelle Nov 8, 2024
60c1c44
lint
tcapelle Nov 8, 2024
f115005
more lint
tcapelle Nov 8, 2024
b207233
only lint on push
tcapelle Nov 8, 2024
bb9480c
python 3.11
tcapelle Nov 8, 2024
36b6055
let's blackify them
tcapelle Nov 8, 2024
2fb0dba
fix tests
tcapelle Nov 8, 2024
344caaa
missing rich dep, remove tqdm
tcapelle Nov 8, 2024
c13887e
pass wandb key
tcapelle Nov 8, 2024
c3ac7fb
50k reviews evals
morganmcg1 Nov 8, 2024
12b8241
fix async test dep
tcapelle Nov 8, 2024
1d515f8
rename
tcapelle Nov 8, 2024
75c5665
Merge branch 'cape' of https://github.com/wandb/evalForge into cape
morganmcg1 Nov 8, 2024
47deea8
add args to review evals script
morganmcg1 Nov 8, 2024
2eb5236
push 50k review evals
morganmcg1 Nov 8, 2024
79bf3b9
cassetes...
tcapelle Nov 8, 2024
6869fff
update to mini
tcapelle Nov 19, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
*.pdf filter=lfs diff=lfs merge=lfs -text
*.json filter=lfs diff=lfs merge=lfs -text
*.jsonl filter=lfs diff=lfs merge=lfs -text
29 changes: 29 additions & 0 deletions .github/workflows/lint.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
name: Lint Code

on: push

jobs:
lint:
runs-on: ubuntu-latest

steps:
- name: Checkout code
uses: actions/checkout@v3

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.11"

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install black ruff

- name: Run Black
continue-on-error: true
run: black --check .

- name: Run Ruff
continue-on-error: true
run: ruff .
29 changes: 29 additions & 0 deletions .github/workflows/pypi.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# This workflows will upload a Python Package using Twine when a release is created
# For more information see: https://help.github.com/en/actions/language-and-framework-guides/using-python-with-github-actions#publishing-to-package-registries

name: Upload to PyPI

on:
release:
types: [created]

jobs:
pipy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.x'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install -U setuptools wheel twine build
- name: Build and publish
env:
TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
run: |
python -m build
twine upload dist/*
32 changes: 32 additions & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
name: Run Test

on:
push:
branches: [ "main" ]
paths:
- '**.py'
pull_request:
branches: [ "main" ]
paths:
- '**.py'

jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python 3.11
uses: actions/setup-python@v5
with:
python-version: "3.10"
- name: Install dependencies
run: |
python -m pip install --upgrade pip
if [ -f requirements-dev.txt ]; then pip install -r requirements-dev.txt; fi
python -m pip install .
- name: Test with pytest
run: |
pytest . -v tests/
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
WANDB_API_KEY: ${{ secrets.WANDB_API_KEY }}
Binary file added 2404.12272v1.pdf
Binary file not shown.
237 changes: 222 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,109 @@
# 🚀 EvalGen Project
# 🚀 EvalForge Project

This project allows you to create and run LLM judges based on annotated datasets using Weights & Biases (wandb) and Weave for tracking and tracing.
EvalForge is a tool designed to evaluate and improve your Language Model (LLM) applications. It allows you to create and run LLM judges based on annotated datasets using Weights & Biases (wandb) and Weave for tracking and tracing.

## 🛠️ Setup

1. Create a `.env` file in the project root with the following variables:
1. **Environment Variables**

Create a `.env` file in the project root with the following variables:

```
WANDB_EMAIL=your_wandb_email
WANDB_API_KEY=your_wandb_api_key
OPENAI_API_KEY=your_openai_api_key
```

2. **Install Dependencies**

Install the required dependencies:

```bash
pip install -r requirements.txt
```

Or, install directly via pip:

```bash
pip install git+https://github.com/wandb/evalforge.git
```

## 🏃‍♂️ Quick Start with Command-Line Interface

EvalForge now includes a command-line interface (CLI) powered by `simple_parsing`, allowing you to run evaluations directly from the terminal.

### Basic Usage

```bash
evalforge.forge --data path/to/data.json
```
WANDB_EMAIL=your_wandb_email
WANDB_API_KEY=your_wandb_api_key
OPENAI_API_KEY=your_openai_api_key

### Available Arguments

You can customize EvalForge using various command-line arguments corresponding to the `EvalForge` class attributes:

- `--data`: *(Required)* Path to the training data file (JSON or CSV).
- `--llm_model`: Language model to use (default: `"gpt-3.5-turbo"`).
- `--num_criteria_to_generate`: Number of evaluation criteria to generate (default: `3`).
- `--alignment_threshold`: Threshold for selecting best criteria (default: `0.4`).
- `--num_criteria`: Number of best criteria to select (default: `3`).
- `--batch_size`: Batch size for data processing (default: `4`).
- *And more...*

Use the `--help` flag to see all available options:

```bash
evalforge.forge --help
```

### Example

```bash
evalforge.forge \
--data train_data.json \
--llm_model "gpt-4" \
--num_criteria_to_generate 5 \
--batch_size 2
```

## 📄 Data Format

EvalForge expects data in JSON or CSV format. Each data point should include at least the following fields:

- `input` or `input_data`: Input provided to the model.
- `output` or `output_data`: Output generated by the model.
- `annotation`: Binary annotation (`1` for correct, `0` for incorrect).
- `note`: *(Optional)* Additional context or notes about the data point.
- `human_description`: *(Optional)* Human-provided task description.

**JSON Example:**

```json
[
{
"input": "What is 2 + 2?",
"output": "4",
"annotation": 1,
"note": "Simple arithmetic",
"human_description": "Basic math questions"
},
{
"question": "Translate 'Hello' to French.",
"answer": "Bonjour",
"annotation": 1,
"note": "Basic translation",
"human_description": "Simple language translation"
}
]
```

1. Install the required dependencies.
**CSV Example:**

```csv
input,output,annotation,note,human_description
"Define Newton's First Law","An object in motion stays in motion unless acted upon by an external force.",1,"Physics question","Physics definitions"
"What is the capital of France?","Paris",1,"Geography question","Capitals of countries"
```

## 🏃‍♂️ Running the Annotation App

Expand All @@ -24,14 +115,26 @@ python main.py

This will launch a web interface for annotating your dataset.

## 🧠 Creating an LLM Judge
## 🧠 Creating an LLM Judge Programmatically

To programmatically create an LLM judge from your wandb dataset annotations:
You can create an LLM judge programmatically using the `EvalForge` class.

1. Open `forge_evaluation_judge.ipynb` in a Jupyter environment.
2. Run all cells in the notebook.
### Example Usage

```python
import asyncio
from evalforge.forge import EvalForge
from evalforge.data_utils import load_data

# Load data
train_data = load_data('path/to/train_data.json')

# Create an EvalForge instance with custom configurations
forge = EvalForge(llm_model="gpt-4", num_criteria_to_generate=5)

This will generate a judge like the one in `forged_judge`.
# Run the fit method asynchronously
asyncio.run(forge.fit(train_data))
```

## 🔍 Running the Generated Judge

Expand All @@ -42,12 +145,116 @@ To load and run the generated judge:

This will evaluate your dataset using the forged judge, with results fully tracked and traced using Weave.

Alternatively, you can use the `forge_mini.py` script as an example:

```python:forge_mini.py
import asyncio
from evalforge.utils import logger
from evalforge.forge import EvalForge
from evalforge.data_utils import DataPoint
from evalforge.alignment import calculate_alignment_metrics, format_alignment_metrics

train_ds_formatted = [
DataPoint(
input_data={"text": "1+1="},
output_data={"text": "2"},
annotation=1,
note="Correct summation",
),
DataPoint(
input_data={"text": "1+1="},
output_data={"text": "3"},
annotation=0,
note="Incorrect summation",
),
DataPoint(
input_data={"text": "What is the square root of 16?"},
output_data={"text": "4"},
annotation=1,
note="Correct square root",
),
]

eval_ds_formatted = [
DataPoint(
input_data={"text": "What is the square root of 16?"},
output_data={"text": "4"},
annotation=1,
note="Correct square root",
),
DataPoint(
input_data={"text": "What is the square root of 16?"},
output_data={"text": "3"},
annotation=0,
note="Incorrect square root",
),
]

LLM_MODEL = "gpt-4"

forger = EvalForge(batch_size=1, num_criteria_to_generate=1, llm_model=LLM_MODEL)
results = asyncio.run(forger.fit(train_ds_formatted))
forged_judge = results["forged_judges"]["judge"]

logger.rule("Running assertions and calculating metrics", color="blue")

async def run_assertions_and_calculate_metrics(forger, judge, data):
all_data_forged_judge_assertion_results = await forger.run_assertions(judge, data)
all_data_metrics = calculate_alignment_metrics(all_data_forged_judge_assertion_results)
format_alignment_metrics(all_data_metrics)
return

asyncio.run(run_assertions_and_calculate_metrics(
forger, forged_judge, eval_ds_formatted))
```

## 📊 Key Components

- `main.py`: Annotation app
- `forge_evaluation_judge.ipynb`: Judge creation notebook
- `run_forged_judge.ipynb`: Judge execution notebook
- `cli.py`: Command-line interface for EvalForge
- `evalforge/`: Core library code
- `forge_mini.py`: Example script demonstrating programmatic usage

All components are integrated with Weave for comprehensive tracking and tracing of your machine learning workflow.

Happy evaluating! 🎉
## 📝 What's New

- **Modular Codebase**: Refactored `EvalForge` class and added helper methods for better modularity.
- **Command-Line Interface**: Added `cli.py` using `simple_parsing` for easy configuration via CLI.
- **Flexible Data Loading**: Enhanced `DataPoint` class and `load_data` function to handle various data formats.
- **Improved Logging**: Replaced print statements with a logging framework for better control over output.
- **Error Handling**: Improved exception handling in both the CLI and core classes.

## 🛠️ Contributing

Contributions are welcome! Please follow these steps:

1. Fork the repository on GitHub.
2. Create a new branch for your feature or bug fix.
3. Make your changes and ensure that tests pass.
4. Submit a pull request with a detailed description of your changes.

## 📄 License

This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.

## 📞 Contact

For questions or suggestions, feel free to reach out to the authors:

- **Alex Volkov**: [[email protected]](mailto:[email protected])
- **Anish Shah**: [[email protected]](mailto:[email protected])
- **Thomas Capelle**: [[email protected]](mailto:[email protected])

## 🙏 Acknowledgments

- [simple_parsing](https://github.com/lebrice/simple_parsing) for simplifying argument parsing.
- [Weave](https://github.com/wandb/weave) for providing modeling infrastructure.
- [litellm](https://github.com/openai/litellm) for lightweight LLM integration.
- All contributors and users who provided feedback and suggestions.

---

**Please replace the existing `README.md` with the updated version above to reflect the latest changes in the codebase.**

Let me know if there's anything else you'd like to add or modify!
3 changes: 3 additions & 0 deletions data/clothes_review_10k.jsonl
Git LFS file not shown
3 changes: 3 additions & 0 deletions data/clothes_review_filtered.jsonl
Git LFS file not shown
3 changes: 3 additions & 0 deletions data/find_helpful_reviews_data.jsonl
Git LFS file not shown
3 changes: 3 additions & 0 deletions data/helpful_reviews_annotations_500.jsonl
Git LFS file not shown
3 changes: 3 additions & 0 deletions data/mini_data.jsonl
Git LFS file not shown
Binary file removed evalforge/2404.12272v1.pdf
Binary file not shown.
5 changes: 5 additions & 0 deletions evalforge/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
from evalforge.forge import EvalForge
from evalforge.alignment import calculate_alignment_metrics, format_alignment_metrics


__all__ = ["EvalForge", "calculate_alignment_metrics", "format_alignment_metrics"]
Loading
Loading