wandb · tcapelle · Oct 30, 2024 · Oct 30, 2024 · Oct 30, 2024 · Oct 30, 2024
diff --git a/.gitattributes b/.gitattributes
@@ -0,0 +1,3 @@
+*.pdf filter=lfs diff=lfs merge=lfs -text
+*.json filter=lfs diff=lfs merge=lfs -text
+*.jsonl filter=lfs diff=lfs merge=lfs -text
diff --git a/.github/workflows/lint.yml b/.github/workflows/lint.yml
@@ -0,0 +1,29 @@
+name: Lint Code
+
+on: push
+
+jobs:
+  lint:
+    runs-on: ubuntu-latest
+
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@v3
+
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: "3.11"
+
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install black ruff
+
+      - name: Run Black
+        continue-on-error: true
+        run: black --check .
+
+      - name: Run Ruff
+        continue-on-error: true
+        run: ruff .
diff --git a/.github/workflows/pypi.yml b/.github/workflows/pypi.yml
@@ -0,0 +1,29 @@
+# This workflows will upload a Python Package using Twine when a release is created
+# For more information see: https://help.github.com/en/actions/language-and-framework-guides/using-python-with-github-actions#publishing-to-package-registries
+
+name: Upload to PyPI
+
+on:
+  release:
+    types: [created]
+
+jobs:
+  pipy:
+    runs-on: ubuntu-latest
+    steps:
+    - uses: actions/checkout@v2
+    - name: Set up Python
+      uses: actions/setup-python@v2
+      with:
+        python-version: '3.x'
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        python -m pip install -U setuptools wheel twine build
+    - name: Build and publish
+      env:
+        TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
+        TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
+      run: |
+        python -m build
+        twine upload dist/*
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -0,0 +1,32 @@
+name: Run Test
+
+on:
+  push:
+    branches: [ "main" ]
+    paths:
+      - '**.py'
+  pull_request:
+    branches: [ "main" ]
+    paths:
+      - '**.py'
+
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    steps:
+    - uses: actions/checkout@v3
+    - name: Set up Python 3.11
+      uses: actions/setup-python@v5
+      with:
+        python-version: "3.10"
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        if [ -f requirements-dev.txt ]; then pip install -r requirements-dev.txt; fi
+        python -m pip install .
+    - name: Test with pytest
+      run: |
+        pytest . -v tests/
+      env:
+        OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
+        WANDB_API_KEY: ${{ secrets.WANDB_API_KEY }}
diff --git a/2404.12272v1.pdf b/2404.12272v1.pdf
diff --git a/README.md b/README.md
@@ -1,18 +1,109 @@
-# 🚀 EvalGen Project
+# 🚀 EvalForge Project
 
-This project allows you to create and run LLM judges based on annotated datasets using Weights & Biases (wandb) and Weave for tracking and tracing.
+EvalForge is a tool designed to evaluate and improve your Language Model (LLM) applications. It allows you to create and run LLM judges based on annotated datasets using Weights & Biases (wandb) and Weave for tracking and tracing.
 
 ## 🛠️ Setup
 
-1. Create a `.env` file in the project root with the following variables:
+1. **Environment Variables**
 
+   Create a `.env` file in the project root with the following variables:
+
+   ```
+   WANDB_EMAIL=your_wandb_email
+   WANDB_API_KEY=your_wandb_api_key
+   OPENAI_API_KEY=your_openai_api_key
+   ```
+
+2. **Install Dependencies**
+
+   Install the required dependencies:
+
+   ```bash
+   pip install -r requirements.txt
+   ```
+
+   Or, install directly via pip:
+
+   ```bash
+   pip install git+https://github.com/wandb/evalforge.git
+   ```
+
+## 🏃‍♂️ Quick Start with Command-Line Interface
+
+EvalForge now includes a command-line interface (CLI) powered by `simple_parsing`, allowing you to run evaluations directly from the terminal.
+
+### Basic Usage
+
+```bash
+evalforge.forge --data path/to/data.json
 ```
-WANDB_EMAIL=your_wandb_email 
-WANDB_API_KEY=your_wandb_api_key
-OPENAI_API_KEY=your_openai_api_key
+
+### Available Arguments
+
+You can customize EvalForge using various command-line arguments corresponding to the `EvalForge` class attributes:
+
+- `--data`: *(Required)* Path to the training data file (JSON or CSV).
+- `--llm_model`: Language model to use (default: `"gpt-3.5-turbo"`).
+- `--num_criteria_to_generate`: Number of evaluation criteria to generate (default: `3`).
+- `--alignment_threshold`: Threshold for selecting best criteria (default: `0.4`).
+- `--num_criteria`: Number of best criteria to select (default: `3`).
+- `--batch_size`: Batch size for data processing (default: `4`).
+- *And more...*
+
+Use the `--help` flag to see all available options:
+
+```bash
+evalforge.forge --help
+```
+
+### Example
+
+```bash
+evalforge.forge \
+  --data train_data.json \
+  --llm_model "gpt-4" \
+  --num_criteria_to_generate 5 \
+  --batch_size 2
+```
+
+## 📄 Data Format
+
+EvalForge expects data in JSON or CSV format. Each data point should include at least the following fields:
+
+- `input` or `input_data`: Input provided to the model.
+- `output` or `output_data`: Output generated by the model.
+- `annotation`: Binary annotation (`1` for correct, `0` for incorrect).
+- `note`: *(Optional)* Additional context or notes about the data point.
+- `human_description`: *(Optional)* Human-provided task description.
+
+**JSON Example:**
+
+```json
+[
+  {
+    "input": "What is 2 + 2?",
+    "output": "4",
+    "annotation": 1,
+    "note": "Simple arithmetic",
+    "human_description": "Basic math questions"
+  },
+  {
+    "question": "Translate 'Hello' to French.",
+    "answer": "Bonjour",
+    "annotation": 1,
+    "note": "Basic translation",
+    "human_description": "Simple language translation"
+  }
+]
 ```
 
-1. Install the required dependencies.
+**CSV Example:**
+
+```csv
+input,output,annotation,note,human_description
+"Define Newton's First Law","An object in motion stays in motion unless acted upon by an external force.",1,"Physics question","Physics definitions"
+"What is the capital of France?","Paris",1,"Geography question","Capitals of countries"
+```
 
 ## 🏃‍♂️ Running the Annotation App
 
@@ -24,14 +115,26 @@ python main.py
 
 This will launch a web interface for annotating your dataset.
 
-## 🧠 Creating an LLM Judge
+## 🧠 Creating an LLM Judge Programmatically
 
-To programmatically create an LLM judge from your wandb dataset annotations:
+You can create an LLM judge programmatically using the `EvalForge` class.
 
-1. Open `forge_evaluation_judge.ipynb` in a Jupyter environment.
-2. Run all cells in the notebook.
+### Example Usage
+
+```python
+import asyncio
+from evalforge.forge import EvalForge
+from evalforge.data_utils import load_data
+
+# Load data
+train_data = load_data('path/to/train_data.json')
+
+# Create an EvalForge instance with custom configurations
+forge = EvalForge(llm_model="gpt-4", num_criteria_to_generate=5)
 
-This will generate a judge like the one in `forged_judge`.
+# Run the fit method asynchronously
+asyncio.run(forge.fit(train_data))
+```
 
 ## 🔍 Running the Generated Judge
 
@@ -42,12 +145,116 @@ To load and run the generated judge:
 
 This will evaluate your dataset using the forged judge, with results fully tracked and traced using Weave.
 
+Alternatively, you can use the `forge_mini.py` script as an example:
+
+```python:forge_mini.py
+import asyncio
+from evalforge.utils import logger
+from evalforge.forge import EvalForge
+from evalforge.data_utils import DataPoint
+from evalforge.alignment import calculate_alignment_metrics, format_alignment_metrics
+
+train_ds_formatted = [
+    DataPoint(
+        input_data={"text": "1+1="},
+        output_data={"text": "2"},
+        annotation=1,
+        note="Correct summation",
+    ),
+    DataPoint(
+        input_data={"text": "1+1="},
+        output_data={"text": "3"},
+        annotation=0,
+        note="Incorrect summation",
+    ),
+    DataPoint(
+        input_data={"text": "What is the square root of 16?"},
+        output_data={"text": "4"},
+        annotation=1,
+        note="Correct square root",
+    ),
+]
+
+eval_ds_formatted = [
+    DataPoint(
+        input_data={"text": "What is the square root of 16?"},
+        output_data={"text": "4"},
+        annotation=1,
+        note="Correct square root",
+    ),
+    DataPoint(
+        input_data={"text": "What is the square root of 16?"},
+        output_data={"text": "3"},
+        annotation=0,
+        note="Incorrect square root",
+    ),
+]
+
+LLM_MODEL = "gpt-4"
+
+forger = EvalForge(batch_size=1, num_criteria_to_generate=1, llm_model=LLM_MODEL)
+results = asyncio.run(forger.fit(train_ds_formatted))
+forged_judge = results["forged_judges"]["judge"]
+
+logger.rule("Running assertions and calculating metrics", color="blue")
+
+async def run_assertions_and_calculate_metrics(forger, judge, data):
+    all_data_forged_judge_assertion_results = await forger.run_assertions(judge, data)
+    all_data_metrics = calculate_alignment_metrics(all_data_forged_judge_assertion_results)
+    format_alignment_metrics(all_data_metrics)
+    return
+
+asyncio.run(run_assertions_and_calculate_metrics(
+    forger, forged_judge, eval_ds_formatted))
+```
+
 ## 📊 Key Components
 
 - `main.py`: Annotation app
-- `forge_evaluation_judge.ipynb`: Judge creation notebook
-- `run_forged_judge.ipynb`: Judge execution notebook
+- `cli.py`: Command-line interface for EvalForge
+- `evalforge/`: Core library code
+- `forge_mini.py`: Example script demonstrating programmatic usage
 
 All components are integrated with Weave for comprehensive tracking and tracing of your machine learning workflow.
 
-Happy evaluating! 🎉
+## 📝 What's New
+
+- **Modular Codebase**: Refactored `EvalForge` class and added helper methods for better modularity.
+- **Command-Line Interface**: Added `cli.py` using `simple_parsing` for easy configuration via CLI.
+- **Flexible Data Loading**: Enhanced `DataPoint` class and `load_data` function to handle various data formats.
+- **Improved Logging**: Replaced print statements with a logging framework for better control over output.
+- **Error Handling**: Improved exception handling in both the CLI and core classes.
+
+## 🛠️ Contributing
+
+Contributions are welcome! Please follow these steps:
+
+1. Fork the repository on GitHub.
+2. Create a new branch for your feature or bug fix.
+3. Make your changes and ensure that tests pass.
+4. Submit a pull request with a detailed description of your changes.
+
+## 📄 License
+
+This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
+
+## 📞 Contact
+
+For questions or suggestions, feel free to reach out to the authors:
+
+- **Alex Volkov**: [[email protected]](mailto:[email protected])
+- **Anish Shah**: [[email protected]](mailto:[email protected])
+- **Thomas Capelle**: [[email protected]](mailto:[email protected])
+
+## 🙏 Acknowledgments
+
+- [simple_parsing](https://github.com/lebrice/simple_parsing) for simplifying argument parsing.
+- [Weave](https://github.com/wandb/weave) for providing modeling infrastructure.
+- [litellm](https://github.com/openai/litellm) for lightweight LLM integration.
+- All contributors and users who provided feedback and suggestions.
+
+---
+
+**Please replace the existing `README.md` with the updated version above to reflect the latest changes in the codebase.**
+
+Let me know if there's anything else you'd like to add or modify!
diff --git a/data/clothes_review_10k.jsonl b/data/clothes_review_10k.jsonl
diff --git a/data/clothes_review_filtered.jsonl b/data/clothes_review_filtered.jsonl
diff --git a/data/find_helpful_reviews_data.jsonl b/data/find_helpful_reviews_data.jsonl
diff --git a/data/helpful_reviews_annotations_500.jsonl b/data/helpful_reviews_annotations_500.jsonl
diff --git a/data/mini_data.jsonl b/data/mini_data.jsonl
diff --git a/evalforge/2404.12272v1.pdf b/evalforge/2404.12272v1.pdf
diff --git a/evalforge/__init__.py b/evalforge/__init__.py
@@ -0,0 +1,5 @@
+from evalforge.forge import EvalForge
+from evalforge.alignment import calculate_alignment_metrics, format_alignment_metrics
+
+
+__all__ = ["EvalForge", "calculate_alignment_metrics", "format_alignment_metrics"]