Fennec is a tool designed for automating the evaluation of conversational data, offering judgements across multiple dimensions and granularities.
Fennec aim is to address two key challenges:
- Multidimensional Evaluation: providing comprehensive or accurate responses to multi-intent queries.
- Scaling Evaluation Capabilities: extending evaluation abilities to encompass a wider array of scenarios and usage examples.
Fennec provides a step-by-step framework designed for evaluating conversational responses using a Branching mechanism.
- Evaluation Criteria: Offers users multiple evaluation dimensions for their queries.
- Scoring Guidelines: Extends scoring rules (1-5 points) for each scoring dimension.
- Judgements: Scores based on evaluation criteria and scoring guidelines.
- Correction: Addresses issues identified within the conversations accordingly.
Model | Parameters | Datasets | Agreement ⬆ | Consistency ⬆ |
---|---|---|---|---|
GPT-4 | - | - | 62.28 | 86.28 |
GPT-3.5 | - | - | 42.74 | 62.43 |
Auto-J | 13B | Auto-J | 54.96 | 83.41 |
Fennec | 7B | Fennec | 56.63 | 86.32 |
Fennec | 7B | Fennec-bridging | 57.40 | 87.00 |
-
The current version has minimal third-party dependencies:
pip install SQLAlchemy # Utilized for data caching. pip install scikit-learn # Employed for computing evaluation metrics. pip install loguru # logging functionalities.
-
Fennec utilizes vLLM to launch inference services, currently supporting version >= 0.2.1.
pip install vllm
or build from source:
git clone https://github.com/vllm-project/vllm.git cd vllm pip install -e .
Using Fennec involves two steps:
-
Launching vLLM server for Fennec evaluation:
python scripts/run_vllm_server.sh
- EVAL_PARALLEL: allows for inference on multiple GPU resources concurrently when available.
- MODEl_NAME: the address where the downloaded model is stored.
-
Evaluating benchmark dataset (or custom dataset):
python scripts/fennec_eval.sh
-a -p {number}: execute parallel inference and specify the {number}.
We provide more detailed Recipes on how to use the current repo.