Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Renamed criterias in LLM-as-a-Judge metrics to criteria. #1545

Merged
merged 18 commits into from
Jan 26, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/catalog.py
Original file line number Diff line number Diff line change
Expand Up @@ -175,7 +175,7 @@ def make_content(artifact, label, all_labels):

# Replacement function
html_for_dict = re.sub(pattern, r"\1\2\3", html_for_dict)
source_link = f"""<a class="reference external" href="https://github.com/IBM/unitxt/blob/main/src/unitxt/catalog/{catalog_id.replace('.','/')}.json"><span class="viewcode-link"><span class="pre">[source]</span></span></a>"""
source_link = f"""<a class="reference external" href="https://github.com/IBM/unitxt/blob/main/src/unitxt/catalog/{catalog_id.replace(".", "/")}.json"><span class="viewcode-link"><span class="pre">[source]</span></span></a>"""
html_for_dict = f"""<div class="admonition note">
<p class="admonition-title">{catalog_id}</p>
<div class="highlight-json notranslate">
Expand Down
95 changes: 30 additions & 65 deletions docs/docs/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -133,96 +133,61 @@ Related documentation: :ref:`Benchmarks tutorial <adding_benchmark>`, :ref:`Form
LLM as Judges
--------------

Evaluate an existing dataset using a predefined LLM as judge
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Using LLM as judge for direct comparison using a predefined criteria
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

This example demonstrates how to evaluate an existing QA dataset (squad) using the HuggingFace Datasets and Evaluate APIs and leveraging a predefine LLM as a judge metric.
This example demonstrates how to use LLM-as-a-Judge with a predefined criteria, in this case *answer_relevance*. The unitxt catalog has more than 40 predefined criteria for direct evaluators.

`Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_existing_dataset_by_llm_as_judge.py>`__
`Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_llm_as_judge_direct_predefined_criteria.py>`__

Related documentation: :ref:`Evaluating datasets <evaluating_datasets>`, :ref:`LLM as a Judge Metrics Guide <llm_as_judge>`, :ref:`Inference Engines <inference>`.
Related documentation: :ref:`Using LLM as a Judge in Unitxt`

Evaluate a custom dataset using a custom LLM as Judge
+++++++++++++++++++++++++++++++++++++++++++++++++++++

This example demonstrates how to evaluate a user QA answering dataset in a standalone file using a user-defined task and template. In addition, it shows how to define an LLM as a judge metric, specify the template it uses to produce the input to the judge, and select the judge model and platform.
Using LLM as judge for direct comparison using a custom criteria
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

`Example code <https://github.com/IBM/unitxt/blob/main/examples/standalone_evaluation_llm_as_judge.py>`__
The user can also specify a bespoke criteria that the judge model uses as a guide to evaluate the responses.
This example demonstrates how to use LLM-as-a-Judge with a user-defined criteria. The criteria must have options and option_map.

Related documentation: :ref:`LLM as a Judge Metrics Guide <llm_as_judge>`.
`Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_llm_as_judge_direct_user_criteria_no_catalog.py>`__

Evaluate an existing dataset from the catalog comparing two custom LLM as judges
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Related documentation: :ref:`Creating a custom criteria`

This example demonstrates how to evaluate a document summarization dataset by defining an LLM as a judge metric, specifying the template it uses to produce the input to the judge, and selecting the judge model and platform.
The example adds two LLM judges, one that uses the ground truth (references) from the dataset and one that does not.

`Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_summarization_dataset_llm_as_judge.py>`__

Related documentation: :ref:`LLM as a Judge Metrics Guide <llm_as_judge>`.

Evaluate the quality of an LLM as judge
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

This example demonstrates how to evaluate an LLM as judge by checking its scores using the gold references of a dataset.
It checks if the judge consistently prefers correct outputs over clearly wrong ones.
Note that to check the the ability of the LLM as judge to discern suitable differences between
partially correct answers requires more refined tests and corresponding labeled data.
The example shows an 8b llama based judge is not a good judge for a summarization task,
while the 70b model performs much better.

`Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_llm_as_judge.py>`__

Related documentation: :ref:`LLM as a Judge Metrics Guide <llm_as_judge>`, :ref:`Inference Engines <inference>`.


Evaluate your model on the Arena Hard benchmark using a custom LLMaJ
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

This example demonstrates how to evaluate a user model on the Arena Hard benchmark, using an LLMaJ other than the GPT4.

`Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_a_model_using_arena_hard.py>`__

Related documentation: :ref:`Evaluate a Model on Arena Hard Benchmark <dir_catalog.cards.arena_hard>`, :ref:`Inference Engines <inference>`.

Evaluate a judge model performance judging the Arena Hard Benchmark
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

This example demonstrates how to evaluate the capabilities of a user model, to act as a judge on the Arena Hard benchmark.
The model is evaluated on its capability to give a judgment that is in correlation with GPT4 judgment on the benchmark.
Evaluate an existing dataset using an LLM-as-a-Judge for direct comparison
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

`Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_a_judge_model_capabilities_on_arena_hard.py>`__
This example demonstrates how to evaluate an existing QA dataset (squad) using the HuggingFace Datasets and Evaluate APIs and leveraging a predefined criteria for direct evaluation.
Note that here we also showcase unitxt's ability to evaluate the dataset on multiple criteria, namely, *answer_relevance*, *coherence* and *conciseness*

Related documentation: :ref:`Evaluate a Model on Arena Hard Benchmark <dir_catalog.cards.arena_hard>`, :ref:`Inference Engines <inference>`.
`Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_existing_dataset_by_llm_as_judge_direct.py>`__

Evaluate using ensemble of LLM as a judge metrics
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Related documentation: :ref:`End to end Direct example`

This example demonstrates how to create a metric which is an ensemble of LLM as a judge metrics.
The example shows how to ensemble two judges which uses different templates.

`Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_using_metrics_ensemble.py>`__
Using LLM as a judge for pairwise comparison using a predefined criteria
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Related documentation: :ref:`LLM as a Judge Metrics Guide <llm_as_judge>`, :ref:`Inference Engines <inference>`.
This example demonstrates how to use LLM-as-a-Judge for pairwise comparison using a predefined criteria from the catalog. The unitxt catalog has 7 predefined criteria for pairwise evaluators.
We also showcase that the criteria does not need to be the same across the entire dataset and that the framework can handle different criteria for each datapoint.

Evaluate predictions of models using pre-trained ensemble of LLM as judges
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
`Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_llm_as_judge_pairwise_predefined_criteria.py>`__

This example demonstrates how to use a pre-trained ensemble model or an off-the-shelf LLM as judge to assess multi-turn conversation quality of models on a set of pre-defined metrics.
This example demonstrates using LLM-as-a-Judge for pairwise comparison using a single predefined criteria for the entire dataset

Topicality: Response of the model only contains information that is related to and helpful for the user inquiry.
`Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_llm_as_judge_pairwise_criteria_from_dataset.py>`__

`Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_ensemble_judge.py>`__

Groundedness: Every substantial claim in the response of the model is derivable from the content of the document
Evaluate an existing dataset using an LLM-as-a-Judge for direct comparison
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

`Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_grounded_ensemble_judge.py>`__
This example demonstrates how to evaluate an existing QA dataset (squad) using the HuggingFace Datasets and Evaluate APIs and leveraging a predefined criteria for pairwise evaluation.
Note that here we also showcase unitxt's ability to evaluate the dataset on multiple criteria, namely, *answer_relevance*, *coherence* and *conciseness*

IDK: Does the model response say I don't know?
`Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_existing_dataset_by_llm_as_judge_direct.py>`__

`Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_idk_judge.py>`__
Related documentation: :ref:`End to end Pairwise example`

Related documentation: :ref:`LLM as a Judge Metrics Guide <llm_as_judge>`, :ref:`Inference Engines <inference>`.

RAG
---
Expand Down
105 changes: 84 additions & 21 deletions docs/docs/llm_as_judge.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,11 +46,11 @@ An LLM as a Judge metric consists of several essential components:
1. The judge model, such as *Llama-3-8B-Instruct* or *gpt-3.5-turbo*, which evaluates the performance of other models.
2. The platform responsible for executing the judge model, such as Huggingface, OpenAI API and IBM's deployment platforms such as WatsonX and RITS.
A lot of these model and catalog combinations are already predefined in our catalog. The models are prefixed by metrics.llm_as_judge.direct followed by the platform and the model name.
For instance, metrics.llm_as_judge.direct.rits.llama3_1_70b refers to llama3 70B model that uses RITS deployment service.
For instance, *metrics.llm_as_judge.direct.rits.llama3_1_70b* refers to *llama3 70B* model that uses RITS deployment service.

3. The criteria to evaluate the model's response. There are predefined criteria in the catalog and the user can also define a custom criteria.
Each criteria specifies fine-grained options that help steer the model to evaluate the response more precisely.
For instance the critertion "metrics.llm_as_judge.direct.criterias.answer_relevance" quantifies how much the model's response is relevant to the user's question.
For instance the critertion *metrics.llm_as_judge.direct.criteria.answer_relevance* quantifies how much the model's response is relevant to the user's question.
It has four options that the model can choose from and they are excellent, acceptable, could be improved and bad. Each option also has a description of itself and a score associated with it.
The model uses these descriptions to identify which option the given response is closest to and returns them.
The user can also specify their own custom criteria. An example of this is included under the section **Creating a custom criteria**.
Expand All @@ -72,7 +72,7 @@ To accomplish this evaluation, we require the following:

1. The questions that were input to the model
2. The judge model and its deployment platform
3. The pre-defined criteria, which in this case is metrics.llm_as_judge.direct.criterias.answer_relevance.
3. The pre-defined criteria, which in this case is metrics.llm_as_judge.direct.criteria.answer_relevance.

We pass the criteria to the judge model's metric as criteria and the question as the context fields.

Expand All @@ -84,15 +84,11 @@ We pass the criteria to the judge model's metric as criteria and the question as
{"question": "What is a good low cost of living city in the US?"},
]
criteria = "metrics.llm_as_judge.direct.criterias.answer_relevance"
criteria = "metrics.llm_as_judge.direct.criteria.answer_relevance"
metrics = [
f"metrics.llm_as_judge.direct.rits.llama3_1_70b[criteria={criteria}, context_fields=[question]]"
]
dataset = create_dataset(
task="tasks.qa.open", test_set=data, metrics=metrics, split="test"
)
Once the metric is created, a dataset is created for the appropriate task.

.. code-block:: python
Expand Down Expand Up @@ -156,20 +152,20 @@ Below is an example where the user mandates that the model respond with the temp
)
End to end example
--------------------------------------------
End to end Direct example
----------------------------
Unitxt can also obtain model's responses for a given dataset and then run LLM-as-a-judge evaluations on the model's responses.
Here, we will get llama-3.2 1B instruct's responses and then evaluate them for answer relevance, coherence and conciseness using llama3_1_70b judge model
Here, we will get *llama-3.2 1B* instruct's responses and then evaluate them for answer relevance, coherence and conciseness using *llama3_1_70b* judge model

.. code-block:: python
criterias = ["answer_relevance", "coherence", "conciseness"]
criteria = ["answer_relevance", "coherence", "conciseness"]
metrics = [
"metrics.llm_as_judge.direct.rits.llama3_1_70b"
"[context_fields=[context,question],"
f"criteria=metrics.llm_as_judge.direct.criterias.{criteria},"
f"score_prefix={criteria}_]"
for criteria in criterias
f"criteria=metrics.llm_as_judge.direct.criteria.{criterion},"
f"score_prefix={criterion}_]"
for criterion in criteria
]
dataset = load_dataset(
card="cards.squad",
Expand Down Expand Up @@ -210,22 +206,22 @@ We use CrossProviderInferenceEngine for inference.
],
)
for criteria in criterias:
logger.info(f"Scores for criteria '{criteria}'")
for criterion in criteria:
logger.info(f"Scores for criteria '{criterion}'")
gold_answer_scores = [
instance["score"]["instance"][f"{criteria}_llm_as_a_judge_score"]
instance["score"]["instance"][f"{criterion}_llm_as_a_judge_score"]
for instance in evaluated_gold_answers
]
gold_answer_position_bias = [
int(instance["score"]["instance"][f"{criteria}_positional_bias"])
int(instance["score"]["instance"][f"{criterion}_positional_bias"])
for instance in evaluated_gold_answers
]
prediction_scores = [
instance["score"]["instance"][f"{criteria}_llm_as_a_judge_score"]
instance["score"]["instance"][f"{criterion}_llm_as_a_judge_score"]
for instance in evaluated_predictions
]
prediction_position_bias = [
int(instance["score"]["instance"][f"{criteria}_positional_bias"])
int(instance["score"]["instance"][f"{criterion}_positional_bias"])
for instance in evaluated_predictions
]
Expand Down Expand Up @@ -263,3 +259,70 @@ We use CrossProviderInferenceEngine for inference.
Scores of predicted answers: 0.34 +/- 0.47609522856952335
Positional bias occurrence on gold answers: 0.03
Positional bias occurrence on predicted answers: 0.01
End to end Pairwise example
----------------------------

So far we showcased pointwise evaluators where the judge model takes responses from one model and evaluates its efficacy. Unitxt also supports pairwise evaluations, where the judge model takes responses from two models and ranks them based on the specified criteria.
The winrate metric determines how many times the current model's response was better than the other models' responses according to the criteria. Similar to pointwise, pairwise evaluators also detect positional bias.
Below is an example where we compare the responses of three models for two questions each with a different criteria to evaluate against and the judge model is *Llama 3 70B* .

.. code-block:: python
from unitxt import evaluate, load_dataset
from unitxt.blocks import Task, TaskCard
from unitxt.llm_as_judge import LoadCriteria
from unitxt.loaders import LoadFromDictionary
from unitxt.templates import NullTemplate
data = {
"test": [
{
"question": "How is the weather?",
"criteria": "metrics.llm_as_judge.pairwise.criteria.temperature_in_celsius_and_fahrenheit",
},
{
"question": "Tell me a joke about cats",
"criteria": "metrics.llm_as_judge.pairwise.criteria.funny_joke",
},
]
}
card = TaskCard(
loader=LoadFromDictionary(data=data, data_classification_policy=["public"]),
preprocess_steps=[
LoadCriteria(field="criteria", to_field="criteria"),
],
task=Task(
input_fields={"question": str},
reference_fields={"criteria": Any},
prediction_type=List[str],
metrics=[
"metrics.llm_as_judge.pairwise.watsonx.llama3_1_70b[context_fields=question,criteria_field=criteria]"
],
default_template=NullTemplate(),
),
)
dataset = load_dataset(card=card, split="test")
predictions = [
[
"""On most days, the weather is warm and humid, with temperatures often soaring into the high 80s and low 90s Fahrenheit (around 31-34°C). The dense foliage of the jungle acts as a natural air conditioner, keeping the temperature relatively stable and comfortable for the inhabitants.""",
"""On most days, the weather is warm and humid, with temperatures often soaring into the high 80s and low 90s Fahrenheit. The dense foliage of the jungle acts as a natural air conditioner, keeping the temperature relatively stable and comfortable for the inhabitants.""",
"""On most days, the weather is warm and humid. The dense foliage of the jungle acts as a natural air conditioner, keeping the temperature relatively stable and comfortable for the inhabitants.""",
],
[
"""Why did the cat cross the road? To cat to the other side.""",
"""Why did the cat sit on the computer? Because it wanted to keep an eye on the mouse!""",
"""What is red, yellow and green? A traffic light.""",
],
]
results = evaluate(predictions=predictions, data=dataset)
print("Global Scores:")
print(results.global_scores.summary)
print("Instance Scores:")
print(results.instance_scores.summary)
2 changes: 1 addition & 1 deletion examples/evaluate_batched_multiclass_classification.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ class EnumeratedListSerializer(SingleTypeSerializer):
serialized_type = EnumeratedList

def serialize(self, value: EnumeratedList, instance: Dict[str, Any]) -> str:
return "\n".join([f"{i+1}. {v}" for i, v in enumerate(value)])
return "\n".join([f"{i + 1}. {v}" for i, v in enumerate(value)])


task = Task(
Expand Down
Loading
Loading