Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Fix for Instance Based Metrics and Updated docs #1827

Merged
merged 8 commits into from
Jan 9, 2025
67 changes: 50 additions & 17 deletions docs/concepts/metrics/available_metrics/general_purpose.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,26 +103,59 @@ Output

## Instance Specific rubrics criteria scoring

Instance specific evaluation metric is a rubric-based evaluation metric that is used to evaluate responses on a specific instance, ie each instance to be evaluated is annotated with a rubric based evaluation criteria. The rubric consists of descriptions for each score, typically ranging from 1 to 5. The response here is evaluation and scored using the LLM using description specified in the rubric. This metric also have reference free and reference based variations. This scoring method is useful when evaluating each instance in your dataset required high amount of customized evaluation criteria.
Instance Specific Evaluation Metric is a rubric-based method used to evaluate each item in a dataset individually. To use this metric, you need to provide a rubric along with the items you want to evaluate.

!!! note
This differs from the `Rubric Based Criteria Scoring Metric`, where a single rubric is applied to uniformly evaluate all items in the dataset. In the `Instance-Specific Evaluation Metric`, you decide which rubric to use for each item. It's like the difference between giving the entire class the same quiz (rubric-based) and creating a personalized quiz for each student (instance-specific).

#### Example
```python
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import InstanceRubrics


sample = SingleTurnSample(
user_input="Where is the Eiffel Tower located?",
response="The Eiffel Tower is located in Paris.",
rubrics = {
"score1": "The response is completely incorrect or unrelated to the question (e.g., 'The Eiffel Tower is in New York.' or talking about something entirely irrelevant).",
"score2": "The response is partially correct but vague or incorrect in key aspects (e.g., 'The Eiffel Tower is in France.' without mentioning Paris, or a similar incomplete location).",
"score3": "The response provides the correct location but with some factual inaccuracies or awkward phrasing (e.g., 'The Eiffel Tower is in Paris, Germany.' or 'It is located in Paris, which is a country.').",
"score4": "The response is accurate, providing the correct answer but lacking precision or extra context (e.g., 'The Eiffel Tower is in Paris, France.' or a minor phrasing issue).",
"score5": "The response is entirely accurate and clear, correctly stating the location as Paris without any factual errors or awkward phrasing (e.g., 'The Eiffel Tower is located in Paris.')."
}
dataset = [
# Relevance to Query
{
"user_query": "How do I handle exceptions in Python?",
"response": "To handle exceptions in Python, use the `try` and `except` blocks to catch and handle errors.",
"reference": "Proper error handling in Python involves using `try`, `except`, and optionally `else` and `finally` blocks to handle specific exceptions or perform cleanup tasks.",
"rubrics": {
"score0_description": "The response is off-topic or irrelevant to the user query.",
"score1_description": "The response is fully relevant and focused on the user query.",
},
},
# Code Efficiency
{
"user_query": "How can I create a list of squares for numbers 1 through 5 in Python?",
"response": """
# Using a for loop
squares = []
for i in range(1, 6):
squares.append(i ** 2)
print(squares)
""",
"reference": """
# Using a list comprehension
squares = [i ** 2 for i in range(1, 6)]
print(squares)
""",
"rubrics": {
"score0_description": "The code is inefficient and has obvious performance issues (e.g., unnecessary loops or redundant calculations).",
"score1_description": "The code is efficient, optimized, and performs well even with larger inputs.",
},
},
]


evaluation_dataset = EvaluationDataset.from_list(dataset)

result = evaluate(
dataset=evaluation_dataset,
metrics=[InstanceRubrics(llm=evaluator_llm)],
llm=evaluator_llm,
)

scorer = InstanceRubrics(llm=evaluator_llm)
await scorer.single_turn_ascore(sample)
result
```
Output

```
{'instance_rubrics': 0.5000}
```
4 changes: 2 additions & 2 deletions src/ragas/metrics/_instance_specific_rubrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,13 +38,13 @@ class MultiTurnInputWithRubric(MultiTurnInputWithoutRubric):


class SingleTurnPrompt(PydanticPrompt[SingleTurnInputWithRubric, ScoreFeedback]):
instruction = "" # this will be set in the constructor
instruction = "Your task is to assign an appropriate score and provide feedback to the inputs based solely on the scoring criteria passed in the input."
input_model = SingleTurnInputWithRubric
output_model = ScoreFeedback


class MultiTurnPrompt(PydanticPrompt[MultiTurnInputWithRubric, ScoreFeedback]):
instruction = "" # this will be set in the constructor
instruction = "Your task is to assign an appropriate score and provide feedback to the inputs based solely on the scoring criteria passed in the input."
input_model = MultiTurnInputWithRubric
output_model = ScoreFeedback

Expand Down
4 changes: 2 additions & 2 deletions src/ragas/metrics/_simple_criteria.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,7 @@ def __init__(
self.multi_turn_prompt = multi_turn_prompt or MultiTurnSimpleCriteriaPrompt()

# update the instruction for the prompts with the definition
instruction = f"Evaluate the Input based on the criterial defined. Give a score between 0 and 5.\nCriteria Definition: {self._definition}"
instruction = f"Evaluate the input based on the criteria defined. Give a score between 0 and 5.\nCriteria Definition: {self._definition}"
sahusiddharth marked this conversation as resolved.
Show resolved Hide resolved
self.single_turn_prompt.instruction = instruction
self.multi_turn_prompt.instruction = instruction

Expand All @@ -145,7 +145,7 @@ def definition(self) -> str:
def definition(self, value: str) -> None:
self._definition = value
# Update the instruction for both prompts with the new definition
instruction = f"Evaluate the Input based on the criterial defined. Give a score between 0 and 5.\nCriteria Definition: {self._definition}"
instruction = f"Evaluate the input based on the criteria defined. Give a score between 0 and 5.\nCriteria Definition: {self._definition}"
sahusiddharth marked this conversation as resolved.
Show resolved Hide resolved
self.single_turn_prompt.instruction = instruction
self.multi_turn_prompt.instruction = instruction

Expand Down
Loading