Do the scores for rubric have to be numeric and in a 1-5 range? #1800

alexcpop · 2024-12-27T13:19:46Z

[x] I checked the documentation and related resources and couldn't find an answer to my question.

Your Question
I want to put together my own custom rubric. I played around with the score range (0-3 , 1-3, categorical: Met, Partially Met, Not Met). I noticed that in these cases the output scores would discard my scoring criteria and default to 1-5. My questions are:

do the scores have to be between 1-5?
and do they have to be numeric only? I have use cases where a categorical rubric would work better.

Code Examples

rubric = {
    "adherence_guidance": {
        "description": "Does the LLM output align with the provided information (e.g., tech guidance, domain-specific standards)?",
        "scoring_criteria": """
            1: Does not align with key aspects of the guidance.
            2: Partially aligns but has notable omissions or misinterpretations.
            3: Fully aligns with the instructions.
        """
    },
    "factual_accuracy_source_relevance": {
        "description": "Are the claims made in the output factually correct and supported by the input content (e.g., references, documents)?",
        "scoring_criteria": """
            1: Unsupported or fabricated claims.
            2: Some claims are backed by input content, but others are incorrect or unsupported.
            3: All claims are accurate and appropriately cited.
        """
    }
}

scorer = RubricsScore(rubrics=rubric, llm=evaluator_llm)

async def score_business_case(row):
    sample = SingleTurnSample(
        user_input=row["Summary"],
        response=json.dumps(row.to_dict()),
        reference=assessment_criteria
    )

    scores = {}
    for metric in rubric.keys():
        scores[metric] = await scorer.single_turn_ascore(sample)
    return scores

Example outputs:

adherence_guidance_score  factual_accuracy_source_relevance_score
                        5                                        5
                        4                                        3
                        3                                        3
                        4                                        4
                        1                                        1

The text was updated successfully, but these errors were encountered:

jjmachan · 2025-01-03T09:55:53Z

how you could create the rubric scores is as follows

rubric = {
    "score1_description": "Does not align with key aspects of the guidance.",
    "score2_description": "Partially aligns but has notable omissions or misinterpretations.",
    "score3_description": "Fully aligns with the instructions.",
}
adherence_scorer = RubricsScore(rubrics=rubric, llm=evaluator_llm)

this will score the adhence rubric you had mentioned. You can create the same for the other rubric too

can you let me know if that helps? if not could you share a bit more about your usecase and I'll help you out 🙂

alexcpop · 2025-01-06T12:43:20Z

Hi @jjmachan , many thanks for your reply! I tested the code above and the result is 5 (even though I only have the 3 options). I tried 2 scenarios and both print out 5. Code:

from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import RubricsScore
sample = SingleTurnSample(
user_input="What is the capital of Spain?",
response="The capital of Spain is Madrid.",
reference="The capital of Spain is Madrid.",
)

rubric = {
"score1_description": "Does not align with key aspects of the guidance.",
"score2_description": "Partially aligns but has notable omissions or misinterpretations.",
"score3_description": "Fully aligns with the instructions.",
}
adherence_score = RubricsScore(rubrics=rubric, llm=evaluator_llm)
await scorer.single_turn_ascore(sample)

from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import RubricsScore
sample = SingleTurnSample(
user_input="Where is the Eiffel Tower located?",
response="The Eiffel Tower is located in Paris.",
reference="The Eiffel Tower is located in Paris.",
)

rubric = {
"score1_description": "Does not align with key aspects of the guidance.",
"score2_description": "Partially aligns but has notable omissions or misinterpretations.",
"score3_description": "Fully aligns with the instructions.",
}
adherence_score = RubricsScore(rubrics=rubric, llm=evaluator_llm)
await scorer.single_turn_ascore(sample)

tuan3w · 2025-01-07T02:58:43Z

I have the same problem. I got a score of 7 even though I only specified 5-category. I use the latest version of the library.

sahusiddharth · 2025-01-08T14:22:42Z

Hi @alexcpop @tuan3w,

I’ve looked into the issue and made the necessary changes to address it. Please let me know if you continue to encounter any problems. You can also use the code snippet below for reference.

import os
from dotenv import load_dotenv

from ragas import evaluate
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI
from ragas.metrics import RubricsScore
from ragas import EvaluationDataset

load_dotenv()

llm = ChatOpenAI(model="gpt-4o-mini")
evaluator_llm = LangchainLLMWrapper(llm)

test = [
    {
        "response": "The Earth is flat and does not orbit the Sun.",
        "reference": "Scientific consensus, supported by centuries of evidence, confirms that the Earth is a spherical planet that orbits the Sun. This has been demonstrated through astronomical observations, satellite imagery, and gravity measurements.",
    },
    {
        "response": "The moon is made entirely of green cheese.",
        "reference": "The moon is a natural satellite composed primarily of rock and metal. This has been confirmed by lunar samples brought back by the Apollo missions and is well-supported by scientific studies.",
    },
    {
        "response": "Albert Einstein developed the theory of relativity in 1905.",
        "reference": "Albert Einstein's theory of relativity was published in 1905, and it fundamentally changed the understanding of space, time, and gravity. This is a key milestone in modern physics.",
    },
    {
        "response": "The Amazon rainforest produces 10% of the world’s oxygen.",
        "reference": "The Amazon rainforest is often cited as a major oxygen producer, but studies show that it contributes far less to the Earth's oxygen supply than commonly stated. It does, however, play a critical role in carbon dioxide absorption.",
    },
    {
        "response": "Shakespeare wrote 'To Kill a Mockingbird'.",
        "reference": "'To Kill a Mockingbird' was written by Harper Lee and published in 1960. William Shakespeare, an English playwright, lived in the 16th century and is famous for works such as 'Romeo and Juliet' and 'Hamlet'.",
    },
]

evaluation_dataset = EvaluationDataset.from_list(test)

rubrics = {
    "score1_description": "The response does not align with key aspects of the reference or instructions.",
    "score2_description": "The response partially aligns with the reference or instructions, but has notable omissions or misinterpretations.",
    "score3_description": "The response fully aligns with the reference or instructions, accurately addressing all key points."
}

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[RubricsScore(llm=evaluator_llm, rubrics=rubrics)],
    llm=evaluator_llm,
)

result

- #1800 --------- Co-authored-by: ikka <[email protected]>

heoun · 2025-01-12T07:34:57Z

@jjmachan - Are you sure this is complete? I just attempted the sample example and scored between 1-5. The output had 1,1,8,3,0. It's either hallucinating the 8 or instructions are still off.

@sahusiddharth - if you do results_df = result.to_pandas() results_df.head()

you'll find that it still rated the Albert Einstein an 8.

heoun · 2025-01-12T07:40:00Z

@sahusiddharth

jjmachan · 2025-01-13T13:29:43Z

hey @heoun which tracing tool do you using, maybe we can check the actual prompt going to figure out why this is happening. Could you share that here?

alexcpop added the question Further information is requested label Dec 27, 2024

sahusiddharth mentioned this issue Jan 8, 2025

fix: rubrics based metrics #1821

Merged

jjmachan linked a pull request Jan 8, 2025 that will close this issue

fix: rubrics based metrics #1821

Merged

jjmachan pushed a commit that referenced this issue Jan 8, 2025

fix: rubrics based metrics (#1821)

c0dc689

- #1800 --------- Co-authored-by: ikka <[email protected]>

jjmachan closed this as completed in #1821 Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do the scores for rubric have to be numeric and in a 1-5 range? #1800

Do the scores for rubric have to be numeric and in a 1-5 range? #1800

alexcpop commented Dec 27, 2024 •

edited by jjmachan

Loading

jjmachan commented Jan 3, 2025

alexcpop commented Jan 6, 2025

tuan3w commented Jan 7, 2025 •

edited

Loading

sahusiddharth commented Jan 8, 2025

heoun commented Jan 12, 2025 •

edited

Loading

heoun commented Jan 12, 2025

jjmachan commented Jan 13, 2025

Do the scores for rubric have to be numeric and in a 1-5 range? #1800

Do the scores for rubric have to be numeric and in a 1-5 range? #1800

Comments

alexcpop commented Dec 27, 2024 • edited by jjmachan Loading

jjmachan commented Jan 3, 2025

alexcpop commented Jan 6, 2025

tuan3w commented Jan 7, 2025 • edited Loading

sahusiddharth commented Jan 8, 2025

heoun commented Jan 12, 2025 • edited Loading

heoun commented Jan 12, 2025

jjmachan commented Jan 13, 2025

alexcpop commented Dec 27, 2024 •

edited by jjmachan

Loading

tuan3w commented Jan 7, 2025 •

edited

Loading

heoun commented Jan 12, 2025 •

edited

Loading