Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do the scores for rubric have to be numeric and in a 1-5 range? #1800

Closed
alexcpop opened this issue Dec 27, 2024 · 7 comments · Fixed by #1821
Closed

Do the scores for rubric have to be numeric and in a 1-5 range? #1800

alexcpop opened this issue Dec 27, 2024 · 7 comments · Fixed by #1821
Labels
question Further information is requested

Comments

@alexcpop
Copy link

alexcpop commented Dec 27, 2024

[x] I checked the documentation and related resources and couldn't find an answer to my question.

Your Question
I want to put together my own custom rubric. I played around with the score range (0-3 , 1-3, categorical: Met, Partially Met, Not Met). I noticed that in these cases the output scores would discard my scoring criteria and default to 1-5. My questions are:

  • do the scores have to be between 1-5?
  • and do they have to be numeric only? I have use cases where a categorical rubric would work better.

Code Examples

rubric = {
    "adherence_guidance": {
        "description": "Does the LLM output align with the provided information (e.g., tech guidance, domain-specific standards)?",
        "scoring_criteria": """
            1: Does not align with key aspects of the guidance.
            2: Partially aligns but has notable omissions or misinterpretations.
            3: Fully aligns with the instructions.
        """
    },
    "factual_accuracy_source_relevance": {
        "description": "Are the claims made in the output factually correct and supported by the input content (e.g., references, documents)?",
        "scoring_criteria": """
            1: Unsupported or fabricated claims.
            2: Some claims are backed by input content, but others are incorrect or unsupported.
            3: All claims are accurate and appropriately cited.
        """
    }
}

scorer = RubricsScore(rubrics=rubric, llm=evaluator_llm)

async def score_business_case(row):
    sample = SingleTurnSample(
        user_input=row["Summary"],
        response=json.dumps(row.to_dict()),
        reference=assessment_criteria
    )

    scores = {}
    for metric in rubric.keys():
        scores[metric] = await scorer.single_turn_ascore(sample)
    return scores

Example outputs:

adherence_guidance_score  factual_accuracy_source_relevance_score
                        5                                        5
                        4                                        3
                        3                                        3
                        4                                        4
                        1                                        1
@alexcpop alexcpop added the question Further information is requested label Dec 27, 2024
@jjmachan
Copy link
Member

jjmachan commented Jan 3, 2025

how you could create the rubric scores is as follows

rubric = {
    "score1_description": "Does not align with key aspects of the guidance.",
    "score2_description": "Partially aligns but has notable omissions or misinterpretations.",
    "score3_description": "Fully aligns with the instructions.",
}
adherence_scorer = RubricsScore(rubrics=rubric, llm=evaluator_llm)

this will score the adhence rubric you had mentioned. You can create the same for the other rubric too

can you let me know if that helps? if not could you share a bit more about your usecase and I'll help you out 🙂

@alexcpop
Copy link
Author

alexcpop commented Jan 6, 2025

Hi @jjmachan , many thanks for your reply! I tested the code above and the result is 5 (even though I only have the 3 options). I tried 2 scenarios and both print out 5. Code:

from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import RubricsScore
sample = SingleTurnSample(
user_input="What is the capital of Spain?",
response="The capital of Spain is Madrid.",
reference="The capital of Spain is Madrid.",
)

rubric = {
"score1_description": "Does not align with key aspects of the guidance.",
"score2_description": "Partially aligns but has notable omissions or misinterpretations.",
"score3_description": "Fully aligns with the instructions.",
}
adherence_score = RubricsScore(rubrics=rubric, llm=evaluator_llm)
await scorer.single_turn_ascore(sample)

from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import RubricsScore
sample = SingleTurnSample(
user_input="Where is the Eiffel Tower located?",
response="The Eiffel Tower is located in Paris.",
reference="The Eiffel Tower is located in Paris.",
)

rubric = {
"score1_description": "Does not align with key aspects of the guidance.",
"score2_description": "Partially aligns but has notable omissions or misinterpretations.",
"score3_description": "Fully aligns with the instructions.",
}
adherence_score = RubricsScore(rubrics=rubric, llm=evaluator_llm)
await scorer.single_turn_ascore(sample)

@tuan3w
Copy link

tuan3w commented Jan 7, 2025

I have the same problem. I got a score of 7 even though I only specified 5-category. I use the latest version of the library.

@sahusiddharth
Copy link
Collaborator

Hi @alexcpop @tuan3w,

I’ve looked into the issue and made the necessary changes to address it. Please let me know if you continue to encounter any problems. You can also use the code snippet below for reference.

import os
from dotenv import load_dotenv

from ragas import evaluate
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI
from ragas.metrics import RubricsScore
from ragas import EvaluationDataset

load_dotenv()

llm = ChatOpenAI(model="gpt-4o-mini")
evaluator_llm = LangchainLLMWrapper(llm)

test = [
    {
        "response": "The Earth is flat and does not orbit the Sun.",
        "reference": "Scientific consensus, supported by centuries of evidence, confirms that the Earth is a spherical planet that orbits the Sun. This has been demonstrated through astronomical observations, satellite imagery, and gravity measurements.",
    },
    {
        "response": "The moon is made entirely of green cheese.",
        "reference": "The moon is a natural satellite composed primarily of rock and metal. This has been confirmed by lunar samples brought back by the Apollo missions and is well-supported by scientific studies.",
    },
    {
        "response": "Albert Einstein developed the theory of relativity in 1905.",
        "reference": "Albert Einstein's theory of relativity was published in 1905, and it fundamentally changed the understanding of space, time, and gravity. This is a key milestone in modern physics.",
    },
    {
        "response": "The Amazon rainforest produces 10% of the world’s oxygen.",
        "reference": "The Amazon rainforest is often cited as a major oxygen producer, but studies show that it contributes far less to the Earth's oxygen supply than commonly stated. It does, however, play a critical role in carbon dioxide absorption.",
    },
    {
        "response": "Shakespeare wrote 'To Kill a Mockingbird'.",
        "reference": "'To Kill a Mockingbird' was written by Harper Lee and published in 1960. William Shakespeare, an English playwright, lived in the 16th century and is famous for works such as 'Romeo and Juliet' and 'Hamlet'.",
    },
]

evaluation_dataset = EvaluationDataset.from_list(test)

rubrics = {
    "score1_description": "The response does not align with key aspects of the reference or instructions.",
    "score2_description": "The response partially aligns with the reference or instructions, but has notable omissions or misinterpretations.",
    "score3_description": "The response fully aligns with the reference or instructions, accurately addressing all key points."
}

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[RubricsScore(llm=evaluator_llm, rubrics=rubrics)],
    llm=evaluator_llm,
)

result

@jjmachan jjmachan linked a pull request Jan 8, 2025 that will close this issue
jjmachan pushed a commit that referenced this issue Jan 8, 2025
@heoun
Copy link

heoun commented Jan 12, 2025

@jjmachan - Are you sure this is complete? I just attempted the sample example and scored between 1-5. The output had 1,1,8,3,0. It's either hallucinating the 8 or instructions are still off.

@sahusiddharth - if you do results_df = result.to_pandas() results_df.head()

you'll find that it still rated the Albert Einstein an 8.

@heoun
Copy link

heoun commented Jan 12, 2025

@sahusiddharth

@jjmachan
Copy link
Member

hey @heoun which tracing tool do you using, maybe we can check the actual prompt going to figure out why this is happening. Could you share that here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants