-
Notifications
You must be signed in to change notification settings - Fork 828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do the scores for rubric have to be numeric and in a 1-5 range? #1800
Comments
how you could create the rubric scores is as follows rubric = {
"score1_description": "Does not align with key aspects of the guidance.",
"score2_description": "Partially aligns but has notable omissions or misinterpretations.",
"score3_description": "Fully aligns with the instructions.",
}
adherence_scorer = RubricsScore(rubrics=rubric, llm=evaluator_llm) this will score the adhence rubric you had mentioned. You can create the same for the other rubric too can you let me know if that helps? if not could you share a bit more about your usecase and I'll help you out 🙂 |
Hi @jjmachan , many thanks for your reply! I tested the code above and the result is 5 (even though I only have the 3 options). I tried 2 scenarios and both print out 5. Code: from ragas.dataset_schema import SingleTurnSample rubric = { from ragas.dataset_schema import SingleTurnSample rubric = { |
I have the same problem. I got a score of 7 even though I only specified 5-category. I use the latest version of the library. |
I’ve looked into the issue and made the necessary changes to address it. Please let me know if you continue to encounter any problems. You can also use the code snippet below for reference. import os
from dotenv import load_dotenv
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI
from ragas.metrics import RubricsScore
from ragas import EvaluationDataset
load_dotenv()
llm = ChatOpenAI(model="gpt-4o-mini")
evaluator_llm = LangchainLLMWrapper(llm)
test = [
{
"response": "The Earth is flat and does not orbit the Sun.",
"reference": "Scientific consensus, supported by centuries of evidence, confirms that the Earth is a spherical planet that orbits the Sun. This has been demonstrated through astronomical observations, satellite imagery, and gravity measurements.",
},
{
"response": "The moon is made entirely of green cheese.",
"reference": "The moon is a natural satellite composed primarily of rock and metal. This has been confirmed by lunar samples brought back by the Apollo missions and is well-supported by scientific studies.",
},
{
"response": "Albert Einstein developed the theory of relativity in 1905.",
"reference": "Albert Einstein's theory of relativity was published in 1905, and it fundamentally changed the understanding of space, time, and gravity. This is a key milestone in modern physics.",
},
{
"response": "The Amazon rainforest produces 10% of the world’s oxygen.",
"reference": "The Amazon rainforest is often cited as a major oxygen producer, but studies show that it contributes far less to the Earth's oxygen supply than commonly stated. It does, however, play a critical role in carbon dioxide absorption.",
},
{
"response": "Shakespeare wrote 'To Kill a Mockingbird'.",
"reference": "'To Kill a Mockingbird' was written by Harper Lee and published in 1960. William Shakespeare, an English playwright, lived in the 16th century and is famous for works such as 'Romeo and Juliet' and 'Hamlet'.",
},
]
evaluation_dataset = EvaluationDataset.from_list(test)
rubrics = {
"score1_description": "The response does not align with key aspects of the reference or instructions.",
"score2_description": "The response partially aligns with the reference or instructions, but has notable omissions or misinterpretations.",
"score3_description": "The response fully aligns with the reference or instructions, accurately addressing all key points."
}
result = evaluate(
dataset=evaluation_dataset,
metrics=[RubricsScore(llm=evaluator_llm, rubrics=rubrics)],
llm=evaluator_llm,
)
result |
- #1800 --------- Co-authored-by: ikka <[email protected]>
@jjmachan - Are you sure this is complete? I just attempted the sample example and scored between 1-5. The output had 1,1,8,3,0. It's either hallucinating the 8 or instructions are still off. @sahusiddharth - if you do you'll find that it still rated the Albert Einstein an 8. |
hey @heoun which tracing tool do you using, maybe we can check the actual prompt going to figure out why this is happening. Could you share that here? |
[x] I checked the documentation and related resources and couldn't find an answer to my question.
Your Question
I want to put together my own custom rubric. I played around with the score range (0-3 , 1-3, categorical: Met, Partially Met, Not Met). I noticed that in these cases the output scores would discard my scoring criteria and default to 1-5. My questions are:
Code Examples
Example outputs:
The text was updated successfully, but these errors were encountered: