Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to write custom prompt for faithfullness with PydanticPrompt #1729

Open
pratikchhapolika opened this issue Dec 4, 2024 · 2 comments
Open
Assignees
Labels
answered 🤖 The question has been answered. Will be closed automatically if no new comments module-metrics this is part of metrics module question Further information is requested

Comments

@pratikchhapolika
Copy link

pratikchhapolika commented Dec 4, 2024

How can I change the default prompt used in faithfulness metric.
Here is example that use default faithfulness prompt.

from ragas import EvaluationDataset
from langchain_openai.chat_models import AzureChatOpenAI
from langchain_openai.embeddings import AzureOpenAIEmbeddings
from ragas import SingleTurnSample 
from ragas.metrics import ResponseRelevancy
from langchain_core.callbacks import BaseCallbackHandler
from ragas import evaluate,RunConfig

from datasets import Dataset
from ragas.prompt import PydanticPrompt
from pydantic import BaseModel, Field
from datasets import Dataset 
from ragas.metrics import faithfulness
from ragas import evaluate
import os

class TestCallback(BaseCallbackHandler):

    def on_llm_start(self, serialized, prompts, **kwargs):
        print(f"**********Prompts*********:\n {prompts[0]}\n\n")

    def on_llm_end(self, response, **kwargs):
        print(f"**********Response**********:\n {response}\n\n")

data_samples = {
    'question': ['When was the first super bowl?'],
    'answer': ['The first superbowl was held on Jan 15, 1967'],
    'contexts' : [['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'], 
    ],
}
dataset = Dataset.from_dict(data_samples)
score = evaluate(dataset,metrics=[faithfulness],
                 llm=azure_model,
                 embeddings=azure_embeddings,
                 raise_exceptions=True,
                 callbacks=[TestCallback()],
                 run_config=RunConfig(timeout=10,max_retries=1,max_wait=60,max_workers=1)
                )
score.to_pandas()
**********Prompts*********:
 Human: Given a question, an answer, and sentences from the answer analyze the complexity of each sentence given under 'sentences' and break down each sentence into one or more fully understandable statements while also ensuring no pronouns are used in each statement. Format the outputs in JSON.
Please return the output in a JSON format that complies with the following schema as specified in JSON Schema:
{'$defs': {'SentenceComponents': {'properties': {'sentence_index': {'description': 'The index of the sentence', 'title': 'Sentence Index', 'type': 'integer'}, 'simpler_statements': {'description': 'A list of simpler statements that can be directly inferred from the context', 'items': {'type': 'string'}, 'title': 'Simpler Statements', 'type': 'array'}}, 'required': ['sentence_index', 'simpler_statements'], 'title': 'SentenceComponents', 'type': 'object'}}, 'properties': {'sentences': {'description': 'A list of sentences and their simpler versions', 'items': {'$ref': '#/$defs/SentenceComponents'}, 'title': 'Sentences', 'type': 'array'}}, 'required': ['sentences'], 'title': 'SentencesSimplified', 'type': 'object'}

--------EXAMPLES-----------
Example 1
Input: {
    "question": "Who was Albert Einstein and what is he best known for?",
    "answer": "He was a German-born theoretical physicist, widely acknowledged to be one of the greatest and most influential physicists of all time. He was best known for developing the theory of relativity, he also made important contributions to the development of the theory of quantum mechanics.",
    "sentences": {
        "0": "He was a German-born theoretical physicist, widely acknowledged to be one of the greatest and most influential physicists of all time.",
        "1": "He was best known for developing the theory of relativity, he also made important contributions to the development of the theory of quantum mechanics."
    }
}
Output: {
    "sentences": [
        {
            "sentence_index": 0,
            "simpler_statements": [
                "Albert Einstein was a German-born theoretical physicist.",
                "Albert Einstein is recognized as one of the greatest and most influential physicists of all time."
            ]
        },
        {
            "sentence_index": 1,
            "simpler_statements": [
                "Albert Einstein was best known for developing the theory of relativity.",
                "Albert Einstein also made important contributions to the development of the theory of quantum mechanics."
            ]
        }
    ]
}
-----------------------------

Now perform the same with the following input
input: {
    "question": "When was the first super bowl?",
    "answer": "The first superbowl was held on Jan 15, 1967",
    "sentences": {}
}
Output: 


**********Response**********:
 generations=[[ChatGeneration(text='```json\n{\n    "sentences": []\n}\n```', generation_info={'finish_reason': 'stop', 'logprobs': None, 'content_filter_results': {'hate': {'filtered': False, 'severity': 'safe'}, 'protected_material_code': {'filtered': False, 'detected': False}, 'protected_material_text': {'filtered': False, 'detected': False}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered': False, 'severity': 'safe'}, 'violence': {'filtered': False, 'severity': 'safe'}}}, message=AIMessage(content='```json\n{\n    "sentences": []\n}\n```', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 12, 'prompt_tokens': 600, 'total_tokens': 612, 'completion_tokens_details': None, 'prompt_tokens_details': None}, 'model_name': 'gpt-4o-2024-08-06', 'system_fingerprint': 'fp_04751d0b65', 'prompt_filter_results': [{'prompt_index': 0, 'content_filter_results': {'hate': {'filtered': False, 'severity': 'safe'}, 'jailbreak': {'filtered': False, 'detected': False}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered': False, 'severity': 'safe'}, 'violence': {'filtered': False, 'severity': 'safe'}}}], 'finish_reason': 'stop', 'logprobs': None, 'content_filter_results': {'hate': {'filtered': False, 'severity': 'safe'}, 'protected_material_code': {'filtered': False, 'detected': False}, 'protected_material_text': {'filtered': False, 'detected': False}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered': False, 'severity': 'safe'}, 'violence': {'filtered': False, 'severity': 'safe'}}}, id='run-b5aad4d8-0ad7-4296-8338-ccdb36a20add-0', usage_metadata={'input_tokens': 600, 'output_tokens': 12, 'total_tokens': 612, 'input_token_details': {}, 'output_token_details': {}}))]] llm_output={'token_usage': {'completion_tokens': 12, 'prompt_tokens': 600, 'total_tokens': 612, 'completion_tokens_details': None, 'prompt_tokens_details': None}, 'model_name': 'gpt-4o-2024-08-06', 'system_fingerprint': 'fp_04751d0b65', 'prompt_filter_results': [{'prompt_index': 0, 'content_filter_results': {'hate': {'filtered': False, 'severity': 'safe'}, 'jailbreak': {'filtered': False, 'detected': False}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered': False, 'severity': 'safe'}, 'violence': {'filtered': False, 'severity': 'safe'}}}]} run=None type='LLMResult'


**********Prompts*********:
 Human: Your task is to judge the faithfulness of a series of statements based on a given context. For each statement you must return verdict as 1 if the statement can be directly inferred based on the context or 0 if the statement can not be directly inferred based on the context.
Please return the output in a JSON format that complies with the following schema as specified in JSON Schema:
{'$defs': {'StatementFaithfulnessAnswer': {'properties': {'statement': {'description': 'the original statement, word-by-word', 'title': 'Statement', 'type': 'string'}, 'reason': {'description': 'the reason of the verdict', 'title': 'Reason', 'type': 'string'}, 'verdict': {'description': 'the verdict(0/1) of the faithfulness.', 'title': 'Verdict', 'type': 'integer'}}, 'required': ['statement', 'reason', 'verdict'], 'title': 'StatementFaithfulnessAnswer', 'type': 'object'}}, 'properties': {'statements': {'items': {'$ref': '#/$defs/StatementFaithfulnessAnswer'}, 'title': 'Statements', 'type': 'array'}}, 'required': ['statements'], 'title': 'NLIStatementOutput', 'type': 'object'}

--------EXAMPLES-----------
Example 1
Input: {
    "context": "John is a student at XYZ University. He is pursuing a degree in Computer Science. He is enrolled in several courses this semester, including Data Structures, Algorithms, and Database Management. John is a diligent student and spends a significant amount of time studying and completing assignments. He often stays late in the library to work on his projects.",
    "statements": [
        "John is majoring in Biology.",
        "John is taking a course on Artificial Intelligence.",
        "John is a dedicated student.",
        "John has a part-time job."
    ]
}
Output: {
    "statements": [
        {
            "statement": "John is majoring in Biology.",
            "reason": "John's major is explicitly mentioned as Computer Science. There is no information suggesting he is majoring in Biology.",
            "verdict": 0
        },
        {
            "statement": "John is taking a course on Artificial Intelligence.",
            "reason": "The context mentions the courses John is currently enrolled in, and Artificial Intelligence is not mentioned. Therefore, it cannot be deduced that John is taking a course on AI.",
            "verdict": 0
        },
        {
            "statement": "John is a dedicated student.",
            "reason": "The context states that he spends a significant amount of time studying and completing assignments. Additionally, it mentions that he often stays late in the library to work on his projects, which implies dedication.",
            "verdict": 1
        },
        {
            "statement": "John has a part-time job.",
            "reason": "There is no information given in the context about John having a part-time job.",
            "verdict": 0
        }
    ]
}

Example 2
Input: {
    "context": "Photosynthesis is a process used by plants, algae, and certain bacteria to convert light energy into chemical energy.",
    "statements": [
        "Albert Einstein was a genius."
    ]
}
Output: {
    "statements": [
        {
            "statement": "Albert Einstein was a genius.",
            "reason": "The context and statement are unrelated",
            "verdict": 0
        }
    ]
}
-----------------------------

Now perform the same with the following input
input: {
    "context": "The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,",
    "statements": []
}
Output: 

I assume it uses 2 steps to calculate faithfullnes. First prompt is to break into sentences and then calculate verdict score.

How can I use same prompt style , calling 2 times for my custom prompt.??

Here is what I tried:

# Step 1: Define custom input and output schemas
class CustomInput(BaseModel):
    question: str = Field(description="The question to be answered")
    answer: str = Field(description="The generated answer to the question")
    contexts: list[str] = Field(description="Relevant context documents")

class CustomOutput(BaseModel):
    score: float = Field(description="Faithfulness score between the answer and contexts")

# Step 2: Create a custom prompt
class CustomFaithfulnessPrompt(PydanticPrompt[CustomInput, CustomOutput]):
    instruction = "Evaluate how faithful the answer is to the provided contexts."
    input_model = CustomInput
    output_model = CustomOutput
    examples = [
        (
            CustomInput(
                question="What is the capital of France?",
                answer="The capital of France is Paris.",
                contexts=[
                    "France is a country in Europe. Its capital city is Paris, known for its landmarks like the Eiffel Tower."
                ]
            ),
            CustomOutput(score=1.0)
        )
    ]

# Step 3: Define the dataset
data_samples = {
    'question': ['When was the first super bowl?'],
    'answer': ['The first superbowl was held on Jan 15, 1967'],
    'contexts': [
        ['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,']
    ],
}
dataset = Dataset.from_dict(data_samples)
# print(dir(faithfulness))
faithfulness.long_form_answer_prompt=CustomFaithfulnessPrompt()
print("Custom prompt\n",faithfulness.long_form_answer_prompt.to_string())

# Step 4: Use evaluate with the custom prompt
score = evaluate(
    dataset,
    metrics=[faithfulness],
    llm=azure_model,
    embeddings=azure_embeddings,
    raise_exceptions=True,
    callbacks=[TestCallback()],
    run_config=RunConfig(timeout=10, max_retries=1, max_wait=60, max_workers=1),
)

# Convert the results to a DataFrame
print(score.to_pandas())

But its not working in the same way as default prompt.

@pratikchhapolika pratikchhapolika added the question Further information is requested label Dec 4, 2024
@pratikchhapolika
Copy link
Author

@jjmachan would need your help on this

@sahusiddharth sahusiddharth added the module-metrics this is part of metrics module label Jan 11, 2025
@sahusiddharth
Copy link
Collaborator

sahusiddharth commented Jan 11, 2025

Hi @pratikchhapolika,

We have a section in our docs that explains how to change the prompt for the Ragas metrics. You can also have a look at the example below:

from ragas.metrics import Faithfulness

scorer = Faithfulness(llm=evaluator_llm)
scorer.get_prompts()

Output

{'n_l_i_statement_prompt': NLIStatementPrompt(instruction=Your task is to judge the faithfulness of a ..
'statement_generator_prompt': StatementGeneratorPrompt(instruction=Given a question, an answer, and sentences from ... }

Changing the prompt...

statement_prompt = scorer.get_prompts()["n_l_i_statement_prompt"]
verdict_prompt = scorer.get_prompts()["statement_generator_prompt"]

statement_prompt.instruction = "New statement breaking prompt"
statement_prompt.examples = [] # Conform to the example class of nli statement prompt if you want to change examples here

verdict_prompt.instruction = "New verdict generation prompt"
verdict_prompt.examples = [] # Conform to the example class of giving the verdict if you want to change examples here

scorer.set_prompts(
    **{
        "n_l_i_statement_prompt": statement_prompt,
        "statement_generator_prompt": verdict_prompt,
    }
)

scorer.get_prompts()

Output

{'n_l_i_statement_prompt': NLIStatementPrompt(instruction=New statement breaking prompt
'statement_generator_prompt': StatementGeneratorPrompt(instruction=New verdict generation prompt}

@sahusiddharth sahusiddharth added the answered 🤖 The question has been answered. Will be closed automatically if no new comments label Jan 11, 2025
@sahusiddharth sahusiddharth self-assigned this Jan 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
answered 🤖 The question has been answered. Will be closed automatically if no new comments module-metrics this is part of metrics module question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants