Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The context_precision_score always returns 0 when using Bedrock with Claude instant. #288

Closed
Pauldevillers opened this issue Nov 16, 2023 · 12 comments
Assignees
Labels
bug Something isn't working enhancement New feature or request

Comments

@Pauldevillers
Copy link
Contributor

Pauldevillers commented Nov 16, 2023

Describe the bug
The context_precision_score always returns 0 when using Bedrock with Claude instant.

I loaded Amazon Bedrock documentation with OpenSearch as a vector store, and I am performing evaluation using Ragas.

The context_precision_score splits each chunk retrieved from vector search and uses a prompt to verify if the information in the given context is useful in answering the question. The LLM should answer with a yes/no answer.

The problem is that the LLM does not answer only yes or no but also includes its thinking process, as shown below:

' Yes\n\nThe context provided is the Amazon Bedrock User Guide which contains information about the supported regions for Amazon Bedrock on page 3. This would be useful in answering the question "Where Amazon bedrock is available?" since it directly lists out the available regions.'

The issue lies in the context_precision.py file, specifically in the following line:

response = [int("Yes" in resp) for resp in response]

This line checks if "Yes" is present in each response within the grouped_responses list. In the example above, the response outputs 0 even though we can see the answer contains "yes."

I have four items inside my grouped_responses, and the error occurs because there is no exact match with the condition int("Yes" in resp). The condition looks for a strict "Yes" in the response, leading to the problem.

Python version and packages
Ragas version: Python 3.9.6
Python version: ragas==0.0.19

Options to Resolve the Error:

  • Option 1: Add few-shot learning in the context_precision call to follow a given schema.
  • Option 2: Instead of using [int("Yes" in resp) for resp in response], consider using regex to identify if there is a "Yes" in the return statement.
    Please provide guidance on implementing one of the suggested options to resolve the error.

Code to Reproduce

from ragas.llms import LangchainLLM
from ragas.metrics import faithfulness, answer_relevancy,context_precision
from ragas.langchain import RagasEvaluatorChain
from langchain.llms.bedrock import Bedrock

bedrock_client = boto3.client("bedrock-runtime", region_name="us-east-1")

bedrock_llm = Bedrock(
        model_id="anthropic.claude-instant-v1", 
        client=bedrock_client,
        model_kwargs={'temperature': 0}
    )

# Set bedorck as Langchain wrapper

bedrock_wraper = LangchainLLM(llm=bedrock_llm)

qa = RetrievalQA.from_chain_type(llm=bedrock_llm, 
                                 chain_type="stuff", 
                                 retriever=opensearch_vector_search_client.as_retriever(),
                                 return_source_documents=True,
                                 chain_type_kwargs={"prompt": PROMPT, "verbose": True},
                                 verbose=True)

response = qa(question, return_only_outputs=False)

#context_precision
context_precision.llm = bedrock_wraper

metrics = [context_precision]
for m in metrics:
    m.__setattr__("llm", bedrock_wraper)

# make eval chains
eval_chains = {
    m.name: RagasEvaluatorChain(metric=m) for m in metrics
}

for name, eval_chain in eval_chains.items():
    score_name = f"{name}_score"
    print(f"{score_name}: {eval_chain(response)[score_name]}")

Expected behavior

context_precision_score: not 0 value

Additional context

Here are my four elements inside `grouped_responses``:

' No\n\nThe context provided is the copyright information and terms of use for an Amazon Bedrock user guide. It does not contain any information about where Amazon bedrock is available. So this context is not useful in answering the given question.'

' Yes\n\nThe context provided is the Amazon Bedrock User Guide which contains information about the supported regions for Amazon Bedrock on page 3. This would be useful in answering the question "Where Amazon bedrock is available?" since it directly lists out the available regions.'

' Yes, the context contains information that is useful in answering the question "Where Amazon bedrock is available?". \n\nThe context talks about Amazon Bedrock and mentions that it is a fully managed service that makes base models from Amazon and third-party model providers accessible through an API. It also lists the supported regions on page 3, which would contain the answer to where Amazon Bedrock is available.\n\nSo the context provides relevant information to answer the given question, therefore the answer is Yes.'

' No\n\nThe given context does not contain any information about where Amazon bedrock is available. The context only mentions "Amazon Bedrock User Guide" but does not provide any details about locations. Hence the context is not useful in answering the given question about locations of Amazon bedrock availability.'
@shahules786
Copy link
Member

Hi @Pauldevillers, this is a dominant issue with different prompts in the framework. As the first step in tackling this, we are going to change the prompts so that the output follows specific structure like json and can be verified easily. Thanks for bringing this to our notice, you can expect a fix very soon :)

@IgnacioPascale
Copy link

I support this @Pauldevillers. Using Llama-2-7b-chat-hf, my grouped_responses object looks as follows

[[['  Yes. The information']],
 [['  Yes, the information']],
 [['  Yes. The information']],
 [['  Yes. The information']],
 [['  Yes. The information']],
 [['  Yes. The information']],
 [['  Yes. The information']],
 [['  Yes. The information']],
 [['  Yes. The information']],
 [['  Yes. The information']],
 [['  Yes. The information']],
 [['  Yes.\n\n']],
 [['  Yes. The information']],
 [['  Yes.\n\n']],
 [['  Yes. The information']]]

So I get 0 in all examples.

Thanks for looking into it @shahules786

@Pauldevillers
Copy link
Contributor Author

Pauldevillers commented Nov 16, 2023

Hello @shahules786 , I just summited a PR to tackle this issue: #289

@shahules786
Copy link
Member

Thank you @Pauldevillers :)

@ferdinandl007
Copy link
Contributor

The same happens for vertex text bison

@kandakji
Copy link

same issue here

@mohade09
Copy link

Now i am getting NaN

@jjmachan jjmachan added this to the v0.1.0 milestone Nov 29, 2023
@jjmachan jjmachan added bug Something isn't working enhancement New feature or request labels Nov 29, 2023
@jjmachan
Copy link
Member

this is tricky because this is both a bug and only a long term problem. I think we can start off with having prompts for Bedrock and Vertext AI maybe @shahules786 .
it would be an extension of what @tinomaxthayil and Austin (from AWS) was working that would be faster to merge in?

@kishoreiitd
Copy link

I am also getting same error while using claude v2 via Bedrock.

@shahules786
Copy link
Member

Hi @kishoreiitd can you share the ragas version? This should be fixed with #364

@kishoreiitd
Copy link

Thanks @shahules786 This has been fixed. I was using version 0.0.21, however I was trying it before this #364.

@jjmachan
Copy link
Member

jjmachan commented Jan 8, 2024

addressed with #364
please raise a new issue if this still persists

@jjmachan jjmachan closed this as completed Jan 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

8 participants