-
Notifications
You must be signed in to change notification settings - Fork 18
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: Knowledge Graph evaluator (#37)
* added evaluation module using llama_index for graph_rag * added graph_rag_test_data generation module using ragas * added docstings for each files and formatted using black * updated requirements.txt * added evaluation using ragas script * add initial Readme file for the module * updated README.MD and added some artifacts in random folder * Updated Readme.md of evaluation module for demo of test data generation * Updated Readme.md of evaluation module * Updated Readme.md of evaluation module * updated README.MD for evaluation using llama-index
- Loading branch information
1 parent
9cf1b43
commit 1f1ee9f
Showing
9 changed files
with
593 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,196 @@ | ||
|
||
# Knowledge Graph Evaluation | ||
|
||
This module provides methods to evaluate the performance of GraphRag. The following integrations are available for evaluation: | ||
|
||
- **Llama-Index Evaluation Pack** | ||
- **Ragas Evaluation Pack** | ||
|
||
Additionally, this module includes scripts for creating custom test datasets to benchmark and evaluate GraphRag. | ||
|
||
## Getting Started | ||
This section demonstrates how to use the functions provided in the module: | ||
|
||
--- | ||
|
||
### 1. QA Generation and Critique | ||
|
||
This module offers tools to generate question-answer (QA) pairs from input documents using a language model and critique them based on various criteria like groundedness, relevance, and standalone quality. | ||
|
||
> #### Generate and Critique QA Pairs | ||
To use this module, follow these steps: | ||
|
||
#### 1. Generate QA Pairs | ||
|
||
First, we prepare our dataset for generating QA pairs. In this example, we'll use Keras-IO documentation and Llama-Index's `SimpleDirectoryReader` to obtain `Document` objects. | ||
|
||
```python | ||
!git clone https://github.com/keras-team/keras-io.git | ||
|
||
def get_data(input_dir="path/to/keras-io/templates"): | ||
reader = SimpleDirectoryReader( | ||
input_dir, | ||
recursive=True, | ||
exclude=["path/to/keras-io/templates/examples"] | ||
) | ||
docs = reader.load_data() | ||
|
||
splitter = SentenceSplitter( | ||
chunk_size=300, | ||
chunk_overlap=20, | ||
) | ||
nodes = splitter.get_nodes_from_documents(docs) | ||
documents = [Document(text=node.text, metadata=node.metadata) for node in nodes] | ||
|
||
return docs | ||
|
||
# load the documents | ||
documents=get_data() | ||
``` | ||
|
||
Use the `qa_generator` function to generate QA pairs from your input documents. | ||
|
||
```python | ||
from evaluation.ragas_evaluation.QA_graphrag_testdataset import qa_generator | ||
|
||
N_GENERATIONS = 20 | ||
|
||
# Generate the QA pairs | ||
qa_pairs = qa_generator(documents, N_GENERATIONS) | ||
``` | ||
|
||
#### 2. Critique the Generated QA Pairs | ||
|
||
Once you have generated the QA pairs, critique them using the `critique_qa` function. | ||
|
||
```python | ||
from evaluation.ragas_evaluation.QA_graphrag_testdatasete import critique_qa | ||
|
||
# Critique the generated QA pairs | ||
critiqued_qa_pairs = critique_qa(qa_pairs) | ||
|
||
# The critiqued pairs will include scores and evaluations for groundedness, relevance, and standalone quality | ||
``` | ||
|
||
--- | ||
### 2. Evaluating Your Knowledge Graph with Llama-Index Evaluator Pack | ||
|
||
This section demonstrates how to evaluate the performance of your query engine using the Llama-Index RAG evaluator pack. | ||
|
||
> #### Evaluate Your Knowledge Graph with llama-index | ||
To evaluate your query engine, follow these steps: | ||
```shell | ||
llamaindex-cli download-llamadataset PaulGrahamEssayDataset --download-dir ./data | ||
``` | ||
|
||
```python | ||
from evaluation.evaluation_llama_index import evaluate | ||
|
||
|
||
# Path to your labeled RAG dataset | ||
RAG_DATASET = "./data/rag_dataset.json" | ||
|
||
# Define the language model and embedding | ||
from llama_index.embeddings.huggingface import HuggingFaceEmbedding | ||
from llama_index.llms.ollama import Ollama | ||
|
||
llm = Ollama(base_url="http://localhost:11434", model="llama2") | ||
embedding = HuggingFaceEmbedding(model_name="microsoft/codebert-base") | ||
|
||
# Your query engine instance | ||
from graph_rag.graph_retrieval.graph_retrieval import get_index_from_pickle, get_query_engine | ||
|
||
index = get_index_from_pickle("path/to/graphIndex.pkl") | ||
query_engine = get_query_engine(index) | ||
|
||
# Evaluate the dataset | ||
evaluation_results = evaluate(RAG_DATASET, query_engine) | ||
|
||
# Review the results | ||
print(evaluation_results) | ||
``` | ||
| Metrics | RAG | Base RAG | | ||
|------------------------------|------------|-----------| | ||
| **Mean Correctness Score** | 3.340909 | 0.934 | | ||
| **Mean Relevancy Score** | 0.750000 | 4.239 | | ||
| **Mean Faithfulness Score** | 0.386364 | 0.977 | | ||
| **Mean Context Similarity Score** | 0.948765 | 0.977 | | ||
|
||
|
||
|
||
This example shows how to quickly evaluate your query engine's performance using the Llama-Index RAG evaluator pack. | ||
|
||
|
||
--- | ||
### 3. Evaluating Your Knowledge Graph with Ragas backend | ||
|
||
You can easily evaluate the performance of your query engine using this module. | ||
|
||
> #### Load and Evaluate Your Dataset with ragas | ||
Use the `load_test_dataset` function to load your dataset and directly evaluate it using the `evaluate` function. This method handles all necessary steps, including batching the data. | ||
|
||
```python | ||
from evaluation.ragas_evaluation.evaluation_ragas load_test_dataset, evaluate | ||
|
||
# Step 1: Load the dataset from a pickle file | ||
dataset_path = "/content/keras_docs_embedded.pkl" | ||
test_dataset = load_test_dataset(dataset_path) | ||
``` | ||
|
||
> **Note:** `test_dataset` is a list of Llama-Index `Document` objects. | ||
```python | ||
# Step 2: Define the language model and embedding | ||
from llama_index.embeddings.huggingface import HuggingFaceEmbedding | ||
from llama_index.llms.ollama import Ollama | ||
|
||
llm = Ollama(base_url="http://localhost:11434", model="codellama") | ||
embedding = HuggingFaceEmbedding(model_name="microsoft/codebert-base") | ||
|
||
# Step 3: Specify the metrics for evaluation | ||
metrics = [faithfulness, answer_relevancy, context_precision, context_recall] | ||
|
||
# Step 4: Load the query engine (Llama-Index) | ||
from graph_rag.graph_retrieval.graph_retrieval import get_index_from_pickle, get_query_engine | ||
|
||
index = get_index_from_pickle("path/to/graphIndex.pkl") | ||
query_engine = get_query_engine(index) | ||
|
||
# Step 5: Evaluate the dataset | ||
evaluation_results = evaluate( | ||
query_engine=query_engine, | ||
dataset=test_dataset, | ||
llm=llm, | ||
embeddings=embedding, | ||
metrics=metrics, | ||
# Default batch size is 4 | ||
) | ||
``` | ||
|
||
**Output:** | ||
```python | ||
{'faithfulness': 0.0333, 'answer_relevancy': 0.9834, 'context_precision': 0.2000, 'context_recall': 0.8048} | ||
``` | ||
|
||
```python | ||
rdf = evaluation_results.to_pandas() | ||
rdf.to_csv("results.csv", index=False) | ||
``` | ||
--- | ||
**Detailed Result:** | ||
|
||
| question | contexts | answer | ground_truth | faithfulness | answer_relevancy | context_precision | context_recall | | ||
|-----------------------------------------------|---------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------|--------------|------------------|-------------------|----------------| | ||
| What is mixed precision in computing? | [Examples GPT-2 text generation Parameter…] | Mixed precision is a technique used to improve… | A combination of different numerical precision… | 0.166667 | 0.981859 | 0.0 | 0.666667 | | ||
| What is the title of the guide discussed in th... | [Available guides… Hyperparameter T…] | The title of the guide discussed in the given… | How to distribute training | 0.000000 | 1.000000 | 0.0 | 1.000000 | | ||
| What is Keras 3? | [No relationships found.] | Keras 3 is a new version of the popular deep l… | A deep learning framework that works with Tensor… | 0.000000 | 0.974711 | 0.0 | 0.500000 | | ||
| What was the percentage boost in StableDiffusion... | [A first example: A MNIST convnet…] | The percentage boost in StableDiffusion traini… | Over 150% | 0.000000 | 0.970565 | 1.0 | 1.000000 | | ||
| What are some examples of pretrained models av... | [No relationships found.] | Some examples of pre-trained models available… | BERT, OPT, Whisper, T5, StableDiffusion, YOLOv8… | 0.000000 | 0.989769 | 0.0 | 0.857143 | | ||
|
||
|
||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
""" | ||
This script evaluates a RagDataset using a RagEvaluatorPack, which assesses query engines by benchmarking against | ||
labeled data using LLMs and embeddings. | ||
Functions: | ||
- evaluate: Evaluates the query engine using a labeled RAG dataset and specified models for both the LLM and embeddings. | ||
""" | ||
|
||
from llama_index.core.llama_dataset import LabelledRagDataset | ||
from llama_index.core.llama_pack import download_llama_pack | ||
from llama_index.llms.ollama import Ollama | ||
from llama_index.embeddings.huggingface import HuggingFaceEmbedding | ||
|
||
|
||
|
||
|
||
|
||
def evaluate( | ||
RAG_DATASET: str, | ||
query_engine: object, | ||
ollama_model: str = "llama3", | ||
embedd_model: str = "microsoft/codebert-base", | ||
): | ||
""" | ||
Evaluates a RAG dataset by using a query engine and benchmarks it using LLM and embedding models. | ||
Args: | ||
RAG_DATASET: Path to the JSON file containing the labeled RAG dataset. | ||
query_engine: The query engine to evaluate. | ||
ollama_model: The LLM model to use for evaluation (default: "llama3"). | ||
embedd_model: The Hugging Face embedding model to use for evaluation (default: "microsoft/codebert-base"). | ||
Returns: | ||
A DataFrame containing the benchmarking results, including LLM calls and evaluations. | ||
""" | ||
|
||
RagEvaluatorPack = download_llama_pack("RagEvaluatorPack", "./rag_evaluator_pack") | ||
rag_dataset = LabelledRagDataset.from_json(RAG_DATASET) | ||
rag_evaluator_pack = RagEvaluatorPack( | ||
rag_dataset=rag_dataset, | ||
query_engine=query_engine, | ||
judge_llm=Ollama(base_url="http://localhost:11434", model=ollama_model), | ||
embed_model=HuggingFaceEmbedding(model_name=embedd_model), | ||
) | ||
benchmark_df = await rag_evaluator_pack.arun( | ||
batch_size=5, # batches the number of llm calls to make | ||
sleep_time_in_seconds=1, # seconds to sleep before making an api call | ||
) | ||
return benchmark_df |
130 changes: 130 additions & 0 deletions
130
graph_rag/evaluation/ragas_evaluation/QA_graphrag_testdataset.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,130 @@ | ||
""" | ||
This script contains functions to generate question-answer pairs from input documents using a language model, | ||
and critique them based on various criteria like groundedness, relevance, and standalone quality. | ||
Functions: | ||
- get_response: Sends a request to a language model API to generate responses based on a provided prompt. | ||
- qa_generator: Generates a specified number of question-answer pairs from input documents. | ||
- critique_qa: Critiques the generated QA pairs based on groundedness, relevance, and standalone quality. | ||
""" | ||
|
||
from prompts import * | ||
import pandas as pd | ||
import random | ||
from tqdm.auto import tqdm | ||
import requests | ||
|
||
|
||
def get_response( | ||
prompt: str, url: str = "http://localhost:11434/api/generate", model: str = "llama3" | ||
): | ||
""" | ||
Sends a prompt ollama API and retrieves the generated response. | ||
Args: | ||
prompt:The text input that the model will use to generate a response. | ||
url: The API endpoint for the model (default: "http://localhost:11434/api/generate"). | ||
model: The model to be used for generation (default: "llama3"). | ||
Returns: | ||
The generated response from the language model as a string. | ||
""" | ||
|
||
payload = {"model": model, "prompt": prompt, "stream": False} | ||
response = requests.post(url, json=payload) | ||
resp = response.json() | ||
return resp["response"] | ||
|
||
|
||
def qa_generator( | ||
documents: object, | ||
N_GENERATIONS: int = 20, | ||
): | ||
""" | ||
Generates a specified number of question-answer pairs from the provided documents. | ||
Args: | ||
documents: A collection of document objects to generate QA pairs from. | ||
N_GENERATIONS: The number of question-answer pairs to generate (default: 20). | ||
Returns: | ||
A list of dictionaries, each containing the generated context, question, answer, and source document metadata. | ||
""" | ||
print(f"Generating {N_GENERATIONS} QA couples...") | ||
|
||
outputs = [] | ||
for sampled_context in tqdm(random.sample(documents, N_GENERATIONS)): | ||
# Generate QA couple | ||
output_QA_couple = get_response( | ||
QA_generation_prompt.format(context=sampled_context.text) | ||
) | ||
try: | ||
question = output_QA_couple.split("Factoid question: ")[-1].split( | ||
"Answer: " | ||
)[0] | ||
answer = output_QA_couple.split("Answer: ")[-1] | ||
assert len(answer) < 300, "Answer is too long" | ||
outputs.append( | ||
{ | ||
"context": sampled_context.text, | ||
"question": question, | ||
"answer": answer, | ||
"source_doc": sampled_context.metadata, | ||
} | ||
) | ||
except: | ||
continue | ||
df = pd.DataFrame(outputs) | ||
df.to_csv("QA.csv") | ||
return outputs | ||
|
||
|
||
def critique_qa( | ||
outputs: list, | ||
): | ||
""" | ||
Critiques the generated question-answer pairs based on groundedness, relevance, and standalone quality. | ||
Args: | ||
outputs: A list of dictionaries containing generated QA pairs to be critiqued. | ||
Returns: | ||
The critiqued QA pairs with additional fields for groundedness, relevance, and standalone quality scores and evaluations. | ||
""" | ||
print("Generating critique for each QA couple...") | ||
for output in tqdm(outputs): | ||
evaluations = { | ||
"groundedness": get_response( | ||
question_groundedness_critique_prompt.format( | ||
context=output["context"], question=output["question"] | ||
), | ||
), | ||
"relevance": get_response( | ||
question_relevance_critique_prompt.format(question=output["question"]), | ||
), | ||
"standalone": get_response( | ||
question_standalone_critique_prompt.format(question=output["question"]), | ||
), | ||
} | ||
try: | ||
for criterion, evaluation in evaluations.items(): | ||
score, eval = ( | ||
int(evaluation.split("Total rating: ")[-1].strip()), | ||
evaluation.split("Total rating: ")[-2].split("Evaluation: ")[1], | ||
) | ||
output.update( | ||
{ | ||
f"{criterion}_score": score, | ||
f"{criterion}_eval": eval, | ||
} | ||
) | ||
except Exception as e: | ||
continue | ||
generated_questions = pd.DataFrame.from_dict(outputs) | ||
generated_questions = generated_questions.loc[ | ||
(generated_questions["groundedness_score"] >= 4) | ||
& (generated_questions["relevance_score"] >= 4) | ||
& (generated_questions["standalone_score"] >= 4) | ||
] | ||
generated_questions.to_csv("generated_questions.csv") | ||
return outputs |
Oops, something went wrong.