Skip to content

Commit

Permalink
feat: Knowledge Graph evaluator (#37)
Browse files Browse the repository at this point in the history
* added evaluation module using llama_index for graph_rag

* added graph_rag_test_data generation module using ragas

* added docstings for each files and formatted using black

* updated requirements.txt

* added evaluation using ragas script

* add initial Readme file for the module

* updated README.MD and added some artifacts in random folder

* Updated Readme.md of evaluation module for demo of test data generation

* Updated Readme.md of evaluation module

* Updated Readme.md of evaluation module

* updated README.MD for evaluation using llama-index
  • Loading branch information
debrupf2946 authored Aug 25, 2024
1 parent 9cf1b43 commit 1f1ee9f
Show file tree
Hide file tree
Showing 9 changed files with 593 additions and 1 deletion.
196 changes: 196 additions & 0 deletions graph_rag/evaluation/README.MD
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@

# Knowledge Graph Evaluation

This module provides methods to evaluate the performance of GraphRag. The following integrations are available for evaluation:

- **Llama-Index Evaluation Pack**
- **Ragas Evaluation Pack**

Additionally, this module includes scripts for creating custom test datasets to benchmark and evaluate GraphRag.

## Getting Started
This section demonstrates how to use the functions provided in the module:

---

### 1. QA Generation and Critique

This module offers tools to generate question-answer (QA) pairs from input documents using a language model and critique them based on various criteria like groundedness, relevance, and standalone quality.

> #### Generate and Critique QA Pairs
To use this module, follow these steps:

#### 1. Generate QA Pairs

First, we prepare our dataset for generating QA pairs. In this example, we'll use Keras-IO documentation and Llama-Index's `SimpleDirectoryReader` to obtain `Document` objects.

```python
!git clone https://github.com/keras-team/keras-io.git

def get_data(input_dir="path/to/keras-io/templates"):
reader = SimpleDirectoryReader(
input_dir,
recursive=True,
exclude=["path/to/keras-io/templates/examples"]
)
docs = reader.load_data()

splitter = SentenceSplitter(
chunk_size=300,
chunk_overlap=20,
)
nodes = splitter.get_nodes_from_documents(docs)
documents = [Document(text=node.text, metadata=node.metadata) for node in nodes]

return docs

# load the documents
documents=get_data()
```

Use the `qa_generator` function to generate QA pairs from your input documents.

```python
from evaluation.ragas_evaluation.QA_graphrag_testdataset import qa_generator

N_GENERATIONS = 20

# Generate the QA pairs
qa_pairs = qa_generator(documents, N_GENERATIONS)
```

#### 2. Critique the Generated QA Pairs

Once you have generated the QA pairs, critique them using the `critique_qa` function.

```python
from evaluation.ragas_evaluation.QA_graphrag_testdatasete import critique_qa

# Critique the generated QA pairs
critiqued_qa_pairs = critique_qa(qa_pairs)

# The critiqued pairs will include scores and evaluations for groundedness, relevance, and standalone quality
```

---
### 2. Evaluating Your Knowledge Graph with Llama-Index Evaluator Pack

This section demonstrates how to evaluate the performance of your query engine using the Llama-Index RAG evaluator pack.

> #### Evaluate Your Knowledge Graph with llama-index
To evaluate your query engine, follow these steps:
```shell
llamaindex-cli download-llamadataset PaulGrahamEssayDataset --download-dir ./data
```

```python
from evaluation.evaluation_llama_index import evaluate


# Path to your labeled RAG dataset
RAG_DATASET = "./data/rag_dataset.json"

# Define the language model and embedding
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.ollama import Ollama

llm = Ollama(base_url="http://localhost:11434", model="llama2")
embedding = HuggingFaceEmbedding(model_name="microsoft/codebert-base")

# Your query engine instance
from graph_rag.graph_retrieval.graph_retrieval import get_index_from_pickle, get_query_engine

index = get_index_from_pickle("path/to/graphIndex.pkl")
query_engine = get_query_engine(index)

# Evaluate the dataset
evaluation_results = evaluate(RAG_DATASET, query_engine)

# Review the results
print(evaluation_results)
```
| Metrics | RAG | Base RAG |
|------------------------------|------------|-----------|
| **Mean Correctness Score** | 3.340909 | 0.934 |
| **Mean Relevancy Score** | 0.750000 | 4.239 |
| **Mean Faithfulness Score** | 0.386364 | 0.977 |
| **Mean Context Similarity Score** | 0.948765 | 0.977 |



This example shows how to quickly evaluate your query engine's performance using the Llama-Index RAG evaluator pack.


---
### 3. Evaluating Your Knowledge Graph with Ragas backend

You can easily evaluate the performance of your query engine using this module.

> #### Load and Evaluate Your Dataset with ragas
Use the `load_test_dataset` function to load your dataset and directly evaluate it using the `evaluate` function. This method handles all necessary steps, including batching the data.

```python
from evaluation.ragas_evaluation.evaluation_ragas load_test_dataset, evaluate

# Step 1: Load the dataset from a pickle file
dataset_path = "/content/keras_docs_embedded.pkl"
test_dataset = load_test_dataset(dataset_path)
```

> **Note:** `test_dataset` is a list of Llama-Index `Document` objects.
```python
# Step 2: Define the language model and embedding
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.ollama import Ollama

llm = Ollama(base_url="http://localhost:11434", model="codellama")
embedding = HuggingFaceEmbedding(model_name="microsoft/codebert-base")

# Step 3: Specify the metrics for evaluation
metrics = [faithfulness, answer_relevancy, context_precision, context_recall]

# Step 4: Load the query engine (Llama-Index)
from graph_rag.graph_retrieval.graph_retrieval import get_index_from_pickle, get_query_engine

index = get_index_from_pickle("path/to/graphIndex.pkl")
query_engine = get_query_engine(index)

# Step 5: Evaluate the dataset
evaluation_results = evaluate(
query_engine=query_engine,
dataset=test_dataset,
llm=llm,
embeddings=embedding,
metrics=metrics,
# Default batch size is 4
)
```

**Output:**
```python
{'faithfulness': 0.0333, 'answer_relevancy': 0.9834, 'context_precision': 0.2000, 'context_recall': 0.8048}
```

```python
rdf = evaluation_results.to_pandas()
rdf.to_csv("results.csv", index=False)
```
---
**Detailed Result:**

| question | contexts | answer | ground_truth | faithfulness | answer_relevancy | context_precision | context_recall |
|-----------------------------------------------|---------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------|--------------|------------------|-------------------|----------------|
| What is mixed precision in computing? | [Examples GPT-2 text generation Parameter…] | Mixed precision is a technique used to improve… | A combination of different numerical precision… | 0.166667 | 0.981859 | 0.0 | 0.666667 |
| What is the title of the guide discussed in th... | [Available guides… Hyperparameter T…] | The title of the guide discussed in the given… | How to distribute training | 0.000000 | 1.000000 | 0.0 | 1.000000 |
| What is Keras 3? | [No relationships found.] | Keras 3 is a new version of the popular deep l… | A deep learning framework that works with Tensor… | 0.000000 | 0.974711 | 0.0 | 0.500000 |
| What was the percentage boost in StableDiffusion... | [A first example: A MNIST convnet…] | The percentage boost in StableDiffusion traini… | Over 150% | 0.000000 | 0.970565 | 1.0 | 1.000000 |
| What are some examples of pretrained models av... | [No relationships found.] | Some examples of pre-trained models available… | BERT, OPT, Whisper, T5, StableDiffusion, YOLOv8… | 0.000000 | 0.989769 | 0.0 | 0.857143 |





49 changes: 49 additions & 0 deletions graph_rag/evaluation/evaluation_llama_index.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
"""
This script evaluates a RagDataset using a RagEvaluatorPack, which assesses query engines by benchmarking against
labeled data using LLMs and embeddings.
Functions:
- evaluate: Evaluates the query engine using a labeled RAG dataset and specified models for both the LLM and embeddings.
"""

from llama_index.core.llama_dataset import LabelledRagDataset
from llama_index.core.llama_pack import download_llama_pack
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.huggingface import HuggingFaceEmbedding





def evaluate(
RAG_DATASET: str,
query_engine: object,
ollama_model: str = "llama3",
embedd_model: str = "microsoft/codebert-base",
):
"""
Evaluates a RAG dataset by using a query engine and benchmarks it using LLM and embedding models.
Args:
RAG_DATASET: Path to the JSON file containing the labeled RAG dataset.
query_engine: The query engine to evaluate.
ollama_model: The LLM model to use for evaluation (default: "llama3").
embedd_model: The Hugging Face embedding model to use for evaluation (default: "microsoft/codebert-base").
Returns:
A DataFrame containing the benchmarking results, including LLM calls and evaluations.
"""

RagEvaluatorPack = download_llama_pack("RagEvaluatorPack", "./rag_evaluator_pack")
rag_dataset = LabelledRagDataset.from_json(RAG_DATASET)
rag_evaluator_pack = RagEvaluatorPack(
rag_dataset=rag_dataset,
query_engine=query_engine,
judge_llm=Ollama(base_url="http://localhost:11434", model=ollama_model),
embed_model=HuggingFaceEmbedding(model_name=embedd_model),
)
benchmark_df = await rag_evaluator_pack.arun(
batch_size=5, # batches the number of llm calls to make
sleep_time_in_seconds=1, # seconds to sleep before making an api call
)
return benchmark_df
130 changes: 130 additions & 0 deletions graph_rag/evaluation/ragas_evaluation/QA_graphrag_testdataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
"""
This script contains functions to generate question-answer pairs from input documents using a language model,
and critique them based on various criteria like groundedness, relevance, and standalone quality.
Functions:
- get_response: Sends a request to a language model API to generate responses based on a provided prompt.
- qa_generator: Generates a specified number of question-answer pairs from input documents.
- critique_qa: Critiques the generated QA pairs based on groundedness, relevance, and standalone quality.
"""

from prompts import *
import pandas as pd
import random
from tqdm.auto import tqdm
import requests


def get_response(
prompt: str, url: str = "http://localhost:11434/api/generate", model: str = "llama3"
):
"""
Sends a prompt ollama API and retrieves the generated response.
Args:
prompt:The text input that the model will use to generate a response.
url: The API endpoint for the model (default: "http://localhost:11434/api/generate").
model: The model to be used for generation (default: "llama3").
Returns:
The generated response from the language model as a string.
"""

payload = {"model": model, "prompt": prompt, "stream": False}
response = requests.post(url, json=payload)
resp = response.json()
return resp["response"]


def qa_generator(
documents: object,
N_GENERATIONS: int = 20,
):
"""
Generates a specified number of question-answer pairs from the provided documents.
Args:
documents: A collection of document objects to generate QA pairs from.
N_GENERATIONS: The number of question-answer pairs to generate (default: 20).
Returns:
A list of dictionaries, each containing the generated context, question, answer, and source document metadata.
"""
print(f"Generating {N_GENERATIONS} QA couples...")

outputs = []
for sampled_context in tqdm(random.sample(documents, N_GENERATIONS)):
# Generate QA couple
output_QA_couple = get_response(
QA_generation_prompt.format(context=sampled_context.text)
)
try:
question = output_QA_couple.split("Factoid question: ")[-1].split(
"Answer: "
)[0]
answer = output_QA_couple.split("Answer: ")[-1]
assert len(answer) < 300, "Answer is too long"
outputs.append(
{
"context": sampled_context.text,
"question": question,
"answer": answer,
"source_doc": sampled_context.metadata,
}
)
except:
continue
df = pd.DataFrame(outputs)
df.to_csv("QA.csv")
return outputs


def critique_qa(
outputs: list,
):
"""
Critiques the generated question-answer pairs based on groundedness, relevance, and standalone quality.
Args:
outputs: A list of dictionaries containing generated QA pairs to be critiqued.
Returns:
The critiqued QA pairs with additional fields for groundedness, relevance, and standalone quality scores and evaluations.
"""
print("Generating critique for each QA couple...")
for output in tqdm(outputs):
evaluations = {
"groundedness": get_response(
question_groundedness_critique_prompt.format(
context=output["context"], question=output["question"]
),
),
"relevance": get_response(
question_relevance_critique_prompt.format(question=output["question"]),
),
"standalone": get_response(
question_standalone_critique_prompt.format(question=output["question"]),
),
}
try:
for criterion, evaluation in evaluations.items():
score, eval = (
int(evaluation.split("Total rating: ")[-1].strip()),
evaluation.split("Total rating: ")[-2].split("Evaluation: ")[1],
)
output.update(
{
f"{criterion}_score": score,
f"{criterion}_eval": eval,
}
)
except Exception as e:
continue
generated_questions = pd.DataFrame.from_dict(outputs)
generated_questions = generated_questions.loc[
(generated_questions["groundedness_score"] >= 4)
& (generated_questions["relevance_score"] >= 4)
& (generated_questions["standalone_score"] >= 4)
]
generated_questions.to_csv("generated_questions.csv")
return outputs
Loading

0 comments on commit 1f1ee9f

Please sign in to comment.