Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MODULE] RAG Module #135

Open
wants to merge 19 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
119 changes: 119 additions & 0 deletions 8_rag/1_naive_rag.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# **Basic RAG (Retrieval-Augmented Generation)**

## **Overview**

Retrieval-Augmented Generation (RAG) is a powerful framework for combining **retrieval** (fetching relevant context) with **generation** (producing coherent and contextually accurate responses) to create intelligent, factual, and context-aware systems.

Most RAG pipeline leverages **retrieval** to access external knowledge and **generation** to produce fluent, natural responses, making it an essential architecture for modern AI systems.

This guide focuses on implementing a basic RAG pipeline using **Haystack**, a relatively lightweight yet feature-rich llm orchestration framework that simplifies the process of building and customizing such systems. It would be straighforward to implement the pipeline with other framework like LangChain and Llama-Index which all possess identical core functionalities required for RAG.

## **Core Concepts in RAG**

### **Indexing Pipeline**

The **Indexing Pipeline** prepares your knowledge base by preprocessing raw documents, splitting them into manageable chunks, and embedding them into vectors. These vectors are stored in an efficient database to support fast retrieval.

#### **Steps in the Indexing Pipeline**

1. **Document Collection**:
- Source raw documents from relevant repositories, such as Wikipedia, internal databases, or research articles.
- Examples: `.txt`, `.pdf`, `.docx`, JSON, or other formats.

2. **Document Cleaning**:
- Use tools like Haystack's `DocumentCleaner` to remove noise, boilerplate text, and irrelevant sections.
- Focus on retaining meaningful content.

3. **Document Splitting**:
- Split large documents into smaller, coherent chunks (e.g., paragraphs or sentences).
- Use Haystack's `DocumentSplitter` to define chunk size and overlap for better retrieval performance.

4. **Embedding Generation**:
- Convert text chunks into dense vector representations using pre-trained models (e.g., `SentenceTransformersDocumentEmbedder`).
- Embeddings capture semantic meaning, enabling similarity-based search.

5. **Document Indexing**:
- Store embeddings and metadata in a vector database or document store, such as `InMemoryDocumentStore` or `FAISS`.

**Indexing Workflow**

```plaintext
[Raw Documents]
|
v
[Document Cleaner] -- Removes noise
|
v
[Document Splitter] -- Splits text into chunks
|
v
[Document Embedder] -- Converts chunks into vector embeddings
|
v
[Document Store] -- Stores embeddings for fast retrieval
```


### **Retrieve + Generate Pipeline**

The **Retrieve + Generate Pipeline** processes user queries by retrieving relevant knowledge and generating context-aware responses using retrieved content.

#### **Steps in the Retrieve + Generate Pipeline**

1. **Query Embedding**:
- Convert the user query into a dense vector representation using a model like `SentenceTransformersTextEmbedder`.

2. **Document Retrieval**:
- Perform similarity search in the document store to retrieve the top-k most relevant chunks based on query embedding.

3. **Prompt Construction**:
- Combine the user query and retrieved documents into a structured prompt for the generative model.
- Ensure clarity and relevance by organizing context logically.

4. **Response Generation**:
- Use a text generation model (e.g., GPT-3, SmolLM2) to generate a coherent and factual response based on the constructed prompt.

**Retrieve + Generate Workflow**

```plaintext
[User Query]
|
v
[Query Embedder] -- Converts query into vector
|
v
[Document Retriever] -- Finds top-k relevant documents
|
v
[Prompt Builder] -- Combines query + retrieved documents into a prompt
|
v
[Text Generator] -- Produces contextually grounded response
```


## **Evaluation**

To ensure the RAG system performs well, evaluate both retrieval and generation components using appropriate metrics:

- **BLEU** (Bilingual Evaluation Understudy) focuses on **precision** and evaluates how much of the generated text matches reference text n-grams.
- **ROUGE** (Recall-Oriented Understudy for Gisting Evaluation) focuses on **recall** and evaluates how much of the reference text's n-grams are captured by the generated text, making it ideal for summarization and text generation tasks.
- **MRR** (Mean Reciprocal Rank) evaluates the effectiveness of an information retrieval system and tasks like question answering by considering the rank of the first relevant result.

Feedback may be used to iteratively improve embeddings, retrieval thresholds, or prompt formatting.

## **Example: Basic RAG System**

1. **Setup Knowledge Base**:
- Collect documents and preprocess them using the **Indexing Pipeline**.

2. **Integrate Query Handling**:
- Implement the **Retrieve + Generate Pipeline** to handle user inputs.

3. **Evaluate and Adjust**:
- Evaluate the pipeline and monitor retrieval and generation quality. Incorporate feedback for adjustment.

⏩ Try the [Basic RAG Tutorial](./notebooks/naive_rag_haystack_example.ipynb) to implement a Naive RAG pipeline.

## **Resources**

54 changes: 54 additions & 0 deletions 8_rag/2_advanced_rag.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# Advanced RAG Strategies

Advanced Retrieval-Augmented Generation (RAG) techniques address the challenges faced by naive RAG. These strategies enhance the process of document retrieval and improve the quality of answers generated by large language models (LLMs). There are multiple possible optimizations in each step of the pipeline for an advanced RAG, but particularly retrieval stage is the focus.

The strategies are thus divided into pre-retrieval, retrieval, and post-retrieval to address the following challenges.

- **How to achieve accurate semantic representations of documents and queries?**
- **What methods align the semantic spaces of queries and documents (chunks)?**
- **How to align the retriever’s output with the preferences of the LLM?**


## Pre-Retrieval Strategies

Efficient data indexing is essential for improving the retrieval performance in a RAG system. Key pre-retrieval strategies include:

- **Improve Data Quality**: Remove irrelevant information, resolve ambiguity in entities and terms, confirm factual accuracy, maintain context, and update outdated information.
- **Optimize Index Structure**: Adjust chunk sizes to capture relevant context or incorporate graph structures to represent relationships between entities.
- **Add Metadata**: Enhance data filtering by adding relevant metadata such as dates, chapters, subsections, and purposes to document chunks.
- **Chunk Optimization**: Fine-tune chunk size to balance between too large or too small chunks to improve the embedding process.

### Key Pre-Retrieval Techniques:

- **Sliding Window**: Chunking method with overlap between text blocks.
- **Auto-Merging Retrieval**: Starts with small text blocks and later provides larger, related text blocks for the LLM.
- **Abstract Embedding**: Focuses on Top-K retrieval based on document abstracts for a comprehensive document context.
- **Metadata Filtering**: Leverages document metadata for enhanced filtering.
- **Graph Indexing**: Converts entities and relationships into nodes and connections to improve relevance.

## Retrieval Strategies

During the retrieval phase, the goal is to identify the most relevant document chunks to the query. This requires optimizing the embedding models used to represent both the query and chunks.

- **Domain Knowledge Fine-Tuning**: Fine-tune embedding models using domain-specific datasets to capture the unique aspects of the RAG system.
- **Similarity Metrics**: Select an appropriate metric to measure the similarity between the query and chunk embeddings. Common metrics include:
- Cosine Similarity
- Euclidean Distance (L2)
- Dot Product
- L2 Squared Distance
- Manhattan Distance

Several vector databases support multiple similarity metrics, allowing further customization/optimization.

## Post-Retrieval Strategies

After retrieving context data (chunks) from a vector database, the next step is to process this information and pass it to the LLM. However, some retrieved chunks may be irrelevant, noisy, or repeated, impacting the LLM’s ability to generate accurate answers.

### Strategies to Address Post-Retrieval Issues:

- **Reranking**: Prioritize the most relevant chunks by reranking the retrieved results. This ensures LLMs are given the top-K most pertinent context, reducing performance issues caused by excessive context. Available reranking techniques are offered by libraries like LlamaIndex, LangChain, and HayStack.
- **Prompt Compression**: Filter out irrelevant context and shorten the prompt before inputting it to the LLM. Techniques such as mutual information or perplexity estimation, along with summarization, help in reducing context length and noise.

⏩ Try the [Improved RAG Tutorial](./notebooks/improved_rag_haystack_example.ipynb) to implement improved RAG pipelines.

## **Resources**
79 changes: 79 additions & 0 deletions 8_rag/3_modular_agenic_rag.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Modular (Agentic) RAG

## Why Modular RAG is Needed

**Vanilla RAG** faces two main challenges:

1. **Single retrieval step**: If the retrieved documents are irrelevant, the generated answer will be poor.
2. **Semantic mismatch**: The user’s query might differ in form from the document’s content, making semantic similarity-based retrieval suboptimal.

**Modular RAG** addresses these limitations by introducing an agent that can:

- **Formulate the query** to optimize document retrieval.
- **Critique and re-retrieve** if necessary, improving retrieval accuracy and ensuring better answers.

## Key Components of Modular RAG
### 1. **Module Components**

In a modular RAG system, module, tool, and agent are often used interchangeably, though they represent different levels of abstraction within the system. These components represent different parts of the system that work together to enhance functionality.

- **Search Module**: Expands retrieval by integrating data from various external sources like search engines, tabular data, and knowledge graphs, enhancing the relevance of context during retrieval.
- **Memory Module**: Stores past interactions (queries and answers) for ongoing context awareness, supporting dynamic tasks and conversations.
- **Custom Function Tool Module**: Executes advanced workflows, such as database queries or system commands, allowing the agent to interact with external systems.
- **Code Module (Agent)**: Specializes in coding tasks like analysis, generation, refactoring, and testing, enabling the agent to handle software development tasks.

### 2. **Other Components**

- **Fusion**: Performs parallel retrieval on original and expanded queries, intelligently reranking and merging results for optimal context.
- **Routing**: Directs the next action based on the query, such as summarization or searching specific databases, ensuring appropriate responses.
- **Orchestration Agent**: Coordinates the flow of information between modules, optimizing the efficiency and effectiveness of the overall RAG system.

## AI Agent

AI agents are modular systems where the output of LLMs controls the workflow, enabling interaction with external tools, programs, or systems. They provide the necessary "agency" for LLMs to autonomously navigate tasks and processes. The agent's role is to translate LLM outputs into executable actions, bridging the gap between the language model and the real world.

AI agents bring an additional layer of intelligent orchestration, improving how different modules work together dynamically, rather than relying on static, predefined processes. Indeeds, agency in AI agents exists on a spectrum, with the LLM's control over the workflow increasing at each level:

| Agency Level | Description | Example Pattern |
| --- | --- | --- |
| ☆☆☆ | LLM output has no impact on program flow | Simple Processor (`process_llm_output(llm_response)`) |
| ★☆☆ | LLM output triggers an if/else switch | Router (`if llm_decision(): path_a() else: path_b()`) |
| ★★☆ | LLM output determines function execution | Tool Caller (`run_function(llm_chosen_tool, llm_chosen_args)`) |
| ★★★ | LLM output controls iteration | Multi-step Agent (`while llm_should_continue(memory): execute_next_step()`) |
| ★★★ | One agent starts another agentic workflow | Multi-Agent (`if llm_trigger(): execute_agent()`) |


![Agent](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/open-source-llms-as-agents/ReAct.png)


**When to Use Agents**

Agents are useful when flexibility is needed in the workflow. If tasks are too complex for predefined steps or criteria, an agent can adapt and determine the necessary actions. For simple tasks with a predictable workflow, agents may be unnecessary.


### `smolagent` Library

The **smolagent** library provides a simple yet powerful framework to build AI agents. While you can manually write code for simple agents for chaining or routing, more complex behaviors such as tool calling and multi-step agent workflows require predefined abstractions to work effectively. Here's why **smolagent** is helpful:

1. **Tool Calling**: When an agent needs to call a tool (e.g., fetching weather data), the output format from the LLM should be predefined, such as:
`Thought: I should call tool 'get_weather'. Action: get_weather(Paris).`
This ensures the LLM’s output can be parsed and executed by a system function.

2. **Multi-Step Agents**: If the agent’s output controls a loop (e.g., iterating over a series of tasks), a different prompt may be needed for each iteration based on memory. This requires integrating memory into the system.

Given these needs, **smolagent** provides essential building blocks that enable seamless orchestration:

- An LLM engine that powers the system
- A list of available tools the agent can use
- A parser that extracts tool calls from LLM output
- A memory system that stores relevant information
- A system prompt synced with the parser

Additionally, since agents are powered by LLMs, error logging, and retry mechanisms are essential for ensuring robustness and reliability. **smolagent** handles these elements, making it easier to build complex workflows that are reliable, flexible, and adaptive.


### Resources

https://huggingface.co/docs/smolagents/index
https://huggingface.co/docs/smolagents/examples/rag
https://huggingface.co/docs/smolagents/conceptual_guides/intro_agents
Loading
Loading