diff --git a/README.md b/README.md
deleted file mode 100644
index c74406b..0000000
--- a/README.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Databricks Mosaic Generative AI Cookbook
-
-To start working on this book:
-- clone the repo; `cd cookbook`
-- use your preferred approach to starting a new python environment
-- in that environment, `pip install jupyter-book`
-- build and preview the site with `jupyter-book build --all genai_cookbook`
-
-The homepage is at `genai_cookbook/index.md`
-
-The content pages are in `genai_cookbook/nbs/`
-
-Jupyter book is fairly flexible and offers a lot of different options for formatting, cross-referencing, adding formatted callouts, etc. Read more at the [Jupyter Book docs](https://jupyterbook.org/en/stable/intro.html).
\ No newline at end of file
diff --git a/build_requirements.txt b/build_requirements.txt
deleted file mode 100644
index 47a56ab..0000000
--- a/build_requirements.txt
+++ /dev/null
@@ -1 +0,0 @@
-jupyter-book
\ No newline at end of file
diff --git a/genai_cookbook/.DS_Store b/genai_cookbook/.DS_Store
deleted file mode 100644
index 83a7442..0000000
Binary files a/genai_cookbook/.DS_Store and /dev/null differ
diff --git a/genai_cookbook/_config.yml b/genai_cookbook/_config.yml
deleted file mode 100644
index 19e2620..0000000
--- a/genai_cookbook/_config.yml
+++ /dev/null
@@ -1,29 +0,0 @@
-# Book settings
-# Learn more at https://jupyterbook.org/customize/config.html
-
-title: Databricks Generative AI Cookbook
-author: The Databricks GenAI Community
-logo: logo2.png
-
-# Force re-execution of notebooks on each build.
-# See https://jupyterbook.org/content/execute.html
-execute:
- execute_notebooks: 'off'
-
-# Information about where the book exists on the web
-repository:
- url: https://github.com/databricks-genai-cookbook/cookbook/
- path_to_book: ./genai_cookbook # Optional path to your book, relative to the repository root
- branch: main # Which branch of the repository should be used when creating links (optional)
-
-# Add GitHub buttons to your book
-# See https://jupyterbook.org/customize/config.html#add-a-link-to-your-repository
-html:
- favicon: images/index/favicon.ico
- use_issues_button: true
- use_repository_button: true
- home_page_in_navbar: false
- google_analytics_id: G-6BZ4NTBHVJ
-sphinx:
- config:
- html_show_copyright: false
\ No newline at end of file
diff --git a/genai_cookbook/_toc.yml b/genai_cookbook/_toc.yml
deleted file mode 100644
index fb6d640..0000000
--- a/genai_cookbook/_toc.yml
+++ /dev/null
@@ -1,9 +0,0 @@
-# Table of contents
-# Learn more at https://jupyterbook.org/customize/toc.html
-
-format: jb-book
-root: index
-parts:
-- caption: All Notebooks
- chapters:
- - glob: nbs/*
diff --git a/genai_cookbook/images/1-introduction-to-rag/1_img.png b/genai_cookbook/images/1-introduction-to-rag/1_img.png
deleted file mode 100644
index 5d23f2e..0000000
Binary files a/genai_cookbook/images/1-introduction-to-rag/1_img.png and /dev/null differ
diff --git a/genai_cookbook/images/2-fundamentals-unstructured/1_img.png b/genai_cookbook/images/2-fundamentals-unstructured/1_img.png
deleted file mode 100644
index c1f3575..0000000
Binary files a/genai_cookbook/images/2-fundamentals-unstructured/1_img.png and /dev/null differ
diff --git a/genai_cookbook/images/2-fundamentals-unstructured/2_img.png b/genai_cookbook/images/2-fundamentals-unstructured/2_img.png
deleted file mode 100644
index cd0abf1..0000000
Binary files a/genai_cookbook/images/2-fundamentals-unstructured/2_img.png and /dev/null differ
diff --git a/genai_cookbook/images/2-fundamentals-unstructured/3_img.png b/genai_cookbook/images/2-fundamentals-unstructured/3_img.png
deleted file mode 100644
index b334e79..0000000
Binary files a/genai_cookbook/images/2-fundamentals-unstructured/3_img.png and /dev/null differ
diff --git a/genai_cookbook/images/3-deep-dive/1_img.png b/genai_cookbook/images/3-deep-dive/1_img.png
deleted file mode 100644
index 342b2bf..0000000
Binary files a/genai_cookbook/images/3-deep-dive/1_img.png and /dev/null differ
diff --git a/genai_cookbook/images/3-deep-dive/2_img.png b/genai_cookbook/images/3-deep-dive/2_img.png
deleted file mode 100644
index c46a903..0000000
Binary files a/genai_cookbook/images/3-deep-dive/2_img.png and /dev/null differ
diff --git a/genai_cookbook/images/3-deep-dive/3_img.png b/genai_cookbook/images/3-deep-dive/3_img.png
deleted file mode 100644
index 10ebb56..0000000
Binary files a/genai_cookbook/images/3-deep-dive/3_img.png and /dev/null differ
diff --git a/genai_cookbook/images/3-deep-dive/4_img.png b/genai_cookbook/images/3-deep-dive/4_img.png
deleted file mode 100644
index 6c32cb8..0000000
Binary files a/genai_cookbook/images/3-deep-dive/4_img.png and /dev/null differ
diff --git a/genai_cookbook/images/3-deep-dive/5_img.png b/genai_cookbook/images/3-deep-dive/5_img.png
deleted file mode 100644
index 026428e..0000000
Binary files a/genai_cookbook/images/3-deep-dive/5_img.png and /dev/null differ
diff --git a/genai_cookbook/images/4-evaluation/1_img.png b/genai_cookbook/images/4-evaluation/1_img.png
deleted file mode 100644
index c36b85f..0000000
Binary files a/genai_cookbook/images/4-evaluation/1_img.png and /dev/null differ
diff --git a/genai_cookbook/images/4-evaluation/2_img.png b/genai_cookbook/images/4-evaluation/2_img.png
deleted file mode 100644
index 4255600..0000000
Binary files a/genai_cookbook/images/4-evaluation/2_img.png and /dev/null differ
diff --git a/genai_cookbook/images/4-evaluation/3_img.png b/genai_cookbook/images/4-evaluation/3_img.png
deleted file mode 100644
index 57f3f6c..0000000
Binary files a/genai_cookbook/images/4-evaluation/3_img.png and /dev/null differ
diff --git a/genai_cookbook/images/5-hands-on/10_img.png b/genai_cookbook/images/5-hands-on/10_img.png
deleted file mode 100644
index a77a043..0000000
Binary files a/genai_cookbook/images/5-hands-on/10_img.png and /dev/null differ
diff --git a/genai_cookbook/images/5-hands-on/11_img.png b/genai_cookbook/images/5-hands-on/11_img.png
deleted file mode 100644
index 29959ca..0000000
Binary files a/genai_cookbook/images/5-hands-on/11_img.png and /dev/null differ
diff --git a/genai_cookbook/images/5-hands-on/12_img.png b/genai_cookbook/images/5-hands-on/12_img.png
deleted file mode 100644
index c8cf89f..0000000
Binary files a/genai_cookbook/images/5-hands-on/12_img.png and /dev/null differ
diff --git a/genai_cookbook/images/5-hands-on/13_img.png b/genai_cookbook/images/5-hands-on/13_img.png
deleted file mode 100644
index f62ebe9..0000000
Binary files a/genai_cookbook/images/5-hands-on/13_img.png and /dev/null differ
diff --git a/genai_cookbook/images/5-hands-on/14_img.png b/genai_cookbook/images/5-hands-on/14_img.png
deleted file mode 100644
index 6b2b8ef..0000000
Binary files a/genai_cookbook/images/5-hands-on/14_img.png and /dev/null differ
diff --git a/genai_cookbook/images/5-hands-on/15_img.png b/genai_cookbook/images/5-hands-on/15_img.png
deleted file mode 100644
index 4e89043..0000000
Binary files a/genai_cookbook/images/5-hands-on/15_img.png and /dev/null differ
diff --git a/genai_cookbook/images/5-hands-on/16_img.png b/genai_cookbook/images/5-hands-on/16_img.png
deleted file mode 100644
index 0bcc5fd..0000000
Binary files a/genai_cookbook/images/5-hands-on/16_img.png and /dev/null differ
diff --git a/genai_cookbook/images/5-hands-on/17_img.png b/genai_cookbook/images/5-hands-on/17_img.png
deleted file mode 100644
index 7a383be..0000000
Binary files a/genai_cookbook/images/5-hands-on/17_img.png and /dev/null differ
diff --git a/genai_cookbook/images/5-hands-on/1_img.png b/genai_cookbook/images/5-hands-on/1_img.png
deleted file mode 100644
index 1a64a02..0000000
Binary files a/genai_cookbook/images/5-hands-on/1_img.png and /dev/null differ
diff --git a/genai_cookbook/images/5-hands-on/2_img.png b/genai_cookbook/images/5-hands-on/2_img.png
deleted file mode 100644
index 2d8471b..0000000
Binary files a/genai_cookbook/images/5-hands-on/2_img.png and /dev/null differ
diff --git a/genai_cookbook/images/5-hands-on/3_img.png b/genai_cookbook/images/5-hands-on/3_img.png
deleted file mode 100644
index 9b00173..0000000
Binary files a/genai_cookbook/images/5-hands-on/3_img.png and /dev/null differ
diff --git a/genai_cookbook/images/5-hands-on/4_img.png b/genai_cookbook/images/5-hands-on/4_img.png
deleted file mode 100644
index bbdab36..0000000
Binary files a/genai_cookbook/images/5-hands-on/4_img.png and /dev/null differ
diff --git a/genai_cookbook/images/5-hands-on/5_img.png b/genai_cookbook/images/5-hands-on/5_img.png
deleted file mode 100644
index 9cdb973..0000000
Binary files a/genai_cookbook/images/5-hands-on/5_img.png and /dev/null differ
diff --git a/genai_cookbook/images/5-hands-on/6_img.png b/genai_cookbook/images/5-hands-on/6_img.png
deleted file mode 100644
index fbf8c54..0000000
Binary files a/genai_cookbook/images/5-hands-on/6_img.png and /dev/null differ
diff --git a/genai_cookbook/images/5-hands-on/7_img.png b/genai_cookbook/images/5-hands-on/7_img.png
deleted file mode 100644
index 7ec6703..0000000
Binary files a/genai_cookbook/images/5-hands-on/7_img.png and /dev/null differ
diff --git a/genai_cookbook/images/5-hands-on/8_img.png b/genai_cookbook/images/5-hands-on/8_img.png
deleted file mode 100644
index 709b72c..0000000
Binary files a/genai_cookbook/images/5-hands-on/8_img.png and /dev/null differ
diff --git a/genai_cookbook/images/5-hands-on/9_img.png b/genai_cookbook/images/5-hands-on/9_img.png
deleted file mode 100644
index cdaab55..0000000
Binary files a/genai_cookbook/images/5-hands-on/9_img.png and /dev/null differ
diff --git a/genai_cookbook/images/5-hands-on/chain_code.png b/genai_cookbook/images/5-hands-on/chain_code.png
deleted file mode 100644
index 3e1286b..0000000
Binary files a/genai_cookbook/images/5-hands-on/chain_code.png and /dev/null differ
diff --git a/genai_cookbook/images/5-hands-on/chain_config.png b/genai_cookbook/images/5-hands-on/chain_config.png
deleted file mode 100644
index d569742..0000000
Binary files a/genai_cookbook/images/5-hands-on/chain_config.png and /dev/null differ
diff --git a/genai_cookbook/images/5-hands-on/data_pipeline.png b/genai_cookbook/images/5-hands-on/data_pipeline.png
deleted file mode 100644
index fae3b66..0000000
Binary files a/genai_cookbook/images/5-hands-on/data_pipeline.png and /dev/null differ
diff --git a/genai_cookbook/images/index/favicon.ico b/genai_cookbook/images/index/favicon.ico
deleted file mode 100644
index 393f550..0000000
Binary files a/genai_cookbook/images/index/favicon.ico and /dev/null differ
diff --git a/genai_cookbook/images/index/import_notebook.png b/genai_cookbook/images/index/import_notebook.png
deleted file mode 100644
index 50d1579..0000000
Binary files a/genai_cookbook/images/index/import_notebook.png and /dev/null differ
diff --git a/genai_cookbook/index.md b/genai_cookbook/index.md
deleted file mode 100644
index 7cc836f..0000000
--- a/genai_cookbook/index.md
+++ /dev/null
@@ -1,74 +0,0 @@
----
-title: Databricks Generative AI Cookbook
----
-
-# Databricks Mosaic Generative AI Cookbook
-
-## RAG Guide
-Follow this guide to learn how RAG works, the required components of a production-ready, high-quality RAG application, and Databricks’ recommended developer workflow for delivering a high-quality RAG application that uses unstructured data.
-
-This guide assumes you have selected a use case that requires unstructured docs RAG.
-
-This guide is broken into 5 main sections:
-
-1. Introduction to retrieval-augmented generation (RAG)
- - Overview of the RAG technique
- - Key benefits of using RAG
- - Types of RAG
-2. Fundamentals of RAG over Unstructured Documents
- - Understanding retrieval
- - Components of a RAG application
-3. Deep dive into RAG over Unstructured Documents
- - Deep dive into the components of a RAG application
-4. How to evaluate RAG applications
- - Metrics to measure quality / cost / latency
- - Developing an evaluation set to measure quality
- - Infrastructure required to evaluate RAG apps
-5. Practical hands-on guide to implementing high-quality RAG
- - Databricks recommended developer workflow for building a RAG application
- - How to iteratively improve the application's quality
- - Throughout, this section is supported by ready-to-use code examples
-
-If you are less familiar with RAG, we suggest starting with section 1. If you are already familiar with RAG or simply want to get started quickly, start with section 5. Section 5 refers back to the previous sections as needed to explain concepts.
-
-## Featured Notebooks
-
-::::{grid} 3
-:class-container: text-center
-
-:::{grid-item-card}
-:link: /nbs/1-introduction-to-rag
-:link-type: doc
-:class-header: bg-light
-
-Section 1: Introduction to retrieval-augmented generation (RAG)
-^^^
-Learn the basic concepts of retrieval-augmented generation (RAG) in this introdoctory notebook.
-:::
-
-:::{grid-item-card}
-:link: /nbs/2-fundamentals-unstructured
-:link-type: doc
-:class-header: bg-light
-
-Section 2: Fundamentals of RAG over Unstructured Documents
-^^^
-Introduction to the key components and principles of developing RAG applications over unstructured data.
-:::
-
-:::{grid-item-card}
-:link: /nbs/3-deep-dive
-:link-type: doc
-:class-header: bg-light
-
-Section 3: Deep dive into RAG over unstructured documents
-^^^
-A more detailed guide to refining each component of a RAG application over unstructured data.
-:::
-::::
-
-```{tableofcontents}
-```
-
-
-
diff --git a/genai_cookbook/logo2.png b/genai_cookbook/logo2.png
deleted file mode 100644
index 0c7645e..0000000
Binary files a/genai_cookbook/logo2.png and /dev/null differ
diff --git a/genai_cookbook/nbs/1-introduction-to-rag.md b/genai_cookbook/nbs/1-introduction-to-rag.md
deleted file mode 100644
index deff071..0000000
--- a/genai_cookbook/nbs/1-introduction-to-rag.md
+++ /dev/null
@@ -1,56 +0,0 @@
-# Section 1: Introduction to retrieval-augmented generation (RAG)
-
-This section provides an overview of Retrieval-augmented generation (RAG): what it is, how it works, and key concepts.
-
-## What is retrieval-augmented generation?
-
-Retrieval-augmented generation (RAG) is a technique that enables a large language model (LLM) to generate enriched responses by augmenting a user’s prompt with supporting data retrieved from an outside information source. By incorporating this retrieved information, RAG enables the LLM to generate more accurate, higher quality responses compared to using the prompt alone.
-
-For example, suppose you are building a question-and-answer chatbot to help employees answer questions about your company’s proprietary documents. A standalone LLM won’t be able to accurately answer questions about the content of these documents if it was not specifically trained on them. The LLM might refuse to answer due to a lack of information or, even worse, it might generate an incorrect response.
-
-RAG addresses this issue by first retrieving relevant information from the company documents based on a user’s query, and then providing the retrieved information to the LLM as additional context. This allows the LLM to generate a more accurate response by drawing from the specific details found in the relevant documents. In essence, RAG enables the LLM to “consult” the retrieved information to formulate its answer.
-
-## Core components of a RAG application
-
-A RAG application is an example of a [compound AI system](https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/): it expands on the language capabilities of the model alone by combining it with other tools and procedures.
-
-When using a stand-alone LLM, a user submits a request, such as a question, to the LLM, and the LLM responds with an answer based solely on its training data.
-
-In its most basic form, the following steps happen in a RAG application:
-
-1. **Retrieval:** The **user's request** is used to query some outside source of information. This might mean querying a database or vector store, a keyword search over some text, or querying a SQL database. The goal of the retrieval step is to obtain **supporting data** that will help the LLM provide a useful response.
-
-2. **Augmentation:** The **supporting data** from the retrieval step is combined with the **user's request**, often using a template with additional formatting and instructions to the LLM, to create a **prompt**.
-
-3. **Generation:** The resulting **prompt** is passed to the LLM, and the LLM generates a response to the **user's request**.
-
-```{image} ../images/1-introduction-to-rag/1_img.png
-:alt: RAG process
-:align: center
-```
-
-
-
-This is a simplified overview of the RAG process, but it's important to note that implementing a RAG application involves a number of complex tasks. Preprocessing source data to make it suitable for use in RAG, effectively retrieving data, formatting the augmented prompt, and evaluating the generated responses all require careful consideration and effort. These topics will be covered in greater detail in later sections of this guide.
-
-## Why use RAG?
-
-The following table outlines the benefits of using RAG versus a stand-alone LLM:
-
-| With an LLM alone | Using LLMs with RAG |
-|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| **No proprietary knowledge:** LLMs are generally trained on publicly available data, so they cannot accurately answer questions about a company's internal or proprietary data. | **RAG applications can incorporate proprietary data:** A RAG application can supply proprietary documents such as memos, emails, and design documents to an LLM, enabling it to answer questions about those documents. |
-| **Knowledge isn't updated in real time:** LLMs do not have access to information about events that occurred after they were trained. For example, a standalone LLM cannot tell you anything about stock movements today. | **RAG applications can access real-time data:** A RAG application can supply the LLM with timely information from an updated data source, allowing it to provide useful answers about events past its training cutoff date. |
-| **Lack of citations:** LLMs cannot cite specific sources of information when responding, leaving the user unable to verify whether the response is factually correct or a [hallucination](https://en.wikipedia.org/wiki/Hallucination_(artificial_intelligence)). | **RAG can cite sources:** When used as part of a RAG application, an LLM can be asked to cite its sources. |
-| **Lack of data access controls (ACLs):** LLMs alone can't reliably provide different answers to different users based on specific user permissions. | **RAG allows for data security / ACLs:** The retrieval step can be designed to find only the information that the user has permission to access, enabling a RAG application to selectively retrieve personal or proprietary information based on the credentials of the individual user. |
-
-## Types of RAG
-
-The RAG architecture can work with 2 types of **supporting data**:
-
-| | Structured data | Unstructured data |
-|---|---|---|
-| **Definition** | Tabular data arranged in rows & columns with a specific schema e.g., tables in a database. | Data without a specific structure or organization, e.g., documents that include text and images or multimedia content such as audio or videos. |
-| **Example data sources** | - Customer records in a BI or Data Warehouse system
- Transaction data from a SQL database
- Data from application APIs (e.g., SAP, Salesforce, etc) | - PDFs
- Google/Office documents
- Wikis
- Images
- Videos |
-
-Which data you use with RAG depends on your use case. The remainder of this guide focuses on RAG for unstructured data.
diff --git a/genai_cookbook/nbs/2-fundamentals-unstructured.md b/genai_cookbook/nbs/2-fundamentals-unstructured.md
deleted file mode 100644
index 0d63a15..0000000
--- a/genai_cookbook/nbs/2-fundamentals-unstructured.md
+++ /dev/null
@@ -1,93 +0,0 @@
-# Section 2: Fundamentals of RAG over unstructured documents
-
-In [section 1](1-introduction-to-rag) of this guide, we introduced RAG, explained its functionality at a high level, and highlighted its advantages over standalone LLMs.
-
-This section will introduce the key components and principles behind developing RAG applications over unstructured data. In particular, we will discuss:
-
-1. **[Data pipeline](#data-pipeline):** Transforming unstructured documents, such as collections of PDFs, into a format suitable for retrieval using the RAG application's **data pipeline**.
-2. [**Retrieval, Augmentation, and Generation (RAG chain)**](#retrieval-augmentation-and-generation-rag-chain): A series (or **chain**) of steps is called to:
- 1. Understand the user's question
- 2. Retrieve the supporting data
- 3. Call an LLM to generate a response based on the user's question and supporting data
-3. [**Evaluation**](#evaluation-monitoring): Assessing the RAG application to determine its quality/cost/latency to ensure it meets your business requirements
-
-```{image} ../images/2-fundamentals-unstructured/1_img.png
-:alt: Major components of RAG over unstructured data
-:align: center
-```
-
-## Data pipeline
-
-Throughout this guide we will focus on preparing unstructured data for use in RAG applications. *Unstructured* data refers to data without a specific structure or organization, such as PDF documents that might include text and images, or multimedia content such as audio or videos.
-
-Unstructured data lacks a predefined data model or schema, making it impossible to query on the basis of structure and metadata alone. As a result, unstructured data requires techniques that can understand and extract semantic meaning from raw text, images, audio, or other content.
-
-During data preparation, the RAG application's data pipeline takes raw unstructured data and transforms it into discrete chunks that can be queried based on their relevance to a user's query. The key steps in data preprocessing are outlined below. Each step has a variety of knobs that can be tuned - for a deeper dive discussion on these knobs, please refer to the [deep dive into RAG section.](/nbs/3-deep-dive)
-
-In the remainder of this section, we describe the process of preparing unstructured data for retrieval using *semantic search*. Semantic search understands the contextual meaning and intent of a user query to provide more relevant search results.
-
-Semantic search is one of several approaches that can be taken when implementing the retrieval component of a RAG application over unstructured data. We cover alternate retrieval strategies in the [retrieval deep dive section](/nbs/3-deep-dive).
-
-```{image} ../images/2-fundamentals-unstructured/2_img.png
-:align: center
-```
-
-The following are the typical steps of a data pipeline in a RAG application using unstructured data:
-
-1. **Parse the raw documents:** The initial step involves transforming raw data into a usable format. This can include extracting text, tables, and images from a collection of PDFs or employing optical character recognition (OCR) techniques to extract text from images.
-
-2. **Extract document metadata (optional):** In some cases, extracting and using document metadata, such as document titles, page numbers, URLs, or other information can help the retrieval step more precisely query the correct data.
-
-3. **Chunk documents:** To ensure the parsed documents can fit into the embedding model and the LLM's context window, we break the parsed documents into smaller, discrete chunks. Retrieving these focused chunks, rather than entire documents, gives the LLM more targeted content from which to generate its responses.
-
-4. **Embedding chunks:** In a RAG application that uses semantic search, a special type of language model called an *embedding model* transforms each of the chunks from the previous step into numeric vectors, or lists of numbers, that encapsulate the meaning of each piece of content. Crucially, these vectors represent the semantic meaning of the text, not just surface-level keywords. This will later enable searching based on meaning rather than literal text matches.
-
-5. **Index chunks in a vector database:** The final step is to load the vector representations of the chunks, along with the chunk's text, into a *vector database*. A vector database is a specialized type of database designed to efficiently store and search for vector data like embeddings. To maintain performance with a large number of chunks, vector databases commonly include a vector index that uses various algorithms to organize and map the vector embeddings in a way that optimizes search efficiency. At query time, a user's request is embedded into a vector, and the database leverages the vector index to find the most similar chunk vectors, returning the corresponding original text chunks.
-
-The process of computing similarity can be computationally expensive. Vector indexes, such as [Databricks Vector Search](https://docs.databricks.com/en/generative-ai/vector-search.html), speed this process up by providing a mechanism for efficiently organizing and navigating embeddings, often via sophisticated approximation methods. This enables rapid ranking of the most relevant results without comparing each embedding to the user's query individually.
-
-Each step in the data pipeline involves engineering decisions that impact the RAG application's quality. For example, choosing the right chunk size in step (3) ensures the LLM receives specific yet contextualized information, while selecting an appropriate embedding model in step (4) determines the accuracy of the chunks returned during retrieval.
-
-This data preparation process is referred to as *offline* data preparation, as it occurs before the system answers queries, unlike the *online* steps triggered when a user submits a query.
-
-## Retrieval, augmentation, and generation (RAG Chain)
-
-Once the data has been processed by the data pipeline, it is suitable for use in the RAG application. This section describes the process that occurs once the user submits a request to the RAG application in an online setting. The series, or *chain* of steps that are invoked at inference time is commonly referred to as the RAG chain.
-
-```{image} ../images/2-fundamentals-unstructured/3_img.png
-:align: center
-```
-
-1. **(Optional) User query preprocessing:** In some cases, the user's query is preprocessed to make it more suitable for querying the vector database. This can involve formatting the query within a template, using another model to rewrite the request, or extracting keywords to aid retrieval. The output of this step is a *retrieval query* which will be used in the subsequent retrieval step.
-
-2. **Retrieval:** To retrieve supporting information from the vector database, the retrieval query is translated into an embedding using *the same embedding model* that was used to embed the document chunks during data preparation. These embeddings enable comparison of the semantic similarity between the retrieval query and the unstructured text chunks, using measures like cosine similarity. Next, chunks are retrieved from the vector database and ranked based on how similar they are to the embedded request. The top (most similar) results are returned.
-
-3. **Prompt augmentation:** The prompt that will be sent to the LLM is formed by augmenting the user's query with the retrieved context, in a template that instructs the model how to use each component, often with additional instructions to control the response format. The process of iterating on the right prompt template to use is referred to as [prompt engineering](https://en.wikipedia.org/wiki/Prompt_engineering).
-
-4. **LLM Generation**: The LLM takes the augmented prompt, which includes the user's query and retrieved supporting data, as input. It then generates a response that is grounded on the additional context.
-
-5. **(Optional) Post-processing:** The LLM's response may be processed further to apply additional business logic, add citations, or otherwise refine the generated text based on predefined rules or constraints.
-
-As with the RAG application data pipeline, there are numerous consequential engineering decisions that can affect the quality of the RAG chain. For example, determining how many chunks to retrieve in (2) and how to combine them with the user's query in (3) can both significantly impact the model's ability to generate quality responses.
-
-Throughout the chain, various guardrails may be applied to ensure compliance with enterprise policies. This might involve filtering for appropriate requests, checking user permissions before accessing data sources, and applying content moderation techniques to the generated responses.
-
-## Evaluation & monitoring
-
-Evaluation and monitoring are critical components to understand if your RAG application is performing to the quality, cost, and latency requirements dictated by your use case. Evaluation happens during development and monitoring happens once the application is deployed to production.
-
-RAG over unstructured data is a complex system with many components that impact the application's quality. Adjusting any single element can have cascading effects on the others. For instance, data formatting changes can influence the retrieved chunks and the LLM's ability to generate relevant responses. Therefore, it's crucial to evaluate each of the application's components in addition to the application as a whole in order to iteratively refine it based on those assessments.
-
-Evaluation and monitoring the quality, cost and latency requires several components:
-
-- **Defining quality with metrics**: You can't manage what you don't measure. In order to improve RAG quality, it is essential to define what quality means for your use case. Depending on the application, important metrics might include response accuracy, latency, cost, or ratings from key stakeholders.
-
-- **Building an effective evaluation set:** To rigorously evaluate your RAG application, you need a curated set of evaluation queries (and ideally outputs) that are representative of the application's intended use. These evaluation examples should be challenging, diverse, and updated to reflect changing usage and requirements.
-
-- **Monitoring application usage:** Instrumentation that tracks inputs, outputs, and intermediate steps such as document retrieval enables ongoing monitoring and early detection and diagnosis of issues that arise in development and production.
-
-- **Collecting stakeholder feedback:** As a developer, you may not be a domain expert in the content of the application you are developing. In order to collect feedback from human experts who can assess your application quality, you need an interface that allows them to interact with the application and provide detailed feedback.
-
-We will cover evaluation in much more detail in [Section 4: Evaluation](/nbs/4-evaluation).
-
-The [next section](/nbs/3-deep-dive) of this guide will unpack the finer details of the typical components that make up the data pipeline and RAG chain of a RAG application using unstructured data.
diff --git a/genai_cookbook/nbs/3-deep-dive.md b/genai_cookbook/nbs/3-deep-dive.md
deleted file mode 100644
index c937042..0000000
--- a/genai_cookbook/nbs/3-deep-dive.md
+++ /dev/null
@@ -1,290 +0,0 @@
-# Section 3: Deep dive into RAG over unstructured documents
-
-In the previous section, we introduced the key components of a RAG application and discussed the fundamental principles behind developing RAG applications over unstructured data. This section discusses how you can think about refining each component in order to increase the quality of your application.
-
-Above we alluded to the myriad of "knobs" to tune at every point in both the offline data pipeline, and online RAG chain. While there are numerous options to consider, we will focus on the table stakes considerations that should be prioritized when improving the quality of your RAG application. It's important to note that this is just scratching the surface, and there are many more advanced techniques that can be explored.
-
-In the following sections of this guide, we will discuss how to measure changes with [evals](/nbs/4-evaluation), and finish by outlining how to diagnose root causes and possible fixes in the final [hands-on section](/nbs/5-hands-on).
-
-## Data pipeline
-
-```{image} ../images/3-deep-dive/1_img.png
-:align: center
-```
-
-The foundation of any RAG application with unstructured data is the data pipeline. This pipeline is responsible for preparing the unstructured data in a format that can be effectively utilized by the RAG application. While this data pipeline can become arbitrarily complex, the following are the key components you need to think about when first building your RAG application:
-
-1. **Corpus composition:** Selecting the right data sources and content based on the specific use case
-
-2. **Parsing:** Extracting relevant information from the raw data using appropriate parsing techniques
-
-3. **Chunking:** Breaking down the parsed data into smaller, manageable chunks for efficient retrieval
-
-4. **Embedding:** Converting the chunked text data into a numerical vector representation that captures its semantic meaning
-
-We discuss how to experiment with all of these data pipeline choices from a practical standpoint in [implementing data pipeline changes](/nbs/5-hands-on.md#data-pipeline-changes).
-
-### Corpus composition
-
-To state the obvious, without the right data corpus, your RAG application won't be able to retrieve the information required to answer a user query. The right data will be entirely dependent on the specific requirements and goals of your application, making it crucial to dedicate time to understand the nuances of data available (see the [requirements gathering section](/nbs/5-hands-on.md#requirements-questions) for guidance on this).
-
-For example, when building a customer support bot, you might consider including:
-
-- Knowledge base documents
-- Frequently asked questions (FAQs)
-- Product manuals and specifications
-- Troubleshooting guides
-
-Engage domain experts and stakeholders from the outset of any project to help identify and curate relevant content that could improve the quality and coverage of your data corpus. They can provide insights into the types of queries that users are likely to submit, and help prioritize the most important information to include.
-
-### Parsing
-
-Having identified the data sources for your RAG application, the next step will be extracting the required information from the raw data. This process, known as parsing, involves transforming the unstructured data into a format that can be effectively utilized by the RAG application.
-
-The specific parsing techniques and tools you use will depend on the type of data you are working with. For example:
-
-- **Text documents** (e.g., PDFs, Word docs): Off-the-shelf libraries like [unstructured](https://github.com/Unstructured-IO/unstructured) and [PyPDF2](https://pypdf2.readthedocs.io/en/3.x/) can handle various file formats and provide options for customizing the parsing process.
-
-- **HTML documents**: HTML parsing libraries like [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/) can be used to extract relevant content from web pages. With these you can navigate the HTML structure, select specific elements, and extract the desired text or attributes.
-
-- **Images and scanned documents**: Optical Character Recognition (OCR) techniques will typically be required to extract text from images. Popular OCR libraries include [Tesseract](https://github.com/tesseract-ocr/tesseract), [Amazon Textract](https://aws.amazon.com/textract/ocr/), [Azure AI Vision OCR](https://azure.microsoft.com/en-us/products/ai-services/ai-vision/), and [Google Cloud Vision API](https://cloud.google.com/vision).
-
-When parsing your data, consider the following best practices:
-
-1. **Data cleaning:** Preprocess the extracted text to remove any irrelevant or noisy information, such as headers, footers, or special characters. Be cognizant of reducing the amount of unnecessary or malformed information that your RAG chain will need to process.
-
-2. **Handling errors and exceptions:** Implement error handling and logging mechanisms to identify and resolve any issues encountered during the parsing process. This will help you quickly identify and fix problems. Doing so often points to upstream issues with the quality of the source data.
-
-3. **Customizing parsing logic:** Depending on the structure and format of your data, you may need to customize the parsing logic to extract the most relevant information. While it may require additional effort upfront, invest the time to do this if required - it often prevents a lot of downstream quality issues.
-
-4. **Evaluating parsing quality**: Regularly assess the quality of the parsed data by manually reviewing a sample of the output. This can help you identify any issues or areas for improvement in the parsing process.
-
-### Chunking
-
-```{image} ../images/3-deep-dive/2_img.png
-:align: center
-```
-
-After parsing the raw data into a more structured format, the next step is to break it down into smaller, manageable units called *chunks*. Segmenting large documents into smaller, semantically concentrated chunks, ensures that retrieved data fits in the LLM's context, while minimizing the inclusion of distracting or irrelevant information. The choices made on chunking will directly affect what retrieved data the LLM is provided, making it one of the first layers of optimization in a RAG application.
-
-When chunking your data, you will generally need to consider the following factors:
-
-1. **Chunking strategy:** The method you use to divide the original text into chunks. This can involve basic techniques such as splitting by sentences, paragraphs, or specific character/token counts, through to more advanced document-specific splitting strategies.
-
-2. **Chunk size:** Smaller chunks may focus on specific details but lose some surrounding information. Larger chunks may capture more context but can also include irrelevant information.
-
-3. **Overlap between chunks:** To ensure that important information is not lost when splitting the data into chunks, consider including some overlap between adjacent chunks. Overlapping can ensure continuity and context preservation across chunks.
-
-4. **Semantic coherence:** When possible, aim to create chunks that are semantically coherent, meaning they contain related information and can stand on their own as a meaningful unit of text. This can be achieved by considering the structure of the original data, such as paragraphs, sections, or topic boundaries.
-
-5. **Metadata:** Including relevant metadata within each chunk, such as the source document name, section heading, or product names can improve the retrieval process. This additional information in the chunk can help match retrieval queries to chunks.
-
-Finding the right chunking method is both iterative and context-dependent. There is no one-size-fits all approach; the optimal chunk size and method will depend on the specific use case and the nature of the data being processed. Broadly speaking, chunking strategies can be viewed as the following:
-
-- **Fixed-size chunking:** Split the text into chunks of a predetermined size, such as a fixed number of characters or tokens (e.g., [LangChain CharacterTextSplitter](https://python.langchain.com/v0.2/docs/how_to/character_text_splitter/)). While splitting by an arbitrary number of characters/tokens is quick and easy to set up, it will typically not result in consistent semantically coherent chunks.
-
-- **Paragraph-based chunking:** Use the natural paragraph boundaries in the text to define chunks. This method can help preserve the semantic coherence of the chunks, as paragraphs often contain related information (e.g, [LangChain RecursiveCharacterTextSplitter](https://python.langchain.com/v0.2/docs/how_to/recursive_text_splitter/)).
-
-- **Format-specific chunking:** Formats such as markdown or HTML have an inherent structure within them which can be used to define chunk boundaries (for example, markdown headers). Tools like LangChain's [MarkdownHeaderTextSplitter](https://python.langchain.com/v0.2/docs/how_to/markdown_header_metadata_splitter/#how-to-return-markdown-lines-as-separate-documents) or HTML [header](https://python.langchain.com/v0.2/docs/how_to/HTML_header_metadata_splitter/)/[section](https://python.langchain.com/v0.2/docs/how_to/HTML_section_aware_splitter/)-based splitters can be used for this purpose.
-
-- **Semantic chunking:** Techniques such as topic modeling can be applied to identify semantically coherent sections within the text. These approaches analyze the content or structure of each document to determine the most appropriate chunk boundaries based on shifts in topic. Although more involved than more basic approaches, semantic chunking can help create chunks that are more aligned with the natural semantic divisions in the text (see [LangChain SemanticChunker](https://python.langchain.com/v0.2/docs/how_to/semantic-chunker/) for an example of this).
-
-```{image} ../images/3-deep-dive/3_img.png
-:align: center
-```
-
-**Example:** Fixed-size chunking example using LangChain's [RecursiveCharacterTextSplitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter) with `chunk_size=100` and `chunk_overlap=20`. [ChunkViz](https://chunkviz.up.railway.app/) provides an interactive way to visualize how different chunk size and chunk overlap values with Langchain's character splitters affects resulting chunks.
-
-### Embedding model
-
-```{image} ../images/3-deep-dive/4_img.png
-:align: center
-```
-
-After chunking your data, the next step is to convert the text chunks into a vector representation using an embedding model. An embedding model is used to convert each text chunk into a vector representation that captures its semantic meaning. By representing chunks as dense vectors, embeddings allow for fast and accurate retrieval of the most relevant chunks based on their semantic similarity to a retrieval query. At query time, the retrieval query will be transformed using the same embedding model that was used to embed chunks in the data pipeline.
-
-When selecting an embedding model, consider the following factors:
-
-- **Model choice:** Each embedding model has its nuances, and the available benchmarks may not capture the specific characteristics of your data. Experiment with different off-the-shelf embedding models, even those that may be lower-ranked on standard leaderboards like [MTEB](https://huggingface.co/spaces/mteb/leaderboard). Some examples to consider include:
- - [GTE-Large-v1.5](https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5)
- - [OpenAI's text-embedding-ada-002, text-embedding-large, and text-embedding-small](https://platform.openai.com/docs/guides/embeddings)
-
-- **Max tokens:** Be aware of the maximum token limit for your chosen embedding model. If you pass chunks that exceed this limit, they will be truncated, potentially losing important information. For example, [bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) has a maximum token limit of 512.
-
-- **Model size:** Larger embedding models generally offer better performance but require more computational resources. Strike a balance between performance and efficiency based on your specific use case and available resources.
-
-- **Fine-tuning:** If your RAG application deals with domain-specific language (e.g., internal company acronyms or terminology), consider fine-tuning the embedding model on domain-specific data. This can help the model better capture the nuances and terminology of your particular domain, and can often lead to improved retrieval performance.
-
-## RAG Chain
-
-```{image} ../images/3-deep-dive/5_img.png
-:align: center
-```
-
-The RAG chain takes a user query as input, retrieves relevant information given that query, and generates an appropriate response grounded on the retrieved data. While the exact steps within a RAG chain can vary widely depending on the use case and requirements, the following are the key components to consider when building your RAG chain:
-
-1. **Query understanding:** Analyzing and transforming user queries to better represent intent and extract relevant information, such as filters or keywords, to improve the retrieval process.
-
-2. **Retrieval:** Finding the most relevant chunks of information given a retrieval query. In the unstructured data case, this typically involves one or a combination of semantic or keyword-based search.
-
-3. **Prompt augmentation:** Combining a user query with retrieved information and instructions to guide the LLM towards generating high-quality responses.
-
-4. **LLM:** Selecting the most appropriate model (and model parameters) for your application to optimize/balance performance, latency, and cost.
-
-5. **Post-processing and guardrails:** Applying additional processing steps and safety measures to ensure the LLM-generated responses are on-topic, factually consistent, and adhere to specific guidelines or constraints.
-
-In the [implementing RAG chain changes](nbs/5-hands-on.md#rag-chain-changes) section we will demonstrate how to iterate over these various components of a chain.
-
-### Query understanding
-
-Using the user query directly as a retrieval query can work for some queries. However, it is generally beneficial to reformulate the query before the retrieval step. Query understanding comprises a step (or series of steps) at the beginning of a chain to analyze and transform user queries to better represent intent, extract relevant information, and ultimately help the subsequent retrieval process. Approaches to transforming a user query to improve retrieval include:
-
-1. **Query Rewriting:** Query rewriting involves translating a user query into one or more queries that better represent the original intent. The goal is to reformulate the query in a way that increases the likelihood of the retrieval step finding the most relevant documents. This can be particularly useful when dealing with complex or ambiguous queries that might not directly match the terminology used in the retrieval documents.
-
- **Examples**:
-
- - Paraphrasing conversation history in a multi-turn chat
- - Correcting spelling mistakes in the user's query
- - Replacing words or phrases in the user query with synonyms to capture a broader range of relevant documents
-
-```{eval-rst}
-.. note::
-
- Query rewriting must be done in conjunction with changes to the retrieval component
-
-.. include:: ./include-rst.rst
-```
-
-2. **Filter extraction:** In some cases, user queries may contain specific filters or criteria that can be used to narrow down the search results. Filter extraction involves identifying and extracting these filters from the query and passing them to the retrieval step as additional parameters. This can help improve the relevance of the retrieved documents by focusing on specific subsets of the available data.
-
- **Examples**:
-
- - Extracting specific time periods mentioned in the query, such as "articles from the last 6 months" or "reports from 2023".
- - Identifying mentions of specific products, services, or categories in the query, such as "Databricks Professional Services" or "laptops".
- - Extracting geographic entities from the query, such as city names or country codes.
-
-
-```{eval-rst}
-.. note::
-
- Filter extraction must be done in conjunction with changes to both metadata extraction [data pipeline] and retrieval [RAG chain] components. The metadata extraction step should ensure that the relevant metadata fields are available for each document/chunk, and the retrieval step should be implemented to accept and apply extracted filters.
-
-.. include:: ./include-rst.rst
-```
-
-In addition to query rewriting and filter extraction, another important consideration in query understanding is whether to use a single LLM call or multiple calls. While using a single call with a carefully crafted prompt can be efficient, there are cases where breaking down the query understanding process into multiple LLM calls can lead to better results. This, by the way, is a generally applicable rule of thumb when you are trying to implement a number of complex logic steps into a single prompt.
-
-For example, you might use one LLM call to classify the query intent, another to extract relevant entities, and a third to rewrite the query based on the extracted information. Although this approach may add some latency to the overall process, it can allow for more fine-grained control and potentially improve the quality of the retrieved documents.
-
-Here's how a multi-step query understanding component might look for our a customer support bot:
-
-1. **Intent classification:** Use an LLM to classify the user's query into predefined categories, such as "product information", "troubleshooting", or "account management".
-
-2. **Entity extraction:** Based on the identified intent, use another LLM call to extract relevant entities from the query, such as product names, reported errors, or account numbers.
-
-3. **Query rewriting:** Use the extracted intent and entities to rewrite the original query into a more specific and targeted format, e.g., "My RAG chain is failing to deploy on Model Serving, I'm seeing the following error...".
-
-### Retrieval
-
-The retrieval component of the RAG chain is responsible for finding the most relevant chunks of information given a retrieval query. In the context of unstructured data, retrieval typically involves one or a combination of semantic search, keyword-based search, and metadata filtering. The choice of retrieval strategy depends on the specific requirements of your application, the nature of the data, and the types of queries you expect to handle. Let's compare these options:
-
-1. **Semantic search:** Semantic search uses an embedding model to convert each chunk of text into a vector representation that captures its semantic meaning. By comparing the vector representation of the retrieval query with the vector representations of the chunks, semantic search can retrieve documents that are conceptually similar, even if they don't contain the exact keywords from the query.
-
-2. **Keyword-based search:** Keyword-based search determines the relevance of documents by analyzing the frequency and distribution of shared words between the retrieval query and the indexed documents. The more often the same words appear in both the query and a document, the higher the relevance score assigned to that document.
-
-3. **Hybrid search:** Hybrid search combines the strengths of both semantic and keyword-based search by employing a two-step retrieval process. First, it performs a semantic search to retrieve a set of conceptually relevant documents. Then, it applies keyword-based search on this reduced set to further refine the results based on exact keyword matches. Finally, it combines the scores from both steps to rank the documents.
-
-The following table contrasts each of these retrieval strategies against one another:
-
-| | Semantic search | Keyword search | Hybrid search |
-|---|---|---|---|
-| **Simple explanation** | If the same **concepts** appear in the query and a potential document, they are relevant. | If the same **words** appear in the query and a potential document, they are relevant. The **more words** from the query in the document, the more relevant that document is. | Runs BOTH a semantic search and keyword search, then combines the results. |
-| **Example** | If the user searches for "RAG", a document referring to "retrieval-augmented generation" would be returned even if the document does NOT have the words "RAG" in it. | If the user searches for "RAG", a document referring to "retrieval-augmented generation" would be NOT returned UNLESS the document HAS the words "RAG" in it. | Both documents would be returned. |
-| **Technical approaches** | Uses embeddings to represent text in a continuous vector space, enabling semantic search | Relies on discrete token-based methods like [bag-of-words](https://en.wikipedia.org/wiki/Bag-of-words_model), [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf), [BM25](https://en.wikipedia.org/wiki/Okapi_BM25) for keyword matching. | Use a re-ranking approach to combine the results, such as [reciprocal rank fusion](https://www.elastic.co/guide/en/elasticsearch/reference/current/rrf.html) or a [re-ranking model](https://en.wikipedia.org/wiki/Learning_to_rank). |
-| **Strengths** | Retrieving contextually similar information to a query, even if the exact words are not used. | Scenarios requiring precise keyword matches, ideal for specific term-focused queries such as product names. | Combines the best of both approaches. |
-| **Example use case** | Customer support where user queries are different than the words in the product manuals
e.g., *"how do i turn my phone on?"* and the manual section is called *"toggling the power"*. | Customer support where queries contain specific, non descriptive technical terms.
e.g., *"what does model HD7-8D do?"* | Customer support queries that combined both semantic and technical terms.
e.g., *"how do I turn on my HD7-8D?"* |
-
-In addition to these core retrieval strategies, there are several techniques you can apply to further enhance the retrieval process:
-
-- **Query expansion:** Query expansion can help capture a broader range of relevant documents by using multiple variations of the retrieval query. This can be achieved by either conducting individual searches for each expanded query, or using a concatenation of all expanded search queries in a single retrieval query.
-
-> ***Note:** Query expansion must be done in conjunction with changes to the query understanding component [RAG chain]. The multiple variations of a retrieval query are typically generated in this step.*
-
-- **Re-ranking:** After retrieving an initial set of chunks, apply additional ranking criteria (e.g., sort by time) or a reranker model to re-order the results. Re-ranking can help prioritize the most relevant chunks given a specific retrieval query. Reranking with cross-encoder models such as [mxbai-rerank](https://huggingface.co/mixedbread-ai/mxbai-rerank-large-v1) and [ColBERTv2](https://docs.llamaindex.ai/en/stable/examples/node_postprocessor/ColbertRerank/) can yield an uplift in retrieval performance.
-
-- **Metadata filtering:** Use metadata filters extracted from the query understanding step to narrow down the search space based on specific criteria. Metadata filters can include attributes like document type, creation date, author, or domain-specific tags. By combining metadata filters with semantic or keyword-based search, you can create more targeted and efficient retrieval.
-
-> ***Note:** Metadata filtering must be done in conjunction with changes to the query understanding [RAG chain] and metadata extraction [data pipeline] components.*
-
-### Prompt augmentation
-
-Prompt augmentation is the step where the user query is combined with the retrieved information and instructions in a prompt template to guide the language model towards generating high-quality responses. Iterating on this template to optimize the prompt provided to the LLM (i.e., prompt engineering) will be required to ensure that the model is guided to produce accurate, grounded, and coherent responses.
-
-There are entire [guides to prompt engineering](https://www.promptingguide.ai/), but here are a number of considerations to keep in mind when you're iterating on the prompt template:
-
-1. Provide examples
- - Include examples of well-formed queries and their corresponding ideal responses within the prompt template itself (i.e., [few-shot learning](https://arxiv.org/abs/2005.14165)). This helps the model understand the desired format, style, and content of the responses.
- - One useful way to come up with good examples is to identify types of queries your chain struggles with. Create gold-standard responses for those queries and include them as examples in the prompt.
- - Ensure that the examples you provide are representative of user queries you anticipate at inference time. Aim to cover a diverse range of expected queries to help the model generalize better.
-
-2. Parameterize your prompt template
- - Design your prompt template to be flexible by parameterizing it to incorporate additional information beyond the retrieved data and user query. This could be variables such as current date, user context, or other relevant metadata.
- - Injecting these variables into the prompt at inference time can enable more personalized or context-aware responses.
-
-3. Consider Chain-of-Thought prompting
- - For complex queries where direct answers aren't readily apparent, consider [Chain-of-Thought (CoT) prompting](https://arxiv.org/abs/2201.11903). This prompt engineering strategy breaks down complicated questions into simpler, sequential steps, guiding the LLM through a logical reasoning process.
- - By prompting the model to "think through the problem step-by-step," you encourage it to provide more detailed and well-reasoned responses, which can be particularly effective for handling multi-step or open-ended queries.
-
-4. Prompts may not transfer across models
- - Recognize that prompts often do not transfer seamlessly across different language models. Each model has its own unique characteristics where a prompt that works well for one model may not be as effective for another.
- - Experiment with different prompt formats and lengths, refer to online guides (e.g., [OpenAI Cookbook](https://cookbook.openai.com/), [Anthropic cookbook](https://github.com/anthropics/anthropic-cookbook)), and be prepared to adapt and refine your prompts when switching between models.
-
-### LLM
-
-The generation component of the RAG chain takes the augmented prompt template from the previous step and passes it to a LLM. When selecting and optimizing an LLM for the generation component of a RAG chain, consider the following factors, which are equally applicable to any other steps that involve LLM calls:
-
-1. Experiment with different off-the-shelf models
- - Each model has its own unique properties, strengths, and weaknesses. Some models may have a better understanding of certain domains or perform better on specific tasks.
- - As mentioned prior, keep in mind that the choice of model may also influence the prompt engineering process, as different models may respond differently to the same prompts.
- - If there are multiple steps in your chain that require an LLM, such as calls for query understanding in addition to the generation step, consider using different models for different steps. More expensive, general-purpose models may be overkill for tasks like determining the intent of a user query.
-
-2. Start small and scale up as needed
- - While it may be tempting to immediately reach for the most powerful and capable models available (e.g., GPT-4, Claude), it's often more efficient to start with smaller, more lightweight models.
- - In many cases, smaller open-source alternatives like Llama 3 or DBRX can provide satisfactory results at a lower cost and with faster inference times. These models can be particularly effective for tasks that don't require highly complex reasoning or extensive world knowledge.
- - As you develop and refine your RAG chain, continuously assess the performance and limitations of your chosen model. If you find that the model struggles with certain types of queries or fails to provide sufficiently detailed or accurate responses, consider scaling up to a more capable model.
- - Monitor the impact of changing models on key metrics such as response quality, latency, and cost to ensure that you're striking the right balance for the requirements of your specific use case.
-
-3. Optimize model parameters
- - Experiment with different parameter settings to find the optimal balance between response quality, diversity, and coherence. For example, adjusting the temperature can control the randomness of the generated text, while max_tokens can limit the response length.
- - Be aware that the optimal parameter settings may vary depending on the specific task, prompt, and desired output style. Iteratively test and refine these settings based on evaluation of the generated responses.
-
-4. Task-specific fine-tuning
- - As you refine performance, consider fine-tuning smaller models for specific sub-tasks within your RAG chain, such as query understanding.
- - By training specialized models for individual tasks with the RAG chain, you can potentially improve the overall performance, reduce latency, and lower inference costs compared to using a single large model for all tasks.
-
-5. Continued pre-training
- - If your RAG application deals with a specialized domain or requires knowledge that is not well-represented in the pre-trained LLM, consider performing continued pre-training (CPT) on domain-specific data.
- - Continued pre-training can improve a model's understanding of specific terminology or concepts unique to your domain. In turn this can reduce the need for extensive prompt engineering or few-shot examples.
-
-### Post-processing & guardrails
-
-After the LLM generates a response, it is often necessary to apply post-processing techniques or guardrails to ensure that the output meets the desired format, style, and content requirements. This final step (or multiple steps) in the chain can help maintain consistency and quality across the generated responses. If you are implementing post-processing and guardrails, consider some of the following:
-
-1. Enforcing output format
- - Depending on your use case, you may require the generated responses to adhere to a specific format, such as a structured template or a particular file type (e.g., JSON, HTML, Markdown etc).
- - If structured output is required, libraries such as [Instructor](https://github.com/jxnl/instructor) or [Outlines](https://github.com/outlines-dev/outlines) provide good starting points to implement this kind of validation step.
- - When developing, take time to ensure that the post-processing step is flexible enough to handle variations in the generated responses while maintaining the required format.
-
-2. Maintaining style consistency
- - If your RAG application has specific style guidelines or tone requirements (e.g., formal vs. casual, concise vs. detailed), a post-processing step can both check and enforce these style attributes across generated responses.
-
-3. Content filters and safety guardrails
- - Depending on the nature of your RAG application and the potential risks associated with generated content, it may be important to [implement content filters or safety guardrails](https://www.databricks.com/blog/implementing-llm-guardrails-safe-and-responsible-generative-ai-deployment-databricks) to prevent the output of inappropriate, offensive, or harmful information.
- - Consider using models like [Llama Guard](https://marketplace.databricks.com/details/a4bc6c21-0888-40e1-805e-f4c99dca41e4/Databricks_Llama-Guard-Model) or APIs specifically designed for content moderation and safety, such as [OpenAI's moderation API](https://platform.openai.com/docs/guides/moderation), to implement safety guardrails.
-
-4. Handling hallucinations
- - Defending against hallucinations can also be implemented as a post-processing step. This may involve cross-referencing the generated output with retrieved documents, or using additional LLMs to validate the factual accuracy of the response.
- - Develop fallback mechanisms to handle cases where the generated response fails to meet the factual accuracy requirements, such as generating alternative responses or providing disclaimers to the user.
-
-5. Error handling
- - With any post-processing steps, implement mechanisms to gracefully deal with cases where the step encounters an issue or fails to generate a satisfactory response. This could involve generating a default response, or escalating the issue to a human operator for manual review.
diff --git a/genai_cookbook/nbs/4-evaluation.md b/genai_cookbook/nbs/4-evaluation.md
deleted file mode 100644
index 0d4fa9e..0000000
--- a/genai_cookbook/nbs/4-evaluation.md
+++ /dev/null
@@ -1,134 +0,0 @@
-# Section 4: Evaluation
-
-The old saying "you can't manage what you can't measure" is incredibly relevant (no pun intended) in the context of any generative AI application, RAG included. In order for your generative AI application to deliver high quality, accurate responses, you **must** be able to define and measure what "quality" means for your use case.
-
-This section deep dives into 3 critical components of evaluation:
-
-1. [Establishing Ground Truth: Creating Evaluation Sets](#establishing-ground-truth-creating-evaluation-sets)
-2. [Assessing Performance: Defining Metrics that Matter](#assessing-performance-defining-metrics-that-matter)
-3. [Enabling Measurement: Building Supporting Infrastructure](#enabling-measurement-building-supporting-infrastructure)
-
-## Establishing Ground Truth: Creating Evaluation Sets
-
-To measure quality, Databricks recommends creating a human-labeled Evaluation Set, which is a curated, representative set of queries, along with ground-truth answers and (optionally) the correct supporting documents that should be retrieved. Human input is crucial in this process, as it ensures that the Evaluation Set accurately reflects the expectations and requirements of the end-users.
-
-A good Evaluation Set has the following characteristics:
-
-- **Representative:** Accurately reflects the variety of requests the application will encounter in production.
-- **Challenging:** The set should include difficult and diverse cases to effectively test the model's capabilities. Ideally, it will include adversarial examples such as questions attempting prompt injection or questions attempting to generate inappropriate responses from LLM.
-- **Continually updated:** The set must be periodically updated to reflect how the application is used in production and the changing nature of the indexed data.
-
-Databricks recommends at least 30 questions in your evaluation set, and ideally 100 - 200. The best evaluation sets will grow over time to contain 1,000s of questions.
-
-To avoid overfitting, Databricks recommends splitting your evaluation set into training, test, and validation sets:
-
-- Training set: ~70% of the questions. Used for an initial pass to evaluate every experiment to identify the highest potential ones.
-- Test set: ~20% of the questions. Used for evaluating the highest performing experiments from the training set.
-- Validation set: ~10% of the questions. Used for a final validation check before deploying an experiment to production.
-
-## Assessing Performance: Defining Metrics that Matter
-
-With an evaluation set, you are able to measure the performance of your RAG application across a number of different dimensions, including:
-
-- **Retrieval quality**: Retrieval metrics assess how successfully your RAG application retrieves relevant supporting data. Precision and recall are two key retrieval metrics.
-- **Response quality**: Response quality metrics assess how well the RAG application responds to a user's request. Response metrics can measure, for instance, how well-grounded the response was given the retrieved context, or how harmful/harmless the response was.
-- **Chain performance:** Chain metrics capture the overall cost and performance of RAG applications. Overall latency and token consumption are examples of chain performance metrics.
-
-There are two key approaches to measuring performance across these metrics:
-
-- **Ground truth based:** This approach involves comparing the RAG application's retrieved supporting data or final output to the ground-truth answers and supporting documents recorded in the evaluation set. It allows for assessing the performance based on known correct answers.
-- **LLM judge based:** In this approach, a separate [LLM acts as a judge](https://arxiv.org/abs/2306.05685) to evaluate the quality of the RAG application's retrieval and responses. LLM judges can be configured to compare the final response to the user query and rate its relevance. This approach automates evaluation across numerous dimensions. LLM judges can also be configured to return rationales for their ratings.
-
-Take time to ensure that the LLM judge's evaluations align with the RAG application's success criteria. Some LLM-as-judge metrics still rely on the ground truth from the evaluation set, which the judge LLM uses to assess the application's output.
-
-### Retrieval metrics
-
-Retrieval metrics help you understand if your retriever is delivering relevant results. Retrieval metrics are largely based on precision and recall.
-
-| Metric Name | Question Answered | Details |
-|-------------|------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| Precision | Is the retrieved supporting data relevant? | Precision is the proportion of retrieved documents that are actually relevant to the user's request. An LLM judge can be used to assess the relevance of each retrieved chunk to the user's request. |
-| Recall | Did I retrieve most/all of the relevant chunks? | Recall is the proportion of all of the relevant documents that were retrieved. This is a measure of the completeness of the results. |
-
-In the example below, two out of the three retrieved results were relevant to the user's query, so the precision was 0.66 (2/3). The retrieved docs included two out of a total of four relevant docs, so the recall was 0.5 (2/4).
-
-```{image} ../images/4-evaluation/1_img.png
-:align: center
-```
-
-See the [appendix](#appendix) on precision and recall for more details.
-
-### Response metrics
-
-Response metrics assess the quality of the final output. "Quality" has many different dimensions when it comes to assessing LLM outputs, and the range of metrics reflects this.
-
-| Metric Name | Question Answered | Details |
-|--------------|------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| Correctness | All things considered, did the LLM give an accurate answer? | Correctness is an LLM-generated metric that assesses whether the LLM's output is correct by comparing it to the ground truth in the evaluation dataset. |
-| Groundedness | Is the LLM's response a hallucination or is it grounded to the context? | To measure groundedness, an LLM judge compares the LLM's output to the retrieved supporting data and assesses whether the output reflects the contents of the supporting data or whether it constitutes a hallucination. |
-| Harmfulness | Is the LLM responding safely without any harmful or toxic content? | The Harmfulness measure considers only the RAG application's final response. An LLM judge determines whether the response should be considered harmful or toxic. |
-| Relevance | Is the LLM responding to the question asked? | Relevance is based on the user's request and the RAG application's output. An LLM judge provides a rating of how relevant the output is to the response. |
-
-It is very important to collect both response and retrieval metrics. A RAG application can respond poorly in spite of retrieving the correct context; it can also provide good responses on the basis of faulty retrievals. Only by measuring both components can we accurately diagnose and address issues in the application.
-
-### Chain metrics
-
-Chain metrics access the overall performance of the whole RAG chain. Cost and latency can be just as important as quality when it comes to evaluating RAG applications. It is important to consider cost and latency requirements early in the process of developing a RAG application as these considerations can affect every part of the application, including both the retrieval method and the LLM used for generation.
-
-| Metric Name | Question Answered | Details |
-|-------------|------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| Total tokens | What is the cost of executing the RAG chain? | Token consumption can be used to approximate cost. This metric counts the tokens used across all LLM generation calls in the RAG pipeline. Generally speaking, more tokens lead to higher costs, and finding ways to reduce tokens can reduce costs. |
-| Latency | What is the latency of executing the RAG chain? | Latency measures the time it takes for the application to return a response after the user sends a request. This includes the time it takes the retriever to retrieve relevant supporting data and for the LLM to generate output. |
-
-## Enabling Measurement: Building Supporting Infrastructure
-
-Measuring quality is not easy and requires a significant infrastructure investment. This section details what you need to succeed and how Databricks provides these components.
-
-**Detailed trace logging.** The core of your RAG application's logic is a series of steps in the chain. In order to evaluate and debug quality, you need to implement instrumentation that tracks the chain's inputs and outputs, along with each step of the chain and its associated inputs and outputs. The instrumentation you put in place should work the same way in development and production.
-
-In Databricks, MLflow Trace Logging provides this capability. With MLflow Trace Logging, you instrument your code in production, and get the same traces during development and in production. For more details, [link the docs].
-
-**Stakeholder review UI.** Most often, as a developer, you are not a domain expert in the content of the application you are developing. In order to collect feedback from human experts who can assess your application's output quality, you need an interface that allows them to interact with early versions of the application and provide detailed feedback. Further, you need a way to load specific application outputs for the stakeholders to assess their quality.
-
-This interface must track the application's outputs and associated feedback in a structured manner, storing the full application trace and detailed feedback in a data table.
-
-In Databricks, the Quality Lab Review App provides this capability. For more details, [link the docs].
-
-**Quality / cost / latency metric framework.** You need a way to define the metrics that comprehensively measure the quality of each component of your chain and the end-to-end application. Ideally, the framework would provide a suite of standard metrics out of the box, in addition to supporting customization, so you can add metrics that test specific aspects of quality that are unique to your business.
-
-**Evaluation harness.** You need a way to quickly and efficiently get outputs from your chain for every question in your evaluation set, and then evaluate each output on the relevant metrics. This harness must be as efficient as possible, since you will run evaluation after every experiment that you try to improve quality.
-
-In Databricks, Quality Lab provides these capabilities. [link more details]
-
-**Evaluation set management.** Your evaluation set is a living, breathing set of questions that you will update iteratively over the course of your application's development and production lifecycle.
-
-**Experiment tracking framework.** During the course of your application development, you will try many different experiments. An experiment tracking framework enables you to log each experiment and track its metrics vs. other experiments.
-
-**Chain parameterization framework.** Many experiments you try will require you to hold the chain's code constant while iterating on various parameters used by the code. You need a framework that enables you to do this.
-
-In Databricks, MLflow provides these capabilities. [link]
-
-Online monitoring. [talk about LHM]
-
-
-## Appendix
-
-### Precision and Recall
-
-Retrieval metrics are based on the concept of Precision & Recall. Below is a quick primer on Precision & Recall adapted from the excellent [Wikipedia article](https://en.wikipedia.org/wiki/Precision_and_recall).
-
-Precision measures “Of the items* I retrieved, what % of these items are actually relevant to my user’s query? Computing precision does NOT require your ground truth to contain ALL relevant items.
-
-```{image} ../images/4-evaluation/2_img.png
-:align: center
-:width: 400px
-```
-
-Recall measures “Of ALL the items* that I know are relevant to my user’s query, what % did I retrieve?” Computing recall requires your ground-truth to contain ALL relevant items.
-
-```{image} ../images/4-evaluation/3_img.png
-:align: center
-:width: 400px
-```
-
-\* Items can either be a document or a chunk of a document.
diff --git a/genai_cookbook/nbs/5-hands-on.md b/genai_cookbook/nbs/5-hands-on.md
deleted file mode 100644
index f567528..0000000
--- a/genai_cookbook/nbs/5-hands-on.md
+++ /dev/null
@@ -1,703 +0,0 @@
-# Section 5: Hands-on guide to implementing high-quality RAG
-
-This section walks you through Databricks recommended development workflow for building, testing, and deploying a high-quality RAG application: **evaluation-driven development**. This workflow is based on the Mosaic Research team's best practices for building and evaluating high quality RAG applications. If quality is important to your business, Databricks recommends following an evaluation-driven workflow:
-
-1. Define the requirements
-2. Collect stakeholder feedback on a rapid proof of concept (POC)
-3. Evaluate the POC's quality
-4. Iteratively diagnose and fix quality issues
-5. Deploy to production
-6. Monitor in production
-
-```{image} ../images/5-hands-on/1_img.png
-:align: center
-```
-
-Mapping to this workflow, this section provides ready-to-run sample code for every step and every suggestion to improve quality.
-
-Throughout, we will demonstrate evaluation-driven development using one of Databricks' internal use generative AI cases: using a RAG bot to help answer customer support questions in order to [1] reduce support costs [2] improve the customer experience.
-
-## Evaluation-driven development
-
-There are two core concepts in **evaluation-driven development:**
-
-1. **Metrics:** Defining high-quality
-
- *Similar to how you set business goals each year, you need to define what high-quality means for your use case.* *Databricks' Quality Lab provides a suggested set of* *N metrics to use, the most important of which is answer accuracy or correctness - is the RAG application providing the right answer?*
-
-2. **Evaluation:** Objectively measuring the metrics
-
- *To objectively measure quality, you need an evaluation set, which contains questions with known-good answers validated by humans. While this may seem scary at first - you probably don't have an evaluation set sitting ready to go - this guide walks you through the process of developing and iteratively refining this evaluation set.*
-
-Anchoring against metrics and an evaluation set provides the following benefits:
-
-1. You can iteratively and confidently refine your application's quality during development - no more vibe checks or guessing if a change resulted in an improvement.
-
-2. Getting alignment with business stakeholders on the readiness of the application for production becomes more straightforward when you can confidently state, *"we know our application answers the most critical questions to our business correctly and doesn't hallucinate."*
-
-*>> Evaluation-driven development is known in the academic research community as "hill climbing" akin to climbing a hill to reach the peak - where the hill is your metric and the peak is 100% accuracy on your evaluation set.*
-
-## Gather requirements
-
-```{image} ../images/5-hands-on/2_img.png
-:align: center
-```
-
-Defining clear and comprehensive use case requirements is a critical first step in developing a successful RAG application. These requirements serve two primary purposes. Firstly, they help determine whether RAG is the most suitable approach for the given use case. If RAG is indeed a good fit, these requirements guide solution design, implementation, and evaluation decisions. Investing time at the outset of a project to gather detailed requirements can prevent significant challenges and setbacks later in the development process, and ensures that the resulting solution meets the needs of end-users and stakeholders. Well-defined requirements provide the foundation for the subsequent stages of the development lifecycle we'll walk through.
-
-### Is the use case a good fit for RAG?
-
-The first thing you'll need to establish is whether RAG is even the right approach for your use case. Given the hype around RAG, it's tempting to view it as a possible solution for any problem. However, there are nuances as to when RAG is suitable versus not.
-
-RAG is a good fit when:
-
-- Reasoning over retrieved information (both unstructured and structured)
-- Synthesizing information from multiple sources (e.g., generating a summary of key points from different articles on a topic)
-- Dynamic retrieval based on a user query is necessary (e.g., given a user query, determine what data source to retrieve from)
-- The use case requires generating novel content based on retrieved information (e.g., answering questions, providing explanations, offering recommendations)
-
-Conversely, RAG may not be the best fit when:
-
-- The task does not require query-specific retrieval. For example, generating call transcript summaries; even if individual transcripts are provided as context in the LLM prompt, the retrieved information remains the same for each summary.
-- Extremely low-latency responses are required (i.e., when responses are required in milliseconds)
-- The output is expected to be an exact copy of the retrieved information without modification (e.g., a search engine that returns verbatim snippets from documents)
-- Simple rule-based or templated responses are sufficient (e.g., a customer support chatbot that provides predefined answers based on keywords)
-- Input data needs to be reformatted (e.g., a user provides some input text and expects it to be transformed to a table)
-
-### Requirements questions
-
-Having established that RAG is indeed a good fit for your use case, consider the following questions to capture concrete requirements. For each requirement, we have prioritized them:
-
-- 🟢 P0 : Must define this requirement before starting your POC
-- 🟡 P1: Must define before going to production, but can iteratively refine during the POC
-- ⚪ P2: Nice to have requirement
-
-#### User Experience
-
-*Define how users will interact with the RAG system and what kind of responses are expected*
-
-- 🟢 P0 What will a typical request to the RAG chain look like? Ask stakeholders for examples of potential user queries.
-- 🟢 P0 What kind of responses will users expect (e.g., short answers, long-form explanations, a combination, or something else)?
-- 🟡 P1 How will users interact with the system? Through a chat interface, search bar, or some other modality?
-- 🟡 P1 What tone or style should generated responses take? (e.g., formal, conversational, technical)
-- 🟡 P1 How should the application handle ambiguous, incomplete, or irrelevant queries? Should any form of feedback or guidance be provided in such cases?
-- ⚪ P2 Are there specific formatting or presentation requirements for the generated output? Should the output include any metadata in addition to the chain's response?
-
-#### Data
-
-*Determine the nature, source(s), and quality of the data that will be used in the RAG solution*
-
-- 🟢 P0 What are the available sources to use?
-- For each data source:
- - 🟢 P0 Is data structured or unstructured?
- - 🟢 P0 What is the source format of the retrieval data (e.g., PDFs, documentation with images/tables, structured API responses)?
- - 🟢 P0 Where does that data reside?
- - 🟢 P0 How much data is available?
- - 🟡 P1 How frequently is the data updated? How should those updates be handled?
- - 🟡 P1 Are there any known data quality issues or inconsistencies for each data source?
-
-Consider creating an inventory table to consolidate this information, for example:
-
-| Data Source | Source | File type(s) | Size | Update frequency |
-|----------------|----------------|--------------|--------|------------------|
-| Data source 1 | Unity Catalog Volume | JSON | 10GB | Daily |
-| Data source 2 | Public API | XML | n/a (API) | Real-time |
-| Data source 3 | SharePoint | PDF, DOCX | 500MB | Monthly |
-
-#### Performance constraints
-
-*Capture performance and resource requirements for the RAG application*
-
-- 🟡 P1 What is the maximum acceptable latency for generating the responses?
- - 🟡 P1 What is the maximum acceptable time to first token?
- - 🟡 P1 If the output is being streamed, is higher total latency acceptable?
-- 🟡 P1 Are there any cost limitations on compute resources available for inference?
-- 🟡 P1 What are the expected usage patterns and peak loads?
-- 🟡 P1 How many concurrent users or requests should the system be able to handle?
- - **NOTE:** Databricks natively handles such scalability requirements, through the ability to scale automatically with [Model Serving](https://docs.databricks.com/en/machine-learning/model-serving/index.html).
-
-#### Evaluation
-
-*Establish how the RAG solution will be evaluated and improved over time*
-
-- 🟢 P0 What is the business goal / KPI you want to impact? What is the baseline value and what is the target?
-- 🟢 P0 Which users or stakeholders will provide initial and ongoing feedback?
-- 🟢 P0 What metrics should be used to assess the quality of generated responses?
- - Note: Databricks Quality Lab provides a recommended set of metrics to yo use
-- 🟡 P1 What is the set of questions the RAG app must be good at to go to production?
-- 🟡 P1 Does an [evaluation set](/nbs/4-evaluation.md#establishing-ground-truth-creating-evaluation-sets) exist? Is it possible to get an evaluation set of user queries, along with ground-truth answers and (optionally) the correct supporting documents that should be retrieved?
-- 🟡 P1 How will user feedback be collected and incorporated into the system?
-
-#### Security
-
-*Identify any security and privacy considerations*
-
-- 🟢 P0 Are there sensitive/confidential data that needs to be handled with care?
-- 🟡 P1 Do access controls need to be implemented in the solution (e.g., a given user can only retrieve from a restricted set of documents)?
-
-#### Deployment
-
-*Understanding how the RAG solution will be integrated, deployed, and maintained*
-
-- 🟡 P1 How should the RAG solution integrate with existing systems and workflows?
-- 🟡 P1 How should the model be deployed, scaled, and versioned?
- - **NOTE:** we will cover how this end-to-end lifecycle can be handled on Databricks with MLflow, Unity Catalog, Agent SDK, and Model Serving**.**
-
-Note that this is by no means an exhaustive list of questions. However, it should provide a solid foundation for capturing the key requirements for your RAG solution.
-
-Let's look at how some of these questions apply to the Databricks customer support RAG application:
-
-| | Considerations | Requirements |
-|---|---|---|
-| User experience | - Interaction modality
- Typical user query examples
- Expected response format/style
- Handling ambiguous/irrelevant queries | - Chat interface integrated with Slack
- Example queries:
- "How do I reduce cluster startup time?"
- "What kind of support plan do I have?"
- Clear, technical responses with code snippets and links to relevant documentation where appropriate
- Provide contextual suggestions and escalate to Databricks support engineers when needed |
-| Data | - Number and type of data sources
- Data format and location
- Data size and update frequency
- Data quality and consistency | - 3 data sources
- Databricks documentation (HTML, PDF)
- Resolved support tickets (JSON)
- Community forum posts (Delta table)
- Data stored in Unity Catalog and updated weekly
- Total data size: 5 GB
- Consistent data structure and quality maintained by dedicated docs and support teams |
-| Performance | - Maximum acceptable latency
- Cost constraints
- Expected usage and concurrency | - Maximum latency:
- <5 seconds
- Cost constraints:
- [confidential]
- Expected peak load:
- 200 concurrent users |
-| Evaluation | - Evaluation dataset availability
- Quality metrics
- User feedback collection | - SMEs from each product area will help review outputs and adjust incorrect answers to create the evaluation dataset
- Business KPIs
- Increase in support ticket resolution rate
- Decrease in user time spent per support ticket
- Quality metrics
- LLM judged answer correctness & relevance
- LLM judges retrieval precision
- User upvote/downvote
- Feedback collection
- Slack will be instrumented to provide a thumbs up / down |
-| Security | - Sensitive data handling
- Access control requirements | - No sensitive customer data should be in the retrieval source
- User authentication through Databricks Community SSO |
-| Deployment | - Integration with existing systems
- Deployment and versioning | - Integration with Databricks support ticket system
- Chain deployed as a Databricks Model Serving endpoint |
-
-## Build & Collect Feedback on POC
-
-```{image} ../images/5-hands-on/3_img.png
-:align: center
-```
-
-The first step in evaluation-driven development is to build a proof of concept (POC). A POC offers several benefits:
-
-1. Provides a directional view on the feasibility of your use case with RAG
-2. Allows collecting initial feedback from stakeholders, which in turn enables you to create the first version of your Evaluation Set
-3. Establishes a baseline measurement of quality to start to iterate from
-
-Databricks recommends building your POC using the simplest RAG chain architecture and our recommended defaults for each knob/parameter.
-
-> *!! Important: our recommended default parameters are by no means perfect, nor are they intended to be. Rather, they are a place to start from - the next steps of our workflow guide you through iterating on these parameters.*
->
-> *Why start from a simple POC? There are hundreds of possible combinations of knobs you can tune within your RAG application. You can easily spend weeks tuning these knobs, but if you do so before you can systematically evaluate your RAG, you'll end up in what we call the POC doom loop - iterating on settings, but with no way to objectively know if you made an improvement -- all while your stakeholders sit around impatiently waiting.*
-
-The POC templates in this guide are designed with quality iteration in mind - that is, they are parameterized with the knobs that our research has shown are most important to tune in order to improve RAG quality. Each knob has a smart default.
-
-Said differently, these templates are not "3 lines of code that magically make a RAG" - rather, they are a well-structured RAG application that can be tuned for quality in the following steps of an evaluation-driven development workflow.
-
-This enables you to quickly deploy a POC, but transition quickly to quality iteration without needing to rewrite your code.
-
-### How to build a POC
-
-**Expected time:** 30-60 minutes
-
-**Requirements:**
-
-- Data from your [requirements](#requirements-questions) is available in your [Lakehouse](https://www.databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html) inside a [Unity Catalog](https://www.databricks.com/product/unity-catalog) [volume](https://docs.databricks.com/en/connect/unity-catalog/volumes.html) or [Delta Table](https://docs.databricks.com/en/delta/index.html)
-- Access to a [Mosaic AI Vector Search](https://docs.databricks.com/en/generative-ai/vector-search.html) endpoint [[instructions](https://docs.databricks.com/en/generative-ai/create-query-vector-search.html)]
-- Write access to Unity Catalog schema
-- A single-user cluster with DBR 14.3+
-
-At the end of this step, you will have deployed the Quality Lab Review App which allows your stakeholders to test and provide feedback on your POC. Detailed logs from your stakeholder's usage and their feedback will flow to Delta Tables in your Lakehouse.
-
-```{image} ../images/5-hands-on/4_img.png
-:align: center
-```
-
-Below is the technical architecture of the POC application.
-
-```{image} ../images/5-hands-on/5_img.png
-:align: center
-```
-
-By default, the POC uses the open source models available on [Mosaic AI Foundation Model Serving](https://www.databricks.com/product/pricing/foundation-model-serving). However, because the POC uses Mosaic AI Model Serving, which supports *any foundation model*, using a different model is easy - simply configure that model in Model Serving and then replace the embedding_endpoint_name and llm_endpoint_name parameters in the POC code.
-
-- [follow these steps for other open source models in the marketplace e.g., PT]
-- [follow these steps for models such as Azure OpenAI, OpenAI, Cohere, Anthropic, Google Gemini, etc e.g., external models]
-
-#### 1. Import the sample code.
-
-To get started, [import this Git Repository to your Databricks Workspace](https://docs.databricks.com/en/repos/index.html). This repository contains the entire set of sample code. Based on your data, select one of the following folders that contains the POC application code.
-
-| File type | Source | POC application folder |
-|----------------------------|------------------------|------------------------|
-| PDF files | UC Volume | |
-| JSON files w/ HTML content & metadata | UC Volume | |
-| Powerpoint files | UC Volume | |
-| DOCX files | UC Volume | |
-| HTML content | Delta Table | |
-| Markdown or regular text | Delta Table | |
-
-If you don't have any data ready, and just want to follow along using the Databricks Customer Support Bot example, you can use this pipeline which uses a Delta Table of the Databricks Docs stored as HTML.
-
-If your data doesn't meet one of the above requirements, [insert instructions on how to customize].
-
-Once you have imported the code, you will have the following notebooks:
-
-```{image} ../images/5-hands-on/6_img.png
-:align: center
-```
-
-#### 2. Configure your application
-
-Follow the instructions in the `00_config` Notebook to configure the following settings:
-
-1. `RAG_APP_NAME`: The name of the RAG application. This is used to name the chain's UC model and prepended to the output Delta Tables + Vector Indexes
-
-2. `UC_CATALOG` & `UC_SCHEMA`: [Create Unity Catalog](https://docs.databricks.com/en/data-governance/unity-catalog/create-catalogs.html#create-a-catalog) and a Schema where the output Delta Tables with the parsed/chunked documents and Vector Search indexes are stored
-
-3. `UC_MODEL_NAME`: Unity Catalog location to log and store the chain's model
-
-4. `VECTOR_SEARCH_ENDPOINT`: [Create Vector Search Endpoint](https://docs.databricks.com/en/generative-ai/create-query-vector-search.html#create-a-vector-search-endpoint) to host the resulting vector index
-
-5. `SOURCE_PATH`: [Create Volumes](https://docs.databricks.com/en/connect/unity-catalog/volumes.html#create-and-work-with-volumes) for source documents as `SOURCE_PATH`
-
-6. `MLFLOW_EXPERIMENT_NAME`: MLflow Experiment to use for this application. Using the same experiment allows you to track runs across Notebooks and store a single history for your application.
-
-Run the `00_validate_config` to check that your configuration is valid and all resources are available. You will see an `rag_chain_config.yaml` file appear in your directory - we will do this in step 4 to deploy the application.
-
-#### 3. Prepare your data.
-
-The POC data pipeline is a Databricks Notebook based on Apache Spark that provides a default implementation of the parameters outlined below.
-
-To run this pipeline and generate your initial Vector Index:
-
-1. Open the `02_poc_data_pipeline` Notebook and connect it to your single-user cluster
-
-2. Press Run All to execute the data pipeline
-
-3. In the last cell of the notebook, you can see the resulting Delta Tables and Vector Index.
-
-```{image} ../images/5-hands-on/7_img.png
-:align: center
-```
-
-Parameters and their default values that are configured in `00_config`.
-
-| Knob | Description | Default value |
-|------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------|
-| [Parsing strategy](/nbs/3-deep-dive.md#parsing) | Extracting relevant information from the raw data using appropriate parsing techniques | Varies based on document type, but generally an open source parsing library |
-| [Chunking strategy](/nbs/3-deep-dive.md#chunking) | Breaking down the parsed data into smaller, manageable chunks for efficient retrieval | Token Text Splitter, which splits text along using a chunk size of 4000 tokens and a stride of 500 tokens. |
-| [Embedding model](/nbs/3-deep-dive.md#embedding-model) | Converting the chunked text data into a numerical vector representation that captures its semantic meaning | GTE-Large-v1.5 on the Databricks FMAPI pay-per-token |
-
-#### 4. Deploy the POC chain to the Quality Lab Review App
-
-The POC chain is a RAG chain that provides a default implementation of the parameters outlined below.
-
-> Note: The POC Chain uses MLflow code-based logging. To understand more about code-based logging, [link to docs].
-
-1. Open the `03_deploy_poc_to_review_app` Notebook
-
-2. Run each cell of the Notebook.
-
-3. You will see the MLflow Trace that shows you how the POC application works. Adjust the input question to one that is relevant to your use case, and re-run the cell to "vibe check" the application.
-
-```{image} ../images/5-hands-on/8_img.png
-:align: center
-```
-
-4. Modify the default instructions to be relevant to your use case.
-
-```python
- instructions_to_reviewer = f"""## Instructions for Testing the {RAG_APP_NAME}'s Initial Proof of Concept (PoC)
-
- Your inputs are invaluable for the development team. By providing detailed feedback and corrections, you help us fix issues and improve the overall quality of the application. We rely on your expertise to identify any gaps or areas needing enhancement.
-
- 1. **Variety of Questions**:
- - Please try a wide range of questions that you anticipate the end users of the application will ask. This helps us ensure the application can handle the expected queries effectively.
-
- 2. **Feedback on Answers**:
- - After asking each question, use the feedback widgets provided to review the answer given by the application.
- - If you think the answer is incorrect or could be improved, please use "Edit Answer" to correct it. Your corrections will enable our team to refine the application's accuracy.
-
- 3. **Review of Returned Documents**:
- - Carefully review each document that the system returns in response to your question.
- - Use the thumbs up/down feature to indicate whether the document was relevant to the question asked. A thumbs up signifies relevance, while a thumbs down indicates the document was not useful.
-
- Thank you for your time and effort in testing {RAG_APP_NAME}. Your contributions are essential to delivering a high-quality product to our end users."""
-
- print(instructions_to_reviewer)
-```
-
-
-5. Run the deployment cell to get a link to the Review App.
-
-```{image} ../images/5-hands-on/9_img.png
-:align: center
-```
-
-6. Grant individual users permissions to access the Review App.
-
-```{image} ../images/5-hands-on/10_img.png
-:align: center
-```
-
-7. Test the Review App by asking a few questions yourself and providing feedback.
- - You can view the data in Delta Tables. Note that results can take up to 2 hours to appear in the Delta Tables.
-
-Parameters and their default values configured in 00_config:
-
-| Knob | Description | Default value |
-|------|-------------|---------------|
-| [Query understanding](/nbs/3-deep-dive.md#query-understanding) | Analyzing and transforming user queries to better represent intent and extract relevant information, such as filters or keywords, to improve the retrieval process. | None, the provided query is directly embedded. |
-| [Retrieval](/nbs/3-deep-dive.md#retrieval) | Finding the most relevant chunks of information given a retrieval query. In the unstructured data case, this typically involves one or a combination of semantic or keyword-based search. | Semantic search with K = 5 chunks retrieved |
-| [Prompt augmentation](/nbs/3-deep-dive.md#prompt-augmentation) | Combining a user query with retrieved information and instructions to guide the LLM towards generating high-quality responses. | A simple RAG prompt template |
-| [LLM](/nbs/3-deep-dive.md#llm) | Selecting the most appropriate model (and model parameters) for your application to optimize/balance performance, latency, and cost. | Databricks-dbrx-instruct hosted using Databricks FMAPI pay-per-token |
-| [Post processing & guardrails](/nbs/3-deep-dive.md#post-processing-guardrails) | Applying additional processing steps and safety measures to ensure the LLM-generated responses are on-topic, factually consistent, and adhere to specific guidelines or constraints. | None |
-
-#### 5. Share the Review App with stakeholders
-
-You can now share your POC RAG application with your stakeholders to get their feedback.
-
-We suggest distributing your POC to at least 3 stakeholders and having them each ask 10 - 20 questions. It is important to have multiple stakeholders test your POC so you can have a diverse set of perspectives to include in your Evaluation Set.
-
-## Evaluate the POC's quality
-
-```{image} ../images/5-hands-on/11_img.png
-:align: center
-```
-
-**Expected time:** 30-60 minutes
-
-**Requirements:**
-
-- Stakeholders have used your POC and provided feedback
-- All requirements from [POC step](#how-to-build-a-poc)
- - Data from your [requirements](#requirements-questions) is available in your [Lakehouse](https://www.databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html) inside a [Unity Catalog](https://www.databricks.com/product/unity-catalog) [volume](https://docs.databricks.com/en/connect/unity-catalog/volumes.html) or [Delta Table](https://docs.databricks.com/en/delta/index.html)
- - Access to a [Mosaic AI Vector Search](https://docs.databricks.com/en/generative-ai/vector-search.html) endpoint [[instructions](https://docs.databricks.com/en/generative-ai/create-query-vector-search.html)]
- - Write access to Unity Catalog schema
- - A single-user cluster with DBR 14.3+
-
-Now that your stakeholders have used your POC, we can use their feedback to measure the POC's quality and establish a baseline.
-
-#### 1. ETL the logs to an Evaluation Set & run evaluation
-
-1. Open the `04_evaluate_poc_quality` Notebook.
-
-2. Adjust the configuration at the top to point to your Review App's logs.
-
-3. Run the cell to create an initial Evaluation Set that includes
- - 3 types of logs
- 1. Requests with a 👍 :
- - `request`: As entered by the user
- - `expected_response`: If the user edited the response, that is used, otherwise, the model's generated response.
- 2. Requests with a 👎 :
- - `request`: As entered by the user
- - `expected_response`: If the user edited the response, that is used, otherwise, null.
- 3. Requests without any feedback
- - `request`: As entered by the user
- - Across all types of requests, if the user 👍 a chunk from the `retrieved_context`, the `doc_uri` of that chunk is included in `expected_retrieved_context` for the question.
-
-> note: Databricks recommends that your Evaluation Set contains at least 30 questions to get started.
-
-4. Inspect the Evaluation Set to understand the data that is included. You need to validate that your Evaluation Set contains a representative and challenging set of questions.
-
-5. Optionally, save your evaluation set to a Delta Table for later use
-
-6. Evaluate the POC with Quality Lab's LLM Judge-based evaluation. Open MLflow to view the results.
-
-```{image} ../images/5-hands-on/12_img.png
-:align: center
-```
-
-```{image} ../images/5-hands-on/13_img.png
-:align: center
-```
-
-#### 2. Review evaluation results
-
-1. Now, let's open MLflow to inspect the results.
-
-2. In the Run tab, we can see each of the computed metrics. Refer to [metrics overview] section for an explanation of what each metric tells you about your application.
-
-3. In the Evaluation tab, we can inspect the questions, RAG application's outputs, and each of the LLM judge's assessments.
-
-Now that you have a baseline understanding of the POC's quality, we can shift focus to identifying the root causes of any quality issues and iteratively improving the app.
-
-It is worth noting: if the results meet your requirements for quality, you can skip directly to the Deployment section.
-
-## Improve RAG quality
-
-```{image} ../images/5-hands-on/14_img.png
-:align: center
-```
-
-While a basic RAG chain is relatively straightforward to implement, refining it to consistently produce high-quality outputs is often non-trivial. Identifying the root causes of issues and determining which levers of the solution to pull to improve output quality requires understanding the various components and their interactions.
-
-Simply vectorizing a set of documents, retrieving them via semantic search, and passing the retrieved documents to an LLM is not sufficient to guarantee optimal results. To yield high-quality outputs, you need to consider factors such as (but not limited to) chunking strategy of documents, choice of LLM and model parameters, or whether to include a query understanding step. As a result, ensuring high quality RAG outputs will generally involve iterating over both the data pipeline (e.g., chunking) and the RAG chain itself (e.g., choice of LLM).
-
-This section is divided into 3 steps:
-
-1. Understand RAG quality improvement levers
-2. Identifying the root cause of quality issues
-3. Implementing and evaluating fixes to the identified root cause
-
-### **Step 1:** Understand RAG quality improvement levers
-
-From a conceptual point of view, it's helpful to view RAG quality issues through the lens of two key aspects:
-
-- **Retrieval quality**
- - Are you retrieving the most relevant information for a given retrieval query?
- - It's difficult to generate high quality RAG output if the context provided to the LLM is missing important information or contains superfluous information.
-- **Generation quality**
- - Given the retrieved information and the original user query, is the LLM generating the most accurate, coherent, and helpful response possible?
- - Issues here can manifest as hallucinations, inconsistent output, or failure to directly address the user query.
-
-From an implementation standpoint, we can divide our RAG solution into two components which can be iterated on to address quality challenges:
-
-[**Data pipeline**](/nbs/3-deep-dive.md#data-pipeline-1)
-
-```{image} ../images/5-hands-on/15_img.png
-:align: center
-```
-
-- What is the composition of the input data corpus?
-- How raw data is extracted and transformed into a usable format (e.g., parsing a PDF document)
-- How documents are split into smaller chunks and how those chunks are formatted (e.g., chunking strategy, chunk size)
-- What metadata (e.g., section title, document title) is extracted about each document/chunk? How is this metadata included (or not included) in each chunk?
-- Which embedding model is used to convert text into vector representations for similarity search
-
-[**RAG chain**](/nbs/3-deep-dive.md#rag-chain)
-
-```{image} ../images/5-hands-on/16_img.png
-:align: center
-```
-
-- The choice of LLM and its parameters (e.g., temperature, max tokens)
-- The retrieval parameters (e.g., number of chunks/documents retrieved)
-- The retrieval approach (e.g., keyword vs. hybrid vs. semantic search, rewriting the user's query, transforming a user's query into filters, re-ranking)
-- How to format the prompt with retrieved context, to guide the LLM towards desired output
-
-It's tempting to assume a clean division between retrieval issues (simply update the data pipeline) and generation issues (update the RAG chain). However, the reality is more nuanced. Retrieval quality can be influenced by *both* the data pipeline (e.g., parsing/chunking strategy, metadata strategy, embedding model) and the RAG chain (e.g., user query transformation, number of chunks retrieved, re-ranking). Similarly, generation quality will invariably be impacted by poor retrieval (e.g., irrelevant or missing information affecting model output).
-
-This overlap underscores the need for a holistic approach to RAG quality improvement. By understanding which components to change across both the data pipeline and RAG chain, and how these changes affect the overall solution, you can make targeted updates to improve RAG output quality.
-
-### **Step 2:** Identify the root cause of quality issues
-
-#### Retrieval quality
-
-##### Debugging retrieval quality
-
-Retrieval quality is arguably the most important component of a RAG application. If the most relevant chunks are not returned for a given query, the LLM will not have access to the necessary information to generate a high-quality response. Poor retrieval can thus lead to irrelevant, incomplete, or hallucinated output.
-
-As discussed in [Section 4: Evaluation](/nbs/4-evaluation), metrics such as precision and recall can be calculated using a set of evaluation queries and corresponding ground-truth chunks/documents. If evaluation results indicate that relevant chunks are not being returned, you will need to investigate further to identify the root cause. This step requires manual effort to analyze the underlying data. With Mosaic AI, this becomes considerably easier given the tight integration between the data platform (Unity Catalog and Vector Search), and experiment tracking (MLflow LLM evaluation and MLflow tracing).
-
-Here's a step-by-step process to address **retrieval quality** issues:
-
-1. Identify a set of test queries with low retrieval quality metrics.
-
-2. For each query, manually examine the retrieved chunks and compare them to the ground-truth retrieval documents.
-
-3. Look for patterns or common issues among the queries with low retrieval quality. Some examples might include:
- - Relevant information is missing from the vector database entirely
- - Insufficient number of chunks/documents returned for a retrieval query
- - Chunks are too small and lack sufficient context
- - Chunks are too large and contain multiple, unrelated topics
- - The embedding model fails to capture semantic similarity for domain-specific terms
-
-4. Based on the identified issue, hypothesize potential root causes and corresponding fixes. See the "[Common reasons for poor retrieval quality](#common-reasons-for-poor-retrieval-quality)" table below for guidance on this.
-
-5. Implement the proposed fix for the most promising or impactful root cause, following [step 3](#step-3-implement-and-evaluate-changes). This may involve modifying the data pipeline (e.g., adjusting chunk size, trying a different embedding model) or the RAG chain (e.g., implementing hybrid search, retrieving more chunks).
-
-6. Re-run the evaluation on the updated system and compare the retrieval quality metrics to the previous version. Once retrieval quality is at a desired level, proceed to evaluating generation quality (see [Debugging generation quality](#debugging-generation-quality)).
-
-7. If retrieval quality is still not satisfactory, repeat steps 4-6 for the next most promising fixes until the desired performance is achieved.
-
-##### Common reasons for poor retrieval quality
-
-Each of these potential fixes are tagged as one of three types. Based on the type of change, you will follow different steps in section 3.
-
-
-
-
-
-
-
-Retrieval Issue |
-Debugging Steps |
-Potential Fix |
-
-
-
-
-Chunks are too small |
-- Examine chunks for incomplete cut-off information
|
- Increase chunk size and/or overlap Try different chunking strategy
|
-
-
-Chunks are too large |
-- Check if retrieved chunks contain multiple, unrelated topics
|
- Decrease chunk size Improve chunking strategy to avoid mixture of unrelated topics (e.g., semantic chunking)
|
-
-
-Chunks don't have enough information about the text from which they were taken |
-- Assess if the lack of context for each chunk is causing confusion or ambiguity in the retrieved results
|
- Chain codeTry adding metadata & titles to each chunk (e.g., section titles) Retrieve more chunks, and use an LLM with larger context size
|
-
-
-Embedding model doesn't accurately understand the domain and/or key phrases in user queries |
-- Check if semantically similar chunks are being retrieved for the same query
|
- Try different embedding models Fine-tune embedding model on domain-specific data
|
-
-
-Limited retrieval quality due to embedding model's lack of domain understanding |
-- Look at retrieved results to check if they are semantically relevant but miss key domain-specific information
|
- Hybrid search Over-fetch retrieval results, and re-rank. Only feed top re-ranked results into the LLM context
|
-
-
-Relevant information missing from the vector database |
-- Check if any relevant documents or sections are missing from the vector database
|
- Add more relevant documents to the vector database Improve document parsing and metadata extraction
|
-
-
-Retrieval queries are poorly formulated |
-- If user queries are being directly used for semantic search, analyze these queries and check for ambiguity, or lack of specificity. This can happen easily in multi-turn conversations where the raw user query references previous parts of the conversation, making it unsuitable to use directly as a retrieval query.
- Check if query terms match terminology used in the search corpus
|
- Add query expansion or transformation approaches (i.e., given a user query, transform the query prior to semantic search) Add query understanding to identify intent and entities (e.g., use an LLM to extract properties to use in metadata filtering)
|
-
-
-
-
-
-#### Generation quality
-
-##### Debugging generation quality
-
-Even with optimal retrieval, if the LLM component of a RAG chain cannot effectively utilize the retrieved context to generate accurate, coherent, and relevant responses, the final output quality will suffer. Issues with generation quality can arise as hallucinations, inconsistencies, or failure to concisely address the user's query, to name a few.
-
-To identify generation quality issues, you can use the approach outlined in the [Evaluation section](#section-4-evaluation). If evaluation results indicate poor generation quality (e.g., low accuracy, coherence, or relevance scores), you'll need to investigate further to identify the root cause.
-
-The following is a step-by-step process to address **generation quality** issues:
-
-1. Identify a set of test queries with low generation quality metrics.
-
-2. For each query, manually examine the generated response and compare it to the retrieved context and the ground-truth response.
-
-3. Look for patterns or common issues among the queries with low generation quality. Some examples:
- - Generating information not present in the retrieved context or outputting contradicting information with respect to the retrieved context (i.e., hallucination)
- - Failure to directly address the user's query given the provided retrieved context
- - Generating responses that are overly verbose, difficult to understand or lack logical coherence
-
-4. Based on the identified issues, hypothesize potential root causes and corresponding fixes. See the "[Common reasons for poor generation quality](#common-reasons-for-poor-generation-quality)" table below for guidance.
-
-5. Implement the proposed fix for the most promising or impactful root cause. This may involve modifying the RAG chain (e.g., adjusting the prompt template, trying a different LLM) or the data pipeline (e.g., adjusting the chunking strategy to provide more context).
-
-6. Re-run evals on the updated system and compare generation quality metrics to the previous version. If there is significant improvement, consider deploying the updated RAG application for further testing with end-users (see the [Deployment](#deployment) section).
-
-7. If the generation quality is still not satisfactory, repeat steps 4-6 for the next most promising fix until the desired performance is achieved.
-
-##### Common reasons for poor generation quality
-
-Each of these potential fixes are tagged as one of three types. Based on the type of change, you will follow different steps in section 3.
-
-
-
-
-
-Generation Issue |
-Debugging Steps |
-Potential Fix |
-
-
-
-
-Generating information not present in the retrieved context (e.g., hallucinations) |
-- Compare generated responses to retrieved context to identify hallucinated information
- Assess if certain types of queries or retrieved context are more prone to hallucinations
|
- Update prompt template to emphasize reliance on retrieved context Use a more capable LLM Implement a fact-checking or verification step post-generation
|
-
-
-Failure to directly address the user's query or providing overly generic responses |
-- Compare generated responses to user queries to assess relevance and specificity
- Check if certain types of queries result in the correct context being retrieved, but the LLM producing low quality output
|
- Improve prompt template to encourage direct, specific responses Retrieve more targeted context by improving the retrieval process Re-rank retrieval results to put most relevant chunks first, only provide these to the LLM Use a more capable LLM
|
-
-
-Generating responses that are difficult to understand or lack logical flow |
-- Assess output for logical flow, grammatical correctness, and understandability
- Analyze if incoherence occurs more often with certain types of queries or when certain types of context are retrieved
|
- Change prompt template to encourage coherent, well-structured response Provide more context to the LLM by retrieving additional relevant chunks Use a more capable LLM
|
-
-
-Generated responses are not in the desired format or style |
-- Compare output to expected format and style guidelines
- Assess if certain types of queries or retrieved context are more likely to result in format/style deviations
|
- Update prompt template to specify the desired output format and style Implement a post-processing step to convert the generated response into the desired format Add a step to validate output structure/style, and output a fallback answer if needed. Use an LLM fine-tuned to provide outputs in a specific format or style
|
-
-
-
-
-
-
-### **Step 3:** Implement and evaluate changes
-
-As discussed above, when working to improve the quality of the RAG system, changes can be broadly categorized into three buckets:
-
-1. **** Data pipeline changes
-2. **** RAG chain configuration changes
-3. **** RAG chain code changes
-
-Depending on the specific issue you are trying to address, you may need to apply changes to one or both of these components. In some cases, simultaneous changes to both the data pipeline and RAG chain may be necessary to achieve the desired quality improvements.
-
-#### Data pipeline changes
-
-**Data pipeline changes** involve modifying how input data is processed, transformed, or stored before being used by the RAG chain. Examples of data pipeline changes include (and are not limited to):
-
-- Trying a different chunking strategy
-- Iterating on the document parsing process
-- Changing the embedding model
-
-Implementing a data pipeline change will generally require re-running the entire pipeline to create a new vector index. This process involves reprocessing the input documents, regenerating the vector embeddings, and updating the vector index with new embeddings and metadata.
-
-#### RAG chain changes
-
-**RAG chain changes** involve modifying steps or parameters of the RAG chain itself, without necessarily changing the underlying vector database. Examples of RAG chain changes include (and are not limited to):
-
-- Changing the LLM
-- Modifying the prompt template
-- Adjusting the retrieval component (e.g., number of retrieval chunks, reranking, query expansion)
-- Introducing additional processing steps such as a query understanding step
-
-RAG chain updates may involve editing the **RAG chain configuration file** (e.g., changing the LLM parameters or prompt template), *or* modifying the actual **RAG chain code** (e.g., adding new processing steps or retrieval logic).
-
-#### Testing a potential fix that could improve quality
-
-Once you have identified a potential fix based on the debugging process outlined above, follow these steps to test your changes:
-
-1. Make the necessary changes to the data pipeline or RAG chain code
- - See the [code examples](#code-examples) below for how and where to make these changes
- - If required, re-run the data pipeline to update the vector index with the new embeddings and metadata
-
-2. Log a new version of your chain to MLflow
- - Ensure that any config files (i.e., for both your data pipeline and RAG chain) are logged to the MLflow run
- -
-
-3. Run evaluation on this new chain
-
-4. Review evaluation results
- - Analyze the evaluation metrics to determine if there has been an improvement the RAG chain's performance
- -
- - Compare the traces and LLM judge results for individual queries before and after the changes to gain insights into the impact of your changes
-
-5. Iterate on the fixes
- - If the evaluation results do not show the desired improvement, iterate on your changes based on the insights gained from analysis.
- - Repeat steps 1-4 until you are satisfied with the improvement in the RAG chain's output quality
-
-6. Deploy the updated RAG chain for user feedback
- - Once evaluation results indicate improvement, register the chain to Unity Catalog and deploy the updated RAG chain via the Review App.
- - Gather feedback from stakeholders and end-users through one or both of the following:
- - Have stakeholders interact with the app directly in the RAG Studio UI and provide feedback on response quality
- -
- - Generate responses using the updated chain for the set of evaluation queries and seek feedback on those specific responses
-
-7. Monitor and analyze user feedback
- - Review these results using a [dashboard](https://docs.databricks.com/en/dashboards/index.html#dashboards).
- - Monitor metrics such as the percentage of positive and negative feedback, as well as any specific comments or issues raised by users.
-
-
-## Deployment
-
-```{image} ../images/5-hands-on/17_img.png
-:align: center
-```
\ No newline at end of file
diff --git a/genai_cookbook/requirements.txt b/genai_cookbook/requirements.txt
deleted file mode 100644
index 9fc0428..0000000
--- a/genai_cookbook/requirements.txt
+++ /dev/null
@@ -1 +0,0 @@
-jupyter-book