From 91393e6bb51fd6c1a4a095e22066aae48a9e4a55 Mon Sep 17 00:00:00 2001
From: Siddharth Sahu <112792547+sahusiddharth@users.noreply.github.com>
Date: Fri, 10 Jan 2025 12:56:10 +0530
Subject: [PATCH] docs: Fixed most of the broken links (#1830)

---
 docs/concepts/components/eval_dataset.md      |   2 +-
 docs/concepts/index.md                        |   7 +-
 docs/concepts/metrics/overview/index.md       |   8 +-
 docs/concepts/test_data_generation/rag.md     |   3 +-
 docs/extra/components/choose_evaluator_llm.md |   4 +-
 docs/extra/components/choose_generator_llm.md |   4 +-
 docs/getstarted/evals.md                      |   4 +-
 docs/getstarted/rag_eval.md                   |   2 +-
 docs/getstarted/rag_testset_generation.md     |  19 +-
 docs/howtos/applications/_cost.md             |  60 +--
 .../howtos/applications/_metrics_llm_calls.md |  57 +--
 docs/howtos/applications/compare_llms.md      |   2 +-
 docs/howtos/customizations/_run_config.md     |   2 +-
 .../metrics/_modifying-prompts-metrics.md     |  71 +--
 .../metrics/_write_your_own_metric.md         |  43 +-
 .../_write_your_own_metric_advanced.md        |   2 +-
 .../_testgen-custom-single-hop.md             |  95 ++--
 .../testgenerator/_testgen-customisation.md   |  75 +--
 docs/howtos/index.md                          |   2 +-
 docs/howtos/integrations/_llamaindex.md       |   4 +-
 docs/howtos/integrations/_zeno.md             |   2 +-
 docs/howtos/integrations/index.md             |   2 +-
 docs/howtos/integrations/langchain.md         |   2 +-
 .../migrations/migrate_from_v01_to_v02.md     |   2 +-
 docs/howtos/observability.md                  | 467 ++++++++++++++++++
 25 files changed, 663 insertions(+), 278 deletions(-)
 create mode 100644 docs/howtos/observability.md

diff --git a/docs/concepts/components/eval_dataset.md b/docs/concepts/components/eval_dataset.md
index 5f5a40a46b..09fd924edd 100644
--- a/docs/concepts/components/eval_dataset.md
+++ b/docs/concepts/components/eval_dataset.md
@@ -20,7 +20,7 @@ An evaluation dataset consists of:
 
 - **Define Clear Objectives**: Identify the specific aspects of the AI application that you want to evaluate and the scenarios you want to test. Collect data samples that reflect these objectives.
 
-- **Collect Representative Data**: Ensure that the dataset covers a diverse range of scenarios, user inputs, and expected responses to provide a comprehensive evaluation of the AI application. This can be achieved by collecting data from various sources or [generating synthetic data]().
+- **Collect Representative Data**: Ensure that the dataset covers a diverse range of scenarios, user inputs, and expected responses to provide a comprehensive evaluation of the AI application. This can be achieved by collecting data from various sources or [generating synthetic data](./../../howtos/customizations/index.md#testset-generation).
 
 - **Quality and Size**: Aim for a dataset that is large enough to provide meaningful insights but not so large that it becomes unwieldy. Ensure that the data is of high quality and accurately reflects the real-world scenarios you want to evaluate.
 
diff --git a/docs/concepts/index.md b/docs/concepts/index.md
index 0f7dfb3507..92405e5e08 100644
--- a/docs/concepts/index.md
+++ b/docs/concepts/index.md
@@ -9,7 +9,8 @@
 
     Discover the various components used within Ragas.
     
-    Components like [Prompt Object](components/index.md#prompt-object), [Evaluation Dataset](components/index.md#evaluation-dataset) and [more..](components/index.md)
+    Components like [Prompt Object](components/prompt.md), [Evaluation Dataset](components/eval_dataset.md) and [more..](components/index.md)
+
 
 -   ::material-ruler-square:{ .lg .middle } [__Ragas Metrics__](metrics/index.md)
 
@@ -17,7 +18,7 @@
 
     Explore available metrics and understand how they work.
 
-    Metrics for evaluating [RAG](metrics/index.md/#retrieval-augmented-generation), [Agentic workflows](metrics/index.md/#agents-or-tool-use-cases) and [more..](metrics/index.md/#list-of-available-metrics).
+    Metrics for evaluating [RAG](metrics/available_metrics/index.md#retrieval-augmented-generation), [Agentic workflows](metrics/available_metrics/index.md#agents-or-tool-use-cases) and [more..](metrics/available_metrics/index.md#list-of-available-metrics).
 
 -   :material-database-plus:{ .lg .middle } [__Test Data Generation__](test_data_generation/index.md)
 
@@ -25,7 +26,7 @@
 
     Generate high-quality datasets for comprehensive testing.
 
-    Algorithms for synthesizing data to test [RAG](test_data_generation/index.md#retrieval-augmented-generation), [Agentic workflows](test_data_generation/index.md#agents-or-tool-use-cases) 
+    Algorithms for synthesizing data to test [RAG](test_data_generation/rag.md), [Agentic workflows](test_data_generation/agents.md) 
 
 
 -   :material-chart-box-outline:{ .lg .middle } [__Feedback Intelligence__](feedback/index.md)
diff --git a/docs/concepts/metrics/overview/index.md b/docs/concepts/metrics/overview/index.md
index 063b0b1c10..889b6622db 100644
--- a/docs/concepts/metrics/overview/index.md
+++ b/docs/concepts/metrics/overview/index.md
@@ -18,14 +18,14 @@ A metric is a quantitative measure used to evaluate the performance of a AI appl
 
 &nbsp;&nbsp;&nbsp;&nbsp; **LLM-based metrics**: These metrics use LLM underneath to do the evaluation. There might be one or more LLM calls that are performed to arrive at the score or result. These metrics can be somewhat non deterministic as the LLM might not always return the same result for the same input. On the other hand, these metrics has shown to be more accurate and closer to human evaluation.
 
-All LLM based metrics in ragas are inherited from `MetricWithLLM` class. These metrics expects a [LLM]() object to be set before scoring.
+All LLM based metrics in ragas are inherited from `MetricWithLLM` class. These metrics expects a LLM object to be set before scoring.
 
 ```python
 from ragas.metrics import FactualCorrectness
 scorer = FactualCorrectness(llm=evaluation_llm)
 ```
 
-Each LLM based metrics also will have prompts associated with it written using [Prompt Object]().
+Each LLM based metrics also will have prompts associated with it written using [Prompt Object](./../../components/prompt.md).
 
 
 &nbsp;&nbsp;&nbsp;&nbsp; **Non-LLM-based metrics**: These metrics do not use LLM underneath to do the evaluation. These metrics are deterministic and can be used to evaluate the performance of the AI application without using LLM. These metrics rely on traditional methods to evaluate the performance of the AI application, such as string similarity, BLEU score, etc. Due to the same, these metrics are known to have a lower correlation with human evaluation.
@@ -34,7 +34,7 @@ All LLM based metrics in ragas are inherited from `Metric` class.
 
 **Metrics can be broadly classified into two categories based on the type of data they evaluate**:
 
-&nbsp;&nbsp;&nbsp;&nbsp; **Single turn metrics**: These metrics evaluate the performance of the AI application based on a single turn of interaction between the user and the AI. All metrics in ragas that supports single turn evaluation are inherited from `SingleTurnMetric` class and scored using `single_turn_ascore` method. It also expects a [Single Turn Sample]() object as input.
+&nbsp;&nbsp;&nbsp;&nbsp; **Single turn metrics**: These metrics evaluate the performance of the AI application based on a single turn of interaction between the user and the AI. All metrics in ragas that supports single turn evaluation are inherited from [SingleTurnMetric][ragas.metrics.base.SingleTurnMetric] class and scored using `single_turn_ascore` method. It also expects a [Single Turn Sample][ragas.dataset_schema.SingleTurnSample] object as input.
 
 ```python
 from ragas.metrics import FactualCorrectness
@@ -43,7 +43,7 @@ scorer = FactualCorrectness()
 await scorer.single_turn_ascore(sample)
 ```
 
-&nbsp;&nbsp;&nbsp;&nbsp; **Multi-turn metrics**: These metrics evaluate the performance of the AI application based on multiple turns of interaction between the user and the AI. All metrics in ragas that supports multi turn evaluation are inherited from `MultiTurnMetric` class and scored using `multi_turn_ascore` method. It also expects a [Multi Turn Sample]() object as input.
+&nbsp;&nbsp;&nbsp;&nbsp; **Multi-turn metrics**: These metrics evaluate the performance of the AI application based on multiple turns of interaction between the user and the AI. All metrics in ragas that supports multi turn evaluation are inherited from [MultiTurnMetric][ragas.metrics.base.MultiTurnMetric] class and scored using `multi_turn_ascore` method. It also expects a [Multi Turn Sample][ragas.dataset_schema.MultiTurnSample] object as input.
 
 ```python
 from ragas.metrics import AgentGoalAccuracy
diff --git a/docs/concepts/test_data_generation/rag.md b/docs/concepts/test_data_generation/rag.md
index 2b93e72efa..78d623c4bd 100644
--- a/docs/concepts/test_data_generation/rag.md
+++ b/docs/concepts/test_data_generation/rag.md
@@ -103,7 +103,7 @@ graph TD
 
 ### Extractors
 
-Different extractors are used to extract information from each nodes that can be used to establish the relationship between the nodes. For example, in the case of financial documents, the extractor that can be used are entity extractor to extract the entities like Company Name, Keyphrase extractor to extract important key phrases present in each node, etc. You can write your own [custom extractors]() to extract the information that is relevant to your domain.
+Different extractors are used to extract information from each nodes that can be used to establish the relationship between the nodes. For example, in the case of financial documents, the extractor that can be used are entity extractor to extract the entities like Company Name, Keyphrase extractor to extract important key phrases present in each node, etc. You can write your own custom extractors to extract the information that is relevant to your domain.
 
 Extractors can be LLM based which are inherited from `LLMBasedExtractor` or rule based which are inherited from `Extractor`.
 
@@ -165,6 +165,7 @@ graph TD
 
 The extracted information is used to establish the relationship between the nodes. For example, in the case of financial documents, the relationship can be established between the nodes based on the entities present in the nodes.
 You can write your own [custom relationship builder]() to establish the relationship between the nodes based on the information that is relevant to your domain.
+# Link missing above
 
 #### Example 
 
diff --git a/docs/extra/components/choose_evaluator_llm.md b/docs/extra/components/choose_evaluator_llm.md
index 524b48c946..aacab8455a 100644
--- a/docs/extra/components/choose_evaluator_llm.md
+++ b/docs/extra/components/choose_evaluator_llm.md
@@ -126,7 +126,7 @@
     evaluator_llm = LangchainLLMWrapper(your_llm_instance)
     ```
 
-    For a more detailed guide, checkout [the guide on customizing models](../../howtos/customizations/customize_models/).
+    For a more detailed guide, checkout [the guide on customizing models](../../howtos/customizations/customize_models.md).
 
     If you using LlamaIndex, you can use the `LlamaIndexLLMWrapper` to wrap your LLM so that it can be used with ragas.
 
@@ -135,6 +135,6 @@
     evaluator_llm = LlamaIndexLLMWrapper(your_llm_instance)
     ```
 
-    For more information on how to use LlamaIndex, please refer to the [LlamaIndex Integration guide](../../howtos/integrations/_llamaindex/).
+    For more information on how to use LlamaIndex, please refer to the [LlamaIndex Integration guide](./../../howtos/integrations/_llamaindex.md).
 
     If your still not able use Ragas with your favorite LLM provider, please let us know by by commenting on this [issue](https://github.com/explodinggradients/ragas/issues/1617) and we'll add support for it 🙂.
\ No newline at end of file
diff --git a/docs/extra/components/choose_generator_llm.md b/docs/extra/components/choose_generator_llm.md
index 5047394442..fc3f967bd1 100644
--- a/docs/extra/components/choose_generator_llm.md
+++ b/docs/extra/components/choose_generator_llm.md
@@ -125,7 +125,7 @@
     generator_llm = LangchainLLMWrapper(your_llm_instance)
     ```
 
-    For a more detailed guide, checkout [the guide on customizing models](../../howtos/customizations/customize_models/).
+    For a more detailed guide, checkout [the guide on customizing models](../../howtos/customizations/customize_models.md).
 
     If you using LlamaIndex, you can use the `LlamaIndexLLMWrapper` to wrap your LLM so that it can be used with ragas.
 
@@ -134,6 +134,6 @@
     generator_llm = LlamaIndexLLMWrapper(your_llm_instance)
     ```
 
-    For more information on how to use LlamaIndex, please refer to the [LlamaIndex Integration guide](../../howtos/integrations/_llamaindex/).
+    For more information on how to use LlamaIndex, please refer to the [LlamaIndex Integration guide](./../../howtos/integrations/_llamaindex.md).
 
     If your still not able use Ragas with your favorite LLM provider, please let us know by by commenting on this [issue](https://github.com/explodinggradients/ragas/issues/1617) and we'll add support for it 🙂.
\ No newline at end of file
diff --git a/docs/getstarted/evals.md b/docs/getstarted/evals.md
index f5d9dfc6f8..8c06753571 100644
--- a/docs/getstarted/evals.md
+++ b/docs/getstarted/evals.md
@@ -7,7 +7,7 @@ The purpose of this guide is to illustrate a simple workflow for testing and eva
 
 In this guide, you will evaluate a **text summarization pipeline**. The goal is to ensure that the output summary accurately captures all the key details specified in the text, such as growth figures, market insights, and other essential information.
 
-`ragas` offers a variety of methods for analyzing the performance of LLM applications, referred to as [metrics](../concepts/metrics/). Each metric requires a predefined set of data points, which it uses to calculate scores that indicate performance.
+`ragas` offers a variety of methods for analyzing the performance of LLM applications, referred to as [metrics](../concepts/metrics/available_metrics/index.md). Each metric requires a predefined set of data points, which it uses to calculate scores that indicate performance.
 
 ### Evaluating using a Non-LLM Metric
 
@@ -203,7 +203,7 @@ To fix these results, ragas provides a way to align the metric with your prefere
 2. **Download**: Save the annotated data using the `Annotated JSON` button in [app.ragas.io](https://app.ragas.io/).
 3. **Train**: Use the annotated data to train your custom metric.
 
-To learn more about this, refer to how to [train your own metric guide](../howtos/customizations/metrics/train_your_own_metric.md)
+To learn more about this, refer to how to [train your own metric guide](./../howtos/customizations/metrics/train_your_own_metric.md)
 
 [Download sample annotated JSON](../_static/sample_annotated_summary.json)
 
diff --git a/docs/getstarted/rag_eval.md b/docs/getstarted/rag_eval.md
index d1bd2ba894..e2c398de91 100644
--- a/docs/getstarted/rag_eval.md
+++ b/docs/getstarted/rag_eval.md
@@ -157,7 +157,7 @@ evaluation_dataset = EvaluationDataset.from_list(dataset)
 
 ## Evaluate
 
-We have successfully collected the evaluation data. Now, we can evaluate our RAG system on the collected dataset using a set of commonly used RAG evaluation metrics. You may choose any model as [evaluator LLM](/docs/howtos/customizations/customize_models.md) for evaluation. 
+We have successfully collected the evaluation data. Now, we can evaluate our RAG system on the collected dataset using a set of commonly used RAG evaluation metrics. You may choose any model as [evaluator LLM](./../howtos/customizations/customize_models.md) for evaluation. 
 
 ```python
 from ragas import evaluate
diff --git a/docs/getstarted/rag_testset_generation.md b/docs/getstarted/rag_testset_generation.md
index af8a8a36e5..101f2892d5 100644
--- a/docs/getstarted/rag_testset_generation.md
+++ b/docs/getstarted/rag_testset_generation.md
@@ -31,7 +31,7 @@ docs = loader.load()
 
 ### Choose your LLM
 
-You may choose to use any [LLM of your choice](../howtos/customizations/customize_models.md)
+You may choose to use any [LLM of your choice](./../howtos/customizations/customize_models.md)
 --8<--
 choose_generator_llm.md
 --8<--
@@ -55,9 +55,10 @@ Once you have generated a testset, you would want to view it and select the quer
 dataset.to_pandas()
 ```
 
+Output
 ![testset](./testset_output.png)
 
-You can also use other tools like [app.ragas.io](https://app.ragas.io/) or any other similar tools available for you in the [Integrations](../howtos/integrations/index.md) section.
+You can also use other tools like [app.ragas.io](https://app.ragas.io/) or any other similar tools available for you in the [Integrations](./../howtos/integrations/index.md) section.
 
 In order to use the [app.ragas.io](https://app.ragas.io/) dashboard, you need to have an account on [app.ragas.io](https://app.ragas.io/). If you don't have one, you can sign up for one [here](https://app.ragas.io/login). You will also need to have a [Ragas APP token](https://app.ragas.io/settings/api-keys).
 
@@ -93,6 +94,7 @@ from ragas.testset.graph import KnowledgeGraph
 
 kg = KnowledgeGraph()
 ```
+Output
 ```
 KnowledgeGraph(nodes: 0, relationships: 0)
 ```
@@ -110,6 +112,7 @@ for doc in docs:
         )
     )
 ```
+Output
 ```
 KnowledgeGraph(nodes: 10, relationships: 0)
 ```
@@ -137,6 +140,8 @@ kg.save("knowledge_graph.json")
 loaded_kg = KnowledgeGraph.load("knowledge_graph.json")
 loaded_kg
 ```
+
+Output
 ```
 KnowledgeGraph(nodes: 48, relationships: 605)
 ```
@@ -158,11 +163,13 @@ from ragas.testset.synthesizers import default_query_distribution
 
 query_distribution = default_query_distribution(generator_llm)
 ```
+
+Output
 ```
 [
-        (SingleHopSpecificQuerySynthesizer(llm=llm), 0.5),
-        (MultiHopAbstractQuerySynthesizer(llm=llm), 0.25),
-        (MultiHopSpecificQuerySynthesizer(llm=llm), 0.25),
+    (SingleHopSpecificQuerySynthesizer(llm=llm), 0.5),
+    (MultiHopAbstractQuerySynthesizer(llm=llm), 0.25),
+    (MultiHopSpecificQuerySynthesizer(llm=llm), 0.25),
 ]
 ```
 
@@ -172,5 +179,5 @@ Now we can generate the testset.
 testset = generator.generate(testset_size=10, query_distribution=query_distribution)
 testset.to_pandas()
 ```
-
+Output
 ![testset](./testset_output.png)
diff --git a/docs/howtos/applications/_cost.md b/docs/howtos/applications/_cost.md
index 8c895b352c..40db296884 100644
--- a/docs/howtos/applications/_cost.md
+++ b/docs/howtos/applications/_cost.md
@@ -24,12 +24,10 @@ from ragas.cost import get_token_usage_for_openai
 
 get_token_usage_for_openai(llm_result)
 ```
-
-
-
-
-    TokenUsage(input_tokens=9, output_tokens=9, model='')
-
+Output
+```
+TokenUsage(input_tokens=9, output_tokens=9, model='')
+```
 
 
 You can define your own or import parsers if they are defined. If you would like to suggest parser for LLM providers or contribute your own ones please check out this [issue](https://github.com/explodinggradients/ragas/issues/1151) 🙂.
@@ -47,9 +45,10 @@ dataset = load_dataset("explodinggradients/amnesty_qa", "english_v3")
 
 eval_dataset = EvaluationDataset.from_hf_dataset(dataset["eval"])
 ```
-
-    Repo card metadata block was not found. Setting CardData to empty.
-
+Output
+```
+Repo card metadata block was not found. Setting CardData to empty.
+```
 
 You can pass in the parser to the `evaluate()` function and the cost will be calculated and returned in the `Result` object.
 
@@ -67,21 +66,19 @@ result = evaluate(
     token_usage_parser=get_token_usage_for_openai,
 )
 ```
-
-
-    Evaluating:   0%|          | 0/20 [00:00<?, ?it/s]
-
+Output
+```
+Evaluating:   0%|          | 0/20 [00:00<?, ?it/s]
+```
 
 
 ```python
 result.total_tokens()
 ```
-
-
-
-
-    TokenUsage(input_tokens=25097, output_tokens=3757, model='')
-
+Output
+```
+TokenUsage(input_tokens=25097, output_tokens=3757, model='')
+```
 
 
 You can compute the cost for each run by passing in the cost per token to `Result.total_cost()` function.
@@ -93,11 +90,10 @@ In this case GPT-4o costs $5 for 1M input tokens and $15 for 1M output tokens.
 result.total_cost(cost_per_input_token=5 / 1e6, cost_per_output_token=15 / 1e6)
 ```
 
-
-
-
-    1.1692900000000002
-
+Output
+```
+1.1692900000000002
+```
 
 
 ## Token Usage for Testset Generation
@@ -116,10 +112,9 @@ kg = KnowledgeGraph.load("../../../experiments/scratchpad_kg.json")
 kg
 ```
 
-
-
-
-    KnowledgeGraph(nodes: 47, relationships: 109)
+Output
+```
+KnowledgeGraph(nodes: 47, relationships: 109)
 
 
 
@@ -145,9 +140,8 @@ testset = tg.generate(testset_size=10, token_usage_parser=get_token_usage_for_op
 testset.total_cost(cost_per_input_token=5 / 1e6, cost_per_output_token=15 / 1e6)
 ```
 
-
-
-
-    0.20967000000000002
-
+Output
+```
+0.20967000000000002
+```
 
diff --git a/docs/howtos/applications/_metrics_llm_calls.md b/docs/howtos/applications/_metrics_llm_calls.md
index 79c12da8a8..da6536fbd7 100644
--- a/docs/howtos/applications/_metrics_llm_calls.md
+++ b/docs/howtos/applications/_metrics_llm_calls.md
@@ -25,10 +25,11 @@ metric = AspectCriticWithReference(
 
 results = evaluate(eval_dataset[:5], metrics=[metric])
 ```
-
-    Repo card metadata block was not found. Setting CardData to empty.
-    Evaluating: 100%|██████████| 5/5 [00:02<00:00,  2.12it/s]
-
+Output
+```
+Repo card metadata block was not found. Setting CardData to empty.
+Evaluating: 100%|██████████| 5/5 [00:02<00:00,  2.12it/s]
+```
 
 ## Export LLM traces
 
@@ -36,19 +37,17 @@ results = evaluate(eval_dataset[:5], metrics=[metric])
 ```python
 results.traces
 ```
+Output
+```
+[{'answer_correctness': 1},
+    {'answer_correctness': 0},
+    {'answer_correctness': 0},
+    {'answer_correctness': 0},
+    {'answer_correctness': 0}]
+```
 
 
-
-
-    [{'answer_correctness': 1},
-     {'answer_correctness': 0},
-     {'answer_correctness': 0},
-     {'answer_correctness': 0},
-     {'answer_correctness': 0}]
-
-
-
-Each of these are [`MetricTrace`](ragas.callbacks.MetricTrace) objects that contain the following fields:
+Each of these are [`MetricTrace`][ragas.callbacks.MetricTrace] objects that contain the following fields:
 - The input to the prompt 
 - The output from the model
 
@@ -60,13 +59,11 @@ To view this you can select the index of the trace you want to view and run the
 ```python
 results.traces[0]["answer_correctness"]
 ```
-
-
-
-
-    {'0_single_turn_aspect_critic_prompt_with_reference': {'input': AspectCriticInputWithReference(user_input="`user_input`: What are the global implications of the USA Supreme Court ruling on abortion? Answer using `retrieved context`: - In 2022, the USA Supreme Court handed down a decision ruling that overturned 50 years of jurisprudence recognizing a constitutional right to abortion.\n- This decision has had a massive impact: one in three women and girls of reproductive age now live in states where abortion access is either totally or near-totally inaccessible.\n- The states with the most restrictive abortion laws have the weakest maternal health support, higher maternal death rates, and higher child poverty rates.\n- The USA Supreme Court ruling has also had impacts beyond national borders due to the geopolitical and cultural influence wielded by the USA globally and the aid it funds.\n- SRR organizations and activists across the world have expressed fear about the ruling laying the groundwork for anti-abortion legislative and policy attacks in other countries.\n- Advocates have also observed the ruling's impact on progressive law reform and the stalling of the adoption and enforcement of abortion guidelines in certain African countries.\n- The ruling has created a chilling effect in international policy spaces, emboldening anti-abortion state and non-state actors to undermine human rights protections.\nThe USA Supreme Court ruling on abortion has sparked intense debates and discussions not only within the country but also around the world. Many countries look to the United States as a leader in legal and social issues, so the decision could potentially influence the policies and attitudes towards abortion in other nations.\nThe ruling may also impact international organizations and non-governmental groups that work on reproductive rights and women's health issues. Depending on the outcome, there could be shifts in funding, advocacy efforts, and collaborations with American counterparts, leading to ripple effects in the global fight for reproductive justice.", response="The global implications of the USA Supreme Court ruling on abortion can be significant, as it sets a precedent for other countries and influences the global discourse on reproductive rights. Here are some potential implications:\n\n1. Influence on other countries: The Supreme Court's ruling can serve as a reference point for other countries grappling with their own abortion laws. It can provide legal arguments and reasoning that advocates for reproductive rights can use to challenge restrictive abortion laws in their respective jurisdictions.\n\n2. Strengthening of global reproductive rights movements: A favorable ruling by the Supreme Court can energize and empower reproductive rights movements worldwide. It can serve as a rallying point for activists and organizations advocating for women's rights, leading to increased mobilization and advocacy efforts globally.\n\n3. Counteracting anti-abortion movements: Conversely, a ruling that restricts abortion rights can embolden anti-abortion movements globally. It can provide legitimacy to their arguments and encourage similar restrictive measures in other countries, potentially leading to a rollback of existing reproductive rights.\n\n4. Impact on international aid and policies: The Supreme Court's ruling can influence international aid and policies related to reproductive health. It can shape the priorities and funding decisions of donor countries and organizations, potentially leading to increased support for reproductive rights initiatives or conversely, restrictions on funding for abortion-related services.\n\n5. Shaping international human rights standards: The ruling can contribute to the development of international human rights standards regarding reproductive rights. It can influence the interpretation and application of existing human rights treaties and conventions, potentially strengthening the recognition of reproductive rights as fundamental human rights globally.\n\n6. Global health implications: The Supreme Court's ruling can have implications for global health outcomes, particularly in countries with restrictive abortion laws. It can impact the availability and accessibility of safe and legal abortion services, potentially leading to an increase in unsafe abortions and related health complications.\n\nIt is important to note that the specific implications will depend on the nature of the Supreme Court ruling and the subsequent actions taken by governments, activists, and organizations both within and outside the United States.", reference="The global implications of the USA Supreme Court ruling on abortion are significant. The ruling has led to limited or no access to abortion for one in three women and girls of reproductive age in states where abortion access is restricted. These states also have weaker maternal health support, higher maternal death rates, and higher child poverty rates. Additionally, the ruling has had an impact beyond national borders due to the USA's geopolitical and cultural influence globally. Organizations and activists worldwide are concerned that the ruling may inspire anti-abortion legislative and policy attacks in other countries. The ruling has also hindered progressive law reform and the implementation of abortion guidelines in certain African countries. Furthermore, the ruling has created a chilling effect in international policy spaces, empowering anti-abortion actors to undermine human rights protections.", criteria='is the response correct compared to reference'),
-      'output': [AspectCriticOutputWithReference(reason='The response accurately reflects the key points and implications outlined in the reference regarding the global implications of the USA Supreme Court ruling on abortion. It discusses the influence on other countries, the strengthening of global reproductive rights movements, the counteraction of anti-abortion movements, the impact on international aid and policies, the shaping of international human rights standards, and the global health implications, all of which are consistent with the reference provided.', verdict=1)]}}
-
+Output
+```
+{'0_single_turn_aspect_critic_prompt_with_reference': {'input': AspectCriticInputWithReference(user_input="`user_input`: What are the global implications of the USA Supreme Court ruling on abortion? Answer using `retrieved context`: - In 2022, the USA Supreme Court handed down a decision ruling that overturned 50 years of jurisprudence recognizing a constitutional right to abortion.\n- This decision has had a massive impact: one in three women and girls of reproductive age now live in states where abortion access is either totally or near-totally inaccessible.\n- The states with the most restrictive abortion laws have the weakest maternal health support, higher maternal death rates, and higher child poverty rates.\n- The USA Supreme Court ruling has also had impacts beyond national borders due to the geopolitical and cultural influence wielded by the USA globally and the aid it funds.\n- SRR organizations and activists across the world have expressed fear about the ruling laying the groundwork for anti-abortion legislative and policy attacks in other countries.\n- Advocates have also observed the ruling's impact on progressive law reform and the stalling of the adoption and enforcement of abortion guidelines in certain African countries.\n- The ruling has created a chilling effect in international policy spaces, emboldening anti-abortion state and non-state actors to undermine human rights protections.\nThe USA Supreme Court ruling on abortion has sparked intense debates and discussions not only within the country but also around the world. Many countries look to the United States as a leader in legal and social issues, so the decision could potentially influence the policies and attitudes towards abortion in other nations.\nThe ruling may also impact international organizations and non-governmental groups that work on reproductive rights and women's health issues. Depending on the outcome, there could be shifts in funding, advocacy efforts, and collaborations with American counterparts, leading to ripple effects in the global fight for reproductive justice.", response="The global implications of the USA Supreme Court ruling on abortion can be significant, as it sets a precedent for other countries and influences the global discourse on reproductive rights. Here are some potential implications:\n\n1. Influence on other countries: The Supreme Court's ruling can serve as a reference point for other countries grappling with their own abortion laws. It can provide legal arguments and reasoning that advocates for reproductive rights can use to challenge restrictive abortion laws in their respective jurisdictions.\n\n2. Strengthening of global reproductive rights movements: A favorable ruling by the Supreme Court can energize and empower reproductive rights movements worldwide. It can serve as a rallying point for activists and organizations advocating for women's rights, leading to increased mobilization and advocacy efforts globally.\n\n3. Counteracting anti-abortion movements: Conversely, a ruling that restricts abortion rights can embolden anti-abortion movements globally. It can provide legitimacy to their arguments and encourage similar restrictive measures in other countries, potentially leading to a rollback of existing reproductive rights.\n\n4. Impact on international aid and policies: The Supreme Court's ruling can influence international aid and policies related to reproductive health. It can shape the priorities and funding decisions of donor countries and organizations, potentially leading to increased support for reproductive rights initiatives or conversely, restrictions on funding for abortion-related services.\n\n5. Shaping international human rights standards: The ruling can contribute to the development of international human rights standards regarding reproductive rights. It can influence the interpretation and application of existing human rights treaties and conventions, potentially strengthening the recognition of reproductive rights as fundamental human rights globally.\n\n6. Global health implications: The Supreme Court's ruling can have implications for global health outcomes, particularly in countries with restrictive abortion laws. It can impact the availability and accessibility of safe and legal abortion services, potentially leading to an increase in unsafe abortions and related health complications.\n\nIt is important to note that the specific implications will depend on the nature of the Supreme Court ruling and the subsequent actions taken by governments, activists, and organizations both within and outside the United States.", reference="The global implications of the USA Supreme Court ruling on abortion are significant. The ruling has led to limited or no access to abortion for one in three women and girls of reproductive age in states where abortion access is restricted. These states also have weaker maternal health support, higher maternal death rates, and higher child poverty rates. Additionally, the ruling has had an impact beyond national borders due to the USA's geopolitical and cultural influence globally. Organizations and activists worldwide are concerned that the ruling may inspire anti-abortion legislative and policy attacks in other countries. The ruling has also hindered progressive law reform and the implementation of abortion guidelines in certain African countries. Furthermore, the ruling has created a chilling effect in international policy spaces, empowering anti-abortion actors to undermine human rights protections.", criteria='is the response correct compared to reference'),
+'output': [AspectCriticOutputWithReference(reason='The response accurately reflects the key points and implications outlined in the reference regarding the global implications of the USA Supreme Court ruling on abortion. It discusses the influence on other countries, the strengthening of global reproductive rights movements, the counteraction of anti-abortion movements, the impact on international aid and policies, the shaping of international human rights standards, and the global health implications, all of which are consistent with the reference provided.', verdict=1)]}}
+```
 
 
 As you can see, it has the name of the prompt as the key and the input and output as the values. Since, I used AspectCriteriaMetric, the input and output is in the pydantic object used to parse input and output for the metric. You may convert it to a dictionary if needed. For example,
@@ -78,15 +75,13 @@ selected_trace["0_single_turn_aspect_critic_prompt_with_reference"][
     "input"
 ].model_dump()
 ```
-
-
-
-
-    {'user_input': "`user_input`: What are the global implications of the USA Supreme Court ruling on abortion? Answer using `retrieved context`: - In 2022, the USA Supreme Court handed down a decision ruling that overturned 50 years of jurisprudence recognizing a constitutional right to abortion.\n- This decision has had a massive impact: one in three women and girls of reproductive age now live in states where abortion access is either totally or near-totally inaccessible.\n- The states with the most restrictive abortion laws have the weakest maternal health support, higher maternal death rates, and higher child poverty rates.\n- The USA Supreme Court ruling has also had impacts beyond national borders due to the geopolitical and cultural influence wielded by the USA globally and the aid it funds.\n- SRR organizations and activists across the world have expressed fear about the ruling laying the groundwork for anti-abortion legislative and policy attacks in other countries.\n- Advocates have also observed the ruling's impact on progressive law reform and the stalling of the adoption and enforcement of abortion guidelines in certain African countries.\n- The ruling has created a chilling effect in international policy spaces, emboldening anti-abortion state and non-state actors to undermine human rights protections.\nThe USA Supreme Court ruling on abortion has sparked intense debates and discussions not only within the country but also around the world. Many countries look to the United States as a leader in legal and social issues, so the decision could potentially influence the policies and attitudes towards abortion in other nations.\nThe ruling may also impact international organizations and non-governmental groups that work on reproductive rights and women's health issues. Depending on the outcome, there could be shifts in funding, advocacy efforts, and collaborations with American counterparts, leading to ripple effects in the global fight for reproductive justice.",
-     'response': "The global implications of the USA Supreme Court ruling on abortion can be significant, as it sets a precedent for other countries and influences the global discourse on reproductive rights. Here are some potential implications:\n\n1. Influence on other countries: The Supreme Court's ruling can serve as a reference point for other countries grappling with their own abortion laws. It can provide legal arguments and reasoning that advocates for reproductive rights can use to challenge restrictive abortion laws in their respective jurisdictions.\n\n2. Strengthening of global reproductive rights movements: A favorable ruling by the Supreme Court can energize and empower reproductive rights movements worldwide. It can serve as a rallying point for activists and organizations advocating for women's rights, leading to increased mobilization and advocacy efforts globally.\n\n3. Counteracting anti-abortion movements: Conversely, a ruling that restricts abortion rights can embolden anti-abortion movements globally. It can provide legitimacy to their arguments and encourage similar restrictive measures in other countries, potentially leading to a rollback of existing reproductive rights.\n\n4. Impact on international aid and policies: The Supreme Court's ruling can influence international aid and policies related to reproductive health. It can shape the priorities and funding decisions of donor countries and organizations, potentially leading to increased support for reproductive rights initiatives or conversely, restrictions on funding for abortion-related services.\n\n5. Shaping international human rights standards: The ruling can contribute to the development of international human rights standards regarding reproductive rights. It can influence the interpretation and application of existing human rights treaties and conventions, potentially strengthening the recognition of reproductive rights as fundamental human rights globally.\n\n6. Global health implications: The Supreme Court's ruling can have implications for global health outcomes, particularly in countries with restrictive abortion laws. It can impact the availability and accessibility of safe and legal abortion services, potentially leading to an increase in unsafe abortions and related health complications.\n\nIt is important to note that the specific implications will depend on the nature of the Supreme Court ruling and the subsequent actions taken by governments, activists, and organizations both within and outside the United States.",
-     'reference': "The global implications of the USA Supreme Court ruling on abortion are significant. The ruling has led to limited or no access to abortion for one in three women and girls of reproductive age in states where abortion access is restricted. These states also have weaker maternal health support, higher maternal death rates, and higher child poverty rates. Additionally, the ruling has had an impact beyond national borders due to the USA's geopolitical and cultural influence globally. Organizations and activists worldwide are concerned that the ruling may inspire anti-abortion legislative and policy attacks in other countries. The ruling has also hindered progressive law reform and the implementation of abortion guidelines in certain African countries. Furthermore, the ruling has created a chilling effect in international policy spaces, empowering anti-abortion actors to undermine human rights protections.",
-     'criteria': 'is the response correct compared to reference'}
-
+Output
+```
+{'user_input': "`user_input`: What are the global implications of the USA Supreme Court ruling on abortion? Answer using `retrieved context`: - In 2022, the USA Supreme Court handed down a decision ruling that overturned 50 years of jurisprudence recognizing a constitutional right to abortion.\n- This decision has had a massive impact: one in three women and girls of reproductive age now live in states where abortion access is either totally or near-totally inaccessible.\n- The states with the most restrictive abortion laws have the weakest maternal health support, higher maternal death rates, and higher child poverty rates.\n- The USA Supreme Court ruling has also had impacts beyond national borders due to the geopolitical and cultural influence wielded by the USA globally and the aid it funds.\n- SRR organizations and activists across the world have expressed fear about the ruling laying the groundwork for anti-abortion legislative and policy attacks in other countries.\n- Advocates have also observed the ruling's impact on progressive law reform and the stalling of the adoption and enforcement of abortion guidelines in certain African countries.\n- The ruling has created a chilling effect in international policy spaces, emboldening anti-abortion state and non-state actors to undermine human rights protections.\nThe USA Supreme Court ruling on abortion has sparked intense debates and discussions not only within the country but also around the world. Many countries look to the United States as a leader in legal and social issues, so the decision could potentially influence the policies and attitudes towards abortion in other nations.\nThe ruling may also impact international organizations and non-governmental groups that work on reproductive rights and women's health issues. Depending on the outcome, there could be shifts in funding, advocacy efforts, and collaborations with American counterparts, leading to ripple effects in the global fight for reproductive justice.",
+'response': "The global implications of the USA Supreme Court ruling on abortion can be significant, as it sets a precedent for other countries and influences the global discourse on reproductive rights. Here are some potential implications:\n\n1. Influence on other countries: The Supreme Court's ruling can serve as a reference point for other countries grappling with their own abortion laws. It can provide legal arguments and reasoning that advocates for reproductive rights can use to challenge restrictive abortion laws in their respective jurisdictions.\n\n2. Strengthening of global reproductive rights movements: A favorable ruling by the Supreme Court can energize and empower reproductive rights movements worldwide. It can serve as a rallying point for activists and organizations advocating for women's rights, leading to increased mobilization and advocacy efforts globally.\n\n3. Counteracting anti-abortion movements: Conversely, a ruling that restricts abortion rights can embolden anti-abortion movements globally. It can provide legitimacy to their arguments and encourage similar restrictive measures in other countries, potentially leading to a rollback of existing reproductive rights.\n\n4. Impact on international aid and policies: The Supreme Court's ruling can influence international aid and policies related to reproductive health. It can shape the priorities and funding decisions of donor countries and organizations, potentially leading to increased support for reproductive rights initiatives or conversely, restrictions on funding for abortion-related services.\n\n5. Shaping international human rights standards: The ruling can contribute to the development of international human rights standards regarding reproductive rights. It can influence the interpretation and application of existing human rights treaties and conventions, potentially strengthening the recognition of reproductive rights as fundamental human rights globally.\n\n6. Global health implications: The Supreme Court's ruling can have implications for global health outcomes, particularly in countries with restrictive abortion laws. It can impact the availability and accessibility of safe and legal abortion services, potentially leading to an increase in unsafe abortions and related health complications.\n\nIt is important to note that the specific implications will depend on the nature of the Supreme Court ruling and the subsequent actions taken by governments, activists, and organizations both within and outside the United States.",
+'reference': "The global implications of the USA Supreme Court ruling on abortion are significant. The ruling has led to limited or no access to abortion for one in three women and girls of reproductive age in states where abortion access is restricted. These states also have weaker maternal health support, higher maternal death rates, and higher child poverty rates. Additionally, the ruling has had an impact beyond national borders due to the USA's geopolitical and cultural influence globally. Organizations and activists worldwide are concerned that the ruling may inspire anti-abortion legislative and policy attacks in other countries. The ruling has also hindered progressive law reform and the implementation of abortion guidelines in certain African countries. Furthermore, the ruling has created a chilling effect in international policy spaces, empowering anti-abortion actors to undermine human rights protections.",
+'criteria': 'is the response correct compared to reference'}
+```
 
 
 And that's it. Now you have learned how to export and analyze LLM calls made by ragas for evaluation. 
diff --git a/docs/howtos/applications/compare_llms.md b/docs/howtos/applications/compare_llms.md
index 7344d6c709..459f7b3d48 100644
--- a/docs/howtos/applications/compare_llms.md
+++ b/docs/howtos/applications/compare_llms.md
@@ -25,7 +25,7 @@ This tutorial notebook provides a step-by-step guide on how to compare and choos
 Ragas offers a unique test generation paradigm that enables the creation of evaluation datasets specifically tailored to your retrieval and generation tasks. Unlike traditional QA generators, Ragas can generate a wide variety of challenging test cases from your document corpus.
 
 !!! tip
-    Refer to [testset generation](../../concepts/testset_generation.md) to know more on how it works.
+    Refer to [testset generation](./../../concepts/testset_generation/index.md) to know more on how it works.
 
 For this tutorial notebook, I am using papers from Arxiv that is related to large language models to build RAG.
 
diff --git a/docs/howtos/customizations/_run_config.md b/docs/howtos/customizations/_run_config.md
index dd7a9e3475..6c57bb8c36 100644
--- a/docs/howtos/customizations/_run_config.md
+++ b/docs/howtos/customizations/_run_config.md
@@ -5,7 +5,7 @@ The `RunConfig` allows you to pass in the run parameters to functions like `eval
 How to configure the `RunConfig` in
 
 - [Evaluate](#evaluate)
-- [TestsetGenerator]()
+- [TestsetGenerator][ragas.testset.generator.TestsetGenerator]
 
 ## Rate Limits
 
diff --git a/docs/howtos/customizations/metrics/_modifying-prompts-metrics.md b/docs/howtos/customizations/metrics/_modifying-prompts-metrics.md
index 7013c593fb..a8632bc797 100644
--- a/docs/howtos/customizations/metrics/_modifying-prompts-metrics.md
+++ b/docs/howtos/customizations/metrics/_modifying-prompts-metrics.md
@@ -2,7 +2,7 @@
 
 Every metrics in ragas that uses LLM also uses one or more prompts to come up with intermediate results that is used for formulating scores. Prompts can be treated like hyperparameters when using LLM based metrics. An optimised prompt that suits your domain and use-case can increase the accuracy of your LLM based metrics by 10-20%. An optimal prompt is also depended on the LLM one is using, so as users you might want to tune prompts that powers each metric. 
 
-Each prompt in Ragas is written using [Prompt Object](/concepts/components/prompt/). Please make sure you have an understanding of it before going further.
+Each prompt in Ragas is written using [Prompt Object][ragas.prompts.PydanticPrompt]lease make sure you have an understanding of it before going further.
 
 ### Understand the prompts of your Metric
 
@@ -15,13 +15,11 @@ from ragas.metrics._simple_criteria import SimpleCriteriaScoreWithReference
 scorer = SimpleCriteriaScoreWithReference(name="random", definition="some definition")
 scorer.get_prompts()
 ```
-
-
-
-
-    {'multi_turn_prompt': <ragas.metrics._simple_criteria.MultiTurnSimpleCriteriaWithReferencePrompt at 0x7f8c41410970>,
-     'single_turn_prompt': <ragas.metrics._simple_criteria.SingleTurnSimpleCriteriaWithReferencePrompt at 0x7f8c41412590>}
-
+Output
+```
+{'multi_turn_prompt': <ragas.metrics._simple_criteria.MultiTurnSimpleCriteriaWithReferencePrompt at 0x7f8c41410970>,
+    'single_turn_prompt': <ragas.metrics._simple_criteria.SingleTurnSimpleCriteriaWithReferencePrompt at 0x7f8c41412590>}
+```
 
 
 
@@ -29,9 +27,10 @@ scorer.get_prompts()
 prompts = scorer.get_prompts()
 print(prompts["single_turn_prompt"].to_string())
 ```
-
-    Your task is to judge the faithfulness of a series of statements based on a given context. For each statement you must return verdict as 1 if the statement can be directly inferred based on the context or 0 if the statement can not be directly inferred based on the context.
-
+Output
+```
+Your task is to judge the faithfulness of a series of statements based on a given context. For each statement you must return verdict as 1 if the statement can be directly inferred based on the context or 0 if the statement can not be directly inferred based on the context.
+```
 
 ### Modifying instruction in default prompt
 It is highly likely that one might want to modify the prompt to suit ones needs. Ragas provides `set_prompts` methods to allow you to do so. Let's change the one of the prompts used in `FactualCorrectness` metrics
@@ -40,23 +39,21 @@ It is highly likely that one might want to modify the prompt to suit ones needs.
 ```python
 prompt = scorer.get_prompts()["single_turn_prompt"]
 prompt.instruction += "\nOnly output valid JSON."
-```
-
-
-```python
 scorer.set_prompts(**{"single_turn_prompt": prompt})
 ```
-
+Output
+```
 Let's check if the prompt's instruction has actually changed
-
+```
 
 ```python
 print(scorer.get_prompts()["single_turn_prompt"].instruction)
 ```
-
-    Given a input, system response and reference. Evaluate and score the response against the reference only using the given criteria.
-    Only output valid JSON.
-
+Output
+```
+Given a input, system response and reference. Evaluate and score the response against the reference only using the given criteria.
+Only output valid JSON.
+```
 
 ### Modifying examples in default prompt
 Few shot examples can greatly influence the outcome of any LLM. It is highly likely that the examples in default prompt may not reflect your domain or use-case. So it's always an good practice to modify with your custom examples. Let's do one here
@@ -64,16 +61,14 @@ Few shot examples can greatly influence the outcome of any LLM. It is highly lik
 
 ```python
 prompt = scorer.get_prompts()["single_turn_prompt"]
-
 prompt.examples
 ```
 
-
-
-
-    [(SingleTurnSimpleCriteriaWithReferenceInput(user_input='Who was the director of Los Alamos Laboratory?', response='Einstein was the director of Los Alamos Laboratory.', criteria='Score responses in range of 0 (low) to 5 (high) based similarity with reference.', reference='The director of Los Alamos Laboratory was J. Robert Oppenheimer.'),
-      SimpleCriteriaOutput(reason='The response and reference have two very different answers.', score=0))]
-
+Output
+```
+[(SingleTurnSimpleCriteriaWithReferenceInput(user_input='Who was the director of Los Alamos Laboratory?', response='Einstein was the director of Los Alamos Laboratory.', criteria='Score responses in range of 0 (low) to 5 (high) based similarity with reference.', reference='The director of Los Alamos Laboratory was J. Robert Oppenheimer.'),
+    SimpleCriteriaOutput(reason='The response and reference have two very different answers.', score=0))]
+```
 
 
 
@@ -82,10 +77,7 @@ from ragas.metrics._simple_criteria import (
     SingleTurnSimpleCriteriaWithReferenceInput,
     SimpleCriteriaOutput,
 )
-```
-
 
-```python
 new_example = [
     (
         SingleTurnSimpleCriteriaWithReferenceInput(
@@ -100,25 +92,16 @@ new_example = [
         ),
     )
 ]
-```
-
 
-```python
 prompt.examples = new_example
-```
-
-
-```python
 scorer.set_prompts(**{"single_turn_prompt": prompt})
-```
-
-
-```python
 print(scorer.get_prompts()["single_turn_prompt"].examples)
 ```
 
-    [(SingleTurnSimpleCriteriaWithReferenceInput(user_input='Who was the first president of the United States?', response='Thomas Jefferson was the first president of the United States.', criteria='Score responses in range of 0 (low) to 5 (high) based similarity with reference.', reference='George Washington was the first president of the United States.'), SimpleCriteriaOutput(reason='The response incorrectly states Thomas Jefferson instead of George Washington. While both are significant historical figures, the answer does not match the reference.', score=2))]
-
+Output
+```
+[(SingleTurnSimpleCriteriaWithReferenceInput(user_input='Who was the first president of the United States?', response='Thomas Jefferson was the first president of the United States.', criteria='Score responses in range of 0 (low) to 5 (high) based similarity with reference.', reference='George Washington was the first president of the United States.'), SimpleCriteriaOutput(reason='The response incorrectly states Thomas Jefferson instead of George Washington. While both are significant historical figures, the answer does not match the reference.', score=2))]
+```
 
 Let's now view and verify the full new prompt with modified instruction and examples
 
diff --git a/docs/howtos/customizations/metrics/_write_your_own_metric.md b/docs/howtos/customizations/metrics/_write_your_own_metric.md
index 0df90e4469..83826fffcf 100644
--- a/docs/howtos/customizations/metrics/_write_your_own_metric.md
+++ b/docs/howtos/customizations/metrics/_write_your_own_metric.md
@@ -61,12 +61,10 @@ hallucinations_binary = AspectCritic(
 
 await hallucinations_binary.single_turn_ascore(eval_dataset[0])
 ```
-
-
-
-
-    0
-
+Output
+```
+0
+```
 
 
 ## Domain Specific Metrics or Rubric based Metrics
@@ -98,12 +96,10 @@ hallucinations_rubric = RubricsScore(
 
 await hallucinations_rubric.single_turn_ascore(eval_dataset[0])
 ```
-
-
-
-
-    3
-
+Output
+```
+3
+```
 
 
 ## Custom Metrics
@@ -173,11 +169,10 @@ hallucinations_metric = HallucinationsMetric(llm=evaluator_llm)
 await hallucinations_metric.single_turn_ascore(eval_dataset[0])
 ```
 
-
-
-
-    0.5
-
+Output
+```
+0.5
+```
 
 
 Now let's evaluate the entire dataset with the metrics we have created.
@@ -194,14 +189,11 @@ results = evaluate(
 
 
 ```python
-results
-```
-
-
-
-
-    {'hallucinations_metric': 0.5932, 'hallucinations_rubric': 3.1500, 'hallucinations_binary': 0.1000}
 
+Output
+```
+{'hallucinations_metric': 0.5932, 'hallucinations_rubric': 3.1500, 'hallucinations_binary': 0.1000}
+```
 
 
 
@@ -210,8 +202,7 @@ results_df = results.to_pandas()
 results_df.head()
 ```
 
-
-
+Output
 
 <div>
 <style scoped>
diff --git a/docs/howtos/customizations/metrics/_write_your_own_metric_advanced.md b/docs/howtos/customizations/metrics/_write_your_own_metric_advanced.md
index aa762909c0..af4ca8c172 100644
--- a/docs/howtos/customizations/metrics/_write_your_own_metric_advanced.md
+++ b/docs/howtos/customizations/metrics/_write_your_own_metric_advanced.md
@@ -1,6 +1,6 @@
 While evaluating your LLM application with Ragas metrics, you may find yourself needing to create a custom metric. This guide will help you do just that. When building your custom metric with Ragas, you also benefit from features such as asynchronous processing, metric language adaptation, and aligning LLM metrics with human evaluators.
 
-It assumes that you are already familiar with the concepts of [Metrics](/concepts/metrics/overview/) and [Prompt Objects](/concepts/components/prompt) in Ragas. If not, please review those topics before proceeding.
+It assumes that you are already familiar with the concepts of [Metrics](/concepts/metrics/overview/index.md) and [Prompt Objects](/concepts/components/prompt.md) in Ragas. If not, please review those topics before proceeding.
 
 For the sake of this tutorial, let's build a custom metric that scores the refusal rate in applications. 
 
diff --git a/docs/howtos/customizations/testgenerator/_testgen-custom-single-hop.md b/docs/howtos/customizations/testgenerator/_testgen-custom-single-hop.md
index 2e843b1e2c..627163d89f 100644
--- a/docs/howtos/customizations/testgenerator/_testgen-custom-single-hop.md
+++ b/docs/howtos/customizations/testgenerator/_testgen-custom-single-hop.md
@@ -1,8 +1,12 @@
 # Create custom single-hop queries from your documents
 
 ### Load sample documents
-I am using documents from [gitlab handbook](https://huggingface.co/datasets/explodinggradients/Sample_Docs_Markdown). You can download it by running the below command.
+I am using documents from [sample of gitlab handbook](https://huggingface.co/datasets/explodinggradients/Sample_Docs_Markdown). You can download it by running the below command.
 
+```
+git clone https://huggingface.co/datasets/explodinggradients/Sample_Docs_Markdown
+
+```
 
 ```python
 from langchain_community.document_loaders import DirectoryLoader
@@ -36,13 +40,8 @@ for doc in docs:
     )
 ```
 
-    /opt/homebrew/Caskroom/miniforge/base/envs/ragas/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
-      from .autonotebook import tqdm as notebook_tqdm
-
-
 ### Set up the LLM and Embedding Model
-You may use any of [your choice](/docs/howtos/customizations/customize_models.md), here I am using models from open-ai.
-
+You may use any of [your choice](./../../customizations/customize_models.md), here I am using models from open-ai.
 
 ```python
 from ragas.llms.base import llm_factory
@@ -76,10 +75,7 @@ headline_splitter = HeadlineSplitter(min_tokens=300, max_tokens=1000)
 keyphrase_extractor = KeyphrasesExtractor(
     llm=llm, property_name="keyphrases", max_num=10
 )
-```
 
-
-```python
 transforms = [
     headline_extractor,
     headline_splitter,
@@ -89,18 +85,20 @@ transforms = [
 apply_transforms(kg, transforms=transforms)
 ```
 
-    Applying KeyphrasesExtractor:   6%| | 2/36 [00:01<00:20,  1Property 'keyphrases' already exists in node '514fdc'. Skipping!
-    Applying KeyphrasesExtractor:  11%| | 4/36 [00:01<00:10,  2Property 'keyphrases' already exists in node '84a0f6'. Skipping!
-    Applying KeyphrasesExtractor:  64%|▋| 23/36 [00:03<00:01,  Property 'keyphrases' already exists in node '93f19d'. Skipping!
-    Applying KeyphrasesExtractor:  72%|▋| 26/36 [00:04<00:00, 1Property 'keyphrases' already exists in node 'a126bf'. Skipping!
-    Applying KeyphrasesExtractor:  81%|▊| 29/36 [00:04<00:00,  Property 'keyphrases' already exists in node 'c230df'. Skipping!
-    Applying KeyphrasesExtractor:  89%|▉| 32/36 [00:04<00:00, 1Property 'keyphrases' already exists in node '4f2765'. Skipping!
-    Property 'keyphrases' already exists in node '4a4777'. Skipping!
-                                                               
+Output
+```
+Applying KeyphrasesExtractor:   6%| | 2/36 [00:01<00:20,  1Property 'keyphrases' already exists in node '514fdc'. Skipping!
+Applying KeyphrasesExtractor:  11%| | 4/36 [00:01<00:10,  2Property 'keyphrases' already exists in node '84a0f6'. Skipping!
+Applying KeyphrasesExtractor:  64%|▋| 23/36 [00:03<00:01,  Property 'keyphrases' already exists in node '93f19d'. Skipping!
+Applying KeyphrasesExtractor:  72%|▋| 26/36 [00:04<00:00, 1Property 'keyphrases' already exists in node 'a126bf'. Skipping!
+Applying KeyphrasesExtractor:  81%|▊| 29/36 [00:04<00:00,  Property 'keyphrases' already exists in node 'c230df'. Skipping!
+Applying KeyphrasesExtractor:  89%|▉| 32/36 [00:04<00:00, 1Property 'keyphrases' already exists in node '4f2765'. Skipping!
+Property 'keyphrases' already exists in node '4a4777'. Skipping!
+```                                                               
 
 ### Configure personas
 
-You can also do this automatically by using the [automatic persona generator](/docs/howtos/customizations/testgenerator/_persona_generator.md)
+You can also do this automatically by using the [automatic persona generator](./_persona_generator.md)
 
 
 ```python
@@ -178,35 +176,24 @@ class MySingleHopScenario(SingleHopQuerySynthesizer):
             )
 
         return scenarios
-```
-
 
-```python
 query = MySingleHopScenario(llm=llm)
-```
-
 
-```python
 scenarios = await query.generate_scenarios(
     n=5, knowledge_graph=kg, persona_list=persona_list
 )
-```
-
 
-```python
 scenarios[0]
 ```
-
-
-
-
-    SingleHopScenario(
-    nodes=1
-    term=what is an ally
-    persona=name='Hiring manager at gitlab' role_description='A hiring manager at gitlab trying to underestand hiring policies in gitlab'
-    style=Web search like queries
-    length=long)
-
+Output
+```
+SingleHopScenario(
+nodes=1
+term=what is an ally
+persona=name='Hiring manager at gitlab' role_description='A hiring manager at gitlab trying to underestand hiring policies in gitlab'
+style=Web search like queries
+length=long)
+```
 
 
 
@@ -235,10 +222,6 @@ to the query. Do not add any information not included in or inferable from the c
 prompt = query.get_prompts()["generate_query_reference_prompt"]
 prompt.instruction = instruction
 query.set_prompts(**{"generate_query_reference_prompt": prompt})
-```
-
-
-```python
 result = await query.generate_sample(scenario=scenarios[-1])
 ```
 
@@ -246,27 +229,15 @@ result = await query.generate_sample(scenario=scenarios[-1])
 ```python
 result.user_input
 ```
-
-
-
-
-    'Does the Diversity, Inclusion & Belonging (DIB) Team at GitLab have a structured approach to encourage collaborations among team members through various communication methods?'
-
-
-
+Output
+```
+'Does the Diversity, Inclusion & Belonging (DIB) Team at GitLab have a structured approach to encourage collaborations among team members through various communication methods?'
+```
 
 ```python
 result.reference
 ```
-
-
-
-
-    'Yes'
-
-
-
-
-```python
-
+Output
+```
+'Yes'
 ```
diff --git a/docs/howtos/customizations/testgenerator/_testgen-customisation.md b/docs/howtos/customizations/testgenerator/_testgen-customisation.md
index 39bb2958e5..6731c93e43 100644
--- a/docs/howtos/customizations/testgenerator/_testgen-customisation.md
+++ b/docs/howtos/customizations/testgenerator/_testgen-customisation.md
@@ -3,7 +3,7 @@
 In this tutorial you will get to learn how to create custom multi-hop queries from your documents. This is a very powerful feature that allows you to create queries that are not possible with the standard query types. This also helps you to create queries that are more specific to your use case.
 
 ### Load sample documents
-I am using documents from [gitlab handbook](https://huggingface.co/datasets/explodinggradients/Sample_Docs_Markdown). You can download it by running the below command.
+I am using documents from [sample of gitlab handbook](https://huggingface.co/datasets/explodinggradients/Sample_Docs_Markdown). You can download it by running the below command.
 
 
 ```python
@@ -42,12 +42,8 @@ for doc in docs:
     )
 ```
 
-    /opt/homebrew/Caskroom/miniforge/base/envs/ragas/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
-      from .autonotebook import tqdm as notebook_tqdm
-
-
 ### Set up the LLM and Embedding Model
-You may use any of [your choice](/docs/howtos/customizations/customize_models.md), here I am using models from open-ai.
+You may use any of [your choice](./../../customizations/customize_models.md), here I am using models from open-ai.
 
 
 ```python
@@ -90,10 +86,7 @@ relation_builder = OverlapScoreBuilder(
     threshold=0.01,
     distance_threshold=0.9,
 )
-```
 
-
-```python
 transforms = [
     headline_extractor,
     headline_splitter,
@@ -103,19 +96,20 @@ transforms = [
 
 apply_transforms(kg, transforms=transforms)
 ```
-
-    Applying KeyphrasesExtractor:   6%|██████▏                                                                                                         | 2/36 [00:01<00:17,  1.94it/s]Property 'keyphrases' already exists in node 'a2f389'. Skipping!
-    Applying KeyphrasesExtractor:  17%|██████████████████▋                                                                                             | 6/36 [00:01<00:04,  6.37it/s]Property 'keyphrases' already exists in node '3068c0'. Skipping!
-    Applying KeyphrasesExtractor:  53%|██████████████████████████████████████████████████████████▌                                                    | 19/36 [00:02<00:01,  8.88it/s]Property 'keyphrases' already exists in node '854bf7'. Skipping!
-    Applying KeyphrasesExtractor:  78%|██████████████████████████████████████████████████████████████████████████████████████▎                        | 28/36 [00:03<00:00,  9.73it/s]Property 'keyphrases' already exists in node '2eeb07'. Skipping!
-    Property 'keyphrases' already exists in node 'd68f83'. Skipping!
-    Applying KeyphrasesExtractor:  83%|████████████████████████████████████████████████████████████████████████████████████████████▌                  | 30/36 [00:03<00:00,  9.35it/s]Property 'keyphrases' already exists in node '8fdbea'. Skipping!
-    Applying KeyphrasesExtractor:  89%|██████████████████████████████████████████████████████████████████████████████████████████████████▋            | 32/36 [00:04<00:00,  7.76it/s]Property 'keyphrases' already exists in node 'ef6ae0'. Skipping!
-                                                                                                                                                                                      
+Output
+```
+Applying KeyphrasesExtractor:   6%|██████▏                                                                                                         | 2/36 [00:01<00:17,  1.94it/s]Property 'keyphrases' already exists in node 'a2f389'. Skipping!
+Applying KeyphrasesExtractor:  17%|██████████████████▋                                                                                             | 6/36 [00:01<00:04,  6.37it/s]Property 'keyphrases' already exists in node '3068c0'. Skipping!
+Applying KeyphrasesExtractor:  53%|██████████████████████████████████████████████████████████▌                                                    | 19/36 [00:02<00:01,  8.88it/s]Property 'keyphrases' already exists in node '854bf7'. Skipping!
+Applying KeyphrasesExtractor:  78%|██████████████████████████████████████████████████████████████████████████████████████▎                        | 28/36 [00:03<00:00,  9.73it/s]Property 'keyphrases' already exists in node '2eeb07'. Skipping!
+Property 'keyphrases' already exists in node 'd68f83'. Skipping!
+Applying KeyphrasesExtractor:  83%|████████████████████████████████████████████████████████████████████████████████████████████▌                  | 30/36 [00:03<00:00,  9.35it/s]Property 'keyphrases' already exists in node '8fdbea'. Skipping!
+Applying KeyphrasesExtractor:  89%|██████████████████████████████████████████████████████████████████████████████████████████████████▋            | 32/36 [00:04<00:00,  7.76it/s]Property 'keyphrases' already exists in node 'ef6ae0'. Skipping!
+```                                                                                                                                                                           
 
 ### Configure personas
 
-You can also do this automatically by using the [automatic persona generator](/docs/howtos/customizations/testgenerator/_persona_generator.md)
+You can also do this automatically by using the [automatic persona generator](./_persona_generator.md)
 
 
 ```python
@@ -217,36 +211,22 @@ class MyMultiHopQuery(MultiHopQuerySynthesizer):
                     scenarios.extend(base_scenarios)
 
         return scenarios
-```
-
 
-```python
 query = MyMultiHopQuery(llm=llm)
 scenarios = await query.generate_scenarios(
     n=10, knowledge_graph=kg, persona_list=persona_list
 )
-```
 
-
-```python
 scenarios[4]
 ```
-
-
-
-
-    MultiHopScenario(
-    nodes=2
-    combinations=['Diversity Inclusion & Belonging', 'Diversity, Inclusion & Belonging Goals']
-    style=Web search like queries
-    length=short
-    persona=name='Hiring manager at gitlab' role_description='A hiring manager at gitlab trying to underestand hiring policies in gitlab')
-
-
-
-
-```python
-
+Output
+```
+MultiHopScenario(
+nodes=2
+combinations=['Diversity Inclusion & Belonging', 'Diversity, Inclusion & Belonging Goals']
+style=Web search like queries
+length=short
+persona=name='Hiring manager at gitlab' role_description='A hiring manager at gitlab trying to underestand hiring policies in gitlab')
 ```
 
 ### Run the multi-hop query
@@ -254,18 +234,13 @@ scenarios[4]
 
 ```python
 result = await query.generate_sample(scenario=scenarios[-1])
-```
-
-
-```python
 result.user_input
 ```
 
-
-
-
-    'How does GitLab ensure that its DIB roundtables are effective in promoting diversity and inclusion?'
-
+Output
+```
+'How does GitLab ensure that its DIB roundtables are effective in promoting diversity and inclusion?'
+```
 
 
 Yay! You have created a multi-hop query. Now you can create any such queries by creating and exploring relationships between documents.
diff --git a/docs/howtos/index.md b/docs/howtos/index.md
index 5dcb20373e..ea5c594abb 100644
--- a/docs/howtos/index.md
+++ b/docs/howtos/index.md
@@ -26,6 +26,6 @@ Each guide in this section provides a focused solution to real-world problems th
 
     How to integrate Ragas with other frameworks and observability tools.
 
-    Use Ragas with frameworks like [Langchain](integrations/langchain.md), [LlamaIndex](integrations/llamaindex.md), and [observability tools]().
+    Use Ragas with frameworks like [Langchain](integrations/langchain.md), [LlamaIndex](integrations/_llamaindex.md), and [observability tools](./observability.md).
 
 </div>
diff --git a/docs/howtos/integrations/_llamaindex.md b/docs/howtos/integrations/_llamaindex.md
index f90ac3c81b..865880cb80 100644
--- a/docs/howtos/integrations/_llamaindex.md
+++ b/docs/howtos/integrations/_llamaindex.md
@@ -6,11 +6,11 @@ This guide assumes you have familarity with the LlamaIndex framework.
 
 ## Building the Testset
 
-You will need an testset to evaluate your `QueryEngine` against. You can either build one yourself or use the [Testset Generator Module](../../getstarted/testset_generation.md) in Ragas to get started with a small synthetic one.
+You will need an testset to evaluate your `QueryEngine` against. You can either build one yourself or use the [Testset Generator Module](./../../getstarted/rag_testset_generation.md) in Ragas to get started with a small synthetic one.
 
 Let's see how that works with Llamaindex
 
-# load the documents
+## load the documents
 
 
 ```python
diff --git a/docs/howtos/integrations/_zeno.md b/docs/howtos/integrations/_zeno.md
index c61d805ec4..ebf2a6ca0e 100644
--- a/docs/howtos/integrations/_zeno.md
+++ b/docs/howtos/integrations/_zeno.md
@@ -13,7 +13,7 @@ pip install zeno-client
 
 Next, create an account at [hub.zenoml.com](https://hub.zenoml.com) and generate an API key on your [account page](https://hub.zenoml.com/account).
 
-We can now pick up the evaluation where we left off at the [Getting Started](../../getstarted/evaluation.md) guide:
+We can now pick up the evaluation where we left off at the [Getting Started](./../../getstarted/index.md) guide:
 
 
 ```python
diff --git a/docs/howtos/integrations/index.md b/docs/howtos/integrations/index.md
index b0c0db3484..e393104a11 100644
--- a/docs/howtos/integrations/index.md
+++ b/docs/howtos/integrations/index.md
@@ -16,4 +16,4 @@ happy to look into it 🙂
 Tools that help you trace the LLM calls can be integrated with Ragas to get the traces of the evaluator LLMs.
 
 -  [Arize Phoenix](./_arize.md) - Arize is a platform for observability and debugging of LLMs, more information can be found [here](https://phoenix.arize.com/).
-- [Langsmith](./_langsmith.md) - Langsmith is a platform for observability and debugging of LLMs from Langchain, more information can be found [here](https://www.langchain.com/langsmith).
\ No newline at end of file
+- [Langsmith](./langsmith.md) - Langsmith is a platform for observability and debugging of LLMs from Langchain, more information can be found [here](https://www.langchain.com/langsmith).
\ No newline at end of file
diff --git a/docs/howtos/integrations/langchain.md b/docs/howtos/integrations/langchain.md
index 353e4d3883..b372d66ff4 100644
--- a/docs/howtos/integrations/langchain.md
+++ b/docs/howtos/integrations/langchain.md
@@ -146,7 +146,7 @@ To evauate our Q&A application we will use the following metrices.
 - `Faithfulness`: Assesses whether all claims in the generated answer can be inferred directly from the provided context.
 - `Factual Correctness`: Checks the factual accuracy of the generated response by comparing it with a reference, using claim-based evaluation and natural language inference.  
 
-For more details on these metrics and how they apply to evaluating RAG systems, visit [Ragas Metrics Documentation](/docs/concepts/metrics/available_metrics/).
+For more details on these metrics and how they apply to evaluating RAG systems, visit [Ragas Metrics Documentation](./../../concepts/metrics/available_metrics/).
 
 
 ```python
diff --git a/docs/howtos/migrations/migrate_from_v01_to_v02.md b/docs/howtos/migrations/migrate_from_v01_to_v02.md
index 83a3478bb8..79c209a13e 100644
--- a/docs/howtos/migrations/migrate_from_v01_to_v02.md
+++ b/docs/howtos/migrations/migrate_from_v01_to_v02.md
@@ -78,7 +78,7 @@ This might be a bit rough but if you do need help here, feel free to chat or men
 
 All the prompts have been rewritten to use [`PydanticPrompts`][ragas.prompt.pydantic_prompt.PydanticPrompt] which is based on [`BasePrompt`][ragas.prompt.base.BasePrompt] object. If you are using the old `Prompt` object you will have to upgrade it to the new one, check the docs to learn more on how to do it
 
-- [How to Guide on how to create new prompts](../../howtos/customizations/metrics/modifying-prompts-metrics.md)
+- [How to Guide on how to create new prompts](./../customizations/metrics/_modifying-prompts-metrics.md)
 - [Github PR for the changes](https://github.com/explodinggradients/ragas/pull/1462)
 
 !!! note "Need Further Assistance?"
diff --git a/docs/howtos/observability.md b/docs/howtos/observability.md
new file mode 100644
index 0000000000..18a7116cfb
--- /dev/null
+++ b/docs/howtos/observability.md
@@ -0,0 +1,467 @@
+# Observability Tools.
+
+## Phoenix (Arize)
+
+### 1. Introduction
+
+Building a baseline for a RAG pipeline is not usually difficult, but enhancing it to make it suitable for production and ensuring the quality of your responses is almost always hard. Choosing the right tools and parameters for RAG can itself be challenging when there is an abundance of options available. This tutorial shares a robust workflow for making the right choices while building your RAG and ensuring its quality. 
+
+This article covers how to evaluate, visualize and analyze your RAG using a combination of open-source libraries.  We will be using:
+
+- [Ragas](https://docs.ragas.io/en/stable/) for synthetic test data generation and evaluation
+- Arize AI’s [Phoenix](https://docs.arize.com/phoenix) for tracing, visualization, and cluster analysis
+- [LlamaIndex](https://docs.llamaindex.ai/en/stable/) for building RAG pipelines
+
+For the purpose of this article, we’ll be using data from arXiv papers about prompt-engineering to build the RAG pipeline.
+
+ℹ️ This notebook requires an OpenAI API key.
+
+### 2. Install Dependencies and Import Libraries
+
+Run the cell below to install Git LFS, which we use to download our dataset.
+
+
+```python
+!git lfs install
+```
+
+Install and import Python dependencies.
+
+
+```python
+!pip install "ragas<0.1.1" pypdf arize-phoenix "openinference-instrumentation-llama-index<1.0.0" "llama-index<0.10.0" pandas
+```
+
+
+```python
+import pandas as pd
+
+# Display the complete contents of dataframe cells.
+pd.set_option("display.max_colwidth", None)
+```
+
+### 3. Configure Your OpenAI API Key
+
+Set your OpenAI API key if it is not already set as an environment variable.
+
+
+```python
+import os
+from getpass import getpass
+import openai
+
+if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
+    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
+openai.api_key = openai_api_key
+os.environ["OPENAI_API_KEY"] = openai_api_key
+```
+
+### 4. Generate Your Synthetic Test Dataset
+
+Curating a golden test dataset for evaluation can be a long, tedious, and expensive process that is not pragmatic — especially when starting out or when data sources keep changing. This can be solved by synthetically generating high quality data points, which then can be verified by developers. This can reduce the time and effort in curating test data by 90%. 
+
+Run the cell below to download a dataset of prompt engineering papers in PDF format from arXiv and read these documents using LlamaIndex.
+
+
+```python
+!git clone https://huggingface.co/datasets/explodinggradients/prompt-engineering-papers
+```
+
+
+```python
+from llama_index import SimpleDirectoryReader
+
+dir_path = "./prompt-engineering-papers"
+reader = SimpleDirectoryReader(dir_path, num_files_limit=2)
+documents = reader.load_data()
+```
+
+An ideal test dataset should contain data points of high quality and diverse nature from a similar distribution to the one observed during production. Ragas uses a unique evolution-based synthetic data generation paradigm to generate questions that are of the highest quality which also ensures diversity of questions generated.  Ragas by default uses OpenAI models under the hood, but you’re free to use any model of your choice. Let’s generate 100 data points using Ragas.
+
+
+```python
+from ragas.testset import TestsetGenerator
+from langchain_openai import ChatOpenAI, OpenAIEmbeddings
+
+TEST_SIZE = 25
+
+# generator with openai models
+generator_llm = ChatOpenAI(model="gpt-4o-mini")
+critic_llm = ChatOpenAI(model="gpt-4o")
+embeddings = OpenAIEmbeddings()
+
+generator = TestsetGenerator.from_langchain(generator_llm, critic_llm, embeddings)
+
+# generate testset
+testset = generator.generate_with_llamaindex_docs(documents, test_size=TEST_SIZE)
+test_df = testset.to_pandas()
+test_df.head()
+```
+
+You are free to change the question type distribution according to your needs. Since we now have our test dataset ready, let’s move on and build a simple RAG pipeline using LlamaIndex.
+
+### 5. Build Your RAG Application With LlamaIndex
+
+LlamaIndex is an easy to use and flexible framework for building RAG applications. For the sake of simplicity, we use the default LLM (gpt-3.5-turbo) and embedding models (openai-ada-2).
+
+Launch Phoenix in the background and instrument your LlamaIndex application so that your OpenInference spans and traces are sent to and collected by Phoenix. [OpenInference](https://github.com/Arize-ai/openinference/tree/main/spec) is an open standard built atop OpenTelemetry that captures and stores LLM application executions. It is designed to be a category of telemetry data that is used to understand the execution of LLMs and the surrounding application context, such as retrieval from vector stores and the usage of external tools such as search engines or APIs.
+
+
+```python
+import phoenix as px
+from llama_index import set_global_handler
+
+session = px.launch_app()
+set_global_handler("arize_phoenix")
+```
+
+Build your query engine.
+
+
+```python
+from llama_index.core import VectorStoreIndex, ServiceContext
+from llama_index.embeddings.openai import OpenAIEmbedding
+
+
+def build_query_engine(documents):
+    vector_index = VectorStoreIndex.from_documents(
+        documents,
+        service_context=ServiceContext.from_defaults(chunk_size=512),
+        embed_model=OpenAIEmbedding(),
+    )
+    query_engine = vector_index.as_query_engine(similarity_top_k=2)
+    return query_engine
+
+
+query_engine = build_query_engine(documents)
+```
+
+If you check Phoenix, you should see embedding spans from when your corpus data was indexed. Export and save those embeddings into a dataframe for visualization later in the notebook.
+
+
+```python
+from phoenix.trace.dsl import SpanQuery
+
+client = px.Client()
+corpus_df = px.Client().query_spans(
+    SpanQuery().explode(
+        "embedding.embeddings",
+        text="embedding.text",
+        vector="embedding.vector",
+    )
+)
+corpus_df.head()
+```
+
+Relaunch Phoenix to clear the accumulated traces.
+
+
+```python
+px.close_app()
+session = px.launch_app()
+```
+
+### 6. Evaluate Your LLM Application
+
+Ragas provides a comprehensive list of metrics that can be used to evaluate RAG pipelines both component-wise and end-to-end.
+
+To use Ragas, we first form an evaluation dataset comprised of a question, generated answer, retrieved context, and ground-truth answer (the actual expected answer for the given question).
+
+
+```python
+from datasets import Dataset
+from tqdm.auto import tqdm
+import pandas as pd
+
+
+def generate_response(query_engine, question):
+    response = query_engine.query(question)
+    return {
+        "answer": response.response,
+        "contexts": [c.node.get_content() for c in response.source_nodes],
+    }
+
+
+def generate_ragas_dataset(query_engine, test_df):
+    test_questions = test_df["question"].values
+    responses = [generate_response(query_engine, q) for q in tqdm(test_questions)]
+
+    dataset_dict = {
+        "question": test_questions,
+        "answer": [response["answer"] for response in responses],
+        "contexts": [response["contexts"] for response in responses],
+        "ground_truth": test_df["ground_truth"].values.tolist(),
+    }
+    ds = Dataset.from_dict(dataset_dict)
+    return ds
+
+
+ragas_eval_dataset = generate_ragas_dataset(query_engine, test_df)
+ragas_evals_df = pd.DataFrame(ragas_eval_dataset)
+ragas_evals_df.head()
+```
+
+Check out Phoenix to view your LlamaIndex application traces.
+
+
+```python
+print(session.url)
+```
+
+![LlamaIndex application traces inside of Phoenix](https://storage.googleapis.com/arize-phoenix-assets/assets/docs/notebooks/ragas/ragas_trace_slide_over.gif)
+
+We save out a couple of dataframes, one containing embedding data that we'll visualize later, and another containing our exported traces and spans that we plan to evaluate using Ragas.
+
+
+```python
+# dataset containing embeddings for visualization
+query_embeddings_df = px.Client().query_spans(
+    SpanQuery().explode(
+        "embedding.embeddings", text="embedding.text", vector="embedding.vector"
+    )
+)
+query_embeddings_df.head()
+```
+
+
+```python
+from phoenix.session.evaluation import get_qa_with_reference
+
+# dataset containing span data for evaluation with Ragas
+spans_dataframe = get_qa_with_reference(client)
+spans_dataframe.head()
+```
+
+Ragas uses LangChain to evaluate your LLM application data. Let's instrument LangChain with OpenInference so we can see what's going on under the hood when we evaluate our LLM application.
+
+
+```python
+from openinference.instrumentation.langchain import LangChainInstrumentor
+
+LangChainInstrumentor().instrument()
+```
+
+Evaluate your LLM traces and view the evaluation scores in dataframe format.
+
+
+```python
+from ragas import evaluate
+from ragas.metrics import (
+    faithfulness,
+    answer_correctness,
+    context_recall,
+    context_precision,
+)
+
+evaluation_result = evaluate(
+    dataset=ragas_eval_dataset,
+    metrics=[faithfulness, answer_correctness, context_recall, context_precision],
+)
+eval_scores_df = pd.DataFrame(evaluation_result.scores)
+```
+
+Submit your evaluations to Phoenix so they are visible as annotations on your spans.
+
+
+```python
+from phoenix.trace import SpanEvaluations
+
+# Assign span ids to your ragas evaluation scores (needed so Phoenix knows where to attach the spans).
+eval_data_df = pd.DataFrame(evaluation_result.dataset)
+assert eval_data_df.question.to_list() == list(
+    reversed(spans_dataframe.input.to_list())  # The spans are in reverse order.
+), "Phoenix spans are in an unexpected order. Re-start the notebook and try again."
+eval_scores_df.index = pd.Index(
+    list(reversed(spans_dataframe.index.to_list())), name=spans_dataframe.index.name
+)
+
+# Log the evaluations to Phoenix.
+for eval_name in eval_scores_df.columns:
+    evals_df = eval_scores_df[[eval_name]].rename(columns={eval_name: "score"})
+    evals = SpanEvaluations(eval_name, evals_df)
+    px.Client().log_evaluations(evals)
+```
+
+If you check out Phoenix, you'll see your Ragas evaluations as annotations on your application spans.
+
+
+```python
+print(session.url)
+```
+
+![ragas evaluations appear as annotations on your spans](https://storage.googleapis.com/arize-phoenix-assets/assets/docs/notebooks/ragas/ragas_evaluation_annotations.gif)
+
+### 7. Visualize and Analyze Your Embeddings
+
+[Embeddings](https://arize.com/blog-course/embeddings-meaning-examples-and-how-to-compute/) encode the meaning of retrieved documents and user queries. Not only are they an essential part of RAG systems, but they are immensely useful for understanding and debugging LLM application performance.
+
+Phoenix takes the high-dimensional embeddings from your RAG application, reduces their dimensionality, and clusters them into semantically meaningful groups of data. You can then select the metric of your choice (e.g., Ragas-computed faithfulness or answer correctness) to visually inspect the performance of your application and surface problematic clusters. The advantage of this approach is that it provides metrics on granular yet meaningful subsets of your data that help you analyze local, not merely global, performance across a dataset. It's also helpful for gaining intuition around what kind of queries your LLM application is struggling to answer.
+
+We'll re-launch Phoenix as an embedding visualizer to inspect the performance of our application on our test dataset.
+
+
+```python
+query_embeddings_df = query_embeddings_df.iloc[::-1]
+assert ragas_evals_df.question.tolist() == query_embeddings_df.text.tolist()
+assert test_df.question.tolist() == ragas_evals_df.question.tolist()
+query_df = pd.concat(
+    [
+        ragas_evals_df[["question", "answer", "ground_truth"]].reset_index(drop=True),
+        query_embeddings_df[["vector"]].reset_index(drop=True),
+        test_df[["evolution_type"]],
+        eval_scores_df.reset_index(drop=True),
+    ],
+    axis=1,
+)
+query_df.head()
+```
+
+
+```python
+query_schema = px.Schema(
+    prompt_column_names=px.EmbeddingColumnNames(
+        raw_data_column_name="question", vector_column_name="vector"
+    ),
+    response_column_names="answer",
+)
+corpus_schema = px.Schema(
+    prompt_column_names=px.EmbeddingColumnNames(
+        raw_data_column_name="text", vector_column_name="vector"
+    )
+)
+# relaunch phoenix with a primary and corpus dataset to view embeddings
+px.close_app()
+session = px.launch_app(
+    primary=px.Dataset(query_df, query_schema, "query"),
+    corpus=px.Dataset(corpus_df.reset_index(drop=True), corpus_schema, "corpus"),
+)
+```
+
+Once you launch Phoenix, you can visualize your data with the metric of your choice with the following steps:
+
+- Select the `vector` embedding,
+- Select `Color By > dimension` and then the dimension of your choice to color your data by a particular field, for example, by Ragas evaluation scores such as faithfulness or answer correctness,
+- Select the metric of your choice from the `metric` dropdown to view aggregate metrics on a per-cluster basis.
+
+![inspect clusters of embeddings, view aggregate metrics, and color your data by the metric of your choice](https://storage.googleapis.com/arize-phoenix-assets/assets/docs/notebooks/ragas/ragas_correctness_clusters.gif)
+
+### 8. Recap
+
+Congrats! You built and evaluated a LlamaIndex query engine using Ragas and Phoenix. Let's recap what we learned:
+
+- With Ragas, you bootstraped a test dataset and computed metrics such as faithfulness and answer correctness to evaluate your LlamaIndex query engine.
+- With OpenInference, you instrumented your query engine so you could observe the inner workings of both LlamaIndex and Ragas.
+- With Phoenix, you collected your spans and traces, imported your evaluations for easy inspection, and visualized your embedded queries and retrieved documents to identify pockets of poor performance.
+
+This notebook is just an introduction to the capabilities of Ragas and Phoenix. To learn more, see the [Ragas](https://docs.ragas.io/en/stable/) and [Phoenix docs](https://docs.arize.com/phoenix/).
+
+If you enjoyed this tutorial, please leave a ⭐ on GitHub:
+
+- [Ragas](https://github.com/explodinggradients/ragas)
+- [Phoenix](https://github.com/Arize-ai/phoenix)
+- [OpenInference](https://github.com/Arize-ai/openinference)
+
+## LangSmith
+
+[LangSmith](https://docs.smith.langchain.com/) is an advanced tool designed to enhance the development and deployment of applications utilizing large language models (LLMs). It provides a comprehensive framework for tracing, analyzing, and optimizing LLM workflows, making it easier for developers to manage complex interactions within their applications.
+
+This tutorial explains how to log traces of Ragas evaluations using LangSmith. Since Ragas is built on LangChain, you only need to set up LangSmith, and it will handle logging the traces automatically.
+
+### 1. Setting Up LangSmith
+
+To set up LangSmith, make sure you set the following environment variables (refer to the [LangSmith documentation](https://docs.smith.langchain.com/#quick-start) for more details):
+
+```bash
+export LANGCHAIN_TRACING_V2=true
+export LANGCHAIN_ENDPOINT=https://api.smith.langchain.com
+export LANGCHAIN_API_KEY=<your-api-key>
+export LANGCHAIN_PROJECT=<your-project>  # Defaults to "default" if not set
+```
+
+### 2. Getting the Dataset
+
+When creating evaluation dataset or evaluating instance, ensure the terminology matches the schema used in `SingleTurnSample` or `MultiTurnSample`.
+
+
+```python
+from ragas import EvaluationDataset
+
+
+dataset = [
+    {
+        "user_input": "Which CEO is widely recognized for democratizing AI education through platforms like Coursera?",
+        "retrieved_contexts": [
+            "Andrew Ng, CEO of Landing AI, is known for his pioneering work in deep learning and for democratizing AI education through Coursera."
+        ],
+        "response": "Andrew Ng is widely recognized for democratizing AI education through platforms like Coursera.",
+        "reference": "Andrew Ng, CEO of Landing AI, is known for democratizing AI education through Coursera.",
+    },
+    {
+        "user_input": "Who is Sam Altman?",
+        "retrieved_contexts": [
+            "Sam Altman, CEO of OpenAI, has advanced AI research and advocates for safe, beneficial AI technologies."
+        ],
+        "response": "Sam Altman is the CEO of OpenAI and advocates for safe, beneficial AI technologies.",
+        "reference": "Sam Altman, CEO of OpenAI, has advanced AI research and advocates for safe AI.",
+    },
+    {
+        "user_input": "Who is Demis Hassabis and how did he gain prominence?",
+        "retrieved_contexts": [
+            "Demis Hassabis, CEO of DeepMind, is known for developing systems like AlphaGo that master complex games."
+        ],
+        "response": "Demis Hassabis is the CEO of DeepMind, known for developing systems like AlphaGo.",
+        "reference": "Demis Hassabis, CEO of DeepMind, is known for developing AlphaGo.",
+    },
+    {
+        "user_input": "Who is the CEO of Google and Alphabet Inc., praised for leading innovation across Google's product ecosystem?",
+        "retrieved_contexts": [
+            "Sundar Pichai, CEO of Google and Alphabet Inc., leads innovation across Google's product ecosystem."
+        ],
+        "response": "Sundar Pichai is the CEO of Google and Alphabet Inc., praised for leading innovation across Google's product ecosystem.",
+        "reference": "Sundar Pichai, CEO of Google and Alphabet Inc., leads innovation across Google's product ecosystem.",
+    },
+    {
+        "user_input": "How did Arvind Krishna transform IBM?",
+        "retrieved_contexts": [
+            "Arvind Krishna, CEO of IBM, transformed the company by focusing on cloud computing and AI solutions."
+        ],
+        "response": "Arvind Krishna transformed IBM by focusing on cloud computing and AI solutions.",
+        "reference": "Arvind Krishna, CEO of IBM, transformed the company through cloud computing and AI.",
+    },
+]
+
+evaluation_dataset = EvaluationDataset.from_list(dataset)
+```
+
+### 3. Tracing ragas metrics
+
+Run the Ragas evaluations on your dataset, and the traces will appear in your LangSmith dashboard under the specified project name or "default."
+
+
+```python
+from ragas import evaluate
+from ragas.llms import LangchainLLMWrapper
+from langchain_openai import ChatOpenAI
+from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness
+
+llm = ChatOpenAI(model="gpt-4o-mini")
+evaluator_llm = LangchainLLMWrapper(llm)
+
+result = evaluate(
+    dataset=evaluation_dataset,
+    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness()],
+    llm=evaluator_llm,
+)
+
+result
+```
+
+Output
+```
+Evaluating:   0%|          | 0/15 [00:00<?, ?it/s]
+
+{'context_recall': 1.0000, 'faithfulness': 0.9333, 'factual_correctness': 0.8520}
+```
+
+### 4. LangSmith Dashboard
+![jpeg](./../_static/langsmith_dashboard.png)