From 91393e6bb51fd6c1a4a095e22066aae48a9e4a55 Mon Sep 17 00:00:00 2001 From: Siddharth Sahu <112792547+sahusiddharth@users.noreply.github.com> Date: Fri, 10 Jan 2025 12:56:10 +0530 Subject: [PATCH] docs: Fixed most of the broken links (#1830) --- docs/concepts/components/eval_dataset.md | 2 +- docs/concepts/index.md | 7 +- docs/concepts/metrics/overview/index.md | 8 +- docs/concepts/test_data_generation/rag.md | 3 +- docs/extra/components/choose_evaluator_llm.md | 4 +- docs/extra/components/choose_generator_llm.md | 4 +- docs/getstarted/evals.md | 4 +- docs/getstarted/rag_eval.md | 2 +- docs/getstarted/rag_testset_generation.md | 19 +- docs/howtos/applications/_cost.md | 60 +-- .../howtos/applications/_metrics_llm_calls.md | 57 +-- docs/howtos/applications/compare_llms.md | 2 +- docs/howtos/customizations/_run_config.md | 2 +- .../metrics/_modifying-prompts-metrics.md | 71 +-- .../metrics/_write_your_own_metric.md | 43 +- .../_write_your_own_metric_advanced.md | 2 +- .../_testgen-custom-single-hop.md | 95 ++-- .../testgenerator/_testgen-customisation.md | 75 +-- docs/howtos/index.md | 2 +- docs/howtos/integrations/_llamaindex.md | 4 +- docs/howtos/integrations/_zeno.md | 2 +- docs/howtos/integrations/index.md | 2 +- docs/howtos/integrations/langchain.md | 2 +- .../migrations/migrate_from_v01_to_v02.md | 2 +- docs/howtos/observability.md | 467 ++++++++++++++++++ 25 files changed, 663 insertions(+), 278 deletions(-) create mode 100644 docs/howtos/observability.md diff --git a/docs/concepts/components/eval_dataset.md b/docs/concepts/components/eval_dataset.md index 5f5a40a46b..09fd924edd 100644 --- a/docs/concepts/components/eval_dataset.md +++ b/docs/concepts/components/eval_dataset.md @@ -20,7 +20,7 @@ An evaluation dataset consists of: - **Define Clear Objectives**: Identify the specific aspects of the AI application that you want to evaluate and the scenarios you want to test. Collect data samples that reflect these objectives. -- **Collect Representative Data**: Ensure that the dataset covers a diverse range of scenarios, user inputs, and expected responses to provide a comprehensive evaluation of the AI application. This can be achieved by collecting data from various sources or [generating synthetic data](). +- **Collect Representative Data**: Ensure that the dataset covers a diverse range of scenarios, user inputs, and expected responses to provide a comprehensive evaluation of the AI application. This can be achieved by collecting data from various sources or [generating synthetic data](./../../howtos/customizations/index.md#testset-generation). - **Quality and Size**: Aim for a dataset that is large enough to provide meaningful insights but not so large that it becomes unwieldy. Ensure that the data is of high quality and accurately reflects the real-world scenarios you want to evaluate. diff --git a/docs/concepts/index.md b/docs/concepts/index.md index 0f7dfb3507..92405e5e08 100644 --- a/docs/concepts/index.md +++ b/docs/concepts/index.md @@ -9,7 +9,8 @@ Discover the various components used within Ragas. - Components like [Prompt Object](components/index.md#prompt-object), [Evaluation Dataset](components/index.md#evaluation-dataset) and [more..](components/index.md) + Components like [Prompt Object](components/prompt.md), [Evaluation Dataset](components/eval_dataset.md) and [more..](components/index.md) + - ::material-ruler-square:{ .lg .middle } [__Ragas Metrics__](metrics/index.md) @@ -17,7 +18,7 @@ Explore available metrics and understand how they work. - Metrics for evaluating [RAG](metrics/index.md/#retrieval-augmented-generation), [Agentic workflows](metrics/index.md/#agents-or-tool-use-cases) and [more..](metrics/index.md/#list-of-available-metrics). + Metrics for evaluating [RAG](metrics/available_metrics/index.md#retrieval-augmented-generation), [Agentic workflows](metrics/available_metrics/index.md#agents-or-tool-use-cases) and [more..](metrics/available_metrics/index.md#list-of-available-metrics). - :material-database-plus:{ .lg .middle } [__Test Data Generation__](test_data_generation/index.md) @@ -25,7 +26,7 @@ Generate high-quality datasets for comprehensive testing. - Algorithms for synthesizing data to test [RAG](test_data_generation/index.md#retrieval-augmented-generation), [Agentic workflows](test_data_generation/index.md#agents-or-tool-use-cases) + Algorithms for synthesizing data to test [RAG](test_data_generation/rag.md), [Agentic workflows](test_data_generation/agents.md) - :material-chart-box-outline:{ .lg .middle } [__Feedback Intelligence__](feedback/index.md) diff --git a/docs/concepts/metrics/overview/index.md b/docs/concepts/metrics/overview/index.md index 063b0b1c10..889b6622db 100644 --- a/docs/concepts/metrics/overview/index.md +++ b/docs/concepts/metrics/overview/index.md @@ -18,14 +18,14 @@ A metric is a quantitative measure used to evaluate the performance of a AI appl      **LLM-based metrics**: These metrics use LLM underneath to do the evaluation. There might be one or more LLM calls that are performed to arrive at the score or result. These metrics can be somewhat non deterministic as the LLM might not always return the same result for the same input. On the other hand, these metrics has shown to be more accurate and closer to human evaluation. -All LLM based metrics in ragas are inherited from `MetricWithLLM` class. These metrics expects a [LLM]() object to be set before scoring. +All LLM based metrics in ragas are inherited from `MetricWithLLM` class. These metrics expects a LLM object to be set before scoring. ```python from ragas.metrics import FactualCorrectness scorer = FactualCorrectness(llm=evaluation_llm) ``` -Each LLM based metrics also will have prompts associated with it written using [Prompt Object](). +Each LLM based metrics also will have prompts associated with it written using [Prompt Object](./../../components/prompt.md).      **Non-LLM-based metrics**: These metrics do not use LLM underneath to do the evaluation. These metrics are deterministic and can be used to evaluate the performance of the AI application without using LLM. These metrics rely on traditional methods to evaluate the performance of the AI application, such as string similarity, BLEU score, etc. Due to the same, these metrics are known to have a lower correlation with human evaluation. @@ -34,7 +34,7 @@ All LLM based metrics in ragas are inherited from `Metric` class. **Metrics can be broadly classified into two categories based on the type of data they evaluate**: -     **Single turn metrics**: These metrics evaluate the performance of the AI application based on a single turn of interaction between the user and the AI. All metrics in ragas that supports single turn evaluation are inherited from `SingleTurnMetric` class and scored using `single_turn_ascore` method. It also expects a [Single Turn Sample]() object as input. +     **Single turn metrics**: These metrics evaluate the performance of the AI application based on a single turn of interaction between the user and the AI. All metrics in ragas that supports single turn evaluation are inherited from [SingleTurnMetric][ragas.metrics.base.SingleTurnMetric] class and scored using `single_turn_ascore` method. It also expects a [Single Turn Sample][ragas.dataset_schema.SingleTurnSample] object as input. ```python from ragas.metrics import FactualCorrectness @@ -43,7 +43,7 @@ scorer = FactualCorrectness() await scorer.single_turn_ascore(sample) ``` -     **Multi-turn metrics**: These metrics evaluate the performance of the AI application based on multiple turns of interaction between the user and the AI. All metrics in ragas that supports multi turn evaluation are inherited from `MultiTurnMetric` class and scored using `multi_turn_ascore` method. It also expects a [Multi Turn Sample]() object as input. +     **Multi-turn metrics**: These metrics evaluate the performance of the AI application based on multiple turns of interaction between the user and the AI. All metrics in ragas that supports multi turn evaluation are inherited from [MultiTurnMetric][ragas.metrics.base.MultiTurnMetric] class and scored using `multi_turn_ascore` method. It also expects a [Multi Turn Sample][ragas.dataset_schema.MultiTurnSample] object as input. ```python from ragas.metrics import AgentGoalAccuracy diff --git a/docs/concepts/test_data_generation/rag.md b/docs/concepts/test_data_generation/rag.md index 2b93e72efa..78d623c4bd 100644 --- a/docs/concepts/test_data_generation/rag.md +++ b/docs/concepts/test_data_generation/rag.md @@ -103,7 +103,7 @@ graph TD ### Extractors -Different extractors are used to extract information from each nodes that can be used to establish the relationship between the nodes. For example, in the case of financial documents, the extractor that can be used are entity extractor to extract the entities like Company Name, Keyphrase extractor to extract important key phrases present in each node, etc. You can write your own [custom extractors]() to extract the information that is relevant to your domain. +Different extractors are used to extract information from each nodes that can be used to establish the relationship between the nodes. For example, in the case of financial documents, the extractor that can be used are entity extractor to extract the entities like Company Name, Keyphrase extractor to extract important key phrases present in each node, etc. You can write your own custom extractors to extract the information that is relevant to your domain. Extractors can be LLM based which are inherited from `LLMBasedExtractor` or rule based which are inherited from `Extractor`. @@ -165,6 +165,7 @@ graph TD The extracted information is used to establish the relationship between the nodes. For example, in the case of financial documents, the relationship can be established between the nodes based on the entities present in the nodes. You can write your own [custom relationship builder]() to establish the relationship between the nodes based on the information that is relevant to your domain. +# Link missing above #### Example diff --git a/docs/extra/components/choose_evaluator_llm.md b/docs/extra/components/choose_evaluator_llm.md index 524b48c946..aacab8455a 100644 --- a/docs/extra/components/choose_evaluator_llm.md +++ b/docs/extra/components/choose_evaluator_llm.md @@ -126,7 +126,7 @@ evaluator_llm = LangchainLLMWrapper(your_llm_instance) ``` - For a more detailed guide, checkout [the guide on customizing models](../../howtos/customizations/customize_models/). + For a more detailed guide, checkout [the guide on customizing models](../../howtos/customizations/customize_models.md). If you using LlamaIndex, you can use the `LlamaIndexLLMWrapper` to wrap your LLM so that it can be used with ragas. @@ -135,6 +135,6 @@ evaluator_llm = LlamaIndexLLMWrapper(your_llm_instance) ``` - For more information on how to use LlamaIndex, please refer to the [LlamaIndex Integration guide](../../howtos/integrations/_llamaindex/). + For more information on how to use LlamaIndex, please refer to the [LlamaIndex Integration guide](./../../howtos/integrations/_llamaindex.md). If your still not able use Ragas with your favorite LLM provider, please let us know by by commenting on this [issue](https://github.com/explodinggradients/ragas/issues/1617) and we'll add support for it πŸ™‚. \ No newline at end of file diff --git a/docs/extra/components/choose_generator_llm.md b/docs/extra/components/choose_generator_llm.md index 5047394442..fc3f967bd1 100644 --- a/docs/extra/components/choose_generator_llm.md +++ b/docs/extra/components/choose_generator_llm.md @@ -125,7 +125,7 @@ generator_llm = LangchainLLMWrapper(your_llm_instance) ``` - For a more detailed guide, checkout [the guide on customizing models](../../howtos/customizations/customize_models/). + For a more detailed guide, checkout [the guide on customizing models](../../howtos/customizations/customize_models.md). If you using LlamaIndex, you can use the `LlamaIndexLLMWrapper` to wrap your LLM so that it can be used with ragas. @@ -134,6 +134,6 @@ generator_llm = LlamaIndexLLMWrapper(your_llm_instance) ``` - For more information on how to use LlamaIndex, please refer to the [LlamaIndex Integration guide](../../howtos/integrations/_llamaindex/). + For more information on how to use LlamaIndex, please refer to the [LlamaIndex Integration guide](./../../howtos/integrations/_llamaindex.md). If your still not able use Ragas with your favorite LLM provider, please let us know by by commenting on this [issue](https://github.com/explodinggradients/ragas/issues/1617) and we'll add support for it πŸ™‚. \ No newline at end of file diff --git a/docs/getstarted/evals.md b/docs/getstarted/evals.md index f5d9dfc6f8..8c06753571 100644 --- a/docs/getstarted/evals.md +++ b/docs/getstarted/evals.md @@ -7,7 +7,7 @@ The purpose of this guide is to illustrate a simple workflow for testing and eva In this guide, you will evaluate a **text summarization pipeline**. The goal is to ensure that the output summary accurately captures all the key details specified in the text, such as growth figures, market insights, and other essential information. -`ragas` offers a variety of methods for analyzing the performance of LLM applications, referred to as [metrics](../concepts/metrics/). Each metric requires a predefined set of data points, which it uses to calculate scores that indicate performance. +`ragas` offers a variety of methods for analyzing the performance of LLM applications, referred to as [metrics](../concepts/metrics/available_metrics/index.md). Each metric requires a predefined set of data points, which it uses to calculate scores that indicate performance. ### Evaluating using a Non-LLM Metric @@ -203,7 +203,7 @@ To fix these results, ragas provides a way to align the metric with your prefere 2. **Download**: Save the annotated data using the `Annotated JSON` button in [app.ragas.io](https://app.ragas.io/). 3. **Train**: Use the annotated data to train your custom metric. -To learn more about this, refer to how to [train your own metric guide](../howtos/customizations/metrics/train_your_own_metric.md) +To learn more about this, refer to how to [train your own metric guide](./../howtos/customizations/metrics/train_your_own_metric.md) [Download sample annotated JSON](../_static/sample_annotated_summary.json) diff --git a/docs/getstarted/rag_eval.md b/docs/getstarted/rag_eval.md index d1bd2ba894..e2c398de91 100644 --- a/docs/getstarted/rag_eval.md +++ b/docs/getstarted/rag_eval.md @@ -157,7 +157,7 @@ evaluation_dataset = EvaluationDataset.from_list(dataset) ## Evaluate -We have successfully collected the evaluation data. Now, we can evaluate our RAG system on the collected dataset using a set of commonly used RAG evaluation metrics. You may choose any model as [evaluator LLM](/docs/howtos/customizations/customize_models.md) for evaluation. +We have successfully collected the evaluation data. Now, we can evaluate our RAG system on the collected dataset using a set of commonly used RAG evaluation metrics. You may choose any model as [evaluator LLM](./../howtos/customizations/customize_models.md) for evaluation. ```python from ragas import evaluate diff --git a/docs/getstarted/rag_testset_generation.md b/docs/getstarted/rag_testset_generation.md index af8a8a36e5..101f2892d5 100644 --- a/docs/getstarted/rag_testset_generation.md +++ b/docs/getstarted/rag_testset_generation.md @@ -31,7 +31,7 @@ docs = loader.load() ### Choose your LLM -You may choose to use any [LLM of your choice](../howtos/customizations/customize_models.md) +You may choose to use any [LLM of your choice](./../howtos/customizations/customize_models.md) --8<-- choose_generator_llm.md --8<-- @@ -55,9 +55,10 @@ Once you have generated a testset, you would want to view it and select the quer dataset.to_pandas() ``` +Output ![testset](./testset_output.png) -You can also use other tools like [app.ragas.io](https://app.ragas.io/) or any other similar tools available for you in the [Integrations](../howtos/integrations/index.md) section. +You can also use other tools like [app.ragas.io](https://app.ragas.io/) or any other similar tools available for you in the [Integrations](./../howtos/integrations/index.md) section. In order to use the [app.ragas.io](https://app.ragas.io/) dashboard, you need to have an account on [app.ragas.io](https://app.ragas.io/). If you don't have one, you can sign up for one [here](https://app.ragas.io/login). You will also need to have a [Ragas APP token](https://app.ragas.io/settings/api-keys). @@ -93,6 +94,7 @@ from ragas.testset.graph import KnowledgeGraph kg = KnowledgeGraph() ``` +Output ``` KnowledgeGraph(nodes: 0, relationships: 0) ``` @@ -110,6 +112,7 @@ for doc in docs: ) ) ``` +Output ``` KnowledgeGraph(nodes: 10, relationships: 0) ``` @@ -137,6 +140,8 @@ kg.save("knowledge_graph.json") loaded_kg = KnowledgeGraph.load("knowledge_graph.json") loaded_kg ``` + +Output ``` KnowledgeGraph(nodes: 48, relationships: 605) ``` @@ -158,11 +163,13 @@ from ragas.testset.synthesizers import default_query_distribution query_distribution = default_query_distribution(generator_llm) ``` + +Output ``` [ - (SingleHopSpecificQuerySynthesizer(llm=llm), 0.5), - (MultiHopAbstractQuerySynthesizer(llm=llm), 0.25), - (MultiHopSpecificQuerySynthesizer(llm=llm), 0.25), + (SingleHopSpecificQuerySynthesizer(llm=llm), 0.5), + (MultiHopAbstractQuerySynthesizer(llm=llm), 0.25), + (MultiHopSpecificQuerySynthesizer(llm=llm), 0.25), ] ``` @@ -172,5 +179,5 @@ Now we can generate the testset. testset = generator.generate(testset_size=10, query_distribution=query_distribution) testset.to_pandas() ``` - +Output ![testset](./testset_output.png) diff --git a/docs/howtos/applications/_cost.md b/docs/howtos/applications/_cost.md index 8c895b352c..40db296884 100644 --- a/docs/howtos/applications/_cost.md +++ b/docs/howtos/applications/_cost.md @@ -24,12 +24,10 @@ from ragas.cost import get_token_usage_for_openai get_token_usage_for_openai(llm_result) ``` - - - - - TokenUsage(input_tokens=9, output_tokens=9, model='') - +Output +``` +TokenUsage(input_tokens=9, output_tokens=9, model='') +``` You can define your own or import parsers if they are defined. If you would like to suggest parser for LLM providers or contribute your own ones please check out this [issue](https://github.com/explodinggradients/ragas/issues/1151) πŸ™‚. @@ -47,9 +45,10 @@ dataset = load_dataset("explodinggradients/amnesty_qa", "english_v3") eval_dataset = EvaluationDataset.from_hf_dataset(dataset["eval"]) ``` - - Repo card metadata block was not found. Setting CardData to empty. - +Output +``` +Repo card metadata block was not found. Setting CardData to empty. +``` You can pass in the parser to the `evaluate()` function and the cost will be calculated and returned in the `Result` object. @@ -67,21 +66,19 @@ result = evaluate( token_usage_parser=get_token_usage_for_openai, ) ``` - - - Evaluating: 0%| | 0/20 [00:00, - 'single_turn_prompt': } - +Output +``` +{'multi_turn_prompt': , + 'single_turn_prompt': } +``` @@ -29,9 +27,10 @@ scorer.get_prompts() prompts = scorer.get_prompts() print(prompts["single_turn_prompt"].to_string()) ``` - - Your task is to judge the faithfulness of a series of statements based on a given context. For each statement you must return verdict as 1 if the statement can be directly inferred based on the context or 0 if the statement can not be directly inferred based on the context. - +Output +``` +Your task is to judge the faithfulness of a series of statements based on a given context. For each statement you must return verdict as 1 if the statement can be directly inferred based on the context or 0 if the statement can not be directly inferred based on the context. +``` ### Modifying instruction in default prompt It is highly likely that one might want to modify the prompt to suit ones needs. Ragas provides `set_prompts` methods to allow you to do so. Let's change the one of the prompts used in `FactualCorrectness` metrics @@ -40,23 +39,21 @@ It is highly likely that one might want to modify the prompt to suit ones needs. ```python prompt = scorer.get_prompts()["single_turn_prompt"] prompt.instruction += "\nOnly output valid JSON." -``` - - -```python scorer.set_prompts(**{"single_turn_prompt": prompt}) ``` - +Output +``` Let's check if the prompt's instruction has actually changed - +``` ```python print(scorer.get_prompts()["single_turn_prompt"].instruction) ``` - - Given a input, system response and reference. Evaluate and score the response against the reference only using the given criteria. - Only output valid JSON. - +Output +``` +Given a input, system response and reference. Evaluate and score the response against the reference only using the given criteria. +Only output valid JSON. +``` ### Modifying examples in default prompt Few shot examples can greatly influence the outcome of any LLM. It is highly likely that the examples in default prompt may not reflect your domain or use-case. So it's always an good practice to modify with your custom examples. Let's do one here @@ -64,16 +61,14 @@ Few shot examples can greatly influence the outcome of any LLM. It is highly lik ```python prompt = scorer.get_prompts()["single_turn_prompt"] - prompt.examples ``` - - - - [(SingleTurnSimpleCriteriaWithReferenceInput(user_input='Who was the director of Los Alamos Laboratory?', response='Einstein was the director of Los Alamos Laboratory.', criteria='Score responses in range of 0 (low) to 5 (high) based similarity with reference.', reference='The director of Los Alamos Laboratory was J. Robert Oppenheimer.'), - SimpleCriteriaOutput(reason='The response and reference have two very different answers.', score=0))] - +Output +``` +[(SingleTurnSimpleCriteriaWithReferenceInput(user_input='Who was the director of Los Alamos Laboratory?', response='Einstein was the director of Los Alamos Laboratory.', criteria='Score responses in range of 0 (low) to 5 (high) based similarity with reference.', reference='The director of Los Alamos Laboratory was J. Robert Oppenheimer.'), + SimpleCriteriaOutput(reason='The response and reference have two very different answers.', score=0))] +``` @@ -82,10 +77,7 @@ from ragas.metrics._simple_criteria import ( SingleTurnSimpleCriteriaWithReferenceInput, SimpleCriteriaOutput, ) -``` - -```python new_example = [ ( SingleTurnSimpleCriteriaWithReferenceInput( @@ -100,25 +92,16 @@ new_example = [ ), ) ] -``` - -```python prompt.examples = new_example -``` - - -```python scorer.set_prompts(**{"single_turn_prompt": prompt}) -``` - - -```python print(scorer.get_prompts()["single_turn_prompt"].examples) ``` - [(SingleTurnSimpleCriteriaWithReferenceInput(user_input='Who was the first president of the United States?', response='Thomas Jefferson was the first president of the United States.', criteria='Score responses in range of 0 (low) to 5 (high) based similarity with reference.', reference='George Washington was the first president of the United States.'), SimpleCriteriaOutput(reason='The response incorrectly states Thomas Jefferson instead of George Washington. While both are significant historical figures, the answer does not match the reference.', score=2))] - +Output +``` +[(SingleTurnSimpleCriteriaWithReferenceInput(user_input='Who was the first president of the United States?', response='Thomas Jefferson was the first president of the United States.', criteria='Score responses in range of 0 (low) to 5 (high) based similarity with reference.', reference='George Washington was the first president of the United States.'), SimpleCriteriaOutput(reason='The response incorrectly states Thomas Jefferson instead of George Washington. While both are significant historical figures, the answer does not match the reference.', score=2))] +``` Let's now view and verify the full new prompt with modified instruction and examples diff --git a/docs/howtos/customizations/metrics/_write_your_own_metric.md b/docs/howtos/customizations/metrics/_write_your_own_metric.md index 0df90e4469..83826fffcf 100644 --- a/docs/howtos/customizations/metrics/_write_your_own_metric.md +++ b/docs/howtos/customizations/metrics/_write_your_own_metric.md @@ -61,12 +61,10 @@ hallucinations_binary = AspectCritic( await hallucinations_binary.single_turn_ascore(eval_dataset[0]) ``` - - - - - 0 - +Output +``` +0 +``` ## Domain Specific Metrics or Rubric based Metrics @@ -98,12 +96,10 @@ hallucinations_rubric = RubricsScore( await hallucinations_rubric.single_turn_ascore(eval_dataset[0]) ``` - - - - - 3 - +Output +``` +3 +``` ## Custom Metrics @@ -173,11 +169,10 @@ hallucinations_metric = HallucinationsMetric(llm=evaluator_llm) await hallucinations_metric.single_turn_ascore(eval_dataset[0]) ``` - - - - 0.5 - +Output +``` +0.5 +``` Now let's evaluate the entire dataset with the metrics we have created. @@ -194,14 +189,11 @@ results = evaluate( ```python -results -``` - - - - - {'hallucinations_metric': 0.5932, 'hallucinations_rubric': 3.1500, 'hallucinations_binary': 0.1000} +Output +``` +{'hallucinations_metric': 0.5932, 'hallucinations_rubric': 3.1500, 'hallucinations_binary': 0.1000} +``` @@ -210,8 +202,7 @@ results_df = results.to_pandas() results_df.head() ``` - - +Output