Skip to content

Commit

Permalink
docs: migrating over to mkdocs (explodinggradients#1301)
Browse files Browse the repository at this point in the history
Moving over the existing documentation to mkdocs with material theme

- started using Tabs for defining LLMs and embeddings from different
providers

docs-site: [Ragas](https://ragas--1301.org.readthedocs.build/en/1301/)
refference repo:
[joelk9895/ragasDocs](https://github.com/joelk9895/ragasDocs)
  • Loading branch information
jjmachan authored Sep 23, 2024
1 parent b845a62 commit 95dc939
Show file tree
Hide file tree
Showing 59 changed files with 966 additions and 732 deletions.
17 changes: 4 additions & 13 deletions .readthedocs.yml
Original file line number Diff line number Diff line change
@@ -1,25 +1,16 @@
version: 2

# Set the OS, Python version and other tools you might need
build:
os: ubuntu-22.04
tools:
python: "3.11"
# You can also specify other tool versions:
# nodejs: "20"
# rust: "1.70"
# golang: "1.20"

# Build documentation in the "docs/" directory with Sphinx
sphinx:
configuration: ./docs/conf.py
# You can configure Sphinx to use a different builder, for instance use the dirhtml builder for simpler URLs
# builder: "dirhtml"
# Fail on all warnings to avoid broken references
# fail_on_warning: true
mkdocs:
configuration: mkdocs.yml

python:
install:
- requirements: ./requirements/docs.txt
- method: pip
path: .
extra_requirements:
- docs
8 changes: 2 additions & 6 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -33,12 +33,8 @@ test-e2e: ## Run end2end tests
@pytest --nbmake tests/e2e -s

# Docs
docs-site: ## Build and serve documentation
@sphinx-build -nW --keep-going -j 4 -b html $(GIT_ROOT)/docs/ $(GIT_ROOT)/docs/_build/html
@python -m http.server --directory $(GIT_ROOT)/docs/_build/html
watch-docs: ## Build and watch documentation
rm -rf $(GIT_ROOT)/docs/_build/{html, jupyter_execute}
sphinx-autobuild docs docs/_build/html --watch $(GIT_ROOT)/src/ --ignore "_build" --open-browser
docsite: ## Build and serve documentation
@mkdocs serve --dirty
rewrite-docs: ## Use GPT4 to rewrite the documentation
@echo "Rewriting the documentation in directory $(DIR)..."
@python $(GIT_ROOT)/docs/python alphred.py --directory $(DIR)
Expand Down
19 changes: 19 additions & 0 deletions docs/_static/js/mathjax.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
window.MathJax = {
tex: {
inlineMath: [["\\(", "\\)"]],
displayMath: [["\\[", "\\]"]],
processEscapes: true,
processEnvironments: true
},
options: {
ignoreHtmlClass: ".*|",
processHtmlClass: "arithmatex"
}
};

document$.subscribe(() => {
MathJax.startup.output.clearCache()
MathJax.typesetClear()
MathJax.texReset()
MathJax.typesetPromise()
})
10 changes: 1 addition & 9 deletions docs/community/index.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,9 @@
(community)=
# ❤️ Community

**"Alone we can do so little; together we can do so much." - Helen Keller**
> "Alone we can do so little; together we can do so much." - Helen Keller
Our project thrives on the vibrant energy, diverse skills, and shared passion of our community. It's not just about code; it's about people coming together to create something extraordinary. This space celebrates every contribution, big or small, and features the amazing people who make it all happen.

:::{note}
**📅 Upcomming Events**

- [Greg Loughnane's](https://www.youtube.com/@AI-Makerspace) YT live event on RAG eval with LangChain and RAGAS on [Feb 7](https://lu.ma/theartofrag)
:::


## **🌟  Contributors**

Meet some of our outstanding members who made significant contributions !
Expand Down
Original file line number Diff line number Diff line change
@@ -1,11 +1,10 @@
(mdd)=
# Metrics-Driven Development
# Evaluation Driven Development

While creating a fundamental LLM application may be straightforward, the challenge lies in its ongoing maintenance and continuous enhancement. Ragas' vision is to facilitate the continuous improvement of LLM and RAG applications by embracing the ideology of Metrics-Driven Development (MDD).
While creating a fundamental LLM application may be straightforward, the challenge lies in its ongoing maintenance and continuous enhancement. Ragas' vision is to facilitate the continuous improvement of LLM and RAG applications by embracing the ideology of Evaluation Driven Development (EDD).

MDD is a product development approach that relies on data to make well-informed decisions. This approach entails the ongoing monitoring of essential metrics over time, providing valuable insights into an application's performance.
EDD is a product development approach that relies on data to make well-informed decisions. This approach entails the ongoing monitoring of essential metrics over time, providing valuable insights into an application's performance.

Our mission is to establish an open-source standard for applying MDD to LLM and RAG applications.
Our mission is to establish an open-source standard for applying EDD to LLM and RAG applications.

- [**Evaluation**](../getstarted/evaluation.md): This enables you to assess LLM applications and conduct experiments in a metric-assisted manner, ensuring high dependability and reproducibility.

Expand Down
1 change: 0 additions & 1 deletion docs/concepts/feedback.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
(user-feedback)=
# Utilizing User Feedback

User feedback can often be noisy and challenging to harness effectively. However, within the feedback, valuable signals exist that can be leveraged to iteratively enhance your LLM and RAG applications. These signals have the potential to be amplified effectively, aiding in the detection of specific issues within the pipeline and preventing recurring errors. Ragas is equipped to assist you in the analysis of user feedback data, enabling the discovery of patterns and making it a valuable resource for continual improvement.
Expand Down
69 changes: 23 additions & 46 deletions docs/concepts/index.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,4 @@
(core-concepts)=
# 📚 Core Concepts
:::{toctree}
:caption: Concepts
:hidden:

metrics_driven
metrics/index
prompts
prompt_adaptation
testset_generation
feedback
:::

Ragas aims to create an open standard, providing developers with the tools and techniques to leverage continual learning in their RAG applications. With Ragas, you would be able to

Expand All @@ -20,46 +8,35 @@ Ragas aims to create an open standard, providing developers with the tools and t
4. Use these insights to iterate and improve your application.


(what-is-rag)=
:::{dropdown} what is RAG and continual learning?
```{rubric} RAG
```
## What is RAG and continual learning?
### RAG

Retrieval augmented generation (RAG) is a paradigm for augmenting LLM with custom data. It generally consists of two stages:

- indexing stage: preparing a knowledge base, and

- querying stage: retrieving relevant context from the knowledge to assist the LLM in responding to a question

```{rubric} Continual Learning
```
### Continual Learning

Continual learning is concept used in machine learning that aims to learn, iterate and improve ML pipelines over its lifetime using the insights derived from continuous stream of data points. In LLM & RAGs, this can be applied by iterating and improving each components of LLM application from insights derived from production and feedback data.
:::

::::{grid} 2

:::{grid-item-card} Metrics Driven Development
:link: mdd
:link-type: ref
What is MDD?
:::

:::{grid-item-card} Ragas Metrics
:link: ragas-metrics
:link-type: ref
What metrics are available? How do they work?
:::

:::{grid-item-card} Synthetic Test Data Generation
:link: testset-generation
:link-type: ref
How to create more datasets to test on?
:::

:::{grid-item-card} Utilizing User Feedback
:link: user-feedback
:link-type: ref
How to leverage the signals from user to improve?
:::
::::

<div class="grid cards" markdown>

- [Evaluation Driven Development](evaluation_driven.md)

What is EDD?

- [Ragas Metrics](metrics/index.md)

What metrics are available? How do they work?

- [Synthetic Test Data Generation](testset_generation.md)

How to create more datasets to test on?

- [Utilizing User Feedback](feedback.md)

How to leverage the signals from user to improve?

</div>
18 changes: 8 additions & 10 deletions docs/concepts/metrics/answer_correctness.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,21 +5,19 @@ The assessment of Answer Correctness involves gauging the accuracy of the genera
Answer correctness encompasses two critical aspects: semantic similarity between the generated answer and the ground truth, as well as factual similarity. These aspects are combined using a weighted scheme to formulate the answer correctness score. Users also have the option to employ a 'threshold' value to round the resulting score to binary, if desired.


```{hint}
Ground truth: Einstein was born in 1879 in Germany.
!!! example
**Ground truth**: Einstein was born in 1879 in Germany.

High answer correctness: In 1879, Einstein was born in Germany.
**High answer correctness**: In 1879, Einstein was born in Germany.

Low answer correctness: Einstein was born in Spain in 1879.
**Low answer correctness**: Einstein was born in Spain in 1879.

```

## Example

```{code-block} python
:caption: Answer correctness with custom weights for each variable
```python
from datasets import Dataset
from ragas.metrics import faithfulness, answer_correctness
from ragas.metrics import answer_correctness
from ragas import evaluate

data_samples = {
Expand Down Expand Up @@ -50,9 +48,9 @@ In the second example:
Now, we can use the formula for the F1 score to quantify correctness based on the number of statements in each of these lists:


```{math}
$$
\text{F1 Score} = {|\text{TP} \over {(|\text{TP}| + 0.5 \times (|\text{FP}| + |\text{FN}|))}}
```
$$

Next, we calculate the semantic similarity between the generated answer and the ground truth. Read more about it [here](./semantic_similarity.md).

Expand Down
28 changes: 12 additions & 16 deletions docs/concepts/metrics/answer_relevance.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,13 @@ The evaluation metric, Answer Relevancy, focuses on assessing how pertinent the

The Answer Relevancy is defined as the mean cosine similarity of the original `question` to a number of artifical questions, which where generated (reverse engineered) based on the `answer`:

```{math}
$$
\text{answer relevancy} = \frac{1}{N} \sum_{i=1}^{N} cos(E_{g_i}, E_o)
````
```{math}
$$

$$
\text{answer relevancy} = \frac{1}{N} \sum_{i=1}^{N} \frac{E_{g_i} \cdot E_o}{\|E_{g_i}\|\|E_o\|}
````
$$

Where:

Expand All @@ -19,25 +20,21 @@ Where:

Please note, that eventhough in practice the score will range between 0 and 1 most of the time, this is not mathematically guranteed, due to the nature of the cosine similarity ranging from -1 to 1.

:::{note}
This is reference free metric. If you're looking to compare ground truth answer with generated answer refer to [answer_correctness](./answer_correctness.md)
:::
!!! note
This is reference free metric. If you're looking to compare ground truth answer with generated answer refer to [answer_correctness](./answer_correctness.md)

An answer is deemed relevant when it directly and appropriately addresses the original question. Importantly, our assessment of answer relevance does not consider factuality but instead penalizes cases where the answer lacks completeness or contains redundant details. To calculate this score, the LLM is prompted to generate an appropriate question for the generated answer multiple times, and the mean cosine similarity between these generated questions and the original question is measured. The underlying idea is that if the generated answer accurately addresses the initial question, the LLM should be able to generate questions from the answer that align with the original question.

```{hint}
Question: Where is France and what is it's capital?
!!! example
Question: Where is France and what is it's capital?

Low relevance answer: France is in western Europe.
Low relevance answer: France is in western Europe.

High relevance answer: France is in western Europe and Paris is its capital.
```
High relevance answer: France is in western Europe and Paris is its capital.

## Example

```{code-block} python
:caption: Answer relevancy
```python
from datasets import Dataset
from ragas.metrics import answer_relevancy
from ragas import evaluate
Expand All @@ -51,7 +48,6 @@ data_samples = {
dataset = Dataset.from_dict(data_samples)
score = evaluate(dataset,metrics=[answer_relevancy])
score.to_pandas()
```

## Calculation
Expand Down
32 changes: 14 additions & 18 deletions docs/concepts/metrics/context_entities_recall.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,23 +4,21 @@ This metric gives the measure of recall of the retrieved context, based on the n

To compute this metric, we use two sets, $GE$ and $CE$, as set of entities present in `ground_truths` and set of entities present in `contexts` respectively. We then take the number of elements in intersection of these sets and divide it by the number of elements present in the $GE$, given by the formula:

```{math}
:label: context_entity_recall
$$
\text{context entity recall} = \frac{| CE \cap GE |}{| GE |}
````
$$

```{hint}
**Ground truth**: The Taj Mahal is an ivory-white marble mausoleum on the right bank of the river Yamuna in the Indian city of Agra. It was commissioned in 1631 by the Mughal emperor Shah Jahan to house the tomb of his favorite wife, Mumtaz Mahal.
!!! example
**Ground truth**: The Taj Mahal is an ivory-white marble mausoleum on the right bank of the river Yamuna in the Indian city of Agra. It was commissioned in 1631 by the Mughal emperor Shah Jahan to house the tomb of his favorite wife, Mumtaz Mahal.

**High entity recall context**: The Taj Mahal is a symbol of love and architectural marvel located in Agra, India. It was built by the Mughal emperor Shah Jahan in memory of his beloved wife, Mumtaz Mahal. The structure is renowned for its intricate marble work and beautiful gardens surrounding it.
**High entity recall context**: The Taj Mahal is a symbol of love and architectural marvel located in Agra, India. It was built by the Mughal emperor Shah Jahan in memory of his beloved wife, Mumtaz Mahal. The structure is renowned for its intricate marble work and beautiful gardens surrounding it.

**Low entity recall context**: The Taj Mahal is an iconic monument in India. It is a UNESCO World Heritage Site and attracts millions of visitors annually. The intricate carvings and stunning architecture make it a must-visit destination.
**Low entity recall context**: The Taj Mahal is an iconic monument in India. It is a UNESCO World Heritage Site and attracts millions of visitors annually. The intricate carvings and stunning architecture make it a must-visit destination.

````

## Example

```{code-block} python
```python
from datasets import Dataset
from ragas.metrics import context_entity_recall
from ragas import evaluate
Expand All @@ -45,19 +43,17 @@ Let us consider the ground truth and the contexts given above.
- Entities in context (CE1) - ['Taj Mahal', 'Agra', 'Shah Jahan', 'Mumtaz Mahal', 'India']
- Entities in context (CE2) - ['Taj Mahal', 'UNESCO', 'India']
- **Step-3**: Use the formula given above to calculate entity-recall
```{math}
:label: context_entity_recall
\text{context entity recall - 1} = \frac{| CE1 \cap GE |}{| GE |}

$$
\text{context entity recall 1} = \frac{| CE1 \cap GE |}{| GE |}
= 4/6
= 0.666
```
$$

```{math}
:label: context_entity_recall
\text{context entity recall - 2} = \frac{| CE2 \cap GE |}{| GE |}
$$
\text{context entity recall 2} = \frac{| CE2 \cap GE |}{| GE |}
= 1/6
= 0.166
```
$$

We can see that the first context had a high entity recall, because it has a better entity coverage given the ground truth. If these two contexts were fetched by two retrieval mechanisms on same set of documents, we could say that the first mechanism was better than the other in use-cases where entities are of importance.

Loading

0 comments on commit 95dc939

Please sign in to comment.