docs: migrating over to mkdocs (explodinggradients#1301)

Moving over the existing documentation to mkdocs with material theme - started using Tabs for defining LLMs and embeddings from different providers docs-site: [Ragas](https://ragas--1301.org.readthedocs.build/en/1301/) refference repo: [joelk9895/ragasDocs](https://github.com/joelk9895/ragasDocs)
shahules786 · Sep 23, 2024 · 95dc939 · 95dc939
1 parent b845a62
commit 95dc939
Show file tree

Hide file tree

Showing 59 changed files with 966 additions and 732 deletions.
diff --git a/.readthedocs.yml b/.readthedocs.yml
@@ -1,25 +1,16 @@
 version: 2
 
-# Set the OS, Python version and other tools you might need
 build:
   os: ubuntu-22.04
   tools:
     python: "3.11"
-    # You can also specify other tool versions:
-    # nodejs: "20"
-    # rust: "1.70"
-    # golang: "1.20"
 
-# Build documentation in the "docs/" directory with Sphinx
-sphinx:
-  configuration: ./docs/conf.py
-  # You can configure Sphinx to use a different builder, for instance use the dirhtml builder for simpler URLs
-  # builder: "dirhtml"
-  # Fail on all warnings to avoid broken references
-  # fail_on_warning: true
+mkdocs:
+  configuration: mkdocs.yml
 
 python:
   install:
-    - requirements: ./requirements/docs.txt
     - method: pip
       path: .
+      extra_requirements:
+        - docs
diff --git a/Makefile b/Makefile
@@ -33,12 +33,8 @@ test-e2e: ## Run end2end tests
 	@pytest --nbmake tests/e2e -s
 
 # Docs
-docs-site: ## Build and serve documentation
-	@sphinx-build -nW --keep-going -j 4 -b html $(GIT_ROOT)/docs/ $(GIT_ROOT)/docs/_build/html
-	@python -m http.server --directory $(GIT_ROOT)/docs/_build/html
-watch-docs: ## Build and watch documentation
-	rm -rf $(GIT_ROOT)/docs/_build/{html, jupyter_execute}
-	sphinx-autobuild docs docs/_build/html --watch $(GIT_ROOT)/src/ --ignore "_build" --open-browser
+docsite: ## Build and serve documentation
+	@mkdocs serve --dirty
 rewrite-docs: ## Use GPT4 to rewrite the documentation
 	@echo "Rewriting the documentation in directory $(DIR)..."
 	@python $(GIT_ROOT)/docs/python alphred.py --directory $(DIR)

diff --git a/docs/_static/js/mathjax.js b/docs/_static/js/mathjax.js
@@ -0,0 +1,19 @@
+window.MathJax = {
+    tex: {
+        inlineMath: [["\\(", "\\)"]],
+        displayMath: [["\\[", "\\]"]],
+        processEscapes: true,
+        processEnvironments: true
+    },
+    options: {
+        ignoreHtmlClass: ".*|",
+        processHtmlClass: "arithmatex"
+    }
+};
+
+document$.subscribe(() => {
+    MathJax.startup.output.clearCache()
+    MathJax.typesetClear()
+    MathJax.texReset()
+    MathJax.typesetPromise()
+})
diff --git a/docs/community/index.md b/docs/community/index.md
@@ -1,17 +1,9 @@
-(community)=
 # ❤️ Community 
 
-**"Alone we can do so little; together we can do so much." - Helen Keller**
+> "Alone we can do so little; together we can do so much." - Helen Keller
 
 Our project thrives on the vibrant energy, diverse skills, and shared passion of our community. It's not just about code; it's about people coming together to create something extraordinary. This space celebrates every contribution, big or small, and features the amazing people who make it all happen.
 
-:::{note}
-**📅 Upcomming Events**
-
-- [Greg Loughnane's](https://www.youtube.com/@AI-Makerspace) YT live event on RAG eval with LangChain and RAGAS on [Feb 7](https://lu.ma/theartofrag)
-:::
-
-
 ## **🌟  Contributors**
 
 Meet some of our outstanding members who made significant contributions !

diff --git a/docs/concepts/metrics_driven.md → docs/concepts/evaluation_driven.md b/docs/concepts/metrics_driven.md → docs/concepts/evaluation_driven.md
@@ -1,11 +1,10 @@
-(mdd)=
-# Metrics-Driven Development
+# Evaluation Driven Development
 
-While creating a fundamental LLM application may be straightforward, the challenge lies in its ongoing maintenance and continuous enhancement. Ragas' vision is to facilitate the continuous improvement of LLM and RAG applications by embracing the ideology of Metrics-Driven Development (MDD).
+While creating a fundamental LLM application may be straightforward, the challenge lies in its ongoing maintenance and continuous enhancement. Ragas' vision is to facilitate the continuous improvement of LLM and RAG applications by embracing the ideology of Evaluation Driven Development (EDD).
 
-MDD is a product development approach that relies on data to make well-informed decisions. This approach entails the ongoing monitoring of essential metrics over time, providing valuable insights into an application's performance.
+EDD is a product development approach that relies on data to make well-informed decisions. This approach entails the ongoing monitoring of essential metrics over time, providing valuable insights into an application's performance.
 
-Our mission is to establish an open-source standard for applying MDD to LLM and RAG applications.
+Our mission is to establish an open-source standard for applying EDD to LLM and RAG applications.
 
 - [**Evaluation**](../getstarted/evaluation.md): This enables you to assess LLM applications and conduct experiments in a metric-assisted manner, ensuring high dependability and reproducibility.
 

diff --git a/docs/concepts/feedback.md b/docs/concepts/feedback.md
@@ -1,4 +1,3 @@
-(user-feedback)=
 # Utilizing User Feedback
 
 User feedback can often be noisy and challenging to harness effectively. However, within the feedback, valuable signals exist that can be leveraged to iteratively enhance your LLM and RAG applications. These signals have the potential to be amplified effectively, aiding in the detection of specific issues within the pipeline and preventing recurring errors. Ragas is equipped to assist you in the analysis of user feedback data, enabling the discovery of patterns and making it a valuable resource for continual improvement.

diff --git a/docs/concepts/index.md b/docs/concepts/index.md
@@ -1,16 +1,4 @@
-(core-concepts)=
 # 📚 Core Concepts
-:::{toctree}
-:caption: Concepts
-:hidden:
-
-metrics_driven
-metrics/index
-prompts
-prompt_adaptation
-testset_generation
-feedback
-:::
 
 Ragas aims to create an open standard, providing developers with the tools and techniques to leverage continual learning in their RAG applications. With Ragas, you would be able to
 
@@ -20,46 +8,35 @@ Ragas aims to create an open standard, providing developers with the tools and t
 4. Use these insights to iterate and improve your application.
 
 
-(what-is-rag)=
-:::{dropdown} what is RAG and continual learning?
-```{rubric} RAG
-```
+## What is RAG and continual learning?
+### RAG
 
 Retrieval augmented generation (RAG) is a paradigm for augmenting LLM with custom data. It generally consists of two stages:
 
 - indexing stage: preparing a knowledge base, and
 
 - querying stage: retrieving relevant context from the knowledge to assist the LLM in responding to a question
 
-```{rubric} Continual Learning
-```
+### Continual Learning
 
 Continual learning is concept used in machine learning that aims to learn, iterate and improve ML pipelines over its lifetime using the insights derived from continuous stream of data points.  In LLM & RAGs, this can be applied by iterating and improving each components of LLM application from insights derived from production and feedback data.
-:::
-
-::::{grid} 2
-
-:::{grid-item-card} Metrics Driven Development
-:link: mdd
-:link-type: ref
-What is MDD?
-:::
-
-:::{grid-item-card} Ragas Metrics
-:link: ragas-metrics
-:link-type: ref
-What metrics are available? How do they work?
-:::
-
-:::{grid-item-card} Synthetic Test Data Generation
-:link: testset-generation
-:link-type: ref
-How to create more datasets to test on?
-:::
-
-:::{grid-item-card} Utilizing User Feedback
-:link: user-feedback
-:link-type: ref
-How to leverage the signals from user to improve?
-:::
-::::
+
+<div class="grid cards" markdown>
+
+- [Evaluation Driven Development](evaluation_driven.md)
+
+    What is EDD?
+
+- [Ragas Metrics](metrics/index.md)
+
+    What metrics are available? How do they work?
+
+- [Synthetic Test Data Generation](testset_generation.md)
+
+    How to create more datasets to test on?
+
+- [Utilizing User Feedback](feedback.md)
+
+    How to leverage the signals from user to improve?
+
+</div>
diff --git a/docs/concepts/metrics/answer_correctness.md b/docs/concepts/metrics/answer_correctness.md
@@ -5,21 +5,19 @@ The assessment of Answer Correctness involves gauging the accuracy of the genera
 Answer correctness encompasses two critical aspects: semantic similarity between the generated answer and the ground truth, as well as factual similarity. These aspects are combined using a weighted scheme to formulate the answer correctness score. Users also have the option to employ a 'threshold' value to round the resulting score to binary, if desired.
 
 
-```{hint}
-Ground truth: Einstein was born in 1879 in Germany.
+!!! example
+    **Ground truth**: Einstein was born in 1879 in Germany.
 
-High answer correctness: In 1879, Einstein was born in Germany.
+    **High answer correctness**: In 1879, Einstein was born in Germany.
 
-Low answer correctness: Einstein was born in Spain in 1879.
+    **Low answer correctness**: Einstein was born in Spain in 1879.
 
-```
 
 ## Example
 
-```{code-block} python
-:caption: Answer correctness with custom weights for each variable
+```python
 from datasets import Dataset 
-from ragas.metrics import faithfulness, answer_correctness
+from ragas.metrics import answer_correctness
 from ragas import evaluate
 
 data_samples = {
@@ -50,9 +48,9 @@ In the second example:
 Now, we can use the formula for the F1 score to quantify correctness based on the number of statements in each of these lists:
 
 
-```{math}
+$$
 \text{F1 Score} = {|\text{TP} \over {(|\text{TP}| + 0.5 \times (|\text{FP}| + |\text{FN}|))}}
-```
+$$
 
 Next, we calculate the semantic similarity between the generated answer and the ground truth. Read more about it [here](./semantic_similarity.md).
 

diff --git a/docs/concepts/metrics/answer_relevance.md b/docs/concepts/metrics/answer_relevance.md
@@ -4,12 +4,13 @@ The evaluation metric, Answer Relevancy, focuses on assessing how pertinent the
 
 The Answer Relevancy is defined as the mean cosine similarity of the original `question` to a number of artifical questions, which where generated (reverse engineered) based on the `answer`: 
 
-```{math}
+$$
 \text{answer relevancy} = \frac{1}{N} \sum_{i=1}^{N} cos(E_{g_i}, E_o)
-````
-```{math}
+$$
+
+$$
 \text{answer relevancy} = \frac{1}{N} \sum_{i=1}^{N} \frac{E_{g_i} \cdot E_o}{\|E_{g_i}\|\|E_o\|}
-````
+$$
 
 Where: 
 
@@ -19,25 +20,21 @@ Where:
 
 Please note, that eventhough in practice the score will range between 0 and 1 most of the time, this is not mathematically guranteed, due to the nature of the cosine similarity ranging from -1 to 1.
 
-:::{note}
-This is reference free metric. If you're looking to compare ground truth answer with generated answer refer to [answer_correctness](./answer_correctness.md)
-:::
+!!! note
+    This is reference free metric. If you're looking to compare ground truth answer with generated answer refer to [answer_correctness](./answer_correctness.md)
 
 An answer is deemed relevant when it directly and appropriately addresses the original question. Importantly, our assessment of answer relevance does not consider factuality but instead penalizes cases where the answer lacks completeness or contains redundant details. To calculate this score, the LLM is prompted to generate an appropriate question for the generated answer multiple times, and the mean cosine similarity between these generated questions and the original question is measured. The underlying idea is that if the generated answer accurately addresses the initial question, the LLM should be able to generate questions from the answer that align with the original question.
 
-```{hint}
-
-Question: Where is France and what is it's capital?
+!!! example
+    Question: Where is France and what is it's capital?
 
-Low relevance answer: France is in western Europe.
+    Low relevance answer: France is in western Europe.
 
-High relevance answer: France is in western Europe and Paris is its capital.
-```
+    High relevance answer: France is in western Europe and Paris is its capital.
 
 ## Example
 
-```{code-block} python
-:caption: Answer relevancy
+```python
 from datasets import Dataset 
 from ragas.metrics import answer_relevancy
 from ragas import evaluate
@@ -51,7 +48,6 @@ data_samples = {
 dataset = Dataset.from_dict(data_samples)
 score = evaluate(dataset,metrics=[answer_relevancy])
 score.to_pandas()
-
 ```
 
 ## Calculation

diff --git a/docs/concepts/metrics/context_entities_recall.md b/docs/concepts/metrics/context_entities_recall.md
@@ -4,23 +4,21 @@ This metric gives the measure of recall of the retrieved context, based on the n
 
 To compute this metric, we use two sets, $GE$ and $CE$, as set of entities present in `ground_truths` and set of entities present in `contexts` respectively. We then take the number of elements in intersection of these sets and divide it by the number of elements present in the $GE$, given by the formula:
 
-```{math}
-:label: context_entity_recall
+$$
 \text{context entity recall} = \frac{| CE \cap GE |}{| GE |}
-````
+$$
 
-```{hint}
-**Ground truth**: The Taj Mahal is an ivory-white marble mausoleum on the right bank of the river Yamuna in the Indian city of Agra. It was commissioned in 1631 by the Mughal emperor Shah Jahan to house the tomb of his favorite wife, Mumtaz Mahal.
+!!! example
+    **Ground truth**: The Taj Mahal is an ivory-white marble mausoleum on the right bank of the river Yamuna in the Indian city of Agra. It was commissioned in 1631 by the Mughal emperor Shah Jahan to house the tomb of his favorite wife, Mumtaz Mahal.
 
-**High entity recall context**: The Taj Mahal is a symbol of love and architectural marvel located in Agra, India. It was built by the Mughal emperor Shah Jahan in memory of his beloved wife, Mumtaz Mahal. The structure is renowned for its intricate marble work and beautiful gardens surrounding it.
+    **High entity recall context**: The Taj Mahal is a symbol of love and architectural marvel located in Agra, India. It was built by the Mughal emperor Shah Jahan in memory of his beloved wife, Mumtaz Mahal. The structure is renowned for its intricate marble work and beautiful gardens surrounding it.
 
-**Low entity recall context**: The Taj Mahal is an iconic monument in India. It is a UNESCO World Heritage Site and attracts millions of visitors annually. The intricate carvings and stunning architecture make it a must-visit destination.
+    **Low entity recall context**: The Taj Mahal is an iconic monument in India. It is a UNESCO World Heritage Site and attracts millions of visitors annually. The intricate carvings and stunning architecture make it a must-visit destination.
 
-````
 
 ## Example
 
-```{code-block} python
+```python
 from datasets import Dataset 
 from ragas.metrics import context_entity_recall
 from ragas import evaluate
@@ -45,19 +43,17 @@ Let us consider the ground truth and the contexts given above.
     - Entities in context (CE1) - ['Taj Mahal', 'Agra', 'Shah Jahan', 'Mumtaz Mahal', 'India']
     - Entities in context (CE2) - ['Taj Mahal', 'UNESCO', 'India']
 - **Step-3**: Use the formula given above to calculate entity-recall
-    ```{math}
-    :label: context_entity_recall
-    \text{context entity recall - 1} = \frac{| CE1 \cap GE |}{| GE |}
+
+    $$
+    \text{context entity recall 1} = \frac{| CE1 \cap GE |}{| GE |}
                                  = 4/6
                                  = 0.666
-    ```
+    $$
 
-    ```{math}
-    :label: context_entity_recall
-    \text{context entity recall - 2} = \frac{| CE2 \cap GE |}{| GE |}
+    $$
+    \text{context entity recall 2} = \frac{| CE2 \cap GE |}{| GE |}
                                  = 1/6
-                                 = 0.166
-    ```
+    $$
 
     We can see that the first context had a high entity recall, because it has a better entity coverage given the ground truth. If these two contexts were fetched by two retrieval mechanisms on same set of documents, we could say that the first mechanism was better than the other in use-cases where entities are of importance.