[ci skip] iter f18abd6

probabl-ai · Apr 19, 2024 · 3fd451f · 3fd451f
1 parent 019d17a
commit 3fd451f
Show file tree

Hide file tree

Showing 9 changed files with 80 additions and 7 deletions.
diff --git a/_images/api_doc_generated_html.png b/_images/api_doc_generated_html.png
diff --git a/_images/naive_chunks.png b/_images/naive_chunks.png
diff --git a/_sources/user_guide/index.rst.txt b/_sources/user_guide/index.rst.txt
@@ -28,7 +28,7 @@ In this proof-of-concept (POC), we are interested in a "zero-shot" setting. It m
 that we expect our user to formulate a question in natural language and the LLM will
 generate an answer.
 
-The way to query the LLM can be done in two ways: (i) through an API such when using
+The way to query the LLM can be done in two ways: (i) through an API such as when using
 GPT-* from OpenAI or (ii) by locally running the model using open-weight models such
 as Mistral or LLama.
 

diff --git a/_sources/user_guide/text_scraping.rst.txt b/_sources/user_guide/text_scraping.rst.txt
@@ -4,12 +4,55 @@
 Text Scraping
 =============
 
-The scraping module provides some simple estimator that extract meaningful
-documentation from the documentation website.
+In a Retrieval Augmented Generation (RAG) framework, the "document" retrieved and
+provided to the Large Language Model (LLM) to generate an answer corresponds to chunks
+extracted from the documentation.
+
+The first important aspect is to be aware that the context of the LLM is limited.
+Therefore, we need to provide chunks of documentation that are relatively limited and
+focused to not reach the context limit.
+
+The most common strategy is therefore to extract chunks of text with given number of
+token and overlap between chunks.
+
+.. image:: /_static/img/diagram/naive_chunks.png
+    :width: 100%
+    :align: center
+    :class: transparent-image
+
+The various tutorial to build RAG models are using this strategy. While it is a fast
+way to get started, it is not the best strategy to get the most of the scikit-learn
+documentation. In the subsequent sections, we present different strategies specifically
+designed for some portion of the scikit-learn documentation.
 
 API documentation
 =================
 
+We refer to "API documentation" to the following documentation entry point:
+https://scikit-learn.org/stable/modules/classes.html.
+
+It corresponds to the documentation of each class and function implemented in
+scikit-learn. This documentation is automatically generated from the docstrings
+of the classes and functions. These docstrings follow the `numpydoc` formatting.
+As an example, we show a generated HTML page containing the documentation of a
+scikit-learn estimator:
+
+.. image:: /_static/img/diagram/api_doc_generated_html.png
+    :width: 100%
+    :align: center
+    :class: transparent-image
+
+Before diving into the chunking mechanism, it is interesting to think about the type
+of queries that such documentation can help at answering. Indeed, these documentation
+pages are intended to provide information about class or function parameters, short
+usage snippet of code and related classes or functions. The narration on these pages
+are relatively short and further discussions are generally provided in the user guide
+instead. So we would expect that the chunks of documentation to be useful to answer
+questions as:
+
+- What are the parameters of `LogisticRegression`?
+- What are the values of the `strategy` parameter in a dummy classifier?
+
 :class:`~ragger_duck.scraping.APINumPyDocExtractor` is a more advanced scraper
 that uses `numpydoc` and it scraper to extract the documentation. Indeed, the
 `numpydoc` scraper will parse the different sections and we build meaningful

diff --git a/_static/img/diagram/api_doc_generated_html.png b/_static/img/diagram/api_doc_generated_html.png
diff --git a/_static/img/diagram/naive_chunks.png b/_static/img/diagram/naive_chunks.png
diff --git a/searchindex.js b/searchindex.js
diff --git a/user_guide/index.html b/user_guide/index.html
@@ -453,7 +453,7 @@
 <p>In this proof-of-concept (POC), we are interested in a “zero-shot” setting. It means
 that we expect our user to formulate a question in natural language and the LLM will
 generate an answer.</p>
-<p>The way to query the LLM can be done in two ways: (i) through an API such when using
+<p>The way to query the LLM can be done in two ways: (i) through an API such as when using
 GPT-* from OpenAI or (ii) by locally running the model using open-weight models such
 as Mistral or LLama.</p>
 <p>Now, let’s introduce the RAG framework.</p>

diff --git a/user_guide/text_scraping.html b/user_guide/text_scraping.html
@@ -444,10 +444,40 @@
 
   <section id="text-scraping">
 <span id="id1"></span><h1>Text Scraping<a class="headerlink" href="#text-scraping" title="Link to this heading">#</a></h1>
-<p>The scraping module provides some simple estimator that extract meaningful
-documentation from the documentation website.</p>
+<p>In a Retrieval Augmented Generation (RAG) framework, the “document” retrieved and
+provided to the Large Language Model (LLM) to generate an answer corresponds to chunks
+extracted from the documentation.</p>
+<p>The first important aspect is to be aware that the context of the LLM is limited.
+Therefore, we need to provide chunks of documentation that are relatively limited and
+focused to not reach the context limit.</p>
+<p>The most common strategy is therefore to extract chunks of text with given number of
+token and overlap between chunks.</p>
+<a class="transparent-image reference internal image-reference" href="../_images/naive_chunks.png"><img alt="../_images/naive_chunks.png" class="transparent-image align-center" src="../_images/naive_chunks.png" style="width: 100%;" /></a>
+<p>The various tutorial to build RAG models are using this strategy. While it is a fast
+way to get started, it is not the best strategy to get the most of the scikit-learn
+documentation. In the subsequent sections, we present different strategies specifically
+designed for some portion of the scikit-learn documentation.</p>
 <section id="api-documentation">
 <h2>API documentation<a class="headerlink" href="#api-documentation" title="Link to this heading">#</a></h2>
+<p>We refer to “API documentation” to the following documentation entry point:
+<a class="reference external" href="https://scikit-learn.org/stable/modules/classes.html">https://scikit-learn.org/stable/modules/classes.html</a>.</p>
+<p>It corresponds to the documentation of each class and function implemented in
+scikit-learn. This documentation is automatically generated from the docstrings
+of the classes and functions. These docstrings follow the <code class="docutils literal notranslate"><span class="pre">numpydoc</span></code> formatting.
+As an example, we show a generated HTML page containing the documentation of a
+scikit-learn estimator:</p>
+<a class="transparent-image reference internal image-reference" href="../_images/api_doc_generated_html.png"><img alt="../_images/api_doc_generated_html.png" class="transparent-image align-center" src="../_images/api_doc_generated_html.png" style="width: 100%;" /></a>
+<p>Before diving into the chunking mechanism, it is interesting to think about the type
+of queries that such documentation can help at answering. Indeed, these documentation
+pages are intended to provide information about class or function parameters, short
+usage snippet of code and related classes or functions. The narration on these pages
+are relatively short and further discussions are generally provided in the user guide
+instead. So we would expect that the chunks of documentation to be useful to answer
+questions as:</p>
+<ul class="simple">
+<li><p>What are the parameters of <code class="docutils literal notranslate"><span class="pre">LogisticRegression</span></code>?</p></li>
+<li><p>What are the values of the <code class="docutils literal notranslate"><span class="pre">strategy</span></code> parameter in a dummy classifier?</p></li>
+</ul>
 <p><a class="reference internal" href="../references/generated/ragger_duck.scraping.APINumPyDocExtractor.html#ragger_duck.scraping.APINumPyDocExtractor" title="ragger_duck.scraping.APINumPyDocExtractor"><code class="xref py py-class docutils literal notranslate"><span class="pre">APINumPyDocExtractor</span></code></a> is a more advanced scraper
 that uses <code class="docutils literal notranslate"><span class="pre">numpydoc</span></code> and it scraper to extract the documentation. Indeed, the
 <code class="docutils literal notranslate"><span class="pre">numpydoc</span></code> scraper will parse the different sections and we build meaningful