Skip to content

Commit

Permalink
[ci skip] iter f18abd6
Browse files Browse the repository at this point in the history
  • Loading branch information
glemaitre committed Apr 19, 2024
1 parent 019d17a commit 3fd451f
Show file tree
Hide file tree
Showing 9 changed files with 80 additions and 7 deletions.
Binary file added _images/api_doc_generated_html.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/naive_chunks.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion _sources/user_guide/index.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ In this proof-of-concept (POC), we are interested in a "zero-shot" setting. It m
that we expect our user to formulate a question in natural language and the LLM will
generate an answer.

The way to query the LLM can be done in two ways: (i) through an API such when using
The way to query the LLM can be done in two ways: (i) through an API such as when using
GPT-* from OpenAI or (ii) by locally running the model using open-weight models such
as Mistral or LLama.

Expand Down
47 changes: 45 additions & 2 deletions _sources/user_guide/text_scraping.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,55 @@
Text Scraping
=============

The scraping module provides some simple estimator that extract meaningful
documentation from the documentation website.
In a Retrieval Augmented Generation (RAG) framework, the "document" retrieved and
provided to the Large Language Model (LLM) to generate an answer corresponds to chunks
extracted from the documentation.

The first important aspect is to be aware that the context of the LLM is limited.
Therefore, we need to provide chunks of documentation that are relatively limited and
focused to not reach the context limit.

The most common strategy is therefore to extract chunks of text with given number of
token and overlap between chunks.

.. image:: /_static/img/diagram/naive_chunks.png
:width: 100%
:align: center
:class: transparent-image

The various tutorial to build RAG models are using this strategy. While it is a fast
way to get started, it is not the best strategy to get the most of the scikit-learn
documentation. In the subsequent sections, we present different strategies specifically
designed for some portion of the scikit-learn documentation.

API documentation
=================

We refer to "API documentation" to the following documentation entry point:
https://scikit-learn.org/stable/modules/classes.html.

It corresponds to the documentation of each class and function implemented in
scikit-learn. This documentation is automatically generated from the docstrings
of the classes and functions. These docstrings follow the `numpydoc` formatting.
As an example, we show a generated HTML page containing the documentation of a
scikit-learn estimator:

.. image:: /_static/img/diagram/api_doc_generated_html.png
:width: 100%
:align: center
:class: transparent-image

Before diving into the chunking mechanism, it is interesting to think about the type
of queries that such documentation can help at answering. Indeed, these documentation
pages are intended to provide information about class or function parameters, short
usage snippet of code and related classes or functions. The narration on these pages
are relatively short and further discussions are generally provided in the user guide
instead. So we would expect that the chunks of documentation to be useful to answer
questions as:

- What are the parameters of `LogisticRegression`?
- What are the values of the `strategy` parameter in a dummy classifier?

:class:`~ragger_duck.scraping.APINumPyDocExtractor` is a more advanced scraper
that uses `numpydoc` and it scraper to extract the documentation. Indeed, the
`numpydoc` scraper will parse the different sections and we build meaningful
Expand Down
Binary file added _static/img/diagram/api_doc_generated_html.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/img/diagram/naive_chunks.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion searchindex.js

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion user_guide/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -453,7 +453,7 @@
<p>In this proof-of-concept (POC), we are interested in a “zero-shot” setting. It means
that we expect our user to formulate a question in natural language and the LLM will
generate an answer.</p>
<p>The way to query the LLM can be done in two ways: (i) through an API such when using
<p>The way to query the LLM can be done in two ways: (i) through an API such as when using
GPT-* from OpenAI or (ii) by locally running the model using open-weight models such
as Mistral or LLama.</p>
<p>Now, let’s introduce the RAG framework.</p>
Expand Down
34 changes: 32 additions & 2 deletions user_guide/text_scraping.html
Original file line number Diff line number Diff line change
Expand Up @@ -444,10 +444,40 @@

<section id="text-scraping">
<span id="id1"></span><h1>Text Scraping<a class="headerlink" href="#text-scraping" title="Link to this heading">#</a></h1>
<p>The scraping module provides some simple estimator that extract meaningful
documentation from the documentation website.</p>
<p>In a Retrieval Augmented Generation (RAG) framework, the “document” retrieved and
provided to the Large Language Model (LLM) to generate an answer corresponds to chunks
extracted from the documentation.</p>
<p>The first important aspect is to be aware that the context of the LLM is limited.
Therefore, we need to provide chunks of documentation that are relatively limited and
focused to not reach the context limit.</p>
<p>The most common strategy is therefore to extract chunks of text with given number of
token and overlap between chunks.</p>
<a class="transparent-image reference internal image-reference" href="../_images/naive_chunks.png"><img alt="../_images/naive_chunks.png" class="transparent-image align-center" src="../_images/naive_chunks.png" style="width: 100%;" /></a>
<p>The various tutorial to build RAG models are using this strategy. While it is a fast
way to get started, it is not the best strategy to get the most of the scikit-learn
documentation. In the subsequent sections, we present different strategies specifically
designed for some portion of the scikit-learn documentation.</p>
<section id="api-documentation">
<h2>API documentation<a class="headerlink" href="#api-documentation" title="Link to this heading">#</a></h2>
<p>We refer to “API documentation” to the following documentation entry point:
<a class="reference external" href="https://scikit-learn.org/stable/modules/classes.html">https://scikit-learn.org/stable/modules/classes.html</a>.</p>
<p>It corresponds to the documentation of each class and function implemented in
scikit-learn. This documentation is automatically generated from the docstrings
of the classes and functions. These docstrings follow the <code class="docutils literal notranslate"><span class="pre">numpydoc</span></code> formatting.
As an example, we show a generated HTML page containing the documentation of a
scikit-learn estimator:</p>
<a class="transparent-image reference internal image-reference" href="../_images/api_doc_generated_html.png"><img alt="../_images/api_doc_generated_html.png" class="transparent-image align-center" src="../_images/api_doc_generated_html.png" style="width: 100%;" /></a>
<p>Before diving into the chunking mechanism, it is interesting to think about the type
of queries that such documentation can help at answering. Indeed, these documentation
pages are intended to provide information about class or function parameters, short
usage snippet of code and related classes or functions. The narration on these pages
are relatively short and further discussions are generally provided in the user guide
instead. So we would expect that the chunks of documentation to be useful to answer
questions as:</p>
<ul class="simple">
<li><p>What are the parameters of <code class="docutils literal notranslate"><span class="pre">LogisticRegression</span></code>?</p></li>
<li><p>What are the values of the <code class="docutils literal notranslate"><span class="pre">strategy</span></code> parameter in a dummy classifier?</p></li>
</ul>
<p><a class="reference internal" href="../references/generated/ragger_duck.scraping.APINumPyDocExtractor.html#ragger_duck.scraping.APINumPyDocExtractor" title="ragger_duck.scraping.APINumPyDocExtractor"><code class="xref py py-class docutils literal notranslate"><span class="pre">APINumPyDocExtractor</span></code></a> is a more advanced scraper
that uses <code class="docutils literal notranslate"><span class="pre">numpydoc</span></code> and it scraper to extract the documentation. Indeed, the
<code class="docutils literal notranslate"><span class="pre">numpydoc</span></code> scraper will parse the different sections and we build meaningful
Expand Down

0 comments on commit 3fd451f

Please sign in to comment.