Skip to content

Commit

Permalink
[ci skip] correct english b605ccd
Browse files Browse the repository at this point in the history
  • Loading branch information
glemaitre committed Apr 19, 2024
1 parent 3fd451f commit 5683d27
Show file tree
Hide file tree
Showing 5 changed files with 351 additions and 109 deletions.
56 changes: 28 additions & 28 deletions _sources/user_guide/index.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@
User Guide
==========

Before to go in depth into implementation details of the different components of our
Retrieval Augmented Generation (RAG) framework, we provide a high-level overview of
the main components.
Before we go in depth into the implementation details of the different components of our
Retrieval Augmented Generation (RAG) framework, we provide a high-level overview of the
main components.

.. _intro_rag:

Expand All @@ -25,12 +25,12 @@ graphic below represents the interaction between our user and the LLM.
:class: transparent-image

In this proof-of-concept (POC), we are interested in a "zero-shot" setting. It means
that we expect our user to formulate a question in natural language and the LLM will
that we expect our user to formulate a question in natural language, and the LLM will
generate an answer.

The way to query the LLM can be done in two ways: (i) through an API such as when using
GPT-* from OpenAI or (ii) by locally running the model using open-weight models such
as Mistral or LLama.
GPT-* from OpenAI or (ii) by locally running the model using open-weight models such as
Mistral or LLama.

Now, let's introduce the RAG framework.

Expand All @@ -39,15 +39,15 @@ Now, let's introduce the RAG framework.
:align: center
:class: transparent-image

The major difference with the previous framework is an additional step that consists
in retrieving relevant information from a given source of information before answering
the user's query. The retrieved information is provided as a context to the LLM during
The major difference with the previous framework is an additional step that consists of
retrieving relevant information from a given source of information before answering the
user's query. The retrieved information is provided as a context to the LLM during
prompting, and the LLM will therefore generate an answer conditioned on this context.

It should be noted that information retrieval is not a new concept and has been
extensively studied in the past and it is also related to the application of search
engine. In the next section, we will go in more details into the information retrieval
components when used for a RAG framework.
extensively studied in the past. It is also related to the application of search
engines. In the next section, we will go into more details about the information
retrieval components when used for a RAG framework.

.. _intro_info_retrieval:

Expand All @@ -57,41 +57,41 @@ Information retrieval
Concepts
--------

Before to explain how a retriever is trained, we first show the main components of
such retriever.
Before explaining how a retriever is trained, we will first show the main components of
such a retriever.

.. image:: /_static/img/diagram/retrieval_phase.png
:width: 100%
:align: center
:class: transparent-image

A retriever has two main components: (i) an algorithm to transform natural text into
a mathematical vector representation and (ii) a database containing vectors and their
A retriever has two main components: (i) an algorithm to transform natural text into a
mathematical vector representation and (ii) a database containing vectors and their
corresponding natural text. This database is also capable of finding the most similar
vectors to a given query vector.

During the training phase, a source of information containing natural text is used to
build a set of vector representations. These vectors are used to populate the database.

During the retrieval phase, a user's query is passed to the algorithm to create a vector
representation. Then, the most similar vectors are found in the database and the
corresponding natural texts are returned. Those documents are then used as context for
the LLM in the previous RAG framework.
representation. Then, the most similar vectors are found in the database, and the
corresponding natural texts are returned. These documents are then used as context for
the LLM in the RAG framework.

Details regarding the retrievers
--------------------------------

In this section, we provide a couple of details regarding the retrievers. However,
In this section, we will provide a couple of details regarding the retrievers. However,
our reader can refer to the following comprehensive review for more details [1]_.

Without going into the details, we can distinguish two types of retrievers: (i)
lexical retrievers based on the Bag-of-Words (BoW) model and (ii) semantic retrievers
based on neural networks.
Without going into the details, we can distinguish two types of retrievers: (i) lexical
retrievers based on the Bag-of-Words (BoW) model and (ii) semantic retrievers based on
neural networks.

Lexical retrievers are based on word counts in documents and queries. They are simple
but they lack the ability to capture the meaning of words. Several approaches have been
proposed to improve the performance of these retrievers by expanding queries, documents
or modeling topics. Those retrievers create a sparse representation that can be
but lack the ability to capture the meaning of words. Several approaches have been
proposed to improve the performance of these retrievers by expanding queries, documents,
or modeling topics. These retrievers create a sparse representation that can be
leveraged to find the most similar documents through inverted indexes.

Semantic retrievers are based on neural networks and project the text into a continuous
Expand All @@ -111,9 +111,9 @@ Implementation details
======================

In the previous sections, we presented the general ideas behind the RAG framework.
However, the devil is in the details. In the following sections, we present some
However, the devil is in the details. In the following sections, we will present some
implementation details regarding some inner steps of the RAG framework that are
important if you want to get meaningful results.
important if you want to obtain meaningful results.

.. toctree::
:maxdepth: 2
Expand Down
177 changes: 151 additions & 26 deletions _sources/user_guide/text_scraping.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -12,55 +12,180 @@ The first important aspect is to be aware that the context of the LLM is limited
Therefore, we need to provide chunks of documentation that are relatively limited and
focused to not reach the context limit.

The most common strategy is therefore to extract chunks of text with given number of
token and overlap between chunks.
The most common strategy is to extract chunks of text with a given number of tokens and
an overlap between chunks.

.. image:: /_static/img/diagram/naive_chunks.png
:width: 100%
:align: center
:class: transparent-image

The various tutorial to build RAG models are using this strategy. While it is a fast
way to get started, it is not the best strategy to get the most of the scikit-learn
documentation. In the subsequent sections, we present different strategies specifically
designed for some portion of the scikit-learn documentation.
The various tutorials to build RAG models use this strategy. While it is a fast way to
get started, it is not the best strategy to get the most out of the scikit-learn
documentation. In the subsequent sections, we present different strategies
specifically designed for certain portions of the scikit-learn documentation.

API documentation
=================

We refer to "API documentation" to the following documentation entry point:
We refer to "API documentation" as the following documentation entry point:
https://scikit-learn.org/stable/modules/classes.html.

It corresponds to the documentation of each class and function implemented in
scikit-learn. This documentation is automatically generated from the docstrings
of the classes and functions. These docstrings follow the `numpydoc` formatting.
As an example, we show a generated HTML page containing the documentation of a
scikit-learn estimator:
scikit-learn. This documentation is automatically generated from the docstrings of the
classes and functions. These docstrings follow the `numpydoc` formatting. As an example,
we show a generated HTML page containing the documentation of a scikit-learn estimator:

.. image:: /_static/img/diagram/api_doc_generated_html.png
:width: 100%
:align: center
:class: transparent-image

Before diving into the chunking mechanism, it is interesting to think about the type
of queries that such documentation can help at answering. Indeed, these documentation
pages are intended to provide information about class or function parameters, short
usage snippet of code and related classes or functions. The narration on these pages
are relatively short and further discussions are generally provided in the user guide
instead. So we would expect that the chunks of documentation to be useful to answer
questions as:
Before diving into the chunking mechanism, it is interesting to think about the type of
queries that such documentation can help answer. Indeed, these documentation pages are
intended to provide information about class or function parameters, short usage snippets
of code, and related classes or functions. The narration on these pages is relatively
short, and further discussions are generally provided in the user guide instead. So we
would expect that the chunks of documentation to be useful to answer questions such as:

- What are the parameters of `LogisticRegression`?
- What are the values of the `strategy` parameter in a dummy classifier?

:class:`~ragger_duck.scraping.APINumPyDocExtractor` is a more advanced scraper
that uses `numpydoc` and it scraper to extract the documentation. Indeed, the
`numpydoc` scraper will parse the different sections and we build meaningful
chunks of documentation from the parsed sections. While, we don't control for
the chunk size, the chunks are build such that they contain information only
of a specific parameter and always refer to the class or function. We hope that
scraping in such way can remove ambiguity that could exist when building chunks
without any control.
Now that we have better framed our expectations, we can think about the chunks
extraction. We could go forward with the naive approach described above. However, it
will fall short to help the LLM answer the questions. Let's go into an example to
illustrate this point.

Consider the second question above: "What are the values of the `strategy` parameter in
a dummy classifier?" While our retrievers (:ref:`information_retrieval`) are able to get
the association between the `DummyClassifier` and the strategy parameter, the LLM will
not be able to get this link if the chunk retrieved does not contain this relationship.
Indeed, the naive approach will provide a chunk where strategy could be mentioned, but
it might not belong to the `DummyClassifier` class.

For instance, we could retrieve the following three chunks that are relatively relevant
to the query:

**Chunk #1**::

strategy : {"most_frequent", "prior", "stratified", "uniform", \
"constant"}, default="prior"
Strategy to use to generate predictions.

* "most_frequent": the `predict` method always returns the most
frequent class label in the observed `y` argument passed to `fit`.
The `predict_proba` method returns the matching one-hot encoded
vector.
* "prior": the `predict` method always returns the most frequent
class label in the observed `y` argument passed to `fit` (like
"most_frequent"). ``predict_proba`` always returns the empirical
class distribution of `y` also known as the empirical class prior
distribution.
* "stratified": the `predict_proba` method randomly samples one-hot
vectors from a multinomial distribution parametrized by the empirical
class prior probabilities.
The `predict` method returns the class label which got probability
one in the one-hot vector of `predict_proba`.
Each sampled row of both methods is therefore independent and
identically distributed.
* "uniform": generates predictions uniformly at random from the list
of unique classes observed in `y`, i.e. each class has equal
probability.
* "constant": always predicts a constant label that is provided by
the user. This is useful for metrics that evaluate a non-majority
class.

**Chunk #2**::

strategy : {"mean", "median", "quantile", "constant"}, default="mean"
Strategy to use to generate predictions.

* "mean": always predicts the mean of the training set
* "median": always predicts the median of the training set
* "quantile": always predicts a specified quantile of the training set,
provided with the quantile parameter.
* "constant": always predicts a constant value that is provided by
the user.

**Chunk #3**::

strategy : str, default='mean'
The imputation strategy.

- If "mean", then replace missing values using the mean along
each column. Can only be used with numeric data.
- If "median", then replace missing values using the median along
each column. Can only be used with numeric data.
- If "most_frequent", then replace missing using the most frequent
value along each column. Can be used with strings or numeric data.
If there is more than one such value, only the smallest is returned.
- If "constant", then replace missing values with fill_value. Can be
used with strings or numeric data.

Therefore, the chunks are relevant to the strategy parameter, but they are related to
the `DummyClassifier`, `DummyRegressor`, and `SimpleImputer` classes.

If we provide such information to a human who is not familiar with the scikit-learn API,
they will not be able to determine which of the above chunks is relevant to answer the
query. If they are experts, they might use their previous knowledge to select the
relevant chunk.

So when it comes to an LLM, you should not expect more than a human: if the LLM has been
trained on similar queries, then it might be able to use the relevant information, but
otherwise, it will not be the case. For example, the Mistral 7b model would only
summarize the information of the chunks and provide an unhelpful answer.

As a straightforward solution to the above problem, we could think that we should go
beyond the naive chunking strategy. For instance, if our chunk contains the associated
class or function to the parameter description, then it will allow us to disambiguate
the information and thus help our LLM answer the relevant question.

As previously stated, scikit-learn uses the `numpydoc` formalism to document the classes
and functions. This library comes with a parser that structures the docstring
information, such that you know about the section, the parameters, the types, etc. We
implemented :class:`~ragger_duck.scraping.APINumPyDocExtractor` that leverages this
information to build meaningful chunks of documentation. The chunk size in this case is
not controlled, but because of the nature of the documentation, we know that it will
never be too large.

For example, a chunk created that is going to be relevant to the previous query is the
following::

source: https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html
content: Parameter strategy of sklearn.dummy.DummyClassifier.
strategy is described as 'Strategy to use to generate predictions.

* "most_frequent": the `predict` method always returns the most
frequent class label in the observed `y` argument passed to `fit`.
The `predict_proba` method returns the matching one-hot encoded
vector.
* "prior": the `predict` method always returns the most frequent
class label in the observed `y` argument passed to `fit` (like
"most_frequent"). ``predict_proba`` always returns the empirical
class distribution of `y` also known as the empirical class prior
distribution.
* "stratified": the `predict_proba` method randomly samples one-hot
vectors from a multinomial distribution parametrized by the empirical
class prior probabilities.
The `predict` method returns the class label which got probability
one in the one-hot vector of `predict_proba`.
Each sampled row of both methods is therefore independent and
identically distributed.
* "uniform": generates predictions uniformly at random from the list
of unique classes observed in `y`, i.e. each class has equal
probability.
* "constant": always predicts a constant label that is provided by
the user. This is useful for metrics that evaluate a non-majority
class.

.. versionchanged:: 0.24
The default value of `strategy` has changed to "prior" in version
0.24.' and has the following type(s): {"most_frequent", "prior", "stratified",
"uniform", "constant"}, default="prior"

By providing chunks that maintain the relationship between the parameter and its
corresponding class, we enable the Mistral 7b model to disambiguate the information and
provide a relevant answer.

User Guide documentation
========================
Expand Down
2 changes: 1 addition & 1 deletion searchindex.js

Large diffs are not rendered by default.

Loading

0 comments on commit 5683d27

Please sign in to comment.