[ci skip] correct english b605ccd

probabl-ai · Apr 19, 2024 · 5683d27 · 5683d27
1 parent 3fd451f
commit 5683d27
Show file tree

Hide file tree

Showing 5 changed files with 351 additions and 109 deletions.
diff --git a/_sources/user_guide/index.rst.txt b/_sources/user_guide/index.rst.txt
@@ -6,9 +6,9 @@
 User Guide
 ==========
 
-Before to go in depth into implementation details of the different components of our
-Retrieval Augmented Generation (RAG) framework, we provide a high-level overview of
-the main components.
+Before we go in depth into the implementation details of the different components of our
+Retrieval Augmented Generation (RAG) framework, we provide a high-level overview of the
+main components.
 
 .. _intro_rag:
 
@@ -25,12 +25,12 @@ graphic below represents the interaction between our user and the LLM.
     :class: transparent-image
 
 In this proof-of-concept (POC), we are interested in a "zero-shot" setting. It means
-that we expect our user to formulate a question in natural language and the LLM will
+that we expect our user to formulate a question in natural language, and the LLM will
 generate an answer.
 
 The way to query the LLM can be done in two ways: (i) through an API such as when using
-GPT-* from OpenAI or (ii) by locally running the model using open-weight models such
-as Mistral or LLama.
+GPT-* from OpenAI or (ii) by locally running the model using open-weight models such as
+Mistral or LLama.
 
 Now, let's introduce the RAG framework.
 
@@ -39,15 +39,15 @@ Now, let's introduce the RAG framework.
     :align: center
     :class: transparent-image
 
-The major difference with the previous framework is an additional step that consists
-in retrieving relevant information from a given source of information before answering
-the user's query. The retrieved information is provided as a context to the LLM during
+The major difference with the previous framework is an additional step that consists of
+retrieving relevant information from a given source of information before answering the
+user's query. The retrieved information is provided as a context to the LLM during
 prompting, and the LLM will therefore generate an answer conditioned on this context.
 
 It should be noted that information retrieval is not a new concept and has been
-extensively studied in the past and it is also related to the application of search
-engine. In the next section, we will go in more details into the information retrieval
-components when used for a RAG framework.
+extensively studied in the past. It is also related to the application of search
+engines. In the next section, we will go into more details about the information
+retrieval components when used for a RAG framework.
 
 .. _intro_info_retrieval:
 
@@ -57,41 +57,41 @@ Information retrieval
 Concepts
 --------
 
-Before to explain how a retriever is trained, we first show the main components of
-such retriever.
+Before explaining how a retriever is trained, we will first show the main components of
+such a retriever.
 
 .. image:: /_static/img/diagram/retrieval_phase.png
     :width: 100%
     :align: center
     :class: transparent-image
 
-A retriever has two main components: (i) an algorithm to transform natural text into
-a mathematical vector representation and (ii) a database containing vectors and their
+A retriever has two main components: (i) an algorithm to transform natural text into a
+mathematical vector representation and (ii) a database containing vectors and their
 corresponding natural text. This database is also capable of finding the most similar
 vectors to a given query vector.
 
 During the training phase, a source of information containing natural text is used to
 build a set of vector representations. These vectors are used to populate the database.
 
 During the retrieval phase, a user's query is passed to the algorithm to create a vector
-representation. Then, the most similar vectors are found in the database and the
-corresponding natural texts are returned. Those documents are then used as context for
-the LLM in the previous RAG framework.
+representation. Then, the most similar vectors are found in the database, and the
+corresponding natural texts are returned. These documents are then used as context for
+the LLM in the RAG framework.
 
 Details regarding the retrievers
 --------------------------------
 
-In this section, we provide a couple of details regarding the retrievers. However,
+In this section, we will provide a couple of details regarding the retrievers. However,
 our reader can refer to the following comprehensive review for more details [1]_.
 
-Without going into the details, we can distinguish two types of retrievers: (i)
-lexical retrievers based on the Bag-of-Words (BoW) model and (ii) semantic retrievers
-based on neural networks.
+Without going into the details, we can distinguish two types of retrievers: (i) lexical
+retrievers based on the Bag-of-Words (BoW) model and (ii) semantic retrievers based on
+neural networks.
 
 Lexical retrievers are based on word counts in documents and queries. They are simple
-but they lack the ability to capture the meaning of words. Several approaches have been
-proposed to improve the performance of these retrievers by expanding queries, documents
-or modeling topics. Those retrievers create a sparse representation that can be
+but lack the ability to capture the meaning of words. Several approaches have been
+proposed to improve the performance of these retrievers by expanding queries, documents,
+or modeling topics. These retrievers create a sparse representation that can be
 leveraged to find the most similar documents through inverted indexes.
 
 Semantic retrievers are based on neural networks and project the text into a continuous
@@ -111,9 +111,9 @@ Implementation details
 ======================
 
 In the previous sections, we presented the general ideas behind the RAG framework.
-However, the devil is in the details. In the following sections, we present some
+However, the devil is in the details. In the following sections, we will present some
 implementation details regarding some inner steps of the RAG framework that are
-important if you want to get meaningful results.
+important if you want to obtain meaningful results.
 
 .. toctree::
     :maxdepth: 2

diff --git a/_sources/user_guide/text_scraping.rst.txt b/_sources/user_guide/text_scraping.rst.txt
@@ -12,55 +12,180 @@ The first important aspect is to be aware that the context of the LLM is limited
 Therefore, we need to provide chunks of documentation that are relatively limited and
 focused to not reach the context limit.
 
-The most common strategy is therefore to extract chunks of text with given number of
-token and overlap between chunks.
+The most common strategy is to extract chunks of text with a given number of tokens and
+an overlap between chunks.
 
 .. image:: /_static/img/diagram/naive_chunks.png
     :width: 100%
     :align: center
     :class: transparent-image
 
-The various tutorial to build RAG models are using this strategy. While it is a fast
-way to get started, it is not the best strategy to get the most of the scikit-learn
-documentation. In the subsequent sections, we present different strategies specifically
-designed for some portion of the scikit-learn documentation.
+The various tutorials to build RAG models use this strategy. While it is a fast way to
+get started, it is not the best strategy to get the most out of the scikit-learn
+documentation. In the subsequent sections, we present different strategies
+specifically designed for certain portions of the scikit-learn documentation.
 
 API documentation
 =================
 
-We refer to "API documentation" to the following documentation entry point:
+We refer to "API documentation" as the following documentation entry point:
 https://scikit-learn.org/stable/modules/classes.html.
 
 It corresponds to the documentation of each class and function implemented in
-scikit-learn. This documentation is automatically generated from the docstrings
-of the classes and functions. These docstrings follow the `numpydoc` formatting.
-As an example, we show a generated HTML page containing the documentation of a
-scikit-learn estimator:
+scikit-learn. This documentation is automatically generated from the docstrings of the
+classes and functions. These docstrings follow the `numpydoc` formatting. As an example,
+we show a generated HTML page containing the documentation of a scikit-learn estimator:
 
 .. image:: /_static/img/diagram/api_doc_generated_html.png
     :width: 100%
     :align: center
     :class: transparent-image
 
-Before diving into the chunking mechanism, it is interesting to think about the type
-of queries that such documentation can help at answering. Indeed, these documentation
-pages are intended to provide information about class or function parameters, short
-usage snippet of code and related classes or functions. The narration on these pages
-are relatively short and further discussions are generally provided in the user guide
-instead. So we would expect that the chunks of documentation to be useful to answer
-questions as:
+Before diving into the chunking mechanism, it is interesting to think about the type of
+queries that such documentation can help answer. Indeed, these documentation pages are
+intended to provide information about class or function parameters, short usage snippets
+of code, and related classes or functions. The narration on these pages is relatively
+short, and further discussions are generally provided in the user guide instead. So we
+would expect that the chunks of documentation to be useful to answer questions such as:
 
 - What are the parameters of `LogisticRegression`?
 - What are the values of the `strategy` parameter in a dummy classifier?
 
-:class:`~ragger_duck.scraping.APINumPyDocExtractor` is a more advanced scraper
-that uses `numpydoc` and it scraper to extract the documentation. Indeed, the
-`numpydoc` scraper will parse the different sections and we build meaningful
-chunks of documentation from the parsed sections. While, we don't control for
-the chunk size, the chunks are build such that they contain information only
-of a specific parameter and always refer to the class or function. We hope that
-scraping in such way can remove ambiguity that could exist when building chunks
-without any control.
+Now that we have better framed our expectations, we can think about the chunks
+extraction. We could go forward with the naive approach described above. However, it
+will fall short to help the LLM answer the questions. Let's go into an example to
+illustrate this point.
+
+Consider the second question above: "What are the values of the `strategy` parameter in
+a dummy classifier?" While our retrievers (:ref:`information_retrieval`) are able to get
+the association between the `DummyClassifier` and the strategy parameter, the LLM will
+not be able to get this link if the chunk retrieved does not contain this relationship.
+Indeed, the naive approach will provide a chunk where strategy could be mentioned, but
+it might not belong to the `DummyClassifier` class.
+
+For instance, we could retrieve the following three chunks that are relatively relevant
+to the query:
+
+**Chunk #1**::
+
+    strategy : {"most_frequent", "prior", "stratified", "uniform", \
+                "constant"}, default="prior"
+            Strategy to use to generate predictions.
+
+            * "most_frequent": the `predict` method always returns the most
+              frequent class label in the observed `y` argument passed to `fit`.
+              The `predict_proba` method returns the matching one-hot encoded
+              vector.
+            * "prior": the `predict` method always returns the most frequent
+              class label in the observed `y` argument passed to `fit` (like
+              "most_frequent"). ``predict_proba`` always returns the empirical
+              class distribution of `y` also known as the empirical class prior
+              distribution.
+            * "stratified": the `predict_proba` method randomly samples one-hot
+              vectors from a multinomial distribution parametrized by the empirical
+              class prior probabilities.
+              The `predict` method returns the class label which got probability
+              one in the one-hot vector of `predict_proba`.
+              Each sampled row of both methods is therefore independent and
+              identically distributed.
+            * "uniform": generates predictions uniformly at random from the list
+              of unique classes observed in `y`, i.e. each class has equal
+              probability.
+            * "constant": always predicts a constant label that is provided by
+              the user. This is useful for metrics that evaluate a non-majority
+              class.
+
+**Chunk #2**::
+
+    strategy : {"mean", "median", "quantile", "constant"}, default="mean"
+            Strategy to use to generate predictions.
+
+            * "mean": always predicts the mean of the training set
+            * "median": always predicts the median of the training set
+            * "quantile": always predicts a specified quantile of the training set,
+              provided with the quantile parameter.
+            * "constant": always predicts a constant value that is provided by
+              the user.
+
+**Chunk #3**::
+
+    strategy : str, default='mean'
+        The imputation strategy.
+
+        - If "mean", then replace missing values using the mean along
+          each column. Can only be used with numeric data.
+        - If "median", then replace missing values using the median along
+          each column. Can only be used with numeric data.
+        - If "most_frequent", then replace missing using the most frequent
+          value along each column. Can be used with strings or numeric data.
+          If there is more than one such value, only the smallest is returned.
+        - If "constant", then replace missing values with fill_value. Can be
+          used with strings or numeric data.
+
+Therefore, the chunks are relevant to the strategy parameter, but they are related to
+the `DummyClassifier`, `DummyRegressor`, and `SimpleImputer` classes.
+
+If we provide such information to a human who is not familiar with the scikit-learn API,
+they will not be able to determine which of the above chunks is relevant to answer the
+query. If they are experts, they might use their previous knowledge to select the
+relevant chunk.
+
+So when it comes to an LLM, you should not expect more than a human: if the LLM has been
+trained on similar queries, then it might be able to use the relevant information, but
+otherwise, it will not be the case. For example, the Mistral 7b model would only
+summarize the information of the chunks and provide an unhelpful answer.
+
+As a straightforward solution to the above problem, we could think that we should go
+beyond the naive chunking strategy. For instance, if our chunk contains the associated
+class or function to the parameter description, then it will allow us to disambiguate
+the information and thus help our LLM answer the relevant question.
+
+As previously stated, scikit-learn uses the `numpydoc` formalism to document the classes
+and functions. This library comes with a parser that structures the docstring
+information, such that you know about the section, the parameters, the types, etc. We
+implemented :class:`~ragger_duck.scraping.APINumPyDocExtractor` that leverages this
+information to build meaningful chunks of documentation. The chunk size in this case is
+not controlled, but because of the nature of the documentation, we know that it will
+never be too large.
+
+For example, a chunk created that is going to be relevant to the previous query is the
+following::
+
+    source: https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html
+    content: Parameter strategy of sklearn.dummy.DummyClassifier.
+    strategy is described as 'Strategy to use to generate predictions.
+
+    * "most_frequent": the `predict` method always returns the most
+      frequent class label in the observed `y` argument passed to `fit`.
+      The `predict_proba` method returns the matching one-hot encoded
+      vector.
+    * "prior": the `predict` method always returns the most frequent
+      class label in the observed `y` argument passed to `fit` (like
+      "most_frequent"). ``predict_proba`` always returns the empirical
+      class distribution of `y` also known as the empirical class prior
+      distribution.
+    * "stratified": the `predict_proba` method randomly samples one-hot
+      vectors from a multinomial distribution parametrized by the empirical
+      class prior probabilities.
+      The `predict` method returns the class label which got probability
+      one in the one-hot vector of `predict_proba`.
+      Each sampled row of both methods is therefore independent and
+      identically distributed.
+    * "uniform": generates predictions uniformly at random from the list
+      of unique classes observed in `y`, i.e. each class has equal
+      probability.
+    * "constant": always predicts a constant label that is provided by
+      the user. This is useful for metrics that evaluate a non-majority
+      class.
+
+    .. versionchanged:: 0.24
+        The default value of `strategy` has changed to "prior" in version
+        0.24.' and has the following type(s): {"most_frequent", "prior", "stratified",
+        "uniform", "constant"}, default="prior"
+
+By providing chunks that maintain the relationship between the parameter and its
+corresponding class, we enable the Mistral 7b model to disambiguate the information and
+provide a relevant answer.
 
 User Guide documentation
 ========================

diff --git a/searchindex.js b/searchindex.js