[ci skip] iter dcb5366

probabl-ai · Apr 19, 2024 · cb677c4 · cb677c4
1 parent f359688
commit cb677c4
Show file tree

Hide file tree

Showing 12 changed files with 349 additions and 21 deletions.
diff --git a/_images/example_tutorial_html.png b/_images/example_tutorial_html.png
diff --git a/_images/example_usage_html.png b/_images/example_usage_html.png
diff --git a/_images/user_guide_html.png b/_images/user_guide_html.png
diff --git a/_sources/user_guide/text_scraping.rst.txt b/_sources/user_guide/text_scraping.rst.txt
@@ -25,8 +25,10 @@ get started, it is not the best strategy to get the most out of the scikit-learn
 documentation. In the subsequent sections, we present different strategies
 specifically designed for certain portions of the scikit-learn documentation.
 
-API documentation
-=================
+.. _api_doc_scraping:
+
+API documentation scraper
+=========================
 
 We refer to "API documentation" as the following documentation entry point:
 https://scikit-learn.org/stable/modules/classes.html.
@@ -189,9 +191,183 @@ By providing chunks that maintain the relationship between the parameter and its
 corresponding class, we enable the Mistral 7b model to disambiguate the information and
 provide a relevant answer.
 
-User Guide documentation
-========================
+Chunk formatting leveraging `numpydoc`
+--------------------------------------
+
+In this section, we provide detailed information regarding the formatting used to create
+the chunks for classes and functions by leveraging the `numpydoc` formalism. You can
+refer to `the numpydoc documentation
+<https://numpydoc.readthedocs.io/en/latest/format.html>`_ have more information
+regarding this formalism.
+
+We are creating individual chunks for the following sections:
+
+- class signature with default parameters
+- class short and extended summary
+- class parameters description
+- class attributes description
+- associated class or function in "See Also" section
+- class note section
+- class example usage
+- class references
+
+For each of these sections, we create a chunk of text in natural language to summarize
+the information. A similar approach is used for functions and methods of a class. We
+provide an example of chunks extracted for the
+`sklearn.feature_extraction.image.extract_patches_2d
+<https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.image.extract_patches_2d.html>`_:
+
+::
+
+    sklearn.feature_extraction.image.extract_patches_2d
+    The parameters of extract_patches_2d with their default values when known are:
+    image, patch_size, max_patches (default=None), random_state (default=None).
+    The description of the extract_patches_2d is as follow.
+    Reshape a 2D image into a collection of patches.
+    The resulting patches are allocated in a dedicated array.
+    Read more in the :ref:`User Guide <image_feature_extraction>`.
+
+::
+
+    Parameter image of sklearn.feature_extraction.image.extract_patches_2d.
+    image is described as 'The original image data. For color images, the last dimension
+    specifies
+    the channel: a RGB image would have `n_channels=3`.' and has the following type(s):
+    ndarray of shape (image_height, image_width) or
+    (image_height, image_width, n_channels)
+
+::
+
+    Parameter patch_size of sklearn.feature_extraction.image.extract_patches_2d.
+    patch_size is described as 'The dimensions of one patch.' and has the following
+    type(s): tuple of int (patch_height, patch_width)
+
+::
+
+    Parameter max_patches of sklearn.feature_extraction.image.extract_patches_2d.
+    max_patches is described as 'The maximum number of patches to extract. If
+    `max_patches` is a float between 0 and 1, it is taken to be a proportion of the
+    total number of patches. If `max_patches` is None it corresponds to the total number
+    of patches that can be extracted.' and has the following type(s): int or float,
+    default=None
+
+::
+
+    Parameter random_state of sklearn.feature_extraction.image.extract_patches_2d.
+    random_state is described as 'Determines the random number generator used for
+    random sampling when `max_patches` is not None. Use an int to make the randomness
+    deterministic.
+    See :term:`Glossary <random_state>`.' and has the following type(s): int,
+    RandomState instance, default=None
+
+::
+
+    patches is returned by sklearn.feature_extraction.image.extract_patches_2d.
+    patches is described as 'The collection of patches extracted from the image, where
+    `n_patches` is either `max_patches` or the total number of patches that can be
+    extracted.' and has the following type(s): array of shape
+    (n_patches, patch_height, patch_width) or
+    (n_patches, patch_height, patch_width, n_channels)
+
+::
+
+    sklearn.feature_extraction.image.extract_patches_2d
+    Here is a usage example of extract_patches_2d:
+        >>> from sklearn.datasets import load_sample_image
+        >>> from sklearn.feature_extraction import image
+        >>> # Use the array data from the first image in this dataset:
+        >>> one_image = load_sample_image("china.jpg")
+        >>> print('Image shape: {}'.format(one_image.shape))
+        Image shape: (427, 640, 3)
+        >>> patches = image.extract_patches_2d(one_image, (2, 2))
+        >>> print('Patches shape: {}'.format(patches.shape))
+        Patches shape: (272214, 2, 2, 3)
+        >>> # Here are just two of these patches:
+        >>> print(patches[1])
+        [[[174 201 231]
+          [174 201 231]]
+         [[173 200 230]
+          [173 200 230]]]
+        >>> print(patches[800])
+        [[[187 214 243]
+          [188 215 244]]
+         [[187 214 243]
+          [188 215 244]]]
+
+User Guide documentation scraper
+================================
+
+We refer to "User Guide documentation" to the narrative documentation that is
+handwritten and provides a detailed explanation of the concepts of machine learning
+concept and how those translate into scikit-learn usage. The HTML generated pages are
+available at https://scikit-learn.org/stable/user_guide.html. Each page have the
+following look:
+
+.. image:: /_static/img/diagram/user_guide_html.png
+    :width: 100%
+    :align: center
+    :class: transparent-image
+
+Here, we observed that the information is not structure as in the API documentation.
+The naive approach of chunking is more appropriate.
+:class:`~ragger_duck.scraping.UserGuideDocExtractor` is a scraper that chunks the
+documentation in this manner. It relies on `beautifulsoup4` to parse the HTML content
+and recursively chunk the content.
+
+It provides two main parameters `chunk_size` and `chunk_overlap` to controlled the
+chunking process. It is quite important to not have too large chunks such that the
+number of token does not exceed the limit of the retriever. Otherwise, the embeddings
+will just truncate the input. Also, it seems that having a small overlap is beneficial
+to not retrieve multiple times the same information.
+
+Here, we can forsee an improvement by parsing the documentation at the section
+high-level and perform the chunking within these sections. This improvement could
+done in the future.
+
+The class also provides the parameter `folders_to_exclude` to exclude some files or
+folders that we don't want to incorporate into our index.
+
+Example gallery scraper
+=======================
+
+The last type of documentation in scikit-learn is the gallery of examples. It
+corresponds to a set of python examples that show some usage cases or tutorial-like
+examples. These examples are written to follow the formalism of `sphinx-gallery`. The
+HTML generated pages are available at
+https://scikit-learn.org/stable/auto_examples/index.html.
+
+We mainly have two types of examples in scikit-learn. The first type are more related to
+a usage example as shown below:
+
+.. image:: /_static/img/diagram/example_usage_html.png
+    :width: 100%
+    :align: center
+    :class: transparent-image
+
+These examples have a title and a description followed by a single block of code.
+
+The second type of examples are more tutorial-like and have sections with titles and
+interlace code blocks with text. An example is shown below:
+
+.. image:: /_static/img/diagram/example_tutorial_html.png
+    :width: 100%
+    :align: center
+    :class: transparent-image
+
+:class:`~ragger_duck.scraping.GalleryExampleExtractor` is a scraper that chunks these
+two types of example. In the first case, it will chunk the title and the description as
+an individual block an chunk separately the code block. In the second case, it will
+instead parse first the section of the example and create blocks for each section. Then,
+we will chunk each block separately. The idea behind this strategy is that a section
+of text is usually an introduction or a description of the code that follows it.
+
+Scraper API
+===========
+
+The different scraper classes have a common API that is the scikit-learn transformer
+API. They all implement the method `fit`, `transform`, and `fit_transform`. The
+scrappers are stateless and only parameter validation is done during `fit`. All the
+processing is happening when calling `transform`.
 
-:class:`~ragger_duck.scraping.UserGuideDocExtractor` is a scraper that extract
-documentation from the user guide. It is a simple scraper that extract
-text information from the webpage. Additionally, this text can be chunked.
+This API allows to leverage the scikit-learn `Pipeline` and for instance to create A
+pipeline and a retriever with a unique Python instance.
diff --git a/_static/img/diagram/example_tutorial_html.png b/_static/img/diagram/example_tutorial_html.png
diff --git a/_static/img/diagram/example_usage_html.png b/_static/img/diagram/example_usage_html.png
diff --git a/_static/img/diagram/user_guide_html.png b/_static/img/diagram/user_guide_html.png
diff --git a/objects.inv b/objects.inv
diff --git a/references/generated/ragger_duck.scraping.APINumPyDocExtractor.html b/references/generated/ragger_duck.scraping.APINumPyDocExtractor.html
@@ -468,8 +468,13 @@ <h1>APINumPyDocExtractor<a class="headerlink" href="#apinumpydocextractor" title
 <dt class="sig sig-object py" id="ragger_duck.scraping.APINumPyDocExtractor">
 <em class="property"><span class="pre">class</span><span class="w"> </span></em><span class="sig-prename descclassname"><span class="pre">ragger_duck.scraping.</span></span><span class="sig-name descname"><span class="pre">APINumPyDocExtractor</span></span><a class="headerlink" href="#ragger_duck.scraping.APINumPyDocExtractor" title="Link to this definition">#</a></dt>
 <dd><p>Extract text from the API documentation using <code class="docutils literal notranslate"><span class="pre">numpydoc</span></code>.</p>
-<p>This function can process classes and functions. It extracts the information using
-<code class="docutils literal notranslate"><span class="pre">numpydoc</span></code> templates.</p>
+<p>To discover the classes and functions, one should provide the path containing the
+HTML autogenerated pages. Usually, only public API is present in the documentation.
+For scikit-learn, the documentation is available in the folder <code class="docutils literal notranslate"><span class="pre">modules/generated</span></code>.</p>
+<p>We leverage the structured information provided by <code class="docutils literal notranslate"><span class="pre">numpydoc</span></code> to create meaningful
+chunk of information. Notably, every chunk contains the associate class or function
+import name.</p>
+<p>Read more in the <a class="reference internal" href="../../user_guide/text_scraping.html#api-doc-scraping"><span class="std std-ref">User Guide</span></a>.</p>
 <p class="rubric">Methods</p>
 <table class="autosummary longtable table autosummary">
 <tbody>

diff --git a/searchindex.js b/searchindex.js
diff --git a/user_guide/index.html b/user_guide/index.html
@@ -524,8 +524,10 @@ <h2>Implementation details<a class="headerlink" href="#implementation-details" t
 <div class="toctree-wrapper compound">
 <ul>
 <li class="toctree-l1"><a class="reference internal" href="text_scraping.html">Text Scraping</a><ul>
-<li class="toctree-l2"><a class="reference internal" href="text_scraping.html#api-documentation">API documentation</a></li>
-<li class="toctree-l2"><a class="reference internal" href="text_scraping.html#user-guide-documentation">User Guide documentation</a></li>
+<li class="toctree-l2"><a class="reference internal" href="text_scraping.html#api-documentation-scraper">API documentation scraper</a></li>
+<li class="toctree-l2"><a class="reference internal" href="text_scraping.html#user-guide-documentation-scraper">User Guide documentation scraper</a></li>
+<li class="toctree-l2"><a class="reference internal" href="text_scraping.html#example-gallery-scraper">Example gallery scraper</a></li>
+<li class="toctree-l2"><a class="reference internal" href="text_scraping.html#scraper-api">Scraper API</a></li>
 </ul>
 </li>
 <li class="toctree-l1"><a class="reference internal" href="information_retrieval.html">Retriever</a><ul>