Skip to content

Commit

Permalink
[ci skip] iter dcb5366
Browse files Browse the repository at this point in the history
  • Loading branch information
glemaitre committed Apr 19, 2024
1 parent f359688 commit cb677c4
Show file tree
Hide file tree
Showing 12 changed files with 349 additions and 21 deletions.
Binary file added _images/example_tutorial_html.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/example_usage_html.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/user_guide_html.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
190 changes: 183 additions & 7 deletions _sources/user_guide/text_scraping.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,10 @@ get started, it is not the best strategy to get the most out of the scikit-learn
documentation. In the subsequent sections, we present different strategies
specifically designed for certain portions of the scikit-learn documentation.

API documentation
=================
.. _api_doc_scraping:

API documentation scraper
=========================

We refer to "API documentation" as the following documentation entry point:
https://scikit-learn.org/stable/modules/classes.html.
Expand Down Expand Up @@ -189,9 +191,183 @@ By providing chunks that maintain the relationship between the parameter and its
corresponding class, we enable the Mistral 7b model to disambiguate the information and
provide a relevant answer.

User Guide documentation
========================
Chunk formatting leveraging `numpydoc`
--------------------------------------

In this section, we provide detailed information regarding the formatting used to create
the chunks for classes and functions by leveraging the `numpydoc` formalism. You can
refer to `the numpydoc documentation
<https://numpydoc.readthedocs.io/en/latest/format.html>`_ have more information
regarding this formalism.

We are creating individual chunks for the following sections:

- class signature with default parameters
- class short and extended summary
- class parameters description
- class attributes description
- associated class or function in "See Also" section
- class note section
- class example usage
- class references

For each of these sections, we create a chunk of text in natural language to summarize
the information. A similar approach is used for functions and methods of a class. We
provide an example of chunks extracted for the
`sklearn.feature_extraction.image.extract_patches_2d
<https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.image.extract_patches_2d.html>`_:

::

sklearn.feature_extraction.image.extract_patches_2d
The parameters of extract_patches_2d with their default values when known are:
image, patch_size, max_patches (default=None), random_state (default=None).
The description of the extract_patches_2d is as follow.
Reshape a 2D image into a collection of patches.
The resulting patches are allocated in a dedicated array.
Read more in the :ref:`User Guide <image_feature_extraction>`.

::

Parameter image of sklearn.feature_extraction.image.extract_patches_2d.
image is described as 'The original image data. For color images, the last dimension
specifies
the channel: a RGB image would have `n_channels=3`.' and has the following type(s):
ndarray of shape (image_height, image_width) or
(image_height, image_width, n_channels)

::

Parameter patch_size of sklearn.feature_extraction.image.extract_patches_2d.
patch_size is described as 'The dimensions of one patch.' and has the following
type(s): tuple of int (patch_height, patch_width)

::

Parameter max_patches of sklearn.feature_extraction.image.extract_patches_2d.
max_patches is described as 'The maximum number of patches to extract. If
`max_patches` is a float between 0 and 1, it is taken to be a proportion of the
total number of patches. If `max_patches` is None it corresponds to the total number
of patches that can be extracted.' and has the following type(s): int or float,
default=None

::

Parameter random_state of sklearn.feature_extraction.image.extract_patches_2d.
random_state is described as 'Determines the random number generator used for
random sampling when `max_patches` is not None. Use an int to make the randomness
deterministic.
See :term:`Glossary <random_state>`.' and has the following type(s): int,
RandomState instance, default=None

::

patches is returned by sklearn.feature_extraction.image.extract_patches_2d.
patches is described as 'The collection of patches extracted from the image, where
`n_patches` is either `max_patches` or the total number of patches that can be
extracted.' and has the following type(s): array of shape
(n_patches, patch_height, patch_width) or
(n_patches, patch_height, patch_width, n_channels)

::

sklearn.feature_extraction.image.extract_patches_2d
Here is a usage example of extract_patches_2d:
>>> from sklearn.datasets import load_sample_image
>>> from sklearn.feature_extraction import image
>>> # Use the array data from the first image in this dataset:
>>> one_image = load_sample_image("china.jpg")
>>> print('Image shape: {}'.format(one_image.shape))
Image shape: (427, 640, 3)
>>> patches = image.extract_patches_2d(one_image, (2, 2))
>>> print('Patches shape: {}'.format(patches.shape))
Patches shape: (272214, 2, 2, 3)
>>> # Here are just two of these patches:
>>> print(patches[1])
[[[174 201 231]
[174 201 231]]
[[173 200 230]
[173 200 230]]]
>>> print(patches[800])
[[[187 214 243]
[188 215 244]]
[[187 214 243]
[188 215 244]]]

User Guide documentation scraper
================================

We refer to "User Guide documentation" to the narrative documentation that is
handwritten and provides a detailed explanation of the concepts of machine learning
concept and how those translate into scikit-learn usage. The HTML generated pages are
available at https://scikit-learn.org/stable/user_guide.html. Each page have the
following look:

.. image:: /_static/img/diagram/user_guide_html.png
:width: 100%
:align: center
:class: transparent-image

Here, we observed that the information is not structure as in the API documentation.
The naive approach of chunking is more appropriate.
:class:`~ragger_duck.scraping.UserGuideDocExtractor` is a scraper that chunks the
documentation in this manner. It relies on `beautifulsoup4` to parse the HTML content
and recursively chunk the content.

It provides two main parameters `chunk_size` and `chunk_overlap` to controlled the
chunking process. It is quite important to not have too large chunks such that the
number of token does not exceed the limit of the retriever. Otherwise, the embeddings
will just truncate the input. Also, it seems that having a small overlap is beneficial
to not retrieve multiple times the same information.

Here, we can forsee an improvement by parsing the documentation at the section
high-level and perform the chunking within these sections. This improvement could
done in the future.

The class also provides the parameter `folders_to_exclude` to exclude some files or
folders that we don't want to incorporate into our index.

Example gallery scraper
=======================

The last type of documentation in scikit-learn is the gallery of examples. It
corresponds to a set of python examples that show some usage cases or tutorial-like
examples. These examples are written to follow the formalism of `sphinx-gallery`. The
HTML generated pages are available at
https://scikit-learn.org/stable/auto_examples/index.html.

We mainly have two types of examples in scikit-learn. The first type are more related to
a usage example as shown below:

.. image:: /_static/img/diagram/example_usage_html.png
:width: 100%
:align: center
:class: transparent-image

These examples have a title and a description followed by a single block of code.

The second type of examples are more tutorial-like and have sections with titles and
interlace code blocks with text. An example is shown below:

.. image:: /_static/img/diagram/example_tutorial_html.png
:width: 100%
:align: center
:class: transparent-image

:class:`~ragger_duck.scraping.GalleryExampleExtractor` is a scraper that chunks these
two types of example. In the first case, it will chunk the title and the description as
an individual block an chunk separately the code block. In the second case, it will
instead parse first the section of the example and create blocks for each section. Then,
we will chunk each block separately. The idea behind this strategy is that a section
of text is usually an introduction or a description of the code that follows it.

Scraper API
===========

The different scraper classes have a common API that is the scikit-learn transformer
API. They all implement the method `fit`, `transform`, and `fit_transform`. The
scrappers are stateless and only parameter validation is done during `fit`. All the
processing is happening when calling `transform`.

:class:`~ragger_duck.scraping.UserGuideDocExtractor` is a scraper that extract
documentation from the user guide. It is a simple scraper that extract
text information from the webpage. Additionally, this text can be chunked.
This API allows to leverage the scikit-learn `Pipeline` and for instance to create A
pipeline and a retriever with a unique Python instance.
Binary file added _static/img/diagram/example_tutorial_html.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/img/diagram/example_usage_html.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/img/diagram/user_guide_html.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified objects.inv
Binary file not shown.
Original file line number Diff line number Diff line change
Expand Up @@ -468,8 +468,13 @@ <h1>APINumPyDocExtractor<a class="headerlink" href="#apinumpydocextractor" title
<dt class="sig sig-object py" id="ragger_duck.scraping.APINumPyDocExtractor">
<em class="property"><span class="pre">class</span><span class="w"> </span></em><span class="sig-prename descclassname"><span class="pre">ragger_duck.scraping.</span></span><span class="sig-name descname"><span class="pre">APINumPyDocExtractor</span></span><a class="headerlink" href="#ragger_duck.scraping.APINumPyDocExtractor" title="Link to this definition">#</a></dt>
<dd><p>Extract text from the API documentation using <code class="docutils literal notranslate"><span class="pre">numpydoc</span></code>.</p>
<p>This function can process classes and functions. It extracts the information using
<code class="docutils literal notranslate"><span class="pre">numpydoc</span></code> templates.</p>
<p>To discover the classes and functions, one should provide the path containing the
HTML autogenerated pages. Usually, only public API is present in the documentation.
For scikit-learn, the documentation is available in the folder <code class="docutils literal notranslate"><span class="pre">modules/generated</span></code>.</p>
<p>We leverage the structured information provided by <code class="docutils literal notranslate"><span class="pre">numpydoc</span></code> to create meaningful
chunk of information. Notably, every chunk contains the associate class or function
import name.</p>
<p>Read more in the <a class="reference internal" href="../../user_guide/text_scraping.html#api-doc-scraping"><span class="std std-ref">User Guide</span></a>.</p>
<p class="rubric">Methods</p>
<table class="autosummary longtable table autosummary">
<tbody>
Expand Down
2 changes: 1 addition & 1 deletion searchindex.js

Large diffs are not rendered by default.

6 changes: 4 additions & 2 deletions user_guide/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -524,8 +524,10 @@ <h2>Implementation details<a class="headerlink" href="#implementation-details" t
<div class="toctree-wrapper compound">
<ul>
<li class="toctree-l1"><a class="reference internal" href="text_scraping.html">Text Scraping</a><ul>
<li class="toctree-l2"><a class="reference internal" href="text_scraping.html#api-documentation">API documentation</a></li>
<li class="toctree-l2"><a class="reference internal" href="text_scraping.html#user-guide-documentation">User Guide documentation</a></li>
<li class="toctree-l2"><a class="reference internal" href="text_scraping.html#api-documentation-scraper">API documentation scraper</a></li>
<li class="toctree-l2"><a class="reference internal" href="text_scraping.html#user-guide-documentation-scraper">User Guide documentation scraper</a></li>
<li class="toctree-l2"><a class="reference internal" href="text_scraping.html#example-gallery-scraper">Example gallery scraper</a></li>
<li class="toctree-l2"><a class="reference internal" href="text_scraping.html#scraper-api">Scraper API</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="information_retrieval.html">Retriever</a><ul>
Expand Down
Loading

0 comments on commit cb677c4

Please sign in to comment.