Issue #1080 : Fixing Parsing code for documentation retrieval in Python 3.13. #1082

raphaelrubrice · 2024-12-27T04:03:52Z

Fixes parsing code so that documentation from NeuralNet is properly processed when adapting the documentation for NeuralNetClassifier, NeuralNetBinaryClassifier and NeuralNetRegressor classes in Python 3.13 and below.
See Issue #1080 for more details.

replaced old parsing code to retrieve criterion section of NeuralNet documentation from (\n\s+)(criterion .*\n)(\s.+){1,99} to (\n\s+)(criterion .*\n)(\s.+|.){1,99}. This ensures proper parsing in both 3.13 and previous python versions.

replaced old parsing code to retrieve criterion section of NeuralNet documentation from "(\n\s+)(criterion .*\n)(\s.+){1,99}" to "(\n\s+)(criterion .*\n)(\s.+|.){1,99}". This ensures proper parsing in both 3.13 and previous python versions.

…was previously cutting out the criterion section so even with the proper regexp for 3.13 and below the pattern was not matching. Now works fine.

BenjaminBossan · 2024-12-28T11:59:40Z

Thanks for the PR. I checked the docstring based on your branch with Python 3.13 and found that some paragraphs are missing. This is what I get:

>>> print(NeuralNetClassifier.__doc__[:1000])
NeuralNet for classification tasks

    Use this specifically if you have a standard classification task,
    with input data X and target y.

  modules will also work.

    criterion : torch criterion (class, default=torch.nn.NLLLoss)
      Negative log likelihood loss. Note that the module should return
      probabilities, the log is applied during ``get_loss``.

    classes : None or list (default=None)
      If None, the ``classes_`` attribute will be inferred from the
      ``y`` data passed to ``fit``. If a non-empty list is passed,
      that list will be returned as ``classes_``. If the initial
      skorch behavior should be restored, i.e. raising an
      ``AttributeError``, pass an empty list.

optimizer : torch optim (class, default=torch.optim.SGD)
  The uninitialized optimizer (update rule) used to optimize the
  module

lr : float (default=0.01)
  Learning rate passed to the optimizer. You may use ``lr`` instead
  of using ``optimizer__lr``, which would result in the sa

Is it the same for you?

What I would expect to see is:

>>> print(NeuralNetClassifier.__doc__[:2000])
NeuralNet for classification tasks

    Use this specifically if you have a standard classification task,
    with input data X and target y.

    In addition to the parameters listed below, there are parameters
    with specific prefixes that are handled separately. To illustrate
    this, here is an example:

    >>> net = NeuralNet(
    ...    ...,
    ...    optimizer=torch.optimizer.SGD,
    ...    optimizer__momentum=0.95,
    ...)

    This way, when ``optimizer`` is initialized, :class:`.NeuralNet`
    will take care of setting the ``momentum`` parameter to 0.95.

    (Note that the double underscore notation in
    ``optimizer__momentum`` means that the parameter ``momentum``
    should be set on the object ``optimizer``. This is the same
    semantic as used by sklearn.)

    Furthermore, this allows to change those parameters later:

    ``net.set_params(optimizer__momentum=0.99)``

    This can be useful when you want to change certain parameters
    using a callback, when using the net in an sklearn grid search,
    etc.

    By default an :class:`.EpochTimer`, :class:`.BatchScoring` (for
    both training and validation datasets), and :class:`.PrintLog`
    callbacks are added for convenience.

    Parameters
    ----------
    module : torch module (class or instance)
      A PyTorch :class:`~torch.nn.Module`. In general, the
      uninstantiated class should be passed, although instantiated
      modules will also work.

    criterion : torch criterion (class, default=torch.nn.NLLLoss)
      Negative log likelihood loss. Note that the module should return
      probabilities, the log is applied during ``get_loss``.

    classes : None or list (default=None)
      If None, the ``classes_`` attribute will be inferred from the
      ``y`` data passed to ``fit``. If a non-empty list is passed,
      that list will be returned as ``classes_``. If the initial
      skorch behavior should be restored, i.e. raising an
      ``AttributeError``, pass an empty

Do you see the same results?

Apart from that, we could try to fix the inconsistent indentation for Python 3.13 using textwrap.dedent, but let's focus on the missing paragraphs first.

…ragraph

raphaelrubrice · 2025-01-04T10:52:08Z

Thanks for catching that, this was stemming from the fact that in Python 3.13 there is one "\n " which becomes a "\n" with no trailing whitespace which lead to one return statement being skipped and leading to missing criterion block as discussed in Issue #1080 .

Fixing the argument back to "\n" solved that. However, there was the :class:`.NeuralNetRegressor`. that was remaining at the end of NeuralNet documentation after splitting with .split("\n", 4). This was fixed by changing it to .split("\n", 5).

I then proceeded to use textwrap as you suggested to fix inconsistent indentation.

Once again I tested imports from both Python 3.13 and Python 3.12 and both works with correct documentation retrieval.

BenjaminBossan

Thanks for the update. I verified locally that the docstrings are now identical for Python 3.12 and 3.13.

Regarding the dedent/indent roundtrip, as it may not be quite obvious when reading the code, could you please add a comment on why that is done?

Furthermore, I noticed that the functions are using 5 spaces in the body, could you please revert that to 4 spaces?

…extwrap.dedent in documentation retrieval for NeuralNetClassifier, NeuralNetBinaryClassifier, NeuralNetRegressor

…ious commit is wrong about the use of textwrap.dedent, it is indeed necessary for correct functionality in both Python 3.13 and 3.12. Reverted to version with textwrap.dedent in documentation retrieval.

raphaelrubrice · 2025-01-08T22:13:01Z

Okay so I corrected functions body.

About the roundtrip, I orginally forgot why I had done that after reading your comment, went on testing in Pyhthon 3.13 found that it was not necessary in the version and committed. That was a mistake thus the followig sections explains why we actually need this roundtrip and thus why I have done 2 commits (the last one reverted the removal of textwrap.dedent. This is why :

When using only a simple doc.split without using textwrap for indentation or dedentation, documentation is not indented as you would expect. However when using textwrap.indent(doc.split, in Python 3.13 documentation is correctly indented but no longer the case in Python 3.12 :

NeuralNet for classification tasks

    Use this specifically if you have a standard classification task,
    with input data X and target y.


        In addition to the parameters listed below, there are parameters
        with specific prefixes that are handled separately. To illustrate
        this, here is an example:

        >>> net = NeuralNet(
        ...    ...,
        ...    optimizer=torch.optimizer.SGD,
        ...    optimizer__momentum=0.95,
        ...)

        This way, when ``optimizer`` is initialized, :class:`.NeuralNet`
        will take care of setting the ``momentum`` parameter to 0.95.

        (Note that the double underscore notation in
        ``optimizer__momentum`` means that the parameter ``momentum``
        should be set on the object ``optimizer``. This is the same
        semantic as used by sklearn.)

        Furthermore, this allows to change those parameters later:

        ``net.set_params(optimizer__momentum=0.99)``

        This can be useful when you want to change certain parameters
        using a callback, when using the net in an sklearn grid search,
        etc.

        By default an :class:`.EpochTimer`, :class:`.BatchScoring` (for
        both training and validation datasets), and :class:`.PrintLog`
        callbacks are added for convenience.

        Parameters
        ----------
        module : torch module (class or instance)
          A PyTorch :class:`~torch.nn.Module`. In general, the
          uninstantiated class should be passed, although instantiated
          modules will also work.

    criterion : torch criterion (class, default=torch.nn.NLLLoss)
      Negative log likelihood loss. Note that the module should return
      probabilities, the log is applied during ``get_loss``.

    classes : None or list (default=None)
      If None, the ``classes_`` attribute will be inferred from the
      ``y`` data passed to ``fit``. If a non-empty list is passed,
      that list will be returned as ``classes_``. If the initial
      skorch behavior should be restored, i.e. raising an
      ``AttributeError``, pass an empty list.

        optimizer : torch optim (class, default=torch.optim.SGD)
          The uninitialized optimizer (update rule) used to optimize the
          module

        lr : float (default=0.01)
          Learning rate passed to the optimizer. You may use ``lr`` instead
          of using ``optimizer__lr``, which would result in the same outcome.

So to ensure correct indentation in both versions, performing a dedentation operation before applying indentation ensures proper indentation :
When using the roundtrip :

NeuralNet for classification tasks

    Use this specifically if you have a standard classification task,
    with input data X and target y.


    In addition to the parameters listed below, there are parameters
    with specific prefixes that are handled separately. To illustrate
    this, here is an example:

    >>> net = NeuralNet(
    ...    ...,
    ...    optimizer=torch.optimizer.SGD,
    ...    optimizer__momentum=0.95,
    ...)

    This way, when ``optimizer`` is initialized, :class:`.NeuralNet`
    will take care of setting the ``momentum`` parameter to 0.95.

    (Note that the double underscore notation in
    ``optimizer__momentum`` means that the parameter ``momentum``
    should be set on the object ``optimizer``. This is the same
    semantic as used by sklearn.)

    Furthermore, this allows to change those parameters later:

    ``net.set_params(optimizer__momentum=0.99)``

    This can be useful when you want to change certain parameters
    using a callback, when using the net in an sklearn grid search,
    etc.

    By default an :class:`.EpochTimer`, :class:`.BatchScoring` (for
    both training and validation datasets), and :class:`.PrintLog`
    callbacks are added for convenience.

    Parameters
    ----------
    module : torch module (class or instance)
      A PyTorch :class:`~torch.nn.Module`. In general, the
      uninstantiated class should be passed, although instantiated
      modules will also work.

    criterion : torch criterion (class, default=torch.nn.NLLLoss)
      Negative log likelihood loss. Note that the module should return
      probabilities, the log is applied during ``get_loss``.

    classes : None or list (default=None)
      If None, the ``classes_`` attribute will be inferred from the
      ``y`` data passed to ``fit``. If a non-empty list is passed,
      that list will be returned as ``classes_``. If the initial
      skorch behavior should be restored, i.e. raising an
      ``AttributeError``, pass an empty list.

    optimizer : torch optim (class, default=torch.optim.SGD)
      The uninitialized optimizer (update rule) used to optimize the
      module

    lr : float (default=0.01)
      Learning rate passed to the optimizer. You may use ``lr`` instead
      of using ``optimizer__lr``, which would result in the same outcome.

This is the rationale behind the use of the roundtrip.

BenjaminBossan

Thanks for the further explanation. To summarize: Since the change in Python 3.13 is basically that the docstring is automatically dedented, it makes sense to do dedent + indent to normalize the indentation for <3.13 and >= 3.13. Let's add this as a comment for future readers.

BenjaminBossan · 2025-01-09T14:00:11Z

skorch/classifier.py

    start, end = pattern.search(doc).span()
    doc = doc[:start] + neural_net_clf_additional_text + doc[end:]
-    doc = doc + neural_net_clf_additional_attribute
+    doc = doc + textwrap.indent(neural_net_clf_additional_attribute, indentation)


Here the doc is indented again but for the other classes, this step is missing. I think it can be removed here, right?

Actually it cannot, here's why.

As it is currently defined, neural_net_clf_additional_attribute is not indented whereas for other additional texts for NeuralNetRegressor and NeuralNetBinaryClassifier objects, the indentation is done when defining the text to add.
You would expect 3.13 automatic dedentation to also affect other objects additional texts but it does not. Thus I suppose that the automatic dedentation we identified for doc.split might stem from how the .split method generates the text and how it is then processed by python 3.13's new default string management ?

Here are some outputs :

3.13 with indentation of neural_net_clf_additional_attribute :

_optimizers : list of str List of names of all optimizers. This list is collected dynamically when the net is initialized. Typically, there is no reason for a user to modify this list. classes_ : array, shape (n_classes, ) A list of class labels known to the classifier.

3.13 without indentation of neural_net_clf_additional_attribute :

_optimizers : list of str List of names of all optimizers. This list is collected dynamically when the net is initialized. Typically, there is no reason for a user to modify this list. classes_ : array, shape (n_classes, ) A list of class labels known to the classifier.

Same for Python 3.12.

I see, thanks for explaining. The real issue is than how we define the neural_net_clf_additional_attribute constant:

skorch/skorch/classifier.py

Lines 38 to 41 in 547d3ff

neural_net_clf_additional_attribute = """classes_ : array, shape (n_classes, )

A list of class labels known to the classifier.

"""

In contrast to the other doc snippets, this one lacks the new lines and indentation at the beginning. So how about we add those and then remove the textwrap.indent call here? That way, all three classes handle the __doc__ in a consistent manner.

Yes indeed, it has now been corrected.

Should we consider the discussion resolved ?

…of textwrap.indent as a consequence of that.

BenjaminBossan

Thanks a lot for fixing the docstring issue with Python 3.13 and ensuring consistency between the versions. I tested it locally and the docstrings are identical.

raphaelrubrice and others added 3 commits December 27, 2024 03:48

Update regressor.py

ab36f8d

replaced old parsing code to retrieve criterion section of NeuralNet documentation from (\n\s+)(criterion .*\n)(\s.+){1,99} to (\n\s+)(criterion .*\n)(\s.+|.){1,99}. This ensures proper parsing in both 3.13 and previous python versions.

Update classifier.py

e10326b

replaced old parsing code to retrieve criterion section of NeuralNet documentation from "(\n\s+)(criterion .*\n)(\s.+){1,99}" to "(\n\s+)(criterion .*\n)(\s.+|.){1,99}". This ensures proper parsing in both 3.13 and previous python versions.

Corrected the arguments of the doc.split call from 4 to 3 because it …

6217897

…was previously cutting out the criterion section so even with the proper regexp for 3.13 and below the pattern was not matching. Now works fine.

Corrected incorrect indentation in documentation and fixed missing pa…

4046ae4

…ragraph

BenjaminBossan requested changes Jan 7, 2025

View reviewed changes

raphaelrubrice added 2 commits January 8, 2025 22:42

Corrected functions body indentation and removed unnecessary use of t…

51381d5

…extwrap.dedent in documentation retrieval for NeuralNetClassifier, NeuralNetBinaryClassifier, NeuralNetRegressor

Forgot why I had done the roundtrip, after thorough testing, the prev…

763c3b0

…ious commit is wrong about the use of textwrap.dedent, it is indeed necessary for correct functionality in both Python 3.13 and 3.12. Reverted to version with textwrap.dedent in documentation retrieval.

raphaelrubrice requested a review from BenjaminBossan January 8, 2025 22:14

BenjaminBossan requested changes Jan 9, 2025

View reviewed changes

Added comments to explain dedent/indent roundtrip.

afb1a9c

raphaelrubrice requested a review from BenjaminBossan January 11, 2025 09:28

Proper definition of neural_net_clf_additional_attribute and removal …

1213a49

…of textwrap.indent as a consequence of that.

raphaelrubrice changed the title ~~Issue #1080 : Fixing Pasing code for documentation retrieval.~~ Issue #1080 : Fixing Parsing code for documentation retrieval. Jan 15, 2025

raphaelrubrice changed the title ~~Issue #1080 : Fixing Parsing code for documentation retrieval.~~ Issue #1080 : Fixing Parsing code for documentation retrieval in Python 3.13. Jan 15, 2025

BenjaminBossan approved these changes Jan 18, 2025

View reviewed changes

BenjaminBossan merged commit bb1bac4 into skorch-dev:master Jan 18, 2025
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue #1080 : Fixing Parsing code for documentation retrieval in Python 3.13. #1082

Issue #1080 : Fixing Parsing code for documentation retrieval in Python 3.13. #1082

raphaelrubrice commented Dec 27, 2024

BenjaminBossan commented Dec 28, 2024

raphaelrubrice commented Jan 4, 2025 •

edited

Loading

BenjaminBossan left a comment

raphaelrubrice commented Jan 8, 2025

BenjaminBossan left a comment

BenjaminBossan Jan 9, 2025

raphaelrubrice Jan 11, 2025 •

edited

Loading

BenjaminBossan Jan 14, 2025

raphaelrubrice Jan 14, 2025 •

edited

Loading

BenjaminBossan left a comment

	neural_net_clf_additional_attribute = """classes_ : array, shape (n_classes, )
	A list of class labels known to the classifier.

	"""

Issue #1080 : Fixing Parsing code for documentation retrieval in Python 3.13. #1082

Issue #1080 : Fixing Parsing code for documentation retrieval in Python 3.13. #1082

Conversation

raphaelrubrice commented Dec 27, 2024

BenjaminBossan commented Dec 28, 2024

raphaelrubrice commented Jan 4, 2025 • edited Loading

BenjaminBossan left a comment

Choose a reason for hiding this comment

raphaelrubrice commented Jan 8, 2025

BenjaminBossan left a comment

Choose a reason for hiding this comment

BenjaminBossan Jan 9, 2025

Choose a reason for hiding this comment

raphaelrubrice Jan 11, 2025 • edited Loading

Choose a reason for hiding this comment

BenjaminBossan Jan 14, 2025

Choose a reason for hiding this comment

raphaelrubrice Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

BenjaminBossan left a comment

Choose a reason for hiding this comment

raphaelrubrice commented Jan 4, 2025 •

edited

Loading

raphaelrubrice Jan 11, 2025 •

edited

Loading

raphaelrubrice Jan 14, 2025 •

edited

Loading