Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue #1080 : Fixing Parsing code for documentation retrieval in Python 3.13. #1082

Merged
merged 8 commits into from
Jan 18, 2025

Conversation

raphaelrubrice
Copy link
Contributor

Fixes parsing code so that documentation from NeuralNet is properly processed when adapting the documentation for NeuralNetClassifier, NeuralNetBinaryClassifier and NeuralNetRegressor classes in Python 3.13 and below.
See Issue #1080 for more details.

raphaelrubrice and others added 3 commits December 27, 2024 03:48
replaced old parsing code to retrieve criterion section of NeuralNet documentation from  (\n\s+)(criterion .*\n)(\s.+){1,99} to (\n\s+)(criterion .*\n)(\s.+|.){1,99}. This ensures proper parsing in both 3.13 and previous python versions.
replaced old parsing code to retrieve criterion section of NeuralNet documentation from  "(\n\s+)(criterion .*\n)(\s.+){1,99}" to "(\n\s+)(criterion .*\n)(\s.+|.){1,99}". This ensures proper parsing in both 3.13 and previous python versions.
…was previously cutting out the criterion section so even with the proper regexp for 3.13 and below the pattern was not matching. Now works fine.
@BenjaminBossan
Copy link
Collaborator

Thanks for the PR. I checked the docstring based on your branch with Python 3.13 and found that some paragraphs are missing. This is what I get:

>>> print(NeuralNetClassifier.__doc__[:1000])
NeuralNet for classification tasks

    Use this specifically if you have a standard classification task,
    with input data X and target y.

  modules will also work.

    criterion : torch criterion (class, default=torch.nn.NLLLoss)
      Negative log likelihood loss. Note that the module should return
      probabilities, the log is applied during ``get_loss``.

    classes : None or list (default=None)
      If None, the ``classes_`` attribute will be inferred from the
      ``y`` data passed to ``fit``. If a non-empty list is passed,
      that list will be returned as ``classes_``. If the initial
      skorch behavior should be restored, i.e. raising an
      ``AttributeError``, pass an empty list.

optimizer : torch optim (class, default=torch.optim.SGD)
  The uninitialized optimizer (update rule) used to optimize the
  module

lr : float (default=0.01)
  Learning rate passed to the optimizer. You may use ``lr`` instead
  of using ``optimizer__lr``, which would result in the sa

Is it the same for you?

What I would expect to see is:

>>> print(NeuralNetClassifier.__doc__[:2000])
NeuralNet for classification tasks

    Use this specifically if you have a standard classification task,
    with input data X and target y.

    In addition to the parameters listed below, there are parameters
    with specific prefixes that are handled separately. To illustrate
    this, here is an example:

    >>> net = NeuralNet(
    ...    ...,
    ...    optimizer=torch.optimizer.SGD,
    ...    optimizer__momentum=0.95,
    ...)

    This way, when ``optimizer`` is initialized, :class:`.NeuralNet`
    will take care of setting the ``momentum`` parameter to 0.95.

    (Note that the double underscore notation in
    ``optimizer__momentum`` means that the parameter ``momentum``
    should be set on the object ``optimizer``. This is the same
    semantic as used by sklearn.)

    Furthermore, this allows to change those parameters later:

    ``net.set_params(optimizer__momentum=0.99)``

    This can be useful when you want to change certain parameters
    using a callback, when using the net in an sklearn grid search,
    etc.

    By default an :class:`.EpochTimer`, :class:`.BatchScoring` (for
    both training and validation datasets), and :class:`.PrintLog`
    callbacks are added for convenience.

    Parameters
    ----------
    module : torch module (class or instance)
      A PyTorch :class:`~torch.nn.Module`. In general, the
      uninstantiated class should be passed, although instantiated
      modules will also work.

    criterion : torch criterion (class, default=torch.nn.NLLLoss)
      Negative log likelihood loss. Note that the module should return
      probabilities, the log is applied during ``get_loss``.

    classes : None or list (default=None)
      If None, the ``classes_`` attribute will be inferred from the
      ``y`` data passed to ``fit``. If a non-empty list is passed,
      that list will be returned as ``classes_``. If the initial
      skorch behavior should be restored, i.e. raising an
      ``AttributeError``, pass an empty 

Do you see the same results?

Apart from that, we could try to fix the inconsistent indentation for Python 3.13 using textwrap.dedent, but let's focus on the missing paragraphs first.

@raphaelrubrice
Copy link
Contributor Author

raphaelrubrice commented Jan 4, 2025

Thanks for catching that, this was stemming from the fact that in Python 3.13 there is one "\n " which becomes a "\n" with no trailing whitespace which lead to one return statement being skipped and leading to missing criterion block as discussed in Issue #1080 .

Fixing the argument back to "\n" solved that. However, there was the :class:`.NeuralNetRegressor`. that was remaining at the end of NeuralNet documentation after splitting with .split("\n", 4). This was fixed by changing it to .split("\n", 5).

I then proceeded to use textwrap as you suggested to fix inconsistent indentation.

Once again I tested imports from both Python 3.13 and Python 3.12 and both works with correct documentation retrieval.

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. I verified locally that the docstrings are now identical for Python 3.12 and 3.13.

Regarding the dedent/indent roundtrip, as it may not be quite obvious when reading the code, could you please add a comment on why that is done?

Furthermore, I noticed that the functions are using 5 spaces in the body, could you please revert that to 4 spaces?

…extwrap.dedent in documentation retrieval for NeuralNetClassifier, NeuralNetBinaryClassifier, NeuralNetRegressor
…ious commit is wrong about the use of textwrap.dedent, it is indeed necessary for correct functionality in both Python 3.13 and 3.12. Reverted to version with textwrap.dedent in documentation retrieval.
@raphaelrubrice
Copy link
Contributor Author

Okay so I corrected functions body.

About the roundtrip, I orginally forgot why I had done that after reading your comment, went on testing in Pyhthon 3.13 found that it was not necessary in the version and committed. That was a mistake thus the followig sections explains why we actually need this roundtrip and thus why I have done 2 commits (the last one reverted the removal of textwrap.dedent. This is why :

When using only a simple doc.split without using textwrap for indentation or dedentation, documentation is not indented as you would expect. However when using textwrap.indent(doc.split, in Python 3.13 documentation is correctly indented but no longer the case in Python 3.12 :

NeuralNet for classification tasks

    Use this specifically if you have a standard classification task,
    with input data X and target y.


        In addition to the parameters listed below, there are parameters
        with specific prefixes that are handled separately. To illustrate
        this, here is an example:

        >>> net = NeuralNet(
        ...    ...,
        ...    optimizer=torch.optimizer.SGD,
        ...    optimizer__momentum=0.95,
        ...)

        This way, when ``optimizer`` is initialized, :class:`.NeuralNet`
        will take care of setting the ``momentum`` parameter to 0.95.

        (Note that the double underscore notation in
        ``optimizer__momentum`` means that the parameter ``momentum``
        should be set on the object ``optimizer``. This is the same
        semantic as used by sklearn.)

        Furthermore, this allows to change those parameters later:

        ``net.set_params(optimizer__momentum=0.99)``

        This can be useful when you want to change certain parameters
        using a callback, when using the net in an sklearn grid search,
        etc.

        By default an :class:`.EpochTimer`, :class:`.BatchScoring` (for
        both training and validation datasets), and :class:`.PrintLog`
        callbacks are added for convenience.

        Parameters
        ----------
        module : torch module (class or instance)
          A PyTorch :class:`~torch.nn.Module`. In general, the
          uninstantiated class should be passed, although instantiated
          modules will also work.

    criterion : torch criterion (class, default=torch.nn.NLLLoss)
      Negative log likelihood loss. Note that the module should return
      probabilities, the log is applied during ``get_loss``.

    classes : None or list (default=None)
      If None, the ``classes_`` attribute will be inferred from the
      ``y`` data passed to ``fit``. If a non-empty list is passed,
      that list will be returned as ``classes_``. If the initial
      skorch behavior should be restored, i.e. raising an
      ``AttributeError``, pass an empty list.

        optimizer : torch optim (class, default=torch.optim.SGD)
          The uninitialized optimizer (update rule) used to optimize the
          module

        lr : float (default=0.01)
          Learning rate passed to the optimizer. You may use ``lr`` instead
          of using ``optimizer__lr``, which would result in the same outcome.

So to ensure correct indentation in both versions, performing a dedentation operation before applying indentation ensures proper indentation :
When using the roundtrip :

NeuralNet for classification tasks

    Use this specifically if you have a standard classification task,
    with input data X and target y.


    In addition to the parameters listed below, there are parameters
    with specific prefixes that are handled separately. To illustrate
    this, here is an example:

    >>> net = NeuralNet(
    ...    ...,
    ...    optimizer=torch.optimizer.SGD,
    ...    optimizer__momentum=0.95,
    ...)

    This way, when ``optimizer`` is initialized, :class:`.NeuralNet`
    will take care of setting the ``momentum`` parameter to 0.95.

    (Note that the double underscore notation in
    ``optimizer__momentum`` means that the parameter ``momentum``
    should be set on the object ``optimizer``. This is the same
    semantic as used by sklearn.)

    Furthermore, this allows to change those parameters later:

    ``net.set_params(optimizer__momentum=0.99)``

    This can be useful when you want to change certain parameters
    using a callback, when using the net in an sklearn grid search,
    etc.

    By default an :class:`.EpochTimer`, :class:`.BatchScoring` (for
    both training and validation datasets), and :class:`.PrintLog`
    callbacks are added for convenience.

    Parameters
    ----------
    module : torch module (class or instance)
      A PyTorch :class:`~torch.nn.Module`. In general, the
      uninstantiated class should be passed, although instantiated
      modules will also work.

    criterion : torch criterion (class, default=torch.nn.NLLLoss)
      Negative log likelihood loss. Note that the module should return
      probabilities, the log is applied during ``get_loss``.

    classes : None or list (default=None)
      If None, the ``classes_`` attribute will be inferred from the
      ``y`` data passed to ``fit``. If a non-empty list is passed,
      that list will be returned as ``classes_``. If the initial
      skorch behavior should be restored, i.e. raising an
      ``AttributeError``, pass an empty list.

    optimizer : torch optim (class, default=torch.optim.SGD)
      The uninitialized optimizer (update rule) used to optimize the
      module

    lr : float (default=0.01)
      Learning rate passed to the optimizer. You may use ``lr`` instead
      of using ``optimizer__lr``, which would result in the same outcome.

This is the rationale behind the use of the roundtrip.

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the further explanation. To summarize: Since the change in Python 3.13 is basically that the docstring is automatically dedented, it makes sense to do dedent + indent to normalize the indentation for <3.13 and >= 3.13. Let's add this as a comment for future readers.

start, end = pattern.search(doc).span()
doc = doc[:start] + neural_net_clf_additional_text + doc[end:]
doc = doc + neural_net_clf_additional_attribute
doc = doc + textwrap.indent(neural_net_clf_additional_attribute, indentation)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the doc is indented again but for the other classes, this step is missing. I think it can be removed here, right?

Copy link
Contributor Author

@raphaelrubrice raphaelrubrice Jan 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it cannot, here's why.

As it is currently defined, neural_net_clf_additional_attribute is not indented whereas for other additional texts for NeuralNetRegressor and NeuralNetBinaryClassifier objects, the indentation is done when defining the text to add.
You would expect 3.13 automatic dedentation to also affect other objects additional texts but it does not. Thus I suppose that the automatic dedentation we identified for doc.split might stem from how the .split method generates the text and how it is then processed by python 3.13's new default string management ?

Here are some outputs :

3.13 with indentation of neural_net_clf_additional_attribute :

    _optimizers : list of str
      List of names of all optimizers. This list is collected dynamically when
      the net is initialized. Typically, there is no reason for a user to modify
      this list.

    classes_ : array, shape (n_classes, )
          A list of class labels known to the classifier.

3.13 without indentation of neural_net_clf_additional_attribute :

    _optimizers : list of str
      List of names of all optimizers. This list is collected dynamically when
      the net is initialized. Typically, there is no reason for a user to modify
      this list.

classes_ : array, shape (n_classes, )
      A list of class labels known to the classifier.

Same for Python 3.12.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for explaining. The real issue is than how we define the neural_net_clf_additional_attribute constant:

neural_net_clf_additional_attribute = """classes_ : array, shape (n_classes, )
A list of class labels known to the classifier.
"""

In contrast to the other doc snippets, this one lacks the new lines and indentation at the beginning. So how about we add those and then remove the textwrap.indent call here? That way, all three classes handle the __doc__ in a consistent manner.

Copy link
Contributor Author

@raphaelrubrice raphaelrubrice Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed, it has now been corrected.

Should we consider the discussion resolved ?

…of textwrap.indent as a consequence of that.
@raphaelrubrice raphaelrubrice changed the title Issue #1080 : Fixing Pasing code for documentation retrieval. Issue #1080 : Fixing Parsing code for documentation retrieval. Jan 15, 2025
@raphaelrubrice raphaelrubrice changed the title Issue #1080 : Fixing Parsing code for documentation retrieval. Issue #1080 : Fixing Parsing code for documentation retrieval in Python 3.13. Jan 15, 2025
Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for fixing the docstring issue with Python 3.13 and ensuring consistency between the versions. I tested it locally and the docstrings are identical.

@BenjaminBossan BenjaminBossan merged commit bb1bac4 into skorch-dev:master Jan 18, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants