Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the PARSeq model for Scene Text Recognition #2036

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

gowthamkpr
Copy link
Collaborator

@gowthamkpr gowthamkpr commented Jan 7, 2025

Adds the PARSeq OCR model for scene text recognition (STR): https://arxiv.org/abs/2207.06966

Features:

  • iterative autoregressive and non-autoregressive decoding and refinement, fully performed within Keras' computation graph
  • training based on permutation sequences, as described in the paper
  • ViT-based recognition model
  • well tested model with state-of-the-art performance

This is (so far) a 1:1 translation of PaddleOCR's implementation (Apache 2.0-licensed), in particular of

Next steps:

  • Port PaddleOCR weights
  • Adaptations to meet KerasHub style
  • Pre/postprocessing, and a dedicated Task for OCR
  • Add tests

@mattdangerw
Copy link
Member

Thanks! When this is done, could you attach a colab showing basic usage? Or even just a code snippet that shows the full usage including data input formats would be helpful.

@gowthamkpr
Copy link
Collaborator Author

gowthamkpr commented Jan 21, 2025

A few comments on code layout of this first version of the code:

  • There's a weird bug where tests for all backends complete perfectly fine when run on their own, but fail when run after other tests. This occurs because Keras rewrites the class hierarchy of Backbone as soon as another model is instantiated (see here) with __init__(inputs=..., outputs=...), therefore the expected __init__ arguments change and inputs and outputs arguments always have to be provided.
    If this is a problem with a known solution, please let me know. I'll investigate further. Due to the way inference and training work with PARSeq, we require a call method here.

Document explaining where the bug happens!!!

  • Compared to the existing ViT implementation, the ViT implementation I've adopted from PaddleOCR contains some additional logic for Drop Path regularisation, but the pretrained weights do not appear to use this. For being able to use the pretrained weights alone, we might alternatively just use the existing ViT implementation, if that is preferred.

  • For now, I've implemented a simple detokenize method. Alternatively, we might add a full-fledged Tokenizer to support the model character set, either extending the existing ByteTokenizer to support a given character set, or creating a new one.

@gowthamkpr gowthamkpr marked this pull request as ready for review January 21, 2025 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants