Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating the text granularity extension to support additional values (Section, Table, ...) #2329

Open
brittnylapierre opened this issue Jan 7, 2025 · 0 comments

Comments

@brittnylapierre
Copy link

The Text Granularity Extension reference can be found here. Current supported values:

  • Page: A page in a paginated document
  • Block: An arbitrary region of text
  • Paragraph: A paragraph
  • Line: A topographic line
  • Word: A single word
  • Glyph: A single glyph or symbol

Looking at the documentation of the extension, I wonder if it would be useful to update the text granularity extension to include more values which don't currently exist, from common AI layout analysis APIs? I will list the return values for Azure and AWS below that aren't covered by the extension already.

AWS

  • Title: The main title of the document.
  • Header: Text located in the top margin of the document.
  • Footer: Text located in the bottom margin of the document.
  • Section Title: The titles below the main title that represent sections in the document.
  • Page Number: The page number of the documents.
  • List: Any information grouped together in list form.
  • Figure: Indicates the location of an image in a document.
  • Table: Indicates the location of a table in the document.
  • Key Value: Indicates the location of form key-value pairs in a document.

Azure

  • Tables: Tabular content identified and extracted from the document. Tables relate to tables identified by the pretrained layout model. Content labeled as tables is extracted as structured fields in the documents object.
  • Figures: Figures (charts, images) identified and extracted from the document, providing visual representations that aid in the understanding of complex information.
  • Sections: Hierarchical document structure identified and extracted from the document. Section or subsection with the corresponding elements (paragraph, table, figure) attached to it.

I think adding a value to indicate a section start would be most valuable, but I think all of the AWS/tesseract values are useful.

AI models are becoming more and more commonly used for these types of tasks which might support the updating of the extension.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant