Updating the text granularity extension to support additional values (Section, Table, ...) #2329

brittnylapierre · 2025-01-07T15:45:36Z

The Text Granularity Extension reference can be found here. Current supported values:

Page: A page in a paginated document
Block: An arbitrary region of text
Paragraph: A paragraph
Line: A topographic line
Word: A single word
Glyph: A single glyph or symbol

Looking at the documentation of the extension, I wonder if it would be useful to update the text granularity extension to include more values which don't currently exist, from common AI layout analysis APIs? I will list the return values for Azure and AWS below that aren't covered by the extension already.

AWS

Title: The main title of the document.
Header: Text located in the top margin of the document.
Footer: Text located in the bottom margin of the document.
Section Title: The titles below the main title that represent sections in the document.
Page Number: The page number of the documents.
List: Any information grouped together in list form.
Figure: Indicates the location of an image in a document.
Table: Indicates the location of a table in the document.
Key Value: Indicates the location of form key-value pairs in a document.

Azure

Tables: Tabular content identified and extracted from the document. Tables relate to tables identified by the pretrained layout model. Content labeled as tables is extracted as structured fields in the documents object.
Figures: Figures (charts, images) identified and extracted from the document, providing visual representations that aid in the understanding of complex information.
Sections: Hierarchical document structure identified and extracted from the document. Section or subsection with the corresponding elements (paragraph, table, figure) attached to it.

I think adding a value to indicate a section start would be most valuable, but I think all of the AWS/tesseract values are useful.

AI models are becoming more and more commonly used for these types of tasks which might support the updating of the extension.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating the text granularity extension to support additional values (Section, Table, ...) #2329

Updating the text granularity extension to support additional values (Section, Table, ...) #2329

brittnylapierre commented Jan 7, 2025

Updating the text granularity extension to support additional values (Section, Table, ...) #2329

Updating the text granularity extension to support additional values (Section, Table, ...) #2329

Comments

brittnylapierre commented Jan 7, 2025