Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Formulas in a separate paragraph or in stand alone tags #1252

Open
panagiotis-tsolakis opened this issue Feb 24, 2025 · 3 comments
Open

Formulas in a separate paragraph or in stand alone tags #1252

panagiotis-tsolakis opened this issue Feb 24, 2025 · 3 comments
Labels
bug From Hemiptera and especially its suborder Heteroptera

Comments

@panagiotis-tsolakis
Copy link

I have noticed that formulas tend to either be in a separate paragraph by themselves (even when they are part of a bigger paragraph) or be placed in a stand-alone formula tag (even when they are part of a sentence). This can create problems during the cleaning of the TEI file and its conversion to TXT.

Image

Image

Also, some text might be missing before or after formulas. The text from the following screenshot follows the text of the above image. The underlined text in yellow is missing from the TEI file, while the formula is again in a stand-alone tag.

Image

Image

@lfoppiano
Copy link
Collaborator

Hi @panagiotis-tsolakis thanks again. I would need to know which Grobid version you are using, and, if possible the pdf files for testing.

Thank you

@panagiotis-tsolakis
Copy link
Author

I used Grobid 0.8.1.

Here's the pdf file:

zhang-etal-2024-quantized.pdf

@lfoppiano lfoppiano added the bug From Hemiptera and especially its suborder Heteroptera label Mar 4, 2025
@lfoppiano
Copy link
Collaborator

lfoppiano commented Mar 4, 2025

@panagiotis-tsolakis Thanks for the report.

Regarding the formula, the example is a mistake of the model because the formula should be inline embedded in the paragraph (it's part of the text), however the having equations/formulas between paragraphs is expected for equation/formula blocks, where there is a label, usually (1) or (2), etc..

In respect of the second part of the issue, this is indeed a problem because the text that you identified, is incorrectly classified as figure caption, and then, it is correctly removed from the caption, but it's tossed on the floor.
I'm going to fix this by pushing back the discarded text into the paragraph jungle. It's a high priority issue to fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug From Hemiptera and especially its suborder Heteroptera
Projects
None yet
Development

No branches or pull requests

2 participants