-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Label offest no accurate in case of table #92
Comments
Would annotating |
I can try, but it will miss the point of me trying to capture the full list as a single segment, wouldn't it? |
with the current implementation annotations cover the area between an element's start and stop tag. in case of an in my opinion this is a use case where it makes more sense to capture the content of the |
The issue with that solution, is that I try to use inscriptis in an automated RT process, that parse multiple different domains. there is no way to identify it beforehand. |
i see - at the moment I do not see how this use case could be supported with the current annotation design which marks relevant text with a Your use case would require returning a tree structure outlining which |
Hi,
I encuontered this bug while trying to scarpe a specific site:
`
page = """
rules = {'ul':['ul'], 'table':['table']}
output = get_annotated_text(page, ParserConfig(annotation_rules=rules)) // {'text': ' * item1 * item5\n * item2 * item6\n * item3 * item7\n * item4 * item8\n', 'label': [(0, 85, 'table'), (0, 40, 'ul'), (11, 51, 'ul')]}
(start_index, end_index, annotation) = output['label'][1]
(output['text'][start_index:end_index]) //' * item1 * item5\n * item2 * item'
`
as can be seen, accessing the text of the relevant label isn't working as the offsets aren't accurate when viewing a table
The text was updated successfully, but these errors were encountered: