-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Configurable HTML Parser Wrappers for BeautifulSoup and Resiliparse #47
Open
silentninja
wants to merge
7
commits into
commoncrawl:main
Choose a base branch
from
silentninja:html-parser
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
f3e3cd7
Add bs4 and resiliparse html parsers
silentninja c6b4ac9
Accept html parser as arguments
silentninja 864241a
modify the index word count to use the html parsers
silentninja 2a0fd51
Edit README.md to include instructions for the html parsers args
silentninja 5ae6b86
Add Resiliparse to requirements.txt
silentninja 9d0b263
Move the html parser config argument to the specific job which uses it
silentninja 5ddbf5d
Fix resiliparse version and add notes on compatibility with fastwarc
silentninja File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
from bs4 import BeautifulSoup | ||
from bs4.dammit import EncodingDetector | ||
|
||
|
||
class HTMLParser(object): | ||
""" | ||
HTML parser using BeautifulSoup4 | ||
""" | ||
|
||
def html_to_text(self, html_tree: BeautifulSoup) -> str: | ||
""" | ||
Convert HTML content to plain text using BeautifulSoup4. | ||
|
||
Returns: | ||
str: Extracted plain text with scripts and styles removed | ||
""" | ||
for script in html_tree(['script', 'style']): | ||
script.extract() | ||
text = html_tree.get_text(' ', strip=True) | ||
return text | ||
|
||
def get_html_tree(self, page: bytes, encoding: str=None, features='lxml', **kwargs) -> BeautifulSoup: | ||
""" | ||
Return the HTML tree object | ||
|
||
Args: | ||
page (bytes): Raw HTML content as bytes | ||
encoding (str, optional): Specific character encoding to use. If None, auto-detection is attempted | ||
features: Parser to be used (default='lxml'). Refer https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for supported parsers. | ||
**kwargs: Additional arguments passed to BeautifulSoup constructor. | ||
Refer here https://www.crummy.com/software/BeautifulSoup/bs4/doc/#bs4.BeautifulSoup for accepted arguments. | ||
|
||
Returns: | ||
BeautifulSoup: HTML tree object | ||
""" | ||
if not encoding: | ||
for encoding in EncodingDetector(page, is_html=True).encodings: | ||
# take the first detected encoding | ||
break | ||
soup = BeautifulSoup(page, features, from_encoding=encoding, **kwargs) | ||
return soup |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
from resiliparse.extract.html2text import extract_plain_text | ||
from resiliparse.parse import detect_encoding | ||
from resiliparse.parse.html import HTMLTree | ||
|
||
|
||
class HTMLParser(object): | ||
""" | ||
HTML parser using Resiliparse | ||
""" | ||
|
||
def html_to_text(self, tree, **kwargs) -> str: | ||
""" | ||
Convert HTML content to plain text using Resiliparse. | ||
|
||
Returns: | ||
str: Extracted plain text with scripts and styles removed | ||
""" | ||
text = extract_plain_text(tree, **kwargs) | ||
return text | ||
|
||
def get_html_tree(self, page: bytes, encoding: str=None, **kwargs) -> HTMLTree: | ||
""" | ||
Get the HTML tree object | ||
|
||
Args: | ||
page (bytes): Raw HTML content as bytes | ||
encoding (str, optional): Specific character encoding to use. If None, auto-detection is attempted | ||
**kwargs: Additional arguments passed to extract_plain_text: | ||
Refer here https://resiliparse.chatnoir.eu/en/latest/api/extract/html2text.html#resiliparse.extract.html2text.extract_plain_text for accepted arguments. | ||
Returns: | ||
str: Extracted plain text content | ||
""" | ||
if not encoding: | ||
encoding = detect_encoding(page) | ||
tree = HTMLTree.parse_from_bytes(page, encoding, **kwargs) | ||
return tree |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you verified that it it possible to install and use FastWARC and Resiliparse with different versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @sebastian-nagel, thanks for the catch! Resiliparse has a strict version dependency on fastwarc and will throw up an error when installing incompatible versions. I will fix the tested version and add a comment in requirement.txt to highlight this