-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow to access WARC record filename and offset #6
Labels
Comments
Actually, accessing record offset or length will cause that the entire record is consumed. It must be done after the record is processed. |
sebastian-nagel
added a commit
that referenced
this issue
Jul 19, 2019
from ArchiveIterator, implements #6 - introduce customizable method `iterate_records(warc_file_uri, archive_iterator)` which iterates over WARC record and calls `process_record(record)` - document pitfall: accessing offset and length must be done after WARC record is processed
sebastian-nagel
added a commit
that referenced
this issue
Jul 19, 2019
from ArchiveIterator, implements #6 - introduce customizable method `iterate_records(warc_file_uri, archive_iterator)` which iterates over WARC record and calls `process_record(record)` - document pitfall: accessing offset and length must be done after WARC record is processed
sebastian-nagel
added a commit
that referenced
this issue
Jul 19, 2019
from ArchiveIterator, implements #6 - introduce customizable method `iterate_records(warc_file_uri, archive_iterator)` which iterates over WARC record and calls `process_record(record)` - document pitfall: accessing offset and length must be done after WARC record is processed
Implemented with with 7e2f67a: by overriding the method |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
See this discussion: https://groups.google.com/d/topic/common-crawl/7MuqVmvajoA/discussion
Offset and length are not part of the ArcWarcRecord but are known only to the ArchiveIterator. Ideally, it should be possible to access WARC filename, record offset and length in the process_record method.
The text was updated successfully, but these errors were encountered: