Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The content in the warcRecord includes the trailing \r\n #30

Open
ikreymer opened this issue Jun 3, 2019 · 2 comments
Open

The content in the warcRecord includes the trailing \r\n #30

ikreymer opened this issue Jun 3, 2019 · 2 comments

Comments

@ikreymer
Copy link

ikreymer commented Jun 3, 2019

The 'content' ArrayBuffer in the record appears to include the trailing \r\n
Tested this with compressed WARCs, may not be the case for uncompressed

@BubuAnabelas
Copy link
Contributor

Can you attach an example or give some more information as to in which record it happened? It would be helpful to trace the error.

@fushihara
Copy link

I created a warc from this site using wget. I created two files, an uncompressed warc and a gzipped warc
https://archive-it.org/post/the-stack-warc-file/

wget \
  --page-requisites \
  --recursive \
  --level=1 \
  --no-parent \
  -e robots=off \
  --warc-file=output \
  --delete-after \
  --no-directories \
  "https://archive-it.org/post/the-stack-warc-file/"
or --no-warc-compression

The following URL images were extracted from them.
https://archive-it.org/wp-content/themes/archive-it_theme/images/facebook.png

When saved by curl, the content-length was 2395 bytes and the hash value was CRC-32: 8BA873CD.
But the hash value of the same image extracted from warc was CRC-32: A3FA8781. This was the same for the uncompressed and compressed versions.
I am not sure if this is a wget issue or this library issue.

The attached zip file contains the following files

facebook-curl.png
facebook-warc-gz.png
facebook-warc-plain.png
output.warc
output.warc.gz

wget was run from windows 10 wsl ubuntu.

$ wget --version
GNU Wget 1.20.3 built on linux-gnu.

warc-content-tail.zip
curl download file: facebook-curl
warc export file: facebook-warc-gz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants