WARC Parsing sometimes results in truncated records. #34

ikreymer · 2019-07-22T15:04:23Z

The WARC parsing sometimes results in records being truncated.

This might be due to the parser continuing to look for newlines/read one line at a time, even when parsing the content body, and might be happening if there is a \r\n\r\n encountered in the body of the record.

The issue can be seen by running:

const { AutoWARCParser } = require('node-warc');
  
(async () => {
  for await (const record of new AutoWARCParser('test1.warc.gz')) {
    console.log(record.content.toString('utf-8'));
  }
})();

With these example files:
test1.warc.gz
(last couple of bytes are cut-off)

test2.warc.gz
(most of the file is cut-off after initial comment)

For comparison, the warcio version prints the full record:

from warcio import ArchiveIterator
  
for record in ArchiveIterator(open('./test1.warc.gz', 'rb')):
    print(record.content_stream().read().decode('utf-8'))

The text was updated successfully, but these errors were encountered:

comatory · 2019-09-01T13:15:45Z

I'm trying to parse the contents as well (especially HTML) and most of the sites return the HTML truncated.

chichicuervo · 2020-04-01T05:35:50Z

I'm experiencing this also. Seems like the issue is warcRecord/builder.js .. the function consumeLine() ... basically you're using a solo CRLF as the boundary between WARC Records... seems to me that you should be ignoring this until you reach the WARC Record's Content-Length

chichicuervo · 2020-04-01T05:53:43Z

seems that commenting out the line
this._parsingState = parsingStates.consumeCRLFContent2

inside the switch statement resolves this specific problem. Is there a specific protocol reason it's there, like trailers or something, or were you just trying to avoid trailing CRLFs?

N0taN3rd · 2020-04-05T04:30:39Z

The aim of crlf content two was to skip the trailing CRLFs between records.
For more details see the spec.

If that fixes the issue for all cases feel free to open a PR.
I am working on this albeit slowly on the updates branch.
Also feel free to add to the effort via a PR.

N0taN3rd · 2020-04-05T04:30:56Z

Also @ikreymer still waiting on that PR fixing this you said you had

ikreymer · 2020-04-05T04:52:01Z

Hey, here is the fix that I have in builder.js:
ikreymer@e5bfd2f

I could isolate that and make it into a PR, not 100% sure it fixes it, but think so.. I had a bunch of other changes on that branch though.

ikreymer · 2020-04-14T23:32:49Z

@chichicuervo in case this helps you, I wanted to mention I've been working on a new WARC reader library focusing on streaming WARCs in the browser: https://github.com/ikreymer/warcio.js
It works comparable to the python warcio library and should not have the truncation issue.

It's not as far along as this library and doesn't yet have support for writing WARCs, though.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WARC Parsing sometimes results in truncated records. #34

WARC Parsing sometimes results in truncated records. #34

ikreymer commented Jul 22, 2019

comatory commented Sep 1, 2019

chichicuervo commented Apr 1, 2020 •

edited

Loading

chichicuervo commented Apr 1, 2020 •

edited

Loading

N0taN3rd commented Apr 5, 2020

N0taN3rd commented Apr 5, 2020

ikreymer commented Apr 5, 2020

ikreymer commented Apr 14, 2020

WARC Parsing sometimes results in truncated records. #34

WARC Parsing sometimes results in truncated records. #34

Comments

ikreymer commented Jul 22, 2019

comatory commented Sep 1, 2019

chichicuervo commented Apr 1, 2020 • edited Loading

chichicuervo commented Apr 1, 2020 • edited Loading

N0taN3rd commented Apr 5, 2020

N0taN3rd commented Apr 5, 2020

ikreymer commented Apr 5, 2020

ikreymer commented Apr 14, 2020

chichicuervo commented Apr 1, 2020 •

edited

Loading

chichicuervo commented Apr 1, 2020 •

edited

Loading