You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(protocol-okhttp) reliably annotate content truncated by length limit:
this happens mostly (100 out of 110) for pages with Content-Encoding: gzip
no truncation flag is added if the loop to read content chunk by chunk is exited reaching the content limit exactly: verify and open issue to fix this in upstream Nutch (NUTCH-2729)
in 3 analyzed WARC files all records flagged by "disconnect" have either "gzip" content encoding or "chunked" transfer encoding (or even both) - the reason could be also a broken encoding not a "network disconnect". Note: would need also to clarify how to annotate truncations due to protocol-level errors.
always add a Content-Length header to HTTP headers in WARC file, even if there wasn't one in the original HTTP response (eg. for chunked transfer encoding). Implemented in 3663f35.
(in the course) upgrade to latest okhttp library (NUTCH-2728)
There are some oddities how truncated captures are recorded in WARC files. See also Henry Thompson's report and the discussion in the Common Crawl user group.
Content-Encoding: gzip
The text was updated successfully, but these errors were encountered: