Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More detailed marking of truncated records due to "network disconnect" #13

Open
sebastian-nagel opened this issue Aug 30, 2019 · 0 comments

Comments

@sebastian-nagel
Copy link

(see #10)

A detailed analysis of WARC records truncated flagged as "disconnect" because of an IOException shows that the "disconnects" are also caused by various kinds of violations or buggy implementations of the HTTP protocol.

Exception message counts derived from analyzing crawler logs of 200+ million fetches:

  • network/socket (timeout)
19447   java.net.SocketTimeoutException: timeout
815     java.net.SocketTimeoutException: Read timed out
  • network/socket "disconnect" (?)
3186    java.net.SocketException: Connection reset
  • Content-Encoding: gzip
39068   java.io.IOException: gzip finished without exhausting source
9890    java.io.EOFException: source exhausted prematurely
5623    java.io.IOException: CRC: actual 0x........ != expected 0x........
3180    java.io.EOFException  (no further information, thrown in okio.GzipSource.consumeTrailer(...))
126     java.io.IOException: java.util.zip.DataFormatException: invalid code lengths set
71      java.io.IOException: ISIZE: actual 0x........ != expected 0x........
38      java.io.IOException: java.util.zip.DataFormatException: invalid block type
32      java.io.IOException: java.util.zip.DataFormatException: invalid distance too far back
2       java.io.IOException: java.util.zip.DataFormatException: too many length or distance symbols
1       java.io.IOException: java.util.zip.DataFormatException: invalid literal/length code
  • wrong Content-Length: ... header
5320    java.net.ProtocolException: unexpected end of stream  (thrown in okhttp3.internal.http1.Http1ExchangeCodec$FixedLengthSource.read(...))
  • Transfer-Encoding: chunked
3855    java.io.EOFException  (no further information, thrown in okio.RealBufferedSource.readHexadecimalUnsignedLong(...))
607     java.net.ProtocolException: unexpected end of stream  (thrown in okhttp3.internal.http1.Http1ExchangeCodec$ChunkedSource.read(...))
151     java.io.EOFException: \n not found: limit=0 content=…
24      java.net.ProtocolException: Expected leading [0-9a-fA-F] character but was 0x..
2       java.net.ProtocolException: expected chunk size and optional extensions but was "..."
  • SSL errors
1590    javax.net.ssl.SSLException: SSL peer shut down incorrectly
  • unclassified (with detailed stack analysis: chunked or gzip, see above "thrown in")
6824    java.io.EOFException
5927    java.net.ProtocolException: unexpected end of stream

First, clarify how to annotate truncations due to protocol-level errors. If the HTTP protocol is thought as part of the network, marking as "network disconnect" might be ok.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant