You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A detailed analysis of WARC records truncated flagged as "disconnect" because of an IOException shows that the "disconnects" are also caused by various kinds of violations or buggy implementations of the HTTP protocol.
Exception message counts derived from analyzing crawler logs of 200+ million fetches:
network/socket (timeout)
19447 java.net.SocketTimeoutException: timeout
815 java.net.SocketTimeoutException: Read timed out
network/socket "disconnect" (?)
3186 java.net.SocketException: Connection reset
Content-Encoding: gzip
39068 java.io.IOException: gzip finished without exhausting source
9890 java.io.EOFException: source exhausted prematurely
5623 java.io.IOException: CRC: actual 0x........ != expected 0x........
3180 java.io.EOFException (no further information, thrown in okio.GzipSource.consumeTrailer(...))
126 java.io.IOException: java.util.zip.DataFormatException: invalid code lengths set
71 java.io.IOException: ISIZE: actual 0x........ != expected 0x........
38 java.io.IOException: java.util.zip.DataFormatException: invalid block type
32 java.io.IOException: java.util.zip.DataFormatException: invalid distance too far back
2 java.io.IOException: java.util.zip.DataFormatException: too many length or distance symbols
1 java.io.IOException: java.util.zip.DataFormatException: invalid literal/length code
wrong Content-Length: ... header
5320 java.net.ProtocolException: unexpected end of stream (thrown in okhttp3.internal.http1.Http1ExchangeCodec$FixedLengthSource.read(...))
Transfer-Encoding: chunked
3855 java.io.EOFException (no further information, thrown in okio.RealBufferedSource.readHexadecimalUnsignedLong(...))
607 java.net.ProtocolException: unexpected end of stream (thrown in okhttp3.internal.http1.Http1ExchangeCodec$ChunkedSource.read(...))
151 java.io.EOFException: \n not found: limit=0 content=…
24 java.net.ProtocolException: Expected leading [0-9a-fA-F] character but was 0x..
2 java.net.ProtocolException: expected chunk size and optional extensions but was "..."
SSL errors
1590 javax.net.ssl.SSLException: SSL peer shut down incorrectly
unclassified (with detailed stack analysis: chunked or gzip, see above "thrown in")
6824 java.io.EOFException
5927 java.net.ProtocolException: unexpected end of stream
First, clarify how to annotate truncations due to protocol-level errors. If the HTTP protocol is thought as part of the network, marking as "network disconnect" might be ok.
The text was updated successfully, but these errors were encountered:
(see #10)
A detailed analysis of WARC records truncated flagged as "disconnect" because of an IOException shows that the "disconnects" are also caused by various kinds of violations or buggy implementations of the HTTP protocol.
Exception message counts derived from analyzing crawler logs of 200+ million fetches:
Content-Length: ...
headerFirst, clarify how to annotate truncations due to protocol-level errors. If the HTTP protocol is thought as part of the network, marking as "network disconnect" might be ok.
The text was updated successfully, but these errors were encountered: