You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The marking of trimmed content (by content limit) is not reliable and reproducibly fails for compressed or chunked content, or when there is no Content-Length header: in this case marking as http.trimmed: true in metadata checks whether the internal buffer of okhttp holds more data than requested. For a reliable detection we need to request one byte more than the configured http.content.limit. Esp. for compressed content the internal buffer of okhttp tends to hold exactly the number of requested bytes.
One example, fetching a 9 MB sitemap with http.content.limit: 1048576 and http.store.headers: true:
(see NUTCH-2729 and commoncrawl/nutch#10 for the same issue in Nutch)
The marking of trimmed content (by content limit) is not reliable and reproducibly fails for compressed or chunked content, or when there is no
Content-Length
header: in this case marking ashttp.trimmed: true
in metadata checks whether the internal buffer of okhttp holds more data than requested. For a reliable detection we need to request one byte more than the configuredhttp.content.limit
. Esp. for compressed content the internal buffer of okhttp tends to hold exactly the number of requested bytes.One example, fetching a 9 MB sitemap with
http.content.limit: 1048576
andhttp.store.headers: true
:The content has exactly the size of the limit but not trimming/truncation marked.
The text was updated successfully, but these errors were encountered: