Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

okhttp protocol: trimmed content because of content limit not reliably marked #756

Closed
sebastian-nagel opened this issue Sep 26, 2019 · 1 comment · Fixed by #757
Closed

Comments

@sebastian-nagel
Copy link
Contributor

(see NUTCH-2729 and commoncrawl/nutch#10 for the same issue in Nutch)

The marking of trimmed content (by content limit) is not reliable and reproducibly fails for compressed or chunked content, or when there is no Content-Length header: in this case marking as http.trimmed: true in metadata checks whether the internal buffer of okhttp holds more data than requested. For a reliable detection we need to request one byte more than the configured http.content.limit. Esp. for compressed content the internal buffer of okhttp tends to hold exactly the number of requested bytes.

One example, fetching a 9 MB sitemap with http.content.limit: 1048576 and http.store.headers: true:

> java -cp ... com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol ... http://localhost/sitemap.xml
http://localhost/sitemap.xml
date: Thu, 26 Sep 2019 14:03:25 GMT
server: Apache/2.4.29 (Ubuntu)
transfer-encoding: chunked
vary: Accept-Encoding
last-modified: Mon, 19 Mar 2018 07:05:39 GMT
keep-alive: timeout=5, max=100
_request.headers_: GET /sitemap.xml 
...
Accept-Encoding: gzip


...
_response.ip_: 127.0.0.1
_response.headers_: HTTP/1.1 200 OK
...
Vary: Accept-Encoding
Content-Encoding: gzip
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: application/xml



status code: 200
content length: 1048576
fetched in : 157 msec

The content has exactly the size of the limit but not trimming/truncation marked.

@jnioche
Copy link
Contributor

jnioche commented Sep 26, 2019

thanks @sebastian-nagel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants