You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When wget2 is used with --recursive it is always looking for the robots.txt file. This happens even when the file is not present on the server and also when --no-robots is used. A quick reproducer below.
Test wget2 from a different dir or different terminal window:
# wget -r --progress=none -nH http://127.0.0.1:8080/index.html
[0] Downloading 'http://127.0.0.1:8080/robots.txt' ...
HTTP ERROR response 404 File not found [http://127.0.0.1:8080/robots.txt]
[0] Downloading 'http://127.0.0.1:8080/index.html' ...
Saving 'index.html'
HTTP response 200 OK [http://127.0.0.1:8080/index.html]
URI content encoding = 'CP1252' (default, encoding not specified)
Adding URL: http://redhat.com/
URL 'http://redhat.com/' not followed (no host-spanning requested)
Downloaded: 1 files, 398 bytes, 0 redirects, 1 errors
Test the same with --no-robots:
wget -r --no-robots --progress=none -nH http://127.0.0.1:8080/index.html
[0] Downloading 'http://127.0.0.1:8080/robots.txt' ...
HTTP ERROR response 404 File not found [http://127.0.0.1:8080/robots.txt]
[0] Downloading 'http://127.0.0.1:8080/index.html' ...
Saving 'index.html'
HTTP response 200 OK [http://127.0.0.1:8080/index.html]
URI content encoding = 'CP1252' (default, encoding not specified)
Adding URL: http://redhat.com/
URL 'http://redhat.com/' not followed (no host-spanning requested)
Downloaded: 1 files, 398 bytes, 0 redirects, 1 errors
Now I understand that the --robots section has the following paragraph:
Whether enabled or disabled, the robots.txt file is downloaded and scanned for sitemaps. These are lists of
pages / files available for download that not necessarily are available via recursive scanning.
Does that mean that wget2 should always either get robots.txt or end with an error? I think that in both cases above wget should pass without errors. With --no-robots, the absence (or presence) of the file should be ignored altogether. When the file does not exist, it should also pass without an error (unless the robots.txt is mandatory on the server side?).
Thanks and regards,
Michal Ruprich
The text was updated successfully, but these errors were encountered:
just shows stupid stats. I hesitate to add logic that "tunes" the stats for some special cases.
If there was a response 4xx -> increase the error count. What is wrong with it?
For scripting, you should take the exit status of wget2 into account. And the exit status is 0 ("success") in your example, because a 404 for "robots.txt" is ignored.
When wget2 is used with --recursive it is always looking for the robots.txt file. This happens even when the file is not present on the server and also when --no-robots is used. A quick reproducer below.
Now I understand that the --robots section has the following paragraph:
Does that mean that wget2 should always either get robots.txt or end with an error? I think that in both cases above wget should pass without errors. With --no-robots, the absence (or presence) of the file should be ignored altogether. When the file does not exist, it should also pass without an error (unless the robots.txt is mandatory on the server side?).
Thanks and regards,
Michal Ruprich
The text was updated successfully, but these errors were encountered: