No robots.txt always fires an error with --recursive #358

mruprich · 2024-11-19T17:55:56Z

When wget2 is used with --recursive it is always looking for the robots.txt file. This happens even when the file is not present on the server and also when --no-robots is used. A quick reproducer below.

Start an http server without robots.txt:

# tempdir=`mktemp -d`
# pushd $tempdir
# echo '<html><body><a href="http://redhat.com/">rht</a></body></html>' > index.html
# python3 -m http.server 8080 &

Test wget2 from a different dir or different terminal window:

# wget -r --progress=none -nH http://127.0.0.1:8080/index.html
[0] Downloading 'http://127.0.0.1:8080/robots.txt' ...
HTTP ERROR response 404 File not found [http://127.0.0.1:8080/robots.txt]
[0] Downloading 'http://127.0.0.1:8080/index.html' ...
Saving 'index.html'
HTTP response 200 OK [http://127.0.0.1:8080/index.html]
URI content encoding = 'CP1252' (default, encoding not specified)
Adding URL: http://redhat.com/
URL 'http://redhat.com/' not followed (no host-spanning requested)
Downloaded: 1 files, 398  bytes, 0 redirects, 1 errors

Test the same with --no-robots:

wget -r --no-robots --progress=none -nH http://127.0.0.1:8080/index.html
[0] Downloading 'http://127.0.0.1:8080/robots.txt' ...
HTTP ERROR response 404 File not found [http://127.0.0.1:8080/robots.txt]
[0] Downloading 'http://127.0.0.1:8080/index.html' ...
Saving 'index.html'
HTTP response 200 OK [http://127.0.0.1:8080/index.html]
URI content encoding = 'CP1252' (default, encoding not specified)
Adding URL: http://redhat.com/
URL 'http://redhat.com/' not followed (no host-spanning requested)
Downloaded: 1 files, 398  bytes, 0 redirects, 1 errors

Now I understand that the --robots section has the following paragraph:

Whether  enabled or disabled, the robots.txt file is downloaded and scanned for sitemaps.  These are lists of
       pages / files available for download that not necessarily are available via recursive scanning.

Does that mean that wget2 should always either get robots.txt or end with an error? I think that in both cases above wget should pass without errors. With --no-robots, the absence (or presence) of the file should be ignored altogether. When the file does not exist, it should also pass without an error (unless the robots.txt is mandatory on the server side?).

Thanks and regards,
Michal Ruprich

The text was updated successfully, but these errors were encountered:

rockdaboot · 2024-11-23T17:38:12Z

The line

Downloaded: 1 files, 398  bytes, 0 redirects, 1 errors

just shows stupid stats. I hesitate to add logic that "tunes" the stats for some special cases.

If there was a response 4xx -> increase the error count. What is wrong with it?

For scripting, you should take the exit status of wget2 into account. And the exit status is 0 ("success") in your example, because a 404 for "robots.txt" is ignored.

mruprich · 2024-12-03T12:37:29Z

Right, the return value is 0 that is true but I would argue that maybe at least with --no-robot, it should not try to get it anyway?

rockdaboot · 2024-12-14T16:04:34Z

--no-robots still fetches the robots.txt file because it may contain sitemaps.

We can think of an option to disable sitemaps - in combination with --no-robots, robots.txt doesn't need to be fetched any more.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No robots.txt always fires an error with --recursive #358

No robots.txt always fires an error with --recursive #358

mruprich commented Nov 19, 2024

rockdaboot commented Nov 23, 2024

mruprich commented Dec 3, 2024

rockdaboot commented Dec 14, 2024

No robots.txt always fires an error with --recursive #358

No robots.txt always fires an error with --recursive #358

Comments

mruprich commented Nov 19, 2024

rockdaboot commented Nov 23, 2024

mruprich commented Dec 3, 2024

rockdaboot commented Dec 14, 2024