Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No robots.txt always fires an error with --recursive #358

Open
mruprich opened this issue Nov 19, 2024 · 3 comments
Open

No robots.txt always fires an error with --recursive #358

mruprich opened this issue Nov 19, 2024 · 3 comments

Comments

@mruprich
Copy link

When wget2 is used with --recursive it is always looking for the robots.txt file. This happens even when the file is not present on the server and also when --no-robots is used. A quick reproducer below.

  1. Start an http server without robots.txt:
# tempdir=`mktemp -d`
# pushd $tempdir
# echo '<html><body><a href="http://redhat.com/">rht</a></body></html>' > index.html
# python3 -m http.server 8080 &
  1. Test wget2 from a different dir or different terminal window:
# wget -r --progress=none -nH http://127.0.0.1:8080/index.html
[0] Downloading 'http://127.0.0.1:8080/robots.txt' ...
HTTP ERROR response 404 File not found [http://127.0.0.1:8080/robots.txt]
[0] Downloading 'http://127.0.0.1:8080/index.html' ...
Saving 'index.html'
HTTP response 200 OK [http://127.0.0.1:8080/index.html]
URI content encoding = 'CP1252' (default, encoding not specified)
Adding URL: http://redhat.com/
URL 'http://redhat.com/' not followed (no host-spanning requested)
Downloaded: 1 files, 398  bytes, 0 redirects, 1 errors
  1. Test the same with --no-robots:
wget -r --no-robots --progress=none -nH http://127.0.0.1:8080/index.html
[0] Downloading 'http://127.0.0.1:8080/robots.txt' ...
HTTP ERROR response 404 File not found [http://127.0.0.1:8080/robots.txt]
[0] Downloading 'http://127.0.0.1:8080/index.html' ...
Saving 'index.html'
HTTP response 200 OK [http://127.0.0.1:8080/index.html]
URI content encoding = 'CP1252' (default, encoding not specified)
Adding URL: http://redhat.com/
URL 'http://redhat.com/' not followed (no host-spanning requested)
Downloaded: 1 files, 398  bytes, 0 redirects, 1 errors

Now I understand that the --robots section has the following paragraph:

Whether  enabled or disabled, the robots.txt file is downloaded and scanned for sitemaps.  These are lists of
       pages / files available for download that not necessarily are available via recursive scanning.

Does that mean that wget2 should always either get robots.txt or end with an error? I think that in both cases above wget should pass without errors. With --no-robots, the absence (or presence) of the file should be ignored altogether. When the file does not exist, it should also pass without an error (unless the robots.txt is mandatory on the server side?).

Thanks and regards,
Michal Ruprich

@rockdaboot
Copy link
Owner

The line

Downloaded: 1 files, 398  bytes, 0 redirects, 1 errors

just shows stupid stats. I hesitate to add logic that "tunes" the stats for some special cases.

If there was a response 4xx -> increase the error count. What is wrong with it?

For scripting, you should take the exit status of wget2 into account. And the exit status is 0 ("success") in your example, because a 404 for "robots.txt" is ignored.

@mruprich
Copy link
Author

mruprich commented Dec 3, 2024

Right, the return value is 0 that is true but I would argue that maybe at least with --no-robot, it should not try to get it anyway?

@rockdaboot
Copy link
Owner

--no-robots still fetches the robots.txt file because it may contain sitemaps.

We can think of an option to disable sitemaps - in combination with --no-robots, robots.txt doesn't need to be fetched any more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants