Improve extraction of host names and registered domains #26

sebastian-nagel · 2023-04-04T13:08:45Z

no host name is extracted in the following situations
- URL contains 4 slashes after the protocol: https:////example.org/ - while java.net.URL extracts an empty hostname, the Nutch's OkHTTP-based protocol seems to fetch the resource as if there are only two slashes.
- similarly java.net.URL and OkHttp show a different behavior if there is an overlong (or even invalid?) userinfo before the hostname (scheme://userinfo@hostname/)
IP addresses are not recognized as such if ending in a dot: https://123.123.123.123./robots.txt
the extraction of registered domains (done by crawler-commons' EffectiveTldFinder does not extract anything if the hostname is equal to a public suffix (gov.uk, kharkov.ua for example)

The text was updated successfully, but these errors were encountered:

sebastian-nagel mentioned this issue Apr 4, 2023

Consider normalizing host, domain names and TLDs #25

Closed

Provide feedback