You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Nutch uses instances of the class java.net.URL to represent the URLs being crawled. WARC files require URIs for the WARC-Target-URI header. While the conversion to an URI is unproblematic for most of the URLs, there are some issues:
there are instances of java.net.URL which fail to be converted to java.net.URI, see URL.toURI(). Note: the URLs were successfully fetched!
Would be good to have unit tests to test and verify these issues - of course, ideally with "solutions" to make the conversion from URL to URI succeed. E.g.,
non-ASCII / Unicode components in URLs, including IDNs
encoding of white space in the URL path or query
encoding of characters invalid in URIs but valid in URLs
The text was updated successfully, but these errors were encountered:
Nutch uses instances of the class java.net.URL to represent the URLs being crawled. WARC files require URIs for the
WARC-Target-URI
header. While the conversion to an URI is unproblematic for most of the URLs, there are some issues:Would be good to have unit tests to test and verify these issues - of course, ideally with "solutions" to make the conversion from URL to URI succeed. E.g.,
The text was updated successfully, but these errors were encountered: