Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARC writer: unit tests for conversion of URLs to URIs #21

Open
sebastian-nagel opened this issue Jul 12, 2023 · 0 comments
Open

WARC writer: unit tests for conversion of URLs to URIs #21

sebastian-nagel opened this issue Jul 12, 2023 · 0 comments

Comments

@sebastian-nagel
Copy link

Nutch uses instances of the class java.net.URL to represent the URLs being crawled. WARC files require URIs for the WARC-Target-URI header. While the conversion to an URI is unproblematic for most of the URLs, there are some issues:

  1. there are instances of java.net.URL which fail to be converted to java.net.URI, see URL.toURI(). Note: the URLs were successfully fetched!
  2. the conversion of java.net.URI to an ASCII-only URI is not free of pitfalls (see WARC writer: use URI.toASCIIString() instead of URI.toString() #20)

Would be good to have unit tests to test and verify these issues - of course, ideally with "solutions" to make the conversion from URL to URI succeed. E.g.,

  • non-ASCII / Unicode components in URLs, including IDNs
  • encoding of white space in the URL path or query
  • encoding of characters invalid in URIs but valid in URLs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant