Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider normalizing host, domain names and TLDs #25

Closed
sebastian-nagel opened this issue Mar 6, 2023 · 1 comment
Closed

Consider normalizing host, domain names and TLDs #25

sebastian-nagel opened this issue Mar 6, 2023 · 1 comment
Labels

Comments

@sebastian-nagel
Copy link
Contributor

While the bulk of URLs in the crawls is normalized, this is not true for URLs stemming from redirects during fetching. As a result host names of URLs not normalized may include:

  • Unicode IDNs (not normalized to their ASCII representation)
  • IP addresses in other than dot-numeric representation
  • host names in percent-encoding
sebastian-nagel added a commit that referenced this issue Apr 4, 2023
- canonicalize host names which are IPv4 addresses
  in non-canonical forms
sebastian-nagel added a commit that referenced this issue Apr 4, 2023
- decode percent encoded host names
- convert Unicode IDNs to the ASCII equivalents
@sebastian-nagel
Copy link
Contributor Author

Fixed for CC-MAIN-2023-14. The number of dubious TLDs (digits only or containing a percent sign) has dropped:

count crawl subset
134 CC-MAIN-2023-06 warc
211 CC-MAIN-2023-06 robotstxt
1188 CC-MAIN-2023-06 crawldiagnostics
3 CC-MAIN-2023-14 crawldiagnostics
3 CC-MAIN-2023-14 robotstxt

counts using:

select count(*) as count, crawl, subset
from "ccindex"."ccindex"
where (crawl = 'CC-MAIN-2023-06' or crawl = 'CC-MAIN-2023-14')
  and regexp_like(url_host_tld, '^\d+$|%')
group by crawl, subset;

Some more work to do, see #26.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant