You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While the bulk of URLs in the crawls is normalized, this is not true for URLs stemming from redirects during fetching. As a result host names of URLs not normalized may include:
Unicode IDNs (not normalized to their ASCII representation)
IP addresses in other than dot-numeric representation
host names in percent-encoding
The text was updated successfully, but these errors were encountered:
Fixed for CC-MAIN-2023-14. The number of dubious TLDs (digits only or containing a percent sign) has dropped:
count
crawl
subset
134
CC-MAIN-2023-06
warc
211
CC-MAIN-2023-06
robotstxt
1188
CC-MAIN-2023-06
crawldiagnostics
3
CC-MAIN-2023-14
crawldiagnostics
3
CC-MAIN-2023-14
robotstxt
counts using:
selectcount(*) as count, crawl, subset
from"ccindex"."ccindex"where (crawl ='CC-MAIN-2023-06'or crawl ='CC-MAIN-2023-14')
and regexp_like(url_host_tld, '^\d+$|%')
group by crawl, subset;
While the bulk of URLs in the crawls is normalized, this is not true for URLs stemming from redirects during fetching. As a result host names of URLs not normalized may include:
The text was updated successfully, but these errors were encountered: