Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timsort comparison error for specific robots.txt URL #86

Open
anjackson opened this issue Nov 28, 2022 · 1 comment
Open

Timsort comparison error for specific robots.txt URL #86

anjackson opened this issue Nov 28, 2022 · 1 comment

Comments

@anjackson
Copy link
Contributor

anjackson commented Nov 28, 2022

From DC

Nov 28, 2022 9:48:29 AM org.archive.modules.CrawlURI getPolitenessDelay
WARNING: politessDelay unset, returning default 5000 for https://www.english.op.org/robots.txt (in thread 'ToeThread #47: https://www.english.op.org/robots.txt')
Nov 28, 2022 9:48:35 AM org.archive.crawler.framework.ToeThread recoverableProblem
SEVERE: Problem java.lang.IllegalArgumentException: Comparison method violates its general contract! occurred when trying to process 'https://www.english.op.org/robots.txt' at step ABOUT_TO_BEGIN_PROCESSOR in 
 (in thread 'ToeThread #498: https://www.english.op.org/robots.txt')
java.lang.IllegalArgumentException: Comparison method violates its general contract!
	at java.util.TimSort.mergeHi(TimSort.java:899)
	at java.util.TimSort.mergeAt(TimSort.java:516)
	at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
	at java.util.TimSort.sort(TimSort.java:254)
	at java.util.Arrays.sort(Arrays.java:1512)
	at java.util.ArrayList.sort(ArrayList.java:1464)
	at java.util.Collections.sort(Collections.java:177)
	at org.apache.http.impl.cookie.RFC6265CookieSpec.formatCookies(RFC6265CookieSpec.java:217)
	at org.apache.http.client.protocol.RequestAddCookies.process(RequestAddCookies.java:187)
	at org.apache.http.protocol.ImmutableHttpProcessor.process(ImmutableHttpProcessor.java:133)
	at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
	at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
	at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:72)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
	at org.archive.modules.fetcher.FetchHTTPRequest.execute(FetchHTTPRequest.java:823)
	at org.archive.modules.fetcher.FetchHTTP.innerProcess(FetchHTTP.java:679)
	at org.archive.modules.Processor.innerProcessResult(Processor.java:175)
	at org.archive.modules.Processor.process(Processor.java:142)
	at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
	at org.archive.crawler.framework.ToeThread.run(ToeThread.java:147)

...the content (as seen in my web browser) appears to be:

# START YOAST BLOCK
# ---------------------------
User-agent: *
Disallow:

Sitemap: https://www.english.op.org/sitemap_index.xml
# ---------------------------
# END YOAST BLOCK
@anjackson anjackson changed the title Weird parsing error for specific URL Timsort comparison error for specific robots.txt URL Jun 8, 2023
@anjackson
Copy link
Contributor Author

Also hit this during DC2023 and it made the crawl very unhappy until I blocked the host.

https://www.ramirezmoto.es/robots.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant