Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawler pauses when processing large numbers of candidate URLs #54

Open
anjackson opened this issue Feb 27, 2020 · 0 comments
Open

Crawler pauses when processing large numbers of candidate URLs #54

anjackson opened this issue Feb 27, 2020 · 0 comments

Comments

@anjackson
Copy link
Contributor

The whole disposition process is currently synchronised, and locks the frontier while the candidate URLs from a given CrawlURI are processed. i.e. lots of

Java Thread State: WAITING
Blocked/Waiting On: java.util.concurrent.locks.ReentrantReadWriteLock$FairSync@cb14bb8 which is owned by [email protected](99)
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
    java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
    java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
    java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lockInterruptibly(ReentrantReadWriteLock.java:772)
    org.archive.crawler.frontier.AbstractFrontier.next(AbstractFrontier.java:455)
    org.archive.crawler.framework.ToeThread.run(ToeThread.java:134)

...while there's a...
ACTIVE for 32m59s225ms
step: ABOUT_TO_BEGIN_PROCESSOR for 32m51s640ms
Java Thread State: RUNNABLE
Blocked/Waiting On: NONE
java.util.regex.Pattern$CharProperty.match(Pattern.java:3790)
java.util.regex.Pattern$Curly.match0(Pattern.java:4274)
java.util.regex.Pattern$Curly.match(Pattern.java:4248)
java.util.regex.Pattern$Begin.match(Pattern.java:3539)
java.util.regex.Matcher.match(Matcher.java:1270)
java.util.regex.Matcher.matches(Matcher.java:604)
org.archive.modules.deciderules.MatchesListRegexDecideRule.evaluate(MatchesListRegexDecideRule.java:94)
org.archive.modules.deciderules.PredicatedDecideRule.innerDecide(PredicatedDecideRule.java:48)
org.archive.modules.deciderules.DecideRule.decisionFor(DecideRule.java:60)
org.archive.modules.deciderules.DecideRuleSequence.innerDecide(DecideRuleSequence.java:113)
org.archive.modules.deciderules.DecideRule.decisionFor(DecideRule.java:60)
org.archive.crawler.framework.Scoper.isInScope(Scoper.java:107)
org.archive.crawler.prefetch.CandidateScoper.innerProcessResult(CandidateScoper.java:45)
org.archive.modules.Processor.process(Processor.java:142)
org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
org.archive.crawler.postprocessor.CandidatesProcessor.runCandidateChain(CandidatesProcessor.java:176)
org.archive.crawler.postprocessor.CandidatesProcessor.innerProcess(CandidatesProcessor.java:230)
org.archive.modules.Processor.innerProcessResult(Processor.java:175)
org.archive.modules.Processor.process(Processor.java:142)
org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
org.archive.crawler.framework.ToeThread.run(ToeThread.java:152)


Because the current sitemap extraction avoids capping the number of outlinks (for completeness), very large sitemaps lead to hangs (e.g >30 mins) while all the candidates are processed. This is made worse by the fact that we need to refer to OutbackCDX to check it we need to revisit a URL.

We could consider capping the outlinks from sitemaps, but use reservoir sampling so we get a different random subset each time?
@anjackson anjackson changed the title Crawler hangs when processing large numbers of candidate URLs Crawler pauses when processing large numbers of candidate URLs Mar 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant