You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The whole disposition process is currently synchronised, and locks the frontier while the candidate URLs from a given CrawlURI are processed. i.e. lots of
Java Thread State: WAITING
Blocked/Waiting On: java.util.concurrent.locks.ReentrantReadWriteLock$FairSync@cb14bb8 which is owned by [email protected](99)
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lockInterruptibly(ReentrantReadWriteLock.java:772)
org.archive.crawler.frontier.AbstractFrontier.next(AbstractFrontier.java:455)
org.archive.crawler.framework.ToeThread.run(ToeThread.java:134)
...while there's a...
ACTIVE for 32m59s225ms
step: ABOUT_TO_BEGIN_PROCESSOR for 32m51s640ms
Java Thread State: RUNNABLE
Blocked/Waiting On: NONE
java.util.regex.Pattern$CharProperty.match(Pattern.java:3790)
java.util.regex.Pattern$Curly.match0(Pattern.java:4274)
java.util.regex.Pattern$Curly.match(Pattern.java:4248)
java.util.regex.Pattern$Begin.match(Pattern.java:3539)
java.util.regex.Matcher.match(Matcher.java:1270)
java.util.regex.Matcher.matches(Matcher.java:604)
org.archive.modules.deciderules.MatchesListRegexDecideRule.evaluate(MatchesListRegexDecideRule.java:94)
org.archive.modules.deciderules.PredicatedDecideRule.innerDecide(PredicatedDecideRule.java:48)
org.archive.modules.deciderules.DecideRule.decisionFor(DecideRule.java:60)
org.archive.modules.deciderules.DecideRuleSequence.innerDecide(DecideRuleSequence.java:113)
org.archive.modules.deciderules.DecideRule.decisionFor(DecideRule.java:60)
org.archive.crawler.framework.Scoper.isInScope(Scoper.java:107)
org.archive.crawler.prefetch.CandidateScoper.innerProcessResult(CandidateScoper.java:45)
org.archive.modules.Processor.process(Processor.java:142)
org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
org.archive.crawler.postprocessor.CandidatesProcessor.runCandidateChain(CandidatesProcessor.java:176)
org.archive.crawler.postprocessor.CandidatesProcessor.innerProcess(CandidatesProcessor.java:230)
org.archive.modules.Processor.innerProcessResult(Processor.java:175)
org.archive.modules.Processor.process(Processor.java:142)
org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
org.archive.crawler.framework.ToeThread.run(ToeThread.java:152)
Because the current sitemap extraction avoids capping the number of outlinks (for completeness), very large sitemaps lead to hangs (e.g >30 mins) while all the candidates are processed. This is made worse by the fact that we need to refer to OutbackCDX to check it we need to revisit a URL.
We could consider capping the outlinks from sitemaps, but use reservoir sampling so we get a different random subset each time?
The text was updated successfully, but these errors were encountered:
anjackson
changed the title
Crawler hangs when processing large numbers of candidate URLs
Crawler pauses when processing large numbers of candidate URLs
Mar 5, 2020
The whole disposition process is currently synchronised, and locks the frontier while the candidate URLs from a given
CrawlURI
are processed. i.e. lots of...while there's a...
ACTIVE for 32m59s225ms
step: ABOUT_TO_BEGIN_PROCESSOR for 32m51s640ms
Java Thread State: RUNNABLE
Blocked/Waiting On: NONE
java.util.regex.Pattern$CharProperty.match(Pattern.java:3790)
java.util.regex.Pattern$Curly.match0(Pattern.java:4274)
java.util.regex.Pattern$Curly.match(Pattern.java:4248)
java.util.regex.Pattern$Begin.match(Pattern.java:3539)
java.util.regex.Matcher.match(Matcher.java:1270)
java.util.regex.Matcher.matches(Matcher.java:604)
org.archive.modules.deciderules.MatchesListRegexDecideRule.evaluate(MatchesListRegexDecideRule.java:94)
org.archive.modules.deciderules.PredicatedDecideRule.innerDecide(PredicatedDecideRule.java:48)
org.archive.modules.deciderules.DecideRule.decisionFor(DecideRule.java:60)
org.archive.modules.deciderules.DecideRuleSequence.innerDecide(DecideRuleSequence.java:113)
org.archive.modules.deciderules.DecideRule.decisionFor(DecideRule.java:60)
org.archive.crawler.framework.Scoper.isInScope(Scoper.java:107)
org.archive.crawler.prefetch.CandidateScoper.innerProcessResult(CandidateScoper.java:45)
org.archive.modules.Processor.process(Processor.java:142)
org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
org.archive.crawler.postprocessor.CandidatesProcessor.runCandidateChain(CandidatesProcessor.java:176)
org.archive.crawler.postprocessor.CandidatesProcessor.innerProcess(CandidatesProcessor.java:230)
org.archive.modules.Processor.innerProcessResult(Processor.java:175)
org.archive.modules.Processor.process(Processor.java:142)
org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
org.archive.crawler.framework.ToeThread.run(ToeThread.java:152)
The text was updated successfully, but these errors were encountered: