-
Notifications
You must be signed in to change notification settings - Fork 761
Unresolved Javascript Extraction Issues
Prompted by https://webarchive.jira.com/browse/LOC-345
Heritrix (ExtractorJS) has trouble finding the links that are not hardcoded strings in javascript.
The flip side of that coin is that it extracts strings which the javascript code does not use as URIs, often resulting in 404s noticed by webmasters.
As an attempt to define the problem better, this is a place to put examples where incorrect javascript extraction is an issue.
https://webarchive.jira.com/browse/LOC-345
https://webarchive.jira.com/browse/HER-1300
https://webarchive.jira.com/browse/HER-1522
https://webarchive.jira.com/browse/HER-1523
A potential solution: search for location.replace calls and similar methods of accessing a URL. If the parameters for such a function involve variables, attempt to resolve them.
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse