forked from internetarchive/heritrix3
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pull] master from internetarchive:master #5
Open
pull
wants to merge
241
commits into
guorenxi:master
Choose a base branch
from
internetarchive:master
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
LastModified field in the FetchHistory table is expected to be in seconds, but sometimes is stored in millis causing the parsing logic to give a date in the future
Fixes #473 "Unsupported class file major version 62".
Bumps [spring-beans](https://github.com/spring-projects/spring-framework) from 5.3.14 to 5.3.18. - [Release notes](https://github.com/spring-projects/spring-framework/releases) - [Commits](spring-projects/spring-framework@v5.3.14...v5.3.18) --- updated-dependencies: - dependency-name: org.springframework:spring-beans dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]>
This avoids an ownership conflict over /tmp/Crashpad.
Allows for custom browser configuration such as using a proxy server or increasing log verbosity.
Disables browser features like phishing protection, translation, crash and metrics reporting, OS keychain integration and such that can cause unnecessary network traffic or that can cause the browser to get stuck in a prompt. Some of these options may be unnecessary in headless mode, but we may as well keep them in case in future we want to support headed mode for debugging.
We use a static (global) counter so that if a job uses multiple instances of ExtractorChrome they don't clobber each other.
Let's not waste time running the browser on error pages.
This usually means the browser has exited and so there's no window left to close.
- use the "DohResolver" from the dnsjava library to make DoH lookups - to enable and configure it, add two new properties * "enableDnsOverHttpResolves" (boolean) * "dnsOverHttpServer" URL to the DoH Server - as one use case for DoH is being located behind a firewall, also support using a proxy to access the DoH server; the proxy from the FetchHTTP bean is reused in that case Fixes #211
- instead of "borrowing" the configured proxy from the fetchHttp bean, use proxy values defined via global options, to avoid interference with other jobs running in parallel (or at least make them explicit). The "fetchHttp" bean also uses these settings, if no bean specific settings are used. - remove the "enableDnsOverHttpResolves", and rely on a non-empty "dnsOverHttpServer" value to signal that DoH should be used.
- scrap setter for "enableDnsOverHttpResolve", too
- docs: lets see if we can set a link to another chapter
This ensures that when we later compare the context in processEmbed() we don't need to deal with variants like srcSet or SRCSET. Note that we're already sometimes lowercasing it later in HTMLLinkContext.get(). Fixes #477.
This makes troubleshooting link extraction problems much easier.
This avoids creating browser processes for jobs that have it disabled. If a job has it disabled by default but enables it with a sheet we'll start it when needed. We still connect on start when enabled to provide early error feedback.
Bumps [spring-core](https://github.com/spring-projects/spring-framework) from 5.3.18 to 5.3.19. - [Release notes](https://github.com/spring-projects/spring-framework/releases) - [Commits](spring-projects/spring-framework@v5.3.18...v5.3.19) --- updated-dependencies: - dependency-name: org.springframework:spring-core dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]>
Useful for testing link extraction on pages that use a robots meta tag.
Bumps [gson](https://github.com/google/gson) from 2.8.6 to 2.8.9. - [Release notes](https://github.com/google/gson/releases) - [Changelog](https://github.com/google/gson/blob/master/CHANGELOG.md) - [Commits](google/gson@gson-parent-2.8.6...gson-parent-2.8.9) --- updated-dependencies: - dependency-name: com.google.code.gson:gson dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]>
…l for heritrix Browse Beans functionality. (cherry picked from commit 96e47a2)
This enables crawl configuration files to use Spring's [Groovy Bean Definition DSL] as an optional alternative to Spring XML. It uses the same bean configuration model but the syntax is more terse and human-readable. No more need for `&` in seed URLs. :-) ```groovy checkpointService(CheckpointService) { checkpointIntervalMinutes = 15 checkpointsDir = 'checkpoints' forgetAllButLatest = true } ``` It also enables some powerful scripting capabilities. For example, defining a custom DecideRule directly in the crawl scope: ```groovy scope(DecideRuleSequence) { rules = [ new RejectDecideRule(), // ACCEPT everything linked from a .pdf file new PredicatedDecideRule() { boolean evaluate(CrawlURI uri) { return uri.via?.path?.endsWith(".pdf") } }, // ... ] } ``` The main downsides are defining nested inner beans can be a bit awkward, some of the errors can be cryptic, and you can't just manipulate the config files with an XML parser. This commit includes a Groovy version of the default crawl profile for reference, but doesn't expose a way to use it in the UI yet. For now, you need to manually create a `crawler-beans.groovy` file in your job directory. [Groovy Bean Definition DSL]: https://docs.spring.io/spring-framework/reference/core/beans/basics.html#beans-factory-groovy
fastutil is our largest dependency, consuming about a third of the total Heritrix distribution size but we only use a couple of trivial classes from it. FPMergeUriUniqFilter (which I'm not sure anyone uses anyway), uses LongArrayList so this change replaces it with a basic version that does just enough. The unsynchronized FastBufferedOutputStream usages are likely unnecessary these days thanks to the JVM's lock optimisations and for the one in CrawlerJournal, the GZIPOutputStream is still going to be synchronizing anyway.
Add Groovy crawl configs
When enabled this option causes regular links annotated with rel=nofollow to not be extracted. This is useful for sites that use rel=nofollow to hint crawler traps.
ExtractorHTML: Add obeyRelNofollow option
These log messages started unhelpfully being copied into job.log in 3.6.0 due to the slf4j fix. They indicate the server sent a set-cookie header with an incorrect domain. It's correct for Heritrix to reject them. They're very common and not an indication of a problem with Heritrix itself so it's unnecessarily alarming to log them as job warnings. Fixes: 533d762 ("Include slf4j-jdk14 in heritrix-engine ...")
…-logging Suppress 'WARNING Cookie rejected' messages in job.log
Update dependencies for 3.7.0 release
Contrary to the example in the announcement it seems to be mandatory: > Config validation error in build.os. Value build not found.
It seems to be failing because the placeholder is missing: ``` File "/home/docs/checkouts/readthedocs.org/user_builds/heritrix/envs/latest/lib/python3.12/site-packages/sphinx/ext/extlinks.py", line 103, in role title = caption % part ~~~~~~~~^~~~~~ TypeError: not all arguments converted during string formatting ```
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot]
Can you help keep this open source service alive? 💖 Please sponsor : )