[pull] master from internetarchive:master #5

pull · 2022-04-01T07:17:27Z

See Commits and Changes for more details.

Can you help keep this open source service alive? 💖 Please sponsor : )

LastModified field in the FetchHistory table is expected to be in seconds, but sometimes is stored in millis causing the parsing logic to give a date in the future

Fixes #473 "Unsupported class file major version 62".

Bumps [spring-beans](https://github.com/spring-projects/spring-framework) from 5.3.14 to 5.3.18. - [Release notes](https://github.com/spring-projects/spring-framework/releases) - [Commits](spring-projects/spring-framework@v5.3.14...v5.3.18) --- updated-dependencies: - dependency-name: org.springframework:spring-beans dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]>

This avoids an ownership conflict over /tmp/Crashpad.

Allows for custom browser configuration such as using a proxy server or increasing log verbosity.

Disables browser features like phishing protection, translation, crash and metrics reporting, OS keychain integration and such that can cause unnecessary network traffic or that can cause the browser to get stuck in a prompt. Some of these options may be unnecessary in headless mode, but we may as well keep them in case in future we want to support headed mode for debugging.

We use a static (global) counter so that if a job uses multiple instances of ExtractorChrome they don't clobber each other.

Let's not waste time running the browser on error pages.

This usually means the browser has exited and so there's no window left to close.

- use the "DohResolver" from the dnsjava library to make DoH lookups - to enable and configure it, add two new properties * "enableDnsOverHttpResolves" (boolean) * "dnsOverHttpServer" URL to the DoH Server - as one use case for DoH is being located behind a firewall, also support using a proxy to access the DoH server; the proxy from the FetchHTTP bean is reused in that case Fixes #211

- instead of "borrowing" the configured proxy from the fetchHttp bean, use proxy values defined via global options, to avoid interference with other jobs running in parallel (or at least make them explicit). The "fetchHttp" bean also uses these settings, if no bean specific settings are used. - remove the "enableDnsOverHttpResolves", and rely on a non-empty "dnsOverHttpServer" value to signal that DoH should be used.

- scrap setter for "enableDnsOverHttpResolve", too

- docs: lets see if we can set a link to another chapter

This ensures that when we later compare the context in processEmbed() we don't need to deal with variants like srcSet or SRCSET. Note that we're already sometimes lowercasing it later in HTMLLinkContext.get(). Fixes #477.

This makes troubleshooting link extraction problems much easier.

This avoids creating browser processes for jobs that have it disabled. If a job has it disabled by default but enables it with a sheet we'll start it when needed. We still connect on start when enabled to provide early error feedback.

Bumps [spring-core](https://github.com/spring-projects/spring-framework) from 5.3.18 to 5.3.19. - [Release notes](https://github.com/spring-projects/spring-framework/releases) - [Commits](spring-projects/spring-framework@v5.3.18...v5.3.19) --- updated-dependencies: - dependency-name: org.springframework:spring-core dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]>

Useful for testing link extraction on pages that use a robots meta tag.

Based on https://github.com/internetarchive/heritrix3/wiki/Credentials

Bumps [gson](https://github.com/google/gson) from 2.8.6 to 2.8.9. - [Release notes](https://github.com/google/gson/releases) - [Changelog](https://github.com/google/gson/blob/master/CHANGELOG.md) - [Commits](google/gson@gson-parent-2.8.6...gson-parent-2.8.9) --- updated-dependencies: - dependency-name: com.google.code.gson:gson dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]>

…l for heritrix Browse Beans functionality. (cherry picked from commit 96e47a2)

This enables crawl configuration files to use Spring's [Groovy Bean Definition DSL] as an optional alternative to Spring XML. It uses the same bean configuration model but the syntax is more terse and human-readable. No more need for `&` in seed URLs. :-) ```groovy checkpointService(CheckpointService) { checkpointIntervalMinutes = 15 checkpointsDir = 'checkpoints' forgetAllButLatest = true } ``` It also enables some powerful scripting capabilities. For example, defining a custom DecideRule directly in the crawl scope: ```groovy scope(DecideRuleSequence) { rules = [ new RejectDecideRule(), // ACCEPT everything linked from a .pdf file new PredicatedDecideRule() { boolean evaluate(CrawlURI uri) { return uri.via?.path?.endsWith(".pdf") } }, // ... ] } ``` The main downsides are defining nested inner beans can be a bit awkward, some of the errors can be cryptic, and you can't just manipulate the config files with an XML parser. This commit includes a Groovy version of the default crawl profile for reference, but doesn't expose a way to use it in the UI yet. For now, you need to manually create a `crawler-beans.groovy` file in your job directory. [Groovy Bean Definition DSL]: https://docs.spring.io/spring-framework/reference/core/beans/basics.html#beans-factory-groovy

fastutil is our largest dependency, consuming about a third of the total Heritrix distribution size but we only use a couple of trivial classes from it. FPMergeUriUniqFilter (which I'm not sure anyone uses anyway), uses LongArrayList so this change replaces it with a basic version that does just enough. The unsynchronized FastBufferedOutputStream usages are likely unnecessary these days thanks to the JVM's lock optimisations and for the one in CrawlerJournal, the GZIPOutputStream is still going to be synchronizing anyway.

Add Groovy crawl configs

When enabled this option causes regular links annotated with rel=nofollow to not be extracted. This is useful for sites that use rel=nofollow to hint crawler traps.

ExtractorHTML: Add obeyRelNofollow option

These log messages started unhelpfully being copied into job.log in 3.6.0 due to the slf4j fix. They indicate the server sent a set-cookie header with an incorrect domain. It's correct for Heritrix to reject them. They're very common and not an indication of a problem with Heritrix itself so it's unnecessarily alarming to log them as job warnings. Fixes: 533d762 ("Include slf4j-jdk14 in heritrix-engine ...")

…-logging Suppress 'WARNING Cookie rejected' messages in job.log

Update dependencies for 3.7.0 release

https://about.readthedocs.com/blog/2024/12/deprecate-config-files-without-sphinx-or-mkdocs-config/ #634

Contrary to the example in the announcement it seems to be mandatory: > Config validation error in build.os. Value build not found.

https://blog.readthedocs.com/defaulting-latest-build-tools/

It seems to be failing because the placeholder is missing: ``` File "/home/docs/checkouts/readthedocs.org/user_builds/heritrix/envs/latest/lib/python3.12/site-packages/sphinx/ext/extlinks.py", line 103, in role title = caption % part ~~~~~~~~^~~~~~ TypeError: not all arguments converted during string formatting ```

Phani Dharmavarapu and others added 5 commits April 14, 2021 13:09

Use seconds from the FetchCompletedTime for LastModified time

39e9e72

LastModified field in the FetchHistory table is expected to be in seconds, but sometimes is stored in millis causing the parsing logic to give a date in the future

Remove change to helper class as it can be done in schema

145d46d

Merge branch 'master' into fix-last-modified-millis

6a5c629

JDK18 support: Update Groovy to 3.0.10

c3f140b

Fixes #473 "Unsupported class file major version 62".

JDK18 support: Run GitHub CI tests on JDK 18

0698545

pull bot added the ⤵️ pull label Apr 1, 2022

dependabot bot and others added 24 commits April 6, 2022 10:21

ChromeProcess: Add macOS default install path to DEFAULT_EXECUTABLES

f2af651

ChromeProcess: Use the --disable-crash-reporter command-line option

0378e0d

This avoids an ownership conflict over /tmp/Crashpad.

ExtractorChrome: Add commandLineOptions property

e5bff6a

Allows for custom browser configuration such as using a proxy server or increasing log verbosity.

ExtractorChrome: Add a counter to give each recorder a unique basename

42ac612

We use a static (global) counter so that if a job uses multiple instances of ExtractorChrome they don't clobber each other.

ExtractorChrome: Only run on pages with 2xx status codes

e5d8d53

Let's not waste time running the browser on error pages.

ChromeWindow: Ignore events once we've requested the window be closed

adc6858

ChromeWindow: Ignore WebsocketNotConnectedException in window.close()

8557533

This usually means the browser has exited and so there's no window left to close.

[misc] rst formatting

9648c98

Add support for DNS-over-HTTPS lookups

ed332bc

- scrap setter for "enableDnsOverHttpResolve", too

Add support for DNS-over-HTTPS lookups

f5c83ce

- docs: lets see if we can set a link to another chapter

Merge pull request #476 from ClemensRobbenhaar/issue211-dns-over-https

55078c0

ExtractorHTML: Add a main() method to run the extractor standalone

914756f

This makes troubleshooting link extraction problems much easier.

Merge pull request #478 from internetarchive/srcset-fix

207adec

ExtractorHTML: Add --robots option to main()

9644780

Useful for testing link extraction on pages that use a robots meta tag.

Docs: Add section to config guide about http and form credentials

a35cd44

Based on https://github.com/internetarchive/heritrix3/wiki/Credentials

Removed a potential NPE in hashCode method to CrawlURI which was fata…

702bfa5

…l for heritrix Browse Beans functionality. (cherry picked from commit 96e47a2)

ato added 30 commits November 29, 2024 20:28

Update CHANGELOG for 3.6.0 release

dd3616e

[maven-release-plugin] prepare release 3.6.0

2d903ae

[maven-release-plugin] prepare for next development iteration

c67ff98

Update maven release plugins for Java compat

be81b3a

Update pom.xml for 3.6.1-SNAPSHOT

2b2179d

Bump webarchive-commons from 1.2.0 to 1.3.0

bfa8692

Merge pull request #632 from internetarchive/groovy-config

4c4510a

Add Groovy crawl configs

ExtractorHTML: Add obeyRelNofollow option

4e8bda1

When enabled this option causes regular links annotated with rel=nofollow to not be extracted. This is useful for sites that use rel=nofollow to hint crawler traps.

Merge pull request #638 from internetarchive/rel-nofollow

a4c98e4

ExtractorHTML: Add obeyRelNofollow option

Bump amqp-client from 5.23.0 to 5.24.0

bd59272

Bump restlet from 2.5.0-rc1 to 2.5.0

f844b96

Bump freemarker from 2.3.33 to 2.3.34

1041fc6

Bump jetty from 9.4.56.v20240826 to 9.4.57.v20241219

9b0d403

Bump jsch from 0.2.21 to 0.2.22

c8cedef

Bump commons-codec from 1.17.1 to 1.17.2

c546842

Bump spring from 6.1.15 to 6.1.16

3c5074d

Merge pull request #640 from internetarchive/suppress-cookie-rejected…

42e4151

…-logging Suppress 'WARNING Cookie rejected' messages in job.log

Merge pull request #639 from internetarchive/update-deps-for-3.7

48ac748

Update dependencies for 3.7.0 release

Update CHANGELOG for 3.7.0

ed02038

Bump ftpserver-core from 1.2.0 to 1.2.1

8cf3e47

[maven-release-plugin] prepare release 3.7.0

370f048

[maven-release-plugin] prepare for next development iteration

d21a053

Add .readthedocs.yml

657eb58

https://about.readthedocs.com/blog/2024/12/deprecate-config-files-without-sphinx-or-mkdocs-config/ #634

Add build section .readthedocs.yml

ac33072

Contrary to the example in the announcement it seems to be mandatory: > Config validation error in build.os. Value build not found.

Add missing readthedocs dependencies

71c2fdb

https://blog.readthedocs.com/defaulting-latest-build-tools/

Restore accidentally removed word

22f8f7f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from internetarchive:master #5

[pull] master from internetarchive:master #5

pull bot commented Apr 1, 2022 •

edited

Loading

[pull] master from internetarchive:master #5

Are you sure you want to change the base?

[pull] master from internetarchive:master #5

Conversation

pull bot commented Apr 1, 2022 • edited Loading

pull bot commented Apr 1, 2022 •

edited

Loading