Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] master from internetarchive:master #5

Open
wants to merge 241 commits into
base: master
Choose a base branch
from

Conversation

pull[bot]
Copy link

@pull pull bot commented Apr 1, 2022

See Commits and Changes for more details.


Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

Phani Dharmavarapu and others added 5 commits April 14, 2021 13:09
LastModified field in the FetchHistory table is expected to be in seconds, but sometimes is stored in millis causing the parsing logic to give a date in the future
Fixes #473 "Unsupported class file major version 62".
@pull pull bot added the ⤵️ pull label Apr 1, 2022
dependabot bot and others added 24 commits April 6, 2022 10:21
Bumps [spring-beans](https://github.com/spring-projects/spring-framework) from 5.3.14 to 5.3.18.
- [Release notes](https://github.com/spring-projects/spring-framework/releases)
- [Commits](spring-projects/spring-framework@v5.3.14...v5.3.18)

---
updated-dependencies:
- dependency-name: org.springframework:spring-beans
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
This avoids an ownership conflict over /tmp/Crashpad.
Allows for custom browser configuration such as using a proxy server or
increasing log verbosity.
Disables browser features like phishing protection, translation, crash
and metrics reporting, OS keychain integration and such that can cause
unnecessary network traffic or that can cause the browser to get stuck
in a prompt. Some of these options may be unnecessary in headless mode,
but we may as well keep them in case in future we want to support headed
mode for debugging.
We use a static (global) counter so that if a job uses multiple
instances of ExtractorChrome they don't clobber each other.
Let's not waste time running the browser on error pages.
This usually means the browser has exited and so there's no window left
to close.
- use the "DohResolver" from the dnsjava library
  to make DoH lookups
- to enable and configure it, add two new
  properties
  * "enableDnsOverHttpResolves" (boolean)
  * "dnsOverHttpServer" URL to the DoH Server
- as one use case for DoH is being located
  behind a firewall, also support using a proxy
  to access the DoH server; the proxy from
  the FetchHTTP bean is reused in that case

Fixes #211
- instead of "borrowing" the configured proxy from
  the fetchHttp bean, use proxy values defined via
  global options, to avoid interference with other
  jobs running in parallel (or at least make them
  explicit).
  The "fetchHttp" bean also uses these settings,
  if no bean specific settings are used.
- remove the "enableDnsOverHttpResolves", and rely
  on a non-empty "dnsOverHttpServer" value to signal
  that DoH should be used.
- scrap setter for "enableDnsOverHttpResolve", too
- docs: lets see if we can set a link
  to another chapter
This ensures that when we later compare the context in processEmbed()
we don't need to deal with variants like srcSet or SRCSET. Note that
we're already sometimes lowercasing it later in HTMLLinkContext.get().

Fixes #477.
This makes troubleshooting link extraction problems much easier.
This avoids creating browser processes for jobs that have it disabled.
If a job has it disabled by default but enables it with a sheet we'll
start it when needed.

We still connect on start when enabled to provide early error feedback.
Bumps [spring-core](https://github.com/spring-projects/spring-framework) from 5.3.18 to 5.3.19.
- [Release notes](https://github.com/spring-projects/spring-framework/releases)
- [Commits](spring-projects/spring-framework@v5.3.18...v5.3.19)

---
updated-dependencies:
- dependency-name: org.springframework:spring-core
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
Useful for testing link extraction on pages that use a robots meta tag.
Bumps [gson](https://github.com/google/gson) from 2.8.6 to 2.8.9.
- [Release notes](https://github.com/google/gson/releases)
- [Changelog](https://github.com/google/gson/blob/master/CHANGELOG.md)
- [Commits](google/gson@gson-parent-2.8.6...gson-parent-2.8.9)

---
updated-dependencies:
- dependency-name: com.google.code.gson:gson
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
…l for heritrix Browse Beans functionality.

(cherry picked from commit 96e47a2)
ato added 30 commits November 29, 2024 20:28
This enables crawl configuration files to use Spring's [Groovy Bean Definition DSL] as an optional alternative to Spring XML. It uses the same bean configuration model but the syntax is more terse and human-readable. No more need for `&amp;` in seed URLs. :-)

```groovy
   checkpointService(CheckpointService) {
        checkpointIntervalMinutes = 15
        checkpointsDir = 'checkpoints'
        forgetAllButLatest = true
   }
```

It also enables some powerful scripting capabilities. For example, defining a custom DecideRule directly in the crawl scope:

```groovy
scope(DecideRuleSequence) {
    rules = [
        new RejectDecideRule(),
        // ACCEPT everything linked from a .pdf file
        new PredicatedDecideRule() {
             boolean evaluate(CrawlURI uri) {
                 return uri.via?.path?.endsWith(".pdf")
             }
        },
        // ...
    ]
}
```

The main downsides are defining nested inner beans can be a bit awkward, some of the errors can be cryptic, and you can't just manipulate the config files with an XML parser.

This commit includes a Groovy version of the default crawl profile for reference, but doesn't expose a way to use it in the UI yet. For now, you need to manually create a `crawler-beans.groovy` file in your job directory.

[Groovy Bean Definition DSL]: https://docs.spring.io/spring-framework/reference/core/beans/basics.html#beans-factory-groovy
fastutil is our largest dependency, consuming about a third of the
total Heritrix distribution size but we only use a couple of trivial
classes from it.

FPMergeUriUniqFilter (which I'm not sure anyone uses anyway), uses
LongArrayList so this change replaces it with a basic version that does
just enough.

The unsynchronized FastBufferedOutputStream usages are likely
unnecessary these days thanks to the JVM's lock optimisations and for
the one in CrawlerJournal, the GZIPOutputStream is still going to
be synchronizing anyway.
When enabled this option causes regular links annotated with rel=nofollow to not be extracted. This is useful for sites that use rel=nofollow to hint crawler traps.
ExtractorHTML: Add obeyRelNofollow option
These log messages started unhelpfully being copied into job.log in
3.6.0 due to the slf4j fix. They indicate the server sent a set-cookie
header with an incorrect domain. It's correct for Heritrix to reject
them. They're very common and not an indication of a problem with
Heritrix itself so it's unnecessarily alarming to log them as
job warnings.

Fixes: 533d762 ("Include slf4j-jdk14 in heritrix-engine ...")
…-logging

Suppress 'WARNING Cookie rejected' messages in job.log
Contrary to the example in the announcement it seems to be mandatory:
> Config validation error in build.os. Value build not found.
It seems to be failing because the placeholder is missing:

```
  File "/home/docs/checkouts/readthedocs.org/user_builds/heritrix/envs/latest/lib/python3.12/site-packages/sphinx/ext/extlinks.py", line 103, in role
    title = caption % part
            ~~~~~~~~^~~~~~
TypeError: not all arguments converted during string formatting
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.