-
Notifications
You must be signed in to change notification settings - Fork 762
Reports
Reports are found in the "reports" directory, which exists under the directory of a specific job. The location of specific report files are provided in the "Configuration-referenced paths" section of the job page.
Field Name |
Description |
---|---|
Crawl Name |
The user-defined name of the crawl. |
Crawl Status |
The status of the crawl, such as "Aborted" or "Finished." |
Duration Time |
The duration of the crawl to the nearest millisecond. |
Total Seeds Crawled |
The number of seeds that were successfully crawled. |
Total Seeds Not Crawled |
The number of seeds that were not successfully crawled. |
Total Hosts Crawled |
The number of hosts that were crawled. |
Total URIs Processed |
The number of URIs that were processed. |
URIs Crawled Successfully |
The number of URIs that were crawled successfully. |
URIs Failed to Crawl |
The number of URIs that could not be crawled. |
URIs Disregarded |
The number of URIs that were not selected for crawling. |
Processed docs/sec |
The average number of documents processed per second. |
Bandwidth in Kbytes/sec |
The average number of kilobytes processed per second. |
Total Raw Data Size in Bytes |
The total amount of data crawled. |
Novel Bytes |
New bytes since last crawl. |
Field Name |
Description |
---|---|
code |
0=not crawled |
status |
Human readable description of whether the seed was crawled. For example, "CRAWLED." |
seed |
The seed URI. |
redirect |
The URI to which the seed redirected. |
Field Name |
Description |
---|---|
#urls |
The number of URIs crawled for the host. |
#bytes |
The number of bytes crawled for the host. |
host |
The hostname. |
#robots |
The number of URIs, for this host, excluded because of |
#remaining |
The number of URIs, for this host, that have not been crawled yet, but are in the queue. |
#novel-urls |
The number of new URIs crawled for this host since the last crawl. |
#novel-bytes |
The amount of new bytes crawled for this host since the last crawl. |
#dup-by-hash-urls |
The number of URIs, for this host, that had the same hash code and are essentially duplicates. |
#dup-by-hash-bytes |
The number of bytes of content, for this host, having the same hashcode. |
#not-modified-urls |
The number of URIs, for this host, that returned a 304 status code. |
#not-modified-bytes |
The amount of of bytes of content, for this host, whose URIs returned a 304 status code. |
Field Name |
Description |
---|---|
source |
The seed. |
host |
The host that was accessed from the seed. |
#urls |
The number of URIs crawled for this seed host combination. |
Note that the SourceTags report will only be generated if the
sourceTagSeeds
property of the TextSeedModule
bean is set to true.
<bean id="seeds" class="org.archive.modules.seeds.TextSeedModule">
<property name="sourceTagsSeeds">
<value>
true
</value>
</property>
</bean>
Field Name |
Description |
---|---|
#urls |
The number of URIs crawled for a given mime-type. |
#bytes |
The number of bytes crawled for a given mime-type. |
mime-types |
The mime-type. |
Field Name |
Description |
---|---|
#urls |
The number of URIs crawled for a given response code. |
rescode |
The response code. |
This report shows the activity of each processor involved in the crawl.
For example, the FetchHTTP
processor is included in the report. For
this processor the number of URIs fetched is displayed. The report is
organized to report on each Chain (Candidate, Fetch, and Disposition)
and each processor in each chain. The order of the report is per the
configuration order in the crawler-beans.cxml
file.
This link displays a report showing the hosts that are queued for capture. The hosts are contained in multiple queues. The details of each Frontier queue is reported.
This link displays a report showing the activity of each thread used by Heritrix. The amount of time the thread has been running is displayed as well as thread state and thread Blocked/Waiting status.
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse