Skip to content
Alex Osborne edited this page Jul 4, 2018 · 3 revisions

Reports

Reports are found in the "reports" directory, which exists under the directory of a specific job.  The location of specific report files are provided in the "Configuration-referenced paths" section of the job page.

Crawl Summary (crawl-report.txt)

Field Name

Description

Crawl Name

The user-defined name of the crawl.

Crawl Status

The status of the crawl, such as "Aborted" or "Finished."

Duration Time

The duration of the crawl to the nearest millisecond.

Total Seeds Crawled

The number of seeds that were successfully crawled.

Total Seeds Not Crawled

The number of seeds that were not successfully crawled.

Total Hosts Crawled

The number of hosts that were crawled.

Total URIs Processed

The number of URIs that were processed.

URIs Crawled Successfully

The number of URIs that were crawled successfully.

URIs Failed to Crawl

The number of URIs that could not be crawled.

URIs Disregarded

The number of URIs that were not selected for crawling.

Processed docs/sec

The average number of documents processed per second.

Bandwidth in Kbytes/sec

The average number of kilobytes processed per second.

Total Raw Data Size in Bytes

The total amount of data crawled.

Novel Bytes

New bytes since last crawl.

Seeds (seeds-report.txt)

Field Name

Description

code

0=not crawled
1=crawled

status

Human readable description of whether the seed was crawled.  For example, "CRAWLED."

seed

The seed URI.

redirect

The URI to which the seed redirected.

Hosts (hosts-report.txt)

Field Name

Description

#urls

The number of URIs crawled for the host.

#bytes

The number of bytes crawled for the host.

host

The hostname.

#robots

The number of URIs, for this host, excluded because of robots.txt restrictions.  This number does not include linked URIs from the specifically excluded URIs.

#remaining

The number of URIs, for this host, that have not been crawled yet, but are in the queue.

#novel-urls

The number of new URIs crawled for this host since the last crawl.

#novel-bytes

The amount of new bytes crawled for this host since the last crawl.

#dup-by-hash-urls

The number of URIs, for this host, that had the same hash code and are essentially duplicates.

#dup-by-hash-bytes

The number of bytes of content, for this host, having the same hashcode.

#not-modified-urls

The number of URIs, for this host, that returned a 304 status code.

#not-modified-bytes

The amount of of bytes of content, for this host, whose URIs returned a 304 status code.

SourceTags (source-report.txt)

Field Name

Description

source

The seed.

host

The host that was accessed from the seed.

#urls

The number of URIs crawled for this seed host combination.

Note that the SourceTags report will only be generated if the sourceTagSeeds property of the TextSeedModule bean is set to true.

<bean id="seeds" class="org.archive.modules.seeds.TextSeedModule">
<property name="sourceTagsSeeds">
<value>
true
</value>
</property>
</bean>
Mimetypes (mimetype-report.txt)

Field Name

Description

#urls

The number of URIs crawled for a given mime-type.

#bytes

The number of bytes crawled for a given mime-type.

mime-types

The mime-type.

ResponseCode (responsecode-report.txt)

Field Name

Description

#urls

The number of URIs crawled for a given response code.

rescode

The response code.

Processors (processors-report.txt)

This report shows the activity of each processor involved in the crawl.  For example, the FetchHTTP processor is included in the report.  For this processor the number of URIs fetched is displayed.  The report is organized to report on each Chain (Candidate, Fetch, and Disposition) and each processor in each chain.  The order of the report is per the configuration order in the crawler-beans.cxml file.

FrontierSummary (frontier-summary-report.txt)

This link displays a report showing the hosts that are queued for capture.  The hosts are contained in multiple queues.  The details of each Frontier queue is reported.

ToeThreads (threads-report.txt)

This link displays a report showing the activity of each thread used by Heritrix.  The amount of time the thread has been running is displayed as well as thread state and thread Blocked/Waiting status.

Heritrix

Structured Guides:

Wiki index

FAQs

User Guide

Knowledge Base

Known Issues

Background Reading

Users of Heritrix

How To Crawl

Development

Clone this wiki locally