Reports

Reports are found in the "reports" directory, which exists under the directory of a specific job. The location of specific report files are provided in the "Configuration-referenced paths" section of the job page.

Crawl Summary (crawl-report.txt)

Field Name	Description
Crawl Name	The user-defined name of the crawl.
Crawl Status	The status of the crawl, such as "Aborted" or "Finished."
Duration Time	The duration of the crawl to the nearest millisecond.
Total Seeds Crawled	The number of seeds that were successfully crawled.
Total Seeds Not Crawled	The number of seeds that were not successfully crawled.
Total Hosts Crawled	The number of hosts that were crawled.
Total URIs Processed	The number of URIs that were processed.
URIs Crawled Successfully	The number of URIs that were crawled successfully.
URIs Failed to Crawl	The number of URIs that could not be crawled.
URIs Disregarded	The number of URIs that were not selected for crawling.
Processed docs/sec	The average number of documents processed per second.
Bandwidth in Kbytes/sec	The average number of kilobytes processed per second.
Total Raw Data Size in Bytes	The total amount of data crawled.
Novel Bytes	New bytes since last crawl.

Seeds (seeds-report.txt)

Field Name	Description
code	0=not crawled 1=crawled
status	Human readable description of whether the seed was crawled. For example, "CRAWLED."
seed	The seed URI.
redirect	The URI to which the seed redirected.

Hosts (hosts-report.txt)

Field Name	Description
#urls	The number of URIs crawled for the host.
#bytes	The number of bytes crawled for the host.
host	The hostname.
#robots	The number of URIs, for this host, excluded because of `robots.txt` restrictions. This number does not include linked URIs from the specifically excluded URIs.
#remaining	The number of URIs, for this host, that have not been crawled yet, but are in the queue.
#novel-urls	The number of new URIs crawled for this host since the last crawl.
#novel-bytes	The amount of new bytes crawled for this host since the last crawl.
#dup-by-hash-urls	The number of URIs, for this host, that had the same hash code and are essentially duplicates.
#dup-by-hash-bytes	The number of bytes of content, for this host, having the same hashcode.
#not-modified-urls	The number of URIs, for this host, that returned a 304 status code.
#not-modified-bytes	The amount of of bytes of content, for this host, whose URIs returned a 304 status code.

SourceTags (source-report.txt)

Field Name	Description
source	The seed.
host	The host that was accessed from the seed.
#urls	The number of URIs crawled for this seed host combination.

Note that the SourceTags report will only be generated if the sourceTagSeeds property of the TextSeedModule bean is set to true.

<bean id="seeds" class="org.archive.modules.seeds.TextSeedModule">
<property name="sourceTagsSeeds">
<value>
true
</value>
</property>
</bean>

Mimetypes (mimetype-report.txt)

Field Name	Description
#urls	The number of URIs crawled for a given mime-type.
#bytes	The number of bytes crawled for a given mime-type.
mime-types	The mime-type.

ResponseCode (responsecode-report.txt)

Field Name	Description
#urls	The number of URIs crawled for a given response code.
rescode	The response code.

Processors (processors-report.txt)

This report shows the activity of each processor involved in the crawl. For example, the FetchHTTP processor is included in the report. For this processor the number of URIs fetched is displayed. The report is organized to report on each Chain (Candidate, Fetch, and Disposition) and each processor in each chain. The order of the report is per the configuration order in the crawler-beans.cxml file.

FrontierSummary (frontier-summary-report.txt)

This link displays a report showing the hosts that are queued for capture. The hosts are contained in multiple queues. The details of each Frontier queue is reported.

ToeThreads (threads-report.txt)

This link displays a report showing the activity of each thread used by Heritrix. The amount of time the thread has been running is displayed as well as thread state and thread Blocked/Waiting status.