Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARC fields not populated in Kafka crawl log #90

Open
anjackson opened this issue Aug 29, 2023 · 0 comments
Open

WARC fields not populated in Kafka crawl log #90

anjackson opened this issue Aug 29, 2023 · 0 comments
Labels

Comments

@anjackson
Copy link
Contributor

In #89 it was noted that the warc_filename and warc_offset appear to null when they should not be.

Note the fields are fine in the actual file-based crawl log:

2023-08-29T09:45:48.367Z   301          0 https://sneezecount.joyfeed.com/three-thousand-nine-hundred-and-two// LRRLL https://sneezecount.joyfeed.com/2014/04/ text/html #600 20230829094547420+614 sha1:3I42H3S6N
NFQ2MSVX7XZKYAYSCX5QBYJ - ip:212.84.88.220,geo:GB {"contentSize":308,"warcFilename":"BL-NPLD-20230829033616191-116600-105~npld-dc-heritrix3-worker-1~8443.warc.gz","warcFileOffset":918370261,"scopeDecision":"ACC
EPT by rule #5 ExternalGeoLocationDecideRule","warcFileRecordLength":1339}

So this is something to do with the Kafka version. We use

protected byte[] buildMessage(CrawlURI curi) {
JSONObject jo = CrawlLogJsonBuilder.buildJson(curi, getExtraFields(), getServerCache());
try {
return jo.toString().getBytes("UTF-8");
} catch (UnsupportedEncodingException e) {
throw new RuntimeException(e);
}
}

Which calls

https://github.com/internetarchive/heritrix3/blob/8563f491a5b355c39a89f51b17c76aaa84752a8a/contrib/src/main/java/org/archive/modules/postprocessor/CrawlLogJsonBuilder.java#L15

So should be working, but perhaps this is just an order-of-operations problem?

Yes, looks like the crawl log kafka thing is written before the WARC, for some complicated reasons that need picking apart.

https://github.com/ukwa/ukwa-heritrix/blame/master/jobs/frequent/crawler-beans.cxml#L722

@anjackson anjackson added the bug label Aug 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant