We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In #89 it was noted that the warc_filename and warc_offset appear to null when they should not be.
warc_filename
warc_offset
null
Note the fields are fine in the actual file-based crawl log:
2023-08-29T09:45:48.367Z 301 0 https://sneezecount.joyfeed.com/three-thousand-nine-hundred-and-two// LRRLL https://sneezecount.joyfeed.com/2014/04/ text/html #600 20230829094547420+614 sha1:3I42H3S6N NFQ2MSVX7XZKYAYSCX5QBYJ - ip:212.84.88.220,geo:GB {"contentSize":308,"warcFilename":"BL-NPLD-20230829033616191-116600-105~npld-dc-heritrix3-worker-1~8443.warc.gz","warcFileOffset":918370261,"scopeDecision":"ACC EPT by rule #5 ExternalGeoLocationDecideRule","warcFileRecordLength":1339}
So this is something to do with the Kafka version. We use
ukwa-heritrix/src/main/java/uk/bl/wap/crawler/postprocessor/KafkaKeyedCrawlLogFeed.java
Lines 141 to 148 in 0c21b27
Which calls
https://github.com/internetarchive/heritrix3/blob/8563f491a5b355c39a89f51b17c76aaa84752a8a/contrib/src/main/java/org/archive/modules/postprocessor/CrawlLogJsonBuilder.java#L15
So should be working, but perhaps this is just an order-of-operations problem?
Yes, looks like the crawl log kafka thing is written before the WARC, for some complicated reasons that need picking apart.
https://github.com/ukwa/ukwa-heritrix/blame/master/jobs/frequent/crawler-beans.cxml#L722
The text was updated successfully, but these errors were encountered:
No branches or pull requests
In #89 it was noted that the
warc_filename
andwarc_offset
appear tonull
when they should not be.Note the fields are fine in the actual file-based crawl log:
So this is something to do with the Kafka version. We use
ukwa-heritrix/src/main/java/uk/bl/wap/crawler/postprocessor/KafkaKeyedCrawlLogFeed.java
Lines 141 to 148 in 0c21b27
Which calls
https://github.com/internetarchive/heritrix3/blob/8563f491a5b355c39a89f51b17c76aaa84752a8a/contrib/src/main/java/org/archive/modules/postprocessor/CrawlLogJsonBuilder.java#L15
So should be working, but perhaps this is just an order-of-operations problem?
Yes, looks like the crawl log kafka thing is written before the WARC, for some complicated reasons that need picking apart.
https://github.com/ukwa/ukwa-heritrix/blame/master/jobs/frequent/crawler-beans.cxml#L722
The text was updated successfully, but these errors were encountered: