Add option for JSONL output #299

anjackson · 2022-09-30T08:13:45Z

Ahead of #297 a first step that can run on older Hadoop is to just output a JSONL version of the records. This could then be sent to Solr or made into Parquet as needed.

This will be added to the Hadoop version first, following a similar pattern to the hadoop3 branch where an explicit Memento bean that covers the Solr schema is provided. This then drives the JSON serialisation, but can also define the Parquet schema. Some outstanding issues are:

Add JSONL output to Hadoop mode.
Add 'all in' default configuration for dataset extraction and use that for JSONL output.
Add the JSONL option to the non-Hadoop command-line version, using the 'all in' config.
Set the default main class for both CLI and Hadoop jars.
Make the Hadoop CLI better (e.g. no Solr and no-solr are contradictory), more consistent with the normal CLI.
~~Fill in source_file_path in local CLI mode (source_file == source_file_path???)~~ See Do we still need source_file_path? #308
How to store the ssdeep hash?
Store ssdeep hash in JSONL in the usual long form: blockSize:Hash:HashOf2xBlockSize:filename

For further enhancements, see #307

The text was updated successfully, but these errors were encountered:

anjackson · 2022-09-30T20:29:43Z

Verified as running fine on old Hadoop

$ hadoop jar warc-hadoop-indexer/target/warc-hadoop-indexer-3.2.0-SNAPSHOT-job.jar uk.bl.wa.hadoop.indexer.WARCIndexerRunner -i files.txt -o jsonl-test --jsonl --no-solr -s h
ttp://null

The CLI could do we cleaning up, maybe making it more consistent with the none-Hadoop version.

anjackson · 2023-03-27T09:28:54Z

I'm going to split this ticket into base JSONL functionality and more open-ended questions and options. We need to understand the overall usage/workflow and verify that this is at all useful before working on it any more.

anjackson · 2023-03-30T12:13:46Z

Added ssdeep in 3a7013e

anjackson self-assigned this Sep 30, 2022

anjackson added the enhancement label Sep 30, 2022

anjackson added a commit that referenced this issue Sep 30, 2022

Tweak testing for #299

5043147

anjackson added a commit that referenced this issue Sep 30, 2022

Add JSONL output for #299

ad8b1f6

anjackson mentioned this issue Mar 27, 2023

Improve the JSONL output #307

Open

12 tasks

anjackson closed this as completed Mar 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option for JSONL output #299

Add option for JSONL output #299

anjackson commented Sep 30, 2022 •

edited

Loading

anjackson commented Sep 30, 2022

anjackson commented Mar 27, 2023

anjackson commented Mar 30, 2023

Add option for JSONL output #299

Add option for JSONL output #299

Comments

anjackson commented Sep 30, 2022 • edited Loading

anjackson commented Sep 30, 2022

anjackson commented Mar 27, 2023

anjackson commented Mar 30, 2023

anjackson commented Sep 30, 2022 •

edited

Loading