Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option for JSONL output #299

Closed
7 of 8 tasks
anjackson opened this issue Sep 30, 2022 · 3 comments
Closed
7 of 8 tasks

Add option for JSONL output #299

anjackson opened this issue Sep 30, 2022 · 3 comments
Assignees

Comments

@anjackson
Copy link
Contributor

anjackson commented Sep 30, 2022

Ahead of #297 a first step that can run on older Hadoop is to just output a JSONL version of the records. This could then be sent to Solr or made into Parquet as needed.

This will be added to the Hadoop version first, following a similar pattern to the hadoop3 branch where an explicit Memento bean that covers the Solr schema is provided. This then drives the JSON serialisation, but can also define the Parquet schema. Some outstanding issues are:

  • Add JSONL output to Hadoop mode.
  • Add 'all in' default configuration for dataset extraction and use that for JSONL output.
  • Add the JSONL option to the non-Hadoop command-line version, using the 'all in' config.
  • Set the default main class for both CLI and Hadoop jars.
  • Make the Hadoop CLI better (e.g. no Solr and no-solr are contradictory), more consistent with the normal CLI.
  • Fill in source_file_path in local CLI mode (source_file == source_file_path???) See Do we still need source_file_path? #308
  • How to store the ssdeep hash?
  • Store ssdeep hash in JSONL in the usual long form: blockSize:Hash:HashOf2xBlockSize:filename

For further enhancements, see #307

@anjackson anjackson self-assigned this Sep 30, 2022
anjackson added a commit that referenced this issue Sep 30, 2022
anjackson added a commit that referenced this issue Sep 30, 2022
@anjackson
Copy link
Contributor Author

Verified as running fine on old Hadoop

$ hadoop jar warc-hadoop-indexer/target/warc-hadoop-indexer-3.2.0-SNAPSHOT-job.jar uk.bl.wa.hadoop.indexer.WARCIndexerRunner -i files.txt -o jsonl-test --jsonl --no-solr -s h
ttp://null

The CLI could do we cleaning up, maybe making it more consistent with the none-Hadoop version.

@anjackson
Copy link
Contributor Author

I'm going to split this ticket into base JSONL functionality and more open-ended questions and options. We need to understand the overall usage/workflow and verify that this is at all useful before working on it any more.

@anjackson
Copy link
Contributor Author

Added ssdeep in 3a7013e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant