You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Ahead of #297 a first step that can run on older Hadoop is to just output a JSONL version of the records. This could then be sent to Solr or made into Parquet as needed.
This will be added to the Hadoop version first, following a similar pattern to the hadoop3 branch where an explicit Memento bean that covers the Solr schema is provided. This then drives the JSON serialisation, but can also define the Parquet schema. Some outstanding issues are:
Add JSONL output to Hadoop mode.
Add 'all in' default configuration for dataset extraction and use that for JSONL output.
Add the JSONL option to the non-Hadoop command-line version, using the 'all in' config.
Set the default main class for both CLI and Hadoop jars.
Make the Hadoop CLI better (e.g. no Solr and no-solr are contradictory), more consistent with the normal CLI.
I'm going to split this ticket into base JSONL functionality and more open-ended questions and options. We need to understand the overall usage/workflow and verify that this is at all useful before working on it any more.
Ahead of #297 a first step that can run on older Hadoop is to just output a JSONL version of the records. This could then be sent to Solr or made into Parquet as needed.
This will be added to the Hadoop version first, following a similar pattern to the
hadoop3
branch where an explicitMemento
bean that covers the Solr schema is provided. This then drives the JSON serialisation, but can also define the Parquet schema. Some outstanding issues are:no-solr
are contradictory), more consistent with the normal CLI.Fill inSee Do we still need source_file_path? #308source_file_path
in local CLI mode (source_file == source_file_path
???)ssdeep
hash?blockSize:Hash:HashOf2xBlockSize:filename
For further enhancements, see #307
The text was updated successfully, but these errors were encountered: