-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor ingest/extract process #122
Comments
Dev Setup InstructionsUpdated 7/28/21 For development on this ticket, it will be necessary to set up an NFS on your dev environment, similar to the shared To set up an NFS between two dev VM's, I used the following method (adapted from these instructions).
|
Other considerations: Spark DataFrame API
Spark/pyspark version
|
Regarding our use of the Elasticsearch Scroll API to retrieve the results for the user-generated extracts, please note that in more recent version of Elasticsearch, the use of this API for deep pagination is discouraged:
|
See also #117 and #121.
Problem
Depending on the parameters, user-generated extracts from large datasets can take a long time (hours or days) and consume a lot of disk space (~1TB). We have implemented a check to prevent the creation of multiple versions of a full (non-parametrized) extract, but we have nothing in place to prevent users from creating multiple versions of the same parametrized extract.
In addition, the extract jobs do not fail gracefully when interrupted by a critical error (such as a lack of disk space).
Questions
Proposal
Full extracts
mentions
datasets (currently disabled).Custom extracts
Benefits
Workflow diagram
Further documentation (including testing) in this notebook.
The text was updated successfully, but these errors were encountered: