Refactor ingest/extract process #122

dolsysmith · 2021-07-14T18:59:03Z

Problem

Depending on the parameters, user-generated extracts from large datasets can take a long time (hours or days) and consume a lot of disk space (~1TB). We have implemented a check to prevent the creation of multiple versions of a full (non-parametrized) extract, but we have nothing in place to prevent users from creating multiple versions of the same parametrized extract.

In addition, the extract jobs do not fail gracefully when interrupted by a critical error (such as a lack of disk space).

Questions

What use cases do these custom extracts satisfy?
Would we still meet most use cases if the app placed a reasonable limit on the size of custom extracts, given that full extracts are also available?

Proposal

Full extracts

Created automatically at time of load (using a Spark job).
Include mentions datasets (currently disabled).

Custom extracts

Option 1: Users can generate a custom extract containing up to N number of Tweets.
Option 2: Users can generate CSV/JSON extracts containing up to N number of Tweets and can generate extracts of Tweet ID's up to any size.
For the benefit of users wanting to filter the full extract themselves, we can provide documentation and a Jupyter notebook for doing this work locally with pyspark.

Benefits

Save space on disk by eliminating very large user extracts.
Obviate the need to provide fault tolerance for long-running jobs. Custom user extracts would be completed within a reasonable timeframe and could use the current implementation (Celery task queue).
Save time on producing full extracts. Latency on retrieval from large Elasticsearch result sets is high. Native Spark seems orders-of-magnitude more efficient (when writing the full dataset) -- but see the note below on the Spark DataFrame API. (On my local machine, for a 50 GB dataset, the longest extract (full Tweet JSON) took 8 minutes.

Workflow diagram

Further documentation (including testing) in this notebook.

The text was updated successfully, but these errors were encountered:

dolsysmith · 2021-07-22T17:31:15Z

Dev Setup Instructions

Updated 7/28/21

For development on this ticket, it will be necessary to set up an NFS on your dev environment, similar to the shared /storage/dataset_loading directory on prod. A shared filesystem is necessary for persisting data to disk from a Spark cluster.

To set up an NFS between two dev VM's, I used the following method (adapted from these instructions).

Install the NFS server package on one VM (henceforth the NFS server):
```
sudo apt update
sudo apt install nfs-kernel-server
```
Configure a directory to be exported. I'm using a directory on the /storage volume, which has adequate space for TweetSets testing.
a. Create a directory to share: sudo mkdir /storage/dataset_loading
b. Set yourself as the directory's owner: sudo chown dsmith:dsmith /storage/dataset_loading
c. Edit the /etc/exports file: sudo nano /etc/exports
d.. Add a line at the bottom mapping a directory on the server to the client VM, using the client VM's IP address. (This should be the IP address internal to the WRLC network, not the exterrnal IP address. On my VM's, it's associated with the ens160 interface.) Use values for your_uid and your_gid from step c. (This will give the client the same privileges to the shared folder as those associated with your user account on the server.)
```
/storage/dataset_loading         172.27.20.233(rw,sync,no_root_squash,no_subtree_check)
```
Restart the NFS server: sudo systemctl restart nfs-kernel-server. (If you get an error from this command, you can view the service log by running sudo journalctl -u nfs-server.service).
On the client VM, install the nfs-common package:

sudo apt update
sudo apt install nfs-common

Create a mount point on the client (if it doesn't already exist) and mount the NFS directory (from the server) to that directory. In this case, I'm mounting the dataset_loading directory on the server VM to the same directory on the client. (I believe this operation will overwrite any data in the local directory if it exists.) Note that the IP address here is that of the server VM.

sudo mkdir /storage/dataset_loading
sudo mount 172.27.20.231:/storage/dataset_loading /storage/dataset_loading

To persist this change on reboot, add the following line to the /etc/fstab file on the client VM:

172.27.20.231:/storage/dataset_loading      /storage/dataset_loading    nfs auto,nofail,noatime,nolock,intr,tcp,actimeo=1800 0 0

dolsysmith · 2021-07-22T18:11:01Z

Other considerations:

Spark DataFrame API

Moving from the Spark RDD API to the Spark DataFrame API (as documented in the notebook above) appears to provide significant performance benefits, both in loading datasets into Elasticsearch and in creating custom extracts. The reason for this improvement lies in the ability to leverage Spark SQL for the transformations currently delegated to Python. (Spark SQL is translated into Scala function calls, allowing us to execute all ELT logic within the Spark engine itself.)

Spark/pyspark version

The code above was written and tested with Spark/pyspark 2.4.7. The arrays_zip Spark function used in creating the mentions extracts requires version 2.4. TweetSets currently uses version 2.3.2; I believe we could upgrade without any breaking changes, but we would need to test.
Depending on how we choose to handle the creation of JSON extracts, we might benefit from upgrading to Spark v. 3, which changes the behavior when handling null values. It looks like version 3 is compatible with our ES version, but we would probably need to use a later version of the es-hadoop library as well.

dolsysmith · 2021-08-02T15:01:46Z

Regarding our use of the Elasticsearch Scroll API to retrieve the results for the user-generated extracts, please note that in more recent version of Elasticsearch, the use of this API for deep pagination is discouraged:

We no longer recommend using the scroll API for deep pagination. If you need to preserve the index state while paging through more than 10,000 hits, use the search_after parameter with a point in time (PIT).

T128 refactor loader #128, #122

dolsysmith added enhancement question major effort level labels Jul 14, 2021

dolsysmith added this to the 2.2 milestone Jul 26, 2021

dolsysmith changed the title ~~Refactor extract process~~ Refactor ingest/extract process Jul 26, 2021

lwrubel assigned dolsysmith Jul 27, 2021

dolsysmith mentioned this issue Jul 28, 2021

Refactor tweetset_loader.py to use Spark DataFrame API #128

Closed

dolsysmith removed enhancement major effort level labels Aug 6, 2021

dolsysmith mentioned this issue Aug 13, 2021

T128 refactor loader #128, #122 #138

Merged

dolsysmith added a commit that referenced this issue Aug 27, 2021

Merge pull request #138 from gwu-libraries/t128-refactor-loader

5229297

T128 refactor loader #128, #122

lwrubel modified the milestones: 2.2, 2.3 Sep 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor ingest/extract process #122

Refactor ingest/extract process #122

dolsysmith commented Jul 14, 2021 •

edited

Loading

dolsysmith commented Jul 22, 2021 •

edited

Loading

dolsysmith commented Jul 22, 2021 •

edited

Loading

dolsysmith commented Aug 2, 2021 •

edited

Loading

Refactor ingest/extract process #122

Refactor ingest/extract process #122

Comments

dolsysmith commented Jul 14, 2021 • edited Loading

Problem

Questions

Proposal

Full extracts

Custom extracts

Benefits

dolsysmith commented Jul 22, 2021 • edited Loading

Dev Setup Instructions

dolsysmith commented Jul 22, 2021 • edited Loading

Spark DataFrame API

Spark/pyspark version

dolsysmith commented Aug 2, 2021 • edited Loading

dolsysmith commented Jul 14, 2021 •

edited

Loading

dolsysmith commented Jul 22, 2021 •

edited

Loading

dolsysmith commented Jul 22, 2021 •

edited

Loading

dolsysmith commented Aug 2, 2021 •

edited

Loading