You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
MAX_PER_JSON_FILE is set to the default of 10,000,000
MAX_PER_CSV_FILE is set to the default of 250,000
Transfer bandwidth on the server side appears to be about 10 MBps
gunicorn SERVER_TIMEOUT is set to 600 seconds
As a result, .jsonl.zip files are around 11 GB in size, which would take around 30-60 minutes to download, except this exceeds the timeout of 600 seconds ( = 10 minutes).
Consistent with this, @dolsysmith notes some gunicorn errors in the log on production: [2021-08-13 10:54:40 -0400] [7] [CRITICAL] WORKER TIMEOUT (pid:64)
Suggested remediations:
Find out from WRLC whether bandwidth can be increased
Reduce MAX_PER_JSON_FILE (and strongly consider using the same values for this and MAX_PER_CSV_FILE, for JSON and CSV files that correspond to each other
Adjust timeout to ensure that JSON zip files (given new MAX_PER_JSON_FILE value) can usually be downloaded without encountering a timeout error. Consider whether there may be negative side effects of using a higher timeout.
The text was updated successfully, but these errors were encountered:
The Spark loader compresses the CSV files but does not archive them. For a large dataset (e.g., Coronavirus), archiving the gzipped CSV files can be done manually after loading via the following bash script (currently saved on production in /opt/TweetSets/chunk_csv.sh.
It yields zipped files of roughly 1.2 G, each of which, unzipped, contains 5 gzipped CSV files of 1M rows max.
# To zip csv.gz files by chunks of 5
# Command-line argument should be the directory containing the files to be zipped
ls "$1"/*.csv.gz > csvfiles
split -d -l5 - csvfiles < csvfiles
counter=1
for i in csvfiles[0-9][0-9]; do
#cat $i # For testing
# Get the parent directory
parentdir="$(dirname "$1")"
# The -j flag means "junk paths"
zip -j "$parentdir/tweets-$((counter++)).csv.zip" -@ < "$i"
done
rm csvfiles
rm csvfiles[0-9][0-9]
Currently on production:
MAX_PER_JSON_FILE
is set to the default of 10,000,000MAX_PER_CSV_FILE
is set to the default of 250,000SERVER_TIMEOUT
is set to 600 secondsAs a result,
.jsonl.zip
files are around 11 GB in size, which would take around 30-60 minutes to download, except this exceeds the timeout of 600 seconds ( = 10 minutes).Consistent with this, @dolsysmith notes some gunicorn errors in the log on production:
[2021-08-13 10:54:40 -0400] [7] [CRITICAL] WORKER TIMEOUT (pid:64)
Suggested remediations:
MAX_PER_JSON_FILE
(and strongly consider using the same values for this andMAX_PER_CSV_FILE
, for JSON and CSV files that correspond to each otherMAX_PER_JSON_FILE
value) can usually be downloaded without encountering a timeout error. Consider whether there may be negative side effects of using a higher timeout.The text was updated successfully, but these errors were encountered: