Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large file downloads are timing out #139

Open
kerchner opened this issue Aug 13, 2021 · 2 comments
Open

Large file downloads are timing out #139

kerchner opened this issue Aug 13, 2021 · 2 comments
Labels

Comments

@kerchner
Copy link
Member

Currently on production:

  • MAX_PER_JSON_FILE is set to the default of 10,000,000
  • MAX_PER_CSV_FILE is set to the default of 250,000
  • Transfer bandwidth on the server side appears to be about 10 MBps
  • gunicorn SERVER_TIMEOUT is set to 600 seconds

As a result, .jsonl.zip files are around 11 GB in size, which would take around 30-60 minutes to download, except this exceeds the timeout of 600 seconds ( = 10 minutes).

Consistent with this, @dolsysmith notes some gunicorn errors in the log on production: [2021-08-13 10:54:40 -0400] [7] [CRITICAL] WORKER TIMEOUT (pid:64)

Suggested remediations:

  • Find out from WRLC whether bandwidth can be increased
  • Reduce MAX_PER_JSON_FILE (and strongly consider using the same values for this and MAX_PER_CSV_FILE, for JSON and CSV files that correspond to each other
  • Adjust timeout to ensure that JSON zip files (given new MAX_PER_JSON_FILE value) can usually be downloaded without encountering a timeout error. Consider whether there may be negative side effects of using a higher timeout.
@dolsysmith dolsysmith added the bug label Aug 13, 2021
@dolsysmith
Copy link
Contributor

For full extracts, we can control the number of rows per file with the maxRecordsPerFile Spark parameter.

@dolsysmith
Copy link
Contributor

The Spark loader compresses the CSV files but does not archive them. For a large dataset (e.g., Coronavirus), archiving the gzipped CSV files can be done manually after loading via the following bash script (currently saved on production in /opt/TweetSets/chunk_csv.sh.

It yields zipped files of roughly 1.2 G, each of which, unzipped, contains 5 gzipped CSV files of 1M rows max.

# To zip csv.gz files by chunks of 5
# Command-line argument should be the directory containing the files to be zipped
ls "$1"/*.csv.gz > csvfiles
split -d -l5 - csvfiles < csvfiles
counter=1
for i in csvfiles[0-9][0-9]; do
  #cat $i # For testing
  # Get the parent directory
  parentdir="$(dirname "$1")"
  # The -j flag means "junk paths"
  zip -j "$parentdir/tweets-$((counter++)).csv.zip" -@ < "$i"
done

rm csvfiles
rm csvfiles[0-9][0-9]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants