-
Notifications
You must be signed in to change notification settings - Fork 5
Deploy Sources BODS Combiner #265
Comments
A new EBS volume has been attached to EC2 server oo-prd0-register/bods-register at Script #!/usr/bin/env bash
set -Eeuo pipefail
ds=(
bods_v2
)
for d in "${ds[@]}"; do
aws s3 sync --delete s3://oo-register-v2/"$d"/ /mnt/data/clones/oo-register-v2/"$d"/
done More directories can be synced if useful (e.g. to help in an investigation), but only Script #!/usr/bin/env bash
set -Eeuo pipefail
aws s3 sync /mnt/data/exports/prd/ s3://oo-register-v2/exports/
aws s3 sync /mnt/data/exports/prd/all/ s3://public-bods/exports/ S3 credentials have been configured so Sources BODS has been configured to use that:
This exposes that data within the container: docker compose run sources-bods bash du -d1 -h data/
With this, the Combiner can be run for a single datasource: combine data/imports/source=PSC/ data/exports/prd/ psc And the Combiner can be run after all datasources, to update the combined snapshot: combine-all data/exports/prd/ Keep an eye on the disk space for the EBS volume, since things will probably break if it runs out of space: df -h /mnt/data
|
Kinesis Firehose has been reconfigured to go back to the non-recommended larger buffer max sizes and flush intervals. In fact, it's now increased to 128MB/900s instead of 64MB/900s which it was previously (or the 5MB/300s that it's been running with the last few weeks). That's because the Combiner is more efficient when working with large files, and the PSC streamers which recently went live result in a lot of small files, each containing just a few records. Whilst the Combiner can work fine with this, it results in processing time increasing by multiples, because of how the indexes are built and deduplicated. So on balance, it's probably better to have longer intervals and sizes like before, at the cost of additional latency (which isn't important to us, in this case). Next month's bulk data import/export should use the Combiner on the EC2 server, rather than locally. |
Since the rewrite of Register Files Combiner (#213), it has been possible to run the combination process locally, rather than requiring AWS services directly. However, in order for this to happen performantly, it is necessary to sync large directories of a couple of our S3 buckets to disk locally. This is all working perfectly; however, it takes rather a lot of space (~ 300G at present), and bulk data export uploads take a while (~ 12G/month).
It would be convenient to deploy Sources BODS Combiner somewhere, in order to be able to download and upload files far more quickly, as well as to minimise the chances of accidental changes to files locally. The existing oo-prd0-register EC2 server would be sufficient for this, but would require an additional EBS volume to be attached.
This is not a lot of effort and moderate additional cost (around 33 USD/month, depending on configuration), but would save time and decrease risk in the monthly bulk data process.
The text was updated successfully, but these errors were encountered: