Some customers may want to use Glow to work with genomic data.
This is an example of how to package Glow dependencies and run a job on EMR Serverless using 1000 Genomes data.
First, we'll need to build a virtualenv
that we can upload to S3. Similar to our base example, we'll use a Dockerfile to do this.
In addition, we're also going to create an uber-jar directly from the Glow project.
- In this directory, run the following command and it will build the Dockerfile and export
pyspark_glow.tar.gz
andglow-spark3-assembly-1.1.2-SNAPSHOT.jar
to your local filesystem.
# Enable BuildKit backend
DOCKER_BUILDKIT=1 docker build --output ./dependencies .
- Now copy those files to S3.
export S3_BUCKET=<YOUR_S3_BUCKET_NAME>
aws s3 sync ./dependencies/ s3://${S3_BUCKET}/artifacts/pyspark/glow/
- And this demo code
aws s3 cp glow_demo.py s3://${S3_BUCKET}/code/pyspark/
Make sure to follow the EMR Serverless getting started guide to spin up your Spark application.
- Define your EMR Serverless app ID and job role and submit the job.
Unfortunately, in preview EMR Serverless does not have outbound internet access. So we can't use the --packages
option. Instead, we'll provide an --archives
option for our PySpark virtualenv and --jars
to our Glow uberjar on S3.
export APPLICATION_ID=<EMR_SERVERLESS_APPLICATION_ID>
export JOB_ROLE_ARN=<EMR_SERVERLESS_IAM_ROLE>
aws emr-serverless start-job-run \
--application-id $APPLICATION_ID \
--execution-role-arn $JOB_ROLE_ARN \
--job-driver '{
"sparkSubmit": {
"entryPoint": "s3://'${S3_BUCKET}'/code/pyspark/glow_demo.py",
"entryPointArguments": ["pyspark_glow.tar.gz"],
"sparkSubmitParameters": "--conf spark.hadoop.io.compression.codecs=io.projectglow.sql.util.BGZFCodec --conf spark.driver.cores=1 --conf spark.driver.memory=2g --conf spark.executor.cores=3 --conf spark.executor.memory=4g --conf spark.executor.instances=20 --archives=s3://'${S3_BUCKET}'/artifacts/pyspark/glow/pyspark_glow.tar.gz --jars s3://'${S3_BUCKET}'/artifacts/pyspark/glow/glow-spark3-assembly-1.1.2-SNAPSHOT.jar"
}
}' \
--configuration-overrides '{
"monitoringConfiguration": {
"s3MonitoringConfiguration": {
"logUri": "s3://'${S3_BUCKET}'/logs/"
}
}
}'
- Set your job run ID and get the status of the job.
export JOB_RUN_ID=001234567890
aws emr-serverless get-job-run \
--application-id $APPLICATION_ID \
--job-run-id $JOB_RUN_ID
- View the logs while the job is running.
aws s3 cp s3://${S3_BUCKET}/logs/applications/${APPLICATION_ID}/jobs/${JOB_RUN_ID}/SPARK_DRIVER/stdout.gz - | gunzip
One other alternative I came up with while researching this is using a temp image to fetch your dependencies.
It uses an EMR on EKS image, so you'll need to authenticate before building.
# Enable BuildKit backend
DOCKER_BUILDKIT=1 docker build -f Dockerfile.jars . --output jars
This will export ~100 jars to your local filesystem, which you can upload to S3 and specificy via the --jars
variable as a comma-delimited list.