Can't achieve a significant performance/$ reduction on a benchmark. #11992
Replies: 3 comments 31 replies
-
Hello, thanks for reaching out. I have not looked in detail at the data generation or query you are running yet but a few general comments about the setup:
Generally these nodes are not the most cost efficient/performant. Generally we use the smaller nodes like the g4dn.2xlarge but the use case does matter. one thing is the disks get faster on the larger models.
For you test cases: r6id.24xlarge (on demand in us-west(Oregon)) is $7.2576/hr GPU run:
one thing to note here is that the CPU node r6id.24xlarge has 96 cores as compared to the 48 on the g4dn.12xlarge so you will have more parallelism there. A few configs I would recommend changing: spark.rapids.sql.concurrentGpuTasks=1 -> Generally we find 2 concurrent is best on the g4dn type nodes, if using g5 or g6 instance set it to 3. Also how much data is each task reading from parquet? Generally we find the GPU performs better with more data so we recommend increasing the input size via spark.sql.files.maxPartitionBytes. If you look at the task input size on you rruns we aim for 128-256mb per task. Generally setting spark.sql.files.maxPartitionBytes= between 512m and 2g is where we recommend starting. Its interesting you got 200 tasks on the stage 1 (which is the parquet read), generally this ends up being odd number of tasks based on how much data each reads and the default spark.sql.files.maxPartitionBytes=128m. A lot of our recent benchmarks on the ec2 nodes were comparing the g5g or g6 nodes to the r6id ones. We run NDS (similar to TPCDS). Generally use larger data with scale factor 3K. So lets say you use 4 r6id.8xlarge (2.4192 x 4 = $9.6768/hr) compared to 4 g6.8xlarge (4 x 2.0144 = $8.0576/hr). I realize that isn't half the cost if everything runs in the same time, but I would suggest running with these to start with as I would expect the GPU run to be faster and thus less expensive in the end. After getting that for a baseline, if everything looks good performance wise, I would adjust node number or type. Say use 3 g6.8xlarge instead of 4. Note in the g5/6.8xlarge nodes run with spark.executor.cores=16. spark.executor.memory=64G. Also if you change to use the g5 or g6 then I would recommend also setting spark.rapids.sql.multiThreadedRead.numThreads=60, this will help with reading from s3. |
Beta Was this translation helpful? Give feedback.
-
By the way, 50% cost reduction was an idea inspired by its mentions in few presentations, in particular here: |
Beta Was this translation helpful? Give feedback.
-
@tgravescs's reply covers most of the important configs. And I understand you want to compare between 2x cost CPU cluster vs 1x cost GPU cluster. I assume as long as GPU run's perf == CPU run's perf in that situation, then it meets your goal, right? r6id.24xlarge has 3x better local disk io performance than g4dn.12xlarge and i believe that could be a key factor here for this specific job. Say if the job is io heavy or spilling heavy, then that advantage will be more obvious. So the perf/$ ratio may heavily depend on what is the job characteristics. If you can share the CPU& GPU run Spark event logs with us by sending to spark-rapids-support [email protected], we can help take a look at what is the most time consuming portion of this job. |
Beta Was this translation helpful? Give feedback.
-
Hello,
I wanted to ask if anybody can see any issue with a benchmark we did. Bellow the description of the used hardware and setup steps.
Goals
Achieve 50% compute cost reduction of basic ETL pipeline
Means
Utilize AWS low tier GPU instances(g4/g6) and compare them with modern CPU instances.
Benchmark setup
In our benchmark we used GPU g4dn.12xlarge vs CPU r6id.24xlarge, as that r6 it is roughly twice as expensive as picked g4, $1799.4500 monthly vs $3329.6760 monthly for reserved Linux instance on US East.
Instances
1 master instance machine
1 worker instance machine
master:
m5dn.xlarge
4 cores / 16 GB ram / 150GB NVMe / 25 Gbps
worker for CPU runs:
r6id.24xlarge
96 cores / 768 GB ram / 5.7TB NVMe / 37.5 Gbps
worker for GPU runs:
g4dn.12xlarge
48 cores / 192 GB ram / 4 T4 gpu / 900GB NVMe / 50 Gbps
Instances setup
Data
Data Generator
Data used for benchmarks
Parameters used for data generation for benchmarks
topGroupsToGenerate = 300000
outputPartitions = 150
150 parquet files stored on S3
24 GB gzip
40 mil rows
Schema
18 string columns
top_level_group/second_level_group/third_level-group - non null, non unique, 30-50 random chars each
3 cols: non null, unique, 30 random chars each
10 cols: unique, 50 random chars each, 85% nulls,
2 cols: unique, 99.999% of values up to 2000 chars, 0.0001% values up to 500000 chars, 85% nulls
Transformation used for benchmarks
Benchmark Runs
CPU run, 4 workers 24 CPU cores
GPU run, 4 workers 1 GPU x 12 CPU cores
Beta Was this translation helpful? Give feedback.
All reactions