Can't achieve a significant performance/$ reduction on a benchmark. #11992

MaxNevermind · 2025-01-21T22:27:22Z

MaxNevermind
Jan 21, 2025

Hello,
I wanted to ask if anybody can see any issue with a benchmark we did. Bellow the description of the used hardware and setup steps.

Goals

Achieve 50% compute cost reduction of basic ETL pipeline

Means

Utilize AWS low tier GPU instances(g4/g6) and compare them with modern CPU instances.

Benchmark setup

In our benchmark we used GPU g4dn.12xlarge vs CPU r6id.24xlarge, as that r6 it is roughly twice as expensive as picked g4, $1799.4500 monthly vs $3329.6760 monthly for reserved Linux instance on US East.

Instances

1 master instance machine
1 worker instance machine

master:
m5dn.xlarge
4 cores / 16 GB ram / 150GB NVMe / 25 Gbps

worker for CPU runs:
r6id.24xlarge
96 cores / 768 GB ram / 5.7TB NVMe / 37.5 Gbps

worker for GPU runs:
g4dn.12xlarge
48 cores / 192 GB ram / 4 T4 gpu / 900GB NVMe / 50 Gbps

Instances setup


sudo apt update
sudo apt upgrade -y
sudo apt install openjdk-11-jdk -y
cd /home/ubuntu
sudo wget https://archive.apache.org/dist/spark/spark-3.5.2/spark-3.5.2-bin-hadoop3.tgz
sudo wget https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.10.1/rapids-4-spark_2.12-24.10.1.jar
sudo wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.779/aws-java-sdk-bundle-1.12.779.jar
sudo wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar
sudo wget https://repo1.maven.org/maven2/org/apache/spark/spark-hadoop-cloud_2.12/3.5.2/spark-hadoop-cloud_2.12-3.5.2.jar
sudo tar -xvzf ./spark-3.5.2-bin-hadoop3.tgz
sudo mv ./aws-java-sdk-bundle-1.12.779.jar ./spark-3.5.2-bin-hadoop3/jars
sudo mv ./hadoop-aws-3.3.4.jar ./spark-3.5.2-bin-hadoop3/jars
sudo mv ./spark-hadoop-cloud_2.12-3.5.2.jar ./spark-3.5.2-bin-hadoop3/jars
sudo add-apt-repository ppa:graphics-drivers/ppa -y
sudo apt update
sudo apt install nvidia-driver-535 -y

sudo mkfs.ext4 -F /dev/nvme1n1
sudo mkdir /mnt/nvme
sudo mount /dev/nvme1n1 /mnt/nvme
sudo mkdir /mnt/nvme/spark_local_dir
sudo mkdir /mnt/nvme/spark_worker_dir
sudo mkdir /mnt/nvme/hadoop_tmp_dir

export WORK_DIR=/home/ubuntu
export PRIVATE_MASTER_DNS=ip-10-0-7-36.us-west-2.compute.internal
export SPARK_MASTER=spark://$PRIVATE_MASTER_DNS:7077
export SPARK_HOME=$WORK_DIR/spark-3.5.2-bin-hadoop3
export SPARK_RAPIDS_PLUGIN_JAR=$WORK_DIR/rapids-4-spark_2.12-24.10.1.jar

touch $SPARK_HOME/conf/spark-env.sh
echo '
export SPARK_WORKER_OPTS="-Dspark.worker.resource.gpu.amount=1 -Dspark.worker.resource.gpu.discoveryScript=$SPARK_HOME/examples/src/main/scripts/getGpusResources.sh"
export SPARK_LOCAL_DIRS="/mnt/nvme/spark_local_dir"
export SPARK_WORKER_DIR="/mnt/nvme/spark_worker_dir"
export AWS_JAVA_V1_DISABLE_DEPRECATION_ANNOUNCEMENT=true
' >> $SPARK_HOME/conf/spark-env.sh


sudo $SPARK_HOME/sbin/start-master.sh
sudo $SPARK_HOME/sbin/start-worker.sh $SPARK_MASTER

Data

Data Generator

import org.apache.spark.sql.{Row, SaveMode, SparkSession}
import org.apache.spark.sql.types._

import scala.collection.mutable.ArrayBuffer
import scala.util.Random


sealed trait Size
case class ExactSize(size: Int) extends Size
case class PercentBasedSize(percents: Seq[(Float, Int)]) extends Size


case class Field(
                  name: String,
                  dataType: DataType = StringType,
                  length: Size,
                  groupSize: Size = ExactSize(1),
                  nullFraction: Float,
                  children: Seq[Field] = Seq()
                ) {
}


class ComplexDataGen(spark: SparkSession) extends Serializable {

  val conf =
    Field(
      name = "top_level_group",
      length = ExactSize(35),
      groupSize = PercentBasedSize(Seq(90.0f -> 30, 9.0f -> 50, 0.9998f -> 10000, 0.0002f -> 500000)),
      nullFraction = 0,
      children =
        Seq(
          Field(
            name = "second_level_group",
            length = ExactSize(30),
            groupSize = ExactSize(40),
            nullFraction = 0.1F,
            children = Seq(
              Field(
                name = "third_level_group",
                length = ExactSize(50),
                groupSize = ExactSize(15),
                nullFraction = 0,
                children =
                  (1 to 3).map(i => Field(
                    name = s"col_size_30_$i",
                    length = ExactSize(30),
                    nullFraction = 0
                  )) ++
                  (1 to 10).map(i => Field(
                    name = s"col_size_50_$i",
                    length = ExactSize(50),
                    nullFraction = 0.85f
                  )) ++
                  (1 to 2).map(i => Field(
                    name = s"col_super_heavy_$i",
                    length = PercentBasedSize(Seq(99.9999f -> 2000, 0.0001f -> 500000)),
                    nullFraction = 0.85f
                  ))
              )
            )
          )
        )
    )

  val schema: StructType = {
    def getStructFields(field: Field): Seq[StructField] = {
      Seq(StructField(field.name, field.dataType)) ++
        field.children.flatMap(child =>
          getStructFields(child)
        )
    }
    StructType(getStructFields(conf))
  }

  def toSparkRows(root: Field): Seq[Row] = {
    val res = ArrayBuffer[Row]()
    val row = new Array[Any](schema.length)
    def generateRows(field: Field, i: Int, rowsLeft: Int): Int = {
      row(i) = generateStringValue(field)
      var rowsGenerated = 0
      val rowsToGenerate = Math.min(
        rowsLeft,
        field.groupSize match {
          case ExactSize(size) => size
          case PercentBasedSize(percents) => randomSize(percents)
        }
      )
      while (field.children.nonEmpty && rowsToGenerate > rowsGenerated) {
        var j = i
        field.children.foreach(field => {
          j += 1
          if (field.children.isEmpty) {
            row(j) = generateStringValue(field)
            if (field.name == schema.fields.last.name) {
              res.append(Row.fromSeq(row.clone()))
              rowsGenerated += 1
            }
          } else {
            rowsGenerated += generateRows(field, i + 1, rowsToGenerate - rowsGenerated)
          }
        })
      }
      rowsGenerated
    }
    generateRows(root, 0, Int.MaxValue)
    res
  }

  def randomSize(percents: Seq[(Float, Int)]): Int = {
    val targetPercent = Random.nextFloat()
    var sum = 0f
    var i = 0
    var res = 0
    while (res == 0 && i < percents.length) {
      val (percent, value) = percents(i)
      sum += (percent / 100)
      if ((i == percents.length - 1) || targetPercent <= sum)
          res = value
      i += 1
    }
    res
  }

  def generateStringValue(config: Field): Any =
    if (Random.nextFloat() <= config.nullFraction) null
    else config.length match {
      case ExactSize(size) => Random.alphanumeric.take(size).mkString
      case PercentBasedSize(percents) => Random.alphanumeric.take(randomSize(percents)).mkString
    }


  def generate(outPath: String, topGroupsToGenerate: Int, outputPartitions: Int) = {
    schema.foreach(println)
    val rdd = spark.sparkContext.parallelize(1 to topGroupsToGenerate).flatMap { _ =>
      toSparkRows(conf)
    }
    spark
      .createDataFrame(rdd, schema)
      .repartition(outputPartitions)
      .write
      .mode(SaveMode.Overwrite)
      .parquet(outPath)
  }

}

Data used for benchmarks

Parameters used for data generation for benchmarks
topGroupsToGenerate = 300000
outputPartitions = 150

new ComplexDataGen(spark)
.generate(outPath = "generated_data", topGroupsToGenerate = 300000, outputPartitions =  150)

150 parquet files stored on S3
24 GB gzip
40 mil rows

Schema

18 string columns
top_level_group/second_level_group/third_level-group - non null, non unique, 30-50 random chars each
3 cols: non null, unique, 30 random chars each
10 cols: unique, 50 random chars each, 85% nulls,
2 cols: unique, 99.999% of values up to 2000 chars, 0.0001% values up to 500000 chars, 85% nulls

Transformation used for benchmarks

import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._

spark
    .read
    .parquet("generated_data")
    .dropDuplicates("col_size_30_1")
    .withColumn("window", count(col("col_size_30_3")).over(Window.partitionBy(col("top_level_group")).orderBy("col_size_30_2")))
    .write
    .option("compression", "snappy")
    .mode("overwrite")
    .parquet("benchmark_output")

Benchmark Runs

CPU run, 4 workers 24 CPU cores

sudo $SPARK_HOME/bin/spark-shell \
  --master $SPARK_MASTER \
  --deploy-mode "client" \
  --conf spark.driver.cores=4 \
  --conf spark.driver.memory=14g \
  --conf spark.driver.extraJavaOptions="-Dhadoop.tmp.dir=/mnt/nvme/hadoop_tmp_dir" \
  --conf spark.executor.cores=24 \
  --conf spark.executor.memory=160g \
  --conf spark.executor.extraJavaOptions="-Dhadoop.tmp.dir=/mnt/nvme/hadoop_tmp_dir" \
  --conf spark.dynamicAllocation.enabled=false \
  --conf spark.sql.shuffle.partitions=200 \
  --conf spark.network.timeout=300s \
  --conf spark.local.dir=/mnt/nvme/spark_work_dir \
  --conf spark.sql.parquet.compression.codec=gzip \
  --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 \
  --conf spark.hadoop.fs.s3a.committer.name=magic \
  --conf spark.sql.sources.commitProtocolClass=org.apache.spark.internal.io.cloud.PathOutputCommitProtocol \
  --conf spark.sql.parquet.output.committer.class=org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter \
  --conf spark.sql.adaptive.enabled=false \
  --conf spark.dynamicAllocation.enabled=false \
  --conf spark.sql.legacy.charVarcharAsString=true

GPU run, 4 workers 1 GPU x 12 CPU cores

sudo $SPARK_HOME/bin/spark-shell \
   --master $SPARK_MASTER \
   --deploy-mode "client" \
   --conf spark.driver.cores=4 \
   --conf spark.driver.memory=14g \
   --conf spark.driver.extraJavaOptions="-Dhadoop.tmp.dir=/mnt/nvme/hadoop_tmp_dir" \
   --conf spark.executor.cores=12 \
   --conf spark.executor.memory=36g \
   --conf spark.executor.extraJavaOptions="-Dhadoop.tmp.dir=/mnt/nvme/hadoop_tmp_dir" \
   --conf spark.dynamicAllocation.enabled=false \
   --conf spark.sql.shuffle.partitions=200 \
   --conf spark.network.timeout=300s \
   --conf spark.local.dir=/mnt/nvme/spark_work_dir \
   --conf spark.sql.parquet.compression.codec=gzip \
   --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 \
   --conf spark.hadoop.fs.s3a.committer.name=magic \
   --conf spark.sql.sources.commitProtocolClass=org.apache.spark.internal.io.cloud.PathOutputCommitProtocol \
   --conf spark.sql.parquet.output.committer.class=org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter \
   --conf spark.sql.adaptive.enabled=false \
   --conf spark.sql.legacy.charVarcharAsString=true \
   --conf spark.plugins=com.nvidia.spark.SQLPlugin \
   --conf spark.rapids.sql.enabled=true \
   --conf spark.executor.resource.gpu.amount=1 \
   --conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \
   --conf spark.task.resource.gpu.amount=0.083 \
   --conf spark.rapids.memory.host.spillStorageSize=8G \
   --conf spark.rapids.memory.pinnedPool.size=8g \
   --conf spark.rapids.sql.concurrentGpuTasks=1 \
   --jars $SPARK_RAPIDS_PLUGIN_JAR

tgravescs · 2025-01-22T19:33:43Z

tgravescs
Jan 22, 2025
Maintainer

Hello, thanks for reaching out. I have not looked in detail at the data generation or query you are running yet but a few general comments about the setup:

worker for GPU runs:
g4dn.12xlarge
48 cores / 192 GB ram / 4 T4 gpu / 900GB NVMe / 50 Gbps

Generally these nodes are not the most cost efficient/performant. Generally we use the smaller nodes like the g4dn.2xlarge but the use case does matter. one thing is the disks get faster on the larger models.

spark.sql.adaptive.enabled=false
Why are you turning off adaptive execution? This is on by default in all newer spark versions and I recommend it staying on unless you have specific reasons. It looks like you are using Spark 3.5.2 which it should be good on.

spark.sql.parquet.compression.codec=gzip
Is this required to maximize space? otherwise you may find zstd or lz4 more performant, that is both cpu/gpu.

For you test cases:
CPU run, 4 workers 1 GPU x 24 CPU cores
I think you are saying this was the CPU run and the 1GPU was just a typo, correct? Was this on the r6id.24xlarge nodes?

r6id.24xlarge (on demand in us-west(Oregon)) is $7.2576/hr

GPU run:

GPU run, 4 workers 1 GPU x 12 CPU cores
g4dn.12xlarge (on demand in us-west(oregon) is $3.912/hr

one thing to note here is that the CPU node r6id.24xlarge has 96 cores as compared to the 48 on the g4dn.12xlarge so you will have more parallelism there.

A few configs I would recommend changing:

spark.rapids.sql.concurrentGpuTasks=1 -> Generally we find 2 concurrent is best on the g4dn type nodes, if using g5 or g6 instance set it to 3.

Also how much data is each task reading from parquet? Generally we find the GPU performs better with more data so we recommend increasing the input size via spark.sql.files.maxPartitionBytes. If you look at the task input size on you rruns we aim for 128-256mb per task. Generally setting spark.sql.files.maxPartitionBytes= between 512m and 2g is where we recommend starting.

Its interesting you got 200 tasks on the stage 1 (which is the parquet read), generally this ends up being odd number of tasks based on how much data each reads and the default spark.sql.files.maxPartitionBytes=128m.

A lot of our recent benchmarks on the ec2 nodes were comparing the g5g or g6 nodes to the r6id ones. We run NDS (similar to TPCDS). Generally use larger data with scale factor 3K. So lets say you use 4 r6id.8xlarge (2.4192 x 4 = $9.6768/hr) compared to 4 g6.8xlarge (4 x 2.0144 = $8.0576/hr). I realize that isn't half the cost if everything runs in the same time, but I would suggest running with these to start with as I would expect the GPU run to be faster and thus less expensive in the end. After getting that for a baseline, if everything looks good performance wise, I would adjust node number or type. Say use 3 g6.8xlarge instead of 4. Note in the g5/6.8xlarge nodes run with spark.executor.cores=16. spark.executor.memory=64G.

Also if you change to use the g5 or g6 then I would recommend also setting spark.rapids.sql.multiThreadedRead.numThreads=60, this will help with reading from s3.

2 replies

MaxNevermind Jan 23, 2025
Author

Hey, thanks for trying to help.

I want to clarify our goal is not to make this ETL more performant in general but to prove that it is achievable for this ETL to have a decent performance/$ increase by utilizing GPUs in comparison with CPU nodes. If that is not going to happen we will not use Spark-Rapids and GPUs at all. The scale of data in actual pipelines might be 100+ times larger than described in this benchmark, the number of nodes will also increase proportionally to the amount of data, so this benchmark is supposed to be on a portion of data using a portion of cluster.

Generally these nodes are not the most cost efficient/performant. Generally we use the smaller nodes

Can you clarify that point. Bigger AWS instances are less expensive per GPU/CPU, right? So a large node should provide better performance/$. So I guess what you mean is that if the data should be big enough to occupy large instances with some work and if the data is too small then we can see worse performance/$ improvements, am I understanding you correctly?

spark.sql.adaptive.enabled=false

Sure, will turn it on, we were not certain if Spark-Rapids fully support this param. Though I'm not sure if it can improve CPU vs GPU performance, in this particular case I think there is no extra optimization that are applicable and even if there are some then we will have both CPU and GPU instances perform better proportionally as I understand.

spark.sql.parquet.compression.codec=gzip
Is this required to maximize space? otherwise you may find zstd or lz4 more performant, that is both cpu/gpu.

We want maximum compaction if possible but if GPU work much better then CPU on lz4 we will consider that option.

I think you are saying this was the CPU run and the 1GPU was just a typo, correct? Was this on the r6id.24xlarge nodes?

Yes, that is a typo, will correct.

one thing to note here is that the CPU node r6id.24xlarge has 96 cores as compared to the 48 on the g4dn.12xlarge so you will have more parallelism there.

Yes, that's the point, cost of compute need to be reduced, that is the goal, so of course GPU instances in total will have smaller CPU count.

spark.rapids.sql.concurrentGpuTasks=1

We already did some tests before and we saw a very little difference between 1 and 2. I will try again.

Also how much data is each task reading from parquet? Generally we find the GPU performs better with more data so we recommend increasing the input size via spark.sql.files.maxPartitionBytes. If you look at the task input size on you rruns we aim for 128-256mb per task. Generally setting spark.sql.files.maxPartitionBytes= between 512m and 2g is where we recommend starting.

The data spec were specified above:

150 parquet files stored on S3
24 GB gzip
40 mil rows

I think the batch size that GPU uses to determine the amount of data to process is controlled by spark.rapids.sql.batchSizeBytes. Am I wrong?

A lot of our recent benchmarks on the ec2 nodes were comparing the g5g or g6 nodes to the r6id ones. We run NDS (similar to TPCDS). Generally use larger data with scale factor 3K. So lets say you use 4 r6id.8xlarge (2.4192 x 4 = $9.6768/hr) compared to 4 g6.8xlarge (4 x 2.0144 = $8.0576/hr). I realize that isn't half the cost if everything runs in the same time, but I would suggest running with these to start with as I would expect the GPU run to be faster and thus less expensive in the end. After getting that for a baseline, if everything looks good performance wise, I would adjust node number or type. Say use 3 g6.8xlarge instead of 4. Note in the g5/6.8xlarge nodes run with spark.executor.cores=16. spark.executor.memory=64G.

We did try running NDS, in particular TPC-DS based queries. We saw some cost saving there though we didn't do the precise estimations. After that we decided to run benchmarks that are much closer to the target, in particular:

TPC-DS queries poorly reflect our queries, we mostly interested in Projections/Windows/Joins/Aggregations, the final queries are much large than TPC-DS ones
there is usually a main input heavy source table
the source table need to be have dozens of columns
columns are mostly string type
some of the column values have a lot of similar values, such similar values forms large groups during windows, such groups might be up hundreds of thousands of records in certain cases
some string values are pretty heavy, up to 1MB

We are not going to use NDS for our benchmark. I don't see value in that for those reasons, we need to get closer to our target, not further away.

MaxNevermind Jan 23, 2025
Author

We will probably try to generate more data for the same setup and add more transformations, like windows and aggregates.
It's just the start is already looking underwhelming and I decided to ask about it.

MaxNevermind · 2025-01-23T03:43:41Z

MaxNevermind
Jan 23, 2025
Author

By the way, 50% cost reduction was an idea inspired by its mentions in few presentations, in particular here:
https://youtu.be/4MI_LYah900?si=8P3H7RdB7Gl6HpwN&t=1967

0 replies

viadea · 2025-01-25T04:58:31Z

viadea
Jan 25, 2025
Collaborator

@tgravescs's reply covers most of the important configs. And I understand you want to compare between 2x cost CPU cluster vs 1x cost GPU cluster. I assume as long as GPU run's perf == CPU run's perf in that situation, then it meets your goal, right?

r6id.24xlarge has 3x better local disk io performance than g4dn.12xlarge and i believe that could be a key factor here for this specific job. Say if the job is io heavy or spilling heavy, then that advantage will be more obvious. So the perf/$ ratio may heavily depend on what is the job characteristics.

If you can share the CPU& GPU run Spark event logs with us by sending to spark-rapids-support [email protected], we can help take a look at what is the most time consuming portion of this job.

29 replies

MaxNevermind Jan 31, 2025
Author

Are you using exact same setup and set up of commands as in the original post?

MaxNevermind Jan 31, 2025
Author

btw two weeks ago we also did a run using 2 * r6id.12xlarge instead of 1 * r6id.24xlarge
it is just 45 sec. and the cost of 2 * r6id.12xlarge instead = 1 * r6id.24xlarge

viadea Jan 31, 2025
Collaborator

Most configs are the same as urs. I applied some suggestion shared from Tom above.
Take 3 * g4dn.4xlarge for example, do you want to share your exact GPU run configs so that i can help double check

MaxNevermind Feb 3, 2025
Author

I used configs from my very first top message. I changeed spark.task.resource.gpu.amount set to 0.0625 as we have 16 cores per worker in case of 3 * g4dn.4xlarge. I tried raising spark.rapids.sql.concurrentGpuTasks to 2, it did not have a significant impact. Please share any extra configs you used.

Here is a AWS spot instance command request I used

USER_DATA=$(echo '#!/bin/bash
sudo apt update
sudo apt upgrade -y
sudo apt install openjdk-11-jdk -y
cd /home/ubuntu
sudo wget https://archive.apache.org/dist/spark/spark-3.5.2/spark-3.5.2-bin-hadoop3.tgz
sudo wget https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.10.1/rapids-4-spark_2.12-24.10.1.jar
sudo wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.779/aws-java-sdk-bundle-1.12.779.jar
sudo wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar
sudo wget https://repo1.maven.org/maven2/org/apache/spark/spark-hadoop-cloud_2.12/3.5.2/spark-hadoop-cloud_2.12-3.5.2.jar
sudo tar -xvzf ./spark-3.5.2-bin-hadoop3.tgz
sudo mv ./aws-java-sdk-bundle-1.12.779.jar ./spark-3.5.2-bin-hadoop3/jars
sudo mv ./hadoop-aws-3.3.4.jar ./spark-3.5.2-bin-hadoop3/jars
sudo mv ./spark-hadoop-cloud_2.12-3.5.2.jar ./spark-3.5.2-bin-hadoop3/jars
sudo add-apt-repository ppa:graphics-drivers/ppa -y
sudo apt update
sudo apt install nvidia-driver-535 -y
sudo reboot
' | base64 -w 0)

aws ec2 request-spot-fleet \
  --spot-fleet-request-config '{
    "AllocationStrategy": "priceCapacityOptimized",
    "IamFleetRole": "arn:aws:iam::xxxxxxxxx:role/aws-ec2-spot-fleet-tagging-role",
    "TargetCapacity": 3,
    "Type": "request",
    "LaunchSpecifications": [
      {
        "InstanceType": "g4dn.4xlarge",
        "ImageId": "ami-05d38da78ce859165",
        "SubnetId": "subnet-xxxxxxxxx",
        "IamInstanceProfile": {
          "Arn": "arn:aws:iam::xxxxxxxxx:instance-profile/allow_ec2_access_s3"
        },
        "KeyName": "test_key_pair_1",
        "BlockDeviceMappings": [
          {
            "DeviceName": "/dev/sda1",
            "Ebs": {
		      "VolumeSize": 128,
		      "DeleteOnTermination": true,
		      "VolumeType": "io1",
		      "Iops": 5550
            }
          }
        ],
        "UserData": "'${USER_DATA}'"
      }
    ]
  }'

viadea Feb 4, 2025
Collaborator

yes i am using your configurations as well and the only difference is:

spark.executor.cores=16
spark.task.resource.gpu.amount=0.0625(1/16)
spark.rapids.sql.concurrentGpuTasks=2

I used web UI to add the EBS, but i feel you may want to add another EBS dedicating for IO in this case.
Is this /dev/sda1 used for root volume as well? I did not get a chance to test but feel free to use a dedicate disk if needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't achieve a significant performance/$ reduction on a benchmark. #11992

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 31 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Can't achieve a significant performance/$ reduction on a benchmark. #11992

MaxNevermind Jan 21, 2025

Goals

Means

Benchmark setup

Instances

Instances setup

Data

Data used for benchmarks

Schema

Transformation used for benchmarks

Benchmark Runs

Replies: 3 comments · 31 replies

tgravescs Jan 22, 2025 Maintainer

MaxNevermind Jan 23, 2025 Author

MaxNevermind Jan 23, 2025 Author

MaxNevermind Jan 23, 2025 Author

viadea Jan 25, 2025 Collaborator

MaxNevermind Jan 31, 2025 Author

MaxNevermind Jan 31, 2025 Author

viadea Jan 31, 2025 Collaborator

MaxNevermind Feb 3, 2025 Author

viadea Feb 4, 2025 Collaborator

MaxNevermind
Jan 21, 2025

Replies: 3 comments 31 replies

tgravescs
Jan 22, 2025
Maintainer

MaxNevermind Jan 23, 2025
Author

MaxNevermind Jan 23, 2025
Author

MaxNevermind
Jan 23, 2025
Author

viadea
Jan 25, 2025
Collaborator

MaxNevermind Jan 31, 2025
Author

MaxNevermind Jan 31, 2025
Author

viadea Jan 31, 2025
Collaborator

MaxNevermind Feb 3, 2025
Author

viadea Feb 4, 2025
Collaborator