[PROPOSAL] Introducing Red line testing using opensearch-benchmark and dynamic client ramp #731

rishabh6788 · 2025-01-16T20:07:03Z

Red line testing with opensearch-benchmark

This issue briefs about the client ramp-up feature that was recently added to opensearch-benchmark.
This feature can be useful when there is a requirement to gradually bring up search/index clients to understand how the cluster behaves with increasing stress and at what point does it go under duress. This feature can be helpful in determining peak load behavior when it comes to:

vertical or horizontal scaling of a cluster.
evaluate redline performance of new features, such as reader-writer separation.
Game-day testing to evaluate scaling needs.

How does it work?

opensearch-benchmark works in two different modes:

iteration-mode: Each task has to define number of warm-up and test iterations, the task is run for those many iterations and results are calculated.
timed-mode: Each task has to define warm-up time and time-period. The task is run for the declared duration and final results are calculated based on the number of requests that were sent to the target cluster in that duration.

In the timed mode without the ramp-up feature if the task has defined 20 clients, then during benchmark run all the 20 clients are brought up simultaneously, which can overwhelm the target cluster quickly and the user would never know at exactly how much load the cluster went under duress. See below for sample e.g. of a task:

{
 "operation": "cardinality-agg-high",
 "warmup-time-period": 300,
 "time-period": 300,
 "target-throughput": 20,
 "clients": 20
 }

To avoid this scenario, we have introduced a new task definition parameter called ramp-up-time-period, When this parameter is used the search/index clients are brought up in timed manner. The gap between each client creation is function of client_num *(ramp-up-time-period/total_clients), taking below mentioned example, if the total clients are 20 and total ramp-up time is 1800s, then each client is spawned at 90s interval. For eg. for 1st client (0-indexed) client will be spawned immediately on benchmark start, the second client will take 1*(1800/20)=90s to spawn, so on and so forth, the last client will be brought up at 1800th second.

If warm-up time is same as ramp-up time then benchmark will consider ramp-up time as warm-up and move on to actual benchmarking (the phase where it starts collecting metrics for final results) for the duration defined in time-period field.

{
 "operation": "cardinality-agg-high",
 "warmup-time-period": 1800,
 "ramp-up-time-period": 1800,
 "time-period": 300,
 "target-throughput": 20,
 "clients": 20
 }

How to use it?

We will be adding new test-procedures to workloads that will run tasks in timed-mode and will accept desired parameters as workload params. For e.g., to use a timed-mode test procedure given in the below example, the user should run the following command:

opensearch-benchmark execute-test —workload=big5 —pipeline=benchmark-only —target-hosts=<cluster-endpoint> —include-tasks range --test-procedure timed-mode-test-procedure --workload-params {"warmup_time":1800,"ramp-up-time-period":1800,"target-throughput":20,"search_clients":20}

{
   "name": "timed-mode-test-procedure",
   "schedule": [
        "operation": "range",
        "warmup-time-period": {{ warmup_time | default(300) | tojson }},
        "ramp-up-time-period": {{ ramp_up_time | default(0) | tojson }},
        "time-period": {{ time_period | default(300) | tojson }},
        "target-throughput": {{ target_throughput | default(1) | tojson }},
        "clients": {{ search_clients | default(1) }}
   ]
}

In case you do not find any timed-mode test-procedures or tasks for the desired workload query, you can directly modify that task definition and use that. For e.g., you can go to /home/ec2-user/.benchmark/benchmarks/workloads/default/big5/test_procedures/common/big5-schedule.json file and modify, for e.g., cardinality-agg-high task based on the example above, save the file and use it.

How to define and/or achieve target-throughput?

If you want to achieve a specified throughput then you must pass target-throughput parameter, if it is defined, it specifies the number of requests per second over all clients. if you specify target-throughput: 1000 with 8 clients, it means that each client will issue 125 (= 1000 / 8) requests per second. In total, all clients will issue 1000 requests each second. If osb reports less than the specified throughput in final results then opensearch server simply cannot reach it.

To test the best throughput that opensearch can achieve it is advised to run benchmark in throughput throttled mode, set target-throughput to 0 or none. If it is not defined, osb assumes this is a throughput benchmark and will run the task as fast as it can. This is mostly needed for batch-style operations where it is more important to achieve the best throughput instead of an acceptable latency.

How to determine max concurrent clients that my cluster can handle?

At the moment opensearch-benchmark cannot predict what is appropriate number of concurrent clients or throughput which will cause the cluster to achieve redline but not go under duress. But if you are interested in figuring out at which exact point (nth client), the cluster started returning errors then you can pass --on-error=abort flag to opensearch-benchmark command. This will stop the benchmark process the moment cluster sends a non-fatal error response, for e.g. in managed domain the admission controller will start rejecting requests if data node cpu utilization is 100% or search queue is exhausted, you can see the from the console log at which client did the benchmark stop.

What does the final benchmark results mean?

In the current scenario, the client ramp up happens in warm-up period, meaning the metrics generated in this period are not evaluated while calculating the final benchmark results. After the ramp-up/warm-up period the benchmark runs for additional time declared in time-period field, this period starts when all clients have ramped up to desired level and all the samples for final results are collected and evaluated for this period only.

Demo with opensource OpenSearch

To test this feature and make sure it offered what we are expecting out of it we did a simple benchmark test using big5 workload on an opensource OpenSearch cluster.

Cluster Config: 3 data nodes (c5.2xlarge), 1-client node (c5.2xlarge) and 3 master nodes (c5.large). All deployed using https://github.com/opensearch-project/opensearch-cluster-cdk package.

Task definition and wrokload: We used cardinality-agg-high query type as this requires extensive compute capacity and can exhaust the data nodes pretty quickly. You can pick any low latency queries like term or default to test with 1000s of clients as well.
target_throughput:30 and search_clients:30, this means each client will send 1-req/s increaing gradually uptp 30 req/s over 30 clients in the course of 3600s. Each client will take 120s (3600/30) to come up.

The big5 index had 3-shards with 12 segments each shard and 0 replica.

{
  "operation": "cardinality-agg-high",
  "warmup-time-period": 3600,
  "ramp-up-time-period": 3600,
  "time-period":600,
  "target-throughput": {{ target_throughput | default(2) | tojson }},
  "clients": {{ search_clients | default(1) }}
}

Install: Till this feature is released in upcoming release you can checkout the opensearch-benchmark repo and run pip3 install -e . to install it on the host.

Command:

opensearch-benchmark execute-test --target-host=<cluster-endpoint>:<port>
 --pipeline=benchmark-only --workload=big5 
--workload-params '{"target_throughput":30,"search_clients":30}'
--include-tasks=cardinality-agg-high --telemetry=node-stats --
 telemetry-params="node-stats-include-indices-metrics:'search'"

Used node-stats telemetry to fetch granular level cpu, jvm and threadpool related metrics as well, apart from this also enabled indices level stats which provide search metrics (search rate) as well.

Datastore: To be able to process and chart the node stats and search related metrics we have to use an OpenSearch datastore instead of default in-memory store, because you cannot access granular level metrics in in-memory setup. During the course of the benchmark granular metrics are pushed to OpenSearch datastore at regular intervals making it easy to keep track to metrics, such as query latency, throughput, cpu, jvm, threadpool usage, search rate, etc and visualize them.
Sample benchmark.ini config to enable datastore:

[meta]
config.version = 17

[system]
env.name = local

[node]
root.dir = /home/ec2-user/.benchmark/benchmarks
src.root.dir = /home/ec2-user/.benchmark/benchmarks/src

[source]
remote.repo.url = https://github.com/opensearch-project/OpenSearch.git
opensearch.src.subdir = opensearch

[benchmarks]
local.dataset.cache = /home/ec2-user/.benchmark/benchmarks/data

[results_publishing]
datastore.type = opensearch
datastore.host = opense-clust-*******************-east-1.amazonaws.com
datastore.port = 80
datastore.secure = False
datastore.user =
datastore.password =


[workloads]
default.url = https://github.com/opensearch-project/opensearch-benchmark-workloads

[provision_configs]
default.dir = default-provision-config

[defaults]
preserve_benchmark_candidate = false

[distributions]
release.cache = true

Metrics and Observations

As the clients started ramping up I was able to visualize certain key metrics during the course of the run and see how the cluster behaved under increasing load.

Query Latency: Up until 5-7 concurrent clients sending 1req/s the query latency was stable (at 600-700ms) but as the clients kept ramping up the query latency also kept increasing reaching up to 4500ms at the end of the run.

CPU: At about 10-12 concurrent clients the cpu utilization on 2 of the 3 data nodes reached 100%, this would have triggered admission controller to start rejecting requests in managed service but in opensource the response just got slower.

Heap Usage: The JVM heap usage was pretty stable during the course of the run.

Thread Pool Search Queue metric:

What next?

In our next milestone we are planning to make the client ramp up dynamic and use it to predict max qps that the target cluster can handle while operating in desired thresholds by dynamically adding or removing clients. For more details and feedback please refer #729.

We will also be updating workloads to add a new test-procedure that will accept the new parameter and run benchmark in timed mode.

The text was updated successfully, but these errors were encountered:

github-actions bot added the untriaged label Jan 16, 2025

getsaurabh02 changed the title ~~Introducing Red line testing using opensearch-benchmark and dynamic client ramp~~ [PROPOSAL] Introducing Red line testing using opensearch-benchmark and dynamic client ramp Jan 16, 2025

IanHoang removed the untriaged label Jan 21, 2025

rishabh6788 mentioned this issue Jan 27, 2025

Support gradual ramp up of clients #713

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROPOSAL] Introducing Red line testing using opensearch-benchmark and dynamic client ramp #731

[PROPOSAL] Introducing Red line testing using opensearch-benchmark and dynamic client ramp #731

rishabh6788 commented Jan 16, 2025

[PROPOSAL] Introducing Red line testing using opensearch-benchmark and dynamic client ramp #731

[PROPOSAL] Introducing Red line testing using opensearch-benchmark and dynamic client ramp #731

Comments

rishabh6788 commented Jan 16, 2025

Red line testing with opensearch-benchmark

How does it work?

How to use it?

How to define and/or achieve target-throughput?

How to determine max concurrent clients that my cluster can handle?

What does the final benchmark results mean?

Demo with opensource OpenSearch

Metrics and Observations

What next?