[RFC] Dynamic client ramp up for redline and baseline testing #729

rishabh6788 · 2025-01-15T03:24:30Z

Is your feature request related to a problem? Please describe

We have recently added long pending client ramp up feature in opensearch-benchmark which when used, gradually ramps up the clients based on number of clients and time period provided using ramp-up-time-period field in task definition. For e..g, using the task definition given below each client will take client-num*(ramp-up-time-period/total-clients), which is 0 for 0th client (read first), 90s for second, so on and so forth.

{
 "operation": "cardinality-agg-high",
 "warmup-time-period": 1800,
 "ramp-up-time-period": 1800,
 "time-period": 300,
 "target-throughput": 20,
 "clients": 20
 }

Compared to scenario before this feature, the opensearch-benchmark would start running benchmark with 20 clients in parallel from beginning, which will quickly overwhelm the cluster and the user would never be able figure out what was the actual breaking point of the cluster.

While this helps alleviate this pain and provides a means to figure out how the cluster performance, be it query latency, cpu or jvm utilization, is getting impacted as the load gradually increases.

The downside is that the user has to have some sort of idea about the number of clients at about how much load the cluster performance is impacted. At the end of the run opensearch-benchmark just provides a final result which just shows the final query latency and server side throughput. Even though user can use a dedicated opensearch cluster as datastore and chart different metrics to figure out when the cluster started going under duress, it is still quite some effort and needs advanced dashboarding knowledge.

Describe the solution you'd like

It would be great to have a benchmark mode where the user just provides a target qps to achieve along with certain constraints, such as the query should maintainer certain level of latency threshold or max overall cpu threshold should no exceed 90% or query success/error rate should not g below 90% etc, anything beyond the provided threshold the benchmark should auto adjust the number of clients to maintain performance under given thresholds.
At the end of the run the benchmark result should provide what was final qps it was able to hit while maintaining all the constraints along with publishing metrics around query latency and success/error rate

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

rishabh6788 · 2025-01-15T03:26:45Z

@bowenlan-amzn created an issue for the similar ask here.

OVI3D0 · 2025-01-15T21:04:00Z

This is a great idea. Like we discussed offline, we can create a new actor similar to the current WorkerCoordinatorActor, which already calculates throughput every N seconds during a benchmark and reports back metrics like latency/error rates at the end of the benchmark.

This way, the user can select this redline benchmark mode/max_qps, and the actor system can instead spin up this new actor to coordinate the redline test, making use of your ramp up feature.

As for implementing the dynamic throughput control, we could consider using some sort of rate limiting algorithm like the Token Bucket Algorithm that was mentioned already.

getsaurabh02 · 2025-01-16T19:23:15Z

Thanks @rishabh6788 . This is super useful feature from red-line tps perspective for OpenSearch. Looking forward to the deeper proposal.

cc: @rishabhmaurya @msfroh @mch2 @Pallavi-AWS for thoughts and feedback.

rishabh6788 added enhancement New feature or request untriaged labels Jan 15, 2025

rishabh6788 added this to Engineering Effectiveness Board Jan 16, 2025

github-project-automation bot moved this to 🆕 New in Engineering Effectiveness Board Jan 16, 2025

rishabh6788 moved this from 🆕 New to 🏗 In progress in Engineering Effectiveness Board Jan 16, 2025

rishabh6788 assigned OVI3D0 and rishabh6788 Jan 16, 2025

getsaurabh02 changed the title ~~Dynamic client ramp up for redline and baseline testing~~ [RFC] Dynamic client ramp up for redline and baseline testing Jan 16, 2025

rishabh6788 mentioned this issue Jan 16, 2025

[PROPOSAL] Introducing Red line testing using opensearch-benchmark and dynamic client ramp #731

Open

IanHoang removed the untriaged label Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Dynamic client ramp up for redline and baseline testing #729

[RFC] Dynamic client ramp up for redline and baseline testing #729

rishabh6788 commented Jan 15, 2025

rishabh6788 commented Jan 15, 2025

OVI3D0 commented Jan 15, 2025

getsaurabh02 commented Jan 16, 2025

[RFC] Dynamic client ramp up for redline and baseline testing #729

[RFC] Dynamic client ramp up for redline and baseline testing #729

Comments

rishabh6788 commented Jan 15, 2025

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Describe alternatives you've considered

Additional context

rishabh6788 commented Jan 15, 2025

OVI3D0 commented Jan 15, 2025

getsaurabh02 commented Jan 16, 2025