Add documentation for OSB query randomization #8990

peteralfonsi · 2024-12-27T22:58:23Z

Description

Adds a documentation page describing what OSB query randomization is and how to use it. As @IanHoang suggested in opensearch-project/opensearch-benchmark#712 I've put this under the Optimizing Benchmarks section.

Issues Resolved

Closes #8989

Version

The whole page describes how the feature will work once the pending opensearch-project/opensearch-benchmark#712 goes in, which will likely go into OSB 1.12. Most of the feature is present in OSB 1.3 and one flag was added in OSB 1.8. I can split up this PR into separate chunks to be backported separately if that's the right way to do things.

Frontend features

N/A

Checklist

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Peter Alfonsi <[email protected]>

github-actions · 2024-12-27T22:58:33Z

Thank you for submitting your PR. The PR states are In progress (or Draft) -> Tech review -> Doc review -> Editorial review -> Merged.

Before you submit your PR for doc review, make sure the content is technically accurate. If you need help finding a tech reviewer, tag a maintainer.

When you're ready for doc review, tag the assignee of this PR. The doc reviewer may push edits to the PR directly or leave comments and editorial suggestions for you to address (let us know in a comment if you have a preference). The doc reviewer will arrange for an editorial review.

Signed-off-by: Peter Alfonsi <[email protected]>

peteralfonsi · 2024-12-31T17:10:23Z

Addressed most of the style guide complaints, but it seems to be upset by the proper noun "Zipf" as in "Zipf distribution".

Naarcha-AWS · 2025-01-02T18:42:08Z

Adding a blocked label to this for now until the development PR is merged. I'll go ahead and edit the content in the meantime.

Naarcha-AWS · 2025-01-07T19:21:33Z

_benchmark/user-guide/optimizing-benchmarks/randomizing-queries.md

+
+To get cache hits, we can't completely randomize this; we have to reuse the same values some of the time. To achieve this we generate some number `N` of value pairs for each randomized operation at the beginning of the benchmark. Each pair gets an index from 1 to `N`. 
+
+Every time a query is sent to OpenSearch, some fraction `rf` (short for "repeat-frequency") of the time, we draw a pair from this saved list. This pair may have been seen before, so it could cause a cache hit. For example, if `rf` = 0.7, the cache hit ratio could be up to 70%. In practice, this may or may not be a hit, depending on benchmark duration and cache size. 


@peteralfonsi: By pair do we mean "value pairs"?

Yes, a value pair like {"gte":5, "lte":15}

_benchmark/user-guide/optimizing-benchmarks/randomizing-queries.md

Naarcha-AWS · 2025-01-07T19:41:53Z

_benchmark/user-guide/optimizing-benchmarks/randomizing-queries.md

+
+The first argument, `"bbox"`, is the operation name. The second argument, `"geo_bounding_box"`, is the query type name.
+
+The third argument is a list of lists: `[["top_left"], ["bottom_right"]]`. Each entry in the outer list represents one parameter name that will be randomized. It's a list because we may have multiple different versions of the same name that represent roughly the same thing. For example, `"gte"` or `"gt"`. In this case there's only one option for each parameter name. At least one version of each parameter name must be present in the original query for it to be randomized.


Is there a second argument? @peteralfonsi? Based on the code, I only see three arguments, but want to make sure we don't miss anything.

We should have 4 arguments - "bbox", "geo_bounding_box", [["top_left"], ["bottom_right"]], and []. May have missed the description of the second argument "geo_bounding_box" as it's in the same paragraph as the description of the first argument?

Signed-off-by: Naarcha-AWS <[email protected]>

Naarcha-AWS · 2025-01-07T19:52:35Z

@peteralfonsi and @IanHoang: I added my edits on top of this. Can one or both of y'all take a look and make sure my adjustments are still technically accurate?

peteralfonsi · 2025-01-07T20:46:49Z

Looks good to me. I tweaked the new language in the Overview section slightly, to make it clearer why we need to reuse values, and also to make it clearer that the generation of saved value pairs is something that OSB does itself, rather than something the user has to do.

Signed-off-by: Peter Alfonsi <[email protected]>

_benchmark/user-guide/optimizing-benchmarks/randomizing-queries.md

Signed-off-by: Naarcha-AWS <[email protected]>

Add documentation for OSB query randomization

0f17a0e

Signed-off-by: Peter Alfonsi <[email protected]>

peteralfonsi requested review from kolchfa-aws, Naarcha-AWS, AMoo-Miki, natebower, dlvenable and epugh as code owners December 27, 2024 22:58

github-actions bot assigned kolchfa-aws Dec 27, 2024

kolchfa-aws assigned Naarcha-AWS and unassigned kolchfa-aws Dec 30, 2024

Address linter issues

543b91f

Signed-off-by: Peter Alfonsi <[email protected]>

Merge branch 'main' into osb-randomization-doc

1135203

Naarcha-AWS added Blocked PR: Cannot move forward without assistance benchmark labels Jan 2, 2025

Naarcha-AWS added the 3 - Tech review PR: Tech review in progress label Jan 2, 2025

Merge branch 'main' into osb-randomization-doc

c4c7af1

Naarcha-AWS removed the Blocked PR: Cannot move forward without assistance label Jan 7, 2025

Naarcha-AWS reviewed Jan 7, 2025

View reviewed changes

_benchmark/user-guide/optimizing-benchmarks/randomizing-queries.md Show resolved Hide resolved

Naarcha-AWS reviewed Jan 7, 2025

View reviewed changes

Add writer edits

dd67e64

Signed-off-by: Naarcha-AWS <[email protected]>

Tweak overview section slightly

883a25a

Signed-off-by: Peter Alfonsi <[email protected]>

peteralfonsi force-pushed the osb-randomization-doc branch from 9fad08d to 883a25a Compare January 7, 2025 20:49

Naarcha-AWS reviewed Jan 8, 2025

View reviewed changes

_benchmark/user-guide/optimizing-benchmarks/randomizing-queries.md Outdated Show resolved Hide resolved

Naarcha-AWS reviewed Jan 8, 2025

View reviewed changes

_benchmark/user-guide/optimizing-benchmarks/randomizing-queries.md Outdated Show resolved Hide resolved

Apply suggestions from code review

a0f668a

Signed-off-by: Naarcha-AWS <[email protected]>

Naarcha-AWS added 5 - Editorial review PR: Editorial review in progress and removed 3 - Tech review PR: Tech review in progress labels Jan 9, 2025

Remove passive voice

578bf32

Signed-off-by: Naarcha-AWS <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add documentation for OSB query randomization #8990

Add documentation for OSB query randomization #8990

peteralfonsi commented Dec 27, 2024

github-actions bot commented Dec 27, 2024

peteralfonsi commented Dec 31, 2024

Naarcha-AWS commented Jan 2, 2025

Naarcha-AWS Jan 7, 2025 •

edited

Loading

peteralfonsi Jan 7, 2025

Naarcha-AWS Jan 7, 2025

peteralfonsi Jan 7, 2025

Naarcha-AWS commented Jan 7, 2025

peteralfonsi commented Jan 7, 2025


		To get cache hits, we can't completely randomize this; we have to reuse the same values some of the time. To achieve this we generate some number `N` of value pairs for each randomized operation at the beginning of the benchmark. Each pair gets an index from 1 to `N`.

		Every time a query is sent to OpenSearch, some fraction `rf` (short for "repeat-frequency") of the time, we draw a pair from this saved list. This pair may have been seen before, so it could cause a cache hit. For example, if `rf` = 0.7, the cache hit ratio could be up to 70%. In practice, this may or may not be a hit, depending on benchmark duration and cache size.


		The first argument, `"bbox"`, is the operation name. The second argument, `"geo_bounding_box"`, is the query type name.

		The third argument is a list of lists: `[["top_left"], ["bottom_right"]]`. Each entry in the outer list represents one parameter name that will be randomized. It's a list because we may have multiple different versions of the same name that represent roughly the same thing. For example, `"gte"` or `"gt"`. In this case there's only one option for each parameter name. At least one version of each parameter name must be present in the original query for it to be randomized.

Add documentation for OSB query randomization #8990

Are you sure you want to change the base?

Add documentation for OSB query randomization #8990

Conversation

peteralfonsi commented Dec 27, 2024

Description

Issues Resolved

Version

Frontend features

Checklist

github-actions bot commented Dec 27, 2024

peteralfonsi commented Dec 31, 2024

Naarcha-AWS commented Jan 2, 2025

Naarcha-AWS Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

peteralfonsi Jan 7, 2025

Choose a reason for hiding this comment

Naarcha-AWS Jan 7, 2025

Choose a reason for hiding this comment

peteralfonsi Jan 7, 2025

Choose a reason for hiding this comment

Naarcha-AWS commented Jan 7, 2025

peteralfonsi commented Jan 7, 2025

Naarcha-AWS Jan 7, 2025 •

edited

Loading