-
Notifications
You must be signed in to change notification settings - Fork 507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add documentation for OSB query randomization #8990
base: main
Are you sure you want to change the base?
Add documentation for OSB query randomization #8990
Conversation
Signed-off-by: Peter Alfonsi <[email protected]>
Thank you for submitting your PR. The PR states are In progress (or Draft) -> Tech review -> Doc review -> Editorial review -> Merged. Before you submit your PR for doc review, make sure the content is technically accurate. If you need help finding a tech reviewer, tag a maintainer. When you're ready for doc review, tag the assignee of this PR. The doc reviewer may push edits to the PR directly or leave comments and editorial suggestions for you to address (let us know in a comment if you have a preference). The doc reviewer will arrange for an editorial review. |
Signed-off-by: Peter Alfonsi <[email protected]>
Addressed most of the style guide complaints, but it seems to be upset by the proper noun "Zipf" as in "Zipf distribution". |
Adding a blocked label to this for now until the development PR is merged. I'll go ahead and edit the content in the meantime. |
|
||
To get cache hits, we can't completely randomize this; we have to reuse the same values some of the time. To achieve this we generate some number `N` of value pairs for each randomized operation at the beginning of the benchmark. Each pair gets an index from 1 to `N`. | ||
|
||
Every time a query is sent to OpenSearch, some fraction `rf` (short for "repeat-frequency") of the time, we draw a pair from this saved list. This pair may have been seen before, so it could cause a cache hit. For example, if `rf` = 0.7, the cache hit ratio could be up to 70%. In practice, this may or may not be a hit, depending on benchmark duration and cache size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@peteralfonsi: By pair
do we mean "value pairs"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, a value pair like {"gte":5, "lte":15}
|
||
The first argument, `"bbox"`, is the operation name. The second argument, `"geo_bounding_box"`, is the query type name. | ||
|
||
The third argument is a list of lists: `[["top_left"], ["bottom_right"]]`. Each entry in the outer list represents one parameter name that will be randomized. It's a list because we may have multiple different versions of the same name that represent roughly the same thing. For example, `"gte"` or `"gt"`. In this case there's only one option for each parameter name. At least one version of each parameter name must be present in the original query for it to be randomized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a second argument? @peteralfonsi? Based on the code, I only see three arguments, but want to make sure we don't miss anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have 4 arguments - "bbox"
, "geo_bounding_box"
, [["top_left"], ["bottom_right"]]
, and []
. May have missed the description of the second argument "geo_bounding_box"
as it's in the same paragraph as the description of the first argument?
Signed-off-by: Naarcha-AWS <[email protected]>
@peteralfonsi and @IanHoang: I added my edits on top of this. Can one or both of y'all take a look and make sure my adjustments are still technically accurate? |
Looks good to me. I tweaked the new language in the Overview section slightly, to make it clearer why we need to reuse values, and also to make it clearer that the generation of saved value pairs is something that OSB does itself, rather than something the user has to do. |
Signed-off-by: Peter Alfonsi <[email protected]>
9fad08d
to
883a25a
Compare
_benchmark/user-guide/optimizing-benchmarks/randomizing-queries.md
Outdated
Show resolved
Hide resolved
_benchmark/user-guide/optimizing-benchmarks/randomizing-queries.md
Outdated
Show resolved
Hide resolved
Signed-off-by: Naarcha-AWS <[email protected]>
Signed-off-by: Naarcha-AWS <[email protected]>
Description
Adds a documentation page describing what OSB query randomization is and how to use it. As @IanHoang suggested in opensearch-project/opensearch-benchmark#712 I've put this under the Optimizing Benchmarks section.
Issues Resolved
Closes #8989
Version
The whole page describes how the feature will work once the pending opensearch-project/opensearch-benchmark#712 goes in, which will likely go into OSB 1.12. Most of the feature is present in OSB 1.3 and one flag was added in OSB 1.8. I can split up this PR into separate chunks to be backported separately if that's the right way to do things.
Frontend features
N/A
Checklist
For more information on following Developer Certificate of Origin and signing off your commits, please check here.