Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for OSB query randomization #8990

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

peteralfonsi
Copy link
Contributor

Description

Adds a documentation page describing what OSB query randomization is and how to use it. As @IanHoang suggested in opensearch-project/opensearch-benchmark#712 I've put this under the Optimizing Benchmarks section.

Issues Resolved

Closes #8989

Version

The whole page describes how the feature will work once the pending opensearch-project/opensearch-benchmark#712 goes in, which will likely go into OSB 1.12. Most of the feature is present in OSB 1.3 and one flag was added in OSB 1.8. I can split up this PR into separate chunks to be backported separately if that's the right way to do things.

Frontend features

N/A

Checklist

  • By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
    For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link

Thank you for submitting your PR. The PR states are In progress (or Draft) -> Tech review -> Doc review -> Editorial review -> Merged.

Before you submit your PR for doc review, make sure the content is technically accurate. If you need help finding a tech reviewer, tag a maintainer.

When you're ready for doc review, tag the assignee of this PR. The doc reviewer may push edits to the PR directly or leave comments and editorial suggestions for you to address (let us know in a comment if you have a preference). The doc reviewer will arrange for an editorial review.

Signed-off-by: Peter Alfonsi <[email protected]>
@peteralfonsi
Copy link
Contributor Author

Addressed most of the style guide complaints, but it seems to be upset by the proper noun "Zipf" as in "Zipf distribution".

@Naarcha-AWS Naarcha-AWS added Blocked PR: Cannot move forward without assistance benchmark labels Jan 2, 2025
@Naarcha-AWS
Copy link
Collaborator

Adding a blocked label to this for now until the development PR is merged. I'll go ahead and edit the content in the meantime.

@Naarcha-AWS Naarcha-AWS added the 3 - Tech review PR: Tech review in progress label Jan 2, 2025
@Naarcha-AWS Naarcha-AWS removed the Blocked PR: Cannot move forward without assistance label Jan 7, 2025

To get cache hits, we can't completely randomize this; we have to reuse the same values some of the time. To achieve this we generate some number `N` of value pairs for each randomized operation at the beginning of the benchmark. Each pair gets an index from 1 to `N`.

Every time a query is sent to OpenSearch, some fraction `rf` (short for "repeat-frequency") of the time, we draw a pair from this saved list. This pair may have been seen before, so it could cause a cache hit. For example, if `rf` = 0.7, the cache hit ratio could be up to 70%. In practice, this may or may not be a hit, depending on benchmark duration and cache size.
Copy link
Collaborator

@Naarcha-AWS Naarcha-AWS Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peteralfonsi: By pair do we mean "value pairs"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a value pair like {"gte":5, "lte":15}


The first argument, `"bbox"`, is the operation name. The second argument, `"geo_bounding_box"`, is the query type name.

The third argument is a list of lists: `[["top_left"], ["bottom_right"]]`. Each entry in the outer list represents one parameter name that will be randomized. It's a list because we may have multiple different versions of the same name that represent roughly the same thing. For example, `"gte"` or `"gt"`. In this case there's only one option for each parameter name. At least one version of each parameter name must be present in the original query for it to be randomized.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a second argument? @peteralfonsi? Based on the code, I only see three arguments, but want to make sure we don't miss anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have 4 arguments - "bbox", "geo_bounding_box", [["top_left"], ["bottom_right"]], and []. May have missed the description of the second argument "geo_bounding_box" as it's in the same paragraph as the description of the first argument?

Signed-off-by: Naarcha-AWS <[email protected]>
@Naarcha-AWS
Copy link
Collaborator

@peteralfonsi and @IanHoang: I added my edits on top of this. Can one or both of y'all take a look and make sure my adjustments are still technically accurate?

@peteralfonsi
Copy link
Contributor Author

Looks good to me. I tweaked the new language in the Overview section slightly, to make it clearer why we need to reuse values, and also to make it clearer that the generation of saved value pairs is something that OSB does itself, rather than something the user has to do.

Signed-off-by: Peter Alfonsi <[email protected]>
@peteralfonsi peteralfonsi force-pushed the osb-randomization-doc branch from 9fad08d to 883a25a Compare January 7, 2025 20:49
@Naarcha-AWS Naarcha-AWS added 5 - Editorial review PR: Editorial review in progress and removed 3 - Tech review PR: Tech review in progress labels Jan 9, 2025
Signed-off-by: Naarcha-AWS <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Editorial review PR: Editorial review in progress benchmark
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[DOC] Add documentation for OpenSearch Benchmarks query randomization
3 participants