Skip to content

Commit

Permalink
Tweak overview section slightly
Browse files Browse the repository at this point in the history
Signed-off-by: Peter Alfonsi <[email protected]>
  • Loading branch information
Peter Alfonsi committed Jan 7, 2025
1 parent dd67e64 commit 883a25a
Showing 1 changed file with 8 additions and 6 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -32,17 +32,17 @@ For example, modifying `"gte"` and `"lt"` in the following `nyc_taxis` operation
```


You can’t completely randomize value to get different cache hits, since occasionally caches reuse the same values. To account for the same values while randomizing, generate a number `N` of value pairs for each randomized operation at the beginning of the benchmark. These values are stored in a saved list. Each pair gets an index from `1` to `N`.
You can’t completely randomize the values, as then the cache would not get any hits. To get cache hits, the cache must sometimes see the same values. To account for the same values while randomizing, OSB generates a number `N` of value pairs for each randomized operation at the beginning of the benchmark. These values are stored in a saved list. Each pair gets an index from `1` to `N`.

Every time a query is sent to OpenSearch, a fraction of the repeat frequency (`rf`) draws a pair of values from this saved list. This pair might have been seen before, so it could cause a cache hit. For example, if `rf` = 0.7, the cache hit ratio could be up to 70%. This may or may not be a hit, depending on benchmark duration and cache size.
Every time a query is sent to OpenSearch, OSB decides whether to use a pair of values from this saved list in the query. It does this a configurable fraction of the time, called repeat frequency (`rf`). This pair might have been seen before, so it could cause a cache hit. For example, if `rf` = 0.7, the cache hit ratio could be up to 70%. This may or may not be a hit, depending on benchmark duration and cache size.

Saved value pairs are based on the `Zipf` distribution, which empirically matches usage traces for many real caches. Pair `i` is drawn with probability proportional to `1 / i^alpha`, where `alpha` is another parameter controlling how spread out the distribution is. Pairs with low indexes appear with a much higher frequency than those with high indexes.

Otherwise, the other `1-rf` fraction of the time, we generate a totally new random pair of values. Because the value pairs have not been seen before, it should miss the cache.

## Usage

To use this feature on a workload you must make some changes to your workload's `workload.py` and supply
To use this feature on a workload you must make some changes to your workload's `workload.py` and supply some CLI flags when running OSB.

### Modifying `workload.py`

Expand Down Expand Up @@ -103,13 +103,15 @@ You would register the following function in `workload.py`:
registry.register_query_randomization_info("bbox", "geo_bounding_box", [["top_left"], ["bottom_right"]], [])
```

The first argument, `"bbox"`, is the operation's name. The second argument, `"geo_bounding_box"`, is the query type name.
The first argument, `"bbox"`, is the operation's name.

The second argument is a list of lists: `[[“top_left”], [“bottom_right”]]`. The outer list’s entries specify parameters for randomization, because there might be different versions of the same name that represent roughly the same parameters, for example, `"gte"` or `"gt"`. Here, there’s only one option for each parameter name. At least one version of each parameter's name must be present in the original query for it to be randomized.
The second argument, `"geo_bounding_box"`, is the query type name.

The third argument is a list of lists: `[[“top_left”], [“bottom_right”]]`. The outer list’s entries specify parameters for randomization, because there might be different versions of the same name that represent roughly the same parameters, for example, `"gte"` or `"gt"`. Here, there’s only one option for each parameter name. At least one version of each parameter's name must be present in the original query for it to be randomized.

The last argument is a list of optional parameters. If an optional parameter is present in the random standard value source, it will be put into the randomized version of the query. If it’s not in the source, it’s ignored. There are no optional parameters in this example, but the typical use case would be `"format"` in a range query.

If there’s no registration, it uses the default;; equivalent to registering `registry.register_query_randomization_info(<operation_name>, "range", [["gte", "gt"], ["lte", "lt"]], ["format"])`.
If there’s no registration, it uses the default; equivalent to registering `registry.register_query_randomization_info(<operation_name>, "range", [["gte", "gt"], ["lte", "lt"]], ["format"])`.

The `dict` returned by the standard value source should match the parameter names you are randomizing. For example, the standard value source for the earlier example is:

Expand Down

0 comments on commit 883a25a

Please sign in to comment.