Evaluating binomial & non-binomial metrics simultaneously with Experiment class & how to categorise metrics by type #106

tmann01 · 2024-10-31T23:04:05Z

I can't seem to find much info on how to exactly evaluate both binomial and non-binomial metrics at the same time within a dataframe that gets input within the Experiment class.

It seems that, even with the method column specified, that multiple_difference treats it as a binomial metric. You would obviously need different inputs to perform a t-test, so how would I add and specify these columns? If so, how would I indicate these in Experiment?

Likewise, there's a really good paper you posted on your risk-aware product decision framework using multiple metrics - and I've seen mention of success metrics within the repository/q&a - however there's no documentation I could find that indicates how to specify success, deterioration, and guardrail metrics. I did see a method on sample ratio which is a form of a quality metric, so I suspect this has been considered but it's difficult to see how to implement the entire approach.

Do let me know if you need any further information. Thanks for your time!

iampelle · 2024-11-01T10:00:11Z

Here's an example (that could be put in an example notebook or as a test case for the Experiment class):

columns = [
    "group_name",
    "num_user",
    "sum",
    "sum_squares",
    "method",
    "metric",
    "preferred_direction",
    "non_inferiority_margin",
]
data = [
    ["Control", 6267728, 3240932, 52409321212, "z-test", "m1", "increase", 0.15],
    ["Test", 6260737, 3239706, 52397061212, "z-test", "m1", "increase", 0.15],
    ["Test", 6260737, 38600871, 12432573969, "z-test", "m2", None, None],
    ["Control", 6225728, 35963863, 18433512959, "z-test", "m2", None, None],
    ["Test", 62607, 26738, None, "chi-squared", "m3", "increase", None],
    ["Control", 62677, 16345, None, "chi-squared", "m3", "increase", None],
]
df = pd.DataFrame(columns=columns, data=data)
test = spotify_confidence.Experiment(
    data_frame=df,
    numerator_column="sum",
    numerator_sum_squares_column="sum_squares",
    denominator_column="num_user",
    categorical_group_columns="metric",
    interval_size=0.99,
    correction_method="bonferroni",
    metric_column="metric",
    treatment_column="group_name",
    method_column="method",
)

diff = test.multiple_difference(
    level="Control",
    level_as_reference=True,
    groupby="metric",
    non_inferiority_margins=True,
)

display(diff)
test.multiple_difference_plot(
    level="Control",
    level_as_reference=True,
    groupby="metric",
    non_inferiority_margins=True,
    use_adjusted_intervals=True,
    absolute=False,
).show('html')

Guardrail metrics are specified by providing a NIM, non-inferiority margin as for the m1 metric. In the example we fail to reject the hypothesis "m1 Test is no worse than m1 Control" since some of the metric's CI is below the NIM.

Success metrics can be one sided or two sided as specified by preferred_direction. I guess the difference between success and deterioration metrics are that success metrics are what you're actually trying to improve, while deterioration metrics are metrics that you would hope stay neutral, often related to performance/latency/numbers crashes etc.

ankargren · 2024-11-01T13:10:48Z

The way you’d implement the deterioration metrics is to take the same data as you’d use for your main results (like Pelle provided), but flip the preferred direction, not use any NIMs, and use a separate alpha.

For example, suppose you have one success metric that should improve and one guardrail metric (with a NIM) for which an increase is a good change. You would then use that as in Pelle’s example and set preferred direction to increase and set the NIM for the guardrail metric. Next, you would make a similar call but set preferred direction to decrease and not set a NIM. You will then test whether any of the two metrics has significantly moved in the wrong direction. In the paper, we also use a different alpha for this test - so it’s using a separate budget. You can also include a sample ratio mismatch test here via the chi-squared test.

tmann01 changed the title ~~Evaluation binomial & non-binomial metrics simultaneously with Experiment class & how to categorise metrics by type~~ Evaluating binomial & non-binomial metrics simultaneously with Experiment class & how to categorise metrics by type Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluating binomial & non-binomial metrics simultaneously with Experiment class & how to categorise metrics by type #106

Evaluating binomial & non-binomial metrics simultaneously with Experiment class & how to categorise metrics by type #106

tmann01 commented Oct 31, 2024

iampelle commented Nov 1, 2024

ankargren commented Nov 1, 2024

Evaluating binomial & non-binomial metrics simultaneously with Experiment class & how to categorise metrics by type #106

Evaluating binomial & non-binomial metrics simultaneously with Experiment class & how to categorise metrics by type #106

Comments

tmann01 commented Oct 31, 2024

iampelle commented Nov 1, 2024

ankargren commented Nov 1, 2024