Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluating binomial & non-binomial metrics simultaneously with Experiment class & how to categorise metrics by type #106

Open
tmann01 opened this issue Oct 31, 2024 · 2 comments

Comments

@tmann01
Copy link

tmann01 commented Oct 31, 2024

I can't seem to find much info on how to exactly evaluate both binomial and non-binomial metrics at the same time within a dataframe that gets input within the Experiment class.

It seems that, even with the method column specified, that multiple_difference treats it as a binomial metric. You would obviously need different inputs to perform a t-test, so how would I add and specify these columns? If so, how would I indicate these in Experiment?

Likewise, there's a really good paper you posted on your risk-aware product decision framework using multiple metrics - and I've seen mention of success metrics within the repository/q&a - however there's no documentation I could find that indicates how to specify success, deterioration, and guardrail metrics. I did see a method on sample ratio which is a form of a quality metric, so I suspect this has been considered but it's difficult to see how to implement the entire approach.

Do let me know if you need any further information. Thanks for your time!

@tmann01 tmann01 changed the title Evaluation binomial & non-binomial metrics simultaneously with Experiment class & how to categorise metrics by type Evaluating binomial & non-binomial metrics simultaneously with Experiment class & how to categorise metrics by type Oct 31, 2024
@iampelle
Copy link
Contributor

iampelle commented Nov 1, 2024

Here's an example (that could be put in an example notebook or as a test case for the Experiment class):

columns = [
    "group_name",
    "num_user",
    "sum",
    "sum_squares",
    "method",
    "metric",
    "preferred_direction",
    "non_inferiority_margin",
]
data = [
    ["Control", 6267728, 3240932, 52409321212, "z-test", "m1", "increase", 0.15],
    ["Test", 6260737, 3239706, 52397061212, "z-test", "m1", "increase", 0.15],
    ["Test", 6260737, 38600871, 12432573969, "z-test", "m2", None, None],
    ["Control", 6225728, 35963863, 18433512959, "z-test", "m2", None, None],
    ["Test", 62607, 26738, None, "chi-squared", "m3", "increase", None],
    ["Control", 62677, 16345, None, "chi-squared", "m3", "increase", None],
]
df = pd.DataFrame(columns=columns, data=data)
test = spotify_confidence.Experiment(
    data_frame=df,
    numerator_column="sum",
    numerator_sum_squares_column="sum_squares",
    denominator_column="num_user",
    categorical_group_columns="metric",
    interval_size=0.99,
    correction_method="bonferroni",
    metric_column="metric",
    treatment_column="group_name",
    method_column="method",
)

diff = test.multiple_difference(
    level="Control",
    level_as_reference=True,
    groupby="metric",
    non_inferiority_margins=True,
)

display(diff)
test.multiple_difference_plot(
    level="Control",
    level_as_reference=True,
    groupby="metric",
    non_inferiority_margins=True,
    use_adjusted_intervals=True,
    absolute=False,
).show('html')

image

Guardrail metrics are specified by providing a NIM, non-inferiority margin as for the m1 metric. In the example we fail to reject the hypothesis "m1 Test is no worse than m1 Control" since some of the metric's CI is below the NIM.

Success metrics can be one sided or two sided as specified by preferred_direction. I guess the difference between success and deterioration metrics are that success metrics are what you're actually trying to improve, while deterioration metrics are metrics that you would hope stay neutral, often related to performance/latency/numbers crashes etc.

@ankargren
Copy link
Collaborator

The way you’d implement the deterioration metrics is to take the same data as you’d use for your main results (like Pelle provided), but flip the preferred direction, not use any NIMs, and use a separate alpha.

For example, suppose you have one success metric that should improve and one guardrail metric (with a NIM) for which an increase is a good change. You would then use that as in Pelle’s example and set preferred direction to increase and set the NIM for the guardrail metric. Next, you would make a similar call but set preferred direction to decrease and not set a NIM. You will then test whether any of the two metrics has significantly moved in the wrong direction. In the paper, we also use a different alpha for this test - so it’s using a separate budget. You can also include a sample ratio mismatch test here via the chi-squared test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants