Allow static covariates in BGNBDModel #1390

PabloRoque · 2025-01-16T16:40:47Z

Description

Allows static covariates in BetaGeoModel

NOTE: It seems there are convergence issues with the dropout-covariates-related params a|b. Related to similar observations by @juanitorduz here. As a consequence the last two assertions in test_distribution_method are a dubious hack.

Related Issue

Closes Add Time-Invariant Covariates to ParetoNBD and BG/NBD Models #134
Related to CLV API Standardization #527

Checklist

Checked that the pre-commit linting/style checks pass
Included tests that prove the fix is effective or that the new feature works
Added necessary documentation (docstrings and/or example notebooks)
If you are a pro: each commit corresponds to a relevant logical change

Modules affected

MMM
CLV
Customer Choice

Type of change

📚 Documentation preview 📚: https://pymc-marketing--1390.org.readthedocs.build/en/1390/

…taGeoModel. Add some tests

codecov · 2025-01-16T18:19:22Z

Codecov Report

Attention: Patch coverage is 85.91549% with 10 lines in your changes missing coverage. Please review.

Project coverage is 92.48%. Comparing base (cf3eab2) to head (411df2f).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
pymc_marketing/clv/models/beta_geo.py	85.91%	10 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1390      +/-   ##
==========================================
- Coverage   92.58%   92.48%   -0.11%     
==========================================
  Files          52       52              
  Lines        6043     6095      +52     
==========================================
+ Hits         5595     5637      +42     
- Misses        448      458      +10

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ColtAllen · 2025-01-18T18:32:56Z

Merging #1375 has created a merge conflict in clv.distributions.py, but shouldn't be hard to fix. Seems like this branch was created off of that one.

ColtAllen · 2025-01-18T19:40:33Z

The _extract_predictive_variables internal method for covariates can probably be moved into the Base CLV class to reduce repetition, because static covariates can be included in all CLV models.

…thCovariates.test_extract_predictive_covariates

ColtAllen · 2025-01-24T15:12:41Z

The _extract_predictive_variables internal method for covariates can probably be moved into the Base CLV class to reduce repetition, because static covariates can be included in all CLV models.

I don't see a fast and easy way forward. Different models have different params, and different relations between covariates and params. We could:

Implement some sort of config for these relationships

Iterate over the different RVs and perform some checks

I much rather address this in a separate PR. What do you think @ColtAllen ?

I agree it should be left to a separate PR, but before creating an issue let me give it some more thought. This would require refactoring to pass arguments into the internal method for the specific model param names, and probably conditionals for the child model class instance itself. I'm not sure if an overly complex, hard to maintain base class method is worth eliminating boilerplate in the child classes.

ColtAllen

Mostly minor comments around things like renaming variables, but the poor convergence in distribution_new_customers needs more investigation and plotting of visuals in a notebook.

On that note, a covariate example will need to be added to the notebooks for BGNBD and/or the Quickstart. You can work off the example in the PNBD notebook.

I have an idea on why distribution_new_customers may not be converging well. I'll create an issue for it.

ColtAllen · 2025-01-24T15:15:26Z

pymc_marketing/clv/models/beta_geo.py

@@ -140,6 +144,9 @@ class BetaGeoModel(CLVModel):
           Error Problem." http://brucehardie.com/notes/027/bgnbd_num_error.pdf.
    .. [4] Fader, P. S. & Hardie, B. G. (2019) "A Step-by-Step Derivation of the BG/NBD
           Model." https://www.brucehardie.com/notes/039/bgnbd_derivation__2019-11-06.pdf
+    .. [5] Fader, Peter & G. S. Hardie, Bruce (2007).


Can we add an in-line citation for this reference in the top-level of the docstring.

ColtAllen · 2025-01-24T15:19:05Z

pymc_marketing/clv/models/beta_geo.py

+                purchase_coefficient_gamma1 = self.model_config[
+                    "purchase_coefficient_prior"
+                ].create_variable("purchase_coefficient_gamma1")


Why is the gamma1 suffix being used here?

Possible to use model coordinates instead?

ColtAllen · 2025-01-24T15:21:19Z

pymc_marketing/clv/models/beta_geo.py

+                    dropout_coefficient_gamma2 = self.model_config[
+                        "dropout_coefficient_prior"
+                    ].create_variable("dropout_coefficient_gamma2")
+                    dropout_coefficient_gamma3 = self.model_config[
+                        "dropout_coefficient_prior"
+                    ].create_variable("dropout_coefficient_gamma3")


Can we change these _gamma% suffixes to _alpha and _beta? Gamma is a confusing term because it pertains to the purchasing process in the research.

I was trying to follow the convention here
. There is no beta, but a and b (I can call them coefficient_a, and coefficient_b if you like)

We can rename as purchase_coefficient, dropout_coefficient to follow the implementation in ParetoNBDModel. But then the gamma2 and gamma3 coefficients must be equal. This is in fact how the implementation in R's CLVTools is done btw.

I tried that implementation, and is not helping with the convergence

Oh my mistake - I meant _a and _b.

Are you saying CLVTools fixes these coefficients to be equal to each other? They share the same data, but this doesn't seem to align with the research note.

Also, which implementation is not helping with convergence?

Are you saying CLVTools fixes these coefficients to be equal to each other? They share the same data, but this doesn't seem to align with the research note.

That is indeed the case. See here and here

Also, which implementation is not helping with convergence?

Using the same implementation as CLVTools, fixing gamma2=gamma3. It did not help with the test issues.

My best guess as to why the CLVTools developers did this was for easier interpretability and/or to speed up model fits. What's weird is why they still went with it if convergence is negatively impacted. This could be a good selling point for pymc-marketing compared to other open-source tools.

Explaining covariate impacts on overall dropout in terms of separate a and b coefficients will be tricky, but not impossible. Generally if a >b, then p increases, and vice-versa. The greater both values are, the narrower the distribution.

ColtAllen · 2025-01-24T15:23:43Z

pymc_marketing/clv/models/beta_geo.py

+                    dropout_coefficient_gamma2 = self.model_config[
+                        "dropout_coefficient_prior"
+                    ].create_variable("dropout_coefficient_gamma2")
+                    dropout_coefficient_gamma3 = self.model_config[
+                        "dropout_coefficient_prior"
+                    ].create_variable("dropout_coefficient_gamma3")


See previous comment.

This hierarchical pooling conditional block may be worth its own internal method because it appears in several models, but we can leave that as a separate PR.

ColtAllen · 2025-01-24T15:29:52Z

tests/clv/models/test_beta_geo.py

+        # NOTE: These are the problematic tests due to poor convergence
+        # We would need to test eg:
+        # assert (res_zero["dropout"] < res_high["dropout"]).all()
+        # Instead we test "less than" within tolerance
+        assert (
+            (
+                res_zero["recency_frequency"].sel(obs_var="recency")
+                - res_high["recency_frequency"].sel(obs_var="recency")
+            )
+            < 0.35
+        ).all()


We may need to break this into multiple tests, because there are 9 separate assert statements happening here.

We should also investigate this particular assert in a notebook because the corresponding test in ParetoNBDModel doesn't have this problem.

Related to the notebook. I am drafting something in #1430. That PR is more focused on:

Fixing plotting to allow covariates

Introducing a new dataset ApparelTrans exported from CLVTools's covariates examples

I wanted to keep this in a separate PR, so I will be adding here only a small gist with convergence plots

I replaced these final tests all together taking onto account the properties of the Beta distribution.
It should make sense now.

ColtAllen · 2025-01-24T15:47:58Z

Created an issue that may be related to the poor convergence: #1431

review-notebook-app · 2025-02-02T11:00:18Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

PabloRoque · 2025-02-02T11:28:40Z

@ColtAllen

OK, the plot kind of thickened.

the poor convergence in distribution_new_customers needs more investigation and plotting of visuals in a notebook.

I added docs/notebooks/dev/clv/dev/bg_nbg_covariates_test_issues.ipynb. Findings:

We were assuming the tests to be equivalent to the one in ParetoNBD. But this should not be the case.
Whenever a = b, the Beta distribution is symmetric with E[X]=0.5.
Under this condition, the only effect of the introduction of covariates is the narrowing of the distribution of E[X] around 0.5.
I believe this has implications on the results from CLVTools. For them gamma2=gamma3, and thus if a0 = b0 you will have basically a fancy 50/50 coin toss discerning your dropout probability

Generally if a >b, then p increases, and vice-versa. The greater both values are, the narrower the distribution.

This was the whole thing!. I was not taking into account the properties of the Beta dist, and following the ParetoNBD implementation blindly. Working on the notebook, I thought I was going crazy. E["dropout"]~0.5 always, regardless of the covariates I was using. Note that in the test setup we have equal coefficients: dropout_coefficient_a=np.array([3.0]), dropout_coefficient_b=np.array([3.0]). This was meant to be the case from the beginning.

Note however some findings. We are using a|b = pm.Flat("a|b") in distribution_new_customer. If we are not to pass data to the function, we would be having issues because Flat and Beta don't get along too well.

juanitorduz · 2025-02-12T17:16:04Z

hey! are we missing something here? any blockers :) ?

ColtAllen · 2025-02-12T17:20:56Z

hey! are we missing something here? any blockers :) ?

Just haven't found time to look at it yet 🤔 Reviewing now.

PabloRoque · 2025-02-12T17:21:30Z

hey! are we missing something here? any blockers :) ?

It is good to ship on my side, and would allow to work on a clean branch on the addition of covariates to the ModifiedBetaGeoModel

@ColtAllen requested changes a couple of weeks ago related to some tests. I believe I've addressed all the worries he expressed (see dev notebook), but would be good to have the green light on his side.

ColtAllen · 2025-02-12T17:25:25Z

Note however some findings. We are using a|b = pm.Flat("a|b") in distribution_new_customer. If we are not to pass data to the function, we would be having issues because Flat and Beta don't get along too well.

distribution_new_customer is just boilerplate for sampling from the latent Beta dropout_rate and Gamma purchase_rate distributions. Choice of distributions for a and b are arbitrary because the fitted posteriors from self.fit_result are being used for those parameters.

environment.yml

pymc_marketing/clv/models/beta_geo.py

ColtAllen · 2025-02-12T17:45:21Z

pymc_marketing/clv/models/beta_geo.py

-
-                a = pm.Deterministic("a", phi_dropout * kappa_dropout)
-                b = pm.Deterministic("b", (1.0 - phi_dropout) * kappa_dropout)
+                if self.dropout_covariate_cols:


Have you tested the phi/kappa hierarchical pooling for a_scale and b_scale with covariates? Any major differences in results and/or fit times compared to specifying a prior for these params?

You can play with it in #1430 docs/source/notebooks/clv/dev/bg_nbd_covariates.ipynb. By removing the custom prior, and using default_model_config as in:

bgnbd = clv.BetaGeoModel( rfm_data, )

It works. The model fits, but we get quite biased estimates.

So specifying a_prior and b_prior in the model config is recommended when using covariates? We should probably mention this somewhere.

It is better if you introduce sensible priors with small dataset (which is the case in #1430) than the default kappa, gamma priors. I did not try different priors for kappa, gamma.

I would differ advising on particular priors until we study default priors in the newly opened #1496 issue.

ColtAllen · 2025-02-12T18:07:48Z

tests/clv/models/test_beta_geo.py

+
+    def test_expectation_method(self):
+        """Test that predictive methods work with covariates"""
+        # Higher covariates with positive coefficients -> higher change of death and vice-versa


Do your experiments in the dev notebook confirm this? This seems copy/pasted from the equivalent test for ParetoNBD model.

ColtAllen

Code looks good! Just some clarifying questions regarding notebook experiments and request to add an additional test condition, and I think this will be good to merge!

ColtAllen · 2025-02-12T18:12:54Z

tests/clv/models/test_beta_geo.py

+                rtol=0.6,
+            )


Can you add an additional test to check for the nested prior config when a_prior and b_prior aren't specified?

Also, is this the smallest rtol you could get passing results with? The same test has an rtol of only 0.2 for ParetoNBDModel.

PabloRoque added 6 commits January 15, 2025 11:57

Implement ModifiedBetaGeoNBD, ModifiedBetaGeoNBDRV. Modify ModifiedBe…

5f567eb

…taGeoModel. Add some tests

Add test_notimplemented_logp

1897d62

Merge branch 'main' into ModifiedBetaGeoNBDRV

042780e

Sample recency_frequency from the newly introduced RV block

bc83ddc

Add model coords in distribution_new_customers

f0cb25e

Allow covariates in BG/NBD

59d6826

github-actions bot added CLV tests labels Jan 16, 2025

Merge branch 'main' into BGNBD-static-covar

b33b56b

wd60622 changed the title ~~[DRAFT]: Allow static covariates in BGNBDModel~~ Allow static covariates in BGNBDModel Jan 17, 2025

Merge branch 'main' into BGNBD-static-covar

e74745c

ColtAllen mentioned this pull request Jan 18, 2025

Implement ModifiedBetaGeoNBD and ModifiedBetaGeoNBDRV #1375

Merged

14 tasks

ColtAllen mentioned this pull request Jan 18, 2025

Add Time-Invariant Covariates to ParetoNBD and BG/NBD Models #134

Open

ColtAllen added the enhancement New feature or request label Jan 18, 2025

ColtAllen added this to the 0.12.0 milestone Jan 18, 2025

PabloRoque added 9 commits January 20, 2025 12:19

Add BetaGeoModel_extract_predictive_variables. Add TestBetaGeoModelWi…

ffced16

…thCovariates.test_extract_predictive_covariates

Add test_logp

02b7c11

Introduce gamma2, gamma3 dropout coefficients

dc7be23

Adapt tests to 3 coefficients. Fix test_logp

dedb36e

Fix test_expectation_method. Fix test_covariate_model_convergence

91d0168

Merge branch 'main' into BGNBD-static-covar

4af6482

Revert explicit dims in RVs

28c15c0

Include dims. Fix test_distribution_method

6ef492f

Increase recency_frequency tolerance

77dad3e

PabloRoque marked this pull request as ready for review January 22, 2025 18:18

PabloRoque added 2 commits January 23, 2025 10:59

Add tolerance to dropout_covariate tests

857c885

Revert non-centered priors

f796170

PabloRoque requested a review from ColtAllen January 24, 2025 14:18

Merge branch 'main' into BGNBD-static-covar

2e83e3a

ColtAllen requested changes Jan 24, 2025

View reviewed changes

ColtAllen mentioned this pull request Jan 24, 2025

Allow covariates in plot_expected_purchases #1430

Open

6 tasks

PabloRoque added 4 commits February 1, 2025 10:18

Merge branch 'main' into BGNBD-static-covar

8902431

Rename gamma coefficients

187176a

Add inline citation for [5]

b056f7a

Ammend covariates assertions. Add notebook with investigation

d71d2b5

github-actions bot added the docs Improvements or additions to documentation label Feb 2, 2025

Add note on Gamma dist

74ae6a1

Increase covariate, so frequency distirbutions are further appart

cf8d67a

PabloRoque requested a review from ColtAllen February 2, 2025 11:32

PabloRoque added 4 commits February 2, 2025 16:19

Test on the mean, not each value

a8c1328

Fix a_scale, b_scale dims

6186be0

Merge branch 'main' into BGNBD-static-covar

d13da11

Merge branch 'main' into BGNBD-static-covar

411df2f

ColtAllen reviewed Feb 12, 2025

View reviewed changes

environment.yml Show resolved Hide resolved

ColtAllen reviewed Feb 12, 2025

View reviewed changes

pymc_marketing/clv/models/beta_geo.py Show resolved Hide resolved

PabloRoque mentioned this pull request Feb 12, 2025

Introduce different a and b priors in (Modified)BetaGeoModel #1496

Open

ColtAllen reviewed Feb 12, 2025

View reviewed changes

ColtAllen requested changes Feb 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow static covariates in BGNBDModel #1390

Allow static covariates in BGNBDModel #1390

PabloRoque commented Jan 16, 2025 •

edited

Loading

codecov bot commented Jan 16, 2025 •

edited

Loading

ColtAllen commented Jan 18, 2025

ColtAllen commented Jan 18, 2025

ColtAllen commented Jan 24, 2025

ColtAllen left a comment

ColtAllen Jan 24, 2025

PabloRoque Feb 11, 2025

ColtAllen Jan 24, 2025

wd60622 Jan 24, 2025

ColtAllen Jan 24, 2025

PabloRoque Jan 24, 2025

ColtAllen Jan 24, 2025

PabloRoque Jan 24, 2025 •

edited

Loading

ColtAllen Jan 24, 2025

ColtAllen Jan 24, 2025

ColtAllen Jan 24, 2025

PabloRoque Jan 24, 2025

PabloRoque Feb 2, 2025

ColtAllen commented Jan 24, 2025

review-notebook-app bot commented Feb 2, 2025

PabloRoque commented Feb 2, 2025 •

edited

Loading

juanitorduz commented Feb 12, 2025

ColtAllen commented Feb 12, 2025

PabloRoque commented Feb 12, 2025

ColtAllen commented Feb 12, 2025

ColtAllen Feb 12, 2025 •

edited

Loading

PabloRoque Feb 12, 2025

ColtAllen Feb 12, 2025

PabloRoque Feb 12, 2025

ColtAllen Feb 12, 2025

ColtAllen left a comment

ColtAllen Feb 12, 2025

Allow static covariates in BGNBDModel #1390

Are you sure you want to change the base?

Allow static covariates in BGNBDModel #1390

Conversation

PabloRoque commented Jan 16, 2025 • edited Loading

Description

Related Issue

Checklist

Modules affected

Type of change

codecov bot commented Jan 16, 2025 • edited Loading

Codecov Report

ColtAllen commented Jan 18, 2025

ColtAllen commented Jan 18, 2025

ColtAllen commented Jan 24, 2025

ColtAllen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PabloRoque Jan 24, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ColtAllen commented Jan 24, 2025

review-notebook-app bot commented Feb 2, 2025

PabloRoque commented Feb 2, 2025 • edited Loading

juanitorduz commented Feb 12, 2025

ColtAllen commented Feb 12, 2025

PabloRoque commented Feb 12, 2025

ColtAllen commented Feb 12, 2025

ColtAllen Feb 12, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ColtAllen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PabloRoque commented Jan 16, 2025 •

edited

Loading

codecov bot commented Jan 16, 2025 •

edited

Loading

PabloRoque Jan 24, 2025 •

edited

Loading

PabloRoque commented Feb 2, 2025 •

edited

Loading

ColtAllen Feb 12, 2025 •

edited

Loading