Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow static covariates in BGNBDModel #1390

Open
wants to merge 31 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
5f567eb
Implement ModifiedBetaGeoNBD, ModifiedBetaGeoNBDRV. Modify ModifiedBe…
PabloRoque Jan 15, 2025
1897d62
Add test_notimplemented_logp
PabloRoque Jan 15, 2025
042780e
Merge branch 'main' into ModifiedBetaGeoNBDRV
PabloRoque Jan 15, 2025
bc83ddc
Sample recency_frequency from the newly introduced RV block
PabloRoque Jan 15, 2025
f0cb25e
Add model coords in distribution_new_customers
PabloRoque Jan 15, 2025
59d6826
Allow covariates in BG/NBD
PabloRoque Jan 16, 2025
b33b56b
Merge branch 'main' into BGNBD-static-covar
PabloRoque Jan 17, 2025
e74745c
Merge branch 'main' into BGNBD-static-covar
ColtAllen Jan 18, 2025
ffced16
Add BetaGeoModel_extract_predictive_variables. Add TestBetaGeoModelWi…
PabloRoque Jan 20, 2025
02b7c11
Add test_logp
PabloRoque Jan 20, 2025
dc7be23
Introduce gamma2, gamma3 dropout coefficients
PabloRoque Jan 22, 2025
dedb36e
Adapt tests to 3 coefficients. Fix test_logp
PabloRoque Jan 22, 2025
91d0168
Fix test_expectation_method. Fix test_covariate_model_convergence
PabloRoque Jan 22, 2025
4af6482
Merge branch 'main' into BGNBD-static-covar
PabloRoque Jan 22, 2025
28c15c0
Revert explicit dims in RVs
PabloRoque Jan 22, 2025
6ef492f
Include dims. Fix test_distribution_method
PabloRoque Jan 22, 2025
77dad3e
Increase recency_frequency tolerance
PabloRoque Jan 22, 2025
857c885
Add tolerance to dropout_covariate tests
PabloRoque Jan 23, 2025
f796170
Revert non-centered priors
PabloRoque Jan 23, 2025
3877662
Merge branch 'main' into BGNBD-static-covar
PabloRoque Jan 23, 2025
2e83e3a
Merge branch 'main' into BGNBD-static-covar
PabloRoque Jan 24, 2025
8902431
Merge branch 'main' into BGNBD-static-covar
PabloRoque Feb 1, 2025
187176a
Rename gamma coefficients
PabloRoque Feb 1, 2025
b056f7a
Add inline citation for [5]
PabloRoque Feb 1, 2025
d71d2b5
Ammend covariates assertions. Add notebook with investigation
PabloRoque Feb 2, 2025
74ae6a1
Add note on Gamma dist
PabloRoque Feb 2, 2025
cf8d67a
Increase covariate, so frequency distirbutions are further appart
PabloRoque Feb 2, 2025
a8c1328
Test on the mean, not each value
PabloRoque Feb 2, 2025
6186be0
Fix a_scale, b_scale dims
PabloRoque Feb 6, 2025
d13da11
Merge branch 'main' into BGNBD-static-covar
PabloRoque Feb 11, 2025
411df2f
Merge branch 'main' into BGNBD-static-covar
PabloRoque Feb 12, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -50,3 +50,4 @@ dependencies:
- blas
- mlflow
- hatch
- pyprojroot
ColtAllen marked this conversation as resolved.
Show resolved Hide resolved
260 changes: 232 additions & 28 deletions pymc_marketing/clv/models/beta_geo.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,10 @@
* `b_prior`: Shape parameter of dropout process; defaults to `1-phi_dropout_prior` * `kappa_dropout_prior`
* `phi_dropout_prior`: Nested prior for a and b priors; defaults to `Prior("Uniform", lower=0, upper=1)`
* `kappa_dropout_prior`: Nested prior for a and b priors; defaults to `Prior("Pareto", alpha=1, m=1)`
* `purchase_covariates_prior`: Coefficients for purchase rate covariates; defaults to `Normal(0, 3)`
* `dropout_covariates_prior`: Coefficients for dropout covariates; defaults to `Normal.dist(0, 3)`
* `purchase_covariate_cols`: List containing column names of covariates for customer purchase rates.
* `dropout_covariate_cols`: List containing column names of covariates for customer dropouts.
sampler_config : dict, optional
Dictionary of sampler parameters. Defaults to *None*.

Expand Down Expand Up @@ -140,6 +144,9 @@
Error Problem." http://brucehardie.com/notes/027/bgnbd_num_error.pdf.
.. [4] Fader, P. S. & Hardie, B. G. (2019) "A Step-by-Step Derivation of the BG/NBD
Model." https://www.brucehardie.com/notes/039/bgnbd_derivation__2019-11-06.pdf
.. [5] Fader, Peter & G. S. Hardie, Bruce (2007).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add an in-line citation for this reference in the top-level of the docstring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"Incorporating Time-Invariant Covariates into the Pareto/NBD and BG/NBD Models".
https://www.brucehardie.com/notes/019/time_invariant_covariates.pdf

""" # noqa: E501

Expand All @@ -151,15 +158,27 @@
model_config: dict | None = None,
sampler_config: dict | None = None,
):
self._validate_cols(
data,
required_cols=["customer_id", "frequency", "recency", "T"],
must_be_unique=["customer_id"],
)
super().__init__(
data=data,
model_config=model_config,
sampler_config=sampler_config,
non_distributions=["purchase_covariate_cols", "dropout_covariate_cols"],
)
self.purchase_covariate_cols = list(
self.model_config["purchase_covariate_cols"]
)
self.dropout_covariate_cols = list(self.model_config["dropout_covariate_cols"])
self.covariate_cols = self.purchase_covariate_cols + self.dropout_covariate_cols
self._validate_cols(
data,
required_cols=[
"customer_id",
"frequency",
"recency",
"T",
*self.covariate_cols,
],
must_be_unique=["customer_id"],
)

@property
Expand All @@ -170,34 +189,156 @@
"r_prior": Prior("HalfFlat"),
"phi_dropout_prior": Prior("Uniform", lower=0, upper=1),
"kappa_dropout_prior": Prior("Pareto", alpha=1, m=1),
"purchase_coefficient_prior": Prior("Normal", mu=0, sigma=1),
"dropout_coefficient_prior": Prior("Normal", mu=0, sigma=1),
ColtAllen marked this conversation as resolved.
Show resolved Hide resolved
"purchase_covariate_cols": [],
"dropout_covariate_cols": [],
}

def build_model(self) -> None: # type: ignore[override]
"""Build the model."""
coords = {
"purchase_covariate": self.purchase_covariate_cols,
"dropout_covariate": self.dropout_covariate_cols,
"customer_id": self.data["customer_id"],
"obs_var": ["recency", "frequency"],
}
with pm.Model(coords=coords) as self.model:
# purchase rate priors
alpha = self.model_config["alpha_prior"].create_variable("alpha")
r = self.model_config["r_prior"].create_variable("r")
if self.purchase_covariate_cols:
purchase_data = pm.Data(
"purchase_data",
self.data[self.purchase_covariate_cols],
dims=["customer_id", "purchase_covariate"],
)
self.model_config[
"purchase_coefficient_prior"
].dims = "purchase_covariate"
purchase_coefficient_gamma1 = self.model_config[
"purchase_coefficient_prior"
].create_variable("purchase_coefficient_gamma1")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the gamma1 suffix being used here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible to use model coordinates instead?


alpha_scale = self.model_config["alpha_prior"].create_variable(
"alpha_scale"
)
alpha = pm.Deterministic(
"alpha",
(
alpha_scale
* pm.math.exp(
-pm.math.dot(purchase_data, purchase_coefficient_gamma1)
)
),
dims="customer_id",
)
else:
alpha = self.model_config["alpha_prior"].create_variable("alpha")

# dropout priors
if "a_prior" in self.model_config and "b_prior" in self.model_config:
a = self.model_config["a_prior"].create_variable("a")
b = self.model_config["b_prior"].create_variable("b")
if self.dropout_covariate_cols:
dropout_data = pm.Data(
"dropout_data",
self.data[self.dropout_covariate_cols],
dims=["customer_id", "dropout_covariate"],
)

self.model_config[
"dropout_coefficient_prior"
].dims = "dropout_covariate"
dropout_coefficient_gamma2 = self.model_config[
"dropout_coefficient_prior"
].create_variable("dropout_coefficient_gamma2")
dropout_coefficient_gamma3 = self.model_config[
"dropout_coefficient_prior"
].create_variable("dropout_coefficient_gamma3")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change these _gamma% suffixes to _alpha and _beta? Gamma is a confusing term because it pertains to the purchasing process in the research.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to follow the convention here
image. There is no beta, but a and b (I can call them coefficient_a, and coefficient_b if you like)

  • We can rename as purchase_coefficient, dropout_coefficient to follow the implementation in ParetoNBDModel. But then the gamma2 and gamma3 coefficients must be equal. This is in fact how the implementation in R's CLVTools is done btw.
  • I tried that implementation, and is not helping with the convergence

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh my mistake - I meant _a and _b.

Are you saying CLVTools fixes these coefficients to be equal to each other? They share the same data, but this doesn't seem to align with the research note.

Also, which implementation is not helping with convergence?

Copy link
Contributor Author

@PabloRoque PabloRoque Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying CLVTools fixes these coefficients to be equal to each other? They share the same data, but this doesn't seem to align with the research note.

That is indeed the case. See here and here

Also, which implementation is not helping with convergence?

Using the same implementation as CLVTools, fixing gamma2=gamma3. It did not help with the test issues.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My best guess as to why the CLVTools developers did this was for easier interpretability and/or to speed up model fits. What's weird is why they still went with it if convergence is negatively impacted. This could be a good selling point for pymc-marketing compared to other open-source tools.

Explaining covariate impacts on overall dropout in terms of separate a and b coefficients will be tricky, but not impossible. Generally if a >b, then p increases, and vice-versa. The greater both values are, the narrower the distribution.


a_scale = self.model_config["a_prior"].create_variable("a_scale")
b_scale = self.model_config["b_prior"].create_variable("b_scale")
a = pm.Deterministic(
"a",
a_scale
* pm.math.exp(
pm.math.dot(dropout_data, dropout_coefficient_gamma2)
),
dims="customer_id",
)
b = pm.Deterministic(
"b",
b_scale
* pm.math.exp(
pm.math.dot(dropout_data, dropout_coefficient_gamma3)
),
dims="customer_id",
)
else:
a = self.model_config["a_prior"].create_variable("a")
b = self.model_config["b_prior"].create_variable("b")
else:
# hierarchical pooling of dropout rate priors
phi_dropout = self.model_config["phi_dropout_prior"].create_variable(
"phi_dropout"
)
kappa_dropout = self.model_config[
"kappa_dropout_prior"
].create_variable("kappa_dropout")

a = pm.Deterministic("a", phi_dropout * kappa_dropout)
b = pm.Deterministic("b", (1.0 - phi_dropout) * kappa_dropout)
if self.dropout_covariate_cols:
Copy link
Collaborator

@ColtAllen ColtAllen Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested the phi/kappa hierarchical pooling for a_scale and b_scale with covariates? Any major differences in results and/or fit times compared to specifying a prior for these params?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can play with it in #1430 docs/source/notebooks/clv/dev/bg_nbd_covariates.ipynb. By removing the custom prior, and using default_model_config as in:

bgnbd = clv.BetaGeoModel(
    rfm_data,
)

It works. The model fits, but we get quite biased estimates.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So specifying a_prior and b_prior in the model config is recommended when using covariates? We should probably mention this somewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better if you introduce sensible priors with small dataset (which is the case in #1430) than the default kappa, gamma priors. I did not try different priors for kappa, gamma.

I would differ advising on particular priors until we study default priors in the newly opened #1496 issue.

dropout_data = pm.Data(

Check warning on line 280 in pymc_marketing/clv/models/beta_geo.py

View check run for this annotation

Codecov / codecov/patch

pymc_marketing/clv/models/beta_geo.py#L280

Added line #L280 was not covered by tests
"dropout_data",
self.data[self.dropout_covariate_cols],
dims=["customer_id", "dropout_covariate"],
)

self.model_config[

Check warning on line 286 in pymc_marketing/clv/models/beta_geo.py

View check run for this annotation

Codecov / codecov/patch

pymc_marketing/clv/models/beta_geo.py#L286

Added line #L286 was not covered by tests
"dropout_coefficient_prior"
].dims = "dropout_covariate"
dropout_coefficient_gamma2 = self.model_config[

Check warning on line 289 in pymc_marketing/clv/models/beta_geo.py

View check run for this annotation

Codecov / codecov/patch

pymc_marketing/clv/models/beta_geo.py#L289

Added line #L289 was not covered by tests
"dropout_coefficient_prior"
].create_variable("dropout_coefficient_gamma2")
dropout_coefficient_gamma3 = self.model_config[

Check warning on line 292 in pymc_marketing/clv/models/beta_geo.py

View check run for this annotation

Codecov / codecov/patch

pymc_marketing/clv/models/beta_geo.py#L292

Added line #L292 was not covered by tests
"dropout_coefficient_prior"
].create_variable("dropout_coefficient_gamma3")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment.

This hierarchical pooling conditional block may be worth its own internal method because it appears in several models, but we can leave that as a separate PR.


phi_dropout = self.model_config[

Check warning on line 296 in pymc_marketing/clv/models/beta_geo.py

View check run for this annotation

Codecov / codecov/patch

pymc_marketing/clv/models/beta_geo.py#L296

Added line #L296 was not covered by tests
"phi_dropout_prior"
].create_variable("phi_dropout")
kappa_dropout = self.model_config[

Check warning on line 299 in pymc_marketing/clv/models/beta_geo.py

View check run for this annotation

Codecov / codecov/patch

pymc_marketing/clv/models/beta_geo.py#L299

Added line #L299 was not covered by tests
"kappa_dropout_prior"
].create_variable("kappa_dropout")

a_scale = pm.Deterministic(

Check warning on line 303 in pymc_marketing/clv/models/beta_geo.py

View check run for this annotation

Codecov / codecov/patch

pymc_marketing/clv/models/beta_geo.py#L303

Added line #L303 was not covered by tests
"a_scale", phi_dropout * kappa_dropout, dims="customer_id"
)
b_scale = pm.Deterministic(

Check warning on line 306 in pymc_marketing/clv/models/beta_geo.py

View check run for this annotation

Codecov / codecov/patch

pymc_marketing/clv/models/beta_geo.py#L306

Added line #L306 was not covered by tests
"b_scale",
(1.0 - phi_dropout) * kappa_dropout,
dims="customer_id",
)

a = pm.Deterministic(

Check warning on line 312 in pymc_marketing/clv/models/beta_geo.py

View check run for this annotation

Codecov / codecov/patch

pymc_marketing/clv/models/beta_geo.py#L312

Added line #L312 was not covered by tests
"a",
a_scale
* pm.math.exp(
pm.math.dot(dropout_data, dropout_coefficient_gamma2)
),
dims="customer_id",
)
b = pm.Deterministic(

Check warning on line 320 in pymc_marketing/clv/models/beta_geo.py

View check run for this annotation

Codecov / codecov/patch

pymc_marketing/clv/models/beta_geo.py#L320

Added line #L320 was not covered by tests
"b",
b_scale
* pm.math.exp(
pm.math.dot(dropout_data, dropout_coefficient_gamma3)
),
dims="customer_id",
)

else:
phi_dropout = self.model_config[
"phi_dropout_prior"
].create_variable("phi_dropout")
kappa_dropout = self.model_config[
"kappa_dropout_prior"
].create_variable("kappa_dropout")

a = pm.Deterministic("a", phi_dropout * kappa_dropout)
b = pm.Deterministic("b", (1.0 - phi_dropout) * kappa_dropout)

# r remains unchanged with or without covariates
r = self.model_config["r_prior"].create_variable("r")

BetaGeoNBD(
name="recency_frequency",
Expand Down Expand Up @@ -237,13 +378,60 @@
required_cols=[
"customer_id",
*customer_varnames,
*self.purchase_covariate_cols,
*self.dropout_covariate_cols,
],
must_be_unique=["customer_id"],
)

a = self.fit_result["a"]
b = self.fit_result["b"]
alpha = self.fit_result["alpha"]
customer_id = data["customer_id"]
model_coords = self.model.coords
if self.purchase_covariate_cols:
purchase_xarray = xarray.DataArray(
data[self.purchase_covariate_cols],
dims=["customer_id", "purchase_covariate"],
coords=[customer_id, list(model_coords["purchase_covariate"])],
)
alpha_scale = self.fit_result["alpha_scale"]
purchase_coefficient_gamma1 = self.fit_result["purchase_coefficient_gamma1"]
alpha = alpha_scale * np.exp(
-xarray.dot(
purchase_xarray,
purchase_coefficient_gamma1,
dim="purchase_covariate",
)
)
alpha.name = "alpha"
else:
alpha = self.fit_result["alpha"]

if self.dropout_covariate_cols:
dropout_xarray = xarray.DataArray(
data[self.dropout_covariate_cols],
dims=["customer_id", "dropout_covariate"],
coords=[customer_id, list(model_coords["dropout_covariate"])],
)
a_scale = self.fit_result["a_scale"]
dropout_coefficient_gamma2 = self.fit_result["dropout_coefficient_gamma2"]
dropout_coefficient_gamma3 = self.fit_result["dropout_coefficient_gamma3"]

a = a_scale * np.exp(
xarray.dot(
dropout_xarray, dropout_coefficient_gamma2, dim="dropout_covariate"
)
)
a.name = "a"
b_scale = self.fit_result["b_scale"]
b = b_scale * np.exp(
xarray.dot(
dropout_xarray, dropout_coefficient_gamma3, dim="dropout_covariate"
)
)
b.name = "b"
else:
a = self.fit_result["a"]
b = self.fit_result["b"]

r = self.fit_result["r"]

customer_vars = to_xarray(
Expand Down Expand Up @@ -605,14 +793,30 @@
coords = self.model.coords.copy() # type: ignore
coords["customer_id"] = data["customer_id"]

with pm.Model(coords=coords):
a = pm.HalfFlat("a")
b = pm.HalfFlat("b")
alpha = pm.HalfFlat("alpha")
r = pm.HalfFlat("r")
with pm.Model(coords=coords) as pred_model:
if self.purchase_covariate_cols:
alpha = pm.Flat("alpha", dims=["customer_id"])
else:
alpha = pm.Flat("alpha")

pm.Beta("dropout", alpha=a, beta=b)
pm.Gamma("purchase_rate", alpha=r, beta=alpha)
if self.dropout_covariate_cols:
a = pm.Flat("a", dims=["customer_id"])
b = pm.Flat("b", dims=["customer_id"])
else:
a = pm.Flat("a")
b = pm.Flat("b")

r = pm.Flat("r")

pm.Beta(
"dropout", alpha=a, beta=b, dims=pred_model.named_vars_to_dims.get("a")
)
pm.Gamma(
"purchase_rate",
alpha=r,
beta=alpha,
dims=pred_model.named_vars_to_dims.get("alpha"),
)

BetaGeoNBD(
name="recency_frequency",
Expand Down
Loading
Loading