Add ParetoNBDModel #177

ColtAllen · 2023-02-24T16:41:10Z

This PR closes #127.

Outstanding Issues

mypy CI errors
https://github.com/ColtAllen/pymc-marketing/blob/clv_pareto_nbd/pymc_marketing/clv/models/pareto_nbd.py#L172
https://github.com/ColtAllen/pymc-marketing/blob/clv_pareto_nbd/pymc_marketing/clv/models/pareto_nbd.py#L526
Consensus on predictive method names
docs build errors
Revise docstring references

UML Diagram

Important Changes

I've marked this model as experimental because NUTS is having convergence errors due to pytensor.Tensor.hyp2f1 in the model logp. @ricardoV94 has been trying out some improvements for that function.

I renamed the predictive methods in this model to be more concise:
expected_purchases
expected_purchases_new_customer
expected_probability_alive
expected_purchase_probability

However, these are not consistent with the equivalent method names in BetaGeoModel. These names need to be standardized for plotting functionality, so it's important we're in agreement on the naming conventions.

xarray-einstats>=0.5.1 is now a required library dependency.

Customer input data arrays are also assigned to class attributes as default predictive arguments in this model, making life easier for end users unless they want to run predictions on new customers.

I also added some additional utility methods which can be used in other CLV models if moved to CLVModel, and would go towards resolving #182.

Unlike the other CLV models, ParetoNBDModel uses a dedicated ParetoNBD distribution block for the logp. My rationale for this is that the distribution class will append an observed column for customer frequency and recency inputs to the returned arviz.InferenceData object, and enable additional diagnostic plots like the Bayesian p-value and PPC plots once those methods are added:

https://python.arviz.org/en/stable/examples/index.html#Model%20Checking

I've documented some future additions to this model in #183 since this PR already has enough to review.

codecov · 2023-02-24T16:58:35Z

Codecov Report

Merging #177 (e413ece) into main (7ebc194) will increase coverage by 0.05%.
The diff coverage is 95.65%.

❗ Current head e413ece differs from pull request most recent head 0b82010. Consider uploading reports for the commit 0b82010 to get more accurate results

@@            Coverage Diff             @@
##             main     #177      +/-   ##
==========================================
+ Coverage   94.12%   94.17%   +0.05%     
==========================================
  Files          17       19       +2     
  Lines         919     1168     +249     
==========================================
+ Hits          865     1100     +235     
- Misses         54       68      +14

Impacted Files	Coverage Δ
pymc_marketing/clv/__init__.py	`100.00% <ø> (ø)`
pymc_marketing/clv/models/pareto_nbd.py	`88.48% <88.48%> (ø)`
pymc_marketing/mmm/transformers.py	`94.44% <93.33%> (-5.56%)`	⬇️
pymc_marketing/mmm/base.py	`95.36% <98.18%> (+0.36%)`	⬆️
pymc_marketing/clv/distributions.py	`100.00% <100.00%> (ø)`
pymc_marketing/clv/models/__init__.py	`100.00% <100.00%> (ø)`
pymc_marketing/clv/models/basic.py	`98.52% <100.00%> (+0.41%)`	⬆️
pymc_marketing/clv/models/beta_geo.py	`100.00% <100.00%> (+1.07%)`	⬆️
pymc_marketing/clv/models/gamma_gamma.py	`98.01% <100.00%> (-0.13%)`	⬇️
pymc_marketing/clv/models/shifted_beta_geo.py	`100.00% <100.00%> (ø)`
... and 4 more

review-notebook-app · 2023-03-01T01:48:28Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

ColtAllen · 2023-03-23T01:37:25Z

I could use some help resolving an array broadcasting issue in the purchase_probability method (see Details in the CI tests). Once this is resolved, just a few tweaks to things like docstrings and this will be ready for review.

larryshamalama

Good progress @ColtAllen! Some comments for now and I can revisit once you mark this as ready for review

larryshamalama · 2023-03-09T17:38:18Z

pymc_marketing/clv/models/pareto_nbd.py

+    # TODO: Edit docstrings
+    def expected_purchases(
+        self,
+        future_t: Union[float, np.ndarray, pd.Series, TensorVariable],


I believe that we use t in the BetaGeoModel. Would future_t be more explicit? Not opposed to the nomenclature, but it would be good to also change the other methods in corresponding models.

larryshamalama · 2023-03-28T01:46:34Z

pymc_marketing/clv/models/pareto_nbd.py

+        self._customer_id = customer_id
+        self._frequency = frequency
+        self._recency = recency
+        self._T = T


I believe that our other models have the same instance variable names, but without the underscore:

pymc-marketing/pymc_marketing/clv/models/beta_geo.py

Lines 119 to 122 in d489a88

self.customer_id = customer_id

self.frequency = frequency

self.recency = recency

self.T = T

The underscores indicates these are internal class attributes. Not strictly required unless one is a stickler for Python PEP conventions.

pymc_marketing/clv/models/pareto_nbd.py

juanitorduz

Hey! The mypy errors should be easy to fix. Most of them come from:

Need to add None inside Union if they can be None
Make sure signatures are consistent, especially if variables can (or should not) be None.
Explicitly set the variable types before unpacking them.

😄

ColtAllen · 2023-05-28T23:02:37Z

Hey! The mypy errors should be easy to fix. Most of them come from:

Need to add None inside Union if they can be None

Make sure signatures are consistent, especially if variables can (or should not) be None.

Explicitly set the variable types before unpacking them.

😄

@juanitorduz Thanks! I've whittled the mypy stuff down to two repeating errors:

pymc_marketing/clv/models/pareto_nbd.py:318: error: Need more than 1 value to unpack (3 expected)  [misc]

This comes up with x, t_x, T = self._process_customers(customer_id, frequency, recency, T)

pymc_marketing/clv/models/pareto_nbd.py:322: error: Argument 7 to "_logp" of "ParetoNBDModel" 
has incompatible type "Union[ndarray[Any, Any], Series[Any], DataArray, None]"; 
expected "DataArray"  [arg-type]

This comes up because the predictive methods call _logp internally, but the latter only accepts DataArray. Adding additional datatypes to _logp will raise a does not contain '.values' attribute error.

pymc_marketing/clv/models/pareto_nbd.py

juanitorduz · 2023-05-29T20:31:06Z

Yay! mypy and test are green 🟢 🙌

ricardoV94 · 2023-06-19T11:30:26Z

@ColtAllen I changed the order of frequency and recency arguments. I think it's more intuitive to have frequency first. The developer notebook was actually passing them like this, and getting different results because of this.

I included the rewrite that speeds up the gradient evaluation when doing NUTS, making it useable. The developer notebook is now running NUTS (the pymc-experimental one, the manual impl still runs Slice), and I removed the Slice from the docstring examples. Please review carefully and let me know if you have any questions, or if I messed up something

I think your PR may have picked commits from main that do not belong here (looking at the file diff on Github). If that's the case, the best is perhaps to open a new PR with the final changes in a single commit.

In the future I suggest you rebase from main instead of merging, to avoid this issue. It's also better practice to keep squashing useless commits instead of letting the PR grow to 100s of commits. Each commit in the end should correspond to one logical self-contained change (eg. all tests should in theory pass after each commit, or in other words, the state of the PR should not be broken in intermediate commits).

That's just a suggestion for future PRs of course. For now, do you mind opening a new PR with only the final changes? If I am mistaken and this PR is not picking extra commits from main just let me know.

And as always, great work!

ColtAllen · 2023-06-20T04:29:10Z

Thanks @ricardoV94! One last thing before I create a new PR - the CodeCov check is failing for the new additions, and I'm not quite sure how to go about writing unit tests for these pytensor functions.

ricardoV94 · 2023-06-20T05:44:51Z

The codecov is sometimes flaky, there's no way the new PyTensor stuff isn't being tested.

ColtAllen · 2023-06-20T19:28:50Z

I think your PR may have picked commits from main that do not belong here (looking at the file diff on Github).

Could this be due to the pymc-labs repo containing a branch with the same name as this one? @juanitorduz created it while he was helping me with the mypy stuff, and my forked repo was showing diffs for that branch instead of main. That branch is deleted now.

If I am mistaken and this PR is not picking extra commits from main just let me know.

Do you know which commits from main do not belong in particular? I'm not sure if simply closing this PR and opening another one will revert those (though it may clear the CodeCov check).

juanitorduz · 2023-06-20T20:16:01Z

OMG sorry if I messed up 😩! I think the suggested changes where small. ~~I could try to delete the branch.~~ Actually I do not see the branch. Also, the changes were not merged anyway. Let me know if there is anything do to help.

ricardoV94 · 2023-06-22T08:58:33Z

Do you know which commits from main do not belong in particular? I'm not sure if simply closing this PR and opening another one will revert those (though it may clear the CodeCov check).

In the file changes you can see stuff like this:

Squashing all your commits in a single commit and rebasing from main (and then force-push) should fix this.
Make sure to backup your branch before so you don't risk losing it :)

Added pytest fixtures for cdnow sample and master summaries removed adhoc dataset creation script added uml diagrams to docs/source/_static/ All ParetoNBDModel methods added and TODOs created Added ParetoNBDRV and renamed ParetoNBDAggregate Added ParetoNBD distribution class Tests added for ParetoNBD distro class FAIL flake8: Test boilerplate and TODOs added for draft PR Updated docstring references Removed UML files dev notebook WIP fixed logp in notebook Notebook commit Revert "Updated docstring references" This reverts commit 81aa384. Revert all commits on distributions.py Revert "Tests added for ParetoNBD distro class" This reverts commit 2984459. Reverting all distributions.py commits Revert "Added ParetoNBD distribution class" This reverts commit 05ef483. Revert all distributions.py commits Revert "Added ParetoNBDRV and renamed ParetoNBDAggregate" This reverts commit da6bd1c. Revert all distributions.py commits Update docstring references Updated todos and added future methods removed new methods for future PR WIP test framework Fix missing underscore prefix Rename control coordinate Fix Gamma-Gamma example Implement CLV base method `_check_prior_ndim` and `_process_prior` Add Individual Shifted Beta Geometric (sBG) model Relax matplotlib dependency (#179) * Relax matplotlib dependency * ensure no conflict with numpy * simplify requirements bump version to 0..04 Update Makefile syntax revisions rewrote probability_alive test framework edits Remove unnecessary dependency on pymc test util changed method names and WIP MAP testing MMM `data_df` param renamed to `data` (#186) Fixing minor typos in README (#195) Update README.md (#194) Rename fitting_method to fit_method Speedup BetaGeoModel tests Test MAP convergence for BetaGeoModel Make docs footer smaller and more responsive Use numerically stable logp for BG/NBD rename in gamma-gamma model WIP revised tests and priors min_max_scaler -> max_abs_scaler for target variable Change ROAS scale in MMM example Implement vectorized adstock transformations Add more detail to the guide for contributors (#229) Extend documentation Add rtd-link-preview github action Bump PyMC dependency Bump version to 0.1.0 Removed CDNOW datasets WIP purchase_probability Fix broadcasting issue in purchase_probability Add xarray-einstats as a dependency Add all-checks job to facilitate branch protection rules Test on oldest PyMC version Add links from CLV/MMM intros to relevant notebooks (#238) Updating PyMC requirement in pyproject.toml and ci Fix LaTeX representation tests w.r.t. PyMC and PyTensor updates add contribution curves over time (#247) * add contribution curves over time * remove unused variable * unit test for new breakdown plot * added example * black * another testcase * black add ref (#249) Install via conda-forge package Update installation instructions in docs scale contributions to original scale and allow custom colors add original scale flag mypy init improve hints mmm fix mmm types mmm last fixes update black exclude folders mypy add some clv type hints some clv type hints improvements utils types done base type hints final fixes add more type tests revert doctring changes fix test where param is none Improve installation instructions in README Fix the discrepancy introduced in #253 between README.md and docs/source/index.md fixed xarray-einstats dependency version in toml Added t=0 test case for purchase_probability Reduced redundant code with _process_customers and _logp Added test for unique customer_id mypy check importable from root clv module Added mypy type overrides to distro blocks Notebook testing notebook testing Cleaned up dev notebook WIP docstrings, tests, param names add yearly seasonality to mmm proposal update nb and fix imlpementation add colourful tests ;) fix tests add more test fit cases refactor component plots method re-run example nb fix mypy fix test to avoid unnecessary fitting improve nb fix bug plotting functon improve notebook Improve docstrings on CLV `freq` argument Set `freq="W"` in CLV quickstart notebook changed from _ to - in pypi.yml Fixed test_model_convergence Add PyMC Marketing logo to the repository and documentation (#261) * add logo to README * remove horizontal line * add light and dark logo to readthedocs fix wrong filename for logo in README (#262) Added example Pandas code to dev notebook Docstring edits Notebook and docstring edits, experimental warning Added lifetimes citations Removed resolved TODOs Docs build fix attempt method names, type hinting, docstring and test edits ignored edge case numerical errors in purchase_prob fixed most mypy errors docs code example attempted fix fixed merge conflicts in clv/distributions another docs fix attempt docstring edits and fixed mypy errors changed default prior params Reorder frequency and recency arguments to match other models Add rewrite to speedup Hyp2F1 gradient

ColtAllen · 2023-06-23T14:16:05Z

Ah, that was my crude resolution of a merge conflict. I can't make any changes until I fix my dev environment per #305, but I'll go ahead and get this PR re-opened.

ColtAllen · 2023-06-23T14:18:28Z

Seems I screwed up the rebase 😕

Here's what I did:

In the clv_pareto_nbd branch:

git rebase -i 6d5d98c
(change 'pick' into 'squash' a kajillion times via Vim, saved, then saved commit message)
git push --force-with-lease origin

@ricardoV94 Should I have done something with main in there somewhere? My rebase picked up a bunch of MMM commits.

ricardoV94 · 2023-06-23T15:04:06Z

I am not sure what you can do when you get to this stage. In the future to avoid getting here, my suggested workflow is:

Always squash useless commits in the PR (so you won't have a gazillion conflicts to fix when rebasing). You should only have separate commits for self-contained independent changes in the code.
Always rebase from main, not pull, to get the code up to date, without "picking" changes that don't belong to the PR. Otherwise it's difficult to Review on GH

When you get to this stage, I suggest you just copy paste the changes in a fresh branch from main. Then you can open a new PR with that branch. If you didn't backup your branch, you can recover it with reflog: https://github.blog/2015-06-08-how-to-undo-almost-anything-with-git/

ricardoV94 · 2023-06-23T15:09:36Z

For git reflog: https://stackoverflow.com/a/10099285

ColtAllen · 2023-06-25T15:00:15Z

Let's go with the nuclear option then. I'll open a new PR with the backup branch I created, but if the changes still need to be copy/pasted over manually, I also created pareto_nbd branch off the latest version of main we can work with.

remote tracking test

6d5d98c

ColtAllen added docs Improvements or additions to documentation enhancement New feature or request help wanted Extra attention is needed CLV tests labels Feb 24, 2023

ColtAllen requested review from larryshamalama and ricardoV94 February 24, 2023 16:41

ColtAllen self-assigned this Feb 24, 2023

This comment was marked as resolved.

Sign in to view

ColtAllen mentioned this pull request Feb 28, 2023

UML Diagrams #178

Closed

ColtAllen removed the docs Improvements or additions to documentation label Feb 28, 2023

ColtAllen changed the title ~~Add ParetoNBDModel & distro class, UML diagrams, CDNOW test datasets~~ Add ParetoNBDModel & distro class, CDNOW test datasets Feb 28, 2023

ColtAllen mentioned this pull request Feb 28, 2023

Add BetaGeoBetaBinomModel #176

Closed

ColtAllen removed the tests label Mar 2, 2023

larryshamalama mentioned this pull request Mar 3, 2023

CLV Distribution RVs not Model-Specific #128

Open

ColtAllen mentioned this pull request Mar 3, 2023

Additional enhancements for ParetoNBDModel #183

Open

6 tasks

ColtAllen changed the title ~~Add ParetoNBDModel & distro class, CDNOW test datasets~~ Add ParetoNBDModel & CDNOW test datasets Mar 3, 2023

ColtAllen removed the help wanted Extra attention is needed label Mar 4, 2023

ColtAllen mentioned this pull request Mar 9, 2023

Add BetaGeoBetaBinomModel #188

Closed

larryshamalama reviewed Mar 28, 2023

View reviewed changes

ColtAllen mentioned this pull request Mar 28, 2023

Add Seasonality to CLV Models #219

Open

ricardoV94 reviewed Mar 29, 2023

View reviewed changes

pymc_marketing/clv/models/pareto_nbd.py Outdated Show resolved Hide resolved

ricardoV94 reviewed Mar 29, 2023

View reviewed changes

pymc_marketing/clv/models/pareto_nbd.py Outdated Show resolved Hide resolved

ricardoV94 reviewed Mar 29, 2023

View reviewed changes

pymc_marketing/clv/models/pareto_nbd.py Outdated Show resolved Hide resolved

ColtAllen changed the title ~~Add ParetoNBDModel & CDNOW test datasets~~ Add ParetoNBDModel Apr 8, 2023

juanitorduz reviewed May 10, 2023

View reviewed changes

pymc_marketing/clv/models/pareto_nbd.py Outdated Show resolved Hide resolved

juanitorduz reviewed May 10, 2023

View reviewed changes

pymc_marketing/clv/models/pareto_nbd.py Show resolved Hide resolved