Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ParetoNBDModel #177

Closed
wants to merge 2 commits into from
Closed

Add ParetoNBDModel #177

wants to merge 2 commits into from

Conversation

ColtAllen
Copy link
Collaborator

@ColtAllen ColtAllen commented Feb 24, 2023

This PR closes #127.

Outstanding Issues

UML Diagram

classes

Important Changes

I've marked this model as experimental because NUTS is having convergence errors due to pytensor.Tensor.hyp2f1 in the model logp. @ricardoV94 has been trying out some improvements for that function.

I renamed the predictive methods in this model to be more concise:
expected_purchases
expected_purchases_new_customer
expected_probability_alive
expected_purchase_probability

However, these are not consistent with the equivalent method names in BetaGeoModel. These names need to be standardized for plotting functionality, so it's important we're in agreement on the naming conventions.

xarray-einstats>=0.5.1 is now a required library dependency.

Customer input data arrays are also assigned to class attributes as default predictive arguments in this model, making life easier for end users unless they want to run predictions on new customers.

I also added some additional utility methods which can be used in other CLV models if moved to CLVModel, and would go towards resolving #182.

Unlike the other CLV models, ParetoNBDModel uses a dedicated ParetoNBD distribution block for the logp. My rationale for this is that the distribution class will append an observed column for customer frequency and recency inputs to the returned arviz.InferenceData object, and enable additional diagnostic plots like the Bayesian p-value and PPC plots once those methods are added:

https://python.arviz.org/en/stable/examples/index.html#Model%20Checking

I've documented some future additions to this model in #183 since this PR already has enough to review.

@ColtAllen ColtAllen added docs Improvements or additions to documentation enhancement New feature or request help wanted Extra attention is needed CLV tests labels Feb 24, 2023
@ColtAllen ColtAllen self-assigned this Feb 24, 2023
@codecov
Copy link

codecov bot commented Feb 24, 2023

Codecov Report

Merging #177 (e413ece) into main (7ebc194) will increase coverage by 0.05%.
The diff coverage is 95.65%.

❗ Current head e413ece differs from pull request most recent head 0b82010. Consider uploading reports for the commit 0b82010 to get more accurate results

@@            Coverage Diff             @@
##             main     #177      +/-   ##
==========================================
+ Coverage   94.12%   94.17%   +0.05%     
==========================================
  Files          17       19       +2     
  Lines         919     1168     +249     
==========================================
+ Hits          865     1100     +235     
- Misses         54       68      +14     
Impacted Files Coverage Δ
pymc_marketing/clv/__init__.py 100.00% <ø> (ø)
pymc_marketing/clv/models/pareto_nbd.py 88.48% <88.48%> (ø)
pymc_marketing/mmm/transformers.py 94.44% <93.33%> (-5.56%) ⬇️
pymc_marketing/mmm/base.py 95.36% <98.18%> (+0.36%) ⬆️
pymc_marketing/clv/distributions.py 100.00% <100.00%> (ø)
pymc_marketing/clv/models/__init__.py 100.00% <100.00%> (ø)
pymc_marketing/clv/models/basic.py 98.52% <100.00%> (+0.41%) ⬆️
pymc_marketing/clv/models/beta_geo.py 100.00% <100.00%> (+1.07%) ⬆️
pymc_marketing/clv/models/gamma_gamma.py 98.01% <100.00%> (-0.13%) ⬇️
pymc_marketing/clv/models/shifted_beta_geo.py 100.00% <100.00%> (ø)
... and 4 more

@ricardoV94

This comment was marked as resolved.

@ColtAllen

This comment was marked as resolved.

@ColtAllen ColtAllen mentioned this pull request Feb 28, 2023
@ColtAllen ColtAllen removed the docs Improvements or additions to documentation label Feb 28, 2023
@ColtAllen ColtAllen changed the title Add ParetoNBDModel & distro class, UML diagrams, CDNOW test datasets Add ParetoNBDModel & distro class, CDNOW test datasets Feb 28, 2023
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@ColtAllen ColtAllen removed the tests label Mar 2, 2023
@ColtAllen ColtAllen changed the title Add ParetoNBDModel & distro class, CDNOW test datasets Add ParetoNBDModel & CDNOW test datasets Mar 3, 2023
@ColtAllen ColtAllen removed the help wanted Extra attention is needed label Mar 4, 2023
@ColtAllen
Copy link
Collaborator Author

I could use some help resolving an array broadcasting issue in the purchase_probability method (see Details in the CI tests). Once this is resolved, just a few tweaks to things like docstrings and this will be ready for review.

Copy link
Contributor

@larryshamalama larryshamalama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good progress @ColtAllen! Some comments for now and I can revisit once you mark this as ready for review

# TODO: Edit docstrings
def expected_purchases(
self,
future_t: Union[float, np.ndarray, pd.Series, TensorVariable],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that we use t in the BetaGeoModel. Would future_t be more explicit? Not opposed to the nomenclature, but it would be good to also change the other methods in corresponding models.

self._customer_id = customer_id
self._frequency = frequency
self._recency = recency
self._T = T
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that our other models have the same instance variable names, but without the underscore:

self.customer_id = customer_id
self.frequency = frequency
self.recency = recency
self.T = T

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The underscores indicates these are internal class attributes. Not strictly required unless one is a stickler for Python PEP conventions.

pymc_marketing/clv/models/pareto_nbd.py Outdated Show resolved Hide resolved
pymc_marketing/clv/models/pareto_nbd.py Outdated Show resolved Hide resolved
@ColtAllen ColtAllen changed the title Add ParetoNBDModel & CDNOW test datasets Add ParetoNBDModel Apr 8, 2023
Copy link
Collaborator

@juanitorduz juanitorduz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! The mypy errors should be easy to fix. Most of them come from:

  • Need to add None inside Union if they can be None
  • Make sure signatures are consistent, especially if variables can (or should not) be None.
  • Explicitly set the variable types before unpacking them.

😄

@ColtAllen
Copy link
Collaborator Author

Hey! The mypy errors should be easy to fix. Most of them come from:

  • Need to add None inside Union if they can be None
  • Make sure signatures are consistent, especially if variables can (or should not) be None.
  • Explicitly set the variable types before unpacking them.

😄

@juanitorduz Thanks! I've whittled the mypy stuff down to two repeating errors:

pymc_marketing/clv/models/pareto_nbd.py:318: error: Need more than 1 value to unpack (3 expected)  [misc]

This comes up with x, t_x, T = self._process_customers(customer_id, frequency, recency, T)

pymc_marketing/clv/models/pareto_nbd.py:322: error: Argument 7 to "_logp" of "ParetoNBDModel" 
has incompatible type "Union[ndarray[Any, Any], Series[Any], DataArray, None]"; 
expected "DataArray"  [arg-type]

This comes up because the predictive methods call _logp internally, but the latter only accepts DataArray. Adding additional datatypes to _logp will raise a does not contain '.values' attribute error.

@juanitorduz
Copy link
Collaborator

Yay! mypy and test are green 🟢 🙌

@ricardoV94
Copy link
Contributor

ricardoV94 commented Jun 19, 2023

@ColtAllen I changed the order of frequency and recency arguments. I think it's more intuitive to have frequency first. The developer notebook was actually passing them like this, and getting different results because of this.

I included the rewrite that speeds up the gradient evaluation when doing NUTS, making it useable. The developer notebook is now running NUTS (the pymc-experimental one, the manual impl still runs Slice), and I removed the Slice from the docstring examples. Please review carefully and let me know if you have any questions, or if I messed up something

I think your PR may have picked commits from main that do not belong here (looking at the file diff on Github). If that's the case, the best is perhaps to open a new PR with the final changes in a single commit.

In the future I suggest you rebase from main instead of merging, to avoid this issue. It's also better practice to keep squashing useless commits instead of letting the PR grow to 100s of commits. Each commit in the end should correspond to one logical self-contained change (eg. all tests should in theory pass after each commit, or in other words, the state of the PR should not be broken in intermediate commits).

That's just a suggestion for future PRs of course. For now, do you mind opening a new PR with only the final changes? If I am mistaken and this PR is not picking extra commits from main just let me know.

And as always, great work!

@ColtAllen
Copy link
Collaborator Author

ColtAllen commented Jun 20, 2023

Thanks @ricardoV94! One last thing before I create a new PR - the CodeCov check is failing for the new additions, and I'm not quite sure how to go about writing unit tests for these pytensor functions.

@ricardoV94
Copy link
Contributor

The codecov is sometimes flaky, there's no way the new PyTensor stuff isn't being tested.

@ColtAllen
Copy link
Collaborator Author

I think your PR may have picked commits from main that do not belong here (looking at the file diff on Github).

Could this be due to the pymc-labs repo containing a branch with the same name as this one? @juanitorduz created it while he was helping me with the mypy stuff, and my forked repo was showing diffs for that branch instead of main. That branch is deleted now.

If I am mistaken and this PR is not picking extra commits from main just let me know.

Do you know which commits from main do not belong in particular? I'm not sure if simply closing this PR and opening another one will revert those (though it may clear the CodeCov check).

@juanitorduz
Copy link
Collaborator

juanitorduz commented Jun 20, 2023

OMG sorry if I messed up 😩! I think the suggested changes where small. I could try to delete the branch. Actually I do not see the branch. Also, the changes were not merged anyway. Let me know if there is anything do to help.

@ricardoV94
Copy link
Contributor

ricardoV94 commented Jun 22, 2023

Do you know which commits from main do not belong in particular? I'm not sure if simply closing this PR and opening another one will revert those (though it may clear the CodeCov check).

In the file changes you can see stuff like this:
image

Squashing all your commits in a single commit and rebasing from main (and then force-push) should fix this.
Make sure to backup your branch before so you don't risk losing it :)

Added pytest fixtures for cdnow sample and master summaries

removed adhoc dataset creation script

added uml diagrams to docs/source/_static/

All ParetoNBDModel methods added and TODOs created

Added ParetoNBDRV and renamed ParetoNBDAggregate

Added ParetoNBD distribution class

Tests added for ParetoNBD distro class

FAIL flake8: Test boilerplate and TODOs added for draft PR

Updated docstring references

Removed UML files

dev notebook WIP

fixed logp in notebook

Notebook commit

Revert "Updated docstring references"

This reverts commit 81aa384.

Revert all commits on distributions.py

Revert "Tests added for ParetoNBD distro class"

This reverts commit 2984459.

Reverting all distributions.py commits

Revert "Added ParetoNBD distribution class"

This reverts commit 05ef483.

Revert all distributions.py commits

Revert "Added ParetoNBDRV and renamed ParetoNBDAggregate"

This reverts commit da6bd1c.

Revert all distributions.py commits

Update docstring references

Updated todos and added future methods

removed new methods for future PR

WIP test framework

Fix missing underscore prefix

Rename control coordinate

Fix Gamma-Gamma example

Implement CLV base method `_check_prior_ndim` and `_process_prior`

Add Individual Shifted Beta Geometric (sBG) model

Relax matplotlib dependency (#179)

* Relax matplotlib dependency

* ensure no conflict with numpy

* simplify requirements

bump version to 0..04

Update Makefile

syntax revisions

rewrote probability_alive

test framework edits

Remove unnecessary dependency on pymc test util

changed method names and WIP MAP testing

MMM `data_df` param renamed to `data` (#186)

Fixing minor typos in README (#195)

Update README.md (#194)

Rename fitting_method to fit_method

Speedup BetaGeoModel tests

Test MAP convergence for BetaGeoModel

Make docs footer smaller and more responsive

Use numerically stable logp for BG/NBD

rename in gamma-gamma model

WIP revised tests and priors

min_max_scaler -> max_abs_scaler for target variable

Change ROAS scale in MMM example

Implement vectorized adstock transformations

Add more detail to the guide for contributors (#229)

Extend documentation

Add rtd-link-preview github action

Bump PyMC dependency

Bump version to 0.1.0

Removed CDNOW datasets

WIP purchase_probability

Fix broadcasting issue in purchase_probability

Add xarray-einstats as a dependency

Add all-checks job to facilitate branch protection rules

Test on oldest PyMC version

Add links from CLV/MMM intros to relevant notebooks (#238)

Updating PyMC requirement in pyproject.toml and ci

Fix LaTeX representation tests w.r.t. PyMC and PyTensor updates

add contribution curves over time (#247)

* add contribution curves over time

* remove unused variable

* unit test for new breakdown plot

* added example

* black

* another testcase

* black

add ref (#249)

Install via conda-forge package

Update installation instructions in docs

scale contributions to original scale and allow custom colors

add original scale flag

mypy init

improve hints mmm

fix mmm types

mmm last fixes

update black

exclude folders mypy

add some clv type hints

some clv type hints improvements

utils types done

base type hints

final fixes

add more type tests

revert doctring changes

fix test where param is none

Improve installation instructions in README

Fix the discrepancy introduced in #253 between README.md and docs/source/index.md

fixed xarray-einstats dependency version in toml

Added t=0 test case for purchase_probability

Reduced redundant code with _process_customers and _logp

Added test for unique customer_id

mypy check

importable from root clv module

Added mypy type overrides to distro blocks

Notebook testing

notebook testing

Cleaned up dev notebook

WIP docstrings, tests, param names

add yearly seasonality to mmm  proposal

update nb and fix imlpementation

add colourful tests ;)

fix tests

add more test fit cases

refactor component plots method

re-run example nb

fix mypy

fix test to avoid unnecessary fitting

improve nb

fix bug plotting functon

improve notebook

Improve docstrings on CLV `freq` argument

Set `freq="W"` in CLV quickstart notebook

changed from _ to - in pypi.yml

Fixed test_model_convergence

Add PyMC Marketing logo to the repository and documentation (#261)

* add logo to README

* remove horizontal line

* add light and dark logo to readthedocs

fix wrong filename for logo in README (#262)

Added example Pandas code to dev notebook

Docstring edits

Notebook and docstring edits, experimental warning

Added lifetimes citations

Removed resolved TODOs

Docs build fix attempt

method names, type hinting, docstring and test edits

ignored edge case numerical errors in purchase_prob

fixed most mypy errors

docs code example attempted fix

fixed merge conflicts in clv/distributions

another docs fix attempt

docstring edits and fixed mypy errors

changed default prior params

Reorder frequency and recency arguments to match other models

Add rewrite to speedup Hyp2F1 gradient
@ColtAllen
Copy link
Collaborator Author

Ah, that was my crude resolution of a merge conflict. I can't make any changes until I fix my dev environment per #305, but I'll go ahead and get this PR re-opened.

@ColtAllen ColtAllen closed this Jun 23, 2023
@ColtAllen ColtAllen reopened this Jun 23, 2023
@ColtAllen
Copy link
Collaborator Author

ColtAllen commented Jun 23, 2023

Seems I screwed up the rebase 😕

Here's what I did:

In the clv_pareto_nbd branch:

  • git rebase -i 6d5d98c
  • (change 'pick' into 'squash' a kajillion times via Vim, saved, then saved commit message)
  • git push --force-with-lease origin

@ricardoV94 Should I have done something with main in there somewhere? My rebase picked up a bunch of MMM commits.

@ricardoV94
Copy link
Contributor

ricardoV94 commented Jun 23, 2023

I am not sure what you can do when you get to this stage. In the future to avoid getting here, my suggested workflow is:

  1. Always squash useless commits in the PR (so you won't have a gazillion conflicts to fix when rebasing). You should only have separate commits for self-contained independent changes in the code.
  2. Always rebase from main, not pull, to get the code up to date, without "picking" changes that don't belong to the PR. Otherwise it's difficult to Review on GH

When you get to this stage, I suggest you just copy paste the changes in a fresh branch from main. Then you can open a new PR with that branch. If you didn't backup your branch, you can recover it with reflog: https://github.blog/2015-06-08-how-to-undo-almost-anything-with-git/

@ricardoV94
Copy link
Contributor

For git reflog: https://stackoverflow.com/a/10099285

@ColtAllen
Copy link
Collaborator Author

Let's go with the nuclear option then. I'll open a new PR with the backup branch I created, but if the changes still need to be copy/pasted over manually, I also created pareto_nbd branch off the latest version of main we can work with.

@ColtAllen ColtAllen closed this Jun 25, 2023
@ColtAllen ColtAllen mentioned this pull request Jun 25, 2023
@ColtAllen ColtAllen deleted the clv_pareto_nbd branch July 7, 2023 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLV enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Pareto/NBD Model
5 participants