-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathRetail Churn.Rmd
173 lines (102 loc) · 8.15 KB
/
Retail Churn.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
---
title: "Retail Churn Models"
output:
md_document:
variant: markdown_github
---
**Author: Demetri Pananos **
```{r}
library(knitr)
knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE, cache = FALSE,fig.cap = "", dpi = 400)
```
# Introduction
Retail churn is different than most other forms of churn since every transaction could be that customer's last, or one of a long sequence of transactions. Normally, churn is a classification problem, but I don't think that classification is appropriate for non-contractual cases. By means of example, suppose you run a retail hardware store. A competitor opens closer to your most loyal customers and thus offers them the benefit of saving time. Your most loyal customers may churn without showing any signs. A typical classification algorithm would missclassify these customers, and cost your business in the long run.
When a customer churns from retail, their between transaction times are large. Perhaps so large that it may prompt retailers to think "Wow, I haven't seen customer X in a long time". You could even say the time between transactions is *anomalously large*. Thus, churn modelling in retail is not a classification problem, it is an anomaly detection problem. In order to determine when your customers are churning or likely to churn, you need to know when they are displaying anomalously large between transaction times.
We first need an idea of what "anomalously" means. I want to be able to make claims like "9 times out of 10, Customer X will make his next transaction within Y days". If Customer X does not make another transaction within Y days, we know that there is only a 1 in 10 chance of this happening, and that this behaviour is anomalous.
To do this, we will need each customer's between transaction time distribution. This may be difficult to estimate, especially if the distribution is multimodal or irregular. To avoid this difficulty, I'll take a non-parametric approach and use the Empirical Cumulative Distribution Function to approximate the quantiles of each customer's between transaction time distribution. Once I have the ECDF, I can approximate the 90th percentile, and obtain estimates of the nature I've described above.
To do demonstrate my methodology, I'll use retail data obtained from the [UCI Machine Learning Respository](http://archive.ics.uci.edu/ml/datasets/online+retail).
Let's get started.
# Data Munging
The first thing we'll have to do is slurp in the data. Once we do that, we'll find that the rows of the data contain information about products, such as: how many were bought (`Quantity`), the price per unit (`Price`), who bought the product (`Customer ID`), when the product was bought (`InvoiceDate`), and which transaction the product was bought under (`InvoiceNo`).
What I really need to know is who bought a product and when. To do this, I can group by the `Invoice No`, `Customer ID`, and `Invoice Date`. This will tell me when a customer made a distinct purchase. We'll have to filter out the returns from the data set. A return is made when `Quantity<0`, so that is easy enough using `filter`. From there, we can determine the time between transactions for each customer.
```{r}
library(tidyverse)
library(lubridate)
theme_set(theme_minimal())
retail_data = read_csv('~/Documents/R/Churn/Online Retail.csv') #Read in the data
#Data lists purchases for single transactions amongst many rows. Group them to see single txns
txns <- retail_data %>%
mutate(CustomerID = as.factor(CustomerID),
InvoiceDate = ymd_hm(InvoiceDate)) %>%
group_by(CustomerID, InvoiceNo,InvoiceDate) %>%
summarise( Spend = sum(UnitPrice*Quantity)) %>%
ungroup() %>%
filter( Spend>0,
year(InvoiceDate)==2011,
month(InvoiceDate)<10)
#Find time between transactions now
time_between <- txns %>%
arrange(CustomerID,InvoiceDate) %>%
group_by(CustomerID) %>%
mutate(dt = as.numeric(InvoiceDate - lag(InvoiceDate), unit= 'days')) %>%
ungroup() %>%
na.omit()
Ntrans = txns %>%
group_by(CustomerID) %>%
summarise(N = n()) %>%
filter(N>35)
```
Let's visualize the distributions for each customer (shown below). Some distributions look as if they are exponential (which would be really nice because then we could model purchase incidence as a Poisson random variable). Others are more irregular. Modelling all these distributions as coming from one separately parameterized distribution would be very difficult. Our non-parametric method is way easier, as we will see.
```{r}
## Helper function for sampling
sample_n_groups = function(tbl, size, replace = FALSE, weight = NULL) {
# regroup when done
grps = tbl %>% groups %>% lapply(as.character) %>% unlist
# check length of groups non-zero
keep = tbl %>% summarise() %>% ungroup() %>% sample_n(size, replace, weight)
# keep only selected groups, regroup because joins change count.
# regrouping may be unnecessary but joins do something funky to grouping variable
tbl %>% right_join(keep, by=grps) %>% group_by_(.dots = grps)
}
## Let's create a sample to depict
sample_users <- ecdf_df %>% inner_join(Ntrans) %>% sample_n_groups(20)
ggplot(data = time_between %>% inner_join(Ntrans) %>% filter(CustomerID %in% sample_users$CustomerID), aes(dt)) +
geom_histogram(aes(y = ..count../sum(..count..)), bins = 15) +
facet_wrap(~CustomerID) +
labs(x = 'Time Since Last Transaction (Days)',y = 'Frequency')
```
# Computation of the ECDF
I've written a little function to compute the ECDF for each customer. Then, I can plot each ECDF and draw a line at 0.9. The time where the ECDF crosses the line is the the approximate 90th percentile. So if the ECDF crosses our line at 23 days, that means 9 times out of 10 that customer will make another transaction within 23 days.
Better yet, we can compute the approximate 90th percentile and display it in a dataframe.
```{r}
ecdf_df <- time_between %>% group_by(CustomerID) %>% arrange(dt) %>% mutate(e_cdf = 1:length(dt)/length(dt))
ggplot(data = ecdf_df %>% inner_join(Ntrans) %>% filter(CustomerID %in% sample_users$CustomerID), aes(dt,e_cdf) ) +
geom_point(size =0.5) +
geom_line() +
geom_hline(yintercept = 0.9, color = 'red') +
facet_wrap(~CustomerID) +
labs(x = 'Time Since Last Transaction (Days)')
```
```{r}
getq <- function(x,a = 0.9){
#Little function to get the alphath quantile
if(a>1|a<0){
print('Check your quantile')
}
X <- sort(x)
e_cdf <- 1:length(X) / length(X)
aprx = approx(e_cdf, X, xout = c(0.9)) #use linear interpolation to approx 90th percentile
return(aprx$y)
}
quantiles = time_between %>%
inner_join(Ntrans) %>%
filter(N>5) %>%
group_by(CustomerID) %>%
summarise(percentile.90= getq(dt)) %>%
arrange(percentile.90)
head(quantiles,10)
```
That's it! We now know the point when each customer will begin to act "anomalously".
# Discussion
Churn is very different for retailers, which means taking a different approach to modelling churn. When a customer has churned, their time between transactions is anomalously large, so we should have an idea of what "anomalously" means for each customer. Using the ECDF, we have estimated the 90th percentile of each customers between transaction time distribution in a non-parametric way. Now, we can examine the last time a customer has transacted, and if the time between then and now is near the 90th percentile (or any percentile you deem appropriate) then we can call them "at risk for churn" and take appropriate action to prevent them from churning. Best of all, with more data our approach will become better and better since the ECDF converges in distribution to the CDF.
In this approach, I've not accounted for seasonality in transactions. In implementation, the window in which we compute the ECDF would have to be appropriately short. Parametric methods exist for modelling between transaction times, which would give better insight into the phenomena as a whole. These methods are currently under investigation by the author.