Skip to content

SMARTboost (boosting of smooth symmetric regression trees)

License

Notifications You must be signed in to change notification settings

PaoloGiordani/SMARTboost.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SMARTboost has been superseded by HTBoost (Hybrid Tree Boosting)!

HTBoost extends SMARTboost to general loss functions (likelihoods), missing values, and categorical features. It can also deal much more effectively with irregular functions and highly non-Gaussian features. Please check HTBoost at:

HTBoost

SMARTboost

Build Status

A Julia implementation of SMARTboost (Smooth Additive Regression Trees), as described in the paper SMARTboost learning for tabular data.

The R version (via JuliaConnectoR) is available here.

Currently support only L2 loss, but extensions are planned.

Input features can be Array{Float64/Float32} or DataFrame.

Installation

Latest:

pkg> add "https://github.com/PaoloGiordani/SMARTboost.jl"

Parameters (incomplete list, see SMARTparam Documentation for more).

  • loss [:L2] currently only :L2 is supported, but extensions are planned.
  • depth [4] tree depth. If not default, then typically cross-validated in SMARTfit.
  • lambda [0.2] learning rate
  • loglikdivide [1.0] with panel data, SMARTloglikdivide() can be used to set this parameter
  • overlap [0] number of overlaps. Typically overlap = h-1, where y(t) = Y(t+h)-Y(t)
  • nfold [5] n in n-fold cv. Set nfold = 1 for a single validation set, the last sharevalidation share of the sample.
  • verbose [:Off] verbosity :On or :Off
  • T [Float32] Float32 is faster than Float64
  • randomizecv [false] default is purged-cv (see paper); a time series or panel structure is automatically detected (see SMARTdata)
  • subsamplesharevs [1.0] row subs-sampling; if <1.0, only a randomly drawn (at each iteration) share of the sample is used in determining ι (which feature),μ,τ.
  • subsampleshare_columns [1.0] column sub-sampling

Example1 with n = 1_000

Example1 with n = 1_000_000

Example1

using SMARTboost   # single core

#=
number_workers  = 4  # desired number of workers
using Distributed
nprocs()<number_workers ? addprocs( number_workers - nprocs()  ) : addprocs(0)
@everywhere using SMARTboost
=#

using Random, Plots, JLD2

# Some options for SMARTboost
cvdepth   = false    # false to use the default depth (3), true to cv
nfold     = 1        # nfold cv. 1 faster, default 5 is slower, but more accurate.

# options to generate data. y = sum of four additive nonlinear functions + Gaussian noise(0,stde^2)
n,p,n_test  = 10_000,5,100_000
stde        = 1.0

f_1(x,b)    = b*x
f_2(x,b)    = sin.(b*x)
f_3(x,b)    = b*x.^3
f_4(x,b)    = b./(1.0 .+ (exp.(4.0*x))) .- 0.5*b

b1,b2,b3,b4 = 1.5,2.0,0.5,2.0

# generate data
x,x_test = randn(n,p), randn(n_test,p)
f        = f_1(x[:,1],b1) + f_2(x[:,2],b2) + f_3(x[:,3],b3) + f_4(x[:,4],b4)
f_test   = f_1(x_test[:,1],b1) + f_2(x_test[:,2],b2) + f_3(x_test[:,3],b3) + f_4(x_test[:,4],b4)
y        = f + randn(n)*stde

# set up SMARTparam and SMARTdata, then fit and predit
param  = SMARTparam( nfold = nfold,verbose = :Off )
data   = SMARTdata(y,x,param,fdgp=f)

if cvdepth==false
    output = SMARTfit(data,param)                # default depth
else
    output = SMARTfit(data,param,paramfield=:depth,cv_grid=[1,2,3,4,5],stopwhenlossup=true)  # starts at depth = 1, stops as soon as loss increases
end

yf     = SMARTpredict(x_test,output.SMARTtrees)  # predict

println("\n depth = $(output.bestvalue), number of trees = $(output.ntrees) ")
println(" out-of-sample RMSE from truth ", sqrt(sum((yf - f_test).^2)/n_test) )

# save (load) fitted model
@save "output.jld2" output
#@load "output.jld2" output    # Note: key must be the same, e.g. @load "output.jld2" output2 is a KeyError

# feature importance, partial dependence plots and marginal effects
fnames,fi,fnames_sorted,fi_sorted,sortedindx = SMARTrelevance(output.SMARTtrees,data)
q,pdp  = SMARTpartialplot(data,output.SMARTtrees,[1,2,3,4])
qm,me  = SMARTmarginaleffect(data,output.SMARTtrees,[1,2,3,4],npoints = 40)

# To compute marginal effect at one point x0 rather than over a grid, set npoints = 1 and other_xs = x0 (a p vector, p the number of features), e.g.
# qm,me  = SMARTmarginaleffect(data,output.SMARTtrees,[1,2,3,4],other_xs=x0,npoints = 1)

# plot partial dependence
![](figures/Example1.png)

Example2 (Global Equity data, parallelization via Distributed, CV and priors for panel data)

Refer to examples/Example2. Notice how we prepare the panel data for purged-CV, and calibrate loglikdivide; here the effective sample size ess is much smaller than n.

# Setting up parallelization
number_workers  = 4  # desired number of workers
using Distributed
nprocs()<number_workers ? addprocs( number_workers - nprocs()  ) : addprocs(0)
@everywhere using SMARTboost

# ....
# prepare data; sorting dataframe by :date is required by purged-CV.
sort!(df,:date)

# calibrate loglikdivide.
lld,ess =  SMARTloglikdivide(df,:excessret,:date,overlap=0)
param   = SMARTparam(loglikdivide = lld,overlap=0,stopwhenlossup=true) # in CV, stop as soon as loss increases

Refer to examples/Example2. Here we point out how to do a train-validation-test split in SMARTboost. In particular, notice that the default is to re-fit the model (with the cross-validated tuning parameters) on the entire data (train+validation), as is done in n-fold CV, even if nfold = 1. If you wish to skip this step (for speed or for comparison with other methods), set nofullsample = true in SMARTfit. When nfold = 1, the default is to use the last 30% of the data as validation test ('approximately' 30% in some cases because of purged validation). To change this default, set e.g. sharevalidation = 0.2 for the last 20% of the sample, or sharevalidation = 1000 for the last 1000 observations. Setting sharevalidation to an integer switches the default to nfold = 1.

Some suggestions for speeding up SMARTboost.

Example of approximate computing time for 100 trees, depth = 4, using 8 workers on an AMD EPYC 7542, dgp linear. (dgp is linear. 100 trees are sufficient in most applications.)

n x p Minutes for 100 trees
100k x 10 1.7'
1m x 10 17'
10m x 10 170'
100k x 100 9'
1m x 100 85'
10m x 100 850'

SMARTboost runs much faster (particularly with large n) with 4-8 cores than with one, after the initial one-off cost. If you are running on few cores, consider limiting depth <= 3.

With large n:

  • Use a single validation sample instead of the default 5-fold cv (param.nfold=1). Additionally, in SMARTfit, set nofullsample = true further reduces computing time by roughly 60% (at the cost of a modest efficiency loss.) nofullsample = true is also required if you want to have a train-validation-test split, so that the model is only fit on the train set (the default will use a validation test to CV, and then re-train on the train+validation at the optimum parameter values).
  • Computing time increases rapidly with param.depth in smooth trees. If you cv tree depth, start at a low value and stop as soon as there is no sizable improvement (set stopwhenlossup = true in SMARTfit). With 8 workers, as a rule of thumb, computing times double if depth <- depth + 1.
  • set stderulestop = 0.05 or 0.1 to stop iterations when the loss is no longer decreasing sizably (at a cost of a small loss in performance.)
  • row and column subs-sampling are supported.