Codes for replicating some of the experiments of the paper Robust leave-one-out cross-validation for high-dimensional Bayesian models.
PyStan has undergone a significant upgrade from PyStan 2 to the newer PyStan 3 which is not backward compatible in a major way. The current codes where created with PyStan 2.9.1.1 and hence work only with that version. Also it can have trouble with Python >=3.9, hence to run the above code we suggest the following commands to be run in the terminal(assuming one has Anaconda).
conda create --name pystan python=3.6
conda activate pystan
pip install pystan==2.19.1.1
To run the codes place the PSIS and ParetoSmooth folders in the apporopriate locations.
-
For a pedagogical walk trough of the construction of the estimator read tutorial.md.
-
To test the estimator, together with some of the competing ones in the paper, in a toy-setting where closed form solutions of the quantities are available, one can run Gaussian_Simulated.ipynb. Note that in this notebook we leverage the module models.py which guarantes more robust numerical behaviours and efficiency. To create also the Pareto-smoothed estimator PSIS one must download the repository and place it in a folder accordingly.
-
The notebook Gaussian_Simulated_np.ipynb performs the experiments present in Section 4 of the paper. They are though pretty slow on Python, hence for extensive experiments we have create an apposite Julia code(see below).
- To replicate some of the results on the high-D logistic models in R refer to the folder R_codes. For some information regarding the code and the datasets available there look at Guide.md.
- For the more computationally heavy experiemnts of Section 4 of the paper we created a julia code which is significantly faster than the Python implementation. The code leverages multithreading and stores the result in a Python-usable format. Hence any type of analysis/visualization of the results can then be performed using the Python notebooks.
Not all the figures of the manuscript are directly reproducible from this folder, and not all the figures produced from the current folder are in the manuscript as some of them have more of a pedagocical scope. Here is some general information regarding the latter and some instruction to reproduce some of the former.
-
The plot at the end of Gaussian_Simulated.ipynb is just to show the different levels of accuracies of the estimators for the different addends of LOOCV for a single simulated dataset. It has not hence any counterpart in the manuscript.
-
Setting attempts=10000 at line 15 of the third cell block of Gaussian_Simulated_np.ipynb and the plotting the results using the penultimate cell block of the same will produce the right subfigure of Figure 2 at page 21 of the manuscript. As mentioned above though, such an experiment is extremely time consuming in Python, hence if one wishes to reproduce it the recommended workflow is the following:
- Run the Julia code hyper_highD.jl with as many threads as physical cores of your working machine; to do so run the following command
in the terminal substituting nthreads with the number of physical cores of your machine.
julia --threads nthreads hyper_highD.jl
- Run the first two cell blocks of Gaussian_Simulated_np.ipynb and the uncomment the fourth one while skipping the third one. Then normally run the penultimate one to obtain the plots.
- To obtain the left subfigure of Figure 2 of the manuscript in Python you have to change line 23 of the third block to
Nonetheless this has the same computational time issue as before. In Julia you have to change line 176 of hyper_highD.jl to
sigma_0 = np.identity(d+1)
sigma = Diagonal(1*ones(d+1))
- Run the Julia code hyper_highD.jl with as many threads as physical cores of your working machine; to do so run the following command
-
The plot at the end of Leukaemia.ipynb is not present in the paper as here it has more a pedagogical role. We hence opted for a figure which conveyed better the behaviour of infinite variance estimators compared to finite variance ones, while in_ Figure S.4_ of the supplement we chose for an experiment which gave a more quantitative result on the bias of the different estimators.
-
Running the cells of Stack_Loss.ipynb directly replicates Figure S.5 of the supplement.
The plot of Lppd.R just gives an element-wise comparison between the classical and mixture estimators and is not present in the paper, as in the paper we have no plots coming from R. It is meant to allow R users to get some quick results. The default dataset used is the Voice dataset analysed in Section 4.2 of the manuscript. To analyse the other datasets present in that section change line 16 of Lppd.R to the desired dataset. For example if one wished to get the estimators for the Parkinson dataset, change line 16 to
data = read.csv("./../../Data/Parkinson_preprocessed.csv")