Merge pull request #3 from isaacsas/fitting_tutorial

Fitting tutorial
isaacsas · Sep 10, 2024 · 5a240de · 5a240de
2 parents 8b86dc0 + dedc947
commit 5a240de
Show file tree

Hide file tree

Showing 5 changed files with 121 additions and 10 deletions.
diff --git a/docs/make.jl b/docs/make.jl
@@ -12,6 +12,7 @@ makedocs(
         "Home" => "index.md",
         "Forward Model" => "forward_simulation.md",
         "Surrogate Construction" => "surrogate_construction.md",
+        "Fitting Data" => "fitting.md",
         "API" => "SPRFitting_api.md"
     ],
     warnonly = [:missing_docs]

diff --git a/docs/src/fitting.md b/docs/src/fitting.md
@@ -0,0 +1,105 @@
+# Fitting to SPR Data
+In this tutorial we will illustrate how to fit the surrogate model to SPR data to estimate $k_{\text{on}}$, $k_{\text{off}}$, $k_{\on,b}
+serial on any computer, and our workflow for building large surrogates on
+clusters (specific to the Grid Engine queuing system based cluster we use).
+
+## Setup
+We begin by installing the packages we need in a clean environment:
+```julia
+using Pkg
+
+# create a new environment for the surrogate constructions
+Pkg.activate("fitting_env") 
+Pkg.add(url="https://github.com/isaacsas/SPRFittingPaper2023.jl.git")
+```
+
+## Estimating Best Fit Parameters to an SPR Dataset
+We wil work with the same SPR dataset as in the [Forward Model Simulation](@ref)
+tutorial. As described in [1], our general approach will be to use the XNES
+optimizer from BlackBoxOptim.jl to minimize the difference between the SPR data
+and the surrogate's predicted SPR traces. XNES is a stochastic natural evolution
+optimizer that we have found often gives the best fits among many optimizer
+choices. We use this optimizer via Optimization.jl, which allows for the easy
+selection of different optimizer choices. As XNES is stochastic, we will run the
+parameter estimation process `nfits` times, and select parameters from the run
+with the minimal loss as our final estimates. 
+
+We'll use the full surrogate from [1], which can be downloaded
+[here](https://doi.org/10.6084/m9.figshare.26936854).
+
+We being by defining some variables related to the fitting process:
+```julia
+nfits = 100    # how many fits to run and then take the minimum over
+nsims = 250    # number of simulations to average when plotting
+logCP_optrange = (1.0, 5.0)  # allowable log-space range of CP values 
+```
+
+Next we set the location of the surrogate:
+```julia
+surfile = "PATH_TO_DOWNLOADED_SURROGATE.jld"
+```
+
+We are now load the surrogate and SPR data:
+```julia
+# load the surrogate
+surrogate = Surrogate(surfile)
+sps = surrogate.surpars
+
+# range of CP values to use
+optpar_ranges = [logCP_optrange]
+
+# load the aligned SPR data from input CSV file
+datadir = joinpath(@__DIR__, "..", "..", "data")
+aligned_data_fname = joinpath(datadir, "Data_FC4_10-05-22_Protein07_FD-11A_RBD-13.8_aligned.csv")
+aligneddat = get_aligned_data(aligned_data_fname)
+```
+Finally, we now run the optimizer `nfits` times, taking as our parameter estimates the fit with the mimimal loss:
+```julia
+optsol, best_pars = fit_spr_data(surrogate, aligneddat, optpar_ranges)
+for i in 2:nfits
+    optsol_new, best_pars_new = fit_spr_data(surrogate, aligneddat, optpar_ranges)
+    if optsol_new.minimum < optsol.minimum
+        optsol = optsol_new
+        best_pars = best_pars_new
+    end
+end
+```
+We find the best fit parameters, `[kon, koff, konb, reach, CP]`, are
+```julia
+best_pars = [5.2859956892261876e-5, 0.04086857480653136, 0.7801815024260655, 31.898843844047246, 128.56923492479402]
+```
+We can now plot (and output) a figure comparing our fits to the SPR data
+```julia
+# number of simulations to average over in making the plot
+simpars.nsims = nsims
+
+# save a figure showing the fit to the SPR data
+OUTDIR = tempdir()
+figfile = joinpath(OUTDIR, "fit_curves.png")
+visualisefit(optsol, aligneddat, surrogate, simpars, figfile)
+```
+Which gives
+
+![spr_fit](./fitting_data/fit_curves.png)
+
+Here the SPR data is shown in black and the average predicted responses in
+color. Finally, we can save a spreadsheet with our fits and parameter estimates
+via
+```julia
+curvefile = joinpath(OUTDIR, "parameters.xlsx")
+savefit(optsol, aligneddat, surrogate, simpars, curvefile)
+```
+For this example the file [here](./fitting_data/parameters.xlsx_fit.xlsx) shows
+the resulting spreadsheet. Note that the second sheet within it shows the
+parameter estimates.
+
+## General Workflow 
+A more detailed workflow that processes multiple SPR inputs, includes monovalent
+fits, and systematically writes the output is presented in GIVE_LOCATION.
+
+
+## Bibliography
+1. A. Huhn, D. Nissley, ..., C. M. Deane, S. A. Isaacson, and O. Dushek,
+   *Analysis of emergent bivalent antibody binding identifies the molecular
+   reach as a critical determinant of SARS-CoV-2 neutralisation potency*, in
+   review, [available on bioRxiv](https://www.biorxiv.org/content/10.1101/2023.09.06.556503v2) (2024).
diff --git a/docs/src/fitting_data/fit_curves.png b/docs/src/fitting_data/fit_curves.png
diff --git a/docs/src/fitting_data/parameters.xlsx_fit.xlsx b/docs/src/fitting_data/parameters.xlsx_fit.xlsx
diff --git a/docs/src/forward_simulation.md b/docs/src/forward_simulation.md
@@ -10,7 +10,7 @@ using Pkg
 Pkg.activate("fwd_sim_env") 
 Pkg.add(url="https://github.com/isaacsas/SPRFittingPaper2023.jl.git")
 Pkg.add("Plots")
-Pkg.add("CSV""
+Pkg.add("CSV")
 ```
 
 ## Running Forward Simulations
@@ -59,8 +59,10 @@ simpars = SimParams(; antigenconcen = biopars.antigenconcen, tstop, dt, N,
 ```
 See the [`SimParams`](@ref) documentation for more on what the various arguments here mean.
 
-Finally, for each antibody concentration we will run one forward simulation and
-plot the ratio of the number of bound antibodies to the number of antigen, scaled by the fitted coefficient of proportionality, `CP`:
+Finally, for each antibody concentration we will run `simpars.nsims` forward
+simulations and plot the ratio of the average number of bound antibodies to the
+number of antigen, scaled by the fitted coefficient of proportionality, `CP`.
+Since we set no value for it, by default `simpars.nsims = 1000` are averaged:
 ```@example fwdsim
 plt = plot(; xlabel = "times (sec)", 
              ylabel = "SPR Response")
@@ -75,15 +77,18 @@ end
 
 plt
 ```
-Note that here we only ran one simulation for each parameter set, and hence
-`means` just returns the `CP` scaled number of antibodies bound to the surface at
+Here `means` finalizes calculating the average number of bound antibodies across
+the 1000 simulations that were run for each antibody concentration. It then
+returns the `CP` scaled average number of antibodies bound to the surface at
 each time in `tsave` divided by the number of antigen in the system (i.e. `N =
-1000` here). If we had wanted to run and average multiple samples we could have
-passed [`run_spr_sim!`](@ref) a terminator such as a
-[`SPRFittingPaper2023.VarianceTerminator`](@ref) or
-[`SPRFittingPaper2023.SimNumberTerminator`](@ref).
+1000` here). See the [`run_spr_sim!`](@ref) docs and the docs for the
+terminators [`SPRFittingPaper2023.VarianceTerminator`](@ref) or
+[`SPRFittingPaper2023.SimNumberTerminator`](@ref) for more information on
+terminating simulations in relation to the number / accuracy of the desired
+averages.
 
-Finally, let's load and plot the corresponding aligned SPR data that we originally estimated these parameters from to confirm the good fit. 
+Finally, let's load and plot the corresponding aligned SPR data that we
+originally estimated these parameters from to confirm the good fit. 
 ```@example fwdsim
 datadir = joinpath(@__DIR__, "..", "..", "data")
 fname = joinpath(datadir, "Data_FC4_10-05-22_Protein07_FD-11A_RBD-13.8_aligned.csv")