sampling.Rmd

# Generating spatial sampling

```{r, results = "asis", echo = FALSE}
status("drafting")
```

```{r, include=FALSE, message=FALSE, results='hide'}
ls <- c("rgdal", "raster", "plotKML", "spatstat", "ranger", "mlr", "forestError", 
        "clhs", "xgboost", "glmnet", "matrixStats", "landmap", "LICORS", "h2o", "yardstick", 
        "hexbin", "parallelMap", "Metrics", "fastSave", "fields", "devtools", "Cubist")
new.packages <- ls[!(ls %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
lapply(ls, require, character.only = TRUE)
#load.pigz("sampling_data.RData")
source("sampling_functions.R")
```

## Spatial sampling

Sampling in statistics is done for the purpose of estimating population
parameters and/or for testing of experiments. If Observations and
Measurements (O&M) are collected in space i.e. as geographical variables
this is referred to as **spatial sampling** and is often materialized as
**a point map** with points representing locations of planned or
implemented O&M. Preparing a spatial sampling plan is a type of **design
of experiment** and hence it is important to do it right to avoid any
potential bias.

Spatial sampling or producing and implementing sampling designs are
common in various fields including physical geography, soil science,
geology, vegetation science, ecology and similar. Imagine an area that
potentially has problems with soil pollution by heavy metals. If we
collect enough samples, we can overlay points vs covariate layers, then
train spatial interpolation / spatial prediction models and produce
predictions of the target variable. For example, to map soil pollution
by heavy metals or soil organic carbon stock, we can collect soil
samples on e.g. a few hundred predefined locations, then take the
samples to the lab, measure individual values and then interpolate them
to produce a map of concentrations. This is one of the most common
methods of interest of **geostatistics** where e.g. various [kriging methods](http://www.leg.ufpr.br/geor/)
are used to produce predictions of the target variable (see e.g.
@Bivand2013Springer).

There are many sampling design algorithms that can be used to spatial
sampling locations. In principle, all spatial sampling approaches can be
grouped based on the following four aspects:

1.  *How objective is it?* Here two groups exist:

    a.  Objective sampling designs which are either [***probability
        sampling***](https://towardsdatascience.com/an-introduction-to-probability-sampling-methods-7a936e486b5)
        or some experimental designs from spatial statistics;

    b.  Subjective or **convenience sampling** which means that the
        inclusion probabilities are unknown and are often based on
        convenience e.g. distance to roads / accessibility;

2.  *How much identically distributed is it*? Here at least three groups
    exist:

    a.  **[Independent Identically Distributed (IID)](https://xzhu0027.gitbook.io/blog/ml-system/sys-ml-index/learning-from-non-iid-data)**
        sampling designs,

    b.  Clustered sampling i.e. non-equal probability sampling,

    c.  Censored sampling,

3.  *Is it based on geographical or feature space?* Here at least three
    groups exist:

    a.  Geographical sampling i.e. taking into account only geographical dimensions + time;

    b.  **Feature-space sampling** i.e. taking into account only distribution of points in feature space;

    c.  Hybrid sampling i.e. taking both feature and geographical space into account;

4.  *How optimized is it?* Here at least two groups exist:

    a.  **Optimized sampling** so that the target optimization criteria 
        reaches minimum / maximum i.e. it can be proven as being optimized,

    b.  *Unoptimized sampling*, when either optimization criteria can not be tested or is unknown,

Doing sampling using objective sampling designs is important as it
allows us to test hypotheses and produce **unbiased estimation** of
population parameters or similar. Many spatial statisticians argue that
only previously prepared, strictly followed randomized probability
sampling can be used to provide an unbiased estimate of the accuracy of
the spatial predictions [@Brus2021sampling]. In the case
of probability sampling, calculation of population parameters is derived
mathematically i.e. that estimation process is unbiased and independent
of the spatial properties of the target variable (e.g. spatial
dependence structure and/or statistical distribution). For example, if
we generate sampling locations using **Simple Random Sampling (SRS)**,
this sampling design has the following properties:

1.  It is an IID with each spatial location with exactly the same
    **inclusion probability**,
2.  It is symmetrical in geographical space meaning that about the same
    number of points can be found in each quadrant of the study area,
3.  It can be used to derive population parameters (e.g. mean) and these
    measures are per definition unbiased,
4.  Any random subset of the SRS is also a SRS,

SRS is in principle both objective sampling and IID sampling and can be
easily generated provided some polygon map representing the study area.
Two other somewhat more complex sampling algorithms with similar
properties as SRS are for example different versions of tessellated
sampling. The **generalized random tessellation stratified (GRTS)
design** was for example used in the USA to produce sampling locations
for the purpose of [geochemical mapping](https://pubs.usgs.gov/ds/801/); 
a **multi-stage stratified random sampling** design was used to produce [LUCAS soil monitoring
network of points](https://esdac.jrc.ec.europa.eu/projects/lucas).

Large point datasets representing observations and/or measurements 
*in-situ* can be used to generate maps by fitting regression and 
classification models using e.g. Machine Learning algorithms, then 
applying those models to predict values at all pixels. This is referred 
to as [***Predictive Mapping***](https://soilmapper.org). In reality, 
many point datasets we use in Machine Learning for predictive mapping do 
not have ideal properties i.e. are neither IID nor are probabilities of 
inclusion known. Many are in fact **purposive** and/or **[convenience sampling](https://methods.sagepub.com/reference/encyclopedia-of-survey-research-methods/n105.xml)**
and hence potentially over-represent some geographic features, are
potentially censored and can lead to significant bias in estimation.

## Response surface designs

If the objective of modeling is to build regression models (correlating
the target variable with a number of spatial layers representing e.g.
soil forming factors), then we are looking at the problem in statistics
known as the **[response-surface experimental designs](https://en.wikipedia.org/wiki/Response_surface_methodology)**.
Consider the following case of one target variable ($Y$) and one
covariate variable ($X$). Assuming that the two are correlated linearly
(i.e. $Y = b_0 + b_1 \cdot X$), one can easily prove that the optimal
experimental design is to: (a) determine min and max of $X$, then put half
of the point at $X_{min}$ and the other half at $X_{max}$ [@Hengl2004AJSR]. This design is called the
**[D1 optimal design](https://en.wikipedia.org/wiki/Optimal_design#D-optimality)** and
indeed it looks relatively simple to implement. The problem is that it
is the optimal design ONLY if the relationship between $Y$ and $X$ is
perfectly linear. If the relationship is maybe close to quadratic than
the D1 design is much worse than for example the D2 design [@Hengl2004AJSR].

```{r scheme-d1, echo=FALSE, fig.cap="Example of D1 design: (left) D1 design in 2D feature space, (right) D1 design is optimal only for linear model, if the model is curvilinear, it is in fact the worse design than simple random sampling.", out.width="90%"}
knitr::include_graphics("./img/Fig_D1_design_scheme.jpg")
```

In practice we may not know what is the nature of the relationship
between $Y$ and $X$, i.e. we do not wish to take a risk and produce
biased estimation. Hence we can assume that it could be a curvilinear
relationship and so we need to sample uniformly in the feature space.

## Spatial sampling algorithms of interest

This chapter reviews some common approaches for preparing point samples for a 
study area that you are visiting for the first time and/or no previous samples or 
models are available. We focus on the following spatial sampling methods: 

- Subjective or convenience sampling,
- Simple Random Sampling (**SRS**) [@Bivand2013Springer;@Brus2021sampling],
- Latin Hypercube Sampling (**LHS**) and its variants e.g. Conditioned LHS [@minasny2006conditioned;@Malone2019PeerJ],
- Feature Space Coverage Sampling (**FSCS**) [@Goerg2013] and fuzzy k-means clustering [@hastie2009elements],
- 2nd round sampling [@stumpf2017uncertainty],

Our interest in this tutorials is in producing predictions (maps) of the target 
variable by employing regression / correlation between the target variable and 
multitude of features (raster layers), and where various Machine Learning techniques 
are used for training and prediction.

Once we collect enough training points in an area we can overlay points and GIS 
layers to produce a **regression matrix** or **classification matrix**, and 
which can then be used to generate spatial predictions i.e. produce maps. As a 
state-of-the-art we use the mlr framework for Ensemble Machine Learning as the key 
Machine Learning framework for predictive mapping. For an introduction to Ensemble 
Machine Learning for Predictive Mapping please refer to [this tutorial](https://gitlab.com/openlandmap/spatial-predictions-using-eml).

## Ebergotzen dataset

To test various sampling and mapping algorithms, we can use the Ebergotzen 
dataset available also via the **plotKML** package [@hengl2015plotkml]:

```{r}
set.seed(100)
library(plotKML)
library(sp)
library(viridis)
library(raster)
library(ggplot2)
library(dplyr)
library(tidyr)
data("eberg_grid25")
gridded(eberg_grid25) <- ~x+y
proj4string(eberg_grid25) <- CRS("+init=epsg:31467")
```

This dataset is described in detail in @Bohner2008Hamburg. It is a soil survey 
dataset with ground observations and measurements of soil properties including soil 
types. The study area is a perfect square 10×10 km in size.

We have previously derived number of additional DEM parameters directly using SAGA GIS 
[@Conrad2015] and which we can add to the list of covariates:

```{r}
eberg_grid25 = cbind(eberg_grid25, readRDS("./extdata/eberg_dtm_25m.rds"))
names(eberg_grid25)
```

so a total of 11 layers at 25-m spatial resolution, from which two layers 
(`HBTSOLx` and `PMTZONES`) are factors representing soil units and parent material 
units. Next, for further analysis, and to reduce data overlap, we can convert all 
primary variables to (numeric) principal components using:

```{r}
eberg_spc = landmap::spc(eberg_grid25[-c(2,3)])
```

which gives the a total of 14 PCs. The patterns in the PC components reflect 
a complex combination of terrain variables, lithological discontinuities (`PMTZONES`) 
and surface vegetation (`NVILANx`):

```{r PCA, fig.width=10, fig.height=3.5, fig.cap = "Principal Components derived using Ebergotzen dataset."}

compare.map(df = stack(eberg_spc@predicted[1:3]), title = "Ebergotzen PC1, PC2 and PC3")
 
```

## Simple Random Sampling

To generate a SRS with e.g. 100 points we can use the **sp** package `spsample` method:

```{r}
rnd <- spsample(eberg_grid25[1], type="random", n=100)
rnd_pred <- as.data.frame(rnd, xy = TRUE)

```

```{r eberg-srs, echo = TRUE, out.width = "80%", fig.cap = "An example of a Simple Random Sample (SRS)."}

plot.map(df = raster(eberg_spc@predicted[1]), points = rnd_pred, title = "Simple Random Sample (SRS)")

```

This sample is generated purely based on the spatial domain, the feature space 
is completely ignored / not taken into account, hence we can check how well do 
these points represent the feature space using a density plot:

```{r eberg-fs, echo = TRUE, out.width = "80%", fig.cap = "Distribution of the SRS points from the previous example in the feature space."}
ov.rnd = sp::over(rnd, eberg_spc@predicted[,1:2])

plot_hex(df=eberg_spc@predicted@data[,1:2], pts=ov.rnd, title = 'PCA Ebergotzen SRS')

```

Visually, we do not directly see from Fig. \@ref(fig:eberg-fs) that the generated SRS 
maybe misses some important feature space, however if we zoom in, then we can notice that some 
parts of feature space (in this specific randomization) with high density are somewhat 
under-represented. Imagine if we reduce number of sampling points then we run 
even higher risk of missing some areas in the feature space by using SRS.

Next, we are interested in evaluating the occurrence probability of the SRS points 
based on the PCA components. To derive the occurrence probability we can use 
the **maxlike** package method [@Royle2012]:

```{r, cache=TRUE}
fm.cov <- stats::as.formula(paste("~", paste(names(eberg_spc@predicted[1:4]), collapse="+")))
ml <- maxlike::maxlike(formula=fm.cov, rasters=raster::stack(eberg_spc@predicted[1:4]), 
                       points=rnd@coords, method="BFGS", removeDuplicates=TRUE, savedata=TRUE)
ml.prob <- predict(ml)
```

```{r eberg-maxlike, echo = TRUE, out.width = "80%", fig.cap = "Occurrence probability for SRS derived using the maxlike package."}

plot.map(df = ml.prob, points = rnd_pred, fill = terrain.colors(10),
         title = "SRS occurrence probability using maxlike")

```

Note: for the sake of reducing the computing intensity we focus on the first four PCs. 
In practice, feature space analysis can be quite computational and we recommend using 
parallelized versions within an High Performance Computing environments to run such analysis.

Fig. \@ref(fig:eberg-maxlike) shows that, by accident, some parts of the feature 
space might be somewhat under-represented (areas with low probability of occurrence on the map). 
Note however that occurrence probability for this dataset is overall _very low_ (<0.05), 
indicating that distribution of points is not much correlated with the features. 
This specific SRS is thus probably satisfactory also for feature space analysis: 
the SRS points do not seem to group (which would in this case be by accident) and 
hence maxlike gives very low probability of occurrence.

We can repeat SRS many times and then see if the clustering of points gets more 
problematic, but as you can image, in practice for large number of samples it is 
a good chance that all features would get well represented also in the feature 
space even though we did not include feature space variables in the production of the 
sampling plan.

We can now also load the actual points collected for the Ebergotzen case study:

```{r}
data(eberg)
eberg.xy <- eberg[,c("X","Y")]
coordinates(eberg.xy) <- ~X+Y
proj4string(eberg.xy) <- CRS("+init=epsg:31467")
ov.xy = sp::over(eberg.xy, eberg_grid25[1])
eberg.xy = eberg.xy[!is.na(ov.xy$DEMTOPx),]
sel <- sample.int(length(eberg.xy), 100)
eberg.smp = eberg.xy[sel,]
```

To quickly estimate spread of points in geographical and feature spaces, we can 
also use the function `spsample.prob` that calls both the kernel density function 
from the **[spatstat](https://rdrr.io/cran/spatstat.core/man/density.ppp.html)** package, and derives the probability of occurrence using the 
**[maxlike](https://rdrr.io/github/rbchan/maxlike/man/maxlike-package.html)** package: 

```{r, cache=TRUE}
iprob <- landmap::spsample.prob(eberg.smp, eberg_spc@predicted[1:4])
```

In this specific case, the actual sampling points are much more clustered, so if we plot 
the two occurrence probability maps derived using maxlike next to each other (actual vs SRS) we get:

```{r eberg-maxlike2, echo = TRUE, out.width = "100%", fig.cap = "Comparison occurrence probability for actual and SRS samples derived using the maxlike package."}

occ_prob <- stack(raster(iprob$maxlike),ml.prob) 
names(occ_prob) <- c("Actual","SRS") # instead of c("maxlike","ml.prob")

compare.map(df=occ_prob, points =rnd_pred , fill=rev(terrain.colors(5)), title = "Occurrence probability for actual and SRS samples using maxlike")


```

The map on the left clearly indicates that most of the sampling points are 
basically preferentially located in the plain area, while the hillands are 
systematically under-sampled. This we can also cross-check by reading the 
description of the dataset in @Bohner2008Hamburg:

- the Ebergotzen soil survey points focus on agricultural land only,  
- no objective sampling design has been used, hence some points are clustered,  

This is also clearly visible from the _feature map_ plot where one part of the feature 
space seem to be completely omitted from sampling:

```{r eberg-fs2, echo = TRUE, out.width = "80%", fig.cap = "Distribution of the actual survey points from the previous example displayed in the feature space."}
ov2.rnd = sp::over(eberg.smp, eberg_spc@predicted[,1:2])

plot_hex(df=eberg_spc@predicted@data[,1:2], pts=ov2.rnd, title = 'PCA Ebergotzen actual')
```

## Latin Hypercube Sampling

In the previous example we have shown how to implement SRS and then also evaluate it 
against feature layers. Often SRS represent very well feature space so it can be 
directly used for Machine Learning and with a guarantee of not making too much bias  
in predictions. To avoid, however, risk of missing out some parts of the feature space, 
and also to try to optimize allocation of points, we can generate a sample using the 
**[Latin Hypercube Sampling](https://en.wikipedia.org/wiki/Latin_hypercube_sampling)** (LHS) method. In a nutshell, LHS methods are based on 
dividing the **Cumulative Density Function** (CDF) into _n_ equal partitions, and 
then choosing a random data point in each partition, consequently:

- CDF of LHS samples matches the CDF of population (hence unbiased representation),  
- Extrapolation in the feature space should be minimized,  

Latin Hypercube and sampling optimization using LHS is explained in detail in @minasny2006conditioned 
and @shields2016generalization. For @brus2015balanced, LHS is just a special case 
of **[balanced sampling](http://www.antongrafstrom.se/balancedsampling/)** i.e. sampling based on allocation in feature space. 
The [lhs package](https://github.com/bertcarnell/lhs) also contains numerous examples of 
how to implement LHS for (non-spatial) data.

Here we use an implementation of the LHS available in the **clhs** package [@Roudier2011]:

```{r, cache=TRUE}
library(clhs)
rnd.lhs = clhs::clhs(eberg_spc@predicted[1:4], size=100, iter=100, progress=FALSE,simple=FALSE)
```

This actually implements the so-called _“Conditional LHS”_ [@minasny2006conditioned], 
and can get quite computational for large stack of rasters, hence we manually limit the 
number of iterations to 100.

We can plot the LHS sampling plan:

```{r eberg-clhs, echo = TRUE, out.width = "80%", fig.cap = "An example of a Latin Hypercube Sample (LHS)."}

plot.map(df=raster(eberg_spc@predicted[1]), 
         points = as.data.frame(rnd.lhs$sampled_data@coords), 
         title="Latin Hypercube Sample (LHS)")

```

```{r eberg-lhs2, echo = TRUE, out.width = "80%", fig.cap = "Distribution of the LHS points from the previous example displayed in the feature space."}

plot_hex(df=eberg_spc@predicted@data[,1:2], pts=as.data.frame(rnd.lhs$sampled_data@data), title = 'PCA Ebergotzen LHS')

```

Although in principle we might not see any difference in the point pattern between 
SRS and LHS, the feature space plot clearly shows that LHS covers systematically 
feature space map, i.e. we would have a relatively low risk of missing out some 
important features as compared to Fig. \@ref(fig:eberg-fs2). 

Thus the main advantages of the LHS are:

- it ensures that feature space is represented systematically i.e. it is optimized 
  for Machine Learning using the specific feature layers;  
- it is an **[Independent Identically Distributed (IID)](https://xzhu0027.gitbook.io/blog/ml-system/sys-ml-index/learning-from-non-iid-data)** sampling design;  
- thanks to the **clhs** package, also the survey costs raster layer can be integrated to still 
  keep systematic spread, but reduce survey costs as much as possible [@roudier2012conditioned];  

## Feature Space Coverage Sampling

The **Feature Space Coverage Sampling** (FSCS) is described in detail in @BRUS2019464.
In a nutshell, FSCS aims at optimal coverage of the feature space which is achieved 
by minimizing the average distance of the population units (raster cells) to the 
nearest sampling units in the feature space represented by the raster layers [@ma2020comparison].

To produce FSCS point sample we can use function `kmeanspp` of package **LICORS** [@Goerg2013].
First we partition the feature space cube to e.g. 100 clusters. We then select raster cells 
with the shortest scaled Euclidean distance in covariate-space to the centers of 
the clusters as the sampling units:

```{r, eval = FALSE}
library(LICORS)
library(fields)
fscs.clust <- kmeanspp(eberg_spc@predicted@data[,1:4], k=100, iter.max=100)
D <- fields::rdist(x1 = fscs.clust$centers, x2 = eberg_spc@predicted@data[,1:4])
units <- apply(D, MARGIN = 1, FUN = which.min)
rnd.fscs <- eberg_spc@predicted@coords[units,]
```

Note: the k-means++ algorithm is of most interest for small sample sizes: _“for large 
sample sizes the extra time needed for computing the initial centers can become 
substantial and may not outweigh the larger number of starts that can be afforded 
with the usual k-means algorithm for the same computing time”_ [@Brus2021sampling].
The `kmeanspp` algorithm from the LICORS package is unfortunately quite computational 
and is in principle not recommended for large grids / to generate large number of 
samples. Instead, we recommend clustering feature space using the `h2o.kmeans` function 
from the [h2o 
package](https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/k-means.html), 
which is also suitable for larger datasets with computing running in parallel:

```{r}
library(h2o)
h2o.init(nthreads = -1)
df.hex <- as.h2o(eberg_spc@predicted@data[,1:4], destination_frame = "df")
km.nut <- h2o.kmeans(training_frame=df.hex, k=100, keep_cross_validation_predictions = TRUE)
#km.nut
```

Note: in the example above, we have manually set the number of clusters to 100, 
but the number of clusters could be also derived using [some optimization procedure](https://www.r-bloggers.com/2019/01/10-tips-for-choosing-the-optimal-number-of-clusters/). 
Next, we predict the clusters and plot the output:

```{r}
m.km <- as.data.frame(h2o.predict(km.nut, df.hex, na.action=na.pass))
```

We can save the class centers (if needed for any further analysis):

```{r}
class_df.c = as.data.frame(h2o.centers(km.nut))
names(class_df.c) = names(eberg_spc@predicted@data[,1:4])
str(class_df.c)
#write.csv(class_df.c, "NCluster_100_class_centers.csv")
```

To select sampling points we use the minimum distance to class centers [@Brus2021sampling]:

```{r, cache=TRUE}
D <- fields::rdist(x1 = class_df.c, x2 = eberg_spc@predicted@data[,1:4])
units <- apply(D, MARGIN = 1, FUN = which.min)
rnd.fscs <- eberg_spc@predicted@coords[units,]
```

```{r eberg-fscs, echo = TRUE, out.width = "80%", fig.cap = "An example of a Feature Space Coverage Sampling (FSCS)."}

plot.map(df=raster(eberg_spc@predicted[1]), 
         points = as.data.frame(rnd.fscs), 
         title="Feature Space Coverage Sampling (FSCS)" )

```

Visually, FSCS seem to add higher spatial density of points in areas where there 
is higher complexity. The `h2o.kmeans` algorithm stratifies area into most 
possible homogeneous units (in the example above, large plains in the right 
part of the study area are relatively homogeneous, hence the sampling intensity 
in there areas drops significantly when visualized in the geographical space), 
and the points are then allocated per each strata.

```{r eberg-fscs-pnts, echo = TRUE, out.width = "80%", fig.cap = "Distribution of the FSCS points from the previous example displayed in the feature space."}

plot_hex(df=eberg_spc@predicted@data[,1:2], pts=eberg_spc@predicted@data[units,], title = 'PCA Ebergotzen FSCS')

```

The FSCS sampling pattern in feature space looks almost as grid sampling in feature 
space. FSCS seems to put more effort on sampling at the edges on the feature space 
in comparison to LHS and SRS, and hence can be compared to classical response 
surface designs such as [D-optimal designs](https://en.wikipedia.org/wiki/Optimal_design) [@Hengl2004AJSR].

```{r}
h2o.shutdown(prompt = FALSE)
```

## Summary points

In the previous examples we have shown differences between SRS, LHS and FSCS.
SRS and LHS are IID sampling methods and as long as the number of samples is large 
and the study area is complex, it is often difficult to notice or quantify any under- or 
over-sampling. Depending on which variant of FSCS we implement, FSCS can result in 
higher spatial spreading especially if a study area consists of a combination 
of a relatively homogeneous and complex terrain units.

The Ebergotzen dataset (existing point samples) clearly shows that the _“convenience surveys”_ can 
show significant clustering and under-representation of feature space 
(Fig. \@ref(fig:eberg-fs2)). Consequently, these sampling bias could lead to:

- _Bias in estimating regression parameters_ i.e. over-fitting;  
- _Extrapolation problems_ due to under-representation of feature space;  
- _Bias_ in estimating population parameters of the target variable;  

The first step to deal with these problems is to detect them, second is to try 
to implement a strategy that prevents from model over-fitting. Some possible approaches 
to deal with such problems are addressed in the second part of the tutorial.

LHS and FSCS are recommended sampling methods if the purpose of sampling is to 
build regression or classification models using multitude of (terrain, 
climate, land cover etc) covariate layers. A generalization of LHS is the balanced 
sampling where users can select even variable inclusion probabilities [@grafstrom2014efficient; @brus2015balanced].

@ma2020comparison compared LHS to FSCS for mapping soil types and concluded that 
FSCS results in better mapping accuracy, most likely because FSCS spreads points 
better in feature space and hence in their case studies that seem to have helped 
with producing more accurate predictions. @yang2020evaluation also report that 
LHS helps improve accuracy only for the large size of points.

The `h2o.kmeans` algorithm is suited for large datasets, but nevertheless to 
generate ≫100 clusters using large number of raster layers could become RAM 
consuming and is maybe not practical for operational sampling. An alternative 
would be to reduce number of clusters and select multiple points per cluster.

In the case of doubt which method to use LHS or FSCS, we recommend the following 
simple rules of thumb: 

- If your dataset contains relatively smaller-size rasters and the targeted number of sampling 
   points is relatively small (e.g. ≪1000), we recommend using the FSCS algorithm;  
- If your project requires large number of sampling points (≫100), then you should probably 
  consider using the LHS algorithm;  

In general, as the number of sampling points starts growing, differences between 
SRS (no feature space) and LHS becomes minor, which can also be witnessed 
visually (basically it becomes difficult to tell the difference between the two). 
SRS could, however, by accident miss some important parts of the feature space, 
so in principle it is still important to use either LHS or FSCS algorithms to 
prepare sampling locations for ML where objective is to fit regression and/or 
classification models using ML algorithms.

To evaluate potential sampling clustering and pin-point under-represented areas 
one can run multiple diagnostics:

1. In _geographical space_: 
   (a) [kernel density analysis](https://rdrr.io/cran/spatstat.core/man/density.ppp.html) using the spatstat package, then determine if some parts of the study area have systematically higher density;  
   (b) testing for **Complete Spatial Randomness** [@schabenberger2005statistical] using e.g. [spatstat.core::mad.test](https://rdrr.io/cran/spatstat.core/man/dclf.test.html) and/or [dbmss::Ktest](https://rdrr.io/cran/dbmss/man/Ktest.html);  
2. In _feature space_: 
   (a) occurrence probability analysis using the [maxlike package](https://rdrr.io/github/rbchan/maxlike/man/maxlike-package.html);
   (b) **unsupervised clustering** of the feature space using e.g. `h2o.kmeans`, then determining if 
   any of the clusters are significantly under-represented / under-sampled;  
   (c) estimating **Area of Applicability** based on similarities between training 
   prediction and feature spaces [@meyer2021predicting];  

Plotting generated sampling points both on a map and _feature space map_ helps 
detect possible extrapolation problems in a sampling design (Fig. \@ref(fig:eberg-fs2)). 
If you detect problems in feature space representation based on an existing point 
sampling set, you can try to reduce those problems by adding additional samples e.g. through 
**covariate space infill sampling** [@Brus2021sampling] or through 2nd round 
sampling and then re-analysis. These methods are discussed in further chapters.


```{r fig.height=5.5, fig.width=13.5, include=FALSE}

library(patchwork)

c <- theme(plot.title = element_text(hjust = 0.5,size=14))+
  theme(legend.key.size = unit(0.3, 'cm'), #change legend key size
        legend.key.height = unit(0.3, 'cm'), #change legend key height
        legend.key.width = unit(0.5, 'cm'), #change legend key width
        legend.title = element_text(size=12), #change legend title font size
        legend.text = element_text(size=9)) #change legend text font size

actual <- plot.map(df=raster(iprob$maxlike), points =rnd_pred , fill=rev(terrain.colors(5))) + theme_void()+c
actual_hex <- plot_hex4(df=eberg_spc@predicted@data[,1:2], pts=ov2.rnd, title = 'PCA Ebergotzen actual')+c

SRS <- plot.map(df = raster(eberg_spc@predicted[1]), points = rnd_pred) + theme_void()+c
SRS_hex <- plot_hex4(df=eberg_spc@predicted@data[,1:2], pts=ov.rnd, title = 'PCA Ebergotzen SRS')+c

LHS <- plot.map(df=raster(eberg_spc@predicted[1]),points = as.data.frame(rnd.lhs$sampled_data@coords)) + theme_void()+c
LHS_hex <- plot_hex4(df=eberg_spc@predicted@data[,1:2], pts=as.data.frame(rnd.lhs$sampled_data@data), title = 'PCA Ebergotzen LHS')+c

FSCS <- plot.map(df=raster(eberg_spc@predicted[1]),points = as.data.frame(rnd.fscs)) + theme_void()+c
FSCS_hex <- plot_hex4(df=eberg_spc@predicted@data[,1:2], pts=eberg_spc@predicted@data[units,], title = 'PCA Ebergotzen FSCS')+c


summary_plot <- actual + SRS   + LHS    + FSCS +  actual_hex + SRS_hex + LHS_hex + FSCS_hex+ plot_layout(ncol = 4, guides = "collect")

saveRDS(summary_plot, "./output/summary_plot.rds")


```