wsim-gldas-acquisition.qmd

---
title: "Acquiring and Pre-Processing the WSIM-GLDAS Dataset"
author: "Josh Brinks"
date: "December, 6 2023"
---

## TO DO

  * Add context for package selection
    + stars for raster management
    + sf for vector/shapefile/geojson
    + lubridate for date management
  * More dataset context/explanation (geoBoundaries vs gadm).
  * Citations and external links for data and packages.
  * Decide on which wsim-gldas version to use.
    + Switch out the current for a 12-month integration anomaly. 
  * Write out the smaller pre-processed file to disk for potential use in binder workshop or subsequent lesson in the module.
  * Introduce automated data acquisition with some sedac or earthdata package??

## Introduction

WSIM-GLDAS can be download from [SEDAC](https://sedac.ciesin.columbia.edu/data/set/water-wsim-gldas-v1). Downloads are organized by combination of variable (composite surplus/deficit, temperature, PETmE, runoff, soil moisture, precipitation) and integration period (1, 3, 6, 12 months). Each variable-integration combination consists of a NetCDF raster file with a time dimension that contains a raster layer for each of the 804 months between January, 1948 and December, 2014. Some variables also contain multiple attributes each with their own time dimension of 804 rasters. Hence, this is a large file that takes upwards of 2 hours to download and may cause memory issues on certain systems. We will work with the composite anomolies integrated over 1 month periods. 

## Reading the Data

Once you've completed the download and placed the .nc into your working directory read in the file with the `stars::read_stars()` function.

```{r}

# proxy = TRUE will limit memory useage but does 
# not always work with certain downstream processing functions

wsim_gldas_anoms <- stars::read_stars("composite_anom_12mo.nc", proxy = FALSE)

print(wsim_gldas_anoms)
```

The print command gives some basic information. The outputs tells us we have 5 attributes (deficit, deficit_cause, surplus, surplus_cause, both) and 3 dimensions. The first 2 dimensions are the spatial extents (x/y--longitude/latitude) and time is the 3rd dimension. 

This means the total number of individual raster layers in this NetCDF is 4020 (5 attributes x 804 time steps/months).

## Attribute Selection

We can start paring this down by subsetting for just deficits (drought).

```{r}
names(wsim_gldas_anoms)

wsim_gldas_anoms <- wsim_gldas_anoms['deficit']
```
## Time Selection

Specifying a temporal range of interest will free up more space. We'll grab every year for 2000-2014. This can be accomplished by generating a sequence for every year between December 2000 and December 2014, and then passing that vector of dates to `filter`.


```{r}
# generate a vector of dates for subsetting
keeps<-seq(lubridate::ymd("2000-12-01"),
           lubridate::ymd("2014-12-01"), 
           by = "year")
# filter using that vector
wsim_gldas_anoms <- dplyr::filter(wsim_gldas_anoms, time %in% keeps)

print(wsim_gldas_anoms)
```

Now we're down to a single attribute ("both") with 15 time-steps. We can take a look at the first 6 years by passing the object through `slice` and then into `plot`.

```{r warning=FALSE}
wsim_gldas_anoms |>
  dplyr::slice(index = 1:6, along = "time") |>
  plot(key.pos = 1, breaks = c(0, -5, -10, -20, -30, -50), key.lab = "Deficit")
```

Although we've pared it down to a single attribute with a restricted time period of interest, we can take it a step further and reduce the spatial extent to a country or state of interest. 

## Spatial Selection

We can spatially crop the raster stack with a few different methods. Options include using a bounding box (xmin, ymin, xmax, ymax), another raster object, or a vectorized boundary like a shapefile or geojson.

Using a vector boundary is one of the more common geoprocessing tasks. In this example we'll pull a geojson of the United States from the geoBoundaries API. You can also download vectorized boundaries directly from [](www.geoboundaries.org).

The call to geoBoundaries' API is pretty simple:

"https://www.geoboundaries.org/api/current/gbOpen/*ISO3C*/*LEVEL*/"

We adjust the bolded components of the URL address to specify the country we want using the ISO 3 Character Country Code (USA) and the desired Administrative Level (ADM1).

```{r}
usa <- httr::GET("https://www.geoboundaries.org/api/current/gbOpen/USA/ADM1/")
```

After the `GET` call, we have to translate the `content`.

```{r}
usa <- httr::content(usa)

names(usa)
```

The parsed content object contains 32 components. Item 29 is a direct link to the geojson file (gjDownloadURL). Read that in and check the visuals.

```{r}
usa <- sf::st_read(usa$gjDownloadURL)

plot(sf::st_geometry(usa))
```
This looks good, but it includes all United States territories. For simplicity, we can get it down to only the contiguous United States. 

```{r}
drops<-
  c("Alaska", "Hawaii", 
    "American Samoa",
    "Puerto Rico",
    "Commonwealth of the Northern Mariana Islands", 
    "Guam", 
    "United States Virgin Islands")

usa<-usa[!(usa$shapeName %in% drops),]

plot(sf::st_geometry(usa))
```

We can take this a step further and select a target state.


```{r}
texas <- usa[usa$shapeName == "Texas",]

plot(sf::st_geometry(texas))
```

From here we can crop the WSIM GLDAS raster stack by indexing it with the stored boundary of Texas


```{r}
wsim_gldas_anoms_tex <- wsim_gldas_anoms[texas]
```

For a final visual check we'll take the last time-step in the WSIM-GLDAS dataset (15/December, 2014) and plot it with an overlay of the Texas boundary.


```{r warning = FALSE}
wsim_gldas_anoms_tex |>
  dplyr::slice(index = 15, along = "time") |>
  plot(reset = FALSE, breaks = c(0,-1,-2,-3,-4,-5))

plot(sf::st_geometry(texas),
     add = TRUE,
     lwd = 3,
     fill = NA,
     border = 'purple')
```

The subsetted dataset may be written to disk, and saved for future modules.

```{r}
stars::write_mdim(wsim_gldas_anoms_tex, "wsim_gldas_tex.nc")
```

The size of the pre-processed dataset is 1.6 MB compared to the original dataset of 1.7 GB. This is much more manageable in cloud environments, workshops, and git platforms.