The objective of cc.data is to build and process datasets that are ready to use in a new Curbcut instance, or to update data in an already existing Curbcut instance. cc.data offers functions to write and access the processed data to AWS services such as RDS (Amazon Aurora Serverless - MySQL) and S3 buckets.
You can install the development version of cc.data from GitHub with:
# install.packages("devtools")
devtools::install_github("Curbcut/cc.data")
For census data building and processing, cc.data creates and works from
long lists of tibble
. For faster building, the package frequently uses
functions of the future_*apply
family from the future.apply
package,
for any computationally intensive work. These implementations of the
apply family function can be solved using any backend supported by
future (parallelization). We therefore recommend defining a future
backend at the beginning of the data building script.
future::plan(future::multisession())
Let’s also enable the global progression handler to get a progress bar on those more computationally intensive work.
progressr::handlers(global = TRUE)
To provide the ability to download subsets of the pre-processed data,
cc.data uploads the data into a MySQL serverless database on AWS. To do
so, the builder/uploader must have a database administrator user
account. To read the data, a user account with read privileges only must
be created. To get one, please contact Max at
[email protected]. The Curbcut team will create a user
account using the db_create_user
function.
db_create_user(user, password)
Once a user account is created, credentials must be stored as an
environment variable (in the .Renviron
file which can be opened using
usethis::edit_r_environ()
.).
# AWS Curbcut db user with read only privileges
CURBCUT_DB_USER = "user"
CURBCUT_DB_PASSWORD = "password"
Every Curbcut instance have its own data folder in which unique
pre-computed datasets are stored for the live application (built through
the cc.buildr
package). To allow for easier sharing of this and any
other folder, cc.data
have write/read functions to AWS S3 buckets. To
write to and read from these buckets, a user account must be created
through the AWS console Security credentials > Access management >
Users > Add users. The credential type is
Access key - Programmatic acess
and attached policies must be either
AmazonS3FullAccess
to allow writing, or AmazonS3ReadOnlyAccess
to
only allow reading. Accesses must then be saved in the .Renviron
file.
# AWS Curbcut bucket access for the 'x' user
CURBCUT_BUCKET_ACCESS_ID = "access_id"
CURBCUT_BUCKET_ACCESS_KEY = "access_pw"
CURBCUT_BUCKET_DEFAULT_REGION = "us-east-1"
There are a set of already built tibbles part of cc.data that, by default, informs census building and processing functions:
cc.data::census_years
: All available census years starting 1996;cc.data::census_scales
: All scales for which data should be built and processed (CSD, CT, DA);cc.data::census_vectors_table
: A Curbcut selection of census variables or aggregates of census variables. A column for each year of the CensusMapper variable names of the census variables to be downloaded, as well as for the “parent” variables e.g. in the case where a Curbcut variable is a percentage, and the title and explanation of the variables.cc.data::census_vectors
: Only the variable codes of the census vectors. This is the vector fed to most function arguments asking for the census vectors, and it retrieves and subsetcensus_vectors_table
behind the scenes.
Only these can be updated when a new census is out, if we want to add a scale to the census data building, or if we want to add variables. The data building and processing will use them by default.
Census data is built and processed through a depth 2 list. The first depth is the census scales, and the second depth is the census years for each census scale. For this reason, we can speed up operations by using two layers of futures. If memory and number of threads allow, the optimal layers would be for the outer layer to be the same length as the number of census scales, and the inner layer to be the same length as the number of census years.
# Tweak futures
future::plan(list(future::tweak(future::multisession,
workers = length(cc.data::census_scales)),
future::tweak(future::multisession,
workers = length(cc.data::census_years))))
Here is the census data building and processing workfow:
empty_geometries <- census_empty_geometries()
data_raw <- census_data_raw(empty_geometries)
interpolated <- census_interpolate(data_raw)
normalized <- census_normalize(interpolated)
parent_dropped <- census_drop_parents(normalized)
processed_census <- census_reduce_years(parent_dropped)
- Download empty geometries, for each census scale and each of their census years;
- Populate the empty geometries with raw data directly downloaded from
the
cancensus
package using the variable codes from thecensus_vectors_table
, as well as the parent variables; - Interpolate all data of all all years to the most recent census year.;
- Normalize data depending if it is classed as a number or a percentage, using the parent variable;
- Drop the parent variables;
- Reduce all years into once dataframe. The output of the last
function is a list with only one depth. A list container only the
number of sf data.frame as there are census scales in
cc.data::census_scales
.
To provide the ability to download subsets of the pre-processed data, we
store the raw census data of dissemination areas and the processed
census data of all scales to a MySQL database. The raw census data is
used if a builder of a Curbcut instance needs to interpolate census data
to custom boundaries. Interpolating using the raw census data provides
the closest degree of accuracy possible. The two following functions are
tailored to exactly the output of census_data_raw()$DA
and the output
of census_reduce_years()
.
db_write_processed_data(processed_census)
db_write_raw_data(DA_data_raw = data_raw$DA)
Most of the operations for the building dataset are performed iteratively on each province. It is very spatial feature heavy and can hog a lot of memory. Each worker of the future backend will use between 16 to 20 gb of memory.
Our building dataset is built using both OpenStreetMap and the Microsoft’s Canadian Building Footprints for places where OSM doesn’t have great coverage. To free up memory between operations, the following functions write the building dataset for each province to a destination folder.
buildings_folder <- "buildings/"
buildings_sf(DA_processed_table = processed_census$DA,
dest_folder = buildings_folder,
OSM_cache = TRUE)
The reverse geocoding can be sped up by building our own docker container of Nominatim, which, when mounted using a .pbf (OpenStreetMap) file, can be used to find the nearest address of a latitude and longitude coordinate. This option is more than 100x faster than querying external servers to reverse geocode a large dataset like all the buildings in the country. The following will download and mount the docker image:
rev_geocode_create_local_nominatim()
Using a combination of the National Open Database of Addresses and the local instance of Nominatim, we reverse geocode all the buildings in the country. This takes an even harder toll on memory usage, as the local Nominatim instance also uses a lot of computational power to serve the reverse geocoding.
buildings_rev_geo <- rev_geocode_building(prov_folder = buildings_folder)
The processed building dataset is stored in the MySQL database with the
buildings
name, the index being the dissemination areas identifier. As
the table is huge, we append data by waves of 50k buildings (last
argument).
db_write_table(df = rev_geo,
tb_name = "buildings",
index = "DA_ID",
rows_per_append_wave = 50000)
The function creating the streets downloads the most updated road network from the Canadian’s government website. It’s then cleaned and street names are standardised. In the case a street name can’t be put together, we use the Nominatim local instance created at the buildings step.
streets <- streets_sf(DA_processed_table = processed_census$DA)
db_write_table(df = streets,
tb_name = "streets",
index = "DA_ID",
rows_per_append_wave = 25000)
A csv
containing country-wide postal codes is available in the
Curbcut’s S3 buckets. The following simply imports it, converts it all
to spatial features, and attach a dissemination area identifier. It’s
saved in the MySQL database with an DA ID index, so that only the postal
code of a particular territory under study can be imported.
postal_codes <- postal_codes_process(processed_census$DA)
db_write_table(df = postal_codes, tb_name = "postal_codes", index = "DA_ID",
rows_per_append_wave = 50000)
The travel time matrices are calculated using two different techniques depending on the mode. Walking, biking and driving are calculated using a local docker container of the modern C++ routing engine OSRM. The transit travel time matrices are calculated using a mix of OSRM and of General Transit Feed Specification (GTFS) files. They can all be solved with any backend supported by future (parallelization).
We start by getting centroids of every dissemination areas and specifying where the routing data will be downloaded and saved to. Inside this folder, we create folders for foot, bike and car.
DA_table <- processed_census$DA["ID"]
DA_table <- suppressWarnings(sf::st_transform(DA_table, 3347))
DA_table <- suppressWarnings(sf::st_centroid(DA_table))
DA_table <- suppressWarnings(sf::st_transform(DA_table, 4326))
dest_folder <- "routing/"
dir.create(dest_folder)
foot_folder <- paste0(dest_folder, "foot")
bike_folder <- paste0(dest_folder, "bike")
car_folder <- paste0(dest_folder, "car")
dir.create(foot_folder)
dir.create(bike_folder)
dir.create(car_folder)
For every mode, we will have to create a docker local OSRM instance. For
further information, read the Project-OSRM/osrm-backend
documentation. The
following code set it up by downloading pbf
files and opening a
command-line terminal (only works on Windows right now).
tt_local_osrm(dest_folder = foot_folder, mode = "foot")
Once the docker container is properly set up, we calculate travel time matrices from every DA to every DA inside a 10km radius (as the maximum limit one could possibly walk in an hour). The routing can be calculated in parallel and, for each pairs of dissemination areas, makes a call to the local OSRM container. The output is a named list of dataframes of 2 columns: every dataframe is a DA, and the first column is all the other DAs it can reach in an hour. The second column is the time in second it takes to reach the DA by walk. We then save each individual dataframe in their own dataframe in the MySQL database.
foot <- tt_calculate(DA_table_centroids = DA_table, max_dist = 10000)
db_write_ttm(ttm = foot, mode = "foot")
To keep it clean, we can proceed by removing it from the active docker containers. I suggest not to do so for the walking mode, as it will be re-used when calculating travel times for transit.
shell(paste0('docker rm --force foot_osrm'))
The exact same process is done for the bike and car mode. We modify the mode and update the maximum distance for which DAs inside x radius of a DA should be calculated a travel time.
# Bike
tt_local_osrm(dest_folder = bicycle_folder, mode = "bicycle")
bicycle <- tt_calculate(DA_table_centroids = DA_table, max_dist = 30000,
routing_server = routing_server)
db_write_ttm(ttm = bicycle, mode = "bicycle")
shell(paste0('docker rm --force bicycle_osrm'))
# Car
tt_local_osrm(dest_folder = car_folder, mode = "car")
car <- tt_calculate(DA_table_centroids = DA_table, max_dist = 120000,
routing_server = routing_server)
db_write_ttm(ttm = car, mode = "car")
shell(paste0('docker rm --force car_osrm'))
TKTKTKTKTKTK TRANSIT
The following are simple one variable dataset used in any version of
Curbcut. Their code takes a csv
file living on a S3 bucket, clean it,
and save it to the MySQL database.
canale <- canale_process()
db_write_table(df = canale,
tb_name = "canale",
primary_key = "DA_ID",
index = "DA_ID")
canbics <- canbics_process()
db_write_table(df = canbics,
tb_name = "canbics",
primary_key = "DA_ID",
index = "DA_ID")