-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathsimplify_shapes.Rmd
350 lines (252 loc) · 16.8 KB
/
simplify_shapes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
---
title: "Simplifying geospatial features in R with sf and rmapshaper"
author: "Markus Konrad"
date: "3/15/2021"
output: html_document
self_contained: false
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE,
dev = 'png',
fig.path='figures/'
)
```
When working with geospatial data, memory consumption and computation time can become quite a problem, since these datasets are often very large. You may have very granular, high resolution data although that isn't really necessary for your use-case, for example when plotting large scale maps or when applying calculations at a spatial level for which lower granularity is sufficient. In such scenarios, it's best to first *simplify* the *geospatial features* (the sets of points, lines and polygons that represent geographic entities) in your data. By simplify I mean that we generally reduce the amount of information that is used to represent these features, i.e. we remove complete features (e.g. small islands), we join features (e.g. we combine several overlapping or adjacent features) or we reduce the complexity of features (e.g. we remove vertices or holes in a feature). Since applying these operations comes with information loss you should be very careful about how much you simplify and if this in some way biases you results.
In R, we can apply this sort of simplification with a few functions from the packages [sf](https://cran.r-project.org/web/packages/sf/index.html) and, for some special cases explained below, [rmapshaper](https://cran.r-project.org/web/packages/rmapshaper/index.html). In the following, I will show how to apply them and what the effects of their parameters are.
First, let's load the required libraries. Besides the already mentioned sf package, we also load [ggplot2](https://ggplot2.tidyverse.org/) for visualization and [magrittr](https://magrittr.tidyverse.org/), because I will use the `%>%` pipe operator sometimes.
```{r, message=FALSE}
library(ggplot2)
library(magrittr)
library(sf)
```
I will also use a utility function `plot_map()` which is capable of plotting geospatial features, zooming into certain spots and highlighting differences between two features. The lengthy source code for this function is available in the [Rmarkdown document of the related GitHub repository](https://github.com/WZBSocialScienceCenter/r_simplify_features).
```{r, echo=FALSE}
plot_map <- function(data, diff_to = NULL, strokecolor = NA, fillcolor = c('#097FB3', '#A13675'),
alpha = 1, graticules = FALSE, zoom_to = NULL, zoom_level = NULL) {
target_crs <- st_crs(data)
if (!is.null(zoom_level)) {
if (is.null(zoom_to)) {
zoom_to_xy <- st_sfc(st_point(c(0, 0)), crs = target_crs)
} else {
zoom_to_xy <- st_transform(st_sfc(st_point(zoom_to), crs = 4326), crs = target_crs)
}
C <- 40075016.686 # ~ circumference of Earth in meters
x_span <- C / 2^zoom_level
y_span <- C / 2^(zoom_level+1)
disp_window <- st_sfc(
st_point(st_coordinates(zoom_to_xy - c(x_span / 2, y_span / 2))),
st_point(st_coordinates(zoom_to_xy + c(x_span / 2, y_span / 2))),
crs = target_crs
)
xlim <- st_coordinates(disp_window)[,'X']
ylim <- st_coordinates(disp_window)[,'Y']
} else {
xlim <- NULL
ylim <- NULL
}
if (!is.null(diff_to)) {
shapebase <- diff_to
shapediff <- st_sym_difference(data, diff_to)
} else {
shapebase <- data
shapediff <- NULL
}
p <- ggplot() +
geom_sf(data = shapebase, color = strokecolor, fill = fillcolor[1], alpha = alpha)
if (!is.null(shapediff)) {
p <- p + geom_sf(data = shapediff, color = strokecolor, fill = fillcolor[2], alpha = alpha)
}
p +
coord_sf(xlim = xlim,
ylim = ylim,
crs = target_crs,
datum = ifelse(graticules, target_crs$input, NA)) +
theme_bw()
}
```
## Loading and transforming our sample data
Let's load our first example dataset. It represents the German federal state [Mecklenburg-Vorpommern (or Mecklenburg-West Pomerania)](https://www.openstreetmap.org/relation/28322) that features a ragged coastline at the Baltic Sea, which is nice to show the effects of feature simplification. I will abbreviate Mecklenburg-Vorpommern by MV to spare us the complicated name.
```{r}
mv <- st_read('data/mv.geojson') %>%
st_geometry() %>%
st_transform(crs = '+proj=aeqd +lat_0=53.6 +lon_0=12.7') # centered at input data for low distortion
```
The dataset contains only a single feature (a multi-polygon, i.e. a set of polygons) with some metadata from OpenStreetMap. I use `st_geometry()` to access this feature (i.e. dismiss the metadata) and `st_transform()` to transform it to an [Azimuthal Equidistant](http://www.radicalcartography.net/?projectionref) map projection. The latter is quite important. Most functions that I'll use don't work with spherical coordinates as used for example by the [WGS84 (GPS) standard](http://epsg.io/4326), where longitude and latitude in degrees describe a position on a sphere. We need to project these coordinates on a plane using a coordinate reference system (CRS) with a specific projection. All projections distort in some way, but we can choose a CRS that accurately represents certain measures. For example, I chose an equidistant CRS because it accurately represents distances up to 10,000 km from the center of projection. Another important aspect is that by using this CRS, we move from degrees to meters as units for our coordinates.
Let's plot our projection of MV. Note the axes which represent the distance from the projection center (I chose the geographic center of MV for that) in meters.
```{r 01mvmap}
plot_map(mv, graticules = TRUE)
```
Let's zoom in on the island of Rügen with it's ragged coastline (west of it is the peninsula Darß, east is a part of Usedom).
```{r 02mvmap_zoom}
plot_map(mv, graticules = TRUE, zoom_to = c(13.359, 54.413), zoom_level = 8)
```
## Removing vertices with `st_simplify`
We can now continue to apply different simplification methods to the displayed features. We will start with – no surprise – [`st_simplify`](https://rdrr.io/cran/sf/man/geos_unary.html). This function removes vertices in lines or polygons to form simpler shapes. The function implementation uses the [Douglas–Peucker algorithm](https://en.wikipedia.org/wiki/Ramer%E2%80%93Douglas%E2%80%93Peucker_algorithm) for simplification and accepts a parameter `dTolerance`, which roughly speaking specifies the distance in which any "wiggles" will be straightened. It's in the same unit as the input data, so in our case it's meters (the unit used in our CRS). There's also a parameter `preserveTopology` which, when set to TRUE, makes sure that polygons are not reduced to lines or even removed, or that inner holes in them are removed during the simplification process.
Let's apply `st_simplify` with a tolerance of 1km:
```{r 03mvsimpl}
mv_simpl <- st_simplify(mv, preserveTopology = FALSE, dTolerance = 1000)
plot_map(mv_simpl)
```
We can see that the geometry is much simpler, especially when zooming in:
```{r 04mvsimpl_zoom}
plot_map(mv_simpl, zoom_to = c(13.359, 54.413), zoom_level = 8)
```
We can also confirm the memory usage is much lower for the simplified feature (shown in kilobytes):
```{r}
round(c(object.size(mv), object.size(mv_simpl)) / 1024)
```
The purple areas show the differences between the original and the simplified version. We can see for example, that some smaller islands were completely removed.
```{r 05mvsimpl_diff}
plot_map(mv, mv_simpl, zoom_to = c(13.359, 54.413), zoom_level = 8)
```
When we set `preserveTopology` to TRUE, we can observe that the small islands are retained:
```{r 06mvpreservetopo}
mv_simpl <- st_simplify(mv, preserveTopology = TRUE, dTolerance = 1000)
plot_map(mv_simpl, zoom_to = c(13.359, 54.413), zoom_level = 8)
```
Increasing the `dTolerance` parameter to 10km gives us something that is more akin to a Cubistic painting:
```{r 07mvsimpl_coarse}
mv_simpl <- st_simplify(mv, preserveTopology = FALSE, dTolerance = 10000)
plot_map(mv_simpl)
```
```{r 08mvsimpl_coarse_diff}
plot_map(mv, mv_simpl)
```
## `st_simplify` limitations
Using `st_simplify` comes with a limitation when working with multiple, adjacent features which I'd like to demonstrate using a map of the federal states in Germany.
```{r}
fedstates <- read_sf('data/germany_fedstates.geojson') %>%
st_transform(crs = 5243) # ETRS89 / LCC Germany (E-N)
fedstates[c('name', 'geometry')]
```
This time, our spatial dataset contains 16 features which represent the boundaries of each federal state:
```{r 09fedstates}
plot_map(fedstates, graticules = TRUE, strokecolor = '#097FB3', fillcolor = '#AED3E4')
```
Let's apply `st_simplify` with a tolerance of 10km and preserving feature topology:
```{r 10fedstates_simpl_fail}
fedstates_simpl <- st_simplify(fedstates, preserveTopology = TRUE, dTolerance = 10000)
plot_map(fedstates_simpl, strokecolor = '#097FB3', fillcolor = '#AED3E4')
```
Ups! We can see that `st_simplify` produced gaps and overlapping features, i.e. shared borders were not handled correctly! The problem is that `st_simplify` simply doesn't consider the topological concept of shared borders between features (like federal states in this case). When setting `preserveTopology = TRUE` it means that each feature's topology is preserved, but it doesn't mean that the topology *between* features is considered.
Luckily, there's the package [rmapshaper](https://cran.r-project.org/web/packages/rmapshaper/index.html) that contains a function that does the trick. Before you install the package, please note that it requires a lot of (system-) dependencies to be installed.
[Visvalingam's algorithm](https://en.wikipedia.org/wiki/Visvalingam%E2%80%93Whyatt_algorithm), which is used in this function, let's us specify the portion of vertices that should be kept after simplification. This can be specified with the `keep` parameter. The parameter `keep_shapes` controls whether complete features such as islands can be removed in the simplification process.
```{r 11fedstates_simpl_rmapshaper}
library(rmapshaper)
fedstates_simpl2 <- ms_simplify(fedstates, keep = 0.001, keep_shapes = FALSE)
plot_map(fedstates_simpl2, strokecolor = '#097FB3', fillcolor = '#AED3E4')
```
As can be seen, `ms_simplify` respects the shared borders. The result is also much smaller in terms of memory usage:
```{r}
round(c(object.size(fedstates), object.size(fedstates_simpl2)) / 1024)
```
## Expanding and shrinking with `st_buffer`
[`st_buffer`](https://rdrr.io/cran/sf/man/geos_unary.html) lets you expand (positive buffer distance) or shrink (negative buffer distance) polygon features. This may not per-se result in a simpler feature form, but it allows you to close holes (by expanding) or remove solitary, small features (by shrinking).
Let's try a first example. Again, the buffer distance is in the same unit as our CRS, which is meters in our case, and we'll expand by 1km. When the buffer is generated, new vertices are added to form the expanded shape with round edges. The parameter `nQuadSegs` specifies how many segments are generated per quadrant and feature.
```{r 12mvbuf}
mv_buf <- st_buffer(mv, dist = 1000, nQuadSegs = 1)
plot_map(mv_buf, graticules = TRUE)
```
```{r 13mvbuf_diff}
plot_map(mv, mv_buf, graticules = TRUE, zoom_to = c(13.359, 54.413), zoom_level = 8)
```
The memory usage is not so much reduced:
```{r}
round(c(object.size(mv), object.size(mv_buf)) / 1024)
```
Now we shrink MV by 1km:
```{r 14mvbuf2}
mv_buf <- st_buffer(mv, -1000, nQuadSegs = 1)
plot_map(mv_buf)
```
```{r 15mvbuf2_diff}
plot_map(mv, mv_buf, zoom_to = c(13.359, 54.413), zoom_level = 8)
```
We can also combine both buffering options, by first shrinking in order to remove small isolated features and then expanding again by the same distance so that the result has more or less the same extents as the original:
```{r 16mvbuf3}
mv_buf <- st_buffer(mv, -1000, nQuadSegs = 1) %>%
st_buffer(1000, nQuadSegs = 1)
plot_map(mv_buf)
```
```{r 17mvbuf3_diff}
plot_map(mv, mv_buf, zoom_to = c(13.359, 54.413), zoom_level = 8)
```
Note how this is more aggressive than `st_simplify` but the result uses more memory. We could now additionally apply `st_simplify` to arrive at a very reduced feature.
```{r}
round(c(object.size(mv), object.size(mv_buf)) / 1024)
```
When we apply `st_buffer` to a dataset with multiple features like the federal states dataset, the buffering will be applied to every feature (i.e. every state's boundary) separately. This will result in either overlapping features (extending) or gaps between them (shrinking).
## Joining overlapping and adjacent features with `st_union`
The last function that we will look at is `st_union`. It can be used to join overlapping and adjacent features. With this, we could for example join all features of the federal state boundaries in Germany and get a single feature that represents the national boundaries. But let's look at a more interesting example and load a new dataset. It contains a sample with regions with street noise of 50dB or more in Berlin at night.
```{r 18noise}
noise <- read_sf('data/noise_berlin_sample.geojson') %>%
st_transform(crs = '+proj=aeqd +lat_0=52.51883 +lon_0=13.41537')
plot_map(noise, graticules = TRUE)
```
You may notice the ragged edges in the picture above. Furthermore, the dataset is quite large and contains a lot of rows (i.e. features):
```{r}
nrow(noise)
```
Why is that so? When we zoom in and also plot each feature's boundary, we will notice that we actually have some sort of vectorized raster data! The noise regions are tiny raster cells:
```{r 19noise_zoom}
plot_map(noise, strokecolor = '#097FB3', fillcolor = '#AED3E4', zoom_level = 15)
```
Most cells cover 100m² and come along with the noise level at this spot (`L_N` column):
```{r}
summary(noise[c('Shape_Area', 'L_N')])
```
If we don't care for the specific noise levels and only want to form a single feature that represents noise levels with 50dB or more, we can apply `st_union` on the whole dataset:
```{r}
noise_union <- st_union(noise)
noise_union
```
As we can see in the output above, only a single feature is left (and all feature-specific "metadata" such as level of noise is gone!). We can confirm this visually:
```{r 20noise_union_zoom}
plot_map(noise_union, strokecolor = '#097FB3', fillcolor = '#AED3E4', zoom_level = 15)
```
```{r 21noise_union}
plot_map(noise_union, strokecolor = '#097FB3', fillcolor = '#AED3E4')
```
Additionally applying `st_simplify` removes the ragged edges:
```{r 22noise_union_simpl}
noise_union_simpl <- st_simplify(noise_union, preserveTopology = FALSE, dTolerance = 50)
plot_map(noise_union_simpl, strokecolor = '#097FB3', fillcolor = '#AED3E4')
```
Note how much the memory consumption was reduced from initially 42 *mega*bytes to 282 *kilo*bytes or even 18 kilobytes when further simplified:
```{r}
round(c(object.size(noise), object.size(noise_union), object.size(noise_union_simpl)) / 1024)
```
Applying simplify without union before would not have the desired effect since `st_simplify` is performed per feature:
```{r 23noise_union_fail}
st_simplify(noise, preserveTopology = FALSE, dTolerance = 10) %>%
plot_map(strokecolor = '#097FB3', fillcolor = '#AED3E4')
```
If we want to retain some of the noise level information, we can quantize ("bin") the noise level data into different ordered categories and apply the union operation per category. I'm used to the [dplyr](https://dplyr.tidyverse.org/) package, so I'll first create a quantized version of the noise variable `L_N` with three equally spaced levels between 50dB and 80dB and then use `group_by` in conjunction with `group_modify`:
```{r, warning=FALSE, message=FALSE}
library(dplyr)
noise_per_lvl <-
mutate(noise,
level = cut(L_N,
breaks = seq(50, 80, by = 10),
right = FALSE,
ordered_result = TRUE)) %>%
group_by(level) %>%
group_modify(~ st_union(.x) %>% as_tibble()) %>% # ".x" is refers to the current group
ungroup() %>%
st_as_sf() # convert back to "sf" object since this information is lost during group_by()
noise_per_lvl
```
Again, the result is much smaller in terms of memory:
```{r}
round(c(object.size(noise), object.size(noise_per_lvl)) / 1024)
```
The following gives us a plot with the three noise levels:
```{r 24noise_levels}
ggplot(noise_per_lvl) +
geom_sf(aes(color = level, fill = level)) +
coord_sf(datum = NA) +
theme_bw()
```
## Conclusion
All in all, it's wise to first think about simplifying your spatial data before you continue to work with it. Spatial datasets are often very large and depending on your use-case you may not need a high level of detail. It may even be absolutely necessary to reduce the complexity of your data, e.g. when the computational capacities are limited or when you produce interactive maps for the web. As shown, R has all the tools that you need to simplify spatial data with the packages sf and rmapshaper.