Skip to content

Commit

Permalink
First version with automated inference of result dimensions
Browse files Browse the repository at this point in the history
  • Loading branch information
mdales committed Feb 6, 2025
1 parent 8ae087e commit e788c79
Show file tree
Hide file tree
Showing 12 changed files with 394 additions and 132 deletions.
99 changes: 48 additions & 51 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,82 +2,79 @@

## Overview

Yirgacheffe is an attempt to wrap gdal datasets such that you can do computational work on them without having to worry about common tasks:
Yirgacheffe is an attempt to wrap raster and polygon geospatial datasets such that you can do computational work on them as a whole or at the pixel level, but without having to do a lot of the grunt work of working out where you need to be in rasters, or managing how much you can load into memory safely.

Example common use-cases:

* Do the datasets overlap? Yirgacheffe will let you define either the intersection or the union of a set of different datasets, scaling up or down the area as required.
* Rasterisation of vector layers: if you have a vector dataset then you can add that to your computation and yirgaceffe will rasterize it on demand, so you never need to store more data in memory than necessary.
* Do the raster layers get big and take up large amounts of memory? Yirgacheffe will let you do simple numerical operations with layers directly and then worry about the memory management behind the scenes for you.


## Basic layer usage
## Basic usage

They main unit of data in Yirgacheffe is a "layer", which wraps either a raster dataset or polygon data, and then you can do work on layers without having to worry (unless you choose to) about how they align - Yirgacheffe will work out all the details around overlapping

The motivation for Yirgacheffe layers is to make working with gdal data slightly easier - it just hides some common operations, such as incremental loading to save memory, or letting you align layers to generate either the intersection result or the union result.

For example, say we had three layers that overlapped and we wanted to know the
For example, if we wanted to do a simple [Area of Habitat](https://github.com/quantifyearth/aoh-calculator/) calculation, whereby we find the pixels where a species resides by combining its range polygon, its habitat preferences, and its elevation preferences, the code would be like this:

```python
from yirgacheffe.layer import Layer, UniformAreaLayer

elevation_layer = RasterLayer.layer_from_file('elecation.tiff')
area_layer = UniformAreaLayer('area.tiff')
validity_layer = RasterLayer.layer_from_file('validity.tiff')

# Work out the common subsection of all these and apply it to the layers
intersection = RasterLayer.find_intersection([elecation_layer, area_layer, validity_layer])
elevation_layer.set_window_for_intersection(intersection)
area_layer.set_window_for_intersection(intersection)
validity_layer.set_window_for_intersection(intersection)

# Create a layer that is just the size of the intersection to store the results
result = RasterLayer.empty_raster_layer(
intersection,
area_layer.pixel_scale,
area_layer.datatype,
"result.tif",
area_layer.projection
)

# Work out the area where the data is valid and over 3000ft
def is_munro(data):
return numpy.where(data > 3000.0, 0.0, 1.0)
calc = validity_layer * area_layer * elevation_layer.numpy_apply(is_munro)
from yirgacheffe.layer import RasterLayer, VectorLayer

calc.save(result)
result.close()
habitat_map = RasterLayer.layer_from_file("habitats.tif")
elevation_map = RasterLayer.layer_from_file('elevation.tif')
range_polygon = VectorLayer.layer_from_file('species123.geojson', raster_like=habitat_map)
area_per_pixel_map = RasterLayer.layer_from_file('area_per_pixel.tif')

refined_habitat = habitat_map.isin([...species habitat codes...])
refined_elevation = (elevation_map >= species_min) && (elevation_map <= species_max)

aoh = refined_habitat * refined_elevation * range_polygon * area_per_pixel_map

print(f'area for species 123: {aoh.sum()}')
```

If you want the union then you can simply swap do:
Similarly, you could save the result to a new raster layer:

```python
intersection = RasterLayer.find_union(elecation_layer, area_layer, validity_layer)
elevation_layer.set_window_for_union(intersection)
area_layer.set_window_for_union(intersection)
validity_layer.set_window_for_union(intersection)
with RasterLayer.empty_raster_layer_like(aoh, filename="result.tif") as result:
aoh.save(result)
```

If you want to work on the data in a layer directly you can call `read_array`, as you would with gdal. The data you access will be relative to the specified window - that is, if you've called either `set_window_for_intersection` or `set_window_for_union` then `read_array` will be relative to that and Yirgacheffe will clip or expand the data with zero values as necessary.
Yirgacheffe will automatically infer if you want to do an intersection of maps or a union of the maps based on the operators you use (see below for a full table). You can explicitly override that if you want.

### Lazy loading and evaluation

Yirgacheffe uses a technique from computer science called "lazy evaluation", which means that only when you resolve a calculation will Yirgacheffe do any work. So in the first code example given, the work is only calculated when you call the `sum()` method. All the other intermediary results such as `refined_habitat` and `refined_elevation` are not calculated either until that final `sum()` is called. You could easily call sum on those intermediaries if you wanted and get their results, and that would cause them to be evaluated then.

Similarly, when you load a layer, be it a raster layer or a vector layer from polygon data, the full data for the file isn't loaded until it's needed for a calculation, and even then only the part of the data necessary will be loaded or rasterized. Furthermore, Yirgacheffe will load the data in chunks, letting you work with rasters bigger than those that would otherwise fit within your computer's memory.

If you have set either the intersection window or union window on a layer and you wish to undo that restriction, then you can simply call `reset_window()` on the layer.

Layers also work with Python's `with` statement, which helps prevent you forgetting to close a layer, which is important both for ensuring results are written to disk when you want, and that memory used by layers is freed (Yirgacheffe doesn't cache data in memory, but it uses GDAL which does).
### Automatic expanding and contracting layers

So for example you can turn the above example to:
When you load raster layers that aren't of equal geographic area (that is, they have a different origin, size, or both)then Yirgacheffe will do all the math internally to ensure that it aligns the pixels geospatially when doing calculations.

If size adjustments are needed, then Yirgacheffe will infer from the calculations you're doing if it needs to either crop or enlarge layers. For instance, if you're summing two rasters it'll expand them to be the union of their two areas before adding them, filling in the missing parts with zeros. If you're multiplying or doing a logical AND of pixels then it'll find the intersection between the two rasters (as areas missing in one would cause the other layer to result in zero anyway).

Whilst it is hoped that the default behaviour makes sense in most cases, we can't anticipate all usages, and so if you want to be explicit about the result of any maps you can specify it yourself.

For example, to tell Yirgacheffe to make a union of a set of layers you'd write:

```python
with RasterLayer.empty_raster_layer(
intersection,
area_layer.pixel_scale,
area_layer.datatype,
"result.tif",
area_layer.projection
) as result:
# Work out the area where the data is valid and over 3000ft
def is_munro(data):
return numpy.where(data > 3000.0, 0.0, 1.0)
calc = validity_layer * area_layer * elevation_layer.numpy_apply(is_munro)
layers = [habitat_map, elevation_map, range_polygon]
union_area = YirgacheffeLayer.find_union(layers)
for layer in layers:
layer.set_window_for_union(union_area)
```

And you no longer need to explicitly call `result.close()` to ensure the result is written to disk.
There is a similar set of methods for using the intersection.

If you have set either the intersection window or union window on a layer and you wish to undo that restriction, then you can simply call `reset_window()` on the layer.

### Direct access to data

If doing per-layer operations isn't applicable for your application, you can read the pixel values for all layers (including VectorLayers) by calling `read_array` similarly to how you would for gdal. The data you access will be relative to the specified window - that is, if you've called either `set_window_for_intersection` or `set_window_for_union` then `read_array` will be relative to that and Yirgacheffe will clip or expand the data with zero values as necessary.


### Todo but not supported
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ build-backend = "setuptools.build_meta"

[project]
name = "yirgacheffe"
version = "0.9.4"
version = "1.0.0"
description = "Abstraction of gdal datasets for doing basic math operations"
readme = "README.md"
authors = [{ name = "Michael Dales", email = "[email protected]" }]
Expand Down
224 changes: 220 additions & 4 deletions tests/test_auto_windowing.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,15 @@
import os
import tempfile

import numpy as np

from helpers import gdal_dataset_with_data
from yirgacheffe.layers import RasterLayer
from helpers import gdal_dataset_with_data, make_vectors_with_mutlile_ids
from yirgacheffe.layers import ConstantLayer, RasterLayer, VectorLayer, area
from yirgacheffe.window import Area

def test_add_windows() -> None:
data1 = np.array([[1, 2], [3, 4]])
data2 = np.array([[10, 20, 30, 40], [50, 60, 70, 80]])
data2 = np.array([[10, 20, 30, 40], [50, 60, 70, 80], [90, 100, 110, 120], [130, 140, 150, 160]])

layer1 = RasterLayer(gdal_dataset_with_data((0.0, 0.0), 0.02, data1))
layer2 = RasterLayer(gdal_dataset_with_data((0.0, 0.0), 0.02, data2))
Expand All @@ -18,9 +22,16 @@ def test_add_windows() -> None:
assert calc.area == layer2.area
assert calc.window == layer2.window

result = RasterLayer.empty_raster_layer_like(calc)
calc.save(result)

expected = np.array([[11, 22, 30, 40], [53, 64, 70, 80], [90, 100, 110, 120], [130, 140, 150, 160]])
actual = result.read_array(0, 0, 4, 4)
assert (expected == actual).all()

def test_multiply_windows() -> None:
data1 = np.array([[1, 2], [3, 4]])
data2 = np.array([[10, 20, 30, 40], [50, 60, 70, 80]])
data2 = np.array([[10, 20, 30, 40], [50, 60, 70, 80], [90, 100, 110, 120], [130, 140, 150, 160]])

layer1 = RasterLayer(gdal_dataset_with_data((0.0, 0.0), 0.02, data1))
layer2 = RasterLayer(gdal_dataset_with_data((0.0, 0.0), 0.02, data2))
Expand All @@ -32,3 +43,208 @@ def test_multiply_windows() -> None:

assert calc.area == layer1.area
assert calc.window == layer1.window

result = RasterLayer.empty_raster_layer_like(calc)
calc.save(result)

expected = np.array([[10, 40], [150, 240]])
actual = result.read_array(0, 0, 2, 2)
assert (expected == actual).all()

def test_add_windows_offset() -> None:
data1 = np.array([[1, 2], [3, 4]])
data2 = np.array([[10, 20, 30, 40], [50, 60, 70, 80], [90, 100, 110, 120], [130, 140, 150, 160]])

layer1 = RasterLayer(gdal_dataset_with_data((-0.02, 0.02), 0.02, data1))
layer2 = RasterLayer(gdal_dataset_with_data((-0.04, 0.04), 0.02, data2))

assert layer1.area != layer2.area
assert layer1.window != layer2.window

calc = layer1 + layer2

assert calc.area == layer2.area

result = RasterLayer.empty_raster_layer_like(calc)
calc.save(result)

expected = np.array([[10, 20, 30, 40], [50, 61, 72, 80], [90, 103, 114, 120], [130, 140, 150, 160]])
actual = result.read_array(0, 0, 4, 4)
assert (expected == actual).all()

def test_multiply_windows_offset() -> None:
data1 = np.array([[1, 2], [3, 4]])
data2 = np.array([[10, 20, 30, 40], [50, 60, 70, 80], [90, 100, 110, 120], [130, 140, 150, 160]])

layer1 = RasterLayer(gdal_dataset_with_data((-0.02, 0.02), 0.02, data1))
layer2 = RasterLayer(gdal_dataset_with_data((-0.04, 0.04), 0.02, data2))

assert layer1.area != layer2.area
assert layer1.window != layer2.window

calc = layer1 * layer2

assert calc.area == layer1.area

result = RasterLayer.empty_raster_layer_like(calc)
calc.save(result)

expected = np.array([[60, 140], [300, 440]])
actual = result.read_array(0, 0, 2, 2)
assert (expected == actual).all()

def test_add_windows_sum() -> None:
data1 = np.array([[1, 2], [3, 4]])
data2 = np.array([[10, 20, 30, 40], [50, 60, 70, 80], [90, 100, 110, 120], [130, 140, 150, 160]])

layer1 = RasterLayer(gdal_dataset_with_data((-0.02, 0.02), 0.02, data1))
layer2 = RasterLayer(gdal_dataset_with_data((-0.04, 0.04), 0.02, data2))

calc = layer1 + layer2
total = calc.sum()

expected = np.array([[10, 20, 30, 40], [50, 61, 72, 80], [90, 103, 114, 120], [130, 140, 150, 160]])
assert total == np.sum(expected)

def test_multiply_windows_sum() -> None:
data1 = np.array([[1, 2], [3, 4]])
data2 = np.array([[10, 20, 30, 40], [50, 60, 70, 80], [90, 100, 110, 120], [130, 140, 150, 160]])

layer1 = RasterLayer(gdal_dataset_with_data((-0.02, 0.02), 0.02, data1))
layer2 = RasterLayer(gdal_dataset_with_data((-0.04, 0.04), 0.02, data2))

calc = layer1 * layer2
total = calc.sum()

expected = np.array([[60, 140], [300, 440]])
assert total == np.sum(expected)

def test_constant_layer_result_rhs_add() -> None:
data1 = np.array([[1.0, 2.0, 3.0, 4.0], [5.0, 6.0, 7.0, 8.0]])
layer1 = RasterLayer(gdal_dataset_with_data((0.0, 0.0), 0.02, data1))
const_layer = ConstantLayer(1.0)

calc = layer1 + const_layer

assert calc.area == layer1.area

result = RasterLayer.empty_raster_layer_like(calc)
calc.save(result)
actual = result.read_array(0, 0, 4, 2)

expected = 1.0 + data1

assert (expected == actual).all()

def test_constant_layer_result_lhs_add() -> None:
data1 = np.array([[1.0, 2.0, 3.0, 4.0], [5.0, 6.0, 7.0, 8.0]])
layer1 = RasterLayer(gdal_dataset_with_data((0.0, 0.0), 0.02, data1))
const_layer = ConstantLayer(1.0)
result = RasterLayer.empty_raster_layer_like(layer1)

intersection = RasterLayer.find_intersection([layer1, const_layer])
const_layer.set_window_for_intersection(intersection)
layer1.set_window_for_intersection(intersection)

calc = const_layer + layer1

assert calc.area == layer1.area

result = RasterLayer.empty_raster_layer_like(calc)
calc.save(result)
actual = result.read_array(0, 0, 4, 2)

expected = 1.0 + data1

assert (expected == actual).all()

def test_constant_layer_result_rhs_multiply() -> None:
data1 = np.array([[1.0, 2.0, 3.0, 4.0], [5.0, 6.0, 7.0, 8.0]])
layer1 = RasterLayer(gdal_dataset_with_data((0.0, 0.0), 0.02, data1))
const_layer = ConstantLayer(2.0)

calc = layer1 * const_layer

assert calc.area == layer1.area

result = RasterLayer.empty_raster_layer_like(calc)
calc.save(result)
actual = result.read_array(0, 0, 4, 2)

expected = data1 * 2.0

assert (expected == actual).all()

def test_constant_layer_result_lhs_multiply() -> None:
data1 = np.array([[1.0, 2.0, 3.0, 4.0], [5.0, 6.0, 7.0, 8.0]])
layer1 = RasterLayer(gdal_dataset_with_data((0.0, 0.0), 0.02, data1))
const_layer = ConstantLayer(2.0)
result = RasterLayer.empty_raster_layer_like(layer1)

intersection = RasterLayer.find_intersection([layer1, const_layer])
const_layer.set_window_for_intersection(intersection)
layer1.set_window_for_intersection(intersection)

calc = const_layer * layer1

assert calc.area == layer1.area

result = RasterLayer.empty_raster_layer_like(calc)
calc.save(result)
actual = result.read_array(0, 0, 4, 2)

expected = 2.0 * data1

assert (expected == actual).all()

def test_vector_layers_add() -> None:
data1 = np.array([[1, 2], [3, 4]])
layer1 = RasterLayer(gdal_dataset_with_data((0.0, 0.0), 1.0, data1))

with tempfile.TemporaryDirectory() as tempdir:
path = os.path.join(tempdir, "test.gpkg")
areas = {
(Area(-10.0, 10.0, 0.0, 0.0), 42),
(Area(0.0, 0.0, 10, -10), 43)
}
make_vectors_with_mutlile_ids(areas, path)

burn_value = 2
layer2 = VectorLayer.layer_from_file(path, None, layer1.pixel_scale, layer1.projection, burn_value=burn_value)
layer2_total = layer2.sum()
assert layer2_total == ((layer2.window.xsize * layer2.window.ysize) / 2) * burn_value

calc = layer1 + layer2

assert calc.area == layer2.area

total = calc.sum()
assert total == layer2_total + np.sum(data1)

def test_vector_layers_multiply() -> None:
data1 = np.array([[1, 2], [3, 4]])
layer1 = RasterLayer(gdal_dataset_with_data((-1.0, 1.0), 1.0, data1))

with tempfile.TemporaryDirectory() as tempdir:
path = os.path.join(tempdir, "test.gpkg")
areas = {
(Area(-10.0, 10.0, 0.0, 0.0), 42),
(Area(0.0, 0.0, 10, -10), 43)
}
make_vectors_with_mutlile_ids(areas, path)

burn_value = 2
layer2 = VectorLayer.layer_from_file(path, None, layer1.pixel_scale, layer1.projection, burn_value=burn_value)
layer2_total = layer2.sum()
assert layer2_total == ((layer2.window.xsize * layer2.window.ysize) / 2) * burn_value

calc = layer1 * layer2

assert calc.area == layer1.area

result = RasterLayer.empty_raster_layer_like(calc)
calc.save(result)
actual = result.read_array(0, 0, 2, 2)

expected = np.array([[2, 0], [0, 8]])
assert (expected == actual).all()
4 changes: 2 additions & 2 deletions yirgacheffe/layers/area.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,8 +84,8 @@ def __init__(self, dataset, name: Optional[str] = None, band: int = 1):
)
self._raster_xsize = self.window.xsize

def read_array(self, _xoffset, yoffset, _xsize, ysize) -> Any:
def read_array_with_window(self, xoffset: int, yoffset: int, xsize: int, ysize: int, window: Window) -> Any:
if ysize <= 0:
raise ValueError("Request dimensions must be positive and non-zero")
offset = self.window.yoff + yoffset
offset = window.yoff + yoffset
return self.databand[offset:offset + ysize]
Loading

0 comments on commit e788c79

Please sign in to comment.