First version with automated inference of result dimensions

quantifyearth · Feb 6, 2025 · e788c79 · e788c79
1 parent 8ae087e
commit e788c79
Show file tree

Hide file tree

Showing 12 changed files with 394 additions and 132 deletions.
diff --git a/README.md b/README.md
@@ -2,82 +2,79 @@
 
 ## Overview
 
-Yirgacheffe is an attempt to wrap gdal datasets such that you can do computational work on them without having to worry about common tasks:
+Yirgacheffe is an attempt to wrap raster and polygon geospatial datasets such that you can do computational work on them as a whole or at the pixel level, but without having to do a lot of the grunt work of working out where you need to be in rasters, or managing how much you can load into memory safely.
+
+Example common use-cases:
 
 * Do the datasets overlap? Yirgacheffe will let you define either the intersection or the union of a set of different datasets, scaling up or down the area as required.
 * Rasterisation of vector layers: if you have a vector dataset then you can add that to your computation and yirgaceffe will rasterize it on demand, so you never need to store more data in memory than necessary.
 * Do the raster layers get big and take up large amounts of memory? Yirgacheffe will let you do simple numerical operations with layers directly and then worry about the memory management behind the scenes for you.
 
 
-## Basic layer usage
+## Basic usage
+
+They main unit of data in Yirgacheffe is a "layer", which wraps either a raster dataset or polygon data, and then you can do work on layers without having to worry (unless you choose to) about how they align - Yirgacheffe will work out all the details around overlapping
 
 The motivation for Yirgacheffe layers is to make working with gdal data slightly easier - it just hides some common operations, such as incremental loading to save memory, or letting you align layers to generate either the intersection result or the union result.
 
-For example, say we had three layers that overlapped and we wanted to know the
+For example, if we wanted to do a simple [Area of Habitat](https://github.com/quantifyearth/aoh-calculator/) calculation, whereby we find the pixels where a species resides by combining its range polygon, its habitat preferences, and its elevation preferences, the code would be like this:
 
 ```python
-from yirgacheffe.layer import Layer, UniformAreaLayer
-
-elevation_layer = RasterLayer.layer_from_file('elecation.tiff')
-area_layer = UniformAreaLayer('area.tiff')
-validity_layer = RasterLayer.layer_from_file('validity.tiff')
-
-# Work out the common subsection of all these and apply it to the layers
-intersection = RasterLayer.find_intersection([elecation_layer, area_layer, validity_layer])
-elevation_layer.set_window_for_intersection(intersection)
-area_layer.set_window_for_intersection(intersection)
-validity_layer.set_window_for_intersection(intersection)
-
-# Create a layer that is just the size of the intersection to store the results
-result = RasterLayer.empty_raster_layer(
-    intersection,
-    area_layer.pixel_scale,
-    area_layer.datatype,
-    "result.tif",
-    area_layer.projection
-)
-
-# Work out the area where the data is valid and over 3000ft
-def is_munro(data):
-    return numpy.where(data > 3000.0, 0.0, 1.0)
-calc = validity_layer * area_layer * elevation_layer.numpy_apply(is_munro)
+from yirgacheffe.layer import RasterLayer, VectorLayer
 
-calc.save(result)
-result.close()
+habitat_map = RasterLayer.layer_from_file("habitats.tif")
+elevation_map = RasterLayer.layer_from_file('elevation.tif')
+range_polygon = VectorLayer.layer_from_file('species123.geojson', raster_like=habitat_map)
+area_per_pixel_map = RasterLayer.layer_from_file('area_per_pixel.tif')
+
+refined_habitat = habitat_map.isin([...species habitat codes...])
+refined_elevation = (elevation_map >= species_min) && (elevation_map <= species_max)
+
+aoh = refined_habitat * refined_elevation * range_polygon * area_per_pixel_map
+
+print(f'area for species 123: {aoh.sum()}')
 ```
 
-If you want the union then you can simply swap do:
+Similarly, you could save the result to a new raster layer:
 
 ```python
-intersection = RasterLayer.find_union(elecation_layer, area_layer, validity_layer)
-elevation_layer.set_window_for_union(intersection)
-area_layer.set_window_for_union(intersection)
-validity_layer.set_window_for_union(intersection)
+with RasterLayer.empty_raster_layer_like(aoh, filename="result.tif") as result:
+    aoh.save(result)
 ```
 
-If you want to work on the data in a layer directly you can call `read_array`, as you would with gdal. The data you access will be relative to the specified window - that is, if you've called either `set_window_for_intersection` or `set_window_for_union` then `read_array` will be relative to that and Yirgacheffe will clip or expand the data with zero values as necessary.
+Yirgacheffe will automatically infer if you want to do an intersection of maps or a union of the maps based on the operators you use (see below for a full table). You can explicitly override that if you want.
+
+### Lazy loading and evaluation
+
+Yirgacheffe uses a technique from computer science called "lazy evaluation", which means that only when you resolve a calculation will Yirgacheffe do any work. So in the first code example given, the work is only calculated when you call the `sum()` method. All the other intermediary results such as `refined_habitat` and `refined_elevation` are not calculated either until that final `sum()` is called. You could easily call sum on those intermediaries if you wanted and get their results, and that would cause them to be evaluated then.
+
+Similarly, when you load a layer, be it a raster layer or a vector layer from polygon data, the full data for the file isn't loaded until it's needed for a calculation, and even then only the part of the data necessary will be loaded or rasterized. Furthermore, Yirgacheffe will load the data in chunks, letting you work with rasters bigger than those that would otherwise fit within your computer's memory.
 
-If you have set either the intersection window or union window on a layer and you wish to undo that restriction, then you can simply call `reset_window()` on the layer.
 
-Layers also work with Python's `with` statement, which helps prevent you forgetting to close a layer, which is important both for ensuring results are written to disk when you want, and that memory used by layers is freed (Yirgacheffe doesn't cache data in memory, but it uses GDAL which does).
+### Automatic expanding and contracting layers
 
-So for example you can turn the above example to:
+When you load raster layers that aren't of equal geographic area (that is, they have a different origin, size, or both)then Yirgacheffe will do all the math internally to ensure that it aligns the pixels geospatially when doing calculations.
+
+If size adjustments are needed, then Yirgacheffe will infer from the calculations you're doing if it needs to either crop or enlarge layers. For instance, if you're summing two rasters it'll expand them to be the union of their two areas before adding them, filling in the missing parts with zeros. If you're multiplying or doing a logical AND of pixels then it'll find the intersection between the two rasters (as areas missing in one would cause the other layer to result in zero anyway).
+
+Whilst it is hoped that the default behaviour makes sense in most cases, we can't anticipate all usages, and so if you want to be explicit about the result of any maps you can specify it yourself.
+
+For example, to tell Yirgacheffe to make a union of a set of layers you'd write:
 
 ```python
-with RasterLayer.empty_raster_layer(
-    intersection,
-    area_layer.pixel_scale,
-    area_layer.datatype,
-    "result.tif",
-    area_layer.projection
-) as result:
-    # Work out the area where the data is valid and over 3000ft
-    def is_munro(data):
-        return numpy.where(data > 3000.0, 0.0, 1.0)
-    calc = validity_layer * area_layer * elevation_layer.numpy_apply(is_munro)
+layers = [habitat_map, elevation_map, range_polygon]
+union_area = YirgacheffeLayer.find_union(layers)
+for layer in layers:
+    layer.set_window_for_union(union_area)
 ```
 
-And you no longer need to explicitly call `result.close()` to ensure the result is written to disk.
+There is a similar set of methods for using the intersection.
+
+If you have set either the intersection window or union window on a layer and you wish to undo that restriction, then you can simply call `reset_window()` on the layer.
+
+### Direct access to data
+
+If doing per-layer operations isn't applicable for your application, you can read the pixel values for all layers (including VectorLayers) by calling `read_array` similarly to how you would for gdal. The data you access will be relative to the specified window - that is, if you've called either `set_window_for_intersection` or `set_window_for_union` then `read_array` will be relative to that and Yirgacheffe will clip or expand the data with zero values as necessary.
 
 
 ### Todo but not supported

diff --git a/pyproject.toml b/pyproject.toml
@@ -6,7 +6,7 @@ build-backend = "setuptools.build_meta"
 
 [project]
 name = "yirgacheffe"
-version = "0.9.4"
+version = "1.0.0"
 description = "Abstraction of gdal datasets for doing basic math operations"
 readme = "README.md"
 authors = [{ name = "Michael Dales", email = "[email protected]" }]

diff --git a/tests/test_auto_windowing.py b/tests/test_auto_windowing.py
@@ -1,11 +1,15 @@
+import os
+import tempfile
+
 import numpy as np
 
-from helpers import gdal_dataset_with_data
-from yirgacheffe.layers import RasterLayer
+from helpers import gdal_dataset_with_data, make_vectors_with_mutlile_ids
+from yirgacheffe.layers import ConstantLayer, RasterLayer, VectorLayer, area
+from yirgacheffe.window import Area
 
 def test_add_windows() -> None:
 	data1 = np.array([[1, 2], [3, 4]])
-	data2 = np.array([[10, 20, 30, 40], [50, 60, 70, 80]])
+	data2 = np.array([[10, 20, 30, 40], [50, 60, 70, 80], [90, 100, 110, 120], [130, 140, 150, 160]])
 
 	layer1 = RasterLayer(gdal_dataset_with_data((0.0, 0.0), 0.02, data1))
 	layer2 = RasterLayer(gdal_dataset_with_data((0.0, 0.0), 0.02, data2))
@@ -18,9 +22,16 @@ def test_add_windows() -> None:
 	assert calc.area == layer2.area
 	assert calc.window == layer2.window
 
+	result = RasterLayer.empty_raster_layer_like(calc)
+	calc.save(result)
+
+	expected = np.array([[11, 22, 30, 40], [53, 64, 70, 80], [90, 100, 110, 120], [130, 140, 150, 160]])
+	actual = result.read_array(0, 0, 4, 4)
+	assert (expected == actual).all()
+
 def test_multiply_windows() -> None:
 	data1 = np.array([[1, 2], [3, 4]])
-	data2 = np.array([[10, 20, 30, 40], [50, 60, 70, 80]])
+	data2 = np.array([[10, 20, 30, 40], [50, 60, 70, 80], [90, 100, 110, 120], [130, 140, 150, 160]])
 
 	layer1 = RasterLayer(gdal_dataset_with_data((0.0, 0.0), 0.02, data1))
 	layer2 = RasterLayer(gdal_dataset_with_data((0.0, 0.0), 0.02, data2))
@@ -32,3 +43,208 @@ def test_multiply_windows() -> None:
 
 	assert calc.area == layer1.area
 	assert calc.window == layer1.window
+
+	result = RasterLayer.empty_raster_layer_like(calc)
+	calc.save(result)
+
+	expected = np.array([[10, 40], [150, 240]])
+	actual = result.read_array(0, 0, 2, 2)
+	assert (expected == actual).all()
+
+def test_add_windows_offset() -> None:
+	data1 = np.array([[1, 2], [3, 4]])
+	data2 = np.array([[10, 20, 30, 40], [50, 60, 70, 80], [90, 100, 110, 120], [130, 140, 150, 160]])
+
+	layer1 = RasterLayer(gdal_dataset_with_data((-0.02, 0.02), 0.02, data1))
+	layer2 = RasterLayer(gdal_dataset_with_data((-0.04, 0.04), 0.02, data2))
+
+	assert layer1.area != layer2.area
+	assert layer1.window != layer2.window
+
+	calc = layer1 + layer2
+
+	assert calc.area == layer2.area
+
+	result = RasterLayer.empty_raster_layer_like(calc)
+	calc.save(result)
+
+	expected = np.array([[10, 20, 30, 40], [50, 61, 72, 80], [90, 103, 114, 120], [130, 140, 150, 160]])
+	actual = result.read_array(0, 0, 4, 4)
+	assert (expected == actual).all()
+
+def test_multiply_windows_offset() -> None:
+	data1 = np.array([[1, 2], [3, 4]])
+	data2 = np.array([[10, 20, 30, 40], [50, 60, 70, 80], [90, 100, 110, 120], [130, 140, 150, 160]])
+
+	layer1 = RasterLayer(gdal_dataset_with_data((-0.02, 0.02), 0.02, data1))
+	layer2 = RasterLayer(gdal_dataset_with_data((-0.04, 0.04), 0.02, data2))
+
+	assert layer1.area != layer2.area
+	assert layer1.window != layer2.window
+
+	calc = layer1 * layer2
+
+	assert calc.area == layer1.area
+
+	result = RasterLayer.empty_raster_layer_like(calc)
+	calc.save(result)
+
+	expected = np.array([[60, 140], [300, 440]])
+	actual = result.read_array(0, 0, 2, 2)
+	assert (expected == actual).all()
+
+def test_add_windows_sum() -> None:
+	data1 = np.array([[1, 2], [3, 4]])
+	data2 = np.array([[10, 20, 30, 40], [50, 60, 70, 80], [90, 100, 110, 120], [130, 140, 150, 160]])
+
+	layer1 = RasterLayer(gdal_dataset_with_data((-0.02, 0.02), 0.02, data1))
+	layer2 = RasterLayer(gdal_dataset_with_data((-0.04, 0.04), 0.02, data2))
+
+	calc = layer1 + layer2
+	total = calc.sum()
+
+	expected = np.array([[10, 20, 30, 40], [50, 61, 72, 80], [90, 103, 114, 120], [130, 140, 150, 160]])
+	assert total == np.sum(expected)
+
+def test_multiply_windows_sum() -> None:
+	data1 = np.array([[1, 2], [3, 4]])
+	data2 = np.array([[10, 20, 30, 40], [50, 60, 70, 80], [90, 100, 110, 120], [130, 140, 150, 160]])
+
+	layer1 = RasterLayer(gdal_dataset_with_data((-0.02, 0.02), 0.02, data1))
+	layer2 = RasterLayer(gdal_dataset_with_data((-0.04, 0.04), 0.02, data2))
+
+	calc = layer1 * layer2
+	total = calc.sum()
+
+	expected = np.array([[60, 140], [300, 440]])
+	assert total == np.sum(expected)
+
+def test_constant_layer_result_rhs_add() -> None:
+	data1 = np.array([[1.0, 2.0, 3.0, 4.0], [5.0, 6.0, 7.0, 8.0]])
+	layer1 = RasterLayer(gdal_dataset_with_data((0.0, 0.0), 0.02, data1))
+	const_layer = ConstantLayer(1.0)
+
+	calc = layer1 + const_layer
+
+	assert calc.area == layer1.area
+
+	result = RasterLayer.empty_raster_layer_like(calc)
+	calc.save(result)
+	actual = result.read_array(0, 0, 4, 2)
+
+	expected = 1.0 + data1
+
+	assert (expected == actual).all()
+
+def test_constant_layer_result_lhs_add() -> None:
+	data1 = np.array([[1.0, 2.0, 3.0, 4.0], [5.0, 6.0, 7.0, 8.0]])
+	layer1 = RasterLayer(gdal_dataset_with_data((0.0, 0.0), 0.02, data1))
+	const_layer = ConstantLayer(1.0)
+	result = RasterLayer.empty_raster_layer_like(layer1)
+
+	intersection = RasterLayer.find_intersection([layer1, const_layer])
+	const_layer.set_window_for_intersection(intersection)
+	layer1.set_window_for_intersection(intersection)
+
+	calc = const_layer + layer1
+
+	assert calc.area == layer1.area
+
+	result = RasterLayer.empty_raster_layer_like(calc)
+	calc.save(result)
+	actual = result.read_array(0, 0, 4, 2)
+
+	expected = 1.0 + data1
+
+	assert (expected == actual).all()
+
+def test_constant_layer_result_rhs_multiply() -> None:
+	data1 = np.array([[1.0, 2.0, 3.0, 4.0], [5.0, 6.0, 7.0, 8.0]])
+	layer1 = RasterLayer(gdal_dataset_with_data((0.0, 0.0), 0.02, data1))
+	const_layer = ConstantLayer(2.0)
+
+	calc = layer1 * const_layer
+
+	assert calc.area == layer1.area
+
+	result = RasterLayer.empty_raster_layer_like(calc)
+	calc.save(result)
+	actual = result.read_array(0, 0, 4, 2)
+
+	expected = data1 * 2.0
+
+	assert (expected == actual).all()
+
+def test_constant_layer_result_lhs_multiply() -> None:
+	data1 = np.array([[1.0, 2.0, 3.0, 4.0], [5.0, 6.0, 7.0, 8.0]])
+	layer1 = RasterLayer(gdal_dataset_with_data((0.0, 0.0), 0.02, data1))
+	const_layer = ConstantLayer(2.0)
+	result = RasterLayer.empty_raster_layer_like(layer1)
+
+	intersection = RasterLayer.find_intersection([layer1, const_layer])
+	const_layer.set_window_for_intersection(intersection)
+	layer1.set_window_for_intersection(intersection)
+
+	calc = const_layer * layer1
+
+	assert calc.area == layer1.area
+
+	result = RasterLayer.empty_raster_layer_like(calc)
+	calc.save(result)
+	actual = result.read_array(0, 0, 4, 2)
+
+	expected = 2.0 * data1
+
+	assert (expected == actual).all()
+
+def test_vector_layers_add() -> None:
+	data1 = np.array([[1, 2], [3, 4]])
+	layer1 = RasterLayer(gdal_dataset_with_data((0.0, 0.0), 1.0, data1))
+
+	with tempfile.TemporaryDirectory() as tempdir:
+		path = os.path.join(tempdir, "test.gpkg")
+		areas = {
+			(Area(-10.0, 10.0, 0.0, 0.0), 42),
+			(Area(0.0, 0.0, 10, -10), 43)
+		}
+		make_vectors_with_mutlile_ids(areas, path)
+
+		burn_value = 2
+		layer2 = VectorLayer.layer_from_file(path, None, layer1.pixel_scale, layer1.projection, burn_value=burn_value)
+		layer2_total = layer2.sum()
+		assert layer2_total == ((layer2.window.xsize * layer2.window.ysize) / 2) * burn_value
+
+		calc = layer1 + layer2
+
+		assert calc.area == layer2.area
+
+		total = calc.sum()
+		assert total == layer2_total + np.sum(data1)
+
+def test_vector_layers_multiply() -> None:
+	data1 = np.array([[1, 2], [3, 4]])
+	layer1 = RasterLayer(gdal_dataset_with_data((-1.0, 1.0), 1.0, data1))
+
+	with tempfile.TemporaryDirectory() as tempdir:
+		path = os.path.join(tempdir, "test.gpkg")
+		areas = {
+			(Area(-10.0, 10.0, 0.0, 0.0), 42),
+			(Area(0.0, 0.0, 10, -10), 43)
+		}
+		make_vectors_with_mutlile_ids(areas, path)
+
+		burn_value = 2
+		layer2 = VectorLayer.layer_from_file(path, None, layer1.pixel_scale, layer1.projection, burn_value=burn_value)
+		layer2_total = layer2.sum()
+		assert layer2_total == ((layer2.window.xsize * layer2.window.ysize) / 2) * burn_value
+
+		calc = layer1 * layer2
+
+		assert calc.area == layer1.area
+
+		result = RasterLayer.empty_raster_layer_like(calc)
+		calc.save(result)
+		actual = result.read_array(0, 0, 2, 2)
+
+		expected = np.array([[2, 0], [0, 8]])
+		assert (expected == actual).all()
diff --git a/yirgacheffe/layers/area.py b/yirgacheffe/layers/area.py
@@ -84,8 +84,8 @@ def __init__(self, dataset, name: Optional[str] = None, band: int = 1):
         )
         self._raster_xsize = self.window.xsize
 
-    def read_array(self, _xoffset, yoffset, _xsize, ysize) -> Any:
+    def read_array_with_window(self, xoffset: int, yoffset: int, xsize: int, ysize: int, window: Window) -> Any:
         if ysize <= 0:
             raise ValueError("Request dimensions must be positive and non-zero")
-        offset = self.window.yoff + yoffset
+        offset = window.yoff + yoffset
         return self.databand[offset:offset + ysize]