Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vector cubes #59

Merged
merged 15 commits into from
Jan 30, 2023
60 changes: 43 additions & 17 deletions documentation/1.0/datacubes.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,53 +2,79 @@

## What are Datacubes?

Datacubes are multidimensional arrays with one or more spatial or temporal dimension(s). They are the way in which data is represented in OpenEO. They provide a nice and tidy interface for spatiotemporal data as well as the operations you may want to execute on it. As they are arrays, it might be easiest to look at raster data as an example, even though datacubes can hold vector data as well. Our example data however consists of a 6x7 raster with 4 bands [`blue`, `green`, `red`, `near-infrared`] and 3 timesteps [`2020-10-01`, `2020-10-13`, `2020-10-25`], displayed here in an orderly, timeseries-like manner:
Data is represented as datacubes in openEO, which are multi-dimensional arrays with additional information about their dimensionality. Datacubes can provide a nice and tidy interface for spatiotemporal data as well as for the operations you may want to execute on them. As they are arrays, it might be easiest to look at raster data as an example, even though datacubes can hold vector data as well. Our example data however consists of a 6x7 raster with 4 bands [`blue`, `green`, `red`, `near-infrared`] and 3 timesteps [`2020-10-01`, `2020-10-13`, `2020-10-25`], displayed here in an orderly, timeseries-like manner:

<figure>
<img src="./datacubes/dc_timeseries.png" alt="Datacube timeseries: 12 imagery tiles are depicted, grouped by 3 dates along a timeline (time dimension). Each date has a blue, green, red and near-infrared band (bands dimension). Each single tile has the dimensions x and y (spatial dimensions).">
<figcaption>An exemplary datacube with 4 dimensions: x, y, bands and time.</figcaption>
<img src="./datacubes/dc_timeseries.png" alt="Raster datacube timeseries: 12 imagery tiles are depicted, grouped by 3 dates along a timeline (time dimension). Each date has a blue, green, red and near-infrared band (bands dimension). Each single tile has the dimensions x and y (spatial dimensions).">
<figcaption>An examplary raster datacube with 4 dimensions: x, y, bands and time.</figcaption>
</figure>

It is important to understand that datacubes are designed to make things easier for us, and are not literally a cube, meaning that the above plot is just as good a representation as any other. That is why we can switch the dimensions around and display them in whatever way we want, including the view below:

<figure>
<img src="./datacubes/dc_flat.png" alt="Datacube flat representation: The 12 imagery tiles are now laid out flat as a 4 by 3 grid (bands by timesteps). All dimension labels are depicted (The timestamps, the band names and the x, y coordinates).">
<img src="./datacubes/dc_flat.png" alt="Raster datacube flat representation: The 12 imagery tiles are now laid out flat as a 4 by 3 grid (bands by timesteps). All dimension labels are depicted (The timestamps, the band names and the x, y coordinates).">
<figcaption>This is the 'raw' data collection that is our example datacube. The grayscale images are colored for understandability, and dimension labels are displayed.</figcaption>
</figure>

A vector cube on the other hand could look like this:

<figure>
<img src="./datacubes/vector.png" alt="Vector datacube: 2 geometries are depicted for the vector dimension, along with 3 timesteps along the time dimension and 4 bands.">
m-mohr marked this conversation as resolved.
Show resolved Hide resolved
<figcaption>An examplary vector datacube with 3 dimensions: 2 geometries are given for the vector dimension, along with 3 timesteps for the time dimension and 4 bands.</figcaption>
</figure>

[Vector data cubes](https://r-spatial.org/r/2022/09/12/vdc.html) and raster data cubes are common cases of data cubes in the EO domain.
A raster data cube has at least two spatial dimensions (e.g. `x` and `y`) and a vector data cube has at least a vector dimension (e.g. `geometry`).
m-mohr marked this conversation as resolved.
Show resolved Hide resolved
These distinctions are just made so that it is easier to describe "special" cases of data cubes, but you can also define other types such as a temporal data cube that has at least a temporal dimension (e.g. `t`).

## Dimensions

A dimension refers to a certain axis of a datacube. This includes all variables (e.g. bands), which are represented as dimensions. Our exemplary raster datacube has the spatial dimensions `x` and `y`, and the temporal dimension `t`. Furthermore, it has a `bands` dimension, extending into the realm of _what kind of information_ is contained in the cube.

The following properties are usually available for dimensions:

* name
* type (`spatial`, `temporal`, `bands`, `vector` or `other`)
m-mohr marked this conversation as resolved.
Show resolved Hide resolved
* axis / number
* type (spatial/temporal/bands/other)
* extents _or_ nominal dimension labels
* reference system / projections
* resolution
* labels (usually exposed in metadata as nominal values _or_ extents)
* reference system / projection
* resolution / step size
* unit (either explicitly specified or implicitly given by the reference system)
m-mohr marked this conversation as resolved.
Show resolved Hide resolved

Here is an overview of the dimensions contained in our example datacube above:
Here is an overview of the dimensions contained in our example raster datacube above:

| # | dimension name | dimension labels | resolution |
|---|----------------|------------------| ---------- |
| 1 | `x` | `466380`, `466580`, `466780`, `466980`, `467180`, `467380` | 10m |
| 2 | `y` | `7167130`, `7166930`, `7166730`, `7166530`, `7166330`, `7166130`, `7165930` | 10m |
| 3 | `bands` | `blue`, `green`, `red`, `nir` | 4 bands |
| 4 | `t` | `2020-10-01`, `2020-10-13`, `2020-10-25` | 12 days |
| # | name | type | labels | resolution | reference system |
| - | ------- | -------- | --------------------------------------------------------------------------- | ---------- | ----------------------------------- |
| 1 | `x` | spatial | `466380`, `466580`, `466780`, `466980`, `467180`, `467380` | 200m | [EPSG:32627](https://epsg.io/32627) |
| 2 | `y` | spatial | `7167130`, `7166930`, `7166730`, `7166530`, `7166330`, `7166130`, `7165930` | 200m | [EPSG:32627](https://epsg.io/32627) |
| 3 | `bands` | bands | `blue`, `green`, `red`, `nir` | 4 bands | - |
| 4 | `t` | temporal | `2020-10-01`, `2020-10-13`, `2020-10-25` | 12 days | Gregorian calendar / UTC |

Dimension labels are either numerical or text (also known as "strings"), which also includes textual representations of timestamps for example. Dimensions with a natural/inherent order are always sorted. These are usually all spatial and temporal dimensions. Dimensions without inherent order, `bands` in openEO for example, retain the order in which they have been defined in metadata or processes (e.g. through [`filter_bands`](https://processes.openeo.org/#filter_bands)), with new labels simply being appended to the existing labels.
Dimension labels are usually either numerical or text (also known as "strings"), which also includes textual representations of timestamps or vectors for example.
Usually, vector labels (geometries) are encoded as [Well-known Text (WKT)](https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry) and temporal labels are encoded as [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) compatible dates and/or times.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this elsewhere I think, but are WKT strings a good choice for labels?

  • the WKT string can become very large (kilobytes or worse), which does not make it a handy label to work with in the different phases of a user workflow
  • there is quite some room for variation in WKT encoding of a geometry due to float representation/precision or vertex ordering.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side question: what does the "usually" refer to here: "usually in the remote sensing/GIS community", or "usually in openEO implementations"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Usually, vector labels (geometries) are encoded as [Well-known Text (WKT)](https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry) and temporal labels are encoded as [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) compatible dates and/or times.
For example, geometries (i.e. the labels of a geometry dimension) can be encoded in [Well-known Text (WKT)](https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry) or GeoJSON like temporal labels are usually encoded as [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) compatible dates and/or times.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The labels can be chosen by the user, could be ID, WKT2, any other attribute. Doesn't need to be unique, back-ends can have an internal UID (e.g. index)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean that in the sense of as follows?:

  • When supported by the back-end implementation: the labels can be chosen by the user: could be ID, WKT2, any other attribute. Doesn't need to be unique.
  • As fallback, back-ends can use an internal (unique) id (e.g. auto-increment index, UUID, hash function of geometry, ....)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether back-ends should be able to not support it, but we also don't know yet at which point/how users can choose this. So this is to be discussed in openeo-processes.
I guess back-ends always need an internal unique id in addition to the representation.


OpenEO datacubes contain scalar values (e.g. strings, numbers or boolean values), with all other associated attributes stored in dimensions (e.g. coordinates or timestamps). Attributes such as the CRS or the sensor can also be turned into dimensions. Be advised that in such a case, the uniqueness of pixel coordinates may be affected. When usually, `(x, y)` refers to a unique location, that changes to `(x, y, CRS)` when `(x, y)` values are reused in other coordinate reference systems (e.g. two neighboring UTM zones).
Dimensions with a natural/inherent order are always sorted. These are usually all spatial and temporal dimensions. Dimensions without inherent order, in openEO `bands` for example, retain the order in which they have been defined in metadata or processes (e.g. through [`filter_bands`](https://processes.openeo.org/#filter_bands)), with new labels simply being appended to the existing labels.
m-mohr marked this conversation as resolved.
Show resolved Hide resolved

A vector dimension is not included in the example raster datacube above and it is not used in the following examples, but to show how a vector dimension with two polygons could look like:

| name | type | labels | reference system |
| ---------- | ------ | ------ | ---------------- |
| `geometry` | vector | `POLYGON((-122.4 37.6,-122.35 37.6,-122.35 37.64,-122.4 37.64,-122.4 37.6))`, `POLYGON((-122.51 37.5,-122.48 37.5,-122.48 37.52,-122.51 37.52,-122.51 37.5))` | [EPSG:4326](https://epsg.io/4326) |

Vector dimensions can consist of points, linestrings, polygons, multi points, multi linestrings and multi polygons or a mixture of those. Empty geometries (includes GeoJSON `null` geometries) are not allowed.

openEO datacubes contain scalar values (e.g. strings, numbers or boolean values), with all other associated attributes stored in dimensions (e.g. coordinates or timestamps). Attributes such as the CRS or the sensor can also be turned into dimensions. Be advised that in such a case, the uniqueness of pixel coordinates may be affected. When usually, `(x, y)` refers to a unique location, that changes to `(x, y, CRS)` when `(x, y)` values are reused in other coordinate reference systems (e.g. two neighboring UTM zones).

::: tip Be Careful with Data Types
As stated above, datacubes only contain scalar values. However, implementations may differ in their ability to handle or convert them. Implementations may also not allow mixing data types in a datacube. For example, returning a boolean value for a reducer on a numerical datacube may result in an error on some back-ends. The recommendation is to not change the data type of values in a datacube unless the back-end supports it explicitly.
:::

### Applying Processes on Dimensions

Some processes are typically applied "along a dimension". You can imagine said dimension as an arrow and whatever is happening as a parallel process to that arrow. It simply means: "we focus on _this_ dimension right now".

m-mohr marked this conversation as resolved.
Show resolved Hide resolved
### Resolution

The resolution of a dimension gives information about what interval lies between observations. This is most obvious with the temporal resolution, where the intervals depict how often observations were made. Spatial resolution gives information about the pixel spacing, meaning how many 'real world meters' are contained in a pixel. The number of bands and their wavelength intervals give information about the spectral resolution.

### Coordinate Reference System as a Dimension
Expand Down
4 changes: 2 additions & 2 deletions documentation/1.0/datacubes/.scripts/datacube_plots.R
Original file line number Diff line number Diff line change
Expand Up @@ -491,8 +491,8 @@ pl(b, 46.5, -3.5, m = vecM, pal = alpha("white", 0.9), border = 0)
print_vector_content(52.5, -1.5)
pl(b, 45, -2, m = vecM, pal = alpha("white", 0.9), border = 0)
print_vector_content(51, 0)
text(51.5, 15, "Line_1")
text(63, 15, "Polygon_1")
text(51.5, 15, "LINESTRING(...)") # e.g. LINESTRING(24.6 19, 24.6 17.4, 25.8 16.4, 27.9 16.1)
text(63, 15, "POLYGON(...)") # e.g. POLYGON((30 18.2, 32.3 17.6, 32.6 19.2, 31.9 19.7, 30 18.2))
text(57, 17.5, "Geometries", cex = 1.1)
text(42, 12, "blue")
text(42, 8, "green")
Expand Down
Binary file modified documentation/1.0/datacubes/dc_aggregate_space.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added documentation/1.0/datacubes/vector.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
24 changes: 23 additions & 1 deletion documentation/1.0/glossary.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,29 @@ In openEO, a back-end offers a set of collections to be processed. All collectio

## Spatial datacubes

A spatiotemporal datacube is a multidimensional array with one or more spatial or temporal dimensions. In the EO domain, it is common to be implicit about the temporal dimension and just refer to them as spatial datacubes in short. Special cases are raster and vector datacubes. Learn more about datacubes in the [datacube documentation](https://openeo.org/documentation/1.0/datacubes.html).
A spatiotemporal datacube is a multidimensional array with one or more spatial or temporal dimensions.
In the EO domain, it is common to be implicit about the temporal dimension and just refer to them as spatial datacubes in short.
Special cases are raster and [vector datacubes](https://r-spatial.org/r/2022/09/12/vdc.html).
Learn more about datacubes in the [datacube documentation](https://openeo.org/documentation/1.0/datacubes.html).

## Vector data

In general, **vector data** represent specific things (also called "features") in a space, e.g. on the surface of the Earth.

A **coordinate** represents a specific point in space.

A **feature** is a thing that has a geometry (e.g. the outline of an agricultural field, a forest or an urban area) and it may have additional properties assigned (e.g. a name, a color or a population).

**Geometries** consist of one or more coordinates that may be connected and then form a specific type of geometry, e.g. two points can be connected to a straight line and four straight lines can be connected to rectangle.

Commonly used types of geometries are:
- Point
- LineString (connected straight line pieces)
- Polygon (connected straight line pieces forming a closed ring, possibly with holes - for example a triangle or rectangle)

Multiple geometries of the same type can be combined to a group of geometries, e.g. a Multi Point or a Multi Polygon.
m-mohr marked this conversation as resolved.
Show resolved Hide resolved

Features and geometries are specified by the OGC in the [Simple Feature Access specification](https://www.ogc.org/standards/sfa) (and ISO 19125). See the specification for more details.
m-mohr marked this conversation as resolved.
Show resolved Hide resolved

## User-defined function (UDF)

Expand Down