From b9d9cdb223a2d1b70847519447e9eaada470b6b8 Mon Sep 17 00:00:00 2001 From: Michael Johns Date: Thu, 25 Jan 2024 11:43:04 -0500 Subject: [PATCH] updated additional docs. --- README.md | 16 ++--- docs/source/api/raster-format-readers.rst | 7 +- docs/source/api/raster-functions.rst | 33 ++++----- docs/source/api/spatial-aggregations.rst | 4 +- docs/source/api/spatial-functions.rst | 61 ++++++++--------- docs/source/api/spatial-indexing.rst | 65 +++++++++--------- docs/source/api/spatial-predicates.rst | 8 +-- docs/source/api/vector-format-readers.rst | 27 +++++--- docs/source/index.rst | 17 ++--- .../usage/automatic-sql-registration.rst | 16 +++-- docs/source/usage/install-gdal.rst | 14 ++-- docs/source/usage/installation.rst | 67 ++++++++++++------- 12 files changed, 185 insertions(+), 150 deletions(-) diff --git a/README.md b/README.md index 128bfe118..b1a052839 100644 --- a/README.md +++ b/README.md @@ -147,14 +147,14 @@ __Note: Mosaic 0.4.x SQL bindings for DBR 13 not yet available in Unity Catalog Here are some example notebooks, check the language links for latest [[Python](/notebooks/examples/python/) | [Scala](/notebooks/examples/scala/) | [SQL](/notebooks/examples/sql/) | [R](/notebooks/examples/R/)]: -| Example | Description | Links | -| --- | --- | --- | -| __Quick Start__ | Example of performing spatial point-in-polygon joins on the NYC Taxi dataset | [python](/notebooks/examples/python/QuickstartNotebook.ipynb), [scala](notebooks/examples/scala/QuickstartNotebook.ipynb), [R](notebooks/examples/R/QuickstartNotebook.r), [SQL](notebooks/examples/sql/QuickstartNotebook.ipynb) | -| Shapefiles | Examples of reading multiple shapefiles | [python](notebooks/examples/python/Shapefiles/) | -| Spatial KNN | Runnable notebook-based example using Mosaic [SpatialKNN](https://databrickslabs.github.io/mosaic/models/spatial-knn.html) model | [python](notebooks/examples/python/SpatialKNN) | -| NetCDF | Read multiple NetCDFs, process through various data engineering steps before analyzing and rendering | [python](notebooks/examples/python/NetCDF/) | -| STS Transfers | Detecting Ship-to-Ship transfers at scale by leveraging Mosaic to process AIS data. | [python](notebooks/examples/python/Ship2ShipTransfers), [blog](https://medium.com/@timo.roest/ship-to-ship-transfer-detection-b370dd9d43e8) | -| EO Gridded STAC | End-to-end Earth Observation series showing downloading Sentinel-2 STAC assets for Alaska from [MSFT Planetary Computer](https://planetarycomputer.microsoft.com/), tiling them to H3 global grid, band stacking, NDVI, merging (mosaicing), clipping, and applying a [Segment Anything Model](https://huggingface.co/facebook/sam-vit-huge) | [python](notebooks/examples/python/EarthObservation/EOGriddedSTAC) | +| Example | Description | Links | +| --- | --- |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| __Quick Start__ | Example of performing spatial point-in-polygon joins on the NYC Taxi dataset | [python](/notebooks/examples/python/Quickstart/QuickstartNotebook.ipynb), [scala](notebooks/examples/scala/QuickstartNotebook.ipynb), [R](notebooks/examples/R/QuickstartNotebook.r), [SQL](notebooks/examples/sql/QuickstartNotebook.ipynb) | +| Shapefiles | Examples of reading multiple shapefiles | [python](notebooks/examples/python/Shapefiles/) | +| Spatial KNN | Runnable notebook-based example using Mosaic [SpatialKNN](https://databrickslabs.github.io/mosaic/models/spatial-knn.html) model | [python](notebooks/examples/python/SpatialKNN) | +| NetCDF | Read multiple NetCDFs, process through various data engineering steps before analyzing and rendering | [python](notebooks/examples/python/NetCDF/) | +| STS Transfers | Detecting Ship-to-Ship transfers at scale by leveraging Mosaic to process AIS data. | [python](notebooks/examples/python/Ship2ShipTransfers), [blog](https://medium.com/@timo.roest/ship-to-ship-transfer-detection-b370dd9d43e8) | +| EO Gridded STAC | End-to-end Earth Observation series showing downloading Sentinel-2 STAC assets for Alaska from [MSFT Planetary Computer](https://planetarycomputer.microsoft.com/), tiling them to H3 global grid, band stacking, NDVI, merging (mosaicing), clipping, and applying a [Segment Anything Model](https://huggingface.co/facebook/sam-vit-huge) | [python](notebooks/examples/python/EarthObservation/EOGriddedSTAC) | You can import those examples in Databricks workspace using [these instructions](https://docs.databricks.com/en/notebooks/index.html). diff --git a/docs/source/api/raster-format-readers.rst b/docs/source/api/raster-format-readers.rst index 3e0c6443e..262ea066e 100644 --- a/docs/source/api/raster-format-readers.rst +++ b/docs/source/api/raster-format-readers.rst @@ -25,8 +25,9 @@ Mosaic provides spark readers for the following raster formats: Other formats are supported if supported by GDAL available drivers. Mosaic provides two flavors of the readers: - * spark.read.format("gdal") for reading 1 file per spark task - * mos.read().format("raster_to_grid") reader that automatically converts raster to grid. + + * :code:`spark.read.format("gdal")` for reading 1 file per spark task + * :code: `mos.read().format("raster_to_grid")` reader that automatically converts raster to grid. spark.read.format("gdal") @@ -91,7 +92,7 @@ mos.read().format("raster_to_grid") *********************************** Reads a GDAL raster file and converts it to a grid. It uses a pattern similar to standard spark.read.format(*).option(*).load(*) pattern. -The only difference is that it uses mos.read() instead of spark.read(). +The only difference is that it uses :code:`mos.read()` instead of :code:`spark.read()`. The raster pixels are converted to grid cells using specified combiner operation (default is mean). If the raster pixels are larger than the grid cells, the cell values can be calculated using interpolation. The interpolation method used is Inverse Distance Weighting (IDW) where the distance function is a k_ring diff --git a/docs/source/api/raster-functions.rst b/docs/source/api/raster-functions.rst index bdad114e9..71eba6226 100644 --- a/docs/source/api/raster-functions.rst +++ b/docs/source/api/raster-functions.rst @@ -6,20 +6,22 @@ Intro ################ Raster functions are available in mosaic if you have installed the optional dependency `GDAL`. Please see :doc:`Install and Enable GDAL with Mosaic ` for installation instructions. -Mosaic provides several unique raster functions that are not available in other Spark packages. -Mainly raster to grid functions, which are useful for reprojecting the raster data into a standard grid index system. -This is useful for performing spatial joins between raster data and vector data. -Mosaic also provides a scalable retiling function that can be used to retile raster data in case of bottlenecking due to large files. -All raster functions respect the \"rst\_\" prefix naming convention. -Mosaic is operating using raster tile objects only since 0.3.11. Tile objects are created using functions such as rst_fromfile(path_to_raster) -or rst_fromcontent(raster_bin, driver). These functions are used as places to start when working with initial data. -If you use spark.read.format("gdal") tiles are automatically generated for you. -Also, scala does not have a df.display method while python does. In practice you would most often call display(df) in -scala for a prettier output, but for brevity, we write df.show in scala. + + * Mosaic provides several unique raster functions that are not available in other Spark packages. + Mainly raster to grid functions, which are useful for reprojecting the raster data into a standard grid index + system. This is useful for performing spatial joins between raster data and vector data. + * Mosaic also provides a scalable retiling function that can be used to retile raster data in case of bottlenecking + due to large files. + * All raster functions respect the :code:`rst_` prefix naming convention. + * Mosaic is operating using raster tile objects only since 0.3.11. Tile objects are created using functions such as + :code:`rst_fromfile` or :code:`rst_fromcontent`. These functions are used as places to start when working with + initial data. If you use :code:`spark.read.format("gdal")` tiles are automatically generated for you. + * Also, scala does not have a :code:`df.display()` method while python does. In practice you would most often call + :code:`display(df)` in scala for a prettier output, but for brevity, we write :code:`df.show` in scala. .. note:: For mosaic versions > 0.4.0 you can use the revamped setup_gdal function or new setup_fuse_install. - These functions will configure an init script in your preferred Workspace, Volume, or DBFS location to install GDAL on your cluster. - See :doc:`Install and Enable GDAL with Mosaic ` for more details. + These functions will configure an init script in your preferred Workspace, Volume, or DBFS location to install GDAL + on your cluster. See :doc:`Install and Enable GDAL with Mosaic ` for more details. rst_bandmetadata **************** @@ -190,7 +192,7 @@ rst_combineavg The output raster will have the same pixel type as the input rasters. The output raster will have the same pixel size as the input rasters. The output raster will have the same coordinate reference system as the input rasters. - Also, see :doc:`rst_combineavg_agg ` function. + Also, see :doc:`rst_combineavg_agg ` function. :param tiles: A column containing an array of raster tiles. :type tiles: Column (ArrayType(RasterTileType)) @@ -244,7 +246,7 @@ rst_derivedband The output raster will have the same pixel type as the input rasters. The output raster will have the same pixel size as the input rasters. The output raster will have the same coordinate reference system as the input rasters. - Also, see :doc:`rst_derivedband_agg ` function. + Also, see :doc:`rst_derivedband_agg ` function. :param tiles: A column containing an array of raster tiles. :type tiles: Column (ArrayType(RasterTileType)) @@ -298,6 +300,7 @@ rst_derivedband +----------------------------------------------------------------------------------------------------------------+ .. code-tab:: sql + SELECT rst_derivedband(array(tile1,tile2,tile3)) as tiles, """ @@ -876,7 +879,7 @@ rst_merge The output raster will have the same pixel type as the input rasters. The output raster will have the same pixel size as the highest resolution input rasters. The output raster will have the same coordinate reference system as the input rasters. - Also, see :doc:`rst_merge_agg ` function. + Also, see :doc:`rst_merge_agg ` function. :param tiles: A column containing an array of raster tiles. :type tiles: Column (ArrayType(RasterTileType)) diff --git a/docs/source/api/spatial-aggregations.rst b/docs/source/api/spatial-aggregations.rst index 9f806fec9..e1184cd2f 100644 --- a/docs/source/api/spatial-aggregations.rst +++ b/docs/source/api/spatial-aggregations.rst @@ -211,7 +211,7 @@ st_intersects_aggregate .. function:: st_intersects_agg(leftIndex, rightIndex) - Returns `true` if any of the `leftIndex` and `rightIndex` pairs intersect. + Returns :code:`true` if any of the :code:`leftIndex` and :code:`rightIndex` pairs intersect. :param leftIndex: Geometry :type leftIndex: Column @@ -301,7 +301,7 @@ st_intersection_agg .. function:: st_intersection_agg(leftIndex, rightIndex) - Computes the intersections of `leftIndex` and `rightIndex` and returns the union of these intersections. + Computes the intersections of :code:`leftIndex` and :code:`rightIndex` and returns the union of these intersections. :param leftIndex: Geometry :type leftIndex: Column diff --git a/docs/source/api/spatial-functions.rst b/docs/source/api/spatial-functions.rst index 350756e48..acf85943f 100644 --- a/docs/source/api/spatial-functions.rst +++ b/docs/source/api/spatial-functions.rst @@ -124,7 +124,7 @@ st_buffer .. function:: st_buffer(col, radius) - Buffer the input geometry by radius `radius` and return a new, buffered geometry. + Buffer the input geometry by radius :code:`radius` and return a new, buffered geometry. :param col: Geometry :type col: Column @@ -576,7 +576,7 @@ st_distance .. function:: st_distance(geom1, geom2) - Compute the euclidean distance between `geom1` and `geom2`. + Compute the euclidean distance between :code:`geom1` and :code:`geom2`. :param geom1: Geometry :type geom1: Column @@ -699,7 +699,8 @@ st_envelope .. function:: st_envelope(col) Returns the minimum bounding box of the input geometry, as a geometry. - This bounding box is defined by the rectangular polygon with corner points `(x_min, y_min)`, `(x_max, y_min)`, `(x_min, y_max)`, `(x_max, y_max)`. + This bounding box is defined by the rectangular polygon with corner points :code:`(x_min, y_min)`, + :code:`(x_max, y_min)`, :code:`(x_min, y_max)`, :code:`(x_max, y_max)`. :param col: Geometry :type col: Column @@ -869,14 +870,14 @@ st_hasvalidcoordinates .. function:: st_hasvalidcoordinates(col, crs, which) - Checks if all points in `geom` are valid with respect to crs bounds. + Checks if all points in :code:`geom` are valid with respect to crs bounds. CRS bounds can be provided either as bounds or as reprojected_bounds. :param col: Geometry :type col: Column :param crs: CRS name (EPSG ID), e.g. "EPSG:2192" :type crs: Column - :param which: Check against geographic `"bounds"` or geometric `"reprojected_bounds"` bounds. + :param which: Check against geographic :code:`"bounds"` or geometric :code:`"reprojected_bounds"` bounds. :type which: Column :rtype: Column: IntegerType @@ -928,8 +929,8 @@ st_intersection .. function:: st_intersection(geom1, geom2) - Returns a geometry representing the intersection of `left_geom` and `right_geom`. - Also, see :doc:`st_intersection_agg ` function. + Returns a geometry representing the intersection of :code:`left_geom` and :code:`right_geom`. + Also, see :doc:`st_intersection_agg ` function. :param geom1: Geometry :type geom1: Column @@ -985,7 +986,7 @@ st_isvalid .. function:: st_isvalid(col) - Returns `true` if the geometry is valid. + Returns :code:`true` if the geometry is valid. :param col: Geometry :type col: Column @@ -1066,10 +1067,6 @@ st_isvalid | false| +---------------+ -.. note:: Validity assertions will be dependent on the chosen geometry API. - The assertions used in the ESRI geometry API (JTS is the default) follow the definitions in the - "Simple feature access - Part 1" document (OGC 06-103r4) for each geometry type. - st_length ************ @@ -1135,7 +1132,7 @@ st_numpoints .. function:: st_numpoints(col) - Returns the number of points in `geom`. + Returns the number of points in :code:`geom`. :param col: Geometry :type col: Column @@ -1246,7 +1243,7 @@ st_rotate .. function:: st_rotate(col, td) - Rotates `geom` using the rotational factor `td`. + Rotates :code:`geom` using the rotational factor :code:`td`. :param col: Geometry :type col: Column @@ -1305,7 +1302,7 @@ st_scale .. function:: st_scale(col, xd, yd) - Scales `geom` using the scaling factors `xd` and `yd`. + Scales :code:`geom` using the scaling factors :code:`xd` and :code:`yd`. :param col: Geometry :type col: Column @@ -1363,11 +1360,11 @@ st_setsrid .. function:: st_setsrid(col, srid) - Sets the Coordinate Reference System well-known identifier (SRID) for `geom`. + Sets the Coordinate Reference System well-known identifier (SRID) for :code:`geom`. :param col: Geometry :type col: Column - :param srid: The spatial reference identifier of `geom`, expressed as an integer, e.g. `4326` for EPSG:4326 / WGS84 + :param srid: The spatial reference identifier of :code:`geom`, expressed as an integer, e.g. :code:`4326` for EPSG:4326 / WGS84 :type srid: Column (IntegerType) :rtype: Column @@ -1414,9 +1411,9 @@ st_setsrid +---------------------------------+ .. note:: - ST_SetSRID does not transform the coordinates of `geom`, + ST_SetSRID does not transform the coordinates of :code:`geom`, rather it tells Mosaic the SRID in which the current coordinates are expressed. - ST_SetSRID can only operate on geometries encoded in GeoJSON or the Mosaic internal format. + :ref:`st_setsrid` can only operate on geometries encoded in GeoJSON. st_simplify *********** @@ -1481,7 +1478,7 @@ st_srid .. function:: st_srid(col) - Looks up the Coordinate Reference System well-known identifier (SRID) for `geom`. + Looks up the Coordinate Reference System well-known identifier (SRID) for :code:`geom`. :param col: Geometry :type col: Column @@ -1534,7 +1531,7 @@ st_srid +------------+ .. note:: - ST_SRID can only operate on geometries encoded in GeoJSON or the Mosaic internal format. + ST_SRID can only operate on geometries encoded in GeoJSON. st_transform @@ -1542,11 +1539,13 @@ st_transform .. function:: st_transform(col, srid) - Transforms the horizontal (XY) coordinates of `geom` from the current reference system to that described by `srid`. + Transforms the horizontal (XY) coordinates of :code:`geom` from the current reference system to that described by :code:`srid`. + + :param col: Geometry :type col: Column - :param srid: Target spatial reference system for `geom`, expressed as an integer, e.g. `3857` for EPSG:3857 / Pseudo-Mercator + :param srid: Target spatial reference system for :code:`geom`, expressed as an integer, e.g. :code:`3857` for EPSG:3857 / Pseudo-Mercator :type srid: Column (IntegerType) :rtype: Column @@ -1557,7 +1556,7 @@ st_transform df = ( spark.createDataFrame([{'wkt': 'MULTIPOINT ((10 40), (40 30), (20 20), (30 10))'}]) - .withColumn('geom', st_setsrid(st_geomfromwkt('wkt'), lit(4326))) + .withColumn('geom', st_setsrid(st_asgeojson('wkt'), lit(4326))) ) df.select(st_astext(st_transform('geom', lit(3857)))).show(1, False) +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ @@ -1569,7 +1568,7 @@ st_transform .. code-tab:: scala val df = List("MULTIPOINT ((10 40), (40 30), (20 20), (30 10))").toDF("wkt") - .withColumn("geom", st_setsrid(st_geomfromwkt(col("wkt")), lit(4326))) + .withColumn("geom", st_setsrid(st_asgeojson(col("wkt")), lit(4326))) df.select(st_astext(st_transform(col("geom"), lit(3857)))).show(1, false) +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |convert_to(st_transform(geom, 3857)) | @@ -1589,7 +1588,7 @@ st_transform .. code-tab:: r R df <- createDataFrame(data.frame(wkt = "MULTIPOINT ((10 40), (40 30), (20 20), (30 10))")) - df <- withColumn(df, 'geom', st_setsrid(st_geomfromwkt(column('wkt')), lit(4326L))) + df <- withColumn(df, 'geom', st_setsrid(st_asgeojson(column('wkt')), lit(4326L))) >>> showDF(select(df, st_astext(st_transform(column('geom'), lit(3857L)))), truncate=F) +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ @@ -1599,8 +1598,10 @@ st_transform +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ .. note:: - If `geom` does not have an associated SRID, use ST_SetSRID to set this before calling ST_Transform. - + If :code:`geom` does not have an associated SRID, use :ref:`st_setsrid` to set this before calling :ref:`st_transform`. + **Changed in 0.4 series** :ref:`st_srid`, :ref:`st_setsrid`, and :ref:`st_transform` only operate on + GeoJSON (columnar) data, so be sure to call :ref:`/api/geometry-accessors#st_asgeojson` to convert from WKT and WKB. You can convert + back after the transform, e.g. using :ref:`/api/geometry-accessors#st_astext` or :ref:`/api/geometry-accessors#st_asbinary`. st_translate @@ -1608,7 +1609,7 @@ st_translate .. function:: st_translate(col, xd, yd) - Translates `geom` to a new location using the distance parameters `xd` and `yd`. + Translates :code:`geom` to a new location using the distance parameters :code:`xd` and :code:`yd`. :param col: Geometry :type col: Column @@ -1666,7 +1667,7 @@ st_union .. function:: st_union(left_geom, right_geom) Returns the point set union of the input geometries. - Also, see :doc:`st_union_agg ` function. + Also, see :doc:`st_union_agg ` function. :param left_geom: Geometry :type left_geom: Column diff --git a/docs/source/api/spatial-indexing.rst b/docs/source/api/spatial-indexing.rst index ba0a91132..b241dade6 100644 --- a/docs/source/api/spatial-indexing.rst +++ b/docs/source/api/spatial-indexing.rst @@ -5,16 +5,17 @@ Spatial grid indexing Spatial grid indexing is the process of mapping a geometry (or a point) to one or more cells (or cell ID) from the selected spatial grid. -The grid system can be specified by using the spark configuration `spark.databricks.labs.mosaic.index.system` +The grid system can be specified by using the spark configuration :code:`spark.databricks.labs.mosaic.index.system` before enabling Mosaic. The valid values are: - * `H3` - Good all-rounder for any location on earth - * `BNG` - Local grid system Great Britain (EPSG:27700) - * `CUSTOM(minX,maxX,minY,maxY,splits,rootCellSizeX,rootCellSizeY)` - Can be used with any local or global CRS - * `minX`,`maxX`,`minY`,`maxY` can be positive or negative integers defining the grid bounds - * `splits` defines how many splits are applied to each cell for an increase in resolution step (usually 2 or 10) - * `rootCellSizeX`,`rootCellSizeY` define the size of the cells on resolution 0 + + * :code:`H3` - Good all-rounder for any location on earth + * :code:`BNG` - Local grid system Great Britain (EPSG:27700) + * :code:`CUSTOM(minX,maxX,minY,maxY,splits,rootCellSizeX,rootCellSizeY)` - Can be used with any local or global CRS + * :code:`minX`, :code:`maxX`, :code:`minY`, :code:`maxY` can be positive or negative integers defining the grid bounds + * :code:`splits` defines how many splits are applied to each cell for an increase in resolution step (usually 2 or 10) + * :code:`rootCellSizeX`, :code:`rootCellSizeY` define the size of the cells on resolution 0 Example @@ -34,8 +35,8 @@ grid_longlatascellid .. function:: grid_longlatascellid(lon, lat, resolution) - Returns the `resolution` grid index associated with - the input `lon` and `lat` coordinates. + Returns the :code:`resolution` grid index associated with + the input :code:`lon` and :code:`lat` coordinates. :param lon: Longitude :type lon: Column: DoubleType @@ -114,8 +115,8 @@ grid_pointascellid .. function:: grid_pointascellid(geometry, resolution) - Returns the `resolution` grid index associated - with the input point geometry `geometry`. + Returns the :code:`resolution` grid index associated + with the input point geometry :code:`geometry`. :param geometry: Geometry :type geometry: Column @@ -187,17 +188,15 @@ grid_pointascellid - - grid_polyfill ************* .. function:: grid_polyfill(geometry, resolution) - Returns the set of grid indices of which centroid is contained in the input `geometry` at `resolution`. + Returns the set of grid indices of which centroid is contained in the input :code:`geometry` at :code:`resolution`. - When using `H3 ` index system, this is equivalent to the - `H3 polyfill ` method + When using `H3 `_ index system, this is equivalent to the + `H3 polyfill `_ method :param geometry: Geometry :type geometry: Column @@ -385,21 +384,21 @@ grid_tessellate .. function:: grid_tessellate(geometry, resolution, ) - Cuts the original `geometry` into several pieces along the grid index borders at the specified `resolution`. + Cuts the original :code:`geometry` into several pieces along the grid index borders at the specified :code:`resolution`. - Returns an array of Mosaic chips **covering** the input `geometry` at `resolution`. + Returns an array of Mosaic chips **covering** the input :code:`geometry` at :code:`resolution`. A Mosaic chip is a struct type composed of: - - `is_core`: Identifies if the chip is fully contained within the geometry: Boolean + - :code:`is_core`: Identifies if the chip is fully contained within the geometry: Boolean - - `index_id`: Index ID of the configured spatial indexing (default H3): Integer + - :code:`index_id`: Index ID of the configured spatial indexing (default H3): Integer - - `wkb`: Geometry in WKB format equal to the intersection of the index shape and the original `geometry`: Binary + - :code:`wkb`: Geometry in WKB format equal to the intersection of the index shape and the original :code:`geometry`: Binary - In contrast to :ref:`grid_tessellateexplode`, `grid_tessellate` does not explode the list of shapes. + In contrast to :ref:`grid_tessellateexplode`, :ref:`grid_tessellate` does not explode the list of shapes. - In contrast to :ref:`grid_polyfill`, `grid_tessellate` fully covers the original `geometry` even if the index centroid + In contrast to :ref:`grid_polyfill`, :ref:`grid_tessellate` fully covers the original :code:`geometry` even if the index centroid falls outside of the original geometry. This makes it suitable to index lines as well. :param geometry: Geometry @@ -507,21 +506,21 @@ grid_tessellateexplode .. function:: grid_tessellateexplode(geometry, resolution, ) - Cuts the original `geometry` into several pieces along the grid index borders at the specified `resolution`. + Cuts the original :code:`geometry` into several pieces along the grid index borders at the specified :code:`resolution`. - Returns the set of Mosaic chips **covering** the input `geometry` at `resolution`. + Returns the set of Mosaic chips **covering** the input :code:`geometry` at :code:`resolution`. A Mosaic chip is a struct type composed of: - - `is_core`: Identifies if the chip is fully contained within the geometry: Boolean + - :code:`is_core`: Identifies if the chip is fully contained within the geometry: Boolean - - `index_id`: Index ID of the configured spatial indexing (default H3): Integer + - :code:`index_id`: Index ID of the configured spatial indexing (default H3): Integer - - `wkb`: Geometry in WKB format equal to the intersection of the index shape and the original `geometry`: Binary + - :code:`wkb`: Geometry in WKB format equal to the intersection of the index shape and the original :code:`geometry`: Binary - In contrast to :ref:`grid_tessellate`, `grid_tessellateexplode` generates one result row per chip. + In contrast to :ref:`grid_tessellate`, :ref:`grid_tessellateexplode` generates one result row per chip. - In contrast to :ref:`grid_polyfill`, `grid_tessellateexplode` fully covers the original `geometry` even if the index centroid + In contrast to :ref:`grid_polyfill`, :ref:`grid_tessellateexplode` fully covers the original :code:`geometry` even if the index centroid falls outside of the original geometry. This makes it suitable to index lines as well. :param geometry: Geometry @@ -678,8 +677,6 @@ grid_cellarea +--------------------+---------------+ - - grid_cellkring ************** @@ -858,7 +855,7 @@ grid_cell_intersection .. function:: grid_cell_intersection(left_chip, right_chip) Returns the chip representing the intersection of two chips based on the same grid cell. - Also, see :doc:`grid_cell_intersection_agg ` function. + Also, see :doc:`grid_cell_intersection_agg ` function. :param left_chip: Chip :type left_chip: Column: ChipType(LongType) @@ -914,7 +911,7 @@ grid_cell_union .. function:: grid_cell_union(left_chip, right_chip) Returns the chip representing the union of two chips based on the same grid cell. - Also, see :doc:`grid_cell_union_agg ` function. + Also, see :doc:`grid_cell_union_agg ` function. :param left_chip: Chip :type left_chip: Column: ChipType(LongType) diff --git a/docs/source/api/spatial-predicates.rst b/docs/source/api/spatial-predicates.rst index c1c3c8288..df3ce4951 100644 --- a/docs/source/api/spatial-predicates.rst +++ b/docs/source/api/spatial-predicates.rst @@ -8,7 +8,7 @@ st_contains .. function:: st_contains(geom1, geom2) - Returns `true` if `geom1` 'spatially' contains `geom2`. + Returns :code:`true` if :code:`geom1` 'spatially' contains :code:`geom2`. :param geom1: Geometry :type geom1: Column @@ -66,8 +66,8 @@ st_intersects .. function:: st_intersects(geom1, geom2) - Returns true if the geometry `geom1` intersects `geom2`. - Also, see :doc:`st_intersects_agg ` function. + Returns true if the geometry :code:`geom1` intersects :code:`geom2`. + Also, see :doc:`st_intersects_agg ` function. :param geom1: Geometry :type geom1: Column @@ -124,7 +124,7 @@ st_within .. function:: st_within(geom1, geom2) - Returns `true` if `geom1` 'spatially' is within `geom2`. + Returns :code:`true` if :code:`geom1` 'spatially' is within :code:`geom2`. :param geom1: Geometry :type geom1: Column diff --git a/docs/source/api/vector-format-readers.rst b/docs/source/api/vector-format-readers.rst index 6a58cb961..4c419a6b6 100644 --- a/docs/source/api/vector-format-readers.rst +++ b/docs/source/api/vector-format-readers.rst @@ -24,10 +24,15 @@ Here are some common useful file formats: For more information please refer to gdal documentation: https://gdal.org/drivers/vector/index.html +Mosaic provides two flavors of the general readers: -Mosaic provides two flavors of the readers: -* spark.read.format("ogr") for reading 1 file per spark task -* mos.read().format("multi_read_ogr") for reading file in parallel with multiple spark tasks + * :code:`spark.read.format("ogr")` for reading 1 file per spark task + * :code:`mos.read().format("multi_read_ogr")` for reading file in parallel with multiple spark tasks + +Additionally, for convenience, Mosaic provides specific readers for Shapefile and File Geodatabases: + + * :code:`spark.read.format("geo_db")` reader for GeoDB files natively in Spark. + * :code:`spark.read.format("shapefile")` reader for Shapefiles natively in Spark. spark.read.format("ogr") @@ -36,6 +41,7 @@ A base Spark SQL data source for reading GDAL vector data sources. The output of the reader is a DataFrame with inferred schema. The schema is inferred from both features and fields in the vector file. Each feature will be provided as 2 columns: + * geometry - geometry of the feature (GeometryType) * srid - spatial reference system identifier of the feature (StringType) @@ -50,7 +56,7 @@ The reader supports the following options: * layerNumber - number of the layer to read (IntegerType), zero-indexed -.. function:: read.format("ogr").load(path) +.. function:: spark.read.format("ogr").load(path) Loads a vector file and returns the result as a :class:`DataFrame`. @@ -100,12 +106,13 @@ Chunk size is the number of file rows that will be read per single task. The output of the reader is a DataFrame with inferred schema. The schema is inferred from both features and fields in the vector file. Each feature will be provided as 2 columns: -* geometry - geometry of the feature (GeometryType) -* srid - spatial reference system identifier of the feature (StringType) + + * geometry - geometry of the feature (GeometryType) + * srid - spatial reference system identifier of the feature (StringType) The fields of the feature will be provided as columns in the DataFrame. The types of the fields are coerced to most concrete type that can hold all the values. -ALL options should be passed as String as they are provided as a Map +ALL options should be passed as String as they are provided as a :code:`Map` and parsed into expected types on execution. The reader supports the following options: * driverName - GDAL driver name (StringType) @@ -116,7 +123,7 @@ and parsed into expected types on execution. The reader supports the following o * layerNumber - number of the layer to read (IntegerType), zero-indexed [pass as String] -.. function:: read.format("multi_read_ogr").load(path) +.. function:: mos.read().format("multi_read_ogr").load(path) Loads a vector file and returns the result as a :class:`DataFrame`. @@ -170,7 +177,7 @@ The reader supports the following options: * layerNumber - number of the layer to read (IntegerType), zero-indexed * vsizip - if the vector files are zipped files, set this to true (BooleanType) -.. function:: read.format("geo_db").load(path) +.. function:: spark.read.format("geo_db").load(path) Loads a GeoDB file and returns the result as a :class:`DataFrame`. @@ -223,7 +230,7 @@ The reader supports the following options: * layerNumber - number of the layer to read (IntegerType), zero-indexed * vsizip - if the vector files are zipped files, set this to true (BooleanType) -.. function:: read.format("shapefile").load(path) +.. function:: spark.read.format("shapefile").load(path) Loads a Shapefile and returns the result as a :class:`DataFrame`. diff --git a/docs/source/index.rst b/docs/source/index.rst index 56a9ce553..d7fb0ca44 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -47,17 +47,17 @@ this will leverage the Databricks H3 expressions when using H3 grid system. Mosaic provides: * easy conversion between common spatial data encodings (WKT, WKB and GeoJSON); * constructors to easily generate new geometries from Spark native data types; - * many of the OGC SQL standard `ST_` functions implemented as Spark Expressions for transforming, aggregating and joining spatial datasets; + * many of the OGC SQL standard :code:`ST_` functions implemented as Spark Expressions for transforming, aggregating and joining spatial datasets; * high performance through implementation of Spark code generation within the core Mosaic functions; * optimisations for performing point-in-polygon joins using an approach we co-developed with Ordnance Survey (`blog post `_); and * the choice of a Scala, SQL and Python API. .. note:: - For Mosaic versions < 0.4.0 please use the `0.3.x docs `_. + For Mosaic versions < 0.4 please use the `0.3 docs `_. -Version 0.4.0 -============= +Version 0.4.x Series +==================== We recommend using Databricks Runtime versions 13.3 LTS with Photon enabled. @@ -67,9 +67,10 @@ We recommend using Databricks Runtime versions 13.3 LTS with Photon enabled. **DEPRECATION ERROR: Mosaic v0.4.x series only supports Databricks Runtime 13. You can specify `%pip install 'databricks-mosaic<0.4,>=0.3'` for DBR < 13.** -As of the 0.4.0 release, Mosaic issues the following ERROR when initialized on a cluster that is neither Photon Runtime nor Databricks Runtime ML `ADB `_ | `AWS `_ | `GCP `_ : +.. warning:: + Mosaic 0.4.x series issues the following ERROR on a standard, non-Photon cluster `ADB `_ | `AWS `_ | `GCP `_ : -**DEPRECATION ERROR: Please use a Databricks Photon-enabled Runtime for performance benefits or Runtime ML for spatial AI benefits; Mosaic 0.4.x series restricts executing this cluster.** + **DEPRECATION ERROR: Please use a Databricks Photon-enabled Runtime for performance benefits or Runtime ML for spatial AI benefits; Mosaic 0.4.x series restricts executing this cluster.** As of Mosaic 0.4.0 (subject to change in follow-on releases) * No Mosaic SQL expressions cannot yet be registered with `Unity Catalog `_ due to API changes affecting DBRs >= 13. @@ -83,7 +84,6 @@ As of Mosaic 0.4.0 (subject to change in follow-on releases) * `Volumes `_ : Along the same principle of isolation, clusters (both assigned and shared access) can read Volumes via relevant built-in readers and writers or via custom python calls which do not involve any custom JVM code. - Version 0.3.x Series ==================== @@ -99,9 +99,6 @@ As of the 0.3.11 release, Mosaic issues the following WARNING when initialized o If you are receiving this warning in v0.3.11+, you will want to begin to plan for a supported runtime. The reason we are making this change is that we are streamlining Mosaic internals to be more aligned with future product APIs which are powered by Photon. Along this direction of change, Mosaic has standardized to JTS as its default and supported Vector Geometry Provider. - - - Documentation ============= diff --git a/docs/source/usage/automatic-sql-registration.rst b/docs/source/usage/automatic-sql-registration.rst index 347949bd0..9c2d0fe64 100644 --- a/docs/source/usage/automatic-sql-registration.rst +++ b/docs/source/usage/automatic-sql-registration.rst @@ -10,12 +10,14 @@ An example of when this might be useful would be connecting a business intellige to your Spark / Databricks cluster to perform spatial queries or integrating Spark with a geospatial middleware component such as [Geoserver](https://geoserver.org/). +.. warning:: + Mosaic 0.4.x SQL bindings for DBR 13 not yet available in Unity Catalog due to API changes. + Pre-requisites ************** In order to use Mosaic, you must have access to a Databricks cluster running -Databricks Runtime 10.0 or higher (11.2 with photon or higher is recommended). -If you have cluster creation permissions in your Databricks +Databricks Runtime 13. If you have cluster creation permissions in your Databricks workspace, you can create a cluster using the instructions `here `__. @@ -30,8 +32,11 @@ Installation To install Mosaic on your Databricks cluster, take the following steps: -#. Upload Mosaic jar to a dedicated dbfs location. E.g. dbfs:/FileStore/mosaic/jars/. +#. Upload Mosaic jar to a dedicated fuse mount location. E.g. dbfs:/FileStore/mosaic/jars/. #. Create an init script that fetches the mosaic jar and copies it to databricks/jars. + You can also use the output from (0.4 series) python function :code:`setup_fuse_install`, e.g. + :code:`setup_fuse_intall(, jar_copy=True)` which can help to copy the JAR used in + the init script below. .. code-block:: bash @@ -45,7 +50,7 @@ To install Mosaic on your Databricks cluster, take the following steps: #!/bin/bash # # File: mosaic-init.sh - # On cluster startup, this script will copy the Mosaic jars to the cluster's default jar directory. + # On cluster startup, this script will copy the Mosaic JAR to the cluster's default jar directory. cp /dbfs/FileStore/mosaic/jars/*.jar /databricks/jars @@ -74,6 +79,9 @@ To test the installation, create a new Python notebook and run the following com You should see all the supported functions registered by Mosaic appear in the output. +.. note:: + You may see some :code:`ST_` functions from other libraries, so pay close attention to the provider. + .. warning:: Issue 317: https://github.com/databrickslabs/mosaic/issues/317 Mosaic jar needs to be installed via init script and not through the cluster UI. diff --git a/docs/source/usage/install-gdal.rst b/docs/source/usage/install-gdal.rst index f18b7eae8..c4bdf02f3 100644 --- a/docs/source/usage/install-gdal.rst +++ b/docs/source/usage/install-gdal.rst @@ -26,19 +26,19 @@ GDAL Installation Setup GDAL files and scripts **************************** Mosaic requires GDAL to be installed on the cluster. The easiest way to do this is to use the -the mos.setup_gdal() function. +the :code:`setup_gdal` function. .. note:: (a) This is close in behavior to Mosaic < 0.4 series (prior to DBR 13), with new options to pip install Mosaic for either ubuntugis gdal (3.4.3) or jammy default (3.4.1). (b) 'to_fuse_dir' can be one of '/Volumes/..', '/Workspace/..', '/dbfs/..'; - however, you should consider setup_fuse_install()` for Volume based installs as that + however, you should consider :code:`setup_fuse_install()` for Volume based installs as that exposes more options, to include copying JAR and JNI Shared Objects. .. function:: setup_gdal() Generate an init script that will install GDAL native libraries on each worker node. - All of the listed parameters are optional. You can have even more control with setup_fuse_install function. + All of the listed parameters are optional. You can have even more control with :code:`setup_fuse_install` function. :param to_fuse_dir: Path to write out the init script for GDAL installation; default is '/Workspace/Shared/geospatial/mosaic/gdal/jammy'. @@ -80,7 +80,7 @@ the mos.setup_gdal() function. Configure the init script ************************** -After the mos.setup_gdal() function has been run, you will need to configure the cluster to use the +After the :code:`setup_gdal` function has been run, you will need to configure the cluster to use the init script. The init script can be set by clicking on the "Edit" button on the cluster page and adding the following to the "Advanced Options" section: @@ -104,4 +104,8 @@ code at the top of the notebook: .. code-block:: text GDAL enabled. - GDAL 3.4.1, released 2021/12/27 \ No newline at end of file + GDAL 3.4.1, released 2021/12/27 + + .. note:: + You can configure init script from default ubuntu GDAL (3.4.1) to `ubuntugis ppa `__ (3.4.3) + with :code:`setup_gdal(with_ubuntugis=True)` \ No newline at end of file diff --git a/docs/source/usage/installation.rst b/docs/source/usage/installation.rst index 11263e11a..61d0a6114 100644 --- a/docs/source/usage/installation.rst +++ b/docs/source/usage/installation.rst @@ -5,67 +5,80 @@ Installation guide Supported platforms ################### +.. note:: + For Mosaic 0.4 series, we recommend DBR 13.3 LTS on Photon or ML Runtime clusters. + .. warning:: - From versions after 0.3.x, Mosaic will require either - * Databricks Runtime 11.2+ with Photon enabled - * Databricks Runtime for ML 11.2+ - - Mosaic 0.3 series does not support DBR 13 (coming soon); - also, DBR 10 is no longer supported in Mosaic. + Mosaic 0.4.x series only supports DBR 13.x DBRs. + If running on a different DBR it will throw an exception: + + **DEPRECATION ERROR: Mosaic v0.4.x series only supports Databricks Runtime 13. You can specify `%pip install 'databricks-mosaic<0.4,>=0.3'` for DBR < 13.** + +.. warning:: + Mosaic 0.4.x series issues the following ERROR on a standard, non-Photon cluster `ADB `_ | `AWS `_ | `GCP `_ : + + **DEPRECATION ERROR: Please use a Databricks Photon-enabled Runtime for performance benefits or Runtime ML for spatial AI benefits; Mosaic 0.4.x series restricts executing this cluster.** -We recommend using Databricks Runtime versions 11.3 LTS or 12.2 LTS with Photon enabled; -this will leverage the Databricks H3 expressions when using H3 grid system. -As of the 0.3.11 release, Mosaic issues the following warning when initialized on a cluster -that is neither Photon Runtime nor Databricks Runtime ML [`ADB `__ | `AWS `__ | `GCP `__]: +As of Mosaic 0.4.0 (subject to change in follow-on releases) + * No Mosaic SQL expressions cannot yet be registered with `Unity Catalog `_ due to API changes affecting DBRs >= 13. + * `Assigned Clusters `_ : Mosaic Python, R, and Scala APIs. + * `Shared Access Clusters `_ : Mosaic Scala API (JVM) with Admin `allowlisting `_ ; Python bindings to Mosaic Scala APIs are blocked by Py4J Security on Shared Access Clusters. - DEPRECATION WARNING: Mosaic is not supported on the selected Databricks Runtime. Mosaic will stop working on this cluster after v0.3.x. Please use a Databricks Photon-enabled Runtime (for performance benefits) or Runtime ML (for spatial AI benefits). +.. note:: + As of Mosaic 0.4.0 (subject to change in follow-on releases) -If you are receiving this warning in v0.3.11+, you will want to begin to plan for a supported runtime. -The reason we are making this change is that we are streamlining Mosaic -internals to be more aligned with future product APIs which are powered by Photon. Along this direction -of change, Mosaic will be standardizing to JTS as its default and supported Vector Geometry Provider. + * `Unity Catalog `_ : Enforces process isolation which is difficult to accomplish with custom JVM libraries; as such only built-in (aka platform provided) JVM APIs can be invoked from other supported languages in Shared Access Clusters. + * `Volumes `_ : Along the same principle of isolation, clusters (both assigned and shared access) can read Volumes via relevant built-in readers and writers or via custom python calls which do not involve any custom JVM code. If you have cluster creation permissions in your Databricks workspace, you can create a cluster using the instructions -`here `__. +`here `_. You will also need "Can Manage" permissions on this cluster in order to attach the Mosaic library to your cluster. A workspace administrator will be able to grant these permissions and more information about cluster permissions can be found in our documentation -`here `__. +`here `_. Package installation #################### Installation from PyPI ********************** -Python users can install the library directly from `PyPI `__ -using the instructions `here `__ +Python users can install the library directly from `PyPI `_ +using the instructions `here `_ or from within a Databricks notebook using the :code:`%pip` magic command, e.g. .. code-block:: bash %pip install databricks-mosaic +If you need to install Mosaic 0.3 series for DBR 12.2 LTS, e.g. + +.. code-block::bash + + %pip install "databricks-mosaic<0.4,>=0.3" + +For Mosaic versions < 0.4 please use the `0.3 docs `_. + Installation from release artifacts *********************************** -Alternatively, you can access the latest release artifacts `here `__ +Alternatively, you can access the latest release artifacts `here `_ and manually attach the appropriate library to your cluster. Which artifact you choose to attach will depend on the language API you intend to use. -* For Python API users, choose the Python .whl file. +* For Python API users, choose the Python .whl file (includes the JAR) * For Scala users, take the Scala JAR (packaged with all necessary dependencies). * For R users, download the Scala JAR and the R bindings library [see the sparkR readme](R/sparkR-mosaic/README.md). -Instructions for how to attach libraries to a Databricks cluster can be found `here `__. +Instructions for how to attach libraries to a Databricks cluster can be found `here `_. Automated SQL registration ************************** If you would like to use Mosaic's functions in pure SQL (in a SQL notebook, from a business intelligence tool, or via a middleware layer such as Geoserver, perhaps) then you can configure -"Automatic SQL Registration" using the instructions `here `__. +"Automatic SQL Registration" using the instructions `here `_. Enabling the Mosaic functions ############################# @@ -74,8 +87,8 @@ The mechanism for enabling the Mosaic functions varies by language: .. tabs:: .. code-tab:: py - from mosaic import enable_mosaic - enable_mosaic(spark, dbutils) + import mosaic as mos + mos.enable_mosaic(spark, dbutils) .. code-tab:: scala @@ -91,6 +104,8 @@ The mechanism for enabling the Mosaic functions varies by language: library(sparkrMosaic) enableMosaic() +.. note:: + We recommend :code:`import mosaic as mos` to namespace the python api and avoid any conflicts with other similar functions. SQL usage ********* @@ -106,3 +121,5 @@ register the Mosaic SQL functions in your SparkSession from a Scala notebook cel val mosaicContext = MosaicContext.build(H3, JTS) mosaicContext.register(spark) +.. warning:: + Mosaic 0.4.x SQL bindings for DBR 13 not yet available in Unity Catalog due to API changes. \ No newline at end of file