diff --git a/docs/sphinx/source/usage/library/input.rst b/docs/sphinx/source/usage/library/input.rst index cc153b9b..6fce3b7f 100644 --- a/docs/sphinx/source/usage/library/input.rst +++ b/docs/sphinx/source/usage/library/input.rst @@ -3,7 +3,8 @@ Input Management =================== -Finch provides its own input management via the class :class:`finch.Input`. +When creating runtime experiments for parallel operators, one might often be interested in how loading input in different ways has an impact on runtime and scalability. +For this reason, finch provides its own input management via the class :class:`finch.Input`. An instance of this class describes a single input for some operator. An input can have multiple versions, described by the class :class:`finch.Input.Version`. Every such version contains the same data, but stores it differently. @@ -12,38 +13,49 @@ Creating a new Input -------------------- We can create a new input by creating a new instance of :class:`finch.data.Input`. -The constructor takes a source, which is a function taking a :class:`finch.data.Input.Version` object and returning a ``xarray.Dataset``. -The returned dataset serves as the input for some operator and its content must always be the same with every call to the source function. +The constructor requires a source, which is a function taking a :class:`finch.data.Input.Version` object and returning a ``xarray.Dataset``. +The returned dataset serves as the input for some operator and its content should always be the same with every call to the source function. The version argument will be passed when we want to create a new version of the input. This allows the source function to efficiently output the requested version of the input. -It is not expected that the output of the source function matches the requested version. -However, those version properties which won't match will be transformed in some default manner, which might be less efficient than what would be possible. +It is not expected that the output of the source function fully matches the requested version. +Finch will ensure itself that the requested version will actually be used. +However, it might more efficient to directly enforce some version properties during source loading. For example, loading the correct chunks directly from some source file is more efficient than loading the data with some arbitrary chunking and rechunking afterwards. + +.. note:: While the input data should always be the same when running experiments, this property is not enforced. + In fact, for some cases it might be nice to ignore this requirement, for example when we want to test with random data. + However, in most cases you should stick to ensuring data invariance, because the runtime of an operator might be dependent on the input data. + Besides the actual source, we must also provide a (complete) version object, which describes the default output of the source. :: from finch.data import Version, Format - def foo_source(version): - return finch.data.load_grib("foo.grib", ["P", "T"], chunks=version.chunks) + def grib_source(version): + return finch.data.load_grib("data.grib", ["P", "T"], chunks=version.chunks) - foo_version = Input.Version( + grib_version = Input.Version( format=Format.GRIB, dim_order="zyx", - chunks={k:-1 for k in list("xyz")}, + chunks={"z": 1, "x": -1, "y": -1}, coords=True ) - foo = Input("foo", foo_source, foo_version) + input = Input("name", grib_source, grib_version) +If an input with the given name is already present in the :confval:`input_store` directory, no new input will be created. +Instead, finch will load the preexisting input along with its versions. +Note that the data of the versions won't be loaded into memory when loading an input. +Only the version properties will be loaded. Creating new Versions ---------------------- We can easily create a new version of our input with the function :func:`finch.data.Input.add_version`. -The new version will be stored in our input's directory inside the finch input data store. -When adding a new version, finch will try to retrieve an already existing version which can be used to construct the new version. +The new version will be stored in our input's directory inside :confval:`input_store`. +When adding a new version, finch will try to retrieve an already existing version which can be used to efficiently construct the new version. If there is no such preexisting version, it will use the source to construct the new version. -The name of a new version will be automatically generated. However, the name is only really used for storing the data. +The name of a new version will be randomly generated by default. It is only really used for storing the version on disk. +:func:`finch.data.Input.add_version` will return the complete version which was actually added. :: zarr = Input.Version( format=Format.ZARR, @@ -52,7 +64,7 @@ The name of a new version will be automatically generated. However, the name is coords=False ) - foo.add_version(zarr) + zarr = input.add_version(zarr) If you don't care about certain version properties, you can omit them in the constructor. They will then be set according to the version which was used for loading the data. @@ -63,37 +75,36 @@ They will then be set according to the version which was used for loading the da foo.add_version(netcdf) -Additionally, if you already have an input ready which you want to add, you can provide it with the ``data`` argument. -However, keep in mind that you are responsible yourself that your data matches the version you provide. -If you provide the data yourself, the version can no longer have any unset attributes. +Additionally, if you already have a version loaded in memory, which you want to add, you can provide it with the ``data`` argument. +In this case you are responsible yourself that your data matches the version you provide. +If you provide the data yourself, all version properties must be explicitly set. :: netcdf_explicit = Input.Version( format=Format.NETCDF, - dim_order="xyz", + dim_order="zyx", chunks={"x": 10, "y": -1, "z": -1}, coords=False ) - data, _ = foo.get_version(netcdf_explicit) + data, _ = foo.get_version(zarr) + data = data.transpose(*list(netcdf_explicit.dim_order)) foo.add_version(version, data) Retrieving Versions ------------------- -As explained previously, finch stores its versions in a directory specified by the name of the input. -When we create a new :class:`finch.data.Input` object, finch will take a look at this directory, if it already exists, to collect previously added versions. -No data will be loaded at this step. -Afterwards, you can see which versions were loaded via :func:`finch.data.Input.list_versions`. +You can list the versions of an input with :func:`finch.data.Input.list_versions`. :: version_list = foo.list_versions() # version_list contains all previously added versions for inputs named "foo" In order to get access to the data via a ``xarray.Dataset``, you can request a specific version with the :func:`finch.data.Input.get_version`. Finch will then browse the existing versions and search for a match, which it will output as a dataset. -A :class:`finch.data.Input.Version` object is used for querying. Unset attributes won't be considered. -By default, no perfect match is required. Instead, finch will also find versions, whose chunks can be combined to the requested chunking configuration. -This mechanism removes the need for perfectly matching versions every time without any noticeable performance impact. +A :class:`finch.data.Input.Version` object is used for querying, where you can omit some properties if you don't care about their values. +By default, no perfect match of the requested version with an existing version is required. Instead, finch will also find versions, whose chunks can be combined to the requested chunking configuration. +This mechanism allows that we can easily experiment with different chunk sizes, without the need to store a new input version every time. +Combining chunks should not impose any noticeable performance impact. :: netcdf_big = Input.Version( format=Format.NETCDF, @@ -105,7 +116,9 @@ This mechanism removes the need for perfectly matching versions every time witho data, out_version = foo.get_version(netcdf_big) assert out_version == netcdf_big -If finch didn't find a match, by default a new version will be created from the source (without adding it). +You can enforce perfect matches by setting ``weak_compare=False`` in :func:`finch.data.Input.get_version`. + +If finch didn't find a match, a new version will be created from the source (without adding it). :: transposed = Input.Version( dim_order="yxz" @@ -116,3 +129,8 @@ If finch didn't find a match, by default a new version will be created from the out_version.coords and \ out_version.chunks == foo_version.chunks and \ out_version.dim_order == "yxz" + +You can instruct finch to directly add the new version by setting ``add_if_not_exists=True``. + +If you don't want to create a new version from the source if no match was found, set ``create_if_not_exists=False``. +The function will then return ``None`` if no match was found.