Add transforms module with scale function #384

stellaprins · 2025-01-23T15:43:02Z

Description

What is this PR

Bug fix
Addition of a new feature
Other

Why is this PR needed?
Enable users to convert data arrays expressed in pixels to SI units (like meters). Users can scale data to a known reference size (e.g. the size of a cage) and add appropriate units. This allows distances to be represented in standard units instead of just the number of pixels.

What does this PR do?
Adds a transforms module with a scale function. The scale function scales data (xarray.DataArray) by a given factor with an optional unit (str| None). Units can by any strings (e.g. "elephants") and will be added to xarray.DataArray.attrs["unit"]. Passing None as a unit (explicitely or by default) "dequantifies" the data (i.e. drops .attrs["unit"]).

I've looked at pint-array but as it stands units are simply strings. Another option using pint-array to what @niksirbi mentioned in #141 (i.e. something like data.pint.quantify({'distance': 'metres', 'time': 'seconds'}), could be to use the .attrs['units'] entry for each data variable (see below).

Unit-aware arithmetic in Xarray, via pint

Alternatively, we can quantify from the object's .attrs, automatically reading the metadata which xarray objects carry around. If nothing is passed to .quantify(), it will attempt to parse the .attrs['units'] entry for each data variable.

References

part of #366.

How has this PR been tested?

The scale function has been tested with various unit tests to ensure it works correctly. These tests include:

Correct scaling when passing different factors and units.
Last unit is used and product of scaling factors when same data is scaled twice.
Applying scaling to the correct dimension when data is transposed.
Applying scaling to the first matching dimension if multiple dimensions match the scaling factor's length.

Is this a breaking change?

No.

Does this PR require an update to the documentation?

Docstrings have been added to the module and all functions. No further documentation needed.

Checklist:

The code has been tested locally
Tests have been added to cover all new functionality
[x ] The documentation has been updated to reflect any changes
The code has been formatted with pre-commit

codecov · 2025-01-23T16:24:50Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.79%. Comparing base (15b3f41) to head (412634b).

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #384   +/-   ##
=======================================
  Coverage   99.79%   99.79%           
=======================================
  Files          14       15    +1     
  Lines         969      989   +20     
=======================================
+ Hits          967      987   +20     
  Misses          2        2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

sonarqubecloud · 2025-01-23T18:01:25Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

niksirbi

Thanks for your first movement contribution, @stellaprins!

This is really well done and thoroughly tested. I do have an alternative suggestion for the implementation, though:

I think the scale function (and any future linear transforms of this kind) should only work on data arrays with a space dimension (with Cartesian coordinates), and broadcasting should happen only along that dimension. Please see my specific comments for details.
I believe the new attribute should be called space_unit, mirroring the existing time_unit attribute we populate when loading a dataset. In the future, I’m inclined to merge these two into an attribute named units that accepts a dictionary mapping dimension names to units (as you mentioned in your PR description). However, that should be handled in a separate issue/PR, possibly in conjunction with the pint-xarray issue. For now, renaming unit to space_unit is perfectly fine.

On the same topic, since we are introducing a new attribute, I wonder if it would be worth populating it directly when a dataset is loaded from a file, as we do for time_unit. For most of our supported formats, space_unit would be "pixels", with the possible exception of "Anipose" (I need to double-check). I have opened an issue to keep track of this idea.

niksirbi · 2025-01-24T10:57:29Z

movement/transforms.py

+def scale(
+    data: xr.DataArray,
+    factor: float | np.ndarray = 1.0,
+    unit: str | None = None,


I would call this space_unit, because we already store a time_unit attribute, and it seems sensible to continue with e dimension_unit convention.
I won't point this out in all the other places where the unit appears.

niksirbi · 2025-01-24T13:26:48Z

movement/transforms.py

+    Parameters
+    ----------
+    data : xarray.DataArray
+        The input data to be scaled.


Suggested change

The input data to be scaled.

The input data array to be scaled.

niksirbi · 2025-01-24T13:48:19Z

movement/transforms.py

+    When the factor is a scalar (a single number), the scaling factor is
+    applied to all dimensions, while if the factor is a list or array, the
+    factor is broadcasted along the first matching dimension.


I suggest that we avoid broadcasting along the first matching dimension, and instead specifically broadcast along the space dimension. At the moment, the function is very permissive, so a user could pass an array of factors with a length equal to the time axis, resulting in each time point being scaled by a different factor (though I'm not sure why that would be useful). In general, I'm not aware of any use case for scaling across dimensions other than space.

Furthermore, in many situations, the number of spatial dimensions (2 or 3) may coincide with the number of individuals or keypoints. Consequently, relying on the first matching dimension is not robust if the dimensions are reordered, and it may be ambiguous.
We should make use of the dimension labels that xarray gives us.

In summary, I propose making this function more Cartesian-space-specific by:

Verifying that the input data array contains a space dimension with coordinates x, y or x, y, z. (This can be done using the existing validate_dims_coords() function—see compute_norm() for an example.)

Naming the new attribute space_unit.

Comparing the factor's shape specifically against data.sizes["space"].

Broadcasting specifically across space.

niksirbi · 2025-01-24T13:55:55Z

movement/transforms.py

+    factor : float or np.ndarray of floats
+        The scaling factor to apply to the data. If factor is a scalar, all
+        dimensions of the data array are scaled by the same factor. If factor
+        is a list or an 1D array, the length of the array must match the length


Here a list is mentioned as an accepted input, and it indeed is, because the function would coerce that into a numpy array.
However, only float and np.ndarray are mentioned in the type hints.
Technically, anything that can be coerced into a 1D numpy array is acceptable, right? Maybe there is and alternative type hint that could indicate that?

niksirbi · 2025-01-24T13:59:45Z