Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add transforms module with scale function #384

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

stellaprins
Copy link

@stellaprins stellaprins commented Jan 23, 2025

Description

What is this PR

  • Bug fix
  • Addition of a new feature
  • Other

Why is this PR needed?
Enable users to convert data arrays expressed in pixels to SI units (like meters). Users can scale data to a known reference size (e.g. the size of a cage) and add appropriate units. This allows distances to be represented in standard units instead of just the number of pixels.

What does this PR do?
Adds a transforms module with a scale function. The scale function scales data (xarray.DataArray) by a given factor with an optional unit (str| None). Units can by any strings (e.g. "elephants") and will be added to xarray.DataArray.attrs["unit"]. Passing None as a unit (explicitely or by default) "dequantifies" the data (i.e. drops .attrs["unit"]).

I've looked at pint-array but as it stands units are simply strings. Another option using pint-array to what @niksirbi mentioned in #141 (i.e. something like data.pint.quantify({'distance': 'metres', 'time': 'seconds'}), could be to use the .attrs['units'] entry for each data variable (see below).

Unit-aware arithmetic in Xarray, via pint

Alternatively, we can quantify from the object's .attrs, automatically reading the metadata which xarray objects carry around. If nothing is passed to .quantify(), it will attempt to parse the .attrs['units'] entry for each data variable.

References

part of #366.

How has this PR been tested?

The scale function has been tested with various unit tests to ensure it works correctly. These tests include:

  • Correct scaling when passing different factors and units.
  • Last unit is used and product of scaling factors when same data is scaled twice.
  • Applying scaling to the correct dimension when data is transposed.
  • Applying scaling to the first matching dimension if multiple dimensions match the scaling factor's length.

Is this a breaking change?

No.

Does this PR require an update to the documentation?

Docstrings have been added to the module and all functions. No further documentation needed.

Checklist:

  • The code has been tested locally
  • Tests have been added to cover all new functionality
  • [x ] The documentation has been updated to reflect any changes
  • The code has been formatted with pre-commit

Copy link

codecov bot commented Jan 23, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.79%. Comparing base (15b3f41) to head (412634b).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #384   +/-   ##
=======================================
  Coverage   99.79%   99.79%           
=======================================
  Files          14       15    +1     
  Lines         969      989   +20     
=======================================
+ Hits          967      987   +20     
  Misses          2        2           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

@stellaprins stellaprins linked an issue Jan 24, 2025 that may be closed by this pull request
@stellaprins stellaprins requested a review from niksirbi January 24, 2025 09:09
Copy link
Member

@niksirbi niksirbi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your first movement contribution, @stellaprins!

This is really well done and thoroughly tested. I do have an alternative suggestion for the implementation, though:

  • I think the scale function (and any future linear transforms of this kind) should only work on data arrays with a space dimension (with Cartesian coordinates), and broadcasting should happen only along that dimension. Please see my specific comments for details.
  • I believe the new attribute should be called space_unit, mirroring the existing time_unit attribute we populate when loading a dataset. In the future, I’m inclined to merge these two into an attribute named units that accepts a dictionary mapping dimension names to units (as you mentioned in your PR description). However, that should be handled in a separate issue/PR, possibly in conjunction with the pint-xarray issue. For now, renaming unit to space_unit is perfectly fine.

On the same topic, since we are introducing a new attribute, I wonder if it would be worth populating it directly when a dataset is loaded from a file, as we do for time_unit. For most of our supported formats, space_unit would be "pixels", with the possible exception of "Anipose" (I need to double-check). I have opened an issue to keep track of this idea.

def scale(
data: xr.DataArray,
factor: float | np.ndarray = 1.0,
unit: str | None = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call this space_unit, because we already store a time_unit attribute, and it seems sensible to continue with e dimension_unit convention.
I won't point this out in all the other places where the unit appears.

Parameters
----------
data : xarray.DataArray
The input data to be scaled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The input data to be scaled.
The input data array to be scaled.

Comment on lines +40 to +42
When the factor is a scalar (a single number), the scaling factor is
applied to all dimensions, while if the factor is a list or array, the
factor is broadcasted along the first matching dimension.
Copy link
Member

@niksirbi niksirbi Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest that we avoid broadcasting along the first matching dimension, and instead specifically broadcast along the space dimension. At the moment, the function is very permissive, so a user could pass an array of factors with a length equal to the time axis, resulting in each time point being scaled by a different factor (though I'm not sure why that would be useful). In general, I'm not aware of any use case for scaling across dimensions other than space.

Furthermore, in many situations, the number of spatial dimensions (2 or 3) may coincide with the number of individuals or keypoints. Consequently, relying on the first matching dimension is not robust if the dimensions are reordered, and it may be ambiguous.
We should make use of the dimension labels that xarray gives us.

In summary, I propose making this function more Cartesian-space-specific by:

  • Verifying that the input data array contains a space dimension with coordinates x, y or x, y, z. (This can be done using the existing validate_dims_coords() function—see compute_norm() for an example.)
  • Naming the new attribute space_unit.
  • Comparing the factor's shape specifically against data.sizes["space"].
  • Broadcasting specifically across space.

factor : float or np.ndarray of floats
The scaling factor to apply to the data. If factor is a scalar, all
dimensions of the data array are scaled by the same factor. If factor
is a list or an 1D array, the length of the array must match the length
Copy link
Member

@niksirbi niksirbi Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here a list is mentioned as an accepted input, and it indeed is, because the function would coerce that into a numpy array.
However, only float and np.ndarray are mentioned in the type hints.
Technically, anything that can be coerced into a 1D numpy array is acceptable, right? Maybe there is and alternative type hint that could indicate that?

pytest.param(
{},
data_array_with_dims_and_coords(nparray_0_to_23()),
id="Do nothing",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done with the ids, I didn't know this was an option. Much better than adding comments (which I normally do).

assert scaled_data.attrs == expected_output.attrs


def test_scale_inverted_data():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test should be modified if you adopt my suggestion about specifically broadcasting over the space dimension.
That would also be robust to re-ordering, because the space dimensions would be identified by name (no matter in which order it comes in the array).

),
],
)
def test_scale(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excellent test, nicely done!

)


def test_scale_first_matching_axis():
Copy link
Member

@niksirbi niksirbi Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test also needs to be modifies or binned, if you adopt my suggestion about specifically broadcasting over space.

),
],
)
def test_scale_twice(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great test!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scale pixels to SI units
3 participants