-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add transforms module with scale function #384
base: main
Are you sure you want to change the base?
Add transforms module with scale function #384
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #384 +/- ##
=======================================
Coverage 99.79% 99.79%
=======================================
Files 14 15 +1
Lines 969 989 +20
=======================================
+ Hits 967 987 +20
Misses 2 2 ☔ View full report in Codecov by Sentry. |
Quality Gate passedIssues Measures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your first movement
contribution, @stellaprins!
This is really well done and thoroughly tested. I do have an alternative suggestion for the implementation, though:
- I think the
scale
function (and any future linear transforms of this kind) should only work on data arrays with aspace
dimension (with Cartesian coordinates), and broadcasting should happen only along that dimension. Please see my specific comments for details. - I believe the new attribute should be called
space_unit
, mirroring the existingtime_unit
attribute we populate when loading a dataset. In the future, I’m inclined to merge these two into an attribute namedunits
that accepts a dictionary mapping dimension names to units (as you mentioned in your PR description). However, that should be handled in a separate issue/PR, possibly in conjunction with thepint-xarray
issue. For now, renamingunit
tospace_unit
is perfectly fine.
On the same topic, since we are introducing a new attribute, I wonder if it would be worth populating it directly when a dataset is loaded from a file, as we do for time_unit
. For most of our supported formats, space_unit
would be "pixels"
, with the possible exception of "Anipose"
(I need to double-check). I have opened an issue to keep track of this idea.
def scale( | ||
data: xr.DataArray, | ||
factor: float | np.ndarray = 1.0, | ||
unit: str | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would call this space_unit
, because we already store a time_unit
attribute, and it seems sensible to continue with e dimension_unit
convention.
I won't point this out in all the other places where the unit appears.
Parameters | ||
---------- | ||
data : xarray.DataArray | ||
The input data to be scaled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The input data to be scaled. | |
The input data array to be scaled. |
When the factor is a scalar (a single number), the scaling factor is | ||
applied to all dimensions, while if the factor is a list or array, the | ||
factor is broadcasted along the first matching dimension. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest that we avoid broadcasting along the first matching dimension, and instead specifically broadcast along the space
dimension. At the moment, the function is very permissive, so a user could pass an array of factors with a length equal to the time axis, resulting in each time point being scaled by a different factor (though I'm not sure why that would be useful). In general, I'm not aware of any use case for scaling across dimensions other than space
.
Furthermore, in many situations, the number of spatial dimensions (2 or 3) may coincide with the number of individuals or keypoints. Consequently, relying on the first matching dimension is not robust if the dimensions are reordered, and it may be ambiguous.
We should make use of the dimension labels that xarray
gives us.
In summary, I propose making this function more Cartesian-space-specific by:
- Verifying that the input data array contains a
space
dimension with coordinatesx, y
orx, y, z
. (This can be done using the existingvalidate_dims_coords()
function—seecompute_norm()
for an example.) - Naming the new attribute
space_unit
. - Comparing the factor's shape specifically against
data.sizes["space"]
. - Broadcasting specifically across
space
.
factor : float or np.ndarray of floats | ||
The scaling factor to apply to the data. If factor is a scalar, all | ||
dimensions of the data array are scaled by the same factor. If factor | ||
is a list or an 1D array, the length of the array must match the length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here a list is mentioned as an accepted input, and it indeed is, because the function would coerce that into a numpy array.
However, only float
and np.ndarray
are mentioned in the type hints.
Technically, anything that can be coerced into a 1D numpy array is acceptable, right? Maybe there is and alternative type hint that could indicate that?
pytest.param( | ||
{}, | ||
data_array_with_dims_and_coords(nparray_0_to_23()), | ||
id="Do nothing", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicely done with the id
s, I didn't know this was an option. Much better than adding comments (which I normally do).
assert scaled_data.attrs == expected_output.attrs | ||
|
||
|
||
def test_scale_inverted_data(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test should be modified if you adopt my suggestion about specifically broadcasting over the space
dimension.
That would also be robust to re-ordering, because the space
dimensions would be identified by name (no matter in which order it comes in the array).
), | ||
], | ||
) | ||
def test_scale( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
excellent test, nicely done!
) | ||
|
||
|
||
def test_scale_first_matching_axis(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test also needs to be modifies or binned, if you adopt my suggestion about specifically broadcasting over space
.
), | ||
], | ||
) | ||
def test_scale_twice( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great test!
Description
What is this PR
Why is this PR needed?
Enable users to convert data arrays expressed in pixels to SI units (like meters). Users can scale data to a known reference size (e.g. the size of a cage) and add appropriate units. This allows distances to be represented in standard units instead of just the number of pixels.
What does this PR do?
Adds a transforms module with a scale function. The scale function scales data (
xarray.DataArray
) by a given factor with an optional unit (str
|None
). Units can by any strings (e.g. "elephants") and will be added toxarray.DataArray.attrs["unit"]
. PassingNone
as a unit (explicitely or by default) "dequantifies" the data (i.e. drops.attrs["unit"]
).I've looked at pint-array but as it stands units are simply strings. Another option using
pint-array
to what @niksirbi mentioned in #141 (i.e. something likedata.pint.quantify({'distance': 'metres', 'time': 'seconds'}
), could be to use the.attrs['units']
entry for each data variable (see below).Unit-aware arithmetic in Xarray, via pint
References
part of #366.
How has this PR been tested?
The scale function has been tested with various unit tests to ensure it works correctly. These tests include:
Is this a breaking change?
No.
Does this PR require an update to the documentation?
Docstrings have been added to the module and all functions. No further documentation needed.
Checklist: