-
-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Site torch dataset update #82
Conversation
Thanks @Sukh-P great to push this forward. A few quick thoughts, and sorry if these seem obvious and have already been answered
|
@peterdudfield thanks, I have tried to answer these:
Yes that's still the plan, perhaps still making a NumpyBatch if we want to have a more generic intermediary format, and yes the code will probably be added to here but then called in PVNet, can make that clearer in the TODO list above
I think the longer term plan is to move towards one batch format (netcdf) and have a common interface with batches through a batch object, in this way things will be more generalised and we will have less of having a different way to do things each time, I imagine this will need a bit more thought and can be improved after adding in a working pipeline for sites, can create an issue/discussion around this after we have
So I think this is managed by having a function which does some stacking of samples like here into a batch and the Torch DataLoader where you specify how many samples would be in a batch |
@@ -197,7 +197,7 @@ def test_select_time_slice_nwp_with_dropout_and_accum(da_nwp_like, t0_str): | |||
t0 = pd.Timestamp(f"2024-01-02 {t0_str}") | |||
interval_start = pd.Timedelta(-6, "h") | |||
interval_end = pd.Timedelta(3, "h") | |||
freq = pd.Timedelta("1H") | |||
freq = pd.Timedelta("1h") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To stop a syntax deprecation warning from popping up
|
||
if "sat" in dataset_dict: | ||
# Satellite is already in the range [0-1] so no need to standardise | ||
da_sat = dataset_dict["sat"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a TODO here for normalization satellite data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small comment could be added, but otherwise looks great
# Merge all prepared datasets | ||
combined_dataset = xr.merge(datasets) | ||
|
||
return combined_dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity bc I think I've missed a conversation somewhere: am I right in thinking that eventually gsp version will also output netcdfs and we will merge it into this function and move convert_to_numpy_batch into the training pipeline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep exactly I think that is the current idea, the convert_to_numpy_batch logic can still be in this repo and then we can call that in PVNet in the training pipeline
|
||
sample = process_and_combine_site_sample_dict(sample_dict, self.config) | ||
sample = sample.compute() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
Looks super good, thanks a lot for doing all this and the tests! |
Pull Request
Description
PR to update the Site Torch Dataset to return samples as xarray Datasets for easier conversion into netcdf files which is the preferred current format of saving samples.
This PR includes:
TODO for Site Dataset pipeline overall (won't be in this PR):