Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Site torch dataset update #82

Merged
merged 5 commits into from
Dec 18, 2024
Merged

Site torch dataset update #82

merged 5 commits into from
Dec 18, 2024

Conversation

Sukh-P
Copy link
Member

@Sukh-P Sukh-P commented Nov 19, 2024

Pull Request

Description

PR to update the Site Torch Dataset to return samples as xarray Datasets for easier conversion into netcdf files which is the preferred current format of saving samples.

This PR includes:

  • Adding a new functions to process the sample dict (dict with xr DataArrays) into one Dataset
  • Reordering of when .compute() is called since now we combine multiple DataArrays into a Dataset we can call compute after this is done
  • Removed unused site specific parts from original process and combine function
  • Updating unit tests now that the data type of the sample has changes in the Torch Dataset
  • Updating some time interval syntax to stop a deprecation warning (unrelated to the above changes)

TODO for Site Dataset pipeline overall (won't be in this PR):

  • Check this works by creating some samples and adding logic into PVNet to create samples ✅ PR here: Add sample saving for Site Dataset PVNet#290
  • Removed saving solar coordinates data in samples for now, current idea is to use the numpy batch functions in here in PVNet to create this data when converting to a numpy batch (if this seems messy may add some logic in here to add to the solar position coordinates to the Dataset)
  • Add new functions to go from a Dataset to NumpyBatch/TensorBatch (for passing to the PVNet model)
  • Add logic to PVNet/ocf-data-sampler to read the netcdfs and convert to NumpyBatch/TensorBatches and then train a model

@peterdudfield
Copy link
Contributor

Thanks @Sukh-P great to push this forward.

A few quick thoughts, and sorry if these seem obvious and have already been answered

  1. is the ideal still to convert the site batch dataset to a dict of tensors ready for the model (PVnet)? if so, will this code sit in here, or in PVNet
  2. For different torch dataloaders, do we have an idea of how to do this for the three different process? 1. make batches, 2. load batches and train model, 3. running inference. It would be a shame to have separate torch datasets for each loader, but perhaps there is a simple way to do this. This is very much in your TODO section, so perhaps you have thought about this already / going to next.
  3. This might be related to 2., but do you know where the combining samples to batch process fits in?

@Sukh-P
Copy link
Member Author

Sukh-P commented Nov 20, 2024

@peterdudfield thanks, I have tried to answer these:

  1. is the ideal still to convert the site batch dataset to a dict of tensors ready for the model (PVnet)? if so, will this code sit in here, or in PVNet

Yes that's still the plan, perhaps still making a NumpyBatch if we want to have a more generic intermediary format, and yes the code will probably be added to here but then called in PVNet, can make that clearer in the TODO list above

  1. For different torch dataloaders, do we have an idea of how to do this for the three different process? 1. make batches, 2. load batches and train model, 3. running inference. It would be a shame to have separate torch datasets for each loader, but perhaps there is a simple way to do this. This is very much in your TODO section, so perhaps you have thought about this already / going to next.

I think the longer term plan is to move towards one batch format (netcdf) and have a common interface with batches through a batch object, in this way things will be more generalised and we will have less of having a different way to do things each time, I imagine this will need a bit more thought and can be improved after adding in a working pipeline for sites, can create an issue/discussion around this after we have

  1. This might be related to 2., but do you know where the combining samples to batch process fits in?

So I think this is managed by having a function which does some stacking of samples like here into a batch and the Torch DataLoader where you specify how many samples would be in a batch

@Sukh-P Sukh-P marked this pull request as ready for review December 17, 2024 16:23
@@ -197,7 +197,7 @@ def test_select_time_slice_nwp_with_dropout_and_accum(da_nwp_like, t0_str):
t0 = pd.Timestamp(f"2024-01-02 {t0_str}")
interval_start = pd.Timedelta(-6, "h")
interval_end = pd.Timedelta(3, "h")
freq = pd.Timedelta("1H")
freq = pd.Timedelta("1h")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To stop a syntax deprecation warning from popping up


if "sat" in dataset_dict:
# Satellite is already in the range [0-1] so no need to standardise
da_sat = dataset_dict["sat"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a TODO here for normalization satellite data

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#87

Copy link
Contributor

@peterdudfield peterdudfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small comment could be added, but otherwise looks great

# Merge all prepared datasets
combined_dataset = xr.merge(datasets)

return combined_dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity bc I think I've missed a conversation somewhere: am I right in thinking that eventually gsp version will also output netcdfs and we will merge it into this function and move convert_to_numpy_batch into the training pipeline?

Copy link
Member Author

@Sukh-P Sukh-P Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep exactly I think that is the current idea, the convert_to_numpy_batch logic can still be in this repo and then we can call that in PVNet in the training pipeline


sample = process_and_combine_site_sample_dict(sample_dict, self.config)
sample = sample.compute()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@AUdaltsova
Copy link
Contributor

Looks super good, thanks a lot for doing all this and the tests!

@Sukh-P Sukh-P merged commit 88b310a into main Dec 18, 2024
6 checks passed
@peterdudfield peterdudfield deleted the site-dataset-update branch December 20, 2024 10:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants