Site torch dataset update #82

Sukh-P · 2024-11-19T17:40:24Z

Pull Request

Description

PR to update the Site Torch Dataset to return samples as xarray Datasets for easier conversion into netcdf files which is the preferred current format of saving samples.

This PR includes:

Adding a new functions to process the sample dict (dict with xr DataArrays) into one Dataset
Reordering of when .compute() is called since now we combine multiple DataArrays into a Dataset we can call compute after this is done
Removed unused site specific parts from original process and combine function
Updating unit tests now that the data type of the sample has changes in the Torch Dataset
Updating some time interval syntax to stop a deprecation warning (unrelated to the above changes)

TODO for Site Dataset pipeline overall (won't be in this PR):

Check this works by creating some samples and adding logic into PVNet to create samples ✅ PR here: Add sample saving for Site Dataset PVNet#290
Removed saving solar coordinates data in samples for now, current idea is to use the numpy batch functions in here in PVNet to create this data when converting to a numpy batch (if this seems messy may add some logic in here to add to the solar position coordinates to the Dataset)
Add new functions to go from a Dataset to NumpyBatch/TensorBatch (for passing to the PVNet model)
Add logic to PVNet/ocf-data-sampler to read the netcdfs and convert to NumpyBatch/TensorBatches and then train a model

peterdudfield · 2024-11-19T18:11:48Z

Thanks @Sukh-P great to push this forward.

A few quick thoughts, and sorry if these seem obvious and have already been answered

is the ideal still to convert the site batch dataset to a dict of tensors ready for the model (PVnet)? if so, will this code sit in here, or in PVNet
For different torch dataloaders, do we have an idea of how to do this for the three different process? 1. make batches, 2. load batches and train model, 3. running inference. It would be a shame to have separate torch datasets for each loader, but perhaps there is a simple way to do this. This is very much in your TODO section, so perhaps you have thought about this already / going to next.
This might be related to 2., but do you know where the combining samples to batch process fits in?

Sukh-P · 2024-11-20T10:39:49Z

@peterdudfield thanks, I have tried to answer these:

is the ideal still to convert the site batch dataset to a dict of tensors ready for the model (PVnet)? if so, will this code sit in here, or in PVNet

Yes that's still the plan, perhaps still making a NumpyBatch if we want to have a more generic intermediary format, and yes the code will probably be added to here but then called in PVNet, can make that clearer in the TODO list above

For different torch dataloaders, do we have an idea of how to do this for the three different process? 1. make batches, 2. load batches and train model, 3. running inference. It would be a shame to have separate torch datasets for each loader, but perhaps there is a simple way to do this. This is very much in your TODO section, so perhaps you have thought about this already / going to next.

I think the longer term plan is to move towards one batch format (netcdf) and have a common interface with batches through a batch object, in this way things will be more generalised and we will have less of having a different way to do things each time, I imagine this will need a bit more thought and can be improved after adding in a working pipeline for sites, can create an issue/discussion around this after we have

This might be related to 2., but do you know where the combining samples to batch process fits in?

So I think this is managed by having a function which does some stacking of samples like here into a batch and the Torch DataLoader where you specify how many samples would be in a batch

Sukh-P · 2024-12-17T16:39:10Z

tests/select/test_select_time_slice.py

@@ -197,7 +197,7 @@ def test_select_time_slice_nwp_with_dropout_and_accum(da_nwp_like, t0_str):
    t0 = pd.Timestamp(f"2024-01-02 {t0_str}")
    interval_start = pd.Timedelta(-6, "h")
    interval_end = pd.Timedelta(3, "h")
-    freq = pd.Timedelta("1H")
+    freq = pd.Timedelta("1h")


To stop a syntax deprecation warning from popping up

peterdudfield · 2024-12-18T14:38:40Z

ocf_data_sampler/torch_datasets/process_and_combine.py

+
+    if "sat" in dataset_dict:
+        # Satellite is already in the range [0-1] so no need to standardise
+        da_sat = dataset_dict["sat"]


Could you add a TODO here for normalization satellite data

peterdudfield

One small comment could be added, but otherwise looks great

AUdaltsova · 2024-12-18T17:12:36Z

ocf_data_sampler/torch_datasets/process_and_combine.py

+    # Merge all prepared datasets
+    combined_dataset = xr.merge(datasets)
+
+    return combined_dataset


Out of curiosity bc I think I've missed a conversation somewhere: am I right in thinking that eventually gsp version will also output netcdfs and we will merge it into this function and move convert_to_numpy_batch into the training pipeline?

Yep exactly I think that is the current idea, the convert_to_numpy_batch logic can still be in this repo and then we can call that in PVNet in the training pipeline

AUdaltsova · 2024-12-18T17:19:03Z

ocf_data_sampler/torch_datasets/site.py


+        sample = process_and_combine_site_sample_dict(sample_dict, self.config)
+        sample = sample.compute()


AUdaltsova · 2024-12-18T17:21:49Z

Looks super good, thanks a lot for doing all this and the tests!

Sukhil Patel added 3 commits November 18, 2024 16:50

Update what Site Dataset returns

373fca7

Improve site dataset unit test

6c3d604

Remove unused logic

af647ff

Update Site data loader

39e8fe1

Sukh-P marked this pull request as ready for review December 17, 2024 16:23

Sukh-P requested review from AUdaltsova and peterdudfield December 17, 2024 16:28

Sukh-P commented Dec 17, 2024

View reviewed changes

peterdudfield reviewed Dec 18, 2024

View reviewed changes

peterdudfield approved these changes Dec 18, 2024

View reviewed changes

AUdaltsova reviewed Dec 18, 2024

View reviewed changes

AUdaltsova approved these changes Dec 18, 2024

View reviewed changes

Add TODO comment

e7ccef7

Sukh-P merged commit 88b310a into main Dec 18, 2024
6 checks passed

peterdudfield mentioned this pull request Dec 20, 2024

2025 ocf-data-sampler - let's make it happen #98

Open

22 tasks

peterdudfield deleted the site-dataset-update branch December 20, 2024 10:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Site torch dataset update #82

Site torch dataset update #82

Sukh-P commented Nov 19, 2024 •

edited

Loading

peterdudfield commented Nov 19, 2024

Sukh-P commented Nov 20, 2024

Sukh-P Dec 17, 2024

peterdudfield Dec 18, 2024

peterdudfield Dec 18, 2024

peterdudfield left a comment

AUdaltsova Dec 18, 2024

Sukh-P Dec 18, 2024 •

edited

Loading

AUdaltsova Dec 18, 2024

AUdaltsova commented Dec 18, 2024


		sample = process_and_combine_site_sample_dict(sample_dict, self.config)
		sample = sample.compute()

Site torch dataset update #82

Site torch dataset update #82

Conversation

Sukh-P commented Nov 19, 2024 • edited Loading

Pull Request

Description

peterdudfield commented Nov 19, 2024

Sukh-P commented Nov 20, 2024

Sukh-P Dec 17, 2024

Choose a reason for hiding this comment

peterdudfield Dec 18, 2024

Choose a reason for hiding this comment

peterdudfield Dec 18, 2024

Choose a reason for hiding this comment

peterdudfield left a comment

Choose a reason for hiding this comment

AUdaltsova Dec 18, 2024

Choose a reason for hiding this comment

Sukh-P Dec 18, 2024 • edited Loading

Choose a reason for hiding this comment

AUdaltsova Dec 18, 2024

Choose a reason for hiding this comment

AUdaltsova commented Dec 18, 2024

Sukh-P commented Nov 19, 2024 •

edited

Loading

Sukh-P Dec 18, 2024 •

edited

Loading