Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance in cfdm.read by caching any array values retrieved from disk #313

Open
davidhassell opened this issue Jan 7, 2025 · 3 comments
Labels
performance Relating to speed and memory performance

Comments

@davidhassell
Copy link
Contributor

First and last array values are cached in NetCDFRead.read._cache_data_elements, and it currently could do this multiple times per variable.

It would be straight forward to cache these values to save going back to the dataset more often than is required.

PR to follow.

@davidhassell davidhassell added the performance Relating to speed and memory performance label Jan 7, 2025
@sadielbartholomew
Copy link
Member

I'm a bit confused by your description @davidhassell, so just to check - from re-reading a bit my interpretation is that you mean we already cache first and last values for the array of data variables, but we can do this more often that we currently do? Would appreciate a clarification - on the following PR or here. Thanks.

@davidhassell
Copy link
Contributor Author

Hi Sadie - sorry - the word "cache " is doing some heavy lifting, here! I'll try to re-word (although you're already correct!)

Selected array values (typically the first and last in the array) are read from disk and cached inside the Data objects of the returned constructs. A given netCDF variable might end up providing the data for multiple constructs, in which case the selected values end up getting read from disk multiple times. Therefore we need a new cache of these values in NetCDFRead that means we're not going to disk more often than we need to. I imagine that this new cache will look something like:

self.read_vars['_cached_data_elements'] = {
    <netCDF variable name>: <selected array values>,  
    <netCDF variable name>: <selected array values>,  
    ...
}

Then, each time we want to get the selected array values, we get them from the NetCDFRead cache, only going to disk if they're not there.

@sadielbartholomew
Copy link
Member

sadielbartholomew commented Jan 7, 2025

Ah I see - thanks for clarifying, I guess I'll be reviewing the PR eventually so it's out of more than just curiosity 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Relating to speed and memory performance
Projects
None yet
Development

No branches or pull requests

2 participants