Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save and load in binary, compatible with NumPy/Matlab and others #486

Open
certik opened this issue Aug 18, 2021 · 18 comments
Open

Save and load in binary, compatible with NumPy/Matlab and others #486

certik opened this issue Aug 18, 2021 · 18 comments
Labels
topic: IO Common input/output related features

Comments

@certik
Copy link
Member

certik commented Aug 18, 2021

First requested here.

@milancurcic
Copy link
Member

This seems useful and in scope. I often work with MAT files (of various versions) from colleagues and I use SciPy.io loadmat and savemat.

For my own interoperable binary data between Fortran and Python, I use NetCDF. I don't think any of the language-specific binary formats will beat it in terms of features, performance, or stability. Likewise for HDF5 which is suitable for unstructured data.

@certik
Copy link
Member Author

certik commented Aug 26, 2021

Both NetCDF and HDF5 are great. The only issue with HDF5 is that there is literally only one library that can read and write it and it's not that easy to build and ship. It's not easy to write a writer in pure Fortran, as an example. While it is easy for the .npy NumPy array format, I've done it in the past, although I can't find the code right now. :(

So that makes me hesitant to just depend on HDF5. However, it is worth investigating what would it take to just support a very small subset of HDF5, say for writing a set of double precision arrays. It might not be that difficult to write a writer for just such a small subset in pure Fortran. Here is the format: https://portal.hdfgroup.org/display/HDF5/File+Format+Specification

The huge advantage of that would be no dependency on the hdf5 library, and using a widely supported format.

@Beliavsky
Copy link

There is

NPY for Fortran: allows saving numerical Fortran arrays in Numpy's .npy or .npz format, by MRedies

which I have not tried.

@MarDiehl
Copy link
Contributor

MarDiehl commented Sep 3, 2021

There is already an HDF5 writer/reader which looks promising: https://github.com/geospace-code/h5fortran. I uses the Fortran bindings of the C library.
I think it is reasonable to keep HDF5 support out of stdlib, it is neither part of the C nor the python standard library.

@awvwgk awvwgk added the topic: IO Common input/output related features label Sep 18, 2021
@awvwgk
Copy link
Member

awvwgk commented Nov 28, 2021

I got the basic structure for reading and writing npy files implemented in #581. Needs some polishing, especially the reading, and much more unit tests to cover all possible errors the loading can encounter.

@TejasAvinashShetty
Copy link

libnpy seems to be a library that provides simple routines for saving a C or Fortran array to a data file using NumPy's own binary format.
Please see https://scipy-cookbook.readthedocs.io/items/InputOutput.html

Not my idea See first CAZT's comment on CAZT's stackoverflow answer

@jvdp1
Copy link
Member

jvdp1 commented Dec 2, 2021

There is already an HDF5 writer/reader which looks promising: https://github.com/geospace-code/h5fortran. I uses the Fortran bindings of the C library. I think it is reasonable to keep HDF5 support out of stdlib, it is neither part of the C nor the python standard library.

I agree with @MarDiehl . I recently used @scivision 's h5fortran and found it great and really easy to use. Therefore, I also think reasonable to keep HDF5 support out of stdlib for the moment.

@awvwgk
Copy link
Member

awvwgk commented Dec 10, 2021

How do we want to handle the npz format? It is a zip archive with npy files. Probably, we have to develop a general interface for interacting with compressed archives first.

For the mat format I found a specification of the layout (linked in description at the top), should be straight-forward to code up, but I don't think I have a matlab version I could use to verify it, but I could try SciPy.

@ivan-pi
Copy link
Member

ivan-pi commented Dec 12, 2021

For the mat format I found a specification of the layout (linked in description at the top), should be straight-forward to code up, but I don't think I have a matlab version I could use to verify it, but I could try SciPy.

Was your idea to implement the reader/writer entirely in Fortran based upon the PDF document, or call into the MATLAB C API to Read MAT-File Data? The latter requires the client has the libmat shared run-time library located in matlabroot/bin/arch.

@awvwgk
Copy link
Member

awvwgk commented Dec 12, 2021

I was reading the specs, sounds easy enough to implement this from scratch and verify using SciPy. Unfortunately, the data can be compressed, and we need an interface to zlib or similar first.

Having the possibility to dynamically load a library with dlopen in case the matlab runtime libraries are around would be another option. However, than we first need an interface for dynamic loading.

@arjenmarkus
Copy link
Member

arjenmarkus commented Dec 13, 2021 via email

@adenchfi
Copy link
Contributor

adenchfi commented Feb 9, 2022

I see stdlib has save_npy and load_npy functionality! I tested it out and it works great! I was wondering, if possible, if dim(:,:,:,:) arrays could also be supported. I only see interfaces up to rank-3.

@awvwgk
Copy link
Member

awvwgk commented Feb 9, 2022

They should be supported up to the maximum rank stdlib was configured for. The docs are only generated up to rank 3 to save space, while the fpm version allow up to rank 4, the CMake version can go up to rank 15.

@adenchfi
Copy link
Contributor

adenchfi commented Feb 9, 2022

Oh, thanks! I should have tested first, I took the docs too literally.

@ivan-pi
Copy link
Member

ivan-pi commented May 4, 2022

While scrolling through the ARCHER2 super-computing service documentation I learned there is BSD-licensed library for MATLAB MAT files called matio. It also has a Fortran interface (help wanted tbeu/matio#51), however it is doesn't appear to use the standard Fortran/C interoperability.

As @awvwgk has remarked above, supporting MATLAB binary files would require a zlib interface and potentially also HDF5, both of which are available as C libraries. It looks more straightforward to just have a thin Fortran wrapper of a C/C++ implementation, than to write an interface/implementation for zlib (and HDF5) first.

@ivan-pi
Copy link
Member

ivan-pi commented May 4, 2022

How do we want to handle the npz format? It is a zip archive with npy files. Probably, we have to develop a general interface for interacting with compressed archives first.

In case of the compressed npz files created with numpy.savez_compressed, the NumPy documentation states zipfile.ZIP_DEFLATED is used which requires zlib behind the scenes.

Irrespective of how we manage to do the zipping/compression (either in C or Fortran), with respect to the zipped format a big question is how to replace positional and keyword arguments in Fortran, without getting overwhelmed by the combinatorial explosion of type/kind/rank + number of saved arrays.

@ivan-pi
Copy link
Member

ivan-pi commented Jul 19, 2022

Since Fortran doesn't have positional or keyword arguments in the way Python does, for .npz files it seems more natural to adopt an API similar to the one in NPY for Fortran:

subroutine add_npz(zipfile,var_name,array)
   character(len=*), intent(in) :: zipfile
   character(len=*), intent(in) :: var_name
   real|complex|integer, intent(in) :: array(..)

Alternatively, we could have a handle based approach:

integer :: npz_unit
real :: A(2,2)
complex :: B(3,3)

call open_npz(newunit=npz_unit,filename="foo.npz")
call stage_npz(npz_unit,A,"A")
call stage_npz(npz_unit,B,"B")
call close_npz(npz_unit)

Since Fortran uses integer units as file handles, the concept should be familiar already.

@ivan-pi
Copy link
Member

ivan-pi commented Jul 20, 2022

The .npz format is also useful to read Scipy sparse matrix formats (CSC, CSR, BSR, DIA, COO). See scipy.sparse.save_npz for a description. The implementation can be found here. Note the keywords in the dictionary creation specify the array names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic: IO Common input/output related features
Projects
None yet
Development

No branches or pull requests

10 participants