Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of "most rapidly varying" #530

Closed
ChrisBarker-NOAA opened this issue Jul 8, 2024 · 17 comments · Fixed by #535
Closed

Use of "most rapidly varying" #530

ChrisBarker-NOAA opened this issue Jul 8, 2024 · 17 comments · Fixed by #535
Labels
change agreed Issue accepted for inclusion in the next version and closed defect Conventions text meaning not as intended, misleading, unclear, has typos, format or language errors
Milestone

Comments

@ChrisBarker-NOAA
Copy link
Contributor

Clarify use of "most rapidly varying" dimension.

In (at least three) places in CF, we refer to the "most rapidly varying" dimension (and are thinking of adding a fourth, in the cell definition, discussed in #163.

I'm enough of a computer geek to know what this means, though I'm not (wasn't) sure quite how it applied to CF.

e.g. ""most rapidly varying" index to mean the one which varies by 1 for the addresses of adjacent locations in storage, i.e. the first index in Fortran, the last in C and CDL"

If, ion fact, it's always the last in CDL (and in netcdf itself), then I think this language could not only being confusing to folks less familiar with the intricacies of array store, but also send. people on the wring track if they are, e.g. writing a file with Fortran, and might think that "most rapidly varying" means the first index, as it is in Fortran.

The three places I found "rapidly varying"

  • in 1.5 COORDS: "...COARDS restricts the axis (equivalently dimension) ordering to be longitude, latitude, vertical, and time (with longitude being the most rapidly varying dimension)."

  • in 2.2, in the discussion of strings: "... a variable of type string with n dimensions, or as a variable of type char with n+1 dimensions where the last (most rapidly varying)..."

  • in 7.1 in cell boundaries: "The additional dimension should be the most rapidly varying one"

Now that I've written this all out -- maybe the only thing to do is adjust the text in 7.1, which is bering worked on right now in #163 (PR #521)

However, maybe it would be good to put in the spec somewhere that "the most rapidly varying" dimension is always the last in a netcdf file? I'm sure that's defined in the netcdf spec itself, but having int in CF could be helpful.

NOTE: there may be other places to look at in the doc -- I only. found these three by searching "rapidly varying"

Moderator

TBA

@ChrisBarker-NOAA ChrisBarker-NOAA added the defect Conventions text meaning not as intended, misleading, unclear, has typos, format or language errors label Jul 8, 2024
@davidhassell
Copy link
Contributor

Hi Chris,

Thanks for describing this so clearly! Would a new entry in 1.3 terminology be sufficient? e.g. something like most rapidly varying dimension: <definition>, and then make sure we use that exact phrase elsewhere in the text:

  • in 1.5 COORDS: "...COARDS restricts the axis (equivalently dimension) ordering to be longitude, latitude, vertical, and time (with longitude being the most rapidly varying dimension)."

  • in 2.2, in the discussion of strings: "... a variable of type string with n dimensions, or as a variable of type char with n+1 dimensions where the most rapidly varying dimension..."

  • in 7.1 in cell boundaries: "The additional dimension should be the most rapidly varying dimension"

I notice that in UGRID (which is also CF, now) there is sometimes the option to specify which dimension is the most rapidly varying, e.g.

The face_dimension attribute specifies which netcdf dimension is used to indicate the index of the face in the connectivity arrays. This is needed because some applications store the data with the fastest varying index first, and some with that index last. The default is to use the num_faces as fastest dimension; e.g. a (num_faces, 3) array for triangles, but some applications might use a (3, num_faces) order, in which case the face_dimension attribute is required to help the client code disambiguate. The edge_dimension attribute is similar for the edge connectivity arrays.

If I understand correctly, given that in CDL/netCDF the most rapidly varying dimension is the last one, this description is misleading. It implies that the face index is always the slowest varying dimension, but that it can be in either position. Not so, right?

@JonathanGregory
Copy link
Contributor

Dear Chris and David

Thanks for addressing this issue. I agree with defining "most rapidly varying dimension" in 1.3 (David's suggestion) and I agree also with saying "last" in the text (Chris's suggestion). In addition, I suggest we should clarify "last" in the text. That is, I think we should say what we mean in more than one way, consistently each time. That's redundancy in the text, but ought to help with clarity so long as we maintain consistency. My proposal for the three cases is:

In 1.3, we could insert a definition like this:

most rapidly varying dimension: The dimension of a multidimensional variable which differs by unity (modulo dimension size) for elements that are adjacent in storage. When netCDF is represented in CDL, the most rapidly varying dimension is the last one e.g. x in float data(z,y,x). C and Python NumPy use the same order as C, also called "column-major order", but Fortran uses the opposite convention, also called "row-major order", so that when netCDF variables are accessed in Fortran the most rapidly varying dimension is the first one.

How's that?

Best wishes

Jonathan

@taylor13
Copy link

lovely (and clear), from my perspective.

@ChrisBarker-NOAA
Copy link
Contributor Author

This looks good to me, thanks!

One nit:

in 1.5 "...COARDS restricts the axis (equivalently dimension) ordering to be longitude, latitude, vertical, and time, with longitude being the last dimension in CDL order (the most rapidly varying dimension)."

Would't that be: (time, vertical, latitude, longitude) in CDL order? so a bit confusing to have it in the opposite order in the text. I know I follow an example before I carefully read the text! was COARDS originally written with Fortran in mind?

@JonathanGregory
Copy link
Contributor

Those words have not changed, @ChrisBarker-NOAA, but I agree that it would be logical to put it in CDL order - good point. I don't know what software environment the authors of COARDS had in mind! Is this OK:

  • in 1.5 "...COARDS requires (time, vertical, latitude, longitude) as the CDL order for the dimensions of a variable, with longitude being the last dimension (the most rapidly varying dimension)."

Since we would not be quoting COARDS verbatim, I have rephrased it, in the hope (though not the certainty) of making it clearer.

@ChrisBarker-NOAA
Copy link
Contributor Author

Thanks! I think that's better, yes.

@davidhassell
Copy link
Contributor

Hi,

This is looking good to me, thanks. A couple of questions:

  • Given that we carefully define "most rapidly varying dimension", wouldn't the phrase the most rapidly varying dimension (the last dimension in CDL order) by better for those three parts of the text (rather than "the last dimension in CDL order (the most rapidly varying dimension)")?

  • In the proposed definition we have "The dimension of a multidimensional variable which differs by unity (modulo dimension size) for elements that are adjacent in storage.", but I don't understand which differs by unity (modulo dimension size) . Wouldn't "The dimension of a multidimensional variable for which elements are adjacent in storage" suffice?

@JonathanGregory
Copy link
Contributor

Dear @davidhassell

  • Yes, I agree, it would be logical to exchange the positions of "most rapidly varying" and "last in CDL order". Thanks.

  • Yes, I didn't get it quite right. I meant, "The dimension of a multidimensional variable along which elements that are adjacent in storage have indices that differ by unity (modulo dimension size)." Does that make sense to you? I agree that it is the same idea as your words, "The dimension of a multidimensional variable for which elements are adjacent in storage," and maybe that would be fine. I wrote the longer (but perhaps unintelligible) version in order to avoid any vagueness about "adjacent", which can sometimes mean "close (but not necessarily contiguous)".

Best wishes

Jonathan

@ChrisBarker-NOAA
Copy link
Contributor Author

I like @davidhassell's wording :-)

@JonathanGregory
Copy link
Contributor

More than three weeks have passed with no further comment. I have prepared PR #535 to implement these changes, as I drafted, with the subsequent changes by @ChrisBarker-NOAA and @davidhassell. Please could someone check and merge. Thanks.

@JonathanGregory JonathanGregory added the change agreed Issue accepted for inclusion in the next version and closed label Sep 17, 2024
@JonathanGregory JonathanGregory added this to the 1.12 milestone Nov 28, 2024
@pvanlaake
Copy link

I hate to be the fly in everyone's drink, but the current definition is faulty. In combination with some changes in the wording, may I propose an update to the text as follows:

most rapidly varying dimension: The dimension of a multidimensional variable for which elements are adjacent in storage. When a netCDF file is represented in CDL, the most rapidly varying dimension is the last one listed, e.g. x in float data(z,y,x). Python NumPy uses the same order as C, called "row-major order", with Fortran and R using the alternative storage mode, called "column-major order", so that when netCDF variables are accessed in Fortran or R the most rapidly varying dimension is the first one.

@ChrisBarker-NOAA
Copy link
Contributor Author

ChrisBarker-NOAA commented Jan 7, 2025

Thanks @pvanlaake : good catch!

For me and everyone else, the text proposed in #535 is:

"""
most rapidly varying dimension:: The dimension of a multidimensional variable for which elements are adjacent in storage. When netCDF is represented in CDL, the most rapidly varying dimension is the last one e.g. x in float data(z,y,x). C and Python NumPy use the same order as C, also called "column-major order", but Fortran uses the opposite convention, also called "row-major order", so that when netCDF variables are accessed in Fortran the most rapidly varying dimension is the first one.
"""

So the change is that C and Fortran doesn't "also call it" and which is column-major and which is row-major are swapped.

Also the text "C and Python NumPy use the same order as C" is reworded.

But I think maybe the sentence was supposed to be:

"C and Python NumPy use the same order as CDL ..."

Which I think is worth saying.

Small note:

Numpy uses C-order by default, but also supports fortran-order -- though pobably a technicallity to detailled to get into here.

Is it close enough to put this discussion in the PR for final editing?

@pvanlaake
Copy link

pvanlaake commented Jan 8, 2025

I proposed a few more little tweaks:

  • "when netCDF is represented in CDL": I'd say "when a netCDF file is represented in CDL".
  • R: I'd appreciate it is the reference to R could be included. It is to Fortran what Python is to C (in this specific context). There are also a few netCDF-aware packages out there and there is an active user community. (I shall add that I am the developer of two such packages: CFtime (which, in its development version, already includes calendars utc and tai!) and ncdfCF which is an early-stage package for general CF-compliance.)

On the issue of supporting either ordering scheme: the conventions dropped the COARDS requirement on dimension ordering, the implication of which is that a CF-compliant reader should be able to manage both arrangements. Both Python and R support array permutation so no issues there (for those two programming ecosystems), just so long as one analyses the dimension ordering.

@davidhassell
Copy link
Contributor

when netCDF is represented in CDL": I'd say "when a netCDF file is represented in CDL".

How about "when a netCDF dataset is represented in CDL". NetCDF is a really a data model rather than a file format, and doesn't have to be represented in a file, e.g. netCDF-Zarr represents the dataset as a set of nested directories.

@JonathanGregory
Copy link
Contributor

Please could someone open a new defect issue and attach a new PR to it for this. It would be confusing to have the same issue listed twice in the revision history. Thanks.

@pvanlaake
Copy link

Defect issue opened as #583. PR will follow as soon as any further comments and suggestions have been received.

@ChrisBarker-NOAA
Copy link
Contributor Author

closing in favor of: #583.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
change agreed Issue accepted for inclusion in the next version and closed defect Conventions text meaning not as intended, misleading, unclear, has typos, format or language errors
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants