Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NXdata needs additional information for data to be plotted accurately #1527

Open
ggoneiESS opened this issue Jan 8, 2025 · 4 comments
Open

Comments

@ggoneiESS
Copy link

ggoneiESS commented Jan 8, 2025

NXdata states should have a shape that matches data dimension(s), such that a given value, obtained at data[i][j][k] can be plotted at some point given by other symbols. Additionally,

The NXdata class is designed to encapsulate all the information required for a set of data to be plotted.

However, it is not possible to use NXdata to plot data which has been integrated without making explicit assumptions. Suppose a set of data which has been integrated already (e.g. by a beam monitor) into bins of non-equal width, integrated in 1 ms periods (and equal each recorded period):

Low Independent | High Independent | Counts
___________________________________________
              0 |                1 |    10
              1 |                3 |    20
              3 |                6 |    30
              6 |               10 |    40
//        10-15 deliberately missed out
             15 |               20 |    50

One could choose to use either the low edge, high edge, or their mid-point, to use as a point to plot the data; using the FIELDNAME_errors field could help to describe the range of the measurement too, but that is a contradiction to what it should represent in the specification. Therefore, either the first bin or last bin edge will be missed, and any information about the data being integrated will be lost. Thus, NXdata will fail to be able to represent the data accurately, since the recorded data currently would have to look like:

data[[10, 20, 30, 40, 50],
     [10, 20, 30, 40, 50],
     [10, 20, 30, 40, 50],
     [10, 20, 30, 40, 50],
     [10, 20, 30, 40, 50]]
x[0, 1, 3, 6, 15] // or x[1, 3, 6, 10, 20], or x[0.5, 2, 4.5, 8, 17.5]
t[0, 1000, 2000, 3000, 4000] // or t[1000, 2000, 3000, 4000, 5000], or t[500, 1500, 2500, 3500, 4500]

This could be fixed with four additional requirements:

  1. Requiring all numeric data to be continuous
  2. Specifying whether the value is a point or integrated datum
    2b. And if integrated whether the value given corresponds to the leading or trailing bin edge*
  3. Whether there is an overflow bin at the none, first, last, or both bin

The data that is to be represented is equivalent to:

Low Independent | High Independent | Counts
___________________________________________
              0 |                0 |    0   // infinitesimal width bins have zero counts
              0 |                1 |    10
              1 |                3 |    20
              3 |                6 |    30   // bin A
              6 |                6 |    0   // infinitesimal width bins have zero counts BUT this bin implies a discontinuity between bin A and B
              6 |               10 |    40   // bin B
             10 |               15 |    0   // a 'missing' bin is functionally equivalent to a bin with no counts
             15 |               20 |    50
             20 |               20 |    0   // infinitesimal width bins have zero counts

which would be represented with

data[[0,  0,  0,  0,  0,  0,  0],
     [0, 10, 20, 30, 40, 0, 50],    // underflow example: [7317, 10, 20, 30, 40, 0, 50]
     [0, 10, 20, 30, 40, 0, 50],
     [0, 10, 20, 30, 40, 0, 50],
     [0, 10, 20, 30, 40, 0, 50],
     [0, 10, 20, 30, 40, 0, 50]]
x[0, 1, 3, 6, 10, 15, 20]
t[0, 1000, 2000, 3000, 4000, 5000]

There may be use cases where it is better to assign the value to the leading edge of the bin (e.g. if there is an overflow bin at the end without underflow), and so the data could also be written as

data[[10, 20, 30, 40, 0, 50, 0],
     [10, 20, 30, 40, 0, 50, 0],
     [10, 20, 30, 40, 0, 50, 0],
     [10, 20, 30, 40, 0, 50, 0],
     [10, 20, 30, 40, 0, 50, 0],
     [10, 20, 30, 40, 0, 50, 0],
     [0,  0,  0,  0,  0,  0,  0]]

without changing the axes.

I understand that

NXdata provides data and coordinates to be plotted but does not describe how the data is to be plotted

but in this example it is vital to be able to specify that this is not point-like data - the integral of each bin is actually equal. Currently, although I can see a possibility to manipulate the data in such a way that bin edges are defined, there is no way to specify what is going on when that file is read by others, and it could easily be interpreted, inaccurately, as a set of points. For similar reasons, it should be explicit that bins must be contiguous.

*an alternative is to use the centre of the bin, which is most likely what is required for statistical analysis, but this complicates representation - in the example here, it becomes a bit convoluted to work out how big that bin actually is, and would require the length of each axis to be 1 element longer than data. Instead, one workaround might be to use centre always for statistical analysis unless explicitly stated by using the keyword again: trailing trailing for analysis which should be filled at AND use the trailing edge of the bin, trailing leading for analysis which should be filled at the trailing edge of the bin BUT use the leading edge of the bin.

@ggoneiESS
Copy link
Author

The edit was to the final representation of the data array

@PeterC-DLS
Copy link
Contributor

For the all-present bins case, text has been proposed in #1396 to include histogram axes that contain the bin edges - see L328. Not sure what can be a general solution for the missing/omitted bins case. Options include:

  • if counts are floats then use NaNs
  • if counts are integers then use 0
  • use a specific negative/zero value in all cases

@ggoneiESS
Copy link
Author

Thanks Peter - I'm going to close this issue and hopefully the rest of the discussion can be had on that PR. I did look at the first couple of pages of PRs, open and closed, but the linked one was tucked away on page 3!

@ggoneiESS ggoneiESS reopened this Jan 9, 2025
@ggoneiESS ggoneiESS closed this as not planned Won't fix, can't repro, duplicate, stale Jan 9, 2025
@ggoneiESS ggoneiESS reopened this Jan 20, 2025
@ggoneiESS
Copy link
Author

Discussion in #1396 suggests we re-open this issue.

I would propose that I submit a PR, but only after the axes definition is pushed. Any thoughts @rayosborn ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants