Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YAML with a colon but no space is valid YAML, but produces a string instead of a map #524

Open
yousefmoazzam opened this issue Nov 6, 2024 · 1 comment
Labels
question Further information is requested

Comments

@yousefmoazzam
Copy link
Collaborator

yousefmoazzam commented Nov 6, 2024

The problem

In httomo pipeline files mappings are used fairly frequently, such as when defining the parameters and parameter values for a method:

- method: remove_stripe_based_sorting
  module_path: httomolibgpu.prep.stripe
  parameters:
    size: 11
    dim: 1

From some unexpected behaviour when running httomo at a beamline, investigations have led to the discovery that a mapping in YAML must have either:

  • at least one space after the colon character (if defining a key-value pair whose value is a non-collection type, such as an int)
center: 20.5
  • a newline character (if defining a key-value pair whose value is a collection type of some sort, such as a list)
parameters:
  - item_one
  - ...

For the first case, if there is no space after the colon character, then the text following is parsed as a python str, rather than as a python dict.

This means that if a user were to miss a space after a colon character, then this can cause the YAML to be parsed differently to how it was intended, and thus cause runtime errors.

Explanation

For example, for the following YAML (note the spaces after the colon characters after start and stop):

- method: standard_tomo
  module_path: httomo.data.hdf.loaders
  parameters:
    data_path: entry1/tomo_entry/data/data
    image_key_path: entry1/tomo_entry/instrument/detector/image_key
    rotation_angles:
      data_path: /entry1/tomo_entry/data/rotation_angle
    preview:
      detector_y:
        start: 100
        stop: 120

the detector_y field and its value parses to the following python data structure (note that the value of the detector_y key in the dict is a dict):

{
  'detector_y': {'start': 100, 'stop': 120}
}

However, if the spaces after the two colon characters are omitted (the change in syntax highlighting compared to the previous example is suggestive of there being a change in meaning of the value):

- method: standard_tomo
  module_path: httomo.data.hdf.loaders
  parameters:
    data_path: entry1/tomo_entry/data/data
    image_key_path: entry1/tomo_entry/instrument/detector/image_key
    rotation_angles:
      data_path: /entry1/tomo_entry/data/rotation_angle
    preview:
      detector_y:
        start:100
        stop:120

the detector_y field and its value parses to the following python data structure (notice that the value of the detector_y key in the dict is a string):

{
  'detector_y': 'start:100 stop:120'
}

I haven't found anything that explicitly states that without a space after a colon, the value is interpreted as a string. The closest I can find is in the YAML spec here where it states that a whitespace character needs to follow a colon in order to define a mapping (but it doesn't say what happens if you omit the colon, it may be somewhere else in the YAML spec, but it's a long document to sift through...).

What can be done?

I'm not sure yet. I took at a look to see if any YAML linters would allow the catching of if a colon was missing a space after it, such as yamllint, which has some options for configuring the rules when encountering a colon character.

However, it seems like it's not possible to catch this (partly due to it being valid YAML to have a missing space after a colon), see adrienverge/yamllint#563 (comment) and the rest of that issue.

In particular, in that comment, it's suggested that conversion to JSON, followed by defining a JSON schema, could solve the problem.

With the development of the web GUI moving forward and it involving this very idea of using a JSON schema to perform validation of parameters in pipeline files, this may well be an issue that could be resolved in that larger discussion.

@yousefmoazzam yousefmoazzam added the question Further information is requested label Nov 6, 2024
@dkazanc
Copy link
Collaborator

dkazanc commented Nov 7, 2024

Thanks @yousefmoazzam for looking into this. It looks like we're too vulnerable by exposing just YAML to the users. The major issue is that we cannot pick up most of the problems before the run. To me it feels like we need to start our developing on JSON schema as soon as possible and it should be in place before HTTomo in production. Without proper parameters validation, we should expect lots of problems related to YAML errors from inexperienced users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants