Ambiguity in "regular expression for PEFF description line" #28

mobiusklein · 2018-08-15T16:57:31Z

I'm attempting to implement a more strict PEFF parser in Python, but after consulting the controlled vocabulary, I'm not sure I see how to type-check annotations which are defined by the regex "regular expression for PEFF description line"

[Term]
id: PEFF:1002001
name: regular expression for PEFF description line
def: "([0-9]+|[0-9]+|[a-zA-Z0-9]*)." [PSI:PEFF]
is_a: MS:1002479 ! regular expression

With syntax highlighting, the regex is:

/([0-9]+|[0-9]+|[a-zA-Z0-9]*)/

First, the expression translated into words seems partially redundant "One or more digits between 0 and 9 OR One or more digits between 0 and 9 OR Zero or more alphanumeric characters". The first two alternatives are identical, which seems odd. The reduced regex would be

/([0-9]+|[a-zA-Z0-9]*)/

This reads as "One or more digits between 0 and 9 OR Zero or more alphanumeric characters". This seems to suggest that implicitly each element of a | separated tuple will be interpreted separately, and that the indices of the tuple are not governed by the CV. This information is described in the format specification's text.

Is this interpretation consistent with the intentions of the authors?

The text was updated successfully, but these errors were encountered:

edeutsch · 2018-08-28T03:25:28Z

However, there need to be several different ones now because different terms take a different number of elements.

I will leave this issue open because this does still need to be fixed in the CV.

mobiusklein · 2018-08-28T16:19:17Z

Should I go through the current draft of the spec in the repository and aggregate the feature type by regex?

edeutsch · 2018-08-28T16:32:00Z

Can you clarify what you mean by "aggregate the feature type by regex"? I don't understand what you mean?

mobiusklein · 2018-08-28T21:11:27Z

I mean to go over each explicitly named header key in the specification and construct a regular expression that matches the full range of inputs described there, and then group header keys by shared regular expression.

mobiusklein · 2018-08-28T21:35:12Z

I hadn't realized there were so few controlled header keys. There are no duplicates. I've tested these regular expressions on each of the examples from PEFF.1.0.draft28.docx, and they appear to match the expected ones and reject the invalid ones. I used ECMAScript non-capturing group notation, as previous HUPO PSI formats I've seen seem to use ECMAScript regular expressions.

\VariantSimple

/[0-9]+\|[A-Z]+(?:\|[a-zA-Z0-9]+)?/

\VariantComplex

/[0-9]+\|[0-9]+\|(?:[A-Z]{2,})?(?:\|[a-zA-Z0-9]+)?/

\ModResUnimod

/(?:[0-9,]+)|\?\|UNIMOD:[0-9]+\|[^\|]+(?:\|[a-zA-Z0-9]+)?/

\ModResPsi

/(?:[0-9,]+)|\?\|MOD:[0-9]+\|[^\|]+(?:\|[a-zA-Z0-9]+)?/

\ModRes

/(?:[0-9,]+)|\?\|[^\|]+\|[^\|]+(?:\|[a-zA-Z0-9]+)?/

\Processed

/[0-9]+\|[0-9]+\|PEFF:[0-9]+\|[^\|]+(?:\|[a-zA-Z0-9]+)?/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ambiguity in "regular expression for PEFF description line" #28

Ambiguity in "regular expression for PEFF description line" #28

mobiusklein commented Aug 15, 2018

edeutsch commented Aug 28, 2018 •

edited

Loading

mobiusklein commented Aug 28, 2018

edeutsch commented Aug 28, 2018

mobiusklein commented Aug 28, 2018

mobiusklein commented Aug 28, 2018

Ambiguity in "regular expression for PEFF description line" #28

Ambiguity in "regular expression for PEFF description line" #28

Comments

mobiusklein commented Aug 15, 2018

edeutsch commented Aug 28, 2018 • edited Loading

mobiusklein commented Aug 28, 2018

edeutsch commented Aug 28, 2018

mobiusklein commented Aug 28, 2018

mobiusklein commented Aug 28, 2018

edeutsch commented Aug 28, 2018 •

edited

Loading