Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ambiguity in "regular expression for PEFF description line" #28

Open
mobiusklein opened this issue Aug 15, 2018 · 5 comments
Open

Ambiguity in "regular expression for PEFF description line" #28

mobiusklein opened this issue Aug 15, 2018 · 5 comments

Comments

@mobiusklein
Copy link

I'm attempting to implement a more strict PEFF parser in Python, but after consulting the controlled vocabulary, I'm not sure I see how to type-check annotations which are defined by the regex "regular expression for PEFF description line"

[Term]
id: PEFF:1002001
name: regular expression for PEFF description line
def: "([0-9]+|[0-9]+|[a-zA-Z0-9]*)." [PSI:PEFF]
is_a: MS:1002479 ! regular expression

With syntax highlighting, the regex is:

/([0-9]+|[0-9]+|[a-zA-Z0-9]*)/

First, the expression translated into words seems partially redundant "One or more digits between 0 and 9 OR One or more digits between 0 and 9 OR Zero or more alphanumeric characters". The first two alternatives are identical, which seems odd. The reduced regex would be

/([0-9]+|[a-zA-Z0-9]*)/

This reads as "One or more digits between 0 and 9 OR Zero or more alphanumeric characters". This seems to suggest that implicitly each element of a | separated tuple will be interpreted separately, and that the indices of the tuple are not governed by the CV. This information is described in the format specification's text.

Is this interpretation consistent with the intentions of the authors?

@edeutsch
Copy link
Contributor

edeutsch commented Aug 28, 2018

I think what was intended was to allow something that looks like this:
(12|50|S)
i.e. (digits|digits|anylettersornumbers)
where the | characters are delimiters, not OR symbols
So maybe it needs to be this:
def: "([0-9]+\|[0-9]+\|[a-zA-Z0-9]*)." [PSI:PEFF]

However, there need to be several different ones now because different terms take a different number of elements.

I will leave this issue open because this does still need to be fixed in the CV.

@mobiusklein
Copy link
Author

Should I go through the current draft of the spec in the repository and aggregate the feature type by regex?

@edeutsch
Copy link
Contributor

Can you clarify what you mean by "aggregate the feature type by regex"? I don't understand what you mean?

@mobiusklein
Copy link
Author

I mean to go over each explicitly named header key in the specification and construct a regular expression that matches the full range of inputs described there, and then group header keys by shared regular expression.

@mobiusklein
Copy link
Author

I hadn't realized there were so few controlled header keys. There are no duplicates. I've tested these regular expressions on each of the examples from PEFF.1.0.draft28.docx, and they appear to match the expected ones and reject the invalid ones. I used ECMAScript non-capturing group notation, as previous HUPO PSI formats I've seen seem to use ECMAScript regular expressions.

\VariantSimple

/[0-9]+\|[A-Z]+(?:\|[a-zA-Z0-9]+)?/

\VariantComplex

/[0-9]+\|[0-9]+\|(?:[A-Z]{2,})?(?:\|[a-zA-Z0-9]+)?/

\ModResUnimod

/(?:[0-9,]+)|\?\|UNIMOD:[0-9]+\|[^\|]+(?:\|[a-zA-Z0-9]+)?/

\ModResPsi

/(?:[0-9,]+)|\?\|MOD:[0-9]+\|[^\|]+(?:\|[a-zA-Z0-9]+)?/

\ModRes

/(?:[0-9,]+)|\?\|[^\|]+\|[^\|]+(?:\|[a-zA-Z0-9]+)?/

\Processed

/[0-9]+\|[0-9]+\|PEFF:[0-9]+\|[^\|]+(?:\|[a-zA-Z0-9]+)?/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants