Encoding PTMs parameters into one-line Experimental Design #13

ypriverol · 2019-08-15T22:18:35Z

We have continued working with the metadata experimental design.

See example, https://github.com/PRIDE-Archive/pride-metadata-standard/tree/master/experimental-design#2-the-sample-and-data-relationship-format

However, if we want to encode search parameters would be great to encode PTMs and other search parameters as key-value pairs. I have seen that MSGF+, Comet, MaxQuant encode PTMs as string lines which is great; because we can encode PTMs Variables as a string and will be easy to translate into the Search Strings.

MSGF+ :

StaticMod=C2H3N1O1,     C,  fix, any,       Carbamidomethyl       # Fixed Carbamidomethyl C (alkylation)
StaticMod=229.1629,     *,  fix, N-term,    TMT6plex
StaticMod=229.1629,     K,  fix, any,       TMT6plex

Comet:

variable_mod1 = 15.9949 M 0 3
variable_mod2 = 0.0 X 0 3
variable_mod3 = 0.0 X 0 3
variable_mod4 = 0.0 X 0 3
variable_mod5 = 0.0 X 0 3
variable_mod6 = 0.0 X 0 3

CRUX:

C+57.02146,2M+15.9949,1STY+79.966331

I think we can propose a way to encode this PTMs as String within the metadata files.

Name ; aminoacid; type; position; UnimodAccession

Where:
Name: Name of the modification.
aminoacid: Aminoacid
Type: Fixed, Variable, Custom
Position: Any, N-Term, Protein N-term
UnimodAccession: Unimod Accession

The Unimod accession can be replaced with delta mass.

The text was updated successfully, but these errors were encountered:

prvst · 2019-08-16T01:47:43Z

So you are basically talking about how the modifications are declared inside the parameter files, not on how they are represented inside the search results, right ? In this case I think that the proposal should be more human-readable than machine friendly. Parameter files are often shared with people who are not entirely familiar with proteomics or even sometimes used as proof of documentation for an analysis. A format like the on used by Comet variable_mod1 = 15.9949 M 0 3 might be easy to be consumed by a software, but quite impossible to be interpreted by a person who doesn't know the documentation. I also think that the name of the modification should be included in the proposal, it makes easier to spot errors and to differentiate isobaric PTMs.

mlocardpaulet · 2019-08-16T06:35:01Z

Veit (and others) worked on this at proteoform level. But I think you could find their strategy interesting: LeDuc, R. D., Schwämmle, V., Shortreed, M. R., Cesnik, A. J., Solntsev, S. K., Shaw, J. B., … Tsybin, Y. O. (2018). ProForma: A Standard Proteoform Notation. Journal of Proteome Research, 17(3), 1321–1325. https://doi.org/10.1021/acs.jproteome.7b00851

ypriverol · 2019-08-16T06:55:03Z

Thanks, @mlocardpaulet for the reference. We are more talking about PTMs as search parameters. In order to represent PTMs in results, we have good references (as you said) with ProForma, MzTab, and others.

The problem we want to solve is that if we annotate a PRIDE or ProteomeXchange experiment, we should annotate some parameters from the search in order to allow external tools like SearchGUI and other to reanalyze the data. This issue is about how to encode Search Parameters PTMs.

RalfG · 2019-08-16T08:27:06Z

We often do something like this, in a JSON structure:

    "modifications":[
        {"name":"Glu->pyro-Glu", "unimod_accession":27, "mass_shift":-18.0153, "amino_acid":"E", "n_term":true, "fixed":false},
        {"name":"Gln->pyro-Glu", "unimod_accession":28, "mass_shift":-17.0305, "amino_acid":"Q", "n_term":true, "fixed":false},
        {"name":"Acetyl", "unimod_accession":1, "mass_shift":42.0367, "amino_acid":null, "n_term":true, "fixed":false},
        {"name":"Oxidation", "unimod_accession":35, "mass_shift":15.9994, "amino_acid":"M", "n_term":false, "fixed":false},
        {"name":"Carbamidomethyl", "unimod_accession":4, "mass_shift":57.0513, "amino_acid":"C", "n_term":false, "fixed":true}
    ],
```

ypriverol · 2019-08-16T08:38:09Z

Thanks @RalfG In the current proposal

Name ; aminoacid; type; position; UnimodAccession

Would be feasible to represent your JSON? We want first to have a tab-delimited representation to align more with the experimental design but in the future YES, we will serialize also to JSON.

In my proposal your first modification will be like:

Glu->pyro-Glu; E; fixed;  N-term; UNIMOD:27

The only thing missing is the mass shift, I didn't include it because it can be retrieved from the UNIMOD accession. However, I agree we can be more explicit using the mass shift.

mvaudel · 2019-08-16T08:58:13Z

A few points you might want to consider:

The target can be a single amino acid or an amino acid pattern (like in glyco). This can be encoded as a simple regular expression.
The terminus can be peptide or protein.
I strongly recommend not to use the rounded mass, and rather stick to the atomic composition. I would make the atomic composition mandatory.
If you are aiming for a format like mzIdentML, generated by software for software, user friendliness is not that much of an issue, we should rather focus on ease and speed of parsing?

ypriverol · 2019-08-16T09:17:12Z

@mvaudel:

Here my comments.

A few points you might want to consider:

The target can be a single amino acid or an amino acid pattern (like in glyco). This can be encoded as a simple regular expression.

I like this idea. We should accept the pattern, However, what is the best way to encode a pattern in am standardize way. I can see here a lot of software and users writing their own pattern rules that are difficult to translate into a specific language. I found a link about how to standarize regular expressions https://www.regular-expressions.info/refflavors.html. Probably a good place to start.

The terminus can be peptide or protein.

I think this is really common now, if we use unimod definitions will be:

N-term
Protein N-term
Anywhere

I strongly recommend not to use the rounded mass, and rather stick to the atomic composition. I would make the atomic composition mandatory.

We can explicitly as for the atomic mass, however, MOST of the search engines and tools currently use the mass_shift. In addition, if we go for the tab-delimited user-friendly option mass shift is easier to get that the Atomic Composition. I really think we should not add a lot of details if the UNIMOD accession is known. If the Unimod is not known then the composition can be the name of the modification?

If you are aiming for a format like mzIdentML, generated by software for software, user friendliness is not that much of an issue, we should rather focus on ease and speed of parsing?

We are aiming in a tab-delimited format easy to produce by software but also easy to produce/read manually by submitters and enriched by our submission tools. For example, a user should be able to specify a fixed modification like this:

Glu->pyro-Glu; E; fixed;  N-term; UNIMOD:27

You should be able with searchGUI to pick from there and go on with the reanalysis.

ypriverol · 2019-08-16T09:24:26Z

Hi @prvst

So you are basically talking about how the modifications are declared inside the parameter files, not on how they are represented inside the search results, right ?

Yes, you should be able to go from here to an MSFragger parameters files and perform a reanalysis of your dataset.

In this case, I think that the proposal should be more human-readable than machine friendly.

Agree, but we should force to put enough information to enable the machines to enrich the files and perform the reanalysis. For example, if the UNIMOD id is provided, we don't need to add some of the fields... like composition, see my point with @mvaudel .

Parameter files are often shared with people who are not entirely familiar with proteomics or even sometimes used as proof of documentation for analysis. A format like the on used by Comet variable_mod1 = 15.9949 M 0 3 might be easy to be consumed by a software, but quite impossible to be interpreted by a person who doesn't know the documentation. I also think that the name of the modification should be included in the proposal, it makes easier to spot errors and to differentiate isobaric PTMs.

Agree @prvst, this is why we are adding some words rather than binary variables 0/1 values.

ypriverol · 2019-08-16T09:35:24Z

@mvaudel about the regular expression, probably this is the standard:

http://pubs.opengroup.org/onlinepubs/9699919799/

trishorts · 2019-08-16T12:12:26Z

For MetaMorpheus, we got our start using the UniProt ptmlist and just retained that format. There is a key value pair system that is pretty easy to interpret. There are mandatory fields and bonus fields. For example, we add diagnostic ions and neutral losses (dependent on fragmentation type). We also have a field that carriers equivalent accession numbers for the same mod in different database systems.

Here is an example:

ID Phosphorylation
TG S or T
PP Anywhere.
NL HCD:H0 or HCD:H3 O4 P1
MT Common Biological
CF H1 O3 P1
DR Unimod; 21.
//

However, we use a .toml file for search settings, which is probably where the data you're mining would come from. In that file for mods, we have only PTM name and target motif. That combination is required to be unique for us.

ypriverol · 2019-08-16T12:16:58Z

@trishorts can you provide me .toml file.

trishorts · 2019-08-16T12:18:44Z

sure thing. Let me create one with interesting PTMs.

ypriverol · 2019-08-16T12:33:51Z

Following @trishorts idea of key=value pairs for each property, we can update my first proposal:

Name ; aminoacid; type; position; UnimodAccession

example:

Glu->pyro-Glu; E; fixed; N-term; UNIMOD:27

We can improve it using the key=value structure:

ID=Glu->pyro-Glu; TG=E; TP=fixed; PP=Anywhere; UA=Unimod:27; CF=H(-2)O(-1)

With this approach, we can control the key name (ID, TG, TY, TP, PP, UA CF ..) and extended it in the specification. This will cover the use case from @mvaudel to add the Composition (CF). Also, the order of the property does not matter because this is control by the key.

The downside of this approach is that is less Human readable.

@mvaudel @trishorts @prvst @RalfG opinions welcome.

BTW @trishorts, What means TG

mvaudel · 2019-08-16T12:49:25Z

Sounds really nice, having explicit labels increases readability and flexibility. Indeed neutral losses and reporter ions are needed, thanks for putting this up. Here again, we use atomic composition and never rounded mass ;)

ypriverol · 2019-08-16T12:53:49Z

@mwalzer actually highlighted that we need to define what is optional and mandatory to be able to define a modification parameter.

I think the only mandatory value would be a name ID because with the name phosphorylation we can guess most of the other fields.

mwalzer · 2019-08-16T12:56:26Z

I like the key=value idea.
So it would be that a key can occur multiple times and be interpreted as a virtual list? I dont like much the use of different separation chars.

Two potential issues that I see in general are:

how should a consumer interpret a metadata file with such PTM encoding when some keys are (because optional) missing
and how to cope with conflicting information, say for example the unimod has different positions in store as given via the encoding

mvaudel · 2019-08-16T13:00:54Z

My personal experience is - don't rely on Unimod.

trishorts · 2019-08-16T13:07:37Z

Here are the key value pairs that MetaMorpheus uses with brief explanation

AC Accession

An accession number of frequently supplied by the primary databases (e.g. UniProt and Unimod).

CF Chemical formula (required if no MM is supplied/defined)

This is the chemical formula of the added or removed atoms. This is required but the mass shift used is specified by MM. The particular isotope of the element can be specified in curly braces following the element name. For example, carbon-13 is written as C{13} in the chemical formula. The number of atoms is specified after the closing brace. Five carbon-13 atoms is written as C{13}5.

DI Diagnostic Ions

Certain PTMs (e.g. acetylation or glycosylation) produce small diagnostic fragment ions that can be detected in MS/MS spectra. These ions can serve as useful indicators of the presence of the corresponding PTM. This feature is currently disabled.

DR External database links
FT Feature key

Used in the UniProt ptmlist but not needed for custom mods in MetaMorpheus

ID Identifier (Required)

This is the text used to describe the modification in the output.

MM Monoisotopic mass (Required if CF is not supplied/defined)

The exact atomic mass shift produced by the modification. Please use at least 5 decimal places of accuracy. This will override the monoisotopic mass described in the chemical formula because there are cases where the mass of the mod and the mass shift from the mod are different (e.g. trimethylation has mass of 43 but mass shift from trimethylation is 42).

MT Modification type (Required)

This specifies which modification group the modification should be included with. Existing modification types are described here. The user is free to designate their own type, which creates a separate list.

NL Neutral loss (if any)

Certain PTMs (e.g. phosphorylation) have labile modifications that can be lost during ionization. The peptide parent mass in MS1 may be seen with or with out the modification. Specifying neutral loss tells MetaMorpheus to take this phenomenon into account.

PP Position of the modification in the polypeptide (Required)

Choose from the following options: Anywhere.; Peptide N-terminal.; N-terminal.; Peptide C-terminal. DON'T FORGET THE '.'

TG Target (Required)

Amino acid letter code capitalized or written out. Multiple targets separated by " or ". The capital letter 'X' may be used to mean any amino acid.

ypriverol · 2019-08-16T13:08:05Z

@mwalzer some comments here:

I like the key=value idea.
So it would be that a key can occur multiple times and be interpreted as a virtual list? I dont like much the use of different separation chars.

I don't see in this particular case we can have more that one value for one particular key. That will be a different modification.

The idea would be:

	comment [modification parameters]	comment [modification parameters]
sample 1	ID=Glu->pyro-Glu; TG=E; TP=fixed; PP=Anywhere; UA=Unimod:27; CF=H(-2)O(-1)	ID=Oxidation; TG=M
sample 2	ID=Glu->pyro-Glu; TG=E; TP=fixed; PP=Anywhere; UA=Unimod:27; CF=H(-2)O(-1)	ID=Oxidation; TG=M

Two potential issues that I see in general are:

how should a consumer interpret a metadata file with such PTM encoding when some keys are (because optional) missing

Actually, this is a great point. The consumers of the metadata can take decisions depending on the data missing. For example, In PRIDE we will implement a system that annotates as much a possible this values; but if the user submits only the name we can actually suggest to the user the possible modifications in Unimod.

and how to cope with conflicting information, say for example the unimod has different positions in store as given via the encoding

This is up to the system, software consumer to decide what to do. For example, we have a library that if a delta mass + name of the modification is provided and it matches uniquely to one UNIMOD modification, then it can suggest that modification.

trishorts · 2019-08-16T13:10:17Z

We've found that you have to be very careful with "separators". Places like Unimod can be very sloppy. So you end up with a modification name that contains a comma or a semicolon and your whole reader goes splat.

ypriverol · 2019-08-16T13:16:36Z

Agree.

We've found that you have to be very careful with "separators". Places like Unimod can be very sloppy. So you end up with a modification name that contains a comma or a semicolon and your whole reader goes splat.

I check before the proposal and ; is not included in any Interim Name in Unimod. Then, we are probably fine. But, if the user uses the description then we can have some conflicts (e.g. Loss of O; nitro photochemical decomposition)

mvaudel · 2019-08-16T13:20:00Z

Can we not just use quotes for all values?

mvaudel · 2019-08-17T09:21:54Z

In addition, it should be possible to specify where the modification is attached on the motif. The format needs to specify that it is zero-based and what the default is.
e.g. motif="[ST]" target=-2 would search modifications two amino acids before any S or T, which would be equivalent to motif="XX[ST]" with a default target of 0. motif="[ST]" target=1 would look for a modification after any S or T.

ypriverol · 2019-08-18T13:55:25Z

PTM site position ongoing discussion:

I will try to summarize the discussion about PMT parameter site, which is stoping the first PR #15 .

1- Target Amino acid (TA) (Proposed by @ypriverol)

TA=M

Target amino acid letter. If the modification target multiple sites, it should be provided as Target Regular Expression (TR).

Pros:

This will be easy for manual annotation and to represent all the most common modifications. This can be improved with the proposal by @trishorts using the or operator or | and represent multiple single sites like: TA=S or T or Y, using the | will be TA=S|T|Y.
Easy for the submitter of proteomics data to repositories.

Cons:

Is only a subset representation of @mvaudel @RonBeavis Regular Expression format TR.

2- Target Amino Acid as Regular expression (proposed by @RonBeavis @mvaudel ):

TA=N[^P][ST]

This proposal aims to represent all sites into a regular expression including motifs, etc.

Pros:

All modifications sites and configurations can be represented.

Cons:

Difficult to write by submitters and users (Probably a solution would be to have a web page with all well-known Ptms Regular expressions - Like UNIMOD?).
Difficult to interpret by readers of the sample metadata files. In addition, it will need some agreements on validations. We will need to develop tools to validate Regular expressions.

Comments needed here to agree in one of the options: @mvaudel @mwalzer @RalfG @RonBeavis @prvst @trishorts .

RalfG · 2019-08-18T14:03:41Z

I tend to prefer option 2, as it is more comprehensive and correct. I agree that this option is more difficult for human submitters and human readers, but a well-designed submission form should be able to take these issues away for the common modifications.

I suspect that regex validators already exist for most programming languages?

mobiusklein · 2019-08-18T15:51:51Z

Option 2 still lacks a way to express which amino acid is the actual target. In this case, the N-glycosylation motif modifies the first amino acid (N), but this isn't guaranteed to be the case. The bacterial N-glycosylation motif has a prefix as well as a suffix around the modification site: [DE][^P]N[^P][ST].

To be able to use a regular expression, we would need to either A) specify capture group index, B) use named capture groups, or C) add a marker to the regular expression to indicate that an amino acid is the target.

The glycosaminoglycan linker glycosylation process preferentially targets S[GA]X[GA] where both S and X may be modified, but X should not be modified if S is not. There's plenty of poorly understood biology here, so we don't know the constraints on X.

If we have to use a capture group, then validation is more than just compiling the regular expression, but also testing that it contains a capture group? If we want to make trivial cases not require a capture group, check that the pattern cannot produces matches of length > 1?

ypriverol · 2019-08-18T22:25:05Z

Can we list a set of examples with the name of modifications and possible Regular expressions? @mvaudel @RonBeavis @mobiusklein @trishorts . I think it will help us to define more clearly option 2.

mobiusklein · 2019-08-19T01:22:03Z

Beyond glycosylation motifs, I do not know many that are "hard rules", and we stray into a gray area between blind combinatorial expansion rules vs. prescribed target sites from a database.

You can draw a few from PROSITE:

Phosphorylation
https://prosite.expasy.org/PDOC00004 [RK]{2}.([ST])
https://prosite.expasy.org/PDOC00005 ([ST]).[RK]
https://prosite.expasy.org/PDOC00006 ([ST])..[DE]
https://prosite.expasy.org/PDOC00007 [RK].{2-3}[DE].{2-3}(Y)

N-myrisotylation
https://prosite.expasy.org/PDOC00008 (G)[^EDRKHPFYW]..[STAGCN][^P]

Amidation
https://prosite.expasy.org/PDOC00009 (.)G[RK]{2}

ypriverol · 2019-08-19T09:43:24Z

@mobiusklein :

This representation is more complex than I was thinking to represent because it also encode the information of the Enzyme. What do ou think @mvaudel @trishorts @RonBeavis

trishorts · 2019-08-19T13:59:22Z

I don't really have any comments about how you represent motifs. I like having motifs where they are appropriate. We don't use regex unless it can't be avoid.

trishorts · 2019-08-19T14:04:41Z

New topic. I'm no longer certain just what you are trying to capture here. I see two competing themes. One is an attempt to capture how a submitted data set WAS searched. And the other is to capture how a submitted data set SHOULD HAVE BEEN searched. I think there are some important considerations like those that Ron has mentioned earlier that will eliminate lots of false positives. But I see that as the job of the search engine and the original searchers. If someone does something "wrong" and submits those search results, I think its good to know how those wrong answers were produced. So, if someone searches for lysine acetylation everywhere (which is not correct), then I want to know that they did that so that I can question the results. If "we" require that acetylation be not allowed at tryptic peptide termini in the recording of the entry but the user had mistakenly allowed it, then there is problem. I don't have a recommendation but I see a collision.

ypriverol · 2019-08-19T14:50:03Z

Thanks for this comment @trishorts, I think in the document I make clear what is the original intention of these efforts.

New topic. I'm no longer certain just what you are trying to capture here. I see two competing themes. One is an attempt to capture how a submitted data set WAS searched.

1.- THIS IS THE MAIN INTENTION. The current metadata about experimental design is really poor into public databases including PRIDE. This problem makes really difficult data reuse and reproducibility. We want to provide a tab-delimited format that enriches the data submission process in two directions:

1.1- The file format should be able to provide information about the Experimental Design, sample metadata including Taxonomy, Tissues, etc. We are proposing SDRF because RNASeq has been using the format for more than 10 years and we have thousands and thousands of projects well-annotated; with no problems (including single-cell experiments). Using SDRF will enable us to and the proteomics community to move towards multiomics, annotating proteomics and transcriptomics experiments in the same way.

1.2- We need to provide sufficient information about the data analysis protocol to describe how the data was processed. This "protocol" description within the SDRF is specific to each field, in our case proteomics and we need to define some rules about how to capture it, including how to encode PTMs parameter search (this issue). The next discussion should be about Enzyme, Fragment tolerances, TMT Fragment ion masses, etc.

And the other is to capture how a submitted data set SHOULD HAVE BEEN searched. I think there are some important considerations like those that Ron has mentioned earlier that will eliminate lots of false positives. But I see that as the job of the search engine and the original searchers. If someone does something "wrong" and submits those search results, I think its good to know how those wrong answers were produced.

Agree.

So, if someone searches for lysine acetylation everywhere (which is not correct), then I want to know that they did that so that I can question the results. If "we" require that acetylation be not allowed at tryptic peptide termini in the recording of the entry but the user had mistakenly allowed it, then there is a problem. I don't have a recommendation but I see a collision.

By looking into most of the search engine parameters (MSGF+, Comet, UNIMOD) exposed to the users the following properties about a modification parameter: Accession or Name, Position [anywhere, C and N-term, Protein C and N-term], Composition, and Mass shifts or Monoisotopic mass.

The current PR #15 aim to define those first and more easy to define properties. In my opinion, the current definition of Amino Acid target AT should be only what aminoacids will be modified.

AT = S,T,Y

Then, what I named now TR Target regular expression should be to define more complex structures. I see now that SearcGUI (@mvaudel) use Pattern Design defined as Target AA and Excluded AA.

If we accept the current proposal PR #15 , then we can clearly discuss how to encode into regular expressions the full information of PTMS parameters.

trishorts · 2019-08-19T15:00:24Z

as I undertand it then we need a target REGEX that will capture what was searched including motifs and that "we" shouldn't block any motif/PTM combos. So, if someone search variable phosphorylation on say alanine, then AT = A, that's what we want to know. I think this clarifies everything for me. Thanks. BTW, I couldn't begin to construct such a REGEX.

mobiusklein · 2019-08-19T15:13:36Z

Splitting modification specification into "amino acid target" TA and a "constraint pattern" TR where appropriate seems reasonable. Specifying everything as a regex would be difficult, especially since there are so many ways to write the same pattern.

Is the intent of this experimental design section to capture all modifications, or only variable modifications? Should open search engines and multi-round search engines include all the modifications they could consider or in some way communicate the range of "dark mass" they allow?

We've talked about glycosylation site motifs, how about glycans themselves? When you look at PRIDE's glycoproteomics entries, they do not explicitly specify that the study looked at glycopeptides, and what the glycan database was. Depending upon what you're looking for, that can be anywhere from five to over nine thousand different glycans, represented at different levels of specificity. Is this something to capture in this one-line description scheme?

Repeat above for cross-linked peptide experiments?

ypriverol · 2019-08-19T16:22:20Z

@trishorts:

as I undertand it then we need a target REGEX that will capture what was searched including motifs and that "we" shouldn't block any motif/PTM combos. So, if someone search variable phosphorylation on say alanine, then AT = A, that's what we want to know. I think this clarifies everything for me.

Can you review the following PR #15 ? I did minor changes to reflect the latest discussion.

The only thing is pending is that modifications that affect N and C term positions, not aminoacids, how to define them. I like the UNIMOD definition N-term and C-term.

RalfG · 2019-08-19T16:36:21Z

@ypriverol:

The only thing is pending is that modifications that affect N and C term positions, not amino acids, how to define them. I like the UNIMOD definition N-term and C-term.

If we are talking about modifications targeting the N-term NH2- or the C-term -COOH, I think N-term and C-term would be good ways to describe them. If we are talking about PTMs specifically targeting the side-chain of an N-term/C-term amino acid, I would go for ., * or any in combination with the PP (polypeptide position) key.

Mass shift-wise, this does not really matter. But I guess for "blocking" the sites in the search space, it could, in theory, make a difference.

ypriverol · 2019-08-19T16:37:29Z

Splitting modification specification into "amino acid target" TA and a "constraint pattern" TR where appropriate seems reasonable.

OK

Specifying everything as a regex would be difficult, especially since there are so many ways to write the same pattern.

I will open a new issue about that, to discuss possible implementations. In the current PR #15 that definition is pending until we have a decision.

Is the intent of this experimental design section to capture all modifications, or only variable modifications?

Variable and fixed modifications define as parameters in the search. See the definition in the PR #15

Should open search engines and multi-round search engines include all the modifications they could consider or in some way communicate the range of "dark mass" they allow?

For "dark modifications" we can use a name Unknown modification and mass shift and all possible amino acids.

We've talked about glycosylation site motifs, how about glycans themselves? When you look at PRIDE's glycoproteomics entries, they do not explicitly specify that the study looked at glycopeptides, and what the glycan database was. Depending upon what you're looking for, that can be anywhere from five to over nine thousand different glycans, represented at different levels of specificity. Is this something to capture in this one-line description scheme?

We clarify already that for large scale annotation of PTMs search we should use database annotations like PEEF.

trishorts · 2019-08-19T16:42:36Z

@trishorts:

as I undertand it then we need a target REGEX that will capture what was searched including motifs and that "we" shouldn't block any motif/PTM combos. So, if someone search variable phosphorylation on say alanine, then AT = A, that's what we want to know. I think this clarifies everything for me.

Can you review the following PR #15 ? I did minor changes to reflect the latest discussion.

The only thing is pending is that modifications that affect N and C term positions, not aminoacids, how to define them. I like the UNIMOD definition N-term and C-term.

I'm on board with this

trishorts · 2019-08-19T16:43:36Z

woops

RalfG · 2019-08-19T16:47:52Z

Should open search engines and multi-round search engines include all the modifications they could consider or in some way communicate the range of "dark mass" they allow?

For "dark modifications" we can use a name Unknown modification and mass shift and all possible amino acids.

For open modification search engines that search for a (very large) fixed list of modifications, this would work. But some open modifications search engines do not have an a priori list of modifications to search for. For those search engines, it would be good to include an any mass shift or open search tag in the data analysis protocol.

mobiusklein · 2019-08-19T16:56:21Z

We've talked about glycosylation site motifs, how about glycans themselves? When you look at PRIDE's glycoproteomics entries, they do not explicitly specify that the study looked at glycopeptides, and what the glycan database was. Depending upon what you're looking for, that can be anywhere from five to over nine thousand different glycans, represented at different levels of specificity. Is this something to capture in this one-line description scheme?

We clarify already that for large scale annotation of PTMs search we should use database annotations like PEEF.

Glycoproteomics search engines do not use "site specific" databases, though should the repositories become complete enough, that'd be desirable. Most of them simply put every single glycan of the appropriate type at each site just like any other variable modification. PEFF has not yet standardized how to communicate the range of glycoforms expected at a specific site, simply that a site is glycosylated.

If including just a very long list of modifications is sufficient, then this should work for glycoproteomics too, provided we have an acceptable way to encode our glycans. If that defeats the purpose of this format, then both glycoproteomics and those open modification search engines with a large database of modifications both might not have an appropriate method to be described by.

jpfeuffer · 2019-08-19T17:34:11Z

Hi @ypriverol and others:

I was wondering how one would represent mutually exclusive modifications like SILAC modifications:
Some search engines like Comet allow for a simultaneous search of such modifications (encoded in the "binary group" column of its parameters at the end of the page here).
With other search engines you might need to search multiple times with the same non-quantification modification and one of the quantification modifications in the group (and afterwards merge the results).
I could imagine either introducing another key/value pair for such a "binary group" and/or allowing multiple rows for the same Run to represent different Samples.

Anyone thought about that already?

ypriverol · 2019-08-19T17:38:35Z

@jpfeuffer Can you propose how to encode that into a key=value representation.

jpfeuffer · 2019-08-19T17:49:48Z

Maybe an optional key "BG" for every modification with integer values representing the group of modifications that should be/were searched together in a binary (all-or-none) way.
If this optional key is missing the modification is handled as usual (and considered on its own).
You could adapt the description from the Comet page in your documentation.

If the searches were performed separately e.g. with another search engine, the user can still go for multiple rows I think, so no loss of generality here.

ypriverol · 2019-08-20T06:31:21Z

@jpfeuffer I was thinking that most of the search engines used SILAC and multiplex modifications as Variable modifications and this solves the problem of the binary.

ypriverol · 2020-02-18T19:06:31Z

Thanks to all for your comments, I will close this issue because we have a proposal now https://github.com/bigbio/proteomics-metadata-standard/tree/master/experimental-design#encoding-protein-modifications

2

ypriverol self-assigned this Aug 15, 2019

ypriverol added the enhancement New feature or request label Aug 15, 2019

ypriverol changed the title ~~Encoding PTMs into one-line Experimental Design parameters~~ Encoding PTMs parameters into one-line Experimental Design Aug 15, 2019

ypriverol assigned prvst and mvaudel Aug 16, 2019

ypriverol pinned this issue Aug 16, 2019

ypriverol added help wanted Extra attention is needed question Further information is requested labels Aug 16, 2019

trishorts closed this as completed Aug 19, 2019

trishorts reopened this Aug 19, 2019

ypriverol mentioned this issue Aug 20, 2019

Encoding PTM sites as Regular expressions #17

Closed

mwalzer unpinned this issue Jan 13, 2020

ypriverol closed this as completed Feb 18, 2020

ypriverol pushed a commit that referenced this issue May 6, 2020

Merge pull request #13 from bigbio/master

1a2200d

2

Encoding PTMs parameters into one-line Experimental Design #13

Encoding PTMs parameters into one-line Experimental Design #13

Comments

ypriverol commented Aug 15, 2019 • edited Loading

prvst commented Aug 16, 2019

mlocardpaulet commented Aug 16, 2019

ypriverol commented Aug 16, 2019

RalfG commented Aug 16, 2019

ypriverol commented Aug 16, 2019

mvaudel commented Aug 16, 2019

ypriverol commented Aug 16, 2019

ypriverol commented Aug 16, 2019

ypriverol commented Aug 16, 2019

trishorts commented Aug 16, 2019

ypriverol commented Aug 16, 2019

trishorts commented Aug 16, 2019

ypriverol commented Aug 16, 2019

mvaudel commented Aug 16, 2019

ypriverol commented Aug 16, 2019

mwalzer commented Aug 16, 2019

mvaudel commented Aug 16, 2019

trishorts commented Aug 16, 2019

ypriverol commented Aug 16, 2019

trishorts commented Aug 16, 2019

ypriverol commented Aug 16, 2019

mvaudel commented Aug 16, 2019

mvaudel commented Aug 17, 2019

ypriverol commented Aug 18, 2019

RalfG commented Aug 18, 2019

mobiusklein commented Aug 18, 2019 • edited Loading

ypriverol commented Aug 18, 2019

mobiusklein commented Aug 19, 2019

ypriverol commented Aug 19, 2019

trishorts commented Aug 19, 2019

trishorts commented Aug 19, 2019

ypriverol commented Aug 19, 2019

trishorts commented Aug 19, 2019

mobiusklein commented Aug 19, 2019

ypriverol commented Aug 19, 2019

RalfG commented Aug 19, 2019

ypriverol commented Aug 19, 2019

trishorts commented Aug 19, 2019

trishorts commented Aug 19, 2019

RalfG commented Aug 19, 2019

mobiusklein commented Aug 19, 2019

jpfeuffer commented Aug 19, 2019

ypriverol commented Aug 19, 2019

jpfeuffer commented Aug 19, 2019

ypriverol commented Aug 20, 2019

ypriverol commented Feb 18, 2020

ypriverol commented Aug 15, 2019 •

edited

Loading

mobiusklein commented Aug 18, 2019 •

edited

Loading