Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding PTMs parameters into one-line Experimental Design #13

Closed
ypriverol opened this issue Aug 15, 2019 · 58 comments
Closed

Encoding PTMs parameters into one-line Experimental Design #13

ypriverol opened this issue Aug 15, 2019 · 58 comments
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed question Further information is requested Specification Specification issues related with PRIDE formats, API, etc

Comments

@ypriverol
Copy link
Member

ypriverol commented Aug 15, 2019

@hbarsnes @mvaudel @StSchulze

We have continued working with the metadata experimental design.

See example, https://github.com/PRIDE-Archive/pride-metadata-standard/tree/master/experimental-design#2-the-sample-and-data-relationship-format

However, if we want to encode search parameters would be great to encode PTMs and other search parameters as key-value pairs. I have seen that MSGF+, Comet, MaxQuant encode PTMs as string lines which is great; because we can encode PTMs Variables as a string and will be easy to translate into the Search Strings.

MSGF+ :

StaticMod=C2H3N1O1,     C,  fix, any,       Carbamidomethyl       # Fixed Carbamidomethyl C (alkylation)
StaticMod=229.1629,     *,  fix, N-term,    TMT6plex
StaticMod=229.1629,     K,  fix, any,       TMT6plex

Comet:

variable_mod1 = 15.9949 M 0 3
variable_mod2 = 0.0 X 0 3
variable_mod3 = 0.0 X 0 3
variable_mod4 = 0.0 X 0 3
variable_mod5 = 0.0 X 0 3
variable_mod6 = 0.0 X 0 3

CRUX:

C+57.02146,2M+15.9949,1STY+79.966331

I think we can propose a way to encode this PTMs as String within the metadata files.

Name ; aminoacid; type; position; UnimodAccession

Where:
Name: Name of the modification.
aminoacid: Aminoacid
Type: Fixed, Variable, Custom
Position: Any, N-Term, Protein N-term
UnimodAccession: Unimod Accession

The Unimod accession can be replaced with delta mass.

@ypriverol ypriverol self-assigned this Aug 15, 2019
@ypriverol ypriverol added the enhancement New feature or request label Aug 15, 2019
@ypriverol ypriverol changed the title Encoding PTMs into one-line Experimental Design parameters Encoding PTMs parameters into one-line Experimental Design Aug 15, 2019
@prvst
Copy link

prvst commented Aug 16, 2019

So you are basically talking about how the modifications are declared inside the parameter files, not on how they are represented inside the search results, right ? In this case I think that the proposal should be more human-readable than machine friendly. Parameter files are often shared with people who are not entirely familiar with proteomics or even sometimes used as proof of documentation for an analysis. A format like the on used by Comet variable_mod1 = 15.9949 M 0 3 might be easy to be consumed by a software, but quite impossible to be interpreted by a person who doesn't know the documentation. I also think that the name of the modification should be included in the proposal, it makes easier to spot errors and to differentiate isobaric PTMs.

@mlocardpaulet
Copy link
Collaborator

Veit (and others) worked on this at proteoform level. But I think you could find their strategy interesting: LeDuc, R. D., Schwämmle, V., Shortreed, M. R., Cesnik, A. J., Solntsev, S. K., Shaw, J. B., … Tsybin, Y. O. (2018). ProForma: A Standard Proteoform Notation. Journal of Proteome Research, 17(3), 1321–1325. https://doi.org/10.1021/acs.jproteome.7b00851

@ypriverol
Copy link
Member Author

Thanks, @mlocardpaulet for the reference. We are more talking about PTMs as search parameters. In order to represent PTMs in results, we have good references (as you said) with ProForma, MzTab, and others.

The problem we want to solve is that if we annotate a PRIDE or ProteomeXchange experiment, we should annotate some parameters from the search in order to allow external tools like SearchGUI and other to reanalyze the data. This issue is about how to encode Search Parameters PTMs.

@RalfG
Copy link
Collaborator

RalfG commented Aug 16, 2019

We often do something like this, in a JSON structure:

    "modifications":[
        {"name":"Glu->pyro-Glu", "unimod_accession":27, "mass_shift":-18.0153, "amino_acid":"E", "n_term":true, "fixed":false},
        {"name":"Gln->pyro-Glu", "unimod_accession":28, "mass_shift":-17.0305, "amino_acid":"Q", "n_term":true, "fixed":false},
        {"name":"Acetyl", "unimod_accession":1, "mass_shift":42.0367, "amino_acid":null, "n_term":true, "fixed":false},
        {"name":"Oxidation", "unimod_accession":35, "mass_shift":15.9994, "amino_acid":"M", "n_term":false, "fixed":false},
        {"name":"Carbamidomethyl", "unimod_accession":4, "mass_shift":57.0513, "amino_acid":"C", "n_term":false, "fixed":true}
    ],
```

@ypriverol
Copy link
Member Author

Thanks @RalfG In the current proposal

Name ; aminoacid; type; position; UnimodAccession

Would be feasible to represent your JSON? We want first to have a tab-delimited representation to align more with the experimental design but in the future YES, we will serialize also to JSON.

In my proposal your first modification will be like:

Glu->pyro-Glu; E; fixed;  N-term; UNIMOD:27

The only thing missing is the mass shift, I didn't include it because it can be retrieved from the UNIMOD accession. However, I agree we can be more explicit using the mass shift.

@mvaudel
Copy link
Collaborator

mvaudel commented Aug 16, 2019

A few points you might want to consider:

  • The target can be a single amino acid or an amino acid pattern (like in glyco). This can be encoded as a simple regular expression.
  • The terminus can be peptide or protein.
  • I strongly recommend not to use the rounded mass, and rather stick to the atomic composition. I would make the atomic composition mandatory.
  • If you are aiming for a format like mzIdentML, generated by software for software, user friendliness is not that much of an issue, we should rather focus on ease and speed of parsing?

@ypriverol
Copy link
Member Author

@mvaudel:

Here my comments.

A few points you might want to consider:

  • The target can be a single amino acid or an amino acid pattern (like in glyco). This can be encoded as a simple regular expression.

I like this idea. We should accept the pattern, However, what is the best way to encode a pattern in am standardize way. I can see here a lot of software and users writing their own pattern rules that are difficult to translate into a specific language. I found a link about how to standarize regular expressions https://www.regular-expressions.info/refflavors.html. Probably a good place to start.

  • The terminus can be peptide or protein.

I think this is really common now, if we use unimod definitions will be:

  • N-term
  • Protein N-term
  • Anywhere
  • I strongly recommend not to use the rounded mass, and rather stick to the atomic composition. I would make the atomic composition mandatory.

We can explicitly as for the atomic mass, however, MOST of the search engines and tools currently use the mass_shift. In addition, if we go for the tab-delimited user-friendly option mass shift is easier to get that the Atomic Composition. I really think we should not add a lot of details if the UNIMOD accession is known. If the Unimod is not known then the composition can be the name of the modification?

  • If you are aiming for a format like mzIdentML, generated by software for software, user friendliness is not that much of an issue, we should rather focus on ease and speed of parsing?

We are aiming in a tab-delimited format easy to produce by software but also easy to produce/read manually by submitters and enriched by our submission tools. For example, a user should be able to specify a fixed modification like this:

Glu->pyro-Glu; E; fixed;  N-term; UNIMOD:27

You should be able with searchGUI to pick from there and go on with the reanalysis.

@ypriverol
Copy link
Member Author

Hi @prvst

So you are basically talking about how the modifications are declared inside the parameter files, not on how they are represented inside the search results, right ?

Yes, you should be able to go from here to an MSFragger parameters files and perform a reanalysis of your dataset.

In this case, I think that the proposal should be more human-readable than machine friendly.

Agree, but we should force to put enough information to enable the machines to enrich the files and perform the reanalysis. For example, if the UNIMOD id is provided, we don't need to add some of the fields... like composition, see my point with @mvaudel .

Parameter files are often shared with people who are not entirely familiar with proteomics or even sometimes used as proof of documentation for analysis. A format like the on used by Comet variable_mod1 = 15.9949 M 0 3 might be easy to be consumed by a software, but quite impossible to be interpreted by a person who doesn't know the documentation. I also think that the name of the modification should be included in the proposal, it makes easier to spot errors and to differentiate isobaric PTMs.

Agree @prvst, this is why we are adding some words rather than binary variables 0/1 values.

@ypriverol
Copy link
Member Author

@mvaudel about the regular expression, probably this is the standard:

http://pubs.opengroup.org/onlinepubs/9699919799/

@trishorts
Copy link
Collaborator

For MetaMorpheus, we got our start using the UniProt ptmlist and just retained that format. There is a key value pair system that is pretty easy to interpret. There are mandatory fields and bonus fields. For example, we add diagnostic ions and neutral losses (dependent on fragmentation type). We also have a field that carriers equivalent accession numbers for the same mod in different database systems.

Here is an example:

ID Phosphorylation
TG S or T
PP Anywhere.
NL HCD:H0 or HCD:H3 O4 P1
MT Common Biological
CF H1 O3 P1
DR Unimod; 21.
//

However, we use a .toml file for search settings, which is probably where the data you're mining would come from. In that file for mods, we have only PTM name and target motif. That combination is required to be unique for us.

@ypriverol
Copy link
Member Author

@trishorts can you provide me .toml file.

@trishorts
Copy link
Collaborator

sure thing. Let me create one with interesting PTMs.

@ypriverol
Copy link
Member Author

Following @trishorts idea of key=value pairs for each property, we can update my first proposal:

Name ; aminoacid; type; position; UnimodAccession

example:

Glu->pyro-Glu; E; fixed; N-term; UNIMOD:27

We can improve it using the key=value structure:

ID=Glu->pyro-Glu; TG=E; TP=fixed; PP=Anywhere; UA=Unimod:27; CF=H(-2)O(-1)

With this approach, we can control the key name (ID, TG, TY, TP, PP, UA CF ..) and extended it in the specification. This will cover the use case from @mvaudel to add the Composition (CF). Also, the order of the property does not matter because this is control by the key.

The downside of this approach is that is less Human readable.

@mvaudel @trishorts @prvst @RalfG opinions welcome.

BTW @trishorts, What means TG

@ypriverol ypriverol pinned this issue Aug 16, 2019
@mvaudel
Copy link
Collaborator

mvaudel commented Aug 16, 2019

Sounds really nice, having explicit labels increases readability and flexibility. Indeed neutral losses and reporter ions are needed, thanks for putting this up. Here again, we use atomic composition and never rounded mass ;)

@ypriverol
Copy link
Member Author

@mwalzer actually highlighted that we need to define what is optional and mandatory to be able to define a modification parameter.

I think the only mandatory value would be a name ID because with the name phosphorylation we can guess most of the other fields.

@ypriverol ypriverol added help wanted Extra attention is needed question Further information is requested labels Aug 16, 2019
@mwalzer
Copy link
Collaborator

mwalzer commented Aug 16, 2019

I like the key=value idea.
So it would be that a key can occur multiple times and be interpreted as a virtual list? I dont like much the use of different separation chars.

Two potential issues that I see in general are:

  • how should a consumer interpret a metadata file with such PTM encoding when some keys are (because optional) missing
  • and how to cope with conflicting information, say for example the unimod has different positions in store as given via the encoding

@mvaudel
Copy link
Collaborator

mvaudel commented Aug 16, 2019

My personal experience is - don't rely on Unimod.

@trishorts
Copy link
Collaborator

Here are the key value pairs that MetaMorpheus uses with brief explanation

  • AC Accession

An accession number of frequently supplied by the primary databases (e.g. UniProt and Unimod).

  • CF Chemical formula (required if no MM is supplied/defined)

This is the chemical formula of the added or removed atoms. This is required but the mass shift used is specified by MM. The particular isotope of the element can be specified in curly braces following the element name. For example, carbon-13 is written as C{13} in the chemical formula. The number of atoms is specified after the closing brace. Five carbon-13 atoms is written as C{13}5.

  • DI Diagnostic Ions

Certain PTMs (e.g. acetylation or glycosylation) produce small diagnostic fragment ions that can be detected in MS/MS spectra. These ions can serve as useful indicators of the presence of the corresponding PTM. This feature is currently disabled.

  • DR External database links
  • FT Feature key

Used in the UniProt ptmlist but not needed for custom mods in MetaMorpheus

  • ID Identifier (Required)

This is the text used to describe the modification in the output.

  • MM Monoisotopic mass (Required if CF is not supplied/defined)

The exact atomic mass shift produced by the modification. Please use at least 5 decimal places of accuracy. This will override the monoisotopic mass described in the chemical formula because there are cases where the mass of the mod and the mass shift from the mod are different (e.g. trimethylation has mass of 43 but mass shift from trimethylation is 42).

  • MT Modification type (Required)

This specifies which modification group the modification should be included with. Existing modification types are described here. The user is free to designate their own type, which creates a separate list.

  • NL Neutral loss (if any)

Certain PTMs (e.g. phosphorylation) have labile modifications that can be lost during ionization. The peptide parent mass in MS1 may be seen with or with out the modification. Specifying neutral loss tells MetaMorpheus to take this phenomenon into account.

  • PP Position of the modification in the polypeptide (Required)

Choose from the following options: Anywhere.; Peptide N-terminal.; N-terminal.; Peptide C-terminal. DON'T FORGET THE '.'

  • TG Target (Required)

Amino acid letter code capitalized or written out. Multiple targets separated by " or ". The capital letter 'X' may be used to mean any amino acid.

@ypriverol
Copy link
Member Author

@mwalzer some comments here:

I like the key=value idea.
So it would be that a key can occur multiple times and be interpreted as a virtual list? I dont like much the use of different separation chars.

I don't see in this particular case we can have more that one value for one particular key. That will be a different modification.

The idea would be:

comment [modification parameters] comment [modification parameters]
sample 1 ID=Glu->pyro-Glu; TG=E; TP=fixed; PP=Anywhere; UA=Unimod:27; CF=H(-2)O(-1) ID=Oxidation; TG=M
sample 2 ID=Glu->pyro-Glu; TG=E; TP=fixed; PP=Anywhere; UA=Unimod:27; CF=H(-2)O(-1) ID=Oxidation; TG=M

Two potential issues that I see in general are:

  • how should a consumer interpret a metadata file with such PTM encoding when some keys are (because optional) missing

Actually, this is a great point. The consumers of the metadata can take decisions depending on the data missing. For example, In PRIDE we will implement a system that annotates as much a possible this values; but if the user submits only the name we can actually suggest to the user the possible modifications in Unimod.

  • and how to cope with conflicting information, say for example the unimod has different positions in store as given via the encoding

This is up to the system, software consumer to decide what to do. For example, we have a library that if a delta mass + name of the modification is provided and it matches uniquely to one UNIMOD modification, then it can suggest that modification.

@trishorts
Copy link
Collaborator

We've found that you have to be very careful with "separators". Places like Unimod can be very sloppy. So you end up with a modification name that contains a comma or a semicolon and your whole reader goes splat.

@ypriverol
Copy link
Member Author

Agree.

We've found that you have to be very careful with "separators". Places like Unimod can be very sloppy. So you end up with a modification name that contains a comma or a semicolon and your whole reader goes splat.

I check before the proposal and ; is not included in any Interim Name in Unimod. Then, we are probably fine. But, if the user uses the description then we can have some conflicts (e.g. Loss of O; nitro photochemical decomposition)

@mvaudel
Copy link
Collaborator

mvaudel commented Aug 16, 2019

Can we not just use quotes for all values?

@mvaudel
Copy link
Collaborator

mvaudel commented Aug 17, 2019

In addition, it should be possible to specify where the modification is attached on the motif. The format needs to specify that it is zero-based and what the default is.
e.g. motif="[ST]" target=-2 would search modifications two amino acids before any S or T, which would be equivalent to motif="XX[ST]" with a default target of 0. motif="[ST]" target=1 would look for a modification after any S or T.

@ypriverol
Copy link
Member Author

PTM site position ongoing discussion:

I will try to summarize the discussion about PMT parameter site, which is stoping the first PR #15 .

1- Target Amino acid (TA) (Proposed by @ypriverol)

TA=M

Target amino acid letter. If the modification target multiple sites, it should be provided as Target Regular Expression (TR).

Pros:

  • This will be easy for manual annotation and to represent all the most common modifications. This can be improved with the proposal by @trishorts using the or operator or | and represent multiple single sites like: TA=S or T or Y, using the | will be TA=S|T|Y.

  • Easy for the submitter of proteomics data to repositories.

Cons:

2- Target Amino Acid as Regular expression (proposed by @RonBeavis @mvaudel ):

TA=N[^P][ST]

This proposal aims to represent all sites into a regular expression including motifs, etc.

Pros:

  • All modifications sites and configurations can be represented.

Cons:

  • Difficult to write by submitters and users (Probably a solution would be to have a web page with all well-known Ptms Regular expressions - Like UNIMOD?).

  • Difficult to interpret by readers of the sample metadata files. In addition, it will need some agreements on validations. We will need to develop tools to validate Regular expressions.

Comments needed here to agree in one of the options: @mvaudel @mwalzer @RalfG @RonBeavis @prvst @trishorts .

@RalfG
Copy link
Collaborator

RalfG commented Aug 18, 2019

I tend to prefer option 2, as it is more comprehensive and correct. I agree that this option is more difficult for human submitters and human readers, but a well-designed submission form should be able to take these issues away for the common modifications.

I suspect that regex validators already exist for most programming languages?

@mobiusklein
Copy link

mobiusklein commented Aug 18, 2019

Option 2 still lacks a way to express which amino acid is the actual target. In this case, the N-glycosylation motif modifies the first amino acid (N), but this isn't guaranteed to be the case. The bacterial N-glycosylation motif has a prefix as well as a suffix around the modification site: [DE][^P]N[^P][ST].

To be able to use a regular expression, we would need to either A) specify capture group index, B) use named capture groups, or C) add a marker to the regular expression to indicate that an amino acid is the target.

The glycosaminoglycan linker glycosylation process preferentially targets S[GA]X[GA] where both S and X may be modified, but X should not be modified if S is not. There's plenty of poorly understood biology here, so we don't know the constraints on X.

If we have to use a capture group, then validation is more than just compiling the regular expression, but also testing that it contains a capture group? If we want to make trivial cases not require a capture group, check that the pattern cannot produces matches of length > 1?

@ypriverol
Copy link
Member Author

Can we list a set of examples with the name of modifications and possible Regular expressions? @mvaudel @RonBeavis @mobiusklein @trishorts . I think it will help us to define more clearly option 2.

@mobiusklein
Copy link

Beyond glycosylation motifs, I do not know many that are "hard rules", and we stray into a gray area between blind combinatorial expansion rules vs. prescribed target sites from a database.

You can draw a few from PROSITE:

Phosphorylation
https://prosite.expasy.org/PDOC00004 [RK]{2}.([ST])
https://prosite.expasy.org/PDOC00005 ([ST]).[RK]
https://prosite.expasy.org/PDOC00006 ([ST])..[DE]
https://prosite.expasy.org/PDOC00007 [RK].{2-3}[DE].{2-3}(Y)

N-myrisotylation
https://prosite.expasy.org/PDOC00008 (G)[^EDRKHPFYW]..[STAGCN][^P]

Amidation
https://prosite.expasy.org/PDOC00009 (.)G[RK]{2}

@ypriverol
Copy link
Member Author

@mobiusklein :

This representation is more complex than I was thinking to represent because it also encode the information of the Enzyme. What do ou think @mvaudel @trishorts @RonBeavis

@trishorts
Copy link
Collaborator

I don't really have any comments about how you represent motifs. I like having motifs where they are appropriate. We don't use regex unless it can't be avoid.

@trishorts
Copy link
Collaborator

New topic. I'm no longer certain just what you are trying to capture here. I see two competing themes. One is an attempt to capture how a submitted data set WAS searched. And the other is to capture how a submitted data set SHOULD HAVE BEEN searched. I think there are some important considerations like those that Ron has mentioned earlier that will eliminate lots of false positives. But I see that as the job of the search engine and the original searchers. If someone does something "wrong" and submits those search results, I think its good to know how those wrong answers were produced. So, if someone searches for lysine acetylation everywhere (which is not correct), then I want to know that they did that so that I can question the results. If "we" require that acetylation be not allowed at tryptic peptide termini in the recording of the entry but the user had mistakenly allowed it, then there is problem. I don't have a recommendation but I see a collision.

@ypriverol
Copy link
Member Author

Thanks for this comment @trishorts, I think in the document I make clear what is the original intention of these efforts.

New topic. I'm no longer certain just what you are trying to capture here. I see two competing themes. One is an attempt to capture how a submitted data set WAS searched.

1.- THIS IS THE MAIN INTENTION. The current metadata about experimental design is really poor into public databases including PRIDE. This problem makes really difficult data reuse and reproducibility. We want to provide a tab-delimited format that enriches the data submission process in two directions:

1.1- The file format should be able to provide information about the Experimental Design, sample metadata including Taxonomy, Tissues, etc. We are proposing SDRF because RNASeq has been using the format for more than 10 years and we have thousands and thousands of projects well-annotated; with no problems (including single-cell experiments). Using SDRF will enable us to and the proteomics community to move towards multiomics, annotating proteomics and transcriptomics experiments in the same way.

1.2- We need to provide sufficient information about the data analysis protocol to describe how the data was processed. This "protocol" description within the SDRF is specific to each field, in our case proteomics and we need to define some rules about how to capture it, including how to encode PTMs parameter search (this issue). The next discussion should be about Enzyme, Fragment tolerances, TMT Fragment ion masses, etc.

And the other is to capture how a submitted data set SHOULD HAVE BEEN searched. I think there are some important considerations like those that Ron has mentioned earlier that will eliminate lots of false positives. But I see that as the job of the search engine and the original searchers. If someone does something "wrong" and submits those search results, I think its good to know how those wrong answers were produced.

Agree.

So, if someone searches for lysine acetylation everywhere (which is not correct), then I want to know that they did that so that I can question the results. If "we" require that acetylation be not allowed at tryptic peptide termini in the recording of the entry but the user had mistakenly allowed it, then there is a problem. I don't have a recommendation but I see a collision.

By looking into most of the search engine parameters (MSGF+, Comet, UNIMOD) exposed to the users the following properties about a modification parameter: Accession or Name, Position [anywhere, C and N-term, Protein C and N-term], Composition, and Mass shifts or Monoisotopic mass.

The current PR #15 aim to define those first and more easy to define properties. In my opinion, the current definition of Amino Acid target AT should be only what aminoacids will be modified.

AT = S,T,Y  

Then, what I named now TR Target regular expression should be to define more complex structures. I see now that SearcGUI (@mvaudel) use Pattern Design defined as Target AA and Excluded AA.

If we accept the current proposal PR #15 , then we can clearly discuss how to encode into regular expressions the full information of PTMS parameters.

@trishorts
Copy link
Collaborator

as I undertand it then we need a target REGEX that will capture what was searched including motifs and that "we" shouldn't block any motif/PTM combos. So, if someone search variable phosphorylation on say alanine, then AT = A, that's what we want to know. I think this clarifies everything for me. Thanks. BTW, I couldn't begin to construct such a REGEX.

@mobiusklein
Copy link

Splitting modification specification into "amino acid target" TA and a "constraint pattern" TR where appropriate seems reasonable. Specifying everything as a regex would be difficult, especially since there are so many ways to write the same pattern.

Is the intent of this experimental design section to capture all modifications, or only variable modifications? Should open search engines and multi-round search engines include all the modifications they could consider or in some way communicate the range of "dark mass" they allow?

We've talked about glycosylation site motifs, how about glycans themselves? When you look at PRIDE's glycoproteomics entries, they do not explicitly specify that the study looked at glycopeptides, and what the glycan database was. Depending upon what you're looking for, that can be anywhere from five to over nine thousand different glycans, represented at different levels of specificity. Is this something to capture in this one-line description scheme?

Repeat above for cross-linked peptide experiments?

@ypriverol
Copy link
Member Author

@trishorts:

as I undertand it then we need a target REGEX that will capture what was searched including motifs and that "we" shouldn't block any motif/PTM combos. So, if someone search variable phosphorylation on say alanine, then AT = A, that's what we want to know. I think this clarifies everything for me.

Can you review the following PR #15 ? I did minor changes to reflect the latest discussion.

The only thing is pending is that modifications that affect N and C term positions, not aminoacids, how to define them. I like the UNIMOD definition N-term and C-term.

@RalfG
Copy link
Collaborator

RalfG commented Aug 19, 2019

@ypriverol:

The only thing is pending is that modifications that affect N and C term positions, not amino acids, how to define them. I like the UNIMOD definition N-term and C-term.

If we are talking about modifications targeting the N-term NH2- or the C-term -COOH, I think N-term and C-term would be good ways to describe them. If we are talking about PTMs specifically targeting the side-chain of an N-term/C-term amino acid, I would go for ., * or any in combination with the PP (polypeptide position) key.

Mass shift-wise, this does not really matter. But I guess for "blocking" the sites in the search space, it could, in theory, make a difference.

@ypriverol
Copy link
Member Author

Splitting modification specification into "amino acid target" TA and a "constraint pattern" TR where appropriate seems reasonable.

OK

Specifying everything as a regex would be difficult, especially since there are so many ways to write the same pattern.

I will open a new issue about that, to discuss possible implementations. In the current PR #15 that definition is pending until we have a decision.

Is the intent of this experimental design section to capture all modifications, or only variable modifications?

Variable and fixed modifications define as parameters in the search. See the definition in the PR #15

Should open search engines and multi-round search engines include all the modifications they could consider or in some way communicate the range of "dark mass" they allow?

For "dark modifications" we can use a name Unknown modification and mass shift and all possible amino acids.

We've talked about glycosylation site motifs, how about glycans themselves? When you look at PRIDE's glycoproteomics entries, they do not explicitly specify that the study looked at glycopeptides, and what the glycan database was. Depending upon what you're looking for, that can be anywhere from five to over nine thousand different glycans, represented at different levels of specificity. Is this something to capture in this one-line description scheme?

We clarify already that for large scale annotation of PTMs search we should use database annotations like PEEF.

@trishorts
Copy link
Collaborator

@trishorts:

as I undertand it then we need a target REGEX that will capture what was searched including motifs and that "we" shouldn't block any motif/PTM combos. So, if someone search variable phosphorylation on say alanine, then AT = A, that's what we want to know. I think this clarifies everything for me.

Can you review the following PR #15 ? I did minor changes to reflect the latest discussion.

The only thing is pending is that modifications that affect N and C term positions, not aminoacids, how to define them. I like the UNIMOD definition N-term and C-term.

I'm on board with this

@trishorts trishorts reopened this Aug 19, 2019
@trishorts
Copy link
Collaborator

woops

@RalfG
Copy link
Collaborator

RalfG commented Aug 19, 2019

Should open search engines and multi-round search engines include all the modifications they could consider or in some way communicate the range of "dark mass" they allow?

For "dark modifications" we can use a name Unknown modification and mass shift and all possible amino acids.

For open modification search engines that search for a (very large) fixed list of modifications, this would work. But some open modifications search engines do not have an a priori list of modifications to search for. For those search engines, it would be good to include an any mass shift or open search tag in the data analysis protocol.

@mobiusklein
Copy link

We've talked about glycosylation site motifs, how about glycans themselves? When you look at PRIDE's glycoproteomics entries, they do not explicitly specify that the study looked at glycopeptides, and what the glycan database was. Depending upon what you're looking for, that can be anywhere from five to over nine thousand different glycans, represented at different levels of specificity. Is this something to capture in this one-line description scheme?

We clarify already that for large scale annotation of PTMs search we should use database annotations like PEEF.

Glycoproteomics search engines do not use "site specific" databases, though should the repositories become complete enough, that'd be desirable. Most of them simply put every single glycan of the appropriate type at each site just like any other variable modification. PEFF has not yet standardized how to communicate the range of glycoforms expected at a specific site, simply that a site is glycosylated.

If including just a very long list of modifications is sufficient, then this should work for glycoproteomics too, provided we have an acceptable way to encode our glycans. If that defeats the purpose of this format, then both glycoproteomics and those open modification search engines with a large database of modifications both might not have an appropriate method to be described by.

@jpfeuffer
Copy link

Hi @ypriverol and others:

I was wondering how one would represent mutually exclusive modifications like SILAC modifications:
Some search engines like Comet allow for a simultaneous search of such modifications (encoded in the "binary group" column of its parameters at the end of the page here).
With other search engines you might need to search multiple times with the same non-quantification modification and one of the quantification modifications in the group (and afterwards merge the results).
I could imagine either introducing another key/value pair for such a "binary group" and/or allowing multiple rows for the same Run to represent different Samples.

Anyone thought about that already?

@ypriverol
Copy link
Member Author

@jpfeuffer Can you propose how to encode that into a key=value representation.

@jpfeuffer
Copy link

Maybe an optional key "BG" for every modification with integer values representing the group of modifications that should be/were searched together in a binary (all-or-none) way.
If this optional key is missing the modification is handled as usual (and considered on its own).
You could adapt the description from the Comet page in your documentation.

If the searches were performed separately e.g. with another search engine, the user can still go for multiple rows I think, so no loss of generality here.

@ypriverol
Copy link
Member Author

@jpfeuffer I was thinking that most of the search engines used SILAC and multiplex modifications as Variable modifications and this solves the problem of the binary.

@ypriverol
Copy link
Member Author

Thanks to all for your comments, I will close this issue because we have a proposal now https://github.com/bigbio/proteomics-metadata-standard/tree/master/experimental-design#encoding-protein-modifications

ypriverol pushed a commit that referenced this issue May 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed question Further information is requested Specification Specification issues related with PRIDE formats, API, etc
Projects
None yet
Development

No branches or pull requests