-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encoding PTMs parameters into one-line Experimental Design #13
Comments
So you are basically talking about how the modifications are declared inside the parameter files, not on how they are represented inside the search results, right ? In this case I think that the proposal should be more human-readable than machine friendly. Parameter files are often shared with people who are not entirely familiar with proteomics or even sometimes used as proof of documentation for an analysis. A format like the on used by Comet |
Veit (and others) worked on this at proteoform level. But I think you could find their strategy interesting: LeDuc, R. D., Schwämmle, V., Shortreed, M. R., Cesnik, A. J., Solntsev, S. K., Shaw, J. B., … Tsybin, Y. O. (2018). ProForma: A Standard Proteoform Notation. Journal of Proteome Research, 17(3), 1321–1325. https://doi.org/10.1021/acs.jproteome.7b00851 |
Thanks, @mlocardpaulet for the reference. We are more talking about PTMs as search parameters. In order to represent PTMs in results, we have good references (as you said) with ProForma, MzTab, and others. The problem we want to solve is that if we annotate a PRIDE or ProteomeXchange experiment, we should annotate some parameters from the search in order to allow external tools like SearchGUI and other to reanalyze the data. This issue is about how to encode Search Parameters PTMs. |
We often do something like this, in a JSON structure:
|
Thanks @RalfG In the current proposal
Would be feasible to represent your JSON? We want first to have a tab-delimited representation to align more with the experimental design but in the future YES, we will serialize also to JSON. In my proposal your first modification will be like:
The only thing missing is the mass shift, I didn't include it because it can be retrieved from the UNIMOD accession. However, I agree we can be more explicit using the mass shift. |
A few points you might want to consider:
|
Here my comments.
I like this idea. We should accept the pattern, However, what is the best way to encode a pattern in am standardize way. I can see here a lot of software and users writing their own pattern rules that are difficult to translate into a specific language. I found a link about how to standarize regular expressions https://www.regular-expressions.info/refflavors.html. Probably a good place to start.
I think this is really common now, if we use unimod definitions will be:
We can explicitly as for the atomic mass, however, MOST of the search engines and tools currently use the mass_shift. In addition, if we go for the tab-delimited user-friendly option mass shift is easier to get that the Atomic Composition. I really think we should not add a lot of details if the UNIMOD accession is known. If the Unimod is not known then the composition can be the name of the modification?
We are aiming in a tab-delimited format easy to produce by software but also easy to produce/read manually by submitters and enriched by our submission tools. For example, a user should be able to specify a fixed modification like this:
You should be able with searchGUI to pick from there and go on with the reanalysis. |
Hi @prvst
Yes, you should be able to go from here to an MSFragger parameters files and perform a reanalysis of your dataset.
Agree, but we should force to put enough information to enable the machines to enrich the files and perform the reanalysis. For example, if the UNIMOD id is provided, we don't need to add some of the fields... like composition, see my point with @mvaudel .
Agree @prvst, this is why we are adding some words rather than binary variables 0/1 values. |
@mvaudel about the regular expression, probably this is the standard: |
For MetaMorpheus, we got our start using the UniProt ptmlist and just retained that format. There is a key value pair system that is pretty easy to interpret. There are mandatory fields and bonus fields. For example, we add diagnostic ions and neutral losses (dependent on fragmentation type). We also have a field that carriers equivalent accession numbers for the same mod in different database systems. Here is an example: ID Phosphorylation However, we use a .toml file for search settings, which is probably where the data you're mining would come from. In that file for mods, we have only PTM name and target motif. That combination is required to be unique for us. |
@trishorts can you provide me |
sure thing. Let me create one with interesting PTMs. |
Following @trishorts idea of
example:
We can improve it using the key=value structure:
With this approach, we can control the key name (ID, TG, TY, TP, PP, UA CF ..) and extended it in the specification. This will cover the use case from @mvaudel to add the Composition (CF). Also, the order of the property does not matter because this is control by the key. The downside of this approach is that is less Human readable. @mvaudel @trishorts @prvst @RalfG opinions welcome. BTW @trishorts, What means |
Sounds really nice, having explicit labels increases readability and flexibility. Indeed neutral losses and reporter ions are needed, thanks for putting this up. Here again, we use atomic composition and never rounded mass ;) |
@mwalzer actually highlighted that we need to define what is optional and mandatory to be able to define a modification parameter. I think the only mandatory value would be a name |
I like the key=value idea. Two potential issues that I see in general are:
|
My personal experience is - don't rely on Unimod. |
Here are the key value pairs that MetaMorpheus uses with brief explanation
An accession number of frequently supplied by the primary databases (e.g. UniProt and Unimod).
This is the chemical formula of the added or removed atoms. This is required but the mass shift used is specified by MM. The particular isotope of the element can be specified in curly braces following the element name. For example, carbon-13 is written as C{13} in the chemical formula. The number of atoms is specified after the closing brace. Five carbon-13 atoms is written as C{13}5.
Certain PTMs (e.g. acetylation or glycosylation) produce small diagnostic fragment ions that can be detected in MS/MS spectra. These ions can serve as useful indicators of the presence of the corresponding PTM. This feature is currently disabled.
Used in the UniProt ptmlist but not needed for custom mods in MetaMorpheus
This is the text used to describe the modification in the output.
The exact atomic mass shift produced by the modification. Please use at least 5 decimal places of accuracy. This will override the monoisotopic mass described in the chemical formula because there are cases where the mass of the mod and the mass shift from the mod are different (e.g. trimethylation has mass of 43 but mass shift from trimethylation is 42).
This specifies which modification group the modification should be included with. Existing modification types are described here. The user is free to designate their own type, which creates a separate list.
Certain PTMs (e.g. phosphorylation) have labile modifications that can be lost during ionization. The peptide parent mass in MS1 may be seen with or with out the modification. Specifying neutral loss tells MetaMorpheus to take this phenomenon into account.
Choose from the following options: Anywhere.; Peptide N-terminal.; N-terminal.; Peptide C-terminal. DON'T FORGET THE '.'
Amino acid letter code capitalized or written out. Multiple targets separated by " or ". The capital letter 'X' may be used to mean any amino acid. |
@mwalzer some comments here:
I don't see in this particular case we can have more that one value for one particular key. That will be a different modification. The idea would be:
Actually, this is a great point. The consumers of the metadata can take decisions depending on the data missing. For example, In PRIDE we will implement a system that annotates as much a possible this values; but if the user submits only the name we can actually suggest to the user the possible modifications in Unimod.
This is up to the system, software consumer to decide what to do. For example, we have a library that if a delta mass + name of the modification is provided and it matches uniquely to one UNIMOD modification, then it can suggest that modification. |
We've found that you have to be very careful with "separators". Places like Unimod can be very sloppy. So you end up with a modification name that contains a comma or a semicolon and your whole reader goes splat. |
Agree.
I check before the proposal and |
Can we not just use quotes for all values? |
In addition, it should be possible to specify where the modification is attached on the motif. The format needs to specify that it is zero-based and what the default is. |
PTM site position ongoing discussion: I will try to summarize the discussion about PMT parameter site, which is stoping the first PR #15 . 1- Target Amino acid (TA) (Proposed by @ypriverol)
Target amino acid letter. If the modification target multiple sites, it should be provided as Target Regular Expression (TR). Pros:
Cons:
2- Target Amino Acid as Regular expression (proposed by @RonBeavis @mvaudel ):
This proposal aims to represent all sites into a regular expression including motifs, etc. Pros:
Cons:
Comments needed here to agree in one of the options: @mvaudel @mwalzer @RalfG @RonBeavis @prvst @trishorts . |
I tend to prefer option 2, as it is more comprehensive and correct. I agree that this option is more difficult for human submitters and human readers, but a well-designed submission form should be able to take these issues away for the common modifications. I suspect that regex validators already exist for most programming languages? |
Option 2 still lacks a way to express which amino acid is the actual target. In this case, the N-glycosylation motif modifies the first amino acid ( To be able to use a regular expression, we would need to either A) specify capture group index, B) use named capture groups, or C) add a marker to the regular expression to indicate that an amino acid is the target. The glycosaminoglycan linker glycosylation process preferentially targets If we have to use a capture group, then validation is more than just compiling the regular expression, but also testing that it contains a capture group? If we want to make trivial cases not require a capture group, check that the pattern cannot produces matches of length > 1? |
Can we list a set of examples with the name of modifications and possible Regular expressions? @mvaudel @RonBeavis @mobiusklein @trishorts . I think it will help us to define more clearly option 2. |
Beyond glycosylation motifs, I do not know many that are "hard rules", and we stray into a gray area between blind combinatorial expansion rules vs. prescribed target sites from a database. You can draw a few from PROSITE: Phosphorylation N-myrisotylation Amidation |
This representation is more complex than I was thinking to represent because it also encode the information of the Enzyme. What do ou think @mvaudel @trishorts @RonBeavis |
I don't really have any comments about how you represent motifs. I like having motifs where they are appropriate. We don't use regex unless it can't be avoid. |
New topic. I'm no longer certain just what you are trying to capture here. I see two competing themes. One is an attempt to capture how a submitted data set WAS searched. And the other is to capture how a submitted data set SHOULD HAVE BEEN searched. I think there are some important considerations like those that Ron has mentioned earlier that will eliminate lots of false positives. But I see that as the job of the search engine and the original searchers. If someone does something "wrong" and submits those search results, I think its good to know how those wrong answers were produced. So, if someone searches for lysine acetylation everywhere (which is not correct), then I want to know that they did that so that I can question the results. If "we" require that acetylation be not allowed at tryptic peptide termini in the recording of the entry but the user had mistakenly allowed it, then there is problem. I don't have a recommendation but I see a collision. |
Thanks for this comment @trishorts, I think in the document I make clear what is the original intention of these efforts.
1.- THIS IS THE MAIN INTENTION. The current metadata about experimental design is really poor into public databases including PRIDE. This problem makes really difficult data reuse and reproducibility. We want to provide a tab-delimited format that enriches the data submission process in two directions: 1.1- The file format should be able to provide information about the 1.2- We need to provide
Agree.
By looking into most of the search engine parameters (MSGF+, Comet, UNIMOD) exposed to the The current PR #15 aim to define those first and more easy to define properties. In my opinion, the current definition of Amino Acid target
Then, what I named now If we accept the current proposal PR #15 , then we can clearly discuss how to encode into regular expressions the full information of PTMS parameters. |
as I undertand it then we need a target REGEX that will capture what was searched including motifs and that "we" shouldn't block any motif/PTM combos. So, if someone search variable phosphorylation on say alanine, then AT = A, that's what we want to know. I think this clarifies everything for me. Thanks. BTW, I couldn't begin to construct such a REGEX. |
Splitting modification specification into "amino acid target" TA and a "constraint pattern" TR where appropriate seems reasonable. Specifying everything as a regex would be difficult, especially since there are so many ways to write the same pattern. Is the intent of this experimental design section to capture all modifications, or only variable modifications? Should open search engines and multi-round search engines include all the modifications they could consider or in some way communicate the range of "dark mass" they allow? We've talked about glycosylation site motifs, how about glycans themselves? When you look at PRIDE's glycoproteomics entries, they do not explicitly specify that the study looked at glycopeptides, and what the glycan database was. Depending upon what you're looking for, that can be anywhere from five to over nine thousand different glycans, represented at different levels of specificity. Is this something to capture in this one-line description scheme? Repeat above for cross-linked peptide experiments? |
Can you review the following PR #15 ? I did minor changes to reflect the latest discussion. The only thing is pending is that modifications that affect N and C term positions, not aminoacids, how to define them. I like the UNIMOD definition |
If we are talking about modifications targeting the N-term NH2- or the C-term -COOH, I think Mass shift-wise, this does not really matter. But I guess for "blocking" the sites in the search space, it could, in theory, make a difference. |
OK
I will open a new issue about that, to discuss possible implementations. In the current PR #15 that definition is pending until we have a decision.
Variable and fixed modifications define as parameters in the search. See the definition in the PR #15
For "dark modifications" we can use a name
We clarify already that for large scale annotation of PTMs search we should use database annotations like PEEF. |
I'm on board with this |
woops |
For open modification search engines that search for a (very large) fixed list of modifications, this would work. But some open modifications search engines do not have an a priori list of modifications to search for. For those search engines, it would be good to include an |
Glycoproteomics search engines do not use "site specific" databases, though should the repositories become complete enough, that'd be desirable. Most of them simply put every single glycan of the appropriate type at each site just like any other variable modification. PEFF has not yet standardized how to communicate the range of glycoforms expected at a specific site, simply that a site is glycosylated. If including just a very long list of modifications is sufficient, then this should work for glycoproteomics too, provided we have an acceptable way to encode our glycans. If that defeats the purpose of this format, then both glycoproteomics and those open modification search engines with a large database of modifications both might not have an appropriate method to be described by. |
Hi @ypriverol and others: I was wondering how one would represent mutually exclusive modifications like SILAC modifications: Anyone thought about that already? |
@jpfeuffer Can you propose how to encode that into a |
Maybe an optional key "BG" for every modification with integer values representing the group of modifications that should be/were searched together in a binary (all-or-none) way. If the searches were performed separately e.g. with another search engine, the user can still go for multiple rows I think, so no loss of generality here. |
@jpfeuffer I was thinking that most of the search engines used SILAC and multiplex modifications as |
Thanks to all for your comments, I will close this issue because we have a proposal now https://github.com/bigbio/proteomics-metadata-standard/tree/master/experimental-design#encoding-protein-modifications |
@hbarsnes @mvaudel @StSchulze
We have continued working with the metadata experimental design.
See example, https://github.com/PRIDE-Archive/pride-metadata-standard/tree/master/experimental-design#2-the-sample-and-data-relationship-format
However, if we want to encode search parameters would be great to encode PTMs and other search parameters as key-value pairs. I have seen that MSGF+, Comet, MaxQuant encode PTMs as string lines which is great; because we can encode PTMs Variables as a string and will be easy to translate into the Search Strings.
MSGF+ :
Comet:
CRUX:
I think we can propose a way to encode this PTMs as String within the metadata files.
Where:
Name: Name of the modification.
aminoacid: Aminoacid
Type: Fixed, Variable, Custom
Position: Any, N-Term, Protein N-term
UnimodAccession: Unimod Accession
The Unimod accession can be replaced with delta mass.
The text was updated successfully, but these errors were encountered: