-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DO NOT MERGE: add notebook demonstrating proteomic aggregation and example analysis #120
base: main
Are you sure you want to change the base?
Conversation
…n, also added scipy and jupyter to requirements.txt (not sure if jupyter necessary or assumed as preexisting but had to recreate my venv so added it). changed pandas version to 2.1.2 because 2.1.1 was generating: 'ValueError: numpy.dtype size changed, may indicate binary incompatibility', when pandas was being imported in ipynb (checked and appears there was bug fix in 2.1.2)
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
…ed overall readme with new notebook name but no link)
proteomic_aggregation_and_visualization/python/proteomic_aggregation_and_visualization.ipynb
Show resolved
Hide resolved
proteomic_aggregation_and_visualization/python/proteomic_aggregation_and_visualization.ipynb
Show resolved
Hide resolved
proteomic_aggregation_and_visualization/python/proteomic_aggregation_and_visualization.ipynb
Show resolved
Hide resolved
proteomic_aggregation_and_visualization/python/proteomic_aggregation_and_visualization.ipynb
Show resolved
Hide resolved
proteomic_aggregation_and_visualization/python/proteomic_aggregation_and_visualization.ipynb
Show resolved
Hide resolved
proteomic_aggregation_and_visualization/python/proteomic_aggregation_and_visualization.ipynb
Show resolved
Hide resolved
proteomic_aggregation_and_visualization/python/proteomic_aggregation_and_visualization.ipynb
Show resolved
Hide resolved
proteomic_aggregation_and_visualization/python/proteomic_aggregation_and_visualization.ipynb
Show resolved
Hide resolved
…ue to system storage maintenance
proteomic_aggregation_and_visualization/python/proteomic_aggregation_and_visualization.ipynb
Show resolved
Hide resolved
…ns.py. could not figure out way to import across different subfolders right now. also added table length limit of 6 for pandas df printouts and changed almost all cells to single output
proteomic_aggregation_and_visualization/python/aggregation_functions.py
Outdated
Show resolved
Hide resolved
proteomic_aggregation_and_visualization/python/aggregation_functions.py
Outdated
Show resolved
Hide resolved
@samobermiller - some of my comments might be out of date, feel free to resolve them. I didn't realize I didn't "submit" them last night doh! |
proteomic_aggregation_and_visualization/python/proteomic_aggregation_and_visualization.ipynb
Show resolved
Hide resolved
proteomic_aggregation_and_visualization/python/proteomic_aggregation_and_visualization.ipynb
Show resolved
Hide resolved
proteomic_aggregation_and_visualization/python/proteomic_aggregation_and_visualization.ipynb
Show resolved
Hide resolved
proteomic_aggregation_and_visualization/python/proteomic_aggregation_and_visualization.ipynb
Show resolved
Hide resolved
proteomic_aggregation_and_visualization/python/proteomic_aggregation_and_visualization.ipynb
Show resolved
Hide resolved
proteomic_aggregation_and_visualization/python/proteomic_aggregation_and_visualization.ipynb
Show resolved
Hide resolved
proteomic_aggregation_and_visualization/python/proteomic_aggregation_and_visualization.ipynb
Show resolved
Hide resolved
proteomic_aggregation_and_visualization/python/proteomic_aggregation_and_visualization.ipynb
Show resolved
Hide resolved
proteomic_aggregation_and_visualization/python/proteomic_aggregation_and_visualization.ipynb
Show resolved
Hide resolved
…scussion with lee ann. will still need to change in folder readme. this version includes t test and pvalues, may remove depending on further discussion
may remove current t test and pvalue analysis and replace with missingness figure |
@@ -0,0 +1,4230 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might need a little more description, with regard to the in_manifest id and instrument_run id. Are we assuming that the in_manifest id is the same simply because they came from the same study? If so, we probably shouldn't.
That first line of code is to confirm that all the runs have the same id, right? And then the next chunk confirms that the manifest_category is instrument_run. If I got that right, then I think my suggestion is to be more clear in the description above. Something like:
Look at the in_manifest
id on these proteomic outputs to confirm that all runs are in the same manifest record, and pull that record. If that manifest record's manifest_category
value is 'instrument_run', then it confirms that these are LC-MS/MS runs that were performed in succession on the same instrument. Proteomic outputs from different manifest records should not be aggregated.
Reply via ReviewNB
@@ -0,0 +1,4230 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code below does a little more than putting all the results in the same dataframe. In particular, I think more info about the last part would be good. You could just add more to the comment below, or add more to the explanation above. Either way, the type of protein becomes important below when doing FDR, so I think it's worth adding more info.
Determine the type of protein match for each peptide: contaminant, reverse (false positive match to the reversed amino acid sequence of a protein), or forward (match to the true, forward amino acid sequence of a protein).
Reply via ReviewNB
@@ -0,0 +1,4230 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not wild about calling this 'noise'. Suggest the following change:
A challenge associated with aggregating mass spectrometry data is that there are always false identifications, which can be mitigated by imposing a spectral probability filter on the data being analyzed. The same spectral probability filter needs to be applied across datasets when they are being compared. The filter value itself is chosen by weighing the number of 'true' identifications retained with the proximity of the data to a chosen false discovery rate (FDR) (usually 0.05 or 0.01). NMDC's metaproteomic workflow provides 'true' and 'false' identifications for FDR estimation in the 'Unfiltered Metaproteomic Result' files.
Reply via ReviewNB
@@ -0,0 +1,4230 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -0,0 +1,4230 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be informative to show plots of the abundances both before and after normalization?
Reply via ReviewNB
@@ -0,0 +1,4230 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -0,0 +1,4230 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit:
Identify the razor protein, which is a method of limiting the assignment of degenerate peptides (i.e., peptides that map to more than one forward protein) to a most likely matched protein.
The rules are as follows:
- If a peptide is unique to a protein, then that is the razor
- If a peptide belongs to more than one protein, but one of those proteins has another unique peptide, then that protein is the razor
- If a peptide belongs to more than one protein and one of those proteins has the maximal number of peptides, then that protein is the razor
- If a peptide belongs to more than one protein and more than one of those proteins has the maximal number of peptides, then collapse the proteins and gene annotations into single strings
- If a peptide belongs to more than one protein and more than one of those proteins has a unique peptide, then the peptide is removed from analysis because its mapping is inconclusive
Reply via ReviewNB
@@ -0,0 +1,4230 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit:
Perform sorted list protein mapping, which is a method of limiting the assignment of degenerate peptides to obtain a single protein identification for each peptide returned. This method does not use information from the original search protein fasta file, and thus is not the same as the popularly used 'first hit' strategy, although it employs a similar logic. It can be performed via the sortedproteins() function in agg_func.
It iss defined as:
- If a peptide is unique to a protein, then that is the sorted_list protein
- If a peptide belongs to more than one protein, but one of those proteins has a unique peptide, then that protein is the sorted_list protein
- If a peptide belongs to more than one protein and one of those proteins has the maximal number of peptides, then that is the sorted_list protein
- If a peptide belongs to more than one protein and more than one of those proteins has the maximal number of peptides, then the sorted_list is the first protein in a sorted list of all proteins across datasets
- If a peptide belongs to more than one protein and more than one of those proteins has a unique peptide, then the peptide is removed from analysis because its mapping is inconclusive
Reply via ReviewNB
@@ -0,0 +1,4230 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Combine sortedmapping information with relative abundance values since sortedmapping returns a single protein for each peptide
Reply via ReviewNB
@@ -0,0 +1,4230 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Starting here, I think we want to modify the rest of the notebook.
One thing I would want the notebook to do from this point, is to generate the input files needed for pMart. The table above is one, and the other would be the metadata table (rows with the sample identifiers and columns with the metadata factors like depth). Analysis of the combined data should be done in the software that we built for that purpose.
Gathering the annotation information for the proteins is useful, and so we should probably figure out where to put that information.
But searching across NMDC for other datasets with shared annotations or pathways is not something we want to do with this notebook.
Reply via ReviewNB
#95
New Notebook Submissions: