DO NOT MERGE: add notebook demonstrating proteomic aggregation and example analysis #120

samobermiller · 2025-01-31T21:00:21Z

New Notebook Submissions:

Have you included a summary of the notebook in the README.md included updated links to the notebook?
Does your PR include links to the new notebook (in the branch) for review using nbviewer, Colab, and reviewnb? These three are the preferred ways to review changes and additions to notebooks during review.
Does your PR include a test in a github workflow that tests the render-ability of your notebook?

…n, also added scipy and jupyter to requirements.txt (not sure if jupyter necessary or assumed as preexisting but had to recreate my venv so added it). changed pandas version to 2.1.2 because 2.1.1 was generating: 'ValueError: numpy.dtype size changed, may indicate binary incompatibility', when pandas was being imported in ipynb (checked and appears there was bug fix in 2.1.2)

review-notebook-app · 2025-01-31T21:00:27Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

…ed overall readme with new notebook name but no link)

proteomic_aggregation_and_visualization/python/proteomic_aggregation_and_visualization.ipynb

…ue to system storage maintenance

proteomic_aggregation_and_visualization/python/proteomic_aggregation_and_visualization.ipynb

…ns.py. could not figure out way to import across different subfolders right now. also added table length limit of 6 for pandas df printouts and changed almost all cells to single output

proteomic_aggregation_and_visualization/README.md

README.md

proteomic_aggregation_and_visualization/python/aggregation_functions.py

kheal · 2025-02-04T18:30:42Z

@samobermiller - some of my comments might be out of date, feel free to resolve them. I didn't realize I didn't "submit" them last night doh!

proteomic_aggregation_and_visualization/python/proteomic_aggregation_and_visualization.ipynb

…scussion with lee ann. will still need to change in folder readme. this version includes t test and pvalues, may remove depending on further discussion

samobermiller · 2025-02-06T01:33:20Z

may remove current t test and pvalue analysis and replace with missingness figure

lamccue · 2025-02-10T17:37:56Z

proteomic_aggregation_and_visualization/python/proteomic_aggregation_and_visualization.ipynb

@@ -0,0 +1,4230 @@
+{


I think this might need a little more description, with regard to the in_manifest id and instrument_run id. Are we assuming that the in_manifest id is the same simply because they came from the same study? If so, we probably shouldn't.
That first line of code is to confirm that all the runs have the same id, right? And then the next chunk confirms that the manifest_category is instrument_run. If I got that right, then I think my suggestion is to be more clear in the description above. Something like:

Look at the in_manifest id on these proteomic outputs to confirm that all runs are in the same manifest record, and pull that record. If that manifest record's manifest_category value is 'instrument_run', then it confirms that these are LC-MS/MS runs that were performed in succession on the same instrument. Proteomic outputs from different manifest records should not be aggregated.

Reply via ReviewNB

lamccue · 2025-02-10T17:37:56Z

proteomic_aggregation_and_visualization/python/proteomic_aggregation_and_visualization.ipynb

@@ -0,0 +1,4230 @@
+{


The code below does a little more than putting all the results in the same dataframe. In particular, I think more info about the last part would be good. You could just add more to the comment below, or add more to the explanation above. Either way, the type of protein becomes important below when doing FDR, so I think it's worth adding more info.
Determine the type of protein match for each peptide: contaminant, reverse (false positive match to the reversed amino acid sequence of a protein), or forward (match to the true, forward amino acid sequence of a protein).

Reply via ReviewNB

lamccue · 2025-02-10T17:37:56Z

proteomic_aggregation_and_visualization/python/proteomic_aggregation_and_visualization.ipynb

@@ -0,0 +1,4230 @@
+{


I am not wild about calling this 'noise'. Suggest the following change:
A challenge associated with aggregating mass spectrometry data is that there are always false identifications, which can be mitigated by imposing a spectral probability filter on the data being analyzed. The same spectral probability filter needs to be applied across datasets when they are being compared. The filter value itself is chosen by weighing the number of 'true' identifications retained with the proximity of the data to a chosen false discovery rate (FDR) (usually 0.05 or 0.01). NMDC's metaproteomic workflow provides 'true' and 'false' identifications for FDR estimation in the 'Unfiltered Metaproteomic Result' files.

Reply via ReviewNB

lamccue · 2025-02-10T17:37:56Z

proteomic_aggregation_and_visualization/python/proteomic_aggregation_and_visualization.ipynb

@@ -0,0 +1,4230 @@
+{


Collapse to unique peptides and normalize their relative abundance

Reply via ReviewNB

lamccue · 2025-02-10T17:37:56Z

proteomic_aggregation_and_visualization/python/proteomic_aggregation_and_visualization.ipynb

@@ -0,0 +1,4230 @@
+{


Would it be informative to show plots of the abundances both before and after normalization?

Reply via ReviewNB

lamccue · 2025-02-10T17:37:56Z

proteomic_aggregation_and_visualization/python/proteomic_aggregation_and_visualization.ipynb

@@ -0,0 +1,4230 @@
+{


I don't understand this step. Why would we sum abundance values?

Reply via ReviewNB

lamccue · 2025-02-10T17:37:57Z

proteomic_aggregation_and_visualization/python/proteomic_aggregation_and_visualization.ipynb

@@ -0,0 +1,4230 @@
+{


Edit:
Identify the razor protein, which is a method of limiting the assignment of degenerate peptides (i.e., peptides that map to more than one forward protein) to a most likely matched protein.
The rules are as follows:
If a peptide is unique to a protein, then that is the razor
If a peptide belongs to more than one protein, but one of those proteins has another unique peptide, then that protein is the razor
If a peptide belongs to more than one protein and one of those proteins has the maximal number of peptides, then that protein is the razor
If a peptide belongs to more than one protein and more than one of those proteins has the maximal number of peptides, then collapse the proteins and gene annotations into single strings
If a peptide belongs to more than one protein and more than one of those proteins has a unique peptide, then the peptide is removed from analysis because its mapping is inconclusive

Reply via ReviewNB

lamccue · 2025-02-10T17:37:57Z

proteomic_aggregation_and_visualization/python/proteomic_aggregation_and_visualization.ipynb

@@ -0,0 +1,4230 @@
+{


Edit:
Perform sorted list protein mapping, which is a method of limiting the assignment of degenerate peptides to obtain a single protein identification for each peptide returned. This method does not use information from the original search protein fasta file, and thus is not the same as the popularly used 'first hit' strategy, although it employs a similar logic. It can be performed via the sortedproteins() function in agg_func.
It iss defined as:
If a peptide is unique to a protein, then that is the sorted_list protein
If a peptide belongs to more than one protein, but one of those proteins has a unique peptide, then that protein is the sorted_list protein
If a peptide belongs to more than one protein and one of those proteins has the maximal number of peptides, then that is the sorted_list protein
If a peptide belongs to more than one protein and more than one of those proteins has the maximal number of peptides, then the sorted_list is the first protein in a sorted list of all proteins across datasets
If a peptide belongs to more than one protein and more than one of those proteins has a unique peptide, then the peptide is removed from analysis because its mapping is inconclusive

Reply via ReviewNB

lamccue · 2025-02-10T17:37:57Z

proteomic_aggregation_and_visualization/python/proteomic_aggregation_and_visualization.ipynb

@@ -0,0 +1,4230 @@
+{


Combine sortedmapping information with relative abundance values since sortedmapping returns a single protein for each peptide

Reply via ReviewNB

lamccue · 2025-02-10T17:37:57Z

proteomic_aggregation_and_visualization/python/proteomic_aggregation_and_visualization.ipynb

@@ -0,0 +1,4230 @@
+{


Starting here, I think we want to modify the rest of the notebook.
One thing I would want the notebook to do from this point, is to generate the input files needed for pMart. The table above is one, and the other would be the metadata table (rows with the sample identifiers and columns with the metadata factors like depth). Analysis of the combined data should be done in the software that we built for that purpose.
Gathering the annotation information for the proteins is useful, and so we should probably figure out where to put that information.
But searching across NMDC for other datasets with shared annotations or pathways is not something we want to do with this notebook.

Reply via ReviewNB

samobermiller added 2 commits January 31, 2025 12:12

rerender of notebook

163d343

samobermiller linked an issue Jan 31, 2025 that may be closed by this pull request

Create notebook that aggregates proteomic workflow outputs and visualizes the results #95

Open

10 tasks

samobermiller requested a review from bmeluch January 31, 2025 21:00

samobermiller added 3 commits January 31, 2025 13:03

Remove issue_entry from tracking

3a8f766

updated within notebook folder readme colab and nbviewer links, updat…

01b8bd2

…ed overall readme with new notebook name but no link)

readme issue

84b49cd

bmeluch reviewed Feb 1, 2025

View reviewed changes

samobermiller added 3 commits February 3, 2025 10:45

updated based on Beas comments in PR

4235871

added spaces to list of steps

2f5aa7f

another render edit

425c579

samobermiller requested a review from kheal February 3, 2025 18:52

reran notebook with information for all relevant gffs (prev delayed d…

03ea76e

…ue to system storage maintenance

bmeluch reviewed Feb 4, 2025

View reviewed changes

proteomic_aggregation_and_visualization/python/proteomic_aggregation_and_visualization.ipynb Show resolved Hide resolved

samobermiller added 2 commits February 4, 2025 10:02

copied nmdc_api and removed repeat functions from aggregation_functio…

b0784d9

…ns.py. could not figure out way to import across different subfolders right now. also added table length limit of 6 for pandas df printouts and changed almost all cells to single output

checks failed, retry push

add91ba

kheal reviewed Feb 4, 2025

View reviewed changes

adjustments based on katherine comments, remaining bea comment and di…

71efd7b

…scussion with lee ann. will still need to change in folder readme. this version includes t test and pvalues, may remove depending on further discussion

samobermiller requested a review from bmeluch February 6, 2025 01:34

bmeluch requested a review from lamccue February 10, 2025 17:26

lamccue reviewed Feb 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DO NOT MERGE: add notebook demonstrating proteomic aggregation and example analysis #120

DO NOT MERGE: add notebook demonstrating proteomic aggregation and example analysis #120

samobermiller commented Jan 31, 2025 •

edited

Loading

review-notebook-app bot commented Jan 31, 2025

kheal commented Feb 4, 2025

samobermiller commented Feb 6, 2025

lamccue Feb 10, 2025 •

edited

Loading

lamccue Feb 10, 2025 •

edited

Loading

lamccue Feb 10, 2025 •

edited

Loading

lamccue Feb 10, 2025 •

edited

Loading

lamccue Feb 10, 2025 •

edited

Loading

lamccue Feb 10, 2025 •

edited

Loading

lamccue Feb 10, 2025 •

edited

Loading

lamccue Feb 10, 2025 •

edited

Loading

lamccue Feb 10, 2025 •

edited

Loading

lamccue Feb 10, 2025 •

edited

Loading

DO NOT MERGE: add notebook demonstrating proteomic aggregation and example analysis #120

Are you sure you want to change the base?

DO NOT MERGE: add notebook demonstrating proteomic aggregation and example analysis #120

Conversation

samobermiller commented Jan 31, 2025 • edited Loading

New Notebook Submissions:

review-notebook-app bot commented Jan 31, 2025

kheal commented Feb 4, 2025

samobermiller commented Feb 6, 2025

lamccue Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

lamccue Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

lamccue Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

lamccue Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

lamccue Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

lamccue Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

lamccue Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

lamccue Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

lamccue Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

lamccue Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

samobermiller commented Jan 31, 2025 •

edited

Loading

lamccue Feb 10, 2025 •

edited

Loading

lamccue Feb 10, 2025 •

edited

Loading

lamccue Feb 10, 2025 •

edited

Loading

lamccue Feb 10, 2025 •

edited

Loading

lamccue Feb 10, 2025 •

edited

Loading

lamccue Feb 10, 2025 •

edited

Loading

lamccue Feb 10, 2025 •

edited

Loading

lamccue Feb 10, 2025 •

edited

Loading

lamccue Feb 10, 2025 •

edited

Loading

lamccue Feb 10, 2025 •

edited

Loading