Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DO NOT MERGE: add notebook demonstrating proteomic aggregation and example analysis #120

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

samobermiller
Copy link
Collaborator

@samobermiller samobermiller commented Jan 31, 2025

#95

New Notebook Submissions:

  • Have you included a summary of the notebook in the README.md included updated links to the notebook?
  • Does your PR include links to the new notebook (in the branch) for review using nbviewer, Colab, and reviewnb? These three are the preferred ways to review changes and additions to notebooks during review.
  • Does your PR include a test in a github workflow that tests the render-ability of your notebook?

…n, also added scipy and jupyter to requirements.txt (not sure if jupyter necessary or assumed as preexisting but had to recreate my venv so added it). changed pandas version to 2.1.2 because 2.1.1 was generating: 'ValueError: numpy.dtype size changed, may indicate binary incompatibility', when pandas was being imported in ipynb (checked and appears there was bug fix in 2.1.2)
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@samobermiller samobermiller requested a review from bmeluch January 31, 2025 21:00
@samobermiller samobermiller requested a review from kheal February 3, 2025 18:52
…ns.py. could not figure out way to import across different subfolders right now. also added table length limit of 6 for pandas df printouts and changed almost all cells to single output
@kheal
Copy link
Collaborator

kheal commented Feb 4, 2025

@samobermiller - some of my comments might be out of date, feel free to resolve them. I didn't realize I didn't "submit" them last night doh!

…scussion with lee ann. will still need to change in folder readme. this version includes t test and pvalues, may remove depending on further discussion
@samobermiller
Copy link
Collaborator Author

may remove current t test and pvalue analysis and replace with missingness figure

@samobermiller samobermiller requested a review from bmeluch February 6, 2025 01:34
@bmeluch bmeluch requested a review from lamccue February 10, 2025 17:26
@@ -0,0 +1,4230 @@
{
Copy link

@lamccue lamccue Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might need a little more description, with regard to the in_manifest id and instrument_run id. Are we assuming that the in_manifest id is the same simply because they came from the same study? If so, we probably shouldn't.

That first line of code is to confirm that all the runs have the same id, right? And then the next chunk confirms that the manifest_category is instrument_run. If I got that right, then I think my suggestion is to be more clear in the description above. Something like:

Look at the in_manifest id on these proteomic outputs to confirm that all runs are in the same manifest record, and pull that record. If that manifest record's manifest_category value is 'instrument_run', then it confirms that these are LC-MS/MS runs that were performed in succession on the same instrument. Proteomic outputs from different manifest records should not be aggregated.


Reply via ReviewNB

@@ -0,0 +1,4230 @@
{
Copy link

@lamccue lamccue Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code below does a little more than putting all the results in the same dataframe. In particular, I think more info about the last part would be good. You could just add more to the comment below, or add more to the explanation above. Either way, the type of protein becomes important below when doing FDR, so I think it's worth adding more info.

Determine the type of protein match for each peptide: contaminant, reverse (false positive match to the reversed amino acid sequence of a protein), or forward (match to the true, forward amino acid sequence of a protein).


Reply via ReviewNB

@@ -0,0 +1,4230 @@
{
Copy link

@lamccue lamccue Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not wild about calling this 'noise'. Suggest the following change:

A challenge associated with aggregating mass spectrometry data is that there are always false identifications, which can be mitigated by imposing a spectral probability filter on the data being analyzed. The same spectral probability filter needs to be applied across datasets when they are being compared. The filter value itself is chosen by weighing the number of 'true' identifications retained with the proximity of the data to a chosen false discovery rate (FDR) (usually 0.05 or 0.01). NMDC's metaproteomic workflow provides 'true' and 'false' identifications for FDR estimation in the 'Unfiltered Metaproteomic Result' files.


Reply via ReviewNB

@@ -0,0 +1,4230 @@
{
Copy link

@lamccue lamccue Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collapse to unique peptides and normalize their relative abundance


Reply via ReviewNB

@@ -0,0 +1,4230 @@
{
Copy link

@lamccue lamccue Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be informative to show plots of the abundances both before and after normalization?


Reply via ReviewNB

@@ -0,0 +1,4230 @@
{
Copy link

@lamccue lamccue Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this step. Why would we sum abundance values?


Reply via ReviewNB

@@ -0,0 +1,4230 @@
{
Copy link

@lamccue lamccue Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit:

Identify the razor protein, which is a method of limiting the assignment of degenerate peptides (i.e., peptides that map to more than one forward protein) to a most likely matched protein.

The rules are as follows:

  • If a peptide is unique to a protein, then that is the razor
  • If a peptide belongs to more than one protein, but one of those proteins has another unique peptide, then that protein is the razor
  • If a peptide belongs to more than one protein and one of those proteins has the maximal number of peptides, then that protein is the razor
  • If a peptide belongs to more than one protein and more than one of those proteins has the maximal number of peptides, then collapse the proteins and gene annotations into single strings
  • If a peptide belongs to more than one protein and more than one of those proteins has a unique peptide, then the peptide is removed from analysis because its mapping is inconclusive


Reply via ReviewNB

@@ -0,0 +1,4230 @@
{
Copy link

@lamccue lamccue Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit:

Perform sorted list protein mapping, which is a method of limiting the assignment of degenerate peptides to obtain a single protein identification for each peptide returned. This method does not use information from the original search protein fasta file, and thus is not the same as the popularly used 'first hit' strategy, although it employs a similar logic. It can be performed via the sortedproteins() function in agg_func.

It iss defined as:

  • If a peptide is unique to a protein, then that is the sorted_list protein
  • If a peptide belongs to more than one protein, but one of those proteins has a unique peptide, then that protein is the sorted_list protein
  • If a peptide belongs to more than one protein and one of those proteins has the maximal number of peptides, then that is the sorted_list protein
  • If a peptide belongs to more than one protein and more than one of those proteins has the maximal number of peptides, then the sorted_list is the first protein in a sorted list of all proteins across datasets
  • If a peptide belongs to more than one protein and more than one of those proteins has a unique peptide, then the peptide is removed from analysis because its mapping is inconclusive


Reply via ReviewNB

@@ -0,0 +1,4230 @@
{
Copy link

@lamccue lamccue Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combine sortedmapping information with relative abundance values since sortedmapping returns a single protein for each peptide


Reply via ReviewNB

@@ -0,0 +1,4230 @@
{
Copy link

@lamccue lamccue Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting here, I think we want to modify the rest of the notebook.

One thing I would want the notebook to do from this point, is to generate the input files needed for pMart. The table above is one, and the other would be the metadata table (rows with the sample identifiers and columns with the metadata factors like depth). Analysis of the combined data should be done in the software that we built for that purpose.

Gathering the annotation information for the proteins is useful, and so we should probably figure out where to put that information.

But searching across NMDC for other datasets with shared annotations or pathways is not something we want to do with this notebook.


Reply via ReviewNB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create notebook that aggregates proteomic workflow outputs and visualizes the results
4 participants