Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimal Gene Filtering for TPM Expression Data in RNA-seq Analysis: Impact of Non-Protein Coding Biotypes on Hallmark Enrichment Analysis #28

Open
snijesh opened this issue Nov 29, 2023 · 1 comment

Comments

@snijesh
Copy link

snijesh commented Nov 29, 2023

Hello members,

I am currently working with TPM expression data obtained from RNA-seq analysis, and my dataset includes a diverse range of biotypes such as miRNA, lncRNA, pseudogenes, etc., resulting in a total of around 60,000 genes. As I intend to perform enrichment analysis (ssGSEA) using the hallmark gene list from MSigDB, I am faced with a crucial decision regarding whether to filter the data based on biotype='protein coding'.

Given the diverse nature of the genes in my dataset, I am uncertain about the potential impact of including non-protein coding biotypes on the enrichment analysis. Filtering by biotype='protein coding' seems like a logical step to focus on protein-coding genes relevant to the hallmark pathways, but I would like to seek the community's advice and experiences on this matter.

Here are some specific questions to guide the discussion:

  1. In the context of hallmark pathway enrichment analysis, what are the potential advantages and disadvantages of including non-protein coding genes in the dataset?

  2. Has anyone encountered similar scenarios with a diverse set of biotypes in RNA-seq data, and if so, what criteria did you use for gene filtering, especially concerning biotypes?

  3. Are there specific biotypes, such as miRNA, lncRNA, or pseudogenes, that are known to significantly impact or contribute to hallmark pathway enrichment analysis?

  4. How does the choice of gene filtering criteria, specifically regarding biotype, affect the biological interpretation of enrichment analysis results using hallmark gene sets?

I appreciate any insights, experiences, or recommendations the community can provide to help me make an informed decision on whether to filter my RNA-seq data by biotype='protein coding' for hallmark pathway enrichment analysis.

Thank you in advance for your assistance!

@drmani
Copy link
Collaborator

drmani commented Nov 29, 2023

As far as I know, MSigDB gene sets contain only protein coding genes. Including non-protein coding biotypes will affect your enrichment scores, and potentially dilute the enrichment signal that may be present. So, the best approach is to filter out non-protein coding transcripts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants