Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy in Abundances in Sensitive Mode #45

Open
echolley31 opened this issue Jun 5, 2024 · 0 comments
Open

Discrepancy in Abundances in Sensitive Mode #45

echolley31 opened this issue Jun 5, 2024 · 0 comments

Comments

@echolley31
Copy link

Hi there! Thank you for your comment on my initial Metalign installation issue! I was able to get the application working, and I just had a question about the calculation of relative abundances. I posted this on Biostars, but I was wondering if you also had an answer? Thank you so much!

My lab is particularly interested in M. smegmatis, and the abundance of that within a particular soil sample. When I ran Metalign with the default parameters, M. smeg was found in 10% relative abundance, and around 30 other species were identified in the sample. This is a native soil sample, with a limited amount of M. smeg added to the soil. I would expect there to be more than just 30 species identified and for M. smeg to not be 10% of all reads within that native soil sample, given the biodiversity of soil.

When I changed the parameters to run in sensitive mode, Metalign found more than 17,000 different species, and the M. smegmatis abundance dropped to 0.01. If this is relative abundance, which includes the % of unmapped reads in both of these runs, shouldn't it find the same amount of M. smegmatis?

I've been reading on the CMash algorithm and how it pre-filters the database based upon the ratio (containment index) of k-mers from reads in common with a reference genome to the number of k-mers in that reference genome. When using the defaults, Metalign has this ratio/index cutoff of 0.01. When Metalign runs in sensitive mode, the cutoff is 0.0, effectively eliminating this pre-filtering step.

Additionally, 61% of reads were unmapped with the defaults as opposed to 78% unmapped with --sensitive mode. Therefore, despite the --sensitive mode identifying way more organisms, the percentage of unmapped reads increased. Thus, is Metalign falsely aligning reads to M. smeg and the 30 other organisms since that is all the filtered database has? Why did the abundance of M. smegmatis go from 10% with a 0.01 CMash cutoff to a 0.015% abundance when there was no CMash cutoff?

If there is anything wrong with my understanding of Metalign's algorithm or how it works, please let me know. I'm very new at this and just trying to understand this large discrepancy in the results. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant