-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Best practices for pre-processing prior to assembly #77
Comments
Hi @patrickaoude , For short-read RNA-seq data, I had often used fastp for trimming adapters, low quality bases, etc. RNA-Bloom's error correction for short reads is very similar to Rcorrector. If you are interested in assembling long-read RNA-seq data, then fastp and Rcorrector are not suitable as I believe they are only intended for short reads. I think majority of the tools mentioned in that paper are largely intended for short-read RNA-seq data, but the core ideas are still relevant for long reads in my opinion. RNA-Boom already does its own correction for long reads based on k-mers. As ONT and PacBio have improved the error rate of their sequencing data, you probably don't need an additional correction tool for that purpose. If the RNA in your sample were poly-A selected, then you most likely do not need rRNA removal. Contamination detection is also very useful, especially if you are assembling reads for a species without a reference genome. As an FYI, Kraken2 did not perform very well in long reads according to this paper. I haven't tried combining RNA-Bloom with RATTLE's clustering. However, I actually implemented a very crude clustering strategy for long reads in RNA-Bloom over 5 years ago, but it wasn't very efficient and the assembly ended up not very good either. One thing that I strongly suggest looking into are adaptor trimming tools for long reads (e.g. Porechop, Pychopper, etc.). These tools can also potentially reduce chimeric reads, which are one of the major causes of misassemblies (in RNA-Bloom). Hope that helps! |
Hi Ka Ming, Thanks for your detailed and quick reply, it's very helpful. I had a few other clarifications to make if you get the chance:
Thanks again, |
I have seen that message in their README for a number of years. You can use fastp for filtering long reads based on base quality, etc. I am uncertain in how well fastp performs on trimming adapters in ONT reads.
If the RNA were not poly-A selected, then it could likely be ribo-depleted. Regardless, you simply map your reads with minimap2 against a set of rRNA sequences (e.g. from SILVA).
It depends on the sequencing chemistry. If it is R9.4, then short-reads do lower the error rate. If you would like to correct the long reads with short reads in your long-read assembly, you can supply your short reads to RNA-Bloom. Note that RNA-Bloom does not perform hybrid assembly of long and short reads.
If your reads are cDNAs (i.e. not direct RNA sequencing), then your command can be simplified to: |
Hi Ka Ming, Thanks again for your reply. I tried running my data with RNABloom2 and received an error in the third stage, I was hoping you might have some insight into what might have gone wrong:
I don't believe it's a memory issue (100 GB supplied), but if you have any suggestions for how I can troubleshoot this please let me know! Thanks again! |
The first two stages look fine. In stage 3, the overlapping step with minimap2 had some issues. The PAF output from minimap2 looks to be truncated. There should be 12 columns, but only 8 were found. Can you show me the content of the file |
Hmm, maybe it was a memory issue after all, not sure why the job would have been killed otherwise. Here are the file contents:
|
Hi,
I am interested in using long and short reads to assemble a transcriptome with RNA-Bloom2.
I had a few questions regarding pre-processing of reads before using RNA-Bloom2.
I am loosely following this paper: A simple guide to de novo transcriptome assembly and annotation
The steps I plan to include based on the paper:
Prior to error correction via fastp, it is recommended to use k-mer correction with Rcorrector, however this tool (Rcorrector) is intended for short reads only. Do you have any alternative recommendation for this that's appropriate for long reads? I was thinking I could potentially use RATTLE and take the intermediate output after clustering and correction, but I am unsure if this approach is sound. I also posted on RATTLE's github to see the author's opinion on this approach.
Alternatively, is k-mer correction a necessary step given the way that RNA-Bloom2 works? If you have any suggestions for what I should do for pre-processing prior to assembly please let me know!
Thanks,
Patrick
The text was updated successfully, but these errors were encountered: