-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
snakemake pipeline viral genome pipe #749
Comments
Hi @jb013b, While documentation on this topic isn't great, there are a few higher level resources you can look to. For further info on the Snakemake pipelines, there's some overview here. Again, not perfect, but can be handy to refer to. One thing to keep in mind is that Snakemake is all about specifying your desired end-result and it will figure out what it needs to do to get there. One of the more common end points is the If you provide depleted uBAMs, you can place them in data/01_cleaned/samplename.cleaned.bam, but really, the Snakemake pipeline is meant to do that all for you. Undepleted (raw) uBAMs can go in data/00_raw/samplename.bam. Actually the whole pipeline can start from an Illumina BCL directory, and lets you re-demultiplex and redefine samplesheets and such. Or you can put paired fastq files in data/00_raw/samplename_L001_R1_001.fastq.gz (and R2_001.fastq.gz) as long as that directory also contains an Illumina-style SampleSheet.csv and RunInfo.xml as well (the Snakemake rules will automatically detect that and do fastq-to-uBAM conversion). There's also some code complexity in there that allows you to specify a couple of tabular inputs that let you merge data for the same sample across multiple sequencing runs--it does nice things like run all the depletions separately (in parallel) while merging it all prior to assembly. This step is actually required if you want to do any intrahost variant calling, as we require multiple independent sequencing libraries per sample in order to call any iSNVs. Let me know if you want to delve into any of those paths, but I suggest starting with the simpler things before getting into things like intrahost variant calling. If, at the end of the day, you find the Snakemake pipelines too difficult to use, you can always try the Cromwell or DNAnexus pipelines, which are entirely separate, but call the same python scripts underneath. There is no inter or intrahost variant calling yet, but most of the basic workflows (depletion, assembly, metagenomics, demultiplexing) are all there. Although we've used Cromwell (either locally or on Google Cloud) successfully, we have no documentation on that yet, since it's quite new. The DNAnexus implementation is our most popular one, however, and has been widely used both within our lab and with our partners, and is the primary platform we train on and that our ACEGID partner sites use regularly. |
Hi Daniel, |
What is the best way to edit the .yaml file to use the viral genome analysis sections (assembly/intrahost variation)?
Would input files need to be depleted already?
Which dir would the input files be placed within /data?
What would the appropriate file format be?
For editing the .yaml does one remove the depletion section of leave them blank ""?
If this is already discussed please point me in the correct direction.
Thank you for all the help.
James
The text was updated successfully, but these errors were encountered: