snakemake pipeline viral genome pipe #749

jb013b · 2018-01-03T20:41:44Z

What is the best way to edit the .yaml file to use the viral genome analysis sections (assembly/intrahost variation)?

Would input files need to be depleted already?
Which dir would the input files be placed within /data?
What would the appropriate file format be?

For editing the .yaml does one remove the depletion section of leave them blank ""?

If this is already discussed please point me in the correct direction.
Thank you for all the help.
James

dpark01 · 2018-01-03T21:14:49Z

Hi @jb013b,

While documentation on this topic isn't great, there are a few higher level resources you can look to.

For further info on the Snakemake pipelines, there's some overview here. Again, not perfect, but can be handy to refer to. One thing to keep in mind is that Snakemake is all about specifying your desired end-result and it will figure out what it needs to do to get there.

One of the more common end points is the all_assemble Snakemake target, which tries to assemble a genome for every sample defined in samples-assembly.txt. Because the success or failure of these assemblies is often for reasons that have nothing to do with computational correctness (ie, assembly often fails due to your data), this is often an iterative process of running snakemake all_assemble, and then identifying which samples did not have sufficient reads to create a genome, and then manually removing them from samples-assembly.txt (usually moving them into samples-assembly-failures.txt), and then trying again until snakemake succeeds on the all_assemble target. You can start to think about inter and intra host variants after that, but there can be more complexity there we can discuss later.

If you provide depleted uBAMs, you can place them in data/01_cleaned/samplename.cleaned.bam, but really, the Snakemake pipeline is meant to do that all for you. Undepleted (raw) uBAMs can go in data/00_raw/samplename.bam. Actually the whole pipeline can start from an Illumina BCL directory, and lets you re-demultiplex and redefine samplesheets and such. Or you can put paired fastq files in data/00_raw/samplename_L001_R1_001.fastq.gz (and R2_001.fastq.gz) as long as that directory also contains an Illumina-style SampleSheet.csv and RunInfo.xml as well (the Snakemake rules will automatically detect that and do fastq-to-uBAM conversion). There's also some code complexity in there that allows you to specify a couple of tabular inputs that let you merge data for the same sample across multiple sequencing runs--it does nice things like run all the depletions separately (in parallel) while merging it all prior to assembly. This step is actually required if you want to do any intrahost variant calling, as we require multiple independent sequencing libraries per sample in order to call any iSNVs. Let me know if you want to delve into any of those paths, but I suggest starting with the simpler things before getting into things like intrahost variant calling.

If, at the end of the day, you find the Snakemake pipelines too difficult to use, you can always try the Cromwell or DNAnexus pipelines, which are entirely separate, but call the same python scripts underneath. There is no inter or intrahost variant calling yet, but most of the basic workflows (depletion, assembly, metagenomics, demultiplexing) are all there. Although we've used Cromwell (either locally or on Google Cloud) successfully, we have no documentation on that yet, since it's quite new. The DNAnexus implementation is our most popular one, however, and has been widely used both within our lab and with our partners, and is the primary platform we train on and that our ACEGID partner sites use regularly.

jb013b · 2018-01-03T22:35:04Z

Hi Daniel,
Thank you that information is very helpful, the .yaml file was also very helpful. I am going to keep working on the snakemake pipelines. The one sample I am starting with is ultracentrifuged virus from vero cells (nhp). Prior to working on the viral-ngs pipeline I had depeleted the NHP reads, so I have both non-depleted and depleted files. The original input is ~87,000,000 total reads (2x150). After depletion of NHP reads there were 86,000,000 total reads left. No guarantee's that these are all against CHIKV but the majority should be. Any possibility it is failing due to the number of specific reads, or just total read number?

DAWNkKim mentioned this issue Jul 10, 2023

Can viral-ngs detect virus integration? #1011

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

snakemake pipeline viral genome pipe #749

snakemake pipeline viral genome pipe #749

jb013b commented Jan 3, 2018

dpark01 commented Jan 3, 2018

jb013b commented Jan 3, 2018

snakemake pipeline viral genome pipe #749

snakemake pipeline viral genome pipe #749

Comments

jb013b commented Jan 3, 2018

dpark01 commented Jan 3, 2018

jb013b commented Jan 3, 2018