add example sample sheet and improve quick start

theislab · Oct 25, 2023 · 565a491 · 565a491
1 parent 47509b3
commit 565a491
Show file tree

Hide file tree

Showing 8 changed files with 24 additions and 31 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,8 +1,7 @@
 .nextflow*
 work/
 data/
-results/
-result/
+result*/
 .DS_Store
 testing/
 testing*

diff --git a/conda/demuxem_py.yml b/conda/demuxem_py.yml
@@ -3,9 +3,9 @@ channels:
   - bioconda
 dependencies:
   - python=3.9
-  - pegasuspy
   - pip
   - pip:
       - pandas<2.0
       - demuxEM
       - argparse
+      - pegasuspy
diff --git a/docs/source/general.md b/docs/source/general.md
@@ -99,7 +99,7 @@ profiles{
 
 ### **Running on multiple samples**
 
-The pipeline is able to run on multiple samples. In this scenario, the shared parameters for input data are retrieved from a sample sheet using `params.multi_sample`, which is set to None by default. Along with the input data, the sample sheet should contain an additional column for unique sample IDs assigned to each sample. The remaining parameters for each process are specified in the nextflow.config file, just like when demultiplexing a single sample. However, there is a distinction between running on a single sample and running on multiple samples. When processing multiple samples, the pipeline only permits a single value for each process parameter, whereas in the case of a single sample, multiple values separated by commas are allowed. The sample sheet should have e.g. following columns depending on the methods you want to run:
+The pipeline is able to run on multiple samples. In this scenario, the shared parameters for input data are retrieved from a sample sheet using `params.multi_sample`, which is set to None by default. Along with the input data, the sample sheet should contain an additional column for unique sample IDs assigned to each sample. The remaining parameters for each process are specified in the nextflow.config file, just like when demultiplexing a single sample. However, there is a distinction between running on a single sample and running on multiple samples. When processing multiple samples, the pipeline only permits a single value for each process parameter, whereas in the case of a single sample, multiple values separated by commas are allowed. The sample sheet (example file see the Resources section below) should have e.g. following columns depending on the methods you want to run:
 
 - sampleId
 - na_matrix_raw
@@ -152,3 +152,7 @@ The demultiplexing workflow saves its output in `$pipeline_output_folder/[gene/h
   | ... | ... | ... | ... |
 - `adata` folder: stores Anndata object with filtered scRNA-seq read counts and assignment of each deconvolution method if `params.generate_anndata` is `True`. Details see section "scverse compatibility" above.
 - In the `rescue` mode, the pipeline generates some additional output files, details please check [](rescue).
+
+
+## **Resources**
+- There is an [example sample sheet](../../multi_sample_input.csv) for `multi_sample` mode.
diff --git a/docs/source/genetic.md b/docs/source/genetic.md
@@ -4,16 +4,6 @@ Genotyped-based deconvolution leverages the unique genetic composition of indivi
 
 ## **Genetics-based deconvolution (gene_demulti) in hadge**
 
-- Pre-processing: Samtools
-- Variant-calling: freebayes
-- Variant-filtering: BCFtools
-- Variant-calling: cellsnp-lite
-- Demuxlet
-- Freemuxlet
-- Vireo
-- Souporcell
-- scSplit
-
 ![Caption](_static/images/genotype-based.png)
 
 ## **Input data preparation**

diff --git a/docs/source/hashing.md b/docs/source/hashing.md
@@ -1,17 +1,7 @@
 # Hashing demultiplexing
-
+Cell hashing is a sample processing technique that requires processing individual samples to “tag” the membrane of the cell or the nuclei with unique oligonucleotide barcodes. The cells are then washed or the reaction is quenched, and the samples can be safely mixed and processed following the standard library preparation procedure. Two libraries are generated after this process, one for the scRNA and one for the hashing oligos (HTO), which are independently sequenced to produce each a single cell count matrix, one for the RNA library and one for the HTO library. The hashtag counts are then bioinformatically processed to deconvolve the cell’s source sample.
 ## **Hashing-based deconvolution (hash_demulti) in hadge**
 
-- Pre-processing
-- Multiseq
-- HTODemux
-- HashedDrops
-- DemuxEM
-- HashSolo
-- Demuxmix
-- GMM-Demux
-- BFF
-
 ![Caption](_static/images/hashing-based.png)
 ## **Input data preparation**
 

diff --git a/docs/source/index.md b/docs/source/index.md
@@ -36,15 +36,20 @@ As next, please run the pipeline
 ```bash
 nextflow run http://github.com/theislab/hadge
 ```
-
-## **Quick start**
-
+You can also:
 - Choose the mode: `--mode=<genetic/hashing/rescue>`
 - Specify the folder name `--outdir` to save the output files. This will create a folder automatically in the project directory.
 - Specify the input data for each process.
 - The pipeline can be run either locally or on a HPC with different resource specifications. As default, the pipeline will run locally. You can also set the SLURM executor by running the pipeline with `-profile cluster`.
 - Please also check [](usage) for more details.
 
+## **Quick start**
+```bash
+sh test_data/download_data.sh
+nextflow run main.nf
+``` 
+
+
 ## **Pipeline output**
 
 By default, the pipeline is run on a single sample. In this case, all pipeline output will be saved in the folder `$projectDir/$params.outdir/$params.mode`. When running the pipeline on multiple samples, the pipeline output will be found in the folder `"$projectDir/$params.outdir/$sampleId/$params.mode/`. To simplify this, we'll refer to this folder as `$pipeline_output_folder` from now on.

diff --git a/multi_sample_input.csv b/multi_sample_input.csv
@@ -0,0 +1,3 @@
+sampleId,rna_matrix_raw,rna_matrix_filtered,hto_matrix_raw,hto_matrix_filtered,bam,bam_index,barcodes,nsample,cell_data,vcf_mixed,vcf_donor,vireo_parent_dir,demultiplexing_result
+sample1,sample1/rna/raw_feature_bc_matrix,sample1/rna/filtered_feature_bc_matrix,sample1/hto/raw_feature_bc_matrix,sample1/hto/filtered_feature_bc_matrix,sample1/pooled.sorted.bam,sample1/pooled.sorted.bam.bai,sample1/rna/filtered_feature_bc_matrix/barcodes.tsv,2,None,None,None,None,None
+sample2,sample2/rna/raw_feature_bc_matrix,sample2/rna/filtered_feature_bc_matrix,sample2/hto/raw_feature_bc_matrix,sample2/hto/filtered_feature_bc_matrix,sample2/pooled.sorted.bam,sample2/pooled.sorted.bam.bai,sample2/rna/filtered_feature_bc_matrix/barcodes.tsv,2,None,None,None,None,None
diff --git a/test_data/download_data.sh b/test_data/download_data.sh
@@ -1,6 +1,8 @@
+#!/bin/sh
 # Download data for genotype-based deconvolution methods (popscle tutorial dataset)
-# outputdir="test_data"
-# mkdir $outputdir & cd $outputdir
+outputdir="test_data"
+mkdir $outputdir 
+cd $outputdir
 wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1xl9g3IPFNISa1Z2Uqj3nia4tDWtMDz5H' -O jurkat_293t_exons_only.vcf.withAF.vcf.gz
 wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1jlhEO1Z7YGYVnxv1YO9arjDwGFeZbfkr' -O jurkat_293t_downsampled_n500_full_bam.bai
 FILEID="13CV6CjP9VzmwG5MVHbJiVDMVdiIhGdJB"
@@ -18,7 +20,7 @@ tar -xzvf refdata-cellranger-hg19-3.0.0.tar.gz
 # Download common variants
 wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1lw4T6d7uXsm9dt39ZtEwpuB2VTY3wK1y' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1lw4T6d7uXsm9dt39ZtEwpuB2VTY3wK1y" -O common_variants_hg19.vcf && rm -rf /tmp/cookies.txt
 wget https://master.dl.sourceforge.net/project/cellsnp/SNPlist/genome1K.phase3.SNP_AF5e2.chr1toX.hg19.vcf.gz
-gunzip genome1K.phase3.SNP_AF5e2.chr1toX.hg19.vcf.gz
+#gunzip genome1K.phase3.SNP_AF5e2.chr1toX.hg19.vcf.gz
 # Processed by bcftools query -f '%CHROM:%POS\n' common_variants_hg19.vcf > common_variants_hg19_list.vcf
 wget https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/41773431/common_variants_hg19_list.vcf?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIYCQYOYV5JSSROOA/20230731/eu-west-1/s3/aws4_request&X-Amz-Date=20230731T153655Z&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=283575c6ccb3104b8b95684e6d955abd28b47db71c18d1eeec99ae5dab65ff7b
 # Download simualated data