Skip to content

Commit

Permalink
add example sample sheet and improve quick start
Browse files Browse the repository at this point in the history
  • Loading branch information
wxicu committed Oct 25, 2023
1 parent 47509b3 commit 565a491
Show file tree
Hide file tree
Showing 8 changed files with 24 additions and 31 deletions.
3 changes: 1 addition & 2 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
.nextflow*
work/
data/
results/
result/
result*/
.DS_Store
testing/
testing*
Expand Down
2 changes: 1 addition & 1 deletion conda/demuxem_py.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@ channels:
- bioconda
dependencies:
- python=3.9
- pegasuspy
- pip
- pip:
- pandas<2.0
- demuxEM
- argparse
- pegasuspy
6 changes: 5 additions & 1 deletion docs/source/general.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ profiles{

### **Running on multiple samples**

The pipeline is able to run on multiple samples. In this scenario, the shared parameters for input data are retrieved from a sample sheet using `params.multi_sample`, which is set to None by default. Along with the input data, the sample sheet should contain an additional column for unique sample IDs assigned to each sample. The remaining parameters for each process are specified in the nextflow.config file, just like when demultiplexing a single sample. However, there is a distinction between running on a single sample and running on multiple samples. When processing multiple samples, the pipeline only permits a single value for each process parameter, whereas in the case of a single sample, multiple values separated by commas are allowed. The sample sheet should have e.g. following columns depending on the methods you want to run:
The pipeline is able to run on multiple samples. In this scenario, the shared parameters for input data are retrieved from a sample sheet using `params.multi_sample`, which is set to None by default. Along with the input data, the sample sheet should contain an additional column for unique sample IDs assigned to each sample. The remaining parameters for each process are specified in the nextflow.config file, just like when demultiplexing a single sample. However, there is a distinction between running on a single sample and running on multiple samples. When processing multiple samples, the pipeline only permits a single value for each process parameter, whereas in the case of a single sample, multiple values separated by commas are allowed. The sample sheet (example file see the Resources section below) should have e.g. following columns depending on the methods you want to run:

- sampleId
- na_matrix_raw
Expand Down Expand Up @@ -152,3 +152,7 @@ The demultiplexing workflow saves its output in `$pipeline_output_folder/[gene/h
| ... | ... | ... | ... |
- `adata` folder: stores Anndata object with filtered scRNA-seq read counts and assignment of each deconvolution method if `params.generate_anndata` is `True`. Details see section "scverse compatibility" above.
- In the `rescue` mode, the pipeline generates some additional output files, details please check [](rescue).


## **Resources**
- There is an [example sample sheet](../../multi_sample_input.csv) for `multi_sample` mode.
10 changes: 0 additions & 10 deletions docs/source/genetic.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,6 @@ Genotyped-based deconvolution leverages the unique genetic composition of indivi

## **Genetics-based deconvolution (gene_demulti) in hadge**

- Pre-processing: Samtools
- Variant-calling: freebayes
- Variant-filtering: BCFtools
- Variant-calling: cellsnp-lite
- Demuxlet
- Freemuxlet
- Vireo
- Souporcell
- scSplit

![Caption](_static/images/genotype-based.png)

## **Input data preparation**
Expand Down
12 changes: 1 addition & 11 deletions docs/source/hashing.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,7 @@
# Hashing demultiplexing

Cell hashing is a sample processing technique that requires processing individual samples to “tag” the membrane of the cell or the nuclei with unique oligonucleotide barcodes. The cells are then washed or the reaction is quenched, and the samples can be safely mixed and processed following the standard library preparation procedure. Two libraries are generated after this process, one for the scRNA and one for the hashing oligos (HTO), which are independently sequenced to produce each a single cell count matrix, one for the RNA library and one for the HTO library. The hashtag counts are then bioinformatically processed to deconvolve the cell’s source sample.
## **Hashing-based deconvolution (hash_demulti) in hadge**

- Pre-processing
- Multiseq
- HTODemux
- HashedDrops
- DemuxEM
- HashSolo
- Demuxmix
- GMM-Demux
- BFF

![Caption](_static/images/hashing-based.png)
## **Input data preparation**

Expand Down
11 changes: 8 additions & 3 deletions docs/source/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,15 +36,20 @@ As next, please run the pipeline
```bash
nextflow run http://github.com/theislab/hadge
```

## **Quick start**

You can also:
- Choose the mode: `--mode=<genetic/hashing/rescue>`
- Specify the folder name `--outdir` to save the output files. This will create a folder automatically in the project directory.
- Specify the input data for each process.
- The pipeline can be run either locally or on a HPC with different resource specifications. As default, the pipeline will run locally. You can also set the SLURM executor by running the pipeline with `-profile cluster`.
- Please also check [](usage) for more details.

## **Quick start**
```bash
sh test_data/download_data.sh
nextflow run main.nf
```


## **Pipeline output**

By default, the pipeline is run on a single sample. In this case, all pipeline output will be saved in the folder `$projectDir/$params.outdir/$params.mode`. When running the pipeline on multiple samples, the pipeline output will be found in the folder `"$projectDir/$params.outdir/$sampleId/$params.mode/`. To simplify this, we'll refer to this folder as `$pipeline_output_folder` from now on.
Expand Down
3 changes: 3 additions & 0 deletions multi_sample_input.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
sampleId,rna_matrix_raw,rna_matrix_filtered,hto_matrix_raw,hto_matrix_filtered,bam,bam_index,barcodes,nsample,cell_data,vcf_mixed,vcf_donor,vireo_parent_dir,demultiplexing_result
sample1,sample1/rna/raw_feature_bc_matrix,sample1/rna/filtered_feature_bc_matrix,sample1/hto/raw_feature_bc_matrix,sample1/hto/filtered_feature_bc_matrix,sample1/pooled.sorted.bam,sample1/pooled.sorted.bam.bai,sample1/rna/filtered_feature_bc_matrix/barcodes.tsv,2,None,None,None,None,None
sample2,sample2/rna/raw_feature_bc_matrix,sample2/rna/filtered_feature_bc_matrix,sample2/hto/raw_feature_bc_matrix,sample2/hto/filtered_feature_bc_matrix,sample2/pooled.sorted.bam,sample2/pooled.sorted.bam.bai,sample2/rna/filtered_feature_bc_matrix/barcodes.tsv,2,None,None,None,None,None
8 changes: 5 additions & 3 deletions test_data/download_data.sh
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
#!/bin/sh
# Download data for genotype-based deconvolution methods (popscle tutorial dataset)
# outputdir="test_data"
# mkdir $outputdir & cd $outputdir
outputdir="test_data"
mkdir $outputdir
cd $outputdir
wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1xl9g3IPFNISa1Z2Uqj3nia4tDWtMDz5H' -O jurkat_293t_exons_only.vcf.withAF.vcf.gz
wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1jlhEO1Z7YGYVnxv1YO9arjDwGFeZbfkr' -O jurkat_293t_downsampled_n500_full_bam.bai
FILEID="13CV6CjP9VzmwG5MVHbJiVDMVdiIhGdJB"
Expand All @@ -18,7 +20,7 @@ tar -xzvf refdata-cellranger-hg19-3.0.0.tar.gz
# Download common variants
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1lw4T6d7uXsm9dt39ZtEwpuB2VTY3wK1y' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1lw4T6d7uXsm9dt39ZtEwpuB2VTY3wK1y" -O common_variants_hg19.vcf && rm -rf /tmp/cookies.txt
wget https://master.dl.sourceforge.net/project/cellsnp/SNPlist/genome1K.phase3.SNP_AF5e2.chr1toX.hg19.vcf.gz
gunzip genome1K.phase3.SNP_AF5e2.chr1toX.hg19.vcf.gz
#gunzip genome1K.phase3.SNP_AF5e2.chr1toX.hg19.vcf.gz
# Processed by bcftools query -f '%CHROM:%POS\n' common_variants_hg19.vcf > common_variants_hg19_list.vcf
wget https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/41773431/common_variants_hg19_list.vcf?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIYCQYOYV5JSSROOA/20230731/eu-west-1/s3/aws4_request&X-Amz-Date=20230731T153655Z&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=283575c6ccb3104b8b95684e6d955abd28b47db71c18d1eeec99ae5dab65ff7b
# Download simualated data
Expand Down

0 comments on commit 565a491

Please sign in to comment.