diff --git a/use_case_examples/MAG_fishing/README.md b/use_case_examples/MAG_fishing/README.md new file mode 100644 index 0000000..dc20fc8 --- /dev/null +++ b/use_case_examples/MAG_fishing/README.md @@ -0,0 +1,249 @@ + +# MAG Fishing + +## Introduction + +YACHT can be used to search or "fish" for a metagenome-assembled genome (MAG) of interest in a sample. We took a closer look to the results of the use case example for Low Abundance Samples and found that *Bacteroidales bacterium* was present in sample SRR32008482 using a *k*-size of 51 and ani of 0.95. In Table 1, we observed that the accepted threshold was to match at least 75 exclusive *k*-mer matches to the reference and Sample SRR32008482 exceeded this requirement matching 158 exclusive *k*-mers. Please refer to `result_k51_ani0.95_SRR32008482.xlsx` for more YACHT results of this sample. In the following, we are curious to find out whether another metagenomic sample, SRR32008483, has *B. bacterium* present in its microbial community. + +### Table 1: B. bacterium detected in Sample SRR32008482 using k-size of 51 and ani=0.95 +| Sample | Minimum Coverage | Organism Name | Exclusive *k*-mers to Genome | Coverage of Exclusive *k*-mers | num_matches | Acceptance Threshold | +|-------------|------------------|-------------------------------------------------------|----------------------------|------------------------------|-------------|----------------------| +| SRR32008482 | 0.5 | GCA_017506175.1 Bacteroidales bacterium, ASM1750617v1 | 2464 | 1232 | 158 | 75 | + + +Make sure you have the following dependencies to run this use case example: + +- YACHT +- fastq-dump +- datasets +- matplotlib_venn + +## Obtain datasets + +Throughout this use case example, we will use two sample datasets, SRR32008482 and SRR32008483, to evaluate how results may differ in samples when MAG fishing. Note that these two samples have different sequence coverages where one has a sequence coverage of 44Mb (SRR32008482) and the other has a sequence coverage of 4.4Mb (SRR32008483). Additionally, we will use only one genome of interest as our MAG reference dataset to train with YACHT, *B. bacterium* (Accession: GCA_017506175). + +Please run the following script to download these samples and the MAG reference dataset. + + bash data_download.sh + +## Can we find the MAG, Bacteroidales bacterium, in both of these samples? + +### Sketch samples of interest + +We start off our analysis by sketching samples using `yacht sketch sample`. + +#### Sample SRR32008482: + yacht sketch sample --infile data/SRR32008482.fastq --kmer 51 --scaled 10 --outfile SRR32008482.k51.sample.sig.zip +#### Sample SRR32008483: + yacht sketch sample --infile data/SRR32008483.fastq --kmer 51 --scaled 10 --outfile SRR32008483.k51.sample.sig.zip + + + +### Sketch the MAG of interest, B. bacterium + +If you completed the low abundance use case example, you may recall that we were able to download pre-trained reference datasets. However, in this use case example, we are only interested in retrieving or "fishing" a single reference MAG, rather than an entire reference database like GTDB. Therefore, we can use a smaller scale factor (i.e., 10) and still ensure computational efficiency. This preserves the k-mer set preventing any error that YACHT cannot match a k-mer from the MAG to our samples. Additionally, this reference will be a sketch using "--singleton" meaning that each entry within the FASTA/Q file will have a unique signature. Specifically, every entry in this reference file is a scaffold indicating that this reference has not yet been appropriately annotated. + +#### Similar to sketching a sample, we use `yacht sketch ref` to sketch our MAG of interest + + yacht sketch ref --infile data/ncbi_dataset/data/GCA_017506175.1/GCA_017506175.1_ASM1750617v1_genomic.fna --kmer 51 --scaled 10 --outfile MAG_ref_database.k51.sig.zip + +### Using `yacht train`, we train our MAG signature + yacht train --ref_file MAG_ref_database.k51.sig.zip --ksize 51 --num_threads 32 --ani_thresh 0.95 --prefix 'MAG_ref_database_k51_ani0.95' --outdir ./ + +### Identify whether B. bacterium is present or absent using `yacht run` + +#### Initially, we were interested in whether *B. bacterium* was present in Sample SRR32008483. + yacht run --json 'MAG_ref_database_k51_ani0.95_config.json' --sample_file 'SRR32008483.k51.sample.sig.zip' --num_threads 32 --keep_raw --significance 0.95 --min_coverage_list 1 0.5 0.1 0.05 0.01 --out ./result_SRR32008483.k51.xlsx + +#### Because signatures were produced for each scaffold in the MAG reference database, we are also interested in evaluating the scaffolds present in the old sample of interest, Sample SRR32008482. + yacht run --json 'MAG_ref_database_k51_ani0.95_config.json' --sample_file 'SRR32008482.k51.sample.sig.zip' --num_threads 32 --keep_raw --significance 0.95 --min_coverage_list 1 0.5 0.1 0.05 0.01 --out ./result_SRR32008482.k51.xlsx + + +## Do these samples share contigs of *B. bacterium*? + +Run following script to produce and evaluate figure. + + python venn.py + +Yes, we are able to identify contigs of B. bacterium within these samples. Both samples share 21 contigs of B. bacterium, where Sample SRR32008482 has 41 unique contigs at a minimal coverage of 0.5 and at a lower coverage Sample SRR32008483 has 2 unique contigs. Note, that the difference in scaffolds present can be an indication of difference in sequence coverage of the samples. + +![Venn Diagram](venn.png) + + \ No newline at end of file diff --git a/use_case_examples/MAG_fishing/data_download.sh b/use_case_examples/MAG_fishing/data_download.sh new file mode 100644 index 0000000..0cffe6d --- /dev/null +++ b/use_case_examples/MAG_fishing/data_download.sh @@ -0,0 +1,14 @@ +#!/bin/bash + +#Create a folder for data files +mkdir data + +### Download sample to data folder +fasterq-dump --concatenate-reads SRR32008482 -O data +fasterq-dump --concatenate-reads SRR32008483 -O data + +### Download MAG of interest +datasets download genome accession GCA_017506175.1 --include genome --filename data/. +cd data +unzip ncbi_dataset +cd ../ diff --git a/use_case_examples/MAG_fishing/result_k51_ani0.95_SRR32008482.xlsx b/use_case_examples/MAG_fishing/result_k51_ani0.95_SRR32008482.xlsx new file mode 100644 index 0000000..b7449bb Binary files /dev/null and b/use_case_examples/MAG_fishing/result_k51_ani0.95_SRR32008482.xlsx differ diff --git a/use_case_examples/MAG_fishing/venn.png b/use_case_examples/MAG_fishing/venn.png new file mode 100644 index 0000000..233421d Binary files /dev/null and b/use_case_examples/MAG_fishing/venn.png differ diff --git a/use_case_examples/MAG_fishing/venn.py b/use_case_examples/MAG_fishing/venn.py new file mode 100644 index 0000000..3bd6502 --- /dev/null +++ b/use_case_examples/MAG_fishing/venn.py @@ -0,0 +1,25 @@ +from matplotlib import pyplot as plt +from matplotlib_venn import venn2 +import pandas as pd + +#datasets for different ksizes + +sample1 = 'result_SRR32008482.k51.xlsx' +df_sample1 = pd.read_excel(sample1, sheet_name='min_coverage0.5') +sample1_lst = df_sample1['organism_name'] +sample1_set = {x for x in sample1_lst} + +sample2 = 'result_SRR32008483.k51.xlsx' +df_sample2 = pd.read_excel(sample2, sheet_name='min_coverage0.1') +sample2_lst = df_sample2['organism_name'] +sample2_set = {x for x in sample2_lst} + +#venn diagram + +venn2([sample1_set, sample2_set], alpha=0.5, set_labels=(f'SRR32008482\nmincov=0.5',f'SRR32008483\nmincov=0.1')) + +plt.title("k-size=51, ANI=0.95") + +# Save the figure +plt.savefig('venn.png',dpi=500) + diff --git a/use_case_examples/contamination_detection_example/1_before_starting.sh b/use_case_examples/contamination_detection_example/1_before_starting.sh new file mode 100644 index 0000000..05078e0 --- /dev/null +++ b/use_case_examples/contamination_detection_example/1_before_starting.sh @@ -0,0 +1,17 @@ +# Download SRR25626360 which represents WGS of Haemophilus influenzae +nohup fastq-dump --fasta 60 SRR25626360 2>&1 & + +### Download SRR24210460 which represents WGS of mycoplasma pneumoniae from library MDY +nohup fastq-dump --fasta 60 SRR24210460 2>&1 & + +### Download SRR7217470 which represents WGS of Chlamydia pneumoniae +nohup fastq-dump --fasta 60 SRR7217470 2>&1 & + +### Download SRR5962942 which represents WGS of Streptococcus pneumoniae +nohup fastq-dump --fasta 60 SRR5962942 2>&1 & + +### Download SRR26202532 which represents WGS of Bordetella pertussis +nohup fastq-dump --fasta 60 SRR26202532 2>&1 & + +### Download SRR2830253, reads of a healthy human lung microbiome +nohup fastq-dump --fasta 60 SRR2830253 2>&1 & diff --git a/use_case_examples/contamination_detection_example/2_before_starting.sh b/use_case_examples/contamination_detection_example/2_before_starting.sh new file mode 100644 index 0000000..3bccd1b --- /dev/null +++ b/use_case_examples/contamination_detection_example/2_before_starting.sh @@ -0,0 +1,30 @@ +# Create the example sample data for a patient with respiratory symptoms seeks to find out the pathogen that is causing them these symptoms. + +# Before moving on. Make sure reads needed to create sample dataset are available. Please reference create_reference_database.md + +# Create samples that will be loaded to the 96-well tray + +# Negative control, so just reads from a healthy lung +cat SRR2830253.fasta negative_control_well_11.fasta + +# Positive control with H. influenzae +cat SRR25626360.fasta SRR2830253.fasta > positive_control_well_23.fasta + +# Sample 1 +cat SRR25626360.fasta SRR2830253.fasta SRR25626360.fasta > positive_control_well_64.fasta + +# Sample 2 +cat SRR24210460.fasta SRR2830253.fasta SRR25626360.fasta > sample_well_80.fasta + +# I check one of my negative controls, which is a healthy lung example and we should not detect any bacteria here +# no contamination + +# I check one of my positive controls for M. pneumonaie which should not have H. influenzae +# no contamination + +# I check one of my positive controls for H. influenzae which should not have M. pneumonaie +# contamination + +# I check one of my samples for H. influenzae which should not have M. pneumonaie +# contamination + diff --git a/use_case_examples/contamination_detection_example/Picture1.png b/use_case_examples/contamination_detection_example/Picture1.png new file mode 100644 index 0000000..008b4d4 Binary files /dev/null and b/use_case_examples/contamination_detection_example/Picture1.png differ diff --git a/use_case_examples/contamination_detection_example/contamination_detection_example.md b/use_case_examples/contamination_detection_example/contamination_detection_example.md new file mode 100644 index 0000000..646e547 --- /dev/null +++ b/use_case_examples/contamination_detection_example/contamination_detection_example.md @@ -0,0 +1,95 @@ +# Contamination Detection Example +A research is being conducted on how microbial communities are being shaped among diffrent types of respiratory diseases. Samples were collected from two patience in wihch one patient is M. pneumoniae positive and the other in H. influenzae. To save time and many, only one 96-well tray will be used for both samples. Before downstream analysis can be performed, we want to know if cross contamination between samples occured during the loading of the 96-well tray and we randomly choose wells 11, 23, 64, and 80. + +Make sure all bacterial reads needed to create your reference dataset also known as a training dataset are available. +```bash +bash 1_before_starting.sh +``` +```bash +bash 2_before_starting.sh +``` + +### Sketch your training dataset and sample to your preference. + +#### Using k=31 +Note: training and sample datasets are required to have the same ksize. Please note that since we are sketching from a list of genomes. We can use the following sourmash sketch command: +```bash +sourmash sketch fromfile genome_list.csv -p dna,k=31,scaled=1000,abund -o training_database.k31.sig.zip +``` + +Sketch the negative control reads from well 11 +```bash +yacht sketch sample --infile ./negative_control_well_11.fasta --kmer 31 --scaled 1000 --outfile negative_control_well_11.k31.sig.zip +``` + +Sketch the positive control from well 23 +```bash +yacht sketch sample --infile ./positive_control_well_23.fasta --kmer 31 --scaled 1000 --outfile positive_control_well_23.k31.sig.zip +``` + +Sketch the positive control from well 64 +```bash +yacht sketch sample --infile ./positive_control_well_64.fasta --kmer 31 --scaled 1000 --outfile positive_control_well_64.k31.sig.zip +``` + +Sketch the sample from well 80 +```bash +yacht sketch sample --infile ./sample_well_80.fasta --kmer 31 --scaled 1000 --outfile sample_well_80.k31.sig.zip +``` + +### Make training data for k=31 +```bash +yacht train --ref_file training_database.k31.sig.zip --ksize 31 --num_threads 64 --ani_thresh 0.95 --prefix 'training_database.k31' --outdir ./ --force +``` + +### Identify whether the patient has a infection and what pathogen is causing the disease. +```bash +yacht run --json training_database.k31_config.json --sample_file negative_control_well_11.k31.sig.zip --significance 0.99 --num_threads 64 --min_coverage_list 1 0.6 0.2 0.1 --out ./negative_control_well_11_k31_result.xlsx +``` + +```bash +yacht run --json training_database.k31_config.json --sample_file positive_control_well_23.k31.sig.zip --significance 0.99 --num_threads 64 --min_coverage_list 1 0.6 0.2 0.1 --out ./positive_control_well_23_k31_result.xlsx +``` + +```bash +yacht run --json training_database.k31_config.json --sample_file positive_control_well_64.k31.sig.zip --significance 0.99 --num_threads 64 --min_coverage_list 1 0.6 0.2 0.1 --out ./positive_control_well_64_k31_result.xlsx +``` + +```bash +yacht run --json training_database.k31_config.json --sample_file sample_well_80.k31.sig.zip --significance 0.99 --num_threads 64 --min_coverage_list 1 0.6 0.2 0.1 --out ./sample_well_80.xlsx +``` + +### Results +Using a ksize of 31 at ANI 0.95, YACHT finds XYZ + +## Let's decrease ANI to 0.50 + +### Make training data for k=31 +```bash +yacht train --ref_file training_database.k31.sig.zip --ksize 31 --num_threads 64 --ani_thresh 0.95 --prefix 'training_database.k31_ani0.50' --outdir ./ --force +``` + +### Pathogen Detection using YACHT +Identify whether the patient has a infectin and what pathogen is causing the disease. +```bash +yacht run --json training_database.k31_ani0.50_config.json --sample_file negative_control_well_11.k31.sig.zip --significance 0.99 --num_threads 64 --min_coverage_list 1 0.6 0.2 0.1 --out ./k31_ani0.50_result_negative_control_well_11.xlsx +``` + +Identify whether the patient has a infectin and what pathogen is causing the disease. +```bash +yacht run --json training_database.k31_ani0.50_config.json --sample_file positive_control_well_23.k31.sig.zip --significance 0.99 --num_threads 64 --min_coverage_list 1 0.6 0.2 0.1 --out ./k31_ani0.50_result_positive_control_well_23.xlsx +``` + +Identify whether the patient has a infectin and what pathogen is causing the disease. +```bash +yacht run --json training_database.k31_ani0.50_config.json --sample_file positive_control_well_64.k31.sig.zip --significance 0.99 --num_threads 64 --min_coverage_list 1 0.6 0.2 0.1 --out ./k31_ani0.50_result_positive_control_well_64.xlsx +``` + +Identify whether the patient has a infectin and what pathogen is causing the disease. +```bash +yacht run --json training_database.k31_ani0.50_config.json --sample_file sample_well_80.k31.sig.zip --significance 0.99 --num_threads 64 --min_coverage_list 1 0.6 0.2 0.1 --out ./k31_ani0.50_result_sample_well_80.xlsx +``` + + +### Results +Decreasing ANI to 0.50 and using a ksize of 31, YACHT finds XYZ \ No newline at end of file diff --git a/use_case_examples/contamination_detection_example/genome_list.csv b/use_case_examples/contamination_detection_example/genome_list.csv new file mode 100644 index 0000000..cc965d4 --- /dev/null +++ b/use_case_examples/contamination_detection_example/genome_list.csv @@ -0,0 +1,6 @@ +0,name,genome_filename,protein_filename +1,SRR25626360,SRR25626360.fasta, +2,SRR24210460,SRR24210460.fasta, +3,SRR7217470,SRR7217470.fasta, +4,SRR5962942,SRR5962942.fasta, +5,SRR26202532,SRR26202532.fasta, diff --git a/use_case_examples/low_abundance_samples/README.md b/use_case_examples/low_abundance_samples/README.md new file mode 100644 index 0000000..77a2183 --- /dev/null +++ b/use_case_examples/low_abundance_samples/README.md @@ -0,0 +1,118 @@ + +# Low abundance samples + +## Introduction + +If it's your first time using YACHT, we recommend to first use the default parameters, a k-size of 31 and an ANI threshold of 0.95. + +Make sure you have the following dependencies to run this use case example: + +- fastq-dump + +## Obtain datasets + +Throughout this use case example, we will use this sample dataset to test and evaluate how results may change when modifying parameters such as k-size and ANI thresholds. + +To run our use case examples, there is no need to start from stratch when sketching our references. We will download and use pre-created reference signatures for a k-size of 21, 31, and 51. Please run the following script to download all the data needed. + + bash data_download.sh + +You can find additional reference signatures at [sourmash](https://sourmash.readthedocs.io/en/latest/databases.html#id9). + +## Using YACHT's default parameters: k-size=31, ani_thresh=0.95 + +### Sketch sample + +Sketch the sample dataset using the same k-size used for the reference. + + yacht sketch sample --infile data/SRR32008482.fastq --kmer 31 --scaled 1000 --outfile sample.31.sig.zip + +### Train reference + +Note, we didn't need to sketch the reference, since we were able to download the pre-created signatures, but if you provide YACHT with your own reference, you'll need to sketch it. For more information please visit: [YACHT](https://github.com/KoslickiLab/YACHT?tab=readme-ov-file#creating-sketches-of-your-reference-database-genomes-yacht-sketch-ref). + +Here, we will train our reference signature. We are using an ANI threshold of 0.95. This means that any species that is within that threshold will combine. + + yacht train --ref_file data/gtdb-rs214-k31.zip --ksize 31 --num_threads 64 --ani_thresh 0.95 --prefix 'gtdb_ani_thresh_0.95' --outdir ./ + +### Identify presence or absence of species using yacht run + + yacht run --json 'gtdb_ani_thresh_0.95_config.json' --sample_file 'sample.31.sig.zip' --num_threads 32 --keep_raw --significance 0.95 --min_coverage_list 1 0.5 0.1 0.05 0.01 --out ./result_k31_ani0.95.xlsx + +## Decrease k-size to 21 + +### Sketch sample + +Sketch the sample dataset using a k-size of 21. + + yacht sketch sample --infile data/SRR32008482.fastq --kmer 21 --scaled 1000 --outfile sample.21.sig.zip + +### Train reference + +Here, we will train our reference signature. We conitnue to use an ANI threshold of 0.95, but using a k-size of 21. + + yacht train --ref_file data/gtdb-rs214-k21.zip --ksize 21 --num_threads 64 --ani_thresh 0.95 --prefix 'gtdb_ani_thresh_0.95' --outdir ./ + +### How will using a smaller k-size change the identifcation of presence or absence of species when using yahct run? + + yacht run --json 'gtdb_ani_thresh_0.95_config.json' --sample_file 'sample.21.sig.zip' --num_threads 32 --keep_raw --significance 0.95 --min_coverage_list 1 0.5 0.1 0.05 0.01 --out ./result_k21_ani0.95.xlsx + +## Increase k-size to 51 + +### Sketch sample + +Sketch the sample dataset using a k-size of 51. + + yacht sketch sample --infile data/SRR32008482.fastq --kmer 51 --scaled 1000 --outfile sample.51.sig.zip + +### Train reference + +To train our reference signature, conitnue using an ANI threshold of 0.95 increasing the k-size to 51. + + yacht train --ref_file data/gtdb-rs214-k51.zip --ksize 21 --num_threads 64 --ani_thresh 0.95 --prefix 'gtdb_ani_thresh_0.95' --outdir ./ + +### Run yacht run and observe difference in species presence/absence output + + yacht run --json 'gtdb_ani_thresh_0.95_config.json' --sample_file 'sample.51.sig.zip' --num_threads 32 --keep_raw --significance 0.95 --min_coverage_list 1 0.5 0.1 0.05 0.01 --out ./result_k51_ani0.95.xlsx + +## Results + +Run following script to produce figure. + + python venn_ksize.py + +![Venn Diagram](venn_low_abundance_species_ksize.png) + +## Using default k-size 31 and increasing ANI to 0.9995 + +Now that we know what happens when the k-size is either decreased or increased, let's tune the ANI threshold! + +### Train reference + +Note that we have the signature for the samplle using a k-size of 31, so we can move forward to training our reference signature using an ANI threshold of 0.9995. + + yacht train --ref_file data/gtdb-rs214-k31.zip --ksize 31 --num_threads 64 --ani_thresh 0.9995 --prefix 'gtdb_ani_thresh_0.9995' --outdir ./ + +### Run yacht run and observe difference in species presence/absence output + + yacht run --json 'gtdb_ani_thresh_0.9995_config.json' --sample_file 'sample.31.sig.zip' --num_threads 32 --keep_raw --significance 0.95 --min_coverage_list 1 0.5 0.1 0.05 0.01 --out ./result_k31_ani0.9995.xlsx + +## using default k-size 31 and decreasing ANI to 0.90 + +### Train reference + +Train our reference signature reducing the ANI threshold to 0.90. + + yacht train --ref_file data/gtdb-rs214-k31.zip --ksize 31 --num_threads 64 --ani_thresh 0.90 --prefix 'gtdb_ani_thresh_0.90' --outdir ./ + +### Run yacht run and observe difference in species presence/absence output + + yacht run --json 'gtdb_ani_thresh_0.90_config.json' --sample_file 'sample.31.sig.zip' --num_threads 32 --keep_raw --significance 0.95 --min_coverage_list 1 0.5 0.1 0.05 0.01 --out ./result_k31_ani0.90.xlsx + +## Results + +Run following script to produce figure. + + python venn_ani.py + +![Venn Diagram](venn_low_abundance_species_ani.png) diff --git a/use_case_examples/low_abundance_samples/data_download.sh b/use_case_examples/low_abundance_samples/data_download.sh new file mode 100644 index 0000000..8c9a605 --- /dev/null +++ b/use_case_examples/low_abundance_samples/data_download.sh @@ -0,0 +1,15 @@ +#!/bin/bash + +#Create a folder for data files +mkdir data + +### Download sample to data folder +fasterq-dump --concatenate-reads SRR32008482 -O data + +### Download pre-reference signatures for k=21,k=31,and k=51 +#k-size=21 +wget https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db/gtdb-rs214/gtdb-rs214-k21.zip --directory-prefix=data +#k-size=31 +wget https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db/gtdb-rs214/gtdb-rs214-k31.zip --directory-prefix=data +#k-size=51 +wget https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db/gtdb-rs214/gtdb-rs214-k51.zip --directory-prefix=data diff --git a/use_case_examples/low_abundance_samples/result_k21_ani0.95.xlsx b/use_case_examples/low_abundance_samples/result_k21_ani0.95.xlsx new file mode 100644 index 0000000..63d44cd Binary files /dev/null and b/use_case_examples/low_abundance_samples/result_k21_ani0.95.xlsx differ diff --git a/use_case_examples/low_abundance_samples/result_k31_ani0.90.xlsx b/use_case_examples/low_abundance_samples/result_k31_ani0.90.xlsx new file mode 100644 index 0000000..2233b0b Binary files /dev/null and b/use_case_examples/low_abundance_samples/result_k31_ani0.90.xlsx differ diff --git a/use_case_examples/low_abundance_samples/result_k31_ani0.95.xlsx b/use_case_examples/low_abundance_samples/result_k31_ani0.95.xlsx new file mode 100644 index 0000000..3b1859e Binary files /dev/null and b/use_case_examples/low_abundance_samples/result_k31_ani0.95.xlsx differ diff --git a/use_case_examples/low_abundance_samples/result_k31_ani0.9995.xlsx b/use_case_examples/low_abundance_samples/result_k31_ani0.9995.xlsx new file mode 100644 index 0000000..3013cbe Binary files /dev/null and b/use_case_examples/low_abundance_samples/result_k31_ani0.9995.xlsx differ diff --git a/use_case_examples/low_abundance_samples/result_k51_ani0.95.xlsx b/use_case_examples/low_abundance_samples/result_k51_ani0.95.xlsx new file mode 100644 index 0000000..06b57ba Binary files /dev/null and b/use_case_examples/low_abundance_samples/result_k51_ani0.95.xlsx differ diff --git a/use_case_examples/low_abundance_samples/venn_ani.py b/use_case_examples/low_abundance_samples/venn_ani.py new file mode 100644 index 0000000..6a5e61b --- /dev/null +++ b/use_case_examples/low_abundance_samples/venn_ani.py @@ -0,0 +1,32 @@ +from matplotlib import pyplot as plt +from matplotlib_venn import venn3 +import pandas as pd + +sheet='min_coverage0.05' + +#datasets for different ani + +file_ani90 = 'result_k31_ani0.90.xlsx' +df_ani90 = pd.read_excel(file_ani90, sheet_name=sheet) +species_ani90_lst = df_ani90['organism_name'] +species_ani90_set = {x for x in species_ani90_lst} + +file_ani95 = 'result_k31_ani0.95.xlsx' +df_ani95 = pd.read_excel(file_ani95, sheet_name=sheet) +species_ani95_lst = df_ani95['organism_name'] +species_ani95_set = {x for x in species_ani95_lst} + +file_ani9995 = 'result_k31_ani0.9995.xlsx' +df_ani9995 = pd.read_excel(file_ani9995, sheet_name=sheet) +species_ani9995_lst = df_ani9995['organism_name'] +species_ani9995_set = {x for x in species_ani9995_lst} + +# Create subplots + +venn3([species_ani90_set, species_ani95_set, species_ani9995_set], set_labels=('0.90', '0.95', '0.9995'), alpha=0.5) +plt.title("Different ANI thresholds at k-size=31 using minimal coverage 0.05") + +# Save the figure +plt.savefig('venn_low_abundance_species_ani.png',dpi=500) + + diff --git a/use_case_examples/low_abundance_samples/venn_ksize.py b/use_case_examples/low_abundance_samples/venn_ksize.py new file mode 100644 index 0000000..3fd7071 --- /dev/null +++ b/use_case_examples/low_abundance_samples/venn_ksize.py @@ -0,0 +1,30 @@ +from matplotlib import pyplot as plt +from matplotlib_venn import venn3 +import pandas as pd + +sheet='min_coverage0.05' + +#datasets for different ksizes + +file_ani95_k21 = 'result_k21_ani0.95.xlsx' +df_ani95_k21 = pd.read_excel(file_ani95_k21, sheet_name=sheet) +species_ani95_k21_lst = df_ani95_k21['organism_name'] +species_ani95_k21_set = {x for x in species_ani95_k21_lst} + +file_ani95_k31 = 'result_k31_ani0.95.xlsx' +df_ani95_k31 = pd.read_excel(file_ani95_k31, sheet_name=sheet) +species_ani95_k31_lst = df_ani95_k31['organism_name'] +species_ani95_k31_set = {x for x in species_ani95_k31_lst} + +file_ani95_k51 = 'result_k51_ani0.95.xlsx' +df_ani95_k51 = pd.read_excel(file_ani95_k51, sheet_name=sheet) +species_ani95_k51_lst = df_ani95_k51['organism_name'] +species_ani95_k51_set = {x for x in species_ani95_k51_lst} + +venn3([species_ani95_k21_set, species_ani95_k31_set, species_ani95_k51_set], set_labels=('k21', 'k31', 'k51'), alpha=0.5) +plt.title("Different k-sizes at ANI threshold 0.95 at minimal coverage of 0.05") + +# Save the figure +plt.savefig('venn_low_abundance_species_ksize.png',dpi=500) + + diff --git a/use_case_examples/low_abundance_samples/venn_low_abundance_species_ani.png b/use_case_examples/low_abundance_samples/venn_low_abundance_species_ani.png new file mode 100644 index 0000000..fecdfa7 Binary files /dev/null and b/use_case_examples/low_abundance_samples/venn_low_abundance_species_ani.png differ diff --git a/use_case_examples/low_abundance_samples/venn_low_abundance_species_ksize.png b/use_case_examples/low_abundance_samples/venn_low_abundance_species_ksize.png new file mode 100644 index 0000000..2d01c1f Binary files /dev/null and b/use_case_examples/low_abundance_samples/venn_low_abundance_species_ksize.png differ diff --git a/use_case_examples/pathogen_detection_example/1_before_starting.sh b/use_case_examples/pathogen_detection_example/1_before_starting.sh new file mode 100644 index 0000000..05078e0 --- /dev/null +++ b/use_case_examples/pathogen_detection_example/1_before_starting.sh @@ -0,0 +1,17 @@ +# Download SRR25626360 which represents WGS of Haemophilus influenzae +nohup fastq-dump --fasta 60 SRR25626360 2>&1 & + +### Download SRR24210460 which represents WGS of mycoplasma pneumoniae from library MDY +nohup fastq-dump --fasta 60 SRR24210460 2>&1 & + +### Download SRR7217470 which represents WGS of Chlamydia pneumoniae +nohup fastq-dump --fasta 60 SRR7217470 2>&1 & + +### Download SRR5962942 which represents WGS of Streptococcus pneumoniae +nohup fastq-dump --fasta 60 SRR5962942 2>&1 & + +### Download SRR26202532 which represents WGS of Bordetella pertussis +nohup fastq-dump --fasta 60 SRR26202532 2>&1 & + +### Download SRR2830253, reads of a healthy human lung microbiome +nohup fastq-dump --fasta 60 SRR2830253 2>&1 & diff --git a/use_case_examples/pathogen_detection_example/2_before_starting.sh b/use_case_examples/pathogen_detection_example/2_before_starting.sh new file mode 100644 index 0000000..8a682fe --- /dev/null +++ b/use_case_examples/pathogen_detection_example/2_before_starting.sh @@ -0,0 +1,11 @@ +# Create the example sample data for a patient with respiratory symptoms seeks to find out the pathogen that is causing them these symptoms. + +# Before moving on. Make sure reads needed to create sample dataset are available. Please reference create_reference_database.md + +# Sketch sample to your preference. Note: training and sample datasets are required to have the same ksize. + +## Using k=31 +nohup sourmash sketch fromfile lung_list.csv -p dna,k=31,scaled=1000,abund -o lung_sample.k31.sig.zip > k31_sample.log 2>&1 & + +## Using k=15 +nohup sourmash sketch fromfile lung_list.csv -p dna,k=15,scaled=1000,abund -o lung_sample.k15.sig.zip > k15_sample.log 2>&1 & diff --git a/use_case_examples/pathogen_detection_example/genome_list.csv b/use_case_examples/pathogen_detection_example/genome_list.csv new file mode 100644 index 0000000..cc965d4 --- /dev/null +++ b/use_case_examples/pathogen_detection_example/genome_list.csv @@ -0,0 +1,6 @@ +0,name,genome_filename,protein_filename +1,SRR25626360,SRR25626360.fasta, +2,SRR24210460,SRR24210460.fasta, +3,SRR7217470,SRR7217470.fasta, +4,SRR5962942,SRR5962942.fasta, +5,SRR26202532,SRR26202532.fasta, diff --git a/use_case_examples/pathogen_detection_example/lung_list.csv b/use_case_examples/pathogen_detection_example/lung_list.csv new file mode 100644 index 0000000..db27312 --- /dev/null +++ b/use_case_examples/pathogen_detection_example/lung_list.csv @@ -0,0 +1,3 @@ +0,name,genome_filename,protein_filename +1,SRR24210460,SRR24210460.fasta, +2,SRR26202532,SRR26202532.fasta, diff --git a/use_case_examples/pathogen_detection_example/pathogen_detection_example.md b/use_case_examples/pathogen_detection_example/pathogen_detection_example.md new file mode 100644 index 0000000..6b43ce0 --- /dev/null +++ b/use_case_examples/pathogen_detection_example/pathogen_detection_example.md @@ -0,0 +1,77 @@ +# Pathogen Detection Example +A patient with respiratory symptoms seeks to find out the pathogen that is causing them these symptoms. + +Make sure all bacterial reads needed to create your reference dataset also known as a training dataset are available. +```bash +bash 1_before_starting.sh +``` +```bash +bash 2_before_starting.sh +``` + +### Sketch your training dataset and sample to your preference. + +#### Using k=31 +Note: training and sample datasets are required to have the same ksize. Please note that since we are sketching from a list of genomes. We can use the following sourmash sketch command: +```bash +sourmash sketch fromfile genome_list.csv -p dna,k=31,scaled=1000,abund -o training_database.k31.sig.zip +``` + +Sketch your sample fasta file +```bash +yacht sketch sample --infile ./lung_sample.fasta --kmer 31 --scaled 1000 --outfile lung_sample.k31.sig.zip +``` + +### Make training data for k=31 +```bash +yacht train --ref_file training_database.k31.sig.zip --ksize 31 --num_threads 64 --ani_thresh 0.95 --prefix 'training_database.k31' --outdir ./ --force +``` + +### Identify whether the patient has a infection and what pathogen is causing the disease. +```bash +yacht run --json training_database.k31_config.json --sample_file lung_sample.k31.sig.zip --significance 0.99 --num_threads 64 --min_coverage_list 1 0.6 0.2 0.1 --out ./k31_result.xlsx +``` + +### Results +Using a ksize of 31, YACHT finds that M. pneumoniae is present in the lung sample. + +## What if we decrease ksize to 15? +If we use small ksizes like 15, we would expect to not find that the patient is infected by M. pneumoniae. Let's set up the experiment. Note that a ksize below 7 may not produce results and is not recommend. + +### Sketch Lung Sample using a k=15 +```bash +sourmash sketch fromfile genome_list.csv -p dna,k=15,scaled=1000,abund -o training_database.k15.sig.zip +``` + +Sketch your sample fasta file +```bash +yacht sketch sample --infile ./lung_sample.fasta --kmer 15 --scaled 1000 --outfile lung_sample.k15.sig.zip +``` + +### Make training data for k=15 +```bash +yacht train --ref_file training_database.k15.sig.zip --ksize 15 --num_threads 64 --ani_thresh 0.95 --prefix 'training_database.k15' --outdir ./ --force +``` + +### Pathogen Detection using YACHT +Identify whether the patient has a infectin and what pathogen is causing the disease. +```bash +yacht run --json training_database.k15_config.json --sample_file lung_sample.k15.sig.zip --significance 0.99 --num_threads 64 --min_coverage_list 1 0.6 0.2 0.1 --out ./k15_result.xlsx +``` +### Results +Using a ksize of 15, YACHT finds/does not fine that M. pneumoniae + +## Let's decrease ANI to 0.85 + +### Make training data for k=15 +```bash +yacht train --ref_file training_database.k15.sig.zip --ksize 15 --num_threads 64 --ani_thresh 0.85 --prefix 'training_database.k15_ani0.85' --outdir ./ --force +``` + +### Pathogen Detection using YACHT +Identify whether the patient has a infectin and what pathogen is causing the disease. +```bash +yacht run --json training_database.k15_ani0.85_config.json --sample_file lung_sample.k15.sig.zip --significance 0.99 --num_threads 64 --min_coverage_list 1 0.6 0.2 0.1 --out ./k15_ani0.85_result.xlsx +``` +### Results +Using a ksize of 15, YACHT finds/does not fine that M. pneumoniae \ No newline at end of file diff --git a/use_case_examples/use-case-examples.md b/use_case_examples/use-case-examples.md new file mode 100644 index 0000000..71874e3 --- /dev/null +++ b/use_case_examples/use-case-examples.md @@ -0,0 +1,69 @@ +Advantages of using YACHT is that you can analyze from fasta or fastq files. + +# Biological Application + +Does YACHT provide qunatitative data? + +How sensitive is Yacht in detecting viruses? + +Could we use YACHT to identify the amount of host DNA? + +Is sequencing depth an issue for YACHT? + +Can YACHT be used to identify the amount of contamination that there is in a sample? + +## Contamination Detection + +Identifying contamination of samples is an important step for downstream analysis. + +Failure to detect sample contaminants can bias community diversity and strain sharing identification, which lead to false claims in research. Additionally, contaminant detection is a challenge for low-biomass data. + +There are two types of contaminants that yacht can detect. Specifically, external and cross contaminations. External contaminations includes any microbial DNA that can come from a researcher's native microbiome, experimental kits, and surrounding areas. Cross contamination can come from DNA extractions, sequencing index switching, and sample bleeding. Further, cross-contamination can complicate contimanation detection. + +Here, we explore how we can use YACHT for contaminant detection with the following tutorials: +* External Contamination Detection +* Cross Contamination Detection +* Contamination in Low-biomass Datasets + +Lou YC, Hoff J, Olm MR, West-Roberts J, Diamond S, Firek BA, Morowitz MJ, Banfield JF. Using strain-resolved analysis to identify contamination in metagenomics data. Microbiome. 2023 Mar 2;11(1):36. doi: 10.1186/s40168-023-01477-2. PMID: 36864482; PMCID: PMC9979413. + +### Contaminant Detection Examples + +#### External Contamination Detection + +Data: + +Commands: + +Interpretation: + +#### Cross Contamination Detection + +Data: + +Commands: + +Interpretation: + +#### Contamination in Low-biomass Datasets + +## Pathogenic Detection + +Early and accurate pathogen detection is vital for rapid diagnostic, clinical intervention, pathogen discovery, pathogen surveillance, and outbreak investigations. Some microorganisms are hard to grow, cultivate (i.e. viruses), and/or take a long time to grow (i.e. mycobacteriums and mold). Additionally, the origin of the sample can decrease pathogen detection sensitivity such as purulent samples having human DNA noise. These challenges can lead to extended hospitalizations, readmissions, as well as increased mortality. + +Research in pathogen detection have led to improvements of the experimental part such as differential lysis of human cells, etc. Further, efforts in improving computational tools and pipelines for pathogen detection have also been made such as real time pathogen detection (EPI2ME, SURPI-RT, etc) + +Methods for pathogen detection include PCR, multiplex PCR, broad-range PCR, antigen detection, MALDI-TOF MS, and PNA-FISH. However, PCR based methods are limited to false negatives, knowledge priori, and/or viral detection. Further, antigen detection is limited to the time for antibodies to develop which can take 1-2 weeks. Finally, methods such as mass spectrometry and insitu hybridization have been advancements for bacterial and fungal identification but still require bacterial cultivation, do not provide quantitative results (in the case of MALDI-TOF MS) or are limited to tissue samples (in the case of PNA-FISH). + +Using clinical metagenomic data, we can detect any pathogenic infection (i.e. respiratory, bloodstream, central nervous system infections) without prior knowledge from millions of reads produced by shot gun metagenomic sequencing. However, pathogen detection can be difficult depending on the sample type being used such as purulent samples having a lot of human DNA nouse. + +Computationally, pathogen detection is challenging due to alignment/classification algorithms becoming overwhelmed by large data, read sparsity (leading to de novo assembly difficulty), the lack of genome representation of novel pathogens, and the detection of pathogens that are highly divergent novel apthogens. Computationl subtraction is a popular approach to use for pathogen detection, tools such as PathSeq and SURPI use this approach. These approaches use alignments and although alignment based methods have been shown to slow down analyses, SURPI has shown to detect pathogens and generate results in approaite clinical timing from minutes to hours. + +Gu W, Deng X, Lee M, Sucu YD, Arevalo S, Stryke D, Federman S, Gopez A, Reyes K, Zorn K, Sample H, Yu G, Ishpuniani G, Briggs B, Chow ED, Berger A, Wilson MR, Wang C, Hsu E, Miller S, DeRisi JL, Chiu CY. Rapid pathogen detection by metagenomic next-generation sequencing of infected body fluids. Nat Med. 2021 Jan;27(1):115-124. doi: 10.1038/s41591-020-1105-z. Epub 2020 Nov 9. PMID: 33169017; PMCID: PMC9020267. + +Batool M, Galloway-Peña J. Clinical metagenomics-challenges and future prospects. Front Microbiol. 2023 Jun 28;14:1186424. doi: 10.3389/fmicb.2023.1186424. PMID: 37448579; PMCID: PMC10337830. + +Naccache SN, Federman S, Veeraraghavan N, Zaharia M, Lee D, Samayoa E, Bouquet J, Greninger AL, Luk KC, Enge B, Wadford DA, Messenger SL, Genrich GL, Pellegrino K, Grard G, Leroy E, Schneider BS, Fair JN, Martínez MA, Isa P, Crump JA, DeRisi JL, Sittler T, Hackett J Jr, Miller S, Chiu CY. A cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples. Genome Res. 2014 Jul;24(7):1180-92. doi: 10.1101/gr.171934.113. Epub 2014 Jun 4. PMID: 24899342; PMCID: PMC4079973. + +### Pathogenic Detection Examples +