This programme takes advantage of Stacks to analyze NGS data. We have successfully processed NGS data for Brassicacae plants and Triticum aestivum L.
- Required environment
- Installing Prerequisites
- Quick Guide
- Multithread Run
- Documentation
- Interpreting Results
- stacks
- vcftools
- samtools
- trimmomatic
- java
- bwa
- CMplot
- matplotlib
- pandas
- scipy
- numpy
- etc.
→ Please refer to requirements.txt for more information.
Please refer to the homepage of each package for more information.
Run install.packages("package_needed")
in R.
Run the following script in shell. This script will download requirements.txt and automatically install.
if command -v wget >/dev/null 2>&1; then
wget -O ./requirements.txt https://raw.githubusercontent.com/ShuXingYu94/NGS_Process_Sample/master/requirements.txt
else
if command -v curl >/dev/null 2>&1; then
curl -o ./requirements.txt https://raw.githubusercontent.com/ShuXingYu94/NGS_Process_Sample/master/requirements.txt
else
echo "Wget and Curl are not installed. Please install."
exit 0
fi
fi
if command -v pip3 >/dev/null 2>&1; then
pip3 install -r requirements.txt
else
if command -v pip >/dev/null 2>&1; then
pip install -r requirements.txt
else
echo "Something wrong with pip/pip3. Please install python packages with other software like conda."
exit 0
fi
fi
rm ./requirements.txt
Please start with the following folder structure.
work_dir
├── BaseCall
│ ├── sample1_R1_001.fastq.gz
│ ├── sample1_R2_001.fastq.gz
│ ├── sample2_R1_001.fastq.gz
│ └── sample2_R2_001.fastq.gz
├── MIGadapter.fasta → (Optional)
└── popmap.txt → (Optional)
Normally, putting NGS data files in ./work_dir/BaseCall
is all you need to do.
For trimming of the original data files, an adapter.fasta file is needed. By default, a
MIGadapter.fasta
file will be downloaded.In case of multiple population analysis, you can put a
popmap.txt
file in./work_dir/
. By default, apopmap.txt
file will be automatically generated with all the samples recognized in the same population.
Download main.sh
and statistics.sh
files from the Master branch to your working directory.
You can find the recommended version in the latest release.
Input the required configurations in main.sh
as follows.
# Required Settings
workdir=/Users/.../work_dir
mapping_db=/Users/.../reference_genome.fa
trimmomatic_dir=/Users/.../Trimmomatic-0.39/trimmomatic-0.39.jar
adapter=${workdir}/MIGadapter.fasta
popmap_dir=${workdir}/popmap.txt
# Read file names
files="sample1 sample2"
Basically, there's no need to change the optional configurations.
However, in case of analyzing more than 2 samples, it is recommended to set stacks_r = 0.8
rather than stacks_r = 1.0
by default.
→ For more information, please refer to the documentation of Trimmomatic and Stacks.
Note
Currently, copy all the commands and execute is recommended.
Please execute the following commands in the same directory as main.sh
and statistics.sh
,
bash main.sh
bash statistics.sh
or simply copy all the commands and execute them.
Warning
This is not a part of the analysis !
Check the directory and files you are deleting !
- To start from the beginning, delete all the generated files, please execute the following command:
rm -rf ${workdir}/aligned
rm -rf ${workdir}/log
rm -rf ${workdir}/stacks
rm -rf ${workdir}/trimmed
rm -rf ${workdir}/statistics
rm ${workdir}/*.py
rm ${workdir}/*.r
# Optional: rm ${workdir}/popmap.txt
This will remove all the generated and downloaded files and return to the original folder structure.
- To rerun the statistical analysis, please execute the following command:
rm -rf ${workdir}/statistics
From the release later than v1.0.0, you can execute the script multithreading.sh
along to carry out the analysis more than 20 times faster than before (By the time of 2023/1/3).
It takes advantage of multicore functionality to accelerate the whole calculating process, which will require more memory than before.
Please note that it is not recommended to run this script before you have successfully executed all the commands stated in Quick Guide for at least once.
...still in progress, please follow the steps in Quick Guide.
After execution of main.sh
and statistics.sh
, the folder should look something like this:
work_dir
├── BaseCall
├── aligned
├── log
├── stacks
├── statistics
│ ├── Consensus
│ ├── Coverage
│ └── Read_depth
└── trimmed
# the files inside is not shown
Here, the original fastq.gz
data files are in folder ./BaseCall
. After trimming with given adapter.txt
file, the reads are saved in folder ./trimmed
. The trimmed files are subsequently mapped to reference genome and saved in .bam
format in folder ./algined
. Then, Stacks were used to analyze mapped reads, carry out SNPs calling and population analysis. These results are saved in ./stacks
.
Eventually, the statistical analysis were carried out with statistics.sh
using the above files. The generated result files are kept in folder ./statistics
, which should look like the following.
statistics/
├── Consensus
│ ├── chr.txt
│ └── consensus.txt
├── Coverage
│ ├── sample1.txt
│ └── sample2.txt
├── Read_depth
│ ├── sample1.txt
│ └── sample2.txt
├── SNP_Depth_0.jpg
├── SNP_Depth_10.jpg
├── SNPs_Distribution_0.svg
├── SNPs_Distribution_10.svg
└── results.txt
Basically, all the statistical results and figures we need is in this folder.
Currently, the following results are available.
- Data in table 1
- Reads Count of Raw Data
- Reads Count of Trimmed Data
- Mapped Length of Trimmed Data
- Mapped Coverage of Trimmed Data
- Average Depth of Mapped Reads
- Consensus Length
- Consensus Coverage
- Data in table 2
- SNPs Count
- Average SNP Depth
- Mapped Length / SNPs
- Consensus Length / SNPs
- Figures
- SNPs Distribution Based on Chromosomes
- Python - SNP distribution
- R - SNP depth and distribution
All the numerical data are saved in ./statistics/results.txt
. It is recommended to open the file with Excel or any software that supports .tsv
format.
File ./statistics/results.txt
contains two independent tables.
- Table 1 contains information from Reads Count of Raw Data to Consensus Coverage.
- Table 2 contains information from SNPs Count to Consensus Length / SNPs
SNPs Distribution Figure shows the distribution of SNPs on the chromosomes using reference genome information.
- File
./statistics/SNPs_Distribution_0.svg
shows distribution of all the SNPs from./stacks/populations.snps.vcf
- File
./statistics/SNPs_Distribution_10.svg
shows distribution of SNPs the depth of which is >= 10 on the first two samples.
SNP Depth Figure shows the depth of each SNPs on the chromosomes.
- File
./statistics/SNP_Depth_0.jpg
shows depth of all the SNPs from./stacks/populations.snps.vcf
- File
./statistics/SNP_Depth_10.jpg
shows depth of SNPs the depth of which is >= 10 on the first two samples.