-
Notifications
You must be signed in to change notification settings - Fork 11
StrainGST
Identify close reference genomes and interpret results
- A pre-built database for the genus or species of interest
- A whole metagenomic sequencing (WMS) sample
StrainGST iteratively compares the k-mer profiles of references in the database to the k-mers in the sample to identify close reference genomes to strains in a sample.
Our first step is to kmerize the sample reads. For example, if you have a FASTQ
file named patient1.fastq
with all reads, then we generate its corresponding
k-mer set as follows:
straingst kmerize -k 23 -o patient1.hdf5 patient1.fastq
Similar to the first step of the database creation section, this will generate
a HDF5 file named patient1.hdf5
with all k-mers and their corresponding
counts. Make sure the value of k
is the same as used in the
database creation step.
You can specify multiple FASTQ files to the command above, which is useful if you have paired-end reads. Furthermore, it will also automatically decompress gzipped files. For example, if you have gzipped paired-end FASTQ files, then the following command also works:
straingst kmerize -k 23 -o patient1.hdf5 \
patient1.1.fastq.gz patient1.2.fastq.gz
We can now run straingst run
with our database HDF5 and our sample HDF5:
straingst run -o results.tsv pan-genome-db.hdf5 patient1.hdf5
This will output a tab separated values (tsv) file, containing statistics about the sample k-mer set and a list of identified reference strains with accompanying metrics.
sample totalkmers distinct pkmers pkcov pan%
UMB11_01 2277023860 380759656 50090 6.984 1.536
i strain gkmers ikmers skmers cov kcov gcov acct even spec rapct wscore score
0 Esch_coli_NGF1 49631 49622 50090 0.985 7.009 6.831 0.980 0.987 1.000 1.507 0.940 0.940
The first two lines contain statistics on the whole sample.
Columns:
- sample: Sample name, derived from the k-mer set filename
- totalkmers: total number of k-mers counted in the sample, including k-mers that occur multiple times
- distinct: total unique number of k-mers
- pkmers: total unique number of k-mers that are also present in the database
- pkcov: average "coverage" (multiplicity) of each unique k-mer in the sample that is also present in the database.
- pan%: total number of k-mers (including duplicates) that are present in both the sample and database divided by the total number of k-mers in the sample (totalkmers), i.e. an estimation of the relative abundance of the species/genus of interest in this sample.
The next lines contain the close reference genomes identified by StrainGST.
Columns:
- i: Iteration number
- strain: Reference strain name
- gkmers: Total number of unique k-mers in the original reference genome (or its fingerprint).
- ikmers: Remaining unique k-mers in the genome after discarding k-mers excluded in an earlier iteration or because their average coverage was too high
- skmers: Remaining unique k-mers from the sample
- cov: Breadth of coverage of this reference, i.e. what fraction of k-mers in the reference is also present in the sample
- kcov: Average depth of coverage of k-mers both present in the reference and in the sample
- gcov: Average depth of coverage of all k-mers in the reference
- acct: What fraction of the sample k-mers can be explained by this reference?
- even: Evenness of coverage. A value close to 1 indicates that the coverage is evenly distributed along the genome, a value close to zero indicates that only a small part of the genome is well covered.
- spec: Obsolete
- rapct: Estimated strain relative abundance. The relative abundance of the first strain is biased upwards because it included "core" k-mers that get discarded in next iterations.
- wscore: Obsolete
- score: Score used to rank each reference in the database at each iteration. A high score represents high confidence in this prediction. Scores cannot be compared across iterations or across samples, and it is possible that a strain in a second iteration has a higher score than the strain in the first iteration.
Easily parse StrainGST file in Python:
from strainge.io.utils import parse_straingst
results = ['sample1.tsv', 'sample2.tsv']
for sample in results:
print('#', sample)
with open(sample) as f:
for strain in parse_straingst(f):
print(strain) # strain is a dict with above columns
With sample statistics:
from strainge.io.utils import parse_straingst
results = ['sample1.tsv', 'sample2.tsv']
for sample in results:
print('#', sample)
with open(sample) as f:
straingst_iter = iter(parse_straingst(f, return_sample_stats=True))
sample_stats = next(straingst_iter)
print(sample_stats)
for strain in straingst_iter:
print(strain) # strain is a dict with above columns