Skip to content

StrainGST

Lucas van Dijk edited this page Jul 29, 2020 · 3 revisions

Identify close reference genomes and interpret results

Prerequisites

  1. A pre-built database for the genus or species of interest
  2. A whole metagenomic sequencing (WMS) sample

Running StrainGST

1. K-merize the sample reads

StrainGST iteratively compares the k-mer profiles of references in the database to the k-mers in the sample to identify close reference genomes to strains in a sample.

Our first step is to kmerize the sample reads. For example, if you have a FASTQ file named patient1.fastq with all reads, then we generate its corresponding k-mer set as follows:

straingst kmerize -k 23 -o patient1.hdf5 patient1.fastq

Similar to the first step of the database creation section, this will generate a HDF5 file named patient1.hdf5 with all k-mers and their corresponding counts. Make sure the value of k is the same as used in the database creation step.

You can specify multiple FASTQ files to the command above, which is useful if you have paired-end reads. Furthermore, it will also automatically decompress gzipped files. For example, if you have gzipped paired-end FASTQ files, then the following command also works:

straingst kmerize -k 23 -o patient1.hdf5 \
    patient1.1.fastq.gz patient1.2.fastq.gz

2. Run StrainGST

We can now run straingst run with our database HDF5 and our sample HDF5:

straingst run -o results.tsv pan-genome-db.hdf5 patient1.hdf5

This will output a tab separated values (tsv) file, containing statistics about the sample k-mer set and a list of identified reference strains with accompanying metrics.

Output file description

Example output

sample    totalkmers      distinct   pkmers  pkcov   pan%
UMB11_01  2277023860      380759656  50090   6.984   1.536
i         strain          gkmers     ikmers  skmers  cov    kcov   gcov   acct   even   spec   rapct  wscore  score
0         Esch_coli_NGF1  49631      49622   50090   0.985  7.009  6.831  0.980  0.987  1.000  1.507  0.940   0.940

Sample statistics

The first two lines contain statistics on the whole sample.

Columns:

  • sample: Sample name, derived from the k-mer set filename
  • totalkmers: total number of k-mers counted in the sample, including k-mers that occur multiple times
  • distinct: total unique number of k-mers
  • pkmers: total unique number of k-mers that are also present in the database
  • pkcov: average "coverage" (multiplicity) of each unique k-mer in the sample that is also present in the database.
  • pan%: total number of k-mers (including duplicates) that are present in both the sample and database divided by the total number of k-mers in the sample (totalkmers), i.e. an estimation of the relative abundance of the species/genus of interest in this sample.

Reference strain statistics

The next lines contain the close reference genomes identified by StrainGST.

Columns:

  • i: Iteration number
  • strain: Reference strain name
  • gkmers: Total number of unique k-mers in the original reference genome (or its fingerprint).
  • ikmers: Remaining unique k-mers in the genome after discarding k-mers excluded in an earlier iteration or because their average coverage was too high
  • skmers: Remaining unique k-mers from the sample
  • cov: Breadth of coverage of this reference, i.e. what fraction of k-mers in the reference is also present in the sample
  • kcov: Average depth of coverage of k-mers both present in the reference and in the sample
  • gcov: Average depth of coverage of all k-mers in the reference
  • acct: What fraction of the sample k-mers can be explained by this reference?
  • even: Evenness of coverage. A value close to 1 indicates that the coverage is evenly distributed along the genome, a value close to zero indicates that only a small part of the genome is well covered.
  • spec: Obsolete
  • rapct: Estimated strain relative abundance. The relative abundance of the first strain is biased upwards because it included "core" k-mers that get discarded in next iterations.
  • wscore: Obsolete
  • score: Score used to rank each reference in the database at each iteration. A high score represents high confidence in this prediction. Scores cannot be compared across iterations or across samples, and it is possible that a strain in a second iteration has a higher score than the strain in the first iteration.

Tips and tricks

Easily parse StrainGST file in Python:

from strainge.io.utils import parse_straingst

results = ['sample1.tsv', 'sample2.tsv']

for sample in results:
    print('#', sample)
    with open(sample) as f:
        for strain in parse_straingst(f):
            print(strain)  # strain is a dict with above columns

With sample statistics:

from strainge.io.utils import parse_straingst

results = ['sample1.tsv', 'sample2.tsv']

for sample in results:
    print('#', sample)
    with open(sample) as f:
        straingst_iter = iter(parse_straingst(f, return_sample_stats=True))
        sample_stats = next(straingst_iter)
        print(sample_stats)

        for strain in straingst_iter:
            print(strain)  # strain is a dict with above columns