Skip to content

Releases: ahmedmagds/GNUVID

GNUVID v2.4

24 Oct 02:37
Compare
Choose a tag to compare
  • 1,392,002 High Quality GISAID sequences have been included in this analysis.
  • GNUVID compressed the 13920020 ORFs in the 1392002 genomes to 755489 unique alleles.
  • 731164 Sequence Types (STs) have been assigned in this dataset and were clustered in 4084 clonal complexes (CCs).
  • 1196 new CCs have been assigned (2888 CCs in Jun 2021 to 4084 in Aug 2021).
  • 3123 CCs have been Inactive (i.e. Last time seen more than 1 month before 2021-08-31).
  • 397 CCs have gone Quiet (i.e. Last seen 2-4 weeks before 2021-08-31).
  • 564 CCs have been Active (i.e. Last seen within the 2 weeks before 2021-08-31).
  • A table showing summary information of the 564 Active Clonal Complexes (CCs) can be found here. A full report for the 4084 CCs can be found here

GNUVID v2.3

20 Jul 04:49
Compare
Choose a tag to compare
  • 999,106 High Quality GISAID sequences have been included in this analysis from a total of 2,012,563 sequences.
  • GNUVID compressed the 9991060 ORFs in the 999106 genomes to 549768 unique alleles.
  • 523727 Sequence Types (STs) have been assigned in this dataset and were clustered in 2888 clonal complexes (CCs).
  • GNUVID now reports the WHO Naming system for VOCs/VOIs (e.g. Alpha, Beta..etc) as per the WHO updated on 07/06/2021.
  • GNUVID now excludes genomes that does not pass quality check for sequence length (20000) and proportion of ambiguity (Ns) (0.3). User can change these cutoffs.
  • A table showing summary information of the 177 Active Clonal Complexes (CCs) can be found here. A full report for the 2888 CCs can be found here

GNUVID 2.2

12 Mar 16:16
Compare
Choose a tag to compare

GNUVID v2.2 now uses minimap2 and Gofasta to align to the reference for prediction using the random forest classifier.

GNUVID now assigns genomes to five new Variants of Concern:

  • CC81085 represents the Brazilian P.1 lineage (a.k.a. 20J/501Y.V3).
  • CC70949 represents the Brazilian P.2 lineage.
  • CC72860 represents the Californian B.1.429 (CAL.20C) lineage.
  • CC71014 represents the South African B.1.351 lineage (a.k.a. 20H/501Y.V2).
  • 10 CCs represent the UK B.1.1.7 lineage (a.k.a. 20I/501Y.V1 Variant of Concern (VOC) 202012/01). (10 CCs: 46649, 45062, 49676, 54949, 54452, 58534, 57630, 66559, 62415 and 67441).

New Features

  • GNUVID now excludes genomes that does not pass quality check for sequence length (15000) and proportion of ambiguity (Ns) (0.5). User can change these cutoffs.
  • Skip exact matching (-e): do only prediction [Default: do exact matching first].
  • Prediction block size (-b): you can now assign the block size of genomes to be predicted at once [Default: 1000]. This will be helpful for machines with limited memory.

GNUVID v2.1: Globally circulating clonal complexes as of 2021-01-06 (Variants of Concern now being assigned)

02 Feb 12:55
67455e9
Compare
Choose a tag to compare

GNUVID 2.1

GNUVID now assigns the three new variants of concern (UK, SA and Brazilian):

  • CC81085 represents the Brazilian P.1 lineage (a.k.a. 20J/501Y.V3).
  • CC71014 represents the South African B.1.351 lineage (a.k.a. 20H/501Y.V2).
  • 10 CCs represent the UK B.1.1.7 lineage (a.k.a. 20I/501Y.V1 Variant of Concern (VOC) 202012/01). (10 CCs: 46649, 45062, 49676, 54949, 54452, 58534, 57630, 66559, 62415 and 67441).

Globally circulating clonal complexes as of 2021-01-06:

  • 159515 GISAID sequences have been included in this analysis.

  • GNUVID compressed the 1595150 ORFs in the 159515 genomes to 89491 unique alleles.

  • 81097 Sequence Types (STs) have been assigned in this dataset and were clustered in 406 clonal complexes (CCs).

  • 252 new CCs have been assigned.

  • 133 CCs have been Inactive (i.e. Last time seen more than 1 month before 2021-01-06).

  • 209 CCs have gone Quiet (i.e. Last seen 2-4 weeks before 2021-01-06).

  • 64 CCs have been Active (i.e. Last seen within the 2 weeks before 2021-01-06).

GNUVID v2.0: Globally circulating clonal complexes as of 2020-10-20

09 Dec 21:46
Compare
Choose a tag to compare

GNUVID 2.0

Big update

This release of GNUVID comes with a significant speed-up and improved classification. The new classification algorithm is called GNUVID_Predict.

Use of GNUVID now is as simple as GNUVID query_fasta.fna

As of GNUVID 2.0, GNUVID_Predict.py is a speedy algorithm for assigning Clonal Complexes to new genomes, which uses a Machine Learning Random Forest Classifier.

The model was trained using 53,565 SARS-CoV-2 sequences from GISAID. The alignment of these genomes to MN908947.3 was one-hot encoded. The Classifier model was built using the sci-kit learn implementation of Random Forest.

Globally circulating clonal complexes as of 2020-10-20:

  • 69686 GISAID sequences have been included in this analysis.

  • GNUVID compressed the 696860 ORFs in the 69686 genomes to 37921 unique alleles.

  • 35010 Sequence Types (STs) have been assigned in this dataset and were clustered in 154 clonal complexes (CCs).

  • 84 new CCs have been assigned.

  • 82 CCs have been Inactive (i.e. Last time seen more than 1 month before 2020-10-20).

  • 27 CCs have gone Quiet (i.e. Last seen 2-4 weeks before 2020-10-20).

  • 45 CCs have been Active (i.e. Last seen within the 2 weeks before 2020-10-20).

  • CC70, CC26, CC343, CC439, CC927, CC1434, CC11290, CC13202, CC13669 and CC17244 have now been called CC550, CC750, CC9999, CC2649, CC1179, CC2175, CC18372, CC13208, CC12995 and CC13413 respectively.

GNUVID v1.4: Globally circulating clonal complexes as of 2020-08-17

26 Aug 04:22
Compare
Choose a tag to compare
  • 32719 GISAID sequences have been included in this analysis.

  • GNUVID compressed the 327190 ORFs in the 32719 genomes to 19224 unique alleles.

  • 17615 Sequence Types (STs) have been assigned in this dataset and were clustered in 70 clonal complexes (CCs).

  • 11 new CCs have been assigned.

  • 54 CCs have been Inactive (i.e. Last time seen more than 1 month before 2020-08-17).

  • 12 CCs have gone Quiet (i.e. Last seen 2-4 weeks before 2020-08-17).

  • 4 CCs have been Active (i.e. Last seen within the 2 weeks before 2020-08-17).

  • CC70, CC26, CC343, CC927, CC1434 and CC13202 have now been called CC550, CC750, CC9999, CC1179, CC2175 and CC13208 respectively.

GNUVID v1.3: Globally circulating clonal complexes as of 2020-07-17

30 Jul 08:18
Compare
Choose a tag to compare

Summary

  • 25594 GISAID sequences have been included in this analysis.

  • GNUVID compressed the 255940 ORFs in the 25594 genomes to 15025 unique alleles.

  • 13688 Sequence Types (STs) have been assigned in this dataset and were clustered in 59 clonal complexes (CCs).

  • 14 new CCs have been assigned.

  • 42 CCs have been Inactive (i.e. Last time seen more than 1 month before 2020-07-17).

  • 12 CCs have gone Quiet (i.e. Last seen 2-4 weeks before 2020-07-17).

  • 5 CCs have been Active (i.e. Last seen within the 2 weeks before 2020-07-17).

  • CC70, CC26, CC343, CC927 and CC1434 have now been called CC550, CC750, CC9999, CC1179 and CC2175 respectively.

  • Starting from the DB update (07/17/2020), the nine defining SNPs (C241,C3037,A23403,C8782,G11083,G25563,G26144,T28144,G28882) for the six GISAID clades are reported for each CC for easier correlation between the two systems.

GNUVID v1.2: Globally circulating clonal complexes as of 2020-06-17

02 Jul 22:14
Compare
Choose a tag to compare

Summary

  • 18298 GISAID sequences have been included in this analysis.

  • GNUVID compressed the 182980 ORFs in the 18298 genomes to 10686 unique alleles.

  • 9676 Sequence Types (STs) have been assigned in this dataset and were clustered in 45 clonal complexes (CCs).

  • 21 new CCs have been assigned.

  • 33 CCs have been Inactive (i.e. Last time seen more than 1 month before 2020-06-17).

  • 5 CCs have gone Quiet (i.e. Last seen 2-4 weeks before 2020-06-17).

  • 7 CCs have been Active (i.e. Last seen within the 2 weeks before 2020-06-17).

  • CC70 and CC26 have now been called CC550 and CC750, respectively.

  • The three Beijing isolates that are from the recent Beijing’s big new COVID-19 outbreak were assigned three new STs that are all Single Locus Variant (SLV) at ORF1ab of ST300, that is mainly found in Europe and founder of CC300.

    • Beijing/IVDC-01-06/2020|EPI_ISL_469254|2020-06-11 is assigned to ST9646 (CC300)

    • Beijing/IVDC-02-06/2020|EPI_ISL_469255|2020-06-11 is assigned to ST9647 (CC300)

    • Beijing/env/IVDC-03-06/2020|EPI_ISL_469256|2020-06-11 is assigned to ST9648 (CC300)

GNUVIDv1.1

02 Jul 06:34
Compare
Choose a tag to compare

New and updated scripts added and A database release report is now available in the main README.md.

Launch

08 Jun 20:07
d9377d3
Compare
Choose a tag to compare
v1.0

Update README.md