Skip to content

Commit

Permalink
GNUVID v2.4 update
Browse files Browse the repository at this point in the history
  • Loading branch information
ahmedmagds committed Oct 24, 2021
1 parent 753fc7e commit c52b4ec
Show file tree
Hide file tree
Showing 21 changed files with 9,360 additions and 7,217 deletions.
51 changes: 23 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,11 @@
# GNUVID
**G**ene **N**ovelty **U**nit-based **V**irus **ID**entification for **SARS-CoV-2**
## Introduction
GNUVID (GNU-based Virus IDentification) is a Python3 program. It ranks CDS nucleotide sequences in a genome fna file based on the number of observed exact CDS nucleotide matches in a public or private database. It was created to type SARS-CoV-2 genomes using a whole genome multilocus sequence typing (wgMLST) approach. The 10 ORFs (ORF1ab, S, ORF3a, E, M, ORF6, ORF7a, ORF8, N, ORF10) in SARS-CoV-2 are used for typing. It automatically assigns allele numbers to each of the 10 ORFs and a Sequence Type (ST) to each genome, based on its profile of unique gene allele sequences. It is based on our recent panallelome approach implemented in [WhatsGNU](https://github.com/ahmedmagds/WhatsGNU). It can type your query genome in seconds. As of GNUVID v2.0, GNUVID_Predict.py is a speedy algorithm for assigning Clonal Complexes to new genomes, which uses a Machine Learning Random Forest Classifier.<br/>
GNUVID (GNU-based Virus IDentification) is a Python3 program. It ranks CDS nucleotide sequences in a genome fna file based on the number of observed exact CDS nucleotide matches in a public or private database. It was created to type SARS-CoV-2 genomes using a whole genome multilocus sequence typing (wgMLST) approach. The 10 ORFs (ORF1ab, S, ORF3a, E, M, ORF6, ORF7a, ORF8, N, ORF10) in SARS-CoV-2 are used for typing. It automatically assigns allele numbers to each of the 10 ORFs and a Sequence Type (ST) to each genome, based on its profile of unique gene allele sequences. It is based on our recent panallelome approach implemented in [WhatsGNU](https://github.com/ahmedmagds/WhatsGNU). The STs are then clustered into bigger groups which are designated clonal complexes (CCs) based on their grouping on a minimum spanning tree (MST). The CCs are more granular than a Pango Lineage. It can type your query genome in seconds. As of GNUVID v2.0, GNUVID_Predict.py is a speedy algorithm for assigning Clonal Complexes to new genomes, which uses a Machine Learning Random Forest Classifier.<br/>

A pre-print of the paper **Emerging SARS-CoV-2 diversity revealed by rapid whole genome sequence typing** can be found here: https://www.biorxiv.org/content/10.1101/2020.12.28.424582v1
GNUVID is now published [Moustafa AM and Planet PJ 2021. **Emerging SARS-CoV-2 diversity revealed by rapid whole genome sequence typing**. Genome Biology and Evolution;13(9):evab197](https://doi.org/10.1093/gbe/evab197)

A table of acknowledgements for the GISAID SARS-CoV-2 sequences used here is available from:
https://github.com/ahmedmagds/GNUVID/blob/master/GISAID_acknowledgement_table_2021_01_06-merged_all.pdf
We acknowledge the open-science of the individual research labs and public agencies that have made their SARS-CoV-2 genomes available on [GISAID](https://www.gisaid.org/).

## Install and use as simple as
Make a new environment and install GNUVID in it
Expand All @@ -20,46 +19,42 @@ conda create -n GNUVID -c bioconda gnuvid
conda activate GNUVID
```

## Globally circulating clonal complexes as of 2021-06-21:
## Globally circulating clonal complexes as of 2021-08-31:

- 999106 High Quality GISAID sequences have been included in this analysis.
- 1,392,002 High Quality GISAID sequences have been included in this analysis.

- GNUVID compressed the 9991060 ORFs in the 999106 genomes to 549768 unique alleles.
- GNUVID compressed the 13920020 ORFs in the 1392002 genomes to 755489 unique alleles.

- 523727 Sequence Types (STs) have been assigned in this dataset and were clustered in 2888 clonal complexes (CCs).
- 731164 Sequence Types (STs) have been assigned in this dataset and were clustered in 4084 clonal complexes (CCs).

- 2482 new CCs have been assigned (406 CCs in Jan 2021 to 2888 in Jun 2021).
- 1196 new CCs have been assigned (2888 CCs in Jun 2021 to 4084 in Aug 2021).

- 1716 CCs have been Inactive (i.e. Last time seen more than 1 month before 2021-06-21).
- 3123 CCs have been Inactive (i.e. Last time seen more than 1 month before 2021-08-31).

- 995 CCs have gone Quiet (i.e. Last seen 2-4 weeks before 2021-06-21).
- 397 CCs have gone Quiet (i.e. Last seen 2-4 weeks before 2021-08-31).

- 177 CCs have been Active (i.e. Last seen within the 2 weeks before 2021-06-21).
- 564 CCs have been Active (i.e. Last seen within the 2 weeks before 2021-08-31).


## GNUVID now reports the WHO Naming system for VOCs/VOIs (e.g. Alpha, Beta..etc) as per the WHO updated on 07/06/2021:
## GNUVID now reports the WHO Naming system for VOCs/VOIs/VUMs (e.g. Alpha, Beta..etc) as per the WHO updated on 10/22/2021:

- 1346 CCs representing the **Alpha** VOC (a.k.a. B.1.1.7).
- 1597 CCs representing the **Alpha** VOC (a.k.a. B.1.1.7 and descendant Q.* lineages).

- 25 CCs representing the **Beta** VOC (a.k.a. B.1.351, B.1.351.2, B.1.351.3).
- 27 CCs representing the **Beta** VOC (a.k.a. B.1.351 and descendant lineages).

- 61 CCs representing the **Gamma** VOC (a.k.a. P.1, P.1.1, P.1.2).
- 117 CCs representing the **Gamma** VOC (a.k.a. P.1 and descendant lineages).

- 47 CCs representing the **Delta** VOC (a.k.a. B.1.617.2, AY.1, AY.2).
- 777 CCs representing the **Delta** VOC (a.k.a. B.1.617.2 and descendant AY.* lineages).

- 3 CCs representing the **Eta** VOI (a.k.a. B.1.525).
- 6 CC representing the **Lambda** VOI (a.k.a. C.37).

- 80 CCs representing the **Iota** VOI (a.k.a. B.1.526).
- 6 CCs representing the **Mu** VOI (a.k.a. B.1.621).

- 8 CCs representing the **Kappa** VOI (a.k.a. B.1.617.1).
- 225 CCs representing the 16 lineages (B.1.427/429, R.1, C.1.2, B.1.466.2, B.1.1.318, B.1.1.519, B.1.1.523, C.36.3, B.1.525, B.1.526, B.1.619, B.1.620, B.1.630, B.1.617.1 and B.1.214.2) that are currently designated **Variants Under Monitoring (VUM)** by WHO for Further Monitoring.

- 1 CC representing the **Lambda** VOI (a.k.a. C.37).
- The remaining 1329/4084 CCs are not designated VOC/VOI/VUM by WHO (10/22/2021).

- 124 CCs representing the 12 lineages (B.1.427/429, P.2, P.3, R.1/R.2, B.1.466.2, B.1.621, AV.1, B.1.1.318, B.1.1.519, AT.1, C.36.3/C.36.3.1, and B.1.214.2) that are currently designated **Alerts** by WHO for Further Monitoring.

- The remaining 1193/2888 CCs are not designatd VOC/VOI/Alert by WHO (07/06/2021).

**A table showing summary information of the 177 Active Clonal Complexes (CCs) can be found [here](https://github.com/ahmedmagds/GNUVID/tree/master/db). A full report for the 2888 CCs can be found [here](https://github.com/ahmedmagds/GNUVID/blob/master/db/GNUVID_06212021_CCs_report.txt)**
**A table showing summary information of the 564 Active Clonal Complexes (CCs) can be found [here](https://github.com/ahmedmagds/GNUVID/tree/master/db). A full report for the 4084 CCs can be found [here](https://github.com/ahmedmagds/GNUVID/blob/master/db/GNUVID_08312021_CCs_report.txt)**

## Installation
### Dependencies
Expand Down Expand Up @@ -97,7 +92,7 @@ $export PATH=$PATH:/path/to/folder/having/GNUVID/bin
If you need it permanently, you can add this last line to your .bashrc or .bash_profile.
### Test
* Type GNUVID_Predict.py -h and it should output help screen.
* Type GNUVID_Predict.py -v and you should see an output like GNUVID.py v2.3.
* Type GNUVID_Predict.py -v and you should see an output like GNUVID.py v2.4.

## Usage for GNUVID_Predict.py
### Input
Expand All @@ -115,7 +110,7 @@ $GNUVID_Predict.py -i -o new_genomes_GNUVID new_genomes.fasta
```
usage: GNUVID_Predict.py [-h] [-o OUTPUT_FOLDER] [-m MIN_LEN] [-n N_MAX] [-b BLOCK_PRED] [-e] [-i] [-f] [-q] [-v] query_fna
GNUVID v2.3 uses the natural variation in public genomes of SARS-CoV-2 to rank
GNUVID v2.4 uses the natural variation in public genomes of SARS-CoV-2 to rank
gene sequences based on the number of observed exact matches (the GNU score)
in all known genomes of SARS-CoV-2. It assigns a sequence type to each genome
based on its profile of unique gene allele sequences. It can type (using whole
Expand Down
6 changes: 3 additions & 3 deletions bin/GNUVID.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@

PARSER = argparse.ArgumentParser(
prog="GNUVID.py",
description="GNUVID v2.3 utilizes the natural\
description="GNUVID v2.4 utilizes the natural\
variation in public genomes of SARS-CoV-2 to rank gene sequences based on the number of observed exact \
matches (the GNU score) in all known genomes of SARS-CoV-2. It types the genomes based on their unique \
gene allele sequences. It types (using a whole genome MLST) your query genome in seconds.",
Expand Down Expand Up @@ -120,7 +120,7 @@
"--version",
help="print version and exit",
action="version",
version="%(prog)s 2.1",
version="%(prog)s 2.4",
)
PARSER.add_argument(
"reference",
Expand Down Expand Up @@ -1044,7 +1044,7 @@
QUERYFILE_OBJECT.close()
logging.info("Typed the query isolate/s and wrote Query_isolates_GNUVID_ST_Report.txt")
logging.info("Done in --- {:.3f} seconds ---".format(time.time() - START_TIME))
logging.info("""Thanks for using GNUVID v2.3, I hope you found it useful.
logging.info("""Thanks for using GNUVID v2.4, I hope you found it useful.
Please cite WhatsGNU 'Moustafa AM and Planet PJ 2020, Genome Biology;21:58'.
Please also cite BLAST+ 'Camacho et al. 2009, BMC Bioinformatics;10:421' if you use GNUVID.
Please also cite GISAID 'Shu Y. and McCauley J. 2017, EuroSurveillance; 22:13'
Expand Down
2 changes: 1 addition & 1 deletion bin/GNUVID_Post_CC_processor.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
PARSER.add_argument("-l","--level", type=int, help="level of locus variant to assign CC (e.g. SLV, DLV) [Default: 2]")
PARSER.add_argument("-n","--number_connections", type=int, help="number of connections to assign CC [Default: 20]")
PARSER.add_argument("output_path", type=str, help="output path folder for aln")
PARSER.add_argument("eBURST_MST_report", type=str, help="eBURST MST csv report")
PARSER.add_argument("eBURST_MST_report", type=str, help="eBURST MST txt report")
PARSER.add_argument("ST_GNUVID_report", type=str, help="ST GNUVID csv report")
PARSER.add_argument("fasta_aln", type=str, help="path to Multifasta.fna name")
if len(sys.argv) == 1:
Expand Down
31 changes: 21 additions & 10 deletions bin/GNUVID_Post_Training.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@
prog="GNUVID_Post_Training.py",
description="This script will predict CC for NA_genomes & summarize CCs",)
PARSER.add_argument("-p","--predict",help="run NA prediction [default OFF]",action="store_true",)
PARSER.add_argument("-s","--subsample",help="subsample fna as query_aln file used so extract right feature [default OFF]",action="store_true",)
PARSER.add_argument(
"inactive_date",
type=str,
Expand All @@ -71,7 +72,7 @@
PARSER.add_argument("ST_GNUVID_report", type=str, help="ST GNUVID txt report")
PARSER.add_argument("DB_RF", type=str, help="Random forest DB (.joblib)")
PARSER.add_argument("features", type=str, help="features (SNPs) positions")
PARSER.add_argument("query_aln", type=str, help="Query Whole Genome SNPs MSA file to analyze (.aln)")
PARSER.add_argument("query_aln", type=str, help="Query Whole Genome SNPs MSA file to analyze (.aln) or Query Whole Genome MSA file (.fna)")
if len(sys.argv) == 1:
PARSER.print_help()
sys.exit(0)
Expand All @@ -81,8 +82,10 @@
if ARGS.predict:
VCF_OBJECT = open(ARGS.features,'r') # has the nucleotide(feature) positions
features_list = []
postitions_list = []
for line in VCF_OBJECT:
features_list.append(line.rstrip())
postitions_list.append(int(line.rstrip())-1)
VCF_OBJECT.close()
###########Parse the SNPs alignment file#########
SNPs_aln = open(ARGS.query_aln,'r')#sequence_file
Expand All @@ -99,6 +102,8 @@
if counter in [1000,50000,100000,200000,300000,400000,500000,600000,700000]:
print('RAM memory % used: {} for {} seqs'.format(psutil.virtual_memory()[3],counter),flush=True)
lst = list(SEQUENCE_STRING)
if ARGS.subsample:
lst = [lst[ind] for ind in postitions_list]
SEQUENCES_LIST.append(lst)
order_list.append(SEQUENCE_INFO)
SEQUENCE_STRING = ""
Expand All @@ -110,6 +115,8 @@
if counter in [1000,50000,100000,200000,300000,400000,500000,600000,700000]:
print('RAM memory % used: {} for {} seqs'.format(psutil.virtual_memory()[3],counter),flush=True)
lst = list(SEQUENCE_STRING)
if ARGS.subsample:
lst = [lst[ind] for ind in postitions_list]
SEQUENCES_LIST.append(lst)
order_list.append(SEQUENCE_INFO)
SNPs_aln.close()
Expand Down Expand Up @@ -295,14 +302,15 @@
os.system('gzip -c {} > {}'.format(op_name, gzip1))
os.system('gzip -c {} > {}'.format(Output_GNUVID, gzip2))
#######WHO Naming#########
VOC_VOI = {"B.1.1.7":'Alpha', 'B.1.351':'Beta','B.1.351.2':'Beta','B.1.351.3':'Beta',
'P.1':'Gamma','P.1.1':'Gamma','P.1.2':'Gamma','B.1.617.2':'Delta',
'AY.1':'Delta','AY.2':'Delta','B.1.525':'Eta','B.1.526':'Iota',
'B.1.617.1':'Kappa','C.37':'Lambda',
'B.1.427':'Alert','B.1.429':'Alert', 'P.2':'Alert','P.3':'Alert',
'R.1':'Alert','R.2':'Alert', 'B.1.466.2':'Alert','B.1.621':'Alert',
'AV.1':'Alert','B.1.1.318':'Alert','B.1.1.519':'Alert','AT.1':'Alert',
'C.36.3':'Alert','C.36.3.1':'Alert','B.1.214.2':'Alert','B.1.427/429':'Alert'}
VOC_VOI = {"B.1.1.7":'Alpha','Q':'Alpha','B.1.351':'Beta','P.1':'Gamma',
'B.1.617.2':'Delta','AY':'Delta','C.37':'Lambda','B.1.621':'Mu',
'B.1.525':'VUM','B.1.526':'VUM','B.1.630':'VUM',
'B.1.617.1':'VUM','B.1.619':'VUM','B.1.620':'VUM','C.1.2':'VUM',
'B.1.427':'VUM','B.1.429':'VUM', 'P.2':'VUM','P.3':'VUM',
'R.1':'VUM','B.1.466.2':'VUM','B.1.1.523':'VUM',
'AV.1':'VUM','B.1.1.318':'VUM','B.1.1.519':'VUM','AT.1':'VUM',
'C.36.3':'VUM','B.1.214.2':'VUM','B.1.427/429':'VUM'}
variants_list = list(VOC_VOI.keys())
######Writing report######
inactive_date = ARGS.inactive_date
quiet_date = ARGS.quiet_date
Expand Down Expand Up @@ -358,7 +366,10 @@
try:
WHO = VOC_VOI[Top_Pango[0]]
except:
WHO = 'NA'
try:
WHO = VOC_VOI[Top_Pango[0].rsplit('.',1)[0]]
except:
WHO = 'NA'
output_report_object.write('{}\t{}/{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\n'.format(
CC,STs_count,isolates_count,', '.join(Countries_list),
Top_Region_percent, dates_range, CC_state, Top_Pango_percent,Top_GIS_percent,WHO,','.join(aa_lst)))
Expand Down
23 changes: 12 additions & 11 deletions bin/GNUVID_Predict.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@
########################################
PARSER = argparse.ArgumentParser(
prog="GNUVID_Predict.py",
description="GNUVID v2.3 uses the natural variation in public genomes of SARS-CoV-2 to \
description="GNUVID v2.4 uses the natural variation in public genomes of SARS-CoV-2 to \
rank gene sequences based on the number of observed exact matches (the GNU score) \
in all known genomes of SARS-CoV-2. It assigns a sequence type to each genome based on \
its profile of unique gene allele sequences. It can type (using whole genome multilocus sequence typing; wgMLST) \
Expand Down Expand Up @@ -121,7 +121,7 @@
"--version",
help="print version and exit",
action="version",
version="%(prog)s 2.3",
version="%(prog)s 2.4",
)
PARSER.add_argument(
"query_fna", type=str, help="Query Whole Genome Nucleotide FASTA file to analyze (.fna)"
Expand All @@ -133,7 +133,7 @@
if bool(vars(ARGS)["individual"]) and bool(vars(ARGS)["exact_matching"]):
PARSER.exit(status=0, message="Error: You cannot use -i with -e\n")
OS_SEPARATOR = os.sep
Classifier_version = '06/21/2021'
Classifier_version = '08/31/2021'
START_TIME0 = time.time()
if ARGS.exact_matching:
e_matching = 0
Expand All @@ -146,7 +146,7 @@
if ARGS.min_len:
min_len = ARGS.min_len
else:
min_len = 20000
min_len = 25000
if ARGS.n_max:
n_max = ARGS.n_max
else:
Expand Down Expand Up @@ -205,19 +205,19 @@
SEQ_OBJECT = open(seq_file,'r')
Script_Path = os.path.realpath(__file__)
DB_Folder_Path = os.path.join(Script_Path.rsplit(OS_SEPARATOR,2)[0], "db")
VCF_file = os.path.join(DB_Folder_Path,'SNPs_26221_Jun2021.txt') # has the nucleotide(feature) positions
VCF_file = os.path.join(DB_Folder_Path,'SNPs_25419_Aug2021.txt') # has the nucleotide(feature) positions
VCF_OBJECT = open(VCF_file,'r')
CCs_file = os.path.join(DB_Folder_Path,'GNUVID_06212021_CCs_report.txt') # has the WHO naming
CCs_file = os.path.join(DB_Folder_Path,'GNUVID_08312021_CCs_report.txt') # has the WHO naming
CCs_OBJECT = open(CCs_file,'r')
DT_model = os.path.join(DB_Folder_Path,'GNUVID_06212021_RandomForest.joblib')#ML model
COMP_DB_file = os.path.join(DB_Folder_Path,'GNUVID_06212021_comp_db.joblib')#compressed DB
DT_model = os.path.join(DB_Folder_Path,'GNUVID_08312021_RandomForest.joblib')#ML model
COMP_DB_file = os.path.join(DB_Folder_Path,'GNUVID_08312021_comp_db.joblib')#compressed DB
if os.path.exists(COMP_DB_file):
logging.info("Found compressed database")
else:
url = 'https://zenodo.org/record/5112632/files/GNUVID_06212021_comp_db.joblib?download=1'
url = 'https://zenodo.org/record/5594708/files/GNUVID_08312021_comp_db.joblib?download=1'
urllib.request.urlretrieve(url, COMP_DB_file)
logging.info("Downloaded compressed database")
strains_report_file = os.path.join(DB_Folder_Path,'GNUVID_06212021_DB_isolates_report_deidentified.txt.gz')
strains_report_file = os.path.join(DB_Folder_Path,'GNUVID_08312021_DB_isolates_report_deidentified.txt.gz')
Ref_CDS = os.path.join(DB_Folder_Path,'MN908947.3_cds.fna')
Ref_WG = os.path.join(DB_Folder_Path,'MN908947.3.fasta')
############SNPs and CCs_WHO##################
Expand Down Expand Up @@ -667,8 +667,9 @@ def FASTA_to_seq_list(seq_file,SNP_postitions,blk_size):
fo.write('{},{},{}\n'.format(seqId,Classifier_version,','.join(y_results[-4:])))
logging.info("Finished Run in --- {:.3f} seconds ---".format(time.time() - START_TIME0))
logging.info("Typed the query isolate/s and wrote {}".format(output_file.rsplit(OS_SEPARATOR,1)[-1]))
logging.info("""Thanks for using GNUVID v2.3, I hope you found it useful.
logging.info("""Thanks for using GNUVID v2.4, I hope you found it useful.
References:
GNUVID 'Moustafa AM and Planet PJ 2021, Genome Biology and Evolution;13(9):evab197'.
WhatsGNU 'Moustafa AM and Planet PJ 2020, Genome Biology;21:58'.
pandas 'Reback et al. 2020, DOI:10.5281/zenodo.3509134'.
Scikit-learn 'Pedregosa et al. 2011, JMLR; 12:2825-2830'.
Expand Down
Loading

0 comments on commit c52b4ec

Please sign in to comment.