GNUVID v2.4 update

ahmedmagds · Oct 24, 2021 · c52b4ec · c52b4ec
1 parent 753fc7e
commit c52b4ec
Show file tree

Hide file tree

Showing 21 changed files with 9,360 additions and 7,217 deletions.
diff --git a/README.md b/README.md
@@ -6,12 +6,11 @@
 # GNUVID
 **G**ene **N**ovelty **U**nit-based **V**irus **ID**entification for **SARS-CoV-2**
 ## Introduction
-GNUVID (GNU-based Virus IDentification) is a Python3 program. It ranks CDS nucleotide sequences in a genome fna file based on the number of observed exact CDS nucleotide matches in a public or private database. It was created to type SARS-CoV-2 genomes using a whole genome multilocus sequence typing (wgMLST) approach. The 10 ORFs (ORF1ab, S, ORF3a, E, M, ORF6, ORF7a, ORF8, N, ORF10) in SARS-CoV-2 are used for typing. It automatically assigns allele numbers to each of the 10 ORFs and a Sequence Type (ST) to each genome, based on its profile of unique gene allele sequences. It is based on our recent panallelome approach implemented in [WhatsGNU](https://github.com/ahmedmagds/WhatsGNU). It can type your query genome in seconds. As of GNUVID v2.0, GNUVID_Predict.py is a speedy algorithm for assigning Clonal Complexes to new genomes, which uses a Machine Learning Random Forest Classifier.<br/>
+GNUVID (GNU-based Virus IDentification) is a Python3 program. It ranks CDS nucleotide sequences in a genome fna file based on the number of observed exact CDS nucleotide matches in a public or private database. It was created to type SARS-CoV-2 genomes using a whole genome multilocus sequence typing (wgMLST) approach. The 10 ORFs (ORF1ab, S, ORF3a, E, M, ORF6, ORF7a, ORF8, N, ORF10) in SARS-CoV-2 are used for typing. It automatically assigns allele numbers to each of the 10 ORFs and a Sequence Type (ST) to each genome, based on its profile of unique gene allele sequences. It is based on our recent panallelome approach implemented in [WhatsGNU](https://github.com/ahmedmagds/WhatsGNU). The STs are then clustered into bigger groups which are designated clonal complexes (CCs) based on their grouping on a minimum spanning tree (MST). The CCs are more granular than a Pango Lineage. It can type your query genome in seconds. As of GNUVID v2.0, GNUVID_Predict.py is a speedy algorithm for assigning Clonal Complexes to new genomes, which uses a Machine Learning Random Forest Classifier.<br/>
 
-A pre-print of the paper **Emerging SARS-CoV-2 diversity revealed by rapid whole genome sequence typing** can be found here: https://www.biorxiv.org/content/10.1101/2020.12.28.424582v1
+GNUVID is now published [Moustafa AM and Planet PJ 2021. **Emerging SARS-CoV-2 diversity revealed by rapid whole genome sequence typing**. Genome Biology and Evolution;13(9):evab197](https://doi.org/10.1093/gbe/evab197)
 
-A table of acknowledgements for the GISAID SARS-CoV-2 sequences used here is available from:
-https://github.com/ahmedmagds/GNUVID/blob/master/GISAID_acknowledgement_table_2021_01_06-merged_all.pdf
+We acknowledge the open-science of the individual research labs and public agencies that have made their SARS-CoV-2 genomes available on [GISAID](https://www.gisaid.org/).
 
 ## Install and use as simple as
 Make a new environment and install GNUVID in it
@@ -20,46 +19,42 @@ conda create -n GNUVID -c bioconda gnuvid
 conda activate GNUVID
 ```
 
-## Globally circulating clonal complexes as  of 2021-06-21:
+## Globally circulating clonal complexes as  of 2021-08-31:
 
-- 999106 High Quality GISAID sequences have been included in this analysis.
+- 1,392,002 High Quality GISAID sequences have been included in this analysis.
 
-- GNUVID compressed the 9991060 ORFs in the 999106 genomes to 549768 unique alleles.
+- GNUVID compressed the 13920020 ORFs in the 1392002 genomes to 755489 unique alleles.
 
-- 523727 Sequence Types (STs) have been assigned in this dataset and were clustered in 2888 clonal complexes (CCs).
+- 731164 Sequence Types (STs) have been assigned in this dataset and were clustered in 4084 clonal complexes (CCs).
 
-- 2482 new CCs have been assigned (406 CCs in Jan 2021 to 2888 in Jun 2021).
+- 1196 new CCs have been assigned (2888 CCs in Jun 2021 to 4084 in Aug 2021).
 
-- 1716 CCs have been Inactive (i.e. Last time seen more than 1 month before 2021-06-21).
+- 3123 CCs have been Inactive (i.e. Last time seen more than 1 month before 2021-08-31).
 
-- 995 CCs have gone Quiet (i.e. Last seen 2-4 weeks before 2021-06-21).
+- 397 CCs have gone Quiet (i.e. Last seen 2-4 weeks before 2021-08-31).
 
-- 177 CCs have been Active (i.e. Last seen within the 2 weeks before 2021-06-21).
+- 564 CCs have been Active (i.e. Last seen within the 2 weeks before 2021-08-31).
 
 
-## GNUVID now reports the WHO Naming system for VOCs/VOIs (e.g. Alpha, Beta..etc) as per the WHO updated on 07/06/2021:
+## GNUVID now reports the WHO Naming system for VOCs/VOIs/VUMs (e.g. Alpha, Beta..etc) as per the WHO updated on 10/22/2021:
 
-- 1346 CCs representing the **Alpha** VOC (a.k.a. B.1.1.7).
+- 1597 CCs representing the **Alpha** VOC (a.k.a. B.1.1.7 and descendant Q.* lineages).
 
-- 25 CCs representing the **Beta** VOC (a.k.a. B.1.351, B.1.351.2, B.1.351.3).
+- 27 CCs representing the **Beta** VOC (a.k.a. B.1.351 and descendant lineages).
 
-- 61 CCs representing the **Gamma** VOC (a.k.a. P.1, P.1.1, P.1.2).
+- 117 CCs representing the **Gamma** VOC (a.k.a. P.1 and descendant lineages).
 
-- 47 CCs representing the **Delta** VOC (a.k.a. B.1.617.2, AY.1, AY.2).
+- 777 CCs representing the **Delta** VOC (a.k.a. B.1.617.2 and descendant AY.* lineages).
 
-- 3 CCs representing the **Eta** VOI (a.k.a. B.1.525).
+- 6 CC representing the **Lambda** VOI (a.k.a. C.37).
 
-- 80 CCs representing the **Iota** VOI (a.k.a. B.1.526).
+- 6 CCs representing the **Mu** VOI (a.k.a. B.1.621).
 
-- 8 CCs representing the **Kappa** VOI (a.k.a. B.1.617.1).
+- 225 CCs representing the 16 lineages (B.1.427/429, R.1, C.1.2, B.1.466.2, B.1.1.318, B.1.1.519, B.1.1.523, C.36.3, B.1.525, B.1.526, B.1.619, B.1.620, B.1.630, B.1.617.1 and B.1.214.2) that are currently designated **Variants Under Monitoring (VUM)** by WHO for Further Monitoring.
 
-- 1 CC representing the **Lambda** VOI (a.k.a. C.37).
+- The remaining 1329/4084 CCs are not designated VOC/VOI/VUM by WHO (10/22/2021).
 
-- 124 CCs representing the 12 lineages (B.1.427/429, P.2, P.3, R.1/R.2, B.1.466.2, B.1.621, AV.1, B.1.1.318, B.1.1.519, AT.1, C.36.3/C.36.3.1, and B.1.214.2) that are currently designated **Alerts** by WHO for Further Monitoring.
-
-- The remaining 1193/2888 CCs are not designatd VOC/VOI/Alert by WHO (07/06/2021).
-
-**A table showing summary information of the 177 Active Clonal Complexes (CCs) can be found [here](https://github.com/ahmedmagds/GNUVID/tree/master/db). A full report for the 2888 CCs can be found [here](https://github.com/ahmedmagds/GNUVID/blob/master/db/GNUVID_06212021_CCs_report.txt)**
+**A table showing summary information of the 564 Active Clonal Complexes (CCs) can be found [here](https://github.com/ahmedmagds/GNUVID/tree/master/db). A full report for the 4084 CCs can be found [here](https://github.com/ahmedmagds/GNUVID/blob/master/db/GNUVID_08312021_CCs_report.txt)**
 
 ## Installation
 ### Dependencies
@@ -97,7 +92,7 @@ $export PATH=$PATH:/path/to/folder/having/GNUVID/bin
 If you need it permanently, you can add this last line to your .bashrc or .bash_profile.
 ### Test
 * Type GNUVID_Predict.py -h and it should output help screen.
-* Type GNUVID_Predict.py -v and you should see an output like GNUVID.py v2.3.
+* Type GNUVID_Predict.py -v and you should see an output like GNUVID.py v2.4.
 
 ## Usage for GNUVID_Predict.py
 ### Input
@@ -115,7 +110,7 @@ $GNUVID_Predict.py -i -o new_genomes_GNUVID new_genomes.fasta
 ```
 usage: GNUVID_Predict.py [-h] [-o OUTPUT_FOLDER] [-m MIN_LEN] [-n N_MAX] [-b BLOCK_PRED] [-e] [-i] [-f] [-q] [-v] query_fna
 
-GNUVID v2.3 uses the natural variation in public genomes of SARS-CoV-2 to rank
+GNUVID v2.4 uses the natural variation in public genomes of SARS-CoV-2 to rank
 gene sequences based on the number of observed exact matches (the GNU score)
 in all known genomes of SARS-CoV-2. It assigns a sequence type to each genome
 based on its profile of unique gene allele sequences. It can type (using whole

diff --git a/bin/GNUVID.py b/bin/GNUVID.py
@@ -59,7 +59,7 @@
 
 PARSER = argparse.ArgumentParser(
     prog="GNUVID.py",
-    description="GNUVID v2.3 utilizes the natural\
+    description="GNUVID v2.4 utilizes the natural\
  variation in public genomes of SARS-CoV-2 to rank gene sequences based on the number of observed exact \
  matches (the GNU score) in all known genomes of SARS-CoV-2. It types the genomes based on their unique \
  gene allele sequences. It types (using a whole genome MLST) your query genome in seconds.",
@@ -120,7 +120,7 @@
     "--version",
     help="print version and exit",
     action="version",
-    version="%(prog)s 2.1",
+    version="%(prog)s 2.4",
 )
 PARSER.add_argument(
     "reference",
@@ -1044,7 +1044,7 @@
     QUERYFILE_OBJECT.close()
 logging.info("Typed the query isolate/s and wrote Query_isolates_GNUVID_ST_Report.txt")
 logging.info("Done in --- {:.3f} seconds ---".format(time.time() - START_TIME))
-logging.info("""Thanks for using GNUVID v2.3, I hope you found it useful.
+logging.info("""Thanks for using GNUVID v2.4, I hope you found it useful.
 Please cite WhatsGNU 'Moustafa AM and Planet PJ 2020, Genome Biology;21:58'.
 Please also cite BLAST+ 'Camacho et al. 2009, BMC Bioinformatics;10:421' if you use GNUVID.
 Please also cite GISAID 'Shu Y. and McCauley J. 2017, EuroSurveillance; 22:13'

diff --git a/bin/GNUVID_Post_CC_processor.py b/bin/GNUVID_Post_CC_processor.py
@@ -11,7 +11,7 @@
 PARSER.add_argument("-l","--level", type=int, help="level of locus variant to assign CC (e.g. SLV, DLV) [Default: 2]")
 PARSER.add_argument("-n","--number_connections", type=int, help="number of connections to assign CC [Default: 20]")
 PARSER.add_argument("output_path", type=str, help="output path folder for aln")
-PARSER.add_argument("eBURST_MST_report", type=str, help="eBURST MST csv report")
+PARSER.add_argument("eBURST_MST_report", type=str, help="eBURST MST txt report")
 PARSER.add_argument("ST_GNUVID_report", type=str, help="ST GNUVID csv report")
 PARSER.add_argument("fasta_aln", type=str, help="path to Multifasta.fna name")
 if len(sys.argv) == 1:

diff --git a/bin/GNUVID_Post_Training.py b/bin/GNUVID_Post_Training.py
@@ -59,6 +59,7 @@
     prog="GNUVID_Post_Training.py",
     description="This script will predict CC for NA_genomes & summarize CCs",)
 PARSER.add_argument("-p","--predict",help="run NA prediction [default OFF]",action="store_true",)
+PARSER.add_argument("-s","--subsample",help="subsample fna as query_aln file used so extract right feature [default OFF]",action="store_true",)
 PARSER.add_argument(
     "inactive_date",
     type=str,
@@ -71,7 +72,7 @@
 PARSER.add_argument("ST_GNUVID_report", type=str, help="ST GNUVID txt report")
 PARSER.add_argument("DB_RF", type=str, help="Random forest DB (.joblib)")
 PARSER.add_argument("features", type=str, help="features (SNPs) positions")
-PARSER.add_argument("query_aln", type=str, help="Query Whole Genome SNPs MSA file to analyze (.aln)")
+PARSER.add_argument("query_aln", type=str, help="Query Whole Genome SNPs MSA file to analyze (.aln) or Query Whole Genome MSA file (.fna)")
 if len(sys.argv) == 1:
     PARSER.print_help()
     sys.exit(0)
@@ -81,8 +82,10 @@
 if ARGS.predict:
     VCF_OBJECT = open(ARGS.features,'r') # has the nucleotide(feature) positions
     features_list = []
+    postitions_list = []
     for line in VCF_OBJECT:
         features_list.append(line.rstrip())
+        postitions_list.append(int(line.rstrip())-1)
     VCF_OBJECT.close()
     ###########Parse the SNPs alignment file#########
     SNPs_aln = open(ARGS.query_aln,'r')#sequence_file
@@ -99,6 +102,8 @@
                     if counter in [1000,50000,100000,200000,300000,400000,500000,600000,700000]:
                         print('RAM memory % used: {} for {} seqs'.format(psutil.virtual_memory()[3],counter),flush=True)
                     lst = list(SEQUENCE_STRING)
+                    if ARGS.subsample:
+                        lst = [lst[ind] for ind in postitions_list]
                     SEQUENCES_LIST.append(lst)
                     order_list.append(SEQUENCE_INFO)
                 SEQUENCE_STRING = ""
@@ -110,6 +115,8 @@
         if counter in [1000,50000,100000,200000,300000,400000,500000,600000,700000]:
             print('RAM memory % used: {} for {} seqs'.format(psutil.virtual_memory()[3],counter),flush=True)
         lst = list(SEQUENCE_STRING)
+        if ARGS.subsample:
+            lst = [lst[ind] for ind in postitions_list]
         SEQUENCES_LIST.append(lst)
         order_list.append(SEQUENCE_INFO)
     SNPs_aln.close()
@@ -295,14 +302,15 @@
 os.system('gzip -c {} > {}'.format(op_name, gzip1))
 os.system('gzip -c {} > {}'.format(Output_GNUVID, gzip2))
 #######WHO Naming#########
-VOC_VOI = {"B.1.1.7":'Alpha', 'B.1.351':'Beta','B.1.351.2':'Beta','B.1.351.3':'Beta',
-            'P.1':'Gamma','P.1.1':'Gamma','P.1.2':'Gamma','B.1.617.2':'Delta',
-            'AY.1':'Delta','AY.2':'Delta','B.1.525':'Eta','B.1.526':'Iota',
-            'B.1.617.1':'Kappa','C.37':'Lambda',
-        'B.1.427':'Alert','B.1.429':'Alert', 'P.2':'Alert','P.3':'Alert',
-        'R.1':'Alert','R.2':'Alert', 'B.1.466.2':'Alert','B.1.621':'Alert',
-        'AV.1':'Alert','B.1.1.318':'Alert','B.1.1.519':'Alert','AT.1':'Alert',
-        'C.36.3':'Alert','C.36.3.1':'Alert','B.1.214.2':'Alert','B.1.427/429':'Alert'}
+VOC_VOI = {"B.1.1.7":'Alpha','Q':'Alpha','B.1.351':'Beta','P.1':'Gamma',
+            'B.1.617.2':'Delta','AY':'Delta','C.37':'Lambda','B.1.621':'Mu',
+            'B.1.525':'VUM','B.1.526':'VUM','B.1.630':'VUM',
+            'B.1.617.1':'VUM','B.1.619':'VUM','B.1.620':'VUM','C.1.2':'VUM',
+        'B.1.427':'VUM','B.1.429':'VUM', 'P.2':'VUM','P.3':'VUM',
+        'R.1':'VUM','B.1.466.2':'VUM','B.1.1.523':'VUM',
+        'AV.1':'VUM','B.1.1.318':'VUM','B.1.1.519':'VUM','AT.1':'VUM',
+        'C.36.3':'VUM','B.1.214.2':'VUM','B.1.427/429':'VUM'}
+variants_list = list(VOC_VOI.keys())
 ######Writing report######
 inactive_date = ARGS.inactive_date
 quiet_date = ARGS.quiet_date
@@ -358,7 +366,10 @@
     try:
         WHO = VOC_VOI[Top_Pango[0]]
     except:
-        WHO = 'NA'
+        try:
+            WHO = VOC_VOI[Top_Pango[0].rsplit('.',1)[0]]
+        except:
+            WHO = 'NA'
     output_report_object.write('{}\t{}/{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\n'.format(
             CC,STs_count,isolates_count,', '.join(Countries_list),
             Top_Region_percent, dates_range, CC_state, Top_Pango_percent,Top_GIS_percent,WHO,','.join(aa_lst)))

diff --git a/bin/GNUVID_Predict.py b/bin/GNUVID_Predict.py
@@ -58,7 +58,7 @@
 ########################################
 PARSER = argparse.ArgumentParser(
     prog="GNUVID_Predict.py",
-    description="GNUVID v2.3 uses the natural variation in public genomes of SARS-CoV-2 to \
+    description="GNUVID v2.4 uses the natural variation in public genomes of SARS-CoV-2 to \
     rank gene sequences based on the number of observed exact matches (the GNU score) \
     in all known genomes of SARS-CoV-2. It assigns a sequence type to each genome based on \
     its profile of unique gene allele sequences. It can type (using whole genome multilocus sequence typing; wgMLST) \
@@ -121,7 +121,7 @@
     "--version",
     help="print version and exit",
     action="version",
-    version="%(prog)s 2.3",
+    version="%(prog)s 2.4",
 )
 PARSER.add_argument(
     "query_fna", type=str, help="Query Whole Genome Nucleotide FASTA file to analyze (.fna)"
@@ -133,7 +133,7 @@
 if bool(vars(ARGS)["individual"]) and bool(vars(ARGS)["exact_matching"]):
     PARSER.exit(status=0, message="Error: You cannot use -i with -e\n")
 OS_SEPARATOR = os.sep
-Classifier_version = '06/21/2021'
+Classifier_version = '08/31/2021'
 START_TIME0 = time.time()
 if ARGS.exact_matching:
     e_matching = 0
@@ -146,7 +146,7 @@
 if ARGS.min_len:
     min_len = ARGS.min_len
 else:
-    min_len = 20000
+    min_len = 25000
 if ARGS.n_max:
     n_max = ARGS.n_max
 else:
@@ -205,19 +205,19 @@
 SEQ_OBJECT = open(seq_file,'r')
 Script_Path = os.path.realpath(__file__)
 DB_Folder_Path = os.path.join(Script_Path.rsplit(OS_SEPARATOR,2)[0], "db")
-VCF_file = os.path.join(DB_Folder_Path,'SNPs_26221_Jun2021.txt') # has the nucleotide(feature) positions
+VCF_file = os.path.join(DB_Folder_Path,'SNPs_25419_Aug2021.txt') # has the nucleotide(feature) positions
 VCF_OBJECT = open(VCF_file,'r')
-CCs_file = os.path.join(DB_Folder_Path,'GNUVID_06212021_CCs_report.txt') # has the WHO naming
+CCs_file = os.path.join(DB_Folder_Path,'GNUVID_08312021_CCs_report.txt') # has the WHO naming
 CCs_OBJECT = open(CCs_file,'r')
-DT_model = os.path.join(DB_Folder_Path,'GNUVID_06212021_RandomForest.joblib')#ML model
-COMP_DB_file = os.path.join(DB_Folder_Path,'GNUVID_06212021_comp_db.joblib')#compressed DB
+DT_model = os.path.join(DB_Folder_Path,'GNUVID_08312021_RandomForest.joblib')#ML model
+COMP_DB_file = os.path.join(DB_Folder_Path,'GNUVID_08312021_comp_db.joblib')#compressed DB
 if os.path.exists(COMP_DB_file):
     logging.info("Found compressed database")
 else:
-    url = 'https://zenodo.org/record/5112632/files/GNUVID_06212021_comp_db.joblib?download=1'
+    url = 'https://zenodo.org/record/5594708/files/GNUVID_08312021_comp_db.joblib?download=1'
     urllib.request.urlretrieve(url, COMP_DB_file)
     logging.info("Downloaded compressed database")
-strains_report_file = os.path.join(DB_Folder_Path,'GNUVID_06212021_DB_isolates_report_deidentified.txt.gz')
+strains_report_file = os.path.join(DB_Folder_Path,'GNUVID_08312021_DB_isolates_report_deidentified.txt.gz')
 Ref_CDS = os.path.join(DB_Folder_Path,'MN908947.3_cds.fna')
 Ref_WG = os.path.join(DB_Folder_Path,'MN908947.3.fasta')
 ############SNPs and CCs_WHO##################
@@ -667,8 +667,9 @@ def FASTA_to_seq_list(seq_file,SNP_postitions,blk_size):
         fo.write('{},{},{}\n'.format(seqId,Classifier_version,','.join(y_results[-4:])))
 logging.info("Finished Run in --- {:.3f} seconds ---".format(time.time() - START_TIME0))
 logging.info("Typed the query isolate/s and wrote {}".format(output_file.rsplit(OS_SEPARATOR,1)[-1]))
-logging.info("""Thanks for using GNUVID v2.3, I hope you found it useful.
+logging.info("""Thanks for using GNUVID v2.4, I hope you found it useful.
 References:
+GNUVID 'Moustafa AM and Planet PJ 2021, Genome Biology and Evolution;13(9):evab197'.
 WhatsGNU 'Moustafa AM and Planet PJ 2020, Genome Biology;21:58'.
 pandas 'Reback et  al. 2020, DOI:10.5281/zenodo.3509134'.
 Scikit-learn 'Pedregosa et al. 2011, JMLR; 12:2825-2830'.