The S-protein-Typer tool identifies S-protein mutations of SARS-CoV-2 and reports the alterations on the amino acid level. Thereby, the evaluator can easily spot combinations of adverse mutations in case of newly emerging variants. It has been developed for automated analysis of sequencing data generated by the SARS-Cov2 S1-ROI panel, but should also work for other sequencing data of the S-gene.
Files for analysis of sequencing data obtained by the S1-ROI primer panel with the ARTIC pipeline are included in this archive.
The tool has been tested on MacOS and GNU/Linux systems, Windows 10 and above users should be able to run it with minimal effort using WSL. To install the required packages (listed in the environment file), we recommend using a conda implementation (Miniconda or Anaconda). Details for installation can be found here.
Once the environment manager is set up, install the s-protein-typer by cloning/installing the current repository:
$ git clone https://github.com/MassimoGregorioTotaro/s-protein-typer.git
then move into the newly created installation folder
$ cd s-protein-typer
set the environment up (download all the dependencies for the S-protein-Typer tool)
$ conda env create -f s-protein-typer.yaml
and then activate the environment to execute the S-protein-Typer tool (this step is to be repeated every time a new terminal is opened)
$ conda activate s-protein-typer
Once the environment is activated (see above), the program can be launched. By default, it will analyse all sequences in every fasta file present in the 'data' folder with the standard settings (an exemplary file for analysis is already included in the folder):
$ python s_protein_typer.py
Of note, it probably make sense to delete the fasta files in the 'data' folder after analysis, otherwise they will be re-analysed once the program is getting executed again.
All options of the S-protein-Typer can be viewed by running the program with the --help flag:
$ python s_protein_typer.py -h
- FILE1.fasta single files to be analysed can be passed as free arguments
- --silent disables runtime information
- --slow performs a Multiple Sequence Alignment, in case some sequences are particularly weak, it is more accurate, but also slower
- --output exports the alignment as a CSV file, to be quickly analysed again (e.g. with a different classifier) or used for retraining the classifier
- --retrain (for experts) requires an aptly formatted file to retrain the provided classifier, the retrained classifier will then be exported in the same folder
- --reference (for experts) is needed if the reference FASTA file to be used is not in the default folder ('reference')
- --machine_readable (for experts) disables the terminal formatting so that the ouput cam be redirected or saved to a file via terminal
- --classifier (for experts) in case the classifier to be used is not in the default folder ('reference')
- --alignment (for experts) reads a pre-generated CSV file or an alignment FASTA file, instead of the sequences FASTA files, for a quicker classification analysis
-
Be aware that the tool analyses mutations with respect to the whole S-protein sequence. Therefore, depending on the primer set that has been employed, parts at the beginning or at the end of the S-protein gene that were not sequenced will be reported as deletions (e.g. the S1-ROI panel which focuses in the S-proteins immune-dominant part excludes residues 1-14 and 684-1273).
-
As mentioned in the usage instructions, by default all files present in the ‘data’ folder will be analysed at the same time, therefore make sure to use unambiguous names for each sequence in the fasta files.
-
The passed sequences are aligned both at the DNA level, to identify the correct ORF, and at the AA level, to identify the mutations. While the standard protocol, which involves a series of pairwise alignments between the reference and the target sequences, usually performs pretty well, in case the sequences are of poor quality or have extensive deletions around the start codon area, misalignments can occur. The user should normally be aware of a messy output, treat it with skepticism and thoroughly check the corrisponding sequence; one can reasonably suspect alignment issues in case of several reported INDEL mutations at the beginning of the sequence. To overcome the problem, the calculation should be run again with the --slow flag (on a subset of sequences, if time is a strict concern, which must however comprise at least 5 'good' sequences per each problematic one).
-
The classification is performed by a pre-trained Random Forest classifier, provided as PKL file in the 'reference' folder. Said classifier can, hovever, be created anew and trained by providing a correctly formatted CSV file (the original is provided here) to the --retrain flag, which will also take care of saving it for future usage (beware, repeated runs will overwrite previous ones' outputs). Being the generation and training steps non-deterministic, you might have to perform several runs with the -t flag until you get an optimal outcome. The training dataset can be adjusted by adding/removing manually classified sequences: the mutations can be derived from a run of the s-protein-typer CSV output database, the class must be denoted in the first row, preceeding the sequence name with an underscore (e.g. B.1.1.7_XXX). By altering the training database one must be careful to provide good quality data, i.e. avoid incomplete sequences (the classifier will be underfit), do not provide too many duplicates (the classifier will be overfit), be consistent with class names (e.g. an identical sequence identified as both B.1.617.2_XXX and delta_XXX will cause serious classification issues), do not overlook curating the outgroup ('NA', it is fundamental for classfication accuracy) etc.
Ambiguous codons cause issues when used for training the classifier (e.g. codon 'RUA' translates to 'B', which crashes the program when training, evading the encoding step, even if it should not...).
Minor adjustments can be made, especially regarding the classifier, according to the development and evolution of the typisation efforts.
Massimo G. Totaro, Institute of Biochemistry, Graz University of Technology, Graz, Austria.
Maintained until end 2021.