Variant Annotation

Overview

CellBase can take advantage of the data integrated to implement a rich and high-performance variant annotator. The variant annotation tool is integrated within the CellBase code and can be accessed in two different ways:

Using remote RESTful web services: both GET and POST annotation web services are available (see http://bioinfo.hpc.cam.ac.uk/cellbase/webservices/). Web services can be called from Javascript or by using the Python client command line. By avoiding local installation of the knowledge base, users do not need to store hundreds of Gigabytes (about 900GB in current release v4) and will always be automatically updated. Moreover, despite remotely accessing the knowledge base, the CellBase client provides a lightweight efficient multi-threaded implementation which outperforms other local variant annotators (see Benchmark). Web services based annotation results are returned in the form of JSON objects.
Using the Java command line: this can efficiently fetch annotation data directly from the database, for this users need to install and have access to the MongoDB database what will result in a major performance speed up when compared to remote web services. Current Java CLI can still connect to remote web services although this functionality is being ported to the new Python client.

The typical input for the CellBase variant annotator will be a VCF file, although the CLI also offers the possibility to explicitly provide a short list of variants as an argument for fast annotation. Two different output formats can be currently generated by the annotator: a .json file with a list of VariantAnnotation objects (see Variant and VariantAnnotation models at http://bioinfo.hpc.cam.ac.uk/cellbase/webservices/rest/v4/hsapiens/feature/variant/model), or a tab separated values file with the VEP formatted output.

Data sources

Data provided by the variant annotator is the result of integrating most of the annotations available at the CellBase knowledge base: ENSEMBL's core transcript annotation such as location, id, strand, biotype,etc.; protein annotation provided by UniProt, InterPro, SIFT and PolyPhen; population frequencies provided by the European Variation Archive for The 1000 Genomes Project Phase 3, The Exome Server Project (EVS), The Exome Aggregation Consortium v3 (ExaC) and The Genomes of the Netherlands (GoNL); sequence conservation from PhastCons and PhyloP; gene expression values from The Genome Expression Atlas and The Genotype-Tissue Expression project (GTEx); gene drug interaction data from The Drug Gene Interaction Database (DGIdb) and the Human Phenotype Ontology database (HPO); clinical variants annotation from ClinVar, The Genome-Wide Association Studies catalog (GWAS) and The Catalogue of Somatic Mutations in Cancer (COSMIC). Sequence effect prediction is also calculated on the fly and described by Sequence Ontology (SO) terms.

Summary

Annotation	Homo Sapiens GRCh37	Homo Sapiens GRCh38	Others
Consequence Types¹	✔️	✔️	✔️
Conservation Scores²	✔️	✖️	✖️
Protein Subs. Scores³	✔️	✔️	✖️

¹ More info at http://www.ensembl.org/info/genome/variation/predicted_data.html

² PhastCons, PhyloP and GERP

Benchmark

Exhaustive comparison of sequence effect predictions was made against VEP (78) results for the whole 1000 Genome Phase 1 variant set (XX million variants, XXX million effect predictions), yielding a 99.9997% of concordanced with Ensembl VEP Consequence Types (see https://github.com/opencb/cellbase/wiki/Variant-Annotation for a detailed report on known differences).

Custom annotations

CellBase provided annotation can be complemented with custom annotations provided by the user. This custom annotations must be provided within a VCF file and will be read from the INFO column.

General

Tutorials

Provide feedback

Saved searches