-
Notifications
You must be signed in to change notification settings - Fork 34
Models
This module stores models for entities so they can be handled in the same way, independently of the format of the file they were read from. The most commonly used fields are explicitly specified, and at the same time the entities provide mechanisms for preserving all the information of a certain format. For a variant, the specified fields would be (among others) chromosome, position, reference and alternatives; if a VCF file is being stored, then columns such as INFO are also saved in a key-value data structure.
A variant is uniquely represented by a tuple (chromosome, start, reference allele, alternate allele). The whole list of fields that model a variant are included in the Variant.java file, and are the following:
chromosome : string
start : integer
end : integer
length : integer
reference : string
alternate : string
hgvs : set of strings
The information that was specific to a given file is stored in another attribute whose class is implemented in the ArchivedVariantFile.java file. Such information would be the columns FILTER, QUAL, INFO and FORMAT, as well as all the samples, in a VCF file.
It is straight-forward to create one of these variants when a single-alternate record is read from a VCF file, but what happens when multi-allelic variants are read? They are represented ambiguously in the VCF format, because they can be stored in one or several lines. Information must be stored in a homogeneous format, and we conclude that splitting a multi-allelic variant in pairs (reference, alternate) was the only way to preserve all the information while fulfilling that condition.