Skip to content

Sequence population search

Anna Bernasconi edited this page Oct 31, 2022 · 2 revisions

The Sequence population search section serves as a primary tool for querying big amount of viral sequences available in EpiViruSurf (and in the companion system ViruSurf). It is based on the choice of a restricted number of attributes that are considered relevant for many genomic sources and are present in most of them. The available attributes and their organization are explained in the Viral Conceptual Model (VCM) paper, which introduces the conceptualization of our integration effort.

The field Sequence population search condition is dynamically compiled to show in a single point which values have been selected in the table below.

The search can be defined in a table that contains general characteristics or properties of viral sequences. It is organized into four parts:

  • Virus: includes info on the virus species to which sequences belong.
  • Host Organism: includes the information about the collection of the sequence from the host organism.
  • Technology: includes the technological process and features analyzed in the experiment.
  • Organization: contains information on the organizations/projects which are behind the production of each experiment.

For each part, multiple attributes are included (click on the to read each attribute definition), making available several values that can be selected through drop-down menus. At the side of each possible value, we report the number of sequences available in our database with such value.

The user can compose desired queries by combining values from all the drop-down menus. Note that the special value N/D indicates null values (i.e., it allows to select sequences that do not feature that particular attribute). Queries retrieve a list of sequences.

metadata-search

The interface allows choosing multiple values in one attribute drop-down list at the same time (these are considered as alternative). Values chosen over different attributes, instead, are considered as conditions that should coexist in the resulting items.

In the following example, the user is choosing all sequences that are from either

"Connecticut" OR "New York"

AND that have "GC% > 37"

AND are partial (non-complete).

metadata-search-example

Note that some fields are numerical (age, length, GC%, and N%) and thus we provide the possibility to specify a minimum and a maximum value as a range. In addition, the user can check/uncheck the N/D flag, meaning that the sequences that do not have/do have (respectively) that field should be included in the results.

Similarly, we provide collection date and submission date with appropriate input calendar-like dropdowns. Also in this case the user may specify a lower and an upper bound to a range and include/exclude N/D (null values).