Mapping accessions/logical entities to DRS/ physical objects #104

ianfore · 2020-07-10T13:41:18Z

In this hackathon exercise SRA would be used as a test case to explore how biological entities (logical level) are handled in relation to the immutable physical objects in DRS.

INSDC ids used by in SRA identify logical level/biological entities such as sequencing runs (SRRnnnnnn). The mapping to immutable digital objects (DRS ids) is not as simple as might be expected for two reasons.
a) SRRs map to more that one digital object (e.g. a cram and a crai file)
b) the immutable digital object(s) to which they map may change (e.g. after alignment to a different reference sequence)
The attached example
sdl_example1.txt
shows a response for data for SRR7274638 from the SRA Data Locator as an example of the use case.

Schema which define the biological entities would provide the data model defining the relationships between the objects.

The exercise would be to test out schema searchable via the Discovery Search prototype which map logical level ids to immutable DRS objects. In the SRA case this could potentially be as simple as using the NCBI implementation of SRA in BigQuery.

DavidPotCanuck · 2020-07-13T15:14:18Z

ISB-CGC bioinformaticians will examine SRA data in BigQuery. The goal is to compare how data is indexed in SRA to how NCI's ISB-CGC Cloud Resource is presenting similar information to its end-users in BigQuery tables at ISB-CGC. Insights into how researchers could combine information from both systems will be provided.

ianfore · 2020-08-12T12:47:10Z

8/10/20
Kurt outlined the approach in progress for making SRA data available via DRS.
There are three components
RAS Clearing House - for auth and authz
DRS service
ID Exchange -

The ID Exchange would be passed an accession e.g. SRR and return a DRS id. As there are multiple files for each accession the plan is to return a DRS id to what would be a bundle.
For example, currently the accession SRR1999478 has four files which would have to be bundled. The following json is not a proposed format but serves to illustrate what would need to be bundled and where it exists.

{
    "accession": "SRR1999478",
    "files": [
        {
            "name": "14_DN.unmap.bam", "type": "bam",
            "locality": [
                { "service": "gs", "region": "us", "rehydrationRequired": true },
                { "service": "s3", "region": "us-east-1", "rehydrationRequired": true }
            ]
        },
        {
            "name": "14_DN.BWA.MARK.bam", "type": "bam",
            "locality": [
                { "service": "gs", "region": "us", "rehydrationRequired": true },
                { "service": "s3", "region": "us-east-1", "rehydrationRequired": true }
            ]
        },
        {              
            "name": "SRR1999478.pileup","type": "sra",
             "locality": [
                { "service": "sra-ncbi", "region": "dbgap" }
            ]
        },
        {
            "name": "SRR1999478", "type": "sra",
            "locality": [
                { "service": "sra-ncbi", "region": "dbgap" },
                { "service": "gs", "region": "us" },
                { "service": "s3", "region": "us-east-1"}
            ]
        }
    ]
}

There is no existing externally usable id/accession for the individual files for this SRR. DRS ids would also be generated for each of the files which could then be used to retrieve those which the user requires.

The challenge for code that reads the bundle is to work out what the types are of the individual files are and to work out which to use for the purpose at hand. This amounts to understanding the semantics of the bundle. In the example above there are two files of type bam. There is no way of understanding the significance (semantics) of what those two files are. A human with the right knowledge might infer some meaning from the file names, but there is no consistency in file naming, and even if there were containing structured meaning in a filename is not a sound approach. For the two files listed with a type of sra one might infer that the file named pileup is a different type of file.

A second example of SRA content that would need to be represented.

{
    "accession": "SRR7274638",
    "files": [
        {
             "name": "95436.recal.cram", "type": "cram",
            "locality": [
                { "service": "s3", "region": "us-east-1" },
                { "service": "gs", "region": "us" }
            ]
        },
        {
            "name": "95436.recal.cram.crai", "type": "crai",
            "locality": [
                { "service": "s3", "region": "us-east-1" },
                { "service": "gs", "region": "us" }
            ]
        }
    ]
}

In this case the type attribute is more informative. The filename also conveys the type through the conventional use of the file extension but the convenience of the distinct type attribute is useful.

A key question is whether the semantics of multiple objects should be represented within the bundle in a machine actionable way. Or whether those multiple objects should be referenced in an external queryable schema which is used to obtain the precise ids needed for a particular purpose.

kabdilleh1 · 2020-08-18T21:11:45Z

https://docs.google.com/document/d/1nPK2fJ7w7tLaQY9I3uzd6tsI_PMDbVqoMHp5dgYB68g/edit

ianfore · 2020-10-15T18:21:29Z

The SRA public DRS server is now available. This gives the opportunity to work through two approaches to the problem with real examples.

Approach 1 - unpacking a bundle
The following DRS call uses a drs_id which corresponds to the SRA run accession no (SRR1599287). https://locate.ncbi.nlm.nih.gov/ga4gh/drs/v1/objects/99b71bee00f3dbc6d583887b91ea9a2f
For convenience, see attached response SRR1599287_drs.txt

The response is a bundle describing three files and the individual drs_ids for each. In order to determine which file is relevant for a given purpose it is necessary to parse the filename. There is no convention for file naming in DRS. It is not suggested here that there should be.

Approach 2 - identify the specific file through Search
The Discovery Search reference implementation contains a table onek_genomes.sra_drs_files which may be queried as follows to obtain the drs_id for a specific file of interest.
SELECT drs_id, filename
FROM thousand_genomes.onek_genomes.sra_drs_files
where filetype = 'bam' and mapped = 'mapped'

More realistically, a search for files will be based on broader criteria including sample and subject attributes
See FASPScript14.py for a fully worked example.

Under this approach the specific file of interest can be identified through a mechanism (GA4GH Search) which provides a machine readable schema.

The practice of making available tables to query for specific files is widespread among GA4GH Driver Projects. See FASPScript2 which illustrates this for both the Cancer Research Data Commons and BioDataCatalyst.

An additional approach worth exploring is how the Research Objects initiative has handled the issue of describing the contents of an object like a bundle.

ianfore · 2021-01-18T19:09:50Z

Jim Vlasblom provided the following information

We have some data that might help with the item "Resolving accessions - SRA use case".

We've ingested public SRA metadata into bigquery (directly from the NCBI) and created DRS records that we've linked to this metadata. There are three tables in the striking-effort-817:ncbi_sra dataset, to which I've given you access:

'drs' - contains copies of the SRA subset of DRS records served by our drs server. The DRS Server actually uses a different database with both SRA and other DRS records, so changes here will not affect the server.

'meta' - Metadata scraped from the SRA. At this time we're missing a few key columns (e.g. run identifiers) but we'll be adding them soon.

'drsmeta' - the two tables above joined together. Prejoining improves performance in Presto, which would otherwise read in both tables and try to join it through our single presto node

You can use these to look up a DRS record by SRA metadata -- either by directly querying for the full DRS record (since we happen to mirror them in bigquery), or by doing a more "typical" workflow of looking up a DRS identifier and then querying the DRS server.

To lookup a DRS record by id, you can use something like:

https://drs-server.staging.dnastack.com/ga4gh/drs/v1/objects/0bf7f02b-f334-4060-a402-40281cd8e2be

Where the 0bf702b... is the DRS id. In our tables, we mostly just record this part of the DRS identifier right now. The full proper DRS identifier would be drs://drs-server.staging.dnastack.com/0bf7f02b... and is reported in the DRS record's "self URI".

Note: the DRS server requires some basic auth credentials. I will send these to you shortly.

It would make sense resources/approaches alongside the SRA DRS Service and ID Exchange. Note that data available on the DNAStack Public Search instance through the dbgap_demo.scr_gecco_susceptibility tables such as sb_drs_index are also a route to sequence data represented in SRA - albeit with the data in separate cloud storage. See this code example.

ianfore · 2021-01-18T19:31:20Z

Relevant dialog about the DNAStack tables containing SRA data

On Thu, Jan 7, 2021 at 12:30 PM Fore, Ian  wrote:
When you refer to the public SRA metadata did you use their own BigQuery tables? Or did you import into striking-effort-817:ncbi_sra from  somewhere else?
https://www.ncbi.nlm.nih.gov/sra/docs/sra-bigquery-examples/

On Thu, Jan 7, 2021 at 12:45 PM Ayman Al Baz  wrote:
While our metadata does share many similar fields as the bigquery table provided by NCBI,  the metadata we have wasn't collected from the link you provided. We collected the metadata directly from NCBI using NCBI's Entrez API as the Entrez API is more comprehensive than the linked bigquery table.

On Thu, Jan 7, 2021 at 12:53 PM  Jim Vlasblom  wrote
Thanks Ayman.  I'll also add that this data contains publicly available metadata.  Some of the data itself is publicly accessible (if it has a non null access URL), and some of it is not (null/missing access URL).

jvlasblom · 2021-01-19T21:13:47Z

We've updated our script to grab more metadata, and have created an updated striking-effort-817.ncbi_sra.january2021 table joining metadata to DRS records in the DNAstack DRS server. Some notes on this are here:
https://docs.google.com/document/d/17SFjBmr5WyA9WJsGIdubM4FhcGQAY68gbdx2rGD6Xk0/view

ianfore added the Hackathon exercise label Jul 10, 2020

This was referenced Oct 15, 2020

Consolidated DRS API Feedback: Identifiers ga4gh/data-repository-service-schemas#330

Open

Review approaches to locating objects of interest ga4gh/TASC#22

Closed

ianfore mentioned this issue Nov 10, 2020

Bundling and other approaches to mapping accessions/logical entities to DRS/ physical objects ga4gh/data-repository-service-schemas#337

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mapping accessions/logical entities to DRS/ physical objects #104

Mapping accessions/logical entities to DRS/ physical objects #104

ianfore commented Jul 10, 2020 •

edited

Loading

DavidPotCanuck commented Jul 13, 2020

ianfore commented Aug 12, 2020 •

edited

Loading

kabdilleh1 commented Aug 18, 2020

ianfore commented Oct 15, 2020 •

edited

Loading

ianfore commented Jan 18, 2021

ianfore commented Jan 18, 2021 •

edited

Loading

jvlasblom commented Jan 19, 2021

Mapping accessions/logical entities to DRS/ physical objects #104

Mapping accessions/logical entities to DRS/ physical objects #104

Comments

ianfore commented Jul 10, 2020 • edited Loading

DavidPotCanuck commented Jul 13, 2020

ianfore commented Aug 12, 2020 • edited Loading

kabdilleh1 commented Aug 18, 2020

ianfore commented Oct 15, 2020 • edited Loading

ianfore commented Jan 18, 2021

ianfore commented Jan 18, 2021 • edited Loading

jvlasblom commented Jan 19, 2021

ianfore commented Jul 10, 2020 •

edited

Loading

ianfore commented Aug 12, 2020 •

edited

Loading

ianfore commented Oct 15, 2020 •

edited

Loading

ianfore commented Jan 18, 2021 •

edited

Loading