-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mapping accessions/logical entities to DRS/ physical objects #104
Comments
ISB-CGC bioinformaticians will examine SRA data in BigQuery. The goal is to compare how data is indexed in SRA to how NCI's ISB-CGC Cloud Resource is presenting similar information to its end-users in BigQuery tables at ISB-CGC. Insights into how researchers could combine information from both systems will be provided. |
8/10/20 The ID Exchange would be passed an accession e.g. SRR and return a DRS id. As there are multiple files for each accession the plan is to return a DRS id to what would be a bundle.
There is no existing externally usable id/accession for the individual files for this SRR. DRS ids would also be generated for each of the files which could then be used to retrieve those which the user requires. The challenge for code that reads the bundle is to work out what the types are of the individual files are and to work out which to use for the purpose at hand. This amounts to understanding the semantics of the bundle. In the example above there are two files of type bam. There is no way of understanding the significance (semantics) of what those two files are. A human with the right knowledge might infer some meaning from the file names, but there is no consistency in file naming, and even if there were containing structured meaning in a filename is not a sound approach. For the two files listed with a type of sra one might infer that the file named pileup is a different type of file. A second example of SRA content that would need to be represented.
In this case the type attribute is more informative. The filename also conveys the type through the conventional use of the file extension but the convenience of the distinct type attribute is useful. A key question is whether the semantics of multiple objects should be represented within the bundle in a machine actionable way. Or whether those multiple objects should be referenced in an external queryable schema which is used to obtain the precise ids needed for a particular purpose. |
The SRA public DRS server is now available. This gives the opportunity to work through two approaches to the problem with real examples. Approach 1 - unpacking a bundle The response is a bundle describing three files and the individual drs_ids for each. In order to determine which file is relevant for a given purpose it is necessary to parse the filename. There is no convention for file naming in DRS. It is not suggested here that there should be. Approach 2 - identify the specific file through Search More realistically, a search for files will be based on broader criteria including sample and subject attributes Under this approach the specific file of interest can be identified through a mechanism (GA4GH Search) which provides a machine readable schema. The practice of making available tables to query for specific files is widespread among GA4GH Driver Projects. See FASPScript2 which illustrates this for both the Cancer Research Data Commons and BioDataCatalyst. An additional approach worth exploring is how the Research Objects initiative has handled the issue of describing the contents of an object like a bundle. |
Jim Vlasblom provided the following information
It would make sense resources/approaches alongside the SRA DRS Service and ID Exchange. Note that data available on the DNAStack Public Search instance through the dbgap_demo.scr_gecco_susceptibility tables such as sb_drs_index are also a route to sequence data represented in SRA - albeit with the data in separate cloud storage. See this code example. |
Relevant dialog about the DNAStack tables containing SRA data
|
We've updated our script to grab more metadata, and have created an updated striking-effort-817.ncbi_sra.january2021 table joining metadata to DRS records in the DNAstack DRS server. Some notes on this are here: |
In this hackathon exercise SRA would be used as a test case to explore how biological entities (logical level) are handled in relation to the immutable physical objects in DRS.
INSDC ids used by in SRA identify logical level/biological entities such as sequencing runs (SRRnnnnnn). The mapping to immutable digital objects (DRS ids) is not as simple as might be expected for two reasons.
a) SRRs map to more that one digital object (e.g. a cram and a crai file)
b) the immutable digital object(s) to which they map may change (e.g. after alignment to a different reference sequence)
The attached example
sdl_example1.txt
shows a response for data for SRR7274638 from the SRA Data Locator as an example of the use case.
Schema which define the biological entities would provide the data model defining the relationships between the objects.
The exercise would be to test out schema searchable via the Discovery Search prototype which map logical level ids to immutable DRS objects. In the SRA case this could potentially be as simple as using the NCBI implementation of SRA in BigQuery.
The text was updated successfully, but these errors were encountered: