Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let's talk expression data! #2229

Open
sammyjava opened this issue Mar 12, 2020 · 12 comments
Open

Let's talk expression data! #2229

sammyjava opened this issue Mar 12, 2020 · 12 comments

Comments

@sammyjava
Copy link
Member

As per the discussion in the community call today, I am hereby creating an "issue" to stimulate creation of a core expression data model so we can build tools that are commonly usable. (Note that Strain is implemented since we load expression data from multiple strains of a given legume species.)

To get things started, here's the data model that I use in the LIS mines:

<class name="ExpressionSource" is-interface="true">
        <attribute name="unit" type="java.lang.String"/>
        <attribute name="primaryIdentifier" type="java.lang.String"/>
        <reference name="dataSet" referenced-type="DataSet"/>
        <collection name="samples" referenced-type="ExpressionSample" reverse-reference="source"/>
</class>

<class name="ExpressionSample" is-interface="true">
        <attribute name="num" type="java.lang.Integer"/>
        <attribute name="description" type="java.lang.String"/>
        <attribute name="bioSample" type="java.lang.String"/>
        <attribute name="name" type="java.lang.String"/>
        <attribute name="primaryIdentifier" type="java.lang.String"/>
        <reference name="organism" referenced-type="Organism"/>
        <reference name="strain" referenced-type="Strain"/>
        <reference name="dataSet" referenced-type="DataSet"/>
        <reference name="source" referenced-type="ExpressionSource" reverse-reference="samples"/>
</class>

<class name="ExpressionValue" is-interface="true">
        <attribute name="value" type="java.lang.Double"/>
        <reference name="gene" referenced-type="Gene"/>
        <reference name="sample" referenced-type="ExpressionSample"/>
</class>
@sammyjava
Copy link
Member Author

sammyjava commented Mar 12, 2020

And here's a sample of data that is held by that data model:

id                | 38202165
unit              | TPM
primaryidentifier | Gene expression atlas of pigeonpea Asha(ICPL87119)
datasetid         | 38202163
class             | org.intermine.model.bio.ExpressionSource
-[ RECORD 1 ]-----+---------------------------------------------------------
num               | 1
description       | Mature seed at Reproductive stage (SRR5199304)
id                | 37000003
biosample         | SAMN06264156
name              | Mature seed at reprod (SRR5199304)
primaryidentifier | SRR5199304
organismid        | 5235944
strainid          | 5235945
datasetid         | 38202163
sourceid          | 38202165
class             | org.intermine.model.bio.ExpressionSample
-[ RECORD 2 ]-----+---------------------------------------------------------
num               | 2
description       | Immature seed at Reproductive stage (SRR5199305)
id                | 37000005
biosample         | SAMN06264155
name              | Immature seed at reprod (SRR5199305)
primaryidentifier | SRR5199305
organismid        | 5235944
strainid          | 5235945
datasetid         | 38202163
sourceid          | 38202165
class             | org.intermine.model.bio.ExpressionSample
-[ RECORD 3 ]-----+---------------------------------------------------------
num               | 3
description       | Mature pod at Reproductive stage (SRR5199306)
id                | 37000007
biosample         | SAMN06264154
name              | Mature pod at reprod (SRR5199306)
primaryidentifier | SRR5199306
organismid        | 5235944
strainid          | 5235945
datasetid         | 38202163
sourceid          | 38202165
class             | org.intermine.model.bio.ExpressionSample
-[ RECORD 1 ]---+----------------------------------------
intermine_value | 0
id              | 37000002
geneid          | 5235941
sampleid        | 37000003
class           | org.intermine.model.bio.ExpressionValue
-[ RECORD 2 ]---+----------------------------------------
intermine_value | 1.21
id              | 37000062
geneid          | 5235946
sampleid        | 37000003
class           | org.intermine.model.bio.ExpressionValue
-[ RECORD 3 ]---+----------------------------------------
intermine_value | 4.29
id              | 37000092
geneid          | 5235948
sampleid        | 37000003
class           | org.intermine.model.bio.ExpressionValue
-[ RECORD 4 ]---+----------------------------------------
intermine_value | 11.43
id              | 37000122
geneid          | 5235950
sampleid        | 37000003
class           | org.intermine.model.bio.ExpressionValue

@danielabutano
Copy link
Member

Thanks @sammyjava! @rachellyne and @sergiocontrino let's discuss during the next Monday meeting

@sergiocontrino
Copy link
Member

thank you @sammyjava, i think you are right that it would be nice to have a common basic model for this sooner rather than later. regarding the one you are using i have a few initial questions:

  • should the referenced type in ExpressionValue be a more generic bioentity?
  • what is the role of 'num' in ExpressionSample? is not the primaryidentifier enough?
  • maybe you could comment on using ExpressionSource, in particular this vs extending dataset and unit attribute here rather than in ExpressionValue (i think i can see the reasoning, would be nice to have your experience on that).
  • is this working for time-course experiments?
    thanks!

@sammyjava
Copy link
Member Author

Just saw this, Sergio!

* should the referenced type in ExpressionValue be a more generic bioentity?

Yes. No reason for it to specifically be Gene. I'm not sure about BioEntity, though, I think SequenceFeature would be more accurate. Proteins don't express but transposons can.

* what is the role of 'num' in ExpressionSample? is not the primaryidentifier enough?

That's for ordering the samples for user convenience, such as on a heat map axis. It's nice to have all the leaf-related tissues together and then the seed-related ones, etc. It can be left null.

* maybe you could comment on using ExpressionSource, in particular this vs extending dataset and unit attribute here rather than in  ExpressionValue (i think i can see the reasoning, would be nice to have your experience on that).

For extensibility. Although I agree that unit should reside with ExpressionValue, since it is the unit of that value (e.g. "TPM"). I extend ExpressionSource enormously in my mines, with all sorts of extra attributes. I don't think I want to extend DataSet with all that stuff. (Like SRA identifier, library prep details, all sorts of things that come up in RNA-seq experiments.)

* is this working for time-course experiments?

I haven't really thought of how to deal with time-course anything in InterMine. Of course you can add an attribute "time" to ExpressionValue and have values across time. I don't have any at the current time, although my first job in bioinformatics was dealing with Arabidopsis time-course experiments, for which I wrote a fairly big webapp.

@sammyjava
Copy link
Member Author

Updates per Sergio's suggestions.

<class name="ExpressionSource" is-interface="true">
        <attribute name="primaryIdentifier" type="java.lang.String"/>
        <reference name="dataSet" referenced-type="DataSet"/>
        <collection name="samples" referenced-type="ExpressionSample" reverse-reference="source"/>
</class>

<class name="ExpressionSample" is-interface="true">
        <attribute name="num" type="java.lang.Integer"/>
        <attribute name="description" type="java.lang.String"/>
        <attribute name="bioSample" type="java.lang.String"/>
        <attribute name="name" type="java.lang.String"/>
        <attribute name="primaryIdentifier" type="java.lang.String"/>
        <reference name="organism" referenced-type="Organism"/>
        <reference name="strain" referenced-type="Strain"/>
        <reference name="dataSet" referenced-type="DataSet"/>
        <reference name="source" referenced-type="ExpressionSource" reverse-reference="samples"/>
</class>

<class name="ExpressionValue" is-interface="true">
        <attribute name="unit" type="java.lang.String"/>
        <attribute name="value" type="java.lang.Double"/>
        <reference name="feature" referenced-type="SequenceFeature"/>
        <reference name="sample" referenced-type="ExpressionSample"/>
</class>

@sammyjava
Copy link
Member Author

sammyjava commented Jun 25, 2020

FYI, this is my current model. I've got NCBI attributes in there (sra, bioProject, bioSample, geoSeries) as well as some others which should probably not be in the core model. But I thought I'd show you what I'm using. I also changed primaryIdentifier to identifier to make it clear that it doesn't extend Annotatable. (So publication and dataSet are explicitly listed as references.) Also, note that both organism and strain are referenced so that strain is not required.

<class name="ExpressionSource" is-interface="true">
        <attribute name="sra" type="java.lang.String"/>
        <attribute name="identifier" type="java.lang.String"/>
        <attribute name="geoSeries" type="java.lang.String"/>
        <attribute name="origin" type="java.lang.String"/>
        <attribute name="shortName" type="java.lang.String"/>
        <reference name="publication" referenced-type="Publication"/>
        <reference name="bioProject" referenced-type="BioProject"/>
        <reference name="dataSet" referenced-type="DataSet"/>
        <collection name="samples" referenced-type="ExpressionSample" reverse-reference="source"/>
</class>

<class name="ExpressionSample" is-interface="true">
        <attribute name="num" type="java.lang.Integer"/>
        <attribute name="identifier" type="java.lang.String"/>
        <attribute name="description" type="java.lang.String"/>
        <attribute name="bioSample" type="java.lang.String"/>
        <attribute name="name" type="java.lang.String"/>
        <reference name="organism" referenced-type="Organism"/>
        <reference name="dataSet" referenced-type="DataSet"/>
        <reference name="strain" referenced-type="Strain"/>
        <reference name="source" referenced-type="ExpressionSource" reverse-reference="samples"/>
</class>

<class name="ExpressionValue" extends="java.lang.Object" is-interface="false">
        <attribute name="value" type="java.lang.Double"/>
        <attribute name="unit" type="java.lang.String"/>
        <reference name="sample" referenced-type="ExpressionSample"/>
        <reference name="feature" referenced-type="SequenceFeature"/>
</class>

@hendrikweisser
Copy link

I can't comment on implementation details, but perhaps one point worth considering:

Proteins don't express but transposons can.

What about protein-level expression data, e.g. from quantitative mass spec? You can have isoform-specific expression data that you couldn't capture properly on the gene level.
Not sure if you want this to be "in scope" for the current proposal.

@sammyjava
Copy link
Member Author

Interesting. But I think it makes sense to limit the "things that express" to sequence features, which proteins are not. Just enforcing the "central dogma" really. I think you can map the proteins to transcript isoforms, and they often have the exact same name (gene.1, gene.2); they certainly do in all the LIS mines by specification. It certainly makes sense to store expression relative to transcripts, not genes.

@rachellyne
Copy link
Member

Apologies it has taken us a long time to get back to this. I really like the idea of a core expression model (or possibly a couple of core expression models to cover different expression techniques - see below). I think this needs some discussion and we probably should also take into account visualizations we already have (and what we would like to add). Our problem here in Cambridge is that we have multiple expression models that cover different data and techniques - RNA-seq, microarray and in-situ hybridisation. It would be good to re-visit models (some were created many years ago before RNA-seq was really even a thing - historically we have a bit of a mish mash). Our RNA-seq data would fit nicely into the model proposed above. The microarray and in-situ models are more complex. For instance, for the microarrays, we have two samples and multiple expression scores (various affymetrix measurements) and info on probes etc. I'll put a couple of our models below.

@rachellyne
Copy link
Member

rachellyne commented Oct 20, 2020

RNA-seq:

  <class name="RNASeqResult" is-interface="true"/>
     <attribute name="expressionScore" type="java.lang.Double"/>
    <attribute name="tissue" type="java.lang.String"/>
    <attribute name="expressionType" type="java.lang.String"/>
    <reference name="gene" referenced-type="Gene" reverse-reference="rnaSeqResults"/>
    <collection name="dataSets" referenced-type="DataSet" />
  </class>

@rachellyne
Copy link
Member

rachellyne commented Oct 20, 2020

Affymetrix arrays:

<class name="FlyAtlasResult" extends="MicroArrayResult" is-interface="true">
  <attribute name="affyCall" type="java.lang.String"/>
  <attribute name="presentCall" type="java.lang.Integer"/>
  <attribute name="enrichment" type="java.lang.Double"/>
  <attribute name="mRNASignal" type="java.lang.Double"/>
  <attribute name="mRNASignalSEM" type="java.lang.Double"/>
  <reference name="tissue" referenced-type="Tissue" reverse-reference="expressionResults"/>
</class>
<class name="Tissue" is-interface="true">
  <attribute name="name" type="java.lang.String"/>
  <collection name="expressionResults" referenced-type="FlyAtlasResult"  reverse-reference="tissue"/>
</class>
  <class name="MicroArrayResult" is-interface="true">
    <attribute name="scale" type="java.lang.String"/>
    <attribute name="type" type="java.lang.String"/>
    <attribute name="isControl" type="java.lang.Boolean"/>
    <attribute name="value" type="java.lang.Float"/>
    <reference name="experiment" referenced-type="MicroArrayExperiment" reverse-reference="results"/>
    <reference name="material" referenced-type="ProbeSet" reverse-reference="results"/>
    <collection name="assays" referenced-type="MicroArrayAssay" reverse-reference="results"/>
    <collection name="reporters" referenced-type="Reporter" reverse-reference="results"/>
    <collection name="genes" referenced-type="Gene" reverse-reference="microArrayResults"/>
    <collection name="samples" referenced-type="Sample"/>
    <collection name="dataSets" referenced-type="DataSet"/>
  </class>
    <class name="MicroArrayExperiment" is-interface="true">
        <attribute name="identifier" type="java.lang.String"/>
        <attribute name="description" type="java.lang.String"/>
        <attribute name="name" type="java.lang.String"/>
        <collection name="assays" referenced-type="MicroArrayAssay" reverse-reference="experiment"/>
        <collection name="results" referenced-type="MicroArrayResult" reverse-reference="experiment"/>
    </class>
    <class name="MicroArrayAssay" is-interface="true">
        <attribute name="name" type="java.lang.String"/>
        <attribute name="description" type="java.lang.String"/>
        <attribute name="sample1" type="java.lang.String"/>
        <attribute name="sample2" type="java.lang.String"/>
        <attribute name="displayOrder" type="java.lang.Integer"/>
        <reference name="experiment" referenced-type="MicroArrayExperiment" reverse-reference="assays"/>
        <collection name="results" referenced-type="MicroArrayResult"  reverse-reference="assays"/>
        <collection name="samples" referenced-type="Sample" reverse-reference="assays"/>
    </class>
    <class name="Sample" extends="BioEntity" is-interface="true">
        <attribute name="materialType" type="java.lang.String"/>
        <attribute name="name" type="java.lang.String"/>
        <attribute name="description" type="java.lang.String"/>
        <attribute name="primaryCharacteristic" type="java.lang.String"/>
        <attribute name="primaryCharacteristicType" type="java.lang.String"/>
        <collection name="assays" referenced-type="MicroArrayAssay" reverse-reference="samples"/>
        <collection name="characteristics" referenced-type="SampleCharacteristic"/>
        <collection name="treatments" referenced-type="Treatment"/>
    </class>
    <class name="SampleCharacteristic" is-interface="true">
        <attribute name="type" type="java.lang.String"/>
        <attribute name="value" type="java.lang.String"/>
        <reference name="ontologyTerm" referenced-type="OntologyTerm"/>
    </class>
    <class name="Treatment" is-interface="true">
        <attribute name="action" type="java.lang.String"/>
        <collection name="protocols" referenced-type="Protocol"/>
        <collection name="parameters" referenced-type="TreatmentParameter" reverse-reference="treatment"/>
    </class>
    <class name="TreatmentParameter" is-interface="true">
        <attribute name="type" type="java.lang.String"/>
        <attribute name="value" type="java.lang.String"/>
        <attribute name="units" type="java.lang.String"/>
        <reference name="treatment" referenced-type="Treatment" reverse-reference="parameters"/>
    </class>
    <class name="Protocol" is-interface="true">
        <attribute name="name" type="java.lang.String"/>
        <attribute name="description" type="java.lang.String"/>
    </class>
    <class name="ProbeSet" extends="BioEntity" is-interface="true">
        <collection name="results" referenced-type="MicroArrayResult" reverse-reference="material"/>
    </class>
    <class name="Reporter" is-interface="true">
        <attribute name="isControl" type="java.lang.Boolean"/>
        <attribute name="failType" type="java.lang.String"/>
        <attribute name="controlType" type="java.lang.String"/>
        <reference name="material" referenced-type="BioEntity"/>
        <collection name="results" referenced-type="MicroArrayResult" reverse-reference="reporters"/>
    </class>
    <class name="Gene" is-interface="true">
        <collection name="microArrayResults" referenced-type="MicroArrayResult" reverse-reference="genes"/>
    </class>

@sammyjava
Copy link
Member Author

Thanks, Rachel. Yes, I had RNA-seq in mind with the proposal, since that's what we're storing in the LIS mines. One comment is that we should be sure to write the core model to handle expression experiments that deal with samples of a single tissue but with various "treatments" (which could be mutations). ExpressionSample should include the tissue attribute, but should also contain what's special about the sample if it's not the tissue. A concrete example is an Arabidopsis experiment I worked on where we had controls, GR-REV, GR-STM, GR-AS2, and GR-KAN mutant lines. All samples were seedling leaves. And all were treated with dexamethazone with varying times before freezing. So there were mutants and treatments but only seedling leaves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants