-
Notifications
You must be signed in to change notification settings - Fork 347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Let's talk expression data! #2229
Comments
And here's a sample of data that is held by that data model:
|
Thanks @sammyjava! @rachellyne and @sergiocontrino let's discuss during the next Monday meeting |
thank you @sammyjava, i think you are right that it would be nice to have a common basic model for this sooner rather than later. regarding the one you are using i have a few initial questions:
|
Just saw this, Sergio!
Yes. No reason for it to specifically be Gene. I'm not sure about BioEntity, though, I think SequenceFeature would be more accurate. Proteins don't express but transposons can.
That's for ordering the samples for user convenience, such as on a heat map axis. It's nice to have all the leaf-related tissues together and then the seed-related ones, etc. It can be left null.
For extensibility. Although I agree that unit should reside with ExpressionValue, since it is the unit of that value (e.g. "TPM"). I extend ExpressionSource enormously in my mines, with all sorts of extra attributes. I don't think I want to extend DataSet with all that stuff. (Like SRA identifier, library prep details, all sorts of things that come up in RNA-seq experiments.)
I haven't really thought of how to deal with time-course anything in InterMine. Of course you can add an attribute "time" to ExpressionValue and have values across time. I don't have any at the current time, although my first job in bioinformatics was dealing with Arabidopsis time-course experiments, for which I wrote a fairly big webapp. |
Updates per Sergio's suggestions.
|
FYI, this is my current model. I've got NCBI attributes in there (
|
I can't comment on implementation details, but perhaps one point worth considering:
What about protein-level expression data, e.g. from quantitative mass spec? You can have isoform-specific expression data that you couldn't capture properly on the gene level. |
Interesting. But I think it makes sense to limit the "things that express" to sequence features, which proteins are not. Just enforcing the "central dogma" really. I think you can map the proteins to transcript isoforms, and they often have the exact same name (gene.1, gene.2); they certainly do in all the LIS mines by specification. It certainly makes sense to store expression relative to transcripts, not genes. |
Apologies it has taken us a long time to get back to this. I really like the idea of a core expression model (or possibly a couple of core expression models to cover different expression techniques - see below). I think this needs some discussion and we probably should also take into account visualizations we already have (and what we would like to add). Our problem here in Cambridge is that we have multiple expression models that cover different data and techniques - RNA-seq, microarray and in-situ hybridisation. It would be good to re-visit models (some were created many years ago before RNA-seq was really even a thing - historically we have a bit of a mish mash). Our RNA-seq data would fit nicely into the model proposed above. The microarray and in-situ models are more complex. For instance, for the microarrays, we have two samples and multiple expression scores (various affymetrix measurements) and info on probes etc. I'll put a couple of our models below. |
RNA-seq:
|
Affymetrix arrays:
|
Thanks, Rachel. Yes, I had RNA-seq in mind with the proposal, since that's what we're storing in the LIS mines. One comment is that we should be sure to write the core model to handle expression experiments that deal with samples of a single tissue but with various "treatments" (which could be mutations). ExpressionSample should include the tissue attribute, but should also contain what's special about the sample if it's not the tissue. A concrete example is an Arabidopsis experiment I worked on where we had controls, GR-REV, GR-STM, GR-AS2, and GR-KAN mutant lines. All samples were seedling leaves. And all were treated with dexamethazone with varying times before freezing. So there were mutants and treatments but only seedling leaves. |
As per the discussion in the community call today, I am hereby creating an "issue" to stimulate creation of a core expression data model so we can build tools that are commonly usable. (Note that Strain is implemented since we load expression data from multiple strains of a given legume species.)
To get things started, here's the data model that I use in the LIS mines:
The text was updated successfully, but these errors were encountered: