-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #881 from microbiomedata/ADR-define-envo-value-sets
add ADR for env triad
- Loading branch information
Showing
2 changed files
with
100 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,100 @@ | ||
--- | ||
status: proposed / in progress | ||
date: 2024-08-28 | ||
deciders: Montana Smith, Mark Miller, Sierra Moxon | ||
consulted: Lee Ann McCue, Natalie Winans | ||
informed: | ||
--- | ||
|
||
# Establishing Reasonable Values for MIxS Environmental Triad Slots | ||
|
||
## Context and Problem Statement | ||
|
||
Determining what values to populate into each of the environmental triad slots (`env_broad_scale`, `env_local_scale`, | ||
and `env_medium`) is difficult, especially when further broken down into reasonable `env_broad_scale` values for Soil | ||
sample, Water samples, etc. Although it is challenging, it adds value to the NMDC ecosystem in terms of finding and | ||
grouping similar Biosamples. | ||
|
||
NMDC will establish reproducible logic to provide users with value sets, i.e. curated lists of reasonable terms, for | ||
each slot. The value sets will be implemented as enumerations in the submission-schema, and will be composed in the MIxS | ||
style: the label for a class from a high quality ontology, followed by the class' CURIe, inside of square brackets. For | ||
example, one reasonable `env_broad_scale` for a Soil sample _might_ be 'temperate grassland biome [ENVO:01000193]' | ||
|
||
These value sets are not necessarily intended to be closed indefinitely. An environmental triad slot could be given an | ||
enumeration, or a regular expression pattern as constraints. This could be a temporary allowance while NMDC's logic is | ||
being refined. Values that match the regular expression but do not come from the enumeration will need to be reviewed on | ||
a regular basis, as submitters could provide values that have typos, that have CURIe/label mismatches, or just are not | ||
reasonable values for the slot. | ||
|
||
*This ADR is in progress. The initial plan has been outlined here and may change.* | ||
|
||
## Decision Outcome | ||
|
||
NMDC environment-specific value sets for the environmental triad slots will be generated by processes that emphasize | ||
reusability and generalizability. Specifically, code has been developed to generate tables of candidates for each value | ||
set, with columns of theoretical and empirical supporting evidence. | ||
|
||
The theoretical evidence is based | ||
on [guidance provided by the EnvO and MIxS authors](https://github.com/EnvironmentOntology/envo/wiki/Using-ENVO-with-MIxS), | ||
along with structure of the ontologies whose classes are included in the value sets. Furthermore, the technology for | ||
extracting subsets from those ontologies will be queries composed with | ||
the [Ontology Access Kit](https://github.com/INCATools/ontology-access-kit). For example, NMDC may set the general rule | ||
that the values for the `env_broad_scale`, in combination with all environmental sample types, must be subclasses | ||
of [biome [ENVO:00000428]](https://www.ebi.ac.uk/ols4/ontologies/envo/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FENVO_00000428?lang=en). | ||
In that case, the `env_broad_scale` values sets for individual environments/sample types must be subsets of the general | ||
query. | ||
|
||
The empirical evidence is based on values that have been used for the environmental triad slots in prominent Biosample | ||
metadata systems, such as NCBI, GOLD, and NMDC. Some of those sources, especially NCBI, are very permissive for the | ||
environmental triad values. Because the candidate value tables include multiple empirical sources plus the results of | ||
the general, rules-based OAK queries, no one source can introduce inappropriate values nor exclude reasonable values. | ||
Initial exploration suggests that the queries will be hard limits for `env_broad_scale` and `env_medium`, but that the | ||
empirical evidence will be more important for `env_local_scale`. General, rules-based queries will be refined for each | ||
of the environments (MIxS Extensions), and will be reflected in either query-specific columns or general boolean | ||
columns, like 'is_biome' or 'is_environmental_material'. | ||
|
||
- This ADR will be updated as environment-specific queries are created. | ||
- The logic described above is intended to minimize cherry-picking of values for the sets. | ||
- Any filtering should be accomplished as a general query. For example, we will not remove a specific term, but rather identify a rule that accounts for more general needs. | ||
- Expert review will be done on an initial list of accepted terms. Insights from review will be fed back into reusable logic, which might require complex OAK queries. over inconsistent annotations, or grouping of values within a semantic embedding space. | ||
- As necessary, we can request that EnvO add classes in support of this work, or that more (and more consistent) axioms | ||
are added to existing classes. This will not be done as part of the squad contributing this ADR. | ||
|
||
## NMDC General Queries for Environmental Triad Value Sets | ||
|
||
- `env_broad_scale` will consist of subclasses | ||
of [biome [ENVO:00000428]](https://www.ebi.ac.uk/ols4/ontologies/envo/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FENVO_00000428), | ||
with the exception of host-associated samples, including plant-associated samples. | ||
- `env_local_scale` will consist of subclasses | ||
of [material entity [BFO:0000040]](https://www.ebi.ac.uk/ols4/ontologies/envo/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FBFO_0000040?lang=en), | ||
minus biome [ENVO:00000428] | ||
and [environmental material [ENVO:00010483]](https://www.ebi.ac.uk/ols4/ontologies/envo/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FENVO_00010483) | ||
- It's expected that additional branches of the is_a hierarchy will subtracted, | ||
like [chemical entity [CHEBI:24431]](https://www.ebi.ac.uk/ols4/ontologies/envo/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FCHEBI_24431) | ||
- `env_medium` will consist of subclasses environmental material [ENVO:00010483] | ||
|
||
For **soil** environment (MIxS Extension) | ||
|
||
- `env_broad_scale` will | ||
exclude [aquatic biome [ENVO:00002030]](https://www.ebi.ac.uk/ols4/ontologies/envo/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FENVO_00002030) | ||
- `env_local_scale` ????? TBC | ||
- `env_medium` TBC | ||
|
||
This ADR will be updated following the evaluation of these initial queries, followed by human review and voting. | ||
* This evaluation was completed by the authors of this PR and the members of the NMDC's "Env Triad Squad" | ||
|
||
## More Information | ||
|
||
* Reference the squad meeting notes. | ||
* https://github.com/microbiomedata/issues/issues/468 | ||
* https://github.com/microbiomedata/issues/issues/840 | ||
* https://github.com/microbiomedata/issues/issues/841 | ||
* https://github.com/microbiomedata/issues/issues/877 | ||
|
||
## Illustration of farm/agricultural field/banana plantation paths | ||
|
||
```shell | ||
poetry run runoak --input sqlite:obo:envo viz --predicates i 'farm' 'agricultural field' 'banana plantation' | ||
``` | ||
|
||
![farm_field_banana_crop_75pct.png](images/farm_field_banana_crop_75pct.png) |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.