Skip to content

Commit

Permalink
Merge pull request #881 from microbiomedata/ADR-define-envo-value-sets
Browse files Browse the repository at this point in the history
add ADR for env triad
  • Loading branch information
mslarae13 authored Sep 19, 2024
2 parents 08654fc + b72eda5 commit 64399c8
Show file tree
Hide file tree
Showing 2 changed files with 100 additions and 0 deletions.
100 changes: 100 additions & 0 deletions decisions/0015-env-triad-terms.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
---
status: proposed / in progress
date: 2024-08-28
deciders: Montana Smith, Mark Miller, Sierra Moxon
consulted: Lee Ann McCue, Natalie Winans
informed:
---

# Establishing Reasonable Values for MIxS Environmental Triad Slots

## Context and Problem Statement

Determining what values to populate into each of the environmental triad slots (`env_broad_scale`, `env_local_scale`,
and `env_medium`) is difficult, especially when further broken down into reasonable `env_broad_scale` values for Soil
sample, Water samples, etc. Although it is challenging, it adds value to the NMDC ecosystem in terms of finding and
grouping similar Biosamples.

NMDC will establish reproducible logic to provide users with value sets, i.e. curated lists of reasonable terms, for
each slot. The value sets will be implemented as enumerations in the submission-schema, and will be composed in the MIxS
style: the label for a class from a high quality ontology, followed by the class' CURIe, inside of square brackets. For
example, one reasonable `env_broad_scale` for a Soil sample _might_ be 'temperate grassland biome [ENVO:01000193]'

These value sets are not necessarily intended to be closed indefinitely. An environmental triad slot could be given an
enumeration, or a regular expression pattern as constraints. This could be a temporary allowance while NMDC's logic is
being refined. Values that match the regular expression but do not come from the enumeration will need to be reviewed on
a regular basis, as submitters could provide values that have typos, that have CURIe/label mismatches, or just are not
reasonable values for the slot.

*This ADR is in progress. The initial plan has been outlined here and may change.*

## Decision Outcome

NMDC environment-specific value sets for the environmental triad slots will be generated by processes that emphasize
reusability and generalizability. Specifically, code has been developed to generate tables of candidates for each value
set, with columns of theoretical and empirical supporting evidence.

The theoretical evidence is based
on [guidance provided by the EnvO and MIxS authors](https://github.com/EnvironmentOntology/envo/wiki/Using-ENVO-with-MIxS),
along with structure of the ontologies whose classes are included in the value sets. Furthermore, the technology for
extracting subsets from those ontologies will be queries composed with
the [Ontology Access Kit](https://github.com/INCATools/ontology-access-kit). For example, NMDC may set the general rule
that the values for the `env_broad_scale`, in combination with all environmental sample types, must be subclasses
of [biome [ENVO:00000428]](https://www.ebi.ac.uk/ols4/ontologies/envo/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FENVO_00000428?lang=en).
In that case, the `env_broad_scale` values sets for individual environments/sample types must be subsets of the general
query.

The empirical evidence is based on values that have been used for the environmental triad slots in prominent Biosample
metadata systems, such as NCBI, GOLD, and NMDC. Some of those sources, especially NCBI, are very permissive for the
environmental triad values. Because the candidate value tables include multiple empirical sources plus the results of
the general, rules-based OAK queries, no one source can introduce inappropriate values nor exclude reasonable values.
Initial exploration suggests that the queries will be hard limits for `env_broad_scale` and `env_medium`, but that the
empirical evidence will be more important for `env_local_scale`. General, rules-based queries will be refined for each
of the environments (MIxS Extensions), and will be reflected in either query-specific columns or general boolean
columns, like 'is_biome' or 'is_environmental_material'.

- This ADR will be updated as environment-specific queries are created.
- The logic described above is intended to minimize cherry-picking of values for the sets.
- Any filtering should be accomplished as a general query. For example, we will not remove a specific term, but rather identify a rule that accounts for more general needs.
- Expert review will be done on an initial list of accepted terms. Insights from review will be fed back into reusable logic, which might require complex OAK queries. over inconsistent annotations, or grouping of values within a semantic embedding space.
- As necessary, we can request that EnvO add classes in support of this work, or that more (and more consistent) axioms
are added to existing classes. This will not be done as part of the squad contributing this ADR.

## NMDC General Queries for Environmental Triad Value Sets

- `env_broad_scale` will consist of subclasses
of [biome [ENVO:00000428]](https://www.ebi.ac.uk/ols4/ontologies/envo/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FENVO_00000428),
with the exception of host-associated samples, including plant-associated samples.
- `env_local_scale` will consist of subclasses
of [material entity [BFO:0000040]](https://www.ebi.ac.uk/ols4/ontologies/envo/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FBFO_0000040?lang=en),
minus biome [ENVO:00000428]
and [environmental material [ENVO:00010483]](https://www.ebi.ac.uk/ols4/ontologies/envo/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FENVO_00010483)
- It's expected that additional branches of the is_a hierarchy will subtracted,
like [chemical entity [CHEBI:24431]](https://www.ebi.ac.uk/ols4/ontologies/envo/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FCHEBI_24431)
- `env_medium` will consist of subclasses environmental material [ENVO:00010483]

For **soil** environment (MIxS Extension)

- `env_broad_scale` will
exclude [aquatic biome [ENVO:00002030]](https://www.ebi.ac.uk/ols4/ontologies/envo/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FENVO_00002030)
- `env_local_scale` ????? TBC
- `env_medium` TBC

This ADR will be updated following the evaluation of these initial queries, followed by human review and voting.
* This evaluation was completed by the authors of this PR and the members of the NMDC's "Env Triad Squad"

## More Information

* Reference the squad meeting notes.
* https://github.com/microbiomedata/issues/issues/468
* https://github.com/microbiomedata/issues/issues/840
* https://github.com/microbiomedata/issues/issues/841
* https://github.com/microbiomedata/issues/issues/877

## Illustration of farm/agricultural field/banana plantation paths

```shell
poetry run runoak --input sqlite:obo:envo viz --predicates i 'farm' 'agricultural field' 'banana plantation'
```

![farm_field_banana_crop_75pct.png](images/farm_field_banana_crop_75pct.png)
Binary file added decisions/images/farm_field_banana_crop_75pct.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 64399c8

Please sign in to comment.