Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add ADR for env triad #881

Merged
merged 27 commits into from
Sep 19, 2024
Merged
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
d791e06
can't see file in branch
mslarae13 Aug 28, 2024
f4416f8
more details
mslarae13 Aug 28, 2024
32576b4
will use queries plus evidence and voting
turbomam Sep 5, 2024
2822205
Update decisions/0015-env-triad-terms.md
mslarae13 Sep 18, 2024
66abbf6
Update decisions/0015-env-triad-terms.md
mslarae13 Sep 18, 2024
c423a12
Update decisions/0015-env-triad-terms.md
mslarae13 Sep 18, 2024
7a8ffb4
Update decisions/0015-env-triad-terms.md
mslarae13 Sep 18, 2024
92c55ec
Update decisions/0015-env-triad-terms.md
mslarae13 Sep 18, 2024
d0601cc
Update decisions/0015-env-triad-terms.md
mslarae13 Sep 18, 2024
f8eabe4
Update decisions/0015-env-triad-terms.md
mslarae13 Sep 18, 2024
c2132fd
Update decisions/0015-env-triad-terms.md
mslarae13 Sep 18, 2024
7e485d3
Update decisions/0015-env-triad-terms.md
mslarae13 Sep 18, 2024
c6a15ed
Update decisions/0015-env-triad-terms.md
mslarae13 Sep 18, 2024
4439276
Update decisions/0015-env-triad-terms.md
mslarae13 Sep 18, 2024
c27895a
Update decisions/0015-env-triad-terms.md
mslarae13 Sep 18, 2024
b2f5479
Merge pull request #886 from microbiomedata/queries-plus-evidence
mslarae13 Sep 18, 2024
5888bfc
Update decisions/0015-env-triad-terms.md
mslarae13 Sep 18, 2024
ccad565
Update decisions/0015-env-triad-terms.md
mslarae13 Sep 18, 2024
8cdcdfb
Update decisions/0015-env-triad-terms.md
mslarae13 Sep 18, 2024
53fd85e
Update decisions/0015-env-triad-terms.md
mslarae13 Sep 18, 2024
cdcad12
Update decisions/0015-env-triad-terms.md
mslarae13 Sep 18, 2024
3ff5f9b
Update decisions/0015-env-triad-terms.md
mslarae13 Sep 18, 2024
9844a41
Update decisions/0015-env-triad-terms.md
mslarae13 Sep 18, 2024
3f425b3
Update decisions/0015-env-triad-terms.md
mslarae13 Sep 18, 2024
f7c4461
Update decisions/0015-env-triad-terms.md
mslarae13 Sep 19, 2024
d5f7f89
Update decisions/0015-env-triad-terms.md
mslarae13 Sep 19, 2024
b72eda5
Update decisions/0015-env-triad-terms.md
mslarae13 Sep 19, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
108 changes: 108 additions & 0 deletions decisions/0015-env-triad-terms.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
---
status: proposed / in progress
date: 2024-08-28
deciders: Montana Smith, Mark Miller, Sierra Moxon
consulted: Lee Ann McCue, Natalie Winans
informed:
---

# Establishing Reasonable Values for MIxS Environmental Triad Slots

## Context and Problem Statement

Determining what values to populate into each of the environmental triad slots (`env_broad_scale`, `env_local_scale`,
and `env_medium`) is difficult, especially when further broken down into reasonable `env_broad_scale` values for Soil
sample, Water samples, etc. Although it is challenging, it adds value to the NMDC ecosystem in terms of finding and
grouping similar Biosamples.

NMDC will establish reproducible logic to provide users with value sets, i.e. curated lists of reasonable terms, for
each slot. The value sets will be implemented as enumerations in the submission-schema, and will be composed in the MIxS
style: the label for a class from a high quality ontology, followed by the class' CURIe, inside of square brackets. For
example, one reasonable `env_broad_scale` for a Soil sample _might_ be 'temperate grassland biome [ENVO:01000193]'

These value sets are not necessarily intended to be closed indefinitely. An environmental triad slot could be given an
enumeration, or a regular expression pattern as constraints. This could be a temporary allowance while NMDC's logic is
being refined. Values that match the regular expression but do not come from the enumeration will need to be reviewed on
a regular basis, as submitters could provide values that have typos, that have CURIe/label mismatches, or just are not
reasonable values for the slot.

*This ADR is in progress. The initial plan has been outlined here and may change.*

## Decision Outcome
mslarae13 marked this conversation as resolved.
Show resolved Hide resolved

NMDC environment-specific value sets for the environmental triad slots will be generated by processes that emphasize
mslarae13 marked this conversation as resolved.
Show resolved Hide resolved
reusability and generalizability. Specifically, code has been developed to generate tables of candidates for each value
set, with columns of theoretical and empirical supporting evidence.

The theoretical evidence is based
on [guidance provided by the EnvO and MIxS authors](https://github.com/EnvironmentOntology/envo/wiki/Using-ENVO-with-MIxS),
along with structure of the ontologies whose classes are included in the value sets. Furthermore, the technology for
extracting subsets from those ontologies will be queries composed with
the [Ontology Access Kit](https://github.com/INCATools/ontology-access-kit). For example, NMDC may set the general rule
that the values for the `env_broad_scale`, in combination with all environmental sample types, must be subclasses
of [biome [ENVO:00000428]](https://www.ebi.ac.uk/ols4/ontologies/envo/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FENVO_00000428?lang=en).
In that case, the `env_broad_scale` values sets for individual environments/sample types must be subsets of the general
query.

The empirical evidence is based on values that have been used for the environmental triad slots in prominent Biosample
metadata systems, such as NCBI, GOLD, and NMDC. Some of those sources, especially NCBI, are very permissive for the
environmental triad values. Because the candidate value tables include multiple empirical sources plus the results of
the general, rules-based OAK queries, no one source can introduce inappropriate values nor exclude reasonable values.
Initial exploration suggests that the queries will be hard limits for `env_broad_scale` and `env_medium`, but that the
empirical evidence will be more important for `env_local_scale`. General, rules-based queries will be refined for each
of the environments (MIxS Extensions), and will be reflected in either query-specific columns or general boolean
columns, like 'is_biome' or 'is_environmental_material'.

- This ADR will be updated as environment-specific queries are created.
- The logic described above is intended to minimize cherry-picking of values for the sets.
mslarae13 marked this conversation as resolved.
Show resolved Hide resolved
- Any filtering should be accomplished as a general query. For example, we will not remove a specific term, but rather identify a rule that accounts for more general needs.
- Expert review will be done on an initial list of accepted terms. Insights from review will be fed back into reusable logic, which that might require complex OAK queries. over inconsistent annotations, or grouping of values within a semantic embedding space. For example, there may be evidence that [farm [ENVO:00000078]](https://www.ebi.ac.uk/ols4/ontologies/envo/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FENVO_00000078),
[agricultural field [ENVO:00000114]](https://www.ebi.ac.uk/ols4/ontologies/envo/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FENVO_00000114)
and [banana plantation [ENVO:00000161]](https://www.ebi.ac.uk/ols4/ontologies/envo/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FENVO_00000161)
mslarae13 marked this conversation as resolved.
Show resolved Hide resolved
- As necessary, we can request that EnvO add classes in support of this work, or that more (and more consistent) axioms
are added to existing classes. This will not be done as part of the squad contributing this ADR.

## NMDC General Queries for Environmental Triad Value Sets

- `env_broad_scale` will consist of subclasses
of [biome [ENVO:00000428]](https://www.ebi.ac.uk/ols4/ontologies/envo/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FENVO_00000428),
with the exception of host-associated samples, including plant-associated samples.
- `env_local_scale` will consist of subclasses
of [material entity [BFO:0000040]](https://www.ebi.ac.uk/ols4/ontologies/envo/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FBFO_0000040?lang=en),
minus biome [ENVO:00000428]
and [environmental material [ENVO:00010483]](https://www.ebi.ac.uk/ols4/ontologies/envo/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FENVO_00010483)
- It's expected that additional branches of the is_a hierarchy will subtracted,
like [chemical entity [CHEBI:24431]](https://www.ebi.ac.uk/ols4/ontologies/envo/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FCHEBI_24431)


- `env_medium` will consist of subclasses environmental material [ENVO:00010483]
mslarae13 marked this conversation as resolved.
Show resolved Hide resolved

For **soil** environment (MIxS Extension)

- `env_broad_scale` will
exclude [aquatic biome [ENVO:00002030]](https://www.ebi.ac.uk/ols4/ontologies/envo/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FENVO_00002030)
- `env_local_scale` ????? TBC
- `env_medium`
example [hydrocarbon-based environmental material [ENVO:2000045]](https://www.ebi.ac.uk/ols4/ontologies/envo/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FENVO_2000045?lang=en)
- Textual filtering over labels, with or without regular expressions or stemming, _could_ also be used
- food, water and ice terms could be removed without stemming
- stemming with the pattern 'ferment' could be used to remove terms with fermented or fermenting in their labels
mslarae13 marked this conversation as resolved.
Show resolved Hide resolved

This ADR will be updated following the evaluation of these initial queries, followed by human review and voting.
mslarae13 marked this conversation as resolved.
Show resolved Hide resolved
* This evaluation was completed by the authors of this PR and the members of the NMDC's "Env Triad Squad"

## More Information

* Reference the squad meeting notes.
* https://github.com/microbiomedata/issues/issues/468
* https://github.com/microbiomedata/issues/issues/840
* https://github.com/microbiomedata/issues/issues/841
* https://github.com/microbiomedata/issues/issues/877

## Illustration of farm/agricultural field/banana plantation paths

```shell
poetry run runoak --input sqlite:obo:envo viz --predicates i 'farm' 'agricultural field' 'banana plantation'
```

![farm_field_banana_crop_75pct.png](images/farm_field_banana_crop_75pct.png)
Binary file added decisions/images/farm_field_banana_crop_75pct.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.