Skip to content

Commit

Permalink
397 add gold metadata to neon soil samples (#303)
Browse files Browse the repository at this point in the history
* initial checkin with base class and basic tests

- Base ChangeSheet write class
- unit tests for base class

* add conftest and gold changesheet tests

- move test fixtures to conftest.py
- add get_biosample_name function and unit test to GoldBiosample generator

* update biosample name unit test

add explicit expected values

* Sketch out functions for gold changesheet generator

* function and test for missing GOLD ecosystem metadata

* add function and test for missing gold_biosample_identifiers

* add get_normalized_gold_biosample_identifier

* update logic with omics processing step

* skeleton find_omics_processing_set function, and updated (correct this time) test data files

* Add Omics to Biosample map

- add omics_to_biosample map imput
- added nmdc / gold BioSample comparison logic
- unit tests
- stub API dependent methods

* Add changesheets.py pachage for common functions and classes

- Changesheet and ChangesheetLineItem classes
- API @op functions

* refactor to split omice procesing data file read to stand-aloine function

* more refactoring and code cleanup

* add test generation job

* add resource definitions and config

* refactor and code cleanup

Simplify to just ChangeSheet and ChangeSheerLineItem classes

* Cleanup this branch to focus on getting assets working

* fix defs and fetch statement

* get basic GOLD asset generation working

* Add Api resources as ConfigurableResources

* Add asset scaffolding

* update normalizer functions to all take and return strings

* update resources add empty click script

* fix gold ID normalization and add unit tests

* implement compare biosamples and write_changesheet

* add omics reccord comparison

* Add validate_changesheet method

* cleanup unused data files

* fix validate_changesheet method and add logging

* delete dagster asset based code and tests - move to a demo branch

* add changesheet_output to .gitignore

* add changesheet_output to .gitignore

* remove Dagster-related code and settings

* style: format with black

* Use TypeAlias for JSON_OBJECT

* Removed hard-coded URL from Changesheet.validate()

* remove .tsv file - should be ignorewd

* clarify function name and blacken formatting

* fix click options help text and blacken

* yet more blackening

* uncomment wait-for-it

* Delete get_data.ipynb

* Revert "Delete get_data.ipynb"

This reverts commit fe3e68a.

* add docstring for generate_changesheet

* automatic reformatting

* bring get_data noteback back to original state

* add some logging

* update to use gold_sequencing_identifiers over alternative_identifiers

* Delete neon_cache.sqlite

* strip and de-tab the value in tsv output

* set default line_items in changesheet class correctly

* update output_dir type hint

* remove apply_changes option

* Dry up unfindable logging

* Clean up gold normalization and documentation

* fix: style

---------

Co-authored-by: Donny Winston <[email protected]>
  • Loading branch information
mbthornton-lbl and dwinston committed Nov 3, 2023
1 parent f1820af commit 295cba1
Show file tree
Hide file tree
Showing 15 changed files with 2,074 additions and 5 deletions.
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -160,4 +160,7 @@ repl.ipynb

tests/nmdcdb/

neon_cache.sqlite
neon_cache.sqlite

# output of changesheet generation
nmdc_runtime/site/changesheets/changesheets_output/
1 change: 1 addition & 0 deletions docker-compose.test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ services:
DAGSTER_POSTGRES_DB: "postgres_db"
depends_on:
- dagster-postgresql
restart: on-failure
volumes:
- ./:/opt/dagster/lib

Expand Down
1 change: 1 addition & 0 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ services:
DAGSTER_POSTGRES_DB: "postgres_db"
depends_on:
- dagster-postgresql
restart: on-failure
volumes:
- ./:/opt/dagster/lib

Expand Down
3 changes: 1 addition & 2 deletions docs/nb/get_data.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -640,8 +640,7 @@
"32,590.4511390833583,590.450730484585,,33051017.328779608,174529.72008540362,95.62528562170476,-1,,,,,,,,unassigned,,,,,,,,,\n",
"33,574.4557295997837,574.4553249571502,,57270445.86539885,179389.47009937343,165.6984621406013,-1,,,,,,,,unassigned,,,,,,,,,\n",
"34,509.29372586837684,509.2933397523243,,3016102.9919072534,115624.41149005273,8.726379197243881,-1,,,,,,,,unassigned,,,,,,,,,\n",
"36,311.1007284631918,311.10042248349583,,2602658.391103224,220831.98006834995,7.530175230287296,-1,,,,,,,,unassigned,,,,,,,,,\n",
"\n"
"36,311.1007284631918,311.10042248349583,,2602658.391103224,220831.98006834995,7.530175230287296,-1,,,,,,,,unassigned,,,,,,,,,\n"
]
}
],
Expand Down
Empty file added nmdc_runtime/__init__.py
Empty file.
Empty file.
85 changes: 85 additions & 0 deletions nmdc_runtime/site/changesheets/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# nmdc_runtime/site/changesheets/base.py
"""
base.py: Provides data classes for creating changesheets for NMDC database objects.
"""

import logging
import time
from dataclasses import dataclass, field
from pathlib import Path
import requests
from typing import Any, ClassVar, Dict, TypeAlias, Optional

from nmdc_runtime.site.resources import RuntimeApiUserClient

logging.basicConfig(
level=logging.INFO, format="%(asctime)s %(levelname)s %(" "message)s"
)

JSON_OBJECT: TypeAlias = Dict[str, Any]
CHANGESHEETS_DIR = Path(__file__).parent.absolute().joinpath("changesheets_output")


@dataclass
class ChangesheetLineItem:
"""
A line item in a changesheet
"""

id: str
action: str
attribute: str
value: str

@property
def line(self) -> str:
"""
Return the line item as a tab-separated string
"""
cleaned_value = self.value.replace("\n", " ").replace("\t", " ").strip()
return f"{self.id}\t{self.action}\t{self.attribute}\t{cleaned_value}"


@dataclass
class Changesheet:
"""
A changesheet
"""

name: str
line_items: list = field(default_factory=list)
header: ClassVar[str] = "id\taction\tattribute\tvalue"
output_dir: Optional[Path] = None

def __post_init__(self):
self.line_items = []
if self.output_dir is None:
self.output_dir = CHANGESHEETS_DIR
self.output_dir.mkdir(parents=True, exist_ok=True)
self.output_filename_root: str = f"{self.name}-{time.strftime('%Y%m%d-%H%M%S')}"
self.output_filename: str = f"{self.output_filename_root}.tsv"
self.output_filepath: Path = self.output_dir.joinpath(self.output_filename)

def validate_changesheet(self, base_url: str) -> bool:
"""
Validate the changesheet
:return: None
"""
logging.info(f"Validating changesheet {self.output_filepath}")
url = f"{base_url}/metadata/changesheets:validate"
resp = requests.post(
url,
files={"uploaded_file": open(self.output_filepath, "rb")},
)
return resp.ok

def write_changesheet(self) -> None:
"""
Write the changesheet to a file
:return: None
"""
with open(self.output_filepath, "w") as f:
logging.info(f"Writing changesheet to {self.output_filepath}")
f.write(self.header + "\n")
for line_item in self.line_items:
f.write(line_item.line + "\n")
1,561 changes: 1,561 additions & 0 deletions nmdc_runtime/site/changesheets/data/OmicsProcessing-to-catted-Biosamples.tsv

Large diffs are not rendered by default.

Loading

0 comments on commit 295cba1

Please sign in to comment.