Data Asset Catalog (DAC) Service

Rationale

Responsibilities

maintains the lifecycle and metadata—how data came to be, how it changed, how it’s been used—of each dataset and derived dataset hosted in the Station.
serves the discovery and integration engine, by data users to describe capsule’s trust constraints, and by contributors and stewards to implement access and governance policies.

Mental Model

DAC implements a common mental model consisting of profiles and relationships.

Profiles are descriptions of the data, such as statistical distribution of values, sketches, but also temporal information and others.

Profiles are automatically computed by the Metadata Engine when datasets are submitted to the Station, or elicited from humans using incentives.

Profiles include:

what-profile to describe an actual dataset
how-profile to indicate what program produced the current dataset version,
who-profile to indicate who produced and who uses the dataset
where-profile to indicate how the dataset can be accessed
why-profile to explain the purpose of the dataset
when-profile to explain when the dataset was modified and when it is valid.

Relationships are built out of profiles: for example, provenance is built from who- and how-profiles, and syntactic relationships such as join and similarity graphs are built from what-profiles.

Both profiles and relationships are used by the discovery and integration modules.

DAC’s logical schema design strives for a balance between

structuredness, which facilitates querying, and
flexibility, which facilitates including new data.

At its core, it reflects the mental model introduced above, which allows different parties to understand and query it effectively.

To increase flexibility, it supports semi-structured data, such as JSON, to reflect the idiosyncrasies of different data formats: e.g., describing an image is different than describing a relation or a ML model.

Populating the Catalog.

The Data Station triggers the execution of a Metadata Engine whenever a new dataset is received, an existing dataset is updated, or the Station produces a derived data product from existing ones.

The Metadata Engine analyzes the dataset and extracts as many profiles as it can automatically.

This is done via the orchestration of analyzers that specialize in different profiles, but also by eliciting this information from data users and contributors directly when it cannot be accessed differently (see Section 4. As a consequence, the full lifecyle of residing datasets and derived data products is known because all operations on that dataset happen within the realm of the Station.

Catalog Service.

A catalog service facilitates loading and querying, and stores and enforces access and governance policies. The service maintains schemas of the semi-structured data as well as indexes that permit the discovery and blending engines to find the information they need, and data users to specify trust constraints to guarantee the origin and nature of the results they request.

Setting up the development environment

Setting up the environment is normally done via GNU make

Start by creating an environment with conda (assuming it's already installed)

make create-environment

Base environment is declared in environment.yml. pip-managed requirements are handled in requirements*.txt files. These files are read both by environment.yml and by the setup.py script.

Subsequent changes in the environment, can be tracked & managed via other make targets accordingly

make update-environment
make list-outdated-packages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github/workflows		.github/workflows
ds_catalog_service		ds_catalog_service
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
config.py		config.py
ds_catalog_service_dev.env		ds_catalog_service_dev.env
environment.yml		environment.yml
generate-changelog.sh		generate-changelog.sh
notes.md		notes.md
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
schema-0.1.sql		schema-0.1.sql
schema-0.2.sql		schema-0.2.sql
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Asset Catalog (DAC) Service

Rationale

Responsibilities

Mental Model

Populating the Catalog.

Catalog Service.

Setting up the development environment

About

Releases

Packages

Contributors 5

Languages

License

TheDataStation/catalog-service

Folders and files

Latest commit

History

Repository files navigation

Data Asset Catalog (DAC) Service

Rationale

Responsibilities

Mental Model

Populating the Catalog.

Catalog Service.

Setting up the development environment

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages