-
Notifications
You must be signed in to change notification settings - Fork 588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FR] Samples which extend samples from other datasets #5275
Comments
Hi @jnewb1 👋 The motivation behind this feature request definitely makes sense. You have a single source of truth for certain data and you want this to be available (and shared) on multiple other datasets without having to manually keep the downstream datasets in sync with the source dataset. Suppose Some design questions:
Adding fields to a dataset also carries some interface implications that must be satisfied in order for the dataset to work with the rest of the FiftyOne data model:
Some observations:
Implication 4 is the concerning one. The natural way to achieve this would be to prepend a TLDR, this will be a complex feature to properly implement 🤓 |
Hi, thanks for the comment :) I agree it will be somewhat complex. I made a basic PR over here which passes some simple tests and uses a reference field. I'd like to do more benchmarking here to see what impacts the
|
Here's a functional version of read-only reference fields that are implemented by copying the values rather than dynamically looking them up. Resources
The benefit of this approach is that it's minimally complex to implement and will be as fast as possible to use. Of course the downside is that you manually have to check + sync the reference field if the source dataset is updated. from datetime import datetime
def create_reference_field(dataset, src_dataset, ref_field):
"""Adds a read-only `ref_field` to `dataset` whose values are sourced from
the `ref_field` of `src_dataset`.
"""
values = dict(zip(*src_dataset.values(["filepath", ref_field])))
dataset.set_values(ref_field, values, key_field="filepath")
field = dataset.get_field(ref_field)
field.read_only = True
field.info = {
"source_dataset": src_dataset.name,
"last_modified_at": datetime.utcnow(),
}
field.save()
def list_reference_fields(dataset):
"""Lists the reference fields on the given dataset."""
return [
path
for path, field in dataset.get_field_schema().items()
if "source_dataset" in (field.info or {})
]
def check_reference_field(dataset, ref_field):
"""Returns True/False whether the reference field needs updating."""
field = dataset.get_field(ref_field)
src_dataset = fo.load_dataset(field.info["source_dataset"])
return field.info["last_modified_at"] < src_dataset.max("last_modified_at")
def update_reference_field(dataset, ref_field):
"""Updates the reference field on the dataset with the current values from
the source dataset.
"""
field = dataset.get_field(ref_field)
field.read_only = False
field.save()
try:
field = dataset.get_field(ref_field)
src_dataset = fo.load_dataset(field.info["source_dataset"])
values = dict(zip(*src_dataset.values(["filepath", ref_field])))
dataset.set_values(ref_field, values, key_field="filepath")
field.info["last_modified_at"] = datetime.utcnow()
finally:
field.read_only = True
field.save() Example usage: import fiftyone as fo
import fiftyone.zoo as foz
from fiftyone import ViewField as F
dataset1 = foz.load_zoo_dataset("quickstart")
dataset2 = dataset1.select_fields().clone()
# Add a `ground_truth` reference field to `dataset2` linked to `dataset1`
create_reference_field(dataset2, dataset1, "ground_truth")
assert len(list_reference_fields(dataset1)) == 0
assert len(list_reference_fields(dataset2)) == 1
assert len(dataset2.count_values("ground_truth.detections.label")) > 1
assert not check_reference_field(dataset2, "ground_truth")
# Delete some labels
del_view = dataset1.filter_labels("ground_truth", F("label") != "person")
dataset1.delete_labels(fields="ground_truth", view=del_view)
assert check_reference_field(dataset2, "ground_truth")
# Sync the reference field
update_reference_field(dataset2, "ground_truth")
assert len(dataset2.count_values("ground_truth.detections.label")) == 1
assert not check_reference_field(dataset2, "ground_truth") |
Proposal Summary
I would like to be able to have samples which extend other samples in another dataset. For instance, we have images which have different types of annotations or labels. Ideally our base dataset contains all the raw images, and then every other dataset simple references these entries. This would allow us to modify our base dataset and have these changes reflected in all of the other annotated datasets. Currently, we have to copy every sample to every dataset including metadata, embeddings, etc which means our database gets quite large.
What areas of FiftyOne does this feature affect?
fiftyone
Python libraryDetails
I think there should be a sample type called
fo.SampleReference
, where you pass an existing sample in as well as any additional primitives / labels that are specific to this new sample. When you query the dataset, you get a sample with both the base labels as well as any labels in theSampleReference
. Any fields from the base sample should be read only.For instance:
Dataset 1: Base - contains images,
ImageMetadata
Dataset 2: Labels - contains a reference to a sample in Base, as well as detection labels from a model (say yolo)
Dataset 3: Labels2 - contains a reference to a sample in Base, as well as detection labels for a different model
Now two parties can independently work on and improve the two datasets while using the same base dataset. If you add
clip_vit_base32
embeddings to the base dataset, both datasets can use this for filtering, searching, etc. If your labels2 dataset is quite small (maybe 1/10th of labels1) you get much better performance for queries to this dataset than if you had combined everything into a single dataset.Willingness to contribute
The FiftyOne Community welcomes contributions! Would you or another member of your organization be willing to contribute an implementation of this feature?
The text was updated successfully, but these errors were encountered: