The Snorkel Zoo is a collection of utilities for writing labeling functions, transformation functions, and slicing functions, as seen in the core Snorkel library. We’ve demonstrated the efficacy of templates or declarative operators across a range of use cases in prior work (Ratner et. al 2019). In this reposiotry, we aim to provide a shared resource for different builders, generators, and primitives that are effective in both research and production contexts. More importantly, we’re excited to crowdsource ideas from the community!
The repository is divided into subfolders for builders, generators, and primitives.
For a single problem, It’s helpful to have a shared interface for building a specific type of labeling function. For instance, the Intro Tutorial features a number of keyword labeling functions using a shared template:
def make_keyword_lf(keywords, label=SPAM):
return LabelingFunction(
name=f"keyword_{keywords[0]}",
f=keyword_lookup,
resources=dict(keywords=keywords, label=label),
)
"""Spam comments talk about 'my channel', 'my video', etc."""
keyword_my = make_keyword_lf(keywords=["my"])
"""Spam comments ask users to subscribe to their channels."""
keyword_subscribe = make_keyword_lf(keywords=["subscribe"])
"""Spam comments post links to other channels."""
keyword_link = make_keyword_lf(keywords=["http"])
"""Spam comments make requests rather than commenting."""
keyword_please = make_keyword_lf(keywords=["please", "plz"])
"""Ham comments actually talk about the video's content."""
keyword_song = make_keyword_lf(keywords=["song"], label=HAM)
Labeling functions may be generated using programmatic methods. We’ve explored this in a number of settings — from automatically-generated labeling functions (Varma et. al 2019) to natural language interfaces for parsing labeling functions (Hancock et. al 2018). In the Crowdsourcing Tutorial, we show a generator that produces a labeling function for each crowdworker:
def worker_lf(x, worker_dict):
return worker_dict.get(x.tweet_id, ABSTAIN)
def make_worker_lf(worker_id):
worker_dict = worker_dicts[worker_id]
name = f"worker_{worker_id}"
return LabelingFunction(name, f=worker_lf, resources={"worker_dict": worker_dict})
worker_lfs = [make_worker_lf(worker_id) for worker_id in worker_dicts]
For certain use cases, it's helpful to generate primitives, or basic features, over the underlying data for Snorkel operators to access. This is especially important for non-textual data modalities, as we’ve shown in work across medical imaging (Fries et. al, 2019) and computer vision (Chen et. al 2019).
Coming soon!