Example-Driven Error Detection

Traditional error detection approaches require user-defined parameters and rules. Thus, the user has to know both the error detection system and the data. However, we can also formulate error detection as a semi-supervised classification problem that only requires domain expertise. The challenges for such an approach are twofold: (1) to represent the data in a way that enables a classification model to identify various kinds of data errors, and (2) to pick the most promising data values for learning. In this paper, we address these challenges with our new example-driven error detection method (ED2). First, we discuss and identify the appropriate features to locate different kinds of data errors across different data types. Second, we present a new two-dimensional multi-classifier sampling strategy for active learning. The combined application of these techniques enables the convergence of the classification task with high detection accuracy. On several real-world datasets, ED2 requires, on average, only 1% labels to outperform existing error detection approaches that are manually configured and tuned.

Citing

For further details refer to the paper - and of course if any of this code was helpful for your research, please consider citing it:

@inproceedings{neutatz2019ed2,
  title={{ED2:} {A} {C}ase for {A}ctive {L}earning in {E}rror {D}etection},
  author={Neutatz, Felix and Mahdavi, Mohammad and Abedjan, Ziawasch},
  booktitle={{CIKM}},
  year={2019}
}

Datasets

We provide the dirty and the clean version of a number of datasets.

Additional Evaluations

In addition to the charts provided in the paper, we provide additional evaluations on more datasets:

Feature representations: Besides more datasets, we also provide the F1-score for LSTM features on Address, Flights, and Hospital.
Column selection strategies
Classification models

Documentation

We are working hard to provide as much documentation as possible over the time. We start here:

Constraints that we used to run NADEEF

Setup

cd model
sudo apt-get install libpq-dev python-dev python-tk
sudo python setup.py install

Using ED2

To run the experiments, first, you need to set the paths in a configuration file with the name of your machine. Examples can be found here: ~/model/ml/configuration/resources/

Then, you can adapt the file ~/model/ml/experiments/features_experiment_multi.py to run the experiments that you are interested in.

Name		Name	Last commit message	Last commit date
Latest commit History 226 Commits
datasets		datasets
deeplearning		deeplearning
documentation		documentation
model		model
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Example-Driven Error Detection

Citing

Datasets

Additional Evaluations

Documentation

Setup

Using ED2

Scenario

About

Releases

Packages

Languages

BigDaMa/ExampleDrivenErrorDetection

Folders and files

Latest commit

History

Repository files navigation

Example-Driven Error Detection

Citing

Datasets

Additional Evaluations

Documentation

Setup

Using ED2

Scenario

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages