This project is a generic de-duplication framework using Spring and Lucene.
It runs in two modes — deduplication and match.
- Deduplication: operates on a single data file and looks for duplicate records within it.
- Match: operates on two data files, and looks for matching records between the two.
In order to run it needs two know two things:
- the directory
data-dir
containing the config-, input- and future output- files - optional: the path of the
config-file
(a spring bean context file), relative todata-dir
, defaults to 'config.xml'
The project is built using maven, so to run on the command-line do:
mvn clean compile exec:java -Dexec.args="--data-dir <path/to/directory> --config-file <path/to/config>"
For full examples look at the high-level cucumber tests, the simplest one currently is at src/test/resources/org/kew/rmf/core/levenshtein_dedup.feature.
For examples of concrete Matchers/Transformers/etc look at the corresponding unit-tests.
The spring context is split up in three different files:
- application-context.xml: a generic context that is always the same (independent of config details)
- application-context-dedup.xml/application-context-match.xml: this context adds the engine to the configuration; depending on the specific task this will be either a LuceneDeduplicator or a LuceneMatcher
- the config-specific context: Contains the 'user-configuration', the information about the data-sets (file locations and format, which columns to use in which way, matchers, transformers, reporters). This is the context that has to import the other two. A sample configuration can be found at src/test/resources/org/kew/rmf/core/levenshtein_dedup.feature.
All the work is done by a DataHandler
, by the <engine>
bean in the task
specific spring context. The actual configuration is held in the <config>
bean
in the user provided config that wires together all user-configurable elements.
Besides the file location and file format the most important here are the
Properties
: they wire together the name of a column with the desired
Transformer
s and Matcher
.
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ LOGICAL DATAMODEL ┃
┃ Configuration ┃
┃ ┃ ┏╾< Transformer ┃
┃ ┃ ┃ ┃
┃ ┣━━━━━━━━╾< Property ╼━━━━━┫ ┃
┃ ┃ (each col) ┃ ┃
┃ ┃ ┗╾ Matcher ┃
┃ ┗━━━━━━━━╾< Reporter ┃
┃ ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
The Reporter(s) define the output format of the match- and deduplication tasks.
See FurtherReading.md for more background info.