-
Notifications
You must be signed in to change notification settings - Fork 147
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #836 from IBM/issue-753-ededup-docid
Update doc for doc_id and ededup to follow template in issue #753
- Loading branch information
Showing
9 changed files
with
548 additions
and
135 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,33 +1,13 @@ | ||
# Doc ID Transform | ||
|
||
The Document ID transforms adds a document identification (unique integers and content hashes), which later can be | ||
used in de-duplication operations, per the set of | ||
[transform project conventions](../../README.md#transform-project-conventions) | ||
the following runtimes are available: | ||
The Document ID transform assigns to each document in a dataset a unique identifier, including an integer ID and a | ||
content hash, which can later be used by the exact dedup and fuzzy dedup transform to identify and remove duplicate | ||
documents. Per the set of [transform project conventions](../../README.md#transform-project-conventions), the following | ||
runtimes are available: | ||
|
||
* [pythom](python/README.md) - enables the running of the base python transformation | ||
in a Python runtime | ||
* [ray](ray/README.md) - enables the running of the base python transformation | ||
in a Ray runtime | ||
* [spark](spark/README.md) - enables the running of a spark-based transformation | ||
in a Spark runtime. | ||
* [kfp](kfp_ray/README.md) - enables running the ray docker image | ||
in a kubernetes cluster using a generated `yaml` file. | ||
|
||
## Summary | ||
|
||
This transform annotates documents with document "ids". | ||
It supports the following transformations of the original data: | ||
* Adding document hash: this enables the addition of a document hash-based id to the data. | ||
The hash is calculated with `hashlib.sha256(doc.encode("utf-8")).hexdigest()`. | ||
To enable this annotation, set `hash_column` to the name of the column, | ||
where you want to store it. | ||
* Adding integer document id: this allows the addition of an integer document id to the data that | ||
is unique across all rows in all tables provided to the `transform()` method. | ||
To enable this annotation, set `int_id_column` to the name of the column, where you want | ||
to store it. | ||
|
||
Document IDs are generally useful for tracking annotations to specific documents. Additionally | ||
[fuzzy deduping](../fdedup) relies on integer IDs to be present. If your dataset does not have | ||
document ID column(s), you can use this transform to create ones. | ||
* [python](python/README.md) - enables running the base python transform in a Python runtime | ||
* [ray](ray/README.md) - enables running the base python transform in a Ray runtime | ||
* [spark](spark/README.md) - enables running of a spark-based transform in a Spark runtime. | ||
* [kfp](kfp_ray/README.md) - enables running the ray docker image in a kubernetes cluster using a generated `yaml` file. | ||
|
||
Please check [here](python/README.md) for a more detailed description of this transform. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,182 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "afd55886-5f5b-4794-838e-ef8179fb0394", | ||
"metadata": {}, | ||
"source": [ | ||
"##### **** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:\n", | ||
"```\n", | ||
"make venv \n", | ||
"source venv/bin/activate \n", | ||
"pip install jupyterlab\n", | ||
"```" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "4c45c3c6-e4d7-4e61-8de6-32d61f2ce695", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%%capture\n", | ||
"## This is here as a reference only\n", | ||
"# Users and application developers must use the right tag for the latest from pypi\n", | ||
"%pip install data-prep-toolkit\n", | ||
"%pip install data-prep-toolkit-transforms==0.2.2.dev3" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "407fd4e4-265d-4ec7-bbc9-b43158f5f1f3", | ||
"metadata": { | ||
"jp-MarkdownHeadingCollapsed": true | ||
}, | ||
"source": [ | ||
"##### **** Configure the transform parameters. The set of dictionary keys holding DocIDTransform configuration for values are as follows: \n", | ||
"* doc_column - specifies name of the column containing the document (required for ID generation)\n", | ||
"* hash_column - specifies name of the column created to hold the string document id, if None, id is not generated\n", | ||
"* int_id_column - specifies name of the column created to hold the integer document id, if None, id is not generated\n", | ||
"* start_id - an id from which ID generator starts () " | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "ebf1f782-0e61-485c-8670-81066beb734c", | ||
"metadata": {}, | ||
"source": [ | ||
"##### ***** Import required classes and modules" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "c2a12abc-9460-4e45-8961-873b48a9ab19", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import os\n", | ||
"import sys\n", | ||
"\n", | ||
"from data_processing.runtime.pure_python import PythonTransformLauncher\n", | ||
"from data_processing.utils import ParamsUtils\n", | ||
"from doc_id_transform_python import DocIDPythonTransformRuntimeConfiguration\n", | ||
"from doc_id_transform_base import (\n", | ||
" doc_column_name_cli_param,\n", | ||
" hash_column_name_cli_param,\n", | ||
" int_column_name_cli_param,\n", | ||
" start_id_cli_param,\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "7234563c-2924-4150-8a31-4aec98c1bf33", | ||
"metadata": {}, | ||
"source": [ | ||
"##### ***** Setup runtime parameters for this transform" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "e90a853e-412f-45d7-af3d-959e755aeebb", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"\n", | ||
"# create parameters\n", | ||
"input_folder = os.path.join(\"python\", \"test-data\", \"input\")\n", | ||
"output_folder = os.path.join( \"python\", \"output\")\n", | ||
"local_conf = {\n", | ||
" \"input_folder\": input_folder,\n", | ||
" \"output_folder\": output_folder,\n", | ||
"}\n", | ||
"code_location = {\"github\": \"github\", \"commit_hash\": \"12345\", \"path\": \"path\"}\n", | ||
"params = {\n", | ||
" # Data access. Only required parameters are specified\n", | ||
" \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", | ||
" # execution info\n", | ||
" \"runtime_pipeline_id\": \"pipeline_id\",\n", | ||
" \"runtime_job_id\": \"job_id\",\n", | ||
" \"runtime_code_location\": ParamsUtils.convert_to_ast(code_location),\n", | ||
" # doc id params\n", | ||
" doc_column_name_cli_param: \"contents\",\n", | ||
" hash_column_name_cli_param: \"hash_column\",\n", | ||
" int_column_name_cli_param: \"int_id_column\",\n", | ||
" start_id_cli_param: 5,\n", | ||
"}" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "7949f66a-d207-45ef-9ad7-ad9406f8d42a", | ||
"metadata": {}, | ||
"source": [ | ||
"##### ***** Use python runtime to invoke the transform" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "0775e400-7469-49a6-8998-bd4772931459", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%%capture\n", | ||
"sys.argv = ParamsUtils.dict_to_req(d=params)\n", | ||
"launcher = PythonTransformLauncher(runtime_config=DocIDPythonTransformRuntimeConfiguration())\n", | ||
"launcher.launch()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "c3df5adf-4717-4a03-864d-9151cd3f134b", | ||
"metadata": {}, | ||
"source": [ | ||
"##### **** The specified folder will include the transformed parquet files." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "7276fe84-6512-4605-ab65-747351e13a7c", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import glob\n", | ||
"glob.glob(\"python/output/*\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "845a75cf-f4a9-467d-87fa-ccbac1c9beb8", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.9" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,21 +1,41 @@ | ||
# Document ID Python Annotator | ||
|
||
Please see the set of | ||
[transform project conventions](../../../README.md) | ||
for details on general project conventions, transform configuration, | ||
testing and IDE set up. | ||
Please see the set of [transform project conventions](../../../README.md) for details on general project conventions, | ||
transform configuration, testing and IDE set up. | ||
|
||
## Building | ||
## Contributors | ||
- Boris Lublinsky ([email protected]) | ||
|
||
A [docker file](Dockerfile) that can be used for building docker image. You can use | ||
## Description | ||
|
||
```shell | ||
make build | ||
``` | ||
This transform assigns unique identifiers to the documents in a dataset and supports the following annotations to the | ||
original data: | ||
* **Adding a Document Hash** to each document. The unique hash-based ID is generated using | ||
`hashlib.sha256(doc.encode("utf-8")).hexdigest()`. To store this hash in the data specify the desired column name using | ||
the `hash_column` parameter. | ||
* **Adding an Integer Document ID**: to each document. The integer ID is unique across all rows and tables processed by | ||
the `transform()` method. To store this ID in the data, specify the desired column name using the `int_id_column` | ||
parameter. | ||
|
||
Document IDs are essential for tracking annotations linked to specific documents. They are also required for processes | ||
like [fuzzy deduplication](../../fdedup/README.md), which depend on the presence of integer IDs. If your dataset lacks document ID | ||
columns, this transform can be used to generate them. | ||
|
||
## Input Columns Used by This Transform | ||
|
||
| Input Column Name | Data Type | Description | | ||
|------------------------------------------------------------------|-----------|----------------------------------| | ||
| Column specified by the _contents_column_ configuration argument | str | Column that stores document text | | ||
|
||
## Configuration and command line Options | ||
## Output Columns Annotated by This Transform | ||
| Output Column Name | Data Type | Description | | ||
|--------------------|-----------|---------------------------------------------| | ||
| hash_column | str | Unique hash assigned to each document | | ||
| int_id_column | uint64 | Unique integer ID assigned to each document | | ||
|
||
The set of dictionary keys defined in [DocIDTransform](src/doc_id_transform_ray.py) | ||
## Configuration and Command Line Options | ||
|
||
The set of dictionary keys defined in [DocIDTransform](src/doc_id_transform_base.py) | ||
configuration for values are as follows: | ||
|
||
* _doc_column_ - specifies name of the column containing the document (required for ID generation) | ||
|
@@ -25,7 +45,7 @@ configuration for values are as follows: | |
|
||
At least one of _hash_column_ or _int_id_column_ must be specified. | ||
|
||
## Running | ||
## Usage | ||
|
||
### Launched Command Line Options | ||
When running the transform with the Ray launcher (i.e. TransformLauncher), | ||
|
@@ -43,7 +63,40 @@ the following command line arguments are available in addition to | |
``` | ||
These correspond to the configuration keys described above. | ||
|
||
### Running the samples | ||
To run the samples, use the following `make` targets | ||
|
||
* `run-cli-sample` - runs src/doc_id_transform_python.py using command line args | ||
* `run-local-sample` - runs src/doc_id_local_python.py | ||
|
||
These targets will activate the virtual environment and set up any configuration needed. | ||
Use the `-n` option of `make` to see the detail of what is done to run the sample. | ||
|
||
For example, | ||
```shell | ||
make run-cli-sample | ||
... | ||
``` | ||
Then | ||
```shell | ||
ls output | ||
``` | ||
To see results of the transform. | ||
|
||
### Code example | ||
|
||
[notebook](../doc_id.ipynb) | ||
|
||
### Transforming data using the transform image | ||
|
||
To use the transform image to transform your data, please refer to the | ||
[running images quickstart](../../../../doc/quick-start/run-transform-image.md), | ||
substituting the name of this transform image and runtime as appropriate. | ||
|
||
## Testing | ||
|
||
Following [the testing strategy of data-processing-lib](../../../../data-processing-lib/doc/transform-testing.md) | ||
|
||
Currently we have: | ||
- [Unit test](test/test_doc_id_python.py) | ||
- [Integration test](test/test_doc_id.py) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.