Skip to content

Commit

Permalink
Merge pull request #836 from IBM/issue-753-ededup-docid
Browse files Browse the repository at this point in the history
Update doc for doc_id and ededup to follow template in issue #753
  • Loading branch information
touma-I authored Dec 5, 2024
2 parents 299aa5f + 54ced0e commit 5f92011
Show file tree
Hide file tree
Showing 9 changed files with 548 additions and 135 deletions.
38 changes: 9 additions & 29 deletions transforms/universal/doc_id/README.md
Original file line number Diff line number Diff line change
@@ -1,33 +1,13 @@
# Doc ID Transform

The Document ID transforms adds a document identification (unique integers and content hashes), which later can be
used in de-duplication operations, per the set of
[transform project conventions](../../README.md#transform-project-conventions)
the following runtimes are available:
The Document ID transform assigns to each document in a dataset a unique identifier, including an integer ID and a
content hash, which can later be used by the exact dedup and fuzzy dedup transform to identify and remove duplicate
documents. Per the set of [transform project conventions](../../README.md#transform-project-conventions), the following
runtimes are available:

* [pythom](python/README.md) - enables the running of the base python transformation
in a Python runtime
* [ray](ray/README.md) - enables the running of the base python transformation
in a Ray runtime
* [spark](spark/README.md) - enables the running of a spark-based transformation
in a Spark runtime.
* [kfp](kfp_ray/README.md) - enables running the ray docker image
in a kubernetes cluster using a generated `yaml` file.

## Summary

This transform annotates documents with document "ids".
It supports the following transformations of the original data:
* Adding document hash: this enables the addition of a document hash-based id to the data.
The hash is calculated with `hashlib.sha256(doc.encode("utf-8")).hexdigest()`.
To enable this annotation, set `hash_column` to the name of the column,
where you want to store it.
* Adding integer document id: this allows the addition of an integer document id to the data that
is unique across all rows in all tables provided to the `transform()` method.
To enable this annotation, set `int_id_column` to the name of the column, where you want
to store it.

Document IDs are generally useful for tracking annotations to specific documents. Additionally
[fuzzy deduping](../fdedup) relies on integer IDs to be present. If your dataset does not have
document ID column(s), you can use this transform to create ones.
* [python](python/README.md) - enables running the base python transform in a Python runtime
* [ray](ray/README.md) - enables running the base python transform in a Ray runtime
* [spark](spark/README.md) - enables running of a spark-based transform in a Spark runtime.
* [kfp](kfp_ray/README.md) - enables running the ray docker image in a kubernetes cluster using a generated `yaml` file.

Please check [here](python/README.md) for a more detailed description of this transform.
182 changes: 182 additions & 0 deletions transforms/universal/doc_id/doc_id.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "afd55886-5f5b-4794-838e-ef8179fb0394",
"metadata": {},
"source": [
"##### **** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:\n",
"```\n",
"make venv \n",
"source venv/bin/activate \n",
"pip install jupyterlab\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4c45c3c6-e4d7-4e61-8de6-32d61f2ce695",
"metadata": {},
"outputs": [],
"source": [
"%%capture\n",
"## This is here as a reference only\n",
"# Users and application developers must use the right tag for the latest from pypi\n",
"%pip install data-prep-toolkit\n",
"%pip install data-prep-toolkit-transforms==0.2.2.dev3"
]
},
{
"cell_type": "markdown",
"id": "407fd4e4-265d-4ec7-bbc9-b43158f5f1f3",
"metadata": {
"jp-MarkdownHeadingCollapsed": true
},
"source": [
"##### **** Configure the transform parameters. The set of dictionary keys holding DocIDTransform configuration for values are as follows: \n",
"* doc_column - specifies name of the column containing the document (required for ID generation)\n",
"* hash_column - specifies name of the column created to hold the string document id, if None, id is not generated\n",
"* int_id_column - specifies name of the column created to hold the integer document id, if None, id is not generated\n",
"* start_id - an id from which ID generator starts () "
]
},
{
"cell_type": "markdown",
"id": "ebf1f782-0e61-485c-8670-81066beb734c",
"metadata": {},
"source": [
"##### ***** Import required classes and modules"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c2a12abc-9460-4e45-8961-873b48a9ab19",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import sys\n",
"\n",
"from data_processing.runtime.pure_python import PythonTransformLauncher\n",
"from data_processing.utils import ParamsUtils\n",
"from doc_id_transform_python import DocIDPythonTransformRuntimeConfiguration\n",
"from doc_id_transform_base import (\n",
" doc_column_name_cli_param,\n",
" hash_column_name_cli_param,\n",
" int_column_name_cli_param,\n",
" start_id_cli_param,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "7234563c-2924-4150-8a31-4aec98c1bf33",
"metadata": {},
"source": [
"##### ***** Setup runtime parameters for this transform"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e90a853e-412f-45d7-af3d-959e755aeebb",
"metadata": {},
"outputs": [],
"source": [
"\n",
"# create parameters\n",
"input_folder = os.path.join(\"python\", \"test-data\", \"input\")\n",
"output_folder = os.path.join( \"python\", \"output\")\n",
"local_conf = {\n",
" \"input_folder\": input_folder,\n",
" \"output_folder\": output_folder,\n",
"}\n",
"code_location = {\"github\": \"github\", \"commit_hash\": \"12345\", \"path\": \"path\"}\n",
"params = {\n",
" # Data access. Only required parameters are specified\n",
" \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n",
" # execution info\n",
" \"runtime_pipeline_id\": \"pipeline_id\",\n",
" \"runtime_job_id\": \"job_id\",\n",
" \"runtime_code_location\": ParamsUtils.convert_to_ast(code_location),\n",
" # doc id params\n",
" doc_column_name_cli_param: \"contents\",\n",
" hash_column_name_cli_param: \"hash_column\",\n",
" int_column_name_cli_param: \"int_id_column\",\n",
" start_id_cli_param: 5,\n",
"}"
]
},
{
"cell_type": "markdown",
"id": "7949f66a-d207-45ef-9ad7-ad9406f8d42a",
"metadata": {},
"source": [
"##### ***** Use python runtime to invoke the transform"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0775e400-7469-49a6-8998-bd4772931459",
"metadata": {},
"outputs": [],
"source": [
"%%capture\n",
"sys.argv = ParamsUtils.dict_to_req(d=params)\n",
"launcher = PythonTransformLauncher(runtime_config=DocIDPythonTransformRuntimeConfiguration())\n",
"launcher.launch()"
]
},
{
"cell_type": "markdown",
"id": "c3df5adf-4717-4a03-864d-9151cd3f134b",
"metadata": {},
"source": [
"##### **** The specified folder will include the transformed parquet files."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7276fe84-6512-4605-ab65-747351e13a7c",
"metadata": {},
"outputs": [],
"source": [
"import glob\n",
"glob.glob(\"python/output/*\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "845a75cf-f4a9-467d-87fa-ccbac1c9beb8",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
77 changes: 65 additions & 12 deletions transforms/universal/doc_id/python/README.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,41 @@
# Document ID Python Annotator

Please see the set of
[transform project conventions](../../../README.md)
for details on general project conventions, transform configuration,
testing and IDE set up.
Please see the set of [transform project conventions](../../../README.md) for details on general project conventions,
transform configuration, testing and IDE set up.

## Building
## Contributors
- Boris Lublinsky ([email protected])

A [docker file](Dockerfile) that can be used for building docker image. You can use
## Description

```shell
make build
```
This transform assigns unique identifiers to the documents in a dataset and supports the following annotations to the
original data:
* **Adding a Document Hash** to each document. The unique hash-based ID is generated using
`hashlib.sha256(doc.encode("utf-8")).hexdigest()`. To store this hash in the data specify the desired column name using
the `hash_column` parameter.
* **Adding an Integer Document ID**: to each document. The integer ID is unique across all rows and tables processed by
the `transform()` method. To store this ID in the data, specify the desired column name using the `int_id_column`
parameter.

Document IDs are essential for tracking annotations linked to specific documents. They are also required for processes
like [fuzzy deduplication](../../fdedup/README.md), which depend on the presence of integer IDs. If your dataset lacks document ID
columns, this transform can be used to generate them.

## Input Columns Used by This Transform

| Input Column Name | Data Type | Description |
|------------------------------------------------------------------|-----------|----------------------------------|
| Column specified by the _contents_column_ configuration argument | str | Column that stores document text |

## Configuration and command line Options
## Output Columns Annotated by This Transform
| Output Column Name | Data Type | Description |
|--------------------|-----------|---------------------------------------------|
| hash_column | str | Unique hash assigned to each document |
| int_id_column | uint64 | Unique integer ID assigned to each document |

The set of dictionary keys defined in [DocIDTransform](src/doc_id_transform_ray.py)
## Configuration and Command Line Options

The set of dictionary keys defined in [DocIDTransform](src/doc_id_transform_base.py)
configuration for values are as follows:

* _doc_column_ - specifies name of the column containing the document (required for ID generation)
Expand All @@ -25,7 +45,7 @@ configuration for values are as follows:

At least one of _hash_column_ or _int_id_column_ must be specified.

## Running
## Usage

### Launched Command Line Options
When running the transform with the Ray launcher (i.e. TransformLauncher),
Expand All @@ -43,7 +63,40 @@ the following command line arguments are available in addition to
```
These correspond to the configuration keys described above.

### Running the samples
To run the samples, use the following `make` targets

* `run-cli-sample` - runs src/doc_id_transform_python.py using command line args
* `run-local-sample` - runs src/doc_id_local_python.py

These targets will activate the virtual environment and set up any configuration needed.
Use the `-n` option of `make` to see the detail of what is done to run the sample.

For example,
```shell
make run-cli-sample
...
```
Then
```shell
ls output
```
To see results of the transform.

### Code example

[notebook](../doc_id.ipynb)

### Transforming data using the transform image

To use the transform image to transform your data, please refer to the
[running images quickstart](../../../../doc/quick-start/run-transform-image.md),
substituting the name of this transform image and runtime as appropriate.

## Testing

Following [the testing strategy of data-processing-lib](../../../../data-processing-lib/doc/transform-testing.md)

Currently we have:
- [Unit test](test/test_doc_id_python.py)
- [Integration test](test/test_doc_id.py)
10 changes: 9 additions & 1 deletion transforms/universal/doc_id/ray/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,18 @@
# Document ID Annotator
# Document ID Ray Annotator

Please see the set of
[transform project conventions](../../../README.md)
for details on general project conventions, transform configuration,
testing and IDE set up.

## Summary
This project wraps the Document ID transform with a Ray runtime.

## Configuration and command line Options

Document ID configuration and command line options are the same as for the
[base python transform](../python/README.md).

## Building

A [docker file](Dockerfile) that can be used for building docker image. You can use
Expand Down
20 changes: 11 additions & 9 deletions transforms/universal/doc_id/spark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,23 +6,25 @@ testing and IDE set up.

## Summary

This transform assigns a unique integer ID to each row in a Spark DataFrame. It relies on the [monotonically_increasing_id](https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.monotonically_increasing_id.html) pyspark function to generate the unique integer IDs. As described in the documentation of this function:
This transform assigns a unique integer ID to each row in a Spark DataFrame. It relies on the
[monotonically_increasing_id](https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.monotonically_increasing_id.html)
pyspark function to generate the unique integer IDs. As described in the documentation of this function:
> The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.
## Configuration and command line Options

The set of dictionary keys holding [DocIdTransform](src/doc_id_transform.py)
configuration for values are as follows:

* _doc_id_column_name_ - specifies the name of the DataFrame column that holds the generated document IDs.
Document ID configuration and command line options are the same as for the
[base python transform](../python/README.md).

## Running
You can run the [doc_id_local.py](src/doc_id_local_spark.py) (spark-based implementation) to transform the `test1.parquet` file in [test input data](test-data/input) to an `output` directory. The directory will contain both the new annotated `test1.parquet` file and the `metadata.json` file.
You can run the [doc_id_local.py](src/doc_id_local_spark.py) (spark-based implementation) to transform the
`test1.parquet` file in [test input data](test-data/input) to an `output` directory. The directory will contain both
the new annotated `test1.parquet` file and the `metadata.json` file.

### Launched Command Line Options
When running the transform with the Spark launcher (i.e. SparkTransformLauncher),
the following command line arguments are available in addition to
the options provided by the [python launcher](../../../../data-processing-lib/doc/python-launcher-options.md).
When running the transform with the Spark launcher (i.e. SparkTransformLauncher), the following command line arguments
are available in addition to the options provided by the
[python launcher](../../../../data-processing-lib/doc/python-launcher-options.md).

```
--doc_id_column_name DOC_ID_COLUMN_NAME
Expand Down
Loading

0 comments on commit 5f92011

Please sign in to comment.