Merge pull request #836 from IBM/issue-753-ededup-docid

Update doc for doc_id and ededup to follow template in issue #753
IBM · Dec 5, 2024 · 5f92011 · 5f92011
2 parents 299aa5f + 54ced0e
commit 5f92011
Show file tree

Hide file tree

Showing 9 changed files with 548 additions and 135 deletions.
diff --git a/transforms/universal/doc_id/README.md b/transforms/universal/doc_id/README.md
@@ -1,33 +1,13 @@
 # Doc ID Transform 
 
-The Document ID transforms adds a document identification (unique integers and content hashes), which later can be 
-used in de-duplication operations, per the set of 
-[transform project conventions](../../README.md#transform-project-conventions)
-the following runtimes are available:
+The Document ID transform assigns to each document in a dataset a unique identifier, including an integer ID and a
+content hash, which can later be used by the exact dedup and fuzzy dedup transform to identify and remove duplicate
+documents. Per the set of [transform project conventions](../../README.md#transform-project-conventions), the following
+runtimes are available:
 
-* [pythom](python/README.md) - enables the running of the base python transformation
-  in a Python runtime
-* [ray](ray/README.md) - enables the running of the base python transformation
-  in a Ray runtime
-* [spark](spark/README.md) - enables the running of a spark-based transformation
-in a Spark runtime. 
-* [kfp](kfp_ray/README.md) - enables running the ray docker image 
-in a kubernetes cluster using a generated `yaml` file.
-
-## Summary
-
-This transform annotates documents with document "ids".
-It supports the following transformations of the original data:
-* Adding document hash: this enables the addition of a document hash-based id to the data.
-  The hash is calculated with `hashlib.sha256(doc.encode("utf-8")).hexdigest()`.
-  To enable this annotation, set `hash_column` to the name of the column,
-  where you want to store it.
-* Adding integer document id: this allows the addition of an integer document id to the data that
-  is unique across all rows in all tables provided to the `transform()` method.
-  To enable this annotation, set `int_id_column` to the name of the column, where you want
-  to store it.
-
-Document IDs are generally useful for tracking annotations to specific documents. Additionally
-[fuzzy deduping](../fdedup) relies on integer IDs to be present. If your dataset does not have
-document ID column(s), you can use this transform to create ones.
+* [python](python/README.md) - enables running the base python transform in a Python runtime
+* [ray](ray/README.md) - enables running the base python transform  in a Ray runtime
+* [spark](spark/README.md) - enables running of a spark-based transform in a Spark runtime. 
+* [kfp](kfp_ray/README.md) - enables running the ray docker image in a kubernetes cluster using a generated `yaml` file.
 
+Please check [here](python/README.md) for a more detailed description of this transform.
diff --git a/transforms/universal/doc_id/doc_id.ipynb b/transforms/universal/doc_id/doc_id.ipynb
@@ -0,0 +1,182 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "afd55886-5f5b-4794-838e-ef8179fb0394",
+   "metadata": {},
+   "source": [
+    "##### **** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:\n",
+    "```\n",
+    "make venv \n",
+    "source venv/bin/activate \n",
+    "pip install jupyterlab\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4c45c3c6-e4d7-4e61-8de6-32d61f2ce695",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%capture\n",
+    "## This is here as a reference only\n",
+    "# Users and application developers must use the right tag for the latest from pypi\n",
+    "%pip install data-prep-toolkit\n",
+    "%pip install data-prep-toolkit-transforms==0.2.2.dev3"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "407fd4e4-265d-4ec7-bbc9-b43158f5f1f3",
+   "metadata": {
+    "jp-MarkdownHeadingCollapsed": true
+   },
+   "source": [
+    "##### **** Configure the transform parameters. The set of dictionary keys holding DocIDTransform configuration for values are as follows: \n",
+    "* doc_column - specifies name of the column containing the document (required for ID generation)\n",
+    "* hash_column - specifies name of the column created to hold the string document id, if None, id is not generated\n",
+    "* int_id_column - specifies name of the column created to hold the integer document id, if None, id is not generated\n",
+    "* start_id - an id from which ID generator starts () "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ebf1f782-0e61-485c-8670-81066beb734c",
+   "metadata": {},
+   "source": [
+    "##### ***** Import required classes and modules"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c2a12abc-9460-4e45-8961-873b48a9ab19",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import sys\n",
+    "\n",
+    "from data_processing.runtime.pure_python import PythonTransformLauncher\n",
+    "from data_processing.utils import ParamsUtils\n",
+    "from doc_id_transform_python import DocIDPythonTransformRuntimeConfiguration\n",
+    "from doc_id_transform_base import (\n",
+    "    doc_column_name_cli_param,\n",
+    "    hash_column_name_cli_param,\n",
+    "    int_column_name_cli_param,\n",
+    "    start_id_cli_param,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7234563c-2924-4150-8a31-4aec98c1bf33",
+   "metadata": {},
+   "source": [
+    "##### ***** Setup runtime parameters for this transform"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e90a853e-412f-45d7-af3d-959e755aeebb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\n",
+    "# create parameters\n",
+    "input_folder = os.path.join(\"python\", \"test-data\", \"input\")\n",
+    "output_folder = os.path.join( \"python\", \"output\")\n",
+    "local_conf = {\n",
+    "    \"input_folder\": input_folder,\n",
+    "    \"output_folder\": output_folder,\n",
+    "}\n",
+    "code_location = {\"github\": \"github\", \"commit_hash\": \"12345\", \"path\": \"path\"}\n",
+    "params = {\n",
+    "    # Data access. Only required parameters are specified\n",
+    "    \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n",
+    "    # execution info\n",
+    "    \"runtime_pipeline_id\": \"pipeline_id\",\n",
+    "    \"runtime_job_id\": \"job_id\",\n",
+    "    \"runtime_code_location\": ParamsUtils.convert_to_ast(code_location),\n",
+    "    # doc id params\n",
+    "    doc_column_name_cli_param: \"contents\",\n",
+    "    hash_column_name_cli_param: \"hash_column\",\n",
+    "    int_column_name_cli_param: \"int_id_column\",\n",
+    "    start_id_cli_param: 5,\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7949f66a-d207-45ef-9ad7-ad9406f8d42a",
+   "metadata": {},
+   "source": [
+    "##### ***** Use python runtime to invoke the transform"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0775e400-7469-49a6-8998-bd4772931459",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%capture\n",
+    "sys.argv = ParamsUtils.dict_to_req(d=params)\n",
+    "launcher = PythonTransformLauncher(runtime_config=DocIDPythonTransformRuntimeConfiguration())\n",
+    "launcher.launch()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c3df5adf-4717-4a03-864d-9151cd3f134b",
+   "metadata": {},
+   "source": [
+    "##### **** The specified folder will include the transformed parquet files."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7276fe84-6512-4605-ab65-747351e13a7c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import glob\n",
+    "glob.glob(\"python/output/*\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "845a75cf-f4a9-467d-87fa-ccbac1c9beb8",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/transforms/universal/doc_id/python/README.md b/transforms/universal/doc_id/python/README.md
@@ -1,21 +1,41 @@
 # Document ID Python Annotator
 
-Please see the set of
-[transform project conventions](../../../README.md)
-for details on general project conventions, transform configuration,
-testing and IDE set up.
+Please see the set of [transform project conventions](../../../README.md) for details on general project conventions,
+transform configuration, testing and IDE set up.
 
-## Building
+## Contributors
+- Boris Lublinsky ([email protected])
 
-A [docker file](Dockerfile) that can be used for building docker image. You can use
+## Description
 
-```shell
-make build 
-```
+This transform assigns unique identifiers to the documents in a dataset and supports the following annotations to the
+original data:
+* **Adding a Document Hash** to each document. The unique hash-based ID is generated using
+`hashlib.sha256(doc.encode("utf-8")).hexdigest()`. To store this hash in the data specify the desired column name using
+the `hash_column` parameter.
+* **Adding an Integer Document ID**: to each document. The integer ID is unique across all rows and tables processed by
+the `transform()` method. To store this ID in the data, specify the desired column name using the `int_id_column`
+parameter.
+
+Document IDs are essential for tracking annotations linked to specific documents. They are also required for processes
+like [fuzzy deduplication](../../fdedup/README.md), which depend on the presence of integer IDs. If your dataset lacks document ID
+columns, this transform can be used to generate them.
+
+## Input Columns Used by This Transform
+
+| Input Column Name                                                | Data Type | Description                      |
+|------------------------------------------------------------------|-----------|----------------------------------|
+| Column specified by the _contents_column_ configuration argument | str       | Column that stores document text |
 
-## Configuration and command line Options
+## Output Columns Annotated by This Transform
+| Output Column Name | Data Type | Description                                 |
+|--------------------|-----------|---------------------------------------------|
+| hash_column        | str       | Unique hash assigned to each document       |
+| int_id_column      | uint64    | Unique integer ID assigned to each document |
 
-The set of dictionary keys defined in [DocIDTransform](src/doc_id_transform_ray.py)
+## Configuration and Command Line Options
+
+The set of dictionary keys defined in [DocIDTransform](src/doc_id_transform_base.py)
 configuration for values are as follows:
 
 * _doc_column_ - specifies name of the column containing the document (required for ID generation)
@@ -25,7 +45,7 @@ configuration for values are as follows:
 
 At least one of _hash_column_ or _int_id_column_ must be specified.
 
-## Running
+## Usage
 
 ### Launched Command Line Options 
 When running the transform with the Ray launcher (i.e. TransformLauncher),
@@ -43,7 +63,40 @@ the following command line arguments are available in addition to
 ```
 These correspond to the configuration keys described above.
 
+### Running the samples
+To run the samples, use the following `make` targets
+
+* `run-cli-sample` - runs src/doc_id_transform_python.py using command line args
+* `run-local-sample` - runs src/doc_id_local_python.py
+
+These targets will activate the virtual environment and set up any configuration needed.
+Use the `-n` option of `make` to see the detail of what is done to run the sample.
+
+For example, 
+```shell
+make run-cli-sample
+...
+```
+Then 
+```shell
+ls output
+```
+To see results of the transform.
+
+### Code example
+
+[notebook](../doc_id.ipynb)
+
+### Transforming data using the transform image
 
 To use the transform image to transform your data, please refer to the 
 [running images quickstart](../../../../doc/quick-start/run-transform-image.md),
 substituting the name of this transform image and runtime as appropriate.
+
+## Testing
+
+Following [the testing strategy of data-processing-lib](../../../../data-processing-lib/doc/transform-testing.md)
+
+Currently we have:
+- [Unit test](test/test_doc_id_python.py)
+- [Integration test](test/test_doc_id.py)
diff --git a/transforms/universal/doc_id/ray/README.md b/transforms/universal/doc_id/ray/README.md
@@ -1,10 +1,18 @@
-# Document ID Annotator
+# Document ID Ray Annotator
 
 Please see the set of
 [transform project conventions](../../../README.md)
 for details on general project conventions, transform configuration,
 testing and IDE set up.
 
+## Summary
+This project wraps the Document ID transform with a Ray runtime.
+
+## Configuration and command line Options
+
+Document ID configuration and command line options are the same as for the
+[base python transform](../python/README.md).
+
 ## Building
 
 A [docker file](Dockerfile) that can be used for building docker image. You can use

diff --git a/transforms/universal/doc_id/spark/README.md b/transforms/universal/doc_id/spark/README.md
@@ -6,23 +6,25 @@ testing and IDE set up.
 
 ## Summary 
 
-This transform assigns a unique integer ID to each row in a Spark DataFrame. It relies on the [monotonically_increasing_id](https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.monotonically_increasing_id.html) pyspark function to generate the unique integer IDs. As described in the documentation of this function:
+This transform assigns a unique integer ID to each row in a Spark DataFrame. It relies on the
+[monotonically_increasing_id](https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.monotonically_increasing_id.html)
+pyspark function to generate the unique integer IDs. As described in the documentation of this function:
 > The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. 
 
 ## Configuration and command line Options
 
-The set of dictionary keys holding [DocIdTransform](src/doc_id_transform.py) 
-configuration for values are as follows:
-
-* _doc_id_column_name_ - specifies the name of the DataFrame column that holds the generated document IDs.
+Document ID configuration and command line options are the same as for the
+[base python transform](../python/README.md).
 
 ## Running
-You can run the [doc_id_local.py](src/doc_id_local_spark.py) (spark-based implementation) to transform the `test1.parquet` file in [test input data](test-data/input) to an `output` directory.  The directory will contain both the new annotated `test1.parquet` file and the `metadata.json` file.
+You can run the [doc_id_local.py](src/doc_id_local_spark.py) (spark-based implementation) to transform the
+`test1.parquet` file in [test input data](test-data/input) to an `output` directory.  The directory will contain both
+the new annotated `test1.parquet` file and the `metadata.json` file.
 
 ### Launched Command Line Options 
-When running the transform with the Spark launcher (i.e. SparkTransformLauncher),
-the following command line arguments are available in addition to 
-the options provided by the [python launcher](../../../../data-processing-lib/doc/python-launcher-options.md).
+When running the transform with the Spark launcher (i.e. SparkTransformLauncher), the following command line arguments
+are available in addition to the options provided by the
+[python launcher](../../../../data-processing-lib/doc/python-launcher-options.md).
 
 ```
   --doc_id_column_name DOC_ID_COLUMN_NAME