From 65382189d3124f8aacd9e391a7a2d267a25ef9f4 Mon Sep 17 00:00:00 2001
From: Constantin M Adam <cmadam@us.ibm.com>
Date: Tue, 26 Nov 2024 08:53:08 -0500
Subject: [PATCH 1/6] Update doc to follow template in issue #753

Signed-off-by: Constantin M Adam <cmadam@us.ibm.com>
---
 transforms/universal/doc_id/README.md        | 38 ++-------
 transforms/universal/doc_id/python/README.md | 89 +++++++++++++++++---
 transforms/universal/doc_id/ray/README.md    | 10 ++-
 transforms/universal/doc_id/spark/README.md  | 17 ++--
 4 files changed, 105 insertions(+), 49 deletions(-)

diff --git a/transforms/universal/doc_id/README.md b/transforms/universal/doc_id/README.md
index c5c785353..02564db20 100644
--- a/transforms/universal/doc_id/README.md
+++ b/transforms/universal/doc_id/README.md
@@ -1,33 +1,13 @@
 # Doc ID Transform 
 
-The Document ID transforms adds a document identification (unique integers and content hashes), which later can be 
-used in de-duplication operations, per the set of 
-[transform project conventions](../../README.md#transform-project-conventions)
-the following runtimes are available:
+The Document ID transform assigns to each document in a dataset a unique identifier, including an integer ID and a
+content hash, which can later be used by the exact dedup and fuzzy dedup transform to identify and remove duplicate
+documents. Per the set of [transform project conventions](../../README.md#transform-project-conventions), the following
+runtimes are available:
 
-* [pythom](python/README.md) - enables the running of the base python transformation
-  in a Python runtime
-* [ray](ray/README.md) - enables the running of the base python transformation
-  in a Ray runtime
-* [spark](spark/README.md) - enables the running of a spark-based transformation
-in a Spark runtime. 
-* [kfp](kfp_ray/README.md) - enables running the ray docker image 
-in a kubernetes cluster using a generated `yaml` file.
-
-## Summary
-
-This transform annotates documents with document "ids".
-It supports the following transformations of the original data:
-* Adding document hash: this enables the addition of a document hash-based id to the data.
-  The hash is calculated with `hashlib.sha256(doc.encode("utf-8")).hexdigest()`.
-  To enable this annotation, set `hash_column` to the name of the column,
-  where you want to store it.
-* Adding integer document id: this allows the addition of an integer document id to the data that
-  is unique across all rows in all tables provided to the `transform()` method.
-  To enable this annotation, set `int_id_column` to the name of the column, where you want
-  to store it.
-
-Document IDs are generally useful for tracking annotations to specific documents. Additionally
-[fuzzy deduping](../fdedup) relies on integer IDs to be present. If your dataset does not have
-document ID column(s), you can use this transform to create ones.
+* [python](python/README.md) - enables running the base python transform in a Python runtime
+* [ray](ray/README.md) - enables running the base python transform  in a Ray runtime
+* [spark](spark/README.md) - enables running of a spark-based transform in a Spark runtime. 
+* [kfp](kfp_ray/README.md) - enables running the ray docker image in a kubernetes cluster using a generated `yaml` file.
 
+Please check [here](python/README.md) for a more detailed description of this transform.
diff --git a/transforms/universal/doc_id/python/README.md b/transforms/universal/doc_id/python/README.md
index dbb02093c..bdaf834e3 100644
--- a/transforms/universal/doc_id/python/README.md
+++ b/transforms/universal/doc_id/python/README.md
@@ -1,21 +1,41 @@
 # Document ID Python Annotator
 
-Please see the set of
-[transform project conventions](../../../README.md)
-for details on general project conventions, transform configuration,
-testing and IDE set up.
+Please see the set of [transform project conventions](../../../README.md) for details on general project conventions,
+transform configuration, testing and IDE set up.
 
-## Building
+## Contributors
+- Boris Lublinsky (blublinsk@ibm.com)
 
-A [docker file](Dockerfile) that can be used for building docker image. You can use
+## Description
 
-```shell
-make build 
-```
+This transform assigns unique identifiers to the documents in a dataset and supports the following annotations to the
+original data:
+* **Adding a Document Hash** to each document. The unique hash-based ID is generated using
+`hashlib.sha256(doc.encode("utf-8")).hexdigest()`. To store this hash in the data specify the desired column name using
+the `hash_column` parameter.
+* **Adding an Integer Document ID**: to each document. The integer ID is unique across all rows and tables processed by
+the `transform()` method. To store this ID in the data, specify the desired column name using the `int_id_column`
+parameter.
+
+Document IDs are essential for tracking annotations linked to specific documents. They are also required for processes
+like [fuzzy deduplication](../fdedup), which depend on the presence of integer IDs. If your dataset lacks document ID
+columns, this transform can be used to generate them.
+
+## Input Columns Used by This Transform
+
+| Input Column Name                                    | Data Type | Description                      |
+|------------------------------------------------------|-----------|----------------------------------|
+| Column specified by the _contents_column_ config arg | str       | Column that stores document text |
+
+## Output Columns Annotated by This Transform
+| Output Column Name | Data Type | Description                                 |
+|--------------------|-----------|---------------------------------------------|
+| hash_column        | str       | Unique hash assigned to each document       |
+| int_id_column      | uint64    | Unique integer ID assigned to each document |
 
-## Configuration and command line Options
+## Configuration and Command Line Options
 
-The set of dictionary keys defined in [DocIDTransform](src/doc_id_transform_ray.py)
+The set of dictionary keys defined in [DocIDTransform](src/doc_id_transform_base.py)
 configuration for values are as follows:
 
 * _doc_column_ - specifies name of the column containing the document (required for ID generation)
@@ -25,7 +45,7 @@ configuration for values are as follows:
 
 At least one of _hash_column_ or _int_id_column_ must be specified.
 
-## Running
+## Usage
 
 ### Launched Command Line Options 
 When running the transform with the Ray launcher (i.e. TransformLauncher),
@@ -43,7 +63,52 @@ the following command line arguments are available in addition to
 ```
 These correspond to the configuration keys described above.
 
+To use the transform image to transform your data, please refer to the 
+[running images quickstart](../../../../doc/quick-start/run-transform-image.md),
+substituting the name of this transform image and runtime as appropriate.
+
+## Building
+
+A [docker file](Dockerfile) that can be used for building docker image. You can use
+
+```shell
+make build 
+```
+
+### Running the samples
+To run the samples, use the following `make` targets
+
+* `run-cli-sample` - runs src/doc_id_transform_python.py using command line args
+* `run-local-sample` - runs src/doc_id_local_python.py
+
+These targets will activate the virtual environment and set up any configuration needed.
+Use the `-n` option of `make` to see the detail of what is done to run the sample.
+
+For example, 
+```shell
+make run-cli-sample
+...
+```
+Then 
+```shell
+ls output
+```
+To see results of the transform.
+
+### Code example
+
+TBD
+
+### Transforming data using the transform image
 
 To use the transform image to transform your data, please refer to the 
 [running images quickstart](../../../../doc/quick-start/run-transform-image.md),
 substituting the name of this transform image and runtime as appropriate.
+
+## Testing
+
+Following [the testing strategy of data-processing-lib](../../../../data-processing-lib/doc/transform-testing.md)
+
+Currently we have:
+- [Unit test](test/test_doc_id_python.py)
+- [Integration test](test/test_doc_id.py)
diff --git a/transforms/universal/doc_id/ray/README.md b/transforms/universal/doc_id/ray/README.md
index c9cb0d15c..ef260f719 100644
--- a/transforms/universal/doc_id/ray/README.md
+++ b/transforms/universal/doc_id/ray/README.md
@@ -1,10 +1,18 @@
-# Document ID Annotator
+# Document ID Ray Annotator
 
 Please see the set of
 [transform project conventions](../../../README.md)
 for details on general project conventions, transform configuration,
 testing and IDE set up.
 
+## Summary
+This project wraps the Document ID transform with a Ray runtime.
+
+## Configuration and command line Options
+Document ID configuration and command line options are the same as for the base python
+transform.
+
+
 ## Building
 
 A [docker file](Dockerfile) that can be used for building docker image. You can use
diff --git a/transforms/universal/doc_id/spark/README.md b/transforms/universal/doc_id/spark/README.md
index 932637c54..ace6f79e2 100644
--- a/transforms/universal/doc_id/spark/README.md
+++ b/transforms/universal/doc_id/spark/README.md
@@ -6,23 +6,26 @@ testing and IDE set up.
 
 ## Summary 
 
-This transform assigns a unique integer ID to each row in a Spark DataFrame. It relies on the [monotonically_increasing_id](https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.monotonically_increasing_id.html) pyspark function to generate the unique integer IDs. As described in the documentation of this function:
+This transform assigns a unique integer ID to each row in a Spark DataFrame. It relies on the
+[monotonically_increasing_id](https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.monotonically_increasing_id.html)
+pyspark function to generate the unique integer IDs. As described in the documentation of this function:
 > The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. 
 
 ## Configuration and command line Options
 
-The set of dictionary keys holding [DocIdTransform](src/doc_id_transform.py) 
-configuration for values are as follows:
+The set of dictionary keys holding [DocIdTransform](src/doc_id_transform.py) configuration for values are as follows:
 
 * _doc_id_column_name_ - specifies the name of the DataFrame column that holds the generated document IDs.
 
 ## Running
-You can run the [doc_id_local.py](src/doc_id_local_spark.py) (spark-based implementation) to transform the `test1.parquet` file in [test input data](test-data/input) to an `output` directory.  The directory will contain both the new annotated `test1.parquet` file and the `metadata.json` file.
+You can run the [doc_id_local.py](src/doc_id_local_spark.py) (spark-based implementation) to transform the
+`test1.parquet` file in [test input data](test-data/input) to an `output` directory.  The directory will contain both
+the new annotated `test1.parquet` file and the `metadata.json` file.
 
 ### Launched Command Line Options 
-When running the transform with the Spark launcher (i.e. SparkTransformLauncher),
-the following command line arguments are available in addition to 
-the options provided by the [python launcher](../../../../data-processing-lib/doc/python-launcher-options.md).
+When running the transform with the Spark launcher (i.e. SparkTransformLauncher), the following command line arguments
+are available in addition to the options provided by the
+[python launcher](../../../../data-processing-lib/doc/python-launcher-options.md).
 
 ```
   --doc_id_column_name DOC_ID_COLUMN_NAME

From 0f96b6119f2ca844146a4c0da9ffb6b20dc3c072 Mon Sep 17 00:00:00 2001
From: Constantin M Adam <cmadam@us.ibm.com>
Date: Tue, 26 Nov 2024 09:07:09 -0500
Subject: [PATCH 2/6] Update doc to follow template in issue #753

Signed-off-by: Constantin M Adam <cmadam@us.ibm.com>
---
 transforms/universal/doc_id/python/README.md | 20 ++++----------------
 transforms/universal/doc_id/ray/README.md    |  4 ++--
 transforms/universal/doc_id/spark/README.md  |  5 ++---
 3 files changed, 8 insertions(+), 21 deletions(-)

diff --git a/transforms/universal/doc_id/python/README.md b/transforms/universal/doc_id/python/README.md
index bdaf834e3..0f1b783c4 100644
--- a/transforms/universal/doc_id/python/README.md
+++ b/transforms/universal/doc_id/python/README.md
@@ -18,14 +18,14 @@ the `transform()` method. To store this ID in the data, specify the desired colu
 parameter.
 
 Document IDs are essential for tracking annotations linked to specific documents. They are also required for processes
-like [fuzzy deduplication](../fdedup), which depend on the presence of integer IDs. If your dataset lacks document ID
+like [fuzzy deduplication](../../fdedup/README.md), which depend on the presence of integer IDs. If your dataset lacks document ID
 columns, this transform can be used to generate them.
 
 ## Input Columns Used by This Transform
 
-| Input Column Name                                    | Data Type | Description                      |
-|------------------------------------------------------|-----------|----------------------------------|
-| Column specified by the _contents_column_ config arg | str       | Column that stores document text |
+| Input Column Name                                                | Data Type | Description                      |
+|------------------------------------------------------------------|-----------|----------------------------------|
+| Column specified by the _contents_column_ configuration argument | str       | Column that stores document text |
 
 ## Output Columns Annotated by This Transform
 | Output Column Name | Data Type | Description                                 |
@@ -63,18 +63,6 @@ the following command line arguments are available in addition to
 ```
 These correspond to the configuration keys described above.
 
-To use the transform image to transform your data, please refer to the 
-[running images quickstart](../../../../doc/quick-start/run-transform-image.md),
-substituting the name of this transform image and runtime as appropriate.
-
-## Building
-
-A [docker file](Dockerfile) that can be used for building docker image. You can use
-
-```shell
-make build 
-```
-
 ### Running the samples
 To run the samples, use the following `make` targets
 
diff --git a/transforms/universal/doc_id/ray/README.md b/transforms/universal/doc_id/ray/README.md
index ef260f719..438c6a16d 100644
--- a/transforms/universal/doc_id/ray/README.md
+++ b/transforms/universal/doc_id/ray/README.md
@@ -9,9 +9,9 @@ testing and IDE set up.
 This project wraps the Document ID transform with a Ray runtime.
 
 ## Configuration and command line Options
-Document ID configuration and command line options are the same as for the base python
-transform.
 
+Document ID configuration and command line options are the same as for the
+[base python transform](../python/README.md).
 
 ## Building
 
diff --git a/transforms/universal/doc_id/spark/README.md b/transforms/universal/doc_id/spark/README.md
index ace6f79e2..92a5f23d1 100644
--- a/transforms/universal/doc_id/spark/README.md
+++ b/transforms/universal/doc_id/spark/README.md
@@ -13,9 +13,8 @@ pyspark function to generate the unique integer IDs. As described in the documen
 
 ## Configuration and command line Options
 
-The set of dictionary keys holding [DocIdTransform](src/doc_id_transform.py) configuration for values are as follows:
-
-* _doc_id_column_name_ - specifies the name of the DataFrame column that holds the generated document IDs.
+Document ID configuration and command line options are the same as for the
+[base python transform](../python/README.md).
 
 ## Running
 You can run the [doc_id_local.py](src/doc_id_local_spark.py) (spark-based implementation) to transform the

From bab7c56bef94329e707ed8a0f917c76baccde063 Mon Sep 17 00:00:00 2001
From: Constantin M Adam <cmadam@us.ibm.com>
Date: Wed, 27 Nov 2024 15:42:10 -0500
Subject: [PATCH 3/6] Updating docs for ededup

Signed-off-by: Constantin M Adam <cmadam@us.ibm.com>
---
 transforms/universal/ededup/README.md        | 68 ++------------------
 transforms/universal/ededup/python/README.md | 66 +++++++++++++++++--
 2 files changed, 68 insertions(+), 66 deletions(-)

diff --git a/transforms/universal/ededup/README.md b/transforms/universal/ededup/README.md
index 0390cc19c..0a2f58af6 100644
--- a/transforms/universal/ededup/README.md
+++ b/transforms/universal/ededup/README.md
@@ -1,65 +1,11 @@
 # Exact Deduplication Transform 
 
-## Summary
-
-Exact data deduplication is used to identify (and remove) records determined by native documents.
-* It’s O(N2) complexity
-* shuffling with lots of data movement
-
-It can be implemented using 2 approaches:
-* Exact string matching
-* Hash-based matching (ASSUMPTION: a hash is unique to each native document.) – moving hash value is cheaper than moving full content
-
-Implementation here is using “streaming” deduplication, based on central hash:
-
-![](images/exactdedup.png)
-
-* At the heart of the implementation is a hash cache implemented as a set of Ray actors and containing
-  unique hashes seen so far.
-* Individual data processors are responsible for:
-  * Reading data from data plane
-  * Converting documents into hashes
-  * Coordinating with distributed hashes cache to remove the duplicates
-  * Storing unique documents back to the data plane
-
-The complication of mapping this model to transform model is the fact that implementation requires a hash cache,
-that transform mode knows nothing about. The solution here is to use transform runtime to create haches cache.
-and pass it as a parameter to transforms.
-
-## Transform runtime
-
-Transform runtime is responsible for creation of the hashes cache. Additionally it 
-enhances statistics information with the information about hashes cache size and utilization
-
-## Configuration and command line Options
-
-The set of dictionary keys holding [EdedupTransform](src/ededup_transform_ray.py)
-configuration for values (common for Python and Ray) are as follows:
-
-* _doc_column_ - specifies name of the column containing documents
-* _doc_id_column_ - specifies the name of the column containing a document id
-* _use_snapshot_ - specifies that ededup execution starts from a set of already seen hashes. This can be used
-  for the incremental ededup execution
-* _snapshot_directory_ - specifies a directory from which snapshots are read. If this is not specified, a default
-  location (output_folder/snapshot is used)
-
-## Snapshotting
-
-In the current implementation we also provide snapshotting. At the end of execution, the content
-of the hash cache to storage (local disk or S3). The reason this is done is to enable incremental
-execution of dedup. You can run dedup on a set of existing files and snapshot the hash cache. Now
-when additional files come in, instead of running dedup on all the files, you can load snapshot
-from the previous run and run dedup only on new files
-
-
-## Available runtimes
-
-As per [transform project conventions](../../README.md#transform-project-conventions)
+Exact deduplication transform identifies and removes identical documents in a dataset by comparing them hash-for-hash
+to ensure exact matching. Per the set of [transform project conventions](../../README.md#transform-project-conventions)
 the following runtimes are available:
 
-* [python](python/README.md) - enables running of the base python transformation
-  in a Python runtime
-* [ray](ray/README.md) - enables running of the base python transformation
-in a Ray runtime
-* [kfp](kfp_ray/README.md) - enables running the ray docker image 
-in a kubernetes cluster using a generated `yaml` file.
+* [python](python/README.md) - enables running of the base python transformation in a Python runtime
+* [ray](ray/README.md) - enables running of the base python transformation in a Ray runtime
+* [kfp](kfp_ray/README.md) - enables running the ray docker image in a kubernetes cluster using a generated `yaml` file.
+
+Please see [here](python/README.md) a more detailed description of this transform.
diff --git a/transforms/universal/ededup/python/README.md b/transforms/universal/ededup/python/README.md
index 4a10e9f83..dbe6d37cf 100644
--- a/transforms/universal/ededup/python/README.md
+++ b/transforms/universal/ededup/python/README.md
@@ -1,15 +1,71 @@
 # Ededup Python Transform 
 
-Please see the set of
-[transform project conventions](../../../README.md#transform-project-conventions)
-for details on general project conventions, transform configuration,
-testing and IDE set up.
+Please see the set of [transform project conventions](../../../README.md#transform-project-conventions) for details on
+general project conventions, transform configuration, testing and IDE set up.
 
-Also see [here](../ray/README.md) on details of implementation
 
 ## Summary 
 This is a python version of ededup
 
+
+* As shown in the figure below, the implementation of exact dedup relies on a (distributed)
+hash cache and a set of individual data processors that read documents and convert them into hashes
+
+## Summary
+
+Exact data deduplication is used to identify (and remove) records determined by native documents.
+* It’s O(N2) complexity
+* shuffling with lots of data movement
+
+It can be implemented using 2 approaches:
+* Exact string matching
+* Hash-based matching (ASSUMPTION: a hash is unique to each native document.) – moving hash value is cheaper than moving full content
+
+Implementation here is using “streaming” deduplication, based on central hash:
+
+![](images/exactdedup.png)
+
+* At the heart of the implementation is a hash cache implemented as a set of Ray actors and containing
+  unique hashes seen so far.
+* Individual data processors are responsible for:
+  * Reading data from data plane
+  * Converting documents into hashes
+  * Coordinating with distributed hashes cache to remove the duplicates
+  * Storing unique documents back to the data plane
+
+The complication of mapping this model to transform model is the fact that implementation requires a hash cache,
+that transform mode knows nothing about. The solution here is to use transform runtime to create haches cache.
+and pass it as a parameter to transforms.
+
+## Transform runtime
+
+Transform runtime is responsible for creation of the hashes cache. Additionally it 
+enhances statistics information with the information about hashes cache size and utilization
+
+## Configuration and command line Options
+
+The set of dictionary keys holding [EdedupTransform](src/ededup_transform_ray.py)
+configuration for values (common for Python and Ray) are as follows:
+
+* _doc_column_ - specifies name of the column containing documents
+* _doc_id_column_ - specifies the name of the column containing a document id
+* _use_snapshot_ - specifies that ededup execution starts from a set of already seen hashes. This can be used
+  for the incremental ededup execution
+* _snapshot_directory_ - specifies a directory from which snapshots are read. If this is not specified, a default
+  location (output_folder/snapshot is used)
+
+## Snapshotting
+
+In the current implementation we also provide snapshotting. At the end of execution, the content
+of the hash cache to storage (local disk or S3). The reason this is done is to enable incremental
+execution of dedup. You can run dedup on a set of existing files and snapshot the hash cache. Now
+when additional files come in, instead of running dedup on all the files, you can load snapshot
+from the previous run and run dedup only on new files
+
+
+## Available runtimes
+
+
 ## Configuration and command line Options
 
 See [common](../README.md) ededup parameters

From 28b0e54622f1e4051d0e77edc2f40a8f7b13ab17 Mon Sep 17 00:00:00 2001
From: Constantin M Adam <cmadam@us.ibm.com>
Date: Wed, 27 Nov 2024 23:48:03 -0500
Subject: [PATCH 4/6] Added notebook pointer

Signed-off-by: Constantin M Adam <cmadam@us.ibm.com>
---
 transforms/universal/doc_id/python/README.md |   2 +-
 transforms/universal/ededup/python/README.md | 129 ++++++++++---------
 transforms/universal/ededup/ray/README.md    |  12 +-
 3 files changed, 76 insertions(+), 67 deletions(-)

diff --git a/transforms/universal/doc_id/python/README.md b/transforms/universal/doc_id/python/README.md
index 0f1b783c4..7941f4bc7 100644
--- a/transforms/universal/doc_id/python/README.md
+++ b/transforms/universal/doc_id/python/README.md
@@ -85,7 +85,7 @@ To see results of the transform.
 
 ### Code example
 
-TBD
+[notebook](../doc_id.ipynb)
 
 ### Transforming data using the transform image
 
diff --git a/transforms/universal/ededup/python/README.md b/transforms/universal/ededup/python/README.md
index dbe6d37cf..ac15e64d3 100644
--- a/transforms/universal/ededup/python/README.md
+++ b/transforms/universal/ededup/python/README.md
@@ -3,80 +3,54 @@
 Please see the set of [transform project conventions](../../../README.md#transform-project-conventions) for details on
 general project conventions, transform configuration, testing and IDE set up.
 
+## Contributors
+- Boris Lublinsky (blublinsk@ibm.com)
 
-## Summary 
-This is a python version of ededup
+## Description
+This Python implementation of the exact deduplication transform uses "streaming" deduplication based on a central hash.
+As shown below, it relies on a distributed hash cache and data processors that read documents, generate hashes,
+coordinate with the cache to remove duplicates, and store unique documents in the data plane.
 
+![](../images/exactdedup.png)
 
-* As shown in the figure below, the implementation of exact dedup relies on a (distributed)
-hash cache and a set of individual data processors that read documents and convert them into hashes
+Mapping this model to the transform model is complicated by the need for a hash cache, which the transform model does
+not recognize. The solution is to have the transform runtime create the hash cache and pass it as a parameter to the
+transforms. The transform runtime handles hash cache creation and enhances statistics with details about cache size and
+utilization.
 
-## Summary
+### Incremental Execution and Snapshotting
 
-Exact data deduplication is used to identify (and remove) records determined by native documents.
-* It’s O(N2) complexity
-* shuffling with lots of data movement
+The current implementation includes snapshotting, where the hash cache is saved to storage (local disk or S3) at the
+end of execution. This enables incremental deduplication: you can run deduplication on existing files, save the hash
+cache, and later load the snapshot to deduplicate only new files, avoiding reprocessing the entire dataset.
 
-It can be implemented using 2 approaches:
-* Exact string matching
-* Hash-based matching (ASSUMPTION: a hash is unique to each native document.) – moving hash value is cheaper than moving full content
+## Input Columns Used by This Transform
 
-Implementation here is using “streaming” deduplication, based on central hash:
+| Input Column Name                                                   | Data Type | Description                      |
+|---------------------------------------------------------------------|-----------|----------------------------------|
+| Column specified by the _contents_column_ configuration argument    | str       | Column that stores document text |
+| Column specified by the _document_id_column_ configuration argument | int64     | Column that stores document ID   |
 
-![](images/exactdedup.png)
+## Output Columns Annotated by This Transform
+This transform does not perform any annotations; it only filters out the documents that are marked as duplicates.
 
-* At the heart of the implementation is a hash cache implemented as a set of Ray actors and containing
-  unique hashes seen so far.
-* Individual data processors are responsible for:
-  * Reading data from data plane
-  * Converting documents into hashes
-  * Coordinating with distributed hashes cache to remove the duplicates
-  * Storing unique documents back to the data plane
+## Configuration
 
-The complication of mapping this model to transform model is the fact that implementation requires a hash cache,
-that transform mode knows nothing about. The solution here is to use transform runtime to create haches cache.
-and pass it as a parameter to transforms.
-
-## Transform runtime
-
-Transform runtime is responsible for creation of the hashes cache. Additionally it 
-enhances statistics information with the information about hashes cache size and utilization
-
-## Configuration and command line Options
-
-The set of dictionary keys holding [EdedupTransform](src/ededup_transform_ray.py)
+The set of dictionary keys holding [EdedupTransform](src/ededup_transform_python.py)
 configuration for values (common for Python and Ray) are as follows:
 
 * _doc_column_ - specifies name of the column containing documents
 * _doc_id_column_ - specifies the name of the column containing a document id
-* _use_snapshot_ - specifies that ededup execution starts from a set of already seen hashes. This can be used
-  for the incremental ededup execution
-* _snapshot_directory_ - specifies a directory from which snapshots are read. If this is not specified, a default
-  location (output_folder/snapshot is used)
-
-## Snapshotting
-
-In the current implementation we also provide snapshotting. At the end of execution, the content
-of the hash cache to storage (local disk or S3). The reason this is done is to enable incremental
-execution of dedup. You can run dedup on a set of existing files and snapshot the hash cache. Now
-when additional files come in, instead of running dedup on all the files, you can load snapshot
-from the previous run and run dedup only on new files
-
+* _use_snapshot_ - specifies that ededup execution starts with a set of pre-existing hashes, enabling incremental
+execution
+* _snapshot_directory_ - specifies the directory for reading snapshots. If not provided, the default is
+`output_folder/snapshot`
 
-## Available runtimes
+## Usage
 
-
-## Configuration and command line Options
-
-See [common](../README.md) ededup parameters
-
-## Running
-
-### Launched Command Line Options 
-The following command line arguments are available in addition to 
-the options provided by 
-the [python launcher](../../../../data-processing-lib/doc/python-launcher-options.md).
-```
+The following command line arguments (corresponding to the configuration keys described above) are available in addition
+to the options provided by the [python launcher](../../../../data-processing-lib/doc/python-launcher-options.md).
+```text
   --ededup_doc_column EDEDUP_DOC_COLUMN
                         name of the column containing document
   --ededup_doc_id_column EDEDUP_DOC_ID_COLUMN
@@ -86,7 +60,44 @@ the [python launcher](../../../../data-processing-lib/doc/python-launcher-option
   --ededup_snapshot_directory EDEDUP_SNAPSHOT_DIRECTORY
                         location of snapshot files  
 ```
-These correspond to the configuration keys described above.
+
+### Running the samples
+To run the samples, use the following `make` targets
+
+* `run-cli-sample` - runs src/ededup_transform_python.py using command line args
+* `run-local-sample` - runs src/ededup_local.py
+
+These targets will activate the virtual environment and set up any configuration needed.
+Use the `-n` option of `make` to see the detail of what is done to run the sample.
+
+For example, 
+```shell
+make run-cli-sample
+...
+```
+Then 
+```shell
+ls output
+```
+To see results of the transform.
+
+### Code example
+
+[notebook](../ededup.ipynb)
+
+### Transforming data using the transform image
+
+To use the transform image to transform your data, please refer to the 
+[running images quickstart](../../../../doc/quick-start/run-transform-image.md),
+substituting the name of this transform image and runtime as appropriate.
+
+## Testing
+
+Following [the testing strategy of data-processing-lib](../../../../data-processing-lib/doc/transform-testing.md)
+
+Currently we have:
+- [Unit test](test/test_ededup_python.py)
+- [Integration test](test/test_ededup.py)
 
 To use the transform image to transform your data, please refer to the 
 [running images quickstart](../../../../doc/quick-start/run-transform-image.md),
diff --git a/transforms/universal/ededup/ray/README.md b/transforms/universal/ededup/ray/README.md
index 35f96f55c..88fe034d5 100644
--- a/transforms/universal/ededup/ray/README.md
+++ b/transforms/universal/ededup/ray/README.md
@@ -1,13 +1,11 @@
 # Exact Dedup
 
-Please see the set of
-[transform project conventions](../../../README.md)
-for details on general project conventions, transform configuration,
-testing and IDE set up.
+Please see the set of [transform project conventions](../../../README.md) for details on general project conventions,
+transform configuration, testing and IDE set up.
 
 ## Additional parameters
 
-In addition to [common](../README.md) ededup parameters Ray implementation provides two additional ones
+In addition to [common](../python/README.md) ededup parameters Ray implementation provides two additional ones
 
 * _hash_cpu_ - specifies amount of CPU per hash actor
 * _num_hashes_ - specifies number of hash actors
@@ -19,8 +17,8 @@ We also provide an [estimate](src/cluster_estimator.py) to roughly determine clu
 ## Running
 
 ### Launched Command Line Options
-When running the transform with the Ray launcher (i.e. TransformLauncher),
-the following command line arguments are available in addition to
+When running the transform with the Ray launcher (i.e. TransformLauncher), the following command line arguments are
+available in addition to
 [the options provided by the launcher](../../../../data-processing-lib/doc/ray-launcher-options.md).
 
 ```

From 2a01b1bc9b9464eb094e0247ba341cd7f706d41b Mon Sep 17 00:00:00 2001
From: Constantin M Adam <cmadam@us.ibm.com>
Date: Thu, 28 Nov 2024 00:05:51 -0500
Subject: [PATCH 5/6] Added sample notebooks for doc_id and ededup

Signed-off-by: Constantin M Adam <cmadam@us.ibm.com>
---
 transforms/universal/doc_id/doc_id.ipynb | 182 +++++++++++++++++++++++
 transforms/universal/ededup/ededup.ipynb | 175 ++++++++++++++++++++++
 2 files changed, 357 insertions(+)
 create mode 100644 transforms/universal/doc_id/doc_id.ipynb
 create mode 100644 transforms/universal/ededup/ededup.ipynb

diff --git a/transforms/universal/doc_id/doc_id.ipynb b/transforms/universal/doc_id/doc_id.ipynb
new file mode 100644
index 000000000..eb1ffe212
--- /dev/null
+++ b/transforms/universal/doc_id/doc_id.ipynb
@@ -0,0 +1,182 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "afd55886-5f5b-4794-838e-ef8179fb0394",
+   "metadata": {},
+   "source": [
+    "##### **** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:\n",
+    "```\n",
+    "make venv \n",
+    "source venv/bin/activate \n",
+    "pip install jupyterlab\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4c45c3c6-e4d7-4e61-8de6-32d61f2ce695",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%capture\n",
+    "## This is here as a reference only\n",
+    "# Users and application developers must use the right tag for the latest from pypi\n",
+    "%pip install data-prep-toolkit\n",
+    "%pip install data-prep-toolkit-transforms==0.2.2.dev3"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "407fd4e4-265d-4ec7-bbc9-b43158f5f1f3",
+   "metadata": {
+    "jp-MarkdownHeadingCollapsed": true
+   },
+   "source": [
+    "##### **** Configure the transform parameters. The set of dictionary keys holding DocQualityTransform configuration for values are as follows: \n",
+    "* text_lang - specifies language used in the text content. By default, \"en\" is used.\n",
+    "* doc_content_column - specifies column name that contains document text. By default, \"contents\" is used.\n",
+    "* bad_word_filepath - specifies a path to bad word file: local folder (file or directory) that points to bad word file. You don't have to set this parameter if you don't need to set bad words.\n",
+    "#####"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ebf1f782-0e61-485c-8670-81066beb734c",
+   "metadata": {},
+   "source": [
+    "##### ***** Import required classes and modules"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c2a12abc-9460-4e45-8961-873b48a9ab19",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import sys\n",
+    "\n",
+    "from data_processing.runtime.pure_python import PythonTransformLauncher\n",
+    "from data_processing.utils import ParamsUtils\n",
+    "from doc_id_transform_python import DocIDPythonTransformRuntimeConfiguration\n",
+    "from doc_id_transform_base import (\n",
+    "    doc_column_name_cli_param,\n",
+    "    hash_column_name_cli_param,\n",
+    "    int_column_name_cli_param,\n",
+    "    start_id_cli_param,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7234563c-2924-4150-8a31-4aec98c1bf33",
+   "metadata": {},
+   "source": [
+    "##### ***** Setup runtime parameters for this transform"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e90a853e-412f-45d7-af3d-959e755aeebb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\n",
+    "# create parameters\n",
+    "input_folder = os.path.join(\"python\", \"test-data\", \"input\")\n",
+    "output_folder = os.path.join( \"python\", \"output\")\n",
+    "local_conf = {\n",
+    "    \"input_folder\": input_folder,\n",
+    "    \"output_folder\": output_folder,\n",
+    "}\n",
+    "code_location = {\"github\": \"github\", \"commit_hash\": \"12345\", \"path\": \"path\"}\n",
+    "params = {\n",
+    "    # Data access. Only required parameters are specified\n",
+    "    \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n",
+    "    # execution info\n",
+    "    \"runtime_pipeline_id\": \"pipeline_id\",\n",
+    "    \"runtime_job_id\": \"job_id\",\n",
+    "    \"runtime_code_location\": ParamsUtils.convert_to_ast(code_location),\n",
+    "    # doc id params\n",
+    "    doc_column_name_cli_param: \"contents\",\n",
+    "    hash_column_name_cli_param: \"hash_column\",\n",
+    "    int_column_name_cli_param: \"int_id_column\",\n",
+    "    start_id_cli_param: 5,\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7949f66a-d207-45ef-9ad7-ad9406f8d42a",
+   "metadata": {},
+   "source": [
+    "##### ***** Use python runtime to invoke the transform"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0775e400-7469-49a6-8998-bd4772931459",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%capture\n",
+    "sys.argv = ParamsUtils.dict_to_req(d=params)\n",
+    "launcher = PythonTransformLauncher(runtime_config=DocIDPythonTransformRuntimeConfiguration())\n",
+    "launcher.launch()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c3df5adf-4717-4a03-864d-9151cd3f134b",
+   "metadata": {},
+   "source": [
+    "##### **** The specified folder will include the transformed parquet files."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7276fe84-6512-4605-ab65-747351e13a7c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import glob\n",
+    "glob.glob(\"python/output/*\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "845a75cf-f4a9-467d-87fa-ccbac1c9beb8",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/transforms/universal/ededup/ededup.ipynb b/transforms/universal/ededup/ededup.ipynb
new file mode 100644
index 000000000..b9d77a7aa
--- /dev/null
+++ b/transforms/universal/ededup/ededup.ipynb
@@ -0,0 +1,175 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "afd55886-5f5b-4794-838e-ef8179fb0394",
+   "metadata": {},
+   "source": [
+    "##### **** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:\n",
+    "```\n",
+    "make venv \n",
+    "source venv/bin/activate \n",
+    "pip install jupyterlab\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4c45c3c6-e4d7-4e61-8de6-32d61f2ce695",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%capture\n",
+    "## This is here as a reference only\n",
+    "# Users and application developers must use the right tag for the latest from pypi\n",
+    "%pip install data-prep-toolkit\n",
+    "%pip install data-prep-toolkit-transforms==0.2.2.dev3"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "407fd4e4-265d-4ec7-bbc9-b43158f5f1f3",
+   "metadata": {
+    "jp-MarkdownHeadingCollapsed": true
+   },
+   "source": [
+    "##### **** Configure the transform parameters. The set of dictionary keys holding DocQualityTransform configuration for values are as follows: \n",
+    "* text_lang - specifies language used in the text content. By default, \"en\" is used.\n",
+    "* doc_content_column - specifies column name that contains document text. By default, \"contents\" is used.\n",
+    "* bad_word_filepath - specifies a path to bad word file: local folder (file or directory) that points to bad word file. You don't have to set this parameter if you don't need to set bad words.\n",
+    "#####"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ebf1f782-0e61-485c-8670-81066beb734c",
+   "metadata": {},
+   "source": [
+    "##### ***** Import required classes and modules"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c2a12abc-9460-4e45-8961-873b48a9ab19",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import sys\n",
+    "\n",
+    "from data_processing.runtime.pure_python import PythonTransformLauncher\n",
+    "from data_processing.utils import ParamsUtils\n",
+    "from ededup_transform_python import EdedupPythonTransformRuntimeConfiguration\n",
+    "from ededup_transform_base import doc_column_name_cli_param, int_column_name_cli_param"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7234563c-2924-4150-8a31-4aec98c1bf33",
+   "metadata": {},
+   "source": [
+    "##### ***** Setup runtime parameters for this transform"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e90a853e-412f-45d7-af3d-959e755aeebb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\n",
+    "# create parameters\n",
+    "input_folder = os.path.join(\"python\", \"test-data\", \"input\")\n",
+    "output_folder = os.path.join( \"python\", \"output\")\n",
+    "local_conf = {\n",
+    "    \"input_folder\": input_folder,\n",
+    "    \"output_folder\": output_folder,\n",
+    "}\n",
+    "code_location = {\"github\": \"github\", \"commit_hash\": \"12345\", \"path\": \"path\"}\n",
+    "params = {\n",
+    "    # Data access. Only required parameters are specified\n",
+    "    \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n",
+    "    # orchestrator\n",
+    "    \"runtime_pipeline_id\": \"pipeline_id\",\n",
+    "    \"runtime_job_id\": \"job_id\",\n",
+    "    \"runtime_code_location\": ParamsUtils.convert_to_ast(code_location),\n",
+    "    # ededup parameters\n",
+    "    doc_column_name_cli_param: \"contents\",\n",
+    "    int_column_name_cli_param: \"document_id\",\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7949f66a-d207-45ef-9ad7-ad9406f8d42a",
+   "metadata": {},
+   "source": [
+    "##### ***** Use python runtime to invoke the transform"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0775e400-7469-49a6-8998-bd4772931459",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%capture\n",
+    "sys.argv = ParamsUtils.dict_to_req(d=params)\n",
+    "launcher = PythonTransformLauncher(EdedupPythonTransformRuntimeConfiguration())\n",
+    "launcher.launch()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c3df5adf-4717-4a03-864d-9151cd3f134b",
+   "metadata": {},
+   "source": [
+    "##### **** The specified folder will include the transformed parquet files."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7276fe84-6512-4605-ab65-747351e13a7c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import glob\n",
+    "glob.glob(\"python/output/*\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "845a75cf-f4a9-467d-87fa-ccbac1c9beb8",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "ededup_ray",
+   "language": "python",
+   "name": "ededup_ray"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

From 54ced0eb2a2ec935b277391fc4fe9505d6daa4c3 Mon Sep 17 00:00:00 2001
From: Constantin M Adam <cmadam@us.ibm.com>
Date: Wed, 4 Dec 2024 17:31:52 -0500
Subject: [PATCH 6/6] Provide config parameter description in the notebook

Signed-off-by: Constantin M Adam <cmadam@us.ibm.com>
---
 transforms/universal/doc_id/doc_id.ipynb | 10 +++++-----
 transforms/universal/ededup/ededup.ipynb | 12 +++++++-----
 2 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/transforms/universal/doc_id/doc_id.ipynb b/transforms/universal/doc_id/doc_id.ipynb
index eb1ffe212..7ecab7d65 100644
--- a/transforms/universal/doc_id/doc_id.ipynb
+++ b/transforms/universal/doc_id/doc_id.ipynb
@@ -34,11 +34,11 @@
     "jp-MarkdownHeadingCollapsed": true
    },
    "source": [
-    "##### **** Configure the transform parameters. The set of dictionary keys holding DocQualityTransform configuration for values are as follows: \n",
-    "* text_lang - specifies language used in the text content. By default, \"en\" is used.\n",
-    "* doc_content_column - specifies column name that contains document text. By default, \"contents\" is used.\n",
-    "* bad_word_filepath - specifies a path to bad word file: local folder (file or directory) that points to bad word file. You don't have to set this parameter if you don't need to set bad words.\n",
-    "#####"
+    "##### **** Configure the transform parameters. The set of dictionary keys holding DocIDTransform configuration for values are as follows: \n",
+    "* doc_column - specifies name of the column containing the document (required for ID generation)\n",
+    "* hash_column - specifies name of the column created to hold the string document id, if None, id is not generated\n",
+    "* int_id_column - specifies name of the column created to hold the integer document id, if None, id is not generated\n",
+    "* start_id - an id from which ID generator starts () "
    ]
   },
   {
diff --git a/transforms/universal/ededup/ededup.ipynb b/transforms/universal/ededup/ededup.ipynb
index b9d77a7aa..9a84d4c51 100644
--- a/transforms/universal/ededup/ededup.ipynb
+++ b/transforms/universal/ededup/ededup.ipynb
@@ -34,11 +34,13 @@
     "jp-MarkdownHeadingCollapsed": true
    },
    "source": [
-    "##### **** Configure the transform parameters. The set of dictionary keys holding DocQualityTransform configuration for values are as follows: \n",
-    "* text_lang - specifies language used in the text content. By default, \"en\" is used.\n",
-    "* doc_content_column - specifies column name that contains document text. By default, \"contents\" is used.\n",
-    "* bad_word_filepath - specifies a path to bad word file: local folder (file or directory) that points to bad word file. You don't have to set this parameter if you don't need to set bad words.\n",
-    "#####"
+    "##### **** Configure the transform parameters. The set of dictionary keys holding EdedupTransform configuration for values are as follows: \n",
+    "* doc_column - specifies name of the column containing documents\n",
+    "* doc_id_column - specifies the name of the column containing a document id\n",
+    "* use_snapshot - specifies that ededup execution starts with a set of pre-existing hashes, enabling incremental\n",
+    "execution\n",
+    "* snapshot_directory - specifies the directory for reading snapshots. If not provided, the default is\n",
+    "`output_folder/snapshot`"
    ]
   },
   {