From 65382189d3124f8aacd9e391a7a2d267a25ef9f4 Mon Sep 17 00:00:00 2001 From: Constantin M Adam Date: Tue, 26 Nov 2024 08:53:08 -0500 Subject: [PATCH 1/6] Update doc to follow template in issue #753 Signed-off-by: Constantin M Adam --- transforms/universal/doc_id/README.md | 38 ++------- transforms/universal/doc_id/python/README.md | 89 +++++++++++++++++--- transforms/universal/doc_id/ray/README.md | 10 ++- transforms/universal/doc_id/spark/README.md | 17 ++-- 4 files changed, 105 insertions(+), 49 deletions(-) diff --git a/transforms/universal/doc_id/README.md b/transforms/universal/doc_id/README.md index c5c785353..02564db20 100644 --- a/transforms/universal/doc_id/README.md +++ b/transforms/universal/doc_id/README.md @@ -1,33 +1,13 @@ # Doc ID Transform -The Document ID transforms adds a document identification (unique integers and content hashes), which later can be -used in de-duplication operations, per the set of -[transform project conventions](../../README.md#transform-project-conventions) -the following runtimes are available: +The Document ID transform assigns to each document in a dataset a unique identifier, including an integer ID and a +content hash, which can later be used by the exact dedup and fuzzy dedup transform to identify and remove duplicate +documents. Per the set of [transform project conventions](../../README.md#transform-project-conventions), the following +runtimes are available: -* [pythom](python/README.md) - enables the running of the base python transformation - in a Python runtime -* [ray](ray/README.md) - enables the running of the base python transformation - in a Ray runtime -* [spark](spark/README.md) - enables the running of a spark-based transformation -in a Spark runtime. -* [kfp](kfp_ray/README.md) - enables running the ray docker image -in a kubernetes cluster using a generated `yaml` file. - -## Summary - -This transform annotates documents with document "ids". -It supports the following transformations of the original data: -* Adding document hash: this enables the addition of a document hash-based id to the data. - The hash is calculated with `hashlib.sha256(doc.encode("utf-8")).hexdigest()`. - To enable this annotation, set `hash_column` to the name of the column, - where you want to store it. -* Adding integer document id: this allows the addition of an integer document id to the data that - is unique across all rows in all tables provided to the `transform()` method. - To enable this annotation, set `int_id_column` to the name of the column, where you want - to store it. - -Document IDs are generally useful for tracking annotations to specific documents. Additionally -[fuzzy deduping](../fdedup) relies on integer IDs to be present. If your dataset does not have -document ID column(s), you can use this transform to create ones. +* [python](python/README.md) - enables running the base python transform in a Python runtime +* [ray](ray/README.md) - enables running the base python transform in a Ray runtime +* [spark](spark/README.md) - enables running of a spark-based transform in a Spark runtime. +* [kfp](kfp_ray/README.md) - enables running the ray docker image in a kubernetes cluster using a generated `yaml` file. +Please check [here](python/README.md) for a more detailed description of this transform. diff --git a/transforms/universal/doc_id/python/README.md b/transforms/universal/doc_id/python/README.md index dbb02093c..bdaf834e3 100644 --- a/transforms/universal/doc_id/python/README.md +++ b/transforms/universal/doc_id/python/README.md @@ -1,21 +1,41 @@ # Document ID Python Annotator -Please see the set of -[transform project conventions](../../../README.md) -for details on general project conventions, transform configuration, -testing and IDE set up. +Please see the set of [transform project conventions](../../../README.md) for details on general project conventions, +transform configuration, testing and IDE set up. -## Building +## Contributors +- Boris Lublinsky (blublinsk@ibm.com) -A [docker file](Dockerfile) that can be used for building docker image. You can use +## Description -```shell -make build -``` +This transform assigns unique identifiers to the documents in a dataset and supports the following annotations to the +original data: +* **Adding a Document Hash** to each document. The unique hash-based ID is generated using +`hashlib.sha256(doc.encode("utf-8")).hexdigest()`. To store this hash in the data specify the desired column name using +the `hash_column` parameter. +* **Adding an Integer Document ID**: to each document. The integer ID is unique across all rows and tables processed by +the `transform()` method. To store this ID in the data, specify the desired column name using the `int_id_column` +parameter. + +Document IDs are essential for tracking annotations linked to specific documents. They are also required for processes +like [fuzzy deduplication](../fdedup), which depend on the presence of integer IDs. If your dataset lacks document ID +columns, this transform can be used to generate them. + +## Input Columns Used by This Transform + +| Input Column Name | Data Type | Description | +|------------------------------------------------------|-----------|----------------------------------| +| Column specified by the _contents_column_ config arg | str | Column that stores document text | + +## Output Columns Annotated by This Transform +| Output Column Name | Data Type | Description | +|--------------------|-----------|---------------------------------------------| +| hash_column | str | Unique hash assigned to each document | +| int_id_column | uint64 | Unique integer ID assigned to each document | -## Configuration and command line Options +## Configuration and Command Line Options -The set of dictionary keys defined in [DocIDTransform](src/doc_id_transform_ray.py) +The set of dictionary keys defined in [DocIDTransform](src/doc_id_transform_base.py) configuration for values are as follows: * _doc_column_ - specifies name of the column containing the document (required for ID generation) @@ -25,7 +45,7 @@ configuration for values are as follows: At least one of _hash_column_ or _int_id_column_ must be specified. -## Running +## Usage ### Launched Command Line Options When running the transform with the Ray launcher (i.e. TransformLauncher), @@ -43,7 +63,52 @@ the following command line arguments are available in addition to ``` These correspond to the configuration keys described above. +To use the transform image to transform your data, please refer to the +[running images quickstart](../../../../doc/quick-start/run-transform-image.md), +substituting the name of this transform image and runtime as appropriate. + +## Building + +A [docker file](Dockerfile) that can be used for building docker image. You can use + +```shell +make build +``` + +### Running the samples +To run the samples, use the following `make` targets + +* `run-cli-sample` - runs src/doc_id_transform_python.py using command line args +* `run-local-sample` - runs src/doc_id_local_python.py + +These targets will activate the virtual environment and set up any configuration needed. +Use the `-n` option of `make` to see the detail of what is done to run the sample. + +For example, +```shell +make run-cli-sample +... +``` +Then +```shell +ls output +``` +To see results of the transform. + +### Code example + +TBD + +### Transforming data using the transform image To use the transform image to transform your data, please refer to the [running images quickstart](../../../../doc/quick-start/run-transform-image.md), substituting the name of this transform image and runtime as appropriate. + +## Testing + +Following [the testing strategy of data-processing-lib](../../../../data-processing-lib/doc/transform-testing.md) + +Currently we have: +- [Unit test](test/test_doc_id_python.py) +- [Integration test](test/test_doc_id.py) diff --git a/transforms/universal/doc_id/ray/README.md b/transforms/universal/doc_id/ray/README.md index c9cb0d15c..ef260f719 100644 --- a/transforms/universal/doc_id/ray/README.md +++ b/transforms/universal/doc_id/ray/README.md @@ -1,10 +1,18 @@ -# Document ID Annotator +# Document ID Ray Annotator Please see the set of [transform project conventions](../../../README.md) for details on general project conventions, transform configuration, testing and IDE set up. +## Summary +This project wraps the Document ID transform with a Ray runtime. + +## Configuration and command line Options +Document ID configuration and command line options are the same as for the base python +transform. + + ## Building A [docker file](Dockerfile) that can be used for building docker image. You can use diff --git a/transforms/universal/doc_id/spark/README.md b/transforms/universal/doc_id/spark/README.md index 932637c54..ace6f79e2 100644 --- a/transforms/universal/doc_id/spark/README.md +++ b/transforms/universal/doc_id/spark/README.md @@ -6,23 +6,26 @@ testing and IDE set up. ## Summary -This transform assigns a unique integer ID to each row in a Spark DataFrame. It relies on the [monotonically_increasing_id](https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.monotonically_increasing_id.html) pyspark function to generate the unique integer IDs. As described in the documentation of this function: +This transform assigns a unique integer ID to each row in a Spark DataFrame. It relies on the +[monotonically_increasing_id](https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.monotonically_increasing_id.html) +pyspark function to generate the unique integer IDs. As described in the documentation of this function: > The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. ## Configuration and command line Options -The set of dictionary keys holding [DocIdTransform](src/doc_id_transform.py) -configuration for values are as follows: +The set of dictionary keys holding [DocIdTransform](src/doc_id_transform.py) configuration for values are as follows: * _doc_id_column_name_ - specifies the name of the DataFrame column that holds the generated document IDs. ## Running -You can run the [doc_id_local.py](src/doc_id_local_spark.py) (spark-based implementation) to transform the `test1.parquet` file in [test input data](test-data/input) to an `output` directory. The directory will contain both the new annotated `test1.parquet` file and the `metadata.json` file. +You can run the [doc_id_local.py](src/doc_id_local_spark.py) (spark-based implementation) to transform the +`test1.parquet` file in [test input data](test-data/input) to an `output` directory. The directory will contain both +the new annotated `test1.parquet` file and the `metadata.json` file. ### Launched Command Line Options -When running the transform with the Spark launcher (i.e. SparkTransformLauncher), -the following command line arguments are available in addition to -the options provided by the [python launcher](../../../../data-processing-lib/doc/python-launcher-options.md). +When running the transform with the Spark launcher (i.e. SparkTransformLauncher), the following command line arguments +are available in addition to the options provided by the +[python launcher](../../../../data-processing-lib/doc/python-launcher-options.md). ``` --doc_id_column_name DOC_ID_COLUMN_NAME From 0f96b6119f2ca844146a4c0da9ffb6b20dc3c072 Mon Sep 17 00:00:00 2001 From: Constantin M Adam Date: Tue, 26 Nov 2024 09:07:09 -0500 Subject: [PATCH 2/6] Update doc to follow template in issue #753 Signed-off-by: Constantin M Adam --- transforms/universal/doc_id/python/README.md | 20 ++++---------------- transforms/universal/doc_id/ray/README.md | 4 ++-- transforms/universal/doc_id/spark/README.md | 5 ++--- 3 files changed, 8 insertions(+), 21 deletions(-) diff --git a/transforms/universal/doc_id/python/README.md b/transforms/universal/doc_id/python/README.md index bdaf834e3..0f1b783c4 100644 --- a/transforms/universal/doc_id/python/README.md +++ b/transforms/universal/doc_id/python/README.md @@ -18,14 +18,14 @@ the `transform()` method. To store this ID in the data, specify the desired colu parameter. Document IDs are essential for tracking annotations linked to specific documents. They are also required for processes -like [fuzzy deduplication](../fdedup), which depend on the presence of integer IDs. If your dataset lacks document ID +like [fuzzy deduplication](../../fdedup/README.md), which depend on the presence of integer IDs. If your dataset lacks document ID columns, this transform can be used to generate them. ## Input Columns Used by This Transform -| Input Column Name | Data Type | Description | -|------------------------------------------------------|-----------|----------------------------------| -| Column specified by the _contents_column_ config arg | str | Column that stores document text | +| Input Column Name | Data Type | Description | +|------------------------------------------------------------------|-----------|----------------------------------| +| Column specified by the _contents_column_ configuration argument | str | Column that stores document text | ## Output Columns Annotated by This Transform | Output Column Name | Data Type | Description | @@ -63,18 +63,6 @@ the following command line arguments are available in addition to ``` These correspond to the configuration keys described above. -To use the transform image to transform your data, please refer to the -[running images quickstart](../../../../doc/quick-start/run-transform-image.md), -substituting the name of this transform image and runtime as appropriate. - -## Building - -A [docker file](Dockerfile) that can be used for building docker image. You can use - -```shell -make build -``` - ### Running the samples To run the samples, use the following `make` targets diff --git a/transforms/universal/doc_id/ray/README.md b/transforms/universal/doc_id/ray/README.md index ef260f719..438c6a16d 100644 --- a/transforms/universal/doc_id/ray/README.md +++ b/transforms/universal/doc_id/ray/README.md @@ -9,9 +9,9 @@ testing and IDE set up. This project wraps the Document ID transform with a Ray runtime. ## Configuration and command line Options -Document ID configuration and command line options are the same as for the base python -transform. +Document ID configuration and command line options are the same as for the +[base python transform](../python/README.md). ## Building diff --git a/transforms/universal/doc_id/spark/README.md b/transforms/universal/doc_id/spark/README.md index ace6f79e2..92a5f23d1 100644 --- a/transforms/universal/doc_id/spark/README.md +++ b/transforms/universal/doc_id/spark/README.md @@ -13,9 +13,8 @@ pyspark function to generate the unique integer IDs. As described in the documen ## Configuration and command line Options -The set of dictionary keys holding [DocIdTransform](src/doc_id_transform.py) configuration for values are as follows: - -* _doc_id_column_name_ - specifies the name of the DataFrame column that holds the generated document IDs. +Document ID configuration and command line options are the same as for the +[base python transform](../python/README.md). ## Running You can run the [doc_id_local.py](src/doc_id_local_spark.py) (spark-based implementation) to transform the From bab7c56bef94329e707ed8a0f917c76baccde063 Mon Sep 17 00:00:00 2001 From: Constantin M Adam Date: Wed, 27 Nov 2024 15:42:10 -0500 Subject: [PATCH 3/6] Updating docs for ededup Signed-off-by: Constantin M Adam --- transforms/universal/ededup/README.md | 68 ++------------------ transforms/universal/ededup/python/README.md | 66 +++++++++++++++++-- 2 files changed, 68 insertions(+), 66 deletions(-) diff --git a/transforms/universal/ededup/README.md b/transforms/universal/ededup/README.md index 0390cc19c..0a2f58af6 100644 --- a/transforms/universal/ededup/README.md +++ b/transforms/universal/ededup/README.md @@ -1,65 +1,11 @@ # Exact Deduplication Transform -## Summary - -Exact data deduplication is used to identify (and remove) records determined by native documents. -* It’s O(N2) complexity -* shuffling with lots of data movement - -It can be implemented using 2 approaches: -* Exact string matching -* Hash-based matching (ASSUMPTION: a hash is unique to each native document.) – moving hash value is cheaper than moving full content - -Implementation here is using “streaming” deduplication, based on central hash: - -![](images/exactdedup.png) - -* At the heart of the implementation is a hash cache implemented as a set of Ray actors and containing - unique hashes seen so far. -* Individual data processors are responsible for: - * Reading data from data plane - * Converting documents into hashes - * Coordinating with distributed hashes cache to remove the duplicates - * Storing unique documents back to the data plane - -The complication of mapping this model to transform model is the fact that implementation requires a hash cache, -that transform mode knows nothing about. The solution here is to use transform runtime to create haches cache. -and pass it as a parameter to transforms. - -## Transform runtime - -Transform runtime is responsible for creation of the hashes cache. Additionally it -enhances statistics information with the information about hashes cache size and utilization - -## Configuration and command line Options - -The set of dictionary keys holding [EdedupTransform](src/ededup_transform_ray.py) -configuration for values (common for Python and Ray) are as follows: - -* _doc_column_ - specifies name of the column containing documents -* _doc_id_column_ - specifies the name of the column containing a document id -* _use_snapshot_ - specifies that ededup execution starts from a set of already seen hashes. This can be used - for the incremental ededup execution -* _snapshot_directory_ - specifies a directory from which snapshots are read. If this is not specified, a default - location (output_folder/snapshot is used) - -## Snapshotting - -In the current implementation we also provide snapshotting. At the end of execution, the content -of the hash cache to storage (local disk or S3). The reason this is done is to enable incremental -execution of dedup. You can run dedup on a set of existing files and snapshot the hash cache. Now -when additional files come in, instead of running dedup on all the files, you can load snapshot -from the previous run and run dedup only on new files - - -## Available runtimes - -As per [transform project conventions](../../README.md#transform-project-conventions) +Exact deduplication transform identifies and removes identical documents in a dataset by comparing them hash-for-hash +to ensure exact matching. Per the set of [transform project conventions](../../README.md#transform-project-conventions) the following runtimes are available: -* [python](python/README.md) - enables running of the base python transformation - in a Python runtime -* [ray](ray/README.md) - enables running of the base python transformation -in a Ray runtime -* [kfp](kfp_ray/README.md) - enables running the ray docker image -in a kubernetes cluster using a generated `yaml` file. +* [python](python/README.md) - enables running of the base python transformation in a Python runtime +* [ray](ray/README.md) - enables running of the base python transformation in a Ray runtime +* [kfp](kfp_ray/README.md) - enables running the ray docker image in a kubernetes cluster using a generated `yaml` file. + +Please see [here](python/README.md) a more detailed description of this transform. diff --git a/transforms/universal/ededup/python/README.md b/transforms/universal/ededup/python/README.md index 4a10e9f83..dbe6d37cf 100644 --- a/transforms/universal/ededup/python/README.md +++ b/transforms/universal/ededup/python/README.md @@ -1,15 +1,71 @@ # Ededup Python Transform -Please see the set of -[transform project conventions](../../../README.md#transform-project-conventions) -for details on general project conventions, transform configuration, -testing and IDE set up. +Please see the set of [transform project conventions](../../../README.md#transform-project-conventions) for details on +general project conventions, transform configuration, testing and IDE set up. -Also see [here](../ray/README.md) on details of implementation ## Summary This is a python version of ededup + +* As shown in the figure below, the implementation of exact dedup relies on a (distributed) +hash cache and a set of individual data processors that read documents and convert them into hashes + +## Summary + +Exact data deduplication is used to identify (and remove) records determined by native documents. +* It’s O(N2) complexity +* shuffling with lots of data movement + +It can be implemented using 2 approaches: +* Exact string matching +* Hash-based matching (ASSUMPTION: a hash is unique to each native document.) – moving hash value is cheaper than moving full content + +Implementation here is using “streaming” deduplication, based on central hash: + +![](images/exactdedup.png) + +* At the heart of the implementation is a hash cache implemented as a set of Ray actors and containing + unique hashes seen so far. +* Individual data processors are responsible for: + * Reading data from data plane + * Converting documents into hashes + * Coordinating with distributed hashes cache to remove the duplicates + * Storing unique documents back to the data plane + +The complication of mapping this model to transform model is the fact that implementation requires a hash cache, +that transform mode knows nothing about. The solution here is to use transform runtime to create haches cache. +and pass it as a parameter to transforms. + +## Transform runtime + +Transform runtime is responsible for creation of the hashes cache. Additionally it +enhances statistics information with the information about hashes cache size and utilization + +## Configuration and command line Options + +The set of dictionary keys holding [EdedupTransform](src/ededup_transform_ray.py) +configuration for values (common for Python and Ray) are as follows: + +* _doc_column_ - specifies name of the column containing documents +* _doc_id_column_ - specifies the name of the column containing a document id +* _use_snapshot_ - specifies that ededup execution starts from a set of already seen hashes. This can be used + for the incremental ededup execution +* _snapshot_directory_ - specifies a directory from which snapshots are read. If this is not specified, a default + location (output_folder/snapshot is used) + +## Snapshotting + +In the current implementation we also provide snapshotting. At the end of execution, the content +of the hash cache to storage (local disk or S3). The reason this is done is to enable incremental +execution of dedup. You can run dedup on a set of existing files and snapshot the hash cache. Now +when additional files come in, instead of running dedup on all the files, you can load snapshot +from the previous run and run dedup only on new files + + +## Available runtimes + + ## Configuration and command line Options See [common](../README.md) ededup parameters From 28b0e54622f1e4051d0e77edc2f40a8f7b13ab17 Mon Sep 17 00:00:00 2001 From: Constantin M Adam Date: Wed, 27 Nov 2024 23:48:03 -0500 Subject: [PATCH 4/6] Added notebook pointer Signed-off-by: Constantin M Adam --- transforms/universal/doc_id/python/README.md | 2 +- transforms/universal/ededup/python/README.md | 129 ++++++++++--------- transforms/universal/ededup/ray/README.md | 12 +- 3 files changed, 76 insertions(+), 67 deletions(-) diff --git a/transforms/universal/doc_id/python/README.md b/transforms/universal/doc_id/python/README.md index 0f1b783c4..7941f4bc7 100644 --- a/transforms/universal/doc_id/python/README.md +++ b/transforms/universal/doc_id/python/README.md @@ -85,7 +85,7 @@ To see results of the transform. ### Code example -TBD +[notebook](../doc_id.ipynb) ### Transforming data using the transform image diff --git a/transforms/universal/ededup/python/README.md b/transforms/universal/ededup/python/README.md index dbe6d37cf..ac15e64d3 100644 --- a/transforms/universal/ededup/python/README.md +++ b/transforms/universal/ededup/python/README.md @@ -3,80 +3,54 @@ Please see the set of [transform project conventions](../../../README.md#transform-project-conventions) for details on general project conventions, transform configuration, testing and IDE set up. +## Contributors +- Boris Lublinsky (blublinsk@ibm.com) -## Summary -This is a python version of ededup +## Description +This Python implementation of the exact deduplication transform uses "streaming" deduplication based on a central hash. +As shown below, it relies on a distributed hash cache and data processors that read documents, generate hashes, +coordinate with the cache to remove duplicates, and store unique documents in the data plane. +![](../images/exactdedup.png) -* As shown in the figure below, the implementation of exact dedup relies on a (distributed) -hash cache and a set of individual data processors that read documents and convert them into hashes +Mapping this model to the transform model is complicated by the need for a hash cache, which the transform model does +not recognize. The solution is to have the transform runtime create the hash cache and pass it as a parameter to the +transforms. The transform runtime handles hash cache creation and enhances statistics with details about cache size and +utilization. -## Summary +### Incremental Execution and Snapshotting -Exact data deduplication is used to identify (and remove) records determined by native documents. -* It’s O(N2) complexity -* shuffling with lots of data movement +The current implementation includes snapshotting, where the hash cache is saved to storage (local disk or S3) at the +end of execution. This enables incremental deduplication: you can run deduplication on existing files, save the hash +cache, and later load the snapshot to deduplicate only new files, avoiding reprocessing the entire dataset. -It can be implemented using 2 approaches: -* Exact string matching -* Hash-based matching (ASSUMPTION: a hash is unique to each native document.) – moving hash value is cheaper than moving full content +## Input Columns Used by This Transform -Implementation here is using “streaming” deduplication, based on central hash: +| Input Column Name | Data Type | Description | +|---------------------------------------------------------------------|-----------|----------------------------------| +| Column specified by the _contents_column_ configuration argument | str | Column that stores document text | +| Column specified by the _document_id_column_ configuration argument | int64 | Column that stores document ID | -![](images/exactdedup.png) +## Output Columns Annotated by This Transform +This transform does not perform any annotations; it only filters out the documents that are marked as duplicates. -* At the heart of the implementation is a hash cache implemented as a set of Ray actors and containing - unique hashes seen so far. -* Individual data processors are responsible for: - * Reading data from data plane - * Converting documents into hashes - * Coordinating with distributed hashes cache to remove the duplicates - * Storing unique documents back to the data plane +## Configuration -The complication of mapping this model to transform model is the fact that implementation requires a hash cache, -that transform mode knows nothing about. The solution here is to use transform runtime to create haches cache. -and pass it as a parameter to transforms. - -## Transform runtime - -Transform runtime is responsible for creation of the hashes cache. Additionally it -enhances statistics information with the information about hashes cache size and utilization - -## Configuration and command line Options - -The set of dictionary keys holding [EdedupTransform](src/ededup_transform_ray.py) +The set of dictionary keys holding [EdedupTransform](src/ededup_transform_python.py) configuration for values (common for Python and Ray) are as follows: * _doc_column_ - specifies name of the column containing documents * _doc_id_column_ - specifies the name of the column containing a document id -* _use_snapshot_ - specifies that ededup execution starts from a set of already seen hashes. This can be used - for the incremental ededup execution -* _snapshot_directory_ - specifies a directory from which snapshots are read. If this is not specified, a default - location (output_folder/snapshot is used) - -## Snapshotting - -In the current implementation we also provide snapshotting. At the end of execution, the content -of the hash cache to storage (local disk or S3). The reason this is done is to enable incremental -execution of dedup. You can run dedup on a set of existing files and snapshot the hash cache. Now -when additional files come in, instead of running dedup on all the files, you can load snapshot -from the previous run and run dedup only on new files - +* _use_snapshot_ - specifies that ededup execution starts with a set of pre-existing hashes, enabling incremental +execution +* _snapshot_directory_ - specifies the directory for reading snapshots. If not provided, the default is +`output_folder/snapshot` -## Available runtimes +## Usage - -## Configuration and command line Options - -See [common](../README.md) ededup parameters - -## Running - -### Launched Command Line Options -The following command line arguments are available in addition to -the options provided by -the [python launcher](../../../../data-processing-lib/doc/python-launcher-options.md). -``` +The following command line arguments (corresponding to the configuration keys described above) are available in addition +to the options provided by the [python launcher](../../../../data-processing-lib/doc/python-launcher-options.md). +```text --ededup_doc_column EDEDUP_DOC_COLUMN name of the column containing document --ededup_doc_id_column EDEDUP_DOC_ID_COLUMN @@ -86,7 +60,44 @@ the [python launcher](../../../../data-processing-lib/doc/python-launcher-option --ededup_snapshot_directory EDEDUP_SNAPSHOT_DIRECTORY location of snapshot files ``` -These correspond to the configuration keys described above. + +### Running the samples +To run the samples, use the following `make` targets + +* `run-cli-sample` - runs src/ededup_transform_python.py using command line args +* `run-local-sample` - runs src/ededup_local.py + +These targets will activate the virtual environment and set up any configuration needed. +Use the `-n` option of `make` to see the detail of what is done to run the sample. + +For example, +```shell +make run-cli-sample +... +``` +Then +```shell +ls output +``` +To see results of the transform. + +### Code example + +[notebook](../ededup.ipynb) + +### Transforming data using the transform image + +To use the transform image to transform your data, please refer to the +[running images quickstart](../../../../doc/quick-start/run-transform-image.md), +substituting the name of this transform image and runtime as appropriate. + +## Testing + +Following [the testing strategy of data-processing-lib](../../../../data-processing-lib/doc/transform-testing.md) + +Currently we have: +- [Unit test](test/test_ededup_python.py) +- [Integration test](test/test_ededup.py) To use the transform image to transform your data, please refer to the [running images quickstart](../../../../doc/quick-start/run-transform-image.md), diff --git a/transforms/universal/ededup/ray/README.md b/transforms/universal/ededup/ray/README.md index 35f96f55c..88fe034d5 100644 --- a/transforms/universal/ededup/ray/README.md +++ b/transforms/universal/ededup/ray/README.md @@ -1,13 +1,11 @@ # Exact Dedup -Please see the set of -[transform project conventions](../../../README.md) -for details on general project conventions, transform configuration, -testing and IDE set up. +Please see the set of [transform project conventions](../../../README.md) for details on general project conventions, +transform configuration, testing and IDE set up. ## Additional parameters -In addition to [common](../README.md) ededup parameters Ray implementation provides two additional ones +In addition to [common](../python/README.md) ededup parameters Ray implementation provides two additional ones * _hash_cpu_ - specifies amount of CPU per hash actor * _num_hashes_ - specifies number of hash actors @@ -19,8 +17,8 @@ We also provide an [estimate](src/cluster_estimator.py) to roughly determine clu ## Running ### Launched Command Line Options -When running the transform with the Ray launcher (i.e. TransformLauncher), -the following command line arguments are available in addition to +When running the transform with the Ray launcher (i.e. TransformLauncher), the following command line arguments are +available in addition to [the options provided by the launcher](../../../../data-processing-lib/doc/ray-launcher-options.md). ``` From 2a01b1bc9b9464eb094e0247ba341cd7f706d41b Mon Sep 17 00:00:00 2001 From: Constantin M Adam Date: Thu, 28 Nov 2024 00:05:51 -0500 Subject: [PATCH 5/6] Added sample notebooks for doc_id and ededup Signed-off-by: Constantin M Adam --- transforms/universal/doc_id/doc_id.ipynb | 182 +++++++++++++++++++++++ transforms/universal/ededup/ededup.ipynb | 175 ++++++++++++++++++++++ 2 files changed, 357 insertions(+) create mode 100644 transforms/universal/doc_id/doc_id.ipynb create mode 100644 transforms/universal/ededup/ededup.ipynb diff --git a/transforms/universal/doc_id/doc_id.ipynb b/transforms/universal/doc_id/doc_id.ipynb new file mode 100644 index 000000000..eb1ffe212 --- /dev/null +++ b/transforms/universal/doc_id/doc_id.ipynb @@ -0,0 +1,182 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "afd55886-5f5b-4794-838e-ef8179fb0394", + "metadata": {}, + "source": [ + "##### **** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:\n", + "```\n", + "make venv \n", + "source venv/bin/activate \n", + "pip install jupyterlab\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4c45c3c6-e4d7-4e61-8de6-32d61f2ce695", + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "## This is here as a reference only\n", + "# Users and application developers must use the right tag for the latest from pypi\n", + "%pip install data-prep-toolkit\n", + "%pip install data-prep-toolkit-transforms==0.2.2.dev3" + ] + }, + { + "cell_type": "markdown", + "id": "407fd4e4-265d-4ec7-bbc9-b43158f5f1f3", + "metadata": { + "jp-MarkdownHeadingCollapsed": true + }, + "source": [ + "##### **** Configure the transform parameters. The set of dictionary keys holding DocQualityTransform configuration for values are as follows: \n", + "* text_lang - specifies language used in the text content. By default, \"en\" is used.\n", + "* doc_content_column - specifies column name that contains document text. By default, \"contents\" is used.\n", + "* bad_word_filepath - specifies a path to bad word file: local folder (file or directory) that points to bad word file. You don't have to set this parameter if you don't need to set bad words.\n", + "#####" + ] + }, + { + "cell_type": "markdown", + "id": "ebf1f782-0e61-485c-8670-81066beb734c", + "metadata": {}, + "source": [ + "##### ***** Import required classes and modules" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c2a12abc-9460-4e45-8961-873b48a9ab19", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import sys\n", + "\n", + "from data_processing.runtime.pure_python import PythonTransformLauncher\n", + "from data_processing.utils import ParamsUtils\n", + "from doc_id_transform_python import DocIDPythonTransformRuntimeConfiguration\n", + "from doc_id_transform_base import (\n", + " doc_column_name_cli_param,\n", + " hash_column_name_cli_param,\n", + " int_column_name_cli_param,\n", + " start_id_cli_param,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "7234563c-2924-4150-8a31-4aec98c1bf33", + "metadata": {}, + "source": [ + "##### ***** Setup runtime parameters for this transform" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e90a853e-412f-45d7-af3d-959e755aeebb", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "# create parameters\n", + "input_folder = os.path.join(\"python\", \"test-data\", \"input\")\n", + "output_folder = os.path.join( \"python\", \"output\")\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "code_location = {\"github\": \"github\", \"commit_hash\": \"12345\", \"path\": \"path\"}\n", + "params = {\n", + " # Data access. Only required parameters are specified\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " # execution info\n", + " \"runtime_pipeline_id\": \"pipeline_id\",\n", + " \"runtime_job_id\": \"job_id\",\n", + " \"runtime_code_location\": ParamsUtils.convert_to_ast(code_location),\n", + " # doc id params\n", + " doc_column_name_cli_param: \"contents\",\n", + " hash_column_name_cli_param: \"hash_column\",\n", + " int_column_name_cli_param: \"int_id_column\",\n", + " start_id_cli_param: 5,\n", + "}" + ] + }, + { + "cell_type": "markdown", + "id": "7949f66a-d207-45ef-9ad7-ad9406f8d42a", + "metadata": {}, + "source": [ + "##### ***** Use python runtime to invoke the transform" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0775e400-7469-49a6-8998-bd4772931459", + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "sys.argv = ParamsUtils.dict_to_req(d=params)\n", + "launcher = PythonTransformLauncher(runtime_config=DocIDPythonTransformRuntimeConfiguration())\n", + "launcher.launch()" + ] + }, + { + "cell_type": "markdown", + "id": "c3df5adf-4717-4a03-864d-9151cd3f134b", + "metadata": {}, + "source": [ + "##### **** The specified folder will include the transformed parquet files." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7276fe84-6512-4605-ab65-747351e13a7c", + "metadata": {}, + "outputs": [], + "source": [ + "import glob\n", + "glob.glob(\"python/output/*\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "845a75cf-f4a9-467d-87fa-ccbac1c9beb8", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/transforms/universal/ededup/ededup.ipynb b/transforms/universal/ededup/ededup.ipynb new file mode 100644 index 000000000..b9d77a7aa --- /dev/null +++ b/transforms/universal/ededup/ededup.ipynb @@ -0,0 +1,175 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "afd55886-5f5b-4794-838e-ef8179fb0394", + "metadata": {}, + "source": [ + "##### **** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:\n", + "```\n", + "make venv \n", + "source venv/bin/activate \n", + "pip install jupyterlab\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4c45c3c6-e4d7-4e61-8de6-32d61f2ce695", + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "## This is here as a reference only\n", + "# Users and application developers must use the right tag for the latest from pypi\n", + "%pip install data-prep-toolkit\n", + "%pip install data-prep-toolkit-transforms==0.2.2.dev3" + ] + }, + { + "cell_type": "markdown", + "id": "407fd4e4-265d-4ec7-bbc9-b43158f5f1f3", + "metadata": { + "jp-MarkdownHeadingCollapsed": true + }, + "source": [ + "##### **** Configure the transform parameters. The set of dictionary keys holding DocQualityTransform configuration for values are as follows: \n", + "* text_lang - specifies language used in the text content. By default, \"en\" is used.\n", + "* doc_content_column - specifies column name that contains document text. By default, \"contents\" is used.\n", + "* bad_word_filepath - specifies a path to bad word file: local folder (file or directory) that points to bad word file. You don't have to set this parameter if you don't need to set bad words.\n", + "#####" + ] + }, + { + "cell_type": "markdown", + "id": "ebf1f782-0e61-485c-8670-81066beb734c", + "metadata": {}, + "source": [ + "##### ***** Import required classes and modules" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c2a12abc-9460-4e45-8961-873b48a9ab19", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import sys\n", + "\n", + "from data_processing.runtime.pure_python import PythonTransformLauncher\n", + "from data_processing.utils import ParamsUtils\n", + "from ededup_transform_python import EdedupPythonTransformRuntimeConfiguration\n", + "from ededup_transform_base import doc_column_name_cli_param, int_column_name_cli_param" + ] + }, + { + "cell_type": "markdown", + "id": "7234563c-2924-4150-8a31-4aec98c1bf33", + "metadata": {}, + "source": [ + "##### ***** Setup runtime parameters for this transform" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e90a853e-412f-45d7-af3d-959e755aeebb", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "# create parameters\n", + "input_folder = os.path.join(\"python\", \"test-data\", \"input\")\n", + "output_folder = os.path.join( \"python\", \"output\")\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "code_location = {\"github\": \"github\", \"commit_hash\": \"12345\", \"path\": \"path\"}\n", + "params = {\n", + " # Data access. Only required parameters are specified\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " # orchestrator\n", + " \"runtime_pipeline_id\": \"pipeline_id\",\n", + " \"runtime_job_id\": \"job_id\",\n", + " \"runtime_code_location\": ParamsUtils.convert_to_ast(code_location),\n", + " # ededup parameters\n", + " doc_column_name_cli_param: \"contents\",\n", + " int_column_name_cli_param: \"document_id\",\n", + "}" + ] + }, + { + "cell_type": "markdown", + "id": "7949f66a-d207-45ef-9ad7-ad9406f8d42a", + "metadata": {}, + "source": [ + "##### ***** Use python runtime to invoke the transform" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0775e400-7469-49a6-8998-bd4772931459", + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "sys.argv = ParamsUtils.dict_to_req(d=params)\n", + "launcher = PythonTransformLauncher(EdedupPythonTransformRuntimeConfiguration())\n", + "launcher.launch()" + ] + }, + { + "cell_type": "markdown", + "id": "c3df5adf-4717-4a03-864d-9151cd3f134b", + "metadata": {}, + "source": [ + "##### **** The specified folder will include the transformed parquet files." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7276fe84-6512-4605-ab65-747351e13a7c", + "metadata": {}, + "outputs": [], + "source": [ + "import glob\n", + "glob.glob(\"python/output/*\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "845a75cf-f4a9-467d-87fa-ccbac1c9beb8", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "ededup_ray", + "language": "python", + "name": "ededup_ray" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From 54ced0eb2a2ec935b277391fc4fe9505d6daa4c3 Mon Sep 17 00:00:00 2001 From: Constantin M Adam Date: Wed, 4 Dec 2024 17:31:52 -0500 Subject: [PATCH 6/6] Provide config parameter description in the notebook Signed-off-by: Constantin M Adam --- transforms/universal/doc_id/doc_id.ipynb | 10 +++++----- transforms/universal/ededup/ededup.ipynb | 12 +++++++----- 2 files changed, 12 insertions(+), 10 deletions(-) diff --git a/transforms/universal/doc_id/doc_id.ipynb b/transforms/universal/doc_id/doc_id.ipynb index eb1ffe212..7ecab7d65 100644 --- a/transforms/universal/doc_id/doc_id.ipynb +++ b/transforms/universal/doc_id/doc_id.ipynb @@ -34,11 +34,11 @@ "jp-MarkdownHeadingCollapsed": true }, "source": [ - "##### **** Configure the transform parameters. The set of dictionary keys holding DocQualityTransform configuration for values are as follows: \n", - "* text_lang - specifies language used in the text content. By default, \"en\" is used.\n", - "* doc_content_column - specifies column name that contains document text. By default, \"contents\" is used.\n", - "* bad_word_filepath - specifies a path to bad word file: local folder (file or directory) that points to bad word file. You don't have to set this parameter if you don't need to set bad words.\n", - "#####" + "##### **** Configure the transform parameters. The set of dictionary keys holding DocIDTransform configuration for values are as follows: \n", + "* doc_column - specifies name of the column containing the document (required for ID generation)\n", + "* hash_column - specifies name of the column created to hold the string document id, if None, id is not generated\n", + "* int_id_column - specifies name of the column created to hold the integer document id, if None, id is not generated\n", + "* start_id - an id from which ID generator starts () " ] }, { diff --git a/transforms/universal/ededup/ededup.ipynb b/transforms/universal/ededup/ededup.ipynb index b9d77a7aa..9a84d4c51 100644 --- a/transforms/universal/ededup/ededup.ipynb +++ b/transforms/universal/ededup/ededup.ipynb @@ -34,11 +34,13 @@ "jp-MarkdownHeadingCollapsed": true }, "source": [ - "##### **** Configure the transform parameters. The set of dictionary keys holding DocQualityTransform configuration for values are as follows: \n", - "* text_lang - specifies language used in the text content. By default, \"en\" is used.\n", - "* doc_content_column - specifies column name that contains document text. By default, \"contents\" is used.\n", - "* bad_word_filepath - specifies a path to bad word file: local folder (file or directory) that points to bad word file. You don't have to set this parameter if you don't need to set bad words.\n", - "#####" + "##### **** Configure the transform parameters. The set of dictionary keys holding EdedupTransform configuration for values are as follows: \n", + "* doc_column - specifies name of the column containing documents\n", + "* doc_id_column - specifies the name of the column containing a document id\n", + "* use_snapshot - specifies that ededup execution starts with a set of pre-existing hashes, enabling incremental\n", + "execution\n", + "* snapshot_directory - specifies the directory for reading snapshots. If not provided, the default is\n", + "`output_folder/snapshot`" ] }, {