From 569b08093f76eed48707cd446e4a4f487c5ef4a8 Mon Sep 17 00:00:00 2001 From: ian-cho <42691703+ian-cho@users.noreply.github.com> Date: Fri, 22 Nov 2024 15:34:58 +0900 Subject: [PATCH 01/10] Update README.md update readme for hap transform --- transforms/universal/hap/python/README.md | 56 ++++++++++++++++------- 1 file changed, 39 insertions(+), 17 deletions(-) diff --git a/transforms/universal/hap/python/README.md b/transforms/universal/hap/python/README.md index 29d54d999..490cd8c37 100644 --- a/transforms/universal/hap/python/README.md +++ b/transforms/universal/hap/python/README.md @@ -1,10 +1,11 @@ # Hate, Abuse, and Profanity (HAP) Annotation Please see the set of [transform project conventions](https://github.com/ian-cho/data-prep-kit/blob/dev/transforms/README.md) for details on general project conventions, transform configuration, testing and IDE set up. -## Prerequisite +## Description +### Prerequisite This repository needs [NLTK](https://www.nltk.org/) and please refer to `requirements.txt`. -## Summary +### Overview The hap transform maps a non-empty input table to an output table with an added `hap_score` column. Each row in the table represents a document, and the hap transform performs the following three steps to calculate the hap score for each document: * Sentence spliting: we use NLTK to split the document into sentence pieces. @@ -12,18 +13,7 @@ The hap transform maps a non-empty input table to an output table with an added * Aggregation: the document hap score is determined by selecting the maximum hap score among its sentences. -## Configuration and command line Options -The set of dictionary keys holding [HAPTransformConfiguration](src/hap_transform.py) -configuration for values are as follows: - -* --model_name_or_path - specify the HAP model, which should be compatible with HuggingFace's AutoModelForSequenceClassification. Defaults to IBM's open-source toxicity classifier `ibm-granite/granite-guardian-hap-38m`. -* --batch_size - modify it based on the infrastructure capacity. Defaults to `128`. -* --max_length - the maximum length for the tokenizer. Defaults to `512`. -* --doc_text_column - the column name containing the document text in the input .parquet file. Defaults to `contents`. -* --annotation_column - the column name containing hap (toxicity) score in the output .parquet file. Defaults to `hap_score`. - - -## input format +### input format The input is in .parquet format and contains the following columns: | doc_id | contents | @@ -31,7 +21,8 @@ The input is in .parquet format and contains the following columns: | 1 | GSC is very much a little Swiss Army knife for... | | 2 | Here are only a few examples. And no, I'm not ... | -## output format + +### output format The output is in .parquet format and includes an additional column, in addition to those in the input: | doc_id | contents | hap_score | @@ -39,7 +30,21 @@ The output is in .parquet format and includes an additional column, in addition | 1 | GSC is very much a little Swiss Army knife for... | 0.002463 | | 2 | Here are only a few examples. And no, I'm not ... | 0.989713 | -## How to run +## Configuration +The set of dictionary keys holding [HAPTransformConfiguration](src/hap_transform.py) +configuration for values are as follows: + + +* --model_name_or_path - specify the HAP model, which should be compatible with HuggingFace's AutoModelForSequenceClassification. Defaults to IBM's open-source toxicity classifier `ibm-granite/granite-guardian-hap-38m`. +* --batch_size - modify it based on the infrastructure capacity. Defaults to `128`. +* --max_length - the maximum length for the tokenizer. Defaults to `512`. +* --doc_text_column - the column name containing the document text in the input .parquet file. Defaults to `contents`. +* --annotation_column - the column name containing hap (toxicity) score in the output .parquet file. Defaults to `hap_score`. + + + + +## Usage Place your input Parquet file in the `test-data/input/` directory. A sample file, `test1.parquet`, is available in this directory. Once done, run the script. ```python @@ -48,6 +53,22 @@ python hap_local_python.py You will obtain the output file `test1.parquet` in the output directory. +### Code example +TBD (link to the notebook will be provided) + +See the sample script [src/hap_local_python.py](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/hap/python/src/hap_local_python.py). + +### Transforming data using the transform image +To use the transform image to transform your data, please refer to the +[running images quickstart](../../../../doc/quick-start/run-transform-image.md), +substituting the name of this transform image and runtime as appropriate. + +## Testing + +Currently we have: +- [hap test](transforms/universal/hap/python/test/test_hap.py) + + ## Throughput The table below shows the throughput (tokens per second) of the HAP transform module, which primarily includes sentence splitting, HAP annotation, and HAP score aggregation. We herein compare two models: @@ -62,6 +83,7 @@ We processed 6,000 documents (12 MB in Parquet file size) using the HAP transfor | granite-guardian-hap-125m | 1.14 k | - +### Credits +The HAP transform is jointly developed by IBM Research - Tokyo and Yorktown. From 27c854da2d336f918d257f4c69712ef011e92a14 Mon Sep 17 00:00:00 2001 From: Aanchal Goyal Date: Mon, 2 Dec 2024 13:15:18 +0530 Subject: [PATCH 02/10] Updated Resources webpage with latest talks and links Signed-off-by: Aanchal Goyal --- resources.md | 34 +++++++++++++++++++++++++++++----- 1 file changed, 29 insertions(+), 5 deletions(-) diff --git a/resources.md b/resources.md index 4f5657a02..fb413e38b 100644 --- a/resources.md +++ b/resources.md @@ -1,3 +1,8 @@ +# New Features & Enhancements + +- Support for Docling 2.0 added to DPK in [pdf2parquet](https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/pdf2parquet/python) transform. The new updates allow DPK users to ingest other type of documents, e.g. MS Word, MS Powerpoint, Images, Markdown, Asciidocs, etc. +- Released [Web2parquet](https://github.com/IBM/data-prep-kit/tree/dev/transforms/universal/web2parquet) transform for crawling the web. + # Data Prep Kit Resources ## 📄 Papers @@ -7,24 +12,43 @@ 3. [Scaling Granite Code Models to 128K Context](https://arxiv.org/abs/2407.13739) -## 🎤 Talks +## 🎤 External Events and Showcase 1. **"Building Successful LLM Apps: The Power of high quality data"** - [Video](https://www.youtube.com/watch?v=u_2uiZBBVIE) | [Slides](https://www.slideshare.net/slideshow/data_prep_techniques_challenges_methods-pdf-a190/271527890) 2. **"Hands on session for fine tuning LLMs"** - [Video](https://www.youtube.com/watch?v=VEHIA3E64DM) 3. **"Build your own data preparation module using data-prep-kit"** - [Video](https://www.youtube.com/watch?v=0WUMG6HIgMg) 4. **"Data Prep Kit: A Comprehensive Cloud-Native Toolkit for Scalable Data Preparation in GenAI App"** - [Video](https://www.youtube.com/watch?v=WJ147TGULwo) | [Slides](https://ossaidevjapan24.sched.com/event/1jKBm) +5. **"RAG with Data Prep Kit" Workshop** @ Mountain View, CA, USA ** - [info](https://github.com/sujee/data-prep-kit-examples/blob/main/events/2024-09-21__RAG-workshop-data-riders.md) +6. **Tech Educator summit** [IBM CSR Event](https://www.linkedin.com/posts/aanchalaggarwal_github-ibmdata-prep-kit-open-source-project-activity-7254062098295472128-OA_x?utm_source=share&utm_medium=member_desktop) +7. **Talk and Hands on session** at [MIT Bangalore](https://www.linkedin.com/posts/saptha-surendran-71a4a0ab_ibmresearch-dataprepkit-llms-activity-7261987741087801346-h0no?utm_source=share&utm_medium=member_desktop) +8. **PyData NYC 2024** - [90 mins Tutorial](https://nyc2024.pydata.org/cfp/talk/AWLTZP/) +9. **Open Source AI** [Demo Night](https://lu.ma/oss-ai?tk=A8BgIt) +10. [**Data Exchange Podcast with Ben Lorica**](https://thedataexchange.media/ibm-data-prep-kit/) +11. Unstructured Data Meetup - SF, NYC, Silicon Valley +12. IBM TechXchange Las Vegas +13. Open Source [**RAG Pipeline workshop**](https://www.linkedin.com/posts/sujeemaniyam_dataprepkit-workshop-llm-activity-7256176802383986688-2UKc?utm_source=share&utm_medium=member_desktop) with Data Prep Kit at TechEquity's AI Summit in Silicon Valley +14. **Data Science Dojo Meetup** - [video](https://datasciencedojo.com/tutorial/data-preparation-toolkit/) +15. [**DPK tutorial and hands on session at IIIT Delhi**](https://www.linkedin.com/posts/cai-iiitd-97a6a4232_datascience-datapipelines-machinel[%E2%80%A6]65125349376-FG8E/?utm_source=share&utm_medium=member_desktop) + ## Example Code +Find example code in readme section of each tranform and some sample jupyter notebooks for getting started [**here**](examples/notebooks) ## Blogs / Tutorials - [**IBM Developer Blog**](https://developer.ibm.com/blogs/awb-unleash-potential-llms-data-prep-kit/) +- [**Introductory Blog on DPK**](https://www.linkedin.com/pulse/unleashing-potential-large-language-models-through-data-aanchal-goyal-fgtff) +- [**DPK Header Cleanser Module Blog by external contributor**](https://www.linkedin.com/pulse/enhancing-data-quality-developing-header-cleansing-tool-kalathiya-i1ohc/?trackingId=6iAeBkBBRrOLijg3LTzIGA%3D%3D) -## Workshops -- **2024-09-21: "RAG with Data Prep Kit" Workshop** @ Mountain View, CA, USA - [info](https://github.com/sujee/data-prep-kit-examples/blob/main/events/2024-09-21__RAG-workshop-data-riders.md) - -## Discord +# Relevant online communities - [**Data Prep Kit Discord Channel**](https://discord.com/channels/1276554812359442504/1286046139921207476) +- [**DPK is now listed in Github Awesome-LLM under LLM Data section**](https://github.com/Hannibal046/Awesome-LLM) +- [**DPK is now up for access via IBM Skills Build Download**](https://academic.ibm.com/a2mt/downloads/artificial_intelligence#/) +- [**DPK added to the Application Hub of “AI Sustainability Catalog”**](https://enterprise-neurosystem.github.io/Sustainability-Catalog/) + +## We Want Your Feedback! + Feel free to contribute to discussions or create a new one to share your [feedback](https://github.com/IBM/data-prep-kit/discussions) + From ddfa7b8973b149735bea6a1b676264b7a69443c8 Mon Sep 17 00:00:00 2001 From: ian-cho <42691703+ian-cho@users.noreply.github.com> Date: Mon, 2 Dec 2024 23:02:42 +0900 Subject: [PATCH 03/10] Add files via upload --- .../universal/hap/python/hap_python.ipynb | 158 ++++++++++++++++++ 1 file changed, 158 insertions(+) create mode 100644 transforms/universal/hap/python/hap_python.ipynb diff --git a/transforms/universal/hap/python/hap_python.ipynb b/transforms/universal/hap/python/hap_python.ipynb new file mode 100644 index 000000000..deb147341 --- /dev/null +++ b/transforms/universal/hap/python/hap_python.ipynb @@ -0,0 +1,158 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "id": "38aebf49-9460-4951-bb04-7045dec28690", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "[nltk_data] Downloading package punkt_tab to /Users/ian/nltk_data...\n", + "[nltk_data] Package punkt_tab is already up-to-date!\n" + ] + } + ], + "source": [ + "# import necessary packages\n", + "import ast\n", + "import os\n", + "import sys\n", + "\n", + "from data_processing.runtime.pure_python import PythonTransformLauncher\n", + "from data_processing.utils import ParamsUtils\n", + "from hap_transform_python import HAPPythonTransformConfiguration" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "6a8ec5e4-1f52-4c61-9c9e-4618f9034b80", + "metadata": {}, + "outputs": [], + "source": [ + "# create parameters\n", + "__file__ = os.getcwd()\n", + "input_folder = os.path.abspath(os.path.join(os.path.dirname(__file__), \"../test-data/input\"))\n", + "output_folder = os.path.abspath(os.path.join(os.path.dirname(__file__), \"../output\"))\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "code_location = {\"github\": \"github\", \"commit_hash\": \"12345\", \"path\": \"path\"}\n", + "\n", + "params = {\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " \"runtime_pipeline_id\": \"pipeline_id\",\n", + " \"runtime_job_id\": \"job_id\",\n", + " \"runtime_code_location\": ParamsUtils.convert_to_ast(code_location),\n", + "}\n", + "\n", + "\n", + "hap_params = {\n", + " \"model_name_or_path\": 'ibm-granite/granite-guardian-hap-38m',\n", + " \"annotation_column\": \"hap_score\",\n", + " \"doc_text_column\": \"contents\",\n", + " \"inference_engine\": \"CPU\",\n", + " \"max_length\": 512,\n", + " \"batch_size\": 128,\n", + "}" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "94e908e2-1891-4dc7-9f85-85bbf8d44c5e", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "22:40:12 INFO - hap params are {'model_name_or_path': 'ibm-granite/granite-guardian-hap-38m', 'annotation_column': 'hap_score', 'doc_text_column': 'contents', 'inference_engine': 'CPU', 'max_length': 512, 'batch_size': 128} \n", + "22:40:12 INFO - pipeline id pipeline_id\n", + "22:40:12 INFO - code location {'github': 'github', 'commit_hash': '12345', 'path': 'path'}\n", + "22:40:12 INFO - data factory data_ is using local data access: input_folder - /Users/ian/Desktop/data-prep-kit/transforms/universal/hap/test-data/input output_folder - /Users/ian/Desktop/data-prep-kit/transforms/universal/hap/output\n", + "22:40:12 INFO - data factory data_ max_files -1, n_sample -1\n", + "22:40:12 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "22:40:12 INFO - orchestrator hap started at 2024-12-02 22:40:12\n", + "22:40:12 ERROR - No input files to process - exiting\n", + "22:40:12 INFO - Completed execution in 0.0 min, execution result 0\n" + ] + }, + { + "data": { + "text/plain": [ + "0" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Set the simulated command line args\n", + "sys.argv = ParamsUtils.dict_to_req(d=params | hap_params)\n", + "# create launcher\n", + "launcher = PythonTransformLauncher(runtime_config=HAPPythonTransformConfiguration())\n", + "# Launch to process the input\n", + "launcher.launch()" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "f21d5d9b-562d-4530-8cea-2de5b63eb1dc", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['../output/metadata.json', '../output/test1.parquet']" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# the outputs will be located in the following folders\n", + "import glob\n", + "glob.glob(\"../output/*\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2cd3367a-205f-4d33-83fb-106e32173bc0", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.0" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From dea69e247c9e6a75a0426b70913703cad0407843 Mon Sep 17 00:00:00 2001 From: ian-cho <42691703+ian-cho@users.noreply.github.com> Date: Mon, 2 Dec 2024 23:03:58 +0900 Subject: [PATCH 04/10] Update README.md 1. added contributor 2. added hap notebook --- transforms/universal/hap/python/README.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/transforms/universal/hap/python/README.md b/transforms/universal/hap/python/README.md index 490cd8c37..4feadfbf4 100644 --- a/transforms/universal/hap/python/README.md +++ b/transforms/universal/hap/python/README.md @@ -1,6 +1,9 @@ # Hate, Abuse, and Profanity (HAP) Annotation Please see the set of [transform project conventions](https://github.com/ian-cho/data-prep-kit/blob/dev/transforms/README.md) for details on general project conventions, transform configuration, testing and IDE set up. +## Contributor +- Yang Zhao (yangzhao@ibm.com) + ## Description ### Prerequisite This repository needs [NLTK](https://www.nltk.org/) and please refer to `requirements.txt`. @@ -54,9 +57,7 @@ python hap_local_python.py You will obtain the output file `test1.parquet` in the output directory. ### Code example -TBD (link to the notebook will be provided) - -See the sample script [src/hap_local_python.py](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/hap/python/src/hap_local_python.py). +[notebook](../hap_python.ipynb) ### Transforming data using the transform image To use the transform image to transform your data, please refer to the From 8e3f3bd4940debe35a944ff3b040193f49688a9a Mon Sep 17 00:00:00 2001 From: ian-cho <42691703+ian-cho@users.noreply.github.com> Date: Tue, 3 Dec 2024 11:33:51 +0900 Subject: [PATCH 05/10] Add files via upload --- .../universal/hap/python/hap_python.ipynb | 111 +++++++++++++----- 1 file changed, 84 insertions(+), 27 deletions(-) diff --git a/transforms/universal/hap/python/hap_python.ipynb b/transforms/universal/hap/python/hap_python.ipynb index deb147341..ad0c42a02 100644 --- a/transforms/universal/hap/python/hap_python.ipynb +++ b/transforms/universal/hap/python/hap_python.ipynb @@ -1,8 +1,54 @@ { "cells": [ + { + "cell_type": "markdown", + "id": "cefa9cf6-e043-4b75-b416-a0b26c8cb3ad", + "metadata": {}, + "source": [ + "**** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:\n", + " make venv \n", + " source venv/bin/activate \n", + " pip install jupyterlab" + ] + }, { "cell_type": "code", "execution_count": 1, + "id": "4a84e965-feeb-424d-9263-9f127e53a1aa", + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "## This is here as a reference only\n", + "# Users and application developers must use the right tag for the latest from pypi\n", + "%pip install data-prep-toolkit\n", + "%pip install data-prep-toolkit-transforms==0.2.2.dev3" + ] + }, + { + "cell_type": "markdown", + "id": "1d695832-16bc-48d3-a9c3-6ce650ae4a5c", + "metadata": {}, + "source": [ + "**** Configure the transform parameters. The set of dictionary keys holding DocQualityTransform configuration for values are as follows:\n", + " - model_name_or_path - specify the HAP model, which should be compatible with HuggingFace's AutoModelForSequenceClassification. Defaults to IBM's open-source toxicity classifier ibm-granite/granite-guardian-hap-38m.\n", + " - annotation_column - the column name containing hap (toxicity) score in the output .parquet file. Defaults to hap_score.\n", + " - doc_text_column - the column name containing the document text in the input .parquet file. Defaults to contents.\n", + " - batch_size - modify it based on the infrastructure capacity. Defaults to 128.\n", + " - max_length - the maximum length for the tokenizer. Defaults to 512." + ] + }, + { + "cell_type": "markdown", + "id": "3f9dbf94-2db4-492d-bbcb-53ac3948c256", + "metadata": {}, + "source": [ + "***** Import required classes and modules" + ] + }, + { + "cell_type": "code", + "execution_count": 2, "id": "38aebf49-9460-4951-bb04-7045dec28690", "metadata": {}, "outputs": [ @@ -16,7 +62,6 @@ } ], "source": [ - "# import necessary packages\n", "import ast\n", "import os\n", "import sys\n", @@ -26,9 +71,17 @@ "from hap_transform_python import HAPPythonTransformConfiguration" ] }, + { + "cell_type": "markdown", + "id": "f443108f-40e4-40e5-a052-e8a7f4fbccdf", + "metadata": {}, + "source": [ + "***** Setup runtime parameters for this transform" + ] + }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 3, "id": "6a8ec5e4-1f52-4c61-9c9e-4618f9034b80", "metadata": {}, "outputs": [], @@ -61,9 +114,17 @@ "}" ] }, + { + "cell_type": "markdown", + "id": "d70abda8-3d66-4328-99ce-4075646a7756", + "metadata": {}, + "source": [ + "***** Use python runtime to invoke the transform" + ] + }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 4, "id": "94e908e2-1891-4dc7-9f85-85bbf8d44c5e", "metadata": {}, "outputs": [ @@ -71,40 +132,36 @@ "name": "stderr", "output_type": "stream", "text": [ - "22:40:12 INFO - hap params are {'model_name_or_path': 'ibm-granite/granite-guardian-hap-38m', 'annotation_column': 'hap_score', 'doc_text_column': 'contents', 'inference_engine': 'CPU', 'max_length': 512, 'batch_size': 128} \n", - "22:40:12 INFO - pipeline id pipeline_id\n", - "22:40:12 INFO - code location {'github': 'github', 'commit_hash': '12345', 'path': 'path'}\n", - "22:40:12 INFO - data factory data_ is using local data access: input_folder - /Users/ian/Desktop/data-prep-kit/transforms/universal/hap/test-data/input output_folder - /Users/ian/Desktop/data-prep-kit/transforms/universal/hap/output\n", - "22:40:12 INFO - data factory data_ max_files -1, n_sample -1\n", - "22:40:12 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "22:40:12 INFO - orchestrator hap started at 2024-12-02 22:40:12\n", - "22:40:12 ERROR - No input files to process - exiting\n", - "22:40:12 INFO - Completed execution in 0.0 min, execution result 0\n" + "11:29:11 INFO - hap params are {'model_name_or_path': 'ibm-granite/granite-guardian-hap-38m', 'annotation_column': 'hap_score', 'doc_text_column': 'contents', 'inference_engine': 'CPU', 'max_length': 512, 'batch_size': 128} \n", + "11:29:11 INFO - pipeline id pipeline_id\n", + "11:29:11 INFO - code location {'github': 'github', 'commit_hash': '12345', 'path': 'path'}\n", + "11:29:11 INFO - data factory data_ is using local data access: input_folder - /Users/ian/Desktop/data-prep-kit/transforms/universal/hap/test-data/input output_folder - /Users/ian/Desktop/data-prep-kit/transforms/universal/hap/output\n", + "11:29:11 INFO - data factory data_ max_files -1, n_sample -1\n", + "11:29:11 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "11:29:11 INFO - orchestrator hap started at 2024-12-03 11:29:11\n", + "11:29:11 ERROR - No input files to process - exiting\n", + "11:29:11 INFO - Completed execution in 0.0 min, execution result 0\n" ] - }, - { - "data": { - "text/plain": [ - "0" - ] - }, - "execution_count": 3, - "metadata": {}, - "output_type": "execute_result" } ], "source": [ - "# Set the simulated command line args\n", + "%%capture\n", "sys.argv = ParamsUtils.dict_to_req(d=params | hap_params)\n", - "# create launcher\n", "launcher = PythonTransformLauncher(runtime_config=HAPPythonTransformConfiguration())\n", - "# Launch to process the input\n", "launcher.launch()" ] }, + { + "cell_type": "markdown", + "id": "0bd4ad5c-a1d9-4ea2-abb7-e43571095392", + "metadata": {}, + "source": [ + "**** The specified folder will include the transformed parquet files." + ] + }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 5, "id": "f21d5d9b-562d-4530-8cea-2de5b63eb1dc", "metadata": {}, "outputs": [ @@ -114,7 +171,7 @@ "['../output/metadata.json', '../output/test1.parquet']" ] }, - "execution_count": 4, + "execution_count": 5, "metadata": {}, "output_type": "execute_result" } From 083c230aae7ff3b0cafe322b0be86e58bfe5ed0b Mon Sep 17 00:00:00 2001 From: ian-cho <42691703+ian-cho@users.noreply.github.com> Date: Tue, 3 Dec 2024 13:58:19 +0900 Subject: [PATCH 06/10] Update README.md --- transforms/universal/hap/python/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/transforms/universal/hap/python/README.md b/transforms/universal/hap/python/README.md index 4feadfbf4..2cc8504d2 100644 --- a/transforms/universal/hap/python/README.md +++ b/transforms/universal/hap/python/README.md @@ -57,7 +57,7 @@ python hap_local_python.py You will obtain the output file `test1.parquet` in the output directory. ### Code example -[notebook](../hap_python.ipynb) +[notebook](./hap_python.ipynb) ### Transforming data using the transform image To use the transform image to transform your data, please refer to the From 66efc18dd22327ab749bdd74dbb8aba85d93a8b3 Mon Sep 17 00:00:00 2001 From: Aanchal Goyal Date: Tue, 3 Dec 2024 11:03:17 +0530 Subject: [PATCH 07/10] Updated links to 15 and Discord Signed-off-by: Aanchal Goyal --- resources.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/resources.md b/resources.md index fb413e38b..9263dd183 100644 --- a/resources.md +++ b/resources.md @@ -28,7 +28,7 @@ 12. IBM TechXchange Las Vegas 13. Open Source [**RAG Pipeline workshop**](https://www.linkedin.com/posts/sujeemaniyam_dataprepkit-workshop-llm-activity-7256176802383986688-2UKc?utm_source=share&utm_medium=member_desktop) with Data Prep Kit at TechEquity's AI Summit in Silicon Valley 14. **Data Science Dojo Meetup** - [video](https://datasciencedojo.com/tutorial/data-preparation-toolkit/) -15. [**DPK tutorial and hands on session at IIIT Delhi**](https://www.linkedin.com/posts/cai-iiitd-97a6a4232_datascience-datapipelines-machinel[%E2%80%A6]65125349376-FG8E/?utm_source=share&utm_medium=member_desktop) +15. DPK tutorial and hands on session at IIIT Delhi ## Example Code @@ -43,7 +43,7 @@ Find example code in readme section of each tranform and some sample jupyter not # Relevant online communities -- [**Data Prep Kit Discord Channel**](https://discord.com/channels/1276554812359442504/1286046139921207476) +- [**Data Prep Kit Discord Channel**](https://discord.com/channels/1276554812359442504/1303454647427661866) - [**DPK is now listed in Github Awesome-LLM under LLM Data section**](https://github.com/Hannibal046/Awesome-LLM) - [**DPK is now up for access via IBM Skills Build Download**](https://academic.ibm.com/a2mt/downloads/artificial_intelligence#/) - [**DPK added to the Application Hub of “AI Sustainability Catalog”**](https://enterprise-neurosystem.github.io/Sustainability-Catalog/) From da2c6c1a6d3b4e3b0b1db6d87646ec9e9f2ebdac Mon Sep 17 00:00:00 2001 From: Aanchal Goyal Date: Tue, 3 Dec 2024 15:47:37 +0530 Subject: [PATCH 08/10] Added working link for 15 Signed-off-by: Aanchal Goyal --- resources.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/resources.md b/resources.md index 9263dd183..9a011c3f0 100644 --- a/resources.md +++ b/resources.md @@ -28,7 +28,7 @@ 12. IBM TechXchange Las Vegas 13. Open Source [**RAG Pipeline workshop**](https://www.linkedin.com/posts/sujeemaniyam_dataprepkit-workshop-llm-activity-7256176802383986688-2UKc?utm_source=share&utm_medium=member_desktop) with Data Prep Kit at TechEquity's AI Summit in Silicon Valley 14. **Data Science Dojo Meetup** - [video](https://datasciencedojo.com/tutorial/data-preparation-toolkit/) -15. DPK tutorial and hands on session at IIIT Delhi +15. [**DPK tutorial and hands on session at IIIT Delhi**](https://www.linkedin.com/posts/cai-iiitd-97a6a4232_datascience-datapipelines-machinel[…]565125349376-FG8E?utm_source=share&utm_medium=member_desktop) ## Example Code From 9fc6d5bbe4f2a11ba417e32d6d57119cbab45b97 Mon Sep 17 00:00:00 2001 From: Aanchal Goyal Date: Tue, 3 Dec 2024 18:49:29 +0530 Subject: [PATCH 09/10] Modified link for bullet 15 Signed-off-by: Aanchal Goyal --- resources.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/resources.md b/resources.md index 9a011c3f0..3164f5ce3 100644 --- a/resources.md +++ b/resources.md @@ -28,7 +28,7 @@ 12. IBM TechXchange Las Vegas 13. Open Source [**RAG Pipeline workshop**](https://www.linkedin.com/posts/sujeemaniyam_dataprepkit-workshop-llm-activity-7256176802383986688-2UKc?utm_source=share&utm_medium=member_desktop) with Data Prep Kit at TechEquity's AI Summit in Silicon Valley 14. **Data Science Dojo Meetup** - [video](https://datasciencedojo.com/tutorial/data-preparation-toolkit/) -15. [**DPK tutorial and hands on session at IIIT Delhi**](https://www.linkedin.com/posts/cai-iiitd-97a6a4232_datascience-datapipelines-machinel[…]565125349376-FG8E?utm_source=share&utm_medium=member_desktop) +15. [**DPK tutorial and hands on session at IIIT Delhi**](https://www.linkedin.com/posts/cai-iiitd-97a6a4232_datascience-datapipelines-machinelearning-activity-7263121565125349376-FG8E?utm_source=share&utm_medium=member_desktop) ## Example Code From 2a374d1bfc88eef6813941b8629a0d7716a7bad9 Mon Sep 17 00:00:00 2001 From: Maroun Touma Date: Tue, 3 Dec 2024 16:29:03 -0500 Subject: [PATCH 10/10] fix layout for commands in first cell Signed-off-by: Maroun Touma --- transforms/universal/hap/python/hap_python.ipynb | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/transforms/universal/hap/python/hap_python.ipynb b/transforms/universal/hap/python/hap_python.ipynb index ad0c42a02..62486fb0d 100644 --- a/transforms/universal/hap/python/hap_python.ipynb +++ b/transforms/universal/hap/python/hap_python.ipynb @@ -6,9 +6,11 @@ "metadata": {}, "source": [ "**** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:\n", + "```\n", " make venv \n", " source venv/bin/activate \n", - " pip install jupyterlab" + " pip install jupyterlab\n", + "```" ] }, {