Skip to content

Commit

Permalink
Merge pull request IBM#945 from touma-I/documentation-1.0
Browse files Browse the repository at this point in the history
Cleanup documentation for 1.0.0
  • Loading branch information
touma-I authored Jan 21, 2025
2 parents c8096b1 + 2f5c691 commit 87844a0
Show file tree
Hide file tree
Showing 7 changed files with 556 additions and 264 deletions.
31 changes: 2 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ Data modalities supported _today_: Code and Natural Language.

### Fastest way to experience Data Prep Kit

With no setup necessary, let's use a Google Colab friendly notebook to try Data Prep Kit. This is a simple transform to extract content from PDF files: [examples/notebooks/Run_your_first_transform_colab.ipynb](examples/notebooks/Run_your_first_transform_colab.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IBM/data-prep-kit/blob/dev/examples/notebooks/Run_your_first_transform_colab.ipynb). ([Here](doc/google-colab.md) are some tips for running Data Prep Kit transforms on Google Colab. For this simple example, these tips are either already taken care of, or are not needed.) The same notebook can be downloaded and run on the local machine, without cloning the repo or any other setup. For additional guidance on setting up Jupyter lab, see the Appendix section below.
With no setup necessary, let's use a Google Colab friendly notebook to try Data Prep Kit. This is a simple transform to extract content from PDF files: [examples/notebooks/Run_your_first_transform_colab.ipynb](examples/notebooks/Run_your_first_transform_colab.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IBM/data-prep-kit/blob/dev/examples/notebooks/Run_your_first_transform_colab.ipynb). ([Here](doc/google-colab.md) are some tips for running Data Prep Kit transforms on Google Colab. For this simple example, these tips are either already taken care of, or are not needed.) The same notebook can be downloaded and run on the local machine, without cloning the repo or any other setup. For additional guidance on setting up Jupyter lab, click [here](doc/quick-start/quick-start.md#jupyter).

### Install data prep kit from PyPi

Expand All @@ -71,7 +71,7 @@ When installing select transforms, users can specify the name of the transform i
```bash
pip install 'data-prep-toolkit-transforms[pdf2parquet]'
```
For guidance on creating the virtual environment for installing the data prep kit, refer to the Appendix section below.
For additional guidance on creating the virtual environment for installing the data prep kit, click [here](doc/quick-start/quick-start.md#conda).

### Run your first data prep pipeline

Expand Down Expand Up @@ -173,33 +173,6 @@ When you finish working with the cluster, and want to clean up or destroy it. Se

You can run transforms via docker image or using virtual environments. This [document](doc/quick-start/run-transform-venv.md) shows how to run a transform using virtual environment. You can follow this [document](doc/quick-start/run-transform-image.md) to run using docker image.

## Appendix
### Create a Virtual Environment

To run on a local machine, follow these steps to quickly set up and deploy the Data Prep Kit in your virtual Python environment.

```bash
conda create -n data-prep-kit -y python=3.11
conda activate data-prep-kit
python --version
```

Check if the python version is 3.11.

If you are using a linux system, install gcc using the below commands, as it will be required to compile and install [fasttext](https://fasttext.cc/) currently used by some of the transforms.

```bash
conda install gcc_linux-64
conda install gxx_linux-64
```

## Setting up Jupyter lab for local experimentation with transform notebooks

```bash
pip install jupyterlab ipykernel ipywidgets
python -m ipykernel install --user --name=data-prep-kit --display-name "dataprepkit"
```

## Citations <a name = "citations"></a>

If you use Data Prep Kit in your research, please cite our paper:
Expand Down
18 changes: 17 additions & 1 deletion doc/mac.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,23 @@ machine with an Intel CPU.

### Memory Considerations

To verify that running transforms through KFP does not leak memory and also get an idea on the required Podman VM memory size configuration, a few tests were devised and run, as summarized [here](memory.md).
To verify that running transforms through KFP does not leak memory and also get an idea on the required Podman VM memory size configuration, a few tests were devised and run, as summarized below:

#### Memory and Endurance Considerations

A test was devised with a set of 1483 files on a Mac with 32GB memory and 4CPU cores. Traceback library was used to check for memory leak.
10 iterations were run and the memory usage was observed, which peaked around 4 GB. There were no obvious signs of a memory leak.

Another set of tests was done with the 1483 files on a podman VM with different memory configurations. The results are shown below.
It seems that it needed around 4GB of available memory to run successfully for all 1483 files.

|CPU Cores | Total Memory | Memory Used by Ray | Transform | Files Processed Successfully |
|------------------------------ |-------------------|------------------|--------------------|------------------------|
|4 |8GB |4.2GB| NOOP |1483 |
|4 |6GB |3GB| NOOP |910 |
|4 |4GB |2GB| NOOP |504 |



> **Note**: the *current* release does not support building cross-platform images, therefore, please do not build images
on the Apple silicon.
13 changes: 0 additions & 13 deletions doc/memory.md

This file was deleted.

Loading

0 comments on commit 87844a0

Please sign in to comment.