diff --git a/README.md b/README.md index 946a79256..ade3bed68 100644 --- a/README.md +++ b/README.md @@ -126,59 +126,33 @@ Now that you have run a single transform, the next step is to explore how to put The matrix below shows the the combination of modules and supported runtimes. All the modules can be accessed [here](transforms) and can be combined to form data processing pipelines, as shown in the [examples](examples) folder. -| Modules | Python-only | Ray | Spark | KFP on Ray | -|----------------------------------|------------------|------------------|------------------|------------------------| -| **Data Ingestion** | | | | | -| [Code (from zip) to Parquet](transforms/code/code2parquet/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: | -| [PDF to Parquet](transforms/language/pdf2parquet/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: | -|[HTML to Parquet](transforms/universal/html2parquet/python/README.md) |:white_check_mark:| | | | -| **Universal (Code & Language)** | | | | | -| [Exact dedup filter](transforms/universal/ededup/ray/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: | -| [Fuzzy dedup filter](transforms/universal/fdedup/ray/README.md) | |:white_check_mark:| |:white_check_mark: | -| [Unique ID annotation](transforms/universal/doc_id/ray/README.md) | :white_check_mark: |:white_check_mark:|:white_check_mark:|:white_check_mark: | -| [Filter on annotations](transforms/universal/filter/python/README.md) | :white_check_mark: |:white_check_mark:|:white_check_mark:|:white_check_mark: | -| [Profiler](transforms/universal/profiler/ray/README.md) | :white_check_mark: |:white_check_mark:|:white_check_mark: |:white_check_mark: | -| [Resize](transforms/universal/resize/python/README.md) | :white_check_mark: |:white_check_mark:|:white_check_mark: |:white_check_mark: | -| [Tokenizer](transforms/universal/tokenization/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: | -| **Language-only** | | | | | -| [Language identification](transforms/language/lang_id/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: | -| [Document quality](transforms/language/doc_quality/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: | -| [Document chunking for RAG](transforms/language/doc_chunk/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: | -| [Text encoder](transforms/language/text_encoder/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: | -| [PII Annotator/Redactor](transforms/language/pii_redactor/python/README.md)| :white_check_mark:| :white_check_mark: | | :white_check_mark: | -| **Code-only** | | | | | -| [Programming language annnotation](transforms/code/proglang_select/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: | -| [Code quality annotation](transforms/code/code_quality/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: | -| [Malware annotation](transforms/code/malware/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: | -| [Header cleanser](transforms/code/header_cleanser/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: | -| [Semantic file ordering](transforms/code/repo_level_ordering/ray/README.md) | |:white_check_mark:| | | -| [License Select Annotation](transforms/code/license_select/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: | -| Modules | Python-only | Ray | Spark | KFP on Ray | -|----------------------------------|------------------|------------------|---------------------|------------------------| -| **Data Ingestion** | | | | | -| [Code (from zip) to Parquet](transforms/code/code2parquet/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: | -| [PDF to Parquet](transforms/language/pdf2parquet/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: | -|[HTML to Parquet](transforms/universal/html2parquet/python/README.md) |:white_check_mark:| | | | -| **Universal (Code & Language)** | | | | | -| [Exact dedup filter](transforms/universal/ededup/ray/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: | -| [Fuzzy dedup filter](transforms/universal/fdedup/ray/README.md) | |:white_check_mark:| |:white_check_mark: | -| [Unique ID annotation](transforms/universal/doc_id/ray/README.md) | :white_check_mark: |:white_check_mark:| :white_check_mark: |:white_check_mark: | -| [Filter on annotations](transforms/universal/filter/python/README.md) | :white_check_mark: |:white_check_mark:| :white_check_mark: |:white_check_mark: | -| [Profiler](transforms/universal/profiler/ray/README.md) | :white_check_mark: |:white_check_mark:| :white_check_mark: |:white_check_mark: | -| [Resize](transforms/universal/resize/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: | -| [Tokenizer](transforms/universal/tokenization/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: | -| **Language-only** | | | | | -| [Language identification](transforms/language/lang_id/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: | -| [Document quality](transforms/language/doc_quality/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: | -| [Document chunking for RAG](transforms/language/doc_chunk/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: | -| [Text encoder](transforms/language/text_encoder/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: | -| [PII Annotator/Redactor](transforms/language/pii_redactor/python/README.md)| :white_check_mark:| :white_check_mark: | | :white_check_mark: | -| **Code-only** | | | | | -| [Programming language annnotation](transforms/code/proglang_select/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: | -| [Code quality annotation](transforms/code/code_quality/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: | -| [Malware annotation](transforms/code/malware/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: | -| [Header cleanser](transforms/code/header_cleanser/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: | -| [Semantic file ordering](transforms/code/repo_level_ordering/ray/README.md) | |:white_check_mark:| | | +| Modules | Python-only | Ray | Spark | KFP on Ray | +|:-------------------------------------------------------------------------------------|:------------------:|:------------------:|:------------------:|:------------------:| +| **Data Ingestion** | | | | | +| [Code (from zip) to Parquet](transforms/code/code2parquet/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: | +| [PDF to Parquet](transforms/language/pdf2parquet/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: | +| [HTML to Parquet](transforms/universal/html2parquet/python/README.md) | :white_check_mark: | | | | +| **Universal (Code & Language)** | | | | | +| [Exact dedup filter](transforms/universal/ededup/ray/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: | +| [Fuzzy dedup filter](transforms/universal/fdedup/ray/README.md) | | :white_check_mark: | | :white_check_mark: | +| [Unique ID annotation](transforms/universal/doc_id/ray/README.md) | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | +| [Filter on annotations](transforms/universal/filter/python/README.md) | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | +| [Profiler](transforms/universal/profiler/ray/README.md) | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | +| [Resize](transforms/universal/resize/python/README.md) | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | +| [Tokenizer](transforms/universal/tokenization/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: | +| **Language-only** | | | | | +| [Language identification](transforms/language/lang_id/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: | +| [Document quality](transforms/language/doc_quality/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: | +| [Document chunking for RAG](transforms/language/doc_chunk/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: | +| [Text encoder](transforms/language/text_encoder/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: | +| [PII Annotator/Redactor](transforms/language/pii_redactor/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: | +| **Code-only** | | | | | +| [Programming language annnotation](transforms/code/proglang_select/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: | +| [Code quality annotation](transforms/code/code_quality/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: | +| [Malware annotation](transforms/code/malware/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: | +| [Header cleanser](transforms/code/header_cleanser/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: | +| [Semantic file ordering](transforms/code/repo_level_ordering/ray/README.md) | | :white_check_mark: | | | +| [License Select Annotation](transforms/code/license_select/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: | Contributors are welcome to add new modules to expand to other data modalities as well as add runtime support for existing modules!