Skip to content

Commit

Permalink
The table of transforms was duplicated (#649)
Browse files Browse the repository at this point in the history
* fix lib doc .py links and update resize readme

Signed-off-by: David Wood <[email protected]>

* reorder some instructions in RELEASE.md

Signed-off-by: David Wood <[email protected]>

* updated doc on exception processing by the runtime

Signed-off-by: David Wood <[email protected]>

* updated release notes and release process doc

Signed-off-by: David Wood <[email protected]>

* cleanups in the release documentation

Signed-off-by: David Wood <[email protected]>

* cleanups in the release documentation

Signed-off-by: David Wood <[email protected]>

* remove duplicated table of transforms

Signed-off-by: David Wood <[email protected]>

* center columns in module table of readme

Signed-off-by: David Wood <[email protected]>

---------

Signed-off-by: David Wood <[email protected]>
  • Loading branch information
daw3rd authored Oct 1, 2024
1 parent afafbf1 commit 6e21f64
Showing 1 changed file with 27 additions and 53 deletions.
80 changes: 27 additions & 53 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,59 +126,33 @@ Now that you have run a single transform, the next step is to explore how to put
The matrix below shows the the combination of modules and supported runtimes. All the modules can be accessed [here](transforms) and can be combined to form data processing pipelines, as shown in the [examples](examples) folder.


| Modules | Python-only | Ray | Spark | KFP on Ray |
|----------------------------------|------------------|------------------|------------------|------------------------|
| **Data Ingestion** | | | | |
| [Code (from zip) to Parquet](transforms/code/code2parquet/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: |
| [PDF to Parquet](transforms/language/pdf2parquet/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: |
|[HTML to Parquet](transforms/universal/html2parquet/python/README.md) |:white_check_mark:| | | |
| **Universal (Code & Language)** | | | | |
| [Exact dedup filter](transforms/universal/ededup/ray/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: |
| [Fuzzy dedup filter](transforms/universal/fdedup/ray/README.md) | |:white_check_mark:| |:white_check_mark: |
| [Unique ID annotation](transforms/universal/doc_id/ray/README.md) | :white_check_mark: |:white_check_mark:|:white_check_mark:|:white_check_mark: |
| [Filter on annotations](transforms/universal/filter/python/README.md) | :white_check_mark: |:white_check_mark:|:white_check_mark:|:white_check_mark: |
| [Profiler](transforms/universal/profiler/ray/README.md) | :white_check_mark: |:white_check_mark:|:white_check_mark: |:white_check_mark: |
| [Resize](transforms/universal/resize/python/README.md) | :white_check_mark: |:white_check_mark:|:white_check_mark: |:white_check_mark: |
| [Tokenizer](transforms/universal/tokenization/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: |
| **Language-only** | | | | |
| [Language identification](transforms/language/lang_id/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: |
| [Document quality](transforms/language/doc_quality/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: |
| [Document chunking for RAG](transforms/language/doc_chunk/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: |
| [Text encoder](transforms/language/text_encoder/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: |
| [PII Annotator/Redactor](transforms/language/pii_redactor/python/README.md)| :white_check_mark:| :white_check_mark: | | :white_check_mark: |
| **Code-only** | | | | |
| [Programming language annnotation](transforms/code/proglang_select/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: |
| [Code quality annotation](transforms/code/code_quality/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: |
| [Malware annotation](transforms/code/malware/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: |
| [Header cleanser](transforms/code/header_cleanser/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: |
| [Semantic file ordering](transforms/code/repo_level_ordering/ray/README.md) | |:white_check_mark:| | |
| [License Select Annotation](transforms/code/license_select/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: |
| Modules | Python-only | Ray | Spark | KFP on Ray |
|----------------------------------|------------------|------------------|---------------------|------------------------|
| **Data Ingestion** | | | | |
| [Code (from zip) to Parquet](transforms/code/code2parquet/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: |
| [PDF to Parquet](transforms/language/pdf2parquet/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: |
|[HTML to Parquet](transforms/universal/html2parquet/python/README.md) |:white_check_mark:| | | |
| **Universal (Code & Language)** | | | | |
| [Exact dedup filter](transforms/universal/ededup/ray/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: |
| [Fuzzy dedup filter](transforms/universal/fdedup/ray/README.md) | |:white_check_mark:| |:white_check_mark: |
| [Unique ID annotation](transforms/universal/doc_id/ray/README.md) | :white_check_mark: |:white_check_mark:| :white_check_mark: |:white_check_mark: |
| [Filter on annotations](transforms/universal/filter/python/README.md) | :white_check_mark: |:white_check_mark:| :white_check_mark: |:white_check_mark: |
| [Profiler](transforms/universal/profiler/ray/README.md) | :white_check_mark: |:white_check_mark:| :white_check_mark: |:white_check_mark: |
| [Resize](transforms/universal/resize/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: |
| [Tokenizer](transforms/universal/tokenization/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: |
| **Language-only** | | | | |
| [Language identification](transforms/language/lang_id/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: |
| [Document quality](transforms/language/doc_quality/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: |
| [Document chunking for RAG](transforms/language/doc_chunk/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: |
| [Text encoder](transforms/language/text_encoder/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: |
| [PII Annotator/Redactor](transforms/language/pii_redactor/python/README.md)| :white_check_mark:| :white_check_mark: | | :white_check_mark: |
| **Code-only** | | | | |
| [Programming language annnotation](transforms/code/proglang_select/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: |
| [Code quality annotation](transforms/code/code_quality/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: |
| [Malware annotation](transforms/code/malware/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: |
| [Header cleanser](transforms/code/header_cleanser/python/README.md) | :white_check_mark: |:white_check_mark:| |:white_check_mark: |
| [Semantic file ordering](transforms/code/repo_level_ordering/ray/README.md) | |:white_check_mark:| | |
| Modules | Python-only | Ray | Spark | KFP on Ray |
|:-------------------------------------------------------------------------------------|:------------------:|:------------------:|:------------------:|:------------------:|
| **Data Ingestion** | | | | |
| [Code (from zip) to Parquet](transforms/code/code2parquet/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: |
| [PDF to Parquet](transforms/language/pdf2parquet/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: |
| [HTML to Parquet](transforms/universal/html2parquet/python/README.md) | :white_check_mark: | | | |
| **Universal (Code & Language)** | | | | |
| [Exact dedup filter](transforms/universal/ededup/ray/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: |
| [Fuzzy dedup filter](transforms/universal/fdedup/ray/README.md) | | :white_check_mark: | | :white_check_mark: |
| [Unique ID annotation](transforms/universal/doc_id/ray/README.md) | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: |
| [Filter on annotations](transforms/universal/filter/python/README.md) | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: |
| [Profiler](transforms/universal/profiler/ray/README.md) | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: |
| [Resize](transforms/universal/resize/python/README.md) | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: |
| [Tokenizer](transforms/universal/tokenization/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: |
| **Language-only** | | | | |
| [Language identification](transforms/language/lang_id/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: |
| [Document quality](transforms/language/doc_quality/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: |
| [Document chunking for RAG](transforms/language/doc_chunk/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: |
| [Text encoder](transforms/language/text_encoder/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: |
| [PII Annotator/Redactor](transforms/language/pii_redactor/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: |
| **Code-only** | | | | |
| [Programming language annnotation](transforms/code/proglang_select/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: |
| [Code quality annotation](transforms/code/code_quality/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: |
| [Malware annotation](transforms/code/malware/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: |
| [Header cleanser](transforms/code/header_cleanser/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: |
| [Semantic file ordering](transforms/code/repo_level_ordering/ray/README.md) | | :white_check_mark: | | |
| [License Select Annotation](transforms/code/license_select/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: |


Contributors are welcome to add new modules to expand to other data modalities as well as add runtime support for existing modules!
Expand Down

0 comments on commit 6e21f64

Please sign in to comment.