[Feature] Html2ParquetTransform support output_format_value json #908

1337stn · 2025-01-02T16:01:28Z

Search before asking

I searched the issues and found no similar issues.

Component

Transforms/Other

Feature

Running through "RAG with Data Prep Kit" focusing on Step-2:

https://github.com/IBM/data-prep-kit/blob/6a06d8763ede388a93e956635af530df3494a9c8/examples/notebooks/rag/rag_1A_dpk_process_python.ipynb

Step-2: Process Input Documents (RAG stage 1, 2 & 3)

This code uses DPK to

    Extract text from PDFs (RAG stage-1)
    Performs de-dupes (RAG stage-1)
    split the documents into chunks (RAG stage-2)
    vectorize the chunks (RAG stage-3)

In Extract text from PDFs (RAG stage-1) when calling pdf2parquet_transform_python there are three options for output pdf2parquet_contents_types: markdown, text, json

data-prep-kit/transforms/language/pdf2parquet/dpk_pdf2parquet/transform.py

Line 67 in 6a06d87

class pdf2parquet_contents_types(str, enum.Enum):

class pdf2parquet_contents_types(str, enum.Enum):
    MARKDOWN = "text/markdown"
    TEXT = "text/plain"
    JSON = "application/json"

The next step split the documents into chunks (RAG stage-2) when calling doc_chunk there are three options for chunking_type: dl_json, li_markdown, li_token_text with default = dl_json

data-prep-kit/transforms/language/doc_chunk/README.md

Line 66 in 6a06d87

    
           When invoking the CLI, the parameters must be set as `--doc_chunk_<name>`, e.g. `--doc_chunk_column_name_key=myoutput`.

Thus we see the example code use Stage-1 output = json and Stage-2 type = json.

When attempting to change Stage-1 to Extract text from HTML using html2parquet_transform_python there are two options for html2parquet_output_format: markdown, txt

data-prep-kit/transforms/language/html2parquet/dpk_html2parquet/transform.py

Line 168 in 6a06d87

class html2parquet_output_format(str, enum.Enum):

class html2parquet_output_format(str, enum.Enum):
    MARKDOWN = "markdown"
    TEXT = "txt"

However html2parquet_transform_python reports to use Trafilatura where Trafilatura also supports JSON output:

https://trafilatura.readthedocs.io/en/latest/usage-python.html

Output

By default, the output is in plain text (TXT) format without metadata. The following additional formats are available:

    CSV
    HTML (from version 1.11 onwards)
    JSON
    Markdown (from version 1.9 onwards)
    XML and XML-TEI (following the guidelines of the Text Encoding Initiative)

To specify the output format, use one of the following strings: "csv", "json", "html", "markdown", "txt", "xml", "xmltei".

I will be attempting to change Stage-2 to work on markdown/text to align with the current supported outputs formats of html2parquet_transform_python.

However it seems html2parquet_transform_python could allow html2parquet_output_format: json which would pass-through to Trafilatura which already supports JSON. This would allow the flow of Stage-2 and beyond in the RAG example(s) to be maintained since they default to JSON.

Could you please consider adding JSON support in html2parquet_output_format (similar to code below, and whatever other downstream changes may be required) to align with pdf2parquet_output_format options along with the underlying Trafilatura supported options.

class html2parquet_output_format(str, enum.Enum):
    MARKDOWN = "markdown"
    TEXT = "txt"
    JSON = "json"

Thank you for your consideration and thank you for a great software tool.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

touma-I · 2025-01-08T12:38:42Z

Thanks @1337stn. I do think this will be needed and would like @shahrokhDaijavad and @sungeunan-ibm to weigh in. But I think you should proceed with a PR. Thanks

shahrokhDaijavad · 2025-01-08T15:26:18Z

I think this is a good suggestion. Supporting JSON as an additional output format for the html2parquet transform and making it consistent with the pdf2parquet transform output formats is a nice addition, and the work is straightforward since Trafilatura already supports this. @1337stn I also think you should proceed with a PR. Thanks.

1337stn added the enhancement New feature or request label Jan 2, 2025

touma-I assigned 1337stn Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Html2ParquetTransform support output_format_value json #908

[Feature] Html2ParquetTransform support output_format_value json #908

1337stn commented Jan 2, 2025 •

edited

Loading

touma-I commented Jan 8, 2025

shahrokhDaijavad commented Jan 8, 2025

[Feature] Html2ParquetTransform support output_format_value json #908

[Feature] Html2ParquetTransform support output_format_value json #908

Comments

1337stn commented Jan 2, 2025 • edited Loading

Search before asking

Component

Feature

Are you willing to submit a PR?

touma-I commented Jan 8, 2025

shahrokhDaijavad commented Jan 8, 2025

1337stn commented Jan 2, 2025 •

edited

Loading