Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: failure Loading json columns into an arrow table #2273

Open
wants to merge 2 commits into
base: devel
Choose a base branch
from

Conversation

neuromantik33
Copy link
Contributor

@neuromantik33 neuromantik33 commented Feb 7, 2025

Issue Summary

When loading a table with json columns into an Arrow table, the process fails due to nullable json payloads. Since the dataset is sparse, the first row in the result set is often None, leading to py_type being inferred as NoneType. As a result, the column is not converted into a JSON string array, causing the following error: Expected bytes, got a 'dict' object.

Steps to Reproduce

  • Load a SQL table containing a json column into an Arrow table.
  • Ensure the first row of the result set contains None
  • Observe that the Arrow conversion fails with the above error message.

Expected Behavior

  • The json column should be properly converted to a json string array, even if the first value in the dataset is NULL.
  • The inference mechanism should recognize the json type instead of defaulting to NoneType.

sql_database workaround

A query adapter can be used to explicitly cast JSON columns to TEXT:

def json_to_text(_: Select, table: sa.Table) -> Select:
    """FIXME Workaround to convert all JSON types to text until pyarrow adapter is fixed"""
    columns = [
        sa.cast(col, sqltypes.Text).label(col.name) if isinstance(col.type, sqltypes.JSON) else col
        for col in table.columns
    ]
    return sa.select(*columns)

But seeing as I also use this module in my pg_replication source plugin I preferred to correct it at the source, no pun intended 😉

Copy link

netlify bot commented Feb 7, 2025

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit 0f57b92
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/67a55a0dbe50aa00082f3cc2

@rudolfix
Copy link
Collaborator

@neuromantik33 FYI we are on it here #2295
@zilto please look at test case in this PR. this is the thing I mentioned in PR. the current idea is that we launch the conversion when we get an exception from arrow, so we prevent looping on Python rows in most cases....

@neuromantik33
Copy link
Contributor Author

Thanks for tackling this issue, feel free to close this whenever you have something and if you need a beta tester ... 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants