Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fix error in write_dataframe when writing an empty or all-None object column with use_arrow #512

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

theroggy
Copy link
Member

@theroggy theroggy commented Dec 21, 2024

When a dataframe is being written with an object column without any rows or with only None values in the object column, the object column is converted to an null type arrow column, which is not supported by gdal and leads to an error being thrown.

Solves #513

@brendan-ward brendan-ward changed the title TST: add test to show error when an empty object column is written uing arrow TST: add test to show error when an empty object column is written using arrow Dec 23, 2024
@brendan-ward
Copy link
Member

I'm not clear on what the proper fix is going to be in this case. Should we instead raise our own error when writing an empty dataframe with one or more object dtype columns present using arrow, and direct user to the non-arrow interface? Or should we fall back to using non-arrow ourselves when detecting an empty data frame? No benefit of using arrow for this case.

@theroggy
Copy link
Member Author

theroggy commented Dec 24, 2024

I'm not clear on what the proper fix is going to be in this case. Should we instead raise our own error when writing an empty dataframe with one or more object dtype columns present using arrow, and direct user to the non-arrow interface? Or should we fall back to using non-arrow ourselves when detecting an empty data frame? No benefit of using arrow for this case.

Yes, I didn't have a clear idea yet either... The thing I was wondering about (as indicated in #513) was if it was a very conscious in pyarrow.Table.from_pandas to convert an object column to null datatype, as all other datatypes (int,...) are retained as such. object is obviously a very special case, so I understand it is a different case compared to int,... but the null datatype doesn't seem super useful to me (I might be wrong)...

It is a good point however that arrow doesn't have a lot of added value if there is no data to be written, so it could be an easy fix to detect the dataframe being empty upfront and disabling use of arrow...

…w-error-when-an-empty-object-column-is-written-using-arrow
@theroggy theroggy changed the title TST: add test to show error when an empty object column is written using arrow BUG: fix error when an empty object column is written with use_arrow Jan 23, 2025
@theroggy
Copy link
Member Author

theroggy commented Jan 23, 2025

I found an extra, related problem. The same error occurs with object type columns with all None values: these are converted to a null type column as well by pyarrow.from_pandas.

I found a fix for both issues by looping over the table and converting all null-type columns to string type.

@theroggy theroggy self-assigned this Jan 26, 2025
@theroggy theroggy changed the title BUG: fix error when an empty object column is written with use_arrow BUG: fix error when an empty or all-None object column is written with use_arrow Jan 26, 2025
@theroggy theroggy changed the title BUG: fix error when an empty or all-None object column is written with use_arrow BUG: fix error in write_dataframe when writing an empty or all-None object column with use_arrow Jan 26, 2025
@theroggy theroggy requested a review from brendan-ward January 26, 2025 20:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants