Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-735220: pd_writer example from documentation not working #1420

Closed
jonashaag opened this issue Jan 31, 2023 · 28 comments
Closed

SNOW-735220: pd_writer example from documentation not working #1420

jonashaag opened this issue Jan 31, 2023 · 28 comments
Assignees

Comments

@jonashaag
Copy link

Please answer these questions before submitting your issue. Thanks!

  1. What version of Python are you using?

Python 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:23:14) [GCC 10.4.0]

  1. What operating system and processor architecture are you using?

Linux-5.4.0-1094-aws-x86_64-with-glibc2.31

  1. What are the component versions in the environment (pip freeze)?
mamba list | grep -e snow -e pandas
pandas                    1.5.2           py310h769672d_0    https://conda.anaconda.org/conda-forge
snowflake-connector-python 2.8.3           py310hea2c3ac_0    https://conda.anaconda.org/conda-forge
snowflake-sqlalchemy      1.3.4              pyhd8ed1ab_0    https://conda.anaconda.org/conda-forge

I also tried with snowflake-sqlalchemy 1.4.5

  1. What did you do?

Run this example from the documentation

import pandas
from snowflake.connector.pandas_tools import pd_writer

# Create a DataFrame containing data about customers
df = pandas.DataFrame([('Mark', 10), ('Luke', 20)], columns=['name', 'balance'])

# Specify that the to_sql method should use the pd_writer function
# to write the data from the DataFrame to the table named "customers"
# in the Snowflake database.
df.to_sql('customers', engine, index=False, method=pd_writer)
  1. What did you expect to see?

Expected the example to work, but got

snowflake.connector.errors.ProgrammingError: 000904 (42000): 01aa0467-3201-91be-0000-1e695e9861c6: SQL compilation error: error line 1 at
position 92
invalid identifier '"name"'
  1. Can you set logging to DEBUG and collect the logs?

/

@github-actions github-actions bot changed the title pd_writer example from documentation not working SNOW-735220: pd_writer example from documentation not working Jan 31, 2023
@jonashaag
Copy link
Author

If I use an ALL_UPPERCASE table name I get the following instead

sqlalchemy.exc.InvalidRequestError: Could not reflect: requested table(s) not available in Engine(...) schema '...': (MY_TABLE_NAME)

@jonashaag
Copy link
Author

If I pass quote_identifiers=False, I hit #1220.

@michaelkwagner
Copy link

michaelkwagner commented Feb 14, 2023

I am also having a similar issue. I created this test to illustrate:

import pandas as pd
from snowflake.connector.pandas_tools import pd_writer
from snowflake.sqlalchemy import URL
from sqlalchemy import create_engine

account = "xxx"
user = "xxx"
password = "xxx"

def test_snowflake_pd_writer():
    url = URL(
        account=account,
        user=user,
        password=password)
    engine = create_engine(url=url)

    try:
        database = "PD_WRITER_DB"
        sql = f'CREATE DATABASE {database} '
        with engine.connect() as connection:
            connection.execute(sql)

        url = URL(
            account=account,
            user=user,
            password=password,
            database=database,
            schema="PUBLIC")
        engine = create_engine(url=url)

        # upper case columns
        df = pd.DataFrame({
            "AA": [1, 1, 1],
            "BB": [2, 2, 2],
            "CC": [3, 3, 3]})

        # both of the following create the columns in uppercase and populate the new tables
        df.to_sql(name="UPPER_NO_PD_WRITER", con=engine, index=False)
        df.to_sql(name="UPPER_PD_WRITER", con=engine, method=pd_writer, index=False)

        # mixed case columns
        df = pd.DataFrame({
            "Aa": [1, 1, 1],
            "Bb": [2, 2, 2],
            "Cc": [3, 3, 3]})

        # both of the following create the columns in mixed case and populate the new tables            
        df.to_sql(name="MIXED_NO_PD_WRITER", con=engine, index=False)
        df.to_sql(name="MIXED_PD_WRITER", con=engine, method=pd_writer, index=False)

        # lower case columns
        df = pd.DataFrame({
            "aa": [1, 1, 1],
            "bb": [2, 2, 2],
            "cc": [3, 3, 3]})
            
        # following df.to_sql creates table but columns in new table are upper case column names: AA, BB, CC
        # is there a way to create the column names as lowercase? 
        # the columns are populated with the values from the dataframe
        df.to_sql(name="LOWER_NO_PD_WRITER", con=engine, index=False)
        
        # following df.to_sql creates table but columns in new table are upper case column names: AA, BB, CC
        # fails with the error: invalid identifier '"aa"'
        # no values are populated.
        df.to_sql(name="LOWER_PD_WRITER", con=engine, method=pd_writer, index=False)
    finally:
        with engine.connect() as connection:
            connection.execute(f'DROP DATABASE IF EXISTS {database} ')

@sfc-gh-achandrasekaran
Copy link
Contributor

We will look at this issue this week. FYI @sfc-gh-stan

@sfc-gh-stan
Copy link
Contributor

Hi all, thanks for opening this issue. I could reproduce the error and will be working on a bug fix. Assigning this to myself.

@sfc-gh-stan
Copy link
Contributor

I looked into this further, this is because when sqlalchemy created the table, it uses the query 'CREATE TABLE customers ( name TEXT, balance BIGINT )' but pd_writer defaults to quoting the identifier names and uppercasing the table name, so Snowflake finds the correct table (customers and "CUSTOMERS" matches) but couldn't find the columns (name and "name" are different). Specifying quote_identifiers=False should work around this, I didn't hit #1220 by running df.to_sql('new_customers', engine, index=False, method=lambda a,b,c,d:pd_writer(a,b,c,d, quote_identifiers=False)) with

pandas==1.5.2
...
snowflake-connector-python==2.8.3
snowflake-sqlalchemy==1.4.3
SQLAlchemy==1.4.47

I will look into #1220 for now.

FYI, the documented example for pd_writer actually works because the column names are specified in uppercase, the example used here is from write_pandas.

@jonashaag
Copy link
Author

Any news on this?

@willsthompson
Copy link

Thanks for looking into this problem, but your proposed workaround doesn't handle the case where the column name is lowercase and case-sensitive. Our application supports Snowflake connections, so we have no control over table names and have to handle our customers' tables as-is. In order to fully support Snowflake casing options we've had to build fairly elaborate workarounds for this bug and for snowflakedb/snowflake-sqlalchemy#388

@jonashaag
Copy link
Author

Hello, is anyone from Snowflake looking into this?

@sfc-gh-stan
Copy link
Contributor

I see how setting quote_identifiers to False does not work for column names that are case-sensitive. I will take a look again soon and see what the best approach is here.

@sfc-gh-stan
Copy link
Contributor

sfc-gh-stan commented Jun 7, 2023

Hi all, with #1594, you should be able to work around this behavior by enquoting the columns in your pandas DataFrames:

df = pandas.DataFrame([('Mark', 10), ('Luke', 20)], columns=['"name"', '"balance"'])
df.to_sql('customers', engine, index=False, method=pd_writer)

@jonashaag
Copy link
Author

Is this what you are going to recommend in the documentation?

@sfc-gh-stan
Copy link
Contributor

Is this what you are going to recommend in the documentation?

Sure, I will include it in the documentation of pd_writer.

@jonashaag
Copy link
Author

Honestly, I think it's a terrible workaround. It shouldn't require manual quoting of column names to be able to upload a dataframe.

Can we at least move the quoting logic inside the Snowflake adapter? So when at some point in the future this is properly fixed in Snowflake/the adapter, people don't need to remove this workaround from their code bases.

@sfc-gh-stan
Copy link
Contributor

@jonashaag , I think this is a documented behavior of Snowflake rather than a bug in the connector. The problem is when pandas generates the query to create the table, it uses the dataframe's column names directly, but Snowflake needs case-sensitive identifiers to be double-quoted.

@jonashaag
Copy link
Author

jonashaag commented Jun 8, 2023

Ok, but since we have special code to support Pandas dataframes in the connector already, don't you think a good place for Snowflake specific column name handling would be inside the connector as well? Otherwise we have code in the connector to deal with Pandas dataframes that doesn't work unless you make specific changes to your dataframe. It's just not nice from a developer experience point of view.

@willsthompson
Copy link

willsthompson commented Jun 8, 2023

df.to_sql() doesn't appear to have a mechanism to change table create behavior, but internally it uses SQLAlchemy to create the table. So maybe this this could be further resolved in snowflake-sqlalchemy? There I think you could define specific SQLAlchemy dialect rules, such that the quoting is handled automatically at the SQLAlchemy level, rather than leaking this abstraction up to the Pandas call. I agree with @jonashaag this fix is not ideal.

@sfc-gh-stan
Copy link
Contributor

sfc-gh-stan commented Jun 14, 2023

I see your concerns with this workaround, after digging in snowflake-sqlalchemy a bit more, I found that identifiers with a mix of lowercase and uppercase characters are actually automatically enquoted. However, identifiers with lowercase characters only are not, which causes the issue here.
I agree this should ideally be fixed in snowflake-sqlalchemy with a custom IdentifierPreparer or so, but that would be a breaking change with potentially huge blast radius.

@willsthompson
Copy link

Thanks, and I can see you're basically at a dead end for options in this repo, so if you want to carry on this conversation somewhere else just lmk.

Is there a PM responsible for this part of Snowflake's tooling that can make a call on what to do here? I completely understand not wanting to break anything as a maintainer, but generally it seems like an odd decision on Snowflake's part to officially not support official Snowflake features in their own tooling. Even if they decide not to do more to handle this problem, there should at least be documented disclaimers about what is and what is not supported, to save devs the trouble of rediscovering these rough edge cases.

Also, I haven't explored in detail, but I think there are ways to add new IdentifierPreparer logic on an opt-in basis. It's possible to pass in arguments from create_engine() that will be received by the dialect class. So IIUC you could still maintain current behavior as default and allow users to opt into alternative logic via something like

engine = create_engine(
    'snowflake://user:password@localhost/test', 
    strict_quotes=True
)
# or
engine = create_engine(
    'snowflake://user:password@localhost/test?strict_quotes=true', 

@jonashaag
Copy link
Author

I agree, handling of case sensitivity should definitely be improved in the Python connector/SQLAlchemy dialect. One of the first things one has to do when adding Snowflake support to a Python application or library is to add some hacks to make it work with column name casing. There is likely dozens of different hacks in the various codebases that have the privilege of having to deal with these problems, and each of those hacks is incorrect and incomplete in a different way I'm sure.

@jonashaag
Copy link
Author

jonashaag commented Jul 10, 2023

Any news on this? Will you be leaving incorrect documentation forever?

@jonashaag
Copy link
Author

Hello?

@sfc-gh-aling
Copy link
Collaborator

hi @jonashaag , apologize for the delayed response.

we have updated the docstring to note the limitation as well as the workaround:

Notes:
Please note that when column names in the pandas DataFrame are consist of strictly lower case letters, column names need to
be enquoted, otherwise `ProgrammingError` will be raised.
This is because `snowflake-sqlalchemy` does not enquote lower case column names when creating the table, but `pd_writer` enquotes the columns by default.
the copy into command looks for enquoted column names.
Future improvements will be made in the snowflake-sqlalchemy library.
Example usage:
import pandas as pd
from snowflake.connector.pandas_tools import pd_writer
sf_connector_version_df = pd.DataFrame([('snowflake-connector-python', '1.0')], columns=['NAME', 'NEWEST_VERSION'])
sf_connector_version_df.to_sql('driver_versions', engine, index=False, method=pd_writer)
# when the column names are consist of only lower case letters, enquote the column names
sf_connector_version_df = pd.DataFrame([('snowflake-connector-python', '1.0')], columns=['"name"', '"newest_version"'])
sf_connector_version_df.to_sql('driver_versions', engine, index=False, method=pd_writer)

thanks for sharing your thoughts and feedbacks on the issue, I agree in long term this should be a fix in the sqlalchemy library.

@jonashaag
Copy link
Author

@jonashaag
Copy link
Author

Are you guys fine with incorrect documentation?

This issue is slowly becoming a meme

@sfc-gh-aling
Copy link
Collaborator

hi @jonashaag , thanks for the reminder regarding the adjustments needed for the official document, we're working with the doc team to get this fixed: https://github.com/snowflakedb/snowflake-prod-docs/pull/4378

@jonashaag
Copy link
Author

Thank you! That repo doesn't seem to be public but I'm looking forward to seeing the fix deployed!

@jonashaag
Copy link
Author

Seems to be fixed, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants