SNOW-640837: `pd_writer` regression with Pandas 1.3 -> 1.4 #1220

jonashaag · 2022-08-08T14:12:50Z

Reproducer:

import pandas as pd
from functools import partial
from snowflake.connector.pandas_tools import pd_writer
import logging

logging.basicConfig(level="INFO")

TABLE = "xxx"
SCHEMA = "yyy"

with get_sqlalchemy_engine().connect() as conn:
    conn.execute(f"CREATE TEMP TABLE {SCHEMA}.{TABLE} (a int)")
    conn.execute(f"USE SCHEMA {SCHEMA}") # This is needed for pd_writer to work
    pd.DataFrame({"a": [42]}).to_sql(
        TABLE,
        schema=SCHEMA,
        con=conn,
        index=False,
        if_exists="append",
        # see https://github.com/snowflakedb/snowflake-connector-python/issues/513
        # Uncommenting this works as well
        method=partial(pd_writer, quote_identifiers=False),
    )
    # This works:
    df = pd.read_sql_table(TABLE, conn, schema=SCHEMA)
    assert len(df) == 1
    # But this doesn't:
    df = pd.read_sql(f"SELECT * FROM {SCHEMA}.{TABLE}", conn)
    assert len(df) == 1, df

It works in Pandas 1.3 (specifically, before this PR: pandas-dev/pandas#43116)

Logs attached.

log-pandas-13.txt
log-pandas-14.txt

jonashaag · 2022-11-09T14:58:38Z

Any news on this?

jonashaag · 2023-01-27T13:54:05Z

Bump

jonashaag · 2023-01-31T12:56:18Z

Something curious happens here, if you use a new connection the result is empty (even without temporary tables)

import pandas as pd
from functools import partial
from snowflake.connector.pandas_tools import pd_writer
import logging

logging.basicConfig(level="INFO")

TABLE = "xxx"
SCHEMA = "yyy"

with get_sqlalchemy_engine().connect() as conn:
    conn.execute(f"USE SCHEMA {SCHEMA}") # This is needed for pd_writer to work
    pd.DataFrame({"a": [42]}).to_sql(
        TABLE,
        schema=SCHEMA,
        con=conn,
        index=False,
        if_exists="append",
        # see https://github.com/snowflakedb/snowflake-connector-python/issues/513
        # Uncommenting this works as well
        method=partial(pd_writer, quote_identifiers=False),
    )
    # This works:
    df = pd.read_sql_table(TABLE, conn, schema=SCHEMA)
    assert len(df) == 1

with get_sqlalchemy_engine().connect() as conn:
    conn.execute(f"USE SCHEMA {SCHEMA}") # This is needed for pd_writer to work
    # But this doesn't:
    df = pd.read_sql_table(TABLE, conn, schema=SCHEMA)
    assert len(df) == 1

sfc-gh-aling · 2023-03-20T17:22:57Z

thanks for reaching out and giving us the example, we will take a look into this

sfc-gh-aling · 2023-04-05T23:17:45Z

hey @jonashaag , I feel this is a regression in the method pd.DataFrame.to_sql. When I purely ran the to_sql with pandas 1.3.0, I can see data inserted into the table. however, with 1.4.0, to_sql doesn't insert data into the table.
I'm not sure why read_sql_table works in 1.4.0 because when I try checking data in the database through UI I didn't see data.

Can you verify that to_sql doesn't insert data on your side as well? if that's the case i think it's better to open an issue in the pandas repo.

jonashaag · 2023-04-06T08:22:43Z

Well the first read_sql_table works so the data must be stored somewhere! But interestingly, adding this before the first read_sql_table breaks it:

pd.read_sql("SELECT 1", con=conn)

sfc-gh-aling · 2023-04-11T16:58:38Z

thanks for the update. what does it mean by "breaks it"? does it mean no data is inserted or something else

in the meanwhile, I'm wondering if it's possible for you to do some experiment with sqlite db and see if this is a general pandas issue, if so we can create an issue in the pandas repo.

jonashaag · 2023-04-11T17:37:30Z

Can you reproduce the issue? If so, why do you need me to do additional work?

sfc-gh-aling · 2023-04-11T19:04:28Z

apologize if there's any miscommunication, I thought you would be interested in the pandas behaviors.
I will try something on my side and keep you updated, appreciate your help with the issue!

sfc-gh-aling · 2023-05-15T22:37:36Z

I just want to call out that this is still on our radar and we will keep investigation.

jonashaag · 2023-05-16T06:08:52Z

Thank you. I wonder why takes almost a year to fix such a fundamental problem.

jonashaag · 2023-05-31T20:56:50Z

Hello, is anyone from Snowflake looking into this?

sfc-gh-aling · 2023-06-01T21:01:23Z

hey @jonashaag , I'm going to dedicate some time today to dive into the code base and post updates for you, I know this has been hanging for a while, thanks for your patience.

sfc-gh-aling · 2023-06-02T07:14:41Z

I did some more investigation, the way how connector saves pandas dataframe into a table is:

connector convert padans dataframe to parquet locally
connector uploads the parquet file to a stage
connector execute a copy command query to copy the data from the stage file into the table

I found that with pandas 1.3, copy command can load data into the table, however, with 1.4, though copy command query returns successfully, but no dada is loaded into the table.

I verified the behavior with a non-temp table:

with engine.connect() as conn:
    conn.execute(f"CREATE OR REPLACE TABLE {SCHEMA}.{TABLE} (a int)")
    conn.execute(f"USE SCHEMA {SCHEMA}") # This is needed for pd_writer to work
    pd.DataFrame({"a": [42]}).to_sql(
        TABLE,
        schema=SCHEMA,
        con=conn,
        index=False,
        if_exists="append",
        # see https://github.com/snowflakedb/snowflake-connector-python/issues/513
        # Uncommenting this works as well
        method=partial(pd_writer, quote_identifiers=False),
    )

I also tried to verify the parquet file generated by pandas 1.4.4 -- if I manually copy the generated parquet into a table, the table will have the data.

for now I don't have clue on why copy command executed in connector with pandas 1.4.4 returns success but no data is loaded, will keep investigation.

jonashaag · 2023-06-02T07:39:38Z

Thanks! Note that this is also the case with all other pandas versions.

jonashaag · 2023-06-07T21:49:30Z

Should we using the same quoting workaround here as mention in #1420?

sfc-gh-aling · 2023-06-09T23:18:15Z

I don't think it's due to the quoting, the parquet file is good, I can manually copy the data into table. so I don't think it's related to quoted identifier.

actually I found something very interesting when checking the log around the copy command.

in pandas 1.3, I saw a COMMIT right after the COPY command got executed.

DEBUG:snowflake.connector.cursor:running query [COPY INTO ... /* Python:snowflake.connector.pandas_tools.writ...]
...
DEBUG:snowflake.connector.cursor:binding: [COMMIT] with input=[None], processed=[{}]	
INFO:snowflake.connector.cursor:query: [COMMIT]	
DEBUG:snowflake.connector.connection:sequence counter: 8	
DEBUG:snowflake.connector.cursor:Request id: xxx
DEBUG:snowflake.connector.cursor:running query [COMMIT]	
DEBUG:snowflake.connector.cursor:is_file_transfer: True	
DEBUG:snowflake.connector.connection:_cmd_query	
DEBUG:snowflake.connector._query_context_cache:serialize_to_json() called	
DEBUG:snowflake.connector.connection:sql=[COMMIT], sequence_id=[8], is_file_transfer=[False]	
DEBUG:snowflake.connector.network:Session status for SessionPool 'xxx.snowflakecomputing.com', SessionPool 1/1 active sessions	
DEBUG:snowflake.connector.network:remaining request timeout: None, retry cnt: 1	
DEBUG:snowflake.connector.network:Request guid: 
DEBUG:snowflake.connector.network:socket timeout: 60	
DEBUG:snowflake.connector.vendored.urllib3.connectionpool:https://xxx.snowflakecomputing.com:443 "POST /queries/v1/query-request?requestId= HTTP/1.1" 200 1149	
DEBUG:snowflake.connector.network:SUCCESS	
DEBUG:snowflake.connector.network:Session status for SessionPool 'xxx.snowflakecomputing.com', SessionPool 0/1 active sessions	
DEBUG:snowflake.connector.network:ret[code] = None, after post request	
DEBUG:snowflake.connector.network:Query id: 
DEBUG:snowflake.connector.cursor:sfqid: 
INFO:snowflake.connector.cursor:query execution done	
DEBUG:snowflake.connector.cursor:SUCCESS	
DEBUG:snowflake.connector.cursor:PUT OR GET: False	
DEBUG:snowflake.connector.cursor:Query result format: json	
DEBUG:snowflake.connector.result_batch:parsing for result batch id: 1	
INFO:snowflake.connector.cursor:Number of results in first chunk: 1	
DEBUG:snowflake.connector.connection:cursor	
DEBUG:snowflake.connector.cursor:executing SQL/command	
...

however, I don't see the similar logging of commit after the execution of copy command in pandas 1.4 which means the COPY INTO command might not even get committed -- this explains why we do not see the data in the table at all after executing the COPY INTO command.

I think we're close to finding the root cause and the solution.
I am going to check if there's any change related to commit level in pandas, and forcing commit within connector.

sfc-gh-aling · 2023-06-09T23:35:17Z

alright, I find one workaround finally:

after adding cursor.execute("commit") after the line:
https://github.com/snowflakedb/snowflake-connector-python/blob/main/src/snowflake/connector/pandas_tools.py#L304

now I can copy the parquet into table and have those data.

the questions now become:

what pandas has changed around commit
how connector handle commit if pandas does not commit

I will keep looking

sfc-gh-aling · 2023-06-12T21:43:33Z

as a workaround, you could try applying adding conn.execute("COMMIT") in the code after calling pandas.to_sql.

with get_sqlalchemy_engine().connect() as conn:
    conn.execute(f"USE SCHEMA {SCHEMA}") # This is needed for pd_writer to work
    pd.DataFrame({"a": [42]}).to_sql(
        TABLE,
        schema=SCHEMA,
        con=conn,
        index=False,
        if_exists="append",
        # see https://github.com/snowflakedb/snowflake-connector-python/issues/513
        # Uncommenting this works as well
        method=partial(pd_writer, quote_identifiers=False),
    )
    conn.execute('commit')

I'm still comparing the behavior diff between pandas 1.3 & 1.4

pep-sanwer · 2024-12-18T19:24:41Z

Seeing this issue with a lack of committing changes (all other functionalities work as expected, for example auto creating a previously non-existing table) using write_pandas directly.

Data is correctly PUT -> COPY INTO but the changes are not committed. This only happens when COPY INTO-ing an existing table (appending rows). Work around mentioned here: #1220 (comment) - works

Resolved in:

snowflake-connector-python==3.12.4
snowflake-sqlalchemy==1.7.2
pandas==2.2.3
sqlalchemy==2.0.36

without hacking snowflake-connector-python code by wrapping the call to write_pandas in a explicit transaction block like (via sqlalchemy.engine):

with engine.begin() as conn:
   # call to `write_pandas`
    ...

# other code

jonashaag · 2024-12-18T19:39:57Z

FWIW we abandoned using this API altogether and replaced it with a two step approach where we first upload the dataframe as Parquet to a temporary stage and then load that Parquet. Much more reliable than using this API that no one seems to care enough about to fix it

jonashaag added bug needs triage labels Aug 8, 2022

github-actions bot changed the title ~~pd_writer regression with Pandas 1.3 -> 1.4~~ SNOW-640837: pd_writer regression with Pandas 1.3 -> 1.4 Aug 8, 2022

jonashaag mentioned this issue Jan 31, 2023

SNOW-735220: pd_writer example from documentation not working #1420

Closed

sfc-gh-aling added triaged and removed needs triage labels Mar 20, 2023

sfc-gh-dszmolka assigned sfc-gh-aling Mar 8, 2024

sfc-gh-dszmolka added the status-triage_done Initial triage done, will be further handled by the driver team label Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SNOW-640837: `pd_writer` regression with Pandas 1.3 -> 1.4 #1220

SNOW-640837: `pd_writer` regression with Pandas 1.3 -> 1.4 #1220

jonashaag commented Aug 8, 2022

jonashaag commented Nov 9, 2022

jonashaag commented Jan 27, 2023

jonashaag commented Jan 31, 2023 •

edited

Loading

sfc-gh-aling commented Mar 20, 2023

sfc-gh-aling commented Apr 5, 2023 •

edited

Loading

jonashaag commented Apr 6, 2023

sfc-gh-aling commented Apr 11, 2023 •

edited

Loading

jonashaag commented Apr 11, 2023

sfc-gh-aling commented Apr 11, 2023

sfc-gh-aling commented May 15, 2023

jonashaag commented May 16, 2023

jonashaag commented May 31, 2023

sfc-gh-aling commented Jun 1, 2023 •

edited

Loading

sfc-gh-aling commented Jun 2, 2023

jonashaag commented Jun 2, 2023

jonashaag commented Jun 7, 2023

sfc-gh-aling commented Jun 9, 2023 •

edited

Loading

sfc-gh-aling commented Jun 9, 2023 •

edited

Loading

sfc-gh-aling commented Jun 12, 2023

pep-sanwer commented Dec 18, 2024 •

edited

Loading

jonashaag commented Dec 18, 2024

SNOW-640837: pd_writer regression with Pandas 1.3 -> 1.4 #1220

SNOW-640837: pd_writer regression with Pandas 1.3 -> 1.4 #1220

Comments

jonashaag commented Aug 8, 2022

jonashaag commented Nov 9, 2022

jonashaag commented Jan 27, 2023

jonashaag commented Jan 31, 2023 • edited Loading

sfc-gh-aling commented Mar 20, 2023

sfc-gh-aling commented Apr 5, 2023 • edited Loading

jonashaag commented Apr 6, 2023

sfc-gh-aling commented Apr 11, 2023 • edited Loading

jonashaag commented Apr 11, 2023

sfc-gh-aling commented Apr 11, 2023

sfc-gh-aling commented May 15, 2023

jonashaag commented May 16, 2023

jonashaag commented May 31, 2023

sfc-gh-aling commented Jun 1, 2023 • edited Loading

sfc-gh-aling commented Jun 2, 2023

jonashaag commented Jun 2, 2023

jonashaag commented Jun 7, 2023

sfc-gh-aling commented Jun 9, 2023 • edited Loading

sfc-gh-aling commented Jun 9, 2023 • edited Loading

sfc-gh-aling commented Jun 12, 2023

pep-sanwer commented Dec 18, 2024 • edited Loading

jonashaag commented Dec 18, 2024

SNOW-640837: `pd_writer` regression with Pandas 1.3 -> 1.4 #1220

SNOW-640837: `pd_writer` regression with Pandas 1.3 -> 1.4 #1220

jonashaag commented Jan 31, 2023 •

edited

Loading

sfc-gh-aling commented Apr 5, 2023 •

edited

Loading

sfc-gh-aling commented Apr 11, 2023 •

edited

Loading

sfc-gh-aling commented Jun 1, 2023 •

edited

Loading

sfc-gh-aling commented Jun 9, 2023 •

edited

Loading

sfc-gh-aling commented Jun 9, 2023 •

edited

Loading

pep-sanwer commented Dec 18, 2024 •

edited

Loading