You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I expect the script to do a destructive replace when write_disposition="replace" is set at the source level, and I expect DLT to correctly use incremental loading even when the write_disposition is set at the pipeline level. (I do not expect behavior to be different based on whether write_disposition is set at the pipeline level or the source level)
Steps to reproduce
Here is my full code with the write_disposition set as part of a hint: (Note that I am not running it as a lambda function, I am running on my local machine)
The easiest way to replicate this would be to put a simple test json file into a bucket in s3, run the pipeline, then put an additional file into the s3 bucket, and run the pipeline again (write_disposition should be set to replace). You will see two _dlt_load_id's in the data:
select _dlt_load_id, count(*) FROM oc_data_bronze.promotions
GROUP BY _dlt_load_id;
To see the other error, you'll need to move the write_disposition to the pipeline level as shown below (I have a couple with write_disposition set to append but those can be ignored):
Run the pipeline with one file in the source S3 bucket, then add another and run again. You will see that the target table in Databricks has data from both S3 files, but only one single _dlt_load_id, implying that a destructive replace was performed, but both files were read when only the latest one should have been. You can also further confirm this in the logs for the second run by noting that start_value is set to None (it should be the date of your first load).
{"written_at":"2025-01-30T17:01:21.635Z","written_ts":1738256481635656000,"component_name":"s3_to_databricks","process":12821,"msg":"Bind incremental on filesystem_products with initial_value: None, start_value: None, end_value: None","type":"log","logger":"dlt","thread":"MainThread","level":"INFO","module":"init","line_no":483,"version":{"dlt_version":"1.5.0","pipeline_name":"s3_to_databricks"}}
Operating system
macOS
Runtime environment
Local
Python version
3.11
dlt data source
S3 bucket
dlt destination
No response
Other deployment details
Destination is Databricks
Additional information
No response
The text was updated successfully, but these errors were encountered:
dlt version
1.5.0
Describe the problem
When write_disposition is set as a hint on the source like below it is ignored (defaults to append),
and when write_disposition is set at the pipeline level, it is used but incremental loading is ignored.
Expected behavior
I expect the script to do a destructive replace when write_disposition="replace" is set at the source level, and I expect DLT to correctly use incremental loading even when the write_disposition is set at the pipeline level. (I do not expect behavior to be different based on whether write_disposition is set at the pipeline level or the source level)
Steps to reproduce
Here is my full code with the write_disposition set as part of a hint: (Note that I am not running it as a lambda function, I am running on my local machine)
The easiest way to replicate this would be to put a simple test json file into a bucket in s3, run the pipeline, then put an additional file into the s3 bucket, and run the pipeline again (write_disposition should be set to replace). You will see two _dlt_load_id's in the data:
To see the other error, you'll need to move the write_disposition to the pipeline level as shown below (I have a couple with write_disposition set to append but those can be ignored):
Run the pipeline with one file in the source S3 bucket, then add another and run again. You will see that the target table in Databricks has data from both S3 files, but only one single _dlt_load_id, implying that a destructive replace was performed, but both files were read when only the latest one should have been. You can also further confirm this in the logs for the second run by noting that start_value is set to None (it should be the date of your first load).
Operating system
macOS
Runtime environment
Local
Python version
3.11
dlt data source
S3 bucket
dlt destination
No response
Other deployment details
Destination is Databricks
Additional information
No response
The text was updated successfully, but these errors were encountered: