Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release DataFusion 45.0.0 #14008

Open
10 of 30 tasks
alamb opened this issue Jan 4, 2025 · 24 comments
Open
10 of 30 tasks

Release DataFusion 45.0.0 #14008

alamb opened this issue Jan 4, 2025 · 24 comments
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jan 4, 2025

Is your feature request related to a problem or challenge?

Tracking ticket for next release, also a place to track desired inclusions

Last release was https://crates.io/crates/datafusion/440.0 December 31, 2024 so next major release would be around Feb 1, 2025

Steps:

Pre-relese testing

Prior release tickets:

Please let me know if you would like to add any items on this list or move the categorization

Items to fix before release

Items maybe to complete (not sure if they are blockers)

Nice to Have (but non blockers -- e.g. bugs but not regressions)

@alamb alamb added the enhancement New feature or request label Jan 4, 2025
@alamb alamb mentioned this issue Jan 4, 2025
10 tasks
@alamb
Copy link
Contributor Author

alamb commented Jan 13, 2025

@andygrove would you like to coordinate this release or would you like me to? (or does anyone else want to do so?)

@alamb
Copy link
Contributor Author

alamb commented Jan 13, 2025

I also added some issues to the description above that I think would be worth fixing

@andygrove
Copy link
Member

@andygrove would you like to coordinate this release or would you like me to? (or does anyone else want to do so?)

I don't have a preference. I will traveling around this time though, so perhaps it would make sense for someone else to be release manager for this one.

@alamb
Copy link
Contributor Author

alamb commented Jan 13, 2025

I don't have a preference. I will traveling around this time though, so perhaps it would make sense for someone else to be release manager for this one.

I am happy to do it again for 45 if no one else would like the opportunity (see what I did there 😆 )

@xudong963
Copy link
Member

xudong963 commented Jan 14, 2025

I don't have a preference. I will traveling around this time though, so perhaps it would make sense for someone else to be release manager for this one.

I am happy to do it again for 45 if no one else would like the opportunity (see what I did there 😆 )

Thanks, alamb, I booked 46 in advance!

@alamb
Copy link
Contributor Author

alamb commented Jan 14, 2025

I am happy to do it again for 45 if no one else would like the opportunity (see what I did there 😆 )

Thanks, alamb, I booked 46 in advance!

Awesome -- I filed #14123 to track 46

@alamb
Copy link
Contributor Author

alamb commented Jan 15, 2025

I plan to start assembing the release candidate and test on the week of Jan 27 (in about 2 weeks time()

@shehabgamin
Copy link
Contributor

As promised, Sail is working on porting relevant tests into DataFusion.

A good starting point is a regression our tests caught in DataFusion 43, which still seems to persist in DataFusion 44. A regression was introduced in DataFusion 43.0.0 related to casting to UTF8 in various places. Upgrading to DataFusion 43.0.0 required adding explicit casting in several areas as a workaround. This PR (lakehq/sail#355) comments out those changes to expose the regression through the 12 additional failed tests compared to the main branch.

Once I’ve pinpointed the root cause(s) of the regression, I’ll create an issue in DataFusion to track the work. I want to ensure the issue accurately reflects the problem before filing it. I’m happy to address these regressions and port over the tests that cover them in the same PR. Hopefully, we can get this resolved in time for the DataFusion 45 release!

@alamb
Copy link
Contributor Author

alamb commented Jan 18, 2025

Once I’ve pinpointed the root cause(s) of the regression, I’ll create an issue in DataFusion to track the work. I want to ensure the issue accurately reflects the problem before filing it. I’m happy to address these regressions and port over the tests that cover them in the same PR. Hopefully, we can get this resolved in time for the DataFusion 45 release!

Thank you very much @shehabgamin 🙏

I strongly suspect this is related to switching to Utf8View by default in Parquet; You can validate this theory by disabling the following config setting:

https://datafusion.apache.org/user-guide/configs.html

datafusion.execution.parquet.schema_force_view_types

I think we are pretty close to closing out the Utf8View epic (now that we have upgraded to the latest arrow):

I'll add that to the list for 45 too

@alamb
Copy link
Contributor Author

alamb commented Jan 18, 2025

I plan to start preparing / testing / pushing this release the week of Jan 27, aiming to get an release candidate early the next week

@shehabgamin
Copy link
Contributor

I strongly suspect this is related to switching to Utf8View by default in Parquet; You can validate this theory by disabling the following config setting:

https://datafusion.apache.org/user-guide/configs.html

datafusion.execution.parquet.schema_force_view_types

I think we are pretty close to closing out the Utf8View epic (now that we have upgraded to the latest arrow):

* [[Epic] A Collection of Additional UTF8View support tickets #13504](https://github.com/apache/datafusion/issues/13504)

Thanks for the pointer @alamb!

I tried setting datafusion.execution.parquet.schema_force_view_types to false, but unfortunately, none of the 12 failed tests passed.

I'll take a deeper look into the issue after the weekend. Hope you have a great rest of your weekend!

@shehabgamin
Copy link
Contributor

I'll take a deeper look into the issue after the weekend. Hope you have a great rest of your weekend!

Most of the regressions are related to this issue: #14230. I should be able to resolve them well before the 45 release.

While testing my local Sail code with the latest commit on DataFusion's main branch, I encountered several breaking changes that may make DataFusion 45 a jarring upgrade for some users. Given the previous discussion about wanting to make releases less jarring (#13334 (comment)), I wanted to bring this to your attention, @alamb.

Aside from that, there is one remaining regression I haven't investigated yet, which seems to be related to Parquet.

@alamb
Copy link
Contributor Author

alamb commented Jan 22, 2025

While testing my local Sail code with the latest commit on DataFusion's main branch, I encountered several breaking changes that may make DataFusion 45 a jarring upgrade for some users

Thanks @shehabgamin -- Can you enumerate these changes (or point me at a PR) so we can see if there is some way to make jarring

@shehabgamin
Copy link
Contributor

Thanks @shehabgamin -- Can you enumerate these changes (or point me at a PR) so we can see if there is some way to make jarring

Yeah I'll work on that right now!

@shehabgamin
Copy link
Contributor

My apologies @alamb, the DataFusion upgrade from the latest main branch commit is smoother than I initially thought. After investigating the flood of errors, I discovered that many were resolved by simply updating Sail's serde-arrow dependency to Arrow 54. Projects without PyO3 or the pyarrow feature in DataFusion should experience a seamless upgrade (as of writing). Projects using PyO3 with the pyarrow feature enabled will have varying experiences based on their usage of PyO3.

PyO3 0.23.3
DataFusion 45 upgrades from PyO3 0.22 to 0.23.3. This is an exciting change, but may introduce significant breaking changes for PyO3 users. Since these changes vary based on PyO3 usage, I'm not listing Sail's specific changes here. Users can refer to the PyO3 migration guide: https://pyo3.rs/v0.23.0/migration

DataFusion
ValuesExec is now deprecated. The deprecation message is a bit confusing though. It currently states: "Use MemoryExec::try_new_as_values instead", but I think should say: "Use MemoryExec::try_new_as_values or MemoryExec::try_new_from_batches instead". Or, just simply: "Use MemoryExec instead".

If you'd like to see these changes, they're in my PR that's testing the regression fixes: lakehq/sail#355

@jayzhan211
Copy link
Contributor

ValuesExec is now deprecated. The deprecation message is a bit confusing though. It currently states: "Use MemoryExec::try_new_as_values instead", but I think should say: "Use MemoryExec::try_new_as_values or MemoryExec::try_new_from_batches instead". Or, just simply: "Use MemoryExec instead".

To replace ValuesExec, try_new_as_values is the right one to use not try_new_from_batches

@shehabgamin
Copy link
Contributor

To replace ValuesExec, try_new_as_values is the right one to use not try_new_from_batches

Some people currently use ValuesExec::try_new_from_batches, so MemoryExec::try_new_as_values wouldn't necessarily be a suitable substitute.

@jayzhan211
Copy link
Contributor

I see.

@andygrove
Copy link
Member

I created an issue to track our progress with upgrading Comet to use DataFusion 45 and linked to it from the PR description: #14274

@andygrove
Copy link
Member

@alamb I took the liberty of adding #14277 to the "must fix" list

@shehabgamin
Copy link
Contributor

shehabgamin commented Jan 25, 2025

Most of the regressions are related to this issue: #14230. I should be able to resolve them well before the 45 release.

It turns out that type coercion for UDF arguments (TypeSignature::Coercible) was not being applied to the majority of types. I adjusted the scope of #14230 and #14268 to reflect this.

IMO this should go on the "must fix" list too. I'll make sure to have the PR ready by the end of the weekend.

@shehabgamin
Copy link
Contributor

@alamb @jayzhan211 #14268 is ready for review!

@alamb alamb self-assigned this Jan 27, 2025
@alamb
Copy link
Contributor Author

alamb commented Jan 27, 2025

I am starting to do some ticket triage and prepare for the release

IMO this should go on the "must fix" list too. I'll make sure to have the PR ready by the end of the weekend.

Done

@alamb
Copy link
Contributor Author

alamb commented Jan 27, 2025

Some people currently use ValuesExec::try_new_from_batches, so MemoryExec::try_new_as_values wouldn't necessarily be a suitable substitute.

@shehabgamin --makes sense. I made this PR to try and clarify:

Thank you again for the testing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants