-
Notifications
You must be signed in to change notification settings - Fork 327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arrow language to use industry standard format for columnar data #7755
Comments
Hubert Plociniczak reports a new STANDUP for yesterday (2023-12-06): Progress: Fixed remaining issues for MacOS arm build. Started looking into requirements for Arrow, doing some background reading on specification. It should be finished by 2023-12-22. Next Day: Next day I will be working on the #7755 task. Continue investigating into Arrow. |
Hubert Plociniczak reports a new STANDUP for yesterday (2023-12-07): Progress: Small progress on getting the initial example going. Increased timeout for running launcher command (MacOS appears to consistently fail for no apparent reason on CI #8486). It should be finished by 2023-12-22. Next Day: Next day I will be working on the #7755 task. Continue with adding support for Arrow. |
Hubert Plociniczak reports a new STANDUP for the provided date (2023-12-08): Progress: Continued working on small date examples, dealing with some interop problems. Debugged some connection problems in CI for MacOS builds; turned out some retries were fixed and added to latest sbt (needed a bump, #8498) It should be finished by 2023-12-22. Next Day: Next day I will be working on the #7755 task. Continue with adding support for Arrow. |
Hubert Plociniczak reports a new STANDUP for yesterday (2023-12-11): Progress: Draft PR is up which implements the first half proposed in the ticket. Date32 and Date64 types are supported. It should be finished by 2023-12-22. Next Day: Next day I will be working on the #7755 task. Continue with adding support for Arrow. |
Hubert Plociniczak reports a new STANDUP for yesterday (2023-12-12): Progress: Adding more tests and types, as per specification. It should be finished by 2023-12-22. Next Day: Next day I will be working on the #7755 task. Continue with adding support for Arrow. |
Hubert Plociniczak reports a new STANDUP for yesterday (2023-12-13): Progress: Addressing PR reviews, figuring out how to test casting. Lots of meetings. It should be finished by 2023-12-22. Next Day: Next day I will be working on the #7755 task. Continue with adding support for Arrow language. |
Hubert Plociniczak reports a new STANDUP for yesterday (2023-12-14): Progress: Support for casting for Arrow vectors created in other languages. Added a test illustrating the support using Java's Arrow implementation. It should be finished by 2023-12-22. Next Day: Next day I will be working on the #7755 task. Continue with adding support for Arrow language. |
Hubert Plociniczak reports a new STANDUP for the provided date (2023-12-15): Progress: Added proper nullability support without any copying. PR is mostly ready with the initial support. Added more tests. It should be finished by 2023-12-22. Next Day: Next day I will be working on the #7755 task. Address review, but also look into ongoing compiler bugs. |
Hubert Plociniczak reports a new 🔴 DELAY for yesterday (2024-01-09): Summary: There is 19 days delay in implementation of the Arrow language to use industry standard format for columnar data (#7755) task. Delay Cause: Didn't address PR feedback before holidays and moved on to other issues in the meantime. |
Hubert Plociniczak reports a new STANDUP for yesterday (2024-01-09): Progress: Rebasing against latest develop, fighting with the JPMS. Also trying to confirm optimization in AliasAnalysis via benchmarks. It should be finished by 2024-01-10. Next Day: Next day I will be working on the #7755 task. Merge PR |
Hubert Plociniczak reports a new STANDUP for yesterday (2024-01-10): Progress: Modularized Arrow language project. Investigated UUID optimization. It should be finished by 2024-01-10. Next Day: Next day I will be working on the #7755 task. Address PR review. |
Initial implementation of the Arrow language. Closes #7755. Currently supported logical types are - Date (days and milliseconds) - Int (8, 16, 32, 64) One can currently - allocate a new fixed-length, nullable Arrow vector - `new[<name-of-the-type>]` - cast an already existing fixed-length Arrow vector from a memory address - `cast[<name-of-the-type>]` Closes #7755.
Hubert Plociniczak reports a new 🔴 DELAY for yesterday (2024-01-11): Summary: There is 2 days delay in implementation of the Arrow language to use industry standard format for columnar data (#7755) task. Delay Cause: Underestimated the amount of work related to PR review. Will delay some work to follow up PRs. |
Hubert Plociniczak reports a new STANDUP for yesterday (2024-01-11): Progress: Addressing PR review. PR that introduces lazy UUID generation (#8716). It should be finished by 2024-01-12. Next Day: Next day I will be working on the #7755 task. Address PR review. |
Hubert Plociniczak reports a new STANDUP for the provided date (2024-01-12): Progress: Fighting with CI, last minute improvements to PR. The remaining work will go in followup PRs. Similarly with UUID PR in #8728. Investigated assertion failure in #8595, pushed a workaround unblocking the PR but failures to come up with a minimal reproducible case so far failed. It should be finished by 2024-01-12. Next Day: Next day I will be working on the #8689 task. Look into GUI/backedn failures reported recently. |
Investigation of
resulted in an observation that Pandas 2.0 support Apache Arrow. If we want smooth, zero copy exchange of Enso
Table
&Column
data with Pandas & other Arrow supporting libraries, then the best way is to store the data in Arrow formats.To get the best from Truffle we want to create an internal Arrow language to hide the access to Arrow structures behind a simple facade and InteropLibrary.readArrayElement. This would be the expected usage:
e.g. one could specify the format of data as a text that gets parsed and returns a factory function to create array of given size. The function than returns a
TruffleObject
that behaves like an array:When this partially evaluates, we really would like to get to simple few assembly instructions that read the
buffer
offset, addat
and write the four bytes representing time to that address.Btw. one can use the official Java bindings as an inspiration. To be able to consume "Arrows" from other libraries, we need a way to "cast" a pointer to some Arrow type:
The Arrow Language only allocates (or casts) array-like structures_ in proper format suitable for exchange with other Arrow-ready libraries. Apache Arrow only specifies the data layout - it doesn't provide any operations on the data. Such operations are then provided by libraries working on the interchangeable data.
Further references
The text was updated successfully, but these errors were encountered: