Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow language to use industry standard format for columnar data #7755

Closed
JaroslavTulach opened this issue Sep 6, 2023 · 14 comments · Fixed by #8512
Closed

Arrow language to use industry standard format for columnar data #7755

JaroslavTulach opened this issue Sep 6, 2023 · 14 comments · Fixed by #8512
Assignees
Labels
-compiler -libs Libraries: New libraries to be implemented l-apache-arrow InMemory Table move to Apache Arrow

Comments

@JaroslavTulach
Copy link
Member

JaroslavTulach commented Sep 6, 2023

Investigation of

resulted in an observation that Pandas 2.0 support Apache Arrow. If we want smooth, zero copy exchange of Enso Table & Column data with Pandas & other Arrow supporting libraries, then the best way is to store the data in Arrow formats.

To get the best from Truffle we want to create an internal Arrow language to hide the access to Arrow structures behind a simple facade and InteropLibrary.readArrayElement. This would be the expected usage:

var myDate32array = ctx.eval("arrow", "new[date32]").execute(size)

e.g. one could specify the format of data as a text that gets parsed and returns a factory function to create array of given size. The function than returns a TruffleObject that behaves like an array:

@ExportLibrary(InteropLibrary.class)
class ArrayDate32Array {
    private final ByteBuffer buffer;

    @ExportMessage
    void writeArrayElement(Object obj, int index, @CachedLibrary InteropLibrary iop) {
        assert iop.isDate(obj);
        var at = index * 4;
        var time = iop.asDate(obj).epochTimeMillis();
        buffer.putLong(at, time);
    }
 }

When this partially evaluates, we really would like to get to simple few assembly instructions that read the buffer offset, add at and write the four bytes representing time to that address.

Btw. one can use the official Java bindings as an inspiration. To be able to consume "Arrows" from other libraries, we need a way to "cast" a pointer to some Arrow type:

var otherDate32array = ctx.eval("arrow", "cast[date32]").execute(pointer)

The Arrow Language only allocates (or casts) array-like structures_ in proper format suitable for exchange with other Arrow-ready libraries. Apache Arrow only specifies the data layout - it doesn't provide any operations on the data. Such operations are then provided by libraries working on the interchangeable data.

Further references

@github-project-automation github-project-automation bot moved this to ❓New in Issues Board Sep 6, 2023
@JaroslavTulach JaroslavTulach added the l-apache-arrow InMemory Table move to Apache Arrow label Sep 6, 2023
@jdunkerley jdunkerley added -compiler -libs Libraries: New libraries to be implemented labels Oct 31, 2023
@jdunkerley jdunkerley moved this from ❓New to 📤 Backlog in Issues Board Nov 7, 2023
@enso-bot
Copy link

enso-bot bot commented Dec 7, 2023

Hubert Plociniczak reports a new STANDUP for yesterday (2023-12-06):

Progress: Fixed remaining issues for MacOS arm build. Started looking into requirements for Arrow, doing some background reading on specification. It should be finished by 2023-12-22.

Next Day: Next day I will be working on the #7755 task. Continue investigating into Arrow.

@enso-bot
Copy link

enso-bot bot commented Dec 8, 2023

Hubert Plociniczak reports a new STANDUP for yesterday (2023-12-07):

Progress: Small progress on getting the initial example going. Increased timeout for running launcher command (MacOS appears to consistently fail for no apparent reason on CI #8486). It should be finished by 2023-12-22.

Next Day: Next day I will be working on the #7755 task. Continue with adding support for Arrow.

@enso-bot
Copy link

enso-bot bot commented Dec 11, 2023

Hubert Plociniczak reports a new STANDUP for the provided date (2023-12-08):

Progress: Continued working on small date examples, dealing with some interop problems. Debugged some connection problems in CI for MacOS builds; turned out some retries were fixed and added to latest sbt (needed a bump, #8498) It should be finished by 2023-12-22.

Next Day: Next day I will be working on the #7755 task. Continue with adding support for Arrow.

@hubertp hubertp moved this from 📤 Backlog to 🔧 Implementation in Issues Board Dec 12, 2023
@enso-bot
Copy link

enso-bot bot commented Dec 12, 2023

Hubert Plociniczak reports a new STANDUP for yesterday (2023-12-11):

Progress: Draft PR is up which implements the first half proposed in the ticket. Date32 and Date64 types are supported. It should be finished by 2023-12-22.

Next Day: Next day I will be working on the #7755 task. Continue with adding support for Arrow.

@enso-bot
Copy link

enso-bot bot commented Dec 13, 2023

Hubert Plociniczak reports a new STANDUP for yesterday (2023-12-12):

Progress: Adding more tests and types, as per specification. It should be finished by 2023-12-22.

Next Day: Next day I will be working on the #7755 task. Continue with adding support for Arrow.

@enso-bot
Copy link

enso-bot bot commented Dec 14, 2023

Hubert Plociniczak reports a new STANDUP for yesterday (2023-12-13):

Progress: Addressing PR reviews, figuring out how to test casting. Lots of meetings. It should be finished by 2023-12-22.

Next Day: Next day I will be working on the #7755 task. Continue with adding support for Arrow language.

@enso-bot
Copy link

enso-bot bot commented Dec 15, 2023

Hubert Plociniczak reports a new STANDUP for yesterday (2023-12-14):

Progress: Support for casting for Arrow vectors created in other languages. Added a test illustrating the support using Java's Arrow implementation. It should be finished by 2023-12-22.

Next Day: Next day I will be working on the #7755 task. Continue with adding support for Arrow language.

@hubertp hubertp mentioned this issue Dec 15, 2023
3 tasks
@enso-bot
Copy link

enso-bot bot commented Dec 18, 2023

Hubert Plociniczak reports a new STANDUP for the provided date (2023-12-15):

Progress: Added proper nullability support without any copying. PR is mostly ready with the initial support. Added more tests. It should be finished by 2023-12-22.

Next Day: Next day I will be working on the #7755 task. Address review, but also look into ongoing compiler bugs.

@jdunkerley jdunkerley moved this from 🔧 Implementation to 👁️ Code review in Issues Board Jan 2, 2024
@enso-bot
Copy link

enso-bot bot commented Jan 10, 2024

Hubert Plociniczak reports a new 🔴 DELAY for yesterday (2024-01-09):

Summary: There is 19 days delay in implementation of the Arrow language to use industry standard format for columnar data (#7755) task.
It will cause 0 days delay for the delivery of this weekly plan.

Delay Cause: Didn't address PR feedback before holidays and moved on to other issues in the meantime.

@enso-bot
Copy link

enso-bot bot commented Jan 10, 2024

Hubert Plociniczak reports a new STANDUP for yesterday (2024-01-09):

Progress: Rebasing against latest develop, fighting with the JPMS. Also trying to confirm optimization in AliasAnalysis via benchmarks. It should be finished by 2024-01-10.

Next Day: Next day I will be working on the #7755 task. Merge PR

@enso-bot
Copy link

enso-bot bot commented Jan 11, 2024

Hubert Plociniczak reports a new STANDUP for yesterday (2024-01-10):

Progress: Modularized Arrow language project. Investigated UUID optimization. It should be finished by 2024-01-10.

Next Day: Next day I will be working on the #7755 task. Address PR review.

@mergify mergify bot closed this as completed in #8512 Jan 12, 2024
mergify bot pushed a commit that referenced this issue Jan 12, 2024
Initial implementation of the Arrow language. Closes #7755.
Currently supported logical types are
- Date (days and milliseconds)
- Int (8, 16, 32, 64)

One can currently
- allocate a new fixed-length, nullable Arrow vector - `new[<name-of-the-type>]`
- cast an already existing fixed-length Arrow vector from a memory address - `cast[<name-of-the-type>]`

Closes #7755.
@github-project-automation github-project-automation bot moved this from 👁️ Code review to 🟢 Accepted in Issues Board Jan 12, 2024
@enso-bot
Copy link

enso-bot bot commented Jan 12, 2024

Hubert Plociniczak reports a new 🔴 DELAY for yesterday (2024-01-11):

Summary: There is 2 days delay in implementation of the Arrow language to use industry standard format for columnar data (#7755) task.
It will cause 0 days delay for the delivery of this weekly plan.

Delay Cause: Underestimated the amount of work related to PR review. Will delay some work to follow up PRs.

@enso-bot
Copy link

enso-bot bot commented Jan 12, 2024

Hubert Plociniczak reports a new STANDUP for yesterday (2024-01-11):

Progress: Addressing PR review. PR that introduces lazy UUID generation (#8716). It should be finished by 2024-01-12.

Next Day: Next day I will be working on the #7755 task. Address PR review.

@enso-bot
Copy link

enso-bot bot commented Jan 15, 2024

Hubert Plociniczak reports a new STANDUP for the provided date (2024-01-12):

Progress: Fighting with CI, last minute improvements to PR. The remaining work will go in followup PRs. Similarly with UUID PR in #8728. Investigated assertion failure in #8595, pushed a workaround unblocking the PR but failures to come up with a minimal reproducible case so far failed. It should be finished by 2024-01-12.

Next Day: Next day I will be working on the #8689 task. Look into GUI/backedn failures reported recently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
-compiler -libs Libraries: New libraries to be implemented l-apache-arrow InMemory Table move to Apache Arrow
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants