-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ballista Python Issue(s) #1142
Comments
I’ve been meaning to dive into this and also some work happening on I don’t have my computer this weekend so I can’t test to verify but you may get unblocked if you do I did write up an issue to improve these confusing errors. apache/datafusion-python#853 |
But even if that unblocks you I worry it still doesn’t resolve to core issue of trying to share that session context from one python package to another. |
Draft patch to illustrate "Possible Solution (I)", for diff --git a/Cargo.lock b/Cargo.lock
index 815323b..a00bdc5 100644
diff --git a/Cargo.toml b/Cargo.toml
index df72cd4..cf3cb1c 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -47,6 +47,7 @@ async-trait = "0.1"
futures = "0.3"
object_store = { version = "0.11.0", features = ["aws", "gcp", "azure", "http"] }
url = "2"
+ballista = { path = "../arrow-ballista/ballista/client", default-features = false }
[build-dependencies]
prost-types = "0.13" # keep in line with `datafusion-substrait`
diff --git a/python/datafusion/context.py b/python/datafusion/context.py
index 957d7e3..ca6094a 100644
--- a/python/datafusion/context.py
+++ b/python/datafusion/context.py
@@ -423,7 +423,7 @@ class SessionContext:
"""
def __init__(
- self, config: SessionConfig | None = None, runtime: RuntimeConfig | None = None
+ self, config: SessionConfig | None = None, runtime: RuntimeConfig | None = None, url: str | None = None
) -> None:
"""Main interface for executing queries with DataFusion.
@@ -448,7 +448,7 @@ class SessionContext:
config = config.config_internal if config is not None else None
runtime = runtime.config_internal if runtime is not None else None
- self.ctx = SessionContextInternal(config, runtime)
+ self.ctx = SessionContextInternal(config, runtime, url)
def register_object_store(
self, schema: str, store: Any, host: str | None = None
diff --git a/src/context.rs b/src/context.rs
index f445874..a40bc47 100644
--- a/src/context.rs
+++ b/src/context.rs
@@ -23,6 +23,7 @@ use std::sync::Arc;
use arrow::array::RecordBatchReader;
use arrow::ffi_stream::ArrowArrayStreamReader;
use arrow::pyarrow::FromPyArrow;
+use ballista::prelude::SessionContextExt;
use datafusion::execution::session_state::SessionStateBuilder;
use object_store::ObjectStore;
use url::Url;
@@ -271,11 +272,13 @@ pub struct PySessionContext {
#[pymethods]
impl PySessionContext {
- #[pyo3(signature = (config=None, runtime=None))]
+ #[pyo3(signature = (config=None, runtime=None, ballista_url=None))]
#[new]
pub fn new(
config: Option<PySessionConfig>,
runtime: Option<PyRuntimeConfig>,
+ ballista_url: Option<String>,
+ py: Python,
) -> PyResult<Self> {
let config = if let Some(c) = config {
c.config
@@ -293,9 +296,16 @@ impl PySessionContext {
.with_runtime_env(runtime)
.with_default_features()
.build();
- Ok(PySessionContext {
- ctx: SessionContext::new_with_state(session_state),
- })
+
+ match ballista_url {
+ Some(url) => Ok(PySessionContext {
+ ctx: wait_for_future(py, SessionContext::remote_with_state(&url, session_state))
+ .map_err(DataFusionError::from)?,
+ }),
+ None => Ok(PySessionContext {
+ ctx: SessionContext::new_with_state(session_state),
+ }),
+ }
}
/// Register an object store with the given name more details at apache/datafusion-python@main...milenkovicm:datafusion-python:feat_add_ballista If we go this direction we would need to make ballista optional feature |
I finally got some time to try this, but unfortunately no luck, no such function. I tried variation of the proposal wrapping DataFrame, but same error from ballista import BallistaBuilder
# from datafusion.context import SessionContext
from datafusion import functions as f
from datafusion.dataframe import DataFrame
ctx = BallistaBuilder()\
.standalone()
df = ctx.sql("SELECT 1 as r")
df0 = DataFrame(df)
df0.aggregate(
[f.col("r")], [f.count_star()]
)
df0.show()
Update: I have also tried: ctx = SessionContext()
ctx.ctx = BallistaBuilder()\
.standalone() same issue with function conversion as previous |
After spending some time and reading PyO3/pyo3#1444 there is no simple solution for the problem. |
SummaryAfter some instigation and reading PyO3/pyo3#1444 it looks not trivial to share (pyo3) structures between multiple crates, there might be some hacks but its a long shot. So options mentioned in #1142 still stands: Starting from option 2 - re-export all the (py)datafusion structures and functions as part of (py)ballista. I can't comment about effort scale, but if we go with it we could get into same situation were ballista was, constantly lagging behind (py)datafusion. Thus I'd argue that this approach would be dead-on-arrival due to lack of maintainers, and overall duplicated work. Option 1 - creating ballista specific context in (py)datafusion. IMHO, this approach makes the most sense from technical perspective. We would just need to expose optional (py)datafusion ballista integration. This would mean a bit of extra work on (py)datafusion team. Ballista would be baggage which in the long run may go to "unmaintained" mode. In short term, I would suggest not to release (py)ballista bindings, until we make decision on approach. Also, if we decide to go with "Option 1" we could use (py)ballista project for scheduler/executor py bindings. Open for any suggestion |
One more option to throw in. Could we reduce the scope for (py)Ballista for now to just support SQL and not the DataFrame API? We would just need the ability to send SQL to the server (perhaps via FlightSQL) and then fetch record batches. |
we dont even need flightsql, protocol supports sending sql statement:
so we would not need any context on (py)ballista side just a grpc client |
personally I find (py)datafusion running on ballista killer feature :) a great way to avoid GIL limitations |
@andygrove may I ask what kind of scenarios you'd like to support with "option 3"? |
First of all, I'm not expert in rust-python (pyo3) integration, if I've done/said something stupid,
my apologies.
Current implementation of (py)ballista has limitation when it comes to
DataFrame
operations.following code will result with an error:
it will throw exception (similar to):
Actually previous implementation had the same problem, the same error will be thrown (
git checkout 2f223db21557c15080bf865ac692d276b8f0b770
)The similar issue is there if
SessionConfig
is used:problem with
RuntimeConfig
,SessionConfig
could be solved if they are re-exported in ballista:but the first problem with
DataFrame
would still remain.My guess is that there is FFI issue as ballista and datafusion is different package, I'm not sure what the problem is nor how to resolve this issue.
@timsaucer comment #1091 (comment) make more sense to me now.
Possible Solution (I)
One obvious way would be to move ballista context creation to datafusion-python. We need one line context creation:
As ballista context is the
SessionContext
it would be trivial to integrate, and, I believe, it would avoid previous issues.We could only provide "remote context" (no standalone), making it optional feature for which users python datafusion users could to opt in. This would somewhat limit number of libraries ballista would bring to datafusion-python (we could split core to core and client-core to further reduce deps)
This proposal would mean that we would have to bring optional dependency to datafusion-python, and additional complexity in (datafusion-python) release process.
(py)ballista would stay, it could expose scheduler and executor control as proposed in #1107
Big risk for of this proposal is that ballista could block datafusion python release in case it goes back to unmaintained mode.
Possible Solution (II)
Another possible solution is to re-export all classes from datafusion-python in ballista. I'm not sure how complex or practical this is going to be.
I'm not sure if datafusion python applications would need some kind of re-writing to be able to run on ballista.
This would put additional responsibility to ballista maintainers (not too many of them).
Any Other Solution?
I'm not sure, open to suggestions
Proposal
Short term proposal:
datafusion-ballista/.github/workflows/rust.yml
Lines 121 to 122 in 81cfa63
We should release (py)ballista once we figure out the best approach to fix it.
The text was updated successfully, but these errors were encountered: