-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kill Spark Query #223
Comments
I need help from @src-d/data-processing for this as I don't know how to kill the underlying spark job of the corresponding query. The best option would be if it is possible to do it with our fork of PyHive. |
Currently, we support query canceling for mysql. This is achieved by getting the connection id through the Unfortunately, there's nothing similar for Spark, but the only way is to close the underlying connection (see [private] here). I thought that a way could be to use a custom pool for sqlalchemy that adds an id to a connection and accepts a way to kill one given its connection id. I thought that a singleton pool could work, but then I realized that running the application through gunicorn means having different connection pools, and each of them won't be able to close a connection of another pool. Do you have any idea? |
We may want to implement
each worker is independent as you mentioned. There is a way to share memory between workers but it won't solve the problem as soon as we have more than 1 instance of sourced-ui itself. Also, remember that sql queries can be executed in both sync&async mode (sync inside superset itself and async in celery). If it's not feasible to implement |
@ajnavarro do you have any inputs on this? |
In my opinion, implement connection_id and kill is not possible or really complicated in any case:
I can't see the problem on closing the connection from client-side, thrift server is able to do it with no problem. |
in short, it's not feasible with current superset architecture and patching much of the internals. In more details: Pseudo-code for sync queries (problem on frontend):
In theory we can patch frontend but it won't solve the main use case when queries are executed in SQLLab. Pseudo-code for async query execution (problem on backend):
In theory we can patch the code and save task_id and then terminate the task but:
Which means terminating a query might terminate all other queries as well that were started by other users or in different tabs. Celery doesn't provide any methods to terminate a task instead of a process as far as I know. |
[PRIVATE] Link to doc with the current alternatives proposed: https://docs.google.com/document/d/1kQBxAQGRNTako9la9ZqEcnnH5JukF5nhVZjb82FUDRU/edit#heading=h.6kocf3qjzjnc |
When I press STOP in Superset SQL Lab it should kill the Spark job/query.
The text was updated successfully, but these errors were encountered: