Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kill Spark Query #223

Closed
eiso opened this issue Jul 24, 2019 · 7 comments · Fixed by #267
Closed

Kill Spark Query #223

eiso opened this issue Jul 24, 2019 · 7 comments · Fixed by #267
Assignees
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@eiso
Copy link
Member

eiso commented Jul 24, 2019

When I press STOP in Superset SQL Lab it should kill the Spark job/query.

@eiso eiso added the bug Something isn't working label Jul 24, 2019
@se7entyse7en se7entyse7en self-assigned this Aug 8, 2019
@se7entyse7en
Copy link
Contributor

I need help from @src-d/data-processing for this as I don't know how to kill the underlying spark job of the corresponding query. The best option would be if it is possible to do it with our fork of PyHive.

@se7entyse7en
Copy link
Contributor

Currently, we support query canceling for mysql. This is achieved by getting the connection id through the SELECT CONNECTION_ID() statement before running a query, and by KILL CONNECTION <id> when we want to cancel it.

Unfortunately, there's nothing similar for Spark, but the only way is to close the underlying connection (see [private] here). I thought that a way could be to use a custom pool for sqlalchemy that adds an id to a connection and accepts a way to kill one given its connection id. I thought that a singleton pool could work, but then I realized that running the application through gunicorn means having different connection pools, and each of them won't be able to close a connection of another pool.

Do you have any idea?

@se7entyse7en se7entyse7en added the help wanted Extra attention is needed label Aug 16, 2019
@smacker
Copy link
Contributor

smacker commented Aug 19, 2019

We may want to implement SELECT CONNECTION_ID() & KILL in spark.

custom pool for sqlalchemy that adds an id to a connection

each worker is independent as you mentioned. There is a way to share memory between workers but it won't solve the problem as soon as we have more than 1 instance of sourced-ui itself.

Also, remember that sql queries can be executed in both sync&async mode (sync inside superset itself and async in celery).

If it's not feasible to implement SELECT CONNECTION_ID() & KILL in spark we can consider putting a proxy in front of spark that would implement these features and drop the real connection.

@se7entyse7en
Copy link
Contributor

If it's not feasible to implement SELECT CONNECTION_ID() & KILL in spark we can consider putting a proxy in front of spark that would implement these features and drop the real connection.

@ajnavarro do you have any inputs on this?

@ajnavarro
Copy link

In my opinion, implement connection_id and kill is not possible or really complicated in any case:

  • If we pushdown that expressions to gitbase it won't work, because gitbase instances won't have the same connection_id and there is no way to sync that connection ids between gitbases.
  • If we implement those expressions on spark side, there is no connection concept at that level, so we will need to implement a bunch of new stuff on top of spark and thrift server, that in my opinion is not worth it

I can't see the problem on closing the connection from client-side, thrift server is able to do it with no problem.

@smacker
Copy link
Contributor

smacker commented Aug 20, 2019

closing the connection from client-side

in short, it's not feasible with current superset architecture and patching much of the internals.

In more details:

Pseudo-code for sync queries (problem on frontend):

function runQuery(query) {
  return fetch(...).then().catch();
}

document.query('.run-button').addEventListener(() => {
  runQuery(query); // promise isn't saved anywhere to the state
});

In theory we can patch frontend but it won't solve the main use case when queries are executed in SQLLab.

Pseudo-code for async query execution (problem on backend):

def execute(query):
   celery.send_task('execute_query', query)
   // task_id isn't saved anywhere or returned to frontend
   return

In theory we can patch the code and save task_id and then terminate the task but:
http://docs.celeryproject.org/en/latest/userguide/workers.html#revoke-revoking-tasks

The terminate option is a last resort for administrators when a task is stuck. It’s not for terminating the task, it’s for terminating the process that’s executing the task, and that process may have already started processing another task at the point when the signal is sent, so for this reason you must never call this programmatically.

Which means terminating a query might terminate all other queries as well that were started by other users or in different tabs.

Celery doesn't provide any methods to terminate a task instead of a process as far as I know.

@carlosms
Copy link
Contributor

[PRIVATE] Link to doc with the current alternatives proposed: https://docs.google.com/document/d/1kQBxAQGRNTako9la9ZqEcnnH5JukF5nhVZjb82FUDRU/edit#heading=h.6kocf3qjzjnc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants