Kill Spark Query #223

eiso · 2019-07-24T18:08:33Z

When I press STOP in Superset SQL Lab it should kill the Spark job/query.

se7entyse7en · 2019-08-09T13:41:24Z

I need help from @src-d/data-processing for this as I don't know how to kill the underlying spark job of the corresponding query. The best option would be if it is possible to do it with our fork of PyHive.

se7entyse7en · 2019-08-16T16:02:05Z

Currently, we support query canceling for mysql. This is achieved by getting the connection id through the SELECT CONNECTION_ID() statement before running a query, and by KILL CONNECTION <id> when we want to cancel it.

Unfortunately, there's nothing similar for Spark, but the only way is to close the underlying connection (see [private] here). I thought that a way could be to use a custom pool for sqlalchemy that adds an id to a connection and accepts a way to kill one given its connection id. I thought that a singleton pool could work, but then I realized that running the application through gunicorn means having different connection pools, and each of them won't be able to close a connection of another pool.

Do you have any idea?

smacker · 2019-08-19T09:44:37Z

We may want to implement SELECT CONNECTION_ID() & KILL in spark.

custom pool for sqlalchemy that adds an id to a connection

each worker is independent as you mentioned. There is a way to share memory between workers but it won't solve the problem as soon as we have more than 1 instance of sourced-ui itself.

Also, remember that sql queries can be executed in both sync&async mode (sync inside superset itself and async in celery).

If it's not feasible to implement SELECT CONNECTION_ID() & KILL in spark we can consider putting a proxy in front of spark that would implement these features and drop the real connection.

se7entyse7en · 2019-08-19T16:59:57Z

If it's not feasible to implement SELECT CONNECTION_ID() & KILL in spark we can consider putting a proxy in front of spark that would implement these features and drop the real connection.

@ajnavarro do you have any inputs on this?

ajnavarro · 2019-08-20T08:52:28Z

In my opinion, implement connection_id and kill is not possible or really complicated in any case:

If we pushdown that expressions to gitbase it won't work, because gitbase instances won't have the same connection_id and there is no way to sync that connection ids between gitbases.
If we implement those expressions on spark side, there is no connection concept at that level, so we will need to implement a bunch of new stuff on top of spark and thrift server, that in my opinion is not worth it

I can't see the problem on closing the connection from client-side, thrift server is able to do it with no problem.

smacker · 2019-08-20T11:11:56Z

closing the connection from client-side

in short, it's not feasible with current superset architecture and patching much of the internals.

In more details:

Pseudo-code for sync queries (problem on frontend):

function runQuery(query) {
  return fetch(...).then().catch();
}

document.query('.run-button').addEventListener(() => {
  runQuery(query); // promise isn't saved anywhere to the state
});

In theory we can patch frontend but it won't solve the main use case when queries are executed in SQLLab.

Pseudo-code for async query execution (problem on backend):

def execute(query):
   celery.send_task('execute_query', query)
   // task_id isn't saved anywhere or returned to frontend
   return

In theory we can patch the code and save task_id and then terminate the task but:
http://docs.celeryproject.org/en/latest/userguide/workers.html#revoke-revoking-tasks

The terminate option is a last resort for administrators when a task is stuck. It’s not for terminating the task, it’s for terminating the process that’s executing the task, and that process may have already started processing another task at the point when the signal is sent, so for this reason you must never call this programmatically.

Which means terminating a query might terminate all other queries as well that were started by other users or in different tabs.

Celery doesn't provide any methods to terminate a task instead of a process as far as I know.

carlosms · 2019-08-21T16:37:47Z

[PRIVATE] Link to doc with the current alternatives proposed: https://docs.google.com/document/d/1kQBxAQGRNTako9la9ZqEcnnH5JukF5nhVZjb82FUDRU/edit#heading=h.6kocf3qjzjnc

eiso added the bug Something isn't working label Jul 24, 2019

se7entyse7en self-assigned this Aug 8, 2019

se7entyse7en added the help wanted Extra attention is needed label Aug 16, 2019

se7entyse7en mentioned this issue Aug 27, 2019

Spark query cancellation #267

Merged

2 tasks

se7entyse7en closed this as completed in #267 Sep 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kill Spark Query #223

Kill Spark Query #223

eiso commented Jul 24, 2019

se7entyse7en commented Aug 9, 2019

se7entyse7en commented Aug 16, 2019

smacker commented Aug 19, 2019

se7entyse7en commented Aug 19, 2019

ajnavarro commented Aug 20, 2019

smacker commented Aug 20, 2019 •

edited

Loading

carlosms commented Aug 21, 2019

Kill Spark Query #223

Kill Spark Query #223

Comments

eiso commented Jul 24, 2019

se7entyse7en commented Aug 9, 2019

se7entyse7en commented Aug 16, 2019

smacker commented Aug 19, 2019

se7entyse7en commented Aug 19, 2019

ajnavarro commented Aug 20, 2019

smacker commented Aug 20, 2019 • edited Loading

carlosms commented Aug 21, 2019

smacker commented Aug 20, 2019 •

edited

Loading