-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark query cancellation #267
Spark query cancellation #267
Conversation
/cc @src-d/data-processing for error logs of |
superset/superset/db_engine_specs.py
Outdated
connection of the cursor, but that's not an integer, and here an | ||
integer is required.""" | ||
|
||
return 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative would be to change this condition.
@@ -37,7 +37,7 @@ fi | |||
if [ "$#" -ne 0 ]; then | |||
exec "$@" | |||
elif [ "$SUPERSET_ENV" = "development" ]; then | |||
celery worker --app=superset.sql_lab:celery_app --pool=gevent -Ofair & | |||
celery worker --app=superset.sql_lab:celery_app --logfile=/tmp/celery.log --loglevel=INFO --pool=gevent -Ofair & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may worth to expose logfile and loglevel as env vars, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's do it in separate PR
@@ -37,7 +37,7 @@ fi | |||
if [ "$#" -ne 0 ]; then | |||
exec "$@" | |||
elif [ "$SUPERSET_ENV" = "development" ]; then | |||
celery worker --app=superset.sql_lab:celery_app --pool=gevent -Ofair & | |||
celery worker --app=superset.sql_lab:celery_app --logfile=/tmp/celery.log --loglevel=INFO --pool=gevent -Ofair & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we redirect it to stdout of the container instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to not mix it with the application's logs. But I agree that it would make logs much more accessible. What I would do is to redirect to stdout, but to make celery run in a separate container. Do you see any particular reason for which it needs to be run in the same container?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can consider moving celery to separate container. I don't see any disadvantages except +1 container. But let's do it in a separate issue.
Do you mind to just revert this change for now?
One of the problems I see with it - INFO
is very verbose and I'm not sure if log rotation works correctly in this case. If not, /tmp/celery.log
would grow very large fast and at the end would crash container with out of disk space error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #269 to discuss about moving celery
to a separate container.
superset/superset/db_engine_specs.py
Outdated
@@ -1416,6 +1421,74 @@ class SparkSQLEngineSpec(HiveEngineSpec): | |||
|
|||
engine = 'sparksql' | |||
|
|||
@classmethod | |||
def _dumps_operation_handle(cls, op_handle): | |||
return { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't check how pyhive types are defined but most probably you can just use __dict__
attribute instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't work because op_handle.operationId
is an instance of hive.ttypes.THandleIdentifier
. One way would be to define a custom json encoder/decoder, but IMO it doesn't worth the effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about?
dict(op_handle.__dict__, operation_id=op_handle.operationId.__dict__)
or
dict(op_handle.__dict__, operation_id={
'guid': op_handle.operationId.guid.decode('ISO-8859-1'),
'secret': op_handle.operationId.secret.decode('ISO-8859-1'),
})
if decoding is required.
The idea is remove direct dependency on TOperationHandle
internals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks much nicer!!!
superset/superset/db_engine_specs.py
Outdated
@@ -1151,7 +1151,12 @@ def fetch_data(cls, cursor, limit): | |||
raise Exception('Query error', state.errorMessage) | |||
try: | |||
return super(HiveEngineSpec, cls).fetch_data(cursor, limit) | |||
except pyhive.exc.ProgrammingError: | |||
except (pyhive.exc.ProgrammingError, pyhive.exc.OperationalError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it's a good idea. Let's check, in which cases OperationalError
is returned. I believe that should be some cases when we should return it to a user/log instead of just skipping it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's check, in which cases
OperationalError
is returned.
Actually, I didn't check the other cases when this error is raised, and I agree that in some cases we should return it.
My idea was this to be temporary until gsc
handles this correctly without raising an exception on cancelation. I could have a look at the specific exception raised during cancelation and skip that one only. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's possible to catch only cancelation error it would be much better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed here.
54654ee
to
2bf13a8
Compare
superset/superset/db_engine_specs.py
Outdated
# The `pyhive.exc.OperationalError` exception is raised when a | ||
# query is cancelled. This seems to happen because `gsc` expects | ||
# the state to be `FINISHED` but got `CANCELED`. | ||
cancelation_error_msg = 'Expected state FINISHED, but found CANCELED' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add FIXME comment as it's needed only while gsc doesn't support CANCELED.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done here.
I'm gonna address DCO and rebase after approvals. |
[private] https://src-d.slack.com/archives/C7TB5NEDN/p1567074879029100 This PR only sends the request to cancel the query downstream. To also have the underlying query on gitbase stopped, at least |
Signed-off-by: Lou Marvin Caraig <[email protected]>
Signed-off-by: Lou Marvin Caraig <[email protected]>
Signed-off-by: Lou Marvin Caraig <[email protected]>
Signed-off-by: Lou Marvin Caraig <[email protected]>
Signed-off-by: Lou Marvin Caraig <[email protected]>
Signed-off-by: Lou Marvin Caraig <[email protected]>
0c06f07
to
0bc3296
Compare
Rebased and force-pushed. After rebasing, the cancellation request wasn't being sent, I needed to add this. The reason is that some lines later there's a refresh of the query object that makes the I tried with Need further investigation 😞. |
All linters are failing. |
Signed-off-by: Lou Marvin Caraig <[email protected]>
Hi need some help here, especially from @src-d/data-processing team. Here's what currently happens using this PR for
The problem is that between point 5 and point 6 if run After point 6, running So it seems like the Spark task was killed after In the layers above |
Okay, I asked @ajnavarro and to recap Spark cancelation is not instant, and gitbase simply checks for the tcp connection being open, so this is the most we can do so far: [private] slack thread. |
@se7entyse7en somehow pylint thinks that I found similar issue but it's an opposite to what we see. And I don't see any relevant issue in astroid repository. Anyway, pylint shouldn't be able to interfer the type of the |
Signed-off-by: Lou Marvin Caraig <[email protected]>
Signed-off-by: Lou Marvin Caraig <[email protected]>
Fixed requirement with the proper release here. |
To fix test coverage we would need to add tests for |
Let's just merge. I don't see a point to mock everything. We can add testing on spark instead of mysql in gsc branch. Though we don't have a test for cancelation on mysql too. |
Closes #223.
Depends on src-d/PyHive#2.
This PR adds support for query cancellation by canceling the underlying Spark job. This is done by saving in the query object the corresponding json serialized
operationHandle
in theextra
field. ThisoperationHandle
is then retrieved, marshaled and sent to Thrift server to request the cancelation.This also adds some logging and sets the log level and log file for celery worker.
At celery layer the task ends correctly without raising any error:
At Spark level it's possible to see that the job is canceled by looking at the dashboard: http://localhost:4040/jobs:
On the other hand some exception is not handled by
gsc
: