circus watchers need an async_kill option in some cases #987

thefab · 2016-06-14T11:11:46Z

When you have several watchers with:

big graceful_timeout (600 seconds in our case)
max_age values > 0

The call of manage_watchers of the Arbiter from the Controller can be very long (600 seconds in our case). During this delay, circus is not blocked (thanks to tornado ioloop) but:

some intrusive commands are forbidden (because manage_watchers is running)
no other manage_watchers is launched

So during this graceful 600s period, other dead or expired processes (even for other watchers) are not (re)launched anymore.

The real fix is really intrusive and complex because of concurrency issues.

We are working on a little fix based on a "async_kill" flag (at the watcher level).

Ideas and advices are welcome

The text was updated successfully, but these errors were encountered:

k4nar · 2016-06-14T13:24:25Z

This conflict error is something that comes up pretty often. I'm not sure what's the solution, but this is clearly something we have to fix.

Actually, I'm wondering why we are blocking other commands during manage_watchers. Right now it's a pretty heavy coroutine, but I think we could convert it to something more flexible. Also, this is called as a Periodic Callback, but we could convert it to a while True with a gen.sleep. This is what Tornado's documentation is recommending.

@thefab : Do see any major blocking point making this impossible?

thefab · 2016-06-14T14:47:31Z

I do the same analyze. But split manage_watchers into some more flexible parts or remove the synchronization decorator... is not easy at all because it's something really dangerous in term of complex regressions. This would be a major change.

I'm currently working on more pragmatic change. I introduce this option at a watcher level for the moment:

+      async_kill: don't wait for kill total completion (SIGTERM + 
+      graceful timeout + SIGKILL), can help if you have a big graceful
+      timeout and if you don't worry about to have more processes than
+      numprocesses during the graceful killing phase.
+      Defaults to False.

The main problem, in our use case, is that the graceful_timeout is blocking the complete execution of manage_processes (at watcher level) and so of manage_watchers (at arbiter level).

With async_kill=True and very few changes to deal with that new option, the manage_processes execution is fast because it doesn't wait anymore for kill total completion. SIGTERM is sent, the process is removed from watcher.processes and the end of the kill process is put on the ioloop (as a future) and completly detached from manage_watchers execution (which should be fast too).

This mitigates the conflict error because manage_watchers become fast.

thefab · 2016-06-20T11:30:07Z

#988 is deployed on our integration server and works well (after a few days)

Ok, it's not the perfect solution to the "manage_watchers monolith". But I can't work on it for the moment. It's too big and too risky for us at this moment.

The #988 proposal mitigates the problem. In our particular use case, it's a huge win.
I can refactor #988 a bit and try to add some unit tests / docs. To reach a mergeable status.

But are you interested in ?

I'm waiting for your decision. Thanks

k4nar · 2016-06-20T12:33:45Z

I didn't have time to take a deep look at it. From what I've seen it seems to be a short change and a big win, so I think this could be merged.

Anyway, I'm trying to fix our tests before accepting new PRs, and I'm starting to see the light at the end of the tunnel :) .

thefab · 2016-06-20T17:54:49Z

ok perfect, I'm going to wait for your "tests fix" before breaking them with my changes ;-)

k4nar mentioned this issue Jun 14, 2016

why is synchronized global #900

Closed

k4nar added the question label Jun 14, 2016

k4nar mentioned this issue Jun 15, 2016

Circus fails to restart watcher #831

Closed

thefab added a commit to thefab/circus that referenced this issue Jun 15, 2016

work in progress about circus-tent#987

6211774

thefab mentioned this issue Jun 15, 2016

Issue987 work in progress (do not merge) #988

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

circus watchers need an async_kill option in some cases #987

circus watchers need an async_kill option in some cases #987

thefab commented Jun 14, 2016

k4nar commented Jun 14, 2016

thefab commented Jun 14, 2016 •

edited

Loading

thefab commented Jun 20, 2016

k4nar commented Jun 20, 2016

thefab commented Jun 20, 2016

circus watchers need an async_kill option in some cases #987

circus watchers need an async_kill option in some cases #987

Comments

thefab commented Jun 14, 2016

k4nar commented Jun 14, 2016

thefab commented Jun 14, 2016 • edited Loading

thefab commented Jun 20, 2016

k4nar commented Jun 20, 2016

thefab commented Jun 20, 2016

thefab commented Jun 14, 2016 •

edited

Loading