Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

circus watchers need an async_kill option in some cases #987

Open
thefab opened this issue Jun 14, 2016 · 5 comments
Open

circus watchers need an async_kill option in some cases #987

thefab opened this issue Jun 14, 2016 · 5 comments
Labels

Comments

@thefab
Copy link
Contributor

thefab commented Jun 14, 2016

When you have several watchers with:

  • big graceful_timeout (600 seconds in our case)
  • max_age values > 0

The call of manage_watchers of the Arbiter from the Controller can be very long (600 seconds in our case). During this delay, circus is not blocked (thanks to tornado ioloop) but:

  • some intrusive commands are forbidden (because manage_watchers is running)
  • no other manage_watchers is launched

So during this graceful 600s period, other dead or expired processes (even for other watchers) are not (re)launched anymore.

The real fix is really intrusive and complex because of concurrency issues.

We are working on a little fix based on a "async_kill" flag (at the watcher level).

Ideas and advices are welcome

@k4nar
Copy link
Contributor

k4nar commented Jun 14, 2016

This conflict error is something that comes up pretty often. I'm not sure what's the solution, but this is clearly something we have to fix.

Actually, I'm wondering why we are blocking other commands during manage_watchers. Right now it's a pretty heavy coroutine, but I think we could convert it to something more flexible. Also, this is called as a Periodic Callback, but we could convert it to a while True with a gen.sleep. This is what Tornado's documentation is recommending.

@thefab : Do see any major blocking point making this impossible?

@k4nar k4nar added the question label Jun 14, 2016
@thefab
Copy link
Contributor Author

thefab commented Jun 14, 2016

I do the same analyze. But split manage_watchers into some more flexible parts or remove the synchronization decorator... is not easy at all because it's something really dangerous in term of complex regressions. This would be a major change.

I'm currently working on more pragmatic change. I introduce this option at a watcher level for the moment:

+      async_kill: don't wait for kill total completion (SIGTERM + 
+      graceful timeout + SIGKILL), can help if you have a big graceful
+      timeout and if you don't worry about to have more processes than
+      numprocesses during the graceful killing phase.
+      Defaults to False.

The main problem, in our use case, is that the graceful_timeout is blocking the complete execution of manage_processes (at watcher level) and so of manage_watchers (at arbiter level).

With async_kill=True and very few changes to deal with that new option, the manage_processes execution is fast because it doesn't wait anymore for kill total completion. SIGTERM is sent, the process is removed from watcher.processes and the end of the kill process is put on the ioloop (as a future) and completly detached from manage_watchers execution (which should be fast too).

This mitigates the conflict error because manage_watchers become fast.

@thefab
Copy link
Contributor Author

thefab commented Jun 20, 2016

#988 is deployed on our integration server and works well (after a few days)

Ok, it's not the perfect solution to the "manage_watchers monolith". But I can't work on it for the moment. It's too big and too risky for us at this moment.

The #988 proposal mitigates the problem. In our particular use case, it's a huge win.
I can refactor #988 a bit and try to add some unit tests / docs. To reach a mergeable status.

But are you interested in ?

I'm waiting for your decision. Thanks

@k4nar
Copy link
Contributor

k4nar commented Jun 20, 2016

I didn't have time to take a deep look at it. From what I've seen it seems to be a short change and a big win, so I think this could be merged.

Anyway, I'm trying to fix our tests before accepting new PRs, and I'm starting to see the light at the end of the tunnel :) .

@thefab
Copy link
Contributor Author

thefab commented Jun 20, 2016

ok perfect, I'm going to wait for your "tests fix" before breaking them with my changes ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants