Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvement in gunicorn container settings #322

Open
hille721 opened this issue Aug 6, 2021 · 13 comments
Open

Improvement in gunicorn container settings #322

hille721 opened this issue Aug 6, 2021 · 13 comments

Comments

@hille721
Copy link
Contributor

hille721 commented Aug 6, 2021

What is the idea ?

I'm not sure if the current gunicorn settings in the official ara images are really optimized for a container usage:

--workers=4

Starting 4 workers, means 4 processes inside the container, which is a vertical scaling inside the container. But isn't using containers about horizontal scaling? Thus instead of spawn more processes in one container, we would use just more containers.

I found this nice guide: https://pythonspeed.com/articles/gunicorn-in-docker/ and also tried these recommend settings. With them I am able to spawn more containers each with less ressources. Which is in on my container platform (Openshift) much better.

The guide is from 2019 and I'm not a expert in that topic, but maybe here are some who can jump into the discussion :)

@dmsimard
Copy link
Contributor

dmsimard commented Aug 7, 2021

Hi and thanks for the issue !

To be fair, I must say that there are no claims that the container images published by the project are intended or optimized for production use at a large scale in the docs:

The scripts are designed to yield images that are opinionated and “batteries-included” for the sake of simplicity. They install the necessary packages for connecting to MySQL and PostgreSQL databases and set up gunicorn as the application server.

You are encouraged to use these scripts as a base example that you can build, tweak and improve the container image according to your specific needs and preferences.

For example, precious megabytes can be saved by installing only the things you need and you can change the application server as well as it’s configuration.

That is not to say that we cannot improve the base image we publish but the objective is more about getting people started quickly and then allowing users to tweak on their own by showing them how the sausage is made.

That said, it wouldn't be a bad idea to benchmark different approaches and settings to find out what works best and what doesn't so we can make an informed decision. I personally like gunicorn but there's also uwsgi and other ways to run the application if people really want to.

Edit: links to existing benchmarks:

@VannTen
Copy link

VannTen commented Oct 10, 2024

It looks like gunicorn and containers don't go very well together

We're currently POCing with ara on kubernetes to record our playbooks runs, using the images provided, and consistently getting WORKER TIMEOUT errors (doing simple curls call with not much data, using sqlite for now (as we're just trying ara))

127.0.0.1 - - [10/Oct/2024:13:14:00 +0000] "GET / HTTP/1.1" 200 231491 "-" "curl/8.10.1"
127.0.0.1 - - [10/Oct/2024:13:14:03 +0000] "GET / HTTP/1.1" 200 231491 "-" "curl/8.10.1"
127.0.0.1 - - [10/Oct/2024:13:14:10 +0000] "GET / HTTP/1.1" 200 231491 "-" "curl/8.10.1"
127.0.0.1 - - [10/Oct/2024:13:14:11 +0000] "GET / HTTP/1.1" 200 231491 "-" "curl/8.10.1"
127.0.0.1 - - [10/Oct/2024:13:14:13 +0000] "GET / HTTP/1.1" 200 231491 "-" "curl/8.10.1"
127.0.0.1 - - [10/Oct/2024:13:14:14 +0000] "GET / HTTP/1.1" 200 231491 "-" "curl/8.10.1"
127.0.0.1 - - [10/Oct/2024:13:14:23 +0000] "GET / HTTP/1.1" 200 231491 "-" "curl/8.10.1"
[2024-10-10 13:24:16 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:65)
[2024-10-10 13:24:17 +0000] [1] [ERROR] Worker (pid:65) was sent SIGKILL! Perhaps out of memory?
[2024-10-10 13:24:17 +0000] [103] [INFO] Booting worker with pid: 103
[2024-10-10 13:24:52 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:67)
[2024-10-10 13:24:53 +0000] [1] [ERROR] Worker (pid:67) was sent SIGKILL! Perhaps out of memory?
[2024-10-10 13:24:53 +0000] [104] [INFO] Booting worker with pid: 104
[2024-10-10 13:25:27 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:101)
[2024-10-10 13:25:28 +0000] [1] [ERROR] Worker (pid:101) was sent SIGKILL! Perhaps out of memory?
[2024-10-10 13:25:28 +0000] [105] [INFO] Booting worker with pid: 105
127.0.0.1 - - [10/Oct/2024:13:25:57 +0000] "GET / HTTP/1.1" 200 231491 "-" "curl/8.10.1"
127.0.0.1 - - [10/Oct/2024:13:29:19 +0000] "GET / HTTP/1.1" 200 231491 "-" "curl/8.10.1"

@hille721
Copy link
Contributor Author

What do you have for gunicorn settings? I have following:

gunicorn --workers=2 --threads=4 --worker-class=gthread --worker-tmp-dir /dev/shm --log-file=- --access-logfile=- --bind 0.0.0.0:8000 ara.server.wsgi

with that ara is running since years on kubernetes

@dmsimard
Copy link
Contributor

dmsimard commented Oct 11, 2024

Hi @VannTen and thanks for the feedback (also merci for working on kubespray ❤️).

I stand by my previous comment that says we aren't specifically tuning the container images for scale or performance but it should work and if there is anything we can do to make them run better we should consider it.

The recent blog post you shared is interesting and didn't exist when we last looked at this, the take away being:

tl;dr The conventional wisdom to use multiple workers in a containerized instance of Flask/Django/anything that is served with gunicorn is incorrect - you should only use one worker per container, otherwise you’re not properly using the resources allocated to your application. Using multiple workers per container also runs the risk of OOM SIGKILLs without logging, making diagnosis of issues much more difficult than it would be otherwise.

I don't personally have ara deployed in k8s right now but I am willing to work with you to find out if this is true in the context of ara, while putting odds in our favour by doing two more things (that are part of general performance troubleshooting tips):

  • Switching to a mysql or postgresql backend (to make sure we aren't running in sqlite lock contention)
  • Putting a k8s ingress in front of the ara containers (up to you: haproxy, nginx, traefik, etc.)

The container images currently ship with this command:

buildah config --cmd "bash -c '/usr/local/bin/ara-manage migrate && python3 -m gunicorn --workers=4 --access-logfile - --bind 0.0.0.0:8000 ara.server.wsgi'" "${build}"

For the sake of simplicity I have gone ahead and done a rebuild of the latest image, only changing the number of workers from 4 to 1. (@hille721 if you have any information or data regarding your additional settings maybe we can test that too)

You can try this image here: docker.io/dmsimard/ara-dont-use-this-for-prod:one-worker
(via https://hub.docker.com/repository/docker/dmsimard/ara-dont-use-this-for-prod/general)

If you want to use MySQL you should have environment variables that look like this for where the ara server container runs:

ARA_DATABASE_CONN_MAX_AGE: 60
ARA_DATABASE_ENGINE: django.db.backends.mysql
ARA_DATABASE_HOST: mysql.host.name
ARA_DATABASE_NAME: ara
ARA_DATABASE_PASSWORD: password
ARA_DATABASE_PORT: 3306
ARA_DATABASE_USER: ara

For Postgre:

ARA_DATABASE_CONN_MAX_AGE: 60
ARA_DATABASE_ENGINE: django.db.backends.postgresql
ARA_DATABASE_HOST: postgresql.host.name
ARA_DATABASE_NAME: ara
ARA_DATABASE_PASSWORD: password
ARA_DATABASE_PORT: 5432
ARA_DATABASE_USER: ara

Please let me know how that works out and if you have any interesting findings we can work with.

Thanks !

@VannTen
Copy link

VannTen commented Oct 11, 2024 via email

@dmsimard
Copy link
Contributor

dmsimard commented Oct 11, 2024

For what it's worth, the ara server doesn't /need/ to run with gunicorn. Any WSGI servers known to run django will work as well (uwsgi, apache mod_wsgi, etc).

We feel the same about databases in k8s, the database server can run outside on bare metal or on a VM, etc., just have to be mindful of the network latency between the ara server and the database server.

That said, it feels like we might be missing something because the performance shouldn't be /that/ bad and errors shouldn't come up so easily, especially if you aren't running concurrent playbooks which could run into the sqlite lock issues.

Are you able to reproduce the kind of issues you are seeing if you try to run the container outside or k8s? I mean locally with podman or docker.

@dmsimard
Copy link
Contributor

dmsimard commented Oct 11, 2024

https://pythonspeed.com/articles/gunicorn-in-docker/ Seems pretty interesting and has reasoning behind the options which seems pretty sound to me, probably what we're going to try next (--workers=1 ended up not making much of a difference, unfortunately).

It suggests using --workers=2 --threads=4 --worker-class=gthread --worker-tmp-dir /dev/shm which was part of the command @hille721 provided. I can build an image with that so we can compare but it will be later -- I'm about to board a flight back home :p

@dmsimard
Copy link
Contributor

I put up an image with those settings: docker.io/dmsimard/ara-dont-use-this-for-prod:w2-t4-gthread-shm

I will also do some testing on my end out of curiosity.

@dmsimard
Copy link
Contributor

dmsimard commented Oct 12, 2024

I have used a similar approach to benchmarking blog posts (database backends, ansible versions & ara) to test whether there is a significant difference between the current image and "tweaked" settings.

This is running locally on the same machine (16 cores, 32gb ram, modest SSDs) on fedora 40.

The results:

Screenshot from 2024-10-12 18-13-03

Screenshot from 2024-10-12 18-29-49

Stock (current image)

  • sqlite: 6m51s
  • mysql (with ARA_CALLBACK_THREADS=4): 4m59s

2 workers, 4 threads, gthread, /dev/shm

  • sqlite: 6m40s
  • mysql (with ARA_CALLBACK_THREADS=4): 4m54s

So, yes, while the numbers are slightly better using the 2 workers/4threads/gthread and /dev/shm options, it is almost negligible in practice: the benchmarking test playbook does nothing 10 000 times really fast.

In any case, I am unable to reproduce the extreme sluggishness you are seeing.

I will leave it at that for now but I am interested to learn if you find out anything.

Edit: tangentially related, these numbers are better than the ones last benchmarked by a significant margin:

image

Maybe we are due for a new blog post :)

@VannTen
Copy link

VannTen commented Oct 13, 2024 via email

@dmsimard
Copy link
Contributor

Hi @VannTen, I'm reaching out to see if you ended up finding anything interesting.

Thanks,

@VannTen
Copy link

VannTen commented Nov 19, 2024 via email

@dmsimard
Copy link
Contributor

Thanks for reporting back :D

I have not revisited the topic about making the callback less blocking in a while and it could be worth looking into again.

With some time to think about it, the approach used in https://gist.github.com/phemmer/8ee4ea0ebf1b389050ce4a2bd78c66d6 could be shipped as an additional callback that people can use if need be. I need some time to test it out.

I will also add it to my to-do list for benchmarking I will be doing in the not-too-distant future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants