-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak / concurrency issues in short-running workers #2522
Comments
Initial investigationRedictEven though the issue appears to come from the Redict, the deployment is stable. As of time of writing this comment there is one pod deployed that's running since July 29th (with regards to the resources: Briefly checking the Redict deployment I notice an increasing trend of the connected clients, before opening this issue it was around 1500, currently as I'm writing this comment it's
|
Status updateOn Friday I replaced Redict with Valkey, workers got redeployed. I've been checking the count of connected clients here and there:
Based on the observation, rescaling of the workers dropped the amount of connections, the issues is present across different deployments (e.g., Redis, Redict, Valkey). Posting list of the connected clients before experimenting with queues |
2º updateTo pinpoint the issues more precisely, I've rescaled the workers while watching the stats from the Valkey.
OpenShift Metrics: The issue is definitely coming from short-running workers… Based on the previous findings:
I assume that running out of connection slots is a side effect related to the memory leak that causes restart. This could be caused by failed clean up of the concurrent threads in the short-running workers (holds onto both allocated memory, and open connection to Valkey). I also suspected bug in the Celery client that fails to properly clean up the session afterwards, but this doesn't align with the memory issue, i.e., there would be open connections, but memory should've been cleaned up. Next steps
Captured output from the
|
Testing on prod pt. 2
Before adjusting the timeout there was between 4k-6k clients, so even the 1hr timeout seems reasonable. Checking the client list posted in some comment above, most of the clients have Going through the Redis docs I found:
Gotta love the first point… TODO
|
Looks OK so far, after brief inspection of the Valkey client list, lowering the timeout further to 1800 ( Before switching to 3600 (1 hour), we hung around 600-800 connections, right now (clean up is iterative = not all clients that pass the As for the short-running workers, I don't really see a noticeable difference (red line indicates the setting of the timeout) Last restart happened yesterday, even after setting the timeout on the Valkey and there appears to be an increasing trend for the used memory, therefore it doesn't appear that it helps in any way with the issue of the short-running workers. However, since the Valkey cleans up the connections, it doesn't cause DoS from running out of the Valkey connections… |
Currently hanging around 550 clients in Valkey, will open a PR to have the timeout configured in our Valkey/Redict/Redis deployment. |
The leaks in short-running pods result in idle connections to Redict/Valkey, all of these KV-databases have ‹timeout› option in their config that allows for iterative cleanup of hanging connections. This mitigates the issue to the point of still having free connections slots to Redict/Valkey, i.e., the pods shall be killed, but handlers do not end up in a retry-loop trying to connect to Redict/Valkey. Since the config is 1:1 between all Redis, Redict, and Valkey, create one ConfigMap, map the config into the databases and pass the path to the config as an argument. Tested with Redict and Valkey. »NOT« tested with Redis. Related to packit/packit-service#2522 Signed-off-by: Matej Focko <[email protected]>
The leaks in short-running pods result in idle connections to Redict/Valkey, all of these KV-databases have ‹timeout› option in their config that allows for iterative cleanup of hanging connections. This mitigates the issue to the point of still having free connections slots to Redict/Valkey, i.e., the pods shall be killed, but handlers do not end up in a retry-loop trying to connect to Redict/Valkey. Since the config is 1:1 between all Redis, Redict, and Valkey, create one ConfigMap, map the config into the databases and pass the path to the config as an argument. Tested with Redict and Valkey. »NOT« tested with Redis. Related to packit/packit-service#2522 Signed-off-by: Matej Focko <[email protected]>
- [x] Deployed on both stage and prod… :PepeLaugh: > Friday evening deployments hit different… Related to packit/packit-service#2522
Fixing this should also solve #2427 |
Blocked on #2693 |
Sentry Issue: PCKT-002-PACKIT-SERVICE-7SS
The text was updated successfully, but these errors were encountered: