-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Worker goes unhealthy after 10 minutes on sidekiq 7 #94
Comments
Hi @w32-blaster , |
@arturictus do you mean to check it in the sidekiq UI? |
@w32-blaster, yes |
Hi @arturictus Yes, sidekiq-alive queue is running and we can see that in the sidekiq UI. Please let us know if you have any clue on why this might be happening with this sidekiq-alive gem when used with sidekiq 7.1. Just another quick info: we also use |
I am also using it with sidekiq 7.1 together with One problem that can happen is if your cron jobs spawn enough jobs that take all available threads, alive job will not get picked up and app will start reporting as unhealthy. I encountered such issue a while back in my project here: https://gitlab.com/dependabot-gitlab/dependabot/-/issues/293#note_1408148243 This is not really solvable in sidekiq < 7 but capsules in the new version should be able to address that. Potential fix: #96 |
That's not the case. Our worker threads are sitting there and doing nothing. Then after 10 minutes they got removed because the health check. |
@andrcuns You are right. I just checked and verified this: sidekiq-alive queue is not scheduled in staging (where we deployed sidekiq 7), but it's present in other environments where we still have sidekiq 6.5.6. Now, could you please help us a bit debug where is the actuall issue? Here are list of gems and their versions we are using along with @andrcuns Any clue on how to debug this issue would be highly appreciated. |
I am not using You can start with checking the sidekiq startup logs and see sidekiq-alive is even started correctly. You should have something like this: │ [2023-09-01 08:08:17 +0000] INFO -- Starting sidekiq-alive: {:hostname=>"dependabot-gitlab-worker-fc46db7c4-kpdwp", :port=>"7433", :ttl=>60, :queue=>"healthcheck-dependabot-gitlab-worker-fc46db7c4-kpdwp", :reg │
│ ister_set=>"sidekiq-alive-hostnames", :liveness_key=>"SIDEKIQ::LIVENESS_PROBE_TIMESTAMP::dependabot-gitlab-worker-fc46db7c4-kpdwp", :register_key=>"SIDEKIQ_REGISTERED_INSTANCE::dependabot-gitlab-worker-fc46db7c │
│ 4-kpdwp"} │
│ [2023-09-01 08:08:17 +0000] INFO -- Successfully started sidekiq-alive, registered with key: SIDEKIQ_REGISTERED_INSTANCE::dependabot-gitlab-worker-fc46db7c4-kpdwp on set sidekiq-alive-hostnames │
│ [2023-09-01 08:08:18 +0000] INFO -- WEBrick 1.8.1 │
│ [2023-09-01 08:08:18 +0000] INFO -- ruby 3.1.4 (2023-03-30) [x86_64-linux] │
│ [2023-09-01 08:08:18 +0000] INFO -- WEBrick::HTTPServer#start: pid=7 port=7433 │ |
I can see the sidekiq-alive is starting successfully:
I will try removing hiredis and deploy, and see if that fixes the issue. Will reply back once I test it out. Thanks so much for your help thius far @andrcuns :-) |
If it starts successfully, the issue is then either with the job not being scheduled which is supposed to update alive key, the job being scheduled but being unsuccessful at updating the key. I believe it's one of these two. The fact that it starts failing only after 10 minutes + the log shows that it creates liveness key successfully. |
@andrcuns So, I tried removing hiredis client but still the issue persists. I can see the sidekiq alive queue here: https://app.com.com/sidekiq/queues but it does not appear in the scheduled tab: https://app.com/sidekiq/scheduled that means the job is not being scheduled. Do you have any idea where I can look deep into in order to see what's the reason the job is not being scheduled? Thank you once again! |
@rakibulislam Not really sure. I imagine you can try enabling We could probably use some additional debug log messages in this repo as well, like actions being performed during startup and inside worker. If it's not getting scheduled, I'm guessing it means either this line did not schedule the initial job or it did but this action did not trigger the next job |
@andrcuns I was debugging this locally and found out that, this line is executed on sidekiq startup. But, this line is not executed automatically. And, I think that's the main issue we need to figure out. But, when I run this command directly from the rails console:
Do you see why this might be happening? Thanks for your time and help! |
@andrcuns Looks like its enqueuing the |
Hey Team! Appreciate any help! |
Okay, I fixed this issue, finally! For me everything was working fine with sidekiq 6.5.6 but the sidekiq-alive stopped processing and scheduling jobs after we upgraded to sidekiq 7.1. Finally, the solution was to add the sidekiq-alive queue in the |
That's not really a fix though, queue should be created automatically. In the latest version with sidekiq 7+ it uses a separate capsule as well. It would be nice to understand why the queue was not there. |
@andrcuns I agree with you. It would be great to understand why it is not creating the sidekiq-alive queue automatically with sidekiq 7.1. Probably creating a new capsule might help? I will explore more in the future. But, the fix works for us for now. |
This really sound like a workaround to me, but @rakibulislam did a great work to figure this out. |
Workaround from @rakibulislam also worked for me and now Sidekiq Azure web container has been Alive! after the 10 minutes threshold. This queue is now in scheduled tab and it refreshes every 5 min. Thank you. |
Yeah, some additional debug level messages wouldn't hurt. I'll see if I can add a pr sometime this week |
Hi there,
This library is intended to run with the HOSTNAME present and without it will produce false positives
QUESTION: are you running redis in cluster? or single instance?, replicas? |
@arturictus we are running redis in Elasticache as a single instance. Not in cluster mode. There are no replicas currently |
Facing the same issue here. The 10 mins mark is actually based on the default TTL for Redis key in the config file:
I reduced this to 60 and the worker goes unhealthy after 1 minute (Can't find the alive key) - even though the worker is fine. +++ |
Confirming the issue, exactly after 10 minutes worker gets unhealthy status, so basically sidekiq_alive returns "not alive", after that sidekiq starts failing with
sidekiq_alive job was not present in scheduled jobs. During another run job is present. 🤔 I don't see any pattern yet. What I just saw: the job was still scheduled, but sidekiq worker was already terminated by provisioner. Another option is to get rid of UPD: So, tuning kubernetes pod settings helped. sidekiq_alive queue is present, job is enqueued, status updated without issues. UPD 2: UPD 3: I checked manually in Rails console this: SidekiqAlive.hostname
SidekiqAlive.alive?
SidekiqAlive::Worker.perform_sync After running it manually pod became alive and basically it means that Now I also see two jobs enqueued for Also spotted that UPD 5: Worker job gets properly scheduled when manually called, so it goes under "Scheduled jobs" and waiting time is 5 minutes (600/2 as it follows the code). At some point job gets stuck under "Enqueued jobs" and never gets back to "Scheduled jobs".
UPD 6: Most probably will just go with simple implementation of web server like this: # PATH: config/initializers/sidekiq_alive.rb
class SidekiqAliveServer
def run!
handler = Rack::Handler.get("webrick")
Signal.trap("TERM") { handler.shutdown }
handler.run(self, Port: 7433, Host: "0.0.0.0", AccessLog: [], Logger: Rails.logger)
end
def call(_env)
workers = Sidekiq::Workers.new
process_set = Sidekiq::ProcessSet.new
process_set_size = process_set.size
queues = Sidekiq::Queue.all
queues_size = queues.size
queues_avg_latency = queues.sum(&:latency) / queues_size if queues_size.positive?
response = {
workers_size: workers.size,
process_set_size: process_set_size,
queues_size: queues_size,
queues_avg_latency: queues_avg_latency
}
is_alive = process_set_size.positive? && queues_size.positive? && queues_avg_latency < 30 * 60
response_code = is_alive ? 200 : 500
[response_code, { "Content-Type" => "application/json" }, [response.to_json]]
end
end
Sidekiq.configure_server do |config|
config.on(:startup) do
@server_pid = fork { SidekiqAliveServer.new.run! }
end
config.on(:quiet) do
end
config.on(:shutdown) do
Process.kill("TERM", @server_pid) unless @server_pid.nil?
Process.wait(@server_pid) unless @server_pid.nil?
end
end Latest version is here: https://gist.github.com/amkisko/95662a67da9c7344ee538786ed3e9d6e |
I'm not sure if this will help anyone else, but I was also experiencing this problem with sidekiq On the throttled README, it states:
Because # config/initializers/sidekiq.rb
require "sidekiq/throttled"
Sidekiq.configure_server do |config|
config.on(:startup) { Sidekiq::Throttled.setup! }
end |
@amkisko I was having the same issue as your comment here
Fix: In
I don't really know anything about Sidekiq internals, but it seems like it's creating a Redis connection pool that's too small - somehow there are only two connections, but the pool is exhausted when this method is called:
setting |
@jasondoc3 we were having the exact same problem with Sidekq 6.5.11 and luckily I found your comment. Thanks! :-) |
We are trying to upgrade our infrastructure to use sidekiq 7 and facing the following issue:
sidekiq_alive works fine when registering the worker in redis, then it becomes healthy.
After exactly 10 minutes it becomes unhealthy with this message: "Can't find the alive key" and pod gets restarted.
We have verified and during this 10 minutes period we can see it really alive, so it's not like our health check starts working after 10 minutes.
Any idea where to look to solve this problem?
The text was updated successfully, but these errors were encountered: