Worklflow frozen/not updating after ~100 nodes. #6281
-
I have a workflow which can spawn a lot of nodes, in the thousands. Some months ago when I did it, it gets stuck at around ~4,000 nodes. I found out I need to enable persistence for such a large workflow. Enabled it and all has been working well. Fast forward to today, the same workflow but it gets frozen at about ~100 nodes with the same symptoms. The nodes' status does not get updated and proceed to the next steps even though the pods itself has completed and stops (I checked both log and For the pods that get stuck, there will be quite a lot of these logs from workflow-controller (following is just an excerpt of a very long repetition of the same thing).
Telling it to Stop or Terminate will not do anything either (again, same symptom as workflow being too large for etcd). Since persistence has been enabled, and it's only ~100 nodes, where/how/what do I debug to find out why is this happening? Specs (I tried a whole bunch of stuff, they all behave the same):
EDIT I thought the workflow froze, but actually it didn't and manage to run all tasks successfully. It was just the UI not updating according to the actual status of the nodes. Any idea on what to check for debugging? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
have a look at #6256, I think they are related |
Beta Was this translation helpful? Give feedback.
have a look at #6256, I think they are related