-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pipeline never pauses due to "failed to get information of buffer" error #2088
Comments
On the second, issue of why the buffer information is not available, I will check the logs you posted to see if can isolate a reason. |
Nice! and thank you! |
I'm seeing in the log messages like: "vertex/controller.go:121","msg":"ISB Service is not in healthy status" after the pipeline topology is changed, so let me just in case also post the watch we have on isbsvc resource. |
Looking at that isbsvc.yaml file, it does seem like the isbsvc is unhealthy between 5:57:20 and 5:57:33. This time range seems to correspond to the "ISB Service is not in healthy status" messages. The "failed to get information of buffer" error starts after that time range at 5:57:39 and continues throughout the remaining test. But note that the buffer was created just before the isbsvc was reporting itself as unhealthy:
Could something have gone wrong creating the buffer but not been reported? |
@juliev0 - From the log file you attached, it looks like the isbsvc was new created, and it was unhealthy because it was waiting for pods ready. It's clearly mentioned in the status.
|
@whynowy Take a look at 2024-09-25T05:56:23Z - it actually was deemed healthy at that time. Then later at 2024-09-25T05:57:20Z it was back to "Waiting for 1 pods to be ready..." |
Where did you run your test case? The fact was a statefulset had been in ready status for about 1 min (2024-09-25T05:56:23Z -> 2024-09-25T05:57:20Z), and then went to a status of 1 pod not ready. I can not think of a reason other than one of the pods getting recreated, especially not seeing any other reconciliation activity in between. |
The CI which uses k3d to create a cluster. |
Are you able to consistently see the issue? |
I've updated the Description with all of the timestamps: Relevant time stamps: If the isbsvc is unhealthy when a Creation Job is executed and something causes the buffer to not exist, it seems that Numaflow needs to be able to recover from that, right? |
I know that I have at least once before seen that same error "failed to get information of buffer". I can keep an eye out to look for more occurrences. |
If the Creation Job fails due to connectivity with the isbsvc, what will happen? |
The job will be retried by k8s. |
Do you know if it uses some sort of exponential backoff method for retrying? I guess Numaflow Controller log doesn't have any visibility into that, huh? I wonder if it would actually retry for the full ~13 seconds that it appears it was unhealthy...hmm... |
It is exponential backoff. |
seems this is probably due to the fact that we are not configuring persistence for the isbsvc in the test - will close this |
Describe the bug
Our CI failed when Numaflow failed to pause our pipeline after we set
desiredPhase=Paused
.There are 2 issues here:
This is the numaflow log and the numaplane log:
numaflow-controller.log
numaplane-controller.log
This is the result of a "watch" on the Pipeline as it changed over time.
pipeline.yaml.txt
Relevant time stamps:
05:57:18 - desiredPhase=Paused
05:57:18 - Pipeline topology changes from
in->out
toin->cat->out
, and Creation/Deletion jobs created (including creation of "numaplane-system-test-pipeline-rollout-0-cat-0" buffer)05:57:20 - isbsvc reported as unhealthy in its Status: "Unavailable: Waiting for 1 pods to be ready..."
05:57:33 - isbsvc reported as healthy again
05:57:39 - first occurrence of error "failed to get information of buffer "numaplane-system-test-pipeline-rollout-0-cat-0"" in log
05:57:40 - desiredPhase=Running
05:57:43 - desiredPhase=Paused
To Reproduce
Run our e2e test enough to cause this kind of thing to happen.
Environment (please complete the following information):
Message from the maintainers:
Impacted by this bug? Give it a 👍. We often sort issues this way to know what to prioritize.
The text was updated successfully, but these errors were encountered: