Pipeline never pauses due to "failed to get information of buffer" error #2088

juliev0 · 2024-09-25T17:36:42Z

Describe the bug

Our CI failed when Numaflow failed to pause our pipeline after we set desiredPhase=Paused.

There are 2 issues here:

Pipeline Controller won't honor the PauseGracePeriodSettings timeout if an error is repeatedly happening because this check is done at the end of the function. (previously discussed with @kohlisid that he will change that)
For some reason, there is an error contacting the daemon: "Updated desired pipeline phase failed: {error 26 0 rpc error: code = Unknown desc = failed to get information of buffer "numaplane-system-test-pipeline-rollout-0-cat-0"}" which is the reason why it's never able to pause. We need to understand what's causing that error.

This is the numaflow log and the numaplane log:
numaflow-controller.log
numaplane-controller.log

This is the result of a "watch" on the Pipeline as it changed over time.
pipeline.yaml.txt

Relevant time stamps:
05:57:18 - desiredPhase=Paused
05:57:18 - Pipeline topology changes from in->out to in->cat->out, and Creation/Deletion jobs created (including creation of "numaplane-system-test-pipeline-rollout-0-cat-0" buffer)
05:57:20 - isbsvc reported as unhealthy in its Status: "Unavailable: Waiting for 1 pods to be ready..."
05:57:33 - isbsvc reported as healthy again
05:57:39 - first occurrence of error "failed to get information of buffer "numaplane-system-test-pipeline-rollout-0-cat-0"" in log
05:57:40 - desiredPhase=Running
05:57:43 - desiredPhase=Paused

To Reproduce
Run our e2e test enough to cause this kind of thing to happen.

Environment (please complete the following information):

Numaflow: rc-18 from here which was built from master 9/23

Message from the maintainers:

Impacted by this bug? Give it a 👍. We often sort issues this way to know what to prioritize.

For quick help and support, join our slack channel.

The text was updated successfully, but these errors were encountered:

kohlisid · 2024-09-25T17:39:27Z

Hey @juliev0
As discussed earlier I have a PR for that #2084
This case should be covered when we merge this in.

kohlisid · 2024-09-25T17:40:24Z

On the second, issue of why the buffer information is not available, I will check the logs you posted to see if can isolate a reason.

juliev0 · 2024-09-25T17:45:30Z

Nice! and thank you!

juliev0 · 2024-09-25T17:54:23Z

isbservice.yaml.txt

I'm seeing in the log messages like: "vertex/controller.go:121","msg":"ISB Service is not in healthy status" after the pipeline topology is changed, so let me just in case also post the watch we have on isbsvc resource.

juliev0 · 2024-09-25T18:08:37Z

Looking at that isbsvc.yaml file, it does seem like the isbsvc is unhealthy between 5:57:20 and 5:57:33. This time range seems to correspond to the "ISB Service is not in healthy status" messages.

The "failed to get information of buffer" error starts after that time range at 5:57:39 and continues throughout the remaining test.

But note that the buffer was created just before the isbsvc was reporting itself as unhealthy:

{"level":"info","ts":"2024-09-25T05:57:18.841049044Z","logger":"numaflow.controller-manager","caller":"pipeline/controller.go:342","msg":"Created a job successfully for ISB creating","namespace":"numaplane-system","pipeline":"test-pipeline-rollout-0","buffers":["numaplane-system-test-pipeline-rollout-0-cat-0"],"buckets":["numaplane-system-test-pipeline-rollout-0-in-cat","numaplane-system-test-pipeline-rollout-0-cat-out"],"servingStreams":[]}

Could something have gone wrong creating the buffer but not been reported?

whynowy · 2024-09-25T20:11:07Z

@juliev0 - From the log file you attached, it looks like the isbsvc was new created, and it was unhealthy because it was waiting for pods ready. It's clearly mentioned in the status.

  message: |
    Unavailable: Waiting for 1 pods to be ready...

juliev0 · 2024-09-25T21:35:21Z

@juliev0 - From the log file you attached, it looks like the isbsvc was new created, and it was unhealthy because it was waiting for pods ready. It's clearly mentioned in the status.
  message: |
    Unavailable: Waiting for 1 pods to be ready...

@whynowy Take a look at 2024-09-25T05:56:23Z - it actually was deemed healthy at that time. Then later at 2024-09-25T05:57:20Z it was back to "Waiting for 1 pods to be ready..."

whynowy · 2024-09-25T23:22:22Z

@juliev0 - From the log file you attached, it looks like the isbsvc was new created, and it was unhealthy because it was waiting for pods ready. It's clearly mentioned in the status.
  message: |
    Unavailable: Waiting for 1 pods to be ready...
@whynowy Take a look at 2024-09-25T05:56:23Z - it actually was deemed healthy at that time. Then later at 2024-09-25T05:57:20Z it was back to "Waiting for 1 pods to be ready..."

Where did you run your test case? The fact was a statefulset had been in ready status for about 1 min (2024-09-25T05:56:23Z -> 2024-09-25T05:57:20Z), and then went to a status of 1 pod not ready. I can not think of a reason other than one of the pods getting recreated, especially not seeing any other reconciliation activity in between.

juliev0 · 2024-09-25T23:31:51Z

@juliev0 - From the log file you attached, it looks like the isbsvc was new created, and it was unhealthy because it was waiting for pods ready. It's clearly mentioned in the status.
  message: |
    Unavailable: Waiting for 1 pods to be ready...
@whynowy Take a look at 2024-09-25T05:56:23Z - it actually was deemed healthy at that time. Then later at 2024-09-25T05:57:20Z it was back to "Waiting for 1 pods to be ready..."
Where did you run your test case? The fact was a statefulset had been in ready status for about 1 min (2024-09-25T05:56:23Z -> 2024-09-25T05:57:20Z), and then went to a status of 1 pod not ready. I can not think of a reason other than one of the pods getting recreated, especially not seeing any other reconciliation activity in between.

The CI which uses k3d to create a cluster.

whynowy · 2024-09-25T23:38:24Z

@juliev0 - From the log file you attached, it looks like the isbsvc was new created, and it was unhealthy because it was waiting for pods ready. It's clearly mentioned in the status.
  message: |
    Unavailable: Waiting for 1 pods to be ready...
@whynowy Take a look at 2024-09-25T05:56:23Z - it actually was deemed healthy at that time. Then later at 2024-09-25T05:57:20Z it was back to "Waiting for 1 pods to be ready..."
Where did you run your test case? The fact was a statefulset had been in ready status for about 1 min (2024-09-25T05:56:23Z -> 2024-09-25T05:57:20Z), and then went to a status of 1 pod not ready. I can not think of a reason other than one of the pods getting recreated, especially not seeing any other reconciliation activity in between.
The CI which uses k3d to create a cluster.

Are you able to consistently see the issue?

juliev0 · 2024-09-25T23:41:59Z

I've updated the Description with all of the timestamps:

Relevant time stamps:
05:57:18 - desiredPhase=Paused
05:57:18 - Pipeline topology changes from in->out to in->cat->out, and Creation/Deletion jobs created (including creation of "numaplane-system-test-pipeline-rollout-0-cat-0" buffer)
05:57:20 - isbsvc reported as unhealthy in its Status: "Unavailable: Waiting for 1 pods to be ready..."
05:57:33 - isbsvc reported as healthy again
05:57:39 - first occurrence of error "failed to get information of buffer "numaplane-system-test-pipeline-rollout-0-cat-0"" in log
05:57:40 - desiredPhase=Running
05:57:43 - desiredPhase=Paused

If the isbsvc is unhealthy when a Creation Job is executed and something causes the buffer to not exist, it seems that Numaflow needs to be able to recover from that, right?

juliev0 · 2024-09-25T23:43:34Z

Are you able to consistently see the issue?

I know that I have at least once before seen that same error "failed to get information of buffer". I can keep an eye out to look for more occurrences.

juliev0 · 2024-09-25T23:44:51Z

If the isbsvc is unhealthy when a Creation Job is executed and something causes the buffer to not exist, it seems that Numaflow needs to be able to recover from that, right?

If the Creation Job fails due to connectivity with the isbsvc, what will happen?

whynowy · 2024-09-25T23:49:09Z

If the isbsvc is unhealthy when a Creation Job is executed and something causes the buffer to not exist, it seems that Numaflow needs to be able to recover from that, right?

If the Creation Job fails due to connectivity with the isbsvc, what will happen?

The job will be retried by k8s.

juliev0 · 2024-09-25T23:54:20Z

Do you know if it uses some sort of exponential backoff method for retrying? I guess Numaflow Controller log doesn't have any visibility into that, huh? I wonder if it would actually retry for the full ~13 seconds that it appears it was unhealthy...hmm...

whynowy · 2024-09-26T00:07:33Z

Do you know if it uses some sort of exponential backoff method for retrying? I guess Numaflow Controller log doesn't have any visibility into that, huh? I wonder if it would actually retry for the full ~13 seconds that it appears it was unhealthy...hmm...

It is exponential backoff.

juliev0 · 2024-10-01T00:44:44Z

seems this is probably due to the fact that we are not configuring persistence for the isbsvc in the test - will close this

juliev0 added the bug Something isn't working label Sep 25, 2024

juliev0 assigned kohlisid Sep 25, 2024

juliev0 added the area/controller label Sep 25, 2024

juliev0 closed this as completed Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline never pauses due to "failed to get information of buffer" error #2088

Pipeline never pauses due to "failed to get information of buffer" error #2088

juliev0 commented Sep 25, 2024 •

edited

Loading

kohlisid commented Sep 25, 2024

kohlisid commented Sep 25, 2024

juliev0 commented Sep 25, 2024

juliev0 commented Sep 25, 2024

juliev0 commented Sep 25, 2024

whynowy commented Sep 25, 2024 •

edited

Loading

juliev0 commented Sep 25, 2024

whynowy commented Sep 25, 2024

juliev0 commented Sep 25, 2024

whynowy commented Sep 25, 2024

juliev0 commented Sep 25, 2024

juliev0 commented Sep 25, 2024

juliev0 commented Sep 25, 2024

whynowy commented Sep 25, 2024

juliev0 commented Sep 25, 2024

whynowy commented Sep 26, 2024

juliev0 commented Oct 1, 2024

Pipeline never pauses due to "failed to get information of buffer" error #2088

Pipeline never pauses due to "failed to get information of buffer" error #2088

Comments

juliev0 commented Sep 25, 2024 • edited Loading

kohlisid commented Sep 25, 2024

kohlisid commented Sep 25, 2024

juliev0 commented Sep 25, 2024

juliev0 commented Sep 25, 2024

juliev0 commented Sep 25, 2024

whynowy commented Sep 25, 2024 • edited Loading

juliev0 commented Sep 25, 2024

whynowy commented Sep 25, 2024

juliev0 commented Sep 25, 2024

whynowy commented Sep 25, 2024

juliev0 commented Sep 25, 2024

juliev0 commented Sep 25, 2024

juliev0 commented Sep 25, 2024

whynowy commented Sep 25, 2024

juliev0 commented Sep 25, 2024

whynowy commented Sep 26, 2024

juliev0 commented Oct 1, 2024

juliev0 commented Sep 25, 2024 •

edited

Loading

whynowy commented Sep 25, 2024 •

edited

Loading