Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline never pauses due to "failed to get information of buffer" error #2088

Closed
juliev0 opened this issue Sep 25, 2024 · 17 comments
Closed
Assignees
Labels
area/controller bug Something isn't working

Comments

@juliev0
Copy link
Contributor

juliev0 commented Sep 25, 2024

Describe the bug

Our CI failed when Numaflow failed to pause our pipeline after we set desiredPhase=Paused.

There are 2 issues here:

  1. Pipeline Controller won't honor the PauseGracePeriodSettings timeout if an error is repeatedly happening because this check is done at the end of the function. (previously discussed with @kohlisid that he will change that)
  2. For some reason, there is an error contacting the daemon: "Updated desired pipeline phase failed: {error 26 0 rpc error: code = Unknown desc = failed to get information of buffer "numaplane-system-test-pipeline-rollout-0-cat-0"}" which is the reason why it's never able to pause. We need to understand what's causing that error.

This is the numaflow log and the numaplane log:
numaflow-controller.log
numaplane-controller.log

This is the result of a "watch" on the Pipeline as it changed over time.
pipeline.yaml.txt

Relevant time stamps:
05:57:18 - desiredPhase=Paused
05:57:18 - Pipeline topology changes from in->out to in->cat->out, and Creation/Deletion jobs created (including creation of "numaplane-system-test-pipeline-rollout-0-cat-0" buffer)
05:57:20 - isbsvc reported as unhealthy in its Status: "Unavailable: Waiting for 1 pods to be ready..."
05:57:33 - isbsvc reported as healthy again
05:57:39 - first occurrence of error "failed to get information of buffer "numaplane-system-test-pipeline-rollout-0-cat-0"" in log
05:57:40 - desiredPhase=Running
05:57:43 - desiredPhase=Paused

To Reproduce
Run our e2e test enough to cause this kind of thing to happen.

Environment (please complete the following information):

  • Numaflow: rc-18 from here which was built from master 9/23

Message from the maintainers:

Impacted by this bug? Give it a 👍. We often sort issues this way to know what to prioritize.

For quick help and support, join our slack channel.

@juliev0 juliev0 added the bug Something isn't working label Sep 25, 2024
@kohlisid
Copy link
Contributor

Hey @juliev0
As discussed earlier I have a PR for that #2084
This case should be covered when we merge this in.

@kohlisid
Copy link
Contributor

On the second, issue of why the buffer information is not available, I will check the logs you posted to see if can isolate a reason.

@juliev0
Copy link
Contributor Author

juliev0 commented Sep 25, 2024

Nice! and thank you!

@juliev0
Copy link
Contributor Author

juliev0 commented Sep 25, 2024

isbservice.yaml.txt

I'm seeing in the log messages like: "vertex/controller.go:121","msg":"ISB Service is not in healthy status" after the pipeline topology is changed, so let me just in case also post the watch we have on isbsvc resource.

@juliev0
Copy link
Contributor Author

juliev0 commented Sep 25, 2024

Looking at that isbsvc.yaml file, it does seem like the isbsvc is unhealthy between 5:57:20 and 5:57:33. This time range seems to correspond to the "ISB Service is not in healthy status" messages.

The "failed to get information of buffer" error starts after that time range at 5:57:39 and continues throughout the remaining test.

But note that the buffer was created just before the isbsvc was reporting itself as unhealthy:

{"level":"info","ts":"2024-09-25T05:57:18.841049044Z","logger":"numaflow.controller-manager","caller":"pipeline/controller.go:342","msg":"Created a job successfully for ISB creating","namespace":"numaplane-system","pipeline":"test-pipeline-rollout-0","buffers":["numaplane-system-test-pipeline-rollout-0-cat-0"],"buckets":["numaplane-system-test-pipeline-rollout-0-in-cat","numaplane-system-test-pipeline-rollout-0-cat-out"],"servingStreams":[]}

Could something have gone wrong creating the buffer but not been reported?

@whynowy
Copy link
Member

whynowy commented Sep 25, 2024

@juliev0 - From the log file you attached, it looks like the isbsvc was new created, and it was unhealthy because it was waiting for pods ready. It's clearly mentioned in the status.

  message: |
    Unavailable: Waiting for 1 pods to be ready...

@juliev0
Copy link
Contributor Author

juliev0 commented Sep 25, 2024

@juliev0 - From the log file you attached, it looks like the isbsvc was new created, and it was unhealthy because it was waiting for pods ready. It's clearly mentioned in the status.

  message: |
    Unavailable: Waiting for 1 pods to be ready...

@whynowy Take a look at 2024-09-25T05:56:23Z - it actually was deemed healthy at that time. Then later at 2024-09-25T05:57:20Z it was back to "Waiting for 1 pods to be ready..."

@whynowy
Copy link
Member

whynowy commented Sep 25, 2024

@juliev0 - From the log file you attached, it looks like the isbsvc was new created, and it was unhealthy because it was waiting for pods ready. It's clearly mentioned in the status.

  message: |
    Unavailable: Waiting for 1 pods to be ready...

@whynowy Take a look at 2024-09-25T05:56:23Z - it actually was deemed healthy at that time. Then later at 2024-09-25T05:57:20Z it was back to "Waiting for 1 pods to be ready..."

Where did you run your test case? The fact was a statefulset had been in ready status for about 1 min (2024-09-25T05:56:23Z -> 2024-09-25T05:57:20Z), and then went to a status of 1 pod not ready. I can not think of a reason other than one of the pods getting recreated, especially not seeing any other reconciliation activity in between.

@juliev0
Copy link
Contributor Author

juliev0 commented Sep 25, 2024

@juliev0 - From the log file you attached, it looks like the isbsvc was new created, and it was unhealthy because it was waiting for pods ready. It's clearly mentioned in the status.

  message: |
    Unavailable: Waiting for 1 pods to be ready...

@whynowy Take a look at 2024-09-25T05:56:23Z - it actually was deemed healthy at that time. Then later at 2024-09-25T05:57:20Z it was back to "Waiting for 1 pods to be ready..."

Where did you run your test case? The fact was a statefulset had been in ready status for about 1 min (2024-09-25T05:56:23Z -> 2024-09-25T05:57:20Z), and then went to a status of 1 pod not ready. I can not think of a reason other than one of the pods getting recreated, especially not seeing any other reconciliation activity in between.

The CI which uses k3d to create a cluster.

@whynowy
Copy link
Member

whynowy commented Sep 25, 2024

@juliev0 - From the log file you attached, it looks like the isbsvc was new created, and it was unhealthy because it was waiting for pods ready. It's clearly mentioned in the status.

  message: |
    Unavailable: Waiting for 1 pods to be ready...

@whynowy Take a look at 2024-09-25T05:56:23Z - it actually was deemed healthy at that time. Then later at 2024-09-25T05:57:20Z it was back to "Waiting for 1 pods to be ready..."

Where did you run your test case? The fact was a statefulset had been in ready status for about 1 min (2024-09-25T05:56:23Z -> 2024-09-25T05:57:20Z), and then went to a status of 1 pod not ready. I can not think of a reason other than one of the pods getting recreated, especially not seeing any other reconciliation activity in between.

The CI which uses k3d to create a cluster.

Are you able to consistently see the issue?

@juliev0
Copy link
Contributor Author

juliev0 commented Sep 25, 2024

I've updated the Description with all of the timestamps:

Relevant time stamps:
05:57:18 - desiredPhase=Paused
05:57:18 - Pipeline topology changes from in->out to in->cat->out, and Creation/Deletion jobs created (including creation of "numaplane-system-test-pipeline-rollout-0-cat-0" buffer)
05:57:20 - isbsvc reported as unhealthy in its Status: "Unavailable: Waiting for 1 pods to be ready..."
05:57:33 - isbsvc reported as healthy again
05:57:39 - first occurrence of error "failed to get information of buffer "numaplane-system-test-pipeline-rollout-0-cat-0"" in log
05:57:40 - desiredPhase=Running
05:57:43 - desiredPhase=Paused

If the isbsvc is unhealthy when a Creation Job is executed and something causes the buffer to not exist, it seems that Numaflow needs to be able to recover from that, right?

@juliev0
Copy link
Contributor Author

juliev0 commented Sep 25, 2024

Are you able to consistently see the issue?

I know that I have at least once before seen that same error "failed to get information of buffer". I can keep an eye out to look for more occurrences.

@juliev0
Copy link
Contributor Author

juliev0 commented Sep 25, 2024

If the isbsvc is unhealthy when a Creation Job is executed and something causes the buffer to not exist, it seems that Numaflow needs to be able to recover from that, right?

If the Creation Job fails due to connectivity with the isbsvc, what will happen?

@whynowy
Copy link
Member

whynowy commented Sep 25, 2024

If the isbsvc is unhealthy when a Creation Job is executed and something causes the buffer to not exist, it seems that Numaflow needs to be able to recover from that, right?

If the Creation Job fails due to connectivity with the isbsvc, what will happen?

The job will be retried by k8s.

@juliev0
Copy link
Contributor Author

juliev0 commented Sep 25, 2024

Do you know if it uses some sort of exponential backoff method for retrying? I guess Numaflow Controller log doesn't have any visibility into that, huh? I wonder if it would actually retry for the full ~13 seconds that it appears it was unhealthy...hmm...

@whynowy
Copy link
Member

whynowy commented Sep 26, 2024

Do you know if it uses some sort of exponential backoff method for retrying? I guess Numaflow Controller log doesn't have any visibility into that, huh? I wonder if it would actually retry for the full ~13 seconds that it appears it was unhealthy...hmm...

It is exponential backoff.

@juliev0
Copy link
Contributor Author

juliev0 commented Oct 1, 2024

seems this is probably due to the fact that we are not configuring persistence for the isbsvc in the test - will close this

@juliev0 juliev0 closed this as completed Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants