-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1 master & 0 workers setup requires "create manifest" step twice #238
Comments
Please collect the log bundle from bootstrap node - https://docs.okd.io/latest/installing/installing-troubleshooting.html#installation-bootstrap-gather_installing-troubleshooting |
log-bundle-20200701135351.tar.gz Thank you @vrutkovs ! |
That's expected - initial FCOS is being updated to the version in the release (https://origin-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.4.0-0.okd/release/4.4.0-0.okd-2020-07-01-045420). Control plane never requested master ignition from bootstrap - ensure your LB is setup correctly. Did any of the masters booted? Can they access LB address and fetch ignition? |
I am seeing a similar issue with FCOS 32 and OKD 4.5 bare-metal UPI. The Bootstrap node starts fine, the API comes up, but the master nodes are unable to retrieve ignition config. Using curl to attempt to retrieve the ignition config results in a 503. I've got a log bundle that I will upload as soon as I double check to ensure that I haven't done something silly to cause the issue. |
I suspect Zincati has been fixed recently and we don't disable it, so master nodes are being updated to latest stable instead of the payload we expect |
I've got a long weekend with the US holiday, so I'll have some time in my lab to tinker. |
@vrutkovs The output I have generated was actually for just a beginning of installation, control node was never started. Sorry about that, I thought that FCOS should not update for higher than 31 for 4.4.0 and I assumed this is the problem I am seeing since 29 Jun. I used to destroy and recreate cluster few times per day, and at some point, even with exactly the same source files, kernel images, openshift-installer version I am not longer able to bootstrap the cluster any more. I hoped it was related to FCOS version not pinned, but the problem must be elsewhere. @cgruver For 4.5.0 I was also seeing what you are describing #229 by the way. Probably this bug does not make sense any more, as @vrutkovs explained how it is fixed between openshift-installer and FCOS versions. I will open a new ticket and I will provide more information there - I hope this is ok @vrutkovs. Thank you! |
Feel free to reuse this ticket if you like |
Thanks @vrutkovs! The version (which still uses FCOS 31 and which was battle tested for me) was 4.4.0-0.okd-2020-05-23-055148-beta5 with fedora-coreos-31.20200517.3.0-live-kernel-x86_64 PXE kernel. Please find the attached archive (this time control node was bootstrapped completely): In bootkube.sh I see that it should be completed: But in I can not find anything interesting in POD logs, kube-apiserver is complaining about certificates: Just like my kubectl command: I am in dead end now, I see PODs bouncing. I don't know anything what could be changed except for the external dependencies, this I why I was hoping for this to be related to FCOS upgrade. If you can take a look and tell me what I am missing, that would be great. Thanks! |
Few notes from log-bundle:
|
I wanted to add more when bootstrap was ready, I had
I have just checked and it seems like api and master-01 endpoint are providing exactly the same certificates:
Beside, this configuration was working fine for over 3 weeks:
It is almost 1 to 1 example from haproxy.cfg example. Actually this is like some kind of sorcery to me, I think I have checked every possible scenario. But perhaps I am missing something super obvious. This is how the api certificate looks like:
When I am trying to login using kubeadmin password from
The same if I try to connect to master directly:
I have also tried to rebuild the cluster with PXE kernel from FCOS 32 (as stated in #229 (comment)), but this failed the same way. And this time I had 3 workers in config. Please find the log archive attached: Vadim, please let me know if there is anything else you can think of, what I can check. Thanks a lot for your help! |
@vrutkovs I managed to deploy it. I will let you know soon. |
So, now it works every time, but I need to run a command to create manifests twice for some reason:
The key here is to run
And now it is perfectly fine! This is debug output from the first iteration, with the warning:
@vrutkovs Have you seen something like this? Thank you! |
That's weird. Are you reusing the same directory across attempts? If yes you should remove hidden files (".openshift_*") and/or use |
I am not reusing it, it is always fresh directory:
Super weird stuff, and I don't really understand why this was not causing the issue before. Thanks Vadim! |
Here it is with completely empty directory:
|
What's the output of |
|
hmm, that's very odd. Could you also attach the contents of |
You are right, this is super odd. This is my install-config.yaml file:
|
It works fine if I use
I have also tried on the same machine, to use OKD
And it has the warning, which creates "unbootstrapable" ignition files. |
Okay, must be some OKD-specific change broke that. Not quite sure which change exactly |
I noticed that you are creating a cluster with one master and 3 workers. It would be interesting to see if you still see the warning in the following two scenarios:
What happens if you run If you don't get the warning, that might help narrow the search. |
Hello! Case 1:
Case 2:
And it actually did narrow the search :) Thanks! |
Right, I see, so its single master install code breaking things |
I'm seeing exactly the same behaviour. |
This comment has been minimized.
This comment has been minimized.
This is a pretty important issue, but since it has a workaround we won't block GA on it |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Any plans to fix that? |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
@openshift-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
…fter switch to MkDocs site (okd-project#238) * move from master to main branch * remove blog entry re creating blog on old site * Add site README * add CNAME in docs so gets published * Delete CNAME * fix CNAME for production site Signed-off-by: Brian Innes <[email protected]> * fixed links re MkDocs going live + container tooling instructions Signed-off-by: Brian Innes <[email protected]> * added Windows support for docker based tooling Signed-off-by: Brian Innes <[email protected]> * added podman instructions Signed-off-by: Brian Innes <[email protected]> * tidy up commands Signed-off-by: Brian Innes <[email protected]> Co-authored-by: Brian Innes <[email protected]>
…fter switch to MkDocs site (okd-project#238) (okd-project#2) * move from master to main branch * remove blog entry re creating blog on old site * Add site README * add CNAME in docs so gets published * Delete CNAME * fix CNAME for production site Signed-off-by: Brian Innes <[email protected]> * fixed links re MkDocs going live + container tooling instructions Signed-off-by: Brian Innes <[email protected]> * added Windows support for docker based tooling Signed-off-by: Brian Innes <[email protected]> * added podman instructions Signed-off-by: Brian Innes <[email protected]> * tidy up commands Signed-off-by: Brian Innes <[email protected]> Co-authored-by: Brian Innes <[email protected]> Co-authored-by: Brian Innes <[email protected]>
…fter switch to MkDocs site (okd-project#238) (okd-project#2) (okd-project#239) * move from master to main branch * remove blog entry re creating blog on old site * Add site README * add CNAME in docs so gets published * Delete CNAME * fix CNAME for production site Signed-off-by: Brian Innes <[email protected]> * fixed links re MkDocs going live + container tooling instructions Signed-off-by: Brian Innes <[email protected]> * added Windows support for docker based tooling Signed-off-by: Brian Innes <[email protected]> * added podman instructions Signed-off-by: Brian Innes <[email protected]> * tidy up commands Signed-off-by: Brian Innes <[email protected]> Co-authored-by: Brian Innes <[email protected]> Co-authored-by: Brian Innes <[email protected]> Co-authored-by: Brian Innes <[email protected]>
* Add docker/podman instructions for site update and fix broken links after switch to MkDocs site (okd-project#238) (okd-project#2) * move from master to main branch * remove blog entry re creating blog on old site * Add site README * add CNAME in docs so gets published * Delete CNAME * fix CNAME for production site Signed-off-by: Brian Innes <[email protected]> * fixed links re MkDocs going live + container tooling instructions Signed-off-by: Brian Innes <[email protected]> * added Windows support for docker based tooling Signed-off-by: Brian Innes <[email protected]> * added podman instructions Signed-off-by: Brian Innes <[email protected]> * tidy up commands Signed-off-by: Brian Innes <[email protected]> Co-authored-by: Brian Innes <[email protected]> Co-authored-by: Brian Innes <[email protected]> * added primary working group information Signed-off-by: Brian Innes <[email protected]> Co-authored-by: Brian Innes <[email protected]>
* Add docker/podman instructions for site update and fix broken links after switch to MkDocs site (okd-project#238) (okd-project#2) * move from master to main branch * remove blog entry re creating blog on old site * Add site README * add CNAME in docs so gets published * Delete CNAME * fix CNAME for production site Signed-off-by: Brian Innes <[email protected]> * fixed links re MkDocs going live + container tooling instructions Signed-off-by: Brian Innes <[email protected]> * added Windows support for docker based tooling Signed-off-by: Brian Innes <[email protected]> * added podman instructions Signed-off-by: Brian Innes <[email protected]> * tidy up commands Signed-off-by: Brian Innes <[email protected]> Co-authored-by: Brian Innes <[email protected]> Co-authored-by: Brian Innes <[email protected]> * added primary working group information Signed-off-by: Brian Innes <[email protected]> * add missing links Signed-off-by: Brian Innes <[email protected]> * fix branch name Signed-off-by: Brian Innes <[email protected]> * convert pdf links to relative links Signed-off-by: Brian Innes <[email protected]> * update working group overview Signed-off-by: Brian Innes <[email protected]> Co-authored-by: Brian Innes <[email protected]>
Describe the bug
OKD 4.4 installer updates FCOS image to version 32
Version
4.4.0-0.okd-2020-07-01-045420 (but I also checked it with a lot of other 4.4 builds)
How reproducible
This was working fine 100% before FCOS 32 was released and marked as STABLE.
Kernel PXE:
fedora-coreos-31.20200517.3.0-live-initramfs.x86_64.img
fedora-coreos-31.20200517.3.0-live-kernel-x86_64
Baremetal CoreOS image:
fedora-coreos-31.20200505.3.0-metal.x86_64.raw.xz
Log bundle
[must-gather ] OUT Get https://api.ocp.domain.net:6443/apis/image.openshift.io/v1/namespaces/openshift/imagestreams/must-gather: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kube-apiserver-lb-signer") [must-gather ] OUT [must-gather ] OUT Using must-gather plugin-in image: quay.io/openshift/origin-must-gather:latest Unable to connect to the server: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kube-apiserver-lb-signer")
$ openshift-install wait-for bootstrap-complete --log-level=debug DEBUG OpenShift Installer 4.4.0-0.okd-2020-07-01-045420 DEBUG Built from commit ddd989504d76ae25b0a020db9b29f9375d5ce242 INFO Waiting up to 20m0s for the Kubernetes API at https://api.ocp.domain.net:6443... DEBUG Still waiting for the Kubernetes API: Get https://api.ocp.domain.net:6443/version?timeout=32s: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kube-apiserver-lb-signer")
Related issues
Possibly:
#229
#227
Thank you!
The text was updated successfully, but these errors were encountered: