Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the VMware metadata to new, modern defaults #1119

Closed
miabbott opened this issue Mar 8, 2022 · 32 comments
Closed

Update the VMware metadata to new, modern defaults #1119

miabbott opened this issue Mar 8, 2022 · 32 comments

Comments

@miabbott
Copy link
Member

miabbott commented Mar 8, 2022

Describe the enhancement

The current configuration used when constructing the OVA disk for VMware installs specifies metadata in the OVF which will soon be unsupported by VMware. Specifically, we are specifying hardware version 13 which indicates that that the oldest version of VMware ESXi that is supported is 6.5.

ESXi 6.5 will lose "General Support" from VMware on Oct 15, 2022, so it would behoove us to update the hardware version we are specifying in the OVF to indicate the oldest version of VMware ESXi that FCOS can operate on will still be supported by VMware. I propose we use hardware version 17 which indicates the oldest version of ESXi that can be used/supported to 7.0.

Additionally, the OVF specifies that the guest OS is a variant of RHEL7 which is wildly incorrect. It may be desirable to update that piece of metadata to indicate it is fedora64Guest. See https://vdc-download.vmware.com/vmwb-repository/dcr-public/da47f910-60ac-438b-8b9b-6122f4d14524/16b7274a-bf8b-4b4c-a05e-746f2aa93c8c/doc/vim.vm.GuestOsDescriptor.GuestOsIdentifier.html. (Alternately, we use rhel8_64Guest which is closer to the truth, but requires the use of at least ESXi 6.5 U3 and hardware version 14. See https://kb.vmware.com/s/article/67443 and https://kb.vmware.com/s/article/67621)

Finally, a guest FCOS VM will default to using legacy BIOS firmware with the current OVF settings. It may be desirable to update the settings to default to UEFI firmware, which will enable the use of SecureBoot in the guest. See https://vdc-download.vmware.com/vmwb-repository/dcr-public/b50dcbbf-051d-4204-a3e7-e1b618c1e384/538cf2ec-b34f-4bae-a332-3820ef9e7773/vim.vm.GuestOsDescriptor.FirmwareType.html

At a minimum, I propose we start using hardware version 17 (which also unblocks the use of the UEFI firmware; minimum hardware version for EFI support is 13).

System details

-VMware

Additional information

We ultimately want these set of changes for RHCOS, as OCP will be dropping support for ESXi 6.5 as part of the next OCP release.

The proposed changes are in a PR to coreos-assembler - coreos/coreos-assembler#2740

Workaround

If some or none of these changes are accepted, it may be possible for users to upgrade the hardware version of the guest VM via the vSphere management tools (with caveats) - https://kb.vmware.com/s/article/1010675

@dustymabe
Copy link
Member

We discussed this in the community meeting today.

12:03:43   dustymabe | #agreed We aren't opposed to updating the OVA to contain
                     | more modern defaults, but would like to know how users can
                     | revert the behavior if they need to since the platforms
                     | aren't EOL yet. We'd either need some documentation to let
                     | the users know how or to know that the users can ignore   
                     | warnings and still proceed.

@jcpowermac
Copy link

jcpowermac commented Mar 9, 2022

@dustymabe it might be annoying but someone could easily do this. untar the ova and change the values within the coreos.ovf file.

@dustymabe
Copy link
Member

thank you for that info @jcpowermac

@miabbott
Copy link
Member Author

I created an FCOS OVA that has:

  • hardware version 17
  • UEFI firmware
  • osType == rhel8_64Guest

...using the PR from @jcpowermac (coreos/coreos-assembler#2740)

https://miabbott.fedorapeople.org/fedora-coreos-35.20220310.dev.0-vmware.x86_64.ova

I'm requesting access to our internal vSphere environment, where I'll do some sanity checks.

@fifofonix if you can report back anything you find using it in your environment, that would be awesome!

@fifofonix
Copy link

@miabbott I've tried creating a VM from the ova on our vSphere 6.7.0.51000 environment but unfortunately it is failing with a message regarding an unsupported hardware family (Unsupported hardware family 'vmx-17'). This version is introduced with vSphere 7.0.0 (https://kb.vmware.com/s/article/1003746) and presumably cannot be run on lower platform versions. I think the next OVA is currently on 'vmc-13' which means it supports back to vSphere 6.5 which makes sense since EOL for 6.5 is 10/15/22.

Note: I am also a desktop VMware Fusion 12.2.1 user and should be able to spin up the OVA you've shared on the desktop and validate it functions if that is useful. This won't help me with one of the main things I do want to check on vSphere - that open-vm-tools still operates correctly allowing vSphere to interact with the FCOS guest in various ways.

@jcpowermac
Copy link

@miabbott I've tried creating a VM from the ova on our vSphere 6.7.0.51000 environment but unfortunately it is failing with a message regarding an unsupported hardware family (Unsupported hardware family 'vmx-17'). This version is introduced with vSphere 7.0.0 (https://kb.vmware.com/s/article/1003746) and presumably cannot be run on lower platform versions.

@fifofonix Right, you can't run a higher hardware version than what the ESXi host supports.

I think the next OVA is currently on 'vmc-13' which means it supports back to vSphere 6.5 which makes sense since EOL for 6.5 is 10/15/22.

6.7 is also included in that date.

@miabbott
Copy link
Member Author

@fifofonix thanks for the info about the errors in your environment; this seems to match what is expected.

If you could do one more test for us, that would be useful. I'd like you to modify the OVF in the file to change the hardware version to 14 and see if you encounter any errors in your environment:

$ tar -xvf fedora-coreos-35.20220310.dev.0-vmware.x86_64.ova
$ sed -i 's|vmx-17|vmx-14|' coreos.ovf
$ tar -H posix -cvf fedora-coreos-35.20220310.dev.0-vmware-vmx-14.x86_64.ova coreos.ovf disk.vmdk 

Hardware version 14 is required for the rhel8_64Guest osType and puts the oldest supported version of ESXi at 6.7, so it should allow you to use the new OVA in your environment and allows you to use UEFI firmware.

@fifofonix
Copy link

@miabbott I can provision machines from a modified OVA like this and they seem to run fine in vSphere. I can see that vSphere hardware family has clicked forward to 14, and that the Guest OS description is Red Hat Fedora (64-bit). The layered open-vm-tools also seems to function fine.

The terraform provisioner I used defaulted to a BIOS firmware even though I believe the new OVA-specified default is EFI. I was able to stop the machine, change to EFI enabling secure boot and all seemed to proceed normally. However, I can make this same change on older OVA versions too without issue. (Note that when changing an older OVA vm firmware from BIOS to EFI the former is 'recommended' but in the newer OVA the latter is 'recommended').

The environment I run these development / throwaway tests on is a cluster of 4 NUCs and unfortunately I do not believe these have a TPM chip. I'm not sure whether this is a requirement from a root-of-trust basis for a true secure boot but I would think so. In the journal I can see secure boot messages related to seeking various certificates but I also see a TPM bypass message. Furthermore, of course given yours is an unsigned OS I'd expect some kind of loud failure of secure boot somewhere but this is not immediately apparent to me...

...is there something you'd like to see from the journal logs?

@jcpowermac
Copy link

@fifofonix you can set firmware in terraform: https://registry.terraform.io/providers/hashicorp/vsphere/latest/docs/resources/virtual_machine#firmware

@miabbott
Copy link
Member Author

Guest OS description is Red Hat Fedora (64-bit)

That seems...odd, but I don't think it is a show-stopper.

The environment I run these development / throwaway tests on is a cluster of 4 NUCs and unfortunately I do not believe these have a TPM chip. I'm not sure whether this is a requirement from a root-of-trust basis for a true secure boot but I would think so. In the journal I can see secure boot messages related to seeking various certificates but I also see a TPM bypass message. Furthermore, of course given yours is an unsigned OS I'd expect some kind of loud failure of secure boot somewhere but this is not immediately apparent to me...

...is there something you'd like to see from the journal logs?

I believe it should be possible to successfully boot any Fedora-derived OS using SecureBoot; I believe Fedora has an agreement with MSFT to sign the shim binary shipped with Fedora to allow for it

On my local Fedora Silverblue system, I can see the following in the journal:

Feb 12 16:26:12 fedora kernel: secureboot: Secure boot enabled
Feb 12 16:26:12 fedora kernel: Kernel is locked down from EFI Secure Boot mode; see man kernel_lockdown.7

Reviewing the vSphere docs on enabling Secure Boot, it looks like just toggling the use of EFI and Secure Boot is all that is required. If the OS comes up, then it was successful.

I think the tests you have done thus far prove that the changes we are proposing are workable; we can provide docs showing how to change the parameters in the OVF for users that are not in a position to upgrade to vSphere 7.0.

@fifofonix
Copy link

I recreated machines with terraform firmware option (thanks @jcpowermac) setting EFI. I can then stop VM, set secure boot checkbox and reboot successfully with journal showing messages @miabbott quotes. However, for reasons yet not understood, when I then use the accompanying efi_secure_boot_enabled option to achieve secure on first boot, although I can see the secure boot checkbox (as witnessed in the management console) is set the secure boot itself hangs. I'm running some more tests on simpler ignition files to see whether it is anything to do with various systemd units I am invoking.

In terms of the Guest OS description I noticed that it only changes to Red Hat Fedora (64 bit) post the layering of open-vm-tools. Prior to that it reads something more like Redhat Enterprise Linux 64 Bit (or similar).

For people who may be stuck on VMware 6.7 due to other infra dependencies the suggested changes may be workable - requiring them to modify the OVA but also hosting it somewhere on their infra - so that it can be used easily by terraform or whatever tooling they use.

@fifofonix
Copy link

@miabbott I've stripped down my ignition to bare minimum (nearly) and the failure of first secure boot is still repeatable. I've captured a serial port log read out but I have zero experience of reading these. Hope this helps.

@miabbott
Copy link
Member Author

miabbott commented Mar 11, 2022

@fifofonix that serial log is useful. It looks like you are hitting coreos/ignition#1092

        Starting �[0;1;39mIgnition (fetch-offline)�[0m...
[    2.699953] coreos-ignition-setup-user[592]: File /mnt/boot_partition/ignition/config.ign does not exist.. Skipping copy
[    2.702968] systemd[1]: Reached target Basic System.
[    2.705005] systemd[1]: Starting Ignition OSTree: Regenerate Filesystem UUID (boot)...
[    2.707419] systemd[1]: Finished Ignition OSTree: Regenerate Filesystem UUID (boot).
[    2.709813] systemd[1]: Starting CoreOS Ignition User Config Setup...
[    2.710731] Lockdown: ignition: raw io port access is restricted; see man kernel_lockdown.7
[    2.711931] ignition[600]: Ignition 2.13.0
[    2.715397] systemd[1]: Finished CoreOS Ignition User Config Setup.
[�[0;1;31mFAILED�[0m] Failed to start �[0;1;39mIgnition (fetch-offline)�[0m.
[    2.717551] ignition[600]: Stage: fetch-offline
See 'systemctl status ignition-fetch-offline.service' for details.
[    2.719349] systemd[1]: Starting Ignition (fetch-offline)...
[�[0;1;38;5;185mDEPEND�[0m] Dependency failed for �[0;1;39mIgnition Complete�[0m.
[�[0;1;38;5;185mDEPEND�[0m] Dependency failed for �[0;1;39mInitrd Default Target�[0m.

Since access to the Ignition config is not possible, the system can't be provisioned and fails to come up.

I think the workaround here is to do the initial boot without Secure Boot and then enable it after. Clearly not the ideal case, but the upstream issue/PR doesn't appear to be moving towards resolution - vmware-archive/vmw-guestinfo#21

@dustymabe dustymabe added meeting topics for meetings and removed meeting topics for meetings labels Mar 16, 2022
@dustymabe
Copy link
Member

We discussed this in the community meeting today. @miabbott and @bgilbert are going to work together on a few more details and we'll revisit this next week.

@bgilbert
Copy link
Contributor

As I mentioned in the meeting, the virtual hardware versions listed in VMware docs are the maximum versions supported by each VMware release. So nothing bad will happen if VMware 6.5 goes EOL and we're still on HW version 13; we could theoretically stay on HW 13 indefinitely. For maximum compatibility, it makes sense to update only when all VMware releases that need our current HW version are EOL.

Older versions of VMware Workstation/Fusion/Player are already EOL, so the only issue is ESXi. ESXi 6.5 and 6.7 go EOL on 2022-10-15, and we can then update to HW 17 without dropping any supported VMware products.

OCP has its own, more aggressive lifecycle for VMware virtual hardware support. That's fine; we can template the HW version through image.yaml. While we're at it, we should also template the OS version, so FCOS can claim to be Fedora and RHCOS can claim to be RHEL 8.

I think it makes sense to switch the VMware image to UEFI regardless.

@dustymabe
Copy link
Member

Anyone have a creative (or not so creative) idea to remind ourselves to move to HW 17 on/after 2022-10-15 ?

Also, when we do that should we create the docs telling people how to revert the behavior?

@jcpowermac
Copy link

Anyone have a creative (or not so creative) idea to remind ourselves to move to HW 17 on/after 2022-10-15 ?

Members of OCP SPLAT will care. I will be in a calendar entry so I remember to reach out.

@bgilbert
Copy link
Contributor

Anyone have a creative (or not so creative) idea to remind ourselves to move to HW 17 on/after 2022-10-15 ?

Slack /remind? 😁

Also, when we do that should we create the docs telling people how to revert the behavior?

I would lean no. If you're using an EOL component in your stack, you're on your own. Or we could split the difference and send a coreos-status post.

@cgwalters
Copy link
Member

Slack /remind? grin

There's also things like https://docs.github.com/en/actions/managing-issues-and-pull-requests/scheduling-issue-creation which I've been meaning to look at it.

@dustymabe
Copy link
Member

Anyone have a creative (or not so creative) idea to remind ourselves to move to HW 17 on/after 2022-10-15 ?

Members of OCP SPLAT will care. I will be in a calendar entry so I remember to reach out.

@jcpowermac Just so we're clear, my understanding is that RHCOS/OCP will make this change earlier than that EOL date to reflect product requirements. The work Benjamin is doing (linked in #1119 (comment)) will make it possible to configure that differently for RHCOS/FCOS.

@jcpowermac
Copy link

Anyone have a creative (or not so creative) idea to remind ourselves to move to HW 17 on/after 2022-10-15 ?

Members of OCP SPLAT will care. I will be in a calendar entry so I remember to reach out.

@jcpowermac Just so we're clear, my understanding is that RHCOS/OCP will make this change earlier than that EOL date to reflect product requirements.

Yep

@miabbott
Copy link
Member Author

@jcpowermac Just so we're clear, my understanding is that RHCOS/OCP will make this change earlier than that EOL date to reflect product requirements. The work Benjamin is doing (linked in #1119 (comment)) will make it possible to configure that differently for RHCOS/FCOS.

We are going to update the hardware version that RHCOS uses to 15 which lines up with the minimums we want to enforce for OCP (notably the CSI driver requires hw version 15) - openshift/os#748

@bgilbert
Copy link
Contributor

Ignition upstream and the Ignition FCOS/RHCOS packages all have workarounds for coreos/ignition#1092 now, so coreos/coreos-assembler#2767 enables Secure Boot in the OVA.

@dustymabe
Copy link
Member

While we're on the "updating VMWare defaults" topic, should we revisit/update/close #871 ?

@bgilbert
Copy link
Contributor

Might be good. I'm not planning to work on #871 in the short term though.

@dustymabe
Copy link
Member

Since some changes are starting to land - do we need to send out a communication about this to users?

@miabbott
Copy link
Member Author

Since some changes are starting to land - do we need to send out a communication about this to users?

I think once coreos/coreos-assembler#2767 lands, we can do an email to [email protected] with the changes that have landed.

@dustymabe
Copy link
Member

We discussed this in the community meeting today.

13:05:31    dustymabe | #info because of some great work by bgilbert we now have 
                      | the flexibility to use different values in the OVA for
                      | FCOS and RHCOS so each derivative can use values most
                      | appropriate. FCOS will remain at hw version 13 until after
                      | the vSphere 6.5/6.7 EOL date.
13:06:06    dustymabe | #info the defaults were changed for FCOS to use EFI
                      | firmware and Secure Boot by default
13:12:21    dustymabe | #action miabbott to send an email to the list about
                      | planned updates to the OVA

@miabbott
Copy link
Member Author

This ticket got a bit ambitious encompassing multiple changes and makes it hard to track them for the purposes of the release notes and communicating to users.

I've created two issues to track the two separate changes we are concerned with:

@miabbott
Copy link
Member Author

PR to add instructions on modifying the OVF metadata - coreos/fedora-coreos-docs#376

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants