Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rhel 8.10 - compute-redhat.yml breaks image build #419

Open
xdkreij opened this issue Jun 28, 2024 · 10 comments
Open

rhel 8.10 - compute-redhat.yml breaks image build #419

xdkreij opened this issue Jun 28, 2024 · 10 comments

Comments

@xdkreij
Copy link

xdkreij commented Jun 28, 2024

Problem description
During iPXE boot, the following challenge pops up. Maybe someone has encountered this before in the past?

image

Command used
ansible-playbook compute-redhat.yml -v

Expected results
A working image that boots successfully :-)

@xdkreij
Copy link
Author

xdkreij commented Jun 28, 2024

A dump of found 'issues'

000000 04:59:47 [root@cpu site]# luna osimage kernel compute Traceback (most recent call last): File "/bin/luna", line 7, in <module> CLI = Cli().main() File "/trinity/local/python/lib/python3.10/site-packages/luna/cli.py", line 109, in main self.call_class() File "/trinity/local/python/lib/python3.10/site-packages/luna/cli.py", line 137, in call_class call(self.args, self.parser, self.subparsers) File "/trinity/local/python/lib/python3.10/site-packages/luna/osimage.py", line 67, in __init__ call(self) File "/trinity/local/python/lib/python3.10/site-packages/luna/osimage.py", line 299, in kernel_osimage http_response = result.json() AttributeError: 'types.SimpleNamespace' object has no attribute 'json'

Adding a print statement to python like so print(result.content) results in

{'message': 'osimage pack for compute already queued', 'request_id': '1719565192.3494275247818686'}

@aphmschonewille
Copy link
Member

"osimage pack for compute already queued" normally means that another packing for that image was already in progress. It prevents it from being packed twice at the same time. However if changes were made while the other packing was already in progress, things will go wrong. Was there only one packing active at that time, or were there concurrent operations going or something else?

@xdkreij
Copy link
Author

xdkreij commented Jul 2, 2024

"osimage pack for compute already queued" normally means that another packing for that image was already in progress. It prevents it from being packed twice at the same time. However if changes were made while the other packing was already in progress, things will go wrong. Was there only one packing active at that time, or were there concurrent operations going or something else?

Only one - via the compute-redhat.yml :-)

I wonder if this would result in the 'kernel panic' eventually. The playbook seems/completes successful but apparently something goes terribly wrong with the image (build?) itself.

(side note: I do have to fix rhsm.conf half way through the play within the image itself, otherwise the redhat.repo gets overwritten and redirects to cdn.redhat.com instead - but i doubt that it would result in image issues itself since afterwards everything kicks of fine.)

@xdkreij
Copy link
Author

xdkreij commented Jul 2, 2024

w00000t.... i think i may have solved it...

image

What i did was posted here: https://www.linuxquestions.org/questions/linux-server-73/centos-7-does-not-boot-4175619015/

Like so...

cp /sbin/init /trinity/images/compute/sbin/init
cp /lib/systemd/systemd /trinity/images/compute/lib/systemd/systemd
 
 luna osimage pack compute
 luna node change -o compute node001
 --- reboot node ---

I've got no clue whatsoever why it doesn't work without.. but I'll test the compute-redhat.yml again with a new image soon to verify if this actually solved it.

@aphmschonewille
Copy link
Member

There were two problems that you hit. There was indeed a bug in the cli where a returned call caused the python trace. That has been solved and will be released soon. The other problem you see, the missing of /sbin/init is something i cannot really explain yet. May i ask when you cloned the TrinityX repo? this helps us determining if this is an ongoing problem or something that has already been solved through other fixes.

@xdkreij
Copy link
Author

xdkreij commented Jul 3, 2024

There were two problems that you hit. There was indeed a bug in the cli where a returned call caused the python trace. That has been solved and will be released soon. The other problem you see, the missing of /sbin/init is something i cannot really explain yet. May i ask when you cloned the TrinityX repo? this helps us determining if this is an ongoing problem or something that has already been solved through other fixes.

The repo has been cloned (lucky for me I keep track of things using ARA) on the 24th during the re-deployment of the entire controller on RHEL 8.10;

@trick-1
Copy link

trick-1 commented Aug 7, 2024

I too have this issue. I note whilst executing "ansible-playbook default.yml" the following

[WARNING]: Target is a chroot or systemd is offline. This can lead to false positives or prevent the init system tools from working.

Suspect it might be a systemd not behaving in chroot related issue.....

@trick-1
Copy link

trick-1 commented Aug 7, 2024

A bit more information. I pulled the repro yesterday and built a new controller per the instructions here (https://supercomputing.tue.nl/documentation/administration/trinityx/installation/#fix-uchiwa-logrotate-script-owner ) it completed without issue.

I then went to build the node images

#ansible-playbook compute-default.yml -v

it also completed without error except for the warning above.

I have since changed to an alternate distribution build
alternative_distribution: Rocky-9

I get the same error regardless.

I also note that during the bring up of the node it fails to copy /etc/passwd and /etc/groups

@trick-1
Copy link

trick-1 commented Aug 8, 2024

so I commented out the following section

#- import_playbook: imports/trinity-redhat-image-setup.yml
  #vars:
#   hostlist: "{{ hostvars['localhost']['image_name'] }}.osimages.luna"

and the image builds and deploys without error....time to step through the setup and see what breaks

@aphmschonewille
Copy link
Member

hi gents,

quite a bit has changed (read: fixes, improvements and of course the addition of new bugs :)
We've tested the installation a million times, using defaults as much as possible but do not see any problems, or any of the above issues. Who has time to pull the latest release, 14.4u1 and try? or can we close the issue?

--Antoine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants