-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test building on Snellius: Zen4/H100 #903
base: 2023.06-software.eessi.io
Are you sure you want to change the base?
Test building on Snellius: Zen4/H100 #903
Conversation
Instance
|
Instance
|
Instance
|
bot: build instance:eessi-bot-casparvl repo:eessi.io-2023.06-software accel:nvidia/cc80 |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
bot: show_config |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Instance
|
Instance
|
Instance
|
bot: show_config |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Instance
|
Instance
|
Instance
|
bot: build instance:eessi-bot-casparvl repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 |
Updates by the bot instance
|
Updates by the bot instance
|
bot: build instance:eessi-bot-casparvl repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
Hmm, it succesfully install cuda and cudnn in the host injection dir, but then I see:
I'm not sure if that error means the step of putting stuff in host-injections isn't properly finished, but I see the installation of |
Another strange thing is that the above job ran until the walltime ran out. Even though the builds were done after 20 mins or so, and nothing else got add to the output file. logging into the node, it seemed to just hang. The |
As a hacky fix, i've set a more premissive |
bot: build instance:eessi-bot-casparvl repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
Killing the tee task (manually) makes the job continue... |
Hang happens again in the test step:
|
bot: build instance:eessi-bot-casparvl repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
…lled, and we need to accept the eula
bot: build instance:eessi-bot-casparvl repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
This https://stackoverflow.com/questions/73158567/bash-script-is-stuck-at-tee#comment129210841_73158567 might be a pointer to my issue. I seem to remember having some trouble with containers not 'returning' properly upon exit, i.e. I'd have to do an enter to get a prompt again or something. Maybe that is indeed keeping the pipe open, and just it seems like Edit: one thing is that I also still see a lot of cvmfs2 processes:
Maybe these somehow prevent the container from completely exiting, since they are not properly being cleaned up? They do also disappear if I kill tee, so I'm not sure if they're the cause or the effect... |
Looking at the logs, everything went fine in #903 (comment) so I'm puzzled about the failure. I don't see a tarball however in the jobdir, and don't see a tarball creation step in the build logs... Will need to figure out why not - but that's for next week. |
Ah, my guess is, it's because I kill the Line 22 in 51a1006
|
For now, I've set up a personal bot instance to build some experience with bot deployment. This PR is purely to test that instance.