Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test building on Snellius: Zen4/H100 #903

Open
wants to merge 2 commits into
base: 2023.06-software.eessi.io
Choose a base branch
from

Conversation

casparvl
Copy link
Collaborator

For now, I've set up a personal bot instance to build some experience with bot deployment. This PR is purely to test that instance.

Copy link

eessi-bot bot commented Jan 31, 2025

Instance eessi-bot-mc-aws is configured to build for:

  • architectures: x86_64/generic, x86_64/intel/haswell, x86_64/intel/sapphire_rapids, x86_64/intel/skylake_avx512, x86_64/amd/zen2, x86_64/amd/zen3, aarch64/generic, aarch64/neoverse_n1, aarch64/neoverse_v1
  • repositories: eessi.io-2023.06-software, eessi.io-2023.06-compat

Copy link

eessi-bot bot commented Jan 31, 2025

Instance eessi-bot-mc-azure is configured to build for:

  • architectures: x86_64/amd/zen4
  • repositories: eessi.io-2023.06-software, eessi.io-2023.06-compat

@eessi-bot-casparvl
Copy link

Instance eessi-bot-casparvl is configured to build for:

  • architectures: x86_64/zen4
  • repositories: eessi.io-2023.06-software, eessi-hpc.org-2023.06-compat, eessi-hpc.org-2023.06-software, eessi.io-2023.06-compat

@casparvl casparvl added tests Related to software testing accel:nvidia labels Jan 31, 2025
@EESSI EESSI deleted a comment from eessi-bot bot Jan 31, 2025
@EESSI EESSI deleted a comment from eessi-bot bot Jan 31, 2025
@EESSI EESSI deleted a comment from eessi-bot-casparvl bot Jan 31, 2025
@casparvl
Copy link
Collaborator Author

bot: build instance:eessi-bot-casparvl repo:eessi.io-2023.06-software accel:nvidia/cc80

Copy link

eessi-bot bot commented Jan 31, 2025

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build instance:eessi-bot-casparvl repo:eessi.io-2023.06-software accel:nvidia/cc80 from casparvl

    • expanded format: build instance:eessi-bot-casparvl repository:eessi.io-2023.06-software accelerator:nvidia/cc80
  • handling command build instance:eessi-bot-casparvl repository:eessi.io-2023.06-software accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Jan 31, 2025

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build instance:eessi-bot-casparvl repo:eessi.io-2023.06-software accel:nvidia/cc80 from casparvl

    • expanded format: build instance:eessi-bot-casparvl repository:eessi.io-2023.06-software accelerator:nvidia/cc80
  • handling command build instance:eessi-bot-casparvl repository:eessi.io-2023.06-software accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

@eessi-bot-casparvl
Copy link

eessi-bot-casparvl bot commented Jan 31, 2025

Updates by the bot instance eessi-bot-casparvl (click for details)
  • received bot command build instance:eessi-bot-casparvl repo:eessi.io-2023.06-software accel:nvidia/cc80 from casparvl

    • expanded format: build instance:eessi-bot-casparvl repository:eessi.io-2023.06-software accelerator:nvidia/cc80
  • handling command build instance:eessi-bot-casparvl repository:eessi.io-2023.06-software accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

@casparvl
Copy link
Collaborator Author

bot: show_config

Copy link

eessi-bot bot commented Jan 31, 2025

Updates by the bot instance eessi-bot-mc-aws (click for details)

Copy link

eessi-bot bot commented Jan 31, 2025

Updates by the bot instance eessi-bot-mc-azure (click for details)

@eessi-bot-casparvl
Copy link

eessi-bot-casparvl bot commented Jan 31, 2025

Updates by the bot instance eessi-bot-casparvl (click for details)

Copy link

eessi-bot bot commented Jan 31, 2025

Instance eessi-bot-mc-aws is configured to build for:

  • architectures: x86_64/generic, x86_64/intel/haswell, x86_64/intel/sapphire_rapids, x86_64/intel/skylake_avx512, x86_64/amd/zen2, x86_64/amd/zen3, aarch64/generic, aarch64/neoverse_n1, aarch64/neoverse_v1
  • repositories: eessi.io-2023.06-software, eessi.io-2023.06-compat

Copy link

eessi-bot bot commented Jan 31, 2025

Instance eessi-bot-mc-azure is configured to build for:

  • architectures: x86_64/amd/zen4
  • repositories: eessi.io-2023.06-software, eessi.io-2023.06-compat

@eessi-bot-casparvl
Copy link

Instance eessi-bot-casparvl is configured to build for:

  • architectures: x86_64/zen4
  • repositories: eessi.io-2023.06-software, eessi-hpc.org-2023.06-compat, eessi-hpc.org-2023.06-software, eessi.io-2023.06-compat

@casparvl
Copy link
Collaborator Author

bot: show_config

Copy link

eessi-bot bot commented Jan 31, 2025

Updates by the bot instance eessi-bot-mc-aws (click for details)

Copy link

eessi-bot bot commented Jan 31, 2025

Updates by the bot instance eessi-bot-mc-azure (click for details)

@eessi-bot-casparvl
Copy link

eessi-bot-casparvl bot commented Jan 31, 2025

Updates by the bot instance eessi-bot-casparvl (click for details)

Copy link

eessi-bot bot commented Jan 31, 2025

Instance eessi-bot-mc-aws is configured to build for:

  • architectures: x86_64/generic, x86_64/intel/haswell, x86_64/intel/sapphire_rapids, x86_64/intel/skylake_avx512, x86_64/amd/zen2, x86_64/amd/zen3, aarch64/generic, aarch64/neoverse_n1, aarch64/neoverse_v1
  • repositories: eessi.io-2023.06-software, eessi.io-2023.06-compat

Copy link

eessi-bot bot commented Jan 31, 2025

Instance eessi-bot-mc-azure is configured to build for:

  • architectures: x86_64/amd/zen4
  • repositories: eessi.io-2023.06-software, eessi.io-2023.06-compat

@eessi-bot-casparvl
Copy link

Instance eessi-bot-casparvl is configured to build for:

  • architectures: x86_64/amd/zen4
  • repositories: eessi.io-2023.06-compat, eessi-hpc.org-2023.06-software, eessi.io-2023.06-software, eessi-hpc.org-2023.06-compat

@casparvl
Copy link
Collaborator Author

bot: build instance:eessi-bot-casparvl repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90

Copy link

eessi-bot bot commented Jan 31, 2025

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build instance:eessi-bot-casparvl repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from casparvl

    • expanded format: build instance:eessi-bot-casparvl repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
  • handling command build instance:eessi-bot-casparvl repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Jan 31, 2025

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build instance:eessi-bot-casparvl repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from casparvl

    • expanded format: build instance:eessi-bot-casparvl repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
  • handling command build instance:eessi-bot-casparvl repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:

    • no jobs were submitted

@casparvl
Copy link
Collaborator Author

bot: build instance:eessi-bot-casparvl repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90

Copy link

eessi-bot bot commented Jan 31, 2025

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build instance:eessi-bot-casparvl repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from casparvl

    • expanded format: build instance:eessi-bot-casparvl repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
  • handling command build instance:eessi-bot-casparvl repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Jan 31, 2025

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build instance:eessi-bot-casparvl repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from casparvl

    • expanded format: build instance:eessi-bot-casparvl repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
  • handling command build instance:eessi-bot-casparvl repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:

    • no jobs were submitted

@eessi-bot-casparvl
Copy link

eessi-bot-casparvl bot commented Jan 31, 2025

Updates by the bot instance eessi-bot-casparvl (click for details)
  • received bot command build instance:eessi-bot-casparvl repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from casparvl

    • expanded format: build instance:eessi-bot-casparvl repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
  • handling command build instance:eessi-bot-casparvl repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:

@eessi-bot-casparvl
Copy link

eessi-bot-casparvl bot commented Jan 31, 2025

New job on instance eessi-bot-casparvl for CPU micro-architecture x86_64-amd-zen4 and accelerator nvidia/cc90 for repository eessi.io-2023.06-software in job dir /projects/casparl/eessi-bot-casparvl/jobs/2025.01/pr_903/9720670

date job status comment
Jan 31 21:03:36 UTC 2025 submitted job id 9720670 awaits release by job manager
Jan 31 21:03:53 UTC 2025 released job awaits launch by Slurm scheduler
Jan 31 21:04:58 UTC 2025 running job 9720670 is running
Jan 31 22:05:02 UTC 2025 finished
🤷 UNKNOWN (click triangle for detailed information)
  • Job results file _bot_job9720670.result does not exist in job directory, or parsing it failed.
  • No artefacts were found/reported.
Jan 31 22:05:02 UTC 2025 test result
🤷 UNKNOWN (click triangle for detailed information)
  • Job test file _bot_job9720670.test does not exist in job directory, or parsing it failed.

@casparvl
Copy link
Collaborator Author

Hmm, it succesfully install cuda and cudnn in the host injection dir, but then I see:

ESC[32mFound host CUDA version 9.0ESC[0m
ESC[32mFound NVIDIA GPU driver version 555.42.06ESC[0m
Using downloaded list of libraries
Matched 48 CUDA Libraries
ESC[31mERROR: The current umask (0027) does not allow global read permissions, you'll want everyone to be able to read the created directory.ESC[0m
>> Using /home/casparl/EESSI/bot-instance/SHARED/easybuild/sources as shared EasyBuild source path
ESC[32m>> MODULEPATH set up: /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/modules/all:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/m
odules/all:/cvmfs/software.eessi.io/host_injections/2023.06/software/linux/x86_64/amd/zen4/modules/all:/gpfs/work2/1/casparl/eessi-bot-casparvl/jobs/2025.01/pr_903/event_d2a0a360-e016-11ef-90bb-fd28efd9803a
/run_000/linux_x86_64_amd_zen4/eessi.io-2023.06-software/init/modulesESC[0m
Processing easystack file easystacks/software.eessi.io/2023.06/accel/nvidia/zen4_h100/eessi-2023.06-eb-4.9.4-2023a-CUDA.yml...

I'm not sure if that error means the step of putting stuff in host-injections isn't properly finished, but I see the installation of pmt is retriggering the CUDA install, which then fails because we don't accept the license agreement.

@casparvl
Copy link
Collaborator Author

Another strange thing is that the above job ran until the walltime ran out. Even though the builds were done after 20 mins or so, and nothing else got add to the output file. logging into the node, it seemed to just hang. The tee process was still there. Maybe that's also while the file _bot_job9720670.result wasn't found: that tee process was probably killed at the end of the walltime. Not sure why it would hang though

@casparvl
Copy link
Collaborator Author

As a hacky fix, i've set a more premissive umask in the bot-build.slum job, at the start .

@casparvl
Copy link
Collaborator Author

bot: build instance:eessi-bot-casparvl repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90

Copy link

eessi-bot bot commented Jan 31, 2025

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build instance:eessi-bot-casparvl repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from casparvl

    • expanded format: build instance:eessi-bot-casparvl repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
  • handling command build instance:eessi-bot-casparvl repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:

    • no jobs were submitted

@eessi-bot-casparvl
Copy link

eessi-bot-casparvl bot commented Jan 31, 2025

Updates by the bot instance eessi-bot-casparvl (click for details)
  • received bot command build instance:eessi-bot-casparvl repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from casparvl

    • expanded format: build instance:eessi-bot-casparvl repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
  • handling command build instance:eessi-bot-casparvl repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:

Copy link

eessi-bot bot commented Jan 31, 2025

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build instance:eessi-bot-casparvl repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from casparvl

    • expanded format: build instance:eessi-bot-casparvl repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
  • handling command build instance:eessi-bot-casparvl repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:

    • no jobs were submitted

@eessi-bot-casparvl
Copy link

eessi-bot-casparvl bot commented Jan 31, 2025

New job on instance eessi-bot-casparvl for CPU micro-architecture x86_64-amd-zen4 and accelerator nvidia/cc90 for repository eessi.io-2023.06-software in job dir /projects/casparl/eessi-bot-casparvl/jobs/2025.01/pr_903/9721287

date job status comment
Jan 31 22:09:36 UTC 2025 submitted job id 9721287 awaits release by job manager
Jan 31 22:09:54 UTC 2025 released job awaits launch by Slurm scheduler
Jan 31 22:10:43 UTC 2025 running job 9721287 is running
Jan 31 22:32:19 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-9721287.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
No artefacts were created or found.
Jan 31 22:32:19 UTC 2025 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
✅ job output file slurm-9721287.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Collaborator Author

Another strange thing is that the above job ran until the walltime ran out. Even though the builds were done after 20 mins or so, and nothing else got add to the output file. logging into the node, it seemed to just hang. The tee process was still there. Maybe that's also while the file _bot_job9720670.result wasn't found: that tee process was probably killed at the end of the walltime. Not sure why it would hang though

Killing the tee task (manually) makes the job continue...

@casparvl
Copy link
Collaborator Author

Hang happens again in the test step:

460885 casparl     20   0  218M  3072  3072 S   0.0  0.0  0:00.00   1 │  └─ bash /var/spool/slurm/slurmd/job9721287/slurm_script
 468521 casparl     20   0  218M  3584  3072 S   0.0  0.0  0:00.01   1 │     └─ bash bot/test.sh
 468736 casparl     20   0  217M  2048  2048 S   0.0  0.0  0:00.00   1 │        └─ tee -a test.outerr.Zccj

@casparvl
Copy link
Collaborator Author

bot: build instance:eessi-bot-casparvl repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90

Copy link

eessi-bot bot commented Jan 31, 2025

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build instance:eessi-bot-casparvl repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from casparvl

    • expanded format: build instance:eessi-bot-casparvl repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
  • handling command build instance:eessi-bot-casparvl repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Jan 31, 2025

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build instance:eessi-bot-casparvl repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from casparvl

    • expanded format: build instance:eessi-bot-casparvl repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
  • handling command build instance:eessi-bot-casparvl repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:

    • no jobs were submitted

@eessi-bot-casparvl
Copy link

eessi-bot-casparvl bot commented Jan 31, 2025

Updates by the bot instance eessi-bot-casparvl (click for details)
  • received bot command build instance:eessi-bot-casparvl repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from casparvl

    • expanded format: build instance:eessi-bot-casparvl repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
  • handling command build instance:eessi-bot-casparvl repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:

@eessi-bot-casparvl
Copy link

eessi-bot-casparvl bot commented Jan 31, 2025

New job on instance eessi-bot-casparvl for CPU micro-architecture x86_64-amd-zen4 and accelerator nvidia/cc90 for repository eessi.io-2023.06-software in job dir /projects/casparl/eessi-bot-casparvl/jobs/2025.01/pr_903/9721606

date job status comment
Jan 31 22:35:43 UTC 2025 submitted job id 9721606 awaits release by job manager
Jan 31 22:35:56 UTC 2025 released job awaits launch by Slurm scheduler
Jan 31 22:37:16 UTC 2025 running job 9721606 is running
Jan 31 22:45:35 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-9721606.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
No artefacts were created or found.
Jan 31 22:45:35 UTC 2025 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 0/0 test case(s) from 0 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-9721606.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Collaborator Author

bot: build instance:eessi-bot-casparvl repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90

Copy link

eessi-bot bot commented Jan 31, 2025

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build instance:eessi-bot-casparvl repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from casparvl

    • expanded format: build instance:eessi-bot-casparvl repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
  • handling command build instance:eessi-bot-casparvl repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Jan 31, 2025

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build instance:eessi-bot-casparvl repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from casparvl

    • expanded format: build instance:eessi-bot-casparvl repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
  • handling command build instance:eessi-bot-casparvl repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:

    • no jobs were submitted

@eessi-bot-casparvl
Copy link

eessi-bot-casparvl bot commented Jan 31, 2025

Updates by the bot instance eessi-bot-casparvl (click for details)
  • received bot command build instance:eessi-bot-casparvl repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from casparvl

    • expanded format: build instance:eessi-bot-casparvl repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
  • handling command build instance:eessi-bot-casparvl repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:

@eessi-bot-casparvl
Copy link

eessi-bot-casparvl bot commented Jan 31, 2025

New job on instance eessi-bot-casparvl for CPU micro-architecture x86_64-amd-zen4 and accelerator nvidia/cc90 for repository eessi.io-2023.06-software in job dir /projects/casparl/eessi-bot-casparvl/jobs/2025.01/pr_903/9721779

date job status comment
Jan 31 22:45:17 UTC 2025 submitted job id 9721779 awaits release by job manager
Jan 31 22:45:34 UTC 2025 released job awaits launch by Slurm scheduler
Jan 31 22:46:41 UTC 2025 running job 9721779 is running
Jan 31 23:06:30 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-9721779.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
No artefacts were created or found.
Jan 31 23:06:30 UTC 2025 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 0/0 test case(s) from 0 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-9721779.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Collaborator Author

casparvl commented Jan 31, 2025

This https://stackoverflow.com/questions/73158567/bash-script-is-stuck-at-tee#comment129210841_73158567 might be a pointer to my issue. I seem to remember having some trouble with containers not 'returning' properly upon exit, i.e. I'd have to do an enter to get a prompt again or something. Maybe that is indeed keeping the pipe open, and just it seems like tee is hanging... Well, will have to investigate that later by manually running the container and seeing how that goes.

Edit: one thing is that I also still see a lot of cvmfs2 processes:

1628256 casparl     20   0  521M 32980  7680 S   0.0  0.0  0:00.01  13 ├─ cvmfs2 cvmfs-config.cern.ch /dev/fd/3 -f
1628315 casparl     20   0  521M 32980  7680 S   0.0  0.0  0:00.00   2 │  ├─ cvmfs2 cvmfs-config.cern.ch /dev/fd/3 -f
1628316 casparl     20   0  521M 32980  7680 S   0.0  0.0  0:00.00   3 │  ├─ cvmfs2 cvmfs-config.cern.ch /dev/fd/3 -f
1628317 casparl     20   0  521M 32980  7680 S   0.0  0.0  0:00.00   9 │  ├─ cvmfs2 cvmfs-config.cern.ch /dev/fd/3 -f
1628318 casparl     20   0  521M 32980  7680 S   0.0  0.0  0:00.00   2 │  ├─ cvmfs2 cvmfs-config.cern.ch /dev/fd/3 -f
1628319 casparl     20   0  521M 32980  7680 S   0.0  0.0  0:00.00  10 │  ├─ cvmfs2 cvmfs-config.cern.ch /dev/fd/3 -f
1628320 casparl     20   0  521M 32980  7680 S   0.0  0.0  0:00.00  15 │  ├─ cvmfs2 cvmfs-config.cern.ch /dev/fd/3 -f
1628321 casparl     20   0  521M 32980  7680 S   0.0  0.0  0:00.00   2 │  ├─ cvmfs2 cvmfs-config.cern.ch /dev/fd/3 -f
1628322 casparl     20   0  521M 32980  7680 S   0.0  0.0  0:00.00   2 │  ├─ cvmfs2 cvmfs-config.cern.ch /dev/fd/3 -f
1628323 casparl     20   0  521M 32980  7680 S   0.0  0.0  0:00.00   8 │  ├─ cvmfs2 cvmfs-config.cern.ch /dev/fd/3 -f
1628324 casparl     20   0  521M 32980  7680 S   0.0  0.0  0:00.00   2 │  ├─ cvmfs2 cvmfs-config.cern.ch /dev/fd/3 -f
1628325 casparl     20   0  521M 32980  7680 S   0.0  0.0  0:00.00   2 │  ├─ cvmfs2 cvmfs-config.cern.ch /dev/fd/3 -f
1628326 casparl     20   0  521M 32980  7680 S   0.0  0.0  0:00.00   2 │  └─ cvmfs2 cvmfs-config.cern.ch /dev/fd/3 -f
1628257 casparl     20   0  924M 42848  8704 S   0.0  0.0  0:00.05   7 ├─ cvmfs2 software.eessi.io /dev/fd/3 -f
1628357 casparl     20   0  924M 42848  8704 S   0.0  0.0  0:00.00   1 │  ├─ cvmfs2 software.eessi.io /dev/fd/3 -f
1628358 casparl     20   0  924M 42848  8704 S   0.0  0.0  0:00.00  15 │  ├─ cvmfs2 software.eessi.io /dev/fd/3 -f
1628359 casparl     20   0  924M 42848  8704 S   0.0  0.0  0:00.00   1 │  ├─ cvmfs2 software.eessi.io /dev/fd/3 -f
1628360 casparl     20   0  924M 42848  8704 S   0.0  0.0  0:00.00   9 │  ├─ cvmfs2 software.eessi.io /dev/fd/3 -f
1628361 casparl     20   0  924M 42848  8704 S   0.0  0.0  0:13.75   6 │  ├─ cvmfs2 software.eessi.io /dev/fd/3 -f
1628362 casparl     20   0  924M 42848  8704 S   0.0  0.0  0:00.00   2 │  ├─ cvmfs2 software.eessi.io /dev/fd/3 -f
1628363 casparl     20   0  924M 42848  8704 S   0.0  0.0  0:00.00  15 │  ├─ cvmfs2 software.eessi.io /dev/fd/3 -f
1628364 casparl     20   0  924M 42848  8704 S   0.0  0.0  0:00.00  15 │  ├─ cvmfs2 software.eessi.io /dev/fd/3 -f
1628365 casparl     20   0  924M 42848  8704 S   0.0  0.0  0:00.00   3 │  ├─ cvmfs2 software.eessi.io /dev/fd/3 -f
1628366 casparl     20   0  924M 42848  8704 S   0.0  0.0  0:00.00  15 │  ├─ cvmfs2 software.eessi.io /dev/fd/3 -f
1628367 casparl     20   0  924M 42848  8704 S   0.0  0.0  0:00.59   1 │  ├─ cvmfs2 software.eessi.io /dev/fd/3 -f
1628368 casparl     20   0  924M 42848  8704 S   0.0  0.0  0:00.59  11 │  ├─ cvmfs2 software.eessi.io /dev/fd/3 -f
1628369 casparl     20   0  924M 42848  8704 S   0.0  0.0  0:00.59   3 │  ├─ cvmfs2 software.eessi.io /dev/fd/3 -f
1628382 casparl     20   0  924M 42848  8704 S   0.0  0.0  0:00.52   9 │  ├─ cvmfs2 software.eessi.io /dev/fd/3 -f
1628788 casparl     20   0  924M 42848  8704 S   0.0  0.0  0:00.32   0 │  ├─ cvmfs2 software.eessi.io /dev/fd/3 -f
1628926 casparl     20   0  924M 42848  8704 S   0.0  0.0  0:00.28   7 │  ├─ cvmfs2 software.eessi.io /dev/fd/3 -f
1630891 casparl     20   0  924M 42848  8704 S   0.0  0.0  0:00.03   4 │  └─ cvmfs2 software.eessi.io /dev/fd/3 -f
1628307 casparl     20   0 15060  2052  1536 S   0.0  0.0  0:00.00  12 ├─ cvmfs2 cvmfs-config.cern.ch /dev/fd/3 -f
1628309 casparl     20   0 21228  7680  6144 S   0.0  0.0  0:00.67  14 ├─ cvmfs2 __cachemgr__ . 10 11 10485760000 5242880000 1 3 -1 :
1628314 casparl     20   0 21228  7680  6144 S   0.0  0.0  0:00.00   7 │  └─ cvmfs2 __cachemgr__ . 10 11 10485760000 5242880000 1 3 -1 :
1628313 casparl     20   0 14816  2568  2048 S   0.0  0.0  0:00.00  13 ├─ cvmfs2 __cachemgr__ . 10 11 10485760000 5242880000 1 3 -1 :
1628356 casparl     20   0 15060  2056  1536 S   0.0  0.0  0:00.00   4 ├─ cvmfs2 software.eessi.io /dev/fd/3 -f
1635266 casparl     20   0  521M 34532  7680 S   0.0  0.0  0:00.02   5 ├─ cvmfs2 cvmfs-config.cern.ch /dev/fd/3 -f
1635317 casparl     20   0  521M 34532  7680 S   0.0  0.0  0:00.00  11 │  ├─ cvmfs2 cvmfs-config.cern.ch /dev/fd/3 -f
1635318 casparl     20   0  521M 34532  7680 S   0.0  0.0  0:00.00  10 │  ├─ cvmfs2 cvmfs-config.cern.ch /dev/fd/3 -f

Maybe these somehow prevent the container from completely exiting, since they are not properly being cleaned up? They do also disappear if I kill tee, so I'm not sure if they're the cause or the effect...

@casparvl
Copy link
Collaborator Author

Looking at the logs, everything went fine in #903 (comment) so I'm puzzled about the failure. I don't see a tarball however in the jobdir, and don't see a tarball creation step in the build logs... Will need to figure out why not - but that's for next week.

@casparvl
Copy link
Collaborator Author

Ah, my guess is, it's because I kill the tee process, and then because of

the build job just dies completely. So.... fix the tee issue = fix the tarball creation issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accel:nvidia tests Related to software testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant