Changes made since version 1.17.0
prior to version 1.18.0
:
π Features
- add downscaleAndOverwritePopulateJail
- PR: #311
- add priority class
- PR: #313
- Update Slurm to 24.05.5, fix MPI/PMIx bug, update README, and remove AppArmor hack
- PR: #316
- MSP-3516: settings of accounting to scrape jobs stats
- PR: #321
- Print actual command before executing it in bash scripts
- PR: #329
- Move gpubench to worker image and bind mount it
- PR: #333
- Move chroot plugin inside containers and bind mount it
- PR: #335
- Move enroot inside images and bind mount it
- PR: #339
- NOTASK: add debug logs
- PR: #357
- Move Pyxis from jail to images and bind-mount it
- PR: #361
- MSP-4080: add simple rebooter
- PR: #369
- MSP-4080: add CheckNodeCondition to rebooter
- PR: #372
- MSP-4080: add rebooting node check
- PR: #377
- MSP-4080: add reboot node and build image
- PR: #381
- MSP-4080: add handleNodeReboot, handleNodeDrain, handleNodeUnDrain and fix patch condition
- PR: #383
- Preinstall Nvidia mock packages issues/384
- PR: #387
- Install nvtop as deb package from repo and bind mount it from container to the jail filesystem
- PR: #390
- Preinstall dcgmi tools to the jail
- PR: #394
- MSP-4080: add render, reconcile rebooter and rbac
- PR: #391
- Remove Nvidia CUDA from worker image and apt clean
- PR: #397
- Build jail image based on own CUDA packages installation
- PR: #415
- Add Epilog and Prolog options
- PR: #411
- Pre-create enroot credentials, exclude enroot bind-mounts from motd, make Docker mount /dev/infiniband, store /tmp on disk and add tmpfs /mnt/memory
- PR: #389
π Fixes
- MSP-3918: Fix bug reconciliation logic for scenarios with maintenance=true and accounting=false
- PR: #309
- Update Slurm to 24.05.5, fix MPI/PMIx bug, update README, and remove AppArmor hack
- PR: #316
- NOTIC: Keep more failed NCCL benchmark jobs in the history instead ofβ¦
- PR: #315
- MSP-3515: fix mistake in values slurmdbdConfig and slurmConfig
- PR: #318
- [Fix] Install libpmix into nccl-benchmark image
- PR: #319
- Remove openmpi from controller
- PR: #320
- MSP-3992: fix bug with empty version of annotation
- PR: #334
- [FIX] Add patching for service annotations [MSP-3801]
- PR: #354
- fix: update AppArmor profile to allow creation of library links
- PR: #356
- NOTASK: fix bug invalid memory address or nil pointer when get role
- PR: #359
- Enable leader election for controller manager by default
- PR: #365
- Change watching ns mechanism
- PR: #366
- MSP-4080: fix bugs with stuck draining condition
- PR: #399
- Temporary remove
expose_enroot_logs
flag- PR: #417
- Fix ci for external contributors
- PR: #419
- Fix non-zero error handling in gpu_healthcheck.sh
- PR: #418
- Pre-create enroot credentials, exclude enroot bind-mounts from motd, make Docker mount /dev/infiniband, store /tmp on disk and add tmpfs /mnt/memory
- PR: #389
π¦ Dependencies
- build(deps): bump alpine from
b97e2a8
to56fa17d
- PR: #310
- bump github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring from 0.78.2
- PR: #312
- build(deps): bump golang from
7ea4c9d
toa6927f4
- PR: #322
- build(deps): bump golang from
a6927f4
to585103a
- PR: #323
- build(deps): bump k8s.io/apimachinery from 0.32.0 to 0.32.1
- PR: #325
- build(deps): bump k8s.io/api from 0.32.0 to 0.32.1
- PR: #324
- build(deps): bump golang from
585103a
to9820aca
- PR: #328
- build(deps): bump k8s.io/client-go from 0.32.0 to 0.32.1
- PR: #327
- build(deps): bump golang from
9820aca
to51a6466
- PR: #331
- bump golang.org/x/net to v0.33.0
- PR: #340
- build(deps): bump step-security/harden-runner from 2.10.2 to 2.10.4
- PR: #341
- build(deps): bump actions/setup-go from 5.2.0 to 5.3.0
- PR: #342
- build(deps): bump docker/login-action from 7ca345011ac4304463197fac0e56eab1bc7e6af0 to 327cd5a69de6c009b9ce71bce8395f28e651bf99
- PR: #344
- build(deps): bump google.golang.org/grpc from 1.69.2 to 1.69.4 in /images/worker/gpubench
- PR: #345
- build(deps): bump go.opentelemetry.io/otel/sdk from 1.33.0 to 1.34.0 in /images/worker/gpubench
- PR: #346
- build(deps): bump golang from
51a6466
to8c10f21
- PR: #338
- build(deps): bump go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp from 1.33.0 to 1.34.0 in /images/worker/gpubench
- PR: #349
- build(deps): bump go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc from 1.33.0 to 1.34.0 in /images/worker/gpubench
- PR: #353
- build(deps): bump google.golang.org/grpc from 1.69.4 to 1.70.0 in /images/worker/gpubench
- PR: #358
- Bump kube-apiserver v0.32.1 in gpubench
- PR: #367
- Bump go version for gpubench
- PR: #368
- build(deps): bump golang from
8c10f21
toe213430
- PR: #386
- build(deps): bump golang from
e213430
to9271129
- PR: #392
- build(deps): bump docker/setup-buildx-action from 3.8.0 to 3.9.0
- PR: #402
- build(deps): bump golang.org/x/crypto from 0.32.0 to 0.33.0
- PR: #421
Other
- fix docs about GPUs are required #306
- PR: #317
- Revert "Print actual command before executing it in bash scripts"
- PR: #332
- Update pyxis version with
container_image_save
andexpose_enroot_logs
enagled- PR: #376
Contributors:
@Uburro, @dependabot[bot], @asteny, @rdjjke, @dstaroff, @itechdima, @nandexsp, @angelbejarano
π Categorized PRs | π Uncategorized PRs | π₯ Commits | β Lines added | β Lines deleted |
---|---|---|---|---|
5301 | 235 | 196 | 4604 | 1434 |