Skip to content

1.18.0

Latest
Compare
Choose a tag to compare
@github-actions github-actions released this 13 Feb 20:34
· 24 commits to dev since this release
b580718

Changes made since version 1.17.0 prior to version 1.18.0:

πŸš€ Features

  • add downscaleAndOverwritePopulateJail
  • add priority class
  • Update Slurm to 24.05.5, fix MPI/PMIx bug, update README, and remove AppArmor hack
  • MSP-3516: settings of accounting to scrape jobs stats
  • Print actual command before executing it in bash scripts
  • Move gpubench to worker image and bind mount it
  • Move chroot plugin inside containers and bind mount it
  • Move enroot inside images and bind mount it
  • NOTASK: add debug logs
  • Move Pyxis from jail to images and bind-mount it
  • MSP-4080: add simple rebooter
  • MSP-4080: add CheckNodeCondition to rebooter
  • MSP-4080: add rebooting node check
  • MSP-4080: add reboot node and build image
  • MSP-4080: add handleNodeReboot, handleNodeDrain, handleNodeUnDrain and fix patch condition
  • Preinstall Nvidia mock packages issues/384
  • Install nvtop as deb package from repo and bind mount it from container to the jail filesystem
  • Preinstall dcgmi tools to the jail
  • MSP-4080: add render, reconcile rebooter and rbac
  • Remove Nvidia CUDA from worker image and apt clean
  • Build jail image based on own CUDA packages installation
  • Add Epilog and Prolog options
  • Pre-create enroot credentials, exclude enroot bind-mounts from motd, make Docker mount /dev/infiniband, store /tmp on disk and add tmpfs /mnt/memory

πŸ› Fixes

  • MSP-3918: Fix bug reconciliation logic for scenarios with maintenance=true and accounting=false
  • Update Slurm to 24.05.5, fix MPI/PMIx bug, update README, and remove AppArmor hack
  • NOTIC: Keep more failed NCCL benchmark jobs in the history instead of…
  • MSP-3515: fix mistake in values slurmdbdConfig and slurmConfig
  • [Fix] Install libpmix into nccl-benchmark image
  • Remove openmpi from controller
  • MSP-3992: fix bug with empty version of annotation
  • [FIX] Add patching for service annotations [MSP-3801]
  • fix: update AppArmor profile to allow creation of library links
  • NOTASK: fix bug invalid memory address or nil pointer when get role
  • Enable leader election for controller manager by default
  • Change watching ns mechanism
  • MSP-4080: fix bugs with stuck draining condition
  • Temporary remove expose_enroot_logs flag
  • Fix ci for external contributors
  • Fix non-zero error handling in gpu_healthcheck.sh
  • Pre-create enroot credentials, exclude enroot bind-mounts from motd, make Docker mount /dev/infiniband, store /tmp on disk and add tmpfs /mnt/memory

πŸ“¦ Dependencies

  • build(deps): bump alpine from b97e2a8 to 56fa17d
  • bump github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring from 0.78.2
  • build(deps): bump golang from 7ea4c9d to a6927f4
  • build(deps): bump golang from a6927f4 to 585103a
  • build(deps): bump k8s.io/apimachinery from 0.32.0 to 0.32.1
  • build(deps): bump k8s.io/api from 0.32.0 to 0.32.1
  • build(deps): bump golang from 585103a to 9820aca
  • build(deps): bump k8s.io/client-go from 0.32.0 to 0.32.1
  • build(deps): bump golang from 9820aca to 51a6466
  • bump golang.org/x/net to v0.33.0
  • build(deps): bump step-security/harden-runner from 2.10.2 to 2.10.4
  • build(deps): bump actions/setup-go from 5.2.0 to 5.3.0
  • build(deps): bump docker/login-action from 7ca345011ac4304463197fac0e56eab1bc7e6af0 to 327cd5a69de6c009b9ce71bce8395f28e651bf99
  • build(deps): bump google.golang.org/grpc from 1.69.2 to 1.69.4 in /images/worker/gpubench
  • build(deps): bump go.opentelemetry.io/otel/sdk from 1.33.0 to 1.34.0 in /images/worker/gpubench
  • build(deps): bump golang from 51a6466 to 8c10f21
  • build(deps): bump go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp from 1.33.0 to 1.34.0 in /images/worker/gpubench
  • build(deps): bump go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc from 1.33.0 to 1.34.0 in /images/worker/gpubench
  • build(deps): bump google.golang.org/grpc from 1.69.4 to 1.70.0 in /images/worker/gpubench
  • Bump kube-apiserver v0.32.1 in gpubench
  • Bump go version for gpubench
  • build(deps): bump golang from 8c10f21 to e213430
  • build(deps): bump golang from e213430 to 9271129
  • build(deps): bump docker/setup-buildx-action from 3.8.0 to 3.9.0
  • build(deps): bump golang.org/x/crypto from 0.32.0 to 0.33.0

Other

  • fix docs about GPUs are required #306
  • Revert "Print actual command before executing it in bash scripts"
  • Update pyxis version with container_image_save and expose_enroot_logs enagled

Contributors:
@Uburro, @dependabot[bot], @asteny, @rdjjke, @dstaroff, @itechdima, @nandexsp, @angelbejarano

πŸ“ Categorized PRs πŸ“‚ Uncategorized PRs πŸ“₯ Commits βž• Lines added βž– Lines deleted
5301 235 196 4604 1434