Skip to content

Commit

Permalink
Load nvidia-uvm at boot time and install nvidia-persistenced as syste…
Browse files Browse the repository at this point in the history
…m service on GPU instances

When running on a GPU instance, this commit loads kernel module of Nvidia unified virtual memory by default and install Nvidia persistence daemon as a system service.
Nvidia unified virtual memory makes it easy to use memory on both CPU and GPU.
Nvidia persistence daemon keeps GPU initialized, therefore shorten application startup latency.
See reference:
Nvidia-uvm: https://developer.nvidia.com/blog/unified-memory-cuda-beginners/
Nvidia persistence daemon: https://docs.nvidia.com/deploy/driver-persistence/index.html

Signed-off-by: Hanwen <[email protected]>
  • Loading branch information
hanwen-cluster committed Apr 21, 2023
1 parent 637319d commit 97db18d
Show file tree
Hide file tree
Showing 5 changed files with 67 additions and 0 deletions.
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@ This file is used to list changes made in each version of the AWS ParallelCluste
- Add log rotation support for ParallelCluster managed logs.
- Track head node memory and root volume disk utilization using the `mem_used_percent` and `disk_used_percent` metrics collected through the CloudWatch Agent.
- Enforce the DCV Authenticator Server to use at least `TLS-1.2` protocol when creating the SSL Socket.
- Load kernel module [nvidia-uvm](https://developer.nvidia.com/blog/unified-memory-cuda-beginners/) by default.
- Install [Nvidia persistence daemon](https://docs.nvidia.com/deploy/driver-persistence/index.html) as a system service.

**CHANGES**
- Upgrade Slurm to version 23.02.1.
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
nvidia-uvm
3 changes: 3 additions & 0 deletions cookbooks/aws-parallelcluster-config/recipes/base.rb
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,9 @@
action :configure
end

# Configure Nvidia driver
include_recipe "aws-parallelcluster-config::nvidia"

# EFA runtime configuration
efa 'Configure system for EFA' do
action :configure
Expand Down
41 changes: 41 additions & 0 deletions cookbooks/aws-parallelcluster-config/recipes/nvidia.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# frozen_string_literal: true

#
# Cookbook:: aws-parallelcluster
# Recipe:: nvidia
#
# Copyright:: 2013-2021 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License"). You may not use this file except in compliance with the
# License. A copy of the License is located at
#
# http://aws.amazon.com/apache2.0/
#
# or in the "LICENSE.txt" file accompanying this file. This file is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES
# OR CONDITIONS OF ANY KIND, express or implied. See the License for the specific language governing permissions and
# limitations under the License.

if graphic_instance? && nvidia_installed?
# Load kernel module Nvidia-uvm
kernel_module 'nvidia-uvm' do
action :load
end
# Make sure kernel module Nvidia-uvm is loaded at instance boot time
cookbook_file 'nvidia.conf' do
source 'nvidia/nvidia.conf'
path '/etc/modules-load.d/nvidia.conf'
owner 'root'
group 'root'
mode '0644'
end
# Make sure nvidia_persistenced is installed as a system service
bash 'nvidia.run advanced' do
cwd '/usr/share/doc/NVIDIA_GLX-1.0/samples'
user 'root'
group 'root'
code <<-NVIDIA
tar -xf nvidia-persistenced-init.tar.bz2
./nvidia-persistenced-init/install.sh
NVIDIA
end
end
20 changes: 20 additions & 0 deletions test/recipes/controls/aws_parallelcluster_config/nvidia_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,26 @@
end
end

control 'tag:config_nvidia_uvm_and_persistenced_on_graphic_instances' do
only_if do
!(os_properties.centos7? && os_properties.arm?) &&
!instance.custom_ami? && instance.graphic?
end

describe kernel_module('nvidia_uvm') do
it { should be_loaded }
end

describe file('/etc/modules-load.d/nvidia.conf') do
its('content') { should include("uvm") }
end

describe service('nvidia-persistenced') do
it { should be_enabled }
it { should be_running }
end
end

control 'tag:config_gdrcopy_disabled_on_non_graphic_instances' do
only_if do
!(os_properties.centos7? && os_properties.arm?) &&
Expand Down

0 comments on commit 97db18d

Please sign in to comment.