We run sonar
by means of cron on each node in each cluster. Cron runs two scripts:
- One script samples the node every few minutes using
sonar ps
and then exfiltrates the sample data to the analysis host by means ofcurl
. - Another script runs
sonar sysinfo
daily and on reboot to probe the system configuration of the node, and these data are also exfiltrated withcurl
.
On each cluster there is a sonar home directory, ideally on a shared disk. This directory holds the following programs and scripts.
sonar
is an executable that samples the system and prints sample data on stdout.sonar-<config>.sh
is the script that runssonar ps
to sample process data, here<config>
isslurm
for nodes controlled by slurm andbatchless
for nodes without any batch system at all (typically login, interactive, and ad-hoc nodes).sysinfo.sh
is the ditto script that runssonar sysinfo
to probe the system configuration.sonar-config.sh
(literally the name of it) is a shared configuration file that is read by the two scripts.sonar-runner-<config>.cron
is acron
script appropriate for a node to schedule invocations ofsonar-<config>.sh
andsysinfo.sh
, it's the master copy for the crontab that has been copied to/etc/cron.d/sonar
on each node. If the cluster has both batch nodes and batchless nodes then there will be two crontabs here, one for each node type. The correct file must be copied to/etc/cron.d/sonar
on each node.
In the sonar home directory there is also a subdirectory secrets/
with authentication information:
secrets/upload-auth.netrc
contains identity information used bycurl
in its communication with the server. For more information about this file, see the documentation in../jobanalyzer-server
. The format is anetrc
file:wheremachine naic-monitor.uio.no login LOGIN password PASSWORD
LOGIN
andPASSWORD
must be obtained from the admin ofnaic-monitor.uio.no
and will be different for different clusters.
The secrets/
subdirectory should have mode 0700 or 0500 and the files within it should have mode
0600 or 0400.
The contents of all the scripts and the crontab may vary a little from cluster to cluster (see below), and the binaries are naturally architecture-dependent.
Sonar runs as the user sonar-runner
and the sonar home directory is ~sonar-runner
, owned by
sonar-runner:sonar-runner
. The scripts and crontabs are set up to use sonar_dir=$HOME
.
The cron job is run as sonar-runner
by copying sonar-runner-batchless.cron
to /etc/cron.d/sonar
.
NOTE: You may wish to change the MAILTO
definition in the cron script first.
Since none of the nodes have slurm, the sonar script is sonar-batchless.sh
.
Currently, ~sonar-runner
is /var/lib/sonar-runner
on all nodes. As this directory is private to
each machine, the binaries and scripts have to be updated everywhere every time there is an update.
It is hard to use a shared directory since the UID of sonar-runner
is not the same on every
machine.
Let's assume you have a checkout of the repos for Sonar and Jobanalyzer (see later for instructions)
and you are in the parent directory of the two repos. We'll construct a temporary directory
upload-tmp
and populate it with the necessary directory structure and files.
mkdir upload-tmp
( cd sonar ; cargo build --release )
cp sonar/target/release/sonar upload-tmp
cp Jobanalyzer/production/sonar-nodes/mlx.hpc.uio.no/*.{sh,cron} upload-tmp
cd upload-tmp
mkdir secrets
echo "machine naic-monitor.uio.no login mlx.hpc.uio.no password PASSWORD" > secrets/upload-auth.netrc
chmod -R go-rwx secrets
Now, consider whether to edit these files:
*-batchless.sh
andsonar-runner-batchless.cron
, to change the sonar root directorysonar-config.sh
, to change any other settingssonar-runner-batchless.cron
, to change cron'sMAILTO
variable
Now we can create a tar file that contains everything except the password (and is therefore freely copyable):
tar czf sonar-mlnodes.tar.gz sonar *.sh *.cron secrets
The tar file does not normally have to be created more than once, all the ML nodes have the same architecture.
On each ML node, do as follows.
Put the tar file somewhere accessible to the sonar-runner user, /tmp
might work if you don't want
to open your build directory to the world:
$ cp sonar-mlnodes.tar.gz /tmp
Create the user if necessary:
$ sudo -i
# useradd -r -m -d /var/lib/sonar-runner -s /sbin/nologin -c "Sonar monitor account" sonar-runner
# <edit /etc/security/access.conf and add "+ : sonar-runner : cron" before any deny-all lines>
On systems with AMD GPUs it is necessary to also add sonar-runner to group video
to grant the
necessary permissions to run rocm-smi
.
Then set everything up:
# sudo -u sonar-runner /bin/bash
$ cd
$ tar xzf /tmp/sonar-mlnodes.tar.gz
$ <edit secrets/upload-auth.netrc to add the PASSWORD>
$ ^D
# cp /var/lib/sonar-runner/sonar-runner-batchless.cron /etc/cron.d/sonar
# ^D
$ rm /tmp/sonar-mlnodes.tar.gz
If you want to test the setup without sending output directly to the production system, then edit
sonar-config.sh
to use the address of a test system that has data ingestion set up.
Sonar runs as the user sonar
and the sonar work directory is /cluster/var/sonar
, owned by
sonar:sonar
. The cron job is run as the user sonar
by copying sonar-runner-<config>.cron
to
/etc/cron.d/sonar
, for the appropriate <config>
for the node.
NOTE: You may wish to change the MAILTO
definition in the cron script first.
On Fox we use a lock file to prevent multiple instances of Sonar to run at the same time, the lock
file directory is currently /var/tmp
.
On these nodes, the cron script is sonar-runner-slurm.cron
and it will run the scripts
sonar-slurm.sh
and sysinfo.sh
.
The following additional conditions have to be met on compute and gpu nodes
- there shall be a user
sonar
with UID 502 (the same UID on all nodes) and homedir/var/lib/sonar
- user can be nologin
- the directory
/var/lib/sonar
shall exist - the owner of
/var/lib/sonar
shall besonar.sonar
- in
/etc/security/access.conf
there shall be the permission+ : sonar : cron
before the deny-all line - the owner of
/etc/cron.d/sonar
shall beroot.root
- cron errors show up in
/var/log/cron
To make sure sonar is set up on new and reinstalled nodes:
- the access.conf setting shall be reflected in
master:/install/dists/fox/syncfiles-el9/etc/security/access.conf
- the script
master:/install/postscripts/fox_sonar
shall contain commands to set up home directory and permissions
On these nodes, the cron script is sonar-runner-batchless.cron
and it will run the scripts
sonar-batchless.sh
and sysinfo.sh
. The reason for this difference from the compute
nodes is that job numbers must be synthesized by Sonar.
The following additional conditions have to be met on interactive nodes and login nodes
- there shall be a user
sonar
with UID 502 (the same UID on all nodes) and homedir/home/sonar
- user can be nologin
- the directory
/home/sonar
shall exist - the owner of
/home/sonar
shall besonar.whatever
- normally
/etc/security/access.conf
allows everything, otherwise see above - the owner of
/etc/cron.d/sonar
shall beroot.root
- cron errors show up in the journal
The setup procedure (building, creating a tar file, installing it) is largely as for the ML nodes
except that the user is sonar
and not sonar-runner
.
When complete, the setup will be as for Fox (see above) except the work directory is
/cluster/shared/sonar
(instead of /cluster/var/sonar
), and there is currently no requirement to
use a lock file.
TBD, but likely as for Saga.
For testing, the production server will receive Sonar and Sysinfo data for the cluster
naic-monitor.uio.no
with arbitrary host names and at arbitrary times. See
test-node/README.md for more information about how to use this.
Information about how to set up sonar on the server is in ../jobanalyzer-server/README.md.
(Possibly stale information) To complement information above, see the PR that added everything for Saga for an example of what a new node configuration might look like.
You need to install or load a compiler for Rust 1.74 or later to build sonar
.
git clone https://github.com/NordicHPC/sonar.git
cd sonar
cargo build --release
The executable will appear in target/release/sonar
. Test it by running target/release/sonar ps
.
To check out Jobanalyzer:
git clone https://github.com/NAICNO/Jobanalyzer.git
One does not simply copy new executables and scripts into place as this will create all sorts of problems.
If the only thing that has changed is the sonar
executable then it can normally be mv
'd into
place without stopping anything; any running sonar processes will have a handle to the original file
while still executing.
For everything else, generally the safest bet is to spin down the cron job by removing
/etc/cron.d/sonar
. This has to be done once per node.
Then, once sonar
is not running (pgrep sonar
returning no hits is a good indication), the new
files can be put in place, and the cron job restarted.