Skip to content

LDMS connector

Vivek Kale edited this page Jun 9, 2023 · 5 revisions

Summary

The Lightweight Data Monitoring System (LDMS) is a health monitoring system to monitor performance of a set of indepent applications running on a particular supercomputer or supercomputing platform. LDMS is developed at Sandia National Laboratories and is used across U.S. DoE laboratories. LDMS Kokkos Tools connector collects data of Kokkos kernels and converts them into a format understandable by LDMS.

Due to very large amounts of data gathered, often 10s of TB per day, the LDMS Kokkos Tools connector should be used with the sampler utility of Kokkos tools to extract profiling data samples from a Kokkos application program.

Key Features

  • Collected LDMS data is already on node, and queryable, with little to no overhead.

Getting Started

Configuring the Sampler

  • The following environment variables can be adjusted by users in their run scripts:
      export KOKKOS_TOOLS_SAMPLER_RATE=101
      export KOKKOS_LDMS_VERBOSE=0 
    
  • The tool's environment variable KOKKOS_SAMPLER_RATE sets the sampling rate of kernel function calls. The default is set to 1%.
  • The tool's environment variable KOKKOS_LDMS_VERBOSE prints all Kokkos messages that are sent to LDMS to an output file when set to a non-zero integer.

Storage of Sampled Data

All collected data by LDMS are stored in the built storage system (DSOS) provided in the LDMS setup tutorials above.

Visualization of Sampled Data

Data can be visualized using Grafana. Information about setting up and using Grafana can be found here: https://ovis-hpcreadthedocs.readthedocs.io/en/latest/grafanapanel.html

Clone this wiki locally