Skip to content
Snippets Groups Projects

hpcmd

Schematic of site-wide monitoring using hpcmd and Splunk

Overview

hpcmd is software daemon that runs Linux perf and comparable tools periodically to obtain metrics from performance counters. Intel Broadwell, Skylake, and newer processors are fully supported, e.g., to compute the performance in GFLOPS or to obtain the memory bandwidth in GB/s. Moreover, performance metrics from GPUs, OmniPath and InfiniBand networks, and GPFS file systems are supported. hpcmd computes derived metrics and writes the data to syslog lines. On a cluster installation, these local syslog lines can be collected via rsyslog and finally stored and analyzed in Splunk. hpcmd fully integrates with the SLURM batch system, enabling to correlate performance metrics with each job.

Top Features

  • non-measurable performance impact on the applications
  • Linux daemon and systemd service
  • user mode for custom measurements
  • user-triggered suspension of the systemd service
  • SLURM integration, SLURM job detection
  • flexible hierarchical config files in YAML format
  • modular software design for easy extensibility
  • extensively tested through large-scale monitoring of more than 160.000 CPUs

Requirements

The hpcmd package requires a Python 2.7 or Python 3 environment. The setuptools package is required to run setup.py correctly. In addition, the Python packages python-daemon and pyyaml are necessary at runtime.

To actually measure performance data on Linux, the following binaries are called:

  • perf
  • top
  • ps
  • numastat
  • ibstat (InfiniBand)
  • opastat (OmniPath)
  • ipstat (Ethernet)
  • nvidia-smi (GPU)
  • mmpmon (GPFS)

Installation

The hpcmd package can be installed in the standard way using setup.py:

python setup.py install [--user]

Documentation

Online Documentation

http://mpcdf.pages.mpcdf.de/hpcmd/

Publication

MPCDF HPC Performance Monitoring System: Enabling Insight via Job-Specific Analysis. Stanisic L., Reuter K. (2020) In: Schwardmann U. et al. (eds) Euro-Par 2019: Parallel Processing Workshops. Euro-Par 2019. Lecture Notes in Computer Science, vol 11997. Springer, Cham. doi.org/10.1007/978-3-030-48340-1_47 arXiv:1909.11704 (2019)