hpcmd
Overview
hpcmd is software daemon that runs Linux perf and comparable tools periodically to obtain metrics from performance counters. Intel Broadwell, Skylake, and newer processors are fully supported, e.g., to compute the performance in GFLOPS or to obtain the memory bandwidth in GB/s. Moreover, performance metrics from GPUs, OmniPath and InfiniBand networks, and GPFS file systems are supported. hpcmd computes derived metrics and writes the data to syslog lines. On a cluster installation, these local syslog lines can be collected via rsyslog and finally stored and analyzed in Splunk. hpcmd fully integrates with the SLURM batch system, enabling to correlate performance metrics with each job.
Top Features
- non-measurable performance impact on the applications
- Linux daemon and systemd service
- user mode for custom measurements
- user-triggered suspension of the systemd service
- SLURM integration, SLURM job detection
- flexible hierarchical config files in YAML format
- modular software design for easy extensibility
- extensively tested through large-scale monitoring of more than 160.000 CPUs
Requirements
The hpcmd package requires a Python 2.7 or Python 3 environment. The setuptools package is required to run setup.py correctly. In addition, the Python packages python-daemon and pyyaml are necessary at runtime.
To actually measure performance data on Linux, the following binaries are called:
- perf
- top
- ps
- numastat
- ibstat (InfiniBand)
- opastat (OmniPath)
- ipstat (Ethernet)
- nvidia-smi (GPU)
- mmpmon (GPFS)
Installation
The hpcmd package can be installed in the standard way using setup.py:
python setup.py install [--user]
Documentation
Online Documentation
http://mpcdf.pages.mpcdf.de/hpcmd/
Publication
MPCDF HPC Performance Monitoring System: Enabling Insight via Job-Specific Analysis. Stanisic L., Reuter K. (2020) In: Schwardmann U. et al. (eds) Euro-Par 2019: Parallel Processing Workshops. Euro-Par 2019. Lecture Notes in Computer Science, vol 11997. Springer, Cham. doi.org/10.1007/978-3-030-48340-1_47 arXiv:1909.11704 (2019)