AMD Device Metrics Exporter

AMD Device Metrics Exporter#

AMD Device Metrics Exporter enables Prometheus-format metrics collection for AMD GPUs in HPC and AI environments. It provides detailed telemetry, including temperature, utilization, memory usage, and power consumption. This tool includes the following features:

Features#

  • Prometheus-compatible metrics endpoint

  • Rich GPU telemetry data including:

    • Temperature monitoring

    • Utilization metrics

    • Memory usage statistics

    • Power consumption data

    • PCIe bandwidth metrics

    • Performance metrics

  • Kubernetes integration via Helm chart

  • Slurm integration support

  • Configurable service ports

  • Container-based deployment

Requirements#

  • Ubuntu 22.04, 24.04

  • Docker (or compatible container runtime)

Rocm Version

Driver Version

Exporter Image Version

Platform

6.2.x

6.8.5

v1.0.0

MI2xx, MI3xx

6.3.x

6.10.5

v1.1.0, v1.2.0

MI2xx, MI3xx

6.4.x

6.12.12

v1.3.0

MI3xx

6.4.x

6.12.12

v1.3.0.1

MI2xx, MI3xx

Available Metrics#

Device Metrics Exporter provides extensive GPU metrics including:

  • Temperature metrics

    • Edge temperature

    • Junction temperature

    • Memory temperature

    • HBM temperature

  • Performance metrics

    • GPU utilization

    • Memory utilization

    • Clock speeds

  • Power metrics

    • Current power usage

    • Average power usage

    • Energy consumption

  • Memory statistics

    • Total VRAM

    • Used VRAM

    • Free VRAM

  • PCIe metrics

    • Bandwidth

    • Link speed

    • Error counts

For a full list of available metrics see this page.