AMD Device Metrics Exporter#
AMD Device Metrics Exporter enables Prometheus-format metrics collection for AMD GPUs and NICs in HPC and AI environments. It provides detailed telemetry, including temperature, utilization, memory usage, and power consumption. This tool includes the following features:
Features#
Prometheus-compatible metrics endpoint
Rich GPU telemetry data including:
Temperature monitoring
Utilization metrics
Memory usage statistics
Power consumption data
PCIe bandwidth metrics
Performance metrics
Kubernetes integration via Helm chart
Slurm integration support
Configurable service ports
Container-based deployment
Beta: Kubernetes Dynamic Resource Allocation (DRA) GPU claim support (Kubernetes 1.34+)
GPU Metrics#
Requirements#
Ubuntu 22.04, 24.04
Docker (or compatible container runtime)
Compatibility Matrix#
Rocm Version |
Driver Version |
Exporter Image Version |
Platform |
|---|---|---|---|
6.2.x |
6.8.5 |
v1.0.0 |
MI2xx, MI3xx |
6.3.x |
6.10.5 |
v1.2.0 |
MI2xx, MI3xx |
6.4.x |
6.12.12 |
v1.3.1 |
MI2xx, MI3xx |
7.0.x |
6.14.14 |
v1.4.0.1 |
MI2xx, MI3xx |
7.1.x |
6.16.6 |
v1.4.2 |
MI2xx, MI3xx |
TBD |
TBD |
v1.5.0 (dev) |
MI2xx, MI3xx |
Available Metrics#
Device Metrics Exporter provides extensive GPU metrics including:
Temperature metrics
Edge temperature
Junction temperature
Memory temperature
HBM temperature
Performance metrics
GPU utilization
Memory utilization
Clock speeds
Power metrics
Current power usage
Average power usage
Energy consumption
Memory statistics
Total VRAM
Used VRAM
Free VRAM
PCIe metrics
Bandwidth
Link speed
Error counts
See GPU Metrics List for the complete list.
NIC Metrics#
Requirements#
Ubuntu 22.04, 24.04
Docker (or compatible container runtime)
AMD NICs with supported drivers (AINIC)
Compatibility Matrix#
AINIC Firmware Version |
Exporter Image Version |
Supported NICs |
|---|---|---|
N/A (host nicctl) |
nic-v1.0.0 |
Pollara 400 |
N/A (host nicctl) |
nic-v1.0.1 |
Pollara 400 |
1.117.5-a-56 |
nic-v1.1.0 |
Pollara 400 |
Available Metrics#
Device Metrics Exporter provides extensive NIC metrics including:
Port statistics
Frame counts (RX/TX)
Octet counts (RX/TX)
Pause and priority frames
FCS and other error counts
LIF statistics
Unicast/multicast/broadcast packets
DMA errors
Drop counts
Queue Pair (QP) statistics
Send Queue requester metrics
Receive Queue responder metrics
QCN congestion metrics
RDMA statistics
Tx/Rx unicast packets
CNP/ECN packets
Request/response errors
Ethtool statistics
Packet and byte counts
Frame size distribution
Per-queue drop counts
See AINIC Metrics List for the complete list.