AMD Device Metrics Exporter#
AMD Device Metrics Exporter enables Prometheus-format metrics collection for AMD GPUs in HPC and AI environments. It provides detailed telemetry, including temperature, utilization, memory usage, and power consumption. This tool includes the following features:
Prometheus-compatible metrics endpoint
Rich GPU telemetry data
Kubernetes integration
Slurm integration support
Configurable service ports
Container-based deployment
Available Metrics#
Device Metrics Exporter provides extensive GPU metrics including:
Temperature metrics
Edge temperature
Junction temperature
Memory temperature
HBM temperature
Performance metrics
GPU utilization
Memory utilization
Clock speeds
Power metrics
Current power usage
Average power usage
Energy consumption
Memory statistics
Total VRAM
Used VRAM
Free VRAM
PCIe metrics
Bandwidth
Link speed
Error counts