Prometheus and Grafana integration#
Grafana dashboards provided visualize GPU metrics collected from AMD Device Metrics Exporter via Prometheus. Dashboard files are located in the grafana directory:
dashboard_overview.json
: High-level GPU cluster overview.dashboard_gpu.json
: Detailed per-GPU metrics.dashboard_job.json
: GPU usage by job (Slurm and Kubernetes).dashboard_node.json
: Host-level GPU usage.
To ingest metrics into Prometheus, you can use one of the following methods:
Method 1: Direct Prometheus Configuration#
Run Prometheus (for Testing)#
docker run -p 9090:9090 -v ./example/prometheus.yml:/etc/prometheus/prometheus.yml -v prometheus-data:/prometheus prom/prometheus
Installing Grafana (for Testing)#
Follow the official Grafana Debian Installation guide.
Start Grafana Server:
sudo systemctl daemon-reload
sudo systemctl start grafana-server
sudo systemctl status grafana-server
Configure Prometheus#
Add the AMD Device Metrics Exporter endpoint to your Prometheus configuration:
scrape_configs:
- job_name: 'gpu_metrics'
static_configs:
- targets: ['exporter_external_ip:5000']
Method 2: Using Prometheus Operator in Kubernetes#
If you’re using Kubernetes, you can install Prometheus and Grafana using the Prometheus Operator:
Add the Prometheus Community Helm repository:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
Install the kube-prometheus-stack (includes Prometheus, Alertmanager, and Grafana):
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set grafana.enabled=true
Deploy Device Metrics Exporter with ServiceMonitor enabled:
helm install metrics-exporter \
https://github.com/ROCm/device-metrics-exporter/releases/download/v1.2.1/device-metrics-exporter-charts-v1.2.1.tgz \
--set serviceMonitor.enabled=true \
--set serviceMonitor.interval=15s \
-n mynamespace --create-namespace
For detailed ServiceMonitor configuration options and troubleshooting, please refer to the Prometheus ServiceMonitor Integration documentation.
Pre-built Grafana dashboards are available in the grafana/
directory of the repository:
Import these dashboards through the Grafana interface for immediate visualization of your GPU metrics.