Prometheus and Grafana integration#

Grafana dashboards provided visualize GPU metrics collected from AMD Device Metrics Exporter via Prometheus. Pre-built Grafana dashboards are available in the grafana directory of the repository:

Import these dashboards through the Grafana interface for immediate visualization of your GPU metrics.

Grafana Dashboard Setup#

  • Variables can be configured at any time in each dashboard’s Settings > Variables section.

    g_metrics_prefix: string to prefix names of metrics queries (e.g. gpu_gfx_activity -> amd_gpu_gfx_activity)

  • Prefix can be set using the dropdown menu in the top left corner of each dashboard.

Methods to Ingest metrics into Prometheus#

Method 1: Direct Prometheus Configuration#

Run Prometheus (for Testing)#

docker run -p 9090:9090 -v ./example/prometheus.yml:/etc/prometheus/prometheus.yml -v prometheus-data:/prometheus prom/prometheus

Installing Grafana (for Testing)#

Follow the official Grafana Debian Installation guide.

Start Grafana Server:

sudo systemctl daemon-reload
sudo systemctl start grafana-server
sudo systemctl status grafana-server

Configure Prometheus#

Add the AMD Device Metrics Exporter endpoint to your Prometheus configuration:

scrape_configs:
  - job_name: 'gpu_metrics'
    static_configs:
      - targets: ['exporter_external_ip:5000']

Method 2: Using Prometheus Operator in Kubernetes#

If you’re using Kubernetes, you can install Prometheus and Grafana using the Prometheus Operator:

  1. Add the Prometheus Community Helm repository:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
  1. Install the kube-prometheus-stack (includes Prometheus, Alertmanager, and Grafana):

helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set grafana.enabled=true
  1. Deploy Device Metrics Exporter with ServiceMonitor enabled:

helm install metrics-exporter \
  https://github.com/ROCm/device-metrics-exporter/releases/download/v1.4.0/device-metrics-exporter-charts-v1.4.0.tgz \
  --set serviceMonitor.enabled=true \
  --set serviceMonitor.interval=15s \
  -n mynamespace --create-namespace

For detailed ServiceMonitor configuration options and troubleshooting, please refer to the Prometheus ServiceMonitor Integration documentation.