Metrics Exporter#

Features#

  • Prometheus-compatible metrics endpoint

  • Rich GPU telemetry data including:

    • Temperature monitoring

    • Utilization metrics

    • Memory usage statistics

    • Power consumption data

    • PCIe bandwidth metrics

    • Performance metrics

  • Kubernetes integration via Helm chart

  • Slurm integration support

  • Configurable service ports

  • Container-based deployment

Requirements#

  • Ubuntu 22.04, 24.04

  • Docker (or compatible container runtime)

Rocm Version

Driver Version

Exporter Image Version

Platform

6.2.x

6.8.5

v1.0.0

MI2xx, MI3xx

6.3.x

6.10.5

v1.1.0, v1.2.0

MI2xx, MI3xx

6.4.x

6.12.12

v1.3.0

MI3xx

6.4.x

6.12.12

v1.3.0.1

MI2xx, MI3xx

Configure metrics exporter#

To start the Device Metrics Exporter along with the GPU Operator configure the spec/metricsExporter/enable field in deviceconfig Custom Resource(CR) to enable/disable metrics exporter

# Specify the metrics exporter config
metricsExporter:
    # To enable/disable the metrics exporter, disabled by default
    enable: True

    # kubernetes service type for metrics exporter, clusterIP(default) or NodePort
    serviceType: "NodePort"

    # Node port for metrics exporter service, metrics endpoint $node-ip:$nodePort
    nodePort: 32500

    # image for the metrics-exporter container
    image: "rocm/device-metrics-exporter:v1.2.0"
 

The metrics-exporter pods start after updating the DeviceConfig CR

#kubectl get pods -n kube-amd-gpu -l "app.kubernetes.io/name=metrics-exporter"
NAME                                       READY   STATUS    RESTARTS   AGE
gpu-operator-metrics-exporter-q8hbb   1/1     Running   0          74s

Note

Note: The Device Metrics Exporter name will be prefixed with the name of your DeviceConfig custom resource (“gpu-operator” in the default helm installation)

Metrics Exporter DeviceConfig#

Field Name

Details

Enable

Enable/Disable metrics exporter

Port

Service port exposed by metrics exporter

serviceType

service type for metrics, clusterIP/NodePort

nodePort

Node port for metrics exporter service

selector

Node selector for metrics exporter daemonset

image

metrics exporter image

config

metrics configurations (fields/labels)

name

configmap name for custom fields/labels

Customize metrics fields/labels#

To customize metrics fields/labels, create a configmap with fields/labels and use it in DeviceConfig CR

kubectl create configmap <name> --from-file=examples/metricsExporter/config.json

Example config file is available here: config.json