Prometheus ServiceMonitor Integration

Prometheus ServiceMonitor Integration#

The Device Metrics Exporter Helm chart supports integration with Prometheus Operator through ServiceMonitor Custom Resource Definition (CRD). This enables automated service discovery and scraping of metrics by Prometheus instances managed by the Prometheus Operator.

Prerequisites#

Before using the ServiceMonitor feature, ensure that:

You have a Kubernetes cluster with Prometheus Operator installed
The ServiceMonitor CRD is available in your cluster (servicemonitors.monitoring.coreos.com)

Enabling ServiceMonitor#

To deploy the Device Metrics Exporter with a ServiceMonitor resource, use the following Helm command:

helm install metrics-exporter \
  https://github.com/ROCm/device-metrics-exporter/releases/download/v1.3.0/device-metrics-exporter-charts-v1.3.0.tgz \
  --set serviceMonitor.enabled=true \
  --set serviceMonitor.interval=15s \
  -n mynamespace --create-namespace

This will automatically create a ServiceMonitor resource that Prometheus Operator can discover and use to scrape metrics from the Device Metrics Exporter. The ServiceMonitor will be deployed in the same namespace as the metrics service and daemonset. Additional configuration in Prometheus is necessary to select the metrics namespace and the ServiceMonitor. Aternatively, define a values.yaml with the desired options and use it in helm install.

Configuration Options#

The following options can be customized for the ServiceMonitor via Helm values values.yaml:

serviceMonitor:
  # -- Whether to create a ServiceMonitor resource for Prometheus Operator
  enabled: false
  # -- Scrape interval for the ServiceMonitor
  interval: "30s"
  # -- Honor labels configuration for ServiceMonitor
  honorLabels: true
  # -- Honor timestamps configuration for ServiceMonitor
  honorTimestamps: true
  # -- Additional labels for the ServiceMonitor object (to match the Prometheus Operator instance selectors)
  labels: {}
  # -- RelabelConfigs to apply to targets before scraping
  relabelings: []

Verifying ServiceMonitor Deployment#

After installation, you can verify that the ServiceMonitor was created correctly:

kubectl get servicemonitor -n mynamespace

Here, mynamespace refers to the metrics namespace where the service and daemonset are deployed. You should see a ServiceMonitor resource with the name pattern <release-name>-amd-metrics-exporter.

Troubleshooting#

If Prometheus is not scraping metrics from the Device Metrics Exporter:

Verify the ServiceMonitor exists and is correctly configured:

kubectl describe servicemonitor <release-name>-amd-metrics-exporter -n mynamespace

Check if the labels on the ServiceMonitor match the Prometheus Operator’s serviceMonitorSelector:
```
kubectl get prometheus -n monitoring -o yaml
```

Ensure that the service endpoints are available and ready:

kubectl get endpoints -l app=<release-name>-amdgpu-metrics-exporter -n mynamespace