Kubernetes (Helm) installation#
This page explains how to install AMD Device Metrics Exporter using Kubernetes.
System requirements#
ROCm 6.3.x
Ubuntu 22.04 or later
Kubernetes cluster v1.29.0 or later
Helm v3.2.0 or later
kubectlcommand-line tool configured with access to the cluster
Note: GPU and NIC monitoring cannot be enabled simultaneously in a single Helm deployment. Choose the appropriate configuration based on your monitoring needs:
GPU Monitoring: For monitoring AMD GPUs (requires ROCm 6.2.0+)
NIC Monitoring: For monitoring AMD NICs (requires compatible AMD NIC hardware)
Installation#
For Kubernetes environments, a Helm chart is provided for easy deployment. The chart comes with default values pre-configured - you only need to customize them if your environment requires specific settings.
Install Helm (if not installed)#
# Install Helm
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
Deploy the exporter using Helm#
# Install Helm Charts for GPU monitoring
helm repo add exporter https://rocm.github.io/device-metrics-exporter
helm repo update
helm install exporter exporter/device-metrics-exporter-charts \
--version v1.4.1 \
--namespace kube-amd-gpu \
--create-namespace
Verifying the GPU installation
After installation, verify that the exporter pods are running:
kubectl get pods -l app.kubernetes.io/name=metrics-exporter -n kube-amd-gpu # For GPU monitoring
Customizing the GPU installation
You can customize the installation using one of two methods:
Option 1: Using –set flags
Override individual values on the command line using --set key=value:
Example: Enable Prometheus monitoring and adjust resource limits
helm install exporter exporter/device-metrics-exporter-charts \
--version v1.4.1 \
--namespace kube-amd-gpu \
--create-namespace \
--set serviceMonitor.enabled=true \
--set resources.limits.cpu=4 \
--set resources.limits.memory=8Gi
Example: Add custom tolerations for tainted nodes
helm install exporter exporter/device-metrics-exporter-charts \
--version v1.4.1 \
--namespace kube-amd-gpu \
--create-namespace \
--set tolerations[0].key=example.com/foo \
--set tolerations[0].operator=Exists \
--set tolerations[0].effect=NoSchedule
Option 2: Using a custom values.yaml file
For more extensive customization, download and modify the default values.yaml:
# Download the default values.yaml for GPU monitoring
helm show values exporter/device-metrics-exporter-charts --version v1.4.1 > gpu-values.yaml
Edit gpu-values.yaml as needed (example below), then install using the custom file:
tolerations:
- key: example.com/foo
operator: Exists
effect: NoSchedule
helm install exporter exporter/device-metrics-exporter-charts \
--version v1.4.1 \
--namespace kube-amd-gpu \
--create-namespace \
-f gpu-values.yaml
Debugging GPU installation
Use the --debug flag with helm install or helm template to see the fully rendered Kubernetes manifests with all values applied. This is useful for verifying that overrides are taking effect before deploying:
helm template exporter exporter/device-metrics-exporter-charts \
--debug \
-f gpu-values.yaml
Use --dry-run with helm install to simulate the installation without applying any resources to the cluster. Unlike helm template, a dry run communicates with the Kubernetes API server to validate the manifests against the cluster:
helm install exporter exporter/device-metrics-exporter-charts \
--version v1.4.1 \
--namespace kube-amd-gpu \
--create-namespace \
--dry-run \
-f gpu-values.yaml
# Install Helm Charts for NIC monitoring
helm repo add exporter https://rocm.github.io/device-metrics-exporter
helm repo update
helm install exporter exporter/nic-device-metrics-exporter-charts \
--version v1.2.0 \
--namespace kube-amd-network \
--create-namespace
Verifying the NIC installation
After installation, verify that the exporter pods are running:
kubectl get pods -l app.kubernetes.io/name=metrics-exporter -n kube-amd-network # For NIC monitoring
Customizing the NIC installation
You can customize the installation using one of two methods:
Note: Ensure the Metrics Exporter image image.tag matches the AINIC firmware version installed on your nodes. Refer to the Compatibility Matrix for the correct image version to use and configure the correct image.tag value in the values file.
Option 1: Using –set flags
Override individual values on the command line using --set key=value:
Example: Enable Prometheus monitoring and adjust resource limits
helm install exporter exporter/nic-device-metrics-exporter-charts \
--version v1.2.0 \
--namespace kube-amd-network \
--create-namespace \
--set serviceMonitor.enabled=true \
--set resources.limits.cpu=4 \
--set resources.limits.memory=8Gi
Example: Add custom tolerations for tainted nodes
helm install exporter exporter/nic-device-metrics-exporter-charts \
--version v1.2.0 \
--namespace kube-amd-network \
--create-namespace \
--set tolerations[0].key=example.com/foo \
--set tolerations[0].operator=Exists \
--set tolerations[0].effect=NoSchedule
Option 2: Using a custom values.yaml file
For more extensive customization, download and modify the default values.yaml:
# Download the default values.yaml for NIC monitoring
helm show values exporter/nic-device-metrics-exporter-charts --version v1.2.0 > nic-values.yaml
Edit nic-values.yaml as needed (example below), then install using the custom file:
tolerations:
- key: example.com/foo
operator: Exists
effect: NoSchedule
helm install exporter exporter/nic-device-metrics-exporter-charts \
--version v1.2.0 \
--namespace kube-amd-network \
--create-namespace \
-f nic-values.yaml
Debugging NIC installation
Use the --debug flag with helm install or helm template to see the fully rendered Kubernetes manifests with all values applied. This is useful for verifying that overrides are taking effect before deploying:
helm template exporter exporter/nic-device-metrics-exporter-charts \
--debug \
-f nic-values.yaml
Use --dry-run with helm install to simulate the installation without applying any resources to the cluster. Unlike helm template, a dry run communicates with the Kubernetes API server to validate the manifests against the cluster:
helm install exporter exporter/nic-device-metrics-exporter-charts \
--version v1.2.0 \
--namespace kube-amd-network \
--create-namespace \
--dry-run \
-f nic-values.yaml
Testing the installation#
Verify that the exporter is running and exposing metrics:
curl http://<exporter-service-ip>:5000/metrics
curl http://<exporter-service-ip>:5001/metrics
Uninstallation#
To uninstall the exporter, run:
helm uninstall exporter -n kube-amd-gpu # For GPU monitoring
helm uninstall exporter -n kube-amd-network # For NIC monitoring
Enabling DRA (Beta) for GPU monitoring#
Dynamic Resource Allocation (DRA) GPU claim support is available starting with exporter v1.4.1 on Kubernetes 1.34+. Pod association works natively with both the AMD Kubernetes device plugin (k8s-device-plugin) and the AMD GPU DRA driver (k8s-gpu-dra-driver) without any additional Helm configuration. The exporter first uses device plugin allocations and, if absent, automatically inspects DRA resource claims.
Checklist:
Cluster version: Kubernetes 1.34+.
DRA GPU driver deployed AMD GPU DRA driver.
Pods use resource claims referencing the AMD GPU driver.