Installation Guide#
This guide walks through the process of installing the AMD GPU device plugin on a Kubernetes cluster.
Prerequisites#
Before installing the AMD GPU device plugin, ensure your environment meets the following requirements:
System Requirements#
Kubernetes: v1.18 or higher
AMD GPUs: ROCm-capable AMD GPU hardware
GPU Drivers: AMD GPU drivers or ROCm stack installed on worker nodes
Helm: v3.2.0 or later (if using the health check feature or GPU Operator)
Driver Installation#
If you haven’t installed the AMD GPU drivers yet, follow the official ROCm Installation Guide
Installation Steps#
Choose one of the following options based on your requirements.
Option 1: Standard Device Plugin#
Use this option if you only need basic GPU allocation without health monitoring.
Using Pre-defined YAML File: You can use the pre-defined YAML file provided in this repository. Run the following command:
kubectl create -f k8s-ds-amdgpu-dp.yaml
Pulling from the Web: Alternatively, you can directly pull the YAML file from the repository:
kubectl create -f https://raw.githubusercontent.com/ROCm/k8s-device-plugin/master/k8s-ds-amdgpu-dp.yaml
Option 1.a: Standard Device Plugin with Init Container#
Use this option when deploying the Device Plugin in environments where the amdgpu driver may not be loaded before the plugin starts. This deployment has an init container that waits for amdgpu driver to load before launching the main plugin container.
Using Pre-defined YAML File: You can use the pre-defined YAML file provided in this repository. Run the following command:
kubectl create -f k8s-ds-amdgpu-dp-with-init-container.yaml
Pulling from the Web: Alternatively, you can directly pull the YAML file from the repository:
kubectl create -f https://raw.githubusercontent.com/ROCm/k8s-device-plugin/master/k8s-ds-amdgpu-dp-with-init-container.yaml
Option 2: Device Plugin with Health Checks#
Use this option if you need GPU health monitoring capabilities in addition to GPU allocation.
Step 1: Install AMD Device Metrics Exporter#
The health check feature requires the AMD Device Metrics Exporter to be installed. This service provides GPU metrics and health information that the device plugin connects to.
Create a metrics-exporter-values.yaml file with the following content:
platform: k8s
nodeSelector: {} # Optional: Add custom nodeSelector
image:
repository: docker.io/rocm/device-metrics-exporter
tag: v1.2.0
pullPolicy: Always
service:
type: ClusterIP
ClusterIP:
port: 5000
# Enable GRPC socket for device plugin health monitoring
socket:
enable: true
path: /var/lib/amd-metrics-exporter/amdgpu_device_metrics_exporter_grpc.socket
permissions: 0777
volumeMounts:
- name: socket-dir
mountPath: /var/lib/amd-metrics-exporter
volumes:
- name: socket-dir
hostPath:
path: /var/lib/amd-metrics-exporter
type: DirectoryOrCreate
Install the metrics exporter with Helm:
helm install metrics-exporter \
https://github.com/ROCm/device-metrics-exporter/releases/download/v1.2.0/device-metrics-exporter-charts-v1.2.0.tgz \
-n kube-system -f metrics-exporter-values.yaml
Step 2: Install Device Plugin with Health Checks#
After successfully installing the metrics exporter, deploy the device plugin with health check capability:
Using Pre-defined YAML File: You can use the pre-defined YAML file provided in this repository. Run the following command:
kubectl create -f k8s-ds-amdgpu-dp-health.yaml
Pulling from the Web: Alternatively, you can directly pull the YAML file from the repository:
kubectl create -f https://raw.githubusercontent.com/ROCm/k8s-device-plugin/master/k8s-ds-amdgpu-dp-health.yaml
Option 3: Using AMD GPU Operator#
The AMD GPU Operator provides a comprehensive solution that installs and manages:
AMD GPU device plugin
Node labeler
Device metrics exporter
Driver installation and updates
See the GPU Operator Documentation for installation instructions and additional information.
Install Node Labeler (Optional)#
The AMD GPU Node Labeler automatically detects and labels nodes with detailed GPU properties, enabling more precise workload scheduling.
The node labeler requires:
A service account with permissions to modify node labels
Privileged container access for GPU discovery
Deploy the node labeler using the provided DaemonSet manifest:
kubectl create -f k8s-ds-amdgpu-labeller.yaml
After deployment, nodes with AMD GPUs will be automatically labeled with properties including:
Device ID
Product Name
Driver Version
VRAM Size
SIMD Count
Compute Unit count
GPU Family information
Firmware and Feature Versions
The labels are added with two prefixes:
amd.com/gpu.*- Current prefixbeta.amd.com/gpu.*- Legacy prefix (maintained for backwards compatibility)
Verify the labels on your nodes using one of these commands:
# View all GPU-related labels
kubectl get nodes -o custom-columns=NAME:.metadata.name,LABELS:.metadata.labels
# Filter for current GPU labels
kubectl get nodes --show-labels | grep "amd.com/gpu"
# Filter for legacy GPU labels
kubectl get nodes --show-labels | grep "beta.amd.com/gpu"
Example labels for an AMD MI300X GPU:
amd.com/gpu.cu-count=304
amd.com/gpu.device-id=74a1
amd.com/gpu.family=AI
amd.com/gpu.product-name=AMD_Instinct_MI300X_OAM
amd.com/gpu.simd-count=1216
amd.com/gpu.vram=192G
Verify the Device Plugin Installation#
Check the status of the pods:
kubectl get pods -n kube-system
Describe the device plugin pod to see logs and events:
kubectl describe pod <device-plugin-pod-name> -n kube-system
After deploying the device plugin, verify that your AMD GPUs are properly recognized as schedulable resources:
# List all nodes with their AMD GPU capacity
kubectl get nodes -o custom-columns=NAME:.metadata.name,GPU:"status.capacity.amd\.com/gpu"
NAME GPU
k8s-node-01 8
Troubleshooting#
If the device plugin pods are not running, check the logs:
kubectl logs -n kube-system <amdgpu-device-plugin-pod-name>
Common issues include:
GPU drivers not installed correctly
ROCm stack not installed or misconfigured
Insufficient permissions for the device plugin to access GPU devices
Uninstalling the Device Plugin#
To uninstall the device plugin, delete the DaemonSet using the same manifest file you used for installation:
If you installed the standard device plugin (Option 1):
kubectl delete -f k8s-ds-amdgpu-dp.yaml
# Or using the web URL
kubectl delete -f https://raw.githubusercontent.com/ROCm/k8s-device-plugin/master/k8s-ds-amdgpu-dp.yaml
If you installed the device plugin with health checks (Option 2):
kubectl delete -f k8s-ds-amdgpu-dp-health.yaml
# Or using the web URL
kubectl delete -f https://raw.githubusercontent.com/ROCm/k8s-device-plugin/master/k8s-ds-amdgpu-dp-health.yaml