DRA (Dynamic Resource Allocation) Driver#
Overview#
The AMD GPU Operator supports Dynamic Resource Allocation (DRA), a Kubernetes API for requesting and sharing resources between pods and containers. DRA is an alternative to the traditional Device Plugin approach for making AMD GPUs available to workloads.
The DRA driver is built on the AMD GPU DRA Driver, which implements the Kubernetes DRA interface for AMD Instinct GPUs.
Note: DRA requires Kubernetes 1.32 or later with the
DynamicResourceAllocationfeature gate enabled for Kubernetes 1.32/1.33.Important: The DRA driver and Device Plugin cannot be enabled at the same time on the same
DeviceConfig. The operator validates this and will reject configurations where both are enabled.
For a detailed comparison of DRA vs Device Plugin capabilities, refer to the AMD GPU DRA Driver documentation.
Prerequisites#
Kubernetes 1.32+ with the
DynamicResourceAllocationfeature gate enabledAMD GPU Operator installed via Helm
AMD GPU driver (amdgpu) must be installed on the worker nodes — the DRA driver requires the amdgpu kernel module to be loaded in order to discover GPUs and publish
ResourceSlicesCDI (Container Device Interface) must be enabled in the container runtime. CDI is enabled by default in containerd 2.0+ and CRI-O. If you are running older versions, enable CDI manually — refer to your container runtime’s documentation for instructions.
Enabling the DRA Driver#
Option 1: Enable during Helm install with default CR#
You can enable the DRA driver in the default DeviceConfig at install time:
helm install amd-gpu-operator rocm/gpu-operator-charts \
--namespace kube-amd-gpu \
--create-namespace \
--set deviceConfig.spec.devicePlugin.enableDevicePlugin=false \
--set deviceConfig.spec.draDriver.enable=true
By default, the operator uses the rocm/k8s-gpu-dra-driver:latest image from Docker Hub. To specify a custom DRA driver image:
helm install amd-gpu-operator rocm/gpu-operator-charts \
--namespace kube-amd-gpu \
--create-namespace \
--set deviceConfig.spec.devicePlugin.enableDevicePlugin=false \
--set deviceConfig.spec.draDriver.enable=true \
--set deviceConfig.spec.draDriver.image=rocm/k8s-gpu-dra-driver:latest
This will create a default DeviceConfig with the DRA driver enabled and the device plugin disabled.
Option 2: Enable via DeviceConfig CR#
If the operator is already installed, you can enable the DRA driver by editing the DeviceConfig:
kubectl edit deviceconfigs -n kube-amd-gpu default
Set the draDriver section:
Note: If no
imageis specified, the operator defaults torocm/k8s-gpu-dra-driver:latest.
apiVersion: amd.com/v1alpha1
kind: DeviceConfig
metadata:
name: default
namespace: kube-amd-gpu
spec:
draDriver:
enable: true
image: rocm/k8s-gpu-dra-driver:latest
devicePlugin:
enableDevicePlugin: false
selector:
feature.node.kubernetes.io/amd-gpu: "true"
Note: If the device plugin is currently enabled, you must disable it before enabling the DRA driver. The operator enforces mutual exclusion between these two components.
Option 3: Apply a DeviceConfig YAML#
Create a file dra-deviceconfig.yaml:
Note: The
imagefield is optional. If omitted, the operator defaults torocm/k8s-gpu-dra-driver:latest.
apiVersion: amd.com/v1alpha1
kind: DeviceConfig
metadata:
name: dra-config
namespace: kube-amd-gpu
spec:
draDriver:
enable: true
image: rocm/k8s-gpu-dra-driver:latest
imagePullPolicy: IfNotPresent
devicePlugin:
enableDevicePlugin: false
selector:
feature.node.kubernetes.io/amd-gpu: "true"
kubectl apply -f dra-deviceconfig.yaml
Verifying the DRA Driver#
After enabling, verify the DRA driver daemonset is running:
kubectl get daemonsets -n kube-amd-gpu
You should see a daemonset named <deviceconfig-name>-dra-driver (e.g., default-dra-driver) with pods running on each GPU node:
$ kubectl get pods -n kube-amd-gpu -l app=dra-driver
NAME READY STATUS RESTARTS AGE
default-dra-driver-abc12 1/1 Running 0 2m
Verify the DeviceClass exists:
$ kubectl get deviceclass gpu.amd.com
NAME AGE
gpu.amd.com 5m
Check that ResourceSlices are being published by the driver:
$ kubectl get resourceslices
NAME DRIVER NODE AGE
gpu-worker-1-gpu.amd.com-gpu-0-0qkr2 gpu.amd.com gpu-worker-1 2m
DeviceClass#
The operator’s Helm chart creates a DeviceClass named gpu.amd.com by default. This DeviceClass uses the following CEL selector expression:
apiVersion: resource.k8s.io/v1beta1
kind: DeviceClass
metadata:
name: gpu.amd.com
spec:
selectors:
- cel:
expression: "device.driver == 'gpu.amd.com'"
Disabling automatic DeviceClass creation#
If you manage the DeviceClass independently (e.g., in a GitOps workflow or when using the standalone DRA driver helm chart), you can disable the operator from creating it:
helm install amd-gpu-operator rocm/gpu-operator-charts \
--namespace kube-amd-gpu \
--create-namespace \
--set draDriver.deviceClass.create=false
Or during upgrade:
helm upgrade amd-gpu-operator rocm/gpu-operator-charts \
--namespace kube-amd-gpu \
--set draDriver.deviceClass.create=false
Requesting GPUs with DRA#
With DRA enabled, workloads request GPUs using ResourceClaim and ResourceClaimTemplate objects instead of resources.limits.
For workload examples including single GPU, multi-GPU, and GPU sharing scenarios, see the DRA driver examples in the upstream repository.
Migrating from Device Plugin to DRA#
To migrate an existing deployment from the traditional device plugin to the DRA driver:
Disable the device plugin by editing the
DeviceConfig:kubectl patch deviceconfig default -n kube-amd-gpu --type=merge \ -p '{"spec":{"devicePlugin":{"enableDevicePlugin":false}}}'
Enable the DRA driver:
kubectl patch deviceconfig default -n kube-amd-gpu --type=merge \ -p '{"spec":{"draDriver":{"enable":true}}}'
Update workload specifications to use
ResourceClaim/ResourceClaimTemplateinstead ofresources.limits.amd.com/gpu.Verify the DRA driver pods are running and
ResourceSlicesare published.
Warning: Workloads using
amd.com/gpuwill no longer be able to access GPUs once the device plugin is disabled. Update all workload specs before completing the migration.
DRA Driver DeviceConfig Fields#
Field |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Enable or disable the DRA driver |
|
string |
|
DRA driver container image. If not specified, defaults to |
|
string |
|
Image pull policy: Always, IfNotPresent, or Never |
|
list |
|
Tolerations for the DRA driver DaemonSet pods |
|
object |
|
Image pull secret for private registries, e.g. |
|
map |
|
Additional command-line flags passed to the DRA driver binary. Keys are flag names (without leading |
|
map |
|
Node selector for the DRA driver DaemonSet; if not specified, reuses |
|
string |
|
DaemonSet upgrade strategy: |
|
int |
|
Maximum pods unavailable during a rolling update |
Passing Command-Line Arguments#
The cmdLineArguments field lets you pass flags directly to the gpu-kubeletplugin binary. Specify each flag as a key-value pair where the key is the flag name without the leading --:
spec:
draDriver:
enable: true
cmdLineArguments:
cdi-root: /etc/cdi
healthcheck-port: "8080"
v: "4"
logging-format: json
For the full list of supported flags and their descriptions, refer to the DRA driver CLI options reference.
Helm Chart Values#
Key |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Enable DRA driver in default DeviceConfig |
|
string |
|
DRA driver image |
|
string |
|
Image pull policy |
|
list |
|
Tolerations |
|
object |
|
Image pull secret |
|
object |
|
Command-line arguments |
|
object |
|
Node selector; if not specified, reuses |
|
bool |
|
Whether to create the |
|
object |
|
Annotations for the DRA driver ServiceAccount |
Troubleshooting#
DRA driver pods not starting#
Verify your Kubernetes version is 1.32+ and
DynamicResourceAllocationfeature gate is enabledCheck the DRA driver DaemonSet events:
kubectl describe daemonset <name>-dra-driver -n kube-amd-gpuEnsure the DRA driver ServiceAccount and RBAC resources exist:
kubectl get sa amd-gpu-operator-dra-driver -n kube-amd-gpu kubectl get clusterrole amd-gpu-operator-dra-driver-role
No ResourceSlices published#
The AMD GPU driver (amdgpu) must be installed on the worker node. The DRA driver relies on the amdgpu kernel module to enumerate GPU devices. Without it, the DRA driver pod will run but not publish any
ResourceSlices.Verify the driver is loaded:
lsmod | grep amdgpuon the worker nodeIf using the operator for driver management, ensure
spec.driver.enable: truein yourDeviceConfig
The DRA driver pod must be running and healthy on the node
Check pod logs:
kubectl logs <dra-driver-pod> -n kube-amd-gpuVerify AMD GPUs are detected on the node:
kubectl get node <node> -o yaml | grep amd-gpu
Validation error: “DRADriver and DevicePlugin cannot be enabled at the same time”#
The operator enforces mutual exclusion. Disable the device plugin before enabling the DRA driver (or vice versa). See the Migration section above.
Further Reading#
AMD GPU DRA Driver (upstream) — detailed DRA driver documentation, architecture, and advanced configuration
DRA Driver CLI Options Reference — all command-line flags accepted by the
gpu-kubeletpluginbinaryKubernetes DRA Documentation — Kubernetes-native DRA concepts and API reference
Device Plugin Documentation — traditional device plugin approach
Full DeviceConfig Reference — all available DeviceConfig fields