OpenShift (Helm)#
Warning
Installing via Helm is not a recommended method for Red Hat OpenShift. Users wishing to use the AMD GPU with OpenShift should consider using the OLM method instead.
This guide walks through installing the AMD GPU Operator on an OpenShift cluster using Helm.
Prerequisites#
OpenShift Requirements#
OpenShift Container Platform 4.16 or later
Cluster administrator privileges
Helm v3.2.0 or later
oc
CLI tool configured with cluster access
Required OpenShift Operators#
The following operators must be enabled in your OpenShift cluster (enabled by default):
Service-CA Operator
Required for certificate signing and webhook authentication
Verifies communication between kube-api-server and KMM webhook server
MachineConfig Operator
Required for configuring the blacklist for
amdgpu
driverManages node-level configuration
Cluster Image Registry Operator
Required for driver image builds within OpenShift
Manages internal image registry storage
Steps to enable image registry operator if it is disabled (example using emptyDir):
Configure registry storage:
oc patch configs.imageregistry.operator.openshift.io cluster --type merge --patch '{"spec":{"storage":{"emptyDir":{}}}}'
Enable the registry:
oc patch configs.imageregistry.operator.openshift.io cluster --type merge --patch '{"spec":{"managementState":"Managed"}}'
Verify the registry pod is running:
oc get pods -n openshift-image-registry
Installation Methods#
There are two ways to install the AMD GPU Operator on OpenShift:
Method 1: All-in-One Installation#
This method installs the operator and all dependencies using a single Helm chart.
Install the operator and dependencies:
helm install amd-gpu-operator rocm/gpu-operator-helm \
--namespace kube-amd-gpu\
--create-namespace \
--set platform=openshift
Verify the installation:
oc get pods -n kube-amd-gpu
Expected output:
NAME READY STATUS RESTARTS AGE
nfd-master-67b568b89c-lvk9k 1/1 Running 0 2m
nfd-worker-nkrgl 1/1 Running 0 2m
amd-gpu-operator-controller-manager-56844b49b4-tk75f 1/1 Running 0 2m
amd-gpu-kmm-controller-78ddd75846-kxd8n 1/1 Running 0 2m
amd-gpu-kmm-webhook-server-749cb8b565-ktbsp 1/1 Running 0 2m
amd-gpu-nfd-controller-manager-77764d98c5-h76pp 2/2 Running 0 2m
Method 2: Component-by-Component Installation#
This method allows more control over the installation process by installing dependencies separately.
Step 1: Install Node Feature Discovery (NFD) Operator#
Navigate to OpenShift Web Console → OperatorHub
Search for “Node Feature Discovery”
Select and install the Red Hat version of the operator
Choose the default installation options
Step 2: Install Kernel Module Management (KMM) Operator#
Navigate to OpenShift Web Console → OperatorHub
Search for “Kernel Module Management”
Select and install the Red Hat version (without Hub label)
Choose the default installation options
Step 3: Install AMD GPU Operator#
Install the operator while skipping the already-installed dependencies:
helm install amd-gpu-operator rocm/gpu-operator-helm \
--namespace kube-amd-gpu \
--create-namespace \
--set platform=openshift \
--set nfd.enabled=false \
--set kmm.enabled=false
Post-Installation Configuration#
1. Configure Node Feature Discovery#
Create an NFD rule to detect AMD GPUs:
apiVersion: nfd.openshift.io/v1
kind: NodeFeatureDiscovery
metadata:
name: amd-gpu-nfd-instance
namespace: kube-amd-gpu
spec:
operand:
image: quay.io/openshift/origin-node-feature-discovery:4.16
imagePullPolicy: IfNotPresent
servicePort: 12000
workerConfig:
configData: |
core:
sleepInterval: 60s
sources:
pci:
deviceClassWhitelist:
- "0200"
- "03"
- "12"
deviceLabelFields:
- "vendor"
- "device"
custom:
- name: amd-gpu
labels:
feature.node.kubernetes.io/amd-gpu: "true"
matchAny:
- matchFeatures:
- feature: pci.device
matchExpressions:
vendor: {op: In, value: ["1002"]}
device: {op: In, value: [
"74a0", # MI300A
"74a1", # MI300X
"740f", # MI210
"7408", # MI250X
"740c", # MI250/MI250X
"738c", # MI100
"738e" # MI100
]}
2. Create blacklist (for installing out-of-tree kernel module)#
Create a Machine Config Operator custom resource to add amdgpu
kernel module into the modprobe blacklist, here is an example of custom resource MachineConfig
, please set master
for the label machineconfiguration.openshift.io/role
if you run Single Node OpenShift or worker
in other scenarios with dedicated controllers.
Warning
After adding amdgpu
kernel module to blacklist by using MachineConfig
custom resource, the Machine Config Operator will automatically reboot selected nodes.
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker
name: amdgpu-module-blacklist
spec:
config:
ignition:
version: 3.2.0
storage:
files:
- path: "/etc/modprobe.d/amdgpu-blacklist.conf"
mode: 420
overwrite: true
contents:
source: "data:text/plain;base64,YmxhY2tsaXN0IGFtZGdwdQo="
3. Create DeviceConfig Resource#
Create a DeviceConfig
to trigger driver installation:
apiVersion: amd.com/v1alpha1
kind: DeviceConfig
metadata:
name: amd-gpu-config
namespace: kube-amd-gpu
spec:
driver:
enable: true
image: image-registry.openshift-image-registry.svc:5000/amdgpu_kmod
version: 6.2.2
selector:
feature.node.kubernetes.io/amd-gpu: "true"
Verification#
1. Check Node Labels#
Verify GPU detection:
oc get nodes -l feature.node.kubernetes.io/amd-gpu=true
2. Check Component Status#
Verify all pods are running:
oc get pods -n kube-amd-gpu
Check GPU resource availability:
oc get node -o json | jq '.items[].status.capacity."amd.com/gpu"'
3. Check Driver Status#
Monitor driver installation:
oc logs -n kube-amd-gpu-l app=kmm-worker
Troubleshooting#
Common Issues#
Certificate Issues
Check Service-CA operator status
Verify webhook certificates are properly mounted
Driver Build Failures
Check builder pod logs
Verify registry access
Check available storage
Node Labeling Issues
Verify NFD operator status
Check NFD worker pods on GPU nodes
Review NFD rule syntax
For detailed troubleshooting, run the support tool:
./tools/techsupport_dump.sh -w -o yaml <node-name>
Uninstallation#
Please refer to the Uninstallation document for uninstalling related resources.