Quick Start Guide#
Getting up and running with the AMD GPU Operator and Device Metrics Exporter on Kubernetes is quick and easy. Below is a short guide on how to get started using the helm installation method on a standard Kubernetes install. Note that more detailed instructions on the different installation methods can be found on this site:
GPU Operator Kubernetes Helm Install
GPU Operator Red Hat OpenShift Install
Installing the GPU Operator#
The GPU Operator uses cert-manager to manage certificates for MTLS communication between services. If you haven’t already installed
cert-manager
as a prerequisite on your Kubernetes cluster, you’ll need to install it as follows:
# Add and update the cert-manager repository
helm repo add jetstack https://charts.jetstack.io --force-update
# Install cert-manager
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--version v1.15.1 \
--set crds.enabled=true
Once
cert-manager
is installed, you’re just a few commands away from installing the GPU Operating and having a fully managed GPU infrastructure, add the helm repository and fetch the latest helm charts:
# Add the Helm repository
helm repo add rocm https://rocm.github.io/gpu-operator
helm repo update
Install the GPU Operator
By using helm install
command you can install the AMD GPU Operator helm charts.
Tip
Before v1.3.0 the gpu operator helm chart won’t provide a default
DeviceConfig
, you need to take extra step to create aDeviceConfig
.Starting from v1.3.0 the
helm install
command would support one-step installation + configuration, which would create a defaultDeviceConfig
with default values, which may not work for all the users with different the deployment scenarios, please refer to Typical Deployment Scenarios for more information and get correspondinghelm install
commands.
helm install amd-gpu-operator rocm/gpu-operator-charts \
--namespace kube-amd-gpu --create-namespace \
--version=v1.3.0
helm install amd-gpu-operator rocm/gpu-operator-charts \
--namespace kube-amd-gpu --create-namespace \
--version=v1.2.2
helm install amd-gpu-operator rocm/gpu-operator-charts \
--namespace kube-amd-gpu --create-namespace \
--version=v1.2.1
helm install amd-gpu-operator rocm/gpu-operator-charts \
--namespace kube-amd-gpu --create-namespace \
--version=v1.2.0
helm install amd-gpu-operator rocm/gpu-operator-charts \
--namespace kube-amd-gpu --create-namespace \
--version=v1.1.1
helm install amd-gpu-operator rocm/gpu-operator-charts \
--namespace kube-amd-gpu --create-namespace \
--version=v1.1.0
helm install amd-gpu-operator rocm/gpu-operator-charts \
--namespace kube-amd-gpu --create-namespace \
--version=v1.0.0
Typical Deployment Scenarios#
Use VM worker node with VF-Passthrough GPU
If you are using VM based GPU worker node with Virtual Function (VF) Passthrough powered by AMD MxGPU GIM driver, the VF device would show up in the guest VM.
You need to adjust the default node selector to "feature.node.kubernetes.io/amd-vgpu":"true"
to make the DeviceConfig
work for your VM based cluster.
helm install amd-gpu-operator rocm/gpu-operator-charts \
--namespace kube-amd-gpu --create-namespace \
--version=v1.3.0 \
--set-json 'deviceConfig.spec.selector={"feature.node.kubernetes.io/amd-gpu":null,"feature.node.kubernetes.io/amd-vgpu":"true"}'
helm install amd-gpu-operator rocm/gpu-operator-charts \
--namespace kube-amd-gpu --create-namespace \
--version=v1.2.2
# take extra step to create a DeviceConfig with spec.selector "feature.node.kubernetes.io/amd-vgpu":"true"
helm install amd-gpu-operator rocm/gpu-operator-charts \
--namespace kube-amd-gpu --create-namespace \
--version=v1.2.1
# take extra step to create a DeviceConfig with spec.selector "feature.node.kubernetes.io/amd-vgpu":"true"
helm install amd-gpu-operator rocm/gpu-operator-charts \
--namespace kube-amd-gpu --create-namespace \
--version=v1.2.0
# no amd-vgpu detection support at this version, please manually modify the DeviceConfig selector to make it select your worker nodes
helm install amd-gpu-operator rocm/gpu-operator-charts \
--namespace kube-amd-gpu --create-namespace \
--version=v1.1.1
# no amd-vgpu detection support at this version, please manually modify the DeviceConfig selector to make it select your worker nodes
helm install amd-gpu-operator rocm/gpu-operator-charts \
--namespace kube-amd-gpu --create-namespace \
--version=v1.1.0
# no amd-vgpu detection support at this version, please manually modify the DeviceConfig selector to make it select your worker nodes
helm install amd-gpu-operator rocm/gpu-operator-charts \
--namespace kube-amd-gpu --create-namespace \
--version=v1.0.0
# no amd-vgpu detection support at this version, please manually modify the DeviceConfig selector to make it select your worker nodes
Use GPU worker node without inbox / pre-installed driver
If your worker node doesn’t have inbox / pre-installed AMD GPU driver loaded, the operand (e.g. deivce plugin, metrics exporter) would stuck at Init 0/1
pod state.
If you plan to use GPU Operator to install out-of-tree driver on your worker nodes, please refer to Driver Installation Guide to configure the default DeviceConfig
. Here are example commands:
# 1. prepare image registry to store driver image (e.g. dockerHub)
# 2. setup image registry secret:
# kubectl create secret docker-registry mySecret -n kube-amd-gpu --docker-username=xxx --docker-password=xxx --docker-server=index.docker.io
helm install amd-gpu-operator rocm/gpu-operator-charts \
--namespace kube-amd-gpu --create-namespace \
--version=v1.3.0 \
--set deviceConfig.spec.driver.enable=true \
--set deviceConfig.spec.driver.blacklist=true \
--set deviceConfig.spec.driver.version=6.4 \
--set deviceConfig.spec.driver.image=docker.io/myUserName/amd-driver-image \
--set deviceConfig.spec.driver.imageRegistrySecret.name=mySecret
helm install amd-gpu-operator rocm/gpu-operator-charts \
--namespace kube-amd-gpu --create-namespace \
--version=v1.2.2
# take extra step to create a DeviceConfig with proper configs in spec.driver
helm install amd-gpu-operator rocm/gpu-operator-charts \
--namespace kube-amd-gpu --create-namespace \
--version=v1.2.1
# take extra step to create a DeviceConfig with proper configs in spec.driver
helm install amd-gpu-operator rocm/gpu-operator-charts \
--namespace kube-amd-gpu --create-namespace \
--version=v1.2.0
# take extra step to create a DeviceConfig with proper configs in spec.driver
helm install amd-gpu-operator rocm/gpu-operator-charts \
--namespace kube-amd-gpu --create-namespace \
--version=v1.1.1
# take extra step to create a DeviceConfig with proper configs in spec.driver
helm install amd-gpu-operator rocm/gpu-operator-charts \
--namespace kube-amd-gpu --create-namespace \
--version=v1.1.0
# take extra step to create a DeviceConfig with proper configs in spec.driver
helm install amd-gpu-operator rocm/gpu-operator-charts \
--namespace kube-amd-gpu --create-namespace \
--version=v1.0.0
# take extra step to create a DeviceConfig with proper configs in spec.driver
Deploy
DeviceConfig
separately without using the default one during helm charts installation
You can use the option --set crds.defaultCR.install=false
to disable the deployment of the default DeviceConfig
then deploy it later in a separate step with your desired configuration.
Verify Installation#
After running helm install
commands with proper configurations in values.yaml
. You should now see the GPU Operator pods starting up in the namespace you specified above, kube-amd-gpu
. Here is an example of one control plane node and one GPU worker node:
$ kubectl get deviceconfigs -n kube-amd-gpu
NAME AGE
default 10m
$ kubectl get pods -n kube-amd-gpu
NAME READY STATUS AGE
amd-gpu-operator-gpu-operator-charts-controller-manager-74nm5wt 1/1 Running 10m
amd-gpu-operator-kmm-controller-5c895cd594-h65nm 1/1 Running 10m
amd-gpu-operator-kmm-webhook-server-76d6765d5b-g5g74 1/1 Running 10m
amd-gpu-operator-node-feature-discovery-gc-64c9b7dcd9-gz4g4 1/1 Running 10m
amd-gpu-operator-node-feature-discovery-master-7d69c9b6f9-hcrxm 1/1 Running 10m
amd-gpu-operator-node-feature-discovery-worker-jlzbs 1/1 Running 10m
default-device-plugin-9r9bh 1/1 Running 10m
default-metrics-exporter-6c7z5 1/1 Running 10m
default-node-labeller-xtwbm 1/1 Running 10m
Controller components:
gpu-operator-charts-controller-manager
,kmm-controller
andkmm-webhook-server
Operands:
default-device-plugin
,default-node-labeller
anddefault-metrics-exporter
Please refer to TroubleShooting if any issue happened during the installation and configuration.
- For a full list of
DeviceConfig
configurable options refer to the Full Reference Config documentation. An example DeviceConfig is supplied in the ROCm/gpu-operator repository: kubectl apply -f https://raw.githubusercontent.com/ROCm/gpu-operator/refs/heads/release-v1.3.0/example/deviceconfig_example.yaml
That’s it! The GPU Operator components should now all be running. You can verify this by checking the namespace where the gpu-operator components are installed (default: kube-amd-gpu
):
kubectl get pods -n kube-amd-gpu
Creating a GPU-enabled Pod#
To create a pod that uses a GPU, specify the GPU resource in your pod specification:
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: gpu-container
image: rocm/rocm-terminal:latest
resources:
limits:
amd.com/gpu: 1 # requesting 1 GPU
Save this YAML to a file (e.g., gpu-pod.yaml
) and create the pod:
kubectl apply -f gpu-pod.yaml
Checking GPU Status#
To check the status of GPUs in your cluster:
kubectl get nodes -o custom-columns=NAME:.metadata.name,GPUs:.status.capacity.'amd\.com/gpu'
Using amd-smi#
To run amd-smi
in a pod:
Create a YAML file named
amd-smi.yaml
:
apiVersion: v1
kind: Pod
metadata:
name: amd-smi
spec:
containers:
- image: docker.io/rocm/rocm-terminal:latest
name: amd-smi
command: ["/bin/bash"]
args: ["-c","amd-smi version && amd-smi monitor -ptum"]
resources:
limits:
amd.com/gpu: 1
requests:
amd.com/gpu: 1
restartPolicy: Never
Create the pod:
kubectl create -f amd-smi.yaml
Check the logs and verify the output
amd-smi
reflects the expected ROCm version and GPU presence:
kubectl logs amd-smi
AMDSMI Tool: 24.6.2+2b02a07 | AMDSMI Library version: 24.6.2.0 | ROCm version: 6.2.2
GPU POWER GPU_TEMP MEM_TEMP GFX_UTIL GFX_CLOCK MEM_UTIL MEM_CLOCK
0 126 W 40 °C 32 °C 1 % 182 MHz 0 % 900 MHz
Using rocminfo#
To run rocminfo
in a pod:
Create a YAML file named
rocminfo.yaml
:
apiVersion: v1
kind: Pod
metadata:
name: rocminfo
spec:
containers:
- image: docker.io/rocm/rocm-terminal:latest
name: rocminfo
command: ["/bin/sh","-c"]
args: ["rocminfo"]
securityContext:
runAsUser: 0
resources:
limits:
amd.com/gpu: 1
restartPolicy: Never
Create the pod:
kubectl create -f rocminfo.yaml
Check the logs and verify the output:
kubectl logs rocminfo
Configuring GPU Resources#
Configuration parameters are documented in the Custom Resource Installation Guide