Full Reference Config#
Full DeviceConfig#
Below is an example of a full DeviceConfig CR that can be used to install the AMD GPU Operator and its components. This example includes all the available fields and their default values.
apiVersion: amd.com/v1alpha1
kind: DeviceConfig #New Custom Resource Definition used by the GPU Operator
metadata:
# Name of the DeviceConfig CR. Note that the name of device plugin, node-labeller and metric-exporter pods will be prefixed with
name: gpu-operator
namespace: kube-amd-gpu # Namespace for the GPU Operator and it's components
spec:
## AMD GPU Driver Configuration ##
driver:
# Set to false to skip driver installation to use inbox or pre-installed driver on worker nodes
# Set to True to enable operator to install out-of-tree amdgpu kernel module
enable: false
# Set to true to blacklist the amdgpu kernel module which is required for installing out-of-tree driver
# Not working for OpenShift cluster. OpenShift users please use the Machine Config Operator (MCO) resource to configure amdgpu blacklist.
# Example MCO resource is available at https://instinct.docs.amd.com/projects/gpu-operator/en/latest/installation/openshift-olm.html#create-blacklist-for-installing-out-of-tree-kernel-module
blacklist: false
version: "6.4" # Specify the driver version you would like to be installed that coincides with a ROCm version number
# Specify your repository to host driver image
# Note:
# 1. DO NOT include the image tag as AMD GPU Operator will automatically manage the image tag for you
# 2. Updating the driver image repository is not supported. Please delete the existing DeviceConfig and create a new one with the updated image repository
image: docker.io/username/repo
# (Optional) Specify the credential for your private registry if it requires credential to get pull/push access
# you can create the docker-registry type secret by running command like:
# kubectl create secret docker-registry mysecret -n kmm-namespace --docker-username=xxx --docker-password=xxx
# Make sure you created the secret within the namespace that KMM operator is running
imageRegistrySecret:
name: my-image-secret
imageRegistryTLS:
insecure: False # If True, check for the container image using plain HTTP
insecureSkipTLSVerify: False # If True, skip any TLS server certificate validation (useful for self-signed certificates)
upgradePolicy:
enable: true # (Optional) set to true to enable auto driver upgrade, set to false to manage driver upgrade manually
maxParallelUpgrades: 3 # (Optional) Number of nodes that will be upgraded in parallel. Default is 1
# (Optional) specify the secret that saves the private and public keys used to sign the built driver
# secure boot enabled node requires image signing to load the kernel module
# you need to register the public key in the system's Machine Owner Key (MOK) database
imageSign:
keySecret:
name: image-sign-private-key-secret
certSecret:
name: image-sign-public-key-secret
# (Optional) configure the driver image build within the cluster
imageBuild:
# configure the registry to search for base image for building driver
# e.g. if you are using worker node with ubuntu 22.04 and baseImageRegistry is docker.io
# image builder will use docker.io/ubuntu:22.04 as base image
baseImageRegistry: docker.io
baseImageRegistryTLS:
insecure: False # If True, check for the container image using plain HTTP
insecureSkipTLSVerify: False # If True, skip any TLS server certificate validation (useful for self-signed certificates)
# (Optional) specify driver toleration so operator can manage out-of-tree drivers on tainted nodes
tolerations:
- key: "example-key"
operator: "Equal"
value: "example-value"
effect: "NoSchedule"
## AMD K8s Device Plugin Configuration ##
devicePlugin:
# (Optional) Specifying image names are optional. Default image names for shown here if not specified.
devicePluginImage: rocm/k8s-device-plugin:latest # Change this to trigger metrics exporter upgrade on CR update
nodeLabellerImage: rocm/k8s-device-plugin:labeller-latest # Change this to trigger metrics exporter upgrade on CR update
upgradePolicy:
#(Optional) If no UpgradePolicy is mentioned for any of the components but their image is changed, the daemonset will
# get upgraded according to the defaults, which is `upgradeStrategy` set to `RollingUpdate` and `maxUnavailable` set to 1.
upgradeStrategy: "RollingUpdate" # (Optional) Can be either `RollingUpdate` or `OnDelete`
maxUnavailable: 1 # (Optional) Number of pods that can be unavailable during the upgrade process. 1 is the default value
## AMD GPU Metrics Exporter Configuration ##
metricsExporter:
enable: False # False by Default. Set to True to enable the Metrics Exporter
serviceType: ClusterIP # ServiceType used to expose the Metrics Exporter endpoint. Can be either `ClusterIp` or `NodePort`.
port: 5000 # Note if specifying NodePort as the serviceType use `32500` as the port number must be between 30000-32767
# (Optional) Specifying metrics exporter image is optional. Default image name shown here if not specified.
image: rocm/device-metrics-exporter:v1.3.1 # Change this to trigger metrics exporter upgrade on CR update
upgradePolicy:
#(Optional) If no UpgradePolicy is mentioned for any of the components but their image is changed, the daemonset will
# get upgraded according to the defaults, which is `upgradeStrategy` set to `RollingUpdate` and `maxUnavailable` set to 1.
upgradeStrategy: "RollingUpdate" # (Optional) Can be either `RollingUpdate` or `OnDelete`
maxUnavailable: 1 # (Optional) Number of pods that can be unavailable during the upgrade process. 1 is the default value
# If specifying a node selector here, the metrics exporter will only be deployed on nodes that match the selector
# See Item #6 on https://dcgpu.docs.amd.com/projects/gpu-operator/en/latest/knownlimitations.html for example usage
selector:
feature.node.kubernetes.io/amd-gpu: "true" # You must include this again as this selector will overwrite the global selector
amd.com/device-metrics-exporter: "true" # Helpful for when you want to disable the metrics exporter on specific nodes
configManager:
enable: False # False by Default. Set to True to enable the config manager
image: "rocm/device-config-manager:v1.3.1" # image for the device-config-manager container
imagePullPolicy: IfNotPresent # image pull policy for config manager. Accepted values are Always, IfNotPresent, Never
config: # specify configmap name which stores profile config info
name: "config-manager-config"
upgradePolicy:
#(Optional) If no UpgradePolicy is mentioned for any of the components but their image is changed, the daemonset will
# get upgraded according to the defaults, which is `upgradeStrategy` set to `RollingUpdate` and `maxUnavailable` set to 1.
upgradeStrategy: "RollingUpdate" # (Optional) Can be either `RollingUpdate` or `OnDelete`
maxUnavailable: 1 # (Optional) Number of pods that can be unavailable during the upgrade process. 1 is the default value
# DCM pod deployed either as a standalone pod or through the GPU operator will have
# a toleration attached to it. User can specify additional tolerations if required
# key: amd-dcm , value: up , Operator: Equal, effect: NoExecute
# OPTIONAL
# toleration field for dcm pod to bypass nodes with specific taints
configManagerTolerations:
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoExecute"
selector: # (Optional)
feature.node.kubernetes.io/amd-gpu: "true" # You can include this if you wish to overwrite the global selector
selector:
# Specify the nodes to be managed by this DeviceConfig Custom Resource. This will be applied to all components unless a selector
# is specified in the component configuration. The node labeller will automatically find nodes with AMD GPUs and apply the label
# `feature.node.kubernetes.io/amd-gpu: "true"` to them for you
feature.node.kubernetes.io/amd-gpu: "true"
Minimal DeviceConfig#
The below is an example of the minimal DeviceConfig CR that can be used to install the AMD GPU Operator and its components. All fields not listed below will revert to their default values. See the above Full DeviceConfig for all available fields and their default values.
apiVersion: amd.com/v1alpha1
kind: DeviceConfig
metadata:
name: gpu-operator
namespace: kube-amd-gpu
spec:
driver:
enable: False # Set to False to skip driver installation to use inbox or pre-installed driver on worker nodes
devicePlugin:
enableNodeLabeller: True
metricsExporter:
enable: True # To enable/disable the metrics exporter, disabled by default
serviceType: "NodePort" # Node port for metrics exporter service
nodePort: 32500
selector:
feature.node.kubernetes.io/amd-gpu: "true"