Full Reference Config#

Full DeviceConfig#

Below is an example of a full DeviceConfig CR that can be used to install the AMD GPU Operator and its components. This example includes all the available fields and their default values.

apiVersion: amd.com/v1alpha1
kind: DeviceConfig #New Custom Resource Definition used by the GPU Operator
metadata:
  # Name of the DeviceConfig CR. Note that the name of device plugin, node-labeller and metric-explorter pods will be prefixed with
  name: gpu-operator
  namespace: kube-amd-gpu # Namespace for the GPU Operator and it's components
spec:
  ## AMD GPU Driver Configuration ##
  driver:
    # Set to false to skip driver installation to use inbox or pre-installed driver on worker nodes
    # Set to true to enable operator to install out-of-tree amdgpu kernel module
    enable: false
    # Set to true to blacklist the amdgpu kernel module which is required for installing out-of-tree driver
    # Not working for OpenShift cluster. OpenShift users please use the Machine Config Operator (MCO) resource to configure amdgpu blacklist.
    # Example MCO resource is available at https://instinct.docs.amd.com/projects/gpu-operator/en/latest/installation/openshift-olm.html#create-blacklist-for-installing-out-of-tree-kernel-module
    blacklist: false
    # Specify your repository to host driver image
    # DO NOT include the image tag as AMD GPU Operator will automatically manage the image tag for you
    image: docker.io/username/repo
    # (Optional) Specify the credential for your private registry if it requires credential to get pull/push access
    # you can create the docker-registry type secret by running command like:
    # kubectl create secret docker-registry mysecret -n kmm-namespace --docker-username=xxx --docker-password=xxx
    # Make sure you created the secret within the namespace that KMM operator is running
    imageRegistrySecret:
      name: mysecret
  imageRegistryTLS:
    insecure: false # If true, check for the container image using plain HTTP
    InsecureSkipTLSVerify: false # If true, skip any TLS server certificate validation (useful for self-signed certificates)
  version: "6.3" # Specify the driver version you would like to be installed that coincides with a ROCm version number
  upgradePolicy:
    enable: true
    maxParallelUpgrades: 3 # (Optional) Number of nodes that will be upgraded in parallel. Default is 1
  ## AMD K8s Device Plugin Configuration ##
  commonConfig:
    # (Optional) Specify common values used by all components.
    initContainerImage: busybox:1.36 # Specify the InitContainerImage to use for all component pods
    utilsContainer:
      image: docker.io/amdpsdo/gpu-operator-utils:latest # Image to use for the utils container
      imagePullPolicy: IfNotPresent # Image pull policy for the utils container. Either `Always`, `IfNotPresent` or `Never`
      # (Optional) Specify the credential for your private registry if it requires credential to get pull/push access
      # you can create the docker-registry type secret by running command like:
      # kubectl create secret docker-registry mysecret -n kmm-namespace --docker-username=xxx --docker-password=xxx
      # Make sure you created the secret within the namespace that KMM operator is running
      imageRegistrySecret:
        name: mysecret
  devicePlugin:
    enableNodeLabeller: true # enable or disable the node labeller
    # (Optional) Specifying image names are optional. Default image names for shown here if not specified.
    devicePluginImage: rocm/k8s-device-plugin:latest # Change this to trigger metrics exporter upgrade on CR update
    devicePluginImagePullPolicy: IfNotPresent # Image pull policy for the device plugin. Either `Always`, `IfNotPresent` or `Never`
    # devicePluginImagePullPolicy default value is "IfNotPresent" for valid tags, "Always" for no tag or "latest" tag
    devicePluginTolerations:
      key: "key1" # Key is the taint key that the toleration applies to. Empty means match all taint keys. If the key is empty,
      # operator must be "Exists"; this combination means to match all values and all keys.
      operator: "Equal" # Operator represents a key's relationship to the value. Valid operators are Exists and Equal.
      # Defaults to Equal. Exists is equivalent to wildcard for value, so that a pod can tolerate all taints of a particular category.
      value: "value1" # Value is the taint value the toleration matches to. If the operator is Exists, the value should be empty,
      # otherwise just a regular string.
      effect: "NoSchedule" # Effect indicates the taint effect to match. Empty means match all taint effects. When specified, allowed
      # values are "NoSchedule", "PreferNoSchedule" and "NoExecute".
      tolerationSeconds: [Expected Int value, not set by default] #Seconds represents the period of time the toleration tolerates the taint.
      # By default, it is not set, which means tolerate the taint forever (do not evict). Effect needs to be NoExecute for this,
      # otherwise this field is ignored. Zero and negative values will be treated as 0 (evict immediately) by the system.
    nodeLabellerImage: rocm/k8s-device-plugin:labeller-latest # Change this to trigger metrics exporter upgrade on CR update
    nodeLabellerImagePullPolicy: IfNotPresent # Image pull policy for the node labeller. Either `Always`, `IfNotPresent` or `Never`
    # nodeLabellerImagePullPolicy default value is "IfNotPresent" for valid tags, "Always" for no tag or "latest" tag
    nodeLabellerTolerations:
      key: "key1" # Key is the taint key that the toleration applies to. Empty means match all taint keys. If the key is empty,
      # operator must be "Exists"; this combination means to match all values and all keys.
      operator: "Equal" # Operator represents a key's relationship to the value. Valid operators are Exists and Equal.
      # Defaults to Equal. Exists is equivalent to wildcard for value, so that a pod can tolerate all taints of a particular category.
      value: "value1" # Value is the taint value the toleration matches to. If the operator is Exists, the value should be empty,
      # otherwise just a regular string.
      effect: "NoSchedule" # Effect indicates the taint effect to match. Empty means match all taint effects. When specified, allowed
      # values are "NoSchedule", "PreferNoSchedule" and "NoExecute".
      tolerationSeconds: [Expected Int value, not set by default] #Seconds represents the period of time the toleration tolerates the taint.
      # By default, it is not set, which means tolerate the taint forever (do not evict). Effect needs to be NoExecute for this,
      # otherwise this field is ignored. Zero and negative values will be treated as 0 (evict immediately) by the system.
    imageRegistrySecret:
      # (Optional) Specify the credential for your private registry if it requires credential to get pull/push access
      # you can create the docker-registry type secret by running command like:
      # kubectl create secret docker-registry mysecret -n kmm-namespace --docker-username=xxx --docker-password=xxx
      # Make sure you created the secret within the namespace that KMM operator is running
      name: mysecret
    upgradePolicy:
      #(Optional) If no UpgradePolicy is mentioned for any of the components but their image is changed, the daemonset will
      # get upgraded according to the defaults, which is `upgradeStrategy` set to `RollingUpdate` and `maxUnavailable` set to 1.
      upgradeStrategy: RollingUpdate, # (Optional) Can be either `RollingUpdate` or `OnDelete`
      maxUnavailable: 1 # (Optional) Number of pods that can be unavailable during the upgrade process. 1 is the default value
  ## AMD GPU Metrics Exporter Configuration ##
  metricsExporter:
    enable: false # false by Default. Set to true to enable the Metrics Exporter
    serviceType: ClusterIP # ServiceType used to expose the Metrics Exporter endpoint. Can be either `ClusterIp` or `NodePort`.
    port: 5000 # Note if specifying NodePort as the serviceType use `32500` as the port number must be between 30000-32767
    # (Optional) Specifying metrics exporter image is optional. Default imagename shown here if not specified.
    image: rocm/device-metrics-exporter:v1.2.0 # Change this to trigger metrics exporter upgrade on CR update
    imagePullPolicy: "IfNotPresent" # image pull policy for the metrics exporter container. Either `Always`, `IfNotPresent` or `Never`
    # imagePullPolicy default value is "IfNotPresent" for valid tags, "Always" for no tag or "latest" tag
    config:
      # Name of the ConfigMap that contains the metrics exporter configuration.
      name: gpu-config # (Optional) If the configmap does not exist the DeviceConfig will show a validation error and not start any plugin pods
    upgradePolicy:
      #(Optional) If no UpgradePolicy is mentioned for any of the components but their image is changed, the daemonset will
      # get upgraded according to the defaults, which is `upgradeStrategy` set to `RollingUpdate` and `maxUnavailable` set to 1.
      upgradeStrategy: RollingUpdate, # (Optional) Can be either `RollingUpdate` or `OnDelete`
      maxUnavailable: 1 # (Optional) Number of pods that can be unavailable during the upgrade process. 1 is the default value
    # If specifying a node selector here, the metrics exporter will only be deployed on nodes that match the selector
    # See Item #6 on https://instinct.docs.amd.com/projects/gpu-operator/en/latest/knownlimitations.html for example usage
    tolerations:
      key: "key1" # Key is the taint key that the toleration applies to. Empty means match all taint keys. If the key is empty,
      # operator must be "Exists"; this combination means to match all values and all keys.
      operator: "Equal" # Operator represents a key's relationship to the value. Valid operators are Exists and Equal.
      # Defaults to Equal. Exists is equivalent to wildcard for value, so that a pod can tolerate all taints of a particular category.
      value: "value1" # Value is the taint value the toleration matches to. If the operator is Exists, the value should be empty,
      # otherwise just a regular string.
      effect: "NoSchedule" # Effect indicates the taint effect to match. Empty means match all taint effects. When specified, allowed
      # values are "NoSchedule", "PreferNoSchedule" and "NoExecute".
      tolerationSeconds: [Expected Int value, not set by default] #Seconds represents the period of time the toleration tolerates the taint.
      # By default, it is not set, which means tolerate the taint forever (do not evict). Effect needs to be NoExecute for this,
      # otherwise this field is ignored. Zero and negative values will be treated as 0 (evict immediately) by the system.
    imageRegistrySecret:
      # (Optional) Specify the credential for your private registry if it requires credential to get pull/push access
      # you can create the docker-registry type secret by running command like:
      # kubectl create secret docker-registry mysecret -n kmm-namespace --docker-username=xxx --docker-password=xxx
      # Make sure you created the secret within the namespace that KMM operator is running
      name: mysecret
    selector:
      feature.node.kubernetes.io/amd-gpu: "true" # You must include this again as this selector will overwrite the global selector
      amd.com/device-metrics-exporter: "true" # Helpful for when you want to disable the metrics exporter on specific nodes
  ## AMD GPU Device Test Runner Configuration ##
  testRunner:
    enable: true # false by Default. Set to true to enable the Metrics Exporter
    serviceType: ClusterIP # ServiceType used to expose the Metrics Exporter endpoint. Can be either `ClusterIp` or `NodePort`.
    port: 5000 # Note if specifying NodePort as the serviceType use `32500` as the port number must be between 30000-32767
    # (Optional) Specifying metrics exporter image is optional. Default imagename shown here if not specified.
    image: docker.io/rocm/test-runner:v1.2.0-beta.0 # Change this to trigger metrics exporter upgrade on CR update
    imagePullPolicy: "IfNotPresent" # image pull policy for the test runner container. Either `Always`, `IfNotPresent` or `Never`
    # imagePullPolicy default value is "IfNotPresent" for valid tags, "Always" for no tag or "latest" tag
    config:
      # Name of the configmap to customize the config for test runner. If not specified default test config will be aplied
      name: test-config # (Optional) If the configmap does not exist the DeviceConfig will show a validation error and not start any plugin pods
    logsLocation:
      mountPath: "/var/log/amd-test-runner" # mount path inside test runner container for log files
      hostPath: "/var/log/amd-test-runner" # host path to be mounted into test runner container for log files
    upgradePolicy:
      #(Optional) If no UpgradePolicy is mentioned for any of the components but their image is changed, the daemonset will
      # get upgraded according to the defaults, which is `upgradeStrategy` set to `RollingUpdate` and `maxUnavailable` set to 1.
      upgradeStrategy: RollingUpdate, # (Optional) Can be either `RollingUpdate` or `OnDelete`
      maxUnavailable: 1 # (Optional) Number of pods that can be unavailable during the upgrade process. 1 is the default value
    # If specifying a node selector here, the metrics exporter will only be deployed on nodes that match the selector
    # See Item #6 on https://instinct.docs.amd.com/projects/gpu-operator/en/latest/knownlimitations.html for example usage
    tolerations:
      key: "key1" # Key is the taint key that the toleration applies to. Empty means match all taint keys. If the key is empty,
      # operator must be "Exists"; this combination means to match all values and all keys.
      operator: "Equal" # Operator represents a key's relationship to the value. Valid operators are Exists and Equal.
      # Defaults to Equal. Exists is equivalent to wildcard for value, so that a pod can tolerate all taints of a particular category.
      value: "value1" # Value is the taint value the toleration matches to. If the operator is Exists, the value should be empty,
      # otherwise just a regular string.
      effect: "NoSchedule" # Effect indicates the taint effect to match. Empty means match all taint effects. When specified, allowed
      # values are "NoSchedule", "PreferNoSchedule" and "NoExecute".
      tolerationSeconds: [Expected Int value, not set by default] #Seconds represents the period of time the toleration tolerates the taint.
      # By default, it is not set, which means tolerate the taint forever (do not evict). Effect needs to be NoExecute for this,
      # otherwise this field is ignored. Zero and negative values will be treated as 0 (evict immediately) by the system.
    imageRegistrySecret:
      # (Optional) Specify the credential for your private registry if it requires credential to get pull/push access
      # you can create the docker-registry type secret by running command like:
      # kubectl create secret docker-registry mysecret -n kmm-namespace --docker-username=xxx --docker-password=xxx
      # Make sure you created the secret within the namespace that KMM operator is running
      name: mysecret
    selector:
      feature.node.kubernetes.io/amd-gpu: "true" # You must include this again as this selector will overwrite the global selector
      amd.com/device-test-runner: "true" # Helpful for when you want to disable the test runner on specific nodes
  selector:
  # Specify the nodes to be managed by this DeviceConfig Custom Resource.  This will be applied to all components unless a selector
  # is specified in the component configuration. The node labeller will automatically find nodes with AMD GPUs and apply the label
  # `feature.node.kubernetes.io/amd-gpu: "true"` to them for you
    feature.node.kubernetes.io/amd-gpu: "true"

Minimal DeviceConfig#

The below is an example of the minimal DeviceConfig CR that can be used to install the AMD GPU Operator and its components. All fields not listed below will revert to their default values. See the above Full DeviceConfig for all available fields and their default values.

apiVersion: amd.com/v1alpha1
kind: DeviceConfig
metadata:
  name: gpu-operator
  namespace: kube-amd-gpu
spec:
  driver:
    enable: false # Set to false to skip driver installation to use inbox or pre-installed driver on worker nodes
  devicePlugin:
    enableNodeLabeller: true
  metricsExporter:
    enable: true # To enable/disable the metrics exporter, disabled by default
    serviceType: "NodePort" # Node port for metrics exporter service
    nodePort: 32500
    testRunner:
      enable: true
      logsLocation:
        mountPath: "/var/log/amd-test-runner" # mount path inside test runner container for logs
        hostPath: "/var/log/amd-test-runner" # host path to be mounted into test runner container for logs
  selector:
    feature.node.kubernetes.io/amd-gpu: "true"

Metrics Exporter ConfigMap#

apiVersion: v1
kind: ConfigMap
metadata:
  name: exporter-configmap
  namespace: kube-amd-gpu
data:
  config.json: |
    {
      "GPUConfig": {
        "Labels": [
          "GPU_UUID",
          "SERIAL_NUMBER",
          "GPU_ID",
          "POD",
          "NAMESPACE",
          "CONTAINER",
          "JOB_ID",
          "JOB_USER",
          "JOB_PARTITION",
          "CLUSTER_NAME",
          "CARD_SERIES",
          "CARD_MODEL",
          "CARD_VENDOR",
          "DRIVER_VERSION",
          "VBIOS_VERSION",
          "HOSTNAME"
        ]
      }
    }