Full Reference Config

Full Reference Config#

Full DeviceConfig#

Below is an example of a full DeviceConfig CR that can be used to install the AMD GPU Operator and its components. This example includes all the available fields and their default values.

apiVersion: amd.com/v1alpha1
kind: DeviceConfig # New Custom Resource Definition used by the GPU Operator
metadata:
  # Name that will prefix device plugin, node-labeller and metrics-exporter pods
  name: gpu-operator
  # Namespace where the GPU Operator and its components will run
  namespace: kube-amd-gpu
spec:
  ## AMD GPU Driver Configuration ##
  driver:
    # Set to false to use existing in-tree/pre-installed driver
    # Set to true to install out-of-tree amdgpu kernel module
    # Default: true
    enable: false
    # Set blacklist to true to blacklist the inbox / pre-installed amdgpu kernel module
    # Required when spec.driver.enable is true
    # GPU Worker node reboot is required to apply the blacklist
    blacklist: false
    # Specify the out-of-tree amdgpu driver version you want to install that coincides with a ROCm version number
    version: "6.3"
    # Specify your repository URL to host driver image for out-of-tree amdgpu kernel module
    # DO NOT include the image tag as AMD GPU Operator will automatically manage the image tag for you
    image: docker.io/username/repo
    # (Optional) Specify the credential for your private registry if it requires credential to get pull/push access
    # you can create the docker-registry type secret by running command like:
    # kubectl create secret docker-registry mysecret -n kmm-namespace --docker-username=xxx --docker-password=xxx
    # Make sure you created the secret within the namespace that KMM operator is running
    imageRegistrySecret:
      name: mysecret
    # (Optional) Specify your image registry's TLS config
    imageRegistryTLS:
      insecure: False # If True, check for the container image using plain HTTP
      insecureSkipTLSVerify: False # If True, skip any TLS server certificate validation (useful for self-signed certificates)
    # (Optional) Specify configuration to sign the driver image
    # Will be used when there is no pre-compiled driver image
    # and operator is building + signing driver image in one shot within cluster
    # necessary for secure boot enabled system
    imageSign:
      # the private key used to sign kernel modules within image
      keySecret:
        name: my-key-secret
      # the public key used to sign kernel modules within image
      certSecret:
        name: my-cert-secret
  ## AMD K8s Device Plugin Configuration ##
  devicePlugin:
    # (Optional) Specifying image names are optional. Default image names for shown here if not specified.
    devicePluginImage: rocm/k8s-device-plugin:latest # Change this to trigger metrics exporter upgrade on CR update
    nodeLabellerImage: rocm/k8s-device-plugin:labeller-latest # Change this to trigger metrics exporter upgrade on CR update
    # (Optional) Specify image registry secret to pull device plugin and node labeller images if needed.
    imageRegistrySecret:
      name: my-deviceplugin-image-secret
    # (Optional) Enable or disable node labeller, default value is true
    enableNodeLabeller: true
  ## AMD GPU Metrics Exporter Configuration ##
  metricsExporter:
    # Enable metrics collection and exposure (Default: false)
    enable: False
    # Service type for metrics endpoint exposure
    # Values: ClusterIP, NodePort
    # Default: ClusterIP
    serviceType: ClusterIP
    # Port for metrics endpoint when using ClusterIP
    # Default: 5000
    port: 5000
    # Port for metrics endpoint when using NodePort
    # Valid range: 30000-32767
    # Default: 32500
    nodePort: 32500
    # Container image for metrics exporter
    # Default: rocm/device-metrics-exporter:latest
    image: rocm/device-metrics-exporter:latest
    # Private registry credentials (optional)
    imageRegistrySecret:
      name: exporter-image-pull-secret
    # Custom metrics exporter configuration (optional)
    config:
      name: exporter-configmap
    # RBAC Proxy Configuration for secure metrics endpoint access (optional)
    rbacConfig:
      # Enable RBAC authentication proxy (Default: false)
      # When enabled, provides authentication and authorization for metrics endpoint
      enable: false
      # RBAC proxy container image
      # Default: quay.io/brancz/kube-rbac-proxy:v0.18.1
      image: "quay.io/brancz/kube-rbac-proxy:v0.18.1"
      # TLS configuration for metrics endpoint
      # Set true to disable HTTPS
      disableHttps: false
      # TLS certificate configuration
      # Default: Auto-generated self-signed certificates
      secret:
        name: my-kube-rbac-proxy-cert
    # If specifying a node selector here, the metrics exporter will only be deployed on nodes that match the selector
    # See Item #6 on https://dcgpu.docs.amd.com/projects/gpu-operator/en/latest/knownlimitations.html for example usage
    selector:
      feature.node.kubernetes.io/amd-gpu: "true" # You must include this again as this selector will overwrite the global selector
      amd.com/device-metrics-exporter: "true" # Helpful for when you want to disable the metrics exporter on specific nodes
  selector:
  # Specify the nodes to be managed by this DeviceConfig Custom Resource.  This will be applied to all components unless a selector
  # is specified in the component configuration. The node labeller will automatically find nodes with AMD GPUs and apply the label
  # `feature.node.kubernetes.io/amd-gpu: "true"` to them for you
    feature.node.kubernetes.io/amd-gpu: "true"

Minimal DeviceConfig#

The below is an example of the minimal DeviceConfig CR that can be used to install the AMD GPU Operator and its components. All fields not listed below will revert to their default values. See the above Full DeviceConfig for all available fields and their default values.

apiVersion: amd.com/v1alpha1
kind: DeviceConfig
metadata:
  name: gpu-operator
  namespace: kube-amd-gpu
spec:
  driver:
    enable: False # Set to False to skip driver installation to use inbox or pre-installed driver on worker nodes
  devicePlugin:
    enableNodeLabeller: True
  metricsExporter:
    enable: True # To enable/disable the metrics exporter, disabled by default
    serviceType: "NodePort" # Node port for metrics exporter service
    nodePort: 32500
  selector:
    feature.node.kubernetes.io/amd-gpu: "true"