Troubleshooting#

This guide provides steps to diagnose and resolve common issues with the AMD GPU Operator.

Checking Operator Status#

To check the status of the AMD GPU Operator:

kubectl get pods -n kube-amd-gpu

Collecting Logs#

To collect logs from the AMD GPU Operator:

kubectl logs -n kube-amd-gpu <pod-name>

Potential Issues with DeviceConfig#

  • Please refer to Typical Deployment Scenarios for more information and get corresponding helm install commands and configs that fits your specific use case.

  • If operand pods (e.g. device plugin, metrics exporter) are stuck in Init:0/1 state, it means your GPU worker doesn’t have GPU driver loaded or driver was not loaded properly.

    • If you try to use inbox or pre-installed driver please check the node dmesg to see why the driver was not loaded properly.

    • If you want to deploy out-of-tree driver, we suggest check the Driver Installation Guide then modify the default DeviceConfig to ask Operator to install the out-of-tree GPU driver for your worker nodes.

kubectl edit deviceconfigs -n kube-amd-gpu default
  • Verify that the DeviceConfig has been applied successfully across all nodes by checking its status. Any configuration issues (such as field validation errors) will be reported in the status section with the OperatorReady condition set to False. Use the following command to view the status:

kubectl get deviceconfigs -n kube-amd-gpu default -o yaml
status:
  conditions:
  - lastTransitionTime: "2026-03-10T09:56:53Z"
    message: ""
    reason: OperatorReady
    status: "True"
    type: Ready
  devicePlugin:
    availableNumber: 1
    desiredNumber: 1
    nodesMatchingSelectorNumber: 1
  metricsExporter:
    availableNumber: 1
    desiredNumber: 1
    nodesMatchingSelectorNumber: 1
  observedGeneration: 1

Debugging Driver Installation#

If the AMD GPU driver build fails:

  • Check the status of the build pod:

kubectl get pods -n kube-amd-gpu
  • View the build pod logs:

kubectl logs -n kube-amd-gpu <build-pod-name>
  • Check events for more information:

kubectl get events -n kube-amd-gpu

Using Techsupport-dump Tool#

The techsupport-dump script can be used to collect system state and logs for debugging:

./tools/techsupport_dump.sh [-w] [-o yaml/json] [-k kubeconfig] <node-name/all>

Options:

  • -w: wide option

  • -o yaml/json: output format (default: json)

  • -k kubeconfig: path to kubeconfig (default: ~/.kube/config)

Please file an issue with collected techsupport bundle on our GitHub Issues page