Troubleshooting#

This guide provides steps to diagnose and resolve common issues with the AMD GPU Operator.

Checking Operator Status#

To check the status of the AMD GPU Operator:

kubectl get pods -n kube-amd-gpu

Collecting Logs#

To collect logs from the AMD GPU Operator:

kubectl logs -n kube-amd-gpu <pod-name>

Potential Issues with default DeviceConfig#

  • Please refer to Typical Deployment Scenarios for more information and get corresponding helm install commands and configs that fits your specific use case.

  • If operand pods (e.g. device plugin, metrics exporter) are stuck in Init:0/1 state, it means your GPU worker doesn’t have GPU driver loaded or driver was not loaded properly.

    • If you try to use inbox or pre-installed driver please check the node dmesg to see why the driver was not loaded properly.

    • If you want to deploy out-of-tree driver, we suggest check the Driver Installation Guide <./drivers/installation.html>_ then modify the default DeviceConfig to ask Operator to install the out-of-tree GPU driver for your worker nodes.

kubectl edit deviceconfigs -n kube-amd-gpu default

Debugging Driver Installation#

If the AMD GPU driver build fails:

  • Check the status of the build pod:

kubectl get pods -n kube-amd-gpu
  • View the build pod logs:

kubectl logs -n kube-amd-gpu <build-pod-name>
  • Check events for more information:

kubectl get events -n kube-amd-gpu

Using Techsupport-dump Tool#

The techsupport-dump script can be used to collect system state and logs for debugging:

./tools/techsupport_dump.sh [-w] [-o yaml/json] [-k kubeconfig] <node-name/all>

Options:

  • -w: wide option

  • -o yaml/json: output format (default: json)

  • -k kubeconfig: path to kubeconfig (default: ~/.kube/config)

Please file an issue with collected techsupport bundle on our GitHub Issues page