Troubleshooting

Troubleshooting#

This guide provides steps to diagnose and resolve common issues with the AMD GPU Operator.

To check the status of the AMD GPU Operator:

kubectl get pods -n kube-amd-gpu

To collect logs from the AMD GPU Operator:

kubectl logs -n kube-amd-gpu <pod-name>

Please refer to Typical Deployment Scenarios for more information and get corresponding helm install commands and configs that fits your specific use case.
If operand pods (e.g. device plugin, metrics exporter) are stuck in Init:0/1 state, it means your GPU worker doesn’t have GPU driver loaded or driver was not loaded properly.
- If you try to use inbox or pre-installed driver please check the node dmesg to see why the driver was not loaded properly.
- If you want to deploy out-of-tree driver, we suggest check the Driver Installation Guide then modify the default DeviceConfig to ask Operator to install the out-of-tree GPU driver for your worker nodes.

kubectl edit deviceconfigs -n kube-amd-gpu default

If the AMD GPU driver build fails:

kubectl get pods -n kube-amd-gpu

kubectl logs -n kube-amd-gpu <build-pod-name>

kubectl get events -n kube-amd-gpu

The techsupport-dump script can be used to collect system state and logs for debugging:

./tools/techsupport_dump.sh [-w] [-o yaml/json] [-k kubeconfig] <node-name/all>

Options:

Please file an issue with collected techsupport bundle on our GitHub Issues page