Troubleshooting#
This guide provides steps to diagnose and resolve common issues with the AMD GPU Operator.
Checking Operator Status#
To check the status of the AMD GPU Operator:
kubectl get pods -n kube-amd-gpu
Collecting Logs#
To collect logs from the AMD GPU Operator:
kubectl logs -n kube-amd-gpu <pod-name>
Debugging Driver Installation#
If the AMD GPU driver build fails:
- Check the status of the build pod: 
kubectl get pods -n kube-amd-gpu
- View the build pod logs: 
kubectl logs -n kube-amd-gpu <build-pod-name>
- Check events for more information: 
kubectl get events -n kube-amd-gpu
Using Techsupport-dump Tool#
The techsupport-dump tool can be used to collect system state and logs for debugging:
./tools/techsupport_dump.sh [-w] [-o yaml/json] [-k kubeconfig] <node-name/all>
Options:
- -w: wide option
- -o yaml/json: output format (default: json)
- -k kubeconfig: path to kubeconfig (default: ~/.kube/config)