Troubleshooting#
This guide provides steps to diagnose and resolve common issues with the AMD GPU Operator.
Checking Operator Status#
To check the status of the AMD GPU Operator:
kubectl get pods -n kube-amd-gpu
Collecting Logs#
To collect logs from the AMD GPU Operator:
kubectl logs -n kube-amd-gpu <pod-name>
Debugging Driver Installation#
If the AMD GPU driver build fails:
Check the status of the build pod:
kubectl get pods -n kube-amd-gpu
View the build pod logs:
kubectl logs -n kube-amd-gpu <build-pod-name>
Check events for more information:
kubectl get events -n kube-amd-gpu
Using Techsupport-dump Tool#
The techsupport-dump tool can be used to collect system state and logs for debugging:
./tools/techsupport_dump.sh [-w] [-o yaml/json] [-k kubeconfig] <node-name/all>
Options:
-w
: wide option-o yaml/json
: output format (default: json)-k kubeconfig
: path to kubeconfig (default: ~/.kube/config)