Troubleshooting Device Metrics Exporter#
This topic provides an overview of troubleshooting options for Device Metrics Exporter.
Techsupport Collection#
K8s Techsupport Collection#
The techsupport-dump script can be used to collect system state and logs for debugging:
# ./techsupport_dump.sh [-w] [-o yaml/json] [-k kubeconfig] [-r helm-release-name] <node-name/all>
Options:
-w
: wide option-o yaml/json
: output format (default: json)-k kubeconfig
: path to kubeconfig (default: ~/.kube/config)-r helm-release-name
: helm release name
Docker Techsupport Collection#
docker exec -it device-metrics-exporter metrics-exporter-ts.sh
docker cp device-metrics-exporter:/var/log/amd-metrics-exporter-techsupport-<timestamp>.tar.gz .
Debian Techsupport Collection#
sudo metrics-exporter-ts.sh
Please file an issue with collected techsupport bundle on our GitHub Issues page
Logs#
You can view the container logs by executing the following command:
K8s deployment#
kubectl logs -n <namespace> <exporter-container-on-node>
Docker deployment#
docker logs device-metrics-exporter
Debian deployment#
sudo journalctl -xu amd-metrics-exporter
Common Issues#
This section describes common issues with AMD Device Metrics Exporter
Port conflicts:
Verify port 5000 is available
Configure an alternate port through the configuration file
Device access:
Ensure proper permissions on
/dev/dri
and/dev/kfd
Verify ROCm is properly installed
Metric collection issues:
Check GPU driver status
Verify ROCm version compatibility