Automatic Unhealthy Device Test#
Automatic Unhealthy Device Test trigger#
Test runner is periodically watching for the device health status from device metrics exporter per 30 seconds. Once exporter reported GPU status is unhealthy, test runner will start to run one-time test on the unhealthy GPU. The test result will be exported as Kubernetes event.
Configure test runner#
To start the Test Runner along with the GPU Operator, Device Metrics Exporter must be enabled since Test Runner is depending on the exported health status. Configure the spec/metricsExporter/enable
field in deviceconfig Custom Resource(CR) to enable/disable metrics exporter and configure the spec/testRunner/enable
field in deviceconfig Custom Resource(CR) to enable/disable test runner.
# Specify the metrics exporter config
metricsExporter:
# To enable/disable the metrics exporter, disabled by default
enable: True
# kubernetes service type for metrics exporter, clusterIP(default) or NodePort
serviceType: "NodePort"
# Node port for metrics exporter service, metrics endpoint $node-ip:$nodePort
nodePort: 32500
# image for the metrics-exporter container
image: "rocm/device-metrics-exporter:v1.2.0"
# Specify the test runner config
testRunner:
# To enable/disable the test runner, disabled by default
enable: true
# image for the test runner container
image: docker.io/rocm/test-runner:v1.2.0-beta.0
# specify the mount for test logs
logsLocation:
# mount path inside test runner container
mountPath: "/var/log/amd-test-runner"
# host path to be mounted into test runner container
hostPath: "/var/log/amd-test-runner"
Check test runner pod logs#
kube-amd-gpu test-deviceconfig-metrics-exporter-6g286 1/1 Running 0 18m
kube-amd-gpu test-deviceconfig-test-runner-r9gjr 1/1 Running 0 18m
Once device metrics exporter and test runner were brought up by applying the corresponding deviceconfig Custom Resource(CR), the test runner pod logs can be viewed by running kubectl logs
command, for the above example it is:
kubectl logs -n kube-amd-gpu test-deviceconfig-test-runner-r9gjr
Check test running node labels#
When the test is ongoing the corresponding label will be added to the node resource: "amd.testrunner.gpu_health_check.gst_single": "running"
, the test running label will be removed once the test completed.
Check test result event#
The test runner generated event can be found from the operator’s namespace:
$ kubectl get events -n kube-amd-gpu
LAST SEEN TYPE REASON OBJECT MESSAGE
8m8s Normal TestFailed pod/test-runner-manual-trigger-c4hpw [{"number":1,"suitesResult":{"42924":{"gpustress-3000-dgemm-false":"success","gpustress-41000-fp32-false":"failure","gst-1215Tflops-4K4K8K-rand-fp8":"failure","gst-8096-150000-fp16":"success"}}}]
Test runner generated event can be retrieved by filtering the source component: kubectl get events -n kube-amd-gpu -o=jsonpath='{.items[?(@.source.component=="amd-test-runner")]}'
Here is an example event resource:
{
"apiVersion": "v1",
"count": 1,
"eventTime": null,
"firstTimestamp": "2025-02-07T17:39:50Z",
"involvedObject": {
"kind": "Pod",
"name": "gpu-operator-test-runner-58jrf",
"namespace": "kube-amd-gpu"
},
"kind": "Event",
"lastTimestamp": "2025-02-07T17:39:50Z",
"message": "[{\"number\":1,\"suitesResult\":{\"35824\":{\"gpustress-3000-dgemm-false\":\"success\",\"gpustress-41000-fp32-false\":\"failure\",\"gst-1215Tflops-4K4K8K-rand-fp8\":\"failure\",\"gst-8096-150000-fp16\":\"success\"}},\"status\":\"completed\"}]",
"metadata": {
"creationTimestamp": "2025-02-07T17:39:50Z",
"generateName": "amd-test-runner-gpu_health_check-",
"labels": {
"testrunner.amd.com/category": "gpu_health_check",
"testrunner.amd.com/gpu.id.0": "35824",
"testrunner.amd.com/gpu.kfd.35824": "0",
"testrunner.amd.com/hostname": "leto",
"testrunner.amd.com/recipe": "gst_single",
"testrunner.amd.com/trigger": "auto_unhealthy_gpu_watch"
},
"managedFields": [
{
"apiVersion": "v1",
"fieldsType": "FieldsV1",
"fieldsV1": {
"f:count": {},
"f:firstTimestamp": {},
"f:involvedObject": {},
"f:lastTimestamp": {},
"f:message": {},
"f:metadata": {
"f:generateName": {},
"f:labels": {
".": {},
"f:testrunner.amd.com/category": {},
"f:testrunner.amd.com/gpu.id.0": {},
"f:testrunner.amd.com/gpu.kfd.35824": {},
"f:testrunner.amd.com/hostname": {},
"f:testrunner.amd.com/recipe": {},
"f:testrunner.amd.com/trigger": {}
}
},
"f:reason": {},
"f:source": {
"f:component": {},
"f:host": {}
},
"f:type": {}
},
"manager": "amd-test-runner",
"operation": "Update",
"time": "2025-02-07T17:39:50Z"
}
],
"name": "amd-test-runner-gpu_health_check-2266p",
"namespace": "kube-amd-gpu",
"resourceVersion": "779260",
"uid": "1993b984-6806-403a-a25c-2c135828e6c2"
},
"reason": "TestFailed",
"reportingComponent": "",
"reportingInstance": "",
"source": {
"component": "amd-test-runner",
"host": "leto"
},
"type": "Warning"
}
involvedObject
shows which test runner pod executed the test and generated the event.lastTimestamp
shows the time when event was generated.message
contains detailed test result, which can be further passsed into json format:
$ kubectl get events -o=jsonpath='{.items[?(@.source.component=="amd-test-runner")]}' -n kube-amd-gpu | jq -r .message | jq .
[
{
"number": 1,
"suitesResult": {
"35824": {
"gpustress-3000-dgemm-false": "success",
"gpustress-41000-fp32-false": "failure",
"gst-1215Tflops-4K4K8K-rand-fp8": "failure",
"gst-8096-150000-fp16": "success"
}
},
"status": "completed"
}
]
in the above example 35824
is the GPU’s KFD ID reported by amd-smi (in rocm-smi it is named as GUID), gpustress-3000-dgemm-false
represents a specific test action and success
is the action’s result.
metadata
contains the basic information of event.generateName
is the common prefix of the generated event name, which is in the format ofamd-test-runner-<test category>-
.name
is the generated event name, which is in the format of<generatedName>-<generated unique suffix>
namespace
is the namespace where the test runner is running.labels
contains all the information regarding the event and is helpful for filtering the event."labels": { "testrunner.amd.com/category": "gpu_health_check", "testrunner.amd.com/gpu.id.0": "35824", "testrunner.amd.com/gpu.kfd.35824": "0", "testrunner.amd.com/hostname": "leto", "testrunner.amd.com/recipe": "gst_single", "testrunner.amd.com/trigger": "auto_unhealthy_gpu_watch" },
testrunner.amd.com/category
is which kind of test the event is related to.testrunner.amd.com/trigger
is how the test was triggered.testrunner.amd.com/recipe
is the test recipe name.testrunner.amd.com/hostname
is the name of the host where the test happened.testrunner.amd.com/gpu.id.X
shows which GPU was involved andX
is the GPU index number, the value is corresponding GPU KFD ID.testrunner.amd.com/gpu.kfd.Y
shows which GPU was involved andY
is the GPU KFD ID, the value is corresponding GPU index number.
reason
gives an overall result of the whole test run, it could beTestPassed
,TestFailed
orTestTimedOut
.source
shows where the event came from, including component nameamd-test-runner
and worker node’s host name.type
classifies the event into different level. For test runner generated event,TestPassed
events are assigned withNormal
event type whileTestFailed
andTestTimedOut
events are assigned withWarning
event type.
Advanced Configuration - ConfigMap#
You can provide a config map to specify test recipe details for the test runner. Create the config map then specify the config map name in the deviceconfig Custom Resource(CR) for test runner to pick up the config.
Note
If you want to update the config for test runner on the fly for the auto_unhealthy_gpu_watch
trigger, directly update the configmap then the test runner can pick up the new config. After reading the new config, test runner’s ongoing test won’t be interrupted and still going with old config. The new config will be applied to the next test run.
Here is an example config map:
apiVersion: v1
kind: ConfigMap
metadata:
name: test-runner-config-map
namespace: kube-amd-gpu
data:
config.json: |
{
"TestConfig": {
"GPU_HEALTH_CHECK": {
"TestLocationTrigger": {
"global": {
"TestParameters": {
"AUTO_UNHEALTHY_GPU_WATCH": {
"TestCases": [
{
"Recipe": "gst_single",
"Iterations": 1,
"StopOnFailure": true,
"TimeoutSeconds": 600
}
]
}
}
},
"node1": {
"TestParameters": {
"AUTO_UNHEALTHY_GPU_WATCH": {
"TestCases": [
{
"Recipe": "mem",
"Iterations": 1,
"StopOnFailure": true,
"TimeoutSeconds": 600
}
]
}
}
}
}
}
}
}
The name of the config map should be put under the deviceconfigs Custom Resource’s spec/testRunner/config/name
field:
spec:
testRunner:
enable: true
config:
name: test-runner-config-map
Config map explanation:
TestCategory:
GPU_HEALTH_CHECK
indicates all the config under this category is working for checking GPU device health status.Global and node specific config:
Under
TestLocationTrigger
there areglobal
configs and host specific configs. Worker node whose name isnode1
will pick up the configs undernode1
, other worker nodes will pick up the configs underglobal
.Test Triggers:
Under
TestParameters
there is a map from test trigger to specific test case configs, which means the configs are setup for the corresponding test triggers (AUTO_UNHEALTHY_GPU_WATCH
,MANUAL
andPRE_START_JOB_CHECK
)Test Cases:
Note
Test case is a list under each test trigger, in the current release only the first test case in the list will be executed.
Recipe:
String. Please check the Appendix for more details about all available test recipes. Default test recipe is
gst_single
.Iterations:
Positive integer. Number of iterations to run for each test run. Default value is
1
.StopOnFailure:
Boolean. If any iteration of test run failed, whether to stop the entire test run or not. Default is
true
.TimeoutSeconds:
Positive integer. Specifies the timeout for each test iteration. The default value is
600
seconds. If the running time exceeds this limit, the current test iteration will be terminated.DeviceIDs (Only works for
manual
andpre-start-job-check
test trigger):List of string for GPU 0-inedxed ID. A selector to filter which GPU would run the test. For example, if there are 2 GPUs the GPU ID would be 0 and 1. To select GPU0 to run the test only, please configure the DeviceIDs:
{ "Recipe": "gst_single", "Iterations": 1, "StopOnFailure": true, "TimeoutSeconds": 600, "DeviceIDs": ["0"] }
If the DeviceIDs list is empty or not specified, all GPUs will be selected.
If the DeviceIDs list is specified and all the IDs in the list are invalid, test runner process would exit with error status.