Test Runner Overview#
The test runner component offers hardware validation, diagnostics and benchmarking capabilities for your GPU Worker nodes. The new capabilities include:
Automatically triggering of configurable tests on unhealthy GPUs
Scheduling or Manually triggering tests within the Kubernetes cluster
Running pre-start job tests as init containers within your GPU workload pods to ensure GPU health and stability before execution of long running jobs
Reporting test results as Kubernetes events
Under the hood the Device Test runner leverages the ROCm Validation Suite (RVS) and AMD GPU Field Health Check (AGFHC) toolkit to run any number of tests including GPU stress tests, PCIE bandwidth benchmarks, memory tests, and longer burn-in tests if so desired.
Note
The public test runner image only supports executing RVS test.
The AGFHC toolkit is NOT publicly accessible and requires special authorization. It can be used not only with the test runner but also in various other workflows. For more details, see the Instinct documentation website.
To access the full test runner image, which includes both RVS and the AGFHC toolkit, please contact your AMD representative to complete the authorization process.