AGFHC (AMD GPU Field Health Check) Support#
In addition to RVS (ROCm Validation Suite), the test runner supports AGFHC (AMD GPU Field Health Check) to ensure the health of AMD GPUs in production environments.
Triggering AGFHC Tests#
To support more than one test framework, the test runner allows you to specify the test framework in the config.json
file.
Example Config Map to use AGFHC test framework:
apiVersion: v1
kind: ConfigMap
metadata:
name: agfhc-config-map
namespace: default
data:
config.json: |
{
"TestConfig": {
"GPU_HEALTH_CHECK": {
"TestLocationTrigger": {
"global": {
"TestParameters": {
"MANUAL": {
"TestCases": [
{
"Framework": "AGFHC",
"Recipe": "all_lvl1",
"Iterations": 1,
"StopOnFailure": true,
"TimeoutSeconds": 2400,
"DeviceIDs": ["0", "1"],
"Arguments": "--ignore-dmesg,--disable-sysmon"
}
]
}
}
}
}
}
}
}
The default framework is RVS if not specified, but you can switch to AGFHC by setting the Framework
field to AGFHC
in the TestCases
section of the config.json
. The Recipe
field specifies the test suite to run from the specified framework. You can supply additional optional arguments to the test cases using the Arguments
field. At present, only 1 testcase
can be run at a time.
Please refer to the AGFHC documentation for available test recipes and additional configuration options.
Recipes#
For example, for MI300X GPUs, the following test recipes are currently available:
Name |
Title |
Path |
---|---|---|
all_burnin_12h |
A ~12h check across system |
/opt/amd/agfhc/recipes/mi300x/all_burnin_12h.yml |
all_burnin_24h |
A ~24h check across system |
/opt/amd/agfhc/recipes/mi300x/all_burnin_24h.yml |
all_burnin_4h |
A ~4h check across system |
/opt/amd/agfhc/recipes/mi300x/all_burnin_4h.yml |
all_lvl1 |
A ~5m check across system |
/opt/amd/agfhc/recipes/mi300x/all_lvl1.yml |
all_lvl2 |
A ~10m check across system |
/opt/amd/agfhc/recipes/mi300x/all_lvl2.yml |
all_lvl3 |
A ~30m check across system |
/opt/amd/agfhc/recipes/mi300x/all_lvl3.yml |
all_lvl4 |
A ~1h check across system |
/opt/amd/agfhc/recipes/mi300x/all_lvl4.yml |
all_lvl5 |
A ~2h check across system |
/opt/amd/agfhc/recipes/mi300x/all_lvl5.yml |
all_perf |
Run all performance based tests |
/opt/amd/agfhc/recipes/mi300x/all_perf.yml |
dma_lvl1 |
A ~5m DMA workload |
/opt/amd/agfhc/recipes/mi300x/dma_lvl1.yml |
dma_lvl2 |
A ~10m DMA workload |
/opt/amd/agfhc/recipes/mi300x/dma_lvl2.yml |
dma_lvl3 |
A ~30m DMA workload |
/opt/amd/agfhc/recipes/mi300x/dma_lvl3.yml |
dma_lvl4 |
A ~1h DMA workload |
/opt/amd/agfhc/recipes/mi300x/dma_lvl4.yml |
gfx_lvl1 |
A ~5m GFX workload |
/opt/amd/agfhc/recipes/mi300x/gfx_lvl1.yml |
gfx_lvl2 |
A ~10m GFX workload |
/opt/amd/agfhc/recipes/mi300x/gfx_lvl2.yml |
gfx_lvl3 |
A ~30m GFX workload |
/opt/amd/agfhc/recipes/mi300x/gfx_lvl3.yml |
gfx_lvl4 |
A ~1h GFX workload |
/opt/amd/agfhc/recipes/mi300x/gfx_lvl4.yml |
hbm_burnin_24h |
A ~24h extended hbm test |
/opt/amd/agfhc/recipes/mi300x/hbm_burnin_24h.yml |
hbm_burnin_8h |
A ~8h extended hbm test |
/opt/amd/agfhc/recipes/mi300x/hbm_burnin_8h.yml |
hbm_lvl1 |
A ~5m HBM workload |
/opt/amd/agfhc/recipes/mi300x/hbm_lvl1.yml |
hbm_lvl2 |
A ~10m HBM workload |
/opt/amd/agfhc/recipes/mi300x/hbm_lvl2.yml |
hbm_lvl3 |
A ~30m HBM workload |
/opt/amd/agfhc/recipes/mi300x/hbm_lvl3.yml |
hbm_lvl4 |
A ~1h HBM workload |
/opt/amd/agfhc/recipes/mi300x/hbm_lvl4.yml |
hbm_lvl5 |
A ~2h HBM workload |
/opt/amd/agfhc/recipes/mi300x/hbm_lvl5.yml |
hsio |
Run all HSIO tests once |
/opt/amd/agfhc/recipes/mi300x/hsio.yml |
pcie_lvl1 |
A ~5m PCIe workload |
/opt/amd/agfhc/recipes/mi300x/pcie_lvl1.yml |
pcie_lvl2 |
A ~10m PCIe workload |
/opt/amd/agfhc/recipes/mi300x/pcie_lvl2.yml |
pcie_lvl3 |
A ~30m PCIe workload |
/opt/amd/agfhc/recipes/mi300x/pcie_lvl3.yml |
pcie_lvl4 |
A ~1h PCIe workload |
/opt/amd/agfhc/recipes/mi300x/pcie_lvl4.yml |
rochpl_isolation |
Run rocHPL on each GPU |
/opt/amd/agfhc/recipes/mi300x/rochpl_isolation.yml |
single_pass |
Run all tests once |
/opt/amd/agfhc/recipes/mi300x/single_pass.yml |
thermal |
Verify thermal solution |
/opt/amd/agfhc/recipes/mi300x/thermal.yml |
xgmi_lvl1 |
A ~5m xGMI workload |
/opt/amd/agfhc/recipes/mi300x/xgmi_lvl1.yml |
xgmi_lvl2 |
A ~10m xGMI workload |
/opt/amd/agfhc/recipes/mi300x/xgmi_lvl2.yml |
xgmi_lvl3 |
A ~30m xGMI workload |
/opt/amd/agfhc/recipes/mi300x/xgmi_lvl3.yml |
xgmi_lvl4 |
A ~1h xGMI workload |
/opt/amd/agfhc/recipes/mi300x/xgmi_lvl4.yml |
NOTE: Each one of the aforementioned recipes could consist of multiple test cases. Execution of individual AGFHC test case is currently not supported.
Supported Partition Profile#
The Instinct GPU models could be configured with certain GPU partition profiles to execute AGFHC tests, the supported partition profiles are:
GPU Model |
Compute Partition |
Memory Partition |
Number of GPUs for testing |
---|---|---|---|
mi300a |
SPX |
NPS1 |
1 |
mi300a |
SPX |
NPS1 |
2 |
mi300a |
SPX |
NPS1 |
4 |
mi300x |
SPX |
NPS1 |
1 |
mi300x |
SPX |
NPS1 |
8 |
mi300x |
DPX |
NPS2 |
2 |
mi300x |
DPX |
NPS2 |
16 |
mi308x |
SPX |
NPS1 |
1 |
mi308x |
SPX |
NPS1 |
8 |
mi325x |
SPX |
NPS1 |
1 |
mi325x |
SPX |
NPS1 |
8 |
mi308x-hf |
SPX |
NPS1 |
1 |
mi308x-hf |
SPX |
NPS1 |
8 |
mi300x-hf |
SPX |
NPS1 |
1 |
mi300x-hf |
SPX |
NPS1 |
8 |
mi350x |
SPX |
NPS1 |
1 |
mi350x |
SPX |
NPS1 |
8 |
mi355x |
SPX |
NPS1 |
1 |
mi355x |
SPX |
NPS1 |
8 |
AGFHC arguments#
As for the AGFHC arguments, please refer to AGFHC official documents for the full list of available arguments. Here is a list of frequently used arguments:
Argument |
Description |
Default/Example |
---|---|---|
–update-interval UPDATE_INTERVAL |
Set the interval to print elapsed timing updates on the console. |
|
–sysmon-interval SYSMON_INTERVAL |
Set to update the default sysmon interval |
|
–tar-logs |
Generate a tar file of all logs |
|
–disable-sysmon |
Set to disable system monitoring data collection. |
Default: enabled |
–disable-numa-control |
Set to disable control of numa balancing. |
Default: enabled |
–disable-ras-checks |
Set to disable ras checks. |
Default: enabled |
–disable-bad-pages-checks |
Set to disable bad pages checks. |
Default: enabled |
–disable-dmesg-checks |
Set to disable dmesg checks. |
Default: enabled |
–ignore-dmesg |
Set to ignore dmesg fails, logs will still be created. |
Default: dmesg fails enabled |
–ignore-ras |
Set to ignore ras fails, logs will still be created. |
Default: ras fails enabled |
–ignore-performance |
Set to ignore performance to skip the performance analysis and perform only RAS/dmesg checks. |
Default: performance analysis enabled |
–known-dmesg-only |
Do not fail on any unknown dmesg, but mark them as expected. |
Default: any unknown dmesg fails |
–disable-hsio-gather |
Set to disable hsio gather. |
Default: enabled |
Kubernetes events#
Upon successful execution of the AGFHC test recipe, the results are output as Kubernetes events. You can view these events using the following command:
$ kubectl get events
LAST SEEN TYPE REASON OBJECT MESSAGE
113s Normal TestPassed pod/test-runner-manual-trigger-hqb7b [{"number":1,"suitesResult":{"0":{"gfx_dgemm":"success","hbm_bw":"success","pcie_bidi_peak":"success","pcie_link_status":"success","xgmi_a2a":"success"},"1":{"gfx_dgemm":"success","hbm_bw":"success","pcie_bidi_peak":"success","pcie_link_status":"success","xgmi_a2a":"success"}},"status":"completed"}]
If the test fails, the event will indicate a failure status.
$ kubectl get events
LAST SEEN TYPE REASON OBJECT MESSAGE
63s Warning TestFailed pod/test-runner-manual-trigger-fs64h [{"number":1,"suitesResult":{"0":{"gfx_dgemm":"success","hbm_bw":"success","pcie_bidi_peak":"success","pcie_link_status":"success","xgmi_a2a":"success"},"2":{"gfx_dgemm":"success","hbm_bw":"failure","pcie_bidi_peak":"success","pcie_link_status":"success","xgmi_a2a":"success"}},"status":"completed"}]
Log export#
By default, test execution logs are saved to /var/log/amd-test-runner/
on the host. Log export functionality is also supported, similar to RVS. AGFHC provides more detailed logs than RVS and all the logs provided by the framework are included in the tarball.