System validation tests#
The validation tests in this section are intended to ensure that a system is
operating correctly. In this section, ROCm Validation Suite (rvs
) is used,
which is a collection of tests, benchmarks, and qualification tools, each
targeting a specific subsystem of the system under test (SUT).
If not already installed on the SUT, run the following install command (Ubuntu):
sudo apt install rocm-validation-suite
Then ensure that the path to the rvs
executable, located at
/opt/rocm/bin
, is added to the path. Use the following command:
export PATH=$PATH:/opt/rocm/bin
The rvs
tool consists of modules that implement a particular test
functionality. The collection of the modules can be broadly categorized as
targeting the following aspects of the hardware platform:
Compute / GPU
Memory
IO / PCIe
Each of these categories runs a subset of rvs
modules to validate that the
category is working as expected.
The standard way to run an rvs
module is by providing a configuration file
prefixed with the -c
option. When rvs
is installed properly on the SUT,
the conf
files are found in the
/opt/rocm/share/rocm-validation-suite/conf/
folder. Since this path is a
part of every rvs
command in this document, an environment variable is
defined which will be used in place of the long path for commands and their
output. To set this variable in the environment, run the following command:
export RVS_CONF=/opt/rocm/share/rocm-validation-suite/conf
The configuration files section of the ROCm Validation Suite User Guide
provides detailed description about the conf
file, its formation, and keys.
It’s recommended to become familiar with the conf
file before running the
rvs
tests described in the following sections. Be aware that some conf
files are included in product-specific subfolders (for instance,
=/opt/rocm/share/rocm-validation-suite/conf/MI300X
). If present, always use
GPU-specific configurations instead of the default test configurations.
In the following subsections, under each of the categories, the relevant
rvs
test modules are listed along with descriptions how the category is
validated. Example rvs
commands with the expected output are also provided.
Most of the rvs
tests do not have strict PASS / FAIL conditions reported,
rather it is expected that when they are run on the SUT, the output observed
are within a reasonable range provided.
Compute / GPU#
The rvs
has three different types of modules to validate the Compute
subsystem. These are:
Properties
Benchmark / Stress / Qualification
Monitor
MI300X GPU accelerators have many architectural features. Similar to
Check GPU presence (lspci) section, rvs
has an
option to display all MI300X GPU accelerators present in the SUT. Before
proceeding with the modules below, run the following command to make sure all
the GPUs are seen with their correct PCIe properties.
Command:
rvs -g
Expected output:
ROCm Validation Suite (version 0.0.60202)
Supported GPUs available:
0000:05:00.0 - GPU[ 2 - 28851] AMD Instinct MI300X (Device 29857)
0000:26:00.0 - GPU[ 3 - 23018] AMD Instinct MI300X (Device 29857)
0000:46:00.0 - GPU[ 4 - 29122] AMD Instinct MI300X (Device 29857)
0000:65:00.0 - GPU[ 5 - 22683] AMD Instinct MI300X (Device 29857)
0000:85:00.0 - GPU[ 6 - 53458] AMD Instinct MI300X (Device 29857)
0000:a6:00.0 - GPU[ 7 - 63883] AMD Instinct MI300X (Device 29857)
0000:c6:00.0 - GPU[ 8 - 53667] AMD Instinct MI300X (Device 29857)
0000:e5:00.0 - GPU[ 9 - 63738] AMD Instinct MI300X (Device 29857)
Result:
PASSED: All 8 GPUs are seen in the output
FAILED: Otherwise
Action: Don’t proceed further. Debug the issue of not being able to see all GPUs.
Properties#
The GPU Properties module queries the configuration of a targeted GPU and returns the device’s static characteristics. These static values can be used to debug issues such as device support, performance and firmware problems.
To confirm the architectural properties of the GPU, use the GPUP module, which uses of the GPUP configuration file.
The configuration file for GPUP module is located at {RVS_CONF}/gpup_single.conf
.
The GPUP module section of the ROCm Validation Suite User
Guide
provides detailed description about the GPUP conf
file, its formation,
and keys.
Command:
rvs -c ${RVS_CONF}/gpup_single.conf
Expected output (truncated):
The conf
file has six test cases RVS-GPUP-TC1
, RVS-GPUP-TC2
, and so
on up to RV-GPUP-TC6
. Only a truncated version of the output of RVS-GPUP-TC1
is
shown here. The other tests are modified versions of RVS-GPUP-TC1
, which display
a subset of properties and/or a subset of io_links-properties
.
The first block of output displays the properties (all):
[RESULT] [ 54433.732433] Action name :RVS-GPUP-TC1
[RESULT] [ 54433.733858] Module name :gpup
[RESULT] [ 54433.733992] [RVS-GPUP-TC1] gpup 28851 cpu_cores_count 0
[RESULT] [ 54433.733994] [RVS-GPUP-TC1] gpup 28851 simd_count 1216
...
[RESULT] [ 54433.734018] [RVS-GPUP-TC1] gpup 28851 num_xcc 8
[RESULT] [ 54433.734018] [RVS-GPUP-TC1] gpup 28851 max_engine_clk_ccompute 3250
The block below shows only one of the io_link-properties
of the eight GPUs
(0
to 7
):
[RESULT] [ 96878.647964] [RVS-GPUP-TC1] gpup 28851 0 type 2
[RESULT] [ 96878.647973] [RVS-GPUP-TC1] gpup 28851 0 version_major 0
[RESULT] [ 96878.647982] [RVS-GPUP-TC1] gpup 28851 0 version_minor 0
[RESULT] [ 96878.647990] [RVS-GPUP-TC1] gpup 28851 0 node_from 2
[RESULT] [ 96878.647997] [RVS-GPUP-TC1] gpup 28851 0 node_to 0
[RESULT] [ 96878.648013] [RVS-GPUP-TC1] gpup 28851 0 weight 20
[RESULT] [ 96878.648020] [RVS-GPUP-TC1] gpup 28851 0 min_latency 0
[RESULT] [ 96878.648029] [RVS-GPUP-TC1] gpup 28851 0 max_latency 0
[RESULT] [ 96878.648037] [RVS-GPUP-TC1] gpup 28851 0 min_bandwidth 312
[RESULT] [ 96878.648045] [RVS-GPUP-TC1] gpup 28851 0 max_bandwidth 64000
[RESULT] [ 96878.648053] [RVS-GPUP-TC1] gpup 28851 0 recommended_transfer_size 0
[RESULT] [ 96878.648060] [RVS-GPUP-TC1] gpup 28851 0 flags 1
Result:
PASSED: If generated output looks similar
FAILED: If any GPU is not listed in output or ERROR tagged logs are seen
Typically, it is not expected that this module will fail
Benchmark, stress, qualification#
These categories of modules perform qualification of the GPU subsystem, execute stress test, and compute and display bandwidth. The modules do not produce a PASS / FAIL result. When bandwidth is measured, it only reports the bandwidth and doesn’t make any comparisons with the existing set of numbers. The only exceptions are GST and IET modules.
Benchmark#
The GPU Stress Test (GST) module stresses the GPU FLOPS performance for
SGEMM, DGEMM and HGEMM operations and computes and displays peak
GFLOPs/s. Two configuration files are used by the GST module – one is
general purpose (gst_single.conf
), and the other is MI300X specific
(gst_ext.conf
). Each is detailed below.
The MI300X specific gst_single.conf
configuration file for the GST module
is located at:
${RVS_CONF}/MI300X/gst_single.conf
Run the following command to perform the general GPU stress test using the
gst_single.conf
config file.
Command:
rvs -c ${RVS_CONF}/MI300X/gst_single.conf
Expected output (truncated):
[RESULT] [1101980.682169] Action name :gst-1215Tflops-4K4K8K-rand-fp8
[RESULT] [1101980.683973] Module name :gst
[RESULT] [1101980.836841] [gst-1215Tflops-4K4K8K-rand-fp8] [GPU:: 28851] Start of GPU ramp up
[RESULT] [1101987.830800] [gst-1215Tflops-4K4K8K-rand-fp8] [GPU:: 28851] GFLOPS 1539705
[RESULT] [1101988.831928] [gst-1215Tflops-4K4K8K-rand-fp8] [GPU:: 28851] End of GPU ramp up
[RESULT] [1101992.16545 ] [gst-1215Tflops-4K4K8K-rand-fp8] [GPU:: 28851] GFLOPS 1640057
[RESULT] [1101995.85574 ] [gst-1215Tflops-4K4K8K-rand-fp8] [GPU:: 28851] GFLOPS 1595462
[RESULT] [1101998.181333] [gst-1215Tflops-4K4K8K-rand-fp8] [GPU:: 28851] GFLOPS 1687129
[RESULT] [1102001.278962] [gst-1215Tflops-4K4K8K-rand-fp8] [GPU:: 28851] GFLOPS 1686102
[RESULT] [1102003.864611] [gst-1215Tflops-4K4K8K-rand-fp8] [GPU:: 28851] GFLOPS 1687129
[RESULT] [1102003.864648] [gst-1215Tflops-4K4K8K-rand-fp8] [GPU:: 28851] GFLOPS 1687129 Target GFLOPS: 1215000 met: TRUE
...
Result:
PASSED: If
met: TRUE
is displayed in test log for all eight GPUs and actions, it indicates the test was able to hit peak GFLOP/s which matches or exceeds the target values listed in the config file.FAILED: Test results fail to meet the target GFLOP/s
Action: Do not proceed further. Report this issue to your system manufacturer immediately.
The MI300X specific gst_ext.conf
configuration file for the GST module is
located at:
${RVS_CONF}/MI300X/gst_ext.conf
Run the following command to perform the MI300X GPU specific stress test
using the gst_ext.conf
config file.
Command:
ROCBLAS_TENSILE_LIBPATH=/opt/rocm/lib/rocblas/library/ rvs -c ${RVS_CONF}/MI300X/gst_ext.conf
Expected output (truncated):
[RESULT] [603545.521766] Action name :gst-1000Tflops-8KB-fp8_r-false
[RESULT] [603545.523245] Module name :gst
[RESULT] [603545.685745] [gst-1000Tflops-8KB-fp8_r-false] [GPU:: 28851] Start of GPU ramp up
[RESULT] [603552.11787 ] [gst-1000Tflops-8KB-fp8_r-false] [GPU:: 28851] GFLOPS 1235406
[RESULT] [603553.12495 ] [gst-1000Tflops-8KB-fp8_r-false] [GPU:: 28851] GFLOPS 1250866
[RESULT] [603554.12557 ] [gst-1000Tflops-8KB-fp8_r-false] [GPU:: 28851] GFLOPS 1235406
[RESULT] [603555.12386 ] [gst-1000Tflops-8KB-fp8_r-false] [GPU:: 28851] End of GPU ramp up
[RESULT] [603556.12907 ] [gst-1000Tflops-8KB-fp8_r-false] [GPU:: 28851] GFLOPS 1220772
[RESULT] [603557.13180 ] [gst-1000Tflops-8KB-fp8_r-false] [GPU:: 28851] GFLOPS 1221056
[RESULT] [603558.13786 ] [gst-1000Tflops-8KB-fp8_r-false] [GPU:: 28851] GFLOPS 1238206
[RESULT] [603559.13885 ] [gst-1000Tflops-8KB-fp8_r-false] [GPU:: 28851] GFLOPS 1231140
[RESULT] [603560.14584 ] [gst-1000Tflops-8KB-fp8_r-false] [GPU:: 28851] GFLOPS 1232638
[RESULT] [603561.14988 ] [gst-1000Tflops-8KB-fp8_r-false] [GPU:: 28851] GFLOPS 1237375
[RESULT] [603562.15658 ] [gst-1000Tflops-8KB-fp8_r-false] [GPU:: 28851] GFLOPS 1237069
[RESULT] [603563.16277 ] [gst-1000Tflops-8KB-fp8_r-false] [GPU:: 28851] GFLOPS 1237102
[RESULT] [603564.16494 ] [gst-1000Tflops-8KB-fp8_r-false] [GPU:: 28851] GFLOPS 1236422
[RESULT] [603565.17256 ] [gst-1000Tflops-8KB-fp8_r-false] [GPU:: 28851] GFLOPS 1236946
[RESULT] [603566.17565 ] [gst-1000Tflops-8KB-fp8_r-false] [GPU:: 28851] GFLOPS 1236323
[RESULT] [603567.17654 ] [gst-1000Tflops-8KB-fp8_r-false] [GPU:: 28851] GFLOPS 1235515
[RESULT] [603568.17924 ] [gst-1000Tflops-8KB-fp8_r-false] [GPU:: 28851] GFLOPS 1235281
[RESULT] [603569.18070 ] [gst-1000Tflops-8KB-fp8_r-false] [GPU:: 28851] GFLOPS 1235452
[RESULT] [603570.18519 ] [gst-1000Tflops-8KB-fp8_r-false] [GPU:: 28851] GFLOPS 1235085
[RESULT] [603571.18960 ] [gst-1000Tflops-8KB-fp8_r-false] [GPU:: 28851] GFLOPS 1234038
[RESULT] [603572.19046 ] [gst-1000Tflops-8KB-fp8_r-false] [GPU:: 28851] GFLOPS 1234418
[RESULT] [603573.19153 ] [gst-1000Tflops-8KB-fp8_r-false] [GPU:: 28851] GFLOPS 1234417
[RESULT] [603574.19692 ] [gst-1000Tflops-8KB-fp8_r-false] [GPU:: 28851] GFLOPS 1233895
[RESULT] [603575.20205 ] [gst-1000Tflops-8KB-fp8_r-false] [GPU:: 28851] GFLOPS 1233942
[RESULT] [603576.20336 ] [gst-1000Tflops-8KB-fp8_r-false] [GPU:: 28851] GFLOPS 1233328
[RESULT] [603577.20441 ] [gst-1000Tflops-8KB-fp8_r-false] [GPU:: 28851] GFLOPS 1233327
[RESULT] [603578.21167 ] [gst-1000Tflops-8KB-fp8_r-false] [GPU:: 28851] GFLOPS 1233693
[RESULT] [603579.21800 ] [gst-1000Tflops-8KB-fp8_r-false] [GPU:: 28851] GFLOPS 1231561
[RESULT] [603580.22072 ] [gst-1000Tflops-8KB-fp8_r-false] [GPU:: 28851] GFLOPS 1232009
[RESULT] [603581.22249 ] [gst-1000Tflops-8KB-fp8_r-false] [GPU:: 28851] GFLOPS 1232113
[RESULT] [603582.22852 ] [gst-1000Tflops-8KB-fp8_r-false] [GPU:: 28851] GFLOPS 1232700
[RESULT] [603583.23573 ] [gst-1000Tflops-8KB-fp8_r-false] [GPU:: 28851] GFLOPS 1232620
[RESULT] [603584.23655 ] [gst-1000Tflops-8KB-fp8_r-false] [GPU:: 28851] GFLOPS 1231152
[RESULT] [603585.12439 ] [gst-1000Tflops-8KB-fp8_r-false] [GPU:: 28851] GFLOPS 1238206
[RESULT] [603585.12457 ] [gst-1000Tflops-8KB-fp8_r-false] [GPU:: 28851] GFLOPS 1238206 Target GFLOPS: 1000000 met: TRUE
...
Result:
PASSED: If “met: TRUE” is displayed in the test log for all eight GPUs, it indicates the test was able to hit peak GFLOP/s which matches or exceeds the target values listed in the config file.
FAILED: Test results fail to meet the target GFLOP/s
Action: Do not proceed further. Report this issue to your system manufacturer immediately.
Stress#
The Input Energy Delay Product (EDP) test (IET) module runs GEMM workloads to stress the GPU power, that is, Total Graphics Power (TGP).
This test is used to:
Verify the GPU can handle maximum power stress for a sustained period.
Check that the GPU power reaches a set target power.
The configuration file for IET module is located at
{RVS_CONF}/MI300X/iet_single.conf
.
Command:
rvs -c ${RVS_CONF}/MI300X/iet_single.conf
IET module run six different actions. Each action will be performed on all
eight GPUs. Each GPU power test will display a TRUE
or FALSE
status as
shown in the following output example.
Expected output (truncated):
[RESULT] [1102597.157090] Action name :iet-620W-1K-rand-dgemm
[RESULT] [1102597.159274] Module name :iet
[RESULT] [1102597.333747] [iet-620W-1K-rand-dgemm] [GPU:: 28851] Power(W) 127.000000
[RESULT] [1102597.334457] [iet-620W-1K-rand-dgemm] [GPU:: 23018] Power(W) 123.000000
[RESULT] [1102597.334500] [iet-620W-1K-rand-dgemm] [GPU:: 22683] Power(W) 123.000000
...
[RESULT] [1102657.372824] [iet-620W-1K-rand-dgemm] [GPU:: 29122] pass: TRUE
[RESULT] [1102657.372859] [iet-620W-1K-rand-dgemm] [GPU:: 23018] pass: TRUE
[RESULT] [1102657.372936] [iet-620W-1K-rand-dgemm] [GPU:: 28851] pass: TRUE
[RESULT] [1102657.373301] [iet-620W-1K-rand-dgemm] [GPU:: 53458] pass: TRUE
[RESULT] [1102657.373508] [iet-620W-1K-rand-dgemm] [GPU:: 63738] pass: TRUE
[RESULT] [1102657.373620] [iet-620W-1K-rand-dgemm] [GPU:: 63883] pass: TRUE
[RESULT] [1102657.374090] [iet-620W-1K-rand-dgemm] [GPU:: 22683] pass: TRUE
[RESULT] [1102657.374158] [iet-620W-1K-rand-dgemm] [GPU:: 53667] pass: TRUE
[RESULT] [1102658.379728] Action name :iet-wait-750W-28K-rand-dgemm
[RESULT] [1102658.379781] Module name :iet
Result:
PASSED:
pass: TRUE
must be displayed for each GPU.FAILED: Test results FAIL
Action: Do not proceed further. Report this issue to your system manufacturer immediately.
Qualification#
The GPU monitor (GM) module is used to report and validate the following system attributes.
Temperature
Fan speed
Memory clock
System clock
Power
The configuration file for GST module is located at
{RVS_CONF}/gm_single.conf
.
Command:
rvs -c ${RVS_CONF}/gm_single.conf
Expected output (truncated):
[RESULT] [209228.305186] [metrics_monitor] gm 28851 temp violations 0
[RESULT] [209228.305186] [metrics_monitor] gm 28851 clock violations 0
[RESULT] [209228.305186] [metrics_monitor] gm 28851 mem_clock violations 0
[RESULT] [209228.305186] [metrics_monitor] gm 28851 fan violations 0
[RESULT] [209228.305186] [metrics_monitor] gm 28851 power violations 0
...
Result:
PASSED: If the output displays
violations 0
for all give attributes for each GPU. Pipe output to grep to create a quick summary of violations.FAILED: If any violations have a non-zero value
Action: Continue with the next step but periodically monitor by running this module.
Memory#
To validate the GPU memory subsystem, rvs
has the following two types of
modules:
MEM
BABEL
MEM#
The Memory module, MEM, tests the GPU memory for hardware errors and soft errors using HIP. It consists of various tests that use algorithms like Walking 1 bit, Moving inversion and Modulo 20. The module executes the following memory tests [Algorithm, data pattern]:
Walking 1 bit
Own address test
Moving inversions, ones & zeros
Moving inversions, 8-bit pattern
Moving inversions, random pattern
Block move, 64 moves
Moving inversions, 32-bit pattern
Random number sequence
Modulo 20, random pattern
Memory stress test
The configuration file for GST module is located at {RVS_CONF}/mem.conf
.
Command:
rvs -c ${RVS_CONF}/mem.conf -l mem.txt
The entire output file is not shown here for brevity. Grepping for certain
strings in the file where the log is saved makes it easier to understand the
log. The -l mem.txt
option in the command dumps the entire output into
the file.
Grepping for the string mem Test 1:
shows, Test 1 (Change one
bit memory address) is launched for each GPU.
grep "mem Test 1:" mem.txt
[RESULT] [214775.925788] [action_1] mem Test 1: Change one bit memory addresss
[RESULT] [214776.112738] [action_1] mem Test 1: Change one bit memory addresss
[RESULT] [214776.299030] [action_1] mem Test 1: Change one bit memory addresss
[RESULT] [214776.486354] [action_1] mem Test 1: Change one bit memory addresss
[RESULT] [214776.674529] [action_1] mem Test 1: Change one bit memory addresss
[RESULT] [214776.865057] [action_1] mem Test 1: Change one bit memory addresss
[RESULT] [214777.52685 ] [action_1] mem Test 1: Change one bit memory addresss
[RESULT] [214777.155703] [action_1] mem Test 1: Change one bit memory addresss
Grepping for the string mem Test 1 :
shows, Test 1 passed for all GPUs.
[RESULT] [214775.947349] [action_1] mem Test 1 : PASS
[RESULT] [214776.134798] [action_1] mem Test 1 : PASS
[RESULT] [214776.320838] [action_1] mem Test 1 : PASS
[RESULT] [214776.509205] [action_1] mem Test 1 : PASS
[RESULT] [214776.697979] [action_1] mem Test 1 : PASS
[RESULT] [214776.888054] [action_1] mem Test 1 : PASS
[RESULT] [214777.75572 ] [action_1] mem Test 1 : PASS
[RESULT] [214777.178653] [action_1] mem Test 1 : PASS
Similarly, you can grep other strings to parse the log file easily.
Grepping for the string “bandwidth” shows the memory bandwidth perceived by each of the eight GPUs.
grep "bandwidth" mem.txt
[RESULT] [214808.291036] [action_1] mem Test 11: elapsedtime = 6390.423828 bandwidth = 2003.017090GB/s
[RESULT] [214812.175895] [action_1] mem Test 11: elapsedtime = 6387.198242 bandwidth = 2004.028564GB/s
[RESULT] [214813.999085] [action_1] mem Test 11: elapsedtime = 6400.554199 bandwidth = 1999.846802GB/s
[RESULT] [214814.406234] [action_1] mem Test 11: elapsedtime = 6397.101074 bandwidth = 2000.926392GB/s
[RESULT] [214814.583630] [action_1] mem Test 11: elapsedtime = 6388.572266 bandwidth = 2003.597534GB/s
[RESULT] [214815.176800] [action_1] mem Test 11: elapsedtime = 6378.345703 bandwidth = 2006.810059GB/s
[RESULT] [214815.384878] [action_1] mem Test 11: elapsedtime = 6404.943848 bandwidth = 1998.476196GB/s
[RESULT] [214815.419048] [action_1] mem Test 11: elapsedtime = 6416.849121 bandwidth = 1994.768433GB/s
Result:
PASSED: If all memory tests passed without memory errors and the bandwidth obtained in Test 11 is about ~2TB/s
FAILED: If any memory errors report and/or the obtained bandwidth is not even close to 2TB/s
Action: Do not proceed further. Report this issue to your system manufacturer immediately.
BABEL#
Refer to the BabelStream section for instructions on how to run this module to test memory.
IO#
To validate the GPU interfaces, rvs
has the following three types of
modules:
PEBB – PCIe Bandwidth Benchmark
PEQT – PCIe Qualification Tool
PBQT – P2P Benchmark and Qualification Tool
PEBB (PCIe Bandwidth Benchmark)#
The PCIe Bandwidth Benchmark attempts to saturate the PCIe bus with DMA transfers between system memory and a target GPU card’s memory. The maximum bandwidth obtained is reported.
The configuration file for GST module is located at:
{RVS_CONF}/MI300X/pebb_single.conf
Command:
rvs -c ${RVS_CONF}/MI300X/pebb_single.conf -l pebb.txt
The PEBB modules has the following tests defined in the conf
file (where
h2d
means host to device, d2h
means device to host, xMB
means
random block size, and b2b
means back to back):
h2d-sequential-51MB
d2h-sequential-51MB
h2d-d2h-sequential-51MB
h2d-parallel-xMB
d2h-parallel-xMB
h2d-d2h-xMB
h2d-b2b-51MB
d2h-b2b-51MB
h2d-d2h-b2b-51MB
Each of these tests will produce the following header as part of the output log. It shows the distances between CPUs and GPUs.
Expected output (truncated):
[RESULT] [1103843.610745] [d2h-sequential-64MB] pcie-bandwidth [CPU:: 0] [GPU:: 2 - 28851 - 0000:05:00.0] distance:20 PCIe:20
[RESULT] [1103843.610763] [d2h-sequential-64MB] pcie-bandwidth [CPU:: 1] [GPU:: 2 - 28851 - 0000:05:00.0] distance:52 PCIe:52
[RESULT] [1103843.610771] [d2h-sequential-64MB] pcie-bandwidth [CPU:: 0] [GPU:: 3 - 23018 - 0000:26:00.0] distance:20 PCIe:20
[RESULT] [1103843.610778] [d2h-sequential-64MB] pcie-bandwidth [CPU:: 1] [GPU:: 3 - 23018 - 0000:26:00.0] distance:52 PCIe:52
[RESULT] [1103843.610787] [d2h-sequential-64MB] pcie-bandwidth [CPU:: 0] [GPU:: 4 - 29122 - 0000:46:00.0] distance:20 PCIe:20
[RESULT] [1103843.610795] [d2h-sequential-64MB] pcie-bandwidth [CPU:: 1] [GPU:: 4 - 29122 - 0000:46:00.0] distance:52 PCIe:52
[RESULT] [1103843.610802] [d2h-sequential-64MB] pcie-bandwidth [CPU:: 0] [GPU:: 5 - 22683 - 0000:65:00.0] distance:20 PCIe:20
[RESULT] [1103843.610810] [d2h-sequential-64MB] pcie-bandwidth [CPU:: 1] [GPU:: 5 - 22683 - 0000:65:00.0] distance:52 PCIe:52
[RESULT] [1103843.610817] [d2h-sequential-64MB] pcie-bandwidth [CPU:: 0] [GPU:: 6 - 53458 - 0000:85:00.0] distance:52 PCIe:52
[RESULT] [1103843.610825] [d2h-sequential-64MB] pcie-bandwidth [CPU:: 1] [GPU:: 6 - 53458 - 0000:85:00.0] distance:20 PCIe:20
[RESULT] [1103843.610833] [d2h-sequential-64MB] pcie-bandwidth [CPU:: 0] [GPU:: 7 - 63883 - 0000:a6:00.0] distance:52 PCIe:52
[RESULT] [1103843.610841] [d2h-sequential-64MB] pcie-bandwidth [CPU:: 1] [GPU:: 7 - 63883 - 0000:a6:00.0] distance:20 PCIe:20
[RESULT] [1103843.610848] [d2h-sequential-64MB] pcie-bandwidth [CPU:: 0] [GPU:: 8 - 53667 - 0000:c6:00.0] distance:52 PCIe:52
[RESULT] [1103843.610856] [d2h-sequential-64MB] pcie-bandwidth [CPU:: 1] [GPU:: 8 - 53667 - 0000:c6:00.0] distance:20 PCIe:20
[RESULT] [1103843.610863] [d2h-sequential-64MB] pcie-bandwidth [CPU:: 0] [GPU:: 9 - 63738 - 0000:e5:00.0] distance:52 PCIe:52
[RESULT] [1103843.610871] [d2h-sequential-64MB] pcie-bandwidth [CPU:: 1] [GPU:: 9 - 63738 - 0000:e5:00.0] distance:20 PCIe:20
The other half of the output for each of the tests, shows the transfer bandwidth and indicates whether its bidirectional or unidirectional transfer.
[RESULT] [1103903.617888] [d2h-sequential-64MB] pcie-bandwidth [ 1/16] [CPU:: 0] [GPU:: 2 - 28851 - 0000:05:00.0] h2d::false d2h::true 56.298 GBps ...
[RESULT] [1103903.617971] [d2h-sequential-64MB] pcie-bandwidth [ 2/16] [CPU:: 1] [GPU:: 2 - 28851 - 0000:05:00.0] h2d::false d2h::true 55.664 GBps ...
[RESULT] [1103903.617982] [d2h-sequential-64MB] pcie-bandwidth [ 3/16] [CPU:: 0] [GPU:: 3 - 23018 - 0000:26:00.0] h2d::false d2h::true 56.304 GBps ...
[RESULT] [1103903.617993] [d2h-sequential-64MB] pcie-bandwidth [ 4/16] [CPU:: 1] [GPU:: 3 - 23018 - 0000:26:00.0] h2d::false d2h::true 56.318 GBps ...
[RESULT] [1103903.618009] [d2h-sequential-64MB] pcie-bandwidth [ 5/16] [CPU:: 0] [GPU:: 4 - 29122 - 0000:46:00.0] h2d::false d2h::true 56.318 GBps ...
[RESULT] [1103903.618019] [d2h-sequential-64MB] pcie-bandwidth [ 6/16] [CPU:: 1] [GPU:: 4 - 29122 - 0000:46:00.0] h2d::false d2h::true 56.273 GBps ...
[RESULT] [1103903.618029] [d2h-sequential-64MB] pcie-bandwidth [ 7/16] [CPU:: 0] [GPU:: 5 - 22683 - 0000:65:00.0] h2d::false d2h::true 56.297 GBps ...
[RESULT] [1103903.618039] [d2h-sequential-64MB] pcie-bandwidth [ 8/16] [CPU:: 1] [GPU:: 5 - 22683 - 0000:65:00.0] h2d::false d2h::true 55.592 GBps ...
[RESULT] [1103903.618052] [d2h-sequential-64MB] pcie-bandwidth [ 9/16] [CPU:: 0] [GPU:: 6 - 53458 - 0000:85:00.0] h2d::false d2h::true 56.293 GBps ...
[RESULT] [1103903.618063] [d2h-sequential-64MB] pcie-bandwidth [10/16] [CPU:: 1] [GPU:: 6 - 53458 - 0000:85:00.0] h2d::false d2h::true 56.337 GBps ...
[RESULT] [1103903.618072] [d2h-sequential-64MB] pcie-bandwidth [11/16] [CPU:: 0] [GPU:: 7 - 63883 - 0000:a6:00.0] h2d::false d2h::true 56.298 GBps ...
[RESULT] [1103903.618083] [d2h-sequential-64MB] pcie-bandwidth [12/16] [CPU:: 1] [GPU:: 7 - 63883 - 0000:a6:00.0] h2d::false d2h::true 56.325 GBps ...
[RESULT] [1103903.618116] [d2h-sequential-64MB] pcie-bandwidth [13/16] [CPU:: 0] [GPU:: 8 - 53667 - 0000:c6:00.0] h2d::false d2h::true 56.311 GBps ...
[RESULT] [1103903.618124] [d2h-sequential-64MB] pcie-bandwidth [14/16] [CPU:: 1] [GPU:: 8 - 53667 - 0000:c6:00.0] h2d::false d2h::true 56.340 GBps ...
[RESULT] [1103903.618134] [d2h-sequential-64MB] pcie-bandwidth [15/16] [CPU:: 0] [GPU:: 9 - 63738 - 0000:e5:00.0] h2d::false d2h::true 56.287 GBps ...
[RESULT] [1103903.618142] [d2h-sequential-64MB] pcie-bandwidth [16/16] [CPU:: 1] [GPU:: 9 - 63738 - 0000:e5:00.0] h2d::false d2h::true 56.334 GBps ...
Result:
PASSED: If all CPUs-GPUs distances are displayed and CPU x (x=0/1) to GPU y (y=2/3/4/5/6/7/8/9) PCIe transfer bandwidths are displayed.
FAILED: Otherwise
Action: Proceed to next step. Run this same test later again.
PEQT (PCIe Qualification Tool)#
The PCIe Qualification Tool is used to qualify the PCIe bus the GPU is connected to. The qualification tool can determine the following characteristics of the PCIe bus interconnect to a GPU:
Support for Gen 3 atomic completers
DMA transfer statistics
PCIe link speed
PCIe link width
The configuration file for the PEQT module is located at {RVS_CONF}/peqt_single.conf
.
Command:
sudo rvs -c ${RVS_CONF}/peqt_single.conf
This module has total 17 tests (pcie_act_1
to pcie_act_17
). Each test
checks for a subset of PCIe capabilities and shows the true or false status.
Note
The tests needs sudo
permission to run properly.
Expected output:
[RESULT] [1105558.986882] Action name :pcie_act_1
[RESULT] [1105558.988288] Module name :peqt
[RESULT] [1105559.33461 ] [pcie_act_1] peqt true
[RESULT] [1105559.33492 ] Action name :pcie_act_2
[RESULT] [1105559.33497 ] Module name :peqt
[RESULT] [1105559.72308 ] [pcie_act_2] peqt true
[RESULT] [1105559.72325 ] Action name :pcie_act_3
[RESULT] [1105559.72330 ] Module name :peqt
[RESULT] [1105559.114937] [pcie_act_3] peqt true
[RESULT] [1105559.114957] Action name :pcie_act_4
[RESULT] [1105559.114962] Module name :peqt
[RESULT] [1105559.155511] [pcie_act_4] peqt true
[RESULT] [1105559.155526] Action name :pcie_act_5
[RESULT] [1105559.155531] Module name :peqt
[RESULT] [1105559.190472] [pcie_act_5] peqt true
[RESULT] [1105559.190491] Action name :pcie_act_6
[RESULT] [1105559.190495] Module name :peqt
[RESULT] [1105559.230632] [pcie_act_6] peqt true
[RESULT] [1105559.230646] Action name :pcie_act_7
[RESULT] [1105559.230651] Module name :peqt
[RESULT] [1105559.273512] [pcie_act_7] peqt true
[RESULT] [1105559.273534] Action name :pcie_act_8
[RESULT] [1105559.273538] Module name :peqt
[RESULT] [1105559.316290] [pcie_act_8] peqt true
[RESULT] [1105559.316305] Action name :pcie_act_9
[RESULT] [1105559.316310] Module name :peqt
[RESULT] [1105559.357042] [pcie_act_9] peqt true
[RESULT] [1105559.357064] Action name :pcie_act_10
[RESULT] [1105559.357069] Module name :peqt
[RESULT] [1105559.391754] [pcie_act_10] peqt true
[RESULT] [1105559.391767] Action name :pcie_act_11
[RESULT] [1105559.391771] Module name :peqt
[RESULT] [1105559.434373] [pcie_act_11] peqt true
[RESULT] [1105559.434391] Action name :pcie_act_12
[RESULT] [1105559.434395] Module name :peqt
[RESULT] [1105559.470072] [pcie_act_12] peqt true
[RESULT] [1105559.470087] Action name :pcie_act_13
[RESULT] [1105559.470091] Module name :peqt
[RESULT] [1105559.512754] [pcie_act_13] peqt true
[RESULT] [1105559.512774] Action name :pcie_act_14
[RESULT] [1105559.512778] Module name :peqt
[RESULT] [1105559.552761] [pcie_act_14] peqt true
[RESULT] [1105559.552779] Action name :pcie_act_15
[RESULT] [1105559.552783] Module name :peqt
[RESULT] [1105559.586778] [pcie_act_15] peqt true
[RESULT] [1105559.586794] Action name :pcie_act_16
[RESULT] [1105559.586798] Module name :peqt
[RESULT] [1105559.620305] [pcie_act_16] peqt true
[RESULT] [1105559.620322] Action name :pcie_act_17
[RESULT] [1105559.620326] Module name :peqt
[RESULT] [1105559.651564] [pcie_act_17] peqt true
Result:
PASSED:
[pcie_act_x] peqt true
can be seen for all 17 actions.FAILED: If any tests show true.
Action: Check that you are running this test as root or with
sudo
privileges. If not, actions 6 through 16 will fail. Run this same test later again.
PBQT (P2P Benchmark and Qualification Tool)#
The PBQT module executes the following tests:
List all GPUs that support P2P
Characterizes the P2P links between peers
Performs a peer-to-peer throughput test between all P2P pairs
The configuration file for the pbqt
module for MI300X is located here:
{RVS_CONF}/MI300X/pbqt_single.conf
.
The conf
file has 12 actions_xy
test segments. Each of these checks for
peer-to-peer connectivity among GPUs and provides a true/false status. In
addition, it also performs bidirectional throughput test and reports the
throughput obtained based on config parameters. Since comparison is not
performed for some target throughput numbers, there is no PASS/FAIL condition
for the overall test.
It’s recommended that you carefully review the pbqt_single.conf
file before
running the following command.
Command:
rvs -c ${RVS_CONF}/MI300X/pbqt_single.conf
Only two example lines from the very long log file is shown because other lines look similar as all combinations of GPU pairs are considered and numbers for those pairs are reported.
Expected output below (truncated) shows uni-directional connectivity is true for the GPU and its connection to the other seven GPU peers:
[RESULT] [1104553.34268 ] [p2p-unidir-sequential-64MB] p2p [GPU:: 2 - 28851 - 0000:05:00.0] [GPU:: 3 - 23018 - 0000:26:00.0] peers:true distance:15 xGMI:15
[RESULT] [1104553.34276 ] [p2p-unidir-sequential-64MB] p2p [GPU:: 2 - 28851 - 0000:05:00.0] [GPU:: 4 - 29122 - 0000:46:00.0] peers:true distance:15 xGMI:15
[RESULT] [1104553.34280 ] [p2p-unidir-sequential-64MB] p2p [GPU:: 2 - 28851 - 0000:05:00.0] [GPU:: 5 - 22683 - 0000:65:00.0] peers:true distance:15 xGMI:15
[RESULT] [1104553.34283 ] [p2p-unidir-sequential-64MB] p2p [GPU:: 2 - 28851 - 0000:05:00.0] [GPU:: 6 - 53458 - 0000:85:00.0] peers:true distance:15 xGMI:15
[RESULT] [1104553.34289 ] [p2p-unidir-sequential-64MB] p2p [GPU:: 2 - 28851 - 0000:05:00.0] [GPU:: 7 - 63883 - 0000:a6:00.0] peers:true distance:15 xGMI:15
[RESULT] [1104553.34294 ] [p2p-unidir-sequential-64MB] p2p [GPU:: 2 - 28851 - 0000:05:00.0] [GPU:: 8 - 53667 - 0000:c6:00.0] peers:true distance:15 xGMI:15
[RESULT] [1104553.34298 ] [p2p-unidir-sequential-64MB] p2p [GPU:: 2 - 28851 - 0000:05:00.0] [GPU:: 9 - 63738 - 0000:e5:00.0] peers:true distance:15 xGMI:15
The following lines show unidirectional throughput between the 56 GPU pairs (not all are shown):
[RESULT] [1104673.143726] [p2p-unidir-parallel-64MB] p2p-bandwidth[ 1/56] [GPU:: 2 - 28851 - 0000:05:00.0] [GPU:: 3 - 23018 - 0000:26:00.0] bidirectional: false 48.962 GBps duration: 1.462462 secs
[RESULT] [1104673.144823] [p2p-unidir-parallel-64MB] p2p-bandwidth[ 2/56] [GPU:: 2 - 28851 - 0000:05:00.0] [GPU:: 4 - 29122 - 0000:46:00.0] bidirectional: false 48.914 GBps duration: 1.470746 secs
[RESULT] [1104673.145898] [p2p-unidir-parallel-64MB] p2p-bandwidth[ 3/56] [GPU:: 2 - 28851 - 0000:05:00.0] [GPU:: 5 - 22683 - 0000:65:00.0] bidirectional: false 48.577 GBps duration: 1.480956 secs
Result:
PASSED: If
peers:true
lines are observed for GPUs peer-to-peer connectivity and if throughput values are non-zeros.FAILED: Otherwise
Action: Do not proceed further. Report this issue to your system manufacturer immediately.