Network and Cluster Validation#
This chapter details validation of the network and the cluster. Validating network performance and reliability for AMD Instinct™ platforms is essential to ensure optimal data throughput and cluster efficiency.
This section outlines pre-validation checks, such as enabling RDMA on backend NICs and verifying link speeds, with 400G NICs recommended for avoiding bottlenecks. The chapter also details the installation and execution of Open Fabrics Enterprise Distribution (OFED) performance tests to validate GPU-to-NIC, NIC-to-switch, and host-to-host communication, including specific tests for GPU-to-GPU and NIC-to-switch paths. Disabling ACS is advised for consistent performance during testing.
Multi-node RCCL tests are introduced to verify inter-node GPU communication, with automation options available via the Cluster Validation Suite. The chapter then transitions to cluster validation, establishing performance baselines and guiding further optimization. It recommends AI model validation using benchmarks like Llama 3.1 and provides resources for multi-node network performance validation, emphasizing workload tuning for enhanced efficiency on AMD Instinct™ systems.
Network Validation#
This section covers essential steps for validating network performance and reliability on AMD Instinct™ platforms. Proper network validation is critical, as misconfigured settings or faulty hardware can significantly impact data throughput and cluster efficiency. The process includes both software and hardware checks—from enabling RDMA (Remote Direct Memory Access) and verifying link speed to running targeted performance benchmarks and evaluating collective GPU operations. By following the outlined procedures, you can quickly identify and resolve network bottlenecks, ensuring your system delivers optimal performance for high-demand workloads.
Pre-Validation Network Checks#
Before initiating network tests, follow the below steps to ensure all backend interfaces are RDMA-enabled and running at the specified speed. These checks confirm that your AMD Instinct™ cluster is prepared for consistent, high-performance networking.
Enable RDMA on All Backend NICs#
All the backend network interfaces should support RDMA. To ensure it is enabled use the command:
rdma link
Example output for Broadcom NICs:
link bnxt_re0/1 state ACTIVE physical_state LINK_UP netdev ens26np0
link bnxt_re1/1 state ACTIVE physical_state LINK_UP netdev ens27np0
link bnxt_re2/1 state ACTIVE physical_state LINK_UP netdev ens25np0
link bnxt_re3/1 state ACTIVE physical_state LINK_UP netdev ens24np0
link bnxt_re4/1 state ACTIVE physical_state LINK_UP netdev ens22np0
link bnxt_re5/1 state ACTIVE physical_state LINK_UP netdev ens23np0
link bnxt_re6/1 state ACTIVE physical_state LINK_UP netdev ens21np0
link bnxt_re7/1 state ACTIVE physical_state LINK_UP netdev ens20np0
For Mellanox NICs, the RDMA device names would be mlx5_0, mlx5_1, etc.
Example output for AMD Pensando (AINIC) NICs:
link ionic_0/1 state ACTIVE physical_state LINK_UP netdev benic7p1
link ionic_1/1 state ACTIVE physical_state LINK_UP netdev benic8p1
link ionic_2/1 state ACTIVE physical_state LINK_UP netdev benic6p1
link ionic_3/1 state ACTIVE physical_state LINK_UP netdev benic5p1
link ionic_4/1 state ACTIVE physical_state LINK_UP netdev benic3p1
link ionic_5/1 state ACTIVE physical_state LINK_UP netdev benic2p1
link ionic_6/1 state ACTIVE physical_state LINK_UP netdev benic1p1
link ionic_7/1 state ACTIVE physical_state LINK_UP netdev benic4p1
Result:
PASSED: All 8 backend NICs should be listed and report status as ENABLED.
FAILED: One or more of the backend network interfaces is not listed or is DISABLED.
Check NIC Link Speed#
Verify the NICs in your servers are reporting the correct speeds. Several commands and utilities are available to measure speed based on your network type. For the AMD Instinct™ product line 400G network cards are generally advised. 200G cards are not sufficient to avoid bottlenecks and 800G cards are not needed.
RoCE / Ethernet#
Set backend NIC speed
ethtool -s "<interface>" autoneg off speed 400000 duplex full
Check backend NIC speed
sudo ethtool <interface> | grep -i speed
cat /sys/class/net/<interface>/speed
InfiniBand#
ibdiagnet provides an output of the entire fabric in the default log files. You can verify link speeds here.
ibstat or ibstatus tells you if the link is up and reports the speed at which it is running for all HCAs in the server.
Result#
PASSED: All backend NICs report a speed of 400000
FAILED: Any result less than 400000
RCCL Multi-Node Fabric Test#
Multi-node RCCL testing verifies that GPUs in one node can communicate correctly and efficiently with GPUs in other nodes in the cluster, ensuring that the communication fabric is functioning properly. An overview of RCCL, along with configuration instructions, is provided in the RCCL Benchmarking section. It is generally recommended to run this test on a subset of the deployment, either the full cluster or up to sixteen nodes, whichever is smaller.
RCCL Pre-Checks#
To ensure that RCCL runs properly, please ensure that the following criteria are met:
Ensure that UCX, OpenMPI, and RCCL-Tests are built using the options given in this document. For systems using Pensando NICs the AMD ANP should also be rebuilt.
Run a fabric ping test and ensure reachability.
Configure PFC and DCQCN with recommended parameters.
Ensure QoS (PFC + DCQCN) is configured across the network and DSCP marking at NICs and Switches are in sync.
Set ulimit value to infinite.
Run OFED Performance Tests, CPU-to-CPU and GPU-to-GPU, across the cluster.
Disable ACS. Performance will vary dramatically if ACS is enabled.
RCCL Multi-Node Test#
Refer to the RCCL Benchmarking page for detailed instructions on building and running multi-node RCCL tests.
Result:
PASSED: For a two node cluster, Allreduce performance of 304 GB/s or greater for MI300X. For a two node cluster, Allreduce performance of 350 GB/s or greater for MI350X or MI355X.
FAILED: Allreduce performance less than expected values.
RCCL Multi-Node Tests with the Cluster Validation Suite#
The multi-node RCCL test recommended for deployment can be automated using the Cluster Validation Suite.
Before running CVS, ensure that the configuration files for your specific platform and the nodes under test are correctly defined in the cvs/input/config_file/ subdirectories.
After installing the test suite, deploy RCCL to the cluster using:
pytest -vvv --log-file=/tmp/test.log -s ./tests/rccl/rccl_multinode_cvs.py \
--cluster_file input/cluster_file/cluster.json \
--config_file input/config_file/rccl/rccl_config.json \
--html=/var/www/html/cvs/rccl.html --capture=tee-sys --self-contained-html
Cluster Validation#
After successfully completing the tests mentioned in this guide, the System Under Test (SUT) meets the single-node customer acceptance criteria. The test results related to performance serve as a baseline for further enhancements. To further optimize the system, make incremental changes to individual parameters noted in the prerequisites and repeat the tests. Once complete, proceed with AI model validation and cluster network validation using the guides mentioned below.
Cluster Network Performance Validation#
After validating single-node performance, configure each server for maximum data transfer and bandwidth. It is essential to test both host and device performance in single-node and multi-node setups using targeted benchmarks.
The Cluster Network Performance Validation Guide for single-node and multi-node networking provides step-by-step instructions on configuring network settings, devices, and running performance tests to ensure AMD Instinct™-based GPU clusters operate at peak speed and bandwidth.
Use the guides as follows:
Validate and optimize a single server using the Single Node Configuration section of the GPU Cluster Networking guide.
If you are using a RoCE network, consult the RoCE Network Configuration Guide for additional steps pertaining to your hardware. Otherwise, proceed to step 3.
Once all individual nodes are validated, proceed through the Multi-node Networking Guide. If you did not install MPI-supported rccl-tests in the RCCL Benchmarking section, ensure you do so when following the multi-node guide.
To thoroughly evaluate your cluster’s performance, we recommend using the Llama 3.1 405B training benchmark with the JAX library. Please note that the runtime will vary depending on the number of nodes, NICs per node, and overall cluster configuration.
There is no multi-node inference benchmark suggested to evaluate clusters at this time, but use of vLLM and SGLang should be considered.
AI Workload Validation with the Cluster Validation Suite#
AMD recommends using the workload of Llama 3.1 405B with JAX for multi-node testing.
The Cluster Validation Suite includes scripts for automatically installing and running this workload as part of deployment testing. Additional documentation is available on the ROCm website under Run Cluster Validation Suite Tests page, within the Jax training test scripts section.
Workload Optimization#
Once the system and networking have been fully validated, review the Workload Optimization Guide to learn more about how to tune workloads for maximum performance on AMD Instinct™ systems.