Network Validation#

This section covers essential steps for validating network performance and reliability on AMD Instinct™ platforms. Proper network validation is critical, as misconfigured settings or faulty hardware can significantly impact data throughput and cluster efficiency. The process includes both software and hardware checks—from enabling RDMA (Remote Direct Memory Access) and verifying link speed to running targeted performance benchmarks and evaluating collective GPU operations. By following the outlined procedures, you can quickly identify and resolve network bottlenecks, ensuring your system delivers optimal performance for high-demand workloads.

Pre-Validation Network Checks#

Before initiating network tests, follow the below steps to ensure all backend interfaces are RDMA-enabled and running at the specified speed. These checks confirm that your AMD Instinct™ cluster is prepared for consistent, high-performance networking.

Enable RDMA on All Backend NICs#

All the backend network interfaces should support RDMA. To ensure it is enabled use the command:

rdma link

Result:

  • PASSED: All 8 backend NICs should be listed and report status as ENABLED.

  • FAILED: One or more of the backend network interfaces is not listed or is DISABLED.