AMD Instinct MI350X#
The AMD Instinct™ MI350X is a high-performance GPU accelerator designed for AI, HPC, and demanding workloads. This document provides MI350X-specific prerequisites, health checks, validation steps, and performance acceptance criteria.
Overview#
The AMD Instinct MI350X is a high-performance GPU accelerator designed to deliver exceptional efficiency and performance for training large AI models, high-speed inference, and complex HPC workloads. The MI350X platform utilizes a Universal Baseboard (UBB 2.0) configuration that hosts 8 AMD Instinct MI350X OAM (OCP Accelerator Module) accelerators with a total of 2.3TB of HBM3 memory.
Each MI350X GPU features multiple chiplet design with 8 XCDs (Accelerator Complex Dies) and 288GB of HBM3 memory per accelerator. It utilizes AMD’s CDNA4 architecture with fully-meshed Infinity Fabric™ connectivity between accelerators.
System Requirements#
Operating System Support#
For the most up-to-date information on supported operating systems and distributions, please refer to the official ROCm documentation:
ROCm System Requirements - Supported Distributions
Note
ROCm docs is the single source of truth for supported versions, distribution compatibility, and required dependencies for the ROCm toolkit.
Hardware Configuration#
Expected GPU Configuration#
Recommended high-level platform configuration:
Dual-socket AMD EPYC 9004/9005-series (or supported server-class CPUs) with BIOS configured per guide
3.0 TB or more of system memory
Eight 400G backend NICs (RoCE or InfiniBand)
GPU Identification#
All MI350X GPUs (PCI vendor:device 1002:75a0) should appear in lspci output:
sudo lspci -d 1002:75a0
Example (truncated for brevity – expect 8 lines):
05:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Device 75a0
15:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Device 75a0
...
f5:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Device 75a0
Acceptance Criteria Checklist#
This section presents the high-level cluster acceptance validation criteria in a clear, checklist-driven format designed to enable efficient execution and tracking. The checklist is used to verify that the system meets all required technical, operational, and performance criteria necessary to achieve “Go-Live” readiness. It is organized into the following key areas:
Prerequisites Validation - Ensure all system requirements and dependencies are met
Basic Health Checks - Verify hardware detection and basic system health
System Validation - Conduct comprehensive single and multi-node stress testing and qualification
Performance Benchmarks - Validate compute, memory, and interconnect performance
Each area consists of a defined set of criteria that are hyperlinked to the corresponding sections within this guide, enabling users to quickly access detailed procedures, execution steps, and supporting guidance.
The System Validation area, which includes both single-node and multi-node testing, defines minimum required execution (run) times for each test. These requirements ensure that validation is conducted under appropriate conditions to accurately assess system stability, performance, and reliability.
Successful completion of this checklist, with no errors or hardware faults observed in validation logs, confirms that the cluster has been properly configured, validated at both the single-node and multi-node levels, and is capable of supporting sustained AI workloads in a production environment.
Prerequisites Validation#
Ensure all system requirements are met before proceeding with validation. See the Prerequisites documentation and System setup for more details.
✅ Supported operating system (see ROCm supported distributions)
✅ ROCm 7.0.1 or later installed for MI350X (verify:
cat /opt/rocm/.info/version)✅ BIOS configured per recommended settings (PCIe, xGMI, IOMMU enabled, DF C-states disabled, etc.)
✅ Required kernel parameters present:
pci=realloc=off pci=bfsort iommu=pt numa_balancing=disable modprobe.blacklist=amdgpuintel_iommu=onif Intel host CPU
✅ Minimum 3.0 TB system memory available
✅ Latest applicable firmware (BIOS, BMC, PCIe switch, NIC, GPU BKC) applied consistently across nodes
✅ ROCm Validation Suite (RVS) and AGFHC prerequisites installed (AGFHC package)
✅ Environment variables (if used):
HIP_FORCE_DEV_KERNARG=1(default ≥ ROCm 6.2)HSA_OVERRIDE_CPU_AFFINITY_DEBUG=0
Basic Health Checks#
These checks ensure fundamental system health and proper GPU detection. For detailed procedures, see Health Checks.
Test |
Command |
Pass/Fail criteria |
|---|---|---|
|
Pass: OS version listed in compatibility matrix |
|
|
Pass: Shows all required params ( |
|
|
Pass: Null (no GPU-related errors) |
|
|
Pass: ≥ 3.0T system memory available |
|
|
Pass: 8 MI350X GPUs found |
|
|
Pass: Each GPU: Speed 32GT/s, Width x16, no |
|
|
Pass: Idle metrics as specified |
|
|
Pass: Null |
System Validation#
AGFHC (AMD GPU Field Health Check) provides structured recipes exercising PCIe, HBM, compute, power/thermal and fabric.
Single-Node Tests#
Following single-node tests must be performed at the required run time with no failures reported in validation logs.
Test |
Command |
Run Time |
Purpose |
Pass Criteria |
|---|---|---|---|---|
|
2 hours |
Broad system-level coverage (PCIe, HBM, compute, power) |
Overall result PASS / return code 0 |
|
hbm_lvl5 (4 iterations) |
|
8 hours |
Intensive HBM stress & ECC observation |
All iterations PASS / no memory errors |
|
1 hour |
GPU compute stress test |
PASS / return code 0 |
|
|
3 hours (10 hours recommended) |
Linpack-like integration stress |
PASS / completes without failures |
|
|
10 minutes |
PCIe bandwidth & link health |
PASS / expected link stability |
|
|
2–11 minutes |
Single-node GPU interconnect validation |
busbw meets expected thresholds |
|
See workload validation |
1–24 hours |
Sustained AI workload (Llama 3.1 70B with JAX) |
Completes without failures |
Multi-Node Tests#
Following multi-node tests must be performed at the required run time with no failures reported in validation logs.
Test |
Reference |
Run Time |
Purpose |
Pass Criteria |
|---|---|---|---|---|
Network validation |
2 hours |
RDMA fabric bandwidth and latency |
All tests PASS / expected bandwidth |
|
Network validation |
Up to 128 nodes, 10 hours |
Multi-node GPU fabric validation |
All nodes PASS / expected bandwidth |
|
Cluster validation |
24 hours |
Sustained AI workload (Llama 3.1 405B with JAX) |
Completes without failures |
Review results.json in the output directory or terminal summary; any FAIL requires remediation before acceptance.
Performance Benchmarks#
RCCL All-Reduce bandwidth benchmark:
RCCL All-Reduce (in-place bus bandwidth @ 8 GB message)
all_reduce_perf -b 8 -e 8G -f 2 -g 8Pass: In-place busbw ≥ 304 GB/s
Additional application or GEMM benchmarks may be executed as supplemental evidence, but only the RCCL all-reduce threshold above is required in this template.