AMD Instinct MI350X

AMD Instinct MI350X#

The AMD Instinct™ MI350X is a high-performance GPU accelerator designed for AI, HPC, and demanding workloads. This document provides MI350X-specific prerequisites, health checks, validation steps, and performance acceptance criteria.

Overview#

The AMD Instinct MI350X is a high-performance GPU accelerator designed to deliver exceptional efficiency and performance for training large AI models, high-speed inference, and complex HPC workloads. The MI350X platform utilizes a Universal Baseboard (UBB 2.0) configuration that hosts 8 AMD Instinct MI350X OAM (OCP Accelerator Module) accelerators with a total of 2.3TB of HBM3 memory.

Each MI350X GPU features multiple chiplet design with 8 XCDs (Accelerator Complex Dies) and 288GB of HBM3 memory per accelerator. It utilizes AMD’s CDNA4 architecture with fully-meshed Infinity Fabric™ connectivity between accelerators.

System Requirements#

Operating System Support#

For the most up-to-date information on supported operating systems and distributions, please refer to the official ROCm documentation:

ROCm System Requirements - Supported Distributions

Note

ROCm docs is the single source of truth for supported versions, distribution compatibility, and required dependencies for the ROCm toolkit.

Hardware Configuration#

Expected GPU Configuration#

Recommended high-level platform configuration:

Dual-socket AMD EPYC 9004/9005-series (or supported server-class CPUs) with BIOS configured per guide
3.0 TB or more of system memory
Eight 400G backend NICs (RoCE or InfiniBand)

GPU Identification#

All MI350X GPUs (PCI vendor:device 1002:75a0) should appear in lspci output:

sudo lspci -d 1002:75a0

Example (truncated for brevity – expect 8 lines):

05:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Device 75a0
15:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Device 75a0
...
f5:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Device 75a0

Acceptance Criteria#

The MI350X system acceptance process validates that the platform is correctly configured, stable, and performing to expectations. Follow the sequence: Prerequisites → Basic Health Checks → System Validation (AGFHC recipes) → Performance Benchmarks.

System Acceptance Process#

Prerequisites Validation - Ensure all system requirements and dependencies are met
Basic Health Checks - Verify hardware detection and basic system health
System Validation - Conduct comprehensive stress testing and qualification
Performance Benchmarks - Validate compute, memory, and interconnect performance

System is accepted when all required recipe runs and benchmarks pass without errors and no hardware faults appear in logs.

Prerequisites Validation#

Ensure all system requirements are met before proceeding with validation. See the Prerequisites documentation and System setup for more details.

✅ Supported operating system (see ROCm supported distributions)
✅ ROCm 7.0.1 or later installed for MI350X (verify: cat /opt/rocm/.info/version)
✅ BIOS configured per recommended settings (PCIe, xGMI, IOMMU enabled, DF C-states disabled, etc.)
✅ Required kernel parameters present:
- pci=realloc=off pci=bfsort iommu=pt numa_balancing=disable modprobe.blacklist=amdgpu
- intel_iommu=on if Intel host CPU
✅ Minimum 3.0 TB system memory available
✅ Latest applicable firmware (BIOS, BMC, PCIe switch, NIC, GPU BKC) applied consistently across nodes
✅ ROCm Validation Suite (RVS) and AGFHC prerequisites installed (AGFHC package)
✅ Environment variables (if used):
- HIP_FORCE_DEV_KERNARG=1 (default ≥ ROCm 6.2)
- HSA_OVERRIDE_CPU_AFFINITY_DEBUG=0

Basic Health Checks#

These checks ensure fundamental system health and proper GPU detection. For detailed procedures, see Health Checks.

Test	Command	Pass/Fail criteria
Check OS distribution	`cat /etc/os-release`	Pass: OS version listed in compatibility matrix Fail: Otherwise
Check kernel boot arguments	`cat /proc/cmdline`	Pass: Shows all required params (`pci=realloc=off pci=bfsort iommu=pt numa_balancing=disable modprobe.blacklist=amdgpu`) + `intel_iommu=on` if Intel Fail: Missing any required param
Check for driver errors	`dmesg	grep -i error`
Check available memory	`free -h` / `cat /proc/meminfo`	Pass: ≥ 3.0T system memory available Fail: < 3.0T
Check GPU presence	`sudo lspci -d 1002:75a0`	Pass: 8 MI350X GPUs found Fail: Otherwise
Check GPU link speed and width	`sudo lspci -d 1002:75a0 -vvv \| grep -e DevSta -e LnkSta`	Pass: Each GPU: Speed 32GT/s, Width x16, no `FatalErr+` Fail: Otherwise
Monitor utilization metrics	`amd-smi monitor -putm`	Pass: Idle metrics as specified Fail: Otherwise
Check system kernel logs for errors	`sudo dmesg -T \| grep -i 'error\|warn\|fail\|exception'`	Pass: Null Fail: Otherwise

System Validation#

AGFHC (AMD GPU Field Health Check) provides structured recipes exercising PCIe, HBM, compute, power/thermal and fabric.

Recipe	Command	Purpose	Pass Criteria
all_lvl5	`/opt/amd/agfhc/agfhc -r all_lvl5 -o <output_dir>`	Broad ~2h system-level coverage (PCIe, HBM, compute, power)	Overall result PASS / return code 0
hbm_lvl5 (run twice)	`/opt/amd/agfhc/agfhc -r hbm_lvl5:i=2 -o <output_dir>`	Intensive HBM stress & ECC observation	Both iterations PASS / no memory errors
pcie_lvl2	`/opt/amd/agfhc/agfhc -r pcie_lvl2 -o <output_dir>`	PCIe bandwidth & link health	PASS / expected link stability
miniHPL (optional)	`/opt/amd/agfhc/agfhc -t miniHPL:d=120m -o <output_dir>`	Linpack-like integration stress (MI350X)	PASS / completes without failures

Review results.json in the output directory or terminal summary; any FAIL requires remediation before acceptance.

Performance Benchmarks#

RCCL All-Reduce bandwidth benchmark:

RCCL All-Reduce (in-place bus bandwidth @ 8 GB message)

Command: all_reduce_perf -b 8 -e 8G -f 2 -g 8

Pass: In-place busbw ≥ 304 GB/s

Fail: Otherwise

Additional application or GEMM benchmarks may be executed as supplemental evidence, but only the RCCL all-reduce threshold above is required in this template.