AMD Instinct MI350X#

The AMD Instinct™ MI350X is a high-performance GPU accelerator designed for AI, HPC, and demanding workloads. This document provides MI350X-specific prerequisites, health checks, validation steps, and performance acceptance criteria.

Overview#

The AMD Instinct MI350X is a high-performance GPU accelerator designed to deliver exceptional efficiency and performance for training large AI models, high-speed inference, and complex HPC workloads. The MI350X platform utilizes a Universal Baseboard (UBB 2.0) configuration that hosts 8 AMD Instinct MI350X OAM (OCP Accelerator Module) accelerators with a total of 2.3TB of HBM3 memory.

Each MI350X GPU features multiple chiplet design with 8 XCDs (Accelerator Complex Dies) and 288GB of HBM3 memory per accelerator. It utilizes AMD’s CDNA4 architecture with fully-meshed Infinity Fabric™ connectivity between accelerators.

System Requirements#

Operating System Support#

For the most up-to-date information on supported operating systems and distributions, please refer to the official ROCm documentation:

ROCm System Requirements - Supported Distributions

Note

ROCm docs is the single source of truth for supported versions, distribution compatibility, and required dependencies for the ROCm toolkit.

Hardware Configuration#

Expected GPU Configuration#

Recommended high-level platform configuration:

  • Dual-socket AMD EPYC 9004/9005-series (or supported server-class CPUs) with BIOS configured per guide

  • 3.0 TB or more of system memory

  • Eight 400G backend NICs (RoCE or InfiniBand)

GPU Identification#

All MI350X GPUs (PCI vendor:device 1002:75a0) should appear in lspci output:

sudo lspci -d 1002:75a0

Example (truncated for brevity – expect 8 lines):

05:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Device 75a0
15:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Device 75a0
...
f5:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Device 75a0

Acceptance Criteria#

The MI350X system acceptance process validates that the platform is correctly configured, stable, and performing to expectations. Follow the sequence: Prerequisites → Basic Health Checks → System Validation (AGFHC recipes) → Performance Benchmarks.

System Acceptance Process#

  1. Prerequisites Validation - Ensure all system requirements and dependencies are met

  2. Basic Health Checks - Verify hardware detection and basic system health

  3. System Validation - Conduct comprehensive stress testing and qualification

  4. Performance Benchmarks - Validate compute, memory, and interconnect performance

System is accepted when all required recipe runs and benchmarks pass without errors and no hardware faults appear in logs.

Prerequisites Validation#

Ensure all system requirements are met before proceeding with validation. See the Prerequisites documentation and System setup for more details.

  • ✅ Supported operating system (see ROCm supported distributions)

  • ✅ ROCm 7.0.0 or later installed for MI350X (verify: cat /opt/rocm/.info/version)

  • ✅ BIOS configured per recommended settings (PCIe, xGMI, IOMMU enabled, DF C-states disabled, etc.)

  • ✅ Required kernel parameters present:

    • pci=realloc=off pci=bfsort iommu=pt numa_balancing=disable modprobe.blacklist=amdgpu

    • intel_iommu=on if Intel host CPU

  • ✅ Minimum 3.0 TB system memory available

  • ✅ Latest applicable firmware (BIOS, BMC, PCIe switch, NIC, GPU BKC) applied consistently across nodes

  • ✅ ROCm Validation Suite (RVS) and AGFHC prerequisites installed (AGFHC package)

  • ✅ Environment variables (if used):

    • HIP_FORCE_DEV_KERNARG=1 (default ≥ ROCm 6.2)

    • HSA_OVERRIDE_CPU_AFFINITY_DEBUG=0

Basic Health Checks#

These checks ensure fundamental system health and proper GPU detection. For detailed procedures, see Health Checks.

Test

Command

Pass/Fail criteria

Check OS distribution

cat /etc/os-release

Pass: OS version listed in compatibility matrix
Fail: Otherwise

Check kernel boot arguments

cat /proc/cmdline

Pass: Shows all required params (pci=realloc=off pci=bfsort iommu=pt numa_balancing=disable modprobe.blacklist=amdgpu) + intel_iommu=on if Intel
Fail: Missing any required param

Check for driver errors

`dmesg

grep -i error`

Check available memory

free -h / cat /proc/meminfo

Pass: ≥ 3.0T system memory available
Fail: < 3.0T

Check GPU presence

sudo lspci -d 1002:75a0

Pass: 8 MI350X GPUs found
Fail: Otherwise

Check GPU link speed and width

sudo lspci -d 1002:75a0 -vvv | grep -e DevSta -e LnkSta

Pass: Each GPU: Speed 32GT/s, Width x16, no FatalErr+
Fail: Otherwise

Monitor utilization metrics

amd-smi monitor -putm

Pass: Idle metrics as specified
Fail: Otherwise

Check system kernel logs for errors

sudo dmesg -T | grep -i 'error|warn|fail|exception'

Pass: Null
Fail: Otherwise

System Validation#

AGFHC (AMD GPU Field Health Check) provides structured recipes exercising PCIe, HBM, compute, power/thermal and fabric.

Recipe

Command

Purpose

Pass Criteria

all_lvl5

/opt/amd/agfhc/agfhc -r all_lvl5 -o <output_dir>

Broad ~2h system-level coverage (PCIe, HBM, compute, power)

Overall result PASS / return code 0

hbm_lvl5 (run twice)

/opt/amd/agfhc/agfhc -r hbm_lvl5:i=2 -o <output_dir>

Intensive HBM stress & ECC observation

Both iterations PASS / no memory errors

pcie_lvl3

/opt/amd/agfhc/agfhc -r pcie_lvl3 -o <output_dir>

PCIe bandwidth & link health

PASS / expected link stability

miniHPL (optional)

/opt/amd/agfhc/agfhc -t miniHPL:d=120m -o <output_dir>

Linpack-like integration stress (MI350X)

PASS / completes without failures

Review results.json in the output directory or terminal summary; any FAIL requires remediation before acceptance.

Performance Benchmarks#

RCCL All-Reduce bandwidth benchmark:

RCCL All-Reduce (in-place bus bandwidth @ 8 GB message)

Command: all_reduce_perf -b 8 -e 8G -f 2 -g 8

Pass: In-place busbw ≥ 304 GB/s

Additional application or GEMM benchmarks may be executed as supplemental evidence, but only the RCCL all-reduce threshold above is required in this template.