AMD Instinct MI350X#

The AMD Instinct™ MI350X is a high-performance GPU accelerator designed for AI, HPC, and demanding workloads. This document provides MI350X-specific prerequisites, health checks, validation steps, and performance acceptance criteria.

Overview#

The AMD Instinct MI350X is a high-performance GPU accelerator designed to deliver exceptional efficiency and performance for training large AI models, high-speed inference, and complex HPC workloads. The MI350X platform utilizes a Universal Baseboard (UBB 2.0) configuration that hosts 8 AMD Instinct MI350X OAM (OCP Accelerator Module) accelerators with a total of 2.3TB of HBM3 memory.

Each MI350X GPU features multiple chiplet design with 8 XCDs (Accelerator Complex Dies) and 288GB of HBM3 memory per accelerator. It utilizes AMD’s CDNA4 architecture with fully-meshed Infinity Fabric™ connectivity between accelerators.

System Requirements#

Operating System Support#

For the most up-to-date information on supported operating systems and distributions, please refer to the official ROCm documentation:

ROCm System Requirements - Supported Distributions

Note

ROCm docs is the single source of truth for supported versions, distribution compatibility, and required dependencies for the ROCm toolkit.

Hardware Configuration#

Expected GPU Configuration#

Recommended high-level platform configuration:

  • Dual-socket AMD EPYC 9004/9005-series (or supported server-class CPUs) with BIOS configured per guide

  • 3.0 TB or more of system memory

  • Eight 400G backend NICs (RoCE or InfiniBand)

GPU Identification#

All MI350X GPUs (PCI vendor:device 1002:75a0) should appear in lspci output:

sudo lspci -d 1002:75a0

Example (truncated for brevity – expect 8 lines):

05:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Device 75a0
15:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Device 75a0
...
f5:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Device 75a0

Acceptance Criteria Checklist#

This section presents the high-level cluster acceptance validation criteria in a clear, checklist-driven format designed to enable efficient execution and tracking. The checklist is used to verify that the system meets all required technical, operational, and performance criteria necessary to achieve “Go-Live” readiness. It is organized into the following key areas:

  1. Prerequisites Validation - Ensure all system requirements and dependencies are met

  2. Basic Health Checks - Verify hardware detection and basic system health

  3. System Validation - Conduct comprehensive single and multi-node stress testing and qualification

  4. Performance Benchmarks - Validate compute, memory, and interconnect performance

Each area consists of a defined set of criteria that are hyperlinked to the corresponding sections within this guide, enabling users to quickly access detailed procedures, execution steps, and supporting guidance.

The System Validation area, which includes both single-node and multi-node testing, defines minimum required execution (run) times for each test. These requirements ensure that validation is conducted under appropriate conditions to accurately assess system stability, performance, and reliability.

Successful completion of this checklist, with no errors or hardware faults observed in validation logs, confirms that the cluster has been properly configured, validated at both the single-node and multi-node levels, and is capable of supporting sustained AI workloads in a production environment.

Prerequisites Validation#

Ensure all system requirements are met before proceeding with validation. See the Prerequisites documentation and System setup for more details.

  • ✅ Supported operating system (see ROCm supported distributions)

  • ✅ ROCm 7.0.1 or later installed for MI350X (verify: cat /opt/rocm/.info/version)

  • ✅ BIOS configured per recommended settings (PCIe, xGMI, IOMMU enabled, DF C-states disabled, etc.)

  • ✅ Required kernel parameters present:

    • pci=realloc=off pci=bfsort iommu=pt numa_balancing=disable modprobe.blacklist=amdgpu

    • intel_iommu=on if Intel host CPU

  • ✅ Minimum 3.0 TB system memory available

  • ✅ Latest applicable firmware (BIOS, BMC, PCIe switch, NIC, GPU BKC) applied consistently across nodes

  • ✅ ROCm Validation Suite (RVS) and AGFHC prerequisites installed (AGFHC package)

  • ✅ Environment variables (if used):

    • HIP_FORCE_DEV_KERNARG=1 (default ≥ ROCm 6.2)

    • HSA_OVERRIDE_CPU_AFFINITY_DEBUG=0

Basic Health Checks#

These checks ensure fundamental system health and proper GPU detection. For detailed procedures, see Health Checks.

Test

Command

Pass/Fail criteria

Check OS distribution

cat /etc/os-release

Pass: OS version listed in compatibility matrix
Fail: Otherwise

Check kernel boot arguments

cat /proc/cmdline

Pass: Shows all required params (pci=realloc=off pci=bfsort iommu=pt numa_balancing=disable modprobe.blacklist=amdgpu) + intel_iommu=on if Intel
Fail: Missing any required param

Check for driver errors

dmesg | grep -i error

Pass: Null (no GPU-related errors)
Fail: Errors present

Check available memory

free -h / cat /proc/meminfo

Pass: ≥ 3.0T system memory available
Fail: < 3.0T

Check GPU presence

sudo lspci -d 1002:75a0

Pass: 8 MI350X GPUs found
Fail: Otherwise

Check GPU link speed and width

sudo lspci -d 1002:75a0 -vvv | grep -e DevSta -e LnkSta

Pass: Each GPU: Speed 32GT/s, Width x16, no FatalErr+
Fail: Otherwise

Monitor utilization metrics

amd-smi monitor -putm

Pass: Idle metrics as specified
Fail: Otherwise

Check system kernel logs for errors

sudo dmesg -T | grep -i 'error|warn|fail|exception'

Pass: Null
Fail: Otherwise

System Validation#

AGFHC (AMD GPU Field Health Check) provides structured recipes exercising PCIe, HBM, compute, power/thermal and fabric.

Single-Node Tests#

Following single-node tests must be performed at the required run time with no failures reported in validation logs.

Test

Command

Run Time

Purpose

Pass Criteria

all_lvl5

/opt/amd/agfhc/agfhc -r all_lvl5 -o <output_dir>

2 hours

Broad system-level coverage (PCIe, HBM, compute, power)

Overall result PASS / return code 0

hbm_lvl5 (4 iterations)

/opt/amd/agfhc/agfhc -r hbm_lvl5:i=4 -o <output_dir>

8 hours

Intensive HBM stress & ECC observation

All iterations PASS / no memory errors

gfx_lvl4

/opt/amd/agfhc/agfhc -r gfx_lvl4 -o <output_dir>

1 hour

GPU compute stress test

PASS / return code 0

miniHPL

/opt/amd/agfhc/agfhc -t minihpl:d=3h -o <output_dir>

3 hours (10 hours recommended)

Linpack-like integration stress

PASS / completes without failures

pcie_lvl2

/opt/amd/agfhc/agfhc -r pcie_lvl2 -o <output_dir>

10 minutes

PCIe bandwidth & link health

PASS / expected link stability

Single-node RCCL

all_reduce_perf -b 8 -e 8G -f 2 -g 8

2–11 minutes

Single-node GPU interconnect validation

busbw meets expected thresholds

AI Workloads

See workload validation

1–24 hours

Sustained AI workload (Llama 3.1 70B with JAX)

Completes without failures

Multi-Node Tests#

Following multi-node tests must be performed at the required run time with no failures reported in validation logs.

Test

Reference

Run Time

Purpose

Pass Criteria

OFED Performance Tests

Network validation

2 hours

RDMA fabric bandwidth and latency

All tests PASS / expected bandwidth

Multi-node RCCL

Network validation

Up to 128 nodes, 10 hours

Multi-node GPU fabric validation

All nodes PASS / expected bandwidth

AI Workloads

Cluster validation

24 hours

Sustained AI workload (Llama 3.1 405B with JAX)

Completes without failures

Review results.json in the output directory or terminal summary; any FAIL requires remediation before acceptance.

Performance Benchmarks#

RCCL All-Reduce bandwidth benchmark:

RCCL All-Reduce (in-place bus bandwidth @ 8 GB message)

Command: all_reduce_perf -b 8 -e 8G -f 2 -g 8

Pass: In-place busbw ≥ 304 GB/s

Additional application or GEMM benchmarks may be executed as supplemental evidence, but only the RCCL all-reduce threshold above is required in this template.