AMD Instinct MI355X#
The AMD Instinct™ MI355X is a high-performance GPU accelerator designed for AI, HPC, and demanding workloads. This document provides MI355X-specific prerequisites, health checks, validation steps, and performance acceptance criteria.
Overview#
The AMD Instinct MI355X is a high-performance GPU accelerator designed to deliver exceptional efficiency and performance for training large AI models, high-speed inference, and complex HPC workloads. The MI355X platform utilizes a Universal Baseboard (UBB 2.0) configuration that hosts 8 AMD Instinct MI350X OAM (OCP Accelerator Module) accelerators with a total of 2.3TB of HBM3 memory.
Each MI355X GPU features multiple chiplet design with 8 XCDs (Accelerator Complex Dies) and 288GB of HBM3 memory per accelerator. It utilizes AMD’s CDNA4 architecture with fully-meshed Infinity Fabric™ connectivity between accelerators.
System Requirements#
Operating System Support#
For the most up-to-date information on supported operating systems and distributions, please refer to the official ROCm documentation:
ROCm System Requirements - Supported Distributions
Note
ROCm docs is the single source of truth for supported versions, distribution compatibility, and required dependencies for the ROCm toolkit.
Hardware Configuration#
Expected GPU Configuration#
Recommended high-level platform configuration:
Dual-socket AMD EPYC 9004/9005-series (or supported server-class CPUs) with BIOS configured per guide
3.0 TB or more of system memory
Eight 400G backend NICs (RoCE or InfiniBand)
GPU Identification#
All MI355X GPUs (PCI vendor:device 1002:75a3) should appear in lspci
output:
sudo lspci -d 1002:75a3
Example (truncated for brevity – expect 8 lines):
05:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Device 75a3
15:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Device 75a3
...
f5:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Device 75a3
Acceptance Criteria#
The MI355X system acceptance process validates that the platform is correctly configured, stable, and performing to expectations. Follow the sequence: Prerequisites → Basic Health Checks → System Validation (AGFHC recipes) → Performance Benchmarks.
System Acceptance Process#
Prerequisites Validation - Ensure all system requirements and dependencies are met
Basic Health Checks - Verify hardware detection and basic system health
System Validation - Conduct comprehensive stress testing and qualification
Performance Benchmarks - Validate compute, memory, and interconnect performance
System is accepted when all required recipe runs and benchmarks pass without errors and no hardware faults appear in logs.
Prerequisites Validation#
Ensure all system requirements are met before proceeding with validation. See the Prerequisites documentation and System setup for more details.
✅ Supported operating system (see ROCm supported distributions)
✅ ROCm 7.0.0 or later installed for MI350X (verify:
cat /opt/rocm/.info/version
)✅ BIOS configured per recommended settings (PCIe, xGMI, IOMMU enabled, DF C-states disabled, etc.)
✅ Required kernel parameters present:
pci=realloc=off pci=bfsort iommu=pt numa_balancing=disable modprobe.blacklist=amdgpu
intel_iommu=on
if Intel host CPU
✅ Minimum 3.0 TB system memory available
✅ Latest applicable firmware (BIOS, BMC, PCIe switch, NIC, GPU BKC) applied consistently across nodes
✅ ROCm Validation Suite (RVS) and AGFHC prerequisites installed (AGFHC package)
✅ Environment variables (if used):
HIP_FORCE_DEV_KERNARG=1
(default ≥ ROCm 6.2)HSA_OVERRIDE_CPU_AFFINITY_DEBUG=0
Basic Health Checks#
These checks ensure fundamental system health and proper GPU detection. For detailed procedures, see Health Checks.
Test |
Command |
Pass/Fail criteria |
---|---|---|
|
Pass: OS version listed in compatibility matrix |
|
|
Pass: Shows all required params ( |
|
`dmesg |
grep -i error` |
|
|
Pass: ≥ 3.0T system memory available |
|
|
Pass: 8 MI355X GPUs found |
|
`sudo lspci -d 1002:75a3 -vvv |
grep -e DevSta -e LnkSta` |
|
|
Pass: Idle metrics as specified |
|
|
Pass: Null |
System Validation#
AGFHC (AMD GPU Field Health Check) provides structured recipes exercising PCIe, HBM, compute, power/thermal and fabric.
Recipe |
Command |
Purpose |
Pass Criteria |
---|---|---|---|
|
Broad ~2h system-level coverage (PCIe, HBM, compute, power) |
Overall result PASS / return code 0 |
|
hbm_lvl5 (run twice) |
|
Intensive HBM stress & ECC observation |
Both iterations PASS / no memory errors |
|
PCIe bandwidth & link health |
PASS / expected link stability |
|
miniHPL (optional) |
|
Linpack-like integration stress (MI350X) |
PASS / completes without failures |
Review results.json
in the output directory or terminal summary; any FAIL requires remediation before acceptance.
Performance Benchmarks#
RCCL All-Reduce bandwidth benchmark:
RCCL All-Reduce (in-place bus bandwidth @ 8 GB message)
all_reduce_perf -b 8 -e 8G -f 2 -g 8
Pass: In-place busbw ≥ 304 GB/s
Additional application or GEMM benchmarks may be executed as supplemental evidence, but only the RCCL all-reduce threshold above is required in this template.