# AMD Instinct MI350X

The AMD Instinct™ MI350X is a high-performance GPU accelerator designed for AI, HPC, and demanding workloads. This document provides MI350X-specific prerequisites, health checks, validation steps, and performance acceptance criteria.

## Overview

The AMD Instinct MI350X is a high-performance GPU accelerator designed to deliver exceptional efficiency and performance for training large AI models, high-speed inference, and complex HPC workloads. The MI350X platform utilizes a Universal Baseboard (UBB 2.0) configuration that hosts 8 AMD Instinct MI350X OAM (OCP Accelerator Module) accelerators with a total of 2.3TB of HBM3 memory.

Each MI350X GPU features multiple chiplet design with 8 XCDs (Accelerator Complex Dies) and 288GB of HBM3 memory per accelerator. It utilizes AMD's CDNA4 architecture with fully-meshed Infinity Fabric™ connectivity between accelerators.

- **[MI350X Product Page](https://www.amd.com/en/products/accelerators/instinct/mi350/mi350x.html)**
- **[MI350X Product Brief](https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/product-briefs/amd-instinct-mi350x-gpu-brochure.pdf)**
- **[MI350X Platform Brief](https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/product-briefs/amd-instinct-mi350x-platform-brochure.pdf)**
- **[MI350 Series Microarchitecture](https://instinct.docs.amd.com/latest/gpu-arch/mi350.html)**

## System Requirements

### Operating System Support

For the most up-to-date information on supported operating systems and distributions, please refer to the official ROCm documentation:

[ROCm System Requirements - Supported Distributions](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html#supported-distributions)

```{note}
[ROCm docs](https://rocm.docs.amd.com) is the single source of truth for supported versions, distribution compatibility, and required dependencies for the ROCm toolkit.
```

## Hardware Configuration

### Expected GPU Configuration

Recommended high-level platform configuration:

- Dual-socket AMD EPYC 9004/9005-series (or supported server-class CPUs) with BIOS configured per guide
- 3.0 TB or more of system memory
- Eight 400G backend NICs (RoCE or InfiniBand)

### GPU Identification

All MI350X GPUs (PCI vendor:device 1002:75a0) should appear in `lspci` output:

```bash
sudo lspci -d 1002:75a0
```

Example (truncated for brevity – expect 8 lines):

```bash
05:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Device 75a0
15:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Device 75a0
...
f5:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Device 75a0
```

## Acceptance Criteria Checklist

This section presents the high-level cluster acceptance validation criteria in a clear, checklist-driven format designed to enable efficient execution and tracking. The checklist is used to verify that the system meets all required technical, operational, and performance criteria necessary to achieve "Go-Live" readiness. It is organized into the following key areas:

1. **[Prerequisites Validation](#prerequisites-validation)** - Ensure all system requirements and dependencies are met
2. **[Basic Health Checks](#basic-health-checks)** - Verify hardware detection and basic system health
3. **[System Validation](#system-validation)** - Conduct comprehensive single and multi-node stress testing and qualification
4. **[Performance Benchmarks](#performance-benchmarks)** - Validate compute, memory, and interconnect performance

Each area consists of a defined set of criteria that are hyperlinked to the corresponding sections within this guide, enabling users to quickly access detailed procedures, execution steps, and supporting guidance.

The System Validation area, which includes both single-node and multi-node testing, defines minimum required execution (run) times for each test. These requirements ensure that validation is conducted under appropriate conditions to accurately assess system stability, performance, and reliability.

Successful completion of this checklist, with no errors or hardware faults observed in validation logs, confirms that the cluster has been properly configured, validated at both the single-node and multi-node levels, and is capable of supporting sustained AI workloads in a production environment.

### Prerequisites Validation

Ensure all system requirements are met before proceeding with validation. See the [Prerequisites documentation](../common/prerequisites.md) and [System setup](../common/system-setup.md) for more details.

- ✅ Supported operating system (see ROCm supported distributions)
- ✅ ROCm 7.0.1 or later installed for MI350X (verify: `cat /opt/rocm/.info/version`)
- ✅ BIOS configured per recommended settings (PCIe, xGMI, IOMMU enabled, DF C-states disabled, etc.)
- ✅ Required kernel parameters present:
  - `pci=realloc=off pci=bfsort iommu=pt numa_balancing=disable modprobe.blacklist=amdgpu`
  - `intel_iommu=on` if Intel host CPU
- ✅ Minimum 3.0 TB system memory available
- ✅ Latest applicable firmware (BIOS, BMC, PCIe switch, NIC, GPU BKC) applied consistently across nodes
- ✅ ROCm Validation Suite (RVS) and AGFHC prerequisites installed (AGFHC package)
- ✅ Environment variables (if used):
  - `HIP_FORCE_DEV_KERNARG=1` (default ≥ ROCm 6.2)
  - `HSA_OVERRIDE_CPU_AFFINITY_DEBUG=0`

### Basic Health Checks

These checks ensure fundamental system health and proper GPU detection. For detailed procedures, see [Health Checks](../common/health-checks.md).

| Test | Command | Pass/Fail criteria |
|------|---------|-------------------|
| [Check OS distribution](../common/health-checks.md#check-os-distribution) |```cat /etc/os-release``` | **Pass**: OS version listed in compatibility matrix<br>**Fail**: Otherwise |
| [Check kernel boot arguments](../common/health-checks.md#check-kernel-boot-arguments) | `cat /proc/cmdline` | **Pass**: Shows all required params (`pci=realloc=off pci=bfsort iommu=pt numa_balancing=disable modprobe.blacklist=amdgpu`) + `intel_iommu=on` if Intel<br>**Fail**: Missing any required param |
| [Check for driver errors](../common/health-checks.md#check-for-driver-errors) | `dmesg \| grep -i error` | **Pass**: Null (no GPU-related errors)<br>**Fail**: Errors present |
| [Check available memory](../common/health-checks.md#check-for-available-system-memory) | `free -h` / `cat /proc/meminfo` | **Pass**: ≥ 3.0T system memory available<br>**Fail**: < 3.0T |
| [Check GPU presence](../common/health-checks.md#check-gpu-presence) | `sudo lspci -d 1002:75a0` | **Pass**: 8 MI350X GPUs found<br>**Fail**: Otherwise |
| [Check GPU link speed and width](../common/health-checks.md#check-gpu-pcie-bus-link-speed-and-width) | `sudo lspci -d 1002:75a0 -vvv \| grep -e DevSta -e LnkSta` | **Pass**: Each GPU: Speed 32GT/s, Width x16, no `FatalErr+`<br>**Fail**: Otherwise |
| [Monitor utilization metrics](../common/health-checks.md#monitor-utilization-metrics) | `amd-smi monitor -putm` | **Pass**: Idle metrics as specified<br>**Fail**: Otherwise |
| [Check system kernel logs for errors](../common/health-checks.md#check-system-kernel-logs) | `sudo dmesg -T \| grep -i 'error\|warn\|fail\|exception'` | **Pass**: Null<br>**Fail**: Otherwise |

### System Validation

AGFHC (AMD GPU Field Health Check) provides structured recipes exercising PCIe, HBM, compute, power/thermal and fabric.

#### Single-Node Tests

Following single-node tests must be performed at the required run time with no failures reported in validation logs.

| Test | Command | Run Time | Purpose | Pass Criteria |
|--------|---------|----------|---------|---------------|
| [all_lvl5](../common/system-validation.md#all_lvl5) | `/opt/amd/agfhc/agfhc -r all_lvl5 -o <output_dir>` | 2 hours | Broad system-level coverage (PCIe, HBM, compute, power) | Overall result PASS / return code 0 |
| [hbm_lvl5](../common/system-validation.md#hbm_lvl5) (4 iterations) | `/opt/amd/agfhc/agfhc -r hbm_lvl5:i=4 -o <output_dir>` | 8 hours | Intensive HBM stress & ECC observation | All iterations PASS / no memory errors |
| [gfx_lvl4](../common/system-validation.md#gfx_lvl4) | `/opt/amd/agfhc/agfhc -r gfx_lvl4 -o <output_dir>` | 1 hour | GPU compute stress test | PASS / return code 0 |
| [miniHPL](../common/system-validation.md#minihpl) | `/opt/amd/agfhc/agfhc -t minihpl:d=3h -o <output_dir>` | 3 hours (10 hours recommended) | Linpack-like integration stress | PASS / completes without failures |
| [pcie_lvl2](../common/system-validation.md#pcie_lvl2) | `/opt/amd/agfhc/agfhc -r pcie_lvl2 -o <output_dir>` | 10 minutes | PCIe bandwidth & link health | PASS / expected link stability |
| [Single-node RCCL](../common/rccl-benchmarking.md#single-node-rccl-testing) | `all_reduce_perf -b 8 -e 8G -f 2 -g 8` | 2–11 minutes | Single-node GPU interconnect validation | busbw meets expected thresholds |
| [AI Workloads](../network/validation.md#ai-workload-validation-with-the-cluster-validation-suite) | See workload validation | 1–24 hours | Sustained AI workload (Llama 3.1 70B with JAX) | Completes without failures |

#### Multi-Node Tests

Following multi-node tests must be performed at the required run time with no failures reported in validation logs.

| Test | Reference | Run Time | Purpose | Pass Criteria |
|------|-----------|----------|---------|---------------|
| [OFED Performance Tests](../network/rdma-benchmarking.md#ofed-performance-tests) | Network validation | 2 hours | RDMA fabric bandwidth and latency | All tests PASS / expected bandwidth |
| [Multi-node RCCL](../network/validation.md#rccl-multi-node-fabric-test) | Network validation | Up to 128 nodes, 10 hours | Multi-node GPU fabric validation | All nodes PASS / expected bandwidth |
| [AI Workloads](../network/validation.md#ai-workload-validation-with-the-cluster-validation-suite) | Cluster validation | 24 hours | Sustained AI workload (Llama 3.1 405B with JAX) | Completes without failures |

Review `results.json` in the output directory or terminal summary; any FAIL requires remediation before acceptance.

### Performance Benchmarks

RCCL All-Reduce bandwidth benchmark:

:::{card} Command: `all_reduce_perf -b 8 -e 8G -f 2 -g 8`
RCCL All-Reduce (in-place bus bandwidth @ 8 GB message)
^^^
**Pass:** In-place busbw ≥ 304 GB/s
+++
**Fail:** Otherwise
:::

Additional application or GEMM benchmarks may be executed as supplemental evidence, but only the RCCL all-reduce threshold above is required in this template.
