# AMD Instinct MI300X

The AMD Instinct™ MI300X is a high-performance GPU accelerator designed for AI, HPC, and demanding workloads. This document provides a comprehensive overview of MI300X-specific requirements, specifications, and acceptance testing criteria.

## Overview

The AMD Instinct MI300X is a high-performance GPU accelerator designed for AI, HPC, and demanding workloads. The MI300X platform utilizes a Universal Baseboard (UBB 2.0) configuration that hosts 8 AMD Instinct MI300X OAM (OCP Accelerator Module) accelerators with a total of 1.5TB of HBM3 memory.

Each MI300X GPU features multiple chiplet design with 8 XCDs (Accelerator Complex Dies) and 192GB of HBM3 memory per accelerator. It utilizes AMD's CDNA 3 architecture with fully-meshed Infinity Fabric™ connectivity between accelerators.

- **[MI300X Product Page](https://www.amd.com/en/products/accelerators/instinct/mi300/mi300x.html)**
- **[MI300X Data Sheet](https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/data-sheets/amd-instinct-mi300x-data-sheet.pdf)**
- **[MI300 Series Microarchitecture](https://instinct.docs.amd.com/latest/gpu-arch/mi300.html)**

## System Requirements

### Operating System Support

For the most up-to-date information on supported operating systems and distributions, please refer to the official ROCm documentation:

[ROCm System Requirements - Supported Distributions](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html#supported-distributions)

```{note}
[ROCm docs](https://rocm.docs.amd.com) is the single source of truth for supported versions, distribution compatibility, and required dependencies for the ROCm toolkit.
```

### GPU Identification

All MI300X GPUs should appear in `lspci` output:

```bash
lspci | grep MI300X
```

Expected output example:

```bash
05:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300X]
26:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300X]
46:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300X]
65:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300X]
85:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300X]
a6:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300X]
c6:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300X]
e5:00.0 Processing accelerators: Advanced Micro Devices[AMD/ATI] Aqua Vanjaram [Instinct MI300X]
```

## Acceptance Criteria

The MI300X system acceptance process is designed to validate that your system is operating correctly and meets all performance specifications. Users are expected to step through the validation guides in sequence to ensure comprehensive system verification.

### System Acceptance Process

1. **[Prerequisites Validation](#prerequisites-validation)** - Ensure all system requirements and dependencies are met
2. **[Basic Health Checks](#basic-health-checks)** - Verify hardware detection and basic system health
3. **[System Validation](#system-validation)** - Conduct comprehensive stress testing and qualification
4. **[Performance Benchmarks](#performance-benchmarks)** - Validate compute, memory, and interconnect performance

The system is accepted when all criteria below are successfully validated:

### Prerequisites Validation

Ensure all system requirements are met before proceeding with validation. See the [Prerequisites documentation](../common/prerequisites.md) and [System setup](../common/system-setup.md) for more details.

- ✅ Supported operating system version installed
- ✅ Compatible ROCm version installed
- ✅ System manufacturer compatibility verified
- ✅ All required dependencies installed

### Basic Health Checks

These checks ensure fundamental system health and proper GPU detection. For detailed procedures, see [Health Checks](../common/health-checks.md).

| Test | Command | Pass/Fail criteria |
|------|---------|-------------------|
| [Check OS distribution](../common/health-checks.md#check-os-distribution) |```cat /etc/os-release``` | **Pass**: OS version listed in compatibility matrix<br>**Fail**: Otherwise |
| [Check kernel boot arguments](../common/health-checks.md#check-kernel-boot-arguments) | `cat /proc/cmdline` | **Pass**: Contains `pci=realloc=off`, `amd_iommu=on` or `intel_iommu=on`, and `iommu=pt`<br>**Fail**: Otherwise |
| [Check for driver errors](../common/health-checks.md#check-for-driver-errors) | `sudo dmesg -T \| grep amdgpu \| grep -i error` | **Pass**: Null<br>**Fail**: Errors reported |
| [Check available memory](../common/health-checks.md#check-for-available-system-memory) | `lsmem \| grep "Total online memory"` | **Pass**: 1.5T or more<br>**Fail**: Less than 1.5T |
| [Check GPU presence](../common/health-checks.md#check-gpu-presence) | `lspci \| grep MI300X` | **Pass**: All 8 GPUs found<br>**Fail**: Otherwise |
| [Check GPU link speed and width](../common/health-checks.md#check-gpu-pcie-bus-link-speed-and-width) | `sudo lspci -d 1002:74a1 -vvv \| grep -e DevSta -e LnkSta` | **Pass**: Speed 32GT/s, width `x16`, no `FatalErr+`<br>**Fail**: Otherwise |
| [Monitor utilization metrics](../common/health-checks.md#monitor-utilization-metrics) | `amd-smi monitor -putm` | **Pass**: Idle metrics as specified<br>**Fail**: Otherwise |
| [Check system kernel logs for errors](../common/health-checks.md#check-system-kernel-logs) | `sudo dmesg -T \| grep -i 'error\|warn\|fail\|exception'` | **Pass**: Null<br>**Fail**: Otherwise |

### System Validation

Comprehensive validation ensures system stability under load. For detailed procedures, see [System Validation](../common/system-validation.md).

| Test | Command | Pass/Fail criteria |
|------|---------|-------------------|
| [Compute/GPU properties](../common/system-validation.md#gpu-properties) | `rvs -c ${RVS_CONF}/gpup_single.conf` | **Pass**: All GPUs listed with no errors<br>**Fail**: Missing GPUs or errors |
| [GPU stress test (GST)](../common/system-validation.md#gpu-stress-test) | `rvs -c ${RVS_CONF}/MI300X/gst_single.conf` | **Pass**: `met: TRUE` in logs<br>**Fail**: Target GFLOP/s not met |
| [Input energy delay product (IET)](../common/system-validation.md#input-energy-delay-product) | `rvs -c ${RVS_CONF}/MI300X/iet_single.conf` | **Pass**: `met: TRUE` for all actions<br>**Fail**: Otherwise |
| [Memory test (MEM)](../common/system-validation.md#mem) | `rvs -c ${RVS_CONF}/mem.conf -l mem.txt` | **Pass**: All tests passed; bandwidth ~2TB/s<br>**Fail**: Any test failed or low bandwidth |
| [PCIe bandwidth benchmark (PEBB)](../common/system-validation.md#pcie-bandwidth-benchmark) | `rvs -c ${RVS_CONF}/MI300X/pebb_single.conf` | **Pass**: All distances and bandwidths displayed<br>**Fail**: Missing data |
| [PCIe qualification tool (PEQT)](../common/system-validation.md#pcie-qualification-tool) | `rvs -c ${RVS_CONF}/peqt_single.conf` | **Pass**: All actions true<br>**Fail**: Otherwise |
| [P2P benchmark and qualification tool (PBQT)](../common/system-validation.md#p2p-benchmark-and-qualification-tool) | `rvs -c ${RVS_CONF}/pbqt_single.conf` | **Pass**: `peers:true` lines and non-zero throughput<br>**Fail**: Otherwise |

### Performance Benchmarks

Performance validation ensures the system meets MI300X specifications. For detailed procedures, see the [Performance Benchmarking](../common/system-validation.md#performance-benchmarking).

:::{card} Command: `TransferBench a2a`
[TransferBench all-to-all](../common/system-validation.md#transferbench)
^^^
**Pass:** ≥ 32.9 GB/s
+++
**Fail:** otherwise
:::

:::{card} Command: `TransferBench p2p`
[TransferBench peer-to-peer](../common/system-validation.md#transferbench)
^^^

| Test | Pass Criteria |
|------|--------------|
| UniDir | ≥ 33.9 GB/s |
| BiDir | ≥ 43.9 GB/s |

+++
**Fail:** otherwise
:::

:::{card} Command: `TransferBench example.cfg`
[TransferBench config tests (1–6)](../common/system-validation.md#transferbench)
^^^

| **Test** | **Pass Criteria** |
|------|--------------|
| Test 1 | ≥ 47.1 GB/s |
| Test 2 | ≥ 48.4 GB/s |
| Test 3 | ≥ 31.9 (0→1), ≥ 38.9 (1→0) GB/s |
| Test 4 | ≥ 1264 GB/s |
| Test 5 | N/A (GPU validation) |
| Test 6 | ≥ 48.6 GB/s |

+++
**Fail:** otherwise
:::

:::{card} Command: `build/all_reduce_perf -b 8 -e 8G -f 2 -g 8`
[RCCL Allreduce](../common/system-validation.md#rccl-allreduce)
^^^
**Pass:** ≥ 304 GB/s
+++
**Fail:** otherwise
:::

:::{card} Command:
[rocBLAS FP32](../common/system-validation.md#rocblas-gemm-benchmarks)
^^^

```bash
rocblas-bench -f gemm \
  -r s -m 4000 -n 4000 -k 4000 \
  --lda 4000 --ldb 4000 --ldc 4000 \
  --transposeA N --transposeB T
```

**Pass:** ≥ 94100 GFLOPS  
+++
**Fail:** otherwise
:::

:::{card} Command:
[rocBLAS BF16](../common/system-validation.md#rocblas-gemm-benchmarks)
^^^

```bash
rocblas-bench -f gemm_strided_batched_ex \
  --transposeA N --transposeB T \
  -m 1024 -n 2048 -k 512 \
  --a_type h --lda 1024 --stride_a 4096 \
  --b_type h --ldb 2048 --stride_b 4096 \
  --c_type s --ldc 1024 --stride_c 2097152 \
  --d_type s --ldd 1024 --stride_d 2097152 \
  --compute_type s \
  --alpha 1.1 --beta 1 \
  --batch_count 5
```

**Pass:** ≥ 130600 GFLOPS
+++
**Fail:** otherwise
:::

:::{card} Command:
[rocBLAS INT8](../common/system-validation.md#rocblas-gemm-benchmarks)
^^^

```bash
rocblas-bench -f gemm_strided_batched_ex \
  --transposeA N --transposeB T \
  -m 1024 -n 2048 -k 512 \
  --a_type i8_r --lda 1024 --stride_a 4096 \
  --b_type i8_r --ldb 2048 --stride_b 4096 \
  --c_type i32_r --ldc 1024 --stride_c 2097152 \
  --d_type i32_r --ldd 1024 --stride_d 2097152 \
  --compute_type i32_r \
  --alpha 1.1 --beta 1 \
  --batch_count 5
```

**Pass:** ≥ 162700 GFLOPS
+++
**Fail:** otherwise
:::

:::{card} Command: `mpiexec -n 8 wrapper.sh`
[BabelStream](../common/system-validation.md#babelstream)
^^^

| Kernel | Threshold (MB/s) |
|--------|-----------------|
| Copy  | ≥ 4,177,285 |
| Mul   | ≥ 4,067,069 |
| Add   | ≥ 3,920,853 |
| Triad | ≥ 3,885,301 |
| Dot   | ≥ 3,660,781 |

+++
**Fail:** otherwise
:::
