Performance Benchmarking on AMD Instinct MI300X GPUs#

Introduction#

Optimizing LLM workloads is an essential yet challenging aspect of deploying AI applications, as performance depends on numerous factors across hardware and software. Benchmarking serves as a critical step in this process, enabling you to measure your inference stack’s performance, identify potential bottlenecks, and determine which configurations deliver optimal throughput and latency for your specific use cases.

This tutorial guides you through running LLM performance benchmarks on a single ND MI300X v5 Azure VM using the Model Automation and Dashboarding (MAD) toolkit. MAD is an automated benchmarking framework for evaluating AI model performance on AMD GPUs, with integrated metrics tracking and reporting capabilities. It provides a command-line interface (CLI) that streamlines the benchmarking process by wrapping existing AI frameworks, tools, and models inside Docker containers for controlled and repeatable execution.

Prerequisites#

  • SSH Keys: Have an SSH key pair and OpenSSH installed on your local machine for secure VM access

  • Azure Account: Maintain an active Azure account with appropriate subscription and resource group

  • Permissions: Ensure you have necessary permissions to create and manage Azure resources

  • vCPU Quota: Verify your subscription has sufficient vCPU quota for ND MI300X v5 VMs

  • Command-Line Tools: Install Azure CLI on your local machine

Once these steps are completed, you’re ready to begin setting up the MAD benchmarking environment.

Prepare high-performance storage#

Before provisioning the VM, we need to prepare the high-performance storage that will be used for caching Hugging Face models and Docker images. The ND MI300X v5 VM comes with eight local NVMe disks, which can be configured as a RAID 0 array to provide high-speed storage.

Create a cloud-init.yaml file in the same directory where you will run the az vm create command (see further down). This file automates the VM’s initial configuration and is designed to be idempotent and persistent after reboots. Paste the following YAML content into the cloud-init.yaml file.

 1#cloud-config
 2package_update: true
 3users:
 4  - default
 5  - name: azureuser
 6    groups: [docker, sudo, video, render]
 7    sudo: ALL=(ALL) NOPASSWD:ALL
 8
 9write_files:
10  - path: /opt/setup_nvme.sh
11    permissions: '0755'
12    owner: root:root
13    content: |
14      #!/bin/bash
15      set -euo pipefail
16
17      # Wait up to 2 minutes for all 8 NVMe disks to be detected
18      for i in {1..24}; do
19          if [ "$(ls /dev/nvme*n1 2>/dev/null | wc -l)" -eq 8 ]; then
20              break
21          fi
22          sleep 5
23      done
24
25      # Idempotency check
26      if mountpoint -q /mnt/resource_nvme && grep -q '^/dev/md0 ' /proc/mounts; then
27        exit 0
28      fi
29
30      # Detect NVMe disks and names
31      NUM_NVME_DISKS=$(ls /dev/nvme*n1 2>/dev/null | wc -l || echo 0)
32      NVME_DISK_NAMES=$(ls /dev/nvme*n1 2>/dev/null || true)
33
34      # Create mount point
35      mkdir -p /mnt/resource_nvme
36
37      # Stop existing RAID array if any
38      mdadm --stop /dev/md0 2>/dev/null || true
39
40      # Create RAID 0 array
41      if ! mdadm --create /dev/md0 --force --run --level 0 --raid-devices "$NUM_NVME_DISKS" $NVME_DISK_NAMES; then
42          exit 1
43      fi
44
45      # Format if not already XFS
46      if ! blkid /dev/md0 | grep -q 'TYPE="xfs"'; then
47        mkfs.xfs -f /dev/md0
48      fi
49
50      # Create mount point and set permissions
51      mount /dev/md0 /mnt/resource_nvme
52      chmod 1777 /mnt/resource_nvme
53
54      # Docker directory
55      mkdir -p /mnt/resource_nvme/docker
56      chown root:docker /mnt/resource_nvme/docker
57
58      # Update fstab for persistent mount
59      grep -q '/dev/md0 /mnt/resource_nvme' /etc/fstab || echo "/dev/md0 /mnt/resource_nvme xfs defaults,nofail 0 2" >> /etc/fstab
60
61  - path: /etc/docker/daemon.json
62    permissions: '0644'
63    content: |
64      {
65        "data-root": "/mnt/resource_nvme/docker"
66      }
67
68runcmd:
69  - ["/bin/bash", "/opt/setup_nvme.sh"]
70  - systemctl restart docker

The cloud-init script will:

  • Ensure system packages are updated.

  • Create a user named azureuser, granting passwordless sudo privileges and adding to the docker, video, and render groups.

  • Create and set executable permissions for a shell script located at /opt/setup_nvme.sh, which detects NVMe disks, combines them into a single RAID 0 array, formats it with XFS, mounts it at /mnt/resource_nvme, and ensures the mount persists across reboots. A directory for Docker images is also created on this volume.

  • Create a Docker configuration file /etc/docker/daemon.json that tells Docker to use the newly created /mnt/resource_nvme/docker directory for its data, which will improve read and write performance for containers.

  • Restart the Docker service to apply the new configuration.

Warning

The RAID 0 array uses ephemeral NVMe disks. Data stored here is not persistent and will be lost if the VM is deallocated or redeployed.

Create MI300X virtual machine#

To streamline the process of creating the VM, define the following variables. Replace the placeholder values as needed.

resource_group="my-rg"
region="my-region"
vm_name="MI300X"
admin_username="azureuser"
vm_size="Standard_ND96isr_MI300X_v5"
vm_image="microsoft-dsvm:ubuntu-hpc:2204-rocm:22.04.2025081902"
ssh_key="<ssh-rsa AAAB3...>"

For the ssh_key value, replace the placeholder <ssh-rsa AAAAB3...> with your ssh public key string.

Note

The image version is date-stamped and updated over time. You can check for the latest version using the Azure CLI command: az vm image list --publisher microsoft-dsvm --sku 2204-rocm --all --output table.

The following Azure CLI command creates a new VM using the variables and the cloud-init file we defined above:

az vm create \
  --resource-group $resource_group \
  --name $vm_name \
  --image $vm_image \
  --size $vm_size \
  --location $region \
  --admin-username $admin_username \
  --ssh-key-value "$ssh_key" \
  --security-type Standard \
  --os-disk-size-gb 256 \
  --os-disk-delete-option Delete \
  --custom-data cloud-init.yaml

Important

Azure has shifted its default security type to TrustedLaunch for newly created VMs. The Standard security type is still supported, but it requires explicit registration of the UseStandardSecurityType feature flag per Azure subscription. Register the feature flag using the following command: az feature register --namespace Microsoft.Compute --name UseStandardSecurityType. Verify the registration was successful with az feature show --namespace Microsoft.Compute --name UseStandardSecurityType.

If the VM was created successfully, the shell will display information specific to your deployment. Take note of the public IP address, as you will need it to connect to the VM.

On your local machine, navigate to the hidden .ssh directory and connect to your Azure VM:

ssh -i id_rsa azureuser@<vm_public_ip_address>

Replace <vm_public_ip_address> with the actual public IP address of your VM.

Validate high-performance storage#

To verify that the RAID array was created successfully and is mounted correctly, run the following command on your VM:

lsblk -o NAME,SIZE,TYPE,MOUNTPOINT,FSTYPE

Example output:

NAME      SIZE TYPE  MOUNTPOINT         FSTYPE
loop0    63.7M loop  /snap/core20/2496
loop1    89.4M loop  /snap/lxd/31333
loop2    44.4M loop  /snap/snapd/23545
sda       256G disk
├─sda1  255.9G part  /                  ext4
├─sda14     4M part
└─sda15   106M part  /boot/efi          vfat
sdb         1T disk
└─sdb1   1024G part  /mnt               ext4
sr0       638K rom
nvme4n1   3.5T disk                     linux_raid_member
└─md0    27.9T raid0 /mnt/resource_nvme xfs
nvme2n1   3.5T disk                     linux_raid_member
└─md0    27.9T raid0 /mnt/resource_nvme xfs
nvme1n1   3.5T disk                     linux_raid_member
└─md0    27.9T raid0 /mnt/resource_nvme xfs
nvme3n1   3.5T disk                     linux_raid_member
└─md0    27.9T raid0 /mnt/resource_nvme xfs
nvme5n1   3.5T disk                     linux_raid_member
└─md0    27.9T raid0 /mnt/resource_nvme xfs
nvme6n1   3.5T disk                     linux_raid_member
└─md0    27.9T raid0 /mnt/resource_nvme xfs
nvme0n1   3.5T disk                     linux_raid_member
└─md0    27.9T raid0 /mnt/resource_nvme xfs
nvme7n1   3.5T disk                     linux_raid_member
└─md0    27.9T raid0 /mnt/resource_nvme xfs

The output confirms the eight local ~3.5T NVMe devices (nvme0n1–nvme7n1) were striped with mdadm into a single RAID 0 volume md0 of approximately 27.9T, formatted XFS, and mounted at /mnt/resource_nvme by the cloud-init script. The root (OS) disk is a separate 256G named sda and mounted at /.

Validate using AMD SMI library#

Validate your setup using the AMD System Management Interface (AMD SMI) library, which is a versatile command-line tool for managing and monitoring AMD hardware, with a primary focus on GPUs.

Verify the version of the AMD SMI Library along with the ROCm version:

amd-smi version

Example output:

AMDSMI Tool: 24.6.3+9578815 | AMDSMI Library version: 24.6.3.0 | ROCm version: 6.2.4

Note

Allow 1-2 minutes after VM startup for all services to initialize before running amd-smi commands.

To list all eight AMD GPUs on the VM, run the following command:

amd-smi list

Example output:

GPU: 0
    BDF: 0002:00:00.0
    UUID: 1fff74b5-0000-1000-807c-84c354560001

GPU: 1
    BDF: 0003:00:00.0
    UUID: 8bff74b5-0000-1000-8042-32403807af72

GPU: 2
    BDF: 0004:00:00.0
    UUID: faff74b5-0000-1000-80fd-df190d55b466

[output truncated]

Verify Docker installation#

Verify that Docker is running on your VM:

systemctl status --full docker --no-pager

To verify that Docker is working correctly, run the following command to pull and run the hello-world image:

docker run hello-world

If Docker is set up correctly, you should see a message indicating that the hello-world image ran successfully.

Request model access on Hugging Face#

To benchmark gated models that require access permissions, you need to request access on Hugging Face. Visit the model page on Hugging Face and submit the access request form.

Note

Ensure you are using the same email address and username as your Hugging Face account as well as provide your full company name when submitting the request.

Configure Hugging Face access token#

Set your Hugging Face token as an environment variable so that MAD can access gated models on the Hugging Face Hub:

echo 'export MAD_SECRETS_HFTOKEN="<your Hugging Face access token>"' >> ~/.bashrc
source ~/.bashrc

Replace <your Hugging Face access token> with your actual Hugging Face access token. This command appends the export statement to your ~/.bashrc file, which is executed every time you open a new terminal session. By sourcing the file, you make the token available immediately in the current session.

Tip

If you don’t have a Hugging Face access token, you can create one by logging into your Hugging Face account, navigating to Settings > Access Tokens, and creating a new token of the “read” type. Copy the token and replace <your Hugging Face access token> in the command above.

Create virtual environment#

To ensure a clean and isolated environment for benchmarking, we will create a Python virtual environment for the MAD toolkit. This allows us to manage dependencies without affecting the Python system installation.

Install the Ubuntu/Debian python3.10-venv system package, which provides the Python venv module needed to create isolated virtual environments for your benchmarking dependencies:

sudo DEBIAN_FRONTEND=noninteractive apt install -y python3.10-venv

Once complete, create a Python virtual environment in /mnt/resource_nvme/venv and activate it:

python3 -m venv /mnt/resource_nvme/venv
source /mnt/resource_nvme/venv/bin/activate

Clone the benchmarking repository#

We will clone the MAD repository to the mounted NVMe RAID volume to ensure optimal performance during benchmarking. Inside the MAD repository, we will also clone the madengine repository, which provides the CLI for running benchmarks.

Clone the MAD repository into the /mnt/resource_nvme directory:

cd /mnt/resource_nvme
git clone https://github.com/ROCm/MAD

Next, install madengine, which is the CLI for MAD:

cd MAD
pip install -r requirements.txt

You should now have a directory structure in /mnt/resource_nvme looking like this:

.
├── MAD                # MAD benchmarking toolkit
├── docker             # Docker images
└── venv               # Python virtual environment

Structuring your benchmarking environment this way keeps it isolated and organized, where each directory has a specific role, making it easier to manage dependencies, containers, and models.

Discover available models#

All available models are defined in a JSON configuration file named models.json in the MAD repository. This file contains metadata about each model, such as its name, URL, Dockerfile path, run script, required number of GPUs, tags, and more. The models listed are generally categorized for inference, training (including fine-tuning), or pretraining. You can typically identify the intended purpose of a model by its name and its associated tags. The madengine discover command reads this file and lists all available models.

Navigate to the root of the MAD directory and run the following command:

madengine discover

Example output (truncated):

MAD_MINIO environment variable is not set.
MAD_MINIO is using default values.
Discovering all models in the project
Number of models in total: 77
pyt_huggingface_gpt2
pyt_huggingface_bert
pyt_vllm_llama-2-70b
pyt_vllm_llama-3.1-8b
pyt_vllm_llama-3.1-8b_fp8
...

The --tags option allows you to filter models whose tags match any you specify (logical OR). For example, to list all models tagged with pyt (PyTorch) or fp16, you can run:

madengine discover --tags pyt fp16

Output omitted for brevity, but it will show all models that match any of the specified tags.

Each model name is also automatically available as a tag. For example, to list only the pyt_vllm_llama-3.1-8b_fp8 model, run:

madengine discover --tags pyt_vllm_llama-3.1-8b_fp8

Output:

MAD_MINIO environment variable is not set.
MAD_MINIO is using default values.
Discovering all models in the project
[
    {
        "name": "pyt_vllm_llama-3.1-8b_fp8",
        "url": "",
        "dockerfile": "docker/pyt_vllm",
        "scripts": "scripts/vllm/run.sh",
        "data": "huggingface",
        "n_gpus": "-1",
        "owner": "[email protected]",
        "training_precision": "",
        "multiple_results": "perf_Llama-3.1-8B-Instruct-FP8-KV.csv",
        "tags": [
            "pyt",
            "vllm",
            "vllm_extended"
        ],
        "timeout": -1,
        "args": "--model_repo amd/Llama-3.1-8B-Instruct-FP8-KV --config configs/extended.csv"
    }
]

The n_gpus field indicates that this model can utilize all available GPUs on the VM (-1 means all GPUs). The dockerfile and scripts fields specify the paths to the Dockerfile and the run script used to execute the benchmark inside the container. The tags field lists tags associated with this model, which can be used for filtering during discovery or benchmarking.

The args field defines the default command-line arguments for this model’s benchmark run. By default it includes --config configs/extended.csv, a configuration that schedules both throughput and serving tests. Treat these defaults as a baseline; you can override or extend them to tailor the benchmark to a specific scenario. The next section shows how to run only a single throughput test.

Run model benchmarks#

The basic command to run a model benchmark is:

madengine run --tags TAGS

Similar to the discover command, the run command uses the --tags option to specify which model(s) to benchmark. The TAGS can be a single tag or multiple tags separated by spaces.

To run a pyt_vllm_llama-3.1-8b_fp8 throughput benchmark using vLLM, use the following command:

madengine run --tags pyt_vllm_llama-3.1-8b_fp8:benchmark=throughput --live-output > >(tee output.log) 2> >(tee error.log >&2)

The --tags option uses a colon (:) to separate the model tag from additional key-value pairs that modify the benchmark behavior. In this case, benchmark=throughput specifies that we want to run only the throughput test for the selected model.

The --live-output flag enables real-time output streaming to the terminal, allowing you to monitor the benchmark progress as it runs. We also redirect the standard output to output.log and standard error to error.log for later analysis, if needed.

Note

The first time you run a benchmark for this model, MAD will pull the corresponding Docker image (rocm/vllm:rocm6.4.1_vllm_0.10.0_20250812) and cache it locally. This may take several minutes depending on your internet connection speed. Subsequent runs of the same model will use the cached image, significantly reducing startup time.

Analyze benchmark results#

After a run completes, MAD writes log, metric, and environment artifacts into the working directory.

Access results locally#

To easily access the results from your local machine, start a simple HTTP server on the VM from inside the MAD directory:

python -m http.server 8090 --bind 127.0.0.1

Binding to 127.0.0.1 prevents public exposure, adding an extra layer of security.

On your local machine, run:

ssh -L 8090:localhost:8090 azureuser@<vm_public_ip_address>

Open your browser: http://localhost:8090.

Alternatively, or as a complement, you can also copy the files locally using scp. On your local machine, run the following command to copy the perf.csv file from the VM to your current local directory:

scp azureuser@<vm_public_ip_address>:/mnt/resource_nvme/MAD/perf.csv .

Key artifacts#

The following key artifacts are generated in the MAD directory after the benchmark run:

Logs:

  • output.log: Full orchestrator + container stdout (authoritative textual trace of events).

  • error.log: Standard error stream (warnings, errors, tracebacks).

  • *.live.log: In-container streaming log captured mid-run (subset of output.log).

Metrics:

  • perf.csv: The main combined performance file.

  • perf.html: HTML rendering of perf.csv.

  • perf_entry.csv: Temporary file.

  • perf_<model_name>.csv: Multiple results file per workload (from multiple results field in models.json).

  • *_env.csv: Environment snapshot (hardware, ROCm stack, Python packages, runtime variables).

Interpreting the sample output#

The perf.csv file contains detailed information collected during the benchmark run. Below is a sample output for the throughput benchmark:

model

performance

metric

..._throughput_tot

10,450.84

tokens/sec

..._throughput_gen

9,836

tokens/sec


Key throughput metrics:

  • throughput_tot: This measures the end-to-end token processing rate, combining both prompt (prefill) and the subsequent generation of new tokens (decode). It reflects the overall performance of the entire request.

  • throughput_gen: This measures the token generation rate (decode). For autoregressive models, this is often the most critical metric.

In this example, the difference between them is small, so the prefill phase is not adding a large extra cost for this particular configuration.

The full model name in perf.csv is long but descriptive—pyt_vllm_llama-3.1-8b_fp8_Llama-3.1-8B-Instruct-FP8-KV_throughput_1_1024_128_2048_float8_throughput_tot—encoding the exact conditions of the test:

  • pyt_vllm_llama-3.1-8b_fp8: The MAD model name that was run.

  • Llama-3.1-8B-Instruct-FP8-KV: The specific model variant from Hugging Face.

  • throughput: The type of test conducted.

  • 1: Number of concurrent requests (batch size).

  • 1024: Number of prompt (prefill, input) tokens.

  • 128: Number of generated (decode, output) tokens.

  • 2048: Maximum context window size.

  • float8: The data precision used (floating point 8).

  • throughput_tot: The specific metric being reported (total throughput), as explained above.

A back-of-the-envelope calculation can estimate the idealized latency. By dividing the total tokens (1,024 prompt + 128 generated = 1,152) by the end-to-end throughput (10,450 tokens/sec), we get an idealized latency of approximately 110 milliseconds. This figure excludes real-world overheads like network latency and request queuing, but for interactive applications like chatbots, a sub-200 millisecond response is often perceived as instantaneous.

Conclusion#

This tutorial showed how to provision an ND MI300X v5 VM with high-performance storage, create and validate the environment, run MAD-based LLM benchmarks, and interpret core throughput metrics. By establishing this reproducible baseline—logs, ground truth metrics, and environment snapshots—you can systematically optimize your inference stack to achieve the best possible price-performance for your LLM workloads on Azure.

Recommended next steps:

  • Archive a clean baseline (metrics + environment snapshot) for regression detection.

  • Choose appropriate model and configuration to reflect production traffic and achieve service level objectives.

  • Systematically sweep concurrency, sequence lengths, precision, and batch size to explore trade-offs.

  • Build a lightweight dashboard for trend and regression tracking.

  • Track cost efficiency (tokens/sec per dollar-hour) alongside throughput and latency.

References#