Performance Benchmarking on AMD Instinct MI300X GPUs#
Introduction#
Optimizing LLM workloads is an essential yet challenging aspect of deploying AI applications, as performance depends on numerous factors across hardware and software. Benchmarking serves as a critical step in this process, enabling you to measure your inference stack’s performance, identify potential bottlenecks, and determine which configurations deliver optimal throughput and latency for your specific use cases.
This tutorial guides you through running LLM performance benchmarks on a single ND MI300X v5 Azure VM using the Model Automation and Dashboarding (MAD) toolkit. MAD is an automated benchmarking framework for evaluating AI model performance on AMD GPUs, with integrated metrics tracking and reporting capabilities. It provides a command-line interface (CLI) that streamlines the benchmarking process by wrapping existing AI frameworks, tools, and models inside Docker containers for controlled and repeatable execution.
Prerequisites#
SSH Keys: Have an SSH key pair and OpenSSH installed on your local machine for secure VM access
Azure Account: Maintain an active Azure account with appropriate subscription and resource group
Permissions: Ensure you have necessary permissions to create and manage Azure resources
vCPU Quota: Verify your subscription has sufficient vCPU quota for ND MI300X v5 VMs
Command-Line Tools: Install Azure CLI on your local machine
Once these steps are completed, you’re ready to begin setting up the MAD benchmarking environment.
Prepare high-performance storage#
Before provisioning the VM, we need to prepare the high-performance storage that will be used for caching Hugging Face models and Docker images. The ND MI300X v5 VM comes with eight local NVMe disks, which can be configured as a RAID 0 array to provide high-speed storage.
Create a cloud-init.yaml
file in the same directory where you will run the az vm create
command (see further down). This file automates the VM’s initial configuration and is designed to be idempotent and persistent after reboots. Paste the following YAML content into the cloud-init.yaml
file.
1#cloud-config
2package_update: true
3users:
4 - default
5 - name: azureuser
6 groups: [docker, sudo, video, render]
7 sudo: ALL=(ALL) NOPASSWD:ALL
8
9write_files:
10 - path: /opt/setup_nvme.sh
11 permissions: '0755'
12 owner: root:root
13 content: |
14 #!/bin/bash
15 set -euo pipefail
16
17 # Wait up to 2 minutes for all 8 NVMe disks to be detected
18 for i in {1..24}; do
19 if [ "$(ls /dev/nvme*n1 2>/dev/null | wc -l)" -eq 8 ]; then
20 break
21 fi
22 sleep 5
23 done
24
25 # Idempotency check
26 if mountpoint -q /mnt/resource_nvme && grep -q '^/dev/md0 ' /proc/mounts; then
27 exit 0
28 fi
29
30 # Detect NVMe disks and names
31 NUM_NVME_DISKS=$(ls /dev/nvme*n1 2>/dev/null | wc -l || echo 0)
32 NVME_DISK_NAMES=$(ls /dev/nvme*n1 2>/dev/null || true)
33
34 # Create mount point
35 mkdir -p /mnt/resource_nvme
36
37 # Stop existing RAID array if any
38 mdadm --stop /dev/md0 2>/dev/null || true
39
40 # Create RAID 0 array
41 if ! mdadm --create /dev/md0 --force --run --level 0 --raid-devices "$NUM_NVME_DISKS" $NVME_DISK_NAMES; then
42 exit 1
43 fi
44
45 # Format if not already XFS
46 if ! blkid /dev/md0 | grep -q 'TYPE="xfs"'; then
47 mkfs.xfs -f /dev/md0
48 fi
49
50 # Create mount point and set permissions
51 mount /dev/md0 /mnt/resource_nvme
52 chmod 1777 /mnt/resource_nvme
53
54 # Docker directory
55 mkdir -p /mnt/resource_nvme/docker
56 chown root:docker /mnt/resource_nvme/docker
57
58 # Update fstab for persistent mount
59 grep -q '/dev/md0 /mnt/resource_nvme' /etc/fstab || echo "/dev/md0 /mnt/resource_nvme xfs defaults,nofail 0 2" >> /etc/fstab
60
61 - path: /etc/docker/daemon.json
62 permissions: '0644'
63 content: |
64 {
65 "data-root": "/mnt/resource_nvme/docker"
66 }
67
68runcmd:
69 - ["/bin/bash", "/opt/setup_nvme.sh"]
70 - systemctl restart docker
The cloud-init script will:
Ensure system packages are updated.
Create a user named
azureuser
, granting passwordlesssudo
privileges and adding to thedocker
,video
, andrender
groups.Create and set executable permissions for a shell script located at
/opt/setup_nvme.sh
, which detects NVMe disks, combines them into a single RAID 0 array, formats it with XFS, mounts it at/mnt/resource_nvme
, and ensures the mount persists across reboots. A directory for Docker images is also created on this volume.Create a Docker configuration file
/etc/docker/daemon.json
that tells Docker to use the newly created/mnt/resource_nvme/docker
directory for its data, which will improve read and write performance for containers.Restart the Docker service to apply the new configuration.
Warning
The RAID 0 array uses ephemeral NVMe disks. Data stored here is not persistent and will be lost if the VM is deallocated or redeployed.
Create MI300X virtual machine#
To streamline the process of creating the VM, define the following variables. Replace the placeholder values as needed.
resource_group="my-rg"
region="my-region"
vm_name="MI300X"
admin_username="azureuser"
vm_size="Standard_ND96isr_MI300X_v5"
vm_image="microsoft-dsvm:ubuntu-hpc:2204-rocm:22.04.2025081902"
ssh_key="<ssh-rsa AAAB3...>"
For the ssh_key value, replace the placeholder <ssh-rsa AAAAB3...>
with your ssh public key string.
Note
The image version is date-stamped and updated over time. You can check for the latest version using the Azure CLI command: az vm image list --publisher microsoft-dsvm --sku 2204-rocm --all --output table
.
The following Azure CLI command creates a new VM using the variables and the cloud-init file we defined above:
az vm create \
--resource-group $resource_group \
--name $vm_name \
--image $vm_image \
--size $vm_size \
--location $region \
--admin-username $admin_username \
--ssh-key-value "$ssh_key" \
--security-type Standard \
--os-disk-size-gb 256 \
--os-disk-delete-option Delete \
--custom-data cloud-init.yaml
Important
Azure has shifted its default security type to TrustedLaunch
for newly created VMs. The Standard
security type is still supported, but it requires explicit registration of the UseStandardSecurityType
feature flag per Azure subscription. Register the feature flag using the following command: az feature register --namespace Microsoft.Compute --name UseStandardSecurityType
. Verify the registration was successful with az feature show --namespace Microsoft.Compute --name UseStandardSecurityType
.
If the VM was created successfully, the shell will display information specific to your deployment. Take note of the public IP address, as you will need it to connect to the VM.
On your local machine, navigate to the hidden .ssh
directory and connect to your Azure VM:
ssh -i id_rsa azureuser@<vm_public_ip_address>
Replace <vm_public_ip_address>
with the actual public IP address of your VM.
Validate high-performance storage#
To verify that the RAID array was created successfully and is mounted correctly, run the following command on your VM:
lsblk -o NAME,SIZE,TYPE,MOUNTPOINT,FSTYPE
Example output:
NAME SIZE TYPE MOUNTPOINT FSTYPE
loop0 63.7M loop /snap/core20/2496
loop1 89.4M loop /snap/lxd/31333
loop2 44.4M loop /snap/snapd/23545
sda 256G disk
├─sda1 255.9G part / ext4
├─sda14 4M part
└─sda15 106M part /boot/efi vfat
sdb 1T disk
└─sdb1 1024G part /mnt ext4
sr0 638K rom
nvme4n1 3.5T disk linux_raid_member
└─md0 27.9T raid0 /mnt/resource_nvme xfs
nvme2n1 3.5T disk linux_raid_member
└─md0 27.9T raid0 /mnt/resource_nvme xfs
nvme1n1 3.5T disk linux_raid_member
└─md0 27.9T raid0 /mnt/resource_nvme xfs
nvme3n1 3.5T disk linux_raid_member
└─md0 27.9T raid0 /mnt/resource_nvme xfs
nvme5n1 3.5T disk linux_raid_member
└─md0 27.9T raid0 /mnt/resource_nvme xfs
nvme6n1 3.5T disk linux_raid_member
└─md0 27.9T raid0 /mnt/resource_nvme xfs
nvme0n1 3.5T disk linux_raid_member
└─md0 27.9T raid0 /mnt/resource_nvme xfs
nvme7n1 3.5T disk linux_raid_member
└─md0 27.9T raid0 /mnt/resource_nvme xfs
The output confirms the eight local ~3.5T NVMe devices (nvme0n1–nvme7n1) were striped with mdadm into a single RAID 0 volume md0
of approximately 27.9T, formatted XFS, and mounted at /mnt/resource_nvme
by the cloud-init script. The root (OS) disk is a separate 256G named sda
and mounted at /
.
Validate using AMD SMI library#
Validate your setup using the AMD System Management Interface (AMD SMI) library, which is a versatile command-line tool for managing and monitoring AMD hardware, with a primary focus on GPUs.
Verify the version of the AMD SMI Library along with the ROCm version:
amd-smi version
Example output:
AMDSMI Tool: 24.6.3+9578815 | AMDSMI Library version: 24.6.3.0 | ROCm version: 6.2.4
Note
Allow 1-2 minutes after VM startup for all services to initialize before running amd-smi
commands.
To list all eight AMD GPUs on the VM, run the following command:
amd-smi list
Example output:
GPU: 0
BDF: 0002:00:00.0
UUID: 1fff74b5-0000-1000-807c-84c354560001
GPU: 1
BDF: 0003:00:00.0
UUID: 8bff74b5-0000-1000-8042-32403807af72
GPU: 2
BDF: 0004:00:00.0
UUID: faff74b5-0000-1000-80fd-df190d55b466
[output truncated]
Verify Docker installation#
Verify that Docker is running on your VM:
systemctl status --full docker --no-pager
To verify that Docker is working correctly, run the following command to pull and run the hello-world image:
docker run hello-world
If Docker is set up correctly, you should see a message indicating that the hello-world image ran successfully.
Request model access on Hugging Face#
To benchmark gated models that require access permissions, you need to request access on Hugging Face. Visit the model page on Hugging Face and submit the access request form.
Note
Ensure you are using the same email address and username as your Hugging Face account as well as provide your full company name when submitting the request.
Configure Hugging Face access token#
Set your Hugging Face token as an environment variable so that MAD can access gated models on the Hugging Face Hub:
echo 'export MAD_SECRETS_HFTOKEN="<your Hugging Face access token>"' >> ~/.bashrc
source ~/.bashrc
Replace <your Hugging Face access token>
with your actual Hugging Face access token. This command appends the export statement to your ~/.bashrc
file, which is executed every time you open a new terminal session. By sourcing the file, you make the token available immediately in the current session.
Tip
If you don’t have a Hugging Face access token, you can create one by logging into your Hugging Face account, navigating to Settings > Access Tokens, and creating a new token of the “read” type. Copy the token and replace <your Hugging Face access token>
in the command above.
Create virtual environment#
To ensure a clean and isolated environment for benchmarking, we will create a Python virtual environment for the MAD toolkit. This allows us to manage dependencies without affecting the Python system installation.
Install the Ubuntu/Debian python3.10-venv
system package, which provides the Python venv
module needed to create isolated virtual environments for your benchmarking dependencies:
sudo DEBIAN_FRONTEND=noninteractive apt install -y python3.10-venv
Once complete, create a Python virtual environment in /mnt/resource_nvme/venv
and activate it:
python3 -m venv /mnt/resource_nvme/venv
source /mnt/resource_nvme/venv/bin/activate
Clone the benchmarking repository#
We will clone the MAD repository to the mounted NVMe RAID volume to ensure optimal performance during benchmarking. Inside the MAD
repository, we will also clone the madengine
repository, which provides the CLI for running benchmarks.
Clone the MAD repository into the /mnt/resource_nvme
directory:
cd /mnt/resource_nvme
git clone https://github.com/ROCm/MAD
Next, install madengine, which is the CLI for MAD:
cd MAD
pip install -r requirements.txt
You should now have a directory structure in /mnt/resource_nvme
looking like this:
.
├── MAD # MAD benchmarking toolkit
├── docker # Docker images
└── venv # Python virtual environment
Structuring your benchmarking environment this way keeps it isolated and organized, where each directory has a specific role, making it easier to manage dependencies, containers, and models.
Discover available models#
All available models are defined in a JSON configuration file named models.json
in the MAD repository. This file contains metadata about each model, such as its name, URL, Dockerfile path, run script, required number of GPUs, tags, and more. The models listed are generally categorized for inference, training (including fine-tuning), or pretraining. You can typically identify the intended purpose of a model by its name and its associated tags. The madengine discover
command reads this file and lists all available models.
Navigate to the root of the MAD directory and run the following command:
madengine discover
Example output (truncated):
MAD_MINIO environment variable is not set.
MAD_MINIO is using default values.
Discovering all models in the project
Number of models in total: 77
pyt_huggingface_gpt2
pyt_huggingface_bert
pyt_vllm_llama-2-70b
pyt_vllm_llama-3.1-8b
pyt_vllm_llama-3.1-8b_fp8
...
The --tags
option allows you to filter models whose tags match any you specify (logical OR). For example, to list all models tagged with pyt
(PyTorch) or fp16
, you can run:
madengine discover --tags pyt fp16
Output omitted for brevity, but it will show all models that match any of the specified tags.
Each model name is also automatically available as a tag. For example, to list only the pyt_vllm_llama-3.1-8b_fp8
model, run:
madengine discover --tags pyt_vllm_llama-3.1-8b_fp8
Output:
MAD_MINIO environment variable is not set.
MAD_MINIO is using default values.
Discovering all models in the project
[
{
"name": "pyt_vllm_llama-3.1-8b_fp8",
"url": "",
"dockerfile": "docker/pyt_vllm",
"scripts": "scripts/vllm/run.sh",
"data": "huggingface",
"n_gpus": "-1",
"owner": "[email protected]",
"training_precision": "",
"multiple_results": "perf_Llama-3.1-8B-Instruct-FP8-KV.csv",
"tags": [
"pyt",
"vllm",
"vllm_extended"
],
"timeout": -1,
"args": "--model_repo amd/Llama-3.1-8B-Instruct-FP8-KV --config configs/extended.csv"
}
]
The n_gpus
field indicates that this model can utilize all available GPUs on the VM (-1
means all GPUs). The dockerfile
and scripts
fields specify the paths to the Dockerfile and the run script used to execute the benchmark inside the container. The tags
field lists tags associated with this model, which can be used for filtering during discovery or benchmarking.
The args
field defines the default command-line arguments for this model’s benchmark run. By default it includes --config configs/extended.csv
, a configuration that schedules both throughput and serving tests. Treat these defaults as a baseline; you can override or extend them to tailor the benchmark to a specific scenario. The next section shows how to run only a single throughput test.
Run model benchmarks#
The basic command to run a model benchmark is:
madengine run --tags TAGS
Similar to the discover
command, the run
command uses the --tags
option to specify which model(s) to benchmark. The TAGS
can be a single tag or multiple tags separated by spaces.
To run a pyt_vllm_llama-3.1-8b_fp8
throughput benchmark using vLLM, use the following command:
madengine run --tags pyt_vllm_llama-3.1-8b_fp8:benchmark=throughput --live-output > >(tee output.log) 2> >(tee error.log >&2)
The --tags
option uses a colon (:
) to separate the model tag from additional key-value pairs that modify the benchmark behavior. In this case, benchmark=throughput
specifies that we want to run only the throughput test for the selected model.
The --live-output
flag enables real-time output streaming to the terminal, allowing you to monitor the benchmark progress as it runs. We also redirect the standard output to output.log
and standard error to error.log
for later analysis, if needed.
Note
The first time you run a benchmark for this model, MAD will pull the corresponding Docker image (rocm/vllm:rocm6.4.1_vllm_0.10.0_20250812
) and cache it locally. This may take several minutes depending on your internet connection speed. Subsequent runs of the same model will use the cached image, significantly reducing startup time.
Analyze benchmark results#
After a run completes, MAD writes log, metric, and environment artifacts into the working directory.
Access results locally#
To easily access the results from your local machine, start a simple HTTP server on the VM from inside the MAD directory:
python -m http.server 8090 --bind 127.0.0.1
Binding to 127.0.0.1
prevents public exposure, adding an extra layer of security.
On your local machine, run:
ssh -L 8090:localhost:8090 azureuser@<vm_public_ip_address>
Open your browser: http://localhost:8090
.
Alternatively, or as a complement, you can also copy the files locally using scp
. On your local machine, run the following command to copy the perf.csv
file from the VM to your current local directory:
scp azureuser@<vm_public_ip_address>:/mnt/resource_nvme/MAD/perf.csv .
Key artifacts#
The following key artifacts are generated in the MAD directory after the benchmark run:
Logs:
output.log
: Full orchestrator + container stdout (authoritative textual trace of events).error.log
: Standard error stream (warnings, errors, tracebacks).*.live.log
: In-container streaming log captured mid-run (subset ofoutput.log
).
Metrics:
perf.csv
: The main combined performance file.perf.html
: HTML rendering ofperf.csv
.perf_entry.csv
: Temporary file.perf_<model_name>.csv
: Multiple results file per workload (frommultiple results
field inmodels.json
).*_env.csv
: Environment snapshot (hardware, ROCm stack, Python packages, runtime variables).
Interpreting the sample output#
The perf.csv
file contains detailed information collected during the benchmark run. Below is a sample output for the throughput benchmark:
model |
performance |
metric |
---|---|---|
|
10,450.84 |
tokens/sec |
|
9,836 |
tokens/sec |
Key throughput metrics:
throughput_tot
: This measures the end-to-end token processing rate, combining both prompt (prefill) and the subsequent generation of new tokens (decode). It reflects the overall performance of the entire request.throughput_gen
: This measures the token generation rate (decode). For autoregressive models, this is often the most critical metric.
In this example, the difference between them is small, so the prefill phase is not adding a large extra cost for this particular configuration.
The full model name in perf.csv
is long but descriptive—pyt_vllm_llama-3.1-8b_fp8_Llama-3.1-8B-Instruct-FP8-KV_throughput_1_1024_128_2048_float8_throughput_tot
—encoding the exact conditions of the test:
pyt_vllm_llama-3.1-8b_fp8
: The MAD model name that was run.Llama-3.1-8B-Instruct-FP8-KV
: The specific model variant from Hugging Face.throughput
: The type of test conducted.1
: Number of concurrent requests (batch size).1024
: Number of prompt (prefill, input) tokens.128
: Number of generated (decode, output) tokens.2048
: Maximum context window size.float8
: The data precision used (floating point 8).throughput_tot
: The specific metric being reported (total throughput), as explained above.
A back-of-the-envelope calculation can estimate the idealized latency. By dividing the total tokens (1,024 prompt + 128 generated = 1,152) by the end-to-end throughput (10,450 tokens/sec), we get an idealized latency of approximately 110 milliseconds. This figure excludes real-world overheads like network latency and request queuing, but for interactive applications like chatbots, a sub-200 millisecond response is often perceived as instantaneous.
Conclusion#
This tutorial showed how to provision an ND MI300X v5 VM with high-performance storage, create and validate the environment, run MAD-based LLM benchmarks, and interpret core throughput metrics. By establishing this reproducible baseline—logs, ground truth metrics, and environment snapshots—you can systematically optimize your inference stack to achieve the best possible price-performance for your LLM workloads on Azure.
Recommended next steps:
Archive a clean baseline (metrics + environment snapshot) for regression detection.
Choose appropriate model and configuration to reflect production traffic and achieve service level objectives.
Systematically sweep concurrency, sequence lengths, precision, and batch size to explore trade-offs.
Build a lightweight dashboard for trend and regression tracking.
Track cost efficiency (tokens/sec per dollar-hour) alongside throughput and latency.