Getting Started with AMD Instinct MI300X on Azure#
Introduction#
AI workloads require powerful virtual machines (VMs) to handle intense computational demands and extensive data processing. The ND MI300X v5 series VM, equipped with eight AMD Instinct MI300X GPUs, is ideal for the most advanced deep learning training and inference, driving cutting-edge AI applications across industries. This guide will walk you through provisioning an ND MI300X v5 VM on Microsoft Azure.
Prerequisites#
SSH Keys: Have an SSH key pair and OpenSSH installed on your local machine for secure VM access
Azure Account: Maintain an active Azure account with appropriate subscription and resource group
Permissions: Ensure you have necessary permissions to create and manage Azure resources
vCPU Quota: Verify your subscription has sufficient vCPU quota for ND MI300X v5 VMs
Command-Line Tools: Install Azure CLI on your local machine or use Azure Cloud Shell
Tip
Access Azure Cloud Shell easily through the Azure portal, by visiting shell.azure.com, or by setting it up in your local terminal environment.
Check for availability#
Verify MI300X VM availability in your preferred Azure regions with the following command. Adjust the regions array as needed:
regions=("westus", "francecentral", "uksouth")
for region in ${regions[@]}; do
echo $region
az vm list-sizes \
--location $region \
--query "[?contains(name, 'MI300X')]" \
--output table
done
Tip
Run az account list-locations --output table
to view all available Azure region names.
If MI300X VMs are available in the specified regions, you’ll see two VM sizes listed:
Standard_ND96is_MI300X_v5
Standard_ND96isr_MI300X_v5
These VMs belong to the ND-series family, specifically engineered for high-performance deep learning, generative AI, and HPC workloads. Key specifications include:
Eight AMD Instinct MI300X GPUs per VM, each with 192GB of HBM3 memory
96 CPU cores with isolated hardware (i) for dedicated customer use
Premium solid-state drive storage (s) for enhanced data processing speeds
Version 5 (v5) of the architecture
The Standard_ND96isr_MI300X_v5
variant includes InfiniBand networking (r) for high-speed inter-node communication, essential for distributed training of large AI models and efficient cluster management.
Preparation#
To streamline the process of creating the VM, define the following variables. Replace the placeholder values as needed.
resource_group="my-rg"
region="westus"
vm_name="MI300X"
admin_username="azureuser"
vm_size="Standard_ND96isr_MI300X_v5"
vm_image="microsoft-dsvm:ubuntu-hpc:2204-rocm:22.04.2025030701"
ssh_key="<ssh-rsa AAAB3...>"
To obtain the uniform resource name (URN) for the VM image, use the following command:
az vm image list --publisher microsoft-dsvm --sku 2204-rocm --all --output table
Look for an image with a URN similar to microsoft-dsvm:ubuntu-hpc:2204-rocm:22.04.2025030701
.
For the ssh_key
value, replace the placeholder <ssh-rsa AAAAB3...>
with your ssh public key string.
Tip
To retrieve your public SSH key, navigate to your .ssh directory and use type id_rsa.pub
(Windows) or cat id_rsa.pub
(Linux/macOS).
Setting up additional user accounts#
To add developer accounts with sudo
privileges and SSH access, create a cloud-init
configuration file:
nano cloud_init_add_user.txt
Use the following template, replacing the placeholder values with actual usernames and SSH keys:
1#cloud-config
2users:
3 - default
4 - name: <user-name1>
5 groups: sudo
6 shell: /bin/bash
7 sudo: ALL=(ALL) NOPASSWD:ALL
8 ssh-authorized-keys:
9 - <ssh-rsa AAAAB3...>
10 - name: <user-name2>
11 groups: sudo
12 shell: /bin/bash
13 sudo: ALL=(ALL) NOPASSWD:ALL
14 ssh-authorized-keys:
15 - <ssh-rsa AAAAB3...>
Important
Ensure the first line starts with #cloud-config
without any leading spaces, as this syntax is required for cloud-init to process the file correctly.
Important
The - default
parameter preserves the admin user created during VM provisioning. Omitting this parameter would replace the default admin user.
To ensure that the cloud_init_add_user.txt file was created correctly, verify it in Azure Cloud Shell: cat cloud_init_add_user.txt
.
Create MI300X virtual machine#
The following Azure CLI command creates a new VM using the variables and the configuration file we defined above. It specifies parameters such as the resource group, VM image, administrator user name, and administrator SSH public key for secure access. It also ensures the OS disk size is deleted when the VM is deleted. The --custom-data
parameter adds the additional users we defined in the cloud-init script file.
az vm create \
--resource-group $resource_group \
--name $vm_name \
--image $vm_image \
--size $vm_size \
--location $region \
--admin-username $admin_username \
--ssh-key-value "$ssh_key" \
--custom-data cloud_init_add_user.txt \
--security-type Standard \
--os-disk-size-gb 256 \
--os-disk-delete-option Delete
Tip
Enclose the $ssh_key
variable in double quotes to ensure that the entire key value is passed correctly.
It takes a few minutes to create the VM and supporting resources. If the VM was created successfully, the shell will display information specific to your deployment.
{
"fqdns": "",
"id": "/subscriptions/<guid>/resourceGroups/my-rg/providers/Microsoft.Compute/virtualMachines/MI300X",
"location": "westus",
"macAddress": "00-0D-3A-35-FE-3F",
"powerState": "VM running",
"privateIpAddress": "10.0.0.5",
"publicIpAddress": "13.64.99.136",
"resourceGroup": "my-rg",
"zones": ""
}
Note
Take note of the VM’s public IP address, as you will use this address to access the VM from your local machine.
During first boot, cloud-init will configure the additional user accounts specified in your configuration file.
Connect to your VM using SSH:
ssh -i id_rsa azureuser@<vm-ip-address>
Validate using AMD-SMI Library#
Validate your setup using the AMD System Management Interface (AMD-SMI) Library, which is a versatile command-line tool for managing and monitoring AMD hardware, with a primary focus on GPUs.
To verify the version of the AMD-SMI Library, run the following command:
amd-smi version
Note
Allow 1-2 minutes after VM startup for all services to initialize before running amd-smi
commands.
The output will show the version of the AMD-SMI tool itself, the version of the AMD-SMI library, and the ROCm platform version. This information helps you ensure that you are using the correct versions of these components for managing and monitoring your AMD GPUs.
To list all eight AMD GPUs on the VM, along with basic information like their universally unique identifier, run the following command:
amd-smi list
You should see an output similar to the this:
GPU: 0
BDF: 0002:00:00.0
UUID: 1fff74b5-0000-1000-807c-84c354560001
GPU: 1
BDF: 0003:00:00.0
UUID: 8bff74b5-0000-1000-8042-32403807af72
GPU: 2
BDF: 0004:00:00.0
UUID: faff74b5-0000-1000-80fd-df190d55b466
[output truncated]
Docker status and group setup#
Verify Docker is running:
systemctl status --full docker --no-pager
Setting up a Docker group on the VM is a best practice for running Docker commands without needing sudo each time. While this is not strictly required, it is highly recommended. If your user is not part of this group, you might encounter permission issues.
Check if your user is part of the Docker group using the groups
command. If you see docker in the output, you are already part of the Docker group. If not, add your user to the Docker group:
sudo usermod -aG docker $USER
Important
You must log out and log back in for group membership changes to take effect.
Configure high-performance storage#
Optimize your VM by configuring a RAID 0 array across the eight NVMe disks, creating a single high-performance storage location for Docker and Hugging Face cache data.
Step 1: Prepare the NVMe disks for RAID 0 configuration#
Create a mount directory:
sudo mkdir -p /mnt/resource_nvme/
Create a RAID 0 array using all NVMe disks:
sudo mdadm --create /dev/md128 --force --run --level=0 --raid-devices=8 /dev/nvme*n1
Format the RAID storage with the XFS filesystem, which is designed for large files and extensive storage capacity.
sudo mkfs.xfs --force /dev/md128
Mount the RAID array:
sudo mount /dev/md128 /mnt/resource_nvme
Set permissions to allow all users to read, write, and execute files in the directory:
sudo chmod 1777 /mnt/resource_nvme
Step 2: Configure Hugging Face to use the RAID storage#
Create and configure the Hugging Face cache directory:
mkdir –p /mnt/resource_nvme/hf_cache
Set the environment variable HF_HOME
to point to the new directory:
export HF_HOME=/mnt/resource_nvme/hf_cache
This environment variable is used in the docker run
command to ensure that any containers pulling images or data from Hugging Face use the correct cache directory.
Step 3: Configure Docker to use the RAID storage#
Create the Docker data directory:
mkdir -p /mnt/resource_nvme/docker
Configure Docker to use the new directory:
echo '{"data-root": "/mnt/resource_nvme/docker"}' | sudo tee /etc/docker/daemon.json
sudo chmod 0644 /etc/docker/daemon.json
The permission 0644 means the owner can read and write the file, while others can only read it, which is a common setting for configuration files.
Restart the Docker service:
sudo systemctl restart docker
Restarting Docker ensures that it starts using the new data directory specified in the daemon.json file.
Important
Remember that the end user is responsible for data and access management in the shared responsibility model of the cloud.
Run inference test#
As an example AI workload, we are going to run DeepSeek-R1 model.
Download the sglang-dev
Docker image from the ROCm repository:
docker pull rocm/sgl-dev:20250323-srt
Important
Check the Docker Hub repository for the most recent tag (version), as tags change frequently.
Launch the SGLang server:
docker run \
--device=/dev/kfd \
--device=/dev/dri \
--security-opt seccomp=unconfined \
--group-add video \
--shm-size 32g \
--ipc=host \
-p 30000:30000 \
-v /mnt/resource_nvme:/mnt/resource_nvme \
-e HF_HOME=/mnt/resource_nvme/hf_cache \
-e HSA_NO_SCRATCH_RECLAIM=1 \
-e GPU_FORCE_BLIT_COPY_SIZE=64 \
-e DEBUG_HIP_BLOCK_SYN=1024 \
rocm/sgl-dev:20250323-srt \
python3 -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-R1 \
--tp 8 \
--trust-remote-code \
--host 0.0.0.0
Note
It may take up to 10 minutes to load the model. If successful, you will see a message “The server is fired up and ready to roll!”
Important
Open up a new shell on the same host for the following commands.
Verify the container is running:
docker ps --all
Sample output:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
db137f3da5d8 rocm/sgl-dev:20250323-srt "python3 -m sglang.l…" 57 seconds ago Up 56 seconds pedantic_cartwright
Test the endpoint by retrieving model information:
curl http://localhost:30000/get_model_info
The output should look something like this:
{"model_path":"deepseek-ai/DeepSeek-R1","tokenizer_path":"deepseek-ai/DeepSeek-R1","is_generation":true}
Finally, let’s run a sample request:
curl http://localhost:30000/generate \
-H "Content-Type: application/json" \
-d '{
"text": "Did you know that",
"sampling_params": { "max_new_tokens": 16, "temperature": 0.6 }
}'
Sample output:
{"text":" 70% of the Earth's surface is covered by water? But only ","meta_info":{"id":"ae022d181c5945fa8fc70db9fc76e4d5","finish_reason":{"type":"length","length":16},"prompt_tokens":5,"completion_tokens":16,"cached_tokens":1}}
Cleanup (Optional)#
To manage costs when the VM is not in use:
Deallocate (stop) the VM to pause billing for compute resources
Delete the VM and associated resources if they are no longer needed
Conclusion#
By following this guide, you’ve successfully provisioned and configured an ND MI300X v5
VM on Azure, optimized its storage configuration, and validated its setup with a real-world AI workload. This environment provides the high-performance computing resources necessary for advanced AI development and inference tasks.