Getting Started with AMD Instinct MI300X on Azure#

Introduction#

AI workloads require powerful virtual machines (VMs) to handle intense computational demands and extensive data processing. The ND MI300X v5 series VM, equipped with eight AMD Instinct MI300X GPUs, is ideal for the most advanced deep learning training and inference, driving cutting-edge AI applications across industries. This guide will walk you through provisioning an ND MI300X v5 VM on Microsoft Azure.

Prerequisites#

  • SSH Keys: Have an SSH key pair and OpenSSH installed on your local machine for secure VM access

  • Azure Account: Maintain an active Azure account with appropriate subscription and resource group

  • Permissions: Ensure you have necessary permissions to create and manage Azure resources

  • vCPU Quota: Verify your subscription has sufficient vCPU quota for ND MI300X v5 VMs

  • Command-Line Tools: Install Azure CLI on your local machine or use Azure Cloud Shell

Tip

Access Azure Cloud Shell easily through the Azure portal, by visiting shell.azure.com, or by setting it up in your local terminal environment.

Check for availability#

Verify MI300X VM availability in your preferred Azure regions with the following command. Adjust the regions array as needed:

regions=("westus", "francecentral", "uksouth")

for region in ${regions[@]}; do
    echo $region
    az vm list-sizes \
      --location $region \
      --query "[?contains(name, 'MI300X')]" \
      --output table
done

Tip

Run az account list-locations --output table to view all available Azure region names.

If MI300X VMs are available in the specified regions, you’ll see two VM sizes listed:

  • Standard_ND96is_MI300X_v5

  • Standard_ND96isr_MI300X_v5

These VMs belong to the ND-series family, specifically engineered for high-performance deep learning, generative AI, and HPC workloads. Key specifications include:

  • Eight AMD Instinct MI300X GPUs per VM, each with 192GB of HBM3 memory

  • 96 CPU cores with isolated hardware (i) for dedicated customer use

  • Premium solid-state drive storage (s) for enhanced data processing speeds

  • Version 5 (v5) of the architecture

The Standard_ND96isr_MI300X_v5 variant includes InfiniBand networking (r) for high-speed inter-node communication, essential for distributed training of large AI models and efficient cluster management.

Preparation#

To streamline the process of creating the VM, define the following variables. Replace the placeholder values as needed.

resource_group="my-rg"
region="westus"
vm_name="MI300X"
admin_username="azureuser"
vm_size="Standard_ND96isr_MI300X_v5"
vm_image="microsoft-dsvm:ubuntu-hpc:2204-rocm:22.04.2025030701"
ssh_key="<ssh-rsa AAAB3...>"

To obtain the uniform resource name (URN) for the VM image, use the following command:

az vm image list --publisher microsoft-dsvm --sku 2204-rocm --all --output table

Look for an image with a URN similar to microsoft-dsvm:ubuntu-hpc:2204-rocm:22.04.2025030701.

For the ssh_key value, replace the placeholder <ssh-rsa AAAAB3...> with your ssh public key string.

Tip

To retrieve your public SSH key, navigate to your .ssh directory and use type id_rsa.pub (Windows) or cat id_rsa.pub (Linux/macOS).

Setting up additional user accounts#

To add developer accounts with sudo privileges and SSH access, create a cloud-init configuration file:

nano cloud_init_add_user.txt

Use the following template, replacing the placeholder values with actual usernames and SSH keys:

 1#cloud-config
 2users:
 3  - default
 4  - name: <user-name1>
 5    groups: sudo
 6    shell: /bin/bash
 7    sudo: ALL=(ALL) NOPASSWD:ALL
 8    ssh-authorized-keys:
 9      - <ssh-rsa AAAAB3...>
10  - name: <user-name2>
11    groups: sudo
12    shell: /bin/bash
13    sudo: ALL=(ALL) NOPASSWD:ALL
14    ssh-authorized-keys:
15      - <ssh-rsa AAAAB3...>

Important

Ensure the first line starts with #cloud-config without any leading spaces, as this syntax is required for cloud-init to process the file correctly.

Important

The - default parameter preserves the admin user created during VM provisioning. Omitting this parameter would replace the default admin user.

To ensure that the cloud_init_add_user.txt file was created correctly, verify it in Azure Cloud Shell: cat cloud_init_add_user.txt.

Create MI300X virtual machine#

The following Azure CLI command creates a new VM using the variables and the configuration file we defined above. It specifies parameters such as the resource group, VM image, administrator user name, and administrator SSH public key for secure access. It also ensures the OS disk size is deleted when the VM is deleted. The --custom-data parameter adds the additional users we defined in the cloud-init script file.

az vm create \
    --resource-group $resource_group \
    --name $vm_name \
    --image $vm_image \
    --size $vm_size \
    --location $region \
    --admin-username $admin_username \
    --ssh-key-value "$ssh_key" \
    --custom-data cloud_init_add_user.txt \
    --security-type Standard \
    --os-disk-size-gb 256 \
    --os-disk-delete-option Delete

Tip

Enclose the $ssh_key variable in double quotes to ensure that the entire key value is passed correctly.

It takes a few minutes to create the VM and supporting resources. If the VM was created successfully, the shell will display information specific to your deployment.

{
  "fqdns": "",
  "id": "/subscriptions/<guid>/resourceGroups/my-rg/providers/Microsoft.Compute/virtualMachines/MI300X",
  "location": "westus",
  "macAddress": "00-0D-3A-35-FE-3F",
  "powerState": "VM running",
  "privateIpAddress": "10.0.0.5",
  "publicIpAddress": "13.64.99.136",
  "resourceGroup": "my-rg",
  "zones": ""
}

Note

Take note of the VM’s public IP address, as you will use this address to access the VM from your local machine.

During first boot, cloud-init will configure the additional user accounts specified in your configuration file.

Connect to your VM using SSH:

ssh -i id_rsa azureuser@<vm-ip-address>

Validate using AMD-SMI Library#

Validate your setup using the AMD System Management Interface (AMD-SMI) Library, which is a versatile command-line tool for managing and monitoring AMD hardware, with a primary focus on GPUs.

To verify the version of the AMD-SMI Library, run the following command:

amd-smi version

Note

Allow 1-2 minutes after VM startup for all services to initialize before running amd-smi commands.

The output will show the version of the AMD-SMI tool itself, the version of the AMD-SMI library, and the ROCm platform version. This information helps you ensure that you are using the correct versions of these components for managing and monitoring your AMD GPUs.

To list all eight AMD GPUs on the VM, along with basic information like their universally unique identifier, run the following command:

amd-smi list

You should see an output similar to the this:

GPU: 0
    BDF: 0002:00:00.0
    UUID: 1fff74b5-0000-1000-807c-84c354560001

GPU: 1
    BDF: 0003:00:00.0
    UUID: 8bff74b5-0000-1000-8042-32403807af72

GPU: 2
    BDF: 0004:00:00.0
    UUID: faff74b5-0000-1000-80fd-df190d55b466

[output truncated]

Docker status and group setup#

Verify Docker is running:

systemctl status --full docker --no-pager

Setting up a Docker group on the VM is a best practice for running Docker commands without needing sudo each time. While this is not strictly required, it is highly recommended. If your user is not part of this group, you might encounter permission issues.

Check if your user is part of the Docker group using the groups command. If you see docker in the output, you are already part of the Docker group. If not, add your user to the Docker group:

sudo usermod -aG docker $USER

Important

You must log out and log back in for group membership changes to take effect.

Configure high-performance storage#

Optimize your VM by configuring a RAID 0 array across the eight NVMe disks, creating a single high-performance storage location for Docker and Hugging Face cache data.

Step 1: Prepare the NVMe disks for RAID 0 configuration#

Create a mount directory:

sudo mkdir -p /mnt/resource_nvme/

Create a RAID 0 array using all NVMe disks:

sudo mdadm --create /dev/md128 --force --run --level=0 --raid-devices=8 /dev/nvme*n1

Format the RAID storage with the XFS filesystem, which is designed for large files and extensive storage capacity.

sudo mkfs.xfs --force /dev/md128

Mount the RAID array:

sudo mount /dev/md128 /mnt/resource_nvme

Set permissions to allow all users to read, write, and execute files in the directory:

sudo chmod 1777 /mnt/resource_nvme

Step 2: Configure Hugging Face to use the RAID storage#

Create and configure the Hugging Face cache directory:

mkdir –p /mnt/resource_nvme/hf_cache

Set the environment variable HF_HOME to point to the new directory:

export HF_HOME=/mnt/resource_nvme/hf_cache

This environment variable is used in the docker run command to ensure that any containers pulling images or data from Hugging Face use the correct cache directory.

Step 3: Configure Docker to use the RAID storage#

Create the Docker data directory:

mkdir -p /mnt/resource_nvme/docker

Configure Docker to use the new directory:

echo '{"data-root": "/mnt/resource_nvme/docker"}' | sudo tee /etc/docker/daemon.json
sudo chmod 0644 /etc/docker/daemon.json

The permission 0644 means the owner can read and write the file, while others can only read it, which is a common setting for configuration files.

Restart the Docker service:

sudo systemctl restart docker

Restarting Docker ensures that it starts using the new data directory specified in the daemon.json file.

Important

Remember that the end user is responsible for data and access management in the shared responsibility model of the cloud.

Run inference test#

As an example AI workload, we are going to run DeepSeek-R1 model.

Download the sglang-dev Docker image from the ROCm repository:

docker pull rocm/sgl-dev:20250323-srt

Important

Check the Docker Hub repository for the most recent tag (version), as tags change frequently.

Launch the SGLang server:

docker run \
  --device=/dev/kfd \
  --device=/dev/dri \
  --security-opt seccomp=unconfined \
  --group-add video \
  --shm-size 32g \
  --ipc=host \
  -p 30000:30000 \
  -v /mnt/resource_nvme:/mnt/resource_nvme \
  -e HF_HOME=/mnt/resource_nvme/hf_cache \
  -e HSA_NO_SCRATCH_RECLAIM=1 \
  -e GPU_FORCE_BLIT_COPY_SIZE=64 \
  -e DEBUG_HIP_BLOCK_SYN=1024 \
  rocm/sgl-dev:20250323-srt \
  python3 -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-R1 \
  --tp 8 \
  --trust-remote-code \
  --host 0.0.0.0

Note

It may take up to 10 minutes to load the model. If successful, you will see a message “The server is fired up and ready to roll!”

Important

Open up a new shell on the same host for the following commands.

Verify the container is running:

docker ps --all

Sample output:

CONTAINER ID   IMAGE                          COMMAND                  CREATED          STATUS          PORTS     NAMES
db137f3da5d8   rocm/sgl-dev:20250323-srt      "python3 -m sglang.l…"   57 seconds ago   Up 56 seconds             pedantic_cartwright

Test the endpoint by retrieving model information:

curl http://localhost:30000/get_model_info

The output should look something like this:

{"model_path":"deepseek-ai/DeepSeek-R1","tokenizer_path":"deepseek-ai/DeepSeek-R1","is_generation":true}

Finally, let’s run a sample request:

curl http://localhost:30000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Did you know that",
    "sampling_params": { "max_new_tokens": 16, "temperature": 0.6 }
    }'

Sample output:

{"text":" 70% of the Earth's surface is covered by water? But only ","meta_info":{"id":"ae022d181c5945fa8fc70db9fc76e4d5","finish_reason":{"type":"length","length":16},"prompt_tokens":5,"completion_tokens":16,"cached_tokens":1}}

Cleanup (Optional)#

To manage costs when the VM is not in use:

  • Deallocate (stop) the VM to pause billing for compute resources

  • Delete the VM and associated resources if they are no longer needed

Conclusion#

By following this guide, you’ve successfully provisioned and configured an ND MI300X v5 VM on Azure, optimized its storage configuration, and validated its setup with a real-world AI workload. This environment provides the high-performance computing resources necessary for advanced AI development and inference tasks.

References#