Setting up Azure Kubernetes Service with AMD Instinct MI300X

Setting up Azure Kubernetes Service with AMD Instinct MI300X#

Introduction#

Azure Kubernetes Service (AKS) and AMD Instinct MI300X accelerators (GPUs) form a powerful combination for the most advanced AI deployments—a setup capable of scaling up to thousands of GPUs. In this guide, you will learn how to create an AKS cluster, add a GPU node pool with AMD Instinct MI300X, stopping, starting, scaling, and deleting nodes, and installing and configuring kubectl. In addition, the guide covers installing and configuring the AMD GPU Operator to help you better leverage AMD Instinct GPUs on your cluster.

Prerequisites#

SSH Keys: Ensure you have an SSH key pair and OpenSSH installed on your local machine for secure VM access.
Azure Setup: You need an active Azure account and subscription, a resource group, and the necessary permissions to create and manage resources.
vCPU Quota: Verify your subscription has sufficient vCPU quota. Contact your Microsoft account representative if needed.
Command-Line Interpreter: You will need a command-line interpreter with Azure CLI installed on your local machine. Alternatively, you can use Azure Cloud Shell, which comes with the latest version of Azure CLI already installed.

Create variables#

Set the following variables to streamline the VM creation process. Update the placeholder values as necessary.

region="my-region"
resource_group="my-rg"
aks_name="my-aks"
systempool="systempool"
gpupool="gpupool"
vm_size="Standard_D32a_v4"
gpu_vm_size="Standard_ND96is_MI300X_v5"
os_sku="Ubuntu"
ssh_key="<ssh-rsa AAAB3...>"

To optimize resource allocation and cost efficiency, two different VM sizes are used in the AKS cluster. The Standard_D32a_v4 VM size is designated for the system node pool, handling general-purpose tasks and managing the cluster’s infrastructure. Meanwhile, the Standard_ND96is_MI300X_v5 VM size is allocated to the GPU node pool for AI workloads. This separation ensures efficient management of both regular operations and intensive AI processing without over-provisioning resources.

Note

As of today, the Standard_ND96isr_MI300X_v5 VM size with InfiniBand is not yet enabled in AKS. Therefore, this guide will focus on the Standard_ND96is_MI300X_v5. For more details on this instance type see the ND-MI300X-v5 Series on Microsoft’s docs.

For the ssh_key value, replace the placeholder <ssh-rsa AAAAB3...> with your SSH public key string.

Create AKS cluster#

Each AKS deployment spans two resource groups. You create the first resource group, which contains only the Kubernetes service. In this user guide, as stated in the prerequisites, we assume this resource group already exists.

The second resource group is called the node resource group. It is automatically created by AKS when you execute the az aks create command. By default, this resource group has a name based on the pattern MC_resourcegroupname_clustername_location. Its purpose is to store all the infrastructure resources associated with the cluster, including the virtual machine scale sets (VMSS), which themselves include the VMs (the nodes), virtual networking, public IP addresses, and storage.

Important

You should only use the node resource group for resources that share the cluster’s life cycle. Do not modify any of its resources, as doing so can lead to cluster operation failures.

To create the AKS cluster using Azure CLI, execute the following command:

az aks create \
    --location $region \
    --resource-group $resource_group \
    --name $aks_name \
    --nodepool-name $systempool \
    --node-vm-size $vm_size \
    --os-sku $os_sku \
    --ssh-key-value "$ssh_key" \
    --node-count 1 \
    --node-osdisk-size 256 \
    --node-osdisk-type Managed \
    --enable-node-public-ip

This command creates an AKS cluster with a system node pool containing a single node dedicated to running critical system components. The system node pool is specifically named “systempool” to clearly distinguish its purpose from user workloads. The node is provisioned with a 256 GB managed OS disk, providing storage for system services. The command also enables a public IP address for the node, allowing direct external access.

Important

For production environments, enhance security by: 1) restricting API access with --api-server-authorized-ip-ranges, 2) implementing network policies for improved traffic control, and 3) using managed identities for enhanced authentication. For more information, see AKS security best practices.

Note

Node pools group together nodes (VMs) of the same size and configuration by implementing virtual machine scale sets (VMSS). If you want to have different node sizes in a single cluster, you need to create multiple node pools, one for each size.

Add MI300X GPU node pool#

To support AI workloads, add a user node pool with two MI300X VMs to your AKS cluster.

First, enable the --skip-gpu-driver-install flag by installing the aks-preview extension. This flag is necessary because the VM size Standard_ND96is_MI300X_v5 is not in AKS’s list of supported SKUs for automatic GPU driver installation. Without this flag, AKS will return an error when trying to create nodes with this VM size. The AMD GPU Operator will handle the driver installation instead.

Add the aks-preview extension to your Azure CLI installation:

az extension add --name aks-preview --upgrade

Register the feature SkipGpuDriverInstallPreview to enable the --skip-gpu-driver-install flag:

az feature register \
  --namespace "Microsoft.ContainerService" \
  --name "SkipGpuDriverInstallPreview"

Check the registration status:

az feature show \
  --namespace "Microsoft.ContainerService" \
  --name "SkipGpuDriverInstallPreview" \
  --query "properties.state"

Note

Allow 10-15 minutes for the feature registration to complete. The status should change from “Registering” to “Registered” before proceeding. You can periodically run the status check command until it shows “Registered”.

Re-register the resource provider to apply the feature registration:

az provider register --namespace Microsoft.ContainerService

After completing these steps, you can proceed to create the GPU node pool:

az aks nodepool add \
  --resource-group $resource_group \
  --cluster-name $aks_name \
  --name $gpupool \
  --node-vm-size $gpu_vm_size \
  --os-sku $os_sku \
  --node-count 2 \
  --node-osdisk-type Managed \
  --node-osdisk-size 256 \
  --scale-down-mode Deallocate \
  --enable-node-public-ip \
  --skip-gpu-driver-install

The user node pool is specifically named “gpupool” to clearly indicate its purpose for running GPU workloads. The node pool is provisioned with two nodes, each equipped with a 256 GB managed OS disk. The --node-count parameter sets the initial number of allocated (running) nodes in the node pool. Setting the --scale-down-mode parameter to Deallocate ensures the nodes are deallocated when they are scaled down. The --node-osdisk-type Managed makes the OS disk type persistent, which is required if you want to deallocate your nodes to save costs as ephemeral OS disks don’t support deallocation. The command also enables a public IP address for the nodes, allowing direct external access. Lastly, the --skip-gpu-driver-install flag is used to skip the automatic GPU driver installation, as the AMD GPU Operator will handle this task.

Warning

The default value of --scale-down-mode is Delete, which will delete any scaled-down nodes.

Note

If you want to enable autoscaling in the GPU node pool from the start, use the --enable-cluster-autoscaler parameter together with the --min-count and --max-count parameters.

Stopping, starting, scaling, and deleting nodes#

For AI development tasks, you typically don’t need your GPU nodes running continuously. By stopping (deallocating) the node pool or scaling nodes to zero, you can save costs since you won’t be charged for the deallocated nodes.

To stop a running node pool:

az aks nodepool stop \
  --resource-group $resource_group \
  --cluster-name $aks_name \
  --name $gpupool

To start a stopped node pool:

az aks nodepool start \
  --resource-group $resource_group \
  --cluster-name $aks_name \
  --name $gpupool

Alternatively, you can use the az aks nodepool scale command, which is used to change the number of nodes in a node pool, either increasing or decreasing the node count.

az aks nodepool scale \
  --resource-group $resource_group \
  --cluster-name $aks_name \
  --name $gpupool \
  --node-count <desired-node-count>

Note

Remember, the --scale-down-mode parameter controls whether the nodes are deallocated or deleted when they are scaled down.

To delete a nodepool including its nodes, use the following command:

az aks nodepool delete \
  --resource-group $resource_group \
  --cluster-name $aks_name \
  --name $gpupool

Install and configure kubectl#

To interact with your newly created AKS cluster, you’ll need to set up the Kubernetes command-line tool, kubectl, and configure it for your specific cluster.

To install kubectl on your local machine, run the following command:

az aks install-cli

Verify the installation was successful:

kubectl version --client

Run the following command to configure kubectl to connect to your AKS cluster:

az aks get-credentials \
  --resource-group $resource_group \
  --name $aks_name

After running this command, your kubectl configuration file will be updated with the necessary credentials and context for your AKS cluster. The file is typically located at ~/.kube/config on Linux/macOS or %USERPROFILE%\.kube\config on Windows.

Confirm you can communicate with your cluster by listing its nodes:

kubectl get nodes

You should see an output similar to:

NAME                                 STATUS   ROLES    AGE     VERSION
aks-systempool-xxxxxxxx-vmss000000   Ready    <none>   5m17s   v1.31.7

Install AMD GPU Operator#

Now that you have an AKS cluster with AMD Instinct GPU nodes attached and kubectl set up correctly, you are ready to install the AMD GPU Operator. The AMD GPU Operator simplifies the deployment and management of AMD Instinct GPU accelerators within Kubernetes clusters. It also adds features such as:

Automated driver installation and management
Easy deployment of the AMD GPU device plugin
Metrics collection and export
Simplified GPU resource allocation for containers
Automatic worker node labeling for GPU-enabled nodes
Ability to run GPU health tests via the Device Test Runner

The AMD GPU Operator and all of its components can be easily installed via Helm. To continue with the install please refer to the AMD GPU Operator Kubernetes Install page.

To get up and running quicker you can also follow the steps on the AMD GPU Operator Quick Start page.

Conclusion#

In this guide, you have successfully set up an Azure Kubernetes Service (AKS) cluster with AMD Instinct MI300X nodes. Your cluster is now fully operational with GPU support, featuring essential components such as the AMD GPU Operator. You can now deploy your AI workloads and leverage the power of AMD Instinct GPUs for accelerated computing.

Setting up Azure Kubernetes Service with AMD Instinct MI300X

Contents

Setting up Azure Kubernetes Service with AMD Instinct MI300X#

Introduction#

Prerequisites#

Create variables#

Create AKS cluster#

Add MI300X GPU node pool#

Stopping, starting, scaling, and deleting nodes#

Install and configure kubectl#

Install AMD GPU Operator#

Conclusion#

References#