Setting up Azure Kubernetes Service with AMD Instinct MI300X#
Introduction#
Azure Kubernetes Service (AKS) and AMD Instinct MI300X accelerators (GPUs) form a powerful combination for the most advanced AI deployments—a setup capable of scaling up to thousands of GPUs. In this guide, you will learn how to create an AKS cluster, add a GPU node pool with AMD Instinct MI300X, stopping, starting, scaling, and deleting nodes, and installing and configuring kubectl. In addition, the guide covers installing and configuring the AMD GPU Operator to help you better leverage AMD Instinct GPUs on your cluster.
Prerequisites#
Azure Setup: You need an active Azure account and subscription, a resource group, and the necessary permissions to create and manage resources.
vCPU Quota: Verify your subscription has sufficient vCPU quota. Contact your Microsoft account representative if needed.
Command-Line Interpreter: You will need a command-line interpreter with Azure CLI
2.72.0
or above installed on your local machine.
Create variables#
Set the following variables to streamline the VM creation process. Update the placeholder values as necessary.
1region="my-region"
2resource_group="my-rg"
3aks_name="my-aks"
4systempool="systempool"
5gpupool="gpupool"
6vm_size="Standard_D32a_v4"
7gpu_vm_size="Standard_ND96isr_MI300X_v5"
8os_sku="Ubuntu"
To optimize resource allocation and cost efficiency, two different VM sizes are used in the AKS cluster. The Standard_D32a_v4 VM size is designated for the system node pool, handling general-purpose tasks and managing the cluster’s infrastructure. Meanwhile, the Standard_ND96isr_MI300X_v5 VM size is allocated to the GPU node pool for AI workloads. This separation ensures efficient management of both regular operations and intensive AI processing without over-provisioning resources.
Create AKS cluster#
Each AKS deployment spans two resource groups. You create the first resource group, which contains only the Kubernetes service. In this user guide, as stated in the prerequisites, we assume this resource group already exists.
The second resource group is called the node resource group. It is automatically created by AKS when you execute the az aks create
command. By default, this resource group has a name based on the pattern MC_resourcegroupname_clustername_location. Its purpose is to store all the infrastructure resources associated with the cluster, including the virtual machine scale sets (VMSS), which themselves include the VMs (the nodes), virtual networking, public IP addresses, and storage.
Important
You should only use the node resource group for resources that share the cluster’s life cycle. Do not modify any of its resources, as doing so can lead to cluster operation failures.
To create the AKS cluster using Azure CLI, execute the following command:
az aks create \
--location $region \
--resource-group $resource_group \
--name $aks_name \
--nodepool-name $systempool \
--node-vm-size $vm_size \
--os-sku $os_sku \
--node-count 1 \
--generate-ssh-keys
This command creates an AKS cluster with a system node pool containing a single node dedicated to running critical system components. The system node pool is specifically named “systempool” to clearly distinguish its purpose from user workloads. The node pool is provisioned with a single node, which is sufficient for most development and testing scenarios. The --node-count
parameter sets the initial number of allocated (running) nodes in the node pool.
Important
For production environments, enhance security by using managed identities for enhanced authentication and implementing network policies for improved traffic control. For more information, see AKS security best practices.
Note
Node pools group together nodes (VMs) of the same size and configuration by implementing virtual machine scale sets (VMSS). If you want to have different node sizes in a single cluster, you need to create multiple node pools, one for each size.
Add MI300X GPU node pool#
To support AI workloads, add a user node pool with two MI300X VMs to your AKS cluster.
To ensure that the machines in the node pool land on the same physical InfiniBand network, you need to register the AKS InfiniBand Support feature:
az feature register \
--name AKSInfinibandSupport \
--namespace Microsoft.ContainerService
Verify the registration was successful:
az feature show \
--namespace "Microsoft.ContainerService" \
--name AKSInfinibandSupport
When done, run the following command to add the GPU node pool:
az aks nodepool add \
--resource-group $resource_group \
--cluster-name $aks_name \
--name $gpupool \
--node-vm-size $gpu_vm_size \
--os-sku $os_sku \
--node-count 2 \
--node-osdisk-type Managed \
--node-osdisk-size 256 \
--scale-down-mode Deallocate \
--gpu-driver None
The user node pool is specifically named “gpupool” to clearly indicate its purpose for running GPU workloads. The --node-count
parameter sets the initial number of allocated (running) nodes in the node pool. The --node-osdisk-type
parameter value Managed
makes the OS disk type persistent, which is required if you want to deallocate your nodes to save costs as ephemeral OS disks don’t support deallocation. Setting the --scale-down-mode
parameter to Deallocate
ensures the nodes are deallocated when they are scaled down. The --gpu-driver None
parameter is necessary because the VM size Standard_ND96isr_MI300X_v5
is not in AKS’s list of supported SKUs for automatic GPU driver installation. The AMD GPU Operator will handle the driver installation instead.
Warning
The default value of --scale-down-mode
is Delete
, which will delete any scaled-down nodes.
Note
If you want to enable autoscaling in the GPU node pool from the start, use the --enable-cluster-autoscaler
parameter together with the --min-count
and --max-count
parameters.
Install and configure kubectl#
To interact with your newly created AKS cluster, you’ll need to set up the Kubernetes command-line tool, kubectl
, and configure it for your AKS cluster.
To install kubectl on your local machine, run the following command:
az aks install-cli
Verify the installation was successful:
kubectl version --client
The --client
flag ensures that only the client version of kubectl is displayed, not the Kubernetes server version.
Run the following command to configure kubectl to connect to your AKS cluster:
az aks get-credentials \
--resource-group $resource_group \
--name $aks_name
Note
This command may require administrative privileges to modify the kubeconfig file. If you encounter permission issues, try running the command with elevated privileges (e.g., using sudo
on Linux/macOS or running the command as an administrator on Windows). You may also need to refresh your Azure credentials using az login
.
After running this command, you should see an output similar to:
Merged "my-aks" as current context in \Users\<username>\.kube\config
Despite the Unix-style path in the message, the credentials should be correctly merged into your kubeconfig file, regardless of your operating system. This file is typically located at ~/.kube/config
on Linux/macOS or %USERPROFILE%\.kube\config
on Windows.
Verify that kubectl is configured correctly by running:
kubectl config view --minify
This command outputs the entire configuration details of the current context, including the cluster name, user, and context. Ensure that the output matches your AKS cluster’s details.
Confirm you can communicate with your cluster by listing its nodes:
kubectl get nodes
You should see an output similar to:
NAME STATUS ROLES AGE VERSION
aks-gpupool-xxxxxxxx-vmss000000 Ready <none> 2m4s v1.31.7
aks-gpupool-xxxxxxxx-vmss000001 Ready <none> 114s v1.31.7
aks-systempool-xxxxxxxx-vmss000000 Ready <none> 11h v1.31.7
Stopping, starting, scaling, and deleting nodes#
For AI development tasks, you typically don’t need your GPU nodes running continuously. By stopping (deallocating) the node pool or scaling nodes to zero, you can save costs since you won’t be charged for the deallocated nodes.
To stop a running node pool:
az aks nodepool stop \
--resource-group $resource_group \
--cluster-name $aks_name \
--name $gpupool
To start a stopped node pool:
az aks nodepool start \
--resource-group $resource_group \
--cluster-name $aks_name \
--name $gpupool
Alternatively, you can use the az aks nodepool scale
command, which is used to change the number of nodes in a node pool, either increasing or decreasing the node count.
az aks nodepool scale \
--resource-group $resource_group \
--cluster-name $aks_name \
--name $gpupool \
--node-count <desired-node-count>
Note
Remember, the --scale-down-mode
parameter controls whether the nodes are deallocated or deleted when they are scaled down.
To delete a nodepool including its nodes, use the following command:
az aks nodepool delete \
--resource-group $resource_group \
--cluster-name $aks_name \
--name $gpupool
Install AMD GPU Operator#
Now that you have an AKS cluster with AMD Instinct GPU nodes attached and kubectl set up correctly, you are ready to install the AMD GPU Operator. The AMD GPU Operator simplifies the deployment and management of AMD Instinct GPU accelerators within Kubernetes clusters. It also adds features such as:
Automated driver installation and management
Easy deployment of the AMD GPU device plugin
Metrics collection and export
Simplified GPU resource allocation for containers
Automatic worker node labeling for GPU-enabled nodes
Ability to run GPU health tests via the Device Test Runner
The AMD GPU Operator and all of its components can be easily installed via Helm. To continue with the install please refer to the AMD GPU Operator Kubernetes Install page.
To get up and running quicker you can also follow the steps on the AMD GPU Operator Quick Start page.
Troubleshooting#
If you encounter any issues during the installation or configuration process, here are some common troubleshooting steps:
Check Azure CLI Version: Ensure you are using Azure CLI
2.72.0
or above on your local machine. You can check the version by runningaz version
.Check Azure CLI Extensions: If you have the
aks-preview
extension installed, you need to remove it withaz extension remove --name aks-preview
to avoid any conflicts.Refresh Azure Credentials: If you encounter authentication issues, run
az login
to refresh your credentials.Verify Resource Quotas: Ensure that your AKS cluster has sufficient resources allocated for your workloads.
Conclusion#
In this guide, you have successfully set up an Azure Kubernetes Service (AKS) cluster with AMD Instinct MI300X nodes. Your cluster is now fully operational with GPU support, featuring essential components such as the AMD GPU Operator. You can now deploy your AI workloads and leverage the power of AMD Instinct GPUs for accelerated computing.