Setting up Kubernetes with AMD Instinct MI300X on Azure#

Introduction#

This guide covers setting up Rancher Kubernetes Engine 2 (RKE2), a lightweight, secure Kubernetes distribution that adheres closely to upstream Kubernetes. RKE2 is designed for flexibility and ease of use in various environments, including cloud and government scenarios.

Additionally, this guide includes steps for setting up ND MI300X v5 virtual machines to act as nodes in the Kubernetes cluster, creating inbound network security group rules for remote access, configuring kubectl, installing Helm, Cert-Manager, and the AMD GPU Operator. It also covers creating a DeviceConfig custom resource for managing the MI300X GPUs, configuring a pod, and deploying a test workload.

Prerequisites#

  • SSH Keys: Have an SSH key pair and OpenSSH installed on your local machine for secure VM access

  • Azure Account: Maintain an active Azure account with appropriate subscription and resource group

  • Permissions: Ensure you have necessary permissions to create and manage Azure resources

  • vCPU Quota: Verify your subscription has sufficient vCPU quota for ND MI300X v5 VMs

  • Command-Line Tools: Install Azure CLI on your local machine or use Azure Cloud Shell

Create virtual machines#

Define the following variables to streamline the process of creating the VMs. Replace the placeholder values as needed.

SERVER_NAME="rke2-server"
SERVER_SIZE="Standard_D32a_v4"
WORKER_NAME="rke2-worker1"
WORKER_SIZE="Standard_ND96isr_MI300X_v5"
RESOURCE_GROUP="<your-resource-group>"
REGION="<your-region>"
ADMIN_USERNAME="azureuser"
SOURCE_IP="<your-public-ip-address>/32"
SSH_KEY="<ssh-rsa AAAB3...>"
VM_IMAGE="microsoft-dsvm:ubuntu-hpc:2204-rocm:22.04.2025030701"

Note

For the WORKER_NAME variable, increment the number for each worker added, e.g. rke2-worker2, rke2-worker3, etc.

Note

In the SOURCE_IP variable, the /32 suffix is a security best practice for NSG rules that allows access from exactly one IP address. To allow access from multiple IP addresses, you can specify a comma-separated list of IP addresses or CIDR ranges.

Tip

To find your public IP address, you can use curl ifconfig.me or visit a website like https://whatismyipaddress.com.

Create the server VM:

az vm create \
    --name $SERVER_NAME \
    --size $SERVER_SIZE \
    --resource-group $RESOURCE_GROUP \
    --location $REGION \
    --admin-username $ADMIN_USERNAME \
    --ssh-key-value "$SSH_KEY" \
    --image $VM_IMAGE \
    --security-type Standard \
    --public-ip-sku Standard \
    --os-disk-size-gb 256 \
    --os-disk-delete-option Delete

Create a worker VM:

az vm create \
    --name $WORKER_NAME \
    --size $WORKER_SIZE \
    --resource-group $RESOURCE_GROUP \
    --location $REGION \
    --admin-username $ADMIN_USERNAME \
    --ssh-key-value "$SSH_KEY" \
    --image $VM_IMAGE \
    --security-type Standard \
    --public-ip-sku Standard \
    --os-disk-size-gb 256 \
    --os-disk-delete-option Delete

The --public-ip-sku Standard parameter enables a static public IP address for remote cluster access. The --os-disk-delete-option Delete ensures the OS disk is removed when the VM is deleted, which helps with cleanup.

Create inbound NSG rules#

A Network Security Group (NSG) in Azure provides network-level security for your VMs through inbound and outbound traffic rules. When you create a VM, Azure automatically creates an NSG and associates it with the VM’s network interface. This NSG, named rke2-serverNSG for your server node, acts as a virtual firewall by filtering traffic between Azure resources, controlling which ports are open or closed, and defining allowed source and destination addresses.

To securely access your cluster, you’ll create two inbound rules in the NSG: SSH access on port 22 for remote administration, and Kubernetes API access on port 6443 for kubectl commands.

Allow SSH access#

If not already created during the VM creation process, create an NSG rule to allow access to the server node via SSH:

az network nsg rule create \
  --resource-group $RESOURCE_GROUP \
  --nsg-name rke2-serverNSG \
  --name allow-ssh \
  --priority 1000 \
  --direction Inbound \
  --access Allow \
  --protocol Tcp \
  --source-address-prefixes $SOURCE_IP \
  --source-port-ranges "*" \
  --destination-address-prefixes "*" \
  --destination-port-ranges 22

Allow Kubernetes API access#

To allow access to the server node via kubectl, create the following rule:

az network nsg rule create \
  --resource-group $RESOURCE_GROUP \
  --nsg-name rke2-serverNSG \
  --name allow-k8s-api \
  --priority 1001 \
  --direction Inbound \
  --access Allow \
  --protocol Tcp \
  --source-address-prefixes $SOURCE_IP \
  --source-port-ranges "*" \
  --destination-address-prefixes "*" \
  --destination-port-ranges 6443

Verify NSG rules#

Verify your NSG rules after creation:

az network nsg rule list \
  --resource-group $RESOURCE_GROUP \
  --nsg-name rke2-serverNSG \
  --output table

Example output:

Name           ResourceGroup     Priority    SourcePortRanges    SourceAddressPrefixes    SourceASG    Access    Protocol    Direction    DestinationPortRanges    DestinationAddressPrefixes    DestinationASG
-------------  ----------------  ----------  ------------------  -----------------------  -----------  --------  ----------  -----------  -----------------------  ----------------------------  ----------------
allow-ssh      instinct-demo-rg  1000        *                   <your public IP>/32      None         Allow     Tcp         Inbound      22                       *                             None
allow-k8s-api  instinct-demo-rg  1001        *                   <your public IP>/32      None         Allow     Tcp         Inbound      6443                     *                             None

Tip

To learn more how to configure the network security for your specific environment, refer to the Azure documentation.

Install Kubernetes#

We will use Rancher Kubernetes Engine 2 (RKE2), a Kubernetes distribution specifically designed for data centers with a strong emphasis on security.

Important

The RKE2 installation requires root privileges or sudo.

Server node installation#

Follow these steps to set up the server node, which will host the control plane components and manage the cluster.

SSH into the VM named rke2-server:

ssh -i id_rsa azureuser@<server-ip>

Tip

Obtain the server’s public IP address from the Azure portal or by running az vm list-ip-addresses --name rke2-server --resource-group $RESOURCE_GROUP. The IP address is also displayed in the terminal after successful VM creation.

Download and run the installation script:

curl -sfL https://get.rke2.io | sudo sh -

Enable the RKE server service to start automatically:

sudo systemctl enable rke2-server.service

Start the service:

sudo systemctl start rke2-server.service

Confirm that it is running:

systemctl status rke2-server.service --no-pager -l

In addition to installing the rke2-server service, the installation script also sets up various utilities and clean-up scripts. Notably, it generates a kubeconfig file at /etc/rancher/rke2/rke2.yaml, which we will use to connect to the server node. Additionally, it creates an authentication token for registering other server or worker nodes, located at /var/lib/rancher/rke2/server/node-token.

Note

A kubeconfig file is a generic term for a file that configures kubectl for accessing clusters. On your local machine, this file is named config without a filename extension.

Configure kubectl on the server#

The Kubernetes command-line tool, kubectl, allows you to interact with a Kubernetes cluster’s control plane. Kubectl is included in the RKE2 installation, so you only need to add the RKE2 binary directory to PATH:

echo 'export PATH=$PATH:/var/lib/rancher/rke2/bin' >> ~/.bashrc

By default, kubectl looks for a configuration file at $HOME/.kube/config. For RKE2, the kubeconfig is generated at /etc/rancher/rke2/rke2.yaml. To use this file, set the KUBECONFIG environment variable to point to it:

echo "export KUBECONFIG=/etc/rancher/rke2/rke2.yaml" >> ~/.bashrc

Note

We add both environment variables to the .bashrc file to make them persistent across sessions.

To apply both changes immediately in your current session, run:

source ~/.bashrc

After sourcing the .bashrc file, verify the kubectl installation:

kubectl version --client

Also check that the KUBECONFIG variable is set correctly:

echo $KUBECONFIG

Next, change the security permissions of the kubeconfig file:

sudo chmod 644 /etc/rancher/rke2/rke2.yaml

This allows non-root users to access the kubeconfig file while maintaining security.

Note

You’ll need to reset these permissions whenever the RKE2 server service starts or restarts, as RKE2 automatically resets the kubeconfig file permissions to 600 as a security measure.

Verify the file permissions:

ls -l /etc/rancher/rke2/rke2.yaml

Verify that kubectl is working and view the current cluster status:

kubectl get nodes

Example output:

NAME          STATUS   ROLES                       AGE   VERSION
rke2-server   Ready    control-plane,etcd,master   97s   v1.31.8+rke2r1

Server node token retrieval#

Before proceeding with worker setup, retrieve the join token from the server node:

sudo cat /var/lib/rancher/rke2/server/node-token

Save this token as it will be needed to join the worker node to the cluster.

Worker node installation#

Follow these steps to set up a worker (agent) node, which will join the cluster to handle AI workloads. Repeat these steps for each worker node you set up.

SSH into the worker VM, e.g. rke2-worker1:

ssh -i id_rsa azureuser@$<worker-ip>

Download and run the RKE2 installation script, setting the type to agent:

curl -sfL https://get.rke2.io | sudo INSTALL_RKE2_TYPE="agent" sh -

Enable the RKE2 agent service to start automatically:

sudo systemctl enable rke2-agent.service

Next, configure the worker node to connect to the cluster. Create the configuration directory:

sudo mkdir -p /etc/rancher/rke2/

Open the configuration file for editing:

sudo nano /etc/rancher/rke2/config.yaml

If the file doesn’t exist, it will be created.

Add the following content to config.yaml:

server: https://<server-private-ip>:9345
token: <node-token>

Replace <server-private-ip> with the private IP address of your server node. You can obtain this IP address by checking the Azure portal or by running az vm list-ip-addresses --name rke2-server --resource-group $RESOURCE_GROUP. Replace <node-token> with the token obtained from the server in the previous step.

Note

The rke2 server process listens on port 9345 for new nodes to register. The Kubernetes API is still served on port 6443, as normal.

Save the file and exit the editor.

Check the contents of config.yaml to ensure it is correct:

cat /etc/rancher/rke2/config.yaml

Start the RKE2 agent service:

sudo systemctl start rke2-agent.service

Verify that it is running:

systemctl status rke2-agent.service --no-pager -l

Check the logs to see if the worker has successfully connected to the server:

sudo journalctl --unit rke2-agent --no-pager -l

In the output, look for a log entry containing the message “Remotedialer connected to proxy”. Example:

Apr 01 20:13:32 rke2-worker1 rke2[19231]: time="2025-04-01T20:13:32Z" level=info msg="Remotedialer connected to proxy" url="wss://10.0.0.5:9345/v1-rke2/connect"

This log entry indicates that the rke2-worker1 node, using the RKE2 component Remotedialer, successfully established a secure WebSocket connection to the RKE2 server node at 10.0.0.5 on port 9345. The phrase “connected to proxy” means that the server node acts as a proxy or intermediary, facilitating communication between the worker node and the Kubernetes control plane.

Note

If you see errors in the log entries related to failing to pull images (e.g., failed to pull images from /var/lib/rancher/rke2/agent/images/runtime-image.txt: image \"index.docker.io/rancher/rke2-runtime:v1.31.8-rke2r1\": not found"), try restarting the rke2-agent service: sudo systemctl restart rke2-agent.service. After restarting, check the logs again for successful image pulls: Image index.docker.io/rancher/hardened-kubernetes:v1.31.8-rke2r1-build20250423 has already been pulled and Imported docker.io/rancher/rke2-runtime:v1.31.8-rke2r1.

Regenerate RKE2 certificates#

To enable external access to your cluster, we need to include the server’s public IP address in the TLS certificate’s Subject Alternative Name (SAN) list. By default, RKE2 only includes the server’s internal IP address in the certificate, which prevents external tools like kubectl from validating the connection. We’ll create a configuration file with the public IP and regenerate the certificates to ensure secure external communication with the Kubernetes API server.

Create or open the rke2-server configuration file:

sudo nano /etc/rancher/rke2/config.yaml

Add the following content in the TLS-SAN list:

tls-san:
  - "<server-public-ip>"

This entry specifies an additional SAN for the TLS certificate, ensuring the certificate is valid for the server’s public IP address.

Save and close the file. Now follow these steps to apply the new configuration.

Stop the RKE2 server:

sudo systemctl stop rke2-server

Rotate the certificates to generate new ones with the updated SAN:

sudo rke2 certificate rotate

Start the RKE2 server to apply the changes:

sudo systemctl start rke2-server

After rotating the certificates and restarting the server, the file permissions of the kubeconfig file need to be set again to ensure proper access:

sudo chmod 644 /etc/rancher/rke2/rke2.yaml

Verify the file permissions:

ls -l /etc/rancher/rke2/rke2.yaml

Verify that the new TLS certificate includes the public IP address by inspecting the certificate:

sudo openssl x509 -in /var/lib/rancher/rke2/server/tls/serving-kube-apiserver.crt -text -noout

Install kubectl locally#

Download and install kubectl on your local machine by following the official documentation.

Verify the client (local) installation:

kubectl version --client

To enable kubectl to connect to the remote cluster server, configure the kubeconfig file.

Create the hidden directory .kube if it doesn’t exist:

mkdir $HOME/.kube # Linux/macOS
mkdir %USERPROFILE%\.kube # Windows

Run a command based on the below template to copy the kubeconfig file from the server node and save it on your local machine as config in the hidden .kube directory:

# Linux/macOS
ssh azureuser@<public-ip> "sudo cat /etc/rancher/rke2/rke2.yaml" > $HOME/.kube/config

# Windows
ssh azureuser@<public-ip> "sudo cat /etc/rancher/rke2/rke2.yaml" > %USERPROFILE%\.kube\config

Once downloaded, open the kubeconfig file and replace the server’s loop-back address 127.0.0.1 with its public IP address. Save and close the file.

Note

The kubeconfig file contains connection information for the Kubernetes cluster, including the API server address and authentication credentials. By default, the server address is set to the loopback address 127.0.0.1. Replacing it with the public IP address allows kubectl to connect to the cluster from outside the server node.

Set the KUBECONFIG environment variable to point to the kubeconfig file:

# Linux/macOS
export KUBECONFIG=$HOME/.kube/config

# Windows
setx KUBECONFIG "%HOMEPATH%\.kube\config"

Display all the contexts defined in your kubeconfig file:

kubectl config get-contexts

If needed, set default as the current context in your kubeconfig file:

kubectl config use-context default

Display the name of the currently active context in your kubeconfig file:

kubectl config current-context

Verify the configuration of the current context:

kubectl config view --minify

The command will omit sensitive data.

The next command should list the nodes in your RKE2 cluster if everything is set up correctly:

kubectl get nodes

Example output:

NAME           STATUS   ROLES                       AGE     VERSION
rke2-server    Ready    control-plane,etcd,master   4h3m    v1.31.7+rke2r1
rke2-worker1   Ready    <none>                      3h25m   v1.31.7+rke2r1

Having <none> in the ROLES column indicates the node is a worker node.

Note

If you encounter an issue running kubectl get nodes, wait a few minutes for background processes to complete.

To display information about the cluster’s control plane and services, run:

kubectl cluster-info

Example output:

Kubernetes control plane is running at https://13.64.210.60:6443
CoreDNS is running at https://13.64.210.60:6443/api/v1/namespaces/kube-system/services/rke2-coredns-rke2-coredns:udp-53/proxy

Install Helm#

Helm is a package manager that helps you manage Kubernetes applications using Helm charts, which are collections of YAML configuration files describing pre-configured Kubernetes resources.

Install Helm on the rke2-server node:

sudo snap install helm --classic

Install Cert-Manager#

Cert-Manager is an open source Kubernetes tool that ensures certificates are valid and up-to-date. Perform these steps on the rke2-server node.

Add the Helm chart repository jetstack:

helm repo add jetstack https://charts.jetstack.io --force-update

Install cert-manager:

helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --version v1.17.2 \
  --set crds.enabled=true \
  --kubeconfig /etc/rancher/rke2/rke2.yaml

Replace the version number v1.17.2 with the version number you want to install, if desired.

Tip

Use the command helm search repo jetstack/cert-manager --devel --versions to find the latest version of cert-manager.

This command installs the cert-manager Helm chart from the Jetstack repository into the Kubernetes namespace cert-manager. The --set crds.enabled=true option enables the installation of custom resource definitions (CRDs), which is a requirement for cert-manager to function properly.

Verify the installation by listing all the pods running in the cert-manager namespace:

kubectl get pods --namespace cert-manager

Example output:

NAME                                       READY   STATUS    RESTARTS   AGE
cert-manager-6794b8d569-6cp8l              1/1     Running   0          3d17h
cert-manager-cainjector-7f69cd69f7-t8h4j   1/1     Running   0          3d17h
cert-manager-webhook-6cc5dccc4b-sn6vk      1/1     Running   0          3d17h

Install AMD GPU Operator#

The AMD GPU Operator consists of a set of tools for managing AMD GPUs in Kubernetes clusters. It automates the deployment and management of the necessary components to enable GPU support in your Kubernetes cluster. These components include drivers, device plugins, and monitoring tools. Perform these steps on the rke2-server node.

Add the Helm chart repository rocm:

helm repo add rocm https://rocm.github.io/gpu-operator --force-update

Install the AMD GPU Operator:

helm install amd-gpu-operator rocm/gpu-operator-charts \
  --namespace kube-amd-gpu \
  --create-namespace \
  --version v1.2.2

Verify the installation by listing all the pods running in the kube-amd-gpu namespace:

kubectl get pods --namespace kube-amd-gpu

Example output:

NAME                                                              READY   STATUS    RESTARTS   AGE
amd-gpu-operator-gpu-operator-charts-controller-manager-7b2wvgr   1/1     Running   0          2m42s
amd-gpu-operator-kmm-controller-548b84fbcd-64jj6                  1/1     Running   0          2m42s
amd-gpu-operator-kmm-webhook-server-5d86b78f84-69fs7              1/1     Running   0          2m42s
amd-gpu-operator-node-feature-discovery-gc-64c9b7dcd9-tnnfb       1/1     Running   0          2m42s
amd-gpu-operator-node-feature-discovery-master-7d69c9b6f9-pwslb   1/1     Running   0          2m42s
amd-gpu-operator-node-feature-discovery-worker-6w8mm              1/1     Running   0          2m42s
amd-gpu-operator-node-feature-discovery-worker-r6nkn              1/1     Running   0          2m42s

Verify that the nodes are correctly labeled:

kubectl get nodes -L feature.node.kubernetes.io/amd-vgpu

Example output:

NAME           STATUS   ROLES                       AGE     VERSION          AMD-VGPU
rke2-server    Ready    control-plane,etcd,master   4h      v1.31.8+rke2r1
rke2-worker1   Ready    <none>                      3h24m   v1.31.8+rke2r1   true

View available cluster GPUs:

kubectl get nodes -o custom-columns=NAME:.metadata.name,GPUs:.status.capacity.'amd\.com/gpu'

Example output:

NAME           GPUs
rke2-server    <none>
rke2-worker1   8

Create DeviceConfig custom resource#

A DeviceConfig is a custom resource that enables the configuration of specialized hardware resources like GPUs in Kubernetes. It allows you to specify how the AMD GPU Operator should manage the GPUs in your cluster.

Create a file named deviceconfig.yaml on the rke2-server node to configure the AMD GPU Operator:

sudo nano deviceconfig.yaml

Paste the following configuration into the file:

apiVersion: amd.com/v1alpha1
kind: DeviceConfig
metadata:
  name: deviceconfig
  namespace: kube-amd-gpu
spec:
  driver:
    enable: false
  devicePlugin:
    devicePluginImage: rocm/k8s-device-plugin:latest
    nodeLabellerImage: rocm/k8s-device-plugin:labeller-latest
  metricsExporter:
     enable: true
     serviceType: "NodePort"
     nodePort: 32500
     image: docker.io/rocm/device-metrics-exporter:v1.2.0
  selector:
    feature.node.kubernetes.io/amd-vgpu: "true"

Save and exit the file.

This DeviceConfig specifies the driver, where enable: false indicates using pre-installed drivers. It configures the device plugin and node labeller images. The metrics exporter is set up to collect and export GPU usage metrics. Finally, a selector identifies which nodes in the cluster should be managed by this configuration.

Apply the deviceconfig.yaml file to the Kubernetes cluster:

kubectl apply -f deviceconfig.yaml

Verify the deviceconfig resource in the kube-amd-gpu namespace:

kubectl describe deviceconfig -n kube-amd-gpu

Pod configuration and test#

We will set up a Kubernetes pod to monitor the GPU usage and check the AMD-SMI version using a Ubuntu ROCm image. This test involves creating a manifest file, deploying the pod, and retrieving the pod logs. Finally, we delete the pod to clean up.

On the rke2-server node, use a text editor to create a manifest file called amd-smi.yaml that defines the pod configuration:

apiVersion: v1
kind: Pod
metadata:
 name: amd-smi
spec:
 containers:
 - image: docker.io/rocm/dev-ubuntu-22.04:latest
   name: amd-smi
   command: ["/bin/bash"]
   args: ["-c","amd-smi version && amd-smi monitor -ptumv"]
   resources:
    limits:
      amd.com/gpu: 8
    requests:
      amd.com/gpu: 8
 restartPolicy: Never

This configuration sets up a pod with a single container with an Ubuntu ROCm image, which we use to run amd-smi to view GPU details.

Create the pod:

kubectl create -f amd-smi.yaml

Note

Allow a few minutes for the pod to start and initialize.

Retrieve the pod logs:

kubectl logs amd-smi

Example output:

AMDSMI Tool: 25.3.0+ede62f2 | AMDSMI Library version: 25.3.0 | ROCm version: 6.4.0 | amdgpu version: 6.8.5 | amd_hsmp version: N/A
GPU  POWER   GPU_T   MEM_T   GFX_CLK   GFX%   MEM%  MEM_CLOCK  VRAM_USED  VRAM_TOTAL
  0  143 W   41 °C   34 °C   249 MHz    1 %    0 %   1159 MHz     282 MB    195766   196048 MB       0.0
  1  147 W   41 °C   33 °C   224 MHz    1 %    0 %   1147 MHz     282 MB    195766   196048 MB       0.0
  2  145 W   38 °C   32 °C   250 MHz    1 %    0 %   1138 MHz     282 MB    195766   196048 MB       0.0
  3  145 W   43 °C   35 °C   240 MHz    1 %    0 %   1125 MHz     282 MB    195766   196048 MB       0.0
  4  143 W   39 °C   33 °C   241 MHz    1 %    0 %   1113 MHz     282 MB    195766   196048 MB       0.0
  5  143 W   41 °C   34 °C   241 MHz    1 %    0 %   1099 MHz     282 MB    195766   196048 MB       0.0
  6  140 W   41 °C   32 °C   272 MHz    1 %    0 %   1087 MHz     282 MB    195766   196048 MB       0.0
  7  150 W   39 °C   32 °C   244 MHz    1 %    0 %   1133 MHz     282 MB    195766   196048 MB       0.0

Delete the pod:

kubectl delete -f amd-smi.yaml

Cleanup (optional)#

To manage costs when the cluster is not in use:

  • Deallocate the VMs to pause billing for compute resources

  • Delete the VMs and associated resources if they are no longer needed

Conclusion#

This guide walked you through setting up a Kubernetes cluster on Azure using RKE2, configured with ND MI300X v5 VMs. You now have a functional cluster with GPU support, equipped with essential components like Cert-Manager and the AMD GPU Operator. The cluster is ready for AI workloads that can leverage the computational power of AMD’s MI300X accelerators.

While this guide covers the basics of setting up a Kubernetes cluster with AMD GPUs, additional considerations are crucial for a production environment. These include enhanced security measures, infrastructure automation, monitoring, and scaling strategies.

Software versions#

This guide uses the following software versions:

  • Ubuntu: 22.04 (Azure HPC image 22.04.2025030701)

  • RKE2: v1.31.8+rke2r1

  • Cert-Manager: v1.17.2

  • AMD GPU Operator: v1.2.2

  • AMD ROCm: 6.2.4 (VM image), 6.4.0 (container image)

References#