Setting up Kubernetes with AMD Instinct MI300X using Azure VMs#

Introduction#

This guide covers setting up Rancher Kubernetes Engine 2 (RKE2), a lightweight, secure Kubernetes distribution that adheres closely to upstream Kubernetes. RKE2 is designed for flexibility and ease of use in various environments, including cloud and government scenarios.

Additionally, this guide includes steps for setting up ND MI300X v5 virtual machines to act as nodes in the Kubernetes cluster, creating inbound network security group rules for remote access, installing kubectl, Helm, Cert-Manager, and the AMD GPU Operator. It also covers creating a DeviceConfig custom resource for managing the MI300X GPUs, configuring a pod, and deploying a test workload.

Note

If you’re looking for a guide on Azure Kubernetes Service (AKS), Microsoft’s managed Kubernetes offering, please refer to this guide.

Prerequisites#

  • SSH Keys: Have an SSH key pair and OpenSSH installed on your local machine for secure VM access

  • Azure Account: Maintain an active Azure account with appropriate subscription and resource group

  • Permissions: Ensure you have necessary permissions to create and manage Azure resources

  • vCPU Quota: Verify your subscription has sufficient vCPU quota for ND MI300X v5 VMs

  • Command-Line Tools: Install Azure CLI on your local machine or use Azure Cloud Shell

Create virtual machines#

Define the following variables to streamline the process of creating the VMs. Replace the placeholder values as needed.

server_name="rke2-server"
server_size="Standard_D32a_v4"
worker_name="rke2-worker1"
worker_size="Standard_ND96is_MI300X_v5"
resource_group="<your-resource-group>"
region="<your-region>"
admin_username="azureuser"
ssh_key="<ssh-rsa AAAB3...>"
vm_image="microsoft-dsvm:ubuntu-hpc:2204-rocm:22.04.2025030701"

Note

For the worker_name variable, increment the number for each worker added, e.g. rke2-worker2, rke2-worker3, etc.

Create the server VM:

az vm create \
    --name $server_name \
    --size $server_size \
    --resource-group $resource_group \
    --location $region \
    --admin-username $admin_username \
    --ssh-key-value "$ssh_key" \
    --image $vm_image \
    --security-type Standard \
    --public-ip-sku Standard \
    --os-disk-size-gb 256 \
    --os-disk-delete-option Delete

Create a worker VM:

az vm create \
    --name $worker_name \
    --size $worker_size \
    --resource-group $resource_group \
    --location $region \
    --admin-username $admin_username \
    --ssh-key-value "$ssh_key" \
    --image $vm_image \
    --security-type Standard \
    --public-ip-sku Standard \
    --os-disk-size-gb 256 \
    --os-disk-delete-option Delete

The --public-ip-sku Standard parameter enables a static public IP address for remote cluster access. The --os-disk-delete-option Delete ensures the OS disk is removed when the VM is deleted, which helps with cleanup.

Create inbound NSG rules#

A Network Security Group (NSG) in Azure provides network-level security for your VMs through inbound and outbound traffic rules. When you create a VM, Azure automatically creates an NSG and associates it with the VM’s network interface. This NSG, named rke2-serverNSG for your server node, acts as a virtual firewall by filtering traffic between Azure resources, controlling which ports are open or closed, and defining allowed source and destination addresses.

To securely access your cluster, you’ll create two inbound rules in the NSG: SSH access on port 22 for remote administration, and Kubernetes API access on port 6443 for kubectl commands.

Allow SSH access#

If not already created during the VM creation process, create an NSG rule to allow access to the server node via SSH:

az network nsg rule create \
  --resource-group $resource_group \
  --nsg-name rke2-serverNSG \
  --name allow-ssh \
  --priority 1000 \
  --direction Inbound \
  --access Allow \
  --protocol Tcp \
  --source-address-prefixes <source-ip>/32 \
  --source-port-ranges "*" \
  --destination-address-prefixes "*" \
  --destination-port-ranges 22

To reduce the risk of unauthorized access, set the --source-address-prefixes parameter value <source-ip> to your local machine’s public IP address. The /32 suffix is a security best practice for NSG rules that allows access from exactly one IP address. To allow access from multiple IP addresses, you can specify a comma-separated list of IP addresses or CIDR ranges.

Tip

To find your public IP address, you can use curl ifconfig.me or visit a website like https://whatismyipaddress.com.

Allow Kubernetes API access#

To allow access to the server node via kubectl, create the following rule:

az network nsg rule create \
  --resource-group $resource_group \
  --nsg-name rke2-serverNSG \
  --name allow-k8s-api \
  --priority 1001 \
  --direction Inbound \
  --access Allow \
  --protocol Tcp \
  --source-address-prefixes <source-ip>/32 \
  --source-port-ranges "*" \
  --destination-address-prefixes "*" \
  --destination-port-ranges 6443

The --source-address-prefixes parameter value <source-ip> should be set to your local machine’s public IP address, similar to the SSH rule.

Warning

Always restrict source IP addresses to specific addresses rather than allowing access from anywhere ("*"). This follows the principle of least privilege and reduces your attack surface.

Verify your NSG rules after creation:

az network nsg rule list \
  --resource-group $resource_group \
  --nsg-name rke2-serverNSG \
  --output table

Tip

To learn more how to configure the network security for your specific environment, refer to the Azure documentation.

Install Kubernetes#

We will use Rancher Kubernetes Engine 2 (RKE2), a Kubernetes distribution specifically designed for data centers with a strong emphasis on security.

Important

The RKE2 installation requires root privileges or sudo.

Server node installation#

Follow these steps to set up the server node, which will host the control plane components and manage the cluster.

SSH into the VM named rke2-server:

ssh -i id_rsa azureuser@<public-ip>

Replace <public-ip> with the public IP address of the VM rke2-server.

Download and run the installation script:

curl -sfL https://get.rke2.io | sudo sh -

Enable the RKE server service to start automatically:

sudo systemctl enable rke2-server.service

Start the service:

sudo systemctl start rke2-server.service

Confirm that it is running:

systemctl status rke2-server.service --no-pager

In addition to installing the rke2-server service, the installation script also sets up various utilities and clean-up scripts. Notably, it generates a kubeconfig file at /etc/rancher/rke2/rke2.yaml, which we will use to connect to the server node. Additionally, it creates an authentication token for registering other server or worker nodes, located at /var/lib/rancher/rke2/server/node-token.

Note

A kubeconfig file is a generic term for a file that configures kubectl for accessing clusters. On your local machine, this file is named config without a filename extension.

Worker node installation#

Follow these steps to set up a worker (agent) node, which will join the cluster to handle AI workloads. Repeat these steps for each worker node you set up.

SSH into the worker VM, e.g. rke2-worker1:

ssh -i id_rsa azureuser@<public-ip>

Replace <public-ip> with the public IP address of the worker node.

Download and run the RKE2 installation script, setting the type to agent:

curl -sfL https://get.rke2.io | sudo INSTALL_RKE2_TYPE="agent" sh -

Enable the RKE2 agent service to start automatically:

sudo systemctl enable rke2-agent.service

Next, configure the worker node to connect to the cluster. Create the configuration directory:

sudo mkdir -p /etc/rancher/rke2/

Open the configuration file for editing:

sudo nano /etc/rancher/rke2/config.yaml

If the file doesn’t exist, it will be created.

Add the following content to config.yaml:

server: https://<ip-address>:9345
token: <authentication token>

Important

Replace <ip-address> with the private IP address of your server node.

Retrieve the <authentication token> by running the following command on the server node:

sudo cat /var/lib/rancher/rke2/server/node-token

Copy and paste it into the config.yaml file on your worker node. Save the file and exit the editor.

Note

The rke2 server process listens on port 9345 for new nodes to register. The Kubernetes API is still served on port 6443, as normal.

Start the RKE2 agent service:

sudo systemctl start rke2-agent.service

Verify that it is running:

systemctl status rke2-agent.service --no-pager

Check the logs to see if the worker has successfully connected to the server:

sudo journalctl --unit rke2-agent --no-pager

In the output, look for a log entry containing the message “Remotedialer connected to proxy”. Example:

Apr 01 20:13:32 rke2-worker1 rke2[19231]: time="2025-04-01T20:13:32Z" level=info msg="Remotedialer connected to proxy" url="wss://10.0.0.5:9345/v1-rke2/connect"

This log entry indicates that the rke2-worker1 node, using the RKE2 component Remotedialer, successfully established a secure WebSocket connection to the RKE2 server node at 10.0.0.5 on port 9345. The phrase “connected to proxy” means that the server node acts as a proxy or intermediary, facilitating communication between the worker node and the Kubernetes control plane.

Note

If you see errors related to failing to pull images (e.g., Failed to pull images from /var/lib/rancher/rke2/agent/images/), try restarting the rke2-agent service: sudo systemctl restart rke2-agent.service. This error can occur if the worker node is unable to pull images from Docker Hub. Restarting the agent service often resolves these image loading issues.

Regenerate RKE2 certificates#

To enable external access to your cluster, we need to include the server’s public IP address in the TLS certificate’s Subject Alternative Name (SAN) list. By default, RKE2 only includes the server’s internal IP address in the certificate, which prevents external tools like kubectl from validating the connection. We’ll create a configuration file with the public IP and regenerate the certificates to ensure secure external communication with the Kubernetes API server.

Create or open the RKE2 server configuration file:

sudo nano /etc/rancher/rke2/config.yaml

Add the following content to include the public IP address in the TLS-SAN list:

tls-san:
  - "<public-ip>"  # Replace with your server's public IP address

This entry specifies an additional SAN for the TLS certificate, ensuring the certificate is valid for the server’s public IP address.

Save and close the file. Now follow these steps to apply the new configuration.

Stop the RKE2 server:

sudo systemctl stop rke2-server

Rotate the certificates to generate new ones with the updated SAN:

sudo rke2 certificate rotate

Start the RKE2 server to apply the changes:

sudo systemctl start rke2-server

Verify that the new TLS certificate includes the public IP address by inspecting the certificate:

sudo openssl x509 -in /var/lib/rancher/rke2/server/tls/serving-kube-apiserver.crt -text -noout

Install kubectl#

The Kubernetes command-line tool, kubectl, allows you to interact with a Kubernetes cluster’s control plane.

Install kubectl locally#

Download and install kubectl on your local machine by following the official documentation.

Verify the client (local) installation:

kubectl version --client

To enable kubectl to connect to the remote cluster server, configure the kubeconfig file.

Create the hidden directory .kube if it doesn’t exist:

mkdir %USERPROFILE%\.kube # Windows
mkdir $HOME/.kube # Linux/macOS

Run a command based on the below template to copy the kubeconfig file from the server node and save it on your local machine as config in the hidden .kube directory:

ssh azureuser@<public-ip> "sudo cat /etc/rancher/rke2/rke2.yaml" > <path>config

Replace placeholders <public-ip> and <path> with the appropriate values.

Once downloaded, open the kubeconfig file and replace the server’s loop-back address 127.0.0.1 with its public IP address. Save and close the file.

Note

The kubeconfig file contains connection information for the Kubernetes cluster, including the API server address and authentication credentials. By default, the server address is set to the loopback address 127.0.0.1. Replacing it with the public IP address allows kubectl to connect to the cluster from outside the server node.

Set the KUBECONFIG environment variable to point to the kubeconfig file:

setx KUBECONFIG "%HOMEPATH%\.kube\config" # Windows
export KUBECONFIG=$HOME/.kube/config # Linux/macOS

Display all the contexts defined in your kubeconfig file:

kubectl config get-contexts

If needed, set default as the current context in your kubeconfig file:

kubectl config use-context default

Display the name of the currently active context in your kubeconfig file:

kubectl config current-context

Verify the configuration of the current context:

kubectl config view --minify

The command will omit sensitive data.

The next command should list the nodes in your RKE2 cluster if everything is set up correctly:

kubectl get nodes

Example output:

NAME           STATUS   ROLES                       AGE     VERSION
rke2-server    Ready    control-plane,etcd,master   4h3m    v1.31.7+rke2r1
rke2-worker1   Ready    <none>                      3h25m   v1.31.7+rke2r1

Having <none> in the ROLES column indicates the node is a worker node.

Note

If you encounter an issue running kubectl get nodes, wait a few minutes for background processes to complete.

To display information about the cluster’s control plane and services, run:

kubectl cluster-info

Example output:

Kubernetes control plane is running at https://13.64.210.60:6443
CoreDNS is running at https://13.64.210.60:6443/api/v1/namespaces/kube-system/services/rke2-coredns-rke2-coredns:udp-53/proxy

Install kubectl on the server#

To install kubectl on the rke2-server node, run this command:

sudo snap install kubectl --classic

Ensure the KUBECONFIG environment variable is set to the correct path of the kubeconfig file on the server:

echo $KUBECONFIG

If not, set the KUBECONFIG environment variable to point to the kubeconfig file:

export KUBECONFIG=/etc/rancher/rke2/rke2.yaml

This tells kubectl and Helm to use this configuration file for cluster interactions.

Change the security permissions of the kubeconfig file:

sudo chmod 644 /etc/rancher/rke2/rke2.yaml

This command enables services to access the kubeconfig file while maintaining security.

Install Helm#

Helm is a package manager that helps you manage Kubernetes applications using Helm charts, which are collections of YAML configuration files describing pre-configured Kubernetes resources.

Install Helm on the rke2-server node:

sudo snap install helm --classic

Install Cert-Manager#

Cert-Manager is an open source Kubernetes tool that ensures certificates are valid and up-to-date. Perform these steps on the rke2-server node.

Add the Helm chart repository jetstack:

helm repo add jetstack https://charts.jetstack.io --force-update

Install cert-manager:

helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --version <latest> \
  --set crds.enabled=true \
  --kubeconfig /etc/rancher/rke2/rke2.yaml

Replace <latest> with the version number you want to install.

Tip

Use the command helm search repo jetstack/cert-manager --devel --versions to find the latest version of cert-manager.

This command installs the cert-manager Helm chart from the Jetstack repository into the Kubernetes namespace cert-manager. The --set crds.enabled=true option enables the installation of custom resource definitions (CRDs), which is a requirement for cert-manager to function properly.

Verify the installation by listing all the pods running in the cert-manager namespace:

kubectl get pods --namespace cert-manager

Example output:

NAME                                       READY   STATUS    RESTARTS   AGE
cert-manager-6794b8d569-6cp8l              1/1     Running   0          3d17h
cert-manager-cainjector-7f69cd69f7-t8h4j   1/1     Running   0          3d17h
cert-manager-webhook-6cc5dccc4b-sn6vk      1/1     Running   0          3d17h

Install AMD GPU Operator#

The AMD GPU Operator consists of a set of tools for managing AMD GPUs in Kubernetes clusters. It automates the deployment and management of the necessary components to enable GPU support in your Kubernetes cluster. These components include drivers, device plugins, and monitoring tools.

Add the Helm chart repository rocm:

helm repo add rocm https://rocm.github.io/gpu-operator --force-update

Install the AMD GPU Operator:

helm install amd-gpu-operator rocm/gpu-operator-charts \
  --namespace kube-amd-gpu \
  --create-namespace

Verify the installation by listing all the pods running in the kube-amd-gpu namespace:

kubectl get pods --namespace kube-amd-gpu

Example output:

NAME                                                              READY   STATUS    RESTARTS   AGE
amd-gpu-operator-gpu-operator-charts-controller-manager-7b2wvgr   1/1     Running   0          2m42s
amd-gpu-operator-kmm-controller-548b84fbcd-64jj6                  1/1     Running   0          2m42s
amd-gpu-operator-kmm-webhook-server-5d86b78f84-69fs7              1/1     Running   0          2m42s
amd-gpu-operator-node-feature-discovery-gc-64c9b7dcd9-tnnfb       1/1     Running   0          2m42s
amd-gpu-operator-node-feature-discovery-master-7d69c9b6f9-pwslb   1/1     Running   0          2m42s
amd-gpu-operator-node-feature-discovery-worker-6w8mm              1/1     Running   0          2m42s
amd-gpu-operator-node-feature-discovery-worker-r6nkn              1/1     Running   0          2m42s

Use the following command to label a worker node with AMD GPU hardware:

kubectl label node rke2-worker1 feature.node.kubernetes.io/amd-gpu=true

Repeat this step for each worker node that’s equipped with AMD GPUs.

Verify that the nodes are correctly labeled:

kubectl get nodes -L feature.node.kubernetes.io/amd-gpu

Example output:

NAME           STATUS   ROLES                       AGE    VERSION          AMD-GPU
rke2-server    Ready    control-plane,etcd,master   4d6h   v1.31.7+rke2r1
rke2-worker1   Ready    <none>                      4d6h   v1.31.7+rke2r1   true

Create DeviceConfig custom resource#

A DeviceConfig is a custom resource that enables the configuration of specialized hardware resources like GPUs in Kubernetes. It allows you to specify how the AMD GPU Operator should manage the GPUs in your cluster.

Create a file named deviceconfig.yaml on the rke2-server node where kubectl is already installed to configure the AMD GPU Operator in the cluster:

sudo nano deviceconfig.yaml

Paste the following configuration into the file:

apiVersion: amd.com/v1alpha1
kind: DeviceConfig
metadata:
  name: deviceconfig
  namespace: kube-amd-gpu
spec:
  driver:
    enable: false
  devicePlugin:
    devicePluginImage: rocm/k8s-device-plugin:latest
    nodeLabellerImage: rocm/k8s-device-plugin:labeller-latest
  metricsExporter:
     enable: true
     serviceType: "NodePort"
     nodePort: 32500
     image: docker.io/rocm/device-metrics-exporter:v1.2.0
  selector:
    feature.node.kubernetes.io/amd-gpu: "true"

Save and exit the file.

This DeviceConfig specifies the driver, where enable: false indicates using pre-installed drivers. It configures the device plugin and node labeller images. The metrics exporter is set up to collect and export GPU usage metrics. Finally, a selector identifies which nodes in the cluster should be managed by this configuration.

Apply the deviceconfig.yaml file to the Kubernetes cluster:

kubectl apply -f deviceconfig.yaml

Verify the deviceconfig resource in the kube-amd-gpu namespace:

kubectl describe deviceconfig -n kube-amd-gpu

Pod configuration and test#

We will set up a Kubernetes pod to monitor the GPU usage and check the AMD-SMI version using a ROCm PyTorch image. This test involves creating a manifest file, deploying the pod, and retrieving the pod logs. Finally, we delete the pod to clean up.

Create a manifest file called amd-smi.yaml that defines the pod configuration:

apiVersion: v1
kind: Pod
metadata:
 name: amd-smi
spec:
 containers:
 - image: docker.io/rocm/pytorch:latest
   name: amd-smi
   command: ["/bin/bash"]
   args: ["-c","amd-smi version && amd-smi monitor -ptum"]
   resources:
    limits:
      amd.com/gpu: 1
    requests:
      amd.com/gpu: 1
 restartPolicy: Never

This configuration sets up a pod with a single container with a ROCm PyTorch image, which we use to check the AMD-SMI version and monitor GPU usage.

Create the pod:

kubectl create -f amd-smi.yaml

Note

Allow a few minutes for the pod to start and initialize.

Retrieve the pod logs:

kubectl logs amd-smi

Delete the pod:

kubectl delete -f amd-smi.yaml

Conclusion#

This guide walked you through setting up a Kubernetes cluster on Azure using RKE2, configured with ND MI300X v5 VMs. You now have a fully functional cluster with GPU support, complete with essential components like Cert-Manager and the AMD GPU Operator. The cluster is ready for AI workloads that can leverage the computational power of AMD’s MI300X accelerators.

References#