Model Inference with AMD Instinct MI300X on Azure Using vLLM#

Introduction#

This tutorial demonstrates how to run model inference workloads using AMD Instinct MI300X GPUs on Microsoft Azure with vLLM, a popular library for LLM inference. You’ll learn how to provision an ND MI300X v5 virtual machine, configure the required environment, and run a simple chatbot application using the Meta-Llama-3-8B-Instruct model accessed through Hugging Face.

The steps detailed in this guide are adapted from the chatbot tutorial in the AMD AI Developer Hub and have been successfully tested in Azure. The approach outlined should work as a foundation for most of the inference tutorials found on that site.

Prerequisites#

  • SSH Keys: Have an SSH key pair and OpenSSH installed on your local machine for secure VM access

  • Azure Account: Maintain an active Azure account with appropriate subscription and resource group

  • Permissions: Ensure you have necessary permissions to create and manage Azure resources

  • vCPU Quota: Verify your subscription has sufficient vCPU quota for the ND MI300X v5 VM

  • Command-Line Tools: Install Azure CLI version 2.72.0 or above on your local machine

Request model access on Hugging Face#

In this tutorial, we will use the Meta-Llama-3-8B-Instruct model from Hugging Face. This model is designed for instruction-following tasks and is part of the Llama 3 series developed by Meta. To access this model, visit the Meta-Llama-3-8B-Instruct model page and submit the access request form. When accepted, you will receive an email notification confirming your access to the model.

Note

Ensure you are using the same email address and username as your Hugging Face account as well as provide your full company name when submitting the request.

Create Hugging Face access token#

You will also need to create a Hugging Face access token if you do not already have one. Create a new access token by navigating to the Hugging Face Access Tokens page and clicking on “Create new token”. Select the “Read” token type, click “Create token”, and copy the generated token. Store it in a secure location. You will use this token to authenticate your requests to the Hugging Face Hub.

Create variables#

To streamline the process of creating the VM, define the following variables. Replace the placeholder values as needed.

resource_group="<your-resource-group>"
region="<your-region>"
vm_name="MI300X"
admin_username="azureuser"
vm_size="Standard_ND96isr_MI300X_v5"
vm_image="microsoft-dsvm:ubuntu-hpc:2204-rocm:22.04.2025030701"
source_ip="<your-ip-address>/32"
ssh_key="<ssh-rsa AAAB3...>"

For the ssh_key value, replace the placeholder <ssh-rsa AAAAB3...> with your ssh public key string.

Tip

For additional guidance on how to create MI300X virtual machines on Azure, refer to this guide.

Create MI300X virtual machine#

The following Azure CLI command creates a new VM using the variables we defined above:

az vm create \
    --resource-group $resource_group \
    --name $vm_name \
    --image $vm_image \
    --size $vm_size \
    --location $region \
    --ssh-key-value "$ssh_key" \
    --security-type Standard \
    --os-disk-size-gb 256 \
    --os-disk-delete-option Delete

Important

Azure has shifted its default security type to TrustedLaunch for newly created VMs. The Standard security type is still supported, but it requires explicit registration of the UseStandardSecurityType feature flag per Azure subscription.

Allow a few minutes for the VM and supporting resources to be created. If the VM was created successfully, the shell will display information specific to your deployment:

{
  "fqdns": "",
  "id": "/subscriptions/<guid>/resourceGroups/<your-resource-group>/providers/Microsoft.Compute/virtualMachines/MI300X",
  "location": "<your-region>",
  "macAddress": "00-0D-3A-35-FE-3F",
  "powerState": "VM running",
  "privateIpAddress": "10.0.0.5",
  "publicIpAddress": "<vm-ip-address>",
  "resourceGroup": "<your-resource-group>",
  "zones": ""
}

Note

Take note of the VM’s public IP address, as you will use this address to access the VM from your local machine. You can also obtain the VM’s public IP address from the Azure portal or by running the Azure CLI command az vm list-ip-addresses --name MI300X --resource-group $resource_group.

Create inbound NSG rule#

When you create a virtual machine in Azure, it is typically associated with a Network Security Group (NSG) that controls inbound and outbound traffic based on source and destination IP addresses, ports, and protocols. The NSG for your VM—in our example, MI300XNSG—usually includes a default rule that allows SSH inbound traffic on port 22. To access the Jupyter server running on the VM, you will need to create a new NSG rule that allows inbound traffic on port 8888 from your trusted IP address(es), defined in the source_ip variable.

az network nsg rule create \
  --resource-group $resource_group \
  --nsg-name MI300XNSG \
  --name allow-jupyter \
  --priority 1001 \
  --direction Inbound \
  --access Allow \
  --protocol Tcp \
  --source-address-prefixes $source_ip \
  --source-port-ranges "*" \
  --destination-address-prefixes "*" \
  --destination-port-ranges 8888

Validate using AMD SMI library#

Validate your setup using the AMD System Management Interface (AMD SMI) library, which is a versatile command-line tool for managing and monitoring AMD hardware.

On your local machine, navigate to the hidden .ssh directory and connect to your VM:

ssh -i id_rsa azureuser@<vm-ip-address>

Note

Allow a couple of minutes after VM startup for all services to initialize before running amd-smi commands.

Verify the version of AMD SMI on your VM:

amd-smi version

The output will show the version of the AMD SMI tool itself, the version of the AMD SMI library, and the ROCm version. This information helps you ensure that you are using the correct versions of these components for managing and monitoring your AMD GPUs.

To list all eight AMD GPUs on the VM, along with basic information like their universally unique identifier, run the following command:

amd-smi list

You should see an output similar to the this:

GPU: 0
    BDF: 0002:00:00.0
    UUID: 1fff74b5-0000-1000-807c-84c354560001

GPU: 1
    BDF: 0003:00:00.0
    UUID: 8bff74b5-0000-1000-8042-32403807af72

GPU: 2
    BDF: 0004:00:00.0
    UUID: faff74b5-0000-1000-80fd-df190d55b466

[output truncated]

Docker setup#

Verify that Docker is running on your VM:

systemctl status --full docker --no-pager

Check if your user is part of the Docker group using the groups command. If you see docker in the output, you are already part of the Docker group. If not, add your user to the Docker group:

sudo usermod -aG docker $USER

Switch to the Docker group without logging out and back in by running:

newgrp docker

To verify that Docker is working correctly, run the following command to pull and run the hello-world image:

docker run hello-world

Launch Docker container#

The chatbot application will run inside a Docker container using the rocm/vllm:latest image, which is preconfigured with the necessary libraries and tools for running vLLM on AMD MI300X GPUs.

Launch the Docker container:

docker run -it --rm \
  --network host \
  --device /dev/kfd \
  --device /dev/dri \
  --group-add video \
  --ipc host \
  --security-opt seccomp=unconfined \
  --shm-size 8G \
  --hostname localhost \
  -v $(pwd):/workspace \
  -w /workspace/notebooks \
  --env HUGGINGFACE_HUB_CACHE=/workspace \
  --entrypoint /bin/bash \
  rocm/vllm:latest

Explanation of the key options:

  • The -v $(pwd):/workspace option creates a volume mount that maps your current directory ($(pwd)) on the host machine to the /workspace directory inside the container, allowing you to share files between the VM and the container.

  • The -w /workspace/notebooks option sets the working directory inside the container where you will create and run your Jupyter notebooks.

  • The --env HUGGINGFACE_HUB_CACHE=/workspace option configures the Hugging Face Hub cache location inside the container, preventing redundant downloads across container restarts.

  • The --entrypoint /bin/bash option overrides the container’s default entry command with the Bash shell, providing an interactive terminal that allows you to execute the commands outlined below inside the container.

Note

It may take several minutes for the container to start, especially if it is the first time you are running it.

After successful initialization of the container, you will see a command prompt similar to:

root@localhost:/workspace/notebooks#

This indicates that you are now inside the Docker container’s shell environment, logged in as the root user, and positioned in the /workspace/notebooksdirectory.

Install Jupyter#

To install Jupyter, run the following command inside the Docker container:

pip install jupyter

Start the Jupyter server:

jupyter-lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root

The --ip=0.0.0.0 option allows Jupyter to listen on all available network interfaces, making it accessible from outside the container. The --port=8888 option specifies the port on which Jupyter will run, which is the default port for Jupyter Lab. The --no-browser option prevents Jupyter from trying to open a web browser automatically, which is useful when running on a remote server. The --allow-root option allows Jupyter to run as the root user, which is necessary if you are still logged in as root.

After running the command, you will see output similar to the following:

[output truncated]
To access the server, open this file in a browser:
    file:///root/.local/share/jupyter/runtime/jpserver-24-open.html
Or copy and paste one of these URLs:
    http://localhost:8888/lab?token=3ef930be965a6496c152ce2026e5b82cb53c0d600ab918ee
    http://127.0.0.1:8888/lab?token=3ef930be965a6496c152ce2026e5b82cb53c0d600ab918ee
[C 2025-05-27 13:27:45.280 ServerApp]

Copy either URL provided in the output, which includes a token used for authenticating access to the Jupyter server, and replace localhost or 127.0.0.1 (depending on which URL you copied) with the public IP address of your VM. This URL is used to access the Jupyter server running inside the Docker container.

Open a web browser on your local machine and paste the modified URL, replacing <vm-ip-address> with the actual public IP address of your VM and <jupyter-server-token> with the token provided in Jupyter server output:

http://<vm-ip-address>:8888/lab?token=<jupyter-server-token>

Note

Ensure that you replace <jupyter-server-token> with the token provided in the output of the Jupyter server command. Do not use the Hugging Face access token, which you will use in the next step.

Provide Hugging Face access token#

To access the Meta-Llama-3-8B-Instruct model from Hugging Face, you need to provide your Hugging Face access token. Run the following code in a new cell in your Jupyter notebook to log in to Hugging Face:

from huggingface_hub import notebook_login, HfApi

# Prompt the user to log in
notebook_login()

Verify that your Hugging Face access token is set correctly by running the following code in a new cell:

try:
    api = HfApi()
    user_info = api.whoami()
    print(f"Token validated successfully! Logged in as: {user_info['name']}")
except Exception as e:
    print(f"Token validation failed. Error: {e}")

You should see a message similar to:

Token validated successfully! Logged in as: <your-hugging-face-username>

Build chatbot with vLLM#

To build a basic chatbot using vLLM, we will use the vllm library, which provides an efficient way to run large language models (LLMs) like Meta-Llama-3-8B-Instruct. The following steps will guide you through creating the chatbot.

Import the necessary libraries and modules:

from vllm import LLM, SamplingParams
import gc
import torch
import time

The Chatbot class below encapsulates all functionality needed for our chatbot application: loading the LLM model, constructing prompts based on user input, generating responses using the LLM, and managing the conversation history. The class also includes a cleanup method to release GPU memory when the chatbot is no longer needed.

class Chatbot:
    def __init__(self):
        self.history = []
        self.system_instruction = (
            "You are a helpful and professional chatbot. "
            "Keep your responses concise, friendly, and relevant."
        )
        self.llm = self.load_model()

    def load_model(self):
        model_name = "meta-llama/Llama-3.2-3B-Instruct"  # Adjust if using another model
        print("Loading the model. Please wait...")
        llm = LLM(model=model_name)
        print("Model loaded successfully!")
        return llm

    def construct_prompt(self, user_input):
        recent_history = self.history[-4:]
        conversation = [{"role": "system", "content": self.system_instruction}] + recent_history + [{"role": "user", "content": user_input}]
        return conversation

    def generate_response(self, conversation, max_tokens=200):
        sampling_params = SamplingParams(
            temperature=0.7,
            top_p=0.9,
            max_tokens=max_tokens
        )
        outputs = self.llm.chat(conversation, sampling_params)
        reply = outputs[0].outputs[0].text
        return reply

    def get_response(self, user_input):
        conversation = self.construct_prompt(user_input)
        bot_response = self.generate_response(conversation)
        self.history.append({"role": "user", "content": user_input})
        self.history.append({"role": "bot", "content": bot_response})
        return bot_response

    def cleanup(self):
        """
        Clean up resources and release GPU memory.
        """
        if hasattr(self, "llm") and self.llm:
            print("Cleaning up GPU memory...")
            del self.llm  # Delete the LLM object
            gc.collect()
            torch.cuda.empty_cache()  # Clear CUDA cache
            time.sleep(5)  # Add a 2-second wait before the final message
            print("Cleanup complete!")

Test the chatbot#

Finally, we can test the chatbot by creating an instance of the Chatbot class and interacting with it through a simple text input loop.

Create a new cell in your Jupyter notebook and run the following code:

chatbot = Chatbot()
while True:
    user_input = input("You: ")
    if user_input.lower() in ["exit", "quit"]:
        print("Exiting chatbot...")
        break
    response = chatbot.get_response(user_input)
    print(f"Bot: {response}")

chatbot.cleanup()

This code initializes the chatbot, enters a loop where it waits for user input, and generates responses based on the user’s queries. The conversation history is maintained, allowing the chatbot to provide contextually relevant answers.

Note

Allow a few minutes for the model to load the first time you run the chatbot. Subsequent interactions will be faster as the model remains loaded in memory.

When you see the message: “Model loaded successfully!”, you can start interacting with the chatbot. Type your messages in the input field, and the chatbot will respond accordingly.

Exit the chat by typing exit or quit.

Cleanup (optional)#

To manage costs when the VM is not in use, you can perform the following cleanup steps:

  • Deallocate (stop) the VM to pause billing for compute resources

  • Delete the VM and associated resources if they are no longer needed

Conclusion#

In this tutorial, you learned how to set up a virtual machine with AMD Instinct MI300X GPUs on Microsoft Azure, configure the environment for model inference using vLLM, and build a simple chatbot application using the Meta-Llama-3-8B-Instruct model from Hugging Face. You also learned how to manage the Docker container, install Jupyter, and run the chatbot application within the container. This setup provides a solid foundation for running various inference workloads on AMD GPUs in Azure. You can now explore more advanced features of vLLM, experiment with different models, and build more complex applications using the powerful capabilities of AMD Instinct MI300X GPUs.

References#