Model Inference with AMD Instinct MI300X on Azure Using vLLM#
Introduction#
This tutorial demonstrates how to run model inference workloads using AMD Instinct MI300X GPUs on Microsoft Azure with vLLM, a popular library for LLM inference. You’ll learn how to provision an ND MI300X v5 virtual machine, configure the required environment, and run a simple chatbot application using the Meta-Llama-3-8B-Instruct model accessed through Hugging Face.
The steps detailed in this guide are adapted from the chatbot tutorial in the AMD AI Developer Hub and have been successfully tested in Azure. The approach outlined should work as a foundation for most of the inference tutorials found on that site.
Prerequisites#
SSH Keys: Have an SSH key pair and OpenSSH installed on your local machine for secure VM access
Azure Account: Maintain an active Azure account with appropriate subscription and resource group
Permissions: Ensure you have necessary permissions to create and manage Azure resources
vCPU Quota: Verify your subscription has sufficient vCPU quota for the ND MI300X v5 VM
Command-Line Tools: Install Azure CLI version
2.72.0
or above on your local machine
Request model access on Hugging Face#
In this tutorial, we will use the Meta-Llama-3-8B-Instruct
model from Hugging Face. This model is designed for instruction-following tasks and is part of the Llama 3 series developed by Meta. To access this model, visit the Meta-Llama-3-8B-Instruct model page and submit the access request form. When accepted, you will receive an email notification confirming your access to the model.
Note
Ensure you are using the same email address and username as your Hugging Face account as well as provide your full company name when submitting the request.
Create Hugging Face access token#
You will also need to create a Hugging Face access token if you do not already have one. Create a new access token by navigating to the Hugging Face Access Tokens page and clicking on “Create new token”. Select the “Read” token type, click “Create token”, and copy the generated token. Store it in a secure location. You will use this token to authenticate your requests to the Hugging Face Hub.
Create variables#
To streamline the process of creating the VM, define the following variables. Replace the placeholder values as needed.
resource_group="<your-resource-group>"
region="<your-region>"
vm_name="MI300X"
admin_username="azureuser"
vm_size="Standard_ND96isr_MI300X_v5"
vm_image="microsoft-dsvm:ubuntu-hpc:2204-rocm:22.04.2025030701"
source_ip="<your-ip-address>/32"
ssh_key="<ssh-rsa AAAB3...>"
For the ssh_key
value, replace the placeholder <ssh-rsa AAAAB3...>
with your ssh public key string.
Tip
For additional guidance on how to create MI300X virtual machines on Azure, refer to this guide.
Create MI300X virtual machine#
The following Azure CLI command creates a new VM using the variables we defined above:
az vm create \
--resource-group $resource_group \
--name $vm_name \
--image $vm_image \
--size $vm_size \
--location $region \
--ssh-key-value "$ssh_key" \
--security-type Standard \
--os-disk-size-gb 256 \
--os-disk-delete-option Delete
Important
Azure has shifted its default security type to TrustedLaunch
for newly created VMs. The Standard
security type is still supported, but it requires explicit registration of the UseStandardSecurityType
feature flag per Azure subscription.
Allow a few minutes for the VM and supporting resources to be created. If the VM was created successfully, the shell will display information specific to your deployment:
{
"fqdns": "",
"id": "/subscriptions/<guid>/resourceGroups/<your-resource-group>/providers/Microsoft.Compute/virtualMachines/MI300X",
"location": "<your-region>",
"macAddress": "00-0D-3A-35-FE-3F",
"powerState": "VM running",
"privateIpAddress": "10.0.0.5",
"publicIpAddress": "<vm-ip-address>",
"resourceGroup": "<your-resource-group>",
"zones": ""
}
Note
Take note of the VM’s public IP address, as you will use this address to access the VM from your local machine. You can also obtain the VM’s public IP address from the Azure portal or by running the Azure CLI command az vm list-ip-addresses --name MI300X --resource-group $resource_group
.
Create inbound NSG rule#
When you create a virtual machine in Azure, it is typically associated with a Network Security Group (NSG) that controls inbound and outbound traffic based on source and destination IP addresses, ports, and protocols. The NSG for your VM—in our example, MI300XNSG
—usually includes a default rule that allows SSH inbound traffic on port 22. To access the Jupyter server running on the VM, you will need to create a new NSG rule that allows inbound traffic on port 8888
from your trusted IP address(es), defined in the source_ip
variable.
az network nsg rule create \
--resource-group $resource_group \
--nsg-name MI300XNSG \
--name allow-jupyter \
--priority 1001 \
--direction Inbound \
--access Allow \
--protocol Tcp \
--source-address-prefixes $source_ip \
--source-port-ranges "*" \
--destination-address-prefixes "*" \
--destination-port-ranges 8888
Validate using AMD SMI library#
Validate your setup using the AMD System Management Interface (AMD SMI) library, which is a versatile command-line tool for managing and monitoring AMD hardware.
On your local machine, navigate to the hidden .ssh
directory and connect to your VM:
ssh -i id_rsa azureuser@<vm-ip-address>
Note
Allow a couple of minutes after VM startup for all services to initialize before running amd-smi
commands.
Verify the version of AMD SMI on your VM:
amd-smi version
The output will show the version of the AMD SMI tool itself, the version of the AMD SMI library, and the ROCm version. This information helps you ensure that you are using the correct versions of these components for managing and monitoring your AMD GPUs.
To list all eight AMD GPUs on the VM, along with basic information like their universally unique identifier, run the following command:
amd-smi list
You should see an output similar to the this:
GPU: 0
BDF: 0002:00:00.0
UUID: 1fff74b5-0000-1000-807c-84c354560001
GPU: 1
BDF: 0003:00:00.0
UUID: 8bff74b5-0000-1000-8042-32403807af72
GPU: 2
BDF: 0004:00:00.0
UUID: faff74b5-0000-1000-80fd-df190d55b466
[output truncated]
Docker setup#
Verify that Docker is running on your VM:
systemctl status --full docker --no-pager
Check if your user is part of the Docker group using the groups
command. If you see docker
in the output, you are already part of the Docker group. If not, add your user to the Docker group:
sudo usermod -aG docker $USER
Switch to the Docker group without logging out and back in by running:
newgrp docker
To verify that Docker is working correctly, run the following command to pull and run the hello-world
image:
docker run hello-world
Launch Docker container#
The chatbot application will run inside a Docker container using the rocm/vllm:latest
image, which is preconfigured with the necessary libraries and tools for running vLLM on AMD MI300X GPUs.
Launch the Docker container:
docker run -it --rm \
--network host \
--device /dev/kfd \
--device /dev/dri \
--group-add video \
--ipc host \
--security-opt seccomp=unconfined \
--shm-size 8G \
--hostname localhost \
-v $(pwd):/workspace \
-w /workspace/notebooks \
--env HUGGINGFACE_HUB_CACHE=/workspace \
--entrypoint /bin/bash \
rocm/vllm:latest
Explanation of the key options:
The
-v $(pwd):/workspace
option creates a volume mount that maps your current directory ($(pwd)
) on the host machine to the/workspace
directory inside the container, allowing you to share files between the VM and the container.The
-w /workspace/notebooks
option sets the working directory inside the container where you will create and run your Jupyter notebooks.The
--env HUGGINGFACE_HUB_CACHE=/workspace
option configures the Hugging Face Hub cache location inside the container, preventing redundant downloads across container restarts.The
--entrypoint /bin/bash
option overrides the container’s default entry command with the Bash shell, providing an interactive terminal that allows you to execute the commands outlined below inside the container.
Note
It may take several minutes for the container to start, especially if it is the first time you are running it.
After successful initialization of the container, you will see a command prompt similar to:
root@localhost:/workspace/notebooks#
This indicates that you are now inside the Docker container’s shell environment, logged in as the root
user, and positioned in the /workspace/notebooks
directory.
Install Jupyter#
To install Jupyter, run the following command inside the Docker container:
pip install jupyter
Start the Jupyter server:
jupyter-lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root
The --ip=0.0.0.0
option allows Jupyter to listen on all available network interfaces, making it accessible from outside the container. The --port=8888
option specifies the port on which Jupyter will run, which is the default port for Jupyter Lab. The --no-browser
option prevents Jupyter from trying to open a web browser automatically, which is useful when running on a remote server. The --allow-root
option allows Jupyter to run as the root user, which is necessary if you are still logged in as root.
After running the command, you will see output similar to the following:
[output truncated]
To access the server, open this file in a browser:
file:///root/.local/share/jupyter/runtime/jpserver-24-open.html
Or copy and paste one of these URLs:
http://localhost:8888/lab?token=3ef930be965a6496c152ce2026e5b82cb53c0d600ab918ee
http://127.0.0.1:8888/lab?token=3ef930be965a6496c152ce2026e5b82cb53c0d600ab918ee
[C 2025-05-27 13:27:45.280 ServerApp]
Copy either URL provided in the output, which includes a token used for authenticating access to the Jupyter server, and replace localhost
or 127.0.0.1
(depending on which URL you copied) with the public IP address of your VM. This URL is used to access the Jupyter server running inside the Docker container.
Open a web browser on your local machine and paste the modified URL, replacing <vm-ip-address>
with the actual public IP address of your VM and <jupyter-server-token>
with the token provided in Jupyter server output:
http://<vm-ip-address>:8888/lab?token=<jupyter-server-token>
Note
Ensure that you replace <jupyter-server-token>
with the token provided in the output of the Jupyter server command. Do not use the Hugging Face access token, which you will use in the next step.
Provide Hugging Face access token#
To access the Meta-Llama-3-8B-Instruct model from Hugging Face, you need to provide your Hugging Face access token. Run the following code in a new cell in your Jupyter notebook to log in to Hugging Face:
from huggingface_hub import notebook_login, HfApi
# Prompt the user to log in
notebook_login()
Verify that your Hugging Face access token is set correctly by running the following code in a new cell:
try:
api = HfApi()
user_info = api.whoami()
print(f"Token validated successfully! Logged in as: {user_info['name']}")
except Exception as e:
print(f"Token validation failed. Error: {e}")
You should see a message similar to:
Token validated successfully! Logged in as: <your-hugging-face-username>
Build chatbot with vLLM#
To build a basic chatbot using vLLM, we will use the vllm
library, which provides an efficient way to run large language models (LLMs) like Meta-Llama-3-8B-Instruct. The following steps will guide you through creating the chatbot.
Import the necessary libraries and modules:
from vllm import LLM, SamplingParams
import gc
import torch
import time
The Chatbot
class below encapsulates all functionality needed for our chatbot application: loading the LLM model, constructing prompts based on user input, generating responses using the LLM, and managing the conversation history. The class also includes a cleanup method to release GPU memory when the chatbot is no longer needed.
class Chatbot:
def __init__(self):
self.history = []
self.system_instruction = (
"You are a helpful and professional chatbot. "
"Keep your responses concise, friendly, and relevant."
)
self.llm = self.load_model()
def load_model(self):
model_name = "meta-llama/Llama-3.2-3B-Instruct" # Adjust if using another model
print("Loading the model. Please wait...")
llm = LLM(model=model_name)
print("Model loaded successfully!")
return llm
def construct_prompt(self, user_input):
recent_history = self.history[-4:]
conversation = [{"role": "system", "content": self.system_instruction}] + recent_history + [{"role": "user", "content": user_input}]
return conversation
def generate_response(self, conversation, max_tokens=200):
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=max_tokens
)
outputs = self.llm.chat(conversation, sampling_params)
reply = outputs[0].outputs[0].text
return reply
def get_response(self, user_input):
conversation = self.construct_prompt(user_input)
bot_response = self.generate_response(conversation)
self.history.append({"role": "user", "content": user_input})
self.history.append({"role": "bot", "content": bot_response})
return bot_response
def cleanup(self):
"""
Clean up resources and release GPU memory.
"""
if hasattr(self, "llm") and self.llm:
print("Cleaning up GPU memory...")
del self.llm # Delete the LLM object
gc.collect()
torch.cuda.empty_cache() # Clear CUDA cache
time.sleep(5) # Add a 2-second wait before the final message
print("Cleanup complete!")
Test the chatbot#
Finally, we can test the chatbot by creating an instance of the Chatbot
class and interacting with it through a simple text input loop.
Create a new cell in your Jupyter notebook and run the following code:
chatbot = Chatbot()
while True:
user_input = input("You: ")
if user_input.lower() in ["exit", "quit"]:
print("Exiting chatbot...")
break
response = chatbot.get_response(user_input)
print(f"Bot: {response}")
chatbot.cleanup()
This code initializes the chatbot, enters a loop where it waits for user input, and generates responses based on the user’s queries. The conversation history is maintained, allowing the chatbot to provide contextually relevant answers.
Note
Allow a few minutes for the model to load the first time you run the chatbot. Subsequent interactions will be faster as the model remains loaded in memory.
When you see the message: “Model loaded successfully!”, you can start interacting with the chatbot. Type your messages in the input field, and the chatbot will respond accordingly.
Exit the chat by typing exit
or quit
.
Cleanup (optional)#
To manage costs when the VM is not in use, you can perform the following cleanup steps:
Deallocate (stop) the VM to pause billing for compute resources
Delete the VM and associated resources if they are no longer needed
Conclusion#
In this tutorial, you learned how to set up a virtual machine with AMD Instinct MI300X GPUs on Microsoft Azure, configure the environment for model inference using vLLM, and build a simple chatbot application using the Meta-Llama-3-8B-Instruct model from Hugging Face. You also learned how to manage the Docker container, install Jupyter, and run the chatbot application within the container. This setup provides a solid foundation for running various inference workloads on AMD GPUs in Azure. You can now explore more advanced features of vLLM, experiment with different models, and build more complex applications using the powerful capabilities of AMD Instinct MI300X GPUs.