Fine-tuning Llama-3.2 3B with LoRA on Azure#
Introduction#
Large language models (LLMs) with open weights are typically pretrained and instruction-tuned on vast datasets, enabling them to perform a wide range of tasks out of the box. However, organizations often need to adapt these models to proprietary data or specialized tasks. This process, known as fine-tuning, customizes a model’s behavior for specific domains and requirements, unlocking greater value from these publicly available models.
This tutorial demonstrates how to use low-rank adaptation (LoRA) to fine-tune Meta’s Llama-3.2 3B model on instruction-following data. LoRA enables efficient adaptation by adding a small number of trainable parameters to the model. This approach significantly reduces computational costs compared to full fine-tuning, without compromising model quality.
The workflow presented is adapted from the AMD AI Developer Hub and validated specifically for AMD Instinct MI300X GPUs on Azure. You will provision an Azure virtual machine, configure a Docker environment, and use the PEFT library from Hugging Face to implement LoRA. To illustrate the technique, you will use a sample dataset of 1,000 multilingual instruction-following examples. Finally, you will test your fine-tuned model by generating a sample response to a query.
By the end of this tutorial, you will have practical experience with LoRA-based fine-tuning on MI300X GPUs in Azure, preparing you to apply these techniques to your own projects.
Prerequisites#
SSH Keys: Have an SSH key pair installed on your local machine for secure VM access.
Azure Account: Maintain an active Azure account with appropriate subscription and resource group.
Permissions: Ensure you have necessary permissions to create and manage Azure resources.
vCPU Quota: Verify your subscription has sufficient vCPU quota.
Command-Line Tools: Install Azure CLI version
2.74.0
or above on your local machine.
Request model access on Hugging Face#
Visit the Llama-3.2 3B model page on Hugging Face and submit the access request form. When accepted, you will receive an email notification confirming your access to the model.
Note
Ensure you are using the same email address and username as your Hugging Face account as well as provide your full company name when submitting the request.
Create Hugging Face access token#
You will also need to create a Hugging Face access token if you do not already have one. Create a new access token by navigating to the Hugging Face Access Tokens page and clicking on “Create new token”. Select the “Read” token type, click “Create token”, and copy the generated token. Store it in a secure location. You will use this token to authenticate your requests to the Hugging Face Hub.
Create MI300X virtual machine#
To streamline the process of creating the VM, define the following variables. Replace the placeholder values as needed.
resource_group="<your-resource-group>"
region="<your-region>"
vm_name="MI300X"
admin_username="azureuser"
vm_size="Standard_ND96isr_MI300X_v5"
vm_image="microsoft-dsvm:ubuntu-hpc:2204-rocm:22.04.2025030701"
source_ip="<your-ip-address>/32"
ssh_key="<ssh-rsa AAAB3...>"
For the ssh_key
value, replace the placeholder <ssh-rsa AAAAB3...>
with your ssh public key string.
Tip
For additional guidance on how to create MI300X virtual machines on Azure, refer to this guide.
The following Azure CLI command creates a new VM using the variables we defined above:
az vm create \
--resource-group $resource_group \
--name $vm_name \
--image $vm_image \
--size $vm_size \
--location $region \
--admin-username $admin_username \
--ssh-key-value "$ssh_key" \
--security-type Standard \
--os-disk-size-gb 256 \
--os-disk-delete-option Delete
Important
Azure has shifted its default security type to TrustedLaunch
for newly created VMs. The Standard
security type is still supported, but it requires explicit registration of the UseStandardSecurityType
feature flag per Azure subscription.
Allow a few minutes for the VM and supporting resources to be created. If the VM was created successfully, the shell will display information specific to your deployment:
{
"fqdns": "",
"id": "/subscriptions/<guid>/resourceGroups/<your-resource-group>/providers/Microsoft.Compute/virtualMachines/MI300X",
"location": "<your-region>",
"macAddress": "60-45-BD-04-D7-7B",
"powerState": "VM running",
"privateIpAddress": "10.0.0.5",
"publicIpAddress": "<vm-ip-address>",
"resourceGroup": "<your-resource-group>",
"zones": ""
}
Note
Take note of the VM’s public IP address, as you will use this address to access the VM from your local machine. You can also obtain the VM’s public IP address from the Azure portal or by running the Azure CLI command az vm list-ip-addresses --name MI300X --resource-group $resource_group
.
Create inbound NSG rule#
When you create a virtual machine in Azure, it is typically associated with a Network Security Group (NSG) that controls inbound and outbound traffic based on source and destination IP addresses, ports, and protocols. The NSG for your VM—in our example, MI300XNSG
—usually includes a default rule that allows SSH inbound traffic on port 22. To access the Jupyter server running on the VM, you will need to create a new NSG rule that allows inbound traffic on port 8888
from your trusted IP address(es), defined in the source_ip
variable.
az network nsg rule create \
--resource-group $resource_group \
--nsg-name MI300XNSG \
--name allow-jupyter \
--priority 1001 \
--direction Inbound \
--access Allow \
--protocol Tcp \
--source-address-prefixes $source_ip \
--source-port-ranges "*" \
--destination-address-prefixes "*" \
--destination-port-ranges 8888
Docker setup#
On your local machine, navigate to the hidden .ssh
directory and connect to your VM:
ssh -i id_rsa azureuser@<vm-ip-address>
Verify that Docker is running on your VM:
systemctl status --full docker --no-pager
Check if your user is part of the Docker group using the groups
command. If you see docker
in the output, you are already part of the Docker group. If not, add your user to the Docker group:
sudo usermod -aG docker $USER
Switch to the Docker group without logging out and back in by running:
newgrp docker
To verify that Docker is working correctly, run the following command to pull and run the hello-world
image:
docker run hello-world
Launch Docker container#
The code will run inside a Docker container using the rrocm/pytorch:rocm6.2.3_ubuntu22.04_py3.10_pytorch_release_2.3.0
image, which is preconfigured with the necessary libraries and tools for running on the AMD MI300X GPUs.
Launch the Docker container:
docker run -it --rm \
--network=host \
--device=/dev/kfd \
--device=/dev/dri \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--shm-size 8G \
--hostname localhost \
-v $(pwd):/workspace \
-w /workspace/notebooks \
rocm/pytorch:rocm6.2.3_ubuntu22.04_py3.10_pytorch_release_2.3.0
Note
It may take several minutes for the container to start, especially if it is the first time you are running it.
After successful initialization of the container, you will see a command prompt similar to:
root@localhost:/workspace/notebooks#
This indicates that you are now inside the Docker container’s shell environment, logged in as the root
user, and positioned in the /workspace/notebooks
directory.
Install Jupyter#
To install Jupyter, run the following command inside the Docker container:
pip install jupyter
Once complete, start the Jupyter server:
jupyter-lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root
The --ip=0.0.0.0
option allows Jupyter to listen on all available network interfaces, making it accessible from outside the container. The --port=8888
option specifies the port on which Jupyter will run, which is the default port for Jupyter Lab. The --no-browser
option prevents Jupyter from trying to open a web browser automatically, which is useful when running on a remote server. The --allow-root
option allows Jupyter to run as the root user, which is necessary if you are still logged in as root.
After running the command, you will see output similar to the following:
[output truncated]
To access the server, open this file in a browser:
file:///root/.local/share/jupyter/runtime/jpserver-24-open.html
Or copy and paste one of these URLs:
http://localhost:8888/lab?token=3ef930be965a6496c152ce2026e5b82cb53c0d600ab918ee
http://127.0.0.1:8888/lab?token=3ef930be965a6496c152ce2026e5b82cb53c0d600ab918ee
[C 2025-05-27 13:27:45.280 ServerApp]
Copy either URL provided in the output, which includes a token used for authenticating access to the Jupyter server, and replace localhost
or 127.0.0.1
(depending on which URL you copied) with the public IP address of your VM. This URL is used to access the Jupyter server running inside the Docker container.
Open a web browser on your local machine and paste the modified URL, replacing <vm-ip-address>
with the actual public IP address of your VM and <jupyter-server-token>
with the token provided in Jupyter server output:
http://<vm-ip-address>:8888/lab?token=<jupyter-server-token>
Note
Ensure that you replace <jupyter-server-token>
with the token provided in the output of the Jupyter server command. Do not use the Hugging Face access token, which you will use in the next step.
Install the required packages#
Run the following commands inside the Jupyter notebook running within the Docker container:
!pip install \
pandas \
peft==0.14.0 \
transformers==4.47.1 \
trl==0.13.0 \
accelerate==1.2.1 \
scipy \
tensorboardX
Provide Hugging Face access token#
To access the model from Hugging Face, you need to provide your Hugging Face access token. Run the following code in a new cell in your Jupyter notebook to log in to Hugging Face:
from huggingface_hub import notebook_login, HfApi
notebook_login()
This will prompt you to enter your Hugging Face access token. Paste it in the input field, uncheck the box “Add token as git credential?”, and click Login.
Verify that your Hugging Face access token is set correctly:
try:
api = HfApi()
user_info = api.whoami()
print(f"Token validated successfully! Logged in as: {user_info['name']}")
except Exception as e:
print(f"Token validation failed. Error: {e}")
You should see a message similar to:
Token validated successfully! Logged in as: <your-hugging-face-username>
Configure GPUs for PyTorch#
This script configures which GPUs PyTorch should recognize and use during training. In this setup, all 8 GPUs available on the MI300X are enabled, allowing PyTorch to fully leverage the MI300X’s computational efficiency, which will significantly speed up the fine-tuning process.
import os
import torch
gpus = list(range(8)) # [0, 1, 2, 3, 4, 5, 6, 7]
os.environ.setdefault("CUDA_VISIBLE_DEVICES", ','.join(map(str, gpus)))
print(f"PyTorch detected number of available devices: {torch.cuda.device_count()}")
Expected output:
PyTorch detected number of available devices: 8
Import the required libraries#
Import the necessary libraries for dataset handling, model configuration, and LoRA fine-tuning:
from datasets import load_dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
pipeline
)
from peft import LoraConfig, PeftModel
from peft import get_peft_model
from trl import SFTTrainer
Load and configure the base model#
Load the base model and tokenizer for fine-tuning. The meta-llama/Llama-3.2-3B
model is used as the base model, and the LoRA fine-tuned model will be saved as Llama-3.2-3B-lora
.
base_model_name = "meta-llama/Llama-3.2-3B"
new_model_name = "Llama-3.2-3B-lora"
tokenizer = AutoTokenizer.from_pretrained(
base_model_name,
trust_remote_code=True,
use_fast=True
)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
base_model = AutoModelForCausalLM.from_pretrained(
base_model_name,
device_map="auto", # enables model parallelism
trust_remote_code=True
)
base_model.config.use_cache = False
base_model.config.pretraining_tp = 1
The tokenizer is set to use the end-of-sequence token as the padding token, which is a common practice for language models to ensure consistent input lengths during training and inference. We also set device_map="auto"
to enable model parallelism, where the model’s layers are split across all available GPUs.
Load the dataset#
To fine-tune the Llama-3.2 3B model we will use the guanaco-llama2-1k dataset of 1,000 multilingual instruction-following examples. This dataset is a small subset of the Guanaco dataset and is intended for demonstration purposes.
Load the dataset and its training split:
data_name = "mlabonne/guanaco-llama2-1k"
training_data = load_dataset(data_name, split="train")
Note
The guanaco-llama2-1k
dataset only contains a train split, so we will not perform validation in this tutorial. For production use cases, it is best practice to use a dataset with separate training and validation splits to monitor for overfitting, and a final test split for unbiased model evaluation.
Display the shape of the training dataset:
print(training_data.shape)
Expected output:
(1000, 1)
Look at a random sample from the training dataset:
import random
print(training_data[random.randint(0, len(training_data) - 1)])
In the random sample output, notice the use of special tokens <s>
, </s>
, [INST]
, and [/INST]
. These tokens help the model understand the structure of the input data, indicating where each instruction and response begins and ends. Here’s what each token signifies:
<s></s>
: These represent the beginning-of-sequence (BOS) and end-of-sequence (EOS) tokens. They are used to separate different turns in a multi-turn conversation, allowing the model to distinguish between individual instruction and response pairs.[INST][/INST]
: These tokens indicate the start and end of an instruction in the conversation.
For example, a multi-turn conversation might look like this:
{'text': "<s>[INST] What is LoRA in machine learning? [/INST] LoRA stands for Low-Rank Adaptation, a technique for efficiently fine-tuning large language models. </s><s>[INST] Why is it useful? [/INST] It reduces memory and compute requirements by only training a small number of additional parameters. </s>"}
In this example, each instruction and response pair is wrapped with <s>
and </s>
, while the instruction is enclosed by [INST]
and [/INST]
.
LoRA configuration#
Low-rank adaptation (LoRA) introduces trainable low-rank matrices into specific layers of the base model, significantly reducing the number of trainable parameters and memory requirements.
Configure the LoRA parameters for the base model using the PEFT library:
peft_config = LoraConfig(
lora_alpha=8,
lora_dropout=0.1,
r=8,
bias="none",
task_type="CAUSAL_LM"
)
The lora_alpha
parameter is a scaling factor for the LoRA weights, lora_dropout
helps prevent overfitting, and r
specifies the rank of the low-rank matrices. A common practice is to set alpha
equal, half, or double the value of r
, depending on the model and task.
GPU performance monitoring#
Monitoring GPU performance is a recommended best practice to ensure your allocated MI300X GPUs are being used efficiently and to help identify potential issues during fine-tuning. While Azure data centers are designed with advanced cooling and monitoring systems, keeping an eye on GPU metrics can help you optimize resource usage, especially during longer or larger training runs.
To monitor GPU performance, use the AMD System Management Interface (AMD SMI) tool:
!amd-smi monitor -putm
Example output when idle:
GPU POWER GPU_TEMP MEM_TEMP GFX_UTIL GFX_CLOCK MEM_UTIL MEM_CLOCK
0 168 W 43 °C 39 °C 0 % 1592 MHz 0 % 900 MHz
1 161 W 42 °C 36 °C 1 % 1627 MHz 0 % 900 MHz
2 168 W 43 °C 35 °C 1 % 1636 MHz 0 % 900 MHz
3 166 W 42 °C 36 °C 1 % 1680 MHz 0 % 900 MHz
4 168 W 42 °C 35 °C 0 % 1691 MHz 0 % 900 MHz
5 167 W 41 °C 37 °C 0 % 1714 MHz 0 % 900 MHz
6 167 W 41 °C 36 °C 0 % 1730 MHz 0 % 900 MHz
7 166 W 41 °C 36 °C 0 % 1745 MHz 0 % 900 MHz
Configure fine-tuning hyperparameters#
Before starting the fine-tuning process, define the following hyperparameters to control aspects such as batch size and learning rate. The configuration below is a good starting point for the fine-tuning job on 8 MI300X GPUs. Use it as a baseline and adjust the parameters based on your dataset and specific requirements.
training_args = TrainingArguments(
output_dir="./results_lora",
num_train_epochs=5,
per_device_train_batch_size=4,
gradient_accumulation_steps=1,
optim="adamw_torch",
save_steps=50,
logging_steps=50,
learning_rate=4e-5,
weight_decay=0.001,
fp16=False,
bf16=True, # MI300X supports bfloat16 natively
max_grad_norm=0.3,
max_steps=-1,
warmup_ratio=0.03,
group_by_length=True,
lr_scheduler_type="linear",
report_to="tensorboard"
)
Start the fine-tuning process#
Now that the model, dataset, and training arguments are configured, you can start the fine-tuning process using the SFTTrainer
class from the PEFT library.
Run the following code to initialize the supervised fine-tuning (SFT) trainer:
trainer = SFTTrainer(
model=base_model,
train_dataset=training_data,
peft_config=peft_config,
args=training_args
)
Inspect the model within the trainer to see how LoRA has reduced the number of trainable parameters:
trainer.model.print_trainable_parameters()
Expected output:
trainable params: 2,293,760 || all params: 3,215,043,584 || trainable%: 0.0713
This output indicates that out of the total 3.2 billion parameters in the Llama-3.2 3B model, approximately 2.3 million parameters are trainable with LoRA, which is 0.0713 % of the total model size. This low percentage demonstrates the efficiency of LoRA in fine-tuning LLMs, as it allows for significant model adaptation with minimal additional computational cost.
Finally, start the fine-tuning process:
trainer.train()
While training, run the AMD SMI monitor command in a separate terminal to keep track of GPU performance:
amd-smi monitor -putm
Example output:
GPU POWER GPU_TEMP MEM_TEMP GFX_UTIL GFX_CLOCK MEM_UTIL MEM_CLOCK
0 299 W 59 °C 42 °C 95 % 2066 MHz 1 % 1295 MHz
1 409 W 57 °C 43 °C 100 % 1958 MHz 4 % 1300 MHz
2 393 W 54 °C 41 °C 100 % 1959 MHz 4 % 1300 MHz
3 367 W 57 °C 41 °C 100 % 1982 MHz 3 % 1300 MHz
4 308 W 54 °C 40 °C 100 % 2043 MHz 2 % 1300 MHz
5 298 W 55 °C 42 °C 100 % 2066 MHz 1 % 1300 MHz
6 288 W 55 °C 40 °C 100 % 2084 MHz 0 % 1300 MHz
7 178 W 42 °C 37 °C 0 % 2099 MHz 0 % 900 MHz
As can be seen in the output, the GPUs are being utilized effectively, with high GPU utilization percentages (close to 100%). This indicates that the fine-tuning process is running efficiently on the MI300X GPUs.
During training, you will see output indicating the progress of the training process.
Example output:
Step Training Loss
50 1.758200
100 1.548800
...
1250 1.452800
TrainOutput(global_step=1250, training_loss=1.4649119018554688, metrics={'train_runtime': 225.7386, 'train_samples_per_second': 22.15, 'train_steps_per_second': 5.537, 'total_flos': 3.163109317926912e+16, 'train_loss': 1.4649119018554688, 'epoch': 5.0})
In this example, the training run completed in under four minutes. The initial steep drop in loss is typical, as the model quickly adapts from its pre-trained knowledge to the specific patterns in the small fine-tuning dataset. Once the model has learned the main patterns, there’s less new information to learn, causing the loss to flatten out. The goal is not to achieve zero loss, which would indicate over-fitting (memorization), but to teach the model to generalize well to the instruction-following tasks it will encounter in practice.
Save the fine-tuned model#
After the fine-tuning process is complete, save the model:
trainer.model.save_pretrained(new_model_name)
Test the fine-tuned model#
The following code will load the base model, merge it with your LoRA weights, and set up a text generation pipeline.
base_model = AutoModelForCausalLM.from_pretrained(base_model_name)
peft_model = PeftModel.from_pretrained(base_model, new_model_name)
peft_model = peft_model.merge_and_unload()
pipeline = pipeline(
"text-generation",
model=peft_model,
tokenizer=tokenizer,
max_length=256,
truncation=True,
device_map="auto"
)
If you see a message like Device set to use cuda:0
, it confirms that the model is successfully set up to run on the first GPU. This is expected even when using AMD GPUs. Many popular AI libraries, such as PyTorch and Hugging Face Transformers, use “CUDA” as a general term for any GPU, so this message simply indicates that the GPU is being used correctly.
To generate a response, use:
query = "What do you think is the most important part of building an AI chatbot?"
prompt = f"<s>[INST] {query} [/INST]"
output = pipeline(prompt)
print(output[0]['generated_text'])
Notice the use of the <s>
, [INST]
, and [/INST]
tokens to format the input prompt. Also notice the absence of the </s>
token at the end of the prompt, which signals to the model that the instruction part is complete and it is now expected to generate the response.
Cleanup (optional)#
To manage costs when the VM is not in use, you can perform the following cleanup steps:
Deallocate (stop) the VM to pause billing for compute resources
Delete the VM and associated resources if they are no longer needed
Conclusion#
In this tutorial, you successfully fine-tuned the Llama-3.2 3B model using LoRA on AMD Instinct MI300X GPUs in Azure. You provisioned a VM, configured a Docker environment, and used the Hugging Face PEFT library to fine-tune the model. You also explored best practices for secure access and GPU monitoring before testing the new model with a sample prompt. With these skills, you are now prepared to experiment with larger datasets, different LoRA configurations, and more thorough model validation.
References#
Hu, Edward J., Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models. arXiv, June 17, 2021. https://doi.org/10.48550/arXiv.2106.09685.