Deploy Azure CycleCloud Workspace for Slurm with AMD Instinct MI300X#
Introduction#
Azure CycleCloud Workspace for Slurm offers an enterprise-ready solution for creating, configuring, and deploying HPC and AI clusters efficiently in the cloud. Azure CycleCloud is a cloud orchestration and management platform that automates the creation, configuration, and scaling of complex HPC infrastructure on Azure. It handles the provisioning of virtual machines, networking, storage, and other resources required for a cluster.
Slurm, on the other hand, is an open-source workload manager designed to schedule and manage jobs on HPC clusters. While CycleCloud sets up and manages the underlying infrastructure, Slurm is responsible for allocating compute resources, queuing jobs, and monitoring workload execution within the cluster.
In this tutorial, you will learn how to deploy a CycleCloud-managed Slurm cluster on Azure, configure Slurm to support AMD Instinct MI300X GPUs, and run a test job to verify your deployment. By the end, you will have a scalable and robust environment for running HPC and AI workloads on Azure, leveraging both CycleCloud’s automation and Slurm’s workload management capabilities.
Note
The Azure CycleCloud Workspace for Slurm solution template is under active development, and features and deployment steps may change frequently. While we aim to keep this tutorial up-to-date, please refer to the Azure CycleCloud documentation for the latest information as you follow along.
Prerequisites#
SSH Keys: Have an SSH key pair installed on your local machine for secure cloud access.
Azure Setup: Maintain an active Azure account, subscription, and resource group for deploying CycleCloud and Slurm resources.
Permissions: Have the Owner or Contributor role on the subscription to create and manage resources.
vCPU Quota: Have sufficient vCPU quota in the Azure region where you plan to deploy the Slurm cluster.
Command-Line Interpreter: Have Azure CLI
2.74.0
or above installed on your local machine.
Clone Azure CycleCloud Workspace for Slurm#
The Azure CycleCloud Workspace for Slurm GitHub repository contains the necessary scripts and configurations for deploying Slurm clusters on Azure.
Clone the repository to your local machine using the latest
release tag:
git clone --branch "2025.04.24" https://github.com/azure/cyclecloud-slurm-workspace.git
Replace 2025.04.24
with the latest
release tag if necessary.
Tip
To find the latest release tag, visit the GitHub releases page or run git ls-remote --tags https://github.com/Azure/cyclecloud-slurm-workspace.git
.
In the repository, you will find two key resources: (1) a bicep
folder containing the Bicep templates for deploying resources, and (2) a uidefinitions
folder containing the UI definition file, which helps with customizing the deployment to your specific requirements. In the next steps, we will use these resources to deploy the Slurm cluster.
Create UI definition sandbox#
Open the uidefinitions
folder and copy the contents of the createUiDefinition.json
file. Open the Create UI Definition Sandbox page in your browser and replace the definition with the contents of the UI definition file. Click Preview in the bottom-left corner to view the deployment configuration form. Proceed through each tab of the user experience to configure the deployment to your requirements.
Tip
You can keep the default values for most fields. However, customize settings such as VM sizes, number of nodes, and other parameters based on your workload requirements.
Step 1: Basics#
Select your Azure subscription, region, and resource group where you want to deploy resources. Choose a CycleCloud VM name and size. For this tutorial, use the default values. The CycleCloud VM is a virtual machine that hosts the CycleCloud portal, which provides a web-based interface for managing your Slurm cluster. Provide an admin username, password, and SSH public key. You can keep the default username or customize it.
Step 2: File-system#
Create a new Azure NetApp Files file system with Premium service level and an appropriate capacity. Azure NetApp Files is a high-performance file storage service that provides low-latency access to data, making it ideal for HPC and AI workloads.
Step 3: Networking#
Select an existing virtual network or create a new one. Enable Create a Bastion for SSH connections to securely connect to Slurm nodes without exposing them through public IP addresses. Deselect Create a NAT Gateway unless you have specific requirements for outbound traffic from the Slurm nodes. Leave the remaining networking settings at their default values.
Step 4: Slurm Settings#
Select Start cluster to automatically start the Slurm cluster after deployment. This ensures the Slurm scheduler and login nodes are ready for use immediately. Configure the scheduler and the login nodes. Keep the default values unless you have specific requirements.
Step 5: Partition Settings#
The CycleCloud Workspace for Slurm supports multiple partitions, including HTC, HPC, and GPU. Focus on the GPU partition for this tutorial.
Under the GPU Partition section, select the ND96isr_MI300X_v5
VM size and specify the maximum number of nodes to deploy. In this tutorial, we are using 2 nodes. For the Image Name field, select Custom image
and enter microsoft-dsvm:ubuntu-hpc:2204-rocm:22.04.2025030701
.
Note
Ensure the VM size contains an r
in its name (ND96isr_MI300X_v5
) for RDMA support.
Tip
To obtain the uniform resource name (URN) for the VM image, use the command az vm image list --publisher microsoft-dsvm --sku 2204-rocm --all --output table
.
Step 6: Open OnDemand#
Skip this step as the tutorial does not cover Open OnDemand configuration.
Step 7: Advanced#
Skip this step as the tutorial does not cover advanced settings.
Step 9: Review + create#
Review the summary of your configuration to ensure all settings are correct. Select View outputs payload at the bottom of the screen. Copy the JSON-formatted text displayed on the right-hand side and save it locally as parameters.json
.
Create variables#
To simplify repeated commands and reduce errors, create shell variables for common values you will use throughout the deployment process.
Define the following variables:
location=<your-region>
resource_group=<your-resource-group>
ssh_key=<ssh_key_file>
cyclecloud_name=ccw
username=hpcadmin
Replace the placeholder values <your-region>
, <your-resource-group>
, and <ssh_key_file>
with your actual values. Keep the default values for cyclecloud_name
and username
unless you have specific requirements.
Note
The <ssh-key-file>
should point to the private SSH key file on your local machine, for example, ~/.ssh/id_rsa
on Linux/macOS or C:\Users\<your-username>\.ssh\id_rsa
on Windows.
Validate solution files#
Before deploying resources, validate the main template mainTemplate.bicep
and parameters file parameters.json
to ensure there are no syntax errors or misconfigurations. Ensure you are positioned in the directory containing the mainTemplate.bicep
and parameters.json
files when you run the command, or adjust the paths accordingly.
Run the following command to validate the files:
az deployment sub validate \
--location $location \
--template-file mainTemplate.bicep \
--parameters parameters.json \
--verbose > validation_output.json
Open the validation_output.json
file and look for "provisioningState": "Succeeded"
to confirm that there are no critical errors in your solution files. Examine the diagnostics
array for warnings or informational messages for potential issues. Finally, check the "validatedResources"
array to ensure all expected resources are listed.
Tip
Use --verbose
or --debug
flag with Azure CLI commands to get more detailed output for troubleshooting errors or warnings.
Accept Azure CycleCloud image terms#
Before deploying the Slurm cluster using the Azure CycleCloud image, you must accept the terms of use for the image to comply with Azure’s licensing requirements.
Set the active Azure subscription where you want the accepted terms to apply:
az account set -s <subscription-id>
Replace <subscription-id>
with your Azure subscription ID.
Tip
You can find your subscription ID by running az account list --output table
.
Accept the terms of use for the Azure CycleCloud image:
az vm image terms accept --urn azurecyclecloud:azure-cyclecloud:cyclecloud8-gen2:latest
Deploy CycleCloud Workspace for Slurm#
Deploying the CycleCloud Workspace for Slurm provisions the necessary infrastructure to manage HPC and AI workloads on Azure. This includes creating the CycleCloud VM, scheduler node, login nodes, and other resources required for Slurm cluster management.
The deployment uses a subscription-scoped main Bicep template mainTemplate.bicep
. This means you will run the deployment at the subscription level, but all resources will be provisioned within the resource group specified in the parameters.json
file. Ensure you have defined the required shell variables ($location
, $resource_group
, $cyclecloud_name
, etc.) as described above.
Run the following command to deploy the CycleCloud Workspace:
az deployment sub create \
--location $location \
--name $cyclecloud_name \
--template-file mainTemplate.bicep \
--parameters parameters.json
Note
The deployment takes around 10 minutes. You can monitor the progress in the Azure portal by navigating to the Settings > Deployments within your resource group.
Once the deployment is complete, verify the resources created using the following command:
az resource list --resource-group $resource_group --output table
Check for the following resources in the output:
ccw-cyclecloud-vm
: CycleCloud portal VM.scheduler-<unique-id>
: Slurm scheduler node.login-<unique-id>
: VM scale set for Slurm login nodes.
In this tutorial, we only deploy one login node, but you can scale it up later if needed.
Note
Compute nodes are not created automatically during deployment. Instead, these nodes are created using the azslurm
CLI tool. We will cover how to do this later in this tutorial.
Install Azure CLI extension for Bastion#
Azure Bastion provides secure SSH and RDP connectivity to your Azure VMs without exposing them to the public internet.
Install the Azure CLI extension for Bastion:
az extension add --name bastion --upgrade
Confirm that the extension was installed successfully by listing all installed extensions:
az extension list --output table
Look for bastion
in the output.
Connect to CycleCloud portal#
The CycleCloud portal is a web-based interface for managing Slurm clusters deployed on Azure. It allows you to configure nodes, monitor workloads, and manage resources efficiently.
Connect to the CycleCloud VM using Azure Bastion:
az network bastion tunnel \
--name bastion \
--resource-group $resource_group \
--target-resource-id $(az vm show --resource-group $resource_group --name ccw-cyclecloud-vm --query "id" -o tsv) \
--resource-port 443 \
--port 8443
This command creates a secure tunnel, mapping port 8443
on your local machine to port 443
on ccw-cyclecloud-vm
, which is used for HTTPS traffic.
If the connection was successfully established, you should see output similar to the following:
Opening tunnel on port: 8443
Tunnel is ready, connect on port 8443
Ctrl + C to close
Leave this terminal open to keep the tunnel active.
Open a browser and navigate to https://localhost:8443
. Log in using the admin username and password provided during deployment.
Note
You may see a browser warning due to the self-signed certificate. You can safely ignore this warning and proceed to the portal.
Connect to Slurm scheduler node#
The Slurm scheduler node runs the Slurm controller daemon (slurmctld
) and communicates with the compute nodes to manage job execution. As a system administrator, you may need to connect to the scheduler node for tasks such as configuring partitions, managing nodes, or troubleshooting cluster issues.
Connect to the Slurm scheduler node:
az network bastion ssh \
--name bastion \
--resource-group $resource_group \
--target-resource-id $(az vm list --resource-group $resource_group --query "[?contains(name, 'scheduler')].id" --output tsv) \
--auth-type ssh-key \
--username $username \
--ssh-key $ssh_key
Once connected, you can proceed to configure the Slurm compute partitions in the next steps.
Verify Slurm configuration#
To ensure the Slurm scheduler node is properly configured to manage AMD Instinct MI300X GPUs, verify the gres.conf
file on the scheduler node. This file defines Generic RESources (GRES), such as GPUs, for Slurm.
Open the gres.conf
configuration file:
sudo nano /etc/slurm/gres.conf
Ensure the following line is present in the file:
Nodename=ccw-gpu-[1-2] Name=gpu Count=8 File=/dev/dri/renderD[128,136,144,152,160,168,176,184]
This configuration specifies that the Slurm scheduler should set up 8 GPUs on each of the compute nodes named ccw-gpu-1
and ccw-gpu-2
. The File
parameter points to the device files for the MI300X GPUs, which Slurm will use to manage GPU resources.
If the line is missing, add it to the file and save your changes.
To apply the changes, you need to restart the Slurm controller daemon slurmctld
, which is covered below.
Disable automatic suspension of idle nodes#
In certain scenarios, such as during development, testing, or learning phases, you may want to disable the automatic suspension of idle nodes. This ensures that nodes remain active without interruption. However, in production environments, enabling automatic suspension is recommended to optimize resource usage and reduce costs.
Open the slurm.conf
file on the scheduler node using a text editor:
sudo nano /etc/slurm/slurm.conf
Append the following line to the end of the file:
SuspendTime=-1
This setting disables the automatic suspension of idle nodes by setting the suspension time to -1
, which means that nodes will not be suspended automatically regardless of their idle state.
To apply the changes, you need to restart the Slurm controller daemon, which is covered in the next section.
The azslurm
CLI#
The azslurm
CLI is a specialized command-line tool designed to simplify the management of Slurm clusters on Azure.
Apply configuration changes using azslurm
#
If you made any changes to the Slurm configuration files, such as gres.conf
or slurm.conf
, you need to apply these changes for them to take effect. The azslurm
CLI tool simplifies this process by automatically updating the necessary configuration files and restarting the Slurm controller daemon.
On the scheduler node, switch to the root user:
sudo -i
As root, run the following command to apply the changes:
azslurm scale
This command updates the Slurm configuration files and ensures the Slurm scheduler is aware of the MI300X GPUs on the compute nodes. The azslurm
CLI tool handles all necessary steps, including updating the azure.conf
file with the correct GPU configuration and restarting the slurmctld
service.
To leave the root shell, run logout
.
Create GPU nodes using azslurm
#
Logged in to the Slurm scheduler node as hpcadmin
, run the following command:
sudo azslurm resume --node-list=ccw-gpu-[1-2]
This command creates two GPU nodes named ccw-gpu-1
and ccw-gpu-2
, each with 8 MI300X GPUs. In your resource group in the Azure portal, you should see the new GPU nodes created as a new VM scale set with a name starting with gpu-
. You can also verify the creation of the GPU nodes in the CycleCloud portal.
Note
Allow a few minutes for the nodes to be provisioned and initialized.
To confirm the nodes are created successfully, run:
sinfo # or scontrol show nodes
Example output:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
dynamic up infinite 0 n/a
gpu up infinite 2 idle ccw-gpu-[1-2]
hpc* up infinite 0 n/a
htc up infinite 0 n/a
The suffix *
indicates the default partition, which is set to hpc
. The gpu
partition shows that both GPU nodes are available and idle.
Delete GPU nodes using azslurm
#
If you need to delete the GPU nodes, for example, to free up resources or reconfigure the cluster, you can suspend the nodes.
To delete the GPU nodes, run the following command from the Slurm scheduler node:
sudo azslurm suspend --node-list=ccw-gpu-[1-2]
This command deletes both nodes from the Slurm cluster and the VM scale set. You can verify that the nodes are no longer listed in the CycleCloud portal or in the Azure portal under the gpu-
VM scale set.
Add users to video and render groups#
To ensure users can access the AMD Instinct MI300X GPUs, they must belong to the supplementary groups video and render on all GPU nodes. Follow the steps below to add users to these groups.
On the Slurm scheduler node, run the following:
for i in ccw-scheduler $(scontrol show hostname ccw-gpu-[1-2]); do
for user in hpcadmin; do
echo "Adding $user to video and render groups on $i..."
if ssh $i "sudo usermod -aG video,render $user"; then
echo "Successfully added $user to video and render groups on $i"
else
echo "Failed to add $user to video and render groups on $i"
fi
done
done
Confirm that hpcadmin
has been added to the video
and render
groups on each node:
groups hpcadmin
Example output:
hpcadmin : hpcadmin video render cyclecloud
Tip
Ensure all listed users exist on each node before adding them to the video and render groups. Refer to the Azure CycleCloud Documentation for more details on user management in CycleCloud environments.
CycleCloud CLI#
The CycleCloud CLI allows you to interact with the CycleCloud portal and perform various operations on your Slurm environment using a command-line interface on your local machine. Although this tutorial does not focus on the CycleCloud CLI, it is presented here to provide a brief overview of its installation and configuration.
Install CycleCloud CLI#
Download the CycleCloud CLI from the CycleCloud portal by navigating to the About page:
https://localhost:8443/about
Alternatively, download the CLI using the terminal:
curl --insecure --remote-name https://localhost:8443/static/tools/cyclecloud-cli.zip
After downloading, unzip the file to extract the CycleCloud CLI executable and follow the instructions in the README
file to install it.
Once complete, verify the installation:
cyclecloud --version
If the command is not recognized, restart your terminal to refresh environment variables.
Configure CycleCloud CLI#
Once the CycleCloud CLI is installed, you need to configure it to connect to your Azure CycleCloud server URL.
Initialize CycleCloud with the following command:
cyclecloud initialize
Enter the CycleCloud server URL (https://localhost:8443
), accept the untrusted certificate warning, and provide your CycleCloud username and password when prompted.
Verify the configuration:
cyclecloud show_cluster ccw
Example output:
-------------
ccw : started
-------------
Resource group: <your-resource-group>
Cluster nodes:
scheduler: Started ccw-scheduler 10.0.0.132 scheduler-<unique-id>
Cluster node arrays:
login: 1 instances, 4 cores, Started
gpu: 2 instances, 192 cores, Started
Total nodes: 4
This command retrieves information about your Slurm cluster, including the resource group, cluster nodes, cluster node arrays, and their statuses. Cluster node arrays represent groups of nodes organized for a specific purpose, such as login nodes or compute nodes. In the above output, there are two node arrays: login
and gpu
. Both arrays are started, indicating that they are running and available for use.
Submit a test job to Slurm#
Once your Slurm cluster is deployed and configured, you can submit a test job to verify that the compute nodes and GPUs are functioning correctly. In this tutorial, we will use the all_reduce_perf
test from the RCCL (ROCm Collective Communication Library) benchmarking tools. This test is a collective communications operation used in parallel computing, involving multiple GPUs working together to perform a reduction operation (in our example, summing numbers) across all GPUs. The test provides various metrics, helping you ensure that the MI300X GPUs and RDMA-enabled networking are configured correctly for high-performance workloads.
Connect to the login node#
To submit a test job, connect to the Slurm login node using Azure Bastion. First, retrieve the VM scale set name for the login node:
login_vmss_name=$(az vmss list \
--resource-group $resource_group \
--query "[?starts_with(name, 'login-')].name" \
--output tsv)
Next, retrieve the resource ID of the login node:
login_node_id=$(az vmss list-instances \
--resource-group $resource_group \
--name $login_vmss_name \
--query "[0].id" \
--output tsv)
Connect to the login node:
az network bastion ssh \
--name bastion \
--resource-group $resource_group \
--target-resource-id $login_node_id \
--auth-type ssh-key \
--username $username \
--ssh-key $ssh_key
Create a job script#
On the login node, create a job script to test the cluster’s functionality:
sudo nano test.sh
Paste the following content into the file:
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --partition=gpu
#SBATCH --gpus-per-node=8
#SBATCH --ntasks-per-node=8
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
module load mpi/hpcx
export LD_LIBRARY_PATH=/opt/rccl/lib:$LD_LIBRARY_PATH \
OMPI_MCA_coll_hcoll_enable=0 \
OMPI_MCA_plm_rsh_no_tree_spawn=1 \
OMPI_MCA_plm_rsh_num_concurrent=800 \
NCCL_IB_PCI_RELAXED_ORDERING=1 \
NCCL_IB_HCA=^mlx5_an0 \
NCCL_MIN_NCHANNELS=112
srun --mpi=pmix /opt/rccl-tests/all_reduce_perf -b 1 -e 8G -f 2 -g 1 -O 0
Save and close the file.
The lines beginning with #SBATCH
are Slurm directives that specify various options for Slurm:
--job-name
: Sets the name of the job.--partition
: Specifies the GPU partition to use for the job.--gpus-per-node
: Specifies the number of GPUs to allocate per node.--ntasks-per-node
: Allocates eight tasks per node, matching the number of GPUs.--output
and--error
: Output and error log files for the job, where%x
is the job name and%j
is job allocation number.
The module
command specifies to load the HPC-X plugin, which is necessary for running MPI jobs on the Slurm cluster.
The export
command sets seven environment variables to configure the Slurm job environment, including the library path for RCCL, and various performance tuning parameters for large-scale computations across multiple GPUs.
The srun
command executes the all_reduce_perf
test to measure the performance of the all-reduce operation across multiple GPUs. This job uses the PMIx plugin, which enables efficient communication between tasks using MPI (Message Passing Interface), a standard for parallel computing. In this setup, each task is assigned to one GPU, and MPI handles the coordination between them.
The command parameters define the following:
-b 1
: Start with a message size of 1 byte.-e 8G
: End with a message size of 8 gigabytes.-f 2
: Use a multiplication factor of 2 between sizes.-g 1
: Each MPI task uses 1 GPU. With 8 tasks per node, all GPUs are used concurrently.-O 0
: Use in-place operations (out-of-place operations are omitted for simplicity).
This setup allows the same script to scale from a single node (8 GPUs) to multiple nodes (e.g., 2 nodes with 16 GPUs) by adjusting the --nodes
parameter when submitting the job.
Submit the job#
Submit the job to the Slurm scheduler on 1
node:
sbatch --nodes=1 test.sh
You can experiment with setting the --nodes
parameter to run the job on multiple nodes and see how the performance scales with more GPUs.
Monitor the job status using:
squeue
Example output:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1 gpu test hpcadmin R 0:03 1 ccw-gpu-1
Tip
If you encounter issues with job submission (e.g., jobs stuck in the PD
(pending) state with messages about nodes being down or drained), try running sudo systemctl restart slurmd
on each GPU node. This ensures the Slurm daemon is running and the node can communicate with the controller.
Once the job completes, review the output file:
cat test_<job-ID>.out
Example output:
Loading mpi/hpcx
Loading requirement:
/opt/hpcx-v2.18-gcc-mlnx_ofed-ubuntu22.04-cuda12-x86_64/modulefiles/hpcx
# nThread 1 nGpus 1 minBytes 1 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
rccl-tests: Version develop:448c4c7
# Using devices
# Rank 0 Pid 47406 on ccw-gpu-1 device 0 [0002:00:00.0] AMD Instinct MI300X VF
# Rank 1 Pid 47407 on ccw-gpu-1 device 1 [0003:00:00.0] AMD Instinct MI300X VF
# Rank 2 Pid 47408 on ccw-gpu-1 device 2 [0004:00:00.0] AMD Instinct MI300X VF
# Rank 3 Pid 47409 on ccw-gpu-1 device 3 [0005:00:00.0] AMD Instinct MI300X VF
# Rank 4 Pid 47410 on ccw-gpu-1 device 4 [0006:00:00.0] AMD Instinct MI300X VF
# Rank 5 Pid 47411 on ccw-gpu-1 device 5 [0007:00:00.0] AMD Instinct MI300X VF
# Rank 6 Pid 47412 on ccw-gpu-1 device 6 [0008:00:00.0] AMD Instinct MI300X VF
# Rank 7 Pid 47413 on ccw-gpu-1 device 7 [0009:00:00.0] AMD Instinct MI300X VF
#
# in-place
# size count type redop root time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s)
0 0 float sum -1 0.21 0.00 0.00 0
0 0 float sum -1 0.18 0.00 0.00 0
4 1 float sum -1 33.33 0.00 0.00 0
8 2 float sum -1 32.14 0.00 0.00 0
16 4 float sum -1 32.81 0.00 0.00 0
32 8 float sum -1 33.40 0.00 0.00 0
64 16 float sum -1 32.16 0.00 0.00 0
128 32 float sum -1 27.86 0.00 0.01 0
256 64 float sum -1 29.96 0.01 0.01 0
512 128 float sum -1 30.07 0.02 0.03 0
1024 256 float sum -1 29.07 0.04 0.06 0
2048 512 float sum -1 28.87 0.07 0.12 0
4096 1024 float sum -1 29.00 0.14 0.25 0
8192 2048 float sum -1 21.38 0.38 0.67 0
16384 4096 float sum -1 29.35 0.56 0.98 0
32768 8192 float sum -1 31.76 1.03 1.81 0
65536 16384 float sum -1 30.95 2.12 3.71 0
131072 32768 float sum -1 33.54 3.91 6.84 0
262144 65536 float sum -1 35.83 7.32 12.80 0
524288 131072 float sum -1 39.05 13.43 23.49 0
1048576 262144 float sum -1 43.69 24.00 42.00 0
2097152 524288 float sum -1 50.31 41.69 72.95 0
4194304 1048576 float sum -1 65.68 63.86 111.76 0
8388608 2097152 float sum -1 106.1 79.04 138.32 0
16777216 4194304 float sum -1 158.4 105.89 185.31 0
33554432 8388608 float sum -1 299.9 111.90 195.83 0
67108864 16777216 float sum -1 456.5 147.00 257.25 0
134217728 33554432 float sum -1 802.9 167.17 292.55 0
268435456 67108864 float sum -1 1536.7 174.68 305.70 0
536870912 134217728 float sum -1 3040.1 176.59 309.04 0
1073741824 268435456 float sum -1 6080.6 176.58 309.02 0
2147483648 536870912 float sum -1 12112 177.31 310.29 0
4294967296 1073741824 float sum -1 24307 176.70 309.22 0
8589934592 2147483648 float sum -1 48607 176.72 309.26 0
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth : 94.0969
#
In the output, confirm that all GPUs are properly detected (rank 0-7) and identified as AMD Instinct MI300X GPUs. The main table show performance at different data sizes (from 1 byte to 8 GB). The algbw
(algorithm bandwidth) column shows the achieved bandwidth in GB/s, while the busbw
(bus bandwidth) column shows bandwidth adjusted to reflect hardware utilization to compare it with the theoretical maximum bandwidth of the hardware.
Notice how the performance (bandwidth utilization) changes as the data size increases:
For small message sizes, bandwidth is relatively low. This is expected, as bandwidth is naturally small for smaller message sizes.
As the message size increases, performance improves significantly. Larger messages allow the communication libraries and hardware to better utilize available bandwidth, resulting in higher throughput.
For large message sizes, performance plateaus near the maximum achievable bandwidth of the MI300X GPUs. In this case, the
busbw
column in the output shows values approaching 310 GB/s, which reflects the peak interconnect performance of the system.
This trend is typical for GPU-based collective communication benchmarks and indicates that the system is functioning and scaling as expected.
Troubleshooting#
If you encounter issues during the deployment, check the following:
Azure subscription: Ensure you have the correct Azure subscription selected. You can check your current subscription with
az account show
.Verify resource quotas: Ensure that your subscription has sufficient resources.
Check role assignment: Verify you have a role that permits subscription-level deployments. You can check your role assignments with
az role assignment list --assignee <your-email> --output table
.Check activity logs: Check the activity logs in Azure and CycleCloud to help identify issues.
Clean up (optional)#
To manage costs when the Slurm cluster is not in use, consider the following options:
Enable automatic suspension of idle GPU nodes to reduce costs when the cluster is not in use.
Delete the Slurm cluster resources if they are no longer needed.
Conclusion#
You have successfully deployed an Azure CycleCloud Workspace for Slurm with AMD Instinct MI300X GPUs. This setup provides a robust environment for running HPC and AI workloads on Azure, leveraging the power of Slurm for workload management and the high performance of AMD Instinct MI300X GPUs for compute-intensive tasks.