Deploy Azure CycleCloud Workspace for Slurm with AMD Instinct MI300X#

Introduction#

Azure CycleCloud Workspace for Slurm offers an enterprise-ready solution for creating, configuring, and deploying HPC and AI clusters efficiently in the cloud. Azure CycleCloud is a cloud orchestration and management platform that automates the creation, configuration, and scaling of complex HPC infrastructure on Azure. It handles the provisioning of virtual machines, networking, storage, and other resources required for a cluster.

Slurm, on the other hand, is an open-source workload manager designed to schedule and manage jobs on HPC clusters. While CycleCloud sets up and manages the underlying infrastructure, Slurm is responsible for allocating compute resources, queuing jobs, and monitoring workload execution within the cluster.

In this tutorial, you will learn how to deploy a CycleCloud-managed Slurm cluster on Azure, configure Slurm to support AMD Instinct MI300X GPUs, and run a test job to verify your deployment. By the end, you will have a scalable and robust environment for running HPC and AI workloads on Azure, leveraging both CycleCloud’s automation and Slurm’s workload management capabilities.

Note

The Azure CycleCloud Workspace for Slurm solution template is under active development, and features and deployment steps may change frequently. While we aim to keep this tutorial up-to-date, please refer to the Azure CycleCloud documentation for the latest information as you follow along.

Prerequisites#

  • SSH Keys: Have an SSH key pair installed on your local machine for secure cloud access.

  • Azure Setup: Maintain an active Azure account, subscription, and resource group for deploying CycleCloud and Slurm resources.

  • Permissions: Have the Owner or Contributor role on the subscription to create and manage resources.

  • vCPU Quota: Have sufficient vCPU quota in the Azure region where you plan to deploy the Slurm cluster.

  • Command-Line Interpreter: Have Azure CLI 2.74.0 or above installed on your local machine.

Clone Azure CycleCloud Workspace for Slurm#

The Azure CycleCloud Workspace for Slurm GitHub repository contains the necessary scripts and configurations for deploying Slurm clusters on Azure.

Clone the repository to your local machine using the latest release tag:

git clone --branch "2025.04.24" https://github.com/azure/cyclecloud-slurm-workspace.git

Replace 2025.04.24 with the latest release tag if necessary.

Tip

To find the latest release tag, visit the GitHub releases page or run git ls-remote --tags https://github.com/Azure/cyclecloud-slurm-workspace.git.

In the repository, you will find two key resources: (1) a bicep folder containing the Bicep templates for deploying resources, and (2) a uidefinitions folder containing the UI definition file, which helps with customizing the deployment to your specific requirements. In the next steps, we will use these resources to deploy the Slurm cluster.

Create UI definition sandbox#

Open the uidefinitions folder and copy the contents of the createUiDefinition.json file. Open the Create UI Definition Sandbox page in your browser and replace the definition with the contents of the UI definition file. Click Preview in the bottom-left corner to view the deployment configuration form. Proceed through each tab of the user experience to configure the deployment to your requirements.

Tip

You can keep the default values for most fields. However, customize settings such as VM sizes, number of nodes, and other parameters based on your workload requirements.

Step 1: Basics#

Select your Azure subscription, region, and resource group where you want to deploy resources. Choose a CycleCloud VM name and size. For this tutorial, use the default values. The CycleCloud VM is a virtual machine that hosts the CycleCloud portal, which provides a web-based interface for managing your Slurm cluster. Provide an admin username, password, and SSH public key. You can keep the default username or customize it.

Step 2: File-system#

Create a new Azure NetApp Files file system with Premium service level and an appropriate capacity. Azure NetApp Files is a high-performance file storage service that provides low-latency access to data, making it ideal for HPC and AI workloads.

Step 3: Networking#

Select an existing virtual network or create a new one. Enable Create a Bastion for SSH connections to securely connect to Slurm nodes without exposing them through public IP addresses. Deselect Create a NAT Gateway unless you have specific requirements for outbound traffic from the Slurm nodes. Leave the remaining networking settings at their default values.

Step 4: Slurm Settings#

Select Start cluster to automatically start the Slurm cluster after deployment. This ensures the Slurm scheduler and login nodes are ready for use immediately. Configure the scheduler and the login nodes. Keep the default values unless you have specific requirements.

Step 5: Partition Settings#

The CycleCloud Workspace for Slurm supports multiple partitions, including HTC, HPC, and GPU. Focus on the GPU partition for this tutorial.

Under the GPU Partition section, select the ND96isr_MI300X_v5 VM size and specify the maximum number of nodes to deploy. In this tutorial, we are using 2 nodes. For the Image Name field, select Custom image and enter microsoft-dsvm:ubuntu-hpc:2204-rocm:22.04.2025030701.

Note

Ensure the VM size contains an r in its name (ND96isr_MI300X_v5) for RDMA support.

Tip

To obtain the uniform resource name (URN) for the VM image, use the command az vm image list --publisher microsoft-dsvm --sku 2204-rocm --all --output table.

Step 6: Open OnDemand#

Skip this step as the tutorial does not cover Open OnDemand configuration.

Step 7: Advanced#

Skip this step as the tutorial does not cover advanced settings.

Step 8: Tags (optional)#

Add standard tags such as Environment, CostCenter, or Owner to organize and manage your resources effectively.

Step 9: Review + create#

Review the summary of your configuration to ensure all settings are correct. Select View outputs payload at the bottom of the screen. Copy the JSON-formatted text displayed on the right-hand side and save it locally as parameters.json.

Create variables#

To simplify repeated commands and reduce errors, create shell variables for common values you will use throughout the deployment process.

Define the following variables:

location=<your-region>
resource_group=<your-resource-group>
ssh_key=<ssh_key_file>
cyclecloud_name=ccw
username=hpcadmin

Replace the placeholder values <your-region>, <your-resource-group>, and <ssh_key_file> with your actual values. Keep the default values for cyclecloud_name and username unless you have specific requirements.

Note

The <ssh-key-file> should point to the private SSH key file on your local machine, for example, ~/.ssh/id_rsa on Linux/macOS or C:\Users\<your-username>\.ssh\id_rsa on Windows.

Validate solution files#

Before deploying resources, validate the main template mainTemplate.bicep and parameters file parameters.json to ensure there are no syntax errors or misconfigurations. Ensure you are positioned in the directory containing the mainTemplate.bicep and parameters.json files when you run the command, or adjust the paths accordingly.

Run the following command to validate the files:

az deployment sub validate \
  --location $location \
  --template-file mainTemplate.bicep \
  --parameters parameters.json \
  --verbose > validation_output.json

Open the validation_output.json file and look for "provisioningState": "Succeeded" to confirm that there are no critical errors in your solution files. Examine the diagnostics array for warnings or informational messages for potential issues. Finally, check the "validatedResources" array to ensure all expected resources are listed.

Tip

Use --verbose or --debug flag with Azure CLI commands to get more detailed output for troubleshooting errors or warnings.

Accept Azure CycleCloud image terms#

Before deploying the Slurm cluster using the Azure CycleCloud image, you must accept the terms of use for the image to comply with Azure’s licensing requirements.

Set the active Azure subscription where you want the accepted terms to apply:

az account set -s <subscription-id>

Replace <subscription-id> with your Azure subscription ID.

Tip

You can find your subscription ID by running az account list --output table.

Accept the terms of use for the Azure CycleCloud image:

az vm image terms accept --urn azurecyclecloud:azure-cyclecloud:cyclecloud8-gen2:latest

Deploy CycleCloud Workspace for Slurm#

Deploying the CycleCloud Workspace for Slurm provisions the necessary infrastructure to manage HPC and AI workloads on Azure. This includes creating the CycleCloud VM, scheduler node, login nodes, and other resources required for Slurm cluster management.

The deployment uses a subscription-scoped main Bicep template mainTemplate.bicep. This means you will run the deployment at the subscription level, but all resources will be provisioned within the resource group specified in the parameters.json file. Ensure you have defined the required shell variables ($location, $resource_group, $cyclecloud_name, etc.) as described above.

Run the following command to deploy the CycleCloud Workspace:

az deployment sub create \
  --location $location \
  --name $cyclecloud_name \
  --template-file mainTemplate.bicep \
  --parameters parameters.json

Note

The deployment takes around 10 minutes. You can monitor the progress in the Azure portal by navigating to the Settings > Deployments within your resource group.

Once the deployment is complete, verify the resources created using the following command:

az resource list --resource-group $resource_group --output table

Check for the following resources in the output:

  • ccw-cyclecloud-vm: CycleCloud portal VM.

  • scheduler-<unique-id>: Slurm scheduler node.

  • login-<unique-id>: VM scale set for Slurm login nodes.

In this tutorial, we only deploy one login node, but you can scale it up later if needed.

Note

Compute nodes are not created automatically during deployment. Instead, these nodes are created using the azslurm CLI tool. We will cover how to do this later in this tutorial.

Install Azure CLI extension for Bastion#

Azure Bastion provides secure SSH and RDP connectivity to your Azure VMs without exposing them to the public internet.

Install the Azure CLI extension for Bastion:

az extension add --name bastion --upgrade

Confirm that the extension was installed successfully by listing all installed extensions:

az extension list --output table

Look for bastion in the output.

Connect to CycleCloud portal#

The CycleCloud portal is a web-based interface for managing Slurm clusters deployed on Azure. It allows you to configure nodes, monitor workloads, and manage resources efficiently.

Connect to the CycleCloud VM using Azure Bastion:

az network bastion tunnel \
  --name bastion \
  --resource-group $resource_group \
  --target-resource-id $(az vm show --resource-group $resource_group --name ccw-cyclecloud-vm --query "id" -o tsv) \
  --resource-port 443 \
  --port 8443

This command creates a secure tunnel, mapping port 8443 on your local machine to port 443 on ccw-cyclecloud-vm, which is used for HTTPS traffic.

If the connection was successfully established, you should see output similar to the following:

Opening tunnel on port: 8443
Tunnel is ready, connect on port 8443
Ctrl + C to close

Leave this terminal open to keep the tunnel active.

Open a browser and navigate to https://localhost:8443. Log in using the admin username and password provided during deployment.

Note

You may see a browser warning due to the self-signed certificate. You can safely ignore this warning and proceed to the portal.

Connect to Slurm scheduler node#

The Slurm scheduler node runs the Slurm controller daemon (slurmctld) and communicates with the compute nodes to manage job execution. As a system administrator, you may need to connect to the scheduler node for tasks such as configuring partitions, managing nodes, or troubleshooting cluster issues.

Connect to the Slurm scheduler node:

az network bastion ssh \
  --name bastion \
  --resource-group $resource_group \
  --target-resource-id $(az vm list --resource-group $resource_group --query "[?contains(name, 'scheduler')].id" --output tsv) \
  --auth-type ssh-key \
  --username $username \
  --ssh-key $ssh_key

Once connected, you can proceed to configure the Slurm compute partitions in the next steps.

Verify Slurm configuration#

To ensure the Slurm scheduler node is properly configured to manage AMD Instinct MI300X GPUs, verify the gres.conf file on the scheduler node. This file defines Generic RESources (GRES), such as GPUs, for Slurm.

Open the gres.conf configuration file:

sudo nano /etc/slurm/gres.conf

Ensure the following line is present in the file:

Nodename=ccw-gpu-[1-2] Name=gpu Count=8 File=/dev/dri/renderD[128,136,144,152,160,168,176,184]

This configuration specifies that the Slurm scheduler should set up 8 GPUs on each of the compute nodes named ccw-gpu-1 and ccw-gpu-2. The File parameter points to the device files for the MI300X GPUs, which Slurm will use to manage GPU resources.

If the line is missing, add it to the file and save your changes.

To apply the changes, you need to restart the Slurm controller daemon slurmctld, which is covered below.

Disable automatic suspension of idle nodes#

In certain scenarios, such as during development, testing, or learning phases, you may want to disable the automatic suspension of idle nodes. This ensures that nodes remain active without interruption. However, in production environments, enabling automatic suspension is recommended to optimize resource usage and reduce costs.

Open the slurm.conf file on the scheduler node using a text editor:

sudo nano /etc/slurm/slurm.conf

Append the following line to the end of the file:

SuspendTime=-1

This setting disables the automatic suspension of idle nodes by setting the suspension time to -1, which means that nodes will not be suspended automatically regardless of their idle state.

To apply the changes, you need to restart the Slurm controller daemon, which is covered in the next section.

The azslurm CLI#

The azslurm CLI is a specialized command-line tool designed to simplify the management of Slurm clusters on Azure.

Apply configuration changes using azslurm#

If you made any changes to the Slurm configuration files, such as gres.conf or slurm.conf, you need to apply these changes for them to take effect. The azslurm CLI tool simplifies this process by automatically updating the necessary configuration files and restarting the Slurm controller daemon.

On the scheduler node, switch to the root user:

sudo -i

As root, run the following command to apply the changes:

azslurm scale

This command updates the Slurm configuration files and ensures the Slurm scheduler is aware of the MI300X GPUs on the compute nodes. The azslurm CLI tool handles all necessary steps, including updating the azure.conf file with the correct GPU configuration and restarting the slurmctld service.

To leave the root shell, run logout.

Create GPU nodes using azslurm#

Logged in to the Slurm scheduler node as hpcadmin, run the following command:

sudo azslurm resume --node-list=ccw-gpu-[1-2]

This command creates two GPU nodes named ccw-gpu-1 and ccw-gpu-2, each with 8 MI300X GPUs. In your resource group in the Azure portal, you should see the new GPU nodes created as a new VM scale set with a name starting with gpu-. You can also verify the creation of the GPU nodes in the CycleCloud portal.

Note

Allow a few minutes for the nodes to be provisioned and initialized.

To confirm the nodes are created successfully, run:

sinfo  # or scontrol show nodes

Example output:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
dynamic      up   infinite      0    n/a
gpu          up   infinite      2   idle ccw-gpu-[1-2]
hpc*         up   infinite      0    n/a
htc          up   infinite      0    n/a

The suffix * indicates the default partition, which is set to hpc. The gpu partition shows that both GPU nodes are available and idle.

Delete GPU nodes using azslurm#

If you need to delete the GPU nodes, for example, to free up resources or reconfigure the cluster, you can suspend the nodes.

To delete the GPU nodes, run the following command from the Slurm scheduler node:

sudo azslurm suspend --node-list=ccw-gpu-[1-2]

This command deletes both nodes from the Slurm cluster and the VM scale set. You can verify that the nodes are no longer listed in the CycleCloud portal or in the Azure portal under the gpu- VM scale set.

Add users to video and render groups#

To ensure users can access the AMD Instinct MI300X GPUs, they must belong to the supplementary groups video and render on all GPU nodes. Follow the steps below to add users to these groups.

On the Slurm scheduler node, run the following:

for i in ccw-scheduler $(scontrol show hostname ccw-gpu-[1-2]); do
  for user in hpcadmin; do
    echo "Adding $user to video and render groups on $i..."
    if ssh $i "sudo usermod -aG video,render $user"; then
      echo "Successfully added $user to video and render groups on $i"
    else
      echo "Failed to add $user to video and render groups on $i"
    fi
  done
done

Confirm that hpcadmin has been added to the video and render groups on each node:

groups hpcadmin

Example output:

hpcadmin : hpcadmin video render cyclecloud

Tip

Ensure all listed users exist on each node before adding them to the video and render groups. Refer to the Azure CycleCloud Documentation for more details on user management in CycleCloud environments.

CycleCloud CLI#

The CycleCloud CLI allows you to interact with the CycleCloud portal and perform various operations on your Slurm environment using a command-line interface on your local machine. Although this tutorial does not focus on the CycleCloud CLI, it is presented here to provide a brief overview of its installation and configuration.

Install CycleCloud CLI#

Download the CycleCloud CLI from the CycleCloud portal by navigating to the About page:

https://localhost:8443/about

Alternatively, download the CLI using the terminal:

curl --insecure --remote-name https://localhost:8443/static/tools/cyclecloud-cli.zip

After downloading, unzip the file to extract the CycleCloud CLI executable and follow the instructions in the README file to install it.

Once complete, verify the installation:

cyclecloud --version

If the command is not recognized, restart your terminal to refresh environment variables.

Configure CycleCloud CLI#

Once the CycleCloud CLI is installed, you need to configure it to connect to your Azure CycleCloud server URL.

Initialize CycleCloud with the following command:

cyclecloud initialize

Enter the CycleCloud server URL (https://localhost:8443), accept the untrusted certificate warning, and provide your CycleCloud username and password when prompted.

Verify the configuration:

cyclecloud show_cluster ccw

Example output:

-------------
ccw : started
-------------
Resource group: <your-resource-group>
Cluster nodes:
    scheduler: Started ccw-scheduler 10.0.0.132 scheduler-<unique-id>
Cluster node arrays:
     login: 1 instances, 4 cores, Started
     gpu:   2 instances, 192 cores, Started
Total nodes: 4

This command retrieves information about your Slurm cluster, including the resource group, cluster nodes, cluster node arrays, and their statuses. Cluster node arrays represent groups of nodes organized for a specific purpose, such as login nodes or compute nodes. In the above output, there are two node arrays: login and gpu. Both arrays are started, indicating that they are running and available for use.

Submit a test job to Slurm#

Once your Slurm cluster is deployed and configured, you can submit a test job to verify that the compute nodes and GPUs are functioning correctly. In this tutorial, we will use the all_reduce_perf test from the RCCL (ROCm Collective Communication Library) benchmarking tools. This test is a collective communications operation used in parallel computing, involving multiple GPUs working together to perform a reduction operation (in our example, summing numbers) across all GPUs. The test provides various metrics, helping you ensure that the MI300X GPUs and RDMA-enabled networking are configured correctly for high-performance workloads.

Connect to the login node#

To submit a test job, connect to the Slurm login node using Azure Bastion. First, retrieve the VM scale set name for the login node:

login_vmss_name=$(az vmss list \
  --resource-group $resource_group \
  --query "[?starts_with(name, 'login-')].name" \
  --output tsv)

Next, retrieve the resource ID of the login node:

login_node_id=$(az vmss list-instances \
  --resource-group $resource_group \
  --name $login_vmss_name \
  --query "[0].id" \
  --output tsv)

Connect to the login node:

az network bastion ssh \
  --name bastion \
  --resource-group $resource_group \
  --target-resource-id $login_node_id \
  --auth-type ssh-key \
  --username $username \
  --ssh-key $ssh_key

Create a job script#

On the login node, create a job script to test the cluster’s functionality:

sudo nano test.sh

Paste the following content into the file:

#!/bin/bash
#SBATCH --job-name=test
#SBATCH --partition=gpu
#SBATCH --gpus-per-node=8
#SBATCH --ntasks-per-node=8
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err

module load mpi/hpcx

export LD_LIBRARY_PATH=/opt/rccl/lib:$LD_LIBRARY_PATH \
       OMPI_MCA_coll_hcoll_enable=0 \
       OMPI_MCA_plm_rsh_no_tree_spawn=1 \
       OMPI_MCA_plm_rsh_num_concurrent=800 \
       NCCL_IB_PCI_RELAXED_ORDERING=1 \
       NCCL_IB_HCA=^mlx5_an0 \
       NCCL_MIN_NCHANNELS=112

srun --mpi=pmix /opt/rccl-tests/all_reduce_perf -b 1 -e 8G -f 2 -g 1 -O 0

Save and close the file.

The lines beginning with #SBATCH are Slurm directives that specify various options for Slurm:

  • --job-name: Sets the name of the job.

  • --partition: Specifies the GPU partition to use for the job.

  • --gpus-per-node: Specifies the number of GPUs to allocate per node.

  • --ntasks-per-node: Allocates eight tasks per node, matching the number of GPUs.

  • --output and --error: Output and error log files for the job, where %x is the job name and %j is job allocation number.

The module command specifies to load the HPC-X plugin, which is necessary for running MPI jobs on the Slurm cluster.

The export command sets seven environment variables to configure the Slurm job environment, including the library path for RCCL, and various performance tuning parameters for large-scale computations across multiple GPUs.

The srun command executes the all_reduce_perf test to measure the performance of the all-reduce operation across multiple GPUs. This job uses the PMIx plugin, which enables efficient communication between tasks using MPI (Message Passing Interface), a standard for parallel computing. In this setup, each task is assigned to one GPU, and MPI handles the coordination between them.

The command parameters define the following:

  • -b 1: Start with a message size of 1 byte.

  • -e 8G: End with a message size of 8 gigabytes.

  • -f 2: Use a multiplication factor of 2 between sizes.

  • -g 1: Each MPI task uses 1 GPU. With 8 tasks per node, all GPUs are used concurrently.

  • -O 0: Use in-place operations (out-of-place operations are omitted for simplicity).

This setup allows the same script to scale from a single node (8 GPUs) to multiple nodes (e.g., 2 nodes with 16 GPUs) by adjusting the --nodes parameter when submitting the job.

Submit the job#

Submit the job to the Slurm scheduler on 1 node:

sbatch --nodes=1 test.sh

You can experiment with setting the --nodes parameter to run the job on multiple nodes and see how the performance scales with more GPUs.

Monitor the job status using:

squeue

Example output:

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    1       gpu     test hpcadmin  R       0:03      1 ccw-gpu-1

Tip

If you encounter issues with job submission (e.g., jobs stuck in the PD (pending) state with messages about nodes being down or drained), try running sudo systemctl restart slurmd on each GPU node. This ensures the Slurm daemon is running and the node can communicate with the controller.

Once the job completes, review the output file:

cat test_<job-ID>.out

Example output:

Loading mpi/hpcx
  Loading requirement:
    /opt/hpcx-v2.18-gcc-mlnx_ofed-ubuntu22.04-cuda12-x86_64/modulefiles/hpcx
# nThread 1 nGpus 1 minBytes 1 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
rccl-tests: Version develop:448c4c7
# Using devices
#   Rank  0 Pid  47406 on  ccw-gpu-1 device  0 [0002:00:00.0] AMD Instinct MI300X VF
#   Rank  1 Pid  47407 on  ccw-gpu-1 device  1 [0003:00:00.0] AMD Instinct MI300X VF
#   Rank  2 Pid  47408 on  ccw-gpu-1 device  2 [0004:00:00.0] AMD Instinct MI300X VF
#   Rank  3 Pid  47409 on  ccw-gpu-1 device  3 [0005:00:00.0] AMD Instinct MI300X VF
#   Rank  4 Pid  47410 on  ccw-gpu-1 device  4 [0006:00:00.0] AMD Instinct MI300X VF
#   Rank  5 Pid  47411 on  ccw-gpu-1 device  5 [0007:00:00.0] AMD Instinct MI300X VF
#   Rank  6 Pid  47412 on  ccw-gpu-1 device  6 [0008:00:00.0] AMD Instinct MI300X VF
#   Rank  7 Pid  47413 on  ccw-gpu-1 device  7 [0009:00:00.0] AMD Instinct MI300X VF
#
#                                                              in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)
           0             0     float     sum      -1     0.21    0.00    0.00      0
           0             0     float     sum      -1     0.18    0.00    0.00      0
           4             1     float     sum      -1    33.33    0.00    0.00      0
           8             2     float     sum      -1    32.14    0.00    0.00      0
          16             4     float     sum      -1    32.81    0.00    0.00      0
          32             8     float     sum      -1    33.40    0.00    0.00      0
          64            16     float     sum      -1    32.16    0.00    0.00      0
         128            32     float     sum      -1    27.86    0.00    0.01      0
         256            64     float     sum      -1    29.96    0.01    0.01      0
         512           128     float     sum      -1    30.07    0.02    0.03      0
        1024           256     float     sum      -1    29.07    0.04    0.06      0
        2048           512     float     sum      -1    28.87    0.07    0.12      0
        4096          1024     float     sum      -1    29.00    0.14    0.25      0
        8192          2048     float     sum      -1    21.38    0.38    0.67      0
       16384          4096     float     sum      -1    29.35    0.56    0.98      0
       32768          8192     float     sum      -1    31.76    1.03    1.81      0
       65536         16384     float     sum      -1    30.95    2.12    3.71      0
      131072         32768     float     sum      -1    33.54    3.91    6.84      0
      262144         65536     float     sum      -1    35.83    7.32   12.80      0
      524288        131072     float     sum      -1    39.05   13.43   23.49      0
     1048576        262144     float     sum      -1    43.69   24.00   42.00      0
     2097152        524288     float     sum      -1    50.31   41.69   72.95      0
     4194304       1048576     float     sum      -1    65.68   63.86  111.76      0
     8388608       2097152     float     sum      -1    106.1   79.04  138.32      0
    16777216       4194304     float     sum      -1    158.4  105.89  185.31      0
    33554432       8388608     float     sum      -1    299.9  111.90  195.83      0
    67108864      16777216     float     sum      -1    456.5  147.00  257.25      0
   134217728      33554432     float     sum      -1    802.9  167.17  292.55      0
   268435456      67108864     float     sum      -1   1536.7  174.68  305.70      0
   536870912     134217728     float     sum      -1   3040.1  176.59  309.04      0
  1073741824     268435456     float     sum      -1   6080.6  176.58  309.02      0
  2147483648     536870912     float     sum      -1    12112  177.31  310.29      0
  4294967296    1073741824     float     sum      -1    24307  176.70  309.22      0
  8589934592    2147483648     float     sum      -1    48607  176.72  309.26      0
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 94.0969
#

In the output, confirm that all GPUs are properly detected (rank 0-7) and identified as AMD Instinct MI300X GPUs. The main table show performance at different data sizes (from 1 byte to 8 GB). The algbw (algorithm bandwidth) column shows the achieved bandwidth in GB/s, while the busbw (bus bandwidth) column shows bandwidth adjusted to reflect hardware utilization to compare it with the theoretical maximum bandwidth of the hardware.

Notice how the performance (bandwidth utilization) changes as the data size increases:

  • For small message sizes, bandwidth is relatively low. This is expected, as bandwidth is naturally small for smaller message sizes.

  • As the message size increases, performance improves significantly. Larger messages allow the communication libraries and hardware to better utilize available bandwidth, resulting in higher throughput.

  • For large message sizes, performance plateaus near the maximum achievable bandwidth of the MI300X GPUs. In this case, the busbw column in the output shows values approaching 310 GB/s, which reflects the peak interconnect performance of the system.

This trend is typical for GPU-based collective communication benchmarks and indicates that the system is functioning and scaling as expected.

Troubleshooting#

If you encounter issues during the deployment, check the following:

  • Azure subscription: Ensure you have the correct Azure subscription selected. You can check your current subscription with az account show.

  • Verify resource quotas: Ensure that your subscription has sufficient resources.

  • Check role assignment: Verify you have a role that permits subscription-level deployments. You can check your role assignments with az role assignment list --assignee <your-email> --output table.

  • Check activity logs: Check the activity logs in Azure and CycleCloud to help identify issues.

Clean up (optional)#

To manage costs when the Slurm cluster is not in use, consider the following options:

  • Enable automatic suspension of idle GPU nodes to reduce costs when the cluster is not in use.

  • Delete the Slurm cluster resources if they are no longer needed.

Conclusion#

You have successfully deployed an Azure CycleCloud Workspace for Slurm with AMD Instinct MI300X GPUs. This setup provides a robust environment for running HPC and AI workloads on Azure, leveraging the power of Slurm for workload management and the high performance of AMD Instinct MI300X GPUs for compute-intensive tasks.

References#