Example Training Workload via Slinky

Example Training Workload via Slinky#

The following outlines steps to get up and running with Slinky on Kubernetes and running a simple image classification training workload to verify GPUs are accessible.

Clone this repo and go into slinky folder#

git clone https://github.com/rocm/gpu-operator.git
cd example/slinky

Installing Slinky Prerequisites#

Install AMD GPU Operator, configure the DeviceConfig and make sure that the device plugin is advertising the AMD GPU devices as allocatable resources

$ kubectl get node -oyaml | grep -i allocatable -A 10 | grep amd.com

amd.com/gpu: "8"

The following steps for installing pre-requisites and installing Slinky have been taking from the SlinkProject/slinky-operator repo quick-start guide

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install cert-manager jetstack/cert-manager \
	--namespace cert-manager --create-namespace --set crds.enabled=true
helm install prometheus prometheus-community/kube-prometheus-stack \
	--namespace prometheus --create-namespace --set installCRDs=true

Installing Slinky Operator#

helm install slurm-operator oci://ghcr.io/slinkyproject/charts/slurm-operator \
  --values=values-operator.yaml --version=0.1.0 --namespace=slinky --create-namespace

Make sure the operator deployed successfully with:

kubectl --namespace=slinky get pods

Output should be similar to:

NAME                                      READY   STATUS    RESTARTS   AGE
slurm-operator-7444c844d5-dpr5h           1/1     Running   0          5m00s
slurm-operator-webhook-6fd8d7857d-zcvqh   1/1     Running   0          5m00s

Building the Slurm Compute Node Image#

You will need to build a Slurm docker image to be used for the Slurm compute node that includes ROCm and ROCm-compatible PyTorch version. The slurm-rocm-torch directory contains an example Dockerfile that can be used to build this image. It is based off of the Dockerfile from the Slinky repo with the only modifications being:

the base image is using the rocm/pytorch-training:v25.4 image which already has ROCm and PyTorch installed
the COPY patches/ patches/ line has been commented out as there are currently no patches to be applied
the COPY --from=build /tmp/*.deb /tmp/ has also been commented out as there are no .deb files to copy

Installing Slurm Cluster#

Once the image has been built and pushed to a repository update the values-slurm.yaml file to specify the compute node image you will be using:

# Slurm compute (slurmd) configurations.
compute:
  #
  # -- (string)
  # Set the image pull policy.
  imagePullPolicy: IfNotPresent
  #
  # Default image for the nodeset pod (slurmd)
  # Each nodeset may override this setting.
  image:
    #
    # -- (string)
    # Set the image repository to use.
    repository: docker-registry/docker-repository/docker-image
    #
    # -- (string)
    # Set the image tag to use.
    # @default -- The Release appVersion.
    tag: image-tag

Install the Slurm Cluster helm chart

helm install slurm oci://ghcr.io/slinkyproject/charts/slurm \
  --values=values-slurm.yaml --version=0.1.0 --namespace=slurm --create-namespace

Make sure the Slurm cluster deployed successfully with:

kubectl --namespace=slurm get pods

Output should be similar to:

NAME                              READY   STATUS    RESTARTS       AGE
slurm-accounting-0                1/1     Running   0              5m00s
slurm-compute-gpu-node            1/1     Running   0              5m00s
slurm-controller-0                2/2     Running   0              5m00s
slurm-exporter-7b44b6d856-d86q5   1/1     Running   0              5m00s
slurm-mariadb-0                   1/1     Running   0              5m00s
slurm-restapi-5f75db85d9-67gpl    1/1     Running   0              5m00s

Prepping Compute Node#

Get SLURM Compute Node Name

SLURM_COMPUTE_POD=$(kubectl get pods -n slurm | grep ^slurm-compute-gpu-node | awk '{print $1}');echo $SLURM_COMPUTE_POD

Add Slurm user to video and render group and create Slurm user home directory to Slrum Compute node

kubectl exec -it -n slurm $SLURM_COMPUTE_POD -- bash -c "
    usermod -aG video,render slurm
    mkdir -p /home/slurm
    chown slurm:slurm /home/slurm"

Copy PyTorch test script to Slurm compute node that can be found in the example/slinky folder of this repo
```
kubectl cp example/slinky/test.py slurm/$SLURM_COMPUTE_POD:/tmp/test.py 
```

Copy Fashion MNIST Image Classification Model Training script to Slurm compute node

kubectl cp example/slinky/train_fashion_mnist.py slurm/$SLURM_COMPUTE_POD:/tmp/train_fashion_mnist.py 

Run test.py script on compute node to confirm GPUs are accessible

kubectl exec -it slurm-controller-0 -n slurm --  srun python3 test.py

Run single-GPU training script on compute node

kubectl exec -it slurm-controller-0 -n slurm --  srun python3 train_fashion_mnist.py

Run multi-GPU training script on compute node

kubectl exec -it slurm-controller-0 -n slurm --  srun apptainer exec --rocm --bind /tmp:/tmp torch_rocm.sif torchrun --standalone --nnodes=1 --nproc_per_node=8 --master-addr localhost train_mnist_distributed.py

Other Useful Slurm Commands#

Check Slurm Node Info#

kubectl exec -it slurm-controller-0 -n slurm --  sinfo

Check Job Queue#

kubectl exec -it slurm-controller-0 -n slurm --  squeue

Check Node Resources#

kubectl exec -it slurm-controller-0 -n slurm -- sinfo -N -o "%N %G"