Example Training Workload via Slinky#
The following outlines steps to get up and running with Slinky on Kubernetes and running a simple image classification training workload to verify GPUs are accessible.
Clone this repo and go into slinky folder#
git clone https://github.com/rocm/gpu-operator.git
cd example/slinky
Installing Slinky Prerequisites#
Install AMD GPU Operator, configure the DeviceConfig
and make sure that the device plugin is advertising the AMD GPU devices as allocatable resources
$ kubectl get node -oyaml | grep -i allocatable -A 10 | grep amd.com
amd.com/gpu: "8"
The following steps for installing pre-requisites and installing Slinky have been taking from the SlinkProject/slinky-operator repo quick-start guide
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager --create-namespace --set crds.enabled=true
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace prometheus --create-namespace --set installCRDs=true
Installing Slinky Operator#
helm install slurm-operator oci://ghcr.io/slinkyproject/charts/slurm-operator \
--values=values-operator.yaml --version=0.1.0 --namespace=slinky --create-namespace
Make sure the operator deployed successfully with:
kubectl --namespace=slinky get pods
Output should be similar to:
NAME READY STATUS RESTARTS AGE
slurm-operator-7444c844d5-dpr5h 1/1 Running 0 5m00s
slurm-operator-webhook-6fd8d7857d-zcvqh 1/1 Running 0 5m00s
Building the Slurm Compute Node Image#
You will need to build a Slurm docker image to be used for the Slurm compute node that includes ROCm and ROCm-compatible PyTorch version. The slurm-rocm-torch directory contains an example Dockerfile that can be used to build this image. It is based off of the Dockerfile from the Slinky repo with the only modifications being:
the base image is using the
rocm/pytorch-training:v25.4
image which already has ROCm and PyTorch installedthe
COPY patches/ patches/
line has been commented out as there are currently no patches to be appliedthe
COPY --from=build /tmp/*.deb /tmp/
has also been commented out as there are no .deb files to copy
Installing Slurm Cluster#
Once the image has been built and pushed to a repository update the values-slurm.yaml
file to specify the compute node image you will be using:
# Slurm compute (slurmd) configurations.
compute:
#
# -- (string)
# Set the image pull policy.
imagePullPolicy: IfNotPresent
#
# Default image for the nodeset pod (slurmd)
# Each nodeset may override this setting.
image:
#
# -- (string)
# Set the image repository to use.
repository: docker-registry/docker-repository/docker-image
#
# -- (string)
# Set the image tag to use.
# @default -- The Release appVersion.
tag: image-tag
Install the Slurm Cluster helm chart
helm install slurm oci://ghcr.io/slinkyproject/charts/slurm \
--values=values-slurm.yaml --version=0.1.0 --namespace=slurm --create-namespace
Make sure the Slurm cluster deployed successfully with:
kubectl --namespace=slurm get pods
Output should be similar to:
NAME READY STATUS RESTARTS AGE
slurm-accounting-0 1/1 Running 0 5m00s
slurm-compute-gpu-node 1/1 Running 0 5m00s
slurm-controller-0 2/2 Running 0 5m00s
slurm-exporter-7b44b6d856-d86q5 1/1 Running 0 5m00s
slurm-mariadb-0 1/1 Running 0 5m00s
slurm-restapi-5f75db85d9-67gpl 1/1 Running 0 5m00s
Prepping Compute Node#
Get SLURM Compute Node Name
SLURM_COMPUTE_POD=$(kubectl get pods -n slurm | grep ^slurm-compute-gpu-node | awk '{print $1}');echo $SLURM_COMPUTE_POD
Add Slurm user to video and render group and create Slurm user home directory to Slrum Compute node
kubectl exec -it -n slurm $SLURM_COMPUTE_POD -- bash -c " usermod -aG video,render slurm mkdir -p /home/slurm chown slurm:slurm /home/slurm"
Copy PyTorch test script to Slurm compute node that can be found in the
example/slinky
folder of this repokubectl cp example/slinky/test.py slurm/$SLURM_COMPUTE_POD:/tmp/test.py
Copy Fashion MNIST Image Classification Model Training script to Slurm compute node
kubectl cp example/slinky/train_fashion_mnist.py slurm/$SLURM_COMPUTE_POD:/tmp/train_fashion_mnist.py
Run test.py script on compute node to confirm GPUs are accessible
kubectl exec -it slurm-controller-0 -n slurm -- srun python3 test.py
Run single-GPU training script on compute node
kubectl exec -it slurm-controller-0 -n slurm -- srun python3 train_fashion_mnist.py
Run multi-GPU training script on compute node
kubectl exec -it slurm-controller-0 -n slurm -- srun apptainer exec --rocm --bind /tmp:/tmp torch_rocm.sif torchrun --standalone --nnodes=1 --nproc_per_node=8 --master-addr localhost train_mnist_distributed.py
Other Useful Slurm Commands#
Check Slurm Node Info#
kubectl exec -it slurm-controller-0 -n slurm -- sinfo
Check Job Queue#
kubectl exec -it slurm-controller-0 -n slurm -- squeue
Check Node Resources#
kubectl exec -it slurm-controller-0 -n slurm -- sinfo -N -o "%N %G"