Slurm integration#
AMD Device Metrics Exporter integrates with Slurm workload manager to track GPU metrics for Slurm jobs. This topic explains how to set up and configure this integration.
Prerequisites#
Slurm workload manager installed and configured
AMD Device Metrics Exporter installed and running
Root or sudo access on Slurm nodes
Installation#
Copy the integration script:
cp ${TOP_DIR}/example/slurm/exporter-prolog.sh /etc/slurm/epilog.d/exporter-prolog.sh
cp ${TOP_DIR}/example/slurm/exporter-epilog.sh /etc/slurm/epilog.d/exporter-epilog.sh
sudo chmod +x /etc/slurm/epilog.d/exporter-prolog.sh
sudo chmod +x /etc/slurm/epilog.d/exporter-epilog.sh
Configure Slurm:
sudo vi /etc/slurm/slurm.conf
# Add these lines:
prologFlags=Alloc
Prolog="/etc/slurm/prolog.d/*"
Epilog="/etc/slurm/epilog.d/*"
Restart Slurm services to apply changes:
sudo systemctl restart slurmd # On compute nodes
Exporter Container Deployment#
Directory Setup#
It’s recommended to use the following directory structure to store persistent exporter data on the host:
$ tree -d exporter/
exporter/
- config/
- config.json
Create the directory required for tracking Slurm jobs:
mkdir -p /var/run/exporter
Start Exporter Container#
Once the directory structure is ready, start the exporter container:
docker run -d \
--device=/dev/dri \
--device=/dev/kfd \
-v ./config:/etc/metrics \
-v /var/run/exporter/:/var/run/exporter/ \
-p 5000:5000 --name exporter \
rocm/device-metrics-exporter:v1.2.1
Verification#
Submit a test job:
srun --gpus=1 amd-smi monitor
Check metrics endpoint:
curl http://localhost:5000/metrics | grep job_id
You should see metrics tagged with the Slurm job ID.
Metrics#
When Slurm integration is enabled, the following job-specific labels are added to metrics:
job_id
: Slurm job IDjob_user
: Username of job ownerjob_partition
: Slurm partition namecluster_name
: Slurm cluster name
Troubleshooting#
Common Issues#
Script permissions:
Ensure the exporter script is executable
Verify proper ownership (should be owned by
root
orslurm
user)
Configuration issues:
Check Slurm logs for prolog/epilog execution errors
Verify paths in slurm.conf are correct
Metric collection:
Ensure metrics exporter is running
Check if job ID labels are being properly set
Check service status:
systemctl status gpuagent.service amd-metrics-exporter.service
Logs#
View Slurm logs for integration issues:
sudo tail -f /var/log/slurm/slurmd.log
View service logs:
journalctl -u gpuagent.service -u amd-metrics-exporter.service
Advanced Configuration#
Custom Script Location#
You can place the script in a different location by updating the paths in slurm.conf
:
Prolog=/path/to/custom/slurm-prolog.sh
Epilog=/path/to/custom/slurm-epilog.sh
Additional Job Information#
The integration script can be modified to include additional job-specific information in the metrics. Edit the script to add custom labels as needed.
Slurm labels are disabled by default. To enable Slurm labels, add the following to your config.json
:
{
"GPUConfig": {
"Labels": [
"GPU_UUID",
"SERIAL_NUMBER",
"GPU_ID",
"JOB_ID",
"JOB_USER",
"JOB_PARTITION",
"CLUSTER_NAME",
"CARD_SERIES",
"CARD_MODEL",
"CARD_VENDOR",
"DRIVER_VERSION",
"VBIOS_VERSION",
"HOSTNAME"
]
}
}