Slurm integration#
AMD Device Metrics Exporter integrates with Slurm workload manager to track GPU metrics for Slurm jobs. This topic explains how to set up and configure this integration.
Prerequisites#
Slurm workload manager installed and configured
AMD Device Metrics Exporter installed and running
Root or sudo access on Slurm nodes
Installation#
Copy the integration script:
sudo cp /usr/local/etc/metrics/slurm/slurm-prolog.sh /etc/slurm/
sudo cp /usr/local/etc/metrics/slurm/slurm-epilog.sh /etc/slurm/
sudo chmod +x /etc/slurm/slurm-*.sh
Configure Slurm:
sudo vi /etc/slurm/slurm.conf
# Add these lines:
prologFlags=Alloc
Prolog=/etc/slurm/slurm-prolog.sh
Epilog=/etc/slurm/slurm-epilog.sh
Restart Slurm services to apply changes:
sudo systemctl restart slurmctld # On controller node
sudo systemctl restart slurmd # On compute nodes
Verification#
Submit a test job:
srun --gpus=1 amd-smi monitor
Check metrics endpoint:
curl http://localhost:5000/metrics | grep job_id
You should see metrics tagged with the Slurm job ID.
Metrics#
When Slurm integration is enabled, the following job-specific labels are added to metrics:
job_id
: Slurm job IDjob_user
: Username of job ownerjob_partition
: Slurm partition namecluster_name
: Slurm cluster name
Troubleshooting#
Common Issues#
Script permissions:
Ensure the exporter script is executable
Verify proper ownership (should be owned by
root
orslurm
user)
Configuration issues:
Check Slurm logs for prolog/epilog execution errors
Verify paths in slurm.conf are correct
Metric collection:
Ensure metrics exporter is running
Check if job ID labels are being properly set
Check service status:
systemctl status rdc.service gpuagent.service amd-metrics-exporter.service
Logs#
View Slurm logs for integration issues:
sudo tail -f /var/log/slurm/slurmd.log
View service logs:
journalctl -u rdc.service -u gpuagent.service -u amd-metrics-exporter.service
Advanced Configuration#
Custom Script Location#
You can place the script in a different location by updating the paths in slurm.conf
:
Prolog=/path/to/custom/slurm-prolog.sh
Epilog=/path/to/custom/slurm-epilog.sh
Additional Job Information#
The integration script can be modified to include additional job-specific information in the metrics. Edit the script to add custom labels as needed.