Troubleshooting#

The AMD Container Toolkit is designed to integrate smoothly into Docker-based environments. However, issues may arise due to system configurations, driver installations, or runtime settings. This guide aims to provide detailed, step-by-step troubleshooting methods to identify and resolve common issues effectively.

Common Issues:#

1. Driver Not Loaded#

If the AMD GPU driver is not detected, verify that the amdgpu module is loaded:

lsmod | grep amdgpu

If the module is not present, attempt to load it manually:

sudo modprobe amdgpu

If you encounter errors, check the kernel logs for driver loading issues:

dmesg | grep amdgpu

This will provide information about any problems during the driver initialization.

2. Permission Denied Errors#

If GPU devices are not visible inside containers:

  • Verify GPU accessibility using rocm-smi outside the container.

  • Ensure the user belongs to the following groups:

    • render

    • video

Verify your group membership:

groups $USER

If you are not a member, add yourself to the necessary groups:

sudo usermod -a -G render,video $USER

Note: Log out and back in for the changes to take effect.

3. Docker Daemon Restart Failure#

If Docker fails to restart after configuring the AMD runtime, inspect the Docker logs:

sudo journalctl -u docker

Look for errors related to:

  • Container runtime conflicts

  • GPU device issues

  • Improper /etc/docker/daemon.json configuration

Verify that the runtime path is correctly set for AMD:

cat /etc/docker/daemon.json

4. Runtime Configuration Issues#

If Docker does not recognize the AMD runtime, validate the Docker configuration:

cat /etc/docker/daemon.json

Ensure the runtime is set correctly:

{
   "runtimes": {
       "amd": {
           "path": "/usr/bin/amd-container-runtime",
           "runtimeArgs": []
       }
   }
}

If the configuration is missing or incorrect, regenerate it and restart Docker:

sudo amd-ctk configure runtime
sudo systemctl restart docker

5. CDI Specification Not Applied#

If Docker does not recognize the GPU under CDI specifications, regenerate the CDI configuration:

sudo amd-ctk cdi generate --output=/etc/cdi/amd.json

Check the integrity of the generated specification:

cat /etc/cdi/amd.json

If issues persist, restart Docker:

sudo systemctl restart docker

Log File Reference#

The AMD Container Toolkit logs runtime events and errors to the following location:

/var/log/amd-container-runtime.log

You can view logs in real-time using:

sudo tail -f /var/log/amd-container-runtime.log

This log captures detailed interactions between Docker and the AMD container runtime, including:

  • Runtime initialization

  • GPU device injection

  • OCI specification modifications

  • CDI specification usage

If you experience issues that are not easily diagnosed, refer to this log file for real-time insights and deeper debugging.

Diagnostic Commands#

  • List Available Devices:

    amd-ctk cdi list
    
  • Check Runtime Configuration:

    cat /etc/docker/daemon.json
    
  • Inspect Docker Logs:

    sudo journalctl -u docker
    

Next Steps#

If the above steps do not resolve your issue:

  • Validate your ROCm driver installation with:

rocminfo
  • Verify GPU accessibility with:

rocm-smi
  • Consult the official AMD Container Toolkit documentation or reach out to the support community for advanced troubleshooting.