GPU Tracker#
Currently, barebones Docker provides no way to track access of GPUs in containers. Additionally, by default, multiple containers in Docker can be granted access to the same GPU simultaneously. GPU Tracker is an extremely lightweight feature of AMD Container Toolkit that solves these issues.
GPU Tracker state is initialized during AMD Container Toolkit installation and is by default disabled. Users can enable or disable the GPU Tracker feature by using the enable or disable CLIs. When enabled, the GPU Tracker automatically maintains the state of GPUs and the containers that they are made accessible to, only if the containers are launched and granted access to the GPUs using the AMD_VISILE_DEVICES environment variable. When the container process completes execution or is stopped, the GPU Tracker state is automatically updated to reflect GPUs released by the specific container.
NOTE: GPU Tracker feature is currenly supported only if containers are started using the docker run command and GPUs are made accessible in containers using the AMD_VISIBLE_DEVICES environment variable. If containers are started and granted access to GPUs in any other manner, GPU Tracker feature is not supported.
GPU Tracker provides CLIs that can be used to control the accessibility of GPUs in containers. The accessibility of GPUs can be set to either shared or exclusive.
The
sharedaccessibility indicates that the GPU can be made accessibile to multiple containers simultaneously. By default, all GPUs are granted thesharedaccessibility to reflect the default Docker behavior.The
exclusiveaccessibility indicates that the GPU can be made accessible to at most one container at any point of time.
GPU Tracker status can be queried at any point of time using the status command and reset using the reset CLIs.
> sudo amd-ctk gpu-tracker -h
NAME:
AMD Container Toolkit CLI gpu-tracker - GPU Tracker related commands
USAGE:
amd-ctk gpu-tracker [gpu-ids] [accessibility]
Arguments:
gpu-ids Comma-separated list of GPU IDs (comma separated list, range operator, all)
accessibility Must be either 'exclusive' or 'shared'
Examples:
amd-ctk gpu-tracker 0,1,2 exclusive
amd-ctk gpu-tracker 0,1-2 shared
amd-ctk gpu-tracker all shared
OR
amd-ctk gpu-tracker [command] [options]
COMMANDS:
disable Disable the GPU Tracker
enable Enable the GPU Tracker
reset Reset the GPU Tracker
status Show Status of GPUs
help, h Shows a list of commands or help for one command
OPTIONS:
--help, -h show help
Using GPU Tracker#
Let us assume that the node has 4 GPUs as indicated below
> rocm-smi
========================================= ROCm System Management Interface =========================================
=================================================== Concise Info ===================================================
Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%
(DID, GUID) (Edge) (Avg) (Mem, Compute, ID)
====================================================================================================================
0 4 0x740f, 12261 33.0°C 42.0W N/A, N/A, 0 800Mhz 1600Mhz 0% auto 300.0W 0% 0%
1 5 0x740f, 13566 38.0°C 40.0W N/A, N/A, 0 800Mhz 1600Mhz 0% auto 300.0W 0% 0%
2 3 0x740f, 57300 34.0°C 42.0W N/A, N/A, 0 800Mhz 1600Mhz 0% auto 300.0W 0% 0%
3 2 0x740f, 1997 38.0°C 41.0W N/A, N/A, 0 800Mhz 1600Mhz 0% auto 300.0W 0% 0%
====================================================================================================================
=============================================== End of ROCm SMI Log ================================================
Show GPU Tracker Status:
Once AMD Container Toolkit, is installed, the GPU Tracker is initialized and the status can be queried using the
statusCLI. If GPU Tracker is enabled, by default it can be seen that GPUs are granted thesharedaccessibility.> amd-ctk gpu-tracker status ------------------------------------------------------------------------------------------------------------------------ GPU Id UUID Accessibility Container Ids ------------------------------------------------------------------------------------------------------------------------ 0 0xEA35F57CC80DEB35 Shared - 1 0x89CAA15875FF5A43 Shared - 2 0x6E32F10EFC982B4C Shared - 3 0x12FE4F7FDAF06B9 Shared -
If GPU Tracked feature is not enabled, then a message indicating this is printed.
> amd-ctk gpu-tracker status GPU Tracker is disabled
Enabling GPU Tracker:
GPU Tracker can be enabled using the
enableCLI. When GPU Tracker is newly enabled, it starts tracking usage of GPUs in containers with no prior knowledge of GPUs state. If GPU Tracker is already currently enabled, then nothing happens and a message indicating this is printed.> amd-ctk gpu-tracker status GPU Tracker is disabled > amd-ctk gpu-tracker enable GPU Tracker has been enabled > amd-ctk gpu-tracker enable GPU Tracker is already enabled
Disabling GPU Tracker:
GPU Tracker can be disabled using the
disableCLI. If GPU Tracker is again enabled in the future, all the GPUs state related information will be lost.> amd-ctk gpu-tracker disable GPU Tracker has been disabled > amd-ctk gpu-tracker status GPU Tracker is disabled
Granting access to GPUs in Docker containers:
If GPU Tracker is enabled before launching container, it automatically tracks the usage of GPUs in containers as indicated below.
> docker run --runtime=amd -itd -e AMD_VISIBLE_DEVICES=0-2 rocm/rocm-terminal bash 36b012bb34c96149a6ef5b28623e6e75cf9f71eb2b824b2c8f44e0449c7a1aa8 > docker run --runtime=amd -itd -e AMD_VISIBLE_DEVICES=1,3 rocm/rocm-terminal bash 90cb29e11e83aa3ae497c68c90e1f0894b85262188c1ef9c7284457a9bc35ffd > amd-ctk gpu-tracker status ------------------------------------------------------------------------------------------------------------------------ GPU Id UUID Accessibility Container Ids ------------------------------------------------------------------------------------------------------------------------ 0 0xEA35F57CC80DEB35 Shared 36b012bb34c96149a6ef5b28623e6e75cf9f71eb2b824b2c8f44e0449c7a1aa8 1 0x89CAA15875FF5A43 Shared 36b012bb34c96149a6ef5b28623e6e75cf9f71eb2b824b2c8f44e0449c7a1aa8 90cb29e11e83aa3ae497c68c90e1f0894b85262188c1ef9c7284457a9bc35ffd 2 0x6E32F10EFC982B4C Shared 36b012bb34c96149a6ef5b28623e6e75cf9f71eb2b824b2c8f44e0449c7a1aa8 3 0x12FE4F7FDAF06B9 Shared 90cb29e11e83aa3ae497c68c90e1f0894b85262188c1ef9c7284457a9bc35ffd > docker rm -f 36b012bb34c96149a6ef5b28623e6e75cf9f71eb2b824b2c8f44e0449c7a1aa8 36b012bb34c96149a6ef5b28623e6e75cf9f71eb2b824b2c8f44e0449c7a1aa8 > amd-ctk gpu-tracker status ------------------------------------------------------------------------------------------------------------------------ GPU Id UUID Accessibility Container Ids ------------------------------------------------------------------------------------------------------------------------ 0 0xEA35F57CC80DEB35 Shared - 1 0x89CAA15875FF5A43 Shared 90cb29e11e83aa3ae497c68c90e1f0894b85262188c1ef9c7284457a9bc35ffd 2 0x6E32F10EFC982B4C Shared - 3 0x12FE4F7FDAF06B9 Shared 90cb29e11e83aa3ae497c68c90e1f0894b85262188c1ef9c7284457a9bc35ffdSetting GPUs to have
exclusiveaccessibility:If GPU Tracker is enabled, GPUs can be set to have exclusive access in containers. If the user tries to make GPUs exclusive when GPU Tracker is disabled, nothing happens and a message indicating that GPU Tracker is disabled is printed.
> amd-ctk gpu-tracker 1-3 exclusive GPUs [1 2 3] have been made exclusive > amd-ctk gpu-tracker status ------------------------------------------------------------------------------------------------------------------------ GPU Id UUID Accessibility Container Ids ------------------------------------------------------------------------------------------------------------------------ 0 0xEA35F57CC80DEB35 Shared - 1 0x89CAA15875FF5A43 Exclusive 90cb29e11e83aa3ae497c68c90e1f0894b85262188c1ef9c7284457a9bc35ffd 2 0x6E32F10EFC982B4C Exclusive - 3 0x12FE4F7FDAF06B9 Exclusive 90cb29e11e83aa3ae497c68c90e1f0894b85262188c1ef9c7284457a9bc35ffd > docker run --runtime=amd -itd -e AMD_VISIBLE_DEVICES=0-2 rocm/rocm-terminal bash d23ff3dce1839cbf8ce7ad362641ab85e80b315c319edf73b269c460e348053a docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/moby/d23ff3dce1839cbf8ce7ad362641ab85e80b315c319edf73b269c460e348053a/log.json: no such file or directory): amd-container-runtime did not terminate successfully: exit status 1: GPUs [0 2] allocated GPUs [1] are exclusive and already in use Released GPUs [2 0] used by container d23ff3dce1839cbf8ce7ad362641ab85e80b315c319edf73b269c460e348053a : unknown. > amd-ctk gpu-tracker status ------------------------------------------------------------------------------------------------------------------------ GPU Id UUID Accessibility Container Ids ------------------------------------------------------------------------------------------------------------------------ 0 0xEA35F57CC80DEB35 Shared - 1 0x89CAA15875FF5A43 Exclusive 90cb29e11e83aa3ae497c68c90e1f0894b85262188c1ef9c7284457a9bc35ffd 2 0x6E32F10EFC982B4C Exclusive - 3 0x12FE4F7FDAF06B9 Exclusive 90cb29e11e83aa3ae497c68c90e1f0894b85262188c1ef9c7284457a9bc35ffd
In the above example, GPUs 1,2 and 3 have been granted
exclusiveaccess.When a new container
d23ff3dce1839cbf8ce7ad362641ab85e80b315c319edf73b269c460e348053athat requests access to GPUs 0,1 and 2 is launched, the following happens:The new container is created.
The new container is granted access to GPU 0 as no container is currently using GPU 0.
GPUs 1 is already being used by container
90cb29e11e83aa3ae497c68c90e1f0894b85262188c1ef9c7284457a9bc35ffd. Hence, the new container is not granted access to it as GPU 1 hasexclusiveaccessibility.The new container is granted access to GPU 2 as no container is currently using GPU 2 though GPU 2 has
exclusiveaccessibility.The container is not started since it has not been granted access to the required GPU resources.
The resources that have been granted to the new container are released.
NOTE:
Even though the new container
d23ff3dce1839cbf8ce7ad362641ab85e80b315c319edf73b269c460e348053ais not successfully started, it is still visible when we rundocker ps -acommand.> docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES d23ff3dce183 rocm/rocm-terminal "bash" 11 seconds ago Created funny_gagarin 90cb29e11e83 rocm/rocm-terminal "bash" 45 seconds ago Up 44 seconds practical_williams
This is because Docker has already created the container when the runtime errors out due to non-availability of resources. This behavior is similar to behavior exhibited by Docker when a container fails to start in any stage after the container is created in Docker. In such cases also, the container is visible in the
docker ps -acommand output with status asCreatedas depicted below.> docker run -itd ubuntu incorrect_command 94f11c132e8cd0a35d05bcc8bcaf77264563998d07f6ad5c73798cf9ddd94726 docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: exec: "incorrect_command": executable file not found in $PATH: unknown. > docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 94f11c132e8c ubuntu "incorrect_command" 17 seconds ago Created elastic_ardinghelli
Only GPUs that are currently not being used by more than 1 container can be set to have
exclusiveaccessibility.> amd-ctk gpu-tracker status ------------------------------------------------------------------------------------------------------------------------ GPU Id UUID Accessibility Container Ids ------------------------------------------------------------------------------------------------------------------------ 0 0xEA35F57CC80DEB35 Shared 8463b475b55b104b30edec8ddf6249b6214b27127106aa0ff4a8a514b856810e 1 0x89CAA15875FF5A43 Shared 90cb29e11e83aa3ae497c68c90e1f0894b85262188c1ef9c7284457a9bc35ffd 8463b475b55b104b30edec8ddf6249b6214b27127106aa0ff4a8a514b856810e 2 0x6E32F10EFC982B4C Exclusive 8463b475b55b104b30edec8ddf6249b6214b27127106aa0ff4a8a514b856810e 3 0x12FE4F7FDAF06B9 Exclusive 90cb29e11e83aa3ae497c68c90e1f0894b85262188c1ef9c7284457a9bc35ffd > amd-ctk gpu-tracker 1 exclusive GPUs [1] have not been made exclusive because more than one container is currently using it
Setting GPUs to have
sharedaccessibility:If GPU Tracker is enabled, GPUs can be set to have shared access in containers. If the user tries to make GPUs shared when GPU Tracker is disabled, nothing happens and a message indicating that GPU Tracker is disabled is printed. By default when GPU Tracker is disabled, GPUs have shared accessibility.
> amd-ctk gpu-tracker status ------------------------------------------------------------------------------------------------------------------------ GPU Id UUID Accessibility Container Ids ------------------------------------------------------------------------------------------------------------------------ 0 0xEA35F57CC80DEB35 Shared - 1 0x89CAA15875FF5A43 Exclusive 90cb29e11e83aa3ae497c68c90e1f0894b85262188c1ef9c7284457a9bc35ffd 2 0x6E32F10EFC982B4C Exclusive - 3 0x12FE4F7FDAF06B9 Exclusive 90cb29e11e83aa3ae497c68c90e1f0894b85262188c1ef9c7284457a9bc35ffd > amd-ctk gpu-tracker 1 shared GPUs [1] have been made shared > docker run --runtime=amd -itd -e AMD_VISIBLE_DEVICES=0-2 rocm/rocm-terminal bash a8ce87c99727107ab467508bd431a170b148001fe8a866fcf96d5cc6af9a7f5e > amd-ctk gpu-tracker status ------------------------------------------------------------------------------------------------------------------------ GPU Id UUID Accessibility Container Ids ------------------------------------------------------------------------------------------------------------------------ 0 0xEA35F57CC80DEB35 Shared a8ce87c99727107ab467508bd431a170b148001fe8a866fcf96d5cc6af9a7f5e 1 0x89CAA15875FF5A43 Shared 90cb29e11e83aa3ae497c68c90e1f0894b85262188c1ef9c7284457a9bc35ffd a8ce87c99727107ab467508bd431a170b148001fe8a866fcf96d5cc6af9a7f5e 2 0x6E32F10EFC982B4C Exclusive a8ce87c99727107ab467508bd431a170b148001fe8a866fcf96d5cc6af9a7f5e 3 0x12FE4F7FDAF06B9 Exclusive 90cb29e11e83aa3ae497c68c90e1f0894b85262188c1ef9c7284457a9bc35ffdIn the above example, GPU 1 has been set to
sharedaccess from the previousexclusiveaccess.When a new container
a8ce87c99727107ab467508bd431a170b148001fe8a866fcf96d5cc6af9a7f5ethat requests access to GPUs 0,1 and 2 is launched, the following happens:The new container is created.
The new container is granted access to GPU 0 as no container is currently using GPU 0.
GPUs 1 is already being used by container
90cb29e11e83aa3ae497c68c90e1f0894b85262188c1ef9c7284457a9bc35ffd. However, the new container is granted access to GPU 1 as GPU 1 hassharedaccessibility.The new container is granted access to GPU 2 as no container is currently using GPU 2 though GPU 2 has
exclusiveaccessibility.The container is successfully started since it has been granted access to the required GPU resources.
Resetting GPU Tracker Status:
Resetting GPU Tracker clears the GPU Tracker state, i.e. the accessibility of all GPU is set to
sharedand all information about which GPUs have been made accessible in containers is cleared. If GPU Tracker is enabled, then after the reset operation also the GPU Tracker is enabled. Conversely, if GPU Tracker is cdisabled, then after the reset operation also the GOU Tracker remains disabled.Resetting GPU Tracker is primarily useful in cases where GPU Tracker is enabled and the partitioning scheme of the GPUs has been altered. Changing the partitioning scheme of the GPUs invalidated the CDI Spec and GPU Tracker state. In these cases, it is required to:
Stop all running containers
Reset GPU Tracker
Regenerate CDI Spec
Restart containers
If GPU Tracker is disabled when the partitioning scheme of the GPUs have been altered, then GPU Tracker need not be reset. However, it is recommended to still perform the other actions. It makes no difference if GPU Tracker is reset when it is disabled.
> amd-ctk gpu-tracker status GPUs info is invalid. Please reset GPU Tracker. > amd-ctk gpu-tracker reset GPU Tracker has been reset Since GPU Tracker was enabled, it is recommended to stop and restart running containers to get the most accurate GPU Tracker status > sudo amd-ctk cdi generate Generated CDI spec: /etc/cdi/amd.json > sudo docker run --runtime=amd -itd -e AMD_VISIBLE_DEVICES=0-2 rocm/rocm-terminal bash 988135dafcd94bf98fbd92ca97f4a07c9bcfff0521359ee9bc8a6973cc3e25ce > sudo amd-ctk gpu-tracker status ------------------------------------------------------------------------------------------------------------------------ GPU Id UUID Accessibility Container Ids ------------------------------------------------------------------------------------------------------------------------ 0 0xEA35F57CC80DEB35 Shared 988135dafcd94bf98fbd92ca97f4a07c9bcfff0521359ee9bc8a6973cc3e25ce 1 0x89CAA15875FF5A43 Shared 988135dafcd94bf98fbd92ca97f4a07c9bcfff0521359ee9bc8a6973cc3e25ce 2 0x6E32F10EFC982B4C Shared 988135dafcd94bf98fbd92ca97f4a07c9bcfff0521359ee9bc8a6973cc3e25ce 3 0x12FE4F7FDAF06B9 Shared -