Migration Guide: NVIDIA to AMD#
Migrating from the NVIDIA Container Toolkit to the AMD Container Toolkit is a streamlined process that enables developers to leverage AMD Instinct GPUs in containerized environments with minimal modifications. This guide provides step-by-step instructions to make this transition smooth and efficient.
Migrating container workflows from NVIDIA GPUs (using nvidia-docker) to AMD Instinct GPUs (using AMD Container Toolkit) involves the following key changes:
Step 1: Environment Variable Updates#
NVIDIA GPUs are typically managed through the environment variable NVIDIA_VISIBLE_DEVICES
. In the AMD Container Toolkit, this is replaced with:
export AMD_VISIBLE_DEVICES=all # To enable all GPUs
export AMD_VISIBLE_DEVICES=0,1 # To enable specific GPUs
The variable syntax remains consistent, but the prefix changes to AMD
. This environment variable is recognized by the AMD runtime to expose GPUs to your container workloads.
Step 2: Update Runtime Configuration#
NVIDIA’s container runtime is identified as nvidia
in Docker commands. For AMD, the runtime flag needs to be updated to:
sudo docker run --rm --runtime=amd <image-name>
To set AMD as the default runtime:
sudo amd-ctk runtime configure --runtime=docker --set-as-default
If you previously used:
sudo docker run --rm --runtime=nvidia <image-name>
You would now use the equivalent:
sudo docker run --rm --runtime=amd <image-name>
Step 3: Command Line Utility Replacement#
NVIDIA provides nvidia-ctk
for container configurations and runtime settings. In the AMD ecosystem, this is replaced with:
amd-ctk
For example, listing all available GPUs using NVIDIA would look like:
nvidia-ctk list
The AMD equivalent:
amd-ctk list
You can also generate CDI specifications with:
amd-ctk cdi generate --output=/etc/cdi/amd.json
Step 4: Container Images#
Containers built for NVIDIA GPUs often rely on CUDA-based images. AMD’s container toolkit is designed to work seamlessly with ROCm-enabled images:
TensorFlow:
tensorflow/tensorflow:rocm-latest
PyTorch:
rocm/pytorch:latest
Triton Inference Server:
rocm/tritonserver:latest
For example, running PyTorch with AMD GPUs:
sudo docker run --rm --runtime=amd -e AMD_VISIBLE_DEVICES=all rocm/pytorch:latest
Step 5: Framework Adjustments#
To fully leverage AMD Instinct GPUs, frameworks like TensorFlow and PyTorch must use their ROCm-enabled versions. This ensures compatibility and optimized performance for machine learning workloads.
Compatibility Notes#
Ensure Docker is version 25 or above for CDI compatibility.
Some CUDA-specific applications may require minor modifications.
ROCm supports a majority of ML frameworks, but always validate with your application stack.
Comparison:#
Feature |
NVIDIA Docker |
AMD Container Toolkit |
---|---|---|
GPU Enumeration |
|
|
Container Runtime |
|
|
Environment Variable |
|
|
Framework Images |
NVIDIA-specific images optimized for CUDA. |
ROCm-optimized images designed for AMD GPUs. |
TensorFlow Support |
CUDA TensorFlow - Supports TensorFlow operations on NVIDIA GPUs. |
ROCm TensorFlow - Optimized TensorFlow builds for AMD GPUs. |
PyTorch Support |
CUDA PyTorch - Optimized for NVIDIA architectures. |
ROCm PyTorch - Optimized for AMD Instinct architectures. |
Configuration Toolkit |
|
|
Default Docker Runtime |
|
|
Testing and Validation#
After migration, it is crucial to validate workloads with:
sudo docker run --rm --runtime=amd -e AMD_VISIBLE_DEVICES=all rocm/rocm-terminal rocm-smi
Ensure that all intended GPUs are detected and functioning as expected.
Next Steps#
Once migration is complete:
Update your CI/CD pipelines to reflect runtime changes.
Adjust Dockerfiles if specific runtime flags were set for NVIDIA.
Monitor GPU usage using tools like
rocm-smi
andamd-ctk list
.
This completes the NVIDIA to AMD migration process, enabling you to leverage the full power of AMD Instinct GPUs in containerized workflows. By following this migration guide, users can rapidly transition their GPU workloads to AMD Instinct platforms.