Preparing Pre-compiled Driver Images#
Overview#
The AMD GPU Operator uses the Kernel Module Management (KMM) Operator to deploy AMD GPU drivers on worker nodes. Due to kernel compatibility requirements, each driver image must match the worker node’s exact environment:
Linux distribution
OS release version
Kernel version
Users could prepare pre-compiled driver images in advance and import them into the cluster to let KMM skip the driver build stage within the cluster and directly use driver images to load amdgpu kernel modules into the worker nodes.
How KMM Selects Driver Images#
KMM determines the appropriate driver image based on the combination of:
Worker node OS information
Requested ROCm driver version
Image Tag Format#
KMM looks for driver images based on tags, the controller will use these methods to determine the image tag:
Parse the node’s
osImage
field to determine the OS and versionkubectl get node -oyaml | grep -i osImage
:
osImage |
OS |
version |
---|---|---|
|
|
|
|
|
|
Read the node’s
kernelVersion
field to determine to kernel versionkubectl get node -oyaml | grep -i kernelVersion
.Read user configured amdgpu driver version from
DeviceConfig
fieldspec.driver.version
.
OS |
Tag Format |
Example Image Tag |
---|---|---|
|
|
|
|
|
|
When a DeviceConfig is created with driver management enabled (spec.driver.enable=true
), KMM will:
Check if a matching driver image exists in the registry
If not found, build the driver image in-cluster using the AMD GPU Operator’s Dockerfile
If found, directly use the existing image to install the driver
Building Pre-compiled Driver Images#
Ubuntu#
Follow these image build steps to get a pre-compiled driver images, make sure your system matched with ROCm required Linux system requirement.
Prepare the Dockerfile
ARG OS_VERSION
FROM ubuntu:${OS_VERSION} as builder
ARG OS_CODENAME
ARG KERNEL_FULL_VERSION
ARG DRIVERS_VERSION
ARG REPO_URL
# Install build dependencies
RUN apt-get update && apt-get install -y bc \
bison \
flex \
libelf-dev \
gnupg \
wget \
git \
make \
gcc \
linux-headers-${KERNEL_FULL_VERSION} \
linux-modules-extra-${KERNEL_FULL_VERSION}
# Configure AMD GPU repository
RUN mkdir --parents --mode=0755 /etc/apt/keyrings
RUN wget ${REPO_URL}/rocm/rocm.gpg.key -O - | \
gpg --dearmor | tee /etc/apt/keyrings/rocm.gpg > /dev/null
RUN echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] ${REPO_URL}/amdgpu/${DRIVERS_VERSION}/ubuntu ${OS_CODENAME} main" \
| tee /etc/apt/sources.list.d/amdgpu.list
# Install and configure driver
RUN apt-get update && apt-get install -y amdgpu-dkms
RUN depmod ${KERNEL_FULL_VERSION}
# Create final image
ARG OS_VERSION
FROM ubuntu:${OS_VERSION}
ARG KERNEL_FULL_VERSION
RUN apt-get update && apt-get install -y kmod
# Set up module directory structure
RUN mkdir -p /opt/lib/modules/${KERNEL_FULL_VERSION}/updates/dkms/
COPY --from=builder /lib/modules/${KERNEL_FULL_VERSION}/updates/dkms/amd* /opt/lib/modules/${KERNEL_FULL_VERSION}/updates/dkms/
COPY --from=builder /lib/modules/${KERNEL_FULL_VERSION}/modules.* /opt/lib/modules/${KERNEL_FULL_VERSION}/
COPY --from=builder /lib/modules/${KERNEL_FULL_VERSION}/kernel /opt/lib/modules/${KERNEL_FULL_VERSION}/kernel
# Set up firmware directory
RUN mkdir -p /firmwareDir/updates/amdgpu
COPY --from=builder /lib/firmware/updates/amdgpu /firmwareDir/updates/amdgpu
Build Steps Explanation:
Choose a base image matching your worker nodes’ OS (example:
ubuntu:22.04
)Install
amdgpu-dkms
package using the OS package managerUpdate Module Dependencies: run
depmod ${KERNEL_FULL_VERSION}
Configure the final image
Install
kmod
(required for modprobe operations)Copy required files to these locations, required by KMM:
Kernel modules:
/opt/lib/modules/${KERNEL_FULL_VERSION}/
Firmware files:
/firmwareDir/updates/amdgpu/
Trigger the build with the Dockerfile
Make sure the build node has the same OS and kernel with your production nodes.
See examples to tag the image with the correct tag name.
source /etc/os-release
export AMDGPU_VERSION=7.0
docker build \
--build-arg OS_VERSION=${VERSION_ID} \
--build-arg OS_CODENAME=${VERSION_CODENAME} \
--build-arg KERNEL_FULL_VERSION=$(uname -r) \
--build-arg DRIVERS_VERSION=${AMDGPU_VERSION} \
--build-arg REPO_URL=https://repo.radeon.com \
-t registry.example.com/amdgpu-driver:ubuntu-${VERSION_ID}-$(uname -r)-${AMDGPU_VERSION} .
Push to the image to a registry
docker push registry.example.com/amdgpu-driver:ubuntu-${VERSION_ID}-$(uname -r)-${AMDGPU_VERSION}
OpenShift - Red Hat Enterprise Linux CoreOS#
Follow these image build steps to get a pre-compiled driver images for OpenShift cluster, make sure your RHEL version and driver version matched with ROCm required Linux system requirement.
Collect System Information
Please collect system information from OpenShift build node before configuring the build process:
kernel version:
uname -r
kernel compatible OpenShift DriverToolkit image:
oc adm release info --image-for driver-toolkit
Prepare image registry:
Please decide where you want to push your pre-compiled driver image:
Case 1: Use OpenShift internal registry:
Enable internal registry (skip this step if you already enabled registry):
oc patch configs.imageregistry.operator.openshift.io cluster --type merge \ --patch '{"spec":{"storage":{"emptyDir":{}}}}' oc patch configs.imageregistry.operator.openshift.io cluster --type merge \ --patch '{"spec":{"managementState":"Managed"}}' # make sure the image registry pods are running oc get pods -n openshift-image-registry
Create ImageStream
oc create imagestream amdgpu_kmod
Case 2: Use external image registry:
Create secret to push image if required:
kubectl create secret docker-registry docker-auth \ --docker-server=registry.example.com \ --docker-username=xxx \ --docker-password=xxx
Create OpenShift
BuildConfig
Please create the following YAML file, the full example is assuming you are using OpenShift internal image registry and build config will be saved in default namespace.
If you want to configure the build in other namespace, please change the namespace accordingly in the example steps.
If you want to use other image registry, please replace the
spec.output
part with this:
spec:
output:
pushSecret:
name: docker-auth
to:
kind: DockerImage
# follow the Image Tag Format section to get your image ta
name: registry.example.com/amdgpu_kmod:coreos-9.6-5.14.0-570.45.1.el9_6.x86_64-7.0
Full example:
kind: BuildConfig
apiVersion: build.openshift.io/v1
metadata:
name: amd-gpu-operator-build
namespace: default
labels:
app.kubernetes.io/component: build
spec:
runPolicy: Serial
nodeSelector: null
output:
to:
kind: ImageStreamTag
# follow the Image Tag Format section to get your image tag
name: amdgpu_kmod:coreos-9.6-5.14.0-570.45.1.el9_6.x86_64-7.0
successfulBuildsHistoryLimit: 5
failedBuildsHistoryLimit: 5
strategy:
type: Docker
dockerStrategy:
buildArgs:
- name: DRIVERS_VERSION # amdgpu version
value: '7.0'
- name: REPO_URL
value: 'https://repo.radeon.com'
- name: KERNEL_VERSION
value: 5.14.0-570.45.1.el9_6.x86_64
- name: KERNEL_FULL_VERSION
value: 5.14.0-570.45.1.el9_6.x86_64
- name: DTK_AUTO
# DriverToolkit image, get it from `oc adm release info --image-for driver-toolkit`
value: 'quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b3af1db51aa8a453fbba972e0039a496f0848eb15e6b411ef0bbb7d5ed864ac7'
serviceAccount: builder
source:
type: Dockerfile
dockerfile: |-
ARG DTK_AUTO
FROM ${DTK_AUTO} as builder
ARG KERNEL_VERSION
ARG DRIVERS_VERSION
ARG REPO_URL
RUN dnf install https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm -y && \
crb enable && \
sed -i "s/\$releasever/9/g" /etc/yum.repos.d/epel*.repo && \
dnf install dnf-plugin-config-manager -y && \
dnf clean all
RUN dnf install -y 'dnf-command(config-manager)' && \
dnf config-manager --add-repo=https://mirror.stream.centos.org/9-stream/BaseOS/x86_64/os/ && \
dnf config-manager --add-repo=https://mirror.stream.centos.org/9-stream/AppStream/x86_64/os/ && \
rpm --import https://www.centos.org/keys/RPM-GPG-KEY-CentOS-Official && \
dnf clean all
RUN source /etc/os-release && \
echo -e "[amdgpu] \n\
name=amdgpu \n\
baseurl=${REPO_URL}/amdgpu/${DRIVERS_VERSION}/el/${VERSION_ID}/main/x86_64/ \n\
enabled=1 \n\
priority=50 \n\
gpgcheck=1 \n\
gpgkey=${REPO_URL}/rocm/rocm.gpg.key" > /etc/yum.repos.d/amdgpu.repo
RUN dnf clean all && \
cat /etc/yum.repos.d/amdgpu.repo && \
dnf install amdgpu-dkms -y && \
depmod ${KERNEL_VERSION} && \
find /lib/modules/${KERNEL_VERSION} -name "*.ko.xz" -exec xz -d {} \; && \
depmod ${KERNEL_VERSION}
RUN mkdir -p /modules_files && \
mkdir -p /amdgpu_ko_files && \
mkdir -p /kernel_files && \
cp /lib/modules/${KERNEL_VERSION}/modules.* /modules_files/ && \
cp -r /lib/modules/${KERNEL_VERSION}/extra/* /amdgpu_ko_files/ && \
cp -r /lib/modules/${KERNEL_VERSION}/kernel/* /kernel_files/
FROM registry.redhat.io/ubi9/ubi-minimal
ARG KERNEL_VERSION
RUN microdnf install -y kmod
COPY --from=builder /amdgpu_ko_files /opt/lib/modules/${KERNEL_VERSION}/extra
COPY --from=builder /kernel_files /opt/lib/modules/${KERNEL_VERSION}/kernel
COPY --from=builder /modules_files /opt/lib/modules/${KERNEL_VERSION}/
COPY --from=builder /lib/firmware/updates/amdgpu /firmwareDir/updates/amdgpu
Trigger driver image build
Option 1 - Web Console:
Login to OpenShift web console with username and password
Select
Builds
then selectBuildConfigs
in the navigation barClick
Create BuildConfig
then select YAML view, copy over the YAML file created in last stepSelect the
BuildConfig
in the list, clickActions
then selectStart Build
Select
Builds
in the currentBuildConfig
page, a new build should be triggered and in running status.Wait for it to be completed, you can also monitor the progress in
Logs
section, in the end it should show push is successful.Delete the
BuildConfig
if needed.
Option 2 - Command Line Interface (CLI):
Create the
BuildConfig
by using the YAML file created in the last step:oc apply -f build-config.yaml
Start the build:
oc start-build amd-gpu-operator-build
Check the build status:
oc get build
andoc get pods | grep build
Wait for it to complete, the logs should show that push is successful
Delete the
BuildConfig
if needed:oc delete -f build-config.yaml
Using Pre-compiled Images#
In previous section Building Pre-compiled Driver Images we pushed driver image to registry.example.com/amdgpu-driver
. Now you can configure your DeviceConfig
to use the pre-compiled images:
apiVersion: amd.com/v1alpha1
kind: DeviceConfig
metadata:
name: test-deviceconfig
namespace: kube-amd-gpu
spec:
driver:
# Registry path without tag - operator manages tags
# If you use OpenShift internal image registry, by default the operator will auto select the internal image registry URL
image: registry.example.com/amdgpu_kmod
# Registry credentials if required
imageRegistrySecret:
name: docker-auth
# Driver version
version: "7.0"
Important: Do not include the image tag in the
image
field - the operator automatically appends the appropriate tag based on the node’s OS and kernel version.
Create registry credentials, if needed:
kubectl create secret docker-registry docker-auth \
-n kube-amd-gpu \
--docker-server=registry.example.com \
--docker-username=xxx \
--docker-password=xxx
if you are hosting driver images in DockerHub, you don’t need to specify the parameter
--docker-server