Air-gapped Installation Guide#

This guide explains how to install the AMD GPU Operator in an air-gapped environment where the Kubernetes cluster has no external network connectivity.

Prerequisites#

  • Kubernetes v1.29.0+

  • Helm v3.2.0+

  • Access to an internal container registry

Required Images#

The following images must be mirrored to your internal registry:

# Core Operator Images
docker.io/rocm/gpu-operator:<version>
docker.io/rocm/gpu-operator-utils:<version>

# Device Plugin Images
docker.io/rocm/k8s-device-plugin:latest
docker.io/rocm/k8s-node-labeller:latest

# Metrics and Testing Images
docker.io/rocm/device-metrics-exporter:<version>
docker.io/rocm/test-runner:<version>
docker.io/rocm/device-config-manager:<version>

# Kernel Module Management (KMM) Images
docker.io/rocm/kernel-module-management-operator:<version>
docker.io/rocm/kernel-module-management-webhook-server:<version>
docker.io/rocm/kernel-module-management-worker:<version>
docker.io/rocm/kernel-module-management-signimage:<version>

# Build and Base Images
gcr.io/kaniko-project/executor:v1.23.2
docker.io/ubuntu:<Ubuntu OS version>
docker.io/busybox:1.36

# Node Feature Discovery
registry.k8s.io/nfd/node-feature-discovery:v0.16.1

# Cert-Manager Images
quay.io/jetstack/cert-manager-controller:v1.15.1
quay.io/jetstack/cert-manager-webhook:v1.15.1
quay.io/jetstack/cert-manager-cainjector:v1.15.1
quay.io/jetstack/cert-manager-acmesolver:v1.15.1

# Argo workflow controller image (for Auto Node Remediation)
quay.io/argoproj/workflow-controller:v3.6.5

Required RPM/DEB Packages#

For driver compilation, ensure these packages are available in your internal package repository:

RHEL/CentOS#

kernel-devel
kernel-headers
gcc
make
elfutils-libelf-devel

Ubuntu#

linux-headers-$(uname -r)
build-essential

Installation Steps#

1. Mirror Required Images#

On a connected system, run the following script to pull, tag, and push all required images:

#!/bin/bash
# Configuration - Update these variables according to your environment
INTERNAL_REGISTRY="internal-registry.example.com"
OPERATOR_VERSION="v1.4.1"  # GPU operator version, e.g., "v1.5.0"
UBUNTU_VERSION="22.04"  # e.g., "22.04"
KANIKO_VERSION="v1.23.2"
NFD_VERSION="v0.16.1"
CERT_MANAGER_VERSION="v1.15.1"
BUSYBOX_VERSION="1.36"

# Operator images with version tag
OPERATOR_VERSIONED_IMAGES=(
  "gpu-operator"
  "gpu-operator-utils"
  "k8s-device-plugin"
  "device-metrics-exporter"
  "test-runner"
  "device-config-manager"
  "kernel-module-management-operator"
  "kernel-module-management-webhook-server"
  "kernel-module-management-worker"
  "kernel-module-management-signimage"
)

# Operator images with latest tag
ROCM_LATEST_IMAGES=(
  "k8s-device-plugin:latest"
  "k8s-node-labeller:latest"
)

# Pull, tag, and push ROCm versioned images
for img in "${OPERATOR_VERSIONED_IMAGES[@]}"; do
  docker pull docker.io/rocm/${img}:${OPERATOR_VERSION}
  docker tag rocm/${img}:${OPERATOR_VERSION} ${INTERNAL_REGISTRY}/rocm/${img}:${OPERATOR_VERSION}
  docker push ${INTERNAL_REGISTRY}/rocm/${img}:${OPERATOR_VERSION}
done

# Pull, tag, and push ROCm latest images
for img in "${ROCM_LATEST_IMAGES[@]}"; do
  docker pull docker.io/rocm/${img}
  docker tag rocm/${img} ${INTERNAL_REGISTRY}/rocm/${img}
  docker push ${INTERNAL_REGISTRY}/rocm/${img}
done

# Third-party images (kaniko, ubuntu, busybox, NFD)
docker pull gcr.io/kaniko-project/executor:${KANIKO_VERSION}
docker tag gcr.io/kaniko-project/executor:${KANIKO_VERSION} ${INTERNAL_REGISTRY}/kaniko-project/executor:${KANIKO_VERSION}
docker push ${INTERNAL_REGISTRY}/kaniko-project/executor:${KANIKO_VERSION}

docker pull docker.io/ubuntu:${UBUNTU_VERSION}
docker tag ubuntu:${UBUNTU_VERSION} ${INTERNAL_REGISTRY}/ubuntu:${UBUNTU_VERSION}
docker push ${INTERNAL_REGISTRY}/ubuntu:${UBUNTU_VERSION}

docker pull docker.io/busybox:${BUSYBOX_VERSION}
docker tag busybox:${BUSYBOX_VERSION} ${INTERNAL_REGISTRY}/busybox:${BUSYBOX_VERSION}
docker push ${INTERNAL_REGISTRY}/busybox:${BUSYBOX_VERSION}

docker pull registry.k8s.io/nfd/node-feature-discovery:${NFD_VERSION}
docker tag registry.k8s.io/nfd/node-feature-discovery:${NFD_VERSION} ${INTERNAL_REGISTRY}/nfd/node-feature-discovery:${NFD_VERSION}
docker push ${INTERNAL_REGISTRY}/nfd/node-feature-discovery:${NFD_VERSION}

# Cert-manager images
CERT_MANAGER_IMAGES=(
  "cert-manager-controller"
  "cert-manager-webhook"
  "cert-manager-cainjector"
  "cert-manager-acmesolver"
)

for img in "${CERT_MANAGER_IMAGES[@]}"; do
  docker pull quay.io/jetstack/${img}:${CERT_MANAGER_VERSION}
  docker tag quay.io/jetstack/${img}:${CERT_MANAGER_VERSION} ${INTERNAL_REGISTRY}/jetstack/${img}:${CERT_MANAGER_VERSION}
  docker push ${INTERNAL_REGISTRY}/jetstack/${img}:${CERT_MANAGER_VERSION}
done

2. Configure Internal Package Repository#

  1. Create an internal package repository mirror containing required build packages

  2. Configure worker nodes to use the internal repository

  3. Verify package availability:

# RHEL/CentOS
yum list kernel-devel kernel-headers gcc make elfutils-libelf-devel

# Ubuntu
apt list linux-headers-$(uname -r) build-essential

3. Install Cert-Manager#

  • Download the cert-manager helm chart on a connected system and transfer it to your air-gapped environment

  • Install cert-manager using internal images:

helm install cert-manager ./cert-manager-v1.15.1.tgz \
  --namespace cert-manager \
  --create-namespace \
  --set installCRDs=true \
  --set image.repository=internal-registry.example.com/jetstack/cert-manager-controller \
  --set webhook.image.repository=internal-registry.example.com/jetstack/cert-manager-webhook \
  --set cainjector.image.repository=internal-registry.example.com/jetstack/cert-manager-cainjector \
  --set acmesolver.image.repository=internal-registry.example.com/jetstack/cert-manager-acmesolver

4. Install AMD GPU Operator#

  • Download the GPU operator helm chart on a connected system and transfer it to your air-gapped environment

  • Create custom values file for the operator:

# operator-values.yaml
controllerManager:
  manager:
    image:
      repository: internal-registry.example.com/rocm/gpu-operator
      tag: <version>
    imagePullPolicy: IfNotPresent

deviceConfig:
  spec:
    driver:
      # enable this section if you want operator to build the driver
      # enable: true
      #image: internal-registry.example.com/rocm/driver-image
      #version: <amdgpu driver version>
    commonConfig:
      initContainerImage: internal-registry.example.com/busybox:1.36
      utilsContainer:
        image: internal-registry.example.com/rocm/gpu-operator-utils:<version>
    devicePlugin:
      devicePluginImage: internal-registry.example.com/rocm/k8s-device-plugin:<version>
      nodeLabellerImage: internal-registry.example.com/rocm/k8s-device-plugin:labeller-<version>
    metricsExporter:
      image: internal-registry.example.com/rocm/device-metrics-exporter:<version>
    testRunner:
      image: internal-registry.example.com/rocm/test-runner:<version>
    configManager:
      image: internal-registry.example.com/rocm/device-config-manager:<version>

# NFD image configuration
node-feature-discovery:
  image:
    repository: internal-registry.example.com/nfd/node-feature-discovery
    tag: v0.16.1

# KMM (Kernel Module Management) image configuration
kmm:
  controller:
    manager:
      image:
        repository: internal-registry.example.com/rocm/kernel-module-management-operator
        tag: <version>
      imagePullPolicy: IfNotPresent
      env:
        relatedImageBuild: internal-registry.example.com/kaniko-project/executor:v1.23.2
        relatedImageSign: internal-registry.example.com/rocm/kernel-module-management-signimage:<version>
        relatedImageWorker: internal-registry.example.com/rocm/kernel-module-management-worker:<version>
  webhookServer:
    webhookServer:
      image:
        repository: internal-registry.example.com/rocm/kernel-module-management-webhook-server
        tag: <version>
      imagePullPolicy: IfNotPresent
  • Install the operator:

helm install amd-gpu-operator ./gpu-operator-<version>.tgz \
  --namespace kube-amd-gpu \
  --create-namespace \
  -f operator-values.yaml

5. Configure DeviceConfig#

A default DeviceConfig is automatically created during helm installation (controlled by crds.defaultCR.install: true in values.yaml). The default DeviceConfig uses the image settings specified in the deviceConfig.spec section of your values file.

If you disabled the default DeviceConfig creation (by setting crds.defaultCR.install: false in your values file), or if you want to create an additional custom DeviceConfig, you can create one manually:

apiVersion: amd.com/v1alpha1
kind: DeviceConfig
metadata:
  name: amd-gpu-config
  namespace: kube-amd-gpu
spec:
  selector:
    # if using SR-IOV VM, please use feature.node.kubernetes.io/amd-vgpu: "true"
    feature.node.kubernetes.io/amd-gpu: "true"
  driver:
    enable: false  # Set to true if you want operator to build and manage out-of-tree drivers
    image: internal-registry.example.com/rocm/driver-image
    version: "<version>"
  commonConfig:
    initContainerImage: internal-registry.example.com/busybox:1.36
    utilsContainer:
      image: internal-registry.example.com/rocm/gpu-operator-utils:<version>
  devicePlugin:
    enableDevicePlugin: true
    devicePluginImage: internal-registry.example.com/rocm/k8s-device-plugin:<version>
    enableNodeLabeller: true
    nodeLabellerImage: internal-registry.example.com/rocm/k8s-device-plugin:labeller-<version>
  metricsExporter:
    enable: true
    image: internal-registry.example.com/rocm/device-metrics-exporter:<version>
  testRunner:
    enable: false
    image: internal-registry.example.com/rocm/test-runner:<version>
  configManager:
    enable: false
    image: internal-registry.example.com/rocm/device-config-manager:<version>

Verification#

  • Check operator pod status:

kubectl get pods -n kube-amd-gpu
  • Verify driver installation:

kubectl get deviceconfig -n kube-amd-gpu
  • Check GPU detection:

kubectl get nodes -l feature.node.kubernetes.io/amd-gpu=true

Troubleshooting#

  • Image Pull Errors

    • Verify internal registry connectivity

    • Check image names and tags

    • Verify registry credentials

  • Driver Build Failures

    • Verify package repository connectivity

    • Check package availability

    • Verify build dependencies

  • Certificate Issues

    • Check cert-manager deployment

    • Verify TLS certificates for internal services

Collecting Logs#

# Operator logs
kubectl logs -n kube-amd-gpu deployment/amd-gpu-operator-controller-manager

# Driver build logs
kubectl logs -n kube-amd-gpu <driver-build-pod>

Run the support tool for comprehensive diagnostics:

./tools/techsupport_dump.sh -w -o yaml <node-name>

Additional Considerations#

  • Registry Certificates

    • Ensure registry certificates are trusted by all nodes

    • Configure container runtime to trust internal certificates

  • Package Repository Security

    • Configure repository signing keys

    • Verify package integrity

  • Network Requirements

    • Ensure internal DNS resolution works

    • Configure necessary firewall rules

    • Set up required proxy settings if applicable