Air-gapped Installation Guide#
This guide explains how to install the AMD GPU Operator in an air-gapped environment where the Kubernetes cluster has no external network connectivity.
Prerequisites#
Kubernetes v1.29.0+
Helm v3.2.0+
Access to an internal container registry
Required Images#
The following images must be mirrored to your internal registry:
# Core Operator Images
rocm/gpu-operator:<version>
rocm/gpu-operator-bundle:<version>
rocm/gpu-operator-catalog:<version>
# Device Plugin Images
rocm/k8s-device-plugin:<version>
rocm/k8s-device-plugin-labeller:<version>
# Dependency Images
quay.io/jetstack/cert-manager-controller:<version>
quay.io/jetstack/cert-manager-webhook:<version>
quay.io/jetstack/cert-manager-cainjector:<version>
### Required DEB Packages
# For driver compilation, ensure these packages are available in
# your internal package repository:
#### Ubuntu
linux-headers-$(uname -r)
build-essential
Installation Steps#
1. Mirror Required Images#
Download images on a connected system:
# Example for core operator images
docker pull rocm/gpu-operator:<version>
docker pull rocm/k8s-device-plugin:<version>
Tag images for your internal registry:
docker tag rocm/gpu-operator:<version> internal-registry.example.com/rocm/gpu-operator:<version>
docker tag rocm/k8s-device-plugin:<version> internal-registry.example.com/rocm/k8s-device-plugin:<version>
Push to your internal registry:
docker push internal-registry.example.com/rocm/gpu-operator:<version>
docker push internal-registry.example.com/rocm/k8s-device-plugin:<version>
2. Configure Internal Package Repository#
Create an internal package repository mirror containing required build packages
Configure worker nodes to use the internal repository
Verify package availability:
# Ubuntu
apt list linux-headers-$(uname -r) build-essential
3. Install Cert-Manager#
Create custom values file for cert-manager:
# cert-manager-values.yaml
global:
imageRegistry: internal-registry.example.com
Install cert-manager using internal images:
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--version v1.15.1 \
--set installCRDs=true \
-f cert-manager-values.yaml
4. Install AMD GPU Operator#
Create custom values file for the operator:
# operator-values.yaml
global:
imageRegistry: internal-registry.example.com
driver:
repository: internal-registry.example.com/rocm/gpu-operator
version: "<version>"
devicePlugin:
repository: internal-registry.example.com/rocm/k8s-device-plugin
version: "<version>"
# Additional configuration for internal repositories
buildArgs:
ROCM_REPO_URL: "http://internal-repo.example.com/rocm"
ROCM_REPO_KEY: "http://internal-repo.example.com/rocm/rocm.gpg.key"
Install the operator:
helm install amd-gpu-operator rocm/gpu-operator-charts \
--namespace kube-amd-gpu \
--create-namespace \
-f operator-values.yaml
5. Configure DeviceConfig#
Create a DeviceConfig that references your internal registry:
apiVersion: amd.com/v1alpha1
kind: DeviceConfig
metadata:
name: amd-gpu-config
namespace: kube-amd-gpu
spec:
driver:
image: internal-registry.example.com/rocm/gpu-driver
version: "<version>"
devicePlugin:
devicePluginImage: internal-registry.example.com/rocm/k8s-device-plugin:latest
nodeLabellerImage: internal-registry.example.com/rocm/k8s-device-plugin-labeller:latest
selector:
feature.node.kubernetes.io/amd-gpu: "true"
Verification#
Check operator pod status:
kubectl get pods -n kube-amd-gpu
Verify driver installation:
kubectl get deviceconfig -n kube-amd-gpu
Check GPU detection:
kubectl get nodes -l feature.node.kubernetes.io/amd-gpu=true
Troubleshooting#
Image Pull Errors
Verify internal registry connectivity
Check image names and tags
Verify registry credentials
Driver Build Failures
Verify package repository connectivity
Check package availability
Verify build dependencies
Certificate Issues
Check cert-manager deployment
Verify TLS certificates for internal services
Collecting Logs#
# Operator logs
kubectl logs -n kube-amd-gpu deployment/amd-gpu-operator-controller-manager
# Driver build logs
kubectl logs -n kube-amd-gpu <driver-build-pod>
Run the support tool for comprehensive diagnostics:
./tools/techsupport_dump.sh -w -o yaml <node-name>
Additional Considerations#
Registry Certificates
Ensure registry certificates are trusted by all nodes
Configure container runtime to trust internal certificates
Package Repository Security
Configure repository signing keys
Verify package integrity
Network Requirements
Ensure internal DNS resolution works
Configure necessary firewall rules
Set up required proxy settings if applicable