Air-gapped Installation Guide#
This guide explains how to install the AMD GPU Operator in an air-gapped environment where the Kubernetes cluster has no external network connectivity.
Prerequisites#
- Kubernetes v1.29.0+ 
- Helm v3.2.0+ 
- Access to an internal container registry 
Required Images#
The following images must be mirrored to your internal registry:
# Core Operator Images
rocm/gpu-operator:<version>
rocm/gpu-operator-bundle:<version>
rocm/gpu-operator-catalog:<version>
# Device Plugin Images
rocm/k8s-device-plugin:<version>
rocm/k8s-device-plugin-labeller:<version>
# Dependency Images
quay.io/jetstack/cert-manager-controller:<version>
quay.io/jetstack/cert-manager-webhook:<version>
quay.io/jetstack/cert-manager-cainjector:<version>
### Required DEB Packages
# For driver compilation, ensure these packages are available in 
# your internal package repository:
#### Ubuntu
linux-headers-$(uname -r)
build-essential
Installation Steps#
1. Mirror Required Images#
- Download images on a connected system: 
# Example for core operator images
docker pull rocm/gpu-operator:<version>
docker pull rocm/k8s-device-plugin:<version>
- Tag images for your internal registry: 
docker tag rocm/gpu-operator:<version> internal-registry.example.com/rocm/gpu-operator:<version>
docker tag rocm/k8s-device-plugin:<version> internal-registry.example.com/rocm/k8s-device-plugin:<version>
- Push to your internal registry: 
docker push internal-registry.example.com/rocm/gpu-operator:<version>
docker push internal-registry.example.com/rocm/k8s-device-plugin:<version>
2. Configure Internal Package Repository#
- Create an internal package repository mirror containing required build packages 
- Configure worker nodes to use the internal repository 
- Verify package availability: 
# Ubuntu
apt list linux-headers-$(uname -r) build-essential
3. Install Cert-Manager#
- Create custom values file for cert-manager: 
# cert-manager-values.yaml
global:
  imageRegistry: internal-registry.example.com
- Install cert-manager using internal images: 
helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --version v1.15.1 \
  --set installCRDs=true \
  -f cert-manager-values.yaml
4. Install AMD GPU Operator#
- Create custom values file for the operator: 
# operator-values.yaml
global:
  imageRegistry: internal-registry.example.com
driver:
  repository: internal-registry.example.com/rocm/gpu-operator
  version: "<version>"
devicePlugin:
  repository: internal-registry.example.com/rocm/k8s-device-plugin
  version: "<version>"
# Additional configuration for internal repositories
buildArgs:
  ROCM_REPO_URL: "http://internal-repo.example.com/rocm"
  ROCM_REPO_KEY: "http://internal-repo.example.com/rocm/rocm.gpg.key"
- Install the operator: 
helm install amd-gpu-operator rocm/gpu-operator-charts \
  --namespace kube-amd-gpu \
  --create-namespace \
  -f operator-values.yaml
5. Configure DeviceConfig#
Create a DeviceConfig that references your internal registry:
apiVersion: amd.com/v1alpha1
kind: DeviceConfig
metadata:
  name: amd-gpu-config
  namespace: kube-amd-gpu
spec:
  driver:
    image: internal-registry.example.com/rocm/gpu-driver
    version: "<version>"
    
  devicePlugin:
    devicePluginImage: internal-registry.example.com/rocm/k8s-device-plugin:latest
    nodeLabellerImage: internal-registry.example.com/rocm/k8s-device-plugin-labeller:latest
    
  selector:
    feature.node.kubernetes.io/amd-gpu: "true"
Verification#
- Check operator pod status: 
kubectl get pods -n kube-amd-gpu
- Verify driver installation: 
kubectl get deviceconfig -n kube-amd-gpu
- Check GPU detection: 
kubectl get nodes -l feature.node.kubernetes.io/amd-gpu=true
Troubleshooting#
- Image Pull Errors - Verify internal registry connectivity 
- Check image names and tags 
- Verify registry credentials 
 
- Driver Build Failures - Verify package repository connectivity 
- Check package availability 
- Verify build dependencies 
 
- Certificate Issues - Check cert-manager deployment 
- Verify TLS certificates for internal services 
 
Collecting Logs#
# Operator logs
kubectl logs -n kube-amd-gpu deployment/amd-gpu-operator-controller-manager
# Driver build logs
kubectl logs -n kube-amd-gpu <driver-build-pod>
Run the support tool for comprehensive diagnostics:
./tools/techsupport_dump.sh -w -o yaml <node-name>
Additional Considerations#
- Registry Certificates - Ensure registry certificates are trusted by all nodes 
- Configure container runtime to trust internal certificates 
 
- Package Repository Security - Configure repository signing keys 
- Verify package integrity 
 
- Network Requirements - Ensure internal DNS resolution works 
- Configure necessary firewall rules 
- Set up required proxy settings if applicable