Kubernetes Reference Architectures for AMD AI Infrastructure

Kubernetes Reference Architectures for AMD AI Infrastructure#

This repository provides production-ready reference architectures for deploying and operating AMD AI infrastructure at scale on Kubernetes. Covering AMD Instinct™ GPUs, AMD Pensando AINCIs, and AMD EPYC™ processors, these guides help organizations build, validate, and operate high-performance AI/ML datacenters from the ground up.

Why These Reference Architectures?#

Deploying enterprise AI infrastructure involves more than just installing hardware. Organizations face challenges across the entire lifecycle:

Hardware Intake: How do you validate hundreds of GPU servers before production use?
Platform Foundation: How do you build a Kubernetes platform optimized for AMD hardware?
Day-2 Operations: How do you monitor, remediate, and scale GPU/NIC infrastructure?
Workload Optimization: How do you maximize performance for distributed AI training?

These reference architectures provide tested, reproducible solutions using AMD’s Kubernetes-native tools:

Tool	Purpose
AMD GPU Operator	GPU driver lifecycle, device plugin, health monitoring
AMD Network Operator	AINIC driver lifecycle, Multus CNI, RDMA networking
Device Metrics Exporter	Prometheus metrics for GPU and NIC telemetry
AGFHC	AMD GPU Field Health Check - enterprise burn-in testing
CVS (Cluster Validation Suite)	Multi-node cluster validation - RCCL, IB perf, distributed training

Reference Architecture Journey#

The architectures are organized as a progressive journey from hardware intake to production AI workloads:

Phase 1: Hardware Foundation#

Reference	Description	Status
GPU Server Intake & Validation	Kubernetes-based workflow for receiving, validating, and releasing GPU servers to production	🚧 In Progress
Bare Metal Provisioning with MAAS	Automate GPU node provisioning with MAAS + Cluster API	🚧 Upcoming

The GPU Server Intake & Validation reference architecture covers the Kubernetes-based validation workflow:

Part 1: Validation Cluster Setup - GPU/Network operators deployment, health check gates, burn-in testing
Part 2: Release Decision - AGFHC test results, pass/fail handling, node relabeling
Part 3: Production Handoff - Cluster migration, continuous monitoring, auto-remediation

Note: Physical intake (rack, cable, power) and OS provisioning are assumed to be handled by existing customer processes. A future MAAS-based provisioning reference is planned separately.

Phase 2: GPU Lifecycle Management#

Reference	Description	Status
Day-2 GPU Lifecycle Management	Continuous health checks in production clusters using CVS workflows	🚧 Upcoming
GPU Auto-Remediation with Argo Workflows	Automated detection, isolation, and recovery of unhealthy GPUs	🚧 Upcoming

Phase 3: Network Infrastructure#

Reference	Description	Status
AMD GPU + AINIC Co-deployment	Deploy GPU and Network operators together for high-performance AI training	🚧 Upcoming
RDMA Networking for Distributed Training	Configure Multus, SR-IOV, and RDMA for GPU-to-GPU communication	🚧 Upcoming

Phase 4: AI/ML Workloads#

Reference	Description	Status
RCCL Testing Reference Architecture	Validate collective communication performance across GPU clusters	🚧 Upcoming
Distributed PyTorch Training with RDMA	End-to-end distributed training with optimized networking	🚧 Upcoming
MLOps Pipeline with Network-Optimized Training	Production ML pipelines leveraging AMD hardware acceleration	🚧 Upcoming

Target Audience#

Infrastructure Engineers: Building GPU-accelerated Kubernetes platforms
Platform Engineers: Operating and scaling AI/ML infrastructure
Site Reliability Engineers: Ensuring availability and performance of GPU clusters
DevOps Teams: Automating hardware lifecycle and workload deployment
Data Center Operators: Qualifying and onboarding AMD hardware at scale

Common Prerequisites#

Before using these reference architectures, ensure your environment meets these requirements:

Requirement	Specification
Kubernetes	v1.29+
Operating System	Ubuntu 22.04 LTS or Ubuntu 24.04 LTS
Container Runtime	containerd with GPU support
GPU Hardware	AMD Instinct™ MI300X, MI325X, MI350X, or MI355X
NIC Hardware	AMD Pensando Pollara AINIC (for network architectures)
Tools	Helm v3.2+, kubectl, ROCm 6.0+