Kubernetes Reference Architectures for AMD AI Infrastructure#
This repository provides production-ready reference architectures for deploying and operating AMD AI infrastructure at scale on Kubernetes. Covering AMD Instinct™ GPUs, AMD Pensando AINCIs, and AMD EPYC™ processors, these guides help organizations build, validate, and operate high-performance AI/ML datacenters from the ground up.
Why These Reference Architectures?#
Deploying enterprise AI infrastructure involves more than just installing hardware. Organizations face challenges across the entire lifecycle:
Hardware Intake: How do you validate hundreds of GPU servers before production use?
Platform Foundation: How do you build a Kubernetes platform optimized for AMD hardware?
Day-2 Operations: How do you monitor, remediate, and scale GPU/NIC infrastructure?
Workload Optimization: How do you maximize performance for distributed AI training?
These reference architectures provide tested, reproducible solutions using AMD’s Kubernetes-native tools:
Tool |
Purpose |
|---|---|
GPU driver lifecycle, device plugin, health monitoring |
|
AINIC driver lifecycle, Multus CNI, RDMA networking |
|
Prometheus metrics for GPU and NIC telemetry |
|
AMD GPU Field Health Check - enterprise burn-in testing |
|
Multi-node cluster validation - RCCL, IB perf, distributed training |
Reference Architecture Journey#
The architectures are organized as a progressive journey from hardware intake to production AI workloads:
Phase 1: Hardware Foundation#
Reference |
Description |
Status |
|---|---|---|
Kubernetes-based workflow for receiving, validating, and releasing GPU servers to production |
🚧 In Progress |
|
Automate GPU node provisioning with MAAS + Cluster API |
🚧 Upcoming |
The GPU Server Intake & Validation reference architecture covers the Kubernetes-based validation workflow:
Part 1: Validation Cluster Setup - GPU/Network operators deployment, health check gates, burn-in testing
Part 2: Release Decision - AGFHC test results, pass/fail handling, node relabeling
Part 3: Production Handoff - Cluster migration, continuous monitoring, auto-remediation
Note: Physical intake (rack, cable, power) and OS provisioning are assumed to be handled by existing customer processes. A future MAAS-based provisioning reference is planned separately.
Phase 2: GPU Lifecycle Management#
Reference |
Description |
Status |
|---|---|---|
Continuous health checks in production clusters using CVS workflows |
🚧 Upcoming |
|
Automated detection, isolation, and recovery of unhealthy GPUs |
🚧 Upcoming |
Phase 3: Network Infrastructure#
Reference |
Description |
Status |
|---|---|---|
Deploy GPU and Network operators together for high-performance AI training |
🚧 Upcoming |
|
Configure Multus, SR-IOV, and RDMA for GPU-to-GPU communication |
🚧 Upcoming |
Phase 4: AI/ML Workloads#
Reference |
Description |
Status |
|---|---|---|
Validate collective communication performance across GPU clusters |
🚧 Upcoming |
|
End-to-end distributed training with optimized networking |
🚧 Upcoming |
|
Production ML pipelines leveraging AMD hardware acceleration |
🚧 Upcoming |
Target Audience#
Infrastructure Engineers: Building GPU-accelerated Kubernetes platforms
Platform Engineers: Operating and scaling AI/ML infrastructure
Site Reliability Engineers: Ensuring availability and performance of GPU clusters
DevOps Teams: Automating hardware lifecycle and workload deployment
Data Center Operators: Qualifying and onboarding AMD hardware at scale
Common Prerequisites#
Before using these reference architectures, ensure your environment meets these requirements:
Requirement |
Specification |
|---|---|
Kubernetes |
v1.29+ |
Operating System |
Ubuntu 22.04 LTS or Ubuntu 24.04 LTS |
Container Runtime |
containerd with GPU support |
GPU Hardware |
AMD Instinct™ MI300X, MI325X, MI350X, or MI355X |
NIC Hardware |
AMD Pensando Pollara AINIC (for network architectures) |
Tools |
Helm v3.2+, kubectl, ROCm 6.0+ |