Kubernetes Reference Architectures for AMD AI Infrastructure#

This repository provides production-ready reference architectures for deploying and operating AMD AI infrastructure at scale on Kubernetes. Covering AMD Instinct™ GPUs, AMD Pensando AINCIs, and AMD EPYC™ processors, these guides help organizations build, validate, and operate high-performance AI/ML datacenters from the ground up.

Why These Reference Architectures?#

Deploying enterprise AI infrastructure involves more than just installing hardware. Organizations face challenges across the entire lifecycle:

  • Hardware Intake: How do you validate hundreds of GPU servers before production use?

  • Platform Foundation: How do you build a Kubernetes platform optimized for AMD hardware?

  • Day-2 Operations: How do you monitor, remediate, and scale GPU/NIC infrastructure?

  • Workload Optimization: How do you maximize performance for distributed AI training?

These reference architectures provide tested, reproducible solutions using AMD’s Kubernetes-native tools:

Tool

Purpose

AMD GPU Operator

GPU driver lifecycle, device plugin, health monitoring

AMD Network Operator

AINIC driver lifecycle, Multus CNI, RDMA networking

Device Metrics Exporter

Prometheus metrics for GPU and NIC telemetry

AGFHC

AMD GPU Field Health Check - enterprise burn-in testing

CVS (Cluster Validation Suite)

Multi-node cluster validation - RCCL, IB perf, distributed training

Reference Architecture Journey#

The architectures are organized as a progressive journey from hardware intake to production AI workloads:

Phase 1: Hardware Foundation#

Reference

Description

Status

GPU Server Intake & Validation

Kubernetes-based workflow for receiving, validating, and releasing GPU servers to production

🚧 In Progress

Bare Metal Provisioning with MAAS

Automate GPU node provisioning with MAAS + Cluster API

🚧 Upcoming

The GPU Server Intake & Validation reference architecture covers the Kubernetes-based validation workflow:

  • Part 1: Validation Cluster Setup - GPU/Network operators deployment, health check gates, burn-in testing

  • Part 2: Release Decision - AGFHC test results, pass/fail handling, node relabeling

  • Part 3: Production Handoff - Cluster migration, continuous monitoring, auto-remediation

Note: Physical intake (rack, cable, power) and OS provisioning are assumed to be handled by existing customer processes. A future MAAS-based provisioning reference is planned separately.

Phase 2: GPU Lifecycle Management#

Reference

Description

Status

Day-2 GPU Lifecycle Management

Continuous health checks in production clusters using CVS workflows

🚧 Upcoming

GPU Auto-Remediation with Argo Workflows

Automated detection, isolation, and recovery of unhealthy GPUs

🚧 Upcoming

Phase 3: Network Infrastructure#

Reference

Description

Status

AMD GPU + AINIC Co-deployment

Deploy GPU and Network operators together for high-performance AI training

🚧 Upcoming

RDMA Networking for Distributed Training

Configure Multus, SR-IOV, and RDMA for GPU-to-GPU communication

🚧 Upcoming

Phase 4: AI/ML Workloads#

Reference

Description

Status

RCCL Testing Reference Architecture

Validate collective communication performance across GPU clusters

🚧 Upcoming

Distributed PyTorch Training with RDMA

End-to-end distributed training with optimized networking

🚧 Upcoming

MLOps Pipeline with Network-Optimized Training

Production ML pipelines leveraging AMD hardware acceleration

🚧 Upcoming

Target Audience#

  • Infrastructure Engineers: Building GPU-accelerated Kubernetes platforms

  • Platform Engineers: Operating and scaling AI/ML infrastructure

  • Site Reliability Engineers: Ensuring availability and performance of GPU clusters

  • DevOps Teams: Automating hardware lifecycle and workload deployment

  • Data Center Operators: Qualifying and onboarding AMD hardware at scale

Common Prerequisites#

Before using these reference architectures, ensure your environment meets these requirements:

Requirement

Specification

Kubernetes

v1.29+

Operating System

Ubuntu 22.04 LTS or Ubuntu 24.04 LTS

Container Runtime

containerd with GPU support

GPU Hardware

AMD Instinct™ MI300X, MI325X, MI350X, or MI355X

NIC Hardware

AMD Pensando Pollara AINIC (for network architectures)

Tools

Helm v3.2+, kubectl, ROCm 6.0+