Operator Overview

Operator Overview#

The AMD GPU Operator consists of several key components that work together to manage AMD GPUs in Kubernetes clusters. This document provides an overview of each component and its role in the system.

Core Components#

Controller Manager#

The AMD GPU Operator Controller Manager is the central control component that manages the operator’s custom resources. Its primary responsibilities include:

Managing the DeviceConfig custom resource
Running reconciliation loops to maintain desired state
Coordinating driver installation, upgrades, and removal
Managing the lifecycle of dependent components (device plugin, node labeller, metrics exporter)

Node Feature Discovery (NFD)#

The Node Feature Discovery (NFD) component automatically detects and labels nodes with AMD GPU hardware. Key features include:

Detection of AMD GPUs using PCI vendor and device IDs
Automatic node labeling with feature.node.kubernetes.io/amd-gpu: "true"
Hardware capability discovery and reporting

Note

OpenShift clusters use a specialized NFD Operator that includes Red Hat optimizations for OpenShift environments.

Kernel Module Management (KMM)#

The Kernel Module Management (KMM) Operator handles the lifecycle of GPU driver kernel modules. Its responsibilities include:

Loading, upgrading, and unloading host kernel modules
Managing containerized driver operations
Coordinating with the Controller Manager for driver lifecycle events

Note

Kubernetes: Use the AMD-optimized KMM Operator provided by the GPU Operator Helm chart
OpenShift: Uses the Red Hat KMM Operator with OpenShift-specific optimizations

Component Interaction#

The components work together in the following sequence:

NFD identifies worker nodes with AMD GPUs
Controller Manager processes DeviceConfig custom resources
KMM handles driver operations based on configuration
Device Plugin registers amd.com/gpu allocatable resources to node
Node Labeller adds detailed GPU information to node labels
Metrics Exporter provides ongoing monitoring

Architecture diagram

Plugins and Extensions#

Device Plugin#

The AMD GPU Device Plugin enables GPU resource allocation in Kubernetes:

Implements the Kubernetes Device Plugin API
Registers AMD GPUs as allocatable resources
Enables GPU resource requests and limits in pod specifications

Node Labeller#

The Node Labeller provides detailed GPU information through node labels:

Automatically detects GPU properties
Adds detailed GPU-specific labels to nodes
Enables fine-grained pod scheduling based on GPU capabilities

Metrics Exporter#

The Device Metrics Exporter provides monitoring capabilities:

Exports GPU metrics in Prometheus format
Monitors GPU utilization, temperature, and health
Enables integration with monitoring systems

Test Runner#

The Test Runner offers hardware validation, diagnostics and benchmarking capabilities across various scenarios:

Automatically triggers configurable ROCm Validation Suite tests on unhealthy GPUs.
Support manually triggered or scheduled test execution within the Kubernetes cluster.
Support executing tests as init containers within the GPU workload pod.
Report test results as Kubernetes events.