GPU Operator v1.0.0 Release Notes#
This release is the first major release of AMD GPU Operator. The AMD GPU Operator simplifies the deployment and management of AMD Instinct™ GPU accelerators within Kubernetes clusters. This project enables seamless configuration and operation of GPU-accelerated workloads, including machine learning, Generative AI, and other GPU-intensive applications.
Release Highlights#
- Manage AMD GPU drivers with desired versions on Kubernetes cluster nodes 
- Customized scheduling of AMD GPU workloads within Kubernetes cluster 
- Metrics and statistics monitoring solution for AMD GPU hardware and workloads 
- Support specialized networking environment like HTTP proxy or Air-gapped network 
Hardware Support#
New Hardware Support#
- AMD Instinct™ MI300 - Required driver version: ROCm 6.2+ 
 
- AMD Instinct™ MI250 - Required driver version: ROCm 6.2+ 
 
- AMD Instinct™ MI210 - Required driver version: ROCm 6.2+ 
 
Platform Support#
New Platform Support#
- Kubernetes 1.29+ - Supported features: - Driver management 
- Workload scheduling 
- Metrics monitoring 
 
- Requirements: Kubernetes version 1.29+ 
 
Breaking Changes#
Not Applicable as this is the initial release.
New Features#
Feature Category#
- Driver management - Managed Driver Installations: Users will be able to install ROCm 6.2+ dkms driver on Kubernetes worker nodes, they can also optionally choose to use inbox or pre-installed driver on the worker nodes 
- DeviceConfig Custom Resource: Users can configure a new DeviceConfig CRD (Custom Resource Definition) to define the driver management behavior of the GPU Operator 
 
- GPU Workload Scheduling - Custom Resource Allocation “amd.com/gpu”: After the deployment of the GPU Operator a new custom resource allocation will be present on each GPU node, - amd.com/gpu, which will list the allocatable GPU resources on the node for which GPU workloads can be scheduled against
- Assign Multiple GPUs: Users can easily specify the number of AMD GPUs required by each workload in the deployment/pod spec and the Kubernetes scheduler wiill automatically take care of assigning the correct GPU resources 
 
- Metrics Monitoring for GPUs and Workloads: - Out-of-box Metrics: Users can optionally enable the AMD Device Metrics Exporter when installing the AMD GPU Operator to enable a robust out-of-box monitoring solution for prometheus to consume 
- Custom Metrics Configurations: Users can utilize a configmap to customize the configuration and behavior of Device Metrics Exporter 
 
- Specialized Network Setups: - Air-gapped Installation: Users can install the GPU Operator in a secure air-gapped environment where the Kubernetes cluster has no external network connectivity 
- HTTP Proxy Support: The AMD GPU Operator supports usage within a Kubernetes cluster that is behind a HTTP Proxy. Support for HTTPS Proxy will be added in a future version of the GPU Operator. 
 
Known Limitations#
- GPU operator driver installs only DKMS package - Impact: Applications which require ROCM packages will need to install respective packages. 
- Affected Configurations: All configurations 
- Workaround: None as this is the intended behaviour 
 
- When Using Operator to install amdgpu 6.1.3/6.2 a reboot is required to complete install - Impact: Node requires a reboot when upgrade is initiated due to ROCm bug. Driver install failures may be seen in dmesg 
- Affected configurations: Nodes with driver version >= ROCm 6.2.x 
- Workaround: Reboot the nodes upgraded manually to finish the driver install. This has been fixed in ROCm 6.3+ 
 
- GPU Operator unable to install amdgpu driver if existing driver is already installed - Impact: Driver install will fail if amdgpu in-box Driver is present/already installed 
- Affected Configurations: All configurations 
- Workaround: When installing the amdgpu drivers using the GPU Operator, worker nodes should have amdgpu blacklisted or amdgpu drivers should not be pre-installed on the node. Blacklist in-box driver so that it is not loaded or remove the pre-installed driver 
 
- When GPU Operator is used in SKIP driver install mode, if amdgpu module is removed with device plugin installed it will not reflect active GPU available on the server - Impact: Scheduling Workloads will have impact as it will scheduled on nodes which does have active GPU. 
- Affectioned Configurations: All configurations 
- Workaround: Restart the Device plugin pod deployed. 
 
- Worker nodes where Kernel needs to be upgraded needs to taken out of the cluster and readded with Operator installed - Impact: Node upgrade will not proceed automatically and requires manual intervention 
- Affected Configurations: All configurations 
- Workaround: Manually mark the node as unschedulable, preventing new pods from being scheduled on it, by cordoning it off: 
 - kubectl cordon <node-name> 
- When GPU Operator is installed with Exporter enabled, upgrade of driver is blocked as exporter is actively using the amdgpu module - Impact: Driver upgrade is blocked 
- Affected Configurations: All configurations 
- Workaround: Disable the Metrics Exporter on specific node to allow driver upgrade as follows: 
 - Label all nodes with new label: - kubectl label nodes --all amd.com/device-metrics-exporter=true 
- Patch DeviceConfig to include new selectors for metrics exporter: - kubectl patch deviceconfig gpu-operator -n kube-amd-gpu --type='merge' -p {"spec":{"metricsExporter":{"selector":{"feature.node.kubernetes.io/amd-gpu":"true","amd.com/device-metrics-exporter":"true"}}}}' 
- Remove the amd.com/device-metrics-exporter label for the specific node you would like to disable the exporter on: - kubectl label node [node-to-exclude] amd.com/device-metrics-exporter-