Auto Remediation of GPU nodes

Auto Remediation of GPU nodes#

The GPU Operator provides automatic remediation for GPU worker nodes that become unhealthy due to GPU-related issues. When such problems are detected, the operator triggers a workflow - a series of automated steps designed to restore the node to a healthy state. This functionality is powered by Argo Workflows, a lightweight and scalable open-source workflow engine for Kubernetes. Through the DeviceConfig Custom Resource, the GPU Operator offers extensive customization options for configuring remediation behavior.

Note: The auto node remediation feature currently supports bare metal deployments. Support for VM-based deployments will be coming in the near future.

Auto-Remediation Workflow Overview#

The following diagram illustrates the end-to-end flow of automatic remediation:

┌─────────────────────────────────────────────────────────────────────────────┐
│                           GPU Worker Node                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌────────────────────────┐                                                 │
│  │ Device Metrics         │                                                 │
│  │ Exporter               │  Reports inband-RAS errors                      │
│  └───────────┬────────────┘                                                 │
│              │                                                              │
│              ▼                                                              │
│  ┌────────────────────────┐                                                 │
│  │ Node Problem           │  Queries for inband-RAS errors                  │
│  │ Detector (NPD)         │  and marks node condition as True               │
│  └───────────┬────────────┘                                                 │
│              │                                                              │
└──────────────┼──────────────────────────────────────────────────────────────┘
               │
               │ Node condition status update
               ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                         Controller Node                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌────────────────────────┐                                                 │
│  │ GPU Operator           │  Observes node error conditions                 │
│  │                        │                                                 │
│  └───────────┬────────────┘                                                 │
│              │                                                              │
│              ▼                                                              │
│  ┌────────────────────────┐                                                 │
│  │ Argo Workflow          │  Triggers remediation workflow                  │
│  │ Controller             │  for the affected node                          │
│  └────────────────────────┘                                                 │
│              │                                                              │
└──────────────┼──────────────────────────────────────────────────────────────┘
               │
               │ Executes remediation steps
               ▼
        Affected GPU Worker Node

The Node Problem Detector (NPD) maintains a unique node condition for each error type, enabling users to configure different remediation actions tailored to specific error conditions.

Note: The GPU Operator prevents multiple concurrent workflows on the same node. When a node is tainted and a workflow is already executing, no additional workflows will be triggered on that node until the current workflow completes.

Pre-requisites#

Automatic node remediation requires the following components to be enabled and running on the cluster:

Device Metrics Exporter - Reports unhealthy metrics and inband-RAS errors that are used to detect faulty GPUs.
Node Problem Detector (NPD) - An open-source Kubernetes component that runs on all nodes to identify node issues and report them to upstream controllers in the Kubernetes management stack. For more information about NPD configuration, see the NPD documentation.

Note

For OpenShift users: Please refer to the NPD Installation section for specific instructions on installing NPD in OpenShift clusters.

Installation - Vanilla Kubernetes#

The GPU Operator Helm installation includes the following Argo Workflows components:

Argo workflow controller (deployed as a Kubernetes deployment)
Argo CRDs for defining workflow templates and workflows

The GPU Operator installs Argo Workflows v4.0.3, using a customized installation YAML tailored for auto-remediation requirements. This customization excludes components not needed for remediation, such as the Argo workflow server. For more information about Argo Workflows concepts, refer to the official documentation.

Note: By default, the GPU Operator installs auto-remediation components (the workflow controller and CRDs) during Helm deployment. If Argo Workflows is already present in the cluster, you can skip installation of only the CRDs by setting:
--set remediation.installCRDs=false
To disable the auto node remediation feature entirely, use:
--set remediation.enabled=false

Installation - OpenShift#

For OpenShift users: To use the auto remediation feature, additonal steps are required to install Argo Workflows to the OpenShift cluster, which requires special consideration:

If using OpenShift AI Operator with CRD DataScienceCluster: Argo Workflows are possibly already deployed by the OpenShift AI Operator, if the CustomResourceDefinition like workflows.argoproj.io is already existing, no additional installation is needed.
If not using OpenShift AI Operator: Follow these steps to install Argo Workflows on your OpenShift cluster:

a. Install CRDs (must be executed separately due to CRD size):

oc apply --server-side --force-conflicts -k "https://github.com/argoproj/argo-workflows/manifests/base/crds/full?ref=v4.0.3"

b. Add the Argo Helm repository:

helm repo add argo https://argoproj.github.io/argo-helm --force-update

c. Install Argo Workflows using Helm:

helm install argo-workflow argo/argo-workflows \
  -n argo-workflow \
  --create-namespace \
  --version=1.0.6 \
  --set crds.install=false \
  --set controller.instanceID.enabled=true \
  --set controller.instanceID.explicitID=amd-gpu-operator-remediation-workflow \
  --set 'controller.tolerations[0].key=amd-gpu-unhealthy' \
  --set 'controller.tolerations[0].operator=Exists' \
  --set 'controller.tolerations[0].effect=NoSchedule' \
  --set 'controller.tolerations[1].key=amd-dcm' \
  --set 'controller.tolerations[1].operator=Equal' \
  --set 'controller.tolerations[1].value=up' \
  --set 'controller.tolerations[1].effect=NoExecute'

Important: The controller.instanceID.explicitID value must be set to amd-gpu-operator-remediation-workflow. The GPU Operator labels every remediation workflow and workflow template it creates with workflows.argoproj.io/controller-instanceid: amd-gpu-operator-remediation-workflow. An Argo workflow-controller only reconciles workflows whose controller-instanceid label matches its configured instanceID, so without this setting the Helm-installed controller will silently ignore the operator’s workflows. Refer to the Argo Workflows controller instanceID documentation and the argo-workflows chart values for full details.

Important: The controller.tolerations entry for amd-gpu-unhealthy:NoSchedule is required. During remediation the GPU Operator taints the affected node with amd-gpu-unhealthy:NoSchedule (see NodeRemediationTaints). If the workflow-controller pod happens to be scheduled on a node that later gets tainted, it will be evicted and remediation will stall. Adding this toleration ensures the controller keeps running on tainted nodes so it can continue driving the workflow to completion. The same toleration is applied to the in-tree workflow controller, metrics-exporter, and other operator-managed components.

Important: The controller.tolerations entry for amd-dcm=up:NoExecute is also required when the Device Config Manager (DCM) is used for GPU partitioning on the cluster. DCM taints the node with amd-dcm=up:NoExecute while applying a partition profile (see Applying Partition Profiles) to evict non-essential workloads from the GPU node. Without this toleration, the Argo workflow-controller pod would be evicted from any GPU node undergoing a partition change, and any in-flight remediation workflow on that node would stall. Tolerating the DCM taint lets the controller continue driving the workflow even while DCM is reconfiguring partitions.

The same toleration must also be added to the Kernel Module Management (KMM) operator’s Helm chart when KMM is installed separately on OpenShift. KMM is responsible for (re)building and loading the GPU driver kernel module on the node after a remediation reboot — if its controller pod cannot tolerate amd-gpu-unhealthy:NoSchedule, it may be evicted from a tainted node and the driver will never be reloaded, blocking the post-reboot validation step of the workflow. When installing the KMM operator via Helm, pass the equivalent flags (the exact key path depends on the KMM chart you use, e.g. controller.manager.tolerations for the upstream chart):
--set 'controller.manager.tolerations[0].key=amd-gpu-unhealthy' \
--set 'controller.manager.tolerations[0].operator=Exists' \
--set 'controller.manager.tolerations[0].effect=NoSchedule'

If you prefer a values file over --set, the equivalent block is:

controller:
  instanceID:
    enabled: true
    explicitID: amd-gpu-operator-remediation-workflow
  tolerations:
    - key: amd-gpu-unhealthy
      operator: Exists
      effect: NoSchedule
    - key: amd-dcm
      operator: Equal
      value: up
      effect: NoExecute

Configuration and customization#

Device Config configuration#

The DeviceConfig Custom Resource includes a RemediationWorkflowSpec section for configuring and customizing the auto-remediation feature:

remediationWorkflow:
  # Enable auto node remediation feature for AMD GPU Operator. Disabled by default.
  # Set to true to activate automatic remediation workflows when GPU issues are detected.
  enable: true

  # ConfigMap containing mappings between node conditions and remediation workflows.
  # If not specified, the operator uses the default 'default-conditional-workflow-mappings' ConfigMap.
  # The ConfigMap defines which workflow template to execute for each specific error condition.
  config:
    name: configmapName

  # Time-to-live duration for retaining failed workflow objects and pods before cleanup.
  # Accepts duration strings like "5h", "24h", "30m", "1h30m". Default is 24 hours.
  # Retaining failed workflows allows for post-mortem analysis and troubleshooting.
  ttlForFailedWorkflows: 5h

  # Container image used for executing GPU validation tests during remediation workflows.
  # This image runs test suites to verify GPU health after remediation completes.
  # Default image supports only RVS tests. Contact AMD for AGFHC-enabled test runner.
  testerImage: docker.io/rocm/test-runner:v1.4.1

  # Maximum number of remediation workflows that can execute concurrently across the cluster.
  # Helps maintain minimum node availability by preventing excessive simultaneous remediations.
  # A value of 0 (default) means no limit is enforced. Excess workflows are queued as Pending.
  maxParallelWorkflows: 0

  # Custom taints to apply to nodes during the remediation process.
  # If not specified, the operator applies the default taint 'amd-gpu-unhealthy:NoSchedule'.
  # Taints prevent new workload scheduling on affected nodes during remediation.
  nodeRemediationTaints:
    - key:       # Taint key (e.g., 'amd-gpu-unhealthy')
      value:     # Taint value (e.g., specific error condition)
      effect:    # Taint effect (e.g., 'NoSchedule', 'NoExecute', 'PreferNoSchedule')

  # Custom labels to apply to nodes during automatic remediation workflows.
  # These labels persist throughout the remediation process and can be used for
  # monitoring, tracking, or applying custom policies.
  nodeRemediationLabels:
    label-one-key: label-one-val
    label-two-key: label-two-val

  # Configuration for pod eviction behavior when draining workloads from nodes.
  # Controls how pods are removed during remediation, including timeouts, grace periods,
  # and namespace exclusions to protect critical infrastructure.
  nodeDrainPolicy:
    # Enable forced draining of pods that do not respond to standard termination signals.
    # When true, pods that cannot be evicted gracefully will be forcibly removed.
    force: false

    # Maximum time in seconds to wait for the drain operation to complete.
    # A value of 0 means infinite timeout. Default is 300 seconds (5 minutes).
    timeoutSeconds: 300

    # Grace period in seconds for pods to shut down gracefully after termination signal.
    # Overrides each pod's terminationGracePeriodSeconds. Use -1 to respect pod settings.
    gracePeriodSeconds: 60

    # When true, DaemonSet-managed pods are excluded from the drain operation.
    # DaemonSets are designed to run on all nodes and will automatically reschedule.
    ignoreDaemonSets: true

    # List of namespaces to exclude from pod eviction during drain operation.
    # Pods in these namespaces remain on the node, allowing critical infrastructure
    # components to continue operating throughout the remediation process.
    ignoreNamespaces:
      - kube-system
      - cert-manager

  # AutoStartWorkflow specifies the behavior of the remediation workflow. Default value is true.
  # If true, remediation workflow will be automatically started when the node condition matches.
  # If false, remediation workflow will be in suspended state when the node condition matches and needs to be manually started by the user.
  # This field gives users more control and flexibility on when to start the remediation workflow.
  # Default value is set to true if not specified and the remediation workflow automatically starts when the node condition matches.
  autoStartWorkflow: true

  # ConfigMapImage specifies a container image that contains the remediation
  # ConfigMap. When set, the operator runs a Job using this image to apply
  # the ConfigMap to the cluster before the remediation workflow proceeds.
  configMapImage: yourregistry/configmap-image:version

Enable - Controls whether automatic node remediation is enabled. Set this field to true to activate the auto-remediation feature in the cluster.

Config - References a ConfigMap that contains mappings between node conditions and their corresponding remediation workflows. The GPU Operator automatically creates a default-conditional-workflow-mappings ConfigMap with predefined mappings. Users can either modify this default ConfigMap or create their own custom ConfigMap. If left empty, the default ConfigMap will be used automatically. More about the ConfigMap in below section.

Note: The default-conditional-workflow-mappings ConfigMap is created automatically by the GPU Operator.

TtlForFailedWorkflows - Defines the time-to-live (TTL) duration for retaining failed workflow objects and their associated pods before automatic cleanup. This field accepts a duration string in standard formats (e.g., “24h”, “30m”, “1h30m”). Retaining failed workflows allows for post-mortem analysis and troubleshooting. Once the specified duration expires, the workflow resources are automatically garbage collected by the system. The default retention period is 24 hours.

TesterImage - Specifies the container image for executing GPU validation tests during remediation workflows. This image must align with Spec.TestRunner.Image specifications and runs test suites to verify GPU health after remediation completion. If unspecified, the default image is docker.io/rocm/test-runner:v1.4.1.

Note: The default image supports only RVS test execution. For AGFHC test framework support within workflows, contact your AMD representative to obtain access to the AGFHC-enabled test runner image.

MaxParallelWorkflows - Limits the maximum number of remediation workflows that can execute concurrently across the cluster. This setting helps maintain minimum node availability by preventing excessive simultaneous remediation operations. A value of zero (default) means no limit is enforced.

When the number of triggered workflows exceeds this limit, additional workflows are queued by the Argo workflow controller in a Pending state. Queued workflows remain pending until an active workflow completes, freeing a slot within the configured parallelism limit.

NodeRemediationLabels - Defines custom labels to be applied to nodes during automatic remediation workflows. These labels persist throughout the remediation process and can be used for monitoring, tracking, or applying custom policies.

NodeRemediationTaints - Specifies custom taints to be applied to nodes during the remediation process. If no taints are specified, the Operator applies the default taint amd-gpu-unhealthy:NoSchedule to prevent workload scheduling on the affected node.

NodeDrainPolicy - Configures the pod eviction behavior when draining workloads from nodes during the remediation process. This policy controls how pods are removed, including timeout settings, grace periods, and namespace exclusions. See the Node Drain Policy Configuration section below for detailed field descriptions.

AutoStartWorkflow - Specifies the behavior of the remediation workflow. Default value is true. If true, the remediation workflow is automatically started when the node condition matches. If false, the remediation workflow remains in a suspended state when the node condition matches and must be manually started by the user. To resume the workflow at a later point, refer to the resume workflow section

Spec.CommonConfig.UtilsContainer - Remediation workflow uses a utility image for executing the steps. Specify the utility image in Spec.CommonConfig.UtilsContainer section of Device Config. If the UtilsContainer section is not specified, default image used is docker.io/rocm/gpu-operator-utils:latest

Node Drain Policy Configuration#

The NodeDrainPolicy field accepts a DrainSpec object with the following configurable parameters:

Force - Enables forced draining of pods that do not respond to standard termination signals. When set to true, pods that cannot be evicted gracefully will be forcibly removed. Default value is false.

TimeoutSeconds - Specifies the maximum time in seconds to wait for the drain operation to complete before giving up. A value of zero means infinite timeout, allowing the drain operation to continue indefinitely. Default value is 300 seconds (5 minutes).

GracePeriodSeconds - Defines the grace period in seconds that Kubernetes allows for a pod to shut down gracefully after receiving a termination signal. This value overrides the pod’s configured terminationGracePeriodSeconds. A value of -1 uses each pod’s own grace period setting. Default value is -1.

IgnoreDaemonSets - When set to true, DaemonSet-managed pods are excluded from the drain operation. This is typically desired since DaemonSets are designed to run on all nodes and will automatically reschedule on the same node. Default value is true.

IgnoreNamespaces - Defines a list of namespaces to exclude from pod eviction during the drain operation. Pods running in these namespaces will remain on the node, allowing critical infrastructure components to continue operating throughout the remediation process. By default, the following namespaces are excluded: kube-system, cert-manager, and the GPU Operator’s namespace.

Other Configuration options#

NPD Configuration - NPD configuration is explained in more detail in this section. The Node Problem Detector (NPD) DaemonSet must continue running during workflow execution to verify issue resolution. Add the following toleration to the NPD DaemonSet:

amd-gpu-unhealthy:NoSchedule op=Exists

The GPU Operator automatically applies this toleration to internal components such as KMM and metrics-exporter, ensuring they continue running during workflow execution.

Failed Workflow Handling - If a remediation workflow fails, the affected node remains in a tainted state. To manually restore the node to a schedulable state for workloads, remove the taint using the following command:

kubectl taint node <node-name> amd-gpu-unhealthy:NoSchedule-

Remediation Workflow ConfigMap#

The remediation ConfigMap defines error-to-workflow mappings. The default ConfigMap is derived from the AMD Service Action Guide (SAG). Each entry maps a unique error code (AFID) to a remediation workflow, specifying the Argo Workflow template to run and any workflow-specific parameters. For details on AFID values and event lists, see the AMD Instinct AFID Event List documentation.

The ConfigMap can be provided in one of the following ways:

ConfigMap Image (recommended) - Set spec.remediationWorkflow.configMapImage in the DeviceConfig to reference a container image that packages the ConfigMap. The operator runs a Job from this image to create the ConfigMap in the cluster. This decouples the SAG version from the operator version, allowing the ConfigMap to be updated independently.
User-created ConfigMap - Create a ConfigMap manually and reference it via spec.remediationWorkflow.config.name in the DeviceConfig. The operator will use the referenced ConfigMap as-is and will not modify or delete it during cleanup.

Note: If neither configMapImage nor config is specified, the operator will not create a default ConfigMap and remediation will not proceed until one is provided.

Example Error Mapping Section#

The following example demonstrates a complete error mapping configuration:

apiVersion: v1
kind: ConfigMap
metadata:
  name: <CM_NAME>
  namespace: <CM_NAMESPACE>
data:
  Version: 0.9.0
  workflow: |
    - nodeCondition: AMDGPUXgmi
      workflowTemplate: default-template
      validationTestsProfile:
        framework: AGFHC
        recipe: all_lvl4
        iterations: 1
        stopOnFailure: true
        timeoutSeconds: 4800
      physicalActionNeeded: true
      notifyRemediationMessage: >-
        Remove GPU tray from node.
        Confirm that all four screws on all eight OAMs are torqued
        as described in OAM Removal and Installation guide.
        Re-install the GPU tray into node.
      notifyTestFailureMessage: >-
        Remove the failing UBB assembly and return to AMD, along with the
        relevant failure details: at a minimum this should be the RF event
        that indicated the original fail, and if that RF event includes an
        additional data URI, the CPER and/or the decoded JSON from the CPER
        as pointed by the additional data. Install a new or known-good UBB
        assembly to the GPU tray.
      recoveryPolicy:
        maxAllowedRunsPerWindow: 3
        windowSize: 15m

ConfigMap Field Descriptions#

nodeCondition - Specifies a unique description for an error code (AFID). This value must match the corresponding node condition defined in the Node Problem Detector (NPD) configuration.

workflowTemplate - Defines the Argo Workflows template to execute for this specific error condition. The default-template is used by default and provides comprehensive remediation steps (detailed below). While users can create and reference custom Argo workflow templates in the cluster, it is recommended to use the operator-managed default-template for consistency and maintainability.

validationTestsProfile - Specifies the test framework and test suite to execute for validating GPU health after remediation. Supported frameworks include AGFHC and RVS. All fields under validationTestsProfile are mandatory and correspond to the parameters documented in the Test Runner Documentation.

physicalActionNeeded - Indicates whether manual physical intervention is required on the node (e.g., RMA of faulty GPU, hardware inspection, etc.). Specific actions are detailed in the notifyRemediationMessage field for each error condition. For issues resolved by a reboot, this field is set to false.

notifyRemediationMessage - Provides detailed instructions for physical or manual actions when physicalActionNeeded is true. This message guides administrators through the required remediation steps to resolve the fault.

notifyTestFailureMessage - Contains instructions to be displayed when validation tests fail after remediation attempts. This message typically includes escalation procedures and diagnostic information requirements.

recoveryPolicy - Defines limits on remediation attempts to prevent excessive recovery cycles. Includes maxAllowedRunsPerWindow (maximum retry attempts) and windowSize (time window for counting attempts). Once the number of remediation workflows crosses maxAllowedRunsPerWindow, no new workflow is triggered for the same node condition within windowSize. After the window elapses, if the issue still persists, a new remediation workflow is allowed to start again.

skipRebootStep - Controls whether the node reboot step is executed during the remediation workflow. The default workflow template includes an automatic reboot step to reinitialize GPU hardware after performing the recommended remediation actions. Set this field to true to skip the reboot step when the node has already been rebooted manually as part of the remediation process or when a reboot is not desired for the specific error condition. Default value is false.

Remediation of Partitioned GPUs#

The auto node remediation feature fully supports nodes with partitioned GPUs. When GPUs are partitioned using the Device Config Manager (DCM) with compute and memory partition profiles (e.g., CPX+NPS4), the remediation workflow operates seamlessly on these nodes.

Important: After the remediation workflow completes, the GPU partition profile on the affected node is reset to the default SPX+NPS1 mode (no partitions). Users must manually re-apply the desired partition profile on the remediated node by following the steps described in the GPU Partitioning via DCM documentation.

Default Workflow Template#

Note: The default-template is automatically created on the cluster by the GPU Operator.

The default-template workflow performs the following remediation steps:

Label Node - Applies custom labels to the node as specified in the NodeRemediationLabels field of the DeviceConfig Custom Resource. If no labels are configured, this step is skipped and the workflow proceeds to the next step.
Taint Node - Apply taint with key = "AMD_GPU_Unhealthy", op = equal, value = node_condition, effect = noSchedule to prevent new workload scheduling.
Drain Workloads - Evict all pods utilizing AMD GPUs from the affected node.
Notify Administrator - Generate a Kubernetes event to notify the administrator if manual intervention is required for the detected issue.
Suspend Workflow - Pause workflow execution pending manual intervention or automatic resumption based on configured policies.
Reboot Node - Issue a reboot command on the affected node to clear transient errors and reinitialize GPU hardware. This step exits gracefully after triggering the reboot, ensuring the workflow pod is not disrupted by the node shutdown.
Wait for Node Ready - Monitor the rebooted node until it comes back online and reports a Ready condition in the Kubernetes cluster before proceeding to validation.
Validate GPUs - Execute AGFHC/RVS validation tests to confirm GPU health after reboot.
Verify Condition - Confirm that the triggering node condition has been resolved (status changed to False).
Remove Taint - Remove the node taint to restore GPU availability for workload scheduling.
Remove Labels - Removes all custom labels that were applied to the node in Step 1, restoring the node to its original label state.

Each workflow step is executed as a separate Kubernetes pod. For advanced use cases, users can create custom workflow templates using the Argo CRDs available on the cluster and reference them in the ConfigMap.

While most workflow steps are self-explanatory, Steps 4, 5, and 8 require additional clarification.

Workflow Step 4: Physical Intervention Check#

According to the AMD service action guide, certain GPU issues require physical intervention (e.g., checking wiring, securing screws, retorquing connections). When such conditions are detected, the workflow generates a Kubernetes event to notify the administrator of the required physical action before suspending at this step. The specific physical action for each node condition is defined in the physicalActionNeeded field within the corresponding ConfigMap mapping.

This step enables administrators to identify nodes awaiting physical intervention. After completing the necessary physical repairs, administrators can resume the workflow for validation using the label described in Workflow Step 5.

Querying Remediation Events#

The remediation workflow generates Kubernetes events at key stages to notify administrators of workflow progress. These events can be queried using:

kubectl get events -n <amdgpu-operator-namespace> --field-selector involvedObject.kind=Node

The following event types are generated:

amd-gpu-remediation-required - Generated before the workflow suspends, indicating that a node condition has been detected and remediation is required. For conditions requiring physical intervention, the event message describes the specific action needed.
amd-gpu-remediation-succeeded - Generated when the remediation workflow completes successfully and all GPU validation tests pass.
amd-gpu-remediation-failed - Generated when GPU validation tests fail after the remediation attempt. The event message includes details about the failure and the affected node.

Workflow Step 5: Workflow Suspension and Resumption#

The GPU Operator determines whether to automatically resume the workflow after it pauses in Step 4. This pause accommodates scenarios requiring manual intervention. The workflow may remain suspended in two primary cases:

Excessive Remediation Attempts: When a RecoveryPolicy is configured in the ConditionalWorkflowMappings ConfigMap, it defines the maximum remediation attempts allowed within a specified time window. Nodes exceeding this threshold will have their workflows paused indefinitely until manual resumption.
Physical Action Required: When a physical action is specified for a workflow in the ConditionalWorkflowMappings ConfigMap, the workflow pauses at this step, allowing administrators to perform the required maintenance. A notification event is generated to alert the user.

If neither condition applies, the workflow automatically resumes without manual intervention.

Resume or Abort a Paused Workflow#

To resume a suspended workflow, apply the label operator.amd.com/gpu-force-resume-workflow=true to the affected node. The operator detects this label and resumes workflow execution.

To abort the workflow entirely, apply the label operator.amd.com/gpu-abort-workflow=true to the node. This keeps the node in a tainted state for manual remediation. This option is useful when automatic remediation is no longer desired and the workflow should be deleted while paused.

Workflow Step 8: GPU Validation Testing#

This step executes comprehensive GPU health validation tests using the test runner:

Test Profile Configuration: The test profile for each node condition is specified in the validationTestsProfile field within the ConfigMap.
Test Execution: The workflow creates a Kubernetes Job that launches a test runner container. This container retrieves and executes the specified test profile.
Result Verification: The workflow evaluates test results and only proceeds if all tests pass successfully. If any test fails, the entire workflow terminates with a failure status.