Resource Health Monitoring

Resource Health Monitoring#

The Device plugin includes a comprehensive resource health monitoring system that enables real-time monitoring and reporting of network device health status. This feature provides dynamic health updates for the network devices by integrating with an external metrics service through gRPC communication, allowing Kubernetes to make intelligent decisions about device allocation and pod scheduling based on device health.

Key Features#

Health Monitoring Capabilities#

Real-time Health Polling: Device Plugin continuously polls the health status of NIC devices
Kubelet Integration: Health status is published to kubelet as heartbeat signals
Automated Response: When a NIC is discovered unhealthy, the system automatically responds by:
- Marking the device as unhealthy in the device plugin
- Preventing new pod allocations to unhealthy devices
Important Limitation: When a NIC is marked as unhealthy, kubelet does not automatically evict or reschedule existing pods using that device. Intervention is required through one of the following approaches:
- Manual Eviction: Administrators must manually evict affected pods to allow the scheduler to reschedule them with healthy device resources
- Application-Level Handling: Applications can implement health detection logic to trigger self-eviction when device issues are detected

Configuration#

Enabling Health Monitoring#

Health monitoring is enabled through the device plugin configuration enableExporterHealthCheck:

{
  "resourceList": [{
    "resourceName": "nic", 
    "resourcePrefix": "amd.com", 
    "enableExporterHealthCheck": true,
    "selectors": {
      "vendors": ["1dd8"], 
      "devices": ["1002"],
      "drivers": ["ionic"],
      "excludeTopology": true,
      "isRdma": true
    }
  }]
}

This configmap is automatically generated by the Network Operator during Helm installation. The enableExporterHealthCheck setting can be modified in the ConfigMap to enable or disable the health monitoring feature as needed.

Monitoring Health Status#

Node Resource Status#

You can monitor device health status by examining node resource information using kubectl describe node <node_name>. The output provides two key resource metrics that reflect device health: Capacity, Allocatable

Capacity vs Allocatable Resources#

Capacity represents all physically available NICs on the node, regardless of their health status. This value remains constant unless devices are physically added or removed from the node.

Example for a node with 2 NICs:

Capacity:
  amd.com/nic:         2
  amd.com/vnic:        0

Allocatable reflects only the NICs reported as healthy and available for pod scheduling. When all NICs are healthy, this value equals the Capacity.

Example when all NICs are healthy:

Allocatable:
  amd.com/nic:        2
  amd.com/vnic:       0

Health Status Changes#

When a NIC is marked as unhealthy, the change is immediately reflected in the Allocatable count, while Capacity remains unchanged.

Example after one NIC becomes unhealthy:

Allocatable:
  amd.com/nic:        1
  amd.com/vnic:       0

This ensures that the Kubernetes scheduler only assigns pods to nodes with healthy network devices.