Kubernetes configuration

Kubernetes configuration#

When deploying AMD Device Metrics Exporter on Kubernetes, a ConfigMap is deployed in the exporter namespace.

Configuration parameters#

  • ServerPort: this field is ignored when Device Metrics Exporter is deployed by the GPU Operator to avoid conflicts with the service node port config.

  • GPUConfig:

    • Fields: An array of strings specifying what metrics field to be exported.

    • Labels: SERIAL_NUMBER, GPU_ID, POD, NAMESPACE, CONTAINER, JOB_ID, JOB_USER, JOB_PARTITION, CARD_MODEL, HOSTNAME, GPU_PARTITION_ID, GPU_COMPUTE_PARTITION_TYPE, GPU_MEMORY_PARTITION_TYPE, KFD_PROCESS_ID and DEPLOYMENT_MODE are always set and cannot be removed. Labels supported are available in the provided example configmap.yml.

    • CustomLabels: A map of user-defined labels and their values. Users can set up to 10 custom labels. From the GPUMetricLabel list, only CLUSTER_NAME is allowed to be set in CustomLabels. Any other labels from this list cannot be set. Users can define other custom labels outside of this restriction. These labels will be exported with every metric, ensuring consistent metadata across all metrics.

    • ExtraPodLabels: This defines a map that links Prometheus label names to Kubernetes pod labels. Each key is the Prometheus label that will be exposed in metrics, and the value is the pod label to pull the data from. This lets you expose pod metadata as Prometheus labels for easier filtering and querying.
      (e.g. Considering an entry like "WORKLOAD_ID"   : "amd-workload-id", where WORKLOAD_ID is a label visible in metrics and its value is the pod label value of a pod label key set as amd-workload-id).

    • ProfilerMetrics: A map of toggle to enable Profiler Metrics either for all nodes or a specific hostname with desired state. Key with specific hostname $HOSTNAME takes precedense over a all key. This only controls the Profiler Metrics which has prefix of GPU_PROF_ from the metrics list.

  • CommonConfig:

    • MetricsFieldPrefix: Add prefix string for all the fields exporter. Premetheus Metric Label formatted string prefix will be accepted, on any invalid prefix will default to empty prefix to allow exporting of the fields.

    • HealthService : Health Service configurations for the exproter.

      • Enable : false to disable, otherwise enabled by default

    • LoggerConfig: Logger configurations for the exporter.

      • Level: Log level for the exporter. Supported levels are DEBUG, INFO, WARN, ERROR. Default is INFO.

      • MaxSizeMB: Maximum size in megabytes of the log file before it gets rotated. Default is 10 MB.

      • MaxBackups: Maximum number of old log files to retain. Default is 3.

      • MaxAgeDays: Maximum number of days to retain old log files. Default is 7 days.

      • LogRotationDisable: Boolean flag to disable log rotation. If set to true, log rotation is disabled and logs will be written to a single file without rotation. Default is false.

  • NICConfig:

    • Fields: An array of strings specifying what metrics field to be exported. Detailed list of fields can be found here

    • Labels: NIC_SERIAL_NUMBER, NIC_UUID, NIC_HOSTNAME are always set and cannot be removed. Workload related labels such as NIC_POD, NIC_NAMESPACE, and NIC_CONTAINER are dynamically added to the LIF when there is an associated workload. Labels supported are available in the provided example configmap.yml.

    • CustomLabels: A map of user-defined labels and their values. Users can set up to 10 custom labels. CLUSTER_NAME is the only label that is exported by default. Users can define other custom labels outside of this restriction. These labels will be exported with every metric, ensuring consistent metadata across all metrics.

    • HealthCheckConfig: List of the configs that determine the health check behavior for NICs. This includes settings such as whether interfaces that are down should be reported as unhealthy (InterfaceAdminDownAsUnhealthy). These configurations help define how NIC health metrics are evaluated and exported.

  • IFOEConfig:

    • Fields: An array of strings specifying what IFOE metrics fields to be exported. Detailed list of fields can be found here. If no fields are specified, all IFOE metrics are exported by default.

    • Labels: HOSTNAME and IFOE_UUID are mandatory labels that are always set and cannot be removed. Additional optional labels include IFOE_STATION_UUID, IFOE_PORT_NAME, and IFOE_DEVICE_UUID which provide more granular identification of IFOE components. Labels supported are available in the provided example configmap.yml.

    • CustomLabels: A map of user-defined labels and their values. Users can set up to 10 custom labels. These labels will be exported with every IFOE metric, ensuring consistent metadata across all metrics. Custom labels allow you to add deployment-specific information such as cluster identifiers, data center locations, or other organizational metadata.

    • ExtraPodLabels: Similar to GPUConfig, this defines a map that links Prometheus label names to Kubernetes pod labels for IFOE metrics. This allows you to expose pod metadata as Prometheus labels for easier correlation between IFOE network metrics and workload information.

Setting custom values#

To use a custom configuration when deploying the Metrics Exporter:

  1. Create a ConfigMap based on the provided example configmap.yml file.

  2. Change the configMap property in values.yaml to configmap.yml

  3. Run helm install:

helm install exporter https://github.com/ROCm/device-metrics-exporter/releases/download/v1.5.0/device-metrics-exporter-charts-v1.5.0.tgz -n metrics-exporter -f values.yaml --create-namespace

Device Metrics Exporter polls for configuration changes every minute, so updates take effect without container restarts.