Kubernetes configuration#
When deploying AMD Device Metrics Exporter on Kubernetes, a ConfigMap is deployed in the exporter namespace.
Configuration parameters#
ServerPort: this field is ignored when Device Metrics Exporter is deployed by the GPU Operator to avoid conflicts with the service node port config.GPUConfig:Fields: An array of strings specifying what metrics field to be exported.Labels:
SERIAL_NUMBER,GPU_ID,POD,NAMESPACE,CONTAINER,JOB_ID,JOB_USER,JOB_PARTITION,CARD_MODEL,HOSTNAME,GPU_PARTITION_ID,GPU_COMPUTE_PARTITION_TYPE,GPU_MEMORY_PARTITION_TYPE,KFD_PROCESS_IDandDEPLOYMENT_MODEare always set and cannot be removed. Labels supported are available in the provided exampleconfigmap.yml.CustomLabels: A map of user-defined labels and their values. Users can set up to 10 custom labels. From the
GPUMetricLabellist, onlyCLUSTER_NAMEis allowed to be set inCustomLabels. Any other labels from this list cannot be set. Users can define other custom labels outside of this restriction. These labels will be exported with every metric, ensuring consistent metadata across all metrics.ExtraPodLabels: This defines a map that links Prometheus label names to Kubernetes pod labels. Each key is the Prometheus label that will be exposed in metrics, and the value is the pod label to pull the data from. This lets you expose pod metadata as Prometheus labels for easier filtering and querying.
(e.g. Considering an entry like"WORKLOAD_ID" : "amd-workload-id", whereWORKLOAD_IDis a label visible in metrics and its value is the pod label value of a pod label key set asamd-workload-id).ProfilerMetrics: A map of toggle to enable Profiler Metrics either forallnodes or a specific hostname with desired state. Key with specific hostname$HOSTNAMEtakes precedense over aallkey. This only controls the Profiler Metrics which has prefix ofGPU_PROF_from the metrics list.
CommonConfig:MetricsFieldPrefix: Add prefix string for all the fields exporter. Premetheus Metric Label formatted string prefix will be accepted, on any invalid prefix will default to empty prefix to allow exporting of the fields.HealthService: Health Service configurations for the exproter.Enable: false to disable, otherwise enabled by default
LoggerConfig: Logger configurations for the exporter.Level: Log level for the exporter. Supported levels areDEBUG,INFO,WARN,ERROR. Default isINFO.MaxSizeMB: Maximum size in megabytes of the log file before it gets rotated. Default is10MB.MaxBackups: Maximum number of old log files to retain. Default is3.MaxAgeDays: Maximum number of days to retain old log files. Default is7days.LogRotationDisable: Boolean flag to disable log rotation. If set totrue, log rotation is disabled and logs will be written to a single file without rotation. Default isfalse.
NICConfig:Fields: An array of strings specifying what metrics field to be exported. Detailed list of fields can be found hereLabels:NIC_SERIAL_NUMBER,NIC_UUID,NIC_HOSTNAMEare always set and cannot be removed. Workload related labels such asNIC_POD,NIC_NAMESPACE, andNIC_CONTAINERare dynamically added to the LIF when there is an associated workload. Labels supported are available in the provided exampleconfigmap.yml.CustomLabels: A map of user-defined labels and their values. Users can set up to 10 custom labels.CLUSTER_NAMEis the only label that is exported by default. Users can define other custom labels outside of this restriction. These labels will be exported with every metric, ensuring consistent metadata across all metrics.HealthCheckConfig: List of the configs that determine the health check behavior for NICs. This includes settings such as whether interfaces that are down should be reported as unhealthy (InterfaceAdminDownAsUnhealthy). These configurations help define how NIC health metrics are evaluated and exported.
IFOEConfig:Fields: An array of strings specifying what IFOE metrics fields to be exported. Detailed list of fields can be found here. If no fields are specified, all IFOE metrics are exported by default.Labels:HOSTNAMEandIFOE_UUIDare mandatory labels that are always set and cannot be removed. Additional optional labels includeIFOE_STATION_UUID,IFOE_PORT_NAME, andIFOE_DEVICE_UUIDwhich provide more granular identification of IFOE components. Labels supported are available in the provided exampleconfigmap.yml.CustomLabels: A map of user-defined labels and their values. Users can set up to 10 custom labels. These labels will be exported with every IFOE metric, ensuring consistent metadata across all metrics. Custom labels allow you to add deployment-specific information such as cluster identifiers, data center locations, or other organizational metadata.ExtraPodLabels: Similar to GPUConfig, this defines a map that links Prometheus label names to Kubernetes pod labels for IFOE metrics. This allows you to expose pod metadata as Prometheus labels for easier correlation between IFOE network metrics and workload information.
Setting custom values#
To use a custom configuration when deploying the Metrics Exporter:
Create a
ConfigMapbased on the provided example configmap.yml file.Change the
configMapproperty invalues.yamltoconfigmap.ymlRun
helm install:
helm install exporter https://github.com/ROCm/device-metrics-exporter/releases/download/v1.5.0/device-metrics-exporter-charts-v1.5.0.tgz -n metrics-exporter -f values.yaml --create-namespace
Device Metrics Exporter polls for configuration changes every minute, so updates take effect without container restarts.