Developer Guide#
This document provides build instructions and guidance for developers working on the AMD Device Metrics Exporter repository.
Environment Setup#
The project Makefile provides a easy way to create a docker build container that packages the Docker and Go versions needed to build this repository. The following environment variables can be set, either directly or via a dev.env
file:
DOCKER_REGISTRY
: Docker registry (default:docker.io/rocm
).DOCKER_BUILDER_TAG
: Docker build container tag (default:v1.0
).BUILD_BASE_IMAGE
: Base image for Docker build container (default:ubuntu:22.04
).EXPORTER_IMAGE_NAME
: Metrics exporter container name (default:device-metrics-exporter
).EXPORTER_IMAGE_TAG
: Metrics exporter container tag (default:latest
).TESTRUNNER_IMAGE_NAME
: Test runner image name (default:test-runner
).UBUNTU_VERSION
: Ubuntu version for builds (jammy
for 22.04,noble
for 24.04).
Build Prerequisites#
Before starting, ensure you have Docker installed and running with the user permissions set appropriately.
Quick Start#
To quickly build everything using Docker:
make default
The default target creates a docker build container that packages the developer tools required to build all other targets in the Makefile and builds the all
target in this build container.
Building Components#
Build and Launch Docker Build Container Shell#
Run the following command to start a Docker-based build container shell:
make docker-shell
This gives you an interactive Docker environment with necessary tools pre-installed. It is recommended to run all other Makefile targets in this build environment.
Compiling the AMD Device Metrics Exporter#
To compile from within the build environment, run:
make all
This command builds:
AMD Metrics Exporter
Proto-generated code
Metrics utility
AMD Test Runner
Note: AMD Test Runner builds are currently disabled in this branch. Please use prebuilt images to deploy test runner until support for building the component is added here.
Building a Debian Package#
To build a Debian package for Ubuntu:
make pkg
This will create .deb
packages in the bin
directory.
Build Docker images#
Build standard exporter image:
make docker
Testing#
To run unit tests in pkg/
:
make unit-test
To run end-end tests:
make e2e
Note: End-end tests run on mock AMD Metrics Exporter image that mocks the metrics generated.
Helm Chart Packaging#
To package Helm charts:
make helm-charts
GPU Agent Integration#
The AMD Device Metrics Exporter relies on GPU Agent, which provides programmable APIs to configure and monitor AMD Instinct GPUs. GPU Agent enables low-level interactions with the GPUs, facilitating the collection and reporting of device-specific metrics.
Building GPU Agent#
Developers can make changes directly in the GPU Agent repository, build the GPU Agent binary, and then integrate the built binaries into the Device Metrics Exporter project. Copy over the static binary into the assets
folder in the AMD Device Metrics Exporter and follow these steps:
gzip -f assets/gpuagent_static.bin
make all
Build a new docker image with the new gpuagent binary poackaged using:
make docker
Architecture#
Metrics HTTP Server Request Handling#
sequenceDiagram
actor user/client
user/client ->> exporter : http /metrics
exporter ->> metricHandler: UpdateMetrics
metricHandler ->> gpuagentClient : UpdateStaticMetrics
gpuagentClient ->> gpuagent : gRPC getGPUs
gpuagent ->> amdsmilib : getGPUs statistics
amdsmilib -->> gpuagent : getGPUs statistics response
gpuagent -->> gpuagentClient : getGPUs response
gpuagentClient -->> metricHandler : UpdateStaticMetrics response
metricHandler -->> exporter : UpdateMetrics response
exporter -->> user/client : http /metrics response
Health Monitoring And gRPC Service#
sequenceDiagram
exporter ->> metricsvc : start gRPC service over unix socket
metricsvc ->> gpuagentClient : UpdateStaticMetrics
gpuagentClient ->> gpuagent : gRPC getGPU
gpuagent -->> gpuagentClient : getGPU response
gpuagentClient -->> metricsvc : UpdateStaticMetrics response
metricsvc ->> metricsvc : evaluate GPU health @ 30s interval
Health gRPC Request Handling#
sequenceDiagram
actor user/client
user/client ->> exporter : gRPC List/GetGPUState
exporter ->> metricsvc : GetGPUHealthStates
metricsvc -->> exporter : GetGPUHealthStates response
exporter -->> user/client : GPUStateResponse