Developer Guide

Developer Guide#

This guide provides information for developers who want to contribute to or modify the AMD GPU Operator.

Warning

This project is not ready yet to accept the external developers commits.

Prerequisites#

Go v1.20 (due to open issues with Go v1.21 or v1.22)
Docker
Kubernetes cluster (v1.29.0+) or OpenShift (4.16+)
kubectl or oc CLI tool configured to access your cluster

Development Environment Setup#

Install Helm:

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh

For alternative installation methods, refer to the Helm Official Website.

Install Helmify:
- Download the released binary from the Helmify GitHub release page, unpack it, and move it to your PATH.
Clone the repository:

git clone https://github.com/ROCm/gpu-operator.git
cd gpu-operator

(Optional) Set up a local Docker registry. If you want to build and host container images locally, you can set up a local Docker registry:

docker run -d -p 5000:5000 --name registry registry:latest

Modify the registry-related variables in the Makefile:
- DOCKER_REGISTRY: Set to localhost:5000 for local development, or your preferred registry
- IMAGE_NAME: Set to rocm/gpu-operator
- IMAGE_TAG: Set as needed (e.g., v1.0.0 or latest)
Compile the project:

make

This will generate the basic YAML files for CRD, build controller images, build Helm charts and build OpenShift OLM bundle.

(Optional) Run specific make target:
- Run make docker/shell to build and attach to a container with build environment configured
- Run make <specific target> within the container to execute specific make target.
Build and push the AMD GPU Operator image:

make docker-build
make docker-push

Note: If you’re using a remote registry that requires authentication, ensure you’ve logged in using docker login before pushing.

Generate Helm charts:
- For vanilla Kubernetes: make helm
- For OpenShift: OPENSHIFT=1 make helm
Check Makefile help message for more options:

make help

Running Tests#

Running e2e requires a Kubernetes cluster, please prepare your Kubernetes cluster ready for running the e2e tests, as well as configure the kubeconfig file at ~/.kube/config for kubectl and helm toolkits to get access to your cluster. The e2e test cases will deploy the Operator to your cluster and run the test cases.

To run the e2e tests:

make e2e

To run e2e tests with a specific Helm chart:

make e2e GPU_OPERATOR_CHART="path to helm chart"

To run e2e test only:

make -C tests/e2e # run e2e tests only

GPU Operator E2E Tests#

The tests/k8s-e2e/ directory contains an e2e test suite that installs the GPU Operator via Helm and verifies metrics and health. Tests run against a live Kubernetes cluster.

Prerequisites#

A running Kubernetes cluster with at least one AMD GPU node
kubectl configured (~/.kube/config or a custom kubeconfig)
Docker (to build the test runner image)

Test runner image#

docker build -t gpu-op-k8s-e2e:latest -f tests/k8s-e2e/Dockerfile.e2e tests/k8s-e2e/

Running tests#

Full install + verify + teardown#

Pass the helm chart as a local directory path (the helm-charts-k8s/ directory in the repository root) or an OCI/repo reference if publishing to a registry:

docker run --rm \
  -v /path/to/kubeconfig:/kubeconfig:ro \
  -v /path/to/gpu-operator/helm-charts-k8s:/helm-charts:ro \
  gpu-op-k8s-e2e:latest \
  -kubeconfig /kubeconfig \
  -operatorchart /helm-charts \
  -operatortag v1.5.0 \
  -test.timeout 60m

Verify only (pre-deployed cluster)#

docker run --rm -v /path/to/kubeconfig:/kubeconfig:ro \
  gpu-op-k8s-e2e:latest \
  -kubeconfig /kubeconfig -existing \
  -check.f 'TestOp010|TestOp020|TestOp030|TestOp040|TestOp050|TestOp060|TestOp065|TestOp070' \
  -test.timeout 30m

Using make#

# Full install+verify+teardown
make -C tests/k8s-e2e all KUBECONFIG=/path/to/kubeconfig OPERATOR_TAG=v1.5.0

# Verify only (pre-deployed)
make -C tests/k8s-e2e verify KUBECONFIG=/path/to/kubeconfig

Common flags#

Flag	Default	Description
`-kubeconfig`	`~/.kube/config`	Path to kubeconfig
`-operatorchart`	OCI registry chart	GPU Operator helm chart (OCI ref or local path)
`-operatortag`	`v1.4.1`	GPU Operator chart version
`-namespace`	`kube-amd-gpu`	Kubernetes namespace
`-existing`	`false`	Skip install/teardown — verify only against pre-deployed cluster
`-noteardown`	`false`	Skip teardown after tests (leave operator installed)
`-helmset`	(none)	Extra helm `--set` override (repeatable)
`-check.f`	(all)	Regex filter for test names (gocheck syntax)
`-test.timeout`	`30m`	Overall test timeout

Creating a Pull Request#

Fork the repository on GitHub.
Create a new branch for your changes.
Make your changes and commit them with clear, descriptive commit messages.
Push your changes to your fork.
Create a pull request against the main repository.

Please ensure your code follows our coding standards and includes appropriate tests.

Build Documentation Website Locally#

Download mkdocs utilities

python3 -m pip install mkdocs

Build the website

cd docs
python3 -m mkdocs build

Deploy the website

python3 -m mkdocs serve --dev-addr localhost:2345

The local docs website will dynmically update as changes are made to markdown docs.

Developer Guide

Contents

Developer Guide#

Prerequisites#

Development Environment Setup#

Running Tests#

GPU Operator E2E Tests#

Prerequisites#

Test runner image#

Running tests#

Full install + verify + teardown#

Verify only (pre-deployed cluster)#

Using make#

Common flags#

Creating a Pull Request#

Build Documentation Website Locally#