Pre-start Job Test#
Pre-start Job Test trigger#
Test runner can be embedded as an init container within your Kubernetes workload pod definition. The init container will be executed before the actual workload containers start, in that way the system could be tested right before the workload start to use the hardware resource.
Configure pre-start init container#
The init container requires RBAC config to grant the pod access to export events and add node labels to the cluster. Here is an example of configuring the RBAC and Job resources:
apiVersion: v1
kind: ServiceAccount
metadata:
name: test-run
namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: test-run-cluster-role
rules:
- apiGroups:
- ""
resources:
- events
verbs:
- get
- list
- watch
- create
- update
- apiGroups:
- ""
resources:
- nodes
verbs:
- patch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: test-run-rb
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: test-run-cluster-role
subjects:
- kind: ServiceAccount
name: test-run
namespace: default
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: pytorch-gpu-deployment
namespace: default
labels:
purpose: demo-pytorch-amdgpu
spec:
replicas: 1
selector:
matchLabels:
purpose: demo-pytorch-amdgpu
template:
metadata:
labels:
purpose: demo-pytorch-amdgpu
spec:
serviceAccountName: test-run
initContainers:
- name: init-test-runner
image: docker.io/rocm/test-runner:v1.2.0-beta.0
imagePullPolicy: IfNotPresent
resources:
limits:
amd.com/gpu: 1 # requesting a GPU
env:
- name: TEST_TRIGGER
value: "PRE_START_JOB_CHECK" # Set the TEST_TRIGGER environment variable to PRE_START_JOB_CHECK for test runner as init container
- name: POD_NAME # Use downward API to pass pod name to test runner container
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE # Use downward API to pass pod namespace to test runner container
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: NODE_NAME # Use downward API to pass host name to test runner container
valueFrom:
fieldRef:
fieldPath: spec.nodeName
containers:
- name: pytorch-gpu-workload
image: busybox:latest
command: ["/bin/sh", "-c", "--"]
args: ["sleep 6000"]
resources:
limits:
amd.com/gpu: 1 # requesting a GPU
Check test runner init container#
When test runner is running:
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
pytorch-gpu-deployment-7c6bb979f5-p2wlk 0/1 Init:0/1 0 2m52s
Check test runner container logs:
$ kubectl logs pytorch-gpu-deployment-7c6bb979f5-p2wlk -c init-test-runner
When test runner is completed, the workload container started to run:
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
pytorch-gpu-deployment-7c6bb979f5-p2wlk 1/1 Running 0 7m46s
Check test running node labels#
When the test is ongoing the corresponding label will be added to the node resource: "amd.testrunner.gpu_health_check.gst_single": "running"
, the test running label will be removed once the test completed.
Check test result event#
The test runner generated event can be found from Job resource defined namespace
$ kubectl get events -n kube-amd-gpu
LAST SEEN TYPE REASON OBJECT MESSAGE
8m8s Normal TestFailed pod/test-runner-manual-trigger-c4hpw [{"number":1,"suitesResult":{"42924":{"gpustress-3000-dgemm-false":"success","gpustress-41000-fp32-false":"failure","gst-1215Tflops-4K4K8K-rand-fp8":"failure","gst-8096-150000-fp16":"success"}}}]
More detailed information about test result events can be found in this section.
Advanced Configuration - ConfigMap#
You can create a config map to customize the test triggger and recipe configs. For the example config map and explanation please check this section.
After creating the config map, you can specify the volume and volume mount to mount the config map into test runner container.
In the config map the file name must be named as
config.json
Within the test runner container the mount path should be
/etc/test-runner/
The example of mounting config map into test runner container can be found in this section.
Advanced Configuration - Logs Mount#
The test runner compressed test logs are saved at /var/log/amd-test-runner/
folder within container. If you don’t configure the logs mount, by default the logs won’t be exported. There are many types of Kubernetes volumes to mount into the test runner container so that the test logs can be persisted at your desired place.
Here is an example of mounting the hostPath
into the container, the key points are:
Define a
hostPath
volumeMount the volume to a directory within test runner container
Use
LOG_MOUNT_DIR
environment variable to ask test runner to save logs into the mounted directory
The example of mounting the hostPath
volume into test runner container can be found at this section.