Workload Validation#
Systems under test should be capable of running large AI models, including Large Language Models (LLMs). This section provides guidance on validating AI model performance on AMD Instinct systems.
AI Model Performance Check#
For LLM Inference and Training performance results for validation, refer to the latest Performance Results with AMD ROCm Software on the AMD ROCm AI Developer Hub or the AMD Infinity Hub.
Detailed instructions on how to reproduce these results can be found at the following links:
Single-Node Workload Validation#
For single-node deployment validation, AMD recommends running Llama 3.1 70B with JAX using the Cluster Validation Suite.
AI Model Performance Validation with the Cluster Validation Suite#
For system validation during deployment, AMD recommends using Llama 3.1 70B with JAX for single-node testing.
The Cluster Validation Suite includes scripts for automatically installing and running these workloads. See the Jax training test scripts section.
The Megatron workload is also supported as an optional deployment workload.
Multi-Node Workload Validation#
For multi-node workload validation, see the Network and Cluster Validation section.
AMD recommends using the workload of Llama 3.1 405B with JAX for multi-node testing. The runtime will vary depending on the number of nodes, NICs per node, and overall cluster configuration.
There is no multi-node inference benchmark suggested to evaluate clusters at this time, but use of vLLM and SGLang should be considered.