RoCE cluster network configuration guide for AMD Instinct accelerators

RoCE cluster network configuration guide for AMD Instinct accelerators#

RDMA over Converged Ethernet (RoCE) is a network protocol can deliver speeds comparable to InfiniBand when running AI/HPC workloads, and offer lower cost than InfiniBand due their compatibility with standard ethernet architecture.

This guide contains instructions for optimizing the performance of a RoCE cluster network at the network interface card (NIC) and switch level with proper configuration, as well as routing directions to mitigate issues like MAC address mismatch (ARP flux) that can occur establishing RDMA sessions on nodes with multiple NICs.

RoCE configuration for network switches#

You will need direct or remote access to switches in your cluster to configure them for optimal data transfer over a RoCE network. This guide provides instructions for Dell and Arista switches using SONiC and Arista EOS respectively.

Switch authentication and configuration terminal access#

The first step is to log in to the switch and elevate your permissions so that you can change configurations.

Dell switches

Access your switch CLI with ssh.
Run sonic-cli as a command.
Run configure or configure terminal as a command to enter configuration mode.
Run exit as a command at any time to leave configuration mode.

Arista switches

Access your switch CLI with ssh.
Run enable as a command to receive elevated privileges.
Run configure terminal as a command to enter configuration mode.
Run exit as a command at any time to leave configuration mode.

Enable RoCE support#

Dell switches

While in configuration mode, run roce enable as a command.
Reboot the switch when or if prompted.

Arista switches

Arista EOS supports RoCE communication by default. Instead, ensure the PFC for the RoCE traffic class is enabled on each port that handles RoCE traffic.

Implement standard extended naming for switch interfaces#

Dell recommends what is referred to as the “standard” or “standard extended” naming convention for switch interfaces. The naming scheme is understood as Eth<line_card_id>/<port_id>/[breakout_port_id]. In a fixed switch this naming scheme simplifies matching each port to its front-panel label, where Eth1/16/[x] corresponds to port 16 as physically labeled on the switch. The line card ID also remains 1 since there is a single line card.

However, if multiple line cards are present in a modular switch like the Arista 7388X5 series, additional effort is required to match the port name to its physical label.

Dell switches

While in configuration mode, run interface-naming standard extended as a command.
Run write memory as a command.

Log out of the switch, then log back in to view the change in interface names.

Interface Name Vendor Part No. Serial No. QSA Adapter Qualified
--------------------------------------------------------------------------------------------------------------------------------------
Eth1/1 QSFP56-DD 400GBASE-SR8-AEC-3.0M DELL EMC DH11M CN0F9KR711I0060 N/A True
Eth1/2 QSFP56-DD 400GBASE-SR8-AEC-3.0M DELL EMC DH11M CN0F9KR711I0042 N/A True
Eth1/3 QSFP56-DD 400GBASE-SR8-AEC-3.0M DELL EMC DH11M CN0F9KR70CO0095 N/A True
...
Eth1/64      QSFP56-DD 400GBASE-SR8-AEC-3.0M         DELL EMC          DH11M             CN0F9KR70CM0041   N/A               True
Eth1/65      N/A                                     N/A               N/A               N/A               N/A               False
Eth1/66      N/A                                     N/A               N/A               N/A               N/A               False

Arista switches

Arista switches are pre-configured to use the standard extended naming convention, no additional action should be required.

Verify all connected transceivers are detected#

Once all physical cluster cabling is complete, check that your switch transceivers are detected and online.

Dell switches

While in configuration mode, run show interface transceiver summary | no-more.

Verify all transceivers appear in the interface list.

--------------------------------------------------------------------------------------------------------------------------------------
Interface    Name                                    Vendor            Part No.          Serial No.        QSA Adapter       Qualified
--------------------------------------------------------------------------------------------------------------------------------------
Eth1/1       QSFP56-DD 400GBASE-SR8-AEC-3.0M         DELL EMC          DH11M             CN0F9KR711I0060   N/A               True
Eth1/2       QSFP56-DD 400GBASE-SR8-AEC-3.0M         DELL EMC          DH11M             CN0F9KR711I0042   N/A               True
Eth1/3       QSFP56-DD 400GBASE-SR8-AEC-3.0M         DELL EMC          DH11M             CN0F9KR70CO0095   N/A               True
Eth1/4       QSFP56-DD 400GBASE-SR8-AEC-3.0M         DELL EMC          DH11M             CN0F9KR70CQ0026   N/A               True
...

Arista switches

While in configuration mode, run show inventory.

Verify all transceivers appears in the interface list.

System has 54 switched transceiver slots
  Port Manufacturer     Model            Serial Number    Rev
  ---- ---------------- ---------------- ---------------- ----
  1    Arista Networks  DCS-7050TX-72Q
  2    Arista Networks  DCS-7050TX-72Q
  3    Arista Networks  DCS-7050TX-72Q
  4    Arista Networks  DCS-7050TX-72Q
  5    Arista Networks  DCS-7050TX-72Q

Configure switch links#

Link training is used to calibrate the network signal between two devices over a physical, copper-based ethernet cable. This is typically a requirement when running a direct access cable (DAC) but discouraged for optics (Dell SmartFabric OS10 User Guide Release 10.5.6).

If you require link training, enable it on both your NIC and switch OS.

Broadcom NIC

You can run niccli to enable link training on your NICs:

niccli -dev 1 nvm -setoption link_training -value [0|1] -scope 0

Dell switches

If your switch ports are connected to non-DAC cables you should disable link training:

While in configuration mode, run interface range as a command to select an interface range such as Eth 1/1-1/32.
Run no shutdown as a command .

Run no standalone-link-training as a command.

$ (config)# interface range Eth 1/1-1/32

%Info: Configuring only existing interfaces in range

$ (config-if-range-eth**)# no shutdown

$ (config-if-range-eth**)# no standalone-link-training

For switch ports connected to DAC cables:

While in configuration mode, run interface range as a command to select an interface range such as Eth 1/1-1/32.
Run no shutdown as a command.

Run standalone-link-training as a command.

$ (config-if-range-eth**)# interface range Eth 1/33-1/64

%Info: Configuring only existing interfaces in range

$ (config-if-range-eth**)# no shutdown

$ (config-if-range-eth**)# standalone-link-training

Arista switches

While in configuration mode, run interface Ethernet as a command to select an interface range such as 1-32.

Run no shutdown as a command.

$ (config)# interface Ethernet 1-32

$ (config-if-Et1-32)# no shutdown

Important

Some Arista switches are observed to not support autonegotiation or standalone link training on the edge ports (eth1, eth2, eth31-34, eth63, eth64) when running older versions of Arista EOS. This causes a situation where you must either only use DACs in the switch ports that can enable link training or disable link training on all ports and the NIC to allow full usage of the switch.

Since neither approach is ideal, the preferred solution is to update Arista EOS to version 4.33.0F or later, which should allow standalone link training and autonegotiation on all ports.

Link support training matrix#

Refer to the table below as a reference for whether standalone link training should be enabled or not based on your switch OS and cable type.

Switch OS	cable type	port speed	link training	BRCM NIC link training
Arista EOS >= 4.33.0F	optics	400 - no autoneg	off	off
Arista EOS >= 4.33.0F	DAC	400 - no autoneg	on	on
Dell Sonic	optics	400 - no autoneg	off	off
Dell Sonic	DAC	400 - no autoneg	on	on
Dell OS10	optics	400 - no autoneg	N/A	off
Dell OS10	DAC	400 - no autoneg	N/A	on

Match switch QoS configuration to NIC for DCQCN#

It’s critical that the configuration you set up on a NIC be matched by your switch. Refer to the BCM957608 Ethernet Networking Guide for AMD Instinct MI300X GPU Clusters (pages 38-42) as a reference; the configuration detailed there is a good baseline for most switches and NICs.

DCQCN in Arista EOS can be summarized as:

Enable PFC on all ports.
RoCE and CNP DSCP values match those set on the NIC.
Enable ECN and configure ECN thresholds.

For SONIC running on Dell switches, most of the configuration below will be auto generated when the command enable roce is run. Just make sure that the QoS configuration generated on the switch match those on the NICs.

Example - DCQCN configuration on Dell Z9664f-O64 switch using sonic-cli

!!! Configure ECN buffer thresholds for RoCE traffic. Thresholds can be adjusted later depending on network performance.
!
qos wred-policy ROCE
green minimum-threshold 2048 maximum-threshold 12480 drop-probability 15
ecn green
!
qos scheduler-policy ROCE
!
queue 0
type dwrr
weight 50
!
queue 3
type dwrr
weight 50
!
queue 4
type dwrr
weight 50
!
queue 6
type strict
!
qos map dscp-tc ROCE
dscp 0-3,5-23,25,27-47,49-63 traffic-class 0
dscp 24,26 traffic-class 3
dscp 4 traffic-class 4
dscp 48 traffic-class 6
!
qos map dot1p-tc ROCE
dot1p 0-2,5-7 traffic-class 0
dot1p 3 traffic-class 3
dot1p 4 traffic-class 4
!
qos map tc-queue ROCE
traffic-class 0 queue 0
traffic-class 1 queue 1
traffic-class 2 queue 2
traffic-class 3 queue 3
traffic-class 4 queue 4
traffic-class 5 queue 5
traffic-class 6 queue 6
traffic-class 7 queue 7
!
qos map tc-pg ROCE
traffic-class 3 priority-group 3
traffic-class 4 priority-group 4
traffic-class 0-2,5-7 priority-group 7
!
qos map pfc-priority-queue ROCE
pfc-priority 0 queue 0
pfc-priority 1 queue 1
pfc-priority 2 queue 2
pfc-priority 3 queue 3
pfc-priority 4 queue 4
pfc-priority 5 queue 5
pfc-priority 6 queue 6
pfc-priority 7 queue 7
!
qos map pfc-priority-pg ROCE
pfc-priority 0 pg 0
pfc-priority 1 pg 1
pfc-priority 2 pg 2
pfc-priority 3 pg 3
pfc-priority 4 pg 4
pfc-priority 5 pg 5
pfc-priority 6 pg 6
pfc-priority 7 pg 7
!
hardware
!
access-list
counters per-entry
!
tcam
!
line vty
service-policy type qos in oob-qos-policy
!
interface Loopback 0
ip address 192.168.0.1/32
!
interface Eth1/1
description Spine-Eth1/1
mtu 9216
speed 400000
fec RS
unreliable-los auto
no shutdown
ipv6 enable
ars bind port_pro
queue 3 wred-policy ROCE
queue 4 wred-policy ROCE
scheduler-policy ROCE
qos-map dscp-tc ROCE
qos-map dot1p-tc ROCE
qos-map tc-queue ROCE
qos-map tc-pg ROCE
qos-map pfc-priority-queue ROCE
qos-map pfc-priority-pg ROCE
priority-flow-control priority 3
priority-flow-control priority 4
priority-flow-control watchdog action drop
priority-flow-control watchdog on detect-time 200
priority-flow-control watchdog restore-time 400
!
interface Eth1/2
description Spine-Eth1/2
mtu 9216
speed 400000
fec RS
unreliable-los auto
no shutdown
ipv6 enable
ars bind port_pro
queue 3 wred-policy ROCE
queue 4 wred-policy ROCE
scheduler-policy ROCE
qos-map dscp-tc ROCE
qos-map dot1p-tc ROCE
qos-map tc-queue ROCE
qos-map tc-pg ROCE
qos-map pfc-priority-queue ROCE
qos-map pfc-priority-pg ROCE
priority-flow-control priority 3
priority-flow-control priority 4
priority-flow-control watchdog action drop
priority-flow-control watchdog on detect-time 200
priority-flow-control watchdog restore-time 400
!
interface Eth1/3
description Spine-Eth1/3
mtu 9216
speed 400000
fec RS
unreliable-los auto
no shutdown
ipv6 enable
ars bind port_pro
queue 3 wred-policy ROCE
queue 4 wred-policy ROCE
scheduler-policy ROCE
qos-map dscp-tc ROCE
qos-map dot1p-tc ROCE
qos-map tc-queue ROCE
qos-map tc-pg ROCE
qos-map pfc-priority-queue ROCE
qos-map pfc-priority-pg ROCE
priority-flow-control priority 3
priority-flow-control priority 4
priority-flow-control watchdog action drop
priority-flow-control watchdog on detect-time 200
priority-flow-control watchdog restore-time 400
!

Backend network routing methods for preventing ARP flux#

ARP flux occurs when an IP address is mapped to an incorrect MAC address in the ARP table. This is a known problem in Linux hosts with multiple network interfaces on the same subnet, as any ARP request for an IP address to a host will be answered by every available interface on that host.

For an HPC/AI cluster, an incorrect MAC address in the ARP table can have several impacts on RDMA traffic:

Communication may fail if the interface corresponding to the returned (incorrect) MAC address has no open RDMA session.
Multiple IP addresses map to the same MAC address resulting in one NIC receiving excessive traffic while other NICs are idle, causing a performance bottleneck.

This section discusses two methods for mitigating the effects for ARP flux: IPV4 configuration at the host level or VLAN/L3 routing at the switch level.

Preventing ARP Flux with Linux Host IPV4 Configuration#

You can set the IPV4 sysctl parameters for individual Linux hosts to prevent ARP flux. This method is most effective when systems across the network are stable and do not frequently change OS.

To temporarily force only the correct NIC to respond to ARP, run the following commands:

$ sysctl -w net.ipv4.conf.all.arp_announce=1 # Ignore NICs not on subnet

$ sysctl -w net.ipv4.conf.all.arp_ignore=2 # ignore NICs not matching exact IP addr

To make the change permanent, add these lines to /etc/sysctl.conf and reboot:

net.ipv4.conf.all.arp_announce = 1
net.ipv4.conf.all.arp_ignore = 2

Preventing ARP Flux with individual subnets and L3 routing#

Instead of configuring the host’s IPV4 parameters, you can leverage your network switches to isolate each NIC on a unique subnet. ARP requests can then be sent through inter-VLAN or point-to-point routing to only reach one NIC at a time.

Backend network routing with VLAN#

Routing with VLANs ensures any two backend network NICs can communicate with one another while preventing ARP flux.

The requirements for inter-VLAN routing are as follows:

The number of VLANs must equal the number of backend NICs per host.
For each host, a NIC is routed to only one switch VLAN. NIC1 on each host is routed to VLAN2, NIC2 on each host to VLAN2, and so on.
If using SONIC as the switch OS, each VLAN is assigned an IP address on the switch side. On the server side, the VLAN IP address is specified as the gateway of the interface.

Example - Sonic switch configuration with VLAN definitions

interface Vlan1
 description nic1_vlan
 ip address 192.168.2.254/24
!
interface Vlan2
 description nic2_vlan
 ip address 192.168.3.254/24
!
interface Vlan3
 description nic3_vlan
 ip address 192.168.4.254/24
!
interface Vlan4
 description nic4_vlan
 ip address 192.168.5.254/24
!
interface Vlan5
 description nic5_vlan
 ip address 192.168.6.254/24
!
interface Vlan6
 description nic6_vlan
 ip address 192.168.7.254/24
!
interface Vlan7
 description nic7_vlan
 ip address 192.168.8.254/24
!
interface Vlan8
 description nic8_vlan
 ip address 192.168.9.254/24
!

interface Eth1/1
 description "Node1 nic1"
 mtu 9100
 speed 400000
 fec RS
 standalone-link-training
 unreliable-los auto
 no shutdown
 switchport access Vlan 1
!
interface Eth1/2
 description "Node1 nic2"
 mtu 9100
 speed 400000
 fec RS
 standalone-link-training
 unreliable-los auto
 no shutdown
 switchport access Vlan 2
!
interface Eth1/3
 description "Node1 nic3"
 mtu 9100
 speed 400000
 fec RS
 standalone-link-training
 unreliable-los auto
 no shutdown
 switchport access Vlan 3
!
interface Eth1/4
 description "Node1 nic4"
 mtu 9100
 speed 400000
 fec RS
 standalone-link-training
 unreliable-los auto
 no shutdown
 switchport access Vlan 4
!
interface Eth1/5
 description "Node1 nic5"
 mtu 9100
 speed 400000
 fec RS
 standalone-link-training
 unreliable-los auto
 no shutdown
 switchport access Vlan 5
!
interface Eth1/6
 description "Node1 nic6"
 mtu 9100
 speed 400000
 fec RS
 standalone-link-training
 unreliable-los auto
 no shutdown
 switchport access Vlan 6
!
interface Eth1/7
 description "Node1 nic7"
 mtu 9100
 speed 400000
 fec RS
 standalone-link-training
 unreliable-los auto
 no shutdown
 switchport access Vlan 7
!
interface Eth1/8
 description "Node1 nic8"
 mtu 9100
 speed 400000
 fec RS
 standalone-link-training
 unreliable-los auto
 no shutdown
 switchport access Vlan 8
!
interface Eth1/9
 description "Node2 nic1"
 mtu 9100
 speed 400000
 fec RS
 standalone-link-training
 unreliable-los auto
 no shutdown
 switchport access Vlan 1
!
interface Eth1/10
 description "Node2 nic2"
 mtu 9100
 speed 400000
 fec RS
 standalone-link-training
 unreliable-los auto
 no shutdown
 switchport access Vlan 2
!
interface Eth1/11
 description "Node2 nic3"
 mtu 9100
 speed 400000
 fec RS
 standalone-link-training
 unreliable-los auto
 no shutdown
 switchport access Vlan 3
!
interface Eth1/12
 description "Node2 nic4"
 mtu 9100
 speed 400000
 fec RS
 standalone-link-training
 unreliable-los auto
 no shutdown
 switchport access Vlan 4
!
interface Eth1/13
 description "Node2 nic5"
 mtu 9100
 speed 400000
 fec RS
 standalone-link-training
 unreliable-los auto
 no shutdown
 switchport access Vlan 5
!
interface Eth1/14
 description "Node2 nic6"
 mtu 9100
 speed 400000
 fec RS
 standalone-link-training
 unreliable-los auto
 no shutdown
 switchport access Vlan 6
!
interface Eth1/15
 description "Node2 nic7"
 mtu 9100
 speed 400000
 fec RS
 standalone-link-training
 unreliable-los auto
 no shutdown
 switchport access Vlan 7
!
interface Eth1/16
 description "Node2 nic8"
 mtu 9100
 speed 400000
 fec RS
 standalone-link-training
 unreliable-los auto
 no shutdown
 switchport access Vlan 8
!

Backend network routing with /31 subnet point-to-point routing#

Since /31 subnets allow only two hosts (one for network and one for broadcast), they can be leveraged to prevent ARP flux in a way similar to VLANs.

The requirements for point-to-point routing are:

Each NIC on a host must have a /31 network mask (for example, 192.168.131.X/31).
Each connected backend switch port must have an IP address that the NIC interface can use as a gateway.

RoCE cluster network configuration guide for AMD Instinct accelerators

Contents

RoCE cluster network configuration guide for AMD Instinct accelerators#

RoCE configuration for NICs#

Install NIC firmware and driver#

Set static NIC speed#

Enable RoCE support mode#

Enable PCIe relaxed ordering#

Disable ACS and set IOMMU passthrough#

Enable DCQCN through QoS configuration#

NIC QoS troubleshooting#

RoCE configuration for network switches#

Switch authentication and configuration terminal access#

Enable RoCE support#

Implement standard extended naming for switch interfaces#

Verify all connected transceivers are detected#

Configure switch links#

Link support training matrix#

Match switch QoS configuration to NIC for DCQCN#

Backend network routing methods for preventing ARP flux#

Preventing ARP Flux with Linux Host IPV4 Configuration#

Preventing ARP Flux with individual subnets and L3 routing#

Backend network routing with VLAN#

Backend network routing with /31 subnet point-to-point routing#