RoCE cluster network configuration guide for AMD Instinct accelerators#

RDMA over Converged Ethernet (RoCE) is a network protocol can deliver speeds comparable to InfiniBand when running AI/HPC workloads, and offer lower cost than InfiniBand due their compatibility with standard ethernet architecture.

This guide contains instructions for optimizing the performance of a RoCE cluster network at the network interface card (NIC) and switch level with proper configuration, as well as routing directions to mitigate issues like MAC address mismatch (ARP flux) that can occur establishing RDMA sessions on nodes with multiple NICs.

RoCE configuration for NICs#

The specific steps to configure your NIC for RoCE support differ based on the NIC manufacturer. As this guide cannot provide steps for every potential manufacturer, this section provides some high-level recommendations of what to look for when setting up RoCE NICs according to the most common manufacturers, but always defers to manufacturer documentation for complete setup.

Install NIC firmware and driver#

NIC drivers typically include a regular ethernet driver, a RoCE driver, and a peer-mem (aka GPU direct RDMA), depending on the vendor.

Specific installation steps vary from vendor to vendor and may differ between NIC models from the same vendor. Here are a few examples of public vendor documentation for installing RoCE drivers:

Always consult vendor-specific instructions in addition to this guide when configuring your NIC. You may need to reach out directly to the vendor if instructions are not publicly available.

Note

The latest driver for a NIC may require a Linux kernel that is not yet supported by the AMD ROCm/amdgpu software stack. Before updating, review the supported operating systems for ROCm and verify the driver kernel is one supported by the version of ROCm you have installed.

Set static NIC speed#

Most high-speed network adapters support multiple speeds and are often configured with a default “auto-negotiation” feature that dynamically sets the speed based on network conditions. It’s a best practice to disable this feature and configure a single static network speed instead. This avoids unexpected changes in speed and simplified debugging if you encounter performance issues.

Enable RoCE support mode#

Most RoCE-capable NICs have a feature flag that must be set before they can communicate through RDMA. As the default setting for this feature differs by NIC vendor and model, you must verify all NICs are configured to support RoCE before running tests.

As an example, Broadcom NICs use the support_rdma flag to govern this feature. You can check the status with the NICCLI configuration tool:

$ sudo niccli -i 3 nvm -getoption support_rdma -scope 0

support_rdma = False
sudo niccli -i <NIC index> nvm -getoption support_rdma -scope <scope index>

In this case, the NIC is not configured to support RoCE, so run nvm -setoption to enable it:

$ sudo niccli -i 3 nvm -setoption support_rdma -value 1 -scope 0

support_rdma is set successfully
sudo niccli -i <NIC index> nvm -setoption support_rdma -value <value> -scope <scope index>

Other vendors use different utilities and flags to control this setting; refer to vendor-specific documentation in these scenarios. For Broadcom NICs, you can also refer to the Broadcom RoCE configuration scripts provided in the networking guides to review and configure RDMA support in bulk on each NIC in a node.

Enable PCIe relaxed ordering#

Configuring relaxed ordering for your NICs can offer performance improvement by changing the ordering rules that govern data transfers in the base PCIe specification. As with RoCE support, how to enable this feature differs based on vendor and NIC model, but examples from Broadcom NICs are provided in this guide as a starting framework.

Note

NIC configuration is only one part of enabling PCIe relaxed ordering. You must also ensure your server architecture supports relaxed ordering and that it is enabled in BIOS on each node in your cluster.

To check relaxed ordering on a Broadcom NIC, use NICCLI:

$ sudo niccli -i 3 nvm -getoption pcie_relaxed_ordering

pcie_relaxed_ordering = Enabled
sudo niccli -i <NIC index> nvm -getoption pcie_relaxed_ordering

If pcie_relaxed_ordering shows a disabled value, you can enable it with this command:

$ sudo niccli -i 3 nvm -setoption pcie_relaxed_ordering -value 1

pcie_relaxed_ordering is set successfully
Please reboot the system to apply the configuration
sudo niccli -i <NIC index> nvm -setoption pcie_relaxed_ordering -value <value>

For other vendors, refer to vendor-specific documentation for information about how to verify and enable this setting. Additionally, for Broadcom NICs you can also refer to the Broadcom RoCE configuration scripts provided in the networking guides to review and configure relaxed ordering in bulk on each NIC in a node.

Disable ACS and set IOMMU passthrough#

On any nodes hosting GPUs, ensure you disable ACS and configure IOMMU passthrough to ensure peer to peer transfer between NICs and GPUs functions as expected.

To disable ACS, use the disable ACS script provided in the single-node network guide. To set IOMMU passthrough on a Linux system, add iommu=pt to the GRUB_CMDLINE_LINUX_DEFAULT entry in /etc/default/grub, then run sudo update-grub. You can see a more detailed flow at GRUB settings and Issue #5: Application hangs on Multi-GPU systems.

Enable DCQCN through QoS configuration#

Data Center Quantized Congestion Notification (DCQCN) is a traffic control method achieved by enabling two features, Explicit Congestion Notification (ECN) and Priority Flow Control (PFC), to support end-to-end lossless ethernet in a data center environment.

In communication between NICs, ECN detects congestion in PCIe switch buffers and alerts the endpoint (receiving NIC) through packet ECN bits. The receiving NIC then transmits a congestion notification package (CNP) to the sending NIC to reduce the transfer rate. If congestion is still too high for a specific traffic class with ECN in effect, PFC pauses traffic for that class until congestion is resolved.

ECN and PFC are configured per NIC through quality of service (QoS) parameters. Refer to your vendor-specific documentation on how to set the following parameters for your NICs:

  • RoCE priority class.

  • Enable PFC on RoCE priority class.

  • Set CNP priority class (usually the highest priority of 7).

Example of default QoS configuration on a Broadcom Thor2 NIC
# RoCE v2 packets are marked with a  DSCP value 26 and use Priority 3 internally
# CNP packets are marked with a DSCP value 48 and use Priority 7 internally
# PFC is enabled for Priority 3 traffic
# Three Traffic classes are set up, TC0 for non RoCE traffic, TC1 for RoCE traffic, and TC2 for CNP traffic
# RoCE and non-RoCE traffic share ETS bandwidth of 50% each. The ETS bandwidth share applies only when the actual traffic is available to use the bandwidth share. In the absence of non-RoCE traffic, all the available bandwidth will be used by RoCE and vice-versa.
# CNP traffic is treated as ETS Strict Priority

$ sudo niccli -dev 1 get_qos

IEEE 8021QAZ ETS Configuration TLV:
         PRIO_MAP: 0:0 1:0 2:0 3:1 4:0 5:0 6:0 7:2
         TC Bandwidth: 50% 50% 0%
         TSA_MAP: 0:ets 1:ets 2:strict
IEEE 8021QAZ PFC TLV:
         PFC enabled: 3
IEEE 8021QAZ APP TLV:
         APP#0:
         Priority: 7
         Sel: 5
         DSCP: 48

         APP#1:
         Priority: 3
         Sel: 5
         DSCP: 26

         APP#2:
         Priority: 3
         Sel: 3
         UDP or DCCP: 4791

TC Rate Limit: 100% 100% 100% 0% 0% 0% 0% 0%

$ sudo niccli -dev 1 dump pri2cos

Base Queue is 0 for port 0
----------------------------
Priority   TC   Queue ID
------------------------
0         0    4
1         0    4
2         0    4
3         1    0
4         0    4
5         0    4
6         0    4
7         2    5

$ sudo niccli -dev 1 get_dscp2prio

dscp2prio mapping:
         priority:7  dscp: 48
         priority:3  dscp: 26
sudo niccli -dev <NIC index> get_qos

sudo niccli -dev <NIC index> dump pri2cos

sudo niccli -dev <NIC index> get_dscp2prio

NIC QoS troubleshooting#

Sometimes, the default QoS on a NIC may differ significantly from that recommended by Broadcom in the BCM957608 Ethernet Networking Guide for AMD Instinct MI300X GPU Clusters (pages 38-42), even if the RoCE profile has been set in NVM.

If you determine this is the case for any of your NICs, follow these steps to resolve:

  1. Run the install.sh script provided in the Broadcom NIC release package.

    $ cd bcm5760x_<version.x.y.z>/utils/linux_installer/
    
    $ bash install.sh -i <interface_name> -o ECNPFC -f -b 50 -w
    

    The flags can be understood as follows:

    • -o ECNPFC - Enables PFC and sets traffic priority, sets DSCP values for RoCE traffic and CNP traffic.

    • -b 50 - Sets RoCE to occupy a minimum 50% of bandwidth.

    • -w - Assumes proper firmware is already installed and skips it.

  2. Once the script completes, run sudo reboot to prevent the RoCE driver warning about unmatched DSCP values.

  3. Verify the QoS is now correct. Run sudo niccli -dev 1 getqos and ensure it matches the example in the previous section, paying particular attention to PFC state, traffic classes, and DSCP values.

RoCE configuration for network switches#

You will need direct or remote access to switches in your cluster to configure them for optimal data transfer over a RoCE network. This guide provides instructions for Dell and Arista switches using SONiC and Arista EOS respectively.

Switch authentication and configuration terminal access#

The first step is to log in to the switch and elevate your permissions so that you can change configurations.

  1. Access your switch CLI with ssh.

  2. Run sonic-cli as a command.

  3. Run configure or configure terminal as a command to enter configuration mode.

  4. Run exit as a command at any time to leave configuration mode.

  1. Access your switch CLI with ssh.

  2. Run enable as a command to receive elevated privileges.

  3. Run configure terminal as a command to enter configuration mode.

  4. Run exit as a command at any time to leave configuration mode.

Enable RoCE support#

  1. While in configuration mode, run roce enable as a command.

  2. Reboot the switch when or if prompted.

Arista EOS supports RoCE communication by default. Instead, ensure the PFC for the RoCE traffic class is enabled on each port that handles RoCE traffic.

Implement standard extended naming for switch interfaces#

Dell recommends what is referred to as the “standard” or “standard extended” naming convention for switch interfaces. The naming scheme is understood as Eth<line_card_id>/<port_id>/[breakout_port_id]. In a fixed switch this naming scheme simplifies matching each port to its front-panel label, where Eth1/16/[x] corresponds to port 16 as physically labeled on the switch. The line card ID also remains 1 since there is a single line card.

However, if multiple line cards are present in a modular switch like the Arista 7388X5 series, additional effort is required to match the port name to its physical label.

  1. While in configuration mode, run interface-naming standard extended as a command.

  2. Run write memory as a command.

  3. Log out of the switch, then log back in to view the change in interface names.

    Interface Name Vendor Part No. Serial No. QSA Adapter Qualified
    --------------------------------------------------------------------------------------------------------------------------------------
    Eth1/1 QSFP56-DD 400GBASE-SR8-AEC-3.0M DELL EMC DH11M CN0F9KR711I0060 N/A True
    Eth1/2 QSFP56-DD 400GBASE-SR8-AEC-3.0M DELL EMC DH11M CN0F9KR711I0042 N/A True
    Eth1/3 QSFP56-DD 400GBASE-SR8-AEC-3.0M DELL EMC DH11M CN0F9KR70CO0095 N/A True
    ...
    Eth1/64      QSFP56-DD 400GBASE-SR8-AEC-3.0M         DELL EMC          DH11M             CN0F9KR70CM0041   N/A               True
    Eth1/65      N/A                                     N/A               N/A               N/A               N/A               False
    Eth1/66      N/A                                     N/A               N/A               N/A               N/A               False
    

Arista switches are pre-configured to use the standard extended naming convention, no additional action should be required.

Verify all connected transceivers are detected#

Once all physical cluster cabling is complete, check that your switch transceivers are detected and online.

  1. While in configuration mode, run show interface transceiver summary | no-more.

  2. Verify all transceivers appear in the interface list.

    --------------------------------------------------------------------------------------------------------------------------------------
    Interface    Name                                    Vendor            Part No.          Serial No.        QSA Adapter       Qualified
    --------------------------------------------------------------------------------------------------------------------------------------
    Eth1/1       QSFP56-DD 400GBASE-SR8-AEC-3.0M         DELL EMC          DH11M             CN0F9KR711I0060   N/A               True
    Eth1/2       QSFP56-DD 400GBASE-SR8-AEC-3.0M         DELL EMC          DH11M             CN0F9KR711I0042   N/A               True
    Eth1/3       QSFP56-DD 400GBASE-SR8-AEC-3.0M         DELL EMC          DH11M             CN0F9KR70CO0095   N/A               True
    Eth1/4       QSFP56-DD 400GBASE-SR8-AEC-3.0M         DELL EMC          DH11M             CN0F9KR70CQ0026   N/A               True
    ...
    
  1. While in configuration mode, run show inventory.

  2. Verify all transceivers appears in the interface list.

    System has 54 switched transceiver slots
      Port Manufacturer     Model            Serial Number    Rev
      ---- ---------------- ---------------- ---------------- ----
      1    Arista Networks  DCS-7050TX-72Q
      2    Arista Networks  DCS-7050TX-72Q
      3    Arista Networks  DCS-7050TX-72Q
      4    Arista Networks  DCS-7050TX-72Q
      5    Arista Networks  DCS-7050TX-72Q
    

Match switch QoS configuration to NIC for DCQCN#

It’s critical that the configuration you set up on a NIC be matched by your switch. Refer to the BCM957608 Ethernet Networking Guide for AMD Instinct MI300X GPU Clusters (pages 38-42) as a reference; the configuration detailed there is a good baseline for most switches and NICs.

DCQCN in Arista EOS can be summarized as:

  • Enable PFC on all ports.

  • RoCE and CNP DSCP values match those set on the NIC.

  • Enable ECN and configure ECN thresholds.

Example - DCQCN configuration on an Arista 7388 switch
!!! Map traffic RoCE and CNP classes (TC) to their DSCP to match NIC config
qos map traffic-class 3 to dscp 26
qos map traffic-class 7 to dscp 48

!!! Define PFC settings and ECN buffer thresholds for RoCE traffic. Note that ECN buffer thresholds can be optimized later depending on the workload running on the cluster.
qos profile QOS_ROCE_DCQCN
   qos trust dscp
   priority-flow-control on
   priority-flow-control priority 3 no-drop
   !
   uc-tx-queue 3
      random-detect ecn minimum-threshold 2000 segments maximum-threshold 10000 segments max-mark-probability 20 weight 0
      random-detect ecn count
!

!!! Sample switch port configuration. Note speed was set to 200GbE because the NICs on nodes had a 200GbE speed.
interface Ethernet2/5/1
   load-interval 2
   mtu 9214
   speed 200g-4
   error-correction encoding reed-solomon
   ip address 1.1.122.14/31
   phy link training
   service-profile QOS_ROCE_DCQCN
!
interface Ethernet2/5/5
   load-interval 2
   mtu 9214
   speed 200g-4
   error-correction encoding reed-solomon
   ip address 1.1.122.24/31
   phy link training
   service-profile QOS_ROCE_DCQCN
!

For SONIC running on Dell switches, most of the configuration below will be auto generated when the command enable roce is run. Just make sure that the QoS configuration generated on the switch match those on the NICs.

Example - DCQCN configuration on Dell Z9664f-O64 switch using sonic-cli
!!! Configure ECN buffer thresholds for RoCE traffic. Thresholds can be adjusted later depending on network performance.
!
qos wred-policy ROCE
green minimum-threshold 2048 maximum-threshold 12480 drop-probability 15
ecn green
!
qos scheduler-policy ROCE
!
queue 0
type dwrr
weight 50
!
queue 3
type dwrr
weight 50
!
queue 4
type dwrr
weight 50
!
queue 6
type strict
!
qos map dscp-tc ROCE
dscp 0-3,5-23,25,27-47,49-63 traffic-class 0
dscp 24,26 traffic-class 3
dscp 4 traffic-class 4
dscp 48 traffic-class 6
!
qos map dot1p-tc ROCE
dot1p 0-2,5-7 traffic-class 0
dot1p 3 traffic-class 3
dot1p 4 traffic-class 4
!
qos map tc-queue ROCE
traffic-class 0 queue 0
traffic-class 1 queue 1
traffic-class 2 queue 2
traffic-class 3 queue 3
traffic-class 4 queue 4
traffic-class 5 queue 5
traffic-class 6 queue 6
traffic-class 7 queue 7
!
qos map tc-pg ROCE
traffic-class 3 priority-group 3
traffic-class 4 priority-group 4
traffic-class 0-2,5-7 priority-group 7
!
qos map pfc-priority-queue ROCE
pfc-priority 0 queue 0
pfc-priority 1 queue 1
pfc-priority 2 queue 2
pfc-priority 3 queue 3
pfc-priority 4 queue 4
pfc-priority 5 queue 5
pfc-priority 6 queue 6
pfc-priority 7 queue 7
!
qos map pfc-priority-pg ROCE
pfc-priority 0 pg 0
pfc-priority 1 pg 1
pfc-priority 2 pg 2
pfc-priority 3 pg 3
pfc-priority 4 pg 4
pfc-priority 5 pg 5
pfc-priority 6 pg 6
pfc-priority 7 pg 7
!
hardware
!
access-list
counters per-entry
!
tcam
!
line vty
service-policy type qos in oob-qos-policy
!
interface Loopback 0
ip address 192.168.0.1/32
!
interface Eth1/1
description Spine-Eth1/1
mtu 9216
speed 400000
fec RS
unreliable-los auto
no shutdown
ipv6 enable
ars bind port_pro
queue 3 wred-policy ROCE
queue 4 wred-policy ROCE
scheduler-policy ROCE
qos-map dscp-tc ROCE
qos-map dot1p-tc ROCE
qos-map tc-queue ROCE
qos-map tc-pg ROCE
qos-map pfc-priority-queue ROCE
qos-map pfc-priority-pg ROCE
priority-flow-control priority 3
priority-flow-control priority 4
priority-flow-control watchdog action drop
priority-flow-control watchdog on detect-time 200
priority-flow-control watchdog restore-time 400
!
interface Eth1/2
description Spine-Eth1/2
mtu 9216
speed 400000
fec RS
unreliable-los auto
no shutdown
ipv6 enable
ars bind port_pro
queue 3 wred-policy ROCE
queue 4 wred-policy ROCE
scheduler-policy ROCE
qos-map dscp-tc ROCE
qos-map dot1p-tc ROCE
qos-map tc-queue ROCE
qos-map tc-pg ROCE
qos-map pfc-priority-queue ROCE
qos-map pfc-priority-pg ROCE
priority-flow-control priority 3
priority-flow-control priority 4
priority-flow-control watchdog action drop
priority-flow-control watchdog on detect-time 200
priority-flow-control watchdog restore-time 400
!
interface Eth1/3
description Spine-Eth1/3
mtu 9216
speed 400000
fec RS
unreliable-los auto
no shutdown
ipv6 enable
ars bind port_pro
queue 3 wred-policy ROCE
queue 4 wred-policy ROCE
scheduler-policy ROCE
qos-map dscp-tc ROCE
qos-map dot1p-tc ROCE
qos-map tc-queue ROCE
qos-map tc-pg ROCE
qos-map pfc-priority-queue ROCE
qos-map pfc-priority-pg ROCE
priority-flow-control priority 3
priority-flow-control priority 4
priority-flow-control watchdog action drop
priority-flow-control watchdog on detect-time 200
priority-flow-control watchdog restore-time 400
!

Backend network routing methods for preventing ARP flux#

ARP flux occurs when an IP address is mapped to an incorrect MAC address in the ARP table. This is a known problem in Linux hosts with multiple network interfaces on the same subnet, as any ARP request for an IP address to a host will be answered by every available interface on that host.

For an HPC/AI cluster, an incorrect MAC address in the ARP table can have several impacts on RDMA traffic:

  • Communication may fail if the interface corresponding to the returned (incorrect) MAC address has no open RDMA session.

  • Multiple IP addresses map to the same MAC address resulting in one NIC receiving excessive traffic while other NICs are idle, causing a performance bottleneck.

This section discusses two methods for mitigating the effects for ARP flux: IPV4 configuration at the host level or VLAN/L3 routing at the switch level.

Preventing ARP Flux with Linux Host IPV4 Configuration#

You can set the IPV4 sysctl parameters for individual Linux hosts to prevent ARP flux. This method is most effective when systems across the network are stable and do not frequently change OS.

To temporarily force only the correct NIC to respond to ARP, run the following commands:

$ sysctl -w net.ipv4.conf.all.arp_announce=1 # Ignore NICs not on subnet

$ sysctl -w net.ipv4.conf.all.arp_ignore=2 # ignore NICs not matching exact IP addr

To make the change permanent, add these lines to /etc/sysctl.conf and reboot:

net.ipv4.conf.all.arp_announce = 1
net.ipv4.conf.all.arp_ignore = 2

Preventing ARP Flux with individual subnets and L3 routing#

Instead of configuring the host’s IPV4 parameters, you can leverage your network switches to isolate each NIC on a unique subnet. ARP requests can then be sent through inter-VLAN or point-to-point routing to only reach one NIC at a time.

Backend network routing with VLAN#

Routing with VLANs ensures any two backend network NICs can communicate with one another while preventing ARP flux.

The requirements for inter-VLAN routing are as follows:

  • The number of VLANs must equal the number of backend NICs per host.

  • For each host, a NIC is routed to only one switch VLAN. NIC1 on each host is routed to VLAN2, NIC2 on each host to VLAN2, and so on.

  • If using SONIC as the switch OS, each VLAN is assigned an IP address on the switch side. On the server side, the VLAN IP address is specified as the gateway of the interface.

Switch VLAN-based routing - Host 1 example with 8 NICs
network:
  ethernets:
    eth1:
      mtu: 9000
      addresses:
      - 192.168.2.1/24  # Unique subnet 192.168.2.X/24
      routing-policy:
        - from: 192.168.2.1
          table: 102
      routes:   # Everything from this interface routes to VLAN with IP address 192.168.2.254
        - to: 0.0.0.0/0
          via: 192.168.2.254    # VLAN IP address specified in the switch
          table: 102
    eth2:
      mtu: 9000
      addresses:
      - 192.168.3.1/24
      routing-policy:
        - from: 192.168.3.1
          table: 103
      routes:
        - to: 0.0.0.0/0
          via: 192.168.3.254
          table: 103
    eth3:
      mtu: 9000
      addresses:
      - 192.168.4.1/24
      routing-policy:
        - from: 192.168.4.1
          table: 104
      routes:
        - to: 0.0.0.0/0
          via: 192.168.4.254
          table: 104
    eth4:
      mtu: 9000
      addresses:
      - 192.168.5.1/24
      routing-policy:
        - from: 192.168.5.1
          table: 105
      routes:
        - to: 0.0.0.0/0
          via: 192.168.5.254
          table: 105
    eth5:
      mtu: 9000
      addresses:
      - 192.168.6.1/24
      routing-policy:
        - from: 192.168.6.1
          table: 106
      routes:
        - to: 0.0.0.0/0
          via: 192.168.6.254
          table: 106
    eth6:
      mtu: 9000
      addresses:
      - 192.168.7.1/24
      routing-policy:
        - from: 192.168.7.1
          table: 107
      routes:
        - to: 0.0.0.0/0
          via: 192.168.7.254
          table: 107
    eth7:
      mtu: 9000
      addresses:
      - 192.168.8.1/24
      routing-policy:
        - from: 192.168.8.1
          table: 108
      routes:
        - to: 0.0.0.0/0
          via: 192.168.8.254
          table: 108
    eth8:
      mtu: 9000
      addresses:
      - 192.168.9.1/24
      routing-policy:
        - from: 192.168.9.1
          table: 109
      routes:
        - to: 0.0.0.0/0
          via: 192.168.9.254
          table: 109
  version: 2
Switch VLAN-based routing - Host 2 example with 8 NICs
network:
  ethernets:
    eth1:
      mtu: 9000
      addresses:
      - 192.168.2.2/24
      routing-policy:
        - from: 192.168.2.2
          table: 102
      routes:
        - to: 0.0.0.0/0
          via: 192.168.2.254
          table: 102
    eth2:
      mtu: 9000
      addresses:
      - 192.168.3.2/24
      routing-policy:
        - from: 192.168.3.2
          table: 103
      routes:
        - to: 0.0.0.0/0
          via: 192.168.3.254
          table: 103
    eth3:
      mtu: 9000
      addresses:
      - 192.168.4.2/24
      routing-policy:
        - from: 192.168.4.2
          table: 104
      routes:
        - to: 0.0.0.0/0
          via: 192.168.4.254
          table: 104
    eth4:
      mtu: 9000
      addresses:
      - 192.168.5.2/24
      routing-policy:
        - from: 192.168.5.2
          table: 105
      routes:
        - to: 0.0.0.0/0
          via: 192.168.5.254
          table: 105
    eth5:
      mtu: 9000
      addresses:
      - 192.168.6.2/24
      routing-policy:
        - from: 192.168.6.2
          table: 106
      routes:
        - to: 0.0.0.0/0
          via: 192.168.6.254
          table: 106
    eth6:
      mtu: 9000
      addresses:
      - 192.168.7.2/24
      routing-policy:
        - from: 192.168.7.2
          table: 107
      routes:
        - to: 0.0.0.0/0
          via: 192.168.7.254
          table: 107
    eth7:
      mtu: 9000
      addresses:
      - 192.168.8.2/24
      routing-policy:
        - from: 192.168.8.2
          table: 108
      routes:
        - to: 0.0.0.0/0
          via: 192.168.8.254
          table: 108
    eth8:
      mtu: 9000
      addresses:
      - 192.168.9.2/24
      routing-policy:
        - from: 192.168.9.2
          table: 109
      routes:
        - to: 0.0.0.0/0
          via: 192.168.9.254
          table: 109
  version: 2
Example - Sonic switch configuration with VLAN definitions
interface Vlan1
 description nic1_vlan
 ip address 192.168.2.254/24
!
interface Vlan2
 description nic2_vlan
 ip address 192.168.3.254/24
!
interface Vlan3
 description nic3_vlan
 ip address 192.168.4.254/24
!
interface Vlan4
 description nic4_vlan
 ip address 192.168.5.254/24
!
interface Vlan5
 description nic5_vlan
 ip address 192.168.6.254/24
!
interface Vlan6
 description nic6_vlan
 ip address 192.168.7.254/24
!
interface Vlan7
 description nic7_vlan
 ip address 192.168.8.254/24
!
interface Vlan8
 description nic8_vlan
 ip address 192.168.9.254/24
!

interface Eth1/1
 description "Node1 nic1"
 mtu 9100
 speed 400000
 fec RS
 standalone-link-training
 unreliable-los auto
 no shutdown
 switchport access Vlan 1
!
interface Eth1/2
 description "Node1 nic2"
 mtu 9100
 speed 400000
 fec RS
 standalone-link-training
 unreliable-los auto
 no shutdown
 switchport access Vlan 2
!
interface Eth1/3
 description "Node1 nic3"
 mtu 9100
 speed 400000
 fec RS
 standalone-link-training
 unreliable-los auto
 no shutdown
 switchport access Vlan 3
!
interface Eth1/4
 description "Node1 nic4"
 mtu 9100
 speed 400000
 fec RS
 standalone-link-training
 unreliable-los auto
 no shutdown
 switchport access Vlan 4
!
interface Eth1/5
 description "Node1 nic5"
 mtu 9100
 speed 400000
 fec RS
 standalone-link-training
 unreliable-los auto
 no shutdown
 switchport access Vlan 5
!
interface Eth1/6
 description "Node1 nic6"
 mtu 9100
 speed 400000
 fec RS
 standalone-link-training
 unreliable-los auto
 no shutdown
 switchport access Vlan 6
!
interface Eth1/7
 description "Node1 nic7"
 mtu 9100
 speed 400000
 fec RS
 standalone-link-training
 unreliable-los auto
 no shutdown
 switchport access Vlan 7
!
interface Eth1/8
 description "Node1 nic8"
 mtu 9100
 speed 400000
 fec RS
 standalone-link-training
 unreliable-los auto
 no shutdown
 switchport access Vlan 8
!
interface Eth1/9
 description "Node2 nic1"
 mtu 9100
 speed 400000
 fec RS
 standalone-link-training
 unreliable-los auto
 no shutdown
 switchport access Vlan 1
!
interface Eth1/10
 description "Node2 nic2"
 mtu 9100
 speed 400000
 fec RS
 standalone-link-training
 unreliable-los auto
 no shutdown
 switchport access Vlan 2
!
interface Eth1/11
 description "Node2 nic3"
 mtu 9100
 speed 400000
 fec RS
 standalone-link-training
 unreliable-los auto
 no shutdown
 switchport access Vlan 3
!
interface Eth1/12
 description "Node2 nic4"
 mtu 9100
 speed 400000
 fec RS
 standalone-link-training
 unreliable-los auto
 no shutdown
 switchport access Vlan 4
!
interface Eth1/13
 description "Node2 nic5"
 mtu 9100
 speed 400000
 fec RS
 standalone-link-training
 unreliable-los auto
 no shutdown
 switchport access Vlan 5
!
interface Eth1/14
 description "Node2 nic6"
 mtu 9100
 speed 400000
 fec RS
 standalone-link-training
 unreliable-los auto
 no shutdown
 switchport access Vlan 6
!
interface Eth1/15
 description "Node2 nic7"
 mtu 9100
 speed 400000
 fec RS
 standalone-link-training
 unreliable-los auto
 no shutdown
 switchport access Vlan 7
!
interface Eth1/16
 description "Node2 nic8"
 mtu 9100
 speed 400000
 fec RS
 standalone-link-training
 unreliable-los auto
 no shutdown
 switchport access Vlan 8
!

Backend network routing with /31 subnet point-to-point routing#

Since /31 subnets allow only two hosts (one for network and one for broadcast), they can be leveraged to prevent ARP flux in a way similar to VLANs.

The requirements for point-to-point routing are:

  • Each NIC on a host must have a /31 network mask (for example, 192.168.131.X/31).

  • Each connected backend switch port must have an IP address that the NIC interface can use as a gateway.

Example - point-to-point /31 IPV4 routing host netplan file
network:
  ethernets:
    eth1:
      mtu: 9000
      addresses:
      - 192.168.1.1/31
      routing-policy:
      - from: 192.168.1.1
        table: 101
      routes:
      - to: 0.0.0.0/0
        via: 192.168.1.0
        table: 101
    eth2:
      mtu: 9000
      addresses:
      - 192.168.1.3/31
      routing-policy:
      - from: 192.168.1.3
        table: 102
      routes:
      - to: 0.0.0.0/0
        via: 192.168.1.2
        table: 102
    eth3:
      mtu: 9000
      addresses:
      - 192.168.1.5/31
      routing-policy:
      - from: 192.168.1.5
        table: 103
        routes:
      - to: 0.0.0.0/0
        via: 192.168.1.4
        table: 103
    eth4:
      mtu: 9000
      addresses:
      - 192.168.1.7/31
      routing-policy:
      - from: 192.168.1.7
        table: 104
      routes:
      - to: 0.0.0.0/0
        via: 192.168.1.6
        table: 104
    eth5:
      mtu: 9000
      addresses:
      - 192.168.1.9/31
      routing-policy:
      - from: 192.168.1.9
        table: 105
      routes:
      - to: 0.0.0.0/0
        via: 192.168.1.8
        table: 105
    eth6:
      mtu: 9000
      addresses:
      - 192.168.1.11/31
      routing-policy:
      - from: 192.168.1.11
        table: 106
      routes:
      - to: 0.0.0.0/0
        via: 192.168.1.10
        table: 106
    eth7:
      mtu: 9000
      addresses:
      - 192.168.1.13/31
      routing-policy:
      - from: 192.168.1.13
        table: 107
      routes:
      - to: 0.0.0.0/0
        via: 192.168.1.12
        table: 107
    eth8:
      mtu: 9000
      addresses:
      - 192.168.1.15/31
      routing-policy:
      - from: 192.168.1.15
        table: 108
      routes:
      - to: 0.0.0.0/0
        via: 192.168.1.14
        table: 108
  version: 2
Example - Switch configuration for point-to-point /31 IPV4 routing (applicable for Sonic, EOS, and others)
!
interface Eth1/1
 description node1-eth1
 ip address 192.168.1.0/31
!
interface Eth1/2
 description node1-eth2
 ip address 192.168.1.2/31
!
interface Eth1/3
 description node1-eth3
 ip address 192.168.1.4/31
!
interface Eth1/4
 description node1-eth4
 ip address 192.168.1.6/31
!
interface Eth1/5
 description node1-eth5
 ip address 192.168.1.8/31
!
interface Eth1/6
 description node1-eth6
 ip address 192.168.1.10/31
!
interface Eth1/7
 description node1-eth7
 ip address 192.168.1.12/31
!
interface Eth1/8
 description node1-eth8
 ip address 192.168.1.14/31