Mellanox nmlx5_core driver 4.23 issues on ESXi 8.0 Update 1

Mellanox Driver Overview

Problem Inventory - Mellanox Driver Update on ESXi 8.0u1 causing network virtualization issues

After installing ESXi 8.0 Update 1, some issues start to appear with affected nmlx5_core adapters:

  • Delayed / Failed IP discovery on VLAN-backed segments, even within the same host. Once in the ARP cache, no issues persist
  • Delayed / Failed IP discovery, IP allocation failures on VLAN trunked port-groups, even within the same host. Issues persist even after IP discovery is established
  • Overlay encapsulation offload failures:
    • ICMP with any payload size will function bidirectionally via Edge Transport Nodes / FRRLinux machines, but TCP and UDP will not
    • All overlay traffic encapsulated by a vSphere host flows correctly between workloads on the sane NSX overlay segment
    • All overlay traffic encapsulated by a vSphere host flows correctly between segments on the same NSX distributed router

These issues are seen on the following hardware models:

  • MCX4121A-ACAT firmware revisions 14.25 and 14.32

These issues are experienced with the upgrade to vSphere 8.0 Update 1, which includes the following updated driver:

nmlx5-core 4.23.0.36-8vmw.800.1.0.20513097

This driver from NVIDIA ships with support for both Bluefield SmartNIC and ConnectX Generation 5 network adapters as one package, and rolling back to a previous release of ESXi 8 with the previous driver (nmlx5-core 4.22) immediately resolves all overlay issues

Resolution

UPDATE: This problem has been resolved with ESXi 8.0 update 1c