Intel 10GbE NICs missing after VMware ESXi 7 upgrade

To provide you with a quick back story on just how I got here, I purchased the first pieces of my home lab back in 2016. When building my home lab, my aim was to build an environment which was small, quiet and closely aligned to the VMware HCL. I obviously also had a budget to work with. At the time many people within the vCommunity were choosing to build their home labs using retired server hardware such as Dell PowerEdge R710s, or cheap & unsupported hardware such as Intel NUCs (remember this is 2016). Neither of these quite suited me because they were either too large & loud or unsupported, so instead I started to look at SuperMicro. These days most people would be familiar with the variety of hardware offerings SuperMicro provide in their lineup, most of which can be found on the VMware HCL. However back in 2016, the vCommunity were only just starting to explore the possibilities of building a home lab using SuperMicro. The other challenge I had was that the 5028D-T4NT server, which was the popular choice, did not come cheap, especially to those living outside the US! So I decided to build my own unique setup using the SuperMicro X10SDV-TL4NF motherboards. I assumed that by choosing a platform which was on the VMware HCL, that I would avoid all the pain and suffering of having to tinker with the hardware to get it to work. Well I assumed wrong!

When I first built my lab I was using a single physical host running VMware ESXi 6.5. I initially used this environment to deploy nested vSphere environments to develop and test AsBuiltReport against many different versions of VMware vSphere. Over the years I have gradually added more hardware to scale out my environment to provide more capacity and redundancy. However as more hardware was added, I began to encounter more issues with my setup. The first issue I encountered was when I added some Noctua cooling fans, and discovered that I needed to modify the fan thresholds. When I chose to add a second ESXi host, direct connecting the onboard Intel X552/X557-AT 10Gb network adapters for vMotion and vSAN traffic, I encountered another issue whereby the NICs would intermittently disconnect and never reconnect. Thankfully Paul Braren was able to provide a solution via his website over at TinkerTry.

And so begins this story…

My home lab had been running vSphere 6.7 U3 for quite some time without any issues whatsoever. Given the stability of my environment I had been reluctant to upgrade to vSphere 7 earlier this year, mostly due to some of the issues I had heard about ESXi hosts rebooting back to ESXi 6.7. I decided to make some changes in preparation for the new ESXi 7.0 partition changes, by moving my ESXi 6.7 boot drives from USB to an NVMe SSD. Moving the boot drive was easy and everything went smoothly. All that I was waiting for now, was vSphere 7.0 U1 to drop.

Yesterday I woke to news that VMware vSphere 7.0 Update 1 had been released, and I wasted no time in starting the upgrade to my home lab. I had already upgraded my vCenter Server to 7.0 and proceeded with to update to U1 via the VAMI. Easy as! Next up I downloaded the VMware ESXi 7.0 U1 ISO image and uploaded this to LifeCycle Manager, created my baseline and attached it to my ESXi hosts. I currently run a 2-node vSAN setup and run the vSAN witness server nested on my physical hosts (unsupported I know!). I upgraded the vSAN Witness server without issue and quickly moved on to upgrading my first physical ESXi host. I performed the upgrade, the host rebooted successfully, however my vSAN cluster was now reporting a network partition error. On closer inspection I found that my ESXi 7.0 host was now missing it’s two 10GbE network adapters, the same adapters I use for vSAN and vMotion traffic! Thankfully I had premeditated the scenario of losing the vSAN datastore and had placed my vCenter Server on a NFS datastore hosted on my Synology NAS.

And down the rabbit hole I went…

My first thought for what caused this issue was that it was something relating to the Intel NIC driver. I soon found myself reading about the deprecation of the legacy VMKlinux drivers in vSphere 7.0. This then lead me to checking the VMware HCL where I found that the NICs were natively supported in vSphere 7.0. I SSH’d to the ESXi host and ran esxcli software vib list to confirm that the native inbox drivers were installed.

The VMware hardware compatibility list however was showing a newer version of the ixgben driver, so I proceeded to install the updated driver to see if this would resolve my issue. Unfortunately this did not resolve the issue and my NICs were still not being detected. I then proceeded on the merry-go-round of installing and testing every one of the listed drivers. No change!!

Next I thought it could be a BIOS or firmware issue, so I proceeded with full system updates of the BIOS, NICs and IPMI for good measure! It had been a while since I had upgraded the NIC firmware and I found this article useful to remember the steps to performing the upgrade using the Intel Boot Utility and Drivers.

System firmware updates complete, yet my issue had not been resolved!! I had now spent most of the day attempting to resolve this issue, it was late and my wife was now calling me for dinner. I did not want to leave knowing I had not resolved my issue. Then one last Google search led me to this VMTN communities post. I quickly jumped back to my SSH window and typed cat /etc/vmware/esx.conf frantically searching for what was outlined in the post. Unfortunately I found nothing which was outlined in the post to resolve my issue… BUT…

Just as I was about to admit defeat and head to the dinner table, I saw this:

WTF!?!

I made a copy of esx.conf and modified the line to /vmkernel/module/ixgben/enabled = "true" and rebooted my host. Fingers crossed! (Keep reading to learn the correct method for changing this)

After the reboot, I WAS BACK!

[[email protected]:~] vmware -vl
VMware ESXi 7.0.1 build-16850804
VMware ESXi 7.0 Update 1
[[email protected]:~] esxcfg-nics -l
Name PCI Driver Link Speed Duplex MAC Address MTU Description
vmnic0 0000:05:00.0 igbn Up 1000Mbps Full 0c:c4:7a:cd:f4:b0 1500 Intel Corporation I350 Gigabit Network Connection
vmnic1 0000:05:00.1 igbn Up 1000Mbps Full 0c:c4:7a:cd:f4:b1 1500 Intel Corporation I350 Gigabit Network Connection
vmnic2 0000:03:00.0 ixgben Up 10000Mbps Full 0c:c4:7a:cd:f5:76 9000 Intel(R) Ethernet Connection X552/X557-AT 10GBASE-T
vmnic3 0000:03:00.1 ixgben Up 10000Mbps Full 0c:c4:7a:cd:f5:77 9000 Intel(R) Ethernet Connection X552/X557-AT 10GBASE-T

It was now dinner time, my wife was getting rather annoyed, so now I had my host back online I went to spend some time with the family. I ate dinner, put the kids to bed and settled in to wind down watching some Netflix with my wife. Now if you’ve just spent an entire day frustratingly trying to solve an issue like this, might I suggest you do not watch David Attenborough’s – A Life On Our Planet as it will leave you feeling even worse than you were before your day began. Thankfully, Sir David does manage to turn it around at the end and leave you with some faint hope that the earth’s population might survive another day!

Meanwhile I could not head to bed without knowing the real reason for my issue. With new information at hand I continued my search to see if this issue had been encountered before. In my search I discovered that the correct way to enable/disable the ixgben module was to actually use one of the following;

esxcfg-module method

Enable: esxcfg-module -e ixgben
Disable: esxcfg-module -d ixgben

esxcli method

Enable: esxcli system module set -m ixgben -e=true

Disable: esxcli system module set -m ixgben -e=false

But this still didn’t answer my question as to why the ixgben driver was disabled in the first place?

And then I re-discovered this post, which made me recall that I had replaced the ixgben driver with ixgbe driver back on ESXi 6.5 to address the intermittent dropouts of the 10GbE NIC adapters. In upgrading from ESXi 6.x to 7.x, the ixgbe driver, being a VMKlinux driver, has now been deprecated and replaced with the native inbox ixgben driver, and although the driver was installed, my ESXi config still had the driver module disabled.

Re-enabling the ixgben driver module eventually resolved the issue!

TL;DR

 

 

 

 

1 Comments

  1. This is very interesting. Small world 🙂 I am reading your article and see that you referenced my blog (itnoobs.net). I am just migrating back to VMware (clean install of vSphere 7) and ran into 10GbE NIC connectivity issues. Workaround was to set the speed manually but just upgraded the firmware of the NICs and it seems to have resolved the issue. The very confusing part is that this did not happen on my recently purchase X10SDV-TLN4F board and it only affected the one I had purchased 4 years ago. Firmware version via vSphere esxcli command is different but surprisingly not the case according to Intel’s firmware update tool! So I am a bit stumped here but I guess things are working ok now at least. This seems to be an issue on vSphere and not other OS/Hypervisors.

    Reply

Leave a Comment.

This site uses Akismet to reduce spam. Learn how your comment data is processed.