Wednesday, March 2, 2011

It's always the simple things that cause the most trouble, March 2011 Edition.

I have been trying to hunt down a "random reboot" problem for a few days now, and I hope that I have fixed it.  I searched all over the internets trying to find folks with similar problems, but to no working solution.

(for those interested in the technical details, I have a multi-node Hyper-V "highly available" cluster running 2008 R2.  For my servers, these are Dell Poweredge R815s with four 12-core AMD processors and Broadcomm 5709C Nics.  I have an Equalogic 6000E and 6500XV for central storage, and I am using iSCSI for all of those connections.)

So the Internets took me to many great sites with complicated fixes, and of course fixes that don't apply to my setup (it seems that Intel has a nasty bug with their processors right now, btw).  For some examples:

In a nutshell, everyone above suggests driver updates, firmware updates, windows updates... all stuff that is pretty simple. The more I Googled, the more I though it might be related to a NIC issue, or at least I had a hunch that it was.

So I decided to take what I call the "dummy" approach.  I went through every NIC, setting by setting on each node in the cluster.  I had already turned off the "Allow the computer to turn off this device to save power" option on the iSCSI nics, but I had left it alone on the motherboard NICS.  I decided to change this so every NIC was the same, disabling the setting.

Then a meeting took me away from diagnosing it any further yesterday... I went home, expecting to get a bizzilan e-mails from SCOM the next morning, as one or more nodes reboot at night.  But my inbox was free of SCOM's nagging, and none of the nodes rebooted.  So no matter how many times we forget, we should check the simple things first.