Wednesday, April 20, 2011

Purple Screen of Death - Dell issues

You may notice I don't post a lot of issues on here. That's because we frankly don't have a whole lot of issues. I personally attribute it to using Best Practices, and regular maintenance. That said, things will still get lost in the weeds occasionally.

We ran into the Purple Screen of Death on one of our ESX 4.1 boxes yesterday. It is a Dell R610, and apparently had a hardware hiccup, and kicked out errors stating:

Tue Apr 19 21:09:15 2011
PCIE Fatal Err: Critical Event sensor, bus fatal error (Bus 1 Device 0 Function 1) was asserted
0xA10002FBF9AD4DB1000413186FAA0101h
Tue Apr 19 21:09:15 2011
Err Reg Pointer: OEM sensor, OEM Diagnostic data event was asserted
0xA00002FBF9AD4DB10004C11A7E011610h
Tue Apr 19 21:09:15 2011
PCIE Fatal Err: Critical Event sensor, bus fatal error (Bus 1 Device 0 Function 0) was asserted
0x9F0002FBF9AD4DB1000413186FAA0001h
Tue Apr 19 21:09:15 2011
Err Reg Pointer: OEM sensor, OEM Diagnostic data event was asserted
0x9E0002FBF9AD4DB10004C11A7E011610h

We rebooted the box, and it came back online just fine, but we didn't feel comfortable with it, so we stuck it in maintenance mode and had someone contact Dell. Dell reports that we need to update the Bios on it:

-----

Yes, it appears your system is affected by some of the microcode updates released from Intel on the 5500 and 5600 series processors.  That is likely the cause of these PCI errors.  The course of action we need to take is:

·         Update the BIOS

·         Update the iDRAC

·         Clear out the old log entries

·         Monitor for re-occurance.

------

So it's sitting in maintenance mode until someone has some time to love on it. The awesome thing is that we run N+1 (one more box than we need) so we have that luxury. I know plenty of people that refuse to listen to why you should go N+1 who would be scrambling to make a maintenance window to update it.

The downside to this whole fiasco was that when it hiccupped, it stayed online (as is the default with ESX), and held onto the Storage of it's VMs. Therefore, HA couldn't restart them on another box until someone manually SHUT OFF the pretty Purple-VM-Eater. As soon as they did that, all was well in the world and the phone stopped ringing.

Since I'm not fond on relying on manual intervention to make HA work, I found the command for auto-restart when a PSoD happens and applied to ALL our hosts:

esxcfg-advcfg -s X /Misc/BlueScreenTimeout

Were X = number of seconds before restart

I went with 30 seconds, that way I have the opportunity of seeing the screen if I so happen to be looking at it when it dies.

 
------
Dustin Shaw
VCP