Saturday, October 2, 2010

All About the Design

It seems to me that you can always tell if an IT guy is worth his water by the way that he architects/designs solutions.

We picked up a small shop recently that their previous guy built them a heavy-built Dell T710 with Microsoft Hyper-V. They've got 5 VMs running on it, so nothing wrong with the basic idea behind it, but the execution is where I have an issue.

The machine was originally built (from factory) with the onboard SAS 6/iR RAID card with a set of four 450GB drives in a RAID 5 config. No hot spares. Issue number one.

He realized that he must have under built, so he then went off the deep end. He went out to NewEgg and bought four 2TB SATA drives, bought the caddies off EBay, and bought a PERC 6/i RAID controller directly from Dell (must've not found it cheaper somewhere else). This means the factory warranty from Dell (3 year NBD onsite) won't extend to anything except the RAID card. What's the most likely thing to die on a server again?

He then went in and reconfigured the machine to run off the PERC with a second RAID 5 set with the 2TB drives (again no hot spare), and directly attached one of the VMs to it for their main data storage. This is all fine and dandy, except for the fact that there is 800GB free on the first RAID set, and they are only (at this point) using 40GB on the second RAID set (leaving roughly ALL of it free).

And here comes the second part of the issue - much bigger than the previous config issues.

The PERC 6/i controller has issues running on drives 1.5TB and up. The latest firmware revision runs better than the previous one, but the controller is regularly (it's happened 3 times in a month, hence why the old guy went bye-bye) dropping a drive offline. Part of the issue is that when it drops the drive, it drops the RAID set offline at the same time.

When I got called in, the RAID set had dropped two drives offline (meaning bad, bad things could've happened to their data). I reseated the "bad" drive, imported the foreign config, and let the RAID set rebuild itself. I then proceeded to move their data off of that RAID set and remove the config from the environment.

They still have 800GB to grow into before we even need to consider doing anything. My first plan of order is to tell them that they need to get another 450GB drive and put it in as a hot spare. As I mentioned, this is a small shop, and the server sits on the floor in a closet. They only open the door when there is a problem, so they could easily loose a drive and not know it until someone asks "What's that red light for? It's been doing that for a month." Hence the driving need for a hot spare.

Dustin Shaw

No comments:

Post a Comment