Tuesday, October 03, 2006

Hardware: Adding drives to a PE2650 with hot-swappable bays takes how many reboots?

The answer is 6.

Okay, I’ll be fair… 4.

So we have a new client that initially involved us because one of their Dell PowerEdge 2650’s with 2003 was “running slowly”…, after an initial review we could better phrase that as… “out of drive space”… but otherwise at first glance it was none too interesting.

So after verifying that we have good backups of the box, and freeing up enough space to start working we checked the server’s application stack to make sure it was supportable, applied SP1 and rebooted. Then with the intention of configuring two new drives in a RAID1 configuration, we installed the Dell Open Manage Administrator (OMSA) 5.1.0 with the Storage Manager component, and rebooted. After logging into OMSA interface, we checked on the Storage component, clicked the Information/Configuration tab, and got… nothing. Under Global Tasks, it said “No Task Available”, and no storage controllers were visible. Which isn’t’ good. So we talked to Dell Open Manage Support, and they knew exactly what it was… the server wasn’t using the Dell driver for the PERC3/Di controller… fair enough. After talking to support, we came up with a path-forward …

1) Download latest driver then reboot
2) Update the firmware then reboot
3) Apply a patch to “make the warning icon in device manager go-away” (uh-huh…), reboot…

So that’s pretty reasonable for the most part, right?

Well, because the Dell PE2650 isn’t ours, and isn’t a box that we’ve standardized on, we talked to tech support a bit more about the process of adding the new drives. Well, the process really wasn’t what I expected. It turns out that even after doing the driver, firmware, and patch updates, we still won’t be able to configure the drives using the web-based OMSA. Which sounds a bit out of sorts… isn’t the whole point in having OMSA and hot-swappable bays to minimize reboots, keep people out of the RAID controller BIOS, and meet SLAs? Now maybe I’m not adding drives to a PE2650 everyday… but that just doesn’t sound right to me.

Well, if we take the Dell recommended approach we’re going to have to do the following:

1) Shutdown the server
2) Remove the existing drives, and add the new drives
3) Use the RAID controller BIOS to configure the new drives
4) Turn off the server. Add the original drives back in.
5) Enter the RAID controller BIOS, and accept the notification that “Changes have been made”
6) Reboot. The server should then boot normally.

So all things considered that’s more risk than we really anticipated, not to mention time and reboots. Granted, we probably wouldn’t had been in this situation if whomever built the box had used the Dell drivers, and installed OMSA during the initial install, but we don’t want to point fingers. Further, we could have noticed the driver issue before going to Dell support… ultimately, we want to show-value to this new client, and start building trust. With the server about to go out of warranty, we’ve decided to use the drives that had been purchased in a different server, move a couple of shares around, and recommend that the server be repurposed out of the business critical role that it currently occupies.

How to handle a similar situation better in the future? If this were an existing client, we probably would have managed the issue differently. Certainly if we envision working with the PE2650 line more, or the Dell PERC controllers more, we need to come up with some internal procedures for working in the BIOS. Otherwise, we’d be accepting too much risk on behalf of our client – which we really shouldn’t be doing in the first place.

Let me know if you have any thoughts on this process, and if we had any major misses that we could have addressed.

2 comments:

Anonymous said...

I don't disagree with anything you're saying. On a project like this I would have budgeted in a good fudge factor, so if rebuild became necessary you wouldn't be looking at 3 days of work for free.

Nick said...

It's not even so much that I mind giving away some service up front – because building trust and showing value to a new mid-sized client should pay off in the long-term. What I mind is what looks to me like bad procedures, unnecessary complexity, and excess risk. I’m not trying to solely blame Dell – as there were contributing factors in a number of areas – but I am saying there’s some type of problem when Dell Open Manage support says “Don’t use the Open Manage software to add those drives to a PE2650”… because what’s the point in having storage management software and server class equipment? "Turn off the box, and use the RAID BOIS tool..." no thanks... not unless it’s absolutely necessary.