Adaptec RAID cards are junk

background: I work with the OpenBSD project. Long ago, OpenBSD had support for a series of hardware RAID cards made by Adaptec, but it was found that the cards were very buggy (not that uncommon) but they don't document the bugs, meaning it is impossible for outsiders (like OpenBSD) to write a good driver for the card, as it can't work around the bugs. Because of this inability to write a good driver for Adaptec RAID cards, the decision was made to simply pull the support from OpenBSD. Better to have no support than buggy support.

From time to time, people would complain about this deleted driver and argue that other OSs were superior because they supported this bad hardware (usually without the fixes in the drivers that were needed). This story was originally posted on one of the OpenBSD mail lists in three installments -- it wasn't intended to be three installments, but the story kept getting better/worse as time went on.

A RAID card is a device that hooks hard disks to computer and permits several disks to be handled as one big disk to the OS, usually with added redundancy. It stands for "Redundant Array of Individual Disks". Since it acts as a link between your disks and your operating system and application, its reliable operation is critical to your data's health.

Adaptec itself disappeared as an independent business some time after these posts were made.

Original posts:

At one point, I managed the e-mail for about 30,000 people scattered around North America. The system we used for that e-mail was a "canned appliance" -- a bundle of hardware and software which is managed on a day-to-day basis by me, but has a company we can fall back on for support when things break. The product was made by a company called Mirapoint, which really had a pretty remarkable product line -- if I told you I basically managed 30,000 users single-handedly, you might have an idea just HOW good it was. They were, overall, good people. They made some mistakes, but they did their darnedest to make good on them.

When I hired in, their system was based on FreeBSD, with their own mail transport system and various spam and virus filtering systems. After I had worked with them a while, they announced they were switching from FreeBSD to Linux as the base OS for their product. This really wasn't an issue for customers, as they never get to see a Unix shell prompt...but it was interesting to me, so I asked a few of their people about the decision (with warning them first that I worked with the OpenBSD project, so I wasn't a totally disinterested party).

They told me, in short, they were "wanting to be in the e-mail processing and delivery business, not the hardware device driver writing business", and with FreeBSD, they seemed to have to do too much driver development to get things to work as they desired, and the drivers they were after were "just available" for Linux.

I.e., they wanted to pick the hardware and have the drivers available, and someone else to support the OS (they went with RedHat), rather than picking the best OS for the job and selecting the best hardware that OS supported.

(They also claimed some performance benefits out of Linux that I do not believe in the slightest, based on later experience. They also indicated that they were having trouble with third-party antivirus vendors providing FreeBSD versions of their software; they wanted to ship only Linux versions. This I do believe.)

So, for their latest major release of the system, they have a Linux based application with a bunch of hardware choices, all of it with Adaptec RAID hardware.

One day as I'm walking into the office, our customer's rep called me and said, "we got e-mail problems". After a bit of investigation, I found all the edge machines were wedged. Rebooting them solved the problem. The mail system manufacturer looked at them and said, "Oh, looks like a problem with the RAID card, upgrade to this new version, which is supposed to fix this".

Ok, shit happens, and unfortunately, that's just accepted in most non-OpenBSD parts of the computer world, so I shut down one machine at a time and upgrade to the newest firmware.

WELL...the new firmware doesn't cause hangs, it causes random reboots.... Isn't that special.

They tell me, "Yes, we've seen that recently. Try this new, newest firmware". Guess what? That one doesn't fix the reboots, but NOW when the system spontaneously reboots, the cache is mishandled and manages to corrupt the file systems on the disks, so instead of a reboot and a few minutes of non-productivity, you get a dead lump of a canned appliance until you get in front of it, boot their magic remote repair CD and a remote tech does an fsck of your file systems.

So they give me another NEW firmware. That one seems to (usually) fix the file system corruption, but still reboots frequently, and once in a while, trashes the file system.

(I do want to point out that they really had ZERO intent of you EVER booting a firmware upgrade CD on these things. They are supposed to be serial managed, no keyboard or VGA monitor is ever supposed to be attached to them...until you need to upgrade the firmware...the hardware they have actually supported console redirection, but since that was all supposed to be handled by the OS, it is not turned on. Ooops.)

For the last few weeks, I've been running a mix of different firmware versions, just so I don't have another "come in and all my mail servers are dead at once" day. One day, they asked me to install a special firmware with debugging features so hopefully Adaptec can figure out what is going wrong and actually make it work correctly this time.

You think the OpenBSD project is blowing smoke when they say Adaptec RAID hardware has piles of horrible bugs? THEY ARE NOT. I think it is very safe to say that your data is not Adaptec's priority. They have a garbage product and garbage drivers and they try to patch around it in any way they can OTHER than build it right in the first place. Don't go telling yourself this is just "OpenBSD doesn't play nice, so they don't get good drivers from Adaptec". Linux plays nice with everyone, signs any NDA and takes drivers under any conditions...and they get crap, too...but they are content with it! But remember: you heard it from OpenBSD first. I can't say I'd ever trust any Adaptec RAID card with data on any OS after seeing this little issue.

Punchline: I had a chat with one of the top techs at this mail system provider, and told him about the OpenBSD experience with Adaptec. He told me they have come to the same conclusion and that their next generation product would have a much better (by OpenBSD standards) manufacturer for the RAID systems...


Chapter 2

Recap:
Bad firmware -> locking system.
New firmware -> rebooting system.
Newer firmware -> still reboots, now trashes file systems
Newer firmware -> still reboots, trashes file systems less often.

At time of that posting, new firmware which has diagnostic code in it to capture critical info so Adaptec can figure out why their cards are crashing my system.

So, for a couple months, things were going pretty well. We got a few crashes out of the system and data to the vendor to pass up to Adaptec, but no really big events. Then one weekend, one of the machines falls over and can't get back up. I figure "surprise", VPN into work, remove it from the cluster, and I'll worry about it Monday.

Ok, now look at this from Adaptec's perspective... You have pissed off your customer and your customer's customer. You can't find the problem, so you have asked them to run special diagnostic firmware to have them help you do your job. What can you possibly do to further impress them with your incompetence now?

So Monday, I go into work, cable up the machine and...it's hung in the RAID controller boot (not the system boot, but since HW manufacturers think it is so f*ing cool that OSs boot, of course they want their RAID controller to have a well advertised boot process too). And it hangs. Not even trying to read an OS off the disks, just hung. Power off, back on, still hangs. Re-seat card, still hangs.

I called Mirapoint's support, told 'em the symptoms, they agree that it is the RAID controller that failed. I start thinking, well, maybe I was a little hard on Adaptec, publicly bashing them like this and in reality, maybe I just had a defective RAID card all along. It might explain why a large majority (though certainly not all!) of the crashes happened on this one machine...and now the card is totally dead. Hm. Maybe just bad hardware. I'm starting to consider how I'll word my semi-retraction to the things I've said about Adaptec before.

Then the phone rings, it's my regular contact at the system vendor. He's telling me there's something really strange going on, as these cards are popping all over the country, all at people who have been running the diagnostic firmware. They can't believe the conclusion, but it seems like there's a time bomb in the diagnostic firmware. They have a call in to Adaptec, but the guy responsible for the diagnostic firmware is on vacation, and it takes 'em a while to track the guy down, "but it is possible". Sure enough, a couple hours later, I get a call back that confirms the firmware is actively killing our cards, and thank goodness that I upgraded them over a period of days and not all in a short period of time, and I do an emergency reversion of all the other systems.

How do you top your past levels of incompetence now? Thank your victim..er..customers who are helping you debug your product by time-bombing the device so that sixty days after install, your adapter breaks. Can you top that? Yeah. Don't tell anyone about the time bomb -- don't tell the VAR, or the end user, "if you help us debug our crapy product, don't let it run this way for 60 days, or your computer will start doing space heater imitations".

(One could argue that they topped that one step further by actually locking the boot process so one could not even boot up the firmware update disk and downgrade the firmware to something that sucks less, but I am willing to pass that off as a bug, not deliberate).

Think about this a bit. These people DELIBERATELY put a feature in their firmware to STOP me (and a lot of other people) from using this card. Legitimate users, but they felt that I was entitled to help them debug their shit for no more than sixty days. They worked hard at putting this feature in. This isn't a piece of software that has access to the resources of a computer, like real-time clocks and writable disks. This is a bloomin' RAID controller, which they managed to build a persistent time bomb into so that after 60 days of operation, it destroyed itself!! (and again, note: it didn't just crash and need to be power cycled, it DAMAGED THE CARD so it had to be replaced). This took some effort -- I can't think of any other reason to have a real time clock in a RAID card. I also somehow doubt that the coder who did this sat down and wrote the time bomb AFTER he was charged with coming up with the diagnostic firmware. No, I rather suspect he grabbed some off-the-shelf code, something they put routinely into their diagnostic and troubleshooting systems, but wasn't intended to get out into the general public. They obviously care more about things OTHER than your system integrity and reliability. This coder made an error in judgment, but they obviously had the tools laying around for some reason.

Now, tell me again how horrible it is that OpenBSD doesn't let you trust your data (and OpenBSD's reputation) to these incompetent assholes?

Current status: system is running on non-diagnostic firmware. Adaptec and the our mail system vendor can produce this problem in the lab with a couple days work, so they are closer to a good solution(?), but not fixed yet. Vendor has come out with a new version of their mail system software which handles corrupted file systems much better than the old versions did (and works on a new line of hardware which uses a different RAID vendor). But we are still about six months into this problem and it still exists.

It does bring one part of the OpenBSD stance on Adaptec RAID hardware into question, though. The real reason they aren't giving us the errata for these products may not be that they don't wish to or are embarrassed by how bad it is, but that they don't even understand the problems in the product themselves. Doesn't change the conclusion: Adaptec products can't be trusted.


Chapter 3

Ok, when we last saw this story, Adaptec was working on new firmware which really would fix the problem. Not too long after I wrote the second chapter in this saga, I got word from my ever-patient support guy that they got a new firmware for me, and if this doesn't do it, they are sending me new controllers (LSI), which they have switched to for all new machines they send out (forcing a rev of their application).

The new firmware is installed, and finally..things are working.

For a while.

A couple months ago, one of the boxes hangs and quits working, somewhat like the very very first problem, but to be honest, that isn't my first guess. I reboot the box, and call the service vendor and they look and sure enough, a couple hours later I get a call from the guy who has patiently worked with me on this stuff and he said, "It did it again." I can't believe this, but sure enough, the system logs show the controller tripping up and killing the system. again.

So, they tell me, "That does it, you are getting the upgrade kit". They send me out the deluxe edition, complete with new disks, array pre-created and preloaded with the new OS, as the old array won't be readable on the new array controller. The kit is actually pretty decent, they have obviously spent a bit of time planning on having people field-upgrade these what were supposed to be "sealed" boxes, and changing cards in computers. Of course, hasn't been an issue for me...well, ever. I like hardware.

It's something of a pain, though, as we basically have to rebuild the box from scratch, and reconfigure it as the old system was (and hope we got everything right the first time...which we did by the time the third box was upgraded). While we are upgrading the first one, though, another one died on us...leading us to think we've got an uptime-related issue.

So at this point, I've got three of the boxes upgraded with new firmware (and a new version of the OS to go along with it). The fourth box, I offered to test the NEXT new Adaptec firmware on. Curiously, the version number on the new firmware is SMALLER than the last "This is it" version. Yes, you could feel my support contact rolling his eyes when he told me that.

Do note that every step of the way, they are sending me new FIRMWARE, not new OS drivers. They are having trouble working around the bugs in the hardware. ADAPTEC HARDWARE WAS/IS CRAP. THEY KNEW IT.



 

since 2/14/2021
Copyright 2021, Nick Holland
Return to War Stories
Return to Nick Holland's Home Page