Raid Arrays, Downtime, The pain of 3ware!

Web hosting requires massive amounts of storage to satisfy customers needs. This ever increasing demand for storage is backed by many different connected storage paradigms including nas, san or local sas or sata and so forth. This data is normally carved up into differing types of raid arrays. I don’t wish to discuss different raid theory. Instead I feel compelled to express my extreme frustration with certain raid controller manufacturers and to point out what I feel are deficiencies in their offerings and technical transparency.

3ware is the primary supplier for the raid controllers we use. While I have had several complaints with them over the last several years their products have been useable, albeit frustrating and painful to manage in large quantities. I don’t wish to single them out, however over the past 2 weeks their product and misinformation about their products has caused us and our users inexcusable and painful downtime. I wish to personally apologize to all users on bluehost box500-503 and Hostmonster users on servers host300-host303. The huge downtime can be directly attributed to 3ware and their lack of transparency with regard to their controllers and limitations.

Below are several complaints I have against virtually every raid controller manufacturer.

1) Almost every benchmark distributed by raid controller manufacturers shows only raid 5 and raid 6 sequential reads and write I/O benchmarks. This is not representative of almost any workload in todays computing environment. This would be like saying a Chevy Suburban gets 100 MPG. Its possible if its in neutral going down a hill, but you will never get those results EVER in real life use. In my opinion 3ware is scared to show real life performance benchmarks because they struggle to beat even the most basic storage alternatives.

2) Support is not knowledgeable at most of these companies. I understand this. As someone who employs hundreds of support engineers I know first hand the challenges with training support representatives. However, when you sell a product as technical as raid controllers you better not have someone in india typing in questions into a knowledge base. Its extremely frustrating when you know far more about a product then the people you are calling for help.

3) Lack of published technical information – I understand the difference between marketing materials and technical materials, but SOMEWHERE you need to be able to find the beef! In so many cases there isn’t ANY information on the technical underpinnings that make these devices work. I have been “escalated” up the chain (ahem…) at 3ware several times only to confirm over and over that those I talk to have almost no idea what I am talking about. Please, just let me talk to the driver developer!!! I will pay! Just give me someone who REALLY knows. Short of me going through the driver code (And don’t think I haven’t done that!) the information I need simply isn’t there.

Here are a few of observations about 3ware that make me want to jab an ice pick in one of my eyes.

1) No support for Raid 1 “split seeks”, at least nothing that I can test and show. I still can’t find a single person at 3ware that even knows what split seeks are let alone if they support it and to what degree. Because we use primarily Raid 1 and some raid 1+0 arrays this is extremely important to us. Please don’t email me saying we should use raid 5 or 6. I know our workload perfectly and raid 5 or 6 is a nightmare that many other hosts don’t understand. Disk seeks are infinitely more import to minimize than maximizing space with raid 5 or 6 unless you really are doing primarily sequential I/O.

2) No support to use the onboard 256 meg or 512 meg cache (Depending on controller model) for anything other than raid 5 or raid 6 except for the write-thru journaling. This means that raid 1 or raid 0 is actually SLOWER using 3ware cards than most onboard motherboard controllers. Again, no one at 3ware knows anything about this. If you use write-back journaling then you lose data if you have to do a hard reset on the server. You can’t separate FUA requests (Disk array saying that a write is complete when its not even though a process is requesting confirmation that a real write has occurred) and journaling in “performance” mode on the controller. This means you choose fast and lose data occasionally or slow and skip the onboard cache completely. Get with the program and use the cache for something other than boosting sequential I/O benchmarks! Please support Raid 1 with your cache. Maybe even give us the option to choose read ahead for the cache or specify it exclusively for write cache, what a novel idea!

3) Driver updates that don’t work, and software updates that cause us to endlessly “verify” and “initialize” our arrays for no reason except that 3ware’s software is too inept to know the state of its arrays. Anyone who has used 3ware for any amount of time knows exactly what I am talking about here!

4) Rapid restore – We personally don’t use this option, but I tested it to see how it works. It is basically an option that allows you to restore an array faster by knowing only the parts of the array that are out of sync so you don’t have to go through the whole array to verify its integrity. The only problem is that it CRUSHES the array with writes to accomplish this feat. To illustrate this imagine writing down every footstep you have taken in your house for a week so that when it comes times to vacuum you only have to clean those spots where you walked. It takes 10 hours to write down all your steps so you only have to vacuum for 10 minutes, or you could just skip writing it all down and vacuum for 30 minutes instead. 3ware is CLEARLY testing this on a mostly idle disk array or they would have never released this beast into the public. When I spoke with 3ware they claimed it put very minimal load on the array. When I asked for the technical details of how it worked exactly they were of course without any concrete answers. I had to test it myself to see the impact.

Wow. I feel much better. I’m sorry for this long post, but trying to do quality hosting is sometimes an impossible task when you have to rely on so many outside vendors and service providers to come together to make your product work. Just as we rely on vendors to provide our service to you, you rely on us to power your websites and businesses. In the end it is 100% our responsibility to make it all work for you. I am so sorry that many of you have experienced unacceptable downtime from us because of these controllers. We have solved almost all of these issues now, but we know we have lost the confidence of many of you. I just wanted you to know the real reason behind our recent problems and that many sleepless nights were spent in the pursuit of a solid long lasting solution.

Thanks,
Matt Heaton / President Bluehost.com / Hostmonster.com

30 Responses to “Raid Arrays, Downtime, The pain of 3ware!”

  1. Raid Arrays, Downtime, The pain of 3ware!…

    Web hosting requires massive amounts of storage to satisfy customers needs. This ever increasing demand for storage is backed by many different connected storage paradigms including nas, san or local sas or sata and so forth. This data is normally ca…

  2. Maybe it’s time for some 10gig E iSCSI SANs

  3. Thank you for the interesting read Matt, a great technical insight into the day to day problems of operating a hosting company, if only more hosts were as vocal as you about problems with hardware vendors you might see more action from them.

  4. David L says:

    This morning I found ALL my websites and BlueHost emails down. I managed to connect to this blog nonetheless (couldn’t get into cPanel) and was very glad to get such a full explanation of your headaches.

    It’s much easier to “grin and bear it” when we know the struggles you are having to deal with and how hard you are working behind the scenes.

    My own internet provider has similar major headaches with the hardware and firmware for his wireless rural systems and this gives me a better appreciation of his agony too.

    Thanks, Matt.

  5. Harley says:

    Matt,

    I never liked using RAID of any form. To me, when you factor in the multiple levels of redundancy in a solution, the overall cost is absurd. Long ago I worked as a mainframe programmer. At the first most important application level we could right then make a decision to decide how many copies of your information could be kept when storing data. So let’s assume you structured it to make two copies of your data. Then we had an automatic hierarchical storage manager that would automate when and where you data would go.

    Assuming we make two copies of data. Those get sent to separate super fast servers. Each server has redundancy built in. Essentially we are already paying for 4 times as much hardware as we need if things would magically just work 100% of the time. Then it is not accessed in a while so the system decides to move it to slower disk systems and finally tape. Well, the tape system is actually first a virtual tape system that is caching it on a disk system so we are paying more just to have this intelligent cache system. But it finally goes to tape where it was again split up into two copies and sent to multiple libraries, each of which has multiple redundant systems just to keep each single system fully redundant.

    The point is that the cost of redundancy in an overall system is excessive.

    I always thought that you just need to have some type of redundancy built in at a high layer for non mainframe systems, not just for archive purpose and the back end system ought to be incredibly cheap but highly parallel. When I first heard of the XIV purchase by IBM it upset me that people inside the company did not have the foresight to create a product of that type of architecture because it is similar to what I had been thinking storage should look like. It hasn’t met all my expectations yet but at least the paradigm shift in how to build a disk system is close. I would have similar worries that you have in the controllers ability to handle the workloads properly since the massive amount of parallelism in a fully populated system really taxes the controllers.

    I have a similar problem with how SSDs are being utilized in storage systems. The characteristics (of NAND flash) are completely different and the customer should not be in a situation where they have to make an either SSD or HDD decision. A system architecture should be designed in such a way as to exploit the properties of the NAND flash chips themselves along with HDDs to fully exploit the parallel chip, fast random read but slow sequential write of NAND flash chips and very fast sequential read/write but slow random read/write of HDDs. Forgive me if I think that systems from RAMSAN are too expensive (low latency but too low of capacity) while IBM systems are also expensive (high capacity but higher latency). A while ago I calculated a system architecture that could take about a 20% capacity hit but gain overall IOPS anywhere from 3-15 times as it would relate to a real overall system benchmark. It takes another paradigm shift in architecture and packaging NAND flash in SSD is far from an optimal use for NAND flash (redundant and restrictive). It is just incredibly difficult to explain to people and especially customers because they are not used to being able to hold so many dimensions of complexity about a technology and it is not easy to simplify. I am sure you would be the exception.

    Another very smart blogger that writes intelligently about systems like XIV, SSD directions and even mentions a library I created (my high density library) can be found here http://www.ibm.com/developerworks/blogs/page/InsideSystemStorage
    Just keep in mind it is all things IBM since that is where he works.

    Best of luck in your fight for efficiency in everything you create. Your ideas have been an inspiration to me. I like the way you think and am happy you share. I often feel bored that nobody around me is willing or able to share in the types of wild ideas that continuously rattle around my brain.
    My blog is short, to the point, hosted by your company and is at http://worldlearningtree.com/blog

    Later,

    Harley

  6. While it seems as if over the last year I’ve experienced more down time than the previous year I appreciate the honesty and the fact that your tech support guys have a clue of what’s going on and are always there. Even though it’s a bummer, it’s nice to be know that someone is aware of a problem and working on it.

  7. Andrew Kerr says:

    Hey Matt,

    While I really don’t have any idea what most of what you were saying about RAIDs. I am very greatly of your honesty and transparent in being forthcoming with these issues. My sites were affected… i’m on box 500 and it was annoying to say the least.

    What makes me glad is that you and your team are dedicated to getting these problems solved so that the little guy can have a great reliable service. It makes times of downtime worthwhile knowing that you guys are aware of it and are making it better.

    Thanks!

  8. Darvid says:

    You should move to SAN. Why are you mucking about with RAID controllers on webservers anyway? It shouldn’t matter what box is designated box 503, that’s just a number. It could be any CPU and NIC somewhere with SAN, with the number. Look at google, they assume hardware is going to fail, so they wrote a distributed system over the top. If you get to be a certain size you have to do that, or go with something a little more professional than Linux. Such as Solaris. Granted Linux has come a long way….

    If you look at newer OS designs, such as Plan 9, storage is abstracted to the filesystem. It shouldn’t matter if it’s disk A or disk 12818. And if you need more you plug it in to the network somewhere and it’s immediately incorporated into the filesystem. This (as another poster mentioned) is not a new development, but rather has been used in the mainframe world almost since the beginning.

    Anyway, if you’re running your boxes as individual servers, you are going to have a lot of administrative headaches. Better to get some big solid servers and use virtualization to provide your services, with a big sql box kitted out with RAID and all that. Sun owns MySql, you know.

  9. Jay R. Wren says:

    Hi Matt,

    As a former sysadmin, I feel your pain.

    If you are considering a new vendor, I highly recommend the fast IO hardware from scalable informatics. Please check out their jackrabbit hardware. And as for knowledgeable support, talk to Joe Landman. I’m sure he can answer any questions you have.

    I have no ties with Scalable Informatics. I have just been impressed with their transparency (always blogging about the hardware) and the speed of their hardware.

  10. Marisa says:

    Information like this needs to be splashed on your front page when there’s technical trouble. I’ve seen numerous threads like this one recently and that can’t be good for you reputation.

    For the record, I’ve been happy with bluehost since the cpu exceeded limits error was fixed.

  11. Chris says:

    So does this mean most of our recent downtime issues should be resolved? I was about to switch hosts but since I’m a little bit of a slacker it so happens that I’m still with bluehost. Were those controllers also causing the frequent CPU overuse lockdowns? Our sites don’t get that much traffic and any time I check my processes in cPanel there’s never more than 2 listed.

  12. doug says:

    “In the end it is 100% our responsibility to make it all work for you. I am so sorry that many of you have experienced unacceptable downtime from us because of these controllers. We have solved almost all of these issues now, but we know we have lost the confidence of many of you.”

    So what are you doing to restore this confidence?

    As per my recent experience with your tech support, nothing.

    Having a happy feeling about knowing what’s going on doesn’t make my website have anymore uptime.

  13. Howlndog says:

    Hello Matt,

    I’ve been a satisfied customer for going on 2 years and the company’s honesty with issues is greatly appreciated. I also like your tech insights here on the blog.

    You use the domains hosted figure, but what is the number for your “massive amounts of storage”? I guess xx terabytes that grows like a monster.

    1 Gigabyte: A high-fidelity audio recording OR A video at TV quality
    5 Terabytes: My employer
    1 Petabyte: 3 years of Earth Observing System data
    1 Exabyte: All words ever spoken by human beings

    Keep your feet on the ground and keep reaching for the stars.

  14. Chuck Linart says:

    I have been with you guys for a couple of years now and have been very satisfied until recently. Please keep us apprised of what’s going on, and thanks for the explanation.

    One simple (potential) solution (and I’m certainly no expert…):

    Instead of using complicated RAID arrays, just rsync regularly to redundant servers that are set up as mirrors? Maybe?

  15. Jordan says:

    Hi Matt,

    I know you’re working very hard but i just wanted to express my frustration lately with downtimes, and such. I have referred about 50 of my clients to Bluehost and I absolutely love Bluehost and it’s services but lately things have not been working properly.. My clients have businesses that depend on their websites and when it goes down in the middle of the DAY its really not good. I can somewhat understand if this happens in the middle of the night but when i cant even access my own site and portfolio of client work, this looks VERY unprofessional. It looks unprofessional in the sense that potential clients cant reach my site… for someone who creates online identities and marketing campaigns, for my site to be down in the middle of the day, or for my email to not be working ht emiddle of the day is not good.. i dont mean to sound harsh but i just dont understand why its down lately.. and why there are so many issues lately…

  16. Marisa says:

    And you definitely want to check out that thread I mentioned. Seems your chat support person jumped right into pushing VPS or Dedicated hosting before realizing it was a hardware problem. The whole chat session has been posted on that forum and I’d have jumped ship if that would have been me.

    I want to see you meet your goals and Bluehost succeed but someone needs to look into this.

  17. Andy Frey says:

    Hi Matt,

    Have you tinkered or tested any Promise Technology RAID cards or equipment? I used to install a ton of their boards in boxes for video editors. I haven’t kept up on them, lately, but this article and the Promise card I found in the junk drawer behind me the other day made me wonder.

    Have an excellent day and thanks for the updates and honesty!

    Andy

  18. Chris says:

    Hi Matt,

    Thanks for your information!
    As previous said, your honesty and transparent to your hosting is greatly appreciated! At least we know wtf is happening.

    My site being affected too. Wish you can fix it!

    Good Luck!

  19. Thanks Matt,

    You may want to go back and take out all the names of vendors though…just don’t want to see you get them upset and make things worse…Hey I think I’ll make part of this into my next post on my blog (once it becomes available to blog on ;) )but let me start some of it off here…it seems that most advertisers promise things that they can only deliver in a perfect world. Honesty is almost completely absent in marketing and salesmanship anymore. Why? Because everything is “all things being equal” or “in a perfect world or most ideal situation” etc… Yet in this world of ours, there is constant interruption, distraction, misdirection, misinformation and unforeseen unintended consequence. Even our economy and everything else can fall into that category I guess. So what is one to do? Plan for the worst and expect the worse, hope for the best I guess. Maybe tell others, from the top, the truth for once as you’ve done…so people can understand. Also, know how to control what you can, and use cause and effect so you can leverage the good and divert or otherwise head off the bad as you’re undoubtedly doing. In IT I always say go for double what you think you’ll need…timelines, storage space, power requirements, cabling, etc. I remember once when I was in charge of IT in a LAN/MAN supporting approx 4K people and we lost 1 of 3 mail servers I was well ahead on trying to get replaced…so I had to learn how to stay cool, what to do, and not get upset at the fact that I had been the most vocal trying to head off what seemed to be coming…in fact we even started a weekly reboot and weekly emptying of excess log files, recycle bins, etc prior to this event just to do something…and held on as long as we could while the bureaucracy played fiscal prioritization in full…and I can say once we lost 1/3 of everyone’s mailbox it became a priority and I was able to use it to leverage a 100% replacement of my entire server farm and network infrastructure with the right people listening when I said “remember what happened when…”. Anyhow maybe this is also something to learn from…just don’t share your secrets, or your competitors may get an edge…I went with Bluehost since I thought it was the best deal…unlimited everything for $6.95/month and then no gimmicks thereafter. That’s how it needs to be. I plan on going full swing into the affiliate side of referring others to Bluehost soon and hope to do well so again thanks for the explanation. BTW my network admin manager back in my previous days told me when we had RAID that the backup stuff was unrealistic and that we should better choose other backup methods…we had one product that wasn’t reliable not because of the HW but because of the SW side of it…the HW has been around…tape drives and all…it’s the sw management piece we had a challenge with until we ran into Dantz Retrospect in 2003/2004…we were at the bleeding edge of it but it worked…and it’s no wonder EMC later acquired them…much better than the previous unnamed troublesome software we never got to work right…anyhow I just happened also to run into this http://www.marketwatch.com/news/story/dell-simplifies-disk-based-backup-recovery/story.aspx?guid=%7B16DF44B0-248A-4545-9214-8B29C07F8630%7D&dist=hppr sounds promising…as do so many other products…the best thing though is to keep it simple…if you can’t have 100% go with the 80% solution at least (80-20 rule) just don’t get stuck with only the 20%…that’s my 2 cents.

    Mike

  20. Matt,

    I appreciate your honest assessment of the challanges of hosting.

    Below is an email I am sending to complaints@bluemail.com though, my guess is, it will not be read.

    I have been a bluehost client for several years and have just renewed my service contract with Bluehost dispite service interruptions in the past.

    Today’s service interruption of more than 4 hours is disturbing.

    I have a 30-year IT background, understand the challenges of service provisioning but fail to understand the poor management mentality to server backup, restoral and standby systems.

    These best practices have been around for a long time. I sense the Sr management of this company might benefit from first hand consulting and am willing to work with your team yto improve the quality of service being delivered.

    The irony is that doing the job right is far less costly than your current processes. If you really are concerned about the bottom line (and my business) please have a Sr. level manager contact me. You might want to use my alternate email address (from which this email is being sent) given the current problems with my Bluehost services.

    Best regards,

    Michael B Shear
    President & CEO
    mshear@pocketsnet.com
    POCKETS Distributed Workplace Alternative, Inc
    720-253-3700

  21. wes says:

    I hope to see the end of these downtimes. There are times when I have to check one of my sites and do something only to see its down but I hope a good solution will be found in the end.

  22. Mike Snyder says:

    I have been extremely frustrated and unhappy with BlueHost over the past two weeks. I have had sites down, email not accessible, and a major lack of support from chat, my preferred method of communication due to multi tasking most of my day. I find your comments about support interesting because I found that for the most part, a large number of your staff were uneducated about these problems. I have placed several phone calls, sent several emails, and logged in to many chats in the past two weeks with no answers. Many times I got a canned “our admins are doing maintenance on your server and it should be fine in 15-30 minutes” response and was disconnected from the chat…basically hung up on.

    I am very patient when given the answers to my questions, but when I am mislead or fed BS, I tend to lose my patience quickly. Tonight was the first time I got an honest answer. I called in and was told that you, the President, sent out a letter letting staff know the downtimes and problems on servers were unexceptable. While that didn’t solve any of my problems, I completely understand and felt good knowing that someone was hopefully working on my problem. Perhaps that message should have been sent out to all support staff two weeks ago to let them have something to give to customers, rather than a line of crap that didn’t do anyone any good!

    To be perfectly honest, if it wasn’t so much work to move my sites, you would have lost me as a customer long ago. I have too much riding on the sites I have hosted with you to deal with this. Knowing that something is being done and that people in high places are as unhappy as me has restored my faith a little.

    I still have some customer service issues I hope to iron out, but I am desperately hoping your problems, and mine as a result, are eliminated shortly. I have always loved BlueHost and am part of the referral program. In fact I have signed up two clients recently even after I have had problems, so I still trust your company.

    Please don’t let me down!

    Mike Snyder

  23. Tony Lloyd says:

    So you’re not in a different position than me. I was like man, I wonder if the people in charge at Bluehost can imagine the frustration of a customer who can pinpoint the cause for failure on their server (that would be me, on one of my dozen or so bluehost accounts), but who has no recourse to fix this, or who has been provided no remedy.

    For close to 2 weeks we’ve seen greater than 90% disk usage at the /dev/sdc1 (/home) drive. It climbed to 92%, now to 94%. A site we’re trying to promote as a large user community for a huge software platform that’s a simple wordpress blog barely loads. Other sites barely show up. We contact support, and are told “we’ll have to add a new harddrive”, but nothing gets done. We ask “how can a server get past 90% disk usage on it’s main drive without anybody taking action?”

    Welcome to the lives of your customers recently. This is just one example of a problem we’ve been able to pinpoint, but that nothing has been done about. Bluehost’s service seems to be going to the dogs lately, and it’s really frustrating. That frustration is only amplified by knowing that you deal with the exact same kind of concerns, but don’t take safeguards to keep your customers from experiencing the same thing.

  24. NeXEA says:

    This is why your the man Matt, and why I use you.

    I just don’t have the time or the amount of patience to host my own stuff.

    BTW excellent considering Im on the 390′s box’s.

    Keep up the amazing work.
    -NeXEA

  25. Jackie Fitts says:

    Matt I am also glad the Live Chat is a)there, and b)the person was honest in telling me what was going on – but this downtime from a DDOS? has come at a time when I’m teaching my client how to use and update their site! The frustration I’m feeling now is helped by reading of your work… so AFTER it’s fixed :) I’ll look for more behind the scenes info, maybe also some explanation on what happened here.

  26. Sherry Grove says:

    Thank you very much this post. We experienced some downtimes this year that were frustrating and caused us some concern. This restores my faith. I believe BH has some pretty darn good customer service, but I do have a recommendation.

    We would appreciate it very much if you would notify folks of planned or possible downtimes. I realize that this is a little more work for you, but it would be a tremendous help for us.

  27. Adam says:

    Hi Matt,

    As a user of host301 and suffering the outage (and data loss) I do take comfort in knowing that you are busting heads to address issues like this rather than just accepting the product as is.

    Keep it up please.

    Thanks!

  28. Jim says:

    Matt,
    Your scaring the hell out of me with your raid problems, I just moved over from you know who gd, to build a joomla site on blue host, hope I didn’t make the wrong decision :((

    Jim Novak

  29. Matt,

    I’m going to quote you first, “Please, just let me talk to the driver developer!!! I will pay! Just give me someone who REALLY knows. ”

    I FEEL YOUR PAIN!!! My website has been down now for 24 hours, I’ve called your customer service hotline 4 different times, wrote you an email and I still can’t get to anyone who can reassure me that my site will even come back on at all.

    I’ve been using Bluehost for one month come November and my site has already been down twice (2), the first time for about four hours and this time it’s 24 HOURS, and still counting.

    I just got off the phone with a customer service representative. I asked this very nice gentleman if I could please speak with a manager. After holding for about 45 minutes, he came back and had to tell me that no one wanted to take my call. ARE YOU SERIOUS??? It wasn’t that no one was available, it was that no one wanted to take my call. I was very polite, I did not yell, I did not swear and still nothing.

    I’m trying to be understanding of the situation because I know that you must be working tirelessly to fix it, but given the events of the last day, couldn’t you see why someone in my shoes would just get a new service provider?? I’m curious to get your opinion.

    Thank you for your time,

    Andrew Rubalcava

  30. Jeff says:

    It sounds like you have experienced similar support from RAID controller manufacturers that BlueHost customers are currently receiving from BlueHost.

    I find this very ironic, but hopefully this will help BlueHost to up their support level since you can see how frustrating it is when you are unable to find someone who will give you straight answers and get issues resolved.

Leave a Reply