Archive for October, 2008

Happy Halloween!

Friday, October 31st, 2008

Sorry…. Lately I have sort of disappeared from my blog. Perhaps its because I have been delving DEEP into the underpinnings of the linux kernel ferreting out problems that lesser hosting companies would run screaming from!

Anyway, I’m almost back to normal life and to prove it I offer up the following physical evidence!

Bluehost Pumpkin

Bluehost Pumpkin #2

Let it never be said that I let a potential marketing opportunity slip by. All those trick or treaters need a website to sell or trade their candy on Ebay. Ok, jokes aside it is good to be back. I will try and be MUCH more active than I have been lately.

Happy Halloween!

Matt Heaton / President Bluehost.com

Collectl – A System Admins’ Dream Tool

Wednesday, October 29th, 2008

We run Linux on all our servers at Bluehost and Hostmonster. Our particular flavor of linux is CentOS 4 and 5. For a variety of reasons we feel that Linux is head and shoulders better than Windows in virtually every way. But in order to really squeeze the best performance out of a system or diagnose a performance related issue you used to have to rely on many different tools to give you the bits and pieces on information you would need.

Enter “collectl” (Yes that is an L on the end of collect). Collectl is indispensable to any system admin. It replaces sar, vmstat, top, atop, iostat, and many other tools that I USED to use. Now instead of having to rely on those various tools that did 80% of what I needed I just use collectl.

Collectl was the brain child of Mark Seger over at HP’s Scalable Computing & Infrastructure group. He developed this as a tool to monitor huge clusters of HP servers, but saw the benefit that the “average” linux user could get from using this tool.

The best part of collectl in my opinion is that it is CONSTANTLY being updated and improved. This means that you have access to data and reporting that is available in many of the newer kernels. So many of the other tools out there simply don’t keep up and so get left by the wayside as the rapid development of the kernel continues.

An example of this is when I need realtime IO breakdown of processes that simply can’t be had without the data in a newer kernel. No other monitoring software out there supported what the kernel was kicking out except collectl.

Rarely do I endorse a product on my blog, but since Mark has done so much for me personally with regards to developing colleclt and helping us get the most out of collectl I feel it is the least I can do.

Collectl is a free product. You can read up on it, or download it for free at the following URL –

http://collectl.sourceforge.net/

Thanks again to Mark for all his hard work! The community greatly appreciates it!

Matt Heaton / President Bluehost.com

Raid Arrays, Downtime, The pain of 3ware!

Sunday, October 5th, 2008

Web hosting requires massive amounts of storage to satisfy customers needs. This ever increasing demand for storage is backed by many different connected storage paradigms including nas, san or local sas or sata and so forth. This data is normally carved up into differing types of raid arrays. I don’t wish to discuss different raid theory. Instead I feel compelled to express my extreme frustration with certain raid controller manufacturers and to point out what I feel are deficiencies in their offerings and technical transparency.

3ware is the primary supplier for the raid controllers we use. While I have had several complaints with them over the last several years their products have been useable, albeit frustrating and painful to manage in large quantities. I don’t wish to single them out, however over the past 2 weeks their product and misinformation about their products has caused us and our users inexcusable and painful downtime. I wish to personally apologize to all users on bluehost box500-503 and Hostmonster users on servers host300-host303. The huge downtime can be directly attributed to 3ware and their lack of transparency with regard to their controllers and limitations.

Below are several complaints I have against virtually every raid controller manufacturer.

1) Almost every benchmark distributed by raid controller manufacturers shows only raid 5 and raid 6 sequential reads and write I/O benchmarks. This is not representative of almost any workload in todays computing environment. This would be like saying a Chevy Suburban gets 100 MPG. Its possible if its in neutral going down a hill, but you will never get those results EVER in real life use. In my opinion 3ware is scared to show real life performance benchmarks because they struggle to beat even the most basic storage alternatives.

2) Support is not knowledgeable at most of these companies. I understand this. As someone who employs hundreds of support engineers I know first hand the challenges with training support representatives. However, when you sell a product as technical as raid controllers you better not have someone in india typing in questions into a knowledge base. Its extremely frustrating when you know far more about a product then the people you are calling for help.

3) Lack of published technical information – I understand the difference between marketing materials and technical materials, but SOMEWHERE you need to be able to find the beef! In so many cases there isn’t ANY information on the technical underpinnings that make these devices work. I have been “escalated” up the chain (ahem…) at 3ware several times only to confirm over and over that those I talk to have almost no idea what I am talking about. Please, just let me talk to the driver developer!!! I will pay! Just give me someone who REALLY knows. Short of me going through the driver code (And don’t think I haven’t done that!) the information I need simply isn’t there.

Here are a few of observations about 3ware that make me want to jab an ice pick in one of my eyes.

1) No support for Raid 1 “split seeks”, at least nothing that I can test and show. I still can’t find a single person at 3ware that even knows what split seeks are let alone if they support it and to what degree. Because we use primarily Raid 1 and some raid 1+0 arrays this is extremely important to us. Please don’t email me saying we should use raid 5 or 6. I know our workload perfectly and raid 5 or 6 is a nightmare that many other hosts don’t understand. Disk seeks are infinitely more import to minimize than maximizing space with raid 5 or 6 unless you really are doing primarily sequential I/O.

2) No support to use the onboard 256 meg or 512 meg cache (Depending on controller model) for anything other than raid 5 or raid 6 except for the write-thru journaling. This means that raid 1 or raid 0 is actually SLOWER using 3ware cards than most onboard motherboard controllers. Again, no one at 3ware knows anything about this. If you use write-back journaling then you lose data if you have to do a hard reset on the server. You can’t separate FUA requests (Disk array saying that a write is complete when its not even though a process is requesting confirmation that a real write has occurred) and journaling in “performance” mode on the controller. This means you choose fast and lose data occasionally or slow and skip the onboard cache completely. Get with the program and use the cache for something other than boosting sequential I/O benchmarks! Please support Raid 1 with your cache. Maybe even give us the option to choose read ahead for the cache or specify it exclusively for write cache, what a novel idea!

3) Driver updates that don’t work, and software updates that cause us to endlessly “verify” and “initialize” our arrays for no reason except that 3ware’s software is too inept to know the state of its arrays. Anyone who has used 3ware for any amount of time knows exactly what I am talking about here!

4) Rapid restore – We personally don’t use this option, but I tested it to see how it works. It is basically an option that allows you to restore an array faster by knowing only the parts of the array that are out of sync so you don’t have to go through the whole array to verify its integrity. The only problem is that it CRUSHES the array with writes to accomplish this feat. To illustrate this imagine writing down every footstep you have taken in your house for a week so that when it comes times to vacuum you only have to clean those spots where you walked. It takes 10 hours to write down all your steps so you only have to vacuum for 10 minutes, or you could just skip writing it all down and vacuum for 30 minutes instead. 3ware is CLEARLY testing this on a mostly idle disk array or they would have never released this beast into the public. When I spoke with 3ware they claimed it put very minimal load on the array. When I asked for the technical details of how it worked exactly they were of course without any concrete answers. I had to test it myself to see the impact.

Wow. I feel much better. I’m sorry for this long post, but trying to do quality hosting is sometimes an impossible task when you have to rely on so many outside vendors and service providers to come together to make your product work. Just as we rely on vendors to provide our service to you, you rely on us to power your websites and businesses. In the end it is 100% our responsibility to make it all work for you. I am so sorry that many of you have experienced unacceptable downtime from us because of these controllers. We have solved almost all of these issues now, but we know we have lost the confidence of many of you. I just wanted you to know the real reason behind our recent problems and that many sleepless nights were spent in the pursuit of a solid long lasting solution.

Thanks,
Matt Heaton / President Bluehost.com / Hostmonster.com