This is perhaps the least sexy topic I’ve ever written about
The linux cpu scheduler is an extremely important part of how linux works. The CFS scheduler (Completely fair scheduler) has been a part of linux for a couple of years. The purpose of the scheduler is to look at tasks (processes and threads) and assign them a processor or cpu core to run on and to make sure that all the processes that need run time get an equal and fair share of processing time. It is also responsible for context switching (migrating tasks from one cpu/core to another or switching out processes that don’t need anymore run time). This helps to balance processes and make better use of cpu cache by being “smart” about where to put queued and running processes.
It all sounds simple enough, but there are HUGE problems with the design of CFS in my opinion. I’m getting in dangerous territory here because I’m about to tear apart something that was designed by people that are much smarter than myself. However, I have something that most kernel developers don’t have access to – a huge and unbelievably busy network. Our network receives more than a trillion (Yes with a T) hits every quarter. We receive more than 100 million email every day. We send out more than 25 million email each day. We now have more than 5 petabytes of storage. In short, I have one of the best testbeds on the planet for finding deficiencies in an operating system.
Enough background, lets get to why I think CFS is “broken”. As the number of processes increases CFS is disproportionally slower and slower until almost no work (CPU processing) gets done. There are many tunables to modify how CFS behaves but the premise is the same. CFS is based on the incorrect (In my opinion) basis that all processes are always “equal”. I can easily create enough processes on a production server that CFS will completely consume almost all the cpu cycles just trying to schedule the processes to run without giving the processes almost any time to actually run.
Think of it like this – Lets assume that for every process to run it takes .1% of the cpu to “schedule” a process to run, and then it takes X % of cpu to run the program. But what if you have 900 processes running and each one takes .1% of the cpu for scheduling. Now you only have 10% of the cpu remaining in which to run your software. In reality I think its much worse than this example. After about 1500 concurrent processes CFS completely starts to fall apart on our servers.
The worst part about this is that the only way you can really tell this is happening is to measure the process quantum (The time slice that userspace programs get of a cpu/core). How many of you know how to measure the average process quantum of the scheduler – That’s what I thought
If you add up all the “quantum times” during a 1 second period and look at the difference you will see how much CPU the kernel is taking to service those requests. On a desktop system I get about 95% of a CPU for running my software. On our busiest servers I get about 70% of our available CPU time for actually running our software. The rest is eaten up by the inefficient scheduler. If you feel compelled to evaluate the process quantum time you can enable sched_debug in the kernel and check out its output. It’s actually pretty good data for those nerdy enough to read it.
Its been near impossible to prove my calculations over the last several months, but after many long nights I now feel very comfortable in saying that CFS truly is a broken design. It may be a good design for a desktop, and admittedly the kernel guys have made low latency desktops a priority but still… You do have to have some upper bound limit on how many processes can be running and how many new processes can be started over a given period of time, but this limit should be MUCH higher than 1500-2000. I would say it needs to be somewhere in the 10,000 range to really be effective with hardware that will be coming out in the next 6-18 months. If linux wants to scale efficiently to 16,32,64 cores then the scheduler needs some serious work.
How do we fix it? Well, we actually have a “process start throttler” kernel patch that evens out the start times of processes that gives predictable behavior to the scheduler, but it doesn’t solve the issue of the scheduler simply not scaling. It actually gives us a pretty substantial gain in speed and more importantly it stops a single user that launches a ton of processes at once from impacting the speed and stability of everyone else on the system. This is pretty complex to explain, but its actually being tested on live servers starting today, but that is a blog entry for another day.
Thanks,
Matt Heaton
Great post. Makes me feel confident I’m with the right hosting company.
Maybe i’m still ignorant about this problem
I find this post too technical to understand
Great article. Since you have this in place, could explain the differences between VPS hosting vs Bluehost?
Storage capacity is a plus on Bluehost as most VPS limit that, though I have now seen 50gig options which puts it in the realm of what you offer.
CPU is now the same as Bluehost as you usually get shared cpu slice on a VPS.
Memory I am unsure of.
It seems to me disk trashing (which almost no one talks about) which could slow down any hosting company’s clients might be more of an issue on a VPS where users can run low level kernal apps.
Security would seem like a plus on the VPS side as well as running bigger applications that can be better managed via ssh.
Price of course is always a winner on Bluehost, but I would be intersted in hearing your take on the whole VPS vs Bluehost issue.
Thanks!
Chris
Hi Matt,
After trying several other hosting companies for my Drupal sites, I became a Bluehost customer about 8 months ago. I am EXTREMELY happy with your service, especially your customer service, and the fact that you are always current on Drupal versions. Invariably, your CS staff is knowledgeable, pleasant, and quick to help or make suggestions. I am dropping this note to request that you please add reseller accounts to your services. I now have clients at several different accounts, and it’s getting hard to manage. My business continues to grow and I really, I mean REALLY want to stay with Bluehost. I know I can add unlimited domains to my main account, but frankly, with all the domain forwarding, etc, it’s getting confusing. Is there any possibility Bluehost will be adding reseller accounts?
Thanks so much,
Lynne
this service continues to get better. You people are doing a great job
have you released this patch to the linux community?
Interesting – just curious if you’ve compared with schedulers in say FreeBSD?
quick google search fond this quote “ULE2 seems to perform considerably better for many workloads (eg, MySQL) on many-core machines.” http://news.ycombinator.com/item?id=1016662
Dear Matt, thank you for this text, again, explaining how you guys work makes me understand how the company works, and, get in touch with the human level of bluehost.
I mean, being far away geographically from Bluehost that has been hosting my work for almost 5 years (i have 2 accounts) could make me unsecured and i have some friend telling me to change all my domains to Europe, but with articles like this, I can somehow get closer to your company feeling more part of it though just being a simple client.
About this Linux thing, wow, I never knew non-resolved things existed in the Linux world, and, that they could be not that easy to fix after all.
I wish you a great week, greetings from Sahara Desert, João Pedro
Matt, bravo on the post and for your hard work and attention to detail with your own systems and clients. I find it amazing that you go to such depth with advancing Linux, especially in such a profound way that we can all, in turn, benefit from.
This is another reason I won’t ever leave this great company!
I have been reading alot about cpu/core and CPU processing time lately. Sounds pretty interesting to me Matt. I think you take a hands on approach which is good and you have the ability to experiment with the scheduler.
Thanks for the insight into the complexities of Linux. Especially from a hosting point of view. I’m with another hosting company but I’ll take a look at what you have to offer if you’re being proactive with stuff like this.
Now if you could port IBM z/OS’s workload manager to *NIX…we’d be set
So I wonder what huge shops like google and facebook do, since they run on linux. For sure they must have altered this to their liking. I wonder how Solaris tackled this issue as well. Great technical post!
I hope I’m not late in asking this, but what is your opinion on BFS in regards to your system? Would you consider using it or no?
You know you should drop Ingo Molnar and the scheduler guys some info on your workload/setup – there is a lot of interest in tuning the scheduler to other workloads than desktop and it sounds like you have an extremely interesting web serving load.
Alan
Have you any experience of Solaris? They say it scales much better than Linux, maybe it would solve your problem? Have you tried ZFS for storage?
Is this a problem with threads to? Let’s say you run a busy webserver with apache worker? How will the scheduler handle threads?
I like the post and trust that your conclusions about the scheduler and linux are correct. I don’t think that you can necessarily expect linux to be all things to all people. I just think it’s impressive that linux can handle your very demanding load as well as it does.
I wanted to echo “yet another anonymous” and mention BFS. I’ve used it on my own server, and was generally pleased with the performance, but I don’t have enough simultaneous processes to evaluate it properly. BFS, to my understanding is more of a “first come, first serve” scheduler. It is some thousand lines or more shorter than CFS, and reportedly is much faster. From what I have read, Google already plans to integrate it into Android for similar reasons. I would be very interested in hearing about the performance difference for a server.
Well, I’m enjoying the blog, and looking forward to more great posts!
Yes, threads in this instance are dealt with just like a normal processes except in one way (They are often WORSE). The reason why threads can be even more problematic is because threads try and stay on the same core instead of switching around. This is good for caching, but can cause a HUGE inbalance of tids/pids. Sometimes I will have literally 400 processes/threads on a single core and 10 processes/threads on other cores. Its this kind of thing that makes me scratch my head at some of the decisions that some of the kernels devs make.
Matt
Hey Matt,
Have you considered setting up CFS group scheduling? I know your main point and it may not solve all of the issues you have, and not sure how it would help you greatly with your software, but if you know something is launching a ton of threads, and has a habit of doing so, you can make a group for it, so it has to share its time for ALL its threads with the rest of the threads on the system. I do agree that CFS does have some major pitfalls (im running BFS on my desktop now), but it may help if you have not considered it.