This is perhaps the least sexy topic I’ve ever written about The linux cpu scheduler is an extremely important part of how linux works. The CFS scheduler (Completely fair scheduler) has been a part of linux for a couple of years. The purpose of the scheduler is to look at tasks (processes and threads) and assign them a processor or cpu core to run on and to make sure that all the processes that need run time get an equal and fair share of processing time. It is also responsible for context switching (migrating tasks from one cpu/core to another or switching out processes that don’t need anymore run time). This helps to balance processes and make better use of cpu cache by being “smart” about where to put queued and running processes.
It all sounds simple enough, but there are HUGE problems with the design of CFS in my opinion. I’m getting in dangerous territory here because I’m about to tear apart something that was designed by people that are much smarter than myself. However, I have something that most kernel developers don’t have access to – a huge and unbelievably busy network. Our network receives more than a trillion (Yes with a T) hits every quarter. We receive more than 100 million email every day. We send out more than 25 million email each day. We now have more than 5 petabytes of storage. In short, I have one of the best testbeds on the planet for finding deficiencies in an operating system.
Enough background, lets get to why I think CFS is “broken”. As the number of processes increases CFS is disproportionally slower and slower until almost no work (CPU processing) gets done. There are many tunables to modify how CFS behaves but the premise is the same. CFS is based on the incorrect (In my opinion) basis that all processes are always “equal”. I can easily create enough processes on a production server that CFS will completely consume almost all the cpu cycles just trying to schedule the processes to run without giving the processes almost any time to actually run.
Think of it like this – Lets assume that for every process to run it takes .1% of the cpu to “schedule” a process to run, and then it takes X % of cpu to run the program. But what if you have 900 processes running and each one takes .1% of the cpu for scheduling. Now you only have 10% of the cpu remaining in which to run your software. In reality I think its much worse than this example. After about 1500 concurrent processes CFS completely starts to fall apart on our servers.
The worst part about this is that the only way you can really tell this is happening is to measure the process quantum (The time slice that userspace programs get of a cpu/core). How many of you know how to measure the average process quantum of the scheduler – That’s what I thought If you add up all the “quantum times” during a 1 second period and look at the difference you will see how much CPU the kernel is taking to service those requests. On a desktop system I get about 95% of a CPU for running my software. On our busiest servers I get about 70% of our available CPU time for actually running our software. The rest is eaten up by the inefficient scheduler. If you feel compelled to evaluate the process quantum time you can enable sched_debug in the kernel and check out its output. It’s actually pretty good data for those nerdy enough to read it.
Its been near impossible to prove my calculations over the last several months, but after many long nights I now feel very comfortable in saying that CFS truly is a broken design. It may be a good design for a desktop, and admittedly the kernel guys have made low latency desktops a priority but still… You do have to have some upper bound limit on how many processes can be running and how many new processes can be started over a given period of time, but this limit should be MUCH higher than 1500-2000. I would say it needs to be somewhere in the 10,000 range to really be effective with hardware that will be coming out in the next 6-18 months. If linux wants to scale efficiently to 16,32,64 cores then the scheduler needs some serious work.
How do we fix it? Well, we actually have a “process start throttler” kernel patch that evens out the start times of processes that gives predictable behavior to the scheduler, but it doesn’t solve the issue of the scheduler simply not scaling. It actually gives us a pretty substantial gain in speed and more importantly it stops a single user that launches a ton of processes at once from impacting the speed and stability of everyone else on the system. This is pretty complex to explain, but its actually being tested on live servers starting today, but that is a blog entry for another day.