Linux CPU Scheduler (The biggest problem you never knew you had!)

This is perhaps the least sexy topic I’ve ever written about :) The linux cpu scheduler is an extremely important part of how linux works. The CFS scheduler (Completely fair scheduler) has been a part of linux for a couple of years. The purpose of the scheduler is to look at tasks (processes and threads) and assign them a processor or cpu core to run on and to make sure that all the processes that need run time get an equal and fair share of processing time. It is also responsible for context switching (migrating tasks from one cpu/core to another or switching out processes that don’t need anymore run time). This helps to balance processes and make better use of cpu cache by being “smart” about where to put queued and running processes.

It all sounds simple enough, but there are HUGE problems with the design of CFS in my opinion. I’m getting in dangerous territory here because I’m about to tear apart something that was designed by people that are much smarter than myself. However, I have something that most kernel developers don’t have access to – a huge and unbelievably busy network. Our network receives more than a trillion (Yes with a T) hits every quarter. We receive more than 100 million email every day. We send out more than 25 million email each day. We now have more than 5 petabytes of storage. In short, I have one of the best testbeds on the planet for finding deficiencies in an operating system.

Enough background, lets get to why I think CFS is “broken”. As the number of processes increases CFS is disproportionally slower and slower until almost no work (CPU processing) gets done. There are many tunables to modify how CFS behaves but the premise is the same. CFS is based on the incorrect (In my opinion) basis that all processes are always “equal”. I can easily create enough processes on a production server that CFS will completely consume almost all the cpu cycles just trying to schedule the processes to run without giving the processes almost any time to actually run.

Think of it like this – Lets assume that for every process to run it takes .1% of the cpu to “schedule” a process to run, and then it takes X % of cpu to run the program. But what if you have 900 processes running and each one takes .1% of the cpu for scheduling. Now you only have 10% of the cpu remaining in which to run your software. In reality I think its much worse than this example. After about 1500 concurrent processes CFS completely starts to fall apart on our servers.

The worst part about this is that the only way you can really tell this is happening is to measure the process quantum (The time slice that userspace programs get of a cpu/core). How many of you know how to measure the average process quantum of the scheduler – That’s what I thought :) If you add up all the “quantum times” during a 1 second period and look at the difference you will see how much CPU the kernel is taking to service those requests. On a desktop system I get about 95% of a CPU for running my software. On our busiest servers I get about 70% of our available CPU time for actually running our software. The rest is eaten up by the inefficient scheduler. If you feel compelled to evaluate the process quantum time you can enable sched_debug in the kernel and check out its output. It’s actually pretty good data for those nerdy enough to read it.

Its been near impossible to prove my calculations over the last several months, but after many long nights I now feel very comfortable in saying that CFS truly is a broken design. It may be a good design for a desktop, and admittedly the kernel guys have made low latency desktops a priority but still… You do have to have some upper bound limit on how many processes can be running and how many new processes can be started over a given period of time, but this limit should be MUCH higher than 1500-2000. I would say it needs to be somewhere in the 10,000 range to really be effective with hardware that will be coming out in the next 6-18 months. If linux wants to scale efficiently to 16,32,64 cores then the scheduler needs some serious work.

How do we fix it? Well, we actually have a “process start throttler” kernel patch that evens out the start times of processes that gives predictable behavior to the scheduler, but it doesn’t solve the issue of the scheduler simply not scaling. It actually gives us a pretty substantial gain in speed and more importantly it stops a single user that launches a ton of processes at once from impacting the speed and stability of everyone else on the system. This is pretty complex to explain, but its actually being tested on live servers starting today, but that is a blog entry for another day.

Thanks,
Matt Heaton

36 Responses to “Linux CPU Scheduler (The biggest problem you never knew you had!)”

  1. Christian says:

    Great post. Makes me feel confident I’m with the right hosting company.

  2. Tiens says:

    Maybe i’m still ignorant about this problem

  3. Tianshi says:

    I find this post too technical to understand

  4. Great article. Since you have this in place, could explain the differences between VPS hosting vs Bluehost?

    Storage capacity is a plus on Bluehost as most VPS limit that, though I have now seen 50gig options which puts it in the realm of what you offer.

    CPU is now the same as Bluehost as you usually get shared cpu slice on a VPS.

    Memory I am unsure of.

    It seems to me disk trashing (which almost no one talks about) which could slow down any hosting company’s clients might be more of an issue on a VPS where users can run low level kernal apps.

    Security would seem like a plus on the VPS side as well as running bigger applications that can be better managed via ssh.

    Price of course is always a winner on Bluehost, but I would be intersted in hearing your take on the whole VPS vs Bluehost issue.

    Thanks!

    Chris

  5. Lynne Frey says:

    Hi Matt,
    After trying several other hosting companies for my Drupal sites, I became a Bluehost customer about 8 months ago. I am EXTREMELY happy with your service, especially your customer service, and the fact that you are always current on Drupal versions. Invariably, your CS staff is knowledgeable, pleasant, and quick to help or make suggestions. I am dropping this note to request that you please add reseller accounts to your services. I now have clients at several different accounts, and it’s getting hard to manage. My business continues to grow and I really, I mean REALLY want to stay with Bluehost. I know I can add unlimited domains to my main account, but frankly, with all the domain forwarding, etc, it’s getting confusing. Is there any possibility Bluehost will be adding reseller accounts?
    Thanks so much,
    Lynne

  6. adisa says:

    this service continues to get better. You people are doing a great job

  7. marvin says:

    have you released this patch to the linux community?

  8. mark says:

    Interesting – just curious if you’ve compared with schedulers in say FreeBSD?

    quick google search fond this quote “ULE2 seems to perform considerably better for many workloads (eg, MySQL) on many-core machines.” http://news.ycombinator.com/item?id=1016662

  9. Dear Matt, thank you for this text, again, explaining how you guys work makes me understand how the company works, and, get in touch with the human level of bluehost.

    I mean, being far away geographically from Bluehost that has been hosting my work for almost 5 years (i have 2 accounts) could make me unsecured and i have some friend telling me to change all my domains to Europe, but with articles like this, I can somehow get closer to your company feeling more part of it though just being a simple client.

    About this Linux thing, wow, I never knew non-resolved things existed in the Linux world, and, that they could be not that easy to fix after all.

    I wish you a great week, greetings from Sahara Desert, João Pedro

  10. Kelly says:

    Matt, bravo on the post and for your hard work and attention to detail with your own systems and clients. I find it amazing that you go to such depth with advancing Linux, especially in such a profound way that we can all, in turn, benefit from.

    This is another reason I won’t ever leave this great company!

  11. Russ says:

    I have been reading alot about cpu/core and CPU processing time lately. Sounds pretty interesting to me Matt. I think you take a hands on approach which is good and you have the ability to experiment with the scheduler.

  12. allen says:

    Thanks for the insight into the complexities of Linux. Especially from a hosting point of view. I’m with another hosting company but I’ll take a look at what you have to offer if you’re being proactive with stuff like this.

  13. Anonymous says:

    Now if you could port IBM z/OS’s workload manager to *NIX…we’d be set :-)

  14. A says:

    So I wonder what huge shops like google and facebook do, since they run on linux. For sure they must have altered this to their liking. I wonder how Solaris tackled this issue as well. Great technical post!

  15. yet another anonymous says:

    I hope I’m not late in asking this, but what is your opinion on BFS in regards to your system? Would you consider using it or no?

  16. Alan Cox says:

    You know you should drop Ingo Molnar and the scheduler guys some info on your workload/setup – there is a lot of interest in tuning the scheduler to other workloads than desktop and it sounds like you have an extremely interesting web serving load.

    Alan

  17. kebabbert says:

    Have you any experience of Solaris? They say it scales much better than Linux, maybe it would solve your problem? Have you tried ZFS for storage?

  18. Peter E says:

    Is this a problem with threads to? Let’s say you run a busy webserver with apache worker? How will the scheduler handle threads?

  19. derek says:

    I like the post and trust that your conclusions about the scheduler and linux are correct. I don’t think that you can necessarily expect linux to be all things to all people. I just think it’s impressive that linux can handle your very demanding load as well as it does.

  20. OmniUni says:

    I wanted to echo “yet another anonymous” and mention BFS. I’ve used it on my own server, and was generally pleased with the performance, but I don’t have enough simultaneous processes to evaluate it properly. BFS, to my understanding is more of a “first come, first serve” scheduler. It is some thousand lines or more shorter than CFS, and reportedly is much faster. From what I have read, Google already plans to integrate it into Android for similar reasons. I would be very interested in hearing about the performance difference for a server.

    Well, I’m enjoying the blog, and looking forward to more great posts!

  21. matt says:

    Yes, threads in this instance are dealt with just like a normal processes except in one way (They are often WORSE). The reason why threads can be even more problematic is because threads try and stay on the same core instead of switching around. This is good for caching, but can cause a HUGE inbalance of tids/pids. Sometimes I will have literally 400 processes/threads on a single core and 10 processes/threads on other cores. Its this kind of thing that makes me scratch my head at some of the decisions that some of the kernels devs make.

    Matt

  22. drewski says:

    Hey Matt,
    Have you considered setting up CFS group scheduling? I know your main point and it may not solve all of the issues you have, and not sure how it would help you greatly with your software, but if you know something is launching a ton of threads, and has a habit of doing so, you can make a group for it, so it has to share its time for ALL its threads with the rest of the threads on the system. I do agree that CFS does have some major pitfalls (im running BFS on my desktop now), but it may help if you have not considered it.

  23. Jim says:

    Sir, You could shorten your blog considerably with:

    Hi my name is mattheaton.

    I work for a big company with a zillion hits a day. We are so great. That’s right- literally!

    I have decided to write a post about the Linux scheduler, which is such an arcane subject that no one will know I have absolutely no idea what I am talking about.

    I’ve come up with some really nonsensical math that “I cannot seem to prove” for some reason.

    I’ve even invented some cool sounding words like “quantum underlayer” that mean absolutely nothing!

    I like to think of my blog as a soap box for my fans.

    Have a great day!

    MH

    -sir please stick to stuff you at least can pretend you know something about.

  24. xxiao says:

    I agree CFS has pitfalls, not only in large servers, but also in embedded products, it may be ok for desktop(in that case, BFS is simpler and better).

    i have a 70 threads program, with old O(1) scheduler it ran perfectly, right after CFS was swapped in, the program is 20% slower, and it’s nearly impossible to tune CFS, as it has too many undocumented features/variables there.

    wish someone can forward-port O(1) to the new kernels, CFS sucks big.

  25. jasbir says:

    I got this hosting company from caroline-middlebrook.com on her tutorial on setting up with blue host, very informative. Credit goes to her. I guess blue host will be helpful for my upcoming website

  26. Dashares says:

    have you released this patch to the linux community?

  27. Sounds like you have found a way to distribute the starting of processes more evenly. The volume of hits and emails you receive is amazing.

  28. Hey Matt is bluehost supports .htaccess support

  29. paypal says:

    Have you considered setting up CFS group scheduling?

  30. shubham says:

    hai……
    mat…. thanx 4 nice information…
    can u tell me the reason behind this……how it can happen if no. of processes increases then period also increases so time slice may be less but how this kind of condition can come…….
    is this becoz of….
    1.increased context swtching…
    2.cache hot time…..
    3.preparation to do load balance???or dilemma that which process to migrate…

    plz reply soon……

  31. Adeel Ahmad says:

    Hi Matt, Great article. I face the exact same problem on my servers…

  32. jinlala says:

    Hi Matt, Great article. I face the exact same problem on my servers…

  33. I’m very interested in your work environment. This must be programmers heaven (or hell) I’m creating a new core that brings every thing back to basics and what you wrote is exactly what I thought what’s wrong with computers.

    Schedulers are not as nice as you would think. Sometimes it’s even better to skip things and finish others, than to try to keep up and finally lost the fight of time by trying to calculate how efficient you can do it.

    This core I’m making will never get the marked I think but it’s nice to design and think about the future possibilities.

    How I thought to solve some of this problems is not a software but a hardware solution. The hardware should have it’s own onboard triggers, handlers and scheduler for optimized use of it. The hardware should have his own basic default routines to do things. I know its not SEXY to talk about multithreading and even less SEXY to make a program like.

    Get_NIC_data(TARGET MEMORY=WEBBROWSER)
    (the hard ware does the TCP or even the HTTP transaction)
    onWEBBROWSER_TRIGGER put it on screen (in color RED with BLUE corners)
    Get_NIC_data(TARGET MEMORY=WEBBROWSER>PLAYER)
    onWEBBROWSER_TRIGGER play the sound (volume = 3)
    onSOUND_TO_PLAY(mix data with existing sounds to play.

    The only thing a OS should do is define what the display style of a browser is. (maybe a to simple example but you know what I mean in big lines.)

    But I think the time is right to stop thinking about backwards compatibility and start thinking about modularize the hardware with its tasks. So you can simply plug in a extra NIC if it’s getting to slow or flash the old embedded code to a newer version.

    The CPU would get some rest to only memory to the right place. Who need a CPU with 32 cores??? Give me intelligent hardware modules that can do what there’re made for!!!!

  34. shubham tiwari says:

    yes matt , i can prove it mathamatically!!!

  35. gang cheng says:

    very interesting, I am going to compare different schedulers, it is very help for me!

  36. honey says:

    nice job keep it up.

Leave a Reply