Jacques Mattheij

Technology, Coding and Business

Poll vs Epoll, once again.

Introduction

If you’re reading this blog with some regularity and you’re not a technology person, specifically a systems level programmer with a working knowledge of C and what goes on in the guts of webservers this may not be for you. If you’re curious or you are part of that category by all means read on.

Apologies about the formatting of this post, I’m doing al this on a tiny little screen, I’ll try to fix it.

Today I spent some time trying to make sense of the guts of the poll vs epoll debate, the shell script to drive the test and the pipetest.c program that Zed Shaw used to try to determine which of the two was the better choice for his project ‘mongrel2’.

Zed is a pretty opinionated guy, he seems to have a hard time conversing with people in a normal tone of voice, especially when they’re critical of him, but for all that he may - or may not - have a point and I figured that if he does then it’s worth spending some time on this.

Since Zed is now a scientist he will no doubt be happy that someone took the time to reproduce his findings, in fact, I think one conclusion of the work I did today is that in the case that Zed outlines the situation is even better for poll than he made it out to be.

The bad news (for Zed at least) is that he misinterpreted the usage of the benchmarking utility that he used to come up with his numbers, even though the ‘usage’ string gave a pretty good hint about how it was to be used.

For the record, whether it’s true or not that poll or epoll is faster in a given situation to me is a run race, for ‘web’ like servers in the real world the answer seems to be ‘epoll’ based on the observation (which I’ve confirmed for my own webserver, and I invite you to do the same for yours) that fds tend to be ‘mostly idle’ due to a variety of reasons, and that in practice as you get more fds that situation tends to worsen due to the limited availability of worker threads (which are relatively expensive, and are typically vastly outnumbered by the number of connections), every worker thread that is busy doing something for a client will not be available to service another one, so that client will simply have to wait.

Worker threads are the chosen solution for web servers of almost all plumage because they allow the CPU(s) to be busy for other clients while waiting for stuff to complete (such as, but not limited to disk IO), without worker threads in some form or other the server would grind to a complete halt at the first blocking IO request.

In this simulation there is no real work to be done, so the turnaround time for the work to be done is ‘0’. Data received on one fd can be sent out right away on another. There will never be an issue with sending out data to the receiver, there is always space so a write could not possibly block, there are no back-ends to talk to. And that’s cheating in a way because in real life output is the larger part of the story, you can’t benchmark on just input.

But that’s what we have here.

so that’s what I’ll use, but it is important to keep this in mind and why it really matters.

I think (and have observed in real life) that servers spend more time doing other stuff than they spend time polling, so unless you run 10’s of thousands of file descriptors you are likely not to see much improvement, but if there is some gain to be had in this scenario then let’s try to make the right decision even if its based on a situation that does not really reflect actual workloads, the one takeaway from all of this is that properly benchmarking anything is devilishly hard, even if it looks simple.

Typically during these tests on an otherwise unloaded server there were tens of thousands of calls to poll and epoll both per second of wall clock time, on a single CPU (because the program is not threaded it can not take advantage of multiple CPUs).

Because each call to poll or epoll potentially can result in nfds/nthreads actions the overhead of that call to poll or epoll is fairly low. Observing the difference between wall-clock time and time spent in the user process typically only 20% of the time was spent in the user process, all the rest was spent in poll/epoll, read or write on the kernel side.

Also, you should take under consideration the workload of your system, if you are sure that you will have a very large portion of your fds active most of the time and that your worker pool is sufficiently large compared to the number of connections that you have that there is no reason not to use poll (think media/file servers), whereas if you are fairly sure that the majority of your fds are inactive and/or that the size of your workerpool is substantially smaller than the number of fds that you have open (think web servers) that there is no reason why you shouldn’t use epoll.

That’s the precise reason those two calls exist in the first place, and which one is ‘best’ (for whatever bit of extra throughput you can squeeze out of a machine because of this optimization) depends on your traffic.

For low volume situations which one you choose is totally irrelevant it only starts to matter when you have thousands, tens of thousands or even hundreds of thousands of fds to worry about.

For practical purposes you can safely ignore the outcome of all this, this is explicitly not a real world test, no data was created outside of that shuffled back and forth between memory buffers on the same machine(s).

I emphatically do not claim that any of this is science by any stretch of the imagination, it’s simply a measurement of system performance, I don’t have any ‘testable hypothesis’ to bring to the table and I won’t be upset if anybody wants to criticise any of this.

Feel free, I like to learn stuff.

I’m just a ‘blub’ programmer with an itch to scratch.

This is simply a presentation of some facts and the code to produce the data, which is a slight variation on the code that Zed used, courtesy of RedHat.

Reproduction

My test setup is that which I currently have to work with, which is a small laptop running ubuntu NBR, and the servers in my server farm in Amsterdam, one of which I temporarily vacated to be able to run the test. The server runs kernel 2.6.17.11 SMP and is a 64 bit machine, the laptop runs 2.6.31-22 SMP and is a 32 bit machine, both have plenty of ram for this test.

I included the laptop for completeness sake, of course when you will use this stuff ‘in real life’ you’ll be running it on server grade hardware. It’s just another data point, and it helps to verify that what Zed observed is not an anomaly.

Before we go to the actual meat of this test, a bit about what that infamous pipetest program really does.

Pipetest passes tokens around between the two sides of a pipe until those tokens have reached a maximum lifetime. When all the tokens have reached their maximum lifetime and all the ‘threads’ have finished the total elapsed time and the number of tokens passed is used to compute the total throughput of the program in tokens per second.

The parameters to the program are the number of pipes to use, the number of simulated worker threads (which are not process threads, this program is single threaded) to use and the number of times the token will be passed. Besides that there is a switch that tells the program which one of the two poll versions to use and whether output should be formatted for visualisation using gnuplot.

In tabular form, the output of the pipetest program over several runs on the two test machines looks like this (never mind the fractions, they’re meaningless, and probably the same goes for the last 3 digits but that is how the program outputs its results and I did not want to mess with it at this stage):

Threadslaptop 5000 generationsserver, 5000 generationsserver, 1000 generations
 pollepollpollepollpollepoll
100052794.72113635.8384178.84236435.1984162.04231716.63
200080807.31115024.91127368.02187668.68109851.82188437.40
300095930.27114717.18154023.00185205.31159714.92184739.19
4000109707.55115473.41163494.94188995.27166714.32187494.50
5000117928.17113419.36193359.85184003.86186738.63185297.41
6000120943.38113902.14207693.91184282.37202951.42185804.94
7000126391.55113746.99218761.49181032.53198644.44186326.04
8000128464.63113497.32215132.22181263.91215884.37175914.84
9000133985.83110956.98231116.02189192.59218715.00188358.29
10000140910.47111840.06237152.70181067.26219449.18186205.08

The faster of the two slots is coloured green, too close to call, nobody gets green.

Each of the test runs used 10,000 pipes, so 20,000 fds in all.

At first glance, When you look at these numbers, everything is more or less as expected, poll shows that more tokens are passed around when there are more threads, so the number of fds per ‘poll’ operation is lower, epoll is relatively flat across the board.

The only surprises to me in this data are the first epoll results in the 1,000 threads row, where epoll seems to for some obscure reason perform more or less on par with the best of ‘poll’ from the first column.

I ran the ‘1000’ generations test to see if I could speed up testing without compromising the quality of the results so that I could run more tests while tweaking the code, I hope you will agree with me that difference, while measurable are not of such a magnitude that they will materially impact the validity of the test. After all the code changes are done I’ll run another test with the number of token generations back to 5,000.

The other thing to notice is that the ‘crossover’ point where the poll version outperforms the epoll version is somewhere around 5,000 threads when using 10,000 pipes.

One thing Zeds script does not do is vary the number of pipes, let’s look at the effect of doing that (test only done on the server, 1,000 generations, the number of threads is now given as a percentage of the number of pipes):

Threads1000 pipes10000 pipes
same as above
20000 pipes
 pollepollpollepollpollepoll
10%155888.53480854.7684162.04231716.6378448.75191127.99
20%230266.45437160.38109851.82188437.40120056.84182416.58
30%252568.83409625.09159714.92184739.19155138.36179768.17
40%264276.37379503.76166714.32187494.50176217.87175751.94
50%277974.36330917.8186738.63185297.41182674.37177957.49
60%268144.08289724.48202951.42185804.94194176.30177750.36
70%265375.87278233.32198644.44186326.04207907.90180438.06
80%247331.14250956.30215884.37175914.84210931.99175575.64
90%282519.12229451.59218715.00188358.29212061.12171391.23
100%275873.78230116.33219449.18186205.08213802.77178249.96

By now, it should be clear that there is a pattern to all of this, past a certain percentage of ‘threads’ vs pipes there is a crossover from where epoll was fastest to where poll is fastest.

ATR, analysis, poll is even better!

Zed introduces a metric he names ‘ATR’ the active-to-total ratio. What he’s getting at here is that a critical part of evaluating poll versus epoll is that you have to take in to account the number of total fds that a process is monitoring vs the number of fds that are actually active at any one time.

The guts of the pipetest program do this:

At startup time a token is ‘seeded’ to all the pipes that are to see activity so they have something to read from. Then, in the next iteration of the main loop a poll is done on all fds, and those that have tokens waiting are then read from and the tokens are processed, the processing consists of incrementing the generation number of the token (and check if it has reached the cut-off value), and then write the token to the write side of the pipe.

Rinse, repeat.

So when you have 10,000 pipes in total, and 1,000 threads, each thread is ‘responsible’ for 10 fds, the number of tokens in circulation is limited to the number of simulated worker threads. So the active-to-total ratio for a thread then is the number of pipes divided by the number of threads. In this case 10%, and for every line in the tables above you go up with 10% until the number of threads is equal to the nummber of pipes, in which case there is a 100% ATR, guaranteed. In real life you’d never have 10,000 worker threads though, so the higher portion of these graphs are interesting for some future computer on which we can run 10’s of thousands of threads but I currently don’t have one like that at my disposal.

But which file descriptors are the ones that see activity? That doesn’t seem to be a relevant question at first, but the risk with benchmarks (any benchmark, not just this one) is that your creation does not model the real world and that you are testing something that does not reflect the use case for which the equipment under test was designed for.

I added a couple of lines to pipetest.c to see which fds were seeing the activity, from ‘0’ to ‘MAX_FDS’, 10 buckets. Typically the fds would be used from the top down, sequentially. So for instance, when running with 1000 threads on 10,000 fds only the top 1,000 fds would be used, the others would never see activity (file run.server.activity1). Now why would that matter?

I think it matters because that is not a typical spread of fds that are active out of a large pool, normally speaking it’s not a sequential block at the end of the set that sees all the activity, the fds would be spread out all over the set.

And I expect that this works to the advantage of poll because poll can’t be impacted in a negative way by spreading out the fds but epoll definitely can (think about how you would implement epoll to see why).

I wonder what would happen if I add a bit of randomness so that the fds that see activity are spread out more uniformly across the total set.

Now if you look carefully at the code of pipetest.c you come across this magical bit of code:

#if 0 /* what the heck is this? – JEM */ pipe_idx = toke->generation * 17 + toke->thread; pipe_idx += toke->thread * (nr_pipes / max_threads); pipe_idx %= nr_pipes; #endif

    pipe_idx = nr_pipes - toke->thread - 1;

It almost makes you wonder what would happen if you were to enable it. Now, mr. JEM, presumably the guy that disabled the code and replaced it with the ‘improved’ version below may not have realised it but that piece of code distributes the activity of the fds across all the fds in the set.

The ‘17’ is a deadgiveaway, that’s a prime number as we all know and that’s a good hint that someone is trying hard to make sure he’s not hitting the same entries over and over again.

So, re-enable that bit of code, and retest.

Lots of failing assertions about toke->tofd == fd not being what’s expected, I’m assuming here that got added by JEM so I’m removing that bit for now.

Retest, still no go, debug some more, a couple of lines down there is a #if 1 that looks like it puts tokens on ice for a bit instead of sending them right away. Switched that back to the old version .

Retest. Still no go, still all the fds hit sequentially at the end of the set.

Turns out that ‘really_send_toke’ is also modified by ‘JEM’. I’m beginning to suspect that this program was downloaded from some less than pristine source.

Two more changes then, changed the READ to a WRITE in send_toke and changed really_send_toke to make sure that we actually use that fd.

Finally! Nicely evenly distributed activity across all the fds.

Removed the difference with the ‘pending tokens’ line again to keep things as close as possible to the run that we did before. Retest to make sure the distribution is still even. Yep.

Ok, remove the logging of fd activity by commenting it out.

Retest.

Indeed, the spreading of the fds has slowed down epoll to some extent. The reason for this is that epoll now has to manage a larger list of fds internally.

Poll already had to manage all of them so for poll this change doesn’t make any difference.

Threads

Next, the story about the ‘threads’. The threads in the pipetest benchmark are simulated, what happens is (as outlined above) that a limited amount of activity is set up in parallel to simulate worker threads that all get to handle a bunch of incoming traffic. The number of tokens that is seeded determines the upper limit of this simulated worker pool.

The data that Zed Shaw provides simulates up to 10K threads on 10K pipes, for a ‘coverage’ of 0 to 100%. But how realistic is that figure. 10K threads is a number that we really can only dream of. On my heaviest loaded servers the thread pool is a few thousand threads at most, and those are pretty impressive beasts.

Nowhere near 10K. But I do have a very large number of sockets open, typically I run out of sockets before I run out of threads. By far. So, 10K threads of activity (so 10K active tokens in circulation) when you have 10K pipes to read from means that every worker on every thread will be busy processing. Typically that’s not what you’ll see. What you’re more likely to see is between 500 and 2000 worker threads on the largest servers, with socket pools of several tens of thousands (up to 64K on my machines).

So the active-to-total ratio on a real life server with tens of thousands of sockets is going to be in practice severely limited by the size of the worker pool. After all, if there are only a few thousand workers able to do a read or a write at any given time by definition the rest of the sockets will be idle, whether there is data to be sent or received or not is moot.

Increasing the number of simulated threads to be equal to the number of pipes then becomes an academic exercise beyond something like 2000 pipes for a simulation of the largest machines that I can currently lay my hands on (for normal money, anyway).

In the future this will no doubt change, and it’s nice that you can plug in fairly aribitrary numbers to pipetest and still get a result but it helps if you ground those numbers in the real world.

The modified code that Zed Shaw provided and the output of the runs used to create the tables above is here: http://ww.com/epoll.tar

Zeds original article is here: http://sheddingbikes.com/posts/1280829388.html

and the HN article that triggered this response is here: http://news.ycombinator.com/item?id=1570694