Jacques Mattheij

Technology, Coding and Business

Polll vs Epoll (2), measuring the ATR on a very high volume site

Web servers do their work by advancing a connection through several states.

The first thing that happens is that a new connection is ‘accepted’ from a client that wishes to connect to the server.

Then the server will be a delay until the client sends the request followed by the server reading the request from the client. Requests may span more than one read, so you may have to wait for a bit before the next part of the request becomes available, this is repeated until the request is complete.

If it takes too long for the client to send the request the server may ‘time out’ the connection and close it.

Assuming that doesn’t happen, and the request is indeed received before the time-out expires the request is then analysed and passed off to a ‘worker’ in order to process it. Depending on the kind of server this can happen in any number of ways but for the purpose of this discussion that is immaterial.

The worker may spend some time processing the request, and will prepare a response. As soon as the response is ready (and in some cases the first part of the response will be ready well in advance of the total and can be considered ready to be sent) the response will be sent back to the client.

Because the response can be too large for the output buffers of the webserver the output will be written in blocks large enough to fit in to the output buffer.

Then, when the last block (or the only one) is written the connection will be closed, unless you’re using something called ‘keepalive’, in which case the connection will be recycled for another request, and it will only be closed when a fixed time-out has expired or the other side closes it first.

In a schematic form this looks like:

accept delay read process delay write delay close

Where each of the delays may be 0 (so non-existent) and the read and write lines may have to be executed multiple times (I’m ignoring keepalive for the moment).

Every time data is read from or written to the file descriptor associated with the connection the connection can be considered ‘active’, at any other point in time the connection can be considered to be ‘inactive’.

The total number of connections is incremented on the ‘accept’ call, and decreased on the ‘close’.

A file descriptor is active when:

  • data is being read from it
  • data is being written to it

Otherwiswe it is idle, in any one of the ‘delay’ states.

Note that the kernel the webserver runs on and the webserver have completely different ideas about what it means for a connection to be active or idle.

For the kernel the connection is active whenever there is data coming in on the line but for the webserver there is no way of knowing when that happens, it only sees that data when the kernel sees fit to pass it on. Conversely, during a write for the application it is ‘over’ when the kernel returns from the write call, but actually the data can be posted to a buffer in the kernel and the data might go out on the wire quite a bit later. Also, from the point of view of an application ‘read’ and ‘write’ are atomic operations, nothing happens in between them, whereas for the kernel a single ‘read’ may be the result of the collation of a whole pile of fragmented packets and a single write may result in the creation of a whole pile of packets, for instance because the write is larger than the packet size that the underlying link layer supports.

Now, because all of this is in relation to the whole poll/epoll debate I’m going to count the number of fds that the ‘poll’ or ‘epoll’ is done on and then increment the ‘active’ counter for every fd that we are going to do a read or a write on based on the result of the poll/epoll.

Keeping track of those two variables then allows you to periodically compute the active-to-total ratio as seen by the webserver.

Instrumenting web servers to measure these values is not that hard, but doing it for every webserver that you come across is not useful because the ‘apache’ model (preforked webserver that will process one filedescriptor each) more or less guarantees a very high atr on the one filedescriptor that each pre-forked process will handle. The other filedescriptors (the ones that are accepted but have not yet been assigned a worker thread) are going to be ignored until a worker thread is free to do something about them.

This figure is mostly interesting for servers that handle more than one fd per thread.

So to get more real world data about this we unfortunately can’t get any data (in an easy way, as far as I can see) from the webserver with the biggest deployment and any other webserver that uses the same model.

So, which one can we use?

I’m a big fan of ‘varnish’, one of the fastest - if not the fastest - front end HTTP caches that’s available today. It’s been built by someone that has an excellent reputation as a systems programmer, Poul-Henning Kamp, typically shortened to ‘phk’. Phk did a great deal of kernel programming and decided to apply his knowledge of the kernel side to create a front-end cache that is extremely fast.

To give you an ideal just how fast, I’m currently deploying about 48 instances of Varnish across 8 machines. At ‘low tide’ during the day the instances serve up about 400 requests per second each (so that’s an aggregate total of 48*400 = 19200 requests per second, during ‘peak’ hours the number doubles. The long term average is currently around 500 requests per second per instance.

The workload for this server is images, typically several sizes of the same image, a thumbnail, a 160x120 image and a full size one, where full-size has been limited to 640x480 pixels. Filesizes range from a few hundred bytes for the thumbnails to a maximum of several 10’s of kilobytes for the larger images. (see previous article on this subject why the kind of workload matters). Each instance uses 2000 worker threads to serve up to 64K connections each.

Phk uses ‘epoll’ or ‘poll’ depending on what you choose during the configuration of the system (and what your kernel supports!). The (default) epoll version of the implementation is entirely event driven, (that’s what the ‘e’ in epoll stands for), so it is harder to instrument it to find out the active-to-total ratio for the file descriptors that are to be polled. But fortunately there is also a fallback to use ‘poll’ instead and you can force that one to be used instead even if your system supports ‘epoll’.

I snuck this little bit of code in to the poll version of the program (and of course the two variables to go with it):

                {
                        npolled += hpoll + 1;
                        nactive += v;

                        if (nactive > 10000) {
                                FILE * fp;
                                fp = fopen("/tmp/activefd.log","a");
                                if (fp) {
                                        fprintf(fp,"%d %d\n",npolled,nactive);
                                        fclose(fp);
                                }
                                nactive = 0;
                                npolled = 0;
                        }
                }

Right after the call to ‘poll’. What it does is fairly obvious, it tracks the number of total fds polled and the number of fds returned that have events waiting to be processed. When 10000 events or more have been processed the data gets written to a log file and the counters are reset. Batching the data like this lessens the impact of the test on the varnish instance that we’re doing this on, this is a production server after all, and it also decreases the chance of the measurement influencing the results.

So, on to testing this:

./configure –disable-epoll make

Forces the switch to the doctered version of the poll code.

Make a backup of the old binary, replace it with the instrumented one we just built and restart the varnishd instances.

After a (very) short while this sort of output started to show up in the file:

15887790 10001 5624582 10001 13947720 10001 13232210 10001 12942946 10001 15698762 10001 13519076 10002 13889941 10001 13717163 10001 15750061 10001 14406686 10001 14879378 10001 14411073 10001

The right hand side is the number of events received by poll, the left hand side the total number of file descriptors that were polled. As you can see the ratio between the two of them is typically well in excess of 1000 fds polled for a single event received. In other words, this server sees an active to total ratio less than .001! And that’s with 2,000 worker threads.

Also interesting, the server load nearly doubled by switching from epoll to poll in this example.

The big difference between testing these varnishd instances and the ‘yawwws’ test that I wrote about earlier in that hacker news thread (http://news.ycombinator.com/item?id=1573145)) needs some explaining:

The more frequently you hit the ‘poll’, the more frequently you’ll do so for no reason at all, but the lower the latency as perceived by your clients will be. In other words, given that the number of events to handle a certain number of clients is more or less a constant the frequency with which you poll them will determine how many events you will get per poll.

Let’s look at that in some detail:

Yawwws, my ‘homebrew’ webserver is optimized to run in parallel to a very heavily loaded apache. I don’t particularly care about minimal latency (beyond some reasonably maximum everything is fine with me), as long as the customers are happy. The trade-off then becomes how frequently do you want to hit that poll to see if anybody out there is ready to play.

Varnish is extremely agressive in this respect, it hits the poll loop far more frequently than Yawwws does, and as a consequence of that it sees on average less activity per filedescriptor for each time the poll gets hit.

In other words, in a real life web serving situation you can increase the active-to-total ratio of the ‘poll’ by simply reducing the frequency with which you execute that poll, at the expense of an increase in latency.

If you care about latency and you are going to hit the poll loop agressively then epoll is your friend, if you want a program that is very friendly to its neighbours and you are willing to accept more latency as a consequence of that then you probably can get away with using poll.