Using epoll? Might wanna think about batching epoll_ctl

How we seamlessly reduce system-call overhead and improve performance of epoll based servers
Daniel Shaulov

Daniel Shaulov

Jun 4, 2020

I recently encountered several web-services spending more than a good deal of CPU time executing epoll system calls. I decided to try to optimize them along with other epoll-based servers by reducing the number of syscalls used for epoll maintenance. By employing a seamless defer-and-batch strategy, I succeeded in lowering syscall overhead and improved server performance.

Before/ After using epoll_ctl
Epoll batching illustrated

The following section describes the history of polling mechanisms in Linux and other Unix-like systems, which (besides being interesting on its own) helps us better understand why epoll is designed the way it is. If you’re familiar with the history, feel free to skip ahead.

 

History

One of the main challenges of scaling a network server is handling a large number of concurrent connections while maintaining high performance. When performing I/O on these connections, only some will be ready for further progress, so the server must have a mechanism for determining which connections can be serviced without blocking. These mechanisms include select, poll and epoll, among others.

Select

Select was the first system call in Unix to poll many file descriptors (FDs) for read/write. It was created back in the 1980s to allow applications to multiplex I/O on multiple descriptors without the need to busy-loop. Select still exists today but it has its fair share of issues arising from its simplistic original design, one that made sense in the early days of Unix but fails to handle modern server workloads.

fd_set rfds;
struct timeval tv = {5, 0}; // Wait up to five seconds.
FD_ZERO(&rfds);
FD_SET(STDIN_FILENO, &rfds); // Watch stdin (fd 0) for input
 
select(STDIN_FILENO + 1, &rfds, NULL, NULL, &tv); // Assume success for sample code
if (FD_ISSET(STDIN_FILENO, &rfds)) {
    printf("Data is available now.\n");
}

One of the most egregious issues with select is the limited range of FDs—defined in FD_SETSIZE at 1024. Not only does this mean that select can’t handle more than 1024 FDs, it can’t even support a single FD greater than 1023! If your server has more than 1024 open FDs (including client connections, files, pipes—everything), select simply won’t work, even if it was called with only a few FDs.

Another problem with select’s design is its stateless API. Each call to select must pass three fd_sets to be polled (readfds, writefds, exceptfds). Then the syscall registers callbacks for all FDs only to deregister them before switching back to the user. This engenders substantial overhead for each call and performance is further exacerbated by the fact that select alters the fd_sets in-place, so the user must reset them between calls.

Poll

The next stage of FD polling was the aptly named poll syscall, contributing a much saner API. The new API meant no hard FD limit and no need to reset the input FD lists. But it still suffered from being stateless—each poll call still had the unfortunate overhead of iterating over all FDs upon syscall enter and exit.

struct pollfd fds[1];
 
// Watch stdin (fd 0) for input
fds[0].fd = STDIN_FILENO;
fds[0].events = POLLIN;
 
poll(fds, 1, 5 * 1000); // Wait up to five seconds. Assume success for sample code
if (fds[0].revents & POLLIN) {
    printf("Data is available now.\n");
}

Epoll

As demand rose for a better-scaled polling mechanism, Linux came out with epoll—the long awaited stateful alternative to select and poll. The new construct of the epoll API is the epoll instance, a kernel data structure that holds state relating to polling calls (e.g., interest list, ready list). The API provides the following syscalls for using epoll instances:

  1. epoll_create creates a new epoll instance in the kernel and returns an epoll file descriptor for this instance.
  2. epoll_ctl manipulates the interest list of an epoll instance. Available operations are EPOLL_ADD for adding a new FD to the list, EPOLL_MOD for modifying an existing FD’s settings, and EPOLL_DEL for deletion of an FD from the list.
  3. epoll_wait waits for FDs in the instance’s interest list, returning ready events. This is the actual polling syscall.

 

struct epoll_event event, events[1];
int epoll_fd, event_count;
 
epoll_fd = epoll_create1(0); // Assume success for sample code
event.events = EPOLLIN;
event.data.fd = STDIN_FILENO;
 
epoll_ctl(epoll_fd, EPOLL_CTL_ADD, STDIN_FILENO, &event); // Register stdin for EPOLLIN
 
event_count = epoll_wait(epoll_fd, events, 1, 5 * 1000); // Wait up to five seconds
if (event_count > 0) {
   printf("Data is available now.\n"); // Data is available on events[0].data.fd
}
 
close(epoll_fd);

While epoll fixed the stateless problem of select and poll and it seemed like we had finally obtained an ideal polling mechanism, things weren’t as simple as they, at first, seemed. Epoll introduced many new issues, one of them regarding its performance overhead. Working with epoll usually requires numerous calls for epoll_ctl, for example, when a webserver adds/removes client/backend connections, or when modifying an FD’s entry from EPOLLIN (read) to EPOLLOUT (write), or back. This means an awful lot of context switching to the kernel, which is exacerbated by Meltdown/Spectre mitigations that increase the penalty for syscalls.

 

Optimizing Epoll

The first attempt at optimizing epoll tries to defer as many calls to epoll_ctl as possible, then to execute them all as a batch when required, relieving the cost of many separate epol_ctl syscalls. This optimization technique comprises two components: a kernel-mode driver implementing a new kernel API I named “epoll_batch” (in the form of a device IOCTL), and a user-mode library that can seamlessly alter any epoll-based application by replacing epoll library calls with its own implementations. When running the server application, I simply add “LD_PRELOAD=/path/to/my/epoll_optimize.so” to its command line, giving me complete control over all epoll library calls: epoll_create, epoll_ctl, and epoll_wait.

Since I wished to defer epoll_ctl calls, I decided to handle their bookkeeping in a hashtable consisting of epfds for keys and lists of pending ctls as values. The implementation of the new epoll functions now became straight-forward:

  • epoll_create will first call the original epoll_create and then add a new empty list to the hashtable under the newly allocated epfd.
  • epoll_ctl won’t call the original syscall; instead it will add the ctl parameters to the hashtable under the given epfd.
  • epoll_wait will call our new syscall, epoll_batch, and pass it the pending ctls from the hashtable. Our syscall will execute all pending ctls and, only then, execute the wait on the epoll.

Seems simple enough. Not a lot of room for trouble, right? 

I wish! Unfortunately, I ran into a slew of problems, painfully learning that there was plenty of room for trouble.

 

The Slew of Problems

Error Handling

Let’s first discuss a general problem with syscall batching—error propagation. When deferring syscalls, we need to assume no errors at defer time, and only report the error later at the batched execution, failing the entire batch. In the case of epoll_ctl, there are a few ways in which it can fail, most resulting from invalid arguments. But some problems might be out of our control, like insufficient memory.

The naive approach I implemented at first assumed no errors happen in the server’s “happy flow”. A sensible assumption, I thought. The first few tests showed promise, but then one of the open-source web servers I was testing on broke completely. I traced the problem to a code snippet that looked something like this:

int epoll_ctl_set(int epoll_fd, int fd, struct epoll_event *event) {
	int res = epoll_ctl(epoll_fd, fd, EPOLL_CTL_ADD, event);
	if (res != 0 && errno == EEXIST) {
		res = epoll_ctl(epoll_fd, fd, EPOLL_CTL_MOD, event);
	}
	return res;
}

The code adds an FD to an epoll object regardless of its previous state. It either adds it or, if it was already in the epoll, updates its settings. Since we didn’t report the error on the first epoll_ctl (with EPOLL_CTL_ADD), the second one (with EPOLL_CTL_MOD) was never called. When we execute the entire batch we finally get the error and fail the entire batch.

To handle this situation, we worked the same logic into our epoll_batch syscall implementation—If a deferred EPOLL_CTL_ADD operation fails with EEXIST we attempt to perform an EPOLL_CTL_MOD instead. To be on the safe side, we also added the reverse —if a deferred EPOLL_CTL_MOD fails with ENOENT, we EPOLL_CTL_ADD it instead.

It’s important to note that this isn’t a foolproof generic solution to the problem of batching’s error propagation since we made assumptions about the application’s intent. It’s possible that the application has different logic than the one we implemented, in which case our current approach might still affect the application’s business logic.

Close

Upon closing a file/socket, it’s automatically removed from all epoll instances. An FD might be added to an epoll and subsequently closed before ever waiting on it. This is (shamefully!) wasteful, and there is never a good reason for it, but, nevertheless, some internal app logic and abstractions might cause this to happen. Since we defer the actual operations on the FD, they will be executed after the FD has been already closed, bringing disaster upon us. The best possible outcome is to get EBADF when finally calling epoll_ctl on the closed FD, but the more vicious scenario can happen if the FD was reassigned after being released. In that case, we perform epoll operations on an entirely different object!

My solution was to hook the call to close as well, and to perform all pending epoll_ctls for all epfds in the hashtable. This is far from ideal for performance, which is the reason I started this whole enterprise! This approach starts to show cracks, and the following issues don’t exactly improve the situation.

Nested Epoll

The next problem came from a little known (but fully documented) feature of epoll: nesting. It’s possible to add FDs to epoll_fd1, and then add epoll_fd1 to epoll_fd2. epoll_fd1 will indicate being readable when it has any events waiting. This is documented in the Q&A part of epoll(7):

Q3  Is the epoll file descriptor itself poll/epoll/selectable?

A3  Yes. If an epoll file descriptor has events waiting, then it will indicate as being readable.

Nesting epolls causes a problem for our implementation since, when an epoll_wait is called on epoll_fd2, we will only execute the deferred ctls for epoll_fd2, and not for epoll_fd1. The implicit wait on epoll_fd1 won’t function as expected, and the wait won’t be triggered when it should have.

After a fun debugging session, I discovered the issue and changed the implementation so it simply won’t defer operations on a nested epoll_fd, thereby not optimizing it at all. This approach was far less elegant than I had first hoped, but I haven’t lost hope yet.

Multiprocessing

Now consider the case where an application creates an epoll_fd, adds an FD to it, then forks. We defer the epoll_ctl before the fork, so we will end up executing the epoll_ctl in both the child and the parent when they each call epoll_wait. This means we now need to also hook fork and execute all pending ctls at fork time.

This now begs the question: what happens when an application uses epoll instances concurrently?

Multithreading

Using epoll concurrently is pretty common, so we need to make sure that access to our hashtable and to each FD’s list of deferred ctls is handled in a thread-safe manner. I tried adding classical locking mechanisms and also using lockless mechanisms in the aforementioned data structures, keeping the following invariants:

  • If an epoll_ctl was called (and deferred) before an epoll_wait, it will be executed before performing the wait.
  • If two epoll_ctls were called (and deferred) in the same thread, the order in which they were called will be the order they are executed.

Unfortunately, all of my attempts at solving this proved to have such an adverse effect on performance that they practically negated all the gains from batching the syscalls in the first place!

[Concluding a long Slack discussion] “New idea – we write a blog post with bad lockless implementations and show what’s wrong with them. We never tell anyone that those were all ours” ~ Anonymous (definitely not me)

 

A New Approach

I have to admit, at this stage I felt a bit disheartened. It seemed like using an epfd concurrently in multiple threads involved too many complications for my goal. Actually, some might say that it’s complicated enough even without my batching attempts. After all, the man page of epoll(7) has an Example for suggested usage (reasonable), a Questions and answers section (concerning), and finally a Possible pitfalls and ways to avoid them (seriously?). If you’re interested, you can learn even more about epoll’s multithreading pitfalls in “Epoll is fundamentally broken” and in Bryan Cantrill’s wonderful rant on BSD Now.

…careful programming may be needed to do this correctly.

– EPOLL(7) Linux Programmer’s Manual

I decided to try a new approach, focusing only on applications that use each epfd in a single thread (They can still use multiple threads with different epfds). This new approach allowed me to simplify my previous model. I got rid of the hashtable, instead keeping in Thread Local Storage a single ctls list per-thread. On each call to epoll_wait I now simply call the epoll_batch syscall with all of the deferred ctls of the calling thread, regardless of the epfd they were called on.

We no longer need to hook epoll_create, since there’s no per-epfd list. We no longer need any special functionality for nested epolls—all the ctls on the inner epfd will be executed before the next epoll_wait in the correct order. We no longer need to iterate many lists when an FD is closed or when the process is forked. We just need to call epoll_batch on the calling thread’s pending ctls, just like when epoll_wait is called.

With this new implementation, I was finally able to see the seamless performance improvements I wished for, even if only for a specific subset of applications. For those other daring applications I’ll have to find different tricks.


0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to top
Skip to content