Intro, the Xeon Phi, and the Halt Loop

In this first post I’d like to first introduce my research, give an overview of what kinds of things I’ll likely be posting here, and then explain the name.

My research interest is broadly in operating systems, and I’m a PhD student in the Prescience Lab at Northwestern, under the direction of Peter Dinda. We do a lot of work in low-level systems software, from VMMs and high-performance computing to new OSes, parallel languages, and transactional memory. One of our notable projects is the Palacios Virtual Machine Monitor, which I’ve spent a lot of time hacking on.

Much of my work has revolved around questioning the established assumptions and principles around layers in systems software, whether they be abstraction layers or protection layers. More to the point, what really is the right border to draw between the software components running on your computer. What properties should that border have? We’ve more or less accepted the de facto model of how programs are run on a system, a model most evident in general purpose computing on, for example, Linux and x86, that has come to us as much through historical happenstances as through well-thought-out and researched design. Things change fast enough in our field that some of the basic assumptions based on the state of affairs 40 or 50 years ago may no longer (and likely will not) hold at all in the next few years. With core counts going through the roof, on-chip silicon photonics coming to the forefront, and exotic memory technologies going into production, tomorrow’s machine is going to look a lot different (HP’s Machine might be a good example of a confluence of these innovations) from the ones upon which the foundations of good system software design were established. We’ve all heard the now somewhat tired one-liner taken from Rob Pike’s polemic, Systems Software Research is Irrelevant. While I find it more or less accurate in context (I’ll be writing about this context later on), that piece has a timing strangely reminiscent of Lord Kelvin’s infamous proclamation at the turn of the 20th century that was shattered only a handful of years later during Einstein’s annus mirabilis. I get the feeling something similar is happening now—or is about to happen—in our field.

On this blog, I expect to write up some of the things I’ve been thinking about or working on that may have broader interest or utility to other people. Some things may have little relation to my research or even my field but may be in some way intellectually curious. You may at some point see posts from some of my colleagues on related topics, but I don’t currently have any plans for that in place.

Now for the name—’The Halt Loop’. This comes from the work I’m doing now on porting an operating system (called Nautilus) we built to Intel’s Xeon Phi. I’ll be writing a post dedicated to this next time, but for now suffice it to say that the Phi is a very interesting piece of hardware.

One of our Xeon Phis
One of our Xeon Phis

It is a PCIe accelerator housing many small Intel (in this case Atom) cores, meant to handle offload from the host CPU that’s particularly amenable to parallelization. Getting a custom OS running on it is far from trivial, and as far as I know we are one of two two have done it, the other group being the Barrelfish group at ETH. They created a port that spans both the host and the card. The somewhat stripped down Linux “µOS”, shipped as part of the MPSS software stack, I won’t count. To get to the point, while doing some performance testing with our OS, we found some very peculiar performance issues, particularly when nothing of importance was happening on the card.

After several days of tracking the problem down, we discovered the cause—the idle loop. When the machine cores were idle with no work to do, we had the core go into a loop like this:

while (1) {

Pretty standard stuff if you’re familiar with preemptive threading. To understand why this is an issue, it’s worth discussing what goes on in yield(). The most important thing that must be done is to check whether or not there are any other tasks to run on the core. The current design of our scheduler requires that the data structure holding these tasks must be accessed atomically, so at some point there’s a lock and unlock on that (core-specific) data structure.

The lock() routine in the end turns into an xadd instruction on this architecture, which generates a non-trivial amount of traffic on the memory system. This traffic appears to have a significant effect on the performance of the Xeon Phi, and is demonstrated by the following graph:


This graph shows the performance of HPCG (a standard HPC benchmark based on a numerical algorithm for computing conjugate gradient) running on top of the Legion runtime system (which I don’t have time to go into right now) on our kernel and on Linux.  This is a relatively small problem size, so it runs for a few seconds on a large core count. Here I’m using 32 of the 228 hardware threads available on the Xeon Phi. We can, for now, ignore Linux, but notice the 2x performance difference between the first two bars. The only difference is that in the second bar (with some additional nuances) we now have the following idle loop:

while (1) {

To seasoned OS developers this probably looks like common sense, especially on a machine with so many hardware threads per core, but I should mention that in our case we need to deal with very few context switches. It is the idle loop itself that is causing the issue.  We noticed that adding a delay instead of the halt also increased performance significantly.  That this would have an effect is not too surprising, but that it should result in such a large performance hit was quite surprising to us. At this point I’ve spent so much time reading Intel’s documentation for system software on the Xeon Phi that I feel like I encounter it in my dreams. Yet besides a cursory discussion of the hardware multithreading model, nothing indicates that something like an idle loop—even a particularly dumb one—should have such devastating effects on performance.

It doesn’t help that this document is really not meant for someone writing a new, third-party OS for the Phi. All you get is some high-level descriptions of the operational model, notes on divergences from typical Intelisms (that turn out to be pretty understated in their ability to confuse and vex you in a particularly Mickensian way), and a long list of memory-mapped registers. The latter turns out to be the most important, but you can see for yourself how verbose the descriptions of the registers are. It turns out in many cases to be more useful to dig through Intel’s modified Linux codebase than to reference that document. All that said, I’m sure Intel has sound reasons, both business and technical, for not releasing another tome of a manual.

The halt-while-idle-or-suffer issue has much the same flavor as many of the issues I typically run into in my research, so I thought ‘The Halt Loop’ would be an appropriate name for this blog.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s