This is the first of a multi-part series documenting some of the experience we’ve gained in getting our small OS running full blast on Intel’s Xeon Phi coprocessor. In my last post, I hinted at some of the work that we’ve been doing in our lab, but now I’d like to take some time to go into much more detail with the hope that this information will be useful to others; we certainly could have benefited from a few more pointers at the outset.
However, before we delve into any technical content, I’d like to first introduce a bit of historical context, if for no other purpose than to potentially shed light on some of the idiosyncrasies in the hardware design of the Phi and its architected interface.
A Brief Overview: History of the Xeon Phi (and more)
When I was still an undergraduate in the mid-aughts, the scientific computing community was, it seemed, alight with excitement over the realization that the massively parallel chips already in use by the computer graphics and gaming communities were actually fairly amenable to some of the problems that could be fashioned to fit the data-parallel motif (here’s a nice overview from 2008 on GPU computing, and another here from some of the top architects at NVIDIA). NVIDIA broke ground with the first programmable GPU in 2001, and by 2004 there was already a full-blown academic workshop on GPGPU computing. The next year, there were documents appearing giving their audience an idea of how to massage these highly specialized pieces of hardware to their needs. In 2006, we began to see research that would ameliorate some of the programming burden with languages that hid the details of the GPU behind more familiar data-parallel constructs. After that, CUDA came into fruition, and while a bit rough around the edges (to say the least), it managed to become the mainstay of GPGPU computing.
It certainly helped that there was already a gigantic market for these chips among gamers, who demanded (and still do) a steady improvement in the visual aesthetics and realism of the medium. This, we can be sure, generated plenty of pressure on the manufacturers to improve their designs and, in accordance with Moore’s law, continue squeezing more and more transistors into the silicon. To scientists running their codes on self-managed Beowulf clusters or waiting for time on expensive department, university, or government machines, the realization that they could save a tremendous amount of time by buying a relatively cheap piece of gaming hardware at Best Buy must have been a welcome one.
I’m not all that familiar with the history of graphics hardware, but in many ways these graphics chips share a common lineage with some of the vector-based supercomputers that came out of the 1970s and 1980s. The first instance of a large-scale computer actually built for this kind of vector processing was probably either Control Data’s STAR-100 or the ILLIAC IV, depending on who you ask. The ILLIAC, which—developed amidst intense criticism of the Vietnam war and nuclear proliferation—was plagued with a storied history during its tenure at the UIUC campus. The ideas that were hammered out to build these machines later found their way into Seymour Cray’s Cray-1 and Steve Chen’s Cray XMP (among others) , both of which helped bring vector machines to the forefront of the high-performance computing market and to further reinforce the mystical status that the two designers already enjoyed. I would refer any readers interested in a more technical history of these kinds of machines to Readings in Computer Architecture, an invaluable compendium of white papers and academic research on the subject. For a more approachable background on the subject, I would recommend The Supermen by Charles Murray, a fascinating historical narrative on the engineers who breathed life into supercomputing.
As the end of the first decade of the new millennium approached and CUDA was beginning to mature, there seems to have been a broad realization both within and without the research community that the GPGPU paradigm was not at all likely to go away. We began to see courses appear in CS curriculums offering GPU-centric content, and the graphics hardware companies, eager to take advantage of their momentum, began aggressively hiring academic researchers to keep it going.
It is important to realize that this flurry of interest in GPGPU computing was accompanied by—even preceded by—a strong sense of unease felt by engineers and computer scientists. Clock speeds on commodity chips had been leveling out because their power density was nearing that of a nuclear reactor (!), and the dreaded end of free performance handed down to us from device physicists and engineers was nigh. This was referred to as the Power Wall. As if that weren’t enough of a problem, David Patterson, among others, pointed out that we also had two other walls to deal with: the Memory wall and the ILP wall. The former arises from 1) a limited pin-bandwidth entering and leaving the chip, and 2) a much slower increase in memory performance relative to CPUs. The latter comes from diminishing returns in performance (power is an issue here as well) when you deepen a processor’s pipeline. Patterson’s slides, linked above, can in some ways be interpreted as a call for heterogeneous computing in general, but that’s a discussion for another time.
In any case, some form of parallelism seemed to be the only way forward. By the time GPGPU was gaining traction, Intel had already drawn up a plan to move most of its product line to the multi-core model, and AMD was following a similar path. The problem with these chips, at least from a technical computing perspective, is that they are really not designed for highly data-parallel codes. While SIMD units had been integrated into commodity processors for some time, they certainly weren’t a central feature. This, however, is where GPUs excelled. It seemed that the CPU giants were intently focused on keeping their claim in the desktop and server markets.
Sony, Toshiba, and IBM were realizing the power of SIMD combined with multi-core as early as 2000 (at about the same time that NVIDIA released its GeForce3) and so created a vehicle for it in the Cell Broadband Engine, which was used in the Playstation 3. The Cell BE’s later successor was integrated into IBM’s Roadrunner in 2008, which enjoyed a decent stint at the top of the TOP500. However, the Cell architecture was dropped later the next year.
It wasn’t until the mid-2000s, when the Cell was reaching its full potential and CUDA’s first SDK release was eminent, that Intel began considering a chip aimed at graphics and technical computing, named Larrabee. Larrabee had its own turbulent history, and not too long after Intel finally gave a demo of the chip, they decided to scrap the project in 2010.
While the Larrabee project ultimately failed, Intel still had plenty of impetus to market to the HPC crowd. After all, in addition to the well-established university and government user base, HPC was a growing industrial resource as well, used heavily by the likes of oil companies and automotive research labs. The legacy of Larrabee was not entirely lost to history, and in fact many aspects of its design lived on.
While they couldn’t compete with NVIDIA in the graphics game, Intel managed to leverage its research designs to build a very useful piece of hardware. All three of their prototypes ultimately fed into what we now know as the Xeon Phi. In 2010, after Intel dropped Larrabee, they introduced a new project under the name Many Integrated Core or MIC, targeted solely at scientific computing. The remnants of Intel’s other projects can be seen in the first prototype of the MIC family, codenamed Knights Ferry.
The Larrabee graphics components were completely scrapped, but we can still see the wide vector units in use, and the large core-count (32) along with the simplicity of the cores reflects some of the design aspects used in the other prototype many-core chips.
Knights Ferry had a fairly limited user base (including CERN) and was quickly superseded by Knights Corner, the architecture used in most of the Xeon Phis you’ll find in the wild today. With Knights Corner, Intel raised the core count to 50+ and beefed up the memory capacity from 2GB to 6-8GB. The cores themselves are based on older Pentium cores with several additions, including the fat 512-bit SIMD registers. I’ll go into more details about the design next time, but it should at least be clear at this point that Knights Corner is a large step up from Larrabee.
While Larrabee was a definite failure, its successor, the Xeon Phi, is enjoying considerable adoption among the HPC community. Above is a screenshot taken directly from the TOP500 website that shows the Xeon Phi accounting for 28% of all coprocessors present in TOP500 machines. It certainly is a nifty little card, so I would expect that number to keep growing, especially if the hype around the next-generation Knights Landing holds any water.
Next time I will dive into some of the hardware design and the programming interface of the Xeon Phi, specifically as it relates to building a third-party OS.