Autotuning Stencil Codes for Cache-Based Multicore Platforms

TitleAutotuning Stencil Codes for Cache-Based Multicore Platforms
Publication TypeJournal Article
Year of Publication2009
AuthorsDatta, K.

As clock frequencies have tapered off and the number of cores on a chip has
taken off, the challenge of effectively utilizing these multicore systems has become
increasingly important. However, the diversity of multicore machines in today’s
market compels us to individually tune for each platform. This is especially true
for problems with low computational intensity, since the improvements in memory
latency and bandwidth are much slower than those of computational rates.
One such kernel is a stencil, a regular nearest neighbor operation over the points
in a structured grid. Stencils often arise from solving partial differential equations,
which are found in almost every scientific discipline. In this thesis, we analyze three
common three-dimensional stencils: the 7-point stencil, the 27-point stencil, and the
Gauss-Seidel Red-Black Helmholtz kernel.
We examine the performance of these stencil codes over a spectrum of multicore
architectures, including the Intel Clovertown, Intel Nehalem, AMD Barcelona, the
highly-multithreaded Sun Victoria Falls, and the low power IBM Blue Gene/P. These
platforms not only have significant variations in their core architectures, but also
exhibit a 32
× range in available hardware threads, a 4.5× range in attained DRAM
bandwidth, and a 6.3
× range in peak flop rates. Clearly, designing optimal code for
such a diverse set of platforms represents a serious challenge.
Unfortunately, compilers alone do not achieve satisfactory stencil code perfor-
mance on this varied set of platforms. Instead, we have created an automatic stencil
code tuner, or auto-tuner, that incorporates several optimizations into a single soft-
ware framework. These optimizations hide memory latency, account for non-uniform
memory access times, reduce the volume of data transferred, and take advantage of
special instructions. The auto-tuner then searches over the space of optimizations,
thereby allowing for much greater productivity than hand-tuning. The fully auto-
tuned code runs up to 5.4
× faster than a straightforward implementation and is more
scalable across cores.
By using performance models to identify performance limits, we determined that
our auto-tuner can achieve over 95% of the attainable performance for all three
stencils in our study. This demonstrates that auto-tuning is an important technique
for fully exploiting available multicore resources.