PERI: Autotuning Memory Intensive Kernels for Multicore

TitlePERI: Autotuning Memory Intensive Kernels for Multicore
Publication TypeJournal Article
Year of Publication2008
AuthorsWilliams, S., Datta K., Carter J., Oliker L., Shalf J., Yelick K. A., & Bailey D.
JournalJournal of Physics, SciDAC PI Conference: Conference Series: 123012001

We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of search-based performance optimizations, popular in linear algebra and FFT libraries, to application-specific computational kernels. Our work applies this strategy to sparse
matrix vector multiplication (SpMV), the explicit heat equation PDE on a regular grid (Stencil), and a lattice Boltzmann application (LBMHD). We explore one of the broadest sets of multicore architectures in the high- performance computing literature, including the Intel Xeon Clovertown, AMD Opteron Barcelona, Sun Victoria Falls, and the Sony-Toshiba-IBM (STI) Cell. Rather than hand-tuning each kernel for each system, we develop a code generator for each kernel that allows us identify a highly optimized version for each platform, while amortizing the human programming effort. Results show that our auto-tuned kernel applications often achieve a better than 4× improvement compared with the original code. Additionally, we analyze a Roofline performance model for each platform to reveal hardware bottlenecks and software challenges for future multicore systems and applications.

PERI- Autotuning Memory Intensive Kernels for Multicore.pdf1.11 MB