Automically Tuning Collective Communication for One-Sided Programming Models

TitleAutomically Tuning Collective Communication for One-Sided Programming Models
Publication TypeJournal Article
Year of Publication2009
AuthorsNishtala, R.

Technology trends suggest that future machines will rely on parallelism to meet increasing
performance requirements. To aid in programmer productivity and application performance,
many parallel programming models provide communication building blocks called col lective
communication. These operations, such as Broadcast, Scatter, Gather, and Reduce, ab-
stract common global data movement patterns behind a simple library interface allowing
the hardware and runtime system to optimize them for performance and scalability.
We consider the problem of optimizing collective communication in Partitioned Global
Address Space (PGAS) languages. Rooted in traditional shared memory programming
models, they deliver the benefits of sophisticated distributed data structures using language
extensions and one-sided communication. One-sided communication allows one processor
to directly read and write memory associated with another. Many popular PGAS language
implementations share a common runtime system called GASNet for implementing such
communication. To provide a highly scalable platform for our work, we present a new
implementation of GASNet for the IBM BlueGene/P, allowing GASNet to scale to tens of
thousands of processors.
We demonstrate that PGAS languages are highly scalable and that the one-sided com-
munication within them is an efficient and convenient platform for collective communication.
We show how to use one-sided communication to achieve 3
× improvements in the latency
and throughput of the collectives over standard message passing implementations. Using a
3D FFT as a representative communication bound benchmark, for example, we see a 17%
increase in performance on 32,768 cores of the BlueGene/P and a 1.5
× improvement on 1024
cores of the CrayXT4. We also show how the automatically tuned collectives can deliver
more than an order of magnitude in performance over existing implementations on shared
memory platforms.
There is no obvious best algorithm that serves all machines and usage patterns demon-
strating the need for tuning and we thus build an automatic tuning system in GASNet that
optimizes the collectives for a variety of large scale supercomputers and novel multicore
architectures. To understand the large search space, we construct analytic performance
2 models use them to minimize the overhead of autotuning. We demonstrate that autotun-
ing is an effective approach to addressing performance optimizations on complex parallel