CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization

TitleCudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization
Publication TypeConference Paper
Year of Publication2011
AuthorsBauer, M., Cook H. M., & Khailany B.
Conference NameIntertantional Conference on Super Computing, SC'11
Date Published11/2011
Conference LocationSeattle, WA

As the computational power of GPUs continues to scale with
Moore's Law, an increasing number of applications are becoming
limited by memory bandwidth. We propose an approach
for programming GPUs with tightly-coupled specialized
DMA warps for performing memory transfers between
on-chip and o -chip memories. Separate DMA warps improve
memory bandwidth utilization by better exploiting
available memory-level parallelism and by leveraging efficient
inter-warp producer-consumer synchronization mechanisms.
DMA warps also improve programmer productivity
by decoupling the need for thread array shapes to match
data layout. To illustrate the bene ts of this approach,
we present an extensible API, CudaDMA, that encapsulates
synchronization and common sequential and strided
data transfer patterns. Using CudaDMA, we demonstrate
speedup of up to 1.37x on representative synthetic microbenchmarks,
and 1.15x-3.2x on several kernels from scienti c
applications written in CUDA running on NVIDIA Fermi

cudadma-sc11.pdf737.58 KB