2D image convolution is ubiquitous in image processing and computer
vision problems such as feature extraction. Exploiting parallelism is
a common strategy for accelerating convolution. Parallel processors
keep getting faster, but algorithms such as image convolution remain
memory bounded on parallel processors such as GPUs. Therefore, reducing
memory communication is fundamental to accelerating image
convolution. To reduce memory communication, we reorganize the
convolution algorithm to prefetch image regions to register, and we do
more work per thread with fewer threads. To enable portability to future
architectures, we implement a convolution autotuner that sweeps
the design space of memory layouts and loop unrolling configurations.
We focus on convolution with small filters (2x2–7x7), but our techniques
can be extended to larger filter sizes. Depending on filter size,
our speedups on two NVIDIA architectures range from 1.2x to 4.5x
over state-of-the-art GPU libraries.