We examine the problem of solving many thousands
of small dense linear algebra factorizations simultaneously
on Graphics Processing Units (GPUs). We are interested
in problems ranging from several hundred of rows and columns
to 4 × 4 matrices. Problems of this size are common, especially
in signal processing. However, they have received very
little attention from current numerical linear algebra libraries
for GPUs, which have thus far focused only on very large
problems found in traditional supercomputing applications
and benchmarks. To solve small problems efficiently we tailor
our implementation to the GPUs inverted memory hierarchy
and multilevel parallelism hierarchy. We provide a model
of the GPU memory subsystem that can accurately predict
and explain the performance of our approach across different
problem sizes.
As a motivating example, we look at spacetime adaptive
radar processing, a realtime application that requires
hundreds of independent QR factorizations of small complex
matrices (e.g. 240 × 66). For realistic matrix sizes from a
standard radar processing benchmark, our implementation on
an NVIDIA Quadro 6000 GPU runs 2.8× to 25× faster than
Intel’s Math Kernel Library (MKL) on an Intel Core i72600.
For the QR factorizations of 5,000 56 × 56 singleprecision
matrices, our approach runs 29× faster than MKL and 140× faster than the stateoftheart linear algebra library for GPUs.
In each of these cases we are using the GPU’s hardwareaccelerated
division and square root functions that are accurate
up to 22 mantissa bits.
