Source: Dense Linear Algebra Pattern (page 4)
Author: Eric Battenberg
Blocked Approach
More sophisticated algorithms construct the product hierarchically from multiplications of smaller matrix blocks [7]. By computing outer products on small blocks of the input and output matrices, we can more eectively exploit spatial locali
Auto-Tuning the Blocked Approach
When using a blocked approach, the optimal block sizes will change across architectures. Variables that can aect the best blocking parameters include the number of oating point registers, cache sizes, TLB size, time to access various levels of the memory hierarchy, number of Foating point units, etc.. The interaction between all of these factors is so complicated that being able to decide on optimal blocking parameters while simply considering these variables is near impossible. An eective way to arrive at optimal values for the blocking parameters, I0, J0, and K0, is through the use of auto-tuning. As mentioned above, ATLAS[5] and PHiPAC[6] are two examples 4 % outer product approach for k = 1:K for i = 1:I for j = 1:J C(i,j) = C(i,j) + A(i,k)*B(k,j); = + x C(:,:) C(:,:) A(:,k) B(k,:) K
Figure 3: An implementation based on outer products. Each outer product represents a rank-one update to the entire product matrix.
of the successes of auto-tuning matrix multiply. In simple terms, to auto-tune a blocked matrix multiply, one would simply write a script to compile the blocked code with many dierent combinations of blocking parameters, measure the performance of each version of the code, and then choose the version that performed best.
One thing to be aware of when using a blocked approach is that many times, the optimal block size does not evenly divide the dimensions of the entire matrix. This can be handled in two ways. In the rst method, one can simply zero pad the matrices up to a size that is divisible by the block size. Alternatively, functions that use smaller-sized blocks can be written to handle the \fringes”. Multiple Level Blocking
A rst level of blocking, as discussed above, is commonly referred to as \register blocking” because the data reuse properties of such an approach are thought to exploit the speed of the processor registers. In order to more fully exploit additional levels of the memory hierarchy, multiple levels of blocking can be used. This concept is illustrated at a high level in Figure 5. Code for using an additional level of blocking would iterate through all of the larger, secondlevel blocks contained in each matrix. Within the code to handle each second-level block, the code to iterate through the rst-level blocks contained within the current second-level block would be called.
Using a second level of blocking is thought to exploit the L1 cache which is slower than the register but is much larger in size. Utilizing a third level of blocking would target the L2 cache. Subsequent levels of blocking, while possibly eective, are not typically used.
Surface to volume ratio discussion.