CS-5984

Cache friendly programming


As we have seen in class, there are many alternative ways to perform the multiplication of two matrices. Each algorithm has advantages/disadvantages depending on how the underlying system is organized.

We want to compute the product C = C + A*B where A is mxn, B is nxp, and C is mxp. The direct iterative method is:

do i=1,m

do j=1,p

do k=1,n

C(i,j) = C(i,j) + A(i,k)*B(k,j)

end do

end do

end do

This method is called 'ijk' multiplication because of the order of the loops. Clearly this order can be changed without affecting the final result. By permuting the loops we obtain 'ikj', 'jik', 'jki', 'kij', and 'kji' multiplication algorithms.


Consider matrices with m=n=p=1024 (or other large size). Implement the six matrix multiplication algorithms in C or Fortran. Time their execution. The differences in execution times are due to the different data acces patterns the algorithms have.


Consider the best ordering and try to further improve performance using loop unrolling. Try unrolling different loops (i,j,k) and combinations, and use different unrolling levels. What is the additional increase in performance?


The Basic Linear Algebra Subroutine ( BLAS) package offers a standard interface to commonly used linear algebra algorithms. Browse through the description of the package. Learn about its capabilities.

Now use the BLAS dgemm.f subroutine to perform the same matrix multiplication. This is the netlib implementation of BLAS matrix-matrix multiplication. Time it. Do you notice any difference from your best implementation? (Note: BLAS are implemented in Fortran). An example of the Fortran driver to call dgemm is test_matrix_multiply.f.


Most computer vendors provide a BLAS library that is optimized for the specific architecture. Repeat the matrix multiplication with dgemm but now use the vendor-supplied BLAS library. You should be able to link to it directly, e.g. by using

f77 -O2 test_matrix_multiply.f -L/usr/local/lib -lblas

For Linux/Windows you can download an optimized BLAS from K. Goto.

The Automatically Tuned Linear Algebra Software (ATLAS) project offers a way to automatically adjust the BLAS code to take advantage of the given architecture. Learn about ATLAS from their web page. Download the pre-built ATLAS library that matches your machine and re-run the matrix multiplication example, this time using the dgemm function of ATLAS:

f77 -O2 test_matrix_multiply.f -L -llapack -lcblas -lf77blas -latlas

and compare the timing with the compiled and vendor's versions of BLAS. How different is the performance of the vendor/goto/ATLAS optimized matrix multiplication from your best implementation, and from the compiled BLAS?


Strassen proposed to do the matrix multiplication using only 7 elementary (or block) multiplications, and more additions. The idea can be extended to multiply n x n matrices with n=2^k. Divide the matrices into 4 blocks, each of dimension 2^{k-1} x 2^{k-1}, and apply Strassen's formula (where multiplications and additions are done on blocks). For each block multiplication one can recursively apply Strassen again. The complexity (i.e., the number of operations for large n) of the iterative matrix multiplication is Theta(n^3) multiplications (theta stands for ``in the exact order of''). The complexity of Strassen multiplication is Theta(n^{\log 2}) ~ n^{2.7})$. Thus in theory for large matrices (if additions are considerably less expensive than multiplications) Strassen is more efficient than the iterative algorithm.


asandu@cs.vt.edu (Adrian Sandu)
http://www.cs.vt.edu/~asandu/Courses/CS5984/Homework/homework_02.html