The Matrix Square Root, Blocking and Parallelism

Posted on
11 Jul 2012

NAG recently embarked on a Knowledge Transfer Partnership with the University of Manchester to introduce matrix function capabilities into the NAG Library. As part of this collaboration, Nick Higham (University of Manchester), Rui Ralha (University of Minho, Portugal) and I have been investigating how blocking can be used to speed up the computation of matrix square roots.

There is plenty of interesting mathematical theory concerning matrix square roots, but for now we'll just use the definition that a matrix X is a square root of A if X2=A. Matrix roots have applications in finance and population modelling, where transition matrices are used to describe the evolution of a system from over a certain time interval, t. The square root of a transition matrix can be used to describe the evolution for the interval t/2. The matrix square root also forms a key part of the algorithms used to compute other matrix functions.

To find a square root of a matrix, we start by computing a Schur decomposition. The square root U of the resulting upper triangular matrix T can then be found via a simple recurrence over the elements Uij  and Tij:

We call this the point method.

In many numerical linear algebra routines (such as LAPACK and the BLAS) run times can be drastically reduced by grouping operations into blocks to make more efficient use of a computer's cache memory. We found that run times for the point method could be similarly reduced by solving the recurrence in blocks. We found that even greater speed ups could be obtained by using an alternative blocking scheme in which the matrix is repeatedly split into four blocks and the algorithm calls itself recursively.

The graph below shows run times for the three algorithms, for triangular matrices of various sizes. Recursive blocking is about 10% faster than standard blocking and up to eight times faster than the point algorithm.p

For full square matrices, much of the work is done in computing the Schur decomposition but speeding up the triangular phase is nevertheless useful! The graph below shows run times for MATLAB's sqrtm (an implementation of the point algorithm) together with Fortran implementations, called from within Matlab, of the point (fort_point) and recursively blocked (fort_recurse) algorithms. The recursive routine, fort_recurse, is about 2.5 times faster than sqrtm and over twice as fast as fort_point.

We wrote up our findings in much more detail here: However, the plot thickened when we started investigating parallel implementations.

A very fruitful way of taking advantage of multicore architectures is simply to use threaded BLAS. Another approach, used in the NAG Library for SMP and Multicore, is to explicitly parallelise code using OpenMP. Some run times for the triangular phase of the algorithm using the various blocking schemes and parallel approaches are shown in the graph below.

If only threaded BLAS was used then recursive blocking still performed best. However, when OpenMP was used, recursive blocking actually slowed down! This is because the algorithm's performance was badly hit by synchronization requirements within the recursion. Synchronization and data dependency is often a crucial factor determining how well an algorithm will perform in parallel.

So the moral of this story is: the best algorithm for serial architectures may not be the best algorithm in parallel! (Of course, users of the NAG Library for SMP & Multicore need not concern themselves with this detail; they can rely on NAG developers providing optimal tuning to enable best runtime performance.)