# High Performance Matrix Multiplication Algorithm on a Distributed Memory Parallel Machine Using Overlapped Communication

Original Publication Date: 1993-Nov-01

Included in the Prior Art Database: 2005-Mar-21

## Publishing Venue

IBM

## Related People

Gustavson, FG: AUTHOR [+2]

## Abstract

This algorithm for computing C is based on an outer product formulation and consists of k prime basic steps. In each basic step all nodes participate in computing an update to the C matrix and in simultaneously transmitting data for the next update of C . % Normally, an inter-node communication would be required between two steps. However, communication is overlapped with the computation by initiating the necessary inter-node communication a step before. That is, at the i '-th' step when we are forming the i '-th' update to the C matrix we are also distributing data which is required for forming the (i+1) '-st' update to the C matrix: In any step we have two buffers on a node; one is used for receiving data from other nodes and the other is used for local computing.

**This text was extracted from an ASCII text file.**

**This is the abbreviated version, containing approximately 30% of the total text.**

High Performance Matrix Multiplication Algorithm on a Distributed Memory Parallel Machine Using Overlapped Communication

This
algorithm for computing C is based on an outer product

formulation and consists of k prime basic steps. In each basic step

all nodes participate in computing an update to the C matrix and in

simultaneously transmitting data for the next update of C . %

Normally, an inter-node communication would be required between two

steps. However, communication is
overlapped with the computation by

initiating the necessary inter-node communication a step before.

That is, at the i '-th' step when we are forming the i '-th' update

to the C matrix we are also distributing

data which is required for forming the
(i+1) '-st' update to the C

matrix: In any step we have two buffers
on a node; one is used for

receiving data from other nodes and the other is used for local

computing. In the next step the
functionality of the two buffers is

interchanged, i.e., a buffer used for receiving data in Step i is

used for local computing in Step i+1 , and vice versa. First, data

distribution is discussed and later, details of the algorithm are

given.

Data
Distribution: A parallel machine is
viewed as a logical

two dimensional array of m prime times n prime
nodes. A node at

position (i,j) is denoted by p(i,j) , where 0 le i lt m prime and 0

le j lt n prime . % Also, let 0 le sigma lt k prime . % The quantity

k prime denotes the number of basic
steps our algorithm performs.

Its choice depends upon P = m prime n prime , the number of

processors or nodes, and or the SGEMM/ DGEMM performance of a m , n ,

k size problem. A , B , and C % are partitioned. Note that the

submatrix A sub <i sigma> is an m
times k matrix,

A sub <i sigma> = A ( im : im+m-1 , sigma k : sigma k+k-1)

Similarly B sub <sigma j> and C sub <i j> are
k times n and m times n

submatrices.

<B sub
<sigma j> here = B ( sigma k : sigma k+k-1 , j n : jn+n-1)>

hvabove

<C sub <i j> here = C
(im : im+m-1 , j n : jn+n-1 )>

For
simplicity of presentation we assume K = k k prime , M = m

m prime , and N = n n prime . % The mapping of these submatrices onto

the processors of the parallel machine is such that (i) A sub <i

sigma> is assigned to node p( i ,
sigma % 'mod' % k prime ) (i.e.,

the column dimension of A is distributed in blocks of k to the

columns of P in wrap around fashion), (ii) B sub <sigma j> is

assigned to node p ( sigma % 'mod' % k
prime , j ) (i.e. the row

dimension of B is distributed in blocks of k to the rows of P in wrap

around fashion), and (iii) C sub <i
j> is assigned to node p ( i ,

j ) . % Note that there is only one block of C which gets mapped to a

node. On the other hand there may be
more than one submatrix of A

and B assigned to a node. The exact
number of submatrices mapped on

to a node is determin...