Browse Prior Art Database

High Performance Matrix Multiplication Algorithm on a Distributed Memory Parallel Machine Using Overlapped Communication

IP.com Disclosure Number: IPCOM000106507D
Original Publication Date: 1993-Nov-01
Included in the Prior Art Database: 2005-Mar-21
Document File: 4 page(s) / 141K

Publishing Venue

IBM

Related People

Gustavson, FG: AUTHOR [+2]

Abstract

This algorithm for computing C is based on an outer product formulation and consists of k prime basic steps. In each basic step all nodes participate in computing an update to the C matrix and in simultaneously transmitting data for the next update of C . % Normally, an inter-node communication would be required between two steps. However, communication is overlapped with the computation by initiating the necessary inter-node communication a step before. That is, at the i '-th' step when we are forming the i '-th' update to the C matrix we are also distributing data which is required for forming the (i+1) '-st' update to the C matrix: In any step we have two buffers on a node; one is used for receiving data from other nodes and the other is used for local computing.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 30% of the total text.

High Performance Matrix Multiplication Algorithm on a Distributed Memory Parallel Machine Using Overlapped Communication

      This algorithm for computing C is based on an outer product
formulation and consists of k prime basic steps.  In each basic step
all nodes participate in computing an update to the C matrix and in
simultaneously transmitting data for the next update of C . %
Normally, an inter-node communication would be required between two
steps.  However, communication is overlapped with the computation by
initiating the necessary inter-node communication a step before.
That is, at the i '-th' step when we are forming the i '-th' update
to the C matrix we are also distributing
 data which is required for forming the (i+1) '-st' update to the C
matrix:  In any step we have two buffers on a node; one is used for
receiving data from other nodes and the other is used for local
computing.  In the next step the functionality of the two buffers is
interchanged, i.e., a buffer used for receiving data in Step i is
used for local computing in Step i+1 , and vice versa.  First, data
distribution is discussed and later, details of the algorithm are
given.

      Data Distribution:  A parallel machine is viewed as a logical
two dimensional array of m prime times n prime  nodes.  A node at
position (i,j) is denoted by p(i,j) , where 0 le i lt  m prime and 0
le j lt n prime . % Also, let 0 le sigma lt k prime . % The quantity
k prime  denotes the number of basic steps our algorithm performs.
Its choice depends upon P = m prime n prime , the number of
processors or nodes, and or the SGEMM/ DGEMM performance of a m , n ,
k size problem.  A , B , and C %  are partitioned.  Note that the
submatrix A sub <i sigma>   is an m times k matrix,

      A sub <i sigma>  = A ( im : im+m-1 , sigma k : sigma k+k-1)

Similarly B sub <sigma j> and C sub <i j> are k times n and m times n
submatrices.

  <B sub <sigma j> here = B ( sigma k : sigma k+k-1 , j n : jn+n-1)>
                                hvabove
         <C sub <i j> here = C (im : im+m-1 , j n : jn+n-1 )>

      For simplicity of presentation we assume K = k k prime ,  M = m
m prime , and N = n n prime . % The mapping of these submatrices onto
the processors of the parallel machine is such that (i) A sub <i
sigma> is  assigned to node p( i , sigma % 'mod' %  k prime ) (i.e.,
the column dimension of A is distributed in blocks of k to the
columns of P in wrap around fashion), (ii) B sub <sigma j> is
assigned to node p ( sigma % 'mod' %  k prime , j ) (i.e. the row
dimension of B is distributed in blocks of k to the rows of P in wrap
around fashion), and (iii)  C sub <i j>  is assigned to node p ( i ,
j ) . % Note that there is only one block of C which gets mapped to a
node.  On the other hand there may be more than one submatrix of A
and B assigned to a node.  The exact number of submatrices mapped on
to a node is determin...