Browse Prior Art Database

General Matrix Transpose Algorithm for a Parallel Processor Computer

IP.com Disclosure Number: IPCOM000111745D
Original Publication Date: 1994-Mar-01
Included in the Prior Art Database: 2005-Mar-26
Document File: 4 page(s) / 77K

Publishing Venue

IBM

Related People

Hanson, WA: AUTHOR [+2]

Abstract

Disclosed is an algorithm for performing a matrix transpose in a distributed memory parallel processing environment. The matrix is divided evenly among the processors and divided into blocks. The blocks are exchanged among the processors, and the individual elements of each block are transposed. The matrix can be square or non-square. The processors are assumed to be connected through a cross-bar switch that allows each processor to send data to any other processor. An all-to-all communication pattern is used to determine the order in which the data block exchanges are to be performed.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

General Matrix Transpose Algorithm for a Parallel Processor Computer

      Disclosed is an algorithm for performing a matrix transpose in
a distributed memory parallel processing environment.  The matrix is
divided evenly among the processors and divided into blocks.  The
blocks are exchanged among the processors, and the individual
elements of each block are transposed.  The matrix can be square or
non-square.  The processors are assumed to be connected through a
cross-bar switch that allows each processor to send data to any other
processor.  An all-to-all communication pattern is used to determine
the order in which the data block exchanges are to be performed.

      The columns of the matrix to be transposed are divided evenly
among the processors, as shown in Fig. 1.  The columns are also
divided into equal-sized data blocks.  (These blocks are not square
if the matrix is not square.)  To accomplish the matrix transpose,
the processors must first exchange the data blocks:  for each pair of
processors P and Q, processor P must send block Q to processor Q, and
processor Q must send block P to processor P.  Each processor needs a
temporary storage buffer to store the received data blocks, as shown
in Fig. 2.  Two communication buffers, each the size of a data block,
are also needed for the sending and receiving of data blocks.  The
copying of a data block from the matrix to the send buffer, as well
as the copying of data from the receive buffer to the temporary
storage buffer, can be overlapped with the actual data communication.
The transposition of the individual elements of each data block can
be performed either during the copy to the send buffe...