Browse Prior Art Database

Matrix Multiplication on a Parallel Processing Machine

IP.com Disclosure Number: IPCOM000088346D
Original Publication Date: 1977-May-01
Included in the Prior Art Database: 2005-Mar-04
Document File: 3 page(s) / 36K

Publishing Venue

IBM

Related People

Chandra, AK: AUTHOR

Abstract

A method for computing the product of any two given n x n matrices rapidly on a p processor single-instruction-stream multiple-data-stream computer is described herein. The algorithm runs in time O((n/2.81//p+ log n), and can be used for other problems including the solution of linear equations, finding the transitive closure of a graph, and context-free recognition.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 54% of the total text.

Page 1 of 3

Matrix Multiplication on a Parallel Processing Machine

A method for computing the product of any two given n x n matrices rapidly on a p processor single-instruction-stream multiple-data-stream computer is described herein. The algorithm runs in time O((n/2.81//p+ log n), and can be used for other problems including the solution of linear equations, finding the transitive closure of a graph, and context-free recognition.

Let U,V be the given n x n matrices where n = 2/k/ (if n is not a power of 2, the arrays are considered to be expanded to the next higher power of 2 by filling in zeroes). The result U*V is to be in matrix W. Let A = {0, 1, 2, 3, 4, 5, 6} and B = {0, 1, 2, 3}. Also, let f be the one-one onto function f: {0, 1, ..., n-1}/2/---> B/k/ given by: if 0 </= i,j </= n-1, and i=i(1),...,i(k), j=j(1),...,j(k) in binary, then f(i,j) = Alpha = Alpha(1), Alpha(2),...,Alpha(k). where Alpha(l) = 2i(l) + j(l) for all l </= k.

We have two arrays X,Y indexed by the k-digit strings from A. The algorithm is as follows. Each step (substep) indicates operations that may be done in parallel. Comment - in steps 3 and 6, terms Alpha 0 Beta, etc., are concatenations.

(Image Omitted)

Comment: In steps 3 and 6, all assignments are considered simultaneous. If this is not possible, additional space may be used for temporaries.

Steps 3 and 6 require 14*7/l -1/*4/k-l/ and 4*7/l -1/*4/k-l/ assignments, respectively, so that the total number of assignments required in steps 2-6 is There are only 3n/2/ assignments in steps 1 and 7, and all steps can use at least some constant percentage of all p processors, thereby running in time 0(n/2.81//p) for p </= n/2.81//log n. Additional detail follows.

Given p processors, steps 1 and 7, can be executed in time 0(n/2/log n/p + log n). The critical parts of the algorithm are steps 2-6. We show below how to implement steps 2 and 3 in more detail, and the implementation of steps 4,5 and 6 are similar. (Comment: The implementation could be simplified if simultaneous access to the same memory location by several processors is allowed.)

The p processors have names 0, 1,..., p-1. In addition, there are arrays NUMBER, MAX [0: k-1] which are initialized as follows:

(Image Omitted)

The significance of these arrays is the following. When step 3 of the algorithm is executed, there are 4/k - l/ distinct Beta's, and 7...