Browse Prior Art Database

Method for Exact Evaluation of Sums in Floating-Point Method

IP.com Disclosure Number: IPCOM000116683D
Original Publication Date: 1995-Oct-01
Included in the Prior Art Database: 2005-Mar-31
Document File: 2 page(s) / 54K

Publishing Venue

IBM

Related People

Megiddo, N: AUTHOR [+2]

Abstract

In a computer, a real number represented in a computer as a floating point number has three parts: sign, mantissa and exponent. The sign is given in one bit. The mantissa and the exponent are both integers. When adding a series of both positive and negative numbers into a running sum in floating point arithmetic, only a fixed number of the most significant bits of the sum can be stored in the mantissa. Subsequent subtractions can zero that part of the sum. Hence, the summation may terminate with no significant bits of the sum at all.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 57% of the total text.

Method for Exact Evaluation of Sums in Floating-Point Method

      In a computer, a real number represented in a computer as a
floating point number has three parts:  sign, mantissa and exponent.
The sign is given in one bit.  The mantissa and the exponent are both
integers.  When adding a series of both positive and negative numbers
into a running sum in floating point arithmetic, only a fixed number
of the most significant bits of the sum can be stored in the
mantissa.  Subsequent subtractions can zero that part of the sum.
Hence, the summation may terminate with no significant bits of the
sum at all.

      The problem solved in this invention is how to evaluate exactly
the sum of many numbers in one of the following senses:  (i) Find the
sum and represent it as a sum of exponent-disjoint floating point
numbers (ii) compute the sum at a given precision, and (iii) compute
only the sign of the sum.

      Three methods for the exact evaluation of sums in floating
point arithmetic are disclosed.  All the methods add the input
numbers in a certain order, and the running sum is represented in a
form fuller than one floating point number.  All the methods are
based on the idea of sorting the input numbers into buckets
containing only numbers with the same exponent.  The first phase of
all methods than adds up numbers in the same bucket repeatedly, until
each bucket contains only one number.

      In two of the methods, the numbers have to be
reverse-no...