The IEEE floating-point standard reference article from the English Wikipedia on 24-Jul-2004
(provided by Fixed Reference: snapshots of Wikipedia from wikipedia.org)

IEEE floating-point standard

Why not sponsor a child for Christmas in 2008?

The IEEE floating-point standard (IEEE 754) is an IEEE standard, used by many CPUs and FPUs, which defines formats for representing floating-point numbers; representations of special values (i.e., zero, infinity, very small values (denormal numbers), and bit combinations that don't represent a number (NaN)); five exceptions, when they occur, and what happens when they do occur; four rounding modes; and a set of floating-point operations that will work identically on any conforming system.

IEEE 754 specifies four formats for representing floating-point values: single-precision (32-bit), double-precision (64-bit), single-extended precision (>= 43-bit, not commonly used) and double-extended precision (>= 79-bit, usually implemented with 80 bits). Only 32-bit values are required by the standard, the others are optional. Many languages specify that they implement IEEE arithmetic, although sometimes it is optional. The C programming language for example allows but does not require IEEE arithmetic. IEEE is commonly used in C where float implements IEEE single precision and double implements IEEE double precision.

The IEEE floating-point standard is also known as IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE Std 754-1985) and IEC 559: "Binary floating-point arithmetic for microprocessor systems".

Table of contents
1 Anatomy of a floating-point number
2 External links

Anatomy of a floating-point number

Following is a description of the standard's format for floating-point numbers.

Bit Conventions

Bits within a word of width W are indexed with integers in the range 0 to W-1 inclusive. Bit 0 is drawn on the right. When considering the word or regions within the word as binary numbers then usually the lowest indexed bit will also be the least significant.

Single Precision 32 bit

A binary floating-point number is stored in a 32 bit word:

 1     8               23              width in bits
+-+--------+-----------------------+
|S|  Exp   |  Fraction             |
+-+--------+-----------------------+
31 30    23 22                    0    bit index (0 on right)
   bias +127

S - sign

Exp - Exponent

The exponent is biased in the engineering sense of the word -- the value stored systematically deviates (by 127 here) from the actual value. Biasing is done because exponents have to be signed values to be able to represent both tiny and huge values, but two's complement, the usual representation for signed values, makes comparison harder. To solve this the exponent is biased before being stored, by adjusting its value to put it within an unsigned range suitable for comparison. So, for a single-precision float, an exponent in the range -126 .. +127 is biased by 127 to get a value in the range 1 .. 254 (0 and 255 have special meanings described below). When interpreting the floating-point number the process is followed in reverse.

The set of possible data values can be divided into the following classes:

(NaNs are used to represent exceptional cases, such as the square root of a negative number.)

Each class can be distinguished by the value of the Exp field (well, nearly):

Consider the Exp and Fraction fields as unsigned binary integers (Exp will be in the range 0-255):

Class                  Exp     Fraction

Zeroes 0 0 Denormalised numbers 0 non zero Normalised numbers 1-254 any Infinities 255 0 NaN (Not a Number) 255 non zero

For normalised numbers, the most common, Exp is the biased exponent and Fraction is the fractional part of the significand. The number has value v:

v = s * 2^e * m

Where

s = 1 (positive numbers) when S is 0

s = -1 (negative numbers) when S is 1

e = Exp - 127 (in other words the exponent is stored with 127 added to it, also called "biased with 127")

m = 1.Fraction in binary (that is, the significand is the binary number 1 followed by the radix point followed by the binary bits of Fraction). Note 1 <= m < 2

Denormalised numbers are the same but e = -126 and m is 0.Fraction. Note that -126 is the smallest exponent for a normalised number.

There are two Zeroes, +0 (S is 0) and -0 (S is 1), and two Infinities +Inf (S is 0) and -Inf (S is 1).

Notice that NaNs and Infinities have all 1s in the Exp field.

Double Precision 64 bit

Double precision is essentially the same but the fields are wider:

 1     11                                52
+-+-----------+----------------------------------------------------+
|S|  Exp      |  Fraction                                          |
+-+-----------+----------------------------------------------------+
63 62       52 51                                                 0
   bias +1023

NaNs and Infinities are represented with Exp being all 1s (2047).

For Normalised numbers the exponent bias is +1023 (so e is Exp - 1023). For Denormalised numbers the exponent is -1022 (the minimum exponent for a normalised number).

Comparing floating-point numbers

An interesting feature of this particular representation is that it makes comparisons of numbers of the same sign which are not NaNs simple. For positive numbers (the sign bit is 0) a and b, then a < b whenever the unsigned binary integers with the same bit patterns as a and b are also ordered the same way. In other words if you are comparing two positive floating-point numbers (known not to be NaNs) you can just use an unsigned binary integer comparison using the same bits.

External links


This article (or an earlier version of it) contains material from FOLDOC's article on IEEE Floating Point Standard, used with permission.