Saturday 20 October 2012

1.2 Floating Point

 Floating Point
Representation for non-integral number.
  • including very small and very large numbers

Like scientific notation
  •          -2.34 X 1056
  •     +0.002 X 10-4
  •      987.02 X 109

In binary
  •            ± 1.xxxxxxxx 2 X 2 yyyyyy

Types float and double in C

Floating Point Representation

There have two representations, which is:

  •     Single precision(32-bit)
  •            Double precision(64-bit)


Example 1

Show the IEEE 754 binary representation of the number -0.75 ten in single and double precision.

Solution:

The number -0.75 ten is also

                                   -3/4 ten or -3/22 ten

It is also represented by the binary fraction

                                   -11 two/2ten or -0.11 two

In scientific notation, the value is
                                   -0.11 two X 20

And in normalized scientific notation, it is

                                   -1.1 two X 2-1

The general representation for a single precision number is

                                   (-1)s X (1+fraction) X 2(Exponent-127)

When we subtract the bias 127 from the exponent of -1.1 two X 2-1,the result is

                 (-1)1 X (1+.1000 0000 0000 0000 0000 000 two) X 2(126-127)



    Converting Binary to Decimal Floating Point

   Example 2

    Now let’s try going the other direction.

 Solution:

The sign bit is 1,the exponent field contains 129,and the fraction field contains 1 X 2-2=1/4,or 0.25.Using the basic equation,

    (-1)s X (1+fraction) X 2(Exponent-Bias)= (-1)1 X (1+0.25) X 2(129-127)

                                                              = -1 X 1.25 X 22

                                                              =-1.25 X 4

                                                              =-5.0



Floating -Point Addition

Example: Decimal  Floating-Point Addition

Try adding the numbers 0.5 ten and -0.4375 ten in binary.

Solution:
Let’s first look at the binary version of the two numbers in normalized scientific notation, assuming that we keep 4 bits of precision:

        0.5 ten  = 1/2 ten               =1/2ten
                      = 0.1 two              = 0.1 two X 20                 = 1.000 two X 2-1

    -0.4375 ten = -7/16 ten           = - 7/2ten
                  = -0.0111 two        = - 0.0111 two X 20            = -1.110 two X 2-2

Now we follow the algorithm:

Step 1: The significant of the number with the lesser exponent

 (-1.11 two X 2- 2) is shifted right until its exponent matches the larger number:

                               -1.110 two X 2-2 = -0.111 two X 2-1

Step 2 : Add the significant :

                               1.000 two X 2-1+ (-0.111 two X 2-1)=0.001 two X 2-1

Step 3 : Normalize the sum, checking for overflow or underflow:

                       0.001 two X 2-1 = 0.010 two X 2-2 = 0.100 two X 2-3

                                                =1.000 two X 2-4

Since 127-4 -126, there is no overflow or underflow. (The biased exponent would be -4 +127, or 123, which is between 1 and 254, the smallest and largest unreserved biased exponents.)

Step 4 : Round the sum :

                                           1.000 two X 2-4

The sum already fits exactly in 4 bits, so there is no change to the bits due to rounding.

This sum is then

              1.000 two x 2-4 = 0.0001000 two   =0.0001 two
                                      =1/2ten                =1/16 ten           =0.0625 ten

This sum is what we would expect from adding 0.5 ten to -0.4375 ten.

Floating-Point Multiplication

Example: Decimal Floating-Point Multiplication

Let’s try multiplication the numbers 0.5 ten and -0.4375 ten.

Solution:
In binary, the task is multiplying 1.000 two X 2-1 by -1.110 two X 2-2.

Step 1: Adding the exponents without bias:

                                           -1+(-2)= -3

Or using the biased representation:

            (-1+127) + (-2+127) – 127 = (-1-2) + (127+127-127)= -3+127 =124

Step 2: Multiplying the significants:




The product is 1.110000 two X 2 -3, but we need to keep it to 4 bits, so it is 1.110two X 2-3.

Step 3 : Now we check the product to make sure it is normalized , and then check the exponent for overflow or underflow . The product is already normalized and, since 127 -3 -126, there is no overflow or underflow. (Using the biased representation, 254 124 1, so the exponent fits.)


Step 4: Rounding the product makes no change :
                             1.110 two X 2-3
Step 5 : Since the signs of the original operands differ, make the sign of the product negative . Hence the product is
                             -1.110 two X 2-3
Converting to decimal to check our results:
-1.110 two X 2-3 = -0.001110 two = -0.00111 two
                         = -7/2ten           = -7/32 ten            = -0.12875 ten
The product of 0.5 ten and -0.4375 ten is indeed -0.21875 ten.



No comments:

Post a Comment