Awesome ICT: 1.2 Floating Point

Floating Point

Representation for non-integral number.

including very small and very large numbers

Like scientific notation

-2.34 X 10⁵⁶
+0.002 X 10^-4
987.02 X 10⁹

In binary

± 1.xxxxxxxx ₂ X 2 ^yyyyyy

Types float and double in C

Floating Point Representation

There have two representations, which is:

Single precision(32-bit)

Double precision(64-bit)

Example 1

Show the IEEE 754 binary representation of the number -0.75 _ten in single and double precision.

Solution:

The number -0.75 _ten is also

-3/4 _ten or -3/22 _ten

It is also represented by the binary fraction

-11 _two/2²_ten or -0.11 _two

In scientific notation, the value is

-0.11 _two X 2⁰

And in normalized scientific notation, it is

-1.1 _two X 2^-1

The general representation for a single precision number is

(-1)^s X (1+fraction) X 2^{(Exponent-127)}

When we subtract the bias 127 from the exponent of -1.1 ^two X 2^-1,the result is

(-1)¹ X (1+.1000 0000 0000 0000 0000 000 _two) X 2^(126-127)

Converting Binary to Decimal Floating Point

Example 2

Now let’s try going the other direction.

Solution:

The sign bit is 1,the exponent field contains 129,and the fraction field contains 1 X 2^-2=1/4,or 0.25.Using the basic equation,

(-1)^s X (1+fraction) X 2^{(Exponent-Bias)}= (-1)¹ X (1+0.25) X 2^(129-127)

= -1 X 1.25 X 2²

=-1.25 X 4

=-5.0

Floating -Point Addition

Example: Decimal Floating-Point Addition

Try adding the numbers 0.5 _ten and -0.4375 _ten in binary.

Solution:

Let’s first look at the binary version of the two numbers in normalized scientific notation, assuming that we keep 4 bits of precision:

0.5 _ten = 1/2 _ten =1/2¹_ten

= 0.1 _two = 0.1 _two X 2⁰ = 1.000 _two X 2^-1

-0.4375 _ten = -7/16 _ten = - 7/2⁴_ten

= -0.0111 _two = - 0.0111 _two X 2⁰ = -1.110 _two X 2^-2

Now we follow the algorithm:

Step 1: The significant of the number with the lesser exponent

(-1.11 _two X 2^- ²) is shifted right until its exponent matches the larger number:

-1.110 _two X 2^-2 = -0.111 _two X 2^-1

Step 2 : Add the significant :

1.000 _two X 2^-1+ (-0.111 _two X 2^-1)=0.001 _two X 2^-1

Step 3 : Normalize the sum, checking for overflow or underflow:

0.001 _two X 2^-1 = 0.010 _two X 2^-2 = 0.100 _two X 2^-3

=1.000 _two X 2^-4

Since 127≥ -4 ≥ -126, there is no overflow or underflow. (The biased exponent would be -4 +127, or 123, which is between 1 and 254, the smallest and largest unreserved biased exponents.)

Step 4 : Round the sum :

1.000 _two X 2^-4

The sum already fits exactly in 4 bits, so there is no change to the bits due to rounding.

This sum is then

1.000 _two x 2^-4 = 0.0001000 _two =0.0001 _two

=1/2⁴_ten =1/16 _ten =0.0625 _ten

This sum is what we would expect from adding 0.5 _tento -0.4375 _ten.

Floating-Point Multiplication

Example: Decimal Floating-Point Multiplication

Let’s try multiplication the numbers 0.5 ten and -0.4375 ten.

Solution:

In binary, the task is multiplying 1.000 _two X 2^-1 by -1.110 _two X 2^-2.

Step 1: Adding the exponents without bias:

-1+(-2)= -3

Or using the biased representation:

(-1+127) + (-2+127) – 127 = (-1-2) + (127+127-127)= -3+127 =124

Step 2: Multiplying the significants:

The product is 1.110000 _two X 2 ^-3, but we need to keep it to 4 bits, so it is 1.110_two X 2^-3.

Step 3 : Now we check the product to make sure it is normalized , and then check the exponent for overflow or underflow . The product is already normalized and, since 127≥ -3 ≥-126, there is no overflow or underflow. (Using the biased representation, 254 ≥ 124 ≥1, so the exponent fits.)

Step 4: Rounding the product makes no change :

1.110 _two X 2^-3

Step 5 : Since the signs of the original operands differ, make the sign of the product negative . Hence the product is

-1.110 _two X 2^-3

Converting to decimal to check our results:

-1.110 two X 2^-3 = -0.001110 _two = -0.00111 _two

= -7/2⁵_ten= -7/32 _ten = -0.12875 _ten

The product of 0.5 _tenand -0.4375 _ten is indeed -0.21875 _ten.

Awesome ICT

Saturday, 20 October 2012

1.2 Floating Point

No comments:

Post a Comment