Representation for non-integral number.
- including very small and very large numbers
Like scientific notation
- -2.34 X 1056
- +0.002 X 10-4
- 987.02 X 109
In binary
- ± 1.xxxxxxxx 2 X 2 yyyyyy
Types float and double in C
Floating Point Representation
There have two representations, which is:
- Single precision(32-bit)
- Double precision(64-bit)
Example 1
Show the IEEE 754 binary representation of the number
-0.75 ten in single and double precision.
Solution:
The number -0.75 ten is also
-3/4 ten
or -3/22 ten
It is also represented by the binary fraction
-11 two/22 ten
or -0.11 two
In scientific notation, the value is
-0.11 two
X 20
And in normalized scientific notation, it is
-1.1 two
X 2-1
The general representation for a single precision
number is
(-1)s
X (1+fraction) X 2(Exponent-127)
When we subtract the bias 127 from the exponent of
-1.1 two X 2-1,the result is
(-1)1
X (1+.1000 0000 0000 0000 0000 000 two) X 2(126-127)
Converting
Binary to Decimal Floating Point
Example 2
Now let’s try going the other direction.
Solution:
The sign bit is 1,the exponent field contains 129,and
the fraction field contains 1 X 2-2=1/4,or 0.25.Using the basic
equation,
(-1)s
X (1+fraction) X 2(Exponent-Bias)= (-1)1 X (1+0.25) X 2(129-127)
= -1 X 1.25 X 22
=-1.25 X 4
=-5.0
Floating -Point Addition
Example: Decimal
Floating-Point Addition
Try adding
the numbers 0.5 ten and -0.4375 ten in binary.
Solution:
Let’s first
look at the binary version of the two numbers in normalized scientific notation, assuming that we
keep 4 bits of precision:
0.5 ten = 1/2 ten =1/21 ten
= 0.1 two = 0.1 two
X 20 = 1.000 two X 2-1
-0.4375 ten = -7/16 ten
= - 7/24 ten
= -0.0111 two = - 0.0111 two
X 20 = -1.110 two
X 2-2
Now we follow
the algorithm:
Step 1: The
significant of the number with the lesser exponent
(-1.11 two X 2- 2)
is shifted right until its exponent matches the larger number:
-1.110 two
X 2-2 = -0.111 two X 2-1
Step 2 : Add
the significant :
1.000 two
X 2-1+ (-0.111 two X 2-1)=0.001 two X
2-1
Step 3 :
Normalize the sum, checking for overflow or underflow:
0.001 two
X 2-1 = 0.010 two X 2-2 = 0.100 two X
2-3
=1.000 two X 2-4
Since 127≥ -4
≥ -126,
there is no overflow or underflow. (The biased exponent would be -4 +127, or
123, which is between 1 and 254, the smallest and largest unreserved biased
exponents.)
Step 4 :
Round the sum :
1.000 two
X 2-4
The sum
already fits exactly in 4 bits, so there is no change to the bits due to
rounding.
This sum is
then
1.000 two
x 2-4 = 0.0001000 two =0.0001 two
=1/24 ten =1/16 ten
=0.0625 ten
This
sum is what we would expect from adding 0.5 ten to -0.4375 ten.
Floating-Point Multiplication
Example: Decimal Floating-Point Multiplication
Let’s
try multiplication the numbers 0.5 ten and -0.4375 ten.
Solution:
In
binary, the task is multiplying 1.000 two X 2-1 by -1.110 two
X 2-2.
Step
1: Adding the exponents without bias:
-1+(-2)= -3
Or
using the biased representation:
(-1+127)
+ (-2+127) – 127 = (-1-2) + (127+127-127)= -3+127 =124
Step 2: Multiplying the significants:
Step 2: Multiplying the significants:
The
product is 1.110000 two X 2 -3, but we need to keep it to
4 bits, so it is 1.110two X 2-3.
Step
3 : Now we check the product to make sure it is normalized , and then check the
exponent for overflow or underflow . The product is already normalized and,
since 127≥ -3 ≥-126, there is no overflow or
underflow. (Using the biased representation, 254 ≥
124 ≥1, so the exponent fits.)
Step
4: Rounding the product makes no change :
1.110 two
X 2-3
Step
5 : Since the signs of the original operands differ, make the sign of the
product negative . Hence the product is
-1.110 two
X 2-3
Converting to decimal to check
our results:
-1.110 two X 2-3 =
-0.001110 two = -0.00111 two
= -7/25 ten = -7/32 ten = -0.12875 ten
The product
of 0.5 ten and -0.4375 ten is indeed -0.21875 ten.
No comments:
Post a Comment