=================================================
Assignment #05 - Number Systems with Floating Point Numbers
=================================================
- Ian! D. Allen - idallen@idallen.ca - www.idallen.com

Sources for many of these answers (thank you!):
    Ian Allen
    Jim Khuu
    Terence Christie
    William Jarvis
Corrections by (thank you!):
    -

1.  Conversions from different bases and representations to decimal:
    Reference: http://en.wikipedia.org/wiki/Signed_number_representations

    a) Given the digits 010101(16), convert to decimal from base "16" = 
                
                16^5     16^4   16^3   16^2 16^1 16^0
                1048576  65536  4096   256  16   1
                  0      1      0      1    0    1
                ------------------------------------
                  0 +    65536 +0 +    256 +0 +  1
               
                =  65793 base 10

    b) Given the digits 010101(10), convert to decimal from base "10" =
                
                10^5   10^4   10^3 10^2 10^1 10^0
                100000 10000  1000 100  10   1
                0      1      0    1    0    1
                ------------------------------
                0     +10000 +0   +100  +0  +1

                = 10101 base 10   [ of course! ]

    c) Given the digits 010101(8), convert to decimal from base "8" =

                8^5   8^4   8^3 8^2 8^1 8^0
                32768 4096  512 64  8   1
                0     1     0   1   0   1
                ---------------------------
                0 +   4096 +0  +64 +0 + 1

                = 4161 base 10

    d) Given the digits 010101(2), convert to decimal from base "2" =
                
                2^5 2^4  2^3 2^2 2^1 2^0
                32  16   8   4   2   1  
                0   1    0   1   0   1
                ------------------------
                0 + 16 + 0 + 4 + 0 + 1

                = 21 base 10

    e) Given the digits 010101(-2), convert to decimal from base "-2" =

                -2^5 -2^4 -2^3 -2^2 -2^1 -2^0
                -32   16  -8    4   -2    1
                 0    1    0    1    0    1
                -----------------------------
                 0 +  16 + 0  + 4 +  0  + 1

                = 21 base 10

    f) Given the digits 010101(2), convert to decimal from bias-127 =
       
       Convert binary to unsigned decimal, then subtract 127:
                
                2^5 2^4  2^3 2^2 2^1 2^0
                32  16   8   4   2   1  
                0   1    0   1   0   1
                ------------------------
                0 + 16 + 0 + 4 + 0 + 1

                = 21 - 127 = -106 base 10

    g) Given the digits 010101(2), convert to decimal from bias-63 =
                
       Convert binary to unsigned decimal, then subtract 63:

                2^5 2^4  2^3 2^2 2^1 2^0
                32  16   8   4   2   1  
                0   1    0   1   0   1
                ------------------------
                0 + 16 + 0 + 4 + 0 + 1

                = 21 - 63 = -42 base 10

    h) Given the digits 010101(2), convert to decimal from bias-31 =

       Convert binary to unsigned decimal, then subtract 31:

                2^5 2^4  2^3 2^2 2^1 2^0
                32  16   8   4   2   1  
                0   1    0   1   0   1
                ------------------------
                0 + 16 + 0 + 4 + 0 + 1

                = 21 - 31 = -10 base 10

    i) Given the digits 010101(2), convert to decimal from bias-16 =

       Convert binary to unsigned decimal, then subtract 16:

               2^5 2^4  2^3 2^2 2^1 2^0
                32  16   8   4   2   1  
                0   1    0   1   0   1
                ------------------------
                0 + 16 + 0 + 4 + 0 + 1

                = 21 - 16 = 5 base 10

2.   Perform the following additions and subtractions in binary, assuming
    a 6 bit word. Show the Result value plus the values of the Zero, Sign,
    Carry, and Overflow flag values for each (five answers for each).
    (The Zero flag is on iff the Result is zero.)

    The "Carry" flag indicates a "Borrow" when doing subtraction
    of a big number from a smaller number.

    CARRY:   1111                        
             011010                011010
           + 001111              - 001111
    ANS:     101001                001011
        
      Zero      = 0         Zero      = 0 
      Sign      = 1         Sign      = 0
      Carry     = 0         Carry     = 0 (no borrow needed)
      Overflow  = 1         Overflow  = 0

                     
    CARRY:  111111                     
             010111                010110
           + 101001              - 010110
    ANS:     000000                000000

      Zero      = 1         Zero      = 1 
      Sign      = 0         Sign      = 0
      Carry     = 1         Carry     = 0 (no borrow needed)
      Overflow  = 0         Overflow  = 0

    The overflow flag comes on when the answer is wrong for two's
    complement.  The simple rule to remember is that overflow only
    happens when pos+pos=neg or neg+neg=pos.  For subtraction, note
    that subtracting a positive is the same math as adding a negative,
    and subtracting a negative is the same math as adding a positive.

    In the above examples, we have:

        a) pos+pos=neg is wrong = overflow
        b) pos-pos=pos is fine  = no overflow
        c) pos+neg=pos is fine  = no overflow
        d) pos-pos=pos is fine  = no overflow

    Above, "pos-pos" is the same math as "pos+neg" (no overflow possible).
    "neg-neg" would be the same math as "neg+pos" (no overflow possible).

3.  What is the minimum number of binary bits needed to represent the number
    of each day in the year (the Julian day number)?

    Need to represent 1 through 366.  2**8=256 and 2**9=512, so choose 9 bits.

4.  What is the minimum number of binary bits needed to represent the
    number of each day in the year, if the number of days can be positive
    or negative (e.g. "minus 300 days" or "today - 300")?

    Need to represent -366 through +366.  Using two's complement (and the
    formula from the last assignment), we choose 10 bits with range:
    -(2**9) to +((2**9)-1)  or  -512 to +511

5.  Unix/Linux has traditionally used a 32-bit signed integer to
    store the number of seconds since midnight on January 1, 1970, UTC.
    Calculate roughly in what year/month/day this value overflows and
    time starts going negative.

    32 bits signed means the max positive seconds is +((2**31)-1) or
    2,147,483,647 seconds.

    A year is about 365.25 days or 31,557,600 seconds.

    2,147,483,647 / 31,557,600 = 68.049650385 years.

    year 1970 + 68 = year 2038

    The remainder 0.049650385 years is about 18.134803121 days.

    January 1 + 18 = January 19 (2038)

    A Wikipedia search confirms the actual date as 03:14:07 UTC on Tuesday,
    19 January 2038:   http://en.wikipedia.org/wiki/Year_2038_problem

6.  If possible, convert the following decimal values into 2's complement
    form, assuming a 12-bit word. Show your results in both binary
    and hexadecimal.

    a): -1
            -1 is negative, so treat as positive and bit flip later
            1(10) = 001h as 12-bits hex
                --> FFEh (flip the bits using bit-flip table)
                --> FFFh (add one for two's complement)
            FFFh = 1111 1111 1111(2)
            (or - just remember that -1 is always "all bits on")
    b) +693
            693(10) = ??? hex -> use powers-of-16 table: 1,16,256,4096
            693 / 256 = 2 rem 181
            181 /  16 = 11 rem 5 [ and we write 11(10) = B(16) ]
            5 /     1 = 5
            +693 = 2B5h = 0010 1011 0101(2)
    c) -693
            -693 is negative so treat as positive and bit flip later
            +693(10) = 2B5h (from above)
                   --> D4Ah (flip the bits using bit-flip table)
                   --> D4Bh (add one for two's complement)
            -693 = D4Bh = 1101 0100 1011(2)
    d) +2048
            2048(10) = ??? hex -> use powers-of-16 table: 1,16,256,4096
            2048 / 256 = 8
            0 / 16     = 0
            0 / 1      = 0
            = 800h which is negative in 12 bits!
            --> 2048 is too big to fit in 12 bits two's complement
            --> max (using formula) is +((2**11)-1) = +2047
            --> too big!
    e) -2048
            -2048 is negative so treat as positive and bit flip later
            +2048(10) = 800h (from above)
                    --> 7FFh (flip the bits using bit-flip table)
                    --> 800h (add one for two's complement)
            -2048 = 800h = 1000 0000 0000(2)
            (or - just remember that the most negative number has the
             sign bit on and nothing else in two's complement)
    f) +4097
       --> doesn't fit in 12 bits (use the formula to know this)

7.  Perform the indicated arithmetic in hexadecimal, assuming a 12-bit word.
    Show the hexadecimal result plus the states of the Zero, Sign, Carry
    and Overflow flags (five answers for each problem).

    The "Carry" flag indicates a "Borrow" when doing subtraction
    of a big number from a smaller number.

     CARRY: 111               111         
             D8A      948      C8B      ACE
            +276     -35A     +839     -BDF
            -------------------------------
             000      5EE      4C4      EEF

       Zero: on       off      off      off
       Sign: off      off      off      on
      Carry: on       off      on       on
   Overflow: off      on*      on       off

    (*) Subtracting a positive is the same as adding a negative,
        and adding two negatives must give a negative, not a positive.
        Or, consider that subtracting a positive from a negative must
        generate a more negative number, not a positive number.

    The overflow flag comes on when the answer is wrong for two's
    complement.  The simple rule to remember is that overflow only
    happens when pos+pos=neg or neg+neg=pos.  For subtraction, note
    that subtracting a positive is the same math as adding a negative,
    and subtracting a negative is the same math as adding a positive.

    In the above examples, we have:

        a) neg+pos=pos is fine  = no overflow
        b) neg-pos=pos is wrong = overflow
        c) neg+neg=pos is wrong = overflow
        d) neg-neg=neg is fine  = no overflow

    Above, "neg-pos" is the same math as "neg+neg" and overflow is possible,
    and "neg-neg" is be the same math as "neg+pos" (no overflow possible).

8.  Express floating-point 123.456 as a normalized decimal number using
    scientific notation with four digits of precision.

    Normalized:  1.23456 x 10**2
    Four digits: 1.234   x 10**2  (or 1.235 with rounding)

9.  Add floating-point decimal 1234000.0 to 1.5 and express the result as
    a normalized decimal number using scientific notation with four
    digits of precision.

    1234000.0 + 1.5 = 1234001.5

    Normalized:  1.2340015 x 10**6
    Four digits: 1.234     x 10**6

10. Add *binary* floating-point 1111000.0 to 1.1 and express the result
    as a normalized binary number using (binary) scientific notation
    with four (binary) digits of precision.

    1111000.0 + 1.1 = 1111001.1

    Normalized:  1.1110011 x 2**6
    Four digits: 1.111     x 2**6

11. Looking at the two previous questions, is it possible in a computer
    to add a number to a floating-point number without having any effect,
    i.e. is it true that A+B=B for certain floating-point values of A and B?

    Yes.  Both previous questions show cases where A+B=A when A is big
    and B is small.  Because the number of bits of precision is fixed
    in ordinary floating-point arithmetic inside a computer, there will
    be some arithmetic that will not have enough precision to represent
    the true answer.  Adding two values of greatly differing magnitudes
    (e.g. add 3 to 10**99) usually leaves the larger number unchanged,
    because there are not enough bits of precision to represent the
    small number being added to it.

12. Encode the decimal value +274.5625 as a 32-bit IEEE-754 floating
    point field and show your final answer in hexadecimal.

    274.5625(10) = 100010010.1001(2)  [ see previous labs for how ]

    Normalized:  1.000100101001 x 2**8

    Mantissa part: .000100101001 (drop the leading 1.)
       - pad on the right with zeroes to fill up 23 bits:
         00010010100100000000000
    Exponent part: 8
       - excess-127 notation means add 127 before we convert to binary:
         8+127 = 135 = 128+7= 10000111(2)
    Sign: 0 (positive)

    In IEEE 754 single-precision (32-bit) format (1+8+23 bits):

     = 0 10000111 00010010100100000000000
     = 0100 0011 1000 1001 0100 1000 0000 0000
     =    4    3    8    9    4    8    0    0
     = 43894800h

13. Encode the decimal value -12.1875 as 32-bit IEEE-754 floating point
    field and show your answer in hexadecimal.

    1. Number is negative so the sign bit will be 1

    2. Convert 12.1875 to binary 1100.0011(2)

    3. Normalize the binary number 1.1000011 * 2**3

    4. The binary digits to the right of the decimal become the mantissa.
       Pad to the right with zeroes to fill up 23 bits:
       10000110000000000000000

    5. The exponent is 3.  Bias it with 127 and it becomes 3+127 = 130.
       Convert 130 to binary becomes 10000010 (128+2)

    6. Put it all together in 1+8+23=32 bits like this:
       = 1 10000010 10000110000000000000000
       = 1100 0001 0100 0011 0000 0000 0000 0000
       =    C    1    4    3    0    0    0    0
       = C1430000h

14. Encode the decimal value +0.0 as 32-bit IEEE-754 floating point
    field and show your answer in hexadecimal.

    +Zero is a special number with all-zero bits:  00000000h

15. Encode the decimal value -0.0 as 32-bit IEEE-754 floating point
    field and show your answer in hexadecimal.

    -Zero is a special number with all-zero bits except the sign:  80000000h

16. Encode the decimal value +1.0 as 32-bit IEEE-754 floating point
    field and show your answer in hexadecimal.

    1. 1.0(10) = 1.0(2)

    2. Normalized = 1.0 x 2**0  (exponent is zero)

    3. Exponent is 0.
       Bias it with 127 --> 0+127 = 127 = (128-1) = 01111111(2)

    4. The binary digits to the right of the decimal become the mantissa.
       Pad to the right with zeroes to fill up 23 bits:
       00000000000000000000000

    5. Sign is      0
       Exponent is  01111111
       Mantissa is  0000000000000000000000
       Result       0 01111111 00000000000000000000000

    6. Grouping     0011 1111 1000 0000 0000 0000 0000 0000
                  =    3    F    8    0    0    0    0    0
                  = 3F800000h

17. Encode the decimal value -1.0 as 32-bit IEEE-754 floating point
    field and show your answer in hexadecimal.

    As for +1 (3F800000h), except turn on the negative sign bit:

                  = BF800000h

18. Encode the decimal value +2.0 as 32-bit IEEE-754 floating point
    field and show your answer in hexadecimal.

    Compare with +1.0 (from above):

    +1.0 =  1.0(2) x 2**0
    +2.0 = 10.0(2)
         =  1.0(2) x 2**1

    So do as above for +1, except with an exponent one greater.
    Previous exponent plus one = 01111111+1 = 10000000(2)

    5. Sign is      0
       Exponent is  10000000
       Mantissa is  0000000000000000000000
       Result       0 10000000 00000000000000000000000

    6. Grouping     0100 0000 0000 0000 0000 0000 0000 0000
                  =    4    0    0    0    0    0    0    0
                  = 40000000h

19. Encode the decimal value -2.0 as 32-bit IEEE-754 floating point
    field and show your answer in hexadecimal.

    As for +2 (40000000h), except turn on the negative sign bit:

                  = C0000000h

20. Encode the decimal value +4.0 as 32-bit IEEE-754 floating point
    field and show your answer in hexadecimal.

    Compare with +2.0 (from above):

    +2.0 =  1.0(2) x 2**1
    +4.0 = 100.0(2)
         =  1.0(2) x 2**2

    So do as above for +2, except with an exponent one greater.
    Previous exponent plus one = 10000000+1 = 10000001(2)

    5. Sign is      0
       Exponent is  10000001
       Mantissa is  0000000000000000000000
       Result       0 10000001 00000000000000000000000

    6. Grouping     0100 0000 1000 0000 0000 0000 0000 0000
                  =    4    0    8    0    0    0    0    0
                  = 40800000h

21. Encode the decimal value -4.0 as 32-bit IEEE-754 floating point
    field and show your answer in hexadecimal.

    As for +4 (40800000h), except turn on the negative sign bit:

                  = C0800000h

22. Assuming the following eight-byte hex dump contains two Big-Endian,
    32-bit, IEEE-754 encoded values:     C2 2D C0 00 3F 60 00 00
    decode both values shown in this dump as separate decimal values. 

        The two numbers are C22DC000h and 3F600000h

        1. Write out C22DC000h in binary:

           C    2    2    D    C    0    0    0
        1100 0010 0010 1101 1100 0000 0000 0000

        2. Re-group as 1,8,23 bit pieces:
            1 10000100 01011011100000000000000
            sign is negative
            exponent is 10000100
            mantissa is 01011011100000000000000

        3. Add back the hidden 1. to the left of the mantissa:
                1.01011011100000000000000

        4. Convert the exponent to decimal.
           10000100(2) = 128+4 = 132.
           Un-bias the exponent by removing the excess 127:
               132-127 = 5
           Thus the original exponent factor was 2**5

        5. De-normalize the mantissa using the exponent:
                1.01011011100000000000000 x 2**5
                 = 101011.0111 x 2**0

        6. Convert the de-normalized binary fraction to decimal.
           101011.0111(2) = 32+8+2+1 + 0.250+0.125+0.0625 = 43.4375(10)

        Add the minus sign:  -43.4375(10)

        1. Write out 3F600000h in binary:

              3    F    6    0    0    0    0    0
           0011 1111 0110 0000 0000 0000 0000 0000

        2. Re-group as 1,8,23 bit pieces:
            0 01111110 11000000000000000000000
            sign is positive
            Exponent is 01111110
            Mantissa is 11000000000000000000000

        3. Add back the hidden 1. to the left of the mantissa:
                1.11000000000000000000000

        4. Convert the exponent to decimal.
           01111110(2) = 128-2 = 126.
           Un-bias the exponent by removing the excess 127:
               126-127 = -1
           Thus the original exponent factor was 2**(-1)

        5. De-normalize the mantissa using the exponent:
                1.11000000000000000000000 x 2**(-1)
                
                = 0.11100000000000000000000 x 2**0
                
        6. Convert the de-normalized binary fraction to decimal.
            0.111 = 0.5+0.25+0.125 = 0.875

        Number is positive:  +0.875

23. The IEEE 754 floating-point number 81234567h is negative.  Without
    converting, give the hexadecimal for the same number, only positive.

    Turn off the sign bit:  81234567h --> 01234567h

24. The IEEE 754 floating-point number 7EDCBA98h is positive.  Without
    converting, give the hexadecimal for the same number, only negative.

    Turn on the sign bit:  7EDCBA98h --> FEDCBA98h

25. Without converting, cross out or delete all the IEEE 754 negative numbers,
    leaving only the positive numbers:
    1837A654h 7A6A3B65h 87B5CDE2h 90A5B5EFh A0000037h D1B8765Ah F0000000h

    1837A654h 7A6A3B65h XXXXXXXXX  XXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX

26. How does the answer to the previous question change if you are told that
    all the bit patterns are really IEEE 754 double-precision numbers?
 
    If 87B5CDE2h is double-precision, it is at 64-bit number, i.e. it should
    be written as 0000000087B5CDE2h.  The sign bit is off.  All the given
    bit patterns are positive numbers if taken as double-precision 64-bit,
    because they all start with a zero sign bit (in 64 bits).

27. Which has more *Precision* available - a 32-bit integer or a 32-bit
    floating-point number?

    Integers have 32 bits of precision.  Single-precision floats only
    have about 24 bits of mantissa.  Integers have more precision.

28. Which has more *Range* available, -  a 32-bit integer or a 32-bit 
    floating-point number?

    32-bit integers only range plus or minus 2**31 (approximately).
    32-bit floating point can range plus or minus 2**126 (approximately).
    Floating point has more range.

29. IEEE 754 single-precision floating-point can store numbers in the
    approximate range of  -2**127 to +2**+127. Look up or use a calculator
    to express this range (approximately) as powers of ten (decimal).

    "Approximately plus/minus 10**38"

30. True/False: Decimal 1234.0 x 10**37 fits in IEEE 754 single precision
    floating point.

    1234.0 x 10**37 = 1.234 x 10**40 which exceeds 10**38 - FALSE.

31. True/False: Decimal 0.00001 x 10**40 fits in IEEE 754 single
    precision floating point.

    0.00001 x 10**40 = 1.0 x 10**35 which fits in 10**38 - TRUE.

32. Cross-out or delete the values that fit in IEEE 754 single precision
    floating point with no loss of range or precision, leaving only the
    values that do *not* fit completely accurately:
    2**30-3    2**30-1    2**30    2**30+1    2**30+3    2**30+2**29

    All the numbers fit within the 2**126 exponent range of IEEE 754.
    Cross out anything that fits within 23 bits of precision:

    2**30-3    2**30-1    XXXXX    2**30+1    2**30+3    XXXXXXXXXXX

    For example, 2**30-1 is 111111111111111111111111111111(2) which
    is 1.11111111111111111111111111111 x 2**29 and needs 28 bits of precision.
    Doesn't fit in a 23-bit IEEE mantissa.

    For example, 2**30+2**29 is 1100000000000000000000000000000(2) = 1.1 x 2**30
    which only needs two bits of precision (1.1 x 2**30).  Fits.

33. Without converting, cross-out or delete the sums that fit in IEEE 754
    single-precision floating-point with no loss of range or precision,
    leaving only the sums that do *not* fit accurately:
    2**29+2**10+2**9+2**0      2**26+2**0      2**29+2**28+2**27+2**26
    2**27+2**23+2**1           2**29+2**28+2**2+2**1

    All the numbers fit within the 2**126 exponent range of IEEE 754.
    Cross out anything that fits within 23 bits of precision:

    2**29+2**10+2**9+2**0      2**26+2**0      XXXXXXXXXXXXXXXXXXXXXXX
    2**27+2**23+2**1           2**29+2**28+2**2+2**1

    For example, 2**26+2**0 needs 27 bits of precision:
        = 100000000000000000000000001(2) = 1.00000000000000000000000001 x 2**26
    For example, 2**29+2**28+2**27+2**26 only needs four bits of precision:
        = 1.111 x 2**29

34. Why do the decimal numbers 2147483775 (0x8000007F) and 2147483648
    (0x80000000) both convert to the same IEEE 754 single-precision
    floating-point number 0x4F000000 that has decimal value 2147483648.0?

    The number 2147483775 (0x8000007F) requires 32 bits of precision.
    The last (rightmost) 9 bits of precision are thrown away when
    converting to IEEE 754 single-precision, so the "7F" part of
    0x8000007F disappears and it looks just like 0x800000, which converts
    back to 2147483648.0 decimal, not to 2147483775.  You lose precision
    when converting a 32-bit integer into a 23-bit mantissa.

    In other words:

    2147483775 in binary is 10000000000000000000000001111111
    2147483648 in binary is 10000000000000000000000000000000

    When you convert both of these numbers to IEEE 754 single precision
    floating point number, the mantissa only holds 23 of those bits.
    What is stored in the mantissa for 2147483775 is
    (1.)00000000000000000000000 and the extra 01111111 gets discarded. 
    What is stored in the mantissa for 2147483648 is
    (1.)00000000000000000000000 which is exactly the same as 2147483775.
    Even though the numbers are different, since IEEE 754 single-precision
    floating-point number only has a precision of 23 bits, both of these
    numbers end up being the same when they are converted, because anything
    beyond 23bits is discarded.

35. Explain why, in a computer, floating point mathematics may not be
    associative or distributive, i.e. (A+B)+C may not equal A+(B+C).

    Floating point arithmetic can lose precision when small numbers
    are added to or subtracted from big numbers.  If you arrange your
    mathematics so that the small numbers are added to each other
    first, they stand a better chance of affecting the bigger number.
    Order matters.

36. How close to zero can you get with IEEE 754 32-bit floating point?
    (What is the non-zero value that is closest to zero?)  Express the
    answer in both approximate power-of-two notation and in approximate
    power-of-ten notation.

    Approximately 2**(-126) (normalized IEEE) which is approximately 10**(-38).
    Denormalized IEEE 754 numbers can get closer to zero at the expense
    of some precision.  Remember 10**(-38) and you won't be far wrong.

-- 
| Ian! D. Allen  -  idallen@idallen.ca  -  Ottawa, Ontario, Canada
| Home Page: http://idallen.com/   Contact Improv: http://contactimprov.ca/
| College professor (Free/Libre GNU+Linux) at: http://teaching.idallen.com/
| Defend digital freedom:  http://eff.org/  and have fun:  http://fools.ca/