================================================= Assignment #05 - Number Systems with Floating Point Numbers ================================================= - Ian! D. Allen - idallen@idallen.ca - www.idallen.com Sources for many of these answers (thank you!): Ian Allen Jim Khuu Terence Christie William Jarvis Corrections by (thank you!): - 1. Conversions from different bases and representations to decimal: Reference: http://en.wikipedia.org/wiki/Signed_number_representations a) Given the digits 010101(16), convert to decimal from base "16" = 16^5 16^4 16^3 16^2 16^1 16^0 1048576 65536 4096 256 16 1 0 1 0 1 0 1 ------------------------------------ 0 + 65536 +0 + 256 +0 + 1 = 65793 base 10 b) Given the digits 010101(10), convert to decimal from base "10" = 10^5 10^4 10^3 10^2 10^1 10^0 100000 10000 1000 100 10 1 0 1 0 1 0 1 ------------------------------ 0 +10000 +0 +100 +0 +1 = 10101 base 10 [ of course! ] c) Given the digits 010101(8), convert to decimal from base "8" = 8^5 8^4 8^3 8^2 8^1 8^0 32768 4096 512 64 8 1 0 1 0 1 0 1 --------------------------- 0 + 4096 +0 +64 +0 + 1 = 4161 base 10 d) Given the digits 010101(2), convert to decimal from base "2" = 2^5 2^4 2^3 2^2 2^1 2^0 32 16 8 4 2 1 0 1 0 1 0 1 ------------------------ 0 + 16 + 0 + 4 + 0 + 1 = 21 base 10 e) Given the digits 010101(-2), convert to decimal from base "-2" = -2^5 -2^4 -2^3 -2^2 -2^1 -2^0 -32 16 -8 4 -2 1 0 1 0 1 0 1 ----------------------------- 0 + 16 + 0 + 4 + 0 + 1 = 21 base 10 f) Given the digits 010101(2), convert to decimal from bias-127 = Convert binary to unsigned decimal, then subtract 127: 2^5 2^4 2^3 2^2 2^1 2^0 32 16 8 4 2 1 0 1 0 1 0 1 ------------------------ 0 + 16 + 0 + 4 + 0 + 1 = 21 - 127 = -106 base 10 g) Given the digits 010101(2), convert to decimal from bias-63 = Convert binary to unsigned decimal, then subtract 63: 2^5 2^4 2^3 2^2 2^1 2^0 32 16 8 4 2 1 0 1 0 1 0 1 ------------------------ 0 + 16 + 0 + 4 + 0 + 1 = 21 - 63 = -42 base 10 h) Given the digits 010101(2), convert to decimal from bias-31 = Convert binary to unsigned decimal, then subtract 31: 2^5 2^4 2^3 2^2 2^1 2^0 32 16 8 4 2 1 0 1 0 1 0 1 ------------------------ 0 + 16 + 0 + 4 + 0 + 1 = 21 - 31 = -10 base 10 i) Given the digits 010101(2), convert to decimal from bias-16 = Convert binary to unsigned decimal, then subtract 16: 2^5 2^4 2^3 2^2 2^1 2^0 32 16 8 4 2 1 0 1 0 1 0 1 ------------------------ 0 + 16 + 0 + 4 + 0 + 1 = 21 - 16 = 5 base 10 2. Perform the following additions and subtractions in binary, assuming a 6 bit word. Show the Result value plus the values of the Zero, Sign, Carry, and Overflow flag values for each (five answers for each). (The Zero flag is on iff the Result is zero.) The "Carry" flag indicates a "Borrow" when doing subtraction of a big number from a smaller number. CARRY: 1111 011010 011010 + 001111 - 001111 ANS: 101001 001011 Zero = 0 Zero = 0 Sign = 1 Sign = 0 Carry = 0 Carry = 0 (no borrow needed) Overflow = 1 Overflow = 0 CARRY: 111111 010111 010110 + 101001 - 010110 ANS: 000000 000000 Zero = 1 Zero = 1 Sign = 0 Sign = 0 Carry = 1 Carry = 0 (no borrow needed) Overflow = 0 Overflow = 0 The overflow flag comes on when the answer is wrong for two's complement. The simple rule to remember is that overflow only happens when pos+pos=neg or neg+neg=pos. For subtraction, note that subtracting a positive is the same math as adding a negative, and subtracting a negative is the same math as adding a positive. In the above examples, we have: a) pos+pos=neg is wrong = overflow b) pos-pos=pos is fine = no overflow c) pos+neg=pos is fine = no overflow d) pos-pos=pos is fine = no overflow Above, "pos-pos" is the same math as "pos+neg" (no overflow possible). "neg-neg" would be the same math as "neg+pos" (no overflow possible). 3. What is the minimum number of binary bits needed to represent the number of each day in the year (the Julian day number)? Need to represent 1 through 366. 2**8=256 and 2**9=512, so choose 9 bits. 4. What is the minimum number of binary bits needed to represent the number of each day in the year, if the number of days can be positive or negative (e.g. "minus 300 days" or "today - 300")? Need to represent -366 through +366. Using two's complement (and the formula from the last assignment), we choose 10 bits with range: -(2**9) to +((2**9)-1) or -512 to +511 5. Unix/Linux has traditionally used a 32-bit signed integer to store the number of seconds since midnight on January 1, 1970, UTC. Calculate roughly in what year/month/day this value overflows and time starts going negative. 32 bits signed means the max positive seconds is +((2**31)-1) or 2,147,483,647 seconds. A year is about 365.25 days or 31,557,600 seconds. 2,147,483,647 / 31,557,600 = 68.049650385 years. year 1970 + 68 = year 2038 The remainder 0.049650385 years is about 18.134803121 days. January 1 + 18 = January 19 (2038) A Wikipedia search confirms the actual date as 03:14:07 UTC on Tuesday, 19 January 2038: http://en.wikipedia.org/wiki/Year_2038_problem 6. If possible, convert the following decimal values into 2's complement form, assuming a 12-bit word. Show your results in both binary and hexadecimal. a): -1 -1 is negative, so treat as positive and bit flip later 1(10) = 001h as 12-bits hex --> FFEh (flip the bits using bit-flip table) --> FFFh (add one for two's complement) FFFh = 1111 1111 1111(2) (or - just remember that -1 is always "all bits on") b) +693 693(10) = ??? hex -> use powers-of-16 table: 1,16,256,4096 693 / 256 = 2 rem 181 181 / 16 = 11 rem 5 [ and we write 11(10) = B(16) ] 5 / 1 = 5 +693 = 2B5h = 0010 1011 0101(2) c) -693 -693 is negative so treat as positive and bit flip later +693(10) = 2B5h (from above) --> D4Ah (flip the bits using bit-flip table) --> D4Bh (add one for two's complement) -693 = D4Bh = 1101 0100 1011(2) d) +2048 2048(10) = ??? hex -> use powers-of-16 table: 1,16,256,4096 2048 / 256 = 8 0 / 16 = 0 0 / 1 = 0 = 800h which is negative in 12 bits! --> 2048 is too big to fit in 12 bits two's complement --> max (using formula) is +((2**11)-1) = +2047 --> too big! e) -2048 -2048 is negative so treat as positive and bit flip later +2048(10) = 800h (from above) --> 7FFh (flip the bits using bit-flip table) --> 800h (add one for two's complement) -2048 = 800h = 1000 0000 0000(2) (or - just remember that the most negative number has the sign bit on and nothing else in two's complement) f) +4097 --> doesn't fit in 12 bits (use the formula to know this) 7. Perform the indicated arithmetic in hexadecimal, assuming a 12-bit word. Show the hexadecimal result plus the states of the Zero, Sign, Carry and Overflow flags (five answers for each problem). The "Carry" flag indicates a "Borrow" when doing subtraction of a big number from a smaller number. CARRY: 111 111 D8A 948 C8B ACE +276 -35A +839 -BDF ------------------------------- 000 5EE 4C4 EEF Zero: on off off off Sign: off off off on Carry: on off on on Overflow: off on* on off (*) Subtracting a positive is the same as adding a negative, and adding two negatives must give a negative, not a positive. Or, consider that subtracting a positive from a negative must generate a more negative number, not a positive number. The overflow flag comes on when the answer is wrong for two's complement. The simple rule to remember is that overflow only happens when pos+pos=neg or neg+neg=pos. For subtraction, note that subtracting a positive is the same math as adding a negative, and subtracting a negative is the same math as adding a positive. In the above examples, we have: a) neg+pos=pos is fine = no overflow b) neg-pos=pos is wrong = overflow c) neg+neg=pos is wrong = overflow d) neg-neg=neg is fine = no overflow Above, "neg-pos" is the same math as "neg+neg" and overflow is possible, and "neg-neg" is be the same math as "neg+pos" (no overflow possible). 8. Express floating-point 123.456 as a normalized decimal number using scientific notation with four digits of precision. Normalized: 1.23456 x 10**2 Four digits: 1.234 x 10**2 (or 1.235 with rounding) 9. Add floating-point decimal 1234000.0 to 1.5 and express the result as a normalized decimal number using scientific notation with four digits of precision. 1234000.0 + 1.5 = 1234001.5 Normalized: 1.2340015 x 10**6 Four digits: 1.234 x 10**6 10. Add *binary* floating-point 1111000.0 to 1.1 and express the result as a normalized binary number using (binary) scientific notation with four (binary) digits of precision. 1111000.0 + 1.1 = 1111001.1 Normalized: 1.1110011 x 2**6 Four digits: 1.111 x 2**6 11. Looking at the two previous questions, is it possible in a computer to add a number to a floating-point number without having any effect, i.e. is it true that A+B=B for certain floating-point values of A and B? Yes. Both previous questions show cases where A+B=A when A is big and B is small. Because the number of bits of precision is fixed in ordinary floating-point arithmetic inside a computer, there will be some arithmetic that will not have enough precision to represent the true answer. Adding two values of greatly differing magnitudes (e.g. add 3 to 10**99) usually leaves the larger number unchanged, because there are not enough bits of precision to represent the small number being added to it. 12. Encode the decimal value +274.5625 as a 32-bit IEEE-754 floating point field and show your final answer in hexadecimal. 274.5625(10) = 100010010.1001(2) [ see previous labs for how ] Normalized: 1.000100101001 x 2**8 Mantissa part: .000100101001 (drop the leading 1.) - pad on the right with zeroes to fill up 23 bits: 00010010100100000000000 Exponent part: 8 - excess-127 notation means add 127 before we convert to binary: 8+127 = 135 = 128+7= 10000111(2) Sign: 0 (positive) In IEEE 754 single-precision (32-bit) format (1+8+23 bits): = 0 10000111 00010010100100000000000 = 0100 0011 1000 1001 0100 1000 0000 0000 = 4 3 8 9 4 8 0 0 = 43894800h 13. Encode the decimal value -12.1875 as 32-bit IEEE-754 floating point field and show your answer in hexadecimal. 1. Number is negative so the sign bit will be 1 2. Convert 12.1875 to binary 1100.0011(2) 3. Normalize the binary number 1.1000011 * 2**3 4. The binary digits to the right of the decimal become the mantissa. Pad to the right with zeroes to fill up 23 bits: 10000110000000000000000 5. The exponent is 3. Bias it with 127 and it becomes 3+127 = 130. Convert 130 to binary becomes 10000010 (128+2) 6. Put it all together in 1+8+23=32 bits like this: = 1 10000010 10000110000000000000000 = 1100 0001 0100 0011 0000 0000 0000 0000 = C 1 4 3 0 0 0 0 = C1430000h 14. Encode the decimal value +0.0 as 32-bit IEEE-754 floating point field and show your answer in hexadecimal. +Zero is a special number with all-zero bits: 00000000h 15. Encode the decimal value -0.0 as 32-bit IEEE-754 floating point field and show your answer in hexadecimal. -Zero is a special number with all-zero bits except the sign: 80000000h 16. Encode the decimal value +1.0 as 32-bit IEEE-754 floating point field and show your answer in hexadecimal. 1. 1.0(10) = 1.0(2) 2. Normalized = 1.0 x 2**0 (exponent is zero) 3. Exponent is 0. Bias it with 127 --> 0+127 = 127 = (128-1) = 01111111(2) 4. The binary digits to the right of the decimal become the mantissa. Pad to the right with zeroes to fill up 23 bits: 00000000000000000000000 5. Sign is 0 Exponent is 01111111 Mantissa is 0000000000000000000000 Result 0 01111111 00000000000000000000000 6. Grouping 0011 1111 1000 0000 0000 0000 0000 0000 = 3 F 8 0 0 0 0 0 = 3F800000h 17. Encode the decimal value -1.0 as 32-bit IEEE-754 floating point field and show your answer in hexadecimal. As for +1 (3F800000h), except turn on the negative sign bit: = BF800000h 18. Encode the decimal value +2.0 as 32-bit IEEE-754 floating point field and show your answer in hexadecimal. Compare with +1.0 (from above): +1.0 = 1.0(2) x 2**0 +2.0 = 10.0(2) = 1.0(2) x 2**1 So do as above for +1, except with an exponent one greater. Previous exponent plus one = 01111111+1 = 10000000(2) 5. Sign is 0 Exponent is 10000000 Mantissa is 0000000000000000000000 Result 0 10000000 00000000000000000000000 6. Grouping 0100 0000 0000 0000 0000 0000 0000 0000 = 4 0 0 0 0 0 0 0 = 40000000h 19. Encode the decimal value -2.0 as 32-bit IEEE-754 floating point field and show your answer in hexadecimal. As for +2 (40000000h), except turn on the negative sign bit: = C0000000h 20. Encode the decimal value +4.0 as 32-bit IEEE-754 floating point field and show your answer in hexadecimal. Compare with +2.0 (from above): +2.0 = 1.0(2) x 2**1 +4.0 = 100.0(2) = 1.0(2) x 2**2 So do as above for +2, except with an exponent one greater. Previous exponent plus one = 10000000+1 = 10000001(2) 5. Sign is 0 Exponent is 10000001 Mantissa is 0000000000000000000000 Result 0 10000001 00000000000000000000000 6. Grouping 0100 0000 1000 0000 0000 0000 0000 0000 = 4 0 8 0 0 0 0 0 = 40800000h 21. Encode the decimal value -4.0 as 32-bit IEEE-754 floating point field and show your answer in hexadecimal. As for +4 (40800000h), except turn on the negative sign bit: = C0800000h 22. Assuming the following eight-byte hex dump contains two Big-Endian, 32-bit, IEEE-754 encoded values: C2 2D C0 00 3F 60 00 00 decode both values shown in this dump as separate decimal values. The two numbers are C22DC000h and 3F600000h 1. Write out C22DC000h in binary: C 2 2 D C 0 0 0 1100 0010 0010 1101 1100 0000 0000 0000 2. Re-group as 1,8,23 bit pieces: 1 10000100 01011011100000000000000 sign is negative exponent is 10000100 mantissa is 01011011100000000000000 3. Add back the hidden 1. to the left of the mantissa: 1.01011011100000000000000 4. Convert the exponent to decimal. 10000100(2) = 128+4 = 132. Un-bias the exponent by removing the excess 127: 132-127 = 5 Thus the original exponent factor was 2**5 5. De-normalize the mantissa using the exponent: 1.01011011100000000000000 x 2**5 = 101011.0111 x 2**0 6. Convert the de-normalized binary fraction to decimal. 101011.0111(2) = 32+8+2+1 + 0.250+0.125+0.0625 = 43.4375(10) Add the minus sign: -43.4375(10) 1. Write out 3F600000h in binary: 3 F 6 0 0 0 0 0 0011 1111 0110 0000 0000 0000 0000 0000 2. Re-group as 1,8,23 bit pieces: 0 01111110 11000000000000000000000 sign is positive Exponent is 01111110 Mantissa is 11000000000000000000000 3. Add back the hidden 1. to the left of the mantissa: 1.11000000000000000000000 4. Convert the exponent to decimal. 01111110(2) = 128-2 = 126. Un-bias the exponent by removing the excess 127: 126-127 = -1 Thus the original exponent factor was 2**(-1) 5. De-normalize the mantissa using the exponent: 1.11000000000000000000000 x 2**(-1) = 0.11100000000000000000000 x 2**0 6. Convert the de-normalized binary fraction to decimal. 0.111 = 0.5+0.25+0.125 = 0.875 Number is positive: +0.875 23. The IEEE 754 floating-point number 81234567h is negative. Without converting, give the hexadecimal for the same number, only positive. Turn off the sign bit: 81234567h --> 01234567h 24. The IEEE 754 floating-point number 7EDCBA98h is positive. Without converting, give the hexadecimal for the same number, only negative. Turn on the sign bit: 7EDCBA98h --> FEDCBA98h 25. Without converting, cross out or delete all the IEEE 754 negative numbers, leaving only the positive numbers: 1837A654h 7A6A3B65h 87B5CDE2h 90A5B5EFh A0000037h D1B8765Ah F0000000h 1837A654h 7A6A3B65h XXXXXXXXX XXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX 26. How does the answer to the previous question change if you are told that all the bit patterns are really IEEE 754 double-precision numbers? If 87B5CDE2h is double-precision, it is at 64-bit number, i.e. it should be written as 0000000087B5CDE2h. The sign bit is off. All the given bit patterns are positive numbers if taken as double-precision 64-bit, because they all start with a zero sign bit (in 64 bits). 27. Which has more *Precision* available - a 32-bit integer or a 32-bit floating-point number? Integers have 32 bits of precision. Single-precision floats only have about 24 bits of mantissa. Integers have more precision. 28. Which has more *Range* available, - a 32-bit integer or a 32-bit floating-point number? 32-bit integers only range plus or minus 2**31 (approximately). 32-bit floating point can range plus or minus 2**126 (approximately). Floating point has more range. 29. IEEE 754 single-precision floating-point can store numbers in the approximate range of -2**127 to +2**+127. Look up or use a calculator to express this range (approximately) as powers of ten (decimal). "Approximately plus/minus 10**38" 30. True/False: Decimal 1234.0 x 10**37 fits in IEEE 754 single precision floating point. 1234.0 x 10**37 = 1.234 x 10**40 which exceeds 10**38 - FALSE. 31. True/False: Decimal 0.00001 x 10**40 fits in IEEE 754 single precision floating point. 0.00001 x 10**40 = 1.0 x 10**35 which fits in 10**38 - TRUE. 32. Cross-out or delete the values that fit in IEEE 754 single precision floating point with no loss of range or precision, leaving only the values that do *not* fit completely accurately: 2**30-3 2**30-1 2**30 2**30+1 2**30+3 2**30+2**29 All the numbers fit within the 2**126 exponent range of IEEE 754. Cross out anything that fits within 23 bits of precision: 2**30-3 2**30-1 XXXXX 2**30+1 2**30+3 XXXXXXXXXXX For example, 2**30-1 is 111111111111111111111111111111(2) which is 1.11111111111111111111111111111 x 2**29 and needs 28 bits of precision. Doesn't fit in a 23-bit IEEE mantissa. For example, 2**30+2**29 is 1100000000000000000000000000000(2) = 1.1 x 2**30 which only needs two bits of precision (1.1 x 2**30). Fits. 33. Without converting, cross-out or delete the sums that fit in IEEE 754 single-precision floating-point with no loss of range or precision, leaving only the sums that do *not* fit accurately: 2**29+2**10+2**9+2**0 2**26+2**0 2**29+2**28+2**27+2**26 2**27+2**23+2**1 2**29+2**28+2**2+2**1 All the numbers fit within the 2**126 exponent range of IEEE 754. Cross out anything that fits within 23 bits of precision: 2**29+2**10+2**9+2**0 2**26+2**0 XXXXXXXXXXXXXXXXXXXXXXX 2**27+2**23+2**1 2**29+2**28+2**2+2**1 For example, 2**26+2**0 needs 27 bits of precision: = 100000000000000000000000001(2) = 1.00000000000000000000000001 x 2**26 For example, 2**29+2**28+2**27+2**26 only needs four bits of precision: = 1.111 x 2**29 34. Why do the decimal numbers 2147483775 (0x8000007F) and 2147483648 (0x80000000) both convert to the same IEEE 754 single-precision floating-point number 0x4F000000 that has decimal value 2147483648.0? The number 2147483775 (0x8000007F) requires 32 bits of precision. The last (rightmost) 9 bits of precision are thrown away when converting to IEEE 754 single-precision, so the "7F" part of 0x8000007F disappears and it looks just like 0x800000, which converts back to 2147483648.0 decimal, not to 2147483775. You lose precision when converting a 32-bit integer into a 23-bit mantissa. In other words: 2147483775 in binary is 10000000000000000000000001111111 2147483648 in binary is 10000000000000000000000000000000 When you convert both of these numbers to IEEE 754 single precision floating point number, the mantissa only holds 23 of those bits. What is stored in the mantissa for 2147483775 is (1.)00000000000000000000000 and the extra 01111111 gets discarded. What is stored in the mantissa for 2147483648 is (1.)00000000000000000000000 which is exactly the same as 2147483775. Even though the numbers are different, since IEEE 754 single-precision floating-point number only has a precision of 23 bits, both of these numbers end up being the same when they are converted, because anything beyond 23bits is discarded. 35. Explain why, in a computer, floating point mathematics may not be associative or distributive, i.e. (A+B)+C may not equal A+(B+C). Floating point arithmetic can lose precision when small numbers are added to or subtracted from big numbers. If you arrange your mathematics so that the small numbers are added to each other first, they stand a better chance of affecting the bigger number. Order matters. 36. How close to zero can you get with IEEE 754 32-bit floating point? (What is the non-zero value that is closest to zero?) Express the answer in both approximate power-of-two notation and in approximate power-of-ten notation. Approximately 2**(-126) (normalized IEEE) which is approximately 10**(-38). Denormalized IEEE 754 numbers can get closer to zero at the expense of some precision. Remember 10**(-38) and you won't be far wrong. -- | Ian! D. Allen - idallen@idallen.ca - Ottawa, Ontario, Canada | Home Page: http://idallen.com/ Contact Improv: http://contactimprov.ca/ | College professor (Free/Libre GNU+Linux) at: http://teaching.idallen.com/ | Defend digital freedom: http://eff.org/ and have fun: http://fools.ca/