Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Systems Architecture Floating points are focused around computer arithmetic that

ID: 3716817 • Letter: S

Question

Systems Architecture

Floating points are focused around computer arithmetic that represents numbers which the binary point is not fixed as it is for integers. The IEEE 754 standard helps describe this representation of floating point numbers.

Come up with two different numbers in normalized scientific notation and convert them to the binary version.

Explain the steps to add those two numbers together while considering the overflow, underflow and normalized sum in the form of a flowchart.

Please make copy paste available

Explanation / Answer

eps can be represented by 00000....00000000

however for

1 + eps,

things are complicated.

we no longer have 1.000 * 2^-52

but rather

1.0000 .. eps * 2^52

One possibility for handling numbers with fractional parts is to add bits after the decimal point: The first bit after the decimal point is the halves place, the next bit the quarters place, the next bit the eighths place, and so on.

Suppose that we want to represent 1.625(10). We would want 1 in the ones place, leaving us with 0.625. Then we want 1 in the halves place, leaving us with 0.625 ? 0.5 = 0.125. No quarters will fit, so put a 0 there. We want a 1 in the eighths place, and we subtract 0.125 from 0.125 to get 0.

So the binary representation of 1.625 would be 1.101(2).

The idea of fixed-point representation is to split the bits of the representation between the places to the left of the decimal point and places to the right of the decimal point. For example, a 32-bit fixed-point representation might allocate 24 bits for the integer part and 8 bits for the fractional part.

To represent 1.625, we would use the first 24 bits to indicate 1, and we'd use the remaining 8 bits to represent 0.625. Thus, our 32-bit representation would be:

00000000 00000000 00000001 10100000.

Fixed-point representation works reasonably well as long as you work with numbers within the supported range. The 32-bit fixed-point representation described above can represent any multiple of 1/256 from 0 up to 224 ? 16.7 million. But programs frequently need to work with numbers from a much broader range. For this reason, fixed-point representation isn't used very often in today's computing world.

A notable exception is financial software. Here, all computations must be represented exactly to the penny, and in fact further precision is rarely desired, since results are always rounded to the penny. Moreover, most applications have no requirement for large amounts (like trillions of dollars), so the limited range of fixed-point numbers isn't an issue. Thus, programs typically use a variant of fixed-point representation that represents each amount as an integer multiple of 1/100, just as the fixed-point representation described above represents each number as a multiple of 1/256.

Normalized floating-point

Floating-point representation is an alternative technique based on scientific notation.

Floating-point basics

Though we'd like to use scientific notation, we'll base our scientific notation on powers of 2, not powers of 10, because we're working with computers that prefer binary. For example, 5.5(10) is 101.1(2) in binary, and it converts a to binary scientific notation of 1.011(2) × 22. In converting to binary scientific notation here, we moved the decimal point to the left two places, just as we would in converting 101.1(10) to scientific notation: It would be 1.011(10) × 102.)

Once we have a number in binary scientific notation, we still must have a technique for mapping that into a set of bits. First, let us define the two parts of scientific representation: In 1.011(2) × 102, we call 1.011(2) the mantissa (or the significand), and we call 2 the exponent. In this section we'll use 8 bits to store such a number.

We use the first bit to represent the sign (1 for negative, 0 for positive), the next four bits for the sum of 7 and the actual exponent (we add 7 to allow for negative exponents), and the last three bits for the mantissa's fractional part. Note that we omit the integer part of the mantissa: Since the mantissa must have exactly one nonzero bit to the left of its decimal point, and the only nonzero bit is 1, we know that the bit to the left of the decimal point must be a 1. There's no point in wasting space in inserting this 1 into our bit pattern, so we include only the bits of the mantissa to the right of the decimal point.

For our example of 5.5(10) = 1.011(2) × 22, we add 7 to 2 to arrive at 9(10) = 1001(2) for the exponent bits. Into the mantissa bits we place the bits following the decimal point of the scientific notation, 011. This gives us 0 1001 011 as the 8-bit representation of 5.5(10).

We call this floating-point representation because the values of the mantissa bits “float” along with the decimal point, based on the exponent's given value. This is in contrast to fixed-point representation, where the decimal point is always in the same place among the bits given.

Suppose we want to represent ?96(10).

First we convert our desired number to binary: ?1100000(2).

Then we convert this to binary scientific notation: ?1.100000(2) × 26.

Then we fit this into the bits.

We choose 1 for the sign bit since the number is negative.

We add 7 to the exponent and place the result into the four exponent bits. For this example, we arrive at 6 + 7 = 13(10) = 1101(2).

The three mantissa bits are the first three bits following the leading 1: 100. If it happened that there were 1 bits beyond the 1/8's place, we would need to round the mantissa to the nearest eighth.)

Thus we end up with 1 1101 100.

Conversely, suppose we want to decode the number 0 0101 100.

We observe that the number is positive, and the exponent bits represent 0101(2) = 5(10). This is 7 more than the actual exponent, and so the actual exponent must be ?2. Thus, in binary scientific notation, we have 1.100(2) × 2?2.

We convert this to binary: 1.100(2) × 2?2 = 0.011(2).

We convert the binary into decimal: 0.011(2) = 1/4 + 1/8 = 3/8 = 0.375(10).

2.2. Representable numbers

This 8-bit floating-point format can represent a wide range of both small numbers and large numbers. To find the smallest possible positive number we can represent, we would want the sign bit to be 0, we would place 0 in all the exponent bits to get the smallest exponent possible, and we would put 0 in all the mantissa bits. This gives us 0 0000 000, which represents

1.000(2) × 20 ? 7 = 2?7 ? 0.0078(10).

To determine the largest positive number, we would want the sign bit still to be 0, we would place 1 in all the exponent bits to get the largest exponent possible, and we would put 1 in all the mantissa bits. This gives us 0 1111 111, which represents

1.111(2) × 215 ? 7 = 1.111(2) × 28 = 111100000(2)=480(10).

Thus, our 8-bit floating-point format can represent positive numbers from about 0.0078(10) to 480(10). In contrast, the 8-bit two's-complement representation can only represent positive numbers between 1 and 127.

But notice that the floating-point representation can't represent all of the numbers in its range — this would be impossible, since eight bits can represent only 28 = 256 distinct values, and there are infinitely many real numbers in the range to represent. What's going on? Let's consider how to represent 51(10) in this scheme. In binary, this is 110011(2) = 1.10011(2) × 25. When we try to fit the mantissa into the 3-bit portion of our scheme, we find that the last two bits won't fit: We would be forced to round to 1.101(2) × 25, and the resulting bit pattern would be 0 1100 101. That rounding means that we're not representing the number precisely. In fact, 1 1100 101 translates to

1.101(2) × 212 ? 7 = 1.101(2) × 25 = 110100(2) = 52(10).

Thus, in our 8-bit floating-point representation, 51 equals 52! That's pretty irritating, but it's a price we have to pay if we want to be able to handle a large range of numbers with such a small number of bits.

(By the way, in rounding numbers that are exactly between two possibilities, the typical policy is to round so that the final mantissa bit is 0. For example, taking the number 19, we end up with 1.0011(2) × 24, and we would round this up to 1.010(2) × 24 = 20(10). On the other hand, rounding up the number 21 = 1.0101(2) × 24 would lead to 1.011(2) × 24, leaving a 1 in the final bit of the mantissa, which we want to avoid; so instead we round down to 1.010(2) × 24 = 20(10). Doubtless you were taught in grade school to “round up” all the time; computers don't do this because if we consistently round up, all those roundings will bias the total of the numbers upward. Rounding so the final bit is 0 ensure that exactly half of the possible numbers round up and exactly half round down.)

While a floating-point representation can't represent all numbers precisely, it does give us a guaranteed number of significant digits. For this 8-bit representation, we get a single digit of precision, which is pretty limited. To get more precision, we need more mantissa bits. Suppose we defined a similar 16-bit representation with 1 bit for the sign bit, 6 bits for the exponent plus 31, and 9 bits for the mantissa.

This representation, with its 9 mantissa bits, happens to provide three significant digits. Given a limited length for a floating-point representation, we have to compromise between more mantissa bits (to get more precision) and more exponent bits (to get a wider range of numbers to represent). For 16-bit floating-point numbers, the 6-and-9 split is a reasonable tradeoff of range versus precision.