High Performance Arithmetic
For many applications, the speed of arithmetic operations are the bottleneck to performance. Most supercomputers, such as the Cray, the Tera, and the Intel Hypercube are considered “super” because they excel at performing fixed and floating point arithmetic. In this section we discuss a number of ways to improve the speed of addition, subtraction, multiplication, and division.
HIGH PERFORMANCE ADDITION
The ripple-carry adder that we reviewed in Section 3.2.2 may introduce too much delay into a system. The longest path through the adder is from the inputs of the least significant full adder to the outputs of the most significant full adder. The process of summing the inputs at each bit position is relatively fast (a small two-level circuit suffices) but the carry propagation takes a long time to work its way through the circuit. In fact, the propagation time is proportional to the number of bits in the operands. This is unfortunate, since more significant figures in an addition translates to more time to perform the addition. In this section, we look at a method of speeding the carry propagation in what is known as a carry lookahead adder.
In Appendix B, reduced Boolean expressions for the sum (si) and carry outputs (ci+1) of a full adder are created. These expressions are repeated below, with sub- scripts added to denote the relative position of a full adder in a ripple-carry adder:
The Gi and Pi terms are referred to as generate and propagate functions, respectively, for the effect they have on the carry. When Gi = 1, a carry is generated at stage i. When Pi = 1, then a carry is propagated through stage i if either ai or bi is a 1. The Gi and Pi terms can be created in one level of logic since they only depend on an AND or an OR of the input variables, respectively.
We can now create a four-bit carry lookahead adder as shown in Figure 3-17. We still have the delay through the full adders as before, but now the carry chain is broken into independent pieces that require one gate delay for Gi and Pi and two more gate delays to generate ci+1. Thus, a depth of three gate delays is added, but the ripple-carry chain is removed. If we assume that each full adder introduces a gate delay of two, then a four-bit carry lookahead adder will have a maximum gate delay of five, whereas a four-bit ripple-carry adder will have a maximum gate delay of eight. The difference between the two approaches is more pronounced for wider operands. This process is limited to about eight bits of carry-lookahead, because of gate fanin limitations discussed in Appendix A. For additions of numbers having more than eight bits, the carry-lookahead circuits can be cascaded to compute the carry in and carry out of each carry-lookahead unit. (See the EXAMPLE at the end of the chapter.)
HIGH PERFORMANCE MULTIPLICATION
A number of methods exist for speeding the process of multiplication. Two methods are described in the sections below. The first approach gains performance by skipping over blocks of 1’s, which eliminates addition steps. A parallel multiplier is described next, in which a cross product among all pairs of multi- plier and multiplicand bits is formed. The result of the cross product is summed by rows to produce the final product.
The Booth Algorithm
The Booth algorithm treats positive and negative numbers uniformly. It operates on the fact that strings of 0’s or 1’s in the multiplier require no additions – just shifting. Additions or subtractions take place at the boundaries of the strings, where transitions take place from 0 to 1 or from 1 to 0. A string of 1’s in the multiplier from bit positions with weights 2u to 2v can be treated as 2u+1 – 2v. For example, if the multiplier is 001110 (+14)10, then u = 3 and v = 1, so 24 – 21 = 14.
In a hardware implementation, the multiplier is scanned from right to left. The first transition is observed going from 0 to 1, and so 21 is subtracted from the initial value (0). On the next transition, from 1 to 0, 24 is added, which results in +14. A 0 is considered to be appended to the right side of the multiplier in order to define the situation in which a 1 is in the rightmost digit of the multiplier.
If the multiplier is recoded according to the Booth algorithm, then fewer steps may be needed in the multiplication process. Consider the multiplication example shown in Figure 3-18. The multiplier (14)10 contains three 1’s, which means
that three addition operations are required for the shift/add multiplication procedure that is described in Section 3.3.1. The Booth recoded multiplier is obtained by scanning the original multiplier from right to left, and placing a -1 in the position where the first 1 in a string is encountered, and placing a +1 in the position where the next 0 is seen. The multiplier 001110 thus becomes 0 +1 0 0 0. The Booth recoded multiplier contains just two nonzero digits: +1 and -1, which means that only one addition operation and one subtraction operation are needed, and so a savings is realized for this example.
A savings is not always realized, however, and in some cases the Booth algorithm may cause more operations to take place than if it is not used at all. Consider the example shown in Figure 3-19, in which the multiplier consists of alternating 1’s and 0’s. This is the same example shown in Figure 3-18 but with the multiplicand and multiplier swapped. Without Booth recoding of the multiplier, three addition operations are required for the three 1’s in the multiplier. The Booth recoded multiplier, however, requires six addition and subtraction operations, which is clearly worse. We improve on this in the next section.
The Modified Booth Algorithm
One solution to this problem is to group the recoded multiplier bits in pairs, known as bit pair recoding, which is also known as the modified Booth algorithm. Grouping bit pairs from right to left produces three “+1,-1” pairs as shown in Figure 3-20. Since the +1 term is to the left of the -1 term, it has a
weight that is twice as large as the weight for the -1 position. Thus, we might think of the pair as having the collective value +2 – 1 = +1.
In a similar manner, the pair -1,+1 is equivalent to -2 + 1 = -1. The pairs +1,+1 and -1,-1 cannot occur. There are a total of seven pairs that can occur, which are shown in Figure 3-21. For each case, the value of the recoded bit pair is multi-
plied by the multiplicand and is added to the product. In an implementation of bit pair recoding, the Booth recoding and bit pair recoding steps are collapsed into a single step, by observing three multiplier bits at a time, as shown in the corresponding multiplier bit table.
The process of bit pair recoding of a multiplier guarantees that in the worst case, only w/2 additions (or subtractions) will take place for a w-bit multiplier.
Array Multipliers
The serial method we used for multiplying two unsigned integers in Section 3.2.1 requires only a small amount of hardware, but the time required to ultiply two numbers of length w grows as w2. We can speed the multiplication process so that it completes in just 2w steps by implementing the manual process shown in Figure 3-10 in parallel. The general idea is to form a one-bit product between each multiplier bit and each multiplicand bit, and then sum each row of partial product elements from the top to the bottom in systolic (row by row) fashion.
The structure of a systolic array multiplier is shown in Figure 3-22. A partial product (PP) element is shown at the bottom of the figure. A multiplicand bit (mi) and a multiplier bit (qj) are multiplied by the AND gate, which forms a partial product at position (i,j) in the array. This partial product is added with the partial product from the previous stage (bj) and any carry that is generated in the previous stage (aj). The result has a width of 2w, and appears at the bottom of the array (the high order w bits) and at the right of the array (the low order w bits).
3.5.3 HIGH PERFORMANCE DIVISION
We can extend the unsigned integer division technique of Section 3.3.2 to pro- duce a fractional result in computing a/b. The general idea is to scale a and b to look like integers, perform the division process, and then scale the quotient to correspond to the actual result of dividing a by b.
A faster method of division makes use of a lookup table and iteration. An iterative method of finding a root of a polynomial is called Newton’s iteration, which is illustrated in Figure 3-23. The goal is to find where the function f(x) crosses the
The number of bits of precision doubles on each iteration (see [Goldberg, 1990]), and so if we are looking to obtain 32 bits of precision and we start with a single bit of precision, then five iterations are required to reach our target precision. The problem now is to cast division in the form of finding a zero for f(x).
Consider the function 1/x – b which has a zero at 1/b. If we start with b, then we can compute 1/b by iteratively applying Newton’s method. Since f ’(x) = -1/x2, we now have:
Thus, we only need to perform multiplication and subtraction in order to per- form division. Further, if our initial guess for x0 is good enough, then we may only need to perform the iteration a few times.
Before using this method on an example, we need to consider how we will obtain our initial guess. If we are working with normalized fractions, then it is relatively easy to make use of a lookup table for the first few digits. Consider computing 1/.101101 using a 16-bit normalized base 2 fraction in which the leading 1 is not hidden. The first three bits for any binary fraction will be one of the patterns:
.100, .101, .110, or .111. These fractions correspond to the base 10 numbers 1/2, 5/8, 3/4, and 7/8, respectively. The reciprocals of these numbers are 2, 8/5, 4/3, and 8/7, respectively. We can store the binary equivalents in a lookup table, and then retrieve x0 based on the first three bits of b.
The leading 1 in the fraction does not contribute to the precision, and so the leading three bits of the fraction only provide two bits of precision. Thus, the lookup table only needs two bits for each entry, as shown in Figure 3-24.
Now consider computing 1/.1011011 using this floating point representation. We start by finding x0 using the table shown in Figure 3-24. The first three bits of the fraction b are 101, which corresponds to x0 = 01. We compute x1 = x0(2 – x0b) and obtain, in unsigned base 2 arithmetic: x1 = 01(10 – (01)(.1011011)) = 1.0100101. Our two bits of precision have now become four bits of precision.
For this example, we will retain as much intermediate precision as we can. In general, we only need to retain at most 2p bits of intermediate precision for a p-bit result. We iterate again, obtaining eight bits of precision:
3.5.4 RESIDUE ARITHMETIC
Addition, subtraction, and multiplication can all be performed in a single, carry- less step using residue arithmetic. The residue number system is based on relatively prime integers called moduli. The residue of an integer with respect to a particular modulus is the least positive integer remainder of the division of the integer by the modulus. A set of possible moduli are 5, 7, 9, and 4. With these moduli, 5 ´ 7 ´ 9 ´ 4 = 1260 integers can be uniquely represented. A table showing the representation of the first twenty decimal integers using moduli 5, 7, 9, and 4 is shown in Figure 3-25.
Addition and multiplication in the residue number system result in valid residue numbers, provided the size of the chosen number space is large enough to contain the results. Subtraction requires each residue digit of the subtrahend to be complemented with respect to its modulus before performing addition. Addition and multiplication examples are shown in Figure 3-26. For these examples, the
moduli used are 5, 7, 9, and 4. Addition is performed in parallel for each column, with no carry propagation. Multiplication is also performed in parallel for each column, independent of the other columns.
Although residue arithmetic operations can be very fast, there are a number of disadvantages to the system. Division and sign detection are difficult, and a representation for fractions is also difficult. Conversions between the residue number system and weighted number systems are complex, and often require involved methods such as the Chinese remainder theorem. The conversion problem is important because the residue number system is not very useful with- out being translated to a weighted number system so that magnitude comparisons can be made. However, for integer applications in which the time spent in addition, subtraction, and multiplication outweighs the time spent in division, conversion, etc., the residue number system may be a practical approach. An important application area is matrix-vector multiplication, which is used extensively in signal processing.
EXAMPLE: WIDE WORD HIGH PERFORMANCE ADDER
A practical word width for a carry lookahead adder (CLA) is four bits, whereas a 16-bit word width is not as practical because of the large fan-ins and fan-outs of the internal logic. We can subdivide a 16-bit addition problem into four 4-bit groups in which carry lookahead is used within the groups, and in which carry lookahead is also used among the groups. This organization is referred to as a group carry lookahead adder (GCLA). For this example, we will compare a
some additional logic that generates the carries between the four-bit groups. Each group behaves as an ordinary CLA, except that the least significant carry into each CLA is treated as a variable instead of as a 0, and that group generate (GG) and group propagate (GP) signals are generated. A GG signal is generated when a carry is generated somewhere within a group, and all of the more significant propagate signals are true. This means that a carry into a group will propagate all the way through the group. The corresponding equations for the least significant GG and GP signals in Figure 3-27 are shown below:
The carry into each group, except for the carry into CLA0, is computed from the GG and GP signals. For example, c4 is true when GG0 is true or when GP0 and c0 are both true. The corresponding equation is:
Higher order carries out of each group are computed in a similar manner:
In terms of gate delays, a 16-bit CLA has a longest path of five gate delays to pro- duce the most significant sum bit, as discussed in Section 3.5.1. Each of the CLAs in the 16-bit GCLA also has at least five gate delays on the longest path. The GG and GP signals are generated in three gate delays, and the carry signals out of each group are generated in two more gate delays, resulting in a total of five gate delays to generate the carry out of each group. In the highest bit position (s15), five gate delays are needed to generate c12, and another five gate delays are needed to generate s15, for a worst case path of 10 gate delays through the 16-bit GCLA.
With regard to fan-in and fan-out, the maximum fan-in of any gate in a four-bit CLA is four (refer to Figure 3-17), and in general, the maximum fan-in of any gate in an n-bit CLA is n. Thus, the maximum fan-in of any gate in a 16-bit CLA is 16. In comparison, the maximum fan-in for a 16-bit GCLA is five (for generating c16). The fan-outs for both cases are the same as the fan-ins.
In summary, the 16-bit CLA has only half of the depth of the 16-bit GCLA (five gate delays vs. 10 gate delays). The highest fan-in for a 16-bit CLA is 16, which is more than three times the highest fan-in for a 16-bit GCLA (16 vs. five). The highest fan-outs are the same as the highest fan-ins for each case.