Trends in computer architecture: parallel architecture (mapping an algorithm onto a parallel architecture, fine-grain parallelism – the connection machine cm-1 and course-grain parallelism: the cm-5).

MAPPING AN ALGORITHM ONTO A PARALLEL ARCHITECTURE

The process of mapping an algorithm onto a parallel architecture begins with a dependency analysis in which data dependencies among the operations in a program are identified. Consider the C code shown in Figure 10-23. In an ordinary SISD processor, the four numbered statements require four time steps to

image

complete, as illustrated in the control sequence of Figure 10-24a. The dependency graph shown in Figure 10-24b exposes the natural parallelism in the control sequence. The dependency graph is created by assigning each operation in the original program to a node in the graph, and then drawing a directed arc from each node that produces a result to the node(s) that needs it.

The control sequence requires four time steps to complete, but the dependency graph shows that the program can be completed in just three time steps, since operations 0 and 1 do not depend on each other and can be executed simulta-

image

may not be very great, but for other programs, the opportunity for speedup can  be substantial as we will see.

Consider a matrix multiplication problem Ax = b in which A is a 4´4 matrix and  x and b are both 4´1 matrices, as illustrated in Figure 10-25a. Our goal is to

image

solve for the bi, using the equations shown in Figure 10-25b. Every operation is assigned a number, starting from 0 and ending at 27. There are 28 operations, assuming that no operations can receive more than two operands. A program running on a SISD processor that computes the bi requires 28 time steps to complete, if we make a simplifying assumption that additions and multiplications take the same amount of time.

A dependency graph for this problem is shown in Figure 10-26. The worst case path from any input to any output traverses three nodes, and so the entire process can be completed in three time steps, resulting in a speedup of

image

Now that we know the structure of the data dependencies, we can plan a map- ping of the nodes of the dependency graph to PEs in a parallel processor. Figure

image

the time to complete each multiplication is 100 ns, and the time to communicate between PEs is 1000 ns. These numbers are for a fictitious processor, but the methods extend to real parallel processors.

As we can see from the parallel time of 2120 ns to execute the program using the mapping shown in Figure 10-27a, the time spent in communication dominates performance. This is worse than a SISD approach, since the 16 multiplications and the 12 additions would require 16 ´ 100 ns + 12 ´ 10 ns = 1720 ns. There is no processor-to-processor communication cost within a SISD processor, and so only the computation time is considered.

An alternative mapping is shown in Figure 10-27b in which all of the operations needed to compute b0 are clustered onto the same PE. We have thus increased the granularity of the computation, which is a measure of the number of operations assigned to each PE. A single PE is a sequential, SISD processor, and so none of the operations within a cluster can be executed in parallel, but the communication time among the operations is reduced to 0. As shown in the diagram, the parallel time for b0 is now 430 ns which is much better than either the previous parallel mapping or a straight SISD mapping. Since there are no dependencies among the bi, they can all be computed in parallel, using one processor per bi. The actual speedup is now:

image

Communication is always a bottleneck in parallel processing, and so it is important that we maintain a proper balance. We should not be led astray into thinking that adding processors to a problem will speed it up, when in fact, adding processors can increase execution time as a result of communication time. In general, we want to maintain a ratio in which: image

FINE-GRAIN PARALLELISM – THE CONNECTION MACHINE CM-1

The Connection Machine (CM-1) is a massively parallel SIMD processor designed and built by Thinking Machines Corporation during the 1980’s. The architecture is noted for high connectivity between a large number of small processors. The CM-1 consists of a large number of one-bit processors arranged at the vertices of an n-space hypercube. Each processor communicates with other  processors via routers that send and receive messages along each dimension of the hypercube.

A block diagram of the CM-1 is shown in Figure 10-28. The host computer is a

image

conventional SISD machine such as a Symbolics computer (which was popular at the time) that runs a program written in a high level language such as LISP or

C. Parallelizeable parts of a high level program are farmed out to 2n processors (216 processors is the size of a full CM-1) via a memory bus (for data) and a microcontroller (for instructions) and the results are collected via the memory bus. A separate high bandwidth datapath is provided for input and output directly to and from the hypercube.

The CM-1 makes use of a 12-space hypercube between the routers that send and receive data packets. The overall CM-1 prototype uses a 16-space hypercube, and so the difference between the 12-space router hypercube and the 16-space PE hypercube is made up by a crossbar that serves the 16 PEs attached to each router. For the purpose of example, a four-space hypercube is shown in Figure 10-29 for the router network. Each vertex of the hypercube is a router with an attached group of 16 PEs, each of which has a unique binary address. The router hypercube shown in Figure 10-29 thus serves 256 PEs. Routers that are directly connected to other routers can be found by inverting any one of the four most Router Address (The router address forms the four most significant bits of each of the16 PEs that the router serves.)

image

Each PE is made up of a 16-bit flag register, a three-input, two-output ALU, and a 4096-bit random access memory, as shown in Figure 10-30. During operation,

image

an external controller (the microcontroller of Figure 10-28) selects two bits from memory via the A address and B address lines. Only one value can be read from memory at a time, so the A value is buffered while the B value is fetched. The controller selects a flag to read, and feeds the flag and the A and B values into an  ALU whose function it also selects. The result of the computation produces a new value for the A addressed location and one of the flags.

The ALU takes three one-bit data inputs, two from the memory and one from the flag register, and 16 control inputs from the microcontroller and produces two one-bit data outputs for the memory and flag registers. The ALU generates all 23 = 8 combinations (minterms) of the input variables for each of the two outputs. Eight of the 16 control lines select the minterms that are needed in the sum-of-products form of each output.

PE’s communicate with other PE’s through routers. Each router services communication between a PE and the network by receiving packets from the network intended for the attached PEs, injecting packets into the network, buffering when necessary, and forwarding messages that use the router as an intermediary to get to their destinations.

The CM-1 is a landmark machine for the massive parallelism made available by the architecture. For scalable problems like finite element analysis (such as modeling heat flow through the use of partial differential equations), the avail- able parallelism can be fully exploited. There is usually a need for floating point manipulation for this case, and so floating point processors augment the PEs in the next generation CM-2. A natural way to model heat flow is through a mesh interconnect, which is implemented as a hardwired bypass to the message-pass- ing routing mechanism through the North-East-West-South (NEWS) grid. Thus we can reduce the cost of PE-to-PE communication for this application.

Not all problems scale so well, and there is a general trend moving away from fine grain parallel processing. This is largely due to the difficulty of keeping the PEs busy doing useful work, while also keeping the time spent in computation greater than the time spent in communication. In the next section, we look at a coarse grain architecture: The CM-5.

COURSE-GRAIN PARALLELISM: THE CM-5

The CM-5 (Thinking Machines Corporation) combines properties of both SIMD and MIMD architectures, and thereby provides greater flexibility for mapping a parallel algorithm onto the architecture. The CM-5 is illustrated in Figure 10-31. There are three types of processors for data processing, control, and I/O. These processors are connected primarily by the Data Network and the Control Network, and to a lesser extent by the Diagnostic Network.

image

The processing nodes are assigned to control processors, which form partitions, as illustrated in Figure 10-32. A partition contains a control processor, a number

imageof processing nodes, and dedicated portions of the Control and Data Networks. Note that there are both user partitions (where the data processing takes place) and I/O partitions.

The Data Network uses a fat-tree topology, as illustrated in Figure 10-33. The general idea is that the bandwidth from child nodes to parent nodes increases as

image

the network approaches the root, to account for the increased traffic as data travels from the leaves toward the root.

The Control Network uses a simple binary tree topology in which the system components are at the leaves. A control processor occupies one leaf in a partition, and the processing nodes are placed in the remaining nodes, although not necessarily filling all possible node positions in a subtree.

The Diagnostic Network is a separate binary tree in which one or more diagnostic processors are at the root. At the leaves are physical components, such as circuit boards and backplanes, rather than logical components such as processing nodes.

Each control processor is a self-contained system that is comparable in complexity to a workstation. A control processor contains a RISC microprocessor that serves as a CPU, a local memory, I/O that contains disks and Ethernet connections, and a CM-5 interface.

Each processing node is much smaller, and contains a SPARC-based microprocessor, a memory controller for 8, 16, or 32 Mbytes of local memory, and a net- work interface to the Control and Data Networks. In a full implementation of a CM-5, there can be up to 16,384 processing nodes, each performing 64-bit floating point and integer operations, operating at a clock rate of 32 MHz.

Overall, the CM-5 provides a true mix of SIMD and MIMD styles of processing, and offers greater applicability than the stricter SIMD style of the CM-1 and  CM-2 predecessors.

 

Trends in computer architecture: parallel architecture (the Flynn taxonomy and interconnection networks).

Parallel Architecture

One method of improving the performance of a processor is to decrease the time needed to execute instructions. This will work up to a limit of about 400 MHz (Stone, 1991), at which point an effect known as ringing on busses prohibits further speedup with conventional bus technology. This is not to say that higher clock speeds are not possible, because indeed current microprocessors have clock rates well above 400 MHz, but that “shared bus” approaches become impractical at these speeds. As conventional architectural approaches to improving performance wear thin, we need to consider alternative methods of improving performance.

One alternative approach to increasing bus speed is to increase the number of processors, and decompose and distribute a single program onto the processors. This approach is known as parallel processing, in which a number of processors work collectively, in parallel, on a common problem. We see an example of parallel processing earlier in the chapter with pipelining. For that case, four processors are connected in series (Figure 10-3), each performing a different task, like an assembly line in a factory. The interleaved memory described in Chapter 7 is another example of pipelining.

A parallel architecture can be characterized in terms of three parameters: (1) the number of processing elements (PEs); (2) the interconnection network among the PEs; and (3) the organization of the memory. In the four-stage instruction pipeline of Figure 10-3, there are four PEs. The interconnection network among the PEs is a simple ring. The memory is an ordinary RAM that is external to the pipeline.

Characterizing the architecture of a parallel processor is a relatively easy task, but measuring the performance is not nearly so simple. Although we can easily mea- sure the increased speed that a simple enhancement like a pipeline offers, the overall speedup is data dependent: not all programs and data sets map well onto a pipelined architecture. Other performance considerations of pipelined architectures that are also data dependent are the cost of flushing a pipeline, the increased area cost, the latency (input to output delay) of the pipeline, etc.

A few common measures of performance are parallel time, speedup, efficiency, and throughput. The parallel time is simply the absolute time needed for a pro- gram to execute on a parallel processor. The speedup is the ratio of the time for a program to execute on a sequential (non-parallel, that is) processor to the time for that same program to execute on a parallel processor. In a simple form, we can represent speedup (S, now in the context of parallel processing) as:

image

Since a sequential algorithm and a parallel version of the same algorithm may be programmed very differently for each machine, we need to qualify TSequential and TParallel so that they apply to the best implementation for each machine.

There is more to the story. If we want to achieve a speedup of 100, it is not enough to simply distribute a single program over 100 processors. The problem is that not many computations are easily decomposed in a way that fully utilizes the available PEs. If there are even a small number of sequential operations in a parallel program, then the speedup can be significantly limited. This is summarized by Amdahl’s law, in which speedup is expressed in terms of the number of processors p and the fraction of operations that must be performed sequentially f:

image

For example, if f = 10% of the operations must be performed sequentially, then speedup can be no greater than 10 regardless of how many processors are used:

image

This brings us to measurements of efficiency. Efficiency is the ratio of speedup to the number of processors used. For a speedup of 5.3 with 10 processors, the efficiency is:

image

If we double the number of processors to 20, then the speedup increases to 6.9 but the efficiency reduces to 34%. Thus, parallelizing an algorithm can improve performance to a limit that is determined by the amount of sequential operations. Efficiency is drastically reduced as speedup approaches its limit, and so it does not make sense to simply use more processors in a computation in the hope that a corresponding gain in performance will be achieved.

Throughput is a measure of how much computation is achieved over time, and is of special concern for I/O bound and pipelined applications. For the case of a four stage pipeline that remains filled, in which each pipeline stage completes its task in 10 ns, the average time to complete an operation is 10 ns even though it takes 40 ns to execute any one operation. The overall throughput for this situation is then:

image

THE FLYNN TAXONOMY

Computer architectures can be classified in terms of their instruction streams and their data streams, using a taxonomy developed by M. J. Flynn (Flynn, 1972). A conventional sequential processor fits into the single-instruction stream, single data stream (SISD) category, as illustrated in Figure 10-14a. Only a single instruction is executed at a time in a SISD processor, although pipelining may allow several instructions to be in different phases of execution at any given time.

image

In a single instruction stream, multiple data stream (SIMD) processor, several identical processors perform the same sequence of operations on different data sets, as illustrated in Figure 10-14b. A SIMD system can be thought of as a room filled with mail sorters, all sorting different pieces of mail into the same set of bins.

In a multiple instruction stream, multiple data stream (MIMD) processor, several processors perform different operations on different data sets, but are all coordinated to execute a single parallel program, as illustrated in Figure 10-14c. An example of a MIMD processor is the Sega home video entertainment system, which has four processors for (1) sound synthesis (a Yamaha synthesis processor); (2) sound filtering (a Texas Instruments programmable sound generator); (3) program execution (a 68000); and (4) background processing (a Z80). We will see more of the Sega Genesis in a Case Study at the end of the chapter.

In a multiple instruction stream, single data stream (MISD) processor, a single data stream is operated on by several functional units, as illustrated in Figure 10-14d. The data stream is typically a vector of several related streams. This con- figuration is known as a systolic array, which we see in Chapter 3 in the form of an array multiplier.

INTERCONNECTION NETWORKS

When a computation is distributed over a number of PEs, the PEs need to communicate with each other through an interconnection network. There is a host of topologies for interconnection networks, each with their own characteristics in terms of crosspoint complexity (an asymptotic measure of area), diameter (the length of the worst case path through the network), and blocking (whether or not a new connection can be established in the presence of other connections). A few representative topologies and control strategies for configuring networks are described below.

One of the most powerful network topologies is the crossbar, in which every PE is directly connected to every other PE. An abstract view of a crossbar is illustrated in Figure 10-15a, in which four PEs are interconnected. In a closeup view illustrated in Figure 10-16, the crossbar contains crosspoint switches, which are configurable devices that either connect or disconnect the lines that go through it. In general, for N PEs, the crosspoint complexity (the number of crosspoint switches) is N2. In Figure 10-15a, N = 4 (not 8) because the output ports of the PEs on the left and the input ports of the PEs on the right belong to the same PEs. The crosspoint complexity is thus 42 = 16. The network diameter is 1 since every PE can directly communicate with every other PE, with no intervening PEs. The number of crosspoint switches that are traversed is not normally considered in evaluating the network diameter. The crossbar is strictly nonblocking, which means that there is always an available path between every input and output regardless of the configuration of existing paths.

image

At the other extreme of complexity is the bus topology, which is illustrated in Figure 10-15b. With the bus topology, a fixed amount of bus bandwidth is shared among the PEs. The crosspoint complexity is N for N PEs, and the net- work diameter is 1, so the bus grows more gracefully than the crossbar. There can only be one source at a time, and there is normally only one receiver, so blocking is a frequent situation for a bus.

In a ring topology, there are N crosspoints for N PEs as shown in Figure 10-15c. As for the crossbar, each crosspoint is contained within a PE. The network diameter is N/2, but the collective bandwidth is N times greater than for the case of the bus. This is because adjacent PEs can communicate directly with each other over their common link without affecting the rest of the network.

In the mesh topology, there are N crosspoints for N PEs, but the diameter is only 2 N as shown in Figure 10-15d. All PEs can simultaneously communicate in just 3 N steps, as discussed in (Leighton, 1992) using an off-line routing algorithm (in which the crosspoint settings are determined external to the PEs).

In the star topology, there is a central hub through which all PEs communicate as shown in Figure 10-15e. Since all of the connection complexity is centralized, the star can only grow to sizes that are bounded by the technology, which is normally less than for decentralized topologies like the mesh. The crosspoint complexity within the hub varies according to the implementation, which can be anything from a bus to a crossbar.

In the tree topology, there are N crosspoints for N PEs, and the diameter is 2log2N – 1 as shown in Figure 10-15f. The tree is effective for applications in which there is a great deal of distributing and collecting of data.

In the perfect shuffle topology, there are N crosspoints for N PEs as shown in Figure 10-15g. The diameter is log2N since it takes log2N passes through the net- work to connect any PE with any other in the worst case. The perfect shuffle name comes from the property that if a deck of 2N cards, in which N is an integer, is cut in half and interleaved N times, then the original configuration of the deck is restored. All N PEs can simultaneously communicate in 3log2N – 1 passes through the network as presented in (Wu and Feng, 1981).

Finally, the hypercube has N crosspoints for N PEs, with a diameter of log2N-1, as shown in Figure 10-15h. The smaller number of crosspoints with respect to the perfect shuffle topology is balanced by a greater connection complexity in the PEs.

Let us now consider the behavior of blocking in interconnection networks. Figure 10-17a shows a configuration in which four processors are interconnected

image

with a two-stage perfect shuffle network in which each crosspoint either passes both inputs straight through to the outputs, or exchanges the inputs to the out- puts. A path is enabled from processor 0 to processor 3, and another path is enabled from processor 3 to processor 0. Neither processor 1 nor processor 2 needs to communicate, but they participate in some arbitrary connections as a side effect of the crosspoint settings that are already specified.

Suppose that we want to add another connection, from processor 1 to processor

1. There is no way to adjust the unused crosspoints to accommodate this new connection because all of the crosspoints are already set, and the needed connection does not occur as a side effect of the current settings. Thus, connection 1 ® 1 is now blocked.

If we are allowed to disturb the settings of the crosspoints that are currently in use, then we can accommodate all three connections, as illustrated in Figure 10-17b. An interconnection network that operates in this manner is referred to as a rearrangeably nonblocking network.

The three-stage Clos network is strictly nonblocking. That is, there is no need to disturb the existing settings of the crosspoints in order to add another connection. An example of a three stage Clos network is shown in Figure 10-18 for four

image

PEs. In the input stage, each crosspoint is actually a crossbar that can make any connection of the two inputs to the three outputs. The crosspoints in the middle stage and the output stage are also small crossbars. The number of inputs to each input crosspoint and the number of outputs from each output crosspoint is selected according to the desired complexity of the crosspoints, and the desired complexity of the middle stage.

The middle stage has three crosspoints in this example, and in general, there are (n – 1) + (p – 1) + 1 = n + p – 1 crosspoints in the middle stage, in which n is the number of inputs to each input crosspoint and p is the number of outputs from each output crosspoint. This is how the three-stage Clos network maintains a strictly nonblocking property. There are n – 1 ways that an input can be blocked

at the output of an input stage crosspoint as a result of existing connections. Similarly, there are p – 1 ways that existing connections can block a desired connection into an output crosspoint. In order to ensure that every desired connection can be made between available input and output ports, there must be one more path available.

For this case, n = 2 and p = 2, and so we need n + p – 1 = 2 + 2 – 1 = 3 paths from every input crosspoint to every output crosspoint. Architecturally, this relationship is satisfied with three crosspoints in the middle stage that each connect every input crosspoint to every output crosspoint.

EXAMPLE: STRICTLY NONBLOCKING NETWORK

For this example, we want to design a strictly nonblocking (three-stage Clos) network for l2 channels (l2 inputs and l2 outputs to the network) while maintaining a low maximum complexity of any crosspoint in the network.

There are a number of ways that we can organize the network. For the input stage, we can have two input nodes with 6 inputs per node, or 6 input nodes with two inputs per node, to list just two possibilities. We have similar choices for the output stage. Let us start by looking at a configuration that has two nodes in the input stage, and two nodes in the output stage, with 6 inputs for each node in the input stage and 6 outputs for each node in the output stage. For this case, n = p = 6, which means that n + p – l = ll nodes are needed in the mid- dle stage, as shown in Figure lO-l9. The maximum complexity of any node for this case is 6 ´ ll = 66, for each of the input and output nodes.

Now let us try using 6 input nodes and 6 output nodes, with two inputs for each input node and two outputs for each output node. For this case, n = p = 2, which means that n + p – l = 3 nodes are needed in the middle stage, as shown in Figure lO-2O. The maximum node complexity for this case is 6 ´ 6 = 36 for each of the middle stage nodes, which is better than the maximum node complexity of 66 for the previous case.

Similarly, networks for n = p = 4 and n = p = 3 are shown in Figure lO-2l and Figure lO-22, respectively. The maximum node complexity for each of these net- works is 4 ´ 7 = 28 and 4 ´ 4 = l6, respectively. Among the four configurations studied here, n = p = 3 gives the lowest maximum node complexity.

image

 

Trends in computer architecture: vliw machines and case study: the anthelia- 64 (Merced) architecture (background—the 80×86 cisc architecture and the Merced: an epic architecture).

10.7 VLIW Machines

There is an architecture that is in a sense competitive with superscalar architectures, referred to as the VLIW (Very Long Instruction Word) architecture. In VLIW machines, multiple operations are packed into a single instruction word that may be 128 or more bits wide. The VLIW machine has multiple execution units, similar to the superscalar machine. A typical VLIW CPU might have two IUs, two FPUs, two load/store units, and a BPU. It is the responsibility of the compiler to organize multiple operations into the instruction word. This relieves the CPU of the need to examine instructions for dependencies, or to order or reorder instructions. A disadvantage is that the compiler must out of necessity be pessimistic in its estimates of dependencies. If it cannot find enough instructions to fill the instruction word, it must fill the blank spots with NOP instructions. Furthermore, VLIW architectural improvements require software to be recompiled to take advantage of them.

There have been a number of attempts to market VLIW machines, but mainly, VLIW machines have fallen out of favor in recent years. Performance is the primary culprit, for the reasons above, among others.

Case Study: The Intel IA- 64 (Merced) Architecture

This section discusses a microprocessor family in development by an alliance between Intel and Hewlett-Packard, which is hoped will take the consortium into the 21st century. We first look into the background that led to the decision to develop a new architecture, and then we look at what is currently known about the architecture. (The information in this section is taken from various publications and Web sites, and has not been confirmed by Intel or Hewlett-Packard.)

BACKGROUND—THE 80X86 CISC ARCHITECTURE

The current Intel 80×86 architecture, which runs on some 80% of desktop computers in the late 1990’s, had its roots in the 8086 microprocessor, designed in the late 1970’s. The architectural roots of the family go back to the original Intel 8080, designed in the early 1970’s. Being a persistent advocate of upward compatibility, Intel has been in a sense hobbled by a CISC architecture that is over 20 years old. Other vendors such as Motorola abandoned hardware compatibility for modernization, relying upon emulators to ease the transition to a new ISA.

In any case, Intel and Hewlett-Packard decided several years ago that the x86 architecture would soon reach the end of its useful life, and they began joint research on a new architecture. Intel and Hewlett-Packard have been quoted as saying that RISC architectures have “run out of gas,” so to speak, so their search led in other directions. The result of their research led to the IA-64, which stands for “Intel Architecture-64.” The first of the IA-64 family is known by the code name Merced, after the Merced River, near San Jose, California.

THE MERCED: AN EPIC ARCHITECTURE

Although Intel has not released significant details of the Merced ISA, it refers to its architecture as Explicitly Parallel Instruction Computing, or EPIC. Intel takes pains to point out that it is not a VLIW or even an LIW machine, perhaps out of sensitivity to the bad reputation that VLIW machines have received, however, some industry analysts refer to it as “the VLIW-like EPIC architecture.”

Features

While exact details are not publicly known as of this writing, published sources report that the Merced is expected to have the following characteristics:

• 128 64-bit GPRs and perhaps 128 80-bit FPRs;

• 64 1-bit predicate registers (explained later);

• Instruction words contain three instructions packed into one 128-bit parcel;

• Execution units, roughly equivalent to IU, FPU, and BPU, appear in multiples of three, and the IA-64 will be able to schedule instructions into these multiples;

• It will be the burden of the compiler to schedule the instructions to take advantage of the multiple execution units;

• Most of the instructions seem to be RISC-like, although it is rumored that the processor will (still!) execute 80×86 binary codes, in a dedicated execution unit, known as the DXU;

• Speculative loads. The processor will be able to load values from memory well in advance of when they are needed. Exceptions caused by the loads are postponed until execution has proceeded to the place where the loads would normally have occurred

• Predication (not prediction), where both sides of a conditional branch instruction are executed and the results from the side not taken are discarded.

These latter two features are discussed in more detail later.

The Instruction Word

The 128-bit instruction word, shown in Figure 10-13, has three 40-bit instruc-

image

tions, and an 8-bit template. The template is placed by the compiler to tell the CPU which instructions in and near that instruction word can execute in parallel

, thus the term “Explicit.” The CPU need not analyze the code at runtime to expose instructions that can be executed in parallel because the compiler deter- mines that ahead of time. Compilers for most VLIW machines must place NOP instructions in slots where instructions cannot be executed in parallel. In the IA-64 scheme, the presence of the template identifies those instructions in the word that can and cannot be executed in parallel, so the compiler is free to schedule instructions into all three slots, regardless of whether they can be executed in parallel.

The 6-bit predicate field in each instruction represents a tag placed there by the compiler to identify which leg of a conditional branch the instruction is part of, and is used in branch predication.

Branch Predication

Rather than using branch prediction, the IA-64 architecture uses branch predication to remove penalties due to mis predicted branches. When the compiler encounters a conditional branch instruction that is a candidate for predication, it selects two unique labels and labels the instructions in each leg of the branch instruction with one of the two labels, identifying which leg they belong to. Both legs can then be executed in parallel. There are 64 one-bit predicate registers, one corresponding to each of the 64 possible predicate identifiers.

When the actual branch outcome is known, the corresponding one-bit predicate register is set if the branch outcome is TRUE, and the one-bit predicate register corresponding to the FALSE label is cleared. Then the results from instructions having the correct predicate label are kept, and results from instructions having the incorrect (mis-predicted) label are discarded.

Speculative Loads

The architecture also employs speculative loads, that is, examining the instruction stream for upcoming load instructions and loading the value ahead of time, speculating that the value will actually be needed and will not have been altered by intervening operations. If successful, this eliminates the normal latency inherent in memory accesses. The compiler examines the instruction stream for candidate load operations that it can “hoist” to a location earlier in the instruction sequence. It inserts a check instruction at the point where the load instruction was originally located. The data value is thus available in the CPU when the check instruction is encountered.

The problem that is normally faced by speculative loads is that the load operation may generate an exception, for example because the address is invalid. How- ever, the exception may not be genuine, because the load may be beyond a branch instruction that is not taken, and thus would never actually be executed. The IA-64 architecture postpones processing the exception until the check instruction is encountered. If the branch is not taken then the check instruction will not be executed, and thus the exception will not be processed.

All of this complexity places a heavy burden on the compiler, which must be clever about how it schedules operations into the instruction words.

80×86 Compatibility

Intel was recently granted a patent for a method, presumably to be used with IA-64, for supporting two instruction sets, one of which is the x86 instruction set. It describes instructions to allow switching between the two execution modes, and for data sharing between them.

Estimated Performance

It has been estimated that the first Merced implementation will appear sometime in the year 2000, and will have an 800 MHz clock speed. Goals are for it to have performance several times that of current-generation processors when running in EPIC mode, and that of a 500 MHz Pentium II in x86 mode. Intel has stated that initially the IA-64 microprocessor will be reserved for use in high-performance workstations and servers, and at an estimated initial price of $5000 each this will undoubtedly be the case.

On the other hand, skeptics, who seem to abound when new technology is announced, say that the technology is unlikely to meet expectations, and that the IA-64 may never see the light of day. Time will tell.

 

Trends in computer architecture: multiple instruction issue (supers alar) machines – the power pc 601 and case study: the power pc™ 601 as a superscalar architecture (instruction set architecture of the power pc 601 and hardware architecture of the power pc 601).

10.4 Multiple Instruction Issue (Supers alar) Machines – The PowerPC 601

In the earlier pipelining discussion, we see how several instructions can be in various phases of execution at once. Here, we look at superscalar architecture, where, with separate execution units, several instructions can be executed simul-

image

taneously. In a superscalar architecture, there might be one or more separate Integer Units (IUs), Floating Point Units (FPUs), and Branch Processing Units (BPUs). This implies that instructions need to be scheduled into the various execution units, and further, that instructions might be executed out-of-order.

Out-of-order execution means that instructions need to be examined prior to dispatching them to an execution unit, not only to determine which unit should execute them, but also to determine whether executing them out of order would result in an incorrect program, because of dependencies between the instructions. This in turn implies an Instruction Unit, IU, that can prefect instructions into an instruction queue, determine the kinds of instructions and the dependence relations among them, and schedule them into the various execution units.

10.5 Case Study: The PowerPC™ 601 as a Superscalar Architecture

As an example of a modern superscalar architecture let us examine the Motorola PowerPC™ 601. The 601 has actually been superseded by more powerful members of the PowerPC family, but it will serve to illustrate the important features of superscalar architectures without introducing unnecessary complexity into our discussion.

10.6.1 INSTRUCTION SET ARCHITECTURE OF THE POWERPC 601

The 601 is a 32-bit general register RISC machine whose ISA includes:

• 32 32-bit general purpose integer registers (GPRs);

• 32 64-bit floating point registers (FPRs);

• 8 4-bit condition code registers;

• nearly 50 special-purpose 32-bit registers that are used to control various aspects of memory management and the operating system;

• over 250 instructions (many of which are special-purpose).

10.6.2 HARDWARE ARCHITECTURE OF THE POWERPC 601

Figure 10-12, taken from the Motorola PowerPC 601 user’s manual, shows the microarchitecture of the 601. The flow of instructions and data proceed via the System Interface, shown at the bottom of the figure, into the 32KByte cache. From there, instructions are fetched eight at a time into the Instruction Unit, shown at the top of the fiture.The issue logic within the Instruction Unit examines the instructions in the queue for their kind and dependence, and issues them into one of the three execution units: IU, BPU, or FPU.

The IU is shown as containing the GPR file and an integer exception register, XER, which holds information about exceptions that arise within the IU. The IU can execute most integer instructions in a single clock cycle, thus obviating

image

the need for any kind of pipelining of integer instructions.

The FPU contains the FPRs and the floating point status and control register (FPSCR). The FPSCR contains information about floating point exceptions and the type of result produced by the FPR. The FPU is pipelined, and most FP instructions can be issued at a rate of one per clock.

As we mentioned above in the section on pipelining, branch instructions, especially conditional branch instructions, pose a bottleneck when trying to overlap instruction execution. This is because the branch condition must first be ascertained to be true, for example “branch on plus” must test the N flag to ascertain that it is cleared. The branch address must then be computed, which often involves address arithmetic. Only then can the PC be loaded with the branch address.

The 601 attacks this problem in several ways. First, as mentioned above, there are eight 4-bit condition code registers instead of the usual one. This allows up to eight instructions to have separate condition code bits, and therefore not interfere with each other’s ability to set condition codes. The BPU looks in the instruction queue, and if it finds a conditional branch instruction it proceeds to compute the branch target address ahead of time, and fetches instructions at the branch target. If the branch is taken, this results in effectively a zero-cycle branch, since the instruction at the branch target has already been fetched in anticipation of the branch condition being satisfied. The BPU also has a link register (LR) in which it can store subroutine return addresses, thus saving one GPR as well as several other registers used for special purposes. Note that the BPU can issue its addresses over a separate bus directly to the MME and Memory unit for prefetching instructions.

The RTC unit shown in the figure is a Real Time Clock which has a calendar range of 137 years, with an accuracy of 1 ns.

The MMU and Memory Unit assist in fetching both instructions and data. Note the separate path for data items that goes from the cache directly to the GPR and FPR register files.

The PowerPC 601, and its more powerful descendants are typical of the modern general purpose microprocessor. Current microprocessor families are superscalar in design, often having several of each kind of execution unit.

 

Trends in computer architecture: overlapping register windows

Overlapping Register Windows

One modern architectural feature that has not been as widely adopted as other features (such as pipelining) is overlapping register windows, which to date has only been adopted by the SPARC family. This feature is based upon studies that show typical programs spend much of their time dealing with procedure call-and-return overhead, which involves passing parameters on a stack located in main memory in traditional architectures. The SPARC architecture reduces much of this overhead by employing multiple register sets that overlap. These registers are used for passing parameters between procedures, instead of using a stack in main memory.

Procedure calls may be deeply nested in an ordinary program, but for a given window of time, the nesting depth fluctuates within a narrow band. Figure 10-6

image

illustrates this behavior. For a nesting depth window size of five, the window moves only 18 times for 100 procedure calls. Results produced by a group at UC Berkeley (Tamir and Sequin, 1983) show that a window size of eight will shift on less than 1% of the calls or returns.

The small window size for nested calls is important for improving performance. For each procedure call, a stack frame is normally constructed in which parameters, a return address, and local variables are placed. There is thus a great deal of stack manipulation that takes place for procedure calls, but the complexity of the manipulation is not all that great. That is, stack references are highly localized within a small area.

The RISC I architecture exploits this locality by keeping the active portion of the stack in registers. Figure 10-7 shows the user’s view of register usage for the RISC

image

I. The user sees 32 registers in which each register is 32 bits wide. Registers R0-R7 are used for global variables. Registers R8-R15 are used for incoming parameters. Registers R16-R23 are used for local variables, and registers R24-R31 are used for outgoing parameters. The eight registers within each group are enough to satisfy the bulk of call/return activity, as evidenced by the frequency distribution in Figure 10-6.

Although the user sees 32 registers, there may be several hundred registers that overlap. Figure 10-8 illustrates the concept of overlapping register windows. The global registers are detached from the others, and are continuously available as R0-R7. Registers R8-R31 make up the remaining 24 registers that the user sees, but this group of registers slides deeper into the register file (the entire set of registers) on each procedure call. Since the outgoing parameters for one procedure are the incoming parameters to another, these sets of registers can overlap. Registers R8-R31 are referred to as a window. A current window pointer (CWP) points to the current window, and increases or decreases for calls and returns, respectively.

In the statistically rare event when there are not enough registers for the level of nesting, then main memory is used. However, main memory is used for the lowest numbered window, so that the new current window still uses registers. The

image

highest register location then wraps around to the lowest, forming a circular buffer. As returns are made, registers that were written to memory are restored to the register file. Thus, execution always takes place with registers and never directly with main memory.

EXAMPLE: COMPILED CODE FOR OVERLAPPING REGISTER WINDOWS AND DELAYED BRANCHES

In this section, we analyze a C compiler-produced SPARC assembly program that exploits features of the RISC concept. We start with the C program shown in Figure 10-9, in which the main routine passes two integers to a subroutine, which returns the sum of the integers to the main routine. The code produced by a Solaris Unix C compiler using the command line:

image

An explanation of the more significant aspects of the assembled code is given in Figure 10-10, which includes a number of features found only in RISC code. There are a number of new instructions and pseudo-ops introduced in this code:

image

.seg/.section Unix executable programs have three segments for data, text (the instructions), and the stack. The.seg pseudo-op instructs the assembler to place the code that follows into one of these three segments. Some of the segments have different protections, which is why there is a data segment and also a data1 segment. The data1 segment contains constants, and should be protected from writing. The data segment is both readable and writable and is therefore not protected against reading or writing (but it is protected from being executed, as is data). Newer versions of Unix allow more text and data areas to be used for different read, write, and execute protections.

%hi Same as ARC pseudo-op .high22.

%lo Same as ARC pseudo-op .low10.

add Same as addcc except that the condition codes are unaffected.

save Advances current window pointer and increments stack pointer to create space for local variables.

image

mov Same as:

or %g0,register_or_immediate,destination_register. This differs from st because the destination is a register.

nop No-operation (the processor waits for one instruction cycle, while the branch finishes).

.asci i/.asciz Reserves space for an ASCII string.

set Sets a register to a value. This is a macro that expands to the sethi, %hi, and %lo constructs shown in #PROLOGUE# 1.

image

ret Return. Same as: jmpl %i7+8, %g0. Notice that it seems like the return should only go 4 bytes past the calling instruction, not 8 bytes as indicated above. This is because the instruction that follows the call is a nop, inserted by the compiler to protect the integrity of the pipeline.

  • restore Decrements current window pointer.
  • b Same as ba.
  • .file Identifies the source file.
  • .align Forces the code that follows onto a boundary evenly divisible by its argument.
  • .type Associates a label with its type.
  • .size Computes the size of a segment.
  • .ident Identifies the compiler version.

We can see the positions of the delay slots, marked with nop instructions. (The optimizing feature of the compiler has not been applied yet.) Despite the avail- ability of overlapping register windows in the SPARC architecture, the unoptimized code makes no use of this feature: parameters to be passed to a routine are copied to the stack, and the called routine then retrieves these parameters from the stack. Again, the optimizing feature of the compiler has not yet been invoked. (But read on).

Notice that the compiler does not seem to be consistent with its choice of registers for parameter passing. Prior to the call to add_two, the compiler uses %o0 and %o1 (%r0 and %r1) for parameters passed to add_two. Then, %r1 is used for the parameters passed to printf. Why did the compiler not start with %r0 again, or choose the next available register (%o2)? This is the register assignment problem, which has been the object of a great deal of study. We will not go into details here1, as this is more appropriate for a course in compiler design, but suffice it to say that any logically correct assignment of variables to registers will work, but that some assignments are better than others in terms of the number of registers used and the overall program execution time.

Why are the stack frames so large? We only need three words on the stack frame for local variables a, b, and c in main. We might also need a word to store the return address, although the compiler does not seem to generate code for that. There are no parameters passed to main by the operating system, and so the stack frame that main sees should only be four words (16 bytes) in size. Thus, the line at the beginning of routine main:

save %sp, -128, %sp

should only be:

save %sp, -16, %sp.

What is all of the extra space for? There are a number of runtime situations that may need stack space. For instance, if the nesting depth is greater than the number of windows, then the stack must be used for overflow. (See Figure D-2 in [SPARC, 1992])

If a scalar is passed from one routine to another, then everything is fine. But if a callee refers to the address of a passed scalar (or aggregate), then the scalar (or aggregate) must be copied to the stack and be referenced from there for the life- time of the pointer (or for the lifetime of the procedure, if the pointer lifetime is not known).

Why does the return statement ret cause a return to the code that is 8 bytes past the call, instead of 4 as we have been doing it? As mentioned above, this is because there is a nop that follows call (the so-called “delay-slot instruction”).

Notice that routine labels that appear in the source code are prepended with an underscore in the assembly code, so that main, add_two, and printf in C become _main, _add_two, and _printf in gcc generated SPARC code. This means that if we want to write a C program that is linked to a gcc generated SPARC program, that the C calls should be made to routines that begin with underscores. For example, if add_two is compiled into SPARC code, and we invoke it from a C main program in another file, then the C program should make a call to _add_two, and not add_two, even though the routine started out as add_two. Further, the C program needs to declare _add_two as external.

If the compilation for add_two is continued down to an executable file, then there is no need to treat the labels differently. The add_two routine will still be labeled _add_two, but routine main will be compiled into code that expects to see _add_two and so everything will work OK. This is not the case, however, if a gcc program makes calls to a Fortran library.

Fortran is a commonly used language in the scientific community, and there are a number of significant Fortran libraries that are used for linear algebra, modeling and simulation, and parallel scientific applications. As programmers of language XYZ (whatever language that may be), we sometimes find ourselves wanting to write XYZ programs that make calls to Fortran routines. This is easy to do once we understand what is happening.

There are two significant issues that need to be addressed:

(1) differences in routine labels;

(2) differences in subroutine linkage.

In Fortran, the source code labels are prepended with two underscores in the assembly code. A C program (if C is language XYZ) that makes a call to Fortran routine add_two would then make a call to _ _add_two, which also must be declared as external in the C source code (and declared as global in the Fortran program).

If all of the parameters that are passed to the Fortran routines are pointers, then everything will work OK. If there are any scalars passed, then there will be trouble because C (like Java) uses call-by-value for scalars whereas Fortran uses call-by-reference. We need to “trick” the C compiler into using call-by-reference by making it explicit. Wherever a Fortran routine expects a scalar in its argument list, we use a pointer to the scalar in the C code.

As a practical consideration, the gcc compiler will compile Fortran programs. It knows what to do by observing the extension of the source file, which should be .f for Fortran.

Now, let us take a look at how an optimizing compiler improves the code. Figure 10-11 shows the optimized code using the compiler’s -O flag. Notice there is not a single nop, ld, or st instruction. Wasted cycles devoted to nop instructions have been reclaimed, and memory references devoted to stack manipulation have been eliminated.

 

Trends in computer architecture: pipelining the data path (arithmetic, branch, and load-store instructions, pipelining instructions and keeping the pipeline filled).

10.2 Pipelining the Data path

The flow of instructions through a pipeline follows the steps normally taken when an instruction is executed. In the discussion below we consider how three classes of instructions: arithmetic, branch, and load-store, are executed, and then we relate this to how the instructions are pipelined.

10.3.1 ARITHMETIC, BRANCH, AND LOAD-STORE INSTRUCTIONS

Consider the “normal” sequence of events when an arithmetic instruction is executed in a load-store machine:

1) Fetch the instruction from memory;

2) Decode the instruction (it is an arithmetic instruction, but the CPU has to find that out through a decode operation);

3) Fetch the operands from the register file;

4) Apply the operands to the ALU;

5) Write the result back to the register file.

There are similar patterns for other instruction classes. For branch instructions  the sequence is:

1) Fetch the instruction from memory;

2) Decode the instruction (it is a branch instruction);

3) Fetch the components of the address from the instruction or register file;

4) Apply the components of the address to the ALU (address arithmetic);

5) Copy the resulting effective address into the PC, thus accomplishing the branch.

The sequence for load and store instructions is:

1) Fetch the instruction from memory;

2) Decode the instruction (it is a load or store instruction);

3) Fetch the components of the address from the instruction or register file;

4) Apply the components of the address to the ALU (address arithmetic);

5) Apply the resulting effective address to memory along with a read (load) or write (store) signal. If it is a write signal, the data item to be written must also be retrieved from the register file.

The three sequences above show a high degree of similarity in what is done at each stage: (1) fetch, (2) decode, (3) operand fetch, (4) ALU operation, (5) result writeback.

These five phases are similar to the four phases discussed in chapters 4 and 6 except that we have refined the fourth phase, “execute,” into two subphases: ALU operation and writeback, as illustrated in Figure 10-3. A result writeback is not

image

always needed, and one way to deal with this is to have two separate subphases (ALU and writeback) with a bypass path for situations when a writeback is not needed. For this discussion, we take a simpler approach, and force all instructions to go entirely through each phase, whether or not that is actually needed.

10.3.2 PIPELINING INSTRUCTIONS

In practice, each CPU designer approaches the design of the pipeline from a dif- ferent perspective, depending upon the particular design goals and instruction set. For example the original SPARC implementation had only four pipeline stages, while some floating point pipelines may have a dozen or more stages.

Each of the execution units performs a different operation in the fetch-execute cycle. After the Instruction Fetch unit finishes its task, the fetched instruction is handed off to the Decode unit. At this point, the Instruction Fetch unit can begin fetching the next instruction, which overlaps with the decoding of the previous instruction. When the Instruction Fetch and Decode units complete their tasks, they hand off the remaining tasks to the next units (Operand Fetch is the next unit for Decode). The flow of control continues until all units are filled.

10.3.3 KEEPING THE PIPELINE FILLED

Notice an important point: although it takes multiple steps to execute an instruction in this model, on average, one instruction can be executed per cycle as long as the pipeline stays filled. The pipeline does not stay filled, however, unless we are careful as to how instructions are ordered. We know from Figure 10-1 that approximately one in every four instructions is a branch. We cannot fetch the instruction that follows a branch until the branch completes execution. Thus, as soon as the pipeline fills, a branch is encountered, and then the pipeline has to be

flushed by filling it with no-operations (NOPs). These NOPs are sometimes referred to as pipeline bubbles. A similar situation arises with the LOAD and STORE instructions. They generally require an additional clock cycle in which to access memory, which has the effect of expanding the Execute phase from one cycle to two cycles at times. The “wait” cycles are filled with NOPs.

Figure 10-4 illustrates the pipeline behavior during a memory reference and also

image

during a branch for the ARC instruction set. The addcc instruction enters the pipeline on time step (cycle) 1. On cycle 2, the ld instruction, which references memory, enters the pipeline and addcc moves to the Decode stage. The pipeline continues filling with the srl and subcc instructions on cycles 3 and 4, respectively. On cycle 4, the addcc instruction is executed and leaves the pipeline. On cycle 5, the ld instruction reaches the Execute level, but does not finish execution because an additional cycle is needed for memory references. The ld instruction finishes execution during cycle 6.

Branch and Load Delay Slots

The ld and st instructions both require five cycles, but the remaining instructions require only four. Thus, an instruction that follows an ld or st should not use the register that is being loaded or stored. A safe approach is to insert a NOP after an ld or an st as shown in Figure 10-5a. The extra cycle (or cycles, depending on the architecture) for a load is known as a delayed load, since the data from the load is not immediately available on the next cycle. A delayed

image

branch is similar, as shown for the be instruction in cycles 5 through 8 of Figure 10-4. The position occupied by this NOP instruction is known as a load delay slot or branch delay slot respectively.

It is often possible for the compiler to find a nearby instruction to fill the delay slot. In Figure 10-5a, the srl instruction can be moved to the position of the nop since its register usage does not conflict with the surrounding code and reordering instructions this way does not impact the result. After replacing the nop line with the srl line, the code shown in Figure 10-5b is obtained. This is the code that is traced through the pipeline in Figure 10-4.

Speculative Execution of Instructions

An alternative approach to dealing with branch behavior in pipelines is to simply guess which way the branch will go, and then undo any damage if the wrong path is taken. Statistically, loops are executed more often than not, and so it is usually a good guess to assume that a branch that exits a loop will not be taken. Thus, a processor can start processing the next instruction in anticipation of the direction of the branch. If the branch goes the wrong way, then the execution phase for the next instruction, and any subsequent instructions that enter the pipeline, can be stopped so that the pipeline can be flushed. This approach works well for a number of architectures, particularly those with slow cycle speeds or deep pipelines. For RISCs, however, the overhead of determining when a branch goes the wrong way and then cleaning up any side effects caused by wrong instructions entering the pipeline is generally too great. The nop instruction is normally used in RISC pipelines when something useful cannot be used to replace it.

EXAMPLE: ANALYSIS OF PIPELINE EFFICIENCY

In this example, we analyze the efficiency of a pipeline.

A processor has a five stage pipeline. If a branch is taken, then four cycles are needed to flush the pipeline. The branch penalty b is thus 4. The probability Pb that a particular instruction is a branch is .25. The probability Pt that the branch is taken is .5. We would like to compute the average number of cycles needed to execute an instruction, and the execution efficiency.

When the pipeline is filled and there are no branches, then the average number of cycles per instruction (CPINo_Branch) is 1. The average number of cycles per instruction when there are branches is then:

image

The execution efficiency is the ratio of the cycles per instruction when there are no branches to the cycles per instruction when there are branches. Thus we have:

image

The processor runs at 67% of its potential speed as a result of branches, but this is still much better than the five cycles per instruction that might be needed without pipelining.

There are techniques for improving the efficiency. As stated above, we know that loops are usually executed more than once, so we can guess that a branch out of a loop will not be taken and be right most of the time. We can also run simulations on the non-loop branches, and get a statistical sampling of which branches are likely to be taken, and then guess the branches accordingly. As explained above,  this approach works best when the pipeline is deep or the clock rate is slow.

 

TRENDS IN COMPUTER ARCHITECTURE: From CISC to RISC

10.1 From CISC to RISC

image

Historically, when memory cycle times were very long and when memory prices were high, fewer, complicated instructions held an advantage over more, simpler instructions. There came a point, however, when memory became inexpensive enough and memory hierarchies became fast and large enough, that computer architects began reexamining this advantage. One technology that affected this examination was pipelining—that is, keeping the execution unit more or less the same, but allowing different instructions (which each require several clock cycles to execute) to use different parts of the execution unit on each clock cycle. For example, one instruction might be accessing operands in the register file while another is using the ALU.

We will cover pipelining in more detail later in the chapter, but the important point to make here is that computer architects learned that CISC instructions do not fit pipelined architectures very well. For pipelining to work effectively, each instruction needs to have similarities to other instructions, at least in terms of relative instruction complexity. The reason can be viewed in analogy to an assembly line that produces different models of an automobile. For efficiency, each “station” of the assembly line should do approximately the same amount and kind of work. If the amount or kind of work done at each station is radically different for different models, then periodically the assembly line will have to “stall” to accommodate the requirements of the given model.

CISC instruction sets have the disadvantage that some instructions, such as register-to-register moves, are inherently simple, whereas others, such as the MVC instruction and others like it are complex, and take many more clock cycles to execute.

The main philosophical underpinnings of the RISC approach are:

• Prefetch instructions into an instruction queue in the CPU before they are needed. This has the effect of hiding the latency associated with the instruction fetch.

• With instruction fetch times no longer a penalty, and with cheap memory to hold a greater number of instructions, there is no real advantage to CISC instructions. All instructions should be composed of sequences of RISC instructions, even though the number of instructions needed may increase (typically by as much as 1/3 over a CISC approach).

• Moving operands between registers and memory is expensive, and should be minimized.

• The RISC instruction set should be designed with pipelined architectures in mind.

• There is no requirement that CISC instructions be maintained as integrated wholes; they can be decomposed into sequences of simpler RISC instructions.

The result is that RISC architectures have characteristics that distinguish them from CISC architectures:

• All instructions are of fixed length, one machine word in size.

• All instructions perform simple operations that can be issued into the pipe- line at a rate of one per clock cycle. Complex operations are now composed of simple instructions by the compiler.

• All operands must be in registers before being operated upon. There is a separate class of memory access instructions: LOAD and STORE. This is referred to as a LOAD-STORE architecture.

• Addressing modes are limited to simple ones. Complex addressing calculations are built up using sequences of simple operations.

• There should be a large number of general registers for arithmetic operations so that temporary variables can be stored in registers rather than on a stack in memory.

In the next few sections, we explore additional motivations for RISC architectures, and special characteristics that make RISC architectures effective.

 
 

Trends in computer architecture: quantitative analyses of program execution(quantitative performance analysis).

TRENDS IN COMPUTER ARCHITECTURE

In the earlier chapters, the fetch-execute cycle is described in the form: “fetch an instruction, execute that instruction, fetch the next instruction, etc.” This gives the impression of a straight-line linear progression of program execution. In fact, the processor architectures of today have many advanced features that go beyond this simple paradigm. These features include pipelining, in which several instructions sharing the same hardware can simultaneously be in various phases of execution, superscalar execution, in which several instructions are executed simultaneously using different portions of the hardware, with possibly only some of the results contributing to the overall computation, very long instruction word (VLIW) architectures, in which each instruction word specifies multiple instructions (of smaller widths) that are executed simultaneously, and parallel processing, in which multiple processors are coordinated to work on a single problem.

In this chapter we cover issues that relate to these features. The discussion begins with issues that led to the emergence of the reduced instruction set computer (RISC) and examples of RISC features and characteristics. Following that, we cover an advanced feature used specifically in SPARC architectures: overlapping register windows. We then cover two important architectural features: superscalar execution and VLIW architectures. We then move into the topic of parallel processing, touching both on parallel architectures and program decomposition. The chapter includes with case studies covering Intel’s Merced architecture, the PowerPC 601, and an example of a pervasive parallel architecture that can be found in a home videogame system.

10.1 Quantitative Analyses of Program Execution

Prior to the late 1970’s, computer architects exploited improvements in integrated circuit technology by increasing the complexity of instructions and addressing modes, as the benefits of such improvements were thought to be obvious. It became an effective selling strategy to have more complex instructions and more complex addressing modes than a competing processor. Increases in architectural complexity catered to the belief that a significant barrier to better machine performance was the semantic gap—the gap between the meanings of high-level language statements and the meanings of machine-level instructions.

Unfortunately, as computer architects attempted to close the semantic gap, they sometimes made it worse. The IBM 360 architecture has the MVC (move character) instruction that copies a string of up to 256 bytes between two arbitrary locations. If the source and destination strings overlap, then the overlapped portion is copied one byte at a time. The runtime analysis that determines the degree of overlap adds a significant overhead to the execution time of the MVC instruction. Measurements show that overlaps occur only a few percent of the time, and that the average string size is only eight bytes. In general, faster execution results when the MVC instruction is entirely ignored, and instead, its function is synthesized with simpler instructions. Although a greater number of instructions may be executed without the MVC instruction, on average, fewer clock cycles are needed to implement the copy operation without using MVC than by using it.

Long-held views began to change in 1971, when Donald Knuth published a landmark analysis of typical FORTRAN programs, showing that most of the statements are simple assignments. Later research by John Hennessy at Stanford University, and David Patterson at the University of California at Berkeley con- firmed that most complex instructions and addressing modes went largely unused by compilers. These researchers popularized the use of program analysis and benchmark programs to evaluate the impact of architecture upon performance.

Figure 10-1, taken from (Knuth, 1971), summarizes the frequency of occurrence of instructions in a mix of programs written in a variety of languages. Nearly half of all instructions are assignment statements. Interestingly, arithmetic and other “more powerful” operations account for only 7% of all instructions. Thus, if we want to improve the performance of a computer, our efforts would be better spent optimizing instructions that account for the greatest percentage of execution time rather than focusing on instructions that are inherently complex but rarely occur.

Related metrics are shown in Figure 10-2. From the figure, the number of terms in an assignment statement is normally just a few. The most frequent case (80%),

image

is the simple variable assignment, X ¬ Y. There are only a few local variables in each procedure, and only a few arguments are normally passed to a procedure.

We can see from these measurements that the bulk of computer programs are very simple at the instruction level, even though more complex programs could potentially be created. This means that there may be little or no payoff in increasing the complexity of the instructions.

Discouragingly, analyses of compiled code showed that compilers usually did not take advantage of the complex instructions and addressing modes made available by computer architects eager to close the semantic gap. One important reason for this phenomenon is that it is difficult for a compiler to analyze the code in sufficient detail to locate areas where the new instructions can be used effectively, because of the great difference in meaning between most high-level language constructs and the expression of those constructs in assembly language. This observation, and the ever increasing speed and capacity of integrated circuit technology, converged to bring about an evolution from complex instruction set computer (CISC) machines to RISC machines.

A basic tenet of current computer architecture is to make the frequent case fast, and this often means making it simple. Since the assignment statement happens so frequently, we should concentrate on making it fast (and simple, as a consequence). One way to simplify assignments is to force all communication with memory into just two commands: LOAD and STORE. The LOAD/STORE model is typical of RISC architectures. We see the LOAD/STORE concept in Chapter 4 with the ld and st instructions for the ARC.

By restricting memory accesses to LOAD/STORE instructions only, other instructions can only access data that is stored in registers. There are two consequences of this, both good and bad: (1) accesses to memory can be easily over- lapped, since there are fewer side effects that would occur if different instruction types could access memory (this is good); and (2) there is a need for a large number of registers (this seems bad, but read on).

A simpler instruction set results in a simpler and typically smaller CPU, which frees up space on a microprocessor to be used for something else, like registers. Thus, the need for more registers is balanced to a degree by the newly vacant cir- cuit area, or chip real estate as it is sometimes called. A key problem lies in how to manage these registers, which is discussed in Section 10.4.

10.1.1 QUANTITATIVE PERFORMANCE ANALYSIS

When we estimate machine performance, the measure that is generally most important is execution time, T. When considering the impact of some performance improvement, the effect of the improvement is usually expressed in terms of the speedup, S, taken as the ratio of the execution time without the improvement (Two) to the execution time with the improvement (Tw):

image

For example, if adding a 1MB cache module to a computer system results in lowering the execution time of some benchmark program from 12 seconds to 8 seconds, then the speedup would be 12/8, = 1.5, or 50%. An equation to calculate speedup as a direct percent can be represented as:

image

We can develop a more fine-grained equation for estimating T if we have information about the machine’s clock period, t, the number of clock cycles per instruction, CPI, and a count of the number of instructions executed by the pro- gram during its execution, IC. In this case the total execution time for the pro- gram is given by:

image

CPI and IC can be expressed either as an average over the instruction set and total count, respectively, or summed over each kind and number of instructions in the instruction set and program. Substituting the latter equation into the former we get:

image

These equations and others derived from them, are useful in computing and esti- mating the impact of changes in instructions and architecture upon performance.

EXAMPLE: CALCULATING SPEEDUP FOR A NEW INSTRUCTION SET

Suppose we wish to estimate the speedup obtained by replacing a CPU having an average CPI of 5 with another CPU having an average CPI of 3.5, with the clock period increased from 100 ns to 120 ns. The equation above becomes:

image

Thus, without actually running a benchmark program we can estimate the impact of an architectural change upon performance.

 

SUMMARY OF COMMUNICATION

■ SUMMARY

Communication involves the transfer of information between systems. As a rule, data is transferred in bit-serial fashion because the time-of-flight delays dominate the transfer time for high speed networks. However, modulation schemes allow many bits to be encoded in a single sample, such as for a dibit. The choice of modulation scheme impacts the introduction of errors into the transmission of information. Error detection and correction are made possible through redundancy, in which there are more bit patterns possible than the number of valid bit patterns. If the error bit patterns do not have a single closest valid code word, then error detection is possible but error correction is not possible. If every error bit pattern is reachable from only one valid bit pattern, then error correction is also possible.

Local area networks (LANs) manage complexity by using layering that is based on the OSI model. These days, LANs are typically connected to wide area networks (WANs), most notably, the Internet. The Internet is based on the TCP/IP protocol suite. User data is encapsulated at the Application, Transport, Network, and Link layers, and is sent through the Internet and de-encapsulated (de muxed) on the receiving system. As it is passed through the Internet, the data traverses various transmission media, that vary in bandwidth and distance capabilities.

■ FURTHER READING

(Needleman, 1990) and (Schnaidt, 1990) give a thorough treatment of local area networks according to the OSI model, and (Tanenbaum, 1996) is a good reference on network communication in general. (Halsall, 1996) gives a thorough and readable treatment of network media types. (dePrycker, 1993) gives an in-depth account of ATM and its characteristics.

(Tanenbaum, 1999) and (Stallings, 1996) give readable explanations of Ham- ming encoding. (Hamming, 1986) and (Peterson and Weldon, 1972) give more detailed treatments of error-correcting codes.

Halsall, F., Data Communications, Computer Networks, and Open Systems, 4/e, Addison-Wesley, (1996).

Hamming, R. W., Coding and Information Theory, 2/e, Prentice-Hall, (1986). Needleman, R., Understanding Networks, Simon and Schuster, New York,

(1990).

Peterson, W. Wesley and E. J. Weldon, Jr., Error-Correcting Codes, 2/e, The MIT Press, (1972).

de Prycker, Martin, Asynchronous Transfer Mode: Solution for Broadband ISDN, 2/e, Ellis Horwood, (1993).

Schnaidt, P., LAN Tutorial, Miller Freeman Publications, California, (1990). Stallings, W., Computer Organization and Architecture: Designing for Performance,

4/e, Prentice Hall, Upper Saddle River, (1996).

Tanenbaum, A., Computer Networks, 3/e, Prentice Hall, Upper Saddle River, (1996).

Tanenbaum, A., Structured Computer Organization, 4/e, Prentice Hall, Engle- wood Cliffs, (1999).

■ PROBLEMS

9.1 What is the Hamming distance for the ASCII SEC code discussed in Section 9.4.2?

9.2 Construct the SEC code for the ASCII character ‘Q’ using even parity.

9.3 For parts (a) through (d) below, use a SEC code with even parity.

(a) How many check bits should be added to encode a six-bit word?

(b) Construct the SEC code for the six-bit word: 1 0 1 1 0 0. When construct- ing the code, number the bits from right to left starting with 1 as for the method described in Section 9.4.2.

(c) A receiver sees a two-bit SEC encoded word that looks like this: 1 1 1 0 0. What is the initial two-bit pattern?

d) The 12-bit word: 1 0 1 1 1 0 0 1 1 0 0 1 complete with an SEC code (even parity) is received. What 12-bit word was actually sent?

9.4 How many check bits are needed for a SEC code for an initial word size of 1024?

9.5 Construct a checksum word for EBCDIC characters ‘V’ through ‘Z’ using

vertical redundancy checking with even parity. DO NOT use longitudinal redundancy checking.

9.6 Compare the number of bits used for parity in the SEC code with the simple parity VRC code, for 1024 eight-bit characters:

(a) Compute the number of check bits generated using SEC only (horizon- tally).

(b) Compute the number of checksum bits using VRC only.

9.7 The SEC code discussed in Section 9.4.2 can be turned into a double error detecting/SEC (DED/SEC) code by adding one more bit that creates even parity over the SEC code (which includes the parity bit being added.) Explain how double error detection works while also maintaining single error correction with this approach.

9.8 Compute the CRC for a message to be transmitted M(x) = 101100110 and a generator polynomial G(x) = x3 + x2 + 1.

9.9 What is the longest burst error that CRC-32 is sure to catch?

9.10 To which IPv4 class does address 165.230.140.67 belong?

9.11 How many networks (not hosts) can the IPv4 class A, B, and C addresses support? That is, how many distinct class A, B, and C network addresses can there be? Do not consider reserved addresses.

9.12 Network media always carry data in bit-serial fashion, and virtually never in parallel. That is not to say that data could not be carried over a network in byte-parallel or word-parallel fashion; there simply is no advantage to doing it this way. To see why this is the case, calculate the time required to transmit a 32-bit word between two computers over a 32-foot network. The network speed is 1 Gbps per channel. The time-of-flight delay imposed by the distance is 1 ns per foot. Calculate the time to transmit the 32-bit word using a single channel (bit-serial fashion) and using 32 channels (word-parallel fashion).

9.13 An ATM network has two switches A and B that switch on the virtual path identifier only. The network topology and the routing tables for the

image

 

Communication : case study: asynchronous transfer mode (synchronous vs. Asynchronous transfer mode, what is atm?, atm network architecture and outlook on atm).

9.4 Case Study: Asynchronous Transfer Mode

Historically, there have been different networks that carry different types of information:

• Telex (old style teletypewriters, used for news feeds, stock quotes, etc.)

image

• “Plain old telephone service” (POTS) via the public switched telephone network (PSTN);

• Data, via the packet switched data network (PSDN);

• Television via: (1) ground based antennae; (2) community antennae TV (CATV known as cable TV); and (3) satellites;

• Local data via local area networks (LANs).

Each network is separately planned, dimensioned, developed, and operated. This specialization results in independent worldwide networks. Bandwidth is not shared among the networks, yet when video traffic peaks at prime time, telephone traffic lulls. Although there may always be video programs available on a particular channel at all hours of the day, less-expensive local programming may be used in the off-peak hours, reducing the need for video traffic via satellites or via long distance wireline networks.

Each type of service needs to deal with peak traffic and bursty traffic on its own, although it would be more economical to share networks as long as the peaks and bursts do not coincide. An attempt at sharing is Integrated Services Digital Network (ISDN), which comes in two forms. Narrowband-ISDN (NISDN) is designed for 64 Kb/s telephone switching. NISDN allows voice and data traffic to be carried over the same network. Broadband-ISDN (BISDN) supports video traffic as well as voice and data traffic.

The ISDN Basic Rate Interface (BRI) is used for NISDN. The BRI offers two Bearer (B) channels at 64 Kb/s per channel, for user data, and one Data (D) channel at 16 Kb/s which is used for control and signaling. The total bit rate is 192 Kb/s when higher level framing overhead is taken into account, but the maximum rate available to the user is 128 Kb/s when the two B channels are “bonded” into a single channel.

The ISDN Primary Rate Interface (PRI) offers 23 B channels and one 64 Kb/s D channel, and is commonly known as a “T1 line.” ISDN lines can be leased from telecommunication companies and can be configured to set up private networks. As a general rule of thumb, the monthly rate for leasing 10 64 Kb/s lines equals the rate for leasing a T1 line that supports 23 64 Kb/s lines. A problem with the overall economics with ISDN is that the sup- ported bit rates are service dependent (NISDN is an example of a service) as opposed to traffic dependent, which is needed by modern networks. Services must fall into even multiples of B and D channels in the offered ratios, or else they will not make efficient use of the available bandwidth. We may not know what bit rates future services will need, and so we need a service independent bit transport for a network to be extensible. The goal is to have a single network that can serve everyone, which is where asynchronous transfer mode (ATM) comes in.

9.4.1 SYNCHRONOUS VS. ASYNCHRONOUS TRANSFER MODE

With synchronous transfer mode (STM), also known as time division multiplexing (TDM), a data stream is made up of time slots that are assigned to channels in a round robin fashion. Figure 9-20a illustrates TDM for four channels. A station can only send during a preassigned time slot. Other unused time slots may go by while a station waits for its preassigned time slot. (In the public switched telephone network, a time slot is 125 µs, which allows for 8000 samples per second at 8 bits per sample, resulting in 64 Kb/s for a voice grade line.)

Figure 9-20b shows asynchronous transfer mode, which may still use 125 µs framing, but now, any station can use any slot, and the network is used more efficiently. On the down side, operating an ATM network is much more complex than operating a simpler TDM network.

9.4.2 WHAT IS ATM?

ATM is a combination of hardware and a set of protocols that delivers a guaranteed bandwidth with a bounded low latency. Two enabling progress areas that make ATM possible are:

image

1) Technology – The speed and density of VLSI allows network functionality to be pushed into the end-systems rather than throughout the intermediary components. Error detection is only performed at the endpoints, for example, instead of at every intermediate router as is common for the Internet. The development of high bandwidth, low bit-error rate optical fiber is a supporting technology that makes this approach practical.

2) Systems – Fast packet switches that introduce a low latency into end-to-end communication enable the delivery of real-time services.

Typical speeds for ATM are in the range of 1 Gb/s – 2.5 Gb/s, with an average delay of only 450 µs for each switch that handles an ATM packet.

9.4.3 ATM NETWORK ARCHITECTURE

Before an ATM transfer takes place, a connection must be established by contacting a signaling control point (SCP) which is a network device that has the authority to configure ATM switches for the transfer. All packets are constrained to follow the path set up by the SCP, and all packets arrive in order.

The job of the switch is very simple: it simply looks at the destination address in the header, and then it sends the packet on the path indicated in the header. A new destination address is placed in the header, as described further below.

All ATM packets have the same fixed size of 53 bytes, as shown in Figure 9-21.

image

The packet created by an end system for injection into the network, which is known as the user-to-network interface (UNI) format is shown in Figure 9-21a, and the packet format used from that point onward, which is known as the network-to-network interface (NNI) format is shown in Figure 9-21b.

The generic flow control (GFC) field is used at the network boundary to police how packets are allowed to enter the network. Once a packet is in the network, the GFC field of the UNI format is no longer needed, and is then combined with the 8-bit virtual path identifier (VPI) field to form a 12-bit VPI field in the NNI format. The VPI field can be thought of as identifying one particular home for a cable that carries cable TV channels, whereas the virtual channel identifier (VCI) field identifies the specific channel. The payload type identifier (PTI) identifies whether the data field carries user data or network data, and other information. The cell loss priority (CLP) bit determines whether this cell can be dropped during times of congestion. The header error control (HEC) is a CRC over the header. The 48-byte payload field follows.

Figure 9-22 shows a simple ATM network that consists of three switches. Each

image

switch has a symmetric routing table that can be driven in either direction. For example, at ATM Switch #1, an incoming packet (on the left) that has a VPI field of 7 is sent to the right after changing the VPI field to 5. Likewise, a packet that comes into ATM Switch #1 from the right with a VPI field of 5 is sent to the left after changing the VPI field to 7. This is referred to as “virtual path switching” because only the VPI field is used in routing the packet. The VCI field is unchanged. There can also be “virtual channel switching” in which routing is done based on the VCI field, and the VPI field is left unchanged.

9.4.4 OUTLOOK ON ATM

Although ATM holds a great deal of promise for an extensible network that serves many user needs, economies of scale favor legacy networks such as Ether- net for end-user systems. ATM appears mostly in backbone networks, with multiplexors and concentrators used for connecting legacy networks at the boundary of the backbone to the backbone itself. ATM continues its penetration into back- bone networks and also to the desktop, which is currently the exception rather than the rule. Whether ATM makes it to the desktop pervasively, time will tell.