Chapter 5. Microprocessors
Microprocessors are the most complicated devices ever created by human beings. But don’t despair. A microprocessor is complicated the same way a castle you might build from Legos is complicated. You can easily understand each of the individual parts of a microprocessor and how they fit together, just as you might build a fort or castle from a legion of blocks of assorted shapes. The metaphor is more apt than you might think. Engineers, in fact, design microprocessors as a set of functional blocks—not physical blocks like Legos, but blocks of electronic circuits that perform specific functions. And they don’t fit them together on the floor of a giant playroom. Most of their work is mental, a lot of it performed on computers. Yes, you need a computer to design a computer these days. Using a computer is faster; it can give engineers insight by showing them ideas they could only before imagine, and it doesn’t leave a mess of blocks on the floor that Mom makes them clean up.
In creating a microprocessor, a team of engineers may work on each one of the blocks that makes up the final chip, and the work of one team may be almost entirely unfathomable by another team. That’s okay, because all that teams need to know about the parts besides their own is what those other parts do and how they fit together, not how they are made. All the complex details—the logic gates from which each block is built—are irrelevant to anyone not designing that block. It’s like those Legos. You might know a Lego block is made from molecules of plastic, but you don’t need to know about the carbon backbones and the side-chains that make up each molecule to put the blocks together.
Legos come in different shapes. The functional blocks of a microprocessor have different functions. We’ll start our look at microprocessors by defining what those functions are. Then we’ll leave the playroom behind and look at how you make a microprocessor operate by examining the instructions that it uses. This detour is more important than you might think. Much of the design work in creating a series of microprocessors goes into deciding exactly what instructions it must carry out—which instructions are most useful for solving problems and doing it quickest. From there, we’ll look at how engineers have used their imaginations to design ever-faster microprocessors and at the tricks they use to coax more speed for each new generation of chip.
Microprocessor designers can’t just play with theory in creating microprocessors. They have to deal with real-world issues. Somehow, machines must make the chips. Once they are made, they have to operate—and keep operating. Hard reality puts some tight constraints on microprocessor design. If engineers aren’t careful, microprocessors can become miniature incinerators, burning themselves up. We’ll take a look at some of these real-world issues that guide microprocessor design—including electricity, heat, and packaging, all of which work together (and at times, against the design engineer).
Next, we’ll look at real microprocessors, the chips you can actually buy. We’ll start with a brief history lesson to put today’s commercial offerings in perspective. And, finally, we’ll look at the chips in today’s (and tomorrow’s) computers to see which is meant for what purpose—and which is best for your own computer.
Every modern microprocessor starts with the basics—clocked-logic digital circuitry. The chip has millions of separate gates combined into three basic function blocks: the input/output unit (or I/O unit), the control unit, and the arithmetic/logic unit (ALU). The last two are sometimes jointly called the central processingunit (CPU), although the same term often is used as a synonym for the entire microprocessor. Some chipmakers further subdivide these units, give them other names, or include more than one of each in a particular microprocessor. In any case, the functions of these three units are an inherent part of any chip. The differences are mostly a matter of nomenclature, because you can understand the entire operation of any microprocessor as a product of these three functions.
All three parts of the microprocessor interact together. In all but the simplest microprocessor designs, the I/O unit is under the control of the control unit, and the operation of the control unit may be determined by the results of calculations of the arithmetic/logic unit CPU. The combination of the three parts determines the power and performance of the microprocessor.
Each part of the microprocessor also has its own effect on the processing speed of the system. The control unit operates the microprocessor’s internal clock, which determines the rate at which the chip operates. The I/O unit determines the bus width of the microprocessor, which influences how quickly data and instructions can be moved in and out of the microprocessor. And the registers in the arithmetic/control unit determine how much data the microprocessor can operate on at one time.
The input/output unit links the microprocessor to the rest of the circuitry of the computer, passing along program instructions and data to the registers of the control unit and arithmetic/logic unit. The I/O unit matches the signal levels and timing of the microprocessor’s internal solid-state circuitry to the requirements of the other components inside the computer. The internal circuits of a microprocessor, for example, are designed to be stingy with electricity so that they can operate faster and cooler. These delicate internal circuits cannot handle the higher currents needed to link to external components. Consequently, each signal leaving the microprocessor goes through a signal buffer in the I/O unit that boosts its current capacity.
The input/output unit can be as simple as a few buffers, or it may involve many complex functions. In the latest Intel microprocessors used in some of the most powerful computers, the I/O unit includes cache memory and clock-doubling or -tripling logic to match the high operating speed of the microprocessor to slower external memory.
The microprocessors used in computers have two kinds of external connections to their input/output units: those connections that indicate the address of memory locations to or from which the microprocessor will send or receive data or instructions, and those connections that convey the meaning of the data or instructions. The former is called the address bus of the microprocessor; the latter, the data bus.
The number of bits in the data bus of a microprocessor directly influences how quickly it can move information. The more bits that a chip can use at a time, the faster it is. The first microprocessors had data buses only four bits wide. Pentium chips use a 32-bit data bus, as do the related Athlon, Celeron, and Duron chips. Itanium and Opteron chips have 64-bit data buses.
The number of bits available on the address bus influences how much memory a microprocessor can address. A microprocessor with 16 address lines, for example, can directly work with 216 addresses; that’s 65,536 (or 64K) different memory locations. The different microprocessors used in various computers span a range of address bus widths from 32 to 64 or more bits.
The range of bit addresses used by a microprocessor and the physical number of address lines of the chip no longer correspond. That’s because people and microprocessors look at memory differently. Although people tend to think of memory in terms of bytes, each comprising eight bits, microprocessors now deal in larger chunks of data, corresponding to the number of bits in their data buses. For example, a Pentium chip chews into data 32 bits at a time, so it doesn’t need to look to individual bytes. It swallows them four at a time. Chipmakers consequently omit the address lines needed to distinguish chunks of memory smaller than their data buses. This bit of frugality saves the number of connections the chip needs to make with the computer’s circuitry, an issue that becomes important once you see (as you will later) that the modern microprocessor requires several hundred external connections—each prone to failure.
The control unit of a microprocessor is a clocked logic circuit that, as its name implies, controls the operation of the entire chip. Unlike more common integrated circuits, whose function is fixed by hardware design, the control unit is more flexible. The control unit follows the instructions contained in an external program and tells the arithmetic/logic unit what to do. The control unit receives instructions from the I/O unit, translates them into a form that can be understood by the arithmetic/logic unit, and keeps track of which step of the program is being executed.
With the increasing complexity of microprocessors, the control unit has become more sophisticated. In the basic Pentium, for example, the control unit must decide how to route signals between what amounts to two separate processing units called pipelines. In other advanced microprocessors, the function of the control unit is split among other functional blocks, such as those that specialize in evaluating and handling branches in the stream of instructions.
The arithmetic/logic unit handles all the decision-making operations (the mathematical computations and logic functions) performed by the microprocessor. The unit takes the instructions decoded by the control unit and either carries them out directly or executes the appropriate microcode (see the section titled “Microcode” later in this chapter) to modify the data contained in its registers. The results are passed back out of the microprocessor through the I/O unit.
The first microprocessors had but one ALU. Modern chips may have several, which commonly are classed into two types. The basic form is the integer unit, one that carries out only the simplest mathematical operations. More powerful microprocessors also include one or more floating-point units, which handle advanced math operations (such as trigonometric and transcendental functions), typically at greater precision.
Although functionally a floating-point unit is part of the arithmetic/logic unit, engineers often discuss it separately because the floating-point unit is designed to process only floating-point numbers and not to take care of ordinary math or logic operations.
Floating-point describes a way of expressing values, not a mathematically defined type of number such as an integer, rational, or real number. The essence of a floating-point number is that its decimal point “floats” between a predefined number of significant digits rather than being fixed in place the way dollar values always have two decimal places.
Mathematically speaking, a floating-point number has three parts: a sign, which indicates whether the number is greater or less than zero; a significant (sometimes called a mantissa), which comprises all the digits that are mathematically meaningful; and an exponent, which determines the order of magnitude of the significant (essentially the location to which the decimal point floats). Think of a floating-point number as being like those represented by scientific notation. But whereas scientists are apt to deal in base-10 (the exponents in scientific notation are powers of 10), floating-point units think of numbers digitally in base-2 (all ones and zeros in powers of two).
As a practical matter, the form of floating-point numbers used in computer calculations follows standards laid down by the Institute of Electrical and Electronic Engineers (IEEE). The IEEE formats take values that can be represented in binary form using 80 bits. Although 80 bits seems somewhat arbitrary in a computer world that’s based on powers of two and a steady doubling of register size from 8 to 16 to 32 to 64 bits, it’s exactly the right size to accommodate 64 bits of the significant, with 15 bits leftover to hold an exponent value and an extra bit for the sign of the number held in the register. Although the IEEE standard allows for 32-bit and 64-bit floating-point values, most floating-point units are designed to accommodate the full 80-bit values. The floating-point unit (FPU) carries out all its calculations using the full 80 bits of the chip’s registers, unlike the integer unit, which can independently manipulate its registers in byte-wide pieces.
The floating-point units of Intel-architecture processors have eight of these 80-bit registers in which to perform their calculations. Instructions in your programs tell the microprocessor whether to use its ordinary integer ALU or its floating-point unit to carry out a mathematical operation. The different instructions are important because the eight 80-bit registers in Intel floating-point units also differ from integer units in the way they are addressed. Commands for integer unit registers are directly routed to the appropriate register as if sent by a switchboard. Floating-point unit registers are arranged in a stack, sort of an elevator system. Values are pushed onto the stack, and with each new number the old one goes down one level. Stack machines are generally regarded as lean and mean computers. Their design is austere and streamlined, which helps them run more quickly. The same holds true for stack-oriented floating-point units.
Until the advent of the Pentium, a floating-point unit was not a guaranteed part of a microprocessor. Some 486 and all previous chips omitted floating-point circuitry. The floating-point circuitry simply added too much to the complexity of the chip, at least for the state of fabrication technology at that time. To cut costs, chipmakers simply left the floating-point unit as an option.
When it was necessary to accelerate numeric operations, the earliest microprocessors used in computers allowed you to add an additional, optional chip to your computer to accelerate the calculation of floating-point values. These external floating-point units were termed math coprocessors.
The floating-point units of modern microprocessors have evolved beyond mere number-crunching. They have been optimized to reflect the applications for which computers most often crunch floating-point numbers—graphics and multimedia (calculating dots, shapes, colors, depth, and action on your screen display).
Instructions are the basic units for telling a microprocessor what to do. Internally, the circuitry of the microprocessor has to carry out hundreds, thousands, or even millions of logic operations to carry out one instruction. The instruction, in effect, triggers a cascade of logical operations. How this cascade is controlled marks the great divide in microprocessor and computer design.
The first electronic computers used a hard-wired design. An instruction simply activated the circuits appropriate for carrying out all the steps required. This design has its advantages. It optimizes the speed of the system because the direct hard-wire connection adds nothing to slow down the system. Simplicity means speed, and the hard-wired approach is the simplest. Moreover, the hard-wired design was the practical and obvious choice. After all, computers were so new that no one had thought up any alternative.
However, the hard-wired computer design has a significant drawback. It ties the hardware and software together into a single unit. Any change in the hardware must be reflected in the software. A modification to the computer means that programs have to be modified. A new computer design may require that programs be entirely rewritten from the ground up.
The inspiration for breaking away from the hard-wired approach was the need for flexibility in instruction sets. Throughout most of the history of computing, determining exactly what instructions should make up a machine’s instruction set was more an art than a science. IBM’s first commercial computers, the 701 and 702, were designed more from intuition than from any study of which instructions programmers would need to use. Each machine was tailored to a specific application. The 701 ran instructions thought to serve scientific users; the 702 had instructions aimed at business and commercial applications.
When IBM tried to unite its many application-specific computers into a single, more general-purpose line, these instruction sets were combined so that one machine could satisfy all needs. The result was, of course, a wide, varied, and complex set of instructions. The new machine, the IBM 360 (introduced in 1964), was unlike previous computers in that it was created not as hardware but as an architecture. IBM developed specifications and rules for how the machine would operate but enabled the actual machine to be created from any hardware implementation designers found most expedient. In other words, IBM defined the instructions that the 360 would use but not the circuitry that would carry them out. Previous computers used instructions that directly controlled the underlying hardware. To adapt the instructions defined by the architecture to the actual hardware that made up the machine, IBM adopted an idea called microcode, originally conceived by Maurice Wilkes at Cambridge University.
In the microcode design, an instruction causes a computer to execute a small program to carry out the logic instructions required by the instruction. The collection of small programs for all the instructions the computer understands is its microcode.
Although the additional layer of microcode made machines more complex, it added a great deal of design flexibility. Engineers could incorporate whatever new technologies they wanted inside the computer, yet still run the same software with the same instructions originally written for older designs. In other words, microcode enabled new hardware designs and computer systems to have backward compatibility with earlier machines.
After the introduction of the IBM 360, nearly all mainframe computers used microcode. When the microprocessors came along, they followed the same design philosophy, using microcode to match instructions to hardware. Using this design, a microprocessor actually has a smaller microprocessor inside it, which is sometimes called a nanoprocessor, running the microcode.
This microcode-and-nanoprocessor approach makes creating a complex microprocessor easier. The powerful data-processing circuitry of the chip can be designed independently of the instructions it must carry out. The manner in which the chip handles its complex instructions can be fine-tuned even after the architecture of the main circuits are laid into place. Bugs in the design can be fixed relatively quickly by altering the microcode, which is an easy operation compared to the alternative of developing a new design for the whole chip (a task that’s not trivial when millions of transistors are involved). The rich instruction set fostered by microcode also makes writing software for the microprocessor (and computers built from it) easier, thus reducing the number of instructions needed for each operation.
Microcode has a big disadvantage, however. It makes computers and microprocessors more complicated. In a microprocessor, the nanoprocessor must go through several of its own microcode instructions to carry out every instruction you send to the microprocessor. More steps means more processing time taken for each instruction. Extra processing time means slower operation. Engineers found that microcode had its own way to compensate for its performance penalty—complex instructions.
Using microcode, computer designers could easily give an architecture a rich repertoire of instructions that carry out elaborate functions. A single, complex instruction might do the job of half a dozen or more simpler instructions. Although each instruction would take longer to execute because of the microcode, programs would need fewer instructions overall. Moreover, adding more instructions could boost speed. One result of this micro code “more is merrier” instruction approach is that typical computer microprocessors have seven different subtraction commands.
Although long the mainstay of computer and microprocessor design, microcode is not necessary. While system architects were staying up nights concocting ever more powerful and obscure instructions, a counter force was gathering. Starting in the 1970s, the micro code approach came under attack by researchers who claimed it takes a greater toll on performance than its benefits justify.
By eliminating microcode, this design camp believed, simpler instructions could be executed at speeds so much higher that no degree of instruction complexity could compensate. By necessity, such hard-wired machines would offer only a few instructions because the complexity of their hard-wired circuitry would increase dramatically with every additional instruction added. Practical designs are best made with small instruction sets.
John Cocke at IBM’s Yorktown Research Laboratory analyzed the usage of instructions by computers and discovered that most of the work done by computers involves relatively few instructions. Given a computer with a set of 200 instructions, for example, two-thirds of its processing involves using as few as 10 of the total instructions. Cocke went on to design a computer that was based on a few instructions that could be executed quickly. He is credited with inventing the Reduced Instruction Set Computer (RISC) in 1974. The term RISC itself is credited to David Peterson, who used it in a course in microprocessor design at the University of California at Berkeley in 1980.
The first chip to bear the label and to take advantage of Cocke’s discoveries was RISC-I, a laboratory design that was completed in 1982. To distinguish this new design approach from traditional microprocessors, microcode-based systems with large instruction sets have come to be known as Complex Instruction Set Computers (CISC).
Cocke’s research showed that most of the computing was done by basic instructions, not by the more powerful, complex, and specialized instructions. Further research at Berkeley and Stanford Universities demonstrated that there were even instances in which a sequence of simple instructions could perform a complex task faster than a single complex instruction could. The result of this research is often summarized as the 80/20 Rule, meaning that about 20 percent of a computer’s instructions do about 80 percent of the work. The aim of the RISC design is to optimize a computer’s performance for that 20 percent of instructions, speeding up their execution as much as possible. The remaining 80 percent of the commands could be duplicated, when necessary, by combinations of the quick 20 percent. Analysis and practical experience has shown that the 20 percent could be made so much faster that the overhead required to emulate the remaining 80 percent was no handicap at all.
To enable a microprocessor to carry out all the required functions with a handful of instructions requires a rethinking of the programming process. Instead of simply translating human instructions into machine-readable form, the compilers used by RISC processors attempt to find the optimum instructions to use. The compiler takes a more in-depth look at the requested operations and finds the best way to handle them. The result was the creation of optimizing compilers discussed in Chapter 3, “Software.”
If effect, the RISC design shifts a lot of the processing from the microprocessor to the compiler—a lot of the work in running a program gets taken care of before the program actually runs. Of course, the compiler does more work and takes longer to run, but that’s a fair tradeoff—a program needs to be compiled only once but runs many, many times when the streamlined execution really pays off.
RISC microprocessors have several distinguishing characteristics. Most instructions execute in a single clock cycle—or even faster with advanced microprocessor designs with several execution pathways. All the instructions are the same length with similar syntax. The processor itself does not use microcode; instead, the small repertory of instructions is hard-wired into the chip. RISC instructions operate only on data in the registers of the chip, not in memory, making what is called a load-store design. The design of the chip itself is relatively simple, with comparatively few logic gates that are themselves constructed from simple, almost cookie-cutter designs. And most of the hard work is shifted from the microprocessor itself to the compiler.
Both CISC and RISC have a compelling design rationale and performance, desirable enough that engineers working on one kind of chip often looked over the shoulders of those working in the other camp. As a result, they developed hybrid chips embodying elements of both the CISC and RISC design. All the latest processors—from the Pentium Pro to the Pentium 4, Athlon, and Duron as well—have RISC cores mated with complex instruction sets.
The basic technique involves converting the classic Intel instructions into RISC-style instructions to be processed by the internal chip circuitry. Intel calls the internal RISC-like instructions micro-ops. The term is often abbreviated as uops (strictly speaking, the initial u should be the Greek letter mu, which is an abbreviation for micro) and pronounced you-ops. Other companies use slightly different terminology.
By design, the micro-ops sidestep the primary shortcomings of the Intel instruction set by making the encoding of all commands more uniform, converting all instructions to the same length for processing, and eliminating arithmetic operations that directly change memory by loading memory data into registers before processing.
The translation to RISC-like instructions allows the microprocessor to function internally as a RISC engine. The code conversion occurs in hardware, completely invisible to your applications and out of the control of programmers. In other words, it shifts back from the RISC shift to doing the work in the compiler. There’s a good reason for this backward shift: It lets the RISC code deal with existing programs—those compiled before the RISC designs were created.
Single Instruction, Multiple Data
In a quest to improve the performance of Intel microprocessors on common multimedia tasks, Intel’s hardware and software engineers analyzed the operations multimedia programs most often required. They then sought the most efficient way to enable their chips to carry out these operations. They essentially worked to enhance the signal-processing capabilities of their general-purpose microprocessors so that they would be competitive with dedicated processors, such as digital signal processor (DSP) chips. They called the technology they developed Single Instruction, Multiple Data (SIMD). In effect a new class of microprocessor instructions, SIMD is the enabling element of Intel’s MultiMedia Extensions (MMX) to its microprocessor command set. Intel further developed this technology to add its Streaming SIMD Extensions (SSE, once known as the Katmai New Instructions) to its Pentium III microprocessors to enhance their 3D processing power. The Pentium 4 further enhances SSE with more multimedia instructions to create what Intel calls SSE2.
As the name implies, SIMD allows one microprocessor instruction to operate across several bytes or words (or even larger blocks of data). In the MMX scheme of things, the SIMD instructions are matched to the 64-bit data buses of Intel’s Pentium and newer microprocessors. All data, whether it originates as bytes, words, or 16-bit double-words, gets packed into 64-bit form. Eight bytes, four words, or two double-words get packed into a single 64-bit package that, in turn, gets loaded into a 64-bit register in the microprocessor. One microprocessor instruction then manipulates the entire 64-bit block.
Although the approach at first appears counterintuitive, it improves the handling of common graphic and audio data. In video processor applications, for example, it can trim the number of microprocessor clock cycles for some operations by 50 percent or more.
Very Long Instruction Words
Just as RISC started flowing into the product mainstream, a new idea started designers thinking in the opposite direction. Very long instruction word (VLIW) technology at first appears to run against the RISC stream by using long, complex instructions. In reality, VLIW is a refinement of RISC meant to better take advantage of superscalar microprocessors. Each very long instruction word is made from several RISC instructions. In a typical implementation, eight 32-bit RISC instructions combine to make one instruction word.
Ordinarily, combining RISC instructions would add little to overall speed. As with RISC, the secret of VLIW technology is in the software—the compiler that produces the final program code. The instructions in the long word are chosen so that they execute at the same time (or as close to it as possible) in parallel processing units in the superscalar microprocessor. The compiler chooses and arranges instructions to match the needs of the superscalar processor as best as possible, essentially taking the optimizing compiler one step further. In essence, the VLIW system takes advantage of preprocessing in the compiler to make the final code and microprocessor more efficient.
VLIW technology also takes advantage of the wider bus connections of the latest generation of microprocessors. Existing chips link to their support circuitry with 64-bit buses. Many have 128-bit internal buses. The 256-bit very long instruction words push a little further yet enable a microprocessor to load several cycles of work in a single memory cycle. Transmeta’s Crusoe processor uses VLIW technology.
Functionally, the first microprocessors operated a lot like meat grinders. You put something in such as meat scraps, turned a crank, and something new and wonderful came out—a sausage. Microprocessors started with data and instructions and yielded answers, but operationally they were as simple and direct as turning a crank. Every operation carried out by the microprocessor clicked with a turn of the crank—one clock cycle, one operation.
Such a design is straightforward and almost elegant. But its wonderful simplicity imposes a heavy constraint. The computer’s clock becomes an unforgiving jailor, locking up the performance of the microprocessor. A chip with this turn-the-crank design is locked to the clock speed and can never improve its performance beyond one operation per clock cycle. The situation is worse than that. The use of microcode almost ensures that at least some instructions will require multiple clock cycles.
One way to speed up the execution of instructions is to reduce the number of internal steps the microprocessor must take for execution. That idea was the guiding principle behind the first RISC microprocessors and what made them so interesting to chip designers. Actually, however, step reduction can take one of two forms: making the microprocessor more complex so that steps can be combined or making the instructions simpler so that fewer steps are required. Both approaches have been used successfully by microprocessor designers—the former as CISC microprocessors, the latter as RISC.
Ideally, it would seem, executing one instruction every clock cycle would be the best anyone could hope for, the ultimate design goal. With conventional microprocessor designs, that would be true. But engineers have found another way to trim the clock cycles required by each instruction—by processing more than one instruction at the same time.
Two basic approaches to processing more instructions at once are pipelining and superscalar architecture. All modern microprocessors take advantage of these technologies as well as several other architectural refinements that help them carry out more instructions for every cycle of the system clock.
The operating speed of a microprocessor is usually called its clock speed, which describes the frequency at which the core logic of the chip operates. Clock speed is usually measured in megahertz (one million hertz or clock cycles per second) or gigahertz (a billion hertz). All else being equal, a higher number in megahertz means a faster microprocessor.
Faster does not necessarily mean the microprocessor will compute an answer more quickly, however. Different microprocessor designs can execute instructions more efficiently because there’s no one-to-one correspondence between instruction processing and clock speed. In fact, each new generation of microprocessor has been able to execute more instructions per clock cycle, so a new microprocessor can carry out more instructions at a given megahertz rating. At the same megahertz rating, a Pentium 4 is faster than a Pentium III. Why? Because of pipelining, superscalar architecture, and other design features.
Sometimes microprocessor-makers take advantage of this fact and claim that megahertz doesn’t matter. For example, AMD’s Athlon processors carry out more instructions per clock cycle than Intel’s Pentium III, so AMD stopped using megahertz numbers to describe its chips. Instead, it substituted model designations that hinted at the speed of a comparable Pentium chips. An Athlon XP 2200+ processes data as quickly as a Pentium 4 at 2200MHz chip, although the Athlon chip actually operates at less than 2000MHz. With the introduction of its Itanium series of processors, Intel also made assertions that megahertz doesn’t matter because Itanium chips have clock speeds substantially lower than Pentium chips.
A further complication is software overhead. Microprocessor speed doesn’t affect the performance of Windows or its applications very much. That’s because the performance of Windows depends on the speed of your hard disk, video system, memory system, and other system resources as well as your microprocessor. Although a Windows system using a 2GHz processor will appear faster than a system with a 1GHz processor, it won’t be anywhere near twice as fast.
In other words, the megahertz rating of a microprocessor gives only rough guidance in comparing microprocessor performance in real-world applications. Faster is better, but a comparison of megahertz (or gigahertz) numbers does not necessarily express the relationship between the performance of two chips or computer systems.
In older microprocessor designs, a chip works single-mindedly. It reads an instruction from memory, carries it out, step by step, and then advances to the next instruction. Each step requires at least one tick of the microprocessor’s clock. Pipelining enables a microprocessor to read an instruction, start to process it, and then, before finishing with the first instruction, read another instruction. Because every instruction requires several steps, each in a different part of the chip, several instructions can be worked on at once and passed along through the chip like a bucket brigade (or its more efficient alternative, the pipeline). Intel’s Pentium chips, for example, have four levels of pipelining. Up to four different instructions may be undergoing different phases of execution at the same time inside the chip. When operating at its best, pipelining reduces the multiple-step/multiple-clock-cycle processing of an instruction to a single clock cycle.
Pipelining is very powerful, but it is also demanding. The pipeline must be carefully organized, and the parallel paths kept carefully in step. It’s sort of like a chorus singing a canon such as Fréré Jacques—one missed beat and the harmony falls apart. If one of the execution stages delays, all the rest delay as well. The demands of pipelining push microprocessor designers to make all instructions execute in the same number of clock cycles. That way, keeping the pipeline in step is easier.
In general, the more stages to a pipeline, the greater acceleration it can offer. Intel has added superlatives to the pipeline name to convey the enhancement. Super-pipelining is Intel’s term for breaking the basic pipeline stages into several steps, resulting in a 12-stage design used for its Pentium Pro through Pentium III chips. Later, Intel further sliced the stages to create the current Pentium 4 chip with 20 stages, a design Intel calls hyper-pipelining.
Real-world programs conspire against lengthy pipelines, however. Nearly all programs branch. That is, their execution can take alternate paths down different instruction streams, depending on the results of calculations and decision-making. A pipeline can load up with instructions of one program branch before it discovers that another branch is the one the program is supposed to follow. In that case, the entire contents of the pipeline must be dumped and the whole thing loaded up again. The result is a lot of logical wheel-spinning and wasted time. The bigger the pipeline, the more time that’s wasted. The waste resulting from branching begins to outweigh the benefits of bigger pipelines in the vicinity of five stages.
Today’s most powerful microprocessors are adopting a technology called branch prediction logic to deal with this problem. The microprocessor makes its best guess at which branch a program will take as it is filling up the pipeline. It then executes these most likely instructions. Because the chip is guessing at what to do, this technology is sometimes called speculative execution.
When the microprocessor’s guesses turn out to be correct, the chip benefits from the multiple-pipeline stages and is able to run through more instructions than clock cycles. When the chip’s guess turns out wrong, however, it must discard the results obtained under speculation and execute the correct code. The chip marks the data in later pipeline stages as invalid and discards it. Although the chip doesn’t lose time—the program would have executed in the same order anyway—it does lose the extra boost bequeathed by the pipeline.
To further increase performance, more modern microprocessors use speculative execution. That is, the chip may carry out an instruction in a predicted branch before it confirms whether it has properly predicted the branch. If the chip’s prediction is correct, the instruction has already been executed, so the chip wastes no time. If the prediction was incorrect, the chip will have to execute a different instruction, which it would have to have done anyhow, so it suffers no penalty.
The steps in a program normally are listed sequentially, but they don’t always need to be carried out exactly in order. Just as tough problems can be broken into easier pieces, program code can be divided as well. If, for example, you want to know the larger of two rooms, you have to compute the volume of each and then make your comparison. If you had two brains, you could compute the two volumes simultaneously. A superscalar microprocessor design does essentially that. By providing two or more execution paths for programs, it can process two or more program parts simultaneously. Of course, the chip needs enough innate intelligence to determine which problems can be split up and how to do it. The Pentium, for example, has two parallel, pipelined execution paths.
The first superscalar computer design was the Control Data Corporation 6600 mainframe, introduced in 1964. Designed specifically for intense scientific applications, the initial 6600 machines were built from eight functional units and were the fastest computers in the world at the time of their introduction.
Superscalar architecture gets its name because it goes beyond the incremental increase in speed made possible by scaling down microprocessor technology. An improvement to the scale of a microprocessor design would reduce the size of the microcircuitry on the silicon chip. The size reduction shortens the distance signals must travel and lowers the amount of heat generated by the circuit (because the elements are smaller and need less current to effect changes). Some microprocessor designs lend themselves to scaling down. Superscalar designs get a more substantial performance increase by incorporating a more dramatic change in circuit complexity.
Using pipelining and superscalar architecture cycle-saving techniques has cut the number of clock cycles required for the execution of a typical microprocessor instruction dramatically. Early microprocessors needed, on average, several cycles for each instruction. Today’s chips can often carry out multiple instructions in a single clock cycle. Engineers describe pipelined, superscalar chips by the number of instructions they can retire per clock cycle. They look at the number of instructions that are completed, because this best describes how much work the chip has (or can) actually accomplish.
No matter how well the logic of a superscalar microprocessor divides up a program, each pipeline is unlikely to get an equal share of the work. One or another pipeline will grind away while another finishes in an instant. Certainly the chip logic can shove another instruction down the free pipeline (if another instruction is ready). But if the next instruction depends on the results of the one before it, and that instruction is the one stuck grinding away in the other pipeline, the free pipeline stalls. It is available for work but can do no work, thus potential processor power gets wasted.
Like a good Type-A employee who always looks for something to do, microprocessors can do the same. They can check the program for the next instruction that doesn’t depend on previous work that’s not finished and work on the new instruction. This sort of ambitious approach to programs is termed out-of-order execution, and it helps microprocessors take full advantage of superscalar designs.
This sort of ambitious microprocessor faces a problem, however. It is no longer running the program in the order it was written, and the results might be other than the programmer had intended. Consequently, microprocessors capable of out-of-order execution don’t immediately post the results from their processing into their registers. The work gets carried out invisibly and the results of the instructions that are processed out of order are held in a buffer until the chip has finished the processing of all the previous instructions. The chip puts the results back into the proper order, checking to be sure that the out-of-order execution has not caused any anomalies, before posting the results to its registers. To the program and the rest of the outside world, the results appear in the microprocessor’s registers as if they had been processed in normal order, only faster.
Out-of-order execution often runs into its own problems. Two independently executable instructions may refer to or change the same register. In the original program, one would carry out its operation, then the other would do its work later. During superscalar out-of-order execution, the two instructions may want to work on the register simultaneously. Because that conflict would inevitably lead to confusing results and errors, an ordinary superscalar microprocessor would have to ensure the two instructions referencing the same register executed sequentially instead of in parallel, thus eliminating the advantage of its superscalar design.
To avoid such problems, advanced microprocessors use register renaming. Instead of a small number of registers with fixed names, they use a larger bank of registers that can be named dynamically. The circuitry in each chip converts the references made by an instruction to a specific register name to point instead to its choice of physical register. In effect, the program asks for the EAX register, and the chip says, “Sure,” and gives the program a register it calls EAX. If another part of the program asks for EAX, the chip pulls out a different register and tells the program that this one is EAX, too. The program takes the microprocessor’s word for it, and the microprocessor doesn’t worry because it has several million transistors to sort things out in the end.
And it takes several million transistors because the chip must track all references to registers. It does this to ensure that when one program instruction depends on the result in a given register, it has the right register and results dished up to it.
Explicitly Parallel Instruction Computing
With Intel’s shift to 64-bit architecture for its most powerful line of microprocessors (aimed, for now, at the server market), the company introduced a new instruction set to compliment the new architecture. Called Explicitly Parallel Instruction Computing (EPIC), the design follows the precepts of RISC architecture by putting the hard work into software (the compiler), while retaining the advantages of longer instructions used by SIMD and VLIW technologies.
The difference between EPIC and older Intel chips is that the compiler takes a swipe at each program and determines where parallel processes can occur. It then optimizes the program code to sort out separate streams of execution that can be routed to different microprocessor pipelines and carried out concurrently. This not only relieves the chip from working to figure out how to divide up the instruction stream, but it also allows the software to more thoroughly analyze the code rather than trying to do it on the fly.
By analyzing and dividing the instruction streams before they are submitted to the microprocessor, EPIC trims the need and use of speculative execution and branch prediction. The compiler can look ahead in the program, so it doesn’t have to speculate or predict. It knows how best to carry out a complex program.
The instruction stream is not the only bottleneck in a modern computer. The core logic of most microprocessors operates much faster than other parts of most computers, including the memory and support circuitry. The microprocessor links to the rest of the computer through a connection called the system bus or the front-side bus. The speed at which the system bus operates sets a maximum limit on how fast the microprocessor can send data to other circuits (including memory) in the computer.
When the microprocessor needs to retrieve data or an instruction from memory, it must wait for this to come across the system bus. The slower the bus, the longer the microprocessor has to wait. More importantly, the greater the mismatch between microprocessor speed and system bus speed, the more clock cycles the microprocessor needs to wait. Applications that involve repeated calculations on large blocks of data—graphics and video in particular—are apt to require the most access to the system bus to retrieve data from memory. These applications are most likely to suffer from a slow system bus.
The first commercial microprocessors of the current generation operated their system buses at 66MHz. Through the years, manufacturers have boosted this speed and increased the speed at which the microprocessor can communicate with the rest of the computer. Chips now use clock speeds of 100MHz or 133MHz for their system buses.
With the Pentium 4, Intel added a further refinement to the system bus. Using a technology called quad-pumping, Intel forces four data bits onto each clock cycle of the system bus. A quad-pumped system bus operating at 100MHz can actually move 400Mb of data in a second. A quad-pumped system bus running at 133MHz achieves a 533Mbps data rate.
The performance of the system bus is often described in its bandwidth, the number of total bytes of data that can move through the bus in one second. The data buses of all current computers are 64 bits wide—that’s 8 bytes. Multiplying the clock speed or data rate of the bus by its width yields its bandwidth. A 100MHz system bus therefore has an 800MBps (megabytes per second) bandwidth. A 400MHz bus has a 3.2GBps bandwidth.
Intel usually locks the system bus speed to the clock speed of the microprocessor. This synchronous operation optimizes the transfer rate between the bus and the microprocessor. It also explains some of the odd frequencies at which some microprocessors are designed to operate (for example, 1.33GHz or 2.56GHz). Sometimes the system bus operates at a speed that’s not an even divisor of the microprocessor speed—for example, the microprocessor clock speed may be 4.5, 5, or 5.5 times the system bus speed. Such a mismatch can slow down system performance, although such mismatches are minimized by effective microprocessor caching (see the upcoming section “Caching” for more information).
Translation Look-Aside Buffers
Modern pipelined, superscalar microprocessors need to access memory quickly, and they often repeatedly go to the same address in the execution of a program. To speed up such operations, most newer microprocessors include a quick lookup list of the pages in memory that the chip has addressed most recently. This list is termed a translation look-aside and is a small block of fast memory inside the microprocessor that stores a table that cross-references the virtual addresses in programs with the corresponding real addresses in physical memory that the program has most recently used. The microprocessor can take a quick glance away from its normal address-translation pipeline, effectively “looking aside,” to fetch the addresses it needs.
The translation look-aside buffer (TLB) appears to be very small in relation to the memory of most computers. Typically, a TLB may be 64 to 256 entries. Each entry, however, refers to an entire page of memory, which with today’s Intel microprocessors, totals four kilobytes. The amount of memory that the microprocessor can quickly address by checking the TLB is the TLB address space, which is the product of the number of entries in the TLB and the page size. A 256-entry TLB can provide fast access to a megabyte of memory (256 entries times 4KB per page).
The most important means of matching today’s fast microprocessors to the speeds of affordable memory, which is inevitably slower, is memory caching. A memory cache interposes a block of fast memory—typically high-speed static RAM—between the micro processor and the bulk of primary storage. A special circuit called a cache controller (which current designs make into an essential part of the microprocessor) attempts to keep the cache filled with the data or instructions that the microprocessor is most likely to need next. If the information the microprocessor requests next is held within the cache, it can be retrieved without waiting.
This fastest possible operation is called a cache hit. If the needed data is not in the cache memory, it is retrieved from outside the cache, typically from ordinary RAM at ordinary RAM speed. The result is called a cache miss.
Caches are sometimes described by their logical and electrical proximity to the microprocessor’s core logic. The closest physically and electrically to the microprocessor’s core logic is the primary cache, also called a Level One cache. A secondary cache (or Level Two cache) fits between the primary cache and main memory. The secondary cache usually is larger than the primary cache but operates at a lower speed (to make its larger mass of memory more affordable). Rarely is a tertiary cache (or Level Three cache) interposed between the secondary cache and memory.
In modern microprocessor designs, both the primary and secondary caches are part of the microprocessor itself. Older designs put the secondary cache in a separate part of a microprocessor module or in external memory.
Primary and secondary caches differ in the way they connect with the core logic of the microprocessor. A primary cache invariably operates at the full speed of the microprocessor’s core logic with the widest possible bit-width connection between the core logic and the cache. Secondary caches often operate at a rate slower than the chip’s core logic, although all current chips operate the secondary cache at full core speed.
A major factor that determines how successful the cache will be is how much information it contains. The larger the cache, the more data that is in it and the more likely any needed byte will be there when you system calls for it. Obviously, the best cache is one that’s as large as, and duplicates, the entirety of system memory. Of course, a cache that big is also absurd. You could use the cache as primary memory and forget the rest. The smallest cache would be a byte, also an absurd situation because it guarantees the next read is not in the cache. Chipmakers try to make caches as large as possible within the constraints of fabricating microprocessors affordably.
Instruction and Data Caches
Modern microprocessors subdivide their primary caches into separate instruction and data caches, typically with each assigned one-half the total cache memory. This separation allows for a more efficient microprocessor design. Microprocessors handle instructions and data differently and may even send them down different pipelines. Moreover, instructions and data typically use memory differently—instructions are sequential whereas data can be completely random. Separating the two allows designers to optimize the cache design for each.
Write-Through and Write-Back Caches
Caches also differ in the way they treat writing to memory. Most caches make no attempt to speed up write operations. Instead, they push write commands through the cache immediately, writing to cache and main memory (with normal wait-state delays) at the same time. This write-through cache design is the safe approach because it guarantees that main memory and cache are constantly in agreement. Most Intel microprocessors through the current versions of the Pentium use write-through technology.
The faster alternative is the write-back cache, which allows the microprocessor to write changes to its cache memory and then immediately go back about its work. The cache controller eventually writes the changed data back to main memory as time allows.
The logical configuration of a cache involves how the memory in the cache is arranged and how it is addressed (that is, how the microprocessor determines whether needed information is available inside the cache). The major choices are direct-mapped, full associative, and set-associative.
The direct-mapped cache divides the fast memory of the cache into small units, called lines (corresponding to the lines of storage used by Intel 32-bit microprocessors, which allow addressing in 16-byte multiples, blocks of 128 bits), each of which is identified by an index bit. Main memory is divided into blocks the size of the cache, and the lines in the cache correspond to the locations within such a memory block. Each line can be drawn from a different memory block, but only from the location corresponding to the location in the cache. Which block the line is drawn from is identified by a tag. For the cache controller—the electronics that ride herd on the cache—determining whether a given byte is stored in a direct-mapped cache is easy. It just checks the tag for a given index value.
The problem with the direct-mapped cache is that if a program regularly moves between addresses with the same indexes in different blocks of memory, the cache needs to be continually refreshed—which means cache misses. Although such operation is uncommon in single-tasking systems, it can occur often during multitasking and slow down the direct-mapped cache.
The opposite design approach is the full-associative cache. In this design, each line of the cache can correspond to (or be associated with) any part of main memory. Lines of bytes from diverse locations throughout main memory can be piled cheek-by-jowl in the cache. The major shortcoming of the full-associative approach is that the cache controller must check the addresses of every line in the cache to determine whether a memory request from the microprocessor is a hit or miss. The more lines there are to check, the more time it takes. A lot of checking can make cache memory respond more slowly than main memory.
A compromise between direct-mapped and full-associative caches is the set-associative cache, which essentially divides up the total cache memory into several smaller direct-mapped areas. The cache is described as the number of ways into which it is divided. A four-way set-associative cache, therefore, resembles four smaller direct-mapped caches. This arrangement overcomes the problem of moving between blocks with the same indexes. Consequently, the set-associative cache has more performance potential than a direct-mapped cache. Unfortunately, it is also more complex, making the technology more expensive to implement. Moreover, the more “ways” there are to a cache, the longer the cache controller must search to determine whether needed information is in the cache. This ultimately slows down the cache, mitigating the advantage of splitting it into sets. Most computer-makers find a four-way set-associative cache to be the optimum compromise between performance and complexity.
At its heart, a microprocessor is an electronic device. This electronic foundation has important ramifications in the construction and operation of chips. The “free lunch” principle (that is, there is none) tells us that every operation has its cost. Even the quick electronic thinking of a microprocessor takes a toll. The thinking involves the switching of the state of tiny transistors, and each state change consumes a bit of electrical power, which gets converted to heat. The transistors are so small that the process generates a minuscule amount of heat, but with millions of them in a single chip, the heat adds up. Modern microprocessors generate so much heat that keeping them cool is a major concern in their design.
Heat is the enemy of the semiconductor because it can destroy the delicate crystal structure of a chip. If a chip gets too hot, it will be irrevocably destroyed. Packing circuits tightly concentrates the heat they generate, and the small size of the individual circuit components makes them more vulnerable to damage.
Heat can cause problems more subtle than simple destruction. Because the conductivity of semiconductor circuits also varies with temperature, the effective switching speed of transistors and logic gates also changes when chips get too hot or too cold. Although this temperature-induced speed change does not alter how fast a microprocessor can compute (the chip must stay locked to the system clock at all times), it can affect the relative timing between signals inside the microprocessor. Should the timing get too far off, a microprocessor might make a mistake, with the inevitable result of crashing your system. All chips have rated temperature ranges within which they are guaranteed to operate without such timing errors.
Because chips generate more heat as speed increases, they can produce heat faster than it can radiate away. This heat buildup can alter the timing of the internal signals of the chip so drastically that the microprocessor will stop working and—as if you couldn’t guess—cause your system to crash. To avoid such problems, computer manufacturers often attach heatsinks to microprocessors and other semiconductor components to aid in their cooling.
A heatsink is simply a metal extrusion that increases the surface area from which heat can radiate from a microprocessor or other heat-generating circuit element. Most heatsinks have several fins (rows of pins) or some geometry that increases its surface area. Heatsinks are usually made from aluminum because that metal is one of the better thermal conductors, enabling the heat from the microprocessor to quickly spread across the heatsink.
Heatsinks provide passive cooling (passive because cooling requires no power-using mechanism). Heatsinks work by convection, transferring heat to the air that circulates past the heatsink. Air circulates around the heatsink because the warmed air rises away from the heatsink and cooler air flows in to replace it.
In contrast, active cooling involves some kind of mechanical or electrical assistance to remove heat. The most common form of active cooling is a fan, which blows a greater volume of air past the heatsink than would be possible with convection alone. Nearly all modern microprocessors require a fan for active cooling, typically built into the chip’s heatsink.
The makers of notebook computers face another challenge in efficiently managing the cooling of their computers. Using a fan to cool a notebook system is problematic. The fan consumes substantial energy, which trims battery life. Moreover, the heat generated by the fan motor itself can be a significant part of the thermal load of the system. Most designers of notebook machines have turned to more innovative passive thermal controls, such as heat pipes and using the entire chassis of the computer as a heatsink.
In desktop computers, overheating rather than excess electrical consumption is the major power concern. Even the most wasteful of microprocessors uses far less power than an ordinary light bulb. The most that any computer-compatible microprocessor consumes is about nine watts, hardly more than a night light and of little concern when the power grid supplying your computer has megawatts at its disposal.
If you switch to battery power, however, every last milliwatt is important. The more power used by a computer, the shorter the time its battery can power the system or the heavier the battery it will need to achieve a given life between charges. Every degree a microprocessor raises its case temperature clips minutes from its battery runtime.
Battery-powered notebooks and sub-notebook computers consequently caused microprocessor engineers to do a quick about-face. Whereas once they were content to use bigger and bigger heatsinks, fans, and refrigerators to keep their chips cool, today they focus on reducing temperatures and wasted power at the source.
One way to cut power requirements is to make the design elements of a chip smaller. Smaller digital circuits require less power. But shrinking chips is not an option; microprocessors are invariably designed to be as small as possible with the prevailing technology.
To further trim the power required by microprocessors to make them more amenable to battery operation, engineers have come up with two new design twists: low-voltage operation and system-management mode. Although founded on separate ideas, both are often used together to minimize microprocessor power consumption. All new microprocessor designs incorporate both technologies.
Since the very beginning of the transistor-transistor logic family of digital circuits (the design technology that later blossomed into the microprocessor), digital logic has operated with a supply voltage of 5 volts. That level is essentially arbitrary. Almost any voltage would work. But 5-volt technology offers some practical advantages. It’s low enough to be both safe and frugal with power needs but high enough to avoid noise and allow for several diode drops, the inevitable reduction of voltage that occurs when a current flows across a semiconductor junction.
Every semiconductor junction, which essentially forms a diode, reduces or drops the voltage flowing through it. Silicon junctions impose a diode drop of about 0.7 volts, and there may be one or more such junctions in a logic gate. Other materials impose smaller drops—that of germanium, for example, is 0.4 volts—but the drop is unavoidable.
There’s nothing magical about using 5 volts. Reducing the voltage used by logic circuits dramatically reduces power consumption because power consumption in electrical circuits increases by the square of the voltage. That is, doubling the voltage of a circuit increases the power it uses by fourfold. Reducing the voltage by one-half reduces power consumption by three-quarters (providing, of course, that the circuit will continue to operate at the lower voltage).
All current microprocessor designs operate at about 2 volts or less. The Pentium 4 operates at just over 1.3 volts with minor variations, depending on the clock frequency of the chip. For example, the 2.53GHz version requires 1.325 volts. Microprocessors designed for mobile applications typically operate at about 1.1 volts; some as low as 0.95 volt.
To minimize power consumption, Intel sets the operating voltage of the core logic of its chips as low as possible—some, such as Intel’s ultra-low-voltage Mobile Pentium III-M, to just under 1 volt. The integral secondary caches of these chips (which are fabricated separately from the core logic) usually require their own, often higher, voltage supply. In fact, operating voltage has become so critical that Intel devotes several pins of its Pentium II and later microprocessors to encoding the voltage needs of the chip, and the host computer must adjust its supply to the chip to precisely meet those needs.
Most bus architectures and most of today’s memory modules operate at the 3.3 volt level. Future designs will push that level lower. Rambus memory systems, for example, operate at 2.5 volts (see Chapter 6, “Chipsets,” for more information).
Trimming the power need of a microprocessor both reduces the heat the chip generates and increases how long it can run off a battery supply, an important consideration for portable computers. Reducing the voltage and power use of the chip is one way of keeping the heat down and the battery running, but chipmakers have discovered they can save even more power through frugality. The chips use power only when they have to, thus managing their power consumption.
Chipmakers have two basic power-management strategies: shutting off circuits when they are not needed and slowing down the microprocessor when high performance is not required.
The earliest form of power-savings built into microprocessors was part of system management mode (SMM), which allowed the circuitry of the chip to be shut off. In terms of clock speed, the chip went from full speed to zero. Initially, chips switched off after a period of system inactivity and woke up to full speed when triggered by an appropriate interrupt.
More advanced systems cycled the microprocessor between on and off states as they required processing power. The chief difficulty with this design is that nothing gets done when the chip isn’t processing. This kind of power management only works when you’re not looking (not exactly a salesman’s dream) and is a benefit you should never be able to see. Intel gives this technique the name QuickStart and claims that it can save enough energy between your keystrokes to significantly reduce overall power consumption by briefly cutting the microprocessor’s electrical needs by 95 percent. Intel introduced QuickStart in the Mobile Pentium II processor, although it has not widely publicized the technology.
In the last few years, chipmakers have approached the power problem with more advanced power-saving systems that take an intermediary approach. One way is to reduce microprocessor power when it doesn’t need it for particular operations. Intel slightly reduces the voltage applied to its core logic based on the activity of the processor. Called Intel Mobile Voltage Positioning (IMVP), this technology can reduce the thermal design power—which means the heat produced by the microprocessor—by about 8.5 percent. According to Intel, this reduction is equivalent to reducing the speed of a 750MHz Mobile Pentium III by 100MHz.
Another technique for saving power is to reduce the performance of a microprocessor when its top speed is not required by your applications. Instead of entirely switching off the microprocessor, the chipmakers reduce its performance to trim power consumption. Each of the three current major microprocessor manufacturers puts its own spin on this performance-as-needed technology, labeling it with a clever trademark. Intel offers SpeedStep, AMD offers PowerNow!, and Transmeta offers LongRun. Although at heart all three are conceptually much the same, in operation you’ll find distinct differences between them.
Internal mobile microprocessor power savings started with SpeedStep, introduced by Intel on January 18, 2000, with the Mobile Pentium III microprocessors, operating at 600MHz and 650MHz. To save power, these chips can be configured to reduce their operating speed when running on battery power to 500MHz. All Mobile Pentium III and Mobile Pentium 4 chips since that date have incorporated SpeedStep into their designs. Mobile Celeron processors do not use SpeedStep.
The triggering event is a reduction of power to the chip. For example, the initial Mobile Pentium III chips go from the 1.7 volts that is required for operating at their top speeds to 1.35 volts. As noted earlier, the M-series step down from 1.4 to 1.15 volts, the low-voltage M-series from 1.35 to 1.1 volts, and the ultra-low-voltage chips from 1.1 to 0.975 volts. Note that a 15-percent reduction in voltage in itself reduces power consumption by about 29 percent, with a further reduction that’s proportional to the speed decrease. The 600MHz Pentium III, for example, cuts its power consumption an additional 17 percent thanks to voltage reduction when slipping down from 600MHz to 500MHz.
Intel calls the two modes Maximum Performance Mode (for high speed) and Battery Optimized Mode (for low speed). According to Intel, switching between speeds requires about one two-thousandths of a second. The M-series of Mobile Pentium III adds an additional step to provide an intermediary level of performance when operating on battery. Intel calls this technology Enhanced SpeedStep. Table 5.1 lists the SpeedStep capabilities of many Intel chips.
Advanced Micro Devices devised its own power-saving technology, called PowerNow!, for its mobile processors. The AMD technology differs from Intel’s SpeedStep by providing up to 32 levels of speed reduction and power savings. Note that 32 levels is the design limit. Actual implementations of the technology from AMD have far fewer levels. All current AMD mobile processors—both the Mobile Athlon and Mobile Duron lines—use PowerNow! technology.
PowerNow! operates in one of three modes:
Battery Saver. This mode runs the chip at a lower speed using a lower voltage to conserve battery power, exactly as a SpeedStep chip would, but with multiple levels. The speed and voltage are determined by the chip’s requirements and programming of the BIOS (which triggers the change).
Automatic. This mode makes the changes in voltage and clock speed dynamic, responding to the needs of the system. When an application requires maximum processing power, the chip runs at full speed. As the need for processing declines, the chip adjusts its performance to match. Current implementations allow for four discrete speeds at various operating voltages. Automatic mode is the best compromise for normal operation of portable computers.
The actual means of varying the clock frequency involves dynamic control of the clock multiplier inside the AMD chip. The external oscillator or clock frequency the computer supplies the microprocessor does not change, regardless of the performance demand. In the case of the initial chip to use PowerNow! (the Mobile K6-2+ chip, operating at 550MHz), its actual operating speed could vary from 200MHz to 550MHz in a system that takes full advantage of the technology.
Control of PowerNow! starts with the operating system (which is almost always Windows). Windows monitors processor usage, and when it dips below a predetermined level, such as 50 percent, Windows signals the PowerNow! system to cut back the clock multiplier inside the microprocessor and then signals a programmable voltage regulator to trim the voltage going to the chip. Note that even with PowerNow!, the chip’s supply voltage must be adjusted externally to the chip to achieve the greatest power savings.
If the operating system detects that the available processing power is still underused, it signals to cut back another step. Similarly, should processing needs reach above a predetermined level (say, 90 percent of the available ability), the operating system signals PowerNow! to kick up performance (and voltage) by a notch.
Transmeta calls its proprietary power-saving technology LongRun. It is a feature of both current Crusoe processors, the TM5500 and TM5800. In concept, LongRun is much like AMD’s PowerNow! The chief difference is control. Because of the design of Transmeta’s Crusoe processors, Windows instructions are irrelevant to their power usage—Crusoe chips translate the Windows instructions into their own format. To gauge its power needs, the Crusoe chip monitors the flow of its own native instructions and adjusts its speed to match the processing needs of that code stream. In other words, the Crusoe chip does its own monitoring and decision-making regarding power savings without regard to Windows power-conservation information.
According to Transmeta, LongRun allows its microprocessors to adjust their power consumption by changing their clock frequency on the fly, just as PowerNow! does, as well as to adjust their operating voltage. The processor core steps down processor speed in 33MHz increments, and each step holds the potential of reducing chip voltage. For example, trimming the speed of a chip from 667MHz to 633MHz also allows for reducing the operating voltage from 1.65 to 1.60 volts.
The working part of a microprocessor is exactly what the nickname “chip” implies: a small flake of a silicon crystal no larger than a postage stamp. Although silicon is a fairly robust material with moderate physical strength, it is sensitive to chemical contamination. After all, semiconductors are grown in precisely controlled atmospheres, the chemical content of which affects the operating properties of the final chip. To prevent oxygen and contaminants in the atmosphere from adversely affecting the precision-engineered silicon, the chip itself must be sealed away. The first semiconductors, transistors, were hermetically sealed in tiny metal cans.
The art and science of semiconductor packaging has advanced since those early days. Modern integrated circuits (ICs) are often surrounded in epoxy plastic, an inexpensive material that can be easily molded to the proper shape. Unfortunately, microprocessors can get very hot, sometimes too hot for plastics to safely contain. Most powerful modern microprocessors are consequently cased in ceramic materials that are fused together at high temperatures. Older, cooler chips reside in plastic. The most recent trend in chip packaging is the development of inexpensive tape-based packages optimized for automated assembly of circuit boards.
The most primitive of microprocessors (that is, those of the early generation that had neither substantial signal nor power requirements) fit in the same style housing popular for other integrated circuits—the infamous dual inline pin (DIP) package. The packages grew more pins—or legs, as engineers sometimes call them—to accommodate the ever-increasing number of signals in data and address buses.
The DIP package is far from ideal for a number of reasons. Adding more connections, for example, makes for an ungainly chip. A centipede microprocessor would be a beast measuring a full five inches long. Not only would such a critter be hard to fit onto a reasonably sized circuit board, it would require that signals travel substantially farther to reach the end pins than those in the center. At modern operating frequencies, that difference in distance can amount to a substantial fraction of a clock cycle, potentially putting the pins out of sync.
Modern chip packages are compact squares that avoid these problems. Engineers developed several separate styles to accommodate the needs of the latest microprocessors.
The most common is the pin grid array (PGA), a square package that varies in size with the number of pins that it must accommodate (typically about two inches square). The first PGA chips had 68 pins. Pentium 4 chips in PGA packages have up to 478.
No matter their number, the pins are spaced as if they were laid out on a checkerboard, making the “grid array” of the package name (see Figure 5.1).
To fit the larger number of pins used by wider-bus microprocessors into a reasonable space, Intel rearranged the pins of some processors (notably the Pentium Pro), staggering them so that they can fit closer together. The result is a staggered pin grid array (SPGA) package, as shown in Figure 5.2.
Pins take up space and add to the cost of fabrication, so chipmakers have developed a number of pinless packages. The first of these to find general use was the Leadless Chip Carrier (LCC) socket. Instead of pins, this style of package has contact pads on one of its surfaces. The pads are plated with gold to avoid corrosion or oxidation that would impede the flow of the minute electrical signals used by the chip (see Figure 5.3). The pads are designed to contact special springy mating contacts in a special socket. Once installed, the chip itself may be hidden in the socket, under a heat sink, or perhaps only the top of the chip may be visible, framed by the four sides of the socket.
A related design, the Plastic Leaded Chip Carrier (PLCC), substitutes epoxy plastic for the ceramic materials ordinarily used for encasing chips. Plastic is less expensive and easier to work with. Some microprocessors with low thermal output sometimes use a housing designed to be soldered down—the Plastic Quad Flat Package (PQFP), sometimes called simply the quad flat pack because the chips are flat (they fit flat against the circuit board) and they have four sides (making them a quadrilateral, as shown in Figure 5.4).
The Tape Carrier Package takes the advantage of the quad flat pack a step further, reducing the chip to what looks like a pregnant bulge in the middle of a piece of photographic film (see Figure 5.5).
Another way to deal with the problem of pins is to reduce them to vestigial bumps, substituting precision-formed globs of solder that can mate with socket contacts. Alternately, the globs can be soldered directly to a circuit board using surface-mount technology. Because the solder contacts start out as tiny balls but use a variation on the PGA layout, the package is termed solder-ball grid array. (Note that solder is often omitted from the name, thus yielding the abbreviation BGA.)
When Intel’s engineers first decided to add secondary caches to the company’s microprocessors, they used a separately housed slice of silicon for the cache. Initially Intel put the CPU and cache chips in separate chambers in one big, black chip. The design, called the Multi-Cavity Module (MCM), was used only for the Pentium Pro chip.
Next, Intel shifted to putting the CPU and cache on a small circuit board inside a cartridge, initially called the Single Edge Contact cartridge or SEC cartridge (which Intel often abbreviates SECC) when it was used for the Pentium II chip. Figure 5.6 shows the Pentium II microprocessor SEC cartridge.
To cut the cost of the cartridge for the inexpensive Celeron line, Intel eliminated the case around the chip to make the Singe-Edge Processor (SEP) package (see Figure 5.7).
When Intel developed the capability to put the CPU and secondary cache on a single piece of silicon, called a die, the need for cartridges disappeared. Both later Celeron and Pentium III had on-die caches and were packaged both as cartridges and as individual chips in PGA and similar packages. With the Pentium 4, the circle was complete. Intel offers the latest Pentiums only in compact chip-style packages.
The package that the chip is housed in has no effect on its performance. It can, however, be important when you want to replace or upgrade your microprocessor with a new chip or upgrade card. Many of these enhancement products require that you replace your system’s microprocessor with a new chip or adapter cable that links to a circuit board. If you want the upgrade or a replacement part to fit on your motherboard, you may have to specify which package your computer uses for its microprocessor.
Ordinarily you don’t have to deal with microprocessor sockets unless you’re curious and want to pull out the chip, hold it in your hand, and watch a static discharge turn a $300 circuit into epoxy-encapsulated sand. Choose to upgrade your computer to a new and better microprocessor, and you’ll tangle with the details of socketry, particularly if you want to improve your Pentium.
Intel recognizes nine different microprocessor sockets for its processors, from the 486 to the Pentium Pro. In 1999, it added a new socket for some incarnations of the Pentium II Celeron. Other Pentium II and Pentium III chips, packaged as modules or cartridges, mate with slots instead of sockets. Table 5.2 summarizes these socket types, the chips that use them, and the upgrades appropriate to them.
No matter the designation or origin, all microprocessors in today’s Windows-based computers share a unique characteristic and heritage. All are direct descendents of the very first microprocessor. The instruction set used by all current computer microprocessors is rooted in the instructions selected for that first-ever chip. Even the fastest of today’s Pentium 4 chips has, hidden in its millions of transistors, the capability of acting exactly like that first chip.
In a way, that’s good because this backward-looking design assures us that each new generation of microprocessor remains compatible with its predecessors. When a new chip arrives, manufacturers can plug it into a computer and give you reasonable expectations that all your old software will still work. But holding to the historical standard also heaps extra baggage on chip designs that holds back performance. By switching to a radically new design, engineers could create a faster, simpler microprocessor—one that could run circles around any of today’s chips, but, alas, one that can’t use any of your current programs or operating systems.
The history of the microprocessor stretches back to a 1969 request to Intel by a now-defunct Japanese calculator company, Busicom. The original plan was to build a series of calculators, each one different and each requiring a custom integrated circuit. Using conventional IC technology, the project would have required the design of 12 different chips. The small volumes of each design would have made development costs prohibitive.
Intel engineer Mercian E. (Ted) Hoff had a better idea, one that could slash the necessary design work. Instead of a collection of individually tailored circuits, he envisioned creating one general-purpose device that would satisfy the needs of all the calculators. Hoff laid out an integrated circuit with 2,300 transistors using 10-micron design rules with four-bit registers and a four-bit data bus. Using a 12-bit multiplexed addressing system, it was able to address 640 bytes of memory for storing subproducts and results.
Most amazing of all, once fabricated, the chip worked. It became the first general-purpose microprocessor, which Intel put on sale as the 4004 on November 15, 1971.
The chip was a success. Not only did it usher in the age of low-cost calculators, it also gave designers a single solid-state programmable device for the first time. Instead of designing the digital decision-making circuits in products from scratch, developers could buy an off-the-shelf component and tailor it to their needs simply by writing the appropriate program.
With the microprocessor’s ability to handle numbers proven, the logical next step was to enable chips to deal with a broader range of data, including text characters. The 4004’s narrow four-bit design was sufficient for encoding only numbers and basic operations—a total of 16 symbols. The registers would need to be wider to accommodate a wider repertory. Rather than simply bump up the registers a couple of bits, Intel’s engineers chose to go double and design a full eight-bit microprocessor with eight-bit registers and an eight-bit data bus. In addition, this endowed the chip with the ability to address a full 16KB of memory using 14 multiplexed address lines. The result, which required a total of 3450 transistors, was the Intel 8008, introduced in April 1972.
Intel continued development (as did other integrated circuit manufacturers) and, in April 1974, created a rather more drastic revision, the 8080, which required nearly twice as many transistors (6,000) as the earlier chip. Unlike the 8008, the new 8080 chip was planned from the start for byte-size data. Intel gave the 8080 a 16-bit address bus that could handle a full 64KB of memory and a richer command set, one that embraced all the commands of the 8008 but went further. This set a pattern for Intel microprocessors: Every increase in power and range of command set enlarged on what had gone before rather than replacing it, thus ensuring backward compatibility (at least to some degree) of the software. To this day, the Intel-architecture chips used in personal computers can run program code written using 8080 instructions. From the 8080 on, the story of the microprocessor is simply one of improvements in fabrication technology and increasingly complex designs.
With each new generation of microprocessor, manufacturers relied on improving technology in circuit design and fabrication to increase the number and size of the registers in each microprocessor, broadening the data and address buses to match. When that strategy stalled, they moved to superscalar designs with multiple pipelines. Improvements in semiconductor fabrication technology made the increasing complexity of modern microprocessor designs both practical and affordable. In the three decades since the introduction of the first microprocessor, the linear dimensions of semiconductor circuits have decreased to 1/50th their original size, from 10-micron design rules to 0.13 micron, which means microprocessor-makers can squeeze 6,000 transistors where only one fit originally. This size reduction also facilitates higher speeds. Today’s microprocessors run nearly 25,000 times faster than the first chip out of the Intel foundry, 2.5GHz in comparison to the 108KHz of the first 4004 chip.
Personal Computer Influence
The success of the personal computer marked a major turning point in microprocessor design. Before the PC, microprocessor engineers designed what they regarded as the best possible chips. Afterward, they focused their efforts on making chips for PCs. This change came between what is now regarded as the first generation of Intel microprocessors and the third generation, in the years 1981 to 1987.
The engineers who designed the IBM Personal Computer chose to use a chip from the Intel 8086 family. Intel introduced the 8086 chip in 1978 as an improvement over its first chips. Intel’s engineers doubled the size of the registers in its 8080 to create a chip with 16-bit registers and about 10 times the performance. The 16-bit design carried through completely, also doubling the size of the data bus of earlier chips to 16 bits to move information in and out twice as fast.
In addition, Intel broadened the address bus from 16-bit to 20-bits to allow the 8086 to directly address up to one megabyte of RAM. Intel divided this memory into 64KB segments to make programming and the transition to the new chip easier. A single 16-bit register could address any byte in a given segment. Another, separate register indicated which of the segments that address was in.
A year after the introduction of the 8086, Intel introduced the 8088. The new chip was identical to the 8086 in every way—16-bit registers, 20 address lines, and the same command set—except one. Its data bus was reduced to eight bits, enabling the 8088 to exploit readily available eight-bit support hardware. At that, the 8088 broke no new ground and should have been little more than a footnote in the history of the microprocessor. However, its compromise design that mated 16-bit power with cheap 8-bit support chips made the 8088 IBM’s choice for its first personal computer. With that, the 8088 entered history as the second most important product in the development of the microprocessor, after the ground-breaking 4004.
After the release of the 8086, Intel’s engineers began to work on a successor chip with even more power. Designated the 80286, the new chip was to feature several times the speed and 16 times more addressable memory than its predecessors. Inherent in its design was the capability of multitasking, with new instructions for managing tasks and a new operating mode, called protected mode, that made its full 16MB of memory fodder for advanced operating systems.
The 80286 chip itself was introduced in 1982, but its first major (and most important) application didn’t come until 1984 with the introduction of IBM’s Personal Computer AT. Unfortunately, this development work began before the PC arrived, and few of the new features were compatible with the personal computer design. The DOS operating system for PCs and all the software that ran under it could not take advantage of the chip’s new protected mode—which effectively put most of the new chip’s memory off limits to PC programs.
With all its innovations ignored by PCs, the only thing the 80286 had going for it was its higher clock speed, which yielded better computer performance. Initially released running at 6MHz, computers powered by the 80286 quickly climbed to 8MHz, and then 10MHz. Versions operating at 12.5MHz, 16MHz, 20MHz, and ultimately 24MHz were eventually marketed.
The 80286 proved to be an important chip for Intel, although not because of any enduring success. It taught the company’s engineers two lessons. First was the new importance of the personal computer to Intel’s microprocessor market. Second was licensing. Although the 80286 was designed by Intel, the company licensed the design to several manufacturers, including AMD, Harris Semiconductor, IBM, and Siemens. Intel granted these licenses not only for income but also to assure the chip buyers that they had alternate sources of supply for the 80286, just in case Intel went out of business. At the time, Intel was a relatively new company, one of many struggling chipmakers. With the success of the PC and its future ensured, however, Intel would never again license its designs so freely.
Even before the 80286 made it to the marketplace, Intel’s engineers were working on its successor, a chip designed with the power of hindsight. By then they could see the importance that the personal computer’s primeval DOS operating system had on the microprocessor market, so they designed to match DOS instead of some vaguely conceived successor. They also added in enough power to make the chip a fearsome competitor.
The next chip, the third generation of Intel design, was the 80386. Two features distinguish it from the 80286: a full 32-bit design, for both data and addressing, and the new Virtual 8086 mode. The first gave the third generation unprecedented power. The second made that power useful.
Moreover, Intel learned to tailor the basic microprocessor design to specific niches in the marketplace. In addition to the mainstream microprocessor, the company saw the need to introduce an “entry level” chip, which would enable computer makers to sell lower-cost systems, and a version designed particularly for the needs of battery-powered portable computers. Intel renamed the mainstream 80386 as the 386DX, designated an entry-level chip the 386SX (introduced in 1988), and reengineered the same logic core for low-power applications as the 386SL (introduced in 1990).
The only difference between the 386DX and 386SX was that the latter had a 16-bit external data bus whereas the former had a 32-bit external bus. Internally, however, both chips had full 32-bit registers. The origin of the D/S nomenclature is easily explained. The external bus of the 386DX handled double words (32 bits), and that of the 386SX, single words (16 bits).
Intel knew it had a winner and severely restricted its licensing of the 386 design. IBM (Intel’s biggest customer at the time) got a license only by promising not to sell chips. It could only market the 386-based microprocessors it built inside complete computers or on fully assembled motherboards. AMD won its license to duplicate the 386 in court based on technology-sharing agreements with Intel dating before even the 80286 had been announced. Another company, Chip and Technologies, reverse-engineered the 386 to build clones, but these were introduced too late—well after Intel advanced to its fourth generation of chips—to see much market success.
Age of Refinement
The 386 established Intel Architecture in essentially its final form. Later chips differ only in details. They have no new modes. Although Intel has added new instructions to the basic 386 command set, almost any commercial software written today will run on any Intel processor all the way back to the 386—but not likely any earlier processor, if the software is Windows based. The 386 design had proven itself and had become the foundation for a multibillion-dollar software industry. The one area for improvement was performance. Today’s programs may run on a 386-based machine, but they are likely to run very slowly. Current chips are about 100 times faster than any 386.
The next major processor after the 386 was, as you might expect, the 486. Even Intel conceded its new chip was basically an improved 386. The most significant difference was that Intel added three features that could boost processing speed by working around handicaps in circuitry external to the microprocessor. These innovations included an integral Level One cache that helped compensate for slow memory systems, pipelining within the microprocessor to get more processing power from low clock speeds, and an integral floating-point unit that eliminated the handicap of an external connection. As this generation matured, Intel added one further refinement that let the microprocessor race ahead of laggardly support circuits—splitting the chip so that its core logic and external bus interface could operate at different speeds.
Intel introduced the first of this new generation in 1989 in the form of a chip then designated 80486, continuing with its traditional nomenclature. When the company added other models derived from this basic design, it renamed the then-flagship chip as the 486DX and distinguished lower-priced models by substituting the SX suffix and low-power designs for portable computers using the SL designation, as it had with the third generation. Other manufacturers followed suit, using the 486 designation for their similar products—and often the D/S indicators for top-of-the-line and economy models.
In the 486 family, however, the D/S split does not distinguish the width of the data bus. The designations had become disconnected from their origins. In the 486 family, Intel economized on the SX version by eliminating the integral floating-point unit. The savings from this strategy was substantial—without the floating-point circuitry, the 486SX required only about half the silicon of the full-fledged chip, making it cheaper to make. In the first runs of the 486SX, however, the difference was more marketing. The SX chips were identical to the DX chips except that their floating-point circuitry was either defective or deliberately disabled to make a less capable processor.
As far as hardware basics are concerned, the 486 series retained the principal features of the earlier generation of processors. Chips in both the third and fourth generations have three operating modes (real, protected, and virtual 8086), full 32-bit registers, and a 32-bit address bus enabling up to 4GB of memory to be directly addressed. Both support virtual memory that extends their addressing to 64TB. Both have built-in memory-management units that can remap memory in 4KB pages.
But the hardware of the 486 also differs substantially from the 386 (or any previous Intel microprocessor). The pipelining in the core logic allows the chip to work on parts of several instructions at the same time. At times the 486 could carry out one instruction every clock cycle. Tighter silicon design rules (smaller details etched into the actual silicon that makes up the chip) gave the 486 more speed potential than preceding chips. The small but robust 8KB integral primary cache helped the 486 work around the memory wait states that plagued faster 386-based computers.
The streamlined hardware design (particularly pipelining) meant that the 486-level microprocessors could think faster than 386 chips when the two operated at the same clock speed. On most applications, the 486 proved about twice as fast as a 386 at the same clock rate, so a 20MHz 486 delivered about the same program throughput as a 40MHz 386.
In March 1993, Intel introduced its first superscalar microprocessor, the first chip to bear the designation Pentium. At the time the computer industry expected Intel to continue its naming tradition and label the new chip the 80586. In fact, the competition was banking on it. Many had already decided to use that numerical designation for their next generation of products. Intel, however, wanted to distinguish its new chip from any potential clones and establish its own recognizable brand on the marketplace. Getting trademark protection for the 586 designation was unlikely. A federal court had earlier ruled that the 386 numeric designation was generic—that is, it described a type of product rather than something exclusive to a particular manufacturer—so trademark status was not available for it. Intel coined the word Pentium because it could get trademark protection. It also implied the number 5, signifying fifth generation, much as “586” would have.
Intel has used the Pentium name quite broadly as the designation for mainstream (or desktop performance) microprocessors, but even in its initial usage the singular Pentium designation obscured changes in silicon circuitry. Two very different chips wear the plain designation “Pentium.” The original Pentium began its life under the code name P5 and was the designated successor to the 486DX. Characterized by 5-volt operation, low operating speeds, and high power consumption, the Intel made the P5 available only at three speeds: 60MHz, 66MHz, and 90MHz. Later, Intel refined the initial Pentium design as the P54C (another internal code name), with tighter design rules and lower voltage operation. These innovations raised the speed potential of the design, and commercial chips gradually stepped up from 100MHz to 200MHz. The same basic design underlies the Pentium OverDrive (or P24T) processor used for upgrading 486-based PCs.
In January 1997, Intel enhanced the Pentium instruction set to better handle multimedia applications and created Pentium Processor with MMX Technology (code-named P55C during development). These chips also incorporated a larger on-chip primary memory cache, 32KB.
To put the latest in Pentium power in the field. Intel reengineered the Pentium with MMX Technology chip for low-power operation to make the Mobile Pentium with MMX Technology chip, also released in January 1997. Unlike the deskbound version, the addressing capability of the mobile chip was enhanced by four more lines to allow direct access to 64GB of physical memory.
The Pentium was Intel’s last CISC design. Other manufacturers were adapting RISC designs to handle the Intel instruction set and achieving results that put Intel on notice. The company responded with its own RISC-based design in 1995 that became the standard Intel core logic until the introduction of the Pentium 4 in the year 2000. Intel developed this logic core under the code name P6, and it has appeared in a wide variety of chips, including those bearing the names Pentium Pro, Pentium II, Celeron, Xeon, and Pentium III.
That’s not to say all these chips are the same. Although the entire series uses essentially the same execution units, the floating-point unit continued to evolve throughout the series. The Pentium Pro incorporates a traditional floating-point unit. That of the Pentium II is enhanced to handle the MMX instruction set. The Pentium III adds Streaming SIMD Extensions. In addition, Intel altered the memory cache and bus of these chips to match the requirements of particular market segments to distinguish the Celeron and Xeon lines from the plain Pentium series.
The basic P6 design uses its own internal circuits to translate classic Intel instructions into micro-ops that can be processed in a RISC-based core, which has been tuned using all the RISC design tricks to massage extra processing speed from the code. Intel called this design Dynamic Execution. In the standard language of RISC processors, Dynamic Execution merely indicates a combination of out-of-order instruction execution and the underlying technologies that enable its operation (branch prediction, register renaming, and so on).
The P6 pipeline has 12 stages, divided into three sections: an in-order fetch/decode stage, an out-of-order execution/dispatch stage, and an in-order retirement stage. The design is superscalar, incorporating two integer units and one floating-point unit.
One look and there’s no mistaking the Pentium Pro. Instead of a neat square chip, it’s a rectangular giant. Intel gives this package the name Multi-Chip Module (MCM). It is also termed a dual-cavity PGA (pin-grid array) package because it holds two distinct slices of silicon, the microprocessor core and secondary cache memory. This was Intel’s first chip with an integral secondary cache. Notably, this design results in more pins than any previous Intel microprocessor and a new socket requirement, Socket 8 (discussed earlier).
The main processor chip of the Pentium Pro uses the equivalent of 5.5 million transistors. About 4.5 million of them are devoted to the actual processor itself. The other million provide the circuitry of the chip’s primary cache, which provides a total of 16KB storage bifurcated into separate 8KB sections for program instructions and data. Compared to true RISC processors, the Pentium Pro uses about twice as many transistors. The circuitry that translates instructions into RISC-compatible micro-ops requires the additional transistor logic.
The integral secondary RAM cache fits onto a separate slice of silicon in the other cavity of the MCM. Its circuitry involves another 15.5 million transistors for 256KB of storage and operates at the same speed as the core logic of the rest of the Pentium Pro.
The secondary cache connects with the microprocessor core logic through a dedicated 64-bit bus, termed a back-side bus, that is separate and distinct from the 64-bit front-side bus that connects to main memory. The back-side bus operates at the full internal speed of the microprocessor, whereas the front-side bus operates at a fraction of the internal speed of the microprocessor.
The Pentium Pro bus design superficially appears identical to that of the Pentium with 32-bit addressing, a 64-bit data path, and a maximum clock rate of 66MHz. Below the surface, however, Intel enhanced the design by shifting to a split-transaction protocol. Whereas the Pentium (and, indeed, all previous Intel processors) handled memory accessing as a two-step process (on one clock cycle the chip sends an address out the bus, and reads the data at the next clock cycle), the Pentium Pro can put an address on the bus at the same time it reads data from a previously posted address. Because the address and data buses use separate lines, these two operations can occur simultaneously. In effect, the throughput of the bus can nearly double without an increase in its clock speed.
The internal bus interface logic of the Pentium Pro is designed for multiprocessor systems. Up to four Pentium Pro chips can be directly connected together, pin for pin, without any additional support circuitry. The computer’s chipset arbitrates the combination.
One underlying reason for the cartridge-style design is to accommodate the Pentium II’s larger secondary cache, which is not integral to the chip package but rather co-mounted on the circuit board inside the cartridge. The 512KB of static cache memory connect through a 64-bit back-side bus. Note that the secondary cache memory of a Pentium II operates at one-half the speed of the core logic of the chip itself. This reduced speed is, of course, a handicap. It was a design expediency. It lowers the cost of the technology, allowing Intel to use off-the-shelf cache memory (from another manufacturer, at least initially) in a lower-cost package. The Pentium II secondary cache design has another limitation. Although the Pentium II can address up to 64GB of memory, its cache can track only 512MB. The Pentium II also has a 32KB primary cache that’s split with 16KB assigned to data and 16KB to instructions. Table 5.3 summarizes the Intel Pentium II line.
Mobile Pentium II
To bring the power of the Pentium II processor to notebook computers, Intel reengineered the desktop chip to reduce its power consumption and altered its packaging to fit slim systems. The resulting chip—the Mobile Pentium II, introduced on April 2, 1997—preserved the full power of the Pentium II while sacrificing only its multiprocessor support. The power savings comes from two changes. The core logic of the Mobile Pentium II is specifically designed for low-voltage operation and has been engineered to work well with higher external voltages. It also incorporates an enriched set of power-management modes, including a new QuickStart mode that essentially shuts down the chip, except for the logic that monitors for bus activity by the PCI bridge chip, and allows the chip to wake up when it’s needed. This design, because it does not monitor for other processor activity, prevents the Mobile Pentium II from being used in multiprocessor applications. The Mobile Pentium II can also switch off its cache clock during its sleep or QuickStart states.
Initially, the Mobile Pentium II shared the same P6 core logic design and cache design with the desktop Pentium II (full-speed 32KB primary cache and half-speed 512KB secondary cache inside its mini-cartridge package). However, as fabrication technology improved, Intel was able to integrate the secondary cache on the same die as the processor core, and on January 25, 1999, the company introduced a new version of the Mobile Pentium II with an integral 256KB cache operating at full core speed. Unlike the Pentium II, the mobile chip has the ratio between its core and bus clocks fixed at the factory to operate with a 66MHz front-side bus. Table 5.4 lists the introduction dates and basic characteristics of the Mobile Pentium II models.
Pentium II Celeron
Introduced on March 4, 1998, the Pentium II Celeron was Intel’s entry-level processor derived from the Pentium II. Although it had the same processor core as what was at the time Intel’s premier chip (the second-generation Pentium II with 0.45-micron design rules), Intel trimmed the cost of building the chip by eliminating the integral 512KB secondary (Level Two) memory cache installed in the Pentium II cartridge. The company also opted to lower the packaging cost of the chip by omitting the metal outer shell of the full Pentium II and instead leaving the Celeron’s circuit board substrate bare. In addition, the cartridge-based Celeron package lacked the thermal plate of the Pentium II and the latches that secure it to the slot. Intel terms the Celeron a Single Edge Processor Package to distinguish it from the Single Edge Contact cartridge used by the Pentium II.
In 1999, Intel introduced a new, lower-cost package for the Celeron, a plastic pin-grid array (PPGA) shell that looks like a first generation Pentium on steroids. It has 370 pins and mates with Intel’s PGA370 socket. The chip itself measures just under two inches square (nominally 49.5 millimeters) and about three millimeters thick, not counting the pins, which hang down another three millimeters or so (the actual specification is 3.05 to 3.30 millimeters).
When the Celeron chip was initially introduced, the absence of a cache made such a hit on the performance that Intel was forced by market pressure to revise its design. In August, 1998, the company added a 128KB cache operating at one-half core speed to the Celeron. Code names distinguished the two chips: The first Celeron was code-named Covington during development; the revised chip was code-named Mendocino. Intel further increased the cache to 256KB on October 2, 2001, with the introduction of a 1.2GHz Celeron variant.
Intel also distinguished the Celeron from its more expensive processor lines by limiting its front-side bus speed to 66MHz. All Celerons sold before January 3, 2001 were limited to that speed. With the introduction of the 800MHz Celeron, Intel kicked the chip’s front-side bus up to 100MHz. With the introduction of a 1.7GHz Celeron on May 2, 2002, Intel started quad-clocking the chip’s front-side bus, yielding an effective data rate of 400MHz.
Intel also limited the memory addressing of the Celeron to 4GB of physical RAM by omitting the four highest address bus signals used by the Pentiums II and III from the Celeron pin-out. The Celeron does not support multiprocessor operation, and, until Intel introduced the Streaming SIMD Extensions to the 1.2GHz version, the Celeron understood only the MMX extension to the Intel instruction set.
Table 5.5 lists the features and introduction dates of various Celeron models.
Pentium II Xeon
In 1998, Intel sought to distinguish its higher performance microprocessors from its economy line. In the process, the company created the Xeon, a refined Pentium II microprocessor core enhanced by a higher-speed memory cache, one that operated at the same clock rate as the core logic of the chip.
At heart, the Xeon is a full 32-bit microprocessor with a 64-bit data bus, as with all Pentium-series processors. Its address bus provides for direct access to up to 64GB of RAM. The internal logic of the chip allows for up to four Xeons to be linked together without external circuitry to form powerful multiprocessor systems.
A sixth generation processor, the Xeon is a Pentium Pro derivative by way of the standard Pentium II. It incorporates two 12-stage pipelines to make what Intel terms Dynamic Execution micro-architecture.
The Xeon incorporates two levels of caching. One is integral to the logic core itself, a primary 32KB cache split 16KB for instructions, 16KB for data. In addition, a separate secondary cache is part of the Xeon processor module but is mounted separately from the core logic on the cartridge substrate. This integral-but-separate design allows flexibility in configuring the Xeon. Current chips are available equipped with either 512KB or 1MB of L2 cache, and the architecture and slot design allow for secondary caches of up to 2MB. This integral cache runs at the full core speed of the microprocessor.
This design required a new interface, tagged Slot 2 by Intel.
Initially the core operating speed of the Xeon started where the Pentium II left off (at the time, 400MHz) and followed the Pentium II up to 450MHz.
The front-side bus of the Xeon was initially designed for 100MHz operation, although higher speeds are possible and expected. A set of contacts on the SEC cartridge allows the motherboard to adjust the multiplier that determines the ratio between front-side bus and core logic speed.
The independence of the logic core and cache is emphasized by the power requirements of the Xeon. Each section requires its own voltage level. The design of the Xeon allows Intel flexibility in the power requirements of the chip through a special coding scheme. A set of pins indicates the core voltage and the cache voltage required by the chip, and the chip expects the motherboard to determine the requirements of the board and deliver the required voltages. The Xeon design allows for core voltages as low as 1.8 volts or as high as 2.1 volts (the level required by the first chips). Cache voltage requirements may reach as high as 2.8 volts. Nominally, the Xeon is a 2-volt chip.
Overall, the Xeon is optimized for workstations and servers and features built-in provide support for up to four identical chips in a single computer. Table 5.6 summarizes the original Xeon product line.
Pentium II OverDrive
To give an upgrade path for systems originally equipped with the Pentium Pro processor, Intel developed a new OverDrive line of direct-replacement upgrades. These Pentium II OverDrive chips fit the same zero-insertion force Socket 8 used by the Pentium Pro, so you can slide one chip out and put the other in. Dual-processor systems can use two OverDrive upgrades. Intel warns that some systems may require a BIOS upgrade to accommodate the OverDrive upgrade.
The upgrade offers the revised design of the Pentium II (which means better 16-bit operation) as well as higher clock speeds. The chip also can earn an edge over ordinary Pentium II chips operating at the same speeds—the 512KB secondary cache in the OverDrive chip operates at full core logic speed, not half speed as in the Pentium II.
Although more than a dozen companies make microprocessors, any new personal computer you buy will likely be based on a chip from one of only three companies: Advanced Micro Devices, Intel, or Transmeta. Intel, the largest semiconductor-maker in the world, makes the majority of computer processors—about 80 percent in the first quarter of the year 2002, according to Mercury Research. In the same period, AMD sold 18.2 percent of the chips destined for personal computers.
Intel earned its enviable position by not only inventing the microprocessor but also by a quirk of fate. IBM chose one of its chips for its first Personal Computer in 1981, the machine that all modern personal computers have been patterned after. Computers must use Intel chips or chips designed to match the Intel Architecture to be able to run today’s most popular software. Because microprocessors are so complicated, designing and building them requires a huge investment, which prevents new competitors from edging into the market.
Currently Intel sells microprocessors under several brand names. The most popular is the Pentium 4, which Intel claims as its best-selling microprocessor, ever. But the Pentium 4 is not a single chip design. Rather, it’s a trade name. Intel has marketed two distinctly different chip designs as Pentium 4. More confusing still, Intel markets the same core logic under more than one name. At one time, the Pentium, Celeron, and Xeon all used essentially the same internal design. The names designated the market segment Intel hoped to carve for the respective chips—Celeron for the price-sensitive low end of the personal computer marketplace, Pentium for the mainstream, and Xeon for the pricey, high-end server marketplace. Although Intel did tailor some features to justify its market positioning of the chips, they nevertheless shared the same circuitry deep inside.
Today, that situation has changed. Intel now designs Xeon chips separately, and Celerons often retain older designs and technologies longer than mainstream Pentiums. The market position assigned the microprocessor names remains the same. Intel offers Celeron chips for the budget conscious. It sacrifices the last bit of performance to make computers more affordable. Pentium processors are meant for the mainstream (most computer purchasers) and deliver full performance for single-user computers. Xeons are specialized microprocessors designed primarily for high-powered server systems. Intel has added a further name, Itanium, to its trademark lineup. The Itanium uses a new architecture (usually termed IA64, shorthand for 64-bit Intel Architecture) that is not directly compatible with software meant for other Intel chips.
The important lesson is that the names you see on the market are only brand names and do not reflect what’s inside a chip. Some people avoid confusion by using the code name the manufacturer called a given microprocessor design during its development. These code names allow consumers to distinguish the Northwood processor from the Willamette, both of which are sold as Pentium 4. The Northwood is a newer design with higher-speed potentials. Table 5.7 lists many of Intel’s microprocessor code names.
Announced in January and officially released on February 26, 1999, the Pentium III was the swan song for Intel’s P6 processor core, developed for the Pentium Pro. Code-named Katmai during its development, the Pentium III chip is most notable for adding SSE to the Intel microprocessor repertory. SSE is a compound acronym for Streaming SIMD Extensions (SIMD itself being an acronym for Single Instruction, Multiple Data). SIMD technology allows one microprocessor instruction to operate across several bytes or words (or even larger blocks of data). In the Pentium III, SSE (formerly known as the Katmai New Instructions or KNI) is a set of 70 new SIMD codes for microprocessor instructions that allows programs to specify elaborate three-dimensional processing functions with a single command.
Unlike the MMX extensions, which added no new registers to the basic Pentium design and instead simply redesignated the floating-point unit registers for multimedia functions, Intel’s Streaming SIMD Extensions add new registers to Intel architecture, pushing the total number of transistors inside the core logic of the chip above 9.5 million.
At heart, the Pentium III uses the same core logic as its Pentium II and Pentium Pro forebears, amended to handle its larger instruction set. The enhancements chiefly involve the floating-point unit, which does double-duty processing multimedia instructions. In other words, the Pentium III does not mark a new generation of microprocessor technology or performance. Even Intel noted that on programs that do not take advantage of the Streaming SIMD Extensions of the Pentium III, the chips deliver performance that’s about the same as a Pentium II.
Although the initial fabrication used 0.25-micron technology, Intel rapidly shifted to 0.18-micron technology with the Coppermine design. The result is that the circuitry of the chip takes less silicon (making fabrication less expensive), requires less power, and is able to operate at higher speeds. The initial Pentium III releases, and all versions through at least the 600MHz chip, operate with a 100MHz memory bus. Many of the new chips using the Coppermine core ratchet the maximum memory speed to 133MHz using Rambus memory technology, although some retain a 100MHz maximum memory speed.
In going to the new Coppermine design, Intel replaced earlier Pentium III chips with 0.25-micron design features, in particular those at the 450, 533, 550, and 600MHz speeds. The newer chips are designed with the suffix E. In addition, to distinguish chips with 133MHz front-side bus capability (when 100MHz versions of the chip were once offered), Intel added a B suffix to the designation of 533 and 600MHz chips that are capable of running their memory buses at the higher 133MHz speed.
The Pentium III was the first Intel processor to cross the line at a 1GHz clock speed with a chip released on March 8, 2000. The series ended its development run at 1.13GHz on July 31, 2000, although Intel continued to manufacturer the chip into 2002.
The Pentium III was designed to plug into the same Slot 1 as the Pentium II, however, the Pentium III now comes in three distinct packages. One, the SEC cartridge, is familiar from the Pentium II. Pentium III is also available in the SEC cartridge 2 (or SECC2). In all aspects, the SECC2 is identical to the SEC cartridge and plugs into the same slot (Slot 1, which Intel has renamed SC242), but the SECC2 package lacks the thermal plate of the earlier design. Instead, the SECC2 is designed to mate with an external heatsink, and, because of the lack of the thermal plate, it will make a better thermal connection for more effective cooling. (Well, it makes sense when Intel explains it.) In addition, the 450 and 550E are available in a new package design termed FC-PGA (for Flip-Chip Pin Grid Array), which is more compact and less expensive than the cartridge design.
As with the Pentium II, the 0.25-micron versions of the Pentium III have a 32KB integral primary cache and include a 512KB secondary cache on the same substrate but not in the same hermetic package at the core CPU. The secondary cache runs at half the chip speed. The newer 0.18-micron versions have a smaller, 256KB secondary cache but operate it at full chip speed and locate it on the same silicon as the core logic. In addition, Intel has broadened the data path between the core logic and the cache to enhance performance (Intel calls this Advanced Transfer Cache technology). According to Intel, these improvements give the newer Coppermine-based (0.18-micron) Pentium III chips a 25-percent performance advantage over older Pentium III chips operating at the same clock speed. The entire Pentium III line supports multiprocessing with up to two chips.
The most controversial aspect of the Pentium III lineup is its internal serial number. Hard-coded into the chip, this number is unique to each individual microprocessor. Originally Intel foresaw that a single command—including a query from a distant Web site—would cause the chip to send out its serial number for positive identification (of the chip, of the computer it is in, and of the person owning or using the computer). Intel believed the feature would improve Internet security, not to mention allowing the company to track its products and detect counterfeits. Consumer groups saw the “feature” as in invasion of privacy, and under threat of boycott Intel changed its policy. Where formerly the Pentium III would default to making the identification information available, after the first production run of the new chip, the identification would default to off and require a specific software command to make the serial number accessible. Whether the chip serial number is available becomes a setup feature of the BIOS in PCs using the Pentium III chip, although a software command can override that setting. In other words, someone can always interrogate your PC to discover your Pentium III’s serial number. Therefore, you might want to watch what you say online when you run with the Pentium III. Table 5.8 summarizes the history of the Pentium III.
Pentium III Xeon
To add its Streaming SIMD Extensions to its server products, on March 17, 1999, Intel introduced the Pentium III Xeon. As with the Pentium III itself, the new instructions are the chief change, but they are complemented by a shift to finer technology. As a result, the initial new Xeons start with a speed of 500MHz. At this speed, Intel offers the chip with either a 512KB, 1MB, or 2MB integral secondary cache operating at core speed. The new Slot 2 chips also incorporate the hardware serial number feature of the Pentium III chip.
Developed under the code name Tanner, the Pentium III Xeon improved upon the original (Pentium II) Xeon with additions to the core logic to handle Intel’s Streaming SIMD Extensions. Aimed at the same workstation market as the original Xeon, the Pentium III Xeon is distinguished from the ordinary Pentium III by its larger integral cache, its Slot 2 packaging, and its wider multiprocessor support—the Pentium III Xeon design allows for servers with up to eight processors.
When Intel introduced its Coppermine 0.18-micron technology on October 25, 1999, it unveiled three new Pentium III Xeon versions with speed ratings up to 733MHz. Except for packaging, however, these new Xeons differed little from the ordinary Pentium III line. As with the mainstream processors, the Xeons supported a maximum of two processors per system and had cache designs identical to the ordinary Pentium III with a 256KB secondary cache operating at full processor speed using wide-bus Advanced Transfer Cache technology. In May of 2000, Intel added Xeons with larger, 1MB and 2MB caches as well as higher-speed models with 256KB caches. Table 5.9 lists the characteristics of all of Intel’s Pentium III Xeon chips.
Intel Pentium 4
Officially introduced on November 20, 2000, the Pentium 4 is Intel’s newest and most powerful microprocessor core for personal computers. According to Intel, the key advance made by the new chip is its use of NetBurst micro-architecture, which can be roughly explained as a better way of translating program instructions into the micro-ops that the chip actually carries out. NetBurst is the first truly new Intel core logic design since the introduction of the P6 (Pentium Pro) in 1995.
Part of the innovation is an enhancement to the instruction set; another part is an improvement to the underlying hardware. All told, the first Pentium 4 design required the equivalent of 42 million transistors.
Chips designated Pentium 4 actually use one of two designs. Intel code-named the early chips Willamette and used 0.18-micron design rules in their fabrication. At initial release, these chips operated at 1.4GHz and 1.5GHz, but Intel soon upped their clock speeds. At the time, Intel and AMD were in a horserace for the fastest microprocessor, and the title shifted between the Athlon and Pentium 4 with each new chip release.
In 2002, Intel shifted to 0.13-micron design rules with a new processor code designed under the name Northwood. This shift resulted in a physically smaller chip that also allows more space for cache memory—whereas the Willamette chips boasts 256KB of on-chip Level Two cache operating at full core speed, the Northwood design doubles that to 512KB. The difference is in size only. Both chips have a 256-bit-wide connection with the caches, which use an eight-way set-associative design.
Of particular note, the Pentium 4 uses a different system bus from that of the Pentium III chip. As a practical matter, that means the Pentium 4 requires different chipsets and motherboards from those of the Pentium III. Although this is ordinarily the concern of the computer manufacturer, the new design has important benefits. It adds extra speed to the system (memory) bus by shifting data up to four times faster than older designs using a technology Intel calls Source-Synchronous Transfer. In effect, this signaling system packs four bits of information into each clock cycle, so a bus with a 133MHz nominal clock speed can shift data at an effective rate of 533MHz. The address bus is double-clocked, signaling twice in each clock cycle, yielding an effective rate of 266MHz. In that the Pentium 4, like earlier Pentiums, has a 64-bit-wide data bus, that speed allows the Pentium 4 to move information at a peak rate of 4.3GBps (that is, 8 bytes times 533MHz).
Only Northwood chips rated at 2.26GHz, 2.4GHz, and 2.53GHz have 533MHz system buses. Other Northwood chips, as well as all Willamette versions, use a 400MHz system bus (that is, a quadruple-clocked 100MHz bus). Note that chips operating at 2.4GHz may have either a 400MHz or 533MHz system bus. The system bus speed is set in the system design and cannot be varied through hardware or software.
The Pentium 4 has three execution units. The two integer arithmetic/logic units (ALUs) comprise what Intel calls a rapid execution engine. They are “rapid” because they operate at twice the speed of the rest of the core logic (that is, 5.06GHz in a 2.53GHz chip), executing up to two instructions in each clock cycle. The registers in each ALU are 32 bits wide.
Unlike previous Intel floating-point units, the registers in the Pentium 4 FPU are 128 bits wide. The chief benefit of these wider registers is in carrying out multimedia instructions using a further enhancement on Intel’s Streaming SIMD Extensions (SSE), a set of 144 new instructions (mainly aimed at moving bytes in and out of the 128-bit registers but also including double-precision floating-point and memory-management instructions) called SSE2.
Intel lengthened the pipelines pumping instructions into the execution units to 20 stages, the longest of any microprocessor currently in production. Intel calls this design hyperpipelined technology.
One way Intel pushes more performance from the Pentium 4 is by double-clocking the integer units in the chip. They operate at twice the external clock frequency applied to the chip (that is, at 3GHz in the 1.5GHz Pentium 4 chip). Balancing the increased speed is a 400MHz system bus throughput as well as an improved primary cache and integral 256KB secondary cache. Intel connects this secondary cache to the rest of the chip through a new 256-bit-wide bus, double the size of those in previous chips.
Intel also coins the term hyperpipelining to describe the Pentium 4. The term refers to Intel’s doubling the depth of the instruction pipeline, as compared to the previous line-leader, the Pentium III. One (but not all) of the pipelines in the Pentium 4 stretches out for 20 stages. Intel claims that the new NetBurst micro-architecture enabled the successful development of the long pipeline because it minimizes the penalties associated with mispredicting instruction branches.
The Pentium 4 recognizes the same instruction set as previous Intel microprocessors, including the Streaming SIMD Extensions introduced with the Pentium III, but the Pentium 4 adds 144 more instructions to the list. The result is termed by Intel SSE2. The chip has the same basic data and address bus structure as the Pentium III, allowing it to access up to 64GB of physical memory eight bytes (64-bits) at a time.
Table 5.10 summarizes the Intel Pentium 4 line.
Xeon (Pentium 4)
On May 21, 2001, Intel released the first of its Xeon microprocessors built using the NetBurst core logic of the Pentium 4 microprocessor. To optimize the chip for use in computer servers, the company increased the secondary cache size of the chip, up to 2MB of on-chip cache. For multiprocessor systems, Intel later derived a separate chip, the Xeon MP processor, for servers with two to four microprocessors. The chief difference between the MP chip and the base Xeon chip is the former chip’s caching—a three-level design. The primary cache is 8KB, the secondary cache is 256KB, and the tertiary cache is either 512KB or 1MB.
Table 5.11 summarizes the Pentium 4 Xeon, including Xeon MP processors.
The diversity of models that Intel puts on the desktop is exceeded only by the number of microprocessors it makes for portable computers. The current lineup includes four major models, each of which includes chips meant to operate on three different voltage levels, in a wide range of frequencies. The choices include Mobile Celeron, Mobile Pentium III (soon to fall from the product line, as of this writing), Mobile Pentium III-M, and Mobile Pentium 4-M. Each has its target market, with Celeron at the low end, Pentium 4 at the highest, and the various Pentium III chips for everything in between. Each chip shares essentially the same core as the desktop chip bearing the same designation. But mobile chips have added circuitry for power management and, in some models, different packaging.
Microprocessors for portable computers differ from those meant for desktop systems in three ways: operating power, power management, and performance. The last is a result of the first and, in normal operation, the second.
To help portable computers run longer from a single charge of their batteries, Intel and other microprocessor manufacturers reduce the voltage at which their chips operate. Intel, in fact, produces chips in three voltage ranges, which it calls very (or sometimes, ultra) low voltage, low voltage, and nothing. Very-low-voltage chips operate as low as 0.95 volts. Low-voltage chips dip down to about 1.1. Low-voltage operation necessitates lower-speed operation, which limits performance. Of the chips that Intel produces in all three power levels, the very-low-voltage chips inevitably have the slowest megahertz rating.
Power management aids in the same end—prolonging battery life—and incidentally prevents your laptop computer from singeing you should you operate it on your lap. Some machines still get uncomfortably hot.
Intel aims its Mobile Celeron at the budget market. The chips are not only restricted to lower clock speeds than their Pentium siblings, but most Mobile Celeron chips lack Intel’s SpeedStep power-saving technology. The Mobile Celeron perpetually lags Intel’s desktop processors in other performance indexes. For example, much like the desktop Celerons, the mobile chips usually are one step out-of-date in front-side bus speed. Although Intel has bumped its top memory bus speed to 533MHz, it only endows the latest versions of the Mobile Celeron with the older quad-pumped 400MHz rating. Most Mobile Celerons still have 100MHz front-side buses.
Much like the desktop line, the first Mobile Celeron was a modest alteration of the Pentium II. Introduced on January 25, 1999, it used the same core logic but with a different cache design, a 128KB cache on the chip substrate operating at full core speed. Otherwise, the chip followed the Pentium II design and followed it up in speed, from 266MHz up to 466MHz, using the same 66MHz front-side bus as the desktop chips and the same 0.25-micro design rules. It differs chiefly by operating at a lower, power-saving voltage, 1.6 volts.
On February 14, 2000, Intel revised the Mobile Celeron design to take advantage of the Pentium III core and its 0.18-micro design rules. The newer Mobile Celerons gained two advantages: the higher, 100MHz front-side bus speed and the Streaming SIMD Extensions to the instruction set. In addition to a faster chip (at 500MHz), Intel also introduced a 450MHz chip, slower than the quickest of the old Mobile Celeron design but able to take advantage of the higher bus speed. Intel continued to upgrade this core up to a speed of 933MHz, introduced on October 1, 2001.
When Intel move the Mobile Celeron line to 0.13-micron technology, the company cut the core voltage of the chips down to 1.45 volts while pushing up its top clock speed to 1.2GHz. The smaller design rules left more space on the chip’s silicon, which Intel utilized for an enhanced secondary cache, pushing it to 256KB.
On June 24, 2002, Intel switched the Mobile Celeron core once again, bringing in a design derived from the Pentium 4. The new core allowed Intel to trim the chip’s operating voltage once again, down to 1.3 volts. In addition, new versions of the Mobile Celeron boast the quad-pumped 400MHz front-side bus speed of the Pentium 4 as well as its enhanced Streaming SIMD Extensions 2 instruction set.
Table 5.12 summarizes the life history of Intel’s Mobile Celeron product line.
Mobile Pentium III
For about a year, Intel’s most powerful mobile chips wore the Pentium III designation. As the name implies, they were derived from the desktop series with several features added to optimize them for mobile applications and, incidentally, to bring the number of transistors on their single slice of silicon to 28 million. Table 5.13 summarizes the Mobile Pentium III product line.
Mobile Pentium III-M
When Intel shifted to new fabrication, the company altered the core logic design of the Pentium III. Although the basic design remained the same, the tighter design rules allowed for higher-speed operation. In addition, Intel improved the power management of the Mobile Pentium III with Enhanced SpeedStep technology, which allows the chip to shift down in speed in increments to conserve power. Table 5.14 summarizes the Mobile Pentium III-M lineup.
Mobile Pentium 4-M
Intel’s highest performance mobile chip is the Mobile Pentium 4-M. Based on the same core logic as the line-leading Pentium 4, the mobile chip is enhanced with additional power-management features and lower-voltage operation. Table 5.15 summarizes the Pentium 4-M lineup.
For use in computers with strict power budgets—either because the manufacturer decided to devote little space to batteries or because the maker opted for extremely long runtimes—Intel has developed several lines of low-voltage microprocessors. These mobile chips have operating voltages substantially lower than the mainstream chips. Such low-voltage chips have been produced in three major mobile processor lines: the Mobile Celeron, the Mobile Pentium III, and the Mobile Pentium III-M.
For systems in which power consumption is absolutely critical, Intel has offered versions of its various mobile chips designed for ultra-low-voltage operation. Ultra, like beauty, is in the eye of the beholder—these chips often operate at voltages only a fraction lower than ordinary low-voltage chips. The lower operating voltage limits the top speed of these chips; they are substantially slower than the ordinary low-voltage chips at the time of their introduction. But again, they are meant for systems where long battery life is more important than performance. Table 5.16 summarizes Intel’s ultra-low-voltage microprocessors.
Intel’s Itanium microprocessor line marks an extreme shift for Intel, entirely breaking with the Intel Architecture of the past—which means that the Itanium cannot run programs or operating systems designed for other Intel chips. Instead of using the old Intel design dating back to the 4004, the Itanium introduces Intel’s Explicitly Parallel Instruction Computing architecture, which is based on the Precision Architecture originally developed by Hewlett-Packard Corporation for its line of RISC chips. In short, that means everything you know about Intel processors doesn’t apply to the Itanium, especially the performance you should expect from the megahertz ratings of the chips. Itanium chips look slow on paper but perform fast in computers.
The original Itanium was code-named Merced and was introduced in mid-2001, with an announcement from Intel on May 29, 2001, that systems soon would be shipping. The original Itanium was sold in two speeds: 733MHz and 800MHz. Equipped with a 266MHz system bus (double-clocked 133MHz), the Itanium further enhanced performance with a three-level on-chip cache design with 32KB in its primary cache, 96KB in its secondary cache, and 2MB or 4MB in its tertiary cache (depending on chip model). All aspects of the chip feature a full 64-bit bus width, both data and address lines. Meant for high-performance servers, the Itanium allows for up to 512 processors in a single computer.
Introduced on July 8, 2002, the Itanium 2 (code-named McKinley) pushed up the clock speed of the same basic design as the original Itanium by shifting down the design rules to 0.13 micron. Intel increased the secondary cache of the Itanium 2 to 256KB and allowed for tertiary caches of 1.5MB or 3MB. The refined design also increased the system bus speed to 400MHz (actually, a double-clocked 200MHz bus) with a bus width of 128 bits. Initially, Itanium 2 chips were offered at speeds of 900MHz and 1.0GHz. Table 5.17 summarizes the Itanium line.
Advanced Micro Devices Microprocessors
Advanced Micro Devices currently fields two lines of microprocessor. Duron chips correspond to Intel’s Celeron line, targeting the budget-minded consumer. Athlon chips are mainstream, full-performance microprocessors. Around the end of the year 2002, AMD will add the Opteron name to its product line. Meant to compete with the performance of Intel’s Itanium, the Opteron will differ with a design meant to run today’s software as well or better than current Intel processors. Opteron will become the new top-end of the AMD lineup.
AMD’s answer to the Pentium III and its P6 core was the Athlon. The Athlon is built on a RISC core with three integer pipelines (compared to the two inside the Pentium III), three floating-point pipelines (versus one in the Pentium III), and three instruction decoders (compared one in the Pentium III). The design permits the Athlon to achieve up to nine operations per clock cycle, compared to five for the Pentium III. AMD designed the floating-point units specifically for multimedia and endowed them with both the MMX (under Intel license) and 3DNow! instruction sets.
Program code being what it is—not very amendable to superscalar processing—the Athlon’s advantage proved more modest in reality. Most people gave the Athlon a slight edge on the Pentium III, megahertz for megahertz. The Athlon chip was more than powerful enough to challenge Intel for leadership in processing power. For more than a year, AMD and Intel ran a speed race for the fastest processor, with AMD occasionally edging ahead even in pure megahertz.
The Athlon has several other features that help to boost its performance. It has both primary (L1) and secondary (L2) caches on-chip, operating at the full speed rating of the core logic. A full 128KB is devoted to the primary cache, half for instructions and half for data, and 256KB is devoted to the secondary cache, for a total (the figure that AMD usually quotes) of 384KB of cache. The secondary cache connects to the chip through a 64-bit back-side bus operating at the core speed of the chip.
The system bus of the Athlon also edged past that used by Intel for the Pentium III. At introduction, the Athlon allowed for a 200MHz system bus (and 133MHz memory). Later, in March, 2001, the system bus interface was bumped up to 266MHz. This bus operates asynchronously with the core logic, so AMD never bothered with some of the odd speeds Intel used for its chips. The instruction set of the Athlon includes an enhanced form of AMD’s 3DNow! The Athlon recognizes 45 3D instructions, compared to 21 for AMD’s previous-generation K6-III chip.
The design of the Athlon requires more than 22 million transistors. As with other chips in its generation, it has registers 32 bits wide but connects to its primary cache through a 128-bit bus and to the system through a 64-bit data bus. It can directly address up to 8TB of memory through an address bus that’s effectively 43-bits wide.
AMD has introduced several variations on the Athlon name—the basic Athlon, the Athlon 4 (to parallel the introduction of the Pentium 4), and the Athlon XP (paralleling Microsoft’s introduction of Windows XP). The difference between the Athlon and Athlon 4 are in name alone. The basic core of all these Athlons is the same. Only the speed rating has increased with time. With the XP designation, however, AMD added Intel’s Streaming SIMD Extensions to the instruction set of the chip, giving it better multimedia performance.
The Athlon comes in cartridge form and slides into AMD’s Slot A. Based on the EV6 bus design developed by Digital Equipment Corporation (now part of Compaq) for the Alpha chip (a microprocessor originally meant for minicomputers but now being phased out), the new socket is physically the same as Intel’s Slot 1, but the signals are different and the AMD chip is incompatible with slots for Intel processors.
AMD fabricated its initial Athlon chips using 0.25-micron design rules. In November, 1999, the company shifted to new fabrication facilities that enabled it to build the Athlon with 0.18-micron design rules.
Table 5.18 summarizes the features of the AMD Athlon line.
For multiprocessor applications, AMD adapted the core logic of the Athlon chip with bus control circuitry meant for high-bandwidth transfers. These chips are specifically aimed at servers rather than desktop computers. Table 5.19 summarizes the AMD offerings.
To take on Intel’s budget-priced Celeron chips, AMD slimmed down the Athlon to make a lower-priced product. Although based on the same logic core as the Athlon, the Duron skimps on cache. Although it retains the same 128KB primary cache, split with half handling data and half instructions, the secondary cache is cut to 64KB. As with the Athlon, however, both caches operate at full core logic speed. The smaller secondary cache reduces the size of the silicon die required to make the chip, allowing more Durons than Athlons to be fabricated from each silicon wafer, thus cutting manufacturing cost.
The basic architecture of the Duron core matches the Athlon with three integer pipelines, three floating-point pipelines (which also process both 3DNow! and Intel’s MMX instruction sets), and three instruction/address decoders. Duron chips even share the same 0.18-micron technology used by the higher-priced Athlon. For now, however, Durons are restricted to lower speeds than the Athlon line and have not benefited from AMD’s higher-speed 266MHz system bus. All Durons use a 200MHz bus.
During development AMD used the code name Spitfire for the Duron. The company explains the official name of the chip as “derived from the Latin root durare, meaning ‘to last’ and on, meaning ‘unit.'” The root is the same as the English word durability. Table 5.20 summarizes the characteristics of the AMD Duron line.
As with its desktop processors, AMD has two lines of chips for portable computers, the Athlon and Duron, for the high and low ends of the market, respectively. Unlike Intel, AMD puts essentially the same processors as used on the desktop in mobile packages. The AMD chips operate at the same low voltages as chips specifically designed for mobile applications, and AMD’s desktop (and therefore, mobile) products all use its power-saving PowerNow! technology.
The one difference: AMD shifted to 0.13-micron technology for its portable Athlon XP while the desktop chip stuck with 0.18-micron technology. Table 5.21 summarizes AMD’s Mobile Athlon product line.
As with desktop chips, the chief difference between AMD’s Mobile Athlon and Mobile Duron is the size of the secondary cache—only 64KB in the Duron chips. Table 5.22 summarizes the Mobile Duron line.
AMD chose the name Opteron for what it calls its eighth generation of microprocessors, for which it has used the code-name Hammer during development. The Opteron represents the first 64-bit implementation of Intel architecture, something Intel has neglected to develop.
The Opteron design extends the registers of Pentium-style computers to a full 64-bits wide. It’s a forthright extension of the current Intel architecture, and AMD makes the transition the same way Intel extended the original 16-bit bus of the 8086-style chips to 32 bits for the 386 series. The new, wide registers are a superset of the 32-bit registers. In the Opteron’s compatibility mode, 16-bit instructions simply use the least significant 16 bits of the wide registers, 32-bit instructions use the least significant 32 bits, and 64-bit instructions use the entire register width. As a result, the Opteron can run any Intel code at any time without the need for emulators or coprocessors. Taking advantage of the full 64-bit power of the Opteron will, of course, require new programs written with the new 64-bit instructions.
The Opteron design also changes the structure of the processor core, rearranging the pipelines and processing units. The Opteron design uses three separate decode pipelines that feed a packing stage that links all three pipelines to more efficiently divide operations between them. The pipelines then feed into another stage of decoding, then eight stages of scheduling. At that point, the pipelines route integer and floating-point operations to individual processors. AMD quotes a total pipeline of 12 stages for integers and 17 for floating-point operations. The floating-point unit understands everything from MMX through 3DNow! to Intel’s latest SSE2.
As important as the core logic is, AMD has made vast improvements on the I/O of the Opteron. Major changes come in two areas. AMD builds the memory controller into the Opteron, so the chip requires no separate memory control hub. The interface uses DDR memory through two 128-bit-wide channels. Each channel can handle four memory modules, initially those rated for PC1600 operation, although the Opteron design allows for memory as fast as PC2700. According to AMD, building the memory interface into the Opteron reduces latency (waiting time), the advantage of which is an increase with every step up in clock speed.
Strictly speaking, the Crusoe processors from Transmeta Corporation are not Intel architecture chips. They use an entirely different instruction set from Intel chips, and by themselves could run a Windows program on a dare. Transmeta’s not-so-secret weapon is what it calls Code Morphing software, a program that runs on the Crusoe chip and translates Intel’s instruction set into its own. In effect, the Crusoe chip is the core logic of a modern Intel Architecture stripped of its hardware code translation.
The core is a very long instruction word processor, one that uses instructions that can be either 64 or 128 bits long. The core has two pipelines—an integer pipeline with seven stages and a floating-point pipeline with 10. Transmeta keeps the control logic for the core logic simple. It does not allow out-of-order execution, and instruction scheduling is handled by software.
Transmeta provides both a 64KB primary instruction cache and a 64KB primary data cache. The Crusoe comes with either of two sizes of secondary cache. The TMS5500 uses a 256KB secondary cache, and the TMS5800 has a 512KB secondary cache. At the time this was written, the chips were available with speed ratings of 667, 700, 733, 800, 867, and 900MHz.
To help the chip mimic Intel processors, the Crusoe family has a translation look-aside buffer that uses the same protection bits and address-mapping as Intel processors. The Crusoe hardware generates the same condition codes as Intel chips, and their floating-point units use the same 80-bit format as Intel’s basic FPU design (but not the 128-bit registers used by SSE2 instructions).
The result of this design is a very compact microprocessor that does what it does very quickly while using very little power. Transmeta has concentrated its marketing on the low power needs of the Crusoe chips, and they are used almost exclusively in portable computers. A less charitable way of looking at the Crusoe is that its smaller silicon needs make for a chip that’s far less expensive to manufacturer and easier to design. That’s not quite fair because developing the Code Morphing software is as expensive as designing silicon logic. Moreover, the current Crusoe chips take advantage of small silicon needs of their small logic cores, adding more features onto the same die. The current Crusoe models include the north bridge circuits of a conventional chipset on the same silicon as the core logic. The Crusoe chip includes the system bus, memory, and PCI bus interfaces, making portable computer designs potentially more compact. Current Crusoe versions support both SDR and DDR memory with system bus speeds up to 133MHz.
Another way to look at Code Morphing is to consider it as a software emulator, a program that runs on a chip to mimic another. Emulators are often used at the system level to allow programs meant for one computer to run on another. The chief distinctions between Code Morphing and traditional emulation is that Code Morphing works at the chip level, and the Crusoe chip keeps the necessary translation routines in firmware stored in read-only memory (ROM) chips.
According to Transmeta, Code Morphing also helps the Crusoe chip to be faster, enabling it to keep up with modern superscalar chips. The Code Morphing software doesn’t translate each Intel instruction on the fly. Instead, it translates a series of instructions, potentially even full subroutines. It retains the results as if in a cache so that if it encounters the same set of Intel instructions again, it can look up the code to use rather than translating it again. The effect doesn’t become apparent, according to Transmeta, until the Intel-based routine has been executed several times. The tasks typically involved with running a modern computer—the Windows graphic routines, software drivers, and so on—should benefit greatly from this technology. In reality, Crusoe processors don’t test well, but they deliver adequate performance for the sub-notebook computers that are their primary application.