INTEL AND MOTOROLA 32- & 64-BIT MICROPROCESSORS
This chapter provides a summary of the basic features of 32- and 64-bit microprocessors manufactured by Intel and Motorola. Intel 80386 and Motorola 68020 are covered in detail while an overview of the other 32-bit microprocessors is also included. Finally, a brief coverage of the 64-bit microprocessors is provided.
11.1 Typical Features of 32-bit and 64-bit Microprocessors
This section describes the basic aspects of typical 32- and 64-bit microprocessors. Topics include on-chip features such as pipelining, memory management, floating-point, and cache memory implemented in typical 32- and 64-bit microprocessors.
The first 32-bit microprocessor was Intel’s problematic iAPX432, and was introduced in 1980. Soon afterwards, the concept of "mainframe on a chip" or "micromainframe" was used to indicate the capabilities of these microprocessors and to distinguish them from previous 8- and 16-bit microprocessors.
The introduction of several 32-bit microprocessors revolutionized the microprocessor world. The performance of these 32-bit microprocessors is actually more comparable to that of superminicomputers such as Digital Equipment Corporation’s VAX11/750 and VAX11/780. Designers of 32-bit microprocessors have implemented many powerful features of these mainframe computers to increase the capabilities of the microprocessor chip sets. These include pipelining, on-chip cache memory, memory management, and floating-point arithmetic.
As mentioned in Chapter 8, pipelining is the technique in which instruction fetch and execute cycles are overlapped. This method allows simultaneous preparation for execution of one or more instructions while another instruction is being executed. Pipelining was used for many years in mainframe and minicomputer CPUs to speed up the instruction execution time of these machines. The 32-bit microprocessors implement the pipelining concept and simultaneously operate on several 32-bit words, which may represent different instructions or part of a single instruction.
Although pipelining greatly increases the rate of execution of nonbranching code, pipelines must be emptied and refilled each time a branch or jump instruction is in the code. This may slow down the processing rate for code with many branches orjumps. Thus, there is an optimum pipeline depth, which is strongly related to the instruction set, architecture, and gate density attainable on the processor chip. For many of the applications run on the 32-bit microprocessors, the three-stage pipeline is considered a reasonably optimal depth.
With memory management, virtual memory techniques, traditionally a feature of mainframes, are also implemented as on-chip hardware on typical 32-bit microprocessors.
This allows programmers to write programs much larger than those that could fit in the main memory space available to the microprocessors; the programs are simply stored on a secondary device, such as a disk drive, and portions of the program are swapped into main memory as needed.
Segmentation circuitry has been included in many 32-bit microprocessor chips.
With this technique, blocks of code called "segments," which correspond to modules of the program and have varying sizes set by the programmer or compiler, are swapped. For many applications, however, an alternative method borrowed from mainframes and superminis called "paging" is used. Basically, paging differs from segmentation in that pages are of equal sizes. Demand paging, in which the operating system automatically swaps pages as needed, can be used with all 32-bit microprocessors.
Floating-point arithmetic is yet another area in which the new chips are mimicking mainframes. With early microprocessors, floating-point arithmetic was implemented in software, largely as a subroutine. When required, execution would jump to a piece of code that would handle the tasks. This method, however, slows the execution rate considerably, so floating-point hardware, such as fast bit-slice (registers and ALU on a chip) processors and, in some cases, special-purpose chips, was developed. Other than the Intel 8087, these chips behaved more or less like peripherals. When floating-point arithmetic was required, the problems were sent to the floating-point processor and the CPU was freed to move on to other instructions while it waited for the results. The floating-point processor is implemented as on-chip hardware in typical 32-bit microprocessors, as in mainframe and minicomputer CPUs. Caching or memory-management schemes are utilized with all32-bit microprocessors in order to minimize access time for most instructions.
A cache, used for years in minis and mainframes, is a relatively small, high-speed memory installed between a processor and its main memory. The theory behind a cache is that a significant portion of the CPU time spent running typical programs is tied up in executing loops; thus, the chances are good that if an instruction to be executed is not the next sequential instruction, it will be one of some relatively small number of instructions back, a concept known as locality of reference. Therefore, a high-speed memory large enough to contain most loops should greatly increase processing rates. Cache memory is included as on-chip hardware in typical 32-bit microprocessors.
Typical 32-bit microprocessors such as Pentium and PowerPC chips are superscalar processors. This means that they can execute more than one instruction in one clock cycle. Also, some 32-bit microprocessors such as the PowerPC contain an on-chip real-time clock. This allows these processors to use modem multitasking operating systems that require time keeping for task switching and for keeping the calendar date.
A few 32-bit microprocessors implement a multiple branch prediction feature. This allows these microprocessors to anticipate jumps of the instruction flow ahead of time. Also, some 32-bit microprocessors determine an optimal sequence of instruction execution by looking at decoded instructions and then determining whether to execute or hold the instructions. Typical 32-bit microprocessors use a "look ahead" approach to execute instructions. Typical 32-bit microprocessors instruction pool for a sequence of instructions and perform a useful task rather than execute the present instruction and then go to the next.
The 64-bit microprocessors include all the features of 32-bit microprocessors. In addition, they also contain multiple on-chip integer and floating-point units, a larger address and data bus. The 64-bit microprocessors can typically execute 4 instructions per clock cycle and can run at a clock speed of more than 300 MHz.
The Pentium microprocessor is designed using a combination of mostly microprogramming (CISC–Complex Instruction Set Computer) and some hardwired control (RISC –Reduced Instruction Set Computer) whereas the PowerPC is designed using hardwired control with almost no microcode. The PowerPC is a RISC microprocessor. This means that a simple instruction set is included with PowerPC. The PowerPC instruction set includes register to register, load, and store instructions. All instructions involving arithmetic operations use registers; load and store instructions are utilized to access memory. Almost all computations can be obtained from these simple instructions. Finally, the 64-bit microprocessors are ideal candidates for data-crunching machines and high-performance desktop systems/workstations.
11.2 Intel 32-Bit and 64-Bit Microprocessors
This section provides a summary of lntel 32-bit and 64-bit microprocessors. The Intel line of microprocessors has gone through many changes. The 8080/8085 (8-bit) was the first major chip by Intel but did not see major use. In 1978, Intel introduced a more powerful processor called the 8086. The 8086 is covered in detail in earlier sections of this chapter. This chip had many improved features over the 8080/85. As mentioned before, the 8086 is a 16-bit processor and utilizes pipelining. Pipelining allows the processor to execute and fetch instructions at the same time. The Intel line has progressed through the years to the 80286, 80386, 80486, and Pentium. The general trend has been an expansion of the bit width of the processors both internally and externally. The Pentium processor was introduced in 1993, and the name was changed from 80586 to Pentium because of copyright laws. The processor uses more than 3 million transistors and had an initial speed
of 60 MHz. The speed has increased over the years to the latest speed of233 MHz. Table
11.1 compares the basic features of the Intel 80386DX, 80386SX, 80486DX, 80486SX, 80486DX2, and Pentium. These are all32-bit microprocessors. Note that the 80386SL (not listed in the table) is also a 32-bit microprocessor with a 16-but data bus like the 80386SX. The 80386SL can run at a speed of up to 26 MHz and has a direct addressing capability of 32 MB. The 80386SL provides virtual memory support along with on-chip memory management and protection. It can be interfaced to the 80387SX to provide floating-point support. The 80386SL includes an on-chip disk controller hardware.
The Pentium microprocessor uses superscalar technology to allow multiple instructions to be executed at the same time. The Pentium uses BICMOS technology, which combines the speed of bipolar transistors and the power efficiency of CMOS technology. The internal registers are only 32 bits even though externally it has a 64-bit data bus. It has a 32-bit address bus, which allows 4 gigabytes of addressable memory space. The math coprocessor is on-chip and is up to ten times faster than the 486 in performing certain instructions. There are two execution units in the Pentium that allow the multiple execution. The multiple execution only works for instructions that are data independent, meaning that an instruction executed immediately after another using the previous result cannot be done. The Pentium uses two execution units called the "U and V pipes." Each has five pipeline stages. The U pipe can execute any of the instructions in the 80×86 set, but the V pipe executes only simple instructions. Another new feature of the Pentium is branch prediction. This feature allows the Pentium to predict and prefetch codes and advance them though the pipeline without waiting for the outcome of the zero flag.
The implementation of virtual memory is an important feature of the Pentium.
It allows a total of 64 terabytes of virtual memory. The 386/486 allowed only a 4K page size for virtual memory, but the Pentium allows either 4K or 4M page sizes. The 4K page option makes it backward compatible with the 386/486 processors. The 4M page size option allows mapping of a large program without fragmentation. It reduces the amount of page misses in virtual memory mode.
In the next section, the Intel 80386 is. covered in detail.
Table 11.1 compares the basic features of 80386, 80486, and Pentium.
11.3 Intel 80386
The Intel 80386 is Intel’s first 32-bit microprogrammed microprocessor. Its introduction in 1985 facilitated the introduction of Microsoft’s Windows operating systems. The high speed computer requirement of the graphical interface of Windows operating systems was supplied by the 80386. Also, the on-chip memory management of the 80386 allowed memory to be allocated and managed by the operating system. In the past, memory management was performed by software.
The Intel 80386 is a 32-bit microprocessor and is based on the 8086. A variation of the 80386 (32-bit data bus) is the 80386SX microprocessor, which contains a 16-bit data bus along with all other features of the 80386. The 80386 is software compatible at the object code level with the Intel 8086. The 80386 includes separate 32-bit internal and external data paths along with 8 general-purpose 32-bit registers. The processor can handle 8-, 16-, and 32-bit data types. It has separate 32-bit data and address pins, and generates a 32-bit physical address. The 80386 can directly address up to 4 gigabytes (232) of physical memory and 64 tetrabytes (246) of virtual memory. The 80386 can be operated from a 12.5 -, 16-, 20-, 25-, 33-, or 40-MHz clock. The chip has 132 pins and is typically housed in a pin grid array (PGA) package. The 80386 is designed using high-speed HCMOS III technology.
The 80386 is highly pipelined and can perform instruction fetching, decoding, execution, and memory management functions in parallel. The on-chip memory management and protection hardware translates logical addresses to physical addresses and provides the protection rules required in a multitasking environment. The 80386 contains a total of 129 instructions. The 80386 protection mechanism, paging, and the instructions to support them are not present in the 8086.
The main differences between the 8086 and the 80386 are the 32-bit addresses and data types and paging and memory management. To provide these features and other applications, several new instructions are added in the 80386 instruction set beyond those of the 8086.
11.3.1 Internal80386 Architecture
The internal architecture of the 80386 includes several functional units that operate in parallel. The parallel operation is known as "pipelined processing." Fetching, decoding, execution, memory management, and bus access for several instructions are performed simultaneously. Typical functional units of the 80386 are these:
-
Bus interface unit (BIU)
-
Execution unit (EU)
-
Segmentation unit
-
Paging unit
The 80386 BIU performs similar function as the 8086 BIU. The execution unit processes the instructions from the instruction queue. It contains mainly a control unit and a data unit. The control unit contains microcode and parallel hardware for fast multiplication, division, and effective address calculation. The data unit includes an ALU, 8 general-purpose registers, and a 64-bit barrel shifter for performing multiple bit shifts in one clock cycle. The data unit carries out data operations requested by the control unit. The segmentation unit translates logical addresses into linear addresses at the request of the execution unit. The translated linear address is sent to the paging unit.
Upon enabling of the paging mechanism, the 80386 translates the linear addresses into physical addresses. If paging is not enabled, the physical address is identical to the linear address and no translation is necessary. The 80386 segmentation and paging units support memory management functions. The 80386 does not contain any on-chip cache. However, external cache memory can be interfaced to the 80386 using a cache controller chip.
11.3.2 Processing Modes
The 80386 has three processing modes: protected mode, real-address mode, and virtual 8086 mode. Protected mode is the normal 32-bit application of the 80386. All instructions and features of the 80386 are available in this mode. Real-address mode (also known as "real mode") is the mode of operation of the processor upon hardware reset. This mode appears to programmers as a fast 8086 with a few new instructions. This mode is utilized by most applications for initialization purposes only. Virtual8086 mode (also called "V86 mode") is a mode in which the 80386 can go back and forth repeatedly between V86 mode and protected mode at a fast speed. When entering into V86 mode, the 80386 can execute an 8086 program. The processor can then leave V86 mode and enter protected mode to execute an 80386 program.
As mentioned, the 80386 enters real-address mode upon hardware reset. In this mode, the protection enable (PE) bit in a control register-the control register 0 (CRO)-is cleared to zero. Setting the PE bit in CRO places the 80386 in protected mode. When the 80386 is in protected mode, setting the VM (virtual mode) bit in the flag register (the EFLAGS register) places the 80386 in V86 mode.
11.3.3 Basic 80386 Programming Model
The 80386 basic programming model includes the following aspects:
-
Memory organization and segmentation
-
Data types
-
Registers
-
Addressing modes
-
Instruction set
I/O is not included as part of the basic programming model because systems designers may select to use I/O instructions for application programs or may select to reserve them for the operating system.
Memory Organization and Segmentation
The 4-gigabyte physical memory of the 80386 is structured as 8-bit bytes. Each byte can be uniquely accessed as a 32-bit address. The programmer can write assembly language programs without knowledge of physical address space. The memory organization model available to applications programmers is determined by the system software designers. The memory organization model available to the programmer for each task can vary between the following possibilities:
An address space includes a single array of up to 4 gigabytes. The 80386 maps the 4- gigabyte space into the physical address space automatically by using an address-translation scheme transparent to the applications programmers.
A segmented address space includes up to 16,383 linear address spaces of up to 4 gigabytes
each. In a segmented model, the address space is called the "logical" address space and can be up to 64 terabytes. The processor maps this address space onto the physical address space (up to 4 gigabytes by an address-translation technique).
Data Types
Data types can be byte (8-bit), word (16-bit with the low byte addressed by n and the high byte addressed by n + 1), and double word (32-bit with byte 0 addressed by n and byte 3 addressed by n + 3). All three data types can start at any byte address. Therefore, the words are not required to be aligned at even-numbered addresses, and double words need not be aligned at addresses evenly divisible by 4. However, for maximum performance, data structures (including stacks) should be designed in such a way that, whenever possible, word operands are aligned at even addresses and double word operands are aligned at addresses evenly divisible by 4. That is, for 32-bit words, addresses should start at 0, 4, 8, … for the highest speed.
Depending on the instruction referring to the operand, the following additional data types are available: integer (signed 8-, 16-, or 32-bit), ordinal (unsigned 8-, 16-, or 32-bit), near pointer (a 32-bit logical address that is an offset within a segment), far pointer (a 48-bit logical address consisting of a 16-bit selector and a 32-bit offset), string (8-, 16-, or 32-bit from 0 bytes to 232 – I bytes), bit field (a contiguous sequence of bits starting at any bit position of any byte and containing up to 32 bits), bit string (a contiguous sequence
of bits starting at any position of any byte and containing up to 232 – 1 bits), and packed/ unpacked BCD. When the 80386 is interfaced to a coprocessor such as the 80287 or 80387, then floating-point numbers are supported.
Registers.
Figure 11.1 shows the 80386 registers. As shown in the figure, the 80386 has 16 registers classified as general, segment, status, and instruction pointer. The 8 general registers are the 32-bit registers EAX, EBX, ECX, EDX, EBP, ESP, ESI, and EDI. The low-order word of each of these 8 registers has the 8086 register name AX (AH or AL), BX (BH or BL), CX (CH or CL), DX (DH or DL), BP, SP, SI, and DI. They are useful for making the 80386 compatible with the 8086 processor.
The six 16-bit segment registers-CS, SS, DS, ES, FS, and GS-allow systems software designers to select either a flat or segmented model of memory organization. The purpose of CS, SS, DS, and ES is same as that of the corresponding 8086 registers. The two additional data segment registers FS and GS are included in the 80386 so that the four data segment registers (DS, ES, FS, and GS) can access four separate data areas and allow programs to access different types of data structures.
The flag register is a 32-bit register, named EFLAGS in Figure 11.1, that shows the meaning of each bit in this register. The low-order 16 bits of EFLAGS is named FLAGS and can be treated as a unit. This is useful when executing 8086 code because this part of EFLAGS is similar to the FLAGS register of the 8086. The 80386 flags are grouped into three types: status flags, control flags, and system flags.
The status flags include CF, PF, AF, ZF, SF, and OF, like the 8086. The control flag DF is used by strings like the 8086. The system flags control I/O , maskable interrupts, debugging, task switching, and enabling of virtual 8086 execution in a protected, multitasking environment. The purpose of IF and TF is identical to the 8086. Let us explain some of the system flags:
-
IOPL (VO privilege level). This 2-bit field supports the 80386 protection feature.
-
NT (nested task). The NT bit controls the IRET operation. If NT = 0, a usual return from interrupt is taken by the 80386 by popping EFLAGS, CS, and EIP from the stack. If NT== 1, the 80386 returns from an interrupt via task switching.
-
RF (resume flag). is used during debugging.
-
VM (virtual 8086 mode). When the VM bit is set to 1, the 80386 executes 8086 programs. When the VM bit is 0, the 80386 operates in protected mode.
-
The instruction pointer register (EIP) contains the offset address relative to the start of the current code segment of the next sequential instruction to be executed. The low-order 16 bits of EIP is named IP and is useful when the 80386 executes 8086 instructions.
11.3.4 80386 Addressing Modes
The 80386 has 11 addressing modes, classified into register/immediate and memory addressing modes. The register/immediate type includes 2 addressing modes, and the memory addressing type contains 9 modes.
Register/Immediate Modes
Instructions using the register or immediate modes operate on either register or immediate operands. In register mode, the operand is contained in one of the 8-, 16-, or 32- bit general registers. An example is DEC ECX, which decrements the 32-bit register ECX by 1. In immediate mode, the operand is included as part of the instruction. An example is MOV EDX, 5167 812FH, which moves the 32-bit data 5167812F 16 to the EDX register. Note that the source operand in this case is in immediate mode.
Memory Addressing Modes
The other 9 addressing modes specifY the effective memory address of an operand. These modes are used when accessing memory. An 80386 address consists of two parts: a segment base address and an effective address. The effective address is computed by adding any combination of the following four elements:
1. Displacement. The 8- or 32-bit immediate data following the instruction is the displacement; 16-bit displacements can be used by inserting an address prefix before the instruction
2. Base. The contents of any general-purpose register can be used as a base.
3. Index. The contents of any general-purpose register except ESP can be used as an index register. The elements of an array or a string of characters can be accessed via the index register.
4. Scale. The index register’s contents can be multiplied (scaled) by a factor of 1, 2, 4, or 8. A scaled index mode is efficient for accessing arrays or structures. Effective Address, EA =base register+ (index register x scale)+ displacement The 9 memory addressing modes are a combination of these four elements. Of mthe 9 modes, 8 of them are executed with the same number of clock cycles because the effective address calculation is pipelined with the execution of other instructions; the mode containing base, index, and displacement elements requires one additional clock cycle.
1. Direct mode.The operand’s effective addresses is included as part of the instruction as an 8-, 16-, or 32-bit displacement. An example is DEC WORD PTR [4000H].
2. Register indirect mode. A base or index register contains the operand’s effective address. An example is MOV EBX , [ECX].
3. Base mode. The contents of a base register is added to a displacement to obtain the operand’s effective address. An example is MOV [EDX + 16] , EBX.
4. Index mode. The contents of an index register is added to a displacement to obtain the operand’s effective address. An example is ADD START [EDI] , EBX.
5. Scaled index mode. The contents of an index register is multiplied by a scaling factor (1, 2, 4, or 8), and the result is added to a displacement to obtain the operand’s effective address. An example is MOV START [ EBX * 8] , ECX.
6. Based index mode. The contents of a base register is added to the contents of an index register to obtain the operand’s effective address. An example is MOV ECX 1 [ESI] [EAX].
7. Based scaled index mode. The contents of an index register is multiplied by a scaling factor (1, 2, 4, 8), and the result is added to the contents of a base register to obtain the operand’s effective address. An example is MOV [ECX *4] [EDX] , EAX.
8. Based index mode with displacement. The operand’s effective address is obtained by adding the contents of a base register and an index register with a displacement. An example is MOV [ EBX] [EB P + 0 F2 4 7 8 2AH] , ECX.
9. Based scaled index mode with displacement. The contents of an index register is multiplied by a scaling factor, and the result is added to the contents of base register and displacement to obtain the operand’s effective address. An example is MOV [ESI * 8] [EBP + 60H] , ECX.