10.7 VLIW Machines
There is an architecture that is in a sense competitive with superscalar architectures, referred to as the VLIW (Very Long Instruction Word) architecture. In VLIW machines, multiple operations are packed into a single instruction word that may be 128 or more bits wide. The VLIW machine has multiple execution units, similar to the superscalar machine. A typical VLIW CPU might have two IUs, two FPUs, two load/store units, and a BPU. It is the responsibility of the compiler to organize multiple operations into the instruction word. This relieves the CPU of the need to examine instructions for dependencies, or to order or reorder instructions. A disadvantage is that the compiler must out of necessity be pessimistic in its estimates of dependencies. If it cannot find enough instructions to fill the instruction word, it must fill the blank spots with NOP instructions. Furthermore, VLIW architectural improvements require software to be recompiled to take advantage of them.
There have been a number of attempts to market VLIW machines, but mainly, VLIW machines have fallen out of favor in recent years. Performance is the primary culprit, for the reasons above, among others.
Case Study: The Intel IA- 64 (Merced) Architecture
This section discusses a microprocessor family in development by an alliance between Intel and Hewlett-Packard, which is hoped will take the consortium into the 21st century. We first look into the background that led to the decision to develop a new architecture, and then we look at what is currently known about the architecture. (The information in this section is taken from various publications and Web sites, and has not been confirmed by Intel or Hewlett-Packard.)
BACKGROUND—THE 80X86 CISC ARCHITECTURE
The current Intel 80×86 architecture, which runs on some 80% of desktop computers in the late 1990’s, had its roots in the 8086 microprocessor, designed in the late 1970’s. The architectural roots of the family go back to the original Intel 8080, designed in the early 1970’s. Being a persistent advocate of upward compatibility, Intel has been in a sense hobbled by a CISC architecture that is over 20 years old. Other vendors such as Motorola abandoned hardware compatibility for modernization, relying upon emulators to ease the transition to a new ISA.
In any case, Intel and Hewlett-Packard decided several years ago that the x86 architecture would soon reach the end of its useful life, and they began joint research on a new architecture. Intel and Hewlett-Packard have been quoted as saying that RISC architectures have “run out of gas,” so to speak, so their search led in other directions. The result of their research led to the IA-64, which stands for “Intel Architecture-64.” The first of the IA-64 family is known by the code name Merced, after the Merced River, near San Jose, California.
THE MERCED: AN EPIC ARCHITECTURE
Although Intel has not released significant details of the Merced ISA, it refers to its architecture as Explicitly Parallel Instruction Computing, or EPIC. Intel takes pains to point out that it is not a VLIW or even an LIW machine, perhaps out of sensitivity to the bad reputation that VLIW machines have received, however, some industry analysts refer to it as “the VLIW-like EPIC architecture.”
Features
While exact details are not publicly known as of this writing, published sources report that the Merced is expected to have the following characteristics:
• 128 64-bit GPRs and perhaps 128 80-bit FPRs;
• 64 1-bit predicate registers (explained later);
• Instruction words contain three instructions packed into one 128-bit parcel;
• Execution units, roughly equivalent to IU, FPU, and BPU, appear in multiples of three, and the IA-64 will be able to schedule instructions into these multiples;
• It will be the burden of the compiler to schedule the instructions to take advantage of the multiple execution units;
• Most of the instructions seem to be RISC-like, although it is rumored that the processor will (still!) execute 80×86 binary codes, in a dedicated execution unit, known as the DXU;
• Speculative loads. The processor will be able to load values from memory well in advance of when they are needed. Exceptions caused by the loads are postponed until execution has proceeded to the place where the loads would normally have occurred
• Predication (not prediction), where both sides of a conditional branch instruction are executed and the results from the side not taken are discarded.
These latter two features are discussed in more detail later.
The Instruction Word
The 128-bit instruction word, shown in Figure 10-13, has three 40-bit instruc-
tions, and an 8-bit template. The template is placed by the compiler to tell the CPU which instructions in and near that instruction word can execute in parallel
, thus the term “Explicit.” The CPU need not analyze the code at runtime to expose instructions that can be executed in parallel because the compiler deter- mines that ahead of time. Compilers for most VLIW machines must place NOP instructions in slots where instructions cannot be executed in parallel. In the IA-64 scheme, the presence of the template identifies those instructions in the word that can and cannot be executed in parallel, so the compiler is free to schedule instructions into all three slots, regardless of whether they can be executed in parallel.
The 6-bit predicate field in each instruction represents a tag placed there by the compiler to identify which leg of a conditional branch the instruction is part of, and is used in branch predication.
Branch Predication
Rather than using branch prediction, the IA-64 architecture uses branch predication to remove penalties due to mis predicted branches. When the compiler encounters a conditional branch instruction that is a candidate for predication, it selects two unique labels and labels the instructions in each leg of the branch instruction with one of the two labels, identifying which leg they belong to. Both legs can then be executed in parallel. There are 64 one-bit predicate registers, one corresponding to each of the 64 possible predicate identifiers.
When the actual branch outcome is known, the corresponding one-bit predicate register is set if the branch outcome is TRUE, and the one-bit predicate register corresponding to the FALSE label is cleared. Then the results from instructions having the correct predicate label are kept, and results from instructions having the incorrect (mis-predicted) label are discarded.
Speculative Loads
The architecture also employs speculative loads, that is, examining the instruction stream for upcoming load instructions and loading the value ahead of time, speculating that the value will actually be needed and will not have been altered by intervening operations. If successful, this eliminates the normal latency inherent in memory accesses. The compiler examines the instruction stream for candidate load operations that it can “hoist” to a location earlier in the instruction sequence. It inserts a check instruction at the point where the load instruction was originally located. The data value is thus available in the CPU when the check instruction is encountered.
The problem that is normally faced by speculative loads is that the load operation may generate an exception, for example because the address is invalid. How- ever, the exception may not be genuine, because the load may be beyond a branch instruction that is not taken, and thus would never actually be executed. The IA-64 architecture postpones processing the exception until the check instruction is encountered. If the branch is not taken then the check instruction will not be executed, and thus the exception will not be processed.
All of this complexity places a heavy burden on the compiler, which must be clever about how it schedules operations into the instruction words.
80×86 Compatibility
Intel was recently granted a patent for a method, presumably to be used with IA-64, for supporting two instruction sets, one of which is the x86 instruction set. It describes instructions to allow switching between the two execution modes, and for data sharing between them.
Estimated Performance
It has been estimated that the first Merced implementation will appear sometime in the year 2000, and will have an 800 MHz clock speed. Goals are for it to have performance several times that of current-generation processors when running in EPIC mode, and that of a 500 MHz Pentium II in x86 mode. Intel has stated that initially the IA-64 microprocessor will be reserved for use in high-performance workstations and servers, and at an estimated initial price of $5000 each this will undoubtedly be the case.
On the other hand, skeptics, who seem to abound when new technology is announced, say that the technology is unlikely to meet expectations, and that the IA-64 may never see the light of day. Time will tell.