THE ARITHMETIC COPROCESSOR, MMX, AND SIMD TECHNOLOGIES:INTRODUCTION TO SSE TECHNOLOGY.

INTRODUCTION TO SSE TECHNOLOGY

The latest type of instruction added to the instruction set of the Pentium 4 is SIMD (single- instruction multiple data). As the name implies, a single instruction operates on multiple data in much the same way as do the MMX instructions, which are SIMD instructions that operate on multiple data. The MMX instruction set functions with integers; the SIMD instruction set functions with floating-point numbers as well as integers. The SIMD extension instructions first appeared in the Pentium III as SSE (streaming SIMD extensions) instructions. Later, SSE 2 instructions were added to the Pentium 4, and new to the Pentium 4 (beginning with the 90- nanometer E model) are SSE 3 instructions. The SSE 3 extensions are also found in the Core2 microprocessor.

Recall that the MMX instructions shared registers with the arithmetic coprocessor. The SSE instructions use a new and separate register array to operate on data. Figure 14–13 illustrates an array of eight 128-bit-wide registers that function with the SSE instructions. These new registers

The Arithmetic Coprocessor, MMX,and SIMD Technologies-0305

are called XMM registers (XMM0–XMM7), which denote extended multimedia registers. To accommodate this new 128-bit-wide data size, a new keyword is added called OWORD. An OWORD (octalword) designates a 128-bit variable, as in OWORD PTR for the SSE instruction set. A double quadword is also used at times to specify a 128-bit number.

Just as the MMX registers can contain multiple data types, so can the XMM registers of the SSE unit. Figure 14–14 illustrates the data types that can appear in any XMM register for various SSE instructions. An XMM register can hold four single-precision floating-point numbers or two double-precision floating-point numbers. XMM registers can also hold six- teen 8-bit integers, eight 16-bit integers, four 32-bit integers, or two 64-bit integers. This is a twofold increase in the capacity of the system when compared to the integers contained in MMX registers and hence a twofold increase in execution speeds of integers operations that use the XMM registers and SSE instructions. For new applications that are designed to exe- cute on a Pentium 4 or newer microprocessor, the SSE instructions are used in place of the MMX instructions. Because not all machines are yet Pentium 4 class machines, there still is a need to include MMX technology instructions in a program for compatibility to these older systems.

Floating-Point Data

Floating point data are operated upon as either packed or scalar, and either single-precision or double-precision. The packed operation is performed on all sections at a time; the scalar form is only operated on the rightmost section of the register contents. Figure 14–15 shows both the packed and scalar operations on SSE data in XMM registers. The scalar form is comparable to

The Arithmetic Coprocessor, MMX,and SIMD Technologies-0306

the operation performed by the arithmetic coprocessor. Opcodes are appended with PS (packed single), SS (scalar single), PD (packed double), or SD (scaled double) to form the desired instruction. For example, the opcode for a multiply is MUL, but the opcode for a packed double is MULPD and MULSD for a scalar double multiplication. The single-precision multiplies are MULPS and MULSS. In other words, once the two-letter extension and its meaning are under- stood, it is relatively easy to master the new SSE instructions.

The Instruction Set

The SSE instructions have a few new types added to the instruction set. The floating-point unit does not have a reciprocal instruction, which is used quite often to solve complex equations. The reciprocal instruction (1) now appears in the SSE extensions as the RCP instruction, which generates reciprocals and is written as RCPPS, RCPSS, RCPPD, and RCPSD. There is also a recip- rocal of a square root ( 1 ) instruction, called RSQRT, which is written as RSQRTPS, RSQRTSS, RSQRTPD, and RSQRTSD.

The remainder of the instructions for the SSE unit are basically the same as for the micro- processor and MMX unit except for a few cases. The instruction table in Appendix B lists the instructions, but does not list the extensions (PS, SS, PD, and DS) to the instructions. Again note

The Arithmetic Coprocessor, MMX,and SIMD Technologies-0307

that SSE 2 and SSE 3 contain double-precision operations and SSE does not. Instructions that start with the letter P operate on integer data that is byte, word, doubleword, or quadword sized. For example, the PADDB XMM0, XMM1 instruction adds the 16 byte-sized integers in the XMM1 register to the 16 byte-sized integers in the XMM0 register. PADDW adds 16-bit integers, PADDD adds doublewords, and PADDQ adds quadwords. The execution times are not provided by Intel so they do not appear in the appendix for these instructions.

The Control/Status Register

The SSE unit also contains a control/status register accessed as MXCSR. Figure 14–16 illustrates the MXCSR for the SSE unit. Notice that this register is very similar to the control/status register of the arithmetic coprocessor presented earlier in this chapter. This register sets the precision and rounding modes for the coprocessor, as does the control register for the arithmetic coprocessor, and it provides information about the operation the SSE unit.

The SSE control/status register is loaded from memory using the LDMXCSR and FXRSTOR instructions or stored into the memory using the STMXCSR and FXSAVE instructions. Suppose the rounding control (see Figure 14–6 for the state of the rounding control bits) needs to be changed to round toward positive infinity (RC = 10). Example 14–14 shows the soft- ware that changes only the rounding control bits of the control/status register.

The Arithmetic Coprocessor, MMX,and SIMD Technologies-0308

Programming Examples

A few programming examples are needed to show how to use the SSE unit. As mentioned, the SSE unit allows floating-point and integer operations on multiple data. Suppose that the capacitive

reactance is needed for a circuit that contains a 1.0 μF capacitor at various frequencies from 100 Hz to 10,000 Hz in 100 Hz steps. The equation used to calculate capacitive reactance is:

The Arithmetic Coprocessor, MMX,and SIMD Technologies-0309

Example 14–15 illustrates a procedure that generates the 100 outcomes for this equation using the SSE unit and single-precision floating-point data. The program listed in Example 14–15(a) uses the SSE unit to perform four calculations per iteration, while the program in Example 14–15(b) uses the floating-point coprocessor to calculate XC one at a time. Example 14–15(c) is yet another example in C++. Examine the loop to see that the first exam- ple goes through the loop 25 times and the second goes through the loop 100 times. Each time the loop executes in Example 14–15(a) it executes seven instructions (25 × 7 = 175), which takes 175 instruction times. Example 14–15(b) executes eight instructions per iteration of its loop (100 × 8 = 800), which requires 800 instruction times. By using this parallelism, the SSE unit allows the calculations to be accomplished in much less time than any other method. The C++ version in Example 14–15(c) uses the directive __declspec(align(16)) before each variable to make certain that they are aligned properly in the memory. If these are missing, the program will not function because the SSE memory variables must be aligned on at least quadword boundaries (16). This final version executes at about 41/2 times faster than Example 14–15(b);

The Arithmetic Coprocessor, MMX,and SIMD Technologies-0310The Arithmetic Coprocessor, MMX,and SIMD Technologies-0311

The first example in this section (Example 14–15) used floating-point number to perform multiple calculations, but the SSE unit can also operate on integers. The example illustrated in Example 14–16 uses integer operation to add BlockA to BlockB and store the sum in BlockC. Each block contains 4000 eight-bit numbers. Example 14–16(a) lists an assembly language procedure that forms the sums using the standard integer unit of the microprocessor, which requires 4000 iterations to accomplish.

The Arithmetic Coprocessor, MMX,and SIMD Technologies-0312

Both example programs generate 4000 sums, but the second example using the SSE unit does it by passing through its loop 250 times, while the first example requires 4000 passes. Hence, the second example functions 16 times faster because of the SSE unit. Notice how the PADDB (an instruction presented with the MMX unit) is used with the SSE unit. The SSE unit uses the same commands as the MMX except the registers are different. The MMX unit uses 64-bit-wide MM registers and the SSE unit uses 128-bit-wide XMM registers.

Optimization

The compiler in Visual C++ does have optimization for the SSE unit, but it does not optimize the examples presented in this chapter. It will attempt to optimize a single equation in a statement if the SSE unit can be utilized for the equation. It does not look at a program for blocks of operations that can be optimized as in the examples presented here. Until a compiler and extensions are developed so parallel operations such as these can be included, programs that require high speeds will require hand-coded assembly language for optimization. This is especially true of the SSE unit.

Leave a comment

Your email address will not be published. Required fields are marked *