THE ARITHMETIC COPROCESSOR, MMX, AND SIMD TECHNOLOGIES:INTRODUCTION TO MMX TECHNOLOGY.

INTRODUCTION TO MMX TECHNOLOGY

The MMX1 (multimedia extensions) technology adds 57 new instructions to the instruction set of the Pentium–Pentium 4 microprocessors. The MMX technology also introduces new general- purpose instructions. The new MMX instructions are designed for applications such as motion video, combined graphics with video, image processing, audio synthesis, speech synthesis and compression, telephony, video conferencing, 2D graphics, and 3D graphics. These instructions (new beginning with the Pentium in 1995) operate in parallel with other operations as the instructions for the arithmetic coprocessor.

Data Types

The MMX architecture introduces new packed data types. The data types are eight packed, consecutive 8-bit bytes; four packed, consecutive 16-bit words; and two packed, consecutive 32-bit doublewords. Bytes in this multibyte format have consecutive memory addresses and use the little endian form, as with other Intel data. See Figure 14–11 for the format for these new data types.

The MMX technology registers have the same format as a 64-bit quantity in memory and have two data access modes: 64-bit access mode and 32-bit access mode. The 64-bit access mode is used for 64-bit memory and registers transfers for most instructions. The 32-bit access mode is used for 32-bit memory and also register transfers for most instructions. The 32-bit transfers occur between microprocessor registers, and the 64-bit transfers occur between floating-point coprocessor registers.

Figure 14–12 illustrates the internal register set of the MMX technology extension and how it uses the floating-point coprocessor register set. This technique is called aliasing because the floating-point registers are shared as the MMX registers. That is, the MMX registers (MM0–MM7) are the same as the floating-point registers. Note that the MMX register set is 64 bits wide and uses the rightmost 64 bits of the floating-point register set.

Instruction Set

The instruction for MMX technology includes arithmetic, comparison, conversion, logical, shift, and data transfer instructions. Although the instruction types are similar to the microprocessor’s instruction set, the main difference is that the MMX instructions use the data types shown in Figure 14–11 instead of the normal data types used with the microprocessor.

Arithmetic Instructions. The set of arithmetic instructions includes addition, subtraction, multiplication, a special multiplication with an addition, and so on. Three additions exist. The PADD and PSUB instructions add or subtract packed signed or unsigned packed bytes, packed words, or packed doubleword data. The add instructions are appended with a B, W, or D to select the size, as in PADDB for a byte, PADDW for a word, and PADDD for a doubleword. The same is true for the PSUB instruction. The PMULHW and the PMULLW instructions perform multiplication on four pairs of l6-bit operands, producing 32-bit results. The PMULHW instruction multiplies the high-order l6 bits, and the PMULLW instruction multiplies the low-order 16 bits. The PMADDWD instruction multiplies and adds. After multiplying, the four 32-bit results are added to produce two 32-bit doubleword results.

The MMX instructions use operands just as the integer or floating-point instructions do. The difference is the register names (MM0–MM7). For example, the PADDB MM1, MM2 instruction adds the entire 64-bit contents of MM2 to MM1, byte by byte. The result is steered into MM1. When each 8-bit section is added, any carries generated are dropped. For example, the byte A0H added to the byte 70H produces the byte sum of 10H. The true sum is 110H, but the carry is dropped. Note that the second operand or source can be a memory location containing the 64-bit packed source or an MMX register. You might say that this instruction performs the same function as eight separate byte-sized ADD instructions! If used in an application, this certainly speeds execution of the application. Like PADD, PSUB also does not carry or borrow. The difference is that if an overflow or underflow occurs, the difference becomes 7FH (+127) for an overflow and 80H (–128) for an underflow. Intel calls this saturation, because these values rep- resent the largest and smallest signed bytes.

Comparison Instructions. There are two comparison instructions: PCMPEQ (equal) and PCMPGT (greater than). As with PADD and PSUB, there are three versions of each compare instruction: for example, PCMPEQUB (compares bytes), PCMPEQUW (compares words), and PCMPEQUD (compares doublewords). These instructions do not change the microprocessor flag bits; instead, the result is all ones for a true condition and all zeros for a false condition. For example, if the PCMPEQB MM2, MM3 instruction is executed and the least significant bytes of MM2 and MM3 = 10H and 11H, respectively, the result found in the least significant byte of MM2 is 00H. This indicates that the least significant bytes were not equal. If the least significant byte contained an FFH, it indicates that the two bytes were equal.

Conversion Instructions. There are two basic conversion instructions: PACK and PUNPCK. PACK is available as PACKSS (signed saturation) and PACKUS (unsigned saturation). PUN- PCK is available as PUNPCKH (unpack high data) and PUNPCKL (unpack low data). Similar to the prior instructions, these can be appended with B, W, or D for byte, word, and doubleword pack and unpack, but they must be used in combinations WB (word to byte) or DW (doubleword to word). For example, the PACKUSWB MM3, MM6 instruction packs the words from MM6 into bytes in MM3. If the unsigned word does not fit (too large) into a byte, the destination byte becomes an FFH. For signed saturation, we use the same values explained under addition.

Logic Instructions. The logic instructions are PAND (AND), PANDN (NAND), POR (OR), and PXOR (Exclusive-OR). These instructions do not have size extensions, and perform these bit-wise operations on all 64 bits of the data. For example, the POR MM2, MM3 instruction ORs all 64 bits of MM3 with MM2. The logical sum is placed into MM2 after the OR operation.

Shift Instruction. This instruction contains logical shifts and an arithmetic shift right instruction. The logic shifts are PSLL (left) and PSRL (right). Variations are word (W), doubleword (D), and quadword (Q). For example, the PSLLQ MM3,2 instruction shifts all 64 bits in MM3 left two places. Another example is the PSLLD MM3,2 instruction that shifts the two 32-bit double- words in MM3 left two places each.

The PSRA (arithmetic right shift) instruction functions in the same manner as the logical shifts, except that the sign-bit is preserved.

Data Transfer Instructions. There are two data transfer instructions: MOVED and MOVEQ. These instructions allow transfers between registers and between a register and memory. The MOVED instruction transfers 32 bits of data between an integer register or memory location and an MMX register. For example, the MOVED ECX, MM2 instruction copies the rightmost 32 bits of MM2 into ECX. There is no instruction to transfer the leftmost 32 bits of an MMX register. You could use a shift right before a MOVED to do the transfer.

The MOVEQ instruction copies all 64 bits of an MMX register between memory or another MMX register. The MOVEQ MM2, MM3 instruction transfers all 64 bits of MM3 into MM2. EMMS Instruction. The EMMS (empty MMX-state) instruction sets (11) all the tags in the floating-point unit, so the floating-point registers are listed as empty. The EMMS instruction must be executed before the return instruction at the end of any MMX procedure, or a subsequent floating-point operation will cause a floating-point interrupt error, crashing Windows or any other application. If you plan to use floating-point instructions within an MMX procedure, you must use the EMMS instruction before executing the floating-point instruction. All other MMX instructions clear the tags, which indicate that all floating-point registers are in use.

Instruction Listing. Table 14–10 lists all the MMX instructions with the machine code so these instructions can be used with the assembler. At present, MASM does not support these new instructions unless you have upgraded to the latest version (6.15). The latest version can be found in the Windows Driver Development Kit (Windows DDK), which is available for a small ship- ping charge from Microsoft Corporation. It is also available in Visual Studio Express (search for ML.EXE). Any MMX instruction can be used inside Visual C++ using the inline assembler.

Programming Example. Example 14–13 on p. 581 illustrates a simple programming example that uses the MMAX instructions to perform a task that takes eight times longer using normal microprocessor instruction. In this example an array of 1000 bytes of data (BLOCKA) is added to a second array of 1000 bytes (BLOCKB). The result is stored in a third array called BLOCKC. Example 14–13(a) lists a procedure that uses traditional assembly language to perform the addition and Example 14–13(b) shows the same process using MMX instructions.