Chapter 6: I/O Expansion Buses and Cards
Overview
Expansion buses are a system-level means of connection that allow adapters and controllers to use the computer’s system resources—memory and I/O space, interrupts, DMA channels, and so on—directly. Devices connected to expansion buses can also control these buses themselves, obtaining access to the rest of the computer’s resources (usually, memory space). This kind of direct control, known as “bus mastering,” makes it possible to unload the CPU and to achieve high rates of data exchange. Physically, expansion buses are implemented as slot or pin connectors; they typically have short conductors, which allow high operating frequencies. Buses may not necessarily have external connectors mounted on them, but can be used to connect devices integrated onto motherboards.
Currently, the third generation of I/O expansion bus architecture is getting dominant; these buses include PCI Express (also known as 3GIO), Hyper Transport, and InfiniBand. The ISA bus, an asynchronous parallel bus with low bandwidth (less than 10 MBps) that does not provide exchange robustness or autoconfiguration, belongs to the first generation. The second generation started with the EISA and MCA buses, followed by the PCI bus and its PCI-X extension. This is a generation of synchronous, reliable buses that have autoconfiguration capabilities; some versions are capable of hot-swapping. Transfer rates measured in GBps can be reached. In order to connect a large number of devices, buses are connected into a hierarchical tree-like structure using bridges. For the third generation of buses, the transition from parallel buses to point-to-point connections with serial interface is characteristic; multiple clients are connected using so-called “switching fabric.” In essence, the third generation of the I/O expansion approaches very local networks (within the limits of the motherboard).
The most commonly used expansion bus in modern computers is The PCI bus, supplemented by the AGP port. In desktop computers, the ISA bus is becoming less popular, but it has maintained its positions in industrial and embedded computers, both in its traditional slot version as well as the PC/104 “sandwich” type. In notebook computers, PCMCIA slots with PC Card and CardBus buses are extensively used. The LPC bus is a modern, inexpensive way to connect devices on the motherboard that are not resource-intensive. All these buses are considered in detail in this chapter. Information about the obsolete MCA, EISA, and VLB buses can be found in other books.
The characteristics of the standard PC expansion buses are given in Table 6.1.
Table 6.1: Characteristics of Expansion Buses
6.1 PCI and PCI-X Buses
The Peripheral Component Interconnect (PCI) local bus is the principal expansion bus in modern computers. It was developed for the Pentium family processors, but it is also well suited to the 486 family processors, as well as to the modern processors.
Currently, the PCI is a well-standardized, highly efficient, reliable expansion bus supported by several computer platforms, including PC-compatible computers, PowerPC, and others. The specifications of the PCI bus are updated periodically. The given description covers all PCI and PCI-X bus standards, up to and including versions 2.3 and 2.0, respectively:
-
PCI 1.0 (1992): general conception defined; signals and protocol of a 32-bit parallel synchronous bus with clock frequency of up to 33.3 MHz and peak bandwidth of 132 MBps.
-
PCI 2.0 (1993): introduced specification for connectors and expansion cards with possible width extension to 64 bits (speed of up to 264 MBps); 5 V and 3.3 V power supplies provided for.
-
Version 2.1 (1995): introduced 66 MHz clock frequency (3.3 V only), making it possible to provide peak bandwidth of up to 264 MBps in the 32-bit version and 528 MBps in the 64-bit version.
-
Version PCI 2.2 (PCI Local Bus Specification. Revision 2.2 of 12/18/1998): specified and clarified some provisions of version 2.1. Also introduced a new interrupt signaling mechanism, MSI.
-
Version PCI 2.3 (2002): defined bits for interrupts that facilitate identifying interrupt source; 5 V cards now obsolete (only 3.3 V and universal card left); low profile expansion card build introduced; supplementary SMBus and signals introduced. This version is described in the PCI Local Bus Specification; Revision 2.3 is the basis for the current expansions.
-
Version PCI 3.0 obsoletes 5 V motherboards, leaving only universal and 3.3 V.
In 1999, the PCI-X expansion came out based on the PCI 2.3. It is intended to raise peak bus bandwidth significantly by using a higher transfer frequency, and to increase the operation efficiency by employing an improved protocol. The protocol also defines split transactions and attributes that allow the exchange parties to plan their actions. The extension provides for mechanical, electrical, and software compatibility of the PCI-X devices and motherboards with the regular PCI; however, naturally, all devices on the bus adjust to the slowest piece of equipment.
In the 3.3 V interface of PCI-X 1.0, the clock frequency was raised to 133 MHz, producing PCI-X66, PCI-X100, and PCI-X133. Peak bandwidth reaches up to 528 MBps in the 32-bit version, and over 1 GBps in the 64-bit version. PCI-X 1.0 is described in the PCI-X Addendum to the PCI Local Bus Specification, Revision 1.0b (2002).
PCI-X 2.0 introduced new clocking modes with doubled (PCI-X266) and quadrupled (PCI-X533) data transfer frequency relative to the base clock frequency of 133 MHz. This high frequency requires a low-voltage interface (1.5 V) and error-correction coding (ECC). In addition to the 32- and 64-bit version, a 16-bit version is specified for embedded computers. A new type of transaction—Device ID Messages (DIM)—was introduced: These are messages that address a device using its identifier. PCI-X 2.0 is described in two documents: PCI-X Protocol Addendum to the PCI Local Bus Specification, Revision 2.0 (PCI-X PT 2.0); and PCI-X Electrical and Mechanical Addendum to the PCI Local Bus Specification, Revision 2.0 (PCI-X EM 2.0).
In addition to the bus specification, there are several specifications for other components:
-
PCI to PCI Bridge Architecture Specification, Revision 1.1 (PCI Bridge 1.1), for bridges interconnecting PCI and other bus types
-
PCI BIOS specification, defining configuring of PCI devices and interrupt controllers
-
PCI Hot-Plug Specification, Revision 1.1 (PCI HP 1.1), providing for dynamic (hot) device connection and disconnection
-
PCI Power Management Interface Specification, Revision 1.1 (PCI PM 1.1), for controlling power consumption
Based on the PCI 2.0 bus, Intel developed the dedicated Accelerated Graphics Ports (AGP) interface for connecting a graphics accelerator (see Section 6.2).
PCI specifications are published and supported by the PCI Special Interest Group (PCI SIG, http://www.pcisig.org). The PCI bus exists in different build variations: Compact PCI (CPCI), Mini-PCI, PXI, Card Bus.
The PCI bus was first introduced as a mezzanine bus for the systems with the ISA bus, and later became the central bus. It is connected to the processor system bus by the high-performance north bridge that is a part of the motherboard chipset. The south bridge connects to the PCI bus the other I/O expansion buses and devices. These include the ISA/EISA and MCA buses, as well as the ISA-like X-BUS and the LPC interface to which the motherboard’s integrated circuits (such as the ROM BIOS, interrupt, keyboard, DMA, COM and LPT ports, FDD, and so on) are connected. In modern motherboards that use hub architecture, the PCI bus has been taken out of the mainstream: The power of its CPU and RAM communication channel has not been reduced, but neither is it loaded up by transit traffic from the other buses’ devices.
The bus is synchronous: All signals are latched at the rising edge of the CLK signal. The nominal synchronization frequency is 33 MHz, but this can be lowered if necessary (on machines with 486 CPUs, frequencies of 20-33 MHz were used). Often, the bus frequency can be overclocked to 41.5 MHz, or half the typical 83 MHz system bus frequency. From the PCI 2.1 revision, the bus frequency can be raised to 66 MHz provided that all devices connected to the bus support it.
The nominal bus width is 32 bits, although the specification also defines a 64-bit bus. At 33 MHz, the theoretical throughput capacity of the bus reaches 132 MB/sec for the 32-bit bus and 264 MB/sec for the 64-bit bus. At 66 MHz, it reaches 264 MB/sec and 528 MB/sec for the 32-bit and 64-bit buses, respectively. But these peak values are achieved only during burst transmissions. Because of the protocol overhead, the real average aggregate bus throughput is lower for all masters.
The CPU can interact with PCI devices by memory and I/O port commands addressed within the ranges allocated to each device at configuration. Devices may generate masked and nonmasked interrupt requests. There is no concept of the DMA channel number for the PCI bus, but a bus agent may play the role of master and maintain a high-performance exchange with the memory and other devices without using CPU resources. In this way, for example, DMA exchange with ATA devices connected to the PCI IDE controller can be implemented (see Section 8.2.1).
Operating in the Bus Master role is desirable for all devices that require extensive data exchange, as the device can generate quite lengthy packet transactions, for which the effective data transfer speed approaches the declared pealtl Instead of I/O ports, memory-mapped I/O are recommended for controlling devices as far as possible.
Each PCI device has a standard set of configuration registers located in addressing space separate from the memory and I/O space. Using these registers, devices are identified and configured, and their characteristics are controlled.
The PCI specification requires a device to be capable of moving the resources it uses (the memory and I/O ranges) within the limits of the available address space. This allows conflict-free resource allocation for many devices and/or functions. A device can be configured in two different ways: mapping its registers either onto the memory space or onto the I/O space. The driver can determine its current settings by reading the contents of the device’s base address register. The driver also can determine the interrupt request number used by the device.
6.1.1 Device Enumeration
For the PCI bus, a hierarchy of enumeration categories—bus, device, function—has been adopted: The host—the bus master—enumerates and configures all PCI devices. This role is usually played by the central processor, which is connected to the PCI bus by the host (main) bridge, relative to which bus addressing begins.
The PCI bus is a set of signal lines that directly connect the interface pins of a group of devices such as slots or devices integrated onto the motherboard. A system can have several PCI buses interconnected by PCI bridges. The bridges electrically isolate one bus’ interface signals from another’s while connecting them logically. The host bridge connects the main bus with the host: the CPU and RAM. Each bus has its own PCI bus number. Buses are numbered sequentially; the main bus is assigned number 0.
A PCI device is an integrated circuit or expansion card, connected to one of the PCI buses, that uses an IDSEL bus line allocated to it for identifying itself to access the configuration registers. A device can be multifunctional (i.e., consisting of so-called functions numbered from 0 to 7). Every function is allocated a 256-byte configuration space. Multifunctional devices can respond only to the configuration cycles of those function numbers, for which there is configuration space available. A device must always have function 0 (whether a device with the given number is present is determined by the results of accessing this function); other functions are assigned numbers as needed from 1 to 7 by the device vendor. Depending on their implementation, simple, one-function devices can respond to either any function number or only to function 0.
Enumeration categories are involved only when accessing configuration space registers (see Section 6.1.12). These registers are accessed during the configuration state, which involves enumerating the devices detected, allocating them nonconflicting resources (memory and I/O ranges), and assigning hardware interrupt numbers. In the course of further regular operation, the devices will respond to accesses to the memory and I/O addresses allocated to them, which have been conveyed to the software modules associated with them.
Each function is configured. The full function identifier consists of three parts: the bus number, device number, and function number. The short identification form (used in OS Unix messages, for example) is of the PCIO:1:2 type, meaning function 2 of device 1 is connected to the main (0) PCI bus. The configuration software must operate with a list of all functions of all devices that have been detected on the PCI buses of the given system.
The PCI bus employs positional addressing, meaning that the number assigned to a device is determined by where the device is connected to the bus. The device number, or dev, is determined by the AD bus line to which the IDSEL signal line of the given slot is connected: As a rule, adjacent PCI slots use adjacent device numbers; their numbering is determined by the manufacturer of the motherboard (or of the passive backplane for industrial computers). Very often, decreasing device numbers, beginning with 20 or 15, are used for the slots. Groups of adjacent slots may be connected to different buses; devices are numbered independently on each PCI bus (devices may have the same dev numbers, but their bus numbers will be different). PCI devices integrated into the motherboard use the same numbering system. Their numbers are hard-wired, whereas the numbers of expansion cards can be changed by moving them into different slots.
One PCI card can have only one device for the bus to which it is connected, because it is allocated only one EDSEL line for the slot into which it is installed. If a card contains several devices (a 4-port Ethernet card, for example), then it needs a bridge installed on it, as well as a PCI device, which is addressed by the EDSCL line allocated to the given card. This bridge creates on the card a supplementary PCI bus to which several devices can be connected. Each of these devices is assigned its own IDSEL line, but this line now is a part of the given card’s supplementary PCI bus.
In terms of memory and I/O space addressing, the positional address (bus and device number) is not important within the limits of one bus. However, the device number determines the interrupt request line number that the given device can use. (See Section 6.2.6 for more information on this subject; here, it is enough to note that devices on the same bus whose numbers differ by 4 will use the same interrupt request line. Assigning them different interrupt request line numbers may be possible only when they are located on different buses, but this depends on the motherboard.) In systems with several PCI buses, installing a device into slots of different buses may affect its productivity; this depends on the characteristics of a particular bus and how far away it is from the host bridge.
Figuring out device numbering and the system for assigning interrupt request line numbers system is simple. This can be done by installing a PCI card sequentially into each of the slots (remembering to turn the power off each time) and noting the messages about the PCI devices found that are displayed at the end of POST. PCI devices built into the motherboard and not disabled by the SMOS Setup will also appear in these messages. Although this may all seem to be very clear and simple, some operating systems—especially “smart” ones, such as Windows—are not content with the allocated interrupt request numbers, and change them as they deem fit; this does not affect line-sharing in any way.
All bus devices can be configured only from the host’s side; this is the host’s special role. No master on any PCI bus has access to the configuration registers of all devices; without this access, complete configuration cannot be done. Even from the main PCI bus, the registers of the host bridge are not accessible to a master; without access to these registers, address distribution between the host and the PCI devices cannot be programmed. Access possibilities to the configuration registers are even more modest from the other PCI buses (see Section 6.1.6).
6.1.2 Bus Protocol
Two devices are involved in every transaction, or bus exchange: the exchange initiator device (or bus master) and a target device (or bus slave). The rules for these devices’ interactions are defined by the PCI bus protocol. A device can monitor the bus transactions without participating in them (i.e., without generating any signals); this mode is called Snooping. A Special Cycle happens to be of the broadcast type; in such cycle, the initiator does not interact with any of the devices using the protocol. The suite and functions of the bus interface signals are shown in Table 6.2.
Table 6.2: PCI Bus Signals
The states of all signal lines are perceived at the positive transition of the CLK signal, and it is precisely these moments that in the further description are meant by “bus cycles” (marked by vertical dotted lines in the drawings). At different points in time, the same signal lines are controlled by different bus devices; for conflict-free handing over of the authority of the bus, some time is needed when no device controls the lines. On the timing diagrams, this event—known as turnaround—is marked by a pair of semicircular arrows. See Fig. 6.1.
At any given moment, the bus can be controlled by only one master device, which has received this right from the arbiter. Each master device has a REQ# signal to request control of the bus and a GNT# signal to acknowledge bus control having been granted. A device can begin a transaction (i.e., assert a FRANE# signal) only when it receives an active GNT# signal and having waited until the bus is in the idle state. While waiting for the bus to assume the idle state, the arbiter can change its mind and give bus control to another device with higher priority. Deactivating the GNT# signal stops the device from beginning next transaction, and under certain conditions, which are considered later in this chapter, forces it to terminate the transaction in progress. A special unit, which is a part of the bridge that connects the given bus to the computer core, handles the arbitration requests. The priority scheme (whether fixed, cyclic, or combined) is determined by the arbiter’s programming.
Addresses and data are transmitted over common multiplexed AD lines. The four multiplexed lines C/BE[3:0] encode commands during the address phase and enable bytes during the data phase. In write transactions, the C/BE[3:0] lines enable bytes that are on the AD bus at the same time as their signals; in read transactions, they enable data in the following data phase. At the beginning of a transaction, the master device activates the FRAME# signal, sends the target device address over the AD bus, and sends the information about the transaction type, i.e., command, over the C/BE# lines. The addressed target device responds with a DEVSEL# signal. The master device indicates its readiness to exchange data by the IRDY# signal, which it also may assert before receiving the DEVSEL# signal. When the target device is also ready to exchange data, it will assert the TRDY# signal. Data are sent over the AD bus only when both signals, IRDY# and TRDY#, are asserted. The master and slave devices use these signals to coordinate their exchange rates by introducing wait cycles. If they asserted their ready signals at the end of the address phase and maintained them until the end of the transfer, then 32 bits of data would be sent after the address phase in each cycle. This would make it possible to reach the maximum exchange efficiency. In read transactions, an extra clock is required for the turnaround after the address phase, during which the initiator passes control over the AD line. The target device can take the AD bus only in the next clock. No turnaround is needed in a write transaction, because the data are sent by the initiator.
On the PCI bus, all transactions are of the burst type: Each transaction begins with the address phase, which may be followed by one or more data phases. The number of data phases in a burst is not indicated explicitly, but the master device releases the FRAME# signal before the last data phase, with the IRDY# signal still asserted. In single transactions, the FRAME# signal is active only for the duration of one cycle. If a device does not support burst transactions in the slave mode, then during the first data phase, it must request that the burst transaction be terminated by simultaneously asserting the STOP# and TRDY# signals. In response, the master device will complete the current transaction and will continue the exchange with the following transaction to the next address. After the last data phase, the master device releases the IRDY# signal and the bus goes into the PCI idle state, in which both the FRAME# and the IRDY# signals are inactive.
By asserting the FRAME# signal simultaneously with releasing the IRDY# signal, the initiator can begin the next transaction without going through the bus idle phase. Such fast back-to-back transactions may be directed to a target device. All PCI devices acting as targets support the first type of transactions. The second type, which is optional, is indicated by bit 7 of the status register. An initiator is allowed to use fast back-to-back transactions (by bit 9 of the command register if it is capable of doing so) with different target devices only if all bus agents are capable of fast transactions. When data exchange is conducted in the PCI-X mode, fast back-to-back transactions are not allowed.
The handshaking protocol makes the exchange reliable, as the master device will always receive information about the target device finishing the transaction. Using parity control makes the exchange more reliable and valid: The AD[31:0] and C/BE[3:0] lines are protected by the parity bit PAR line; the number of ones on these lines, including the PAR line, must be even. The actual value of PAR appears on the bus with a one-cycle delay with respect to the AD[31:0] and C/BE# lines. When a target device detects an error, it asserts the PERR# signal, shifting it one cycle after the valid parity-bit signal. When the parity is calculated during the data transfer, all bytes are taken into account, including invalid ones (marked by the high level C/BEX# signal). The bits’ state, even in invalid data bytes, must remain stable during the data phase.
Each bus transaction must be completed as planned, or terminated with the bus assuming the bus idle state (the FRAME# and IRDY# signals going inactive). Both the master and the slave device may initiate a transaction conclusion.
A master device can conclude a transaction in one of the following ways:
-
Completion is executed when the data exchange ends.
-
A time-out occurs when the master device is deprived of control of the bus (by the GNT# signal being driven high) during a transaction and the time indicated in its Latency Time timer has expired. This may happen when the addressed target device turns out to be slower than expected, or the planned transaction is too long. Short transactions (of one or two data phases) complete normally even when the GNT# signal goes high and a time-out is triggered.
-
A master-abort termination takes place when the master device does not receive a response from the target device (DEVSEL#) during the specified length of time.
A transaction may be terminated at the target device’s initiative. The target device can do this by asserting the STOP# signal. Here, three types of termination are possible:
-
Retry. The STOP# signal is asserted at the inactive TRDY# signal before the first data phase. This situation arises when the target device does not manage to present the first data within the allowed time period (16 cycles) because of being too busy. Retry is an instruction to the master device to execute the same transaction again.
-
Disconnect. The STOP# signal is asserted during or after the first data phase. If the STOP# signal is asserted when the TRDY# signal of the current data phase is active, then these data are transmitted and the transaction is terminated.
If the STOP# signal is asserted when the TRDY# signal is inactive, then the transaction is terminated without transmitting the next phase’s data. Disconnect is executed when a target device cannot send or receive the next portion of the burst’s data in time. The disconnect is a directive to the master device to repeat the transaction but with the modified start address.
-
Target-abort. The STOP# signal is asserted simultaneously with deactivation of the DEVSEL# signal. (In the preceding situations, the DEVSEL# signal was active when the STOP# signal was being asserted.) No data are sent after this termination. A target abort is executed when a target device detects a fatal error or some other conditions (e.g., an unsupported command), because of which it will not be able to service the current request.
Using all three termination types is not mandatory for all target devices; however, any master device must be capable of terminating a transaction upon any of these reasons.
Terminations of the Retry type are used to organize delayed transactions. Delayed transactions are only used by slow target devices and also by PCI bridges when transferring transactions to another bus. When terminating (for the initiator) a transaction by a Retry condition, the target device executes the transaction internally. When the initiator repeats this transaction (issues the same command with the same address and set of the C/BE# signals in the data phase), the target device (or the bridge) will have the result ready (data for a read transaction or execution status for a write transaction), and will promptly return it to the initiator. The target device (or the bridge) must store the results of an executed delayed transaction until the time they are requested by the initiator. However, due to some abnormal situation, the initiator can “forget” to repeat the transaction. In order to keep the result storage buffer from overflowing, the device has to discard these results. This can be done without producing detrimental effects only if transactions with prefetchable memory were delayed. The other types of transactions cannot be discarded without the danger of data integrity violation. They can be discarded only if no repeat request is made within 215 bus cycles (upon a Discard Timer timeout). Devices can inform their drivers (or the operating system) about this particular situation.
A transaction initiator may request exclusive use of the PCI bus during the whole exchange operation that requires several bus transactions. For example, if the central processor is executing an instruction that modifies the contents of a memory cell in a PCI device, it needs to read the data from the device, modify them in its ALU, and then return the data to the device. In order to prevent other initiators from intruding their transactions into this operation sequence (which is fraught with the danger of data integrity violation), the host bridge executes this operation as locked: I.e., keeps a LOCK# bus signal during of its execution. Regular PCI devices neither use nor generate this signal; only bridges use it to control arbitration.
PCI Bus Commands
PCI commands are defined by the transaction direction and type, as well as the address space to which they pertain. The PCI bus command set includes the following commands:
-
The I/O Read and Write commands are used to access the I/O address space.
-
Memory Read and Write commands are used to perform short, non-burst (as a rule) transactions. Their direct purpose is to access I/O devices that are mapped onto the memory space. For real memory, which allows prefetching, memory line read, memory line read multiple, and memory write and invalidate commands are used.
-
Memory Read Line is employed when reading to the end of a cache line is planned. Separating this type of read allows increased memory exchange efficiency.
-
Multiple Memory Read is used for transactions involving more than one cache line. Using this type of transaction allows the memory controller to prefetch lines, which gives an extra productivity increase.
-
Memory Write and Invalidate is used to write an entire cache line; moreover, all bytes in all phases must be enabled. This operation saves time by forcing the cache controller to flush the “dirty” cache lines corresponding to the written area without unloading them from the main memory. The initiator that issues this command must know the cache line size in the given system (it has a special register for this purpose in the configuration area).
-
The Dual Address Cycle (DAC) allows the 32-bit bus to be used to communicate with devices employing 64-bit addressing. In this case, the lower 32 address bits are sent in this cycle simultaneously with this command, after which follows a regular cycle setting the exchange type and carrying the higher 32 address bits. The PCI bus permits 64-bit I/O port addressing. (It is of no use for the x86 machines, but PCI also is used in other platforms.)
-
The Configuration Read and Write commands address the configuration space of the devices. Only aligned double words are used to perform access; bits AD[1:0] are used to identify the cycle type. A special hardware/software mechanism is needed to generate these commands.
-
The Special Cycle command is a broadcast type, which makes it different from all other commands. However, no bus agent responds to it, and the host bridge or another device that starts this cycle always terminates it with a master-abort. It takes 6 bus cycles to complete. The Special Cycle broadcasts messages that any “interested” bus agents can read. The message type is encoded in the contents of the AD[15:0] lines; the data sent in the message may be placed on the AD[31:16] lines. Regular devices ignore the address phase in this cycle, but bridges use the information to control how the message is broadcast. Messages with codes 0000h, 0001h, and 0002h are used to indicate shutdown, processor Halt, or the x86-specific functions pertaining to cache operations and tracing. The codes 0003-FFFFh are reserved. The same hardware/software mechanism that generates configuration cycles may generate the Special Cycle, but with the address having a specific meaning.
-
The Interrupt Acknowledge command reads the interrupt vector. In protocol terms, it looks like a read command addressed to the interrupt controller (PIC or APIC). This command does not send any useful information over the AD bus in the address phase (BE[3:0]# sets the vector size), but its initiator—the host bridge—must ensure that the signal is stable and the parity is correct. In PC, an 8-bit vector in byte 0 is sent when the interrupt controller is ready (upon the TRDY# signal). Interrupts are acknowledged in one cycle; the bridge suppresses the first null cycle that x86 processors perform for reverse-compatibility reasons.
Commands are coded by the C/BE# bits in the address phase (see Table 6.3); PCI-X specific commands are considered in the following sections.
Table 6.3: PCI and PCI-X Bus Command Decoding
Each bus command contains the address of the data that pertain to the first data phase of the burst. The address for every subsequent data phase in the burst is incremented by 4 (the next double word), or by 8 (for 64-bit transfers), but the order may be different in memory-reference commands. The bytes of the AD bus that carry meaningful information are selected in data phases by the C/CB [3:0]# signals. Within a burst, these signals may arbitrarily change their states in different phases. Enabled bytes may not be adjacent on the data bus; data phases, in which not a single byte is enabled, are possible. Unlike the ISA bus, the PCI bus cannot change its width dynamically: All devices must connect to the bus in the 32- or 64-bit mode. If a PCI device uses a function circuit of different width (e.g., an 8255 integrated circuit, which has an 8-bit data bus and four registers, needs to be connected), then it becomes necessary to employ schematic conversions that map all registers to the 32-bit AD bus. The 16-bit connection capability appeared only in the second version of PCI-X.
Addressing is different for each of the three spaces—memory, I/O ports, and configuration registers; the address is ignored in the special cycles.
Memory Addressing
Physical memory space address is sent over the PCI bus; in x86 (and other) processors, it is derived from the logical addressing by table page translation done by the MMU block. In the memory-access commands, the address aligned at the double-word boundary is transmitted over the AD[31:2] lines; the AD[1:0] lines set the burst addressing modes:
-
00—Linear incrementing. The address of the next phase is obtained by incrementing the preceding address by the number of the bus-width bytes: 4 bytes for a 32-bit bus and 8 bytes for a 64-bit bus.
-
10—Cacheline Wrap mode. In this mode, memory access addresses wrap around the end of the cacheline. In a transaction, each subsequent phase address is incremented until its value reaches the end of the cacheline, after which it wraps around to the beginning of this line and increments to the value preceding the starting-address value. If a transaction is longer than the cacheline, then it will continue in the next line from the offset, at which it started. Thus, with a 16-byte line and a 32-bit bus, the subsequent data phases of a transaction that began at the address xxxxxx08h will have the addresses xxxxxx0Ch, xxxxxx00h, xxxxxx04h; then in the next cacheline: xxxxxx18h, xxxxxx1Ch, xxxxxx10h, xxxxxx14h. The length of a cacheline is set in the configuration space of the device. If a device does not have the cacheline size register, then it must terminate the transaction after the first data phase because the order in which the addresses alternate turns out to be not indeterminate.
-
01 and 11—These combinations are reserved. They may be used as a Disconnect direction after the first data phase.
If addresses over 4 GB need to be accessed, then a two-address cycle is used that carries the lower 32 bits of the 64-bit address for the following commands, along with which the higher bits of the address are sent. In regular commands, bits [63:32] are assumed to be having zero value.
A full memory address is sent in PCI-X using all AD[31:0] lines. In burst transactions, the address determines the exact location of the burst’s starting byte, and the addresses are assumed to be incremented in a linear ascending order. All bytes are involved in packet transactions, starting from the specified starting byte and ending with the last as given in the byte counter. Individual bytes cannot be disabled in a PCI-X burst transaction as they can be in PCI. In single DWORD transactions, the AD[1:0] address bits determine the bytes that can be enabled by the C/BE[3:0]# signals. Thus, if:
-
AD[1:0] = 00 then C/BE[3:0] = xxxx
-
AD[1:0] = 01 then C/BE[3:0] = xxx1
-
AD[1:0] = 10 then C/BE[3:0] = xx11
-
AD[1:0] = 11 then C/BE[3:0] = x111 (Only byte 3 is sent, or no bytes are enabled.)
I/O Addressing
In the I/O port-access addressing commands, all AD[31:0] lines are used (decoded) to address any byte. The AD[31:2] address bits point to the address of the double-word data being transmitted. The AD[1:0] address bits define the bytes that can be enabled by the C/BE[3:0]# signals. The rules for the PCI transactions are somewhat different here: When at least one byte is sent, the byte pointed to by the address also must be enabled. Thus, when:
-
AD[1:0] = 01 then C/BE[3:0]# = XX01 or 1111
-
AD[1:0] = 10 then C/BE[3:0]# = x011 or 1111
-
AD[1:0] = 11 then C/BE[3:0]# = 0111 (onlybyte3 issent) or C/BE [3:0] # = 1111 (no bytes are enabled)
These cycles formally can also come inbursts, although this capability is seldom used in practice. All 32 address bits are available for I/O port addressing on the PCI bus, but x86 processors can use only the lower 16 bits.
The same interrelations between the C/BE[3:0]# and address signals for single DWORD memory transactions described in the previous paragraph extend to PCI-X I/O transactions. These transactions are always single DWORD.
Addressing Configurations Registers and Special Cycle
The configuration write/read commands have two address formats, each of which is used for specific situations. To access registers of a device located on the given bus, Type 0 configuration transactions are employed (Fig. 6.2, a). The device (an expansion card) is selected by an individual IDSEL signal generated by the bus’ bridge based on the number of the device. The selected device sees the function number Fun in bits AD[10:8] and the configuration register number Reg in bits AD[7:2]; bits AD[1:0] are the Type 0 indicator. The AD[31:11] lines are used as the source of the IDSEL signals for the devices of the given bus. The bus specification defines the AD11 line as the IDSEL line for device 0, the AD12 line as the IDSEL line for device 1; the sequence continues in this order with the AD31 line being the IDSEL line for device 20. The bridge specification features a table in which only lines AD16 (device 0) through AD32 (device 15) are used.
PCI devices combined with a bridge (sharing the same microchip) can also use larger numbers, for which there are not enough AD lines. In PCI-X, the undecoded device number Dev is sent over the AD[15:11] lines: Devices use it as a part of their identifier in the transaction attributes. For devices operating in Mode 1, the AD[31:16] lines are used for IDSEL; only AD[23:16] lines are used in Mode 2, with seven being the largest device number. This allows the function’s configuration space to be expanded to four kilobytes: The AD[27:24] lines are used as the higher bits of the configuration register number UReg (Fig. 6.2, c).
Type 1 configuration transactions are used to access devices on the other buses (Fig. 6.2, d). Here, the bus number Bus of the bus, on which the device being sought is located, is determined by the AD[23:16] bits; the device number Dev is determined by the AD[15:11] bits; bits AD[10:8] contain the function number Fun; bits AD[7:2] carry the register number Reg; the value 01 in bits AD[1:0] is the Type 1 indicator. In PCI-X Mode 2, the higher bits of the register number UReg are sent over the AD[27:24] lines.
Because bits AD[1:0] are used only for identifying the transaction type, the configuration registers are accessed only by double words. The distinctions between the two configuration transaction types are used to construct the hierarchial PCI configuration system. Unlike in transactions conducted with the memory and I/O addresses, which arrive from the initiator to the target device no matter how they are mutually located, the configuration transactions propagate over the bus hierarchy only in one direction: downward, from the host (central processor) through the main bus to the subordinate buses. Consequently, only the host can configure all PCI devices (including bridges).
No information is sent over the AD bus in the address phase of the PCI broadcast command, called the special cycle. Any PCI bus agent can call a special cycle on any specifically indicated bus by using a Type 1 configuration write transaction and indicating the bus number in bits AD[23:16]. The device and function number fields (bits AD[15:8]) must be all set to one, while the register number field must be zeroed out. This transaction transits the buses independently of the mutual locations of the initiator and the target bus, and only the very last bridge converts it into the actual special cycle.
PCI-X Protocol Modification
In many respects, the PCI-X bus protocol is the same as described above: the same latching at the CLK transition, the same control signal functions. The changes in the protocol are aimed at raising the efficiency of bus cycle usage. For this purpose, additions were made to the protocol that allow devices to foresee upcoming events and to plan their actions accordingly.
In the regular PCI, all transactions begin in the same way (with the address phase) as burst transaction with their length unknown in advance. Here, in practice, the I/O transactions always have only one data phase; long bursts are efficient (and are used) only to access memory. In PCI-X, there are two transaction types in terms of length:
-
Burst: All commands access memory except the Memory Read DWORD.
-
Single double word size (DWORD): all other commands.
Each transaction has a new attribute transmission phase after the address phase. In this phase, the initiator presents its identifier (RBN—bus number, RDN—device number, and RFN—function number), a 5-bit tag, a 12-bit byte counter (only for burst transactions; UBC—higher bits, LBC—lower bits), and additional characteristics (ro and NS bits) of the memory area, to which the given transaction pertains. The attributes are sent over the AD[31:0] and BE[3:0]# bus lines (Fig. 6.3). The initiator identifier together with the tag defines the Sequence: one or more transactions that provide logical data transfer scheduled by the initiator. By using a 5-bit tag, each initiator can simultaneously execute up to 32 logical transfers (a tag can be reused for another logical transaction only after a previous transaction using the same tag value has been completed). A logical transfer (sequence) can be up to 4,096 bytes long (byte counter value 00 … 01 corresponds to number 1, value 11 … 11 corresponds to number 4,095, value 00 … 00 corresponds to number 4,096); the number of bytes that must be transferred in the given sequence is indicated in the attributes for each transaction. The number of bytes to be transmitted in the given transaction is not determined in advance (either the initiator or the target device can stop a transaction). However, in order to raise efficiency, stringent requirements are applied to burst transactions.
If a transaction has more than one data phase, it can terminate either after all the bytes declared (in the byte counter in the attributes) have been transmitted or only on the cache line boundaries (on the 128-byte memory address boundaries). If the transaction participants are not ready to meet with these requirements, then one of them must stop the transaction after only the first data phase. Only the target device still has the right to emergency transaction termination at any moment; the initiator is strictly responsible for its actions.
The characteristics of the memory, to which a given transaction pertains, make it possible to select the optimal method to access it when processing the transaction. The characteristics are determined by the device that requests the particular sequence. How it learns the memory properties is something, with which its driver should be concerned. The attributes of the memory characteristics pertain only to burst access memory transactions (but not to MSI messages):
-
The Relaxed Ordering (RO) flag means that the execution order of individual write or read transactions can be changed.
-
The No Snoop (NS) flag means that the memory area, to which the given transaction pertains, is not cached anywhere.
In PCI-X, Delayed Transactions are replaced by Split Transactions. The target device can complete any transaction, with the exception of memory write transactions, either immediately (using the regular PCI method) or using the split transaction protocol. In the latter case, the target device issues a Split Response signal, executes the command internally, and afterwards initiates its own transaction (a Split Completion command) to send the data or inform the initiator of the completion of the initial (splitted) transaction. The target device must split the transaction if it cannot answer it before the initial latency period expires. The device that initiates a split transaction is called a Requester.
The device that completes a split transaction is called a Completer. To complete the transaction, the completer must request bus control from the arbiter; the requester will play the role of the target device in the completion phase. Even a device that is formally not a bus master (as indicated in its configuration space registers) can complete a transaction by the split method. A Split Completion transaction looks a lot like a burst write transaction, but differs from it in the addressing phase: Instead of the full memory or I/O address, the identifier of the sequence (with the requester’s bus, device, and function numbers), to which this completion pertains, and only the lower six address bits are sent over the AD bus (Fig. 6.4). The completer obtains this identifier from the attributes of the transaction that it splits.
Using this identifier (the number of the requester’s bus), the bridges convey the completion transaction to the requesting device. The completer’s identifier (CBN—bus number, CDN—device number, and CFM—function number; see Fig. 6.3, c) is sent in the attribute phase. The requester must recognize its sequence identifier and respond to the transaction in the regular way (immediately). The sequence may be processed in a series of completion transactions, until the byte counter is exhausted (or terminated by a time-out). The requester figures out itself, to which starting address each of the termination transactions belongs (it knows what it asked and how many bytes have already arrived). A completion transaction can carry either the requested read data or a Split Complete Message.
The requester must always be ready to receive the data of the sequence that it started; moreover, the data of different sequences may arrive in random order. The completer can generate completion transactions for several sequences also in random order. Within the limits of one sequence, the completions must, naturally, be arranged by addresses (which are not sent). The attributes of a completion transaction contain the bus, device, and function numbers and a byte counter. In addition, they contain three flags:
-
Byte Count Modified (BCM): indicates that there will be fewer data bytes sent than the requester asked for (sent with the completion data).
-
Split Completion Error (SCE): indicates a completion error; set when a completion message is sent as an early error indicator (before the message itself has been decoded).
-
Split Completion Message (SCM): message indicator (distinguishes the message from data).
PCI-X 2.0 Data Transfer Distinctions
In addition to the above-described protocol changes, a new operating mode—Mode 2—was introduced in PCI-X 2.0. The new mode allows faster memory block writes and uses ECC control, and can be used only with the low (1.5 V) power supply. It has the following features:
-
The time for address decoding by the target device—the delay in its DEVSEL# response to the command directed to it—has been increased by I clock in all transactions.
This extra clock is needed for the ECC control to ascertain the validity of the address and command.
-
In Memory Write Block transactions, data are transferred at two or four times the rate of the clocking frequency. In these transactions, the BEX# signals are used for synchronization from the data source (they are not used as intended, because it is assumed that all bytes must be enabled). Each data transfer (64, 32, or 16 bits) is strobed by the BEX# signals. The BE[1:0]# and BE[3:2] line pairs provide differential strobing signals for the AD[15:0] and AD[31:16] data lines. There can be two or four data subphases in one bus clock, which at the CLK frequency of 133 MHz provides the PCI-X266 and PCI-X533. Because all control signals are synchronized by the common signal CLE, the transfer granularity becomes two or four data subphases. For a 32-bit bus, this means that during transactions, data can be transferred (as well as transfers halted or suspended) in multiples of 8 or 16 bytes.
In the 64-bit version of the bus, the AD[63:32] lines are only used in data phases; only the 32-bit bus is used for the address (even 64-bit) and for the attributes.
Devices operating in Mode 2 have the option of using the 16-bit bus. In this case, the address and attribute phases take 2 clocks each, while the data phases always come in pairs (providing regular granularity). In the address/data bus, the AD[16:31] lines are used to send the information of bits [0:15] in the first phase of the pair, and of bits [16:31] in the second phase. The C/BE[0:1]# information is sent over lines C/BE[2:3]# in the first phase, and C/BE[2:3]# in the second phase. Lines ECC[2:5] are used for ECC control. Bits ECC[0,1,6] and the special E16 control bit are sent over these lines in the first phase, and ECC[2:5] in the second. The 16-bit bus is only intended for built-in applications (slots and expansion cards are not provided for).
Message Exchange Between Devices (DIM Command)
The ability to send information (messages) to a device addressing it using the identifier (bus, device, and function numbers) was introduced in PCI-X 2.0. The memory and I/O addressing spaces are not used to address and route these messages, which can be exchanged between any bus devices, including the host bridge. The messages are sent in sequences, in which Device ID Message (DIM) commands are used. This command has specific addresses and attributes. In the address phase (Fig. 6.5, a), the identifier of the message receiver (completer) is sent: the numbers of its bus, device, and function (CBN, CDN, and CFN, respectively). The Route Type (RT) bit indicates the routing type of the message: 0—explicit addressing using the identifier mentioned above, 1—implicit addressing to the host bridge (the identifier is not used in this case). The Silent Drop (SD) bit sets the error handling method when processing the given transaction: 0—regular (as for a memory write), 1—some types of errors are ignored (but not the parity or ECC errors). The Message Class field sets the message class, according to which the lower address byte is interpreted. A transaction can also use a two-address cycle. In this case, the DAC command code is sent over lines C/BE[3:0]# in the first address phase; the contents of bits AD[31:00] correspond to Fig. 6.5, a. The DIM command code is sent over the C/BE[3:0]# lines in the second address phase; bits AD[31:00] are interpreted depending on the message class. Having decoded the DIM command, a device that supports message exchange checks whether the receiver identifier field matches its own.
The message source identifier (RBN, RDN, and RFN), message tag (Tag), the 11-bit byte counter (UBC and LBC), and additional attribute bits are sent in the attribute phase (Fig. 6.5, b). The Initial Request (IR.) bit is the start of message indicator; the message itself can be broken into parts by the initiator, receiver, or the intermediary bridges (the bit is set to zero in all the following parts). The Relaxed Ordering (RO) bit indicates that the given message can be delivered in any order relative to the other messages and memory writes that are propagated in the same direction (the order, in which the fragments of the given message are delivered, is always preserved).
The body of the message, which is sent in the data phase, can be up to 4,096 bytes long (this limit is due to the 12-bit byte counter). The contents of the body are determined by the message class; class 0 is used at the manufacturer’s discretion.
Bridges transfer explicit routing messages using the bus number of the receiver. Problems with the transfer may only arise on the host bridges: If there is more than one host bridge, it may be very difficult to link them architecturally (using memory controller buses, for example). It is desirable to have the capability to transfer messages from one bus to another using host bridges (it is simpler than transferring transactions of all types), but it is not mandatory. If this method is supported, the user enjoys more freedom (the entire bus topology does not have to be considered when placing devices). Implicit routing messages are sent only in the direction of the host.
It is not mandatory for PCI-X devices to support DIM, but PCI-X Mode 2 devices are required to support it. If a DIM message is addressed to a device located on a bus operating in the standard PCI mode (or the path to it goes through the PCI), the bridge either cancels this message (if SD=1) or aborts the transactions (Target Abort, if SD=0).
Boundaries of Address Ranges and Transactions
The Base Address Registers (BAR) in the configuration space header describe the memory and I/O ranges taken by a device (or, more exactly, by a function). It is assumed that the range length is expressed by a 2n number (n = 0, 1, 2…) and that the range is naturally aligned. In PCI, memory ranges are allocated in 2n paragraphs (16 bytes; i.e., the minimal range size is 16 bytes). I/O ranges are allocated in 2n double words. PCI to PCI bridges have maps of the memory addresses with granularity of 1 MB; maps of I/O addresses have granularity of 4 KB.
In the PCI, a burst transaction can be interrupted at the boundary of any double word (a quadruple word in the 64-bit transactions). In PCI-X, in order to optimize memory accesses, burst transactions can only be interrupted at the special point called the Allowable Disconnect Boundary (ADB). ADB points are located at intervals of 128 bytes: This is a whole number (1, 2, 4, or 8) of cache lines in modern processors. Of course, this limitation applies only to the transaction borders inside a sequence. If a sequence has been planned to complete not on an ADB boundary, then its last transaction will be completed not on a boundary. However, this type of situation is avoided by developing types of data structures that can be properly aligned (sometimes, even at the expense of being superfluous).
The term ADB Delimited Quantum (ADQ) is associated with the address boundaries; it denotes the part of a transaction or buffer memory (in bridges and devices) that lies between adjacent allowable disconnect boundaries. For example, a transaction crossing one allowable disconnect boundary consists of two data ADQ5 and occupies two ADQ buffers in the bridge.
In accordance with the allowable transaction boundaries, the memory areas that PCI-X devices occupy also must begin and end at ADBs: The memory is allocated in ADQ quanta. Consequently, the minimum memory area allocated to a PCI-X device cannot be less than 128 bytes and, taking into account the area description rules, it size is allowed to be 128 × 2n bytes.
Transaction Execution Time, Timers, and Buffers
The PCI protocol regulates the time (number of clocks) allowed for different phases of a transaction. Bus operation is controlled by several timers, which do not allow bus cycles to be wasted and make it possible to plan bandwidth distribution.
Each target device must respond to the transaction addressed to it sufficiently rapidly. The reply from the addressed target device (the DEVSEL# signal) must come within 1-3 clocks after the address phase, depending on how fast the particular device is: 1 clock—fast, 2 clocks—medium, 3 clocks—slow decoding. If there is no answer, the next clock is allocated to transaction intercepting by a subtractive address decoding bridge.
Target initial latency (i.e., the delay in the appearance of the TRDY# signal with respect to the FRAME# signal), must not exceed 16 bus cycles. If, because of its technical characteristics, a device sometimes does not manage to complete its business during this interval, it must assert the STOP# signal, terminating the transaction. This will make the master device repeat the transaction, and chances are greater that this attempt will be successful. If a device is slow and often cannot manage to complete a transaction successfully within 16 bus cycles, then it must execute Delayed Transactions. Target devices are equipped with an incremental bus-cycle-duration tracking mechanism (Incremental Latency Mechanism) that does not allow the interval between the adjacent data phases in the burst (target subsequent latency) to exceed 8 bus cycles. If a target device cannot maintain this rate, it must terminate the transaction. It is desirable that a device inform about its “falling behind” as soon as possible, without waiting out the 16- or 8-cycle limits: This economizes on the bus’ bandwidth.
The initiator must also not slow down the data flow. The permissible delay from the beginning of the FRAME# signal to the IRDY# signal (master data latency) and between data phases must not exceed 8 cycles. If a target device periodically rejects a memory write operation and requests a repeat (as may happen when writing to video memory, for example), then there is a time limit for the operation to be completed. The maximum complete time timer has a threshold of 10 jusec—334 cycles at 33 MHz or 668 cycles at 66 MHz—during which the initiator must have an opportunity to push through at least one data phase. The timer begins to count from the moment the memory write operation repeat is requested, and is reset when a subsequent memory write transaction other than the requested repeat is completed. Devices that are not capable of complying with the limits on the maximum memory write time must provide the driver with a means of determining at what states sufficiently fast memory write operations are not possible with them. The driver, naturally, must take such states into consideration and not strain the bus and the device with fruitless write attempts.
Each master device capable of forming a burst of more than two data phases long must have its own programmable Latency Timer, which regulates its operation when it loses bus control. This timer actually sets the limitation on the length of a burst transaction and, consequently, on the portion of the bus bandwidth allotted to this device. The timer is set going every time the device asserts the FRAME# signal, and counts off bus cycles to the value specified in the configuration register of the same name. The actions of a master device when the timer reaches the threshold depend on the command type and the states of the FRAME# and GNT# signals at the moment the timer is triggered:
-
If the master device deactivates the FRAME# signal before the timer is triggered, the transaction terminates normally.
-
If the GNT# signal is deactivated and the command currently being executed is not a Memory Write and Invalidate command, then the initiator must curtail the transaction by deactivating the FRAME# signal. It is allowed to complete the current data phase and execute one more.
-
If the GNT# signal is deasserted and a Memory Write and Invalidate command is being executed, then the initiator must complete the transaction in one of two ways. If the double word currently being transmitted is not the last in the cache line, the transaction is terminated at the end of the current cacheline. If the double word is the last in the current cacheline, the transaction is terminated at the end of the next cache line.
Arbitration latency is defined as the number of cycles that lapse from the time the initiator issues a bus-control request by the REQ# signal to the time it is granted this right by the GNT# signal. This latency depends on the activity of the other initiators, operating speeds of the other devices (the fewer wait cycles they introduce, the better), and how fast the arbiter itself is. Depending on the command being executed and the state of the signals, a master device must either curtail the transaction or continue it to the planned completion.
When master devices are configured, they declare their resource requirements, stating the maximum permissible bus access grant delay (Max_Lat) and the minimum time they must have control over the bus (Min_GNT). These requirements are functions of how fast a device is and how it has been designed. However, whether these requirements will actually be satisfied (the arbitrage strategy is supposed to be determined based on them) is not clear.
The arbitration latency is defined as the time that elapses from the moment the master’s REQ# is asserted to the moment it receives a GNT# signal and the bus goes into the Idle state (only from that moment can the specific device begin a transaction). The total latency depends on how many master devices there are on the bus, how active they are, and on the values (in bus clocks) of their latency timers. The greater these values are, the more time other devices have to wait to be given control over the bus when it is considerably loaded.
The bus allows lower power consumption by the device at the price of decreased productivity, by using address/data stepping for the AD[31:0] and PAR lines:
-
In continuous stepping, signals begin to be formed by low-current formers several cycles before asserting the valid-data acknowledgement signal (FRAME# in the address phase; IRDY# or TRDY# in the data phase). During these cycles, the signals will “crawl” to the required value using lower current.
-
In discrete stepping, signal formers use regular currents, but instead of switching all at the same time switch in groups (e.g., by bytes), only one group switches during one cycle. This reduces surges of the current, as fewer formers are switched at the same time.
A device does not necessarily use these capabilities (see the description of the command register’s bit 7 functions), but it must “understand” these cycles. If a device delays the FRAME# signal, it risks losing the right to access the bus if the arbiter receives a request from a device that has higher priority. For this reason, stepping was abolished in PCI 2.3 for all transactions except accesses to device configuration areas (type 0 configuration cycles). In these cycles, a device could have not enough time to recognize in the very first transaction cycle the IDSEL selection signal that arrives through a resistor from the corresponding ADx lines.
In PCI-X, the requirements on the number of cycles are more stringent:
-
The initiator has no right to generate wait cycles. In write transactions, the initiator places the initial data (DATA0) on the bus two clocks after the attribute phase; if the transaction is of the burst type, the next data (DATA1) are placed two clocks after the device answers with a DEVSEL# signal. If the target device does not indicate that it is ready (by the TRDY# signal), the initiator must alternate DATA0 and DATA1 data in each clock, until the target device gives a ready signal (it is allowed to generate only an even number of wait cycles).
-
The target device can introduce wait cycles only for the initial data phase of the transaction; no wait is allowed in the following data phases.
To take full advantage of the bus’ capabilities, devices must have buffers to accumulate data for burst transmissions. It is recommended that devices with transmission speeds of up to 5 MB have buffers to hold at least 4 double words. For devices with higher speeds, it is recommended to have buffers to hold 32 double words. For memory exchange operations, transactions that operate with a whole cacheline are the most effective, which is also taken into account when the buffer size is determined. However, increasing the buffer size may cause difficulties when processing errors, and may also increase delays in delivering data: Until a device fills up the buffer to the predetermined level, it will not begin sending these data and the devices, for which they are intended, will be kept waiting.
The specification gives an example of a Fast Ethernet card design (transmission speed 10 MB/sec) that has a 64-byte buffer divided into two parts for each transmission direction (a ping pong buffer). While the adapter is filling up one half of the buffer with an incoming frame, it is outputting the accumulated contents of the second half into the memory, after which the two halves of the buffer swap places. It takes 8 data phases (approximately 0.25 μsec at 33 MHz) to output each half into the memory, which corresponds to the MIN_GNT=1 setting. When the speed of incoming data is 10 MBps, then it takes 3.2 μsec to fill up each half, which corresponds to the MAX_LAT=12 setting (in the MIN_GNT and MAX_LAT registers, the time is set in 0.25 μsec intervals).
Data Transfer Integrity Control and Error Handling
Parity checking of addresses and data is used to control data transfer integrity on the PCI bus; in PCI-X, ECC with correction of one-bit errors is employed. ECC is mandatory for PCI-X Mode 2 operations; it can also be used when operating in Mode 1. The data transfer integrity control method is communicated by the bridge in the initialization pattern after a hardware reset of the bus. The bridge selects the control method supported by all its bus clients (including itself). Errors are reported by the PERR# signals (protocol signaling between the devices) and SERR# (a fatal error signal that generates, as a rule, an unmasked system interrupt).
The PAR and PAR64 signals are used in parity checking; these signals provide even parity on the AD[31:0], C/BE[3:0]#, PAR, and AD[63:32], C/BE[7:4]#, PAR64 sets of lines. The parity signals PAR and PAR64 are generated by the device that controls the AD bus at the given moment (places a command and its address, attributes, or data). Parity signals are generated with a delay of one 1 clock with respect to the lines they control: AD and C/BE#. The rules are somewhat different for read operations in PCI-X: Parity bits in the N clock pertain to the data bits of the N – 1 clock and the C/BE# signals of the N – 2 clock. The PERR# and SERR# signals are generated by the information receiver in the clock that follows the clock, in which the wrong parity appeared.
With ECC, a 7-bit ECC on the ECC[6:0] lines is to check the AD[31:0] and C/BE# [3:0] lines in the 32-bit mode; in the 64-bit mode, an 8-bit code is employed with the ECC [7:0] signals; in the 16-bit mode, a somewhat modified ECC7+1 system is used. In all of the operating modes, the ECC control allows only single errors to be corrected and most errors with a larger repetitive factor to be detected. Error correction can be disabled by software (via the ECC control register); in this case, all parity errors with the repetition factor of 1, 2, or 3 are detected. In all cases, the diagnostic information is saved in the ECC registers. The ECC bits are placed on the bus following the same rules and with the same latency as the parity bits. However, the PERR# and SERR# signals are generated by the information receiver 1 clock after the valid ECC bits: An extra clock is given to ECC syndrome decoding and an attempt to correct the error.
A detected parity error (the same as an ECC error in more than one bit) is unrecoverable. Information integrity in the address phase, and for PCI-X in the attribute phase, is checked by the target device. If an unrecoverable error is detected in these phases, the target device issues a SERR# signal (one clock long) and sets bit 14 in its status register: Signaled System Error. In the data phase, data integrity is checked by the data receiver; if it detects an unrecoverable error, it issues a PERR# signal and sets bit 15 in its status register: Detected Parity Error.
In the status register of devices, bit 8 (Master Data Parity Error) reflects a transaction (sequence) execution failure because of a detected error. In PCI and PCI-X, the rules for setting it are different:
-
In PCI, it is set only by the transaction initiator when it generates (when doing a read) or detects (when doing a write) a PERR# signal.
-
In PCI-X, it is set by the transaction requester or a bridge: a read transaction initiator detects an error in data; a write transaction initiator detects the PERR# signal; a bridge as a target device receives completion data with an error or a completion message with a write transaction error from one of the devices.
When a data error is detected, a PCI-X device and its driver have two alternatives:
-
Without attempting to undertake any actions to recover and continue working, issue a SERR# signal: This is an unrecoverable error signal that can be interpreted by the operating system as a reason to reboot. For PCI devices, this is the only option.
-
Not issue a SERR# signal and attempt to handle the error by itself. This can be only done by software and taking into account all the potential side effects of the extra operations (a simple repeated read operation may, for example, cause data loss).
The alternative selected is determined by bit 0 (Unrecoverable Data Recovery Enable) in the PCI-X Command register. By default (after a reset), this bit is zeroed out, causing a SERR# signal to be generated in case of a data error. The other option must be selected by a driver that is capable of handling errors on its own.
A detected error in the address or attribute phase is always unrecoverable.
The initiator (requester) of a transaction must have the possibility of notifying the driver of the transaction if it is rejected upon a Master Abort (no answer from the target device) or Target Abort (transaction aborted by the target device) condition; this can be done by using interrupts or other suitable means. If such notification is not possible, the device must issue a SERR# signal.
6.1.3 Bus Bandwidth
In modern computers, the PCI bus is the fastest I/O bus; however, even its actual bus bandwidth is not that high. Here, the most common version of the bus—32-bit wide, clocked at 33 MHz—will be considered. As previously mentioned, the peak data transfer rate within a burst cycle is 132 MB/sec, i.e., 4 bytes of data are sent over one bus clock (33 x 4 = 132). However, burst cycles are used far from always. To communicate with PCI devices, the processor uses memory or I/O access instructions. It sends these instructions via the host bridge, which translates them into PCI bus transactions. Because the main registers of x86 processors are 32 bits wide, no more than four bytes of data can be transmitted in one PCI transaction produced by the processor instruction, which equals a single transmission (DWORD transaction). But if the address of the transmitted double word is not aligned on the corresponding boundary, then either two single cycles or one with two data phases will be produced. In either case, this access will take longer to execute than if the address were aligned.
However, when writing a data array to a PCI device (a sequentially incremented address transmission), the bridge may try to organize burst cycles. Modern processors, starting with the Pentium, have a 64-bit data bus and use buffers for writing, so two back-to-back 32-bit write requests may be combined into one 64-bit request. If this request is addressed to a 32-bit device, the bridge will try to send it by a burst with two data phases. An “advanced” bridge may also attempt to assemble consecutive requests into a burst, which may produce a burst of a considerable length. Burst write cycles may be observed, for example, when using the MOVSD string instruction with the repeat prefix REP to send a data array from the main memory to a PCI device. The same effect also produces a sequence of LODSW, STOSW, and other memory access instructions.
Because the kernel of modern processors executes instructions much faster than the bus is capable of outputting their results, the processors can execute several other operations between the instructions that produce assembled writes. However, if the data transfer is organized by a high-level-language instruction, which for the sake of versatility is much more complex than the above-mentioned primitive assembler instructions, the transactions will most likely be executed one at a time for one of two reasons: The first is that the processor’s write buffers will not have enough “patience” to hold one 32-bit request until the next one appears; second, the processor’s or bridge’s write buffers will be forcedly cleared upon a write request (see Section 6.2.10).
Reading from a PCI device in the burst mode is more difficult. Naturally, processors do not buffer the data they read: A read operation may be considered completed only when actual data are received. Consequently, even string instructions will produce single cycles. However, modern processors can generate requests to read more than 4 bytes. For this purpose, instructions to load data into the MMX or XMM registers (8 and 16 bytes, respectively) may be used. From these registers, data then are unloaded to the RAM (which works much faster than any PCI device).
String I/O instructions (INSW, OUTSW with the REP repeat prefix) that are used for programmed input/output of data blocks (PIO) produce a series of single transactions because all the data in the block pertain to one PCI address.
It is easy to observe with an oscilloscope how a device is accessed: In single transactions, the FRAME# signal is asserted only for one clock; this is longer for burst transactions. The number of data phases in a burst is the same as the number of cycles during which both the IRDY# and TRDY# signals are asserted.
Trying to perform write transactions in the burst mode is advisable only when the PCI device supports burst transactions in the target mode. If it does not, then attempting to write data in the burst mode will even lead to a slight efficiency loss, because the transactions will be completed at the initiative of the slave device (by the STOP# signal) and not by the master device, thereby causing the loss of one bus cycle. Thus, for example, when writing an array into a PCI device memory using a high language instruction, a medium-speed device (one that introduces only 3 wait cycles) receives data every 7 cycles, which at 33 MHz gives speed of 33 x 4/7 = 18.8 MBps. Here, the active part of the transaction—from activation of the FRAME# signal to deactivation of the IRDY# signal—takes 4 cycles, and the pause takes 3 cycles. The same device using the MOVSD instruction receives data every eight bus cycles, giving a speed of 33 x 4/8 = 16.5 MBps.
These data were obtained by observing the operation of a PCI kernel implemented on the base of an Altera FPGA integrated circuit that does not support burst transactions in the slave mode. The same device works much slower when reading a PCI device memory: using the REP MOVSW instruction data could be obtained only once every 19-21 bus cycles, giving an average speed of 33 x 4/20 = 6.6 MBps. Here, the negative factors are the device’s high latency (it presents data only 8 cycles after the FRAME# signal is activated) and the fact that the processor begins its next transmission only after receiving data from the previous one. In this case, despite losing a cycle (used by the target device to terminate the transaction), the trick of using the XMM register produces a positive effect. This happens because each processor’s 64-bit request is executed by a consecutive pair of PCI transactions with only a two-cycle wait between them.
To determine the theoretical bus bandwidth limit, let’s return to Fig. 6.1 to determine the minimal time (number of cycles) for executing a read or write transaction. In the read transaction, the current master of the AD bus changes after the initiator has issued the command and address (cycle 1). This turnaround takes cycle 2 to execute, which is due to the TRDY# signal being delayed by the target device. Then, if the target device is smart enough, a data phase may follow (cycle 3). After the last data phase, one more cycle is needed for the reverse turnaround of the AD bus (in this case, it is cycle 4). Thus, it takes at least 4 cycles of 30 nsec each (at 33 MHz) to read one double word (4 bytes). If these transactions follow each other immediately (if the initiator is capable of operating this way and the bus control is not taken away from it) then, for single transactions, maximum read speeds of 33 MBps can be reached. In write transactions, the initiator always controls the AD bus, so no time is lost on the turnarounds. With a smart target device, which does not insert extra wait cycles, write speeds of 66 MBps may be achieved.
Speeds comparable to the peak speeds may be achieved only when using burst transmissions, when three extra read cycles and one write cycle are added not to each data phase but to a burst of them. Thus, to read a burst with 4 data phases, 7 cycles are needed, producing the speed of V=16/(7×30) bytes/nsec = 76 MBps. Five cycles are needed to write such a burst, giving a speed of V=16/(5×30) bytes/nsec = 106.6 MBps. When the number of data phases equals 16, the read speed may reach 112 MBps, and the write speed may reach 125 MBps.
These calculations do not take into account time losses caused by the changes of the initiator. The initiator can begin a transaction at receiving the GNT# signal only after it ascertains that the bus is idle (i.e., that the FRANE# and IRDY# signals are deasserted); recognizing idle state takes another cycle. As can be seen, a single initiator can grab most of the bus’ bandwidth by increasing the burst length. However, this will cause delays for other devices to obtain bus control, which is not always acceptable. It should also be noted that far from all devices can respond to transactions without inserting wait cycles, so the actual figures will be more modest.
Therefore, to achieve the maximum exchange efficiency, PCI devices themselves must be bus master devices and, moreover, be capable of operating in the burst mode. Far from all PCI devices support burst transmission mode, and those that do as a rule have substantial burst length limitations. It is possible to raise the bus’ bandwidth radically by going over to a 66 MHz bus clock and a 64-bit bus, but this is not an inexpensive solution. For devices critical to data delivery timing (such as network adapters, audio and video devices, etc.) to be able to work normally on the bus, the bus’ entire declared bandwidth should not be squeezed out of it, as overloading the bus may lead, for example, to losing packets because of data delivery timing errors. A Fast Ethernet adapter (100 MBps) operating in the half-duplex mode takes over about 13 MBps (10%) of a regular bus’ declared bandwidth. Operating in the full-duplex mode, it takes 26 MBps. A Gigabit Ethernet adapter, even in the half-duplex mode, barely fits into a regular bus’ declared bandwidth (it “survives” only because of its large internal buffers); a 64-bit bus operating at 66 MHz is more suitable for it. Switching to PCI-X, with its higher clock frequencies (PCI-X66, PCI-X100, PCI-X133) and fast memory write (PCI-X266 and PCI-X533), produces a substantial increase in peak speed and effective throughput capacity.
While on the subject of bus throughput and effective exchange rate with PCI devices, the overhead introduced by additional PCI to PCI bridges should be kept in mind. A device located on a distant bus receives less throughput than a device located immediately after the host bridge and to which the above discussion applies. This is due to the way in which the bridge operates: Transactions over the bridge are executed in several stages (see Section 6.1.6).
6.1.4 Interrupts: INTx#, PME#, MSI, and SERR#
PCI devices can signal asynchronous events using interrupts. There are four types of interrupts available on the PCI bus:
-
Traditional wire signaling over the INTx lines
-
Wire signaling for power management events over the PME# line
-
Signaling using messages: MSI
-
Signaling an unrecoverable error over the SERR# line
Hardware Interrupts in PC-Compatible Computers
Hardware interrupts provide a processor’s reaction to the events that occur asynchronously relative to the program code being executed. As a reminder, hardware interrupts are divided into masked and nonmasked.
The processor always reacts to nonmasked interrupts (if it has already handled the previous NMI). These interrupts have the fixed vector 2. Nonmasked interrupts are used in PC to signal fatal errors. A signal comes on the NMI line from the parity check or ECC control circuits, from the ISA bus control lines (IOCHK) and from PCI bus (SERR#). The NMI signal is blocked from entering the processor by setting bit 7 in port 070h to one; individual sources are enabled and identified by the bits of port 061h:
-
Bit 2—ERP (R/W): enables main memory control and the PCI bus SERR# signal
-
Bit 3—EIC (R/W): enables ISA bus control
-
Bit 6—IOCHK (R): ISA bus control error (IOCHK# signal)
-
Bit 7—PCK (R): memory parity error or a SERR# signal on the PCI bus
The processor’s reaction to masked interrupts can be delayed by resetting its internal IF flag (the CLI instruction enables interrupt; STI disables). When an event requiring the processor’s attention arises, the adapter (controller) of the device generates an interrupt request that arrives to the input of the interrupt controller. The interrupt controller generates a general masked interrupt request to the processor; when the processor confirms this request, the controller communicates to the processor the interrupt vector pointing to the software interrupt handler. The interrupt handler must service the given device, including resetting its request so that subsequent events can be reacted to and sending a completion command to the interrupt controller. On calling the interrupt handler, the processor automatically saves all the flags in the stack and clears the IF flag, disabling masked interrupts.
Having returned from the interrupt handler procedure (by the IRET instruction), the processor restores the flags it had saved, including IF (previously set to one), which enables interrupts again. The interrupt handler must have an STI instruction if other (higher priority) interrupts need to be reacted to during handling of the current interrupt. This is especially important for long handlers; here, the STI instruction must be inserted as early as possible, right after the crucial section (not allowing interruptions). The interrupt controller will handle the following interrupts of the same or lower priority only after it has received an interrupt completion command EOI (End of Interrupt).
Masked interrupts are used to signal about device events. Interrupt request signals are handled by interrupt controllers that are software-compatible with a chained pair or 8259A interrupt controllers. The general principle of generating interrupt requests is depicted in Fig. 6.6.
The 8259A controller allows individual request inputs to be masked and requests from different inputs organized into a priority system. The 8253A#1 master controller services requests 0, 1, 3-7; its output is connected to the processor’s interrupt request input. The 8259A#2 slave controller is connected, which services requests 8-15, to its input 2. Requests 8-15, with their descending priorities, are wedged between requests 1 and 3 of the master controller, whose request priorities also descend as the request numbers become larger.
Requests to the inputs of the interrupt controllers arrive from system devices (keyboard, system timer, CMOS timer, coprocessor), motherboard peripheral controllers, and expansion cards. Traditionally, all ISA/EISA bus slots have all request lines that are not taken by the above-listed devices. These lines are denoted as IRQx and have conventional functions (Table 6.4). Some of these lines are given over to the PCI bus. The table also shows interrupt priorities: Requests are located in order of decreasing priorities. The vector numbers corresponding to the controllers’ request lines, the priority systems, and certain other parameters are set by software when the controllers are initialized. These main settings remain unchanged for purposes of software compatibility.
Table 6.4: Hardware Interrupts (in order of decreasing priority)
Each device requiring interrupts for its work must be assigned an interrupt number. The process of assigning interrupt numbers has two parts: First, the adapter that needs interrupts must be configured to use a specific interrupt line on the bus (by jumpers or software); second, the software support for the given adapter must be informed of the interrupt vector used. The Plug-and-Play system for the ISA or PCI buses can be engaged in the process of assigning interrupt numbers; special parameters in CMOS setup are used to allocate interrupt request lines between buses. Modern operating systems can change the request number allocation performed by SMOS Setup.
Traditional PCI Interrupts: INTx#
Four physical interrupt request lines are allocated to PCI devices: IRQX, IRQY, IRQZ, and IRQW. They are connected with the INTA#, INTB#, INTC#, and INTD# out of all PCI slots with cyclic line offset (see Fig. 6.6). The correspondence of INTx# lines to IRQ inputs for devices on any PCI bus is shown in Table 6.5. PCI bridges simply electrically connect the same-name INTx lines of their primary and secondary buses.
Table 6.5: Interrupt Request Routing for PCI devices
A PCI device activates the interrupt signal by placing a low-level (open collector or sink output) signal on the selected interrupt line: INTx#. This signal must remain asserted until the software driver summoned by the interrupt resets the interrupt request by addressing the device that has issued it. If the interrupt controller again detects a low level on the interrupt request line after this, it means that an interrupt request on the same line has been placed by another device that shares the line with the first one, and that it is also requesting to be serviced.
Propagation of an interrupt request signal is not synchronized with the associated data transfers. A situation is possible in which an active device, having completed data transfer to the memory, issues an interrupt notifying the processor of this event. However, the data transferred by the device may have been delayed in the bridges (if the bus is overloaded) and the processor will begin servicing the interrupt without having received the data yet. In order to guarantee data integrity, the Interrupt Service Routine (ISR) program must read one of the registers of its device: Reading from behind the bridge will force all bridges to unload all their buffers of memory writes sent to them prior to the interrupt processing (see Section 6.1.6).
Interrupt request lines from the PCI slots and from the motherboard’s PCI devices are assigned to the inputs of the interrupt controllers in a relatively arbitrary manner. Configuration software can identify and indicate the taken interrupt request lines and the interrupt controller’s number by accessing the configuration space of the device (see Section 6.2.12). A software driver, having read the configuration registers, also can determine these parameters in order to provide the interrupt handler with the necessary vector and to reset the request line when servicing the interrupt.
Any PCI device function may enable its interrupt request line, but its interrupt request handler must be prepared to share it with other devices. If a device requires only one interrupt request line, then it should take the INTA# line; if it requires two interrupt request lines, then it should take the INTA# and INTB# lines; and so on. Taking into account the cyclic shift of the interrupt request lines, this rule makes it possible to install 4 simple devices into 4 adjacent slots, and each of them will take an individual interrupt request line. If a card requires two interrupt request lines, then the adjacent slot must be left unoccupied to preserve the monopolistic use of the interrupts. However, it must be remembered that the PCI devices built into the motherboard enable interrupts in the same manner (with the exception of the IDE controller, which, fortunately, is in class by itself). In terms of interrupts, the AGP port must be considered like any other PCI slot. Consequently, it may turn out that far from all slots have individual interrupt lines.
Modern motherboards are equipped with an advanced programmable interrupt controller (APIC), which can provide additional inputs (as a rule eight of them, numbered from 16 to 23). These inputs are used by integrated devices, and some of them can be allocated to the PCI slots, which somewhat alleviates the interrupt lines shortage.
Devices and/or functions are assigned interrupt lines by a POST procedure, and this process is only partially controlled by user. The interrupt request numbers available to the PCI bus are defined by the user in the CMOS Setup parameters (PCI/PNP Configuration). Depending on the BIOS version, this can be done in different ways: Either every INTA# … INTD# line is explicitly assigned its own number, or a range of numbers is given to PCI and ISA plug-and-play devices (although not to the legacy ISA devices). In the end, POST determines, which INTx# line corresponds to which controller request number, and programs the interrupt request multiplexer accordingly. Depending on the user’s actions, it may be that not every PCI bus interrupt request line gets its individual interrupt controller input assigned. If this is the case, then the multiplexer bundles up several PCI interrupt request lines to one controller input (i.e., even different PCI interrupt request lines will become shared). In the worst-case scenario, PCI devices do not receive any inputs on the interrupt controller. It is unlikely that BIOS will give interrupts 14 or 15 (which are given to the IDE controller) or interrupts 3 or 4 (which are given to the COM ports) to the PCI bus. New operating systems get involved into the hardware platform functions to such an extent that allow themselves (knowing the motherboard chipset or using PCI BIOS functions) to control the interrupt multiplexer. This capability can be enabled or disabled, for example, in Windows using the PCI Interrupt Steering flag in the PCI bus properties (Control Panel—System Devices—PCI Bus).
The driver (or other utility) that works with a PCI device determines the interrupt vector that the device (or, rather, the function) has been allocated by reading the Interrupt Line configuration register. This register shows the interrupt controller’s input number (255 means that the number is not assigned) and the vector is determined by this number. The input number for each device is assigned by POST; it does this by reading the Interrupt Pin register of every function it detects and by the device address (read geographical address) determines, which of the INTA# … INTD# lines (at the input of the interrupt multiplexer) is used. The rules, by which the correspondence between the Interrupt Pin and the interrupt request muliplexer input lines is defined on the motherboard, are not strictly set (dividing the device number by 4 is just a recommendation). However, the BIOS version of the motherboard in question knows them well. By this time, POST has already defined the lines-to-input-numbers correspondence table; using this table, it writes the required value to the Interrupt Line configuration register. To determine whether there are still any contenders for the same interrupt number is possible only by inspecting the configuration registers of the functions of all devices detected on the bus. This is not that difficult, and can be done using the PCI BIOS functions. The “delights” of shared interrupts are discussed in the following paragraph.
From version 2.1, PCI BIOS has functions for determining interrupt capabilities and their configuring. One function returns the data structure, in which information for each device (on each bus) is provided; this information concerns the interrupt controller inputs (IRQx), to which the device’s INTx lines can be connected and exactly with which inputs these lines are currently connected. The physical number of the slot, into which the device is installed, is also indicated. In addition, a bit map is returned, showing which IRQx inputs are allocated exclusively to the PCI bus (and are not used by other bus clients). For the given device, the setup function links the selected INTx signals with the selected IRQx input of the interrupt controller (i.e., programs the multiplexer). This function is intended to be used only by the configuration software (BIOS, operating system), but not by device drivers. The software that uses it is itself responsible for avoiding potential conflicts, for proper programming of the interrupt controller (the selected input must respond to a low level of the signal and not to a positive transition), and for correcting the information in the configuration space of all devices involved (whose interrupt request lines are connected with the selected INTx line).
The Shared Interrupt Problem
Interrupt request lines are the scarcest resource in a computer inundated with peripheral devices; therefore, they have to be shared (i.e., one interrupt line is used by several devices). From the hardware point of view, the interrupt-sharing problem has been solved for the PCI bus: Here, the interrupt request is triggered by a low level, and the interrupt controller is sensitive to a level but not to a transition. Interrupts cannot be shared for the ISA bus, with its positive transition interrupt requests. The exception is motherboards and devices that support ISA Plug-and-Play, which can be made to work using the low level.
With the hardware solution of interrupt request line sharing successfully found, what remains is the task of identifying each interrupt source in order to launch a corresponding interrupt handler. It is desirable that this task be solved by the operating system and take minimal time.
Up to and including PCI 2.2, no commonly accepted method for indicating and disabling interrupts by software existed: For this purpose, each device used its specific bits of the operation registers pertaining to the memory or I/O spaces. In such a case, only the given device’s interrupt handler, which is a part of its driver, can determine whether the given device is the interrupt source at the given moment. Therefore, the operating system has no other means of controlling shared interrupts other than to line up their interrupt handlers into a chain. The interrupt handler’s developer is responsible for its correct and efficient functioning. In PCI 2.3, fixed bits in the status register and configuration control register of a device (function) finally appeared, so the operating system can use them to determine the source of a shared interrupt and summon only its interrupt handler. However, descriptions of devices and operating system often do not mention that they support the PCI 2.3.
Device interrupt handlers must act properly, taking into account the possibility of a shared interrupt getting into the chain of the interrupt handlers. When processing an interrupt, by reading the register of its device, the next interrupt handler in the chain must determine whether the interrupt was triggered by its device. If so, the interrupt handler must execute the necessary routine and clear the interrupt request from its device, handing over the control to the next handler in the chain thereafter; otherwise, it simply hands over the control to the next interrupt handler in the chain. Sometimes, the following typical interrupt handler error happens: Having read the status register of its device and not detected a request indicator, the driver clears all the request sources (or even the entire device), just in case. This error is caused by a heedless driver developer who does not consider the possibility of interrupts being shared and does not trust hardware developers. Seeing this unexpected situation during the debugging process, he or she “fixes” it by inserting the harmful code fragment. The harm lies in that from the moment the device status register is read (without producing a request indicator) to the moment this unnecessary clearing is executed, an interrupt request may arise in the device, which will be blindly cleared and, consequently, lost.
However, even when interrupt handlers lined up in a queue are correctly written, on the whole, shared interrupts for different type devices cannot be considered functional: It is possible that interrupts from devices that require rapid reaction can be lost. This may happen if the interrupt handler for such a device happens to be the last in the queue and the handlers in front of it turn out to be not fast enough (to detect that the interrupt is not theirs). How the system will behave in this case may be different depending on the order in which the drivers are loaded. For several same-type devices (network adapters based on the same chip, for example) that use one driver, shared interrupts work out just fine.
Interrupt conflicts can manifest themselves in various ways. A network adapter may not be able to receive frames from the network (while being able to send them). It may take an extremely long time to access mass storage devices (sometimes, it may take minutes for the file or catalog information to appear), or they may be impossible to access altogether. Audio cards may be silent or stutter, images on video players may be jerky, and so on. The interrupt conflict may also cause unexpected rebooting, for example, upon arrival of a frame from the network or a signal from the modem. The solution to the interrupt sharing problem can be moving cards around into a suitable slot, in which no conflicts are observed (which may not necessarily mean that there are none). However, sometimes a “present” can be encountered from integrated motherboard designers in which only one of several PCI slots has a nonshared interrupt line (or none of them has any at all). As a rule, this kind of disorder cannot be treated without an X-ecto knife and a soldering iron. A more drastic method is to switch to signaling interrupts using messages: MSI.
Power Management Event Signaling: PME#
The PME# line, introduced in the PCI 2.0, is used for signaling in the power management system: device state change, system wake up upon an event, etc. All PCI devices can electrically access this line; just like the INTx# lines, the PME# line is not processed by the bridges in any way but is only conveyed to all the clients. The signaling logic is analogous to the INTx#: The device signals an event by shorting the PME# to the ground; therefore, event signals are logically assembled according to the OR function logic. The interrupt handler of this interrupt can detect the device that has generated the interrupt by software access to the configuration registers of all devices that are capable of generating this signal. Devices (functions) capable of power management have a structure with the Capability ID = 1 identifier and a set of registers in their configuration space. These registers and their functions are as follows:
-
Power Management Capabilities (PMC): specification version, what states are supported, whether states are capable of PME# generation, whether the CLK signal is needed to generate a PME#, what is power consumption on the 3.3 V Aux.
-
Power Management Control/Status Register (PMCSR): indicator of a PME# issuance, clearing, and enabling PM status; control of the data output via the Data register.
-
Data: an optional register that can be used, for example, to output the power consumption information.
-
Bridge Support Extension (PMSCR_BSE): indicator of bus support for secondary bridge control depending on the PM state; status of the secondary bus when switched into D3 (clock halt or depowering).
The details of PCI power management and the formats of the corresponding configuration registers can be found in the PCI PM 1.1 specification.
Message Signaled Interrupts: MSI
The PCI bus has a progressive asynchronous event-notification mechanism based on message signaling: Message Signaled Interrupts (MSI). Here, to signal an interrupt request, the device requests bus control and, having received it, sends a message. The message looks like a regular double word memory write; the address (32- or 64-bit) and the message template are written into the configuration registers of the device (or, to be more exact, of the function) during the configuration stage. The upper 16 bits of the message are always zeroes, while the lower 16 carry the information about the interrupt source. A device (function) may need to signal more than one type of request; according to the device’s needs and its resources, the system tells the function how many different types of request it can generate.
Whether a function can use MSI is described in the configuration space by the MSI Capability structure (CAP_ID=05h), which must be in the space of each function that supports MSI. There are three or four registers in the structure (Fig. 6.7):
-
Message Address: a 32-bit memory address to which the message is sent (bits [1:0]=00). If a 64-bit addressing is used (bit 7 in the Message Control register is set), then its upper part is located in the Message Upper Address register. The system software places values is the address registers during the configuration stage.
-
Message Data: a 16-bit template for data that are sent in the message over the AD[15:0] lines. The system software writes the template during the configuration stage. The message sent by a function can have only a few lower bits (whose number is defined by the contents of the Multiple Message Enable field) modified in order to indicate different interrupt conditions. The rest of the message bits must comply with the template; bits [31:16] are always zeroed out.
-
Message Control: 16 bit long. In bit 7, the function indicates its being able to generate a 64-bit address. In the Multiple Message Capable field of the function, its ability to generate differentiable interrupt conditions is set. In Mutiple Message Enable, the system tells the function the allowable number of conditions. Here, the values 000-101 are the binary coded number of the lower template bits that the device can modify in order to identify the interrupt source: 000—none (only one identifier is available to the device); 101—five bits; during the write, the AD[4:0] lines identify the specific interrupt condition (one out of 32 available to the given function). Values 110 and 111 are reserved. The MSI_Enable bit enables MSI. MSI is disabled after a hardware reset. It can be enabled by software by setting the MSI_Enable bit (after the message address and template have been programmed), after which interrupt generation over the INTx# is disabled.
The system can interpret two or more identical fast back-to-back messages as one interrupt (because of the slow reaction). If each of them needs to be serviced, the interrupt handler must confirm to the device receiving a message, while the device must not send another message until it receives a confirmation that the previous one has been received. Different messages present no problems in this respect.
This mechanism can be used on motherboards that have an Advanced Peripheral Interrupt Controller (APIC). A message is sent by writing the request number into the corresponding APIC register. For example, for motherboards based on a chipset with the ICH2 82801 hub, this register is located at the memory address FEC00020h, and the interrupt number can be in the 0-23h range.
Interrupts using MSI make it possible to avoid the interrupt sharing problem arising from the scarcity of interrupt request lines in PCs. Moreover, they solve the data integrity problem: All data that the device writes prior to issuing an MSI are guaranteed to arrive to the recipient before the start of MSI processing. Interrupts using MSI by some devices can be used alongside the INTx interrupts by other devices in the same system. However, devices (functions) that use MSI must not use INTx interrupts.
6.1.5 Direct Memory Access, ISA DMA Emulation (PC/PCI, DDMA)
As mentioned previously, unlike the ISA bus with its 8237A controller, a PCI bus does not provide direct memory access using a centralized controller. To relieve the CPU of routine data transfers, PCI buses offer direct bus control by devices called PCI Bus Masters. The intelligence of these bus masters varies. The simplest ones transfer blocks of data to and from the system memory (or memory of other devices) under the CPU’s control. By accessing the pertinent registers of the master device, the CPU sets the starting address, block length, and transfer direction, and then gives permission to begin the transmission. This done, the master device executes the transfer when it is ready and willing, without distracting the CPU. In this way, direct memory access is accomplished. A more sophisticated DMA controller can do coupled buffers reads, scattered writes, and other similar operations that are known from the advanced ISA/EISA DMA controllers. A more intelligent master device that, as a rule, has its own microcontroller does not limit itself to simply working under the CPU’s control, but performs exchanges under its controller’s program.
In order to keep PCI devices less complex and to make them compatible with the old PC-oriented software, Intel has developed the special PC/PCI DMA protocol. This protocol changes the functions of the REQi# and GNTi# signals of the bus agent specified in advance, and plays the role of a dedicated DMA conductor. The logic of this agent’s signal pair DRQx# and DACKx# (external, in terms of the PCI bus) is analogous to the same-name ISA signals; the REQi# and GNTi# lines also have special uses when involved in a request for control of the bus. When the agent receives a DRQx request (one or several), it sends serially coded numbers of the active lines of the DRQx requests over the REQi# line using the CLK line for synchronization. In the first CLK cycle, the start bit is sent by low level on the REQi# line. In the second cycle, the activity of the DRQ0 request is sent, then of the DRQi, and so on, up to the DRQ7, after which REQ# continues to be kept low. The arbiter will respond to this message by sending a message over the GNTi# line. This message also begins with a start bit, followed by the 3-bit code of the channel number that is getting the DACK# acknowledgement for data transmission in the transaction. The agent must inform the arbiter of all changes in the request lines, including request signal deactivations. The PC/PCI DMA mechanism can only be implemented in the motherboard chipset. An alternative solution—the Distributed DMA (DDMA) mechanism—makes it possible to disembody the standard controller and to emulate its individual channels by the resources of PCI cards. Both of these mechanisms can only be implemented as a part of the bridge between the primary PCI bus and the ISA bus; therefore, their support can (or cannot) be provided only on the motherboard and enabled in the CMOS Setup.
6.1.6 PCI and PCI-X Bridges
PCI Bridges are special hardware used to connect PCI buses to other buses and to each other. The Host Bridge is used to connect the main PCI bus (number 0) to the system bus (system RAM and the CPU). The honorable duty of the host bridge is to generate accesses to the configuration space under the control of the central processor; this allows the host (the central processor) to configure the entire subsystem of the PCI buses. There can be more than one host bridge in the system, which makes it possible to provide high-performance communications with the center for a large number of devices (there are limitations on how many devices can be placed on one bus). One of these buses is designated the main (bus 0).
Peer-to-peer PCI bridges are used to connect additional PCI buses. These bridges introduce an additional overhead to the data transfers, so the effective productivity of the exchange a device conducts with the center decreases with each bridge in the traffic’s way.
To connect PCMCIA, CardBus, MCA, ISA/EISA, X-Bus, and LPC buses to a PCI bus, special bridges are used. These bridges may be either integral parts of the motherboard chipsets, or implemented as discrete PCI devices (microchips). The bridges interconvert the interfaces of the buses that they connect, and also synchronize and buffer data.
Each bridge must be programmed, or shown the memory and I/O address ranges allocated to the devices of the bus it services. If in the current transaction, the address of a target device connected to a bus on one side of the bridge belongs to the bus on the opposite side, the bridge redirects the transaction to the corresponding side and takes care of coordinating the protocols of the buses. If there is more than one host bridge in the system, then end-to-end routing between devices on different buses may not be possible: The host bridges will be connected with each other only via the memory controller buses. Supporting relaying of all types of PCI transactions via host bridges is too complicated in this case, and, consequently, is not among mandatory PCI requirements. Consequently, all active devices on all PCI buses can access the system memory, but their peer-to-peer communications may depend on the PCI bus on which they are located.
Using PCI bridges presents the following capabilities:
-
Increasing the number of devices that can be connected to the bus by overcoming the electrical specifications of the bus.
-
Dividing PCI devices into segments—PCI buses—that have different widths (32/64 bits), clock frequencies (33/66/100/133 MHz), and protocols (PCI, PCI-X, PCI-X Mode 1, PCI Express). On each bus, its devices keep pace with the weakest member; placing devices on buses properly makes it possible to use the capabilities of the devices and of the motherboard with maximum efficiency.
-
Organizing segments with dynamic device connection/disconnection.
-
Organizing simultaneous parallel transaction execution from initiators located on different buses.
Every PCI bridge connects two buses: the primary bus, which is closer to the top of the hierarchy, and the secondary bus. The interfaces that connect the bridge to these buses are correspondingly called the primary and secondary. Only the tree-type bus connection configuration is allowed (i.e., two buses are interconnected by only one bridge and there are no bridge loops). Buses connected to the secondary interface of a bridge are called subordinate. PCI bridges form a PCI bus hierarchy at whose apex is the main bus (number zero), which is connected to the host bridge. If there is more than one host bridge, then the bus assigned number zero will be the main one.
A bridge must perform a set of mandatory functions:
-
To service the bus connected to its secondary interface. Namely:
-
To perform arbitration: receiving REQx# request signals from master devices on the bus and granting them, by GNTX# signals, the right to control the bus.
-
To park the bus: issuing a GNTx# signal to any device on the bus if none of the masters needs bus control.
-
To generate type 0 bus configuration cycles with issuing individual IDSEL signals to the addressed PCI device.
-
To pull the control signal to the high level.
-
To determine the capabilities of the connected devices and select the bus operation mode that suits them (clock frequency, bus width, protocol).
-
To generate a hardware reset (RST#) up a primary interface reset and at a command, informing of the selected mode by special signalization (see Section 6.1.8).
-
-
To support maps of the resources located on the opposite sides on the bridge.
-
To reply as a target device to transactions initiated by another master on one interface and directed to a resource located on another interface. To convey these transactions on the other interface playing role of a master device, and to convey the results of the transactions to the actual initiator.
Bridges that perform these functions are called transparent; no additional drivers are needed to work with devices that are located on such bridges. It is these bridges that are described in PCI Bridge 1.1; being PCI devices, they are assigned a special class (06). In the given case, the flat resource addressing model is assumed (resources being memory and I/O): Each device has its own addresses, unique within a given computer system (not overlapping with others).
There also are nontransparent bridges; these organize individual segments with their local address spaces. Nontransparent bridges convert addresses for transactions, in which the initiator and the target are located on the opposite sides of the bridge. Not all of resources of the opposite side (address ranges) may be reachable on such bridges. An example of nontransparent bridge use is when a computer has a separate intelligent I/O subsystem (120) with its own I/O processor and local address space.
Routing Functions of Transparent Bridges
The task of routing is to determine where in relation to the bridge the resources addressed in each transaction are located; this task is the first to perform when processing each transaction that a bridge sees on any of its interfaces. This task can be solved in two ways, because either a hierarchical PCI address (bus:device:function) or a flat memory or I/O address can be sent in the address phase.
Hierarchical Address Routing
Configuration write and read transactions, special cycle transactions, and (in PCI-X) also split transaction completions are addressed via bus and device numbers. The routing for these transactions is based on a bus-numbering system. When a system is configured, numbers are assigned to PCI buses strictly sequentially; bridge numbers correspond to the numbers of their secondary buses. Thus, the host bridge is assigned number zero. Numbers of the bridge’s subordinate buses start with the number that follows the number of its secondary bus. Therefore, the system bus topology information that each bridge needs is described by a bus number list—three numerical parameters in its configuration space:
-
Primary Bus Number
-
Secondary Bus Number (also, the bridge number)
-
Subordinate Bus Number—the maximal number of the subordinate bus
All buses with numbers in the range from the Secondary Bus Number to the Subordinate Bus Number inclusive lie on the secondary-interface side; all the other buses lie on the primary-interface side.
Knowing bus numbers allows bridges to relay accesses to the device configuration registers in the direction from the host to the subordinate buses, and to propagate special cycles in all directions. Bridges relay responses to split transactions (Split Complete) from one interface to another if they are addressed to a bus of the opposite bridge.
Bridges do not relay type 0 configuration cycles. Bridges process type 1 configuration transactions detected on the primary interface in the following way:
-
If the bus number (AD[23:16]) corresponds to the secondary bus, bridges convert type 1 configuration transactions into type 0 configuration cycles or special cycles. During conversion to a type 0 cycle, the device number from the primary bus is decoded into a positional code on the secondary bus (see Section 6.1.2.1), the function and register numbers are relayed unchanged; the AD[1:0] bits on the secondary bus are zeroed out. In PCI-X, besides the positioned code, the secondary bus also receives the device number. A conversion into a special cycle (changing the command code) is used if all bits in the device and function number fields have values of one, while all bits in the register number field have values of zero.
-
If the bus number corresponds to the number range of the subordinate buses, bridges pass the transaction from the primary interface to the secondary without changes.
-
If the bus number lies outside the bus number range of the secondary interface side, bridges ignore the transaction.
Bridges only relay type 1 configuration cycles pertaining to special cycles (all bits in the device and function number fields are ones, while in the register number field, all bits are zeroes) from the secondary interface to the primary. If the bus number corresponds to the primary bus number, the bridge converts the transaction into a special cycle.
If none of the devices reacts to a configuration cycle, bridges can process this situation in two ways: by recording the absences of a device (Master Abort will be triggered) or executing dummy operations. However, in any case, reading a configuration register of a nonexisting device (function) must return a value of FFFFFFFFh (this is safe information, because it is an illegal device number).
Flat Address Routing
In order to manipulate memory and I/O access transactions, ports need address maps, on which memory ranges belonging to the secondary and subordinate bus devices are indicated. This is enough for the flat unique addressing system. For the indicated memory ranges, the bridge must respond to transactions it sees on the primary interface as a target device and initiate these transactions as a master on the secondary interface; the bridge ignores all other primary interface transactions. For addresses outside of the indicated address ranges, the bridge must act in reverse: Respond as a target device to transactions it sees on the secondary interface and initiate these transactions on the primary interface; the bridge ignores all other secondary interface transactions. How the bridge relays transactions is described in Section 6.1.6.
All PCI-to-PCI bridges have one descriptor for each of the three resource types: I/O, I/O mapped onto the memory, and the real memory (prefetchable). The descriptor shows the base address and the range size. Resources of the same type for all devices that are located over the bridge (on the secondary and all subordinate buses) must be gathered into one—if possible, compact—range.
The I/O address area is set by the 8-bit I/O Base and I/O Limit registers with granularity of 4 KB. By their high bits, these registers define only four high bits of the 16-bit address of the beginning and the end of the relayed range. The lower 12 bits for I/O Base are assumed to be 000h; for I/O Limit, they are assumed to be FFFh. If there are no I/O ports on the secondary side of the bridge, then a number smaller than that contained in I/O Base is written into I/O Limit. If the bridge does not support I/O address mapping, then both registers always return zeroes when read; this type of bridge does not relay I/O transactions from the primary to the secondary side. If the bridge supports only 16-bit I/O addressing, then zeroes are returned in the lower four bits of both registers when a read is performed. It is assumed that the high address bits AD[31:16]=0, but they also must be decoded. If the bridge supports 32-bit I/O addressing, then 0001 is returned in the lower four bits of both registers when a read is performed. The I/O Base Upper 16 Bits and I/O Limit Upper 16 Bits registers contain the upper 16 bits of the lower and upper boundaries.
The bridge relays I/O transactions for the indicated range from the primary interface to the secondary only if the I/O Space Enable bit in the command register is set. I/O transactions from the secondary interface to the primary are relayed only when the Bus Master Enable bit is set.
Memory mapped I/O can use the first 4 GB of addresses (the limit of the 32-bit addressing) with granularity of 1 MB. The relayed range is set by the Memory Base (starting address) and Memory Limit (end address) registers. Only the upper 12 address bits AD[31:20] are used in these registers; the lower AD[19:0] bits are assumed to be 0 for Memory Base and FFFFFh for Memory Limit. In addition, the VGA memory range can be relayed (see Section 6.1.6).
Real PCI device memory, which allows prefetching, can lie within both the 32-bit (4 GB) and the 64-bit addressing ranges with granularity of 1 MB. The relayed range is set by the Prefetchable Memory Base (starting address) and Prefetchable Memory Limit (end address) registers. A value of 0001 (not 0000) returned by a read operation in the lower bits ([3:0]) of these registers is a 64-bit addressing indicator. In this case, the upper part of the addresses is located in the Prefetchable Base Upper 32 Bits and Prefetchable Limit Upper 32 Bits registers. The bridge may not necessarily support prefetchable memory; in this case, the above-described registers return zero during read operations.
Bridges relay memory transactions of the indicated ranges from the primary interface to the secondary only if the Memory Space Enable bit in the command register is set. Memory transactions from the secondary interface to the primary are relayed only if the Bus Master Enable bit is set.
Concepts of positive and subtractive address decoding are connected with bridges. Ordinary PCI agents (devices and bridges) respond only to calls addressed within the ranges described in their configuration space (via the base addresses and the memory and I/O ranges). This decoding method is called positive. A bridge employing positive decoding lets through only calls belonging to a predetermined address list specified in its configuration registers. A subtractive decoding bridge lets through only calls not pertaining to other devices. Its transparency ranges are formed as if by subtraction (hence the name) from the total space of the ranges described in the configuration ranges of the other devices. Physically, devices (bridges) implement subtractive decoding in an easier way: The device monitors all bus transactions of the type that interests it (usually I/O or memory accesses); if it sees no response (a DEVSEL# signal in clocks 1-3 after the FRAME#) from any of the regular devices, it considers this transaction to be addressed to it and issues a DEVSEL# signal itself.
Only certain types of bridges possess subtractive decoding capability, which supplements positive decoding. Subtractive decoding has to be used for old devices (ISA, EISA), whose addresses are so scattered over the range that they cannot be gathered into an acceptably sized positive decoding range. Subtractive decoding is used for bridges that connect old expansion buses (ISA, EISA). Positive and subtractive decoding pertains only to the memory and I/O range accesses. Configuration accesses are routed using the bus number, which is sent in type 1 cycles (see Section 6.2.11): Each bridge knows the numbers of all surrounding buses. Only the specific class code 060401h found in the header of the bridge’s configuration registers can indicate that the given bridge supports subtractive decoding.
ISA I/O Addressing Support
There are some peculiarities in I/O port addressing that are rooted in the ISA bus legacy. The 10-bit address decoding used in the ISA bus leads each of the addresses in the 0-3FFh range (the 10-bit addressing coverage limitation) to have 63 alias addresses, at which the same ISA device can be addressed when using 16-bit addressing. For example, addresses x778h, xB78h, and xF78h (where x is any hexadecimal digit) are aliases for the 0378h address. ISA address aliases are used for various purposes, in particular in ISA Plug and Play. The 0-FFh address range is reserved for system (not user) ISA devices, for which aliases are not used. Consequently, in each kilobyte of the I/O address space, the last 768 bytes (addresses 100-1FFh) can be aliases, but the first 256 bytes (addresses (0-0FFh) can not. There is an ISA Enable bit in the bus control register setting that will eliminate alias address ranges from the common address range described by the I/O Base and I/O Limit bridge registers.
This elimination is effective only for the first 64 KB of the address space (produced by 16-bit addressing). Bridges do not relay transactions that pertain to these eliminated ranges from the primary interface to the secondary. Conversely, transactions that do pertain to these ranges are relayed to the primary interface. This capability is needed to share the use of the small (64 KB) address range by the PCI and ISA devices, reconciling the cut-up ISA address map with the capability to set only one I/O address range for each bridge. It makes sense to set the ISA Enable bit for the bridges that do not have ISA devices on their downstream side. These bridges will relay downstream all I/O transactions addressed to the first 256 bytes of each 1 Kbyte of the address range described by the I/O Base and I/O Limit bridge registers. The configuration software can allocate these addresses to PCI devices that are below the given bridge (except the 0000h-00FFh addresses that belong to the motherboard devices).
Special VGA Support
Bridges may provide special support for a VGA graphics adapter that can be located on the secondary interface side. This support is initialized and enabled by the VGA Enable bit of the bridge configuration register. When the support is enabled, bridges relay VGA memory accesses in the 0A0000h-0BFFFFh address range; I/O register accesses are relayed in the 3B0h-3BBh and 3C0h-3DFh address ranges including all their aliases (the AD[15:10] address lines are not decoded). This special approach is explained by the need to provide compatibility with the most common graphics adapter and the impossibility to describe all the needed ranges in the positive decoding address range tables. Additionally, to support VGA, the palette registers, which are located at the 3C6h, 3C8h, and 3C9h addresses and their aliases (the AD[15:10] address lines are not decoded here either) require to be accessed in a special way.
Monitoring writing to the VGA palettte registers (VGA Palette Snooping) is an exception to the rule of the unique routing of memory and I/O accesses. In a computer with a PCI bus, the video card is usually installed into this bus slot or into the AGP slot, which is logically equivalent to installing it into the PCI bus slot. A VGA card has Palette Registers that are traditionally mapped onto the I/O space. Sometimes, the computer’s graphic system will have an additional graphics card to mix the graphic adapter’s signals with live-video signals by intercepting the binary information about the current pixel’s color on the VESA Feature Connector bus before it reaches the palette registers. In this case, the color palette will be determined by the palette register of this additional graphic card. A situation arises in which a write operation to the palette registers must be executed simultaneously to the video adapter (in the PCI bus or AGP slot) and to the additional video expansion card, which may even be located on another bus (including an ISA). The CMOS Setup may have the PCI VGA Palette Snoop parameter. With this parameter enabled, an I/O write to the palette register address will initiate a transaction not only on the bus where the video adapter is installed, but also on other buses. A read transaction at this address will be performed only with the video adapter. If the VGA Enable bit is set, read transactions will also be relayed over the bridge, because the palette register addresses lie within the VGA port common address range. The implementation of the monitoring may be delegated to the PCI video adapter. To do this, the card latches the data while writing to the palette register, but does not generate the DEVSEL# and TRDY# negotiation signals. As a result of this, the bridge passes this unclaimed signal on to the ISA bus.
Transaction Relaying and Buffering
Transaction relaying is quite a difficult task for a bridge, and overall system efficiency depends on how it is solved. Exactly which transaction needs to be relayed from one interface to another is decided by the above-described part of the bridge that handles the routing. When relaying a transaction, the bridge, as a target PCI device, immediately replies to its initiator, regardless of what is taking place on its other side. This allows the bridge, like any other PCI device, to observe the limitations on the response time and transaction execution time. Further, the bridge requests control of the opposite side bus and, having received this control, executes the transaction as if it were its initiator. If a read transaction is being relayed, the bridge must receive its results in order to forward them to the real transaction initiator. This general scenario is implemented differently for different commands; however, with all the abundance of choices, there are only two ways, in which a PCI bridge can reply to the initiator:
-
A delayed transaction: The bridge delays the transaction by answering with a Retry condition. This makes the initiator repeat the transaction some time later. During this time, the bridge must perform the ordered transaction on the other side of the interface.
-
A posted write: The bridge pretends that the transaction has been successfully completed. This option is only possible for memory write operations. The real write is executed later, when the bridge obtains bus control on the other side of the interface.
Instead of posting transactions that are relayed from a bus operating in the PCI-X mode, PCI-X bridges must split them.
In order to speed up transaction execution arriving from the primary bus, it is practical for the bridge to park the secondary bus for itself; in this way, if the secondary bus is free, the bridge will not waste time for obtaining bus control when relaying transactions.
Delayed Transactions
PCI bridges execute delayed transactions for all accesses to the I/O and configuration registers, as well as for all types of memory reads. Delayed transactions are executed in three stages:
-
Initiator requests a transaction (data exchange with the target device has not started yet).
-
Transaction completed by the target device.
-
Transaction completed by the initiator.
In order to execute a delayed transaction, the bridge must place a Delayed Request into the queue and issue a Retry conditions by a STOP# signal. This completes the first phase of the transaction. The request contains latched values of the address, command, enabled bytes, and parity lines (and the REQ64# line for 64-bit buses); for delayed write transactions, data also need to be saved. This information is sufficient for the bridge to initiate the transaction on the opposite interface: the second phase of the delayed transaction. Its result is the queued delayed request converted into a delayed completion: the delayed request information together with the completion status (and the requested read data).
Having received the Retry condition, the initial transaction initiator has to reissue the request some time later; moreover, the reissued request must be identical to the original, otherwise the bus will consider it a new request. If by this time the bridge has completed processing the given transaction, this reissued request will be completed in the regular way (or aborted if that is what the target device did). If the transaction has not been completed yet, the bridge will issue a Retry again; the initiator will have to keep reissuing its request until it is normally completed by the bridge. This is the third, final phase of the delayed transaction.
An initiator that receives a Retry condition must reissue exactly the same transaction request, otherwise the bridge will accumulate unclaimed answers. Of course, the bridge also must track the unclaimed transactions and some time later (210 or 215 bus clocks, depending on the value in the Bridge Control register) remove them from its queue, so as not to overfill it because of the initiator’s forgetfulness.
Delaying transactions by bridges significantly increases the execution time of each of them (from the initiator’s point of view); however, it allows multiple transactions queued by bridges to be processed. The result is an increase of the overall volume of the executed transactions on all PCI bridges over a unit of time (i.e., the bus throughput increases as a whole). In principle, bridges, being the custodians of the transaction queue, can execute two transactions simultaneously, each on its own interface. If transactions were not delayed but executed directly, the initiator would have to hold its bus until the destination bus became available (as well as all the intermediate buses, if the transaction transits more than one bridge). The resulting number of the useless wait cycles on all buses would be unacceptably large.
When relaying memory-read transactions (using delayed requests), in some cases, bridges can employ prefetching in order to speed-up memory operations. In doing prefetching, the bridge runs a risk of reading more data from the source than the initiator will take from it in the given transaction. The extra data in the buffer are best cancelled in the transaction completion phase, because their real source in the memory may well have been changed by the time they are requested again. More sophisticated bridges can track these changes and cancel only those data in the buffer that have been modified in the source. Regular Memory Read commands allow bridges to read only the exact amount of the requested data. In this case there are fewer opportunities to speed up transfers but neither are there side effects from reading extra data. Reading extra data is absolutely prohibited for memory mapped I/O registers. For example, reading control registers may change their state; an extra read (with unused results) of data registers may cause data loss. Bridges can do prefetching without any concerns when processing requests with Memory Read Line or Multiple Memory Read commands relayed in any direction. Masters that use these commands are responsible for ensuring that prefetching is allowed for the addressed ranges. If the bridge has registers describing prefetchable memory, then during the relaying, simple read command transactions from the primary interface addressed to the prefetchable memory on the secondary interface can be converted into Memory Read Line or Multiple Memory Read commands. The bridge may also assume that all memory transactions from the secondary interface have the main memory as the target device and, therefore, allow prefetching. However, the bridge must have a special bit that disables command conversion and prefetching that are based upon this assumption (blind prefetching can cause problems of software and hardware interaction).
Posted Writes
For memory-write transactions initiated on one side of the bridge and directed to the memory on the other side of the bridge, the bridge must perform posted writes. Here, the data are received in the bridge’s buffers and the transaction will be terminated for the initiator before the data reach their actual destination. The bridge will deliver them when it is convenient for the recipient; moreover, this delivery can take more than one transaction, which this time is initiated by the bridge. Of course, if the bridge has no room in its posted write buffers (their size is limited), it will have to answer some of these memory-write transactions with a Retry condition. However, this is not a delayed transaction: Bridges do not queue memory write requests. Bridges have separate buffers for posted writes. In general, posted writes are used only with memory operations. Only the host bridge has the right to send writes to the I/O ports, and even then only for processor-initiated transactions. Posted writes cannot be used with the configuration memory space.
In order to optimize bus bandwidth and the efficiency of the entire system, bridges may convert memory-write transactions that they relay. For example, one long regular memory-write burst transaction (MW, Memory Write) of a block that is not aligned on the cacheline boundaries can be broken down into three transactions: MW from the start of the block to the nearest line boundary, MWI (Memory Write Invalidate) with one or more full cachelines, and another MW for the last cacheline boundary to the end of the block. Moreover, several consecutive write transactions can be combined into one burst transaction, in which extra writes can be blocked using the byte-enable signals. For example, a sequence of single double-word writes to the addresses Oh, 4h, and Ch can be write-combined into one burst with the starting address 0. During the third data phase, when the non-needed address 8h is accessed, all the C/BE[3:0]# signals are inactive. In some write transactions, individual bytes can be merged into one transaction; this is allowed for prefetchable memory.
For example, a sequence of byte writes to addresses 3, 1, 0, and 2 can be merged into one double-word write, as these bytes belong to the same addressed double word. Combining and merging can work independently (merged transactions can be combined), but these conversions do not change the order of the physical writes to the devices. These capabilities are not mandatory: Whether or not a bridge has them depends on how “skillful” it is. The purpose of these conversions is to reduce the number of individual transactions (each of which has at least one “extra” address phase) and, as far as possible, the number of data phases. However, the bridge has no right to collapse writes: If it receives two or more posted writes with the same starting address, it must process them all.
PCI devices must perform combined writes without any problems: If a device cannot do this, it has been improperly designed. If a device does not allow byte-merging, it must have the Prefetchable bit in its memory descriptor zeroed out.
PCI-X Bridge Distinctions
The PCI-X bus protocol enables more efficient bus operation. Knowing exactly the transaction length allows the bridge to plan its transfer more efficiently. Special requirements are applied to the bridge buffers: Buffers for each queue type must hold no fewer than two cache lines. Compared with PCI bridges, PCI-X bridges have the following distinctions:
-
PIC-X bridge interfaces can work in the PCI mode as well as PCI-X Mode 1 or Mode 2. Bridges must determine the capabilities of the weakest device on their secondary interfaces and switch this bus (all the devices on it) into the appropriate mode (in terms of the protocol and clock frequency).
-
When PCI and PCI-X buses are interconnected, the bridge has to convert some commands, as well as the protocol. When relaying transactions from a PCI bus to a PCI-X bus, the bridge has to generate transaction attributes. For these, the bus number is obtained from the bridge registers; the device and function number are set to zero. The bridge can “make up” the value of the memory access command byte counter based on the particular command (for cache line reads and writes, the length can be figured out from the line length) or address (to determine the prefetching availability).
-
All single (DWORD) transactions, as well as all burst reads from a PCI-X bus addressed to the other side of the bridge, are completed by the bridge as split transactions (and not as delayed, as is done in the PCI). This makes for more efficient bus use, as the transaction initiator (requester) does not have to periodically reissue the request: The answer will come to the requester as it becomes available. All burst memory writes are processed as posted writes. Of course, if the bridge’s request buffers are filled up, it will have to delay the transaction (by a Retry condition).
Transaction Execution Order and Synchronization
The posted write and delayed transaction mechanisms are aimed at providing, as far as possible, simultaneous execution of multiple exchange transactions in the PCI bus system. Each bridge has posted write and delayed transaction buffers and queues for commands that are relayed in both directions. The bridge can perform simultaneous data exchange on both of its interfaces, functioning both as the initiator and the target device. A question concerning the transaction order execution arises; moreover, it involves exactly the order of completions (phases in which the end target device is interacted with). Bridges follow the following main rules:
-
Posted writes transiting the bridge in one direction are completed in the target device in the same order as on the initiator’s bus.
-
Write transactions transiting the bridge in opposite directions are not coordinated with each other in terms of order.
-
A read transaction pushes out of the bridge all writes sent from the same side prior to its arrival. Before this transaction completes on its initiator’s side (before the third delayed transaction phase), it also pushes out of the bridge all writes sent from the other side prior to completion of the given read by the end target device. In this way, the order or write and read transactions is preserved.
-
As a target device, the bridge can never accept a memory-write transaction for posting until it has completed a non-locked transaction as a master device on the samebus.
Bridges by themselves do not undertake any measures to synchronize transactions or interrupt requests. While transactions are buffered (e.g., they may get stuck in the bridge queues), interrupt request signals (INTx) are transferred by the bridge absolutely transparently (the bridge simply connects electrically these lines on the primary and secondary interface). For software to work correctly, all the data sent prior to issuance of the interrupt signal must reach their destinations. For this, buffers of all the bridges located between the device that issued the interrupt request and its end partners in the transaction must be unloaded. This can be done easily by software by reading any register of the device: Reading over the bridge unloads the buffers. Another method also is possible: Prior to issuing the interrupt signal, the device reads the last data written by it. These matters are simpler with the MSI interrupts: An MSI message cannot overtake the data issued earlier by this device.
One of the specific applications for which the PCI bus and its bridge connections can be used is the true simultaneous multiple data-exchange transactions over non-intersecting pathways; this is known as Concurrent PCI Transferring or PCI Concurrency. For example, while the processor conducts data exchange with the memory, a master PCI device can exchange data with another PCI device. However, this example of simultaneity is more from the realm of theory, as a PCI master device exchanges data with the system memory as a rule. A more interesting example presents the exchange conducted between a video adapter connected to the AGP (a PCI relative; see Section 6.3) and the memory simultaneously with an exchange conducted between the processor and a PCI device. Another example would be a processor loading data into the graphic adapter at the same time as a PCI master device exchanges data with the system memory. Simultaneity requires quite complex logic to arbitrate requests from all system agents, as well as to resort to various tricks to buffer data. Not all chipsets are capable of simultaneity (this feature is always emphasized in the product description) and it can be disabled in the CMOS Setup settings.
6.1.7 Device Configuration
The automatic system resource configuration capability (memory and I/O spaces and interrupt request lines) is built into the PCI bus from the beginning. Automatic device configuration (selection of addresses and interrupt request lines) is supported by BIOS and the operating system and is oriented toward the plug-and-play technology. For every function, the PCI standard defines a configuration space of up to 256 8-bit registers that pertain neither to the memory nor the I/O addressing space. Access to these registers is implemented via special bus commands, Configuration Read and Configuration Write, which are generated by one of the previously described mechanisms. This addressing space has areas mandatory for all devices, as well as special-purpose areas. A specific device may not necessarily have registers at all addresses, but it has to support the normal termination of operations addressed to them. Reading nonexisting registers must return zeros while writing is done as a dummy operation.
A device configuration space begins with a standard header containing the vendor ID, the device ID, the device class ID, and a description of the required and allocated system resources. The header structure is standardized for regular devices (type 0), PCI/PCI bridges (type 1), and PCI/CardBus bridges (type 2). The header type determines the location of the common registers and the functions of their bits. Device-specific registers may follow the header. For standard device capabilities (such as power management), there are predefined-purpose register blocks. These blocks are linked in chains, with the first of these blocks referred to by a pointer in the standard header (CAP_PTR). The block’s first register contains a pointer to the next block (or 0, if the given block is the last one). In this way, having examined the chain, the configuration software obtains a list of all available device capabilities and their locations in the function’s configuration space. In PCI 2.3, the following identifiers CAP_ID are defined (partly considered in this book):
-
01—power management
-
02—AGP port
-
03—VPD (Vital Product Data), the data that give a comprehensive description of the hardware (possibly, the software as well) properties of the devices
-
04—numbering of slots and chassis
-
05—MSI interrupts
-
06—Hot Swap, the connection for Compact PCI
-
07—PCI-X protocol extensions
-
08—reserved for AMD
-
09—at the manufacturer’s discretion (Vendor Specific)
-
0Ah—Debug Port
-
0Bh—PCI Hot Plug, a standard support for “hot plugging”
The configuration space has been expanded to 1,204 bytes for PCI-X Mode 2 devices; this expanded space may contain expanded capabilities descriptions.
After a hardware reset or a power up, PCI devices do not respond to memory and I/O accesses, and are accessible only for configuration read and write operations. In these operations, devices are selected by the individual IDSEL signals, and provide information about their resource requirements and possible configuration options. After resources have been allocated by a configuration program (during POST or operating system booting), the configuration parameters (base addresses) are written into the configuration registers of a device. Only after this are the bits set in the devices (or, more precisely, in their functions) that enable them to respond to memory and I/O access commands, and also to control the bus themselves. In order to be able always to find a viable configuration, all the resources occupied by cards must be relocatable within their spaces. In the case of multifunctional cards, each function must have its own configuration space. A device can have the same registers mapped to either the memory or I/O space. In this case, its configuration register must contain both descriptors, but the driver must use only one access method (preferably, via the memory).
The configuration space header describes the device’s needs of three types of addresses:
-
I/O Space registers.
-
Memory Mapped I/O registers. This is a memory area that must be accessed in strict compliance with the exchange-initiator requests. Accessing these registers can change the internal state of peripheral devices.
-
Prefetchable Memory. Reading extra data out of this memory area does not cause side effects; all bytes are read regardless of the BE [3:0]# signals, and bridges can merge individual byte writes (i.e., this is pure memory).
Address requirements are indicated in the Base Address Registers (BAR). The configuration software can determine the sizes of the necessary areas, which it does as follows: after a hardware reset, it reads and saves the values of the base addresses (these will be the default addresses); then, it writes FFFFFFFFh into each register and reads their values again; in the obtained words, the type-decoding bits (bits [3:0] for memory spaces and bits [1:0] for I/O spaces) are zeroed out, and the resulting 32-bit word is inverted and incremented. (Bits [31:16] are ignored for ports.) These operations will produce the length of the address range. It is assumed that the range size is expressed by a 2″ number, and that the range is normally aligned. Up to six base address registers fit into a standard header; however, the number of described blocks decreases when the 64-bit addressing is used. Non-used registers must always return zeroes when read.
The PCI supports legacy devices (VGA, IDE); they declare themselves such by the class code in the header. Their traditional (fixed) port addresses are not declared in the configuration space, but as soon as the port access-enable bit is set, devices can answer to accesses to these addresses.
Configuration Space of Regular Devices (Type 0)
The header format is shown in Fig. 6.8. Fields mandatory for all devices are indicated in gray. Device specific registers can occupy configuration space addresses within the limits of 40-FFh.
The identifier fields listed below only can be read:
-
Device ID—device identifier assigned by the vendor.
-
Vendor ID—identifier of the PCI microchip manufacturer assigned by PCI SIG. Identifier FFFFh cannot be used; this value must be returned when an attempt to read the configuration space of a nonexisting device is made.
-
Revision ID—product revision assigned by the vendor. Used as an expansion of the Device ID field.
-
Header Type—bits [6:0] define the cell layout in the 10h-3Fh range. Bit 7 indicates a multifunctional device if set to 1. Fig. 6.8 shows a type 0 header format, which applies specifically to PCI devices. Type 01 applies to PCI-PCI bridges. Type 02 format applies to CardBus bridges.
-
Class Code—defines the device’s main function and sometimes its programming interface (see Section 6.2.13). The upper byte (address OBh) defines the base class, the middle byte defines the subclass, and the lower byte defines the programming interface (if it is standardized).
The rest of the header fields are device registers that allow both read and write operations:
-
Command (RW)—controls a PCI device’s behavior. This register allows reading and writing. After a hardware reset, all the register bits, except specially stipulated exemptions, are zeroed out. The register bits’ functions are as follows:
-
Bit 0—IO Space. Enables a device’s response to I/O space accesses.
-
Bit 1—Memory Space. Enables a device’s response to memory space accesses.
-
Bit 2—Bus Master. Enables a device to work as an initiator (bus master); ignored in PCI-X in completing split transaction.
-
Bit 3—Special Cycles. Enables a device’s responses to Special Cycle operations.
-
Bit 4—Memory Write and Invalidate enable. Enables a device to use Memory Write and Invalidate commands when working as an initiator. If this bit is zeroed out, the device must use regular memory writes instead of write and invalidate. Ignored in PCI-X.
-
Bit 5—VGA palette snoop. Enables tracking writes to the palette registers.
-
Bit 6—Parity Error Response. When set, this bit enables normal reaction to a parity or ECC error, generating the PERR# signal. If the bit is zeroed out, then the device only has to record the error in the status register and can continue normal operation. When ECC is used, error information is written into the ECC registers.
-
Bit 7—Stepping Control. This bit controls the device’s address/data stepping. If the device never does it, the bit must be hardwired to 0. If it always does it, then the bit must be hardwired to 1. Devices that have this capability set this bit to 1 after reset. In Version 2.3 and PCI-X, the bit is deallocated due to the abolition of stepping.
-
Bit 8—SERR# Enable. This bit enables generation of the error signal SERR#. An address parity error is reported if this bit and bit 6 equal 1.
-
Bit 9—Fast Back-to-Back Enable (optional, ignored in PCI-X). When set, it permits the master device to perform fast back-to-back transactions to different devices. When it is zeroed out, these transactions are allowed to only one device.
-
Bit 10—Interrupt Disable. Disables interrupt signal generation on the INTX lines (the bit is zeroed out and interrupts are enabled after a hardware reset and power up). Bit is defined starting with PCI 2.3; it was reserved before.
-
Bits [11:15]—reserved.
-
-
The Status register can be read from and written to. However, the writes can only zero out bits, not set them. Bits marked RO are read only. To zero out a register bit, its corresponding bit in the write data must be set to 1. The functions of the status register bits are as follows:
-
Bits [0:2]—reserved.
-
Bit 3—Interrupt Status. Set to one prior to issuing a signal over an INTx line, regardless of the value of the Interrupt Disable bit. Not associated with MSI interrupts. The bit is defined starting from PCI 2.3; it was reserved before. It is mandatory in PCI-X 2.0.
-
Bit 4—Capability List (RO, optional). Shows whether the capabilities indicator is present (offset 34h in the header).
-
Bit 5—66 MHz Capable (RO, optional). Indication of device’s 66 MHz operation capabilities.
-
Bit 6—reserved.
-
Bit 7—Fast Back-to-Back Capable (RO, optional). Indicates whether the device is capable of supporting fast back-to-back transactions to different devices.
-
Bit 8—Master Data Parity Error (bus masters only). Indicates that the transaction initiator (requester) has detected a non-recoverable error.
-
Bits [10:9]—DEVSEL Timing. Sets the selection time: 00—fast, 01—medium, 10—low. It defines the slowest reaction of the DEVSEL# signal to all commands except the Configuration Read and Configuration Write commands.
-
Bit 11—Signaled Target Abort. This bit is set by a target device when it terminates a transaction with Target-Abort command.
-
Bit 12—Received Target Abort. This bit is set by an initiator when it detects a rejected transaction.
-
Bit 13—Received Master Abort. This bit is set by the master device when it rejects a transaction (except for a Special Cycle transaction).
-
Bit 14—Signaled System Error. Set by the device that activates the SERR# signal.
-
Bit 15—Detected Parity Error. Set by the device that detects a parity error.
-
-
Cache Line Size (RW)—cacheline length. Sets cacheline size of between 0 and 128 bytes. Allowable values are 2n; others are treated as 0. The initiator uses this parameter to determine which read command to use: regular, line, or multiple. The target device uses this parameter to cross the line boundaries in burst memory accesses. After reset, this register is zeroed out.
-
Latency Timer (RW). Indicates the value of the latency timer (see Section 6.2.4) in terms of the bus clocks. Some of the bits may not allow modifications. (The lower three bits do not usually change, so the timer is programmed in 8-clock increments.)
-
BIST (RW)—built-in self-test register. The functions of its bits are as follows:
-
Bit 7—indicates whether the device is BIST-capable. 1 if yes, 0 if no.
-
Bit 6—test start. Writing a logical one into this bit initiates BIST. After the test is completed, the device sets the bit to 0. The test must take no longer than 2 sec to conclude.
-
Bits [5:4]—reserved. Set to zero.
-
Bits [3:0]—test results code. If set to zero, means that the test has been successful.
-
-
CardBus CIS Pointer (optional). This register points to the CardBus descriptor structure for combination PCI+CardBus devices.
-
Interrupt Line (RW). Holds the input number of the interrupt request controller for the utilized request line. Values in the 0-15 range indicate IRQ0-IRQ15 (in system with APIC, these values may be greater). Value 255 means the input number is not known or is not used.
-
Interrupt Pin (RO). This register indicates the interrupt pin used by the device or device function. 0 is not used; a value of 1 means INTA#; a value of 2 means INTB#; a value of 3 means INTC#; a value of 4 means INTD#. Values 5—FFh are reserved.
-
Min_GNT (RO). This register indicates the minimal time that a master device must be given for bus control in 0.25-μsec intervals at a bus clock frequency of 33 MHz.
-
Max_LAT (RO). Indicates the maximum latency in providing the master device access to the bus in 0.25-μsec increments. A value of 0 means that the device has no special requirements.
-
Subsystem ID (assigned by the vendor) and Subsystem Vendor ID (assigned to the vendor by PCI SIG). These registers make it possible to identify cards and devices precisely among several cards with matching Device ID and Vendor ID that may be installed in one system. The PCI card Vendor ID goes into the 2Ch field. It may match values with the 0 field if the company produces both microchips and cards.
-
Capability Pointer (CAP_PTR). A pointer to the chain of the function’s capabilities that are described in the configuration registers. Each capability has a set of registers that starts at a double word boundary (in the pointer bits [1:0] = 0). Each list item starts with a capability type configuration byte (CAP_ID, defined by the PCI SIG), followed by the pointer to the next list item (a zero pointer indicates end of list), followed by the capability descriptor bytes proper. Using CAP_PTR, for example, the power management register (if it exists), AGP and other registers are located.
-
Base Address Registers (BAR) of the memory and I/O ports. For memory spaces, bit 0 is set to logical zero. Bits [2:1] define the memory type. If they equal 00, the memory is 32-bit; if they equal 10, the memory is 64-bit (in this case, the register is expanded by the following 4-byte word; 64-bit addressing is mandatory for PCI-X). Values 01 and 11 are reserved. (In previous versions of the standard, 01 was used to indicate that the base register must be mapped onto the memory below 1 MB.) Bit 3 (prefetchable) is set for the real memory allowing prefetching. Bits [3:4] are the base memory address; block size cannot exceed 2 GB. For I/O space, bit 0=1; bit 1=0 (reserved); bits [31:2] are the port block base address; the size of one range cannot exceed 256 bytes.
-
Expansion ROM Base Address. For card software support. Bit 0 is used to enable accesses to the card’s ROM. Bits [1:10] are reserved. Bits [11:31] hold the base address. The size of the ROM range is determined the same way as in the BAR. (See above.) ROM can be accessed only when memory use is enabled (i.e., bit 1 in the command register is set).
PCI-X Device Special Registers
PCI-X devices have additional register (Fig. 6.9) whose location is defined using the capabilities list (Capability ID = 07). ECC registers appeared only in PCI-X 2.0.
The PCI-X Command register controls the new capabilities of the PCI-X protocol:
-
Bit 0 (RW)—Uncorrectable Data Error Recovery Enable. If the bit is not set, a SERR# signal is formed when a parity error is detected.
-
Bit 1 (RW)—Enable Relaxed Ordering in transaction attributes.
-
Bits [3:2] (RW)—Maximum Memory Read Byte Count: 0—512 bytes, 1—1,024 bytes, 2—2,048 bytes, 3—4,096 bytes.
-
Bits [6:4] (RW)—Maximum Outstanding Split Transactions: 0 through 7—1, 2, 3, 4, 8, 12, 16, 32 transactions, respectively.
-
Bits [11:7]—reserved.
-
Bits [13:12] (RO): PCI-X capabilities version (ECC support): 00—ECC not supported; 01—ECC only in Mode 2; 10—ECC in Mode 1 and 2.
-
Bits [15:14]—reserved.
The PCI-X Status register holds the function identifier (its address in the hierarchy of the configuration space); the device constantly monitors this address on the bus when executing configuration write operations. The device needs this identifier for presenting in the attribute phase. In addition, the register has device capability indicators and also split transaction error indicators. The functions of the PCI-X Status register’s bits are as follows:
-
Bits [2:0] (RO)—Function Number.
-
Bits [7:3] (RO)—Device Number. The device learns this number from the value in AD [15:11] during the address phase of the configuration write directed to the given device; the device is selected by the IDSEL line. Set to 1Fh after a reset.
-
Bits [15:8] (RO)—Bus Number. The device learns this number by the value in AD [7:0] during the attribute phase of the configuration write directed to the given device. Set to FFh after a reset.
-
Bit 16 (RO)—64-bit Device.
-
Bit 17 (RO)—133 MHz Capable (66 MHz otherwise).
-
Bit 19 (RWC)—Unexpected Split Completion.
-
Bit 20 (RO)—Device Complexity (of a bridge).
-
Bits [22:21] (RO)—Designed Maximum Memory Read Byte Count in the sequence initiated by the device: 0—512 bytes, 1—1,024 bytes, 2—2,048 bytes, 3—4,096 bytes.
-
Bits [25:23] (RO)—Designed Maximum Outstanding Split Transactions: 0 through 7—1, 2, 3, 4, 8, 12, 16, 32 transactions, respectively.
-
Bits [28:26] (RO)—Designed Maximum Cumulative Read Size expected by the device (requests have been sent, the replies have not been received yet): 0 through 7—8, 16, 32 … 1,024 ADQ.
-
Bit 29 (RWC)—Received Split Completion Error Message.
-
Bit 30 (RO)—PCI-X 266 Capable (Mode 2).
-
Bit 31 (RO)—PCI-X 533 Capable (Mode 2).
The ECC registers are used for control and diagnostics purposes. The ECC Control and Status Register is used to control ECC: to enable ECC in Mode 1 (in Mode 2 it is mandatory), to enable single-occurrence errors correction. The same register reports the indicators of error detection, command and bus phase in which the error was detected, and also the error syndrome value and the transaction attributes. The ECC First Address, ECC Second Address, and ECC Attribute registers hold the address accessing which the ECC error was detected and the attributes.
PCI-X Expanded Configuration Space
The PCI-X 2.0 specification expanded the configuration space of one function to 1,024 bytes. The standard 256-byte set of registers and the header format are preserved and the additional space is used for the device’s needs, including holding the description of the additional capabilities. The expanded configuration space can be accessed by either using the expanded version of mechanism 1 (see further discussion) with sending additional 4 bits of the register number over AD [27:24] or by mapping the configuration registers onto the memory addresses. With memory mapping, the hierarchical configuration register addresses of all PCI devices is reflected in bits A [27:0]; the base address (A [63:28]) depends on how the system has been implemented and is communicated to the operating system. For memory mapping, all configuration registers of all devices of all PCI buses require a 256 MB memory area. The mapping scheme is simple and logical:
-
A [27:20]—Bus number (8 bits)
-
A [19:15]—Device number (5 bits)
-
A [14:12]—Function number (3 bits)
-
A [11:8]—Extended Register number (4 bits)
-
A [7:0]—Register number (8 bits)
Devices must detect and process configuration accesses performed in any way. The device developer must keep in mind that only the first 256 bytes of the function’s configuration space will be software-accessible if the device is placed on a regular PCI bus; therefore, only those registers not used in the standard PCI mode should be placed into the expanded area.
A new capability description format has also been introduced for the expanded configuration space that takes into account the long (10-bit) register address. The expanded capabilities list must begin at address 100h (or there must be a structure not allowing this fragment to be interpreted as the beginning of the chain). Each capability starts with a 32-bit identifier, followed by the registers that describe the given capability. The 32-bit expanded capability identifier has the following structure:
-
Bits [15:0]—Capability ID
-
Bits [19:16]—Capability Version Number
-
Bits [31:20]—Next Capability Offset (relative to the number 0 register)
Configuration Space of PCI Bridges
The format of the configuration space of PCI-PCI bridges is shown in Fig. 6.10. The registers in the 00-17h address range fully coincide with the registers of a regular PCI device and describe the bridge’s behavior and status on the primary bus. Bit 2 of the command register (Bus Master Enable) controls the bridge’s capability to transfer transactions from the secondary bus to the primary. If this bit is zeroed out, the bridge must not respond as a target device on the secondary side in memory and I/O read/write transactions, because it will not be able to transfer these transactions to the primary bus. The BAR registers describe only the specific registers area (depending on the bridge’s implementation); they are not involved in the routing.
The bridge’s routing capabilities are defined by the following registers (see Section 6.1.6 for details):
-
Primary Bus Number register.
-
Secondary Bus Number register (also the bridge number).
-
Subordinate Bus Number register.
-
I/O Base and I/O Limit. These registers set the starting and the ending I/O range addresses of devices located behind the bridge. They provide only the upper four bits of the 16-bit I/O address; consequently, the address allocation granularity is 4 KB.
-
I/O Limit Upper 16 Bits and I/O Base Upper 16 Bits. These registers hold the upper part of the I/O address when 32-bit I/O addressing is used (indicated by a one in bits 0 of the I/O Base and I/O Limit registers).
-
Memory Base and Memory Limit. These registers set the starting and ending addresses of the memory range, onto which the I/O registers of the devices located behind the bridge are mapped. These register set only the upper 12 bits of the 32-bit address; consequently, the address allocation granularity is 1 MB.
-
Prefetchable Memory Base and Prefetchable Memory Limit. These registers set the starting and the ending memory range addresses for devices located behind the bridge. They set only the upper 12 bits of the 32-bit memory address; consequently, the address allocation granularity is 1 MB.
-
Prefetchable Base Upper 32 Bits and Prefetchable Limit Upper 32 Bits. These are the registers of the upper address part of “pure” memory when 64-bit addressing is used (indicated by the bits 0 in the Prefetchable Memory Base and Prefetchable Memory Limit registers being set to one).
The Secondary Status register is analogous to the regular Status register, but reflects the status of the secondary bus. The only difference is bit 14, which in the Secondary Status register indicates detecting the SERR# signal on the secondary interface, and not its issuance by the given device.
The Expansion ROM Base Address register, as with regular devices, sets the location of the BIOS expansion ROM (if the bridge has this ROM).
The Interrupt Line and Interrupt Pin registers pertain to the interrupts generated by the bridge (if they exist). These registers have no relation to the interrupts transferred by the bridge.
The Bridge Control register controls the bridge operation and indicates unclaimed completions of the delayed transactions. The functions of its bits are as follows:
-
Bit 0: Parity Error Response Enable. Enables the bridge to signal address or data parity error detected on the secondary interface.
-
Bit 1: SEER# Enable. Enables transferring the SERE# signal from the secondary interface to the primary (the same name bit must also be set in the command register).
-
Bit 2: ISA Enable. Enables ISA bus I/O addressing support (excluding the last 768 bytes from each kilobyte of the data range set by the I/O Base and I/O Limit registers).
-
Bit 4: reserved.
-
Bit 5: Master-Abort Mode. Defines bridge behavior in case it does not receive a reply from the target device when transferring a transaction: 0—ignore this situation, returning FF…FFh for reads and discarding write data; 1—inform the transaction initiator by a Target-Abort condition or, if this is not possible (in case of a posted write), issue a SEER# signal.
-
Bit 6: Secondary Bus Reset. Places a RST# signal on the secondary interface (when the bit is cleared, the RST# on the secondary interface is generated upon a RST# on the primary interface).
-
Bit 7: Fast Back-to-Back Enable on the secondary interface.
-
Bit 8: Primary Discard Timer. A discard timer for the results of the delayed transactions initiated by the master from the primary interface: 0—wait 215 bus clocks, 1—wait 210 bus clocks. The countdown starts when the result of a delayed transaction comes up to the top of the queue. If the master does not pick up the result (by a transaction repeat) within the set time, the result is discarded.
-
Bit 9: Secondary Discard Timer. Analogous to bit 8, only for transactions initiated by the master from the secondary interface.
-
Discard Timer Status. Indicator of delayed transaction discard on all interfaces.
-
Bit 11: Discard Timer SERE# Enable. Enables SERE# signal generation on the primary interface upon the discard timer actuation.
-
Bits [12:15]: reserved.
Secondary Latency Timer register controls bridge acting as the secondary bus master when it is deprived of bus control.
For a large system with expansion chassis, bridges can have slot and chassis numeration capability; for this, the bridge must have a capability with Capabilities ID=0 (see Fig. 6.11).
The Expansion Slot register describes the status and the secondary bus of the bridge:
-
Bits [4:0]: Expansion Slots Provided—number of slots on the secondary bus of the bridge.
-
Bit 5: First in Chassis. Indicator of the first bridge in the expansion chassis. Also indicates that a chassis is present and, consequently, that the chassis register number is used. If the chassis has more than one bridge, the first one will be either the bridge with the lowest primary bus number (to which other bridges will be subordinate) or the lowest device number (other bridges will be of the same rank, but their secondary buses will have higher numbers).
-
Bits [7:6]: reserved.
The Chassis Number register sets the number of the chassis in which the given bridge is located (0—the chassis on which the host processor that carries out the configuration is located).
Software Access to the Configuration Space and Special Cycle Generation
Because the PCI configuration space is isolated, the host bridge must be equipped with a special mechanism to access it via commands from the processor whose instructions can only access memory and I/O. The same mechanism is used to generate special cycles. Two mechanisms have been stipulated for PC-compatible computers, of which the PCI 2.2 specification has left only the first one: Configuration Mechanism #1, which is more transparent. The number of the configuration mechanism used by a particular motherboard can be found out by calling PCI BIOS. These mechanisms cannot be used to access the expanded configuration space of PCI-X (it can be assessed only via direct memory mapping; see above).
Configuration cycles are addressed to the concrete device (a PCI microchip) located on the bus with a predetermined number. Bridges decode the bus and device numbers for the device, for which the IDSEL select signal has to be issued. The function number and the register address are decoded by the device itself.
Two 32-bit I/O ports are reserved on the host bridge for Configuration Mechanism #1. Their names and addresses are CONFIG_ADDRESS (0CF8h) and CONFIG_DATA (OCFCh).
Both can be written to and read from. To address the configuration space, a 32-bit address (decoded as shown in Fig. 6.12) is first written to the CONFIG_ADDRESS register. After this, the contents of the required configuration port can be read from or written to the CONFIG_DATA register. Bit 31 in the CONFIG_ADDRES enables configuration and the generation of special cycles. Depending on the bus number indicated in this register, the host bridge generates one of the two types of configuration cycle.
-
To access a device located on bus number zero (i.e., connected to the host bridge) a type 0 cycle is used (see Fig. 6.2, a, b, and c). In the address phase of this cycle, the bridge places the device-selection positional code on the AD [31:11] lines, the function number on the AD [10:8] lines, the register address on the AD [7:2] lines; bits [1:0]=0 indicate a type 0 cycle. In the PCI-X address phase, the device number is placed on the AD [15:11] lines; the expanded configuration space cannot be accessed via this mechanism.
-
To access a device that is not located on bus 0, a type 1 cycle is used. Here, the host bridge passes the address information from the CONFIG_ADDRESS register (bus, device, and function numbers and register address) to the main PCI bus, zeroing out the high-order bits [31:24] and setting the type 01 indicator in bits [1:0]. (See Fig. 6.2, d.)
The Special Cycle is generated by writing to the CONFIG_DATA register, with bits [15:8] in the CONFIG_ADDRESS register set to logical one and bits [7:0] of the same register zeroed out. The number of the bus on which the cycle is generated is given in bits [23:16] of the CONFIG_ADDRESS register. Since the Special Cycle is a broadcast type operation, no address information is sent in it, but its propagation can be controlled by setting the bus number. If the host generates a special cycle with the zero bus address, this cycle will be executed on the main bus only, and all the other bridges won’t propagate this cycle. If the bus address is not zero, the host bridge will produce a cycle of Type 1 configuration write, which will be transformed into the special cycle by nothing else than the bridge of the destination bus. A special cycle generated by the master device of the bus is only active on the bus of this device, and does not propagate through the bridges. In case of necessity to generate this cycle at another bus, the master device may do this through writes to the CONGIG_ADDRESS and CONFIG_DATA registers.
Two 8-bit I/O ports are reserved on the host bridge for the outdated and cumbersome Configuration Mechanism #2. Their addresses are 0CF8h and OCFAh. Configuration Mechanism #2 maps the PCI devices configuration space onto the C000h-CFFFh range of the I/O space. Because this range (only 4 K ports) is not sufficient to map the configuration spaces of all the devices of all the PCI buses, the address is generated in a rather elaborate way. Bits [7:4] in the Configuration Space Enable (CSE) register located at OCF8h are the key to enabling mapping, while bits [3:1] carry the number of the function to whose addressing space the accesses are addressed. When set to logical one, bit 0 (SCE—Special Cycle Enable) causes a special cycle to be generated instead of a configuration cycle. When the value of the key is zero, the C000-CFFFh port range remains a regular part of the I/O range; when the value is not zero, configuration spaces of a selected function of sixteen possible devices are mapped onto it.
When addressed to the configuration space of the bus 0’s devices, reading or writing a double word into a port at the C000h-CFFCh generates a configuration cycle. In this cycle, bits [2:7] of the port address are placed on the AD [2:7] lines as the configuration space index, and bits [11:8] are decoded into a positional device-select code (IDSEL lines) on the AD[31:16] lines. The function number is placed on the AD[10:8] lines from the CSE register, while lines AD [1:0] are zeroed out. To address devices located on non-0 buses, the Forward Register (OCFAh) is used, into which the bus number is placed (this register is zeroed out upon a reset). If the bus number is not zero, then a type 1 cycle is generated; here, the function number comes from the SCE register, the lower four bits of the device number arrive from the address bits (AD15=0), and the bus number is taken from the Forward Register (bits AD [1:0] = 0 and AD [31:24] = 0 are hardware-generated).
To generate a Special Cycle using this mechanism, the CSE register is set to a non-0 key, the function number is assigned a value of 111, and SCE = 1, after which a write is done to port CFOOh. Depending on the contents of the Forward Register, either a special cycle or a configuration cycle, which will be converted into a special cycle on the target bus, will be generated.
PCI Device Classes
An important part of the PCI specification is the classification of devices and indication of the class code in its configuration space (3 bytes of the Class Code). The upper byte identifies the base class, the middle byte identifies the subclass, while the lower byte identifies the programming interface (if it has been standardized). The class code makes it possible to detect presence of particular devices in the system, which can be done with PCI BIOS. For standardized devices (such as 01:01:80—IDE controller, or 07:00:01—16450 serial port) a service program can find the necessary device and select the appropriate driver version. Classification codes are defined by PCI SIG; they are regularly updated on the organization’s site, http://www.pcisig.com. As a rule, fields with 0 values give the most broad device descriptions. Subclass value 80h defines “other devices.”
PCI BIOS
There are additional BIOS functions that can facilitate interaction with PCI devices. These functions can be accessed from both the real arid protected processor operation modes. PCI BIOS functions are used only to locate and configure PCI devices, both procedures that require access to their configuration spaces. The functions need to be supported and used because configuration access cycles, like the special cycle, are executed in a specific way. Additionally, PCI BIOS allows controlling the PCI Interrupt Steering, hiding the specific software interface of an individual motherboard chipset.
The other types of interaction with devices using their memory and I/O spaces as well as interrupt servicing do not require BIOS support, because they are executed directly by processor commands and are independent of the platform (motherboard chipset). Regular operations with these devices are conducted using accesses to the device registers at the addresses received during the configuration and servicing the predefined interrupt requests from these devices. The PCI BIOS presence check function determines the configuration mechanisms available; knowing how they operate, calls to PCI BIOS may henceforth be dispensed with.
Using PCI BIOS functions, software can search for the devices it needs by their identifiers or class codes. If all the installed devices need to be reinventoried, this can be done by reading the configuration information of all functions of all devices on all buses; this is faster than going through all possible identifier or class code combinations. For the detected devices, the software must determine the actual settings by reading the configuration space registers, taking into account that resources can be moved over the whole space and even between different memory and I/O ranges.
For the 16-bit real mode, V86 mode, and 16-bit protected mode, PCI BIOS functions are called via the Int 1Ah interrupt; when making the call, the function number is provided in the AX register. Software interrupt emulation also is possible. This is done by a far call to the physical address 000FFE6Eh (the standard entry point for the Int 1Ah interrupt handler); the flag register is pushed on the stack prior to making this call.
For 32-bit protected mode calls, the same functions are called via the entry points found in the catalog of the 32-bit services, but the functions of the input and output registers and the carry flag (CF) are preserved. In order to use the 32-bit interface, its catalog needs to be found first and the presence of PCI BIOS services ascertained by the $PCI identifier (040435024h).
The calls require a deep stack (up to 1,024 bytes). The sign of normal completion is CF=0 and AH=0; if there has been an error, then CF=1 and AH contains the error code.
-
81h—unsupported function
-
86h—device not found
-
87h—invalid PCI register number
-
88h—installation failed
-
89h—not enough space in the data buffer
PCI BIOS functions are listed below:
-
AX=B101h—PCI BIOS presence check. When PCI BIOS is present, CF=0, AH=0, and EDX=20494350h (“PCI” character string); all three conditions must be checked. The AL register holds the descriptor of the hardware access mechanism to configuration space and the special PCI cycle generation mechanism:
-
Bit 0—support of configuration space access mechanism #1
-
Bit 1—support of configuration space access mechanism #2
-
Bits [2:3]—reserved
-
Bit 4—special cycle generation using mechanism #1 support
-
Bit 5—special cycle generation using mechanism #2 support
-
Bits [6:7]—reserved
-
The BH and BL registers return the upper and lower version numbers (in BCD digits); the CL register returns the maximum PCI bus number present in the system (number of buses plus one, because their sequential numbering starts from zero). The EDI register may return the linear address on the entry point for the 32-bit BIOS services. Not all BIOS versions return this address (some BIOS do not modify the EDI). To check this, the EDI is zeroed out prior to making a call, and then the returned value is checked for zero.
-
AX=B102h—device search by identifier. When making a call, the device ID, vendor ID, and device index (the sequential number) are indicated in the CX, DX, and SI registers, respectively. Upon a successful return, the bus, device, and function numbers are indicated in the BH, BL [7:3], and BL [2:0] registers, respectively. To find all devices with these identifiers, calls are made incrementing the si sequentially from zero until the return code of 86h is received.
-
AX=B103h—device search by the class code. When making a call, the class, subclass, interface, and device index are indicated in the ECX[23:16], ECX[15:8], ECX[7:0], and SI registers, respectively. Upon a successful return, the bus, device, and function numbers are indicated in the BH, BL [7:3], and BL [2:0] registers, respectively.
-
AX=B106h—special PCI cycle generation. When making a call, the bus number is indicated in the BL register, and the EDX register carries the special cycle data.
-
AX=B108h—reading a byte from a PCI device configuration space. When making a call, the bus, device, function, and register numbers(0-FFh) are held in the BH, BL [7:3], BL [2:0], and DI registers, respectively. Upon a successful return, the CL register holds the read byte.
-
AX=B109h—reading a word from a PCI device configuration space. When making the call, the bus, device, function, and register numbers (0-FFh, even) are held in the BH, BL[7:3], BL[2:0], and DI registers, respectively. Upon a successful return, the CX register holds the read word.
-
AX=B10Ah—reading a double word from a PCI device configuration space. When making the call, the bus, device, function, and register numbers (0-FFh, a multiple of four) are held in the BH, BL [7:3], BL [2:0], and DI registers, respectively. Upon a successful return, the ECX register holds the read double word.
-
AX=B10Bh—writing a byte into a PCI device configuration space. When making the call, the bus, device, function, and register numbers (0-FFh) are held in the BH, BL [7:3], BL [2:0], and DI registers, respectively. The CL register holds the byte being written.
-
AX=B10Ch—writing a word into a PCI device configuration space. When making the call, the bus, device, function, and register numbers (0-FFh, even) are held in the BH, BL [7:3], BL [2:0], and DI registers, respectively. The CX register holds the word being written.
-
AX=B10Dh—writing a word into a PCI device configuration space. When making the call, the bus, device, function, and register numbers (0-FFh, a multiple of four) are held in the BH, BL [7:3], BL [2:0], and DI registers, respectively. The ECX register holds the double word being written.
-
AX=B10E—determining interrupt allocation options (GET_IRQ_ROUTING_OPTIONS). When making the call, BX=0, ES:EDI point to the parameter structure of the buffer for storing the results; this structure consists of a word holding the buffer length, followed by the far pointer to the buffer’s start address. In the 16-bit mode, the DS register points to the segment with the physical address F0000h; in the 32-bit mode, its contents are defined by the rules set out in the next subsection. Upon a successful return, the BX register holds a bit map of the IRQx requests. Bit values of one in this map mean that the given input of the interrupt controller is used exclusively by the PCI bus (see Table 6.6). The sequential set of structures describing the capabilities of and the interrupt allocation for each PCI device is placed into the buffer. Upon the return, the actual buffer length is held in the parameter structure. If a buffer size too small to hold the entire result was given when making the call, the error code 89h is set.
Table 6.6: Description of Interrupt Options for One PCI Device
-
AX=B10Fh—interrupt request line allocation (SET_PCI_IRQ). When making the call, the BH holds the bus number, bits [7:3] of the BL hold the device number, and bits [2:01] of the BL hold the function for which the request is being assigned. The output (0Ah—INTA#, …, ODh—INTD#) is held in the CL; the desired IRQx number (0-0Fh; 0 corresponds to disconnecting INTx# from the controller’s inputs) is placed into the CH register. If the ordered allocation is not possible, the 88h error code is set upon the return from the call. When using the given function, the attendant changes must be made in the configuration registers of all involved devices and their functions.
32-Bit BIOS Services Search
The 32-bit BIOS32 services are searched using the 32-bit service catalog. The address of the catalog entry point is not known in advance. However, the way to find it is known: The signature string “_32_” (ASCCI code 325F5F33h) is sought at the beginnings of paragraphs in the memory address range 0E0000-0FFFFFh. The 32-bit physical address of the catalog entry point follows this string. The entry points to the services proper are also sought using the service catalog. The number, parameters, and results of the called functions are sent in the processor registers.
To search for a service in the catalog, a four-byte identifier string is placed into the EAX register, code 0 is placed (search function code in the catalog) into the EBX register and a far call is made to the catalog entry point address. The results of the search are returned in the registers: AL=00h—service found; in this case, EBX returns the base address of the service, ECX returns the service segment length, and EDX returns the offset of the entry point relative to the service’s start (the EBX). A value of 81h returned in the AL register means that the service has not been found.
Prior to attempting to use the service catalog, it should be ascertained that the header is correct. This is done by inspecting its checksum: The accumulated sum of all header bytes must equal zero. The header length (in sections) is indicated in the byte at offset nine; the eighth byte holds the header version number. The checksum inspection is mandatory, as the four-byte signature may coincide with a fragment of the BIOS code (string “_32_” disassembles as POP DI; XOR SI, [BP+SI]).
The 32-bit services are called by far calls (CALL FAR). The base of the code segment (CS) must start at the beginning of a 4-kilobyte page that contains the entry point; the limit must encompass this and the following page[1]. The requirements for the data segment (DS) segment base are the same, and its limit must be no smaller than the CS limit. It should be remembered that the addresses are physical (obtained after page conversion of the linear addresses).
PCI Card Expansion ROM
The ROM BIOS in the microchip installed on the motherboard supports only the standard (in purpose and function) devices. Should it be necessary, additional devices installed into the expansion slots (ISA, PCI, PCMCIA) can have Additional ROM BIOS for their software support (it is also called Expansion ROM). This need arises when the device needs software support before the operating system and the application software have been loaded. This type of additional ROM modules can also contain all the software needed to support a specialized diskless PC-based controller. Expansion ROM BIOS is used in EGA/VGA/SVGA graphics adapters, some hard drive controllers, SCSI controllers, remote boot network adapters, and other peripheral devices.
The C8000h-F4000h range has been reserved in the memory space for the ISA bus expansion modules. The POST procedure scans this range in 2 KB steps looking for additional modules during its final stage (after the interrupt vectors have been loaded with the help of pointers to its own handlers). The additional BIOS module for graphics adapters (EGA, VGA, SVGA, etc.) has a fixed address, C000h, and is initialized earlier (during the video adapter initialization). A PCI bus device contains only an expansion module flag in its configuration space; the actual memory address is assigned to it by the POST procedure.
Additional ROM BIOS modules must have a header aligned at the 2 KB memory page boundary; the format of the header is shown in Table 6.7.
Table 6.7: Additional ROM Module Header
In the traditional header, there were only the first three fields; the pointers to the PCI and ISA PnP structures were added later. A valid module starts with the AA55h flag and a zero sum (modulo 256) of all bytes in the declared range. The real module length can exceed the declared length, but the checksum byte, naturally, must be in the declared area.
If it finds a valid module, the POST executes a far call (Call Far) to the module’s initialization procedure, which starts at the third address of the module’s header. The module’s designer is responsible for this procedure being correct. The procedure can reassign the vectors of interrupts serviced by the BIOS. If a procedure reassigns the Bootstrap (Int 19h) interrupt to itself, it obtains control over the operating system loading procedure; this feature is used, for example, to remotely boot computers in local area networks (Remote Boot Reset). If it is not necessary to continue the standard booting procedure, and the additional module is, for example, a control program for some equipment. Instead of the initialization procedure, the ROM can also contain the main program that does not return the control of the system loading sequence to the POST.
The initialization procedure and device-support software contained in the ROM must be written in such a way that the physical addresses, at which they are located in the memory space, do not matter to them. As a rule, the base address, and sometimes the ROM size of expansion cards, can be changed by hardware (jumpers or software-controlled switches). This feature makes it possible to allocate memory address ranges to ROM modules of several installed cards without conflicts.
For ROM BIOS expansions installed on PCI cards, a standard somewhat different from traditional ROM BIOS modules has been adopted. The ROM header is like the traditional one, but it has an additional pointer to the PCI data structure (Table 6.8). The vendor and device identifiers, as well as the class code, coincide with those described in the PCI device’s configuration space. Because the PCI bus is used not only in PCs, the card’s ROM can contain several modules. Each module starts with a data structure; the module proper follows the structure. The next module’s data structure starts after the previous module (if the last module indicator is not set in the previous module), and so on. The platform (processor) type is indicated in the module header, and only the needed module is activated during the BIOS initialization. This mechanism allows, for example, the same graphics adapter to be installed either into an IBM PC or a Power PC.
1]Prior to the PCI 2.2 specification, a pointer to the Vital Product Data string was located here.
Additional PCI card ROM has three parameters pertaining to its size. The ROM size is determined by reading the configuration space. The size indicated in the second byte of the header shows the module’s length during the initialization process. This module is loaded into the main memory by the POST procedure prior to calling the initialization procedure (the entry point with offset 3). The checksum, which is usually located at the end of the module, provides zero checksum for all bytes. The length of the image indicated in the PCI data structure (the word with offset 10h) describes the size of the area that must remain in the main memory during normal operation (this area can be smaller, as the initialization procedure code is no longer needed). This area is also protected by a checksum.
PCI card ROM modules are serviced in accordance with the DIMM model. The POST procedure determines ROM presence by the Expansion ROM Base Address field in the configuration space, and assigns it an address in the free memory area. Afterward, by programming the command register, the ROM is enabled for reading and the AA55h header signature string is looked for in it. When the signature is found, by using the code type, the POST procedure looks for the appropriate image (whose identifiers coincide with the detected PCI devices) and, having found it, loads it into the C0000-DFFFFh range in the main memory. After this, reading of the ROM is disabled (by a write to the Expansion ROM Base Address field) and the initialization procedure is called (from address 3). When calling the procedure, the POST communicates to it the bus number (in the AH register), the device number (in the AL[7:3] register), and the function number (in the AL[2:0] register), thus providing the initialization procedure with the exact coordinates of the hardware resources.
Afterward, the POST determines the size of the area that needs to be left in the main memory (by using byte 2, which can be modified by the initialization procedure) and disables writes to this area. If the initialization procedure cuts the occupied memory, it must take care that the checksum of the area described by byte 2 is valid. If no memory is needed (the procedure zeroes out byte 2), neither is, naturally, needed the checksum. The VGA extension (determined by the class code) is serviced in a special way: it is loaded to the C0000h address. The initialization procedure can determine presence of PnP BIOS in the system by inspecting the PnP control structure at the address indicated in the ES: DI, and execute it based on the system environment.
6.1.8 Electrical Interface and Constructs
The PCI bus is built on CMOS integrated circuits that may use either 5 V or 3.3 V power supplies. The direct current signal parameters are listed in Table 6.9. However, the rated power of the interface elements (gate transistors) was set lower than is actually necessary to switch high-frequency signals (33 or 66 MHz). This was possible because the signals, with which integrated circuits drive the bus lines, reflect off the unmatched ends of these lines, which for such high frequencies have the electrical characteristics of long lines. There are no terminators at the ends of the bus, therefore the arriving signal wave reflects off them with the same polarity and amplitude. In merging with the forward signal, the reflected wave provides the signal level required by the receiver. Therefore, the transmitter generates signals with levels between the switching levels until they merge with their reflected signals and reach the necessary level only after the arrival of the reflected wave. This imposes limitations on the physical bus length: The signal must reach the end and come back reflected in less than a third of the clock period (i.e., 10 nsec at 33 MHz, 5 nsec at 66 MHz).
Table 6.9: PCI DC Interface Signal Parameters
In order to avoid false responses when all bus agents are inactive, on the motherboard, the control line signals FRAME#, TRDY#, IRDY#, DEVSEL#, STOP#, SERR#, PERR#, LOCK#, INTA#, INTB#, INTC#, INTD#, REQ64#, and ACK64# are pulled up to the power rail by resistors (typically, 2.7 K for the 5 V bus version and 8.2 K for the 3.3 V bus version).
Electrical specifications provide for two versions of peak load limits: two PCI devices built into the motherboard plus four expansion slots or six built-in devices and two expansion slots. It is understood that one built-in PCI device sinks only a single CMOS load. Cards sinking only a single CMOS load may be installed into the expansion slots. If the characteristics of component and motherboard track routing surpass those required by the specification, other slot-device combinations are possible. For example, motherboards with five PCI slots can often be encountered. Strict restrictions are imposed on the conductor length, as well as on the expansion-card components and conductors placement. From the above, it becomes clear that producing homemade PCI cards based on medium-scale integrated circuits difficult, as could be done for ISA cards, is impossible.
The clock rate of the bus is determined by the capabilities of all bus agents including the bridges (as well as the host bridge, which is a part of the motherboard chipset). The clock frequency generator can set the high 66 MHz frequency only when line M66EN is high. Therefore, installing any card that does not support 66 MHz (with its contact B49 grounded) will lower the frequency down to 33 MHz. Server motherboards have several PCI buses and allow different bus clocks (66 MHz and 33 MHz) to be used on different buses. For example, the 66 MHz clock can be used on 64-bit buses, and the 33 MHz clock can be used on 32-bit buses. There are no hardware controls on overclocking the bus to 40-50 MHz, but doing this may cause expansion cards to operate with errors.
As per PCI specification, the devices must work without faults when the frequency drops from its nominal value (33 MHz) to zero. The frequency alteration during the devices’ functioning is not prohibited, provided that the limitations on minimal duration of high and low levels of CLK signal are duly observed. The CLK signal must stop at a low level only. After the CLK pulses issuing is resumed, the devices have to resume their work as if there were no synchronization break. When operating at 66 MHz and higher frequencies, in order to lower the electromagnetic interference from the fixed frequency signal, the spread spectrum of the CLK signal can be employed: low-percentage frequency modulation with a modulation frequency of 30-33 KHz. If devices use phase-locked loop for synchronization, their speed ought to be sufficient to handle this modulation. In the PCI-X specification, the allowable clock frequency change ranges depend on the bus mode (see Table 6.12).
Standard PCI Slots and Cards
Standard PCI and PCI-X slots are slot-type connectors with contacts spaced 0.05 inches apart. Slots are located somewhat farther from the back panel than the ISA/EISA or MCA slots. PCI card components are placed on the card’s left surface. Because of this, the last PCI slot usually shares the back panel opening with the neighboring ISA slot. This kind of slot is called shared, and either ISA or PCI cards can be installed into it.
PCI cards can use either 5 V or 3.3 V interface signal levels, or both. PCI slots’ signal levels correspond to the power supply voltages of the integrated circuits of the motherboard’s built-in PCI devices: either 5 V or 3.3 V. To avoid installing a card into a wrong slot, slots have keys determining their voltages. The role of keys is played by the rows of the missing contacts 12 and 13 and/or 50 and 51:
-
The key (a rib) for the 5 V slots is located in place of contacts 50 and 51 (closer to the front panel); these slots are abolished in PCI 3.0.
-
The key to the 3.3 V slots is in the place of contacts 12 and 13 (closer to the back panel).
-
Universal slots have no key ribs.
-
Edge connectors of the 5 V cards have matching notches only in the place of contacts 50 and 51; this type of card was abolished in PCI 2.3.
-
The 3.3 V cards have key notches only in the place of contacts 12 and 13.
-
Universal cards have both key notches.
The keys prevent installation of a card into a slot with a wrong power supply voltage. Cards and slots differ only in the buffer circuit power supply voltages that they receive off the +v I/O lines:
-
5 V slot is fed +5 V to its +V I/O line.
-
3.3 V slot is fed between +3.3 V and 3.6 V to its +V I/O line.
-
5 V card’s buffer integrated circuits may only be powered from +5 V.
-
3.3 V card’s buffer integrated circuits may be powered only from a +3.3 V to +3.6 V power supply.
-
A universal card’s buffer integrated circuits may be powered from either power supply and will generate and receive 5 V and 3.3 V specification signals without problems, depending on the type of slot in which it is installed (i.e., on the voltage on the +V I/O contacts).
Slots of both types have +3.3 V, +5 V, +12 V, and −12 V power supplies on the lines of the same names. The PCI 2.2 standard defines an additional 3.3Vaux power line; this provides a stand-by 3.3 V power supply for devices that generate PME# signal when the main power supply is turned off.
Motherboards are more commonly equipped with 5 V, 32-bit slots that terminate with the contacts A62/B62. 64-bit slots are encountered more seldom; they are longer and they end with the contacts A94/B94. Connector construction and the protocol allow 64-bit cards to be installed into 32-bit slots and vice versa, but the exchange will, naturally, be conducted in the 32-bit mode.
In terms of the mechanical keys, PCI-X cards and slots correspond to 3.3 V cards and slots; the +V I/O power supply voltage for PCI-X Mode 2 is set at 1.5 V.
Figure 6.13 shows a maximum-length 32-bit card (Long Card) with sizes given in millimeters. A Short Card is shorter, at 175 mm, but even shorter cards are very common. A PCI card has an ISA-card-style mounting bracket (previously, cards with IBM PS/2 MCA style mounting brackets could be encountered). There also are Low Profile cards; their brackets are also lower. These cards can be installed vertically into 19-inch cases of 2U height (about 9 cm).
The functions of the PCI/PCI-X card connector contacts are shown in Table 6.10.
Table 6.10: PCI Bus Connectors
[1]In PCI 2.1, the M66EN signal is defined only for the 3.3 V slots.
[2]The signal was introduced in PCI_X 2.0. It was previously reserved.
[3]The signal was introduced in PCI 2.2. It was previously reserved.
[4]The signal was introduced in PCI-X (in PCI, it is GND).
[5]The signals were introduced in PCI 2.3. In PCI 2.0 and 2.1 contacts, A40 (SDONE#) and A41 (SBOFF#) were used to monitor cache; in PCI 2.2, they were left unconnected (for compatibility purposes, they are pulled to high level by 5 K resistors on the motherboard).
PCI slots have connectors to test adapters using the JTAG interface (TCK, TDI, TDO, TMS, and TRST# signals). These signals are not always present on the motherboard, but they can form a logical chain of the tested adapters, to which external testing equipment can be connected. In order for the chain to be uninterrupted on a card that does not use JTAG, there must be a TDI-TDO link.
Most PCI signals are connected under the pure bus topology (i.e., the same-name contacts of one PCI bus slots are electrically connected with each other). There are some exceptions to this rule:
-
Each slot has individual REQ# and GNT# signal lines. They connect the slot with the arbiter (usually—the bridge connecting this bus to the upstream bus).
-
Each slot’s IDSEL signal is connected (via a resistor) to one of the AD[31:11] lines, thereby setting the device’s number on the bus.
-
The INTA, INTB, INTC, and INTD signal are cycled over the contacts (Fig. 6.6) distributing interrupt requests.
-
The CLK signal is routed individually to each slot from its synchronization buffer output. The lengths of all individual feeding tracks are made equal, thus providing synchronous signal on all slots (the tolerance for the 33 MHz slots is ±2 nsec, for the 66 MHz slots it is ±1 nsec).
When a standard motherboard is used in a low profile case, a passive riser card can be installed into one of the PCI slots, and expansion cards installed into it. If more than one card is installed into the riser card, in order to implement the above-described exceptions, PCI extension connectors (small printed-circuitboard edge connectors) are used to bring the above-described signals from other, unoccupied, PCI slots on the motherboard. Moving these connectors around, the numbers of the devices installed in the riser card can be changed; however, the most important thing is that their interrupt-request line allocation can be changed. There is a weak link in this type of connection, though: the long (10-15 cm) ribbon cables that connect the riser card with the slots. All signals in such cables are sent over parallel nontwisted wires, which negatively affects the CLK signal: Its shape is distorted and a noticeable delay is introduced. This can result in sudden computer hangings without any diagnostic messages issued by the system. This situation can be helped by separating the CLK signal from the common ribbon and counter-coiling its excess (this reduces the conductor’s inductivity). The other signals in the ribbon cable are not so crucial to the quality of the cable layout. The best solution is using low-profile PCI cards installed into the motherboard without using a riser card. There would be no problem with using a riser card if there were a clock source chip installed on it distributing the clock signal on all its slots. However, this requires using microchips with phase-locked loop that tie their output signal to the motherboard clock signal.
PCI-X Bus Initialization and Operating Mode Determination
Each PCI-X segment (a physical bus) must work in the most advanced mode available to all its clients, including the bus’ host bridge. In the standard PCI bus, the level of advancement is defined only by the available clock frequency (33 or 66 MHz), and a card informs about its capabilities over contact B49 (M66EN, see above). In the PCI-X bus, new capabilities are available: support of the PCI-X protocol proper (Mode 1 in PCI-X 2.0 terminology) and fast transfers (Mode 2). The card informs the bridge of these capabilities over contact B38 (PCIXCAP), which can be connected to the GND rail via a resistor or not be connected at all (NC), as illustrated in Table 6.11. Resistors’ nominal values are selected in such a way that the bridge can determine the capabilities of cards in multislot buses when the PCIXCAP circuits of all cards are connected in parallel (besides resistors, cards also have capacitors). The bridge that is the master of the given bus inspects the status of the M66EN and PCIXCAP lines at the start of a reset signal. It will select the operating mode of the bus according to the capabilities it sees on those lines (they will correspond to the weakest client’s capabilities). This mode is communicated to all clients using the PCI-X Initialization Pattern: levels of the PERR#, DEVSEL#, STOP#, and TRDY# signals at the end of the RST# signal (at its rising edge). By this moment, the corresponding +V I/O voltage is already being supplied to the slots. Possible bus operation modes and their patterns are shown in Table 6.12.
Table 6.11: Communication of PCI/PCI-X Card Capabilities
Table 6.12: PCI/PCI-X Bus Modes and Initialization Patterns
Device Hot-Plugging
Hot-plugging PCI devices require a special Hot-Plug Controller in the system that controls the hot-plug slots, as well as appropriate software support: operating system, device, and controller drivers.
Hot-plug slots must be connected to the PCI bus via switching circuits that provide the following:
-
Controlled switching (using electronic keys) of all PCI signal circuits
-
Controlled power supply
The hot-plug controller must provide the following for each of its slots:
-
Individual control of signal switching and the power supply.
-
Individual control of the RST# signal.
-
Individual detection of the PRSNT [1:2]# lines’ status, regardless of the slot’s state (connected or isolated).
-
Individual detection of the M66EN line’s status, regardless of the slot’s state (connected or isolated).
-
Individual indicator “Attention” signaling the status of the slot’s power supply (can the card be pulled out or inserted). The flag is software-controlled, and also communicates to the user the problems detected by the system for the device in the given slot.
The user participates in the hot-plug process. He or she must install (and take out) the expansion card only into slots with their power supply disconnected (the slot’s signals are also disconnected from the bus). After the module is installed, the power is supplied to it; and some time later, it is reset by a RST# signal and the device is initialized. Only after this does the controller connect the slot’s signals lines to the bus. Further, the software must identify and configure the connected device. Additional difficulties arise if a 33 MHz device is connected to a 66 MHz bus. Because the bus clock can only be changed while the RST# signal is active, and the device being connected cannot work at the high frequency, the entire PCI bus needs to be reset (with the following initialization of all its devices). Before the slot is depowered, it is reset by an RST# signal and all its signal lines are disconnected from the bus.
Variants of PCI Bus Constructs
The PCI bus also has other constructs, specifications for which are available on the site http://www.pcisig.org (albeit, free of charge only to the members; others have to pay).
The Low-Profile PCI card has the conventional connector but a modified mounting bracket. These cards can be mounted vertically (without a riser card) even into low-profile cases (such as 19-inch-high 2U form factor). For these cards, only the 3.3 V power supply for interface circuits is specified (although, the 5 V power rail is preserved).
The PCI bus has dual use in notebook computers:
-
Expansion cards that can be installed by the end user without opening up the computer (with hot-plug capability) use the CardBus PCMCIA construct (see Section 6.5.1).
-
For internal component installation by the manufacturer (not available to the end user), various versions of Small PCI and Mini PCI constructs are used.
Small PCI (SPCI) is a miniaturized form factor of the PCI specification. It used to be called Small Form-Factor PCI (SFF PCI). This specification is primarily intended for portable computers, and is logically compatible with the conventional 32 bits/33 MHz PCI bus. To the standard signal suite, a new signal has been added: CLKRUN. The host or devices can use this signal to control the bus clock for energy-saving reasons. The SPCI card’s dimensions are the same as those of the PC Card or CardBus card, but special keys stop it from being wrongly connected. To connect a SPCI card to the motherboard, the latter is equipped with a two-row connector with 108 contact pins each spaced 2 mm apart. The card can be installed directly into this connector, or a special adapter with two-sided ribbon contacts spaced 0.8 mm apart can be used. The SPCI bus is internal (expansion cards are located inside the case, and are installed by the manufacturer with the power supply turned off). Therefore, it is not aimed at replacing the CardBus (a bus for hot-swap connection of external equipment). There are three types of the SPCI card, which have 5 V, 3.3 V, and universal power supplies, respectively. Thanks to the card’s reduced dimensions, which entail reduced conductor lengths, its signal power requirements have been lowered. SPCI cards make it possible to take advantage of the modular assemblies (unloading the motherboard), while providing high exchange efficiency (which the CardBus does not).
The Mini PCI Specification is a miniature version of the PCI card (2.75″ x 1.81″ x 0.22″). Logically and electrically, it corresponds to the 32-bit PCI, and also uses the CLKRUN signal to lower the power consumption and does not have the JTAG signals. It also has additional signals to control audio and video applications.
The Compact PCI bus (cPCI) for industrial-purpose devices is based on the PCI 2.1 specification. The bus allows for quite great number of slots (up to 8) and supports 32-bit and 64-bit exchange. Constructively, Compact PCI circuit boards are Euro-card of 3U size (100 x 160 mm) with two connectors (J1 and J2) or of 6U size (233.35 x 160 mm) with 4 or 5 connectors (J1-J5).
The PXI bus (PCI eXtensions for Instrumentation) is developed by National Instruments based on the Compact PCI bus. PXI modules use the same constructs (Euro cards). In addition to the PCI signals’ inputs, PXI connectors reserve place for the extra buses:
-
Local buses coupling the neighbor modules
-
Buses (Trigger Bus) for mutual synchronization of modules and radially propagated signals (Star Trigger)
Designing Custom PCI Devices
Having studied the PCI protocol, it becomes clear that designing custom PCI devices using small- and medium-scale integration chips is a thankless task. The protocol itself is not that complex, but implementing the requirements to the configuration registers is difficult. As a rule, production-run PCI devices are built on one chip: The interface and the functional parts of the device are placed in one package. The development of such microchips is quite expensive, and makes sense only when there is the promise of mass production. For making development models and low production-run devices, several companies produce various-purpose interface PCI microchips. On the PCI side, practically all these microchips support single-target transactions, while more sophisticated models also allow burst cycles. More complex integrated circuits function as bus mastering devices, setting up DMA channels for system memory data exchanges. Depending on the functional capabilities of the microchip, exchanges over these channels may be initiated by host software (host-initiated DMA), and by devices on the peripheral side of the microchip (target-initiated DMA). On the peripheral side, there are interfaces to connect peripheral microchips, microcontrollers, and universal microprocessors and microcontrollers of popular families. A quite extensive selection of microchips is presented on the site http://www.plxtech.com; other companies also work in this area.
Worth noting is the implementation of the PCI interface in Field Programmable Gate Array (FPGA) microchips. Here, the PCI kernel and the initiator and target device functions use from 10,000 to 15,000 gates, the exact number depending on the required functions (see http://www.xilink.com, http://www.altera.com). FPGA microchips are made with 20,000, 30,000 or 40,000 gates, so the other gates can be used for implementing the functional part of the device, FIFO buffers, etc.
ISA bus designs can be quickly converted to PCI using PCI-ISA microchip bridges (see, for example, http://www.iss-us.com).
[1]The segment may also begin a few pages earlier.
6.2 AGP Interface
The Accelerated Graphics Port (AGP) was introduced to connect graphics adapters with 3D accelerators. This type of adapter comprises an accelerator (specialized graphics processor), local memory used as both the video memory and the local memory of the graphics processor, and control and configuration registers accessible by both the local and the central processors. The accelerator can access both the local memory and the main memory, in which it may store data that do not fit into the local memory (as a rule, large-volume textures). The main idea of the AGP lies in providing the accelerator with the quickest possible access to the system memory (it already has quick access to the local memory), with higher priority than for other devices.
The AGP is a 32-bit parallel synchronous interface with a 66 MHz bus clock; most of the signals have been borrowed from the PCI bus. However, unlike the PCI bus, the AGP is a two-point interface. Using the logic and data channels of the motherboard chipset, it connects the video adapter with memory and processor system bus directly, without getting into the bottleneck that is the PCI bus. Exchange via the port can be conducted using either the PCI or the AGP protocol. The distinctive features of the AGP are as follows:
-
Pipelined memory access
-
Multiplied (2x/4x/8x) data transfer rate (relative to the port clock frequency)
-
Sideband command system (SBA) provided by the demultiplexed address and data buses
The concept of pipelined memory access is illustrated in Fig. 6.14, where the PCI and AGP memory accesses are compared. In the PCI, the bus does not work, but is not free, during the memory response to the access request. The pipelined AGP access allows the next-in-line access requests to be sent during this time, and then a stream of responses received.
Multiplying the data transfer rate of the 66 MHz bus clock provides bus bandwidth of up to 532 MBps (2x), 1,066 MBps (4x), and 2,132 MBps (8x).
The way the address and data buses are demultiplexed is somewhat unusual. In order to save on the number of the interface lines, the address and command bus in the demultiplexed AGP mode are implemented using only 8 SideBand Address (SBA) lines. The command and address, as well as the transfer-length indicator, are transmitted serially over several cycles. It was not mandatory for AGP 1.0 devices to support demultiplexed addressing, since there was an alternative way of supplying address using the AD bus. In AGP 2.0, this became mandatory; and in the AGP 3.0 version, it is the only addressing method.
Many of the AGP’s advantages are only potential, and can only be implemented with the support of the video adapter’s hardware and software. In real life, an AGP video adapter can act in different ways:
-
Not use pipelining and use only PCI Fast Write
-
Not work with textures located in the system memory, but exchange data between memory and the local buffer, which is faster
AGP has practically the entire suite of PCI bus signals, and additional AGP-specific signals. A device connected to the AGP may be intended solely for AGP operations, or it may be an AGP+PCI combination. The adapter’s accelerator is an AGP master device; it can execute its requests in both the AGP and the PCI modes. In the AGP mode, the exchanges are carried out employing (or not) features such as sideband addressing (SBA) and 2x/4x/8x speeds. For transactions in the AGP mode, it can only access the main memory (but not the local memory of PCI devices). Moreover, the adapter is a target PCI device that in addition to regular PCI commands can support (or not) fast writes at 2x/4x/8x from the processor side. The adapter plays the role of the target device when its local memory, I/O or configuration space registers are accessed by the central processor.
It is mandatory that a device connected to AGP is able to perform functions of a master AGP device (otherwise, there is no sense in connecting it to the AGP) and functions of a slave PCI device with all its attributes (configuration registers, etc); additionally, it can be a master PCI device.
There are two models of AGP accelerator operation: DMA and DIME (Direct Memory Execute). When performing calculations in DMA, the accelerator views the local memory as the primary memory, and when it runs low on it uses the main memory to swap the excess data into and out of the local memory. In DIME (or Execute), the accelerator views the local memory and the main memory as logically equivalent and located in one addressing space. In DMA operations, port traffic tends to be long block transfers; in DIME, the traffic is saturated with short random accesses.
The AGP specification was developed by Intel based on the 66 MHz PCI 2.1 bus. Currently, there are three main specification versions:
-
AGP 1.0 (1996): A port with two alternative pipelined memory access methods defined: sideband (over the SBA bus) and in-band (using the PIPE# signal). Two transfer modes—1x and 2x—defined. Power supply is 3.3 V.
-
AGP 2.0 (1998): Fast Write capability in the PCI mode added, along with 4x operation mode with 1.5 V power supply.
-
AGP 3.0 (2002; the project was called AGP8X): 8x operation mode added with 0.8 V power supply. 1x and 2x speeds are abolished; only one command method left: sideband (SBA); some AGP commands eliminated; isochronous exchange commands introduced; capability to select pages described in GART introduced; selective coherence support when accessing different pages within GART limits introduced.
Only an intelligent graphics adapter with a 3D accelerator (only one) can be connected to the AGP. AGP system logic has a sophisticated memory controller that performs deep buffering and services requests from the AGP (from the adapter) and other its clients (one or several central processors and the PCI bus) with high efficiency. The only way to connect more than one AGP adapter is to implement several AGPs on the motherboard, which is unlikely ever to be done.
AGP can use the entire memory bandwidth of a 64-bit modern computer system. Here, the memory can be accessed from both the processor and the PCI bus bridges sides. AGP support was first introduced by Intel in chipsets for P6 processors; its competitors use the AGP in motherboards for processors with the Pentium interface. Currently, practically all modern motherboards for PC-compatible computers and other platforms (even Macintosh) support AGP.
6.2.1 Transaction Protocols
In the PCI-mode, transactions that are initiated by the accelerator begin with the activation of the FRAME# signal and are executed in the conventional PCI manner (see Section 6.1). Note that here, the AD bus is occupied for the entire duration of the transaction. Moreover, memory read transactions take more cycles to execute than write transactions do: After the address is issued, wait cycles are unavoidable while the memory is accessed. Writes are executed faster: The master sends the write data immediately after the address, and they “settle” in memory controller buffer while the memory is being accessed. The memory controller makes it possible to complete the transaction and release the bus before the data are physically written to the memory. The adapter handles accesses from the processor (or PCI bus masters), just like a regular PCI device does.
Only the accelerator can initiate pipelined AGP transactions. These are placed by the AGP logic into the process queue, and are executed in the order depending on their priorities, the order the requests come in, and availability of data. The accelerator may address these transactions only to the system memory. If an AGP device needs to access the local memory of a PCI device, then it must perform these transactions in the PCI mode. Transactions that are addressed to an AGP device are handled by this device as they are handled by a PCI slave device; however, they can be executed using a fast write to the local memory operation. In this operation, data are sent at the AGP speed (2x/4x/8x), and their flow control is more like the AGP protocol rather than the PCI. Fast-write transactions are usually initiated by the processor, and are intended to force “push” data into the accelerator’s local memory.
Figure 6.15 illustrates the concept of the AGP pipeline. AGP can assume one of four states:
-
IDLE
-
DATA—transmitting data of the pipelined instructions
-
AGP—placing an AGP command into the queue
-
PCI—executing transaction in the PCI mode
The AGP IDLE state may be terminated by a PCI transaction request (from the accelerator or from the system side) or an AGP request (only from the accelerator). In the PCI state, a PCI transaction is executed completely, from the issuing of the address and command to the completion of data transfer. In the AGP state, the master device only transmits the command and the address for the transaction (upon the PIPE# signal or through the SBA port), which are placed into the queue; several requests may follow immediately each other. The port switches into the DATA state when it has an unserviced command ready for execution enqueued. Data for the enqueued commands are transmitted in this state. This state may be interrupted by intervening PCI requests (to execute a complete transaction) or AGP requests (to enqueue a new command); interrupts[1] are only possible on the boundary of the current AGP transactions, however. When AGP has serviced all commands, it returns into the idle state. All transitions are controlled by the AGP arbiter, which reacts to the incoming requests (REQ# from the accelerator and external accesses from the processor and other PCI devices) and to the memory controller’s responses.
The AGP transactions differ from the PCI transactions in some details, which are as follows:
-
The data phase is separated from the address phase, which is what makes pipelining possible.
-
A custom set of commands is used.
-
Transactions are addressed to the system memory only, and use physical address space (like the PCI bus). Transactions length in bytes can be only a multiple of 8, and can begin only on 8-byte boundaries. Read transactions with a length in bytes other than multiple of 8 must be executed only in the PCI mode; write transactions can use the C/BE [3:0] # signals to mask unneeded bytes.
-
The transaction length is indicated explicitly in the request.
-
Pipelined requests do not guarantee memory and cache coherence. For operations requiring coherence, PCI transactions must be used. In AGP 3.0, the memory areas may be specified, for which the coherence is ensured at pipeline transactions as well.
Two methods to issue AGP commands (enqueueing requests) are available, out of which one is chosen in the current configuration; the methods cannot be changed on the fly:
-
Requests are issued on the AD[31:0] and C/BE[3:0] bus by the PIPE# signal. On each CLK edge, the master device transmits the next double word along with the command code.
-
Commands are issued using the sideband SBA [7:0] address lines. Sideband means that these signals can be used regardless of whether or not the AD bus is busy. How the requests are clocked in depends on the mode (1x, 2x, 4x, or 8x).
When commands are issued over the AD bus and the PIPE# signal is active, the AGP command code (CCCC) is coded by the C/BE[3:0] signals. The starting address and the length n of the requested data block are placed on the AD bus address on lines AD [31:3] and AD [2:0], respectively. The following commands are defined:
-
0000—Read. Reading (n + 1) number of quadruple words from the memory starting from the specified address.
-
0001—HP Read. High priority read. Abolished in AGP 3.0.
-
0100—Write. Memory write.
-
0101—HP Write. High priority memory write. Abolished in AGP 3.0.
-
1000—Long Read. Reading (n + 1) x4 number of quadruple words (up to 256 bytes of data).
-
1001—HP Long Read. High priority long read. Abolished in AGP 3.0.
-
1010—Flush. Flushes data of all previous write commands to their destination addresses (with AGP, it looks like a read that returns a random quadruple word as the acknowledgement of the execution; the address and length specified in the request have no meaning).
-
1100—Fence. Creates boundaries that do not let reads pass over writes in a low-priority access stream.
-
1101—Dual Address Cycle (DAC). This is a two-address cycle for 64-bit addressing. In the first cycle, the lower part of the address and the request length are sent over the AD lines. In the second cycle, the upper part of the address is sent over the AD lines and the actual command is sent over the C/BE [3:0] lines.
Special commands are set aside in AGP 3.0 for isochronous transfers (see Section 6.2.3).
In the sideband mode, commands are sent over the SBA[7:0] bus in four types of 16-bit packets. The type of packet is coded by the upper bits as follows:
-
Type 1: 0AAA AAAA AAAA ALLL—length field (LLL) and the lower address bits (A[14:03])
-
Type 2: 10CC CCRA AAAA AAAA—command code (CCCC) and the middle address bits (A [23:15])
-
Type 3: 110R AAAA AAAA AAAA—upper address bits (A[35:24])
-
Type 4: 1110 AAAA AAAA AAAA—additional upper address bits if 64-bit addressing is required
An all-ones packet is a no-operation command (NOP). This type of command is sent when the SBA bus is idle. Bits marked R are reserved. Type 2, 3, and 4 packets are sticky, meaning that their values are retained until a new packet of the same type has been sent. A command is enqueued by the type 1 packet that sets the transaction length and its lower addresses. The command code and the rest of the address must be specified by the previously sent type 2-4 packets. This method uses bus cycles very efficiently to issue commands when transferring arrays of data. Each two-byte packet is sent via the 8-bit SBA bus in two portions (upper byte first). Data on the SBA are synchronized depending on the port operation mode.
-
In. the 1x operating mode, each byte is sent at the CLK front edge. The start of the packet (the upper byte) is determined by the first received byte that is not 11111111h; the lower byte is sent at the trailing clock edge. The next command (a type 1 packet) may be enqueued every two CLK cycles (contingent on the command code and the upper address having already been issued by earlier packets). The full command input cycle takes 10 clocks.
-
In the 2x operating mode, a separate strobe SB_STB is used for the SBA. The upper byte of a packet is sent at its negative transition, and the lower byte is sent at the next rising edge. The frequency of this strobe (but not the phase) coincides with the CLK frequency, so the next command can be enqueued in every CLK cycle.
-
In the 4x operating mode, yet another, inverted, strobe is used: SB_STB#. The upper byte of a packet is latched at the negative transition of the SB_STB, and the lower byte is latched at the next negative transition of the SB_STB#. The frequency of the strobes is twice the CLK frequency, so two packets can be enqueued in each CLK cycle. However, the AGP master may issue no more than one type 1 packet in each clock (i.e. enqueue no more than one request).
-
In 8x mode, the strobes have different names: SB_STBF and SB_STBS. Their frequency is four times the CLK frequency, so 4 packets fit into one clock. Nevertheless, the command enqueueing rate is still limited to one command per clock tick.
Responding to the received commands, the AGP executes data transmission. The AGP data phase is not tied to the command/address phase explicitly. The AGP will supply data phases when the system memory is ready for the requested exchange.
The AGP data are transmitted when the bus is in the DATA state. Data phases are supplied by the AGP (system logic), based on the order of the commands it receives from the accelerator. The accelerator is informed of the role that the AD bus will play in the next transaction by the ST [2:0] signals, which are valid only during the GNT# signal (codes 100-110 are reserved):
-
000—Data of the previously enqueued low-priority read request (simple asynchronous read in AGP 3.0) will be sent to the master device (or buffers are flushed).
-
001—Data of a high-priority read request will be sent to the master device. (Reserved in AGP 3.0).
-
010—The master device will have to supply low-priority write request data (simple asynchronous write data in AGP 3.0).
-
011—The master device will have to supply high-priority write request data. (Reserved in AGP 3.0.)
-
111—The master device is permitted to enqueue an AGP command (by the PIPE# signal) or to begin a PCI transaction (by the FRAME# signal).
-
110—Transceiver calibration cycle (for 8x speed in AGP 3.0).
The accelerator only finds out the type and priority of the command whose results are to follow in the current transaction. Exactly which enqueued command the port is to service is determined by the accelerator itself, as it did the command enqueueing itself (i.e., it knows their order). The AGP interface has nothing like the transaction tags that can be found in the system bus of the P6 processors or in the PCI-X bus. There are only four independent queues for each command type: low-priority read, high-priority read, low-priority write, high-priority write. Commands from different queues may be executed in an arbitrary order; the port has the right to execute them in the order that is most optimal in terms of efficiency. The actual order of command execution (memory reads and write) may also change. However, for every queue, the order, in which commands are executed, matches the order, in which they had been enqueued, and both the accelerator and the port know this. Queue priorities were abolished in AGP 3.0, but isochronous transaction capability was introduced.
The system logic arbiter assigns higher priority to high-priority AGP requests than to requests from the CPU or the PCI bus master devices. To low-priority AGP requests, it assigns lower priority than to requests from the CPU but higher than to requests from the other master devices. Although the adopted protocol in no way explicitly limits the queue depth, the AGP specification officially limits it to 256 requests. During the configuration stage, the plug-and-play system sets the actual limits in the accelerator’s configuration register according to its own capabilities and those of the motherboard. Programs that work with the accelerator (executed by both the local and the central processors) may not exceed the limit of pending commands in the queue (they have all the information necessary for this).
When transferring AGP data, the control signals borrowed from the PCI have almost the same functions as in the PCI. Data transfer in the AGP 1x mode is very similar to the PCI cycles, but the handshaking procedure is a little simpler (as this is a dedicated port and the exchange is performed only with the fast system-memory controller). The 2x, 4x, and 8x modes use specific strobing:
-
In the 1x mode, data (4 bytes on the AD [31:0]) are latched by the recipient at the rising edge of every CLK cycle, which provides a peak throughput of 66.6 x 4 = 266 MBps.
-
In the 2x mode, two strobe signals are used: AD_STBO for the AD[0:15] lines and AD_STB1 for the AD[16:31] lines. They are generated by the data source; the receiver latches data at both the rising and the falling edges of the strobe signals. The strobe frequency coincides with the CLK frequency, which provides a peak bandwidth of 66.6 x 2 x 4 = 533 MBps.
-
In the 4x mode, two additional inverted strobes arc used: AD_STB0# and AD_STB1 #. Data are clocked at the rising and falling edges of both the positive and inverted strobes (strobe pairs can be used either as two individual signals or as one differential signal). The strobe frequency is twice the CLk frequency, which provides a bandwidth of 66.6 x 2 x 2 x 4 = 1,066 MBps.
-
In 8x mode, the strobe pairs were given new names: AD_STBF[1:0] (F—first) and AD_STBS[1:0] (S—second). The even parts of data are latched at the positive transition of the first strobe; the odd portions are latched at the positive transition of the second strobe. The switching frequency of each strobe is four times higher than the CLK frequency; the strobes are shifted half of the their period relative to each other, which is what provides the eightfold data strobing frequency on the AD lines. From this comes the peak bandwidth of 66.6 x 4 x 2 x 4 = 2,132 MBps.
The AGP must keep track of the accelerator buffers’ readiness to receive and send data of the enqueued transactions. By the RBF# (Read Buffer Full) signal, the accelerator can inform the port of its not being ready to receive data of low-priority read transactions (it must always be ready to receive high-priority transactions). Using the WBF# (Write Buffer Full) signal, the accelerator informs the port that it is not ready to accept a new batch of fast-write data.
6.2.2 Address Translation: AGP Aperture and GART
The AGP provides translation of logical addresses used in accelerator requests to the system memory into physical addresses, thereby making compatible the views of the system memory as seen by the accelerator’s software and the software running on the central processor. The translation is done on a page basis (the default page size is 4 KB) adopted in the virtual memory system with page swapping upon request as used in x86 and other modern processors. All accesses that fall within the AGP aperture must be translated. The AGP aperture is the physical address area that lies above the main memory boundary and, as a rule, is adjacent to the adapter’s local memory (Fig. 6.16).
Therefore, when working in DIME, the accelerator has continuous memory area available, part of which comprises the adapter’s local memory. The rest of the memory that it addresses is mapped onto the system memory via the aperture using the Graphics Address Remapping Table (GART). Each element of this table describes its own page in the aperture area. A validity flag indicates the validity of each GART element; the valid elements specify the address of the physical memory page, onto which the corresponding aperture area is mapped. Physically, the GART is located in the system memory, it is aligned at 4 KB page boundary, and the AGP configuration registers point to its beginning.
The size of the AGP aperture (which also determines the size of the GART) is set by programming the chipset registers. By adjusting the CMOS Setup parameters or using external utilities, it can be set to 8, 16, 32, …, 256, or more megabytes. The optimal aperture size depends on the size of the memory and the programs running, but is recommended to be half of the main memory size. Setting the aperture size does not mean that the entire volume will be made unavailable to the system: This is simply the maximum memory size that the operating system will allocate to the accelerator upon request. While its own memory is sufficient for the accelerator, it will not ask for additional memory from the system memory resources. Only when the local memory is not enough for its needs will it dynamically ask for additional memory, and these requests will be satisfied within the set aperture limits. As the accelerator’s needs for additional memory decrease, it will be dynamically released for the regular needs of the operating system. However, if the graphics accelerator has no local memory at all (as in cheap integrated adapters), then a portion of memory (at least for the screen buffer) will be statically expropriated from the system memory. It can be seen by the lessened memory size, which the POST shows at the beginning of the system boot.
The port logic ensures the coherence of all system cache memories for AGP accesses outside the aperture address range. A selective coherence enabling capability for accesses inside the aperture was introduced in AGP 3.0. Previous versions simply assumed that the memory area inside the aperture must be uncacheable. Because this memory was intended to use to store textures, which are rather static, this simplification is quite acceptable.
6.2.3 AGP 3.0 Isochronous Transactions
To support isochronous transactions in AGP 3.0, new command and status codes were introduced, as well as configuration registers controlling the isochronous connection. Isochronous transfers can be performed by the AGP master only via the AGP aperture; moreover, it can do this only with those memory areas that have no coherence support (in order to avoid unpredicted delays due to the flushing of dirty lines). Isochronous transactions are available only at 8x. An agreement for an isochronous transfer is described by the following parameter set:
-
N—the number of read or write transactions over time T
-
Y—isochronous transaction data block size
-
L—maximum data delivery latency from the moment a command was issued (in T periods)
The bandwidth BW=N×Y/T; a T interval of 1 mcsec is adopted. The block size Y can be 32, 64, 128, or 256 bytes; for read transactions, which can have the same lengths, the asynchronous block is transferred in one transaction. Write transactions can be 32 or 64 bytes long, so 1, 2, or 4 transactions can be transferred in one block. Depending on the memory subsystem size, the AGP can support isochronous traffic sufficient for the following applications:
-
Video capture (in desktop PCs): 128 MBps, N = 2, L = 2, Y = 64
-
Video editing: 320 MBps, N = 5, L = 2, Y = 64
-
One HDTV channel stream: 384 MBps, N = 3, L = 10, Y = 128
-
Two HDTV channel streams (in powerful workstations): 640 MBps, N = 5, L = 10, Y = 128
The new AGP commands for isochronous transactions include the following:
-
0111 (ISOCH Read): isochronous read. The LLL field holds the transaction length code: 000—32 bytes, 001—64 bytes, 010—128 bytes, 011—256 bytes.
-
0110 (ISOCH Write/Unfenced): isochronous write with out-of-order completion. The LLL field holds the transaction length code: 000—32 bytes, 001—64 bytes.
-
0111 (ISOCH Write/Fenced): isochronous write with ordered completions. The LLL field holds the transaction length code: 000—32 bytes, 001—64 bytes.
-
1110 (ISOCH Align): reading the time offset relative to the isochronous period.
The new AGP status codes include the following:
-
100: isochronous data read
-
101: isochronous data write
6.2.4 AGP Configuration Registers
AGP interface devices are configured the same way as the regular PCI devices, by accessing their configuration space registers (see Section 6.2.12). However, AGP devices do not require the external IDSEL line: Their external configuration registers enable signal is connected to the AD16 line so that AGP configuration registers can be accessed when AD16=1.
During the initialization, POST allocates only system resources, but AGP operations remain disabled. The loaded operating system sets the required AGP parameters—the exchange mode, fast-write support, over 4 GB addressing, enqueueing method, and the queue length—and then enables AGP operations. To set the device requirements, their parameters are read from the AGP status register, and the negotiated parameters are written to the AGP command register that is located in the configuration space. Port parameters are set via the motherboard chipset’s host bridge-configuration registers.
Two functions and their configuration spaces are involved in configuring an AGP system:
-
The AGP proper (core logic), which is the target device in AGP transactions
-
The graphics adapter, which is the AGP transaction initiator
The purposes of the specific configuration registers of these functions partially coincide (Fig. 6.17). The registers of importance to the port only are marked in gray; optional registers are marked by an asterisk.
The APBASELO register (port only) sets the location of the AGP aperture:
-
Bits [31:32] set the address.
-
Bits [21:4] are always zeroes (aperture size cannot be less than 4 MB).
-
Bit 3 = 1 indicates prefetching capability.
-
Bits [2:1] set the address width: 00—32 bits, 10—64 bits (the APBASEHI register is also used).
The location of the rest of the registers is defined by the value of CAP_PTR.
The NCAPID register (in port or card) contains the AGP specification version.
-
Bits [31:24]—reserved
-
Bits [23:20]—the higher digit
-
Bits [19:16]—the lower digit
-
Bits [15:8]—NEXT_PTR: a pointer to the next capabilities list (or zero)
-
Bits [7:0]—CAP_ID AGP identifier (02).
The AGP status register, AGPSTAT (in port or card), indicates the main AGP capabilities: the allowable number of enqueued requests, sideband addressing support, over 4GB addressing support, 1x, 2x, 4x, or 8x modes:
-
Bits [31:24]—RQ (port only), the allowable total number of enqueued requests: 0—1 request, 255—256 requests.
-
Bits [23:18]—reserved (0).
-
Bit 17—ISOCH SUPPORT: isochronous transfer support (AGP 3.0).
-
Bit 16—reserved.
-
Bits [15:13]—ARQSZ (port only): indication of the optimal size of a request to the graphics adapter; Opt_Rq_Size = 2 (ARQSZ+4). Introduced in AGP 3.0.
-
Bits [12:10]—Cal_Cycle (AGP 3.0 only), calibration period: 000—4 msec, 001—16 msec, 010—64 msec, 011—256 msec, 111—no calibration needed; other values are reserved.
-
Bit 8—ITA_COH, providing coherence when accessing the accelerator via the aperture (with the coherence bit set in the corresponding GART entry).
-
Bit 7—GART64B, 64-bit GART element support.
-
Bit 6—HTRANS# (AGP 3.0 only), translation of the host requests via the aperture: 0—when the host access is within the aperture range, the address is translated via the GART, 1—the host does not send requests within the aperture range:
-
Bit 5—Over4G memory addressing support.
-
Bit 4—FW, fast write support.
-
Bit 3—AGP3. 0_MODE: 0—AGP 1.0/2.0,1—AGP 3.0.
-
Bits [2:0]—RATE, exchange rates supported over AD and SBA. In AGP 1.0/2.0 mode: bit 0—1x, bit 1—2x, bit 2—4x. In AGP 3.0 mode: bit 0—4x, bit 1—8x.
The AGP command register, AGPCMD (card and port), is used to enable the above capabilities, and contains the following fields:
-
Bits [31:24]—RQ_DEPTH, setting the depth of the command queue.
-
Bits [23:16]—reserved (0).
-
Bits [15:13]—PARQSZ (AGP 3.0 only), setting the optimal request size that the AGP master must attempt to attain; Opt_Rq_Size=2(PARQSZ+4)
-
Bits [12:10]—PCAL_Cycle (AGP 3.0 only), setting the calibration cycle period.
-
Bit 9—SBA_ENABLE, setting the sideband command issuance.
-
Bit 8—AGP_ENABLE.
-
Bit 7—GART64B, enabling 64-bit GART elements.
-
Bit 6—reserved.
-
Bit 5—4G, enabling over 4G memory addressing (two-address cycles and type 4 parcels over the SBA).
-
Bit 4—FW_Enable, fast write enable.
-
Bit 3—reserved.
-
Bits [2:0]—DATA_RATE, setting the exchange mode (only one bit can be set). In AGP 2.0: bit 0—1x, bit 1—2x, bit 2—4x. In AGP 3.0: bit 0—4x, bit 1—8x, bit 3—reserved.
The NISTAT register (port and card) defines isochronous transfer capabilities (AGP 3.0 only):
-
Bits [31:24]—reserved (0).
-
Bits [23:16]—MAXBW, the maximum device bandwidth (total for the asynchronous and isochronous transfers) in 32-byte units per 1μsec (μsec).
-
Bits [15:8]—ISOCH_N, maximum number of isochronous transactions over a 1 μsec period.
-
Bits [7:6]—ISOCH_Y, supported isochronous transfer sizes: 00—32, 64, 128, and 256 bytes; 10—128, 256, and more bytes; 11—256 bytes.
-
Bits [5:3]—ISOCH_L, maximum isochronous transfer latency in μsec (1-5).
-
Bit 2—reserved.
-
Bits [1:0]—Isoch-ErrorCode, isochronous exchange error code (00—no errors). For the port: 01—isochronous request queue overflow. For cards: 01—read buffer underflow, 10—read buffer overflow.
The NICMD register (port and card) controls isochronous transfers (AGP 3.0 only):
-
Bits [15:8]—PISOCH_N, the maximum number of isochronous transaction per 1 μsec period (for cards)
-
Bits [7:6]—PISOCH_Y, isochronous transfer size: 00—32, 64, 128, 256 bytes; 01—64, 128, 256 bytes; 10—128, 256 bytes and more; 11—256 bytes
-
Bits [5:0]—reserved
The AGPCTRL register controls the AGP port proper:
-
Bits [3:10]—reserved
-
Bit 9—CAL_CYCLE_DIS, disabling calibration cycles
-
Bit 8—APERENB, enabling aperture operations
-
Bit 7—GTLBEN, enabling TLB buffer operation (if the port has them)
-
Bits [6:0]—reserved
The APSIZE register (port) sets the aperture size:
-
Bits [15:12]—reserved.
-
Bits [11:8, 5:0]—APSIZE: 111…111—4 MB, 111…110—8 MB, 11110…0—256 MB, 000 … 000—4,096 MB
-
Bits [7:6] =0
The NEPG register (in AGP 3.0 ports) sets the size of the page described in the GART from the list of supported sizes:
-
Bits [15:12]—SEL, selected page size (2 (SEL+12)).
-
Bits [11:0]—different page size support bit map. A one in bit N indicates that a page size (2 (N+12)) is supported. Bit 6 is always set to one: All ports are required to support 4 KB pages.
The GARTLO[31:12] and GARTHI (port) registers set the starting address of the GART.
6.2.5 AGP Slots and Cards
The AGP graphics controller may be built into the motherboard or implemented on an expansion card and installed into the AGP slot. External AGP cards are similar to PCI cards (Fig. 6.18), but they use a high-density connector with a two-level (EISA-style) contact-pad arrangement. The connector itself is located farther from the back panel than the PCI connector.
AGP interface circuits can be powered by power supplies of three different voltages (Vddq): 3.3 V (for 1x and 2x), 1.5 V (for 2x and 4x), and 0.8 V (for 8x). The RST# and CLK signals are always 3 V. In order to prevent installing a card into a wrong slot, both have mechanical keys.
-
AGP 1.0 slots and card use 3.3 V; they have keys in place of contacts 22-25: a rib in the slot and a notch in the card (Fig. 6.18, a).
-
AGP 2.0 slots and card use 1.5 V. They have the keys in place of contacts 42-45.
-
The universal AGP 2.0 slot (3.3 V/1.5 V) has no ribs, while the universal card has both notches. The universal motherboard determines the voltage, on which the installed card operates by the TYPEDET# signal: If this signal’s contact is not connected to anywhere, the card is of 3.3 V type; if this contact is grounded, then the card is either of 1.5 V or universal type. The universal card detects the buffer power-supply nominal by the voltage level on the contacts Vddq (3-3 V or 1.5 V). In this way, cards are matched with proper ports.
-
AGP 3.0 slots and cards use 0.8 V power supply, but their keys are the same as the keys for the 1.5 V card (in place of contacts 42-45). The card recognizes the AGP 3.0 port by the grounded MB-DET# line (in AGP 2.0 ports, this line is not connected).
-
The universal AGP 3.0 slot can work with both 8x (0.8 V power supply) and AGP 2.0 4x (1.5 V power supply) cards. The 0.8 V power supply voltage and the 8x mode are selected by the port logic.
To operate in the 2x/4x/8x modes, receivers need a reference voltage Vref Its nominal for 3.3 V is 04×Vddq 0.5×ddq for 1.5 V, and 0.233×Vddq for 0.8 V. The receiver reference voltage is generated on the transmitters’ side. The graphics device places the signal for the port on contact A66 (vrefgc); the port (chipset) supplies the signal for the AGP device on contact B66 (Vrefcg).
When transferring data in the 8x mode, the data on the AD bus are dynamically inverted. The DET_LO signal indicates inversion on lines AD[15:0]; the DBI_HI signal indicates inversion on lines AD [31:0]. The decision on changing the inversion state is made by comparing the output information with the information of the previous cycle: If the number of switched lines in the corresponding half of the AD bus is greater than eight, the corresponding DBI_xx signal changes its state to the opposite. Consequently, on each half of the AD bus, no more than eight signal lines are switched, which allows surges of the current to be lowered. Automatic transceiver calibration is used in the 8x mode; it allows their parameters to be matched with those of the line and the partner. The calibration is done either statically (during the initial launch) or dynamically (during the operation), in order to compensate the parameter drift due to temperature changes.
Table 6.13 shows the functions of AGP slot contacts for AGP 3.0; functions of AGP 1.0 and 2.0 contacts are shown in parentheses. Because two contacts for a Vcc 3.3 power supply on the AGP 2.0 universal cards are lost to the keys, leaving only four of them, the card’s consumption current is limited (the maximum allowable current for each contact is 1 A). The auxiliary power supply line, 3.3 Vaux, which is used to feed the PME# signal generation circuits in the “sleep” mode, is also missing on the universal cards.
Table 6.13: Functions of AGP Contacts
3]AGP 3.0 only.
[1]3.3 V type cards and slots do not have inverted strobes (i.e., do not support the 4x/8x modes).
[2]1x cards and slots do not need a reference voltage.
In addition to the AGP signals proper, AGP provides for USB bus signals. This bus is used to send USB data and control signals to peripheral devices, usually a USB capable video monitor. The lines are USB+, USB-, and the OVRCNT# signal that indicates current overload on the power rail supplying +5 V to the monitor.
The PME# signal pertains to the Power Management Interface. When there is a supplementary 3.3 Vaux power supply, the card can use this signal to initiate “wake up.”
The AGP Pro specification defines a more powerful connector, which allows a fourfold increase in the power supplied to the graphics controller. In this case, oneway compatibility is preserved: AGP cards can be installed into AGP Pro slot, but not vice versa. Currently, the AGP Pro connector has been abolished, and a supplementary cable is used to supply power to the graphics card.
The AGP Pro connector has additional contact banks at each end of the regular AGP connector for the GND and 3.3 V and 12 V power lines. These contacts’ functions are given in Table 6.14. In order not to install a regular AGP card into it improperly, the additional part of the AGP Pro connector, which is closer to the back panel, is covered by a removable plastic cover. An AGP Pro card can also use one or two neighboring PCI slots in several ways. They can be used simply mechanically or as a support point; their power supply connectors can be recruited to supply additional power; and their functional PCI connectors can also be used.
Table 6.14: Additional Contacts of the AGP Pro Connector
In total, an AGP Pro card can consume up to 110 W of power, which it takes off the 3.3 V (up to 7.6 A) and 12 V (up 9.2 A) lines of the main AGP connector, the supplementary AGP Pro power supply connector, and one or two PCI connectors. High-power AGP Pro cards (50-110 W) take up two PCI slots; low-power cards (25-50 W) take up one PCI slot. Accordingly, their rear-panel mounting bracket is two or three times wider than usual. Additionally, cards have front-panel mounting hardware. In the supplementary connector, the PRSNT1# contact indicates card’s presence when it is grounded. The PRSNT2# contact indicates the card’s power consumption: up to 50 W when not connected, and up to 110 W when grounded.
[1]Here, the term “interrupt” means the intrusion of the AGP commands and PCI transactions into the data flow; which has nothing to do with CPU interrupts.
6.3 PCI Express
The PCI Express is a new component-interconnect architecture introduced under the auspices of PCI SIG; it is also known as 3rd Generation Input-Output (3GIO). Here, connection of devices using parallel buses is replaced with point-to-point serial connections using switches. Many PCI bus software features are preserved in this architecture, which provides smooth migration from the PCI to PCI Express. The interface introduced new capabilities such as control over the quality of service (QoS) and the usage and budgeting of connections. The PCI Express protocol’s characteristic features are low overheads and delay times.
The PCI Express is positioned as a universal I/O architecture for computers of different classes, telecommunications devices, and embedded computer systems. High bandwidth is achieved at a price comparable with the PCI, or even lower. Its application area ranges from on-board microchip interconnections to intercard plug-in and cable connections. The high throughput per each connection contact allows the number of connection contacts to be minimized. A small number of signal lines makes it possible to use compact constructs. The interface’s versatility makes it possible to use one software model for all form factors.
A PCI Express Link is a pair of opposite simplex channels connecting two components. Over these channels, packets carrying commands and transaction data, messages, and control transfers are transmitted. All PCI read and write transactions in the split version are implemented in the PCI Express using a packet protocol. Consequently, a transaction requester and completer perform the transfers. There are four addressing spaces in the PCI Express: memory, I/O, configuration, and messages. The new (compared with the PCI) message space is used to transfer packetized sideband PCI signals: the INTx line interrupts, power consumption control signals, etc. In this way, virtual wires are implemented. A PCI Express port contains a transmitter, receiver, and components necessary to assemble and disassemble packets.
An example of the I/O topology illustrating the PCI Express architecture is shown in Fig. 6.20. The root complex is the central item of the architecture; it connects the I/O hierarchy with the core: the processor(s) and the memory. The root complex can have one and more PCI Express ports; each of these ports defines its own hierarchy domain. Each domain consists of one endpoint or a sub-hierarchy: several endpoints connected by switches. The capability of direct peer-to-peer communications between members of different domains is not mandatory, but can be present in specific situations. To provide transparent peer-to-peer communications, switches must be located in the root complex. The central processor must be able to communicate with any device of the domain, and all devices must be able to access the memory. The root complex must generate requests to the configuration space: Its role is analogous to the PCI host bridge. The root complex can generate I/O requests as a requester; it can also, generate locked requests, which must be executed as atomic operation. The root complex must not support locked requests as a completer, in order to prevent I/O deadlock.
An endpoint is a device capable of initializing and/or executing PCI Express transactions on its own behalf, or on behalf of a non-PCI Express device (a USB host controller, for example). An endpoint must be visible in one of the hierarchy’s domains. An endpoint must have a type 0 configuration space header (see Section 6.1.7) and respond as an executor to all configuration requests. All endpoints use the MSI mechanism to signal interrupts. There are two types of endpoints in the PCI Express: legacy endpoints, and endpoints built according to the PCI Express principles. Legacy endpoints are given more leeway:
-
They are not required to support more than 4 GB addressing space.
-
The I/O does not have to be absolutely relocatable using the base address registers (BAR), so I/O access transactions may be needed (memory access transactions are preferable).
-
The range of occupied addresses must be no less than 128 bytes (the boundary requirements were formed rigidly in the PCI-X).
-
The configuration space does not have to be expanded (it can remain 256 bytes).
-
The software model may request using a locked request to the device (but not from it).
A switch has several PCI Express ports. In terms of logic, a switch is a set of several virtual PCI-PCI bridges that connect the switch’s ports to its own internal local bus. A virtual PCI bridge is described by configuration registers with a type 1 header (see Section 6.1.7). The port that opens to the top of the hierarchy is called the upstream port; through it, the switch is configured as a set of PCI bridges. The switch transfers packets of all types between the ports using the address information relevant for the given type packet. The switch does not propagate locked requests from its downstream ports. The arbitration between the switch’s ports can consider virtual channels and, accordingly, make fair bandwidth distribution depending on devices’ demand. The switch cannot split packets into smaller parts (many PCI bridges can do analogs of this operation).
A PCI Express-PCI bridge connects the PCI/PCI-X bus hierarchy to the I/O fabric.
The fabric is configured using configuration mechanism 100% compatible with the PCI 2.3 or the expanded configuration space. Using virtual bridges, each PCI Express link is presented as a logical PCI bus with a unique number. Logical devices are reflected in the configuration space as PCI devices, each of which can have from one to eight functions with a set of configuration registers each.
The PCI Express architecture is divided into three layers:
-
The transaction layer is the uppermost layer, responsible for assembling and disassembling of Transaction Layer Packets (TLP). These packets are used for read and write transactions, and also for signaling certain conditions. Each TLP has a unique identifier that allows a response packet be sent to its sender. Various forms of addressing are used for TLPs, depending on the transaction type. Packets can have carry coherence control disabling (no Snoop) or Relaxed Ordering attributes. Each transaction that requires an answer is split for execution (see section on PCI-X). The transaction layer is responsible for flow control, which is implemented on the basis of a credit mechanism.
-
The data link layer is the middle layer in the stack. It is responsible for controlling the link, detecting errors, and organizing repeat transfers until they succeed or the link is declared to have failed. The data link layer adds packet numbers and control codes to the packets it receives from the transaction layer. The data link layer itself generates and receives Data Link Layer Packets (DLLP), which are used to control the link.
-
The physical layer isolates the data link layer from all the details of signal transmission. It is made up of two parts. During transmission, the logic sub-block performs data distribution over the lines, scrambling the data, 8B/10B coding, framing, and converting them into serial code. The actions are repeated in reverse order when receiving data. Additional 8B/10B coding characters are used for control signaling. The logic sub-block is also responsible for negotiating the link’s parameters, its initialization, etc. The electrical sub-block is responsible for matching electrical parameters, synchronization, receiver detection, etc. The layered model adopted in the PCI Express allows the physical layer or its sub-blocks be replaced with more effective coding and signaling schemes when such appear, without disturbing the other layers. The interface between the physical and data link layers depends on their implementation, and is decided by the manufacturer. The physical layer interface is clearly defined, and allows devices of different origin to be interconnected. The interface’s transmitters and receivers are decoupled with respect to the direct current, which makes the interface matching independent on the technology used in the components’ manufacturing. In order to test a PCI Express device for compliance with the electrical characteristic requirements, it is sufficient to connect it to a special tester.
The PCI Express supports differentiated classes by the quality of service (QoS), providing the following capabilities:
-
Link resource allocation for each stream class (virtual channels)
-
Policy configuration by QoS for individual components
-
Specification of QoS for each packet
-
Creation of isochronous links
To support QoS, each TLP is tagged with a three-bit traffic class descriptor. This allows the transferred data to be separated by types, and differentiated conditions for different classes of traffic transfers to be created. Transaction execution order is kept within class boundaries, but not in different classes. To differentiate the transfer conditions for different types of traffic, virtual channels can be created in the PCI Express switching elements. A virtual channel consists of physically dedicated sets of buffers and packet-routing mechanisms that are occupied only by processing the traffic of their virtual channel. When input packets are routed, arbitration is performed based on the virtual channel numbers and their priorities. Each port that supports virtual channels maps specific classes of packets onto the corresponding virtual channels. Any number of classes can be mapped to one channel. By default, all traffic is tagged as class 0 (TCO) and is sent over the default channel 0 (VCO). Virtual channels are created as required.
The main method to signal interrupts in the PCI Express is by sending messages (MSI) using 64-bit addressing (32-bit addressing is only allowed for legacy devices). However, in order to provide software compatibility, a device can emulate INTx# interrupts by sending these requests using special packets. As a rule, the interrupt controller, which is located in the root complex, receives both MSI and INTx# emulated interrupts. The TCO packets are used for the INTx# interrupt emulation signaling. When virtual channels are used, the MSI interrupts must employ the traffic class corresponding to the data traffic class, to which the given interrupts pertain. Otherwise, synchronization could be lost because of the relative disorder of the various classes of traffic. Synchronization can be supported by the same means as in the PCI/PCI-X: by reads (even of zero length) over the switch (bridge). Resorting to this technique is unavoidable if the interrupts belong to data of different classes (virtual channels).
Advanced Power Management and budgeting (PM) means being able to perform the following:
-
To identify each function’s PM capability
-
To switch a function into the necessary power consumption state
-
To obtain information about the current power consumption state of the function
-
To generate a wake up request with the main power supply turned off
-
To power up devices sequentially
Power management event signaling can be implemented in two ways: First, by packet emulation of the PME# signal (analogous to the INTx# signal emulation); second, by using the native PCI Express signaling using appropriate messages. When using the PME# emulation, the signal source is identified by sequential reading of the configuration registers of the devices capable of generating this signal. The native signaling is much more practical: The identifier of the interrupt source is contained in the message.
Devices can be hot-plugged and hot-swapped using either the existing mechanisms (PCI Hot-Plug and Hot-Swap) or the PCI Express native mechanisms, without requiring additional signals. The standard hot-plug model involves the following elements:
-
A slot power supply indicator, which forbids insertion or extraction of the card (blinking indicates the transition process into the depowered state).
-
An attention indicator, which signals problems with the device installed into the given slot (blinking facilitates locating the problem slot).
-
A manually operated card latch.
-
A manual latch status sensor, which allows the system software to detect an unlocked latch.
-
An electromechanical blocking mechanism, preventing extraction of a card with the power on. There is no special blocking-control mechanism; if blocking is present, it must operate directly from the port’s power lines.
-
An “Attention” button to request a hot-plug connection.
-
A software user interface for requesting a hot-plug connection.
-
A slot numbering system allowing the required slot to be found visually.
A CRC control of all transactions and control packets is employed to provide transaction reliability and data integrity. The requester considers a transaction executed when it receives a confirmation message from the executor (only posted writes to the main memory do not require confirmations). The minimal error-handling capabilities are analogous to the PCI; the detected errors are indicated in the function’s configuration registers (status register). Expanded error handling capabilities provide primary information to the advanced error isolation and recovery procedures, as well as to the error monitoring and logging procedures. Errors are divided into three categories; this allows adequate recovery procedures to be used. The categories are as follows:
-
Correctable errors: automatically call the hardware recovery (repeat) procedure and do not require software intervention for a normal execution of the transaction.
-
Fatal errors: require a reset for reliable resumption of operation. Some transactions having nothing to do with the error can suffer damages from this reset.
-
Non-fatal errors: do not require a reset to resume operation. These errors may cause loss of only some transactions directly involved with the error.
The software model of the PCI Express is compatible with the PCI in the following aspects:
-
PCI Express devices are detected, enumerated, and configured using the same configuration software as in the PCI (PCI-X 2.0).
-
Existing operating systems do not need to be modified in any way.
-
Drivers of existing devices are supported without any modifications.
-
New PCI Express functional capabilities are configured and enabled following the general concept of PCI device configuration.
A basic link consists of two low-voltage differential signal pairs: transmitting and receiving. Data are transmitted using self-clocking coding, which makes high transfer speeds attainable. The basic speed is 2.5 GB of raw data (after 8B/10B coding) in each direction; higher speeds are planned in the future. Signal pairs (lanes) can be aggregated symmetrically in each direction to scale up the throughput. The specification allows the links to be configured in 1, 2, 4, 8, 12, 16, and 32 lane widths, with the transferred data distributed among them in bytes. In this way, speeds of up to 80 Gbps can be reached, which approximately corresponds to the peak speed of 8 GBps. When the hardware is initialized, the number of lanes and the transfer speed is negotiated for each link; the negotiation is conducted purely on the hardware level, without any software involvement. The negotiated link parameters remain in effect for the entire duration of the subsequent operation.
6.3.1 PCI Express Transactions and Packet Formats
All PCI Express traffic is conducted in packets. Of these, the transaction layer packets (TLP) present practical interest. Each TLP packet starts with a header, which may be followed by a data field and an optional trailer known as a digest: a 32-bit CRC. The length of all packet fields is a multiple of a double word (DW, 32 bits). The packet header contains the following mandatory fields:
-
Fmt [1:0]—format field defining the packet type: bit 0—header length (0—3 DW, 1—4 DW); bit 1—data field presence (0—no data field).
-
Type [4:0]—type field, defining the packet (transaction) type (Table 6.15).
-
TC [2:0]—traffic class field.
-
TD—digest flag. A value of one indicates using a 32-bit CRC at the end of the packet; this CRC protects all the packet’s fields, which do not change in the process of the packet’s traveling via the PCI Express switches. This extra control is used for important transactions; for regular transactions, a channel level CRC control is employed.
-
EP—error flag. Indicates that an error occurred during reading of the transferred data and the data may be invalid (poisoned data).
-
Length[9:0]—data field length in double words: 000 … 01—1; 111 … 111—1,023;000 … 000—1,024.
A transaction identifier is the identifier of the requester combined with an 8-bit tag; a transaction descriptor contains transaction attributes (RO and NS), as well as the traffic class TC. The tag is used only for transactions requiring packet completion. By default, the requester can hold uncompleted up to 32 transactions, so only five of the tag’s 8 bits are used. However, the tag can be enabled into the expanded mode, in which all 8 bits are used (up to 256 uncompleted requests).
Depending on the transaction type, different address and routing formats are used in packets. The address is set accurate within an aligned double word (bits [1:0] =00). For all data-carrying transactions (except messages), one of the header’s bytes carries bits that enable bytes in the first and the last double words of the data field (all bytes in between are assumed to be enabled). Consequently, a packet can carry an arbitrary number of adjacent bytes starting from an arbitrary address.
Memory transactions can use either short (32-bits) or long (64-bits) addresses. The combined transaction address and length must not cause the transaction to cross the 4 KB page boundaries. Memory write transactions are executed as posted writes, and do not require confirmations.
I/O transactions have been left in the PCI Express for reasons of compatibility with the PCI/PCI-X and old software, but it is planned to abolish them. A 32-bit address is used in these transactions, and only one double word of data is transferred.
Configuration transactions are addressed and routed using the device identifier; these transactions use 32-bit addressing, and transfer only one double word of data. The format of the device identifier is the same as used in the PCI: an 8-bit bus number field, a 5-bit device number field, and a 3-bit function number field.
Message transactions are routed depending on the value of the rrr field: 000—toward the root complex; 001—by address; 010—by identifier; 011—broadcast from the root complex; 100—local message (comes no further than the receiver); 101—assembled and routed toward the root complex. One byte in the message is given over to the message code; some messages do not use the data field. Messages with rrr=100 only change the receiver status (this way, for example, virtual INTX# wires are implemented). Messages with rrr=101 are used for one of the power management message types: A switch forwards this message to the upstream port only if it receives this type of message from all downstream ports. Message are used to emulate wire interrupts, signal power management events and errors, and also for interdevice communications.
To emulate INTx# interrupts (four virtual wires), eight message codes are employed (four for setting and four for clearing each signal). The switches (and the root complex) must monitor the status of the virtual wires on each of the downstream ports, considering message arrivals from the corresponding devices (the PCI-specific cyclic INTx# line alternation is preserved). According to these statuses, the status of the virtual wires of the upstream port is generated using an OR function, and corresponding messages are generated upon changes in the virtual wires’ status changes. The root complex performs an analogous task, and conveys the virtual signal to the real interrupt controller. In this way, message packets make it possible to “connect” the logical INTx# wires of devices on all logical buses.
6.3.2 Packet Transfer and Connection Bandwidth
TLPs, which are used to perform transactions, arrive to the data link layer. The main task of the data link layer is to provide reliable delivery of TLPs. For this purpose, the data link layer frames a TLP with its header consisting of a 12-bit serial TLP number and a 32-bit LCRC field (link level CRC). Consequently, the data link level adds 6 bytes of overhead to each TLP. To each TLP, the receiver must receive a positive acknowledgement: a data link layer packet (DLLP) named Ack. If there is no acknowledgement, the timeout mechanism forces the transmitter to resend the packet. A negative acknowledgment mechanism is also provided for; it causes a repeat transfer without the timeout wait.
DLLPs are 6 bytes long: the information part is four bytes long and the CRC is 16 bits (two bytes) long. Besides confirming TLPs, DLLPs are used for flow control, and also to control the link’s power consumption.
The physical layer adds its framing to the transferred packets: A special STP (for a TLP) or SDP (for a DLLP) code is sent in front of each packet. Each packet is terminated by an END code. These special codes are different from the codes representing 8B/10B encoded data.
The packet structure having been considered, the useful bandwidth of a basic PCI Express link can be evaluated (1 bit wide, 2.5 Gbps overall bandwidth). The shortest transaction—a double word I/O write—will be considered. The corresponding TLP is 4 double words long (three for the header and one for the data), or 16 bytes; the data link layer adds six more bytes to it, so 22 bytes arrive for 8B/10B coding. The physical layer adds two bytes of its framing to them, bringing the total to 24. 240 bits (24 x 10) will be sent into the line, which at 2.5 Gbps will take 96 nsec of the forward channel time to transmit. I/O port write transactions require an acknowledgement: an oncoming three-DW (12-byte) TLP that, after the data link and physical layers add their framings, will expand to 20 bytes and to 200 bits after going through the 8B/10B encoding. These will take 80 nsec of the reverse channel time to transmit. Now the data link layer acknowledgements of each TLP—the 6-byte Ack—need to be added, which with the 2-byte physical layer framing turn into 8 bytes, and after the 8B/10B encoding into 80 bits. These will take 32 nsec in each channel. Altogether, an I/O port write transaction takes 96 + 32 = 128 nsec (0.128 msec) in the forward channel, and 80 + 32 = 112 nsec in the reverse. The maximum transfer speed of continuous port writes is V = 4/0.128 = 31.25 MBps. The reverse channel is also occupied with a load factor of 112/128=0.875. The useful data transfer speed results are similar to the standard PCI bus (32 bits, 33 MHz) capabilities, which takes four bus cycles to execute this type of transaction. A PCI Express I/O read transaction produces the same results (the PCI results will be worse because of the extra clock needed for the turnaround).
Now, the most favorable type of transaction for efficiency comparison purposes will be considered: writing a packet of 1,024 double words into the memory (using the 32-bit addressing). Only one TLP is needed for this (a completing transaction is not required). The length of the packet is 3DW + 1,024DW = 4,108 bytes. The data link and physical layers add 6 + 2 = 8 bytes to this number producing 4,116 bytes, which after going through 8B/10B encoding grows to 41,160 bits or 16.5 mcs of transmission time. The data transfer speed is 4,046/16.5 ≈ 248 MBps: the PCI (32 bits/66 MHz) efficiency level at long burst transfers. The loading of the reverse channel by the data link layer acknowledgements is negligibly small in this case. The speed of reading from the memory is somewhat lower, because each read transaction consists of two TLPs: a read request (three or four DWs) and a completion packet with the data (2 + N) DWs long. With a large packet length, the proportion of the additional three double words is small.
If the reverse channel were fully loaded with useful traffic, then the PCI Express bandwidth could be considered doubled thanks to the full-duplex operation capability. However, no such doubling is possible in the I/O port write example, because the reverse channel is loaded rather heavily with the acknowledgements. Calculating the useful speed per one signal connector contact, a speed of 248 × 2/4 = 124 MBps per contact is obtained in the most favorable full-duplex operation mode. For the purposes of comparison, the PCI-X533 can be considered. It provides peak write speed approaching 533 × 4 = 2,132 MBps. Its results of memory read operations are much more modest: The peak speed is only 533 MBps. With about 50 signal contacts employed (not counting many ground lines), this gives about 10-40 MBps per contact. The AGP uses even more lines to produce the same peak, speed, so the claims of high contact efficiency for the PCI Express are based on solid ground. Neither the PCI/PCI-X nor the AGP can provide full duplex operation.
The above calculations are for a basic link; by increasing the width to 32 bits, a maximum memory write speed of 248 × 32 = 7,936 MBps can be obtained. And if the total load of a full duplex link is considered, the PCI Express can provide a total bandwidth potential of 15,872 MBps. Therefore, in its most efficient version, the PCI Express leaves the AGP with its 2,132 MBps peak speed far behind. However, a low contact count cannot be talked about here: A 32x PCI Express link requires 2 × 2 × 32 = 128 signal contacts (the AGP has fewer).
6.4 LPC Interface
The Low Pin Count (LPC) interface is employed to connect local devices—FDD controllers, serial and parallel ports, keyboards, audio codecs, BIOS, etc.—that were previously connected via the X-Bus or ISA bus. The new interface has been introduced to replace the clumsy asynchronous ISA bus, with its numerous signals, which is rapidly becoming obsolete if it is not already. The interface provides the same access cycles as the ISA: memory and I/O read and write, DMA, and bus mastering. Devices can generate interrupt requests. Unlike the 24-bit address ISA/X-Bus buses, which allow only the lower 16 MB memory to be addressed, the LPC interface has 32-bit memory addressing, which provides access to 4 GB of memory. Employing 16-bit port addressing provides access to 64 KB of port addressing space. The interface is synchronized with the PCI bus, but devices can issue an unlimited number of wait cycles. The interface is software-transparent—like for the ISA/X-Bus—and does not require any drivers. The controller of the LPC interface is a PCI bridge device. The interface’s bandwidth is practically the same as that of ISA buses. The LPC 1.0 specification provides bandwidth calculations of the interface and devices that use it. With FIFO buffers, the interface can be used most efficiently in the DMA mode. In this case, the main user is the LPT port: At a transfer rate of 2 MBps, it will take 47% of the bandwidth. Next comes the infrared port: 4 Mbps(11.4%). The rest of the devices (FDD controller, COM ports, audiocodecs) need even smaller shares; as a result, they take up to 75% of the bandwidth if they are all working together. Consequently, switching these devices from the ISA/X-Bus to the LPC should not cause bigger bandwidth-use problems than those in the older buses.
The interface has only seven mandatory signals:
-
LAD [3:0]—bidirectional multiplexed data bus
-
LFRPNE#—host-controlled indicator of the beginning and the end of cycle
-
LRESET#—reset signal, the same as the RST# signal on the PCI bus
-
LCLK—synchronization signal (33 MHz), the same as the CLK signal on the PCI bus
Supplementary LPC interface signals are as follows:
-
LDRQ#—Encoded DMA/Bus Master request from peripheral devices.
-
SERIRQ—serially encoded interrupt request line. Used if there are no standard ISA-style interrupt request lines.
-
CLKRUN#—used to halt the bus (in mobile systems). Required only for devices that need DMA or Bus mastering in systems capable of halting the PCI bus.
-
PME#—Power Management Event. May be activated by peripheral devices, as in the PCI.
-
LPCPD#—Power Down. Used by the host to indicate peripherals to prepare for the power cut-off.
-
LSMI#—an SMI# interrupt request to repeat an I/O instruction.
The LFRME# and LAD [3:0] signals are synchronized by the front of the LCLK signal. During each clock tick of the exchange cycle, fields of the protocol elements are sent over the LAD [3:0] bus. A general timing diagram of the LPC exchange cycle is shown in Fig. 6.21. The host starts each exchange cycle by asserting the LFRAME# line, and placing the START field on the LAD [3:0] lines. Upon the LFRPME# signal, all peripheral devices must release the LAD [3:0] bus; the START field indicates that a bus cycle is to follow. In the next clock cycle, the host deasserts the LFRAME# signal, and places the exchange-cycle-type code CYCTYPE on the LAD [3:0] bus. The LFRANE# signal may remain asserted longer than one clock cycle, but the exchange cycle starts (START field is considered valid) at the last clock cycle, in which the LFRANE# signal is asserted.
The host uses the LFRANE# signal to abort the exchange cycle (in case of a time-out error, for example) by placing corresponding code on the AD [3:0] lines.
The START field can have the following codes:
-
0000—start of a device addressing by the host cycle
-
0010—granting access for master device 0
-
0011—granting access for master device 1
-
1111—forced cycle termination (abort)
The rest of the codes are reserved.
The CYCTYPE field sets the type and direction of the transfer. Bit 0 sets the direction: 0—read, 1—write. Bits [2:1] set the access type: 00—I/O, 01—memory, 10—DMA, 11—reserved. Bit 3 is reserved and is set to 0.
The TAR (Turn ARound) field is used to change the “owner” of the LAD [3:0] bus; it takes two clock ticks to complete. In the first clock tick, the previous owner places the 1111 code on the LAD [3:0] lines; in the second, it switches the buffers into the high-impedance state.
The ADDR field is used to send the address. In the memory cycle, it takes 8 clock ticks (32 bits); in the I/O cycle, it takes 4 clock ticks. The upper bits are transmitted first (to get the address decoder working earlier).
Data are sent in the DATA field. Sending each byte requires 2 clock ticks; the lower nibble is sent first. In multiple byte transmissions, the lower byte is sent first.
The SYNC field is used by the addressed device to add wait states. It may contain the following codes:
-
0000—ready (without errors). For DMA, indicates request deassertion for the given channel.
-
0101—short wait (a few clocks).
-
0110—long wait.
-
1001—ready and a DMA channel request is present (not allowed for other types of access).
-
1010—error: Data have been transmitted but conditions have arisen that would generate the SERR# or IOCHK# signals on the PCI or ISA buses (for DMA, also means the request signal deassertion).
The rest of the codes are reserved.
The synchronization field controls the transmission, wait-cycle introduction, and time-out mechanism. Having begun the cycle, the host reads the synchronization field. If the device being addressed does not answer within three clocks, the host considers that there is no such device on the bus and terminates the transaction. If the host receives a short wait code, it waits until the code changes to the ready or error signal, but after 8 clocks it will terminate the transaction on a time-out. There is no time-out limit for the long wait code; it is the addressed device’s responsibility not to hang the bus. When the host controls the SYNC, the target device must wait as long as needed until the host is ready without asserting any time-outs. In the fastest execution, the SYNC field takes one clock cycle.
Fig. 6.22 shows the sequence of cycles for the host to access memory or I/O (fields supplied by the device are marked in gray). In all of these accesses, 1 byte is sent. A memory read, assuming 5 SYNC field clocks and EPROM access of 120 nsec, will require 21 clocks (0.63 μsec), which gives a memory bandwidth of 1.59 MBps. If the memory is pipelined, then the following accesses will be executed faster. For a memory write, the SYNC field takes 1 clock and the entire cycle will take 17 clock ticks (0.51 μsec), giving a memory-write throughput of 1.96 MBps. hen addressing I/O, the addressing is shorter; there is one SYNC clock and no wait clocks. Consequently, these cycles take 13 clock ticks to execute (0.39 μsec), giving an I/O read/write throughput of 2.56 MBps.
To set up DMA exchange and bus mastering, the host must have one LDRQ# input line for each connected device that uses these functions. Over this line, the device sends serially encoded information about the status of DMA channel requests (Fig. 6.23.) Transmission of a packet begins with sending the start bit, followed by the channel number code and the request bit ACT: 1 (high level) means the request is active, 0 means the request is passive. Channel 4 code (100) is reserved for bus-mastering requests, and corresponds to the traditionally unavailable DMA channel. A packet is sent every time the request-status changes. Usually, a request is only asserted this way; deassertion of a request is signaled by the SYNC field.
The execution of DMA data transfer (Fig. 6.24) is controlled by the host, but it differs somewhat from the regular memory and I/O accesses. Here, new fields have been introduced:
-
The SIZE field defines the length of the transmission. Code 0000 means 1 byte, code 0001 means 2 bytes, code 0011 means 4 bytes. The rest of the codes are reserved.
-
The CHANNEL field is used by the host to send the DMA channel number—bits [2:0]—and the end of cycle indicator—TC, bit 3.
The length of memory access cycles may be 1, 2, or 4 bytes long. There are no wait states, as they are concealed by the DMA controller. Depending on the length of access, memory read cycles take 11, 18, or 32 clock ticks (0.33, 0.54, or 0.96 μsec). This gives bus bandwidths of 3.03 MBps, 3.7 MBps, or 4.17 MBps, respectively. Write cycles take 11, 14, or 20 clock ticks (0.33, 0.42, or 0.60 μsec), giving throughputs of 3.03 MBps, 4.76 MBps, or 6.67 MBps, respectively. The field sequence that is repeated when transmitting 2 or 4 bytes is outlined in bold.
A master device requests bus-master access the same way as direct memory access, but indicating the reserved channel number 4 (100). In granting access, the host in the START field sets the number of the master that will later establish the type of the cycle (Fig. 6.25.) Bus mastering implies access to the resources of the host—system memory, PCI device, etc. Data follow each other without interruption in 2- and 4-byte packets. However, there always will be wait states in the memory and I/O read cycles because of the extra time required to arbitrate for the PCI bus or access memory controller. Assuming the SYNC field is six clock ticks long (it is unlikely it can be shorter; it is more possible it may be longer), memory access cycles (read as well as write) will require 25, 27, or 31 clock ticks (0.75, 0.81, or 0.93 μsec), giving a throughput of 1.33 MBps, 2.47 MBps, or 4.30 MBps, respectively. Because port addressing is shorter, their accessing is shorter too: 21, 23, or 27 clock ticks (0.63, 0.69, or 0.81 μsec), with throughputs of 1.59 MBps, 2.90 MBps, or 4.94 MBps, respectively.
The electrical interface for the LAD [3:0], LFRAME#, LDRQ#, and SERIRQ signals corresponds to the PCI 2.1 specification for the 3.3 V version. Depending on the motherboard, the other signals may be either 5 V or 3.3 V.
Configuring LPC devices does not require using PCI or ISA plug-and-play protocols, as the system BIOS knows all the LPC devices a priori. To access an LPC device, the host must decode its address, and redirect accesses to this address to the LPC controller.
6.5 Notebook PC Expansion Buses and Cards
Originally, portable and notebook PCs were built without any attempts to standardize or provide component compatibility. The situation has changed over time, however, and today there are several interfaces and form factors for expansion devices. The most popular are listed in Table 6.16.
Table 6.16: Form Factors and Interfaces of Portable PC Peripheral Devices
The first standard for expansion cards was PCMCIA, which was later renamed PC Card. In addition to expansion bus slots, notebook and pocket PCs may have slots for connecting memory cards (see Section 8.3).
A desktop PC can be equipped with PC Card slots using a special adapter-bridge card that is installed into a PCI or ISA slot. The slots themselves (one or two) are enclosed into a 3″ case and mounted into the PC’s front panel. Bridge card is connected to this case via a ribbon cable.
6.5.1 PCMCIA, PC Card, and CardBus Interfaces
At the beginning of 1990s, the Personal Computer Memory Card International Association (PCMCIA) began work on standardizing notebook-computer expansion buses, with primary emphasis on memory expansion buses. The first to appear, in June 1990, was the PCMCIA Standard Release 1.0/JEIDA 4.0 standard. This standard described a 68-contact interface and two form factor types: Type I and Type II PC Card. At first, the standard applied only to the electrical and physical requirements of memory cards. The Card Information Structure (CIS) metaformat, which described card’s characteristics and capabilities and was the key element in providing for cards’ interchangeability and the plug-and-play mechanism, was also introduced.
The next version, PCMCIA 2.0, which was released in 1991, defined an I/O operation interface, dual power supplies for memory cards, and testing methods. The specifications for the connector were left unchanged. Version 2.01 added the PC Card ATA specification, a new form factor, Type III, the Auto-Indexing Mass Storage (AIMS) specification, and the initial version of the Card Services Specification (CSS). Version 2.1, which came out in 1994, expanded the Card and Socket Services Specification (CSSS) and developed the Card Information Structure.
The PC Card standard, adopted in 1995, is the continuation of the previous standards. This standard introduced additional requirements directed at improving compatibility, as well as new capabilities, including a 3.3 V power supply, DMA, and 32-bit CardBus bus mastering. Later, specifications of other additional capabilities were added to the standard.
All PCMCIA and PC Card cards have a 68-contact connector. The functions of its contacts vary depending on the type of the interface card. The type of interface is “ordered” by the card when it is installed into the slot, which, naturally, must support the required interface. The memory interface provides 8- and 16-bit accesses with a minimal cycle time of 100 nsec, giving a maximum throughput of 10 MBps and 20 MBps, respectively. The I/O interface has a 255-nsec minimum cycle length, which corresponds to 3.92 MBps and 7.84 MBps throughputs for the 8- and 16-bit accesses, respectively. The CardBus interface supports practically the same exchange protocol as the PCI, but is somewhat simplified. Its 33 MHz clock and 32-bit data bus provide peak bandwidth of up to 132 MBps in the burst cycle. Cards have bus-mastering capability. They use the same automatic configuring system as the PCI (using the configuration space registers). The interface has supplementary capabilities incorporated to transfer digital audio signal, which can be done using both the traditional pulse-code modulation (PCM) and the new pulse-width modulation (PWM), which is actually a revival of a forgotten old method.
There is a special interface specification for PC Card ATA disk devices.
There are four different PC Card varieties: Known as types, they all have the same height and width—54 mm x 85.5 mm—but different thickness. Thinner adapters fit into slots for thicker cards. The four types are as follows:
-
PC Card Type I—3.3 mm—memory cards
-
PC Card Type III—10.5 mm—disk storage devices
-
PC Card Type IV—16 mm (no mention of this type could be found at the time of writing on the site http://www.pc-card.com)
There are also compact Small PC Card cards. These are 45 mm long by 42.8 mm wide, but have the same connector and thickness as the standard cards.
PCMCIA also maintains the Miniature Card standard (see Section 9.3.4) for memory cards (dynamic, static, ROM, and flash EPROM).
The functions of the connector contacts for different types of the interface are given in Table 6.17. The functions of the signals for memory and I/O interface cards are given in Table 6.18. Signal names for CardBus card are prefixed by letter C, which is followed by the PCI bus signal name. (See Section 6.2.2.)
Table 6.18: Signal Functions of Memory and I/O Cards
The interface of the memory and I/O cards is simple—practically the same as the asynchronous static memory interface. The card is selected by the CE# signal, which is asserted simultaneously with the address. The memory and configuration registers are read by the OE# signal; they are written to by the WE# signal. The REG# signal, activated simultaneously with the CE# signal and the address, differentiates between memory accesses and accesses to configuration registers. Separate IORD# and IOWR# signals, acting in tandem with an active REG# signal, are used to access I/O ports. During port accessing, the card can indicate if it is capable of 16-bit access by asserting the IOSC16# signal (as in the ISA bus). A device must acknowledge port reading by the INPACK# signal, which it asserts and deasserts on the CE# signal. This signal allows the host to be certain that it is not reading an empty slot.
PC Card slots may also provide DMA capability. Implementing DMA is the most cost-effective way to unload the CPU, but only simple ISA-bus-based hosts possess this capability. For PCI-bus systems, bus mastering of the CardBus is more usual. However, implementing bus mastering in cards is not a cheap option.
Multimedia cards can switch into the special Zoomed Video Port mode. In this mode, a separate two-point data transfer interface between the card and the host system is set up. Conceptually, the interface is similar to the VESA Feature Connector (VFC) for graphics cards: It has a dedicated bus for transferring video data that is not connected with other buses (and does not load them), but uses a different protocol. In the Zoomed Video Port mode, the A[25:4] address lines and the BVD2/SPKP#, INPACK#, and 101S16# lines are assigned different functions, and used to send video data and four digital audio channels. There are only four address lines left for the regular interface, which makes it possible to address 16 bytes of system memory and card attributes.
The Zoomed Video Port interface matches the CCIR601 timing diagrams, which allows the NTSC decoder to deliver video data from the card into the VGA display buffer in real time. The card may receive video data from either an external video input or an MPEG decoder.
Cards have specially designated attribute memory space, in which the card’s configuration and control registers used for auto-configuration are located. The standard describes the Card Information Structure (CIS) format. Cards may be multifunctional (a modem! network adapter combination, for example). The Multiple Function PC Cards (MFPC) specification provides separate configuration registers and defines the interrupt request line-sharing rules for each function.
For external memory devices, the standard describes MS-DOS FAT-compatible data-storage formats and also formats oriented to flash memory as the main information-carrying medium. For direct execution of the programs stored in the card’s memory, there is the eXecute In Place (XIP) specification. This describes the software interface for executing these programs without loading them into the main system memory.
The standard describes the software-implemented Card Services interface, which standardizes interaction between its clients (drivers, application software, and utilities) with devices. There also is the Socket Services interface, which is used to detect cards’ connections and disconnections, identify them, and to configure the power supplies and the hardware interface.
There are two extensions of the standard, specific to the two groups that maintain the PC Card standard.
-
PCMCIA describes Auto-Indexed Mass Storage (AIMS), which is used for storing large volumes of data, such as images or multimedia data on block-oriented mass storage devices. It also contains specifications for a 15-pin shielded connector for modem I/O and LAN adapters. Another connector described is a 7-pin Modem I/O connector.
-
The JEIDA extension contains the Small Block Flash Format specification that simplifies the file structure of flash memory cards. The Still Image, Sound, and Related Information Format (SISRIF) is directed at recording images and sound to memory cards. There also is a specification for DRAM-Type memory cards.
Most of the adapters support plug-and-play technology, and can be hot-swapped: installed and taken out without turning off the computer. For this, the card’s power line contacts are longer than the rest and, therefore, are connected first and disconnect last. Two card-presence-detection contacts—CD1# and CD2# (Card Detect)—are shorter than the rest. When they connect with the slot’s contacts, it signals to the host that the card has been completely inserted into the slot. Even though, cards can be dynamically configured, sometimes it is necessary to reboot the system after changing configuration.
Initially, cards and host-systems used +5 V signaling voltage. With the addition of low-voltage signaling (+3.3 V), a mechanical key preventing the insertion of a 3.3 V card into a 5 V slot was introduced. Additionally, contacts 43 (VS1#) and 47 (VS2#) were defined for power-supply voltage selection. On the 5 V cards, they are not connected; on the 3.3 V cards, the VS1# contact is grounded and the VS2# contact is unconnected. A host that allows both power-supply voltages uses these lines to determine the installed card’s requirements, and provides the appropriate voltage. If the host cannot provide the necessary voltage, it must not provide any and issue, a connection error message instead. Cards usually support Advanced Power Management (APM), which is an important feature for battery-operated computers.
A diverse range of devices—including memory, storage devices, Communications devices, interface ports, game adapters, multimedia, devices, and so on—are produced under the PC Card standard. However, they are all significantly more expensive than their full-size counterparts. Via the PC Card slot, portable computers can be connected to dock stations that can be equipped with standard peripheral equipment. However, manufacturers not adhering strictly enough to the standard sometimes causes compatibility problems.
PC Card slots are connected to a portable PC system bus via a bridge. This will be the PCI-PC Card bridge for computers with an internal PCI bus. Notebook PCs may also have Small PCI slots (SPCI), but they are not accessible without opening the case, and do not allow devices to be hot-swapped.