Languages and the machine: the assembly process

The Assembly Process

The process of translating an assembly language program into a machine language program is referred to as the assembly process. The assembly process is straightforward and rather simple, since there is a straightforward one-to-one mapping of assembly language statements to their machine language counter- parts. This is in opposition to compilation, for example, in which a given high-level language statement may be translated into a number of computation- ally equivalent machine language statements.

While assembly is a straightforward process, it is tedious and error-prone if done by hand. In fact, the assembler was one of the first software tools developed after the invention of the digital electronic computer.

Commercial assemblers provide at least the following capabilities:

• Allow the programmer to specify the run-time location of data values and programs. (Most often, however, the programmer would not specify an absolute starting location for a program, because the program will be moved around, or relocated, by the linker and perhaps the loader, as discussed be- low.)

• Provide a means for the programmer to initialize data values in memory prior to program execution.

• Provide assembly-language mnemonics for all machine instructions and ad- dressing modes, and translate valid assembly language statements into their equivalent machine language binary values.

• Permit the use of symbolic labels to represent addresses and constants.

• Provide a means for the programmer to specify the starting address of the program, if there is one. (There would not be a starting address if the mod- ule being assembled is a procedure or function, for example.)

• Provide a degree of assemble-time arithmetic.

• Include a mechanism that allows variables to be defined in one assembly language program and used in another, separately assembled program.

• Provide for the expansion of macro routines, that is, routines that can be defined once, and then instantiated as many times as needed.

We shall illustrate how the assembly process proceeds by “hand assembling” a simple program from ARC assembly language to ARC machine language. The program we will assemble is similar to Figure 4-13, reproduced below for convenience as Figure 5-1. In assembling this program we use the ARC encoding for-

image

mats shown in Figure 4-10, reproduced here as Figure 5-2. The figure shows the encoding of ARC machine language. That is, it specifies the target binary machine language of the ARC computer that the assembler must generate from the assembly language text.

image

Assembly and two pass assemblers

Most assemblers pass over the assembly language text twice, and are referred to as “two-pass assemblers.” The first pass is dedicated to determining the addresses of all data items and machine instructions, and selecting which machine instruction should be produced for each assembly language instruction (but not yet generating machine code).

The addresses of data items and instructions are determined by employing an assemble-time analog to the Program Counter, referred to as the location counter. The location counter keeps track of the address of the current instruction or data item as assembly proceeds. It is generally initialized to 0 at the start of the first pass, and is incremented by the size of each instruction. The .org pseudo operation causes the location counter to be set to the value specified by the .org statement. For example if the assembler encounters the statement

.org 1000

it would set the location counter to 1000, and the next instruction or data item would be assembled at that address. During this pass the assembler also performs any assembly-time arithmetic operations, and inserts the definitions of all labels and constant values into a table, referred to as the symbol table.

The primary reason for requiring a second pass is to allow symbols to be used in the program before they are defined, which is known as forward referencing. After the first pass, the assembler will have identified and entered all symbols into its symbol table, and, during a second pass generates the machine code, inserting the values of symbols which are then known.

Let us now hand assemble the program shown in Figure 5-1 into machine code. When the assembler encounters the first instruction,

image

it uses a pattern-matching process to recognize that it is a load instruction. Further pattern matching deduces that it is of the form “load from a memory address specified as a constant value (x in this case) plus the contents of a register (%r0 in this case) into a register (%r1 in this case).” This corresponds to the second Memory format shown in Figure 5-2. Examining the second Memory format we find that the op field for this instruction (ld) is 11. The destination of this ld instruction goes in the rd field, which is 00001 for %r1 in this case. The op3 field is 000000 for ld, as shown in the op3 box below the Memory formats. The rs1 field identifies the register, %r0 in this case, that is added to the simm13 field to form the source operand address. The i bit comes next. Notice that the i bit is used to distinguish between the first Memory format (i=0) and the second (i=0). Therefore the i bit is set to 1. The simm13 field specifies the address of the label x, which appears five words after the first instruction. Since the first instruction occurs at location 2048, and since each word is composed of four bytes, the address of x is 5 ´ 4 = 20 bytes after the beginning of the pro- gram. The address of x is then 2048 + 20 = 2068 which is represented by the bit pattern 0100000010100. This pattern fits into the signed 13-bit simm13 field.

The first line is thus assembled into the bit pattern shown below:

image

image

As a general approach, the assembly process is carried out by reading assembly language statements sequentially, from first to last, and generating machine code for each statement. And as mentioned earlier, a difficulty with this approach is caused by forward referencing. Consider the program fragment shown in Figure 5-3. When the assembler sees the call statement, it does not yet know the loca-

image

tion of sub_r since the sub_r label has not yet been seen. Thus the reference is entered into the symbol table and marked as unresolved. The reference is resolved when the definition of sub_r is found later in the program. The process of building a symbol table is described below.

Assembly and the symbol table

In the first pass of the two-pass assembly process, a symbol table is created. A symbol is either a label or a symbolic name that refers to a value used during the assembly process. The symbol table is generated in the first pass of assembly.

As an example of how a two-pass assembler operates, consider assembling the code in Figure 4-14. Starting from the .begin statement, the assembler encounters the statement

.org 2048

This causes the assembler to set the location counter to 2048, and assembly proceeds from that address. The first statement encountered is

a_start .equ 3000

An entry is created in the symbol table for a_start, which is given the value 3000. (Note that .equ statements do not generate any code, and thus are not assigned addresses during assembly.)

Assembly proceeds as the assembler encounters the first machine instruction,

image

This instruction is assembled at the address specified by the location counter, 2048. The location counter is then incremented by the size of the instruction, 4 bytes, to 2052. Notice that when the symbol length is encountered the assembler has not seen any definition for it. An entry is created in the symbol table for length, but it is initially assigned the value “undefined” as shown by the “—” in Figure 5-4a.

image

The assembler then encounters the second instruction

image

It assembles this instruction at address 2052 and enters the symbol address into the symbol table, again setting its value to “undefined,” since its definition has not been seen. It then increments the location counter by 4 to 2056. The andcc instruction is assembled at address 2056, and the location counter is incremented by the size of the instruction, again 4 bytes, to 2060. The next symbol that is seen is loop, which is entered into the symbol table with a value of 2060, the value of the location counter. The next symbol that is encountered that is not in the symbol table is done, which is also entered into the symbol table without a value since it likewise has not been defined.

The first pass of assembly continues, and the unresolved symbols length, address, and done are assigned the values 2092, 2096, and 2088, respectively as they are encountered. The label a is encountered, and is entered into the table with a value of 3000. The label done appears at location 2088 because there are 10 instructions (40 bytes) between the beginning of the program and done. Addresses for the remaining labels are computed in a similar manner. If any labels are still undefined at the end of the first pass, then an error exists in the program and the assembler will flag the undefined symbols and terminate.

After the symbol table is created, the second pass of assembly begins. The pro- gram is read a second time, starting from the .begin statement, but now object code is generated. The first statement that is encountered that causes code to be generated is ld at location 2048. The symbol table shows that the address portion of the ld instruction is (2092)10 for the address of length, and so one word of code is generated using the Memory format as shown in Figure 5-5. The second pass continues in this manner until all of the code is translated. The assembled program is shown in Figure 5-5. Notice that the displacements for branch addresses are given in words, rather than in bytes, because the branch instructions multiply the displacements by four.

Final tasks of the assembler

After assembly is complete the assembler must add additional information to the assembled module for the linker and loader:

• The module name and size. If the execution model involves memory seg-

image

ments for code, data, stack, etc. then the sizes and identities of the various segments must be specified.

• The address of the start symbol, if one is defined in the module. Most assemblers and high level languages provide for a special reserved label that the programmer can use to indicate where the program should start execution. For example, C specifies that execution will start at the function named main(). In Figure 5-1 the label “main” is a signal to the assembler that execution should start at that location.

• Information about global and external symbols. The linker will need to know the addresses of any global symbols defined in the module and ex- ported by it, and it will likewise need to know which symbols remain un- defined in the module because they are defined as global in another module.

• Information about any library routines that are referenced by the module. Some libraries contain commonly used functionality such as math or other specialized functions. We will have more to say about library usage in the sections below.

• The values of any constants that are to be loaded into memory. Some loaders expect data initialization to be specified separately from the binary code.

• Relocation information. When the linker is invoked most of the modules that are to be linked will need to be relocated as the modules are concatenated. The whole issue of module relocation is complicated because some address references can be relocated and others cannot. We discuss relocation later, but here we note that the assembler specifies which addresses can be relocated and which others cannot.

Location of programs in memory

Up until now we have assumed that programs are located in memory at an address that is specified by a .org pseudo operation. This may indeed be the case in systems programming, where the programmer has a reason for wanting a program to be located at a specific memory location, but typically the programmer does not care where the program is located in memory. Furthermore, when separately assembled or compiled programs are linked together, it is difficult or impossible for the programmer to know exactly where each module will be located after linking, as they are concatenated one after the other. For this reason most addresses are specified as being relocatable in memory, except perhaps for addresses such as I/O addresses, which may be fixed at an absolute memory location.

In the next section we discuss relocation in more detail; here we merely note that it is the assembler’s responsibility to mark symbols as being relocatable. Whether a given symbol is relocatable or not depends upon both the assembly language and the operating system’s conventions. In any case, this relocation information is included in the assembled module for use by the linker and/or loader in a relocation dictionary. Symbols that are relocatable are often marked with an “R” after their value in the assembler’s listing file.

Leave a comment

Your email address will not be published. Required fields are marked *