[Next] [Art
of Assembly][Randall Hyde]
Art of Assembly Language: Chapter Five
- Chapter Five - Variables and Data Structures
- 5.0 - Chapter Overview
- 5.1 - Some Additional Instructions: LEA,
LES, ADD, and MUL
- 5.2 - Declaring Variables in an Assembly
Language Program
- 5.3 - Declaring and Accessing Scalar Variables
- 5.3.1 - Declaring and using BYTE Variables
- 5.3.2 - Declaring and using WORD Variables
- 5.3.3 - Declaring and using DWORD Variables
- 5.3.4 - Declaring and using FWORD, QWORD,
and TBYTE Variables
- 5.3.5 - Declaring Floating Point Variables
with REAL4, REAL8, and REAL10
- 5.4 - Creating Your Own Type Names with
TYPEDEF
- 5.5 - Pointer Data Types
- 5.6 - Composite Data Types
- 5.6.1 - Arrays
- 5.6.1.1 - Declaring Arrays
in Your Data Segment
- 5.6.1.2 - Accessing Elements
of a Single Dimension Array
- 5.6.2 - Multidimensional Arrays
- 5.6.2.1 - Row Major Ordering
- 5.6.2.2 - Column Major Ordering
- 5.6.2.3 - Allocating Storage
for Multidimensional Arrays
- 5.6.2.4 - Accessing Multidimensional
Array Elements in Assembly Language
- 5.6.3 - Structures
- 5.6.4 - Arrays of Structures
and Arrays/Structures as Structure Fields
- 5.6.5 - Pointers to Structures
- 5.7 - Sample Programs
- 5.7.1 - Simple Variable Declarations
- 5.7.2 - Using Pointer Variables
- 5.7.3 - Single Dimension Array
Access
- 5.7.4 - Multidimensional Array
Access
- 5.7.5 - Simple Structure Access
- 5.7.6 - Arrays of Structures
- 5.7.7 - Structures and Arrays
as Fields of Another Structure
- 5.7.8 - Pointers to Structures
and Arrays of Structures
Copyright 1996 by Randall Hyde
All rights reserved.
Duplication other than for immediate display through a browser is prohibited
by U.S. Copyright Law.
This material is provided on-line as a beta-test of this text. It is for
the personal use of the reader only. If you are interested in using this
material as part of a course, please contact
rhyde@cs.ucr.edu
Supporting software and other materials are available via anonymous ftp
from ftp.cs.ucr.edu. See the "/pub/pc/ibmpcdir" directory for
details. You may also download the material from "Randall Hyde's Assembly
Language Page" at URL:
http://webster.ucr.edu
Notes:
This document does not contain the laboratory exercises, programming assignments,
exercises, or chapter summary. These portions were omitted for several reasons:
either they wouldn't format properly, they contained hyperlinks that were
too much work to resolve, they were under constant revision, or they were
not included for security reasons. Such omission should have very little
impact on the reader interested in learning this material or evaluating
this document.
This document was prepared using Harlequin's Web Maker 2.2 and Quadralay's
Webworks Publisher. Since HTML does not support the rich formatting options
available in Framemaker, this document is only an approximation of the actual
chapter from the textbook.
If you are absolutely dying to get your hands on a version other than HTML,
you might consider having the UCR Printing a Reprographics Department run
you off a copy on their Xerox machines. For details, please read the following
EMAIL message I received from the Printing and Reprographics Department:
Hello Again Professor Hyde,
Dallas gave me permission to take orders for the Computer Science 13 Manuals.
We would need to take charge card orders. The only cards we take are: Master
Card, Visa, and Discover. They would need to send the name, numbers, expiration
date, type of card, and authorization to charge $95.00 for the manual and
shipping, also we should have their phone number in case the company has
any trouble delivery. They can use my e-mail address for the orders and
I will process them as soon as possible. I would assume that two weeks would
be sufficient for printing, packages and delivery time.
I am open to suggestions if you can think of any to make this as easy as
possible.
Thank You for your business,
Kathy Chapman, Assistant
Printing and Reprographics
University of California
Riverside
(909) 787-4443/4444
We are currently working on ways to publish this text in a form other than
HTML (e.g., Postscript, PDF, Frameviewer, hard copy, etc.). This, however,
is a low-priority project. Please do not contact Randall Hyde concerning
this effort. When something happens, an announcement will appear on "Randall
Hyde's Assembly Language Page." Please visit this WEB site at http://webster.ucr.edu
for the latest scoop.
Art of Assembly Bug Report Submissions
Did you find an error in The Art of Assembly Language Programming?
You can let me know by using the form below to report the error to me so
that I can correct the error for the next beta version. Thank you.
The Submission Form
Please provide your name and e-mail address so I can contact you if
I have any questions regarding your submission.
Chapter Five Variables and Data Structures
Chapter One discussed the basic format for data in memory. Chapter Three
covered how a computer system physically organizes that data. This chapter
finishes this discussion by connecting the concept of data representation
to its actual physical representation. As the title implies, this chapter
concerns itself with two main topics: variables and data structures. This
chapter does not assume that you've had a formal course in data structures,
though such experience would be useful.
5.0 Chapter Overview
This chapter discusses how to declare and access scalar variables, integers,
reals, data types, pointers, arrays, and structures. You must master these
subjects before going on to the next chapter. Declaring and accessing arrays,
in particular, seems to present a multitude of problems to beginning assembly
language programmers. However, the rest of this text depends on your understanding
of these data structures and their memory representation. Do not try to
skim over this material with the expectation that you will pick it up as
you need it later. You will need it right away and trying to learn this
material along with later material will only confuse you more.
5.1 Some Additional Instructions: LEA, LES, ADD, and MUL
The purpose of this chapter is not to present the 80x86 instruction
set. However, there are four additional instructions (above and beyond mov)
that will prove handy in the discussion throughout the rest of this chapter.
These are the load effective address (lea), load es
and general purpose register (les), addition (add),
and multiply (mul). These instructions, along with the mov
instruction, provide all the necessary power to access the different data
types this chapter discusses.
The lea instruction takes the form:
lea reg16, memory
reg16 is a 16 bit general purpose register. Memory is a memory
location represented by a mod/reg/rm byte (except it must be a memory location,
it cannot be a register).
This instruction loads the 16 bit register with the offset of the location
specified by the memory operand. lea ax,1000h[bx][si], for
example, would load ax with the address of the memory location
pointed at by 1000h[bx][si]. This, of course, is the value
1000h+bx+si. Lea is also quite useful for obtaining the address of
a variable. If you have a variable I somewhere in memory, lea bx,I
will load the bx register with the address (offset)
of I.
The les instruction takes the form
les reg16, memory32
This instruction loads the es register and one of the 16 bit
general purpose registers from the specified memory address. Note that any
memory address you can specify with a mod/reg/rm byte is legal but like
the lea instruction it must be a memory location, not a register.
The les instruction loads the specified general purpose register
from the word at the given address, it loads the es register
from the following word in memory. This instruction, and it's companion
lds (which loads ds) are the only instructions
on pre-80386 machines that manipulate 32 bits at a time.
The add instruction, like it's x86 counterpart, adds two values
on the 80x86. This instruction takes several forms. There are five forms
that concern us here. They are
add reg, reg
add reg, memory
add memory, reg
add reg, constant
add memory, constant
All these instructions add the second operand to the first leaving the sum
in the first operand. For example, add bx,5 computes
bx := bx + 5.
The last instruction to look at is the mul (multiply) instruction.
This instruction has only a single operand and takes the form:
mul reg/memory
There are many important details concerning mul that this chapter
ignores. For the sake of the discussion that follows, assume that the register
or memory location is a 16 bit register or memory location. In such a case
this instruction computes dx:ax :=ax*reg/mem. Note that there
is no immediate mode for this instruction.
5.2 Declaring Variables in an Assembly Language Program
Although you've probably surmised that memory locations and variables
are somewhat related, this chapter hasn't gone out of its way to draw strong
parallels between the two. Well, it's time to rectify that situation. Consider
the following short (and useless) Pascal program:
program useless(input,output);
var i,j:integer;
begin
i := 10;
write('Enter a value for j:');
readln(j);
i := i*j + j*j;
writeln('The result is ',i);
end.
When the computer executes the statement i:=10; it makes a
copy of the value 10 and somehow remembers this value for use later on.
To accomplish this, the compiler sets aside a memory location specifically
for the exclusive use of the variable i. Assuming the compiler
arbitrarily assigned location DS:10h for this purpose it could use the instruction
mov ds:[10h],10 to accomplish this. If i is a
16 bit word, the compiler would probably assign the variable j to
the word starting at location 12h or 0Eh. Assuming it's location 12h, the
second assignment statement in the program might wind up looking like the
following:
mov ax, ds:[10h] ;Fetch value of I
mul ds:[12h] ;Multiply by J
mov ds:[10h], ax ;Save in I (ignore overflow)
mov ax, ds:[12h] ;Fetch J
mul ds:[12h] ;Compute J*J
add ds:[10h], ax ;Add I*J + J*J, store into I
Although there are a few details missing from this code, it is fairly straightforward
and you can easily see what is going on in this program.
Now imagine a 5,000 line program like this one using variables like ds:[10h],
ds:[12h], ds:[14h], etc. Would you want to locate the statement where you
accidentally stored the result of a computation into j rather
than i? Indeed, why should you even care that the variable
i is at location 10h and j is at location 12h? Why shouldn't
you be able to use names like i and j rather than
worrying about these numerical addresses? It seems reasonable to rewrite
the code above as:
mov ax, i
mul j
mov i, ax
mov ax, j
mul j
add i, ax
Of course you can do this in assembly language! Indeed, one of the primary
jobs of an assembler like MASM is to let you use symbolic names for memory
locations. Furthermore, the assembler will even assign locations to the
names automatically for you. You needn't concern yourself with the fact
that variable i is really the word at memory location DS:10h
unless you're curious.
It should come as no surprise that ds will point to the dseg
segment in the SHELL.ASM file. Indeed, setting up ds so that
it points at dseg is one of the first things that happens in the SHELL.ASM
main program. Therefore, all you've got to do is tell the assembler to reserve
some storage for your variables in dseg and attach the offset of said variables
with the names of those variables. This is a very simple process and is
the subject of the next several sections.
5.3 Declaring and Accessing Scalar Variables
Scalar variables hold single values. The variables i and
j in the preceding section are examples of scalar variables. Examples
of data structures that are not scalars include arrays, records, sets, and
lists. These latter data types are made up from scalar values. They are
the composite types. You'll see the composite types a little later; first
you need to learn to deal with the scalar types.
To declare a variable in dseg, you would use a statement something like
the following:
ByteVar byte ?
ByteVar is a label. It should begin at column one on the line
somewhere in the dseg segment (that is, between the dseg segment and
dseg ends statements). You'll find out all about labels in
a few chapters, for now you can assume that most legal Pascal/C/Ada identifiers
are also valid assembly language labels.
If you need more than one variable in your program, just place additional
lines in the dseg segment declaring those variables. MASM will automatically
allocate a unique storage location for the variable (it wouldn't be too
good to have i and j located at the same address
now, would it?). After declaring said variable, MASM will allow you to refer
to that variable by name rather than by location in your program. For example,
after inserting the above statement into the data segment (dseg), you could
use instructions like mov ByteVar,al in your program.
The first variable you place in the data segment gets allocated storage
at location DS:0. The next variable in memory gets allocated storage just
beyond the previous variable. For example, if the variable at location zero
was a byte variable, the next variable gets allocated storage at DS:1. However,
if the first variable was a word, the second variable gets allocated storage
at location DS:2. MASM is always careful to allocate variables in such a
manner that they do not overlap. Consider the following dseg definition:
dseg segment para public 'data'
bytevar byte ? ;byte allocates bytes
wordvar word ? ;word allocates words
dwordvar dword ? ;dword allocs dbl words
byte2 byte ?
word2 word ?
dseg ends
MASM allocates storage for bytevar at location DS:0. Because bytevar is
one byte long, the next available memory location is going to be DS:1. MASM,
therefore, allocates storage for wordvar at location DS:1. Since words require
two bytes, the next available memory location after wordvar is DS:3 which
is where MASM allocates storage for dwordvar. Dwordvar is four bytes long,
so MASM allocates storage for byte2 starting at location DS:7. Likewise,
MASM allocates storage for word2 at location DS:8. Were you to stick another
variable after word2, MASM would allocate storage for it at location DS:0A.
Whenever you refer to one of the names above, MASM automatically substitutes
the appropriate offset. For example, MASM would translate the mov
ax,wordvar instruction into mov ax,ds:[1]. So now you
can use symbolic names for your variables and completely ignore the fact
that these variables are really memory locations with corresponding offsets
into the data segment.
5.3.1 Declaring and using BYTE Variables
So what are byte variables good for, anyway? Well you can certainly
represent any data type that has less than 256 different values with a single
byte. This includes some very important and often-used data types including
the character data type, boolean data type, most enumerated data types,
and small integer data types (signed and unsigned), just to name a few.
Characters on a typical IBM compatible system use the eight bit ASCII/IBM
character set. The 80x86 provides a rich set of instructions for manipulating
character data. It's not surprising to find that most byte variables in
a typical program hold character data.
The boolean data type represents only two values: true or false. Therefore,
it only takes a single bit to represent a boolean value. However, the 80x86
really wants to work with data at least eight bits wide. It actually takes
extra code to manipulate a single bit rather than a whole byte. Therefore,
you should use a whole byte to represent a boolean value. Most programmers
use the value zero to represent false and anything else (typically one)
to represent true. The 80x86's zero flag makes testing for zero/not zero
very easy. Note that this choice of zero or non-zero is mainly for convenience.
You could use any two different values (or two different sets of values)
to represent true and false.
Most high level languages that support enumerated data types convert them
(internally) to unsigned integers. The first item in the list is generally
item zero, the second item in the list is item one, the third is item two,
etc. For example, consider the following Pascal enumerated data type:
colors = (red, blue, green, purple, orange, yellow, white, black);
Most Pascal compilers will assign the value zero to red, one to blue, two
to green, etc.
Later, you will see how to actually create your own enumerated data types
in assembly language. All you need to learn now is how to allocate storage
for a variable that holds an enumerated value. Since it's unlikely there
will be more than 256 items enumerated by the data type, you can use a simple
byte variable to hold the value. If you have a variable, say color of type
colors, using the instruction mov color,2 is the same thing
as saying color:=green in Pascal. (Later, you'll even learn
how to use more meaningful statements like mov color,green to
assign the color green to the color variable).
Of course, if you have a small unsigned integer value (0...255) or small
signed integer (-128...127) a single byte variable is the best way to go
in most cases. Note that most programmers treat all data types except small
signed integers as unsigned values. That is, characters, booleans, enumerated
types, and unsigned integers are all usually unsigned values. In some very
special cases you might want to treat a character as a signed value, but
most of the time even characters are unsigned values.
There are three main statements for declaring byte variables in a program.
They are
identifier db ?
identifier byte ?
and
identifier sbyte ?
identifier represents the name of your byte variable. "db"
is an older term that predates MASM 6.x. You will see this directive used
quite a bit by other programmers (especially those who are not using MASM
6.x or later) but Microsoft considers it to be an obsolete term; you should
always use the byte and sbyte declarations instead.
The byte declaration declares unsigned byte variables. You
should use this declaration for all byte variables except small signed integers.
For signed integer values, use the sbyte (signed byte) directive.
Once you declare some byte variables with these statements, you may reference
those variables within your program by their names:
i db ?
j byte ?
k sbyte ?
.
.
.
mov i, 0
mov j, 245
mov k, -5
mov al, i
mov j, al
etc.
Although MASM 6.x performs a small amount of type checking, you should not
get the idea that assembly language is a strongly typed language. In fact,
MASM 6.x will only check the values you're moving around to verify that
they will fit in the target location. All of the following are legal in
MASM 6.x:
mov k, 255
mov j, -5
mov i, -127
Since all of these variables are byte-sized variables, and all the associated
constants will fit into eight bits, MASM happily allows each of these statements.
Yet if you look at them, they are logically incorrect. What does it mean
to move -5 into an unsigned byte variable? Since signed byte values must
be in the range -128...127, what happens when you store the value 255 into
a signed byte variable? Well, MASM simply converts these values to their
eight bit equivalents (-5 becomes 0FBh, 255 becomes 0FFh [-1], etc.).
Perhaps a later version of MASM will perform stronger type checking on the
values you shove into these variables, perhaps not. However, you should
always keep in mind that it will always be possible to circumvent this checking.
It's up to you to write your programs correctly. The assembler won't help
you as much as Pascal or Ada will. Of course, even if the assembler disallowed
these statements, it would still be easy to get around the type checking.
Consider the following sequence:
mov al, -5
.
; Any number of statements which do not affect AL
.
mov j, al
There is, unfortunately, no way the assembler is going to be able to tell
you that you're storing an illegal value into j. The registers,
by their very nature, are neither signed nor unsigned. Therefore the assembler
will let you store a register into a variable regardless of the value that
may be in that register.
Although the assembler does not check to see if both operands to an instruction
are signed or unsigned, it most certainly checks their size. If the sizes
do not agree the assembler will complain with an appropriate error message.
The following examples are all illegal:
mov i, ax ;Cannot move 16 bits into eight
mov i, 300 ;300 won't fit in eight bits.
mov k, -130 ;-130 won't fit into eight bits.
You might ask "if the assembler doesn't really differentiate signed
and unsigned values, why bother with them? Why not simply use db all
the time?" Well, there are two reasons. First, it makes your programs
easier to read and understand if you explicitly state (by using byte and
sbyte) which variables are signed and which are unsigned. Second, who said
anything about the assembler ignoring whether the variables are signed or
unsigned? The mov instruction ignores the difference, but there
are other instructions that do not.
One final point is worth mentioning concerning the declaration of byte variables.
In all of the declarations you've seen thus far the operand field of the
instruction has always contained a question mark. This question mark tells
the assembler that the variable should be left uninitialized when DOS loads
the program into memory. You may specify an initial value for the variable,
that will be loaded into memory before the program starts executing, by
replacing the question mark with your initial value. Consider the following
byte variable declarations:
i db 0
j byte 255
k sbyte -1
In this example, the assembler will initialize i, j, and
k to zero, 255, and -1, respectively, when the program loads into
memory. This fact will prove quite useful later on, especially when discussing
tables and arrays. Once again, the assembler only checks the sizes of the
operands. It does not check to make sure that the operand for the
byte directive is positive or that the value in the operand field
of sbyte is in the range -128...127. MASM will allow any value
in the range -128...255 in the operand field of any of these statements.
In case you get the impression that there isn't a real reason to use byte
vs. sbyte in a program, you should note that while MASM sometimes ignores
the differences in these definitions, Microsoft's CodeView debugger does
not. If you've declared a variable as a signed value, CodeView will display
it as such (including a minus sign, if necessary). On the other hand, CodeView
will always display db and byte variables as positive
values.
5.3.2 Declaring and using WORD Variables
Most 80x86 programs use word values for three things: 16 bit signed
integers, 16 bit unsigned integers, and offsets (pointers). Oh sure, you
can use word values for lots of other things as well, but these three represent
most applications of the word data type. Since the word is the largest data
type the 8086, 8088, 80186, 80188, and 80286 can handle, you'll find that
for most programs, the word is the basis for most computations. Of course,
the 80386 and later allow 32 bit computations, but many programs do not
use these 32 bit instructions since that would limit them to running on
80386 or later CPUs.
You use the dw, word, and sword statements to
declare word variables. The following examples demonstrate their use:
NoSignedWord dw ?
UnsignedWord word ?
SignedWord sword ?
Initialized0 word 0
InitializedM1 sword -1
InitializedBig word 65535
InitializedOfs dw NoSignedWord
Most of these declarations are slight modifications of the byte declarations
you saw in the last section. Of course you may initialize any word variable
to a value in the range -32768...65535 (the union of the range for signed
and unsigned 16 bit constants). The last declaration above, however, is
new. In this case a label appears in the operand field (specifically, the
name of the NoSignedWord variable). When a label appears in the operand
field the assembler will substitute the offset of that label (within the
variable's segment). If these were the only declarations in dseg and they
appeared in this order, the last declaration above would initialize InitializedOfs
with the value zero since NoSignedWord's offset is zero within the data
segment. This form of initialization is quite useful for initializing pointers.
But more on that subject later.
The CodeView debugger differentiates dw/word variables and
sword variables. It always displays the unsigned values as
positive integers. On the other hand, it will display sword
variables as signed values (complete with minus sign, if the value is negative).
Debugging support is one of the main reasons you'll want to use word
or sword as appropriate.
5.3.3 Declaring and using DWORD Variables
You may use the dd, dword, and sdword instructions
to declare four-byte integers, pointers, and other variables types. Such
variables will allow values in the range -2,147,483,648...4,294,967,295
(the union of the range of signed and unsigned four-byte integers). You
use these declarations like the word declarations:
NoSignedDWord dd ?
UnsignedDWord dword ?
SignedDWord sdword ?
InitBig dword 4000000000
InitNegative sdword -1
InitPtr dd InitBig
The last example initializes a double word pointer with the segment:offset
address of the InitBig variable.
Once again, it's worth pointing out that the assembler doesn't check the
types of these variables when looking at the initialization values. If the
value fits into 32 bits, the assembler will accept it. Size checking, however,
is strictly enforced. Since the only 32 bit mov instructions
on processors earlier than the 80386 are les and lds,
you will get an error if you attempt to access dword variables on these
earlier processors using a mov instruction. Of course, even
on the 80386 you cannot move a 32 bit variable into a 16 bit register, you
must use the 32 bit registers. Later, you'll learn how to manipulate 32
bit variables, even on a 16 bit processor. Until then, just pretend that
you can't.
Keep in mind, of course, that CodeView differentiates between dd/dword
and sdword. This will help you see the actual values your variables
have when you're debugging your programs. CodeView only does this, though,
if you use the proper declarations for your variables. Always use sdword
for signed values and dd or dword (dword
is best) for unsigned values.
5.3.4 Declaring and using FWORD, QWORD, and TBYTE Variables
MASM 6.x also lets you declare six-byte, eight-byte, and ten-byte variables
using the df/fword, dq/qword, and dt/tbyte
statements. Declarations using these statements were originally intended
for floating point and BCD values. There are better directives for the floating
point variables and you don't need to concern yourself with the other data
types you'd use these directives for. The following discussion is for completeness'
sake.
The df/fword statement's main utility is declaring 48 bit pointers
for use in 32 bit protected mode on the 80386 and later. Although you could
use this directive to create an arbitrary six byte variable, there are better
directives for doing that. You should only use this directive for 48 bit
far pointers on the 80386.
dq/qword lets you declare quadword (eight byte) variables.
The original purpose of this directive was to let you create 64 bit double
precision floating point variables and 64 bit integer variables. There are
better directives for creating floating point variables. As for 64 bit integers,
you won't need them very often on the 80x86 CPU (at least, not until Intel
releases a member of the 80x86 family with 64 bit general purpose registers).
The dt/tbyte directives allocate ten bytes of storage. There
are two data types indigenous to the 80x87 (math coprocessor) family that
use a ten byte data type: ten byte BCD values and extended precision (80
bit) floating point values. This text will pretty much ignore the BCD data
type. As for the floating point type, once again there is a better way to
do it.
5.3.5 Declaring Floating Point Variables with REAL4, REAL8, and REAL10
These are the directives you should use when declaring floating point
variables. Like dd, dq, and dt these statements
reserve four, eight, and ten bytes. The operand fields for these statements
may contain a question mark (if you don't want to initialize the variable)
or it may contain an initial value in floating point form. The following
examples demonstrate their use:
x real4 1.5
y real8 1.0e-25
z real10 -1.2594e+10
Note that the operand field must contain a valid floating point constant
using either decimal or scientific notation. In particular, pure integer
constants are not allowed. The assembler will complain if you use an operand
like the following:
x real4 1
To correct this, change the operand field to "1.0".
Please note that it takes special hardware to perform floating point operations
(e.g., an 80x87 chip or an 80x86 with built-in math coprocessor). If such
hardware is not available, you must write software to perform operations
like floating point addition, subtraction, multiplication, etc. In particular,
you cannot use the 80x86 add instruction to add two floating
point values. This text will cover floating point arithmetic in a later
chapter. Nonetheless, it's appropriate to discuss how to declare floating
point variables in the chapter on data structures.
MASM also lets you use dd, dq, and dt to declare
floating point variables (since these directives reserve the necessary four,
eight, or ten bytes of space). You can even initialize such variables with
floating point constants in the operand field. But there are two major drawbacks
to declaring variables this way. First, as with bytes, words, and double
words, the CodeView debugger will only display your floating point variables
properly if you use the real4, real8, or real10
directives. If you use dd, dq, or dt, CodeView
will display your values as four, eight, or ten byte unsigned integers.
Another, potentially bigger, problem with using dd, dq, and
dt is that they allow both integer and floating point constant
initializers (remember, real4, real8, and real10
do not). Now this might seem like a good feature at first glance. However,
the integer representation for the value one is not the same as the floating
point representation for the value 1.0. So if you accidentally enter the
value "1" in the operand field when you really meant "1.0",
the assembler would happily digest this and then give you incorrect results.
Hence, you should always use the real4, real8, and real10
statements to declare floating point variables.
5.4 Creating Your Own Type Names with TYPEDEF
Let's say that you simply do not like the names that Microsoft decided
to use for declaring byte, word, dword, real, and other variables. Let's
say that you prefer Pascal's naming convention or, perhaps, C's naming convention.
You want to use terms like integer, float, double, char, boolean, or whatever.
If this were Pascal you could redefine the names in the type section of
the program. With C you could use a "#define" or a typedef statement
to accomplish the task. Well, MASM 6.x has it's own typedef statement that
also lets you create aliases of these names. The following example demonstrates
how to set up some Pascal compatible names in your assembly language programs:
integer typedef sword
char typedef byte
boolean typedef byte
float typedef real4
colors typedef byte
Now you can declare your variables with more meaningful statements like:
i integer ?
ch char ?
FoundIt boolean ?
x float ?
HouseColor colors ?
If you are an Ada, C, or FORTRAN programmer (or any other language, for
that matter), you can pick type names you're more comfortable with. Of course,
this doesn't change how the 80x86 or MASM reacts to these variables one
iota, but it does let you create programs that are easier to read and understand
since the type names are more indicative of the actual underlying types.
Note that CodeView still respects the underlying data type. If you define
integer to be an sword type, CodeView will display variables
of type integer as signed values. Likewise, if you define float to mean
real4, CodeView will still properly display float variables
as four-byte floating point values.
5.5 Pointer Data Types
Some people refer to pointers as scalar data types, others refer to
them as composite data types. This text will treat them as scalar data types
even though they exhibit some tendencies of both scalar and composite data
types (for a complete description of composite data types, see "Composite
Data Types" on page 206).
Of course, the place to start is with the question "What is a pointer?"
Now you've probably experienced pointers first hand in the Pascal, C, or
Ada programming languages and you're probably getting worried right now.
Almost everyone has a real bad experience when they first encounter pointers
in a high level language. Well, fear not! Pointers are actually easier to
deal with in assembly language. Besides, most of the problems you had with
pointers probably had nothing to do with pointers, but rather with the linked
list and tree data structures you were trying to implement with them. Pointers,
on the other hand, have lots of uses in assembly language that have nothing
to do with linked lists, trees, and other scary data structures. Indeed,
simple data structures like arrays and records often involve the use of
pointers. So if you've got some deep-rooted fear about pointers, well forget
everything you know about them. You're going to learn how great pointers
really are.
Probably the best place to start is with the definition of a pointer. Just
exactly what is a pointer, anyway? Unfortunately, high level languages like
Pascal tend to hide the simplicity of pointers behind a wall of abstraction.
This added complexity (which exists for good reason, by the way) tends to
frighten programmers because they don't understand what's going on.
Now if you're afraid of pointers, well, let's just ignore them for the time
being and work with an array. Consider the following array declaration in
Pascal:
M: array [0..1023] of integer;
Even if you don't know Pascal, the concept here is pretty easy to understand.
M is an array with 1024 integers in it, indexed from M[0] to
M[1023]. Each one of these array elements can hold an integer value
that is independent of all the others. In other words, this array gives
you 1024 different integer variables each of which you refer to by number
(the array index) rather than by name.
If you encountered a program that had the statement M[0]:=100
you probably wouldn't have to think at all about what is happening with
this statement. It is storing the value 100 into the first element of the
array M. Now consider the following two statements:
i := 0; (* Assume "i" is an integer variable *)
M [i] := 100;
You should agree, without too much hesitation, that these two statements
perform the same exact operation as M[0]:=100;. Indeed, you're
probably willing to agree that you can use any integer expression in the
range 0...1023 as an index into this array. The following statements still
perform the same operation as our single assignment to index zero:
i := 5; (* assume all variables are integers*)
j := 10;
k := 50;
m [i*j-k] := 100;
"Okay, so what's the point?" you're probably thinking. "Anything
that produces an integer in the range 0...1023 is legal. So what?"
Okay, how about the following:
M [1] := 0;
M [ M [1] ] := 100;
Whoa! Now that takes a few moments to digest. However, if you take it slowly,
it makes sense and you'll discover that these two instructions perform the
exact same operation you've been doing all along. The first statement stores
zero into array element M[1]. The second statement fetches
the value of M[1], which is an integer so you can use it as
an array index into M, and uses that value (zero) to control where it stores
the value 100.
If you're willing to accept the above as reasonable, perhaps bizarre, but
usable nonetheless, then you'll have no problems with pointers. Because
m[1] is a pointer! Well, not really, but if you were to change
"M" to "memory" and treat this array as all of memory,
this is the exact definition of a pointer.
A pointer is simply a memory location whose value is the address (or index,
if you prefer) of some other memory location. Pointers are very easy to
declare and use in an assembly language program. You don't even have to
worry about array indices or anything like that. In fact, the only complication
you're going to run into is that the 80x86 supports two kinds of pointers:
near pointers and far pointers.
A near pointer is a 16 bit value that provides an offset into a segment.
It could be any segment but you will generally use the data segment (dseg
in SHELL.ASM). If you have a word variable p that contains
1000h, then p "points" at memory location 1000h in
dseg. To access the word that p points at, you could use code
like the following:
mov bx, p ;Load BX with pointer.
mov ax, [bx] ;Fetch data that p points at.
By loading the value of p into bx this code loads
the value 1000h into bx (assuming p contains 1000h
and, therefore, points at memory location 1000h in dseg). The second instruction
above loads the ax register with the word starting at the location
whose offset appears in bx. Since bx now contains
1000h, this will load ax from locations DS:1000 and DS:1001.
Why not just load ax directly from location 1000h using an
instruction like mov ax,ds:[1000h]? Well, there are lots of
reasons. But the primary reason is that this single instruction always loads
ax from location 1000h. Unless you are willing to mess around
with self-modifying code, you cannot change the location from which it loads
ax. The previous two instructions, however, always load ax
from the location that p points at. This is very easy to change
under program control, without using self-modifying code. In fact, the simple
instruction mov p,2000h will cause those two instructions above
to load ax from memory location DS:2000 the next time they
execute. Consider the following instructions:
lea bx, i ;This can actually be done with
mov p, bx ; a single instruction as you'll
. ; see in Chapter Eight.
.
< Some code that skips over the next two instructions >
lea bx, j ;Assume the above code skips these
mov p, bx ; two instructions, that you get
. ; here by jumping to this point from
. ; somewhere else.
mov bx, p ;Assume both code paths above wind
mov ax, [bx] ; up down here.
This short example demonstrates two execution paths through the program.
The first path loads the variable p with the address of the
variable i (remember, lea loads bx
with the offset of the second operand). The second path through the code
loads p with the address of the variable j. Both
execution paths converge on the last two mov instructions that
load ax with i or j depending upon
which execution path was taken. In many respects, this is like a parameter
to a procedure in a high level language like Pascal. Executing the same
instructions accesses different variables depending on whose address (i
or j) winds up in p.
Sixteen bit near pointers are small, fast, and the 80x86 provides efficient
access using them. Unfortunately, they have one very serious drawback -
you can only access 64K of data (one segment) when using near pointers.
Far pointers overcome this limitation at the expense of being 32 bits long.
However, far pointers let you access any piece of data anywhere in the memory
space. For this reason, and the fact that the UCR Standard Library uses
far pointers exclusively, this text will use far pointers most of the time.
But keep in mind that this is a decision based on trying to keep things
simple. Code that uses near pointers rather than far pointers will be shorter
and faster.
To access data referenced by a 32 bit pointer, you will need to load the
offset portion (L.O. word) of the pointer into bx, bp, si,
or di and the segment portion into a segment register (typically
es). Then you could access the object using the register indirect
addressing mode. Since the les instruction is so convenient
for this operation, it is the perfect choice for loading es
and one of the above four registers with a pointer value. The following
sample code stores the value in al into the byte pointed at
by the far pointer p:
les bx, p ;Load p into ES:BX
mov es:[bx], al ;Store away AL
Since near pointers are 16 bits long and far pointers are 32 bits long,
you could simply use the dw/word and dd/dword
directives to allocate storage for your pointers (pointers are inherently
unsigned, so you wouldn't normally use sword or sdword
to declare a pointer). However, there is a much better way to do this by
using the typedef statement. Consider the following general
forms:
typename typedef near ptr basetype
typename typedef far ptr basetype
In these two examples typename represents the name of the new type you're
creating while basetype is the name of the type you want to create a pointer
for. Let's look at some specific examples:
nbytptr typedef near ptr byte
fbytptr typedef far ptr byte
colorsptr typedef far ptr colors
wptr typedef near ptr word
intptr typedef near ptr integer
intHandle typedef near ptr intptr
(these declarations assume that you've previously defined the types colors
and integer with the typedef statement). The typedef
statements with the near ptr operands produce 16 bit near pointers. Those
with the far ptr operands produce 32 bit far pointers. MASM 6.x ignores
the base type supplied after the near ptr or far ptr. However, CodeView
uses the base type to display the object a pointer refers to in its correct
format.
Note that you can use any type as the base type for a pointer. As the last
example above demonstrates, you can even define a pointer to another pointer
(a handle). CodeView would properly display the object a variable of type
intHandle points at as an address.
With the above types, you can now generate pointer variables as follows:
bytestr nbytptr ?
bytestr2 fbytptr ?
CurrentColor colorsptr ?
CurrentItem wptr ?
LastInt intptr ?
Of course, you can initialize these pointers at assembly time if you know
where they are going to point when the program first starts running. For
example, you could initialize the bytestr variable above with the offset
of MyString using the following declaration:
bytestr nbytptr MyString
- 5.0 - Chapter Overview
- 5.1 - Some Additional Instructions: LEA,
LES, ADD, and MUL
- 5.2 - Declaring Variables in an Assembly
Language Program
- 5.3 - Declaring and Accessing Scalar Variables
- 5.3.1 - Declaring and using BYTE Variables
- 5.3.2 - Declaring and using WORD Variables
- 5.3.3 - Declaring and using DWORD Variables
- 5.3.4 - Declaring and using FWORD, QWORD,
and TBYTE Variables
- 5.3.5 - Declaring Floating Point Variables
with REAL4, REAL8, and REAL10
- 5.4 - Creating Your Own Type Names with
TYPEDEF
- 5.5 - Pointer Data Types
- 5.6 - Composite Data Types
- 5.6.1 - Arrays
- 5.6.1.1 - Declaring Arrays
in Your Data Segment
- 5.6.1.2 - Accessing Elements
of a Single Dimension Array
- 5.6.2 - Multidimensional Arrays
- 5.6.2.1 - Row Major Ordering
- 5.6.2.2 - Column Major Ordering
- 5.6.2.3 - Allocating Storage
for Multidimensional Arrays
- 5.6.2.4 - Accessing Multidimensional
Array Elements in Assembly Language
- 5.6.3 - Structures
- 5.6.4 - Arrays of Structures
and Arrays/Structures as Structure Fields
- 5.6.5 - Pointers to Structures
- 5.7 - Sample Programs
- 5.7.1 - Simple Variable Declarations
- 5.7.2 - Using Pointer Variables
- 5.7.3 - Single Dimension Array
Access
- 5.7.4 - Multidimensional Array
Access
- 5.7.5 - Simple Structure Access
- 5.7.6 - Arrays of Structures
- 5.7.7 - Structures and Arrays
as Fields of Another Structure
- 5.7.8 - Pointers to Structures
and Arrays of Structures
Art of Assembly: Chapter Five - 26 SEP 1996
[Next] [Art of Assembly][Randall
Hyde]