URL
https://opencores.org/ocsvn/light8080/light8080/trunk
Subversion Repositories light8080
[/] [light8080/] [trunk/] [tools/] [c80/] [C80DOS.txt] - Rev 80
Go to most recent revision | Compare with Previous | Blame | View Log
c80dos.doc
>>> Small-C Version 1-N Compiler Documentation <<<
NOTE: C80DOS.EXE is the MSDOS compiled binary for running on
a standard PC class machine which emits 8080 assembler
that can then be assembled and loaded on the PC using
lasm.cpm and load.cpm with the zrun.com CP/M emulator.
The final output (or any of the intermediate output in
8080 assembler or Intel HEX format) can then be ported
to the CP/M machine by telecommunicating with a any of
a myriad of programs or by writing the disk directly
using something like the Uniform.exe program or its
equivalent. Hopefully, in the near future, a Z80
opcodeversion of the compiler as well as PC executable
versions of lasm and load will be finished. (RDK)
Available in the <MICRO> directory is a compiler for a
subset of the language C. It consists of the two files C80.C
(compiler) and C80LIB.I80 (runtime library) It is in source
form and is free to anyone wishing to use it.
Characteristics of the compiler are as follows:
(1) It supports a subset of the language C. (see the
book "C A Programming Language", by Brian Kernighan and
Dennis Ritchie.) (2) It is written in C itself. (3) It is
syntactically identical to the C on UNIX (unlike some other
small C compilers and interpreters). (4) It produces as
output a text file suitable for input to an 8080 assembler.
(5) It is a stand-alone single-pass compiler (which means it
does its own syntax checking and parsing and produces no
intermediate files). (6) It can compile itself. This means
any processor supporting C can be used to develop this small
C compiler for any other processor.
The intention behind the writing of this compiler was to
bring the C language to small computers. It was developed
primarily on a 8080 system with 40 K bytes and a single
mini-floppy. Consequently, an effort was made to keep the
compiler small in order to fit within limited memory, and
intermediate files were avoided in order to conserve floppy
space.
COMPILER SPECIFICATIONS
As of this writing, the compiler supports the following:
(1) Data type declarations can be:
- "char" (8 bits)
- "int" (16 bits)
- (by placing an "*" before the variable name, a pointer
can be formed to the respective type of
data element).
(2) Arrays:
- single dimension (vector) arrays can be
of type "char" or "int".
(3) Expressions:
- unary operators:
"-" (minus)
"*" (indirection)
"&" (address of)
"++" (increment, either prefix or postfix)
"--" (decrement, either prefix of postfix)
- binary operators:
"+" (addition)
"-" (subtraction)
"*" (multiplication)
"/" (division)
"%" (mod, i.e. remainder from division)
"|" (inclusive 'or')
"^" (exclusive 'or')
"&" (logical 'and')
"==" (test for equal)
"!=" (test for not equal)
"<" (test for less than)
"<=" (test for less than or equal to)
">" (test for greater than)
">=" (test for greater than or equal to)
"<<" (arithmetic left shift)
">>" (arithmetic right shift)
- primaries:
-array[expression]
-function(arg1, arg2,...,argn)
-constant
-decimal number
-quoted string ("sample string")
-primed string ('a' or 'Z' or 'ab')
-local variable (or pointer)
-global (static) variable (or pointer)
(4) Program control:
-if(expression)statement;
-if(expression) statement;
else statement;
-while (expression) statement;
-break;
-continue;
-return;
-return expression;
-; (null statement)
-{statement; statement; ... statement;}
(compound statement)
(5) Pointers:
-local and static pointers can contain the
address of "char" or "int" data elements.
(6) Compiler commands:
- #define name string (pre-processor will replace
name by string throughout text.)
- #include filename (allows program to include other
files within this compilation.)
- #asm (not supported by standard C)
Allows all code between "#asm" and "#endasm"
to be passed unchanged to the target
assembler. This command is actually a statement
and may appear in the context:
"if (expression) #asm...#endasm else..."
(7) Miscellaneous:
-Expression evaluation maintains the same hierarchy
as standard C.
-Function calls are defined as
any primary followed by an open paren, so legal forms
include:
variable();
array[expression]();
constant();
function()();
-Pointer arithmetic takes into account the data
type of the destination (e.g. pointer++ will increment
by two if pointer was declared "int *pointer").
-Pointer compares generated unsigned
compares (since addresses are not signed numbers).
-Often used pieces of code
(i.e. storing the primary register indirect through the
top of the stack) generate calls to library routines to
shorten the amount of code generated.
-Generated code is "pure" (i.e. the code may be placed
in Read Only Memory). Code, literals, and variables
are kept in separate sections of memory.
-The generated code is re-entrant. Everytime a function
is called, its local variables refer to a new stack
frame. By way of example, the compiler uses
recursive-descent for most of its parsing, which relies
heavily on re-entrant (recursive) functions.
COMPILER RESTRICTIONS
Since recent stages of compiler check-out have been done
both on an 8080 system and on UNIX, language syntax appears
to be identical (within the given subset) between this small
C compiler and the standard UNIX compiler.
Not supported yet are:
(1) Structures.
(2) Multi-dimensional arrays.
(3) Floating point, long integer, or unsigned data types.
(4) Function calls returning anything but "int".
(5) The unaries "!", "~", and "sizeof".
(6) The control binary operators "&&", "||", and "?:".
(7) The declaration specifiers "auto", "static", "extern",
and "register".
(8) The statements "for", "switch", "case",
and "default."
(9) The use of arguments within a "#define" command.
Compiler restrictions include:
(1) Since it is a single-pass compiler, undefined names
are not detected and are assumed to be function names not yet
defined. If this assumption is incorrect, the undefined
reference will not appear until the compiled program is
assembled.
(2) No optimizing is done. The code produced is sound
and capable of re-entrancy, but no attempt is made to
optimize either for code size or speed. It was assumed a
post-processor optimizer would later be written for the
target machine.
(3) Since the target assembler is of unknown
characteristics, no attempt is made to produce pseudo-ops to
declare static variables as internal or external.
(4) Constants are not evaluated by the compiler. That
is, the line of code:
X = 1+2;
would generated code to add "1" and "2" at runtime. The
results are correct, but unnecessary code is the penalty.
ASSEMBLY LANGUAGE INTERFACE
Interfacing to assembly language is relatively
straight-forward. The "#asm ... #endasm" construct allows
the user to place assembly language code directly into the
control context. Since it is considered by the compiler to
be a single statement, it may appear in such forms as:
while(1) #asm ... #endasm
or
if (expression) #asm...#endasm else...
Due to the workings of the preprocessor which must be
suppressed in this construct, the pseudo-op "#asm" must be
the last item before the carriage return on the end of the
line (i.e. the text between #asm and the <CR> is thrown
away), and the "#endasm" pseudo-op must appear on a line by
itself (i.e. everything after #endasm is also thrown away).
Since the parser is completely free-format outside of these
execeptions, the expected format is as follows:
if (expression) #asm
...
...
#endasm
else statement;
Note a semicolon is not required after the #endasm since
the end of context is obvious to the compiler. Assembly
language code within the "#asm ... #endasm" context has
access to all global symbols and functions by name. It is up
to the programmer to know the data type of the symbol
(whether "char" or "int" implies a byte access or a word
access). Stack locals and arguments may be retrieved by
offset (see STACK FRAME). External assembly language
routines invoked by function calls from the c-code have
access to all registers and do not have to restore them prior
to exit. They may push items on the stack as well, but must
pop them off before exit. It is the responsibility of the
calling program to remove arguments from the stack after a
function call. This must not be done by the function itself.
There is no limit to the number of bytes the function may
push onto the stack, providing they are removed prior to
returning. Since parameters are passed by value, the
paramters on the stack may be modified by the called program.
STACK FRAME
The stack is used extensively by the compiler. Function
arguments are pushed onto the stack as they are encountered
between parentheses (note, this is opposite that of standard
C, which means routines expressly retrieving arguments from
the stack rather than declaring them by name must beware).
By the definition of the language, parameter passing is "call
by value". For example the following code would be produced
for the C statement:
function(X, Y, z());
LHLD X
PUSH H
LHLD Y
PUSH H
CALL z
PUSH H
CALL function
POP B
POP B
POP B
Notice, the compiler cleans up the stack after the call
using a simple algorithm to use the least number of bytes.
Local variables allocate as much stack space as is
needed, and are then assigned the current value of the stack
pointer (after the allocation) as their address.
int X;
would produce:
PUSH B
which merely allocates room on the stack for 2 bytes (not
initialized to any value). References to the local variable
X will now be made to the stack pointer + 0. If another
declaration is made:
char array[3];
the code would be:
DCX SP
PUSH B
Array[0] would be at SP+0, array[1] would be at SP+1,
array[2] would be at SP+2, and X would now be at SP+3. Thus,
assembly language code using "#asm...#endasm" cannot access
local variables by name, but must know how many intervening
bytes have been allocated between the declaration of the
variable and its use. It is worth pointing out local
declarations allocate only as much stack space as is
required, including an odd number of bytes, whereas function
arguments always consist of two bytes apiece. In the event
the argument was type "char" (8 bits), the most significant
byte of the 2-byte value is a sign-extension of the lower
byte.
OPERATING THE COMPILER
The small C compiler begins by asking the user for a
number of options regarding the expected compilation. Since
it was easier to ask questions than to pull arguments from a
command line (which is in no way similar between the 8080
developmental system and UNIX), this was the preferred
method.
The questions asked are as follows:
Do you want the c-text to appear?
This gives the user the option of interleaving the
source code into the output file. Response is Y or N. If Y,
a semicolon will be placed at the start of each input line
(to force a comment to the 8080 assembler) and the input
lines will be printed where appropriate. If the answer is N,
only the generated 8080 code will be output.
Do you wish the globals to be defined?
This question is primarily a developmental aid between
machines. If the answer is Y, all static symbols will
allocate storage within the module being compiled. This is
the normal method. If N, no storage will be allocated, but
symbol references will still be made in the normal way.
Essentially, this question allows the user to specify all or
none of the static symbols external. It is to be considered
a temporary measure.
Starting number for labels?
This lets the user supply the first label number
generated by the compiler for it internal labels (which will
typically be "ccXXXXX", where XXXXX is a decimal number
increasing with each label). This option allows modules to
be compiled separately and later appended on the source level
without generating multi-defined labels.
Output filename?
This question gets from the user the name of the file to
be created. A null line sends output to the user's terminal.
Input filename?
This question gets from the user the name of the C
module to use as input. The question will be repeated each
time a name is supplied, allowing the user to create an
output file consisting of many separate input files (it
behaves as if the user had appended them together and
submitted only the one file). A null line response ends the
compilation process.
COMPILING THE COMPILER
The power of the compiler lies in the fact it can
compile itself. This allows a user to "bootstrap" the
compiler onto a new machine without excessive recoding.
To compile the compiler under the UNIX operating system,
the appropriate command is:
% cc C80.c -lS
which will invoke the UNIX C-compiler and the UNIX linker to
create the runnable file "a.out". This file may be renamed
as needed and used. No other files are needed.
In order to create a compiler for a new machine, the
user will need to compile the compiler into the language of
the destination processor. The procedure currently used to
create the compiler for my 8080 system is as follows:
(1) Edit the file C80.c to modify two lines of code:
-change the line of code
#include <stdio.h>
to
#define NULL 0
(this is done since the "stdio.h" I/O header file
contains unparsable lines for the small compiler, and the
line defining NULL is the only line of "stdio.h" needed by
the compiler).
-change the line of code
#define eol 10
to
#define eol 13
(this is done since my 8080 system uses <CR> for the end
of line character, and UNIX uses the "newline" character).
(2) Invoke the compiler (by typing "a.out" or whatever other
name it was given.
(3) Answer the questions by the compiler to use the file
C80.c as input and to produce the file C80.I80
as output.
(4) Append the files C80.I80 and C80LIB.I80 (the code for the
compiler and the code for the runtime library,
respectively).
(5) Assemble the combined file using some 8080 assembler.
(6) Execute the created run file.
Currently, the 8080 assembler used must possess the
abilities to handle symbol names unique to 8 characters and
to recognize lower-case symbol names as unique from their
upper-case equivalent. This is due to the fact the compiler
recognizes 8-character names and passes all static variable
and function names intact to the assembler. There are a few
symbol names within the compiler which are not unique until
the 7th character and which have "upper-case twins". These
discourage the use of the KL-10's MACN80 since it folds
lower-case to upper case and does not recognize 8-character
names. It may be used, however, if the user is aware of
these limitations and chooses symbol names within these
restrictions.
THE FUTURE OF THE COMPILER
That part of the compiler which produces code for the
8080 is all together in the final section of the compiler.
Routines used by the compiler to produce code are kept short
and are commented. Changing this compiler to produce code
for any other machine is a matter of changing only these few
routines, and does not entail digging around through the
internals of the program. I would expect the change to
another machine could be made in an afternoon providing the
target machine had the following attributes:
(1) A stack, preferably running backwards as items
are pushed onto it.
(2) Two sixteen-bit registers. In the 8080 these
are the HL register pair (the primary register
to the compiler) and the DE register pair (the
secondary register).
(3) An assembler (or cross-assembler).
Since the compiler is just now on its feet and subject
to feedback from users, it is expected many changes will be
made to it. Already planned changes (in order of expected
addition) are:
(1) Constants will be pre-evaluated by the
compiler. Something like x=1+2*3 will become
x=7 prior to generating any code.
(2) Structures will be added. This is one of the
powers of C. Its omission has always been
considered temporary.
(3) Assignment operators (+=, &=, etc.) will be
added.
(4) Missing unary and binary operators and
statements will be added.
(5) The expression parser will create intermediate
tree-structures of the expressions and will
walk through them before generating any code.
This will allow some optimization and will
allow the function arguments to be passed on
the stack in the same sequence as UNIX.
(6) A peep-hole optimizer will be added to improve
the generated code.
Many of these things represent a wish-list. Time will
be spent only when it becomes available. Any volunteer help
in any of these areas would be appreciated.
Questions should be directed to Ron Cain here at SRI
either at extension 3860 or at CAIN@SRI-KL.
Go to most recent revision | Compare with Previous | Blame | View Log