OpenCores
URL https://opencores.org/ocsvn/light8080/light8080/trunk

Subversion Repositories light8080

[/] [light8080/] [trunk/] [tools/] [c80/] [C80DOS.txt] - Rev 80

Go to most recent revision | Compare with Previous | Blame | View Log



         c80dos.doc

         >>>  Small-C Version 1-N Compiler Documentation  <<<

         NOTE:  C80DOS.EXE is the MSDOS compiled binary for running on
                a standard PC class machine which emits 8080 assembler
                that can then be assembled and loaded on the PC using
                lasm.cpm and load.cpm with the zrun.com CP/M emulator.
                The final output (or any of the intermediate output in
                8080 assembler or Intel HEX format) can then be ported
                to the CP/M machine by telecommunicating with a any of
                a myriad of programs or by writing the disk directly
                using something like the Uniform.exe program or its
                equivalent.  Hopefully, in the near future, a Z80
                opcodeversion of the compiler as well as PC executable
                versions of lasm and load will be finished. (RDK)
                

              Available in the <MICRO> directory is a compiler  for  a
         subset of the language C.  It consists of the two files C80.C
         (compiler)  and  C80LIB.I80 (runtime library) It is in source
         form  and  is   free   to   anyone   wishing   to   use   it.
         Characteristics of the compiler are as follows:

              (1)  It  supports  a subset of the language C.  (see the
         book "C A  Programming  Language",  by  Brian  Kernighan  and
         Dennis  Ritchie.)   (2) It is written in C itself.  (3) It is
         syntactically identical to the C on UNIX (unlike  some  other
         small  C  compilers  and  interpreters).   (4) It produces as
         output a text file suitable for input to an  8080  assembler.
         (5)  It is a stand-alone single-pass compiler (which means it
         does its own syntax checking  and  parsing  and  produces  no
         intermediate  files).  (6) It can compile itself.  This means
         any processor supporting C can be used to develop this  small
         C compiler for any other processor.

              The intention behind the writing of this compiler was to
         bring  the  C  language to small computers.  It was developed
         primarily on a 8080 system with  40  K  bytes  and  a  single
         mini-floppy.   Consequently,  an  effort was made to keep the
         compiler small in order to fit  within  limited  memory,  and
         intermediate  files  were avoided in order to conserve floppy
         space.


         COMPILER SPECIFICATIONS

              As of this writing, the compiler supports the following:

         (1) Data type declarations can be:

              - "char" (8 bits)
              - "int"  (16 bits)
              - (by placing an "*" before the variable name, a pointer
                can be formed to the respective type of
                data element).

         (2) Arrays:

              - single dimension (vector) arrays can be
                of type "char" or "int".

         (3) Expressions:

              - unary operators:
                 "-" (minus)
                 "*" (indirection)
                 "&" (address of)
                 "++" (increment, either prefix or postfix)
                 "--" (decrement, either prefix of postfix)
              - binary operators:
                 "+" (addition)
                 "-" (subtraction)
                 "*" (multiplication)
                 "/" (division)
                 "%" (mod, i.e. remainder from division)
                 "|" (inclusive 'or')
                 "^" (exclusive 'or')
                 "&" (logical 'and')
                 "==" (test for equal)
                 "!=" (test for not equal)
                 "<"  (test for less than)
                 "<=" (test for less than or equal to)
                 ">"  (test for greater than)
                 ">=" (test for greater than or equal to)
                 "<<" (arithmetic left shift)
                 ">>" (arithmetic right shift)
              - primaries:
                 -array[expression]
                 -function(arg1, arg2,...,argn)
                 -constant
                        -decimal number
                        -quoted string ("sample string")
                        -primed string ('a' or 'Z' or 'ab')
                 -local variable (or pointer)
                 -global (static) variable (or pointer)

         (4) Program control:

                -if(expression)statement;
                -if(expression) statement;
                        else statement;
                -while (expression) statement;
                -break;
                -continue;
                -return;
                -return expression;
                -; (null statement)
                -{statement; statement; ... statement;}
                         (compound statement)

         (5) Pointers:

                -local and static pointers can contain the
                 address of "char" or "int" data elements.

         (6) Compiler commands:

                - #define name string (pre-processor will replace
                        name by string throughout text.)
                - #include filename (allows program to include other
                        files within this compilation.)
                - #asm (not supported by standard C)
                        Allows all code between "#asm" and "#endasm"
                        to be passed unchanged to the target
                        assembler.  This command is actually a statement
                        and may appear in the context:
                        "if (expression) #asm...#endasm else..."

         (7) Miscellaneous:

                -Expression evaluation maintains the same hierarchy
                 as standard C.

                -Function calls are defined as
                 any primary followed by an open paren, so legal forms
                 include:

                        variable();
                        array[expression]();
                        constant();
                        function()();

                -Pointer arithmetic takes into account the data
                 type of the destination (e.g. pointer++ will increment
                 by two if pointer was declared "int *pointer").

                -Pointer compares generated unsigned
                 compares (since addresses are not signed numbers).

                -Often used pieces of code
                 (i.e. storing the primary register indirect through the
                 top of the stack) generate calls to library routines to
                 shorten the amount of code generated.

                -Generated code is "pure" (i.e. the code may be placed
                 in Read Only Memory).  Code, literals, and variables
                 are kept in separate sections of memory.

                -The generated code is re-entrant.  Everytime a function
                 is called, its local variables refer to a new stack
                 frame.  By way of example, the compiler uses
                 recursive-descent for most of its parsing, which relies
                 heavily on re-entrant (recursive) functions.


         COMPILER RESTRICTIONS

              Since recent stages of compiler check-out have been done
         both on an 8080 system and on UNIX, language  syntax  appears
         to  be identical (within the given subset) between this small
         C compiler and the standard UNIX compiler.


         Not supported yet are:

         (1) Structures.
         (2) Multi-dimensional arrays.
         (3) Floating point, long integer, or unsigned data types.
         (4) Function calls returning anything but "int".
         (5) The unaries "!", "~", and "sizeof".
         (6) The control binary operators "&&", "||", and "?:".
         (7) The declaration specifiers "auto", "static", "extern",
                and "register".
         (8) The statements "for", "switch", "case",
                and "default."
         (9) The use of arguments within a "#define" command.


         Compiler restrictions include:

              (1) Since it is a single-pass compiler, undefined  names
         are not detected and are assumed to be function names not yet
         defined.   If  this  assumption  is  incorrect, the undefined
         reference will not  appear  until  the  compiled  program  is
         assembled.

              (2)  No  optimizing is done.  The code produced is sound
         and capable  of  re-entrancy,  but  no  attempt  is  made  to
         optimize  either  for  code  size or speed.  It was assumed a
         post-processor optimizer  would  later  be  written  for  the
         target machine.

              (3)   Since   the   target   assembler   is  of  unknown
         characteristics, no attempt is made to produce pseudo-ops  to
         declare static variables as internal or external.

              (4)  Constants  are not evaluated by the compiler.  That
         is, the line of code:

                        X = 1+2;

         would generated code to add "1"  and  "2"  at  runtime.   The
         results are correct, but unnecessary code is the penalty.


         ASSEMBLY LANGUAGE INTERFACE

              Interfacing   to   assembly   language   is   relatively
         straight-forward.  The "#asm ...  #endasm"  construct  allows
         the  user  to  place assembly language code directly into the
         control context.  Since it is considered by the  compiler  to
         be a single statement, it may appear in such forms as:

                        while(1) #asm ... #endasm

                        or

                        if (expression) #asm...#endasm else...

              Due  to  the  workings of the preprocessor which must be
         suppressed in this construct, the pseudo-op  "#asm"  must  be
         the  last  item  before the carriage return on the end of the
         line (i.e.  the text between #asm  and  the  <CR>  is  thrown
         away),  and  the "#endasm" pseudo-op must appear on a line by
         itself (i.e.  everything after #endasm is also thrown  away).
         Since  the  parser is completely free-format outside of these
         execeptions, the expected format is as follows:

                        if (expression) #asm
                        ...
                        ...
                        #endasm
                        else statement;

              Note a semicolon is not required after the #endasm since
         the end of context is  obvious  to  the  compiler.   Assembly
         language  code  within  the  "#asm  ...  #endasm" context has
         access to all global symbols and functions by name.  It is up
         to the programmer  to  know  the  data  type  of  the  symbol
         (whether  "char"  or  "int"  implies  a byte access or a word
         access).  Stack locals and  arguments  may  be  retrieved  by
         offset   (see   STACK  FRAME).   External  assembly  language
         routines invoked by  function  calls  from  the  c-code  have
         access to all registers and do not have to restore them prior
         to  exit.  They may push items on the stack as well, but must
         pop them off before exit.  It is the  responsibility  of  the
         calling  program  to  remove arguments from the stack after a
         function call.  This must not be done by the function itself.
         There is no limit to the number of  bytes  the  function  may
         push  onto  the  stack,  providing  they are removed prior to
         returning.   Since  parameters  are  passed  by  value,   the
         paramters on the stack may be modified by the called program.



         STACK FRAME

              The stack is used extensively by the compiler.  Function
         arguments  are  pushed onto the stack as they are encountered
         between parentheses (note, this is opposite that of  standard
         C,  which  means routines expressly retrieving arguments from
         the stack rather than declaring them by  name  must  beware).
         By the definition of the language, parameter passing is "call
         by  value".  For example the following code would be produced
         for the C statement:

                function(X, Y, z());

                LHLD X
                PUSH H
                LHLD Y
                PUSH H
                CALL z
                PUSH H
                CALL function
                POP B
                POP B
                POP B

              Notice, the compiler cleans up the stack after the  call
         using a simple algorithm to use the least number of bytes.

              Local  variables  allocate  as  much  stack  space as is
         needed, and are then assigned the current value of the  stack
         pointer (after the allocation) as their address.

                int X;

         would produce:

                PUSH B

         which  merely  allocates  room  on the stack for 2 bytes (not
         initialized to any value).  References to the local  variable
         X  will  now  be  made  to the stack pointer + 0.  If another
         declaration is made:

                char array[3];

         the code would be:

                DCX SP
                PUSH B

         Array[0] would  be  at  SP+0,  array[1]  would  be  at  SP+1,
         array[2] would be at SP+2, and X would now be at SP+3.  Thus,
         assembly  language  code using "#asm...#endasm" cannot access
         local variables by name, but must know how  many  intervening
         bytes  have  been  allocated  between  the declaration of the
         variable and  its  use.   It  is  worth  pointing  out  local
         declarations   allocate  only  as  much  stack  space  as  is
         required, including an odd number of bytes, whereas  function
         arguments  always  consist of two bytes apiece.  In the event
         the argument was type "char" (8 bits), the  most  significant
         byte  of  the  2-byte  value is a sign-extension of the lower
         byte.



         OPERATING THE COMPILER

              The small C compiler begins by asking  the  user  for  a
         number  of options regarding the expected compilation.  Since
         it was easier to ask questions than to pull arguments from  a
         command  line  (which  is  in no way similar between the 8080
         developmental  system  and  UNIX),  this  was  the  preferred
         method.

         The questions asked are as follows:

              Do you want the c-text to appear?

              This gives the  user  the  option  of  interleaving  the
         source code into the output file.  Response is Y or N.  If Y,
         a  semicolon  will  be placed at the start of each input line
         (to force a comment to the  8080  assembler)  and  the  input
         lines will be printed where appropriate.  If the answer is N,
         only the generated 8080 code will be output.

              Do you wish the globals to be defined?

              This  question  is primarily a developmental aid between
         machines.  If the  answer  is  Y,  all  static  symbols  will
         allocate  storage  within the module being compiled.  This is
         the normal method.  If N, no storage will be  allocated,  but
         symbol  references  will  still  be  made  in the normal way.
         Essentially, this question allows the user to specify all  or
         none  of the static symbols external.  It is to be considered
         a temporary measure.

              Starting number for labels?

              This  lets  the  user  supply  the  first  label  number
         generated by the compiler for it internal labels (which  will
         typically  be  "ccXXXXX",  where  XXXXX  is  a decimal number
         increasing with each label).  This option allows  modules  to
         be compiled separately and later appended on the source level
         without generating multi-defined labels.

              Output filename?

              This question gets from the user the name of the file to
         be created.  A null line sends output to the user's terminal.

              Input filename?

              This  question  gets  from  the  user  the name of the C
         module to use as input.  The question will be  repeated  each
         time  a  name  is  supplied,  allowing  the user to create an
         output file consisting  of  many  separate  input  files  (it
         behaves  as  if  the  user  had  appended  them  together and
         submitted only the one file).  A null line response ends  the
         compilation process.


         COMPILING THE COMPILER

              The  power  of  the  compiler  lies  in  the fact it can
         compile itself.   This  allows  a  user  to  "bootstrap"  the
         compiler onto a new machine without excessive recoding.

              To compile the compiler under the UNIX operating system,
         the appropriate command is:

              % cc C80.c -lS

         which will invoke the UNIX C-compiler and the UNIX linker  to
         create  the  runnable file "a.out".  This file may be renamed
         as needed and used.  No other files are needed.

              In  order  to  create  a compiler for a new machine, the
         user will need to compile the compiler into the  language  of
         the  destination  processor.  The procedure currently used to
         create the compiler for my 8080 system is as follows:

         (1) Edit the file C80.c to modify two lines of code:

         -change the line of code

                #include <stdio.h>
                        to
                #define NULL 0

              (this is  done  since  the  "stdio.h"  I/O  header  file
         contains  unparsable  lines  for  the small compiler, and the
         line defining NULL is the only line of  "stdio.h"  needed  by
         the compiler).

         -change the line of code

                #define eol 10
                        to
                #define eol 13

              (this is done since my 8080 system uses <CR> for the end
         of line character, and UNIX uses the "newline" character).


         (2) Invoke the compiler (by typing "a.out" or whatever other
             name it was given.

         (3) Answer the questions by the compiler to use the file
             C80.c as input and to produce the file C80.I80
             as output.

         (4) Append the files C80.I80 and C80LIB.I80 (the code for the
             compiler and the code for the runtime library,
             respectively).

         (5) Assemble the combined file using some 8080 assembler.

         (6) Execute the created run file.

              Currently, the 8080  assembler  used  must  possess  the
         abilities  to  handle symbol names unique to 8 characters and
         to recognize lower-case symbol names  as  unique  from  their
         upper-case  equivalent.  This is due to the fact the compiler
         recognizes 8-character names and passes all  static  variable
         and  function names intact to the assembler.  There are a few
         symbol names within the compiler which are not  unique  until
         the  7th  character and which have "upper-case twins".  These
         discourage the use of  the  KL-10's  MACN80  since  it  folds
         lower-case  to  upper case and does not recognize 8-character
         names.  It may be used, however, if  the  user  is  aware  of
         these  limitations  and  chooses  symbol  names  within these
         restrictions.


         THE FUTURE OF THE COMPILER

              That part of the compiler which produces  code  for  the
         8080  is  all  together in the final section of the compiler.
         Routines used by the compiler to produce code are kept  short
         and  are  commented.   Changing this compiler to produce code
         for any other machine is a matter of changing only these  few
         routines,  and  does  not  entail  digging around through the
         internals of the program.   I  would  expect  the  change  to
         another  machine  could be made in an afternoon providing the
         target machine had the following attributes:

              (1) A stack, preferably running backwards as  items
                   are pushed onto it.

              (2)  Two  sixteen-bit registers.  In the 8080 these
                   are the HL register pair (the primary register
                   to the compiler) and the DE register pair (the
                   secondary register).

              (3) An assembler (or cross-assembler).


              Since  the  compiler is just now on its feet and subject
         to feedback from users, it is expected many changes  will  be
         made  to  it.   Already planned changes (in order of expected
         addition) are:

              (1)  Constants  will  be   pre-evaluated   by   the
                   compiler.   Something like x=1+2*3 will become
                   x=7 prior to generating any code.

              (2) Structures will be added.  This is one  of  the
                   powers  of  C.   Its  omission has always been
                   considered temporary.

              (3) Assignment operators (+=, &=,  etc.)   will  be
                   added.

              (4)   Missing   unary   and  binary  operators  and
                   statements will be added.

              (5) The expression parser will create  intermediate
                   tree-structures  of  the  expressions and will
                   walk through them before generating any  code.
                   This  will  allow  some  optimization and will
                   allow the function arguments to be  passed  on
                   the stack in the same sequence as UNIX.

              (6)  A peep-hole optimizer will be added to improve
                   the generated code.

              Many  of  these things represent a wish-list.  Time will
         be spent only when it becomes available.  Any volunteer  help
         in any of these areas would be appreciated.

              Questions should be directed to Ron  Cain  here  at  SRI
         either at extension 3860 or at CAIN@SRI-KL.






Go to most recent revision | Compare with Previous | Blame | View Log

powered by: WebSVN 2.1.0

© copyright 1999-2024 OpenCores.org, equivalent to Oliscience, all rights reserved. OpenCores®, registered trademark.