OpenCores
URL https://opencores.org/ocsvn/zipcpu/zipcpu/trunk

Subversion Repositories zipcpu

[/] [zipcpu/] [trunk/] [doc/] [src/] [spec.tex] - Diff between revs 199 and 202

Go to most recent revision | Show entire file | Details | Blame | View Log

Rev 199 Rev 202
Line 84... Line 84...
\usepackage{bytefield}  % Install via apt-get install texlive-science
\usepackage{bytefield}  % Install via apt-get install texlive-science
% \graphicspath{{../gfx}}
% \graphicspath{{../gfx}}
\project{ZipCPU}
\project{ZipCPU}
\title{Specification}
\title{Specification}
\author{Dan Gisselquist, Ph.D.}
\author{Dan Gisselquist, Ph.D.}
\email{dgisselq (at) opencores.org}
\email{dgisselq (at) ieee.org}
\revision{Rev.~1.0}
\revision{Rev.~1.1}
\definecolor{webred}{rgb}{0.5,0,0}
\definecolor{webred}{rgb}{0.5,0,0}
\definecolor{webgreen}{rgb}{0,0.4,0}
\definecolor{webgreen}{rgb}{0,0.4,0}
\hypersetup{
\hypersetup{
        ps2pdf,
        ps2pdf,
        pdfpagelabels,
        pdfpagelabels,
Line 120... Line 120...
You should have received a copy of the GNU General Public License along
You should have received a copy of the GNU General Public License along
with this program.  If not, see \hbox{$<$http://www.gnu.org/licenses/$>$} for
with this program.  If not, see \hbox{$<$http://www.gnu.org/licenses/$>$} for
a copy.
a copy.
\end{license}
\end{license}
\begin{revisionhistory}
\begin{revisionhistory}
 
2.0 & 1/18/2017 & Gisselquist & Switched from 32--bit to 8--bit bytes.\\\hline
 
1.1 & 11/28/2016 & Gisselquist & Moved the ZipSystem address to {\tt 0xff000000} base.\\\hline
1.0 & 11/4/2016 & Gisselquist & Major rewrite,
1.0 & 11/4/2016 & Gisselquist & Major rewrite,
                        includes compiler info\\\hline
                        includes compiler info\\\hline
0.91& 7/16/2016 & Gisselquist & Described three more CC bits\\\hline
0.91& 7/16/2016 & Gisselquist & Described three more CC bits\\\hline
0.9 & 4/20/2016 & Gisselquist & Modified ISA: LDIHI replaced with MPY,
0.9 & 4/20/2016 & Gisselquist & Modified ISA: LDIHI replaced with MPY,
        MPYU and MPYS replaced with MPYUHI, and MPYSHI respectively.  LOCK
        MPYU and MPYS replaced with MPYUHI, and MPYSHI respectively.  LOCK
Line 203... Line 205...
more.\footnote{A not--so integrated MMU is currently under development.}
more.\footnote{A not--so integrated MMU is currently under development.}
 
 
For those who like buzz words, the ZipCPU is:
For those who like buzz words, the ZipCPU is:
\begin{itemize}
\begin{itemize}
\item A 32-bit CPU: All registers are 32-bits, addresses are 32-bits,
\item A 32-bit CPU: All registers are 32-bits, addresses are 32-bits,
                instructions are 32-bits wide, etc.  Indeed, the ``byte size''
                instructions are 32-bits wide, etc.
                for this processor, as per the C--language definition of a
 
                ``byte'' being the smallest addressable unit, is 32--bits.
 
\item A RISC CPU.  There is no microcode for executing instructions.  All
\item A RISC CPU.  There is no microcode for executing instructions.  All
        instructions are designed to be completed in one clock cycle.
        instructions are designed to be completed in one clock cycle.
\item A Load/Store architecture.  (Only load and store instructions
\item A Load/Store architecture.  (Only load and store instructions
                can access memory.)
                can access memory.)
\item Wishbone compliant.  All peripherals are accessed just like
\item Wishbone compliant.  All peripherals are accessed just like
Line 454... Line 454...
so it can be overridden upon instantiation.
so it can be overridden upon instantiation.
 
 
Given the performance benefits achieved by early branching, setting this flag
Given the performance benefits achieved by early branching, setting this flag
is highly recommended.
is highly recommended.
 
 
{\tt OPT\_PIPELINED\_BUS\_ACCESS} controls whether or not {\tt LOD}/{\tt STO}
{\tt OPT\_PIPELINED\_BUS\_ACCESS} controls whether or not memory
instructions can take advantage of the pipelined wishbone bus.  To be
instructions can take advantage of the pipelined wishbone bus.  To be
eligible, the operations to be pipelined must be adjacent, must be all
eligible, the operations to be pipelined must be adjacent, must be all
{\tt LOD}s or all {\tt STO}s, and the addresses must all use the same base
loads or all stores, and the addresses must all use the same base
address register and either have identical immediate offsets, or immediate
address register and either have identical immediate offsets, or immediate
offsets that increase by one for each instruction.  Further, the
offsets that increase by one for each instruction.  Further, the
{\tt LOD}/{\tt STO} string of instructions must all have the same conditional
string of load (or store) instructions must all have the same conditional
(if any).  Currently, this approach and benefit is most effectively used
(if any).  Currently, this approach and benefit is most effectively used
when saving registers to or restoring registers from the stack at the
when saving registers to or restoring registers from the stack at the
beginning/end of a procedure, when using assembly optimized programs, or
beginning/end of a procedure, when using assembly optimized programs, or
when doing a context swap.
when doing a context swap.
 
 
I recommend setting this flag, for performance reasons, especially if your
I recommend setting this flag, for performance reasons, especially if your
wishbone bus implementation can handle pipelined bus accesses.  The logic
wishbone bus implementation can handle pipelined bus accesses.  The logic
impact of this setting is minimal, the performance impact can be significant.
impact of this setting is minimal, the performance impact can be significant.
 
 
{\tt OPT\_VLIW} includes within the instruction set the Very Long Instruction
{\tt OPT\_CIS} includes within the instruction set the Very Long Instruction
Word packing, which packs up to two instructions within each instruction word.
Word packing, which packs up to two instructions within each instruction word.
Non--packed instructions will still execute as normal, this just enables the
Non--packed instructions will still execute as normal, this just enables the
decoding and running of packed instructions.
decoding and running of packed instructions.
 
 
The two next options, {\tt INCLUDE\_DMA\_CONTROLLER} and
The two next options, {\tt INCLUDE\_DMA\_CONTROLLER} and
Line 482... Line 482...
control whether the DMA controller is included in the ZipSystem, and
control whether the DMA controller is included in the ZipSystem, and
whether or not the eight accounting timers are also included.  Set these to
whether or not the eight accounting timers are also included.  Set these to
include the respective peripherals, comment them out not to.  These only
include the respective peripherals, comment them out not to.  These only
affect the ZipSystem implementation, and not any ZipBones implementations.
affect the ZipSystem implementation, and not any ZipBones implementations.
 
 
Finally, if you find yourself needing to debug the core and specifically needing
Finally, if you find yourself needing to debug the core and specifically
to get a trace from the core to find out why something specifically failed,
needing to get a trace from the core to find out why something specifically
you may find it useful to define {\tt DEBUG\_SCOPE}.  This will add a 32--bit
failed, you may find it useful to define {\tt DEBUG\_SCOPE}.  This will add a
debug output from the core, as the last argument to the core, to the ZipSystem,
32--bit debug output from the core, as the last argument to the core, to the
or even to ZipBones.  The actual definition and composition of this debugging
ZipSystem, or even to ZipBones.  The actual definition and composition of
bit--field changes from one implementation to the next, depending upon needs
this debugging bit--field changes from one implementation to the next,
and necessities, so please look at the code at the bottom of {\tt zipcpu.v}
depending upon needs and necessities, so please look at the code at the
for more details.
bottom of {\tt zipcpu.v} for more details.
 
 
That ends our discussion of CPU options, but there remain several implementation
That ends our discussion of CPU options, but there remain several
parameters that can be defined with the CPU as well.  Some of these, such as
implementation parameters that can be defined with the CPU as well.  Some of
{\tt IMPLEMENT\_MPY}, {\tt IMPLEMENT\_DIVIDE}, {\tt IMPLEMENT\_FPU}, and
these, such as {\tt IMPLEMENT\_MPY}, {\tt IMPLEMENT\_DIVIDE},
{\tt EARLY\_BRANCHING} have already been discussed. The remainder shall be
{\tt IMPLEMENT\_FPU}, and {\tt EARLY\_BRANCHING} have already been discussed.
discussed quickly here.
The remainder shall be discussed quickly here.
 
 
The {\tt RESET\_ADDRESS} parameter controls what address the CPU attempts to
The {\tt RESET\_ADDRESS} parameter controls what address the CPU attempts to
fetch its first instruction from upon any CPU reset.  The default value is
fetch its first instruction from upon any CPU reset.  The default value is
not likely to be particularly useful, so overriding the default is recommended
not likely to be particularly useful, so overriding the default is recommended
for every implementation.
for every implementation.
 
 
The {\tt ADDRESS\_WIDTH} parameter can be used to trim down the width of
The {\tt ADDRESS\_WIDTH} parameter can be used to trim down the width of
addresses used by the CPU.  For example, although the Wishbone Bus definition
addresses used by the CPU.  For example, although the Wishbone Bus definition
used by the CPU  has 32--address lines, particular implementations may have
used by the CPU  has 30--address lines, particular implementations may have
fewer.  By setting this value to the actual number of wires in the address
fewer.  By setting this value to the actual number of wires in the address
bus, some logic can be spared within the CPU.  The default is a 32--bit wide
bus, some logic can be spared within the CPU.  The default is also the maximum,
bus.
a 30--bit address width.  Two additional bits are used internally by the CPU
 
to create the appearance of an 8--bit bus, by using the wishbone select lines.
 
 
The {\tt LGICACHE} parameter specifies the log base two of the instruction
The {\tt LGICACHE} parameter specifies the log base two of the instruction
cache size.  If no instruction cache is used, this option has no effect.
cache size.  If no instruction cache is used, this option has no effect.
Otherwise it sets the size of the instruction cache to be
Otherwise it sets the size of the instruction cache to be
$2^{\mbox{\tiny\tt LGICACHE}}$ words.  The traditional prefetch cache, if used,
$2^{\mbox{\tiny\tt LGICACHE}}$ words.  The traditional prefetch cache, if used,
Line 527... Line 528...
 
 
The {\tt START\_HALTED} parameter, if set to non--zero, will cause the
The {\tt START\_HALTED} parameter, if set to non--zero, will cause the
CPU to be halted upon startup.  This is useful for debugging, since it prevents
CPU to be halted upon startup.  This is useful for debugging, since it prevents
the CPU from doing anything without supervision.  Of course, once all pieces
the CPU from doing anything without supervision.  Of course, once all pieces
of your design are in place and proven, you'll probably want to set this to
of your design are in place and proven, you'll probably want to set this to
zero.
zero, so that the CPU will then start up immediately upon power up.
 
 
The {\tt EXTERNAL\_INTERRUPTS} parameter controls the number of interrupt
The {\tt EXTERNAL\_INTERRUPTS} parameter controls the number of interrupt
wires coming into the CPU.  This number must be between one and sixteen,
wires coming into the CPU.  This number must be between one and sixteen,
or if the performance counters are disabled, between one and twenty four.
or if the performance counters are disabled, between one and twenty four.
 
 
Line 584... Line 585...
In each register set, the Program Counter (PC) is register 15, whereas
In each register set, the Program Counter (PC) is register 15, whereas
the status register (SR) or condition code register (CC) is register 14.  All
the status register (SR) or condition code register (CC) is register 14.  All
other registers are identical in their hardware functionality.\footnote{Jumps
other registers are identical in their hardware functionality.\footnote{Jumps
to {\tt R0}, an instruction used to implement a return from a subroutine, may
to {\tt R0}, an instruction used to implement a return from a subroutine, may
be optimized in the future within the early branch logic.} By convention, the
be optimized in the future within the early branch logic.} By convention, the
stack pointer is register 13 and noted as (SP)--although there is nothing
stack pointer is register 13 and noted as (SP).  Beyond this convention,
special about this register other than this convention.  Also by convention, if
word accesses to offsets of the stack pointer are compressed when using the
the compiler needs a frame pointer it will be placed into register~12, and may
CIS instruction set.  Also by convention, if the compiler needs a frame
be abbreviated by FP.  Finally, by convention, R0 will hold a subroutine's
pointer it will be placed into register~12, and may be abbreviated by FP.
return address, sometimes called the link register (LR).
Finally, by convention, R0 will hold a subroutine's return address, sometimes
 
called the link register (LR).
 
 
When the CPU is in supervisor mode, instructions can access both register sets
When the CPU is in supervisor mode, instructions can access both register sets
via the {\tt MOV} instruction, whereas when the CPU is in user mode, {\tt MOV}
via the {\tt MOV} instruction, whereas when the CPU is in user mode, {\tt MOV}
instructions will only offer access to user registers.  We'll discuss this
instructions will only offer access to user registers.  We'll discuss this
further in subsection.~\ref{sec:isa-mov}.
further in subsection.~\ref{sec:isa-mov}.
Line 604... Line 606...
\begin{bitlist}
\begin{bitlist}
31\ldots 23 & R & Reserved for future uses\\\hline
31\ldots 23 & R & Reserved for future uses\\\hline
22\ldots 16 & R/W & Reserved for future uses\\\hline
22\ldots 16 & R/W & Reserved for future uses\\\hline
15 & R & Reserved for MMU exceptions\\\hline
15 & R & Reserved for MMU exceptions\\\hline
14 & W & Clear I-Cache command, always reads zero\\\hline
14 & W & Clear I-Cache command, always reads zero\\\hline
13 & R & VLIW instruction phase (1 for first half)\\\hline
13 & R & CIS instruction phase (1 for first half)\\\hline
12 & R & (Reserved for) Floating Point Exception\\\hline
12 & R & (Reserved for) Floating Point Exception\\\hline
11 & R & Division by Zero Exception\\\hline
11 & R & Division by Zero Exception\\\hline
10 & R & Bus-Error Flag\\\hline
10 & R & Bus-Error Flag\\\hline
9 & R & Trap Flag (or user interrupt).  Cleared on return to userspace.\\\hline
9 & R & Trap Flag (or user interrupt).  Cleared on return to userspace.\\\hline
8 & R & Illegal Instruction Flag\\\hline
8 & R & Illegal Instruction Flag\\\hline
Line 737... Line 739...
 
 
\item The thirteenth bit will operate in a similar fashion to both the bus
\item The thirteenth bit will operate in a similar fashion to both the bus
        error and division by zero flags, only it will be set upon a (yet to
        error and division by zero flags, only it will be set upon a (yet to
        be determined) floating point error.
        be determined) floating point error.
 
 
\item In the case of VLIW instructions, if an exception occurs after the first
\item In the case of CIS instructions, if an exception occurs after the first
        instruction but before the second, the fourteenth bit of the CC register
        instruction but before the second, the fourteenth bit of the CC
        will be set to indicate this fact.
        register will be set to indicate this fact.  This can be combined with
 
        the user PC to the address of the half-word where the fault occurred.
 
 
\item The fifteenth bit references a clear cache bit.  The supervisor may
\item The fifteenth bit references a clear cache bit.  The supervisor may
        write a one to this bit in order to clear the CPU instruction cache.
        write a one to this bit in order to clear the CPU instruction cache.
        The bit always reads as a zero.
        The bit always reads as a zero.
 
 
Line 760... Line 763...
All ZipCPU instructions fit in one of the formats shown in
All ZipCPU instructions fit in one of the formats shown in
Fig.~\ref{fig:iset-format}.
Fig.~\ref{fig:iset-format}.
\begin{figure}\begin{center}
\begin{figure}\begin{center}
\begin{bytefield}[endianness=big]{32}
\begin{bytefield}[endianness=big]{32}
\bitheader{0-31}\\
\bitheader{0-31}\\
\begin{leftwordgroup}{Standard}\bitbox{1}{0}\bitbox{4}{DR}
\begin{leftwordgroup}{Standard}\bitbox{1}{0}\bitbox[tlr]{4}{}
                \bitbox[lrt]{5}{OpCode}
                \bitbox[lrt]{5}{OpCode}
                \bitbox[lrt]{3}{Cnd}
                \bitbox[lrt]{3}{}
                \bitbox{1}{0}
                \bitbox{1}{0}
                \bitbox{18}{18-bit Signed Immediate} \\
                \bitbox{18}{18-bit Signed Immediate} \\
\bitbox{1}{0}\bitbox{4}{DR}
\bitbox{1}{0}\bitbox[lr]{4}{DR}
                \bitbox[lrb]{5}{}
                \bitbox[lrb]{5}{}
                \bitbox[lrb]{3}{}
                \bitbox[lr]{3}{Cnd}
                \bitbox{1}{1}
                \bitbox{1}{1}
                \bitbox{4}{BR}
                \bitbox{4}{BR}
                \bitbox{14}{14-bit Signed Immediate}\end{leftwordgroup} \\
                \bitbox{14}{14-bit Signed Immediate}\end{leftwordgroup} \\
\begin{leftwordgroup}{MOV}\bitbox{1}{0}\bitbox{4}{DR}
\begin{leftwordgroup}{MOV}\bitbox{1}{0}\bitbox[lr]{4}{}
                \bitbox[lrt]{5}{5'hf}
                \bitbox[lrt]{5}{5'hf}
                \bitbox[lrt]{3}{Cnd}
                \bitbox[lrb]{3}{}
                \bitbox{1}{A}
                \bitbox{1}{A}
                \bitbox{4}{BR}
                \bitbox{4}{BR}
                \bitbox{1}{B}
                \bitbox{1}{B}
                \bitbox{13}{13-bit Signed Immediate}\end{leftwordgroup} \\
                \bitbox{13}{13-bit Signed Immediate}\end{leftwordgroup} \\
\begin{leftwordgroup}{LDI}\bitbox{1}{0}\bitbox{4}{DR}
\begin{leftwordgroup}{LDI}\bitbox{1}{0}\bitbox[lrb]{4}{}
                \bitbox{4}{4'hb}
                \bitbox{4}{4'hc}
                \bitbox{23}{23-bit Signed Immediate}\end{leftwordgroup} \\
                \bitbox{23}{23-bit Signed Immediate}\end{leftwordgroup} \\
\begin{leftwordgroup}{NOOP}\bitbox{1}{0}\bitbox{3}{3'h7}
\begin{leftwordgroup}{NOOP}\bitbox{1}{0}\bitbox{3}{3'h7}
                \bitbox{1}{}
                \bitbox{1}{}
                \bitbox{2}{11}
                \bitbox{2}{11}
                \bitbox{3}{xxx}
                \bitbox{3}{xxx}
                \bitbox{22}{Ignored}
                \bitbox{22}{Ignored}
                \end{leftwordgroup} \\
                \end{leftwordgroup} \\
\begin{leftwordgroup}{VLIW}\bitbox{1}{1}\bitbox[lrt]{4}{DR}
 
                \bitbox[lrt]{5}{OpCode}
 
                \bitbox[lrt]{3}{Cnd}
 
                \bitbox{1}{0}
 
                \bitbox{4}{Imm.}
 
                \bitbox{14}{---} \\
 
\bitbox{1}{1}\bitbox[lr]{4}{}
 
                \bitbox[lrb]{5}{}
 
                \bitbox[lr]{3}{}
 
                \bitbox{1}{1}
 
                \bitbox{4}{BR}
 
                \bitbox{14}{---}        \\
 
\bitbox{1}{1}\bitbox[lrb]{4}{}
 
                \bitbox{4}{4'hb}
 
                \bitbox{1}{}
 
                \bitbox[lrb]{3}{}
 
                \bitbox{5}{5'b Imm}
 
                \bitbox{14}{---}        \\
 
\bitbox{1}{1}\bitbox{9}{---}
 
                \bitbox[lrt]{3}{Cnd}
 
                \bitbox{5}{---}
 
                \bitbox[lrt]{4}{DR}
 
                \bitbox[lrt]{5}{OpCode}
 
                \bitbox{1}{0}
 
                \bitbox{4}{Imm}
 
                \\
 
\bitbox{1}{1}\bitbox{9}{---}
 
                \bitbox[lr]{3}{}
 
                \bitbox{5}{---}
 
                \bitbox[lr]{4}{}
 
                \bitbox[lrb]{5}{}
 
                \bitbox{1}{1}
 
                \bitbox{4}{Reg} \\
 
\bitbox{1}{1}\bitbox{9}{---}
 
                \bitbox[lrb]{3}{}
 
                \bitbox{5}{---}
 
                \bitbox[lrb]{4}{}
 
                \bitbox{4}{4'hb}
 
                \bitbox{1}{}
 
                \bitbox{5}{5'b Imm}
 
                \end{leftwordgroup} \\
 
\end{bytefield}
\end{bytefield}
\caption{Zip Instruction Set Format}\label{fig:iset-format}
\caption{Zip Instruction Set Format}\label{fig:iset-format}
\end{center}\end{figure}
\end{center}\end{figure}
The basic format is that some operation, defined by the OpCode, is applied
The basic format is that some operation, defined by the OpCode, is applied
if a condition, Cnd, is true in order to produce a result which is placed in
if a condition, Cnd, is true in order to produce a result which is placed in
the destination register (DR).  There are three basic exceptions to this
the destination register (DR).
model.  The first is the {\tt MOV} instruction, which steals bits~13 and~18
 
to allow supervisor access to user registers.  The second is the load 23--bit
There are three basic exceptions to this general instruction model.  The
 
first is the {\tt MOV} instruction, which steals bits~13 and~18
 
to allow supervisor access to user registers.  In supervisor mode, these
 
are set to one to reference user registers, zero otherwise.  They are ignored
 
in user mode.  The second exception is the load 23--bit
signed immediate instruction ({\tt LDI}), in that it accepts no conditions and
signed immediate instruction ({\tt LDI}), in that it accepts no conditions and
uses only a 4-bit opcode.  The last exception is the {\tt NOOP} instruction
uses only a 4-bit opcode.  The last exception is the {\tt NOOP} instruction
group, containing the {\tt NOOP}, {\tt BREAK}, and {\tt LOCK} opcodes.  These
group, containing the {\tt BREAK}, {\tt LOCK}, {\tt SIM}, and {\tt NOOP}
instructions ignore their register and immediate settings.\footnote{A future
opcodes.  These instructions ignore their register and immediate settings.
version of the CPU may repurpose the immediate bits within the {\tt NOOP}
Further, the immediate bits used by these opcodes are available for simulation
instruction to be simulator commands, while the immediate/register bits within
or debug facilities, but otherwise ignored by the CPU.
the {\tt BREAK} instruction may be used by the debugger for whatever purpose
 
it chooses to use them for--such as a breakpoint table index.}
 
 
 
The ZipCPU also supports a very long instruction word (VLIW) set of
 
instructions.  These aren't truly VLIW instructions in the sense that the CPU
 
still only issues one instruction at a time, but they do pack two instructions
 
into a single instuction word.  The number of bits used by the immediate field
 
are adjusted to make space for these instruction words.  Other than instruction
 
format, the only basic difference between VLIW and normal instructions is that
 
the CPU will not switch to interrupt mode in between the two instructions,
 
unless an exception is generated by the first instruction.  Likewise a new job
 
given to the assembler is that of automatically packing as many instructions as
 
possible into the VLIW format.
 
 
 
The disassembler will represent VLIW instructions by placing a vertical bar
 
between the two components, but still leaving them on the same line.
 
 
 
\subsection{Instruction OpCodes}\label{sec:isa-opcodes}
\subsection{Instruction OpCodes}\label{sec:isa-opcodes}
With a 5--bit opcode field, there are 32--possible instructions as shown in
With a 5--bit opcode field, there are 32--possible instructions as shown in
Tbl.~\ref{tbl:iset-opcodes}.
Tbl.~\ref{tbl:iset-opcodes}.
\begin{table}\begin{center}
\begin{table}\begin{center}
\begin{tabular}{|l|l|l|c|} \hline \rowcolor[gray]{0.85}
\begin{tabular}{|l|l|l|l|c|} \hline \rowcolor[gray]{0.85}
OpCode & & Instruction &Sets CC \\\hline\hline
OpCode & & A-Reg & Instruction &Sets CC \\\hline\hline
5'h00 & {\tt SUB} & Subtract &   \\\cline{1-3}
5'h00 & {\tt SUB} & \multicolumn{2}{l|}{Subtract} &   \\\cline{1-4}
5'h01 & {\tt AND} & Bitwise And &   \\\cline{1-3}
5'h01 & {\tt AND} & \multicolumn{2}{l|}{Bitwise And} &   \\\cline{1-4}
5'h02 & {\tt ADD} & Add two numbers &   \\\cline{1-3}
5'h02 & {\tt ADD} & \multicolumn{2}{l|}{Add two numbers} &   \\\cline{1-4}
5'h03 & {\tt OR}  & Bitwise Or & Y \\\cline{1-3}
5'h03 & {\tt OR}  & \multicolumn{2}{l|}{Bitwise Or} & Y \\\cline{1-4}
5'h04 & {\tt XOR} & Bitwise Exclusive Or &   \\\cline{1-3}
5'h04 & {\tt XOR} & \multicolumn{2}{l|}{Bitwise Exclusive Or} &   \\\cline{1-4}
5'h05 & {\tt LSR} & Logical Shift Right &   \\\cline{1-3}
5'h05 & {\tt LSR} & \multicolumn{2}{l|}{Logical Shift Right} &   \\\cline{1-4}
5'h06 & {\tt LSL} & Logical Shift Left &   \\\cline{1-3}
5'h06 & {\tt LSL} & \multicolumn{2}{l|}{Logical Shift Left} &   \\\cline{1-4}
5'h07 & {\tt ASR} & Arithmetic Shift Right &   \\\hline
5'h07 & {\tt ASR} & \multicolumn{2}{l|}{Arithmetic Shift Right} &   \\\hline
5'h08 & {\tt MPY} & 32x32 bit multiply & Y \\\hline
 
5'h09 & {\tt LDILO} & Load Immediate Low & N\\\hline
5'h08 & {\tt BREV} & \multicolumn{2}{l|}{Bit Reverse B operand into result}&  \\\cline{1-4}
5'h0a & {\tt MPYUHI} & Upper 32 of 64 bits from an unsigned 32x32 multiply &  \\\cline{1-3}
5'h09 & {\tt LDILO} & \multicolumn{2}{l|}{Load Immediate Low} & N\\\hline
5'h0b & {\tt MPYSHI} & Upper 32 of 64 bits from a signed 32x32 multiply & Y \\\cline{1-3}
5'h0a & {\tt MPYUHI} & \multicolumn{2}{l|}{Upper 32 of 64 bits from an unsigned 32x32 multiply} &  \\\cline{1-4}
5'h0c & {\tt BREV} & Bit Reverse B operand into result&  \\\cline{1-3}
5'h0b & {\tt MPYSHI} & \multicolumn{2}{l|}{Upper 32 of 64 bits from a signed 32x32 multiply} & Y \\\cline{1-4}
5'h0d & {\tt POPC}& Population Count &  \\\cline{1-3}
5'h0c & {\tt MPY} & \multicolumn{2}{l|}{32x32 bit multiply} & \\\hline
5'h0e & {\tt ROL} & Rotate Ra left by OpB bits&   \\\hline
5'h0d & {\tt MOV} & \multicolumn{2}{l|}{Move OpB into Ra} & N \\\hline
5'h0f & {\tt MOV} & Move OpB into Ra & N \\\hline
5'h0e & {\tt DIVU} & R0-R13 & Divide, unsigned & Y \\\cline{1-4}
5'h10 & {\tt CMP} & Compare (Ra-OpB) to zero & Y \\\cline{1-3}
5'h0f & {\tt DIVS} & R0-R13 & Divide, signed &  \\\hline\hline
5'h11 & {\tt TST} & Test (AND w/o setting result) &   \\\hline
%
5'h12 & {\tt LOD} & Load Ra from memory (OpB) & N \\\cline{1-3}
5'h10 & {\tt CMP} & \multicolumn{2}{l|}{Compare (Ra-OpB) to zero} & Y \\\cline{1-4}
5'h13 & {\tt STO} & Store Ra into memory at (OpB) &  \\\hline\hline
5'h11 & {\tt TST} & \multicolumn{2}{l|}{Test (AND w/o setting result)} &   \\\hline
5'h14 & {\tt DIVU} & Divide, unsigned & Y \\\cline{1-3}
5'h12 & {\tt LW} & \multicolumn{2}{l|}{Load a 32-bit word from memory (OpB) into Ra} & \\\cline{1-4}
5'h15 & {\tt DIVS} & Divide, signed &  \\\hline\hline
5'h13 & {\tt SW} & \multicolumn{2}{l|}{Store a 32-bit word from Ra into memory at (OpB)} &  \\\cline{1-4}
5'h16/7 & {\tt LDI} & Load 23--bit signed immediate & N \\\hline\hline
5'h14 & {\tt LH} & \multicolumn{2}{l|}{Load 16-bits from memory (opB) into Ra, clear upper 16 bits} & N \\\cline{1-4}
5'h18 & {\tt FPADD} & Floating point add &  \\\cline{1-3}
5'h15 & {\tt SH} & \multicolumn{2}{l|}{Store the lower 16-bits of Ra into memory at (OpB)} &  \\\cline{1-4}
5'h19 & {\tt FPSUB} & Floating point subtract &   \\\cline{1-3}
5'h16 & {\tt LB} & \multicolumn{2}{l|}{Load 8-bits from memory (OpB) into Ra, clear upper 24 bits} & \\\cline{1-4}
5'h1a & {\tt FPMPY} & Floating point multiply & Y \\\cline{1-3}
5'h17 & {\tt SB} & \multicolumn{2}{l|}{Store the lower 8-bits of Ra into memory at (OpB)} &  \\\hline\hline
5'h1b & {\tt FPDIV} & Floating point divide &   \\\cline{1-3}
5'h18/9 & {\tt LDI} & \multicolumn{2}{l|}{Load 23--bit signed immediate} & N \\\hline\hline
5'h1c & {\tt FPI2F} & Convert integer to floating point &   \\\cline{1-3}
5'h1a & {\tt FPADD} & R0-R13 & Floating point add &  \\\cline{1-4}
5'h1d & {\tt FPF2I} & Convert floating point to integer &   \\\hline
5'h1b & {\tt FPSUB} & R0-R13 & Floating point subtract &   \\\cline{1-4}
5'h1e & & {\em Reserved for future use} &\\\hline
5'h1c & {\tt FPMPY} & R0-R13 & Floating point multiply & Y \\\cline{1-4}
5'h1f & & {\em Reserved for future use} &\\\hline
5'h1d & {\tt FPDIV} & R0-R13 & Floating point divide &   \\\cline{1-4}
5'h18 & & \hbox to 0.5in{\tt NOOP}  (A-register = PC)&\\\cline{1-3}
5'h1e & {\tt FPI2F} & R0-R13 & Convert integer to floating point &   \\\cline{1-4}
5'h19 & & \hbox to 0.5in{\tt BREAK} (A-register = PC)& N\\\cline{1-3}
5'h1f & {\tt FPF2I} & R0-R13 & Convert floating point to integer &   \\\hline\hline
5'h1a & & \hbox to 0.5in{\tt LOCK}  (A-register = PC)&\\\hline
5'h1c & {\tt BREAK} &None(15)&& \\\cline{1-4}
 
5'h1d & {\tt LOCK} &None(15)&& N\\\cline{1-4}
 
5'h1e & {\tt SIM}  &None(15)&&\\\cline{1-4}
 
5'h1f & {\tt NOOP} &None(15)&&\\\hline
\end{tabular}
\end{tabular}
\caption{ZipCPU OpCodes}\label{tbl:iset-opcodes}
\caption{ZipCPU OpCodes}\label{tbl:iset-opcodes}
\end{center}\end{table}
\end{center}\end{table}
%
%
Of these opcodes, {\tt ROL} and {\tt POPC} are experimental and may be
 
replaced in future revisions.  (If you have a reason to like or wish to keep
 
these opcodes, please contact me.  If you know of alternatives that might be
 
better, please let me know as well.)  There is also room for six more
 
register-less instructions in the {\tt NOOP} instruction space,
 
and two floating point instruction opcodes have been reserved for future use.
 
 
 
\subsection{Conditional Instructions}\label{sec:isa-cond}
\subsection{Conditional Instructions}\label{sec:isa-cond}
Most, although not quite all, instructions may be conditionally executed.
Most, although not quite all, instructions may be conditionally executed.
The 23--bit load immediate instruction, together with the {\tt NOOP},
The 23--bit load immediate instruction, together with the {\tt NOOP},
{\tt BREAK}, and {\tt LOCK} instructions are the exceptions to this rule.
{\tt BREAK}, and {\tt LOCK} instructions are the exceptions to this rule.
All other instructions may be conditionally executed.
All other instructions may be conditionally executed.
 
 
From the four condition code flags, eight conditions are defined for standard
From the four condition code flags, eight conditions are defined, as shown in
instructions.  These are shown in Tbl.~\ref{tbl:conditions}.
Tbl.~\ref{tbl:conditions}.
\begin{table}\begin{center}
\begin{table}\begin{center}
\begin{tabular}{l|l|l}
\begin{tabular}{l|l|l}
Code & Mnemonic & Condition \\\hline
Code & Mnemonic & Condition \\\hline
3'h0 & None & Always execute the instruction \\
3'h0 & None & Always execute the instruction \\
3'h1 & {\tt .LT}& Less than ('N' set) \\
3'h1 & {\tt .Z} & Only execute when `Z' is set \\
3'h2 & {\tt .Z} & Only execute when 'Z' is set \\
3'h2 & {\tt .LT}& Less than (`N' set) \\
3'h3 & {\tt .NZ}& Only execute when 'Z' is not set \\
3'h3 & {\tt .C} & Carry set (Also known as less-than unsigned) \\
3'h4 & {\tt .GT}& Greater than ('N' not set, 'Z' not set) \\
3'h4 & {\tt .V} & Overflow set\\
3'h5 & {\tt .GE}& Greater than or equal ('N' not set, 'Z' irrelevant) \\
3'h5 & {\tt .NZ}& Only execute when `Z' is not set \\
3'h6 & {\tt .C} & Carry set (Also known as less-than unsigned) \\
3'h6 & {\tt .GE}& Greater than or equal (`N' not set) \\
3'h7 & {\tt .V} & Overflow set\\
3'h7 & {\tt .NC}& Not carry (also known as greater-than or equal, unsigned) \\
\end{tabular}
\end{tabular}
\caption{Conditions for conditional operand execution}\label{tbl:conditions}
\caption{Conditions for conditional operand execution}\label{tbl:conditions}
\end{center}\end{table}
\end{center}\end{table}
There is no condition code for less than or equal, not C or not V---there
There are no condition codes for either less than or equal or greater than,
just wasn't enough space in 3--bits.  Ways of handling non--supported
whether signed or unsigned.  In a similar fashion, there is no condition
conditions are discussed in Sec.~\ref{sec:in-mcond}.
code for not V---there just wasn't enough space in 3--bits.  Ways of handling
 
non--supported conditions are discussed in Sec.~\ref{sec:in-mcond}.
 
 
With the exception of \hbox{\tt CMP} and \hbox{\tt TST} instructions,
With the exception of \hbox{\tt CMP} and \hbox{\tt TST} instructions,
conditionally executed instructions will not further adjust the condition codes.
conditionally executed instructions will not further adjust the condition
Conditional \hbox{\tt CMP} or \hbox{\tt TST} instructions will adjust conditions
codes.  Conditional \hbox{\tt CMP} or \hbox{\tt TST} instructions will adjust
whenever they are executed.  In this way, multiple conditions may be evaluated
conditions whenever they are executed.  In this way, multiple conditions may
without branches, creating a sort of logical and--but only if all the conditions
be evaluated without branches, creating a sort of logical and--but only if all
are the same.  For example, to do something if \hbox{\tt R0} is one and
the conditions are the same.  For example, to do something if \hbox{\tt R0} is
\hbox{\tt R1} is two, one might try code such as Tbl.~\ref{tbl:dbl-condition}.
one and \hbox{\tt R1} is two, one might try code such as
 
Tbl.~\ref{tbl:dbl-condition}.
\begin{table}\begin{center}
\begin{table}\begin{center}
\begin{tabular}{l}
\begin{tabular}{l}
        {\tt CMP 1,R0} \\
        {\tt CMP 1,R0} \\
        {\em ; Condition codes are now set based upon R0-1} \\
        {\em ; Condition codes are now set based upon R0-1} \\
        {\tt CMP.Z 2,R1} \\
        {\tt CMP.Z 2,R1} \\
        {\em ; If R0 $\neq$ 1, conditions are unchanged, {\tt Z} is still false.} \\
        {\em ; If R0 $\neq$ 1, conditions are unchanged, {\tt Z} is still false.} \\
        {\em ; If R0 $=$ 1, conditions are now set based upon R1-2.} \\
        {\em ; If R0 $=$ 1, conditions are now set based upon R1-2.} \\
        {\em ; Now some instruction could be done based upon the conjunction} \\
        {\em ; Now some instruction could be done based upon the conjunction} \\
        {\em ; of both conditions.} \\
        {\em ; of both conditions.} \\
        {\em ; While we use the example of a {\tt STO}, it could easily be any
        {\em ; While we use the example of a {\tt SW}, it could easily be any
                instruction.} \\
                instruction.} \\
        {\tt STO.Z R0,(R2)} \\
        {\tt SW.Z R0,(R2)} \\
\end{tabular}
\end{tabular}
\caption{An example of a double conditional}\label{tbl:dbl-condition}
\caption{An example of a double conditional}\label{tbl:dbl-condition}
\end{center}\end{table}
\end{center}\end{table}
 
 
The real utility of conditionally executed instructions is that, unlike
The real utility of conditionally executed instructions is that, unlike
conditional branches, conditionally executed instructions will not stall
conditional branches, conditionally executed instructions will not stall
the bus if they are not executed.
the bus if they are not executed.
 
 
In the case of VLIW instructions, only four conditions are defined as shown
 
in Tbl.~\ref{tbl:vliw-conditions}.
 
\begin{table}\begin{center}
 
\begin{tabular}{l|l|l}
 
Code & Mnemonic & Condition \\\hline
 
2'h0 & None & Always execute the instruction \\
 
2'h1 & {\tt .LT} & Less than ('N' set) \\
 
2'h2 & {\tt .Z} & Only execute when 'Z' is set \\
 
2'h3 & {\tt .NZ} & Only execute when 'Z' is not set \\
 
\end{tabular}
 
\caption{VLIW Conditions}\label{tbl:vliw-conditions}
 
\end{center}\end{table}
 
Further, the first bit of the three is given a special meaning: If the first
 
bit is set, the conditions apply to the second half of the instruction,
 
otherwise the conditions will only apply to the first half of a conditional
 
instruction.  Of course, the other conditions are still available by mingling
 
the non--VLIW instructions with VLIW instructions.
 
 
 
\subsection{Modifying Conditions}\label{sec:in-mcond}
\subsection{Modifying Conditions}\label{sec:in-mcond}
A quick look at the list of conditions supported by the ZipCPU and listed
A quick look at the list of conditions supported by the ZipCPU and listed
in Tbl.~\ref{tbl:conditions} reveals that the ZipCPU does not have a full set
in Tbl.~\ref{tbl:conditions} reveals that the ZipCPU does not have a full set
of conditions.  In particular, only one explicit unsigned condition is
of conditions.  In particular, only one explicit unsigned condition is
supported.  Therefore, Tbl.~\ref{tbl:creating-conditions}
supported.  Therefore, Tbl.~\ref{tbl:creating-conditions}
\begin{table}\begin{center}
\begin{table}\begin{center}
\begin{tabular}{|l|l|l|}\hline
\begin{tabular}{|l|l|l|}\hline
Original & Modified & Name \\\hline\hline
Original & Modified & Name \\\hline\hline
\parbox[t]{1.5in}{\tt CMP Rx,Ry\\BLE label} % If Ry <= Rx -> Ry < Rx+1
\parbox[t]{1.5in}{\tt CMP Imm,Ry\\BLE label} % If Ry <= Rx -> Ry < Rx+1
        & \parbox[t]{1.5in}{\tt CMP 1+Rx,Ry\\BLT label}
        & \parbox[t]{1.5in}{\tt CMP 1+Imm,Ry\\BLT label}
        & Less-than or equal (signed, {\tt Z} or {\tt N} set)\\[4mm]\hline
        & Less-than or equal (signed, {\tt Z} or {\tt N} set)\\[4mm]\hline
 
\parbox[t]{1.5in}{\tt CMP Rx,Ry\\BLE label} % If Ry <= Rx -> Ry < Rx+1
 
        & \parbox[t]{1.5in}{\tt CMP Rx,Ry\\BLT label\\BZ label}
 
        & Less-than or equal (signed, {\tt Z} or {\tt N} set)\\[4mm]\hline\hline
 
\parbox[t]{1.5in}{\tt CMP Imm,Ry\\BGT label}    % if (Ry > Rx) -> Rx < Ry
 
        & \parbox[t]{1.5in}{\tt CMP 1+Imm,Ry\\BGE label}
 
        & Greater-than (immediate) \\[4mm]\hline
 
\parbox[t]{1.5in}{\tt CMP Rx,Ry\\BGT label}     % if (Ry > Rx) -> Rx < Ry
 
        & \parbox[t]{1.5in}{\tt CMP Ry,Rx\\BLT label}
 
        & Greater-than (register) \\[4mm]\hline\hline
 
\parbox[t]{1.5in}{\tt CMP Imm,Ry\\BLEU label}
 
        & \parbox[t]{1.5in}{\tt CMP 1+Imm,Ry\\BC label}
 
        & Less-than or equal unsigned immediate \\[4mm]\hline
\parbox[t]{1.5in}{\tt CMP Rx,Ry\\BLEU label}
\parbox[t]{1.5in}{\tt CMP Rx,Ry\\BLEU label}
        & \parbox[t]{1.5in}{\tt CMP 1+Rx,Ry\\BC label}
        & \parbox[t]{1.5in}{\tt CMP Ry,Rx\\BNC label}
        & Less-than or equal unsigned \\[4mm]\hline
        & Less-than or equal unsigned register\\[4mm]\hline\hline
 
\parbox[t]{1.5in}{\tt CMP Imm,Ry\\BGTU label}   % if (Ry > Rx) -> Rx < Ry
 
        & \parbox[t]{1.5in}{\tt CMP 1+Imm,Ry\\BNC label}
 
        & Greater-than unsigned (immediate)\\[4mm]\hline
\parbox[t]{1.5in}{\tt CMP Rx,Ry\\BGTU label}    % if (Ry > Rx) -> Rx < Ry
\parbox[t]{1.5in}{\tt CMP Rx,Ry\\BGTU label}    % if (Ry > Rx) -> Rx < Ry
        & \parbox[t]{1.5in}{\tt CMP Ry,Rx\\BC label}
        & \parbox[t]{1.5in}{\tt CMP Ry,Rx\\BC label}
        & Greater-than unsigned \\[4mm]\hline
        & Greater-than unsigned \\[4mm]\hline
\parbox[t]{1.5in}{\tt CMP Rx,Ry\\BGEU label}    % if (Ry >= Rx) -> Rx <= Ry -> Rx < Ry+1
 
        & \parbox[t]{1.5in}{\tt CMP 1+Ry,Rx\\BC label}
 
        & Greater-than equal unsigned \\[4mm]\hline
 
\parbox[t]{1.5in}{\tt CMP A+Rx,Ry\\BGEU label} % if (Ry >= A+Rx)-> A+Rx <= Ry -> Rx < Ry+1-A
 
        & \parbox[t]{1.5in}{\tt CMP (1-A)+Ry,Rx\\BC label}
 
        & Greater-than equal unsigned (with offset)\\[4mm]\hline
 
\parbox[t]{1.5in}{\tt CMP A,Ry\\BGEU label} % if (Ry >= A+Rx)-> A+Rx <= Ry -> Rx < Ry+1-A
 
        & \parbox[t]{1.5in}{\tt LDI (A-1),Rx\\CMP Ry,Rx\\BC label}
 
        & Greater-than equal comparison with a constant\\[4mm]\hline
 
\end{tabular}
\end{tabular}
\caption{Modifying conditions}\label{tbl:creating-conditions}
\caption{Modifying conditions}\label{tbl:creating-conditions}
\end{center}\end{table}
\end{center}\end{table}
shows examples of how these unsupported conditions can be created
shows examples of how these unsupported conditions can be created
simply by adjusting the compare instruction, for no extra cost in clocks.
simply by adjusting the compare instruction, for no extra cost in clocks.
Line 1048... Line 984...
In those cases where a fourteen or eighteen bit immediate doesn't make sense,
In those cases where a fourteen or eighteen bit immediate doesn't make sense,
such as for {\tt LDILO}, the extra bits associated with the immediate are
such as for {\tt LDILO}, the extra bits associated with the immediate are
simply ignored.  (This rule does not apply to the shift instructions,
simply ignored.  (This rule does not apply to the shift instructions,
{\tt ASR}, {\tt LSR}, and {\tt LSL}--which all use all of their immediate bits.)
{\tt ASR}, {\tt LSR}, and {\tt LSL}--which all use all of their immediate bits.)
 
 
VLIW instructions still use the same operand B as regular instructions, only
 
there was no room for any instruction plus immediate addressing.  Therefore,
 
VLIW instructions have either
 
a register or a 4--bit signed immediate as their operand B.  The only exception
 
is the load immediate instruction, which permits a 5--bit signed operand
 
B.\footnote{Although the space exists to extend this VLIW load immediate
 
instruction to six bits, the 5--bit limit was chosen to simplify the
 
disassembler.  This may change in the future.}
 
 
 
\subsection{Address Modes}\label{sec:isa-addr}
\subsection{Address Modes}\label{sec:isa-addr}
The ZipCPU supports two addressing modes: register plus immediate, and
The ZipCPU supports two addressing modes: register plus immediate, and
immediate addressing.  Addresses are encoded in the same fashion as
immediate addressing.  Addresses are encoded in the same fashion as
Operand B's, discussed above.
Operand B's, discussed above.
 
 
The VLIW instruction set only offers register addressing.
 
 
 
\subsection{Move Operands}\label{sec:isa-mov}
\subsection{Move Operands}\label{sec:isa-mov}
The previous set of operands would be perfect and complete, save only that the
The previous set of operands would be perfect and complete, save only that the
CPU needs access to non--supervisory registers while in supervisory mode.  The
CPU needs access to non--supervisory registers while in supervisory mode.  The
MOV instruction has been modified to fit that purpose.  The two bits,
MOV instruction has been modified to fit that purpose.  The two bits,
shown as {\tt A} and {\tt B} in Fig.~\ref{fig:iset-format} above, are designed
shown as {\tt A} and {\tt B} in Fig.~\ref{fig:iset-format} above, are designed
Line 1127... Line 1052...
exception, the divide by zero bit will be set in the CC register.  In the
exception, the divide by zero bit will be set in the CC register.  In the
case of a user mode divide by zero, this will be cleared by any return to user
case of a user mode divide by zero, this will be cleared by any return to user
mode command.  The supervisor bit may be cleared either by a reboot or by the
mode command.  The supervisor bit may be cleared either by a reboot or by the
external debugger.
external debugger.
 
 
\subsection{NOOP, BREAK, and Bus LOCK Instruction}
\section{CIS Instructions}
Three instructions within the opcode list in Tbl.~\ref{tbl:iset-opcodes}, are
 
somewhat special.  These are the {\tt NOOP}, {\tt BREAK}, and bus {\tt LOCK}
The ZipCPU also supports a compressed instruction set (CIS), outlined in
instructions.  These are encoded according to
Fig.~\ref{fig:iset-cis},
 
\begin{figure}\begin{center}
 
\begin{bytefield}[endianness=big]{16}
 
\bitheader{0-15}\\
 
\bitbox[lrt]{1}{}\bitbox[lrt]{4}{}
 
                \bitbox[lrt]{3}{COp}
 
                \bitbox{1}{0}
 
                \bitbox{7}{Imm.} \\
 
\bitbox[lr]{1}{1}\bitbox[lr]{4}{DR}
 
                \bitbox[lrb]{3}{}
 
                \bitbox{1}{1}
 
                \bitbox{4}{BR}
 
                \bitbox{3}{Imm} \\
 
\bitbox[lr]{1}{}\bitbox[lr]{4}{}
 
                \bitbox{3}{\tt LDI}
 
                \bitbox{8}{8'b Imm} \\
 
\bitbox[lrb]{1}{}\bitbox[lrb]{4}{}
 
                \bitbox{3}{\tt MOV}
 
                \bitbox{1}{1}
 
                \bitbox{4}{BR}
 
                \bitbox{3}{Imm} \\
 
\end{bytefield}
 
\caption{Zip Compressed Instruction Set (CIS) Format}\label{fig:iset-cis}
 
\end{center}\end{figure}
 
when enabled via {\tt OPT\_CIS}.
 
This compressed instruction set packs two instructions per word.  Words
 
must still be aligned, and jumping into the middle of a compressed instruction
 
is not allowed.  Further, the CIS only permits the encoding of 8~of the
 
32~opcodes available in the ISA, as listed in Tbl.~\ref{tbl:iset-cisops}.
 
\begin{table}\begin{center}
 
\begin{tabular}{|l|l|l|} \hline \rowcolor[gray]{0.85}
 
COp & & Instruction \\\hline\hline
 
3'h00 & {\tt SUB} & Subtract   \\\hline
 
3'h01 & {\tt AND} & Bitwise And   \\\hline
 
3'h02 & {\tt ADD} & Add two numbers   \\\hline
 
3'h03 & {\tt CMP}  & Bitwise Or  \\\hline
 
3'h04 & {\tt LW} & Bitwise Exclusive Or   \\\hline
 
3'h05 & {\tt SW} & Logical Shift Right  \\\hline
 
3'h06 & {\tt LDI} & Logical Shift Left   \\\hline
 
3'h07 & {\tt MOV} & Arithmetic Shift Right \\\hline
 
\end{tabular}
 
\caption{CIS OpCodes}\label{tbl:iset-cisops}
 
\end{center}\end{table}
 
A final feature of the compressed instruction set has to do with {\tt LW} and
 
{\tt SW} instructions.  An {\tt LW} or {\tt SW} instruction with bit-7 set
 
low references an offset of the Stack Pointer, (SP).  Hence the compressed
 
instruction set allows loads and stores to offsets of the Stack Pointer
 
of -128~octets on up to~127 octets.  In practice, this gives the compressed
 
load and store instructions, when referencing the stack, thirty--two words
 
that they can reference.
 
 
 
This compressed instruction set somewhat similar to other architectures that
 
have a thumb instruction set, with the difference that the ZipCPU can intermix
 
regular and thumb instructions at will.  When using the CIS, instructions are
 
still issued one at a time, however interrupts are disabled between
 
instruction halves, in order to prevent the CPU from stopping mid instruction.
 
Further, it is the silent job of the assembler to compress CIS instructions
 
in an opportunistic fashion.
 
 
 
The disassembler represents CIS instructions by placing a vertical bar
 
between the two components, while still leaving them on the same line.
 
 
 
The CIS instruction set does not support conditional execution.
 
 
 
\subsection{BREAK, Bus LOCK, SIM, and NOOP Instructions}
 
Four instructions within the opcode list in Tbl.~\ref{tbl:iset-opcodes}, are
 
somewhat special.  These are the {\tt BREAK}, bus {\tt LOCK}, {\tt SIM}, and
 
{\tt NOOP} instructions.  These are encoded according to
Fig.~\ref{fig:iset-noop}.
Fig.~\ref{fig:iset-noop}.
\begin{figure}\begin{center}
\begin{figure}\begin{center}
\begin{bytefield}[endianness=big]{32}
\begin{bytefield}[endianness=big]{32}
\bitheader{0-31}\\
\bitheader{0-31}\\
\begin{leftwordgroup}{NOOP}
 
\bitbox{1}{0}\bitbox{3}{3'h7}\bitbox{1}{}
 
        \bitbox{2}{11}\bitbox{3}{000}\bitbox{22}{Reserved for Simulator} \\
 
\bitbox{1}{1}\bitbox{3}{3'h7}\bitbox{1}{}
 
        \bitbox{2}{11}\bitbox{3}{000}\bitbox{22}{---} \\
 
\bitbox{1}{1}\bitbox{9}{---}\bitbox{3}{---}\bitbox{5}{---}
 
        \bitbox{3}{3'h7}\bitbox{1}{}\bitbox{2}{11}\bitbox{3}{001}
 
        \bitbox{5}{Rsrvd}
 
                \end{leftwordgroup} \\
 
\begin{leftwordgroup}{BREAK}
\begin{leftwordgroup}{BREAK}
\bitbox{1}{0}\bitbox{3}{3'h7}
\bitbox[lrt]{1}{}\bitbox[lrt]{3}{}
                \bitbox{1}{}\bitbox{2}{11}\bitbox{3}{001}\bitbox{22}{Reserved for debugger}
                \bitbox{1}{}\bitbox[lrt]{3}{}\bitbox{2}{00}\bitbox{22}{Reserved for debugger}
                \end{leftwordgroup} \\
                \end{leftwordgroup} \\
\begin{leftwordgroup}{LOCK}
\begin{leftwordgroup}{LOCK}
\bitbox{1}{0}\bitbox{3}{3'h7}
\bitbox[lr]{1}{0}\bitbox[lr]{3}{3'h7}
                \bitbox{1}{}\bitbox{2}{11}\bitbox{3}{010}\bitbox{22}{Ignored}
                \bitbox{1}{}\bitbox[lr]{3}{111}\bitbox{2}{01}\bitbox{22}{Ignored}
 
                \end{leftwordgroup} \\
 
\begin{leftwordgroup}{SIM}
 
\bitbox[lr]{1}{}\bitbox[lr]{3}{}\bitbox{1}{}
 
        \bitbox[lr]{3}{}\bitbox{2}{10}\bitbox[lrt]{22}{Reserved for Simulator}
 
                \end{leftwordgroup} \\
 
\begin{leftwordgroup}{NOOP}
 
\bitbox[lrb]{1}{}\bitbox[lrb]{3}{}\bitbox{1}{}
 
        \bitbox[lrb]{3}{}\bitbox{2}{11}\bitbox[lrb]{22}{}
                \end{leftwordgroup} \\
                \end{leftwordgroup} \\
\end{bytefield}
\end{bytefield}
\caption{NOOP/Break/LOCK Instruction Format}\label{fig:iset-noop}
\caption{NOOP/Break/LOCK Instruction Format}\label{fig:iset-noop}
\end{center}\end{figure}
\end{center}\end{figure}
 
 
The {\tt NOOP} instruction is just that: an instruction that does not perform
 
any operation.  While many other instructions, such as a move from a register
 
to itself, could also fit this role, only the NOOP instruction guarantees
 
that it will not stall waiting for a register to be available.   For this
 
reason, it gets its own place in the instruction set.  Bits 21--0 of this
 
instruction are reserved for commands which may be given to a simulator, such
 
as simulator exit, should the code be run from a simulator.  However, such
 
simulation codes have not yet been defined.
 
 
 
The {\tt BREAK} instruction is useful for creating a debug instruction that
The {\tt BREAK} instruction is useful for creating a debug instruction that
will halt the CPU without executing.  If in user mode, depending upon the
will halt the CPU without executing.  If in user mode, depending upon the
setting of the break enable bit, it will either switch to supervisor mode or
setting of the break enable bit, it will either switch to supervisor mode or
halt the CPU--depending upon where the user wishes to do his debugging.  The
halt the CPU--depending upon where the user wishes to do his debugging.  The
lower 22~bits of this instruction are likewise reserved for the debuggers
lower 22~bits of this instruction are reserved for the debuggers use.
use.
 
 
 
Finally, the {\tt LOCK} instruction was added in order to provide for
The {\tt LOCK} instruction provides the ZipCPU's atomic operation support,
atomic operations.  The {\tt LOCK} instruction only works when the CPU is
althought it only works when the CPU is configured for pipeline
configured for pipeline mode.  It works by stalling the ALU pipeline stack
mode.\footnote{The reason for not allowing {\tt LOCK} support in
until all prior stages are filled, and then it guarantees that once a bus
non-pipelined mode is that the instruction fetch is not allowed to interrupt
cycle is started, the wishbone {\tt CYC} line will remain asserted until the
a lock cycle.  In non-pipelined mode, the instruction fetch must take place
LOCK is deasserted.  This allows the execution of three instructions, one
between every bus access, negating this utility.}  It works by stalling the
memory (ex. {\tt LOD}), one ALU (ex. {\tt ADD}), and another memory instruction
ALU pipeline stack until all prior stages are filled, and then it guarantees
(ex. {\tt STO}), to take place in an unbreakable fashion.  Example uses of this
that once a bus cycle is started, the wishbone {\tt CYC} line will remain
capability include an atomic increment, such as {\tt LOCK}, {\tt LOD (Rx),Ry},
asserted for up to three instructions.  This allows the execution of one
{\tt ADD \#,Ry}, and then {\tt STO Ry,(Rx)}, or even a two instruction pair
memory load (ex. {\tt LW}), one ALU operation (ex. {\tt ADD}), and then
such as a test and set sequence: {\tt LDI 1,Rz}, {\tt LOCK}, {\tt LOD (Rx),Ry},
another memory instruction (ex. {\tt SW}), to take place in an uninterrupted
{\tt STO Rz,(Rx)}.
fashion.  Example uses of this capability include an atomic increment, such
 
as {\tt LOCK}, {\tt LW (Rx),Ry}, {\tt ADD \#1,Ry}, {\tt SW Ry,(Rx)}, or even
 
a two instruction pair such as a test and set sequence: {\tt LDI 1,Rz},
 
{\tt LOCK}, {\tt LW (Rx),Ry}, {\tt SW Rz,(Rx)}.
 
 
 
The {\tt SIM} and {\tt NOOP} instructions need a touch more explaining.
 
From the standpoint of the CPU, when running from Verilog within an FPGA,
 
the {\tt SIM} instruction is an illegal instruction--generating an illegal
 
instruction exception.  Likewise the {\tt NOOP} instruction is just that:
 
an instruction that consumes a clock, but does not perform any operation.
 
In both cases, the lower 22--bits are ignored.
 
 
 
Both {\tt SIM} and {\tt NOOP} instructions, though, contain 22--bits that can
 
be used by a simulator if present.  The encoding of these 22-bits is identical,
 
so that programs that run in a simulator may run on actual hardware as well
 
(using the {\tt NOOP} encoding), or they may complain that they were unintended
 
to run on actual hardware, such as if the {\tt SIM} encoding were used.
 
Particular encodings allow for exiting the simulation with a known exit
 
code, {\tt $x$EXIT}, dumping either one or all registers, {\tt $x$DUMP},
 
or simpling sending a character to the simulator's standard output stream,
 
{\tt $x$OUT}--where $x$ is either {\tt N} for the {\tt NOOP} version of the
 
instruction, or {\tt S} for the {\tt SIM} version of the opcode.
 
 
 
The {\tt SIM} instruction is currrently a new facility for the ZipCPU, and
 
so its functionality remains under test.
 
 
\subsection{Floating Point}
\subsection{Floating Point}
Although the ZipCPU does not (yet) have a floating point unit, the current
Although the ZipCPU does not (yet) have a floating point unit, the current
instruction set offers eight opcodes for floating point operations, and treats
instruction set offers six opcodes for floating point operations, and treats
floating point exceptions like divide by zero errors.  Once this unit is built
floating point exceptions like divide by zero errors.  Once this unit is built
and integrated together with the rest of the CPU, the ZipCPU will support
and integrated together with the rest of the CPU, the ZipCPU will support
32--bit floating point instructions natively.  Any 64--bit floating point
32--bit floating point instructions natively.  Any 64--bit floating point
instructions will either need to be emulated in software, or else they will
instructions will either need to be emulated in software, or else they will
need an external floating point peripheral.
need an external floating point peripheral.
 
 
Until the FPU is built and integrated, of even afterwards if the floating point
Until this FPU is built and integrated, of even afterwards if the floating
unit is not installed by option, floating point instructions will trigger an
point unit is not installed by option, floating point instructions will
illegal instruction exception, which may be trapped and then implemented in
trigger an illegal instruction exception, which may be trapped and then
software.
implemented in software.
 
 
\subsection{Load/Store byte}
 
One difference between the ZipCPU and many other architectures is that there are
 
no load byte {\tt LB}, store byte {\tt SB}, load halfword {\tt LH} or store
 
halfword {\tt SH} instructions.  This lack is by design in an attempt to keep
 
the 32--bit bus simple.
 
 
 
Because the ZipCPU's addresses refer to 32--bit values, i.e. address one
 
will refer to a completely different 32--bit value than address two, simulating
 
these load and store byte instructions is difficult.
 
 
 
This is just the nature of the ZipCPU, as a result of the design choices that
 
were made.
 
 
 
\subsection{Derived Instructions}
\subsection{Derived Instructions}
The ZipCPU supports many other common instructions by construction, although
The ZipCPU supports many other common instructions by construction, although
not all of them are single cycle instructions.  Tables~\ref{tbl:derived-1}, \ref{tbl:derived-2}, \ref{tbl:derived-3} and~\ref{tbl:derived-4} show how these
not all of them are single cycle instructions.  Tables~\ref{tbl:derived-1}, \ref{tbl:derived-2}, \ref{tbl:derived-3} and~\ref{tbl:derived-4} show how these
other instructions may be implemented on the ZipCPU.  Many of these
other instructions may be implemented on the ZipCPU.  Many of these
instructions will have assembly equivalents,
instructions will have assembly equivalents,
such as the branch instructions, to facilitate working with the CPU.
such as the branch instructions, to facilitate working with the CPU.
\begin{table}\begin{center}
\begin{table}\begin{center}
\begin{tabular}{p{1.4in}p{1.5in}p{3in}}\\\hline
\begin{tabular}{p{1.0in}p{1.5in}p{3in}}\\\hline
Mapped & Actual  & Notes \\\hline
Mapped & Actual  & Notes \\\hline
{\tt ABS Rx}
{\tt ABS Rx}
        & \parbox[t]{1.5in}{\tt TST -1,Rx\\NEG.LT Rx}
        & \parbox[t]{1.5in}{\tt TST -1,Rx\\NEG.LT Rx}
        & Absolute value, depends upon the derived {\tt NEG} instruction
        & Absolute value, depends upon the derived {\tt NEG} instruction
        below, and so this expands into three instructions total.\\\hline
        below, and so this expands into three instructions total.\\\hline
\parbox[t]{1.4in}{\tt ADD Ra,Rx\\ADDC Rb,Ry}
\parbox[t]{1.4in}{\tt ADD Ra,Rx\\ADDC Rb,Ry}
        & \parbox[t]{1.5in}{\tt Add Ra,Rx\\ADD.C \$1,Ry\\Add Rb,Ry}
        & \parbox[t]{1.5in}{\tt Add Ra,Rx\\ADD.C \$1,Ry\\Add Rb,Ry}
        & Add with carry \\\hline
        & Add with carry \\\hline
{\tt BRA.$x$ +/-\$Addr}
\hbox{\tt BRA.$x$ +/-\$Addr}
        & \hbox{\tt ADD.$x$ \$Addr+PC,PC}
        & \hbox{\tt ADD.$x$ \$Addr+PC,PC}
        & Branch or jump on condition $x$.  Works for 18--bit
        & Branch or jump on condition $x$.  Works for 18--bit
                signed address offsets.\\\hline
                signed address offsets.\\\hline
% {\tt BRA.Cond +/-\$Addr}
% {\tt BRA.Cond +/-\$Addr}
%       & \parbox[t]{1.5in}{\tt LDI \$Addr,Rx \\ ADD.cond Rx,PC}
%       & \parbox[t]{1.5in}{\tt LDI \$Addr,Rx \\ ADD.cond Rx,PC}
Line 1259... Line 1251...
        & Clears Rx, leaving the flags untouched.  This instruction can be
        & Clears Rx, leaving the flags untouched.  This instruction can be
                executed conditionally. The assembler will quietly  choose
                executed conditionally. The assembler will quietly  choose
                between {\tt LDI} and {\tt BREV} depending upon the existence
                between {\tt LDI} and {\tt BREV} depending upon the existence
                of the condition.\\\hline
                of the condition.\\\hline
{\tt EXCH.W Rx }
{\tt EXCH.W Rx }
        & {\tt ROL \$16,Rx}
        & \parbox[t]{1.5in}{\tt MOV Rx,Rh \\
 
                LSL \$16,Rh \\
 
                LSR \$16,Rx \\
 
                OR Rh,Rx }
        & Exchanges the top and bottom 16'bit words of Rx \\\hline
        & Exchanges the top and bottom 16'bit words of Rx \\\hline
{\tt HALT }
{\tt HALT }
        & {\tt Or \$SLEEP,CC}
        & {\tt Or \$SLEEP,CC}
        & This only works when issued in interrupt/supervisor mode.  In user
        & This only works when issued in interrupt/supervisor mode.  In user
        mode this is simply a wait until interrupt instruction.
        mode this is simply a wait until interrupt instruction.
Line 1272... Line 1267...
        success instruction.\\\hline
        success instruction.\\\hline
{\tt INT } & {\tt LDI \$0,CC} & This is also known as a trap instruction\\\hline
{\tt INT } & {\tt LDI \$0,CC} & This is also known as a trap instruction\\\hline
{\tt IRET}
{\tt IRET}
        & {\tt OR \$GIE,CC}
        & {\tt OR \$GIE,CC}
        & Also known as an RTU instruction (Return to Userspace) \\\hline
        & Also known as an RTU instruction (Return to Userspace) \\\hline
{\tt JMP R6+\$Offset}
\hbox{\tt JMP R6+\$Offset}
        & {\tt MOV \$Offset(R6),PC}
        & {\tt MOV \$Offset(R6),PC}
        & Only works for 13--bit offsets.  Other offsets require adding the
        & Only works for 13--bit offsets.  Other offsets require adding the
        offset first to R6 before jumping.\\\hline
        offset first to R6 before jumping.\\\hline
{\tt LJMP \$Addr}
{\tt LJMP \$Addr}
        & \parbox[t]{1.5in}{\tt LOD (PC),PC \\ {\em Address }}
        & \parbox[t]{1.5in}{\tt LW (PC),PC \\ {\em Address }}
        & Although this only works for an unconditional jump, and it only
        & Although this only works for an unconditional jump, and it only
        works in an architecture with a unified instruction and data address
        works in an architecture with a unified instruction and data address
        space, this instruction combination makes for a nice combination that
        space, this instruction combination makes for a nice combination that
        can be adjusted by a linker at a later time.\\\hline
        can be adjusted by a linker at a later time.\\\hline
{\tt LJMP.x \$Addr}
{\tt LJMP.x \$Addr}
        & \parbox[t]{1.5in}{\tt LOD.x 2(PC),PC \\ ADD 1,PC \\ {\em Address }}
        & \parbox[t]{1.5in}{\tt LW.x 4(PC),PC \\ ADD 4,PC \\ {\em Address }}
        & Long jump, works for a conditional long jump.  \\\hline
        & Long jump, works for a conditional long jump, not necessarily the best way to do this.  \\\hline
\end{tabular}
\end{tabular}
\caption{Derived Instructions}\label{tbl:derived-1}
\caption{Derived Instructions}\label{tbl:derived-1}
\end{center}\end{table}
\end{center}\end{table}
\begin{table}\begin{center}
\begin{table}\begin{center}
\begin{tabular}{p{1.1in}p{1.8in}p{3in}}\\\hline
\begin{tabular}{p{1.1in}p{1.8in}p{3in}}\\\hline
Mapped & Actual  & Notes \\\hline
Mapped & Actual  & Notes \\\hline
{\tt LJSR \$Addr  }
{\tt LJSR \$Addr  }
        & \parbox[t]{1.5in}{\tt MOV \$2+PC,R0 \\ LOD (PC),PC \\ {\em Address}}
        & \parbox[t]{1.5in}{\tt MOV \$8+PC,R0 \\ LW (PC),PC \\ {\em Address}}
        & Similar to LJMP, but it handles the return address properly.
        & Similar to LJMP, but it handles the return address properly.
        \\\hline
        \\\hline
{\tt JSR PC+\$Offset  }
{\tt JSR PC+\$Offset  }
        & \parbox[t]{1.5in}{\tt MOV \$1+PC,R0 \\ ADD \$Offset,PC}
        & \parbox[t]{1.5in}{\tt MOV \$4+PC,R0 \\ ADD \$Offset,PC}
        & This is similar to the jump and link instructions from other
        & This is similar to the jump and link instructions from other
        architectures, save only that it requires a specific link
        architectures, save only that it requires a specific link
        instruction, seen here as the {\tt MOV} instruction on the
        instruction, seen here as the {\tt MOV} instruction on the
        left.\\\hline
        left.\\\hline
{\tt LDI \$val,Rx }
{\tt LDI \$val,Rx }
Line 1313... Line 1308...
                created to facilitate this together with {\tt BREV}.
                created to facilitate this together with {\tt BREV}.
                \\
                \\
        This is also the appropriate means for setting a register value
        This is also the appropriate means for setting a register value
        to an arbitrary 32--bit value in a post--assembly link
        to an arbitrary 32--bit value in a post--assembly link
        operation.}\\\hline
        operation.}\\\hline
{\tt LOD.b \$addr,Rx}
 
        & \parbox[t]{1.5in}{\tt %
 
        LDI     \$addr,Ra \\
 
        LDI     \$addr,Rb \\
 
        LSR     \$2,Ra \\
 
        AND     \$3,Rb \\
 
        LOD     (Ra),Rx \\
 
        LSL     \$3,Rb \\
 
        SUB     \$32,Rb \\
 
        ROL     Rb,Rx \\
 
        AND \$0ffh,Rx}
 
        & \parbox[t]{3in}{This CPU is designed for 32'bit word
 
        length instructions.  Byte addressing is not supported by the CPU or
 
        the bus, so it therefore takes more work to do.
 
 
 
        Note also that in this example, \$Addr is a byte-wise address, where
 
        all other addresses in this document are 32-bit wordlength addresses.
 
        For this reason,
 
        we needed to drop the bottom two bits.  This also limits the address
 
        space of character accesses using this method from 16 MB down to 4MB.}
 
                \\\hline
 
\parbox[t]{1.5in}{\tt LSL \$1,Rx\\ LSLC \$1,Ry}
\parbox[t]{1.5in}{\tt LSL \$1,Rx\\ LSLC \$1,Ry}
        & \parbox[t]{1.5in}{\tt LSL \$1,Ry \\
        & \parbox[t]{1.5in}{\tt LSL \$1,Ry \\
        LSL \$1,Rx \\
        LSL \$1,Rx \\
        OR.C \$1,Ry}
        OR.C \$1,Ry}
        & Logical shift left with carry.  Note that the
        & Logical shift left with carry.  Note that the
Line 1351... Line 1325...
        BREV.C \$1,Rz \\
        BREV.C \$1,Rz \\
        LSR \$1,Rx \\
        LSR \$1,Rx \\
        OR Rz,Rx}
        OR Rz,Rx}
        & Logical shift right with carry.  Unlike the shift left, this
        & Logical shift right with carry.  Unlike the shift left, this
        approach doesn't extend well to numbers larger than two words. \\\hline
        approach doesn't extend well to numbers larger than two words. \\\hline
\end{tabular}
 
\caption{Derived Instructions, continued}\label{tbl:derived-2}
 
\end{center}\end{table}
 
\begin{table}\begin{center}
 
\begin{tabular}{p{1.2in}p{1.5in}p{3.2in}}\\\hline
 
{\tt NEG Rx} & \parbox[t]{1.5in}{\tt XOR \$-1,Rx \\ ADD \$1,Rx} & Negates Rx\\\hline
{\tt NEG Rx} & \parbox[t]{1.5in}{\tt XOR \$-1,Rx \\ ADD \$1,Rx} & Negates Rx\\\hline
{\tt NEG.C Rx} & \parbox[t]{1.5in}{\tt MOV.C \$-1+Rx,Rx\\XOR.C \$-1,Rx}
{\tt NEG.C Rx} & \parbox[t]{1.5in}{\tt MOV.C \$-1+Rx,Rx\\XOR.C \$-1,Rx}
        & Conditionally negates Rx\\\hline
        & Conditionally negates Rx\\\hline
{\tt NOT Rx } & {\tt XOR \$-1,Rx } & One's complement\\\hline
{\tt NOT Rx } & {\tt XOR \$-1,Rx } & One's complement\\\hline
{\tt POP Rx }
{\tt POP Rx }
        & \parbox[t]{1.5in}{\tt LOD \$(SP),Rx \\ ADD \$1,SP}
        & \parbox[t]{1.5in}{\tt LW \$(SP),Rx \\ ADD \$4,SP}
        & The compiler avoids the need for this instruction and the similar
        & The compiler avoids the need for this instruction and the similar
        {\tt PUSH} instruction when setting up the stack by coalescing all
        {\tt PUSH} instruction when setting up the stack by coalescing all
        the stack address modifications into a single instruction at the
        the stack address modifications into a single instruction at the
        beginning of any stack frame.\\\hline
        beginning of any stack frame.\\\hline
{\tt PUSH Rx}
{\tt PUSH Rx}
        & \parbox[t]{1.5in}{\hbox{\tt SUB \$1,SP}
        & \parbox[t]{1.5in}{\hbox{\tt SUB \$4,SP}
        \hbox{\tt STO Rx,\$(SP)}}
        \hbox{\tt SW Rx,\$(SP)}}
        & Note that for pipelined operation, it helps to coalesce all the
        & Note that for pipelined operation, it helps to coalesce all the
        {\tt SUB}'s into one command, and place the {\tt STO}'s right
        {\tt SUB}'s into one command, and place the {\tt SW}'s right
        after each other.  Further, to avoid a pipeline stall, the
        after each other.  Further, to avoid a pipeline stall, the
        immediate value for the first store must be zero.
        immediate value for the first store must be zero.
        \\\hline
        \\\hline
 
\end{tabular}
 
\caption{Derived Instructions, continued}\label{tbl:derived-2}
 
\end{center}\end{table}
 
\begin{table}\begin{center}
 
\begin{tabular}{p{1.0in}p{1.5in}p{3.2in}}\\\hline
{\tt PUSH Rx-Ry}
{\tt PUSH Rx-Ry}
        & \parbox[t]{1.5in}{\tt SUB \$$n$,SP \\
        & \parbox[t]{1.5in}{\tt SUB \$$4n$,SP \\
        STO Rx,\$(SP)
        SW Rx,\$(SP)
        \ldots \\
        \ldots \\
        STO Ry,\$$\left(n-1\right)$(SP)}
        SW Ry,\$$4\left(n-1\right)$(SP)}
        & Multiple pushes at once only need the single subtract from the
        & Multiple pushes at once only need the single subtract from the
        stack pointer.  This derived instruction is analogous to a similar one
        stack pointer.  This derived instruction is analogous to a similar one
        on the Motorola 68k architecture, although the Zip Assembler
        on the Motorola 68k architecture, although the Zip Assembler
        does not support the combined instruction.  This instruction
        does not support the combined instruction.  This instruction
        also supports pipelined memory access.\\\hline
        also supports pipelined memory access.\\\hline
{\tt RESET}
{\tt RESET}
        & \parbox[t]{1in}{\tt STO \$1,\$watchdog(R12)\\BUSY}
        & \parbox[t]{1in}{\tt LDI~0xff000000,R2\\LDI 1,R1\\\hbox{SW R1,\$watchdog(R2)}\\BUSY}
        & This depends upon the existence of a watchdog peripheral, and the
        & This depends upon the existence of a watchdog peripheral, and the
        peripheral base address being preloaded into {\tt R12}.  The BUSY
        peripheral base address being preloaded into {\tt R12}.  The BUSY
        instructions are required because the CPU will continue until the
        instructions are required because the CPU will continue until the
        {\tt STO} has completed.
        {\tt SW} has completed.
 
 
        Another opportunity might be to jump to the reset address from within
        Another opportunity might be to jump to the reset address from within
        supervisor mode.\\\hline
        supervisor mode.\\\hline
{\tt RET} & {\tt MOV R0,PC}
{\tt RET} & {\tt MOV R0,PC}
        & This depends upon the form of the {\tt JSR} given on the previous
        & This depends upon the form of the {\tt JSR} given on the previous
        page that stores the return address into R0.
        page that stores the return address into R0.
        \\\hline
        \\\hline
{\tt SEX.b Rx }
{\tt SEXB Rx }
        & \parbox[t]{1.5in}{\tt LSL 24,Rx \\ ASR 24,Rx}
        & \parbox[t]{1.5in}{\tt LSL 24,Rx \\ ASR 24,Rx}
        & Signed extend an 8--bit value into a full word.\\\hline
        & Signed extend an 8--bit value into a full word.\\\hline
{\tt SEX.h Rx }
{\tt SEXH Rx }
        & \parbox[t]{1.5in}{\tt LSL 16,Rx \\ ASR 16,Rx}
        & \parbox[t]{1.5in}{\tt LSL 16,Rx \\ ASR 16,Rx}
        & Sign extend a 16--bit value into a full word.\\\hline
        & Sign extend a 16--bit value into a full word.\\\hline
{\tt STEP Rr,Rt}
{\tt STEP Rr,Rt}
        & \parbox[t]{1.5in}{\tt LSR \$1,Rr \\ XOR.C Rt,Rr}
        & \parbox[t]{1.5in}{\tt LSR \$1,Rr \\ XOR.C Rt,Rr}
        & Step a Galois implementation of a Linear Feedback Shift Register, Rr,
        & Step a Galois implementation of a Linear Feedback Shift Register, Rr,
                using taps Rt \\\hline
                using taps Rt \\\hline
{\tt STEP}
{\tt STEP}
        & \parbox[t]{1.5in}{\tt OR \$Step|\$GIE,CC}
        & \parbox[t]{1.5in}{\tt OR \$Step|\$GIE,CC}
        & Steps a user mode process by one instruction\\\hline
        & Steps a user mode process by one instruction\\\hline
%
 
%
 
\end{tabular}
 
\caption{Derived Instructions, continued}\label{tbl:derived-3}
 
\end{center}\end{table}
 
\begin{table}\begin{center}
 
\begin{tabular}{p{1.4in}p{1.5in}p{3in}}\\\hline
 
{\tt STO.b Rx,\$addr}
 
        & \parbox[t]{1.5in}{\tt %
 
        LDI \$addr,Ra \\
 
        LDI \$addr,Rb \\
 
        LSR \$2,Ra \\
 
        AND \$3,Rb \\
 
        SUB \$32,Rb \\
 
        LOD (Ra),Ry \\
 
        AND \$0ffh,Rx \\
 
        AND \~\$0ffh,Ry \\
 
        ROL Rb,Rx \\
 
        OR Rx,Ry \\
 
        STO Ry,(Ra) }
 
        & \parbox[t]{3in}{This CPU and its bus are {\em not} optimized
 
        for byte-wise operations.
 
 
 
        Note that in this example, \$addr is a
 
        byte-wise address, whereas in all of our other examples it is a
 
        32-bit word address. This also limits the address space
 
        of character accesses from 16 MB down to 4MB.
 
        Further, this instruction implies a byte ordering,
 
        such as big or little endian.} \\\hline
 
{\tt SUBR Rx,Ry }
{\tt SUBR Rx,Ry }
        & \parbox[t]{1.5in}{\tt SUB 1+Rx,Ry\\ XOR -1,Ry}
        % & \parbox[t]{1.5in}{\tt SUB 1+Rx,Ry\\ XOR -1,Ry}
 
        & \parbox[t]{1.5in}{\tt XOR -1,Ry\\ADD 1+Rx,Ry}
        & Ry is set to Rx-Ry, rather than the normal subtract which
        & Ry is set to Rx-Ry, rather than the normal subtract which
        sets Ry to Ry-Rx. \\\hline
        sets Ry to Ry-Rx. \\\hline
\parbox[t]{1.4in}{\tt SUB Ra,Rx\\SUBC Rb,Ry}
\parbox[t]{1.4in}{\tt SUB Ra,Rx\\SUBC Rb,Ry}
        & \parbox[t]{1.5in}{\tt SUB Ra,Rx\\SUB.C \$1,Ry\\SUB Rb,Ry}
        & \parbox[t]{1.5in}{\tt SUB Ra,Rx\\SUB.C \$1,Ry\\SUB Rb,Ry}
        & Subtract with carry.  Note that the overflow flag may not be
        & Subtract with carry.  Note that the overflow flag may not be
Line 1463... Line 1409...
        quietly turn the LDI instruction into a {\tt BREV}/{\tt LDILO} pair,
        quietly turn the LDI instruction into a {\tt BREV}/{\tt LDILO} pair,
        but the effect would be the same. \\\hline
        but the effect would be the same. \\\hline
{\tt TS Rx,Ry,(Rz)}
{\tt TS Rx,Ry,(Rz)}
        & \hbox{\tt LDI 1,Rx}
        & \hbox{\tt LDI 1,Rx}
                \hbox{\tt LOCK}
                \hbox{\tt LOCK}
                \hbox{\tt LOD (Rz),Ry}
                \hbox{\tt LW (Rz),Ry}
                \hbox{\tt STO Rx,(Rz)}
                \hbox{\tt SW Rx,(Rz)}
        & A test and set instruction.  The {\tt LOCK} instruction insures
        & A test and set instruction.  The {\tt LOCK} instruction insures
        that the next two instructions lock the bus between the instructions,
        that the next two instructions lock the bus between the instructions,
        so no one else can use it.  Thus guarantees that the operation is
        so no one else can use it.  Thus guarantees that the operation is
        atomic.
        atomic.
        \\\hline
        \\\hline
 
%
 
%
 
\end{tabular}
 
\caption{Derived Instructions, continued}\label{tbl:derived-3}
 
\end{center}\end{table}
 
\begin{table}\begin{center}
 
\begin{tabular}{p{1.0in}p{1.5in}p{3in}}\\\hline
{\tt TST Rx}
{\tt TST Rx}
        & {\tt TST \$-1,Rx}
        & {\tt TST \$-1,Rx}
        & Set the condition codes based upon Rx without changing Rx.
        & Set the condition codes based upon Rx without changing Rx.
        Equivalent to a CMP \$0,Rx.\\\hline
        Equivalent to a CMP \$0,Rx.\\\hline
{\tt WAIT}
{\tt WAIT}
Line 1485... Line 1438...
\end{center}\end{table}
\end{center}\end{table}
 
 
\subsection{Interrupt Handling}
\subsection{Interrupt Handling}
The ZipCPU does not maintain any interrupt vector tables.  If an interrupt
The ZipCPU does not maintain any interrupt vector tables.  If an interrupt
takes place, the CPU simply switches to from user to supervisor (interrupt)
takes place, the CPU simply switches to from user to supervisor (interrupt)
mode.  The supervisor code then continues from where it left off after
mode.  Since getting to user mode in the first place required a return to
executing a return to userspace {\tt RTU} instruction.
userspace instruction, {\tt RTU}, once the interrupt takes place the
 
supervisor just simply starts executing code immediately after that
Since the CPU may return from userspace after either an interrupt, a
{\tt RTU} instruction.
trap, or an exception, it is up to the supervisor code that handles the
 
transition to determine which of the three has taken place.
Since the CPU may return from userspace after either an interrupt (hardware
 
generated), a trap (software generated), or an exception (a fault of some
 
type), it is up to the supervisor code that handles the transition to
 
determine which of the three has taken place.
 
 
\subsection{Pipeline Stages}
\subsection{Pipeline Stages}
As mentioned in the introduction, and highlighted in Fig.~\ref{fig:cpu},
As mentioned in the introduction, and highlighted in Fig.~\ref{fig:cpu},
the ZipCPU supports a five stage pipeline.
the ZipCPU supports a five stage pipeline.
\begin{enumerate}
\begin{enumerate}
Line 1530... Line 1486...
 
 
\item {\bf Decode}: Decodes an instruction into it's OpCode, register(s) to
\item {\bf Decode}: Decodes an instruction into it's OpCode, register(s) to
        read, condition code, and immediate offset.  This stage also
        read, condition code, and immediate offset.  This stage also
        determines whether the flags will be read or set, whether registers
        determines whether the flags will be read or set, whether registers
        will be read (and hence the pipeline may need to stall), or whether the
        will be read (and hence the pipeline may need to stall), or whether the
        result will be written back.  In many ways, simplifying the CPU also
        result will be written back.  In many ways, simplifying the CPU has
        meant simplifying this pipeline stage and hence the instruction set
        meant simplifying this particular pipeline stage and hence the
        architecture.
        instruction set architecture that it implements.
 
 
 
        This stage is also responsible for both normal and CIS decoding.
 
        Hence, following this stage, little information remains regarding
 
        whether or not the CPU was executing a CIS instruction.
 
 
\item {\bf Read Operands}: Read from the register file and applies any
\item {\bf Read Operands}: Read from the register file and applies any
        immediate values to the result.  There is no means of detecting or
        immediate values to the result.  There is no means of detecting or
        flagging arithmetic overflow or carry when adding the immediate to the
        flagging arithmetic overflow or carry when adding the immediate to the
        operand.  This stage will stall if any source operand is pending
        operand.  This stage will stall if any source operand is pending
        and the immediate value is non--zero.
        and the immediate value is non--zero.
 
 
\item At this point, the processing flow splits into one of four tracks: An
\item At this point, the processing flow splits into one of four tracks: An
        {\bf ALU} track which will accomplish a simple instruction, the
        {\bf ALU} track which will accomplish a simple instruction, the
        {\bf MemOps} stage which handles {\tt LOD} (load) and {\tt STO}
        {\bf MemOps} stage which handles {\tt LW} (load) and {\tt SW}
        (store) instructions, the {\bf divide} unit, and the
        (store) instructions, the {\bf divide} unit, and the
        {\bf floating point} unit.
        {\bf floating point} unit.
        \begin{itemize}
        \begin{itemize}
        \item Loads will stall instructions in the read operands stage until the
        \item Loads will stall instructions in the read operands stage until
                entire memory is complete, lest a register be read only to be
                the entire memory operation is complete, lest a register be
                updated unseen by the Load.
                read from the register file only to be updated unseen by the
 
                Load.
        \item Condition codes are set upon completion of the ALU, divide,
        \item Condition codes are set upon completion of the ALU, divide,
                or FPU stage.  (Memory operations do not set conditions.)
                or FPU stage.  (Memory operations do not set conditions.)
        \item Issuing a non--pipelined memory instruction to the memory unit
        \item Issuing a non--pipelined memory instruction to the memory unit
                while the memory unit is busy will stall the entire pipeline
                while the memory unit is busy will stall the entire pipeline
                until the memory unit is idle and ready to accept another
                until the memory unit is idle and ready to accept another
Line 1610... Line 1571...
Given that there are five stages to the pipeline, that accounts
Given that there are five stages to the pipeline, that accounts
for the four stalls.  (Were the {\tt pipefetch} cache chosen, there would
for the four stalls.  (Were the {\tt pipefetch} cache chosen, there would
be another stall internal to the {\tt pipefetch} cache.)
be another stall internal to the {\tt pipefetch} cache.)
 
 
The decode stage can handle the {\tt ADD \$X,PC}, {\tt LDI \$X,PC}, and
The decode stage can handle the {\tt ADD \$X,PC}, {\tt LDI \$X,PC}, and
{\tt LOD (PC),PC} instructions specially, however, when {\tt EARLY\_BRANCHING}
{\tt LW (PC),PC} instructions specially, however, when {\tt EARLY\_BRANCHING}
is enabled.  These instructions, when
is enabled.  These instructions, when
not conditioned on the flags, can execute with only a single stall cycle (two
not conditioned on the flags, can execute with only a single stall cycle (two
for the {\tt LOD(PC),PC} instruction),
for the {\tt LW(PC),PC} instruction),
such as is shown in Fig.~\ref{fig:branch}.
such as is shown in Fig.~\ref{fig:branch}.
\begin{figure}\begin{center}
\begin{figure}\begin{center}
\includegraphics[width=4in]{../gfx/bra.eps} %0.4in per clock
\includegraphics[width=4in]{../gfx/bra.eps} %0.4in per clock
\caption{An expedited branch costs a single stall cycle}\label{fig:branch}
\caption{An expedited branch costs a single stall cycle}\label{fig:branch}
\end{center}\end{figure}
\end{center}\end{figure}
Line 1644... Line 1605...
This is also the reason why, when setting up a stack frame, the top of the
This is also the reason why, when setting up a stack frame, the top of the
stack frame is used first: it eliminates this stall cycle.\footnote{This only
stack frame is used first: it eliminates this stall cycle.\footnote{This only
applies if there is no local memory to allocate on the stack as well.}  Hence,
applies if there is no local memory to allocate on the stack as well.}  Hence,
to save registers at the top of a procedure, one would write:
to save registers at the top of a procedure, one would write:
\begin{enumerate}
\begin{enumerate}
\item\ {\tt SUB 2,SP}
\item\ {\tt SUB 16,SP}
\item\ {\tt STO R1,(SP)}
\item\ {\tt SW R1,(SP)}
\item\ {\tt STO R2,1(SP)}
\item\ {\tt SW R2,4(SP)}
\end{enumerate}
\end{enumerate}
Had {\tt R1} instead been stored at {\tt 1(SP)} as the top of the stack,
Had {\tt R1} instead been stored at {\tt 1(SP)} as the top of the stack,
there would've been an extra stall in setting up the stack frame.
there would've been an extra stall in setting up the stack frame.
 
 
\item When reading from the CC register after setting the flags
\item When reading from the CC register after setting the flags
Line 1677... Line 1638...
will incur the stall, while a {\tt LDI \$BREAKEN|\$STEP,CC} will not since
will incur the stall, while a {\tt LDI \$BREAKEN|\$STEP,CC} will not since
it doesn't read the condition codes before executing.
it doesn't read the condition codes before executing.
 
 
\item When waiting for a memory read operation to complete
\item When waiting for a memory read operation to complete
\begin{enumerate}
\begin{enumerate}
\item\ {\tt LOD address,RA}
\item\ {\tt LW address,RA}
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}
\item\ {\tt OPCODE I+RA,RB}
\item\ {\tt OPCODE I+RA,RB}
\end{enumerate}
\end{enumerate}
 
 
Remember, the ZipCPU does not support out of order execution.  Therefore,
Remember, the ZipCPU does not support out of order execution.  Therefore,
Line 1716... Line 1677...
of clock cycles for the bus to be free, as shown in both Figs.~\ref{fig:memrd}
of clock cycles for the bus to be free, as shown in both Figs.~\ref{fig:memrd}
and~\ref{fig:memwr}, there will be stalls.
and~\ref{fig:memwr}, there will be stalls.
 
 
\item Memory operation followed by a memory operation
\item Memory operation followed by a memory operation
\begin{enumerate}
\begin{enumerate}
\item\ {\tt STO address,RA}
\item\ {\tt SW address,RA}
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}
\item\ {\tt LOD address,RB}
\item\ {\tt LW address,RB}
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}
\end{enumerate}
\end{enumerate}
 
 
In this case, the LOD instruction cannot start until the STO is finished,
In this case, the LW instruction cannot start until the SW is finished,
as illustrated by Fig.~\ref{fig:mstld}.
as illustrated by Fig.~\ref{fig:mstld}.
\begin{figure}\begin{center}
\begin{figure}\begin{center}
\includegraphics[width=5.5in]{../gfx/mstld.eps}
\includegraphics[width=5.5in]{../gfx/mstld.eps}
\caption{Pipeline handling of a store followed by a load instruction}\label{fig:mstld}
\caption{Pipeline handling of a store followed by a load instruction}\label{fig:mstld}
\end{center}\end{figure}
\end{center}\end{figure}
With proper scheduling, it is possible to do something in the ALU while the
With proper scheduling, it is possible to do something in the ALU while the
memory unit is busy with the STO instruction, but otherwise this pipeline will
memory unit is busy with the SW instruction, but otherwise this pipeline will
stall while waiting for it to complete before the load instruction can
stall while waiting for it to complete before the load instruction can
start.
start.
 
 
The ZipCPU has the capability of supporting a form of burst memory access,
The ZipCPU has the capability of supporting a form of burst memory access,
often called pipelined memory access within this document due to its use of
often called pipelined memory access within this document due to its use of
Line 1768... Line 1729...
\subsection{Simplified Wishbone Bus}\label{ssec:bus}
\subsection{Simplified Wishbone Bus}\label{ssec:bus}
The bus architecture of the ZipCPU is that of a simplified, pipelined, WISHBONE
The bus architecture of the ZipCPU is that of a simplified, pipelined, WISHBONE
bus built according to the B4 specification.  Several changes have been made to
bus built according to the B4 specification.  Several changes have been made to
simplify this bus.  First, all unnecessary ancillary information has been
simplify this bus.  First, all unnecessary ancillary information has been
removed.  This includes the retry, tag, lock, cycle type indicator, and burst
removed.  This includes the retry, tag, lock, cycle type indicator, and burst
indicator signals.  It also includes the select lines which would enable the
indicator signals.  The bus supports big endian operation where the high order
CPU to act on less than 32--bit words.  As a result all operations on the bus
octet occupies the low order address.  Second, we insist that all
are 32--bit operations.  The bus is neither little endian nor big endian.  For
 
this reason, all words are 32--bits.  All instructions are also 32--bits wide.
 
Everything has been built around the 32--bit word.  Even the byte size (the
 
size of the minimum addressable unit) is 32--bits.  Second, we insist that all
 
accesses be pipelined, and simplify that further by insisting that pipelined
accesses be pipelined, and simplify that further by insisting that pipelined
accesses not cross peripherals---although we leave it to the user to keep that
accesses not cross peripherals---although we leave it to the user to keep that
from happening in practice.  Finally, we insist that the wishbone strobe line
from happening in practice.  Finally, we insist that the wishbone strobe line
be zero any time the cycle line is inactive.  This makes decoding simpler
be zero any time the cycle line is inactive.  This makes decoding simpler
in slave logic: a transaction is initiated whenever the strobe line is high
in slave logic: a transaction is initiated whenever the strobe line is high
Line 1790... Line 1747...
The CPU knows nothing about which addresses reference on--chip or off-chip
The CPU knows nothing about which addresses reference on--chip or off-chip
memory, or even which reference peripherals.  Indeed, there is no indication
memory, or even which reference peripherals.  Indeed, there is no indication
within the CPU if a particular piece of memory can be cached or not, save that
within the CPU if a particular piece of memory can be cached or not, save that
the CPU assumes any and all instruction words can be cached.
the CPU assumes any and all instruction words can be cached.
 
 
The one exception to this rule revolves around addresses beginning with
The one exception to this rule revolves around addresses where the top 8-bits
{\tt 2'b11} in their high order word.  These addresses are used to access a
of their high order word are all ones.  These addresses are used to access a
variety of optional peripherals that will be discussed more in
variety of optional peripherals that will be discussed more in
Sec.~\ref{sec:zipsys}, but that are only present within the {\tt ZipSystem}.
Sec.~\ref{sec:zipsys}, but that are only present within the {\tt ZipSystem}.
When used with the bare {\tt ZipBones}, these addresses will cause a bus error.
When used with the bare {\tt ZipBones}, these addresses will cause a bus error.
 
 
The prefetch cache currently has no means of detecting an instruction that
The prefetch cache currently has no means of detecting an instruction that
was changed, save by clearing the instruction cache.  This may be necessary
was changed, save by clearing the instruction cache.  This may be necessary
when loading programs into previously used memory, or when creating
when loading programs into previously used memory, or when creating
self--modifying code.
self--modifying code.
 
 
Should the memory management unit (MMU) be integrated into the ZipCPU, the MMU
Should the memory management unit (MMU) be integrated into the ZipCPU, the MMU
will be able to be configured to instruct the ZipCPU as to which addresses may
configuration will tell the ZipCPU wich addresses may be cached and which not.
be cached and which not.
 
 
 
This topic is discussed further in the linker section, Sec.~\ref{sec:ld-mem}
This topic is discussed further in the linker section, Sec.~\ref{sec:ld-mem}
of the ABI chapter, Chap.~\ref{chap:abi}.
of the ABI chapter, Chap.~\ref{chap:abi}.
 
 
% \subsection{Measured Performance}\label{sec:perf}
% \subsection{Measured Performance}\label{sec:perf}
Line 1966... Line 1922...
access\footnote{The pipeline cost of the DMA controller, including setup cost,
access\footnote{The pipeline cost of the DMA controller, including setup cost,
is a minimum of $14+2N$ clocks.} (The CPU gets priority over the bus, but once
is a minimum of $14+2N$ clocks.} (The CPU gets priority over the bus, but once
bus access is granted to the DMA peripheral, it will not be revoked mid--read
bus access is granted to the DMA peripheral, it will not be revoked mid--read
or mid--write.)
or mid--write.)
 
 
 
The DMA controller supports only aligned word accesses.  It does not support
 
byte or half-word accesses.
 
 
When copying memory from one location to another, the DMA controller will
When copying memory from one location to another, the DMA controller will
copy in units of a given transfer length--up to 1024 words at a time.  It will
copy in units of a given transfer length--up to 1024 words at a time.  It will
read that transfer length into its internal buffer, and then write to the
read that transfer length into its internal buffer, and then write to the
destination address from that buffer.
destination address from that buffer.
 
 
Line 2023... Line 1982...
% ELF Format
% ELF Format
% Stack:
% Stack:
%       R13 is the stack register.
%       R13 is the stack register.
%       The stack grows downward.
%       The stack grows downward.
%       Memory at the current stack pointer is allocated.
%       Memory at the current stack pointer is allocated.
%       Hence, a PUSH is : SUB 1,SP; STO Rx,(SP)
%       Hence, a PUSH is : SUB 1,SP; SW Rx,(SP)
% Heap:
% Heap:
%       In general, not yet implemented.
%       In general, not yet implemented.
%       A less than adequate Heap has been implemented as a pointer, from which
%       A less than adequate Heap has been implemented as a pointer, from which
%       malloc requests simply decrement it.  Free's are NOOPs, leaving
%       malloc requests simply decrement it.  Free's are NOOPs, leaving
%       allocated memory allocated forever.
%       allocated memory allocated forever.
 
 
\section{Executable File Format}\label{sec:abi-elf}
\section{Executable File Format}\label{sec:abi-elf}
ZipCPU executable files are stored in the Executable and Linkable Format
ZipCPU executable files are stored in the Executable and Linkable Format
(ELF), prior to being placed in flash, or whatever memory they will be
(ELF), prior to being placed in flash, or whatever memory they will be
executed from.  All addresses within this format are ZipCPU addresses,
executed from.
referencing 32--bit quantities, whereas all offsets internal to the ELF file
 
represent 8--bit quantities.  Thus, when running the {\tt zip-objdump} utility
The ZipCPU described by this specification uses the 16-bits {\tt 16'hdad1}
on a ZipCPU ELF file, the addresses are properly set.
to identify itself against other CPUs.  This is not an officially registered
 
number, and may change in the future.
 
 
The ZipCPU does not (yet) have a dynamic linker/loader.  All linking is
The ZipCPU does not (yet) have a dynamic linker/loader.  All linking is
currently static, and done prior to run time.
currently static, and done prior to run time.
 
 
\section{Stack}\label{sec:abi-stack}
\section{Stack}\label{sec:abi-stack}
Although nothing in the hardware requires this, the compiler back end
Register {\tt R13} (also known as the {\tt SP} register) is the stack register.
implementation uses {\tt R13} (also known as the {\tt SP} register) as a stack
The compiler generates code that grows the stack from
register, and grows the stack from
 
high addresses to lower addresses.  That means that the stack will usually
high addresses to lower addresses.  That means that the stack will usually
start out set to a very large value, such as one past the last RAM address,
start out set to a very large value, such as one past the last RAM address,
and it will grow to lower and lower values--hopefully never mixing with the
and it will grow to lower and lower values--hopefully never mixing with the
heap.  Memory at the current stack position is assumed to be allocated.
heap.  Memory at the current stack position is assumed to be allocated.
 
 
When creating a stack frame for a function, the compiler will subtract
When creating a stack frame for a function, the compiler will subtract
the size of the stack frame from the stack register.  It will then store
the size of the stack frame from the stack register.  It will then store
any registers used by the function, from {\tt R5} to {\tt R12} (including
any registers used by the function, from {\tt R5} to {\tt R12} (including
the link register {\tt R0}) onto offsets given by the stack pointer plus a
the link register {\tt R0}) onto offsets given by the stack pointer plus a
constant.  If a frame pointer is used, the compiler uses {\tt R12} (or {\tt FP})
constant.  If a frame pointer is used, the compiler uses {\tt R12} (or
for this purpose.  The frame pointer is set by moving the stack pointer
{\tt FP}) for this purpose.  The frame pointer is set by moving the stack
plus an offset into {\tt FP}.  This {\tt MOV} instruction effectively limits
pointer plus an offset into {\tt FP}.  This {\tt MOV} instruction effectively
the size of any individual stack frame to $2^{12}-1$ words.
limits the size of any individual stack frame to $2^{12}-1$ octets.
 
 
Once a subroutine is complete, the frame is unwound.  If the frame pointer,
Once a subroutine is complete, the frame is unwound.  If the frame pointer,
{\tt FP} was used, then {\tt FP} is copied directly to the stack pointer,
{\tt FP} was used, then {\tt FP} is copied directly to the stack pointer,
{\tt SP}.  Registers are restored, starting with {\tt R0} all the way to
{\tt SP}.  Registers are restored, starting with {\tt R0} all the way to
{\tt R12} ({\tt FP}).  This also restores, and obliterates, the subroutine
{\tt R12} ({\tt FP}).  This also restores, and obliterates, the subroutine
frame pointer.  Once complete, a value is added to the stack pointer to return
frame pointer.  Once complete, a value is added to the stack pointer to
it to its original value, and a jump is made to the value located within
return it to its original value, and a jump is made to the value located
{\tt R0}.
within {\tt R0}.
 
 
\section{Relocations}\label{sec:abi-reloc}
\section{Relocations}\label{sec:abi-reloc}
 
 
The ZipCPU binutils back end supports two several relocations, although the
The ZipCPU binutils back end supports several types of relocations, although
two most common are the 32--bit relocations for register load and long jump.
the two most common are the 32--bit relocations for register load and long
 
jump.
 
 
The first of these is for loading an arbitrary 32--bit value into a register.
The first of these is for loading an arbitrary 32--bit value into a register.
Such instructions are broken into a pair of {\tt BREV} and {\tt LDILO}
Such instructions are broken into a pair of {\tt BREV} and {\tt LDILO}
instructions, and once the value of the parameter is known their immediates
instructions, and once the value of the parameter is known their immediate
are filled in.
values can be filled in.
 
 
The second type of 32--bit relocation is for jumps to arbitrary addresses.
The second type of 32--bit relocation is for jumps to arbitrary addresses.
These jumps are supported by the \hbox{\tt LOD (PC),PC} instruction, followed
These jumps are supported by the \hbox{\tt LW (PC),PC} instruction, followed
by the 32--bit address to be filled in later by the linker.  If the jump is
by the 32--bit address to be filled in later by the linker.  If the jump is
conditional, then a conditional \hbox{\tt LOD.$x$ 1(PC),PC} instruction is
conditional, then a conditional \hbox{\tt LW.$x$ 4(PC),PC} instruction is
used, followed by a {\tt BRA 1(PC),PC} and then the 32--bit relocation value.
used, followed by a {\tt ADD 4,PC} and then the 32--bit relocation value.
 
 
If the branch distance is known and within reach, branches will be implemented
If a branch distance is known and within reach, then it will be implemented
with {\tt ADD \#,PC} instructions, possibly conditional, as necessary.
with an {\tt ADD \#,PC} instruction, possibly conditional, as necessary.
 
 
While other relocations are supported, they tend not to be used nearly as much
While other relocations are supported, they tend not to be used nearly as much
as these two.
as these two.
 
 
\section{Call format}\label{sec:abi-jsr}
\section{Call format}\label{sec:abi-jsr}
 
 
One feature of the ZipCPU is that it has no JSR instruction.  Jumps to
One unique of the ZipCPU is that it has no JSR instruction.  The assembler
subroutine's therefore take three assembly instructions:
attempts to minimize this problem by replacing a {\tt JSR}~{\em address}
The first is a {\tt MOV .Lcall\#\#(PC),R0}, which places the return address
instruction with a {\tt MOV \#(PC),R0} followed by a jump to the requested
into R0.  {\tt .Lcall\#\#} in this case is a label, where \#\# is a unique
address.  In this case, the offset to the PC for the {\tt MOV} instruction
number filled in by the compiler.  This instruction is followed by a
is determined by whether or not the jump can be accomplished with a local
{\tt BRA subroutine} instruction.  Finally, the third assembly ``instruction''
branch or a long jump.
of any call sequence is the label {\tt .Lcall\#\#}.
 
 
 
While this works well in practice, GCC's implementation prevents such things
While this works well in practice, this implementation prevents such things
as {\tt JSR}'s followed by {\tt BRA}'s from being combined together.
as {\tt JSR}'s followed by {\tt BRA}'s from being combined together.
 
 
Finally, the first five operands passed to the subroutine will be placed into
Finally, GCC will place first five operands passed to the subroutine into
registers R1--R5.  Any additional operands are placed upon the stack.
registers R1--R5.  Any additional operands are placed upon the stack.
 
 
\section{Built-ins}\label{sec:abi-builtin}
\section{Built-ins}\label{sec:abi-builtin}
The ZipCPU ABI supports the a number of built in functions.  The compiler
The ZipCPU ABI supports the a number of built in functions.  The compiler
maps these functions directly to assembly language equivalents, essentially
maps these functions directly to assembly language equivalents, essentially
Line 2114... Line 2073...
instructions.  These are:
instructions.  These are:
\begin{enumerate}
\begin{enumerate}
\item {\tt zip\_bitrev(int)} reverses the bits in the given integer, returning
\item {\tt zip\_bitrev(int)} reverses the bits in the given integer, returning
        the result.  This utilizes the internal {\tt BREV} instruction, and is
        the result.  This utilizes the internal {\tt BREV} instruction, and is
        designed to be used with FFT's as necessary.
        designed to be used with FFT's as necessary.
\item {\tt zip\_busy()} executes an {\tt ADD -1,PC} function, essentially
\item {\tt zip\_busy()} executes an {\tt ADD -4,PC} function, essentially
        forcing the CPU into a very tight infinite loop.
        forcing the CPU into a very tight infinite loop.
\item {\tt zip\_cc()} returns the value of the current CC register.  This may
\item {\tt zip\_cc()} returns the value of the current CC register.  This may
        be used within both user and supervisor code to determine in which
        be used within both user and supervisor code to determine in which
        mode the CPU is within.
        mode the CPU is within.
\item {\tt zip\_halt()} executes an \hbox{\tt OR \$SLEEP,CC} instruction to
\item {\tt zip\_halt()} executes an \hbox{\tt OR \$SLEEP,CC} instruction to
Line 2196... Line 2155...
\begin{eqnarray*}
\begin{eqnarray*}
\mbox{blkram (wx) : ORIGIN = 0x0008000, LENGTH = 0x0008000}
\mbox{blkram (wx) : ORIGIN = 0x0008000, LENGTH = 0x0008000}
\end{eqnarray*}
\end{eqnarray*}
specifies that there is a region of memory, called blkram, that can be read and
specifies that there is a region of memory, called blkram, that can be read and
written, and that programs can execute from.  This section starts at address
written, and that programs can execute from.  This section starts at address
{\tt 0x8000} and extends for another {\tt 0x8000} words.  The other memories
{\tt 0x8000} and extends for another {\tt 0x8000} bytes.  The other memories
are defined in a similar manner, with names {\tt flash} and {\tt sdram}.
are defined in a similar manner, with names {\tt flash} and {\tt sdram}.
 
 
Following the memory section, three specific symbols are defined:
Following the memory section, three specific symbols are defined:
        {\tt \_flash}, defining the beginning of flash memory,
        {\tt \_flash}, defining the beginning of flash memory,
        {\tt \_blkram}, defining the beginning of on--chip block RAM,
        {\tt \_blkram}, defining the beginning of on--chip block RAM,
Line 2301... Line 2260...
        Equivalently, this is the address of the first unused piece of
        Equivalently, this is the address of the first unused piece of
        memory, or the location from whence to start any dynamic memory
        memory, or the location from whence to start any dynamic memory
        subsystem.
        subsystem.
\end{enumerate}
\end{enumerate}
 
 
 
All of these symbols need to reference word aligned addresses.
 
 
\section{Loading ZipCPU Programs}
\section{Loading ZipCPU Programs}
There are two basic ways to load a ZipCPU program, depending upon whether or
There are two basic ways to load a ZipCPU program, depending upon whether or
not the ZipCPU is active within the current configuration.  If the ZipCPU
not the ZipCPU is active within the current configuration.  If the ZipCPU
is not a part of the current FPGA configuration, one need only write the
is not a part of the current FPGA configuration, one need only write the
flash and then switch configurations.  It will be the CPU's responsibility
flash and then switch configurations.  It will be the CPU's responsibility
Line 2507... Line 2468...
{\tt void timer\_delay(int nclocks) \{} \\
{\tt void timer\_delay(int nclocks) \{} \\
\hbox to 0.25in{}\= {\em // Clear the PIC.  We want to exit from here on timer counts alone}\\
\hbox to 0.25in{}\= {\em // Clear the PIC.  We want to exit from here on timer counts alone}\\
        \> {\tt zip->pic = DISABLEALL|SYSINT\_TMA;}\\
        \> {\tt zip->pic = DISABLEALL|SYSINT\_TMA;}\\
        \> {\tt if (nclocks > 10) \{}\\
        \> {\tt if (nclocks > 10) \{}\\
        \> \hbox to 0.25in{}\= {\em // Set our timer to count down the given number of counts}\\
        \> \hbox to 0.25in{}\= {\em // Set our timer to count down the given number of counts}\\
        \> \> {\tt zip->tma = counts} \\
        \> \> {\tt zip->tma = nclocks;} \\
        \> \> {\tt zip->pic = EINT(SYSINT\_TMA);} \\
        \> \> {\tt zip->pic = EINT(SYSINT\_TMA);} \\
        \> \> {\tt zip\_wait();} \\
        \> \> {\tt zip\_wait();} \\
        \> \> {\tt zip->pic = CLEARPIC;} \\
        \> \> {\tt zip->pic = CLEARPIC;} \\
        \> {\tt \} }{\em // else anything less has likely already passed}
        \> {\tt \} }{\em // else anything less has likely already passed} \\
{\tt \}}\\
{\tt \}}\\
\end{tabbing}
\end{tabbing}
\caption{Waiting on a timer}\label{tbl:shi-timer}
\caption{Waiting on a timer}\label{tbl:shi-timer}
\end{center}\end{table}
\end{center}\end{table}
we present one means of waiting for a programmable amount of time using a
we present one means of waiting for a programmable amount of time using a
Line 2679... Line 2640...
One common operation is that of a memory move or copy.  This section will
One common operation is that of a memory move or copy.  This section will
present several methods available to the ZipCPU for performing a memory
present several methods available to the ZipCPU for performing a memory
copy, starting with the C code shown in Tbl.~\ref{tbl:memcp-c}.
copy, starting with the C code shown in Tbl.~\ref{tbl:memcp-c}.
\begin{table}\begin{center}
\begin{table}\begin{center}
\parbox{4in}{\begin{tabbing}
\parbox{4in}{\begin{tabbing}
{\tt void} \= {\tt memcpy(void *dest, void *src, int len) \{} \\
{\tt void} \= {\tt memcpy(char *dest, char *src, int len) \{} \\
        \> {\tt for(int i=0; i<len; i++)} \\
        \> {\tt for(int i=0; i<len; i++)} \\
        \> \hspace{0.2in} {\tt *dest++ = *src++;} \\
        \> \hspace{0.2in} {\tt *dest++ = *src++;} \\
\}
\}
\end{tabbing}}
\end{tabbing}}
\caption{Example Memory Copy code in C}\label{tbl:memcp-c}
\caption{Example Memory Copy code in C}\label{tbl:memcp-c}
Line 2696... Line 2657...
\begin{tabbing}
\begin{tabbing}
memcpy: \\
memcpy: \\
\hbox to 0.35in{}\={\em ; R0 = return address, R1 = *dest, R2 = *src, R3 = LEN} \\
\hbox to 0.35in{}\={\em ; R0 = return address, R1 = *dest, R2 = *src, R3 = LEN} \\
\>      {\em ; The following will operate in 6 ($N=0$), or $2+12N$ clocks ($N\neq 0$).} \\
\>      {\em ; The following will operate in 6 ($N=0$), or $2+12N$ clocks ($N\neq 0$).} \\
\>      {\tt CMP 0,R3} \\ % 8 clocks per setup
\>      {\tt CMP 0,R3} \\ % 8 clocks per setup
\>      {\tt JMP.Z R0} \hbox to 0.3in{}\= {\em ; A conditional return }\\
\>      {\tt RETN.Z} \hbox to 0.3in{}\= {\em ; A conditional return }\\
\>      {\em ; No stack frame needs to be set up to use {\tt R4}, since the compiler}\\
\>      {\em ; No stack frame needs to be set up to use {\tt R4}, since the compiler}\\
\>      {\em  ; assumes {\tt R1}-{\tt R4} may be used and changed by any subroutine} \\
\>      {\em  ; assumes {\tt R1}-{\tt R4} may be used and changed by any subroutine} \\
memcpy\_loop: \\ % 12 clocks per loop
memcpy\_loop: \\ % 12 clocks per loop
\>      {\tt LOD (R2),R4} \\
\>      {\tt LB (R2),R4} \\
\>      {\em ; (4 stalls, cannot be scheduled away)} \\
\>      {\em ; (4 stalls, cannot be scheduled away)} \\
\>      {\tt STO R4,(R1)} \> {\em ; (4 schedulable stalls, has no impact now)} \\
\>      {\tt SB R4,(R1)} \> {\em ; (4 schedulable stalls, has no impact now)} \\
\>      {\em ; Update our count of the number of remaining values to copy}\\
\>      {\em ; Update our count of the number of remaining values to copy}\\
\>      {\tt SUB 1,R3}  \> {\em ; This will be zero when we have copied our last}\\
\>      {\tt SUB 1,R3}  \> {\em ; This will be zero when we have copied our last}\\
\>      {\tt JMP.Z R0}  \> {\em ; + 4 stalls, if taken}\\
\>      {\tt RETN.Z}    \> {\em ; + 4 stalls, if taken}\\
\>      {\tt ADD 1,R1}  \> {\em ; Implement the destination pointer }\\
\>      {\tt ADD 1,R1}  \> {\em ; Implement the destination pointer }\\
\>      {\tt ADD 1,R2}  \> {\em ; Implement the source pointer }\\
\>      {\tt ADD 1,R2}  \> {\em ; Implement the source pointer }\\
\>      {\tt BRA memcpy\_loop} \\
\>      {\tt BRA memcpy\_loop} \\
\>      {\em ; (1 stall on a BRA instruction)} \\
\>      {\em ; (1 stall on a BRA instruction)} \\
\end{tabbing}
\end{tabbing}
\caption{Example Memory Copy code in Zip Assembly, Unoptimized}\label{tbl:memcp-asm}
\caption{Example Memory Copy code in Zip Assembly, Unoptimized}\label{tbl:memcp-asm}
\end{center}\end{table}
\end{center}\end{table}
This example points out several things associated with the ZipCPU.  First,
This example points out several things associated with the ZipCPU.  First,
a straightforward implementation of a for loop is not the fastest loop
a straightforward implementation of a for loop is not the fastest loop
structure.  For this reason, we have placed the test to continue at the
structure.  For this reason, we have placed the test to continue at the
end.  Second, all pointers are {\tt void} pointers to arbitrary 32--bit
end.  Second, notice that we can use {\tt R4} without storing it, since the
data types.  The ZipCPU does not have explicit support for smaller or larger
C~ABI allows for subroutines to use {\tt R1}--{\tt R4} without saving them.
data types, and so this memory copy cannot be applied at an 8--bit level.
This means that we can return from this subroutine using conditional jumps to
Third, notice that we can use {\tt R4} without storing it, since the C~ABI
 
allows for subroutines to use {\tt R1}--{\tt R4} without saving them.  This
 
means that we can return from this subroutine using conditional jumps to
 
{\tt R0}.
{\tt R0}.
 
 
Still, there's more that could be done.  Suppose we wished to use the pipeline
Still, there's more that could be done.  Suppose we wished to use the pipeline
bus capability?  We might then write something closer to
bus capability?  We might then write something closer to
Tbl.~\ref{tbl:memcp-opt}.
Tbl.~\ref{tbl:memcp-opt}.
Line 2736... Line 2694...
{\em ; Upon entry, R0 = return address, R1 = *dest, R2 = *src, R3 = LEN} \\
{\em ; Upon entry, R0 = return address, R1 = *dest, R2 = *src, R3 = LEN} \\
{\em ; Achieves roughly $32+17\left\lfloor\frac{N}{4}\right\rfloor$ clocks,
{\em ; Achieves roughly $32+17\left\lfloor\frac{N}{4}\right\rfloor$ clocks,
        after the initial pipeline delay}\\
        after the initial pipeline delay}\\
memcpy\_opt: \\
memcpy\_opt: \\
\hbox to 0.35in{}\=\hbox to 1.4in{\tt CMP 4,R3}\= {\em ; Check for small short lengths, len $<$ 4}\\
\hbox to 0.35in{}\=\hbox to 1.4in{\tt CMP 4,R3}\= {\em ; Check for small short lengths, len $<$ 4}\\
\>      {\tt BC memcpy\_finish} \> {\em ; Jump to the end if so}\\
\>      {\tt BC \_memcpy\_finish}       \> {\em ; Jump to the end if so}\\
\hbox to 0.35in{}\=\hbox to 1.4in{\tt SUB 3,SP}\= {\em ; Otherwise, create a stack frame, storing the registers}\\
\hbox to 0.35in{}\=\hbox to 1.4in{\tt SUB 12,SP}\= {\em ; Otherwise, create a stack frame, storing the registers}\\
\>      {\tt STO R5,(SP)}       \> {\em ; we will be using.  Note that this is a pipelined store, so}\\
\>      {\tt SW R5,(SP)}        \> {\em ; we will be using.  Note that this is a pipelined store, so}\\
\>      {\tt STO R6,1(SP)}      \> {\em ; subsequent stores only cost 1 clock.}\\
\>      {\tt SW R6,4(SP)}       \> {\em ; subsequent stores only cost 1 clock.}\\
\>      {\tt STO R7,2(SP)}\\
\>      {\tt SW R7,8(SP)}\\
\>      {\tt ADD 4,R2}          \> {\em ; Pre-Increment our pointers, for a 4-stage pipeline.  This}\\
\>      {\tt ADD 4,R2}          \> {\em ; Pre-Increment our pointers, for a 4-stage pipeline.  This}\\
\>      {\tt ADD 4,R1}          \> {\em ; also fills up the 3 of the 4 stall states following the}\\
\>      {\tt ADD 4,R1}          \> {\em ; also fills up the 3 of the 4 stall states following the}\\
\>      {\tt SUB 5,R3}          \> {\em ; stores.  Also, leave {\tt R3} as the number left minus one.}\\
\>      {\tt SUB 5,R3}          \> {\em ; stores.  Also, leave {\tt R3} as the number left minus one.}\\
\>      {\tt LOD -4(R2),R4}     \> {\em ; Load the first four values into }\\
\>      {\tt LB -4(R2),R4}      \> {\em ; Load the first four values into }\\
\>      {\tt LOD -3(R2),R5}     \> {\em ; registers, using a pipelined load.}\\
\>      {\tt LB -3(R2),R5}      \> {\em ; registers, using pipelined loads.}\\
\>      {\tt LOD -2(R2),R6}\\
\>      {\tt LB -2(R2),R6}\\
\>      {\tt LOD -1(R2),R7}\\
\>      {\tt LB -1(R2),R7}\\
{\tt mcopy\_next\_four\_chars:} \>\> {\em ; Here's the top of our copy loop}\\
{\tt \_mcopy\_next\_four\_chars:} \>\> {\em ; Here's the top of our copy loop}\\
\>      {\tt STO  R4,-4(R1)}    \> {\em ; Store four values, using a burst memory operation.}\\
\>      {\tt SB  R4,-4(R1)}     \> {\em ; Store four values, using a burst memory operation.}\\
\>      {\tt STO  R5,-3(R1)}    \> {\em ; One clock for subsequent stores.}\\
\>      {\tt SB  R5,-3(R1)}     \> {\em ; One clock for subsequent stores.}\\
\>      {\tt STO  R6,-2(R1)}    \> {\em ; None of these effect the flags, that were set when}\\
\>      {\tt SB  R6,-2(R1)}     \> {\em ; None of these effect the flags, that were set when}\\
\>      {\tt STO  R7,-1(R1)}    \> {\em ; we last adjusted {\tt R3}}\\
\>      {\tt SB  R7,-1(R1)}     \> {\em ; we last adjusted {\tt R3}}\\
\>      {\tt BC  preend\_memcpy} \> {\em ; +4 stall cycles, but only when taken}\\
\>      {\tt BC  \_preend\_memcpy} \> {\em ; +4 stall cycles, but only when taken}\\
\>      {\tt ADD  4,R1} \> {\em ; ALU ops don't stall during stores, so}\\
\>      {\tt ADD  4,R1} \> {\em ; ALU ops don't stall during stores, so}\\
\>      {\tt ADD  4,R2} \> {\em ; increment our pointers here.} \\
\>      {\tt ADD  4,R2} \> {\em ; increment our pointers here.} \\
\>      {\tt SUB  4,R3} \> {\em ; Calculate whether or not we have a next round}\\
\>      {\tt SUB  4,R3} \> {\em ; Calculate whether or not we have a next round}\\
\>      {\tt LOD  -4(R2),R4} \> {\em ; Preload the values for the next round}\\
\>      {\tt LB  -4(R2),R4} \>  {\em ; Preload the values for the next round}\\
\>      {\tt LOD  -3(R2),R5}\>  {\em ; Notice that these are also pipelined}\\
\>      {\tt LB  -3(R2),R5}\>   {\em ; Notice that these are also pipelined}\\
\>      {\tt LOD  -2(R2),R6}\>  {\em ; loads, as before.}\\
\>      {\tt LB  -2(R2),R6}\>   {\em ; loads, as before.}\\
\>      {\tt LOD  -1(R2),R7}\>  {\em ; The four stall cycles, though, are concurrent w/ the branch.}\\
\>      {\tt LB  -1(R2),R7}\>  {\em ; The four stall cycles, though, are concurrent w/ the branch.}\\
\>      {\tt BRA  mcopy\_next\_char} \> {\em ; Early branching avoids the full memory pipeline stall} \\
\>      {\tt BRA  \_mcopy\_next\_four\_chars}\hspace{0.25in} {\em ; Early branching avoids the full memory pipeline stall} \\
{\tt preend\_memcpy:}\\
{\tt \_preend\_memcpy:}\\
\>      {\tt ADD  1,R3} \>{\em ; R3 is now the remaining length, rather than one less than it}\\
\>      {\tt ADD  1,R3} \>{\em ; R3 is now the remaining length, rather than one less than it}\\
\>      {\tt LOD (SP),R5}  \> {\em ; Restore our saved registers, since the remainder of the routine}\\
\>      {\tt LW (SP),R5}  \> {\em ; Restore our saved registers, since the remainder of the routine}\\
\>      {\tt LOD 1(SP),R6} \> {\em ; doesn't use these registers}\\
\>      {\tt LW 4(SP),R6} \> {\em ; doesn't use these registers}\\
\>      {\tt LOD 2(SP),R7} \> {\em ;}\\
\>      {\tt LW 8(SP),R7} \> {\em ;}\\
\>      {\tt ADD 3,SP}  \>{\em ; Adjust the stack pointer back to what it was}\\
\>      {\tt ADD 12,SP} \>{\em ; Adjust the stack pointer back to what it was}\\
{\tt memcpy\_finish:}\>\>{\em ; At this point, there are $0\leq$ {\tt R3}$<4$ words left}\\
{\tt \_memcpy\_finish:}\>\>{\em ; At this point, there are $0\leq$ {\tt R3}$<4$ words left}\\
\>      {\tt CMP 1,R3} \> {\em ; Check if any ops are remaining }\\
\>      {\tt CMP 1,R3} \> {\em ; Check if any ops are remaining }\\
\>      {\tt JMP.LT R0} \> {\em ; Return now if nothing is left}\\
\>      {\tt RETN.LT} \> {\em ; Return now if nothing is left}\\
\>      {\tt LOD (R1),R4} \> {\em ; Load and store the first item}\\
\>      {\tt LB (R1),R4} \> {\em ; Load and store the first item}\\
\>      {\tt STO R4,(R1)} \> {\em ;}\\
\>      {\tt SB R4,(R1)} \> {\em ;}\\
\>      {\tt JMP.Z R0}  \> {\em ; Return if that was our only value}\\
\>      {\tt RETN.Z}    \> {\em ; Return if that was our only value}\\
\>      {\tt LOD 1(R1),R4}\>{\em; Load and store the second item (if necessary)} \\
\>      {\tt LB 1(R1),R4}\>{\em; Load and store the second item (if necessary)} \\
\>      {\tt STO R4,1(R1)}\\
\>      {\tt SB R4,1(R1)}\\
\>      {\tt CMP 2, R3}\\
\>      {\tt CMP 2, R3}\\
\>      {\tt JMP.LT R0}\\
\>      {\tt RETN.LT}\\
\>      {\tt LOD 2(R1),R4}\>{\em; Load and store the second item (if necessary)} \\
\>      {\tt LB 2(R1),R4}\>{\em; Load and store the second item (if necessary)} \\
\>      {\tt STO R4,2(R1)}\>{\em; {\tt LOD}, {\tt STO}, {\tt JMP R0} will cost 10 cycles}\\
\>      {\tt SB R4,2(R1)}\>{\em; {\tt LW}, {\tt SW}, {\tt RETN} will cost 10 cycles}\\
\>      {\tt JMP     R0} \> {\em ; Finally, we return}\\
\>      {\tt RETN} \> {\em ; Finally, we return}\\
\end{tabbing}}}
\end{tabbing}}}
\caption{Example Memory Copy code in Zip Assembly, Hand Optimized}\label{tbl:memcp-opt}
\caption{Example Memory Copy code in Zip Assembly, Hand Optimized}\label{tbl:memcp-opt}
\end{center}\end{table}
\end{center}\end{table}
This pipeline memory example, though, provides some neat things to discuss
This pipeline memory example, though, provides some neat things to discuss
about optimizing code using the ZipCPU.
about optimizing code using the ZipCPU.
Line 2818... Line 2776...
without needing a new comparison.  Hence, zero to three separate values can be
without needing a new comparison.  Hence, zero to three separate values can be
copied using only two compares.
copied using only two compares.
 
 
However, this discussion wouldn't be complete without an example of how
However, this discussion wouldn't be complete without an example of how
this memory operation would be made even simpler using the direct memory
this memory operation would be made even simpler using the direct memory
access controller.  In that case, we can return to C with the code in
access controller.  In that case, we can return to the C language with the
Tbl.~\ref{tbl:memcp-dmac}.
code in Tbl.~\ref{tbl:memcp-dmac}.
\begin{table}\begin{center}
\begin{table}\begin{center}
\begin{tabbing}
\begin{tabbing}
{\tt \#define DMACOPY 0x0fed0000} {\em // Copy memory, largest chunk at a time possible} \\
{\tt \#define DMACOPY 0x0fed0000} {\em // Copy memory, largest chunk at a time possible} \\
\\
\\
{\tt void} \= {\tt memcpy\_dma(void *dest, void *src, int len) \{} \\
{\tt void} \= {\tt memcpy\_dma(void *dest, void *src, int len) \{} \\
Line 2845... Line 2803...
        \> {\tt zip\_wait();}\\
        \> {\tt zip\_wait();}\\
{\tt \}}
{\tt \}}
\end{tabbing}
\end{tabbing}
\caption{Example Memory Copy code using the DMA}\label{tbl:memcp-dmac}
\caption{Example Memory Copy code using the DMA}\label{tbl:memcp-dmac}
\end{center}\end{table}
\end{center}\end{table}
For large memory amounts, the cost of this approach will scale at roughly
The DMA, however, will only work with an integer number of 32--bit aligned
2~clocks per word transferred.
words.  Still, for large memory amounts, the cost of this approach will scale
 
at roughly 2~clocks per word transferred.
 
 
Notice how much simpler this memory copy has become to write by using the DMA.
Notice how much simpler this memory copy has become to write by using the DMA.
But also consider, the system has only one direct memory access controller.
But also consider, the system has only one direct memory access controller.
What happens if one task tries to use the controller when it is already in use
What happens if one task tries to use the controller when it is already in use
by another task?  The result is that the direct memory access controller may
by another task?  The result is that the direct memory access controller may
Line 2861... Line 2820...
Another example worth discussing is the {\tt memset()} library function.
Another example worth discussing is the {\tt memset()} library function.
A straightforward implementation of this function in C might look like
A straightforward implementation of this function in C might look like
Tbl.~\ref{tbl:memset-c}.
Tbl.~\ref{tbl:memset-c}.
\begin{table}\begin{center}
\begin{table}\begin{center}
\begin{tabbing}
\begin{tabbing}
\hbox to 0.4in{\tt void} \= {\tt *memset(void *s, int c, size\_t n) \{} \\
\hbox to 0.4in{\tt void} \= {\tt *memset(char *s, int c, size\_t n) \{} \\
        \> {\tt for(size\_t i=0; i<n; i++)} \\
        \> {\tt for(size\_t i=0; i<n; i++)} \\
        \> \hspace{0.4in} {\tt *s++ = c;} \\
        \> \hspace{0.4in} {\tt *s++ = c;} \\
        \> {\tt return s;}\\
        \> {\tt return s;}\\
{\tt \}}
{\tt \}}
\end{tabbing}
\end{tabbing}
Line 2877... Line 2836...
\begin{tabbing}
\begin{tabbing}
{\em ; Upon entry, R0 = return address, R1 = s, R2 = c, R3 = len}\\
{\em ; Upon entry, R0 = return address, R1 = s, R2 = c, R3 = len}\\
{\em ; Cost: Roughly $4+6N$ clocks}\\
{\em ; Cost: Roughly $4+6N$ clocks}\\
{\tt memset:}\\
{\tt memset:}\\
\hbox to 0.25in{}\=\hbox to 1in{\tt TST R3}\={\em ; Return immediately if len (R3) is zero}\\
\hbox to 0.25in{}\=\hbox to 1in{\tt TST R3}\={\em ; Return immediately if len (R3) is zero}\\
\>      {\tt JMP.Z R0}\\
\>      {\tt RETN.Z}\\
\>      {\tt MOV R1,R4} \> {\em ; Keep our return value in R1, use R4 as a local}\\
\>      {\tt MOV R1,R4} \> {\em ; Keep our return value in R1, use R4 as a local}\\
{\tt memset\_loop:}\>\> {\em ; Here, we know we have at least one more to go}\\
{\tt memset\_loop:}\>\> {\em ; Here, we know we have at least one more to go}\\
\>      {\tt STO R2,(R4)} \> {\em       ; Store one value (no pipelining)} \\
\>      {\tt SB R2,(R4)} \> {\em        ; Store one value (no pipelining)} \\
\>      {\tt SUB 1,R3} \> {\em; Subtract during the store}\\
\>      {\tt SUB 1,R3} \> {\em; Subtract during the store}\\
\>      {\tt JMP.Z R0} \> {\em; Return (during store) if all done}\\
\>      {\tt RETN.Z} \> {\em; Return (during store) if all done}\\
\>      {\tt ADD 1,R4} \> {\em; Otherwise increment our pointer}\\
\>      {\tt ADD 1,R4} \> {\em; Otherwise increment our pointer}\\
\>      {\tt BRA memset\_loop} {\em ; and repeat}\\
\>      {\tt BRA memset\_loop} {\em ; and repeat}\\
\end{tabbing}
\end{tabbing}
\caption{Example Memset code, minimally optimized}\label{tbl:memset-unop}
\caption{Example Memset code, minimally optimized}\label{tbl:memset-unop}
\end{center}\end{table}
\end{center}\end{table}
Line 2908... Line 2867...
\hbox to 0.25in{}\=\hbox to 0.6in{\tt MOV}\=\hbox to 1.0in{\tt R1,R4}\={\em ; Make a local copy of *s, so we can return R1}\\
\hbox to 0.25in{}\=\hbox to 0.6in{\tt MOV}\=\hbox to 1.0in{\tt R1,R4}\={\em ; Make a local copy of *s, so we can return R1}\\
\>      {\tt CMP}\>{\tt 4,R3}\>{\em ; Jump to non--unrolled section}\\
\>      {\tt CMP}\>{\tt 4,R3}\>{\em ; Jump to non--unrolled section}\\
\>      {\tt JMP.C}\>{\tt memset\_pipe\_tail}\\
\>      {\tt JMP.C}\>{\tt memset\_pipe\_tail}\\
\>      {\tt SUB}\>{\tt 1,R3}\> {\em ; R3 is now one less than the number to finish}\\
\>      {\tt SUB}\>{\tt 1,R3}\> {\em ; R3 is now one less than the number to finish}\\
{\tt memset\_pipe\_unrolled:}\>\>\> {\em ; Here, we know we have at least four more to go}\\
{\tt memset\_pipe\_unrolled:}\>\>\> {\em ; Here, we know we have at least four more to go}\\
\>      {\tt STO}\>{\tt R2,(R4)} \> {\em  ; Store our four values, pipelining our}\\
\>      {\tt SB}\>{\tt R2,(R4)} \> {\em  ; Store our four values, pipelining our}\\
\>      {\tt STO}\>{\tt R2,1(R4)} \> {\em ; access across the bus }\\
\>      {\tt SB}\>{\tt R2,1(R4)} \> {\em ; access across the bus }\\
\>      {\tt STO}\>{\tt R2,2(R4)} \\
\>      {\tt SB}\>{\tt R2,2(R4)} \\
\>      {\tt STO}\>{\tt R2,3(R4)} \\
\>      {\tt SB}\>{\tt R2,3(R4)} \\
\>      {\tt SUB}\>{\tt 4,R3} \> {\em; If there are zero left, this will be a -1 result}\\
\>      {\tt SUB}\>{\tt 4,R3} \> {\em; If there are zero left, this will be a -1 result}\\
\>      {\tt JMP.C}\>{\tt prememset\_pipe\_tail}\> \hbox to 0.5in{}\= {\em; So we can use our LT condition}\\
\>      {\tt BC}\>{\tt prememset\_pipe\_tail}\> \hbox to 0.5in{}\= {\em; So we can use our LT condition}\\
\>      {\tt ADD}\>{\tt 4,R4} \> {\em ; Otherwise increment our pointer}\\
\>      {\tt ADD}\>{\tt 4,R4} \> {\em ; Otherwise increment our pointer}\\
\>      {\tt BRA}\>{\tt memset\_pipe\_loop} {\em ; and repeat using an early branchable instruction}\\
\>      {\tt BRA}\>{\tt memset\_pipe\_unrolled} {\em ; and repeat using an early branchable instruction}\\
{\tt prememset\_pipe\_tail:} \\
{\tt prememset\_pipe\_tail:} \\
\>    {\tt ADD}\>{\tt 1,R3}\>{\em ; Return our counts left to the run number}\\
\>    {\tt ADD}\>{\tt 1,R3}\>{\em ; Return our counts left to the run number}\\
{\tt memset\_pipe\_tail:}\>\>\>{\em ; At this point, we have R3=0-3 remaining}\\
{\tt memset\_pipe\_tail:}\>\>\>{\em ; At this point, we have R3=0-3 remaining}\\
\>      {\tt CMP}\>{\tt 1,R3}   \> {\em ; If there's less than one left}\\
\>      {\tt CMP}\>{\tt 1,R3}   \> {\em ; If there's less than one left}\\
\>      {\tt JMP.C}\>{\tt R0}   \> {\em ; then return early.}\\
\>      {\tt RETN.C}\>  \> {\em ; then return early.}\\
\>      {\tt STO}\>{\tt R2,(R4)} \> {\em ; If we've got one left, store it}\\
\>      {\tt SB}\>{\tt R2,(R4)} \> {\em ; If we've got one left, store it}\\
\>      {\tt STO.GT}\>{\tt R2,1(R4)} \> {\em ; if two, do a burst store}\\
\>      {\tt SB.GT}\>{\tt R2,1(R4)} \> {\em ; if two, do a burst store}\\
\>      {\tt CMP}\>{\tt 3,R3}   \> {\em ; Check if we have another left}\\
\>      {\tt CMP}\>{\tt 3,R3}   \> {\em ; Check if we have another left}\\
\>      {\tt STO.Z}\>{\tt R2,2(R4)}     \> {\em ; and store it if so.}\\
\>      {\tt SB.Z}\>{\tt R2,2(R4)}      \> {\em ; and store it if so.}\\
\>      {\tt JMP}\>{\tt R0}     \> {\em ; Return now that we are complete.}
\>      {\tt RETN}\>            \> {\em ; Return now that we are complete.}
\end{tabbing}
\end{tabbing}
\caption{Example Memset after loop unrolling, using pipelined memory ops}\label{tbl:memset-pipe}
\caption{Example Memset after loop unrolling, using pipelined memory ops}\label{tbl:memset-pipe}
\end{center}\end{table}
\end{center}\end{table}
Note that, in this example as with the {\tt memcpy} example, our loop variable
Note that, in this example as with the {\tt memcpy} example, our loop variable
is one less than the number of operations remaining.  This is because the ZipCPU
is one less than the number of operations remaining.  This is because the
has no less than or equal comparison, but only a less than comparison.  Further,
ZipCPU has no less than or equal comparison, but only a less than comparison.
because the length is given as an unsigned quantity, we {\em only} have a
By subtracting one from the loop variable, that's all our comparison needs to
less than comparison.  By subtracting one from the loop variable, that's
be--at least, until the end of the loop.  For that, we jump to a section one
all our comparison needs to be--at least, until the end of the loop.  For
instruction earlier and return our counts value to the true remaining length.
that, we jump to a section one instruction earlier and return our counts
 
value to the true remaining length.
 
 
 
You may also notice that, despite the four possibilities in the end game, we
You may also notice that, despite the four possibilities in the end game, we
can carefully rearrange the logic to only use two compares.  The first compare
can carefully rearrange the logic to only use two compares.  The first compare
tests against less than one and returns if there are no more sets left.  Using
tests against less than one and returns if there are no more sets left.  Using
the same compare, though, we can also know if we have one or more stores left.
the same compare, though, we can also know if we have one or more stores left.
Hence, we can create a burst memory operation with one or two stores.
Hence, we can create a burst memory operation with one or two stores.
 
 
As one final example, we might also use the DMA for this operation, as with
The three examples given so far discuss and demonstrate solutions appropriate
 
for memory accesses that are not necessarily aligned.  Were the accesses
 
aligned, the operation could be done about four times faster.  To do this,
 
the {\tt LB} and {\tt SB} instructions would need to be replaced by {\tt LW}
 
and {\tt SW} instructions.
 
 
 
Still, if all accesses were able to be aligned, then we might also use the
 
DMA for this operation.  Hence, the DMA makes our final example in
Tbl.~\ref{tbl:memset-dma}.
Tbl.~\ref{tbl:memset-dma}.
\begin{table}\begin{center}
\begin{table}\begin{center}
\begin{tabbing}
\begin{tabbing}
{\tt \#define DMA\_CONSTSRC 0x20000000} {\em // Don't increment the source address}
{\tt \#define DMA\_CONSTSRC 0x20000000} {\em // Don't increment the source address}
\\
\\
{\tt void *} \= {\tt memset\_dma(void *s, int c, size\_t len) \{} \\
{\tt int *} \= {\tt memset\_dma(int *s, int c, size\_t len) \{} \\
        \> {\em // As before, this assumes we have access to the DMA, and that}\\
        \> {\em // As before, this assumes we have access to the DMA, and that}\\
        \> {\em // we are running in system high mode ...}\\
        \> {\em // we are running in system high mode ...}\\
        \> {\tt zip->dma.len = len;} \= {\em // Set up the DMA }\\
        \> {\tt zip->dma.len = len;} \= {\em // Set up the DMA }\\
        \> {\tt zip->dma.rd  = \&c;}\\
        \> {\tt zip->dma.rd  = \&c;}\\
        \> {\tt zip->dma.wr  = s;}\\
        \> {\tt zip->dma.wr  = s;}\\
Line 2968... Line 2932...
        \> {\em // interrupt within it, now we enable the DMA interrupt, and}\\
        \> {\em // interrupt within it, now we enable the DMA interrupt, and}\\
        \> {\em // only the DMA interrupt.}\\
        \> {\em // only the DMA interrupt.}\\
        \> {\tt zip->pic = EINT(SYSINT\_DMA);}\\
        \> {\tt zip->pic = EINT(SYSINT\_DMA);}\\
        \> {\em // And wait for the DMA to complete.} \\
        \> {\em // And wait for the DMA to complete.} \\
        \> {\tt zip\_wait();}\\
        \> {\tt zip\_wait();}\\
 
        \> {\em // Return the original source pointer, so as to} \\
 
        \> {\em // match the library definition.} \\
 
        \> {\tt return s;}\\
{\tt \}}
{\tt \}}
\end{tabbing}
\end{tabbing}
\caption{Example Memset code, only this time with the DMA}\label{tbl:memset-dma}
\caption{Example Memset code, only this time with the DMA}\label{tbl:memset-dma}
\end{center}\end{table}
\end{center}\end{table}
This is almost identical to the {\tt memcpy} function above that used the
This is almost identical to the {\tt memcpy} function above that used the
DMA, save that the pointer for the value read is given to be the address
DMA, save that the pointer for the value read is given to be the address
of c, and that the DMA is instructed not to increment its source pointer.
of c, and that the DMA is instructed not to increment its source pointer.
The DMA will still do {\tt len} reads, so the asymptotic performance will never
The DMA will still do {\tt len} reads, so the asymptotic performance will never
be less than $2N$ clocks per transfer.
be less than $2N$ clocks per transfer.
 
 
\section{String Operations}
 
Perhaps one of the immediate questions most folks will have is, how does one
 
handle string operations on a CPU that only handles 32--bit numbers?  Here we
 
offer a couple of possibilities.
 
 
 
The first possibility is the easy and natural choice: just define characters
 
to be 32--bit numbers and ignore the upper 24 bits.  This is the choice made
 
by the compiler.  Hence, if you compile a simple string compare function,
 
such as Tbl.~\ref{tbl:str-cmp},
 
\begin{table}\begin{center}
 
\begin{tabbing}
 
\hbox to 0.25in{\tt int} \= {\tt strcmp(const char *s1, const char *s2) \{} \\
 
        \> {\tt while(*s1 == *s2)} \\
 
        \> \hbox to 0.25in{} {\tt s1++, s2++;} \\
 
        \> {\tt return *s2 - *s1;} \\
 
{\tt \}}
 
\end{tabbing}
 
\caption{Example string compare function}\label{tbl:str-cmp}
 
\end{center}\end{table}
 
string length function, such as Tbl.~\ref{tbl:str-len},
 
\begin{table}\begin{center}
 
\begin{tabbing}
 
{\tt unsigned} \= {\tt strlen(const char *s) \{} \\
 
        \> {\tt int ln = 0;} \\
 
        \> {\tt while(*s++ != 0)} \\
 
        \> \hbox to 0.25in{} {\tt ln++;} \\
 
        \> {\tt return ln;} \\
 
{\tt \}}
 
\end{tabbing}
 
\caption{Example string compare function}\label{tbl:str-len}
 
\end{center}\end{table}
 
or string copy function, such as Tbl.~\ref{tbl:str-cpy},
 
\begin{table}\begin{center}
 
\begin{tabbing}
 
{\tt char *} \= {\tt strcpy(char *dest, const char *src) \{} \\
 
        \> {\tt char *d = dest;} {\em // Make a working copy of the dest ptr}\\
 
        \> {\tt do \{} \\
 
        \> \hbox to 0.25in{} {\tt *d++ = *src;} \\
 
        \> {\tt \} while(*src++);} \\
 
        \> {\tt return dest;} \\
 
{\tt \}}
 
\end{tabbing}
 
\caption{Example string copy function}\label{tbl:str-cpy}
 
\end{center}\end{table}
 
this is what you will get.
 
 
 
A little work with these functions, and you should be able to optimize them
 
in a fashion similar to that with memcpy.  This doesn't solve the fundamental
 
problem, though, of why am I wasting 32--bits for 8--bit quantities?
 
 
 
An alternative would be to use a packed string structure.  To pack a string,
 
one might do something like Tbl.~\ref{tbl:pstr}.
 
\begin{table}\begin{center}
 
\begin{tabbing}
 
{\tt void} \= {\tt packstr(char *s) \{} \\
 
        \> {\tt char *d = s;} \= {\em // Pack our string in place} \\
 
        \> {\tt int w;}\>{\em // A holding word to pack things into} \\
 
        \> {\tt int k=0;}\>{\em // A count to know when to move to the next word} \\
 
        \> {\tt while(*s) \{} \\
 
        \> \hbox to 0.25in{}\={\tt w = (w<<8)|(*s \& 0x0ff);} \\
 
        \>\> {\em // After four of these octets, write the result out} \\
 
        \> \> {\tt if (((++k)\&3)==0) *d++ = w;} \\
 
        \> {\tt \}} \\
 
        \> {\em // But what happens if we never got to the fourth octet}\\
 
        \> {\em // in our last word?  We need to clean that up here.}\\
 
\\
 
        \> {\em // First, shift the partial value all the way up}\\
 
        \> {\tt w = (w<<(32-((k\&3)<<3));} {\em // Shift up the last word}\\
 
        \> {\tt *d++ = w;} {\em // Store any remaining partial value }\\
 
        \> {\em // If we want to make sure our strings end in zero, we need}\\
 
        \> {\em // one more step:}\\
 
        \> {\tt *d = 0;} {\em // Make sure string ends in a zero.}\\
 
{\tt \}}
 
\end{tabbing}
 
\caption{String packing function}\label{tbl:pstr}
 
\end{center}\end{table}
 
Notice that our packed string places its first byte in the high order octet
 
of our first word, that any excess octets in the last word are zeros,
 
and that there remains a zero word following our string.  With this packed
 
string approach, compares and copies can proceed four times faster.  As an
 
example, Tbl.~\ref{tbl:pstr-cmp}
 
\begin{table}\begin{center}
 
\begin{tabbing}
 
\hbox to 0.25in{\tt int} \= {\tt pstrcmp(const char *s1, const char *s2) \{} \\
 
        \> {\tt while(*s1 == *s2)} \\
 
        \> \hbox to 0.25in{} {\tt s1++, s2++;} \\
 
        \> {\tt return *s2 - *s1;} \\
 
{\tt \}}
 
\end{tabbing}
 
\caption{Packed string compare function}\label{tbl:pstr-cmp}
 
\end{center}\end{table}
 
presents a string compare function for a packed string.  You'll notice that
 
it doesn't look all that different from a string compare for a non-packed
 
string.  This is on purpose.  Another example might be a string copy, which
 
again, wouldn't look all that different.  Getting the number of used 8--bit
 
octets within a string is a touch more difficult.  In that case, one might
 
try something like Tbl.~\ref{tbl:pstr-len}.
 
\begin{table}\begin{center}
 
\begin{tabbing}
 
{\tt unsigned} \= {\tt pstrlen(const char *s) \{} \\
 
        \> {\tt int ln = 0;} \\
 
        \> {\tt while(*s++ != 0)} \\
 
        \> \hbox to 0.25in{}\={\tt ln+=4;} \\
 
        \> {\tt if (ln) \{}\\
 
        \>\>    {\em // Touch up the length in case of an incomplete last word} \\
 
        \>\>    {\tt int lastval = s[-1];}\\
 
\\
 
        \>\>    {\tt if ((lastval \& 0x0ff)==0) ln--;}\\
 
        \>\>    {\tt if ((lastval \& 0x0ffff)==0) ln--;}\\
 
        \>\>    {\tt if ((lastval \& 0x0ffffff)==0) ln--;}\\
 
        \> {\tt \}} \\
 
        \> {\tt return ln;} \\
 
{\tt \}}
 
\end{tabbing}
 
\caption{Packed string subcharacter length function}\label{tbl:pstr-len}
 
\end{center}\end{table}
 
 
 
\section{Context Switch}
\section{Context Switch}
 
 
Fundamental to any multiprocessing system is the ability to switch from one
Fundamental to any multiprocessing system is the ability to switch from one
task to the next.  In the ZipSystem, this is accomplished in one of a couple of
task to the next.  In the ZipSystem, this is accomplished in one of a couple of
Line 3156... Line 3007...
        registers to some supervisor memory structure, such as is shown in
        registers to some supervisor memory structure, such as is shown in
        Tbl.~\ref{tbl:context-out}.
        Tbl.~\ref{tbl:context-out}.
\begin{table}\begin{center}
\begin{table}\begin{center}
\begin{tabbing}
\begin{tabbing}
{\tt save\_context:} \\
{\tt save\_context:} \\
\hbox to 0.25in{}\={\tt SUB 1,SP}\hbox to 0.5in{}\= {\em ; Function prologue: create a stack}\\
\hbox to 0.25in{}\={\tt SUB 4,SP}\hbox to 0.5in{}\= {\em ; Function prologue: create a stack}\\
\>        {\tt STO R5,(SP)}     \> {\em ; frame and save R5.  (R1-R4 are assumed}\\
\>        {\tt SW R5,(SP)}      \> {\em ; frame and save R5.  (R1-R4 are assumed}\\
\>        {\tt MOV uR0,R2}      \> {\em ; to be used and in need of saving.  Then}\\
\>        {\tt MOV uR0,R2}      \> {\em ; to be used and in need of saving.  Then}\\
\>        {\tt MOV uR1,R3}      \> {\em ; copy the user registers, four at a time to }\\
\>        {\tt MOV uR1,R3}      \> {\em ; copy the user registers, four at a time to }\\
\>        {\tt MOV uR2,R4}      \> {\em ; supervisor registers, where they can be}\\
\>        {\tt MOV uR2,R4}      \> {\em ; supervisor registers, where they can be}\\
\>        {\tt MOV uR3,R5}      \> {\em ; stored, while exploiting memory pipelining}\\
\>        {\tt MOV uR3,R5}      \> {\em ; stored, while exploiting memory pipelining}\\
\>        {\tt STO R2,(R1)}     \>{\em ; Exploit memory pipelining: }\\
\>        {\tt SW R2,(R1)}      \>{\em ; Exploit memory pipelining: }\\
\>        {\tt STO R3,1(R1)}    \>{\em ; All instructions write to same base memory}\\
\>        {\tt SW R3,4(R1)}     \>{\em ; All instructions write to same base memory}\\
\>        {\tt STO R4,2(R1)}    \>{\em ; All offsets increment by one }\\
\>        {\tt SW R4,8(R1)}     \>{\em ; All offsets increment by one }\\
\>        {\tt STO R5,3(R1)} \\
\>        {\tt SW R5,12(R1)} \\
\>      \ldots {\em ; Need to repeat for all user registers} \\
\>      \ldots {\em ; Need to repeat for all user registers} \\
\iffalse
 
&        {\tt MOV uR5,R0} \\
 
&        {\tt MOV uR6,R1} \\
 
&        {\tt MOV uR7,R2} \\
 
&        {\tt MOV uR8,R3} \\
 
&        {\tt MOV uR9,R4} \\
 
&        {\tt STO R0,5(R5) }\\
 
&        {\tt STO R1,6(R5) }\\
 
&        {\tt STO R2,7(R5) }\\
 
&        {\tt STO R3,8(R5) }\\
 
&        {\tt STO R4,9(R5)} \\
 
\fi
 
\>        {\tt MOV uR12,R2}     \> {\em ; Finish copying ... } \\
\>        {\tt MOV uR12,R2}     \> {\em ; Finish copying ... } \\
\>        {\tt MOV uSP,R3} \\
\>        {\tt MOV uSP,R3} \\
\>        {\tt MOV uCC,R4} \\
\>        {\tt MOV uCC,R4} \\
\>        {\tt MOV uPC,R5} \\
\>        {\tt MOV uPC,R5} \\
\>        {\tt STO R2,12(R1)}   \> {\em ; and saving the last registers.}\\
\>        {\tt SW R2,48(R1)}    \> {\em ; and saving the last registers.}\\
\>        {\tt STO R3,13(R1)}   \> {\em ; Note that even the special user registers }\\
\>        {\tt SW R3,52(R1)}    \> {\em ; Note that even the special user registers }\\
\>        {\tt STO R4,14(R1)}   \> {\em ; are saved just like any others. }\\
\>        {\tt SW R4,56(R1)}    \> {\em ; are saved just like any others. }\\
\>        {\tt STO R5,15(R1)} \\
\>        {\tt SW R5,60(R1)} \\
\>        {\tt LOD (SP),R5}     \> {\em ; Restore our one saved register}\\
\>        {\tt LW (SP),R5}      \> {\em ; Restore our one saved register}\\
\>        {\tt ADD 1,SP}                \> {\em ; our stack frame,} \\
\>        {\tt ADD 4,SP}                \> {\em ; our stack frame,} \\
\>        {\tt JMP R0}          \> {\em ; and return }\\
\>        {\tt RETN}            \> {\em ; and return }\\
\end{tabbing}
\end{tabbing}
\caption{Example Storing User Task Context}\label{tbl:context-out}
\caption{Example Storing User Task Context}\label{tbl:context-out}
\end{center}\end{table}
\end{center}\end{table}
Since this task is so fundamental, the ZipCPU compiler back end provides
Since this task is so fundamental, the ZipCPU compiler back end provides
the {\tt zip\_save\_context(int *)} function.
the {\tt zip\_save\_context(int *)} function.
Line 3236... Line 3075...
        the user registers.  An example of this is shown in
        the user registers.  An example of this is shown in
        Tbl.~\ref{tbl:context-in},
        Tbl.~\ref{tbl:context-in},
\begin{table}\begin{center}
\begin{table}\begin{center}
\begin{tabbing}
\begin{tabbing}
{\tt restore\_context:} \\
{\tt restore\_context:} \\
\hbox to 0.25in{}\= {\tt SUB 1,SP}\hbox to 0.4in{}\={\em ; Set up a stack frame} \\
\hbox to 0.25in{}\= {\tt SUB 4,SP}\hbox to 0.4in{}\={\em ; Set up a stack frame} \\
\>      {\tt STO R5,(SP)} \> {\em ; and store a local register onto it.}\\
\>      {\tt SW R5,(SP)} \> {\em ; and store a local register onto it.}\\
\\
\\
\>      {\tt LOD (R1),R2} \> {\em ; By doing four loads at a time, we are }\\
\>      {\tt LW (R1),R2} \> {\em ; By doing four loads at a time, we are }\\
\>      {\tt LOD 1(R1),R3} \> {\em ; making sure we are using our pipelined}\\
\>      {\tt LW 4(R1),R3} \> {\em ; making sure we are using our pipelined}\\
\>      {\tt LOD 2(R1),R4} \> {\em ; memory capability. }\\
\>      {\tt LW 8(R1),R4} \> {\em ; memory capability. }\\
\>      {\tt LOD 3(R1),R5} \\
\>      {\tt LW 12(R1),R5} \\
\>      {\tt MOV R2,uR1} \> {\em ; Once the registers are loaded, copy them }\\
\>      {\tt MOV R2,uR1} \> {\em ; Once the registers are loaded, copy them }\\
\>      {\tt MOV R3,uR2} \> {\em ; into the user registers that they need to}\\
\>      {\tt MOV R3,uR2} \> {\em ; into the user registers that they need to}\\
\>      {\tt MOV R4,uR3} \> {\em ; be placed within.} \\
\>      {\tt MOV R4,uR3} \> {\em ; be placed within.} \\
\>      {\tt MOV R5,uR4} \\
\>      {\tt MOV R5,uR4} \\
        \> \ldots {\em ; Need to repeat for all user registers} \\
        \> \ldots {\em ; Need to repeat for all user registers} \\
\>      {\tt LOD 12(R1),R2} \> {\em ; Now for our last four registers ...}\\
\>      {\tt LW 48(R1),R2} \> {\em ; Now for our last four registers ...}\\
\>      {\tt LOD 13(R5),R3} \\
\>      {\tt LW 52(R5),R3} \\
\>      {\tt LOD 14(R5),R4} \\
\>      {\tt LW 56(R5),R4} \\
\>      {\tt LOD 15(R5),R5} \\
\>      {\tt LW 60(R5),R5} \\
\>      {\tt MOV R2,uR12} \> {\em ; These are the special purpose ones, restored }\\
\>      {\tt MOV R2,uR12} \> {\em ; These are the special purpose ones, restored }\\
\>      {\tt MOV R3,uSP} \> {\em ; just like any others.}\\
\>      {\tt MOV R3,uSP} \> {\em ; just like any others.}\\
\>      {\tt MOV R4,uCC} \\
\>      {\tt MOV R4,uCC} \\
\>      {\tt MOV R5,uPC} \\
\>      {\tt MOV R5,uPC} \\
 
 
\>      {\tt LOD (SP),R5} \> {\em ; Restore our saved register, } \\
\>      {\tt LW (SP),R5} \> {\em ; Restore our saved register, } \\
\>      {\tt ADD 1,SP}  \> {\em ; and the stack frame, }\\
\>      {\tt ADD 4,SP}  \> {\em ; and the stack frame, }\\
\>      {\tt JMP R0}    \> {\em ; and return to where we were called from.}\\
\>      {\tt RETN}      \> {\em ; and return to where we were called from.}\\
\end{tabbing}
\end{tabbing}
\caption{Example Restoring User Task Context}\label{tbl:context-in}
\caption{Example Restoring User Task Context}\label{tbl:context-in}
\end{center}\end{table}
\end{center}\end{table}
        Because this is such an important task, the ZipCPU GCC provides a
        Because this is such an important task, the ZipCPU GCC provides a
        built--in function, {\tt zip\_restore\_context(int *)}, which can be
        built--in function, {\tt zip\_restore\_context(int *)}, which can be
Line 3293... Line 3132...
\section{ZipSystem Peripheral Registers}
\section{ZipSystem Peripheral Registers}
The ZipSystem maintains currently maintains 20 register locations, as shown
The ZipSystem maintains currently maintains 20 register locations, as shown
in Tbl.~\ref{tbl:zpregs}.
in Tbl.~\ref{tbl:zpregs}.
\begin{table}[htbp]
\begin{table}[htbp]
\begin{center}\begin{reglist}
\begin{center}\begin{reglist}
PIC   & \scalebox{0.8}{\tt 0xc0000000} & 32 & R/W & Primary Interrupt Controller \\\hline
PIC   & \scalebox{0.8}{\tt 0xff000000} & 32 & R/W & Primary Interrupt Controller \\\hline
WDT & \scalebox{0.8}{\tt 0xc0000001} & 32 & R/W & Watchdog Timer \\\hline
WDT   & \scalebox{0.8}{\tt 0xff000004} & 32 & R/W & Watchdog Timer \\\hline
WBU&\scalebox{0.8}{\tt 0xc0000002} & 32 & R & Address of last bus timeout error\\\hline
WBU   &\scalebox{0.8}{\tt 0xff000008} & 32 & R & Address of last bus timeout error\\\hline
CTRIC & \scalebox{0.8}{\tt 0xc0000003} & 32 & R/W & Secondary Interrupt Controller \\\hline
CTRIC & \scalebox{0.8}{\tt 0xff00000c} & 32 & R/W & Secondary Interrupt Controller \\\hline
TMRA  & \scalebox{0.8}{\tt 0xc0000004} & 32 & R/W & Timer A\\\hline
TMRA  & \scalebox{0.8}{\tt 0xff000010} & 32 & R/W & Timer A\\\hline
TMRB  & \scalebox{0.8}{\tt 0xc0000005} & 32 & R/W & Timer B\\\hline
TMRB  & \scalebox{0.8}{\tt 0xff000014} & 32 & R/W & Timer B\\\hline
TMRC  & \scalebox{0.8}{\tt 0xc0000006} & 32 & R/W & Timer C\\\hline
TMRC  & \scalebox{0.8}{\tt 0xff000018} & 32 & R/W & Timer C\\\hline
JIFF  & \scalebox{0.8}{\tt 0xc0000007} & 32 & R/W & Jiffies \\\hline
JIFF  & \scalebox{0.8}{\tt 0xff00001c} & 32 & R/W & Jiffies \\\hline
MTASK  & \scalebox{0.8}{\tt 0xc0000008} & 32 & R/W & Master Task Clock Counter \\\hline
MTASK & \scalebox{0.8}{\tt 0xff000020} & 32 & R/W & Master Task Clock Counter \\\hline
MMSTL  & \scalebox{0.8}{\tt 0xc0000009} & 32 & R/W & Master Stall Counter \\\hline
MMSTL & \scalebox{0.8}{\tt 0xff000024} & 32 & R/W & Master Stall Counter \\\hline
MPSTL  & \scalebox{0.8}{\tt 0xc000000a} & 32 & R/W & Master Pre--Fetch Stall Counter \\\hline
MPSTL & \scalebox{0.8}{\tt 0xff000028} & 32 & R/W & Master Pre--Fetch Stall Counter \\\hline
MICNT  & \scalebox{0.8}{\tt 0xc000000b} & 32 & R/W & Master Instruction Counter\\\hline
MICNT & \scalebox{0.8}{\tt 0xff00002c} & 32 & R/W & Master Instruction Counter\\\hline
UTASK  & \scalebox{0.8}{\tt 0xc000000c} & 32 & R/W & User Task Clock Counter \\\hline
UTASK & \scalebox{0.8}{\tt 0xff000030} & 32 & R/W & User Task Clock Counter \\\hline
UMSTL  & \scalebox{0.8}{\tt 0xc000000d} & 32 & R/W & User Stall Counter \\\hline
UMSTL & \scalebox{0.8}{\tt 0xff000034} & 32 & R/W & User Stall Counter \\\hline
UPSTL  & \scalebox{0.8}{\tt 0xc000000e} & 32 & R/W & User Pre--Fetch Stall Counter \\\hline
UPSTL & \scalebox{0.8}{\tt 0xff000038} & 32 & R/W & User Pre--Fetch Stall Counter \\\hline
UICNT  & \scalebox{0.8}{\tt 0xc000000f} & 32 & R/W & User Instruction Counter\\\hline
UICNT & \scalebox{0.8}{\tt 0xff00003c} & 32 & R/W & User Instruction Counter\\\hline
DMACTRL  & \scalebox{0.8}{\tt 0xc0000010} & 32 & R/W & DMA Control Register\\\hline
DMACTRL& \scalebox{0.8}{\tt 0xff000040} & 32 & R/W & DMA Control Register\\\hline
DMALEN  & \scalebox{0.8}{\tt 0xc0000011} & 32 & R/W & DMA total transfer length\\\hline
DMALEN & \scalebox{0.8}{\tt 0xff000044} & 32 & R/W & DMA total transfer length\\\hline
DMASRC  & \scalebox{0.8}{\tt 0xc0000012} & 32 & R/W & DMA source address\\\hline
DMASRC & \scalebox{0.8}{\tt 0xff000048} & 32 & R/W & DMA source address\\\hline
DMADST  & \scalebox{0.8}{\tt 0xc0000013} & 32 & R/W & DMA destination address\\\hline
DMADST & \scalebox{0.8}{\tt 0xff00004c} & 32 & R/W & DMA destination address\\\hline
% Cache  & \scalebox{0.8}{\tt 0xc0100000} & & & Base address of the Cache memory\\\hline
 
\end{reglist}
\end{reglist}
\caption{ZipSystem Internal/Peripheral Registers}\label{tbl:zpregs}
\caption{ZipSystem Internal/Peripheral Registers}\label{tbl:zpregs}
\end{center}\end{table}
\end{center}\end{table}
These registers are located in the CPU's address space, although in a special
These registers are all 32-bit registers.  Writes of less than 32--bits
area of that space.  Indeed, the area is so special, that the CPU decodes
may have unexpected results.  Further, they are located in a reserved location
the address space location before placing the request onto the bus.  For
within the CPU's address space.  As a result, references to these locations
this reason, other containers for the CPU, such as the ZipBones which doesn't
by a ZipBones based system will generate a bus error.
have these registers, will still create errors when they are referenced.
 
 
 
Here in this section, we'll walk through the definition of each of these
Here in this section, we'll walk through the definition of each of these
registers in turn, together with any bit fields that may be associated with
registers in turn, together with any bit fields that may be associated with
them, and how to set those fields.
them, and how to set those fields.
 
 
Line 3526... Line 3363...
accessing the system via the wishbone bus.  The debug port itself has been
accessing the system via the wishbone bus.  The debug port itself has been
reduced to two addresses, as outlined earlier in Tbl.~\ref{tbl:dbgregs}.
reduced to two addresses, as outlined earlier in Tbl.~\ref{tbl:dbgregs}.
\begin{table}[htbp]
\begin{table}[htbp]
\begin{center}\begin{reglist}
\begin{center}\begin{reglist}
ZIPCTRL & 0 & 32 & R/W & Debug Control Register \\\hline
ZIPCTRL & 0 & 32 & R/W & Debug Control Register \\\hline
ZIPDATA & 1 & 32 & R/W & Debug Data Register \\\hline
ZIPDATA & 4 & 32 & R/W & Debug Data Register \\\hline
\end{reglist}
\end{reglist}
\caption{ZipSystem Debug Registers}\label{tbl:dbgregs}
\caption{ZipSystem Debug Registers}\label{tbl:dbgregs}
\end{center}\end{table}
\end{center}\end{table}
 
 
Access to the ZipSystem begins with the Debug Control register, shown in
Access to the ZipSystem begins with the Debug Control register, shown in
Line 3652... Line 3489...
and Tbl.~\ref{tbl:wishbone-master} respectively.
and Tbl.~\ref{tbl:wishbone-master} respectively.
\begin{table}[htbp]
\begin{table}[htbp]
\begin{center}
\begin{center}
\begin{wishboneds}
\begin{wishboneds}
Revision level of wishbone & WB B4 spec \\\hline
Revision level of wishbone & WB B4 spec \\\hline
Type of interface & Master, Read/Write, single cycle or pipelined\\\hline
Type of interface & Master, Read/Write, pipelined\\\hline
Address Width & (ZipSystem parameter, can be up to 32--bit bits) \\\hline
Address Width & (ZipSystem parameter, up to 30~bits) \\\hline
Port size & 32--bit \\\hline
Port size & 32--bit \\\hline
Port granularity & 32--bit \\\hline
Port granularity & 8--bit \\\hline
Maximum Operand Size & 32--bit \\\hline
Maximum Operand Size & 32--bit \\\hline
Data transfer ordering & (Irrelevant) \\\hline
Data transfer ordering & Big--Endian \\\hline
Clock constraints & Works at 100~MHz on a Basys--3 board, and 80~MHz on a
Clock constraints & Works at 100~MHz on a Basys--3 board, and 80~MHz on a
                XuLA2--LX25\\\hline
                XuLA2--LX25\\\hline
Signal Names & \begin{tabular}{ll}
Signal Names & \begin{tabular}{ll}
                Signal Name & Wishbone Equivalent \\\hline
                Signal Name & Wishbone Equivalent \\\hline
                {\tt i\_clk} & {\tt CLK\_O} \\
                {\tt i\_clk} & {\tt CLK\_O} \\
                {\tt o\_wb\_cyc} & {\tt CYC\_O} \\
                {\tt o\_wb\_cyc} & {\tt CYC\_O} \\
                {\tt o\_wb\_stb} & {\tt (CYC\_O)\&(STB\_O)} \\
                {\tt o\_wb\_stb} & {\tt (CYC\_O)\&(STB\_O)} \\
                {\tt o\_wb\_we} & {\tt WE\_O} \\
                {\tt o\_wb\_we} & {\tt WE\_O} \\
                {\tt o\_wb\_addr} & {\tt ADR\_O} \\
                {\tt o\_wb\_addr} & {\tt ADR\_O} \\
                {\tt o\_wb\_data} & {\tt DAT\_O} \\
                {\tt o\_wb\_data} & {\tt DAT\_O} \\
 
                {\tt o\_wb\_sel} & {\tt SEL\_O} \\
                {\tt i\_wb\_ack} & {\tt ACK\_I} \\
                {\tt i\_wb\_ack} & {\tt ACK\_I} \\
                {\tt i\_wb\_stall} & {\tt STALL\_I} \\
                {\tt i\_wb\_stall} & {\tt STALL\_I} \\
                {\tt i\_wb\_data} & {\tt DAT\_I} \\
                {\tt i\_wb\_data} & {\tt DAT\_I} \\
                {\tt i\_wb\_err} & {\tt ERR\_I}
                {\tt i\_wb\_err} & {\tt ERR\_I}
                \end{tabular}\\\hline
                \end{tabular}\\\hline
Line 3738... Line 3576...
\begin{table}
\begin{table}
\begin{center}\begin{portlist}
\begin{center}\begin{portlist}
{\tt o\_wb\_cyc}   &  1 & Output & Indicates an active Wishbone cycle\\\hline
{\tt o\_wb\_cyc}   &  1 & Output & Indicates an active Wishbone cycle\\\hline
{\tt o\_wb\_stb}   &  1 & Output & WB Strobe signal\\\hline
{\tt o\_wb\_stb}   &  1 & Output & WB Strobe signal\\\hline
{\tt o\_wb\_we}    &  1 & Output & Write enable\\\hline
{\tt o\_wb\_we}    &  1 & Output & Write enable\\\hline
{\tt o\_wb\_addr}  & 32 & Output & Bus address \\\hline
{\tt o\_wb\_addr}  & 30 & Output & Bus address \\\hline
{\tt o\_wb\_data}  & 32 & Output & Data on WB write\\\hline
{\tt o\_wb\_data}  & 32 & Output & Data on WB write\\\hline
 
{\tt o\_wb\_sel}   &  4 & Output & Select lines\\\hline
{\tt i\_wb\_ack}   &  1 & Input  & Slave has completed a R/W cycle\\\hline
{\tt i\_wb\_ack}   &  1 & Input  & Slave has completed a R/W cycle\\\hline
{\tt i\_wb\_stall} &  1 & Input  & WB bus slave not ready\\\hline
{\tt i\_wb\_stall} &  1 & Input  & WB bus slave not ready\\\hline
{\tt i\_wb\_data}  & 32 & Input  & Incoming bus data\\\hline
{\tt i\_wb\_data}  & 32 & Input  & Incoming bus data\\\hline
{\tt i\_wb\_err}   &  1 & Input  & Bus Error indication\\\hline
{\tt i\_wb\_err}   &  1 & Input  & Bus Error indication\\\hline
\end{portlist}\caption{CPU Master Wishbone I/O Ports}\label{tbl:iowb-master}\end{center}\end{table}
\end{portlist}\caption{CPU Master Wishbone I/O Ports}\label{tbl:iowb-master}\end{center}\end{table}
Line 3816... Line 3655...
        A new implementation using an iCE40 FPGA suggests that the ZipCPU
        A new implementation using an iCE40 FPGA suggests that the ZipCPU
        will fit within the 4k~4--way LUTs of the iCE40 HK4X FPGA, but only
        will fit within the 4k~4--way LUTs of the iCE40 HK4X FPGA, but only
        just barely.
        just barely.
 
 
\item The ZipCPU was designed to be an implementable soft core that could be
\item The ZipCPU was designed to be an implementable soft core that could be
        placed within an FPGA, controlling actions internal to the FPGA. It
        placed within an FPGA, controlling actions internal to the FPGA.  This
        fits this role rather nicely. It does not fit the role of a general
        version of the CPU in particular has been updated so that it would
        purpose CPU replacement very well: it has no octet level access,
        support a more general purpose CPU, since as of version~2.0 the ZipCPU
        no double--precision floating point capability, neither does it have
        now supports octet level access across the bus.
        vector registers and operations.  However, it was never designed to be
 
        such a general purpose CPU but rather a system within a chip.
        Still, it fits this role rather nicely.  Other capabilities common
 
        to more general purpose CPUs, such as
 
        double--precision floating point capability, vector registers and
 
        vector operations have been left out.  However, it was never designed
 
        to be such a general purpose CPU but rather a system within a chip.
 
 
\item The extremely simplified instruction set of the ZipCPU was a good
\item The extremely simplified instruction set of the ZipCPU was a good
        choice. Although it does not have many of the commonly used
        choice. Although it does not have many of the commonly used
        instructions, PUSH, POP, JSR, and RET among them, the simplified
        instructions, PUSH, POP, JSR, and RET among them, the simplified
        instruction set has demonstrated an amazing versatility. I will contend
        instruction set has demonstrated an amazing versatility. I will contend
        therefore and for anyone who will listen, that this instruction set
        therefore and for anyone who will listen, that this instruction set
        offers a full and complete capability for whatever a user might wish
        offers a full and complete capability for whatever a user might wish
        to do with two exceptions: bytewise character access and accelerated
        to do with two exceptions: bytewise character access and accelerated
        floating-point support.
        floating-point support.
\item This simplified instruction set is easy to decode.
 
\item The simplified bus transactions (32-bit words only) were also very easy
 
        to implement.
 
\item The burst load/store approach using the wishbone pipelining mode is
\item The burst load/store approach using the wishbone pipelining mode is
        novel, and can be used to greatly increase the speed of the processor.
        novel, and can be used to greatly increase the speed of the processor.
\item The novel approach to interrupts greatly facilitates the development of
\item The novel approach to interrupts greatly facilitates the development of
        interrupt handlers from within high level languages.
        interrupt handlers from within high level languages.
 
 
Line 3859... Line 3699...
        peripheral to copy instructions from the FLASH to a temporary memory
        peripheral to copy instructions from the FLASH to a temporary memory
        location, after which they may be executed at a single instruction
        location, after which they may be executed at a single instruction
        cycle per access again.
        cycle per access again.
 
 
\item Both GCC and binutils back ends exist for the ZipCPU.
\item Both GCC and binutils back ends exist for the ZipCPU.
 
\item As of this version of the CPU, a newlib veresion of the C--library
 
        now exists.
\end{itemize}
\end{itemize}
 
 
\section{The Not so Good}
\section{The Not so Good}
\begin{itemize}
\begin{itemize}
\item The CPU has no octet (character) support. This is both good and bad.
 
        Realistically, the CPU works just fine without it. Characters can be
 
        supported as subsets of 32-bit words without any problem. Practically,
 
        though, this creates two problems.  The first is that it makes porting
 
        code from non-ZipCPU platforms to the ZipCPU is difficult--especially
 
        anything that depends upon the existence of {\tt *int8\_t},
 
        {\tt *int16\_t}, the size difference between
 
        {\tt sizeof(int)=4*sizeof(char)}, or that tries to
 
        create unions with characters and integers and then attempts to
 
        reference the address of the characters within that union.
 
 
 
        The second problem is that peripherals that depend upon character
 
        support on the bus may need to be rewritten to work on a 32--bit bus.
 
 
 
\item The ZipCPU does not (yet) support a data cache.  One is currently under
\item The ZipCPU does not (yet) support a data cache.  One is currently under
        development.
        development.
 
 
        The ZipCPU compensates for this lack via its burst memory capability.
        The ZipCPU compensates for this lack via its burst memory capability.
        Further, performance tests using Dhrystone suggest that the ZipCPU is
        Further, performance tests using Dhrystone suggest that the ZipCPU is
Line 3910... Line 3738...
        This isn't nearly as bad as it sounds, however, since most RISC
        This isn't nearly as bad as it sounds, however, since most RISC
        architectures have 32~registers that will need to be swapped upon any
        architectures have 32~registers that will need to be swapped upon any
        context swap.
        context swap.
 
 
\item The ZipCPU is by no means generic: it will never handle addresses
\item The ZipCPU is by no means generic: it will never handle addresses
        larger than 32-bits (4GW or 16GB) without a complete and total redesign.
        larger than 32-bits (4GB) without a complete and total redesign.
        This may limit its utility as a generic CPU in the future, although
        This may limit its utility as a generic CPU in the future, although
        as an embedded CPU within an FPGA this isn't really much of a
        as an embedded CPU within an FPGA this isn't really much of a
        restriction.
        restriction.
 
 
\item While a toolchain does exist for the ZipCPU, it isn't yet fully featured.
\item While a toolchain does exist for the ZipCPU, it isn't yet fully featured.
        The ZipCPU has no support for soft floating point arithmetic,
        The ZipCPU does not yet have any support for soft floating point
        nor does it have support for several standard library functions.
        arithmetic, nor does it have gdb support.  These may be provided
        Indeed, full C library support and gdb support are still lacking.
        in future versions.
\end{itemize}
\end{itemize}
 
 
\section{The Next Generation}
\section{The Next Generation}
This section could also be labeled as my ``To do'' list.  It outlines where
This section could also be labeled as my ``To do'' list.  It outlines where
you may expect features in the future.  Currently, there are five primary
you may expect features in the future.  Currently, there are five primary
Line 3932... Line 3760...
 
 
        The lack of any floating point capability, either hard or soft, makes
        The lack of any floating point capability, either hard or soft, makes
        porting math software to the ZipCPU difficult.  Simply building a
        porting math software to the ZipCPU difficult.  Simply building a
        soft floating point library will solve this problem.
        soft floating point library will solve this problem.
 
 
\item A C library.
 
 
 
        The lack of octet support has so far prevented the porting of
 
        newlib to the ZipCPU platform.  In the end, it may mean that any
 
        C library implementation for the ZipCPU may be subtly different
 
        from any you are familiar with.
 
 
 
\item A data cache
\item A data cache
 
 
        A preliminary data cache implemented as a write through cache has
        A preliminary data cache implemented as a write through cache has
        been developed.  Adding this into the CPU should require few changes
        been developed.  Adding this into the CPU should require few changes
        internal to the CPU.  I expect future versions of the CPU will permit
        internal to the CPU.  I expect future versions of the CPU will permit
Line 3951... Line 3772...
\item A Memory Management Unit
\item A Memory Management Unit
 
 
        The first version of such an MMU has already been written.  It is
        The first version of such an MMU has already been written.  It is
        available for examination in the ZipCPU repository.  This MMU exists
        available for examination in the ZipCPU repository.  This MMU exists
        as a peripheral of the ZipCPU.  Integrating this MMU into the ZipCPU
        as a peripheral of the ZipCPU.  Integrating this MMU into the ZipCPU
        will involve slowing down memory stores so that they can be accomplished
        will involve slowing down memory stores so that they can be
        synchronously, as well as determining how and when particular cache
        accomplished synchronously, as well as determining how and when
        lines need to be invalidated.
        particular cache lines need to be invalidated.
 
 
\item An integrated floating point unit (FPU)
\item An integrated floating point unit (FPU)
 
 
        Why a small scale CPU needs a hefty floating point unit, I'm not
        Why a small scale CPU needs a hefty floating point unit, I'm not
        certain, but many application contexts require the ability to do
        certain, but many application contexts require the ability to do
Line 3987... Line 3808...
% -     ADD     x,PC            // Any PC relative jump (20 bits)
% -     ADD     x,PC            // Any PC relative jump (20 bits)
%
%
% -     ADD.C   x,PC            // Any PC relative conditional jump (20 bits)
% -     ADD.C   x,PC            // Any PC relative conditional jump (20 bits)
%
%
% -     LDIHI   Addr,Rx         // Load from any 32-bit address, clobbers Rx,
% -     LDIHI   Addr,Rx         // Load from any 32-bit address, clobbers Rx,
%       LOD     Addr(Rx),Rx     //    unconditional, requires second instruction
%       LW      Addr(Rx),Rx     //    unconditional, requires second instruction
%
%
% -     LOD.C   Addr(Ry),Rx     // Any 16-bit relative address load, poss. cond
% -     LW.C    Addr(Ry),Rx     // Any 16-bit relative address load, poss. cond
%
%
% -     STO.C   Rx,Addr(Ry)     // Any 16-bit rel addr, Rx and Ry must be valid
% -     SW.C    Rx,Addr(Ry)     // Any 16-bit rel addr, Rx and Ry must be valid
%
%
% -     FARJMP  #Addr:          // Arbitrary 32-bit jumps require a jump table
% -     FARJMP  #Addr:          // Arbitrary 32-bit jumps require a jump table
%       BRA     +1              // memory address.  The BRA +1 can be skipped,
%       BRA     +1              // memory address.  The BRA +1 can be skipped,
%       .WORD   Addr            // but only if the address is placed at the end
%       .WORD   Addr            // but only if the address is placed at the end
%       LOD     -2(PC),PC       // of an executable section
%       LW      -2(PC),PC       // of an executable section
%
%
 
 
 No newline at end of file
 No newline at end of file

powered by: WebSVN 2.1.0

© copyright 1999-2024 OpenCores.org, equivalent to Oliscience, all rights reserved. OpenCores®, registered trademark.