URL https://opencores.org/ocsvn/zipcpu/zipcpu/trunk

# Subversion Repositorieszipcpu

## [/] [zipcpu/] [trunk/] [doc/] [src/] [spec.tex] - Diff between revs 39 and 68

Rev 39 Rev 68
Line 42... Line 42...
%%              http://www.gnu.org/licenses/gpl.html
%%              http://www.gnu.org/licenses/gpl.html
%%
%%
%%
%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\documentclass{gqtekspec}
\documentclass{gqtekspec}

\usepackage{import}

% \graphicspath{{../gfx}}
\project{Zip CPU}
\project{Zip CPU}
\title{Specification}
\title{Specification}
\author{Dan Gisselquist, Ph.D.}
\author{Dan Gisselquist, Ph.D.}
\email{dgisselq (at) opencores.org}
\email{dgisselq (at) opencores.org}
\revision{Rev.~0.5}
\revision{Rev.~0.6}
\definecolor{webred}{rgb}{0.2,0,0}
\definecolor{webred}{rgb}{0.2,0,0}
\definecolor{webgreen}{rgb}{0,0.2,0}
\definecolor{webgreen}{rgb}{0,0.2,0}
\usepackage[dvips,ps2pdf,colorlinks=true,
\usepackage[dvips,ps2pdf,colorlinks=true,
        anchorcolor=black,pagecolor=webgreen,pdfpagelabels,hypertexnames,
        anchorcolor=black,pagecolor=webgreen,pdfpagelabels,hypertexnames,
        pdfauthor={Dan Gisselquist},
        pdfauthor={Dan Gisselquist},
Line 74... Line 76...
You should have received a copy of the GNU General Public License along
You should have received a copy of the GNU General Public License along
with this program.  If not, see \hbox{<http://www.gnu.org/licenses/>} for a
with this program.  If not, see \hbox{<http://www.gnu.org/licenses/>} for a
copy.
copy.
\end{license}
\end{license}
\begin{revisionhistory}
\begin{revisionhistory}

0.6 & 11/17/2015 & Gisselquist & Added graphics to illustrate pipeline discussion.\\\hline
0.5 & 9/29/2015 & Gisselquist & Added pipelined memory access discussion.\\\hline
0.5 & 9/29/2015 & Gisselquist & Added pipelined memory access discussion.\\\hline
0.4 & 9/19/2015 & Gisselquist & Added DMA controller, improved stall information, and self--assessment info.\\\hline
0.4 & 9/19/2015 & Gisselquist & Added DMA controller, improved stall information, and self--assessment info.\\\hline
0.3 & 8/22/2015 & Gisselquist & First completed draft\\\hline
0.3 & 8/22/2015 & Gisselquist & First completed draft\\\hline
0.2 & 8/19/2015 & Gisselquist & Still Draft, more complete \\\hline
0.2 & 8/19/2015 & Gisselquist & Still Draft, more complete \\\hline
0.1 & 8/17/2015 & Gisselquist & Incomplete First Draft \\\hline
0.1 & 8/17/2015 & Gisselquist & Incomplete First Draft \\\hline
Line 165... Line 168...
\item Completely open source, licensed under the GPL.\footnote{Should you
\item Completely open source, licensed under the GPL.\footnote{Should you
        need a copy of the Zip CPU licensed under other terms, please
        need a copy of the Zip CPU licensed under other terms, please
        contact me.}
        contact me.}
\end{itemize}
\end{itemize}
 
 

The Zip CPU also has one very unique feature: the ability to do pipelined loads

and stores.  This allows the CPU to access on-chip memory at one access per

clock, minus a stall for the initial access.

 

\section{Characteristics of a SwiC}

 

Here, we shall define a soft core internal to an FPGA as a System within a

Chip,'' or a SwiC.  SwiCs have some very unique properties internal to them

that have influenced the design of the Zip CPU.  Among these are the bus,

memory, and available peripherals.

 

Most other approaches to soft core CPU's employ a Harvard architecture.

This allows these other CPU's to have two separate bus structures: one for the

program fetch, and the other for thememory.  The Zip CPU is fairly unique in

its approach because it uses a von Neumann architecture.  This was done for

simplicity.  By using a von Neumann architecture, only one bus needs to be

implemented within any FPGA.  This helps to minimize real-estate, while

maintaining a high clock speed.  The disadvantage is that it can severely

degrade the overall instructions per clock count.

 

Soft core's within an FPGA have an additional characteristic regarding

memory access: it is slow.  Memory on chip may be accessed at a single

cycle per access, but small FPGA's have a limited amount of memory on chip.

Going off chip, however, is expensive.  Two examples will prove this point.  On

the XuLA2 board, Flash can be accessed at 128~cycles per 32--bit word,

or 64~cycles per subsequent word in a pipelined architecture.  Likewise, the

SDRAM chip on the XuLA2 board allows 6~cycle access for a write, 10~cycles

per read, and 2~cycles for any subsequent pipelined access read or write.

Either way you look at it, this memory access will be slow and this doesn't

account for any logic delays should the bus implementation logic get

complicated.

 

As may be noticed from the above discussion about memory speed, a second

characteristic of memory is that all memory accesses may be pipelined, and

that pipelined memory access is faster than non--pipelined access.  Therefore,

a SwiC soft core should support pipelined operations, but it should also

allow a higher priority subsystem to get access to the bus (no starvation).

 

As a further characteristic of SwiC memory options, on-chip cache's are

expensive.  If you want to have a minimum of logic, cache logic may not be

the highest on the priority list.

 

In sum, memory is slow.  While one processor on one FPGA may be able to fill

its pipeline, the same processor on another FPGA may struggle to get more than

one instruction at a time into the pipeline.  Any SwiC must be able to deal

with both cases: fast and slow memories.

 

A final characteristic of SwiC's within FPGA's is the peripherals.

Specifically, FPGA's are highly reconfigurable.  Soft peripherals can easily

be created on chip to support the SwiC if necessary.  As an example, a simple

30-bit peripheral could easily support reversing 30-bit numbers: a read from

the peripheral returns it's bit--reversed address.  This is cheap within an

FPGA, but expensive in instructions.

 

Indeed, anything that must be done fast within an FPGA is likely to already

be done--elsewhere in the fabric.  This leaves the CPU with the role of handling

sequential tasks that need a lot of state.

 

This means that the SwiC needs to live within a very unique environment,

separate and different from the traditional SoC.  That isn't to say that a

SwiC cannot be turned into a SoC, just that this SwiC has not been designed

for that purpose.

 

\section{Lessons Learned}

 
Now, however, that I've worked on the Zip CPU for a while, it is not nearly
Now, however, that I've worked on the Zip CPU for a while, it is not nearly
as simple as I originally hoped.  Worse, I've had to adjust to create
as simple as I originally hoped.  Worse, I've had to adjust to create
capabilities that I was never expecting to need.  These include:
capabilities that I was never expecting to need.  These include:
\begin{itemize}
\begin{itemize}
\item {\bf External Debug:} Once placed upon an FPGA, some external means is
\item {\bf External Debug:} Once placed upon an FPGA, some external means is
Line 237... Line 305...
\item {\bf Pipeline Stalls:} My original plan was to not support pipeline
\item {\bf Pipeline Stalls:} My original plan was to not support pipeline
        stalls at all, but rather to require the compiler to properly schedule
        stalls at all, but rather to require the compiler to properly schedule
        all instructions so that stalls would never be necessary.  After trying
        all instructions so that stalls would never be necessary.  After trying
        to build such an architecture, I gave up, having learned some things:
        to build such an architecture, I gave up, having learned some things:
 
 
        For example, in  order to facilitate interrupt handling and debug
        First, and ideal pipeline might look something like
        stepping, the CPU needs to know what instructions have finished, and
        Fig.~\ref{fig:ideal-pipeline}.
        which have not.  In other words, it needs to know where it can restart
\begin{figure}
        the pipeline from.  Once restarted, it must act as though it had
\begin{center}
        never stopped.  This killed my idea of delayed branching, since what
\includegraphics[width=4in]{../gfx/fullpline.eps}
        would be the appropriate program counter to restart at?  The one the
\caption{An Ideal Pipeline: One instruction per clock cycle}\label{fig:ideal-pipeline}
        CPU was going to branch to, or the ones in the delay slots?  This
\end{center}\end{figure}
        also makes the idea of compressed instruction codes difficult, since,
        Notice that, in this figure, all the pipeline stages are complete and
        again, where do you restart on interrupt?
        full.  Every instruction takes one clock and there are no delays.

        However, as the discussion above pointed out, the memory associated

        with a SwiC may not allow single clock access.  It may be instead

        that you can only read every two clocks.  In that case, what shall

        the pipeline look like?  Should it look like

        Fig.~\ref{fig:waiting-pipeline},

\begin{figure}\begin{center}

\includegraphics[width=4in]{../gfx/stuttra.eps}

\caption{Instructions wait for each other}\label{fig:waiting-pipeline}

\end{center}\end{figure}

        where instructions are held back until the pipeline is full, or should

        it look like Fig.~\ref{fig:independent-pipeline},

\begin{figure}\begin{center}

\includegraphics[width=4in]{../gfx/stuttrb.eps}

\caption{Instructions proceed independently}\label{fig:independent-pipeline}

\end{center}\end{figure}

        where each instruction is allowed to move through the pipeline

        independently?  For better or worse, the Zip CPU allows instructions

        to move through the pipeline independently.

 

        One approach to avoiding stalls is to use a branch delay slot,

        such as is shown in Fig.~\ref{fig:brdelay}.

\begin{figure}\begin{center}

\includegraphics[width=4in]{../gfx/bdly.eps}

\caption{A typical branch delay slot approach}\label{fig:brdelay}

\end{center}\end{figure}

        In this figure, instructions

        {\tt BR} (a branch), {\tt BD} (a branch delay instruction),

        are followed by instructions after the branch: {\tt IA}, {\tt IB}, etc.

        Since it takes a processor a clock cycle to execute a branch, the

        delay slot allows the processor to do something useful in that

        branch.  The problem the Zip CPU has with this approach is, what

        happens when the pipeline looks like Fig.~\ref{fig:brbroken}?

\begin{figure}\begin{center}

\includegraphics[width=4in]{../gfx/bdbroken.eps}

\caption{The branch delay slot breaks with a slow memory}\label{fig:brbroken}

\end{center}\end{figure}

        In this case, the branch delay slot never gets filled in the first

        place, and so the pipeline squashes it before it gets executed.

        If not that, then what happens when handling interrupts or

        debug stepping: when has the CPU finished an instruction?

        When the {\tt BR} instruction has finished, or must {\tt BD}

        follow every {\tt BR}?  and, again, what if the pipeline isn't

        full?

        These thoughts killed any hopes of doing delayed branching.
 
 
        So I switched to a model of discrete execution: Once an instruction
        So I switched to a model of discrete execution: Once an instruction
        enters into either the ALU or memory unit, the instruction is
        enters into either the ALU or memory unit, the instruction is
        guaranteed to complete.  If the logic recognizes a branch or a
        guaranteed to complete.  If the logic recognizes a branch or a
        condition that would render the instruction entering into this stage
        condition that would render the instruction entering into this stage
        possibly inappropriate (i.e. a conditional branch preceding a store
        possibly inappropriate (i.e. a conditional branch preceding a store
        instruction for example), then the pipeline stalls for one cycle
        instruction for example), then the pipeline stalls for one cycle
        until the conditional branch completes.  Then, if it generates a new
        until the conditional branch completes.  Then, if it generates a new
        PC address, the stages preceding are all wiped clean.
        PC address, the stages preceding are all wiped clean.
 
 
        The discrete execution model allows such things as sleeping: if the
        This model, however, generated too many pipeline stalls, so the

        discrete execution model was modified to allow instructions to go

        through the ALU unit and be canceled before writeback.  This removed

        the stall associated with ALU instructions before untaken branches.

 

        The discrete execution model allows such things as sleeping, as

        outlined in Fig.~\ref{fig:sleeping}.

\begin{figure}\begin{center}

\includegraphics[width=4in]{../gfx/sleep.eps}

\caption{How the CPU halts when sleeping}\label{fig:sleeping}

\end{center}\end{figure}

        If the
        CPU is put to sleep,'' the ALU and memory stages stall and back up
        CPU is put to sleep,'' the ALU and memory stages stall and back up
        everything before them.  Likewise, anything that has entered the ALU
        everything before them.  Likewise, anything that has entered the ALU
        or memory stage when the CPU is placed to sleep continues to completion.
        or memory stage when the CPU is placed to sleep continues to completion.
        To handle this logic, each pipeline stage has three control signals:
        To handle this logic, each pipeline stage has three control signals:
        a valid signal, a stall signal, and a clock enable signal.  In
        a valid signal, a stall signal, and a clock enable signal.  In
        general, a stage stalls if it's contents are valid and the next step
        general, a stage stalls if it's contents are valid and the next step
        is stalled.  This allows the pipeline to fill any time a later stage
        is stalled.  This allows the pipeline to fill any time a later stage
        stalls.
        stalls, as illustrated in Fig.~\ref{fig:stacking}.

\begin{figure}\begin{center}

\includegraphics[width=4in]{../gfx/stacking.eps}

\caption{Instructions can stack up behind a stalled instruction}\label{fig:stacking}

\end{center}\end{figure}
 
 
        This approach is also different from other pipeline approaches.  Instead
        This approach is also different from other pipeline approaches.
        of keeping the entire pipeline filled, each stage is treated
        Instead of keeping the entire pipeline filled, each stage is treated
        independently.  Therefore, individual stages may move forward as long
        independently.  Therefore, individual stages may move forward as long
        as the subsequent stage is available, regardless of whether the stage
        as the subsequent stage is available, regardless of whether the stage
        behind it is filled.
        behind it is filled.
 
 
\item {\bf Verilog Modules:} When examining how other processors worked
\item {\bf Verilog Modules:} When examining how other processors worked
Line 334... Line 461...
\begin{table}\begin{center}
\begin{table}\begin{center}
\begin{bitlist}
\begin{bitlist}
31\ldots 11 & R/W & Reserved for future uses\\\hline
31\ldots 11 & R/W & Reserved for future uses\\\hline
10 & R & (Reserved for) Bus-Error Flag\\\hline
10 & R & (Reserved for) Bus-Error Flag\\\hline
9 & R & Trap, or user interrupt, Flag.  Cleared on return to userspace.\\\hline
9 & R & Trap, or user interrupt, Flag.  Cleared on return to userspace.\\\hline
8 & R & (Reserved for) Illegal Instruction Flag\\\hline
8 & R & Illegal Instruction Flag\\\hline
7 & R/W & Break--Enable\\\hline
7 & R/W & Break--Enable\\\hline
6 & R/W & Step\\\hline
6 & R/W & Step\\\hline
5 & R/W & Global Interrupt Enable (GIE)\\\hline
5 & R/W & Global Interrupt Enable (GIE)\\\hline
4 & R/W & Sleep.  When GIE is also set, the CPU waits for an interrupt.\\\hline
4 & R/W & Sleep.  When GIE is also set, the CPU waits for an interrupt.\\\hline
3 & R/W & Overflow\\\hline
3 & R/W & Overflow\\\hline
Line 399... Line 526...
%
%
 
 
This functionality was added to enable an external debugger to
This functionality was added to enable an external debugger to
        set and manage breakpoints.
        set and manage breakpoints.
 
 
The ninth bit is reserved for an illegal instruction bit.  When the CPU
The ninth bit is an illegal instruction bit.  When the CPU
tries to execute either a non-existant instruction, or an instruction from
tries to execute either a non-existant instruction, or an instruction from
an address that produces a bus error, the CPU will (once implemented) switch
an address that produces a bus error, the CPU will (if implemented) switch
to supervisor mode while setting this bit.  The bit will automatically be
to supervisor mode while setting this bit.  The bit will automatically be
cleared upon any return to user mode.
cleared upon any return to user mode.
 
 
The tenth bit is a trap bit.  It is set whenever the user requests a soft
The tenth bit is a trap bit.  It is set whenever the user requests a soft
interrupt, and cleared on any return to userspace command.  This allows the
interrupt, and cleared on any return to userspace command.  This allows the
Line 416... Line 543...
\begin{table}
\begin{table}
\begin{center}
\begin{center}
\begin{tabular}{l|l}
\begin{tabular}{l|l}
Bit & Meaning \\\hline
Bit & Meaning \\\hline
9 & Soft trap, set on a trap from user mode, cleared when returning to user mode\\\hline
9 & Soft trap, set on a trap from user mode, cleared when returning to user mode\\\hline
8 & (Reserved for) Floating point enable \\\hline
8 & Illegal instruction error flag \\\hline
7 & Halt on break, to support an external debugger \\\hline
7 & Halt on break, to support an external debugger \\\hline
6 & Step, single step the CPU in user mode\\\hline
6 & Step, single step the CPU in user mode\\\hline
5 & GIE, or Global Interrupt Enable \\\hline
5 & GIE, or Global Interrupt Enable \\\hline
4 & Sleep \\\hline
4 & Sleep \\\hline
3 & V, or overflow bit.\\\hline
3 & V, or overflow bit.\\\hline
Line 452... Line 579...
\end{center}
\end{center}
\end{table}
\end{table}
There is no condition code for less than or equal, not C or not V.  Sorry,
There is no condition code for less than or equal, not C or not V.  Sorry,
I ran out of space in 3--bits.  Conditioning on a non--supported condition
I ran out of space in 3--bits.  Conditioning on a non--supported condition
is still possible, but it will take an extra instruction and a pipeline stall.  (Ex: \hbox{\em (Stall)}; \hbox{\tt TST \$4,CC;} \hbox{\tt STO.NZ R0,(R1)}) is still possible, but it will take an extra instruction and a pipeline stall. (Ex: \hbox{\em (Stall)}; \hbox{\tt TST \$4,CC;} \hbox{\tt STO.NZ R0,(R1)})

As an alternative, the condition may often be reversed, recovering those

extra two clocks.  Thus instead of \hbox{\tt CMP Rx,Ry;}

\hbox{\tt BNV label} you can issue a \hbox{\tt CMP Ry,Rx;} \hbox{\tt BV label}.
 
 
Conditionally executed ALU instructions will not further adjust the
Conditionally executed ALU instructions will not further adjust the
condition codes.
condition codes, with the exception of \hbox{\tt CMP} and \hbox{\tt TST}

instructions.   Conditional \hbox{\tt CMP} or \hbox{\tt TST} instructions

will adjust conditions whenever their conditionals are true.  In this way,

multiple conditions may be evaluated without branches.  For example, to do

something if \hbox{\tt R0} is one and \hbox{\tt R1} is two, one might try

code such as Tbl.~\ref{tbl:dbl-condition}.

\begin{table}\begin{center}

\begin{tabular}{l}

        {\tt CMP 1,R0} \\

        {;\em Condition codes are now set based upon R0-1} \\

        {\tt CMP.Z 2,R1} \\

        {;\em If R0 $\neq$ 1, conditions are unchanged.} \\

        {;\em If R0 $=$ 1, conditions are set based upon R1-2.} \\

        {;\em Now do something based upon the conjunction of both conditions.} \\

        {;\em While we use the example of a STO, it could be any instruction.} \\

        {\tt STO.Z R0,(R2)} \\

\end{tabular}

\caption{An example of a double conditional}\label{tbl:dbl-condition}

\end{center}\end{table}

 

\section{Traditional Interrupt Handling}
 
 
\section{Operand B}
\section{Operand B}
Many instruction forms have a 21-bit source Operand B'' associated with them.
Many instruction forms have a 21-bit source Operand B'' associated with them.
This Operand B is either equal to a register plus a signed immediate offset,
This Operand B is either equal to a register plus a signed immediate offset,
or an immediate offset by itself.  This value is encoded as shown in
or an immediate offset by itself.  This value is encoded as shown in
Line 999... Line 1149...
prefetch cache is loaded to support the next instruction.
prefetch cache is loaded to support the next instruction.
 
 
\item While waiting for the pipeline to load following any taken branch, jump,
\item While waiting for the pipeline to load following any taken branch, jump,
        return from interrupt or switch to interrupt context (5 stall cycles)
        return from interrupt or switch to interrupt context (5 stall cycles)
 
 
If the PC suddenly changes, the pipeline is subsequently cleared and needs to
Fig.~\ref{fig:bcstalls}
be reloaded.  Given that there are five stages to the pipeline, that accounts
\begin{figure}\begin{center}
for four of the five stalls.  The stall cycle is lost in the pipelined prefetch
\includegraphics[width=3.5in]{../gfx/bc.eps}
stage which needs at least one clock with a valid PC before it can produce
\caption{A conditional branch generates 5 stall cycles}\label{fig:bcstalls}
a new output.
\end{center}\end{figure}

illustrates the situation for a conditional branch.  In this case, the branch

instruction, {\tt BC}, is nominally followed by instructions {\tt I0} and so

forth.  However, since the branch is taken, the next instruction must be

{\tt IA}.  Therefore, the pipeline needs to be cleared and reloaded.

Given that there are five stages to the pipeline, that accounts

for four of the five stalls.  The last stall cycle is lost in the pipelined

prefetch stage which needs at least one clock with a valid PC before it can

produce a new output.  {\Large\bf Note: When I did this myself, I counted

six stall cycles, for a total of seven cycles for this instruction.  Is five

really the right answer?}
 
 
The Zip CPU handles {\tt MOV \$X(PC),PC}, {\tt ADD \$X,PC}, and
The Zip CPU handles {\tt MOV \$X(PC),PC}, {\tt ADD \$X,PC}, and
{\tt LDI \$X,PC} instructions specially, however. These instructions, when {\tt LDI \$X,PC} instructions specially, however.  These instructions, when
not conditioned on the flags, can execute with only 3~stall cycles.
not conditioned on the flags, can execute with only 2~stall cycles, such as

is shown in Fig.~\ref{fig:branch}.\footnote{Note that this behavior is

slated to be improved upon in subsequent releases.  With a better prefetch,

it should be possible to drop this down to one or zero stall cycles.}

\begin{figure}\begin{center}

\includegraphics[width=4in]{../gfx/bra.eps}

\caption{An expedited delay costs only 2~stall cycles}\label{fig:branch}

\end{center}\end{figure}

In this example, {\tt BR} is a branch always taken, {\tt I1} is the instruction

following the branch in memory, while {\tt IA} is the first instruction at the

branch address.  ({\tt CLR} denotes a clear--pipeline operation, and does

not represent any instruction.)
 
 
\item When reading from a prior register while also adding an immediate offset
\item When reading from a prior register while also adding an immediate offset
\begin{enumerate}
\begin{enumerate}
\item\ {\tt OPCODE ?,RA}
\item\ {\tt OPCODE ?,RA}
\item\ {\em (stall)}
\item\ {\em (stall)}
Line 1024... Line 1195...
opcode that will read and apply an immediate offset by one instruction.  The
opcode that will read and apply an immediate offset by one instruction.  The
good news is that this stall can easily be mitigated by proper scheduling.
good news is that this stall can easily be mitigated by proper scheduling.
That is, any instruction that does not add an immediate to {\tt RA} may be
That is, any instruction that does not add an immediate to {\tt RA} may be
scheduled into the stall slot.
scheduled into the stall slot.
 
 
\item When any write to either the CC or PC Register is followed by a memory
\item When any (conditional) write to either the CC or PC Register is followed
        operation
        by a memory operation
\begin{enumerate}
\begin{enumerate}
\item\ {\tt OPCODE RA,PC} {\em Ex: a branch opcode}
\item\ {\tt OPCODE RA,PC} {\em Ex: a branch opcode}
\item\ {\em (stall, even if jump not taken)}
\item\ {\em (stall, even if jump not taken)}
\item\ {\tt LOD \$X(RA),RB} \item\ {\tt LOD \$X(RA),RB}
\end{enumerate}
\end{enumerate}

A timing diagram of this pipeline situation is shown in Fig.~\ref{fig:bcmem},

\begin{figure}\begin{center}

\includegraphics[width=2in]{../gfx/bcmem.eps}

\caption{A (not taken) conditional branch followed by a memory operation}\label{fig:bcmem}

\end{center}\end{figure}

for a conditional branch, {\tt BC}, a memory operation, {\tt Mem} (which

must be a load here), and ALU instructions {\tt I1} and so forth.
Since branches take place in the writeback stage, the Zip CPU will stall the
Since branches take place in the writeback stage, the Zip CPU will stall the
pipeline for one clock anytime there may be a possible jump.  This prevents
pipeline for one clock anytime there may be a possible jump--forcing the

memory operation to stay in the operand decode stage.  This prevents
an instruction from executing a memory access after the jump but before the
an instruction from executing a memory access after the jump but before the
jump is recognized.
jump is recognized.
 
 
This stall may be mitigated by shuffling the operations immediately following
This stall may be mitigated by shuffling the operations immediately following
a potential branch so that an ALU operation follows the branch instead of a
a potential branch so that an ALU operation follows the branch instead of a
Line 1048... Line 1227...
\item\ {\em (stall)}
\item\ {\em (stall)}
\item\ {\tt TST sys.ccv,CC}
\item\ {\tt TST sys.ccv,CC}
\item\ {\tt BZ somewhere}
\item\ {\tt BZ somewhere}
\end{enumerate}
\end{enumerate}
 
 
The reason for this stall is simply performance.  Many of the flags are
The reason for this stall is simply performance: many of the flags are
determined via combinatorial logic after the writeback instruction is
determined via combinatorial logic {\em during} the writeback cycle.
determined.  Trying to then place these into the input for one of the operands
Trying to then place these into the input for one of the operands for an

ALU instruction during the same cycle
created a time delay loop that would no longer execute in a single 100~MHz
created a time delay loop that would no longer execute in a single 100~MHz
clock cycle.  (The time delay of the multiply within the ALU wasn't helping
clock cycle.  (The time delay of the multiply within the ALU wasn't helping
either \ldots).
either \ldots).
 
 
This stall may be eliminated via proper scheduling, by placing an instruction
This stall may be eliminated via proper scheduling, by placing an instruction
that does not set flags in between the ALU operation and the instruction
that does not set flags in between the ALU operation and the instruction
that references the CC register.  For example, {\tt MOV \$addr+PC,uPC} that references the CC register. For example, {\tt MOV \$addr+PC,uPC}
followed by an {\tt RTU} ({\tt OR \$GIE,CC}) instruction will not incur followed by an {\tt RTU} ({\tt OR \$GIE,CC}) instruction will not incur
this stall, whereas an {\tt OR \$BREAKEN,CC} followed by an {\tt OR \$STEP,CC}
this stall, whereas an {\tt OR \$BREAKEN,CC} followed by an {\tt OR \$STEP,CC}
will incur the stall, while a {\tt LDI \$BREAKEN|\$STEP,CC} will not.
will incur the stall, while a {\tt LDI \$BREAKEN|\$STEP,CC} will not since

it doesn't read the condition codes.
 
 
\item When waiting for a memory read operation to complete
\item When waiting for a memory read operation to complete
\begin{enumerate}
\begin{enumerate}
\item\ {\tt LOD address,RA}
\item\ {\tt LOD address,RA}
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}
\item\ {\tt OPCODE I+RA,RB}
\item\ {\tt OPCODE I+RA,RB}
\end{enumerate}
\end{enumerate}
 
 
Remember, the Zip CPU does not support out of order execution.  Therefore,
Remember, the Zip CPU does not support out of order execution.  Therefore,
anytime the memory unit becomes busy both the memory unit and the ALU must
anytime the memory unit becomes busy both the memory unit and the ALU must
stall until the memory unit is cleared.  This is especially true of a load
stall until the memory unit is cleared.  This is illustrated in

Fig.~\ref{fig:memrd},

\begin{figure}\begin{center}

\includegraphics[width=5in]{../gfx/memrd.eps}

\caption{Pipeline handling of a load instruction}\label{fig:memrd}

\end{center}\end{figure}

since it is especially true of a load
instruction, which must still write its operand back to the register file.
instruction, which must still write its operand back to the register file.
Store instructions are different, since they can be busy with no impact on
Note that there is an extra stall at the end of the memory cycle, so that
later ALU write back operations.  Hence, only loads stall the pipeline.
the memory unit will be idle for one clock before an instruction will be

accepted into the ALU.

Store instructions are different, as shown in Fig.~\ref{fig:memwr},

\begin{figure}\begin{center}

\includegraphics[width=5in]{../gfx/memwr.eps}

\caption{Pipeline handling of a store instruction}\label{fig:memwr}

\end{center}\end{figure}

since they can be busy with the bus without impacting later write back

pipeline stages.  Hence, only loads stall the pipeline.
 
 
This also assumes that the memory being accessed is a single cycle memory.
This, of course, also assumes that the memory being accessed is a single cycle

memory and that there are no stalls to get to the memory.
Slower memories, such as the Quad SPI flash, will take longer--perhaps even
Slower memories, such as the Quad SPI flash, will take longer--perhaps even
as long as forty clocks.   During this time the CPU and the external bus
as long as forty clocks.   During this time the CPU and the external bus
will be busy, and unable to do anything else.
will be busy, and unable to do anything else.  Likewise, if it takes a couple

of clock cycles for the bus to be free, as shown in both Figs.~\ref{fig:memrd}

and~\ref{fig:memwr}, there will be stalls.
 
 
\item Memory operation followed by a memory operation
\item Memory operation followed by a memory operation
\begin{enumerate}
\begin{enumerate}
\item\ {\tt STO address,RA}
\item\ {\tt STO address,RA}
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}
\item\ {\tt LOD address,RB}
\item\ {\tt LOD address,RB}
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}
\end{enumerate}
\end{enumerate}
 
 
In this case, the LOD instruction cannot start until the STO is finished.
In this case, the LOD instruction cannot start until the STO is finished,

as illustrated by Fig.~\ref{fig:mstld}.

\begin{figure}\begin{center}

\includegraphics[width=5.5in]{../gfx/mstld.eps}

\caption{Pipeline handling of a store followed by a load instruction}\label{fig:mstld}

\end{center}\end{figure}
With proper scheduling, it is possible to do something in the ALU while the
With proper scheduling, it is possible to do something in the ALU while the
memory unit is busy with the STO instruction, but otherwise this pipeline will
memory unit is busy with the STO instruction, but otherwise this pipeline will
stall waiting for it to complete.
stall while waiting for it to complete before the load instruction can

start.
 
 
The Zip CPU does have the capability of supporting pipelined memory access,
The Zip CPU does have the capability of supporting pipelined memory access,
but only under the following conditions: all accesses within the pipeline
but only under the following conditions: all accesses within the pipeline
must all be reads or all be writes, all must use the same register for their
must all be reads or all be writes, all must use the same register for their
address, and there can be no stalls or other instructions between pipelined
address, and there can be no stalls or other instructions between pipelined
memory access instructions.  Further, the offset to memory must be increasing
memory access instructions.  Further, the offset to memory must be increasing
by one address each instruction.  These conditions work well for saving or
by one address each instruction.  These conditions work well for saving or
storing registers to the stack.
storing registers to the stack.  Indeed, if you noticed, both

Fig.~\ref{fig:memrd} and Fig.~\ref{fig:memwr} illustrated pipelined memory

accesses.
 
 
\item When waiting for a conditional memory read operation to complete
\item When waiting for a conditional memory read operation to complete
\begin{enumerate}
\begin{enumerate}
\item\ {\tt LOD.Z address,RA}
\item\ {\tt LOD.Z address,RA}
\item\ {\em (multiple stalls, bus dependent, 7 clocks best)}
\item\ {\em (multiple stalls, bus dependent, 7 clocks best)}
Line 1132... Line 1338...
\caption{Zip System Peripherals}\label{fig:zipsystem}
\caption{Zip System Peripherals}\label{fig:zipsystem}
\end{center}\end{figure}
\end{center}\end{figure}
and described here.  They are designed to make
and described here.  They are designed to make
the Zip CPU more useful in an Embedded Operating System environment.
the Zip CPU more useful in an Embedded Operating System environment.
 
 
\section{Interrupt Controller}
\section{Interrupt Controller}\label{sec:pic}
 
 
Perhaps the most important peripheral within the Zip System is the interrupt
Perhaps the most important peripheral within the Zip System is the interrupt
controller.  While the Zip CPU itself can only handle one interrupt, and has
controller.  While the Zip CPU itself can only handle one interrupt, and has
only the one interrupt state: disabled or enabled, the interrupt controller
only the one interrupt state: disabled or enabled, the interrupt controller
can make things more interesting.
can make things more interesting.
 
 
The Zip System interrupt controller module supports up to 15 interrupts, all
The Zip System interrupt controller module supports up to 15 interrupts, all
controlled from one register.  Bit~31 of the interrupt controller controls
controlled from one register.  Bit~31 of the interrupt controller controls
overall whether interrupts are enabled (1'b1) or disabled (1'b0).  Bits~16--30
overall whether interrupts are enabled (1'b1) or disabled (1'b0).  Bits~16--30
control whether individual interrupts are enabled (1'b0) or disabled (1'b0).
control whether individual interrupts are enabled (1'b1) or disabled (1'b0).
Bit~15 is an indicator showing whether or not any interrupt is active, and
Bit~15 is an indicator showing whether or not any interrupt is active, and
bits~0--15 indicate whether or not an individual interrupt is active.
bits~0--15 indicate whether or not an individual interrupt is active.
 
 
The interrupt controller has been designed so that bits can be controlled
The interrupt controller has been designed so that bits can be controlled
individually without having any knowledge of the rest of the controller
individually without having any knowledge of the rest of the controller
Line 1165... Line 1371...
interrupt and clears the active indicator.  This also has the side effect of
interrupt and clears the active indicator.  This also has the side effect of
disabling all interrupts, so a second write of {\tt 0x80000000} may be necessary
disabling all interrupts, so a second write of {\tt 0x80000000} may be necessary
to re-enable any other interrupts.
to re-enable any other interrupts.
 
 
The Zip System currently hosts two interrupt controllers, a primary and a
The Zip System currently hosts two interrupt controllers, a primary and a
secondary.  The primary interrupt controller has one interrupt line which may
secondary.  The primary interrupt controller has one interrupt line (perhaps
come from an external interrupt controller, and one interrupt line from the
more if you configure it for more) which may come from an external interrupt
secondary controller.  Other primary interrupts include the system timers,
controller, and one interrupt line from the secondary controller.  Other
the jiffies interrupt, and the manual cache interrupt.  The secondary interrupt
primary interrupts include the system timers, the jiffies interrupt, and the
controller maintains an interrupt state for all of the processor accounting
manual cache interrupt.  The secondary interrupt controller maintains an
counters.
interrupt state for all of the processor accounting counters.
 
 
\section{Counter}
\section{Counter}
 
 
The Zip Counter is a very simple counter: it just counts.  It cannot be
The Zip Counter is a very simple counter: it just counts.  It cannot be
halted.  When it rolls over, it issues an interrupt.  Writing a value to the
halted.  When it rolls over, it issues an interrupt.  Writing a value to the
Line 1193... Line 1399...
bit is set when writing to the timer, the timer becomes an interval timer and
bit is set when writing to the timer, the timer becomes an interval timer and
reloads its last start time on any interrupt.  Hence, to mark seconds, one
reloads its last start time on any interrupt.  Hence, to mark seconds, one
might set the timer to 100~million (the number of clocks per second), and
might set the timer to 100~million (the number of clocks per second), and
set the high bit.  Ever after, the timer will interrupt the CPU once per
set the high bit.  Ever after, the timer will interrupt the CPU once per
second (assuming a 100~MHz clock).  This reload capability also limits the
second (assuming a 100~MHz clock).  This reload capability also limits the
maximum timer value to $2^{31}-1$, rather than $2^{32}-1$.
maximum timer value to $2^{31}-1$ (about 21~seconds using a 100~MHz clock),

rather than $2^{32}-1$.
 
 
\section{Watchdog Timer}
\section{Watchdog Timer}
 
 
The watchdog timer is no different from any of the other timers, save for one
The watchdog timer is no different from any of the other timers, save for one
critical difference: the interrupt line from the watchdog
critical difference: the interrupt line from the watchdog
Line 1207... Line 1414...
write any other number to it---as with the other timers.
write any other number to it---as with the other timers.
 
 
While the watchdog timer supports interval mode, it doesn't make as much sense
While the watchdog timer supports interval mode, it doesn't make as much sense
as it did with the other timers.
as it did with the other timers.
 
 

\section{Bus Watchdog}

There is an additional watchdog timer on the Wishbone bus.  This timer,

however, is hardware configured and not software configured.  The timer is

reset at the beginning of any bus transaction, and only counts clocks during

such bus transactions.  If the bus transaction takes longer than the number

of counts the timer allots, it will raise a bus error flag to terminate the

transaction.  This is useful in the case of any peripherals that are

misbehaving.  If the bus watchdog terminates a bus transaction, the CPU may

then read from its port to find out which memory location created the problem.

 

Aside from its unusual configuration, the bus watchdog is just another

implementation of the fundamental timer described above.

 
\section{Jiffies}
\section{Jiffies}
 
 
This peripheral is motivated by the Linux use of jiffies' whereby a process
This peripheral is motivated by the Linux use of jiffies' whereby a process
can request to be put to sleep until a certain number of jiffies' have
can request to be put to sleep until a certain number of jiffies' have
elapsed.  Using this interface, the CPU can read the number of jiffies'
elapsed.  Using this interface, the CPU can read the number of jiffies'
Line 1305... Line 1525...
complete.
complete.
 
 
Eventually, I intend to place an operating system onto the ZipSystem, I'm
Eventually, I intend to place an operating system onto the ZipSystem, I'm
just not there yet.
just not there yet.
 
 
The rest of this chapter examines some common programming constructs, and
The rest of this chapter examines some common programming models, and how they
how they might be applied to the Zip System.
might be applied to the Zip System, and then finish with a couple of examples.

 

\section{System High}

The easiest and simplest way to run the Zip CPU is in the system high mode.

In this mode, the CPU runs your program in supervisor mode from reboot to

power down, and is never interrupted.  You will need to poll the interrupt

controller to determine when any external condition has become active.  This

mode is useful, and can handle many microcontroller tasks.

 

Even better, in system high mode, all of the user registers are available

to the system high program as variables.  Accessing these registers can be

done in a single clock cycle, which would move them to the active register

set or move them back.  While this may seem like a load or store instruction,

none of these register accesses will suffer from memory delays.

 

The one thing that cannot be done in supervisor mode is a wait for interrupt

instruction.  This, however, is easily rectified by jumping to a user task

within the supervisors memory space, such as Tbl.~\ref{tbl:shi-idle}.

\begin{table}\begin{center}

\begin{tabbing}

{\tt supervisor\_idle:} \\

\hbox to 0.25in{}\={\em ; While not strictly required, the following move helps to} \\

\>      {\em ; ensure that the prefetch doesn't try to fetch an instruction} \\

\>      {\em ; outside of the CPU's address space when it switches to user} \\

\>      {\em ; mode.} \\

\>      {\tt MOV supervisor\_idle\_continue,uPC} \\

\>      {\em ; Put the processor into user mode and to sleep in the same} \\

\>      {\em ; instruction. } \\

\>      {\tt OR \$SLEEP|\$GIE,CC} \\

{\tt supervisor\_idle\_continue:} \\

\>      {\em ; Now, if we haven't done this inline, we need to return} \\

\>      {\em ; to whatever function called us.} \\

\>      {\tt RETN} \\

\end{tabbing}

\caption{Executing an idle from supervisor mode}\label{tbl:shi-idle}

\end{center}\end{table}

 

\section{Traditional Interrupt Handling}

Although the Zip CPU does not have a traditional interrupt architecture,

it is possible to create the more traditional interrupt approach via software.

In this mode, the programmable interrupt controller is used together with the

supervisor state to create the illusion of more traditional interrupt handling.

 

To set this up, upon reboot the supervisor task:

\begin{enumerate}

\item Creates a (single) user context, a user stack, and sets the user

        program counter to the entry of the user task

\item Creates a task table of ISR entries

\item Enables the master interrupt enable via the interrupt controller, albeit

        without enabling any of the fifteen potential underlying interrupts.

\item Switches to user mode, as the first part of the while loop in

        Tbl.~\ref{tbl:traditional-isr}.

\end{enumerate}

\begin{table}\begin{center}

\begin{tabbing}

{\tt while(true) \{} \\

\hbox to 0.25in{}\= {\tt rtu();}\\

        \> {\tt if (trap) \{} {\em // Here, we allow users to install ISRs, or} \\

        \>\hbox to 0.25in{}\= {\em // whatever else they may wish to do in supervisor mode.} \\

        \> {\tt \} else \{} \\

        \> \> {\tt volatile int *pic = PIC\_ADDRESS;} \\

\\

        \> \> {\em // Save the user context before running any ISRs.  This could easily be}\\

        \> \> {\em // implemented as an inline assembly routine or macro}\\

        \> \> {\tt SAVE\_PARTIAL\_CONTEXT; }\\

        \> \> {\em // At this point, we know an interrupt has taken place:  Ask the programmable}\\

        \> \> {\em // interrupt controller (PIC) which interrupts are enabled and which are active.}\\

        \> \>   {\tt int        picv = *pic;}\\

        \> \>   {\em // Turn off all active interrupts}\\

        \> \>   {\em // Globally disable interrupt generation in the process}\\

        \> \>   {\tt int        active = (picv >> 16) \& picv \& 0x07fff;}\\

        \> \>   {\tt *pic = (active<<16);}\\

        \> \>   {\em // We build a mask of interrupts to re-enable in picv.}\\

        \> \>   {\tt picv = 0;}\\

        \> \>   {\tt for(int i=0,msk=1; i<15; i++, msk<<=1) \{}\\

        \> \>\hbox to 0.25in{}\={\tt if ((active \& msk)\&\&(isr\_table[i])) \{}\\

        \> \>\>\hbox to 0.25in{}\= {\tt mov(isr\_table[i],uPC); }\\

        \> \>\>\>       {\em // Acknowledge this particular interrupt.  While we could acknowledge all}\\

        \> \>\>\>       {\em // interrupts at once, by acknowledging only those with ISR's we allow}\\

        \> \>\>\>       {\em // the user process to use peripherals manually, and to manually check}\\

        \> \>\>\>       {\em // whether or no those other interrupts had occurred.}\\

        \> \>\>\>       {\tt *pic = msk; }\\

        \> \>\>\>       {\tt rtu(); }\\

        \> \>\>\>       {\em // The ISR will only exit on a trap in the Zip archtecture.  There is}\\

        \> \>\>\>       {\em // no {\tt RETI} instruction.  Since the PIC holds all interrupts disabled,}\\

        \> \>\>\>       {\em // there is no need to check for further interrupts.}\\

        \> \>\>\>       {\em // }\\

        \> \>\>\>       {\em // The tricky part is that, because of how the PIC is built, the ISR cannot}\\

        \>\>\>\>        {\em // re-enable its own interrupt without re-enabling all interrupts.  Hence, we}\\

        \>\>\>\>        {\em // look at R0 upon ISR completion to know if an interrupt needs to be }\\

        \> \>\>\>       {\em // re-enabled. }\\

        \> \>\>\>       {\tt mov(uR0,tmp); }\\

        \> \>\>\>       {\tt picv |= (tmp \& 0x7fff) << 16; }\\

        \> \>\>         {\tt \} }\\

        \> \>   {\tt \} }\\

        \> \>   {\tt RESTORE\_PARTIAL\_CONTEXT; }\\

        \> \>   {\em // Re-activate all (requested) interrupts }\\

        \> \>   {\tt *pic = picv | 0x80000000; }\\

        \>{\tt \} }\\

{\tt \}}\\

\end{tabbing}

\caption{Traditional Interrupt handling}\label{tbl:traditional-isr}

\end{center}\end{table}

 

We can work through the interrupt handling process by examining

Tbl.~\ref{tbl:traditional-isr}.  First, remember, the CPU is always running

either the user or the supervisor context.  Once the supervisor switches to

user mode, control does not return until either an interrupt or a trap

has taken place.  (Okay, there's also the possibility of a bus error, or an

illegal instruction such as an unimplemented floating point instruction---but

for now we'll just focus on the trap instruction.)  Therefore, if the trap bit

isn't set, then we know an interrupt has taken place.

 

To process an interrupt, we steal the user's stack: the PC and CC registers

are saved on the stack, as outlined in Tbl.~\ref{tbl:save-partial}.

\begin{table}\begin{center}

\begin{tabbing}

SAVE\_PARTIAL\_CONTEXT: \\

\hbox to 0.25in{}\= {\em ; We save R0, CC, and PC only} \\

\>        {\tt MOV -3(uSP),R3} \\

\>        {\tt MOV uR0,R0} \\

\>        {\tt MOV uCC,R1} \\

\>        {\tt MOV uPC,R2} \\

\>        {\tt STO R0,1(R3)} {\em ; Exploit memory pipelining: }\\

\>        {\tt STO R1,2(R3)} {\em ; All instructions write to stack }\\

\>        {\tt STO R2,3(R3)} {\em ; All offsets increment by one }\\

\>        {\tt MOV R3,uSP} {\em ; Return the updated stack pointer } \\

\end{tabbing}

\caption{Example Saving Minimal User Context}\label{tbl:save-partial}

\end{center}\end{table}

This is much cheaper than the full context swap of a preemptive multitasking

kernel, but it also depends upon the ISR saving any state it uses.  Further,

if multiple ISR's get called at once, this looses its optimality property

very quickly.

 

As Sec.~\ref{sec:pic} discusses, the top of the PIC register stores which

interrupts are enabled, and the bottom stores which have tripped.  (Interrupts

may trip without being enabled, they just will not generate an interrupt to the

CPU.)  Our first step is to query the register to find out our interrupt

state, and then to disable any interrupts that have tripped.  To do

that, we write a one to the enable half of the register while also clearing

the top bit (master interrupt enable).  This has the consequence of disabling

any and all further interrupts, not just the ones that have tripped.  Hence,

upon completion, we re--enable the master interrupt bit again.   Finally,

we keep track of which interrupts have tripped.

 

Using the bit mask of interrupts that have tripped, we walk through all fifteen

possible interrupts.  If there is an ISR installed, we acknowledge and reset

the interrupt within the PIC, and then call the ISR.  The ISR, however, cannot

re--enable its interrupt without re-enabling the master interrupt bit.  Thus,

to keep things simple, when the ISR is finished it places its interrupt

mask back into R0, or clears R0.  This tells the supervisor mode process which

interrupts to re--enable.  Any other registers that the ISR uses must be

saved and restored.  (This is only truly optimal if only a single ISR is

called.)  As a final instruction, the ISR clears the GIE bit executing a user

trap.  (Remember, the Zip CPU has no {\tt RETI} instruction to restore the

stack and return to userland.  It needs to go through the supervisor mode to

get there.)

 

Then, once all interrupts are handled, the user context is restored in  a

fashion similar to Tbl.~\ref{tbl:restore-partial}.

\begin{table}\begin{center}

\begin{tabbing}

RESTORE\_PARTIAL\_CONTEXT: \\

\hbox to 0.25in{}\= {\em ; We retore R0, CC, and PC only} \\

\>        {\tt MOV uSP,R3} {\em ; Return the updated stack pointer } \\

\>        {\tt LOD R0,1(R3),R0} {\em ; Exploit memory pipelining: }\\

\>        {\tt LOD R1,2(R3),R1} {\em ; All instructions write to stack }\\

\>        {\tt LOD R2,3(R3),R2} {\em ; All offsets increment by one }\\

\>        {\tt MOV R0,uR0} \\

\>        {\tt MOV R1,uCC} \\

\>        {\tt MOV R2,uPC} \\

\>        {\tt MOV 3(R3),uSP} \\

\end{tabbing}

\caption{Example Restoring Minimal User Context}\label{tbl:restore-partial}

\end{center}\end{table}

Again, this is short and sweet simply because any other registers that needed

saving were saved within the ISR.

 

There you have it: the Zip CPU, with its non-traditional interrupt architecture,

can still process interrupts in a very traditional fashion.
 
 
\section{Example: Idle Task}
\section{Example: Idle Task}
One task every operating system needs is the idle task, the task that takes
One task every operating system needs is the idle task, the task that takes
place when nothing else can run.  On the Zip CPU, this task is quite simple,
place when nothing else can run.  On the Zip CPU, this task is quite simple,
and it is shown in assemble in Tbl.~\ref{tbl:idle-asm}.
and it is shown in assemble in Tbl.~\ref{tbl:idle-asm}.
Line 1379... Line 1779...
data types.  The Zip CPU does not have explicit support for smaller or larger
data types.  The Zip CPU does not have explicit support for smaller or larger
data types, and so this memory copy cannot be applied at a byte level.
data types, and so this memory copy cannot be applied at a byte level.
Third, we've optimized the conditional jump to a return instruction into a
Third, we've optimized the conditional jump to a return instruction into a
conditional return instruction.
conditional return instruction.
 
 
\section{Context Switch}
\section{Example: Context Switch}
 
 
Fundamental to any multiprocessing system is the ability to switch from one
Fundamental to any multiprocessing system is the ability to switch from one
task to the next.  In the ZipSystem, this is accomplished in one of a couple
task to the next.  In the ZipSystem, this is accomplished in one of a couple
ways.  The first step is that an interrupt happens.  Anytime an interrupt
ways.  The first step is that an interrupt happens.  Anytime an interrupt
happens, the CPU needs to execute the following tasks in supervisor mode:
happens, the CPU needs to execute the following tasks in supervisor mode:
Line 2049... Line 2449...
\item The Zip CPU is light weight and fully featured as it exists today. For
\item The Zip CPU is light weight and fully featured as it exists today. For
        anyone who wishes to build a general purpose CPU and then to
        anyone who wishes to build a general purpose CPU and then to
        experiment with building and adding particular features, the Zip CPU
        experiment with building and adding particular features, the Zip CPU
        makes a good starting point--it is fairly simple. Modifications should
        makes a good starting point--it is fairly simple. Modifications should
        be simple enough.
        be simple enough.
\item As an estimate of the weight'' of this implementation, the CPU has

        cost me less than 150 hours to implement from its inception.

\item The Zip CPU was designed to be an implementable soft core that could be
\item The Zip CPU was designed to be an implementable soft core that could be
        placed within an FPGA, controlling actions internal to the FPGA. It
        placed within an FPGA, controlling actions internal to the FPGA. It
        fits this role rather nicely. It does not fit the role of a system on
        fits this role rather nicely. It does not fit the role of a system on
        a chip very well, but then it was never intended to be a system on a
        a chip very well, but then it was never intended to be a system on a
        chip but rather a system within a chip.
        chip but rather a system within a chip.
Line 2067... Line 2465...
        to do with two exceptions: bytewise character access and accelerated
        to do with two exceptions: bytewise character access and accelerated
        floating-point support.
        floating-point support.
\item This simplified instruction set is easy to decode.
\item This simplified instruction set is easy to decode.
\item The simplified bus transactions (32-bit words only) were also very easy
\item The simplified bus transactions (32-bit words only) were also very easy
        to implement.
        to implement.

\item The pipelined load/store approach is novel, and can be used to greatly

        increase the speed of the processor.
\item The novel approach of having a single interrupt vector, which just
\item The novel approach of having a single interrupt vector, which just
        brings the CPU back to the instruction it left off at within the last
        brings the CPU back to the instruction it left off at within the last
        interrupt context doesn't appear to have been that much of a problem.
        interrupt context doesn't appear to have been that much of a problem.
        If most modern systems handle interrupt vectoring in software anyway,
        If most modern systems handle interrupt vectoring in software anyway,
        why maintain hardware support for it?
        why maintain hardware support for it?
Line 2096... Line 2496...
        Still, \ldots it's not bad, it's just not astonishingly good.
        Still, \ldots it's not bad, it's just not astonishingly good.
 
 
\item The fact that the instruction width equals the bus width means that the
\item The fact that the instruction width equals the bus width means that the
        instruction fetch cycle will always be interfering with any load or
        instruction fetch cycle will always be interfering with any load or
        store memory operation, with the only exception being if the
        store memory operation, with the only exception being if the
        instruction is already in the cache.  {\em This has become the
        instruction is already in the cache.
        fundamental limit on the speed and performance of the CPU!}

        Those familiar with the Von--Neumann approach of sharing a bus

        between data and instructions will not be surprised by this assessment.

 
 
        This could be fixed in one of three ways: the instruction set
        This could be fixed in one of three ways: the instruction set
        architecture could be modified to handle Very Long Instruction Words
        architecture could be modified to handle Very Long Instruction Words
        (VLIW) so that each 32--bit word would encode two or more instructions,
        (VLIW) so that each 32--bit word would encode two or more instructions,
        the instruction fetch bus width could be increased from 32--bits to
        the instruction fetch bus width could be increased from 32--bits to
Line 2144... Line 2541...
 
 
        This may also be written off as a feature'' of the Zip CPU, since
        This may also be written off as a feature'' of the Zip CPU, since
        the addition of a data cache can greatly increase the LUT count of
        the addition of a data cache can greatly increase the LUT count of
        a soft core.
        a soft core.
 
 

        The Zip CPU compensates for this via its pipelined load and store

        instructions.

 
\item Many other instruction sets offer three operand instructions, whereas
\item Many other instruction sets offer three operand instructions, whereas
        the Zip CPU only offers two operand instructions. This means that it
        the Zip CPU only offers two operand instructions. This means that it
        takes the Zip CPU more instructions to do many of the same operations.
        takes the Zip CPU more instructions to do many of the same operations.
        The good part of this is that it gives the Zip CPU a greater amount of
        The good part of this is that it gives the Zip CPU a greater amount of
        flexibility in its immediate operand mode, although that increased
        flexibility in its immediate operand mode, although that increased
Line 2369... Line 2769...
 
 
% Appendices
% Appendices
% Index
% Index
\end{document}
\end{document}
 
 
 
%

%

% Symbol table relocation types:

%

% Only 3-types of instructions truly need relocations: those that modify the

% PC register, and those that access memory.

%

% -     LDI     Addr,Rx         // Load's an absolute address into Rx, 24 bits

%

% -     LDILO   Addr,Rx         // Load's an absolute address into Rx, 32 bits

%       LDIHI   Addr,Rx         //   requires two instructions

%

% -     JMP     Rx              // Jump to any address in Rx

%                       // Can be prefixed with two instructions to load Rx

%                       // from any 32-bit immediate

% -     JMP     #Addr           // Jump to any 24'bit (signed) address, 23'b uns

%

% -     ADD     x,PC            // Any PC relative jump (20 bits)

%

% -     ADD.C   x,PC            // Any PC relative conditional jump (20 bits)

%

% -     LDIHI   Addr,Rx         // Load from any 32-bit address, clobbers Rx,

%       LOD     Addr(Rx),Rx     //    unconditional, requires second instruction

%

% -     LOD.C   Addr(Ry),Rx     // Any 16-bit relative address load, poss. cond

%

% -     STO.C   Rx,Addr(Ry)     // Any 16-bit rel addr, Rx and Ry must be valid

%

% -     FARJMP  #Addr:          // Arbitrary 32-bit jumps require a jump table

%       BRA     +1              // memory address.  The BRA +1 can be skipped,

%       .WORD   Addr            // but only if the address is placed at the end

%       LOD     -2(PC),PC       // of an executable section

%
 
 
 No newline at end of file
 No newline at end of file