OpenCores
URL https://opencores.org/ocsvn/zipcpu/zipcpu/trunk

Subversion Repositories zipcpu

Compare Revisions

  • This comparison shows the changes necessary to convert path
    /zipcpu/trunk/doc/src
    from Rev 39 to Rev 68
    Reverse comparison

Rev 39 → Rev 68

/spec.tex
44,11 → 44,13
%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\documentclass{gqtekspec}
\usepackage{import}
% \graphicspath{{../gfx}}
\project{Zip CPU}
\title{Specification}
\author{Dan Gisselquist, Ph.D.}
\email{dgisselq (at) opencores.org}
\revision{Rev.~0.5}
\revision{Rev.~0.6}
\definecolor{webred}{rgb}{0.2,0,0}
\definecolor{webgreen}{rgb}{0,0.2,0}
\usepackage[dvips,ps2pdf,colorlinks=true,
76,6 → 78,7
copy.
\end{license}
\begin{revisionhistory}
0.6 & 11/17/2015 & Gisselquist & Added graphics to illustrate pipeline discussion.\\\hline
0.5 & 9/29/2015 & Gisselquist & Added pipelined memory access discussion.\\\hline
0.4 & 9/19/2015 & Gisselquist & Added DMA controller, improved stall information, and self--assessment info.\\\hline
0.3 & 8/22/2015 & Gisselquist & First completed draft\\\hline
167,6 → 170,71
contact me.}
\end{itemize}
 
The Zip CPU also has one very unique feature: the ability to do pipelined loads
and stores. This allows the CPU to access on-chip memory at one access per
clock, minus a stall for the initial access.
 
\section{Characteristics of a SwiC}
 
Here, we shall define a soft core internal to an FPGA as a ``System within a
Chip,'' or a SwiC. SwiCs have some very unique properties internal to them
that have influenced the design of the Zip CPU. Among these are the bus,
memory, and available peripherals.
 
Most other approaches to soft core CPU's employ a Harvard architecture.
This allows these other CPU's to have two separate bus structures: one for the
program fetch, and the other for thememory. The Zip CPU is fairly unique in
its approach because it uses a von Neumann architecture. This was done for
simplicity. By using a von Neumann architecture, only one bus needs to be
implemented within any FPGA. This helps to minimize real-estate, while
maintaining a high clock speed. The disadvantage is that it can severely
degrade the overall instructions per clock count.
 
Soft core's within an FPGA have an additional characteristic regarding
memory access: it is slow. Memory on chip may be accessed at a single
cycle per access, but small FPGA's have a limited amount of memory on chip.
Going off chip, however, is expensive. Two examples will prove this point. On
the XuLA2 board, Flash can be accessed at 128~cycles per 32--bit word,
or 64~cycles per subsequent word in a pipelined architecture. Likewise, the
SDRAM chip on the XuLA2 board allows 6~cycle access for a write, 10~cycles
per read, and 2~cycles for any subsequent pipelined access read or write.
Either way you look at it, this memory access will be slow and this doesn't
account for any logic delays should the bus implementation logic get
complicated.
 
As may be noticed from the above discussion about memory speed, a second
characteristic of memory is that all memory accesses may be pipelined, and
that pipelined memory access is faster than non--pipelined access. Therefore,
a SwiC soft core should support pipelined operations, but it should also
allow a higher priority subsystem to get access to the bus (no starvation).
 
As a further characteristic of SwiC memory options, on-chip cache's are
expensive. If you want to have a minimum of logic, cache logic may not be
the highest on the priority list.
 
In sum, memory is slow. While one processor on one FPGA may be able to fill
its pipeline, the same processor on another FPGA may struggle to get more than
one instruction at a time into the pipeline. Any SwiC must be able to deal
with both cases: fast and slow memories.
 
A final characteristic of SwiC's within FPGA's is the peripherals.
Specifically, FPGA's are highly reconfigurable. Soft peripherals can easily
be created on chip to support the SwiC if necessary. As an example, a simple
30-bit peripheral could easily support reversing 30-bit numbers: a read from
the peripheral returns it's bit--reversed address. This is cheap within an
FPGA, but expensive in instructions.
 
Indeed, anything that must be done fast within an FPGA is likely to already
be done--elsewhere in the fabric. This leaves the CPU with the role of handling
sequential tasks that need a lot of state.
 
This means that the SwiC needs to live within a very unique environment,
separate and different from the traditional SoC. That isn't to say that a
SwiC cannot be turned into a SoC, just that this SwiC has not been designed
for that purpose.
 
\section{Lessons Learned}
 
Now, however, that I've worked on the Zip CPU for a while, it is not nearly
as simple as I originally hoped. Worse, I've had to adjust to create
capabilities that I was never expecting to need. These include:
239,16 → 307,60
all instructions so that stalls would never be necessary. After trying
to build such an architecture, I gave up, having learned some things:
 
For example, in order to facilitate interrupt handling and debug
stepping, the CPU needs to know what instructions have finished, and
which have not. In other words, it needs to know where it can restart
the pipeline from. Once restarted, it must act as though it had
never stopped. This killed my idea of delayed branching, since what
would be the appropriate program counter to restart at? The one the
CPU was going to branch to, or the ones in the delay slots? This
also makes the idea of compressed instruction codes difficult, since,
again, where do you restart on interrupt?
First, and ideal pipeline might look something like
Fig.~\ref{fig:ideal-pipeline}.
\begin{figure}
\begin{center}
\includegraphics[width=4in]{../gfx/fullpline.eps}
\caption{An Ideal Pipeline: One instruction per clock cycle}\label{fig:ideal-pipeline}
\end{center}\end{figure}
Notice that, in this figure, all the pipeline stages are complete and
full. Every instruction takes one clock and there are no delays.
However, as the discussion above pointed out, the memory associated
with a SwiC may not allow single clock access. It may be instead
that you can only read every two clocks. In that case, what shall
the pipeline look like? Should it look like
Fig.~\ref{fig:waiting-pipeline},
\begin{figure}\begin{center}
\includegraphics[width=4in]{../gfx/stuttra.eps}
\caption{Instructions wait for each other}\label{fig:waiting-pipeline}
\end{center}\end{figure}
where instructions are held back until the pipeline is full, or should
it look like Fig.~\ref{fig:independent-pipeline},
\begin{figure}\begin{center}
\includegraphics[width=4in]{../gfx/stuttrb.eps}
\caption{Instructions proceed independently}\label{fig:independent-pipeline}
\end{center}\end{figure}
where each instruction is allowed to move through the pipeline
independently? For better or worse, the Zip CPU allows instructions
to move through the pipeline independently.
 
One approach to avoiding stalls is to use a branch delay slot,
such as is shown in Fig.~\ref{fig:brdelay}.
\begin{figure}\begin{center}
\includegraphics[width=4in]{../gfx/bdly.eps}
\caption{A typical branch delay slot approach}\label{fig:brdelay}
\end{center}\end{figure}
In this figure, instructions
{\tt BR} (a branch), {\tt BD} (a branch delay instruction),
are followed by instructions after the branch: {\tt IA}, {\tt IB}, etc.
Since it takes a processor a clock cycle to execute a branch, the
delay slot allows the processor to do something useful in that
branch. The problem the Zip CPU has with this approach is, what
happens when the pipeline looks like Fig.~\ref{fig:brbroken}?
\begin{figure}\begin{center}
\includegraphics[width=4in]{../gfx/bdbroken.eps}
\caption{The branch delay slot breaks with a slow memory}\label{fig:brbroken}
\end{center}\end{figure}
In this case, the branch delay slot never gets filled in the first
place, and so the pipeline squashes it before it gets executed.
If not that, then what happens when handling interrupts or
debug stepping: when has the CPU finished an instruction?
When the {\tt BR} instruction has finished, or must {\tt BD}
follow every {\tt BR}? and, again, what if the pipeline isn't
full?
These thoughts killed any hopes of doing delayed branching.
 
So I switched to a model of discrete execution: Once an instruction
enters into either the ALU or memory unit, the instruction is
guaranteed to complete. If the logic recognizes a branch or a
258,7 → 370,18
until the conditional branch completes. Then, if it generates a new
PC address, the stages preceding are all wiped clean.
 
The discrete execution model allows such things as sleeping: if the
This model, however, generated too many pipeline stalls, so the
discrete execution model was modified to allow instructions to go
through the ALU unit and be canceled before writeback. This removed
the stall associated with ALU instructions before untaken branches.
 
The discrete execution model allows such things as sleeping, as
outlined in Fig.~\ref{fig:sleeping}.
\begin{figure}\begin{center}
\includegraphics[width=4in]{../gfx/sleep.eps}
\caption{How the CPU halts when sleeping}\label{fig:sleeping}
\end{center}\end{figure}
If the
CPU is put to ``sleep,'' the ALU and memory stages stall and back up
everything before them. Likewise, anything that has entered the ALU
or memory stage when the CPU is placed to sleep continues to completion.
266,10 → 389,14
a valid signal, a stall signal, and a clock enable signal. In
general, a stage stalls if it's contents are valid and the next step
is stalled. This allows the pipeline to fill any time a later stage
stalls.
stalls, as illustrated in Fig.~\ref{fig:stacking}.
\begin{figure}\begin{center}
\includegraphics[width=4in]{../gfx/stacking.eps}
\caption{Instructions can stack up behind a stalled instruction}\label{fig:stacking}
\end{center}\end{figure}
 
This approach is also different from other pipeline approaches. Instead
of keeping the entire pipeline filled, each stage is treated
This approach is also different from other pipeline approaches.
Instead of keeping the entire pipeline filled, each stage is treated
independently. Therefore, individual stages may move forward as long
as the subsequent stage is available, regardless of whether the stage
behind it is filled.
336,7 → 463,7
31\ldots 11 & R/W & Reserved for future uses\\\hline
10 & R & (Reserved for) Bus-Error Flag\\\hline
9 & R & Trap, or user interrupt, Flag. Cleared on return to userspace.\\\hline
8 & R & (Reserved for) Illegal Instruction Flag\\\hline
8 & R & Illegal Instruction Flag\\\hline
7 & R/W & Break--Enable\\\hline
6 & R/W & Step\\\hline
5 & R/W & Global Interrupt Enable (GIE)\\\hline
401,9 → 528,9
This functionality was added to enable an external debugger to
set and manage breakpoints.
 
The ninth bit is reserved for an illegal instruction bit. When the CPU
The ninth bit is an illegal instruction bit. When the CPU
tries to execute either a non-existant instruction, or an instruction from
an address that produces a bus error, the CPU will (once implemented) switch
an address that produces a bus error, the CPU will (if implemented) switch
to supervisor mode while setting this bit. The bit will automatically be
cleared upon any return to user mode.
 
418,7 → 545,7
\begin{tabular}{l|l}
Bit & Meaning \\\hline
9 & Soft trap, set on a trap from user mode, cleared when returning to user mode\\\hline
8 & (Reserved for) Floating point enable \\\hline
8 & Illegal instruction error flag \\\hline
7 & Halt on break, to support an external debugger \\\hline
6 & Step, single step the CPU in user mode\\\hline
5 & GIE, or Global Interrupt Enable \\\hline
454,10 → 581,33
There is no condition code for less than or equal, not C or not V. Sorry,
I ran out of space in 3--bits. Conditioning on a non--supported condition
is still possible, but it will take an extra instruction and a pipeline stall. (Ex: \hbox{\em (Stall)}; \hbox{\tt TST \$4,CC;} \hbox{\tt STO.NZ R0,(R1)})
As an alternative, the condition may often be reversed, recovering those
extra two clocks. Thus instead of \hbox{\tt CMP Rx,Ry;}
\hbox{\tt BNV label} you can issue a \hbox{\tt CMP Ry,Rx;} \hbox{\tt BV label}.
 
Conditionally executed ALU instructions will not further adjust the
condition codes.
condition codes, with the exception of \hbox{\tt CMP} and \hbox{\tt TST}
instructions. Conditional \hbox{\tt CMP} or \hbox{\tt TST} instructions
will adjust conditions whenever their conditionals are true. In this way,
multiple conditions may be evaluated without branches. For example, to do
something if \hbox{\tt R0} is one and \hbox{\tt R1} is two, one might try
code such as Tbl.~\ref{tbl:dbl-condition}.
\begin{table}\begin{center}
\begin{tabular}{l}
{\tt CMP 1,R0} \\
{;\em Condition codes are now set based upon R0-1} \\
{\tt CMP.Z 2,R1} \\
{;\em If R0 $\neq$ 1, conditions are unchanged.} \\
{;\em If R0 $=$ 1, conditions are set based upon R1-2.} \\
{;\em Now do something based upon the conjunction of both conditions.} \\
{;\em While we use the example of a STO, it could be any instruction.} \\
{\tt STO.Z R0,(R2)} \\
\end{tabular}
\caption{An example of a double conditional}\label{tbl:dbl-condition}
\end{center}\end{table}
 
\section{Traditional Interrupt Handling}
 
\section{Operand B}
Many instruction forms have a 21-bit source ``Operand B'' associated with them.
This Operand B is either equal to a register plus a signed immediate offset,
1001,15 → 1151,36
\item While waiting for the pipeline to load following any taken branch, jump,
return from interrupt or switch to interrupt context (5 stall cycles)
 
If the PC suddenly changes, the pipeline is subsequently cleared and needs to
be reloaded. Given that there are five stages to the pipeline, that accounts
for four of the five stalls. The stall cycle is lost in the pipelined prefetch
stage which needs at least one clock with a valid PC before it can produce
a new output.
Fig.~\ref{fig:bcstalls}
\begin{figure}\begin{center}
\includegraphics[width=3.5in]{../gfx/bc.eps}
\caption{A conditional branch generates 5 stall cycles}\label{fig:bcstalls}
\end{center}\end{figure}
illustrates the situation for a conditional branch. In this case, the branch
instruction, {\tt BC}, is nominally followed by instructions {\tt I0} and so
forth. However, since the branch is taken, the next instruction must be
{\tt IA}. Therefore, the pipeline needs to be cleared and reloaded.
Given that there are five stages to the pipeline, that accounts
for four of the five stalls. The last stall cycle is lost in the pipelined
prefetch stage which needs at least one clock with a valid PC before it can
produce a new output. {\Large\bf Note: When I did this myself, I counted
six stall cycles, for a total of seven cycles for this instruction. Is five
really the right answer?}
 
The Zip CPU handles {\tt MOV \$X(PC),PC}, {\tt ADD \$X,PC}, and
{\tt LDI \$X,PC} instructions specially, however. These instructions, when
not conditioned on the flags, can execute with only 3~stall cycles.
not conditioned on the flags, can execute with only 2~stall cycles, such as
is shown in Fig.~\ref{fig:branch}.\footnote{Note that this behavior is
slated to be improved upon in subsequent releases. With a better prefetch,
it should be possible to drop this down to one or zero stall cycles.}
\begin{figure}\begin{center}
\includegraphics[width=4in]{../gfx/bra.eps}
\caption{An expedited delay costs only 2~stall cycles}\label{fig:branch}
\end{center}\end{figure}
In this example, {\tt BR} is a branch always taken, {\tt I1} is the instruction
following the branch in memory, while {\tt IA} is the first instruction at the
branch address. ({\tt CLR} denotes a clear--pipeline operation, and does
not represent any instruction.)
 
\item When reading from a prior register while also adding an immediate offset
\begin{enumerate}
1026,15 → 1197,23
That is, any instruction that does not add an immediate to {\tt RA} may be
scheduled into the stall slot.
 
\item When any write to either the CC or PC Register is followed by a memory
operation
\item When any (conditional) write to either the CC or PC Register is followed
by a memory operation
\begin{enumerate}
\item\ {\tt OPCODE RA,PC} {\em Ex: a branch opcode}
\item\ {\em (stall, even if jump not taken)}
\item\ {\tt LOD \$X(RA),RB}
\end{enumerate}
A timing diagram of this pipeline situation is shown in Fig.~\ref{fig:bcmem},
\begin{figure}\begin{center}
\includegraphics[width=2in]{../gfx/bcmem.eps}
\caption{A (not taken) conditional branch followed by a memory operation}\label{fig:bcmem}
\end{center}\end{figure}
for a conditional branch, {\tt BC}, a memory operation, {\tt Mem} (which
must be a load here), and ALU instructions {\tt I1} and so forth.
Since branches take place in the writeback stage, the Zip CPU will stall the
pipeline for one clock anytime there may be a possible jump. This prevents
pipeline for one clock anytime there may be a possible jump--forcing the
memory operation to stay in the operand decode stage. This prevents
an instruction from executing a memory access after the jump but before the
jump is recognized.
 
1050,9 → 1229,10
\item\ {\tt BZ somewhere}
\end{enumerate}
 
The reason for this stall is simply performance. Many of the flags are
determined via combinatorial logic after the writeback instruction is
determined. Trying to then place these into the input for one of the operands
The reason for this stall is simply performance: many of the flags are
determined via combinatorial logic {\em during} the writeback cycle.
Trying to then place these into the input for one of the operands for an
ALU instruction during the same cycle
created a time delay loop that would no longer execute in a single 100~MHz
clock cycle. (The time delay of the multiply within the ALU wasn't helping
either \ldots).
1062,7 → 1242,8
that references the CC register. For example, {\tt MOV \$addr+PC,uPC}
followed by an {\tt RTU} ({\tt OR \$GIE,CC}) instruction will not incur
this stall, whereas an {\tt OR \$BREAKEN,CC} followed by an {\tt OR \$STEP,CC}
will incur the stall, while a {\tt LDI \$BREAKEN|\$STEP,CC} will not.
will incur the stall, while a {\tt LDI \$BREAKEN|\$STEP,CC} will not since
it doesn't read the condition codes.
 
\item When waiting for a memory read operation to complete
\begin{enumerate}
1073,15 → 1254,32
 
Remember, the Zip CPU does not support out of order execution. Therefore,
anytime the memory unit becomes busy both the memory unit and the ALU must
stall until the memory unit is cleared. This is especially true of a load
stall until the memory unit is cleared. This is illustrated in
Fig.~\ref{fig:memrd},
\begin{figure}\begin{center}
\includegraphics[width=5in]{../gfx/memrd.eps}
\caption{Pipeline handling of a load instruction}\label{fig:memrd}
\end{center}\end{figure}
since it is especially true of a load
instruction, which must still write its operand back to the register file.
Store instructions are different, since they can be busy with no impact on
later ALU write back operations. Hence, only loads stall the pipeline.
Note that there is an extra stall at the end of the memory cycle, so that
the memory unit will be idle for one clock before an instruction will be
accepted into the ALU.
Store instructions are different, as shown in Fig.~\ref{fig:memwr},
\begin{figure}\begin{center}
\includegraphics[width=5in]{../gfx/memwr.eps}
\caption{Pipeline handling of a store instruction}\label{fig:memwr}
\end{center}\end{figure}
since they can be busy with the bus without impacting later write back
pipeline stages. Hence, only loads stall the pipeline.
 
This also assumes that the memory being accessed is a single cycle memory.
This, of course, also assumes that the memory being accessed is a single cycle
memory and that there are no stalls to get to the memory.
Slower memories, such as the Quad SPI flash, will take longer--perhaps even
as long as forty clocks. During this time the CPU and the external bus
will be busy, and unable to do anything else.
will be busy, and unable to do anything else. Likewise, if it takes a couple
of clock cycles for the bus to be free, as shown in both Figs.~\ref{fig:memrd}
and~\ref{fig:memwr}, there will be stalls.
 
\item Memory operation followed by a memory operation
\begin{enumerate}
1091,10 → 1289,16
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}
\end{enumerate}
 
In this case, the LOD instruction cannot start until the STO is finished.
In this case, the LOD instruction cannot start until the STO is finished,
as illustrated by Fig.~\ref{fig:mstld}.
\begin{figure}\begin{center}
\includegraphics[width=5.5in]{../gfx/mstld.eps}
\caption{Pipeline handling of a store followed by a load instruction}\label{fig:mstld}
\end{center}\end{figure}
With proper scheduling, it is possible to do something in the ALU while the
memory unit is busy with the STO instruction, but otherwise this pipeline will
stall waiting for it to complete.
stall while waiting for it to complete before the load instruction can
start.
 
The Zip CPU does have the capability of supporting pipelined memory access,
but only under the following conditions: all accesses within the pipeline
1102,7 → 1306,9
address, and there can be no stalls or other instructions between pipelined
memory access instructions. Further, the offset to memory must be increasing
by one address each instruction. These conditions work well for saving or
storing registers to the stack.
storing registers to the stack. Indeed, if you noticed, both
Fig.~\ref{fig:memrd} and Fig.~\ref{fig:memwr} illustrated pipelined memory
accesses.
 
\item When waiting for a conditional memory read operation to complete
\begin{enumerate}
1134,7 → 1340,7
and described here. They are designed to make
the Zip CPU more useful in an Embedded Operating System environment.
 
\section{Interrupt Controller}
\section{Interrupt Controller}\label{sec:pic}
 
Perhaps the most important peripheral within the Zip System is the interrupt
controller. While the Zip CPU itself can only handle one interrupt, and has
1144,7 → 1350,7
The Zip System interrupt controller module supports up to 15 interrupts, all
controlled from one register. Bit~31 of the interrupt controller controls
overall whether interrupts are enabled (1'b1) or disabled (1'b0). Bits~16--30
control whether individual interrupts are enabled (1'b0) or disabled (1'b0).
control whether individual interrupts are enabled (1'b1) or disabled (1'b0).
Bit~15 is an indicator showing whether or not any interrupt is active, and
bits~0--15 indicate whether or not an individual interrupt is active.
 
1167,12 → 1373,12
to re-enable any other interrupts.
 
The Zip System currently hosts two interrupt controllers, a primary and a
secondary. The primary interrupt controller has one interrupt line which may
come from an external interrupt controller, and one interrupt line from the
secondary controller. Other primary interrupts include the system timers,
the jiffies interrupt, and the manual cache interrupt. The secondary interrupt
controller maintains an interrupt state for all of the processor accounting
counters.
secondary. The primary interrupt controller has one interrupt line (perhaps
more if you configure it for more) which may come from an external interrupt
controller, and one interrupt line from the secondary controller. Other
primary interrupts include the system timers, the jiffies interrupt, and the
manual cache interrupt. The secondary interrupt controller maintains an
interrupt state for all of the processor accounting counters.
 
\section{Counter}
 
1195,7 → 1401,8
might set the timer to 100~million (the number of clocks per second), and
set the high bit. Ever after, the timer will interrupt the CPU once per
second (assuming a 100~MHz clock). This reload capability also limits the
maximum timer value to $2^{31}-1$, rather than $2^{32}-1$.
maximum timer value to $2^{31}-1$ (about 21~seconds using a 100~MHz clock),
rather than $2^{32}-1$.
 
\section{Watchdog Timer}
 
1209,6 → 1416,19
While the watchdog timer supports interval mode, it doesn't make as much sense
as it did with the other timers.
 
\section{Bus Watchdog}
There is an additional watchdog timer on the Wishbone bus. This timer,
however, is hardware configured and not software configured. The timer is
reset at the beginning of any bus transaction, and only counts clocks during
such bus transactions. If the bus transaction takes longer than the number
of counts the timer allots, it will raise a bus error flag to terminate the
transaction. This is useful in the case of any peripherals that are
misbehaving. If the bus watchdog terminates a bus transaction, the CPU may
then read from its port to find out which memory location created the problem.
 
Aside from its unusual configuration, the bus watchdog is just another
implementation of the fundamental timer described above.
 
\section{Jiffies}
 
This peripheral is motivated by the Linux use of `jiffies' whereby a process
1307,9 → 1527,189
Eventually, I intend to place an operating system onto the ZipSystem, I'm
just not there yet.
 
The rest of this chapter examines some common programming constructs, and
how they might be applied to the Zip System.
The rest of this chapter examines some common programming models, and how they
might be applied to the Zip System, and then finish with a couple of examples.
 
\section{System High}
The easiest and simplest way to run the Zip CPU is in the system high mode.
In this mode, the CPU runs your program in supervisor mode from reboot to
power down, and is never interrupted. You will need to poll the interrupt
controller to determine when any external condition has become active. This
mode is useful, and can handle many microcontroller tasks.
 
Even better, in system high mode, all of the user registers are available
to the system high program as variables. Accessing these registers can be
done in a single clock cycle, which would move them to the active register
set or move them back. While this may seem like a load or store instruction,
none of these register accesses will suffer from memory delays.
 
The one thing that cannot be done in supervisor mode is a wait for interrupt
instruction. This, however, is easily rectified by jumping to a user task
within the supervisors memory space, such as Tbl.~\ref{tbl:shi-idle}.
\begin{table}\begin{center}
\begin{tabbing}
{\tt supervisor\_idle:} \\
\hbox to 0.25in{}\={\em ; While not strictly required, the following move helps to} \\
\> {\em ; ensure that the prefetch doesn't try to fetch an instruction} \\
\> {\em ; outside of the CPU's address space when it switches to user} \\
\> {\em ; mode.} \\
\> {\tt MOV supervisor\_idle\_continue,uPC} \\
\> {\em ; Put the processor into user mode and to sleep in the same} \\
\> {\em ; instruction. } \\
\> {\tt OR \$SLEEP|\$GIE,CC} \\
{\tt supervisor\_idle\_continue:} \\
\> {\em ; Now, if we haven't done this inline, we need to return} \\
\> {\em ; to whatever function called us.} \\
\> {\tt RETN} \\
\end{tabbing}
\caption{Executing an idle from supervisor mode}\label{tbl:shi-idle}
\end{center}\end{table}
 
\section{Traditional Interrupt Handling}
Although the Zip CPU does not have a traditional interrupt architecture,
it is possible to create the more traditional interrupt approach via software.
In this mode, the programmable interrupt controller is used together with the
supervisor state to create the illusion of more traditional interrupt handling.
 
To set this up, upon reboot the supervisor task:
\begin{enumerate}
\item Creates a (single) user context, a user stack, and sets the user
program counter to the entry of the user task
\item Creates a task table of ISR entries
\item Enables the master interrupt enable via the interrupt controller, albeit
without enabling any of the fifteen potential underlying interrupts.
\item Switches to user mode, as the first part of the while loop in
Tbl.~\ref{tbl:traditional-isr}.
\end{enumerate}
\begin{table}\begin{center}
\begin{tabbing}
{\tt while(true) \{} \\
\hbox to 0.25in{}\= {\tt rtu();}\\
\> {\tt if (trap) \{} {\em // Here, we allow users to install ISRs, or} \\
\>\hbox to 0.25in{}\= {\em // whatever else they may wish to do in supervisor mode.} \\
\> {\tt \} else \{} \\
\> \> {\tt volatile int *pic = PIC\_ADDRESS;} \\
\\
\> \> {\em // Save the user context before running any ISRs. This could easily be}\\
\> \> {\em // implemented as an inline assembly routine or macro}\\
\> \> {\tt SAVE\_PARTIAL\_CONTEXT; }\\
\> \> {\em // At this point, we know an interrupt has taken place: Ask the programmable}\\
\> \> {\em // interrupt controller (PIC) which interrupts are enabled and which are active.}\\
\> \> {\tt int picv = *pic;}\\
\> \> {\em // Turn off all active interrupts}\\
\> \> {\em // Globally disable interrupt generation in the process}\\
\> \> {\tt int active = (picv >> 16) \& picv \& 0x07fff;}\\
\> \> {\tt *pic = (active<<16);}\\
\> \> {\em // We build a mask of interrupts to re-enable in picv.}\\
\> \> {\tt picv = 0;}\\
\> \> {\tt for(int i=0,msk=1; i<15; i++, msk<<=1) \{}\\
\> \>\hbox to 0.25in{}\={\tt if ((active \& msk)\&\&(isr\_table[i])) \{}\\
\> \>\>\hbox to 0.25in{}\= {\tt mov(isr\_table[i],uPC); }\\
\> \>\>\> {\em // Acknowledge this particular interrupt. While we could acknowledge all}\\
\> \>\>\> {\em // interrupts at once, by acknowledging only those with ISR's we allow}\\
\> \>\>\> {\em // the user process to use peripherals manually, and to manually check}\\
\> \>\>\> {\em // whether or no those other interrupts had occurred.}\\
\> \>\>\> {\tt *pic = msk; }\\
\> \>\>\> {\tt rtu(); }\\
\> \>\>\> {\em // The ISR will only exit on a trap in the Zip archtecture. There is}\\
\> \>\>\> {\em // no {\tt RETI} instruction. Since the PIC holds all interrupts disabled,}\\
\> \>\>\> {\em // there is no need to check for further interrupts.}\\
\> \>\>\> {\em // }\\
\> \>\>\> {\em // The tricky part is that, because of how the PIC is built, the ISR cannot}\\
\>\>\>\> {\em // re-enable its own interrupt without re-enabling all interrupts. Hence, we}\\
\>\>\>\> {\em // look at R0 upon ISR completion to know if an interrupt needs to be }\\
\> \>\>\> {\em // re-enabled. }\\
\> \>\>\> {\tt mov(uR0,tmp); }\\
\> \>\>\> {\tt picv |= (tmp \& 0x7fff) << 16; }\\
\> \>\> {\tt \} }\\
\> \> {\tt \} }\\
\> \> {\tt RESTORE\_PARTIAL\_CONTEXT; }\\
\> \> {\em // Re-activate all (requested) interrupts }\\
\> \> {\tt *pic = picv | 0x80000000; }\\
\>{\tt \} }\\
{\tt \}}\\
\end{tabbing}
\caption{Traditional Interrupt handling}\label{tbl:traditional-isr}
\end{center}\end{table}
 
We can work through the interrupt handling process by examining
Tbl.~\ref{tbl:traditional-isr}. First, remember, the CPU is always running
either the user or the supervisor context. Once the supervisor switches to
user mode, control does not return until either an interrupt or a trap
has taken place. (Okay, there's also the possibility of a bus error, or an
illegal instruction such as an unimplemented floating point instruction---but
for now we'll just focus on the trap instruction.) Therefore, if the trap bit
isn't set, then we know an interrupt has taken place.
 
To process an interrupt, we steal the user's stack: the PC and CC registers
are saved on the stack, as outlined in Tbl.~\ref{tbl:save-partial}.
\begin{table}\begin{center}
\begin{tabbing}
SAVE\_PARTIAL\_CONTEXT: \\
\hbox to 0.25in{}\= {\em ; We save R0, CC, and PC only} \\
\> {\tt MOV -3(uSP),R3} \\
\> {\tt MOV uR0,R0} \\
\> {\tt MOV uCC,R1} \\
\> {\tt MOV uPC,R2} \\
\> {\tt STO R0,1(R3)} {\em ; Exploit memory pipelining: }\\
\> {\tt STO R1,2(R3)} {\em ; All instructions write to stack }\\
\> {\tt STO R2,3(R3)} {\em ; All offsets increment by one }\\
\> {\tt MOV R3,uSP} {\em ; Return the updated stack pointer } \\
\end{tabbing}
\caption{Example Saving Minimal User Context}\label{tbl:save-partial}
\end{center}\end{table}
This is much cheaper than the full context swap of a preemptive multitasking
kernel, but it also depends upon the ISR saving any state it uses. Further,
if multiple ISR's get called at once, this looses its optimality property
very quickly.
 
As Sec.~\ref{sec:pic} discusses, the top of the PIC register stores which
interrupts are enabled, and the bottom stores which have tripped. (Interrupts
may trip without being enabled, they just will not generate an interrupt to the
CPU.) Our first step is to query the register to find out our interrupt
state, and then to disable any interrupts that have tripped. To do
that, we write a one to the enable half of the register while also clearing
the top bit (master interrupt enable). This has the consequence of disabling
any and all further interrupts, not just the ones that have tripped. Hence,
upon completion, we re--enable the master interrupt bit again. Finally,
we keep track of which interrupts have tripped.
 
Using the bit mask of interrupts that have tripped, we walk through all fifteen
possible interrupts. If there is an ISR installed, we acknowledge and reset
the interrupt within the PIC, and then call the ISR. The ISR, however, cannot
re--enable its interrupt without re-enabling the master interrupt bit. Thus,
to keep things simple, when the ISR is finished it places its interrupt
mask back into R0, or clears R0. This tells the supervisor mode process which
interrupts to re--enable. Any other registers that the ISR uses must be
saved and restored. (This is only truly optimal if only a single ISR is
called.) As a final instruction, the ISR clears the GIE bit executing a user
trap. (Remember, the Zip CPU has no {\tt RETI} instruction to restore the
stack and return to userland. It needs to go through the supervisor mode to
get there.)
 
Then, once all interrupts are handled, the user context is restored in a
fashion similar to Tbl.~\ref{tbl:restore-partial}.
\begin{table}\begin{center}
\begin{tabbing}
RESTORE\_PARTIAL\_CONTEXT: \\
\hbox to 0.25in{}\= {\em ; We retore R0, CC, and PC only} \\
\> {\tt MOV uSP,R3} {\em ; Return the updated stack pointer } \\
\> {\tt LOD R0,1(R3),R0} {\em ; Exploit memory pipelining: }\\
\> {\tt LOD R1,2(R3),R1} {\em ; All instructions write to stack }\\
\> {\tt LOD R2,3(R3),R2} {\em ; All offsets increment by one }\\
\> {\tt MOV R0,uR0} \\
\> {\tt MOV R1,uCC} \\
\> {\tt MOV R2,uPC} \\
\> {\tt MOV 3(R3),uSP} \\
\end{tabbing}
\caption{Example Restoring Minimal User Context}\label{tbl:restore-partial}
\end{center}\end{table}
Again, this is short and sweet simply because any other registers that needed
saving were saved within the ISR.
 
There you have it: the Zip CPU, with its non-traditional interrupt architecture,
can still process interrupts in a very traditional fashion.
 
\section{Example: Idle Task}
One task every operating system needs is the idle task, the task that takes
place when nothing else can run. On the Zip CPU, this task is quite simple,
1381,7 → 1781,7
Third, we've optimized the conditional jump to a return instruction into a
conditional return instruction.
 
\section{Context Switch}
\section{Example: Context Switch}
 
Fundamental to any multiprocessing system is the ability to switch from one
task to the next. In the ZipSystem, this is accomplished in one of a couple
2051,8 → 2451,6
experiment with building and adding particular features, the Zip CPU
makes a good starting point--it is fairly simple. Modifications should
be simple enough.
\item As an estimate of the ``weight'' of this implementation, the CPU has
cost me less than 150 hours to implement from its inception.
\item The Zip CPU was designed to be an implementable soft core that could be
placed within an FPGA, controlling actions internal to the FPGA. It
fits this role rather nicely. It does not fit the role of a system on
2069,6 → 2467,8
\item This simplified instruction set is easy to decode.
\item The simplified bus transactions (32-bit words only) were also very easy
to implement.
\item The pipelined load/store approach is novel, and can be used to greatly
increase the speed of the processor.
\item The novel approach of having a single interrupt vector, which just
brings the CPU back to the instruction it left off at within the last
interrupt context doesn't appear to have been that much of a problem.
2098,10 → 2498,7
\item The fact that the instruction width equals the bus width means that the
instruction fetch cycle will always be interfering with any load or
store memory operation, with the only exception being if the
instruction is already in the cache. {\em This has become the
fundamental limit on the speed and performance of the CPU!}
Those familiar with the Von--Neumann approach of sharing a bus
between data and instructions will not be surprised by this assessment.
instruction is already in the cache.
 
This could be fixed in one of three ways: the instruction set
architecture could be modified to handle Very Long Instruction Words
2146,6 → 2543,9
the addition of a data cache can greatly increase the LUT count of
a soft core.
 
The Zip CPU compensates for this via its pipelined load and store
instructions.
 
\item Many other instruction sets offer three operand instructions, whereas
the Zip CPU only offers two operand instructions. This means that it
takes the Zip CPU more instructions to do many of the same operations.
2371,4 → 2771,36
% Index
\end{document}
 
 
%
%
% Symbol table relocation types:
%
% Only 3-types of instructions truly need relocations: those that modify the
% PC register, and those that access memory.
%
% - LDI Addr,Rx // Load's an absolute address into Rx, 24 bits
%
% - LDILO Addr,Rx // Load's an absolute address into Rx, 32 bits
% LDIHI Addr,Rx // requires two instructions
%
% - JMP Rx // Jump to any address in Rx
% // Can be prefixed with two instructions to load Rx
% // from any 32-bit immediate
% - JMP #Addr // Jump to any 24'bit (signed) address, 23'b uns
%
% - ADD x,PC // Any PC relative jump (20 bits)
%
% - ADD.C x,PC // Any PC relative conditional jump (20 bits)
%
% - LDIHI Addr,Rx // Load from any 32-bit address, clobbers Rx,
% LOD Addr(Rx),Rx // unconditional, requires second instruction
%
% - LOD.C Addr(Ry),Rx // Any 16-bit relative address load, poss. cond
%
% - STO.C Rx,Addr(Ry) // Any 16-bit rel addr, Rx and Ry must be valid
%
% - FARJMP #Addr: // Arbitrary 32-bit jumps require a jump table
% BRA +1 // memory address. The BRA +1 can be skipped,
% .WORD Addr // but only if the address is placed at the end
% LOD -2(PC),PC // of an executable section
%

powered by: WebSVN 2.1.0

© copyright 1999-2022 OpenCores.org, equivalent to Oliscience, all rights reserved. OpenCores®, registered trademark.