Line 42... |
Line 42... |
%% http://www.gnu.org/licenses/gpl.html
|
%% http://www.gnu.org/licenses/gpl.html
|
%%
|
%%
|
%%
|
%%
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
\documentclass{gqtekspec}
|
\documentclass{gqtekspec}
|
|
\usepackage{import}
|
|
% \graphicspath{{../gfx}}
|
\project{Zip CPU}
|
\project{Zip CPU}
|
\title{Specification}
|
\title{Specification}
|
\author{Dan Gisselquist, Ph.D.}
|
\author{Dan Gisselquist, Ph.D.}
|
\email{dgisselq (at) opencores.org}
|
\email{dgisselq (at) opencores.org}
|
\revision{Rev.~0.5}
|
\revision{Rev.~0.6}
|
\definecolor{webred}{rgb}{0.2,0,0}
|
\definecolor{webred}{rgb}{0.2,0,0}
|
\definecolor{webgreen}{rgb}{0,0.2,0}
|
\definecolor{webgreen}{rgb}{0,0.2,0}
|
\usepackage[dvips,ps2pdf,colorlinks=true,
|
\usepackage[dvips,ps2pdf,colorlinks=true,
|
anchorcolor=black,pagecolor=webgreen,pdfpagelabels,hypertexnames,
|
anchorcolor=black,pagecolor=webgreen,pdfpagelabels,hypertexnames,
|
pdfauthor={Dan Gisselquist},
|
pdfauthor={Dan Gisselquist},
|
Line 74... |
Line 76... |
You should have received a copy of the GNU General Public License along
|
You should have received a copy of the GNU General Public License along
|
with this program. If not, see \hbox{<http://www.gnu.org/licenses/>} for a
|
with this program. If not, see \hbox{<http://www.gnu.org/licenses/>} for a
|
copy.
|
copy.
|
\end{license}
|
\end{license}
|
\begin{revisionhistory}
|
\begin{revisionhistory}
|
|
0.6 & 11/17/2015 & Gisselquist & Added graphics to illustrate pipeline discussion.\\\hline
|
0.5 & 9/29/2015 & Gisselquist & Added pipelined memory access discussion.\\\hline
|
0.5 & 9/29/2015 & Gisselquist & Added pipelined memory access discussion.\\\hline
|
0.4 & 9/19/2015 & Gisselquist & Added DMA controller, improved stall information, and self--assessment info.\\\hline
|
0.4 & 9/19/2015 & Gisselquist & Added DMA controller, improved stall information, and self--assessment info.\\\hline
|
0.3 & 8/22/2015 & Gisselquist & First completed draft\\\hline
|
0.3 & 8/22/2015 & Gisselquist & First completed draft\\\hline
|
0.2 & 8/19/2015 & Gisselquist & Still Draft, more complete \\\hline
|
0.2 & 8/19/2015 & Gisselquist & Still Draft, more complete \\\hline
|
0.1 & 8/17/2015 & Gisselquist & Incomplete First Draft \\\hline
|
0.1 & 8/17/2015 & Gisselquist & Incomplete First Draft \\\hline
|
Line 165... |
Line 168... |
\item Completely open source, licensed under the GPL.\footnote{Should you
|
\item Completely open source, licensed under the GPL.\footnote{Should you
|
need a copy of the Zip CPU licensed under other terms, please
|
need a copy of the Zip CPU licensed under other terms, please
|
contact me.}
|
contact me.}
|
\end{itemize}
|
\end{itemize}
|
|
|
|
The Zip CPU also has one very unique feature: the ability to do pipelined loads
|
|
and stores. This allows the CPU to access on-chip memory at one access per
|
|
clock, minus a stall for the initial access.
|
|
|
|
\section{Characteristics of a SwiC}
|
|
|
|
Here, we shall define a soft core internal to an FPGA as a ``System within a
|
|
Chip,'' or a SwiC. SwiCs have some very unique properties internal to them
|
|
that have influenced the design of the Zip CPU. Among these are the bus,
|
|
memory, and available peripherals.
|
|
|
|
Most other approaches to soft core CPU's employ a Harvard architecture.
|
|
This allows these other CPU's to have two separate bus structures: one for the
|
|
program fetch, and the other for thememory. The Zip CPU is fairly unique in
|
|
its approach because it uses a von Neumann architecture. This was done for
|
|
simplicity. By using a von Neumann architecture, only one bus needs to be
|
|
implemented within any FPGA. This helps to minimize real-estate, while
|
|
maintaining a high clock speed. The disadvantage is that it can severely
|
|
degrade the overall instructions per clock count.
|
|
|
|
Soft core's within an FPGA have an additional characteristic regarding
|
|
memory access: it is slow. Memory on chip may be accessed at a single
|
|
cycle per access, but small FPGA's have a limited amount of memory on chip.
|
|
Going off chip, however, is expensive. Two examples will prove this point. On
|
|
the XuLA2 board, Flash can be accessed at 128~cycles per 32--bit word,
|
|
or 64~cycles per subsequent word in a pipelined architecture. Likewise, the
|
|
SDRAM chip on the XuLA2 board allows 6~cycle access for a write, 10~cycles
|
|
per read, and 2~cycles for any subsequent pipelined access read or write.
|
|
Either way you look at it, this memory access will be slow and this doesn't
|
|
account for any logic delays should the bus implementation logic get
|
|
complicated.
|
|
|
|
As may be noticed from the above discussion about memory speed, a second
|
|
characteristic of memory is that all memory accesses may be pipelined, and
|
|
that pipelined memory access is faster than non--pipelined access. Therefore,
|
|
a SwiC soft core should support pipelined operations, but it should also
|
|
allow a higher priority subsystem to get access to the bus (no starvation).
|
|
|
|
As a further characteristic of SwiC memory options, on-chip cache's are
|
|
expensive. If you want to have a minimum of logic, cache logic may not be
|
|
the highest on the priority list.
|
|
|
|
In sum, memory is slow. While one processor on one FPGA may be able to fill
|
|
its pipeline, the same processor on another FPGA may struggle to get more than
|
|
one instruction at a time into the pipeline. Any SwiC must be able to deal
|
|
with both cases: fast and slow memories.
|
|
|
|
A final characteristic of SwiC's within FPGA's is the peripherals.
|
|
Specifically, FPGA's are highly reconfigurable. Soft peripherals can easily
|
|
be created on chip to support the SwiC if necessary. As an example, a simple
|
|
30-bit peripheral could easily support reversing 30-bit numbers: a read from
|
|
the peripheral returns it's bit--reversed address. This is cheap within an
|
|
FPGA, but expensive in instructions.
|
|
|
|
Indeed, anything that must be done fast within an FPGA is likely to already
|
|
be done--elsewhere in the fabric. This leaves the CPU with the role of handling
|
|
sequential tasks that need a lot of state.
|
|
|
|
This means that the SwiC needs to live within a very unique environment,
|
|
separate and different from the traditional SoC. That isn't to say that a
|
|
SwiC cannot be turned into a SoC, just that this SwiC has not been designed
|
|
for that purpose.
|
|
|
|
\section{Lessons Learned}
|
|
|
Now, however, that I've worked on the Zip CPU for a while, it is not nearly
|
Now, however, that I've worked on the Zip CPU for a while, it is not nearly
|
as simple as I originally hoped. Worse, I've had to adjust to create
|
as simple as I originally hoped. Worse, I've had to adjust to create
|
capabilities that I was never expecting to need. These include:
|
capabilities that I was never expecting to need. These include:
|
\begin{itemize}
|
\begin{itemize}
|
\item {\bf External Debug:} Once placed upon an FPGA, some external means is
|
\item {\bf External Debug:} Once placed upon an FPGA, some external means is
|
Line 237... |
Line 305... |
\item {\bf Pipeline Stalls:} My original plan was to not support pipeline
|
\item {\bf Pipeline Stalls:} My original plan was to not support pipeline
|
stalls at all, but rather to require the compiler to properly schedule
|
stalls at all, but rather to require the compiler to properly schedule
|
all instructions so that stalls would never be necessary. After trying
|
all instructions so that stalls would never be necessary. After trying
|
to build such an architecture, I gave up, having learned some things:
|
to build such an architecture, I gave up, having learned some things:
|
|
|
For example, in order to facilitate interrupt handling and debug
|
First, and ideal pipeline might look something like
|
stepping, the CPU needs to know what instructions have finished, and
|
Fig.~\ref{fig:ideal-pipeline}.
|
which have not. In other words, it needs to know where it can restart
|
\begin{figure}
|
the pipeline from. Once restarted, it must act as though it had
|
\begin{center}
|
never stopped. This killed my idea of delayed branching, since what
|
\includegraphics[width=4in]{../gfx/fullpline.eps}
|
would be the appropriate program counter to restart at? The one the
|
\caption{An Ideal Pipeline: One instruction per clock cycle}\label{fig:ideal-pipeline}
|
CPU was going to branch to, or the ones in the delay slots? This
|
\end{center}\end{figure}
|
also makes the idea of compressed instruction codes difficult, since,
|
Notice that, in this figure, all the pipeline stages are complete and
|
again, where do you restart on interrupt?
|
full. Every instruction takes one clock and there are no delays.
|
|
However, as the discussion above pointed out, the memory associated
|
|
with a SwiC may not allow single clock access. It may be instead
|
|
that you can only read every two clocks. In that case, what shall
|
|
the pipeline look like? Should it look like
|
|
Fig.~\ref{fig:waiting-pipeline},
|
|
\begin{figure}\begin{center}
|
|
\includegraphics[width=4in]{../gfx/stuttra.eps}
|
|
\caption{Instructions wait for each other}\label{fig:waiting-pipeline}
|
|
\end{center}\end{figure}
|
|
where instructions are held back until the pipeline is full, or should
|
|
it look like Fig.~\ref{fig:independent-pipeline},
|
|
\begin{figure}\begin{center}
|
|
\includegraphics[width=4in]{../gfx/stuttrb.eps}
|
|
\caption{Instructions proceed independently}\label{fig:independent-pipeline}
|
|
\end{center}\end{figure}
|
|
where each instruction is allowed to move through the pipeline
|
|
independently? For better or worse, the Zip CPU allows instructions
|
|
to move through the pipeline independently.
|
|
|
|
One approach to avoiding stalls is to use a branch delay slot,
|
|
such as is shown in Fig.~\ref{fig:brdelay}.
|
|
\begin{figure}\begin{center}
|
|
\includegraphics[width=4in]{../gfx/bdly.eps}
|
|
\caption{A typical branch delay slot approach}\label{fig:brdelay}
|
|
\end{center}\end{figure}
|
|
In this figure, instructions
|
|
{\tt BR} (a branch), {\tt BD} (a branch delay instruction),
|
|
are followed by instructions after the branch: {\tt IA}, {\tt IB}, etc.
|
|
Since it takes a processor a clock cycle to execute a branch, the
|
|
delay slot allows the processor to do something useful in that
|
|
branch. The problem the Zip CPU has with this approach is, what
|
|
happens when the pipeline looks like Fig.~\ref{fig:brbroken}?
|
|
\begin{figure}\begin{center}
|
|
\includegraphics[width=4in]{../gfx/bdbroken.eps}
|
|
\caption{The branch delay slot breaks with a slow memory}\label{fig:brbroken}
|
|
\end{center}\end{figure}
|
|
In this case, the branch delay slot never gets filled in the first
|
|
place, and so the pipeline squashes it before it gets executed.
|
|
If not that, then what happens when handling interrupts or
|
|
debug stepping: when has the CPU finished an instruction?
|
|
When the {\tt BR} instruction has finished, or must {\tt BD}
|
|
follow every {\tt BR}? and, again, what if the pipeline isn't
|
|
full?
|
|
These thoughts killed any hopes of doing delayed branching.
|
|
|
So I switched to a model of discrete execution: Once an instruction
|
So I switched to a model of discrete execution: Once an instruction
|
enters into either the ALU or memory unit, the instruction is
|
enters into either the ALU or memory unit, the instruction is
|
guaranteed to complete. If the logic recognizes a branch or a
|
guaranteed to complete. If the logic recognizes a branch or a
|
condition that would render the instruction entering into this stage
|
condition that would render the instruction entering into this stage
|
possibly inappropriate (i.e. a conditional branch preceding a store
|
possibly inappropriate (i.e. a conditional branch preceding a store
|
instruction for example), then the pipeline stalls for one cycle
|
instruction for example), then the pipeline stalls for one cycle
|
until the conditional branch completes. Then, if it generates a new
|
until the conditional branch completes. Then, if it generates a new
|
PC address, the stages preceding are all wiped clean.
|
PC address, the stages preceding are all wiped clean.
|
|
|
The discrete execution model allows such things as sleeping: if the
|
This model, however, generated too many pipeline stalls, so the
|
|
discrete execution model was modified to allow instructions to go
|
|
through the ALU unit and be canceled before writeback. This removed
|
|
the stall associated with ALU instructions before untaken branches.
|
|
|
|
The discrete execution model allows such things as sleeping, as
|
|
outlined in Fig.~\ref{fig:sleeping}.
|
|
\begin{figure}\begin{center}
|
|
\includegraphics[width=4in]{../gfx/sleep.eps}
|
|
\caption{How the CPU halts when sleeping}\label{fig:sleeping}
|
|
\end{center}\end{figure}
|
|
If the
|
CPU is put to ``sleep,'' the ALU and memory stages stall and back up
|
CPU is put to ``sleep,'' the ALU and memory stages stall and back up
|
everything before them. Likewise, anything that has entered the ALU
|
everything before them. Likewise, anything that has entered the ALU
|
or memory stage when the CPU is placed to sleep continues to completion.
|
or memory stage when the CPU is placed to sleep continues to completion.
|
To handle this logic, each pipeline stage has three control signals:
|
To handle this logic, each pipeline stage has three control signals:
|
a valid signal, a stall signal, and a clock enable signal. In
|
a valid signal, a stall signal, and a clock enable signal. In
|
general, a stage stalls if it's contents are valid and the next step
|
general, a stage stalls if it's contents are valid and the next step
|
is stalled. This allows the pipeline to fill any time a later stage
|
is stalled. This allows the pipeline to fill any time a later stage
|
stalls.
|
stalls, as illustrated in Fig.~\ref{fig:stacking}.
|
|
\begin{figure}\begin{center}
|
|
\includegraphics[width=4in]{../gfx/stacking.eps}
|
|
\caption{Instructions can stack up behind a stalled instruction}\label{fig:stacking}
|
|
\end{center}\end{figure}
|
|
|
This approach is also different from other pipeline approaches. Instead
|
This approach is also different from other pipeline approaches.
|
of keeping the entire pipeline filled, each stage is treated
|
Instead of keeping the entire pipeline filled, each stage is treated
|
independently. Therefore, individual stages may move forward as long
|
independently. Therefore, individual stages may move forward as long
|
as the subsequent stage is available, regardless of whether the stage
|
as the subsequent stage is available, regardless of whether the stage
|
behind it is filled.
|
behind it is filled.
|
|
|
\item {\bf Verilog Modules:} When examining how other processors worked
|
\item {\bf Verilog Modules:} When examining how other processors worked
|
Line 334... |
Line 461... |
\begin{table}\begin{center}
|
\begin{table}\begin{center}
|
\begin{bitlist}
|
\begin{bitlist}
|
31\ldots 11 & R/W & Reserved for future uses\\\hline
|
31\ldots 11 & R/W & Reserved for future uses\\\hline
|
10 & R & (Reserved for) Bus-Error Flag\\\hline
|
10 & R & (Reserved for) Bus-Error Flag\\\hline
|
9 & R & Trap, or user interrupt, Flag. Cleared on return to userspace.\\\hline
|
9 & R & Trap, or user interrupt, Flag. Cleared on return to userspace.\\\hline
|
8 & R & (Reserved for) Illegal Instruction Flag\\\hline
|
8 & R & Illegal Instruction Flag\\\hline
|
7 & R/W & Break--Enable\\\hline
|
7 & R/W & Break--Enable\\\hline
|
6 & R/W & Step\\\hline
|
6 & R/W & Step\\\hline
|
5 & R/W & Global Interrupt Enable (GIE)\\\hline
|
5 & R/W & Global Interrupt Enable (GIE)\\\hline
|
4 & R/W & Sleep. When GIE is also set, the CPU waits for an interrupt.\\\hline
|
4 & R/W & Sleep. When GIE is also set, the CPU waits for an interrupt.\\\hline
|
3 & R/W & Overflow\\\hline
|
3 & R/W & Overflow\\\hline
|
Line 399... |
Line 526... |
%
|
%
|
|
|
This functionality was added to enable an external debugger to
|
This functionality was added to enable an external debugger to
|
set and manage breakpoints.
|
set and manage breakpoints.
|
|
|
The ninth bit is reserved for an illegal instruction bit. When the CPU
|
The ninth bit is an illegal instruction bit. When the CPU
|
tries to execute either a non-existant instruction, or an instruction from
|
tries to execute either a non-existant instruction, or an instruction from
|
an address that produces a bus error, the CPU will (once implemented) switch
|
an address that produces a bus error, the CPU will (if implemented) switch
|
to supervisor mode while setting this bit. The bit will automatically be
|
to supervisor mode while setting this bit. The bit will automatically be
|
cleared upon any return to user mode.
|
cleared upon any return to user mode.
|
|
|
The tenth bit is a trap bit. It is set whenever the user requests a soft
|
The tenth bit is a trap bit. It is set whenever the user requests a soft
|
interrupt, and cleared on any return to userspace command. This allows the
|
interrupt, and cleared on any return to userspace command. This allows the
|
Line 416... |
Line 543... |
\begin{table}
|
\begin{table}
|
\begin{center}
|
\begin{center}
|
\begin{tabular}{l|l}
|
\begin{tabular}{l|l}
|
Bit & Meaning \\\hline
|
Bit & Meaning \\\hline
|
9 & Soft trap, set on a trap from user mode, cleared when returning to user mode\\\hline
|
9 & Soft trap, set on a trap from user mode, cleared when returning to user mode\\\hline
|
8 & (Reserved for) Floating point enable \\\hline
|
8 & Illegal instruction error flag \\\hline
|
7 & Halt on break, to support an external debugger \\\hline
|
7 & Halt on break, to support an external debugger \\\hline
|
6 & Step, single step the CPU in user mode\\\hline
|
6 & Step, single step the CPU in user mode\\\hline
|
5 & GIE, or Global Interrupt Enable \\\hline
|
5 & GIE, or Global Interrupt Enable \\\hline
|
4 & Sleep \\\hline
|
4 & Sleep \\\hline
|
3 & V, or overflow bit.\\\hline
|
3 & V, or overflow bit.\\\hline
|
Line 452... |
Line 579... |
\end{center}
|
\end{center}
|
\end{table}
|
\end{table}
|
There is no condition code for less than or equal, not C or not V. Sorry,
|
There is no condition code for less than or equal, not C or not V. Sorry,
|
I ran out of space in 3--bits. Conditioning on a non--supported condition
|
I ran out of space in 3--bits. Conditioning on a non--supported condition
|
is still possible, but it will take an extra instruction and a pipeline stall. (Ex: \hbox{\em (Stall)}; \hbox{\tt TST \$4,CC;} \hbox{\tt STO.NZ R0,(R1)})
|
is still possible, but it will take an extra instruction and a pipeline stall. (Ex: \hbox{\em (Stall)}; \hbox{\tt TST \$4,CC;} \hbox{\tt STO.NZ R0,(R1)})
|
|
As an alternative, the condition may often be reversed, recovering those
|
|
extra two clocks. Thus instead of \hbox{\tt CMP Rx,Ry;}
|
|
\hbox{\tt BNV label} you can issue a \hbox{\tt CMP Ry,Rx;} \hbox{\tt BV label}.
|
|
|
Conditionally executed ALU instructions will not further adjust the
|
Conditionally executed ALU instructions will not further adjust the
|
condition codes.
|
condition codes, with the exception of \hbox{\tt CMP} and \hbox{\tt TST}
|
|
instructions. Conditional \hbox{\tt CMP} or \hbox{\tt TST} instructions
|
|
will adjust conditions whenever their conditionals are true. In this way,
|
|
multiple conditions may be evaluated without branches. For example, to do
|
|
something if \hbox{\tt R0} is one and \hbox{\tt R1} is two, one might try
|
|
code such as Tbl.~\ref{tbl:dbl-condition}.
|
|
\begin{table}\begin{center}
|
|
\begin{tabular}{l}
|
|
{\tt CMP 1,R0} \\
|
|
{;\em Condition codes are now set based upon R0-1} \\
|
|
{\tt CMP.Z 2,R1} \\
|
|
{;\em If R0 $\neq$ 1, conditions are unchanged.} \\
|
|
{;\em If R0 $=$ 1, conditions are set based upon R1-2.} \\
|
|
{;\em Now do something based upon the conjunction of both conditions.} \\
|
|
{;\em While we use the example of a STO, it could be any instruction.} \\
|
|
{\tt STO.Z R0,(R2)} \\
|
|
\end{tabular}
|
|
\caption{An example of a double conditional}\label{tbl:dbl-condition}
|
|
\end{center}\end{table}
|
|
|
|
\section{Traditional Interrupt Handling}
|
|
|
\section{Operand B}
|
\section{Operand B}
|
Many instruction forms have a 21-bit source ``Operand B'' associated with them.
|
Many instruction forms have a 21-bit source ``Operand B'' associated with them.
|
This Operand B is either equal to a register plus a signed immediate offset,
|
This Operand B is either equal to a register plus a signed immediate offset,
|
or an immediate offset by itself. This value is encoded as shown in
|
or an immediate offset by itself. This value is encoded as shown in
|
Line 999... |
Line 1149... |
prefetch cache is loaded to support the next instruction.
|
prefetch cache is loaded to support the next instruction.
|
|
|
\item While waiting for the pipeline to load following any taken branch, jump,
|
\item While waiting for the pipeline to load following any taken branch, jump,
|
return from interrupt or switch to interrupt context (5 stall cycles)
|
return from interrupt or switch to interrupt context (5 stall cycles)
|
|
|
If the PC suddenly changes, the pipeline is subsequently cleared and needs to
|
Fig.~\ref{fig:bcstalls}
|
be reloaded. Given that there are five stages to the pipeline, that accounts
|
\begin{figure}\begin{center}
|
for four of the five stalls. The stall cycle is lost in the pipelined prefetch
|
\includegraphics[width=3.5in]{../gfx/bc.eps}
|
stage which needs at least one clock with a valid PC before it can produce
|
\caption{A conditional branch generates 5 stall cycles}\label{fig:bcstalls}
|
a new output.
|
\end{center}\end{figure}
|
|
illustrates the situation for a conditional branch. In this case, the branch
|
|
instruction, {\tt BC}, is nominally followed by instructions {\tt I0} and so
|
|
forth. However, since the branch is taken, the next instruction must be
|
|
{\tt IA}. Therefore, the pipeline needs to be cleared and reloaded.
|
|
Given that there are five stages to the pipeline, that accounts
|
|
for four of the five stalls. The last stall cycle is lost in the pipelined
|
|
prefetch stage which needs at least one clock with a valid PC before it can
|
|
produce a new output. {\Large\bf Note: When I did this myself, I counted
|
|
six stall cycles, for a total of seven cycles for this instruction. Is five
|
|
really the right answer?}
|
|
|
The Zip CPU handles {\tt MOV \$X(PC),PC}, {\tt ADD \$X,PC}, and
|
The Zip CPU handles {\tt MOV \$X(PC),PC}, {\tt ADD \$X,PC}, and
|
{\tt LDI \$X,PC} instructions specially, however. These instructions, when
|
{\tt LDI \$X,PC} instructions specially, however. These instructions, when
|
not conditioned on the flags, can execute with only 3~stall cycles.
|
not conditioned on the flags, can execute with only 2~stall cycles, such as
|
|
is shown in Fig.~\ref{fig:branch}.\footnote{Note that this behavior is
|
|
slated to be improved upon in subsequent releases. With a better prefetch,
|
|
it should be possible to drop this down to one or zero stall cycles.}
|
|
\begin{figure}\begin{center}
|
|
\includegraphics[width=4in]{../gfx/bra.eps}
|
|
\caption{An expedited delay costs only 2~stall cycles}\label{fig:branch}
|
|
\end{center}\end{figure}
|
|
In this example, {\tt BR} is a branch always taken, {\tt I1} is the instruction
|
|
following the branch in memory, while {\tt IA} is the first instruction at the
|
|
branch address. ({\tt CLR} denotes a clear--pipeline operation, and does
|
|
not represent any instruction.)
|
|
|
\item When reading from a prior register while also adding an immediate offset
|
\item When reading from a prior register while also adding an immediate offset
|
\begin{enumerate}
|
\begin{enumerate}
|
\item\ {\tt OPCODE ?,RA}
|
\item\ {\tt OPCODE ?,RA}
|
\item\ {\em (stall)}
|
\item\ {\em (stall)}
|
Line 1024... |
Line 1195... |
opcode that will read and apply an immediate offset by one instruction. The
|
opcode that will read and apply an immediate offset by one instruction. The
|
good news is that this stall can easily be mitigated by proper scheduling.
|
good news is that this stall can easily be mitigated by proper scheduling.
|
That is, any instruction that does not add an immediate to {\tt RA} may be
|
That is, any instruction that does not add an immediate to {\tt RA} may be
|
scheduled into the stall slot.
|
scheduled into the stall slot.
|
|
|
\item When any write to either the CC or PC Register is followed by a memory
|
\item When any (conditional) write to either the CC or PC Register is followed
|
operation
|
by a memory operation
|
\begin{enumerate}
|
\begin{enumerate}
|
\item\ {\tt OPCODE RA,PC} {\em Ex: a branch opcode}
|
\item\ {\tt OPCODE RA,PC} {\em Ex: a branch opcode}
|
\item\ {\em (stall, even if jump not taken)}
|
\item\ {\em (stall, even if jump not taken)}
|
\item\ {\tt LOD \$X(RA),RB}
|
\item\ {\tt LOD \$X(RA),RB}
|
\end{enumerate}
|
\end{enumerate}
|
|
A timing diagram of this pipeline situation is shown in Fig.~\ref{fig:bcmem},
|
|
\begin{figure}\begin{center}
|
|
\includegraphics[width=2in]{../gfx/bcmem.eps}
|
|
\caption{A (not taken) conditional branch followed by a memory operation}\label{fig:bcmem}
|
|
\end{center}\end{figure}
|
|
for a conditional branch, {\tt BC}, a memory operation, {\tt Mem} (which
|
|
must be a load here), and ALU instructions {\tt I1} and so forth.
|
Since branches take place in the writeback stage, the Zip CPU will stall the
|
Since branches take place in the writeback stage, the Zip CPU will stall the
|
pipeline for one clock anytime there may be a possible jump. This prevents
|
pipeline for one clock anytime there may be a possible jump--forcing the
|
|
memory operation to stay in the operand decode stage. This prevents
|
an instruction from executing a memory access after the jump but before the
|
an instruction from executing a memory access after the jump but before the
|
jump is recognized.
|
jump is recognized.
|
|
|
This stall may be mitigated by shuffling the operations immediately following
|
This stall may be mitigated by shuffling the operations immediately following
|
a potential branch so that an ALU operation follows the branch instead of a
|
a potential branch so that an ALU operation follows the branch instead of a
|
Line 1048... |
Line 1227... |
\item\ {\em (stall)}
|
\item\ {\em (stall)}
|
\item\ {\tt TST sys.ccv,CC}
|
\item\ {\tt TST sys.ccv,CC}
|
\item\ {\tt BZ somewhere}
|
\item\ {\tt BZ somewhere}
|
\end{enumerate}
|
\end{enumerate}
|
|
|
The reason for this stall is simply performance. Many of the flags are
|
The reason for this stall is simply performance: many of the flags are
|
determined via combinatorial logic after the writeback instruction is
|
determined via combinatorial logic {\em during} the writeback cycle.
|
determined. Trying to then place these into the input for one of the operands
|
Trying to then place these into the input for one of the operands for an
|
|
ALU instruction during the same cycle
|
created a time delay loop that would no longer execute in a single 100~MHz
|
created a time delay loop that would no longer execute in a single 100~MHz
|
clock cycle. (The time delay of the multiply within the ALU wasn't helping
|
clock cycle. (The time delay of the multiply within the ALU wasn't helping
|
either \ldots).
|
either \ldots).
|
|
|
This stall may be eliminated via proper scheduling, by placing an instruction
|
This stall may be eliminated via proper scheduling, by placing an instruction
|
that does not set flags in between the ALU operation and the instruction
|
that does not set flags in between the ALU operation and the instruction
|
that references the CC register. For example, {\tt MOV \$addr+PC,uPC}
|
that references the CC register. For example, {\tt MOV \$addr+PC,uPC}
|
followed by an {\tt RTU} ({\tt OR \$GIE,CC}) instruction will not incur
|
followed by an {\tt RTU} ({\tt OR \$GIE,CC}) instruction will not incur
|
this stall, whereas an {\tt OR \$BREAKEN,CC} followed by an {\tt OR \$STEP,CC}
|
this stall, whereas an {\tt OR \$BREAKEN,CC} followed by an {\tt OR \$STEP,CC}
|
will incur the stall, while a {\tt LDI \$BREAKEN|\$STEP,CC} will not.
|
will incur the stall, while a {\tt LDI \$BREAKEN|\$STEP,CC} will not since
|
|
it doesn't read the condition codes.
|
|
|
\item When waiting for a memory read operation to complete
|
\item When waiting for a memory read operation to complete
|
\begin{enumerate}
|
\begin{enumerate}
|
\item\ {\tt LOD address,RA}
|
\item\ {\tt LOD address,RA}
|
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}
|
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}
|
\item\ {\tt OPCODE I+RA,RB}
|
\item\ {\tt OPCODE I+RA,RB}
|
\end{enumerate}
|
\end{enumerate}
|
|
|
Remember, the Zip CPU does not support out of order execution. Therefore,
|
Remember, the Zip CPU does not support out of order execution. Therefore,
|
anytime the memory unit becomes busy both the memory unit and the ALU must
|
anytime the memory unit becomes busy both the memory unit and the ALU must
|
stall until the memory unit is cleared. This is especially true of a load
|
stall until the memory unit is cleared. This is illustrated in
|
|
Fig.~\ref{fig:memrd},
|
|
\begin{figure}\begin{center}
|
|
\includegraphics[width=5in]{../gfx/memrd.eps}
|
|
\caption{Pipeline handling of a load instruction}\label{fig:memrd}
|
|
\end{center}\end{figure}
|
|
since it is especially true of a load
|
instruction, which must still write its operand back to the register file.
|
instruction, which must still write its operand back to the register file.
|
Store instructions are different, since they can be busy with no impact on
|
Note that there is an extra stall at the end of the memory cycle, so that
|
later ALU write back operations. Hence, only loads stall the pipeline.
|
the memory unit will be idle for one clock before an instruction will be
|
|
accepted into the ALU.
|
|
Store instructions are different, as shown in Fig.~\ref{fig:memwr},
|
|
\begin{figure}\begin{center}
|
|
\includegraphics[width=5in]{../gfx/memwr.eps}
|
|
\caption{Pipeline handling of a store instruction}\label{fig:memwr}
|
|
\end{center}\end{figure}
|
|
since they can be busy with the bus without impacting later write back
|
|
pipeline stages. Hence, only loads stall the pipeline.
|
|
|
This also assumes that the memory being accessed is a single cycle memory.
|
This, of course, also assumes that the memory being accessed is a single cycle
|
|
memory and that there are no stalls to get to the memory.
|
Slower memories, such as the Quad SPI flash, will take longer--perhaps even
|
Slower memories, such as the Quad SPI flash, will take longer--perhaps even
|
as long as forty clocks. During this time the CPU and the external bus
|
as long as forty clocks. During this time the CPU and the external bus
|
will be busy, and unable to do anything else.
|
will be busy, and unable to do anything else. Likewise, if it takes a couple
|
|
of clock cycles for the bus to be free, as shown in both Figs.~\ref{fig:memrd}
|
|
and~\ref{fig:memwr}, there will be stalls.
|
|
|
\item Memory operation followed by a memory operation
|
\item Memory operation followed by a memory operation
|
\begin{enumerate}
|
\begin{enumerate}
|
\item\ {\tt STO address,RA}
|
\item\ {\tt STO address,RA}
|
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}
|
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}
|
\item\ {\tt LOD address,RB}
|
\item\ {\tt LOD address,RB}
|
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}
|
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}
|
\end{enumerate}
|
\end{enumerate}
|
|
|
In this case, the LOD instruction cannot start until the STO is finished.
|
In this case, the LOD instruction cannot start until the STO is finished,
|
|
as illustrated by Fig.~\ref{fig:mstld}.
|
|
\begin{figure}\begin{center}
|
|
\includegraphics[width=5.5in]{../gfx/mstld.eps}
|
|
\caption{Pipeline handling of a store followed by a load instruction}\label{fig:mstld}
|
|
\end{center}\end{figure}
|
With proper scheduling, it is possible to do something in the ALU while the
|
With proper scheduling, it is possible to do something in the ALU while the
|
memory unit is busy with the STO instruction, but otherwise this pipeline will
|
memory unit is busy with the STO instruction, but otherwise this pipeline will
|
stall waiting for it to complete.
|
stall while waiting for it to complete before the load instruction can
|
|
start.
|
|
|
The Zip CPU does have the capability of supporting pipelined memory access,
|
The Zip CPU does have the capability of supporting pipelined memory access,
|
but only under the following conditions: all accesses within the pipeline
|
but only under the following conditions: all accesses within the pipeline
|
must all be reads or all be writes, all must use the same register for their
|
must all be reads or all be writes, all must use the same register for their
|
address, and there can be no stalls or other instructions between pipelined
|
address, and there can be no stalls or other instructions between pipelined
|
memory access instructions. Further, the offset to memory must be increasing
|
memory access instructions. Further, the offset to memory must be increasing
|
by one address each instruction. These conditions work well for saving or
|
by one address each instruction. These conditions work well for saving or
|
storing registers to the stack.
|
storing registers to the stack. Indeed, if you noticed, both
|
|
Fig.~\ref{fig:memrd} and Fig.~\ref{fig:memwr} illustrated pipelined memory
|
|
accesses.
|
|
|
\item When waiting for a conditional memory read operation to complete
|
\item When waiting for a conditional memory read operation to complete
|
\begin{enumerate}
|
\begin{enumerate}
|
\item\ {\tt LOD.Z address,RA}
|
\item\ {\tt LOD.Z address,RA}
|
\item\ {\em (multiple stalls, bus dependent, 7 clocks best)}
|
\item\ {\em (multiple stalls, bus dependent, 7 clocks best)}
|
Line 1132... |
Line 1338... |
\caption{Zip System Peripherals}\label{fig:zipsystem}
|
\caption{Zip System Peripherals}\label{fig:zipsystem}
|
\end{center}\end{figure}
|
\end{center}\end{figure}
|
and described here. They are designed to make
|
and described here. They are designed to make
|
the Zip CPU more useful in an Embedded Operating System environment.
|
the Zip CPU more useful in an Embedded Operating System environment.
|
|
|
\section{Interrupt Controller}
|
\section{Interrupt Controller}\label{sec:pic}
|
|
|
Perhaps the most important peripheral within the Zip System is the interrupt
|
Perhaps the most important peripheral within the Zip System is the interrupt
|
controller. While the Zip CPU itself can only handle one interrupt, and has
|
controller. While the Zip CPU itself can only handle one interrupt, and has
|
only the one interrupt state: disabled or enabled, the interrupt controller
|
only the one interrupt state: disabled or enabled, the interrupt controller
|
can make things more interesting.
|
can make things more interesting.
|
|
|
The Zip System interrupt controller module supports up to 15 interrupts, all
|
The Zip System interrupt controller module supports up to 15 interrupts, all
|
controlled from one register. Bit~31 of the interrupt controller controls
|
controlled from one register. Bit~31 of the interrupt controller controls
|
overall whether interrupts are enabled (1'b1) or disabled (1'b0). Bits~16--30
|
overall whether interrupts are enabled (1'b1) or disabled (1'b0). Bits~16--30
|
control whether individual interrupts are enabled (1'b0) or disabled (1'b0).
|
control whether individual interrupts are enabled (1'b1) or disabled (1'b0).
|
Bit~15 is an indicator showing whether or not any interrupt is active, and
|
Bit~15 is an indicator showing whether or not any interrupt is active, and
|
bits~0--15 indicate whether or not an individual interrupt is active.
|
bits~0--15 indicate whether or not an individual interrupt is active.
|
|
|
The interrupt controller has been designed so that bits can be controlled
|
The interrupt controller has been designed so that bits can be controlled
|
individually without having any knowledge of the rest of the controller
|
individually without having any knowledge of the rest of the controller
|
Line 1165... |
Line 1371... |
interrupt and clears the active indicator. This also has the side effect of
|
interrupt and clears the active indicator. This also has the side effect of
|
disabling all interrupts, so a second write of {\tt 0x80000000} may be necessary
|
disabling all interrupts, so a second write of {\tt 0x80000000} may be necessary
|
to re-enable any other interrupts.
|
to re-enable any other interrupts.
|
|
|
The Zip System currently hosts two interrupt controllers, a primary and a
|
The Zip System currently hosts two interrupt controllers, a primary and a
|
secondary. The primary interrupt controller has one interrupt line which may
|
secondary. The primary interrupt controller has one interrupt line (perhaps
|
come from an external interrupt controller, and one interrupt line from the
|
more if you configure it for more) which may come from an external interrupt
|
secondary controller. Other primary interrupts include the system timers,
|
controller, and one interrupt line from the secondary controller. Other
|
the jiffies interrupt, and the manual cache interrupt. The secondary interrupt
|
primary interrupts include the system timers, the jiffies interrupt, and the
|
controller maintains an interrupt state for all of the processor accounting
|
manual cache interrupt. The secondary interrupt controller maintains an
|
counters.
|
interrupt state for all of the processor accounting counters.
|
|
|
\section{Counter}
|
\section{Counter}
|
|
|
The Zip Counter is a very simple counter: it just counts. It cannot be
|
The Zip Counter is a very simple counter: it just counts. It cannot be
|
halted. When it rolls over, it issues an interrupt. Writing a value to the
|
halted. When it rolls over, it issues an interrupt. Writing a value to the
|
Line 1193... |
Line 1399... |
bit is set when writing to the timer, the timer becomes an interval timer and
|
bit is set when writing to the timer, the timer becomes an interval timer and
|
reloads its last start time on any interrupt. Hence, to mark seconds, one
|
reloads its last start time on any interrupt. Hence, to mark seconds, one
|
might set the timer to 100~million (the number of clocks per second), and
|
might set the timer to 100~million (the number of clocks per second), and
|
set the high bit. Ever after, the timer will interrupt the CPU once per
|
set the high bit. Ever after, the timer will interrupt the CPU once per
|
second (assuming a 100~MHz clock). This reload capability also limits the
|
second (assuming a 100~MHz clock). This reload capability also limits the
|
maximum timer value to $2^{31}-1$, rather than $2^{32}-1$.
|
maximum timer value to $2^{31}-1$ (about 21~seconds using a 100~MHz clock),
|
|
rather than $2^{32}-1$.
|
|
|
\section{Watchdog Timer}
|
\section{Watchdog Timer}
|
|
|
The watchdog timer is no different from any of the other timers, save for one
|
The watchdog timer is no different from any of the other timers, save for one
|
critical difference: the interrupt line from the watchdog
|
critical difference: the interrupt line from the watchdog
|
Line 1207... |
Line 1414... |
write any other number to it---as with the other timers.
|
write any other number to it---as with the other timers.
|
|
|
While the watchdog timer supports interval mode, it doesn't make as much sense
|
While the watchdog timer supports interval mode, it doesn't make as much sense
|
as it did with the other timers.
|
as it did with the other timers.
|
|
|
|
\section{Bus Watchdog}
|
|
There is an additional watchdog timer on the Wishbone bus. This timer,
|
|
however, is hardware configured and not software configured. The timer is
|
|
reset at the beginning of any bus transaction, and only counts clocks during
|
|
such bus transactions. If the bus transaction takes longer than the number
|
|
of counts the timer allots, it will raise a bus error flag to terminate the
|
|
transaction. This is useful in the case of any peripherals that are
|
|
misbehaving. If the bus watchdog terminates a bus transaction, the CPU may
|
|
then read from its port to find out which memory location created the problem.
|
|
|
|
Aside from its unusual configuration, the bus watchdog is just another
|
|
implementation of the fundamental timer described above.
|
|
|
\section{Jiffies}
|
\section{Jiffies}
|
|
|
This peripheral is motivated by the Linux use of `jiffies' whereby a process
|
This peripheral is motivated by the Linux use of `jiffies' whereby a process
|
can request to be put to sleep until a certain number of `jiffies' have
|
can request to be put to sleep until a certain number of `jiffies' have
|
elapsed. Using this interface, the CPU can read the number of `jiffies'
|
elapsed. Using this interface, the CPU can read the number of `jiffies'
|
Line 1305... |
Line 1525... |
complete.
|
complete.
|
|
|
Eventually, I intend to place an operating system onto the ZipSystem, I'm
|
Eventually, I intend to place an operating system onto the ZipSystem, I'm
|
just not there yet.
|
just not there yet.
|
|
|
The rest of this chapter examines some common programming constructs, and
|
The rest of this chapter examines some common programming models, and how they
|
how they might be applied to the Zip System.
|
might be applied to the Zip System, and then finish with a couple of examples.
|
|
|
|
\section{System High}
|
|
The easiest and simplest way to run the Zip CPU is in the system high mode.
|
|
In this mode, the CPU runs your program in supervisor mode from reboot to
|
|
power down, and is never interrupted. You will need to poll the interrupt
|
|
controller to determine when any external condition has become active. This
|
|
mode is useful, and can handle many microcontroller tasks.
|
|
|
|
Even better, in system high mode, all of the user registers are available
|
|
to the system high program as variables. Accessing these registers can be
|
|
done in a single clock cycle, which would move them to the active register
|
|
set or move them back. While this may seem like a load or store instruction,
|
|
none of these register accesses will suffer from memory delays.
|
|
|
|
The one thing that cannot be done in supervisor mode is a wait for interrupt
|
|
instruction. This, however, is easily rectified by jumping to a user task
|
|
within the supervisors memory space, such as Tbl.~\ref{tbl:shi-idle}.
|
|
\begin{table}\begin{center}
|
|
\begin{tabbing}
|
|
{\tt supervisor\_idle:} \\
|
|
\hbox to 0.25in{}\={\em ; While not strictly required, the following move helps to} \\
|
|
\> {\em ; ensure that the prefetch doesn't try to fetch an instruction} \\
|
|
\> {\em ; outside of the CPU's address space when it switches to user} \\
|
|
\> {\em ; mode.} \\
|
|
\> {\tt MOV supervisor\_idle\_continue,uPC} \\
|
|
\> {\em ; Put the processor into user mode and to sleep in the same} \\
|
|
\> {\em ; instruction. } \\
|
|
\> {\tt OR \$SLEEP|\$GIE,CC} \\
|
|
{\tt supervisor\_idle\_continue:} \\
|
|
\> {\em ; Now, if we haven't done this inline, we need to return} \\
|
|
\> {\em ; to whatever function called us.} \\
|
|
\> {\tt RETN} \\
|
|
\end{tabbing}
|
|
\caption{Executing an idle from supervisor mode}\label{tbl:shi-idle}
|
|
\end{center}\end{table}
|
|
|
|
\section{Traditional Interrupt Handling}
|
|
Although the Zip CPU does not have a traditional interrupt architecture,
|
|
it is possible to create the more traditional interrupt approach via software.
|
|
In this mode, the programmable interrupt controller is used together with the
|
|
supervisor state to create the illusion of more traditional interrupt handling.
|
|
|
|
To set this up, upon reboot the supervisor task:
|
|
\begin{enumerate}
|
|
\item Creates a (single) user context, a user stack, and sets the user
|
|
program counter to the entry of the user task
|
|
\item Creates a task table of ISR entries
|
|
\item Enables the master interrupt enable via the interrupt controller, albeit
|
|
without enabling any of the fifteen potential underlying interrupts.
|
|
\item Switches to user mode, as the first part of the while loop in
|
|
Tbl.~\ref{tbl:traditional-isr}.
|
|
\end{enumerate}
|
|
\begin{table}\begin{center}
|
|
\begin{tabbing}
|
|
{\tt while(true) \{} \\
|
|
\hbox to 0.25in{}\= {\tt rtu();}\\
|
|
\> {\tt if (trap) \{} {\em // Here, we allow users to install ISRs, or} \\
|
|
\>\hbox to 0.25in{}\= {\em // whatever else they may wish to do in supervisor mode.} \\
|
|
\> {\tt \} else \{} \\
|
|
\> \> {\tt volatile int *pic = PIC\_ADDRESS;} \\
|
|
\\
|
|
\> \> {\em // Save the user context before running any ISRs. This could easily be}\\
|
|
\> \> {\em // implemented as an inline assembly routine or macro}\\
|
|
\> \> {\tt SAVE\_PARTIAL\_CONTEXT; }\\
|
|
\> \> {\em // At this point, we know an interrupt has taken place: Ask the programmable}\\
|
|
\> \> {\em // interrupt controller (PIC) which interrupts are enabled and which are active.}\\
|
|
\> \> {\tt int picv = *pic;}\\
|
|
\> \> {\em // Turn off all active interrupts}\\
|
|
\> \> {\em // Globally disable interrupt generation in the process}\\
|
|
\> \> {\tt int active = (picv >> 16) \& picv \& 0x07fff;}\\
|
|
\> \> {\tt *pic = (active<<16);}\\
|
|
\> \> {\em // We build a mask of interrupts to re-enable in picv.}\\
|
|
\> \> {\tt picv = 0;}\\
|
|
\> \> {\tt for(int i=0,msk=1; i<15; i++, msk<<=1) \{}\\
|
|
\> \>\hbox to 0.25in{}\={\tt if ((active \& msk)\&\&(isr\_table[i])) \{}\\
|
|
\> \>\>\hbox to 0.25in{}\= {\tt mov(isr\_table[i],uPC); }\\
|
|
\> \>\>\> {\em // Acknowledge this particular interrupt. While we could acknowledge all}\\
|
|
\> \>\>\> {\em // interrupts at once, by acknowledging only those with ISR's we allow}\\
|
|
\> \>\>\> {\em // the user process to use peripherals manually, and to manually check}\\
|
|
\> \>\>\> {\em // whether or no those other interrupts had occurred.}\\
|
|
\> \>\>\> {\tt *pic = msk; }\\
|
|
\> \>\>\> {\tt rtu(); }\\
|
|
\> \>\>\> {\em // The ISR will only exit on a trap in the Zip archtecture. There is}\\
|
|
\> \>\>\> {\em // no {\tt RETI} instruction. Since the PIC holds all interrupts disabled,}\\
|
|
\> \>\>\> {\em // there is no need to check for further interrupts.}\\
|
|
\> \>\>\> {\em // }\\
|
|
\> \>\>\> {\em // The tricky part is that, because of how the PIC is built, the ISR cannot}\\
|
|
\>\>\>\> {\em // re-enable its own interrupt without re-enabling all interrupts. Hence, we}\\
|
|
\>\>\>\> {\em // look at R0 upon ISR completion to know if an interrupt needs to be }\\
|
|
\> \>\>\> {\em // re-enabled. }\\
|
|
\> \>\>\> {\tt mov(uR0,tmp); }\\
|
|
\> \>\>\> {\tt picv |= (tmp \& 0x7fff) << 16; }\\
|
|
\> \>\> {\tt \} }\\
|
|
\> \> {\tt \} }\\
|
|
\> \> {\tt RESTORE\_PARTIAL\_CONTEXT; }\\
|
|
\> \> {\em // Re-activate all (requested) interrupts }\\
|
|
\> \> {\tt *pic = picv | 0x80000000; }\\
|
|
\>{\tt \} }\\
|
|
{\tt \}}\\
|
|
\end{tabbing}
|
|
\caption{Traditional Interrupt handling}\label{tbl:traditional-isr}
|
|
\end{center}\end{table}
|
|
|
|
We can work through the interrupt handling process by examining
|
|
Tbl.~\ref{tbl:traditional-isr}. First, remember, the CPU is always running
|
|
either the user or the supervisor context. Once the supervisor switches to
|
|
user mode, control does not return until either an interrupt or a trap
|
|
has taken place. (Okay, there's also the possibility of a bus error, or an
|
|
illegal instruction such as an unimplemented floating point instruction---but
|
|
for now we'll just focus on the trap instruction.) Therefore, if the trap bit
|
|
isn't set, then we know an interrupt has taken place.
|
|
|
|
To process an interrupt, we steal the user's stack: the PC and CC registers
|
|
are saved on the stack, as outlined in Tbl.~\ref{tbl:save-partial}.
|
|
\begin{table}\begin{center}
|
|
\begin{tabbing}
|
|
SAVE\_PARTIAL\_CONTEXT: \\
|
|
\hbox to 0.25in{}\= {\em ; We save R0, CC, and PC only} \\
|
|
\> {\tt MOV -3(uSP),R3} \\
|
|
\> {\tt MOV uR0,R0} \\
|
|
\> {\tt MOV uCC,R1} \\
|
|
\> {\tt MOV uPC,R2} \\
|
|
\> {\tt STO R0,1(R3)} {\em ; Exploit memory pipelining: }\\
|
|
\> {\tt STO R1,2(R3)} {\em ; All instructions write to stack }\\
|
|
\> {\tt STO R2,3(R3)} {\em ; All offsets increment by one }\\
|
|
\> {\tt MOV R3,uSP} {\em ; Return the updated stack pointer } \\
|
|
\end{tabbing}
|
|
\caption{Example Saving Minimal User Context}\label{tbl:save-partial}
|
|
\end{center}\end{table}
|
|
This is much cheaper than the full context swap of a preemptive multitasking
|
|
kernel, but it also depends upon the ISR saving any state it uses. Further,
|
|
if multiple ISR's get called at once, this looses its optimality property
|
|
very quickly.
|
|
|
|
As Sec.~\ref{sec:pic} discusses, the top of the PIC register stores which
|
|
interrupts are enabled, and the bottom stores which have tripped. (Interrupts
|
|
may trip without being enabled, they just will not generate an interrupt to the
|
|
CPU.) Our first step is to query the register to find out our interrupt
|
|
state, and then to disable any interrupts that have tripped. To do
|
|
that, we write a one to the enable half of the register while also clearing
|
|
the top bit (master interrupt enable). This has the consequence of disabling
|
|
any and all further interrupts, not just the ones that have tripped. Hence,
|
|
upon completion, we re--enable the master interrupt bit again. Finally,
|
|
we keep track of which interrupts have tripped.
|
|
|
|
Using the bit mask of interrupts that have tripped, we walk through all fifteen
|
|
possible interrupts. If there is an ISR installed, we acknowledge and reset
|
|
the interrupt within the PIC, and then call the ISR. The ISR, however, cannot
|
|
re--enable its interrupt without re-enabling the master interrupt bit. Thus,
|
|
to keep things simple, when the ISR is finished it places its interrupt
|
|
mask back into R0, or clears R0. This tells the supervisor mode process which
|
|
interrupts to re--enable. Any other registers that the ISR uses must be
|
|
saved and restored. (This is only truly optimal if only a single ISR is
|
|
called.) As a final instruction, the ISR clears the GIE bit executing a user
|
|
trap. (Remember, the Zip CPU has no {\tt RETI} instruction to restore the
|
|
stack and return to userland. It needs to go through the supervisor mode to
|
|
get there.)
|
|
|
|
Then, once all interrupts are handled, the user context is restored in a
|
|
fashion similar to Tbl.~\ref{tbl:restore-partial}.
|
|
\begin{table}\begin{center}
|
|
\begin{tabbing}
|
|
RESTORE\_PARTIAL\_CONTEXT: \\
|
|
\hbox to 0.25in{}\= {\em ; We retore R0, CC, and PC only} \\
|
|
\> {\tt MOV uSP,R3} {\em ; Return the updated stack pointer } \\
|
|
\> {\tt LOD R0,1(R3),R0} {\em ; Exploit memory pipelining: }\\
|
|
\> {\tt LOD R1,2(R3),R1} {\em ; All instructions write to stack }\\
|
|
\> {\tt LOD R2,3(R3),R2} {\em ; All offsets increment by one }\\
|
|
\> {\tt MOV R0,uR0} \\
|
|
\> {\tt MOV R1,uCC} \\
|
|
\> {\tt MOV R2,uPC} \\
|
|
\> {\tt MOV 3(R3),uSP} \\
|
|
\end{tabbing}
|
|
\caption{Example Restoring Minimal User Context}\label{tbl:restore-partial}
|
|
\end{center}\end{table}
|
|
Again, this is short and sweet simply because any other registers that needed
|
|
saving were saved within the ISR.
|
|
|
|
There you have it: the Zip CPU, with its non-traditional interrupt architecture,
|
|
can still process interrupts in a very traditional fashion.
|
|
|
\section{Example: Idle Task}
|
\section{Example: Idle Task}
|
One task every operating system needs is the idle task, the task that takes
|
One task every operating system needs is the idle task, the task that takes
|
place when nothing else can run. On the Zip CPU, this task is quite simple,
|
place when nothing else can run. On the Zip CPU, this task is quite simple,
|
and it is shown in assemble in Tbl.~\ref{tbl:idle-asm}.
|
and it is shown in assemble in Tbl.~\ref{tbl:idle-asm}.
|
Line 1379... |
Line 1779... |
data types. The Zip CPU does not have explicit support for smaller or larger
|
data types. The Zip CPU does not have explicit support for smaller or larger
|
data types, and so this memory copy cannot be applied at a byte level.
|
data types, and so this memory copy cannot be applied at a byte level.
|
Third, we've optimized the conditional jump to a return instruction into a
|
Third, we've optimized the conditional jump to a return instruction into a
|
conditional return instruction.
|
conditional return instruction.
|
|
|
\section{Context Switch}
|
\section{Example: Context Switch}
|
|
|
Fundamental to any multiprocessing system is the ability to switch from one
|
Fundamental to any multiprocessing system is the ability to switch from one
|
task to the next. In the ZipSystem, this is accomplished in one of a couple
|
task to the next. In the ZipSystem, this is accomplished in one of a couple
|
ways. The first step is that an interrupt happens. Anytime an interrupt
|
ways. The first step is that an interrupt happens. Anytime an interrupt
|
happens, the CPU needs to execute the following tasks in supervisor mode:
|
happens, the CPU needs to execute the following tasks in supervisor mode:
|
Line 2049... |
Line 2449... |
\item The Zip CPU is light weight and fully featured as it exists today. For
|
\item The Zip CPU is light weight and fully featured as it exists today. For
|
anyone who wishes to build a general purpose CPU and then to
|
anyone who wishes to build a general purpose CPU and then to
|
experiment with building and adding particular features, the Zip CPU
|
experiment with building and adding particular features, the Zip CPU
|
makes a good starting point--it is fairly simple. Modifications should
|
makes a good starting point--it is fairly simple. Modifications should
|
be simple enough.
|
be simple enough.
|
\item As an estimate of the ``weight'' of this implementation, the CPU has
|
|
cost me less than 150 hours to implement from its inception.
|
|
\item The Zip CPU was designed to be an implementable soft core that could be
|
\item The Zip CPU was designed to be an implementable soft core that could be
|
placed within an FPGA, controlling actions internal to the FPGA. It
|
placed within an FPGA, controlling actions internal to the FPGA. It
|
fits this role rather nicely. It does not fit the role of a system on
|
fits this role rather nicely. It does not fit the role of a system on
|
a chip very well, but then it was never intended to be a system on a
|
a chip very well, but then it was never intended to be a system on a
|
chip but rather a system within a chip.
|
chip but rather a system within a chip.
|
Line 2067... |
Line 2465... |
to do with two exceptions: bytewise character access and accelerated
|
to do with two exceptions: bytewise character access and accelerated
|
floating-point support.
|
floating-point support.
|
\item This simplified instruction set is easy to decode.
|
\item This simplified instruction set is easy to decode.
|
\item The simplified bus transactions (32-bit words only) were also very easy
|
\item The simplified bus transactions (32-bit words only) were also very easy
|
to implement.
|
to implement.
|
|
\item The pipelined load/store approach is novel, and can be used to greatly
|
|
increase the speed of the processor.
|
\item The novel approach of having a single interrupt vector, which just
|
\item The novel approach of having a single interrupt vector, which just
|
brings the CPU back to the instruction it left off at within the last
|
brings the CPU back to the instruction it left off at within the last
|
interrupt context doesn't appear to have been that much of a problem.
|
interrupt context doesn't appear to have been that much of a problem.
|
If most modern systems handle interrupt vectoring in software anyway,
|
If most modern systems handle interrupt vectoring in software anyway,
|
why maintain hardware support for it?
|
why maintain hardware support for it?
|
Line 2096... |
Line 2496... |
Still, \ldots it's not bad, it's just not astonishingly good.
|
Still, \ldots it's not bad, it's just not astonishingly good.
|
|
|
\item The fact that the instruction width equals the bus width means that the
|
\item The fact that the instruction width equals the bus width means that the
|
instruction fetch cycle will always be interfering with any load or
|
instruction fetch cycle will always be interfering with any load or
|
store memory operation, with the only exception being if the
|
store memory operation, with the only exception being if the
|
instruction is already in the cache. {\em This has become the
|
instruction is already in the cache.
|
fundamental limit on the speed and performance of the CPU!}
|
|
Those familiar with the Von--Neumann approach of sharing a bus
|
|
between data and instructions will not be surprised by this assessment.
|
|
|
|
This could be fixed in one of three ways: the instruction set
|
This could be fixed in one of three ways: the instruction set
|
architecture could be modified to handle Very Long Instruction Words
|
architecture could be modified to handle Very Long Instruction Words
|
(VLIW) so that each 32--bit word would encode two or more instructions,
|
(VLIW) so that each 32--bit word would encode two or more instructions,
|
the instruction fetch bus width could be increased from 32--bits to
|
the instruction fetch bus width could be increased from 32--bits to
|
Line 2144... |
Line 2541... |
|
|
This may also be written off as a ``feature'' of the Zip CPU, since
|
This may also be written off as a ``feature'' of the Zip CPU, since
|
the addition of a data cache can greatly increase the LUT count of
|
the addition of a data cache can greatly increase the LUT count of
|
a soft core.
|
a soft core.
|
|
|
|
The Zip CPU compensates for this via its pipelined load and store
|
|
instructions.
|
|
|
\item Many other instruction sets offer three operand instructions, whereas
|
\item Many other instruction sets offer three operand instructions, whereas
|
the Zip CPU only offers two operand instructions. This means that it
|
the Zip CPU only offers two operand instructions. This means that it
|
takes the Zip CPU more instructions to do many of the same operations.
|
takes the Zip CPU more instructions to do many of the same operations.
|
The good part of this is that it gives the Zip CPU a greater amount of
|
The good part of this is that it gives the Zip CPU a greater amount of
|
flexibility in its immediate operand mode, although that increased
|
flexibility in its immediate operand mode, although that increased
|
Line 2369... |
Line 2769... |
|
|
% Appendices
|
% Appendices
|
% Index
|
% Index
|
\end{document}
|
\end{document}
|
|
|
|
%
|
|
%
|
|
% Symbol table relocation types:
|
|
%
|
|
% Only 3-types of instructions truly need relocations: those that modify the
|
|
% PC register, and those that access memory.
|
|
%
|
|
% - LDI Addr,Rx // Load's an absolute address into Rx, 24 bits
|
|
%
|
|
% - LDILO Addr,Rx // Load's an absolute address into Rx, 32 bits
|
|
% LDIHI Addr,Rx // requires two instructions
|
|
%
|
|
% - JMP Rx // Jump to any address in Rx
|
|
% // Can be prefixed with two instructions to load Rx
|
|
% // from any 32-bit immediate
|
|
% - JMP #Addr // Jump to any 24'bit (signed) address, 23'b uns
|
|
%
|
|
% - ADD x,PC // Any PC relative jump (20 bits)
|
|
%
|
|
% - ADD.C x,PC // Any PC relative conditional jump (20 bits)
|
|
%
|
|
% - LDIHI Addr,Rx // Load from any 32-bit address, clobbers Rx,
|
|
% LOD Addr(Rx),Rx // unconditional, requires second instruction
|
|
%
|
|
% - LOD.C Addr(Ry),Rx // Any 16-bit relative address load, poss. cond
|
|
%
|
|
% - STO.C Rx,Addr(Ry) // Any 16-bit rel addr, Rx and Ry must be valid
|
|
%
|
|
% - FARJMP #Addr: // Arbitrary 32-bit jumps require a jump table
|
|
% BRA +1 // memory address. The BRA +1 can be skipped,
|
|
% .WORD Addr // but only if the address is placed at the end
|
|
% LOD -2(PC),PC // of an executable section
|
|
%
|
|
|
No newline at end of file
|
No newline at end of file
|