URL
https://opencores.org/ocsvn/zipcpu/zipcpu/trunk
Subversion Repositories zipcpu
Compare Revisions
- This comparison shows the changes necessary to convert path
/zipcpu
- from Rev 67 to Rev 68
- ↔ Reverse comparison
Rev 67 → Rev 68
/trunk/doc/spec.pdf
Cannot display: file marked as a binary type.
svn:mime-type = application/octet-stream
/trunk/doc/src/spec.tex
44,11 → 44,13
%% |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
\documentclass{gqtekspec} |
\usepackage{import} |
% \graphicspath{{../gfx}} |
\project{Zip CPU} |
\title{Specification} |
\author{Dan Gisselquist, Ph.D.} |
\email{dgisselq (at) opencores.org} |
\revision{Rev.~0.5} |
\revision{Rev.~0.6} |
\definecolor{webred}{rgb}{0.2,0,0} |
\definecolor{webgreen}{rgb}{0,0.2,0} |
\usepackage[dvips,ps2pdf,colorlinks=true, |
76,6 → 78,7
copy. |
\end{license} |
\begin{revisionhistory} |
0.6 & 11/17/2015 & Gisselquist & Added graphics to illustrate pipeline discussion.\\\hline |
0.5 & 9/29/2015 & Gisselquist & Added pipelined memory access discussion.\\\hline |
0.4 & 9/19/2015 & Gisselquist & Added DMA controller, improved stall information, and self--assessment info.\\\hline |
0.3 & 8/22/2015 & Gisselquist & First completed draft\\\hline |
167,6 → 170,71
contact me.} |
\end{itemize} |
|
The Zip CPU also has one very unique feature: the ability to do pipelined loads |
and stores. This allows the CPU to access on-chip memory at one access per |
clock, minus a stall for the initial access. |
|
\section{Characteristics of a SwiC} |
|
Here, we shall define a soft core internal to an FPGA as a ``System within a |
Chip,'' or a SwiC. SwiCs have some very unique properties internal to them |
that have influenced the design of the Zip CPU. Among these are the bus, |
memory, and available peripherals. |
|
Most other approaches to soft core CPU's employ a Harvard architecture. |
This allows these other CPU's to have two separate bus structures: one for the |
program fetch, and the other for thememory. The Zip CPU is fairly unique in |
its approach because it uses a von Neumann architecture. This was done for |
simplicity. By using a von Neumann architecture, only one bus needs to be |
implemented within any FPGA. This helps to minimize real-estate, while |
maintaining a high clock speed. The disadvantage is that it can severely |
degrade the overall instructions per clock count. |
|
Soft core's within an FPGA have an additional characteristic regarding |
memory access: it is slow. Memory on chip may be accessed at a single |
cycle per access, but small FPGA's have a limited amount of memory on chip. |
Going off chip, however, is expensive. Two examples will prove this point. On |
the XuLA2 board, Flash can be accessed at 128~cycles per 32--bit word, |
or 64~cycles per subsequent word in a pipelined architecture. Likewise, the |
SDRAM chip on the XuLA2 board allows 6~cycle access for a write, 10~cycles |
per read, and 2~cycles for any subsequent pipelined access read or write. |
Either way you look at it, this memory access will be slow and this doesn't |
account for any logic delays should the bus implementation logic get |
complicated. |
|
As may be noticed from the above discussion about memory speed, a second |
characteristic of memory is that all memory accesses may be pipelined, and |
that pipelined memory access is faster than non--pipelined access. Therefore, |
a SwiC soft core should support pipelined operations, but it should also |
allow a higher priority subsystem to get access to the bus (no starvation). |
|
As a further characteristic of SwiC memory options, on-chip cache's are |
expensive. If you want to have a minimum of logic, cache logic may not be |
the highest on the priority list. |
|
In sum, memory is slow. While one processor on one FPGA may be able to fill |
its pipeline, the same processor on another FPGA may struggle to get more than |
one instruction at a time into the pipeline. Any SwiC must be able to deal |
with both cases: fast and slow memories. |
|
A final characteristic of SwiC's within FPGA's is the peripherals. |
Specifically, FPGA's are highly reconfigurable. Soft peripherals can easily |
be created on chip to support the SwiC if necessary. As an example, a simple |
30-bit peripheral could easily support reversing 30-bit numbers: a read from |
the peripheral returns it's bit--reversed address. This is cheap within an |
FPGA, but expensive in instructions. |
|
Indeed, anything that must be done fast within an FPGA is likely to already |
be done--elsewhere in the fabric. This leaves the CPU with the role of handling |
sequential tasks that need a lot of state. |
|
This means that the SwiC needs to live within a very unique environment, |
separate and different from the traditional SoC. That isn't to say that a |
SwiC cannot be turned into a SoC, just that this SwiC has not been designed |
for that purpose. |
|
\section{Lessons Learned} |
|
Now, however, that I've worked on the Zip CPU for a while, it is not nearly |
as simple as I originally hoped. Worse, I've had to adjust to create |
capabilities that I was never expecting to need. These include: |
239,16 → 307,60
all instructions so that stalls would never be necessary. After trying |
to build such an architecture, I gave up, having learned some things: |
|
For example, in order to facilitate interrupt handling and debug |
stepping, the CPU needs to know what instructions have finished, and |
which have not. In other words, it needs to know where it can restart |
the pipeline from. Once restarted, it must act as though it had |
never stopped. This killed my idea of delayed branching, since what |
would be the appropriate program counter to restart at? The one the |
CPU was going to branch to, or the ones in the delay slots? This |
also makes the idea of compressed instruction codes difficult, since, |
again, where do you restart on interrupt? |
First, and ideal pipeline might look something like |
Fig.~\ref{fig:ideal-pipeline}. |
\begin{figure} |
\begin{center} |
\includegraphics[width=4in]{../gfx/fullpline.eps} |
\caption{An Ideal Pipeline: One instruction per clock cycle}\label{fig:ideal-pipeline} |
\end{center}\end{figure} |
Notice that, in this figure, all the pipeline stages are complete and |
full. Every instruction takes one clock and there are no delays. |
However, as the discussion above pointed out, the memory associated |
with a SwiC may not allow single clock access. It may be instead |
that you can only read every two clocks. In that case, what shall |
the pipeline look like? Should it look like |
Fig.~\ref{fig:waiting-pipeline}, |
\begin{figure}\begin{center} |
\includegraphics[width=4in]{../gfx/stuttra.eps} |
\caption{Instructions wait for each other}\label{fig:waiting-pipeline} |
\end{center}\end{figure} |
where instructions are held back until the pipeline is full, or should |
it look like Fig.~\ref{fig:independent-pipeline}, |
\begin{figure}\begin{center} |
\includegraphics[width=4in]{../gfx/stuttrb.eps} |
\caption{Instructions proceed independently}\label{fig:independent-pipeline} |
\end{center}\end{figure} |
where each instruction is allowed to move through the pipeline |
independently? For better or worse, the Zip CPU allows instructions |
to move through the pipeline independently. |
|
One approach to avoiding stalls is to use a branch delay slot, |
such as is shown in Fig.~\ref{fig:brdelay}. |
\begin{figure}\begin{center} |
\includegraphics[width=4in]{../gfx/bdly.eps} |
\caption{A typical branch delay slot approach}\label{fig:brdelay} |
\end{center}\end{figure} |
In this figure, instructions |
{\tt BR} (a branch), {\tt BD} (a branch delay instruction), |
are followed by instructions after the branch: {\tt IA}, {\tt IB}, etc. |
Since it takes a processor a clock cycle to execute a branch, the |
delay slot allows the processor to do something useful in that |
branch. The problem the Zip CPU has with this approach is, what |
happens when the pipeline looks like Fig.~\ref{fig:brbroken}? |
\begin{figure}\begin{center} |
\includegraphics[width=4in]{../gfx/bdbroken.eps} |
\caption{The branch delay slot breaks with a slow memory}\label{fig:brbroken} |
\end{center}\end{figure} |
In this case, the branch delay slot never gets filled in the first |
place, and so the pipeline squashes it before it gets executed. |
If not that, then what happens when handling interrupts or |
debug stepping: when has the CPU finished an instruction? |
When the {\tt BR} instruction has finished, or must {\tt BD} |
follow every {\tt BR}? and, again, what if the pipeline isn't |
full? |
These thoughts killed any hopes of doing delayed branching. |
|
So I switched to a model of discrete execution: Once an instruction |
enters into either the ALU or memory unit, the instruction is |
guaranteed to complete. If the logic recognizes a branch or a |
258,7 → 370,18
until the conditional branch completes. Then, if it generates a new |
PC address, the stages preceding are all wiped clean. |
|
The discrete execution model allows such things as sleeping: if the |
This model, however, generated too many pipeline stalls, so the |
discrete execution model was modified to allow instructions to go |
through the ALU unit and be canceled before writeback. This removed |
the stall associated with ALU instructions before untaken branches. |
|
The discrete execution model allows such things as sleeping, as |
outlined in Fig.~\ref{fig:sleeping}. |
\begin{figure}\begin{center} |
\includegraphics[width=4in]{../gfx/sleep.eps} |
\caption{How the CPU halts when sleeping}\label{fig:sleeping} |
\end{center}\end{figure} |
If the |
CPU is put to ``sleep,'' the ALU and memory stages stall and back up |
everything before them. Likewise, anything that has entered the ALU |
or memory stage when the CPU is placed to sleep continues to completion. |
266,10 → 389,14
a valid signal, a stall signal, and a clock enable signal. In |
general, a stage stalls if it's contents are valid and the next step |
is stalled. This allows the pipeline to fill any time a later stage |
stalls. |
stalls, as illustrated in Fig.~\ref{fig:stacking}. |
\begin{figure}\begin{center} |
\includegraphics[width=4in]{../gfx/stacking.eps} |
\caption{Instructions can stack up behind a stalled instruction}\label{fig:stacking} |
\end{center}\end{figure} |
|
This approach is also different from other pipeline approaches. Instead |
of keeping the entire pipeline filled, each stage is treated |
This approach is also different from other pipeline approaches. |
Instead of keeping the entire pipeline filled, each stage is treated |
independently. Therefore, individual stages may move forward as long |
as the subsequent stage is available, regardless of whether the stage |
behind it is filled. |
336,7 → 463,7
31\ldots 11 & R/W & Reserved for future uses\\\hline |
10 & R & (Reserved for) Bus-Error Flag\\\hline |
9 & R & Trap, or user interrupt, Flag. Cleared on return to userspace.\\\hline |
8 & R & (Reserved for) Illegal Instruction Flag\\\hline |
8 & R & Illegal Instruction Flag\\\hline |
7 & R/W & Break--Enable\\\hline |
6 & R/W & Step\\\hline |
5 & R/W & Global Interrupt Enable (GIE)\\\hline |
401,9 → 528,9
This functionality was added to enable an external debugger to |
set and manage breakpoints. |
|
The ninth bit is reserved for an illegal instruction bit. When the CPU |
The ninth bit is an illegal instruction bit. When the CPU |
tries to execute either a non-existant instruction, or an instruction from |
an address that produces a bus error, the CPU will (once implemented) switch |
an address that produces a bus error, the CPU will (if implemented) switch |
to supervisor mode while setting this bit. The bit will automatically be |
cleared upon any return to user mode. |
|
418,7 → 545,7
\begin{tabular}{l|l} |
Bit & Meaning \\\hline |
9 & Soft trap, set on a trap from user mode, cleared when returning to user mode\\\hline |
8 & (Reserved for) Floating point enable \\\hline |
8 & Illegal instruction error flag \\\hline |
7 & Halt on break, to support an external debugger \\\hline |
6 & Step, single step the CPU in user mode\\\hline |
5 & GIE, or Global Interrupt Enable \\\hline |
454,10 → 581,33
There is no condition code for less than or equal, not C or not V. Sorry, |
I ran out of space in 3--bits. Conditioning on a non--supported condition |
is still possible, but it will take an extra instruction and a pipeline stall. (Ex: \hbox{\em (Stall)}; \hbox{\tt TST \$4,CC;} \hbox{\tt STO.NZ R0,(R1)}) |
As an alternative, the condition may often be reversed, recovering those |
extra two clocks. Thus instead of \hbox{\tt CMP Rx,Ry;} |
\hbox{\tt BNV label} you can issue a \hbox{\tt CMP Ry,Rx;} \hbox{\tt BV label}. |
|
Conditionally executed ALU instructions will not further adjust the |
condition codes. |
condition codes, with the exception of \hbox{\tt CMP} and \hbox{\tt TST} |
instructions. Conditional \hbox{\tt CMP} or \hbox{\tt TST} instructions |
will adjust conditions whenever their conditionals are true. In this way, |
multiple conditions may be evaluated without branches. For example, to do |
something if \hbox{\tt R0} is one and \hbox{\tt R1} is two, one might try |
code such as Tbl.~\ref{tbl:dbl-condition}. |
\begin{table}\begin{center} |
\begin{tabular}{l} |
{\tt CMP 1,R0} \\ |
{;\em Condition codes are now set based upon R0-1} \\ |
{\tt CMP.Z 2,R1} \\ |
{;\em If R0 $\neq$ 1, conditions are unchanged.} \\ |
{;\em If R0 $=$ 1, conditions are set based upon R1-2.} \\ |
{;\em Now do something based upon the conjunction of both conditions.} \\ |
{;\em While we use the example of a STO, it could be any instruction.} \\ |
{\tt STO.Z R0,(R2)} \\ |
\end{tabular} |
\caption{An example of a double conditional}\label{tbl:dbl-condition} |
\end{center}\end{table} |
|
\section{Traditional Interrupt Handling} |
|
\section{Operand B} |
Many instruction forms have a 21-bit source ``Operand B'' associated with them. |
This Operand B is either equal to a register plus a signed immediate offset, |
1001,15 → 1151,36
\item While waiting for the pipeline to load following any taken branch, jump, |
return from interrupt or switch to interrupt context (5 stall cycles) |
|
If the PC suddenly changes, the pipeline is subsequently cleared and needs to |
be reloaded. Given that there are five stages to the pipeline, that accounts |
for four of the five stalls. The stall cycle is lost in the pipelined prefetch |
stage which needs at least one clock with a valid PC before it can produce |
a new output. |
Fig.~\ref{fig:bcstalls} |
\begin{figure}\begin{center} |
\includegraphics[width=3.5in]{../gfx/bc.eps} |
\caption{A conditional branch generates 5 stall cycles}\label{fig:bcstalls} |
\end{center}\end{figure} |
illustrates the situation for a conditional branch. In this case, the branch |
instruction, {\tt BC}, is nominally followed by instructions {\tt I0} and so |
forth. However, since the branch is taken, the next instruction must be |
{\tt IA}. Therefore, the pipeline needs to be cleared and reloaded. |
Given that there are five stages to the pipeline, that accounts |
for four of the five stalls. The last stall cycle is lost in the pipelined |
prefetch stage which needs at least one clock with a valid PC before it can |
produce a new output. {\Large\bf Note: When I did this myself, I counted |
six stall cycles, for a total of seven cycles for this instruction. Is five |
really the right answer?} |
|
The Zip CPU handles {\tt MOV \$X(PC),PC}, {\tt ADD \$X,PC}, and |
{\tt LDI \$X,PC} instructions specially, however. These instructions, when |
not conditioned on the flags, can execute with only 3~stall cycles. |
not conditioned on the flags, can execute with only 2~stall cycles, such as |
is shown in Fig.~\ref{fig:branch}.\footnote{Note that this behavior is |
slated to be improved upon in subsequent releases. With a better prefetch, |
it should be possible to drop this down to one or zero stall cycles.} |
\begin{figure}\begin{center} |
\includegraphics[width=4in]{../gfx/bra.eps} |
\caption{An expedited delay costs only 2~stall cycles}\label{fig:branch} |
\end{center}\end{figure} |
In this example, {\tt BR} is a branch always taken, {\tt I1} is the instruction |
following the branch in memory, while {\tt IA} is the first instruction at the |
branch address. ({\tt CLR} denotes a clear--pipeline operation, and does |
not represent any instruction.) |
|
\item When reading from a prior register while also adding an immediate offset |
\begin{enumerate} |
1026,15 → 1197,23
That is, any instruction that does not add an immediate to {\tt RA} may be |
scheduled into the stall slot. |
|
\item When any write to either the CC or PC Register is followed by a memory |
operation |
\item When any (conditional) write to either the CC or PC Register is followed |
by a memory operation |
\begin{enumerate} |
\item\ {\tt OPCODE RA,PC} {\em Ex: a branch opcode} |
\item\ {\em (stall, even if jump not taken)} |
\item\ {\tt LOD \$X(RA),RB} |
\end{enumerate} |
A timing diagram of this pipeline situation is shown in Fig.~\ref{fig:bcmem}, |
\begin{figure}\begin{center} |
\includegraphics[width=2in]{../gfx/bcmem.eps} |
\caption{A (not taken) conditional branch followed by a memory operation}\label{fig:bcmem} |
\end{center}\end{figure} |
for a conditional branch, {\tt BC}, a memory operation, {\tt Mem} (which |
must be a load here), and ALU instructions {\tt I1} and so forth. |
Since branches take place in the writeback stage, the Zip CPU will stall the |
pipeline for one clock anytime there may be a possible jump. This prevents |
pipeline for one clock anytime there may be a possible jump--forcing the |
memory operation to stay in the operand decode stage. This prevents |
an instruction from executing a memory access after the jump but before the |
jump is recognized. |
|
1050,9 → 1229,10
\item\ {\tt BZ somewhere} |
\end{enumerate} |
|
The reason for this stall is simply performance. Many of the flags are |
determined via combinatorial logic after the writeback instruction is |
determined. Trying to then place these into the input for one of the operands |
The reason for this stall is simply performance: many of the flags are |
determined via combinatorial logic {\em during} the writeback cycle. |
Trying to then place these into the input for one of the operands for an |
ALU instruction during the same cycle |
created a time delay loop that would no longer execute in a single 100~MHz |
clock cycle. (The time delay of the multiply within the ALU wasn't helping |
either \ldots). |
1062,7 → 1242,8
that references the CC register. For example, {\tt MOV \$addr+PC,uPC} |
followed by an {\tt RTU} ({\tt OR \$GIE,CC}) instruction will not incur |
this stall, whereas an {\tt OR \$BREAKEN,CC} followed by an {\tt OR \$STEP,CC} |
will incur the stall, while a {\tt LDI \$BREAKEN|\$STEP,CC} will not. |
will incur the stall, while a {\tt LDI \$BREAKEN|\$STEP,CC} will not since |
it doesn't read the condition codes. |
|
\item When waiting for a memory read operation to complete |
\begin{enumerate} |
1073,15 → 1254,32
|
Remember, the Zip CPU does not support out of order execution. Therefore, |
anytime the memory unit becomes busy both the memory unit and the ALU must |
stall until the memory unit is cleared. This is especially true of a load |
stall until the memory unit is cleared. This is illustrated in |
Fig.~\ref{fig:memrd}, |
\begin{figure}\begin{center} |
\includegraphics[width=5in]{../gfx/memrd.eps} |
\caption{Pipeline handling of a load instruction}\label{fig:memrd} |
\end{center}\end{figure} |
since it is especially true of a load |
instruction, which must still write its operand back to the register file. |
Store instructions are different, since they can be busy with no impact on |
later ALU write back operations. Hence, only loads stall the pipeline. |
Note that there is an extra stall at the end of the memory cycle, so that |
the memory unit will be idle for one clock before an instruction will be |
accepted into the ALU. |
Store instructions are different, as shown in Fig.~\ref{fig:memwr}, |
\begin{figure}\begin{center} |
\includegraphics[width=5in]{../gfx/memwr.eps} |
\caption{Pipeline handling of a store instruction}\label{fig:memwr} |
\end{center}\end{figure} |
since they can be busy with the bus without impacting later write back |
pipeline stages. Hence, only loads stall the pipeline. |
|
This also assumes that the memory being accessed is a single cycle memory. |
This, of course, also assumes that the memory being accessed is a single cycle |
memory and that there are no stalls to get to the memory. |
Slower memories, such as the Quad SPI flash, will take longer--perhaps even |
as long as forty clocks. During this time the CPU and the external bus |
will be busy, and unable to do anything else. |
will be busy, and unable to do anything else. Likewise, if it takes a couple |
of clock cycles for the bus to be free, as shown in both Figs.~\ref{fig:memrd} |
and~\ref{fig:memwr}, there will be stalls. |
|
\item Memory operation followed by a memory operation |
\begin{enumerate} |
1091,10 → 1289,16
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)} |
\end{enumerate} |
|
In this case, the LOD instruction cannot start until the STO is finished. |
In this case, the LOD instruction cannot start until the STO is finished, |
as illustrated by Fig.~\ref{fig:mstld}. |
\begin{figure}\begin{center} |
\includegraphics[width=5.5in]{../gfx/mstld.eps} |
\caption{Pipeline handling of a store followed by a load instruction}\label{fig:mstld} |
\end{center}\end{figure} |
With proper scheduling, it is possible to do something in the ALU while the |
memory unit is busy with the STO instruction, but otherwise this pipeline will |
stall waiting for it to complete. |
stall while waiting for it to complete before the load instruction can |
start. |
|
The Zip CPU does have the capability of supporting pipelined memory access, |
but only under the following conditions: all accesses within the pipeline |
1102,7 → 1306,9
address, and there can be no stalls or other instructions between pipelined |
memory access instructions. Further, the offset to memory must be increasing |
by one address each instruction. These conditions work well for saving or |
storing registers to the stack. |
storing registers to the stack. Indeed, if you noticed, both |
Fig.~\ref{fig:memrd} and Fig.~\ref{fig:memwr} illustrated pipelined memory |
accesses. |
|
\item When waiting for a conditional memory read operation to complete |
\begin{enumerate} |
1134,7 → 1340,7
and described here. They are designed to make |
the Zip CPU more useful in an Embedded Operating System environment. |
|
\section{Interrupt Controller} |
\section{Interrupt Controller}\label{sec:pic} |
|
Perhaps the most important peripheral within the Zip System is the interrupt |
controller. While the Zip CPU itself can only handle one interrupt, and has |
1144,7 → 1350,7
The Zip System interrupt controller module supports up to 15 interrupts, all |
controlled from one register. Bit~31 of the interrupt controller controls |
overall whether interrupts are enabled (1'b1) or disabled (1'b0). Bits~16--30 |
control whether individual interrupts are enabled (1'b0) or disabled (1'b0). |
control whether individual interrupts are enabled (1'b1) or disabled (1'b0). |
Bit~15 is an indicator showing whether or not any interrupt is active, and |
bits~0--15 indicate whether or not an individual interrupt is active. |
|
1167,12 → 1373,12
to re-enable any other interrupts. |
|
The Zip System currently hosts two interrupt controllers, a primary and a |
secondary. The primary interrupt controller has one interrupt line which may |
come from an external interrupt controller, and one interrupt line from the |
secondary controller. Other primary interrupts include the system timers, |
the jiffies interrupt, and the manual cache interrupt. The secondary interrupt |
controller maintains an interrupt state for all of the processor accounting |
counters. |
secondary. The primary interrupt controller has one interrupt line (perhaps |
more if you configure it for more) which may come from an external interrupt |
controller, and one interrupt line from the secondary controller. Other |
primary interrupts include the system timers, the jiffies interrupt, and the |
manual cache interrupt. The secondary interrupt controller maintains an |
interrupt state for all of the processor accounting counters. |
|
\section{Counter} |
|
1195,7 → 1401,8
might set the timer to 100~million (the number of clocks per second), and |
set the high bit. Ever after, the timer will interrupt the CPU once per |
second (assuming a 100~MHz clock). This reload capability also limits the |
maximum timer value to $2^{31}-1$, rather than $2^{32}-1$. |
maximum timer value to $2^{31}-1$ (about 21~seconds using a 100~MHz clock), |
rather than $2^{32}-1$. |
|
\section{Watchdog Timer} |
|
1209,6 → 1416,19
While the watchdog timer supports interval mode, it doesn't make as much sense |
as it did with the other timers. |
|
\section{Bus Watchdog} |
There is an additional watchdog timer on the Wishbone bus. This timer, |
however, is hardware configured and not software configured. The timer is |
reset at the beginning of any bus transaction, and only counts clocks during |
such bus transactions. If the bus transaction takes longer than the number |
of counts the timer allots, it will raise a bus error flag to terminate the |
transaction. This is useful in the case of any peripherals that are |
misbehaving. If the bus watchdog terminates a bus transaction, the CPU may |
then read from its port to find out which memory location created the problem. |
|
Aside from its unusual configuration, the bus watchdog is just another |
implementation of the fundamental timer described above. |
|
\section{Jiffies} |
|
This peripheral is motivated by the Linux use of `jiffies' whereby a process |
1307,9 → 1527,189
Eventually, I intend to place an operating system onto the ZipSystem, I'm |
just not there yet. |
|
The rest of this chapter examines some common programming constructs, and |
how they might be applied to the Zip System. |
The rest of this chapter examines some common programming models, and how they |
might be applied to the Zip System, and then finish with a couple of examples. |
|
\section{System High} |
The easiest and simplest way to run the Zip CPU is in the system high mode. |
In this mode, the CPU runs your program in supervisor mode from reboot to |
power down, and is never interrupted. You will need to poll the interrupt |
controller to determine when any external condition has become active. This |
mode is useful, and can handle many microcontroller tasks. |
|
Even better, in system high mode, all of the user registers are available |
to the system high program as variables. Accessing these registers can be |
done in a single clock cycle, which would move them to the active register |
set or move them back. While this may seem like a load or store instruction, |
none of these register accesses will suffer from memory delays. |
|
The one thing that cannot be done in supervisor mode is a wait for interrupt |
instruction. This, however, is easily rectified by jumping to a user task |
within the supervisors memory space, such as Tbl.~\ref{tbl:shi-idle}. |
\begin{table}\begin{center} |
\begin{tabbing} |
{\tt supervisor\_idle:} \\ |
\hbox to 0.25in{}\={\em ; While not strictly required, the following move helps to} \\ |
\> {\em ; ensure that the prefetch doesn't try to fetch an instruction} \\ |
\> {\em ; outside of the CPU's address space when it switches to user} \\ |
\> {\em ; mode.} \\ |
\> {\tt MOV supervisor\_idle\_continue,uPC} \\ |
\> {\em ; Put the processor into user mode and to sleep in the same} \\ |
\> {\em ; instruction. } \\ |
\> {\tt OR \$SLEEP|\$GIE,CC} \\ |
{\tt supervisor\_idle\_continue:} \\ |
\> {\em ; Now, if we haven't done this inline, we need to return} \\ |
\> {\em ; to whatever function called us.} \\ |
\> {\tt RETN} \\ |
\end{tabbing} |
\caption{Executing an idle from supervisor mode}\label{tbl:shi-idle} |
\end{center}\end{table} |
|
\section{Traditional Interrupt Handling} |
Although the Zip CPU does not have a traditional interrupt architecture, |
it is possible to create the more traditional interrupt approach via software. |
In this mode, the programmable interrupt controller is used together with the |
supervisor state to create the illusion of more traditional interrupt handling. |
|
To set this up, upon reboot the supervisor task: |
\begin{enumerate} |
\item Creates a (single) user context, a user stack, and sets the user |
program counter to the entry of the user task |
\item Creates a task table of ISR entries |
\item Enables the master interrupt enable via the interrupt controller, albeit |
without enabling any of the fifteen potential underlying interrupts. |
\item Switches to user mode, as the first part of the while loop in |
Tbl.~\ref{tbl:traditional-isr}. |
\end{enumerate} |
\begin{table}\begin{center} |
\begin{tabbing} |
{\tt while(true) \{} \\ |
\hbox to 0.25in{}\= {\tt rtu();}\\ |
\> {\tt if (trap) \{} {\em // Here, we allow users to install ISRs, or} \\ |
\>\hbox to 0.25in{}\= {\em // whatever else they may wish to do in supervisor mode.} \\ |
\> {\tt \} else \{} \\ |
\> \> {\tt volatile int *pic = PIC\_ADDRESS;} \\ |
\\ |
\> \> {\em // Save the user context before running any ISRs. This could easily be}\\ |
\> \> {\em // implemented as an inline assembly routine or macro}\\ |
\> \> {\tt SAVE\_PARTIAL\_CONTEXT; }\\ |
\> \> {\em // At this point, we know an interrupt has taken place: Ask the programmable}\\ |
\> \> {\em // interrupt controller (PIC) which interrupts are enabled and which are active.}\\ |
\> \> {\tt int picv = *pic;}\\ |
\> \> {\em // Turn off all active interrupts}\\ |
\> \> {\em // Globally disable interrupt generation in the process}\\ |
\> \> {\tt int active = (picv >> 16) \& picv \& 0x07fff;}\\ |
\> \> {\tt *pic = (active<<16);}\\ |
\> \> {\em // We build a mask of interrupts to re-enable in picv.}\\ |
\> \> {\tt picv = 0;}\\ |
\> \> {\tt for(int i=0,msk=1; i<15; i++, msk<<=1) \{}\\ |
\> \>\hbox to 0.25in{}\={\tt if ((active \& msk)\&\&(isr\_table[i])) \{}\\ |
\> \>\>\hbox to 0.25in{}\= {\tt mov(isr\_table[i],uPC); }\\ |
\> \>\>\> {\em // Acknowledge this particular interrupt. While we could acknowledge all}\\ |
\> \>\>\> {\em // interrupts at once, by acknowledging only those with ISR's we allow}\\ |
\> \>\>\> {\em // the user process to use peripherals manually, and to manually check}\\ |
\> \>\>\> {\em // whether or no those other interrupts had occurred.}\\ |
\> \>\>\> {\tt *pic = msk; }\\ |
\> \>\>\> {\tt rtu(); }\\ |
\> \>\>\> {\em // The ISR will only exit on a trap in the Zip archtecture. There is}\\ |
\> \>\>\> {\em // no {\tt RETI} instruction. Since the PIC holds all interrupts disabled,}\\ |
\> \>\>\> {\em // there is no need to check for further interrupts.}\\ |
\> \>\>\> {\em // }\\ |
\> \>\>\> {\em // The tricky part is that, because of how the PIC is built, the ISR cannot}\\ |
\>\>\>\> {\em // re-enable its own interrupt without re-enabling all interrupts. Hence, we}\\ |
\>\>\>\> {\em // look at R0 upon ISR completion to know if an interrupt needs to be }\\ |
\> \>\>\> {\em // re-enabled. }\\ |
\> \>\>\> {\tt mov(uR0,tmp); }\\ |
\> \>\>\> {\tt picv |= (tmp \& 0x7fff) << 16; }\\ |
\> \>\> {\tt \} }\\ |
\> \> {\tt \} }\\ |
\> \> {\tt RESTORE\_PARTIAL\_CONTEXT; }\\ |
\> \> {\em // Re-activate all (requested) interrupts }\\ |
\> \> {\tt *pic = picv | 0x80000000; }\\ |
\>{\tt \} }\\ |
{\tt \}}\\ |
\end{tabbing} |
\caption{Traditional Interrupt handling}\label{tbl:traditional-isr} |
\end{center}\end{table} |
|
We can work through the interrupt handling process by examining |
Tbl.~\ref{tbl:traditional-isr}. First, remember, the CPU is always running |
either the user or the supervisor context. Once the supervisor switches to |
user mode, control does not return until either an interrupt or a trap |
has taken place. (Okay, there's also the possibility of a bus error, or an |
illegal instruction such as an unimplemented floating point instruction---but |
for now we'll just focus on the trap instruction.) Therefore, if the trap bit |
isn't set, then we know an interrupt has taken place. |
|
To process an interrupt, we steal the user's stack: the PC and CC registers |
are saved on the stack, as outlined in Tbl.~\ref{tbl:save-partial}. |
\begin{table}\begin{center} |
\begin{tabbing} |
SAVE\_PARTIAL\_CONTEXT: \\ |
\hbox to 0.25in{}\= {\em ; We save R0, CC, and PC only} \\ |
\> {\tt MOV -3(uSP),R3} \\ |
\> {\tt MOV uR0,R0} \\ |
\> {\tt MOV uCC,R1} \\ |
\> {\tt MOV uPC,R2} \\ |
\> {\tt STO R0,1(R3)} {\em ; Exploit memory pipelining: }\\ |
\> {\tt STO R1,2(R3)} {\em ; All instructions write to stack }\\ |
\> {\tt STO R2,3(R3)} {\em ; All offsets increment by one }\\ |
\> {\tt MOV R3,uSP} {\em ; Return the updated stack pointer } \\ |
\end{tabbing} |
\caption{Example Saving Minimal User Context}\label{tbl:save-partial} |
\end{center}\end{table} |
This is much cheaper than the full context swap of a preemptive multitasking |
kernel, but it also depends upon the ISR saving any state it uses. Further, |
if multiple ISR's get called at once, this looses its optimality property |
very quickly. |
|
As Sec.~\ref{sec:pic} discusses, the top of the PIC register stores which |
interrupts are enabled, and the bottom stores which have tripped. (Interrupts |
may trip without being enabled, they just will not generate an interrupt to the |
CPU.) Our first step is to query the register to find out our interrupt |
state, and then to disable any interrupts that have tripped. To do |
that, we write a one to the enable half of the register while also clearing |
the top bit (master interrupt enable). This has the consequence of disabling |
any and all further interrupts, not just the ones that have tripped. Hence, |
upon completion, we re--enable the master interrupt bit again. Finally, |
we keep track of which interrupts have tripped. |
|
Using the bit mask of interrupts that have tripped, we walk through all fifteen |
possible interrupts. If there is an ISR installed, we acknowledge and reset |
the interrupt within the PIC, and then call the ISR. The ISR, however, cannot |
re--enable its interrupt without re-enabling the master interrupt bit. Thus, |
to keep things simple, when the ISR is finished it places its interrupt |
mask back into R0, or clears R0. This tells the supervisor mode process which |
interrupts to re--enable. Any other registers that the ISR uses must be |
saved and restored. (This is only truly optimal if only a single ISR is |
called.) As a final instruction, the ISR clears the GIE bit executing a user |
trap. (Remember, the Zip CPU has no {\tt RETI} instruction to restore the |
stack and return to userland. It needs to go through the supervisor mode to |
get there.) |
|
Then, once all interrupts are handled, the user context is restored in a |
fashion similar to Tbl.~\ref{tbl:restore-partial}. |
\begin{table}\begin{center} |
\begin{tabbing} |
RESTORE\_PARTIAL\_CONTEXT: \\ |
\hbox to 0.25in{}\= {\em ; We retore R0, CC, and PC only} \\ |
\> {\tt MOV uSP,R3} {\em ; Return the updated stack pointer } \\ |
\> {\tt LOD R0,1(R3),R0} {\em ; Exploit memory pipelining: }\\ |
\> {\tt LOD R1,2(R3),R1} {\em ; All instructions write to stack }\\ |
\> {\tt LOD R2,3(R3),R2} {\em ; All offsets increment by one }\\ |
\> {\tt MOV R0,uR0} \\ |
\> {\tt MOV R1,uCC} \\ |
\> {\tt MOV R2,uPC} \\ |
\> {\tt MOV 3(R3),uSP} \\ |
\end{tabbing} |
\caption{Example Restoring Minimal User Context}\label{tbl:restore-partial} |
\end{center}\end{table} |
Again, this is short and sweet simply because any other registers that needed |
saving were saved within the ISR. |
|
There you have it: the Zip CPU, with its non-traditional interrupt architecture, |
can still process interrupts in a very traditional fashion. |
|
\section{Example: Idle Task} |
One task every operating system needs is the idle task, the task that takes |
place when nothing else can run. On the Zip CPU, this task is quite simple, |
1381,7 → 1781,7
Third, we've optimized the conditional jump to a return instruction into a |
conditional return instruction. |
|
\section{Context Switch} |
\section{Example: Context Switch} |
|
Fundamental to any multiprocessing system is the ability to switch from one |
task to the next. In the ZipSystem, this is accomplished in one of a couple |
2051,8 → 2451,6
experiment with building and adding particular features, the Zip CPU |
makes a good starting point--it is fairly simple. Modifications should |
be simple enough. |
\item As an estimate of the ``weight'' of this implementation, the CPU has |
cost me less than 150 hours to implement from its inception. |
\item The Zip CPU was designed to be an implementable soft core that could be |
placed within an FPGA, controlling actions internal to the FPGA. It |
fits this role rather nicely. It does not fit the role of a system on |
2069,6 → 2467,8
\item This simplified instruction set is easy to decode. |
\item The simplified bus transactions (32-bit words only) were also very easy |
to implement. |
\item The pipelined load/store approach is novel, and can be used to greatly |
increase the speed of the processor. |
\item The novel approach of having a single interrupt vector, which just |
brings the CPU back to the instruction it left off at within the last |
interrupt context doesn't appear to have been that much of a problem. |
2098,10 → 2498,7
\item The fact that the instruction width equals the bus width means that the |
instruction fetch cycle will always be interfering with any load or |
store memory operation, with the only exception being if the |
instruction is already in the cache. {\em This has become the |
fundamental limit on the speed and performance of the CPU!} |
Those familiar with the Von--Neumann approach of sharing a bus |
between data and instructions will not be surprised by this assessment. |
instruction is already in the cache. |
|
This could be fixed in one of three ways: the instruction set |
architecture could be modified to handle Very Long Instruction Words |
2146,6 → 2543,9
the addition of a data cache can greatly increase the LUT count of |
a soft core. |
|
The Zip CPU compensates for this via its pipelined load and store |
instructions. |
|
\item Many other instruction sets offer three operand instructions, whereas |
the Zip CPU only offers two operand instructions. This means that it |
takes the Zip CPU more instructions to do many of the same operations. |
2371,4 → 2771,36
% Index |
\end{document} |
|
|
% |
% |
% Symbol table relocation types: |
% |
% Only 3-types of instructions truly need relocations: those that modify the |
% PC register, and those that access memory. |
% |
% - LDI Addr,Rx // Load's an absolute address into Rx, 24 bits |
% |
% - LDILO Addr,Rx // Load's an absolute address into Rx, 32 bits |
% LDIHI Addr,Rx // requires two instructions |
% |
% - JMP Rx // Jump to any address in Rx |
% // Can be prefixed with two instructions to load Rx |
% // from any 32-bit immediate |
% - JMP #Addr // Jump to any 24'bit (signed) address, 23'b uns |
% |
% - ADD x,PC // Any PC relative jump (20 bits) |
% |
% - ADD.C x,PC // Any PC relative conditional jump (20 bits) |
% |
% - LDIHI Addr,Rx // Load from any 32-bit address, clobbers Rx, |
% LOD Addr(Rx),Rx // unconditional, requires second instruction |
% |
% - LOD.C Addr(Ry),Rx // Any 16-bit relative address load, poss. cond |
% |
% - STO.C Rx,Addr(Ry) // Any 16-bit rel addr, Rx and Ry must be valid |
% |
% - FARJMP #Addr: // Arbitrary 32-bit jumps require a jump table |
% BRA +1 // memory address. The BRA +1 can be skipped, |
% .WORD Addr // but only if the address is placed at the end |
% LOD -2(PC),PC // of an executable section |
% |
/trunk/doc/Makefile
1,4 → 1,4
all: gpl-3.0.pdf spec.pdf |
all: pdftex gpl-3.0.pdf spec.pdf |
DSRC := src |
|
gpl-3.0.pdf: $(DSRC)/gpl-3.0.tex |
8,6 → 8,10
ps2pdf -dAutoRotatePages=/All gpl-3.0.ps gpl-3.0.pdf |
rm gpl-3.0.dvi gpl-3.0.log gpl-3.0.aux gpl-3.0.ps |
|
.PHONY: pdftex |
pdftex: |
@cd gfx; make --no-print-directory |
|
spec.pdf: $(DSRC)/spec.tex $(DSRC)/gqtekspec.cls $(DSRC)/GT.eps |
cd $(DSRC)/; latex spec.tex |
cd $(DSRC)/; latex spec.tex |