OpenCores

Rev 39	Rev 68
Line 42...	Line 42...
`%% http://www.gnu.org/licenses/gpl.html`	`%% http://www.gnu.org/licenses/gpl.html`
`%%`	`%%`
`%%`	`%%`
`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`	`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
`\documentclass{gqtekspec}`	`\documentclass{gqtekspec}`
	`\usepackage{import}`
	`% \graphicspath{{../gfx}}`
`\project{Zip CPU}`	`\project{Zip CPU}`
`\title{Specification}`	`\title{Specification}`
`\author{Dan Gisselquist, Ph.D.}`	`\author{Dan Gisselquist, Ph.D.}`
`\email{dgisselq (at) opencores.org}`	`\email{dgisselq (at) opencores.org}`
`\revision{Rev.~0.5}`	`\revision{Rev.~0.6}`
`\definecolor{webred}{rgb}{0.2,0,0}`	`\definecolor{webred}{rgb}{0.2,0,0}`
`\definecolor{webgreen}{rgb}{0,0.2,0}`	`\definecolor{webgreen}{rgb}{0,0.2,0}`
`\usepackage[dvips,ps2pdf,colorlinks=true,`	`\usepackage[dvips,ps2pdf,colorlinks=true,`
`anchorcolor=black,pagecolor=webgreen,pdfpagelabels,hypertexnames,`	`anchorcolor=black,pagecolor=webgreen,pdfpagelabels,hypertexnames,`
`pdfauthor={Dan Gisselquist},`	`pdfauthor={Dan Gisselquist},`
Line 74...	Line 76...
`You should have received a copy of the GNU General Public License along`	`You should have received a copy of the GNU General Public License along`
`with this program. If not, see \hbox{<http://www.gnu.org/licenses/>} for a`	`with this program. If not, see \hbox{<http://www.gnu.org/licenses/>} for a`
`copy.`	`copy.`
`\end{license}`	`\end{license}`
`\begin{revisionhistory}`	`\begin{revisionhistory}`
	`0.6 & 11/17/2015 & Gisselquist & Added graphics to illustrate pipeline discussion.\\\hline`
`0.5 & 9/29/2015 & Gisselquist & Added pipelined memory access discussion.\\\hline`	`0.5 & 9/29/2015 & Gisselquist & Added pipelined memory access discussion.\\\hline`
`0.4 & 9/19/2015 & Gisselquist & Added DMA controller, improved stall information, and self--assessment info.\\\hline`	`0.4 & 9/19/2015 & Gisselquist & Added DMA controller, improved stall information, and self--assessment info.\\\hline`
`0.3 & 8/22/2015 & Gisselquist & First completed draft\\\hline`	`0.3 & 8/22/2015 & Gisselquist & First completed draft\\\hline`
`0.2 & 8/19/2015 & Gisselquist & Still Draft, more complete \\\hline`	`0.2 & 8/19/2015 & Gisselquist & Still Draft, more complete \\\hline`
`0.1 & 8/17/2015 & Gisselquist & Incomplete First Draft \\\hline`	`0.1 & 8/17/2015 & Gisselquist & Incomplete First Draft \\\hline`
Line 165...	Line 168...
`\item Completely open source, licensed under the GPL.\footnote{Should you`	`\item Completely open source, licensed under the GPL.\footnote{Should you`
`need a copy of the Zip CPU licensed under other terms, please`	`need a copy of the Zip CPU licensed under other terms, please`
`contact me.}`	`contact me.}`
`\end{itemize}`	`\end{itemize}`

	`The Zip CPU also has one very unique feature: the ability to do pipelined loads`
	`and stores. This allows the CPU to access on-chip memory at one access per`
	`clock, minus a stall for the initial access.`

	`\section{Characteristics of a SwiC}`

	Here, we shall define a soft core internal to an FPGA as a ``System within a
	`Chip,'' or a SwiC. SwiCs have some very unique properties internal to them`
	`that have influenced the design of the Zip CPU. Among these are the bus,`
	`memory, and available peripherals.`

	`Most other approaches to soft core CPU's employ a Harvard architecture.`
	`This allows these other CPU's to have two separate bus structures: one for the`
	`program fetch, and the other for thememory. The Zip CPU is fairly unique in`
	`its approach because it uses a von Neumann architecture. This was done for`
	`simplicity. By using a von Neumann architecture, only one bus needs to be`
	`implemented within any FPGA. This helps to minimize real-estate, while`
	`maintaining a high clock speed. The disadvantage is that it can severely`
	`degrade the overall instructions per clock count.`

	`Soft core's within an FPGA have an additional characteristic regarding`
	`memory access: it is slow. Memory on chip may be accessed at a single`
	`cycle per access, but small FPGA's have a limited amount of memory on chip.`
	`Going off chip, however, is expensive. Two examples will prove this point. On`
	`the XuLA2 board, Flash can be accessed at 128~cycles per 32--bit word,`
	`or 64~cycles per subsequent word in a pipelined architecture. Likewise, the`
	`SDRAM chip on the XuLA2 board allows 6~cycle access for a write, 10~cycles`
	`per read, and 2~cycles for any subsequent pipelined access read or write.`
	`Either way you look at it, this memory access will be slow and this doesn't`
	`account for any logic delays should the bus implementation logic get`
	`complicated.`

	`As may be noticed from the above discussion about memory speed, a second`
	`characteristic of memory is that all memory accesses may be pipelined, and`
	`that pipelined memory access is faster than non--pipelined access. Therefore,`
	`a SwiC soft core should support pipelined operations, but it should also`
	`allow a higher priority subsystem to get access to the bus (no starvation).`

	`As a further characteristic of SwiC memory options, on-chip cache's are`
	`expensive. If you want to have a minimum of logic, cache logic may not be`
	`the highest on the priority list.`

	`In sum, memory is slow. While one processor on one FPGA may be able to fill`
	`its pipeline, the same processor on another FPGA may struggle to get more than`
	`one instruction at a time into the pipeline. Any SwiC must be able to deal`
	`with both cases: fast and slow memories.`

	`A final characteristic of SwiC's within FPGA's is the peripherals.`
	`Specifically, FPGA's are highly reconfigurable. Soft peripherals can easily`
	`be created on chip to support the SwiC if necessary. As an example, a simple`
	`30-bit peripheral could easily support reversing 30-bit numbers: a read from`
	`the peripheral returns it's bit--reversed address. This is cheap within an`
	`FPGA, but expensive in instructions.`

	`Indeed, anything that must be done fast within an FPGA is likely to already`
	`be done--elsewhere in the fabric. This leaves the CPU with the role of handling`
	`sequential tasks that need a lot of state.`

	`This means that the SwiC needs to live within a very unique environment,`
	`separate and different from the traditional SoC. That isn't to say that a`
	`SwiC cannot be turned into a SoC, just that this SwiC has not been designed`
	`for that purpose.`

	`\section{Lessons Learned}`

`Now, however, that I've worked on the Zip CPU for a while, it is not nearly`	`Now, however, that I've worked on the Zip CPU for a while, it is not nearly`
`as simple as I originally hoped. Worse, I've had to adjust to create`	`as simple as I originally hoped. Worse, I've had to adjust to create`
`capabilities that I was never expecting to need. These include:`	`capabilities that I was never expecting to need. These include:`
`\begin{itemize}`	`\begin{itemize}`
`\item {\bf External Debug:} Once placed upon an FPGA, some external means is`	`\item {\bf External Debug:} Once placed upon an FPGA, some external means is`
Line 237...	Line 305...
`\item {\bf Pipeline Stalls:} My original plan was to not support pipeline`	`\item {\bf Pipeline Stalls:} My original plan was to not support pipeline`
`stalls at all, but rather to require the compiler to properly schedule`	`stalls at all, but rather to require the compiler to properly schedule`
`all instructions so that stalls would never be necessary. After trying`	`all instructions so that stalls would never be necessary. After trying`
`to build such an architecture, I gave up, having learned some things:`	`to build such an architecture, I gave up, having learned some things:`

`For example, in order to facilitate interrupt handling and debug`	`First, and ideal pipeline might look something like`
`stepping, the CPU needs to know what instructions have finished, and`	`Fig.~\ref{fig:ideal-pipeline}.`
`which have not. In other words, it needs to know where it can restart`	`\begin{figure}`
`the pipeline from. Once restarted, it must act as though it had`	`\begin{center}`
`never stopped. This killed my idea of delayed branching, since what`	`\includegraphics[width=4in]{../gfx/fullpline.eps}`
`would be the appropriate program counter to restart at? The one the`	`\caption{An Ideal Pipeline: One instruction per clock cycle}\label{fig:ideal-pipeline}`
`CPU was going to branch to, or the ones in the delay slots? This`	`\end{center}\end{figure}`
`also makes the idea of compressed instruction codes difficult, since,`	`Notice that, in this figure, all the pipeline stages are complete and`
`again, where do you restart on interrupt?`	`full. Every instruction takes one clock and there are no delays.`
	`However, as the discussion above pointed out, the memory associated`
	`with a SwiC may not allow single clock access. It may be instead`
	`that you can only read every two clocks. In that case, what shall`
	`the pipeline look like? Should it look like`
	`Fig.~\ref{fig:waiting-pipeline},`
	`\begin{figure}\begin{center}`
	`\includegraphics[width=4in]{../gfx/stuttra.eps}`
	`\caption{Instructions wait for each other}\label{fig:waiting-pipeline}`
	`\end{center}\end{figure}`
	`where instructions are held back until the pipeline is full, or should`
	`it look like Fig.~\ref{fig:independent-pipeline},`
	`\begin{figure}\begin{center}`
	`\includegraphics[width=4in]{../gfx/stuttrb.eps}`
	`\caption{Instructions proceed independently}\label{fig:independent-pipeline}`
	`\end{center}\end{figure}`
	`where each instruction is allowed to move through the pipeline`
	`independently? For better or worse, the Zip CPU allows instructions`
	`to move through the pipeline independently.`

	`One approach to avoiding stalls is to use a branch delay slot,`
	`such as is shown in Fig.~\ref{fig:brdelay}.`
	`\begin{figure}\begin{center}`
	`\includegraphics[width=4in]{../gfx/bdly.eps}`
	`\caption{A typical branch delay slot approach}\label{fig:brdelay}`
	`\end{center}\end{figure}`
	`In this figure, instructions`
	`{\tt BR} (a branch), {\tt BD} (a branch delay instruction),`
	`are followed by instructions after the branch: {\tt IA}, {\tt IB}, etc.`
	`Since it takes a processor a clock cycle to execute a branch, the`
	`delay slot allows the processor to do something useful in that`
	`branch. The problem the Zip CPU has with this approach is, what`
	`happens when the pipeline looks like Fig.~\ref{fig:brbroken}?`
	`\begin{figure}\begin{center}`
	`\includegraphics[width=4in]{../gfx/bdbroken.eps}`
	`\caption{The branch delay slot breaks with a slow memory}\label{fig:brbroken}`
	`\end{center}\end{figure}`
	`In this case, the branch delay slot never gets filled in the first`
	`place, and so the pipeline squashes it before it gets executed.`
	`If not that, then what happens when handling interrupts or`
	`debug stepping: when has the CPU finished an instruction?`
	`When the {\tt BR} instruction has finished, or must {\tt BD}`
	`follow every {\tt BR}? and, again, what if the pipeline isn't`
	`full?`
	`These thoughts killed any hopes of doing delayed branching.`

`So I switched to a model of discrete execution: Once an instruction`	`So I switched to a model of discrete execution: Once an instruction`
`enters into either the ALU or memory unit, the instruction is`	`enters into either the ALU or memory unit, the instruction is`
`guaranteed to complete. If the logic recognizes a branch or a`	`guaranteed to complete. If the logic recognizes a branch or a`
`condition that would render the instruction entering into this stage`	`condition that would render the instruction entering into this stage`
`possibly inappropriate (i.e. a conditional branch preceding a store`	`possibly inappropriate (i.e. a conditional branch preceding a store`
`instruction for example), then the pipeline stalls for one cycle`	`instruction for example), then the pipeline stalls for one cycle`
`until the conditional branch completes. Then, if it generates a new`	`until the conditional branch completes. Then, if it generates a new`
`PC address, the stages preceding are all wiped clean.`	`PC address, the stages preceding are all wiped clean.`

`The discrete execution model allows such things as sleeping: if the`	`This model, however, generated too many pipeline stalls, so the`
	`discrete execution model was modified to allow instructions to go`
	`through the ALU unit and be canceled before writeback. This removed`
	`the stall associated with ALU instructions before untaken branches.`

	`The discrete execution model allows such things as sleeping, as`
	`outlined in Fig.~\ref{fig:sleeping}.`
	`\begin{figure}\begin{center}`
	`\includegraphics[width=4in]{../gfx/sleep.eps}`
	`\caption{How the CPU halts when sleeping}\label{fig:sleeping}`
	`\end{center}\end{figure}`
	`If the`
CPU is put to ``sleep,'' the ALU and memory stages stall and back up	CPU is put to ``sleep,'' the ALU and memory stages stall and back up
`everything before them. Likewise, anything that has entered the ALU`	`everything before them. Likewise, anything that has entered the ALU`
`or memory stage when the CPU is placed to sleep continues to completion.`	`or memory stage when the CPU is placed to sleep continues to completion.`
`To handle this logic, each pipeline stage has three control signals:`	`To handle this logic, each pipeline stage has three control signals:`
`a valid signal, a stall signal, and a clock enable signal. In`	`a valid signal, a stall signal, and a clock enable signal. In`
`general, a stage stalls if it's contents are valid and the next step`	`general, a stage stalls if it's contents are valid and the next step`
`is stalled. This allows the pipeline to fill any time a later stage`	`is stalled. This allows the pipeline to fill any time a later stage`
`stalls.`	`stalls, as illustrated in Fig.~\ref{fig:stacking}.`
	`\begin{figure}\begin{center}`
	`\includegraphics[width=4in]{../gfx/stacking.eps}`
	`\caption{Instructions can stack up behind a stalled instruction}\label{fig:stacking}`
	`\end{center}\end{figure}`

`This approach is also different from other pipeline approaches. Instead`	`This approach is also different from other pipeline approaches.`
`of keeping the entire pipeline filled, each stage is treated`	`Instead of keeping the entire pipeline filled, each stage is treated`
`independently. Therefore, individual stages may move forward as long`	`independently. Therefore, individual stages may move forward as long`
`as the subsequent stage is available, regardless of whether the stage`	`as the subsequent stage is available, regardless of whether the stage`
`behind it is filled.`	`behind it is filled.`

`\item {\bf Verilog Modules:} When examining how other processors worked`	`\item {\bf Verilog Modules:} When examining how other processors worked`
Line 334...	Line 461...
`\begin{table}\begin{center}`	`\begin{table}\begin{center}`
`\begin{bitlist}`	`\begin{bitlist}`
`31\ldots 11 & R/W & Reserved for future uses\\\hline`	`31\ldots 11 & R/W & Reserved for future uses\\\hline`
`10 & R & (Reserved for) Bus-Error Flag\\\hline`	`10 & R & (Reserved for) Bus-Error Flag\\\hline`
`9 & R & Trap, or user interrupt, Flag. Cleared on return to userspace.\\\hline`	`9 & R & Trap, or user interrupt, Flag. Cleared on return to userspace.\\\hline`
`8 & R & (Reserved for) Illegal Instruction Flag\\\hline`	`8 & R & Illegal Instruction Flag\\\hline`
`7 & R/W & Break--Enable\\\hline`	`7 & R/W & Break--Enable\\\hline`
`6 & R/W & Step\\\hline`	`6 & R/W & Step\\\hline`
`5 & R/W & Global Interrupt Enable (GIE)\\\hline`	`5 & R/W & Global Interrupt Enable (GIE)\\\hline`
`4 & R/W & Sleep. When GIE is also set, the CPU waits for an interrupt.\\\hline`	`4 & R/W & Sleep. When GIE is also set, the CPU waits for an interrupt.\\\hline`
`3 & R/W & Overflow\\\hline`	`3 & R/W & Overflow\\\hline`
Line 399...	Line 526...
`%`	`%`

`This functionality was added to enable an external debugger to`	`This functionality was added to enable an external debugger to`
`set and manage breakpoints.`	`set and manage breakpoints.`

`The ninth bit is reserved for an illegal instruction bit. When the CPU`	`The ninth bit is an illegal instruction bit. When the CPU`
`tries to execute either a non-existant instruction, or an instruction from`	`tries to execute either a non-existant instruction, or an instruction from`
`an address that produces a bus error, the CPU will (once implemented) switch`	`an address that produces a bus error, the CPU will (if implemented) switch`
`to supervisor mode while setting this bit. The bit will automatically be`	`to supervisor mode while setting this bit. The bit will automatically be`
`cleared upon any return to user mode.`	`cleared upon any return to user mode.`

`The tenth bit is a trap bit. It is set whenever the user requests a soft`	`The tenth bit is a trap bit. It is set whenever the user requests a soft`
`interrupt, and cleared on any return to userspace command. This allows the`	`interrupt, and cleared on any return to userspace command. This allows the`
Line 416...	Line 543...
`\begin{table}`	`\begin{table}`
`\begin{center}`	`\begin{center}`
`\begin{tabular}{l\|l}`	`\begin{tabular}{l\|l}`
`Bit & Meaning \\\hline`	`Bit & Meaning \\\hline`
`9 & Soft trap, set on a trap from user mode, cleared when returning to user mode\\\hline`	`9 & Soft trap, set on a trap from user mode, cleared when returning to user mode\\\hline`
`8 & (Reserved for) Floating point enable \\\hline`	`8 & Illegal instruction error flag \\\hline`
`7 & Halt on break, to support an external debugger \\\hline`	`7 & Halt on break, to support an external debugger \\\hline`
`6 & Step, single step the CPU in user mode\\\hline`	`6 & Step, single step the CPU in user mode\\\hline`
`5 & GIE, or Global Interrupt Enable \\\hline`	`5 & GIE, or Global Interrupt Enable \\\hline`
`4 & Sleep \\\hline`	`4 & Sleep \\\hline`
`3 & V, or overflow bit.\\\hline`	`3 & V, or overflow bit.\\\hline`
Line 452...	Line 579...
`\end{center}`	`\end{center}`
`\end{table}`	`\end{table}`
`There is no condition code for less than or equal, not C or not V. Sorry,`	`There is no condition code for less than or equal, not C or not V. Sorry,`
`I ran out of space in 3--bits. Conditioning on a non--supported condition`	`I ran out of space in 3--bits. Conditioning on a non--supported condition`
`is still possible, but it will take an extra instruction and a pipeline stall. (Ex: \hbox{\em (Stall)}; \hbox{\tt TST \$4,CC;} \hbox{\tt STO.NZ R0,(R1)})`	`is still possible, but it will take an extra instruction and a pipeline stall. (Ex: \hbox{\em (Stall)}; \hbox{\tt TST \$4,CC;} \hbox{\tt STO.NZ R0,(R1)})`
	`As an alternative, the condition may often be reversed, recovering those`
	`extra two clocks. Thus instead of \hbox{\tt CMP Rx,Ry;}`
	`\hbox{\tt BNV label} you can issue a \hbox{\tt CMP Ry,Rx;} \hbox{\tt BV label}.`

`Conditionally executed ALU instructions will not further adjust the`	`Conditionally executed ALU instructions will not further adjust the`
`condition codes.`	`condition codes, with the exception of \hbox{\tt CMP} and \hbox{\tt TST}`
	`instructions. Conditional \hbox{\tt CMP} or \hbox{\tt TST} instructions`
	`will adjust conditions whenever their conditionals are true. In this way,`
	`multiple conditions may be evaluated without branches. For example, to do`
	`something if \hbox{\tt R0} is one and \hbox{\tt R1} is two, one might try`
	`code such as Tbl.~\ref{tbl:dbl-condition}.`
	`\begin{table}\begin{center}`
	`\begin{tabular}{l}`
	`{\tt CMP 1,R0} \\`
	`{;\em Condition codes are now set based upon R0-1} \\`
	`{\tt CMP.Z 2,R1} \\`
	`{;\em If R0 $\neq$ 1, conditions are unchanged.} \\`
	`{;\em If R0 $=$ 1, conditions are set based upon R1-2.} \\`
	`{;\em Now do something based upon the conjunction of both conditions.} \\`
	`{;\em While we use the example of a STO, it could be any instruction.} \\`
	`{\tt STO.Z R0,(R2)} \\`
	`\end{tabular}`
	`\caption{An example of a double conditional}\label{tbl:dbl-condition}`
	`\end{center}\end{table}`

	`\section{Traditional Interrupt Handling}`

`\section{Operand B}`	`\section{Operand B}`
Many instruction forms have a 21-bit source ``Operand B'' associated with them.	Many instruction forms have a 21-bit source ``Operand B'' associated with them.
`This Operand B is either equal to a register plus a signed immediate offset,`	`This Operand B is either equal to a register plus a signed immediate offset,`
`or an immediate offset by itself. This value is encoded as shown in`	`or an immediate offset by itself. This value is encoded as shown in`
Line 999...	Line 1149...
`prefetch cache is loaded to support the next instruction.`	`prefetch cache is loaded to support the next instruction.`

`\item While waiting for the pipeline to load following any taken branch, jump,`	`\item While waiting for the pipeline to load following any taken branch, jump,`
`return from interrupt or switch to interrupt context (5 stall cycles)`	`return from interrupt or switch to interrupt context (5 stall cycles)`

`If the PC suddenly changes, the pipeline is subsequently cleared and needs to`	`Fig.~\ref{fig:bcstalls}`
`be reloaded. Given that there are five stages to the pipeline, that accounts`	`\begin{figure}\begin{center}`
`for four of the five stalls. The stall cycle is lost in the pipelined prefetch`	`\includegraphics[width=3.5in]{../gfx/bc.eps}`
`stage which needs at least one clock with a valid PC before it can produce`	`\caption{A conditional branch generates 5 stall cycles}\label{fig:bcstalls}`
`a new output.`	`\end{center}\end{figure}`
	`illustrates the situation for a conditional branch. In this case, the branch`
	`instruction, {\tt BC}, is nominally followed by instructions {\tt I0} and so`
	`forth. However, since the branch is taken, the next instruction must be`
	`{\tt IA}. Therefore, the pipeline needs to be cleared and reloaded.`
	`Given that there are five stages to the pipeline, that accounts`
	`for four of the five stalls. The last stall cycle is lost in the pipelined`
	`prefetch stage which needs at least one clock with a valid PC before it can`
	`produce a new output. {\Large\bf Note: When I did this myself, I counted`
	`six stall cycles, for a total of seven cycles for this instruction. Is five`
	`really the right answer?}`

`The Zip CPU handles {\tt MOV \$X(PC),PC}, {\tt ADD \$X,PC}, and`	`The Zip CPU handles {\tt MOV \$X(PC),PC}, {\tt ADD \$X,PC}, and`
`{\tt LDI \$X,PC} instructions specially, however. These instructions, when`	`{\tt LDI \$X,PC} instructions specially, however. These instructions, when`
`not conditioned on the flags, can execute with only 3~stall cycles.`	`not conditioned on the flags, can execute with only 2~stall cycles, such as`
	`is shown in Fig.~\ref{fig:branch}.\footnote{Note that this behavior is`
	`slated to be improved upon in subsequent releases. With a better prefetch,`
	`it should be possible to drop this down to one or zero stall cycles.}`
	`\begin{figure}\begin{center}`
	`\includegraphics[width=4in]{../gfx/bra.eps}`
	`\caption{An expedited delay costs only 2~stall cycles}\label{fig:branch}`
	`\end{center}\end{figure}`
	`In this example, {\tt BR} is a branch always taken, {\tt I1} is the instruction`
	`following the branch in memory, while {\tt IA} is the first instruction at the`
	`branch address. ({\tt CLR} denotes a clear--pipeline operation, and does`
	`not represent any instruction.)`

`\item When reading from a prior register while also adding an immediate offset`	`\item When reading from a prior register while also adding an immediate offset`
`\begin{enumerate}`	`\begin{enumerate}`
`\item\ {\tt OPCODE ?,RA}`	`\item\ {\tt OPCODE ?,RA}`
`\item\ {\em (stall)}`	`\item\ {\em (stall)}`
Line 1024...	Line 1195...
`opcode that will read and apply an immediate offset by one instruction. The`	`opcode that will read and apply an immediate offset by one instruction. The`
`good news is that this stall can easily be mitigated by proper scheduling.`	`good news is that this stall can easily be mitigated by proper scheduling.`
`That is, any instruction that does not add an immediate to {\tt RA} may be`	`That is, any instruction that does not add an immediate to {\tt RA} may be`
`scheduled into the stall slot.`	`scheduled into the stall slot.`

`\item When any write to either the CC or PC Register is followed by a memory`	`\item When any (conditional) write to either the CC or PC Register is followed`
`operation`	`by a memory operation`
`\begin{enumerate}`	`\begin{enumerate}`
`\item\ {\tt OPCODE RA,PC} {\em Ex: a branch opcode}`	`\item\ {\tt OPCODE RA,PC} {\em Ex: a branch opcode}`
`\item\ {\em (stall, even if jump not taken)}`	`\item\ {\em (stall, even if jump not taken)}`
`\item\ {\tt LOD \$X(RA),RB}`	`\item\ {\tt LOD \$X(RA),RB}`
`\end{enumerate}`	`\end{enumerate}`
	`A timing diagram of this pipeline situation is shown in Fig.~\ref{fig:bcmem},`
	`\begin{figure}\begin{center}`
	`\includegraphics[width=2in]{../gfx/bcmem.eps}`
	`\caption{A (not taken) conditional branch followed by a memory operation}\label{fig:bcmem}`
	`\end{center}\end{figure}`
	`for a conditional branch, {\tt BC}, a memory operation, {\tt Mem} (which`
	`must be a load here), and ALU instructions {\tt I1} and so forth.`
`Since branches take place in the writeback stage, the Zip CPU will stall the`	`Since branches take place in the writeback stage, the Zip CPU will stall the`
`pipeline for one clock anytime there may be a possible jump. This prevents`	`pipeline for one clock anytime there may be a possible jump--forcing the`
	`memory operation to stay in the operand decode stage. This prevents`
`an instruction from executing a memory access after the jump but before the`	`an instruction from executing a memory access after the jump but before the`
`jump is recognized.`	`jump is recognized.`

`This stall may be mitigated by shuffling the operations immediately following`	`This stall may be mitigated by shuffling the operations immediately following`
`a potential branch so that an ALU operation follows the branch instead of a`	`a potential branch so that an ALU operation follows the branch instead of a`
Line 1048...	Line 1227...
`\item\ {\em (stall)}`	`\item\ {\em (stall)}`
`\item\ {\tt TST sys.ccv,CC}`	`\item\ {\tt TST sys.ccv,CC}`
`\item\ {\tt BZ somewhere}`	`\item\ {\tt BZ somewhere}`
`\end{enumerate}`	`\end{enumerate}`

`The reason for this stall is simply performance. Many of the flags are`	`The reason for this stall is simply performance: many of the flags are`
`determined via combinatorial logic after the writeback instruction is`	`determined via combinatorial logic {\em during} the writeback cycle.`
`determined. Trying to then place these into the input for one of the operands`	`Trying to then place these into the input for one of the operands for an`
	`ALU instruction during the same cycle`
`created a time delay loop that would no longer execute in a single 100~MHz`	`created a time delay loop that would no longer execute in a single 100~MHz`
`clock cycle. (The time delay of the multiply within the ALU wasn't helping`	`clock cycle. (The time delay of the multiply within the ALU wasn't helping`
`either \ldots).`	`either \ldots).`

`This stall may be eliminated via proper scheduling, by placing an instruction`	`This stall may be eliminated via proper scheduling, by placing an instruction`
`that does not set flags in between the ALU operation and the instruction`	`that does not set flags in between the ALU operation and the instruction`
`that references the CC register. For example, {\tt MOV \$addr+PC,uPC}`	`that references the CC register. For example, {\tt MOV \$addr+PC,uPC}`
`followed by an {\tt RTU} ({\tt OR \$GIE,CC}) instruction will not incur`	`followed by an {\tt RTU} ({\tt OR \$GIE,CC}) instruction will not incur`
`this stall, whereas an {\tt OR \$BREAKEN,CC} followed by an {\tt OR \$STEP,CC}`	`this stall, whereas an {\tt OR \$BREAKEN,CC} followed by an {\tt OR \$STEP,CC}`
`will incur the stall, while a {\tt LDI \$BREAKEN\|\$STEP,CC} will not.`	`will incur the stall, while a {\tt LDI \$BREAKEN\|\$STEP,CC} will not since`
	`it doesn't read the condition codes.`

`\item When waiting for a memory read operation to complete`	`\item When waiting for a memory read operation to complete`
`\begin{enumerate}`	`\begin{enumerate}`
`\item\ {\tt LOD address,RA}`	`\item\ {\tt LOD address,RA}`
`\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}`	`\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}`
`\item\ {\tt OPCODE I+RA,RB}`	`\item\ {\tt OPCODE I+RA,RB}`
`\end{enumerate}`	`\end{enumerate}`

`Remember, the Zip CPU does not support out of order execution. Therefore,`	`Remember, the Zip CPU does not support out of order execution. Therefore,`
`anytime the memory unit becomes busy both the memory unit and the ALU must`	`anytime the memory unit becomes busy both the memory unit and the ALU must`
`stall until the memory unit is cleared. This is especially true of a load`	`stall until the memory unit is cleared. This is illustrated in`
	`Fig.~\ref{fig:memrd},`
	`\begin{figure}\begin{center}`
	`\includegraphics[width=5in]{../gfx/memrd.eps}`
	`\caption{Pipeline handling of a load instruction}\label{fig:memrd}`
	`\end{center}\end{figure}`
	`since it is especially true of a load`
`instruction, which must still write its operand back to the register file.`	`instruction, which must still write its operand back to the register file.`
`Store instructions are different, since they can be busy with no impact on`	`Note that there is an extra stall at the end of the memory cycle, so that`
`later ALU write back operations. Hence, only loads stall the pipeline.`	`the memory unit will be idle for one clock before an instruction will be`
	`accepted into the ALU.`
	`Store instructions are different, as shown in Fig.~\ref{fig:memwr},`
	`\begin{figure}\begin{center}`
	`\includegraphics[width=5in]{../gfx/memwr.eps}`
	`\caption{Pipeline handling of a store instruction}\label{fig:memwr}`
	`\end{center}\end{figure}`
	`since they can be busy with the bus without impacting later write back`
	`pipeline stages. Hence, only loads stall the pipeline.`

`This also assumes that the memory being accessed is a single cycle memory.`	`This, of course, also assumes that the memory being accessed is a single cycle`
	`memory and that there are no stalls to get to the memory.`
`Slower memories, such as the Quad SPI flash, will take longer--perhaps even`	`Slower memories, such as the Quad SPI flash, will take longer--perhaps even`
`as long as forty clocks. During this time the CPU and the external bus`	`as long as forty clocks. During this time the CPU and the external bus`
`will be busy, and unable to do anything else.`	`will be busy, and unable to do anything else. Likewise, if it takes a couple`
	`of clock cycles for the bus to be free, as shown in both Figs.~\ref{fig:memrd}`
	`and~\ref{fig:memwr}, there will be stalls.`

`\item Memory operation followed by a memory operation`	`\item Memory operation followed by a memory operation`
`\begin{enumerate}`	`\begin{enumerate}`
`\item\ {\tt STO address,RA}`	`\item\ {\tt STO address,RA}`
`\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}`	`\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}`
`\item\ {\tt LOD address,RB}`	`\item\ {\tt LOD address,RB}`
`\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}`	`\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}`
`\end{enumerate}`	`\end{enumerate}`

`In this case, the LOD instruction cannot start until the STO is finished.`	`In this case, the LOD instruction cannot start until the STO is finished,`
	`as illustrated by Fig.~\ref{fig:mstld}.`
	`\begin{figure}\begin{center}`
	`\includegraphics[width=5.5in]{../gfx/mstld.eps}`
	`\caption{Pipeline handling of a store followed by a load instruction}\label{fig:mstld}`
	`\end{center}\end{figure}`
`With proper scheduling, it is possible to do something in the ALU while the`	`With proper scheduling, it is possible to do something in the ALU while the`
`memory unit is busy with the STO instruction, but otherwise this pipeline will`	`memory unit is busy with the STO instruction, but otherwise this pipeline will`
`stall waiting for it to complete.`	`stall while waiting for it to complete before the load instruction can`
	`start.`

`The Zip CPU does have the capability of supporting pipelined memory access,`	`The Zip CPU does have the capability of supporting pipelined memory access,`
`but only under the following conditions: all accesses within the pipeline`	`but only under the following conditions: all accesses within the pipeline`
`must all be reads or all be writes, all must use the same register for their`	`must all be reads or all be writes, all must use the same register for their`
`address, and there can be no stalls or other instructions between pipelined`	`address, and there can be no stalls or other instructions between pipelined`
`memory access instructions. Further, the offset to memory must be increasing`	`memory access instructions. Further, the offset to memory must be increasing`
`by one address each instruction. These conditions work well for saving or`	`by one address each instruction. These conditions work well for saving or`
`storing registers to the stack.`	`storing registers to the stack. Indeed, if you noticed, both`
	`Fig.~\ref{fig:memrd} and Fig.~\ref{fig:memwr} illustrated pipelined memory`
	`accesses.`

`\item When waiting for a conditional memory read operation to complete`	`\item When waiting for a conditional memory read operation to complete`
`\begin{enumerate}`	`\begin{enumerate}`
`\item\ {\tt LOD.Z address,RA}`	`\item\ {\tt LOD.Z address,RA}`
`\item\ {\em (multiple stalls, bus dependent, 7 clocks best)}`	`\item\ {\em (multiple stalls, bus dependent, 7 clocks best)}`
Line 1132...	Line 1338...
`\caption{Zip System Peripherals}\label{fig:zipsystem}`	`\caption{Zip System Peripherals}\label{fig:zipsystem}`
`\end{center}\end{figure}`	`\end{center}\end{figure}`
`and described here. They are designed to make`	`and described here. They are designed to make`
`the Zip CPU more useful in an Embedded Operating System environment.`	`the Zip CPU more useful in an Embedded Operating System environment.`

`\section{Interrupt Controller}`	`\section{Interrupt Controller}\label{sec:pic}`

`Perhaps the most important peripheral within the Zip System is the interrupt`	`Perhaps the most important peripheral within the Zip System is the interrupt`
`controller. While the Zip CPU itself can only handle one interrupt, and has`	`controller. While the Zip CPU itself can only handle one interrupt, and has`
`only the one interrupt state: disabled or enabled, the interrupt controller`	`only the one interrupt state: disabled or enabled, the interrupt controller`
`can make things more interesting.`	`can make things more interesting.`

`The Zip System interrupt controller module supports up to 15 interrupts, all`	`The Zip System interrupt controller module supports up to 15 interrupts, all`
`controlled from one register. Bit~31 of the interrupt controller controls`	`controlled from one register. Bit~31 of the interrupt controller controls`
`overall whether interrupts are enabled (1'b1) or disabled (1'b0). Bits~16--30`	`overall whether interrupts are enabled (1'b1) or disabled (1'b0). Bits~16--30`
`control whether individual interrupts are enabled (1'b0) or disabled (1'b0).`	`control whether individual interrupts are enabled (1'b1) or disabled (1'b0).`
`Bit~15 is an indicator showing whether or not any interrupt is active, and`	`Bit~15 is an indicator showing whether or not any interrupt is active, and`
`bits~0--15 indicate whether or not an individual interrupt is active.`	`bits~0--15 indicate whether or not an individual interrupt is active.`

`The interrupt controller has been designed so that bits can be controlled`	`The interrupt controller has been designed so that bits can be controlled`
`individually without having any knowledge of the rest of the controller`	`individually without having any knowledge of the rest of the controller`
Line 1165...	Line 1371...
`interrupt and clears the active indicator. This also has the side effect of`	`interrupt and clears the active indicator. This also has the side effect of`
`disabling all interrupts, so a second write of {\tt 0x80000000} may be necessary`	`disabling all interrupts, so a second write of {\tt 0x80000000} may be necessary`
`to re-enable any other interrupts.`	`to re-enable any other interrupts.`

`The Zip System currently hosts two interrupt controllers, a primary and a`	`The Zip System currently hosts two interrupt controllers, a primary and a`
`secondary. The primary interrupt controller has one interrupt line which may`	`secondary. The primary interrupt controller has one interrupt line (perhaps`
`come from an external interrupt controller, and one interrupt line from the`	`more if you configure it for more) which may come from an external interrupt`
`secondary controller. Other primary interrupts include the system timers,`	`controller, and one interrupt line from the secondary controller. Other`
`the jiffies interrupt, and the manual cache interrupt. The secondary interrupt`	`primary interrupts include the system timers, the jiffies interrupt, and the`
`controller maintains an interrupt state for all of the processor accounting`	`manual cache interrupt. The secondary interrupt controller maintains an`
`counters.`	`interrupt state for all of the processor accounting counters.`

`\section{Counter}`	`\section{Counter}`

`The Zip Counter is a very simple counter: it just counts. It cannot be`	`The Zip Counter is a very simple counter: it just counts. It cannot be`
`halted. When it rolls over, it issues an interrupt. Writing a value to the`	`halted. When it rolls over, it issues an interrupt. Writing a value to the`
Line 1193...	Line 1399...
`bit is set when writing to the timer, the timer becomes an interval timer and`	`bit is set when writing to the timer, the timer becomes an interval timer and`
`reloads its last start time on any interrupt. Hence, to mark seconds, one`	`reloads its last start time on any interrupt. Hence, to mark seconds, one`
`might set the timer to 100~million (the number of clocks per second), and`	`might set the timer to 100~million (the number of clocks per second), and`
`set the high bit. Ever after, the timer will interrupt the CPU once per`	`set the high bit. Ever after, the timer will interrupt the CPU once per`
`second (assuming a 100~MHz clock). This reload capability also limits the`	`second (assuming a 100~MHz clock). This reload capability also limits the`
`maximum timer value to $2^{31}-1$, rather than $2^{32}-1$.`	`maximum timer value to $2^{31}-1$ (about 21~seconds using a 100~MHz clock),`
	`rather than $2^{32}-1$.`

`\section{Watchdog Timer}`	`\section{Watchdog Timer}`

`The watchdog timer is no different from any of the other timers, save for one`	`The watchdog timer is no different from any of the other timers, save for one`
`critical difference: the interrupt line from the watchdog`	`critical difference: the interrupt line from the watchdog`
Line 1207...	Line 1414...
`write any other number to it---as with the other timers.`	`write any other number to it---as with the other timers.`

`While the watchdog timer supports interval mode, it doesn't make as much sense`	`While the watchdog timer supports interval mode, it doesn't make as much sense`
`as it did with the other timers.`	`as it did with the other timers.`

Line 42...

%%              http://www.gnu.org/licenses/gpl.html

%%              http://www.gnu.org/licenses/gpl.html

%%

%%

%%

%%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\documentclass{gqtekspec}

\documentclass{gqtekspec}

\usepackage{import}

% \graphicspath{{../gfx}}

\project{Zip CPU}

\project{Zip CPU}

\title{Specification}

\title{Specification}

\author{Dan Gisselquist, Ph.D.}

\author{Dan Gisselquist, Ph.D.}

\email{dgisselq (at) opencores.org}

\email{dgisselq (at) opencores.org}

\revision{Rev.~0.5}

\revision{Rev.~0.6}

\definecolor{webred}{rgb}{0.2,0,0}

\definecolor{webred}{rgb}{0.2,0,0}

\definecolor{webgreen}{rgb}{0,0.2,0}

\definecolor{webgreen}{rgb}{0,0.2,0}

\usepackage[dvips,ps2pdf,colorlinks=true,

\usepackage[dvips,ps2pdf,colorlinks=true,

        anchorcolor=black,pagecolor=webgreen,pdfpagelabels,hypertexnames,

        anchorcolor=black,pagecolor=webgreen,pdfpagelabels,hypertexnames,

        pdfauthor={Dan Gisselquist},

        pdfauthor={Dan Gisselquist},

Line 74...

Line 76...

You should have received a copy of the GNU General Public License along

You should have received a copy of the GNU General Public License along

with this program.  If not, see \hbox{<http://www.gnu.org/licenses/>} for a

with this program.  If not, see \hbox{<http://www.gnu.org/licenses/>} for a

copy.

copy.

\end{license}

\end{license}

\begin{revisionhistory}

\begin{revisionhistory}

0.6 & 11/17/2015 & Gisselquist & Added graphics to illustrate pipeline discussion.\\\hline

0.5 & 9/29/2015 & Gisselquist & Added pipelined memory access discussion.\\\hline

0.5 & 9/29/2015 & Gisselquist & Added pipelined memory access discussion.\\\hline

0.4 & 9/19/2015 & Gisselquist & Added DMA controller, improved stall information, and self--assessment info.\\\hline

0.4 & 9/19/2015 & Gisselquist & Added DMA controller, improved stall information, and self--assessment info.\\\hline

0.3 & 8/22/2015 & Gisselquist & First completed draft\\\hline

0.3 & 8/22/2015 & Gisselquist & First completed draft\\\hline

0.2 & 8/19/2015 & Gisselquist & Still Draft, more complete \\\hline

0.2 & 8/19/2015 & Gisselquist & Still Draft, more complete \\\hline

0.1 & 8/17/2015 & Gisselquist & Incomplete First Draft \\\hline

0.1 & 8/17/2015 & Gisselquist & Incomplete First Draft \\\hline

Line 165...

Line 168...

\item Completely open source, licensed under the GPL.\footnote{Should you

\item Completely open source, licensed under the GPL.\footnote{Should you

        need a copy of the Zip CPU licensed under other terms, please

        need a copy of the Zip CPU licensed under other terms, please

        contact me.}

        contact me.}

\end{itemize}

\end{itemize}

The Zip CPU also has one very unique feature: the ability to do pipelined loads

and stores.  This allows the CPU to access on-chip memory at one access per

clock, minus a stall for the initial access.

\section{Characteristics of a SwiC}

Here, we shall define a soft core internal to an FPGA as a ``System within a

Chip,'' or a SwiC.  SwiCs have some very unique properties internal to them

that have influenced the design of the Zip CPU.  Among these are the bus,

memory, and available peripherals.

Most other approaches to soft core CPU's employ a Harvard architecture.

This allows these other CPU's to have two separate bus structures: one for the

program fetch, and the other for thememory.  The Zip CPU is fairly unique in

its approach because it uses a von Neumann architecture.  This was done for

simplicity.  By using a von Neumann architecture, only one bus needs to be

implemented within any FPGA.  This helps to minimize real-estate, while

maintaining a high clock speed.  The disadvantage is that it can severely

degrade the overall instructions per clock count.

Soft core's within an FPGA have an additional characteristic regarding

memory access: it is slow.  Memory on chip may be accessed at a single

cycle per access, but small FPGA's have a limited amount of memory on chip.

Going off chip, however, is expensive.  Two examples will prove this point.  On

the XuLA2 board, Flash can be accessed at 128~cycles per 32--bit word,

or 64~cycles per subsequent word in a pipelined architecture.  Likewise, the

SDRAM chip on the XuLA2 board allows 6~cycle access for a write, 10~cycles

per read, and 2~cycles for any subsequent pipelined access read or write.

Either way you look at it, this memory access will be slow and this doesn't

account for any logic delays should the bus implementation logic get

complicated.

As may be noticed from the above discussion about memory speed, a second

characteristic of memory is that all memory accesses may be pipelined, and

that pipelined memory access is faster than non--pipelined access.  Therefore,

a SwiC soft core should support pipelined operations, but it should also

allow a higher priority subsystem to get access to the bus (no starvation).

As a further characteristic of SwiC memory options, on-chip cache's are

expensive.  If you want to have a minimum of logic, cache logic may not be

the highest on the priority list.

In sum, memory is slow.  While one processor on one FPGA may be able to fill

its pipeline, the same processor on another FPGA may struggle to get more than

one instruction at a time into the pipeline.  Any SwiC must be able to deal

with both cases: fast and slow memories.

A final characteristic of SwiC's within FPGA's is the peripherals.

Specifically, FPGA's are highly reconfigurable.  Soft peripherals can easily

be created on chip to support the SwiC if necessary.  As an example, a simple

30-bit peripheral could easily support reversing 30-bit numbers: a read from

the peripheral returns it's bit--reversed address.  This is cheap within an

FPGA, but expensive in instructions.

Indeed, anything that must be done fast within an FPGA is likely to already

be done--elsewhere in the fabric.  This leaves the CPU with the role of handling

sequential tasks that need a lot of state.

This means that the SwiC needs to live within a very unique environment,

separate and different from the traditional SoC.  That isn't to say that a

SwiC cannot be turned into a SoC, just that this SwiC has not been designed

for that purpose.

\section{Lessons Learned}

Now, however, that I've worked on the Zip CPU for a while, it is not nearly

Now, however, that I've worked on the Zip CPU for a while, it is not nearly

as simple as I originally hoped.  Worse, I've had to adjust to create

as simple as I originally hoped.  Worse, I've had to adjust to create

capabilities that I was never expecting to need.  These include:

capabilities that I was never expecting to need.  These include:

\begin{itemize}

\begin{itemize}

\item {\bf External Debug:} Once placed upon an FPGA, some external means is

\item {\bf External Debug:} Once placed upon an FPGA, some external means is

Line 237...

Line 305...

\item {\bf Pipeline Stalls:} My original plan was to not support pipeline

\item {\bf Pipeline Stalls:} My original plan was to not support pipeline

        stalls at all, but rather to require the compiler to properly schedule

        stalls at all, but rather to require the compiler to properly schedule

        all instructions so that stalls would never be necessary.  After trying

        all instructions so that stalls would never be necessary.  After trying

        to build such an architecture, I gave up, having learned some things:

        to build such an architecture, I gave up, having learned some things:

        For example, in  order to facilitate interrupt handling and debug

        First, and ideal pipeline might look something like

        stepping, the CPU needs to know what instructions have finished, and

        Fig.~\ref{fig:ideal-pipeline}.

        which have not.  In other words, it needs to know where it can restart

\begin{figure}

        the pipeline from.  Once restarted, it must act as though it had

\begin{center}

        never stopped.  This killed my idea of delayed branching, since what

\includegraphics[width=4in]{../gfx/fullpline.eps}

        would be the appropriate program counter to restart at?  The one the

\caption{An Ideal Pipeline: One instruction per clock cycle}\label{fig:ideal-pipeline}

        CPU was going to branch to, or the ones in the delay slots?  This

\end{center}\end{figure}

        also makes the idea of compressed instruction codes difficult, since,

        Notice that, in this figure, all the pipeline stages are complete and

        again, where do you restart on interrupt?

        full.  Every instruction takes one clock and there are no delays.

        However, as the discussion above pointed out, the memory associated

        with a SwiC may not allow single clock access.  It may be instead

        that you can only read every two clocks.  In that case, what shall

        the pipeline look like?  Should it look like

        Fig.~\ref{fig:waiting-pipeline},

\begin{figure}\begin{center}

\includegraphics[width=4in]{../gfx/stuttra.eps}

\caption{Instructions wait for each other}\label{fig:waiting-pipeline}

\end{center}\end{figure}

        where instructions are held back until the pipeline is full, or should

        it look like Fig.~\ref{fig:independent-pipeline},

\begin{figure}\begin{center}

\includegraphics[width=4in]{../gfx/stuttrb.eps}

\caption{Instructions proceed independently}\label{fig:independent-pipeline}

\end{center}\end{figure}

        where each instruction is allowed to move through the pipeline

        independently?  For better or worse, the Zip CPU allows instructions

        to move through the pipeline independently.

        One approach to avoiding stalls is to use a branch delay slot,

        such as is shown in Fig.~\ref{fig:brdelay}.

\begin{figure}\begin{center}

\includegraphics[width=4in]{../gfx/bdly.eps}

\caption{A typical branch delay slot approach}\label{fig:brdelay}

\end{center}\end{figure}

        In this figure, instructions

        {\tt BR} (a branch), {\tt BD} (a branch delay instruction),

        are followed by instructions after the branch: {\tt IA}, {\tt IB}, etc.

        Since it takes a processor a clock cycle to execute a branch, the

        delay slot allows the processor to do something useful in that

        branch.  The problem the Zip CPU has with this approach is, what

        happens when the pipeline looks like Fig.~\ref{fig:brbroken}?

\begin{figure}\begin{center}

\includegraphics[width=4in]{../gfx/bdbroken.eps}

\caption{The branch delay slot breaks with a slow memory}\label{fig:brbroken}

\end{center}\end{figure}

        In this case, the branch delay slot never gets filled in the first

        place, and so the pipeline squashes it before it gets executed.

        If not that, then what happens when handling interrupts or

        debug stepping: when has the CPU finished an instruction?

        When the {\tt BR} instruction has finished, or must {\tt BD}

        follow every {\tt BR}?  and, again, what if the pipeline isn't

        full?

        These thoughts killed any hopes of doing delayed branching.

        So I switched to a model of discrete execution: Once an instruction

        So I switched to a model of discrete execution: Once an instruction

        enters into either the ALU or memory unit, the instruction is

        enters into either the ALU or memory unit, the instruction is

        guaranteed to complete.  If the logic recognizes a branch or a

        guaranteed to complete.  If the logic recognizes a branch or a

        condition that would render the instruction entering into this stage

        condition that would render the instruction entering into this stage

        possibly inappropriate (i.e. a conditional branch preceding a store

        possibly inappropriate (i.e. a conditional branch preceding a store

        instruction for example), then the pipeline stalls for one cycle

        instruction for example), then the pipeline stalls for one cycle

        until the conditional branch completes.  Then, if it generates a new

        until the conditional branch completes.  Then, if it generates a new

        PC address, the stages preceding are all wiped clean.

        PC address, the stages preceding are all wiped clean.

        The discrete execution model allows such things as sleeping: if the

        This model, however, generated too many pipeline stalls, so the

        discrete execution model was modified to allow instructions to go

        through the ALU unit and be canceled before writeback.  This removed

        the stall associated with ALU instructions before untaken branches.

        The discrete execution model allows such things as sleeping, as

        outlined in Fig.~\ref{fig:sleeping}.

\begin{figure}\begin{center}

\includegraphics[width=4in]{../gfx/sleep.eps}

\caption{How the CPU halts when sleeping}\label{fig:sleeping}

\end{center}\end{figure}

        If the

        CPU is put to ``sleep,'' the ALU and memory stages stall and back up

        CPU is put to ``sleep,'' the ALU and memory stages stall and back up

        everything before them.  Likewise, anything that has entered the ALU

        everything before them.  Likewise, anything that has entered the ALU

        or memory stage when the CPU is placed to sleep continues to completion.

        or memory stage when the CPU is placed to sleep continues to completion.

        To handle this logic, each pipeline stage has three control signals:

        To handle this logic, each pipeline stage has three control signals:

        a valid signal, a stall signal, and a clock enable signal.  In

        a valid signal, a stall signal, and a clock enable signal.  In

        general, a stage stalls if it's contents are valid and the next step

        general, a stage stalls if it's contents are valid and the next step

        is stalled.  This allows the pipeline to fill any time a later stage

        is stalled.  This allows the pipeline to fill any time a later stage

        stalls.

        stalls, as illustrated in Fig.~\ref{fig:stacking}.

\begin{figure}\begin{center}

\includegraphics[width=4in]{../gfx/stacking.eps}

\caption{Instructions can stack up behind a stalled instruction}\label{fig:stacking}

\end{center}\end{figure}

        This approach is also different from other pipeline approaches.  Instead

        This approach is also different from other pipeline approaches.

        of keeping the entire pipeline filled, each stage is treated

        Instead of keeping the entire pipeline filled, each stage is treated

        independently.  Therefore, individual stages may move forward as long

        independently.  Therefore, individual stages may move forward as long

        as the subsequent stage is available, regardless of whether the stage

        as the subsequent stage is available, regardless of whether the stage

        behind it is filled.

        behind it is filled.

\item {\bf Verilog Modules:} When examining how other processors worked

\item {\bf Verilog Modules:} When examining how other processors worked

Line 334...

Line 461...

\begin{table}\begin{center}

\begin{table}\begin{center}

\begin{bitlist}

\begin{bitlist}

31\ldots 11 & R/W & Reserved for future uses\\\hline

31\ldots 11 & R/W & Reserved for future uses\\\hline

10 & R & (Reserved for) Bus-Error Flag\\\hline

10 & R & (Reserved for) Bus-Error Flag\\\hline

9 & R & Trap, or user interrupt, Flag.  Cleared on return to userspace.\\\hline

9 & R & Trap, or user interrupt, Flag.  Cleared on return to userspace.\\\hline

8 & R & (Reserved for) Illegal Instruction Flag\\\hline

8 & R & Illegal Instruction Flag\\\hline

7 & R/W & Break--Enable\\\hline

7 & R/W & Break--Enable\\\hline

6 & R/W & Step\\\hline

6 & R/W & Step\\\hline

5 & R/W & Global Interrupt Enable (GIE)\\\hline

5 & R/W & Global Interrupt Enable (GIE)\\\hline

4 & R/W & Sleep.  When GIE is also set, the CPU waits for an interrupt.\\\hline

4 & R/W & Sleep.  When GIE is also set, the CPU waits for an interrupt.\\\hline

3 & R/W & Overflow\\\hline

3 & R/W & Overflow\\\hline

Line 399...

Line 526...

This functionality was added to enable an external debugger to

This functionality was added to enable an external debugger to

        set and manage breakpoints.

        set and manage breakpoints.

The ninth bit is reserved for an illegal instruction bit.  When the CPU

The ninth bit is an illegal instruction bit.  When the CPU

tries to execute either a non-existant instruction, or an instruction from

tries to execute either a non-existant instruction, or an instruction from

an address that produces a bus error, the CPU will (once implemented) switch

an address that produces a bus error, the CPU will (if implemented) switch

to supervisor mode while setting this bit.  The bit will automatically be

to supervisor mode while setting this bit.  The bit will automatically be

cleared upon any return to user mode.

cleared upon any return to user mode.

The tenth bit is a trap bit.  It is set whenever the user requests a soft

The tenth bit is a trap bit.  It is set whenever the user requests a soft

interrupt, and cleared on any return to userspace command.  This allows the

interrupt, and cleared on any return to userspace command.  This allows the

Line 416...

Line 543...

\begin{table}

\begin{table}

\begin{center}

\begin{center}

\begin{tabular}{l|l}

\begin{tabular}{l|l}

Bit & Meaning \\\hline

Bit & Meaning \\\hline

9 & Soft trap, set on a trap from user mode, cleared when returning to user mode\\\hline

9 & Soft trap, set on a trap from user mode, cleared when returning to user mode\\\hline

8 & (Reserved for) Floating point enable \\\hline

8 & Illegal instruction error flag \\\hline

7 & Halt on break, to support an external debugger \\\hline

7 & Halt on break, to support an external debugger \\\hline

6 & Step, single step the CPU in user mode\\\hline

6 & Step, single step the CPU in user mode\\\hline

5 & GIE, or Global Interrupt Enable \\\hline

5 & GIE, or Global Interrupt Enable \\\hline

4 & Sleep \\\hline

4 & Sleep \\\hline

3 & V, or overflow bit.\\\hline

3 & V, or overflow bit.\\\hline

Line 452...

Line 579...

\end{center}

\end{center}

\end{table}

\end{table}

There is no condition code for less than or equal, not C or not V.  Sorry,

There is no condition code for less than or equal, not C or not V.  Sorry,

I ran out of space in 3--bits.  Conditioning on a non--supported condition

I ran out of space in 3--bits.  Conditioning on a non--supported condition

is still possible, but it will take an extra instruction and a pipeline stall.  (Ex: \hbox{\em (Stall)}; \hbox{\tt TST \$4,CC;} \hbox{\tt STO.NZ R0,(R1)})

is still possible, but it will take an extra instruction and a pipeline stall.  (Ex: \hbox{\em (Stall)}; \hbox{\tt TST \$4,CC;} \hbox{\tt STO.NZ R0,(R1)})

As an alternative, the condition may often be reversed, recovering those

extra two clocks.  Thus instead of \hbox{\tt CMP Rx,Ry;}

\hbox{\tt BNV label} you can issue a \hbox{\tt CMP Ry,Rx;} \hbox{\tt BV label}.

Conditionally executed ALU instructions will not further adjust the

Conditionally executed ALU instructions will not further adjust the

condition codes.

condition codes, with the exception of \hbox{\tt CMP} and \hbox{\tt TST}

instructions.   Conditional \hbox{\tt CMP} or \hbox{\tt TST} instructions

will adjust conditions whenever their conditionals are true.  In this way,

multiple conditions may be evaluated without branches.  For example, to do

something if \hbox{\tt R0} is one and \hbox{\tt R1} is two, one might try

code such as Tbl.~\ref{tbl:dbl-condition}.

\begin{table}\begin{center}

\begin{tabular}{l}

        {\tt CMP 1,R0} \\

        {;\em Condition codes are now set based upon R0-1} \\

        {\tt CMP.Z 2,R1} \\

        {;\em If R0 $\neq$ 1, conditions are unchanged.} \\

        {;\em If R0 $=$ 1, conditions are set based upon R1-2.} \\

        {;\em Now do something based upon the conjunction of both conditions.} \\

        {;\em While we use the example of a STO, it could be any instruction.} \\

        {\tt STO.Z R0,(R2)} \\

\end{tabular}

\caption{An example of a double conditional}\label{tbl:dbl-condition}

\end{center}\end{table}

\section{Traditional Interrupt Handling}

\section{Operand B}

\section{Operand B}

Many instruction forms have a 21-bit source ``Operand B'' associated with them.

Many instruction forms have a 21-bit source ``Operand B'' associated with them.

This Operand B is either equal to a register plus a signed immediate offset,

This Operand B is either equal to a register plus a signed immediate offset,

or an immediate offset by itself.  This value is encoded as shown in

or an immediate offset by itself.  This value is encoded as shown in

Line 999...

Line 1149...

prefetch cache is loaded to support the next instruction.

prefetch cache is loaded to support the next instruction.

\item While waiting for the pipeline to load following any taken branch, jump,

\item While waiting for the pipeline to load following any taken branch, jump,

        return from interrupt or switch to interrupt context (5 stall cycles)

        return from interrupt or switch to interrupt context (5 stall cycles)

If the PC suddenly changes, the pipeline is subsequently cleared and needs to

Fig.~\ref{fig:bcstalls}

be reloaded.  Given that there are five stages to the pipeline, that accounts

\begin{figure}\begin{center}

for four of the five stalls.  The stall cycle is lost in the pipelined prefetch

\includegraphics[width=3.5in]{../gfx/bc.eps}

stage which needs at least one clock with a valid PC before it can produce

\caption{A conditional branch generates 5 stall cycles}\label{fig:bcstalls}

a new output.

\end{center}\end{figure}

illustrates the situation for a conditional branch.  In this case, the branch

instruction, {\tt BC}, is nominally followed by instructions {\tt I0} and so

forth.  However, since the branch is taken, the next instruction must be

{\tt IA}.  Therefore, the pipeline needs to be cleared and reloaded.

Given that there are five stages to the pipeline, that accounts

for four of the five stalls.  The last stall cycle is lost in the pipelined

prefetch stage which needs at least one clock with a valid PC before it can

produce a new output.  {\Large\bf Note: When I did this myself, I counted

six stall cycles, for a total of seven cycles for this instruction.  Is five

really the right answer?}

The Zip CPU handles {\tt MOV \$X(PC),PC}, {\tt ADD \$X,PC}, and

The Zip CPU handles {\tt MOV \$X(PC),PC}, {\tt ADD \$X,PC}, and

{\tt LDI \$X,PC} instructions specially, however.  These instructions, when

{\tt LDI \$X,PC} instructions specially, however.  These instructions, when

not conditioned on the flags, can execute with only 3~stall cycles.

not conditioned on the flags, can execute with only 2~stall cycles, such as

is shown in Fig.~\ref{fig:branch}.\footnote{Note that this behavior is

slated to be improved upon in subsequent releases.  With a better prefetch,

it should be possible to drop this down to one or zero stall cycles.}

\begin{figure}\begin{center}

\includegraphics[width=4in]{../gfx/bra.eps}

\caption{An expedited delay costs only 2~stall cycles}\label{fig:branch}

\end{center}\end{figure}

In this example, {\tt BR} is a branch always taken, {\tt I1} is the instruction

following the branch in memory, while {\tt IA} is the first instruction at the

branch address.  ({\tt CLR} denotes a clear--pipeline operation, and does

not represent any instruction.)

\item When reading from a prior register while also adding an immediate offset

\item When reading from a prior register while also adding an immediate offset

\begin{enumerate}

\begin{enumerate}

\item\ {\tt OPCODE ?,RA}

\item\ {\tt OPCODE ?,RA}

\item\ {\em (stall)}

\item\ {\em (stall)}

Line 1024...

Line 1195...

opcode that will read and apply an immediate offset by one instruction.  The

opcode that will read and apply an immediate offset by one instruction.  The

good news is that this stall can easily be mitigated by proper scheduling.

good news is that this stall can easily be mitigated by proper scheduling.

That is, any instruction that does not add an immediate to {\tt RA} may be

That is, any instruction that does not add an immediate to {\tt RA} may be

scheduled into the stall slot.

scheduled into the stall slot.

\item When any write to either the CC or PC Register is followed by a memory

\item When any (conditional) write to either the CC or PC Register is followed

        operation

        by a memory operation

\begin{enumerate}

\begin{enumerate}

\item\ {\tt OPCODE RA,PC} {\em Ex: a branch opcode}

\item\ {\tt OPCODE RA,PC} {\em Ex: a branch opcode}

\item\ {\em (stall, even if jump not taken)}

\item\ {\em (stall, even if jump not taken)}

\item\ {\tt LOD \$X(RA),RB}

\item\ {\tt LOD \$X(RA),RB}

\end{enumerate}

\end{enumerate}

A timing diagram of this pipeline situation is shown in Fig.~\ref{fig:bcmem},

\begin{figure}\begin{center}

\includegraphics[width=2in]{../gfx/bcmem.eps}

\caption{A (not taken) conditional branch followed by a memory operation}\label{fig:bcmem}

\end{center}\end{figure}

for a conditional branch, {\tt BC}, a memory operation, {\tt Mem} (which

must be a load here), and ALU instructions {\tt I1} and so forth.

Since branches take place in the writeback stage, the Zip CPU will stall the

Since branches take place in the writeback stage, the Zip CPU will stall the

pipeline for one clock anytime there may be a possible jump.  This prevents

pipeline for one clock anytime there may be a possible jump--forcing the

memory operation to stay in the operand decode stage.  This prevents

an instruction from executing a memory access after the jump but before the

an instruction from executing a memory access after the jump but before the

jump is recognized.

jump is recognized.

This stall may be mitigated by shuffling the operations immediately following

This stall may be mitigated by shuffling the operations immediately following

a potential branch so that an ALU operation follows the branch instead of a

a potential branch so that an ALU operation follows the branch instead of a

Line 1048...

Line 1227...

\item\ {\em (stall)}

\item\ {\em (stall)}

\item\ {\tt TST sys.ccv,CC}

\item\ {\tt TST sys.ccv,CC}

\item\ {\tt BZ somewhere}

\item\ {\tt BZ somewhere}

\end{enumerate}

\end{enumerate}

The reason for this stall is simply performance.  Many of the flags are

The reason for this stall is simply performance: many of the flags are

determined via combinatorial logic after the writeback instruction is

determined via combinatorial logic {\em during} the writeback cycle.

determined.  Trying to then place these into the input for one of the operands

Trying to then place these into the input for one of the operands for an

ALU instruction during the same cycle

created a time delay loop that would no longer execute in a single 100~MHz

created a time delay loop that would no longer execute in a single 100~MHz

clock cycle.  (The time delay of the multiply within the ALU wasn't helping

clock cycle.  (The time delay of the multiply within the ALU wasn't helping

either \ldots).

either \ldots).

This stall may be eliminated via proper scheduling, by placing an instruction

This stall may be eliminated via proper scheduling, by placing an instruction

that does not set flags in between the ALU operation and the instruction

that does not set flags in between the ALU operation and the instruction

that references the CC register.  For example, {\tt MOV \$addr+PC,uPC}

that references the CC register.  For example, {\tt MOV \$addr+PC,uPC}

followed by an {\tt RTU} ({\tt OR \$GIE,CC}) instruction will not incur

followed by an {\tt RTU} ({\tt OR \$GIE,CC}) instruction will not incur

this stall, whereas an {\tt OR \$BREAKEN,CC} followed by an {\tt OR \$STEP,CC}

this stall, whereas an {\tt OR \$BREAKEN,CC} followed by an {\tt OR \$STEP,CC}

will incur the stall, while a {\tt LDI \$BREAKEN|\$STEP,CC} will not.

will incur the stall, while a {\tt LDI \$BREAKEN|\$STEP,CC} will not since

it doesn't read the condition codes.

\item When waiting for a memory read operation to complete

\item When waiting for a memory read operation to complete

\begin{enumerate}

\begin{enumerate}

\item\ {\tt LOD address,RA}

\item\ {\tt LOD address,RA}

\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}

\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}

\item\ {\tt OPCODE I+RA,RB}

\item\ {\tt OPCODE I+RA,RB}

\end{enumerate}

\end{enumerate}

Remember, the Zip CPU does not support out of order execution.  Therefore,

Remember, the Zip CPU does not support out of order execution.  Therefore,

anytime the memory unit becomes busy both the memory unit and the ALU must

anytime the memory unit becomes busy both the memory unit and the ALU must

stall until the memory unit is cleared.  This is especially true of a load

stall until the memory unit is cleared.  This is illustrated in

Fig.~\ref{fig:memrd},

\begin{figure}\begin{center}

\includegraphics[width=5in]{../gfx/memrd.eps}

\caption{Pipeline handling of a load instruction}\label{fig:memrd}

\end{center}\end{figure}

since it is especially true of a load

instruction, which must still write its operand back to the register file.

instruction, which must still write its operand back to the register file.

Store instructions are different, since they can be busy with no impact on

Note that there is an extra stall at the end of the memory cycle, so that

later ALU write back operations.  Hence, only loads stall the pipeline.

the memory unit will be idle for one clock before an instruction will be

accepted into the ALU.

Store instructions are different, as shown in Fig.~\ref{fig:memwr},

\begin{figure}\begin{center}

\includegraphics[width=5in]{../gfx/memwr.eps}

\caption{Pipeline handling of a store instruction}\label{fig:memwr}

\end{center}\end{figure}

since they can be busy with the bus without impacting later write back

pipeline stages.  Hence, only loads stall the pipeline.

This also assumes that the memory being accessed is a single cycle memory.

This, of course, also assumes that the memory being accessed is a single cycle

memory and that there are no stalls to get to the memory.

Slower memories, such as the Quad SPI flash, will take longer--perhaps even

Slower memories, such as the Quad SPI flash, will take longer--perhaps even

as long as forty clocks.   During this time the CPU and the external bus

as long as forty clocks.   During this time the CPU and the external bus

will be busy, and unable to do anything else.

will be busy, and unable to do anything else.  Likewise, if it takes a couple

of clock cycles for the bus to be free, as shown in both Figs.~\ref{fig:memrd}

and~\ref{fig:memwr}, there will be stalls.

\item Memory operation followed by a memory operation

\item Memory operation followed by a memory operation

\begin{enumerate}

\begin{enumerate}

\item\ {\tt STO address,RA}

\item\ {\tt STO address,RA}

\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}

\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}

\item\ {\tt LOD address,RB}

\item\ {\tt LOD address,RB}

\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}

\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}

\end{enumerate}

\end{enumerate}

In this case, the LOD instruction cannot start until the STO is finished.

In this case, the LOD instruction cannot start until the STO is finished,

as illustrated by Fig.~\ref{fig:mstld}.

\begin{figure}\begin{center}

\includegraphics[width=5.5in]{../gfx/mstld.eps}

\caption{Pipeline handling of a store followed by a load instruction}\label{fig:mstld}

\end{center}\end{figure}

With proper scheduling, it is possible to do something in the ALU while the

With proper scheduling, it is possible to do something in the ALU while the

memory unit is busy with the STO instruction, but otherwise this pipeline will

memory unit is busy with the STO instruction, but otherwise this pipeline will

stall waiting for it to complete.

stall while waiting for it to complete before the load instruction can

start.

The Zip CPU does have the capability of supporting pipelined memory access,

The Zip CPU does have the capability of supporting pipelined memory access,

but only under the following conditions: all accesses within the pipeline

but only under the following conditions: all accesses within the pipeline

must all be reads or all be writes, all must use the same register for their

must all be reads or all be writes, all must use the same register for their

address, and there can be no stalls or other instructions between pipelined

address, and there can be no stalls or other instructions between pipelined

memory access instructions.  Further, the offset to memory must be increasing

memory access instructions.  Further, the offset to memory must be increasing

by one address each instruction.  These conditions work well for saving or

by one address each instruction.  These conditions work well for saving or

storing registers to the stack.

storing registers to the stack.  Indeed, if you noticed, both

Fig.~\ref{fig:memrd} and Fig.~\ref{fig:memwr} illustrated pipelined memory

accesses.

\item When waiting for a conditional memory read operation to complete

\item When waiting for a conditional memory read operation to complete

\begin{enumerate}

\begin{enumerate}

\item\ {\tt LOD.Z address,RA}

\item\ {\tt LOD.Z address,RA}

\item\ {\em (multiple stalls, bus dependent, 7 clocks best)}

\item\ {\em (multiple stalls, bus dependent, 7 clocks best)}

Line 1132...

Line 1338...

\caption{Zip System Peripherals}\label{fig:zipsystem}

\caption{Zip System Peripherals}\label{fig:zipsystem}

\end{center}\end{figure}

\end{center}\end{figure}

and described here.  They are designed to make

and described here.  They are designed to make

the Zip CPU more useful in an Embedded Operating System environment.

the Zip CPU more useful in an Embedded Operating System environment.

\section{Interrupt Controller}

\section{Interrupt Controller}\label{sec:pic}

Perhaps the most important peripheral within the Zip System is the interrupt

Perhaps the most important peripheral within the Zip System is the interrupt

controller.  While the Zip CPU itself can only handle one interrupt, and has

controller.  While the Zip CPU itself can only handle one interrupt, and has

only the one interrupt state: disabled or enabled, the interrupt controller

only the one interrupt state: disabled or enabled, the interrupt controller

can make things more interesting.

can make things more interesting.

The Zip System interrupt controller module supports up to 15 interrupts, all

The Zip System interrupt controller module supports up to 15 interrupts, all

controlled from one register.  Bit~31 of the interrupt controller controls

controlled from one register.  Bit~31 of the interrupt controller controls

overall whether interrupts are enabled (1'b1) or disabled (1'b0).  Bits~16--30

overall whether interrupts are enabled (1'b1) or disabled (1'b0).  Bits~16--30

control whether individual interrupts are enabled (1'b0) or disabled (1'b0).

control whether individual interrupts are enabled (1'b1) or disabled (1'b0).

Bit~15 is an indicator showing whether or not any interrupt is active, and

Bit~15 is an indicator showing whether or not any interrupt is active, and

bits~0--15 indicate whether or not an individual interrupt is active.

bits~0--15 indicate whether or not an individual interrupt is active.

The interrupt controller has been designed so that bits can be controlled

The interrupt controller has been designed so that bits can be controlled

individually without having any knowledge of the rest of the controller

individually without having any knowledge of the rest of the controller

Line 1165...

Line 1371...

interrupt and clears the active indicator.  This also has the side effect of

interrupt and clears the active indicator.  This also has the side effect of

disabling all interrupts, so a second write of {\tt 0x80000000} may be necessary

disabling all interrupts, so a second write of {\tt 0x80000000} may be necessary

to re-enable any other interrupts.

to re-enable any other interrupts.

The Zip System currently hosts two interrupt controllers, a primary and a

The Zip System currently hosts two interrupt controllers, a primary and a

secondary.  The primary interrupt controller has one interrupt line which may

secondary.  The primary interrupt controller has one interrupt line (perhaps

come from an external interrupt controller, and one interrupt line from the

more if you configure it for more) which may come from an external interrupt

secondary controller.  Other primary interrupts include the system timers,

controller, and one interrupt line from the secondary controller.  Other

the jiffies interrupt, and the manual cache interrupt.  The secondary interrupt

primary interrupts include the system timers, the jiffies interrupt, and the

controller maintains an interrupt state for all of the processor accounting

manual cache interrupt.  The secondary interrupt controller maintains an

counters.

interrupt state for all of the processor accounting counters.

\section{Counter}

\section{Counter}

The Zip Counter is a very simple counter: it just counts.  It cannot be

The Zip Counter is a very simple counter: it just counts.  It cannot be

halted.  When it rolls over, it issues an interrupt.  Writing a value to the

halted.  When it rolls over, it issues an interrupt.  Writing a value to the

Line 1193...

Line 1399...

bit is set when writing to the timer, the timer becomes an interval timer and

bit is set when writing to the timer, the timer becomes an interval timer and

reloads its last start time on any interrupt.  Hence, to mark seconds, one

reloads its last start time on any interrupt.  Hence, to mark seconds, one

might set the timer to 100~million (the number of clocks per second), and

might set the timer to 100~million (the number of clocks per second), and

set the high bit.  Ever after, the timer will interrupt the CPU once per

set the high bit.  Ever after, the timer will interrupt the CPU once per

second (assuming a 100~MHz clock).  This reload capability also limits the

second (assuming a 100~MHz clock).  This reload capability also limits the

maximum timer value to $2^{31}-1$, rather than $2^{32}-1$.

maximum timer value to $2^{31}-1$ (about 21~seconds using a 100~MHz clock),

rather than $2^{32}-1$.

\section{Watchdog Timer}

\section{Watchdog Timer}

The watchdog timer is no different from any of the other timers, save for one

The watchdog timer is no different from any of the other timers, save for one

critical difference: the interrupt line from the watchdog

critical difference: the interrupt line from the watchdog

Line 1207...

Line 1414...

write any other number to it---as with the other timers.

write any other number to it---as with the other timers.

While the watchdog timer supports interval mode, it doesn't make as much sense

While the watchdog timer supports interval mode, it doesn't make as much sense

as it did with the other timers.

as it did with the other timers.

\section{Bus Watchdog}

There is an additional watchdog timer on the Wishbone bus.  This timer,

however, is hardware configured and not software configured.  The timer is

reset at the beginning of any bus transaction, and only counts clocks during

such bus transactions.  If the bus transaction takes longer than the number

of counts the timer allots, it will raise a bus error flag to terminate the

transaction.  This is useful in the case of any peripherals that are

misbehaving.  If the bus watchdog terminates a bus transaction, the CPU may

then read from its port to find out which memory location created the problem.

Aside from its unusual configuration, the bus watchdog is just another

implementation of the fundamental timer described above.

\section{Jiffies}

\section{Jiffies}

This peripheral is motivated by the Linux use of `jiffies' whereby a process

This peripheral is motivated by the Linux use of `jiffies' whereby a process

can request to be put to sleep until a certain number of `jiffies' have

can request to be put to sleep until a certain number of `jiffies' have

elapsed.  Using this interface, the CPU can read the number of `jiffies'

elapsed.  Using this interface, the CPU can read the number of `jiffies'

Line 1305...

Line 1525...

complete.

complete.

Eventually, I intend to place an operating system onto the ZipSystem, I'm

Eventually, I intend to place an operating system onto the ZipSystem, I'm

just not there yet.

just not there yet.

The rest of this chapter examines some common programming constructs, and

The rest of this chapter examines some common programming models, and how they

how they might be applied to the Zip System.

might be applied to the Zip System, and then finish with a couple of examples.

\section{System High}

The easiest and simplest way to run the Zip CPU is in the system high mode.

In this mode, the CPU runs your program in supervisor mode from reboot to

power down, and is never interrupted.  You will need to poll the interrupt

controller to determine when any external condition has become active.  This

mode is useful, and can handle many microcontroller tasks.

Even better, in system high mode, all of the user registers are available

to the system high program as variables.  Accessing these registers can be

done in a single clock cycle, which would move them to the active register

set or move them back.  While this may seem like a load or store instruction,

none of these register accesses will suffer from memory delays.

The one thing that cannot be done in supervisor mode is a wait for interrupt

instruction.  This, however, is easily rectified by jumping to a user task

within the supervisors memory space, such as Tbl.~\ref{tbl:shi-idle}.

\begin{table}\begin{center}

\begin{tabbing}

{\tt supervisor\_idle:} \\

\hbox to 0.25in{}\={\em ; While not strictly required, the following move helps to} \\

\>      {\em ; ensure that the prefetch doesn't try to fetch an instruction} \\

\>      {\em ; outside of the CPU's address space when it switches to user} \\

\>      {\em ; mode.} \\

\>      {\tt MOV supervisor\_idle\_continue,uPC} \\

\>      {\em ; Put the processor into user mode and to sleep in the same} \\

\>      {\em ; instruction. } \\

\>      {\tt OR \$SLEEP|\$GIE,CC} \\

{\tt supervisor\_idle\_continue:} \\

\>      {\em ; Now, if we haven't done this inline, we need to return} \\

\>      {\em ; to whatever function called us.} \\

\>      {\tt RETN} \\

\end{tabbing}

\caption{Executing an idle from supervisor mode}\label{tbl:shi-idle}

\end{center}\end{table}

\section{Traditional Interrupt Handling}

Although the Zip CPU does not have a traditional interrupt architecture,

it is possible to create the more traditional interrupt approach via software.

In this mode, the programmable interrupt controller is used together with the

supervisor state to create the illusion of more traditional interrupt handling.

To set this up, upon reboot the supervisor task:

\begin{enumerate}

\item Creates a (single) user context, a user stack, and sets the user

        program counter to the entry of the user task

\item Creates a task table of ISR entries

\item Enables the master interrupt enable via the interrupt controller, albeit

        without enabling any of the fifteen potential underlying interrupts.

\item Switches to user mode, as the first part of the while loop in

        Tbl.~\ref{tbl:traditional-isr}.

\end{enumerate}

\begin{table}\begin{center}

\begin{tabbing}

{\tt while(true) \{} \\

\hbox to 0.25in{}\= {\tt rtu();}\\

        \> {\tt if (trap) \{} {\em // Here, we allow users to install ISRs, or} \\

        \>\hbox to 0.25in{}\= {\em // whatever else they may wish to do in supervisor mode.} \\

        \> {\tt \} else \{} \\

        \> \> {\tt volatile int *pic = PIC\_ADDRESS;} \\

\\

        \> \> {\em // Save the user context before running any ISRs.  This could easily be}\\

        \> \> {\em // implemented as an inline assembly routine or macro}\\

        \> \> {\tt SAVE\_PARTIAL\_CONTEXT; }\\

        \> \> {\em // At this point, we know an interrupt has taken place:  Ask the programmable}\\

        \> \> {\em // interrupt controller (PIC) which interrupts are enabled and which are active.}\\

        \> \>   {\tt int        picv = *pic;}\\

        \> \>   {\em // Turn off all active interrupts}\\

        \> \>   {\em // Globally disable interrupt generation in the process}\\

        \> \>   {\tt int        active = (picv >> 16) \& picv \& 0x07fff;}\\

        \> \>   {\tt *pic = (active<<16);}\\

        \> \>   {\em // We build a mask of interrupts to re-enable in picv.}\\

        \> \>   {\tt picv = 0;}\\

        \> \>   {\tt for(int i=0,msk=1; i<15; i++, msk<<=1) \{}\\

        \> \>\hbox to 0.25in{}\={\tt if ((active \& msk)\&\&(isr\_table[i])) \{}\\

        \> \>\>\hbox to 0.25in{}\= {\tt mov(isr\_table[i],uPC); }\\

        \> \>\>\>       {\em // Acknowledge this particular interrupt.  While we could acknowledge all}\\

        \> \>\>\>       {\em // interrupts at once, by acknowledging only those with ISR's we allow}\\

        \> \>\>\>       {\em // the user process to use peripherals manually, and to manually check}\\

        \> \>\>\>       {\em // whether or no those other interrupts had occurred.}\\

        \> \>\>\>       {\tt *pic = msk; }\\

        \> \>\>\>       {\tt rtu(); }\\

        \> \>\>\>       {\em // The ISR will only exit on a trap in the Zip archtecture.  There is}\\

        \> \>\>\>       {\em // no {\tt RETI} instruction.  Since the PIC holds all interrupts disabled,}\\

        \> \>\>\>       {\em // there is no need to check for further interrupts.}\\

        \> \>\>\>       {\em // }\\

        \> \>\>\>       {\em // The tricky part is that, because of how the PIC is built, the ISR cannot}\\

        \>\>\>\>        {\em // re-enable its own interrupt without re-enabling all interrupts.  Hence, we}\\

        \>\>\>\>        {\em // look at R0 upon ISR completion to know if an interrupt needs to be }\\

        \> \>\>\>       {\em // re-enabled. }\\

        \> \>\>\>       {\tt mov(uR0,tmp); }\\

        \> \>\>\>       {\tt picv |= (tmp \& 0x7fff) << 16; }\\

        \> \>\>         {\tt \} }\\

        \> \>   {\tt \} }\\

        \> \>   {\tt RESTORE\_PARTIAL\_CONTEXT; }\\

        \> \>   {\em // Re-activate all (requested) interrupts }\\

        \> \>   {\tt *pic = picv | 0x80000000; }\\

        \>{\tt \} }\\

{\tt \}}\\

\end{tabbing}

\caption{Traditional Interrupt handling}\label{tbl:traditional-isr}

\end{center}\end{table}

We can work through the interrupt handling process by examining

Tbl.~\ref{tbl:traditional-isr}.  First, remember, the CPU is always running

either the user or the supervisor context.  Once the supervisor switches to

user mode, control does not return until either an interrupt or a trap

has taken place.  (Okay, there's also the possibility of a bus error, or an

illegal instruction such as an unimplemented floating point instruction---but

for now we'll just focus on the trap instruction.)  Therefore, if the trap bit

isn't set, then we know an interrupt has taken place.

To process an interrupt, we steal the user's stack: the PC and CC registers

are saved on the stack, as outlined in Tbl.~\ref{tbl:save-partial}.

\begin{table}\begin{center}

\begin{tabbing}

SAVE\_PARTIAL\_CONTEXT: \\

\hbox to 0.25in{}\= {\em ; We save R0, CC, and PC only} \\

\>        {\tt MOV -3(uSP),R3} \\

\>        {\tt MOV uR0,R0} \\

\>        {\tt MOV uCC,R1} \\

\>        {\tt MOV uPC,R2} \\

\>        {\tt STO R0,1(R3)} {\em ; Exploit memory pipelining: }\\

\>        {\tt STO R1,2(R3)} {\em ; All instructions write to stack }\\

\>        {\tt STO R2,3(R3)} {\em ; All offsets increment by one }\\

\>        {\tt MOV R3,uSP} {\em ; Return the updated stack pointer } \\

\end{tabbing}

\caption{Example Saving Minimal User Context}\label{tbl:save-partial}

\end{center}\end{table}

This is much cheaper than the full context swap of a preemptive multitasking

kernel, but it also depends upon the ISR saving any state it uses.  Further,

if multiple ISR's get called at once, this looses its optimality property

very quickly.

As Sec.~\ref{sec:pic} discusses, the top of the PIC register stores which

interrupts are enabled, and the bottom stores which have tripped.  (Interrupts

may trip without being enabled, they just will not generate an interrupt to the

CPU.)  Our first step is to query the register to find out our interrupt

state, and then to disable any interrupts that have tripped.  To do

that, we write a one to the enable half of the register while also clearing

the top bit (master interrupt enable).  This has the consequence of disabling

any and all further interrupts, not just the ones that have tripped.  Hence,

upon completion, we re--enable the master interrupt bit again.   Finally,

we keep track of which interrupts have tripped.

Using the bit mask of interrupts that have tripped, we walk through all fifteen

possible interrupts.  If there is an ISR installed, we acknowledge and reset

the interrupt within the PIC, and then call the ISR.  The ISR, however, cannot

re--enable its interrupt without re-enabling the master interrupt bit.  Thus,

to keep things simple, when the ISR is finished it places its interrupt

mask back into R0, or clears R0.  This tells the supervisor mode process which

interrupts to re--enable.  Any other registers that the ISR uses must be

saved and restored.  (This is only truly optimal if only a single ISR is

called.)  As a final instruction, the ISR clears the GIE bit executing a user

trap.  (Remember, the Zip CPU has no {\tt RETI} instruction to restore the

stack and return to userland.  It needs to go through the supervisor mode to

get there.)

Then, once all interrupts are handled, the user context is restored in  a

fashion similar to Tbl.~\ref{tbl:restore-partial}.

\begin{table}\begin{center}

\begin{tabbing}

RESTORE\_PARTIAL\_CONTEXT: \\

\hbox to 0.25in{}\= {\em ; We retore R0, CC, and PC only} \\

\>        {\tt MOV uSP,R3} {\em ; Return the updated stack pointer } \\

\>        {\tt LOD R0,1(R3),R0} {\em ; Exploit memory pipelining: }\\

\>        {\tt LOD R1,2(R3),R1} {\em ; All instructions write to stack }\\

\>        {\tt LOD R2,3(R3),R2} {\em ; All offsets increment by one }\\

\>        {\tt MOV R0,uR0} \\

\>        {\tt MOV R1,uCC} \\

\>        {\tt MOV R2,uPC} \\

\>        {\tt MOV 3(R3),uSP} \\

\end{tabbing}

\caption{Example Restoring Minimal User Context}\label{tbl:restore-partial}

\end{center}\end{table}

Again, this is short and sweet simply because any other registers that needed

saving were saved within the ISR.

There you have it: the Zip CPU, with its non-traditional interrupt architecture,

can still process interrupts in a very traditional fashion.

\section{Example: Idle Task}

\section{Example: Idle Task}

One task every operating system needs is the idle task, the task that takes

One task every operating system needs is the idle task, the task that takes

place when nothing else can run.  On the Zip CPU, this task is quite simple,

place when nothing else can run.  On the Zip CPU, this task is quite simple,

and it is shown in assemble in Tbl.~\ref{tbl:idle-asm}.

and it is shown in assemble in Tbl.~\ref{tbl:idle-asm}.

Line 1379...

Line 1779...

data types.  The Zip CPU does not have explicit support for smaller or larger

data types.  The Zip CPU does not have explicit support for smaller or larger

data types, and so this memory copy cannot be applied at a byte level.

data types, and so this memory copy cannot be applied at a byte level.

Third, we've optimized the conditional jump to a return instruction into a

Third, we've optimized the conditional jump to a return instruction into a

conditional return instruction.

conditional return instruction.

\section{Context Switch}

\section{Example: Context Switch}

Fundamental to any multiprocessing system is the ability to switch from one

Fundamental to any multiprocessing system is the ability to switch from one

task to the next.  In the ZipSystem, this is accomplished in one of a couple

task to the next.  In the ZipSystem, this is accomplished in one of a couple

ways.  The first step is that an interrupt happens.  Anytime an interrupt

ways.  The first step is that an interrupt happens.  Anytime an interrupt

happens, the CPU needs to execute the following tasks in supervisor mode:

happens, the CPU needs to execute the following tasks in supervisor mode:

Line 2049...

Line 2449...

\item The Zip CPU is light weight and fully featured as it exists today. For

\item The Zip CPU is light weight and fully featured as it exists today. For

        anyone who wishes to build a general purpose CPU and then to

        anyone who wishes to build a general purpose CPU and then to

        experiment with building and adding particular features, the Zip CPU

        experiment with building and adding particular features, the Zip CPU

        makes a good starting point--it is fairly simple. Modifications should

        makes a good starting point--it is fairly simple. Modifications should

        be simple enough.

        be simple enough.

\item As an estimate of the ``weight'' of this implementation, the CPU has

        cost me less than 150 hours to implement from its inception.

\item The Zip CPU was designed to be an implementable soft core that could be

\item The Zip CPU was designed to be an implementable soft core that could be

        placed within an FPGA, controlling actions internal to the FPGA. It

        placed within an FPGA, controlling actions internal to the FPGA. It

        fits this role rather nicely. It does not fit the role of a system on

        fits this role rather nicely. It does not fit the role of a system on

        a chip very well, but then it was never intended to be a system on a

        a chip very well, but then it was never intended to be a system on a

        chip but rather a system within a chip.

        chip but rather a system within a chip.

Line 2067...

Line 2465...

        to do with two exceptions: bytewise character access and accelerated

        to do with two exceptions: bytewise character access and accelerated

        floating-point support.

        floating-point support.

\item This simplified instruction set is easy to decode.

\item This simplified instruction set is easy to decode.

\item The simplified bus transactions (32-bit words only) were also very easy

\item The simplified bus transactions (32-bit words only) were also very easy

        to implement.

        to implement.

\item The pipelined load/store approach is novel, and can be used to greatly

        increase the speed of the processor.

\item The novel approach of having a single interrupt vector, which just

\item The novel approach of having a single interrupt vector, which just

        brings the CPU back to the instruction it left off at within the last

        brings the CPU back to the instruction it left off at within the last

        interrupt context doesn't appear to have been that much of a problem.

        interrupt context doesn't appear to have been that much of a problem.

        If most modern systems handle interrupt vectoring in software anyway,

        If most modern systems handle interrupt vectoring in software anyway,

        why maintain hardware support for it?

        why maintain hardware support for it?

Line 2096...

Line 2496...

        Still, \ldots it's not bad, it's just not astonishingly good.

        Still, \ldots it's not bad, it's just not astonishingly good.

\item The fact that the instruction width equals the bus width means that the

\item The fact that the instruction width equals the bus width means that the

        instruction fetch cycle will always be interfering with any load or

        instruction fetch cycle will always be interfering with any load or

        store memory operation, with the only exception being if the

        store memory operation, with the only exception being if the

        instruction is already in the cache.  {\em This has become the

        instruction is already in the cache.

        fundamental limit on the speed and performance of the CPU!}

        Those familiar with the Von--Neumann approach of sharing a bus

        between data and instructions will not be surprised by this assessment.

        This could be fixed in one of three ways: the instruction set

        This could be fixed in one of three ways: the instruction set

        architecture could be modified to handle Very Long Instruction Words

        architecture could be modified to handle Very Long Instruction Words

        (VLIW) so that each 32--bit word would encode two or more instructions,

        (VLIW) so that each 32--bit word would encode two or more instructions,

        the instruction fetch bus width could be increased from 32--bits to

        the instruction fetch bus width could be increased from 32--bits to

Line 2144...

Line 2541...

        This may also be written off as a ``feature'' of the Zip CPU, since

        This may also be written off as a ``feature'' of the Zip CPU, since

        the addition of a data cache can greatly increase the LUT count of

        the addition of a data cache can greatly increase the LUT count of

        a soft core.

        a soft core.

        The Zip CPU compensates for this via its pipelined load and store

        instructions.

\item Many other instruction sets offer three operand instructions, whereas

\item Many other instruction sets offer three operand instructions, whereas

        the Zip CPU only offers two operand instructions. This means that it

        the Zip CPU only offers two operand instructions. This means that it

        takes the Zip CPU more instructions to do many of the same operations.

        takes the Zip CPU more instructions to do many of the same operations.

        The good part of this is that it gives the Zip CPU a greater amount of

        The good part of this is that it gives the Zip CPU a greater amount of

        flexibility in its immediate operand mode, although that increased

        flexibility in its immediate operand mode, although that increased

Line 2369...

Line 2769...

% Appendices

% Appendices

% Index

% Index

\end{document}

\end{document}

% Symbol table relocation types:

% Only 3-types of instructions truly need relocations: those that modify the

% PC register, and those that access memory.

% -     LDI     Addr,Rx         // Load's an absolute address into Rx, 24 bits

% -     LDILO   Addr,Rx         // Load's an absolute address into Rx, 32 bits

%       LDIHI   Addr,Rx         //   requires two instructions

% -     JMP     Rx              // Jump to any address in Rx

%                       // Can be prefixed with two instructions to load Rx

%                       // from any 32-bit immediate

% -     JMP     #Addr           // Jump to any 24'bit (signed) address, 23'b uns

% -     ADD     x,PC            // Any PC relative jump (20 bits)

% -     ADD.C   x,PC            // Any PC relative conditional jump (20 bits)

% -     LDIHI   Addr,Rx         // Load from any 32-bit address, clobbers Rx,

%       LOD     Addr(Rx),Rx     //    unconditional, requires second instruction

% -     LOD.C   Addr(Ry),Rx     // Any 16-bit relative address load, poss. cond

% -     STO.C   Rx,Addr(Ry)     // Any 16-bit rel addr, Rx and Ry must be valid

% -     FARJMP  #Addr:          // Arbitrary 32-bit jumps require a jump table

%       BRA     +1              // memory address.  The BRA +1 can be skipped,

%       .WORD   Addr            // but only if the address is placed at the end

%       LOD     -2(PC),PC       // of an executable section

 No newline at end of file

 No newline at end of file

Browse

Tools

Subversion Repositories zipcpu

[/] [zipcpu/] [trunk/] [doc/] [src/] [spec.tex] - Diff between revs 39 and 68