URL https://opencores.org/ocsvn/zipcpu/zipcpu/trunk

# Subversion Repositorieszipcpu

## Compare Revisions

• This comparison shows the changes necessary to convert path
/zipcpu/trunk/doc/src
from Rev 39 to Rev 68
Reverse comparison

## Rev 39 → Rev 68

/spec.tex
44,11 → 44,13
 %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \documentclass{gqtekspec} \usepackage{import} % \graphicspath{{../gfx}} \project{Zip CPU} \title{Specification} \author{Dan Gisselquist, Ph.D.} \email{dgisselq (at) opencores.org} \revision{Rev.~0.5} \revision{Rev.~0.6} \definecolor{webred}{rgb}{0.2,0,0} \definecolor{webgreen}{rgb}{0,0.2,0} \usepackage[dvips,ps2pdf,colorlinks=true,
76,6 → 78,7
 copy. \end{license} \begin{revisionhistory} 0.6 & 11/17/2015 & Gisselquist & Added graphics to illustrate pipeline discussion.\\\hline 0.5 & 9/29/2015 & Gisselquist & Added pipelined memory access discussion.\\\hline 0.4 & 9/19/2015 & Gisselquist & Added DMA controller, improved stall information, and self--assessment info.\\\hline 0.3 & 8/22/2015 & Gisselquist & First completed draft\\\hline
167,6 → 170,71
  contact me.} \end{itemize}   The Zip CPU also has one very unique feature: the ability to do pipelined loads and stores. This allows the CPU to access on-chip memory at one access per clock, minus a stall for the initial access.   \section{Characteristics of a SwiC}   Here, we shall define a soft core internal to an FPGA as a System within a Chip,'' or a SwiC. SwiCs have some very unique properties internal to them that have influenced the design of the Zip CPU. Among these are the bus, memory, and available peripherals.   Most other approaches to soft core CPU's employ a Harvard architecture.  This allows these other CPU's to have two separate bus structures: one for the program fetch, and the other for thememory. The Zip CPU is fairly unique in its approach because it uses a von Neumann architecture. This was done for simplicity. By using a von Neumann architecture, only one bus needs to be implemented within any FPGA. This helps to minimize real-estate, while maintaining a high clock speed. The disadvantage is that it can severely degrade the overall instructions per clock count.   Soft core's within an FPGA have an additional characteristic regarding memory access: it is slow. Memory on chip may be accessed at a single cycle per access, but small FPGA's have a limited amount of memory on chip. Going off chip, however, is expensive. Two examples will prove this point. On the XuLA2 board, Flash can be accessed at 128~cycles per 32--bit word, or 64~cycles per subsequent word in a pipelined architecture. Likewise, the SDRAM chip on the XuLA2 board allows 6~cycle access for a write, 10~cycles per read, and 2~cycles for any subsequent pipelined access read or write. Either way you look at it, this memory access will be slow and this doesn't account for any logic delays should the bus implementation logic get complicated.   As may be noticed from the above discussion about memory speed, a second characteristic of memory is that all memory accesses may be pipelined, and that pipelined memory access is faster than non--pipelined access. Therefore, a SwiC soft core should support pipelined operations, but it should also allow a higher priority subsystem to get access to the bus (no starvation).   As a further characteristic of SwiC memory options, on-chip cache's are expensive. If you want to have a minimum of logic, cache logic may not be the highest on the priority list.   In sum, memory is slow. While one processor on one FPGA may be able to fill its pipeline, the same processor on another FPGA may struggle to get more than one instruction at a time into the pipeline. Any SwiC must be able to deal with both cases: fast and slow memories.   A final characteristic of SwiC's within FPGA's is the peripherals.  Specifically, FPGA's are highly reconfigurable. Soft peripherals can easily be created on chip to support the SwiC if necessary. As an example, a simple 30-bit peripheral could easily support reversing 30-bit numbers: a read from the peripheral returns it's bit--reversed address. This is cheap within an FPGA, but expensive in instructions.   Indeed, anything that must be done fast within an FPGA is likely to already be done--elsewhere in the fabric. This leaves the CPU with the role of handling sequential tasks that need a lot of state.   This means that the SwiC needs to live within a very unique environment, separate and different from the traditional SoC. That isn't to say that a  SwiC cannot be turned into a SoC, just that this SwiC has not been designed for that purpose.   \section{Lessons Learned}   Now, however, that I've worked on the Zip CPU for a while, it is not nearly as simple as I originally hoped. Worse, I've had to adjust to create capabilities that I was never expecting to need. These include:
239,16 → 307,60
  all instructions so that stalls would never be necessary. After trying  to build such an architecture, I gave up, having learned some things:    For example, in order to facilitate interrupt handling and debug  stepping, the CPU needs to know what instructions have finished, and  which have not. In other words, it needs to know where it can restart  the pipeline from. Once restarted, it must act as though it had  never stopped. This killed my idea of delayed branching, since what  would be the appropriate program counter to restart at? The one the  CPU was going to branch to, or the ones in the delay slots? This  also makes the idea of compressed instruction codes difficult, since,  again, where do you restart on interrupt?  First, and ideal pipeline might look something like  Fig.~\ref{fig:ideal-pipeline}. \begin{figure} \begin{center} \includegraphics[width=4in]{../gfx/fullpline.eps} \caption{An Ideal Pipeline: One instruction per clock cycle}\label{fig:ideal-pipeline} \end{center}\end{figure}  Notice that, in this figure, all the pipeline stages are complete and  full. Every instruction takes one clock and there are no delays.  However, as the discussion above pointed out, the memory associated  with a SwiC may not allow single clock access. It may be instead  that you can only read every two clocks. In that case, what shall  the pipeline look like? Should it look like  Fig.~\ref{fig:waiting-pipeline}, \begin{figure}\begin{center} \includegraphics[width=4in]{../gfx/stuttra.eps} \caption{Instructions wait for each other}\label{fig:waiting-pipeline} \end{center}\end{figure}  where instructions are held back until the pipeline is full, or should  it look like Fig.~\ref{fig:independent-pipeline}, \begin{figure}\begin{center} \includegraphics[width=4in]{../gfx/stuttrb.eps} \caption{Instructions proceed independently}\label{fig:independent-pipeline} \end{center}\end{figure}  where each instruction is allowed to move through the pipeline  independently? For better or worse, the Zip CPU allows instructions  to move through the pipeline independently.    One approach to avoiding stalls is to use a branch delay slot,  such as is shown in Fig.~\ref{fig:brdelay}. \begin{figure}\begin{center} \includegraphics[width=4in]{../gfx/bdly.eps} \caption{A typical branch delay slot approach}\label{fig:brdelay} \end{center}\end{figure}  In this figure, instructions  {\tt BR} (a branch), {\tt BD} (a branch delay instruction),  are followed by instructions after the branch: {\tt IA}, {\tt IB}, etc.  Since it takes a processor a clock cycle to execute a branch, the  delay slot allows the processor to do something useful in that  branch. The problem the Zip CPU has with this approach is, what  happens when the pipeline looks like Fig.~\ref{fig:brbroken}? \begin{figure}\begin{center} \includegraphics[width=4in]{../gfx/bdbroken.eps} \caption{The branch delay slot breaks with a slow memory}\label{fig:brbroken} \end{center}\end{figure}  In this case, the branch delay slot never gets filled in the first  place, and so the pipeline squashes it before it gets executed.  If not that, then what happens when handling interrupts or  debug stepping: when has the CPU finished an instruction?  When the {\tt BR} instruction has finished, or must {\tt BD}  follow every {\tt BR}? and, again, what if the pipeline isn't  full?  These thoughts killed any hopes of doing delayed branching.    So I switched to a model of discrete execution: Once an instruction  enters into either the ALU or memory unit, the instruction is  guaranteed to complete. If the logic recognizes a branch or a
258,7 → 370,18
  until the conditional branch completes. Then, if it generates a new  PC address, the stages preceding are all wiped clean.    The discrete execution model allows such things as sleeping: if the  This model, however, generated too many pipeline stalls, so the  discrete execution model was modified to allow instructions to go  through the ALU unit and be canceled before writeback. This removed  the stall associated with ALU instructions before untaken branches.    The discrete execution model allows such things as sleeping, as  outlined in Fig.~\ref{fig:sleeping}.  \begin{figure}\begin{center} \includegraphics[width=4in]{../gfx/sleep.eps} \caption{How the CPU halts when sleeping}\label{fig:sleeping} \end{center}\end{figure}  If the  CPU is put to sleep,'' the ALU and memory stages stall and back up  everything before them. Likewise, anything that has entered the ALU  or memory stage when the CPU is placed to sleep continues to completion.
266,10 → 389,14
  a valid signal, a stall signal, and a clock enable signal. In  general, a stage stalls if it's contents are valid and the next step  is stalled. This allows the pipeline to fill any time a later stage  stalls.  stalls, as illustrated in Fig.~\ref{fig:stacking}. \begin{figure}\begin{center} \includegraphics[width=4in]{../gfx/stacking.eps} \caption{Instructions can stack up behind a stalled instruction}\label{fig:stacking} \end{center}\end{figure}    This approach is also different from other pipeline approaches. Instead  of keeping the entire pipeline filled, each stage is treated  This approach is also different from other pipeline approaches.   Instead of keeping the entire pipeline filled, each stage is treated  independently. Therefore, individual stages may move forward as long  as the subsequent stage is available, regardless of whether the stage  behind it is filled.
336,7 → 463,7
 31\ldots 11 & R/W & Reserved for future uses\\\hline 10 & R & (Reserved for) Bus-Error Flag\\\hline 9 & R & Trap, or user interrupt, Flag. Cleared on return to userspace.\\\hline 8 & R & (Reserved for) Illegal Instruction Flag\\\hline 8 & R & Illegal Instruction Flag\\\hline 7 & R/W & Break--Enable\\\hline 6 & R/W & Step\\\hline 5 & R/W & Global Interrupt Enable (GIE)\\\hline
401,9 → 528,9
 This functionality was added to enable an external debugger to  set and manage breakpoints.   The ninth bit is reserved for an illegal instruction bit. When the CPU The ninth bit is an illegal instruction bit. When the CPU tries to execute either a non-existant instruction, or an instruction from an address that produces a bus error, the CPU will (once implemented) switch an address that produces a bus error, the CPU will (if implemented) switch to supervisor mode while setting this bit. The bit will automatically be cleared upon any return to user mode. 
418,7 → 545,7
 \begin{tabular}{l|l} Bit & Meaning \\\hline 9 & Soft trap, set on a trap from user mode, cleared when returning to user mode\\\hline 8 & (Reserved for) Floating point enable \\\hline 8 & Illegal instruction error flag \\\hline 7 & Halt on break, to support an external debugger \\\hline 6 & Step, single step the CPU in user mode\\\hline 5 & GIE, or Global Interrupt Enable \\\hline
454,10 → 581,33
 There is no condition code for less than or equal, not C or not V. Sorry, I ran out of space in 3--bits. Conditioning on a non--supported condition is still possible, but it will take an extra instruction and a pipeline stall. (Ex: \hbox{\em (Stall)}; \hbox{\tt TST \$4,CC;} \hbox{\tt STO.NZ R0,(R1)}) As an alternative, the condition may often be reversed, recovering those extra two clocks. Thus instead of \hbox{\tt CMP Rx,Ry;} \hbox{\tt BNV label} you can issue a \hbox{\tt CMP Ry,Rx;} \hbox{\tt BV label}.   Conditionally executed ALU instructions will not further adjust the  condition codes. condition codes, with the exception of \hbox{\tt CMP} and \hbox{\tt TST} instructions. Conditional \hbox{\tt CMP} or \hbox{\tt TST} instructions will adjust conditions whenever their conditionals are true. In this way, multiple conditions may be evaluated without branches. For example, to do something if \hbox{\tt R0} is one and \hbox{\tt R1} is two, one might try code such as Tbl.~\ref{tbl:dbl-condition}. \begin{table}\begin{center} \begin{tabular}{l}  {\tt CMP 1,R0} \\  {;\em Condition codes are now set based upon R0-1} \\  {\tt CMP.Z 2,R1} \\  {;\em If R0$\neq$1, conditions are unchanged.} \\  {;\em If R0$=$1, conditions are set based upon R1-2.} \\  {;\em Now do something based upon the conjunction of both conditions.} \\  {;\em While we use the example of a STO, it could be any instruction.} \\  {\tt STO.Z R0,(R2)} \\ \end{tabular} \caption{An example of a double conditional}\label{tbl:dbl-condition} \end{center}\end{table}   \section{Traditional Interrupt Handling}   \section{Operand B} Many instruction forms have a 21-bit source Operand B'' associated with them.  This Operand B is either equal to a register plus a signed immediate offset, 1001,15 → 1151,36  \item While waiting for the pipeline to load following any taken branch, jump,  return from interrupt or switch to interrupt context (5 stall cycles)   If the PC suddenly changes, the pipeline is subsequently cleared and needs to be reloaded. Given that there are five stages to the pipeline, that accounts for four of the five stalls. The stall cycle is lost in the pipelined prefetch stage which needs at least one clock with a valid PC before it can produce a new output. Fig.~\ref{fig:bcstalls} \begin{figure}\begin{center} \includegraphics[width=3.5in]{../gfx/bc.eps} \caption{A conditional branch generates 5 stall cycles}\label{fig:bcstalls} \end{center}\end{figure} illustrates the situation for a conditional branch. In this case, the branch instruction, {\tt BC}, is nominally followed by instructions {\tt I0} and so forth. However, since the branch is taken, the next instruction must be {\tt IA}. Therefore, the pipeline needs to be cleared and reloaded. Given that there are five stages to the pipeline, that accounts for four of the five stalls. The last stall cycle is lost in the pipelined prefetch stage which needs at least one clock with a valid PC before it can produce a new output. {\Large\bf Note: When I did this myself, I counted six stall cycles, for a total of seven cycles for this instruction. Is five really the right answer?}   The Zip CPU handles {\tt MOV \$X(PC),PC}, {\tt ADD \$X,PC}, and {\tt LDI \$X,PC} instructions specially, however. These instructions, when not conditioned on the flags, can execute with only 3~stall cycles. not conditioned on the flags, can execute with only 2~stall cycles, such as is shown in Fig.~\ref{fig:branch}.\footnote{Note that this behavior is slated to be improved upon in subsequent releases. With a better prefetch, it should be possible to drop this down to one or zero stall cycles.} \begin{figure}\begin{center} \includegraphics[width=4in]{../gfx/bra.eps} \caption{An expedited delay costs only 2~stall cycles}\label{fig:branch} \end{center}\end{figure} In this example, {\tt BR} is a branch always taken, {\tt I1} is the instruction following the branch in memory, while {\tt IA} is the first instruction at the branch address. ({\tt CLR} denotes a clear--pipeline operation, and does not represent any instruction.)   \item When reading from a prior register while also adding an immediate offset \begin{enumerate}
1026,15 → 1197,23
 That is, any instruction that does not add an immediate to {\tt RA} may be scheduled into the stall slot.   \item When any write to either the CC or PC Register is followed by a memory  operation \item When any (conditional) write to either the CC or PC Register is followed  by a memory operation \begin{enumerate} \item\ {\tt OPCODE RA,PC} {\em Ex: a branch opcode} \item\ {\em (stall, even if jump not taken)} \item\ {\tt LOD \$X(RA),RB} \end{enumerate} A timing diagram of this pipeline situation is shown in Fig.~\ref{fig:bcmem}, \begin{figure}\begin{center} \includegraphics[width=2in]{../gfx/bcmem.eps} \caption{A (not taken) conditional branch followed by a memory operation}\label{fig:bcmem} \end{center}\end{figure} for a conditional branch, {\tt BC}, a memory operation, {\tt Mem} (which must be a load here), and ALU instructions {\tt I1} and so forth. Since branches take place in the writeback stage, the Zip CPU will stall the pipeline for one clock anytime there may be a possible jump. This prevents pipeline for one clock anytime there may be a possible jump--forcing the  memory operation to stay in the operand decode stage. This prevents an instruction from executing a memory access after the jump but before the jump is recognized.   1050,9 → 1229,10  \item\ {\tt BZ somewhere} \end{enumerate}   The reason for this stall is simply performance. Many of the flags are determined via combinatorial logic after the writeback instruction is determined. Trying to then place these into the input for one of the operands The reason for this stall is simply performance: many of the flags are determined via combinatorial logic {\em during} the writeback cycle. Trying to then place these into the input for one of the operands for an ALU instruction during the same cycle created a time delay loop that would no longer execute in a single 100~MHz clock cycle. (The time delay of the multiply within the ALU wasn't helping either \ldots).  1062,7 → 1242,8  that references the CC register. For example, {\tt MOV \$addr+PC,uPC} followed by an {\tt RTU} ({\tt OR \$GIE,CC}) instruction will not incur this stall, whereas an {\tt OR \$BREAKEN,CC} followed by an {\tt OR \$STEP,CC} will incur the stall, while a {\tt LDI \$BREAKEN|\$STEP,CC} will not. will incur the stall, while a {\tt LDI \$BREAKEN|\$STEP,CC} will not since it doesn't read the condition codes.   \item When waiting for a memory read operation to complete \begin{enumerate} 1073,15 → 1254,32    Remember, the Zip CPU does not support out of order execution. Therefore, anytime the memory unit becomes busy both the memory unit and the ALU must stall until the memory unit is cleared. This is especially true of a load stall until the memory unit is cleared. This is illustrated in Fig.~\ref{fig:memrd}, \begin{figure}\begin{center} \includegraphics[width=5in]{../gfx/memrd.eps} \caption{Pipeline handling of a load instruction}\label{fig:memrd} \end{center}\end{figure} since it is especially true of a load instruction, which must still write its operand back to the register file.  Store instructions are different, since they can be busy with no impact on later ALU write back operations. Hence, only loads stall the pipeline. Note that there is an extra stall at the end of the memory cycle, so that the memory unit will be idle for one clock before an instruction will be accepted into the ALU. Store instructions are different, as shown in Fig.~\ref{fig:memwr}, \begin{figure}\begin{center} \includegraphics[width=5in]{../gfx/memwr.eps} \caption{Pipeline handling of a store instruction}\label{fig:memwr} \end{center}\end{figure} since they can be busy with the bus without impacting later write back pipeline stages. Hence, only loads stall the pipeline.   This also assumes that the memory being accessed is a single cycle memory. This, of course, also assumes that the memory being accessed is a single cycle memory and that there are no stalls to get to the memory. Slower memories, such as the Quad SPI flash, will take longer--perhaps even as long as forty clocks. During this time the CPU and the external bus  will be busy, and unable to do anything else. will be busy, and unable to do anything else. Likewise, if it takes a couple of clock cycles for the bus to be free, as shown in both Figs.~\ref{fig:memrd} and~\ref{fig:memwr}, there will be stalls.   \item Memory operation followed by a memory operation \begin{enumerate} 1091,10 → 1289,16  \item\ {\em (multiple stalls, bus dependent, 4 clocks best)} \end{enumerate}   In this case, the LOD instruction cannot start until the STO is finished. In this case, the LOD instruction cannot start until the STO is finished, as illustrated by Fig.~\ref{fig:mstld}. \begin{figure}\begin{center} \includegraphics[width=5.5in]{../gfx/mstld.eps} \caption{Pipeline handling of a store followed by a load instruction}\label{fig:mstld} \end{center}\end{figure} With proper scheduling, it is possible to do something in the ALU while the memory unit is busy with the STO instruction, but otherwise this pipeline will stall waiting for it to complete. stall while waiting for it to complete before the load instruction can start.   The Zip CPU does have the capability of supporting pipelined memory access, but only under the following conditions: all accesses within the pipeline 1102,7 → 1306,9  address, and there can be no stalls or other instructions between pipelined memory access instructions. Further, the offset to memory must be increasing by one address each instruction. These conditions work well for saving or storing registers to the stack. storing registers to the stack. Indeed, if you noticed, both Fig.~\ref{fig:memrd} and Fig.~\ref{fig:memwr} illustrated pipelined memory accesses.   \item When waiting for a conditional memory read operation to complete \begin{enumerate} 1134,7 → 1340,7  and described here. They are designed to make the Zip CPU more useful in an Embedded Operating System environment.   \section{Interrupt Controller} \section{Interrupt Controller}\label{sec:pic}   Perhaps the most important peripheral within the Zip System is the interrupt controller. While the Zip CPU itself can only handle one interrupt, and has 1144,7 → 1350,7  The Zip System interrupt controller module supports up to 15 interrupts, all controlled from one register. Bit~31 of the interrupt controller controls overall whether interrupts are enabled (1'b1) or disabled (1'b0). Bits~16--30 control whether individual interrupts are enabled (1'b0) or disabled (1'b0). control whether individual interrupts are enabled (1'b1) or disabled (1'b0). Bit~15 is an indicator showing whether or not any interrupt is active, and  bits~0--15 indicate whether or not an individual interrupt is active.   1167,12 → 1373,12  to re-enable any other interrupts.   The Zip System currently hosts two interrupt controllers, a primary and a  secondary. The primary interrupt controller has one interrupt line which may come from an external interrupt controller, and one interrupt line from the secondary controller. Other primary interrupts include the system timers, the jiffies interrupt, and the manual cache interrupt. The secondary interrupt controller maintains an interrupt state for all of the processor accounting counters. secondary. The primary interrupt controller has one interrupt line (perhaps more if you configure it for more) which may come from an external interrupt controller, and one interrupt line from the secondary controller. Other primary interrupts include the system timers, the jiffies interrupt, and the manual cache interrupt. The secondary interrupt controller maintains an interrupt state for all of the processor accounting counters.   \section{Counter}   1195,7 → 1401,8  might set the timer to 100~million (the number of clocks per second), and set the high bit. Ever after, the timer will interrupt the CPU once per second (assuming a 100~MHz clock). This reload capability also limits the maximum timer value to$2^{31}-1$, rather than$2^{32}-1$. maximum timer value to$2^{31}-1$(about 21~seconds using a 100~MHz clock), rather than$2^{32}-1$.   \section{Watchdog Timer}   1209,6 → 1416,19  While the watchdog timer supports interval mode, it doesn't make as much sense as it did with the other timers.   \section{Bus Watchdog} There is an additional watchdog timer on the Wishbone bus. This timer, however, is hardware configured and not software configured. The timer is reset at the beginning of any bus transaction, and only counts clocks during such bus transactions. If the bus transaction takes longer than the number of counts the timer allots, it will raise a bus error flag to terminate the transaction. This is useful in the case of any peripherals that are misbehaving. If the bus watchdog terminates a bus transaction, the CPU may then read from its port to find out which memory location created the problem.   Aside from its unusual configuration, the bus watchdog is just another implementation of the fundamental timer described above.   \section{Jiffies}   This peripheral is motivated by the Linux use of jiffies' whereby a process 1307,9 → 1527,189  Eventually, I intend to place an operating system onto the ZipSystem, I'm  just not there yet.   The rest of this chapter examines some common programming constructs, and how they might be applied to the Zip System. The rest of this chapter examines some common programming models, and how they might be applied to the Zip System, and then finish with a couple of examples.   \section{System High} The easiest and simplest way to run the Zip CPU is in the system high mode. In this mode, the CPU runs your program in supervisor mode from reboot to power down, and is never interrupted. You will need to poll the interrupt controller to determine when any external condition has become active. This mode is useful, and can handle many microcontroller tasks.    Even better, in system high mode, all of the user registers are available to the system high program as variables. Accessing these registers can be done in a single clock cycle, which would move them to the active register set or move them back. While this may seem like a load or store instruction, none of these register accesses will suffer from memory delays.   The one thing that cannot be done in supervisor mode is a wait for interrupt instruction. This, however, is easily rectified by jumping to a user task within the supervisors memory space, such as Tbl.~\ref{tbl:shi-idle}. \begin{table}\begin{center} \begin{tabbing} {\tt supervisor\_idle:} \\ \hbox to 0.25in{}\={\em ; While not strictly required, the following move helps to} \\ \> {\em ; ensure that the prefetch doesn't try to fetch an instruction} \\ \> {\em ; outside of the CPU's address space when it switches to user} \\ \> {\em ; mode.} \\ \> {\tt MOV supervisor\_idle\_continue,uPC} \\ \> {\em ; Put the processor into user mode and to sleep in the same} \\ \> {\em ; instruction. } \\ \> {\tt OR \$SLEEP|\\$GIE,CC} \\ {\tt supervisor\_idle\_continue:} \\ \> {\em ; Now, if we haven't done this inline, we need to return} \\ \> {\em ; to whatever function called us.} \\ \> {\tt RETN} \\ \end{tabbing} \caption{Executing an idle from supervisor mode}\label{tbl:shi-idle} \end{center}\end{table}   \section{Traditional Interrupt Handling} Although the Zip CPU does not have a traditional interrupt architecture, it is possible to create the more traditional interrupt approach via software. In this mode, the programmable interrupt controller is used together with the supervisor state to create the illusion of more traditional interrupt handling.   To set this up, upon reboot the supervisor task: \begin{enumerate} \item Creates a (single) user context, a user stack, and sets the user  program counter to the entry of the user task \item Creates a task table of ISR entries \item Enables the master interrupt enable via the interrupt controller, albeit  without enabling any of the fifteen potential underlying interrupts. \item Switches to user mode, as the first part of the while loop in   Tbl.~\ref{tbl:traditional-isr}. \end{enumerate} \begin{table}\begin{center} \begin{tabbing} {\tt while(true) \{} \\ \hbox to 0.25in{}\= {\tt rtu();}\\  \> {\tt if (trap) \{} {\em // Here, we allow users to install ISRs, or} \\  \>\hbox to 0.25in{}\= {\em // whatever else they may wish to do in supervisor mode.} \\  \> {\tt \} else \{} \\  \> \> {\tt volatile int *pic = PIC\_ADDRESS;} \\ \\  \> \> {\em // Save the user context before running any ISRs. This could easily be}\\  \> \> {\em // implemented as an inline assembly routine or macro}\\  \> \> {\tt SAVE\_PARTIAL\_CONTEXT; }\\  \> \> {\em // At this point, we know an interrupt has taken place: Ask the programmable}\\  \> \> {\em // interrupt controller (PIC) which interrupts are enabled and which are active.}\\  \> \> {\tt int picv = *pic;}\\  \> \> {\em // Turn off all active interrupts}\\  \> \> {\em // Globally disable interrupt generation in the process}\\  \> \> {\tt int active = (picv >> 16) \& picv \& 0x07fff;}\\  \> \> {\tt *pic = (active<<16);}\\  \> \> {\em // We build a mask of interrupts to re-enable in picv.}\\  \> \> {\tt picv = 0;}\\  \> \> {\tt for(int i=0,msk=1; i<15; i++, msk<<=1) \{}\\  \> \>\hbox to 0.25in{}\={\tt if ((active \& msk)\&\&(isr\_table[i])) \{}\\  \> \>\>\hbox to 0.25in{}\= {\tt mov(isr\_table[i],uPC); }\\  \> \>\>\> {\em // Acknowledge this particular interrupt. While we could acknowledge all}\\  \> \>\>\> {\em // interrupts at once, by acknowledging only those with ISR's we allow}\\  \> \>\>\> {\em // the user process to use peripherals manually, and to manually check}\\  \> \>\>\> {\em // whether or no those other interrupts had occurred.}\\  \> \>\>\> {\tt *pic = msk; }\\  \> \>\>\> {\tt rtu(); }\\  \> \>\>\> {\em // The ISR will only exit on a trap in the Zip archtecture. There is}\\  \> \>\>\> {\em // no {\tt RETI} instruction. Since the PIC holds all interrupts disabled,}\\  \> \>\>\> {\em // there is no need to check for further interrupts.}\\  \> \>\>\> {\em // }\\  \> \>\>\> {\em // The tricky part is that, because of how the PIC is built, the ISR cannot}\\  \>\>\>\> {\em // re-enable its own interrupt without re-enabling all interrupts. Hence, we}\\  \>\>\>\> {\em // look at R0 upon ISR completion to know if an interrupt needs to be }\\  \> \>\>\> {\em // re-enabled. }\\  \> \>\>\> {\tt mov(uR0,tmp); }\\  \> \>\>\> {\tt picv |= (tmp \& 0x7fff) << 16; }\\  \> \>\> {\tt \} }\\  \> \> {\tt \} }\\  \> \> {\tt RESTORE\_PARTIAL\_CONTEXT; }\\  \> \> {\em // Re-activate all (requested) interrupts }\\  \> \> {\tt *pic = picv | 0x80000000; }\\  \>{\tt \} }\\ {\tt \}}\\ \end{tabbing} \caption{Traditional Interrupt handling}\label{tbl:traditional-isr} \end{center}\end{table}   We can work through the interrupt handling process by examining Tbl.~\ref{tbl:traditional-isr}. First, remember, the CPU is always running either the user or the supervisor context. Once the supervisor switches to user mode, control does not return until either an interrupt or a trap has taken place. (Okay, there's also the possibility of a bus error, or an illegal instruction such as an unimplemented floating point instruction---but for now we'll just focus on the trap instruction.) Therefore, if the trap bit isn't set, then we know an interrupt has taken place.   To process an interrupt, we steal the user's stack: the PC and CC registers are saved on the stack, as outlined in Tbl.~\ref{tbl:save-partial}. \begin{table}\begin{center} \begin{tabbing} SAVE\_PARTIAL\_CONTEXT: \\ \hbox to 0.25in{}\= {\em ; We save R0, CC, and PC only} \\ \> {\tt MOV -3(uSP),R3} \\ \> {\tt MOV uR0,R0} \\ \> {\tt MOV uCC,R1} \\ \> {\tt MOV uPC,R2} \\ \> {\tt STO R0,1(R3)} {\em ; Exploit memory pipelining: }\\ \> {\tt STO R1,2(R3)} {\em ; All instructions write to stack }\\ \> {\tt STO R2,3(R3)} {\em ; All offsets increment by one }\\ \> {\tt MOV R3,uSP} {\em ; Return the updated stack pointer } \\ \end{tabbing} \caption{Example Saving Minimal User Context}\label{tbl:save-partial} \end{center}\end{table} This is much cheaper than the full context swap of a preemptive multitasking kernel, but it also depends upon the ISR saving any state it uses. Further, if multiple ISR's get called at once, this looses its optimality property very quickly.   As Sec.~\ref{sec:pic} discusses, the top of the PIC register stores which  interrupts are enabled, and the bottom stores which have tripped. (Interrupts may trip without being enabled, they just will not generate an interrupt to the CPU.) Our first step is to query the register to find out our interrupt state, and then to disable any interrupts that have tripped. To do that, we write a one to the enable half of the register while also clearing the top bit (master interrupt enable). This has the consequence of disabling any and all further interrupts, not just the ones that have tripped. Hence, upon completion, we re--enable the master interrupt bit again. Finally, we keep track of which interrupts have tripped.   Using the bit mask of interrupts that have tripped, we walk through all fifteen possible interrupts. If there is an ISR installed, we acknowledge and reset the interrupt within the PIC, and then call the ISR. The ISR, however, cannot re--enable its interrupt without re-enabling the master interrupt bit. Thus, to keep things simple, when the ISR is finished it places its interrupt mask back into R0, or clears R0. This tells the supervisor mode process which interrupts to re--enable. Any other registers that the ISR uses must be saved and restored. (This is only truly optimal if only a single ISR is called.) As a final instruction, the ISR clears the GIE bit executing a user trap. (Remember, the Zip CPU has no {\tt RETI} instruction to restore the stack and return to userland. It needs to go through the supervisor mode to get there.)   Then, once all interrupts are handled, the user context is restored in a  fashion similar to Tbl.~\ref{tbl:restore-partial}. \begin{table}\begin{center} \begin{tabbing} RESTORE\_PARTIAL\_CONTEXT: \\ \hbox to 0.25in{}\= {\em ; We retore R0, CC, and PC only} \\ \> {\tt MOV uSP,R3} {\em ; Return the updated stack pointer } \\ \> {\tt LOD R0,1(R3),R0} {\em ; Exploit memory pipelining: }\\ \> {\tt LOD R1,2(R3),R1} {\em ; All instructions write to stack }\\ \> {\tt LOD R2,3(R3),R2} {\em ; All offsets increment by one }\\ \> {\tt MOV R0,uR0} \\ \> {\tt MOV R1,uCC} \\ \> {\tt MOV R2,uPC} \\ \> {\tt MOV 3(R3),uSP} \\ \end{tabbing} \caption{Example Restoring Minimal User Context}\label{tbl:restore-partial} \end{center}\end{table} Again, this is short and sweet simply because any other registers that needed saving were saved within the ISR.   There you have it: the Zip CPU, with its non-traditional interrupt architecture, can still process interrupts in a very traditional fashion.   \section{Example: Idle Task} One task every operating system needs is the idle task, the task that takes place when nothing else can run. On the Zip CPU, this task is quite simple,
1381,7 → 1781,7
 Third, we've optimized the conditional jump to a return instruction into a conditional return instruction.   \section{Context Switch} \section{Example: Context Switch}   Fundamental to any multiprocessing system is the ability to switch from one task to the next. In the ZipSystem, this is accomplished in one of a couple
2051,8 → 2451,6
  experiment with building and adding particular features, the Zip CPU  makes a good starting point--it is fairly simple. Modifications should  be simple enough. \item As an estimate of the weight'' of this implementation, the CPU has  cost me less than 150 hours to implement from its inception. \item The Zip CPU was designed to be an implementable soft core that could be  placed within an FPGA, controlling actions internal to the FPGA. It  fits this role rather nicely. It does not fit the role of a system on
2069,6 → 2467,8
 \item This simplified instruction set is easy to decode. \item The simplified bus transactions (32-bit words only) were also very easy  to implement. \item The pipelined load/store approach is novel, and can be used to greatly  increase the speed of the processor. \item The novel approach of having a single interrupt vector, which just  brings the CPU back to the instruction it left off at within the last  interrupt context doesn't appear to have been that much of a problem.
2098,10 → 2498,7
 \item The fact that the instruction width equals the bus width means that the  instruction fetch cycle will always be interfering with any load or  store memory operation, with the only exception being if the  instruction is already in the cache. {\em This has become the  fundamental limit on the speed and performance of the CPU!}  Those familiar with the Von--Neumann approach of sharing a bus  between data and instructions will not be surprised by this assessment.  instruction is already in the cache.     This could be fixed in one of three ways: the instruction set   architecture could be modified to handle Very Long Instruction Words
2146,6 → 2543,9
  the addition of a data cache can greatly increase the LUT count of  a soft core.    The Zip CPU compensates for this via its pipelined load and store  instructions.   \item Many other instruction sets offer three operand instructions, whereas  the Zip CPU only offers two operand instructions. This means that it  takes the Zip CPU more instructions to do many of the same operations.
2371,4 → 2771,36
 % Index \end{document}     % % % Symbol table relocation types: % % Only 3-types of instructions truly need relocations: those that modify the % PC register, and those that access memory. % % - LDI Addr,Rx // Load's an absolute address into Rx, 24 bits % % - LDILO Addr,Rx // Load's an absolute address into Rx, 32 bits % LDIHI Addr,Rx // requires two instructions % % - JMP Rx // Jump to any address in Rx % // Can be prefixed with two instructions to load Rx % // from any 32-bit immediate % - JMP #Addr // Jump to any 24'bit (signed) address, 23'b uns % % - ADD x,PC // Any PC relative jump (20 bits) % % - ADD.C x,PC // Any PC relative conditional jump (20 bits) % % - LDIHI Addr,Rx // Load from any 32-bit address, clobbers Rx, % LOD Addr(Rx),Rx // unconditional, requires second instruction % % - LOD.C Addr(Ry),Rx // Any 16-bit relative address load, poss. cond % % - STO.C Rx,Addr(Ry) // Any 16-bit rel addr, Rx and Ry must be valid % % - FARJMP #Addr: // Arbitrary 32-bit jumps require a jump table % BRA +1 // memory address. The BRA +1 can be skipped, % .WORD Addr // but only if the address is placed at the end % LOD -2(PC),PC // of an executable section %`

powered by: WebSVN 2.1.0

© copyright 1999-2022 OpenCores.org, equivalent to Oliscience, all rights reserved. OpenCores®, registered trademark.