URL
https://opencores.org/ocsvn/zipcpu/zipcpu/trunk
Subversion Repositories zipcpu
Compare Revisions
- This comparison shows the changes necessary to convert path
/zipcpu/trunk/doc/src
- from Rev 69 to Rev 68
- ↔ Reverse comparison
Rev 69 → Rev 68
/spec.tex
45,24 → 45,18
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
\documentclass{gqtekspec} |
\usepackage{import} |
\usepackage{bytefield} |
% \graphicspath{{../gfx}} |
\project{Zip CPU} |
\title{Specification} |
\author{Dan Gisselquist, Ph.D.} |
\email{dgisselq (at) opencores.org} |
\revision{Rev.~0.7} |
\definecolor{webred}{rgb}{0.5,0,0} |
\definecolor{webgreen}{rgb}{0,0.4,0} |
\revision{Rev.~0.6} |
\definecolor{webred}{rgb}{0.2,0,0} |
\definecolor{webgreen}{rgb}{0,0.2,0} |
\usepackage[dvips,ps2pdf,colorlinks=true, |
anchorcolor=black,pdfpagelabels,hypertexnames, |
anchorcolor=black,pagecolor=webgreen,pdfpagelabels,hypertexnames, |
pdfauthor={Dan Gisselquist}, |
pdfsubject={Zip CPU}]{hyperref} |
\hypersetup{ |
colorlinks = true, |
linkcolor = webred, |
citecolor = webgreen |
} |
\begin{document} |
\pagestyle{gqtekspecplain} |
\titlepage |
84,7 → 78,6
copy. |
\end{license} |
\begin{revisionhistory} |
0.7 & 12/22/2015 & Gisselquist & New Instruction Set Architecture \\\hline |
0.6 & 11/17/2015 & Gisselquist & Added graphics to illustrate pipeline discussion.\\\hline |
0.5 & 9/29/2015 & Gisselquist & Added pipelined memory access discussion.\\\hline |
0.4 & 9/19/2015 & Gisselquist & Added DMA controller, improved stall information, and self--assessment info.\\\hline |
146,11 → 139,11
The original goal of the Zip CPU was to be a very simple CPU. You might |
think of it as a poor man's alternative to the OpenRISC architecture. |
For this reason, all instructions have been designed to be as simple as |
possible, and the base instructions are all designed to be executed in one |
instruction cycle per instruction, barring pipeline stalls. Indeed, even the |
bus has been simplified to a constant 32-bit width, with no option for more |
or less. This has resulted in the choice to drop push and pop instructions, |
pre-increment and post-decrement addressing modes, and more. |
possible, and are all designed to be executed in one instruction cycle per |
instruction, barring pipeline stalls. Indeed, even the bus has been simplified |
to a constant 32-bit width, with no option for more or less. This has |
resulted in the choice to drop push and pop instructions, pre-increment and |
post-decrement addressing modes, and more. |
|
For those who like buzz words, the Zip CPU is: |
\begin{itemize} |
165,11 → 158,8
\item A Von-Neumann architecture. (The instructions and data share a |
common bus.) |
\item A pipelined architecture, having stages for {\bf Prefetch}, |
{\bf Decode}, {\bf Read-Operand}, a |
combined stage containing the {\bf ALU}, |
{\bf Memory}, {\bf Divide}, and {\bf Floating Point} |
units, and then the final {\bf Write-back} stage. |
See Fig.~\ref{fig:cpu} |
{\bf Decode}, {\bf Read-Operand}, the {\bf ALU/Memory} |
unit, and {\bf Write-back}. See Fig.~\ref{fig:cpu} |
\begin{figure}\begin{center} |
\includegraphics[width=3.5in]{../gfx/cpu.eps} |
\caption{Zip CPU internal pipeline architecture}\label{fig:cpu} |
193,7 → 183,7
|
Most other approaches to soft core CPU's employ a Harvard architecture. |
This allows these other CPU's to have two separate bus structures: one for the |
program fetch, and the other for the memory. The Zip CPU is fairly unique in |
program fetch, and the other for thememory. The Zip CPU is fairly unique in |
its approach because it uses a von Neumann architecture. This was done for |
simplicity. By using a von Neumann architecture, only one bus needs to be |
implemented within any FPGA. This helps to minimize real-estate, while |
201,13 → 191,12
degrade the overall instructions per clock count. |
|
Soft core's within an FPGA have an additional characteristic regarding |
memory access: it is slow. While memory on chip may be accessed at a single |
cycle per access, small FPGA's often have only a limited amount of memory on |
chip. Going off chip, however, is expensive. Two examples will prove this |
point. On |
memory access: it is slow. Memory on chip may be accessed at a single |
cycle per access, but small FPGA's have a limited amount of memory on chip. |
Going off chip, however, is expensive. Two examples will prove this point. On |
the XuLA2 board, Flash can be accessed at 128~cycles per 32--bit word, |
or 64~cycles per subsequent word in a pipelined architecture. Likewise, the |
SDRAM chip on the XuLA2 board allows a 6~cycle access for a write, 10~cycles |
SDRAM chip on the XuLA2 board allows 6~cycle access for a write, 10~cycles |
per read, and 2~cycles for any subsequent pipelined access read or write. |
Either way you look at it, this memory access will be slow and this doesn't |
account for any logic delays should the bus implementation logic get |
233,13 → 222,11
be created on chip to support the SwiC if necessary. As an example, a simple |
30-bit peripheral could easily support reversing 30-bit numbers: a read from |
the peripheral returns it's bit--reversed address. This is cheap within an |
FPGA, but expensive in instructions. Reading from another 16--bit peripheral |
might calculate a sine function, where the 16--bit address internal to the |
peripheral was the angle of the sine wave. |
FPGA, but expensive in instructions. |
|
Indeed, anything that must be done fast within an FPGA is likely to already |
be done--elsewhere in the fabric. This leaves the CPU with the simple role |
of solely handling sequential tasks that need a lot of state. |
be done--elsewhere in the fabric. This leaves the CPU with the role of handling |
sequential tasks that need a lot of state. |
|
This means that the SwiC needs to live within a very unique environment, |
separate and different from the traditional SoC. That isn't to say that a |
276,30 → 263,17
an instruction in memory, you need to make sure the cache is reloaded |
with the new instruction. |
|
\item {\bf Prefetch Cache:} My original implementation, {\tt prefetch}, had |
a very simple prefetch stage. Any time the PC changed the prefetch |
would go and fetch the new instruction. While this was perhaps this |
simplest approach, it cost roughly five clocks for every instruction. |
This was deemed unacceptable, as I wanted a CPU that could execute |
instructions in one cycle. |
\item {\bf Prefetch Cache:} My original implementation had a very |
simple prefetch stage. Any time the PC changed the prefetch would go |
and fetch the new instruction. While this was perhaps this simplest |
approach, it cost roughly five clocks for every instruction. This |
was deemed unacceptable, as I wanted a CPU that could execute |
instructions in one cycle. I therefore have a prefetch cache that |
issues pipelined wishbone accesses to memory and then pushes |
instructions at the CPU. Sadly, this accounts for about 20\% of the |
logic in the entire CPU, or 15\% of the logic in the entire system. |
|
My second implementation, {\tt pipefetch}, attempted to make the most |
use of pipelined memory. When a new CPU address was issued, it would |
start reading |
memory in a pipelined fashion, and issuing instructions as soon as they |
were ready. This cache was a sliding window in memory. This suffered |
from some difficult performance problems, though. If the CPU was |
alternating between two diverse sections of code, both could never be |
in the cache at the same time--causing lots of cache misses. Further, |
the extra logic to implement this window cost an extra clock cycle |
in the cache implementation, slowing down branches. |
|
The Zip CPU now has a third cache implementation, {\tt pfcache}. This |
new implementation takes only a single cycle per access, but costs a |
full cache line miss on any miss. While configurable, a full cache |
line miss might mean that the CPU needs to read 256~instructions from |
memory before it can execute the first one of them. |
|
\item {\bf Operating System:} In order to support an operating system, |
interrupts and so forth, the CPU needs to support supervisor and |
user modes, as well as a means of switching between them. For example, |
328,19 → 302,6
external to the CPU as part of the CPU System, found in {\tt zipsystem.v}. |
The timer module itself is found in {\tt ziptimer.v}. |
|
\item {\bf Bus Errors:} My original implementation had no logic to handle |
what would happen if the CPU attempted to read or write a non-existent |
memory address. This changed after I needed to troubleshoot a failure |
caused by a subroutine return to a non-existent address. |
|
My next problem bus problem was caused by a misbehaving peripheral. |
Whenever the CPU attempted to read from or write to this peripheral, |
the peripheral would take control of the wishbone bus and not return |
it. For example, it might never return an {\tt ACK} to signal |
the end of the bus transaction. This led to the implementation of |
a wishbone bus watchdog that would create a bus error if any particular |
bus action didn't complete in a timely fashion. |
|
\item {\bf Pipeline Stalls:} My original plan was to not support pipeline |
stalls at all, but rather to require the compiler to properly schedule |
all instructions so that stalls would never be necessary. After trying |
433,8 → 394,6
\includegraphics[width=4in]{../gfx/stacking.eps} |
\caption{Instructions can stack up behind a stalled instruction}\label{fig:stacking} |
\end{center}\end{figure} |
However, if a pipeline hazard is detected, a stage can stall in order |
to prevent the previous from moving forward. |
|
This approach is also different from other pipeline approaches. |
Instead of keeping the entire pipeline filled, each stage is treated |
441,6 → 400,26
independently. Therefore, individual stages may move forward as long |
as the subsequent stage is available, regardless of whether the stage |
behind it is filled. |
|
\item {\bf Verilog Modules:} When examining how other processors worked |
here on open cores, many of them had one separate module per pipeline |
stage. While this appeared to me to be a fascinating and commendable |
idea, my own implementation didn't work out quite so nicely. |
|
As an example, the decode module produces a {\em lot} of |
control wires and registers. Creating a module out of this, with |
only the simplest of logic within it, seemed to be more a lesson |
in passing wires around, rather than encapsulating logic. |
|
Another example was the register writeback section. I would love |
this section to be a module in its own right, and many have made them |
such. However, other modules depend upon writeback results other |
than just what's placed in the register (i.e., the control wires). |
For these reasons, I didn't manage to fit this section into it's |
own module. |
|
The result is that the majority of the CPU code can be found in |
the {\tt zipcpu.v} file. |
\end{itemize} |
|
With that introduction out of the way, let's move on to the instruction |
473,8 → 452,7
code register |
(CC) is register 14. By convention, the stack pointer will be register 13 and |
noted as (SP)--although there is nothing special about this register other |
than this convention. Also by convention register~12 will point to a global |
offset table, and may be abbreviated as the (GBL) register. |
than this convention. |
The CPU can access both register sets via move instructions from the |
supervisor state, whereas the user state can only access the user registers. |
|
482,10 → 460,8
Fig.~\ref{tbl:cc-register}, |
\begin{table}\begin{center} |
\begin{bitlist} |
31\ldots 13 & R/W & Reserved for future uses\\\hline |
12 & R & (Reserved for) Floating Point Exception\\\hline |
11 & R & Division by Zero Exception\\\hline |
10 & R & Bus-Error Flag\\\hline |
31\ldots 11 & R/W & Reserved for future uses\\\hline |
10 & R & (Reserved for) Bus-Error Flag\\\hline |
9 & R & Trap, or user interrupt, Flag. Cleared on return to userspace.\\\hline |
8 & R & Illegal Instruction Flag\\\hline |
7 & R/W & Break--Enable\\\hline |
508,21 → 484,11
Carry (C), |
Negative (N), |
and Overflow (V). |
On those instructions that set the flags, these flags will be set based upon |
the output of the instruction. If the result is zero, the Z flag will be set. |
If the high order bit is set, the N flag will be set. If the instruction |
caused a bit to fall off the end, the carry bit will be set. Finally, if |
the instruction causes a signed integer overflow, the V flag will be set |
afterwards. |
|
The next bit is a sleep bit. Set this bit to one to disable instruction |
execution and place the CPU to sleep, or to zero to keep the pipeline |
running. Setting this bit will cause the CPU to wait for an interrupt |
(if interrupts are enabled), or to completely halt (if interrupts are |
disabled). In order to prevent users from halting the CPU, only the |
supervisor is allowed to both put the CPU to sleep and disable |
interrupts. Any user attempt to do so will simply result in a switch |
to supervisor mode. |
The next bit is a clock enable (0 to enable) or sleep bit (1 to put |
the CPU to sleep). Setting this bit will cause the CPU to |
wait for an interrupt (if interrupts are enabled), or to |
completely halt (if interrupts are disabled). |
|
The sixth bit is a global interrupt enable bit (GIE). When this |
sixth bit is a `1' interrupts will be enabled, else disabled. When |
535,12 → 501,13
GIE register at the same time, with clearing the GIE register taking |
precedence. |
|
The seventh bit is a step bit. This bit can be set from supervisor mode only. |
After setting this bit, should the supervisor mode process switch to |
user mode, it would then accomplish one instruction in user mode |
before returning to supervisor mode. Then, upon return to supervisor |
mode, this bit will be automatically cleared. This bit has no effect |
on the CPU while in supervisor mode. |
The seventh bit is a step bit. This bit can be |
set from supervisor mode only. After setting this bit, should |
the supervisor mode process switch to user mode, it would then |
accomplish one instruction in user mode before returning to supervisor |
mode. Then, upon return to supervisor mode, this bit will |
be automatically cleared. This bit has no effect on the CPU while in |
supervisor mode. |
|
This functionality was added to enable a userspace debugger |
functionality on a user process, working through supervisor mode |
572,204 → 539,56
supervisor, in supervisor mode, to determine whether it got to supervisor |
mode from a trap or from an external interrupt or both. |
|
\section{Instruction Format} |
All Zip CPU instructions fit in one of the formats shown in |
Fig.~\ref{fig:iset-format}. |
\begin{figure}\begin{center} |
\begin{bytefield}[endianness=big]{32} |
\bitheader{0-31}\\ |
\begin{leftwordgroup}{Standard}\bitbox{1}{0}\bitbox{4}{DR} |
\bitbox[lrt]{5}{OpCode} |
\bitbox[lrt]{3}{Cnd} |
\bitbox{1}{0} |
\bitbox{18}{18-bit Signed Immediate} \\ |
\bitbox{1}{0}\bitbox{4}{DR} |
\bitbox[lrb]{5}{} |
\bitbox[lrb]{3}{} |
\bitbox{1}{1} |
\bitbox{4}{BR} |
\bitbox{14}{14-bit Signed Immediate}\end{leftwordgroup} \\ |
\begin{leftwordgroup}{MOV}\bitbox{1}{0}\bitbox{4}{DR} |
\bitbox[lrt]{5}{5'hf} |
\bitbox[lrt]{3}{Cnd} |
\bitbox{1}{A} |
\bitbox{4}{BR} |
\bitbox{1}{B} |
\bitbox{13}{13-bit Signed Immediate}\end{leftwordgroup} \\ |
\begin{leftwordgroup}{LDI}\bitbox{1}{0}\bitbox{4}{DR} |
\bitbox{4}{4'hb} |
\bitbox{23}{23-bit Signed Immediate}\end{leftwordgroup} \\ |
\begin{leftwordgroup}{NOOP}\bitbox{1}{0}\bitbox{3}{3'h7} |
\bitbox{1}{} |
\bitbox{2}{11} |
\bitbox{3}{xxx} |
\bitbox{22}{Ignored} |
\end{leftwordgroup} \\ |
\begin{leftwordgroup}{VLIW}\bitbox{1}{1}\bitbox[lrt]{4}{DR} |
\bitbox[lrt]{5}{OpCode} |
\bitbox[lrt]{3}{Cnd} |
\bitbox{1}{0} |
\bitbox{4}{Imm.} |
\bitbox{14}{---} \\ |
\bitbox{1}{1}\bitbox[lr]{4}{} |
\bitbox[lrb]{5}{} |
\bitbox[lr]{3}{} |
\bitbox{1}{1} |
\bitbox{4}{BR} |
\bitbox{14}{---} \\ |
\bitbox{1}{1}\bitbox[lrb]{4}{} |
\bitbox{4}{4'hb} |
\bitbox{1}{} |
\bitbox[lrb]{3}{} |
\bitbox{5}{5'b Imm} |
\bitbox{14}{---} \\ |
\bitbox{1}{1}\bitbox{9}{---} |
\bitbox[lrt]{3}{Cnd} |
\bitbox{5}{---} |
\bitbox[lrt]{4}{DR} |
\bitbox[lrt]{5}{OpCode} |
\bitbox{1}{0} |
\bitbox{4}{Imm} |
\\ |
\bitbox{1}{1}\bitbox{9}{---} |
\bitbox[lr]{3}{} |
\bitbox{5}{---} |
\bitbox[lr]{4}{} |
\bitbox[lrb]{5}{} |
\bitbox{1}{1} |
\bitbox{4}{Reg} \\ |
\bitbox{1}{1}\bitbox{9}{---} |
\bitbox[lrb]{3}{} |
\bitbox{5}{---} |
\bitbox[lrb]{4}{} |
\bitbox{4}{4'hb} |
\bitbox{1}{} |
\bitbox{5}{5'b Imm} |
\end{leftwordgroup} \\ |
\end{bytefield} |
\caption{Zip Instruction Set Format}\label{fig:iset-format} |
\end{center}\end{figure} |
The basic format is that some operation, defined by the OpCode, is applied |
if a condition, Cnd, is true in order to produce a result which is placed in |
the destination register, or DR. The Load 23--bit signed immediate instruction |
is different in that it requires no conditions, and uses only a 4-bit opcode. |
|
This is actually a second version of instruction set definition, given certain |
lessons learned. For example, the original instruction set had the following |
problems: |
\begin{enumerate} |
\item No opcodes were available for divide or floating point extensions to be |
made available. Although there was space in the instruction set to |
add these types of instructions, this instruction space was going to |
require extra logic to use. |
\item The carveouts for instructions such as NOOP and LDIHI/LDILO required |
extra logic to process. |
\item The instruction set wasn't very compact. One bus operation was required |
for every instruction. |
\end{enumerate} |
This second version was designed with two criteria. The first was that the |
new instruction set needed to be compatible, at the assembly language level, |
with the previous instruction set. Thus, it must be able to support all of |
the previous menumonics and more. This was achieved with the sole exception |
that instruction immediates are generally two bits shorter than before. |
(One bit was lost to the VLIW bit in front, another from changing from 4--bit |
to 5--bit opcodes.) Second, the new instruction set needed to be a drop--in |
replacement for the decoder, modifying nothing else. This was almost achieved, |
save for two issues: the ALU unit needed to be replaced since the OpCodes |
were reordered, and some condition code logic needed to be adjusted since the |
condition codes were renumbered as well. In the end, maximum reuse of the |
existing RTL (Verilog) code was achieved in this upgrade. |
|
As of this second version of the Zip CPU instruction set, the Zip CPU also |
supports a very long instruction word (VLIW) set of instructions. These |
instruction formats pack two instructions into a single instuction word, |
trading immediate instruction space to do this, but in just about all other |
respects these are identical to two standard instructions. Other than |
instruction format, the only basic difference is that the CPU will not switch |
to interrupt mode in between the two instructions. Likewise a new job given |
to the assembler is that of automatically packing as many instructions as |
possible into the VLIW format. Where necessary to place both VLIW instructions |
on the same line, they will be separated by a vertical bar. |
|
\section{Instruction OpCodes} |
With a 5--bit opcode field, there are 32--possible instructions as shown in |
Tbl.~\ref{tbl:iset-opcodes}. |
\begin{table}\begin{center} |
\begin{tabular}{|l|l|l|c|} \hline \rowcolor[gray]{0.85} |
OpCode & & Instruction &Sets CC \\\hline\hline |
5'h00 & SUB & Subtract & \\\cline{1-3} |
5'h01 & AND & Bitwise And & \\\cline{1-3} |
5'h02 & ADD & Add two numbers & \\\cline{1-3} |
5'h03 & OR & Bitwise Or & Y \\\cline{1-3} |
5'h04 & XOR & Bitwise Exclusive Or & \\\cline{1-3} |
5'h05 & LSR & Logical Shift Right & \\\cline{1-3} |
5'h06 & LSL & Logical Shift Left & \\\cline{1-3} |
5'h07 & ASR & Arithmetic Shift Right & \\\hline |
5'h08 & LDIHI & Load Immediate High & N \\\cline{1-3} |
5'h09 & LDILO & Load Immediate Low & \\\hline |
5'h0a & MPYU & Unsigned 16--bit Multiply & \\\cline{1-3} |
5'h0b & MPYS & Signed 16--bit Multiply & Y \\\cline{1-3} |
5'h0c & BREV & Bit Reverse & \\\cline{1-3} |
5'h0d & POPC& Population Count & \\\cline{1-3} |
5'h0e & ROL & Rotate left & \\\hline |
5'h0f & MOV & Move register & N \\\hline |
5'h10 & CMP & Compare & Y \\\cline{1-3} |
5'h11 & TST & Test (AND w/o setting result) & \\\hline |
5'h12 & LOD & Load from memory & N \\\cline{1-3} |
5'h13 & STO & Store a register into memory & \\\hline\hline |
5'h14 & DIVU & Divide, unsigned & Y \\\cline{1-3} |
5'h15 & DIVS & Divide, signed & \\\hline\hline |
5'h16/7 & LDI & Load 23--bit signed immediate & N \\\hline\hline |
5'h18 & FPADD & Floating point add & \\\cline{1-3} |
5'h19 & FPSUB & Floating point subtract & \\\cline{1-3} |
5'h1a & FPMPY & Floating point multiply & Y \\\cline{1-3} |
5'h1b & FPDIV & Floating point divide & \\\cline{1-3} |
5'h1c & FPCVT & Convert integer to floating point & \\\cline{1-3} |
5'h1d & FPINT & Convert to integer & \\\hline |
5'h1e & & {\em Reserved for future use} &\\\hline |
5'h1f & & {\em Reserved for future use} &\\\hline |
These status register bits are summarized in Tbl.~\ref{tbl:ccbits}. |
\begin{table} |
\begin{center} |
\begin{tabular}{l|l} |
Bit & Meaning \\\hline |
9 & Soft trap, set on a trap from user mode, cleared when returning to user mode\\\hline |
8 & Illegal instruction error flag \\\hline |
7 & Halt on break, to support an external debugger \\\hline |
6 & Step, single step the CPU in user mode\\\hline |
5 & GIE, or Global Interrupt Enable \\\hline |
4 & Sleep \\\hline |
3 & V, or overflow bit.\\\hline |
2 & N, or negative bit.\\\hline |
1 & C, or carry bit.\\\hline |
0 & Z, or zero bit. \\\hline |
\end{tabular} |
\caption{Zip CPU OpCodes}\label{tbl:iset-opcodes} |
\caption{Condition Code / Status Register Bits}\label{tbl:ccbits} |
\end{center}\end{table} |
% |
Of these opcodes, the {\tt BREV} and {\tt POPC} are experimental, and may be |
replaced later, and two floating point instruction opcodes are reserved for |
future use. |
|
\section{Conditional Instructions} |
Most, although not quite all, instructions may be conditionally executed. |
The 23--bit load immediate instruction, together with the {\tt NOOP}, |
{\tt BREAK}, and {\tt LOCK} instructions are the only exception to this rule. |
|
From the four condition code flags, eight conditions are defined for standard |
instructions. These are shown in Tbl.~\ref{tbl:conditions}. |
\begin{table}\begin{center} |
Most, although not quite all, instructions may be conditionally executed. From |
the four condition code flags, eight conditions are defined. These are shown |
in Tbl.~\ref{tbl:conditions}. |
\begin{table} |
\begin{center} |
\begin{tabular}{l|l|l} |
Code & Mneumonic & Condition \\\hline |
3'h0 & None & Always execute the instruction \\ |
3'h1 & {\tt .LT} & Less than ('N' set) \\ |
3'h2 & {\tt .Z} & Only execute when 'Z' is set \\ |
3'h3 & {\tt .NZ} & Only execute when 'Z' is not set \\ |
3'h1 & {\tt .Z} & Only execute when 'Z' is set \\ |
3'h2 & {\tt .NE} & Only execute when 'Z' is not set \\ |
3'h3 & {\tt .GE} & Greater than or equal ('N' not set, 'Z' irrelevant) \\ |
3'h4 & {\tt .GT} & Greater than ('N' not set, 'Z' not set) \\ |
3'h5 & {\tt .GE} & Greater than or equal ('N' not set, 'Z' irrelevant) \\ |
3'h5 & {\tt .LT} & Less than ('N' set) \\ |
3'h6 & {\tt .C} & Carry set\\ |
3'h7 & {\tt .V} & Overflow set\\ |
\end{tabular} |
\caption{Conditions for conditional operand execution}\label{tbl:conditions} |
\end{center}\end{table} |
There is no condition code for less than or equal, not C or not V---there |
just wasn't enough space in 3--bits. Conditioning on a non--supported |
condition is still possible, but it will take an extra instruction and a |
pipeline stall. (Ex: \hbox{\em (Stall)}; \hbox{\tt TST \$4,CC;} \hbox{\tt |
STO.NZ R0,(R1)}) As an alternative, it is often possible to reverse the |
condition, and thus recovering those extra two clocks. Thus instead of |
\hbox{\tt CMP Rx,Ry;} \hbox{\tt BNV label} you can issue a |
\hbox{\tt CMP Ry,Rx;} \hbox{\tt BV label}. |
\end{center} |
\end{table} |
There is no condition code for less than or equal, not C or not V. Sorry, |
I ran out of space in 3--bits. Conditioning on a non--supported condition |
is still possible, but it will take an extra instruction and a pipeline stall. (Ex: \hbox{\em (Stall)}; \hbox{\tt TST \$4,CC;} \hbox{\tt STO.NZ R0,(R1)}) |
As an alternative, the condition may often be reversed, recovering those |
extra two clocks. Thus instead of \hbox{\tt CMP Rx,Ry;} |
\hbox{\tt BNV label} you can issue a \hbox{\tt CMP Ry,Rx;} \hbox{\tt BV label}. |
|
Conditionally executed instructions will not further adjust the |
Conditionally executed ALU instructions will not further adjust the |
condition codes, with the exception of \hbox{\tt CMP} and \hbox{\tt TST} |
instructions. Conditional \hbox{\tt CMP} or \hbox{\tt TST} instructions |
will adjust conditions whenever they are executed. In this way, |
will adjust conditions whenever their conditionals are true. In this way, |
multiple conditions may be evaluated without branches. For example, to do |
something if \hbox{\tt R0} is one and \hbox{\tt R1} is two, one might try |
code such as Tbl.~\ref{tbl:dbl-condition}. |
787,56 → 606,32
\caption{An example of a double conditional}\label{tbl:dbl-condition} |
\end{center}\end{table} |
|
In the case of VLIW instructions, only four conditions are defined as shown |
in Tbl.~\ref{tbl:vliw-conditions}. |
\begin{table}\begin{center} |
\begin{tabular}{l|l|l} |
Code & Mneumonic & Condition \\\hline |
2'h0 & None & Always execute the instruction \\ |
2'h1 & {\tt .LT} & Less than ('N' set) \\ |
2'h2 & {\tt .Z} & Only execute when 'Z' is set \\ |
2'h3 & {\tt .NZ} & Only execute when 'Z' is not set \\ |
\end{tabular} |
\caption{VLIW Conditions}\label{tbl:vliw-conditions} |
\end{center}\end{table} |
Further, the first bit is given a special meaning. If the first bit is set, |
the conditions apply to the second half of the instruction, otherwise the |
conditions will only apply to the first half of a conditional instruction. |
\section{Traditional Interrupt Handling} |
|
\section{Operand B} |
Many instruction forms have a 19-bit source ``Operand B'' associated with them. |
This ``Operand B'' is shown in Fig.~\ref{fig:iset-format} as part of the |
standard instructions. This Operand B is either equal to a register plus a |
14--bit signed immediate offset, or an 18--bit signed immediate offset by |
itself. This value is encoded as shown in Tbl.~\ref{tbl:opb}. |
Many instruction forms have a 21-bit source ``Operand B'' associated with them. |
This Operand B is either equal to a register plus a signed immediate offset, |
or an immediate offset by itself. This value is encoded as shown in |
Tbl.~\ref{tbl:opb}. |
\begin{table}\begin{center} |
\begin{bytefield}[endianness=big]{19} |
\bitheader{0-18} \\ |
\bitbox{1}{0}\bitbox{18}{18-bit Signed Immediate} \\ |
\bitbox{1}{1}\bitbox{4}{Reg}\bitbox{14}{14-bit Signed Immediate} |
\end{bytefield} |
\begin{tabular}{|l|l|l|}\hline |
Bit 20 & 19 \ldots 16 & 15 \ldots 0 \\\hline |
1'b0 & \multicolumn{2}{l|}{20--bit Signed Immediate value} \\\hline |
1'b1 & 4-bit Register & 16--bit Signed immediate offset \\\hline |
\end{tabular} |
\caption{Bit allocation for Operand B}\label{tbl:opb} |
\end{center}\end{table} |
|
Fourteen and eighteen bit immediate values don't make sense for all |
instructions. For example, what is the point of an 18--bit immediate when |
executing a 16--bit multiply? Or a 16--bit load--immediate? In these cases, |
the extra bits are simply ignored. |
Sixteen and twenty bit immediate values don't make sense for all instructions. |
For example, what is the point of a 20--bit immediate when executing a 16--bit |
multiply? Likewise, why have a 16--bit immediate when adding to a logical |
or arithmetic shift? In these cases, the extra bits are reserved for future |
instruction possibilities. |
|
VLIW instructions still use the same operand B, only there was no room for any |
instruction plus immediate addressing. Therefore, VLIW instructions have either |
a register or a 4--bit signed immediate as their operand B. The only exception |
is the load immediate instruction, which permits a 5--bit signed operand |
B.\footnote{Although the space exists to extend this VLIW load immediate |
instruction to six bits, the 5--bit limit was chosen to simplify the |
disassembler. This may change in the future.} |
|
\section{Address Modes} |
The Zip CPU supports two addressing modes: register plus immediate, and |
immediate address. Addresses are therefore encoded in the same fashion as |
Operand B's, shown above. Practically, the VLIW instruction set only offers |
register addressing, necessitating a non--VLIW instruction for most memory |
operations. |
Operand B's, shown above. |
|
A lot of long hard thought was put into whether to allow pre/post increment |
and decrement addressing modes. Finding no way to use these operators without |
856,105 → 651,206
\ldots when in supervisory mode. To keep the compiler simple, the extra bits |
are ignored in non-supervisory mode (as though they didn't exist), rather than |
being mapped to new instructions or additional capabilities. The bits |
indicating which register set each register lies within are the A-User, marked |
`A' in Fig.~\ref{fig:iset-format}, and B-User bits, marked as `B'. When set |
to a one, these refer to a user mode register. When set to a zero, these |
refer to a register in the current mode, whether user or supervisor. Further, |
because a load immediate instruction exists, there is no move capability |
between an immediate and a register: all moves come from either a register or |
a register plus an offset. |
indicating which register set each register lies within are the A-Usr and |
B-Usr bits. When set to a one, these refer to a user mode register. When set |
to a zero, these refer to a register in the current mode, whether user or |
supervisor. Further, because a load immediate instruction exists, there is no |
move capability between an immediate and a register: all moves come from either |
a register or a register plus an offset. |
|
This actually leads to a bit of a problem: since the {\tt MOV} instruction |
encodes which register set each register is coming from or moving to, how shall |
a compiler or assembler know how to compile a MOV instruction without knowing |
This actually leads to a bit of a problem: since the MOV instruction encodes |
which register set each register is coming from or moving to, how shall a |
compiler or assembler know how to compile a MOV instruction without knowing |
the mode of the CPU at the time? For this reason, the compiler will assume |
all MOV registers are supervisor registers, and display them as normal. |
Anything with the user bit set will be treated as a user register and displayed |
special. Since the CPU quietly ignores the supervisor bits while in user mode, |
anything marked as a user register will always be specific. |
Anything with the user bit set will be treated as a user register. The CPU |
will quietly ignore the supervisor bits while in user mode, and anything |
marked as a user register will always be valid. |
|
\section{Multiply Operations} |
The Zip CPU supports two Multiply operations, a 16x16 bit signed multiply |
({\tt MPYS}) and a 16x16 bit unsigned multiply ({\tt MPYU}). A 32--bit |
multiply, should it be desired, needs to be created via software from this |
16x16 bit multiply. |
({\tt MPYS}) and a 16x16 bit unsigned multiply ({\tt MPYU}). In both |
cases, the operand is a register plus a 16-bit immediate, subject to the |
rule that the register cannot be the PC or CC registers. The PC register |
field has been stolen to create a multiply by immediate instruction. The |
CC register field is reserved. |
|
\section{Divide Unit} |
The Zip CPU also has a divide unit which can be built alongside the ALU. |
This divide unit provides the Zip CPU with its first two instructions that |
cannot be executed in a single cycle: {\tt DIVS}, or signed divide, and |
{\tt DIVU}, the unsigned divide. These are both 32--bit divide instructions, |
dividing one 32--bit number by another. In this case, the Operand B field, |
whether it be register or register plus immediate, constitutes the denominator, |
whereas the numerator is given by the other register. |
\section{Floating Point} |
The Zip CPU does not (yet) support floating point operations. However, the |
instruction set reserves two possibilities for future floating point |
operations. |
|
The Divide is also a multi--clock instruction. While the divide is running, |
the ALU, memory unit, and floating point unit (if installed) will be idle. |
Once the divide completes, other units may continue. |
The first floating point operation hole in the instruction set involves |
setting a proposed (but non-existent) floating point bit in the CC register. |
The next instruction |
would then simply interpret its operands as floating point instructions. |
Not all instructions, however, have floating point equivalents. Further, the |
immediate fields do not apply in floating point mode, and must be set to |
zero. Not all instructions make sense as floating point operations. |
Therefore, only the CMP, SUB, ADD, and MPY instructions may be issued as |
floating point instructions. Other instructions allow the examining of the |
floating point bit in the CC register. In all cases, the floating point bit |
is cleared one instruction after it is set. |
|
Of course, divides can have errors: division by zero. In the case of division |
by zero, an exception will be caused that will send the CPU either from |
user mode to supervisor mode, or halt the CPU if it is already in supervisor |
mode. |
The other possibility for floating point operations involves exploiting the |
hole in the instruction set that the NOOP and BREAK instructions reside within. |
These two instructions use 24--bits of address space, when only a single bit |
is necessary. A simple adjustment to this space could create instructions |
with 4--bit register addresses for each register, a 3--bit field for |
conditional execution, and a 2--bit field for which operation. |
In this fashion, such a floating point capability would only fill 13--bits of |
the 24--bit field, still leaving lots of room for expansion. |
|
\section{NOOP, BREAK, and Bus Lock Instruction} |
Three instructions are not listed in the opcode list in |
Tbl.~\ref{tbl:iset-opcodes}, yet fit in the NOOP type instruction format of |
Fig.~\ref{fig:iset-format}. These are the {\tt NOOP}, {\tt Break}, and |
bus {\tt LOCK} instructions. These are encoded according to |
Fig.~\ref{fig:iset-noop}, and have the following meanings: |
\begin{figure}\begin{center} |
\begin{bytefield}[endianness=big]{32} |
\bitheader{0-31}\\ |
\begin{leftwordgroup}{NOOP} |
\bitbox{1}{0}\bitbox{3}{3'h7}\bitbox{1}{} |
\bitbox{2}{11}\bitbox{3}{001}\bitbox{22}{Ignored} \\ |
\bitbox{1}{1}\bitbox{3}{3'h7}\bitbox{1}{} |
\bitbox{2}{11}\bitbox{3}{001}\bitbox{22}{---} \\ |
\bitbox{1}{1}\bitbox{9}{---}\bitbox{3}{---}\bitbox{5}{---} |
\bitbox{3}{3'h7}\bitbox{1}{}\bitbox{2}{11}\bitbox{3}{001} |
\bitbox{5}{Ignored} |
\end{leftwordgroup} \\ |
\begin{leftwordgroup}{BREAK} |
\bitbox{1}{0}\bitbox{3}{3'h7} |
\bitbox{1}{}\bitbox{2}{11}\bitbox{3}{010}\bitbox{22}{Ignored} |
\end{leftwordgroup} \\ |
\begin{leftwordgroup}{LOCK} |
\bitbox{1}{0}\bitbox{3}{3'h7} |
\bitbox{1}{}\bitbox{2}{11}\bitbox{3}{100}\bitbox{22}{Ignored} |
\end{leftwordgroup} \\ |
\end{bytefield} |
\caption{NOOP/Break/LOCK Instruction Format}\label{fig:iset-noop} |
\end{center}\end{figure} |
In both cases, the Zip CPU would support 32--bit single precision floats |
only, since other choices would complicate the pipeline. |
|
The {\tt NOOP} instruction is just that: an instruction that does not perform |
any operation. While many other instructions, such as a move from a register to |
itself, could also fit these roles, only the NOOP instruction guarantees that |
it will not stall waiting for a register to be available. For this reason, |
it gets its own place in the instruction set. |
The current architecture does not support a floating point not-implemented |
interrupt. Any soft floating point emulation must be done deliberately. |
|
The {\tt BREAK} instruction is useful for creating a debug instruction that |
will halt the CPU without executing. If in user mode, depending upon the |
setting of the break enable bit, it will either switch to supervisor mode or |
halt the CPU--depending upon where the user wishes to do his debugging. |
\section{Native Instructions} |
The instruction set for the Zip CPU is summarized in |
Tbl.~\ref{tbl:zip-instructions}. |
\begin{table}\begin{center} |
\begin{tabular}{|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|c|}\hline |
\rowcolor[gray]{0.85} |
Op Code & \multicolumn{8}{c|}{31\ldots24} & \multicolumn{8}{c|}{23\ldots 16} |
& \multicolumn{8}{c|}{15\ldots 8} & \multicolumn{8}{c|}{7\ldots 0} |
& Sets CC? \\\hline\hline |
{\tt CMP(Sub)} & \multicolumn{4}{l|}{4'h0} |
& \multicolumn{4}{l|}{D. Reg} |
& \multicolumn{3}{l|}{Cond.} |
& \multicolumn{21}{l|}{Operand B} |
& Yes \\\hline |
{\tt TST(And)} & \multicolumn{4}{l|}{4'h1} |
& \multicolumn{4}{l|}{D. Reg} |
& \multicolumn{3}{l|}{Cond.} |
& \multicolumn{21}{l|}{Operand B} |
& Yes \\\hline |
{\tt MOV} & \multicolumn{4}{l|}{4'h2} |
& \multicolumn{4}{l|}{D. Reg} |
& \multicolumn{3}{l|}{Cond.} |
& A-Usr |
& \multicolumn{4}{l|}{B-Reg} |
& B-Usr |
& \multicolumn{15}{l|}{15'bit signed offset} |
& \\\hline |
{\tt LODI} & \multicolumn{4}{l|}{4'h3} |
& \multicolumn{4}{l|}{R. Reg} |
& \multicolumn{24}{l|}{24'bit Signed Immediate} |
& \\\hline |
{\tt NOOP} & \multicolumn{4}{l|}{4'h4} |
& \multicolumn{4}{l|}{4'he} |
& \multicolumn{24}{l|}{24'h00} |
& \\\hline |
{\tt BREAK} & \multicolumn{4}{l|}{4'h4} |
& \multicolumn{4}{l|}{4'he} |
& \multicolumn{24}{l|}{24'h01} |
& \\\hline |
{\em Reserved} & \multicolumn{4}{l|}{4'h4} |
& \multicolumn{4}{l|}{4'he} |
& \multicolumn{24}{l|}{24'bits, but not 0 or 1.} |
& \\\hline |
{\tt LODIHI }& \multicolumn{4}{l|}{4'h4} |
& \multicolumn{4}{l|}{4'hf} |
& \multicolumn{3}{l|}{Cond.} |
& 1'b1 |
& \multicolumn{4}{l|}{R. Reg} |
& \multicolumn{16}{l|}{16-bit Immediate} |
& \\\hline |
{\tt LODILO} & \multicolumn{4}{l|}{4'h4} |
& \multicolumn{4}{l|}{4'hf} |
& \multicolumn{3}{l|}{Cond.} |
& 1'b0 |
& \multicolumn{4}{l|}{R. Reg} |
& \multicolumn{16}{l|}{16-bit Immediate} |
& \\\hline |
16-b {\tt MPYU} & \multicolumn{4}{l|}{4'h4} |
& \multicolumn{4}{l|}{R. Reg} |
& \multicolumn{3}{l|}{Cond.} |
& 1'b0 & \multicolumn{4}{l|}{Reg} |
& \multicolumn{16}{l|}{16-bit Offset} |
& Yes \\\hline |
16-b {\tt MPYU}(I) & \multicolumn{4}{l|}{4'h4} |
& \multicolumn{4}{l|}{R. Reg} |
& \multicolumn{3}{l|}{Cond.} |
& 1'b0 & \multicolumn{4}{l|}{4'hf} |
& \multicolumn{16}{l|}{16-bit Offset} |
& Yes \\\hline |
16-b {\tt MPYS} & \multicolumn{4}{l|}{4'h4} |
& \multicolumn{4}{l|}{R. Reg} |
& \multicolumn{3}{l|}{Cond.} |
& 1'b1 & \multicolumn{4}{l|}{Reg} |
& \multicolumn{16}{l|}{16-bit Offset} |
& Yes \\\hline |
16-b {\tt MPYS}(I) & \multicolumn{4}{l|}{4'h4} |
& \multicolumn{4}{l|}{R. Reg} |
& \multicolumn{3}{l|}{Cond.} |
& 1'b1 & \multicolumn{4}{l|}{4'hf} |
& \multicolumn{16}{l|}{16-bit Offset} |
& Yes \\\hline |
{\tt ROL} & \multicolumn{4}{l|}{4'h5} |
& \multicolumn{4}{l|}{R. Reg} |
& \multicolumn{3}{l|}{Cond.} |
& \multicolumn{21}{l|}{Operand B, truncated to low order 5 bits} |
& \\\hline |
{\tt LOD} & \multicolumn{4}{l|}{4'h6} |
& \multicolumn{4}{l|}{R. Reg} |
& \multicolumn{3}{l|}{Cond.} |
& \multicolumn{21}{l|}{Operand B address} |
& \\\hline |
{\tt STO} & \multicolumn{4}{l|}{4'h7} |
& \multicolumn{4}{l|}{D. Reg} |
& \multicolumn{3}{l|}{Cond.} |
& \multicolumn{21}{l|}{Operand B address} |
& \\\hline |
{\tt SUB} & \multicolumn{4}{l|}{4'h8} |
& \multicolumn{4}{l|}{R. Reg} |
& \multicolumn{3}{l|}{Cond.} |
& \multicolumn{21}{l|}{Operand B} |
& Yes \\\hline |
{\tt AND} & \multicolumn{4}{l|}{4'h9} |
& \multicolumn{4}{l|}{R. Reg} |
& \multicolumn{3}{l|}{Cond.} |
& \multicolumn{21}{l|}{Operand B} |
& Yes \\\hline |
{\tt ADD} & \multicolumn{4}{l|}{4'ha} |
& \multicolumn{4}{l|}{R. Reg} |
& \multicolumn{3}{l|}{Cond.} |
& \multicolumn{21}{l|}{Operand B} |
& Yes \\\hline |
{\tt OR} & \multicolumn{4}{l|}{4'hb} |
& \multicolumn{4}{l|}{R. Reg} |
& \multicolumn{3}{l|}{Cond.} |
& \multicolumn{21}{l|}{Operand B} |
& Yes \\\hline |
{\tt XOR} & \multicolumn{4}{l|}{4'hc} |
& \multicolumn{4}{l|}{R. Reg} |
& \multicolumn{3}{l|}{Cond.} |
& \multicolumn{21}{l|}{Operand B} |
& Yes \\\hline |
{\tt LSL/ASL} & \multicolumn{4}{l|}{4'hd} |
& \multicolumn{4}{l|}{R. Reg} |
& \multicolumn{3}{l|}{Cond.} |
& \multicolumn{21}{l|}{Operand B, imm. truncated to 6 bits} |
& Yes \\\hline |
{\tt ASR} & \multicolumn{4}{l|}{4'he} |
& \multicolumn{4}{l|}{R. Reg} |
& \multicolumn{3}{l|}{Cond.} |
& \multicolumn{21}{l|}{Operand B, imm. truncated to 6 bits} |
& Yes \\\hline |
{\tt LSR} & \multicolumn{4}{l|}{4'hf} |
& \multicolumn{4}{l|}{R. Reg} |
& \multicolumn{3}{l|}{Cond.} |
& \multicolumn{21}{l|}{Operand B, imm. truncated to 6 bits} |
& Yes \\\hline |
\end{tabular} |
\caption{Zip CPU Instruction Set}\label{tbl:zip-instructions} |
\end{center}\end{table} |
|
Finally, the {\tt LOCK} instruction was added in order to make a test and |
set multi--CPU operation possible. Following a LOCK instruction, the next |
two instructions, if they are memory LOD/STO instructions, will execute without |
dropping the wishbone {\tt CYC} line between the instructions. Thus a |
{\tt LOCK} followed by {\tt LOD (Rx),Ry} and a {\tt STO Rz,(Rx)}, where Rz |
is initially set, can be used to set an address while guaranteeing that Ry |
was the value before setting the address to Rz. This is a useful instruction |
while trying to achieve concurrency among multiple CPU's. |
As you can see, there's lots of room for instruction set expansion. The |
NOOP and BREAK instructions are the only instructions within one particular |
24--bit hole. The rest of this space is reserved for future enhancements. |
|
\section{Floating Point} |
Although the Zip CPU does not (yet) have a floating point unit, the current |
instruction set offers eight opcodes for floating point operations, and treats |
floating point exceptions like divide by zero errors. Once this unit is built |
and integrated together with the rest of the CPU, the Zip CPU will support |
32--bit floating point instructions natively. Any 64--bit floating point |
instructions will still need to be emulated in software. |
|
\section{Derived Instructions} |
The Zip CPU supports many other common instructions, but not all of them |
are single cycle instructions. The derived instruction tables, |
974,7 → 870,7
& Add with carry \\\hline |
{\tt BRA.Cond +/-\$Addr} |
& \hbox{\tt MOV.cond \$Addr+PC,PC} |
& Branch or jump on condition. Works for 13--bit |
& Branch or jump on condition. Works for 15--bit |
signed address offsets.\\\hline |
{\tt BRA.Cond +/-\$Addr} |
& \parbox[t]{1.5in}{\tt LDI \$Addr,Rx \\ ADD.cond Rx,PC} |
1000,7 → 896,7
& {\tt Or \$SLEEP,CC} |
& This only works when issued in interrupt/supervisor mode. In user |
mode this is simply a wait until interrupt instruction. \\\hline |
{\tt INT } & {\tt LDI \$0,CC} & This is also known as a trap instruction\\\hline |
{\tt INT } & {\tt LDI \$0,CC} & \\\hline |
{\tt IRET} |
& {\tt OR \$GIE,CC} |
& Also known as an RTU instruction (Return to Userspace) \\\hline |
1007,18 → 903,34
{\tt JMP R6+\$Addr} |
& {\tt MOV \$Addr(R6),PC} |
& \\\hline |
{\tt LJMP \$Addr} |
& \parbox[t]{1.5in}{\tt LOD (PC),PC \\ {\em Address }} |
& Although this only works for an unconditional jump, and it only |
works in a Von Neumann architecture, this instruction combination makes |
for a nice combination that can be adjusted by a linker at a later |
time.\\\hline |
{\tt JSR PC+\$Addr} |
& \parbox[t]{1.5in}{\tt SUB \$1,SP \\\ |
MOV \$3+PC,R0 \\ |
STO R0,1(SP) \\ |
MOV \$Addr+PC,PC \\ |
ADD \$1,SP} |
& Jump to Subroutine. Note the required cleanup instruction after |
returning. This could easily be turned into a three instruction |
operand, removing the preliminary stack instruction before and |
the cleanup after, by adjusting how any stack frame was built for |
this routine to include space at the top of the stack for the PC. |
Note also that jumping to a subroutine costs a copy register, {\tt R0} |
in this case. |
\\\hline |
{\tt JSR PC+\$Addr } |
& \parbox[t]{1.5in}{\tt MOV \$1+PC,R0 \\ MOV \$addr+PC,PC} |
& This is similar to the jump and link instructions from other |
architectures, save only that it requires a specific link |
instruction, also known as the {\tt MOV} instruction on the |
left.\\\hline |
& \parbox[t]{1.5in}{\tt MOV \$3+PC,R12 \\ MOV \$addr+PC,PC} |
&This is the high speed |
version of a subroutine call, necessitating a register to hold the |
last PC address. In its favor, this method doesn't suffer the |
mandatory memory access of the other approach. \\\hline |
{\tt LDI.l \$val,Rx } |
& \parbox[t]{1.8in}{\tt LDIHI (\$val$>>$16)\&0x0ffff, Rx \\ |
LDILO (\$val\&0x0ffff),Rx} |
& Sadly, there's not enough instruction |
space to load a complete immediate value into any register. |
Therefore, fully loading any register takes two cycles. |
The LDIHI (load immediate high) and LDILO (load immediate low) |
instructions have been created to facilitate this. \\\hline |
\end{tabular} |
\caption{Derived Instructions}\label{tbl:derived-1} |
\end{center}\end{table} |
1025,18 → 937,6
\begin{table}\begin{center} |
\begin{tabular}{p{1.4in}p{1.5in}p{3in}}\\\hline |
Mapped & Actual & Notes \\\hline |
{\tt LDI.l \$val,Rx } |
& \parbox[t]{1.8in}{\tt LDIHI (\$val$>>$16)\&0x0ffff, Rx \\ |
LDILO (\$val\&0x0ffff),Rx} |
& \parbox[t]{3.0in}{Sadly, there's not enough instruction |
space to load a complete immediate value into any register. |
Therefore, fully loading any register takes two cycles. |
The LDIHI (load immediate high) and LDILO (load immediate low) |
instructions have been created to facilitate this. |
\\ |
This is also the appropriate means for setting a register value |
to an arbitrary 32--bit value in a post--assembly link |
operation.}\\\hline |
{\tt LOD.b \$addr,Rx} |
& \parbox[t]{1.5in}{\tt % |
LDI \$addr,Ra \\ |
1083,8 → 983,12
instruction. \\\hline |
{\tt NOT Rx } & {\tt XOR \$-1,Rx } & \\\hline |
{\tt POP Rx } |
& \parbox[t]{1.5in}{\tt LOD \$(SP),Rx \\ ADD \$1,SP} |
& \\\hline |
& \parbox[t]{1.5in}{\tt LOD \$1(SP),Rx \\ ADD \$1,SP} |
& Note |
that for interrupt purposes, one can never depend upon the value at |
(SP). Hence you read from it, then increment it, lest having |
incremented it first something then comes along and writes to that |
value before you can read the result. \\\hline |
\end{tabular} |
\caption{Derived Instructions, continued}\label{tbl:derived-2} |
\end{center}\end{table} |
1091,18 → 995,16
\begin{table}\begin{center} |
\begin{tabular}{p{1.4in}p{1.5in}p{3in}}\\\hline |
{\tt PUSH Rx} |
& \parbox[t]{1.5in}{\hbox{\tt SUB \$1,SP} |
\hbox{\tt STO Rx,\$(SP)}} |
& \parbox[t]{1.5in}{SUB \$1,SP \\ |
STO Rx,\$1(SP)} |
& Note that for pipelined operation, it helps to coalesce all the |
{\tt SUB}'s into one command, and place the {\tt STO}'s right |
after each other. Further, to avoid a pipeline stall, the |
immediate value for the store must be zero. |
\\\hline |
after each other.\\\hline |
{\tt PUSH Rx-Ry} |
& \parbox[t]{1.5in}{\tt SUB \$$n$,SP \\ |
STO Rx,\$(SP) |
& \parbox[t]{1.5in}{\tt SUB \$n,SP \\ |
STO Rx,\$n(SP) |
\ldots \\ |
STO Ry,\$$\left(n-1\right)$(SP)} |
STO Ry,\$1(SP)} |
& Multiple pushes at once only need the single subtract from the |
stack pointer. This derived instruction is analogous to a similar one |
on the Motoroloa 68k architecture, although the Zip Assembler |
1111,13 → 1013,16
{\tt RESET} |
& \parbox[t]{1in}{\tt STO \$1,\$watchdog(R12)\\NOOP\\NOOP} |
& This depends upon the peripheral base address being |
preloaded into R12. |
in R12. |
|
Another opportunity might be to jump to the reset address from within |
supervisor mode.\\\hline |
{\tt RET} & {\tt MOV R0,PC} |
& This depends upon the form of the {\tt JSR} given on the previous |
page that stores the return address into R0. |
{\tt RET} & \parbox[t]{1.5in}{\tt LOD \$1(SP),PC} |
& Note that this depends upon the calling context to clean up the |
stack, as outlined for the JSR instruction. \\\hline |
{\tt RET} & {\tt MOV R12,PC} |
& This is the high(er) speed version, that doesn't touch the stack. |
As such, it doesn't suffer a stall on memory read/write to the stack. |
\\\hline |
{\tt STEP Rr,Rt} |
& \parbox[t]{1.5in}{\tt LSR \$1,Rr \\ XOR.C Rt,Rr} |
1146,14 → 1051,12
Further, this instruction implies a byte ordering, |
such as big or little endian.} \\\hline |
{\tt SWAP Rx,Ry } |
& \parbox[t]{1.5in}{\tt XOR Ry,Rx \\ XOR Rx,Ry \\ XOR Ry,Rx} |
& \parbox[t]{1.5in}{\tt |
XOR Ry,Rx \\ |
XOR Rx,Ry \\ |
XOR Ry,Rx} |
& While no extra registers are needed, this example |
does take 3-clocks. \\\hline |
\end{tabular} |
\caption{Derived Instructions, continued}\label{tbl:derived-3} |
\end{center}\end{table} |
\begin{table}\begin{center} |
\begin{tabular}{p{1.4in}p{1.5in}p{3in}}\\\hline |
{\tt TRAP \#X} |
& \parbox[t]{1.5in}{\tt LDI \$x,R0 \\ AND \~\$GIE,CC } |
& This works because whenever a user lowers the \$GIE flag, it sets |
1163,23 → 1066,17
the LDI and the AND conditional. In that case, the assembler would |
quietly turn the LDI instruction into an LDILO and LDIHI pair, |
but the effect would be the same. \\\hline |
{\tt TS Rx,Ry,(Rz)} |
& \hbox{\tt LDI 1,Rx} |
\hbox{\tt LOCK} |
\hbox{\tt LOD (Rz),Ry} |
\hbox{\tt STO Rx,(Rz)} |
& A test and set instruction. The {\tt LOCK} instruction insures |
that the next two instructions lock the bus between the instructions, |
so no one else can use it. Thus guarantees that the operation is |
atomic. |
\\\hline |
\end{tabular} |
\caption{Derived Instructions, continued}\label{tbl:derived-3} |
\end{center}\end{table} |
\begin{table}\begin{center} |
\begin{tabular}{p{1.4in}p{1.5in}p{3in}}\\\hline |
{\tt TST Rx} |
& {\tt TST \$-1,Rx} |
& Set the condition codes based upon Rx. Could also do a CMP \$0,Rx, |
ADD \$0,Rx, SUB \$0,Rx, etc, AND \$-1,Rx, etc. The TST and CMP |
approaches won't stall future pipeline stages looking for the value |
of Rx. (Future versions of the assembler may shorten this to a |
{\tt TST Rx} instruction.)\\\hline |
of Rx. \\\hline |
{\tt WAIT} |
& {\tt Or \$GIE | \$SLEEP,CC} |
& Wait until the next interrupt, then jump to supervisor/interrupt |
1187,17 → 1084,6
\end{tabular} |
\caption{Derived Instructions, continued}\label{tbl:derived-4} |
\end{center}\end{table} |
|
\section{Interrupt Handling} |
The Zip CPU does not maintain any interrupt vector tables. If an interrupt |
takes place, the CPU simply switches to interrupt mode. The supervisor code |
continues in this interrupt mode from where it left off before, after |
executing a return to userspace {\tt RTU} instruction. |
|
At this point, the supervisor code needs to determine first whether an |
interrupt has occurred, and then whether it is in interrupt mode due to |
an exception and handle each case appropriately. |
|
\section{Pipeline Stages} |
As mentioned in the introduction, and highlighted in Fig.~\ref{fig:cpu}, |
the Zip CPU supports a five stage pipeline. |
1208,13 → 1094,13
ever changes. Stalls are also created here if the instruction isn't |
in the prefetch cache. |
|
The Zip CPU supports one of three prefetch methods, depending upon a |
flag set at build time within the {\tt cpudefs.v} file. The simplest |
is a non--cached implementation of a prefetch. This implementation is |
fairly small, and ideal for users of the Zip CPU who need the extra |
space on the FPGA fabric. However, because this non--cached version |
has no cache, the maximum number of instructions per clock is limited |
to about one per five. |
The Zip CPU supports one of two prefetch methods, depending upon a flag |
set at build time within the {\tt zipcpu.v} file. The simplest is a |
non--cached implementation of a prefetch. This implementation is |
fairly small, and ideal for |
users of the Zip CPU who need the extra space on the FPGA fabric. |
However, because this non--cached version has no cache, the maximum |
number of instructions per clock is limited to about one per five. |
|
The second prefetch module is a pipelined prefetch with a cache. This |
module tries to keep the instruction address within a window of valid |
1224,47 → 1110,33
feature, though, was that it needs an extra internal pipeline stage |
to be implemented. |
|
The third prefetch and cache module implements a more traditional cache. |
While the resulting code tends to be twice as fast as the pipelined |
cache architecture, this implementation uses a large amount of |
distributed FPGA RAM to be successful. This then inflates the Zip CPU's |
FPGA usage statistics. |
|
\item {\bf Decode}: Decodes an instruction into OpCode, register(s) to read, |
and immediate offset. This stage also determines whether the flags |
will be set or whether the result will be written back. |
|
\item {\bf Decode}: Decodes an instruction into op code, register(s) to read, |
and immediate offset. This stage also determines whether the flags will |
be set or whether the result will be written back. |
\item {\bf Read Operands}: Read registers and apply any immediate values to |
them. There is no means of detecting or flagging arithmetic overflow |
or carry when adding the immediate to the operand. This stage will |
stall if any source operand is pending. |
|
\item Split into one of four tracks: An {\bf ALU} track which will accomplish |
a simple instruction, the {\bf MemOps} stage which handles {\tt LOD} |
(load) and {\tt STO} (store) instructions, the {\bf divide} unit, |
and the {\bf floating point} unit. |
\item Split into two tracks: An {\bf ALU} which will accomplish a simple |
instruction, and the {\bf MemOps} stage which handles {\tt LOD} (load) |
and {\tt STO} (store) instructions. |
\begin{itemize} |
\item Loads will stall instructions in the decode stage until the |
entire pipeline until complete, lest a register be read in |
the read operands stage only to be updated unseen by the |
Load. |
\item Condition codes are available upon completion of the ALU, |
divide, or FPU stage. |
\item Issuing a non--pipelined memory instruction to the memory unit |
while the memory unit is busy will stall the entire pipeline. |
\item Loads will stall the entire pipeline until complete. |
\item Condition codes are available upon completion of the ALU stage |
\item Issuing an instruction to the memory unit while the memory unit |
is busy will stall the entire pipeline. If the bus deadlocks, |
only a reset will release the CPU. (Watchdog timer, anyone?) |
\item The Zip CPU currently has no means of reading and acting on any |
error conditions on the bus. |
\end{itemize} |
\item {\bf Write-Back}: Conditionally write back the result to the register |
set, applying the condition. This routine is quad-entrant: either the |
ALU, the memory, the divide, or the FPU may write back a register. |
The only design rule is that no more than a single register may be |
written back in any given clock. |
set, applying the condition. This routine is bi-entrant: either the |
memory or the simple instruction may request a register write. |
\end{enumerate} |
|
The Zip CPU does not support out of order execution. Therefore, if the memory |
unit stalls, every other instruction stalls. The same is true for divide or |
floating point instructions--all other instructions will stall while waiting |
for these to complete. Memory stores, however, can take place concurrently |
with non--memory operations, although memory reads (loads) cannot. |
unit stalls, every other instruction stalls. Memory stores, however, can take |
place concurrently with ALU operations, although memory reads (loads) cannot. |
|
\section{Pipeline Stalls} |
The processing pipeline can and will stall for a variety of reasons. Some of |
1273,38 → 1145,37
\item When the prefetch cache is exhausted |
|
This reason should be obvious. If the prefetch cache doesn't have the |
instruction in memory, the entire pipeline must stall until an instruction |
can be made ready. In the case of the {\tt pipefetch} windowed approach |
to the prefetch cache, this means the pipeline will stall until enough of the |
prefetch cache is loaded to support the next instruction. In the case |
of the more traditional {\tt pfcache} approach, the entire cache line must |
fill before instruction execution can continue. |
instruction in memory, the entire pipeline must stall until enough of the |
prefetch cache is loaded to support the next instruction. |
|
\item While waiting for the pipeline to load following any taken branch, jump, |
return from interrupt or switch to interrupt context (4 stall cycles) |
return from interrupt or switch to interrupt context (5 stall cycles) |
|
Fig.~\ref{fig:bcstalls} |
\begin{figure}\begin{center} |
\includegraphics[width=3.5in]{../gfx/bc.eps} |
\caption{A conditional branch generates 4 stall cycles}\label{fig:bcstalls} |
\caption{A conditional branch generates 5 stall cycles}\label{fig:bcstalls} |
\end{center}\end{figure} |
illustrates the situation for a conditional branch. In this case, the branch |
instruction, {\tt BC}, is nominally followed by instructions {\tt I1} and so |
instruction, {\tt BC}, is nominally followed by instructions {\tt I0} and so |
forth. However, since the branch is taken, the next instruction must be |
{\tt IA}. Therefore, the pipeline needs to be cleared and reloaded. |
Given that there are five stages to the pipeline, that accounts |
for the four stalls. (Were the {\tt pipefetch} cache chosen, there would |
be another stall internal to the {\tt pipefetch} cache.) |
for four of the five stalls. The last stall cycle is lost in the pipelined |
prefetch stage which needs at least one clock with a valid PC before it can |
produce a new output. {\Large\bf Note: When I did this myself, I counted |
six stall cycles, for a total of seven cycles for this instruction. Is five |
really the right answer?} |
|
The Zip CPU handles {\tt MOV \$X(PC),PC}, {\tt ADD \$X,PC}, and |
{\tt LDI \$X,PC} instructions specially, however. These instructions, when |
not conditioned on the flags, can execute with only a single stall cycle, |
such as is shown in Fig.~\ref{fig:branch}.\footnote{Note that when using the |
{\tt pipefetch} cache, this requires an additional stall cycle due to that |
cache's implementation.} |
not conditioned on the flags, can execute with only 2~stall cycles, such as |
is shown in Fig.~\ref{fig:branch}.\footnote{Note that this behavior is |
slated to be improved upon in subsequent releases. With a better prefetch, |
it should be possible to drop this down to one or zero stall cycles.} |
\begin{figure}\begin{center} |
\includegraphics[width=4in]{../gfx/bra.eps} %0.4in per clock |
\caption{An expedited branch costs a single stall cycle}\label{fig:branch} |
\includegraphics[width=4in]{../gfx/bra.eps} |
\caption{An expedited delay costs only 2~stall cycles}\label{fig:branch} |
\end{center}\end{figure} |
In this example, {\tt BR} is a branch always taken, {\tt I1} is the instruction |
following the branch in memory, while {\tt IA} is the first instruction at the |
1326,20 → 1197,33
That is, any instruction that does not add an immediate to {\tt RA} may be |
scheduled into the stall slot. |
|
This is also the reason why, when setting up a stack frame, the top of the |
stack frame is used first: it eliminates this stall cycle. Hence, to save |
registers at the top of a procedure, one would write: |
\item When any (conditional) write to either the CC or PC Register is followed |
by a memory operation |
\begin{enumerate} |
\item\ {\tt SUB 2,SP} |
\item\ {\tt STO R1,(SP)} |
\item\ {\tt STO R2,1(SP)} |
\item\ {\tt OPCODE RA,PC} {\em Ex: a branch opcode} |
\item\ {\em (stall, even if jump not taken)} |
\item\ {\tt LOD \$X(RA),RB} |
\end{enumerate} |
Had {\tt R1} instead been stored at {\tt 1(SP)} as the top of the stack, |
there would've been an extra stall in setting up the stack frame. |
A timing diagram of this pipeline situation is shown in Fig.~\ref{fig:bcmem}, |
\begin{figure}\begin{center} |
\includegraphics[width=2in]{../gfx/bcmem.eps} |
\caption{A (not taken) conditional branch followed by a memory operation}\label{fig:bcmem} |
\end{center}\end{figure} |
for a conditional branch, {\tt BC}, a memory operation, {\tt Mem} (which |
must be a load here), and ALU instructions {\tt I1} and so forth. |
Since branches take place in the writeback stage, the Zip CPU will stall the |
pipeline for one clock anytime there may be a possible jump--forcing the |
memory operation to stay in the operand decode stage. This prevents |
an instruction from executing a memory access after the jump but before the |
jump is recognized. |
|
This stall may be mitigated by shuffling the operations immediately following |
a potential branch so that an ALU operation follows the branch instead of a |
memory operation. |
|
\item When reading from the CC register after setting the flags |
\begin{enumerate} |
\item\ {\tt ALUOP RA,RB} {\em ; Ex: a compare opcode} |
\item\ {\tt ALUOP RA,RB} {\em Ex: a compare opcode} |
\item\ {\em (stall)} |
\item\ {\tt TST sys.ccv,CC} |
\item\ {\tt BZ somewhere} |
1359,7 → 1243,7
followed by an {\tt RTU} ({\tt OR \$GIE,CC}) instruction will not incur |
this stall, whereas an {\tt OR \$BREAKEN,CC} followed by an {\tt OR \$STEP,CC} |
will incur the stall, while a {\tt LDI \$BREAKEN|\$STEP,CC} will not since |
it doesn't read the condition codes before executing. |
it doesn't read the condition codes. |
|
\item When waiting for a memory read operation to complete |
\begin{enumerate} |
1373,20 → 1257,17
stall until the memory unit is cleared. This is illustrated in |
Fig.~\ref{fig:memrd}, |
\begin{figure}\begin{center} |
\includegraphics[width=5.6in]{../gfx/memrd.eps} |
\includegraphics[width=5in]{../gfx/memrd.eps} |
\caption{Pipeline handling of a load instruction}\label{fig:memrd} |
\end{center}\end{figure} |
since it is especially true of a load |
instruction, which must still write its operand back to the register file. |
Further, note that on a pipelined memory operation, the instruction must |
stall in the decode operand stage, lest it try to read a result from the |
register file before the load result has been written to it. Finally, note |
that there is an extra stall at the end of the memory cycle, so that |
the memory unit will be idle for two clocks before an instruction will be |
accepted into the ALU. Store instructions are different, as shown in |
Fig.~\ref{fig:memwr}, |
instruction, which must still write its operand back to the register file. |
Note that there is an extra stall at the end of the memory cycle, so that |
the memory unit will be idle for one clock before an instruction will be |
accepted into the ALU. |
Store instructions are different, as shown in Fig.~\ref{fig:memwr}, |
\begin{figure}\begin{center} |
\includegraphics[width=4in]{../gfx/memwr.eps} |
\includegraphics[width=5in]{../gfx/memwr.eps} |
\caption{Pipeline handling of a store instruction}\label{fig:memwr} |
\end{center}\end{figure} |
since they can be busy with the bus without impacting later write back |
1429,6 → 1310,21
Fig.~\ref{fig:memrd} and Fig.~\ref{fig:memwr} illustrated pipelined memory |
accesses. |
|
\item When waiting for a conditional memory read operation to complete |
\begin{enumerate} |
\item\ {\tt LOD.Z address,RA} |
\item\ {\em (multiple stalls, bus dependent, 7 clocks best)} |
\item\ {\tt OPCODE I+RA,RB} |
\end{enumerate} |
|
In this case, the Zip CPU doesn't warn the prefetch cache to get off the bus |
two cycles before using the bus, so there's a potential for an extra three |
cycle cost due to bus contention between the prefetch and the CPU. |
|
This is true for both the LOD and the STO instructions, with the exception that |
the STO instruction will continue in parallel with any ALU instructions that |
follow it. |
|
\end{itemize} |
|
|
1477,12 → 1373,12
to re-enable any other interrupts. |
|
The Zip System currently hosts two interrupt controllers, a primary and a |
secondary. The primary interrupt controller has one (or more) interrupt line(s) |
which may come from an external interrupt source, and one interrupt line from |
the secondary controller. Other primary interrupts include the system timers, |
the jiffies interrupt, and the manual cache interrupt. The secondary interrupt |
controller maintains an interrupt state for all of the processor accounting |
counters. |
secondary. The primary interrupt controller has one interrupt line (perhaps |
more if you configure it for more) which may come from an external interrupt |
controller, and one interrupt line from the secondary controller. Other |
primary interrupts include the system timers, the jiffies interrupt, and the |
manual cache interrupt. The secondary interrupt controller maintains an |
interrupt state for all of the processor accounting counters. |
|
\section{Counter} |
|
1531,8 → 1427,7
then read from its port to find out which memory location created the problem. |
|
Aside from its unusual configuration, the bus watchdog is just another |
implementation of the fundamental timer described above--stripped down |
for simplicity. |
implementation of the fundamental timer described above. |
|
\section{Jiffies} |
|
1540,8 → 1435,7
can request to be put to sleep until a certain number of `jiffies' have |
elapsed. Using this interface, the CPU can read the number of `jiffies' |
from the peripheral (it only has the one location in address space), add the |
sleep length to it, and write the result back to the peripheral. The |
{\tt zipjiffies} |
sleep length to it, and write the result back to the peripheral. The zipjiffies |
peripheral will record the value written to it only if it is nearer the current |
counter value than the last current waiting interrupt time. If no other |
interrupts are waiting, and this time is in the future, it will be enabled. |
1563,8 → 1457,8
Jiffies value, and $N$, and write it back to the Jiffies register. The |
O/S must also keep track of values written to the Jiffies register. Thus, |
when an `alarm' trips, it should be removed from the list of alarms, the list |
should be resorted, and the next alarm in terms of Jiffies should be written |
to the register--possibly for a second time. |
should be sorted, and the next alarm in terms of Jiffies should be written |
to the register. |
|
\section{Direct Memory Access Controller} |
|
1575,19 → 1469,22
any DMA memory move will by nature be faster than a corresponding program |
accomplishing the same move. To put this to numbers, it may take a program |
18~clocks per word transferred, whereas this DMA controller can move one |
word in two clocks--provided it has bus access. (The CPU gets priority over |
the bus.) |
word in two clocks--provided it has bus access. (The CPU gets priority over the |
bus.) |
|
When copying memory from one location to another, the DMA controller will |
copy in units of a given transfer length--up to 1024 words at a time. It will |
read that transfer length into its internal buffer, and then write to the |
destination address from that buffer. |
destination address from that buffer. If the CPU interrupts a DMA transfer, |
it will release the bus, let the CPU complete whatever it needs to do, and then |
restart its transfer by writing the contents of its internal buffer and then |
re-entering its read cycle again. |
|
When coupled with a peripheral, the DMA controller can be configured to start |
a memory copy when any interrupt line going high. Further, the controller can |
be configured to issue reads from (or to) the same address instead of |
incrementing the address at each clock. The DMA completes once the total |
number of items specified (not the transfer length) have been transferred. |
a memory copy on an interrupt line going high. Further, the controller can be |
configured to issue reads from (or to) the same address instead of incrementing |
the address at each clock. The DMA completes once the total number of items |
specified (not the transfer length) have been transferred. |
|
In each case, once the transfer is complete and the DMA unit returns to |
idle, the DMA will issue an interrupt. |
1603,29 → 1500,20
must exist, and must be connected to the Zip CPU for it to work. |
\item The Zip System needs to be told of its {\tt RESET\_ADDRESS}. This is |
the program counter of the first instruction following a reset. |
\item To conserve logic, you'll want to set the {\tt ADDRESS\_WIDTH} parameter |
to the number of address bits on your wishbone bus. |
\item Likewise, the {\tt LGICACHE} parameter sets the number of bits in |
the instruction cache address. This means that the instruction cache |
will have $2^{\mbox{\tiny\tt LGICACHE}}$ locations within it. |
\item If you want the Zip System to start up on its own, you will need to |
set the {\tt START\_HALTED} parameter to zero. Otherwise, if you |
wish to manually start the CPU, that is if upon reset you want the |
CPU start start in its halted, reset state, then set this parameter to |
one. This latter configuration is useful for a CPU that should be |
idle (i.e. halted) until given an explicit instruction from somewhere |
else to start. |
one. |
\item The third parameter to set is the number of interrupts you will be |
providing from external to the CPU. This can be anything from one |
to sixteen, but it cannot be zero. (Set this to 1 and wire the single |
interrupt line to a 1'b0 if you do not wish to support any external |
interrupts.) |
to nine, but it cannot be zero. (Wire this line to a 1'b0 if you |
do not wish to support any external interrupts.) |
\item Finally, you need to place into some wishbone accessible address, whether |
RAM or (more likely) ROM, the initial instructions for the CPU. |
\end{enumerate} |
If you have enabled your CPU to start automatically, then upon power up the |
CPU will immediately start executing your instructions, starting at the given |
{\tt RESET\_ADDRESS}. |
CPU will immediately start executing your instructions. |
|
This is, however, not how I have used the Zip CPU. I have instead used the |
Zip CPU in a more controlled environment. For me, the CPU starts in a |
1633,8 → 1521,8
location in RAM. After bringing up the board I am using, and further the |
bus that is on it, the RAM memory is then loaded externally with the program |
I wish the Zip System to run. Once the RAM is loaded, I release the CPU. |
The CPU then runs until either its halt condition or an exception occurrs in |
supervisor mode, at which point its task is complete. |
The CPU then runs until its halt condition, at which point its task is |
complete. |
|
Eventually, I intend to place an operating system onto the ZipSystem, I'm |
just not there yet. |
1763,9 → 1651,9
\> {\tt MOV uR0,R0} \\ |
\> {\tt MOV uCC,R1} \\ |
\> {\tt MOV uPC,R2} \\ |
\> {\tt STO R0,(R3)} {\em ; Exploit memory pipelining: }\\ |
\> {\tt STO R1,1(R3)} {\em ; All instructions write to stack }\\ |
\> {\tt STO R2,2(R3)} {\em ; All offsets increment by one }\\ |
\> {\tt STO R0,1(R3)} {\em ; Exploit memory pipelining: }\\ |
\> {\tt STO R1,2(R3)} {\em ; All instructions write to stack }\\ |
\> {\tt STO R2,3(R3)} {\em ; All offsets increment by one }\\ |
\> {\tt MOV R3,uSP} {\em ; Return the updated stack pointer } \\ |
\end{tabbing} |
\caption{Example Saving Minimal User Context}\label{tbl:save-partial} |
1806,9 → 1694,9
RESTORE\_PARTIAL\_CONTEXT: \\ |
\hbox to 0.25in{}\= {\em ; We retore R0, CC, and PC only} \\ |
\> {\tt MOV uSP,R3} {\em ; Return the updated stack pointer } \\ |
\> {\tt LOD R0,(R3),R0} {\em ; Exploit memory pipelining: }\\ |
\> {\tt LOD R1,1(R3),R1} {\em ; All instructions write to stack }\\ |
\> {\tt LOD R2,2(R3),R2} {\em ; All offsets increment by one }\\ |
\> {\tt LOD R0,1(R3),R0} {\em ; Exploit memory pipelining: }\\ |
\> {\tt LOD R1,2(R3),R1} {\em ; All instructions write to stack }\\ |
\> {\tt LOD R2,3(R3),R2} {\em ; All offsets increment by one }\\ |
\> {\tt MOV R0,uR0} \\ |
\> {\tt MOV R1,uCC} \\ |
\> {\tt MOV R2,uPC} \\ |
1868,29 → 1756,19
\begin{table}\begin{center} |
\begin{tabular}{ll} |
memcp: \\ |
& {\em ; R0 = *dest, R1 = *src, R2 = LEN, R3 = return addr} \\ |
& {\em ; The following will operate in $12N+19$ clocks.} \\ |
& {\tt CMP 0,R2} \\ % 8 clocks per setup |
& {\tt MOV.Z R3,PC} {\em ; A conditional return }\\ |
& {\tt SUB 1,SP} {\em ; Create a stack frame}\\ |
& {\tt STO R4,(SP)} {\em ; and a local variable}\\ |
& {\em ; (4 stalls, cannot be further scheduled away)} \\ |
loop: \\ % 12 clocks per loop |
& {\tt LOD (R1),R4} \\ |
& {\em ; R0 = *dest, R1 = *src, R2 = LEN} \\ |
& {\em ; The following will operate in 17 clocks per word minus one clock} \\ |
& {\tt CMP 0,R2} \\ |
& {\tt LOD.Z -1(SP),PC} {\em ; A conditional return }\\ |
& {\em ; (One stall on potentially writing to PC)} \\ |
& {\tt LOD (R1),R3} \\ |
& {\em ; (4 stalls, cannot be scheduled away)} \\ |
& {\tt STO R4,(R0)} {\em ; (4 schedulable stalls, has no impact now)} \\ |
& {\tt STO R3,(R2)} {\em ; (4 schedulable stalls, has no impact now)} \\ |
& {\tt ADD 1,R1} \\ |
& {\tt SUB 1,R2} \\ |
& {\tt BZ memcpend} \\ |
& {\tt ADD 1,R0} \\ |
& {\tt ADD 1,R1} \\ |
& {\tt BRA loop} \\ |
& {\em ; (1 stall on a BRA instruction)} \\ |
memcpend: % 11 clocks |
& {\tt LOD (SP),R4} \\ |
& {\em ; (4 stalls, cannot be further scheduled away)} \\ |
& {\tt ADD 1,SP} \\ |
& {\tt JMP R3} \\ |
& {\em ; (4 stalls)} \\ |
& {\tt BNZ loop} \\ |
& {\em ; (5 stalls, if branch taken, to clear and refill the pipeline)} \\ |
& {\tt RET} \\ |
\end{tabular} |
\caption{Example Memory Copy code in Zip Assembly}\label{tbl:memcp-asm} |
\end{center}\end{table} |
1910,11 → 1788,9
ways. The first step is that an interrupt happens. Anytime an interrupt |
happens, the CPU needs to execute the following tasks in supervisor mode: |
\begin{enumerate} |
\item Check for a trap instruction, or other user exception such as a break, |
bus error, division by zero error, or floating point exception. That |
is, if the user process needs attending then we may not wish to adjust |
the context, check interrupts, or call the scheduler. |
Tbl.~\ref{tbl:trap-check} |
\item Check for a trap instruction. That is, if the user task requested a |
trap, we may not wish to adjust the context, check interrupts, or call |
the scheduler. Tbl.~\ref{tbl:trap-check} |
\begin{table}\begin{center} |
\begin{tabular}{ll} |
{\tt return\_to\_user:} \\ |
1924,13 → 1800,13
& {\tt RTU} \\ |
{\tt trap\_check:} \\ |
& {\tt MOV uCC,R0} \\ |
& {\tt TST \$TRAP \textbar \$BUSERR \textbar \$DIVE \textbar \$FPE,R0} \\ |
& {\tt TST \$TRAP,R0} \\ |
& {\tt BNZ swap\_out} \\ |
& {; \em Do something here to execute the trap} \\ |
& {; \em Don't need to call the scheduler, so we can just return} \\ |
& {\tt BRA return\_to\_user} \\ |
\end{tabular} |
\caption{Checking for whether the user task needs our attention}\label{tbl:trap-check} |
\caption{Checking for whether the user issued a TRAP instruction}\label{tbl:trap-check} |
\end{center}\end{table} |
shows the rudiments of this code, while showing nothing of how the |
actual trap would be implemented. |
1960,11 → 1836,11
& {\tt MOV uR2,R2} \\ |
& {\tt MOV uR3,R3} \\ |
& {\tt MOV uR4,R4} \\ |
& {\tt STO R0,(R5)} {\em ; Exploit memory pipelining: }\\ |
& {\tt STO R1,1(R5)} {\em ; All instructions write to stack }\\ |
& {\tt STO R2,2(R5)} {\em ; All offsets increment by one }\\ |
& {\tt STO R3,3(R5)} {\em ; Longest pipeline is 5 cycles.}\\ |
& {\tt STO R4,4(R5)} \\ |
& {\tt STO R0,1(R5)} {\em ; Exploit memory pipelining: }\\ |
& {\tt STO R1,2(R5)} {\em ; All instructions write to stack }\\ |
& {\tt STO R2,3(R5)} {\em ; All offsets increment by one }\\ |
& {\tt STO R3,4(R5)} {\em ; Longest pipeline is 5 cycles.}\\ |
& {\tt STO R4,5(R5)} \\ |
& \ldots {\em ; Need to repeat for all user registers} \\ |
\iffalse |
& {\tt MOV uR5,R0} \\ |
1972,11 → 1848,11
& {\tt MOV uR7,R2} \\ |
& {\tt MOV uR8,R3} \\ |
& {\tt MOV uR9,R4} \\ |
& {\tt STO R0,5(R5) }\\ |
& {\tt STO R1,6(R5) }\\ |
& {\tt STO R2,7(R5) }\\ |
& {\tt STO R3,8(R5) }\\ |
& {\tt STO R4,9(R5)} \\ |
& {\tt STO R0,6(R5) }\\ |
& {\tt STO R1,7(R5) }\\ |
& {\tt STO R2,8(R5) }\\ |
& {\tt STO R3,9(R5) }\\ |
& {\tt STO R4,10(R5)} \\ |
\fi |
& {\tt MOV uR10,R0} \\ |
& {\tt MOV uR11,R1} \\ |
1983,11 → 1859,11
& {\tt MOV uR12,R2} \\ |
& {\tt MOV uCC,R3} \\ |
& {\tt MOV uPC,R4} \\ |
& {\tt STO R0,10(R5)}\\ |
& {\tt STO R1,11(R5)}\\ |
& {\tt STO R2,12(R5)}\\ |
& {\tt STO R3,13(R5)}\\ |
& {\tt STO R4,14(R5)} \\ |
& {\tt STO R0,11(R5)}\\ |
& {\tt STO R1,12(R5)}\\ |
& {\tt STO R2,13(R5)}\\ |
& {\tt STO R3,14(R5)}\\ |
& {\tt STO R4,15(R5)} \\ |
& {\em ; We can skip storing the stack, uSP, since it'll be stored}\\ |
& {\em ; elsewhere (in the task structure) }\\ |
\end{tabular} |
2087,11 → 1963,11
& {\tt LOD stack(R12),R5} \\ |
& {\tt MOV 15(R1),uSP} \\ |
& {\em ; Be sure to exploit the memory pipelining capability} \\ |
& {\tt LOD (R5),R0} \\ |
& {\tt LOD 1(R5),R1} \\ |
& {\tt LOD 2(R5),R2} \\ |
& {\tt LOD 3(R5),R3} \\ |
& {\tt LOD 4(R5),R4} \\ |
& {\tt LOD 1(R5),R0} \\ |
& {\tt LOD 2(R5),R1} \\ |
& {\tt LOD 3(R5),R2} \\ |
& {\tt LOD 4(R5),R3} \\ |
& {\tt LOD 5(R5),R4} \\ |
& {\tt MOV R0,uR0} \\ |
& {\tt MOV R1,uR1} \\ |
& {\tt MOV R2,uR2} \\ |
2098,11 → 1974,11
& {\tt MOV R3,uR3} \\ |
& {\tt MOV R4,uR4} \\ |
& \ldots {\em ; Need to repeat for all user registers} \\ |
& {\tt LOD 10(R5),R0} \\ |
& {\tt LOD 11(R5),R1} \\ |
& {\tt LOD 12(R5),R2} \\ |
& {\tt LOD 13(R5),R3} \\ |
& {\tt LOD 14(R5),R4} \\ |
& {\tt LOD 11(R5),R0} \\ |
& {\tt LOD 12(R5),R1} \\ |
& {\tt LOD 13(R5),R2} \\ |
& {\tt LOD 14(R5),R3} \\ |
& {\tt LOD 15(R5),R4} \\ |
& {\tt MOV R0,uR10} \\ |
& {\tt MOV R1,uR11} \\ |
& {\tt MOV R2,uR12} \\ |
2140,7 → 2016,7
\begin{center}\begin{reglist} |
PIC & \scalebox{0.8}{\tt 0xc0000000} & 32 & R/W & Primary Interrupt Controller \\\hline |
WDT & \scalebox{0.8}{\tt 0xc0000001} & 32 & R/W & Watchdog Timer \\\hline |
& \scalebox{0.8}{\tt 0xc0000002} & 32 & R & Address of last bus error \\\hline |
& \scalebox{0.8}{\tt 0xc0000002} & 32 & R/W & {\em (Reserved for future use)} \\\hline |
CTRIC & \scalebox{0.8}{\tt 0xc0000003} & 32 & R/W & Secondary Interrupt Controller \\\hline |
TMRA & \scalebox{0.8}{\tt 0xc0000004} & 32 & R/W & Timer A\\\hline |
TMRB & \scalebox{0.8}{\tt 0xc0000005} & 32 & R/W & Timer B\\\hline |
2177,15 → 2053,14
and when so accessed will respond as described in Chapt.~\ref{chap:periph}. |
These registers will be discussed briefly again here. |
|
\subsection{Interrupt Controller(s)} |
The Zip CPU Interrupt controller has four different types of bits, as shown in |
Tbl.~\ref{tbl:picbits}. |
\begin{table}\begin{center} |
\begin{bitlist} |
31 & R/W & Master Interrupt Enable\\\hline |
30\ldots 16 & R/W & Interrupt Enables, write `1' to change\\\hline |
30\ldots 16 & R/W & Interrupt Enables, write '1' to change\\\hline |
15 & R & Current Master Interrupt State\\\hline |
15\ldots 0 & R/W & Input Interrupt states, write `1' to clear\\\hline |
15\ldots 0 & R/W & Input Interrupt states, write '1' to clear\\\hline |
\end{bitlist} |
\caption{Interrupt Controller Register Bits}\label{tbl:picbits} |
\end{center}\end{table} |
2194,11 → 2069,10
will switch to supervisor mode, etc. |
|
Bits 30~\ldots 16 are interrupt enable bits. Should the interrupt line go |
hi while enabled, an interrupt will be generated. (All interrupts are positive |
edge triggered.) To set an interrupt enable bit, one needs to write the |
master interrupt enable while writing a `1' to this the bit. To clear, one |
need only write a `0' to the master interrupt enable, while leaving this line |
high. |
ghile while enabled, an interrupt will be generated. To set an interrupt enable |
bit, one needs to write the master interrupt enable while writing a `1' to this |
the bit. To clear, one need only write a `0' to the master interrupt enable, |
while leaving this line high. |
|
Bits 15\ldots 0 are the current state of the interrupt vector. Interrupt lines |
trip when they go high, and remain tripped until they are acknowledged. If |
2217,13 → 2091,13
interrupt handler to call. |
\item If the interrupt handler expects more interrupts, it will clear its |
current interrupt when it is done handling the interrupt in question. |
To do this, it will write a `1' to the low order interrupt mask, |
such as writing a {\tt 32'h0000\_0001}. |
To do this, it will write a '1' to the low order interrupt mask, |
such as writing a {\tt 32'h80000001}. |
\item If the interrupt handler does not expect any more interrupts, it will |
instead clear the interrupt from the controller by writing a |
{\tt 32'h0001\_0001} to the controller. |
{\tt 32'h00010001} to the controller. |
\item Once all interrupts have been handled, the supervisor will write a |
{\tt 32'h8000\_0000} to the interrupt register to re-enable interrupt |
{\tt 32'h80000000} to the interrupt register to re-enable interrupt |
generation. |
\item The supervisor should also check the user trap bit, and possible soft |
interrupt bits here, but this action has nothing to do with the |
2233,8 → 2107,6
command. |
\end{enumerate} |
|
\subsection{Timer Register} |
|
Leaving the interrupt controller, we show the timer registers bit definitions |
in Tbl.~\ref{tbl:tmrbits}. |
\begin{table}\begin{center} |
2253,8 → 2125,6
the auto--reload option was written to it. To clear and stop the timer, |
just simply write a `32'h00' to this register. |
|
\subsection{Jiffies} |
|
The Jiffies register is somewhat similar in that the register always changes. |
In this case, the register counts up, whereas the timer always counted down. |
Reads from this register, as shown in Tbl.~\ref{tbl:jiffybits}, |
2272,8 → 2142,6
may either be written to it, or it will just continue counting without |
activating any more interrupts. |
|
\subsection{Performance Counters} |
|
The Zip CPU also supports several counter peripherals, mostly in the way of |
process accounting. This peripherals have a single register associated with |
them, shown in Tbl.~\ref{tbl:ctrbits}. |
2299,12 → 2167,8
the fact. Second, whenever activating a user task, the Operating System will |
set the four user counters to zero. When the user task has completed, the |
Operating System will read the timers back off, to determine how much of the |
CPU the process had consumed. To keep this accurate, the user counters will |
only increment when the GIE bit is set to indicate that the processor is |
in user mode. |
CPU the process had consumed. |
|
\subsection{DMA Controller} |
|
The final peripheral to discuss is the DMA controller. This controller |
has four registers. Of these four, the length, source and destination address |
registers should need no further explanation. They are full 32--bit registers |
2319,15 → 2183,15
31 & R & DMA Active\\\hline |
30 & R & Wishbone error, transaction aborted. This bit is cleared the next time |
this register is written to.\\\hline |
29 & R/W & Set to `1' to prevent the controller from incrementing the source address, `0' for normal memory copy. \\\hline |
28 & R/W & Set to `1' to prevent the controller from incrementing the |
destination address, `0' for normal memory copy. \\\hline |
29 & R/W & Set to '1' to prevent the controller from incrementing the source address, '0' for normal memory copy. \\\hline |
28 & R/W & Set to '1' to prevent the controller from incrementing the |
destination address, '0' for normal memory copy. \\\hline |
27 \ldots 16 & W & The DMA Key. Write a 12'hfed to these bits to start the |
activate any DMA transfer. \\\hline |
27 & R & Always reads `0', to force the deliberate writing of the key. \\\hline |
27 & R & Always reads '0', to force the deliberate writing of the key. \\\hline |
26 \ldots 16 & R & Indicates the number of items in the transfer buffer that |
have yet to be written. \\\hline |
15 & R/W & Set to `1' to trigger on an interrupt, or `0' to start immediately |
15 & R/W & Set to '1' to trigger on an interrupt, or '0' to start immediately |
upon receiving a valid key.\\\hline |
14\ldots 10 & R/W & Select among one of 32~possible interrupt lines.\\\hline |
9\ldots 0 & R/W & Intermediate transfer length minus one. Thus, to transfer |
2357,15 → 2221,14
Tbl.~\ref{tbl:dbgctrl}. |
\begin{table}\begin{center} |
\begin{bitlist} |
31\ldots 14 & R & External interrupt state. Bit 14 is valid for one |
interrupt only, bit 15 for two, etc.\\\hline |
31\ldots 14 & R & Reserved\\\hline |
13 & R & CPU GIE setting\\\hline |
12 & R & CPU is sleeping\\\hline |
11 & W & Command clear PF cache\\\hline |
10 & R/W & Command HALT, Set to `1' to halt the CPU\\\hline |
9 & R & Stall Status, `1' if CPU is busy (i.e., not halted yet)\\\hline |
8 & R/W & Step Command, set to `1' to step the CPU, also sets the halt bit\\\hline |
7 & R & Interrupt Request Pending\\\hline |
10 & R/W & Command HALT, Set to '1' to halt the CPU\\\hline |
9 & R & Stall Status, '1' if CPU is busy\\\hline |
8 & R/W & Step Command, set to '1' to step the CPU, also sets the halt bit\\\hline |
7 & R & Interrupt Request \\\hline |
6 & R/W & Command RESET \\\hline |
5\ldots 0 & R/W & Debug Register Address \\\hline |
\end{bitlist} |
2373,7 → 2236,7
\end{center}\end{table} |
|
The first step in debugging access is to determine whether or not the CPU |
is halted, and to halt it if not. To do this, first write a `1' to the |
is halted, and to halt it if not. To do this, first write a '1' to the |
Command HALT bit. This will halt the CPU and place it into debug mode. |
Once the CPU is halted, the stall status bit will drop to zero. Thus, |
if bit 10 is high and bit 9 low, the debug port is open to examine the |
2398,7 → 2261,6
uPC & 31 & 32 & R/W & User Program Counter\\\hline |
PIC & 32 & 32 & R/W & Primary Interrupt Controller \\\hline |
WDT & 33 & 32 & R/W & Watchdog Timer\\\hline |
BUS & 34 & 32 & R & Last Bus Error\\\hline |
CTRIC & 35 & 32 & R/W & Secondary Interrupt Controller\\\hline |
TMRA & 36 & 32 & R/W & Timer A\\\hline |
TMRB & 37 & 32 & R/W & Timer B\\\hline |
2452,8 → 2314,7
Port granularity & 32--bit \\\hline |
Maximum Operand Size & 32--bit \\\hline |
Data transfer ordering & (Irrelevant) \\\hline |
Clock constraints & Works at 100~MHz on a Basys--3 board, and 80~MHz on a |
XuLA2--LX25\\\hline |
Clock constraints & Works at 100~MHz on a Basys--3 board\\\hline |
Signal Names & \begin{tabular}{ll} |
Signal Name & Wishbone Equivalent \\\hline |
{\tt i\_clk} & {\tt CLK\_I} \\ |
2475,13 → 2336,12
\begin{wishboneds} |
Revision level of wishbone & WB B4 spec \\\hline |
Type of interface & Master, Read/Write, single cycle or pipelined\\\hline |
Address Width & (Zip System parameter, can be up to 32--bit bits) \\\hline |
Address Width & 32--bit bits \\\hline |
Port size & 32--bit \\\hline |
Port granularity & 32--bit \\\hline |
Maximum Operand Size & 32--bit \\\hline |
Data transfer ordering & (Irrelevant) \\\hline |
Clock constraints & Works at 100~MHz on a Basys--3 board, and 80~MHz on a |
XuLA2--LX25\\\hline |
Clock constraints & Works at 100~MHz on a Basys--3 board\\\hline |
Signal Names & \begin{tabular}{ll} |
Signal Name & Wishbone Equivalent \\\hline |
{\tt i\_clk} & {\tt CLK\_O} \\ |
2492,8 → 2352,7
{\tt o\_wb\_data} & {\tt DAT\_O} \\ |
{\tt i\_wb\_ack} & {\tt ACK\_I} \\ |
{\tt i\_wb\_stall} & {\tt STALL\_I} \\ |
{\tt i\_wb\_data} & {\tt DAT\_I} \\ |
{\tt i\_wb\_err} & {\tt ERR\_I} |
{\tt i\_wb\_data} & {\tt DAT\_I} |
\end{tabular}\\\hline |
\end{wishboneds} |
\caption{Wishbone Datasheet for the CPU as Master}\label{tbl:wishbone-master} |
2502,12 → 2361,11
Rather, the debug port of the CPU should be accessible regardless of the state |
of the master bus. |
|
You may wish to notice that neither the {\tt LOCK} nor the {\tt RTY} (retry) |
wires have been connected to the CPU's master interface. If necessary, a |
rudimentary {\tt LOCK} may be created by tying the wire to the {\tt wb\_cyc} |
line. As for the {\tt RTY}, all the CPU recognizes at this point are bus |
errors---it cannot tell the difference between a temporary and a permanent bus |
error. |
You may wish to notice that neither the {\tt ERR} nor the {\tt RETRY} wires |
have been implemented. What this means is that the CPU is currently unable |
to detect a bus error condition, and so may stall indefinitely (hang) should |
it choose to access a value not on the bus, or a peripheral that is not |
yet properly configured. |
|
\chapter{Clocks}\label{chap:clocks} |
|
2525,7 → 2383,6
had struggled with various timing violations to keep it at 100~MHz. So, for |
now, I will only state that it can run at 100~MHz. |
|
On a SPARTAN 6, the clock can run successfully at 80~MHz. |
|
\chapter{I/O Ports}\label{chap:ioports} |
The I/O ports to the Zip CPU may be grouped into three categories. The first |
2543,7 → 2400,6
{\tt i\_wb\_ack} & 1 & Input & Slave has completed a R/W cycle\\\hline |
{\tt i\_wb\_stall} & 1 & Input & WB bus slave not ready\\\hline |
{\tt i\_wb\_data} & 32 & Input & Incoming bus data\\\hline |
{\tt i\_wb\_err} & 1 & Input & Bus Error indication\\\hline |
\end{portlist}\caption{CPU Master Wishbone I/O Ports}\label{tbl:iowb-master}\end{center}\end{table} |
and~\ref{tbl:iowb-slave} respectively. |
\begin{table} |
2565,21 → 2421,21
\begin{center}\begin{portlist} |
{\tt i\_clk} & 1 & Input & The master CPU clock \\\hline |
{\tt i\_rst} & 1 & Input & Active high reset line \\\hline |
{\tt i\_ext\_int} & 1\ldots 16 & Input & Incoming external interrupts, actual |
value set by implementation parameter \\\hline |
{\tt i\_ext\_int} & 1\ldots 6 & Input & Incoming external interrupts \\\hline |
{\tt o\_ext\_int} & 1 & Output & CPU Halted interrupt \\\hline |
\end{portlist}\caption{I/O Ports}\label{tbl:ioports}\end{center}\end{table} |
The clock line was discussed briefly in Chapt.~\ref{chap:clocks}. We |
typically run it at 100~MHz, although we've needed to slow it down to 80~MHz |
for some implementations. The reset line is an active high reset. When |
typically run it at 100~MHz. The reset line is an active high reset. When |
asserted, the CPU will start running again from its reset address in |
memory. Further, depending upon how the CPU is configured and specifically |
based upon how the {\tt START\_HALTED} parameter is set, the CPU may or may |
not start running automatically following a reset. The {\tt i\_ext\_int} |
line is for an external interrupt. This line may actually be as wide as |
16~external interrupts, depending upon the setting of |
the {\tt EXTERNAL\_INTERRUPTS} parameter. Finally, the Zip System produces one |
external interrupt whenever the entire CPU halts to wait for the debugger. |
memory. Further, depending upon how the CPU is configured and specifically on |
the {\tt START\_HALTED} parameter, it may or may not start running |
automatically. The {\tt i\_ext\_int} line is for an external interrupt. This |
line may be as wide as 6~external interrupts, depending upon the setting of |
the {\tt EXTERNAL\_INTERRUPTS} line. As currently configured, the ZipSystem |
only supports one such interrupt line by default. For us, this line is the |
output of another interrupt controller, but that's a board specific setup |
detail. Finally, the Zip System produces one external interrupt whenever |
the CPU halts to wait for the debugger. |
|
\chapter{Initial Assessment}\label{chap:assessment} |
|
2590,15 → 2446,11
|
\section{The Good} |
\begin{itemize} |
\item The Zip CPU can be configured to be relatively light weight and fully |
featured as it exists today. For anyone who wishes to build a general |
purpose CPU and then to experiment with building and adding particular |
features, the Zip CPU makes a good starting point--it is fairly simple. |
Modifications should be simple enough. Indeed, a non--pipelined |
version of the bare ZipBones (with no peripherals) has been built that |
only uses 1.1k~LUTs. When using pipelining, the full cache, and all |
of the peripherals, the ZipSystem can top 5~k LUTs. Where it fits |
in between is a function of your needs. |
\item The Zip CPU is light weight and fully featured as it exists today. For |
anyone who wishes to build a general purpose CPU and then to |
experiment with building and adding particular features, the Zip CPU |
makes a good starting point--it is fairly simple. Modifications should |
be simple enough. |
\item The Zip CPU was designed to be an implementable soft core that could be |
placed within an FPGA, controlling actions internal to the FPGA. It |
fits this role rather nicely. It does not fit the role of a system on |
2636,6 → 2488,39
|
\section{The Not so Good} |
\begin{itemize} |
\item While one of the stated goals was to use a small amount of logic, |
3k~LUTs isn't that impressively small. Indeed, it's really much |
too expensive when compared against other 8 and 16-bit CPUs that have |
less than 1k LUTs. |
|
Still, \ldots it's not bad, it's just not astonishingly good. |
|
\item The fact that the instruction width equals the bus width means that the |
instruction fetch cycle will always be interfering with any load or |
store memory operation, with the only exception being if the |
instruction is already in the cache. |
|
This could be fixed in one of three ways: the instruction set |
architecture could be modified to handle Very Long Instruction Words |
(VLIW) so that each 32--bit word would encode two or more instructions, |
the instruction fetch bus width could be increased from 32--bits to |
64--bits or more, or the instruction bus could be separated from the |
data bus. Any and all of these approaches would increase the overall |
LUT count. |
|
\item The (non-existant) floating point unit was an after-thought, isn't even |
built as a potential option, and most likely won't support the full |
IEEE standard set of FPU instructions--even for single point precision. |
This (non-existant) capability would benefit the most from an |
out-of-order execution capability, which the Zip CPU does not have. |
|
Still, sharing FPU registers with the main register set was a good |
idea and worth preserving, as it simplifies context swapping. |
|
Perhaps this really isn't a problem, but rather a feature. By not |
implementing FPU instructions, the Zip CPU maintains a lower LUT count |
than it would have if it did implement these instructions. |
|
\item The CPU has no character support. This is both good and bad. |
Realistically, the CPU works just fine without it. Characters can be |
supported as subsets of 32-bit words without any problem. Practically, |
2668,6 → 2553,11
flexibility in its immediate operand mode, although that increased |
flexibility isn't necessarily as valuable as one might like. |
|
\item The Zip CPU does not currently detect and trap on either illegal |
instructions or bus errors. Attempts to access non--existent |
memory quietly return erroneous results, rather than halting the |
process (user mode) or halting or resetting the CPU (supervisor mode). |
|
\item The Zip CPU doesn't support out of order execution. I suppose it could |
be modified to do so, but then it would no longer be the ``simple'' |
and low LUT count CPU it was designed to be. The two primary results |
2693,18 → 2583,190
shot. My dream of having binutils and gcc support has not been |
realized and at this rate may not be realized. (I've been intimidated |
by the challenge everytime I've looked through those codes.) |
|
\iffalse |
\item While the Wishbone Bus (B4) supports a pipelined mode with single cycle |
execution, the Zip CPU is unable to exploit this parallelism. Instead, |
apart from the DMA and the pipelined prefetch, all loads and stores |
are single wishbone bus operations requiring a minimum of 3 clocks. |
(In practice, this has turned into 7-clocks.) |
% Addressed, 20150929 |
|
\item There is no control over whether or not an instruction sets the |
condition codes--certain instructions always set the condition codes, |
other instructions never set them. This effectively limits conditional |
instructions to a single instruction only (with two or more |
instructions as an exception), as the first instruction that sets |
condition codes will break the condition code chain. |
|
{\em (A proposed change below address this.)} |
|
\item Using the CC register as a trap address was a bad idea--it limits the CC |
registers ability to be used in future expansion, such as by adding |
exception indication flags: bus error, floating point exception, etc. |
|
{\em (This can be changed by a different O/S implementation of the trap |
instruction.)} |
\item The current implementation suffers from too many stalls on any |
branch--even if the branch is known early on. |
|
{\em (This is addressed in proposals below.)} |
% Addressed, 20150918 |
|
\item In a similar fashion, a switch to interrupt context forces the pipeline |
to be cleared, whereas it might make more sense to just continue |
executing the instructions already in the pipeline while the prefetch |
stage is working on switching to the interrupt context. |
|
{\em (Also addressed in proposals below.)} |
% This should happen so rarely that it is not really a problem |
\fi |
|
\end{itemize} |
|
\section{The Next Generation} |
This section could also be labeled as my ``To do'' list. Today's list is |
much different than it was for the last version of this document, as much of |
the prior to do list (such as VLIW instructions, and a more traditional |
instruction cache) has now been implemented. The only things really and |
truly waiting on my list today are assembler support for the VLIW instruction |
set, linker and compiler support. |
This section could also be labeled as my ``To do'' list. |
|
Stay tuned, these are likely to be coming next. |
Given the feedback listed above, perhaps its time to consider what changes could be made to improve the Zip CPU in the future. I offer the following as proposals: |
|
\begin{itemize} |
\item {\bf Remove the low LUT goal.} It wasn't really achieved, and the |
proposals below will only increase the amount of logic the Zip CPU |
requires. While I expect that the Zip CPU will always be somewhat |
of a light weight, it will never be the smallest kid on the block. |
|
I'm actually struggling with this idea. The whole goal of the Zip |
CPU was to be light weight. Wouldn't it make more sense to create and |
maintain options whereby it would remain lightweight? For example, if |
the process accounting registers are anything but light weight, why |
keep them? Why not instead make some compile flags that just turn them |
off, keeping the CPU lightweight? The same holds for the prefetch |
cache. |
|
\item The `{\tt .V}' condition was never used in any code other than my test |
code. Suggest changing it to a `{\tt .LE}' condition, which seems |
to be more useful. |
|
\item {\bf Consider a more traditional Instruction Cache.} The current |
pipelined instruction cache just reads a window of memory into |
its cache. If the CPU leaves that window, the entire cache is |
invalidated. A more traditional cache, however, might allow |
common subroutines to stay within the cache without invalidating the |
entire cache structure. |
|
\iffalse |
\item {\bf Adjust the Zip CPU so that conditional instructions do not set |
flags}, although they may explicitly set condition codes if writing |
to the CC register. |
|
This is a simple change to the core, and may show up in new releases. |
% Fixed, 20150918 |
|
\item Add in an {\bf unpredictable branch delay slot}, so that on any branch |
the delay slot may or may not be executed before the branch. |
Instructions that do not depend upon the branch, and that should be |
executed were the branch not taken, could be placed into the delay |
slot. Thus, if the branch isn't taken, we wouldn't suffer the stall, |
whereas it wouldn't affect the timing of the branch if taken. It would |
just do something irrelevant. |
|
% Changes made, 20150918, make this option no longer relevant |
|
\item {\bf Re-engineer Branch Processing.} There's no reason why a {\tt BRA} |
instruction should create five stall cycles. The decode stage, plus |
the prefetch engine, should be able to drop this number of stalls via |
better branch handling. |
|
Indeed, this could turn into a simple means of branch prediction: |
if {\tt BRA} suffered a single stall only, whereas {\tt BRA.C} |
suffered five stalls, then {\tt BRA.!C} followed by {\tt BRA} would |
be faster than a {\tt BRA.C} instruction. This would then allow a |
compiler to do explicit branch optimizations. |
|
Of course, longer branches using {\tt ADD X,PC} would still not be |
optimized. |
|
% DONE: 20150918 -- to include the ADD X,PC instructions |
|
\item {\bf Request bus access for Load/Store two cycles earlier.} The problem |
here is the contention for the bus between the memory unit and the |
prefetch unit. Currently, the memory unit must ask the prefetch |
unit to release the bus if it is in the middle of a bus cycle. At this |
point, the prefetch drops the {\tt STB} line on the next clock and must |
then wait for the last {\tt ACK} before releasing the bus. If the |
request takes one clock, dropping the strobe line the next, waiting |
for an acknowledgement takes another, and then the bus must be idle |
for one cycle before starting again, these extra cycles add up. |
It should be possible to tell the prefetch stage to give up the bus |
as soon as the decoder knows the instruction will need the bus. |
Indeed, if done in the decode stage, this might drop the seven cycle |
access down by two cycles. |
% FIXED: 20150918 |
|
\item {\bf Very Long Instruction Word (VLIW).} Now, to speed up operation, I |
propose that the Zip CPU instruction set be modified towards a Very |
Long Instruction Word (VLIW) implementation. In this implementation, |
an instruction word may contain either one or two separate |
instructions. The first instruction would take up the high order bits, |
the second optional instruction the lower 16-bits. Further, I propose |
that any of the ALU instructions (SUB through LSR) automatically have |
a second instruction whenever their operand `B' is a register, and use |
the full 20-bit immediate if not. This will effectively eliminate the |
register plus immediate mode for all of these instructions. |
|
This is the minimal required change to increase the number of |
instructions per clock cycle. Other changes would need to take place |
as well to support this. These include: |
\begin{itemize} |
\item Instruction words containing two instructions would take two |
clocks to complete, while requiring only a single cycle |
instruction fetch. |
\item Instructions preceded by a label in the asseembler must always |
start in the high order word. |
\item VLIW's, once started, must always execute to completion. The |
upper word may set the PC, the lower word may not. Regardless |
of whether the upper word sets the PC, the lower word must |
still be guaranteed to complete before the PC changes. On any |
switch to (or from) interrupt context, both instructions must |
complete or none of the instructions in the word shall |
complete prior to the switch. |
\item STEP commands and BREAK instructions will only take place after |
the entire word is executed. |
\end{itemize} |
|
If done well, the assembler should be able to handle these changes |
with the biggest impacts to the user being increased performance and |
a loss of the register plus immediate ALU modes. (These weren't really |
relevant for the XOR, OR, AND, etc. operations anyway.) Machine code |
compatibility will not be maintained. |
|
A proposed secondary instruction set might consist of: a four bit |
operand (any of the prior instructions would be supported, with some |
exceptions such as moves to and from user registers while in |
supervisor mode not being supported), a 4-bit register result (PC not |
allowed), a 3-bit conditional (identical to the conditional for the |
upper word), a single bit for whether or not an immediate is present |
or not, followed by either a 4-bit register or a 4-bit signed |
immediate. The multiply instruction would steal the immediate flag to |
be used as a sign indication, forcing both operands to be registers |
without any immediate offsets. |
|
{\em Initial conversion of several library functions to this secondary |
instruction set has demonstrated little to no gain. The problem was |
that the new instruction set was made by joining a rarely used |
instruction (ALU with register and not immediate) with a more common |
instruction. The utility was then limited by the utility of the rare |
instrction, which limited the impact of the entire approach. } |
\else |
\item {\bf Very Long Instruction Word (VLIW).} The goal here would be to |
create a new instruction set whereby two instructions would be encoded |
in each 32--bit word. While this may speed up |
CPU operation, it would necessitate an instruction redesign. |
\fi |
|
\end{itemize} |
|
|
% Appendices |
% Index |
\end{document} |