Line 43... |
Line 43... |
%%
|
%%
|
%%
|
%%
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
\documentclass{gqtekspec}
|
\documentclass{gqtekspec}
|
\usepackage{import}
|
\usepackage{import}
|
|
\usepackage{bytefield}
|
% \graphicspath{{../gfx}}
|
% \graphicspath{{../gfx}}
|
\project{Zip CPU}
|
\project{Zip CPU}
|
\title{Specification}
|
\title{Specification}
|
\author{Dan Gisselquist, Ph.D.}
|
\author{Dan Gisselquist, Ph.D.}
|
\email{dgisselq (at) opencores.org}
|
\email{dgisselq (at) opencores.org}
|
\revision{Rev.~0.6}
|
\revision{Rev.~0.7}
|
\definecolor{webred}{rgb}{0.2,0,0}
|
\definecolor{webred}{rgb}{0.5,0,0}
|
\definecolor{webgreen}{rgb}{0,0.2,0}
|
\definecolor{webgreen}{rgb}{0,0.4,0}
|
\usepackage[dvips,ps2pdf,colorlinks=true,
|
\usepackage[dvips,ps2pdf,colorlinks=true,
|
anchorcolor=black,pagecolor=webgreen,pdfpagelabels,hypertexnames,
|
anchorcolor=black,pdfpagelabels,hypertexnames,
|
pdfauthor={Dan Gisselquist},
|
pdfauthor={Dan Gisselquist},
|
pdfsubject={Zip CPU}]{hyperref}
|
pdfsubject={Zip CPU}]{hyperref}
|
|
\hypersetup{
|
|
colorlinks = true,
|
|
linkcolor = webred,
|
|
citecolor = webgreen
|
|
}
|
\begin{document}
|
\begin{document}
|
\pagestyle{gqtekspecplain}
|
\pagestyle{gqtekspecplain}
|
\titlepage
|
\titlepage
|
\begin{license}
|
\begin{license}
|
Copyright (C) \theyear\today, Gisselquist Technology, LLC
|
Copyright (C) \theyear\today, Gisselquist Technology, LLC
|
Line 76... |
Line 82... |
You should have received a copy of the GNU General Public License along
|
You should have received a copy of the GNU General Public License along
|
with this program. If not, see \hbox{<http://www.gnu.org/licenses/>} for a
|
with this program. If not, see \hbox{<http://www.gnu.org/licenses/>} for a
|
copy.
|
copy.
|
\end{license}
|
\end{license}
|
\begin{revisionhistory}
|
\begin{revisionhistory}
|
|
0.7 & 12/22/2015 & Gisselquist & New Instruction Set Architecture \\\hline
|
0.6 & 11/17/2015 & Gisselquist & Added graphics to illustrate pipeline discussion.\\\hline
|
0.6 & 11/17/2015 & Gisselquist & Added graphics to illustrate pipeline discussion.\\\hline
|
0.5 & 9/29/2015 & Gisselquist & Added pipelined memory access discussion.\\\hline
|
0.5 & 9/29/2015 & Gisselquist & Added pipelined memory access discussion.\\\hline
|
0.4 & 9/19/2015 & Gisselquist & Added DMA controller, improved stall information, and self--assessment info.\\\hline
|
0.4 & 9/19/2015 & Gisselquist & Added DMA controller, improved stall information, and self--assessment info.\\\hline
|
0.3 & 8/22/2015 & Gisselquist & First completed draft\\\hline
|
0.3 & 8/22/2015 & Gisselquist & First completed draft\\\hline
|
0.2 & 8/19/2015 & Gisselquist & Still Draft, more complete \\\hline
|
0.2 & 8/19/2015 & Gisselquist & Still Draft, more complete \\\hline
|
Line 137... |
Line 144... |
|
|
|
|
The original goal of the Zip CPU was to be a very simple CPU. You might
|
The original goal of the Zip CPU was to be a very simple CPU. You might
|
think of it as a poor man's alternative to the OpenRISC architecture.
|
think of it as a poor man's alternative to the OpenRISC architecture.
|
For this reason, all instructions have been designed to be as simple as
|
For this reason, all instructions have been designed to be as simple as
|
possible, and are all designed to be executed in one instruction cycle per
|
possible, and the base instructions are all designed to be executed in one
|
instruction, barring pipeline stalls. Indeed, even the bus has been simplified
|
instruction cycle per instruction, barring pipeline stalls. Indeed, even the
|
to a constant 32-bit width, with no option for more or less. This has
|
bus has been simplified to a constant 32-bit width, with no option for more
|
resulted in the choice to drop push and pop instructions, pre-increment and
|
or less. This has resulted in the choice to drop push and pop instructions,
|
post-decrement addressing modes, and more.
|
pre-increment and post-decrement addressing modes, and more.
|
|
|
For those who like buzz words, the Zip CPU is:
|
For those who like buzz words, the Zip CPU is:
|
\begin{itemize}
|
\begin{itemize}
|
\item A 32-bit CPU: All registers are 32-bits, addresses are 32-bits,
|
\item A 32-bit CPU: All registers are 32-bits, addresses are 32-bits,
|
instructions are 32-bits wide, etc.
|
instructions are 32-bits wide, etc.
|
Line 156... |
Line 163... |
\item Wishbone compliant. All peripherals are accessed just like
|
\item Wishbone compliant. All peripherals are accessed just like
|
memory across this bus.
|
memory across this bus.
|
\item A Von-Neumann architecture. (The instructions and data share a
|
\item A Von-Neumann architecture. (The instructions and data share a
|
common bus.)
|
common bus.)
|
\item A pipelined architecture, having stages for {\bf Prefetch},
|
\item A pipelined architecture, having stages for {\bf Prefetch},
|
{\bf Decode}, {\bf Read-Operand}, the {\bf ALU/Memory}
|
{\bf Decode}, {\bf Read-Operand}, a
|
unit, and {\bf Write-back}. See Fig.~\ref{fig:cpu}
|
combined stage containing the {\bf ALU},
|
|
{\bf Memory}, {\bf Divide}, and {\bf Floating Point}
|
|
units, and then the final {\bf Write-back} stage.
|
|
See Fig.~\ref{fig:cpu}
|
\begin{figure}\begin{center}
|
\begin{figure}\begin{center}
|
\includegraphics[width=3.5in]{../gfx/cpu.eps}
|
\includegraphics[width=3.5in]{../gfx/cpu.eps}
|
\caption{Zip CPU internal pipeline architecture}\label{fig:cpu}
|
\caption{Zip CPU internal pipeline architecture}\label{fig:cpu}
|
\end{center}\end{figure}
|
\end{center}\end{figure}
|
for a diagram of this structure.
|
for a diagram of this structure.
|
Line 189... |
Line 199... |
implemented within any FPGA. This helps to minimize real-estate, while
|
implemented within any FPGA. This helps to minimize real-estate, while
|
maintaining a high clock speed. The disadvantage is that it can severely
|
maintaining a high clock speed. The disadvantage is that it can severely
|
degrade the overall instructions per clock count.
|
degrade the overall instructions per clock count.
|
|
|
Soft core's within an FPGA have an additional characteristic regarding
|
Soft core's within an FPGA have an additional characteristic regarding
|
memory access: it is slow. Memory on chip may be accessed at a single
|
memory access: it is slow. While memory on chip may be accessed at a single
|
cycle per access, but small FPGA's have a limited amount of memory on chip.
|
cycle per access, small FPGA's often have only a limited amount of memory on
|
Going off chip, however, is expensive. Two examples will prove this point. On
|
chip. Going off chip, however, is expensive. Two examples will prove this
|
|
point. On
|
the XuLA2 board, Flash can be accessed at 128~cycles per 32--bit word,
|
the XuLA2 board, Flash can be accessed at 128~cycles per 32--bit word,
|
or 64~cycles per subsequent word in a pipelined architecture. Likewise, the
|
or 64~cycles per subsequent word in a pipelined architecture. Likewise, the
|
SDRAM chip on the XuLA2 board allows 6~cycle access for a write, 10~cycles
|
SDRAM chip on the XuLA2 board allows a 6~cycle access for a write, 10~cycles
|
per read, and 2~cycles for any subsequent pipelined access read or write.
|
per read, and 2~cycles for any subsequent pipelined access read or write.
|
Either way you look at it, this memory access will be slow and this doesn't
|
Either way you look at it, this memory access will be slow and this doesn't
|
account for any logic delays should the bus implementation logic get
|
account for any logic delays should the bus implementation logic get
|
complicated.
|
complicated.
|
|
|
Line 220... |
Line 231... |
A final characteristic of SwiC's within FPGA's is the peripherals.
|
A final characteristic of SwiC's within FPGA's is the peripherals.
|
Specifically, FPGA's are highly reconfigurable. Soft peripherals can easily
|
Specifically, FPGA's are highly reconfigurable. Soft peripherals can easily
|
be created on chip to support the SwiC if necessary. As an example, a simple
|
be created on chip to support the SwiC if necessary. As an example, a simple
|
30-bit peripheral could easily support reversing 30-bit numbers: a read from
|
30-bit peripheral could easily support reversing 30-bit numbers: a read from
|
the peripheral returns it's bit--reversed address. This is cheap within an
|
the peripheral returns it's bit--reversed address. This is cheap within an
|
FPGA, but expensive in instructions.
|
FPGA, but expensive in instructions. Reading from another 16--bit peripheral
|
|
might calculate a sine function, where the 16--bit address internal to the
|
|
peripheral was the angle of the sine wave.
|
|
|
Indeed, anything that must be done fast within an FPGA is likely to already
|
Indeed, anything that must be done fast within an FPGA is likely to already
|
be done--elsewhere in the fabric. This leaves the CPU with the role of handling
|
be done--elsewhere in the fabric. This leaves the CPU with the simple role
|
sequential tasks that need a lot of state.
|
of solely handling sequential tasks that need a lot of state.
|
|
|
This means that the SwiC needs to live within a very unique environment,
|
This means that the SwiC needs to live within a very unique environment,
|
separate and different from the traditional SoC. That isn't to say that a
|
separate and different from the traditional SoC. That isn't to say that a
|
SwiC cannot be turned into a SoC, just that this SwiC has not been designed
|
SwiC cannot be turned into a SoC, just that this SwiC has not been designed
|
for that purpose.
|
for that purpose.
|
Line 261... |
Line 274... |
pipeline: if you change an instruction partially through the pipeline,
|
pipeline: if you change an instruction partially through the pipeline,
|
the whole pipeline needs to be cleansed. Likewise if you change
|
the whole pipeline needs to be cleansed. Likewise if you change
|
an instruction in memory, you need to make sure the cache is reloaded
|
an instruction in memory, you need to make sure the cache is reloaded
|
with the new instruction.
|
with the new instruction.
|
|
|
\item {\bf Prefetch Cache:} My original implementation had a very
|
\item {\bf Prefetch Cache:} My original implementation, {\tt prefetch}, had
|
simple prefetch stage. Any time the PC changed the prefetch would go
|
a very simple prefetch stage. Any time the PC changed the prefetch
|
and fetch the new instruction. While this was perhaps this simplest
|
would go and fetch the new instruction. While this was perhaps this
|
approach, it cost roughly five clocks for every instruction. This
|
simplest approach, it cost roughly five clocks for every instruction.
|
was deemed unacceptable, as I wanted a CPU that could execute
|
This was deemed unacceptable, as I wanted a CPU that could execute
|
instructions in one cycle. I therefore have a prefetch cache that
|
instructions in one cycle.
|
issues pipelined wishbone accesses to memory and then pushes
|
|
instructions at the CPU. Sadly, this accounts for about 20\% of the
|
My second implementation, {\tt pipefetch}, attempted to make the most
|
logic in the entire CPU, or 15\% of the logic in the entire system.
|
use of pipelined memory. When a new CPU address was issued, it would
|
|
start reading
|
|
memory in a pipelined fashion, and issuing instructions as soon as they
|
|
were ready. This cache was a sliding window in memory. This suffered
|
|
from some difficult performance problems, though. If the CPU was
|
|
alternating between two diverse sections of code, both could never be
|
|
in the cache at the same time--causing lots of cache misses. Further,
|
|
the extra logic to implement this window cost an extra clock cycle
|
|
in the cache implementation, slowing down branches.
|
|
|
|
The Zip CPU now has a third cache implementation, {\tt pfcache}. This
|
|
new implementation takes only a single cycle per access, but costs a
|
|
full cache line miss on any miss. While configurable, a full cache
|
|
line miss might mean that the CPU needs to read 256~instructions from
|
|
memory before it can execute the first one of them.
|
|
|
\item {\bf Operating System:} In order to support an operating system,
|
\item {\bf Operating System:} In order to support an operating system,
|
interrupts and so forth, the CPU needs to support supervisor and
|
interrupts and so forth, the CPU needs to support supervisor and
|
user modes, as well as a means of switching between them. For example,
|
user modes, as well as a means of switching between them. For example,
|
the user needs a means of executing a system call. This is the
|
the user needs a means of executing a system call. This is the
|
Line 300... |
Line 326... |
Modern timesharing systems also depend upon a {\bf Timer} interrupt
|
Modern timesharing systems also depend upon a {\bf Timer} interrupt
|
to handle task swapping. For the Zip CPU, this interrupt is handled
|
to handle task swapping. For the Zip CPU, this interrupt is handled
|
external to the CPU as part of the CPU System, found in {\tt zipsystem.v}.
|
external to the CPU as part of the CPU System, found in {\tt zipsystem.v}.
|
The timer module itself is found in {\tt ziptimer.v}.
|
The timer module itself is found in {\tt ziptimer.v}.
|
|
|
|
\item {\bf Bus Errors:} My original implementation had no logic to handle
|
|
what would happen if the CPU attempted to read or write a non-existent
|
|
memory address. This changed after I needed to troubleshoot a failure
|
|
caused by a subroutine return to a non-existent address.
|
|
|
|
My next problem bus problem was caused by a misbehaving peripheral.
|
|
Whenever the CPU attempted to read from or write to this peripheral,
|
|
the peripheral would take control of the wishbone bus and not return
|
|
it. For example, it might never return an {\tt ACK} to signal
|
|
the end of the bus transaction. This led to the implementation of
|
|
a wishbone bus watchdog that would create a bus error if any particular
|
|
bus action didn't complete in a timely fashion.
|
|
|
\item {\bf Pipeline Stalls:} My original plan was to not support pipeline
|
\item {\bf Pipeline Stalls:} My original plan was to not support pipeline
|
stalls at all, but rather to require the compiler to properly schedule
|
stalls at all, but rather to require the compiler to properly schedule
|
all instructions so that stalls would never be necessary. After trying
|
all instructions so that stalls would never be necessary. After trying
|
to build such an architecture, I gave up, having learned some things:
|
to build such an architecture, I gave up, having learned some things:
|
|
|
Line 392... |
Line 431... |
stalls, as illustrated in Fig.~\ref{fig:stacking}.
|
stalls, as illustrated in Fig.~\ref{fig:stacking}.
|
\begin{figure}\begin{center}
|
\begin{figure}\begin{center}
|
\includegraphics[width=4in]{../gfx/stacking.eps}
|
\includegraphics[width=4in]{../gfx/stacking.eps}
|
\caption{Instructions can stack up behind a stalled instruction}\label{fig:stacking}
|
\caption{Instructions can stack up behind a stalled instruction}\label{fig:stacking}
|
\end{center}\end{figure}
|
\end{center}\end{figure}
|
|
However, if a pipeline hazard is detected, a stage can stall in order
|
|
to prevent the previous from moving forward.
|
|
|
This approach is also different from other pipeline approaches.
|
This approach is also different from other pipeline approaches.
|
Instead of keeping the entire pipeline filled, each stage is treated
|
Instead of keeping the entire pipeline filled, each stage is treated
|
independently. Therefore, individual stages may move forward as long
|
independently. Therefore, individual stages may move forward as long
|
as the subsequent stage is available, regardless of whether the stage
|
as the subsequent stage is available, regardless of whether the stage
|
behind it is filled.
|
behind it is filled.
|
|
|
\item {\bf Verilog Modules:} When examining how other processors worked
|
|
here on open cores, many of them had one separate module per pipeline
|
|
stage. While this appeared to me to be a fascinating and commendable
|
|
idea, my own implementation didn't work out quite so nicely.
|
|
|
|
As an example, the decode module produces a {\em lot} of
|
|
control wires and registers. Creating a module out of this, with
|
|
only the simplest of logic within it, seemed to be more a lesson
|
|
in passing wires around, rather than encapsulating logic.
|
|
|
|
Another example was the register writeback section. I would love
|
|
this section to be a module in its own right, and many have made them
|
|
such. However, other modules depend upon writeback results other
|
|
than just what's placed in the register (i.e., the control wires).
|
|
For these reasons, I didn't manage to fit this section into it's
|
|
own module.
|
|
|
|
The result is that the majority of the CPU code can be found in
|
|
the {\tt zipcpu.v} file.
|
|
\end{itemize}
|
\end{itemize}
|
|
|
With that introduction out of the way, let's move on to the instruction
|
With that introduction out of the way, let's move on to the instruction
|
set.
|
set.
|
|
|
Line 450... |
Line 471... |
whereas the user set is used otherwise. Of this register set, the Program
|
whereas the user set is used otherwise. Of this register set, the Program
|
Counter (PC) is register 15, whereas the status register (SR) or condition
|
Counter (PC) is register 15, whereas the status register (SR) or condition
|
code register
|
code register
|
(CC) is register 14. By convention, the stack pointer will be register 13 and
|
(CC) is register 14. By convention, the stack pointer will be register 13 and
|
noted as (SP)--although there is nothing special about this register other
|
noted as (SP)--although there is nothing special about this register other
|
than this convention.
|
than this convention. Also by convention register~12 will point to a global
|
|
offset table, and may be abbreviated as the (GBL) register.
|
The CPU can access both register sets via move instructions from the
|
The CPU can access both register sets via move instructions from the
|
supervisor state, whereas the user state can only access the user registers.
|
supervisor state, whereas the user state can only access the user registers.
|
|
|
The status register is special, and bears further mention. As shown in
|
The status register is special, and bears further mention. As shown in
|
Fig.~\ref{tbl:cc-register},
|
Fig.~\ref{tbl:cc-register},
|
\begin{table}\begin{center}
|
\begin{table}\begin{center}
|
\begin{bitlist}
|
\begin{bitlist}
|
31\ldots 11 & R/W & Reserved for future uses\\\hline
|
31\ldots 13 & R/W & Reserved for future uses\\\hline
|
10 & R & (Reserved for) Bus-Error Flag\\\hline
|
12 & R & (Reserved for) Floating Point Exception\\\hline
|
|
11 & R & Division by Zero Exception\\\hline
|
|
10 & R & Bus-Error Flag\\\hline
|
9 & R & Trap, or user interrupt, Flag. Cleared on return to userspace.\\\hline
|
9 & R & Trap, or user interrupt, Flag. Cleared on return to userspace.\\\hline
|
8 & R & Illegal Instruction Flag\\\hline
|
8 & R & Illegal Instruction Flag\\\hline
|
7 & R/W & Break--Enable\\\hline
|
7 & R/W & Break--Enable\\\hline
|
6 & R/W & Step\\\hline
|
6 & R/W & Step\\\hline
|
5 & R/W & Global Interrupt Enable (GIE)\\\hline
|
5 & R/W & Global Interrupt Enable (GIE)\\\hline
|
Line 482... |
Line 506... |
Of the condition codes, the bottom four bits are the current flags:
|
Of the condition codes, the bottom four bits are the current flags:
|
Zero (Z),
|
Zero (Z),
|
Carry (C),
|
Carry (C),
|
Negative (N),
|
Negative (N),
|
and Overflow (V).
|
and Overflow (V).
|
|
On those instructions that set the flags, these flags will be set based upon
|
The next bit is a clock enable (0 to enable) or sleep bit (1 to put
|
the output of the instruction. If the result is zero, the Z flag will be set.
|
the CPU to sleep). Setting this bit will cause the CPU to
|
If the high order bit is set, the N flag will be set. If the instruction
|
wait for an interrupt (if interrupts are enabled), or to
|
caused a bit to fall off the end, the carry bit will be set. Finally, if
|
completely halt (if interrupts are disabled).
|
the instruction causes a signed integer overflow, the V flag will be set
|
|
afterwards.
|
|
|
|
The next bit is a sleep bit. Set this bit to one to disable instruction
|
|
execution and place the CPU to sleep, or to zero to keep the pipeline
|
|
running. Setting this bit will cause the CPU to wait for an interrupt
|
|
(if interrupts are enabled), or to completely halt (if interrupts are
|
|
disabled). In order to prevent users from halting the CPU, only the
|
|
supervisor is allowed to both put the CPU to sleep and disable
|
|
interrupts. Any user attempt to do so will simply result in a switch
|
|
to supervisor mode.
|
|
|
The sixth bit is a global interrupt enable bit (GIE). When this
|
The sixth bit is a global interrupt enable bit (GIE). When this
|
sixth bit is a `1' interrupts will be enabled, else disabled. When
|
sixth bit is a `1' interrupts will be enabled, else disabled. When
|
interrupts are disabled, the CPU will be in supervisor mode, otherwise
|
interrupts are disabled, the CPU will be in supervisor mode, otherwise
|
it is in user mode. Thus, to execute a context switch, one only
|
it is in user mode. Thus, to execute a context switch, one only
|
Line 499... |
Line 533... |
and deals with its context switch.) Special logic has been added to
|
and deals with its context switch.) Special logic has been added to
|
keep the user mode from setting the sleep register and clearing the
|
keep the user mode from setting the sleep register and clearing the
|
GIE register at the same time, with clearing the GIE register taking
|
GIE register at the same time, with clearing the GIE register taking
|
precedence.
|
precedence.
|
|
|
The seventh bit is a step bit. This bit can be
|
The seventh bit is a step bit. This bit can be set from supervisor mode only.
|
set from supervisor mode only. After setting this bit, should
|
After setting this bit, should the supervisor mode process switch to
|
the supervisor mode process switch to user mode, it would then
|
user mode, it would then accomplish one instruction in user mode
|
accomplish one instruction in user mode before returning to supervisor
|
before returning to supervisor mode. Then, upon return to supervisor
|
mode. Then, upon return to supervisor mode, this bit will
|
mode, this bit will be automatically cleared. This bit has no effect
|
be automatically cleared. This bit has no effect on the CPU while in
|
on the CPU while in supervisor mode.
|
supervisor mode.
|
|
|
|
This functionality was added to enable a userspace debugger
|
This functionality was added to enable a userspace debugger
|
functionality on a user process, working through supervisor mode
|
functionality on a user process, working through supervisor mode
|
of course.
|
of course.
|
|
|
Line 537... |
Line 570... |
The tenth bit is a trap bit. It is set whenever the user requests a soft
|
The tenth bit is a trap bit. It is set whenever the user requests a soft
|
interrupt, and cleared on any return to userspace command. This allows the
|
interrupt, and cleared on any return to userspace command. This allows the
|
supervisor, in supervisor mode, to determine whether it got to supervisor
|
supervisor, in supervisor mode, to determine whether it got to supervisor
|
mode from a trap or from an external interrupt or both.
|
mode from a trap or from an external interrupt or both.
|
|
|
These status register bits are summarized in Tbl.~\ref{tbl:ccbits}.
|
\section{Instruction Format}
|
\begin{table}
|
All Zip CPU instructions fit in one of the formats shown in
|
\begin{center}
|
Fig.~\ref{fig:iset-format}.
|
\begin{tabular}{l|l}
|
\begin{figure}\begin{center}
|
Bit & Meaning \\\hline
|
\begin{bytefield}[endianness=big]{32}
|
9 & Soft trap, set on a trap from user mode, cleared when returning to user mode\\\hline
|
\bitheader{0-31}\\
|
8 & Illegal instruction error flag \\\hline
|
\begin{leftwordgroup}{Standard}\bitbox{1}{0}\bitbox{4}{DR}
|
7 & Halt on break, to support an external debugger \\\hline
|
\bitbox[lrt]{5}{OpCode}
|
6 & Step, single step the CPU in user mode\\\hline
|
\bitbox[lrt]{3}{Cnd}
|
5 & GIE, or Global Interrupt Enable \\\hline
|
\bitbox{1}{0}
|
4 & Sleep \\\hline
|
\bitbox{18}{18-bit Signed Immediate} \\
|
3 & V, or overflow bit.\\\hline
|
\bitbox{1}{0}\bitbox{4}{DR}
|
2 & N, or negative bit.\\\hline
|
\bitbox[lrb]{5}{}
|
1 & C, or carry bit.\\\hline
|
\bitbox[lrb]{3}{}
|
0 & Z, or zero bit. \\\hline
|
\bitbox{1}{1}
|
|
\bitbox{4}{BR}
|
|
\bitbox{14}{14-bit Signed Immediate}\end{leftwordgroup} \\
|
|
\begin{leftwordgroup}{MOV}\bitbox{1}{0}\bitbox{4}{DR}
|
|
\bitbox[lrt]{5}{5'hf}
|
|
\bitbox[lrt]{3}{Cnd}
|
|
\bitbox{1}{A}
|
|
\bitbox{4}{BR}
|
|
\bitbox{1}{B}
|
|
\bitbox{13}{13-bit Signed Immediate}\end{leftwordgroup} \\
|
|
\begin{leftwordgroup}{LDI}\bitbox{1}{0}\bitbox{4}{DR}
|
|
\bitbox{4}{4'hb}
|
|
\bitbox{23}{23-bit Signed Immediate}\end{leftwordgroup} \\
|
|
\begin{leftwordgroup}{NOOP}\bitbox{1}{0}\bitbox{3}{3'h7}
|
|
\bitbox{1}{}
|
|
\bitbox{2}{11}
|
|
\bitbox{3}{xxx}
|
|
\bitbox{22}{Ignored}
|
|
\end{leftwordgroup} \\
|
|
\begin{leftwordgroup}{VLIW}\bitbox{1}{1}\bitbox[lrt]{4}{DR}
|
|
\bitbox[lrt]{5}{OpCode}
|
|
\bitbox[lrt]{3}{Cnd}
|
|
\bitbox{1}{0}
|
|
\bitbox{4}{Imm.}
|
|
\bitbox{14}{---} \\
|
|
\bitbox{1}{1}\bitbox[lr]{4}{}
|
|
\bitbox[lrb]{5}{}
|
|
\bitbox[lr]{3}{}
|
|
\bitbox{1}{1}
|
|
\bitbox{4}{BR}
|
|
\bitbox{14}{---} \\
|
|
\bitbox{1}{1}\bitbox[lrb]{4}{}
|
|
\bitbox{4}{4'hb}
|
|
\bitbox{1}{}
|
|
\bitbox[lrb]{3}{}
|
|
\bitbox{5}{5'b Imm}
|
|
\bitbox{14}{---} \\
|
|
\bitbox{1}{1}\bitbox{9}{---}
|
|
\bitbox[lrt]{3}{Cnd}
|
|
\bitbox{5}{---}
|
|
\bitbox[lrt]{4}{DR}
|
|
\bitbox[lrt]{5}{OpCode}
|
|
\bitbox{1}{0}
|
|
\bitbox{4}{Imm}
|
|
\\
|
|
\bitbox{1}{1}\bitbox{9}{---}
|
|
\bitbox[lr]{3}{}
|
|
\bitbox{5}{---}
|
|
\bitbox[lr]{4}{}
|
|
\bitbox[lrb]{5}{}
|
|
\bitbox{1}{1}
|
|
\bitbox{4}{Reg} \\
|
|
\bitbox{1}{1}\bitbox{9}{---}
|
|
\bitbox[lrb]{3}{}
|
|
\bitbox{5}{---}
|
|
\bitbox[lrb]{4}{}
|
|
\bitbox{4}{4'hb}
|
|
\bitbox{1}{}
|
|
\bitbox{5}{5'b Imm}
|
|
\end{leftwordgroup} \\
|
|
\end{bytefield}
|
|
\caption{Zip Instruction Set Format}\label{fig:iset-format}
|
|
\end{center}\end{figure}
|
|
The basic format is that some operation, defined by the OpCode, is applied
|
|
if a condition, Cnd, is true in order to produce a result which is placed in
|
|
the destination register, or DR. The Load 23--bit signed immediate instruction
|
|
is different in that it requires no conditions, and uses only a 4-bit opcode.
|
|
|
|
This is actually a second version of instruction set definition, given certain
|
|
lessons learned. For example, the original instruction set had the following
|
|
problems:
|
|
\begin{enumerate}
|
|
\item No opcodes were available for divide or floating point extensions to be
|
|
made available. Although there was space in the instruction set to
|
|
add these types of instructions, this instruction space was going to
|
|
require extra logic to use.
|
|
\item The carveouts for instructions such as NOOP and LDIHI/LDILO required
|
|
extra logic to process.
|
|
\item The instruction set wasn't very compact. One bus operation was required
|
|
for every instruction.
|
|
\end{enumerate}
|
|
This second version was designed with two criteria. The first was that the
|
|
new instruction set needed to be compatible, at the assembly language level,
|
|
with the previous instruction set. Thus, it must be able to support all of
|
|
the previous menumonics and more. This was achieved with the sole exception
|
|
that instruction immediates are generally two bits shorter than before.
|
|
(One bit was lost to the VLIW bit in front, another from changing from 4--bit
|
|
to 5--bit opcodes.) Second, the new instruction set needed to be a drop--in
|
|
replacement for the decoder, modifying nothing else. This was almost achieved,
|
|
save for two issues: the ALU unit needed to be replaced since the OpCodes
|
|
were reordered, and some condition code logic needed to be adjusted since the
|
|
condition codes were renumbered as well. In the end, maximum reuse of the
|
|
existing RTL (Verilog) code was achieved in this upgrade.
|
|
|
|
As of this second version of the Zip CPU instruction set, the Zip CPU also
|
|
supports a very long instruction word (VLIW) set of instructions. These
|
|
instruction formats pack two instructions into a single instuction word,
|
|
trading immediate instruction space to do this, but in just about all other
|
|
respects these are identical to two standard instructions. Other than
|
|
instruction format, the only basic difference is that the CPU will not switch
|
|
to interrupt mode in between the two instructions. Likewise a new job given
|
|
to the assembler is that of automatically packing as many instructions as
|
|
possible into the VLIW format. Where necessary to place both VLIW instructions
|
|
on the same line, they will be separated by a vertical bar.
|
|
|
|
\section{Instruction OpCodes}
|
|
With a 5--bit opcode field, there are 32--possible instructions as shown in
|
|
Tbl.~\ref{tbl:iset-opcodes}.
|
|
\begin{table}\begin{center}
|
|
\begin{tabular}{|l|l|l|c|} \hline \rowcolor[gray]{0.85}
|
|
OpCode & & Instruction &Sets CC \\\hline\hline
|
|
5'h00 & SUB & Subtract & \\\cline{1-3}
|
|
5'h01 & AND & Bitwise And & \\\cline{1-3}
|
|
5'h02 & ADD & Add two numbers & \\\cline{1-3}
|
|
5'h03 & OR & Bitwise Or & Y \\\cline{1-3}
|
|
5'h04 & XOR & Bitwise Exclusive Or & \\\cline{1-3}
|
|
5'h05 & LSR & Logical Shift Right & \\\cline{1-3}
|
|
5'h06 & LSL & Logical Shift Left & \\\cline{1-3}
|
|
5'h07 & ASR & Arithmetic Shift Right & \\\hline
|
|
5'h08 & LDIHI & Load Immediate High & N \\\cline{1-3}
|
|
5'h09 & LDILO & Load Immediate Low & \\\hline
|
|
5'h0a & MPYU & Unsigned 16--bit Multiply & \\\cline{1-3}
|
|
5'h0b & MPYS & Signed 16--bit Multiply & Y \\\cline{1-3}
|
|
5'h0c & BREV & Bit Reverse & \\\cline{1-3}
|
|
5'h0d & POPC& Population Count & \\\cline{1-3}
|
|
5'h0e & ROL & Rotate left & \\\hline
|
|
5'h0f & MOV & Move register & N \\\hline
|
|
5'h10 & CMP & Compare & Y \\\cline{1-3}
|
|
5'h11 & TST & Test (AND w/o setting result) & \\\hline
|
|
5'h12 & LOD & Load from memory & N \\\cline{1-3}
|
|
5'h13 & STO & Store a register into memory & \\\hline\hline
|
|
5'h14 & DIVU & Divide, unsigned & Y \\\cline{1-3}
|
|
5'h15 & DIVS & Divide, signed & \\\hline\hline
|
|
5'h16/7 & LDI & Load 23--bit signed immediate & N \\\hline\hline
|
|
5'h18 & FPADD & Floating point add & \\\cline{1-3}
|
|
5'h19 & FPSUB & Floating point subtract & \\\cline{1-3}
|
|
5'h1a & FPMPY & Floating point multiply & Y \\\cline{1-3}
|
|
5'h1b & FPDIV & Floating point divide & \\\cline{1-3}
|
|
5'h1c & FPCVT & Convert integer to floating point & \\\cline{1-3}
|
|
5'h1d & FPINT & Convert to integer & \\\hline
|
|
5'h1e & & {\em Reserved for future use} &\\\hline
|
|
5'h1f & & {\em Reserved for future use} &\\\hline
|
\end{tabular}
|
\end{tabular}
|
\caption{Condition Code / Status Register Bits}\label{tbl:ccbits}
|
\caption{Zip CPU OpCodes}\label{tbl:iset-opcodes}
|
\end{center}\end{table}
|
\end{center}\end{table}
|
|
%
|
|
Of these opcodes, the {\tt BREV} and {\tt POPC} are experimental, and may be
|
|
replaced later, and two floating point instruction opcodes are reserved for
|
|
future use.
|
|
|
\section{Conditional Instructions}
|
\section{Conditional Instructions}
|
Most, although not quite all, instructions may be conditionally executed. From
|
Most, although not quite all, instructions may be conditionally executed.
|
the four condition code flags, eight conditions are defined. These are shown
|
The 23--bit load immediate instruction, together with the {\tt NOOP},
|
in Tbl.~\ref{tbl:conditions}.
|
{\tt BREAK}, and {\tt LOCK} instructions are the only exception to this rule.
|
\begin{table}
|
|
\begin{center}
|
From the four condition code flags, eight conditions are defined for standard
|
|
instructions. These are shown in Tbl.~\ref{tbl:conditions}.
|
|
\begin{table}\begin{center}
|
\begin{tabular}{l|l|l}
|
\begin{tabular}{l|l|l}
|
Code & Mneumonic & Condition \\\hline
|
Code & Mneumonic & Condition \\\hline
|
3'h0 & None & Always execute the instruction \\
|
3'h0 & None & Always execute the instruction \\
|
3'h1 & {\tt .Z} & Only execute when 'Z' is set \\
|
3'h1 & {\tt .LT} & Less than ('N' set) \\
|
3'h2 & {\tt .NE} & Only execute when 'Z' is not set \\
|
3'h2 & {\tt .Z} & Only execute when 'Z' is set \\
|
3'h3 & {\tt .GE} & Greater than or equal ('N' not set, 'Z' irrelevant) \\
|
3'h3 & {\tt .NZ} & Only execute when 'Z' is not set \\
|
3'h4 & {\tt .GT} & Greater than ('N' not set, 'Z' not set) \\
|
3'h4 & {\tt .GT} & Greater than ('N' not set, 'Z' not set) \\
|
3'h5 & {\tt .LT} & Less than ('N' set) \\
|
3'h5 & {\tt .GE} & Greater than or equal ('N' not set, 'Z' irrelevant) \\
|
3'h6 & {\tt .C} & Carry set\\
|
3'h6 & {\tt .C} & Carry set\\
|
3'h7 & {\tt .V} & Overflow set\\
|
3'h7 & {\tt .V} & Overflow set\\
|
\end{tabular}
|
\end{tabular}
|
\caption{Conditions for conditional operand execution}\label{tbl:conditions}
|
\caption{Conditions for conditional operand execution}\label{tbl:conditions}
|
\end{center}
|
\end{center}\end{table}
|
\end{table}
|
There is no condition code for less than or equal, not C or not V---there
|
There is no condition code for less than or equal, not C or not V. Sorry,
|
just wasn't enough space in 3--bits. Conditioning on a non--supported
|
I ran out of space in 3--bits. Conditioning on a non--supported condition
|
condition is still possible, but it will take an extra instruction and a
|
is still possible, but it will take an extra instruction and a pipeline stall. (Ex: \hbox{\em (Stall)}; \hbox{\tt TST \$4,CC;} \hbox{\tt STO.NZ R0,(R1)})
|
pipeline stall. (Ex: \hbox{\em (Stall)}; \hbox{\tt TST \$4,CC;} \hbox{\tt
|
As an alternative, the condition may often be reversed, recovering those
|
STO.NZ R0,(R1)}) As an alternative, it is often possible to reverse the
|
extra two clocks. Thus instead of \hbox{\tt CMP Rx,Ry;}
|
condition, and thus recovering those extra two clocks. Thus instead of
|
\hbox{\tt BNV label} you can issue a \hbox{\tt CMP Ry,Rx;} \hbox{\tt BV label}.
|
\hbox{\tt CMP Rx,Ry;} \hbox{\tt BNV label} you can issue a
|
|
\hbox{\tt CMP Ry,Rx;} \hbox{\tt BV label}.
|
|
|
Conditionally executed ALU instructions will not further adjust the
|
Conditionally executed instructions will not further adjust the
|
condition codes, with the exception of \hbox{\tt CMP} and \hbox{\tt TST}
|
condition codes, with the exception of \hbox{\tt CMP} and \hbox{\tt TST}
|
instructions. Conditional \hbox{\tt CMP} or \hbox{\tt TST} instructions
|
instructions. Conditional \hbox{\tt CMP} or \hbox{\tt TST} instructions
|
will adjust conditions whenever their conditionals are true. In this way,
|
will adjust conditions whenever they are executed. In this way,
|
multiple conditions may be evaluated without branches. For example, to do
|
multiple conditions may be evaluated without branches. For example, to do
|
something if \hbox{\tt R0} is one and \hbox{\tt R1} is two, one might try
|
something if \hbox{\tt R0} is one and \hbox{\tt R1} is two, one might try
|
code such as Tbl.~\ref{tbl:dbl-condition}.
|
code such as Tbl.~\ref{tbl:dbl-condition}.
|
\begin{table}\begin{center}
|
\begin{table}\begin{center}
|
\begin{tabular}{l}
|
\begin{tabular}{l}
|
Line 604... |
Line 785... |
{\tt STO.Z R0,(R2)} \\
|
{\tt STO.Z R0,(R2)} \\
|
\end{tabular}
|
\end{tabular}
|
\caption{An example of a double conditional}\label{tbl:dbl-condition}
|
\caption{An example of a double conditional}\label{tbl:dbl-condition}
|
\end{center}\end{table}
|
\end{center}\end{table}
|
|
|
\section{Traditional Interrupt Handling}
|
In the case of VLIW instructions, only four conditions are defined as shown
|
|
in Tbl.~\ref{tbl:vliw-conditions}.
|
|
\begin{table}\begin{center}
|
|
\begin{tabular}{l|l|l}
|
|
Code & Mneumonic & Condition \\\hline
|
|
2'h0 & None & Always execute the instruction \\
|
|
2'h1 & {\tt .LT} & Less than ('N' set) \\
|
|
2'h2 & {\tt .Z} & Only execute when 'Z' is set \\
|
|
2'h3 & {\tt .NZ} & Only execute when 'Z' is not set \\
|
|
\end{tabular}
|
|
\caption{VLIW Conditions}\label{tbl:vliw-conditions}
|
|
\end{center}\end{table}
|
|
Further, the first bit is given a special meaning. If the first bit is set,
|
|
the conditions apply to the second half of the instruction, otherwise the
|
|
conditions will only apply to the first half of a conditional instruction.
|
|
|
\section{Operand B}
|
\section{Operand B}
|
Many instruction forms have a 21-bit source ``Operand B'' associated with them.
|
Many instruction forms have a 19-bit source ``Operand B'' associated with them.
|
This Operand B is either equal to a register plus a signed immediate offset,
|
This ``Operand B'' is shown in Fig.~\ref{fig:iset-format} as part of the
|
or an immediate offset by itself. This value is encoded as shown in
|
standard instructions. This Operand B is either equal to a register plus a
|
Tbl.~\ref{tbl:opb}.
|
14--bit signed immediate offset, or an 18--bit signed immediate offset by
|
\begin{table}\begin{center}
|
itself. This value is encoded as shown in Tbl.~\ref{tbl:opb}.
|
\begin{tabular}{|l|l|l|}\hline
|
\begin{table}\begin{center}
|
Bit 20 & 19 \ldots 16 & 15 \ldots 0 \\\hline
|
\begin{bytefield}[endianness=big]{19}
|
1'b0 & \multicolumn{2}{l|}{20--bit Signed Immediate value} \\\hline
|
\bitheader{0-18} \\
|
1'b1 & 4-bit Register & 16--bit Signed immediate offset \\\hline
|
\bitbox{1}{0}\bitbox{18}{18-bit Signed Immediate} \\
|
\end{tabular}
|
\bitbox{1}{1}\bitbox{4}{Reg}\bitbox{14}{14-bit Signed Immediate}
|
|
\end{bytefield}
|
\caption{Bit allocation for Operand B}\label{tbl:opb}
|
\caption{Bit allocation for Operand B}\label{tbl:opb}
|
\end{center}\end{table}
|
\end{center}\end{table}
|
|
|
Sixteen and twenty bit immediate values don't make sense for all instructions.
|
Fourteen and eighteen bit immediate values don't make sense for all
|
For example, what is the point of a 20--bit immediate when executing a 16--bit
|
instructions. For example, what is the point of an 18--bit immediate when
|
multiply? Likewise, why have a 16--bit immediate when adding to a logical
|
executing a 16--bit multiply? Or a 16--bit load--immediate? In these cases,
|
or arithmetic shift? In these cases, the extra bits are reserved for future
|
the extra bits are simply ignored.
|
instruction possibilities.
|
|
|
VLIW instructions still use the same operand B, only there was no room for any
|
|
instruction plus immediate addressing. Therefore, VLIW instructions have either
|
|
a register or a 4--bit signed immediate as their operand B. The only exception
|
|
is the load immediate instruction, which permits a 5--bit signed operand
|
|
B.\footnote{Although the space exists to extend this VLIW load immediate
|
|
instruction to six bits, the 5--bit limit was chosen to simplify the
|
|
disassembler. This may change in the future.}
|
|
|
\section{Address Modes}
|
\section{Address Modes}
|
The Zip CPU supports two addressing modes: register plus immediate, and
|
The Zip CPU supports two addressing modes: register plus immediate, and
|
immediate address. Addresses are therefore encoded in the same fashion as
|
immediate address. Addresses are therefore encoded in the same fashion as
|
Operand B's, shown above.
|
Operand B's, shown above. Practically, the VLIW instruction set only offers
|
|
register addressing, necessitating a non--VLIW instruction for most memory
|
|
operations.
|
|
|
A lot of long hard thought was put into whether to allow pre/post increment
|
A lot of long hard thought was put into whether to allow pre/post increment
|
and decrement addressing modes. Finding no way to use these operators without
|
and decrement addressing modes. Finding no way to use these operators without
|
taking two or more clocks per instruction,\footnote{The two clocks figure
|
taking two or more clocks per instruction,\footnote{The two clocks figure
|
comes from the design of the register set, allowing only one write per clock.
|
comes from the design of the register set, allowing only one write per clock.
|
Line 649... |
Line 854... |
the CPU needs access to non--supervisory registers while in supervisory mode.
|
the CPU needs access to non--supervisory registers while in supervisory mode.
|
Therefore, the MOV instruction is special and offers access to these registers
|
Therefore, the MOV instruction is special and offers access to these registers
|
\ldots when in supervisory mode. To keep the compiler simple, the extra bits
|
\ldots when in supervisory mode. To keep the compiler simple, the extra bits
|
are ignored in non-supervisory mode (as though they didn't exist), rather than
|
are ignored in non-supervisory mode (as though they didn't exist), rather than
|
being mapped to new instructions or additional capabilities. The bits
|
being mapped to new instructions or additional capabilities. The bits
|
indicating which register set each register lies within are the A-Usr and
|
indicating which register set each register lies within are the A-User, marked
|
B-Usr bits. When set to a one, these refer to a user mode register. When set
|
`A' in Fig.~\ref{fig:iset-format}, and B-User bits, marked as `B'. When set
|
to a zero, these refer to a register in the current mode, whether user or
|
to a one, these refer to a user mode register. When set to a zero, these
|
supervisor. Further, because a load immediate instruction exists, there is no
|
refer to a register in the current mode, whether user or supervisor. Further,
|
move capability between an immediate and a register: all moves come from either
|
because a load immediate instruction exists, there is no move capability
|
a register or a register plus an offset.
|
between an immediate and a register: all moves come from either a register or
|
|
a register plus an offset.
|
This actually leads to a bit of a problem: since the MOV instruction encodes
|
|
which register set each register is coming from or moving to, how shall a
|
This actually leads to a bit of a problem: since the {\tt MOV} instruction
|
compiler or assembler know how to compile a MOV instruction without knowing
|
encodes which register set each register is coming from or moving to, how shall
|
|
a compiler or assembler know how to compile a MOV instruction without knowing
|
the mode of the CPU at the time? For this reason, the compiler will assume
|
the mode of the CPU at the time? For this reason, the compiler will assume
|
all MOV registers are supervisor registers, and display them as normal.
|
all MOV registers are supervisor registers, and display them as normal.
|
Anything with the user bit set will be treated as a user register. The CPU
|
Anything with the user bit set will be treated as a user register and displayed
|
will quietly ignore the supervisor bits while in user mode, and anything
|
special. Since the CPU quietly ignores the supervisor bits while in user mode,
|
marked as a user register will always be valid.
|
anything marked as a user register will always be specific.
|
|
|
\section{Multiply Operations}
|
\section{Multiply Operations}
|
The Zip CPU supports two Multiply operations, a 16x16 bit signed multiply
|
The Zip CPU supports two Multiply operations, a 16x16 bit signed multiply
|
({\tt MPYS}) and a 16x16 bit unsigned multiply ({\tt MPYU}). In both
|
({\tt MPYS}) and a 16x16 bit unsigned multiply ({\tt MPYU}). A 32--bit
|
cases, the operand is a register plus a 16-bit immediate, subject to the
|
multiply, should it be desired, needs to be created via software from this
|
rule that the register cannot be the PC or CC registers. The PC register
|
16x16 bit multiply.
|
field has been stolen to create a multiply by immediate instruction. The
|
|
CC register field is reserved.
|
\section{Divide Unit}
|
|
The Zip CPU also has a divide unit which can be built alongside the ALU.
|
|
This divide unit provides the Zip CPU with its first two instructions that
|
|
cannot be executed in a single cycle: {\tt DIVS}, or signed divide, and
|
|
{\tt DIVU}, the unsigned divide. These are both 32--bit divide instructions,
|
|
dividing one 32--bit number by another. In this case, the Operand B field,
|
|
whether it be register or register plus immediate, constitutes the denominator,
|
|
whereas the numerator is given by the other register.
|
|
|
|
The Divide is also a multi--clock instruction. While the divide is running,
|
|
the ALU, memory unit, and floating point unit (if installed) will be idle.
|
|
Once the divide completes, other units may continue.
|
|
|
|
Of course, divides can have errors: division by zero. In the case of division
|
|
by zero, an exception will be caused that will send the CPU either from
|
|
user mode to supervisor mode, or halt the CPU if it is already in supervisor
|
|
mode.
|
|
|
\section{Floating Point}
|
\section{NOOP, BREAK, and Bus Lock Instruction}
|
The Zip CPU does not (yet) support floating point operations. However, the
|
Three instructions are not listed in the opcode list in
|
instruction set reserves two possibilities for future floating point
|
Tbl.~\ref{tbl:iset-opcodes}, yet fit in the NOOP type instruction format of
|
operations.
|
Fig.~\ref{fig:iset-format}. These are the {\tt NOOP}, {\tt Break}, and
|
|
bus {\tt LOCK} instructions. These are encoded according to
|
|
Fig.~\ref{fig:iset-noop}, and have the following meanings:
|
|
\begin{figure}\begin{center}
|
|
\begin{bytefield}[endianness=big]{32}
|
|
\bitheader{0-31}\\
|
|
\begin{leftwordgroup}{NOOP}
|
|
\bitbox{1}{0}\bitbox{3}{3'h7}\bitbox{1}{}
|
|
\bitbox{2}{11}\bitbox{3}{001}\bitbox{22}{Ignored} \\
|
|
\bitbox{1}{1}\bitbox{3}{3'h7}\bitbox{1}{}
|
|
\bitbox{2}{11}\bitbox{3}{001}\bitbox{22}{---} \\
|
|
\bitbox{1}{1}\bitbox{9}{---}\bitbox{3}{---}\bitbox{5}{---}
|
|
\bitbox{3}{3'h7}\bitbox{1}{}\bitbox{2}{11}\bitbox{3}{001}
|
|
\bitbox{5}{Ignored}
|
|
\end{leftwordgroup} \\
|
|
\begin{leftwordgroup}{BREAK}
|
|
\bitbox{1}{0}\bitbox{3}{3'h7}
|
|
\bitbox{1}{}\bitbox{2}{11}\bitbox{3}{010}\bitbox{22}{Ignored}
|
|
\end{leftwordgroup} \\
|
|
\begin{leftwordgroup}{LOCK}
|
|
\bitbox{1}{0}\bitbox{3}{3'h7}
|
|
\bitbox{1}{}\bitbox{2}{11}\bitbox{3}{100}\bitbox{22}{Ignored}
|
|
\end{leftwordgroup} \\
|
|
\end{bytefield}
|
|
\caption{NOOP/Break/LOCK Instruction Format}\label{fig:iset-noop}
|
|
\end{center}\end{figure}
|
|
|
The first floating point operation hole in the instruction set involves
|
The {\tt NOOP} instruction is just that: an instruction that does not perform
|
setting a proposed (but non-existent) floating point bit in the CC register.
|
any operation. While many other instructions, such as a move from a register to
|
The next instruction
|
itself, could also fit these roles, only the NOOP instruction guarantees that
|
would then simply interpret its operands as floating point instructions.
|
it will not stall waiting for a register to be available. For this reason,
|
Not all instructions, however, have floating point equivalents. Further, the
|
it gets its own place in the instruction set.
|
immediate fields do not apply in floating point mode, and must be set to
|
|
zero. Not all instructions make sense as floating point operations.
|
The {\tt BREAK} instruction is useful for creating a debug instruction that
|
Therefore, only the CMP, SUB, ADD, and MPY instructions may be issued as
|
will halt the CPU without executing. If in user mode, depending upon the
|
floating point instructions. Other instructions allow the examining of the
|
setting of the break enable bit, it will either switch to supervisor mode or
|
floating point bit in the CC register. In all cases, the floating point bit
|
halt the CPU--depending upon where the user wishes to do his debugging.
|
is cleared one instruction after it is set.
|
|
|
Finally, the {\tt LOCK} instruction was added in order to make a test and
|
The other possibility for floating point operations involves exploiting the
|
set multi--CPU operation possible. Following a LOCK instruction, the next
|
hole in the instruction set that the NOOP and BREAK instructions reside within.
|
two instructions, if they are memory LOD/STO instructions, will execute without
|
These two instructions use 24--bits of address space, when only a single bit
|
dropping the wishbone {\tt CYC} line between the instructions. Thus a
|
is necessary. A simple adjustment to this space could create instructions
|
{\tt LOCK} followed by {\tt LOD (Rx),Ry} and a {\tt STO Rz,(Rx)}, where Rz
|
with 4--bit register addresses for each register, a 3--bit field for
|
is initially set, can be used to set an address while guaranteeing that Ry
|
conditional execution, and a 2--bit field for which operation.
|
was the value before setting the address to Rz. This is a useful instruction
|
In this fashion, such a floating point capability would only fill 13--bits of
|
while trying to achieve concurrency among multiple CPU's.
|
the 24--bit field, still leaving lots of room for expansion.
|
|
|
|
In both cases, the Zip CPU would support 32--bit single precision floats
|
|
only, since other choices would complicate the pipeline.
|
|
|
|
The current architecture does not support a floating point not-implemented
|
|
interrupt. Any soft floating point emulation must be done deliberately.
|
|
|
|
\section{Native Instructions}
|
|
The instruction set for the Zip CPU is summarized in
|
|
Tbl.~\ref{tbl:zip-instructions}.
|
|
\begin{table}\begin{center}
|
|
\begin{tabular}{|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|c|}\hline
|
|
\rowcolor[gray]{0.85}
|
|
Op Code & \multicolumn{8}{c|}{31\ldots24} & \multicolumn{8}{c|}{23\ldots 16}
|
|
& \multicolumn{8}{c|}{15\ldots 8} & \multicolumn{8}{c|}{7\ldots 0}
|
|
& Sets CC? \\\hline\hline
|
|
{\tt CMP(Sub)} & \multicolumn{4}{l|}{4'h0}
|
|
& \multicolumn{4}{l|}{D. Reg}
|
|
& \multicolumn{3}{l|}{Cond.}
|
|
& \multicolumn{21}{l|}{Operand B}
|
|
& Yes \\\hline
|
|
{\tt TST(And)} & \multicolumn{4}{l|}{4'h1}
|
|
& \multicolumn{4}{l|}{D. Reg}
|
|
& \multicolumn{3}{l|}{Cond.}
|
|
& \multicolumn{21}{l|}{Operand B}
|
|
& Yes \\\hline
|
|
{\tt MOV} & \multicolumn{4}{l|}{4'h2}
|
|
& \multicolumn{4}{l|}{D. Reg}
|
|
& \multicolumn{3}{l|}{Cond.}
|
|
& A-Usr
|
|
& \multicolumn{4}{l|}{B-Reg}
|
|
& B-Usr
|
|
& \multicolumn{15}{l|}{15'bit signed offset}
|
|
& \\\hline
|
|
{\tt LODI} & \multicolumn{4}{l|}{4'h3}
|
|
& \multicolumn{4}{l|}{R. Reg}
|
|
& \multicolumn{24}{l|}{24'bit Signed Immediate}
|
|
& \\\hline
|
|
{\tt NOOP} & \multicolumn{4}{l|}{4'h4}
|
|
& \multicolumn{4}{l|}{4'he}
|
|
& \multicolumn{24}{l|}{24'h00}
|
|
& \\\hline
|
|
{\tt BREAK} & \multicolumn{4}{l|}{4'h4}
|
|
& \multicolumn{4}{l|}{4'he}
|
|
& \multicolumn{24}{l|}{24'h01}
|
|
& \\\hline
|
|
{\em Reserved} & \multicolumn{4}{l|}{4'h4}
|
|
& \multicolumn{4}{l|}{4'he}
|
|
& \multicolumn{24}{l|}{24'bits, but not 0 or 1.}
|
|
& \\\hline
|
|
{\tt LODIHI }& \multicolumn{4}{l|}{4'h4}
|
|
& \multicolumn{4}{l|}{4'hf}
|
|
& \multicolumn{3}{l|}{Cond.}
|
|
& 1'b1
|
|
& \multicolumn{4}{l|}{R. Reg}
|
|
& \multicolumn{16}{l|}{16-bit Immediate}
|
|
& \\\hline
|
|
{\tt LODILO} & \multicolumn{4}{l|}{4'h4}
|
|
& \multicolumn{4}{l|}{4'hf}
|
|
& \multicolumn{3}{l|}{Cond.}
|
|
& 1'b0
|
|
& \multicolumn{4}{l|}{R. Reg}
|
|
& \multicolumn{16}{l|}{16-bit Immediate}
|
|
& \\\hline
|
|
16-b {\tt MPYU} & \multicolumn{4}{l|}{4'h4}
|
|
& \multicolumn{4}{l|}{R. Reg}
|
|
& \multicolumn{3}{l|}{Cond.}
|
|
& 1'b0 & \multicolumn{4}{l|}{Reg}
|
|
& \multicolumn{16}{l|}{16-bit Offset}
|
|
& Yes \\\hline
|
|
16-b {\tt MPYU}(I) & \multicolumn{4}{l|}{4'h4}
|
|
& \multicolumn{4}{l|}{R. Reg}
|
|
& \multicolumn{3}{l|}{Cond.}
|
|
& 1'b0 & \multicolumn{4}{l|}{4'hf}
|
|
& \multicolumn{16}{l|}{16-bit Offset}
|
|
& Yes \\\hline
|
|
16-b {\tt MPYS} & \multicolumn{4}{l|}{4'h4}
|
|
& \multicolumn{4}{l|}{R. Reg}
|
|
& \multicolumn{3}{l|}{Cond.}
|
|
& 1'b1 & \multicolumn{4}{l|}{Reg}
|
|
& \multicolumn{16}{l|}{16-bit Offset}
|
|
& Yes \\\hline
|
|
16-b {\tt MPYS}(I) & \multicolumn{4}{l|}{4'h4}
|
|
& \multicolumn{4}{l|}{R. Reg}
|
|
& \multicolumn{3}{l|}{Cond.}
|
|
& 1'b1 & \multicolumn{4}{l|}{4'hf}
|
|
& \multicolumn{16}{l|}{16-bit Offset}
|
|
& Yes \\\hline
|
|
{\tt ROL} & \multicolumn{4}{l|}{4'h5}
|
|
& \multicolumn{4}{l|}{R. Reg}
|
|
& \multicolumn{3}{l|}{Cond.}
|
|
& \multicolumn{21}{l|}{Operand B, truncated to low order 5 bits}
|
|
& \\\hline
|
|
{\tt LOD} & \multicolumn{4}{l|}{4'h6}
|
|
& \multicolumn{4}{l|}{R. Reg}
|
|
& \multicolumn{3}{l|}{Cond.}
|
|
& \multicolumn{21}{l|}{Operand B address}
|
|
& \\\hline
|
|
{\tt STO} & \multicolumn{4}{l|}{4'h7}
|
|
& \multicolumn{4}{l|}{D. Reg}
|
|
& \multicolumn{3}{l|}{Cond.}
|
|
& \multicolumn{21}{l|}{Operand B address}
|
|
& \\\hline
|
|
{\tt SUB} & \multicolumn{4}{l|}{4'h8}
|
|
& \multicolumn{4}{l|}{R. Reg}
|
|
& \multicolumn{3}{l|}{Cond.}
|
|
& \multicolumn{21}{l|}{Operand B}
|
|
& Yes \\\hline
|
|
{\tt AND} & \multicolumn{4}{l|}{4'h9}
|
|
& \multicolumn{4}{l|}{R. Reg}
|
|
& \multicolumn{3}{l|}{Cond.}
|
|
& \multicolumn{21}{l|}{Operand B}
|
|
& Yes \\\hline
|
|
{\tt ADD} & \multicolumn{4}{l|}{4'ha}
|
|
& \multicolumn{4}{l|}{R. Reg}
|
|
& \multicolumn{3}{l|}{Cond.}
|
|
& \multicolumn{21}{l|}{Operand B}
|
|
& Yes \\\hline
|
|
{\tt OR} & \multicolumn{4}{l|}{4'hb}
|
|
& \multicolumn{4}{l|}{R. Reg}
|
|
& \multicolumn{3}{l|}{Cond.}
|
|
& \multicolumn{21}{l|}{Operand B}
|
|
& Yes \\\hline
|
|
{\tt XOR} & \multicolumn{4}{l|}{4'hc}
|
|
& \multicolumn{4}{l|}{R. Reg}
|
|
& \multicolumn{3}{l|}{Cond.}
|
|
& \multicolumn{21}{l|}{Operand B}
|
|
& Yes \\\hline
|
|
{\tt LSL/ASL} & \multicolumn{4}{l|}{4'hd}
|
|
& \multicolumn{4}{l|}{R. Reg}
|
|
& \multicolumn{3}{l|}{Cond.}
|
|
& \multicolumn{21}{l|}{Operand B, imm. truncated to 6 bits}
|
|
& Yes \\\hline
|
|
{\tt ASR} & \multicolumn{4}{l|}{4'he}
|
|
& \multicolumn{4}{l|}{R. Reg}
|
|
& \multicolumn{3}{l|}{Cond.}
|
|
& \multicolumn{21}{l|}{Operand B, imm. truncated to 6 bits}
|
|
& Yes \\\hline
|
|
{\tt LSR} & \multicolumn{4}{l|}{4'hf}
|
|
& \multicolumn{4}{l|}{R. Reg}
|
|
& \multicolumn{3}{l|}{Cond.}
|
|
& \multicolumn{21}{l|}{Operand B, imm. truncated to 6 bits}
|
|
& Yes \\\hline
|
|
\end{tabular}
|
|
\caption{Zip CPU Instruction Set}\label{tbl:zip-instructions}
|
|
\end{center}\end{table}
|
|
|
|
As you can see, there's lots of room for instruction set expansion. The
|
\section{Floating Point}
|
NOOP and BREAK instructions are the only instructions within one particular
|
Although the Zip CPU does not (yet) have a floating point unit, the current
|
24--bit hole. The rest of this space is reserved for future enhancements.
|
instruction set offers eight opcodes for floating point operations, and treats
|
|
floating point exceptions like divide by zero errors. Once this unit is built
|
|
and integrated together with the rest of the CPU, the Zip CPU will support
|
|
32--bit floating point instructions natively. Any 64--bit floating point
|
|
instructions will still need to be emulated in software.
|
|
|
\section{Derived Instructions}
|
\section{Derived Instructions}
|
The Zip CPU supports many other common instructions, but not all of them
|
The Zip CPU supports many other common instructions, but not all of them
|
are single cycle instructions. The derived instruction tables,
|
are single cycle instructions. The derived instruction tables,
|
Tbls.~\ref{tbl:derived-1}, \ref{tbl:derived-2}, \ref{tbl:derived-3}
|
Tbls.~\ref{tbl:derived-1}, \ref{tbl:derived-2}, \ref{tbl:derived-3}
|
Line 868... |
Line 972... |
\parbox[t]{1.4in}{\tt ADD Ra,Rx\\ADDC Rb,Ry}
|
\parbox[t]{1.4in}{\tt ADD Ra,Rx\\ADDC Rb,Ry}
|
& \parbox[t]{1.5in}{\tt Add Ra,Rx\\ADD.C \$1,Ry\\Add Rb,Ry}
|
& \parbox[t]{1.5in}{\tt Add Ra,Rx\\ADD.C \$1,Ry\\Add Rb,Ry}
|
& Add with carry \\\hline
|
& Add with carry \\\hline
|
{\tt BRA.Cond +/-\$Addr}
|
{\tt BRA.Cond +/-\$Addr}
|
& \hbox{\tt MOV.cond \$Addr+PC,PC}
|
& \hbox{\tt MOV.cond \$Addr+PC,PC}
|
& Branch or jump on condition. Works for 15--bit
|
& Branch or jump on condition. Works for 13--bit
|
signed address offsets.\\\hline
|
signed address offsets.\\\hline
|
{\tt BRA.Cond +/-\$Addr}
|
{\tt BRA.Cond +/-\$Addr}
|
& \parbox[t]{1.5in}{\tt LDI \$Addr,Rx \\ ADD.cond Rx,PC}
|
& \parbox[t]{1.5in}{\tt LDI \$Addr,Rx \\ ADD.cond Rx,PC}
|
& Branch/jump on condition. Works for
|
& Branch/jump on condition. Works for
|
23 bit address offsets, but costs a register, an extra instruction,
|
23 bit address offsets, but costs a register, an extra instruction,
|
Line 894... |
Line 998... |
& Exchanges the top and bottom 16'bit words of Rx \\\hline
|
& Exchanges the top and bottom 16'bit words of Rx \\\hline
|
{\tt HALT }
|
{\tt HALT }
|
& {\tt Or \$SLEEP,CC}
|
& {\tt Or \$SLEEP,CC}
|
& This only works when issued in interrupt/supervisor mode. In user
|
& This only works when issued in interrupt/supervisor mode. In user
|
mode this is simply a wait until interrupt instruction. \\\hline
|
mode this is simply a wait until interrupt instruction. \\\hline
|
{\tt INT } & {\tt LDI \$0,CC} & \\\hline
|
{\tt INT } & {\tt LDI \$0,CC} & This is also known as a trap instruction\\\hline
|
{\tt IRET}
|
{\tt IRET}
|
& {\tt OR \$GIE,CC}
|
& {\tt OR \$GIE,CC}
|
& Also known as an RTU instruction (Return to Userspace) \\\hline
|
& Also known as an RTU instruction (Return to Userspace) \\\hline
|
{\tt JMP R6+\$Addr}
|
{\tt JMP R6+\$Addr}
|
& {\tt MOV \$Addr(R6),PC}
|
& {\tt MOV \$Addr(R6),PC}
|
& \\\hline
|
& \\\hline
|
|
{\tt LJMP \$Addr}
|
|
& \parbox[t]{1.5in}{\tt LOD (PC),PC \\ {\em Address }}
|
|
& Although this only works for an unconditional jump, and it only
|
|
works in a Von Neumann architecture, this instruction combination makes
|
|
for a nice combination that can be adjusted by a linker at a later
|
|
time.\\\hline
|
{\tt JSR PC+\$Addr}
|
{\tt JSR PC+\$Addr}
|
& \parbox[t]{1.5in}{\tt SUB \$1,SP \\\
|
& \parbox[t]{1.5in}{\tt MOV \$1+PC,R0 \\ MOV \$addr+PC,PC}
|
MOV \$3+PC,R0 \\
|
& This is similar to the jump and link instructions from other
|
STO R0,1(SP) \\
|
architectures, save only that it requires a specific link
|
MOV \$Addr+PC,PC \\
|
instruction, also known as the {\tt MOV} instruction on the
|
ADD \$1,SP}
|
left.\\\hline
|
& Jump to Subroutine. Note the required cleanup instruction after
|
|
returning. This could easily be turned into a three instruction
|
|
operand, removing the preliminary stack instruction before and
|
|
the cleanup after, by adjusting how any stack frame was built for
|
|
this routine to include space at the top of the stack for the PC.
|
|
Note also that jumping to a subroutine costs a copy register, {\tt R0}
|
|
in this case.
|
|
\\\hline
|
|
{\tt JSR PC+\$Addr }
|
|
& \parbox[t]{1.5in}{\tt MOV \$3+PC,R12 \\ MOV \$addr+PC,PC}
|
|
&This is the high speed
|
|
version of a subroutine call, necessitating a register to hold the
|
|
last PC address. In its favor, this method doesn't suffer the
|
|
mandatory memory access of the other approach. \\\hline
|
|
{\tt LDI.l \$val,Rx }
|
|
& \parbox[t]{1.8in}{\tt LDIHI (\$val$>>$16)\&0x0ffff, Rx \\
|
|
LDILO (\$val\&0x0ffff),Rx}
|
|
& Sadly, there's not enough instruction
|
|
space to load a complete immediate value into any register.
|
|
Therefore, fully loading any register takes two cycles.
|
|
The LDIHI (load immediate high) and LDILO (load immediate low)
|
|
instructions have been created to facilitate this. \\\hline
|
|
\end{tabular}
|
\end{tabular}
|
\caption{Derived Instructions}\label{tbl:derived-1}
|
\caption{Derived Instructions}\label{tbl:derived-1}
|
\end{center}\end{table}
|
\end{center}\end{table}
|
\begin{table}\begin{center}
|
\begin{table}\begin{center}
|
\begin{tabular}{p{1.4in}p{1.5in}p{3in}}\\\hline
|
\begin{tabular}{p{1.4in}p{1.5in}p{3in}}\\\hline
|
Mapped & Actual & Notes \\\hline
|
Mapped & Actual & Notes \\\hline
|
|
{\tt LDI.l \$val,Rx }
|
|
& \parbox[t]{1.8in}{\tt LDIHI (\$val$>>$16)\&0x0ffff, Rx \\
|
|
LDILO (\$val\&0x0ffff),Rx}
|
|
& \parbox[t]{3.0in}{Sadly, there's not enough instruction
|
|
space to load a complete immediate value into any register.
|
|
Therefore, fully loading any register takes two cycles.
|
|
The LDIHI (load immediate high) and LDILO (load immediate low)
|
|
instructions have been created to facilitate this.
|
|
\\
|
|
This is also the appropriate means for setting a register value
|
|
to an arbitrary 32--bit value in a post--assembly link
|
|
operation.}\\\hline
|
{\tt LOD.b \$addr,Rx}
|
{\tt LOD.b \$addr,Rx}
|
& \parbox[t]{1.5in}{\tt %
|
& \parbox[t]{1.5in}{\tt %
|
LDI \$addr,Ra \\
|
LDI \$addr,Ra \\
|
LDI \$addr,Rb \\
|
LDI \$addr,Rb \\
|
LSR \$2,Ra \\
|
LSR \$2,Ra \\
|
Line 981... |
Line 1081... |
operations have consequences in that they might stall the bus if
|
operations have consequences in that they might stall the bus if
|
Rx isn't ready yet. For this reason, we have a dedicated NOOP
|
Rx isn't ready yet. For this reason, we have a dedicated NOOP
|
instruction. \\\hline
|
instruction. \\\hline
|
{\tt NOT Rx } & {\tt XOR \$-1,Rx } & \\\hline
|
{\tt NOT Rx } & {\tt XOR \$-1,Rx } & \\\hline
|
{\tt POP Rx }
|
{\tt POP Rx }
|
& \parbox[t]{1.5in}{\tt LOD \$1(SP),Rx \\ ADD \$1,SP}
|
& \parbox[t]{1.5in}{\tt LOD \$(SP),Rx \\ ADD \$1,SP}
|
& Note
|
& \\\hline
|
that for interrupt purposes, one can never depend upon the value at
|
|
(SP). Hence you read from it, then increment it, lest having
|
|
incremented it first something then comes along and writes to that
|
|
value before you can read the result. \\\hline
|
|
\end{tabular}
|
\end{tabular}
|
\caption{Derived Instructions, continued}\label{tbl:derived-2}
|
\caption{Derived Instructions, continued}\label{tbl:derived-2}
|
\end{center}\end{table}
|
\end{center}\end{table}
|
\begin{table}\begin{center}
|
\begin{table}\begin{center}
|
\begin{tabular}{p{1.4in}p{1.5in}p{3in}}\\\hline
|
\begin{tabular}{p{1.4in}p{1.5in}p{3in}}\\\hline
|
{\tt PUSH Rx}
|
{\tt PUSH Rx}
|
& \parbox[t]{1.5in}{SUB \$1,SP \\
|
& \parbox[t]{1.5in}{\hbox{\tt SUB \$1,SP}
|
STO Rx,\$1(SP)}
|
\hbox{\tt STO Rx,\$(SP)}}
|
& Note that for pipelined operation, it helps to coalesce all the
|
& Note that for pipelined operation, it helps to coalesce all the
|
{\tt SUB}'s into one command, and place the {\tt STO}'s right
|
{\tt SUB}'s into one command, and place the {\tt STO}'s right
|
after each other.\\\hline
|
after each other. Further, to avoid a pipeline stall, the
|
|
immediate value for the store must be zero.
|
|
\\\hline
|
{\tt PUSH Rx-Ry}
|
{\tt PUSH Rx-Ry}
|
& \parbox[t]{1.5in}{\tt SUB \$n,SP \\
|
& \parbox[t]{1.5in}{\tt SUB \$$n$,SP \\
|
STO Rx,\$n(SP)
|
STO Rx,\$(SP)
|
\ldots \\
|
\ldots \\
|
STO Ry,\$1(SP)}
|
STO Ry,\$$\left(n-1\right)$(SP)}
|
& Multiple pushes at once only need the single subtract from the
|
& Multiple pushes at once only need the single subtract from the
|
stack pointer. This derived instruction is analogous to a similar one
|
stack pointer. This derived instruction is analogous to a similar one
|
on the Motoroloa 68k architecture, although the Zip Assembler
|
on the Motoroloa 68k architecture, although the Zip Assembler
|
does not support this instruction (yet). This instruction
|
does not support this instruction (yet). This instruction
|
also supports pipelined memory access.\\\hline
|
also supports pipelined memory access.\\\hline
|
{\tt RESET}
|
{\tt RESET}
|
& \parbox[t]{1in}{\tt STO \$1,\$watchdog(R12)\\NOOP\\NOOP}
|
& \parbox[t]{1in}{\tt STO \$1,\$watchdog(R12)\\NOOP\\NOOP}
|
& This depends upon the peripheral base address being
|
& This depends upon the peripheral base address being
|
in R12.
|
preloaded into R12.
|
|
|
Another opportunity might be to jump to the reset address from within
|
Another opportunity might be to jump to the reset address from within
|
supervisor mode.\\\hline
|
supervisor mode.\\\hline
|
{\tt RET} & \parbox[t]{1.5in}{\tt LOD \$1(SP),PC}
|
{\tt RET} & {\tt MOV R0,PC}
|
& Note that this depends upon the calling context to clean up the
|
& This depends upon the form of the {\tt JSR} given on the previous
|
stack, as outlined for the JSR instruction. \\\hline
|
page that stores the return address into R0.
|
{\tt RET} & {\tt MOV R12,PC}
|
|
& This is the high(er) speed version, that doesn't touch the stack.
|
|
As such, it doesn't suffer a stall on memory read/write to the stack.
|
|
\\\hline
|
\\\hline
|
{\tt STEP Rr,Rt}
|
{\tt STEP Rr,Rt}
|
& \parbox[t]{1.5in}{\tt LSR \$1,Rr \\ XOR.C Rt,Rr}
|
& \parbox[t]{1.5in}{\tt LSR \$1,Rr \\ XOR.C Rt,Rr}
|
& Step a Galois implementation of a Linear Feedback Shift Register, Rr,
|
& Step a Galois implementation of a Linear Feedback Shift Register, Rr,
|
using taps Rt \\\hline
|
using taps Rt \\\hline
|
Line 1049... |
Line 1144... |
32-bit word address. This also limits the address space
|
32-bit word address. This also limits the address space
|
of character accesses from 16 MB down to 4MB.F
|
of character accesses from 16 MB down to 4MB.F
|
Further, this instruction implies a byte ordering,
|
Further, this instruction implies a byte ordering,
|
such as big or little endian.} \\\hline
|
such as big or little endian.} \\\hline
|
{\tt SWAP Rx,Ry }
|
{\tt SWAP Rx,Ry }
|
& \parbox[t]{1.5in}{\tt
|
& \parbox[t]{1.5in}{\tt XOR Ry,Rx \\ XOR Rx,Ry \\ XOR Ry,Rx}
|
XOR Ry,Rx \\
|
|
XOR Rx,Ry \\
|
|
XOR Ry,Rx}
|
|
& While no extra registers are needed, this example
|
& While no extra registers are needed, this example
|
does take 3-clocks. \\\hline
|
does take 3-clocks. \\\hline
|
|
\end{tabular}
|
|
\caption{Derived Instructions, continued}\label{tbl:derived-3}
|
|
\end{center}\end{table}
|
|
\begin{table}\begin{center}
|
|
\begin{tabular}{p{1.4in}p{1.5in}p{3in}}\\\hline
|
{\tt TRAP \#X}
|
{\tt TRAP \#X}
|
& \parbox[t]{1.5in}{\tt LDI \$x,R0 \\ AND \~\$GIE,CC }
|
& \parbox[t]{1.5in}{\tt LDI \$x,R0 \\ AND \~\$GIE,CC }
|
& This works because whenever a user lowers the \$GIE flag, it sets
|
& This works because whenever a user lowers the \$GIE flag, it sets
|
a TRAP bit within the CC register. Therefore, upon entering the
|
a TRAP bit within the CC register. Therefore, upon entering the
|
supervisor state, the CPU only need check this bit to know that it
|
supervisor state, the CPU only need check this bit to know that it
|
got there via a TRAP. The trap could be made conditional by making
|
got there via a TRAP. The trap could be made conditional by making
|
the LDI and the AND conditional. In that case, the assembler would
|
the LDI and the AND conditional. In that case, the assembler would
|
quietly turn the LDI instruction into an LDILO and LDIHI pair,
|
quietly turn the LDI instruction into an LDILO and LDIHI pair,
|
but the effect would be the same. \\\hline
|
but the effect would be the same. \\\hline
|
\end{tabular}
|
{\tt TS Rx,Ry,(Rz)}
|
\caption{Derived Instructions, continued}\label{tbl:derived-3}
|
& \hbox{\tt LDI 1,Rx}
|
\end{center}\end{table}
|
\hbox{\tt LOCK}
|
\begin{table}\begin{center}
|
\hbox{\tt LOD (Rz),Ry}
|
\begin{tabular}{p{1.4in}p{1.5in}p{3in}}\\\hline
|
\hbox{\tt STO Rx,(Rz)}
|
|
& A test and set instruction. The {\tt LOCK} instruction insures
|
|
that the next two instructions lock the bus between the instructions,
|
|
so no one else can use it. Thus guarantees that the operation is
|
|
atomic.
|
|
\\\hline
|
{\tt TST Rx}
|
{\tt TST Rx}
|
& {\tt TST \$-1,Rx}
|
& {\tt TST \$-1,Rx}
|
& Set the condition codes based upon Rx. Could also do a CMP \$0,Rx,
|
& Set the condition codes based upon Rx. Could also do a CMP \$0,Rx,
|
ADD \$0,Rx, SUB \$0,Rx, etc, AND \$-1,Rx, etc. The TST and CMP
|
ADD \$0,Rx, SUB \$0,Rx, etc, AND \$-1,Rx, etc. The TST and CMP
|
approaches won't stall future pipeline stages looking for the value
|
approaches won't stall future pipeline stages looking for the value
|
of Rx. \\\hline
|
of Rx. (Future versions of the assembler may shorten this to a
|
|
{\tt TST Rx} instruction.)\\\hline
|
{\tt WAIT}
|
{\tt WAIT}
|
& {\tt Or \$GIE | \$SLEEP,CC}
|
& {\tt Or \$GIE | \$SLEEP,CC}
|
& Wait until the next interrupt, then jump to supervisor/interrupt
|
& Wait until the next interrupt, then jump to supervisor/interrupt
|
mode.
|
mode.
|
\end{tabular}
|
\end{tabular}
|
\caption{Derived Instructions, continued}\label{tbl:derived-4}
|
\caption{Derived Instructions, continued}\label{tbl:derived-4}
|
\end{center}\end{table}
|
\end{center}\end{table}
|
|
|
|
\section{Interrupt Handling}
|
|
The Zip CPU does not maintain any interrupt vector tables. If an interrupt
|
|
takes place, the CPU simply switches to interrupt mode. The supervisor code
|
|
continues in this interrupt mode from where it left off before, after
|
|
executing a return to userspace {\tt RTU} instruction.
|
|
|
|
At this point, the supervisor code needs to determine first whether an
|
|
interrupt has occurred, and then whether it is in interrupt mode due to
|
|
an exception and handle each case appropriately.
|
|
|
\section{Pipeline Stages}
|
\section{Pipeline Stages}
|
As mentioned in the introduction, and highlighted in Fig.~\ref{fig:cpu},
|
As mentioned in the introduction, and highlighted in Fig.~\ref{fig:cpu},
|
the Zip CPU supports a five stage pipeline.
|
the Zip CPU supports a five stage pipeline.
|
\begin{enumerate}
|
\begin{enumerate}
|
\item {\bf Prefetch}: Reads instruction from memory and into a cache, if so
|
\item {\bf Prefetch}: Reads instruction from memory and into a cache, if so
|
configured. This
|
configured. This
|
stage is actually pipelined itself, and so it will stall if the PC
|
stage is actually pipelined itself, and so it will stall if the PC
|
ever changes. Stalls are also created here if the instruction isn't
|
ever changes. Stalls are also created here if the instruction isn't
|
in the prefetch cache.
|
in the prefetch cache.
|
|
|
The Zip CPU supports one of two prefetch methods, depending upon a flag
|
The Zip CPU supports one of three prefetch methods, depending upon a
|
set at build time within the {\tt zipcpu.v} file. The simplest is a
|
flag set at build time within the {\tt cpudefs.v} file. The simplest
|
non--cached implementation of a prefetch. This implementation is
|
is a non--cached implementation of a prefetch. This implementation is
|
fairly small, and ideal for
|
fairly small, and ideal for users of the Zip CPU who need the extra
|
users of the Zip CPU who need the extra space on the FPGA fabric.
|
space on the FPGA fabric. However, because this non--cached version
|
However, because this non--cached version has no cache, the maximum
|
has no cache, the maximum number of instructions per clock is limited
|
number of instructions per clock is limited to about one per five.
|
to about one per five.
|
|
|
The second prefetch module is a pipelined prefetch with a cache. This
|
The second prefetch module is a pipelined prefetch with a cache. This
|
module tries to keep the instruction address within a window of valid
|
module tries to keep the instruction address within a window of valid
|
instruction addresses. While effective, it is not a traditional
|
instruction addresses. While effective, it is not a traditional
|
cache implementation. One unique feature of this cache implementation,
|
cache implementation. One unique feature of this cache implementation,
|
however, is that it can be cleared in a single clock. A disappointing
|
however, is that it can be cleared in a single clock. A disappointing
|
feature, though, was that it needs an extra internal pipeline stage
|
feature, though, was that it needs an extra internal pipeline stage
|
to be implemented.
|
to be implemented.
|
|
|
\item {\bf Decode}: Decodes an instruction into op code, register(s) to read,
|
The third prefetch and cache module implements a more traditional cache.
|
and immediate offset. This stage also determines whether the flags will
|
While the resulting code tends to be twice as fast as the pipelined
|
be set or whether the result will be written back.
|
cache architecture, this implementation uses a large amount of
|
|
distributed FPGA RAM to be successful. This then inflates the Zip CPU's
|
|
FPGA usage statistics.
|
|
|
|
\item {\bf Decode}: Decodes an instruction into OpCode, register(s) to read,
|
|
and immediate offset. This stage also determines whether the flags
|
|
will be set or whether the result will be written back.
|
|
|
\item {\bf Read Operands}: Read registers and apply any immediate values to
|
\item {\bf Read Operands}: Read registers and apply any immediate values to
|
them. There is no means of detecting or flagging arithmetic overflow
|
them. There is no means of detecting or flagging arithmetic overflow
|
or carry when adding the immediate to the operand. This stage will
|
or carry when adding the immediate to the operand. This stage will
|
stall if any source operand is pending.
|
stall if any source operand is pending.
|
\item Split into two tracks: An {\bf ALU} which will accomplish a simple
|
|
instruction, and the {\bf MemOps} stage which handles {\tt LOD} (load)
|
\item Split into one of four tracks: An {\bf ALU} track which will accomplish
|
and {\tt STO} (store) instructions.
|
a simple instruction, the {\bf MemOps} stage which handles {\tt LOD}
|
|
(load) and {\tt STO} (store) instructions, the {\bf divide} unit,
|
|
and the {\bf floating point} unit.
|
\begin{itemize}
|
\begin{itemize}
|
\item Loads will stall the entire pipeline until complete.
|
\item Loads will stall instructions in the decode stage until the
|
\item Condition codes are available upon completion of the ALU stage
|
entire pipeline until complete, lest a register be read in
|
\item Issuing an instruction to the memory unit while the memory unit
|
the read operands stage only to be updated unseen by the
|
is busy will stall the entire pipeline. If the bus deadlocks,
|
Load.
|
only a reset will release the CPU. (Watchdog timer, anyone?)
|
\item Condition codes are available upon completion of the ALU,
|
\item The Zip CPU currently has no means of reading and acting on any
|
divide, or FPU stage.
|
error conditions on the bus.
|
\item Issuing a non--pipelined memory instruction to the memory unit
|
|
while the memory unit is busy will stall the entire pipeline.
|
\end{itemize}
|
\end{itemize}
|
\item {\bf Write-Back}: Conditionally write back the result to the register
|
\item {\bf Write-Back}: Conditionally write back the result to the register
|
set, applying the condition. This routine is bi-entrant: either the
|
set, applying the condition. This routine is quad-entrant: either the
|
memory or the simple instruction may request a register write.
|
ALU, the memory, the divide, or the FPU may write back a register.
|
|
The only design rule is that no more than a single register may be
|
|
written back in any given clock.
|
\end{enumerate}
|
\end{enumerate}
|
|
|
The Zip CPU does not support out of order execution. Therefore, if the memory
|
The Zip CPU does not support out of order execution. Therefore, if the memory
|
unit stalls, every other instruction stalls. Memory stores, however, can take
|
unit stalls, every other instruction stalls. The same is true for divide or
|
place concurrently with ALU operations, although memory reads (loads) cannot.
|
floating point instructions--all other instructions will stall while waiting
|
|
for these to complete. Memory stores, however, can take place concurrently
|
|
with non--memory operations, although memory reads (loads) cannot.
|
|
|
\section{Pipeline Stalls}
|
\section{Pipeline Stalls}
|
The processing pipeline can and will stall for a variety of reasons. Some of
|
The processing pipeline can and will stall for a variety of reasons. Some of
|
these are obvious, some less so. These reasons are listed below:
|
these are obvious, some less so. These reasons are listed below:
|
\begin{itemize}
|
\begin{itemize}
|
\item When the prefetch cache is exhausted
|
\item When the prefetch cache is exhausted
|
|
|
This reason should be obvious. If the prefetch cache doesn't have the
|
This reason should be obvious. If the prefetch cache doesn't have the
|
instruction in memory, the entire pipeline must stall until enough of the
|
instruction in memory, the entire pipeline must stall until an instruction
|
prefetch cache is loaded to support the next instruction.
|
can be made ready. In the case of the {\tt pipefetch} windowed approach
|
|
to the prefetch cache, this means the pipeline will stall until enough of the
|
|
prefetch cache is loaded to support the next instruction. In the case
|
|
of the more traditional {\tt pfcache} approach, the entire cache line must
|
|
fill before instruction execution can continue.
|
|
|
\item While waiting for the pipeline to load following any taken branch, jump,
|
\item While waiting for the pipeline to load following any taken branch, jump,
|
return from interrupt or switch to interrupt context (5 stall cycles)
|
return from interrupt or switch to interrupt context (4 stall cycles)
|
|
|
Fig.~\ref{fig:bcstalls}
|
Fig.~\ref{fig:bcstalls}
|
\begin{figure}\begin{center}
|
\begin{figure}\begin{center}
|
\includegraphics[width=3.5in]{../gfx/bc.eps}
|
\includegraphics[width=3.5in]{../gfx/bc.eps}
|
\caption{A conditional branch generates 5 stall cycles}\label{fig:bcstalls}
|
\caption{A conditional branch generates 4 stall cycles}\label{fig:bcstalls}
|
\end{center}\end{figure}
|
\end{center}\end{figure}
|
illustrates the situation for a conditional branch. In this case, the branch
|
illustrates the situation for a conditional branch. In this case, the branch
|
instruction, {\tt BC}, is nominally followed by instructions {\tt I0} and so
|
instruction, {\tt BC}, is nominally followed by instructions {\tt I1} and so
|
forth. However, since the branch is taken, the next instruction must be
|
forth. However, since the branch is taken, the next instruction must be
|
{\tt IA}. Therefore, the pipeline needs to be cleared and reloaded.
|
{\tt IA}. Therefore, the pipeline needs to be cleared and reloaded.
|
Given that there are five stages to the pipeline, that accounts
|
Given that there are five stages to the pipeline, that accounts
|
for four of the five stalls. The last stall cycle is lost in the pipelined
|
for the four stalls. (Were the {\tt pipefetch} cache chosen, there would
|
prefetch stage which needs at least one clock with a valid PC before it can
|
be another stall internal to the {\tt pipefetch} cache.)
|
produce a new output. {\Large\bf Note: When I did this myself, I counted
|
|
six stall cycles, for a total of seven cycles for this instruction. Is five
|
|
really the right answer?}
|
|
|
|
The Zip CPU handles {\tt MOV \$X(PC),PC}, {\tt ADD \$X,PC}, and
|
The Zip CPU handles {\tt MOV \$X(PC),PC}, {\tt ADD \$X,PC}, and
|
{\tt LDI \$X,PC} instructions specially, however. These instructions, when
|
{\tt LDI \$X,PC} instructions specially, however. These instructions, when
|
not conditioned on the flags, can execute with only 2~stall cycles, such as
|
not conditioned on the flags, can execute with only a single stall cycle,
|
is shown in Fig.~\ref{fig:branch}.\footnote{Note that this behavior is
|
such as is shown in Fig.~\ref{fig:branch}.\footnote{Note that when using the
|
slated to be improved upon in subsequent releases. With a better prefetch,
|
{\tt pipefetch} cache, this requires an additional stall cycle due to that
|
it should be possible to drop this down to one or zero stall cycles.}
|
cache's implementation.}
|
\begin{figure}\begin{center}
|
\begin{figure}\begin{center}
|
\includegraphics[width=4in]{../gfx/bra.eps}
|
\includegraphics[width=4in]{../gfx/bra.eps} %0.4in per clock
|
\caption{An expedited delay costs only 2~stall cycles}\label{fig:branch}
|
\caption{An expedited branch costs a single stall cycle}\label{fig:branch}
|
\end{center}\end{figure}
|
\end{center}\end{figure}
|
In this example, {\tt BR} is a branch always taken, {\tt I1} is the instruction
|
In this example, {\tt BR} is a branch always taken, {\tt I1} is the instruction
|
following the branch in memory, while {\tt IA} is the first instruction at the
|
following the branch in memory, while {\tt IA} is the first instruction at the
|
branch address. ({\tt CLR} denotes a clear--pipeline operation, and does
|
branch address. ({\tt CLR} denotes a clear--pipeline operation, and does
|
not represent any instruction.)
|
not represent any instruction.)
|
Line 1195... |
Line 1324... |
opcode that will read and apply an immediate offset by one instruction. The
|
opcode that will read and apply an immediate offset by one instruction. The
|
good news is that this stall can easily be mitigated by proper scheduling.
|
good news is that this stall can easily be mitigated by proper scheduling.
|
That is, any instruction that does not add an immediate to {\tt RA} may be
|
That is, any instruction that does not add an immediate to {\tt RA} may be
|
scheduled into the stall slot.
|
scheduled into the stall slot.
|
|
|
\item When any (conditional) write to either the CC or PC Register is followed
|
This is also the reason why, when setting up a stack frame, the top of the
|
by a memory operation
|
stack frame is used first: it eliminates this stall cycle. Hence, to save
|
|
registers at the top of a procedure, one would write:
|
\begin{enumerate}
|
\begin{enumerate}
|
\item\ {\tt OPCODE RA,PC} {\em Ex: a branch opcode}
|
\item\ {\tt SUB 2,SP}
|
\item\ {\em (stall, even if jump not taken)}
|
\item\ {\tt STO R1,(SP)}
|
\item\ {\tt LOD \$X(RA),RB}
|
\item\ {\tt STO R2,1(SP)}
|
\end{enumerate}
|
\end{enumerate}
|
A timing diagram of this pipeline situation is shown in Fig.~\ref{fig:bcmem},
|
Had {\tt R1} instead been stored at {\tt 1(SP)} as the top of the stack,
|
\begin{figure}\begin{center}
|
there would've been an extra stall in setting up the stack frame.
|
\includegraphics[width=2in]{../gfx/bcmem.eps}
|
|
\caption{A (not taken) conditional branch followed by a memory operation}\label{fig:bcmem}
|
|
\end{center}\end{figure}
|
|
for a conditional branch, {\tt BC}, a memory operation, {\tt Mem} (which
|
|
must be a load here), and ALU instructions {\tt I1} and so forth.
|
|
Since branches take place in the writeback stage, the Zip CPU will stall the
|
|
pipeline for one clock anytime there may be a possible jump--forcing the
|
|
memory operation to stay in the operand decode stage. This prevents
|
|
an instruction from executing a memory access after the jump but before the
|
|
jump is recognized.
|
|
|
|
This stall may be mitigated by shuffling the operations immediately following
|
|
a potential branch so that an ALU operation follows the branch instead of a
|
|
memory operation.
|
|
|
|
\item When reading from the CC register after setting the flags
|
\item When reading from the CC register after setting the flags
|
\begin{enumerate}
|
\begin{enumerate}
|
\item\ {\tt ALUOP RA,RB} {\em Ex: a compare opcode}
|
\item\ {\tt ALUOP RA,RB} {\em ; Ex: a compare opcode}
|
\item\ {\em (stall)}
|
\item\ {\em (stall)}
|
\item\ {\tt TST sys.ccv,CC}
|
\item\ {\tt TST sys.ccv,CC}
|
\item\ {\tt BZ somewhere}
|
\item\ {\tt BZ somewhere}
|
\end{enumerate}
|
\end{enumerate}
|
|
|
Line 1241... |
Line 1357... |
that does not set flags in between the ALU operation and the instruction
|
that does not set flags in between the ALU operation and the instruction
|
that references the CC register. For example, {\tt MOV \$addr+PC,uPC}
|
that references the CC register. For example, {\tt MOV \$addr+PC,uPC}
|
followed by an {\tt RTU} ({\tt OR \$GIE,CC}) instruction will not incur
|
followed by an {\tt RTU} ({\tt OR \$GIE,CC}) instruction will not incur
|
this stall, whereas an {\tt OR \$BREAKEN,CC} followed by an {\tt OR \$STEP,CC}
|
this stall, whereas an {\tt OR \$BREAKEN,CC} followed by an {\tt OR \$STEP,CC}
|
will incur the stall, while a {\tt LDI \$BREAKEN|\$STEP,CC} will not since
|
will incur the stall, while a {\tt LDI \$BREAKEN|\$STEP,CC} will not since
|
it doesn't read the condition codes.
|
it doesn't read the condition codes before executing.
|
|
|
\item When waiting for a memory read operation to complete
|
\item When waiting for a memory read operation to complete
|
\begin{enumerate}
|
\begin{enumerate}
|
\item\ {\tt LOD address,RA}
|
\item\ {\tt LOD address,RA}
|
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}
|
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}
|
Line 1255... |
Line 1371... |
Remember, the Zip CPU does not support out of order execution. Therefore,
|
Remember, the Zip CPU does not support out of order execution. Therefore,
|
anytime the memory unit becomes busy both the memory unit and the ALU must
|
anytime the memory unit becomes busy both the memory unit and the ALU must
|
stall until the memory unit is cleared. This is illustrated in
|
stall until the memory unit is cleared. This is illustrated in
|
Fig.~\ref{fig:memrd},
|
Fig.~\ref{fig:memrd},
|
\begin{figure}\begin{center}
|
\begin{figure}\begin{center}
|
\includegraphics[width=5in]{../gfx/memrd.eps}
|
\includegraphics[width=5.6in]{../gfx/memrd.eps}
|
\caption{Pipeline handling of a load instruction}\label{fig:memrd}
|
\caption{Pipeline handling of a load instruction}\label{fig:memrd}
|
\end{center}\end{figure}
|
\end{center}\end{figure}
|
since it is especially true of a load
|
since it is especially true of a load
|
instruction, which must still write its operand back to the register file.
|
instruction, which must still write its operand back to the register file.
|
Note that there is an extra stall at the end of the memory cycle, so that
|
Further, note that on a pipelined memory operation, the instruction must
|
the memory unit will be idle for one clock before an instruction will be
|
stall in the decode operand stage, lest it try to read a result from the
|
accepted into the ALU.
|
register file before the load result has been written to it. Finally, note
|
Store instructions are different, as shown in Fig.~\ref{fig:memwr},
|
that there is an extra stall at the end of the memory cycle, so that
|
|
the memory unit will be idle for two clocks before an instruction will be
|
|
accepted into the ALU. Store instructions are different, as shown in
|
|
Fig.~\ref{fig:memwr},
|
\begin{figure}\begin{center}
|
\begin{figure}\begin{center}
|
\includegraphics[width=5in]{../gfx/memwr.eps}
|
\includegraphics[width=4in]{../gfx/memwr.eps}
|
\caption{Pipeline handling of a store instruction}\label{fig:memwr}
|
\caption{Pipeline handling of a store instruction}\label{fig:memwr}
|
\end{center}\end{figure}
|
\end{center}\end{figure}
|
since they can be busy with the bus without impacting later write back
|
since they can be busy with the bus without impacting later write back
|
pipeline stages. Hence, only loads stall the pipeline.
|
pipeline stages. Hence, only loads stall the pipeline.
|
|
|
Line 1308... |
Line 1427... |
by one address each instruction. These conditions work well for saving or
|
by one address each instruction. These conditions work well for saving or
|
storing registers to the stack. Indeed, if you noticed, both
|
storing registers to the stack. Indeed, if you noticed, both
|
Fig.~\ref{fig:memrd} and Fig.~\ref{fig:memwr} illustrated pipelined memory
|
Fig.~\ref{fig:memrd} and Fig.~\ref{fig:memwr} illustrated pipelined memory
|
accesses.
|
accesses.
|
|
|
\item When waiting for a conditional memory read operation to complete
|
|
\begin{enumerate}
|
|
\item\ {\tt LOD.Z address,RA}
|
|
\item\ {\em (multiple stalls, bus dependent, 7 clocks best)}
|
|
\item\ {\tt OPCODE I+RA,RB}
|
|
\end{enumerate}
|
|
|
|
In this case, the Zip CPU doesn't warn the prefetch cache to get off the bus
|
|
two cycles before using the bus, so there's a potential for an extra three
|
|
cycle cost due to bus contention between the prefetch and the CPU.
|
|
|
|
This is true for both the LOD and the STO instructions, with the exception that
|
|
the STO instruction will continue in parallel with any ALU instructions that
|
|
follow it.
|
|
|
|
\end{itemize}
|
\end{itemize}
|
|
|
|
|
\chapter{Peripherals}\label{chap:periph}
|
\chapter{Peripherals}\label{chap:periph}
|
|
|
Line 1371... |
Line 1475... |
interrupt and clears the active indicator. This also has the side effect of
|
interrupt and clears the active indicator. This also has the side effect of
|
disabling all interrupts, so a second write of {\tt 0x80000000} may be necessary
|
disabling all interrupts, so a second write of {\tt 0x80000000} may be necessary
|
to re-enable any other interrupts.
|
to re-enable any other interrupts.
|
|
|
The Zip System currently hosts two interrupt controllers, a primary and a
|
The Zip System currently hosts two interrupt controllers, a primary and a
|
secondary. The primary interrupt controller has one interrupt line (perhaps
|
secondary. The primary interrupt controller has one (or more) interrupt line(s)
|
more if you configure it for more) which may come from an external interrupt
|
which may come from an external interrupt source, and one interrupt line from
|
controller, and one interrupt line from the secondary controller. Other
|
the secondary controller. Other primary interrupts include the system timers,
|
primary interrupts include the system timers, the jiffies interrupt, and the
|
the jiffies interrupt, and the manual cache interrupt. The secondary interrupt
|
manual cache interrupt. The secondary interrupt controller maintains an
|
controller maintains an interrupt state for all of the processor accounting
|
interrupt state for all of the processor accounting counters.
|
counters.
|
|
|
\section{Counter}
|
\section{Counter}
|
|
|
The Zip Counter is a very simple counter: it just counts. It cannot be
|
The Zip Counter is a very simple counter: it just counts. It cannot be
|
halted. When it rolls over, it issues an interrupt. Writing a value to the
|
halted. When it rolls over, it issues an interrupt. Writing a value to the
|
Line 1425... |
Line 1529... |
transaction. This is useful in the case of any peripherals that are
|
transaction. This is useful in the case of any peripherals that are
|
misbehaving. If the bus watchdog terminates a bus transaction, the CPU may
|
misbehaving. If the bus watchdog terminates a bus transaction, the CPU may
|
then read from its port to find out which memory location created the problem.
|
then read from its port to find out which memory location created the problem.
|
|
|
Aside from its unusual configuration, the bus watchdog is just another
|
Aside from its unusual configuration, the bus watchdog is just another
|
implementation of the fundamental timer described above.
|
implementation of the fundamental timer described above--stripped down
|
|
for simplicity.
|
|
|
\section{Jiffies}
|
\section{Jiffies}
|
|
|
This peripheral is motivated by the Linux use of `jiffies' whereby a process
|
This peripheral is motivated by the Linux use of `jiffies' whereby a process
|
can request to be put to sleep until a certain number of `jiffies' have
|
can request to be put to sleep until a certain number of `jiffies' have
|
elapsed. Using this interface, the CPU can read the number of `jiffies'
|
elapsed. Using this interface, the CPU can read the number of `jiffies'
|
from the peripheral (it only has the one location in address space), add the
|
from the peripheral (it only has the one location in address space), add the
|
sleep length to it, and write the result back to the peripheral. The zipjiffies
|
sleep length to it, and write the result back to the peripheral. The
|
|
{\tt zipjiffies}
|
peripheral will record the value written to it only if it is nearer the current
|
peripheral will record the value written to it only if it is nearer the current
|
counter value than the last current waiting interrupt time. If no other
|
counter value than the last current waiting interrupt time. If no other
|
interrupts are waiting, and this time is in the future, it will be enabled.
|
interrupts are waiting, and this time is in the future, it will be enabled.
|
(There is currently no way to disable a jiffie interrupt once set, other
|
(There is currently no way to disable a jiffie interrupt once set, other
|
than to disable the interrupt line in the interrupt controller.) The processor
|
than to disable the interrupt line in the interrupt controller.) The processor
|
Line 1455... |
Line 1561... |
The purpose of this register is to support alarm times within a CPU. To
|
The purpose of this register is to support alarm times within a CPU. To
|
set an alarm for a particular process $N$ clocks in advance, read the current
|
set an alarm for a particular process $N$ clocks in advance, read the current
|
Jiffies value, and $N$, and write it back to the Jiffies register. The
|
Jiffies value, and $N$, and write it back to the Jiffies register. The
|
O/S must also keep track of values written to the Jiffies register. Thus,
|
O/S must also keep track of values written to the Jiffies register. Thus,
|
when an `alarm' trips, it should be removed from the list of alarms, the list
|
when an `alarm' trips, it should be removed from the list of alarms, the list
|
should be sorted, and the next alarm in terms of Jiffies should be written
|
should be resorted, and the next alarm in terms of Jiffies should be written
|
to the register.
|
to the register--possibly for a second time.
|
|
|
\section{Direct Memory Access Controller}
|
\section{Direct Memory Access Controller}
|
|
|
The Direct Memory Access (DMA) controller can be used to either move memory
|
The Direct Memory Access (DMA) controller can be used to either move memory
|
from one location to another, to read from a peripheral into memory, or to
|
from one location to another, to read from a peripheral into memory, or to
|
write from a peripheral into memory all without CPU intervention. Further,
|
write from a peripheral into memory all without CPU intervention. Further,
|
since the DMA controller can issue (and does issue) pipeline wishbone accesses,
|
since the DMA controller can issue (and does issue) pipeline wishbone accesses,
|
any DMA memory move will by nature be faster than a corresponding program
|
any DMA memory move will by nature be faster than a corresponding program
|
accomplishing the same move. To put this to numbers, it may take a program
|
accomplishing the same move. To put this to numbers, it may take a program
|
18~clocks per word transferred, whereas this DMA controller can move one
|
18~clocks per word transferred, whereas this DMA controller can move one
|
word in two clocks--provided it has bus access. (The CPU gets priority over the
|
word in two clocks--provided it has bus access. (The CPU gets priority over
|
bus.)
|
the bus.)
|
|
|
When copying memory from one location to another, the DMA controller will
|
When copying memory from one location to another, the DMA controller will
|
copy in units of a given transfer length--up to 1024 words at a time. It will
|
copy in units of a given transfer length--up to 1024 words at a time. It will
|
read that transfer length into its internal buffer, and then write to the
|
read that transfer length into its internal buffer, and then write to the
|
destination address from that buffer. If the CPU interrupts a DMA transfer,
|
destination address from that buffer.
|
it will release the bus, let the CPU complete whatever it needs to do, and then
|
|
restart its transfer by writing the contents of its internal buffer and then
|
|
re-entering its read cycle again.
|
|
|
|
When coupled with a peripheral, the DMA controller can be configured to start
|
When coupled with a peripheral, the DMA controller can be configured to start
|
a memory copy on an interrupt line going high. Further, the controller can be
|
a memory copy when any interrupt line going high. Further, the controller can
|
configured to issue reads from (or to) the same address instead of incrementing
|
be configured to issue reads from (or to) the same address instead of
|
the address at each clock. The DMA completes once the total number of items
|
incrementing the address at each clock. The DMA completes once the total
|
specified (not the transfer length) have been transferred.
|
number of items specified (not the transfer length) have been transferred.
|
|
|
In each case, once the transfer is complete and the DMA unit returns to
|
In each case, once the transfer is complete and the DMA unit returns to
|
idle, the DMA will issue an interrupt.
|
idle, the DMA will issue an interrupt.
|
|
|
|
|
Line 1498... |
Line 1601... |
\begin{enumerate}
|
\begin{enumerate}
|
\item The Zip System depends upon an external 32-bit Wishbone bus. This
|
\item The Zip System depends upon an external 32-bit Wishbone bus. This
|
must exist, and must be connected to the Zip CPU for it to work.
|
must exist, and must be connected to the Zip CPU for it to work.
|
\item The Zip System needs to be told of its {\tt RESET\_ADDRESS}. This is
|
\item The Zip System needs to be told of its {\tt RESET\_ADDRESS}. This is
|
the program counter of the first instruction following a reset.
|
the program counter of the first instruction following a reset.
|
|
\item To conserve logic, you'll want to set the {\tt ADDRESS\_WIDTH} parameter
|
|
to the number of address bits on your wishbone bus.
|
|
\item Likewise, the {\tt LGICACHE} parameter sets the number of bits in
|
|
the instruction cache address. This means that the instruction cache
|
|
will have $2^{\mbox{\tiny\tt LGICACHE}}$ locations within it.
|
\item If you want the Zip System to start up on its own, you will need to
|
\item If you want the Zip System to start up on its own, you will need to
|
set the {\tt START\_HALTED} parameter to zero. Otherwise, if you
|
set the {\tt START\_HALTED} parameter to zero. Otherwise, if you
|
wish to manually start the CPU, that is if upon reset you want the
|
wish to manually start the CPU, that is if upon reset you want the
|
CPU start start in its halted, reset state, then set this parameter to
|
CPU start start in its halted, reset state, then set this parameter to
|
one.
|
one. This latter configuration is useful for a CPU that should be
|
|
idle (i.e. halted) until given an explicit instruction from somewhere
|
|
else to start.
|
\item The third parameter to set is the number of interrupts you will be
|
\item The third parameter to set is the number of interrupts you will be
|
providing from external to the CPU. This can be anything from one
|
providing from external to the CPU. This can be anything from one
|
to nine, but it cannot be zero. (Wire this line to a 1'b0 if you
|
to sixteen, but it cannot be zero. (Set this to 1 and wire the single
|
do not wish to support any external interrupts.)
|
interrupt line to a 1'b0 if you do not wish to support any external
|
|
interrupts.)
|
\item Finally, you need to place into some wishbone accessible address, whether
|
\item Finally, you need to place into some wishbone accessible address, whether
|
RAM or (more likely) ROM, the initial instructions for the CPU.
|
RAM or (more likely) ROM, the initial instructions for the CPU.
|
\end{enumerate}
|
\end{enumerate}
|
If you have enabled your CPU to start automatically, then upon power up the
|
If you have enabled your CPU to start automatically, then upon power up the
|
CPU will immediately start executing your instructions.
|
CPU will immediately start executing your instructions, starting at the given
|
|
{\tt RESET\_ADDRESS}.
|
|
|
This is, however, not how I have used the Zip CPU. I have instead used the
|
This is, however, not how I have used the Zip CPU. I have instead used the
|
Zip CPU in a more controlled environment. For me, the CPU starts in a
|
Zip CPU in a more controlled environment. For me, the CPU starts in a
|
halted state, and waits to be told to start. Further, the RESET address is a
|
halted state, and waits to be told to start. Further, the RESET address is a
|
location in RAM. After bringing up the board I am using, and further the
|
location in RAM. After bringing up the board I am using, and further the
|
bus that is on it, the RAM memory is then loaded externally with the program
|
bus that is on it, the RAM memory is then loaded externally with the program
|
I wish the Zip System to run. Once the RAM is loaded, I release the CPU.
|
I wish the Zip System to run. Once the RAM is loaded, I release the CPU.
|
The CPU then runs until its halt condition, at which point its task is
|
The CPU then runs until either its halt condition or an exception occurrs in
|
complete.
|
supervisor mode, at which point its task is complete.
|
|
|
Eventually, I intend to place an operating system onto the ZipSystem, I'm
|
Eventually, I intend to place an operating system onto the ZipSystem, I'm
|
just not there yet.
|
just not there yet.
|
|
|
The rest of this chapter examines some common programming models, and how they
|
The rest of this chapter examines some common programming models, and how they
|
Line 1649... |
Line 1761... |
\hbox to 0.25in{}\= {\em ; We save R0, CC, and PC only} \\
|
\hbox to 0.25in{}\= {\em ; We save R0, CC, and PC only} \\
|
\> {\tt MOV -3(uSP),R3} \\
|
\> {\tt MOV -3(uSP),R3} \\
|
\> {\tt MOV uR0,R0} \\
|
\> {\tt MOV uR0,R0} \\
|
\> {\tt MOV uCC,R1} \\
|
\> {\tt MOV uCC,R1} \\
|
\> {\tt MOV uPC,R2} \\
|
\> {\tt MOV uPC,R2} \\
|
\> {\tt STO R0,1(R3)} {\em ; Exploit memory pipelining: }\\
|
\> {\tt STO R0,(R3)} {\em ; Exploit memory pipelining: }\\
|
\> {\tt STO R1,2(R3)} {\em ; All instructions write to stack }\\
|
\> {\tt STO R1,1(R3)} {\em ; All instructions write to stack }\\
|
\> {\tt STO R2,3(R3)} {\em ; All offsets increment by one }\\
|
\> {\tt STO R2,2(R3)} {\em ; All offsets increment by one }\\
|
\> {\tt MOV R3,uSP} {\em ; Return the updated stack pointer } \\
|
\> {\tt MOV R3,uSP} {\em ; Return the updated stack pointer } \\
|
\end{tabbing}
|
\end{tabbing}
|
\caption{Example Saving Minimal User Context}\label{tbl:save-partial}
|
\caption{Example Saving Minimal User Context}\label{tbl:save-partial}
|
\end{center}\end{table}
|
\end{center}\end{table}
|
This is much cheaper than the full context swap of a preemptive multitasking
|
This is much cheaper than the full context swap of a preemptive multitasking
|
Line 1692... |
Line 1804... |
\begin{table}\begin{center}
|
\begin{table}\begin{center}
|
\begin{tabbing}
|
\begin{tabbing}
|
RESTORE\_PARTIAL\_CONTEXT: \\
|
RESTORE\_PARTIAL\_CONTEXT: \\
|
\hbox to 0.25in{}\= {\em ; We retore R0, CC, and PC only} \\
|
\hbox to 0.25in{}\= {\em ; We retore R0, CC, and PC only} \\
|
\> {\tt MOV uSP,R3} {\em ; Return the updated stack pointer } \\
|
\> {\tt MOV uSP,R3} {\em ; Return the updated stack pointer } \\
|
\> {\tt LOD R0,1(R3),R0} {\em ; Exploit memory pipelining: }\\
|
\> {\tt LOD R0,(R3),R0} {\em ; Exploit memory pipelining: }\\
|
\> {\tt LOD R1,2(R3),R1} {\em ; All instructions write to stack }\\
|
\> {\tt LOD R1,1(R3),R1} {\em ; All instructions write to stack }\\
|
\> {\tt LOD R2,3(R3),R2} {\em ; All offsets increment by one }\\
|
\> {\tt LOD R2,2(R3),R2} {\em ; All offsets increment by one }\\
|
\> {\tt MOV R0,uR0} \\
|
\> {\tt MOV R0,uR0} \\
|
\> {\tt MOV R1,uCC} \\
|
\> {\tt MOV R1,uCC} \\
|
\> {\tt MOV R2,uPC} \\
|
\> {\tt MOV R2,uPC} \\
|
\> {\tt MOV 3(R3),uSP} \\
|
\> {\tt MOV 3(R3),uSP} \\
|
\end{tabbing}
|
\end{tabbing}
|
Line 1754... |
Line 1866... |
This same code can be translated in Zip Assembly as shown in
|
This same code can be translated in Zip Assembly as shown in
|
Tbl.~\ref{tbl:memcp-asm}.
|
Tbl.~\ref{tbl:memcp-asm}.
|
\begin{table}\begin{center}
|
\begin{table}\begin{center}
|
\begin{tabular}{ll}
|
\begin{tabular}{ll}
|
memcp: \\
|
memcp: \\
|
& {\em ; R0 = *dest, R1 = *src, R2 = LEN} \\
|
& {\em ; R0 = *dest, R1 = *src, R2 = LEN, R3 = return addr} \\
|
& {\em ; The following will operate in 17 clocks per word minus one clock} \\
|
& {\em ; The following will operate in $12N+19$ clocks.} \\
|
& {\tt CMP 0,R2} \\
|
& {\tt CMP 0,R2} \\ % 8 clocks per setup
|
& {\tt LOD.Z -1(SP),PC} {\em ; A conditional return }\\
|
& {\tt MOV.Z R3,PC} {\em ; A conditional return }\\
|
& {\em ; (One stall on potentially writing to PC)} \\
|
& {\tt SUB 1,SP} {\em ; Create a stack frame}\\
|
& {\tt LOD (R1),R3} \\
|
& {\tt STO R4,(SP)} {\em ; and a local variable}\\
|
|
& {\em ; (4 stalls, cannot be further scheduled away)} \\
|
|
loop: \\ % 12 clocks per loop
|
|
& {\tt LOD (R1),R4} \\
|
& {\em ; (4 stalls, cannot be scheduled away)} \\
|
& {\em ; (4 stalls, cannot be scheduled away)} \\
|
& {\tt STO R3,(R2)} {\em ; (4 schedulable stalls, has no impact now)} \\
|
& {\tt STO R4,(R0)} {\em ; (4 schedulable stalls, has no impact now)} \\
|
& {\tt ADD 1,R1} \\
|
|
& {\tt SUB 1,R2} \\
|
& {\tt SUB 1,R2} \\
|
& {\tt BNZ loop} \\
|
& {\tt BZ memcpend} \\
|
& {\em ; (5 stalls, if branch taken, to clear and refill the pipeline)} \\
|
& {\tt ADD 1,R0} \\
|
& {\tt RET} \\
|
& {\tt ADD 1,R1} \\
|
|
& {\tt BRA loop} \\
|
|
& {\em ; (1 stall on a BRA instruction)} \\
|
|
memcpend: % 11 clocks
|
|
& {\tt LOD (SP),R4} \\
|
|
& {\em ; (4 stalls, cannot be further scheduled away)} \\
|
|
& {\tt ADD 1,SP} \\
|
|
& {\tt JMP R3} \\
|
|
& {\em ; (4 stalls)} \\
|
\end{tabular}
|
\end{tabular}
|
\caption{Example Memory Copy code in Zip Assembly}\label{tbl:memcp-asm}
|
\caption{Example Memory Copy code in Zip Assembly}\label{tbl:memcp-asm}
|
\end{center}\end{table}
|
\end{center}\end{table}
|
This example points out several things associated with the Zip CPU. First,
|
This example points out several things associated with the Zip CPU. First,
|
a straightforward implementation of a for loop is not the fastest loop
|
a straightforward implementation of a for loop is not the fastest loop
|
Line 1786... |
Line 1908... |
Fundamental to any multiprocessing system is the ability to switch from one
|
Fundamental to any multiprocessing system is the ability to switch from one
|
task to the next. In the ZipSystem, this is accomplished in one of a couple
|
task to the next. In the ZipSystem, this is accomplished in one of a couple
|
ways. The first step is that an interrupt happens. Anytime an interrupt
|
ways. The first step is that an interrupt happens. Anytime an interrupt
|
happens, the CPU needs to execute the following tasks in supervisor mode:
|
happens, the CPU needs to execute the following tasks in supervisor mode:
|
\begin{enumerate}
|
\begin{enumerate}
|
\item Check for a trap instruction. That is, if the user task requested a
|
\item Check for a trap instruction, or other user exception such as a break,
|
trap, we may not wish to adjust the context, check interrupts, or call
|
bus error, division by zero error, or floating point exception. That
|
the scheduler. Tbl.~\ref{tbl:trap-check}
|
is, if the user process needs attending then we may not wish to adjust
|
|
the context, check interrupts, or call the scheduler.
|
|
Tbl.~\ref{tbl:trap-check}
|
\begin{table}\begin{center}
|
\begin{table}\begin{center}
|
\begin{tabular}{ll}
|
\begin{tabular}{ll}
|
{\tt return\_to\_user:} \\
|
{\tt return\_to\_user:} \\
|
& {\em; The instruction before the context switch processing must} \\
|
& {\em; The instruction before the context switch processing must} \\
|
& {\em; be the RTU instruction that enacted user mode in the first} \\
|
& {\em; be the RTU instruction that enacted user mode in the first} \\
|
& {\em; place. We show it here just for reference.} \\
|
& {\em; place. We show it here just for reference.} \\
|
& {\tt RTU} \\
|
& {\tt RTU} \\
|
{\tt trap\_check:} \\
|
{\tt trap\_check:} \\
|
& {\tt MOV uCC,R0} \\
|
& {\tt MOV uCC,R0} \\
|
& {\tt TST \$TRAP,R0} \\
|
& {\tt TST \$TRAP \textbar \$BUSERR \textbar \$DIVE \textbar \$FPE,R0} \\
|
& {\tt BNZ swap\_out} \\
|
& {\tt BNZ swap\_out} \\
|
& {; \em Do something here to execute the trap} \\
|
& {; \em Do something here to execute the trap} \\
|
& {; \em Don't need to call the scheduler, so we can just return} \\
|
& {; \em Don't need to call the scheduler, so we can just return} \\
|
& {\tt BRA return\_to\_user} \\
|
& {\tt BRA return\_to\_user} \\
|
\end{tabular}
|
\end{tabular}
|
\caption{Checking for whether the user issued a TRAP instruction}\label{tbl:trap-check}
|
\caption{Checking for whether the user task needs our attention}\label{tbl:trap-check}
|
\end{center}\end{table}
|
\end{center}\end{table}
|
shows the rudiments of this code, while showing nothing of how the
|
shows the rudiments of this code, while showing nothing of how the
|
actual trap would be implemented.
|
actual trap would be implemented.
|
|
|
You may also wish to note that the instruction before the first instruction
|
You may also wish to note that the instruction before the first instruction
|
Line 1834... |
Line 1958... |
& {\tt MOV uR0,R0} \\
|
& {\tt MOV uR0,R0} \\
|
& {\tt MOV uR1,R1} \\
|
& {\tt MOV uR1,R1} \\
|
& {\tt MOV uR2,R2} \\
|
& {\tt MOV uR2,R2} \\
|
& {\tt MOV uR3,R3} \\
|
& {\tt MOV uR3,R3} \\
|
& {\tt MOV uR4,R4} \\
|
& {\tt MOV uR4,R4} \\
|
& {\tt STO R0,1(R5)} {\em ; Exploit memory pipelining: }\\
|
& {\tt STO R0,(R5)} {\em ; Exploit memory pipelining: }\\
|
& {\tt STO R1,2(R5)} {\em ; All instructions write to stack }\\
|
& {\tt STO R1,1(R5)} {\em ; All instructions write to stack }\\
|
& {\tt STO R2,3(R5)} {\em ; All offsets increment by one }\\
|
& {\tt STO R2,2(R5)} {\em ; All offsets increment by one }\\
|
& {\tt STO R3,4(R5)} {\em ; Longest pipeline is 5 cycles.}\\
|
& {\tt STO R3,3(R5)} {\em ; Longest pipeline is 5 cycles.}\\
|
& {\tt STO R4,5(R5)} \\
|
& {\tt STO R4,4(R5)} \\
|
& \ldots {\em ; Need to repeat for all user registers} \\
|
& \ldots {\em ; Need to repeat for all user registers} \\
|
\iffalse
|
\iffalse
|
& {\tt MOV uR5,R0} \\
|
& {\tt MOV uR5,R0} \\
|
& {\tt MOV uR6,R1} \\
|
& {\tt MOV uR6,R1} \\
|
& {\tt MOV uR7,R2} \\
|
& {\tt MOV uR7,R2} \\
|
& {\tt MOV uR8,R3} \\
|
& {\tt MOV uR8,R3} \\
|
& {\tt MOV uR9,R4} \\
|
& {\tt MOV uR9,R4} \\
|
& {\tt STO R0,6(R5) }\\
|
& {\tt STO R0,5(R5) }\\
|
& {\tt STO R1,7(R5) }\\
|
& {\tt STO R1,6(R5) }\\
|
& {\tt STO R2,8(R5) }\\
|
& {\tt STO R2,7(R5) }\\
|
& {\tt STO R3,9(R5) }\\
|
& {\tt STO R3,8(R5) }\\
|
& {\tt STO R4,10(R5)} \\
|
& {\tt STO R4,9(R5)} \\
|
\fi
|
\fi
|
& {\tt MOV uR10,R0} \\
|
& {\tt MOV uR10,R0} \\
|
& {\tt MOV uR11,R1} \\
|
& {\tt MOV uR11,R1} \\
|
& {\tt MOV uR12,R2} \\
|
& {\tt MOV uR12,R2} \\
|
& {\tt MOV uCC,R3} \\
|
& {\tt MOV uCC,R3} \\
|
& {\tt MOV uPC,R4} \\
|
& {\tt MOV uPC,R4} \\
|
& {\tt STO R0,11(R5)}\\
|
& {\tt STO R0,10(R5)}\\
|
& {\tt STO R1,12(R5)}\\
|
& {\tt STO R1,11(R5)}\\
|
& {\tt STO R2,13(R5)}\\
|
& {\tt STO R2,12(R5)}\\
|
& {\tt STO R3,14(R5)}\\
|
& {\tt STO R3,13(R5)}\\
|
& {\tt STO R4,15(R5)} \\
|
& {\tt STO R4,14(R5)} \\
|
& {\em ; We can skip storing the stack, uSP, since it'll be stored}\\
|
& {\em ; We can skip storing the stack, uSP, since it'll be stored}\\
|
& {\em ; elsewhere (in the task structure) }\\
|
& {\em ; elsewhere (in the task structure) }\\
|
\end{tabular}
|
\end{tabular}
|
\caption{Example Storing User Task Context}\label{tbl:context-out}
|
\caption{Example Storing User Task Context}\label{tbl:context-out}
|
\end{center}\end{table}
|
\end{center}\end{table}
|
Line 1961... |
Line 2085... |
\begin{tabular}{ll}
|
\begin{tabular}{ll}
|
{\tt swap\_in:} \\
|
{\tt swap\_in:} \\
|
& {\tt LOD stack(R12),R5} \\
|
& {\tt LOD stack(R12),R5} \\
|
& {\tt MOV 15(R1),uSP} \\
|
& {\tt MOV 15(R1),uSP} \\
|
& {\em ; Be sure to exploit the memory pipelining capability} \\
|
& {\em ; Be sure to exploit the memory pipelining capability} \\
|
& {\tt LOD 1(R5),R0} \\
|
& {\tt LOD (R5),R0} \\
|
& {\tt LOD 2(R5),R1} \\
|
& {\tt LOD 1(R5),R1} \\
|
& {\tt LOD 3(R5),R2} \\
|
& {\tt LOD 2(R5),R2} \\
|
& {\tt LOD 4(R5),R3} \\
|
& {\tt LOD 3(R5),R3} \\
|
& {\tt LOD 5(R5),R4} \\
|
& {\tt LOD 4(R5),R4} \\
|
& {\tt MOV R0,uR0} \\
|
& {\tt MOV R0,uR0} \\
|
& {\tt MOV R1,uR1} \\
|
& {\tt MOV R1,uR1} \\
|
& {\tt MOV R2,uR2} \\
|
& {\tt MOV R2,uR2} \\
|
& {\tt MOV R3,uR3} \\
|
& {\tt MOV R3,uR3} \\
|
& {\tt MOV R4,uR4} \\
|
& {\tt MOV R4,uR4} \\
|
& \ldots {\em ; Need to repeat for all user registers} \\
|
& \ldots {\em ; Need to repeat for all user registers} \\
|
& {\tt LOD 11(R5),R0} \\
|
& {\tt LOD 10(R5),R0} \\
|
& {\tt LOD 12(R5),R1} \\
|
& {\tt LOD 11(R5),R1} \\
|
& {\tt LOD 13(R5),R2} \\
|
& {\tt LOD 12(R5),R2} \\
|
& {\tt LOD 14(R5),R3} \\
|
& {\tt LOD 13(R5),R3} \\
|
& {\tt LOD 15(R5),R4} \\
|
& {\tt LOD 14(R5),R4} \\
|
& {\tt MOV R0,uR10} \\
|
& {\tt MOV R0,uR10} \\
|
& {\tt MOV R1,uR11} \\
|
& {\tt MOV R1,uR11} \\
|
& {\tt MOV R2,uR12} \\
|
& {\tt MOV R2,uR12} \\
|
& {\tt MOV R3,uCC} \\
|
& {\tt MOV R3,uCC} \\
|
& {\tt MOV R4,uPC} \\
|
& {\tt MOV R4,uPC} \\
|
Line 2014... |
Line 2138... |
accessed via the ZipCPU shown in Tbl.~\ref{tbl:zpregs},
|
accessed via the ZipCPU shown in Tbl.~\ref{tbl:zpregs},
|
\begin{table}[htbp]
|
\begin{table}[htbp]
|
\begin{center}\begin{reglist}
|
\begin{center}\begin{reglist}
|
PIC & \scalebox{0.8}{\tt 0xc0000000} & 32 & R/W & Primary Interrupt Controller \\\hline
|
PIC & \scalebox{0.8}{\tt 0xc0000000} & 32 & R/W & Primary Interrupt Controller \\\hline
|
WDT & \scalebox{0.8}{\tt 0xc0000001} & 32 & R/W & Watchdog Timer \\\hline
|
WDT & \scalebox{0.8}{\tt 0xc0000001} & 32 & R/W & Watchdog Timer \\\hline
|
& \scalebox{0.8}{\tt 0xc0000002} & 32 & R/W & {\em (Reserved for future use)} \\\hline
|
& \scalebox{0.8}{\tt 0xc0000002} & 32 & R & Address of last bus error \\\hline
|
CTRIC & \scalebox{0.8}{\tt 0xc0000003} & 32 & R/W & Secondary Interrupt Controller \\\hline
|
CTRIC & \scalebox{0.8}{\tt 0xc0000003} & 32 & R/W & Secondary Interrupt Controller \\\hline
|
TMRA & \scalebox{0.8}{\tt 0xc0000004} & 32 & R/W & Timer A\\\hline
|
TMRA & \scalebox{0.8}{\tt 0xc0000004} & 32 & R/W & Timer A\\\hline
|
TMRB & \scalebox{0.8}{\tt 0xc0000005} & 32 & R/W & Timer B\\\hline
|
TMRB & \scalebox{0.8}{\tt 0xc0000005} & 32 & R/W & Timer B\\\hline
|
TMRC & \scalebox{0.8}{\tt 0xc0000006} & 32 & R/W & Timer C\\\hline
|
TMRC & \scalebox{0.8}{\tt 0xc0000006} & 32 & R/W & Timer C\\\hline
|
JIFF & \scalebox{0.8}{\tt 0xc0000007} & 32 & R/W & Jiffies \\\hline
|
JIFF & \scalebox{0.8}{\tt 0xc0000007} & 32 & R/W & Jiffies \\\hline
|
Line 2051... |
Line 2175... |
The peripheral registers, listed in Tbl.~\ref{tbl:zpregs}, are shown in the
|
The peripheral registers, listed in Tbl.~\ref{tbl:zpregs}, are shown in the
|
CPU's address space. These may be accessed by the CPU at these addresses,
|
CPU's address space. These may be accessed by the CPU at these addresses,
|
and when so accessed will respond as described in Chapt.~\ref{chap:periph}.
|
and when so accessed will respond as described in Chapt.~\ref{chap:periph}.
|
These registers will be discussed briefly again here.
|
These registers will be discussed briefly again here.
|
|
|
|
\subsection{Interrupt Controller(s)}
|
The Zip CPU Interrupt controller has four different types of bits, as shown in
|
The Zip CPU Interrupt controller has four different types of bits, as shown in
|
Tbl.~\ref{tbl:picbits}.
|
Tbl.~\ref{tbl:picbits}.
|
\begin{table}\begin{center}
|
\begin{table}\begin{center}
|
\begin{bitlist}
|
\begin{bitlist}
|
31 & R/W & Master Interrupt Enable\\\hline
|
31 & R/W & Master Interrupt Enable\\\hline
|
30\ldots 16 & R/W & Interrupt Enables, write '1' to change\\\hline
|
30\ldots 16 & R/W & Interrupt Enables, write `1' to change\\\hline
|
15 & R & Current Master Interrupt State\\\hline
|
15 & R & Current Master Interrupt State\\\hline
|
15\ldots 0 & R/W & Input Interrupt states, write '1' to clear\\\hline
|
15\ldots 0 & R/W & Input Interrupt states, write `1' to clear\\\hline
|
\end{bitlist}
|
\end{bitlist}
|
\caption{Interrupt Controller Register Bits}\label{tbl:picbits}
|
\caption{Interrupt Controller Register Bits}\label{tbl:picbits}
|
\end{center}\end{table}
|
\end{center}\end{table}
|
The high order bit, or bit--31, is the master interrupt enable bit. When this
|
The high order bit, or bit--31, is the master interrupt enable bit. When this
|
bit is set, then any time an interrupt occurs the CPU will be interrupted and
|
bit is set, then any time an interrupt occurs the CPU will be interrupted and
|
will switch to supervisor mode, etc.
|
will switch to supervisor mode, etc.
|
|
|
Bits 30~\ldots 16 are interrupt enable bits. Should the interrupt line go
|
Bits 30~\ldots 16 are interrupt enable bits. Should the interrupt line go
|
ghile while enabled, an interrupt will be generated. To set an interrupt enable
|
hi while enabled, an interrupt will be generated. (All interrupts are positive
|
bit, one needs to write the master interrupt enable while writing a `1' to this
|
edge triggered.) To set an interrupt enable bit, one needs to write the
|
the bit. To clear, one need only write a `0' to the master interrupt enable,
|
master interrupt enable while writing a `1' to this the bit. To clear, one
|
while leaving this line high.
|
need only write a `0' to the master interrupt enable, while leaving this line
|
|
high.
|
|
|
Bits 15\ldots 0 are the current state of the interrupt vector. Interrupt lines
|
Bits 15\ldots 0 are the current state of the interrupt vector. Interrupt lines
|
trip when they go high, and remain tripped until they are acknowledged. If
|
trip when they go high, and remain tripped until they are acknowledged. If
|
the interrupt goes high for longer than one pulse, it may be high when a clear
|
the interrupt goes high for longer than one pulse, it may be high when a clear
|
is requested. If so, the interrupt will not clear. The line must go low
|
is requested. If so, the interrupt will not clear. The line must go low
|
Line 2089... |
Line 2215... |
\item When an interrupt occurs, the supervisor will switch to the interrupt
|
\item When an interrupt occurs, the supervisor will switch to the interrupt
|
state. It will then cycle through the interrupt bits to learn which
|
state. It will then cycle through the interrupt bits to learn which
|
interrupt handler to call.
|
interrupt handler to call.
|
\item If the interrupt handler expects more interrupts, it will clear its
|
\item If the interrupt handler expects more interrupts, it will clear its
|
current interrupt when it is done handling the interrupt in question.
|
current interrupt when it is done handling the interrupt in question.
|
To do this, it will write a '1' to the low order interrupt mask,
|
To do this, it will write a `1' to the low order interrupt mask,
|
such as writing a {\tt 32'h80000001}.
|
such as writing a {\tt 32'h0000\_0001}.
|
\item If the interrupt handler does not expect any more interrupts, it will
|
\item If the interrupt handler does not expect any more interrupts, it will
|
instead clear the interrupt from the controller by writing a
|
instead clear the interrupt from the controller by writing a
|
{\tt 32'h00010001} to the controller.
|
{\tt 32'h0001\_0001} to the controller.
|
\item Once all interrupts have been handled, the supervisor will write a
|
\item Once all interrupts have been handled, the supervisor will write a
|
{\tt 32'h80000000} to the interrupt register to re-enable interrupt
|
{\tt 32'h8000\_0000} to the interrupt register to re-enable interrupt
|
generation.
|
generation.
|
\item The supervisor should also check the user trap bit, and possible soft
|
\item The supervisor should also check the user trap bit, and possible soft
|
interrupt bits here, but this action has nothing to do with the
|
interrupt bits here, but this action has nothing to do with the
|
interrupt control register.
|
interrupt control register.
|
\item The supervisor will then leave interrupt mode, possibly adjusting
|
\item The supervisor will then leave interrupt mode, possibly adjusting
|
whichever task is running, by executing a return from interrupt
|
whichever task is running, by executing a return from interrupt
|
command.
|
command.
|
\end{enumerate}
|
\end{enumerate}
|
|
|
|
\subsection{Timer Register}
|
|
|
Leaving the interrupt controller, we show the timer registers bit definitions
|
Leaving the interrupt controller, we show the timer registers bit definitions
|
in Tbl.~\ref{tbl:tmrbits}.
|
in Tbl.~\ref{tbl:tmrbits}.
|
\begin{table}\begin{center}
|
\begin{table}\begin{center}
|
\begin{bitlist}
|
\begin{bitlist}
|
31 & R/W & Auto-Reload\\\hline
|
31 & R/W & Auto-Reload\\\hline
|
Line 2123... |
Line 2251... |
In this mode, upon setting its interrupt bit for one cycle, the timer will
|
In this mode, upon setting its interrupt bit for one cycle, the timer will
|
also reset itself back to the value of the timer that was written to it when
|
also reset itself back to the value of the timer that was written to it when
|
the auto--reload option was written to it. To clear and stop the timer,
|
the auto--reload option was written to it. To clear and stop the timer,
|
just simply write a `32'h00' to this register.
|
just simply write a `32'h00' to this register.
|
|
|
|
\subsection{Jiffies}
|
|
|
The Jiffies register is somewhat similar in that the register always changes.
|
The Jiffies register is somewhat similar in that the register always changes.
|
In this case, the register counts up, whereas the timer always counted down.
|
In this case, the register counts up, whereas the timer always counted down.
|
Reads from this register, as shown in Tbl.~\ref{tbl:jiffybits},
|
Reads from this register, as shown in Tbl.~\ref{tbl:jiffybits},
|
\begin{table}\begin{center}
|
\begin{table}\begin{center}
|
\begin{bitlist}
|
\begin{bitlist}
|
Line 2140... |
Line 2270... |
greater than zero while ignoring truncation, will set a new Jiffy interrupt
|
greater than zero while ignoring truncation, will set a new Jiffy interrupt
|
time. At that time, the Jiffy vector will clear, and another interrupt time
|
time. At that time, the Jiffy vector will clear, and another interrupt time
|
may either be written to it, or it will just continue counting without
|
may either be written to it, or it will just continue counting without
|
activating any more interrupts.
|
activating any more interrupts.
|
|
|
|
\subsection{Performance Counters}
|
|
|
The Zip CPU also supports several counter peripherals, mostly in the way of
|
The Zip CPU also supports several counter peripherals, mostly in the way of
|
process accounting. This peripherals have a single register associated with
|
process accounting. This peripherals have a single register associated with
|
them, shown in Tbl.~\ref{tbl:ctrbits}.
|
them, shown in Tbl.~\ref{tbl:ctrbits}.
|
\begin{table}\begin{center}
|
\begin{table}\begin{center}
|
\begin{bitlist}
|
\begin{bitlist}
|
Line 2165... |
Line 2297... |
It is envisioned that these counters will be used as follows: First, every time
|
It is envisioned that these counters will be used as follows: First, every time
|
a master counter rolls over, the supervisor (Operating System) will record
|
a master counter rolls over, the supervisor (Operating System) will record
|
the fact. Second, whenever activating a user task, the Operating System will
|
the fact. Second, whenever activating a user task, the Operating System will
|
set the four user counters to zero. When the user task has completed, the
|
set the four user counters to zero. When the user task has completed, the
|
Operating System will read the timers back off, to determine how much of the
|
Operating System will read the timers back off, to determine how much of the
|
CPU the process had consumed.
|
CPU the process had consumed. To keep this accurate, the user counters will
|
|
only increment when the GIE bit is set to indicate that the processor is
|
|
in user mode.
|
|
|
|
\subsection{DMA Controller}
|
|
|
The final peripheral to discuss is the DMA controller. This controller
|
The final peripheral to discuss is the DMA controller. This controller
|
has four registers. Of these four, the length, source and destination address
|
has four registers. Of these four, the length, source and destination address
|
registers should need no further explanation. They are full 32--bit registers
|
registers should need no further explanation. They are full 32--bit registers
|
specifying the entire transfer length, the starting address to read from, and
|
specifying the entire transfer length, the starting address to read from, and
|
Line 2181... |
Line 2317... |
\begin{table}\begin{center}
|
\begin{table}\begin{center}
|
\begin{bitlist}
|
\begin{bitlist}
|
31 & R & DMA Active\\\hline
|
31 & R & DMA Active\\\hline
|
30 & R & Wishbone error, transaction aborted. This bit is cleared the next time
|
30 & R & Wishbone error, transaction aborted. This bit is cleared the next time
|
this register is written to.\\\hline
|
this register is written to.\\\hline
|
29 & R/W & Set to '1' to prevent the controller from incrementing the source address, '0' for normal memory copy. \\\hline
|
29 & R/W & Set to `1' to prevent the controller from incrementing the source address, `0' for normal memory copy. \\\hline
|
28 & R/W & Set to '1' to prevent the controller from incrementing the
|
28 & R/W & Set to `1' to prevent the controller from incrementing the
|
destination address, '0' for normal memory copy. \\\hline
|
destination address, `0' for normal memory copy. \\\hline
|
27 \ldots 16 & W & The DMA Key. Write a 12'hfed to these bits to start the
|
27 \ldots 16 & W & The DMA Key. Write a 12'hfed to these bits to start the
|
activate any DMA transfer. \\\hline
|
activate any DMA transfer. \\\hline
|
27 & R & Always reads '0', to force the deliberate writing of the key. \\\hline
|
27 & R & Always reads `0', to force the deliberate writing of the key. \\\hline
|
26 \ldots 16 & R & Indicates the number of items in the transfer buffer that
|
26 \ldots 16 & R & Indicates the number of items in the transfer buffer that
|
have yet to be written. \\\hline
|
have yet to be written. \\\hline
|
15 & R/W & Set to '1' to trigger on an interrupt, or '0' to start immediately
|
15 & R/W & Set to `1' to trigger on an interrupt, or `0' to start immediately
|
upon receiving a valid key.\\\hline
|
upon receiving a valid key.\\\hline
|
14\ldots 10 & R/W & Select among one of 32~possible interrupt lines.\\\hline
|
14\ldots 10 & R/W & Select among one of 32~possible interrupt lines.\\\hline
|
9\ldots 0 & R/W & Intermediate transfer length minus one. Thus, to transfer
|
9\ldots 0 & R/W & Intermediate transfer length minus one. Thus, to transfer
|
one item at a time set this value to 0. To transfer 1024 at a time,
|
one item at a time set this value to 0. To transfer 1024 at a time,
|
set it to 1024.\\\hline
|
set it to 1024.\\\hline
|
Line 2219... |
Line 2355... |
reduced to two addresses, as outlined earlier in Tbl.~\ref{tbl:dbgregs}.
|
reduced to two addresses, as outlined earlier in Tbl.~\ref{tbl:dbgregs}.
|
Access to the Zip System begins with the Debug Control register, shown in
|
Access to the Zip System begins with the Debug Control register, shown in
|
Tbl.~\ref{tbl:dbgctrl}.
|
Tbl.~\ref{tbl:dbgctrl}.
|
\begin{table}\begin{center}
|
\begin{table}\begin{center}
|
\begin{bitlist}
|
\begin{bitlist}
|
31\ldots 14 & R & Reserved\\\hline
|
31\ldots 14 & R & External interrupt state. Bit 14 is valid for one
|
|
interrupt only, bit 15 for two, etc.\\\hline
|
13 & R & CPU GIE setting\\\hline
|
13 & R & CPU GIE setting\\\hline
|
12 & R & CPU is sleeping\\\hline
|
12 & R & CPU is sleeping\\\hline
|
11 & W & Command clear PF cache\\\hline
|
11 & W & Command clear PF cache\\\hline
|
10 & R/W & Command HALT, Set to '1' to halt the CPU\\\hline
|
10 & R/W & Command HALT, Set to `1' to halt the CPU\\\hline
|
9 & R & Stall Status, '1' if CPU is busy\\\hline
|
9 & R & Stall Status, `1' if CPU is busy (i.e., not halted yet)\\\hline
|
8 & R/W & Step Command, set to '1' to step the CPU, also sets the halt bit\\\hline
|
8 & R/W & Step Command, set to `1' to step the CPU, also sets the halt bit\\\hline
|
7 & R & Interrupt Request \\\hline
|
7 & R & Interrupt Request Pending\\\hline
|
6 & R/W & Command RESET \\\hline
|
6 & R/W & Command RESET \\\hline
|
5\ldots 0 & R/W & Debug Register Address \\\hline
|
5\ldots 0 & R/W & Debug Register Address \\\hline
|
\end{bitlist}
|
\end{bitlist}
|
\caption{Debug Control Register Bits}\label{tbl:dbgctrl}
|
\caption{Debug Control Register Bits}\label{tbl:dbgctrl}
|
\end{center}\end{table}
|
\end{center}\end{table}
|
|
|
The first step in debugging access is to determine whether or not the CPU
|
The first step in debugging access is to determine whether or not the CPU
|
is halted, and to halt it if not. To do this, first write a '1' to the
|
is halted, and to halt it if not. To do this, first write a `1' to the
|
Command HALT bit. This will halt the CPU and place it into debug mode.
|
Command HALT bit. This will halt the CPU and place it into debug mode.
|
Once the CPU is halted, the stall status bit will drop to zero. Thus,
|
Once the CPU is halted, the stall status bit will drop to zero. Thus,
|
if bit 10 is high and bit 9 low, the debug port is open to examine the
|
if bit 10 is high and bit 9 low, the debug port is open to examine the
|
internal state of the CPU.
|
internal state of the CPU.
|
|
|
Line 2259... |
Line 2396... |
uSP & 29 & 32 & R/W & User Stack Pointer\\\hline
|
uSP & 29 & 32 & R/W & User Stack Pointer\\\hline
|
uCC & 30 & 32 & R/W & User Condition Code Register \\\hline
|
uCC & 30 & 32 & R/W & User Condition Code Register \\\hline
|
uPC & 31 & 32 & R/W & User Program Counter\\\hline
|
uPC & 31 & 32 & R/W & User Program Counter\\\hline
|
PIC & 32 & 32 & R/W & Primary Interrupt Controller \\\hline
|
PIC & 32 & 32 & R/W & Primary Interrupt Controller \\\hline
|
WDT & 33 & 32 & R/W & Watchdog Timer\\\hline
|
WDT & 33 & 32 & R/W & Watchdog Timer\\\hline
|
|
BUS & 34 & 32 & R & Last Bus Error\\\hline
|
CTRIC & 35 & 32 & R/W & Secondary Interrupt Controller\\\hline
|
CTRIC & 35 & 32 & R/W & Secondary Interrupt Controller\\\hline
|
TMRA & 36 & 32 & R/W & Timer A\\\hline
|
TMRA & 36 & 32 & R/W & Timer A\\\hline
|
TMRB & 37 & 32 & R/W & Timer B\\\hline
|
TMRB & 37 & 32 & R/W & Timer B\\\hline
|
TMRC & 38 & 32 & R/W & Timer C\\\hline
|
TMRC & 38 & 32 & R/W & Timer C\\\hline
|
JIFF & 39 & 32 & R/W & Jiffies peripheral\\\hline
|
JIFF & 39 & 32 & R/W & Jiffies peripheral\\\hline
|
Line 2312... |
Line 2450... |
Address Width & 1--bit \\\hline
|
Address Width & 1--bit \\\hline
|
Port size & 32--bit \\\hline
|
Port size & 32--bit \\\hline
|
Port granularity & 32--bit \\\hline
|
Port granularity & 32--bit \\\hline
|
Maximum Operand Size & 32--bit \\\hline
|
Maximum Operand Size & 32--bit \\\hline
|
Data transfer ordering & (Irrelevant) \\\hline
|
Data transfer ordering & (Irrelevant) \\\hline
|
Clock constraints & Works at 100~MHz on a Basys--3 board\\\hline
|
Clock constraints & Works at 100~MHz on a Basys--3 board, and 80~MHz on a
|
|
XuLA2--LX25\\\hline
|
Signal Names & \begin{tabular}{ll}
|
Signal Names & \begin{tabular}{ll}
|
Signal Name & Wishbone Equivalent \\\hline
|
Signal Name & Wishbone Equivalent \\\hline
|
{\tt i\_clk} & {\tt CLK\_I} \\
|
{\tt i\_clk} & {\tt CLK\_I} \\
|
{\tt i\_dbg\_cyc} & {\tt CYC\_I} \\
|
{\tt i\_dbg\_cyc} & {\tt CYC\_I} \\
|
{\tt i\_dbg\_stb} & {\tt STB\_I} \\
|
{\tt i\_dbg\_stb} & {\tt STB\_I} \\
|
Line 2334... |
Line 2473... |
\begin{table}[htbp]
|
\begin{table}[htbp]
|
\begin{center}
|
\begin{center}
|
\begin{wishboneds}
|
\begin{wishboneds}
|
Revision level of wishbone & WB B4 spec \\\hline
|
Revision level of wishbone & WB B4 spec \\\hline
|
Type of interface & Master, Read/Write, single cycle or pipelined\\\hline
|
Type of interface & Master, Read/Write, single cycle or pipelined\\\hline
|
Address Width & 32--bit bits \\\hline
|
Address Width & (Zip System parameter, can be up to 32--bit bits) \\\hline
|
Port size & 32--bit \\\hline
|
Port size & 32--bit \\\hline
|
Port granularity & 32--bit \\\hline
|
Port granularity & 32--bit \\\hline
|
Maximum Operand Size & 32--bit \\\hline
|
Maximum Operand Size & 32--bit \\\hline
|
Data transfer ordering & (Irrelevant) \\\hline
|
Data transfer ordering & (Irrelevant) \\\hline
|
Clock constraints & Works at 100~MHz on a Basys--3 board\\\hline
|
Clock constraints & Works at 100~MHz on a Basys--3 board, and 80~MHz on a
|
|
XuLA2--LX25\\\hline
|
Signal Names & \begin{tabular}{ll}
|
Signal Names & \begin{tabular}{ll}
|
Signal Name & Wishbone Equivalent \\\hline
|
Signal Name & Wishbone Equivalent \\\hline
|
{\tt i\_clk} & {\tt CLK\_O} \\
|
{\tt i\_clk} & {\tt CLK\_O} \\
|
{\tt o\_wb\_cyc} & {\tt CYC\_O} \\
|
{\tt o\_wb\_cyc} & {\tt CYC\_O} \\
|
{\tt o\_wb\_stb} & {\tt STB\_O} \\
|
{\tt o\_wb\_stb} & {\tt STB\_O} \\
|
{\tt o\_wb\_we} & {\tt WE\_O} \\
|
{\tt o\_wb\_we} & {\tt WE\_O} \\
|
{\tt o\_wb\_addr} & {\tt ADR\_O} \\
|
{\tt o\_wb\_addr} & {\tt ADR\_O} \\
|
{\tt o\_wb\_data} & {\tt DAT\_O} \\
|
{\tt o\_wb\_data} & {\tt DAT\_O} \\
|
{\tt i\_wb\_ack} & {\tt ACK\_I} \\
|
{\tt i\_wb\_ack} & {\tt ACK\_I} \\
|
{\tt i\_wb\_stall} & {\tt STALL\_I} \\
|
{\tt i\_wb\_stall} & {\tt STALL\_I} \\
|
{\tt i\_wb\_data} & {\tt DAT\_I}
|
{\tt i\_wb\_data} & {\tt DAT\_I} \\
|
|
{\tt i\_wb\_err} & {\tt ERR\_I}
|
\end{tabular}\\\hline
|
\end{tabular}\\\hline
|
\end{wishboneds}
|
\end{wishboneds}
|
\caption{Wishbone Datasheet for the CPU as Master}\label{tbl:wishbone-master}
|
\caption{Wishbone Datasheet for the CPU as Master}\label{tbl:wishbone-master}
|
\end{center}\end{table}
|
\end{center}\end{table}
|
I do not recommend that you connect these together through the interconnect.
|
I do not recommend that you connect these together through the interconnect.
|
Rather, the debug port of the CPU should be accessible regardless of the state
|
Rather, the debug port of the CPU should be accessible regardless of the state
|
of the master bus.
|
of the master bus.
|
|
|
You may wish to notice that neither the {\tt ERR} nor the {\tt RETRY} wires
|
You may wish to notice that neither the {\tt LOCK} nor the {\tt RTY} (retry)
|
have been implemented. What this means is that the CPU is currently unable
|
wires have been connected to the CPU's master interface. If necessary, a
|
to detect a bus error condition, and so may stall indefinitely (hang) should
|
rudimentary {\tt LOCK} may be created by tying the wire to the {\tt wb\_cyc}
|
it choose to access a value not on the bus, or a peripheral that is not
|
line. As for the {\tt RTY}, all the CPU recognizes at this point are bus
|
yet properly configured.
|
errors---it cannot tell the difference between a temporary and a permanent bus
|
|
error.
|
|
|
\chapter{Clocks}\label{chap:clocks}
|
\chapter{Clocks}\label{chap:clocks}
|
|
|
This core is based upon the Basys--3 development board sold by Digilent.
|
This core is based upon the Basys--3 development board sold by Digilent.
|
The Basys--3 development board contains one external 100~MHz clock, which is
|
The Basys--3 development board contains one external 100~MHz clock, which is
|
Line 2381... |
Line 2523... |
\end{center}\end{table}
|
\end{center}\end{table}
|
I hesitate to suggest that the core can run faster than 100~MHz, since I have
|
I hesitate to suggest that the core can run faster than 100~MHz, since I have
|
had struggled with various timing violations to keep it at 100~MHz. So, for
|
had struggled with various timing violations to keep it at 100~MHz. So, for
|
now, I will only state that it can run at 100~MHz.
|
now, I will only state that it can run at 100~MHz.
|
|
|
|
On a SPARTAN 6, the clock can run successfully at 80~MHz.
|
|
|
\chapter{I/O Ports}\label{chap:ioports}
|
\chapter{I/O Ports}\label{chap:ioports}
|
The I/O ports to the Zip CPU may be grouped into three categories. The first
|
The I/O ports to the Zip CPU may be grouped into three categories. The first
|
is that of the master wishbone used by the CPU, then the slave wishbone used
|
is that of the master wishbone used by the CPU, then the slave wishbone used
|
to command the CPU via a debugger, and then the rest. The first two of these
|
to command the CPU via a debugger, and then the rest. The first two of these
|
Line 2398... |
Line 2541... |
{\tt o\_wb\_addr} & 32 & Output & Bus address \\\hline
|
{\tt o\_wb\_addr} & 32 & Output & Bus address \\\hline
|
{\tt o\_wb\_data} & 32 & Output & Data on WB write\\\hline
|
{\tt o\_wb\_data} & 32 & Output & Data on WB write\\\hline
|
{\tt i\_wb\_ack} & 1 & Input & Slave has completed a R/W cycle\\\hline
|
{\tt i\_wb\_ack} & 1 & Input & Slave has completed a R/W cycle\\\hline
|
{\tt i\_wb\_stall} & 1 & Input & WB bus slave not ready\\\hline
|
{\tt i\_wb\_stall} & 1 & Input & WB bus slave not ready\\\hline
|
{\tt i\_wb\_data} & 32 & Input & Incoming bus data\\\hline
|
{\tt i\_wb\_data} & 32 & Input & Incoming bus data\\\hline
|
|
{\tt i\_wb\_err} & 1 & Input & Bus Error indication\\\hline
|
\end{portlist}\caption{CPU Master Wishbone I/O Ports}\label{tbl:iowb-master}\end{center}\end{table}
|
\end{portlist}\caption{CPU Master Wishbone I/O Ports}\label{tbl:iowb-master}\end{center}\end{table}
|
and~\ref{tbl:iowb-slave} respectively.
|
and~\ref{tbl:iowb-slave} respectively.
|
\begin{table}
|
\begin{table}
|
\begin{center}\begin{portlist}
|
\begin{center}\begin{portlist}
|
{\tt i\_wb\_cyc} & 1 & Input & Indicates an active Wishbone cycle\\\hline
|
{\tt i\_wb\_cyc} & 1 & Input & Indicates an active Wishbone cycle\\\hline
|
Line 2419... |
Line 2563... |
line. These are shown in Tbl.~\ref{tbl:ioports}.
|
line. These are shown in Tbl.~\ref{tbl:ioports}.
|
\begin{table}
|
\begin{table}
|
\begin{center}\begin{portlist}
|
\begin{center}\begin{portlist}
|
{\tt i\_clk} & 1 & Input & The master CPU clock \\\hline
|
{\tt i\_clk} & 1 & Input & The master CPU clock \\\hline
|
{\tt i\_rst} & 1 & Input & Active high reset line \\\hline
|
{\tt i\_rst} & 1 & Input & Active high reset line \\\hline
|
{\tt i\_ext\_int} & 1\ldots 6 & Input & Incoming external interrupts \\\hline
|
{\tt i\_ext\_int} & 1\ldots 16 & Input & Incoming external interrupts, actual
|
|
value set by implementation parameter \\\hline
|
{\tt o\_ext\_int} & 1 & Output & CPU Halted interrupt \\\hline
|
{\tt o\_ext\_int} & 1 & Output & CPU Halted interrupt \\\hline
|
\end{portlist}\caption{I/O Ports}\label{tbl:ioports}\end{center}\end{table}
|
\end{portlist}\caption{I/O Ports}\label{tbl:ioports}\end{center}\end{table}
|
The clock line was discussed briefly in Chapt.~\ref{chap:clocks}. We
|
The clock line was discussed briefly in Chapt.~\ref{chap:clocks}. We
|
typically run it at 100~MHz. The reset line is an active high reset. When
|
typically run it at 100~MHz, although we've needed to slow it down to 80~MHz
|
|
for some implementations. The reset line is an active high reset. When
|
asserted, the CPU will start running again from its reset address in
|
asserted, the CPU will start running again from its reset address in
|
memory. Further, depending upon how the CPU is configured and specifically on
|
memory. Further, depending upon how the CPU is configured and specifically
|
the {\tt START\_HALTED} parameter, it may or may not start running
|
based upon how the {\tt START\_HALTED} parameter is set, the CPU may or may
|
automatically. The {\tt i\_ext\_int} line is for an external interrupt. This
|
not start running automatically following a reset. The {\tt i\_ext\_int}
|
line may be as wide as 6~external interrupts, depending upon the setting of
|
line is for an external interrupt. This line may actually be as wide as
|
the {\tt EXTERNAL\_INTERRUPTS} line. As currently configured, the ZipSystem
|
16~external interrupts, depending upon the setting of
|
only supports one such interrupt line by default. For us, this line is the
|
the {\tt EXTERNAL\_INTERRUPTS} parameter. Finally, the Zip System produces one
|
output of another interrupt controller, but that's a board specific setup
|
external interrupt whenever the entire CPU halts to wait for the debugger.
|
detail. Finally, the Zip System produces one external interrupt whenever
|
|
the CPU halts to wait for the debugger.
|
|
|
|
\chapter{Initial Assessment}\label{chap:assessment}
|
\chapter{Initial Assessment}\label{chap:assessment}
|
|
|
Having now worked with the Zip CPU for a while, it is worth offering an
|
Having now worked with the Zip CPU for a while, it is worth offering an
|
honest assessment of how well it works and how well it was designed. At the
|
honest assessment of how well it works and how well it was designed. At the
|
end of this assessment, I will propose some changes that may take place in a
|
end of this assessment, I will propose some changes that may take place in a
|
later version of this Zip CPU to make it better.
|
later version of this Zip CPU to make it better.
|
|
|
\section{The Good}
|
\section{The Good}
|
\begin{itemize}
|
\begin{itemize}
|
\item The Zip CPU is light weight and fully featured as it exists today. For
|
\item The Zip CPU can be configured to be relatively light weight and fully
|
anyone who wishes to build a general purpose CPU and then to
|
featured as it exists today. For anyone who wishes to build a general
|
experiment with building and adding particular features, the Zip CPU
|
purpose CPU and then to experiment with building and adding particular
|
makes a good starting point--it is fairly simple. Modifications should
|
features, the Zip CPU makes a good starting point--it is fairly simple.
|
be simple enough.
|
Modifications should be simple enough. Indeed, a non--pipelined
|
|
version of the bare ZipBones (with no peripherals) has been built that
|
|
only uses 1.1k~LUTs. When using pipelining, the full cache, and all
|
|
of the peripherals, the ZipSystem can top 5~k LUTs. Where it fits
|
|
in between is a function of your needs.
|
\item The Zip CPU was designed to be an implementable soft core that could be
|
\item The Zip CPU was designed to be an implementable soft core that could be
|
placed within an FPGA, controlling actions internal to the FPGA. It
|
placed within an FPGA, controlling actions internal to the FPGA. It
|
fits this role rather nicely. It does not fit the role of a system on
|
fits this role rather nicely. It does not fit the role of a system on
|
a chip very well, but then it was never intended to be a system on a
|
a chip very well, but then it was never intended to be a system on a
|
chip but rather a system within a chip.
|
chip but rather a system within a chip.
|
Line 2486... |
Line 2634... |
cycle per access again.
|
cycle per access again.
|
\end{itemize}
|
\end{itemize}
|
|
|
\section{The Not so Good}
|
\section{The Not so Good}
|
\begin{itemize}
|
\begin{itemize}
|
\item While one of the stated goals was to use a small amount of logic,
|
|
3k~LUTs isn't that impressively small. Indeed, it's really much
|
|
too expensive when compared against other 8 and 16-bit CPUs that have
|
|
less than 1k LUTs.
|
|
|
|
Still, \ldots it's not bad, it's just not astonishingly good.
|
|
|
|
\item The fact that the instruction width equals the bus width means that the
|
|
instruction fetch cycle will always be interfering with any load or
|
|
store memory operation, with the only exception being if the
|
|
instruction is already in the cache.
|
|
|
|
This could be fixed in one of three ways: the instruction set
|
|
architecture could be modified to handle Very Long Instruction Words
|
|
(VLIW) so that each 32--bit word would encode two or more instructions,
|
|
the instruction fetch bus width could be increased from 32--bits to
|
|
64--bits or more, or the instruction bus could be separated from the
|
|
data bus. Any and all of these approaches would increase the overall
|
|
LUT count.
|
|
|
|
\item The (non-existant) floating point unit was an after-thought, isn't even
|
|
built as a potential option, and most likely won't support the full
|
|
IEEE standard set of FPU instructions--even for single point precision.
|
|
This (non-existant) capability would benefit the most from an
|
|
out-of-order execution capability, which the Zip CPU does not have.
|
|
|
|
Still, sharing FPU registers with the main register set was a good
|
|
idea and worth preserving, as it simplifies context swapping.
|
|
|
|
Perhaps this really isn't a problem, but rather a feature. By not
|
|
implementing FPU instructions, the Zip CPU maintains a lower LUT count
|
|
than it would have if it did implement these instructions.
|
|
|
|
\item The CPU has no character support. This is both good and bad.
|
\item The CPU has no character support. This is both good and bad.
|
Realistically, the CPU works just fine without it. Characters can be
|
Realistically, the CPU works just fine without it. Characters can be
|
supported as subsets of 32-bit words without any problem. Practically,
|
supported as subsets of 32-bit words without any problem. Practically,
|
though, it will make compiling non-Zip CPU code difficult--especially
|
though, it will make compiling non-Zip CPU code difficult--especially
|
anything that assumes sizeof(int)=4*sizeof(char), or that tries to
|
anything that assumes sizeof(int)=4*sizeof(char), or that tries to
|
Line 2551... |
Line 2666... |
takes the Zip CPU more instructions to do many of the same operations.
|
takes the Zip CPU more instructions to do many of the same operations.
|
The good part of this is that it gives the Zip CPU a greater amount of
|
The good part of this is that it gives the Zip CPU a greater amount of
|
flexibility in its immediate operand mode, although that increased
|
flexibility in its immediate operand mode, although that increased
|
flexibility isn't necessarily as valuable as one might like.
|
flexibility isn't necessarily as valuable as one might like.
|
|
|
\item The Zip CPU does not currently detect and trap on either illegal
|
|
instructions or bus errors. Attempts to access non--existent
|
|
memory quietly return erroneous results, rather than halting the
|
|
process (user mode) or halting or resetting the CPU (supervisor mode).
|
|
|
|
\item The Zip CPU doesn't support out of order execution. I suppose it could
|
\item The Zip CPU doesn't support out of order execution. I suppose it could
|
be modified to do so, but then it would no longer be the ``simple''
|
be modified to do so, but then it would no longer be the ``simple''
|
and low LUT count CPU it was designed to be. The two primary results
|
and low LUT count CPU it was designed to be. The two primary results
|
are that 1) loads may unnecessarily stall the CPU, even if other
|
are that 1) loads may unnecessarily stall the CPU, even if other
|
things could be done while waiting for the load to complete, 2)
|
things could be done while waiting for the load to complete, 2)
|
Line 2581... |
Line 2691... |
\item While the Zip CPU has its own assembler, it has no linker and does not
|
\item While the Zip CPU has its own assembler, it has no linker and does not
|
(yet) support a compiler. The standard C library is an even longer
|
(yet) support a compiler. The standard C library is an even longer
|
shot. My dream of having binutils and gcc support has not been
|
shot. My dream of having binutils and gcc support has not been
|
realized and at this rate may not be realized. (I've been intimidated
|
realized and at this rate may not be realized. (I've been intimidated
|
by the challenge everytime I've looked through those codes.)
|
by the challenge everytime I've looked through those codes.)
|
|
|
\iffalse
|
|
\item While the Wishbone Bus (B4) supports a pipelined mode with single cycle
|
|
execution, the Zip CPU is unable to exploit this parallelism. Instead,
|
|
apart from the DMA and the pipelined prefetch, all loads and stores
|
|
are single wishbone bus operations requiring a minimum of 3 clocks.
|
|
(In practice, this has turned into 7-clocks.)
|
|
% Addressed, 20150929
|
|
|
|
\item There is no control over whether or not an instruction sets the
|
|
condition codes--certain instructions always set the condition codes,
|
|
other instructions never set them. This effectively limits conditional
|
|
instructions to a single instruction only (with two or more
|
|
instructions as an exception), as the first instruction that sets
|
|
condition codes will break the condition code chain.
|
|
|
|
{\em (A proposed change below address this.)}
|
|
|
|
\item Using the CC register as a trap address was a bad idea--it limits the CC
|
|
registers ability to be used in future expansion, such as by adding
|
|
exception indication flags: bus error, floating point exception, etc.
|
|
|
|
{\em (This can be changed by a different O/S implementation of the trap
|
|
instruction.)}
|
|
\item The current implementation suffers from too many stalls on any
|
|
branch--even if the branch is known early on.
|
|
|
|
{\em (This is addressed in proposals below.)}
|
|
% Addressed, 20150918
|
|
|
|
\item In a similar fashion, a switch to interrupt context forces the pipeline
|
|
to be cleared, whereas it might make more sense to just continue
|
|
executing the instructions already in the pipeline while the prefetch
|
|
stage is working on switching to the interrupt context.
|
|
|
|
{\em (Also addressed in proposals below.)}
|
|
% This should happen so rarely that it is not really a problem
|
|
\fi
|
|
|
|
\end{itemize}
|
\end{itemize}
|
|
|
\section{The Next Generation}
|
\section{The Next Generation}
|
This section could also be labeled as my ``To do'' list.
|
This section could also be labeled as my ``To do'' list. Today's list is
|
|
much different than it was for the last version of this document, as much of
|
Given the feedback listed above, perhaps its time to consider what changes could be made to improve the Zip CPU in the future. I offer the following as proposals:
|
the prior to do list (such as VLIW instructions, and a more traditional
|
|
instruction cache) has now been implemented. The only things really and
|
\begin{itemize}
|
truly waiting on my list today are assembler support for the VLIW instruction
|
\item {\bf Remove the low LUT goal.} It wasn't really achieved, and the
|
set, linker and compiler support.
|
proposals below will only increase the amount of logic the Zip CPU
|
|
requires. While I expect that the Zip CPU will always be somewhat
|
|
of a light weight, it will never be the smallest kid on the block.
|
|
|
|
I'm actually struggling with this idea. The whole goal of the Zip
|
|
CPU was to be light weight. Wouldn't it make more sense to create and
|
|
maintain options whereby it would remain lightweight? For example, if
|
|
the process accounting registers are anything but light weight, why
|
|
keep them? Why not instead make some compile flags that just turn them
|
|
off, keeping the CPU lightweight? The same holds for the prefetch
|
|
cache.
|
|
|
|
\item The `{\tt .V}' condition was never used in any code other than my test
|
|
code. Suggest changing it to a `{\tt .LE}' condition, which seems
|
|
to be more useful.
|
|
|
|
\item {\bf Consider a more traditional Instruction Cache.} The current
|
|
pipelined instruction cache just reads a window of memory into
|
|
its cache. If the CPU leaves that window, the entire cache is
|
|
invalidated. A more traditional cache, however, might allow
|
|
common subroutines to stay within the cache without invalidating the
|
|
entire cache structure.
|
|
|
|
\iffalse
|
|
\item {\bf Adjust the Zip CPU so that conditional instructions do not set
|
|
flags}, although they may explicitly set condition codes if writing
|
|
to the CC register.
|
|
|
|
This is a simple change to the core, and may show up in new releases.
|
|
% Fixed, 20150918
|
|
|
|
\item Add in an {\bf unpredictable branch delay slot}, so that on any branch
|
|
the delay slot may or may not be executed before the branch.
|
|
Instructions that do not depend upon the branch, and that should be
|
|
executed were the branch not taken, could be placed into the delay
|
|
slot. Thus, if the branch isn't taken, we wouldn't suffer the stall,
|
|
whereas it wouldn't affect the timing of the branch if taken. It would
|
|
just do something irrelevant.
|
|
|
|
% Changes made, 20150918, make this option no longer relevant
|
|
|
|
\item {\bf Re-engineer Branch Processing.} There's no reason why a {\tt BRA}
|
|
instruction should create five stall cycles. The decode stage, plus
|
|
the prefetch engine, should be able to drop this number of stalls via
|
|
better branch handling.
|
|
|
|
Indeed, this could turn into a simple means of branch prediction:
|
|
if {\tt BRA} suffered a single stall only, whereas {\tt BRA.C}
|
|
suffered five stalls, then {\tt BRA.!C} followed by {\tt BRA} would
|
|
be faster than a {\tt BRA.C} instruction. This would then allow a
|
|
compiler to do explicit branch optimizations.
|
|
|
|
Of course, longer branches using {\tt ADD X,PC} would still not be
|
|
optimized.
|
|
|
|
% DONE: 20150918 -- to include the ADD X,PC instructions
|
|
|
|
\item {\bf Request bus access for Load/Store two cycles earlier.} The problem
|
|
here is the contention for the bus between the memory unit and the
|
|
prefetch unit. Currently, the memory unit must ask the prefetch
|
|
unit to release the bus if it is in the middle of a bus cycle. At this
|
|
point, the prefetch drops the {\tt STB} line on the next clock and must
|
|
then wait for the last {\tt ACK} before releasing the bus. If the
|
|
request takes one clock, dropping the strobe line the next, waiting
|
|
for an acknowledgement takes another, and then the bus must be idle
|
|
for one cycle before starting again, these extra cycles add up.
|
|
It should be possible to tell the prefetch stage to give up the bus
|
|
as soon as the decoder knows the instruction will need the bus.
|
|
Indeed, if done in the decode stage, this might drop the seven cycle
|
|
access down by two cycles.
|
|
% FIXED: 20150918
|
|
|
|
\item {\bf Very Long Instruction Word (VLIW).} Now, to speed up operation, I
|
|
propose that the Zip CPU instruction set be modified towards a Very
|
|
Long Instruction Word (VLIW) implementation. In this implementation,
|
|
an instruction word may contain either one or two separate
|
|
instructions. The first instruction would take up the high order bits,
|
|
the second optional instruction the lower 16-bits. Further, I propose
|
|
that any of the ALU instructions (SUB through LSR) automatically have
|
|
a second instruction whenever their operand `B' is a register, and use
|
|
the full 20-bit immediate if not. This will effectively eliminate the
|
|
register plus immediate mode for all of these instructions.
|
|
|
|
This is the minimal required change to increase the number of
|
|
instructions per clock cycle. Other changes would need to take place
|
|
as well to support this. These include:
|
|
\begin{itemize}
|
|
\item Instruction words containing two instructions would take two
|
|
clocks to complete, while requiring only a single cycle
|
|
instruction fetch.
|
|
\item Instructions preceded by a label in the asseembler must always
|
|
start in the high order word.
|
|
\item VLIW's, once started, must always execute to completion. The
|
|
upper word may set the PC, the lower word may not. Regardless
|
|
of whether the upper word sets the PC, the lower word must
|
|
still be guaranteed to complete before the PC changes. On any
|
|
switch to (or from) interrupt context, both instructions must
|
|
complete or none of the instructions in the word shall
|
|
complete prior to the switch.
|
|
\item STEP commands and BREAK instructions will only take place after
|
|
the entire word is executed.
|
|
\end{itemize}
|
|
|
|
If done well, the assembler should be able to handle these changes
|
|
with the biggest impacts to the user being increased performance and
|
|
a loss of the register plus immediate ALU modes. (These weren't really
|
|
relevant for the XOR, OR, AND, etc. operations anyway.) Machine code
|
|
compatibility will not be maintained.
|
|
|
|
A proposed secondary instruction set might consist of: a four bit
|
|
operand (any of the prior instructions would be supported, with some
|
|
exceptions such as moves to and from user registers while in
|
|
supervisor mode not being supported), a 4-bit register result (PC not
|
|
allowed), a 3-bit conditional (identical to the conditional for the
|
|
upper word), a single bit for whether or not an immediate is present
|
|
or not, followed by either a 4-bit register or a 4-bit signed
|
|
immediate. The multiply instruction would steal the immediate flag to
|
|
be used as a sign indication, forcing both operands to be registers
|
|
without any immediate offsets.
|
|
|
|
{\em Initial conversion of several library functions to this secondary
|
|
instruction set has demonstrated little to no gain. The problem was
|
|
that the new instruction set was made by joining a rarely used
|
|
instruction (ALU with register and not immediate) with a more common
|
|
instruction. The utility was then limited by the utility of the rare
|
|
instrction, which limited the impact of the entire approach. }
|
|
\else
|
|
\item {\bf Very Long Instruction Word (VLIW).} The goal here would be to
|
|
create a new instruction set whereby two instructions would be encoded
|
|
in each 32--bit word. While this may speed up
|
|
CPU operation, it would necessitate an instruction redesign.
|
|
\fi
|
|
|
|
\end{itemize}
|
|
|
|
|
Stay tuned, these are likely to be coming next.
|
|
|
% Appendices
|
% Appendices
|
% Index
|
% Index
|
\end{document}
|
\end{document}
|
|
|