| 1 |
21 |
dgisselq |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
| 2 |
|
|
%%
|
| 3 |
|
|
%% Filename: spec.tex
|
| 4 |
|
|
%%
|
| 5 |
|
|
%% Project: Zip CPU -- a small, lightweight, RISC CPU soft core
|
| 6 |
|
|
%%
|
| 7 |
|
|
%% Purpose: This LaTeX file contains all of the documentation/description
|
| 8 |
33 |
dgisselq |
%% currently provided with this Zip CPU soft core. It supersedes
|
| 9 |
21 |
dgisselq |
%% any information about the instruction set or CPUs found
|
| 10 |
|
|
%% elsewhere. It's not nearly as interesting, though, as the PDF
|
| 11 |
|
|
%% file it creates, so I'd recommend reading that before diving
|
| 12 |
|
|
%% into this file. You should be able to find the PDF file in
|
| 13 |
|
|
%% the SVN distribution together with this PDF file and a copy of
|
| 14 |
|
|
%% the GPL-3.0 license this file is distributed under. If not,
|
| 15 |
|
|
%% just type 'make' in the doc directory and it (should) build
|
| 16 |
|
|
%% without a problem.
|
| 17 |
|
|
%%
|
| 18 |
|
|
%%
|
| 19 |
|
|
%% Creator: Dan Gisselquist
|
| 20 |
|
|
%% Gisselquist Technology, LLC
|
| 21 |
|
|
%%
|
| 22 |
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
| 23 |
|
|
%%
|
| 24 |
|
|
%% Copyright (C) 2015, Gisselquist Technology, LLC
|
| 25 |
|
|
%%
|
| 26 |
|
|
%% This program is free software (firmware): you can redistribute it and/or
|
| 27 |
|
|
%% modify it under the terms of the GNU General Public License as published
|
| 28 |
|
|
%% by the Free Software Foundation, either version 3 of the License, or (at
|
| 29 |
|
|
%% your option) any later version.
|
| 30 |
|
|
%%
|
| 31 |
|
|
%% This program is distributed in the hope that it will be useful, but WITHOUT
|
| 32 |
|
|
%% ANY WARRANTY; without even the implied warranty of MERCHANTIBILITY or
|
| 33 |
|
|
%% FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
|
| 34 |
|
|
%% for more details.
|
| 35 |
|
|
%%
|
| 36 |
|
|
%% You should have received a copy of the GNU General Public License along
|
| 37 |
|
|
%% with this program. (It's in the $(ROOT)/doc directory, run make with no
|
| 38 |
|
|
%% target there if the PDF file isn't present.) If not, see
|
| 39 |
|
|
%% <http://www.gnu.org/licenses/> for a copy.
|
| 40 |
|
|
%%
|
| 41 |
|
|
%% License: GPL, v3, as defined and found on www.gnu.org,
|
| 42 |
|
|
%% http://www.gnu.org/licenses/gpl.html
|
| 43 |
|
|
%%
|
| 44 |
|
|
%%
|
| 45 |
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
| 46 |
|
|
\documentclass{gqtekspec}
|
| 47 |
|
|
\project{Zip CPU}
|
| 48 |
|
|
\title{Specification}
|
| 49 |
|
|
\author{Dan Gisselquist, Ph.D.}
|
| 50 |
|
|
\email{dgisselq (at) opencores.org}
|
| 51 |
36 |
dgisselq |
\revision{Rev.~0.4}
|
| 52 |
|
|
\definecolor{webred}{rgb}{0.2,0,0}
|
| 53 |
|
|
\definecolor{webgreen}{rgb}{0,0.2,0}
|
| 54 |
|
|
\usepackage[dvips,ps2pdf,colorlinks=true,
|
| 55 |
|
|
anchorcolor=black,pagecolor=webgreen,pdfpagelabels,hypertexnames,
|
| 56 |
|
|
pdfauthor={Dan Gisselquist},
|
| 57 |
|
|
pdfsubject={Zip CPU}]{hyperref}
|
| 58 |
21 |
dgisselq |
\begin{document}
|
| 59 |
|
|
\pagestyle{gqtekspecplain}
|
| 60 |
|
|
\titlepage
|
| 61 |
|
|
\begin{license}
|
| 62 |
|
|
Copyright (C) \theyear\today, Gisselquist Technology, LLC
|
| 63 |
|
|
|
| 64 |
|
|
This project is free software (firmware): you can redistribute it and/or
|
| 65 |
|
|
modify it under the terms of the GNU General Public License as published
|
| 66 |
|
|
by the Free Software Foundation, either version 3 of the License, or (at
|
| 67 |
|
|
your option) any later version.
|
| 68 |
|
|
|
| 69 |
|
|
This program is distributed in the hope that it will be useful, but WITHOUT
|
| 70 |
|
|
ANY WARRANTY; without even the implied warranty of MERCHANTIBILITY or
|
| 71 |
|
|
FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
|
| 72 |
|
|
for more details.
|
| 73 |
|
|
|
| 74 |
|
|
You should have received a copy of the GNU General Public License along
|
| 75 |
|
|
with this program. If not, see \hbox{<http://www.gnu.org/licenses/>} for a
|
| 76 |
|
|
copy.
|
| 77 |
|
|
\end{license}
|
| 78 |
|
|
\begin{revisionhistory}
|
| 79 |
36 |
dgisselq |
0.4 & 9/19/2015 & Gisselquist & Added DMA controller, improved stall information, and self--assessment info.\\\hline
|
| 80 |
33 |
dgisselq |
0.3 & 8/22/2015 & Gisselquist & First completed draft\\\hline
|
| 81 |
24 |
dgisselq |
0.2 & 8/19/2015 & Gisselquist & Still Draft, more complete \\\hline
|
| 82 |
21 |
dgisselq |
0.1 & 8/17/2015 & Gisselquist & Incomplete First Draft \\\hline
|
| 83 |
|
|
\end{revisionhistory}
|
| 84 |
|
|
% Revision History
|
| 85 |
|
|
% Table of Contents, named Contents
|
| 86 |
|
|
\tableofcontents
|
| 87 |
24 |
dgisselq |
\listoffigures
|
| 88 |
21 |
dgisselq |
\listoftables
|
| 89 |
|
|
\begin{preface}
|
| 90 |
|
|
Many people have asked me why I am building the Zip CPU. ARM processors are
|
| 91 |
|
|
good and effective. Xilinx makes and markets Microblaze, Altera Nios, and both
|
| 92 |
|
|
have better toolsets than the Zip CPU will ever have. OpenRISC is also
|
| 93 |
24 |
dgisselq |
available, RISC--V may be replacing it. Why build a new processor?
|
| 94 |
21 |
dgisselq |
|
| 95 |
|
|
The easiest, most obvious answer is the simple one: Because I can.
|
| 96 |
|
|
|
| 97 |
|
|
There's more to it, though. There's a lot that I would like to do with a
|
| 98 |
|
|
processor, and I want to be able to do it in a vendor independent fashion.
|
| 99 |
36 |
dgisselq |
First, I would like to be able to place this processor inside an FPGA. Without
|
| 100 |
|
|
paying royalties, ARM is out of the question. I would then like to be able to
|
| 101 |
|
|
generate Verilog code, both for the processor and the system it sits within,
|
| 102 |
|
|
that can run equivalently on both Xilinx and Altera chips, and that can be
|
| 103 |
|
|
easily ported from one manufacturer's chipsets to another. Even more, before
|
| 104 |
|
|
purchasing a chip or a board, I would like to know that my soft core works. I
|
| 105 |
|
|
would like to build a test bench to test components with, and Verilator is my
|
| 106 |
|
|
chosen test bench. This forces me to use all Verilog, and it prevents me from
|
| 107 |
|
|
using any proprietary cores. For this reason, Microblaze and Nios are out of
|
| 108 |
|
|
the question.
|
| 109 |
21 |
dgisselq |
|
| 110 |
|
|
Why not OpenRISC? That's a hard question. The OpenRISC team has done some
|
| 111 |
|
|
wonderful work on an amazing processor, and I'll have to admit that I am
|
| 112 |
|
|
envious of what they've accomplished. I would like to port binutils to the
|
| 113 |
|
|
Zip CPU, as I would like to port GCC and GDB. They are way ahead of me. The
|
| 114 |
|
|
OpenRISC processor, however, is complex and hefty at about 4,500 LUTs. It has
|
| 115 |
|
|
a lot of features of modern CPUs within it that ... well, let's just say it's
|
| 116 |
|
|
not the little guy on the block. The Zip CPU is lighter weight, costing only
|
| 117 |
32 |
dgisselq |
about 2,300 LUTs with no peripherals, and 3,200 LUTs with some very basic
|
| 118 |
21 |
dgisselq |
peripherals.
|
| 119 |
|
|
|
| 120 |
|
|
My final reason is that I'm building the Zip CPU as a learning experience. The
|
| 121 |
|
|
Zip CPU has allowed me to learn a lot about how CPUs work on a very micro
|
| 122 |
|
|
level. For the first time, I am beginning to understand many of the Computer
|
| 123 |
|
|
Architecture lessons from years ago.
|
| 124 |
|
|
|
| 125 |
|
|
To summarize: Because I can, because it is open source, because it is light
|
| 126 |
|
|
weight, and as an exercise in learning.
|
| 127 |
|
|
|
| 128 |
|
|
\end{preface}
|
| 129 |
|
|
|
| 130 |
|
|
\chapter{Introduction}
|
| 131 |
|
|
\pagenumbering{arabic}
|
| 132 |
|
|
\setcounter{page}{1}
|
| 133 |
|
|
|
| 134 |
|
|
|
| 135 |
36 |
dgisselq |
The original goal of the Zip CPU was to be a very simple CPU. You might
|
| 136 |
21 |
dgisselq |
think of it as a poor man's alternative to the OpenRISC architecture.
|
| 137 |
|
|
For this reason, all instructions have been designed to be as simple as
|
| 138 |
|
|
possible, and are all designed to be executed in one instruction cycle per
|
| 139 |
|
|
instruction, barring pipeline stalls. Indeed, even the bus has been simplified
|
| 140 |
|
|
to a constant 32-bit width, with no option for more or less. This has
|
| 141 |
|
|
resulted in the choice to drop push and pop instructions, pre-increment and
|
| 142 |
|
|
post-decrement addressing modes, and more.
|
| 143 |
|
|
|
| 144 |
|
|
For those who like buzz words, the Zip CPU is:
|
| 145 |
|
|
\begin{itemize}
|
| 146 |
|
|
\item A 32-bit CPU: All registers are 32-bits, addresses are 32-bits,
|
| 147 |
|
|
instructions are 32-bits wide, etc.
|
| 148 |
24 |
dgisselq |
\item A RISC CPU. There is no microcode for executing instructions. All
|
| 149 |
|
|
instructions are designed to be completed in one clock cycle.
|
| 150 |
21 |
dgisselq |
\item A Load/Store architecture. (Only load and store instructions
|
| 151 |
|
|
can access memory.)
|
| 152 |
|
|
\item Wishbone compliant. All peripherals are accessed just like
|
| 153 |
|
|
memory across this bus.
|
| 154 |
|
|
\item A Von-Neumann architecture. (The instructions and data share a
|
| 155 |
|
|
common bus.)
|
| 156 |
|
|
\item A pipelined architecture, having stages for {\bf Prefetch},
|
| 157 |
|
|
{\bf Decode}, {\bf Read-Operand}, the {\bf ALU/Memory}
|
| 158 |
24 |
dgisselq |
unit, and {\bf Write-back}. See Fig.~\ref{fig:cpu}
|
| 159 |
|
|
\begin{figure}\begin{center}
|
| 160 |
|
|
\includegraphics[width=3.5in]{../gfx/cpu.eps}
|
| 161 |
|
|
\caption{Zip CPU internal pipeline architecture}\label{fig:cpu}
|
| 162 |
|
|
\end{center}\end{figure}
|
| 163 |
|
|
for a diagram of this structure.
|
| 164 |
21 |
dgisselq |
\item Completely open source, licensed under the GPL.\footnote{Should you
|
| 165 |
|
|
need a copy of the Zip CPU licensed under other terms, please
|
| 166 |
|
|
contact me.}
|
| 167 |
|
|
\end{itemize}
|
| 168 |
|
|
|
| 169 |
|
|
Now, however, that I've worked on the Zip CPU for a while, it is not nearly
|
| 170 |
|
|
as simple as I originally hoped. Worse, I've had to adjust to create
|
| 171 |
|
|
capabilities that I was never expecting to need. These include:
|
| 172 |
|
|
\begin{itemize}
|
| 173 |
33 |
dgisselq |
\item {\bf External Debug:} Once placed upon an FPGA, some external means is
|
| 174 |
21 |
dgisselq |
still necessary to debug this CPU. That means that there needs to be
|
| 175 |
|
|
an external register that can control the CPU: reset it, halt it, step
|
| 176 |
24 |
dgisselq |
it, and tell whether it is running or not. My chosen interface
|
| 177 |
|
|
includes a second register similar to this control register. This
|
| 178 |
|
|
second register allows the external controller or debugger to examine
|
| 179 |
21 |
dgisselq |
registers internal to the CPU.
|
| 180 |
|
|
|
| 181 |
|
|
\item {\bf Internal Debug:} Being able to run a debugger from within
|
| 182 |
|
|
a user process requires an ability to step a user process from
|
| 183 |
|
|
within a debugger. It also requires a break instruction that can
|
| 184 |
|
|
be substituted for any other instruction, and substituted back.
|
| 185 |
|
|
The break is actually difficult: the break instruction cannot be
|
| 186 |
|
|
allowed to execute. That way, upon a break, the debugger should
|
| 187 |
|
|
be able to jump back into the user process to step the instruction
|
| 188 |
|
|
that would've been at the break point initially, and then to
|
| 189 |
|
|
replace the break after passing it.
|
| 190 |
|
|
|
| 191 |
24 |
dgisselq |
Incidentally, this break messes with the prefetch cache and the
|
| 192 |
|
|
pipeline: if you change an instruction partially through the pipeline,
|
| 193 |
|
|
the whole pipeline needs to be cleansed. Likewise if you change
|
| 194 |
|
|
an instruction in memory, you need to make sure the cache is reloaded
|
| 195 |
|
|
with the new instruction.
|
| 196 |
|
|
|
| 197 |
21 |
dgisselq |
\item {\bf Prefetch Cache:} My original implementation had a very
|
| 198 |
|
|
simple prefetch stage. Any time the PC changed the prefetch would go
|
| 199 |
|
|
and fetch the new instruction. While this was perhaps this simplest
|
| 200 |
|
|
approach, it cost roughly five clocks for every instruction. This
|
| 201 |
|
|
was deemed unacceptable, as I wanted a CPU that could execute
|
| 202 |
|
|
instructions in one cycle. I therefore have a prefetch cache that
|
| 203 |
|
|
issues pipelined wishbone accesses to memory and then pushes
|
| 204 |
|
|
instructions at the CPU. Sadly, this accounts for about 20\% of the
|
| 205 |
|
|
logic in the entire CPU, or 15\% of the logic in the entire system.
|
| 206 |
|
|
|
| 207 |
|
|
|
| 208 |
|
|
\item {\bf Operating System:} In order to support an operating system,
|
| 209 |
|
|
interrupts and so forth, the CPU needs to support supervisor and
|
| 210 |
|
|
user modes, as well as a means of switching between them. For example,
|
| 211 |
|
|
the user needs a means of executing a system call. This is the
|
| 212 |
|
|
purpose of the {\bf `trap'} instruction. This instruction needs to
|
| 213 |
|
|
place the CPU into supervisor mode (here equivalent to disabling
|
| 214 |
|
|
interrupts), as well as handing it a parameter such as identifying
|
| 215 |
|
|
which O/S function was called.
|
| 216 |
|
|
|
| 217 |
24 |
dgisselq |
My initial approach to building a trap instruction was to create an external
|
| 218 |
|
|
peripheral which, when written to, would generate an interrupt and could
|
| 219 |
|
|
return the last value written to it. In practice, this approach didn't work
|
| 220 |
|
|
at all: the CPU executed two instructions while waiting for the
|
| 221 |
|
|
trap interrupt to take place. Since then, I've decided to keep the rest of
|
| 222 |
|
|
the CC register for that purpose so that a write to the CC register, with the
|
| 223 |
|
|
GIE bit cleared, could be used to execute a trap. This has other problems,
|
| 224 |
|
|
though, primarily in the limitation of the uses of the CC register. In
|
| 225 |
|
|
particular, the CC register is the best place to put CPU state information and
|
| 226 |
|
|
to ``announce'' special CPU features (floating point, etc). So the trap
|
| 227 |
|
|
instruction still switches to interrupt mode, but the CC register is not
|
| 228 |
|
|
nearly as useful for telling the supervisor mode processor what trap is being
|
| 229 |
|
|
executed.
|
| 230 |
21 |
dgisselq |
|
| 231 |
|
|
Modern timesharing systems also depend upon a {\bf Timer} interrupt
|
| 232 |
24 |
dgisselq |
to handle task swapping. For the Zip CPU, this interrupt is handled
|
| 233 |
|
|
external to the CPU as part of the CPU System, found in {\tt zipsystem.v}.
|
| 234 |
|
|
The timer module itself is found in {\tt ziptimer.v}.
|
| 235 |
21 |
dgisselq |
|
| 236 |
|
|
\item {\bf Pipeline Stalls:} My original plan was to not support pipeline
|
| 237 |
|
|
stalls at all, but rather to require the compiler to properly schedule
|
| 238 |
24 |
dgisselq |
all instructions so that stalls would never be necessary. After trying
|
| 239 |
21 |
dgisselq |
to build such an architecture, I gave up, having learned some things:
|
| 240 |
|
|
|
| 241 |
|
|
For example, in order to facilitate interrupt handling and debug
|
| 242 |
|
|
stepping, the CPU needs to know what instructions have finished, and
|
| 243 |
|
|
which have not. In other words, it needs to know where it can restart
|
| 244 |
|
|
the pipeline from. Once restarted, it must act as though it had
|
| 245 |
24 |
dgisselq |
never stopped. This killed my idea of delayed branching, since what
|
| 246 |
|
|
would be the appropriate program counter to restart at? The one the
|
| 247 |
|
|
CPU was going to branch to, or the ones in the delay slots? This
|
| 248 |
|
|
also makes the idea of compressed instruction codes difficult, since,
|
| 249 |
|
|
again, where do you restart on interrupt?
|
| 250 |
21 |
dgisselq |
|
| 251 |
|
|
So I switched to a model of discrete execution: Once an instruction
|
| 252 |
|
|
enters into either the ALU or memory unit, the instruction is
|
| 253 |
|
|
guaranteed to complete. If the logic recognizes a branch or a
|
| 254 |
|
|
condition that would render the instruction entering into this stage
|
| 255 |
33 |
dgisselq |
possibly inappropriate (i.e. a conditional branch preceding a store
|
| 256 |
21 |
dgisselq |
instruction for example), then the pipeline stalls for one cycle
|
| 257 |
|
|
until the conditional branch completes. Then, if it generates a new
|
| 258 |
33 |
dgisselq |
PC address, the stages preceding are all wiped clean.
|
| 259 |
21 |
dgisselq |
|
| 260 |
|
|
The discrete execution model allows such things as sleeping: if the
|
| 261 |
24 |
dgisselq |
CPU is put to ``sleep,'' the ALU and memory stages stall and back up
|
| 262 |
21 |
dgisselq |
everything before them. Likewise, anything that has entered the ALU
|
| 263 |
|
|
or memory stage when the CPU is placed to sleep continues to completion.
|
| 264 |
|
|
To handle this logic, each pipeline stage has three control signals:
|
| 265 |
|
|
a valid signal, a stall signal, and a clock enable signal. In
|
| 266 |
|
|
general, a stage stalls if it's contents are valid and the next step
|
| 267 |
|
|
is stalled. This allows the pipeline to fill any time a later stage
|
| 268 |
|
|
stalls.
|
| 269 |
|
|
|
| 270 |
24 |
dgisselq |
This approach is also different from other pipeline approaches. Instead
|
| 271 |
|
|
of keeping the entire pipeline filled, each stage is treated
|
| 272 |
|
|
independently. Therefore, individual stages may move forward as long
|
| 273 |
|
|
as the subsequent stage is available, regardless of whether the stage
|
| 274 |
|
|
behind it is filled.
|
| 275 |
|
|
|
| 276 |
21 |
dgisselq |
\item {\bf Verilog Modules:} When examining how other processors worked
|
| 277 |
|
|
here on open cores, many of them had one separate module per pipeline
|
| 278 |
|
|
stage. While this appeared to me to be a fascinating and commendable
|
| 279 |
|
|
idea, my own implementation didn't work out quite so nicely.
|
| 280 |
|
|
|
| 281 |
|
|
As an example, the decode module produces a {\em lot} of
|
| 282 |
|
|
control wires and registers. Creating a module out of this, with
|
| 283 |
|
|
only the simplest of logic within it, seemed to be more a lesson
|
| 284 |
|
|
in passing wires around, rather than encapsulating logic.
|
| 285 |
|
|
|
| 286 |
|
|
Another example was the register writeback section. I would love
|
| 287 |
|
|
this section to be a module in its own right, and many have made them
|
| 288 |
|
|
such. However, other modules depend upon writeback results other
|
| 289 |
|
|
than just what's placed in the register (i.e., the control wires).
|
| 290 |
|
|
For these reasons, I didn't manage to fit this section into it's
|
| 291 |
|
|
own module.
|
| 292 |
|
|
|
| 293 |
|
|
The result is that the majority of the CPU code can be found in
|
| 294 |
|
|
the {\tt zipcpu.v} file.
|
| 295 |
|
|
\end{itemize}
|
| 296 |
|
|
|
| 297 |
|
|
With that introduction out of the way, let's move on to the instruction
|
| 298 |
|
|
set.
|
| 299 |
|
|
|
| 300 |
|
|
\chapter{CPU Architecture}\label{chap:arch}
|
| 301 |
|
|
|
| 302 |
24 |
dgisselq |
The Zip CPU supports a set of two operand instructions, where the second operand
|
| 303 |
21 |
dgisselq |
(always a register) is the result. The only exception is the store instruction,
|
| 304 |
|
|
where the first operand (always a register) is the source of the data to be
|
| 305 |
|
|
stored.
|
| 306 |
|
|
|
| 307 |
24 |
dgisselq |
\section{Simplified Bus}
|
| 308 |
|
|
The bus architecture of the Zip CPU is that of a simplified WISHBONE bus.
|
| 309 |
|
|
It has been simplified in this fashion: all operations are 32--bit operations.
|
| 310 |
36 |
dgisselq |
The bus is neither little endian nor big endian. For this reason, all words
|
| 311 |
24 |
dgisselq |
are 32--bits. All instructions are also 32--bits wide. Everything has been
|
| 312 |
|
|
built around the 32--bit word.
|
| 313 |
|
|
|
| 314 |
21 |
dgisselq |
\section{Register Set}
|
| 315 |
|
|
The Zip CPU supports two sets of sixteen 32-bit registers, a supervisor
|
| 316 |
24 |
dgisselq |
and a user set as shown in Fig.~\ref{fig:regset}.
|
| 317 |
|
|
\begin{figure}\begin{center}
|
| 318 |
|
|
\includegraphics[width=3.5in]{../gfx/regset.eps}
|
| 319 |
|
|
\caption{Zip CPU Register File}\label{fig:regset}
|
| 320 |
|
|
\end{center}\end{figure}
|
| 321 |
|
|
The supervisor set is used in interrupt mode when interrupts are disabled,
|
| 322 |
|
|
whereas the user set is used otherwise. Of this register set, the Program
|
| 323 |
|
|
Counter (PC) is register 15, whereas the status register (SR) or condition
|
| 324 |
|
|
code register
|
| 325 |
21 |
dgisselq |
(CC) is register 14. By convention, the stack pointer will be register 13 and
|
| 326 |
24 |
dgisselq |
noted as (SP)--although there is nothing special about this register other
|
| 327 |
|
|
than this convention.
|
| 328 |
21 |
dgisselq |
The CPU can access both register sets via move instructions from the
|
| 329 |
|
|
supervisor state, whereas the user state can only access the user registers.
|
| 330 |
|
|
|
| 331 |
36 |
dgisselq |
The status register is special, and bears further mention. As shown in
|
| 332 |
|
|
Fig.~\ref{tbl:cc-register},
|
| 333 |
|
|
\begin{table}\begin{center}
|
| 334 |
|
|
\begin{bitlist}
|
| 335 |
|
|
31\ldots 11 & R/W & Reserved for future uses\\\hline
|
| 336 |
|
|
10 & R & (Reserved for) Bus-Error Flag\\\hline
|
| 337 |
|
|
9 & R & Trap, or user interrupt, Flag. Cleared on return to userspace.\\\hline
|
| 338 |
|
|
8 & R & (Reserved for) Illegal Instruction Flag\\\hline
|
| 339 |
|
|
7 & R/W & Break--Enable\\\hline
|
| 340 |
|
|
6 & R/W & Step\\\hline
|
| 341 |
|
|
5 & R/W & Global Interrupt Enable (GIE)\\\hline
|
| 342 |
|
|
4 & R/W & Sleep. When GIE is also set, the CPU waits for an interrupt.\\\hline
|
| 343 |
|
|
3 & R/W & Overflow\\\hline
|
| 344 |
|
|
2 & R/W & Negative. The sign bit was set as a result of the last ALU instruction.\\\hline
|
| 345 |
|
|
1 & R/W & Carry\\\hline
|
| 346 |
|
|
|
| 347 |
|
|
\end{bitlist}
|
| 348 |
|
|
\caption{Condition Code Register Bit Assignment}\label{tbl:cc-register}
|
| 349 |
|
|
\end{center}\end{table}
|
| 350 |
|
|
the lower 11~bits of the status register form
|
| 351 |
|
|
a set of CPU state and condition codes. Writes to other bits of this register
|
| 352 |
|
|
are preserved.
|
| 353 |
21 |
dgisselq |
|
| 354 |
33 |
dgisselq |
Of the condition codes, the bottom four bits are the current flags:
|
| 355 |
21 |
dgisselq |
Zero (Z),
|
| 356 |
|
|
Carry (C),
|
| 357 |
|
|
Negative (N),
|
| 358 |
|
|
and Overflow (V).
|
| 359 |
|
|
|
| 360 |
|
|
The next bit is a clock enable (0 to enable) or sleep bit (1 to put
|
| 361 |
|
|
the CPU to sleep). Setting this bit will cause the CPU to
|
| 362 |
|
|
wait for an interrupt (if interrupts are enabled), or to
|
| 363 |
|
|
completely halt (if interrupts are disabled).
|
| 364 |
33 |
dgisselq |
|
| 365 |
21 |
dgisselq |
The sixth bit is a global interrupt enable bit (GIE). When this
|
| 366 |
32 |
dgisselq |
sixth bit is a `1' interrupts will be enabled, else disabled. When
|
| 367 |
21 |
dgisselq |
interrupts are disabled, the CPU will be in supervisor mode, otherwise
|
| 368 |
|
|
it is in user mode. Thus, to execute a context switch, one only
|
| 369 |
|
|
need enable or disable interrupts. (When an interrupt line goes
|
| 370 |
|
|
high, interrupts will automatically be disabled, as the CPU goes
|
| 371 |
32 |
dgisselq |
and deals with its context switch.) Special logic has been added to
|
| 372 |
|
|
keep the user mode from setting the sleep register and clearing the
|
| 373 |
|
|
GIE register at the same time, with clearing the GIE register taking
|
| 374 |
|
|
precedence.
|
| 375 |
21 |
dgisselq |
|
| 376 |
|
|
The seventh bit is a step bit. This bit can be
|
| 377 |
|
|
set from supervisor mode only. After setting this bit, should
|
| 378 |
|
|
the supervisor mode process switch to user mode, it would then
|
| 379 |
|
|
accomplish one instruction in user mode before returning to supervisor
|
| 380 |
|
|
mode. Then, upon return to supervisor mode, this bit will
|
| 381 |
|
|
be automatically cleared. This bit has no effect on the CPU while in
|
| 382 |
|
|
supervisor mode.
|
| 383 |
|
|
|
| 384 |
|
|
This functionality was added to enable a userspace debugger
|
| 385 |
|
|
functionality on a user process, working through supervisor mode
|
| 386 |
|
|
of course.
|
| 387 |
|
|
|
| 388 |
|
|
|
| 389 |
24 |
dgisselq |
The eighth bit is a break enable bit. This controls whether a break
|
| 390 |
|
|
instruction in user mode will halt the processor for an external debugger
|
| 391 |
|
|
(break enabled), or whether the break instruction will simply send send the
|
| 392 |
|
|
CPU into interrupt mode. Encountering a break in supervisor mode will
|
| 393 |
|
|
halt the CPU independent of the break enable bit. This bit can only be set
|
| 394 |
|
|
within supervisor mode.
|
| 395 |
21 |
dgisselq |
|
| 396 |
32 |
dgisselq |
% Should break enable be a supervisor mode bit, while the break enable bit
|
| 397 |
|
|
% in user mode is a break has taken place bit?
|
| 398 |
|
|
%
|
| 399 |
|
|
|
| 400 |
21 |
dgisselq |
This functionality was added to enable an external debugger to
|
| 401 |
|
|
set and manage breakpoints.
|
| 402 |
|
|
|
| 403 |
36 |
dgisselq |
The ninth bit is reserved for an illegal instruction bit. When the CPU
|
| 404 |
|
|
tries to execute either a non-existant instruction, or an instruction from
|
| 405 |
|
|
an address that produces a bus error, the CPU will (once implemented) switch
|
| 406 |
|
|
to supervisor mode while setting this bit. The bit will automatically be
|
| 407 |
|
|
cleared upon any return to user mode.
|
| 408 |
21 |
dgisselq |
|
| 409 |
|
|
The tenth bit is a trap bit. It is set whenever the user requests a soft
|
| 410 |
|
|
interrupt, and cleared on any return to userspace command. This allows the
|
| 411 |
|
|
supervisor, in supervisor mode, to determine whether it got to supervisor
|
| 412 |
|
|
mode from a trap or from an external interrupt or both.
|
| 413 |
|
|
|
| 414 |
24 |
dgisselq |
These status register bits are summarized in Tbl.~\ref{tbl:ccbits}.
|
| 415 |
21 |
dgisselq |
\begin{table}
|
| 416 |
|
|
\begin{center}
|
| 417 |
|
|
\begin{tabular}{l|l}
|
| 418 |
|
|
Bit & Meaning \\\hline
|
| 419 |
33 |
dgisselq |
9 & Soft trap, set on a trap from user mode, cleared when returning to user mode\\\hline
|
| 420 |
21 |
dgisselq |
8 & (Reserved for) Floating point enable \\\hline
|
| 421 |
|
|
7 & Halt on break, to support an external debugger \\\hline
|
| 422 |
|
|
6 & Step, single step the CPU in user mode\\\hline
|
| 423 |
|
|
5 & GIE, or Global Interrupt Enable \\\hline
|
| 424 |
|
|
4 & Sleep \\\hline
|
| 425 |
|
|
3 & V, or overflow bit.\\\hline
|
| 426 |
|
|
2 & N, or negative bit.\\\hline
|
| 427 |
|
|
1 & C, or carry bit.\\\hline
|
| 428 |
|
|
|
| 429 |
|
|
\end{tabular}
|
| 430 |
24 |
dgisselq |
\caption{Condition Code / Status Register Bits}\label{tbl:ccbits}
|
| 431 |
|
|
\end{center}\end{table}
|
| 432 |
|
|
|
| 433 |
21 |
dgisselq |
\section{Conditional Instructions}
|
| 434 |
36 |
dgisselq |
Most, although not quite all, instructions may be conditionally executed. From
|
| 435 |
21 |
dgisselq |
the four condition code flags, eight conditions are defined. These are shown
|
| 436 |
|
|
in Tbl.~\ref{tbl:conditions}.
|
| 437 |
|
|
\begin{table}
|
| 438 |
|
|
\begin{center}
|
| 439 |
|
|
\begin{tabular}{l|l|l}
|
| 440 |
|
|
Code & Mneumonic & Condition \\\hline
|
| 441 |
|
|
3'h0 & None & Always execute the instruction \\
|
| 442 |
|
|
3'h1 & {\tt .Z} & Only execute when 'Z' is set \\
|
| 443 |
|
|
3'h2 & {\tt .NE} & Only execute when 'Z' is not set \\
|
| 444 |
|
|
3'h3 & {\tt .GE} & Greater than or equal ('N' not set, 'Z' irrelevant) \\
|
| 445 |
|
|
3'h4 & {\tt .GT} & Greater than ('N' not set, 'Z' not set) \\
|
| 446 |
24 |
dgisselq |
3'h5 & {\tt .LT} & Less than ('N' set) \\
|
| 447 |
21 |
dgisselq |
3'h6 & {\tt .C} & Carry set\\
|
| 448 |
|
|
3'h7 & {\tt .V} & Overflow set\\
|
| 449 |
|
|
\end{tabular}
|
| 450 |
|
|
\caption{Conditions for conditional operand execution}\label{tbl:conditions}
|
| 451 |
|
|
\end{center}
|
| 452 |
|
|
\end{table}
|
| 453 |
24 |
dgisselq |
There is no condition code for less than or equal, not C or not V. Sorry,
|
| 454 |
36 |
dgisselq |
I ran out of space in 3--bits. Conditioning on a non--supported condition
|
| 455 |
|
|
is still possible, but it will take an extra instruction and a pipeline stall. (Ex: \hbox{\em (Stall)}; \hbox{\tt TST \$4,CC;} \hbox{\tt STO.NZ R0,(R1)})
|
| 456 |
21 |
dgisselq |
|
| 457 |
36 |
dgisselq |
Conditionally executed ALU instructions will not further adjust the
|
| 458 |
|
|
condition codes.
|
| 459 |
|
|
|
| 460 |
21 |
dgisselq |
\section{Operand B}
|
| 461 |
24 |
dgisselq |
Many instruction forms have a 21-bit source ``Operand B'' associated with them.
|
| 462 |
21 |
dgisselq |
This Operand B is either equal to a register plus a signed immediate offset,
|
| 463 |
|
|
or an immediate offset by itself. This value is encoded as shown in
|
| 464 |
|
|
Tbl.~\ref{tbl:opb}.
|
| 465 |
|
|
\begin{table}\begin{center}
|
| 466 |
|
|
\begin{tabular}{|l|l|l|}\hline
|
| 467 |
|
|
Bit 20 & 19 \ldots 16 & 15 \ldots 0 \\\hline
|
| 468 |
24 |
dgisselq |
1'b0 & \multicolumn{2}{l|}{20--bit Signed Immediate value} \\\hline
|
| 469 |
|
|
1'b1 & 4-bit Register & 16--bit Signed immediate offset \\\hline
|
| 470 |
21 |
dgisselq |
\end{tabular}
|
| 471 |
|
|
\caption{Bit allocation for Operand B}\label{tbl:opb}
|
| 472 |
|
|
\end{center}\end{table}
|
| 473 |
24 |
dgisselq |
|
| 474 |
33 |
dgisselq |
Sixteen and twenty bit immediate values don't make sense for all instructions.
|
| 475 |
|
|
For example, what is the point of a 20--bit immediate when executing a 16--bit
|
| 476 |
24 |
dgisselq |
multiply? Likewise, why have a 16--bit immediate when adding to a logical
|
| 477 |
|
|
or arithmetic shift? In these cases, the extra bits are reserved for future
|
| 478 |
|
|
instruction possibilities.
|
| 479 |
|
|
|
| 480 |
21 |
dgisselq |
\section{Address Modes}
|
| 481 |
36 |
dgisselq |
The Zip CPU supports two addressing modes: register plus immediate, and
|
| 482 |
21 |
dgisselq |
immediate address. Addresses are therefore encoded in the same fashion as
|
| 483 |
|
|
Operand B's, shown above.
|
| 484 |
|
|
|
| 485 |
|
|
A lot of long hard thought was put into whether to allow pre/post increment
|
| 486 |
|
|
and decrement addressing modes. Finding no way to use these operators without
|
| 487 |
32 |
dgisselq |
taking two or more clocks per instruction,\footnote{The two clocks figure
|
| 488 |
|
|
comes from the design of the register set, allowing only one write per clock.
|
| 489 |
|
|
That write is either from the memory unit or the ALU, but never both.} these
|
| 490 |
|
|
addressing modes have been
|
| 491 |
21 |
dgisselq |
removed from the realm of possibilities. This means that the Zip CPU has no
|
| 492 |
|
|
native way of executing push, pop, return, or jump to subroutine operations.
|
| 493 |
24 |
dgisselq |
Each of these instructions can be emulated with a set of instructions from the
|
| 494 |
|
|
existing set.
|
| 495 |
21 |
dgisselq |
|
| 496 |
|
|
\section{Move Operands}
|
| 497 |
|
|
The previous set of operands would be perfect and complete, save only that
|
| 498 |
24 |
dgisselq |
the CPU needs access to non--supervisory registers while in supervisory mode.
|
| 499 |
|
|
Therefore, the MOV instruction is special and offers access to these registers
|
| 500 |
|
|
\ldots when in supervisory mode. To keep the compiler simple, the extra bits
|
| 501 |
|
|
are ignored in non-supervisory mode (as though they didn't exist), rather than
|
| 502 |
|
|
being mapped to new instructions or additional capabilities. The bits
|
| 503 |
|
|
indicating which register set each register lies within are the A-Usr and
|
| 504 |
|
|
B-Usr bits. When set to a one, these refer to a user mode register. When set
|
| 505 |
|
|
to a zero, these refer to a register in the current mode, whether user or
|
| 506 |
|
|
supervisor. Further, because a load immediate instruction exists, there is no
|
| 507 |
|
|
move capability between an immediate and a register: all moves come from either
|
| 508 |
|
|
a register or a register plus an offset.
|
| 509 |
21 |
dgisselq |
|
| 510 |
24 |
dgisselq |
This actually leads to a bit of a problem: since the MOV instruction encodes
|
| 511 |
|
|
which register set each register is coming from or moving to, how shall a
|
| 512 |
|
|
compiler or assembler know how to compile a MOV instruction without knowing
|
| 513 |
|
|
the mode of the CPU at the time? For this reason, the compiler will assume
|
| 514 |
|
|
all MOV registers are supervisor registers, and display them as normal.
|
| 515 |
|
|
Anything with the user bit set will be treated as a user register. The CPU
|
| 516 |
|
|
will quietly ignore the supervisor bits while in user mode, and anything
|
| 517 |
36 |
dgisselq |
marked as a user register will always be valid.
|
| 518 |
21 |
dgisselq |
|
| 519 |
|
|
\section{Multiply Operations}
|
| 520 |
36 |
dgisselq |
The Zip CPU supports two Multiply operations, a 16x16 bit signed multiply
|
| 521 |
|
|
({\tt MPYS}) and a 16x16 bit unsigned multiply ({\tt MPYU}). In both
|
| 522 |
21 |
dgisselq |
cases, the operand is a register plus a 16-bit immediate, subject to the
|
| 523 |
|
|
rule that the register cannot be the PC or CC registers. The PC register
|
| 524 |
|
|
field has been stolen to create a multiply by immediate instruction. The
|
| 525 |
|
|
CC register field is reserved.
|
| 526 |
|
|
|
| 527 |
|
|
\section{Floating Point}
|
| 528 |
36 |
dgisselq |
The Zip CPU does not (yet) support floating point operations. However, the
|
| 529 |
32 |
dgisselq |
instruction set reserves two possibilities for future floating point
|
| 530 |
|
|
operations.
|
| 531 |
21 |
dgisselq |
|
| 532 |
32 |
dgisselq |
The first floating point operation hole in the instruction set involves
|
| 533 |
36 |
dgisselq |
setting a proposed (but non-existent) floating point bit in the CC register.
|
| 534 |
|
|
The next instruction
|
| 535 |
|
|
would then simply interpret its operands as floating point instructions.
|
| 536 |
32 |
dgisselq |
Not all instructions, however, have floating point equivalents. Further, the
|
| 537 |
|
|
immediate fields do not apply in floating point mode, and must be set to
|
| 538 |
|
|
zero. Not all instructions make sense as floating point operations.
|
| 539 |
|
|
Therefore, only the CMP, SUB, ADD, and MPY instructions may be issued as
|
| 540 |
|
|
floating point instructions. Other instructions allow the examining of the
|
| 541 |
|
|
floating point bit in the CC register. In all cases, the floating point bit
|
| 542 |
|
|
is cleared one instruction after it is set.
|
| 543 |
21 |
dgisselq |
|
| 544 |
32 |
dgisselq |
The other possibility for floating point operations involves exploiting the
|
| 545 |
|
|
hole in the instruction set that the NOOP and BREAK instructions reside within.
|
| 546 |
36 |
dgisselq |
These two instructions use 24--bits of address space, when only a single bit
|
| 547 |
|
|
is necessary. A simple adjustment to this space could create instructions
|
| 548 |
|
|
with 4--bit register addresses for each register, a 3--bit field for
|
| 549 |
|
|
conditional execution, and a 2--bit field for which operation.
|
| 550 |
|
|
In this fashion, such a floating point capability would only fill 13--bits of
|
| 551 |
|
|
the 24--bit field, still leaving lots of room for expansion.
|
| 552 |
32 |
dgisselq |
|
| 553 |
|
|
In both cases, the Zip CPU would support 32--bit single precision floats
|
| 554 |
36 |
dgisselq |
only, since other choices would complicate the pipeline.
|
| 555 |
32 |
dgisselq |
|
| 556 |
|
|
The current architecture does not support a floating point not-implemented
|
| 557 |
|
|
interrupt. Any soft floating point emulation must be done deliberately.
|
| 558 |
|
|
|
| 559 |
21 |
dgisselq |
\section{Native Instructions}
|
| 560 |
|
|
The instruction set for the Zip CPU is summarized in
|
| 561 |
|
|
Tbl.~\ref{tbl:zip-instructions}.
|
| 562 |
|
|
\begin{table}\begin{center}
|
| 563 |
|
|
\begin{tabular}{|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|c|}\hline
|
| 564 |
36 |
dgisselq |
\rowcolor[gray]{0.85}
|
| 565 |
21 |
dgisselq |
Op Code & \multicolumn{8}{c|}{31\ldots24} & \multicolumn{8}{c|}{23\ldots 16}
|
| 566 |
|
|
& \multicolumn{8}{c|}{15\ldots 8} & \multicolumn{8}{c|}{7\ldots 0}
|
| 567 |
36 |
dgisselq |
& Sets CC? \\\hline\hline
|
| 568 |
21 |
dgisselq |
CMP(Sub) & \multicolumn{4}{l|}{4'h0}
|
| 569 |
|
|
& \multicolumn{4}{l|}{D. Reg}
|
| 570 |
|
|
& \multicolumn{3}{l|}{Cond.}
|
| 571 |
|
|
& \multicolumn{21}{l|}{Operand B}
|
| 572 |
|
|
& Yes \\\hline
|
| 573 |
24 |
dgisselq |
TST(And) & \multicolumn{4}{l|}{4'h1}
|
| 574 |
21 |
dgisselq |
& \multicolumn{4}{l|}{D. Reg}
|
| 575 |
|
|
& \multicolumn{3}{l|}{Cond.}
|
| 576 |
|
|
& \multicolumn{21}{l|}{Operand B}
|
| 577 |
|
|
& Yes \\\hline
|
| 578 |
|
|
MOV & \multicolumn{4}{l|}{4'h2}
|
| 579 |
|
|
& \multicolumn{4}{l|}{D. Reg}
|
| 580 |
|
|
& \multicolumn{3}{l|}{Cond.}
|
| 581 |
|
|
& A-Usr
|
| 582 |
|
|
& \multicolumn{4}{l|}{B-Reg}
|
| 583 |
|
|
& B-Usr
|
| 584 |
|
|
& \multicolumn{15}{l|}{15'bit signed offset}
|
| 585 |
|
|
& \\\hline
|
| 586 |
|
|
LODI & \multicolumn{4}{l|}{4'h3}
|
| 587 |
|
|
& \multicolumn{4}{l|}{R. Reg}
|
| 588 |
|
|
& \multicolumn{24}{l|}{24'bit Signed Immediate}
|
| 589 |
|
|
& \\\hline
|
| 590 |
|
|
NOOP & \multicolumn{4}{l|}{4'h4}
|
| 591 |
|
|
& \multicolumn{4}{l|}{4'he}
|
| 592 |
|
|
& \multicolumn{24}{l|}{24'h00}
|
| 593 |
|
|
& \\\hline
|
| 594 |
|
|
BREAK & \multicolumn{4}{l|}{4'h4}
|
| 595 |
|
|
& \multicolumn{4}{l|}{4'he}
|
| 596 |
|
|
& \multicolumn{24}{l|}{24'h01}
|
| 597 |
|
|
& \\\hline
|
| 598 |
36 |
dgisselq |
{\em Reserved} & \multicolumn{4}{l|}{4'h4}
|
| 599 |
21 |
dgisselq |
& \multicolumn{4}{l|}{4'he}
|
| 600 |
|
|
& \multicolumn{24}{l|}{24'bits, but not 0 or 1.}
|
| 601 |
|
|
& \\\hline
|
| 602 |
|
|
LODIHI & \multicolumn{4}{l|}{4'h4}
|
| 603 |
|
|
& \multicolumn{4}{l|}{4'hf}
|
| 604 |
|
|
& \multicolumn{3}{l|}{Cond.}
|
| 605 |
|
|
& 1'b1
|
| 606 |
|
|
& \multicolumn{4}{l|}{R. Reg}
|
| 607 |
|
|
& \multicolumn{16}{l|}{16-bit Immediate}
|
| 608 |
|
|
& \\\hline
|
| 609 |
|
|
LODILO & \multicolumn{4}{l|}{4'h4}
|
| 610 |
|
|
& \multicolumn{4}{l|}{4'hf}
|
| 611 |
|
|
& \multicolumn{3}{l|}{Cond.}
|
| 612 |
|
|
& 1'b0
|
| 613 |
|
|
& \multicolumn{4}{l|}{R. Reg}
|
| 614 |
|
|
& \multicolumn{16}{l|}{16-bit Immediate}
|
| 615 |
|
|
& \\\hline
|
| 616 |
|
|
16-b MPYU & \multicolumn{4}{l|}{4'h4}
|
| 617 |
|
|
& \multicolumn{4}{l|}{R. Reg}
|
| 618 |
|
|
& \multicolumn{3}{l|}{Cond.}
|
| 619 |
|
|
& 1'b0 & \multicolumn{4}{l|}{Reg}
|
| 620 |
|
|
& \multicolumn{16}{l|}{16-bit Offset}
|
| 621 |
|
|
& Yes \\\hline
|
| 622 |
|
|
16-b MPYU(I) & \multicolumn{4}{l|}{4'h4}
|
| 623 |
|
|
& \multicolumn{4}{l|}{R. Reg}
|
| 624 |
|
|
& \multicolumn{3}{l|}{Cond.}
|
| 625 |
|
|
& 1'b0 & \multicolumn{4}{l|}{4'hf}
|
| 626 |
|
|
& \multicolumn{16}{l|}{16-bit Offset}
|
| 627 |
|
|
& Yes \\\hline
|
| 628 |
|
|
16-b MPYS & \multicolumn{4}{l|}{4'h4}
|
| 629 |
|
|
& \multicolumn{4}{l|}{R. Reg}
|
| 630 |
|
|
& \multicolumn{3}{l|}{Cond.}
|
| 631 |
|
|
& 1'b1 & \multicolumn{4}{l|}{Reg}
|
| 632 |
|
|
& \multicolumn{16}{l|}{16-bit Offset}
|
| 633 |
|
|
& Yes \\\hline
|
| 634 |
|
|
16-b MPYS(I) & \multicolumn{4}{l|}{4'h4}
|
| 635 |
|
|
& \multicolumn{4}{l|}{R. Reg}
|
| 636 |
|
|
& \multicolumn{3}{l|}{Cond.}
|
| 637 |
|
|
& 1'b1 & \multicolumn{4}{l|}{4'hf}
|
| 638 |
|
|
& \multicolumn{16}{l|}{16-bit Offset}
|
| 639 |
|
|
& Yes \\\hline
|
| 640 |
|
|
ROL & \multicolumn{4}{l|}{4'h5}
|
| 641 |
|
|
& \multicolumn{4}{l|}{R. Reg}
|
| 642 |
|
|
& \multicolumn{3}{l|}{Cond.}
|
| 643 |
|
|
& \multicolumn{21}{l|}{Operand B, truncated to low order 5 bits}
|
| 644 |
|
|
& \\\hline
|
| 645 |
|
|
LOD & \multicolumn{4}{l|}{4'h6}
|
| 646 |
|
|
& \multicolumn{4}{l|}{R. Reg}
|
| 647 |
|
|
& \multicolumn{3}{l|}{Cond.}
|
| 648 |
|
|
& \multicolumn{21}{l|}{Operand B address}
|
| 649 |
|
|
& \\\hline
|
| 650 |
|
|
STO & \multicolumn{4}{l|}{4'h7}
|
| 651 |
|
|
& \multicolumn{4}{l|}{D. Reg}
|
| 652 |
|
|
& \multicolumn{3}{l|}{Cond.}
|
| 653 |
|
|
& \multicolumn{21}{l|}{Operand B address}
|
| 654 |
|
|
& \\\hline
|
| 655 |
|
|
SUB & \multicolumn{4}{l|}{4'h8}
|
| 656 |
|
|
& \multicolumn{4}{l|}{R. Reg}
|
| 657 |
|
|
& \multicolumn{3}{l|}{Cond.}
|
| 658 |
32 |
dgisselq |
& \multicolumn{21}{l|}{Operand B}
|
| 659 |
21 |
dgisselq |
& Yes \\\hline
|
| 660 |
|
|
AND & \multicolumn{4}{l|}{4'h9}
|
| 661 |
|
|
& \multicolumn{4}{l|}{R. Reg}
|
| 662 |
|
|
& \multicolumn{3}{l|}{Cond.}
|
| 663 |
|
|
& \multicolumn{21}{l|}{Operand B}
|
| 664 |
|
|
& Yes \\\hline
|
| 665 |
|
|
ADD & \multicolumn{4}{l|}{4'ha}
|
| 666 |
|
|
& \multicolumn{4}{l|}{R. Reg}
|
| 667 |
|
|
& \multicolumn{3}{l|}{Cond.}
|
| 668 |
|
|
& \multicolumn{21}{l|}{Operand B}
|
| 669 |
|
|
& Yes \\\hline
|
| 670 |
|
|
OR & \multicolumn{4}{l|}{4'hb}
|
| 671 |
|
|
& \multicolumn{4}{l|}{R. Reg}
|
| 672 |
|
|
& \multicolumn{3}{l|}{Cond.}
|
| 673 |
|
|
& \multicolumn{21}{l|}{Operand B}
|
| 674 |
|
|
& Yes \\\hline
|
| 675 |
|
|
XOR & \multicolumn{4}{l|}{4'hc}
|
| 676 |
|
|
& \multicolumn{4}{l|}{R. Reg}
|
| 677 |
|
|
& \multicolumn{3}{l|}{Cond.}
|
| 678 |
|
|
& \multicolumn{21}{l|}{Operand B}
|
| 679 |
|
|
& Yes \\\hline
|
| 680 |
|
|
LSL/ASL & \multicolumn{4}{l|}{4'hd}
|
| 681 |
|
|
& \multicolumn{4}{l|}{R. Reg}
|
| 682 |
|
|
& \multicolumn{3}{l|}{Cond.}
|
| 683 |
33 |
dgisselq |
& \multicolumn{21}{l|}{Operand B, imm. truncated to 6 bits}
|
| 684 |
21 |
dgisselq |
& Yes \\\hline
|
| 685 |
|
|
ASR & \multicolumn{4}{l|}{4'he}
|
| 686 |
|
|
& \multicolumn{4}{l|}{R. Reg}
|
| 687 |
|
|
& \multicolumn{3}{l|}{Cond.}
|
| 688 |
33 |
dgisselq |
& \multicolumn{21}{l|}{Operand B, imm. truncated to 6 bits}
|
| 689 |
21 |
dgisselq |
& Yes \\\hline
|
| 690 |
|
|
LSR & \multicolumn{4}{l|}{4'hf}
|
| 691 |
|
|
& \multicolumn{4}{l|}{R. Reg}
|
| 692 |
|
|
& \multicolumn{3}{l|}{Cond.}
|
| 693 |
33 |
dgisselq |
& \multicolumn{21}{l|}{Operand B, imm. truncated to 6 bits}
|
| 694 |
21 |
dgisselq |
& Yes \\\hline
|
| 695 |
|
|
\end{tabular}
|
| 696 |
|
|
\caption{Zip CPU Instruction Set}\label{tbl:zip-instructions}
|
| 697 |
|
|
\end{center}\end{table}
|
| 698 |
|
|
|
| 699 |
|
|
As you can see, there's lots of room for instruction set expansion. The
|
| 700 |
24 |
dgisselq |
NOOP and BREAK instructions are the only instructions within one particular
|
| 701 |
36 |
dgisselq |
24--bit hole. The rest of this space is reserved for future enhancements.
|
| 702 |
21 |
dgisselq |
|
| 703 |
|
|
\section{Derived Instructions}
|
| 704 |
36 |
dgisselq |
The Zip CPU supports many other common instructions, but not all of them
|
| 705 |
24 |
dgisselq |
are single cycle instructions. The derived instruction tables,
|
| 706 |
36 |
dgisselq |
Tbls.~\ref{tbl:derived-1}, \ref{tbl:derived-2}, \ref{tbl:derived-3}
|
| 707 |
|
|
and~\ref{tbl:derived-4},
|
| 708 |
21 |
dgisselq |
help to capture some of how these other instructions may be implemented on
|
| 709 |
36 |
dgisselq |
the Zip CPU. Many of these instructions will have assembly equivalents,
|
| 710 |
21 |
dgisselq |
such as the branch instructions, to facilitate working with the CPU.
|
| 711 |
|
|
\begin{table}\begin{center}
|
| 712 |
|
|
\begin{tabular}{p{1.4in}p{1.5in}p{3in}}\\\hline
|
| 713 |
|
|
Mapped & Actual & Notes \\\hline
|
| 714 |
36 |
dgisselq |
ABS Rx
|
| 715 |
|
|
& \parbox[t]{1.5in}{TST -1,Rx\\NEG.LT Rx}
|
| 716 |
|
|
& Absolute value, depends upon derived NEG.\\\hline
|
| 717 |
21 |
dgisselq |
\parbox[t]{1.4in}{ADD Ra,Rx\\ADDC Rb,Ry}
|
| 718 |
|
|
& \parbox[t]{1.5in}{Add Ra,Rx\\ADD.C \$1,Ry\\Add Rb,Ry}
|
| 719 |
|
|
& Add with carry \\\hline
|
| 720 |
|
|
BRA.Cond +/-\$Addr
|
| 721 |
33 |
dgisselq |
& \hbox{MOV.cond \$Addr+PC,PC}
|
| 722 |
24 |
dgisselq |
& Branch or jump on condition. Works for 15--bit
|
| 723 |
|
|
signed address offsets.\\\hline
|
| 724 |
21 |
dgisselq |
BRA.Cond +/-\$Addr
|
| 725 |
|
|
& \parbox[t]{1.5in}{LDI \$Addr,Rx \\ ADD.cond Rx,PC}
|
| 726 |
|
|
& Branch/jump on condition. Works for
|
| 727 |
|
|
23 bit address offsets, but costs a register, an extra instruction,
|
| 728 |
33 |
dgisselq |
and sets the flags. \\\hline
|
| 729 |
21 |
dgisselq |
BNC PC+\$Addr
|
| 730 |
|
|
& \parbox[t]{1.5in}{Test \$Carry,CC \\ MOV.Z PC+\$Addr,PC}
|
| 731 |
|
|
& Example of a branch on an unsupported
|
| 732 |
|
|
condition, in this case a branch on not carry \\\hline
|
| 733 |
|
|
BUSY & MOV \$-1(PC),PC & Execute an infinite loop \\\hline
|
| 734 |
|
|
CLRF.NZ Rx
|
| 735 |
|
|
& XOR.NZ Rx,Rx
|
| 736 |
|
|
& Clear Rx, and flags, if the Z-bit is not set \\\hline
|
| 737 |
|
|
CLR Rx
|
| 738 |
|
|
& LDI \$0,Rx
|
| 739 |
|
|
& Clears Rx, leaves flags untouched. This instruction cannot be
|
| 740 |
|
|
conditional. \\\hline
|
| 741 |
|
|
EXCH.W Rx
|
| 742 |
|
|
& ROL \$16,Rx
|
| 743 |
|
|
& Exchanges the top and bottom 16'bit words of Rx \\\hline
|
| 744 |
|
|
HALT
|
| 745 |
|
|
& Or \$SLEEP,CC
|
| 746 |
|
|
& Executed while in interrupt mode. In user mode this is simply a
|
| 747 |
33 |
dgisselq |
wait until interrupt instruction. \\\hline
|
| 748 |
21 |
dgisselq |
INT & LDI \$0,CC
|
| 749 |
|
|
& Since we're using the CC register as a trap vector as well, this
|
| 750 |
|
|
executes TRAP \#0. \\\hline
|
| 751 |
|
|
IRET
|
| 752 |
|
|
& OR \$GIE,CC
|
| 753 |
|
|
& Also an RTU instruction (Return to Userspace) \\\hline
|
| 754 |
|
|
JMP R6+\$Addr
|
| 755 |
|
|
& MOV \$Addr(R6),PC
|
| 756 |
|
|
& \\\hline
|
| 757 |
|
|
JSR PC+\$Addr
|
| 758 |
|
|
& \parbox[t]{1.5in}{SUB \$1,SP \\\
|
| 759 |
|
|
MOV \$3+PC,R0 \\
|
| 760 |
|
|
STO R0,1(SP) \\
|
| 761 |
|
|
MOV \$Addr+PC,PC \\
|
| 762 |
|
|
ADD \$1,SP}
|
| 763 |
24 |
dgisselq |
& Jump to Subroutine. Note the required cleanup instruction after
|
| 764 |
36 |
dgisselq |
returning. This could easily be turned into a three instruction
|
| 765 |
|
|
operand, removing the preliminary stack instruction before and
|
| 766 |
|
|
the cleanup after, by adjusting how any stack frame was built for
|
| 767 |
|
|
this routine to include space at the top of the stack for the PC.
|
| 768 |
|
|
\\\hline
|
| 769 |
21 |
dgisselq |
JSR PC+\$Addr
|
| 770 |
|
|
& \parbox[t]{1.5in}{MOV \$3+PC,R12 \\ MOV \$addr+PC,PC}
|
| 771 |
|
|
&This is the high speed
|
| 772 |
|
|
version of a subroutine call, necessitating a register to hold the
|
| 773 |
|
|
last PC address. In its favor, this method doesn't suffer the
|
| 774 |
|
|
mandatory memory access of the other approach. \\\hline
|
| 775 |
|
|
LDI.l \$val,Rx
|
| 776 |
|
|
& \parbox[t]{1.5in}{LDIHI (\$val$>>$16)\&0x0ffff, Rx \\
|
| 777 |
|
|
LDILO (\$val \& 0x0ffff)}
|
| 778 |
|
|
& Sadly, there's not enough instruction
|
| 779 |
|
|
space to load a complete immediate value into any register.
|
| 780 |
|
|
Therefore, fully loading any register takes two cycles.
|
| 781 |
|
|
The LDIHI (load immediate high) and LDILO (load immediate low)
|
| 782 |
|
|
instructions have been created to facilitate this. \\\hline
|
| 783 |
|
|
\end{tabular}
|
| 784 |
|
|
\caption{Derived Instructions}\label{tbl:derived-1}
|
| 785 |
|
|
\end{center}\end{table}
|
| 786 |
|
|
\begin{table}\begin{center}
|
| 787 |
|
|
\begin{tabular}{p{1.4in}p{1.5in}p{3in}}\\\hline
|
| 788 |
|
|
Mapped & Actual & Notes \\\hline
|
| 789 |
|
|
LOD.b \$addr,Rx
|
| 790 |
|
|
& \parbox[t]{1.5in}{%
|
| 791 |
|
|
LDI \$addr,Ra \\
|
| 792 |
|
|
LDI \$addr,Rb \\
|
| 793 |
|
|
LSR \$2,Ra \\
|
| 794 |
|
|
AND \$3,Rb \\
|
| 795 |
|
|
LOD (Ra),Rx \\
|
| 796 |
|
|
LSL \$3,Rb \\
|
| 797 |
|
|
SUB \$32,Rb \\
|
| 798 |
|
|
ROL Rb,Rx \\
|
| 799 |
|
|
AND \$0ffh,Rx}
|
| 800 |
|
|
& \parbox[t]{3in}{This CPU is designed for 32'bit word
|
| 801 |
|
|
length instructions. Byte addressing is not supported by the CPU or
|
| 802 |
|
|
the bus, so it therefore takes more work to do.
|
| 803 |
|
|
|
| 804 |
|
|
Note also that in this example, \$Addr is a byte-wise address, where
|
| 805 |
24 |
dgisselq |
all other addresses in this document are 32-bit wordlength addresses.
|
| 806 |
|
|
For this reason,
|
| 807 |
21 |
dgisselq |
we needed to drop the bottom two bits. This also limits the address
|
| 808 |
|
|
space of character accesses using this method from 16 MB down to 4MB.}
|
| 809 |
|
|
\\\hline
|
| 810 |
|
|
\parbox[t]{1.5in}{LSL \$1,Rx\\ LSLC \$1,Ry}
|
| 811 |
|
|
& \parbox[t]{1.5in}{LSL \$1,Ry \\
|
| 812 |
|
|
LSL \$1,Rx \\
|
| 813 |
|
|
OR.C \$1,Ry}
|
| 814 |
|
|
& Logical shift left with carry. Note that the
|
| 815 |
|
|
instruction order is now backwards, to keep the conditions valid.
|
| 816 |
33 |
dgisselq |
That is, LSL sets the carry flag, so if we did this the other way
|
| 817 |
21 |
dgisselq |
with Rx before Ry, then the condition flag wouldn't have been right
|
| 818 |
|
|
for an OR correction at the end. \\\hline
|
| 819 |
|
|
\parbox[t]{1.5in}{LSR \$1,Rx \\ LSRC \$1,Ry}
|
| 820 |
|
|
& \parbox[t]{1.5in}{CLR Rz \\
|
| 821 |
|
|
LSR \$1,Ry \\
|
| 822 |
|
|
LDIHI.C \$8000h,Rz \\
|
| 823 |
|
|
LSR \$1,Rx \\
|
| 824 |
|
|
OR Rz,Rx}
|
| 825 |
|
|
& Logical shift right with carry \\\hline
|
| 826 |
|
|
NEG Rx & \parbox[t]{1.5in}{XOR \$-1,Rx \\ ADD \$1,Rx} & \\\hline
|
| 827 |
36 |
dgisselq |
NEG.C Rx & \parbox[t]{1.5in}{MOV.C \$-1+Rx,Rx\\XOR.C \$-1,Rx} & \\\hline
|
| 828 |
21 |
dgisselq |
NOOP & NOOP & While there are many
|
| 829 |
|
|
operations that do nothing, such as MOV Rx,Rx, or OR \$0,Rx, these
|
| 830 |
|
|
operations have consequences in that they might stall the bus if
|
| 831 |
|
|
Rx isn't ready yet. For this reason, we have a dedicated NOOP
|
| 832 |
|
|
instruction. \\\hline
|
| 833 |
|
|
NOT Rx & XOR \$-1,Rx & \\\hline
|
| 834 |
|
|
POP Rx
|
| 835 |
|
|
& \parbox[t]{1.5in}{LOD \$-1(SP),Rx \\ ADD \$1,SP}
|
| 836 |
|
|
& Note
|
| 837 |
|
|
that for interrupt purposes, one can never depend upon the value at
|
| 838 |
|
|
(SP). Hence you read from it, then increment it, lest having
|
| 839 |
33 |
dgisselq |
incremented it first something then comes along and writes to that
|
| 840 |
21 |
dgisselq |
value before you can read the result. \\\hline
|
| 841 |
36 |
dgisselq |
\end{tabular}
|
| 842 |
|
|
\caption{Derived Instructions, continued}\label{tbl:derived-2}
|
| 843 |
|
|
\end{center}\end{table}
|
| 844 |
|
|
\begin{table}\begin{center}
|
| 845 |
|
|
\begin{tabular}{p{1.4in}p{1.5in}p{3in}}\\\hline
|
| 846 |
21 |
dgisselq |
PUSH Rx
|
| 847 |
33 |
dgisselq |
& \parbox[t]{1.5in}{SUB \$1,SP \\
|
| 848 |
21 |
dgisselq |
STO Rx,\$1(SP)}
|
| 849 |
|
|
& \\\hline
|
| 850 |
36 |
dgisselq |
PUSH Rx-Ry
|
| 851 |
|
|
& \parbox[t]{1.5in}{SUB \$n,SP \\
|
| 852 |
|
|
STO Rx,\$n(SP)
|
| 853 |
|
|
\ldots \\
|
| 854 |
|
|
STO Ry,\$1(SP)}
|
| 855 |
|
|
& Multiple pushes at once only need the single subtract from the
|
| 856 |
|
|
stack pointer. This derived instruction is analogous to a similar one
|
| 857 |
|
|
on the Motoroloa 68k architecture, although the Zip Assembler
|
| 858 |
|
|
does not support this instruction (yet).\\\hline
|
| 859 |
21 |
dgisselq |
RESET
|
| 860 |
|
|
& \parbox[t]{1in}{STO \$1,\$watchdog(R12)\\NOOP\\NOOP}
|
| 861 |
|
|
& \parbox[t]{3in}{This depends upon the peripheral base address being
|
| 862 |
|
|
in R12.
|
| 863 |
|
|
|
| 864 |
|
|
Another opportunity might be to jump to the reset address from within
|
| 865 |
|
|
supervisor mode.}\\\hline
|
| 866 |
36 |
dgisselq |
RET & \parbox[t]{1.5in}{LOD \$1(SP),PC}
|
| 867 |
24 |
dgisselq |
& Note that this depends upon the calling context to clean up the
|
| 868 |
|
|
stack, as outlined for the JSR instruction. \\\hline
|
| 869 |
21 |
dgisselq |
RET & MOV R12,PC
|
| 870 |
|
|
& This is the high(er) speed version, that doesn't touch the stack.
|
| 871 |
|
|
As such, it doesn't suffer a stall on memory read/write to the stack.
|
| 872 |
|
|
\\\hline
|
| 873 |
|
|
STEP Rr,Rt
|
| 874 |
|
|
& \parbox[t]{1.5in}{LSR \$1,Rr \\ XOR.C Rt,Rr}
|
| 875 |
|
|
& Step a Galois implementation of a Linear Feedback Shift Register, Rr,
|
| 876 |
|
|
using taps Rt \\\hline
|
| 877 |
|
|
STO.b Rx,\$addr
|
| 878 |
|
|
& \parbox[t]{1.5in}{%
|
| 879 |
|
|
LDI \$addr,Ra \\
|
| 880 |
|
|
LDI \$addr,Rb \\
|
| 881 |
|
|
LSR \$2,Ra \\
|
| 882 |
|
|
AND \$3,Rb \\
|
| 883 |
|
|
SUB \$32,Rb \\
|
| 884 |
|
|
LOD (Ra),Ry \\
|
| 885 |
|
|
AND \$0ffh,Rx \\
|
| 886 |
|
|
AND \$-0ffh,Ry \\
|
| 887 |
|
|
ROL Rb,Rx \\
|
| 888 |
|
|
OR Rx,Ry \\
|
| 889 |
|
|
STO Ry,(Ra) }
|
| 890 |
|
|
& \parbox[t]{3in}{This CPU and it's bus are {\em not} optimized
|
| 891 |
|
|
for byte-wise operations.
|
| 892 |
|
|
|
| 893 |
|
|
Note that in this example, \$addr is a
|
| 894 |
|
|
byte-wise address, whereas in all of our other examples it is a
|
| 895 |
|
|
32-bit word address. This also limits the address space
|
| 896 |
|
|
of character accesses from 16 MB down to 4MB.F
|
| 897 |
|
|
Further, this instruction implies a byte ordering,
|
| 898 |
|
|
such as big or little endian.} \\\hline
|
| 899 |
|
|
SWAP Rx,Ry
|
| 900 |
|
|
& \parbox[t]{1.5in}{
|
| 901 |
|
|
XOR Ry,Rx \\
|
| 902 |
|
|
XOR Rx,Ry \\
|
| 903 |
|
|
XOR Ry,Rx}
|
| 904 |
|
|
& While no extra registers are needed, this example
|
| 905 |
|
|
does take 3-clocks. \\\hline
|
| 906 |
|
|
TRAP \#X
|
| 907 |
36 |
dgisselq |
& \parbox[t]{1.5in}{LDI \$x,R0 \\ AND ~\$GIE,CC }
|
| 908 |
|
|
& This works because whenever a user lowers the \$GIE flag, it sets
|
| 909 |
|
|
a TRAP bit within the CC register. Therefore, upon entering the
|
| 910 |
|
|
supervisor state, the CPU only need check this bit to know that it
|
| 911 |
|
|
got there via a TRAP. The trap could be made conditional by making
|
| 912 |
|
|
the LDI and the AND conditional. In that case, the assembler would
|
| 913 |
|
|
quietly turn the LDI instruction into an LDILO and LDIHI pair,
|
| 914 |
|
|
but the effectt would be the same. \\\hline
|
| 915 |
|
|
\end{tabular}
|
| 916 |
|
|
\caption{Derived Instructions, continued}\label{tbl:derived-3}
|
| 917 |
|
|
\end{center}\end{table}
|
| 918 |
|
|
\begin{table}\begin{center}
|
| 919 |
|
|
\begin{tabular}{p{1.4in}p{1.5in}p{3in}}\\\hline
|
| 920 |
21 |
dgisselq |
TST Rx
|
| 921 |
|
|
& TST \$-1,Rx
|
| 922 |
|
|
& Set the condition codes based upon Rx. Could also do a CMP \$0,Rx,
|
| 923 |
|
|
ADD \$0,Rx, SUB \$0,Rx, etc, AND \$-1,Rx, etc. The TST and CMP
|
| 924 |
|
|
approaches won't stall future pipeline stages looking for the value
|
| 925 |
|
|
of Rx. \\\hline
|
| 926 |
|
|
WAIT
|
| 927 |
|
|
& Or \$SLEEP,CC
|
| 928 |
|
|
& Wait 'til interrupt. In an interrupts disabled context, this
|
| 929 |
|
|
becomes a HALT instruction.
|
| 930 |
|
|
\end{tabular}
|
| 931 |
36 |
dgisselq |
\caption{Derived Instructions, continued}\label{tbl:derived-4}
|
| 932 |
21 |
dgisselq |
\end{center}\end{table}
|
| 933 |
|
|
\section{Pipeline Stages}
|
| 934 |
32 |
dgisselq |
As mentioned in the introduction, and highlighted in Fig.~\ref{fig:cpu},
|
| 935 |
|
|
the Zip CPU supports a five stage pipeline.
|
| 936 |
21 |
dgisselq |
\begin{enumerate}
|
| 937 |
36 |
dgisselq |
\item {\bf Prefetch}: Reads instruction from memory and into a cache, if so
|
| 938 |
|
|
configured. This
|
| 939 |
21 |
dgisselq |
stage is actually pipelined itself, and so it will stall if the PC
|
| 940 |
|
|
ever changes. Stalls are also created here if the instruction isn't
|
| 941 |
|
|
in the prefetch cache.
|
| 942 |
36 |
dgisselq |
|
| 943 |
|
|
The Zip CPU supports one of two prefetch methods, depending upon a flag
|
| 944 |
|
|
set at build time within the {\tt zipcpu.v} file. The simplest is a
|
| 945 |
|
|
non--cached implementation of a prefetch. This implementation is
|
| 946 |
|
|
fairly small, and ideal for
|
| 947 |
|
|
users of the Zip CPU who need the extra space on the FPGA fabric.
|
| 948 |
|
|
However, because this non--cached version has no cache, the maximum
|
| 949 |
|
|
number of instructions per clock is limited to about one per five.
|
| 950 |
|
|
|
| 951 |
|
|
The second prefetch module is a pipelined prefetch with a cache. This
|
| 952 |
|
|
module tries to keep the instruction address within a window of valid
|
| 953 |
|
|
instruction addresses. While effective, it is not a traditional
|
| 954 |
|
|
cache implementation. One unique feature of this cache implementation,
|
| 955 |
|
|
however, is that it can be cleared in a single clock. A disappointing
|
| 956 |
|
|
feature, though, was that it needs an extra internal pipeline stage
|
| 957 |
|
|
to be implemented.
|
| 958 |
|
|
|
| 959 |
|
|
\item {\bf Decode}: Decodes an instruction into op code, register(s) to read,
|
| 960 |
|
|
and immediate offset. This stage also determines whether the flags will
|
| 961 |
32 |
dgisselq |
be set or whether the result will be written back.
|
| 962 |
21 |
dgisselq |
\item {\bf Read Operands}: Read registers and apply any immediate values to
|
| 963 |
24 |
dgisselq |
them. There is no means of detecting or flagging arithmetic overflow
|
| 964 |
|
|
or carry when adding the immediate to the operand. This stage will
|
| 965 |
|
|
stall if any source operand is pending.
|
| 966 |
21 |
dgisselq |
\item Split into two tracks: An {\bf ALU} which will accomplish a simple
|
| 967 |
36 |
dgisselq |
instruction, and the {\bf MemOps} stage which handles {\tt LOD} (load)
|
| 968 |
|
|
and {\tt STO} (store) instructions.
|
| 969 |
21 |
dgisselq |
\begin{itemize}
|
| 970 |
36 |
dgisselq |
\item Loads will stall the entire pipeline until complete.
|
| 971 |
|
|
\item Condition codes are available upon completion of the ALU stage
|
| 972 |
|
|
\item Issuing an instruction to the memory unit while the memory unit
|
| 973 |
|
|
is busy will stall the entire pipeline. If the bus deadlocks,
|
| 974 |
|
|
only a reset will release the CPU. (Watchdog timer, anyone?)
|
| 975 |
24 |
dgisselq |
\item The Zip CPU currently has no means of reading and acting on any
|
| 976 |
|
|
error conditions on the bus.
|
| 977 |
21 |
dgisselq |
\end{itemize}
|
| 978 |
32 |
dgisselq |
\item {\bf Write-Back}: Conditionally write back the result to the register
|
| 979 |
36 |
dgisselq |
set, applying the condition. This routine is bi-entrant: either the
|
| 980 |
21 |
dgisselq |
memory or the simple instruction may request a register write.
|
| 981 |
|
|
\end{enumerate}
|
| 982 |
|
|
|
| 983 |
24 |
dgisselq |
The Zip CPU does not support out of order execution. Therefore, if the memory
|
| 984 |
|
|
unit stalls, every other instruction stalls. Memory stores, however, can take
|
| 985 |
36 |
dgisselq |
place concurrently with ALU operations, although memory reads (loads) cannot.
|
| 986 |
24 |
dgisselq |
|
| 987 |
32 |
dgisselq |
\section{Pipeline Stalls}
|
| 988 |
|
|
The processing pipeline can and will stall for a variety of reasons. Some of
|
| 989 |
|
|
these are obvious, some less so. These reasons are listed below:
|
| 990 |
|
|
\begin{itemize}
|
| 991 |
|
|
\item When the prefetch cache is exhausted
|
| 992 |
21 |
dgisselq |
|
| 993 |
36 |
dgisselq |
This reason should be obvious. If the prefetch cache doesn't have the
|
| 994 |
|
|
instruction in memory, the entire pipeline must stall until enough of the
|
| 995 |
|
|
prefetch cache is loaded to support the next instruction.
|
| 996 |
21 |
dgisselq |
|
| 997 |
32 |
dgisselq |
\item While waiting for the pipeline to load following any taken branch, jump,
|
| 998 |
36 |
dgisselq |
return from interrupt or switch to interrupt context (5 stall cycles)
|
| 999 |
32 |
dgisselq |
|
| 1000 |
|
|
If the PC suddenly changes, the pipeline is subsequently cleared and needs to
|
| 1001 |
|
|
be reloaded. Given that there are five stages to the pipeline, that accounts
|
| 1002 |
36 |
dgisselq |
for four of the five stalls. The stall cycle is lost in the pipelined prefetch
|
| 1003 |
32 |
dgisselq |
stage which needs at least one clock with a valid PC before it can produce
|
| 1004 |
36 |
dgisselq |
a new output.
|
| 1005 |
32 |
dgisselq |
|
| 1006 |
36 |
dgisselq |
The Zip CPU handles {\tt MOV \$X(PC),PC}, {\tt ADD \$X,PC}, and
|
| 1007 |
|
|
{\tt LDI \$X,PC} instructions specially, however. These instructions, when
|
| 1008 |
|
|
not conditioned on the flags, can execute with only 3~stall cycles.
|
| 1009 |
|
|
|
| 1010 |
32 |
dgisselq |
\item When reading from a prior register while also adding an immediate offset
|
| 1011 |
|
|
\begin{enumerate}
|
| 1012 |
|
|
\item\ {\tt OPCODE ?,RA}
|
| 1013 |
|
|
\item\ {\em (stall)}
|
| 1014 |
|
|
\item\ {\tt OPCODE I+RA,RB}
|
| 1015 |
|
|
\end{enumerate}
|
| 1016 |
|
|
|
| 1017 |
|
|
Since the addition of the immediate register within OpB decoding gets applied
|
| 1018 |
|
|
during the read operand stage so that it can be nicely settled before the ALU,
|
| 1019 |
|
|
any instruction that will write back an operand must be separated from the
|
| 1020 |
|
|
opcode that will read and apply an immediate offset by one instruction. The
|
| 1021 |
|
|
good news is that this stall can easily be mitigated by proper scheduling.
|
| 1022 |
36 |
dgisselq |
That is, any instruction that does not add an immediate to {\tt RA} may be
|
| 1023 |
|
|
scheduled into the stall slot.
|
| 1024 |
32 |
dgisselq |
|
| 1025 |
36 |
dgisselq |
\item When any write to either the CC or PC Register is followed by a memory
|
| 1026 |
|
|
operation
|
| 1027 |
32 |
dgisselq |
\begin{enumerate}
|
| 1028 |
|
|
\item\ {\tt OPCODE RA,PC} {\em Ex: a branch opcode}
|
| 1029 |
|
|
\item\ {\em (stall, even if jump not taken)}
|
| 1030 |
36 |
dgisselq |
\item\ {\tt LOD \$X(RA),RB}
|
| 1031 |
32 |
dgisselq |
\end{enumerate}
|
| 1032 |
|
|
Since branches take place in the writeback stage, the Zip CPU will stall the
|
| 1033 |
|
|
pipeline for one clock anytime there may be a possible jump. This prevents
|
| 1034 |
|
|
an instruction from executing a memory access after the jump but before the
|
| 1035 |
|
|
jump is recognized.
|
| 1036 |
|
|
|
| 1037 |
36 |
dgisselq |
This stall may be mitigated by shuffling the operations immediately following
|
| 1038 |
|
|
a potential branch so that an ALU operation follows the branch instead of a
|
| 1039 |
|
|
memory operation.
|
| 1040 |
33 |
dgisselq |
|
| 1041 |
32 |
dgisselq |
\item When reading from the CC register after setting the flags
|
| 1042 |
|
|
\begin{enumerate}
|
| 1043 |
36 |
dgisselq |
\item\ {\tt ALUOP RA,RB} {\em Ex: a compare opcode}
|
| 1044 |
|
|
\item\ {\em (stall)}
|
| 1045 |
32 |
dgisselq |
\item\ {\tt TST sys.ccv,CC}
|
| 1046 |
|
|
\item\ {\tt BZ somewhere}
|
| 1047 |
|
|
\end{enumerate}
|
| 1048 |
|
|
|
| 1049 |
|
|
The reason for this stall is simply performance. Many of the flags are
|
| 1050 |
|
|
determined via combinatorial logic after the writeback instruction is
|
| 1051 |
|
|
determined. Trying to then place these into the input for one of the operands
|
| 1052 |
|
|
created a time delay loop that would no longer execute in a single 100~MHz
|
| 1053 |
|
|
clock cycle. (The time delay of the multiply within the ALU wasn't helping
|
| 1054 |
|
|
either \ldots).
|
| 1055 |
|
|
|
| 1056 |
33 |
dgisselq |
This stall may be eliminated via proper scheduling, by placing an instruction
|
| 1057 |
|
|
that does not set flags in between the ALU operation and the instruction
|
| 1058 |
|
|
that references the CC register. For example, {\tt MOV \$addr+PC,uPC}
|
| 1059 |
|
|
followed by an {\tt RTU} ({\tt OR \$GIE,CC}) instruction will not incur
|
| 1060 |
|
|
this stall, whereas an {\tt OR \$BREAKEN,CC} followed by an {\tt OR \$STEP,CC}
|
| 1061 |
36 |
dgisselq |
will incur the stall, while a {\tt LDI \$BREAKEN|\$STEP,CC} will not.
|
| 1062 |
33 |
dgisselq |
|
| 1063 |
32 |
dgisselq |
\item When waiting for a memory read operation to complete
|
| 1064 |
|
|
\begin{enumerate}
|
| 1065 |
|
|
\item\ {\tt LOD address,RA}
|
| 1066 |
36 |
dgisselq |
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}
|
| 1067 |
32 |
dgisselq |
\item\ {\tt OPCODE I+RA,RB}
|
| 1068 |
|
|
\end{enumerate}
|
| 1069 |
|
|
|
| 1070 |
36 |
dgisselq |
Remember, the Zip CPU does not support out of order execution. Therefore,
|
| 1071 |
32 |
dgisselq |
anytime the memory unit becomes busy both the memory unit and the ALU must
|
| 1072 |
|
|
stall until the memory unit is cleared. This is especially true of a load
|
| 1073 |
33 |
dgisselq |
instruction, which must still write its operand back to the register file.
|
| 1074 |
|
|
Store instructions are different, since they can be busy with no impact on
|
| 1075 |
|
|
later ALU write back operations. Hence, only loads stall the pipeline.
|
| 1076 |
32 |
dgisselq |
|
| 1077 |
|
|
This also assumes that the memory being accessed is a single cycle memory.
|
| 1078 |
|
|
Slower memories, such as the Quad SPI flash, will take longer--perhaps even
|
| 1079 |
33 |
dgisselq |
as long as forty clocks. During this time the CPU and the external bus
|
| 1080 |
32 |
dgisselq |
will be busy, and unable to do anything else.
|
| 1081 |
|
|
|
| 1082 |
|
|
\item Memory operation followed by a memory operation
|
| 1083 |
|
|
\begin{enumerate}
|
| 1084 |
|
|
\item\ {\tt STO address,RA}
|
| 1085 |
36 |
dgisselq |
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}
|
| 1086 |
32 |
dgisselq |
\item\ {\tt LOD address,RB}
|
| 1087 |
36 |
dgisselq |
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}
|
| 1088 |
32 |
dgisselq |
\end{enumerate}
|
| 1089 |
|
|
|
| 1090 |
36 |
dgisselq |
In this case, the LOD instruction cannot start until the STO is finished.
|
| 1091 |
32 |
dgisselq |
With proper scheduling, it is possible to do something in the ALU while the
|
| 1092 |
36 |
dgisselq |
memory unit is busy with the STO instruction, but otherwise this pipeline will
|
| 1093 |
|
|
stall waiting for it to complete.
|
| 1094 |
32 |
dgisselq |
|
| 1095 |
|
|
Note that even though the Wishbone bus can support pipelined accesses at
|
| 1096 |
|
|
one access per clock, only the prefetch stage can take advantage of this.
|
| 1097 |
|
|
Load and Store instructions are stuck at one wishbone cycle per instruction.
|
| 1098 |
36 |
dgisselq |
|
| 1099 |
|
|
\item When waiting for a conditional memory read operation to complete
|
| 1100 |
|
|
\begin{enumerate}
|
| 1101 |
|
|
\item\ {\tt LOD.Z address,RA}
|
| 1102 |
|
|
\item\ {\em (multiple stalls, bus dependent, 7 clocks best)}
|
| 1103 |
|
|
\item\ {\tt OPCODE I+RA,RB}
|
| 1104 |
|
|
\end{enumerate}
|
| 1105 |
|
|
|
| 1106 |
|
|
In this case, the Zip CPU doesn't warn the prefetch cache to get off the bus
|
| 1107 |
|
|
two cycles before using the bus, so there's a potential for an extra three
|
| 1108 |
|
|
cycle cost due to bus contention between the prefetch and the CPU.
|
| 1109 |
|
|
|
| 1110 |
|
|
This is true for both the LOD and the STO instructions, with the exception that
|
| 1111 |
|
|
the STO instruction will continue in parallel with any ALU instructions that
|
| 1112 |
|
|
follow it.
|
| 1113 |
|
|
|
| 1114 |
32 |
dgisselq |
\end{itemize}
|
| 1115 |
|
|
|
| 1116 |
|
|
|
| 1117 |
21 |
dgisselq |
\chapter{Peripherals}\label{chap:periph}
|
| 1118 |
24 |
dgisselq |
|
| 1119 |
|
|
While the previous chapter describes a CPU in isolation, the Zip System
|
| 1120 |
|
|
includes a minimum set of peripherals as well. These peripherals are shown
|
| 1121 |
|
|
in Fig.~\ref{fig:zipsystem}
|
| 1122 |
|
|
\begin{figure}\begin{center}
|
| 1123 |
|
|
\includegraphics[width=3.5in]{../gfx/system.eps}
|
| 1124 |
|
|
\caption{Zip System Peripherals}\label{fig:zipsystem}
|
| 1125 |
|
|
\end{center}\end{figure}
|
| 1126 |
|
|
and described here. They are designed to make
|
| 1127 |
|
|
the Zip CPU more useful in an Embedded Operating System environment.
|
| 1128 |
|
|
|
| 1129 |
21 |
dgisselq |
\section{Interrupt Controller}
|
| 1130 |
24 |
dgisselq |
|
| 1131 |
|
|
Perhaps the most important peripheral within the Zip System is the interrupt
|
| 1132 |
|
|
controller. While the Zip CPU itself can only handle one interrupt, and has
|
| 1133 |
|
|
only the one interrupt state: disabled or enabled, the interrupt controller
|
| 1134 |
|
|
can make things more interesting.
|
| 1135 |
|
|
|
| 1136 |
|
|
The Zip System interrupt controller module supports up to 15 interrupts, all
|
| 1137 |
|
|
controlled from one register. Bit~31 of the interrupt controller controls
|
| 1138 |
|
|
overall whether interrupts are enabled (1'b1) or disabled (1'b0). Bits~16--30
|
| 1139 |
|
|
control whether individual interrupts are enabled (1'b0) or disabled (1'b0).
|
| 1140 |
|
|
Bit~15 is an indicator showing whether or not any interrupt is active, and
|
| 1141 |
|
|
bits~0--15 indicate whether or not an individual interrupt is active.
|
| 1142 |
|
|
|
| 1143 |
|
|
The interrupt controller has been designed so that bits can be controlled
|
| 1144 |
|
|
individually without having any knowledge of the rest of the controller
|
| 1145 |
|
|
setting. To enable an interrupt, write to the register with the high order
|
| 1146 |
|
|
global enable bit set and the respective interrupt enable bit set. No other
|
| 1147 |
|
|
bits will be affected. To disable an interrupt, write to the register with
|
| 1148 |
|
|
the high order global enable bit cleared and the respective interrupt enable
|
| 1149 |
|
|
bit set. To clear an interrupt, write a `1' to that interrupts status pin.
|
| 1150 |
|
|
Zero's written to the register have no affect, save that a zero written to the
|
| 1151 |
|
|
master enable will disable all interrupts.
|
| 1152 |
|
|
|
| 1153 |
|
|
As an example, suppose you wished to enable interrupt \#4. You would then
|
| 1154 |
|
|
write to the register a {\tt 0x80100010} to enable interrupt \#4 and to clear
|
| 1155 |
|
|
any past active state. When you later wish to disable this interrupt, you would
|
| 1156 |
|
|
write a {\tt 0x00100010} to the register. As before, this both disables the
|
| 1157 |
|
|
interrupt and clears the active indicator. This also has the side effect of
|
| 1158 |
|
|
disabling all interrupts, so a second write of {\tt 0x80000000} may be necessary
|
| 1159 |
|
|
to re-enable any other interrupts.
|
| 1160 |
|
|
|
| 1161 |
|
|
The Zip System currently hosts two interrupt controllers, a primary and a
|
| 1162 |
|
|
secondary. The primary interrupt controller has one interrupt line which may
|
| 1163 |
|
|
come from an external interrupt controller, and one interrupt line from the
|
| 1164 |
|
|
secondary controller. Other primary interrupts include the system timers,
|
| 1165 |
|
|
the jiffies interrupt, and the manual cache interrupt. The secondary interrupt
|
| 1166 |
|
|
controller maintains an interrupt state for all of the processor accounting
|
| 1167 |
|
|
counters.
|
| 1168 |
|
|
|
| 1169 |
21 |
dgisselq |
\section{Counter}
|
| 1170 |
|
|
|
| 1171 |
|
|
The Zip Counter is a very simple counter: it just counts. It cannot be
|
| 1172 |
|
|
halted. When it rolls over, it issues an interrupt. Writing a value to the
|
| 1173 |
|
|
counter just sets the current value, and it starts counting again from that
|
| 1174 |
|
|
value.
|
| 1175 |
|
|
|
| 1176 |
|
|
Eight counters are implemented in the Zip System for process accounting.
|
| 1177 |
|
|
This may change in the future, as nothing as yet uses these counters.
|
| 1178 |
|
|
|
| 1179 |
|
|
\section{Timer}
|
| 1180 |
|
|
|
| 1181 |
|
|
The Zip Timer is also very simple: it simply counts down to zero. When it
|
| 1182 |
|
|
transitions from a one to a zero it creates an interrupt.
|
| 1183 |
|
|
|
| 1184 |
|
|
Writing any non-zero value to the timer starts the timer. If the high order
|
| 1185 |
|
|
bit is set when writing to the timer, the timer becomes an interval timer and
|
| 1186 |
|
|
reloads its last start time on any interrupt. Hence, to mark seconds, one
|
| 1187 |
|
|
might set the timer to 100~million (the number of clocks per second), and
|
| 1188 |
|
|
set the high bit. Ever after, the timer will interrupt the CPU once per
|
| 1189 |
24 |
dgisselq |
second (assuming a 100~MHz clock). This reload capability also limits the
|
| 1190 |
|
|
maximum timer value to $2^{31}-1$, rather than $2^{32}-1$.
|
| 1191 |
21 |
dgisselq |
|
| 1192 |
|
|
\section{Watchdog Timer}
|
| 1193 |
|
|
|
| 1194 |
|
|
The watchdog timer is no different from any of the other timers, save for one
|
| 1195 |
|
|
critical difference: the interrupt line from the watchdog
|
| 1196 |
|
|
timer is tied to the reset line of the CPU. Hence writing a `1' to the
|
| 1197 |
|
|
watchdog timer will always reset the CPU.
|
| 1198 |
32 |
dgisselq |
To stop the Watchdog timer, write a `0' to it. To start it,
|
| 1199 |
21 |
dgisselq |
write any other number to it---as with the other timers.
|
| 1200 |
|
|
|
| 1201 |
|
|
While the watchdog timer supports interval mode, it doesn't make as much sense
|
| 1202 |
|
|
as it did with the other timers.
|
| 1203 |
|
|
|
| 1204 |
|
|
\section{Jiffies}
|
| 1205 |
|
|
|
| 1206 |
|
|
This peripheral is motivated by the Linux use of `jiffies' whereby a process
|
| 1207 |
|
|
can request to be put to sleep until a certain number of `jiffies' have
|
| 1208 |
|
|
elapsed. Using this interface, the CPU can read the number of `jiffies'
|
| 1209 |
|
|
from the peripheral (it only has the one location in address space), add the
|
| 1210 |
24 |
dgisselq |
sleep length to it, and write the result back to the peripheral. The zipjiffies
|
| 1211 |
21 |
dgisselq |
peripheral will record the value written to it only if it is nearer the current
|
| 1212 |
|
|
counter value than the last current waiting interrupt time. If no other
|
| 1213 |
|
|
interrupts are waiting, and this time is in the future, it will be enabled.
|
| 1214 |
|
|
(There is currently no way to disable a jiffie interrupt once set, other
|
| 1215 |
24 |
dgisselq |
than to disable the interrupt line in the interrupt controller.) The processor
|
| 1216 |
21 |
dgisselq |
may then place this sleep request into a list among other sleep requests.
|
| 1217 |
|
|
Once the timer expires, it would write the next Jiffy request to the peripheral
|
| 1218 |
|
|
and wake up the process whose timer had expired.
|
| 1219 |
|
|
|
| 1220 |
|
|
Indeed, the Jiffies register is nothing more than a glorified counter with
|
| 1221 |
|
|
an interrupt. Unlike the other counters, the Jiffies register cannot be set.
|
| 1222 |
|
|
Writes to the jiffies register create an interrupt time. When the Jiffies
|
| 1223 |
|
|
register later equals the value written to it, an interrupt will be asserted
|
| 1224 |
|
|
and the register then continues counting as though no interrupt had taken
|
| 1225 |
|
|
place.
|
| 1226 |
|
|
|
| 1227 |
|
|
The purpose of this register is to support alarm times within a CPU. To
|
| 1228 |
|
|
set an alarm for a particular process $N$ clocks in advance, read the current
|
| 1229 |
|
|
Jiffies value, and $N$, and write it back to the Jiffies register. The
|
| 1230 |
|
|
O/S must also keep track of values written to the Jiffies register. Thus,
|
| 1231 |
32 |
dgisselq |
when an `alarm' trips, it should be removed from the list of alarms, the list
|
| 1232 |
21 |
dgisselq |
should be sorted, and the next alarm in terms of Jiffies should be written
|
| 1233 |
|
|
to the register.
|
| 1234 |
|
|
|
| 1235 |
36 |
dgisselq |
\section{Direct Memory Access Controller}
|
| 1236 |
24 |
dgisselq |
|
| 1237 |
36 |
dgisselq |
The Direct Memory Access (DMA) controller can be used to either move memory
|
| 1238 |
|
|
from one location to another, to read from a peripheral into memory, or to
|
| 1239 |
|
|
write from a peripheral into memory all without CPU intervention. Further,
|
| 1240 |
|
|
since the DMA controller can issue (and does issue) pipeline wishbone accesses,
|
| 1241 |
|
|
any DMA memory move will by nature be faster than a corresponding program
|
| 1242 |
|
|
accomplishing the same move. To put this to numbers, it may take a program
|
| 1243 |
|
|
18~clocks per word transferred, whereas this DMA controller can move one
|
| 1244 |
|
|
word in two clocks--provided it has bus access. (The CPU gets priority over the
|
| 1245 |
|
|
bus.)
|
| 1246 |
24 |
dgisselq |
|
| 1247 |
36 |
dgisselq |
When copying memory from one location to another, the DMA controller will
|
| 1248 |
|
|
copy in units of a given transfer length--up to 1024 words at a time. It will
|
| 1249 |
|
|
read that transfer length into its internal buffer, and then write to the
|
| 1250 |
|
|
destination address from that buffer. If the CPU interrupts a DMA transfer,
|
| 1251 |
|
|
it will release the bus, let the CPU complete whatever it needs to do, and then
|
| 1252 |
|
|
restart its transfer by writing the contents of its internal buffer and then
|
| 1253 |
|
|
re-entering its read cycle again.
|
| 1254 |
24 |
dgisselq |
|
| 1255 |
36 |
dgisselq |
When coupled with a peripheral, the DMA controller can be configured to start
|
| 1256 |
|
|
a memory copy on an interrupt line going high. Further, the controller can be
|
| 1257 |
|
|
configured to issue reads from (or two) the same address instead of incrementing
|
| 1258 |
|
|
the address at each clock. The DMA completes once the total number of items
|
| 1259 |
|
|
specified (not the transfer length) have been transferred.
|
| 1260 |
|
|
|
| 1261 |
|
|
In each case, once the transfer is complete and the DMA unit returns to
|
| 1262 |
|
|
idle, the DMA will issue an interrupt.
|
| 1263 |
|
|
|
| 1264 |
|
|
|
| 1265 |
21 |
dgisselq |
\chapter{Operation}\label{chap:ops}
|
| 1266 |
|
|
|
| 1267 |
33 |
dgisselq |
The Zip CPU, and even the Zip System, is not a System on a Chip (SoC). It
|
| 1268 |
|
|
needs to be connected to its operational environment in order to be used.
|
| 1269 |
|
|
Specifically, some per system adjustments need to be made:
|
| 1270 |
|
|
\begin{enumerate}
|
| 1271 |
|
|
\item The Zip System depends upon an external 32-bit Wishbone bus. This
|
| 1272 |
|
|
must exist, and must be connected to the Zip CPU for it to work.
|
| 1273 |
|
|
\item The Zip System needs to be told of its {\tt RESET\_ADDRESS}. This is
|
| 1274 |
|
|
the program counter of the first instruction following a reset.
|
| 1275 |
|
|
\item If you want the Zip System to start up on its own, you will need to
|
| 1276 |
|
|
set the {\tt START\_HALTED} parameter to zero. Otherwise, if you
|
| 1277 |
|
|
wish to manually start the CPU, that is if upon reset you want the
|
| 1278 |
|
|
CPU start start in its halted, reset state, then set this parameter to
|
| 1279 |
|
|
one.
|
| 1280 |
|
|
\item The third parameter to set is the number of interrupts you will be
|
| 1281 |
|
|
providing from external to the CPU. This can be anything from one
|
| 1282 |
|
|
to nine, but it cannot be zero. (Wire this line to a 1'b0 if you
|
| 1283 |
|
|
do not wish to support any external interrupts.)
|
| 1284 |
|
|
\item Finally, you need to place into some wishbone accessible address, whether
|
| 1285 |
|
|
RAM or (more likely) ROM, the initial instructions for the CPU.
|
| 1286 |
|
|
\end{enumerate}
|
| 1287 |
|
|
If you have enabled your CPU to start automatically, then upon power up the
|
| 1288 |
|
|
CPU will immediately start executing your instructions.
|
| 1289 |
|
|
|
| 1290 |
|
|
This is, however, not how I have used the Zip CPU. I have instead used the
|
| 1291 |
36 |
dgisselq |
Zip CPU in a more controlled environment. For me, the CPU starts in a
|
| 1292 |
33 |
dgisselq |
halted state, and waits to be told to start. Further, the RESET address is a
|
| 1293 |
|
|
location in RAM. After bringing up the board I am using, and further the
|
| 1294 |
|
|
bus that is on it, the RAM memory is then loaded externally with the program
|
| 1295 |
|
|
I wish the Zip System to run. Once the RAM is loaded, I release the CPU.
|
| 1296 |
|
|
The CPU then runs until its halt condition, at which point its task is
|
| 1297 |
|
|
complete.
|
| 1298 |
|
|
|
| 1299 |
|
|
Eventually, I intend to place an operating system onto the ZipSystem, I'm
|
| 1300 |
|
|
just not there yet.
|
| 1301 |
|
|
|
| 1302 |
36 |
dgisselq |
The rest of this chapter examines some common programming constructs, and
|
| 1303 |
|
|
how they might be applied to the Zip System.
|
| 1304 |
33 |
dgisselq |
|
| 1305 |
36 |
dgisselq |
\section{Example: Idle Task}
|
| 1306 |
|
|
One task every operating system needs is the idle task, the task that takes
|
| 1307 |
|
|
place when nothing else can run. On the Zip CPU, this task is quite simple,
|
| 1308 |
|
|
and it is shown in assemble in Tbl.~\ref{tbl:idle-asm}.
|
| 1309 |
|
|
\begin{table}\begin{center}
|
| 1310 |
|
|
\begin{tabular}{ll}
|
| 1311 |
|
|
{\tt idle\_task:} \\
|
| 1312 |
|
|
& {\em ; Wait for the next interrupt, then switch to supervisor task} \\
|
| 1313 |
|
|
& {\tt WAIT} \\
|
| 1314 |
|
|
& {\em ; When we come back, it's because the supervisor wishes to} \\
|
| 1315 |
|
|
& {\em ; wait for an interrupt again, so go back to the top.} \\
|
| 1316 |
|
|
& {\tt BRA idle\_task} \\
|
| 1317 |
|
|
\end{tabular}
|
| 1318 |
|
|
\caption{Example Idle Loop}\label{tbl:idle-asm}
|
| 1319 |
|
|
\end{center}\end{table}
|
| 1320 |
|
|
When this task runs, the CPU will fill up all of the pipeline stages up the
|
| 1321 |
|
|
ALU. The {\tt WAIT} instruction, upon leaving the ALU, places the CPU into
|
| 1322 |
|
|
a sleep state where nothing more moves. Sure, there may be some more settling,
|
| 1323 |
|
|
the pipe cache continue to read until full, other instructions may issue until
|
| 1324 |
|
|
the pipeline fills, but then everything will stall. Then, once an interrupt
|
| 1325 |
|
|
takes place, control passes to the supervisor task to handle the interrupt.
|
| 1326 |
|
|
When control passes back to this task, it will be on the next instruction.
|
| 1327 |
|
|
Since that next instruction sends us back to the top of the task, the idle
|
| 1328 |
|
|
task thus does nothing but wait for an interrupt.
|
| 1329 |
|
|
|
| 1330 |
|
|
This should be the lowest priority task, the task that runs when nothing else
|
| 1331 |
|
|
can. It will help lower the FPGA power usage overall---at least its dynamic
|
| 1332 |
|
|
power usage.
|
| 1333 |
|
|
|
| 1334 |
|
|
\section{Example: Memory Copy}
|
| 1335 |
|
|
One common operation is that of a memory move or copy. Consider the C code
|
| 1336 |
|
|
shown in Tbl.~\ref{tbl:memcp-c}.
|
| 1337 |
|
|
\begin{table}\begin{center}
|
| 1338 |
|
|
\parbox{4in}{\begin{tabbing}
|
| 1339 |
|
|
{\tt void} \= {\tt memcp(void *dest, void *src, int len) \{} \\
|
| 1340 |
|
|
\> {\tt for(int i=0; i<len; i++)} \\
|
| 1341 |
|
|
\> \hspace{0.2in} {\tt *dest++ = *src++;} \\
|
| 1342 |
|
|
\}
|
| 1343 |
|
|
\end{tabbing}}
|
| 1344 |
|
|
\caption{Example Memory Copy code in C}\label{tbl:memcp-c}
|
| 1345 |
|
|
\end{center}\end{table}
|
| 1346 |
|
|
This same code can be translated in Zip Assembly as shown in
|
| 1347 |
|
|
Tbl.~\ref{tbl:memcp-asm}.
|
| 1348 |
|
|
\begin{table}\begin{center}
|
| 1349 |
|
|
\begin{tabular}{ll}
|
| 1350 |
|
|
memcp: \\
|
| 1351 |
|
|
& {\em ; R0 = *dest, R1 = *src, R2 = LEN} \\
|
| 1352 |
|
|
& {\em ; The following will operate in 17 clocks per word minus one clock} \\
|
| 1353 |
|
|
& {\tt CMP 0,R2} \\
|
| 1354 |
|
|
& {\tt LOD.Z -1(SP),PC} {\em ; A conditional return }\\
|
| 1355 |
|
|
& {\em ; (One stall on potentially writing to PC)} \\
|
| 1356 |
|
|
& {\tt LOD (R1),R3} \\
|
| 1357 |
|
|
& {\em ; (4 stalls, cannot be scheduled away)} \\
|
| 1358 |
|
|
& {\tt STO R3,(R2)} {\em ; (4 schedulable stalls, has no impact now)} \\
|
| 1359 |
|
|
& {\tt ADD 1,R1} \\
|
| 1360 |
|
|
& {\tt SUB 1,R2} \\
|
| 1361 |
|
|
& {\tt BNZ loop} \\
|
| 1362 |
|
|
& {\em ; (5 stalls, if branch taken, to clear and refill the pipeline)} \\
|
| 1363 |
|
|
& {\tt RET} \\
|
| 1364 |
|
|
\end{tabular}
|
| 1365 |
|
|
\caption{Example Memory Copy code in Zip Assembly}\label{tbl:memcp-asm}
|
| 1366 |
|
|
\end{center}\end{table}
|
| 1367 |
|
|
This example points out several things associated with the Zip CPU. First,
|
| 1368 |
|
|
a straightforward implementation of a for loop is not the fastest loop
|
| 1369 |
|
|
structure. For this reason, we have placed the test to continue at the
|
| 1370 |
|
|
end. Second, all pointers are {\tt void} pointers to arbitrary 32--bit
|
| 1371 |
|
|
data types. The Zip CPU does not have explicit support for smaller or larger
|
| 1372 |
|
|
data types, and so this memory copy cannot be applied at a byte level.
|
| 1373 |
|
|
Third, we've optimized the conditional jump to a return instruction into a
|
| 1374 |
|
|
conditional return instruction.
|
| 1375 |
|
|
|
| 1376 |
|
|
\section{Context Switch}
|
| 1377 |
|
|
|
| 1378 |
|
|
Fundamental to any multiprocessing system is the ability to switch from one
|
| 1379 |
|
|
task to the next. In the ZipSystem, this is accomplished in one of a couple
|
| 1380 |
|
|
ways. The first step is that an interrupt happens. Anytime an interrupt
|
| 1381 |
|
|
happens, the CPU needs to execute the following tasks in supervisor mode:
|
| 1382 |
|
|
\begin{enumerate}
|
| 1383 |
|
|
\item Check for a trap instruction. That is, if the user task requested a
|
| 1384 |
|
|
trap, we may not wish to adjust the context, check interrupts, or call
|
| 1385 |
|
|
the scheduler. Tbl.~\ref{tbl:trap-check}
|
| 1386 |
|
|
\begin{table}\begin{center}
|
| 1387 |
|
|
\begin{tabular}{ll}
|
| 1388 |
|
|
{\tt return\_to\_user:} \\
|
| 1389 |
|
|
& {\em; The instruction before the context switch processing must} \\
|
| 1390 |
|
|
& {\em; be the RTU instruction that enacted user mode in the first} \\
|
| 1391 |
|
|
& {\em; place. We show it here just for reference.} \\
|
| 1392 |
|
|
& {\tt RTU} \\
|
| 1393 |
|
|
{\tt trap\_check:} \\
|
| 1394 |
|
|
& {\tt MOV uCC,R0} \\
|
| 1395 |
|
|
& {\tt TST \$TRAP,R0} \\
|
| 1396 |
|
|
& {\tt BNZ swap\_out} \\
|
| 1397 |
|
|
& {; \em Do something here to execute the trap} \\
|
| 1398 |
|
|
& {; \em Don't need to call the scheduler, so we can just return} \\
|
| 1399 |
|
|
& {\tt BRA return\_to\_user} \\
|
| 1400 |
|
|
\end{tabular}
|
| 1401 |
|
|
\caption{Checking for whether the user issued a TRAP instruction}\label{tbl:trap-check}
|
| 1402 |
|
|
\end{center}\end{table}
|
| 1403 |
|
|
shows the rudiments of this code, while showing nothing of how the
|
| 1404 |
|
|
actual trap would be implemented.
|
| 1405 |
|
|
|
| 1406 |
|
|
You may also wish to note that the instruction before the first instruction
|
| 1407 |
|
|
in our context swap {\em must be} a return to userspace instruction.
|
| 1408 |
|
|
Remember, the supervisor process is re--entered where it left off. This is
|
| 1409 |
|
|
different from many other processors that enter interrupt mode at some vector
|
| 1410 |
|
|
or other. In this case, we always enter supervisor mode right where we last
|
| 1411 |
|
|
left.\footnote{The one exception to this rule is upon reset where supervisor
|
| 1412 |
|
|
mode is entered at a pre--programmed wishbone memory address.}
|
| 1413 |
|
|
|
| 1414 |
|
|
\item Capture user counters. If the operating system is keeping track of
|
| 1415 |
|
|
system usage via the accounting counters, those counters need to be
|
| 1416 |
|
|
copied and accumulated into some master counter at this point.
|
| 1417 |
|
|
|
| 1418 |
|
|
\item Preserve the old context. This involves pushing all the user registers
|
| 1419 |
|
|
onto the user stack and then copying the resulting stack address
|
| 1420 |
|
|
into the tasks task structure, as shown in Tbl.~\ref{tbl:context-out}.
|
| 1421 |
|
|
\begin{table}\begin{center}
|
| 1422 |
|
|
\begin{tabular}{ll}
|
| 1423 |
|
|
{\tt swap\_out:} \\
|
| 1424 |
|
|
& {\tt MOV -15(uSP),R1} \\
|
| 1425 |
|
|
& {\tt STO R1,stack(R12)} \\
|
| 1426 |
|
|
& {\tt MOV uPC,R0} \\
|
| 1427 |
|
|
& {\tt STO R0,15(R1)} \\
|
| 1428 |
|
|
& {\tt MOV uCC,R0} \\
|
| 1429 |
|
|
& {\tt STO R0,14(R1)} \\
|
| 1430 |
|
|
& {\em ; We can skip storing the stack, uSP, since it'll be stored}\\
|
| 1431 |
|
|
& {\em ; elsewhere (in the task structure) }\\
|
| 1432 |
|
|
& {\tt MOV uR13,R0} \\
|
| 1433 |
|
|
& {\tt STO R0,13(R1)} \\
|
| 1434 |
|
|
& \ldots {\em ; Need to repeat for all user registers} \\
|
| 1435 |
|
|
& {\tt MOV uR0,R0} \\
|
| 1436 |
|
|
& {\tt STO R0,1(R1)} \\
|
| 1437 |
|
|
\end{tabular}
|
| 1438 |
|
|
\caption{Example Storing User Task Context}\label{tbl:context-out}
|
| 1439 |
|
|
\end{center}\end{table}
|
| 1440 |
|
|
For the sake of discussion, we assume the supervisor maintains a
|
| 1441 |
|
|
pointer to the current task's structure in supervisor register
|
| 1442 |
|
|
{\tt R12}, and that {\tt stack} is an offset to the beginning of this
|
| 1443 |
|
|
structure indicating where the stack pointer is to be kept within it.
|
| 1444 |
|
|
|
| 1445 |
|
|
For those who are still interested, the full code for this context
|
| 1446 |
|
|
save can be found as an assembler macro within the assembler
|
| 1447 |
|
|
include file, {\tt sys.i}.
|
| 1448 |
|
|
|
| 1449 |
|
|
\item Reset the watchdog timer. If you are using the watchdog timer, it should
|
| 1450 |
|
|
be reset on a context swap, to know that things are still working.
|
| 1451 |
|
|
Example code for this is shown in Tbl.~\ref{tbl:reset-watchdog}.
|
| 1452 |
|
|
\begin{table}\begin{center}
|
| 1453 |
|
|
\begin{tabular}{ll}
|
| 1454 |
|
|
\multicolumn{2}{l}{{\tt `define WATCHDOG\_ADDRESS 32'hc000\_0002}}\\
|
| 1455 |
|
|
\multicolumn{2}{l}{{\tt `define WATCHDOG\_TICKS 32'd1\_000\_000} {; \em = 10 ms}}\\
|
| 1456 |
|
|
& {\tt LDI WATCHDOG\_ADDRESS,R0} \\
|
| 1457 |
|
|
& {\tt LDI WATCHDOG\_TICKS,R1} \\
|
| 1458 |
|
|
& {\tt STO R1,(R0)}
|
| 1459 |
|
|
\end{tabular}
|
| 1460 |
|
|
\caption{Example Watchdog Reset}\label{tbl:reset-watchdog}
|
| 1461 |
|
|
\end{center}\end{table}
|
| 1462 |
|
|
|
| 1463 |
|
|
\item Interrupt handling. An interrupt handler within the Zip System is nothing
|
| 1464 |
|
|
more than a task. At context swap time, the supervisor needs to
|
| 1465 |
|
|
disable all of the interrupts that have tripped, and then enable
|
| 1466 |
|
|
all of the tasks that would deal with each of these interrupts.
|
| 1467 |
|
|
These can be user tasks, run at higher priority than any other user
|
| 1468 |
|
|
tasks. Either way, they will need to re--enable their own interrupt
|
| 1469 |
|
|
themselves, if the interrupt is still relevant.
|
| 1470 |
|
|
|
| 1471 |
|
|
An example of this master interrut handling is shown in
|
| 1472 |
|
|
Tbl.~\ref{tbl:pre-handler}.
|
| 1473 |
|
|
\begin{table}\begin{center}
|
| 1474 |
|
|
\begin{tabular}{ll}
|
| 1475 |
|
|
{\tt pre\_handler:} \\
|
| 1476 |
|
|
& {\tt LDI PIC\_ADDRESS,R0 } \\
|
| 1477 |
|
|
& {\em ; Start by grabbing the interrupt state from the interrupt}\\
|
| 1478 |
|
|
& {\em ; controller. We'll store this into the register R7 so that }\\
|
| 1479 |
|
|
& {\em ; we can keep and preserve this information for the scheduler}\\
|
| 1480 |
|
|
& {\em ; to use later. }\\
|
| 1481 |
|
|
& {\tt LOD (R0),R1} \\
|
| 1482 |
|
|
& {\tt MOV R1,R7 } \\
|
| 1483 |
|
|
& {\em ; As a next step, we need to acknowledge and disable all active}\\
|
| 1484 |
|
|
& {\em ; interrupts. We'll start by calculating all of our active}\\
|
| 1485 |
|
|
& {\em ; interrupts.}\\
|
| 1486 |
|
|
& {\tt AND 0x07fff,R1 } \\
|
| 1487 |
|
|
& {\em ; Put the active interrupts into the upper half of R1} \\
|
| 1488 |
|
|
& {\tt ROL 16,R1 } \\
|
| 1489 |
|
|
& {\tt LDILO 0x0ffff,R1 } \\
|
| 1490 |
|
|
& {\tt AND R7,R1}\\
|
| 1491 |
|
|
& {\em ; Acknowledge and disable active interrupts}\\
|
| 1492 |
|
|
& {\em ; This also disables all interrupts from the controller, so}\\
|
| 1493 |
|
|
& {\em ; we'll need to re-enable interrupts in general shortly } \\
|
| 1494 |
|
|
& {\tt STO R1,(R0) } \\
|
| 1495 |
|
|
& {\em ; We leave our active interrupt mask in R7 so the scheduler can}\\
|
| 1496 |
|
|
& {\em ; release any tasks that depended upon them. } \\
|
| 1497 |
|
|
\end{tabular}
|
| 1498 |
|
|
\caption{Example checking for active interrupts}\label{tbl:pre-handler}
|
| 1499 |
|
|
\end{center}\end{table}
|
| 1500 |
|
|
|
| 1501 |
|
|
\item Calling the scheduler. This needs to be done to pick the next task
|
| 1502 |
|
|
to switch to. It may be an interrupt handler, or it may be a normal
|
| 1503 |
|
|
user task. From a priority standpoint, it would make sense that the
|
| 1504 |
|
|
interrupt handlers all have a higher priority than the user tasks,
|
| 1505 |
|
|
and that once they have been called the user tasks may then be called
|
| 1506 |
|
|
again. If no task is ready to run, run the idle task to wait for an
|
| 1507 |
|
|
interrupt.
|
| 1508 |
|
|
|
| 1509 |
|
|
This suggests a minimum of four task priorities:
|
| 1510 |
|
|
\begin{enumerate}
|
| 1511 |
|
|
\item Interrupt handlers, executed with their interrupts disabled
|
| 1512 |
|
|
\item Device drivers, executed with interrupts re-enabled
|
| 1513 |
|
|
\item User tasks
|
| 1514 |
|
|
\item The idle task, executed when nothing else is able to execute
|
| 1515 |
|
|
\end{enumerate}
|
| 1516 |
|
|
|
| 1517 |
|
|
For our purposes here, we'll just assume that a pointer to the current
|
| 1518 |
|
|
task is maintained in {\tt R12}, that a {\tt JSR scheduler} is
|
| 1519 |
|
|
called, and that the next current task is likewise placed into
|
| 1520 |
|
|
{\tt R12}.
|
| 1521 |
|
|
|
| 1522 |
|
|
\item Restore the new tasks context. Given that the scheduler has returned a
|
| 1523 |
|
|
task that can be run at this time, the stack pointer needs to be
|
| 1524 |
|
|
pulled out of the tasks task structure, placed into the user
|
| 1525 |
|
|
register, and then the rest of the user registers need to be popped
|
| 1526 |
|
|
back off of the stack to run this task. An example of this is
|
| 1527 |
|
|
shown in Tbl.~\ref{tbl:context-in},
|
| 1528 |
|
|
\begin{table}\begin{center}
|
| 1529 |
|
|
\begin{tabular}{ll}
|
| 1530 |
|
|
{\tt swap\_in:} \\
|
| 1531 |
|
|
& {\tt LOD stack(R12),R1} \\
|
| 1532 |
|
|
& {\tt MOV 15(R1),uSP} \\
|
| 1533 |
|
|
& {\tt LOD 15(R1),R0} \\
|
| 1534 |
|
|
& {\tt MOV R0,uPC} \\
|
| 1535 |
|
|
& {\tt LOD 14(R1),R0} \\
|
| 1536 |
|
|
& {\tt MOV R0,uCC} \\
|
| 1537 |
|
|
& {\tt LOD 13(R1),R0} \\
|
| 1538 |
|
|
& {\tt MOV R0,uR12} \\
|
| 1539 |
|
|
& \ldots {\em ; Need to repeat for all user registers} \\
|
| 1540 |
|
|
& {\tt LOD 1(R1),R0} \\
|
| 1541 |
|
|
& {\tt MOV R0,uR0} \\
|
| 1542 |
|
|
& {\tt BRA return\_to\_user} \\
|
| 1543 |
|
|
\end{tabular}
|
| 1544 |
|
|
\caption{Example Restoring User Task Context}\label{tbl:context-in}
|
| 1545 |
|
|
\end{center}\end{table}
|
| 1546 |
|
|
assuming as before that the task
|
| 1547 |
|
|
pointer is found in supervisor register {\tt R12}.
|
| 1548 |
|
|
As with storing the user context, the full code associated with
|
| 1549 |
|
|
restoring the user context can be found in the assembler include
|
| 1550 |
|
|
file, {\tt sys.i}.
|
| 1551 |
|
|
|
| 1552 |
|
|
\item Clear the userspace accounting registers. In order to keep track of
|
| 1553 |
|
|
per process system usage, these registers need to be cleared before
|
| 1554 |
|
|
reactivating the userspace process. That way, upon the next
|
| 1555 |
|
|
interrupt, we'll know how many clocks the userspace program has
|
| 1556 |
|
|
encountered, and how many instructions it was able to issue in
|
| 1557 |
|
|
those many clocks.
|
| 1558 |
|
|
|
| 1559 |
|
|
\item Jump back to the instruction just before saving the last tasks context,
|
| 1560 |
|
|
because that location in memory contains the return from interrupt
|
| 1561 |
|
|
command that we are going to need to execute, in order to guarantee
|
| 1562 |
|
|
that we return back here again.
|
| 1563 |
|
|
\end{enumerate}
|
| 1564 |
|
|
|
| 1565 |
21 |
dgisselq |
\chapter{Registers}\label{chap:regs}
|
| 1566 |
|
|
|
| 1567 |
24 |
dgisselq |
The ZipSystem registers fall into two categories, ZipSystem internal registers
|
| 1568 |
|
|
accessed via the ZipCPU shown in Tbl.~\ref{tbl:zpregs},
|
| 1569 |
|
|
\begin{table}[htbp]
|
| 1570 |
|
|
\begin{center}\begin{reglist}
|
| 1571 |
32 |
dgisselq |
PIC & \scalebox{0.8}{\tt 0xc0000000} & 32 & R/W & Primary Interrupt Controller \\\hline
|
| 1572 |
|
|
WDT & \scalebox{0.8}{\tt 0xc0000001} & 32 & R/W & Watchdog Timer \\\hline
|
| 1573 |
36 |
dgisselq |
& \scalebox{0.8}{\tt 0xc0000002} & 32 & R/W & {\em (Reserved for future use)} \\\hline
|
| 1574 |
32 |
dgisselq |
CTRIC & \scalebox{0.8}{\tt 0xc0000003} & 32 & R/W & Secondary Interrupt Controller \\\hline
|
| 1575 |
|
|
TMRA & \scalebox{0.8}{\tt 0xc0000004} & 32 & R/W & Timer A\\\hline
|
| 1576 |
|
|
TMRB & \scalebox{0.8}{\tt 0xc0000005} & 32 & R/W & Timer B\\\hline
|
| 1577 |
|
|
TMRC & \scalebox{0.8}{\tt 0xc0000006} & 32 & R/W & Timer C\\\hline
|
| 1578 |
|
|
JIFF & \scalebox{0.8}{\tt 0xc0000007} & 32 & R/W & Jiffies \\\hline
|
| 1579 |
|
|
MTASK & \scalebox{0.8}{\tt 0xc0000008} & 32 & R/W & Master Task Clock Counter \\\hline
|
| 1580 |
|
|
MMSTL & \scalebox{0.8}{\tt 0xc0000009} & 32 & R/W & Master Stall Counter \\\hline
|
| 1581 |
|
|
MPSTL & \scalebox{0.8}{\tt 0xc000000a} & 32 & R/W & Master Pre--Fetch Stall Counter \\\hline
|
| 1582 |
|
|
MICNT & \scalebox{0.8}{\tt 0xc000000b} & 32 & R/W & Master Instruction Counter\\\hline
|
| 1583 |
|
|
UTASK & \scalebox{0.8}{\tt 0xc000000c} & 32 & R/W & User Task Clock Counter \\\hline
|
| 1584 |
|
|
UMSTL & \scalebox{0.8}{\tt 0xc000000d} & 32 & R/W & User Stall Counter \\\hline
|
| 1585 |
|
|
UPSTL & \scalebox{0.8}{\tt 0xc000000e} & 32 & R/W & User Pre--Fetch Stall Counter \\\hline
|
| 1586 |
|
|
UICNT & \scalebox{0.8}{\tt 0xc000000f} & 32 & R/W & User Instruction Counter\\\hline
|
| 1587 |
36 |
dgisselq |
DMACTRL & \scalebox{0.8}{\tt 0xc0000010} & 32 & R/W & DMA Control Register\\\hline
|
| 1588 |
|
|
DMALEN & \scalebox{0.8}{\tt 0xc0000011} & 32 & R/W & DMA total transfer length\\\hline
|
| 1589 |
|
|
DMASRC & \scalebox{0.8}{\tt 0xc0000012} & 32 & R/W & DMA source address\\\hline
|
| 1590 |
|
|
DMADST & \scalebox{0.8}{\tt 0xc0000013} & 32 & R/W & DMA destination address\\\hline
|
| 1591 |
32 |
dgisselq |
% Cache & \scalebox{0.8}{\tt 0xc0100000} & & & Base address of the Cache memory\\\hline
|
| 1592 |
24 |
dgisselq |
\end{reglist}
|
| 1593 |
|
|
\caption{Zip System Internal/Peripheral Registers}\label{tbl:zpregs}
|
| 1594 |
|
|
\end{center}\end{table}
|
| 1595 |
33 |
dgisselq |
and the two debug registers shown in Tbl.~\ref{tbl:dbgregs}.
|
| 1596 |
24 |
dgisselq |
\begin{table}[htbp]
|
| 1597 |
|
|
\begin{center}\begin{reglist}
|
| 1598 |
|
|
ZIPCTRL & 0 & 32 & R/W & Debug Control Register \\\hline
|
| 1599 |
|
|
ZIPDATA & 1 & 32 & R/W & Debug Data Register \\\hline
|
| 1600 |
|
|
\end{reglist}
|
| 1601 |
|
|
\caption{Zip System Debug Registers}\label{tbl:dbgregs}
|
| 1602 |
|
|
\end{center}\end{table}
|
| 1603 |
|
|
|
| 1604 |
33 |
dgisselq |
\section{Peripheral Registers}
|
| 1605 |
|
|
The peripheral registers, listed in Tbl.~\ref{tbl:zpregs}, are shown in the
|
| 1606 |
|
|
CPU's address space. These may be accessed by the CPU at these addresses,
|
| 1607 |
|
|
and when so accessed will respond as described in Chapt.~\ref{chap:periph}.
|
| 1608 |
|
|
These registers will be discussed briefly again here.
|
| 1609 |
24 |
dgisselq |
|
| 1610 |
33 |
dgisselq |
The Zip CPU Interrupt controller has four different types of bits, as shown in
|
| 1611 |
|
|
Tbl.~\ref{tbl:picbits}.
|
| 1612 |
|
|
\begin{table}\begin{center}
|
| 1613 |
|
|
\begin{bitlist}
|
| 1614 |
|
|
31 & R/W & Master Interrupt Enable\\\hline
|
| 1615 |
|
|
30\ldots 16 & R/W & Interrupt Enables, write '1' to change\\\hline
|
| 1616 |
|
|
15 & R & Current Master Interrupt State\\\hline
|
| 1617 |
|
|
15\ldots 0 & R/W & Input Interrupt states, write '1' to clear\\\hline
|
| 1618 |
|
|
\end{bitlist}
|
| 1619 |
|
|
\caption{Interrupt Controller Register Bits}\label{tbl:picbits}
|
| 1620 |
|
|
\end{center}\end{table}
|
| 1621 |
|
|
The high order bit, or bit--31, is the master interrupt enable bit. When this
|
| 1622 |
|
|
bit is set, then any time an interrupt occurs the CPU will be interrupted and
|
| 1623 |
|
|
will switch to supervisor mode, etc.
|
| 1624 |
|
|
|
| 1625 |
|
|
Bits 30~\ldots 16 are interrupt enable bits. Should the interrupt line go
|
| 1626 |
|
|
ghile while enabled, an interrupt will be generated. To set an interrupt enable
|
| 1627 |
|
|
bit, one needs to write the master interrupt enable while writing a `1' to this
|
| 1628 |
|
|
the bit. To clear, one need only write a `0' to the master interrupt enable,
|
| 1629 |
|
|
while leaving this line high.
|
| 1630 |
|
|
|
| 1631 |
|
|
Bits 15\ldots 0 are the current state of the interrupt vector. Interrupt lines
|
| 1632 |
|
|
trip when they go high, and remain tripped until they are acknowledged. If
|
| 1633 |
|
|
the interrupt goes high for longer than one pulse, it may be high when a clear
|
| 1634 |
|
|
is requested. If so, the interrupt will not clear. The line must go low
|
| 1635 |
|
|
again before the status bit can be cleared.
|
| 1636 |
|
|
|
| 1637 |
|
|
As an example, consider the following scenario where the Zip CPU supports four
|
| 1638 |
|
|
interrupts, 3\ldots0.
|
| 1639 |
|
|
\begin{enumerate}
|
| 1640 |
|
|
\item The Supervisor will first, while in the interrupts disabled mode,
|
| 1641 |
|
|
write a {\tt 32'h800f000f} to the controller. The supervisor may then
|
| 1642 |
|
|
switch to the user state with interrupts enabled.
|
| 1643 |
|
|
\item When an interrupt occurs, the supervisor will switch to the interrupt
|
| 1644 |
|
|
state. It will then cycle through the interrupt bits to learn which
|
| 1645 |
|
|
interrupt handler to call.
|
| 1646 |
|
|
\item If the interrupt handler expects more interrupts, it will clear its
|
| 1647 |
|
|
current interrupt when it is done handling the interrupt in question.
|
| 1648 |
|
|
To do this, it will write a '1' to the low order interrupt mask,
|
| 1649 |
|
|
such as writing a {\tt 32'h80000001}.
|
| 1650 |
|
|
\item If the interrupt handler does not expect any more interrupts, it will
|
| 1651 |
|
|
instead clear the interrupt from the controller by writing a
|
| 1652 |
|
|
{\tt 32'h00010001} to the controller.
|
| 1653 |
|
|
\item Once all interrupts have been handled, the supervisor will write a
|
| 1654 |
|
|
{\tt 32'h80000000} to the interrupt register to re-enable interrupt
|
| 1655 |
|
|
generation.
|
| 1656 |
|
|
\item The supervisor should also check the user trap bit, and possible soft
|
| 1657 |
|
|
interrupt bits here, but this action has nothing to do with the
|
| 1658 |
|
|
interrupt control register.
|
| 1659 |
|
|
\item The supervisor will then leave interrupt mode, possibly adjusting
|
| 1660 |
|
|
whichever task is running, by executing a return from interrupt
|
| 1661 |
|
|
command.
|
| 1662 |
|
|
\end{enumerate}
|
| 1663 |
|
|
|
| 1664 |
|
|
Leaving the interrupt controller, we show the timer registers bit definitions
|
| 1665 |
|
|
in Tbl.~\ref{tbl:tmrbits}.
|
| 1666 |
|
|
\begin{table}\begin{center}
|
| 1667 |
|
|
\begin{bitlist}
|
| 1668 |
|
|
31 & R/W & Auto-Reload\\\hline
|
| 1669 |
|
|
30\ldots 0 & R/W & Current timer value\\\hline
|
| 1670 |
|
|
\end{bitlist}
|
| 1671 |
|
|
\caption{Timer Register Bits}\label{tbl:tmrbits}
|
| 1672 |
|
|
\end{center}\end{table}
|
| 1673 |
|
|
As you may recall, the timer just counts down to zero and then trips an
|
| 1674 |
|
|
interrupt. Writing to the current timer value sets that value, and reading
|
| 1675 |
|
|
from it returns that value. Writing to the current timer value while also
|
| 1676 |
|
|
setting the auto--reload bit will send the timer into an auto--reload mode.
|
| 1677 |
|
|
In this mode, upon setting its interrupt bit for one cycle, the timer will
|
| 1678 |
|
|
also reset itself back to the value of the timer that was written to it when
|
| 1679 |
|
|
the auto--reload option was written to it. To clear and stop the timer,
|
| 1680 |
|
|
just simply write a `32'h00' to this register.
|
| 1681 |
|
|
|
| 1682 |
|
|
The Jiffies register is somewhat similar in that the register always changes.
|
| 1683 |
|
|
In this case, the register counts up, whereas the timer always counted down.
|
| 1684 |
|
|
Reads from this register, as shown in Tbl.~\ref{tbl:jiffybits},
|
| 1685 |
|
|
\begin{table}\begin{center}
|
| 1686 |
|
|
\begin{bitlist}
|
| 1687 |
|
|
31\ldots 0 & R & Current jiffy value\\\hline
|
| 1688 |
|
|
31\ldots 0 & W & Value/time of next interrupt\\\hline
|
| 1689 |
|
|
\end{bitlist}
|
| 1690 |
|
|
\caption{Jiffies Register Bits}\label{tbl:jiffybits}
|
| 1691 |
|
|
\end{center}\end{table}
|
| 1692 |
|
|
always return the time value contained in the register. Writes greater than
|
| 1693 |
|
|
the current Jiffy value, that is where the new value minus the old value is
|
| 1694 |
|
|
greater than zero while ignoring truncation, will set a new Jiffy interrupt
|
| 1695 |
|
|
time. At that time, the Jiffy vector will clear, and another interrupt time
|
| 1696 |
|
|
may either be written to it, or it will just continue counting without
|
| 1697 |
|
|
activating any more interrupts.
|
| 1698 |
|
|
|
| 1699 |
|
|
The Zip CPU also supports several counter peripherals, mostly in the way of
|
| 1700 |
|
|
process accounting. This peripherals have a single register associated with
|
| 1701 |
|
|
them, shown in Tbl.~\ref{tbl:ctrbits}.
|
| 1702 |
|
|
\begin{table}\begin{center}
|
| 1703 |
|
|
\begin{bitlist}
|
| 1704 |
|
|
31\ldots 0 & R/W & Current counter value\\\hline
|
| 1705 |
|
|
\end{bitlist}
|
| 1706 |
|
|
\caption{Counter Register Bits}\label{tbl:ctrbits}
|
| 1707 |
|
|
\end{center}\end{table}
|
| 1708 |
|
|
Writes to this register set the new counter value. Reads read the current
|
| 1709 |
|
|
counter value.
|
| 1710 |
|
|
|
| 1711 |
|
|
The current design operation of these counters is that of performance counting.
|
| 1712 |
|
|
Two sets of four registers are available for keeping track of performance.
|
| 1713 |
|
|
The first is a task counter. This just counts clock ticks. The second
|
| 1714 |
|
|
counter is a prefetch stall counter, then an master stall counter. These
|
| 1715 |
|
|
allow the CPU to be evaluated as to how efficient it is. The fourth and
|
| 1716 |
|
|
final counter is an instruction counter, which counts how many instructions the
|
| 1717 |
|
|
CPU has issued.
|
| 1718 |
|
|
|
| 1719 |
|
|
It is envisioned that these counters will be used as follows: First, every time
|
| 1720 |
|
|
a master counter rolls over, the supervisor (Operating System) will record
|
| 1721 |
|
|
the fact. Second, whenever activating a user task, the Operating System will
|
| 1722 |
|
|
set the four user counters to zero. When the user task has completed, the
|
| 1723 |
|
|
Operating System will read the timers back off, to determine how much of the
|
| 1724 |
|
|
CPU the process had consumed.
|
| 1725 |
|
|
|
| 1726 |
36 |
dgisselq |
The final peripheral to discuss is the DMA controller. This controller
|
| 1727 |
|
|
has four registers. Of these four, the length, source and destination address
|
| 1728 |
|
|
registers should need no further explanation. They are full 32--bit registers
|
| 1729 |
|
|
specifying the entire transfer length, the starting address to read from, and
|
| 1730 |
|
|
the starting address to write to. The registers can be written to when the
|
| 1731 |
|
|
DMA is idle, and read at any time. The control register, however, will need
|
| 1732 |
|
|
some more explanation.
|
| 1733 |
|
|
|
| 1734 |
|
|
The bit allocation of the control register is shown in Tbl.~\ref{tbl:dmacbits}.
|
| 1735 |
|
|
\begin{table}\begin{center}
|
| 1736 |
|
|
\begin{bitlist}
|
| 1737 |
|
|
31 & R & DMA Active\\\hline
|
| 1738 |
|
|
30 & R & Wishbone error, transaction aborted (cleared on any write)\\\hline
|
| 1739 |
|
|
29 & R/W & Set to '1' to prevent the controller from incrementing the source address, '0' for normal memory copy. \\\hline
|
| 1740 |
|
|
28 & R/W & Set to '0' to prevent the controller from incrementing the
|
| 1741 |
|
|
destination address, '0' for normal memory copy. \\\hline
|
| 1742 |
|
|
27 \ldots 16 & W & The DMA Key. Write a 12'hfed to these bits to start the
|
| 1743 |
|
|
activate any DMA transfer. \\\hline
|
| 1744 |
|
|
27 & R & Always reads '0', to force the deliberate writing of the key. \\\hline
|
| 1745 |
|
|
26 \ldots 16 & R & Indicates the number of items in the transfer buffer that
|
| 1746 |
|
|
have yet to be written. \\\hline
|
| 1747 |
|
|
15 & R/W & Set to '1' to trigger on an interrupt, or '0' to start immediately
|
| 1748 |
|
|
upon receiving a valid key.\\\hline
|
| 1749 |
|
|
14\ldots 10 & R/W & Select among one of 32~possible interrupt lines.\\\hline
|
| 1750 |
|
|
9\ldots 0 & R/W & Intermediate transfer length minus one. Thus, to transfer
|
| 1751 |
|
|
one item at a time set this value to 0. To transfer 1024 at a time,
|
| 1752 |
|
|
set it to 1024.\\\hline
|
| 1753 |
|
|
\end{bitlist}
|
| 1754 |
|
|
\caption{DMA Control Register Bits}\label{tbl:dmacbits}
|
| 1755 |
|
|
\end{center}\end{table}
|
| 1756 |
|
|
This control register has been designed so that the common case of memory
|
| 1757 |
|
|
access need only set the key and the transfer length. Hence, writing a
|
| 1758 |
|
|
\hbox{32'h0fed03ff} to the control register will start any memory transfer.
|
| 1759 |
|
|
On the other hand, if you wished to read from a serial port (constant address)
|
| 1760 |
|
|
and put the result into a buffer every time a word was available, you
|
| 1761 |
|
|
might wish to write \hbox{32'h2fed8000}--this assumes, of course, that you
|
| 1762 |
|
|
have a serial port wired to the zero bit of this interrupt control. (The
|
| 1763 |
|
|
DMA controller does not use the interrupt controller, and cannot clear
|
| 1764 |
|
|
interrupts.) As a third example, if you wished to write to an external
|
| 1765 |
|
|
FIFO anytime it was less than half full (had fewer than 512 items), and
|
| 1766 |
|
|
interrupt line 2 indicated this condition, you might wish to issue a
|
| 1767 |
|
|
\hbox{32'h1fed8dff} to this port.
|
| 1768 |
|
|
|
| 1769 |
33 |
dgisselq |
\section{Debug Port Registers}
|
| 1770 |
|
|
Accessing the Zip System via the debug port isn't as straight forward as
|
| 1771 |
|
|
accessing the system via the wishbone bus. The debug port itself has been
|
| 1772 |
|
|
reduced to two addresses, as outlined earlier in Tbl.~\ref{tbl:dbgregs}.
|
| 1773 |
|
|
Access to the Zip System begins with the Debug Control register, shown in
|
| 1774 |
|
|
Tbl.~\ref{tbl:dbgctrl}.
|
| 1775 |
|
|
\begin{table}\begin{center}
|
| 1776 |
|
|
\begin{bitlist}
|
| 1777 |
|
|
31\ldots 14 & R & Reserved\\\hline
|
| 1778 |
|
|
13 & R & CPU GIE setting\\\hline
|
| 1779 |
|
|
12 & R & CPU is sleeping\\\hline
|
| 1780 |
|
|
11 & W & Command clear PF cache\\\hline
|
| 1781 |
|
|
10 & R/W & Command HALT, Set to '1' to halt the CPU\\\hline
|
| 1782 |
|
|
9 & R & Stall Status, '1' if CPU is busy\\\hline
|
| 1783 |
36 |
dgisselq |
8 & R/W & Step Command, set to '1' to step the CPU, also sets the halt bit\\\hline
|
| 1784 |
33 |
dgisselq |
7 & R & Interrupt Request \\\hline
|
| 1785 |
|
|
6 & R/W & Command RESET \\\hline
|
| 1786 |
|
|
5\ldots 0 & R/W & Debug Register Address \\\hline
|
| 1787 |
|
|
\end{bitlist}
|
| 1788 |
|
|
\caption{Debug Control Register Bits}\label{tbl:dbgctrl}
|
| 1789 |
|
|
\end{center}\end{table}
|
| 1790 |
|
|
|
| 1791 |
|
|
The first step in debugging access is to determine whether or not the CPU
|
| 1792 |
|
|
is halted, and to halt it if not. To do this, first write a '1' to the
|
| 1793 |
|
|
Command HALT bit. This will halt the CPU and place it into debug mode.
|
| 1794 |
|
|
Once the CPU is halted, the stall status bit will drop to zero. Thus,
|
| 1795 |
|
|
if bit 10 is high and bit 9 low, the debug port is open to examine the
|
| 1796 |
|
|
internal state of the CPU.
|
| 1797 |
|
|
|
| 1798 |
|
|
At this point, the external debugger may examine internal state information
|
| 1799 |
|
|
from within the CPU. To do this, first write again to the command register
|
| 1800 |
|
|
a value (with command halt still high) containing the address of an internal
|
| 1801 |
|
|
register of interest in the bottom 6~bits. Internal registers that may be
|
| 1802 |
|
|
accessed this way are listed in Tbl.~\ref{tbl:dbgaddrs}.
|
| 1803 |
|
|
\begin{table}\begin{center}
|
| 1804 |
|
|
\begin{reglist}
|
| 1805 |
|
|
sR0 & 0 & 32 & R/W & Supervisor Register R0 \\\hline
|
| 1806 |
|
|
sR1 & 0 & 32 & R/W & Supervisor Register R1 \\\hline
|
| 1807 |
|
|
sSP & 13 & 32 & R/W & Supervisor Stack Pointer\\\hline
|
| 1808 |
|
|
sCC & 14 & 32 & R/W & Supervisor Condition Code Register \\\hline
|
| 1809 |
|
|
sPC & 15 & 32 & R/W & Supervisor Program Counter\\\hline
|
| 1810 |
|
|
uR0 & 16 & 32 & R/W & User Register R0 \\\hline
|
| 1811 |
|
|
uR1 & 17 & 32 & R/W & User Register R1 \\\hline
|
| 1812 |
|
|
uSP & 29 & 32 & R/W & User Stack Pointer\\\hline
|
| 1813 |
|
|
uCC & 30 & 32 & R/W & User Condition Code Register \\\hline
|
| 1814 |
|
|
uPC & 31 & 32 & R/W & User Program Counter\\\hline
|
| 1815 |
|
|
PIC & 32 & 32 & R/W & Primary Interrupt Controller \\\hline
|
| 1816 |
|
|
WDT & 33 & 32 & R/W & Watchdog Timer\\\hline
|
| 1817 |
|
|
CCHE & 34 & 32 & R/W & Manual Cache Controller\\\hline
|
| 1818 |
|
|
CTRIC & 35 & 32 & R/W & Secondary Interrupt Controller\\\hline
|
| 1819 |
|
|
TMRA & 36 & 32 & R/W & Timer A\\\hline
|
| 1820 |
|
|
TMRB & 37 & 32 & R/W & Timer B\\\hline
|
| 1821 |
|
|
TMRC & 38 & 32 & R/W & Timer C\\\hline
|
| 1822 |
|
|
JIFF & 39 & 32 & R/W & Jiffies peripheral\\\hline
|
| 1823 |
|
|
MTASK & 40 & 32 & R/W & Master task clock counter\\\hline
|
| 1824 |
|
|
MMSTL & 41 & 32 & R/W & Master memory stall counter\\\hline
|
| 1825 |
|
|
MPSTL & 42 & 32 & R/W & Master Pre-Fetch Stall counter\\\hline
|
| 1826 |
|
|
MICNT & 43 & 32 & R/W & Master instruction counter\\\hline
|
| 1827 |
|
|
UTASK & 44 & 32 & R/W & User task clock counter\\\hline
|
| 1828 |
|
|
UMSTL & 45 & 32 & R/W & User memory stall counter\\\hline
|
| 1829 |
|
|
UPSTL & 46 & 32 & R/W & User Pre-Fetch Stall counter\\\hline
|
| 1830 |
|
|
UICNT & 47 & 32 & R/W & User instruction counter\\\hline
|
| 1831 |
|
|
\end{reglist}
|
| 1832 |
|
|
\caption{Debug Register Addresses}\label{tbl:dbgaddrs}
|
| 1833 |
|
|
\end{center}\end{table}
|
| 1834 |
|
|
Primarily, these ``registers'' include access to the entire CPU register
|
| 1835 |
36 |
dgisselq |
set, as well as the internal peripherals. To read one of these registers
|
| 1836 |
33 |
dgisselq |
once the address is set, simply issue a read from the data port. To write
|
| 1837 |
|
|
one of these registers or peripheral ports, simply write to the data port
|
| 1838 |
|
|
after setting the proper address.
|
| 1839 |
|
|
|
| 1840 |
|
|
In this manner, all of the CPU's internal state may be read and adjusted.
|
| 1841 |
|
|
|
| 1842 |
|
|
As an example of how to use this, consider what would happen in the case
|
| 1843 |
|
|
of an external break point. If and when the CPU hits a break point that
|
| 1844 |
|
|
causes it to halt, the Command HALT bit will activate on its own, the CPU
|
| 1845 |
|
|
will then raise an external interrupt line and wait for a debugger to examine
|
| 1846 |
|
|
its state. After examining the state, the debugger will need to remove
|
| 1847 |
|
|
the breakpoint by writing a different instruction into memory and by writing
|
| 1848 |
|
|
to the command register while holding the clear cache, command halt, and
|
| 1849 |
|
|
step CPU bits high, (32'hd00). The debugger may then replace the breakpoint
|
| 1850 |
|
|
now that the CPU has gone beyond it, and clear the cache again (32'h500).
|
| 1851 |
|
|
|
| 1852 |
|
|
To leave this debug mode, simply write a `32'h0' value to the command register.
|
| 1853 |
|
|
|
| 1854 |
|
|
\chapter{Wishbone Datasheets}\label{chap:wishbone}
|
| 1855 |
32 |
dgisselq |
The Zip System supports two wishbone ports, a slave debug port and a master
|
| 1856 |
21 |
dgisselq |
port for the system itself. These are shown in Tbl.~\ref{tbl:wishbone-slave}
|
| 1857 |
|
|
\begin{table}[htbp]
|
| 1858 |
|
|
\begin{center}
|
| 1859 |
|
|
\begin{wishboneds}
|
| 1860 |
|
|
Revision level of wishbone & WB B4 spec \\\hline
|
| 1861 |
|
|
Type of interface & Slave, Read/Write, single words only \\\hline
|
| 1862 |
24 |
dgisselq |
Address Width & 1--bit \\\hline
|
| 1863 |
21 |
dgisselq |
Port size & 32--bit \\\hline
|
| 1864 |
|
|
Port granularity & 32--bit \\\hline
|
| 1865 |
|
|
Maximum Operand Size & 32--bit \\\hline
|
| 1866 |
|
|
Data transfer ordering & (Irrelevant) \\\hline
|
| 1867 |
|
|
Clock constraints & Works at 100~MHz on a Basys--3 board\\\hline
|
| 1868 |
|
|
Signal Names & \begin{tabular}{ll}
|
| 1869 |
|
|
Signal Name & Wishbone Equivalent \\\hline
|
| 1870 |
|
|
{\tt i\_clk} & {\tt CLK\_I} \\
|
| 1871 |
|
|
{\tt i\_dbg\_cyc} & {\tt CYC\_I} \\
|
| 1872 |
|
|
{\tt i\_dbg\_stb} & {\tt STB\_I} \\
|
| 1873 |
|
|
{\tt i\_dbg\_we} & {\tt WE\_I} \\
|
| 1874 |
|
|
{\tt i\_dbg\_addr} & {\tt ADR\_I} \\
|
| 1875 |
|
|
{\tt i\_dbg\_data} & {\tt DAT\_I} \\
|
| 1876 |
|
|
{\tt o\_dbg\_ack} & {\tt ACK\_O} \\
|
| 1877 |
|
|
{\tt o\_dbg\_stall} & {\tt STALL\_O} \\
|
| 1878 |
|
|
{\tt o\_dbg\_data} & {\tt DAT\_O}
|
| 1879 |
|
|
\end{tabular}\\\hline
|
| 1880 |
|
|
\end{wishboneds}
|
| 1881 |
22 |
dgisselq |
\caption{Wishbone Datasheet for the Debug Interface}\label{tbl:wishbone-slave}
|
| 1882 |
21 |
dgisselq |
\end{center}\end{table}
|
| 1883 |
|
|
and Tbl.~\ref{tbl:wishbone-master} respectively.
|
| 1884 |
|
|
\begin{table}[htbp]
|
| 1885 |
|
|
\begin{center}
|
| 1886 |
|
|
\begin{wishboneds}
|
| 1887 |
|
|
Revision level of wishbone & WB B4 spec \\\hline
|
| 1888 |
24 |
dgisselq |
Type of interface & Master, Read/Write, single cycle or pipelined\\\hline
|
| 1889 |
|
|
Address Width & 32--bit bits \\\hline
|
| 1890 |
21 |
dgisselq |
Port size & 32--bit \\\hline
|
| 1891 |
|
|
Port granularity & 32--bit \\\hline
|
| 1892 |
|
|
Maximum Operand Size & 32--bit \\\hline
|
| 1893 |
|
|
Data transfer ordering & (Irrelevant) \\\hline
|
| 1894 |
|
|
Clock constraints & Works at 100~MHz on a Basys--3 board\\\hline
|
| 1895 |
|
|
Signal Names & \begin{tabular}{ll}
|
| 1896 |
|
|
Signal Name & Wishbone Equivalent \\\hline
|
| 1897 |
|
|
{\tt i\_clk} & {\tt CLK\_O} \\
|
| 1898 |
|
|
{\tt o\_wb\_cyc} & {\tt CYC\_O} \\
|
| 1899 |
|
|
{\tt o\_wb\_stb} & {\tt STB\_O} \\
|
| 1900 |
|
|
{\tt o\_wb\_we} & {\tt WE\_O} \\
|
| 1901 |
|
|
{\tt o\_wb\_addr} & {\tt ADR\_O} \\
|
| 1902 |
|
|
{\tt o\_wb\_data} & {\tt DAT\_O} \\
|
| 1903 |
|
|
{\tt i\_wb\_ack} & {\tt ACK\_I} \\
|
| 1904 |
|
|
{\tt i\_wb\_stall} & {\tt STALL\_I} \\
|
| 1905 |
|
|
{\tt i\_wb\_data} & {\tt DAT\_I}
|
| 1906 |
|
|
\end{tabular}\\\hline
|
| 1907 |
|
|
\end{wishboneds}
|
| 1908 |
22 |
dgisselq |
\caption{Wishbone Datasheet for the CPU as Master}\label{tbl:wishbone-master}
|
| 1909 |
21 |
dgisselq |
\end{center}\end{table}
|
| 1910 |
|
|
I do not recommend that you connect these together through the interconnect.
|
| 1911 |
24 |
dgisselq |
Rather, the debug port of the CPU should be accessible regardless of the state
|
| 1912 |
|
|
of the master bus.
|
| 1913 |
21 |
dgisselq |
|
| 1914 |
24 |
dgisselq |
You may wish to notice that neither the {\tt ERR} nor the {\tt RETRY} wires
|
| 1915 |
|
|
have been implemented. What this means is that the CPU is currently unable
|
| 1916 |
|
|
to detect a bus error condition, and so may stall indefinitely (hang) should
|
| 1917 |
|
|
it choose to access a value not on the bus, or a peripheral that is not
|
| 1918 |
|
|
yet properly configured.
|
| 1919 |
21 |
dgisselq |
|
| 1920 |
|
|
\chapter{Clocks}\label{chap:clocks}
|
| 1921 |
|
|
|
| 1922 |
32 |
dgisselq |
This core is based upon the Basys--3 development board sold by Digilent.
|
| 1923 |
|
|
The Basys--3 development board contains one external 100~MHz clock, which is
|
| 1924 |
36 |
dgisselq |
sufficient to run the Zip CPU core.
|
| 1925 |
21 |
dgisselq |
\begin{table}[htbp]
|
| 1926 |
|
|
\begin{center}
|
| 1927 |
|
|
\begin{clocklist}
|
| 1928 |
|
|
i\_clk & External & 100~MHz & 100~MHz & System clock.\\\hline
|
| 1929 |
|
|
\end{clocklist}
|
| 1930 |
|
|
\caption{List of Clocks}\label{tbl:clocks}
|
| 1931 |
|
|
\end{center}\end{table}
|
| 1932 |
|
|
I hesitate to suggest that the core can run faster than 100~MHz, since I have
|
| 1933 |
|
|
had struggled with various timing violations to keep it at 100~MHz. So, for
|
| 1934 |
|
|
now, I will only state that it can run at 100~MHz.
|
| 1935 |
|
|
|
| 1936 |
|
|
|
| 1937 |
|
|
\chapter{I/O Ports}\label{chap:ioports}
|
| 1938 |
33 |
dgisselq |
The I/O ports to the Zip CPU may be grouped into three categories. The first
|
| 1939 |
|
|
is that of the master wishbone used by the CPU, then the slave wishbone used
|
| 1940 |
|
|
to command the CPU via a debugger, and then the rest. The first two of these
|
| 1941 |
|
|
were already discussed in the wishbone chapter. They are listed here
|
| 1942 |
|
|
for completeness in Tbl.~\ref{tbl:iowb-master}
|
| 1943 |
|
|
\begin{table}
|
| 1944 |
|
|
\begin{center}\begin{portlist}
|
| 1945 |
|
|
{\tt o\_wb\_cyc} & 1 & Output & Indicates an active Wishbone cycle\\\hline
|
| 1946 |
|
|
{\tt o\_wb\_stb} & 1 & Output & WB Strobe signal\\\hline
|
| 1947 |
|
|
{\tt o\_wb\_we} & 1 & Output & Write enable\\\hline
|
| 1948 |
|
|
{\tt o\_wb\_addr} & 32 & Output & Bus address \\\hline
|
| 1949 |
|
|
{\tt o\_wb\_data} & 32 & Output & Data on WB write\\\hline
|
| 1950 |
|
|
{\tt i\_wb\_ack} & 1 & Input & Slave has completed a R/W cycle\\\hline
|
| 1951 |
|
|
{\tt i\_wb\_stall} & 1 & Input & WB bus slave not ready\\\hline
|
| 1952 |
|
|
{\tt i\_wb\_data} & 32 & Input & Incoming bus data\\\hline
|
| 1953 |
|
|
\end{portlist}\caption{CPU Master Wishbone I/O Ports}\label{tbl:iowb-master}\end{center}\end{table}
|
| 1954 |
|
|
and~\ref{tbl:iowb-slave} respectively.
|
| 1955 |
|
|
\begin{table}
|
| 1956 |
|
|
\begin{center}\begin{portlist}
|
| 1957 |
|
|
{\tt i\_wb\_cyc} & 1 & Input & Indicates an active Wishbone cycle\\\hline
|
| 1958 |
|
|
{\tt i\_wb\_stb} & 1 & Input & WB Strobe signal\\\hline
|
| 1959 |
|
|
{\tt i\_wb\_we} & 1 & Input & Write enable\\\hline
|
| 1960 |
|
|
{\tt i\_wb\_addr} & 1 & Input & Bus address, command or data port \\\hline
|
| 1961 |
|
|
{\tt i\_wb\_data} & 32 & Input & Data on WB write\\\hline
|
| 1962 |
|
|
{\tt o\_wb\_ack} & 1 & Output & Slave has completed a R/W cycle\\\hline
|
| 1963 |
|
|
{\tt o\_wb\_stall} & 1 & Output & WB bus slave not ready\\\hline
|
| 1964 |
|
|
{\tt o\_wb\_data} & 32 & Output & Incoming bus data\\\hline
|
| 1965 |
|
|
\end{portlist}\caption{CPU Debug Wishbone I/O Ports}\label{tbl:iowb-slave}\end{center}\end{table}
|
| 1966 |
21 |
dgisselq |
|
| 1967 |
33 |
dgisselq |
There are only four other lines to the CPU: the external clock, external
|
| 1968 |
|
|
reset, incoming external interrupt line(s), and the outgoing debug interrupt
|
| 1969 |
|
|
line. These are shown in Tbl.~\ref{tbl:ioports}.
|
| 1970 |
|
|
\begin{table}
|
| 1971 |
|
|
\begin{center}\begin{portlist}
|
| 1972 |
|
|
{\tt i\_clk} & 1 & Input & The master CPU clock \\\hline
|
| 1973 |
|
|
{\tt i\_rst} & 1 & Input & Active high reset line \\\hline
|
| 1974 |
|
|
{\tt i\_ext\_int} & 1\ldots 6 & Input & Incoming external interrupts \\\hline
|
| 1975 |
|
|
{\tt o\_ext\_int} & 1 & Output & CPU Halted interrupt \\\hline
|
| 1976 |
|
|
\end{portlist}\caption{I/O Ports}\label{tbl:ioports}\end{center}\end{table}
|
| 1977 |
|
|
The clock line was discussed briefly in Chapt.~\ref{chap:clocks}. We
|
| 1978 |
|
|
typically run it at 100~MHz. The reset line is an active high reset. When
|
| 1979 |
|
|
asserted, the CPU will start running again from its reset address in
|
| 1980 |
|
|
memory. Further, depending upon how the CPU is configured and specifically on
|
| 1981 |
|
|
the {\tt START\_HALTED} parameter, it may or may not start running
|
| 1982 |
|
|
automatically. The {\tt i\_ext\_int} line is for an external interrupt. This
|
| 1983 |
|
|
line may be as wide as 6~external interrupts, depending upon the setting of
|
| 1984 |
|
|
the {\tt EXTERNAL\_INTERRUPTS} line. As currently configured, the ZipSystem
|
| 1985 |
|
|
only supports one such interrupt line by default. For us, this line is the
|
| 1986 |
|
|
output of another interrupt controller, but that's a board specific setup
|
| 1987 |
|
|
detail. Finally, the Zip System produces one external interrupt whenever
|
| 1988 |
|
|
the CPU halts to wait for the debugger.
|
| 1989 |
|
|
|
| 1990 |
36 |
dgisselq |
\chapter{Initial Assessment}\label{chap:assessment}
|
| 1991 |
|
|
|
| 1992 |
|
|
Having now worked with the Zip CPU for a while, it is worth offering an
|
| 1993 |
|
|
honest assessment of how well it works and how well it was designed. At the
|
| 1994 |
|
|
end of this assessment, I will propose some changes that may take place in a
|
| 1995 |
|
|
later version of this Zip CPU to make it better.
|
| 1996 |
|
|
|
| 1997 |
|
|
\section{The Good}
|
| 1998 |
|
|
\begin{itemize}
|
| 1999 |
|
|
\item The Zip CPU is light weight and fully featured as it exists today. For
|
| 2000 |
|
|
anyone who wishes to build a general purpose CPU and then to
|
| 2001 |
|
|
experiment with building and adding particular features, the Zip CPU
|
| 2002 |
|
|
makes a good starting point--it is fairly simple. Modifications should
|
| 2003 |
|
|
be simple enough.
|
| 2004 |
|
|
\item As an estimate of the ``weight'' of this implementation, the CPU has
|
| 2005 |
|
|
cost me less than 150 hours to implement from its inception.
|
| 2006 |
|
|
\item The Zip CPU was designed to be an implementable soft core that could be
|
| 2007 |
|
|
placed within an FPGA, controlling actions internal to the FPGA. It
|
| 2008 |
|
|
fits this role rather nicely. It does not fit the role of a system on
|
| 2009 |
|
|
a chip very well, but then it was never intended to be a system on a
|
| 2010 |
|
|
chip but rather a system within a chip.
|
| 2011 |
|
|
\item The extremely simplified instruction set of the Zip CPU was a good
|
| 2012 |
|
|
choice. Although it does not have many of the commonly used
|
| 2013 |
|
|
instructions, PUSH, POP, JSR, and RET among them, the simplified
|
| 2014 |
|
|
instruction set has demonstrated an amazing versatility. I will contend
|
| 2015 |
|
|
therefore and for anyone who will listen, that this instruction set
|
| 2016 |
|
|
offers a full and complete capability for whatever a user might wish
|
| 2017 |
|
|
to do with two exceptions: bytewise character access and accelerated
|
| 2018 |
|
|
floating-point support.
|
| 2019 |
|
|
\item This simplified instruction set is easy to decode.
|
| 2020 |
|
|
\item The simplified bus transactions (32-bit words only) were also very easy
|
| 2021 |
|
|
to implement.
|
| 2022 |
|
|
\item The novel approach of having a single interrupt vector, which just
|
| 2023 |
|
|
brings the CPU back to the instruction it left off at within the last
|
| 2024 |
|
|
interrupt context doesn't appear to have been that much of a problem.
|
| 2025 |
|
|
If most modern systems handle interrupt vectoring in software anyway,
|
| 2026 |
|
|
why maintain hardware support for it?
|
| 2027 |
|
|
\item My goal of a high rate of instructions per clock may not be the proper
|
| 2028 |
|
|
measure. For example, if instructions are being read from a SPI flash
|
| 2029 |
|
|
device, such as is common among FPGA implementations, these same
|
| 2030 |
|
|
instructions may suffer stalls of between 64 and 128 cycles per
|
| 2031 |
|
|
instruction just to read the instruction from the flash. Executing the
|
| 2032 |
|
|
instruction in a single clock cycle is no longer the appropriate
|
| 2033 |
|
|
measure. At the same time, it should be possible to use the DMA
|
| 2034 |
|
|
peripheral to copy instructions from the FLASH to a temporary memory
|
| 2035 |
|
|
location, after which they may be executed at a single instruction
|
| 2036 |
|
|
cycle per access again.
|
| 2037 |
|
|
\end{itemize}
|
| 2038 |
|
|
|
| 2039 |
|
|
\section{The Not so Good}
|
| 2040 |
|
|
\begin{itemize}
|
| 2041 |
|
|
\item While one of the stated goals was to use a small amount of logic,
|
| 2042 |
|
|
3k~LUTs isn't that impressively small. Indeed, it's really much
|
| 2043 |
|
|
too expensive when compared against other 8 and 16-bit CPUs that have
|
| 2044 |
|
|
less than 1k LUTs.
|
| 2045 |
|
|
|
| 2046 |
|
|
Still, \ldots it's not bad, it's just not astonishingly good.
|
| 2047 |
|
|
|
| 2048 |
|
|
\item The fact that the instruction width equals the bus width means that the
|
| 2049 |
|
|
instruction fetch cycle will always be interfering with any load or
|
| 2050 |
|
|
store memory operation, with the only exception being if the
|
| 2051 |
|
|
instruction is already in the cache. {\em This has become the
|
| 2052 |
|
|
fundamental limit on the speed and performance of the CPU!}
|
| 2053 |
|
|
Those familiar with the Von--Neumann approach of sharing a bus
|
| 2054 |
|
|
between data and instructions will not be surprised by this assessment.
|
| 2055 |
|
|
|
| 2056 |
|
|
This could be fixed in one of three ways: the instruction set
|
| 2057 |
|
|
architecture could be modified to handle Very Long Instruction Words
|
| 2058 |
|
|
(VLIW) so that each 32--bit word would encode two or more instructions,
|
| 2059 |
|
|
the instruction fetch bus width could be increased from 32--bits to
|
| 2060 |
|
|
64--bits or more, or the instruction bus could be separated from the
|
| 2061 |
|
|
data bus. Any and all of these approaches would increase the overall
|
| 2062 |
|
|
LUT count.
|
| 2063 |
|
|
|
| 2064 |
|
|
\item The (non-existant) floating point unit was an after-thought, isn't even
|
| 2065 |
|
|
built as a potential option, and most likely won't support the full
|
| 2066 |
|
|
IEEE standard set of FPU instructions--even for single point precision.
|
| 2067 |
|
|
This (non-existant) capability would benefit the most from an
|
| 2068 |
|
|
out-of-order execution capability, which the Zip CPU does not have.
|
| 2069 |
|
|
|
| 2070 |
|
|
Still, sharing FPU registers with the main register set was a good
|
| 2071 |
|
|
idea and worth preserving, as it simplifies context swapping.
|
| 2072 |
|
|
|
| 2073 |
|
|
Perhaps this really isn't a problem, but rather a feature. By not
|
| 2074 |
|
|
implementing FPU instructions, the Zip CPU maintains a lower LUT count
|
| 2075 |
|
|
than it would have if it did implement these instructions.
|
| 2076 |
|
|
|
| 2077 |
|
|
\item The CPU has no character support. This is both good and bad.
|
| 2078 |
|
|
Realistically, the CPU works just fine without it. Characters can be
|
| 2079 |
|
|
supported as subsets of 32-bit words without any problem. Practically,
|
| 2080 |
|
|
though, it will make compiling non-Zip CPU code difficult--especially
|
| 2081 |
|
|
anything that assumes sizeof(int)=4*sizeof(char), or that tries to
|
| 2082 |
|
|
create unions with characters and integers and then attempts to
|
| 2083 |
|
|
reference the address of the characters within that union.
|
| 2084 |
|
|
|
| 2085 |
|
|
\item The Zip CPU does not support a data cache. One can still be built
|
| 2086 |
|
|
externally, but this is a limitation of the CPU proper as built.
|
| 2087 |
|
|
Further, under the theory of the Zip CPU design (that of an embedded
|
| 2088 |
|
|
soft-core processor within an FPGA, where any ``address'' may reference
|
| 2089 |
|
|
either memory or a peripheral that may have side-effects), any data
|
| 2090 |
|
|
cache would need to be based upon an initial knowledge of whether or
|
| 2091 |
|
|
not it is supporting memory (cachable) or peripherals. This knowledge
|
| 2092 |
|
|
must exist somewhere, and that somewhere is currently (and by design)
|
| 2093 |
|
|
external to the CPU.
|
| 2094 |
|
|
|
| 2095 |
|
|
This may also be written off as a ``feature'' of the Zip CPU, since
|
| 2096 |
|
|
the addition of a data cache can greatly increase the LUT count of
|
| 2097 |
|
|
a soft core.
|
| 2098 |
|
|
|
| 2099 |
|
|
\item Many other instruction sets offer three operand instructions, whereas
|
| 2100 |
|
|
the Zip CPU only offers two operand instructions. This means that it
|
| 2101 |
|
|
takes the Zip CPU more instructions to do many of the same operations.
|
| 2102 |
|
|
The good part of this is that it gives the Zip CPU a greater amount of
|
| 2103 |
|
|
flexibility in its immediate operand mode, although that increased
|
| 2104 |
|
|
flexibility isn't necessarily as valuable as one might like.
|
| 2105 |
|
|
|
| 2106 |
|
|
\item The Zip CPU does not currently detect and trap on either illegal
|
| 2107 |
|
|
instructions or bus errors. Attempts to access non--existent
|
| 2108 |
|
|
memory quietly return erroneous results, rather than halting the
|
| 2109 |
|
|
process (user mode) or halting or resetting the CPU (supervisor mode).
|
| 2110 |
|
|
|
| 2111 |
|
|
\item The Zip CPU doesn't support out of order execution. I suppose it could
|
| 2112 |
|
|
be modified to do so, but then it would no longer be the ``simple''
|
| 2113 |
|
|
and low LUT count CPU it was designed to be. The two primary results
|
| 2114 |
|
|
are that 1) loads may unnecessarily stall the CPU, even if other
|
| 2115 |
|
|
things could be done while waiting for the load to complete, 2)
|
| 2116 |
|
|
bus errors on stores will never be caught at the point of the error,
|
| 2117 |
|
|
and 3) branch prediction becomes more difficult.
|
| 2118 |
|
|
|
| 2119 |
|
|
\item Although switching to an interrupt context in the Zip CPU design doesn't
|
| 2120 |
|
|
require a tremendous swapping of registers, in reality it still
|
| 2121 |
|
|
does--since any task swap still requires saving and restoring all
|
| 2122 |
|
|
16~user registers. That's a lot of memory movement just to service
|
| 2123 |
|
|
an interrupt.
|
| 2124 |
|
|
|
| 2125 |
|
|
\item The Zip CPU is by no means generic: it will never handle addresses
|
| 2126 |
|
|
larger than 32-bits (16GB) without a complete and total redesign.
|
| 2127 |
|
|
This may limit its utility as a generic CPU in the future, although
|
| 2128 |
|
|
as an embedded CPU within an FPGA this isn't really much of a limit
|
| 2129 |
|
|
or restriction.
|
| 2130 |
|
|
|
| 2131 |
|
|
\item While the Zip CPU has its own assembler, it has no linker and does not
|
| 2132 |
|
|
(yet) support a compiler. The standard C library is an even longer
|
| 2133 |
|
|
shot. My dream of having binutils and gcc support has not been
|
| 2134 |
|
|
realized and at this rate may not be realized. (I've been intimidated
|
| 2135 |
|
|
by the challenge everytime I've looked through those codes.)
|
| 2136 |
|
|
|
| 2137 |
|
|
\item While the Wishbone Bus (B4) supports a pipelined mode with single cycle
|
| 2138 |
|
|
execution, the Zip CPU is unable to exploit this parallelism. Instead,
|
| 2139 |
|
|
apart from the DMA and the pipelined prefetch, all loads and stores
|
| 2140 |
|
|
are single wishbone bus operations requiring a minimum of 3 clocks.
|
| 2141 |
|
|
(In practice, this has turned into 7-clocks.)
|
| 2142 |
|
|
|
| 2143 |
|
|
\iffalse
|
| 2144 |
|
|
\item There is no control over whether or not an instruction sets the
|
| 2145 |
|
|
condition codes--certain instructions always set the condition codes,
|
| 2146 |
|
|
other instructions never set them. This effectively limits conditional
|
| 2147 |
|
|
instructions to a single instruction only (with two or more
|
| 2148 |
|
|
instructions as an exception), as the first instruction that sets
|
| 2149 |
|
|
condition codes will break the condition code chain.
|
| 2150 |
|
|
|
| 2151 |
|
|
{\em (A proposed change below address this.)}
|
| 2152 |
|
|
|
| 2153 |
|
|
\item Using the CC register as a trap address was a bad idea--it limits the CC
|
| 2154 |
|
|
registers ability to be used in future expansion, such as by adding
|
| 2155 |
|
|
exception indication flags: bus error, floating point exception, etc.
|
| 2156 |
|
|
|
| 2157 |
|
|
{\em (This can be changed by a different O/S implementation of the trap
|
| 2158 |
|
|
instruction.)}
|
| 2159 |
|
|
\item The current implementation suffers from too many stalls on any
|
| 2160 |
|
|
branch--even if the branch is known early on.
|
| 2161 |
|
|
|
| 2162 |
|
|
{\em (This is addressed in proposals below.)}
|
| 2163 |
|
|
% Addressed, 20150918
|
| 2164 |
|
|
|
| 2165 |
|
|
\item In a similar fashion, a switch to interrupt context forces the pipeline
|
| 2166 |
|
|
to be cleared, whereas it might make more sense to just continue
|
| 2167 |
|
|
executing the instructions already in the pipeline while the prefetch
|
| 2168 |
|
|
stage is working on switching to the interrupt context.
|
| 2169 |
|
|
|
| 2170 |
|
|
{\em (Also addressed in proposals below.)}
|
| 2171 |
|
|
% This should happen so rarely that it is not really a problem
|
| 2172 |
|
|
\fi
|
| 2173 |
|
|
|
| 2174 |
|
|
\end{itemize}
|
| 2175 |
|
|
|
| 2176 |
|
|
\section{The Next Generation}
|
| 2177 |
|
|
This section could also be labeled as my ``To do'' list.
|
| 2178 |
|
|
|
| 2179 |
|
|
Given the feedback listed above, perhaps its time to consider what changes could be made to improve the Zip CPU in the future. I offer the following as proposals:
|
| 2180 |
|
|
|
| 2181 |
|
|
\begin{itemize}
|
| 2182 |
|
|
\item {\bf Remove the low LUT goal.} It wasn't really achieved, and the
|
| 2183 |
|
|
proposals below will only increase the amount of logic the Zip CPU
|
| 2184 |
|
|
requires. While I expect that the Zip CPU will always be somewhat
|
| 2185 |
|
|
of a light weight, it will never be the smallest kid on the block.
|
| 2186 |
|
|
|
| 2187 |
|
|
I'm actually struggling with this idea. The whole goal of the Zip
|
| 2188 |
|
|
CPU was to be light weight. Wouldn't it make more sense to create and
|
| 2189 |
|
|
maintain options whereby it would remain lightweight? For example, if
|
| 2190 |
|
|
the process accounting registers are anything but light weight, why
|
| 2191 |
|
|
keep them? Why not instead make some compile flags that just turn them
|
| 2192 |
|
|
off, keeping the CPU lightweight? The same holds for the prefetch
|
| 2193 |
|
|
cache.
|
| 2194 |
|
|
|
| 2195 |
|
|
\iffalse
|
| 2196 |
|
|
\item {\bf Adjust the Zip CPU so that conditional instructions do not set
|
| 2197 |
|
|
flags}, although they may explicitly set condition codes if writing
|
| 2198 |
|
|
to the CC register.
|
| 2199 |
|
|
|
| 2200 |
|
|
This is a simple change to the core, and may show up in new releases.
|
| 2201 |
|
|
% Fixed, 20150918
|
| 2202 |
|
|
\fi
|
| 2203 |
|
|
|
| 2204 |
|
|
\item The `{\tt .V}' condition was never used in any code other than my test
|
| 2205 |
|
|
code. Suggest changing it to a `{\tt .LE}' condition, which seems
|
| 2206 |
|
|
to be more useful.
|
| 2207 |
|
|
|
| 2208 |
|
|
\iffalse
|
| 2209 |
|
|
\item Add in an {\bf unpredictable branch delay slot}, so that on any branch
|
| 2210 |
|
|
the delay slot may or may not be executed before the branch.
|
| 2211 |
|
|
Instructions that do not depend upon the branch, and that should be
|
| 2212 |
|
|
executed were the branch not taken, could be placed into the delay
|
| 2213 |
|
|
slot. Thus, if the branch isn't taken, we wouldn't suffer the stall,
|
| 2214 |
|
|
whereas it wouldn't affect the timing of the branch if taken. It would
|
| 2215 |
|
|
just do something irrelevant.
|
| 2216 |
|
|
|
| 2217 |
|
|
% Changes made, 20150918, make this option no longer relevant
|
| 2218 |
|
|
|
| 2219 |
|
|
\item {\bf Re-engineer Branch Processing.} There's no reason why a {\tt BRA}
|
| 2220 |
|
|
instruction should create five stall cycles. The decode stage, plus
|
| 2221 |
|
|
the prefetch engine, should be able to drop this number of stalls via
|
| 2222 |
|
|
better branch handling.
|
| 2223 |
|
|
|
| 2224 |
|
|
Indeed, this could turn into a simple means of branch prediction:
|
| 2225 |
|
|
if {\tt BRA} suffered a single stall only, whereas {\tt BRA.C}
|
| 2226 |
|
|
suffered five stalls, then {\tt BRA.!C} followed by {\tt BRA} would
|
| 2227 |
|
|
be faster than a {\tt BRA.C} instruction. This would then allow a
|
| 2228 |
|
|
compiler to do explicit branch optimizations.
|
| 2229 |
|
|
|
| 2230 |
|
|
Of course, longer branches using {\tt ADD X,PC} would still not be
|
| 2231 |
|
|
optimized.
|
| 2232 |
|
|
|
| 2233 |
|
|
% DONE: 20150918 -- to include the ADD X,PC instructions
|
| 2234 |
|
|
|
| 2235 |
|
|
\item {\bf Request bus access for Load/Store two cycles earlier.} The problem
|
| 2236 |
|
|
here is the contention for the bus between the memory unit and the
|
| 2237 |
|
|
prefetch unit. Currently, the memory unit must ask the prefetch
|
| 2238 |
|
|
unit to release the bus if it is in the middle of a bus cycle. At this
|
| 2239 |
|
|
point, the prefetch drops the {\tt STB} line on the next clock and must
|
| 2240 |
|
|
then wait for the last {\tt ACK} before releasing the bus. If the
|
| 2241 |
|
|
request takes one clock, dropping the strobe line the next, waiting
|
| 2242 |
|
|
for an acknowledgement takes another, and then the bus must be idle
|
| 2243 |
|
|
for one cycle before starting again, these extra cycles add up.
|
| 2244 |
|
|
It should be possible to tell the prefetch stage to give up the bus
|
| 2245 |
|
|
as soon as the decoder knows the instruction will need the bus.
|
| 2246 |
|
|
Indeed, if done in the decode stage, this might drop the seven cycle
|
| 2247 |
|
|
access down by two cycles.
|
| 2248 |
|
|
|
| 2249 |
|
|
% FIXED: 20150918
|
| 2250 |
|
|
\fi
|
| 2251 |
|
|
|
| 2252 |
|
|
\item {\bf Consider a more traditional Instruction Cache.} The current
|
| 2253 |
|
|
pipelined instruction cache just reads a window of memory into
|
| 2254 |
|
|
its cache. If the CPU leaves that window, the entire cache is
|
| 2255 |
|
|
invalidated. A more traditional cache, however, might allow
|
| 2256 |
|
|
common subroutines to stay within the cache without invalidating the
|
| 2257 |
|
|
entire cache structure.
|
| 2258 |
|
|
|
| 2259 |
|
|
\iffalse
|
| 2260 |
|
|
\item {\bf Very Long Instruction Word (VLIW).} Now, to speed up operation, I
|
| 2261 |
|
|
propose that the Zip CPU instruction set be modified towards a Very
|
| 2262 |
|
|
Long Instruction Word (VLIW) implementation. In this implementation,
|
| 2263 |
|
|
an instruction word may contain either one or two separate
|
| 2264 |
|
|
instructions. The first instruction would take up the high order bits,
|
| 2265 |
|
|
the second optional instruction the lower 16-bits. Further, I propose
|
| 2266 |
|
|
that any of the ALU instructions (SUB through LSR) automatically have
|
| 2267 |
|
|
a second instruction whenever their operand `B' is a register, and use
|
| 2268 |
|
|
the full 20-bit immediate if not. This will effectively eliminate the
|
| 2269 |
|
|
register plus immediate mode for all of these instructions.
|
| 2270 |
|
|
|
| 2271 |
|
|
This is the minimal required change to increase the number of
|
| 2272 |
|
|
instructions per clock cycle. Other changes would need to take place
|
| 2273 |
|
|
as well to support this. These include:
|
| 2274 |
|
|
\begin{itemize}
|
| 2275 |
|
|
\item Instruction words containing two instructions would take two
|
| 2276 |
|
|
clocks to complete, while requiring only a single cycle
|
| 2277 |
|
|
instruction fetch.
|
| 2278 |
|
|
\item Instructions preceded by a label in the asseembler must always
|
| 2279 |
|
|
start in the high order word.
|
| 2280 |
|
|
\item VLIW's, once started, must always execute to completion. The
|
| 2281 |
|
|
upper word may set the PC, the lower word may not. Regardless
|
| 2282 |
|
|
of whether the upper word sets the PC, the lower word must
|
| 2283 |
|
|
still be guaranteed to complete before the PC changes. On any
|
| 2284 |
|
|
switch to (or from) interrupt context, both instructions must
|
| 2285 |
|
|
complete or none of the instructions in the word shall
|
| 2286 |
|
|
complete prior to the switch.
|
| 2287 |
|
|
\item STEP commands and BREAK instructions will only take place after
|
| 2288 |
|
|
the entire word is executed.
|
| 2289 |
|
|
\end{itemize}
|
| 2290 |
|
|
|
| 2291 |
|
|
If done well, the assembler should be able to handle these changes
|
| 2292 |
|
|
with the biggest impacts to the user being increased performance and
|
| 2293 |
|
|
a loss of the register plus immediate ALU modes. (These weren't really
|
| 2294 |
|
|
relevant for the XOR, OR, AND, etc. operations anyway.) Machine code
|
| 2295 |
|
|
compatibility will not be maintained.
|
| 2296 |
|
|
|
| 2297 |
|
|
A proposed secondary instruction set might consist of: a four bit
|
| 2298 |
|
|
operand (any of the prior instructions would be supported, with some
|
| 2299 |
|
|
exceptions such as moves to and from user registers while in
|
| 2300 |
|
|
supervisor mode not being supported), a 4-bit register result (PC not
|
| 2301 |
|
|
allowed), a 3-bit conditional (identical to the conditional for the
|
| 2302 |
|
|
upper word), a single bit for whether or not an immediate is present
|
| 2303 |
|
|
or not, followed by either a 4-bit register or a 4-bit signed
|
| 2304 |
|
|
immediate. The multiply instruction would steal the immediate flag to
|
| 2305 |
|
|
be used as a sign indication, forcing both operands to be registers
|
| 2306 |
|
|
without any immediate offsets.
|
| 2307 |
|
|
|
| 2308 |
|
|
{\em Initial conversion of several library functions to this secondary
|
| 2309 |
|
|
instruction set has demonstrated little to no gain. The problem was
|
| 2310 |
|
|
that the new instruction set was made by joining a rarely used
|
| 2311 |
|
|
instruction (ALU with register and not immediate) with a more common
|
| 2312 |
|
|
instruction. The utility was then limited by the utility of the rare
|
| 2313 |
|
|
instrction, which limited the impact of the entire approach. }
|
| 2314 |
|
|
\else
|
| 2315 |
|
|
\item {\bf Very Long Instruction Word (VLIW).} The goal here would be to
|
| 2316 |
|
|
create a new instruction set whereby two instructions would be encoded
|
| 2317 |
|
|
in each 32--bit word. While this may speed up
|
| 2318 |
|
|
CPU operation, it would necessitate an instruction redesign.
|
| 2319 |
|
|
\fi
|
| 2320 |
|
|
|
| 2321 |
|
|
\end{itemize}
|
| 2322 |
|
|
|
| 2323 |
|
|
|
| 2324 |
21 |
dgisselq |
% Appendices
|
| 2325 |
|
|
% Index
|
| 2326 |
|
|
\end{document}
|
| 2327 |
|
|
|
| 2328 |
|
|
|