1 |
21 |
dgisselq |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
2 |
|
|
%%
|
3 |
|
|
%% Filename: spec.tex
|
4 |
|
|
%%
|
5 |
199 |
dgisselq |
%% Project: ZipCPU -- a small, lightweight, RISC CPU soft core
|
6 |
21 |
dgisselq |
%%
|
7 |
|
|
%% Purpose: This LaTeX file contains all of the documentation/description
|
8 |
199 |
dgisselq |
%% currently provided with this ZipCPU soft core. It supersedes
|
9 |
21 |
dgisselq |
%% any information about the instruction set or CPUs found
|
10 |
|
|
%% elsewhere. It's not nearly as interesting, though, as the PDF
|
11 |
|
|
%% file it creates, so I'd recommend reading that before diving
|
12 |
|
|
%% into this file. You should be able to find the PDF file in
|
13 |
|
|
%% the SVN distribution together with this PDF file and a copy of
|
14 |
|
|
%% the GPL-3.0 license this file is distributed under. If not,
|
15 |
|
|
%% just type 'make' in the doc directory and it (should) build
|
16 |
|
|
%% without a problem.
|
17 |
|
|
%%
|
18 |
|
|
%%
|
19 |
|
|
%% Creator: Dan Gisselquist
|
20 |
|
|
%% Gisselquist Technology, LLC
|
21 |
|
|
%%
|
22 |
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
23 |
|
|
%%
|
24 |
|
|
%% Copyright (C) 2015, Gisselquist Technology, LLC
|
25 |
|
|
%%
|
26 |
|
|
%% This program is free software (firmware): you can redistribute it and/or
|
27 |
|
|
%% modify it under the terms of the GNU General Public License as published
|
28 |
|
|
%% by the Free Software Foundation, either version 3 of the License, or (at
|
29 |
|
|
%% your option) any later version.
|
30 |
|
|
%%
|
31 |
|
|
%% This program is distributed in the hope that it will be useful, but WITHOUT
|
32 |
|
|
%% ANY WARRANTY; without even the implied warranty of MERCHANTIBILITY or
|
33 |
|
|
%% FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
|
34 |
|
|
%% for more details.
|
35 |
|
|
%%
|
36 |
|
|
%% You should have received a copy of the GNU General Public License along
|
37 |
|
|
%% with this program. (It's in the $(ROOT)/doc directory, run make with no
|
38 |
|
|
%% target there if the PDF file isn't present.) If not, see
|
39 |
|
|
%% <http://www.gnu.org/licenses/> for a copy.
|
40 |
|
|
%%
|
41 |
|
|
%% License: GPL, v3, as defined and found on www.gnu.org,
|
42 |
|
|
%% http://www.gnu.org/licenses/gpl.html
|
43 |
|
|
%%
|
44 |
|
|
%%
|
45 |
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
46 |
139 |
dgisselq |
%
|
47 |
|
|
%
|
48 |
|
|
%
|
49 |
|
|
% From TI about DSPs vs FPGAs:
|
50 |
|
|
% www.ti.com/general/docs/video/foldersGallery.tsp?bkg=gray
|
51 |
|
|
% &gpn=35145&familyid=1622&keyMatch=DSP Breaktime Episode Three
|
52 |
|
|
% &tisearch=Search-EN-Everything&DCMP=leadership
|
53 |
|
|
% &HQS=ep-pro-dsp-leadership-problog-150518-v-en
|
54 |
|
|
%
|
55 |
|
|
% FPGA's are annoyingly faster, cheaper, and not quite as power hungry
|
56 |
|
|
% as they used to be.
|
57 |
|
|
%
|
58 |
|
|
% Why would you choose DSPs over FPGAs? If you care about size,
|
59 |
|
|
% if you care about power, or happen to have a complicated algorithm
|
60 |
|
|
% that just isn't simply doing the same thing over and over
|
61 |
|
|
%
|
62 |
|
|
% For complex algorithms that change over time. Each have their strengths
|
63 |
|
|
% sometimes you can use both.
|
64 |
|
|
%
|
65 |
|
|
% "No assembly required" -- TI tools all C programming, very GUI based
|
66 |
|
|
% environment, very little optimization by hand ...
|
67 |
|
|
%
|
68 |
|
|
%
|
69 |
199 |
dgisselq |
% The FPGA's Achille's heel: Reconfigurability. It is very difficult, although
|
70 |
139 |
dgisselq |
% I'm sure major vendors will tell you not impossible, to reconfigure an FPGA
|
71 |
|
|
% based upon the need to process time-sensitive data. If you need one of two
|
72 |
|
|
% algorithms, both which will fit on the FPGA individually but not together,
|
73 |
|
|
% switching between them on the fly is next to impossible, whereas switching
|
74 |
|
|
% algorithm within a CPU is not difficult at all. For example, imagine
|
75 |
|
|
% receiving a packet and needing to apply one of two data algorithms on the
|
76 |
|
|
% packet before sending it back out, and needing to do so fast. If both
|
77 |
|
|
% algorithms don't fit in memory, where does the packet go when you need to
|
78 |
|
|
% swap one algorithm out for the other? And what is the cost of that "context"
|
79 |
|
|
% swap?
|
80 |
|
|
%
|
81 |
|
|
%
|
82 |
21 |
dgisselq |
\documentclass{gqtekspec}
|
83 |
68 |
dgisselq |
\usepackage{import}
|
84 |
139 |
dgisselq |
\usepackage{bytefield} % Install via apt-get install texlive-science
|
85 |
68 |
dgisselq |
% \graphicspath{{../gfx}}
|
86 |
199 |
dgisselq |
\project{ZipCPU}
|
87 |
21 |
dgisselq |
\title{Specification}
|
88 |
|
|
\author{Dan Gisselquist, Ph.D.}
|
89 |
202 |
dgisselq |
\email{dgisselq (at) ieee.org}
|
90 |
|
|
\revision{Rev.~1.1}
|
91 |
69 |
dgisselq |
\definecolor{webred}{rgb}{0.5,0,0}
|
92 |
|
|
\definecolor{webgreen}{rgb}{0,0.4,0}
|
93 |
167 |
dgisselq |
\hypersetup{
|
94 |
|
|
ps2pdf,
|
95 |
|
|
pdfpagelabels,
|
96 |
|
|
hypertexnames,
|
97 |
36 |
dgisselq |
pdfauthor={Dan Gisselquist},
|
98 |
199 |
dgisselq |
pdfsubject={ZipCPU},
|
99 |
167 |
dgisselq |
anchorcolor= black,
|
100 |
69 |
dgisselq |
colorlinks = true,
|
101 |
|
|
linkcolor = webred,
|
102 |
|
|
citecolor = webgreen
|
103 |
|
|
}
|
104 |
21 |
dgisselq |
\begin{document}
|
105 |
|
|
\pagestyle{gqtekspecplain}
|
106 |
|
|
\titlepage
|
107 |
|
|
\begin{license}
|
108 |
|
|
Copyright (C) \theyear\today, Gisselquist Technology, LLC
|
109 |
|
|
|
110 |
|
|
This project is free software (firmware): you can redistribute it and/or
|
111 |
|
|
modify it under the terms of the GNU General Public License as published
|
112 |
|
|
by the Free Software Foundation, either version 3 of the License, or (at
|
113 |
|
|
your option) any later version.
|
114 |
|
|
|
115 |
|
|
This program is distributed in the hope that it will be useful, but WITHOUT
|
116 |
|
|
ANY WARRANTY; without even the implied warranty of MERCHANTIBILITY or
|
117 |
|
|
FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
|
118 |
|
|
for more details.
|
119 |
|
|
|
120 |
|
|
You should have received a copy of the GNU General Public License along
|
121 |
199 |
dgisselq |
with this program. If not, see \hbox{$<$http://www.gnu.org/licenses/$>$} for
|
122 |
|
|
a copy.
|
123 |
21 |
dgisselq |
\end{license}
|
124 |
|
|
\begin{revisionhistory}
|
125 |
202 |
dgisselq |
2.0 & 1/18/2017 & Gisselquist & Switched from 32--bit to 8--bit bytes.\\\hline
|
126 |
|
|
1.1 & 11/28/2016 & Gisselquist & Moved the ZipSystem address to {\tt 0xff000000} base.\\\hline
|
127 |
199 |
dgisselq |
1.0 & 11/4/2016 & Gisselquist & Major rewrite,
|
128 |
|
|
includes compiler info\\\hline
|
129 |
|
|
0.91& 7/16/2016 & Gisselquist & Described three more CC bits\\\hline
|
130 |
|
|
0.9 & 4/20/2016 & Gisselquist & Modified ISA: LDIHI replaced with MPY,
|
131 |
|
|
MPYU and MPYS replaced with MPYUHI, and MPYSHI respectively. LOCK
|
132 |
|
|
instruction now permits an intermediate ALU operation. \\\hline
|
133 |
92 |
dgisselq |
0.8 & 1/28/2016 & Gisselquist & Reduced complexity early branching \\\hline
|
134 |
69 |
dgisselq |
0.7 & 12/22/2015 & Gisselquist & New Instruction Set Architecture \\\hline
|
135 |
68 |
dgisselq |
0.6 & 11/17/2015 & Gisselquist & Added graphics to illustrate pipeline discussion.\\\hline
|
136 |
39 |
dgisselq |
0.5 & 9/29/2015 & Gisselquist & Added pipelined memory access discussion.\\\hline
|
137 |
36 |
dgisselq |
0.4 & 9/19/2015 & Gisselquist & Added DMA controller, improved stall information, and self--assessment info.\\\hline
|
138 |
33 |
dgisselq |
0.3 & 8/22/2015 & Gisselquist & First completed draft\\\hline
|
139 |
24 |
dgisselq |
0.2 & 8/19/2015 & Gisselquist & Still Draft, more complete \\\hline
|
140 |
21 |
dgisselq |
0.1 & 8/17/2015 & Gisselquist & Incomplete First Draft \\\hline
|
141 |
|
|
\end{revisionhistory}
|
142 |
|
|
% Revision History
|
143 |
|
|
% Table of Contents, named Contents
|
144 |
|
|
\tableofcontents
|
145 |
24 |
dgisselq |
\listoffigures
|
146 |
21 |
dgisselq |
\listoftables
|
147 |
|
|
\begin{preface}
|
148 |
199 |
dgisselq |
Many people have asked me why I am building the ZipCPU. ARM processors are
|
149 |
|
|
good and effective. Xilinx makes and markets Microblaze, Altera Nios, and both
|
150 |
|
|
have better toolsets than the ZipCPU will ever have. OpenRISC is also
|
151 |
24 |
dgisselq |
available, RISC--V may be replacing it. Why build a new processor?
|
152 |
21 |
dgisselq |
|
153 |
|
|
The easiest, most obvious answer is the simple one: Because I can.
|
154 |
|
|
|
155 |
199 |
dgisselq |
There's more to it though. There's a lot of things that I would like to do with
|
156 |
|
|
a processor, and I want to be able to do them in a vendor independent fashion.
|
157 |
36 |
dgisselq |
First, I would like to be able to place this processor inside an FPGA. Without
|
158 |
|
|
paying royalties, ARM is out of the question. I would then like to be able to
|
159 |
|
|
generate Verilog code, both for the processor and the system it sits within,
|
160 |
199 |
dgisselq |
that can run equivalently on both Xilinx, Altera, and Lattice chips, and that
|
161 |
|
|
can be easily ported from one manufacturer's chipsets to another. Even more,
|
162 |
|
|
before purchasing a chip or a board, I would like to know that my soft core
|
163 |
|
|
works. I would like to build a test bench to test components with, and
|
164 |
|
|
Verilator is my chosen test bench. This forces me to use all Verilog, and it
|
165 |
|
|
prevents me from using any proprietary cores. For this reason, Microblaze and
|
166 |
|
|
Nios are out of the question.
|
167 |
21 |
dgisselq |
|
168 |
199 |
dgisselq |
Why not OpenRISC? Because the ZipCPU has different goals. OpenRISC is designed
|
169 |
|
|
to be a full featured CPU. The ZipCPU was designed to be a simple, resource
|
170 |
|
|
friendly, CPU. The result is that it is easy to get a ZipCPU program running
|
171 |
|
|
on bare hardware for a special purpose application--such as what FPGAs were
|
172 |
|
|
designed for, but getting a full featured
|
173 |
|
|
Linux distribution running on the ZipCPU may just be beyond my grasp. Further,
|
174 |
|
|
the OpenRISC ISA is very complex, defining over 200~instructions--even though
|
175 |
|
|
it has never been fully implemented. The ZipCPU on the other hand has only
|
176 |
|
|
a small handful of instructions, and all but the Floating Point instructions
|
177 |
|
|
have already been fully implemented.
|
178 |
21 |
dgisselq |
|
179 |
199 |
dgisselq |
My final reason is that I'm building the ZipCPU as a learning experience. The
|
180 |
|
|
ZipCPU has allowed me to learn a lot about how CPUs work on a very micro
|
181 |
21 |
dgisselq |
level. For the first time, I am beginning to understand many of the Computer
|
182 |
|
|
Architecture lessons from years ago.
|
183 |
|
|
|
184 |
|
|
To summarize: Because I can, because it is open source, because it is light
|
185 |
|
|
weight, and as an exercise in learning.
|
186 |
|
|
|
187 |
|
|
\end{preface}
|
188 |
|
|
|
189 |
|
|
\chapter{Introduction}
|
190 |
|
|
\pagenumbering{arabic}
|
191 |
|
|
\setcounter{page}{1}
|
192 |
|
|
|
193 |
199 |
dgisselq |
The goal of the ZipCPU was to be a very simple CPU. You might think of it as
|
194 |
|
|
a poor man's alternative to the OpenRISC architecture. You might also think of
|
195 |
|
|
it as an Open Source microcontroller.
|
196 |
21 |
dgisselq |
For this reason, all instructions have been designed to be as simple as
|
197 |
69 |
dgisselq |
possible, and the base instructions are all designed to be executed in one
|
198 |
199 |
dgisselq |
instruction cycle per instruction, barring pipeline stalls.\footnote{The
|
199 |
|
|
exceptions to this rule are the multiply, divide, and load/store instructions.
|
200 |
|
|
Once the floating point unit is built, I anticipate these will also be
|
201 |
|
|
exceptions to this rule.} Indeed, even the bus has been simplified to a
|
202 |
|
|
constant 32-bit width, with no option for more or less. This has resulted in
|
203 |
|
|
the choice to drop push and pop instructions, pre-increment and post-decrement
|
204 |
|
|
addressing modes, the integrated memory management unit (MMU), and
|
205 |
|
|
more.\footnote{A not--so integrated MMU is currently under development.}
|
206 |
21 |
dgisselq |
|
207 |
199 |
dgisselq |
For those who like buzz words, the ZipCPU is:
|
208 |
21 |
dgisselq |
\begin{itemize}
|
209 |
|
|
\item A 32-bit CPU: All registers are 32-bits, addresses are 32-bits,
|
210 |
202 |
dgisselq |
instructions are 32-bits wide, etc.
|
211 |
24 |
dgisselq |
\item A RISC CPU. There is no microcode for executing instructions. All
|
212 |
|
|
instructions are designed to be completed in one clock cycle.
|
213 |
21 |
dgisselq |
\item A Load/Store architecture. (Only load and store instructions
|
214 |
|
|
can access memory.)
|
215 |
|
|
\item Wishbone compliant. All peripherals are accessed just like
|
216 |
|
|
memory across this bus.
|
217 |
199 |
dgisselq |
\item A Von-Neumann architecture. The instructions and data share a
|
218 |
|
|
common bus.
|
219 |
21 |
dgisselq |
\item A pipelined architecture, having stages for {\bf Prefetch},
|
220 |
199 |
dgisselq |
{\bf Decode}, {\bf Read-Operand}, a combined stage containing
|
221 |
|
|
the {\bf ALU}, {\bf Memory}, {\bf Divide}, and {\bf Floating Point}
|
222 |
|
|
units, and then the final {\bf Write-back} stage.
|
223 |
69 |
dgisselq |
See Fig.~\ref{fig:cpu}
|
224 |
24 |
dgisselq |
\begin{figure}\begin{center}
|
225 |
|
|
\includegraphics[width=3.5in]{../gfx/cpu.eps}
|
226 |
199 |
dgisselq |
\caption{ZipCPU internal pipeline architecture}\label{fig:cpu}
|
227 |
24 |
dgisselq |
\end{center}\end{figure}
|
228 |
|
|
for a diagram of this structure.
|
229 |
21 |
dgisselq |
\item Completely open source, licensed under the GPL.\footnote{Should you
|
230 |
199 |
dgisselq |
need a copy of the ZipCPU licensed under other terms, please
|
231 |
21 |
dgisselq |
contact me.}
|
232 |
|
|
\end{itemize}
|
233 |
|
|
|
234 |
199 |
dgisselq |
The ZipCPU also has one very unique feature: the ability to do pipelined loads
|
235 |
68 |
dgisselq |
and stores. This allows the CPU to access on-chip memory at one access per
|
236 |
199 |
dgisselq |
clock, minus any stalls for the initial access.
|
237 |
68 |
dgisselq |
|
238 |
|
|
\section{Characteristics of a SwiC}
|
239 |
|
|
|
240 |
199 |
dgisselq |
This section might also be called {\em the ZipCPU philosophy}. It discusses
|
241 |
|
|
the basis for the ZipCPU design decisions, and why a low logic count CPU
|
242 |
|
|
is or can be a good thing.
|
243 |
68 |
dgisselq |
|
244 |
199 |
dgisselq |
Many other FPGA processors have been defined to be good Systems on a Chip, or
|
245 |
|
|
SoC's. The entire goal of such designs, then, is to provide an interface
|
246 |
|
|
to the processor and its external environment. This is not the case with the
|
247 |
|
|
ZipCPU. Instead, we shall define a new concept, that of a soft core internal to
|
248 |
|
|
an FPGA, as a ``System within a Chip,'' or a SwiC. SwiCs have some very
|
249 |
|
|
unique properties internal to them that have influenced the design of the
|
250 |
|
|
ZipCPU. Among these are the bus, memory, and available peripherals.
|
251 |
|
|
|
252 |
|
|
Many other approaches to soft core CPU's employ a Harvard architecture.
|
253 |
68 |
dgisselq |
This allows these other CPU's to have two separate bus structures: one for the
|
254 |
199 |
dgisselq |
program fetch, and the other for the memory. Indeed, Xilinx's proprietary
|
255 |
|
|
Microblaze processor
|
256 |
|
|
goes so far as to support four busses: two for cacheable memory, and two for
|
257 |
|
|
peripherals, with each of those split between instructions and data. The
|
258 |
|
|
ZipCPU on the other hand is fairly unique in its approach because it uses a
|
259 |
|
|
Von Neumann architecture, requiring only one bus within any FPGA. This
|
260 |
|
|
structure was chosen for its simplicity. Having only the one bus helps to
|
261 |
|
|
minimize real-estate, logic, and the number of wires that need to be passed
|
262 |
|
|
back and forth, while maintaining a high clock speed. The disadvantage is
|
263 |
|
|
that both prefetch and memory access units need to contend for time on the
|
264 |
|
|
same bus.
|
265 |
68 |
dgisselq |
|
266 |
|
|
Soft core's within an FPGA have an additional characteristic regarding
|
267 |
69 |
dgisselq |
memory access: it is slow. While memory on chip may be accessed at a single
|
268 |
|
|
cycle per access, small FPGA's often have only a limited amount of memory on
|
269 |
|
|
chip. Going off chip, however, is expensive. Two examples will prove this
|
270 |
|
|
point. On
|
271 |
68 |
dgisselq |
the XuLA2 board, Flash can be accessed at 128~cycles per 32--bit word,
|
272 |
|
|
or 64~cycles per subsequent word in a pipelined architecture. Likewise, the
|
273 |
69 |
dgisselq |
SDRAM chip on the XuLA2 board allows a 6~cycle access for a write, 10~cycles
|
274 |
68 |
dgisselq |
per read, and 2~cycles for any subsequent pipelined access read or write.
|
275 |
|
|
Either way you look at it, this memory access will be slow and this doesn't
|
276 |
|
|
account for any logic delays should the bus implementation logic get
|
277 |
|
|
complicated.
|
278 |
|
|
|
279 |
|
|
As may be noticed from the above discussion about memory speed, a second
|
280 |
199 |
dgisselq |
characteristic of memory is sequential memory accesses may be optimized for
|
281 |
|
|
minimal delays (pipelined), and that pipelined memory access is faster than
|
282 |
|
|
non--pipelined access. Therefore, a SwiC soft core should support pipelined
|
283 |
|
|
operations, but it should also allow a higher priority subsystem to get access
|
284 |
|
|
to the bus (no starvation).
|
285 |
68 |
dgisselq |
|
286 |
|
|
As a further characteristic of SwiC memory options, on-chip cache's are
|
287 |
|
|
expensive. If you want to have a minimum of logic, cache logic may not be
|
288 |
199 |
dgisselq |
the highest on the priority list. Any SwiC capable processor must be able
|
289 |
|
|
to either be built without caches, or to scale up or down the logic required
|
290 |
|
|
by any cache.
|
291 |
68 |
dgisselq |
|
292 |
|
|
In sum, memory is slow. While one processor on one FPGA may be able to fill
|
293 |
|
|
its pipeline, the same processor on another FPGA may struggle to get more than
|
294 |
|
|
one instruction at a time into the pipeline. Any SwiC must be able to deal
|
295 |
|
|
with both cases: fast and slow memories.
|
296 |
|
|
|
297 |
|
|
A final characteristic of SwiC's within FPGA's is the peripherals.
|
298 |
|
|
Specifically, FPGA's are highly reconfigurable. Soft peripherals can easily
|
299 |
|
|
be created on chip to support the SwiC if necessary. As an example, a simple
|
300 |
|
|
30-bit peripheral could easily support reversing 30-bit numbers: a read from
|
301 |
199 |
dgisselq |
the peripheral returns its bit--reversed address. This is cheap within an
|
302 |
69 |
dgisselq |
FPGA, but expensive in instructions. Reading from another 16--bit peripheral
|
303 |
|
|
might calculate a sine function, where the 16--bit address internal to the
|
304 |
|
|
peripheral was the angle of the sine wave.
|
305 |
68 |
dgisselq |
|
306 |
|
|
Indeed, anything that must be done fast within an FPGA is likely to already
|
307 |
199 |
dgisselq |
be done--elsewhere in the fabric. Further, the application designer gets to
|
308 |
|
|
choose what tasks are so important they need fabric dedicated to them, and which
|
309 |
|
|
ones can be done more slowly in a CPU. This leaves the CPU with the simple role
|
310 |
|
|
of solely handling sequential tasks, and tasks that need a lot of state.
|
311 |
68 |
dgisselq |
|
312 |
|
|
This means that the SwiC needs to live within a very unique environment,
|
313 |
|
|
separate and different from the traditional SoC. That isn't to say that a
|
314 |
|
|
SwiC cannot be turned into a SoC, just that this SwiC has not been designed
|
315 |
199 |
dgisselq |
for that purpose. Indeed, some of the best examples of the ZipCPU are
|
316 |
|
|
System on a Chip examples.
|
317 |
68 |
dgisselq |
|
318 |
199 |
dgisselq |
\section{Scope}\label{sec:limits}
|
319 |
68 |
dgisselq |
|
320 |
199 |
dgisselq |
The ZipCPU is itself nothing more than a CPU that can be placed within a
|
321 |
|
|
larger design. It is not a System on a Chip, but it can be used to create
|
322 |
|
|
a system on a chip. As a result, this document will not discuss more than a
|
323 |
|
|
small handful of CPU--related peripherals, as the actual peripherals used
|
324 |
|
|
within a design will vary from one design to the next. Further, because
|
325 |
|
|
control access will vary from one environment to the next, this document will
|
326 |
|
|
not discuss any host control programs, leaving those to be discussed and defined
|
327 |
|
|
together with the environments the ZipCPU is placed within.
|
328 |
21 |
dgisselq |
|
329 |
199 |
dgisselq |
\chapter{CPU Architecture}\label{chap:arch}
|
330 |
21 |
dgisselq |
|
331 |
199 |
dgisselq |
This chapter describes the general architecture of the ZipCPU. It first
|
332 |
|
|
discusses
|
333 |
|
|
the configuration options to the CPU and then breaks into two threads. These
|
334 |
|
|
last two threads are a discussion of the internals of the ZipCPU, such as its
|
335 |
|
|
instruction set architecture and the details and consequences of it, and then
|
336 |
|
|
the external architecture describing how the ZipCPU fits into the systems
|
337 |
|
|
surrounding it, and what those systems must do to support it. Specifically,
|
338 |
|
|
the external architecture section will discuss both the ZipSystem, the
|
339 |
|
|
peripherals provided by it, as well as the debug interface.
|
340 |
24 |
dgisselq |
|
341 |
199 |
dgisselq |
\section{Build Options/defines}\label{ssec:build-options}
|
342 |
21 |
dgisselq |
|
343 |
199 |
dgisselq |
One problem with a simple goal such as being light on logic, is that some
|
344 |
|
|
architectures have some needs, others have other needs. What is light logic
|
345 |
|
|
in some architectures might consume all the available logic in others.
|
346 |
|
|
As an example, the CMod~S6 board built by Digilent uses a very spare Xilinx
|
347 |
|
|
Spartan~6 LX4 FPGA. This FPGA doesn't have enough look up tables (LUTs) to
|
348 |
|
|
support pipelined mode, whereas another project running on a XuLA2 LX25 board
|
349 |
|
|
made by Xess, having a Spartan~6 LX25 on board, has more than enough logic
|
350 |
|
|
to support a pipelined mode. Very quickly it becomes clear that LUTs can be
|
351 |
|
|
traded for performance.
|
352 |
21 |
dgisselq |
|
353 |
199 |
dgisselq |
To make this possible, the ZipCPU has both a configuration file as well as a
|
354 |
|
|
set of parameters that it can be built with. Often, those parameters can
|
355 |
|
|
override the configuration file, but not all configuration file changes can
|
356 |
|
|
be overridden. Several options are available within the configuration file,
|
357 |
|
|
such as making the Zip CPU pipelined or not, able to handle a faster clock
|
358 |
|
|
with more stalls or a slower clock with no stalls, etc.
|
359 |
69 |
dgisselq |
|
360 |
199 |
dgisselq |
The {\tt cpudefs.v} file encapsulates those control options. It contains a
|
361 |
|
|
series of {\tt `define} statements that can either be commented or left
|
362 |
|
|
active. If active, the option is considered to be in effect. The number of
|
363 |
|
|
LUTs the Zip CPU uses varies dramatically with the options defined in this file.
|
364 |
|
|
This section will outline the various configuration options captured by this
|
365 |
|
|
file.
|
366 |
21 |
dgisselq |
|
367 |
199 |
dgisselq |
The first couple of options control the Zip CPU instruction set, and how
|
368 |
|
|
it handles various instructions within the set:
|
369 |
21 |
dgisselq |
|
370 |
|
|
|
371 |
199 |
dgisselq |
{\tt OPT\_MULTIPLY} controls whether or not the multiply is built and included
|
372 |
|
|
in the ALU by default, and if it is which of several multiply options is
|
373 |
|
|
selected. Unlike many of the defines that follow within {\tt cpudefs.v} that
|
374 |
|
|
are either defined or not, this option requires a value. A value of zero means
|
375 |
|
|
no multiply support, whereas a value of one, two, or three, means that a
|
376 |
|
|
multiply will be included that takes one, two, or three clock cycles to
|
377 |
|
|
complete. The option, however, only controls the default value that the
|
378 |
|
|
{\tt IMPLEMENT\_MPY} parameter to the CPU, having the same interpretation, is
|
379 |
|
|
given. Because this is just the default value, it can easily be overridden
|
380 |
|
|
upon instantiation. If the {\tt IMPLEMENT\_MPY} parameter is set to zero,
|
381 |
|
|
then any attempt to execute a multiply instruction will cause an illegal
|
382 |
|
|
instruction exception.
|
383 |
69 |
dgisselq |
|
384 |
199 |
dgisselq |
{\tt OPT\_DIVIDE} controls whether or not the divide instruction is built and
|
385 |
|
|
included into the ZipCPU by default. Set this option and the
|
386 |
|
|
{\tt IMPLEMENT\_DIVIDE} parameter will have a default value of one, meaning that
|
387 |
|
|
unless it is overridden with zero, the divide unit will be included. If the
|
388 |
|
|
divide is not included, then any attempt to use a divide instruction will
|
389 |
|
|
create an illegal instruction exception that will send the CPU into supervisor
|
390 |
|
|
mode.
|
391 |
69 |
dgisselq |
|
392 |
199 |
dgisselq |
{\tt OPT\_IMPLEMENT\_FPU} will (one day) control whether or not the floating
|
393 |
|
|
point unit (once I have one) is built and included into the ZipCPU by default.
|
394 |
|
|
This option sets the {\tt IMPLEMENT\_FPU} parameter to one, so alternatively
|
395 |
|
|
it can be set and adjusted upon instantiation. If the floating point unit is
|
396 |
|
|
not included then, as with the multiply and divide, any floating point
|
397 |
|
|
instruction will result in an illegal instruction exception that will send the
|
398 |
|
|
CPU into supervisor mode.
|
399 |
21 |
dgisselq |
|
400 |
199 |
dgisselq |
{\tt OPT\_SINGLE\_FETCH} controls whether or not the prefetch has a cache, and
|
401 |
|
|
whether or not it can issue one instruction per clock. When set, the
|
402 |
|
|
prefetch has no cache, and only one instruction is fetched at any given time.
|
403 |
|
|
This effectively sets the CPU so that only one instruction is ever
|
404 |
|
|
in the pipeline at a time, and hence you may think of this as a ``no
|
405 |
|
|
pipeline'' option. However, since the pipeline uses so much area on the FPGA,
|
406 |
|
|
this is an important option to use in trimming down used logic if necessary.
|
407 |
|
|
Hence, it needs to be maintained for that purpose. Be aware, though, setting
|
408 |
|
|
this option will disable all pipelining, and therefore will drop your
|
409 |
|
|
performance by a factor of 8x or even more.
|
410 |
21 |
dgisselq |
|
411 |
199 |
dgisselq |
I recommend only defining or enabling this option if you {\em need} to,
|
412 |
|
|
such as if area is tight and speed isn't as important. Otherwise, leave the
|
413 |
|
|
option undefined since the pipelined options have a much better speed
|
414 |
|
|
performance.
|
415 |
68 |
dgisselq |
|
416 |
199 |
dgisselq |
The next several options are pipeline optimization options. They make no
|
417 |
|
|
sense in a single instruction fetch mode, hence they are all disabled if
|
418 |
|
|
{\tt OPT\_SINGLE\_FETCH} is defined.
|
419 |
21 |
dgisselq |
|
420 |
199 |
dgisselq |
{\tt OPT\_PIPELINED} is the natural result and opposite of using the single
|
421 |
|
|
instruction fetch unit. It is an internal parameter that doesn't need user
|
422 |
|
|
adjustment, but if you look through the {\tt cpudefs.v} file you may see
|
423 |
|
|
and notice it. If you have not set the {\tt OPT\_SINGLE\_FETCH} parameter,
|
424 |
|
|
{\tt cpudefs.v} will set the {\tt OPT\_PIPELINED} option. This is more for
|
425 |
|
|
readability than anything else, since {\tt OPT\_PIPELINED} makes more
|
426 |
|
|
intuitive readability sense than {\tt OPT\_SINGLE\_FETCH}. In other words,
|
427 |
|
|
define or comment out {\tt OPT\_SINGLE\_FETCH}, and let {\tt OPT\_PIPELINED} be
|
428 |
|
|
taken care of automatically.
|
429 |
68 |
dgisselq |
|
430 |
199 |
dgisselq |
Assuming you have chosen not to define {\tt OPT\_SINGLE\_FETCH},
|
431 |
|
|
{\tt OPT\_TRADITIONAL\_PFCACHE} allows you to switch between one of two
|
432 |
|
|
prefetch cache modules. If enabled (recommended), a more traditional cache
|
433 |
|
|
will be implemented in the CPU. This more traditional cache reduces
|
434 |
|
|
the stall count tremendously over the alternative pipeline cache, and its
|
435 |
|
|
LUT usage is quite competitive. As there is little downside to defining
|
436 |
|
|
this option if pipelining is enabled, I would recommend including it.
|
437 |
21 |
dgisselq |
|
438 |
199 |
dgisselq |
The alternative prefetch and cache, sometimes called the pipeline cache, tries
|
439 |
|
|
to read instructions ahead of where they are needed, while maintaining what it
|
440 |
|
|
has read in a cache. That cache is cleared anytime you jump outside of its
|
441 |
|
|
window, and it often competes with the CPU for access to the bus. These
|
442 |
|
|
two characteristics make this alternative bus often less than optimal.
|
443 |
21 |
dgisselq |
|
444 |
|
|
|
445 |
199 |
dgisselq |
{\tt OPT\_EARLY\_BRANCHING} is an attempt to execute a {\tt BRA} (branch or
|
446 |
|
|
jump) statement as early
|
447 |
|
|
in the pipeline as possible, to avoid as many pipeline stalls on a branch as
|
448 |
|
|
possible. As an example, if you have {\tt OPT\_TRADITIONAL\_PFCACHE} defined
|
449 |
|
|
as well, then branches within the cache will only cost a single stall cycle.
|
450 |
|
|
Indeed, using early branching, a {\tt BRA} instruction can be used as the
|
451 |
|
|
compiler's branch prediction optimizer: {\tt BRA}'s barely stall, while
|
452 |
|
|
branches on conditions will always suffer about 6~stall cycles. Setting
|
453 |
|
|
this option causes the parameter, {\tt EARLY\_BRANCHING}, to be set to one,
|
454 |
|
|
so it can be overridden upon instantiation.
|
455 |
21 |
dgisselq |
|
456 |
199 |
dgisselq |
Given the performance benefits achieved by early branching, setting this flag
|
457 |
|
|
is highly recommended.
|
458 |
21 |
dgisselq |
|
459 |
202 |
dgisselq |
{\tt OPT\_PIPELINED\_BUS\_ACCESS} controls whether or not memory
|
460 |
199 |
dgisselq |
instructions can take advantage of the pipelined wishbone bus. To be
|
461 |
|
|
eligible, the operations to be pipelined must be adjacent, must be all
|
462 |
202 |
dgisselq |
loads or all stores, and the addresses must all use the same base
|
463 |
199 |
dgisselq |
address register and either have identical immediate offsets, or immediate
|
464 |
|
|
offsets that increase by one for each instruction. Further, the
|
465 |
202 |
dgisselq |
string of load (or store) instructions must all have the same conditional
|
466 |
199 |
dgisselq |
(if any). Currently, this approach and benefit is most effectively used
|
467 |
|
|
when saving registers to or restoring registers from the stack at the
|
468 |
|
|
beginning/end of a procedure, when using assembly optimized programs, or
|
469 |
|
|
when doing a context swap.
|
470 |
24 |
dgisselq |
|
471 |
199 |
dgisselq |
I recommend setting this flag, for performance reasons, especially if your
|
472 |
|
|
wishbone bus implementation can handle pipelined bus accesses. The logic
|
473 |
|
|
impact of this setting is minimal, the performance impact can be significant.
|
474 |
|
|
|
475 |
202 |
dgisselq |
{\tt OPT\_CIS} includes within the instruction set the Very Long Instruction
|
476 |
199 |
dgisselq |
Word packing, which packs up to two instructions within each instruction word.
|
477 |
|
|
Non--packed instructions will still execute as normal, this just enables the
|
478 |
|
|
decoding and running of packed instructions.
|
479 |
|
|
|
480 |
|
|
The two next options, {\tt INCLUDE\_DMA\_CONTROLLER} and
|
481 |
|
|
{\tt INCLUDE\_ACCOUNTING\_COUNTERS}
|
482 |
|
|
control whether the DMA controller is included in the ZipSystem, and
|
483 |
|
|
whether or not the eight accounting timers are also included. Set these to
|
484 |
|
|
include the respective peripherals, comment them out not to. These only
|
485 |
|
|
affect the ZipSystem implementation, and not any ZipBones implementations.
|
486 |
|
|
|
487 |
202 |
dgisselq |
Finally, if you find yourself needing to debug the core and specifically
|
488 |
|
|
needing to get a trace from the core to find out why something specifically
|
489 |
|
|
failed, you may find it useful to define {\tt DEBUG\_SCOPE}. This will add a
|
490 |
|
|
32--bit debug output from the core, as the last argument to the core, to the
|
491 |
|
|
ZipSystem, or even to ZipBones. The actual definition and composition of
|
492 |
|
|
this debugging bit--field changes from one implementation to the next,
|
493 |
|
|
depending upon needs and necessities, so please look at the code at the
|
494 |
|
|
bottom of {\tt zipcpu.v} for more details.
|
495 |
199 |
dgisselq |
|
496 |
202 |
dgisselq |
That ends our discussion of CPU options, but there remain several
|
497 |
|
|
implementation parameters that can be defined with the CPU as well. Some of
|
498 |
|
|
these, such as {\tt IMPLEMENT\_MPY}, {\tt IMPLEMENT\_DIVIDE},
|
499 |
|
|
{\tt IMPLEMENT\_FPU}, and {\tt EARLY\_BRANCHING} have already been discussed.
|
500 |
|
|
The remainder shall be discussed quickly here.
|
501 |
199 |
dgisselq |
|
502 |
|
|
The {\tt RESET\_ADDRESS} parameter controls what address the CPU attempts to
|
503 |
|
|
fetch its first instruction from upon any CPU reset. The default value is
|
504 |
|
|
not likely to be particularly useful, so overriding the default is recommended
|
505 |
|
|
for every implementation.
|
506 |
|
|
|
507 |
|
|
The {\tt ADDRESS\_WIDTH} parameter can be used to trim down the width of
|
508 |
|
|
addresses used by the CPU. For example, although the Wishbone Bus definition
|
509 |
202 |
dgisselq |
used by the CPU has 30--address lines, particular implementations may have
|
510 |
199 |
dgisselq |
fewer. By setting this value to the actual number of wires in the address
|
511 |
202 |
dgisselq |
bus, some logic can be spared within the CPU. The default is also the maximum,
|
512 |
|
|
a 30--bit address width. Two additional bits are used internally by the CPU
|
513 |
|
|
to create the appearance of an 8--bit bus, by using the wishbone select lines.
|
514 |
199 |
dgisselq |
|
515 |
|
|
The {\tt LGICACHE} parameter specifies the log base two of the instruction
|
516 |
|
|
cache size. If no instruction cache is used, this option has no effect.
|
517 |
|
|
Otherwise it sets the size of the instruction cache to be
|
518 |
|
|
$2^{\mbox{\tiny\tt LGICACHE}}$ words. The traditional prefetch cache, if used,
|
519 |
|
|
will split this cache size into up to thirty two separate cache lines.
|
520 |
|
|
|
521 |
|
|
The {\tt IMPLEMENT\_LOCK} parameter controls whether or not the {\tt LOCK}
|
522 |
|
|
instruction is implemented. If set to zero, the {\tt LOCK} instruction will
|
523 |
|
|
cause an illegal instruction exception, otherwise it will be implemented if
|
524 |
|
|
pipelining is enabled.
|
525 |
|
|
|
526 |
|
|
Other parameters are defined within the ZipSystem parent module, and affect
|
527 |
|
|
the performance of the system as a whole.
|
528 |
|
|
|
529 |
|
|
The {\tt START\_HALTED} parameter, if set to non--zero, will cause the
|
530 |
|
|
CPU to be halted upon startup. This is useful for debugging, since it prevents
|
531 |
|
|
the CPU from doing anything without supervision. Of course, once all pieces
|
532 |
|
|
of your design are in place and proven, you'll probably want to set this to
|
533 |
202 |
dgisselq |
zero, so that the CPU will then start up immediately upon power up.
|
534 |
199 |
dgisselq |
|
535 |
|
|
The {\tt EXTERNAL\_INTERRUPTS} parameter controls the number of interrupt
|
536 |
|
|
wires coming into the CPU. This number must be between one and sixteen,
|
537 |
|
|
or if the performance counters are disabled, between one and twenty four.
|
538 |
|
|
|
539 |
|
|
\section{Internal Architecture}\label{sec:internals}
|
540 |
|
|
|
541 |
|
|
This section discusses the general architecture of the CPU itself, separated
|
542 |
|
|
from its environment. As such, it focuses on the instruction set layout
|
543 |
|
|
and how those instructions are implemented.
|
544 |
|
|
|
545 |
|
|
\subsection{Register Set}
|
546 |
|
|
|
547 |
|
|
Fundamental to the understanding of the ZipCPU is its register set, and the
|
548 |
|
|
performance model associated with it.
|
549 |
|
|
The ZipCPU register set contains two sets of sixteen 32-bit registers, a
|
550 |
202 |
dgisselq |
supervisor and a user set as shown in Fig.~\ref{fig:regset}.
|
551 |
24 |
dgisselq |
\begin{figure}\begin{center}
|
552 |
199 |
dgisselq |
\begin{tabular}{|c|c|c|c|c|}
|
553 |
|
|
\multicolumn{2}{c}{Supervisor Register Set} &
|
554 |
|
|
\multicolumn{1}{c}{} &
|
555 |
|
|
\multicolumn{2}{c}{User Register Set} \\
|
556 |
|
|
\multicolumn{2}{c}{\#'s 0-15} & \multicolumn{1}{c}{} &
|
557 |
|
|
\multicolumn{2}{c}{\#'s 16-31} \\\hline\hline
|
558 |
|
|
sR0(LR) & sR8 && uR0(LR) & uR8 \\\cline{1-2}\cline{4-5}
|
559 |
|
|
sR1 & sR9 && uR1 & uR9 \\\cline{1-2}\cline{4-5}
|
560 |
|
|
sR2 & sR10 && uR2 & uR10 \\\cline{1-2}\cline{4-5}
|
561 |
|
|
sR3 & sR11 && uR3 & uR11 \\\cline{1-2}\cline{4-5}
|
562 |
|
|
sR4 & sR12(FP)&& uR4& uR12(FP)\\\cline{1-2}\cline{4-5}
|
563 |
|
|
sR5 & sSP && uR5 & uSP \\\cline{1-2}\cline{4-5}
|
564 |
|
|
sR6 & sCC && uR6 & uCC \\\cline{1-2}\cline{4-5}
|
565 |
|
|
sR7 & sPC && uR7 & uPC \\\hline\hline
|
566 |
|
|
\multicolumn{2}{c}{Interrupts Disabled} &
|
567 |
|
|
\multicolumn{1}{c}{} &
|
568 |
|
|
\multicolumn{2}{c}{Interrupts Enabled} \\
|
569 |
|
|
\end{tabular}
|
570 |
|
|
\caption{ZipCPU Register File}\label{fig:regset}
|
571 |
24 |
dgisselq |
\end{center}\end{figure}
|
572 |
199 |
dgisselq |
The supervisor set is used when interrupts are disabled, whereas the user set
|
573 |
|
|
is used any time interrupts are enabled. This choice makes it easy to set up
|
574 |
|
|
a working context upon any interrupt, as the supervisor register set remains
|
575 |
|
|
what it was when interrupts were enabled. This sets up one of two modes
|
576 |
|
|
the CPU can run within: a {\em supervisor mode}, which runs with interrupts
|
577 |
|
|
disabled using the supervisor register set, and {\em user mode}, which runs
|
578 |
|
|
with interrupts enabled using the user register set.
|
579 |
21 |
dgisselq |
|
580 |
199 |
dgisselq |
This separation is so fundamental to the CPU that it is impossible to enable
|
581 |
|
|
interrupts without switching to the user register set. Further, on any
|
582 |
|
|
interrupt, exception, or trap, the CPU simply clears the pipeline and switches
|
583 |
|
|
instruction sets.
|
584 |
|
|
|
585 |
|
|
In each register set, the Program Counter (PC) is register 15, whereas
|
586 |
|
|
the status register (SR) or condition code register (CC) is register 14. All
|
587 |
|
|
other registers are identical in their hardware functionality.\footnote{Jumps
|
588 |
|
|
to {\tt R0}, an instruction used to implement a return from a subroutine, may
|
589 |
|
|
be optimized in the future within the early branch logic.} By convention, the
|
590 |
202 |
dgisselq |
stack pointer is register 13 and noted as (SP). Beyond this convention,
|
591 |
|
|
word accesses to offsets of the stack pointer are compressed when using the
|
592 |
|
|
CIS instruction set. Also by convention, if the compiler needs a frame
|
593 |
|
|
pointer it will be placed into register~12, and may be abbreviated by FP.
|
594 |
|
|
Finally, by convention, R0 will hold a subroutine's return address, sometimes
|
595 |
|
|
called the link register (LR).
|
596 |
199 |
dgisselq |
|
597 |
|
|
When the CPU is in supervisor mode, instructions can access both register sets
|
598 |
|
|
via the {\tt MOV} instruction, whereas when the CPU is in user mode, {\tt MOV}
|
599 |
|
|
instructions will only offer access to user registers. We'll discuss this
|
600 |
|
|
further in subsection.~\ref{sec:isa-mov}.
|
601 |
|
|
|
602 |
|
|
\subsection{The Status Register, CC}
|
603 |
|
|
The status register (CC) is special, and bears further mention. As shown in
|
604 |
36 |
dgisselq |
Fig.~\ref{tbl:cc-register},
|
605 |
|
|
\begin{table}\begin{center}
|
606 |
|
|
\begin{bitlist}
|
607 |
167 |
dgisselq |
31\ldots 23 & R & Reserved for future uses\\\hline
|
608 |
199 |
dgisselq |
22\ldots 16 & R/W & Reserved for future uses\\\hline
|
609 |
|
|
15 & R & Reserved for MMU exceptions\\\hline
|
610 |
|
|
14 & W & Clear I-Cache command, always reads zero\\\hline
|
611 |
202 |
dgisselq |
13 & R & CIS instruction phase (1 for first half)\\\hline
|
612 |
69 |
dgisselq |
12 & R & (Reserved for) Floating Point Exception\\\hline
|
613 |
|
|
11 & R & Division by Zero Exception\\\hline
|
614 |
|
|
10 & R & Bus-Error Flag\\\hline
|
615 |
167 |
dgisselq |
9 & R & Trap Flag (or user interrupt). Cleared on return to userspace.\\\hline
|
616 |
68 |
dgisselq |
8 & R & Illegal Instruction Flag\\\hline
|
617 |
167 |
dgisselq |
7 & R/W & Break--Enable (sCC), or user break (uCC)\\\hline
|
618 |
36 |
dgisselq |
6 & R/W & Step\\\hline
|
619 |
|
|
5 & R/W & Global Interrupt Enable (GIE)\\\hline
|
620 |
|
|
4 & R/W & Sleep. When GIE is also set, the CPU waits for an interrupt.\\\hline
|
621 |
|
|
3 & R/W & Overflow\\\hline
|
622 |
|
|
2 & R/W & Negative. The sign bit was set as a result of the last ALU instruction.\\\hline
|
623 |
|
|
1 & R/W & Carry\\\hline
|
624 |
|
|
|
625 |
|
|
\end{bitlist}
|
626 |
|
|
\caption{Condition Code Register Bit Assignment}\label{tbl:cc-register}
|
627 |
|
|
\end{center}\end{table}
|
628 |
199 |
dgisselq |
the lower sixteen bits of the status register form a set of CPU state and
|
629 |
|
|
condition codes. The other bits are reserved for future uses.
|
630 |
21 |
dgisselq |
|
631 |
33 |
dgisselq |
Of the condition codes, the bottom four bits are the current flags:
|
632 |
21 |
dgisselq |
Zero (Z),
|
633 |
|
|
Carry (C),
|
634 |
|
|
Negative (N),
|
635 |
|
|
and Overflow (V).
|
636 |
199 |
dgisselq |
These flags maintain their usual definition from other CPUs that use them, for
|
637 |
|
|
all but the shift right instructions. On those instructions that set the flags,
|
638 |
|
|
these flags will be set based upon the output of certain instructions. If the
|
639 |
|
|
result is zero, the Z (zero) flag will be set. If the high order bit is set,
|
640 |
|
|
the N (negative) flag will be set. If the instruction caused a bit to fall off
|
641 |
|
|
the end, the carry bit will be set. In comparisons, this is equivalent to a
|
642 |
|
|
less--than unsigned comparison. Finally, if the instruction causes a signed
|
643 |
|
|
integer overflow, the V (overflow) flag will be set afterwards.
|
644 |
21 |
dgisselq |
|
645 |
199 |
dgisselq |
We'll walk through the next many bits of the status register in order from
|
646 |
|
|
least significant to most significant.
|
647 |
|
|
|
648 |
|
|
\begin{enumerate}
|
649 |
|
|
\setcounter{enumi}{3}
|
650 |
|
|
\item The next bit is a sleep bit. Set this bit to one to disable instruction
|
651 |
69 |
dgisselq |
execution and place the CPU to sleep, or to zero to keep the pipeline
|
652 |
|
|
running. Setting this bit will cause the CPU to wait for an interrupt
|
653 |
|
|
(if interrupts are enabled), or to completely halt (if interrupts are
|
654 |
199 |
dgisselq |
disabled). This leads to the {\tt WAIT} and {\tt HALT} opcodes
|
655 |
|
|
which will be discussed more later. In order to prevent users from
|
656 |
|
|
halting the CPU, only the supervisor is allowed to both put the CPU to
|
657 |
|
|
sleep and disable interrupts. Any user attempt to do so will simply
|
658 |
|
|
result in a switch to supervisor mode.
|
659 |
33 |
dgisselq |
|
660 |
199 |
dgisselq |
\item The sixth bit is a global interrupt enable bit (GIE). This bit also
|
661 |
|
|
forms the top, or fifth, bit of any register address. When this
|
662 |
32 |
dgisselq |
sixth bit is a `1' interrupts will be enabled, else disabled. When
|
663 |
21 |
dgisselq |
interrupts are disabled, the CPU will be in supervisor mode, otherwise
|
664 |
|
|
it is in user mode. Thus, to execute a context switch, one only
|
665 |
|
|
need enable or disable interrupts. (When an interrupt line goes
|
666 |
|
|
high, interrupts will automatically be disabled, as the CPU goes
|
667 |
32 |
dgisselq |
and deals with its context switch.) Special logic has been added to
|
668 |
|
|
keep the user mode from setting the sleep register and clearing the
|
669 |
|
|
GIE register at the same time, with clearing the GIE register taking
|
670 |
|
|
precedence.
|
671 |
21 |
dgisselq |
|
672 |
199 |
dgisselq |
Whenever read, the supervisor CC register will always have this bit
|
673 |
|
|
cleared, whereas the user CC register will always have this bit set.
|
674 |
|
|
|
675 |
|
|
\item The seventh bit is a step bit in the user CC register, and zero in the
|
676 |
|
|
supervisor CC director. This bit can only be set from supervisor
|
677 |
|
|
mode. After setting this bit, should the supervisor mode process switch
|
678 |
|
|
to user mode, it would then accomplish one instruction in user mode
|
679 |
|
|
before returning to supervisor mode. This bit has no effect
|
680 |
69 |
dgisselq |
on the CPU while in supervisor mode.
|
681 |
21 |
dgisselq |
|
682 |
|
|
This functionality was added to enable a userspace debugger
|
683 |
|
|
functionality on a user process, working through supervisor mode
|
684 |
|
|
of course.
|
685 |
|
|
|
686 |
199 |
dgisselq |
The CPU can be stepped in supervisor mode. Doing so requires the
|
687 |
|
|
CPU debug functionality, not the step bit.
|
688 |
21 |
dgisselq |
|
689 |
|
|
|
690 |
199 |
dgisselq |
\item The eighth bit is a break enable bit. When applied to the supervisor CC
|
691 |
|
|
register, this controls whether a break instruction in user mode will
|
692 |
|
|
halt the processor for an external debugger (break enabled), or
|
693 |
|
|
whether the break instruction will simply send send the CPU into
|
694 |
|
|
interrupt mode. This bit can only be set within supervisor mode.
|
695 |
|
|
However, when applied to the user CC register, from supervisor mode,
|
696 |
|
|
this bit will indicate whether or not the reason the CPU entered
|
697 |
|
|
supervisor mode was from a break instruction or not. This break
|
698 |
|
|
reason bit is automatically cleared upon any transition to user mode,
|
699 |
|
|
although it can also be cleared by the supervisor writing to the
|
700 |
|
|
user CC register.
|
701 |
32 |
dgisselq |
|
702 |
199 |
dgisselq |
Encountering a break in supervisor mode will halt the CPU independent
|
703 |
|
|
of the break enable bit.
|
704 |
21 |
dgisselq |
|
705 |
199 |
dgisselq |
This functionality was added to enable a debugger to set and manage
|
706 |
|
|
breakpoints in a user mode process.
|
707 |
21 |
dgisselq |
|
708 |
199 |
dgisselq |
\item The ninth bit is an illegal instruction bit. When the CPU tries to
|
709 |
|
|
execute either a non-existent instruction, or an instruction from
|
710 |
|
|
an address that produces a bus error, the CPU will (if implemented)
|
711 |
|
|
switch to supervisor mode while setting this bit. The bit will
|
712 |
|
|
automatically be cleared upon any return to user mode.
|
713 |
21 |
dgisselq |
|
714 |
199 |
dgisselq |
\item The tenth bit is a trap bit. It is set whenever the user requests a
|
715 |
|
|
soft interrupt, and cleared on any return to userspace command. This
|
716 |
|
|
allows the supervisor, in supervisor mode, to determine whether it got
|
717 |
|
|
to supervisor mode from a trap, from an external interrupt or both.
|
718 |
167 |
dgisselq |
|
719 |
199 |
dgisselq |
\item The eleventh bit is a bus error flag. If the user program encountered
|
720 |
|
|
a bus error, this bit will be set in the user CC register and the CPU
|
721 |
|
|
will switch to supervisor mode. The bit may be cleared by the
|
722 |
|
|
supervisor, otherwise it is automatically cleared upon any return to
|
723 |
|
|
user mode. If the supervisor encounters a bus error, this bit will be
|
724 |
|
|
set in the supervisor CC register and the CPU will halt. In that
|
725 |
|
|
case, either a CPU reset or a write to the supervisor CC register will
|
726 |
|
|
clear this register.
|
727 |
167 |
dgisselq |
|
728 |
199 |
dgisselq |
\item The twelfth bit is a division by zero exception flag. This operates
|
729 |
|
|
in a fashion similar to the bus error flag. If the user attempts
|
730 |
|
|
to use the divide instruction with a zero denominator, the system
|
731 |
|
|
will switch to supervisor mode and set this bit in the user CC
|
732 |
|
|
register. The bit is automatically cleared upon any return to user
|
733 |
|
|
mode, although it can also be manually cleared by the supervisor. In
|
734 |
|
|
a similar fashion, if the supervisor attempts to execute a divide by
|
735 |
|
|
zero, the CPU will halt and set the zero exception flag in the
|
736 |
|
|
supervisor's CC register. This will automatically be cleared upon any
|
737 |
|
|
CPU reset, or it may be manually cleared by the external debugger
|
738 |
|
|
writing to this register.
|
739 |
167 |
dgisselq |
|
740 |
199 |
dgisselq |
\item The thirteenth bit will operate in a similar fashion to both the bus
|
741 |
|
|
error and division by zero flags, only it will be set upon a (yet to
|
742 |
|
|
be determined) floating point error.
|
743 |
167 |
dgisselq |
|
744 |
202 |
dgisselq |
\item In the case of CIS instructions, if an exception occurs after the first
|
745 |
|
|
instruction but before the second, the fourteenth bit of the CC
|
746 |
|
|
register will be set to indicate this fact. This can be combined with
|
747 |
|
|
the user PC to the address of the half-word where the fault occurred.
|
748 |
199 |
dgisselq |
|
749 |
|
|
\item The fifteenth bit references a clear cache bit. The supervisor may
|
750 |
|
|
write a one to this bit in order to clear the CPU instruction cache.
|
751 |
|
|
The bit always reads as a zero.
|
752 |
|
|
|
753 |
|
|
\item Last, but not least, the sixteenth bit is reserved for a page not found
|
754 |
|
|
memory exception to be created by the memory management unit.
|
755 |
|
|
|
756 |
|
|
\end{enumerate}
|
757 |
|
|
|
758 |
167 |
dgisselq |
Some of the upper bits have been temporarily assigned to indicate CPU
|
759 |
|
|
capabilities. This is not a permanent feature, as these upper bits officially
|
760 |
|
|
remain reserved.
|
761 |
|
|
|
762 |
199 |
dgisselq |
\subsection{Instruction Format}\label{sec:isa-fmt}
|
763 |
|
|
All ZipCPU instructions fit in one of the formats shown in
|
764 |
69 |
dgisselq |
Fig.~\ref{fig:iset-format}.
|
765 |
|
|
\begin{figure}\begin{center}
|
766 |
|
|
\begin{bytefield}[endianness=big]{32}
|
767 |
|
|
\bitheader{0-31}\\
|
768 |
202 |
dgisselq |
\begin{leftwordgroup}{Standard}\bitbox{1}{0}\bitbox[tlr]{4}{}
|
769 |
69 |
dgisselq |
\bitbox[lrt]{5}{OpCode}
|
770 |
202 |
dgisselq |
\bitbox[lrt]{3}{}
|
771 |
69 |
dgisselq |
\bitbox{1}{0}
|
772 |
|
|
\bitbox{18}{18-bit Signed Immediate} \\
|
773 |
202 |
dgisselq |
\bitbox{1}{0}\bitbox[lr]{4}{DR}
|
774 |
69 |
dgisselq |
\bitbox[lrb]{5}{}
|
775 |
202 |
dgisselq |
\bitbox[lr]{3}{Cnd}
|
776 |
69 |
dgisselq |
\bitbox{1}{1}
|
777 |
|
|
\bitbox{4}{BR}
|
778 |
|
|
\bitbox{14}{14-bit Signed Immediate}\end{leftwordgroup} \\
|
779 |
202 |
dgisselq |
\begin{leftwordgroup}{MOV}\bitbox{1}{0}\bitbox[lr]{4}{}
|
780 |
69 |
dgisselq |
\bitbox[lrt]{5}{5'hf}
|
781 |
202 |
dgisselq |
\bitbox[lrb]{3}{}
|
782 |
69 |
dgisselq |
\bitbox{1}{A}
|
783 |
|
|
\bitbox{4}{BR}
|
784 |
|
|
\bitbox{1}{B}
|
785 |
|
|
\bitbox{13}{13-bit Signed Immediate}\end{leftwordgroup} \\
|
786 |
202 |
dgisselq |
\begin{leftwordgroup}{LDI}\bitbox{1}{0}\bitbox[lrb]{4}{}
|
787 |
|
|
\bitbox{4}{4'hc}
|
788 |
69 |
dgisselq |
\bitbox{23}{23-bit Signed Immediate}\end{leftwordgroup} \\
|
789 |
|
|
\begin{leftwordgroup}{NOOP}\bitbox{1}{0}\bitbox{3}{3'h7}
|
790 |
|
|
\bitbox{1}{}
|
791 |
|
|
\bitbox{2}{11}
|
792 |
|
|
\bitbox{3}{xxx}
|
793 |
|
|
\bitbox{22}{Ignored}
|
794 |
|
|
\end{leftwordgroup} \\
|
795 |
|
|
\end{bytefield}
|
796 |
|
|
\caption{Zip Instruction Set Format}\label{fig:iset-format}
|
797 |
|
|
\end{center}\end{figure}
|
798 |
|
|
The basic format is that some operation, defined by the OpCode, is applied
|
799 |
|
|
if a condition, Cnd, is true in order to produce a result which is placed in
|
800 |
202 |
dgisselq |
the destination register (DR).
|
801 |
|
|
|
802 |
|
|
There are three basic exceptions to this general instruction model. The
|
803 |
|
|
first is the {\tt MOV} instruction, which steals bits~13 and~18
|
804 |
|
|
to allow supervisor access to user registers. In supervisor mode, these
|
805 |
|
|
are set to one to reference user registers, zero otherwise. They are ignored
|
806 |
|
|
in user mode. The second exception is the load 23--bit
|
807 |
199 |
dgisselq |
signed immediate instruction ({\tt LDI}), in that it accepts no conditions and
|
808 |
|
|
uses only a 4-bit opcode. The last exception is the {\tt NOOP} instruction
|
809 |
202 |
dgisselq |
group, containing the {\tt BREAK}, {\tt LOCK}, {\tt SIM}, and {\tt NOOP}
|
810 |
|
|
opcodes. These instructions ignore their register and immediate settings.
|
811 |
|
|
Further, the immediate bits used by these opcodes are available for simulation
|
812 |
|
|
or debug facilities, but otherwise ignored by the CPU.
|
813 |
69 |
dgisselq |
|
814 |
199 |
dgisselq |
\subsection{Instruction OpCodes}\label{sec:isa-opcodes}
|
815 |
69 |
dgisselq |
With a 5--bit opcode field, there are 32--possible instructions as shown in
|
816 |
|
|
Tbl.~\ref{tbl:iset-opcodes}.
|
817 |
|
|
\begin{table}\begin{center}
|
818 |
202 |
dgisselq |
\begin{tabular}{|l|l|l|l|c|} \hline \rowcolor[gray]{0.85}
|
819 |
|
|
OpCode & & A-Reg & Instruction &Sets CC \\\hline\hline
|
820 |
|
|
5'h00 & {\tt SUB} & \multicolumn{2}{l|}{Subtract} & \\\cline{1-4}
|
821 |
|
|
5'h01 & {\tt AND} & \multicolumn{2}{l|}{Bitwise And} & \\\cline{1-4}
|
822 |
|
|
5'h02 & {\tt ADD} & \multicolumn{2}{l|}{Add two numbers} & \\\cline{1-4}
|
823 |
|
|
5'h03 & {\tt OR} & \multicolumn{2}{l|}{Bitwise Or} & Y \\\cline{1-4}
|
824 |
|
|
5'h04 & {\tt XOR} & \multicolumn{2}{l|}{Bitwise Exclusive Or} & \\\cline{1-4}
|
825 |
|
|
5'h05 & {\tt LSR} & \multicolumn{2}{l|}{Logical Shift Right} & \\\cline{1-4}
|
826 |
|
|
5'h06 & {\tt LSL} & \multicolumn{2}{l|}{Logical Shift Left} & \\\cline{1-4}
|
827 |
|
|
5'h07 & {\tt ASR} & \multicolumn{2}{l|}{Arithmetic Shift Right} & \\\hline
|
828 |
|
|
|
829 |
|
|
5'h08 & {\tt BREV} & \multicolumn{2}{l|}{Bit Reverse B operand into result}& \\\cline{1-4}
|
830 |
|
|
5'h09 & {\tt LDILO} & \multicolumn{2}{l|}{Load Immediate Low} & N\\\hline
|
831 |
|
|
5'h0a & {\tt MPYUHI} & \multicolumn{2}{l|}{Upper 32 of 64 bits from an unsigned 32x32 multiply} & \\\cline{1-4}
|
832 |
|
|
5'h0b & {\tt MPYSHI} & \multicolumn{2}{l|}{Upper 32 of 64 bits from a signed 32x32 multiply} & Y \\\cline{1-4}
|
833 |
|
|
5'h0c & {\tt MPY} & \multicolumn{2}{l|}{32x32 bit multiply} & \\\hline
|
834 |
|
|
5'h0d & {\tt MOV} & \multicolumn{2}{l|}{Move OpB into Ra} & N \\\hline
|
835 |
|
|
5'h0e & {\tt DIVU} & R0-R13 & Divide, unsigned & Y \\\cline{1-4}
|
836 |
|
|
5'h0f & {\tt DIVS} & R0-R13 & Divide, signed & \\\hline\hline
|
837 |
|
|
%
|
838 |
|
|
5'h10 & {\tt CMP} & \multicolumn{2}{l|}{Compare (Ra-OpB) to zero} & Y \\\cline{1-4}
|
839 |
|
|
5'h11 & {\tt TST} & \multicolumn{2}{l|}{Test (AND w/o setting result)} & \\\hline
|
840 |
|
|
5'h12 & {\tt LW} & \multicolumn{2}{l|}{Load a 32-bit word from memory (OpB) into Ra} & \\\cline{1-4}
|
841 |
|
|
5'h13 & {\tt SW} & \multicolumn{2}{l|}{Store a 32-bit word from Ra into memory at (OpB)} & \\\cline{1-4}
|
842 |
|
|
5'h14 & {\tt LH} & \multicolumn{2}{l|}{Load 16-bits from memory (opB) into Ra, clear upper 16 bits} & N \\\cline{1-4}
|
843 |
|
|
5'h15 & {\tt SH} & \multicolumn{2}{l|}{Store the lower 16-bits of Ra into memory at (OpB)} & \\\cline{1-4}
|
844 |
|
|
5'h16 & {\tt LB} & \multicolumn{2}{l|}{Load 8-bits from memory (OpB) into Ra, clear upper 24 bits} & \\\cline{1-4}
|
845 |
|
|
5'h17 & {\tt SB} & \multicolumn{2}{l|}{Store the lower 8-bits of Ra into memory at (OpB)} & \\\hline\hline
|
846 |
|
|
5'h18/9 & {\tt LDI} & \multicolumn{2}{l|}{Load 23--bit signed immediate} & N \\\hline\hline
|
847 |
|
|
5'h1a & {\tt FPADD} & R0-R13 & Floating point add & \\\cline{1-4}
|
848 |
|
|
5'h1b & {\tt FPSUB} & R0-R13 & Floating point subtract & \\\cline{1-4}
|
849 |
|
|
5'h1c & {\tt FPMPY} & R0-R13 & Floating point multiply & Y \\\cline{1-4}
|
850 |
|
|
5'h1d & {\tt FPDIV} & R0-R13 & Floating point divide & \\\cline{1-4}
|
851 |
|
|
5'h1e & {\tt FPI2F} & R0-R13 & Convert integer to floating point & \\\cline{1-4}
|
852 |
|
|
5'h1f & {\tt FPF2I} & R0-R13 & Convert floating point to integer & \\\hline\hline
|
853 |
|
|
5'h1c & {\tt BREAK} &None(15)&& \\\cline{1-4}
|
854 |
|
|
5'h1d & {\tt LOCK} &None(15)&& N\\\cline{1-4}
|
855 |
|
|
5'h1e & {\tt SIM} &None(15)&&\\\cline{1-4}
|
856 |
|
|
5'h1f & {\tt NOOP} &None(15)&&\\\hline
|
857 |
39 |
dgisselq |
\end{tabular}
|
858 |
199 |
dgisselq |
\caption{ZipCPU OpCodes}\label{tbl:iset-opcodes}
|
859 |
39 |
dgisselq |
\end{center}\end{table}
|
860 |
69 |
dgisselq |
%
|
861 |
199 |
dgisselq |
\subsection{Conditional Instructions}\label{sec:isa-cond}
|
862 |
69 |
dgisselq |
Most, although not quite all, instructions may be conditionally executed.
|
863 |
|
|
The 23--bit load immediate instruction, together with the {\tt NOOP},
|
864 |
199 |
dgisselq |
{\tt BREAK}, and {\tt LOCK} instructions are the exceptions to this rule.
|
865 |
|
|
All other instructions may be conditionally executed.
|
866 |
69 |
dgisselq |
|
867 |
202 |
dgisselq |
From the four condition code flags, eight conditions are defined, as shown in
|
868 |
|
|
Tbl.~\ref{tbl:conditions}.
|
869 |
69 |
dgisselq |
\begin{table}\begin{center}
|
870 |
21 |
dgisselq |
\begin{tabular}{l|l|l}
|
871 |
199 |
dgisselq |
Code & Mnemonic & Condition \\\hline
|
872 |
21 |
dgisselq |
3'h0 & None & Always execute the instruction \\
|
873 |
202 |
dgisselq |
3'h1 & {\tt .Z} & Only execute when `Z' is set \\
|
874 |
|
|
3'h2 & {\tt .LT}& Less than (`N' set) \\
|
875 |
|
|
3'h3 & {\tt .C} & Carry set (Also known as less-than unsigned) \\
|
876 |
|
|
3'h4 & {\tt .V} & Overflow set\\
|
877 |
|
|
3'h5 & {\tt .NZ}& Only execute when `Z' is not set \\
|
878 |
|
|
3'h6 & {\tt .GE}& Greater than or equal (`N' not set) \\
|
879 |
|
|
3'h7 & {\tt .NC}& Not carry (also known as greater-than or equal, unsigned) \\
|
880 |
21 |
dgisselq |
\end{tabular}
|
881 |
|
|
\caption{Conditions for conditional operand execution}\label{tbl:conditions}
|
882 |
69 |
dgisselq |
\end{center}\end{table}
|
883 |
202 |
dgisselq |
There are no condition codes for either less than or equal or greater than,
|
884 |
|
|
whether signed or unsigned. In a similar fashion, there is no condition
|
885 |
|
|
code for not V---there just wasn't enough space in 3--bits. Ways of handling
|
886 |
|
|
non--supported conditions are discussed in Sec.~\ref{sec:in-mcond}.
|
887 |
21 |
dgisselq |
|
888 |
199 |
dgisselq |
With the exception of \hbox{\tt CMP} and \hbox{\tt TST} instructions,
|
889 |
202 |
dgisselq |
conditionally executed instructions will not further adjust the condition
|
890 |
|
|
codes. Conditional \hbox{\tt CMP} or \hbox{\tt TST} instructions will adjust
|
891 |
|
|
conditions whenever they are executed. In this way, multiple conditions may
|
892 |
|
|
be evaluated without branches, creating a sort of logical and--but only if all
|
893 |
|
|
the conditions are the same. For example, to do something if \hbox{\tt R0} is
|
894 |
|
|
one and \hbox{\tt R1} is two, one might try code such as
|
895 |
|
|
Tbl.~\ref{tbl:dbl-condition}.
|
896 |
68 |
dgisselq |
\begin{table}\begin{center}
|
897 |
|
|
\begin{tabular}{l}
|
898 |
|
|
{\tt CMP 1,R0} \\
|
899 |
199 |
dgisselq |
{\em ; Condition codes are now set based upon R0-1} \\
|
900 |
68 |
dgisselq |
{\tt CMP.Z 2,R1} \\
|
901 |
199 |
dgisselq |
{\em ; If R0 $\neq$ 1, conditions are unchanged, {\tt Z} is still false.} \\
|
902 |
|
|
{\em ; If R0 $=$ 1, conditions are now set based upon R1-2.} \\
|
903 |
|
|
{\em ; Now some instruction could be done based upon the conjunction} \\
|
904 |
|
|
{\em ; of both conditions.} \\
|
905 |
202 |
dgisselq |
{\em ; While we use the example of a {\tt SW}, it could easily be any
|
906 |
199 |
dgisselq |
instruction.} \\
|
907 |
202 |
dgisselq |
{\tt SW.Z R0,(R2)} \\
|
908 |
68 |
dgisselq |
\end{tabular}
|
909 |
|
|
\caption{An example of a double conditional}\label{tbl:dbl-condition}
|
910 |
|
|
\end{center}\end{table}
|
911 |
36 |
dgisselq |
|
912 |
199 |
dgisselq |
The real utility of conditionally executed instructions is that, unlike
|
913 |
|
|
conditional branches, conditionally executed instructions will not stall
|
914 |
|
|
the bus if they are not executed.
|
915 |
|
|
|
916 |
|
|
\subsection{Modifying Conditions}\label{sec:in-mcond}
|
917 |
|
|
A quick look at the list of conditions supported by the ZipCPU and listed
|
918 |
|
|
in Tbl.~\ref{tbl:conditions} reveals that the ZipCPU does not have a full set
|
919 |
139 |
dgisselq |
of conditions. In particular, only one explicit unsigned condition is
|
920 |
|
|
supported. Therefore, Tbl.~\ref{tbl:creating-conditions}
|
921 |
|
|
\begin{table}\begin{center}
|
922 |
|
|
\begin{tabular}{|l|l|l|}\hline
|
923 |
|
|
Original & Modified & Name \\\hline\hline
|
924 |
202 |
dgisselq |
\parbox[t]{1.5in}{\tt CMP Imm,Ry\\BLE label} % If Ry <= Rx -> Ry < Rx+1
|
925 |
|
|
& \parbox[t]{1.5in}{\tt CMP 1+Imm,Ry\\BLT label}
|
926 |
|
|
& Less-than or equal (signed, {\tt Z} or {\tt N} set)\\[4mm]\hline
|
927 |
139 |
dgisselq |
\parbox[t]{1.5in}{\tt CMP Rx,Ry\\BLE label} % If Ry <= Rx -> Ry < Rx+1
|
928 |
202 |
dgisselq |
& \parbox[t]{1.5in}{\tt CMP Rx,Ry\\BLT label\\BZ label}
|
929 |
|
|
& Less-than or equal (signed, {\tt Z} or {\tt N} set)\\[4mm]\hline\hline
|
930 |
|
|
\parbox[t]{1.5in}{\tt CMP Imm,Ry\\BGT label} % if (Ry > Rx) -> Rx < Ry
|
931 |
|
|
& \parbox[t]{1.5in}{\tt CMP 1+Imm,Ry\\BGE label}
|
932 |
|
|
& Greater-than (immediate) \\[4mm]\hline
|
933 |
|
|
\parbox[t]{1.5in}{\tt CMP Rx,Ry\\BGT label} % if (Ry > Rx) -> Rx < Ry
|
934 |
|
|
& \parbox[t]{1.5in}{\tt CMP Ry,Rx\\BLT label}
|
935 |
|
|
& Greater-than (register) \\[4mm]\hline\hline
|
936 |
|
|
\parbox[t]{1.5in}{\tt CMP Imm,Ry\\BLEU label}
|
937 |
|
|
& \parbox[t]{1.5in}{\tt CMP 1+Imm,Ry\\BC label}
|
938 |
|
|
& Less-than or equal unsigned immediate \\[4mm]\hline
|
939 |
139 |
dgisselq |
\parbox[t]{1.5in}{\tt CMP Rx,Ry\\BLEU label}
|
940 |
202 |
dgisselq |
& \parbox[t]{1.5in}{\tt CMP Ry,Rx\\BNC label}
|
941 |
|
|
& Less-than or equal unsigned register\\[4mm]\hline\hline
|
942 |
|
|
\parbox[t]{1.5in}{\tt CMP Imm,Ry\\BGTU label} % if (Ry > Rx) -> Rx < Ry
|
943 |
|
|
& \parbox[t]{1.5in}{\tt CMP 1+Imm,Ry\\BNC label}
|
944 |
|
|
& Greater-than unsigned (immediate)\\[4mm]\hline
|
945 |
139 |
dgisselq |
\parbox[t]{1.5in}{\tt CMP Rx,Ry\\BGTU label} % if (Ry > Rx) -> Rx < Ry
|
946 |
|
|
& \parbox[t]{1.5in}{\tt CMP Ry,Rx\\BC label}
|
947 |
|
|
& Greater-than unsigned \\[4mm]\hline
|
948 |
|
|
\end{tabular}
|
949 |
|
|
\caption{Modifying conditions}\label{tbl:creating-conditions}
|
950 |
|
|
\end{center}\end{table}
|
951 |
|
|
shows examples of how these unsupported conditions can be created
|
952 |
|
|
simply by adjusting the compare instruction, for no extra cost in clocks.
|
953 |
|
|
Of course, if the compare originally had an immediate within it, that immediate
|
954 |
199 |
dgisselq |
would need to be loaded into a register in order to do make some of these
|
955 |
|
|
adjustments. That case is shown as the last case above.
|
956 |
139 |
dgisselq |
|
957 |
199 |
dgisselq |
Many of these alternate conditions are chosen by the compiler implementation.
|
958 |
21 |
dgisselq |
|
959 |
199 |
dgisselq |
Users should be aware of any signed overflow that might take place within the
|
960 |
|
|
modified conditions, especially when numbers close to the limit are used.
|
961 |
21 |
dgisselq |
|
962 |
|
|
|
963 |
199 |
dgisselq |
\subsection{Operand B}\label{sec:isa-opb}
|
964 |
|
|
Many instruction forms have a 19-bit source ``Operand B'', or OpB for short,
|
965 |
|
|
associated with them. This ``Operand B'' is shown in
|
966 |
|
|
Fig.~\ref{fig:iset-format} as part of the standard instructions. An Operand B
|
967 |
|
|
is either equal to a register plus a 14--bit signed immediate offset, or an
|
968 |
|
|
18--bit signed immediate offset by itself. This value is encoded as shown in
|
969 |
|
|
Tbl.~\ref{tbl:opb}.
|
970 |
|
|
\begin{table}\begin{center}
|
971 |
|
|
\begin{bytefield}[endianness=big]{19}
|
972 |
|
|
\bitheader{0-18} \\
|
973 |
|
|
\bitbox{1}{0}\bitbox{18}{18-bit Signed Immediate} \\
|
974 |
|
|
\bitbox{1}{1}\bitbox{4}{Reg}\bitbox{14}{14-bit Signed Immediate}
|
975 |
|
|
\end{bytefield}
|
976 |
|
|
\caption{Bit allocation for Operand B}\label{tbl:opb}
|
977 |
|
|
\end{center}\end{table}
|
978 |
|
|
This format represents a deviation from many other RISC architectures that use
|
979 |
|
|
{\tt R0} to represent zero, such as OpenRISC and RISC-V. Here, instead, we use
|
980 |
|
|
a bit within the instruction to note whether or not an immediate is used.
|
981 |
|
|
The result is that ZipCPU instructions can encode larger immediates within their
|
982 |
|
|
instruction space.
|
983 |
139 |
dgisselq |
|
984 |
199 |
dgisselq |
In those cases where a fourteen or eighteen bit immediate doesn't make sense,
|
985 |
|
|
such as for {\tt LDILO}, the extra bits associated with the immediate are
|
986 |
|
|
simply ignored. (This rule does not apply to the shift instructions,
|
987 |
|
|
{\tt ASR}, {\tt LSR}, and {\tt LSL}--which all use all of their immediate bits.)
|
988 |
|
|
|
989 |
|
|
\subsection{Address Modes}\label{sec:isa-addr}
|
990 |
|
|
The ZipCPU supports two addressing modes: register plus immediate, and
|
991 |
|
|
immediate addressing. Addresses are encoded in the same fashion as
|
992 |
|
|
Operand B's, discussed above.
|
993 |
|
|
|
994 |
|
|
\subsection{Move Operands}\label{sec:isa-mov}
|
995 |
|
|
The previous set of operands would be perfect and complete, save only that the
|
996 |
|
|
CPU needs access to non--supervisory registers while in supervisory mode. The
|
997 |
|
|
MOV instruction has been modified to fit that purpose. The two bits,
|
998 |
|
|
shown as {\tt A} and {\tt B} in Fig.~\ref{fig:iset-format} above, are designed
|
999 |
|
|
to contain the high order bit of the 5--bit register index. If the {\tt B}
|
1000 |
|
|
bit is a `1', the source operand comes from the user register set. If the
|
1001 |
|
|
{\tt A} bit is a `1', the destination operand is in the user register set. A
|
1002 |
|
|
zero bit indicates the current register set.
|
1003 |
|
|
|
1004 |
|
|
This encoding has been chosen to keep the compiler simple. For the most part,
|
1005 |
|
|
the extra bits are quietly set to zero by the compiler. Assembly instructions,
|
1006 |
|
|
or particular built--in instructions, can be used to get access to these
|
1007 |
|
|
cross register set move instructions.
|
1008 |
|
|
|
1009 |
|
|
Further, the {\tt MOV} instruction lacks the full OpB capability to use a
|
1010 |
|
|
register or a register plus immediate as a source, since a load immediate
|
1011 |
|
|
instruction already exists. As a result, all moves come from a register plus a
|
1012 |
|
|
potential offset.
|
1013 |
|
|
|
1014 |
|
|
\subsection{Multiply Operations}\label{sec:isa-mpy}
|
1015 |
|
|
|
1016 |
|
|
The ZipCPU supports three separate 32x32-bit multiply
|
1017 |
139 |
dgisselq |
instructions: {\tt MPY}, {\tt MPYUHI}, and {\tt MPYSHI}. The first of these
|
1018 |
|
|
produces the low 32-bits of a 32x32-bit multiply result. The second two
|
1019 |
|
|
produce the upper 32-bits. The first, {\tt MPYUHI}, produces the upper 32-bits
|
1020 |
199 |
dgisselq |
assuming the multiply was unsigned, whereas {\tt MPYSHI} assumes it was signed.
|
1021 |
|
|
Each multiply instruction is independent of every other in execution, although
|
1022 |
|
|
the compiler is likely to use them in a dependent fashion.
|
1023 |
139 |
dgisselq |
|
1024 |
199 |
dgisselq |
In an effort to maintain a fast clock speed, all three of these multiplies
|
1025 |
|
|
have been slowed down in logic. Thus, depending upon the setting of
|
1026 |
|
|
{\tt OPT\_MULTIPLY} within {\tt cpudefs.v}, or the corresponding
|
1027 |
|
|
{\tt IMPLEMENT\_MPY} parameter that may override it, the multiply instructions
|
1028 |
|
|
will either 1)~cause an ILLEGAL instruction error ({\tt OPT\_MULTIPLY=0}, or
|
1029 |
|
|
no multiply support), 2)~take one additional clock ({\tt OPT\_MULTIPLY=2}),
|
1030 |
|
|
or 3)~take two additional clock cycles ({\tt OPT\_MULTIPLY=3}).\footnote{Support
|
1031 |
|
|
also exists for a one clock multiply (no clock slowdown), or a four clock
|
1032 |
|
|
multiply, and I am anticipating supporting a much longer multiply for FPGA
|
1033 |
|
|
architectures with no accelerated hardware multiply support.}
|
1034 |
139 |
dgisselq |
|
1035 |
199 |
dgisselq |
\subsection{Divide Unit}
|
1036 |
|
|
The ZipCPU also has an optional divide unit which can be built alongside the
|
1037 |
|
|
ALU. This divide unit provides the ZipCPU with another two instructions that
|
1038 |
69 |
dgisselq |
cannot be executed in a single cycle: {\tt DIVS}, or signed divide, and
|
1039 |
|
|
{\tt DIVU}, the unsigned divide. These are both 32--bit divide instructions,
|
1040 |
|
|
dividing one 32--bit number by another. In this case, the Operand B field,
|
1041 |
|
|
whether it be register or register plus immediate, constitutes the denominator,
|
1042 |
|
|
whereas the numerator is given by the other register.
|
1043 |
21 |
dgisselq |
|
1044 |
199 |
dgisselq |
As with the multiply, the divide instructions are also a multi--clock
|
1045 |
|
|
instructions. While the divide is running, the ALU, any memory loads, and the
|
1046 |
|
|
floating point unit (if installed) will be idle. Once the divide completes,
|
1047 |
|
|
other units may continue.
|
1048 |
21 |
dgisselq |
|
1049 |
199 |
dgisselq |
Of course, any divide instruction can result in a division by zero exception.
|
1050 |
|
|
If this happens the CPU will either suddenly transition from user mode to
|
1051 |
|
|
supervisor mode, or it will halt if the CPU is already in supervisor mode. Upon
|
1052 |
|
|
exception, the divide by zero bit will be set in the CC register. In the
|
1053 |
|
|
case of a user mode divide by zero, this will be cleared by any return to user
|
1054 |
|
|
mode command. The supervisor bit may be cleared either by a reboot or by the
|
1055 |
|
|
external debugger.
|
1056 |
32 |
dgisselq |
|
1057 |
202 |
dgisselq |
\section{CIS Instructions}
|
1058 |
|
|
|
1059 |
|
|
The ZipCPU also supports a compressed instruction set (CIS), outlined in
|
1060 |
|
|
Fig.~\ref{fig:iset-cis},
|
1061 |
|
|
\begin{figure}\begin{center}
|
1062 |
|
|
\begin{bytefield}[endianness=big]{16}
|
1063 |
|
|
\bitheader{0-15}\\
|
1064 |
|
|
\bitbox[lrt]{1}{}\bitbox[lrt]{4}{}
|
1065 |
|
|
\bitbox[lrt]{3}{COp}
|
1066 |
|
|
\bitbox{1}{0}
|
1067 |
|
|
\bitbox{7}{Imm.} \\
|
1068 |
|
|
\bitbox[lr]{1}{1}\bitbox[lr]{4}{DR}
|
1069 |
|
|
\bitbox[lrb]{3}{}
|
1070 |
|
|
\bitbox{1}{1}
|
1071 |
|
|
\bitbox{4}{BR}
|
1072 |
|
|
\bitbox{3}{Imm} \\
|
1073 |
|
|
\bitbox[lr]{1}{}\bitbox[lr]{4}{}
|
1074 |
|
|
\bitbox{3}{\tt LDI}
|
1075 |
|
|
\bitbox{8}{8'b Imm} \\
|
1076 |
|
|
\bitbox[lrb]{1}{}\bitbox[lrb]{4}{}
|
1077 |
|
|
\bitbox{3}{\tt MOV}
|
1078 |
|
|
\bitbox{1}{1}
|
1079 |
|
|
\bitbox{4}{BR}
|
1080 |
|
|
\bitbox{3}{Imm} \\
|
1081 |
|
|
\end{bytefield}
|
1082 |
|
|
\caption{Zip Compressed Instruction Set (CIS) Format}\label{fig:iset-cis}
|
1083 |
|
|
\end{center}\end{figure}
|
1084 |
|
|
when enabled via {\tt OPT\_CIS}.
|
1085 |
|
|
This compressed instruction set packs two instructions per word. Words
|
1086 |
|
|
must still be aligned, and jumping into the middle of a compressed instruction
|
1087 |
|
|
is not allowed. Further, the CIS only permits the encoding of 8~of the
|
1088 |
|
|
32~opcodes available in the ISA, as listed in Tbl.~\ref{tbl:iset-cisops}.
|
1089 |
|
|
\begin{table}\begin{center}
|
1090 |
|
|
\begin{tabular}{|l|l|l|} \hline \rowcolor[gray]{0.85}
|
1091 |
|
|
COp & & Instruction \\\hline\hline
|
1092 |
|
|
3'h00 & {\tt SUB} & Subtract \\\hline
|
1093 |
|
|
3'h01 & {\tt AND} & Bitwise And \\\hline
|
1094 |
|
|
3'h02 & {\tt ADD} & Add two numbers \\\hline
|
1095 |
|
|
3'h03 & {\tt CMP} & Bitwise Or \\\hline
|
1096 |
|
|
3'h04 & {\tt LW} & Bitwise Exclusive Or \\\hline
|
1097 |
|
|
3'h05 & {\tt SW} & Logical Shift Right \\\hline
|
1098 |
|
|
3'h06 & {\tt LDI} & Logical Shift Left \\\hline
|
1099 |
|
|
3'h07 & {\tt MOV} & Arithmetic Shift Right \\\hline
|
1100 |
|
|
\end{tabular}
|
1101 |
|
|
\caption{CIS OpCodes}\label{tbl:iset-cisops}
|
1102 |
|
|
\end{center}\end{table}
|
1103 |
|
|
A final feature of the compressed instruction set has to do with {\tt LW} and
|
1104 |
|
|
{\tt SW} instructions. An {\tt LW} or {\tt SW} instruction with bit-7 set
|
1105 |
|
|
low references an offset of the Stack Pointer, (SP). Hence the compressed
|
1106 |
|
|
instruction set allows loads and stores to offsets of the Stack Pointer
|
1107 |
|
|
of -128~octets on up to~127 octets. In practice, this gives the compressed
|
1108 |
|
|
load and store instructions, when referencing the stack, thirty--two words
|
1109 |
|
|
that they can reference.
|
1110 |
|
|
|
1111 |
|
|
This compressed instruction set somewhat similar to other architectures that
|
1112 |
|
|
have a thumb instruction set, with the difference that the ZipCPU can intermix
|
1113 |
|
|
regular and thumb instructions at will. When using the CIS, instructions are
|
1114 |
|
|
still issued one at a time, however interrupts are disabled between
|
1115 |
|
|
instruction halves, in order to prevent the CPU from stopping mid instruction.
|
1116 |
|
|
Further, it is the silent job of the assembler to compress CIS instructions
|
1117 |
|
|
in an opportunistic fashion.
|
1118 |
|
|
|
1119 |
|
|
The disassembler represents CIS instructions by placing a vertical bar
|
1120 |
|
|
between the two components, while still leaving them on the same line.
|
1121 |
|
|
|
1122 |
|
|
The CIS instruction set does not support conditional execution.
|
1123 |
|
|
|
1124 |
|
|
\subsection{BREAK, Bus LOCK, SIM, and NOOP Instructions}
|
1125 |
|
|
Four instructions within the opcode list in Tbl.~\ref{tbl:iset-opcodes}, are
|
1126 |
|
|
somewhat special. These are the {\tt BREAK}, bus {\tt LOCK}, {\tt SIM}, and
|
1127 |
|
|
{\tt NOOP} instructions. These are encoded according to
|
1128 |
199 |
dgisselq |
Fig.~\ref{fig:iset-noop}.
|
1129 |
69 |
dgisselq |
\begin{figure}\begin{center}
|
1130 |
|
|
\begin{bytefield}[endianness=big]{32}
|
1131 |
|
|
\bitheader{0-31}\\
|
1132 |
|
|
\begin{leftwordgroup}{BREAK}
|
1133 |
202 |
dgisselq |
\bitbox[lrt]{1}{}\bitbox[lrt]{3}{}
|
1134 |
|
|
\bitbox{1}{}\bitbox[lrt]{3}{}\bitbox{2}{00}\bitbox{22}{Reserved for debugger}
|
1135 |
69 |
dgisselq |
\end{leftwordgroup} \\
|
1136 |
|
|
\begin{leftwordgroup}{LOCK}
|
1137 |
202 |
dgisselq |
\bitbox[lr]{1}{0}\bitbox[lr]{3}{3'h7}
|
1138 |
|
|
\bitbox{1}{}\bitbox[lr]{3}{111}\bitbox{2}{01}\bitbox{22}{Ignored}
|
1139 |
69 |
dgisselq |
\end{leftwordgroup} \\
|
1140 |
202 |
dgisselq |
\begin{leftwordgroup}{SIM}
|
1141 |
|
|
\bitbox[lr]{1}{}\bitbox[lr]{3}{}\bitbox{1}{}
|
1142 |
|
|
\bitbox[lr]{3}{}\bitbox{2}{10}\bitbox[lrt]{22}{Reserved for Simulator}
|
1143 |
|
|
\end{leftwordgroup} \\
|
1144 |
|
|
\begin{leftwordgroup}{NOOP}
|
1145 |
|
|
\bitbox[lrb]{1}{}\bitbox[lrb]{3}{}\bitbox{1}{}
|
1146 |
|
|
\bitbox[lrb]{3}{}\bitbox{2}{11}\bitbox[lrb]{22}{}
|
1147 |
|
|
\end{leftwordgroup} \\
|
1148 |
69 |
dgisselq |
\end{bytefield}
|
1149 |
|
|
\caption{NOOP/Break/LOCK Instruction Format}\label{fig:iset-noop}
|
1150 |
|
|
\end{center}\end{figure}
|
1151 |
32 |
dgisselq |
|
1152 |
69 |
dgisselq |
The {\tt BREAK} instruction is useful for creating a debug instruction that
|
1153 |
|
|
will halt the CPU without executing. If in user mode, depending upon the
|
1154 |
|
|
setting of the break enable bit, it will either switch to supervisor mode or
|
1155 |
199 |
dgisselq |
halt the CPU--depending upon where the user wishes to do his debugging. The
|
1156 |
202 |
dgisselq |
lower 22~bits of this instruction are reserved for the debuggers use.
|
1157 |
21 |
dgisselq |
|
1158 |
202 |
dgisselq |
The {\tt LOCK} instruction provides the ZipCPU's atomic operation support,
|
1159 |
|
|
althought it only works when the CPU is configured for pipeline
|
1160 |
|
|
mode.\footnote{The reason for not allowing {\tt LOCK} support in
|
1161 |
|
|
non-pipelined mode is that the instruction fetch is not allowed to interrupt
|
1162 |
|
|
a lock cycle. In non-pipelined mode, the instruction fetch must take place
|
1163 |
|
|
between every bus access, negating this utility.} It works by stalling the
|
1164 |
|
|
ALU pipeline stack until all prior stages are filled, and then it guarantees
|
1165 |
|
|
that once a bus cycle is started, the wishbone {\tt CYC} line will remain
|
1166 |
|
|
asserted for up to three instructions. This allows the execution of one
|
1167 |
|
|
memory load (ex. {\tt LW}), one ALU operation (ex. {\tt ADD}), and then
|
1168 |
|
|
another memory instruction (ex. {\tt SW}), to take place in an uninterrupted
|
1169 |
|
|
fashion. Example uses of this capability include an atomic increment, such
|
1170 |
|
|
as {\tt LOCK}, {\tt LW (Rx),Ry}, {\tt ADD \#1,Ry}, {\tt SW Ry,(Rx)}, or even
|
1171 |
|
|
a two instruction pair such as a test and set sequence: {\tt LDI 1,Rz},
|
1172 |
|
|
{\tt LOCK}, {\tt LW (Rx),Ry}, {\tt SW Rz,(Rx)}.
|
1173 |
21 |
dgisselq |
|
1174 |
202 |
dgisselq |
The {\tt SIM} and {\tt NOOP} instructions need a touch more explaining.
|
1175 |
|
|
From the standpoint of the CPU, when running from Verilog within an FPGA,
|
1176 |
|
|
the {\tt SIM} instruction is an illegal instruction--generating an illegal
|
1177 |
|
|
instruction exception. Likewise the {\tt NOOP} instruction is just that:
|
1178 |
|
|
an instruction that consumes a clock, but does not perform any operation.
|
1179 |
|
|
In both cases, the lower 22--bits are ignored.
|
1180 |
|
|
|
1181 |
|
|
Both {\tt SIM} and {\tt NOOP} instructions, though, contain 22--bits that can
|
1182 |
|
|
be used by a simulator if present. The encoding of these 22-bits is identical,
|
1183 |
|
|
so that programs that run in a simulator may run on actual hardware as well
|
1184 |
|
|
(using the {\tt NOOP} encoding), or they may complain that they were unintended
|
1185 |
|
|
to run on actual hardware, such as if the {\tt SIM} encoding were used.
|
1186 |
|
|
Particular encodings allow for exiting the simulation with a known exit
|
1187 |
|
|
code, {\tt $x$EXIT}, dumping either one or all registers, {\tt $x$DUMP},
|
1188 |
|
|
or simpling sending a character to the simulator's standard output stream,
|
1189 |
|
|
{\tt $x$OUT}--where $x$ is either {\tt N} for the {\tt NOOP} version of the
|
1190 |
|
|
instruction, or {\tt S} for the {\tt SIM} version of the opcode.
|
1191 |
|
|
|
1192 |
|
|
The {\tt SIM} instruction is currrently a new facility for the ZipCPU, and
|
1193 |
|
|
so its functionality remains under test.
|
1194 |
|
|
|
1195 |
199 |
dgisselq |
\subsection{Floating Point}
|
1196 |
|
|
Although the ZipCPU does not (yet) have a floating point unit, the current
|
1197 |
202 |
dgisselq |
instruction set offers six opcodes for floating point operations, and treats
|
1198 |
69 |
dgisselq |
floating point exceptions like divide by zero errors. Once this unit is built
|
1199 |
199 |
dgisselq |
and integrated together with the rest of the CPU, the ZipCPU will support
|
1200 |
69 |
dgisselq |
32--bit floating point instructions natively. Any 64--bit floating point
|
1201 |
199 |
dgisselq |
instructions will either need to be emulated in software, or else they will
|
1202 |
|
|
need an external floating point peripheral.
|
1203 |
69 |
dgisselq |
|
1204 |
202 |
dgisselq |
Until this FPU is built and integrated, of even afterwards if the floating
|
1205 |
|
|
point unit is not installed by option, floating point instructions will
|
1206 |
|
|
trigger an illegal instruction exception, which may be trapped and then
|
1207 |
|
|
implemented in software.
|
1208 |
139 |
dgisselq |
|
1209 |
199 |
dgisselq |
\subsection{Derived Instructions}
|
1210 |
|
|
The ZipCPU supports many other common instructions by construction, although
|
1211 |
|
|
not all of them are single cycle instructions. Tables~\ref{tbl:derived-1}, \ref{tbl:derived-2}, \ref{tbl:derived-3} and~\ref{tbl:derived-4} show how these
|
1212 |
|
|
other instructions may be implemented on the ZipCPU. Many of these
|
1213 |
|
|
instructions will have assembly equivalents,
|
1214 |
21 |
dgisselq |
such as the branch instructions, to facilitate working with the CPU.
|
1215 |
|
|
\begin{table}\begin{center}
|
1216 |
202 |
dgisselq |
\begin{tabular}{p{1.0in}p{1.5in}p{3in}}\\\hline
|
1217 |
21 |
dgisselq |
Mapped & Actual & Notes \\\hline
|
1218 |
39 |
dgisselq |
{\tt ABS Rx}
|
1219 |
|
|
& \parbox[t]{1.5in}{\tt TST -1,Rx\\NEG.LT Rx}
|
1220 |
199 |
dgisselq |
& Absolute value, depends upon the derived {\tt NEG} instruction
|
1221 |
|
|
below, and so this expands into three instructions total.\\\hline
|
1222 |
39 |
dgisselq |
\parbox[t]{1.4in}{\tt ADD Ra,Rx\\ADDC Rb,Ry}
|
1223 |
|
|
& \parbox[t]{1.5in}{\tt Add Ra,Rx\\ADD.C \$1,Ry\\Add Rb,Ry}
|
1224 |
21 |
dgisselq |
& Add with carry \\\hline
|
1225 |
202 |
dgisselq |
\hbox{\tt BRA.$x$ +/-\$Addr}
|
1226 |
199 |
dgisselq |
& \hbox{\tt ADD.$x$ \$Addr+PC,PC}
|
1227 |
|
|
& Branch or jump on condition $x$. Works for 18--bit
|
1228 |
24 |
dgisselq |
signed address offsets.\\\hline
|
1229 |
199 |
dgisselq |
% {\tt BRA.Cond +/-\$Addr}
|
1230 |
|
|
% & \parbox[t]{1.5in}{\tt LDI \$Addr,Rx \\ ADD.cond Rx,PC}
|
1231 |
|
|
% & Branch/jump on condition. Works for 23 bit address offsets, but
|
1232 |
|
|
% costs a register and an extra instruction. With LDIHI and LDILO
|
1233 |
|
|
% this can be made to work anywhere in the 32-bit address space, but yet
|
1234 |
|
|
% cost an additional instruction still. \\\hline
|
1235 |
|
|
% {\tt BNC PC+\$Addr}
|
1236 |
|
|
% & \parbox[t]{1.5in}{\tt Test \$Carry,CC \\ ADD.Z PC+\$Addr,PC}
|
1237 |
|
|
% & Example of a branch on an unsupported
|
1238 |
|
|
% condition, in this case a branch on not carry \\\hline
|
1239 |
|
|
{\tt BUSY } & {\tt ADD \$-1,PC} & Execute an infinite loop. This is used
|
1240 |
|
|
within ZipCPU simulations as the execute simulation on error
|
1241 |
|
|
instruction. \\\hline
|
1242 |
39 |
dgisselq |
{\tt CLRF.NZ Rx }
|
1243 |
|
|
& {\tt XOR.NZ Rx,Rx}
|
1244 |
21 |
dgisselq |
& Clear Rx, and flags, if the Z-bit is not set \\\hline
|
1245 |
39 |
dgisselq |
{\tt CLR Rx }
|
1246 |
|
|
& {\tt LDI \$0,Rx}
|
1247 |
199 |
dgisselq |
& Clears Rx, leaving the flags untouched. This instruction cannot be
|
1248 |
21 |
dgisselq |
conditional. \\\hline
|
1249 |
199 |
dgisselq |
{\tt CLR.NZ Rx }
|
1250 |
|
|
& {\tt BREV.NZ \$0,Rx}
|
1251 |
|
|
& Clears Rx, leaving the flags untouched. This instruction can be
|
1252 |
|
|
executed conditionally. The assembler will quietly choose
|
1253 |
|
|
between {\tt LDI} and {\tt BREV} depending upon the existence
|
1254 |
|
|
of the condition.\\\hline
|
1255 |
39 |
dgisselq |
{\tt EXCH.W Rx }
|
1256 |
202 |
dgisselq |
& \parbox[t]{1.5in}{\tt MOV Rx,Rh \\
|
1257 |
|
|
LSL \$16,Rh \\
|
1258 |
|
|
LSR \$16,Rx \\
|
1259 |
|
|
OR Rh,Rx }
|
1260 |
21 |
dgisselq |
& Exchanges the top and bottom 16'bit words of Rx \\\hline
|
1261 |
39 |
dgisselq |
{\tt HALT }
|
1262 |
|
|
& {\tt Or \$SLEEP,CC}
|
1263 |
|
|
& This only works when issued in interrupt/supervisor mode. In user
|
1264 |
199 |
dgisselq |
mode this is simply a wait until interrupt instruction.
|
1265 |
|
|
|
1266 |
|
|
This is also used within the simulator as an exit simulation on
|
1267 |
|
|
success instruction.\\\hline
|
1268 |
69 |
dgisselq |
{\tt INT } & {\tt LDI \$0,CC} & This is also known as a trap instruction\\\hline
|
1269 |
39 |
dgisselq |
{\tt IRET}
|
1270 |
|
|
& {\tt OR \$GIE,CC}
|
1271 |
|
|
& Also known as an RTU instruction (Return to Userspace) \\\hline
|
1272 |
202 |
dgisselq |
\hbox{\tt JMP R6+\$Offset}
|
1273 |
92 |
dgisselq |
& {\tt MOV \$Offset(R6),PC}
|
1274 |
199 |
dgisselq |
& Only works for 13--bit offsets. Other offsets require adding the
|
1275 |
|
|
offset first to R6 before jumping.\\\hline
|
1276 |
69 |
dgisselq |
{\tt LJMP \$Addr}
|
1277 |
202 |
dgisselq |
& \parbox[t]{1.5in}{\tt LW (PC),PC \\ {\em Address }}
|
1278 |
69 |
dgisselq |
& Although this only works for an unconditional jump, and it only
|
1279 |
199 |
dgisselq |
works in an architecture with a unified instruction and data address
|
1280 |
|
|
space, this instruction combination makes for a nice combination that
|
1281 |
|
|
can be adjusted by a linker at a later time.\\\hline
|
1282 |
|
|
{\tt LJMP.x \$Addr}
|
1283 |
202 |
dgisselq |
& \parbox[t]{1.5in}{\tt LW.x 4(PC),PC \\ ADD 4,PC \\ {\em Address }}
|
1284 |
|
|
& Long jump, works for a conditional long jump, not necessarily the best way to do this. \\\hline
|
1285 |
199 |
dgisselq |
\end{tabular}
|
1286 |
|
|
\caption{Derived Instructions}\label{tbl:derived-1}
|
1287 |
|
|
\end{center}\end{table}
|
1288 |
|
|
\begin{table}\begin{center}
|
1289 |
|
|
\begin{tabular}{p{1.1in}p{1.8in}p{3in}}\\\hline
|
1290 |
|
|
Mapped & Actual & Notes \\\hline
|
1291 |
|
|
{\tt LJSR \$Addr }
|
1292 |
202 |
dgisselq |
& \parbox[t]{1.5in}{\tt MOV \$8+PC,R0 \\ LW (PC),PC \\ {\em Address}}
|
1293 |
199 |
dgisselq |
& Similar to LJMP, but it handles the return address properly.
|
1294 |
|
|
\\\hline
|
1295 |
92 |
dgisselq |
{\tt JSR PC+\$Offset }
|
1296 |
202 |
dgisselq |
& \parbox[t]{1.5in}{\tt MOV \$4+PC,R0 \\ ADD \$Offset,PC}
|
1297 |
69 |
dgisselq |
& This is similar to the jump and link instructions from other
|
1298 |
|
|
architectures, save only that it requires a specific link
|
1299 |
199 |
dgisselq |
instruction, seen here as the {\tt MOV} instruction on the
|
1300 |
69 |
dgisselq |
left.\\\hline
|
1301 |
199 |
dgisselq |
{\tt LDI \$val,Rx }
|
1302 |
202 |
dgisselq |
& \parbox[t]{1.8in}{\tt BREV REV($val$)\&0x0ffff,Rx \\
|
1303 |
199 |
dgisselq |
LDILO ($val$\&0x0ffff),Rx}
|
1304 |
69 |
dgisselq |
& \parbox[t]{3.0in}{Sadly, there's not enough instruction
|
1305 |
21 |
dgisselq |
space to load a complete immediate value into any register.
|
1306 |
|
|
Therefore, fully loading any register takes two cycles.
|
1307 |
199 |
dgisselq |
The {\tt LDILO} (load immediate low) instruction has been
|
1308 |
|
|
created to facilitate this together with {\tt BREV}.
|
1309 |
69 |
dgisselq |
\\
|
1310 |
|
|
This is also the appropriate means for setting a register value
|
1311 |
|
|
to an arbitrary 32--bit value in a post--assembly link
|
1312 |
|
|
operation.}\\\hline
|
1313 |
39 |
dgisselq |
\parbox[t]{1.5in}{\tt LSL \$1,Rx\\ LSLC \$1,Ry}
|
1314 |
|
|
& \parbox[t]{1.5in}{\tt LSL \$1,Ry \\
|
1315 |
21 |
dgisselq |
LSL \$1,Rx \\
|
1316 |
|
|
OR.C \$1,Ry}
|
1317 |
|
|
& Logical shift left with carry. Note that the
|
1318 |
|
|
instruction order is now backwards, to keep the conditions valid.
|
1319 |
33 |
dgisselq |
That is, LSL sets the carry flag, so if we did this the other way
|
1320 |
21 |
dgisselq |
with Rx before Ry, then the condition flag wouldn't have been right
|
1321 |
199 |
dgisselq |
for an {\tt OR} correction at the end. \\\hline
|
1322 |
39 |
dgisselq |
\parbox[t]{1.5in}{\tt LSR \$1,Rx \\ LSRC \$1,Ry}
|
1323 |
|
|
& \parbox[t]{1.5in}{\tt CLR Rz \\
|
1324 |
21 |
dgisselq |
LSR \$1,Ry \\
|
1325 |
199 |
dgisselq |
BREV.C \$1,Rz \\
|
1326 |
21 |
dgisselq |
LSR \$1,Rx \\
|
1327 |
|
|
OR Rz,Rx}
|
1328 |
199 |
dgisselq |
& Logical shift right with carry. Unlike the shift left, this
|
1329 |
|
|
approach doesn't extend well to numbers larger than two words. \\\hline
|
1330 |
|
|
{\tt NEG Rx} & \parbox[t]{1.5in}{\tt XOR \$-1,Rx \\ ADD \$1,Rx} & Negates Rx\\\hline
|
1331 |
|
|
{\tt NEG.C Rx} & \parbox[t]{1.5in}{\tt MOV.C \$-1+Rx,Rx\\XOR.C \$-1,Rx}
|
1332 |
|
|
& Conditionally negates Rx\\\hline
|
1333 |
|
|
{\tt NOT Rx } & {\tt XOR \$-1,Rx } & One's complement\\\hline
|
1334 |
|
|
{\tt POP Rx }
|
1335 |
202 |
dgisselq |
& \parbox[t]{1.5in}{\tt LW \$(SP),Rx \\ ADD \$4,SP}
|
1336 |
199 |
dgisselq |
& The compiler avoids the need for this instruction and the similar
|
1337 |
|
|
{\tt PUSH} instruction when setting up the stack by coalescing all
|
1338 |
|
|
the stack address modifications into a single instruction at the
|
1339 |
|
|
beginning of any stack frame.\\\hline
|
1340 |
39 |
dgisselq |
{\tt PUSH Rx}
|
1341 |
202 |
dgisselq |
& \parbox[t]{1.5in}{\hbox{\tt SUB \$4,SP}
|
1342 |
|
|
\hbox{\tt SW Rx,\$(SP)}}
|
1343 |
39 |
dgisselq |
& Note that for pipelined operation, it helps to coalesce all the
|
1344 |
202 |
dgisselq |
{\tt SUB}'s into one command, and place the {\tt SW}'s right
|
1345 |
69 |
dgisselq |
after each other. Further, to avoid a pipeline stall, the
|
1346 |
199 |
dgisselq |
immediate value for the first store must be zero.
|
1347 |
69 |
dgisselq |
\\\hline
|
1348 |
202 |
dgisselq |
\end{tabular}
|
1349 |
|
|
\caption{Derived Instructions, continued}\label{tbl:derived-2}
|
1350 |
|
|
\end{center}\end{table}
|
1351 |
|
|
\begin{table}\begin{center}
|
1352 |
|
|
\begin{tabular}{p{1.0in}p{1.5in}p{3.2in}}\\\hline
|
1353 |
39 |
dgisselq |
{\tt PUSH Rx-Ry}
|
1354 |
202 |
dgisselq |
& \parbox[t]{1.5in}{\tt SUB \$$4n$,SP \\
|
1355 |
|
|
SW Rx,\$(SP)
|
1356 |
36 |
dgisselq |
\ldots \\
|
1357 |
202 |
dgisselq |
SW Ry,\$$4\left(n-1\right)$(SP)}
|
1358 |
36 |
dgisselq |
& Multiple pushes at once only need the single subtract from the
|
1359 |
|
|
stack pointer. This derived instruction is analogous to a similar one
|
1360 |
199 |
dgisselq |
on the Motorola 68k architecture, although the Zip Assembler
|
1361 |
|
|
does not support the combined instruction. This instruction
|
1362 |
39 |
dgisselq |
also supports pipelined memory access.\\\hline
|
1363 |
|
|
{\tt RESET}
|
1364 |
202 |
dgisselq |
& \parbox[t]{1in}{\tt LDI~0xff000000,R2\\LDI 1,R1\\\hbox{SW R1,\$watchdog(R2)}\\BUSY}
|
1365 |
199 |
dgisselq |
& This depends upon the existence of a watchdog peripheral, and the
|
1366 |
|
|
peripheral base address being preloaded into {\tt R12}. The BUSY
|
1367 |
|
|
instructions are required because the CPU will continue until the
|
1368 |
202 |
dgisselq |
{\tt SW} has completed.
|
1369 |
21 |
dgisselq |
|
1370 |
|
|
Another opportunity might be to jump to the reset address from within
|
1371 |
39 |
dgisselq |
supervisor mode.\\\hline
|
1372 |
69 |
dgisselq |
{\tt RET} & {\tt MOV R0,PC}
|
1373 |
|
|
& This depends upon the form of the {\tt JSR} given on the previous
|
1374 |
|
|
page that stores the return address into R0.
|
1375 |
21 |
dgisselq |
\\\hline
|
1376 |
202 |
dgisselq |
{\tt SEXB Rx }
|
1377 |
199 |
dgisselq |
& \parbox[t]{1.5in}{\tt LSL 24,Rx \\ ASR 24,Rx}
|
1378 |
|
|
& Signed extend an 8--bit value into a full word.\\\hline
|
1379 |
202 |
dgisselq |
{\tt SEXH Rx }
|
1380 |
199 |
dgisselq |
& \parbox[t]{1.5in}{\tt LSL 16,Rx \\ ASR 16,Rx}
|
1381 |
|
|
& Sign extend a 16--bit value into a full word.\\\hline
|
1382 |
39 |
dgisselq |
{\tt STEP Rr,Rt}
|
1383 |
|
|
& \parbox[t]{1.5in}{\tt LSR \$1,Rr \\ XOR.C Rt,Rr}
|
1384 |
21 |
dgisselq |
& Step a Galois implementation of a Linear Feedback Shift Register, Rr,
|
1385 |
|
|
using taps Rt \\\hline
|
1386 |
199 |
dgisselq |
{\tt STEP}
|
1387 |
|
|
& \parbox[t]{1.5in}{\tt OR \$Step|\$GIE,CC}
|
1388 |
|
|
& Steps a user mode process by one instruction\\\hline
|
1389 |
|
|
{\tt SUBR Rx,Ry }
|
1390 |
202 |
dgisselq |
% & \parbox[t]{1.5in}{\tt SUB 1+Rx,Ry\\ XOR -1,Ry}
|
1391 |
|
|
& \parbox[t]{1.5in}{\tt XOR -1,Ry\\ADD 1+Rx,Ry}
|
1392 |
199 |
dgisselq |
& Ry is set to Rx-Ry, rather than the normal subtract which
|
1393 |
|
|
sets Ry to Ry-Rx. \\\hline
|
1394 |
|
|
\parbox[t]{1.4in}{\tt SUB Ra,Rx\\SUBC Rb,Ry}
|
1395 |
|
|
& \parbox[t]{1.5in}{\tt SUB Ra,Rx\\SUB.C \$1,Ry\\SUB Rb,Ry}
|
1396 |
|
|
& Subtract with carry. Note that the overflow flag may not be
|
1397 |
|
|
set correctly after this operation.\\\hline
|
1398 |
39 |
dgisselq |
{\tt SWAP Rx,Ry }
|
1399 |
69 |
dgisselq |
& \parbox[t]{1.5in}{\tt XOR Ry,Rx \\ XOR Rx,Ry \\ XOR Ry,Rx}
|
1400 |
21 |
dgisselq |
& While no extra registers are needed, this example
|
1401 |
|
|
does take 3-clocks. \\\hline
|
1402 |
39 |
dgisselq |
{\tt TRAP \#X}
|
1403 |
199 |
dgisselq |
& \parbox[t]{1.5in}{\tt LDI \$x,R1 \\ AND \textasciitilde\$GIE,CC }
|
1404 |
36 |
dgisselq |
& This works because whenever a user lowers the \$GIE flag, it sets
|
1405 |
199 |
dgisselq |
a TRAP bit within the uCC register. Therefore, upon entering the
|
1406 |
36 |
dgisselq |
supervisor state, the CPU only need check this bit to know that it
|
1407 |
|
|
got there via a TRAP. The trap could be made conditional by making
|
1408 |
|
|
the LDI and the AND conditional. In that case, the assembler would
|
1409 |
199 |
dgisselq |
quietly turn the LDI instruction into a {\tt BREV}/{\tt LDILO} pair,
|
1410 |
37 |
dgisselq |
but the effect would be the same. \\\hline
|
1411 |
69 |
dgisselq |
{\tt TS Rx,Ry,(Rz)}
|
1412 |
|
|
& \hbox{\tt LDI 1,Rx}
|
1413 |
|
|
\hbox{\tt LOCK}
|
1414 |
202 |
dgisselq |
\hbox{\tt LW (Rz),Ry}
|
1415 |
|
|
\hbox{\tt SW Rx,(Rz)}
|
1416 |
69 |
dgisselq |
& A test and set instruction. The {\tt LOCK} instruction insures
|
1417 |
|
|
that the next two instructions lock the bus between the instructions,
|
1418 |
|
|
so no one else can use it. Thus guarantees that the operation is
|
1419 |
|
|
atomic.
|
1420 |
|
|
\\\hline
|
1421 |
202 |
dgisselq |
%
|
1422 |
|
|
%
|
1423 |
|
|
\end{tabular}
|
1424 |
|
|
\caption{Derived Instructions, continued}\label{tbl:derived-3}
|
1425 |
|
|
\end{center}\end{table}
|
1426 |
|
|
\begin{table}\begin{center}
|
1427 |
|
|
\begin{tabular}{p{1.0in}p{1.5in}p{3in}}\\\hline
|
1428 |
39 |
dgisselq |
{\tt TST Rx}
|
1429 |
|
|
& {\tt TST \$-1,Rx}
|
1430 |
199 |
dgisselq |
& Set the condition codes based upon Rx without changing Rx.
|
1431 |
|
|
Equivalent to a CMP \$0,Rx.\\\hline
|
1432 |
39 |
dgisselq |
{\tt WAIT}
|
1433 |
|
|
& {\tt Or \$GIE | \$SLEEP,CC}
|
1434 |
|
|
& Wait until the next interrupt, then jump to supervisor/interrupt
|
1435 |
|
|
mode.
|
1436 |
21 |
dgisselq |
\end{tabular}
|
1437 |
36 |
dgisselq |
\caption{Derived Instructions, continued}\label{tbl:derived-4}
|
1438 |
21 |
dgisselq |
\end{center}\end{table}
|
1439 |
69 |
dgisselq |
|
1440 |
199 |
dgisselq |
\subsection{Interrupt Handling}
|
1441 |
|
|
The ZipCPU does not maintain any interrupt vector tables. If an interrupt
|
1442 |
|
|
takes place, the CPU simply switches to from user to supervisor (interrupt)
|
1443 |
202 |
dgisselq |
mode. Since getting to user mode in the first place required a return to
|
1444 |
|
|
userspace instruction, {\tt RTU}, once the interrupt takes place the
|
1445 |
|
|
supervisor just simply starts executing code immediately after that
|
1446 |
|
|
{\tt RTU} instruction.
|
1447 |
69 |
dgisselq |
|
1448 |
202 |
dgisselq |
Since the CPU may return from userspace after either an interrupt (hardware
|
1449 |
|
|
generated), a trap (software generated), or an exception (a fault of some
|
1450 |
|
|
type), it is up to the supervisor code that handles the transition to
|
1451 |
|
|
determine which of the three has taken place.
|
1452 |
69 |
dgisselq |
|
1453 |
199 |
dgisselq |
\subsection{Pipeline Stages}
|
1454 |
32 |
dgisselq |
As mentioned in the introduction, and highlighted in Fig.~\ref{fig:cpu},
|
1455 |
199 |
dgisselq |
the ZipCPU supports a five stage pipeline.
|
1456 |
21 |
dgisselq |
\begin{enumerate}
|
1457 |
199 |
dgisselq |
\item {\bf Prefetch}: Reads instructions from memory. If the CPU has been
|
1458 |
|
|
configured with a cache, the cache has been integrated into the
|
1459 |
|
|
prefetch. Stalls are also created here if the instruction isn't
|
1460 |
21 |
dgisselq |
in the prefetch cache.
|
1461 |
36 |
dgisselq |
|
1462 |
199 |
dgisselq |
The ZipCPU supports one of three prefetch methods, depending upon the
|
1463 |
|
|
flags set at build time within the {\tt cpudefs.v} file.
|
1464 |
|
|
|
1465 |
|
|
The simplest
|
1466 |
69 |
dgisselq |
is a non--cached implementation of a prefetch. This implementation is
|
1467 |
199 |
dgisselq |
fairly small, and ideal for users of the ZipCPU who need the extra
|
1468 |
69 |
dgisselq |
space on the FPGA fabric. However, because this non--cached version
|
1469 |
|
|
has no cache, the maximum number of instructions per clock is limited
|
1470 |
199 |
dgisselq |
to about one per eight--depending upon the bus/memory delay.
|
1471 |
|
|
This prefetch option is set by leaving the {\tt OPT\_SINGLE\_FETCH}
|
1472 |
|
|
line uncommented within the {\tt cpudefs.v} file. Using this option
|
1473 |
|
|
will also turn off the ZipCPU pipeline.
|
1474 |
36 |
dgisselq |
|
1475 |
199 |
dgisselq |
The second prefetch module is a non--traditional pipelined prefetch
|
1476 |
|
|
with a cache. This module tries to keep the instruction address
|
1477 |
|
|
within a window of valid instruction addresses. While effective, it
|
1478 |
|
|
is not a traditional cache implementation. A disappointing feature of
|
1479 |
|
|
this implementation is that it needs an extra internal pipeline stage
|
1480 |
36 |
dgisselq |
to be implemented.
|
1481 |
|
|
|
1482 |
199 |
dgisselq |
The third prefetch and cache module implements a more traditional
|
1483 |
|
|
cache. This cache provides for the fastest CPU speed. The only
|
1484 |
|
|
drawback is that, when a cache line is loading, the CPU will be stalled
|
1485 |
|
|
until the cache is completely loaded.
|
1486 |
69 |
dgisselq |
|
1487 |
199 |
dgisselq |
\item {\bf Decode}: Decodes an instruction into it's OpCode, register(s) to
|
1488 |
|
|
read, condition code, and immediate offset. This stage also
|
1489 |
|
|
determines whether the flags will be read or set, whether registers
|
1490 |
|
|
will be read (and hence the pipeline may need to stall), or whether the
|
1491 |
202 |
dgisselq |
result will be written back. In many ways, simplifying the CPU has
|
1492 |
|
|
meant simplifying this particular pipeline stage and hence the
|
1493 |
|
|
instruction set architecture that it implements.
|
1494 |
69 |
dgisselq |
|
1495 |
202 |
dgisselq |
This stage is also responsible for both normal and CIS decoding.
|
1496 |
|
|
Hence, following this stage, little information remains regarding
|
1497 |
|
|
whether or not the CPU was executing a CIS instruction.
|
1498 |
|
|
|
1499 |
199 |
dgisselq |
\item {\bf Read Operands}: Read from the register file and applies any
|
1500 |
|
|
immediate values to the result. There is no means of detecting or
|
1501 |
|
|
flagging arithmetic overflow or carry when adding the immediate to the
|
1502 |
|
|
operand. This stage will stall if any source operand is pending
|
1503 |
|
|
and the immediate value is non--zero.
|
1504 |
69 |
dgisselq |
|
1505 |
199 |
dgisselq |
\item At this point, the processing flow splits into one of four tracks: An
|
1506 |
|
|
{\bf ALU} track which will accomplish a simple instruction, the
|
1507 |
202 |
dgisselq |
{\bf MemOps} stage which handles {\tt LW} (load) and {\tt SW}
|
1508 |
199 |
dgisselq |
(store) instructions, the {\bf divide} unit, and the
|
1509 |
|
|
{\bf floating point} unit.
|
1510 |
21 |
dgisselq |
\begin{itemize}
|
1511 |
202 |
dgisselq |
\item Loads will stall instructions in the read operands stage until
|
1512 |
|
|
the entire memory operation is complete, lest a register be
|
1513 |
|
|
read from the register file only to be updated unseen by the
|
1514 |
|
|
Load.
|
1515 |
199 |
dgisselq |
\item Condition codes are set upon completion of the ALU, divide,
|
1516 |
|
|
or FPU stage. (Memory operations do not set conditions.)
|
1517 |
69 |
dgisselq |
\item Issuing a non--pipelined memory instruction to the memory unit
|
1518 |
199 |
dgisselq |
while the memory unit is busy will stall the entire pipeline
|
1519 |
|
|
until the memory unit is idle and ready to accept another
|
1520 |
|
|
instruction.
|
1521 |
21 |
dgisselq |
\end{itemize}
|
1522 |
32 |
dgisselq |
\item {\bf Write-Back}: Conditionally write back the result to the register
|
1523 |
199 |
dgisselq |
set, applying the condition and any special CC logic. This routine is
|
1524 |
|
|
quad-entrant: either the ALU, the memory, the divide, or the FPU may
|
1525 |
|
|
commit a result. The only design rule is that no more than a single
|
1526 |
|
|
register may be written in any given clock cycle.
|
1527 |
|
|
|
1528 |
|
|
This is also the stage where any special condition code logic takes
|
1529 |
|
|
place.
|
1530 |
21 |
dgisselq |
\end{enumerate}
|
1531 |
|
|
|
1532 |
199 |
dgisselq |
The ZipCPU does not support out of order execution. Therefore, if the memory
|
1533 |
69 |
dgisselq |
unit stalls, every other instruction stalls. The same is true for divide or
|
1534 |
|
|
floating point instructions--all other instructions will stall while waiting
|
1535 |
|
|
for these to complete. Memory stores, however, can take place concurrently
|
1536 |
199 |
dgisselq |
with non--memory operations, although memory reads (loads) cannot. This is
|
1537 |
|
|
likely to change with the integration of an memory management unit (MMU), in which case a store
|
1538 |
|
|
instruction must stall the CPU until it is known whether or not the store
|
1539 |
|
|
address can be mapped by the MMU.
|
1540 |
24 |
dgisselq |
|
1541 |
199 |
dgisselq |
% \subsection{Instruction Cache}
|
1542 |
|
|
% \subsection{Data Cache}
|
1543 |
|
|
|
1544 |
|
|
\subsection{Pipeline Stalls}
|
1545 |
32 |
dgisselq |
The processing pipeline can and will stall for a variety of reasons. Some of
|
1546 |
|
|
these are obvious, some less so. These reasons are listed below:
|
1547 |
|
|
\begin{itemize}
|
1548 |
|
|
\item When the prefetch cache is exhausted
|
1549 |
21 |
dgisselq |
|
1550 |
36 |
dgisselq |
This reason should be obvious. If the prefetch cache doesn't have the
|
1551 |
69 |
dgisselq |
instruction in memory, the entire pipeline must stall until an instruction
|
1552 |
|
|
can be made ready. In the case of the {\tt pipefetch} windowed approach
|
1553 |
|
|
to the prefetch cache, this means the pipeline will stall until enough of the
|
1554 |
|
|
prefetch cache is loaded to support the next instruction. In the case
|
1555 |
|
|
of the more traditional {\tt pfcache} approach, the entire cache line must
|
1556 |
|
|
fill before instruction execution can continue.
|
1557 |
21 |
dgisselq |
|
1558 |
32 |
dgisselq |
\item While waiting for the pipeline to load following any taken branch, jump,
|
1559 |
199 |
dgisselq |
return from interrupt or switch to interrupt context (4 stall cycles,
|
1560 |
|
|
minimum)
|
1561 |
32 |
dgisselq |
|
1562 |
68 |
dgisselq |
Fig.~\ref{fig:bcstalls}
|
1563 |
|
|
\begin{figure}\begin{center}
|
1564 |
|
|
\includegraphics[width=3.5in]{../gfx/bc.eps}
|
1565 |
69 |
dgisselq |
\caption{A conditional branch generates 4 stall cycles}\label{fig:bcstalls}
|
1566 |
68 |
dgisselq |
\end{center}\end{figure}
|
1567 |
|
|
illustrates the situation for a conditional branch. In this case, the branch
|
1568 |
69 |
dgisselq |
instruction, {\tt BC}, is nominally followed by instructions {\tt I1} and so
|
1569 |
68 |
dgisselq |
forth. However, since the branch is taken, the next instruction must be
|
1570 |
|
|
{\tt IA}. Therefore, the pipeline needs to be cleared and reloaded.
|
1571 |
|
|
Given that there are five stages to the pipeline, that accounts
|
1572 |
69 |
dgisselq |
for the four stalls. (Were the {\tt pipefetch} cache chosen, there would
|
1573 |
|
|
be another stall internal to the {\tt pipefetch} cache.)
|
1574 |
32 |
dgisselq |
|
1575 |
199 |
dgisselq |
The decode stage can handle the {\tt ADD \$X,PC}, {\tt LDI \$X,PC}, and
|
1576 |
202 |
dgisselq |
{\tt LW (PC),PC} instructions specially, however, when {\tt EARLY\_BRANCHING}
|
1577 |
199 |
dgisselq |
is enabled. These instructions, when
|
1578 |
|
|
not conditioned on the flags, can execute with only a single stall cycle (two
|
1579 |
202 |
dgisselq |
for the {\tt LW(PC),PC} instruction),
|
1580 |
199 |
dgisselq |
such as is shown in Fig.~\ref{fig:branch}.
|
1581 |
68 |
dgisselq |
\begin{figure}\begin{center}
|
1582 |
69 |
dgisselq |
\includegraphics[width=4in]{../gfx/bra.eps} %0.4in per clock
|
1583 |
|
|
\caption{An expedited branch costs a single stall cycle}\label{fig:branch}
|
1584 |
68 |
dgisselq |
\end{center}\end{figure}
|
1585 |
|
|
In this example, {\tt BR} is a branch always taken, {\tt I1} is the instruction
|
1586 |
|
|
following the branch in memory, while {\tt IA} is the first instruction at the
|
1587 |
|
|
branch address. ({\tt CLR} denotes a clear--pipeline operation, and does
|
1588 |
|
|
not represent any instruction.)
|
1589 |
36 |
dgisselq |
|
1590 |
32 |
dgisselq |
\item When reading from a prior register while also adding an immediate offset
|
1591 |
|
|
\begin{enumerate}
|
1592 |
|
|
\item\ {\tt OPCODE ?,RA}
|
1593 |
|
|
\item\ {\em (stall)}
|
1594 |
|
|
\item\ {\tt OPCODE I+RA,RB}
|
1595 |
|
|
\end{enumerate}
|
1596 |
|
|
|
1597 |
|
|
Since the addition of the immediate register within OpB decoding gets applied
|
1598 |
|
|
during the read operand stage so that it can be nicely settled before the ALU,
|
1599 |
|
|
any instruction that will write back an operand must be separated from the
|
1600 |
|
|
opcode that will read and apply an immediate offset by one instruction. The
|
1601 |
|
|
good news is that this stall can easily be mitigated by proper scheduling.
|
1602 |
36 |
dgisselq |
That is, any instruction that does not add an immediate to {\tt RA} may be
|
1603 |
|
|
scheduled into the stall slot.
|
1604 |
32 |
dgisselq |
|
1605 |
69 |
dgisselq |
This is also the reason why, when setting up a stack frame, the top of the
|
1606 |
199 |
dgisselq |
stack frame is used first: it eliminates this stall cycle.\footnote{This only
|
1607 |
|
|
applies if there is no local memory to allocate on the stack as well.} Hence,
|
1608 |
|
|
to save registers at the top of a procedure, one would write:
|
1609 |
32 |
dgisselq |
\begin{enumerate}
|
1610 |
202 |
dgisselq |
\item\ {\tt SUB 16,SP}
|
1611 |
|
|
\item\ {\tt SW R1,(SP)}
|
1612 |
|
|
\item\ {\tt SW R2,4(SP)}
|
1613 |
32 |
dgisselq |
\end{enumerate}
|
1614 |
69 |
dgisselq |
Had {\tt R1} instead been stored at {\tt 1(SP)} as the top of the stack,
|
1615 |
|
|
there would've been an extra stall in setting up the stack frame.
|
1616 |
32 |
dgisselq |
|
1617 |
|
|
\item When reading from the CC register after setting the flags
|
1618 |
|
|
\begin{enumerate}
|
1619 |
69 |
dgisselq |
\item\ {\tt ALUOP RA,RB} {\em ; Ex: a compare opcode}
|
1620 |
36 |
dgisselq |
\item\ {\em (stall)}
|
1621 |
32 |
dgisselq |
\item\ {\tt TST sys.ccv,CC}
|
1622 |
|
|
\item\ {\tt BZ somewhere}
|
1623 |
|
|
\end{enumerate}
|
1624 |
|
|
|
1625 |
68 |
dgisselq |
The reason for this stall is simply performance: many of the flags are
|
1626 |
|
|
determined via combinatorial logic {\em during} the writeback cycle.
|
1627 |
|
|
Trying to then place these into the input for one of the operands for an
|
1628 |
|
|
ALU instruction during the same cycle
|
1629 |
32 |
dgisselq |
created a time delay loop that would no longer execute in a single 100~MHz
|
1630 |
|
|
clock cycle. (The time delay of the multiply within the ALU wasn't helping
|
1631 |
|
|
either \ldots).
|
1632 |
|
|
|
1633 |
33 |
dgisselq |
This stall may be eliminated via proper scheduling, by placing an instruction
|
1634 |
|
|
that does not set flags in between the ALU operation and the instruction
|
1635 |
|
|
that references the CC register. For example, {\tt MOV \$addr+PC,uPC}
|
1636 |
|
|
followed by an {\tt RTU} ({\tt OR \$GIE,CC}) instruction will not incur
|
1637 |
|
|
this stall, whereas an {\tt OR \$BREAKEN,CC} followed by an {\tt OR \$STEP,CC}
|
1638 |
68 |
dgisselq |
will incur the stall, while a {\tt LDI \$BREAKEN|\$STEP,CC} will not since
|
1639 |
69 |
dgisselq |
it doesn't read the condition codes before executing.
|
1640 |
33 |
dgisselq |
|
1641 |
32 |
dgisselq |
\item When waiting for a memory read operation to complete
|
1642 |
|
|
\begin{enumerate}
|
1643 |
202 |
dgisselq |
\item\ {\tt LW address,RA}
|
1644 |
36 |
dgisselq |
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}
|
1645 |
32 |
dgisselq |
\item\ {\tt OPCODE I+RA,RB}
|
1646 |
|
|
\end{enumerate}
|
1647 |
|
|
|
1648 |
199 |
dgisselq |
Remember, the ZipCPU does not support out of order execution. Therefore,
|
1649 |
32 |
dgisselq |
anytime the memory unit becomes busy both the memory unit and the ALU must
|
1650 |
68 |
dgisselq |
stall until the memory unit is cleared. This is illustrated in
|
1651 |
|
|
Fig.~\ref{fig:memrd},
|
1652 |
|
|
\begin{figure}\begin{center}
|
1653 |
69 |
dgisselq |
\includegraphics[width=5.6in]{../gfx/memrd.eps}
|
1654 |
68 |
dgisselq |
\caption{Pipeline handling of a load instruction}\label{fig:memrd}
|
1655 |
|
|
\end{center}\end{figure}
|
1656 |
|
|
since it is especially true of a load
|
1657 |
69 |
dgisselq |
instruction, which must still write its operand back to the register file.
|
1658 |
|
|
Further, note that on a pipelined memory operation, the instruction must
|
1659 |
|
|
stall in the decode operand stage, lest it try to read a result from the
|
1660 |
|
|
register file before the load result has been written to it. Finally, note
|
1661 |
|
|
that there is an extra stall at the end of the memory cycle, so that
|
1662 |
|
|
the memory unit will be idle for two clocks before an instruction will be
|
1663 |
|
|
accepted into the ALU. Store instructions are different, as shown in
|
1664 |
|
|
Fig.~\ref{fig:memwr},
|
1665 |
68 |
dgisselq |
\begin{figure}\begin{center}
|
1666 |
69 |
dgisselq |
\includegraphics[width=4in]{../gfx/memwr.eps}
|
1667 |
68 |
dgisselq |
\caption{Pipeline handling of a store instruction}\label{fig:memwr}
|
1668 |
|
|
\end{center}\end{figure}
|
1669 |
|
|
since they can be busy with the bus without impacting later write back
|
1670 |
|
|
pipeline stages. Hence, only loads stall the pipeline.
|
1671 |
32 |
dgisselq |
|
1672 |
68 |
dgisselq |
This, of course, also assumes that the memory being accessed is a single cycle
|
1673 |
|
|
memory and that there are no stalls to get to the memory.
|
1674 |
32 |
dgisselq |
Slower memories, such as the Quad SPI flash, will take longer--perhaps even
|
1675 |
33 |
dgisselq |
as long as forty clocks. During this time the CPU and the external bus
|
1676 |
68 |
dgisselq |
will be busy, and unable to do anything else. Likewise, if it takes a couple
|
1677 |
|
|
of clock cycles for the bus to be free, as shown in both Figs.~\ref{fig:memrd}
|
1678 |
|
|
and~\ref{fig:memwr}, there will be stalls.
|
1679 |
32 |
dgisselq |
|
1680 |
|
|
\item Memory operation followed by a memory operation
|
1681 |
|
|
\begin{enumerate}
|
1682 |
202 |
dgisselq |
\item\ {\tt SW address,RA}
|
1683 |
36 |
dgisselq |
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}
|
1684 |
202 |
dgisselq |
\item\ {\tt LW address,RB}
|
1685 |
36 |
dgisselq |
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}
|
1686 |
32 |
dgisselq |
\end{enumerate}
|
1687 |
|
|
|
1688 |
202 |
dgisselq |
In this case, the LW instruction cannot start until the SW is finished,
|
1689 |
68 |
dgisselq |
as illustrated by Fig.~\ref{fig:mstld}.
|
1690 |
|
|
\begin{figure}\begin{center}
|
1691 |
|
|
\includegraphics[width=5.5in]{../gfx/mstld.eps}
|
1692 |
|
|
\caption{Pipeline handling of a store followed by a load instruction}\label{fig:mstld}
|
1693 |
|
|
\end{center}\end{figure}
|
1694 |
32 |
dgisselq |
With proper scheduling, it is possible to do something in the ALU while the
|
1695 |
202 |
dgisselq |
memory unit is busy with the SW instruction, but otherwise this pipeline will
|
1696 |
68 |
dgisselq |
stall while waiting for it to complete before the load instruction can
|
1697 |
|
|
start.
|
1698 |
32 |
dgisselq |
|
1699 |
199 |
dgisselq |
The ZipCPU has the capability of supporting a form of burst memory access,
|
1700 |
|
|
often called pipelined memory access within this document due to its use of
|
1701 |
|
|
the Wishbone B4 pipelined access mode.
|
1702 |
|
|
When using this mode, the CPU may issue multiple loads or stores at a time,
|
1703 |
|
|
to the extent that all but the first take only a single clock. Doing this
|
1704 |
|
|
requires several conditions to be true:
|
1705 |
|
|
\begin{enumerate}
|
1706 |
|
|
\item Aall accesses within the burst must all be reads or all be writes,
|
1707 |
|
|
\item All must use the same base register for their address, and
|
1708 |
|
|
\item There can be no stalls or other instructions between memory access instructions within the burst.
|
1709 |
|
|
\item Further, the immediate offset to memory must be either indentical or
|
1710 |
|
|
increasing by one address each instruction.
|
1711 |
|
|
\end{enumerate}
|
1712 |
|
|
These conditions work well for saving or storing registers to the stack in a
|
1713 |
|
|
burst operation. Indeed, if you noticed, both Fig.~\ref{fig:memrd} and
|
1714 |
|
|
Fig.~\ref{fig:memwr} illustrated pipelined memory accesses. Beyond saving and
|
1715 |
|
|
restoring registers to the stack, the compiler does not optimize well (yet)
|
1716 |
|
|
for using this burst mode.
|
1717 |
36 |
dgisselq |
|
1718 |
32 |
dgisselq |
\end{itemize}
|
1719 |
|
|
|
1720 |
199 |
dgisselq |
% \subsection{Debug}
|
1721 |
32 |
dgisselq |
|
1722 |
199 |
dgisselq |
\section{External Architecture}
|
1723 |
24 |
dgisselq |
|
1724 |
199 |
dgisselq |
Having now described the CPU registers, instructions, and instruction formats,
|
1725 |
|
|
we now turn our attention to how the CPU interacts with the rest of the world.
|
1726 |
|
|
Specifically, we shall discuss how the bus is implemented, and the memory
|
1727 |
|
|
model assumed by the CPU.
|
1728 |
|
|
|
1729 |
|
|
\subsection{Simplified Wishbone Bus}\label{ssec:bus}
|
1730 |
|
|
The bus architecture of the ZipCPU is that of a simplified, pipelined, WISHBONE
|
1731 |
|
|
bus built according to the B4 specification. Several changes have been made to
|
1732 |
|
|
simplify this bus. First, all unnecessary ancillary information has been
|
1733 |
|
|
removed. This includes the retry, tag, lock, cycle type indicator, and burst
|
1734 |
202 |
dgisselq |
indicator signals. The bus supports big endian operation where the high order
|
1735 |
|
|
octet occupies the low order address. Second, we insist that all
|
1736 |
199 |
dgisselq |
accesses be pipelined, and simplify that further by insisting that pipelined
|
1737 |
|
|
accesses not cross peripherals---although we leave it to the user to keep that
|
1738 |
|
|
from happening in practice. Finally, we insist that the wishbone strobe line
|
1739 |
|
|
be zero any time the cycle line is inactive. This makes decoding simpler
|
1740 |
|
|
in slave logic: a transaction is initiated whenever the strobe line is high
|
1741 |
|
|
and the stall line is low. For those peripherals that do not generate stalls,
|
1742 |
|
|
only the strobe line needs to be tested for access. The transaction completes
|
1743 |
|
|
whenever either the ACK or the ERR lines go high.
|
1744 |
|
|
|
1745 |
|
|
\subsection{Memory Model}\label{ssec:memory}
|
1746 |
|
|
The memory model of the ZipCPU is that of a uniform 32--bit address space.
|
1747 |
|
|
The CPU knows nothing about which addresses reference on--chip or off-chip
|
1748 |
|
|
memory, or even which reference peripherals. Indeed, there is no indication
|
1749 |
|
|
within the CPU if a particular piece of memory can be cached or not, save that
|
1750 |
|
|
the CPU assumes any and all instruction words can be cached.
|
1751 |
|
|
|
1752 |
202 |
dgisselq |
The one exception to this rule revolves around addresses where the top 8-bits
|
1753 |
|
|
of their high order word are all ones. These addresses are used to access a
|
1754 |
199 |
dgisselq |
variety of optional peripherals that will be discussed more in
|
1755 |
|
|
Sec.~\ref{sec:zipsys}, but that are only present within the {\tt ZipSystem}.
|
1756 |
|
|
When used with the bare {\tt ZipBones}, these addresses will cause a bus error.
|
1757 |
|
|
|
1758 |
|
|
The prefetch cache currently has no means of detecting an instruction that
|
1759 |
|
|
was changed, save by clearing the instruction cache. This may be necessary
|
1760 |
|
|
when loading programs into previously used memory, or when creating
|
1761 |
|
|
self--modifying code.
|
1762 |
|
|
|
1763 |
|
|
Should the memory management unit (MMU) be integrated into the ZipCPU, the MMU
|
1764 |
202 |
dgisselq |
configuration will tell the ZipCPU wich addresses may be cached and which not.
|
1765 |
199 |
dgisselq |
|
1766 |
|
|
This topic is discussed further in the linker section, Sec.~\ref{sec:ld-mem}
|
1767 |
|
|
of the ABI chapter, Chap.~\ref{chap:abi}.
|
1768 |
|
|
|
1769 |
|
|
% \subsection{Measured Performance}\label{sec:perf}
|
1770 |
|
|
|
1771 |
|
|
\subsection{ZipSystem}\label{sec:zipsys}
|
1772 |
|
|
|
1773 |
|
|
While the previous chapter describes a CPU in isolation, the ZipSystem
|
1774 |
|
|
includes a small minimum set of peripherals that can be tightly integrated into
|
1775 |
|
|
the CPU. These peripherals are shown in Fig.~\ref{fig:zipsystem}
|
1776 |
24 |
dgisselq |
\begin{figure}\begin{center}
|
1777 |
|
|
\includegraphics[width=3.5in]{../gfx/system.eps}
|
1778 |
199 |
dgisselq |
\caption{ZipSystem Peripherals}\label{fig:zipsystem}
|
1779 |
24 |
dgisselq |
\end{center}\end{figure}
|
1780 |
|
|
and described here. They are designed to make
|
1781 |
199 |
dgisselq |
the ZipCPU more useful in an Embedded Operating System environment.
|
1782 |
24 |
dgisselq |
|
1783 |
199 |
dgisselq |
\subsubsection{Interrupt Controller}\label{sec:pic}
|
1784 |
24 |
dgisselq |
|
1785 |
199 |
dgisselq |
Perhaps the most important peripheral within the ZipSystem is the interrupt
|
1786 |
|
|
controller. While the ZipCPU itself can only handle one interrupt, and has
|
1787 |
24 |
dgisselq |
only the one interrupt state: disabled or enabled, the interrupt controller
|
1788 |
|
|
can make things more interesting.
|
1789 |
|
|
|
1790 |
199 |
dgisselq |
The ZipSystem interrupt controller module supports up to 15 interrupts, all
|
1791 |
|
|
controlled from one register. Further, it has been designed so that individual
|
1792 |
|
|
interrupts can be enabled or disabled individually without having any knowledge
|
1793 |
|
|
of the rest of the controller setting. To enable an interrupt, write to the
|
1794 |
|
|
register with the high order global enable bit set and the respective interrupt
|
1795 |
|
|
enable bit set. No other bits will be affected. To disable an interrupt,
|
1796 |
|
|
write to the register with the high order global enable bit cleared and the
|
1797 |
|
|
respective interrupt enable bit set. To clear an interrupt, write a `1' to
|
1798 |
|
|
that interrupt's status pin. A zero written to the register has the sole
|
1799 |
|
|
effect of disabling the master interrupt enable bit.
|
1800 |
24 |
dgisselq |
|
1801 |
|
|
As an example, suppose you wished to enable interrupt \#4. You would then
|
1802 |
|
|
write to the register a {\tt 0x80100010} to enable interrupt \#4 and to clear
|
1803 |
|
|
any past active state. When you later wish to disable this interrupt, you would
|
1804 |
199 |
dgisselq |
write a {\tt 0x00100010} to the register. This both disables the
|
1805 |
24 |
dgisselq |
interrupt and clears the active indicator. This also has the side effect of
|
1806 |
|
|
disabling all interrupts, so a second write of {\tt 0x80000000} may be necessary
|
1807 |
|
|
to re-enable any other interrupts.
|
1808 |
|
|
|
1809 |
199 |
dgisselq |
The ZipSystem hosts two interrupt controllers: a primary and a secondary. The
|
1810 |
|
|
primary interrupt controller is the one that interrupts the CPU. It has
|
1811 |
|
|
six local interrupt lines, the rest coming from external interrupt sources.
|
1812 |
|
|
One of those interrupt lines to the primary controller comes from the secondary
|
1813 |
|
|
interrupt controller. This controller maintains an interrupt state for the
|
1814 |
|
|
process accounting counters, and any other external interrupts that didn't fit
|
1815 |
|
|
into the primary interrupt controller.
|
1816 |
24 |
dgisselq |
|
1817 |
199 |
dgisselq |
As a word of caution, because the interrupt controller is an external
|
1818 |
|
|
peripheral, and because memory writes take place concurrently with any following
|
1819 |
|
|
instructions, any attempt to clear interrupts on one instruction followed by
|
1820 |
|
|
an immediate Return to Userspace ({\tt RTU}) instruction, may not have the
|
1821 |
|
|
effect of having interrupts cleared before the {\tt RTU} instruction executes.
|
1822 |
21 |
dgisselq |
|
1823 |
199 |
dgisselq |
\subsubsection{Counter}
|
1824 |
|
|
|
1825 |
21 |
dgisselq |
The Zip Counter is a very simple counter: it just counts. It cannot be
|
1826 |
|
|
halted. When it rolls over, it issues an interrupt. Writing a value to the
|
1827 |
|
|
counter just sets the current value, and it starts counting again from that
|
1828 |
|
|
value.
|
1829 |
|
|
|
1830 |
199 |
dgisselq |
Eight counters are implemented in the ZipSystem for process accounting if
|
1831 |
|
|
the {\tt INCLUDE\_ACCOUNTING\_COUNTERS} define is set within {\tt cpudefs.v}.
|
1832 |
|
|
Four of those measure the performance of the system as a whole, four are
|
1833 |
|
|
used for measuring user CPU usage.
|
1834 |
21 |
dgisselq |
This may change in the future, as nothing as yet uses these counters.
|
1835 |
|
|
|
1836 |
199 |
dgisselq |
\subsubsection{Timer}
|
1837 |
21 |
dgisselq |
|
1838 |
199 |
dgisselq |
The Zip Timer is also very simple: it is a 31--bit counter that simply counts
|
1839 |
|
|
down to zero. When it transitions from a one to a zero it creates an interrupt.
|
1840 |
21 |
dgisselq |
|
1841 |
|
|
Writing any non-zero value to the timer starts the timer. If the high order
|
1842 |
|
|
bit is set when writing to the timer, the timer becomes an interval timer and
|
1843 |
|
|
reloads its last start time on any interrupt. Hence, to mark seconds, one
|
1844 |
199 |
dgisselq |
might set the 31--bits of the timer to the number of clocks per second and the
|
1845 |
|
|
top bit to one. Ever after, the timer will interrupt the CPU once per
|
1846 |
|
|
second--until a non--interrupt interval is set in the timer. This reload
|
1847 |
|
|
capability also limits the maximum timer value to $2^{31}-1$, rather than
|
1848 |
|
|
$2^{32}-1$.
|
1849 |
21 |
dgisselq |
|
1850 |
199 |
dgisselq |
\subsubsection{Watchdog Timer}
|
1851 |
21 |
dgisselq |
|
1852 |
199 |
dgisselq |
The watchdog timer has only two differences from the of the other timers.
|
1853 |
|
|
The first difference is that it is a one--shot timer. The second difference,
|
1854 |
|
|
though, is critical: the interrupt line from the watchdog timer is tied to the
|
1855 |
|
|
reset line of the CPU. Hence writing a `1' to the watchdog timer will always
|
1856 |
|
|
reset the CPU. To stop the Watchdog timer, write a `0' to it. To start it,
|
1857 |
21 |
dgisselq |
write any other number to it---as with the other timers.
|
1858 |
|
|
|
1859 |
|
|
|
1860 |
199 |
dgisselq |
\subsubsection{Bus Watchdog}
|
1861 |
|
|
There is an additional watchdog timer on the Wishbone bus of the ZipSystem.
|
1862 |
|
|
This timer,
|
1863 |
68 |
dgisselq |
however, is hardware configured and not software configured. The timer is
|
1864 |
|
|
reset at the beginning of any bus transaction, and only counts clocks during
|
1865 |
|
|
such bus transactions. If the bus transaction takes longer than the number
|
1866 |
|
|
of counts the timer allots, it will raise a bus error flag to terminate the
|
1867 |
|
|
transaction. This is useful in the case of any peripherals that are
|
1868 |
|
|
misbehaving. If the bus watchdog terminates a bus transaction, the CPU may
|
1869 |
|
|
then read from its port to find out which memory location created the problem.
|
1870 |
|
|
|
1871 |
|
|
Aside from its unusual configuration, the bus watchdog is just another
|
1872 |
69 |
dgisselq |
implementation of the fundamental timer described above--stripped down
|
1873 |
|
|
for simplicity.
|
1874 |
68 |
dgisselq |
|
1875 |
199 |
dgisselq |
\subsubsection{Jiffies}
|
1876 |
21 |
dgisselq |
|
1877 |
|
|
This peripheral is motivated by the Linux use of `jiffies' whereby a process
|
1878 |
|
|
can request to be put to sleep until a certain number of `jiffies' have
|
1879 |
|
|
elapsed. Using this interface, the CPU can read the number of `jiffies'
|
1880 |
|
|
from the peripheral (it only has the one location in address space), add the
|
1881 |
69 |
dgisselq |
sleep length to it, and write the result back to the peripheral. The
|
1882 |
|
|
{\tt zipjiffies}
|
1883 |
21 |
dgisselq |
peripheral will record the value written to it only if it is nearer the current
|
1884 |
|
|
counter value than the last current waiting interrupt time. If no other
|
1885 |
|
|
interrupts are waiting, and this time is in the future, it will be enabled.
|
1886 |
|
|
(There is currently no way to disable a jiffie interrupt once set, other
|
1887 |
24 |
dgisselq |
than to disable the interrupt line in the interrupt controller.) The processor
|
1888 |
21 |
dgisselq |
may then place this sleep request into a list among other sleep requests.
|
1889 |
|
|
Once the timer expires, it would write the next Jiffy request to the peripheral
|
1890 |
|
|
and wake up the process whose timer had expired.
|
1891 |
|
|
|
1892 |
|
|
Indeed, the Jiffies register is nothing more than a glorified counter with
|
1893 |
199 |
dgisselq |
an interrupt. Unlike the other counters, the internal Jiffies counter can only
|
1894 |
|
|
be read, never set.
|
1895 |
21 |
dgisselq |
Writes to the jiffies register create an interrupt time. When the Jiffies
|
1896 |
|
|
register later equals the value written to it, an interrupt will be asserted
|
1897 |
|
|
and the register then continues counting as though no interrupt had taken
|
1898 |
|
|
place.
|
1899 |
|
|
|
1900 |
199 |
dgisselq |
Finally, if the new value written to the Jiffies register is within the past
|
1901 |
|
|
$2^{31-1}$ clock ticks, the Jiffies register will immediately cause an interrupt
|
1902 |
|
|
and otherwise ignore the new request.
|
1903 |
|
|
|
1904 |
21 |
dgisselq |
The purpose of this register is to support alarm times within a CPU. To
|
1905 |
|
|
set an alarm for a particular process $N$ clocks in advance, read the current
|
1906 |
199 |
dgisselq |
Jiffies value, add $N$, and write it back to the Jiffies register. The
|
1907 |
21 |
dgisselq |
O/S must also keep track of values written to the Jiffies register. Thus,
|
1908 |
32 |
dgisselq |
when an `alarm' trips, it should be removed from the list of alarms, the list
|
1909 |
69 |
dgisselq |
should be resorted, and the next alarm in terms of Jiffies should be written
|
1910 |
|
|
to the register--possibly for a second time.
|
1911 |
21 |
dgisselq |
|
1912 |
199 |
dgisselq |
\subsubsection{Direct Memory Access Controller}
|
1913 |
24 |
dgisselq |
|
1914 |
36 |
dgisselq |
The Direct Memory Access (DMA) controller can be used to either move memory
|
1915 |
|
|
from one location to another, to read from a peripheral into memory, or to
|
1916 |
|
|
write from a peripheral into memory all without CPU intervention. Further,
|
1917 |
|
|
since the DMA controller can issue (and does issue) pipeline wishbone accesses,
|
1918 |
|
|
any DMA memory move will by nature be faster than a corresponding program
|
1919 |
|
|
accomplishing the same move. To put this to numbers, it may take a program
|
1920 |
199 |
dgisselq |
running on the CPU 18~clocks per word transferred, whereas this DMA controller
|
1921 |
|
|
can move one word in eight clocks--provided it has bus
|
1922 |
|
|
access\footnote{The pipeline cost of the DMA controller, including setup cost,
|
1923 |
|
|
is a minimum of $14+2N$ clocks.} (The CPU gets priority over the bus, but once
|
1924 |
|
|
bus access is granted to the DMA peripheral, it will not be revoked mid--read
|
1925 |
|
|
or mid--write.)
|
1926 |
24 |
dgisselq |
|
1927 |
202 |
dgisselq |
The DMA controller supports only aligned word accesses. It does not support
|
1928 |
|
|
byte or half-word accesses.
|
1929 |
|
|
|
1930 |
36 |
dgisselq |
When copying memory from one location to another, the DMA controller will
|
1931 |
|
|
copy in units of a given transfer length--up to 1024 words at a time. It will
|
1932 |
|
|
read that transfer length into its internal buffer, and then write to the
|
1933 |
69 |
dgisselq |
destination address from that buffer.
|
1934 |
24 |
dgisselq |
|
1935 |
36 |
dgisselq |
When coupled with a peripheral, the DMA controller can be configured to start
|
1936 |
199 |
dgisselq |
a memory copy when any interrupt line goes high. Further, the controller can
|
1937 |
69 |
dgisselq |
be configured to issue reads from (or to) the same address instead of
|
1938 |
|
|
incrementing the address at each clock. The DMA completes once the total
|
1939 |
|
|
number of items specified (not the transfer length) have been transferred.
|
1940 |
36 |
dgisselq |
|
1941 |
|
|
In each case, once the transfer is complete and the DMA unit returns to
|
1942 |
|
|
idle, the DMA will issue an interrupt.
|
1943 |
|
|
|
1944 |
199 |
dgisselq |
% \subsubsection{Memory Management Unit}
|
1945 |
36 |
dgisselq |
|
1946 |
199 |
dgisselq |
\section{Debug Interface}\label{sec:debug}
|
1947 |
21 |
dgisselq |
|
1948 |
199 |
dgisselq |
The ZipCPU supports an external debug port. Access to the port is the
|
1949 |
|
|
same as accessing a two register peripheral on a wishbone bus, so the basic
|
1950 |
|
|
interface is fairly simple. Using this interface, it is possible to both
|
1951 |
|
|
control the CPU, as well as read register values and current status from the
|
1952 |
|
|
CPU.
|
1953 |
|
|
|
1954 |
|
|
While a more detailed discussion will be reserved for Sec.~\ref{sec:reg-debug},
|
1955 |
|
|
here we'll just discuss how it is put together. The debug interface allows
|
1956 |
|
|
a controller access to the CPU reset line, and a halt line. By raising the
|
1957 |
|
|
reset line, the CPU will be caused to clear it's cache, to clear any internal
|
1958 |
|
|
exception or error conditions, and then to start execution at the
|
1959 |
|
|
{\tt RESET\_ADDRESS}--just like a normal reboot. In a similar fashion, the
|
1960 |
|
|
debug interface allows you to control the {\tt cpu\_halt} line into the
|
1961 |
|
|
CPU. Holding this line high will hold the CPU in an externally halted state.
|
1962 |
|
|
Toggling the line low for one clock allows one to step the CPU by one
|
1963 |
|
|
instruction. Lowering the line causes the CPU to go. A final control wire,
|
1964 |
|
|
controlled by the debug interface, will force the CPU to clear its cache.
|
1965 |
|
|
All of these control wires are set or cleared from the debug control register.
|
1966 |
|
|
|
1967 |
|
|
The two debug command registers also make it possible to read and write
|
1968 |
|
|
all 32 registers within the CPU. In this fashion, a debugger can halt the
|
1969 |
|
|
CPU, investigate its state, and even modify registers so as to have the
|
1970 |
|
|
CPU restart from a different state.
|
1971 |
|
|
|
1972 |
|
|
Finally, without halting the CPU, the debug controller can read from any
|
1973 |
|
|
single register, and it can see if the CPU is still actively running, whether
|
1974 |
|
|
it is in user or supervisor modes, and whether or not it is sleeping. This
|
1975 |
|
|
alone is useful for detecting deadlocks or other difficult problems.
|
1976 |
|
|
|
1977 |
|
|
\chapter{Application Binary Interface}\label{chap:abi}
|
1978 |
|
|
|
1979 |
|
|
This chapter discusses not the CPU itself, but rather how the GCC and binutils
|
1980 |
|
|
toolchains have been configured to support the ZipCPU.
|
1981 |
|
|
|
1982 |
|
|
% ELF Format
|
1983 |
|
|
% Stack:
|
1984 |
|
|
% R13 is the stack register.
|
1985 |
|
|
% The stack grows downward.
|
1986 |
|
|
% Memory at the current stack pointer is allocated.
|
1987 |
202 |
dgisselq |
% Hence, a PUSH is : SUB 1,SP; SW Rx,(SP)
|
1988 |
199 |
dgisselq |
% Heap:
|
1989 |
|
|
% In general, not yet implemented.
|
1990 |
|
|
% A less than adequate Heap has been implemented as a pointer, from which
|
1991 |
|
|
% malloc requests simply decrement it. Free's are NOOPs, leaving
|
1992 |
|
|
% allocated memory allocated forever.
|
1993 |
|
|
|
1994 |
|
|
\section{Executable File Format}\label{sec:abi-elf}
|
1995 |
|
|
ZipCPU executable files are stored in the Executable and Linkable Format
|
1996 |
|
|
(ELF), prior to being placed in flash, or whatever memory they will be
|
1997 |
202 |
dgisselq |
executed from.
|
1998 |
199 |
dgisselq |
|
1999 |
202 |
dgisselq |
The ZipCPU described by this specification uses the 16-bits {\tt 16'hdad1}
|
2000 |
|
|
to identify itself against other CPUs. This is not an officially registered
|
2001 |
|
|
number, and may change in the future.
|
2002 |
|
|
|
2003 |
199 |
dgisselq |
The ZipCPU does not (yet) have a dynamic linker/loader. All linking is
|
2004 |
|
|
currently static, and done prior to run time.
|
2005 |
|
|
|
2006 |
|
|
\section{Stack}\label{sec:abi-stack}
|
2007 |
202 |
dgisselq |
Register {\tt R13} (also known as the {\tt SP} register) is the stack register.
|
2008 |
|
|
The compiler generates code that grows the stack from
|
2009 |
199 |
dgisselq |
high addresses to lower addresses. That means that the stack will usually
|
2010 |
|
|
start out set to a very large value, such as one past the last RAM address,
|
2011 |
|
|
and it will grow to lower and lower values--hopefully never mixing with the
|
2012 |
|
|
heap. Memory at the current stack position is assumed to be allocated.
|
2013 |
|
|
|
2014 |
|
|
When creating a stack frame for a function, the compiler will subtract
|
2015 |
|
|
the size of the stack frame from the stack register. It will then store
|
2016 |
|
|
any registers used by the function, from {\tt R5} to {\tt R12} (including
|
2017 |
|
|
the link register {\tt R0}) onto offsets given by the stack pointer plus a
|
2018 |
202 |
dgisselq |
constant. If a frame pointer is used, the compiler uses {\tt R12} (or
|
2019 |
|
|
{\tt FP}) for this purpose. The frame pointer is set by moving the stack
|
2020 |
|
|
pointer plus an offset into {\tt FP}. This {\tt MOV} instruction effectively
|
2021 |
|
|
limits the size of any individual stack frame to $2^{12}-1$ octets.
|
2022 |
199 |
dgisselq |
|
2023 |
|
|
Once a subroutine is complete, the frame is unwound. If the frame pointer,
|
2024 |
|
|
{\tt FP} was used, then {\tt FP} is copied directly to the stack pointer,
|
2025 |
|
|
{\tt SP}. Registers are restored, starting with {\tt R0} all the way to
|
2026 |
|
|
{\tt R12} ({\tt FP}). This also restores, and obliterates, the subroutine
|
2027 |
202 |
dgisselq |
frame pointer. Once complete, a value is added to the stack pointer to
|
2028 |
|
|
return it to its original value, and a jump is made to the value located
|
2029 |
|
|
within {\tt R0}.
|
2030 |
199 |
dgisselq |
|
2031 |
|
|
\section{Relocations}\label{sec:abi-reloc}
|
2032 |
|
|
|
2033 |
202 |
dgisselq |
The ZipCPU binutils back end supports several types of relocations, although
|
2034 |
|
|
the two most common are the 32--bit relocations for register load and long
|
2035 |
|
|
jump.
|
2036 |
199 |
dgisselq |
|
2037 |
|
|
The first of these is for loading an arbitrary 32--bit value into a register.
|
2038 |
|
|
Such instructions are broken into a pair of {\tt BREV} and {\tt LDILO}
|
2039 |
202 |
dgisselq |
instructions, and once the value of the parameter is known their immediate
|
2040 |
|
|
values can be filled in.
|
2041 |
199 |
dgisselq |
|
2042 |
|
|
The second type of 32--bit relocation is for jumps to arbitrary addresses.
|
2043 |
202 |
dgisselq |
These jumps are supported by the \hbox{\tt LW (PC),PC} instruction, followed
|
2044 |
199 |
dgisselq |
by the 32--bit address to be filled in later by the linker. If the jump is
|
2045 |
202 |
dgisselq |
conditional, then a conditional \hbox{\tt LW.$x$ 4(PC),PC} instruction is
|
2046 |
|
|
used, followed by a {\tt ADD 4,PC} and then the 32--bit relocation value.
|
2047 |
199 |
dgisselq |
|
2048 |
202 |
dgisselq |
If a branch distance is known and within reach, then it will be implemented
|
2049 |
|
|
with an {\tt ADD \#,PC} instruction, possibly conditional, as necessary.
|
2050 |
199 |
dgisselq |
|
2051 |
|
|
While other relocations are supported, they tend not to be used nearly as much
|
2052 |
|
|
as these two.
|
2053 |
|
|
|
2054 |
|
|
\section{Call format}\label{sec:abi-jsr}
|
2055 |
|
|
|
2056 |
202 |
dgisselq |
One unique of the ZipCPU is that it has no JSR instruction. The assembler
|
2057 |
|
|
attempts to minimize this problem by replacing a {\tt JSR}~{\em address}
|
2058 |
|
|
instruction with a {\tt MOV \#(PC),R0} followed by a jump to the requested
|
2059 |
|
|
address. In this case, the offset to the PC for the {\tt MOV} instruction
|
2060 |
|
|
is determined by whether or not the jump can be accomplished with a local
|
2061 |
|
|
branch or a long jump.
|
2062 |
199 |
dgisselq |
|
2063 |
202 |
dgisselq |
While this works well in practice, this implementation prevents such things
|
2064 |
199 |
dgisselq |
as {\tt JSR}'s followed by {\tt BRA}'s from being combined together.
|
2065 |
|
|
|
2066 |
202 |
dgisselq |
Finally, GCC will place first five operands passed to the subroutine into
|
2067 |
199 |
dgisselq |
registers R1--R5. Any additional operands are placed upon the stack.
|
2068 |
|
|
|
2069 |
|
|
\section{Built-ins}\label{sec:abi-builtin}
|
2070 |
|
|
The ZipCPU ABI supports the a number of built in functions. The compiler
|
2071 |
|
|
maps these functions directly to assembly language equivalents, essentially
|
2072 |
|
|
providing the C~programmer with access to several assembly language
|
2073 |
|
|
instructions. These are:
|
2074 |
33 |
dgisselq |
\begin{enumerate}
|
2075 |
199 |
dgisselq |
\item {\tt zip\_bitrev(int)} reverses the bits in the given integer, returning
|
2076 |
|
|
the result. This utilizes the internal {\tt BREV} instruction, and is
|
2077 |
|
|
designed to be used with FFT's as necessary.
|
2078 |
202 |
dgisselq |
\item {\tt zip\_busy()} executes an {\tt ADD -4,PC} function, essentially
|
2079 |
199 |
dgisselq |
forcing the CPU into a very tight infinite loop.
|
2080 |
|
|
\item {\tt zip\_cc()} returns the value of the current CC register. This may
|
2081 |
|
|
be used within both user and supervisor code to determine in which
|
2082 |
|
|
mode the CPU is within.
|
2083 |
|
|
\item {\tt zip\_halt()} executes an \hbox{\tt OR \$SLEEP,CC} instruction to
|
2084 |
|
|
place the processor to sleep. If the processor is in supervisor mode,
|
2085 |
|
|
this halt's the processor.
|
2086 |
|
|
\item {\tt zip\_rtu()} executes an \hbox{\tt OR \$GIE,CC} instruction. This
|
2087 |
|
|
will place the CPU into user mode, and has no effect if the CPU is
|
2088 |
|
|
already in user mode.
|
2089 |
|
|
\item {\tt zip\_step()} executes an \hbox{\tt OR \$STEP|\$GIE,CC} instruction.
|
2090 |
|
|
This will place the CPU into user mode in order to step one instruction,
|
2091 |
|
|
and then return to supervisor mode.
|
2092 |
|
|
It has no effect if the CPU is already in user mode.
|
2093 |
|
|
\item {\tt zip\_system()} executes an \hbox{\tt AND \textasciitilde\$GIE,CC}
|
2094 |
|
|
instruction to return the CPU to supervisor mode. This essentially
|
2095 |
|
|
executes a trap, setting the trap bit for the supervisor to examine.
|
2096 |
|
|
What this instruction does not do is arrange for the trap arguments to
|
2097 |
|
|
be placed into the registers {\tt R1} through {\tt R5}. Since this is
|
2098 |
|
|
a wholly inadequate solution, a function call may be made to an
|
2099 |
|
|
assembly routine that executes a trap if necessary.
|
2100 |
|
|
\item {\tt zip\_wait()} executes a \hbox{\tt \$SLEEP|\$GIE,CC} instruction.
|
2101 |
|
|
Unlike {\tt zip\_halt()}, this {\tt zip\_wait()} instruction places
|
2102 |
|
|
the CPU into a wait state regardless of whether or not the CPU is
|
2103 |
|
|
in supervisor mode or not. When this function, i.e. instruction,
|
2104 |
|
|
completes, it will leave the CPU in supervisor mode upon an interrupt
|
2105 |
|
|
having taken place.
|
2106 |
|
|
|
2107 |
|
|
You may wish to set the user program counter prior to this instruction,
|
2108 |
|
|
as the prefetch unit will try to load instructions from the address
|
2109 |
|
|
contained within the user program counter. Attempts to read from
|
2110 |
|
|
addresses with sideeffects may not produce the desired outcome.
|
2111 |
|
|
However, once that cache fails (or succeeds), the CPU will have been
|
2112 |
|
|
put to sleep and will do no more.
|
2113 |
|
|
|
2114 |
|
|
\item {\tt zip\_restore\_context(context *)} inserts the 32~assembly
|
2115 |
|
|
instructions necessary to copy all sixteen user registers to a memory
|
2116 |
|
|
region pointed to by the given context pointer, starting with {\tt uR0}
|
2117 |
|
|
on up to {\tt uPC}.
|
2118 |
|
|
|
2119 |
|
|
\item {\tt zip\_save\_context(context *)} inserts the 32~assembly instructions
|
2120 |
|
|
necessary to copy all sixteen user registers to a memory region pointed
|
2121 |
|
|
to by the given context pointer argument, starting
|
2122 |
|
|
with {\tt uR0} on up to {\tt uPC}.
|
2123 |
|
|
\item {\tt zip\_ucc()}, returns the value of the user CC register.
|
2124 |
33 |
dgisselq |
\end{enumerate}
|
2125 |
|
|
|
2126 |
199 |
dgisselq |
% Builtin functions:
|
2127 |
|
|
% zip_break();
|
2128 |
|
|
% zip_idle();
|
2129 |
|
|
% zip_syscall(a,b,c,d)
|
2130 |
|
|
%
|
2131 |
|
|
\section{Linker Scripts}\label{sec:ld}
|
2132 |
|
|
The ZipCPU makes no assumptions about its memory layout. The result, though,
|
2133 |
|
|
is that the memory layout of a given project is board specific. This
|
2134 |
|
|
is accomplished via a board specific linker script. This section will discuss
|
2135 |
|
|
some of the specifics of a ZipCPU linker script.
|
2136 |
33 |
dgisselq |
|
2137 |
199 |
dgisselq |
Because the ZipCPU uses a modified binutils package as part of its tool chain,
|
2138 |
|
|
the format for this linker script is defined by the GNU LD project within
|
2139 |
|
|
binutils. Further details on that format may be found within the GNU LD
|
2140 |
|
|
documentation within the binutils package.
|
2141 |
33 |
dgisselq |
|
2142 |
199 |
dgisselq |
This discussion will focus on those parts of the script specific to the ZipCPU.
|
2143 |
33 |
dgisselq |
|
2144 |
199 |
dgisselq |
\subsection{Memory Types}\label{sec:ld-mem}
|
2145 |
|
|
Of the FPGA boards that the ZipCPU has been applied to, most of them have some
|
2146 |
|
|
combination of three types of memory: flash, block RAM, and Synchronous
|
2147 |
|
|
Dynamic RAM (SDRAM). Of these three, only the flash is non--volatile. The
|
2148 |
|
|
block RAM is the fastest, and the SDRAM the largest. While other memory types
|
2149 |
|
|
are available, such as files on an external media such as an SD card or a
|
2150 |
|
|
network drive, these three types have so far been sufficient for our purposes.
|
2151 |
|
|
|
2152 |
|
|
To support these memories, the linker script has three memory lines identifying
|
2153 |
|
|
where each memory exists on the bus, the size of the memory, and any protections
|
2154 |
|
|
associated with it. For example,
|
2155 |
|
|
\begin{eqnarray*}
|
2156 |
|
|
\mbox{blkram (wx) : ORIGIN = 0x0008000, LENGTH = 0x0008000}
|
2157 |
|
|
\end{eqnarray*}
|
2158 |
|
|
specifies that there is a region of memory, called blkram, that can be read and
|
2159 |
|
|
written, and that programs can execute from. This section starts at address
|
2160 |
202 |
dgisselq |
{\tt 0x8000} and extends for another {\tt 0x8000} bytes. The other memories
|
2161 |
199 |
dgisselq |
are defined in a similar manner, with names {\tt flash} and {\tt sdram}.
|
2162 |
|
|
|
2163 |
|
|
Following the memory section, three specific symbols are defined:
|
2164 |
|
|
{\tt \_flash}, defining the beginning of flash memory,
|
2165 |
|
|
{\tt \_blkram}, defining the beginning of on--chip block RAM,
|
2166 |
|
|
and
|
2167 |
|
|
{\tt \_sdram}, defining the beginning of SDRAM.
|
2168 |
|
|
These symbols are used to make the bootloader's task easier.
|
2169 |
|
|
|
2170 |
|
|
\subsection{The Entry Function}\label{sec:ld-entry}
|
2171 |
|
|
The ZipCPU has, as a parameter, a {\tt RESET\_ADDRESS}. It is important
|
2172 |
|
|
that this address contain a valid instruction (or more), since this is the
|
2173 |
|
|
first instruction the ZipCPU will execute. Traditionally, this address is also
|
2174 |
|
|
the first address in instruction memory as well.
|
2175 |
|
|
|
2176 |
|
|
To make this happen, the ZipCPU defines two additional segments: the
|
2177 |
|
|
{\tt .start} and the {\tt .boot} segments. The {\tt .start} segment is to
|
2178 |
|
|
have nothing in it but the very initial startup code. This code needs to run
|
2179 |
|
|
from flash (or other ROM). By placing this segment at the very beginning of
|
2180 |
|
|
the ZipCPU's flash address space, and in particular at the first valid flash
|
2181 |
|
|
address, the ZipCPU will boot from this address. This is the purpose of the
|
2182 |
|
|
{\tt .start} section.
|
2183 |
|
|
|
2184 |
|
|
The {\tt .boot} section has a similar purpose. This section includes anything
|
2185 |
|
|
associated with the bootloader. It is a special section because, when loading
|
2186 |
|
|
from flash, the bootloader {\em cannot} be placed in RAM, but must be placed
|
2187 |
|
|
in flash--since it is the code that loads things from flash into RAM.
|
2188 |
|
|
|
2189 |
|
|
It may also make sense to place any code executed once only within flash as
|
2190 |
|
|
well. Such code may run slower than the main system code, but by leaving it in
|
2191 |
|
|
flash it can be kept from consuming any higher speed RAM. To do this, place
|
2192 |
|
|
this other code into the {\tt .boot} section.
|
2193 |
|
|
|
2194 |
|
|
You may also find that large data structures that are best left in flash
|
2195 |
|
|
can also be placed into this {\tt .boot} section as well for that purpose.
|
2196 |
|
|
|
2197 |
|
|
\subsection{Bootloader Tags}\label{sec:ld-boot}
|
2198 |
|
|
|
2199 |
|
|
The bootloader needs to know a couple things from the linker script. It needs
|
2200 |
|
|
to know what code/data to copy to block RAM from the flash, what code/data to
|
2201 |
|
|
copy to SDRAM, and finally what initial data area needs to be zeroed. Four
|
2202 |
|
|
additional pointers, set within a linker script, can define these regions.
|
2203 |
|
|
|
2204 |
|
|
\begin{enumerate}
|
2205 |
|
|
\item {\tt \_kernel\_image\_start}
|
2206 |
|
|
|
2207 |
|
|
This is the first location in flash containing data that the bootloader
|
2208 |
|
|
needs to move.
|
2209 |
|
|
|
2210 |
|
|
\item {\tt \_kernel\_image\_end}
|
2211 |
|
|
|
2212 |
|
|
This is a pointer to one past the last location in block RAM to place
|
2213 |
|
|
things into. If this pointer is equal to {\tt \_kernel\_image\_start},
|
2214 |
|
|
then no information is placed into block RAM.
|
2215 |
|
|
|
2216 |
|
|
\item {\tt \_sdram\_image\_start}
|
2217 |
|
|
|
2218 |
|
|
This should be equal to {\tt \_kernel\_image\_end}. It is a pointer,
|
2219 |
|
|
within block RAM address space, of the first location to be moved
|
2220 |
|
|
into SDRAM. By adding the difference between
|
2221 |
|
|
{\tt \_sdram\_image\_start} and {\tt \_blkram} to the flash address
|
2222 |
|
|
in {\tt \_kernel\_image\_start}, the actual source address within the
|
2223 |
|
|
flash of the code/data that needs to be copied into SDRAM can be
|
2224 |
|
|
determined.
|
2225 |
|
|
|
2226 |
|
|
\item {\tt \_sdram\_image\_end}
|
2227 |
|
|
|
2228 |
|
|
This is the ending address of any code/data to be copied into SDRAM.
|
2229 |
|
|
The distance between this pointer and {\tt \_sdram} should be the
|
2230 |
|
|
amount of data to be placed into SDRAM.
|
2231 |
|
|
|
2232 |
|
|
\item {\tt \_bss\_image\_end}
|
2233 |
|
|
|
2234 |
|
|
The BSS segment contains data the starts with an initial value of
|
2235 |
|
|
zero. Such data are usually not placed in the executable file, nor
|
2236 |
|
|
are they placed into any flash image. This address points to the
|
2237 |
|
|
last location in SDRAM used by the BSS segment. The bootloader
|
2238 |
|
|
is responsible then for clearing the SDRAM between
|
2239 |
|
|
{\tt \_sdram\_image\_end} and {\tt \_bss\_image\_end}.
|
2240 |
|
|
|
2241 |
|
|
The bootloader must also be robust enough to handle the cases where
|
2242 |
|
|
1) there is no SDRAM, 2) there is no block RAM, and 3) where there
|
2243 |
|
|
is non requirement to move memory at all---such as when the program
|
2244 |
|
|
is placed into memory and started from there.
|
2245 |
|
|
\end{enumerate}
|
2246 |
|
|
\subsection{Other required linker symbols}\label{sec:ld-other}
|
2247 |
|
|
|
2248 |
|
|
Two other symbols need to be defined in the linker script, which are used
|
2249 |
|
|
by the startup code. These are:
|
2250 |
|
|
\begin{enumerate}
|
2251 |
|
|
\item {\tt \_top\_of\_stack}
|
2252 |
|
|
|
2253 |
|
|
This is the address that the startup code will set the stack pointer
|
2254 |
|
|
to point to. It may be one past the last location of a RAM memory,
|
2255 |
|
|
whether block RAM or SDRAM.
|
2256 |
|
|
|
2257 |
|
|
\item {\tt \_top\_of\_heap}
|
2258 |
|
|
|
2259 |
|
|
This is the first location past the end of the {\tt .bss} segment.
|
2260 |
|
|
Equivalently, this is the address of the first unused piece of
|
2261 |
|
|
memory, or the location from whence to start any dynamic memory
|
2262 |
|
|
subsystem.
|
2263 |
|
|
\end{enumerate}
|
2264 |
|
|
|
2265 |
202 |
dgisselq |
All of these symbols need to reference word aligned addresses.
|
2266 |
|
|
|
2267 |
199 |
dgisselq |
\section{Loading ZipCPU Programs}
|
2268 |
|
|
There are two basic ways to load a ZipCPU program, depending upon whether or
|
2269 |
|
|
not the ZipCPU is active within the current configuration. If the ZipCPU
|
2270 |
|
|
is not a part of the current FPGA configuration, one need only write the
|
2271 |
|
|
flash and then switch configurations. It will be the CPU's responsibility
|
2272 |
|
|
to place itself in RAM then.
|
2273 |
|
|
|
2274 |
|
|
The more practical alternative is a little more involved, and there are
|
2275 |
|
|
several steps to it.
|
2276 |
|
|
\begin{enumerate}
|
2277 |
|
|
\item Halt the CPU by writing 0x0440 to the CPU control register. This
|
2278 |
|
|
both halts and resets the CPU. It then prevents both bus contention,
|
2279 |
|
|
while writing the new instructions to memory, as well as preventing the
|
2280 |
|
|
CPU from running some instructions from one program and other
|
2281 |
|
|
instructions from another.
|
2282 |
|
|
\item Load the program into memory. For many programs this will involve
|
2283 |
|
|
loading the program into flash, and necessitate having and using a
|
2284 |
|
|
flash controller. The ZipCPU also supports being loaded straight into
|
2285 |
|
|
RAM address as well, as though the bootloader had completed
|
2286 |
|
|
it's task.
|
2287 |
|
|
\item You may optionally, at this point, clear all of the CPUs registers,
|
2288 |
|
|
to make certain the reboot is clean.
|
2289 |
|
|
\item Set the sPC register to the starting address.
|
2290 |
|
|
\item Clear the instruction cache in order to force the CPU to reload its
|
2291 |
|
|
cache upon start.
|
2292 |
|
|
\item Release the CPU by writing to the CPU debug control register a number
|
2293 |
|
|
between 0 and 31. This number will correspond to the register number
|
2294 |
|
|
of the register that can be ``peeked'' at while the CPU is running.
|
2295 |
|
|
\end{enumerate}
|
2296 |
|
|
|
2297 |
|
|
%
|
2298 |
|
|
\section{Starting a ZipCPU program}
|
2299 |
|
|
\subsection{CRT0}
|
2300 |
|
|
|
2301 |
|
|
Most computers have a section of code, conventionally called {\tt crt0}, which
|
2302 |
|
|
loads a program into memory. On the ZipCPU, this code starts at {\tt \_start}.
|
2303 |
|
|
It is responsible for setting the stack pointer, calling the boot loader,
|
2304 |
|
|
and then calling the main entry function, {\tt entry()}.
|
2305 |
|
|
|
2306 |
|
|
Because {\tt \_start} {\em must} be the first symbol in a program, and because
|
2307 |
|
|
that first symbol is located at the boot address for the CPU, the {\tt \_start}
|
2308 |
|
|
is placed into the {\tt .start} segment. It is the only routine placed there.
|
2309 |
|
|
|
2310 |
|
|
On those CPU's that don't have enough logic space for a debugger, it may be
|
2311 |
|
|
useful to place a routine to dump any registers, stack values and/or kernel
|
2312 |
|
|
traces to an output routine at this time. That way, on any kernel fault, the
|
2313 |
|
|
kernel can be brought back up with a debug trace. This works because rebooting
|
2314 |
|
|
the CPU doesn't reset any register values save the {\tt sCC} and {\tt sPC}.
|
2315 |
|
|
|
2316 |
|
|
\subsection{The Bootloader}
|
2317 |
|
|
|
2318 |
|
|
As discussed in Sec.~\ref{sec:ld-boot}, the bootloader must be placed into
|
2319 |
|
|
flash if it is used. It can be a small C program (it need not be assembly,
|
2320 |
|
|
like {\tt \_start}), and it only needs to copy memory. First, it copies any
|
2321 |
|
|
memory from flash to block RAM. Second, it copies any necessary memory from
|
2322 |
|
|
flash to SDRAM. Then, it zeros any memory necessary in SDRAM (or block RAM,
|
2323 |
|
|
if there is no SDRAM).
|
2324 |
|
|
|
2325 |
|
|
These memory copies may be done with the DMA, or they may be done one--at--a
|
2326 |
|
|
time for a performance penalty.
|
2327 |
|
|
|
2328 |
|
|
\subsection{Kernel Entry}
|
2329 |
|
|
|
2330 |
|
|
After calling the boot loader, execution returns to the {\tt \_start} routine
|
2331 |
|
|
which then calls the main program entry function, {\tt entry()}. No
|
2332 |
|
|
requirements are laid upon this entry function regarding where it must reside.
|
2333 |
|
|
The simplest place to put it is in Block RAM--and just to put all code and
|
2334 |
|
|
variables there. In reality, this entry function may easily be left in flash.
|
2335 |
|
|
It often doesn't need to run particularly fast, since there may easily be
|
2336 |
|
|
one--time setup functions that are independent of the programs main loop.
|
2337 |
|
|
|
2338 |
|
|
\subsection{Kernel Main}
|
2339 |
|
|
|
2340 |
|
|
If the kernel entry function, {\tt entry()}, is placed in flash, it should call
|
2341 |
|
|
a separate function to run the main while loop once it has been set up. In
|
2342 |
|
|
this fashion, the main while loop may be kept in the fastest memory necessary
|
2343 |
|
|
(that it will fit within), to ensure good performance.
|
2344 |
|
|
|
2345 |
|
|
\chapter{Operation}\label{chap:ops}
|
2346 |
|
|
|
2347 |
|
|
This chapter will explore how to perform common tasks with the ZipCPU,
|
2348 |
|
|
offering examples in both C and assembly for those tasks.
|
2349 |
|
|
|
2350 |
|
|
\section{CRT0}
|
2351 |
|
|
|
2352 |
|
|
Of course, the one task that every CPU must do is start the CPU for other
|
2353 |
|
|
tasks. The ZipCPU is no different. This is the one ZipCPU task that must
|
2354 |
|
|
take place in assembly, since no assumptions can be made about the state of
|
2355 |
|
|
the ZipCPU upon entry. In particular, the stack pointer, SP, needs to be
|
2356 |
|
|
loaded with a valid memory location before any higher level language can work.
|
2357 |
|
|
Once that has taken place, it is then possible to call other higher level
|
2358 |
|
|
routines.
|
2359 |
|
|
|
2360 |
|
|
Table.~\ref{tbl:op-init}
|
2361 |
|
|
\begin{table}\begin{center}
|
2362 |
|
|
\begin{tabbing}
|
2363 |
|
|
{\em ; By starting our loader in the .start section, we guarantee through our}\\
|
2364 |
|
|
{\em ; linker script that these are the very first instructions the CPU sees.}\\
|
2365 |
|
|
\hbox to 0.25in{}\={\tt .section .start} \\
|
2366 |
|
|
\> {\tt .global \_start} \\
|
2367 |
|
|
{\em ; \_start is to be placed at our reboot/reset address, so it will be}\\
|
2368 |
|
|
{\em ; called upon any reboot.}\\
|
2369 |
|
|
{\tt \_start:} \\
|
2370 |
|
|
\> {\em ; The most important step: creating a stack pointer. The value}\\
|
2371 |
|
|
\> {\em ; {\tt \_top\_of\_stack} is created by the linker based upon the linker script.}\\
|
2372 |
|
|
\> {\tt LDI \_top\_of\_stack,SP} \\
|
2373 |
|
|
\> {\em ; We then call the bootloader to load our code into memory.}\\
|
2374 |
|
|
\> {\tt MOV \_after\_bootloader(PC),R0} \\
|
2375 |
|
|
\> {\tt BRA bootloader} \\
|
2376 |
|
|
{\tt \_after\_bootloader:} \\
|
2377 |
|
|
\> {\em ; Just in case the bootloader messed up the stack, we'll reset it here.}\\
|
2378 |
|
|
\> {\tt LDI \_top\_of\_stack,SP} \\
|
2379 |
|
|
\> {\em ; Finally, we call the entry function for the entire design.}\\
|
2380 |
|
|
\> {\tt MOV \_kernel\_exit(PC),R0}\\
|
2381 |
|
|
\> {\tt BRA entry}\\
|
2382 |
|
|
{\em ; The {\tt \_kernel\_exit} routine that follows isn't strictly necessary,}\\
|
2383 |
|
|
{\em ; since the CPU should never return from the {\tt entry()} function. However,}\\
|
2384 |
|
|
{\em ; since returning from such a function is valid C code, and just in case}\\
|
2385 |
|
|
{\em ; it does return, we provide this function as a fail safe to make sure}\\
|
2386 |
|
|
{\em ; the kernel halts upon completion.}\\
|
2387 |
|
|
{\tt \_kernel\_exit:}\\
|
2388 |
|
|
\> {\tt HALT}\\
|
2389 |
|
|
\> {\tt BRA \_kernel\_exit}
|
2390 |
|
|
\end{tabbing}
|
2391 |
|
|
\caption{Setting up a stack frame and starting the CPU}\label{tbl:op-init}
|
2392 |
|
|
\end{center}\end{table}
|
2393 |
|
|
presents an example of one such initialization routine
|
2394 |
|
|
that first sets up the stack, then calls a bootloader routine. Upon completion,
|
2395 |
|
|
the initialization routine then calls the main entry point for the CPU. Should
|
2396 |
|
|
that main entry point ever return, a short routine following halts the CPU.
|
2397 |
|
|
|
2398 |
|
|
The example also highlights one optimization that didn't take place. Instead
|
2399 |
|
|
of placing the {\tt \_after\_bootloader} address into {\tt R0}, this script
|
2400 |
|
|
could have placed the {\tt entry()} address into {\tt R0}. Had it done so, the
|
2401 |
|
|
CPU would not have suffered the pipeline stalls associated with two long jumps:
|
2402 |
|
|
the first to {\tt R0}, and the second to {\tt entry}. Instead, it would have
|
2403 |
|
|
suffered such stalls only once: when jumping to {\tt entry()}.
|
2404 |
|
|
|
2405 |
|
|
% \section{Example bootloader}
|
2406 |
|
|
|
2407 |
68 |
dgisselq |
\section{System High}
|
2408 |
199 |
dgisselq |
The easiest and simplest way to run the ZipCPU is in the system high mode.
|
2409 |
68 |
dgisselq |
In this mode, the CPU runs your program in supervisor mode from reboot to
|
2410 |
|
|
power down, and is never interrupted. You will need to poll the interrupt
|
2411 |
|
|
controller to determine when any external condition has become active. This
|
2412 |
199 |
dgisselq |
mode is incredibly useful, and can handle many microcontroller--type tasks.
|
2413 |
68 |
dgisselq |
|
2414 |
|
|
Even better, in system high mode, all of the user registers are available
|
2415 |
|
|
to the system high program as variables. Accessing these registers can be
|
2416 |
|
|
done in a single clock cycle, which would move them to the active register
|
2417 |
|
|
set or move them back. While this may seem like a load or store instruction,
|
2418 |
|
|
none of these register accesses will suffer from memory delays.
|
2419 |
|
|
|
2420 |
|
|
The one thing that cannot be done in supervisor mode is a wait for interrupt
|
2421 |
|
|
instruction. This, however, is easily rectified by jumping to a user task
|
2422 |
|
|
within the supervisors memory space, such as Tbl.~\ref{tbl:shi-idle}.
|
2423 |
|
|
\begin{table}\begin{center}
|
2424 |
|
|
\begin{tabbing}
|
2425 |
|
|
{\tt supervisor\_idle:} \\
|
2426 |
|
|
\hbox to 0.25in{}\={\em ; While not strictly required, the following move helps to} \\
|
2427 |
|
|
\> {\em ; ensure that the prefetch doesn't try to fetch an instruction} \\
|
2428 |
|
|
\> {\em ; outside of the CPU's address space when it switches to user} \\
|
2429 |
|
|
\> {\em ; mode.} \\
|
2430 |
|
|
\> {\tt MOV supervisor\_idle\_continue,uPC} \\
|
2431 |
|
|
\> {\em ; Put the processor into user mode and to sleep in the same} \\
|
2432 |
|
|
\> {\em ; instruction. } \\
|
2433 |
|
|
\> {\tt OR \$SLEEP|\$GIE,CC} \\
|
2434 |
|
|
{\tt supervisor\_idle\_continue:} \\
|
2435 |
|
|
\> {\em ; Now, if we haven't done this inline, we need to return} \\
|
2436 |
|
|
\> {\em ; to whatever function called us.} \\
|
2437 |
|
|
\> {\tt RETN} \\
|
2438 |
|
|
\end{tabbing}
|
2439 |
|
|
\caption{Executing an idle from supervisor mode}\label{tbl:shi-idle}
|
2440 |
|
|
\end{center}\end{table}
|
2441 |
|
|
|
2442 |
199 |
dgisselq |
There are some problems with this model, however. For example, even though
|
2443 |
|
|
the user registers can be accessed in a single cycle, there is currently
|
2444 |
|
|
no way to do so other than with assembly instructions.
|
2445 |
|
|
|
2446 |
|
|
An alternative to this approach is to use the {\tt zip\_wait()} built--in
|
2447 |
|
|
function. This places the ZipCPU into an idle/sleep mode to wait for
|
2448 |
|
|
interrupts. Because the supervisor puts the CPU to sleep, rather than the
|
2449 |
|
|
user, no user context needs to be set up.
|
2450 |
|
|
|
2451 |
|
|
\section{A Programmable Delay}
|
2452 |
|
|
|
2453 |
|
|
One common task in microcontrollers, whether in a user task or supervisor
|
2454 |
|
|
task, is to wait for a programmable amount of time. Using the ZipSystem,
|
2455 |
|
|
there are several peripherals that can be used to create such a delay.
|
2456 |
|
|
It can be done with one of the three timers, the jiffies, or even an off-chip
|
2457 |
|
|
ZipCounter.
|
2458 |
|
|
|
2459 |
|
|
Here, in Tbl.~\ref{tbl:shi-timer},
|
2460 |
|
|
\begin{table}\begin{center}
|
2461 |
|
|
\begin{tabbing}
|
2462 |
|
|
{\tt \#define EINT(A) (0x80000000|(A<<16))} \= {\em // Enable interrupt A}\\
|
2463 |
|
|
{\tt \#define DINT(A) (A<<16)} \>{\em // Just disable the interrupts in A}\\
|
2464 |
|
|
{\tt \#define DISABLEALL 0x7fff0000} \>{\em // Disable all interrupts}\\
|
2465 |
|
|
{\tt \#define CLEARPIC 0x7fff7fff} \>{\em // Clears and disables all interrupts}\\
|
2466 |
|
|
{\tt \#define SYSINT\_TMA 0x10} \>{\em // The Timer--A interrupt mask}\\
|
2467 |
|
|
\\
|
2468 |
|
|
{\tt void timer\_delay(int nclocks) \{} \\
|
2469 |
|
|
\hbox to 0.25in{}\= {\em // Clear the PIC. We want to exit from here on timer counts alone}\\
|
2470 |
|
|
\> {\tt zip->pic = DISABLEALL|SYSINT\_TMA;}\\
|
2471 |
|
|
\> {\tt if (nclocks > 10) \{}\\
|
2472 |
|
|
\> \hbox to 0.25in{}\= {\em // Set our timer to count down the given number of counts}\\
|
2473 |
202 |
dgisselq |
\> \> {\tt zip->tma = nclocks;} \\
|
2474 |
199 |
dgisselq |
\> \> {\tt zip->pic = EINT(SYSINT\_TMA);} \\
|
2475 |
|
|
\> \> {\tt zip\_wait();} \\
|
2476 |
|
|
\> \> {\tt zip->pic = CLEARPIC;} \\
|
2477 |
202 |
dgisselq |
\> {\tt \} }{\em // else anything less has likely already passed} \\
|
2478 |
199 |
dgisselq |
{\tt \}}\\
|
2479 |
|
|
\end{tabbing}
|
2480 |
|
|
\caption{Waiting on a timer}\label{tbl:shi-timer}
|
2481 |
|
|
\end{center}\end{table}
|
2482 |
|
|
we present one means of waiting for a programmable amount of time using a
|
2483 |
|
|
timer. If exact timing is important, you may wish to calibrate the method
|
2484 |
|
|
by subtracting from the counts number the counts it takes to actually do the
|
2485 |
|
|
routine. Otherwise, the timer is guaranteed to at least {\tt counts}
|
2486 |
|
|
ticks.
|
2487 |
|
|
|
2488 |
|
|
Notice that the routine clears the PIC early on. While one might expect
|
2489 |
|
|
that this could be done in the instruction immediately before {\tt zip\_rtu()},
|
2490 |
|
|
this isn't the case. The reason is a race condition created by the fact that
|
2491 |
|
|
the write to the PIC completes after the {\tt zip\_rtu()} instruction. As a
|
2492 |
|
|
result, you might find yourself with a zero delay simply because the timer
|
2493 |
|
|
had tripped some time earlier.
|
2494 |
|
|
|
2495 |
|
|
The routine is also careful not to clear any other interrupts beyond the timer
|
2496 |
|
|
interrupt, lest some other condition trip that the user was also waiting on.
|
2497 |
|
|
|
2498 |
68 |
dgisselq |
\section{Traditional Interrupt Handling}
|
2499 |
199 |
dgisselq |
Although the ZipCPU does not have a traditional interrupt architecture,
|
2500 |
68 |
dgisselq |
it is possible to create the more traditional interrupt approach via software.
|
2501 |
|
|
In this mode, the programmable interrupt controller is used together with the
|
2502 |
|
|
supervisor state to create the illusion of more traditional interrupt handling.
|
2503 |
|
|
|
2504 |
|
|
To set this up, upon reboot the supervisor task:
|
2505 |
|
|
\begin{enumerate}
|
2506 |
|
|
\item Creates a (single) user context, a user stack, and sets the user
|
2507 |
|
|
program counter to the entry of the user task
|
2508 |
|
|
\item Creates a task table of ISR entries
|
2509 |
|
|
\item Enables the master interrupt enable via the interrupt controller, albeit
|
2510 |
|
|
without enabling any of the fifteen potential underlying interrupts.
|
2511 |
|
|
\item Switches to user mode, as the first part of the while loop in
|
2512 |
|
|
Tbl.~\ref{tbl:traditional-isr}.
|
2513 |
|
|
\end{enumerate}
|
2514 |
|
|
\begin{table}\begin{center}
|
2515 |
|
|
\begin{tabbing}
|
2516 |
|
|
{\tt while(true) \{} \\
|
2517 |
199 |
dgisselq |
\hbox to 0.25in{}\= {\tt zip\_rtu();}\\
|
2518 |
|
|
\> {\tt if (zip\_ucc() \& CC\_TRAPBIT) \{} {\em // Here, we allow users to install ISRs, or} \\
|
2519 |
68 |
dgisselq |
\>\hbox to 0.25in{}\= {\em // whatever else they may wish to do in supervisor mode.} \\
|
2520 |
199 |
dgisselq |
\>\> {\tt \ldots} \\
|
2521 |
|
|
\> {\tt \} else (zip\_ucc() \& (CC\_BUSERR|CC\_FPUERR|CC\_DIVERR)) \{}\\
|
2522 |
|
|
\>\> {\em // Here we handle any faults that the CPU may have
|
2523 |
|
|
encountered }\\
|
2524 |
|
|
\>\> {\em // The easiest solution is often to print a trace and reboot}\\
|
2525 |
|
|
\>\> {\em // the CPU.}\\
|
2526 |
|
|
\>\> {\tt \_start();} \\
|
2527 |
68 |
dgisselq |
\> {\tt \} else \{} \\
|
2528 |
|
|
\> \> {\em // At this point, we know an interrupt has taken place: Ask the programmable}\\
|
2529 |
|
|
\> \> {\em // interrupt controller (PIC) which interrupts are enabled and which are active.}\\
|
2530 |
199 |
dgisselq |
\> \> {\tt int picv = zip->pic;}\\
|
2531 |
68 |
dgisselq |
\> \> {\em // Turn off all active interrupts}\\
|
2532 |
|
|
\> \> {\em // Globally disable interrupt generation in the process}\\
|
2533 |
|
|
\> \> {\tt int active = (picv >> 16) \& picv \& 0x07fff;}\\
|
2534 |
199 |
dgisselq |
\> \> {\tt zip->pic = (active<<16);}\\
|
2535 |
68 |
dgisselq |
\> \> {\em // We build a mask of interrupts to re-enable in picv.}\\
|
2536 |
|
|
\> \> {\tt picv = 0;}\\
|
2537 |
|
|
\> \> {\tt for(int i=0,msk=1; i<15; i++, msk<<=1) \{}\\
|
2538 |
|
|
\> \>\hbox to 0.25in{}\={\tt if ((active \& msk)\&\&(isr\_table[i])) \{}\\
|
2539 |
199 |
dgisselq |
\> \>\>\hbox to 0.25in{}\={\em // Here we call our interrupt service routine.}\\
|
2540 |
|
|
\> \>\>\hbox to 0.25in{}\= {\tt tmp = (isr\_table[i])(); }\\
|
2541 |
68 |
dgisselq |
\> \>\>\> {\em // The tricky part is that, because of how the PIC is built, the ISR cannot}\\
|
2542 |
199 |
dgisselq |
\>\>\>\> {\em // re-enable its own interrupts without re-enabling all interrupts. Hence, we}\\
|
2543 |
|
|
\>\>\>\> {\em // look at the return value from the ISR to know if an interrupt needs to be }\\
|
2544 |
68 |
dgisselq |
\> \>\>\> {\em // re-enabled. }\\
|
2545 |
199 |
dgisselq |
\> \>\>\> {\tt if (tmp)} \\
|
2546 |
|
|
\> \>\>\> \hbox to 0.25in{}\={\tt picv |= msk; }\\
|
2547 |
68 |
dgisselq |
\> \>\> {\tt \} }\\
|
2548 |
|
|
\> \> {\tt \} }\\
|
2549 |
199 |
dgisselq |
\> \> {\em // Re-activate, but do not clear, all (requested) interrupts }\\
|
2550 |
|
|
\> \> {\tt zip->pic = picv | 0x80000000; }\\
|
2551 |
68 |
dgisselq |
\>{\tt \} }\\
|
2552 |
|
|
{\tt \}}\\
|
2553 |
|
|
\end{tabbing}
|
2554 |
|
|
\caption{Traditional Interrupt handling}\label{tbl:traditional-isr}
|
2555 |
|
|
\end{center}\end{table}
|
2556 |
|
|
|
2557 |
|
|
We can work through the interrupt handling process by examining
|
2558 |
|
|
Tbl.~\ref{tbl:traditional-isr}. First, remember, the CPU is always running
|
2559 |
|
|
either the user or the supervisor context. Once the supervisor switches to
|
2560 |
199 |
dgisselq |
user mode, control does not return until either an interrupt, a trap, or an
|
2561 |
|
|
exception has taken place. Therefore, if neither the trap bit nor any of the
|
2562 |
|
|
exception bits have been set, then we know an interrupt has taken place.
|
2563 |
68 |
dgisselq |
|
2564 |
199 |
dgisselq |
It is also possible that an interrupt will occur coincident with a trap or
|
2565 |
|
|
exception. If this is the case, the subsequent {\tt zip\_rtu()} instruction
|
2566 |
|
|
will return immediately, since the interrupt has yet to be cleared.
|
2567 |
68 |
dgisselq |
|
2568 |
|
|
As Sec.~\ref{sec:pic} discusses, the top of the PIC register stores which
|
2569 |
|
|
interrupts are enabled, and the bottom stores which have tripped. (Interrupts
|
2570 |
|
|
may trip without being enabled, they just will not generate an interrupt to the
|
2571 |
|
|
CPU.) Our first step is to query the register to find out our interrupt
|
2572 |
|
|
state, and then to disable any interrupts that have tripped. To do
|
2573 |
|
|
that, we write a one to the enable half of the register while also clearing
|
2574 |
|
|
the top bit (master interrupt enable). This has the consequence of disabling
|
2575 |
|
|
any and all further interrupts, not just the ones that have tripped. Hence,
|
2576 |
|
|
upon completion, we re--enable the master interrupt bit again. Finally,
|
2577 |
|
|
we keep track of which interrupts have tripped.
|
2578 |
|
|
|
2579 |
|
|
Using the bit mask of interrupts that have tripped, we walk through all fifteen
|
2580 |
199 |
dgisselq |
possible interrupts. If there is an ISR installed, we simply call it here. The
|
2581 |
|
|
ISR, however, cannot re--enable its interrupt without re-enabling the master
|
2582 |
|
|
interrupt bit. Thus, to keep things simple, when the ISR is finished it returns
|
2583 |
|
|
a boolean into R0 to indicate whether or not the interrupt needs to be
|
2584 |
|
|
re-enabled. Put together, this tells the kernel which interrupts to re--enable.
|
2585 |
|
|
As a final instruction, interrupts are re--enabled before returning continuing
|
2586 |
|
|
the while loop.
|
2587 |
68 |
dgisselq |
|
2588 |
199 |
dgisselq |
There you have it: the ZipCPU, with its non-traditional interrupt architecture,
|
2589 |
|
|
can still process interrupts in a very traditional fashion.
|
2590 |
|
|
|
2591 |
|
|
\section{Idle Task}
|
2592 |
|
|
One task every operating system needs is the idle task, the task that takes
|
2593 |
|
|
place when nothing else can run. On the ZipCPU, this task is quite simple,
|
2594 |
|
|
and it is shown in assembly in Tbl.~\ref{tbl:idle-asm},
|
2595 |
68 |
dgisselq |
\begin{table}\begin{center}
|
2596 |
|
|
\begin{tabbing}
|
2597 |
199 |
dgisselq |
{\tt idle\_task:} \\
|
2598 |
|
|
\hbox to 0.25in{}\= {\em ; Wait for the next interrupt, then switch to supervisor task} \\
|
2599 |
|
|
\> {\tt WAIT} \\
|
2600 |
|
|
\> {\em ; When we come back, it's because the supervisor wishes to} \\
|
2601 |
|
|
\> {\em ; wait for an interrupt again, so go back to the top.} \\
|
2602 |
|
|
\> {\tt BRA idle\_task} \\
|
2603 |
68 |
dgisselq |
\end{tabbing}
|
2604 |
199 |
dgisselq |
\caption{Example Idle Task in Assembly}\label{tbl:idle-asm}
|
2605 |
68 |
dgisselq |
\end{center}\end{table}
|
2606 |
199 |
dgisselq |
or equivalently in C in Tbl.~\ref{tbl:idle-c}.
|
2607 |
|
|
\begin{table}\begin{center}
|
2608 |
|
|
\begin{tabbing}
|
2609 |
|
|
{\tt void idle\_task(void) \{} \\
|
2610 |
|
|
\hbox to 0.25in{}\={\tt while(true) \{} {\em // Never exit}\\
|
2611 |
|
|
\> {\em // Wait for the next interrupt, then switch to supervisor task} \\
|
2612 |
68 |
dgisselq |
|
2613 |
199 |
dgisselq |
\> {\tt zip\_wait();} \\
|
2614 |
|
|
\> {\em // } \\
|
2615 |
|
|
\> {\em // When we come back, it's because the supervisor wishes to} \\
|
2616 |
|
|
\> {\em // wait for an interrupt again, so go back to the top.} \\
|
2617 |
|
|
\> {\tt \}} \\
|
2618 |
|
|
{\tt \}}
|
2619 |
|
|
\end{tabbing}
|
2620 |
|
|
\caption{Example Idle Task in C}\label{tbl:idle-c}
|
2621 |
|
|
\end{center}\end{table}
|
2622 |
68 |
dgisselq |
|
2623 |
36 |
dgisselq |
When this task runs, the CPU will fill up all of the pipeline stages up the
|
2624 |
|
|
ALU. The {\tt WAIT} instruction, upon leaving the ALU, places the CPU into
|
2625 |
199 |
dgisselq |
a sleep state where nothing more moves. Then, once an interrupt takes place,
|
2626 |
|
|
control passes to the supervisor task to handle the interrupt. When control
|
2627 |
|
|
passes back to this task, it will be on the next instruction. Since that next
|
2628 |
|
|
instruction sends us back to the top of the task, the idle task thus does
|
2629 |
|
|
nothing but wait for an interrupt.
|
2630 |
36 |
dgisselq |
|
2631 |
|
|
This should be the lowest priority task, the task that runs when nothing else
|
2632 |
|
|
can. It will help lower the FPGA power usage overall---at least its dynamic
|
2633 |
|
|
power usage.
|
2634 |
|
|
|
2635 |
199 |
dgisselq |
For those highly interested in reducing power consumption, the clock could
|
2636 |
|
|
even be disabled at this time--requiring only some small modifications to the
|
2637 |
|
|
core.
|
2638 |
|
|
|
2639 |
|
|
\section{Memory Copy}
|
2640 |
|
|
One common operation is that of a memory move or copy. This section will
|
2641 |
|
|
present several methods available to the ZipCPU for performing a memory
|
2642 |
|
|
copy, starting with the C code shown in Tbl.~\ref{tbl:memcp-c}.
|
2643 |
36 |
dgisselq |
\begin{table}\begin{center}
|
2644 |
|
|
\parbox{4in}{\begin{tabbing}
|
2645 |
202 |
dgisselq |
{\tt void} \= {\tt memcpy(char *dest, char *src, int len) \{} \\
|
2646 |
36 |
dgisselq |
\> {\tt for(int i=0; i<len; i++)} \\
|
2647 |
|
|
\> \hspace{0.2in} {\tt *dest++ = *src++;} \\
|
2648 |
|
|
\}
|
2649 |
|
|
\end{tabbing}}
|
2650 |
|
|
\caption{Example Memory Copy code in C}\label{tbl:memcp-c}
|
2651 |
|
|
\end{center}\end{table}
|
2652 |
199 |
dgisselq |
Each successive example will further speed up the memory copying process.
|
2653 |
|
|
|
2654 |
|
|
This memory copy code can be translated in Zip Assembly as shown in
|
2655 |
36 |
dgisselq |
Tbl.~\ref{tbl:memcp-asm}.
|
2656 |
|
|
\begin{table}\begin{center}
|
2657 |
199 |
dgisselq |
\begin{tabbing}
|
2658 |
|
|
memcpy: \\
|
2659 |
|
|
\hbox to 0.35in{}\={\em ; R0 = return address, R1 = *dest, R2 = *src, R3 = LEN} \\
|
2660 |
|
|
\> {\em ; The following will operate in 6 ($N=0$), or $2+12N$ clocks ($N\neq 0$).} \\
|
2661 |
|
|
\> {\tt CMP 0,R3} \\ % 8 clocks per setup
|
2662 |
202 |
dgisselq |
\> {\tt RETN.Z} \hbox to 0.3in{}\= {\em ; A conditional return }\\
|
2663 |
199 |
dgisselq |
\> {\em ; No stack frame needs to be set up to use {\tt R4}, since the compiler}\\
|
2664 |
|
|
\> {\em ; assumes {\tt R1}-{\tt R4} may be used and changed by any subroutine} \\
|
2665 |
|
|
memcpy\_loop: \\ % 12 clocks per loop
|
2666 |
202 |
dgisselq |
\> {\tt LB (R2),R4} \\
|
2667 |
199 |
dgisselq |
\> {\em ; (4 stalls, cannot be scheduled away)} \\
|
2668 |
202 |
dgisselq |
\> {\tt SB R4,(R1)} \> {\em ; (4 schedulable stalls, has no impact now)} \\
|
2669 |
199 |
dgisselq |
\> {\em ; Update our count of the number of remaining values to copy}\\
|
2670 |
|
|
\> {\tt SUB 1,R3} \> {\em ; This will be zero when we have copied our last}\\
|
2671 |
202 |
dgisselq |
\> {\tt RETN.Z} \> {\em ; + 4 stalls, if taken}\\
|
2672 |
199 |
dgisselq |
\> {\tt ADD 1,R1} \> {\em ; Implement the destination pointer }\\
|
2673 |
|
|
\> {\tt ADD 1,R2} \> {\em ; Implement the source pointer }\\
|
2674 |
|
|
\> {\tt BRA memcpy\_loop} \\
|
2675 |
|
|
\> {\em ; (1 stall on a BRA instruction)} \\
|
2676 |
|
|
\end{tabbing}
|
2677 |
|
|
\caption{Example Memory Copy code in Zip Assembly, Unoptimized}\label{tbl:memcp-asm}
|
2678 |
36 |
dgisselq |
\end{center}\end{table}
|
2679 |
199 |
dgisselq |
This example points out several things associated with the ZipCPU. First,
|
2680 |
36 |
dgisselq |
a straightforward implementation of a for loop is not the fastest loop
|
2681 |
|
|
structure. For this reason, we have placed the test to continue at the
|
2682 |
202 |
dgisselq |
end. Second, notice that we can use {\tt R4} without storing it, since the
|
2683 |
|
|
C~ABI allows for subroutines to use {\tt R1}--{\tt R4} without saving them.
|
2684 |
|
|
This means that we can return from this subroutine using conditional jumps to
|
2685 |
199 |
dgisselq |
{\tt R0}.
|
2686 |
36 |
dgisselq |
|
2687 |
199 |
dgisselq |
Still, there's more that could be done. Suppose we wished to use the pipeline
|
2688 |
|
|
bus capability? We might then write something closer to
|
2689 |
|
|
Tbl.~\ref{tbl:memcp-opt}.
|
2690 |
|
|
\begin{table}\begin{center}\scalebox{0.8}{\parbox{\textwidth}{%
|
2691 |
|
|
\begin{tabbing}
|
2692 |
|
|
% 27 cycles for 0
|
2693 |
|
|
% 52 cycles for 3
|
2694 |
|
|
{\em ; Upon entry, R0 = return address, R1 = *dest, R2 = *src, R3 = LEN} \\
|
2695 |
|
|
{\em ; Achieves roughly $32+17\left\lfloor\frac{N}{4}\right\rfloor$ clocks,
|
2696 |
|
|
after the initial pipeline delay}\\
|
2697 |
|
|
memcpy\_opt: \\
|
2698 |
|
|
\hbox to 0.35in{}\=\hbox to 1.4in{\tt CMP 4,R3}\= {\em ; Check for small short lengths, len $<$ 4}\\
|
2699 |
202 |
dgisselq |
\> {\tt BC \_memcpy\_finish} \> {\em ; Jump to the end if so}\\
|
2700 |
|
|
\hbox to 0.35in{}\=\hbox to 1.4in{\tt SUB 12,SP}\= {\em ; Otherwise, create a stack frame, storing the registers}\\
|
2701 |
|
|
\> {\tt SW R5,(SP)} \> {\em ; we will be using. Note that this is a pipelined store, so}\\
|
2702 |
|
|
\> {\tt SW R6,4(SP)} \> {\em ; subsequent stores only cost 1 clock.}\\
|
2703 |
|
|
\> {\tt SW R7,8(SP)}\\
|
2704 |
199 |
dgisselq |
\> {\tt ADD 4,R2} \> {\em ; Pre-Increment our pointers, for a 4-stage pipeline. This}\\
|
2705 |
|
|
\> {\tt ADD 4,R1} \> {\em ; also fills up the 3 of the 4 stall states following the}\\
|
2706 |
|
|
\> {\tt SUB 5,R3} \> {\em ; stores. Also, leave {\tt R3} as the number left minus one.}\\
|
2707 |
202 |
dgisselq |
\> {\tt LB -4(R2),R4} \> {\em ; Load the first four values into }\\
|
2708 |
|
|
\> {\tt LB -3(R2),R5} \> {\em ; registers, using pipelined loads.}\\
|
2709 |
|
|
\> {\tt LB -2(R2),R6}\\
|
2710 |
|
|
\> {\tt LB -1(R2),R7}\\
|
2711 |
|
|
{\tt \_mcopy\_next\_four\_chars:} \>\> {\em ; Here's the top of our copy loop}\\
|
2712 |
|
|
\> {\tt SB R4,-4(R1)} \> {\em ; Store four values, using a burst memory operation.}\\
|
2713 |
|
|
\> {\tt SB R5,-3(R1)} \> {\em ; One clock for subsequent stores.}\\
|
2714 |
|
|
\> {\tt SB R6,-2(R1)} \> {\em ; None of these effect the flags, that were set when}\\
|
2715 |
|
|
\> {\tt SB R7,-1(R1)} \> {\em ; we last adjusted {\tt R3}}\\
|
2716 |
|
|
\> {\tt BC \_preend\_memcpy} \> {\em ; +4 stall cycles, but only when taken}\\
|
2717 |
199 |
dgisselq |
\> {\tt ADD 4,R1} \> {\em ; ALU ops don't stall during stores, so}\\
|
2718 |
|
|
\> {\tt ADD 4,R2} \> {\em ; increment our pointers here.} \\
|
2719 |
|
|
\> {\tt SUB 4,R3} \> {\em ; Calculate whether or not we have a next round}\\
|
2720 |
202 |
dgisselq |
\> {\tt LB -4(R2),R4} \> {\em ; Preload the values for the next round}\\
|
2721 |
|
|
\> {\tt LB -3(R2),R5}\> {\em ; Notice that these are also pipelined}\\
|
2722 |
|
|
\> {\tt LB -2(R2),R6}\> {\em ; loads, as before.}\\
|
2723 |
|
|
\> {\tt LB -1(R2),R7}\> {\em ; The four stall cycles, though, are concurrent w/ the branch.}\\
|
2724 |
|
|
\> {\tt BRA \_mcopy\_next\_four\_chars}\hspace{0.25in} {\em ; Early branching avoids the full memory pipeline stall} \\
|
2725 |
|
|
{\tt \_preend\_memcpy:}\\
|
2726 |
199 |
dgisselq |
\> {\tt ADD 1,R3} \>{\em ; R3 is now the remaining length, rather than one less than it}\\
|
2727 |
202 |
dgisselq |
\> {\tt LW (SP),R5} \> {\em ; Restore our saved registers, since the remainder of the routine}\\
|
2728 |
|
|
\> {\tt LW 4(SP),R6} \> {\em ; doesn't use these registers}\\
|
2729 |
|
|
\> {\tt LW 8(SP),R7} \> {\em ;}\\
|
2730 |
|
|
\> {\tt ADD 12,SP} \>{\em ; Adjust the stack pointer back to what it was}\\
|
2731 |
|
|
{\tt \_memcpy\_finish:}\>\>{\em ; At this point, there are $0\leq$ {\tt R3}$<4$ words left}\\
|
2732 |
199 |
dgisselq |
\> {\tt CMP 1,R3} \> {\em ; Check if any ops are remaining }\\
|
2733 |
202 |
dgisselq |
\> {\tt RETN.LT} \> {\em ; Return now if nothing is left}\\
|
2734 |
|
|
\> {\tt LB (R1),R4} \> {\em ; Load and store the first item}\\
|
2735 |
|
|
\> {\tt SB R4,(R1)} \> {\em ;}\\
|
2736 |
|
|
\> {\tt RETN.Z} \> {\em ; Return if that was our only value}\\
|
2737 |
|
|
\> {\tt LB 1(R1),R4}\>{\em; Load and store the second item (if necessary)} \\
|
2738 |
|
|
\> {\tt SB R4,1(R1)}\\
|
2739 |
199 |
dgisselq |
\> {\tt CMP 2, R3}\\
|
2740 |
202 |
dgisselq |
\> {\tt RETN.LT}\\
|
2741 |
|
|
\> {\tt LB 2(R1),R4}\>{\em; Load and store the second item (if necessary)} \\
|
2742 |
|
|
\> {\tt SB R4,2(R1)}\>{\em; {\tt LW}, {\tt SW}, {\tt RETN} will cost 10 cycles}\\
|
2743 |
|
|
\> {\tt RETN} \> {\em ; Finally, we return}\\
|
2744 |
199 |
dgisselq |
\end{tabbing}}}
|
2745 |
|
|
\caption{Example Memory Copy code in Zip Assembly, Hand Optimized}\label{tbl:memcp-opt}
|
2746 |
|
|
\end{center}\end{table}
|
2747 |
|
|
This pipeline memory example, though, provides some neat things to discuss
|
2748 |
|
|
about optimizing code using the ZipCPU.
|
2749 |
36 |
dgisselq |
|
2750 |
199 |
dgisselq |
First, note that all of the loads and stores, except the three following
|
2751 |
|
|
{\tt memcpy\_finish}, are pipelined. To do this, we needed to unroll the
|
2752 |
|
|
copy loop by a factor of four. This means that each time through the loop,
|
2753 |
|
|
we can read and store four values. At 17~clocks to copy four values, that's
|
2754 |
|
|
roughly three times faster than our previous example. The down side, though,
|
2755 |
|
|
is that the loop now needs a final cleanup section where the last 0--3 values
|
2756 |
|
|
will be copied.
|
2757 |
|
|
|
2758 |
|
|
Second, note that we used our remaining length minus one as our loop variable.
|
2759 |
|
|
This was done so that the conditions set by subtracting four from our loop
|
2760 |
|
|
variable could be used without a separate compare. Speaking of the compare,
|
2761 |
|
|
note that we have chosen to use a branch if carry ({\tt BC}) comparison, which
|
2762 |
|
|
is equivalent to a less--than unsigned comparison.
|
2763 |
|
|
|
2764 |
|
|
Third, notice how we packed four ALU instructions, two adds, a subtract, and a
|
2765 |
|
|
conditional branch, after the four store instructions. These instructions can
|
2766 |
|
|
complete while the memory unit is busy, preparing us to start the subsequent
|
2767 |
|
|
load without any further stalls (unless the memory is particularly slow.)
|
2768 |
|
|
|
2769 |
|
|
Next, you may wish to notice that the four memory loads within the loop are
|
2770 |
|
|
followed by the early branching instruction. As a result, the branch costs
|
2771 |
|
|
no extra clocks, and the time between the loads at the bottom of the loop
|
2772 |
|
|
is dominated by the load to store time frame.
|
2773 |
|
|
|
2774 |
|
|
Finally, notice how the comparison at the end has been stacked. By comparing
|
2775 |
|
|
against one, we can return when there are zero items left, or one item left,
|
2776 |
|
|
without needing a new comparison. Hence, zero to three separate values can be
|
2777 |
|
|
copied using only two compares.
|
2778 |
|
|
|
2779 |
|
|
However, this discussion wouldn't be complete without an example of how
|
2780 |
|
|
this memory operation would be made even simpler using the direct memory
|
2781 |
202 |
dgisselq |
access controller. In that case, we can return to the C language with the
|
2782 |
|
|
code in Tbl.~\ref{tbl:memcp-dmac}.
|
2783 |
199 |
dgisselq |
\begin{table}\begin{center}
|
2784 |
|
|
\begin{tabbing}
|
2785 |
|
|
{\tt \#define DMACOPY 0x0fed0000} {\em // Copy memory, largest chunk at a time possible} \\
|
2786 |
|
|
\\
|
2787 |
|
|
{\tt void} \= {\tt memcpy\_dma(void *dest, void *src, int len) \{} \\
|
2788 |
|
|
\> {\em // This assumes we have access to the DMA, that the DMA is not}\\
|
2789 |
|
|
\> {\em // busy, and that we are running in system high mode ...}\\
|
2790 |
|
|
\> {\tt zip->dma.len = len;} \= {\em // Set up the DMA }\\
|
2791 |
|
|
\> {\tt zip->dma.rd = src;}\\
|
2792 |
|
|
\> {\tt zip->dma.wr = dst;}\\
|
2793 |
|
|
\> {\em // Command the DMA to start copying} \\
|
2794 |
|
|
\> {\tt zip->dma.ctrl= DMACOPY;} \\
|
2795 |
|
|
\> {\em // Note that we take two clocks to set up our PIC. This is }\\
|
2796 |
|
|
\> {\em // required because the PIC takes at least a clock cycle to clear.} \\
|
2797 |
|
|
\> {\tt zip->pic = DISABLEALL|SYSINT\_DMA;} \\
|
2798 |
|
|
\> {\em // Now that our PIC is actually clear, with no more DMA }\\
|
2799 |
|
|
\> {\em // interrupt within it, now we enable the DMA interrupt, and}\\
|
2800 |
|
|
\> {\em // only the DMA interrupt.}\\
|
2801 |
|
|
\> {\tt zip->pic = EINT(SYSINT\_DMA);}\\
|
2802 |
|
|
\> {\em // And wait for the DMA to complete.} \\
|
2803 |
|
|
\> {\tt zip\_wait();}\\
|
2804 |
|
|
{\tt \}}
|
2805 |
|
|
\end{tabbing}
|
2806 |
|
|
\caption{Example Memory Copy code using the DMA}\label{tbl:memcp-dmac}
|
2807 |
|
|
\end{center}\end{table}
|
2808 |
202 |
dgisselq |
The DMA, however, will only work with an integer number of 32--bit aligned
|
2809 |
|
|
words. Still, for large memory amounts, the cost of this approach will scale
|
2810 |
|
|
at roughly 2~clocks per word transferred.
|
2811 |
199 |
dgisselq |
|
2812 |
|
|
Notice how much simpler this memory copy has become to write by using the DMA.
|
2813 |
|
|
But also consider, the system has only one direct memory access controller.
|
2814 |
|
|
What happens if one task tries to use the controller when it is already in use
|
2815 |
|
|
by another task? The result is that the direct memory access controller may
|
2816 |
|
|
need some special protections to make certain that only one task uses it at a
|
2817 |
|
|
time---much like any other hardware peripheral.
|
2818 |
|
|
|
2819 |
|
|
\section{Memset}
|
2820 |
|
|
Another example worth discussing is the {\tt memset()} library function.
|
2821 |
|
|
A straightforward implementation of this function in C might look like
|
2822 |
|
|
Tbl.~\ref{tbl:memset-c}.
|
2823 |
|
|
\begin{table}\begin{center}
|
2824 |
|
|
\begin{tabbing}
|
2825 |
202 |
dgisselq |
\hbox to 0.4in{\tt void} \= {\tt *memset(char *s, int c, size\_t n) \{} \\
|
2826 |
199 |
dgisselq |
\> {\tt for(size\_t i=0; i<n; i++)} \\
|
2827 |
|
|
\> \hspace{0.4in} {\tt *s++ = c;} \\
|
2828 |
|
|
\> {\tt return s;}\\
|
2829 |
|
|
{\tt \}}
|
2830 |
|
|
\end{tabbing}
|
2831 |
|
|
\caption{Example Memset code}\label{tbl:memset-c}
|
2832 |
|
|
\end{center}\end{table}
|
2833 |
|
|
The function is simple enough to handle compile into the assembling listing in
|
2834 |
|
|
Tbl.~\ref{tbl:memset-unop}.
|
2835 |
|
|
\begin{table}\begin{center}
|
2836 |
|
|
\begin{tabbing}
|
2837 |
|
|
{\em ; Upon entry, R0 = return address, R1 = s, R2 = c, R3 = len}\\
|
2838 |
|
|
{\em ; Cost: Roughly $4+6N$ clocks}\\
|
2839 |
|
|
{\tt memset:}\\
|
2840 |
|
|
\hbox to 0.25in{}\=\hbox to 1in{\tt TST R3}\={\em ; Return immediately if len (R3) is zero}\\
|
2841 |
202 |
dgisselq |
\> {\tt RETN.Z}\\
|
2842 |
199 |
dgisselq |
\> {\tt MOV R1,R4} \> {\em ; Keep our return value in R1, use R4 as a local}\\
|
2843 |
|
|
{\tt memset\_loop:}\>\> {\em ; Here, we know we have at least one more to go}\\
|
2844 |
202 |
dgisselq |
\> {\tt SB R2,(R4)} \> {\em ; Store one value (no pipelining)} \\
|
2845 |
199 |
dgisselq |
\> {\tt SUB 1,R3} \> {\em; Subtract during the store}\\
|
2846 |
202 |
dgisselq |
\> {\tt RETN.Z} \> {\em; Return (during store) if all done}\\
|
2847 |
199 |
dgisselq |
\> {\tt ADD 1,R4} \> {\em; Otherwise increment our pointer}\\
|
2848 |
|
|
\> {\tt BRA memset\_loop} {\em ; and repeat}\\
|
2849 |
|
|
\end{tabbing}
|
2850 |
|
|
\caption{Example Memset code, minimally optimized}\label{tbl:memset-unop}
|
2851 |
|
|
\end{center}\end{table}
|
2852 |
|
|
Note that we grab {\tt R4} as a local variable, so that we can maintain the
|
2853 |
|
|
source pointer in {\tt R1} as our result upon return. This is valid, since the
|
2854 |
|
|
compiler assumes that {\tt R1}--{\tt R4} will be clobbered upon any function
|
2855 |
|
|
call and so they are not saved.
|
2856 |
|
|
|
2857 |
|
|
You can also see that this straight forward implementation costs about
|
2858 |
|
|
six clocks per value to be set.
|
2859 |
|
|
|
2860 |
|
|
Were we to pipeline the memory accesses, we might choose to unroll the loop
|
2861 |
|
|
and do something more like Tbl.~\ref{tbl:memset-pipe}.
|
2862 |
|
|
\begin{table}\begin{center}
|
2863 |
|
|
\begin{tabbing}
|
2864 |
|
|
{\em ; Upon entry, R0 = return address, R1 = s, R2 = c, R3 = len}\\
|
2865 |
|
|
{\em ; Cost: roughly $20+9\left\lfloor\frac{N}{4}\right\rfloor$}\\
|
2866 |
|
|
{\tt memset\_pipe:}\\
|
2867 |
|
|
\hbox to 0.25in{}\=\hbox to 0.6in{\tt MOV}\=\hbox to 1.0in{\tt R1,R4}\={\em ; Make a local copy of *s, so we can return R1}\\
|
2868 |
|
|
\> {\tt CMP}\>{\tt 4,R3}\>{\em ; Jump to non--unrolled section}\\
|
2869 |
|
|
\> {\tt JMP.C}\>{\tt memset\_pipe\_tail}\\
|
2870 |
|
|
\> {\tt SUB}\>{\tt 1,R3}\> {\em ; R3 is now one less than the number to finish}\\
|
2871 |
|
|
{\tt memset\_pipe\_unrolled:}\>\>\> {\em ; Here, we know we have at least four more to go}\\
|
2872 |
202 |
dgisselq |
\> {\tt SB}\>{\tt R2,(R4)} \> {\em ; Store our four values, pipelining our}\\
|
2873 |
|
|
\> {\tt SB}\>{\tt R2,1(R4)} \> {\em ; access across the bus }\\
|
2874 |
|
|
\> {\tt SB}\>{\tt R2,2(R4)} \\
|
2875 |
|
|
\> {\tt SB}\>{\tt R2,3(R4)} \\
|
2876 |
199 |
dgisselq |
\> {\tt SUB}\>{\tt 4,R3} \> {\em; If there are zero left, this will be a -1 result}\\
|
2877 |
202 |
dgisselq |
\> {\tt BC}\>{\tt prememset\_pipe\_tail}\> \hbox to 0.5in{}\= {\em; So we can use our LT condition}\\
|
2878 |
199 |
dgisselq |
\> {\tt ADD}\>{\tt 4,R4} \> {\em ; Otherwise increment our pointer}\\
|
2879 |
202 |
dgisselq |
\> {\tt BRA}\>{\tt memset\_pipe\_unrolled} {\em ; and repeat using an early branchable instruction}\\
|
2880 |
199 |
dgisselq |
{\tt prememset\_pipe\_tail:} \\
|
2881 |
|
|
\> {\tt ADD}\>{\tt 1,R3}\>{\em ; Return our counts left to the run number}\\
|
2882 |
|
|
{\tt memset\_pipe\_tail:}\>\>\>{\em ; At this point, we have R3=0-3 remaining}\\
|
2883 |
|
|
\> {\tt CMP}\>{\tt 1,R3} \> {\em ; If there's less than one left}\\
|
2884 |
202 |
dgisselq |
\> {\tt RETN.C}\> \> {\em ; then return early.}\\
|
2885 |
|
|
\> {\tt SB}\>{\tt R2,(R4)} \> {\em ; If we've got one left, store it}\\
|
2886 |
|
|
\> {\tt SB.GT}\>{\tt R2,1(R4)} \> {\em ; if two, do a burst store}\\
|
2887 |
199 |
dgisselq |
\> {\tt CMP}\>{\tt 3,R3} \> {\em ; Check if we have another left}\\
|
2888 |
202 |
dgisselq |
\> {\tt SB.Z}\>{\tt R2,2(R4)} \> {\em ; and store it if so.}\\
|
2889 |
|
|
\> {\tt RETN}\> \> {\em ; Return now that we are complete.}
|
2890 |
199 |
dgisselq |
\end{tabbing}
|
2891 |
|
|
\caption{Example Memset after loop unrolling, using pipelined memory ops}\label{tbl:memset-pipe}
|
2892 |
|
|
\end{center}\end{table}
|
2893 |
|
|
Note that, in this example as with the {\tt memcpy} example, our loop variable
|
2894 |
202 |
dgisselq |
is one less than the number of operations remaining. This is because the
|
2895 |
|
|
ZipCPU has no less than or equal comparison, but only a less than comparison.
|
2896 |
|
|
By subtracting one from the loop variable, that's all our comparison needs to
|
2897 |
|
|
be--at least, until the end of the loop. For that, we jump to a section one
|
2898 |
|
|
instruction earlier and return our counts value to the true remaining length.
|
2899 |
199 |
dgisselq |
|
2900 |
|
|
You may also notice that, despite the four possibilities in the end game, we
|
2901 |
|
|
can carefully rearrange the logic to only use two compares. The first compare
|
2902 |
|
|
tests against less than one and returns if there are no more sets left. Using
|
2903 |
|
|
the same compare, though, we can also know if we have one or more stores left.
|
2904 |
|
|
Hence, we can create a burst memory operation with one or two stores.
|
2905 |
|
|
|
2906 |
202 |
dgisselq |
The three examples given so far discuss and demonstrate solutions appropriate
|
2907 |
|
|
for memory accesses that are not necessarily aligned. Were the accesses
|
2908 |
|
|
aligned, the operation could be done about four times faster. To do this,
|
2909 |
|
|
the {\tt LB} and {\tt SB} instructions would need to be replaced by {\tt LW}
|
2910 |
|
|
and {\tt SW} instructions.
|
2911 |
|
|
|
2912 |
|
|
Still, if all accesses were able to be aligned, then we might also use the
|
2913 |
|
|
DMA for this operation. Hence, the DMA makes our final example in
|
2914 |
199 |
dgisselq |
Tbl.~\ref{tbl:memset-dma}.
|
2915 |
|
|
\begin{table}\begin{center}
|
2916 |
|
|
\begin{tabbing}
|
2917 |
|
|
{\tt \#define DMA\_CONSTSRC 0x20000000} {\em // Don't increment the source address}
|
2918 |
|
|
\\
|
2919 |
202 |
dgisselq |
{\tt int *} \= {\tt memset\_dma(int *s, int c, size\_t len) \{} \\
|
2920 |
199 |
dgisselq |
\> {\em // As before, this assumes we have access to the DMA, and that}\\
|
2921 |
|
|
\> {\em // we are running in system high mode ...}\\
|
2922 |
|
|
\> {\tt zip->dma.len = len;} \= {\em // Set up the DMA }\\
|
2923 |
|
|
\> {\tt zip->dma.rd = \&c;}\\
|
2924 |
|
|
\> {\tt zip->dma.wr = s;}\\
|
2925 |
|
|
\> {\em // Command the DMA to start copying, but not to increment the} \\
|
2926 |
|
|
\> {\em // source address during the copy.} \\
|
2927 |
|
|
\> {\tt zip->dma.ctrl= DMACOPY|DMA\_CONSTSRC;} \\
|
2928 |
|
|
\> {\em // Note that we take two clocks to set up our PIC. This is }\\
|
2929 |
|
|
\> {\em // required because the PIC takes at least a clock cycle to clear.} \\
|
2930 |
|
|
\> {\tt zip->pic = DISABLEALL|SYSINT\_DMA;} \\
|
2931 |
|
|
\> {\em // Now that our PIC is actually clear, with no more DMA }\\
|
2932 |
|
|
\> {\em // interrupt within it, now we enable the DMA interrupt, and}\\
|
2933 |
|
|
\> {\em // only the DMA interrupt.}\\
|
2934 |
|
|
\> {\tt zip->pic = EINT(SYSINT\_DMA);}\\
|
2935 |
|
|
\> {\em // And wait for the DMA to complete.} \\
|
2936 |
|
|
\> {\tt zip\_wait();}\\
|
2937 |
202 |
dgisselq |
\> {\em // Return the original source pointer, so as to} \\
|
2938 |
|
|
\> {\em // match the library definition.} \\
|
2939 |
|
|
\> {\tt return s;}\\
|
2940 |
199 |
dgisselq |
{\tt \}}
|
2941 |
|
|
\end{tabbing}
|
2942 |
|
|
\caption{Example Memset code, only this time with the DMA}\label{tbl:memset-dma}
|
2943 |
|
|
\end{center}\end{table}
|
2944 |
|
|
This is almost identical to the {\tt memcpy} function above that used the
|
2945 |
|
|
DMA, save that the pointer for the value read is given to be the address
|
2946 |
|
|
of c, and that the DMA is instructed not to increment its source pointer.
|
2947 |
|
|
The DMA will still do {\tt len} reads, so the asymptotic performance will never
|
2948 |
|
|
be less than $2N$ clocks per transfer.
|
2949 |
|
|
|
2950 |
|
|
|
2951 |
|
|
\section{Context Switch}
|
2952 |
|
|
|
2953 |
36 |
dgisselq |
Fundamental to any multiprocessing system is the ability to switch from one
|
2954 |
199 |
dgisselq |
task to the next. In the ZipSystem, this is accomplished in one of a couple of
|
2955 |
|
|
ways. The first step is that an interrupt, trap, or exception takes place.
|
2956 |
|
|
This will pull the CPU out of user mode and into supervisor mode. At this
|
2957 |
|
|
point, the CPU needs to execute the following tasks:
|
2958 |
36 |
dgisselq |
\begin{enumerate}
|
2959 |
199 |
dgisselq |
\item Check for the reason, why did we return from user mode? Did the user
|
2960 |
|
|
execute a trap instruction, or did some other user exception such as a
|
2961 |
|
|
break, bus error, division by zero error, or floating point exception
|
2962 |
|
|
occur. That is, if the user process needs attending then we may not
|
2963 |
|
|
wish to adjust the context, check interrupts, or call the scheduler.
|
2964 |
69 |
dgisselq |
Tbl.~\ref{tbl:trap-check}
|
2965 |
36 |
dgisselq |
\begin{table}\begin{center}
|
2966 |
199 |
dgisselq |
\begin{tabbing}
|
2967 |
|
|
{\tt while(true) \{} \\
|
2968 |
|
|
\hbox to 0.25in{}\={\em // The instruction before the context switch processing must} \\
|
2969 |
|
|
\> {\em // be the RTU instruction that enacted user mode in the first} \\
|
2970 |
|
|
\> {\em // place. We show it here just for reference.} \\
|
2971 |
|
|
\> {\tt zip\_rtu();} \\
|
2972 |
|
|
\\
|
2973 |
|
|
\> {\tt if (zip\_ucc() \& (CC\_FAULT)) \{} \\
|
2974 |
|
|
\> \hbox to 0.25in{}\={\em // The user program has experienced an unrecoverable fault and must die.}\\
|
2975 |
|
|
\>\> {\em // Do something here to kill the task, recover any resources} \\
|
2976 |
|
|
\>\> {\em // it was using, and report/record the problem.}\\
|
2977 |
|
|
\>\> \ldots \\
|
2978 |
|
|
\> {\tt \} else if (zip\_ucc() \& (CC\_TRAPBIT)) \{} \\
|
2979 |
|
|
\>\> {\em // Handle any user request} \\
|
2980 |
|
|
\>\> {\tt zip\_restore\_context(userregs);} \\
|
2981 |
|
|
\>\> {\em // If the request ID is in uR1, that is now userregs[1]}\\
|
2982 |
|
|
\>\> {\tt switch(userregs[1]) \{} \\
|
2983 |
|
|
\>\> {\tt case $x$:} {\em // Perform some user requested function} \\
|
2984 |
|
|
\>\> \hbox to 0.25in{}\= {\tt break;}\\
|
2985 |
|
|
\>\> {\tt \}} \\
|
2986 |
|
|
\> {\tt \}}\\
|
2987 |
|
|
\\
|
2988 |
|
|
{\tt \}}
|
2989 |
|
|
\end{tabbing}
|
2990 |
69 |
dgisselq |
\caption{Checking for whether the user task needs our attention}\label{tbl:trap-check}
|
2991 |
36 |
dgisselq |
\end{center}\end{table}
|
2992 |
|
|
shows the rudiments of this code, while showing nothing of how the
|
2993 |
|
|
actual trap would be implemented.
|
2994 |
|
|
|
2995 |
|
|
You may also wish to note that the instruction before the first instruction
|
2996 |
|
|
in our context swap {\em must be} a return to userspace instruction.
|
2997 |
|
|
Remember, the supervisor process is re--entered where it left off. This is
|
2998 |
|
|
different from many other processors that enter interrupt mode at some vector
|
2999 |
|
|
or other. In this case, we always enter supervisor mode right where we last
|
3000 |
199 |
dgisselq |
left.
|
3001 |
36 |
dgisselq |
|
3002 |
199 |
dgisselq |
\item Capture user accounting counters. If the operating system is keeping
|
3003 |
|
|
track of system usage via the accounting counters, those counters need
|
3004 |
|
|
to be copied and accumulated into some master counter at this point.
|
3005 |
36 |
dgisselq |
|
3006 |
199 |
dgisselq |
\item Preserve the old context. This involves recording all of the user
|
3007 |
|
|
registers to some supervisor memory structure, such as is shown in
|
3008 |
|
|
Tbl.~\ref{tbl:context-out}.
|
3009 |
36 |
dgisselq |
\begin{table}\begin{center}
|
3010 |
199 |
dgisselq |
\begin{tabbing}
|
3011 |
|
|
{\tt save\_context:} \\
|
3012 |
202 |
dgisselq |
\hbox to 0.25in{}\={\tt SUB 4,SP}\hbox to 0.5in{}\= {\em ; Function prologue: create a stack}\\
|
3013 |
|
|
\> {\tt SW R5,(SP)} \> {\em ; frame and save R5. (R1-R4 are assumed}\\
|
3014 |
199 |
dgisselq |
\> {\tt MOV uR0,R2} \> {\em ; to be used and in need of saving. Then}\\
|
3015 |
|
|
\> {\tt MOV uR1,R3} \> {\em ; copy the user registers, four at a time to }\\
|
3016 |
|
|
\> {\tt MOV uR2,R4} \> {\em ; supervisor registers, where they can be}\\
|
3017 |
|
|
\> {\tt MOV uR3,R5} \> {\em ; stored, while exploiting memory pipelining}\\
|
3018 |
202 |
dgisselq |
\> {\tt SW R2,(R1)} \>{\em ; Exploit memory pipelining: }\\
|
3019 |
|
|
\> {\tt SW R3,4(R1)} \>{\em ; All instructions write to same base memory}\\
|
3020 |
|
|
\> {\tt SW R4,8(R1)} \>{\em ; All offsets increment by one }\\
|
3021 |
|
|
\> {\tt SW R5,12(R1)} \\
|
3022 |
199 |
dgisselq |
\> \ldots {\em ; Need to repeat for all user registers} \\
|
3023 |
|
|
\> {\tt MOV uR12,R2} \> {\em ; Finish copying ... } \\
|
3024 |
|
|
\> {\tt MOV uSP,R3} \\
|
3025 |
|
|
\> {\tt MOV uCC,R4} \\
|
3026 |
|
|
\> {\tt MOV uPC,R5} \\
|
3027 |
202 |
dgisselq |
\> {\tt SW R2,48(R1)} \> {\em ; and saving the last registers.}\\
|
3028 |
|
|
\> {\tt SW R3,52(R1)} \> {\em ; Note that even the special user registers }\\
|
3029 |
|
|
\> {\tt SW R4,56(R1)} \> {\em ; are saved just like any others. }\\
|
3030 |
|
|
\> {\tt SW R5,60(R1)} \\
|
3031 |
|
|
\> {\tt LW (SP),R5} \> {\em ; Restore our one saved register}\\
|
3032 |
|
|
\> {\tt ADD 4,SP} \> {\em ; our stack frame,} \\
|
3033 |
|
|
\> {\tt RETN} \> {\em ; and return }\\
|
3034 |
199 |
dgisselq |
\end{tabbing}
|
3035 |
36 |
dgisselq |
\caption{Example Storing User Task Context}\label{tbl:context-out}
|
3036 |
|
|
\end{center}\end{table}
|
3037 |
199 |
dgisselq |
Since this task is so fundamental, the ZipCPU compiler back end provides
|
3038 |
|
|
the {\tt zip\_save\_context(int *)} function.
|
3039 |
36 |
dgisselq |
|
3040 |
|
|
\item Reset the watchdog timer. If you are using the watchdog timer, it should
|
3041 |
|
|
be reset on a context swap, to know that things are still working.
|
3042 |
|
|
|
3043 |
199 |
dgisselq |
\item Interrupt handling. How you handle interrupts on the ZipCPU are up to
|
3044 |
|
|
you. You can activate a sleeping task if you like, or for smaller
|
3045 |
|
|
faster interrupt routines, such as copying a character to or from a
|
3046 |
|
|
serial port or providing a sample to an audio port, you might choose
|
3047 |
|
|
to do the task within the kernel main loop. The difference may
|
3048 |
|
|
depend upon how you have your hardware set up, and how fast the
|
3049 |
|
|
kernel main loop is.
|
3050 |
36 |
dgisselq |
|
3051 |
|
|
\item Calling the scheduler. This needs to be done to pick the next task
|
3052 |
|
|
to switch to. It may be an interrupt handler, or it may be a normal
|
3053 |
|
|
user task. From a priority standpoint, it would make sense that the
|
3054 |
|
|
interrupt handlers all have a higher priority than the user tasks,
|
3055 |
|
|
and that once they have been called the user tasks may then be called
|
3056 |
|
|
again. If no task is ready to run, run the idle task to wait for an
|
3057 |
|
|
interrupt.
|
3058 |
|
|
|
3059 |
|
|
This suggests a minimum of four task priorities:
|
3060 |
|
|
\begin{enumerate}
|
3061 |
|
|
\item Interrupt handlers, executed with their interrupts disabled
|
3062 |
|
|
\item Device drivers, executed with interrupts re-enabled
|
3063 |
|
|
\item User tasks
|
3064 |
|
|
\item The idle task, executed when nothing else is able to execute
|
3065 |
|
|
\end{enumerate}
|
3066 |
|
|
|
3067 |
199 |
dgisselq |
% For our purposes here, we'll just assume that a pointer to the current
|
3068 |
|
|
% task is maintained in {\tt R12}, that a {\tt JSR scheduler} is
|
3069 |
|
|
% called, and that the next current task is likewise placed into
|
3070 |
|
|
% {\tt R12}.
|
3071 |
36 |
dgisselq |
|
3072 |
|
|
\item Restore the new tasks context. Given that the scheduler has returned a
|
3073 |
199 |
dgisselq |
task that can be run at this time, the user registers need to be
|
3074 |
|
|
read from the memory at the user context pointer and then placed into
|
3075 |
|
|
the user registers. An example of this is shown in
|
3076 |
|
|
Tbl.~\ref{tbl:context-in},
|
3077 |
36 |
dgisselq |
\begin{table}\begin{center}
|
3078 |
199 |
dgisselq |
\begin{tabbing}
|
3079 |
|
|
{\tt restore\_context:} \\
|
3080 |
202 |
dgisselq |
\hbox to 0.25in{}\= {\tt SUB 4,SP}\hbox to 0.4in{}\={\em ; Set up a stack frame} \\
|
3081 |
|
|
\> {\tt SW R5,(SP)} \> {\em ; and store a local register onto it.}\\
|
3082 |
199 |
dgisselq |
\\
|
3083 |
202 |
dgisselq |
\> {\tt LW (R1),R2} \> {\em ; By doing four loads at a time, we are }\\
|
3084 |
|
|
\> {\tt LW 4(R1),R3} \> {\em ; making sure we are using our pipelined}\\
|
3085 |
|
|
\> {\tt LW 8(R1),R4} \> {\em ; memory capability. }\\
|
3086 |
|
|
\> {\tt LW 12(R1),R5} \\
|
3087 |
199 |
dgisselq |
\> {\tt MOV R2,uR1} \> {\em ; Once the registers are loaded, copy them }\\
|
3088 |
|
|
\> {\tt MOV R3,uR2} \> {\em ; into the user registers that they need to}\\
|
3089 |
|
|
\> {\tt MOV R4,uR3} \> {\em ; be placed within.} \\
|
3090 |
|
|
\> {\tt MOV R5,uR4} \\
|
3091 |
|
|
\> \ldots {\em ; Need to repeat for all user registers} \\
|
3092 |
202 |
dgisselq |
\> {\tt LW 48(R1),R2} \> {\em ; Now for our last four registers ...}\\
|
3093 |
|
|
\> {\tt LW 52(R5),R3} \\
|
3094 |
|
|
\> {\tt LW 56(R5),R4} \\
|
3095 |
|
|
\> {\tt LW 60(R5),R5} \\
|
3096 |
199 |
dgisselq |
\> {\tt MOV R2,uR12} \> {\em ; These are the special purpose ones, restored }\\
|
3097 |
|
|
\> {\tt MOV R3,uSP} \> {\em ; just like any others.}\\
|
3098 |
|
|
\> {\tt MOV R4,uCC} \\
|
3099 |
|
|
\> {\tt MOV R5,uPC} \\
|
3100 |
39 |
dgisselq |
|
3101 |
202 |
dgisselq |
\> {\tt LW (SP),R5} \> {\em ; Restore our saved register, } \\
|
3102 |
|
|
\> {\tt ADD 4,SP} \> {\em ; and the stack frame, }\\
|
3103 |
|
|
\> {\tt RETN} \> {\em ; and return to where we were called from.}\\
|
3104 |
199 |
dgisselq |
\end{tabbing}
|
3105 |
36 |
dgisselq |
\caption{Example Restoring User Task Context}\label{tbl:context-in}
|
3106 |
|
|
\end{center}\end{table}
|
3107 |
199 |
dgisselq |
Because this is such an important task, the ZipCPU GCC provides a
|
3108 |
|
|
built--in function, {\tt zip\_restore\_context(int *)}, which can be
|
3109 |
|
|
used for this task.
|
3110 |
36 |
dgisselq |
|
3111 |
|
|
\item Clear the userspace accounting registers. In order to keep track of
|
3112 |
|
|
per process system usage, these registers need to be cleared before
|
3113 |
|
|
reactivating the userspace process. That way, upon the next
|
3114 |
|
|
interrupt, we'll know how many clocks the userspace program has
|
3115 |
|
|
encountered, and how many instructions it was able to issue in
|
3116 |
|
|
those many clocks.
|
3117 |
|
|
|
3118 |
199 |
dgisselq |
\item Return back to the top of our loop in order to execute {\tt zip\_rtu()}
|
3119 |
|
|
again.
|
3120 |
36 |
dgisselq |
\end{enumerate}
|
3121 |
|
|
|
3122 |
199 |
dgisselq |
|
3123 |
21 |
dgisselq |
\chapter{Registers}\label{chap:regs}
|
3124 |
199 |
dgisselq |
This chapter covers the definitions and locations of the various registers
|
3125 |
|
|
associated with both the ZipSystem, and the ZipCPU contained within it.
|
3126 |
|
|
These registers fall into two separate categories: the registers belonging
|
3127 |
|
|
to the ZipSystem, and then the two debug port registers belonging to the CPU
|
3128 |
|
|
itself. In this chapter, we'll discuss the ZipSystem peripheral registers
|
3129 |
|
|
first, followed by the two ZipCPU registers.
|
3130 |
21 |
dgisselq |
|
3131 |
199 |
dgisselq |
|
3132 |
|
|
\section{ZipSystem Peripheral Registers}
|
3133 |
|
|
The ZipSystem maintains currently maintains 20 register locations, as shown
|
3134 |
|
|
in Tbl.~\ref{tbl:zpregs}.
|
3135 |
24 |
dgisselq |
\begin{table}[htbp]
|
3136 |
|
|
\begin{center}\begin{reglist}
|
3137 |
202 |
dgisselq |
PIC & \scalebox{0.8}{\tt 0xff000000} & 32 & R/W & Primary Interrupt Controller \\\hline
|
3138 |
|
|
WDT & \scalebox{0.8}{\tt 0xff000004} & 32 & R/W & Watchdog Timer \\\hline
|
3139 |
|
|
WBU &\scalebox{0.8}{\tt 0xff000008} & 32 & R & Address of last bus timeout error\\\hline
|
3140 |
|
|
CTRIC & \scalebox{0.8}{\tt 0xff00000c} & 32 & R/W & Secondary Interrupt Controller \\\hline
|
3141 |
|
|
TMRA & \scalebox{0.8}{\tt 0xff000010} & 32 & R/W & Timer A\\\hline
|
3142 |
|
|
TMRB & \scalebox{0.8}{\tt 0xff000014} & 32 & R/W & Timer B\\\hline
|
3143 |
|
|
TMRC & \scalebox{0.8}{\tt 0xff000018} & 32 & R/W & Timer C\\\hline
|
3144 |
|
|
JIFF & \scalebox{0.8}{\tt 0xff00001c} & 32 & R/W & Jiffies \\\hline
|
3145 |
|
|
MTASK & \scalebox{0.8}{\tt 0xff000020} & 32 & R/W & Master Task Clock Counter \\\hline
|
3146 |
|
|
MMSTL & \scalebox{0.8}{\tt 0xff000024} & 32 & R/W & Master Stall Counter \\\hline
|
3147 |
|
|
MPSTL & \scalebox{0.8}{\tt 0xff000028} & 32 & R/W & Master Pre--Fetch Stall Counter \\\hline
|
3148 |
|
|
MICNT & \scalebox{0.8}{\tt 0xff00002c} & 32 & R/W & Master Instruction Counter\\\hline
|
3149 |
|
|
UTASK & \scalebox{0.8}{\tt 0xff000030} & 32 & R/W & User Task Clock Counter \\\hline
|
3150 |
|
|
UMSTL & \scalebox{0.8}{\tt 0xff000034} & 32 & R/W & User Stall Counter \\\hline
|
3151 |
|
|
UPSTL & \scalebox{0.8}{\tt 0xff000038} & 32 & R/W & User Pre--Fetch Stall Counter \\\hline
|
3152 |
|
|
UICNT & \scalebox{0.8}{\tt 0xff00003c} & 32 & R/W & User Instruction Counter\\\hline
|
3153 |
|
|
DMACTRL& \scalebox{0.8}{\tt 0xff000040} & 32 & R/W & DMA Control Register\\\hline
|
3154 |
|
|
DMALEN & \scalebox{0.8}{\tt 0xff000044} & 32 & R/W & DMA total transfer length\\\hline
|
3155 |
|
|
DMASRC & \scalebox{0.8}{\tt 0xff000048} & 32 & R/W & DMA source address\\\hline
|
3156 |
|
|
DMADST & \scalebox{0.8}{\tt 0xff00004c} & 32 & R/W & DMA destination address\\\hline
|
3157 |
24 |
dgisselq |
\end{reglist}
|
3158 |
199 |
dgisselq |
\caption{ZipSystem Internal/Peripheral Registers}\label{tbl:zpregs}
|
3159 |
24 |
dgisselq |
\end{center}\end{table}
|
3160 |
202 |
dgisselq |
These registers are all 32-bit registers. Writes of less than 32--bits
|
3161 |
|
|
may have unexpected results. Further, they are located in a reserved location
|
3162 |
|
|
within the CPU's address space. As a result, references to these locations
|
3163 |
|
|
by a ZipBones based system will generate a bus error.
|
3164 |
24 |
dgisselq |
|
3165 |
199 |
dgisselq |
Here in this section, we'll walk through the definition of each of these
|
3166 |
|
|
registers in turn, together with any bit fields that may be associated with
|
3167 |
|
|
them, and how to set those fields.
|
3168 |
24 |
dgisselq |
|
3169 |
69 |
dgisselq |
\subsection{Interrupt Controller(s)}
|
3170 |
199 |
dgisselq |
Any CPU with only a single interrupt line, such as the ZipCPU, really needs an
|
3171 |
|
|
interrupt controller to give it access to more than the single interrupt. The
|
3172 |
|
|
ZipCPU is no different. When the ZipCPU is built as part of the ZipSystem,
|
3173 |
|
|
this interrupt controller comes integrated into the system.
|
3174 |
|
|
|
3175 |
|
|
Looking into the bits that define this controller, you can see from
|
3176 |
|
|
Tbl.~\ref{tbl:picbits},
|
3177 |
33 |
dgisselq |
\begin{table}\begin{center}
|
3178 |
|
|
\begin{bitlist}
|
3179 |
|
|
31 & R/W & Master Interrupt Enable\\\hline
|
3180 |
199 |
dgisselq |
30\ldots 16 & R/W & Interrupt Enable lines\\\hline
|
3181 |
33 |
dgisselq |
15 & R & Current Master Interrupt State\\\hline
|
3182 |
69 |
dgisselq |
15\ldots 0 & R/W & Input Interrupt states, write `1' to clear\\\hline
|
3183 |
33 |
dgisselq |
\end{bitlist}
|
3184 |
|
|
\caption{Interrupt Controller Register Bits}\label{tbl:picbits}
|
3185 |
|
|
\end{center}\end{table}
|
3186 |
199 |
dgisselq |
that the ZipCPU Interrupt controller has four different types of bits.
|
3187 |
33 |
dgisselq |
The high order bit, or bit--31, is the master interrupt enable bit. When this
|
3188 |
|
|
bit is set, then any time an interrupt occurs the CPU will be interrupted and
|
3189 |
|
|
will switch to supervisor mode, etc.
|
3190 |
|
|
|
3191 |
199 |
dgisselq |
Bits 30~\ldots 16 are interrupt enable bits. Should the interrupt line ever be
|
3192 |
|
|
high while enabled, an interrupt will be generated. Further, interrupts are
|
3193 |
|
|
level triggered. Hence, if the interrupt is cleared while the line feeding
|
3194 |
|
|
the controller remains high, then the interrupt will re--trip. To set one of
|
3195 |
|
|
these interrupt enable bits, one needs to write the master interrupt enable
|
3196 |
|
|
while writing a `1' to this the bit. To clear, one need only write a `0' to
|
3197 |
|
|
the master interrupt enable, while leaving this line high.
|
3198 |
33 |
dgisselq |
|
3199 |
|
|
Bits 15\ldots 0 are the current state of the interrupt vector. Interrupt lines
|
3200 |
199 |
dgisselq |
trip whenever they are high, and remain tripped until the input is lowered and
|
3201 |
|
|
the interrupt is acknowledged. Thus, if the interrupt line is high when the
|
3202 |
|
|
controller receives a clear request, then the interrupt will not clear.
|
3203 |
|
|
The incoming line must go low again before the status bit can be cleared.
|
3204 |
33 |
dgisselq |
|
3205 |
199 |
dgisselq |
As an example, consider the following scenario where the ZipCPU supports four
|
3206 |
33 |
dgisselq |
interrupts, 3\ldots0.
|
3207 |
|
|
\begin{enumerate}
|
3208 |
|
|
\item The Supervisor will first, while in the interrupts disabled mode,
|
3209 |
|
|
write a {\tt 32'h800f000f} to the controller. The supervisor may then
|
3210 |
|
|
switch to the user state with interrupts enabled.
|
3211 |
|
|
\item When an interrupt occurs, the supervisor will switch to the interrupt
|
3212 |
|
|
state. It will then cycle through the interrupt bits to learn which
|
3213 |
|
|
interrupt handler to call.
|
3214 |
|
|
\item If the interrupt handler expects more interrupts, it will clear its
|
3215 |
|
|
current interrupt when it is done handling the interrupt in question.
|
3216 |
69 |
dgisselq |
To do this, it will write a `1' to the low order interrupt mask,
|
3217 |
|
|
such as writing a {\tt 32'h0000\_0001}.
|
3218 |
33 |
dgisselq |
\item If the interrupt handler does not expect any more interrupts, it will
|
3219 |
|
|
instead clear the interrupt from the controller by writing a
|
3220 |
69 |
dgisselq |
{\tt 32'h0001\_0001} to the controller.
|
3221 |
33 |
dgisselq |
\item Once all interrupts have been handled, the supervisor will write a
|
3222 |
69 |
dgisselq |
{\tt 32'h8000\_0000} to the interrupt register to re-enable interrupt
|
3223 |
33 |
dgisselq |
generation.
|
3224 |
|
|
\item The supervisor should also check the user trap bit, and possible soft
|
3225 |
|
|
interrupt bits here, but this action has nothing to do with the
|
3226 |
|
|
interrupt control register.
|
3227 |
|
|
\item The supervisor will then leave interrupt mode, possibly adjusting
|
3228 |
|
|
whichever task is running, by executing a return from interrupt
|
3229 |
|
|
command.
|
3230 |
|
|
\end{enumerate}
|
3231 |
|
|
|
3232 |
199 |
dgisselq |
\subsection{Timer Register}\label{sec:reg-timer}
|
3233 |
69 |
dgisselq |
|
3234 |
33 |
dgisselq |
Leaving the interrupt controller, we show the timer registers bit definitions
|
3235 |
|
|
in Tbl.~\ref{tbl:tmrbits}.
|
3236 |
|
|
\begin{table}\begin{center}
|
3237 |
|
|
\begin{bitlist}
|
3238 |
|
|
31 & R/W & Auto-Reload\\\hline
|
3239 |
|
|
30\ldots 0 & R/W & Current timer value\\\hline
|
3240 |
|
|
\end{bitlist}
|
3241 |
|
|
\caption{Timer Register Bits}\label{tbl:tmrbits}
|
3242 |
|
|
\end{center}\end{table}
|
3243 |
|
|
As you may recall, the timer just counts down to zero and then trips an
|
3244 |
|
|
interrupt. Writing to the current timer value sets that value, and reading
|
3245 |
|
|
from it returns that value. Writing to the current timer value while also
|
3246 |
|
|
setting the auto--reload bit will send the timer into an auto--reload mode.
|
3247 |
|
|
In this mode, upon setting its interrupt bit for one cycle, the timer will
|
3248 |
|
|
also reset itself back to the value of the timer that was written to it when
|
3249 |
|
|
the auto--reload option was written to it. To clear and stop the timer,
|
3250 |
|
|
just simply write a `32'h00' to this register.
|
3251 |
|
|
|
3252 |
199 |
dgisselq |
\subsection{Jiffies}\label{sec:reg-jiffies}
|
3253 |
69 |
dgisselq |
|
3254 |
199 |
dgisselq |
The Jiffies register is first and foremost a counter. It counts up one on
|
3255 |
|
|
every clock. Reads from this register, as shown in Tbl.~\ref{tbl:jiffybits},
|
3256 |
33 |
dgisselq |
\begin{table}\begin{center}
|
3257 |
|
|
\begin{bitlist}
|
3258 |
|
|
31\ldots 0 & R & Current jiffy value\\\hline
|
3259 |
|
|
31\ldots 0 & W & Value/time of next interrupt\\\hline
|
3260 |
|
|
\end{bitlist}
|
3261 |
|
|
\caption{Jiffies Register Bits}\label{tbl:jiffybits}
|
3262 |
|
|
\end{center}\end{table}
|
3263 |
199 |
dgisselq |
always return the time value contained in the register.
|
3264 |
33 |
dgisselq |
|
3265 |
199 |
dgisselq |
The register accepts writes as well. Writes to the register set the time of
|
3266 |
|
|
the next Jiffy interrupt. If the next interrupt is between 0~and $2^{31}$
|
3267 |
|
|
clocks in the past, the peripheral will immediately create an interrupt.
|
3268 |
|
|
Otherwise, the register will compare the new value against the currently
|
3269 |
|
|
stored interrupt value. The value nearest in time to the current jiffies value
|
3270 |
|
|
will be kept, and so the jiffies register will trip at that value. Prior
|
3271 |
|
|
values are forgotten.
|
3272 |
|
|
|
3273 |
|
|
When the Jiffy counter value equals the value in its trigger register, then
|
3274 |
|
|
the jiffies peripheral will trigger an interrupt. At this point, the internal
|
3275 |
|
|
register is cleared. It will create no more interrupts unless a new value
|
3276 |
|
|
is written to it.
|
3277 |
|
|
|
3278 |
69 |
dgisselq |
\subsection{Performance Counters}
|
3279 |
|
|
|
3280 |
199 |
dgisselq |
The ZipCPU also supports several counter peripherals, mostly for the purpose of
|
3281 |
|
|
process accounting. These counters each contain a single register, as shown
|
3282 |
|
|
in Tbl.~\ref{tbl:ctrbits}.
|
3283 |
33 |
dgisselq |
\begin{table}\begin{center}
|
3284 |
|
|
\begin{bitlist}
|
3285 |
|
|
31\ldots 0 & R/W & Current counter value\\\hline
|
3286 |
|
|
\end{bitlist}
|
3287 |
|
|
\caption{Counter Register Bits}\label{tbl:ctrbits}
|
3288 |
|
|
\end{center}\end{table}
|
3289 |
|
|
Writes to this register set the new counter value. Reads read the current
|
3290 |
|
|
counter value.
|
3291 |
|
|
|
3292 |
199 |
dgisselq |
These counters can be configured to count upwards upon any event. Using this
|
3293 |
|
|
capability, eight counters have been assigned the task of performance counting.
|
3294 |
33 |
dgisselq |
Two sets of four registers are available for keeping track of performance.
|
3295 |
199 |
dgisselq |
The first set tracks master performance, including both supervisor as well as
|
3296 |
|
|
user CPU statistics. The second set tracks user statistics only, and will not
|
3297 |
|
|
count in supervisor mode.
|
3298 |
33 |
dgisselq |
|
3299 |
199 |
dgisselq |
Of the four registers in each set, the first is a task counter that just counts
|
3300 |
|
|
clock ticks. The second counter is a prefetch stall counter, then an master
|
3301 |
|
|
stall counter. These allow the CPU to be evaluated as to how efficient it is.
|
3302 |
|
|
The fourth and final counter in each set is an instruction counter, which
|
3303 |
|
|
counts how many instructions the CPU has issued.
|
3304 |
|
|
|
3305 |
33 |
dgisselq |
It is envisioned that these counters will be used as follows: First, every time
|
3306 |
|
|
a master counter rolls over, the supervisor (Operating System) will record
|
3307 |
|
|
the fact. Second, whenever activating a user task, the Operating System will
|
3308 |
|
|
set the four user counters to zero. When the user task has completed, the
|
3309 |
|
|
Operating System will read the timers back off, to determine how much of the
|
3310 |
69 |
dgisselq |
CPU the process had consumed. To keep this accurate, the user counters will
|
3311 |
|
|
only increment when the GIE bit is set to indicate that the processor is
|
3312 |
|
|
in user mode.
|
3313 |
33 |
dgisselq |
|
3314 |
199 |
dgisselq |
\subsection{DMA Controller}\label{sec:reg-dmac}
|
3315 |
69 |
dgisselq |
|
3316 |
36 |
dgisselq |
The final peripheral to discuss is the DMA controller. This controller
|
3317 |
|
|
has four registers. Of these four, the length, source and destination address
|
3318 |
|
|
registers should need no further explanation. They are full 32--bit registers
|
3319 |
|
|
specifying the entire transfer length, the starting address to read from, and
|
3320 |
|
|
the starting address to write to. The registers can be written to when the
|
3321 |
|
|
DMA is idle, and read at any time. The control register, however, will need
|
3322 |
|
|
some more explanation.
|
3323 |
|
|
|
3324 |
|
|
The bit allocation of the control register is shown in Tbl.~\ref{tbl:dmacbits}.
|
3325 |
|
|
\begin{table}\begin{center}
|
3326 |
|
|
\begin{bitlist}
|
3327 |
|
|
31 & R & DMA Active\\\hline
|
3328 |
199 |
dgisselq |
30 & R & Wishbone error, transaction aborted. This bit is cleared the next
|
3329 |
|
|
time this register is written to.\\\hline
|
3330 |
|
|
29 & R/W & Set to `1' to prevent the controller from incrementing the source
|
3331 |
|
|
address, `0' for normal memory copy. \\\hline
|
3332 |
69 |
dgisselq |
28 & R/W & Set to `1' to prevent the controller from incrementing the
|
3333 |
|
|
destination address, `0' for normal memory copy. \\\hline
|
3334 |
36 |
dgisselq |
27 \ldots 16 & W & The DMA Key. Write a 12'hfed to these bits to start the
|
3335 |
|
|
activate any DMA transfer. \\\hline
|
3336 |
69 |
dgisselq |
27 & R & Always reads `0', to force the deliberate writing of the key. \\\hline
|
3337 |
36 |
dgisselq |
26 \ldots 16 & R & Indicates the number of items in the transfer buffer that
|
3338 |
|
|
have yet to be written. \\\hline
|
3339 |
69 |
dgisselq |
15 & R/W & Set to `1' to trigger on an interrupt, or `0' to start immediately
|
3340 |
36 |
dgisselq |
upon receiving a valid key.\\\hline
|
3341 |
|
|
14\ldots 10 & R/W & Select among one of 32~possible interrupt lines.\\\hline
|
3342 |
199 |
dgisselq |
9\ldots 0 & R/W & Intermediate transfer length. Thus, to transfer
|
3343 |
|
|
one item at a time set this value to 1. To transfer the maximum number,
|
3344 |
|
|
1024, at a time set it to 0.\\\hline
|
3345 |
36 |
dgisselq |
\end{bitlist}
|
3346 |
|
|
\caption{DMA Control Register Bits}\label{tbl:dmacbits}
|
3347 |
|
|
\end{center}\end{table}
|
3348 |
|
|
This control register has been designed so that the common case of memory
|
3349 |
|
|
access need only set the key and the transfer length. Hence, writing a
|
3350 |
199 |
dgisselq |
\hbox{32'h0fed0000} to the control register will start any memory transfer.
|
3351 |
36 |
dgisselq |
On the other hand, if you wished to read from a serial port (constant address)
|
3352 |
|
|
and put the result into a buffer every time a word was available, you
|
3353 |
199 |
dgisselq |
might wish to write \hbox{32'h2fed8001}--this assumes, of course, that you
|
3354 |
36 |
dgisselq |
have a serial port wired to the zero bit of this interrupt control. (The
|
3355 |
|
|
DMA controller does not use the interrupt controller, and cannot clear
|
3356 |
|
|
interrupts.) As a third example, if you wished to write to an external
|
3357 |
|
|
FIFO anytime it was less than half full (had fewer than 512 items), and
|
3358 |
167 |
dgisselq |
interrupt line 3 indicated this condition, you might wish to issue a
|
3359 |
36 |
dgisselq |
\hbox{32'h1fed8dff} to this port.
|
3360 |
|
|
|
3361 |
199 |
dgisselq |
\section{Debug Port Registers}\label{sec:reg-debug}
|
3362 |
|
|
Accessing the ZipSystem via the debug port isn't as straight forward as
|
3363 |
33 |
dgisselq |
accessing the system via the wishbone bus. The debug port itself has been
|
3364 |
|
|
reduced to two addresses, as outlined earlier in Tbl.~\ref{tbl:dbgregs}.
|
3365 |
199 |
dgisselq |
\begin{table}[htbp]
|
3366 |
|
|
\begin{center}\begin{reglist}
|
3367 |
|
|
ZIPCTRL & 0 & 32 & R/W & Debug Control Register \\\hline
|
3368 |
202 |
dgisselq |
ZIPDATA & 4 & 32 & R/W & Debug Data Register \\\hline
|
3369 |
199 |
dgisselq |
\end{reglist}
|
3370 |
|
|
\caption{ZipSystem Debug Registers}\label{tbl:dbgregs}
|
3371 |
|
|
\end{center}\end{table}
|
3372 |
|
|
|
3373 |
|
|
Access to the ZipSystem begins with the Debug Control register, shown in
|
3374 |
33 |
dgisselq |
Tbl.~\ref{tbl:dbgctrl}.
|
3375 |
|
|
\begin{table}\begin{center}
|
3376 |
|
|
\begin{bitlist}
|
3377 |
69 |
dgisselq |
31\ldots 14 & R & External interrupt state. Bit 14 is valid for one
|
3378 |
|
|
interrupt only, bit 15 for two, etc.\\\hline
|
3379 |
33 |
dgisselq |
13 & R & CPU GIE setting\\\hline
|
3380 |
|
|
12 & R & CPU is sleeping\\\hline
|
3381 |
|
|
11 & W & Command clear PF cache\\\hline
|
3382 |
69 |
dgisselq |
10 & R/W & Command HALT, Set to `1' to halt the CPU\\\hline
|
3383 |
|
|
9 & R & Stall Status, `1' if CPU is busy (i.e., not halted yet)\\\hline
|
3384 |
|
|
8 & R/W & Step Command, set to `1' to step the CPU, also sets the halt bit\\\hline
|
3385 |
|
|
7 & R & Interrupt Request Pending\\\hline
|
3386 |
33 |
dgisselq |
6 & R/W & Command RESET \\\hline
|
3387 |
|
|
5\ldots 0 & R/W & Debug Register Address \\\hline
|
3388 |
|
|
\end{bitlist}
|
3389 |
|
|
\caption{Debug Control Register Bits}\label{tbl:dbgctrl}
|
3390 |
|
|
\end{center}\end{table}
|
3391 |
|
|
|
3392 |
|
|
The first step in debugging access is to determine whether or not the CPU
|
3393 |
69 |
dgisselq |
is halted, and to halt it if not. To do this, first write a `1' to the
|
3394 |
33 |
dgisselq |
Command HALT bit. This will halt the CPU and place it into debug mode.
|
3395 |
|
|
Once the CPU is halted, the stall status bit will drop to zero. Thus,
|
3396 |
|
|
if bit 10 is high and bit 9 low, the debug port is open to examine the
|
3397 |
|
|
internal state of the CPU.
|
3398 |
|
|
|
3399 |
|
|
At this point, the external debugger may examine internal state information
|
3400 |
|
|
from within the CPU. To do this, first write again to the command register
|
3401 |
|
|
a value (with command halt still high) containing the address of an internal
|
3402 |
|
|
register of interest in the bottom 6~bits. Internal registers that may be
|
3403 |
|
|
accessed this way are listed in Tbl.~\ref{tbl:dbgaddrs}.
|
3404 |
|
|
\begin{table}\begin{center}
|
3405 |
|
|
\begin{reglist}
|
3406 |
|
|
sR0 & 0 & 32 & R/W & Supervisor Register R0 \\\hline
|
3407 |
|
|
sR1 & 0 & 32 & R/W & Supervisor Register R1 \\\hline
|
3408 |
|
|
sSP & 13 & 32 & R/W & Supervisor Stack Pointer\\\hline
|
3409 |
|
|
sCC & 14 & 32 & R/W & Supervisor Condition Code Register \\\hline
|
3410 |
|
|
sPC & 15 & 32 & R/W & Supervisor Program Counter\\\hline
|
3411 |
|
|
uR0 & 16 & 32 & R/W & User Register R0 \\\hline
|
3412 |
|
|
uR1 & 17 & 32 & R/W & User Register R1 \\\hline
|
3413 |
|
|
uSP & 29 & 32 & R/W & User Stack Pointer\\\hline
|
3414 |
|
|
uCC & 30 & 32 & R/W & User Condition Code Register \\\hline
|
3415 |
|
|
uPC & 31 & 32 & R/W & User Program Counter\\\hline
|
3416 |
|
|
PIC & 32 & 32 & R/W & Primary Interrupt Controller \\\hline
|
3417 |
|
|
WDT & 33 & 32 & R/W & Watchdog Timer\\\hline
|
3418 |
199 |
dgisselq |
WBUS & 34 & 32 & R & Last Bus Error\\\hline
|
3419 |
33 |
dgisselq |
CTRIC & 35 & 32 & R/W & Secondary Interrupt Controller\\\hline
|
3420 |
|
|
TMRA & 36 & 32 & R/W & Timer A\\\hline
|
3421 |
|
|
TMRB & 37 & 32 & R/W & Timer B\\\hline
|
3422 |
|
|
TMRC & 38 & 32 & R/W & Timer C\\\hline
|
3423 |
|
|
JIFF & 39 & 32 & R/W & Jiffies peripheral\\\hline
|
3424 |
|
|
MTASK & 40 & 32 & R/W & Master task clock counter\\\hline
|
3425 |
|
|
MMSTL & 41 & 32 & R/W & Master memory stall counter\\\hline
|
3426 |
|
|
MPSTL & 42 & 32 & R/W & Master Pre-Fetch Stall counter\\\hline
|
3427 |
|
|
MICNT & 43 & 32 & R/W & Master instruction counter\\\hline
|
3428 |
|
|
UTASK & 44 & 32 & R/W & User task clock counter\\\hline
|
3429 |
|
|
UMSTL & 45 & 32 & R/W & User memory stall counter\\\hline
|
3430 |
|
|
UPSTL & 46 & 32 & R/W & User Pre-Fetch Stall counter\\\hline
|
3431 |
|
|
UICNT & 47 & 32 & R/W & User instruction counter\\\hline
|
3432 |
39 |
dgisselq |
DMACMD & 48 & 32 & R/W & DMA command and status register\\\hline
|
3433 |
|
|
DMALEN & 49 & 32 & R/W & DMA transfer length\\\hline
|
3434 |
|
|
DMARD & 50 & 32 & R/W & DMA read address\\\hline
|
3435 |
|
|
DMAWR & 51 & 32 & R/W & DMA write address\\\hline
|
3436 |
33 |
dgisselq |
\end{reglist}
|
3437 |
|
|
\caption{Debug Register Addresses}\label{tbl:dbgaddrs}
|
3438 |
|
|
\end{center}\end{table}
|
3439 |
|
|
Primarily, these ``registers'' include access to the entire CPU register
|
3440 |
36 |
dgisselq |
set, as well as the internal peripherals. To read one of these registers
|
3441 |
33 |
dgisselq |
once the address is set, simply issue a read from the data port. To write
|
3442 |
|
|
one of these registers or peripheral ports, simply write to the data port
|
3443 |
|
|
after setting the proper address.
|
3444 |
|
|
|
3445 |
|
|
In this manner, all of the CPU's internal state may be read and adjusted.
|
3446 |
|
|
|
3447 |
|
|
As an example of how to use this, consider what would happen in the case
|
3448 |
|
|
of an external break point. If and when the CPU hits a break point that
|
3449 |
|
|
causes it to halt, the Command HALT bit will activate on its own, the CPU
|
3450 |
|
|
will then raise an external interrupt line and wait for a debugger to examine
|
3451 |
|
|
its state. After examining the state, the debugger will need to remove
|
3452 |
|
|
the breakpoint by writing a different instruction into memory and by writing
|
3453 |
|
|
to the command register while holding the clear cache, command halt, and
|
3454 |
|
|
step CPU bits high, (32'hd00). The debugger may then replace the breakpoint
|
3455 |
|
|
now that the CPU has gone beyond it, and clear the cache again (32'h500).
|
3456 |
|
|
|
3457 |
|
|
To leave this debug mode, simply write a `32'h0' value to the command register.
|
3458 |
|
|
|
3459 |
|
|
\chapter{Wishbone Datasheets}\label{chap:wishbone}
|
3460 |
199 |
dgisselq |
The ZipSystem supports two wishbone ports, a slave debug port and a master
|
3461 |
21 |
dgisselq |
port for the system itself. These are shown in Tbl.~\ref{tbl:wishbone-slave}
|
3462 |
|
|
\begin{table}[htbp]
|
3463 |
|
|
\begin{center}
|
3464 |
|
|
\begin{wishboneds}
|
3465 |
|
|
Revision level of wishbone & WB B4 spec \\\hline
|
3466 |
|
|
Type of interface & Slave, Read/Write, single words only \\\hline
|
3467 |
24 |
dgisselq |
Address Width & 1--bit \\\hline
|
3468 |
21 |
dgisselq |
Port size & 32--bit \\\hline
|
3469 |
|
|
Port granularity & 32--bit \\\hline
|
3470 |
|
|
Maximum Operand Size & 32--bit \\\hline
|
3471 |
|
|
Data transfer ordering & (Irrelevant) \\\hline
|
3472 |
69 |
dgisselq |
Clock constraints & Works at 100~MHz on a Basys--3 board, and 80~MHz on a
|
3473 |
|
|
XuLA2--LX25\\\hline
|
3474 |
21 |
dgisselq |
Signal Names & \begin{tabular}{ll}
|
3475 |
|
|
Signal Name & Wishbone Equivalent \\\hline
|
3476 |
|
|
{\tt i\_clk} & {\tt CLK\_I} \\
|
3477 |
|
|
{\tt i\_dbg\_cyc} & {\tt CYC\_I} \\
|
3478 |
199 |
dgisselq |
{\tt i\_dbg\_stb} & {\tt (CYC\_I)\&(STB\_I)} \\
|
3479 |
21 |
dgisselq |
{\tt i\_dbg\_we} & {\tt WE\_I} \\
|
3480 |
|
|
{\tt i\_dbg\_addr} & {\tt ADR\_I} \\
|
3481 |
|
|
{\tt i\_dbg\_data} & {\tt DAT\_I} \\
|
3482 |
|
|
{\tt o\_dbg\_ack} & {\tt ACK\_O} \\
|
3483 |
|
|
{\tt o\_dbg\_stall} & {\tt STALL\_O} \\
|
3484 |
|
|
{\tt o\_dbg\_data} & {\tt DAT\_O}
|
3485 |
|
|
\end{tabular}\\\hline
|
3486 |
|
|
\end{wishboneds}
|
3487 |
22 |
dgisselq |
\caption{Wishbone Datasheet for the Debug Interface}\label{tbl:wishbone-slave}
|
3488 |
21 |
dgisselq |
\end{center}\end{table}
|
3489 |
|
|
and Tbl.~\ref{tbl:wishbone-master} respectively.
|
3490 |
|
|
\begin{table}[htbp]
|
3491 |
|
|
\begin{center}
|
3492 |
|
|
\begin{wishboneds}
|
3493 |
|
|
Revision level of wishbone & WB B4 spec \\\hline
|
3494 |
202 |
dgisselq |
Type of interface & Master, Read/Write, pipelined\\\hline
|
3495 |
|
|
Address Width & (ZipSystem parameter, up to 30~bits) \\\hline
|
3496 |
21 |
dgisselq |
Port size & 32--bit \\\hline
|
3497 |
202 |
dgisselq |
Port granularity & 8--bit \\\hline
|
3498 |
21 |
dgisselq |
Maximum Operand Size & 32--bit \\\hline
|
3499 |
202 |
dgisselq |
Data transfer ordering & Big--Endian \\\hline
|
3500 |
69 |
dgisselq |
Clock constraints & Works at 100~MHz on a Basys--3 board, and 80~MHz on a
|
3501 |
|
|
XuLA2--LX25\\\hline
|
3502 |
21 |
dgisselq |
Signal Names & \begin{tabular}{ll}
|
3503 |
|
|
Signal Name & Wishbone Equivalent \\\hline
|
3504 |
|
|
{\tt i\_clk} & {\tt CLK\_O} \\
|
3505 |
|
|
{\tt o\_wb\_cyc} & {\tt CYC\_O} \\
|
3506 |
199 |
dgisselq |
{\tt o\_wb\_stb} & {\tt (CYC\_O)\&(STB\_O)} \\
|
3507 |
21 |
dgisselq |
{\tt o\_wb\_we} & {\tt WE\_O} \\
|
3508 |
|
|
{\tt o\_wb\_addr} & {\tt ADR\_O} \\
|
3509 |
|
|
{\tt o\_wb\_data} & {\tt DAT\_O} \\
|
3510 |
202 |
dgisselq |
{\tt o\_wb\_sel} & {\tt SEL\_O} \\
|
3511 |
21 |
dgisselq |
{\tt i\_wb\_ack} & {\tt ACK\_I} \\
|
3512 |
|
|
{\tt i\_wb\_stall} & {\tt STALL\_I} \\
|
3513 |
69 |
dgisselq |
{\tt i\_wb\_data} & {\tt DAT\_I} \\
|
3514 |
|
|
{\tt i\_wb\_err} & {\tt ERR\_I}
|
3515 |
21 |
dgisselq |
\end{tabular}\\\hline
|
3516 |
|
|
\end{wishboneds}
|
3517 |
22 |
dgisselq |
\caption{Wishbone Datasheet for the CPU as Master}\label{tbl:wishbone-master}
|
3518 |
21 |
dgisselq |
\end{center}\end{table}
|
3519 |
199 |
dgisselq |
I do not recommend that you connect these together through the interconnect,
|
3520 |
|
|
since 1) it doesn't make sense that the CPU should be able to halt itself,
|
3521 |
|
|
and 2) it helps to be able to reboot the CPU in case something has gone
|
3522 |
|
|
terribly wrong and the CPU is stalling the entire interconnect.
|
3523 |
24 |
dgisselq |
Rather, the debug port of the CPU should be accessible regardless of the state
|
3524 |
|
|
of the master bus.
|
3525 |
21 |
dgisselq |
|
3526 |
69 |
dgisselq |
You may wish to notice that neither the {\tt LOCK} nor the {\tt RTY} (retry)
|
3527 |
|
|
wires have been connected to the CPU's master interface. If necessary, a
|
3528 |
199 |
dgisselq |
rudimentary {\tt LOCK} may be created by tying this wire to the {\tt wb\_cyc}
|
3529 |
69 |
dgisselq |
line. As for the {\tt RTY}, all the CPU recognizes at this point are bus
|
3530 |
|
|
errors---it cannot tell the difference between a temporary and a permanent bus
|
3531 |
199 |
dgisselq |
error. Therefore, one might logically OR the bus error and bus retry flags on
|
3532 |
|
|
input to the CPU if necessary.
|
3533 |
21 |
dgisselq |
|
3534 |
199 |
dgisselq |
The final simplification made of the standard wishbone bus B4 specification, is
|
3535 |
|
|
that the strobe lines are assumed to be zero in any slave if {\tt CYC\_I} is
|
3536 |
|
|
zero, and the master is responsible for ensuring that {\tt STB\_O} is never
|
3537 |
|
|
true when {\tt CYC\_O} is true in order to make this work. All of the ZipCPU
|
3538 |
|
|
and ZipSystem masters and peripherals have been created with this assumption.
|
3539 |
|
|
Converting peripherals that have made this assumption to work with masters
|
3540 |
|
|
that don't guarantee this property is as simple as anding the slave's
|
3541 |
|
|
{\tt CYC\_I} and {\tt STB\_I} lines together. No change needs to be made to
|
3542 |
|
|
any ZipCPU master, however, in order to access any peripheral that hasn't been
|
3543 |
|
|
so simplified.
|
3544 |
|
|
|
3545 |
21 |
dgisselq |
\chapter{Clocks}\label{chap:clocks}
|
3546 |
|
|
|
3547 |
199 |
dgisselq |
This core has now been tested and proven on the Xilinx Spartan~6 FPGA as well
|
3548 |
|
|
as the Artix--7 FPGA.
|
3549 |
21 |
dgisselq |
\begin{table}[htbp]
|
3550 |
|
|
\begin{center}
|
3551 |
|
|
\begin{clocklist}
|
3552 |
199 |
dgisselq |
i\_clk & External & 100~MHz & & System clock, Artix--7/35T\\\hline
|
3553 |
|
|
& & 80~MHz & & System clock, Spartan 6\\\hline
|
3554 |
21 |
dgisselq |
\end{clocklist}
|
3555 |
|
|
\caption{List of Clocks}\label{tbl:clocks}
|
3556 |
|
|
\end{center}\end{table}
|
3557 |
|
|
I hesitate to suggest that the core can run faster than 100~MHz, since I have
|
3558 |
|
|
had struggled with various timing violations to keep it at 100~MHz. So, for
|
3559 |
|
|
now, I will only state that it can run at 100~MHz.
|
3560 |
|
|
|
3561 |
69 |
dgisselq |
On a SPARTAN 6, the clock can run successfully at 80~MHz.
|
3562 |
21 |
dgisselq |
|
3563 |
199 |
dgisselq |
A second Artix--7 design on the Digilent's Arty board is limited to 81.25~MHz
|
3564 |
|
|
by the memory interface generated core used to access SDRAM.
|
3565 |
|
|
|
3566 |
21 |
dgisselq |
\chapter{I/O Ports}\label{chap:ioports}
|
3567 |
199 |
dgisselq |
This chapter presents and outlines the various I/O lines in and out of the
|
3568 |
|
|
ZipSystem. Since the ZipCPU needs to be a component of such a larger part,
|
3569 |
|
|
this makes sense.
|
3570 |
|
|
|
3571 |
|
|
The I/O ports to the ZipSystem may be grouped into three categories. The first
|
3572 |
33 |
dgisselq |
is that of the master wishbone used by the CPU, then the slave wishbone used
|
3573 |
|
|
to command the CPU via a debugger, and then the rest. The first two of these
|
3574 |
|
|
were already discussed in the wishbone chapter. They are listed here
|
3575 |
|
|
for completeness in Tbl.~\ref{tbl:iowb-master}
|
3576 |
|
|
\begin{table}
|
3577 |
|
|
\begin{center}\begin{portlist}
|
3578 |
|
|
{\tt o\_wb\_cyc} & 1 & Output & Indicates an active Wishbone cycle\\\hline
|
3579 |
|
|
{\tt o\_wb\_stb} & 1 & Output & WB Strobe signal\\\hline
|
3580 |
|
|
{\tt o\_wb\_we} & 1 & Output & Write enable\\\hline
|
3581 |
202 |
dgisselq |
{\tt o\_wb\_addr} & 30 & Output & Bus address \\\hline
|
3582 |
33 |
dgisselq |
{\tt o\_wb\_data} & 32 & Output & Data on WB write\\\hline
|
3583 |
202 |
dgisselq |
{\tt o\_wb\_sel} & 4 & Output & Select lines\\\hline
|
3584 |
33 |
dgisselq |
{\tt i\_wb\_ack} & 1 & Input & Slave has completed a R/W cycle\\\hline
|
3585 |
|
|
{\tt i\_wb\_stall} & 1 & Input & WB bus slave not ready\\\hline
|
3586 |
|
|
{\tt i\_wb\_data} & 32 & Input & Incoming bus data\\\hline
|
3587 |
69 |
dgisselq |
{\tt i\_wb\_err} & 1 & Input & Bus Error indication\\\hline
|
3588 |
33 |
dgisselq |
\end{portlist}\caption{CPU Master Wishbone I/O Ports}\label{tbl:iowb-master}\end{center}\end{table}
|
3589 |
|
|
and~\ref{tbl:iowb-slave} respectively.
|
3590 |
|
|
\begin{table}
|
3591 |
|
|
\begin{center}\begin{portlist}
|
3592 |
199 |
dgisselq |
{\tt i\_dbg\_cyc} & 1 & Input & Indicates an active Wishbone cycle\\\hline
|
3593 |
|
|
{\tt i\_dbg\_stb} & 1 & Input & WB Strobe signal\\\hline
|
3594 |
|
|
{\tt i\_dbg\_we} & 1 & Input & Write enable\\\hline
|
3595 |
|
|
{\tt i\_dbg\_addr} & 1 & Input & Bus address, command or data port \\\hline
|
3596 |
|
|
{\tt i\_dbg\_data} & 32 & Input & Data on WB write\\\hline
|
3597 |
|
|
{\tt o\_dbg\_ack} & 1 & Output & Slave has completed a R/W cycle\\\hline
|
3598 |
|
|
{\tt o\_dbg\_stall} & 1 & Output & WB bus slave not ready\\\hline
|
3599 |
|
|
{\tt o\_dbg\_data} & 32 & Output & Incoming bus data\\\hline
|
3600 |
33 |
dgisselq |
\end{portlist}\caption{CPU Debug Wishbone I/O Ports}\label{tbl:iowb-slave}\end{center}\end{table}
|
3601 |
21 |
dgisselq |
|
3602 |
33 |
dgisselq |
There are only four other lines to the CPU: the external clock, external
|
3603 |
|
|
reset, incoming external interrupt line(s), and the outgoing debug interrupt
|
3604 |
|
|
line. These are shown in Tbl.~\ref{tbl:ioports}.
|
3605 |
|
|
\begin{table}
|
3606 |
|
|
\begin{center}\begin{portlist}
|
3607 |
|
|
{\tt i\_clk} & 1 & Input & The master CPU clock \\\hline
|
3608 |
|
|
{\tt i\_rst} & 1 & Input & Active high reset line \\\hline
|
3609 |
69 |
dgisselq |
{\tt i\_ext\_int} & 1\ldots 16 & Input & Incoming external interrupts, actual
|
3610 |
199 |
dgisselq |
value set by implementation parameter. This is only ever one
|
3611 |
|
|
for the ZipBones implementation.\\\hline
|
3612 |
33 |
dgisselq |
{\tt o\_ext\_int} & 1 & Output & CPU Halted interrupt \\\hline
|
3613 |
|
|
\end{portlist}\caption{I/O Ports}\label{tbl:ioports}\end{center}\end{table}
|
3614 |
199 |
dgisselq |
The clock line was discussed briefly in Chapt.~\ref{chap:clocks}. The reset
|
3615 |
|
|
line is an active high reset. When
|
3616 |
|
|
asserted, the CPU will start running again from its {\tt RESET\_ADDRESS} in
|
3617 |
69 |
dgisselq |
memory. Further, depending upon how the CPU is configured and specifically
|
3618 |
|
|
based upon how the {\tt START\_HALTED} parameter is set, the CPU may or may
|
3619 |
|
|
not start running automatically following a reset. The {\tt i\_ext\_int}
|
3620 |
199 |
dgisselq |
bus is for set of external interrupt lines to the ZipSystem. This line may
|
3621 |
|
|
actually be as wide as 16~external interrupts, depending upon the setting of
|
3622 |
|
|
the {\tt EXTERNAL\_INTERRUPTS} parameter. Finally, the ZipSystem produces one
|
3623 |
69 |
dgisselq |
external interrupt whenever the entire CPU halts to wait for the debugger.
|
3624 |
33 |
dgisselq |
|
3625 |
199 |
dgisselq |
The I/O lines to the ZipBones package are identical to those of the ZipSystem,
|
3626 |
|
|
with the only exception that the ZipBones package has only a single interrupt
|
3627 |
|
|
line input. This means that the ZipBones implementation practically depends
|
3628 |
|
|
upon an external interrupt controller.
|
3629 |
|
|
|
3630 |
36 |
dgisselq |
\chapter{Initial Assessment}\label{chap:assessment}
|
3631 |
|
|
|
3632 |
199 |
dgisselq |
Having now worked with the ZipCPU for a while, it is worth offering an
|
3633 |
36 |
dgisselq |
honest assessment of how well it works and how well it was designed. At the
|
3634 |
|
|
end of this assessment, I will propose some changes that may take place in a
|
3635 |
199 |
dgisselq |
later version of this ZipCPU to make it better.
|
3636 |
36 |
dgisselq |
|
3637 |
|
|
\section{The Good}
|
3638 |
|
|
\begin{itemize}
|
3639 |
199 |
dgisselq |
\item The ZipCPU was designed to be a simple and light weight CPU. It has
|
3640 |
|
|
achieved this end nicely. The proof of this is the full multitasking
|
3641 |
|
|
operating system built for Digilent's CMod S6 board, based around
|
3642 |
|
|
a very small Spartan~6/LX4 FPGA.
|
3643 |
|
|
|
3644 |
|
|
As a result, the ZipCPU also makes a good starting point for anyone
|
3645 |
|
|
who wishes to build a general purpose CPU and then to experiment with
|
3646 |
|
|
building and adding particular features. Modifications should be
|
3647 |
|
|
simple enough.
|
3648 |
|
|
|
3649 |
|
|
Indeed, a non--pipelined version of the bare ZipBones (with no
|
3650 |
|
|
peripherals) has been built that only uses 1.3k~6--LUTs. When using
|
3651 |
|
|
pipelining, the full cache, and all of the peripherals, the ZipSystem
|
3652 |
|
|
can take up to 4.5~k LUTs. Where it fits in between is a function of
|
3653 |
|
|
your needs.
|
3654 |
|
|
|
3655 |
|
|
A new implementation using an iCE40 FPGA suggests that the ZipCPU
|
3656 |
|
|
will fit within the 4k~4--way LUTs of the iCE40 HK4X FPGA, but only
|
3657 |
|
|
just barely.
|
3658 |
|
|
|
3659 |
|
|
\item The ZipCPU was designed to be an implementable soft core that could be
|
3660 |
202 |
dgisselq |
placed within an FPGA, controlling actions internal to the FPGA. This
|
3661 |
|
|
version of the CPU in particular has been updated so that it would
|
3662 |
|
|
support a more general purpose CPU, since as of version~2.0 the ZipCPU
|
3663 |
|
|
now supports octet level access across the bus.
|
3664 |
199 |
dgisselq |
|
3665 |
202 |
dgisselq |
Still, it fits this role rather nicely. Other capabilities common
|
3666 |
|
|
to more general purpose CPUs, such as
|
3667 |
|
|
double--precision floating point capability, vector registers and
|
3668 |
|
|
vector operations have been left out. However, it was never designed
|
3669 |
|
|
to be such a general purpose CPU but rather a system within a chip.
|
3670 |
|
|
|
3671 |
199 |
dgisselq |
\item The extremely simplified instruction set of the ZipCPU was a good
|
3672 |
36 |
dgisselq |
choice. Although it does not have many of the commonly used
|
3673 |
|
|
instructions, PUSH, POP, JSR, and RET among them, the simplified
|
3674 |
|
|
instruction set has demonstrated an amazing versatility. I will contend
|
3675 |
|
|
therefore and for anyone who will listen, that this instruction set
|
3676 |
|
|
offers a full and complete capability for whatever a user might wish
|
3677 |
|
|
to do with two exceptions: bytewise character access and accelerated
|
3678 |
|
|
floating-point support.
|
3679 |
199 |
dgisselq |
\item The burst load/store approach using the wishbone pipelining mode is
|
3680 |
|
|
novel, and can be used to greatly increase the speed of the processor.
|
3681 |
|
|
\item The novel approach to interrupts greatly facilitates the development of
|
3682 |
|
|
interrupt handlers from within high level languages.
|
3683 |
|
|
|
3684 |
|
|
The approach involves a single interrupt ``vector'' only, and simply
|
3685 |
|
|
switches the CPU back to the instruction it left off at. By using
|
3686 |
|
|
this approach, interrupt handlers no longer need careful assembly
|
3687 |
|
|
language scripting in order to save their context upon any interrupt.
|
3688 |
|
|
|
3689 |
|
|
At the same time, if most modern systems handle interrupt vectoring in
|
3690 |
|
|
software anyway, why maintain complicated hardware support for it?
|
3691 |
|
|
|
3692 |
36 |
dgisselq |
\item My goal of a high rate of instructions per clock may not be the proper
|
3693 |
199 |
dgisselq |
measure of this CPU. For example, if instructions are being read from a
|
3694 |
|
|
SPI flash device, such as is common among FPGA implementations, these
|
3695 |
|
|
same instructions may suffer stalls of between 64 and 128 cycles per
|
3696 |
36 |
dgisselq |
instruction just to read the instruction from the flash. Executing the
|
3697 |
|
|
instruction in a single clock cycle is no longer the appropriate
|
3698 |
|
|
measure. At the same time, it should be possible to use the DMA
|
3699 |
|
|
peripheral to copy instructions from the FLASH to a temporary memory
|
3700 |
|
|
location, after which they may be executed at a single instruction
|
3701 |
|
|
cycle per access again.
|
3702 |
199 |
dgisselq |
|
3703 |
|
|
\item Both GCC and binutils back ends exist for the ZipCPU.
|
3704 |
202 |
dgisselq |
\item As of this version of the CPU, a newlib veresion of the C--library
|
3705 |
|
|
now exists.
|
3706 |
36 |
dgisselq |
\end{itemize}
|
3707 |
|
|
|
3708 |
|
|
\section{The Not so Good}
|
3709 |
|
|
\begin{itemize}
|
3710 |
199 |
dgisselq |
\item The ZipCPU does not (yet) support a data cache. One is currently under
|
3711 |
|
|
development.
|
3712 |
36 |
dgisselq |
|
3713 |
199 |
dgisselq |
The ZipCPU compensates for this lack via its burst memory capability.
|
3714 |
|
|
Further, performance tests using Dhrystone suggest that the ZipCPU is
|
3715 |
|
|
no slower than other processors containing a data cache.
|
3716 |
68 |
dgisselq |
|
3717 |
36 |
dgisselq |
\item Many other instruction sets offer three operand instructions, whereas
|
3718 |
199 |
dgisselq |
the ZipCPU only offers two operand instructions. This means that it
|
3719 |
|
|
may take the ZipCPU more instructions to do many of the same operations.
|
3720 |
|
|
The good part of this is that it gives the ZipCPU a greater amount of
|
3721 |
36 |
dgisselq |
flexibility in its immediate operand mode, although that increased
|
3722 |
|
|
flexibility isn't necessarily as valuable as one might like.
|
3723 |
|
|
|
3724 |
199 |
dgisselq |
The impact of this lack of three operand instructions is application
|
3725 |
|
|
dependent, but does not appear to be too severe.
|
3726 |
36 |
dgisselq |
|
3727 |
199 |
dgisselq |
\item The ZipCPU doesn't support out of order execution.
|
3728 |
|
|
|
3729 |
|
|
I suppose it could be modified to do so, but then it would no longer
|
3730 |
|
|
be the ``simple'' and low LUT count CPU it was designed to be.
|
3731 |
|
|
|
3732 |
|
|
\item Although switching to an interrupt context in the ZipCPU design doesn't
|
3733 |
36 |
dgisselq |
require a tremendous swapping of registers, in reality it still
|
3734 |
199 |
dgisselq |
does--since any task swap (such as swapping to a task waiting on an
|
3735 |
|
|
interrupt) still requires saving and restoring all 16~user registers.
|
3736 |
|
|
That's a lot of memory movement just to service an interrupt.
|
3737 |
36 |
dgisselq |
|
3738 |
199 |
dgisselq |
This isn't nearly as bad as it sounds, however, since most RISC
|
3739 |
|
|
architectures have 32~registers that will need to be swapped upon any
|
3740 |
|
|
context swap.
|
3741 |
|
|
|
3742 |
|
|
\item The ZipCPU is by no means generic: it will never handle addresses
|
3743 |
202 |
dgisselq |
larger than 32-bits (4GB) without a complete and total redesign.
|
3744 |
36 |
dgisselq |
This may limit its utility as a generic CPU in the future, although
|
3745 |
199 |
dgisselq |
as an embedded CPU within an FPGA this isn't really much of a
|
3746 |
|
|
restriction.
|
3747 |
36 |
dgisselq |
|
3748 |
199 |
dgisselq |
\item While a toolchain does exist for the ZipCPU, it isn't yet fully featured.
|
3749 |
202 |
dgisselq |
The ZipCPU does not yet have any support for soft floating point
|
3750 |
|
|
arithmetic, nor does it have gdb support. These may be provided
|
3751 |
|
|
in future versions.
|
3752 |
36 |
dgisselq |
\end{itemize}
|
3753 |
|
|
|
3754 |
|
|
\section{The Next Generation}
|
3755 |
199 |
dgisselq |
This section could also be labeled as my ``To do'' list. It outlines where
|
3756 |
|
|
you may expect features in the future. Currently, there are five primary
|
3757 |
|
|
items on my to do list:
|
3758 |
|
|
\begin{enumerate}
|
3759 |
|
|
\item Soft Floating Point capability
|
3760 |
36 |
dgisselq |
|
3761 |
199 |
dgisselq |
The lack of any floating point capability, either hard or soft, makes
|
3762 |
|
|
porting math software to the ZipCPU difficult. Simply building a
|
3763 |
|
|
soft floating point library will solve this problem.
|
3764 |
36 |
dgisselq |
|
3765 |
199 |
dgisselq |
\item A data cache
|
3766 |
|
|
|
3767 |
|
|
A preliminary data cache implemented as a write through cache has
|
3768 |
|
|
been developed. Adding this into the CPU should require few changes
|
3769 |
|
|
internal to the CPU. I expect future versions of the CPU will permit
|
3770 |
|
|
this as an option.
|
3771 |
|
|
|
3772 |
|
|
\item A Memory Management Unit
|
3773 |
|
|
|
3774 |
|
|
The first version of such an MMU has already been written. It is
|
3775 |
|
|
available for examination in the ZipCPU repository. This MMU exists
|
3776 |
|
|
as a peripheral of the ZipCPU. Integrating this MMU into the ZipCPU
|
3777 |
202 |
dgisselq |
will involve slowing down memory stores so that they can be
|
3778 |
|
|
accomplished synchronously, as well as determining how and when
|
3779 |
|
|
particular cache lines need to be invalidated.
|
3780 |
199 |
dgisselq |
|
3781 |
|
|
\item An integrated floating point unit (FPU)
|
3782 |
|
|
|
3783 |
|
|
Why a small scale CPU needs a hefty floating point unit, I'm not
|
3784 |
|
|
certain, but many application contexts require the ability to do
|
3785 |
|
|
floating point math.
|
3786 |
|
|
\end{enumerate}
|
3787 |
|
|
|
3788 |
|
|
|
3789 |
21 |
dgisselq |
\end{document}
|
3790 |
|
|
|
3791 |
68 |
dgisselq |
%
|
3792 |
|
|
%
|
3793 |
|
|
% Symbol table relocation types:
|
3794 |
|
|
%
|
3795 |
|
|
% Only 3-types of instructions truly need relocations: those that modify the
|
3796 |
|
|
% PC register, and those that access memory.
|
3797 |
|
|
%
|
3798 |
|
|
% - LDI Addr,Rx // Load's an absolute address into Rx, 24 bits
|
3799 |
|
|
%
|
3800 |
|
|
% - LDILO Addr,Rx // Load's an absolute address into Rx, 32 bits
|
3801 |
|
|
% LDIHI Addr,Rx // requires two instructions
|
3802 |
|
|
%
|
3803 |
|
|
% - JMP Rx // Jump to any address in Rx
|
3804 |
|
|
% // Can be prefixed with two instructions to load Rx
|
3805 |
|
|
% // from any 32-bit immediate
|
3806 |
|
|
% - JMP #Addr // Jump to any 24'bit (signed) address, 23'b uns
|
3807 |
|
|
%
|
3808 |
|
|
% - ADD x,PC // Any PC relative jump (20 bits)
|
3809 |
|
|
%
|
3810 |
|
|
% - ADD.C x,PC // Any PC relative conditional jump (20 bits)
|
3811 |
|
|
%
|
3812 |
|
|
% - LDIHI Addr,Rx // Load from any 32-bit address, clobbers Rx,
|
3813 |
202 |
dgisselq |
% LW Addr(Rx),Rx // unconditional, requires second instruction
|
3814 |
68 |
dgisselq |
%
|
3815 |
202 |
dgisselq |
% - LW.C Addr(Ry),Rx // Any 16-bit relative address load, poss. cond
|
3816 |
68 |
dgisselq |
%
|
3817 |
202 |
dgisselq |
% - SW.C Rx,Addr(Ry) // Any 16-bit rel addr, Rx and Ry must be valid
|
3818 |
68 |
dgisselq |
%
|
3819 |
|
|
% - FARJMP #Addr: // Arbitrary 32-bit jumps require a jump table
|
3820 |
|
|
% BRA +1 // memory address. The BRA +1 can be skipped,
|
3821 |
|
|
% .WORD Addr // but only if the address is placed at the end
|
3822 |
202 |
dgisselq |
% LW -2(PC),PC // of an executable section
|
3823 |
68 |
dgisselq |
%
|