OpenCores
URL https://opencores.org/ocsvn/zipcpu/zipcpu/trunk

Subversion Repositories zipcpu

[/] [zipcpu/] [trunk/] [doc/] [src/] [spec.tex] - Blame information for rev 176

Go to most recent revision | Details | Compare with Previous | View Log

Line No. Rev Author Line
1 21 dgisselq
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2
%%
3
%% Filename:    spec.tex
4
%%
5
%% Project:     Zip CPU -- a small, lightweight, RISC CPU soft core
6
%%
7
%% Purpose:     This LaTeX file contains all of the documentation/description
8 33 dgisselq
%%              currently provided with this Zip CPU soft core.  It supersedes
9 21 dgisselq
%%              any information about the instruction set or CPUs found
10
%%              elsewhere.  It's not nearly as interesting, though, as the PDF
11
%%              file it creates, so I'd recommend reading that before diving
12
%%              into this file.  You should be able to find the PDF file in
13
%%              the SVN distribution together with this PDF file and a copy of
14
%%              the GPL-3.0 license this file is distributed under.  If not,
15
%%              just type 'make' in the doc directory and it (should) build
16
%%              without a problem.
17
%%
18
%%
19
%% Creator:     Dan Gisselquist
20
%%              Gisselquist Technology, LLC
21
%%
22
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
23
%%
24
%% Copyright (C) 2015, Gisselquist Technology, LLC
25
%%
26
%% This program is free software (firmware): you can redistribute it and/or
27
%% modify it under the terms of  the GNU General Public License as published
28
%% by the Free Software Foundation, either version 3 of the License, or (at
29
%% your option) any later version.
30
%%
31
%% This program is distributed in the hope that it will be useful, but WITHOUT
32
%% ANY WARRANTY; without even the implied warranty of MERCHANTIBILITY or
33
%% FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
34
%% for more details.
35
%%
36
%% You should have received a copy of the GNU General Public License along
37
%% with this program.  (It's in the $(ROOT)/doc directory, run make with no
38
%% target there if the PDF file isn't present.)  If not, see
39
%% <http://www.gnu.org/licenses/> for a copy.
40
%%
41
%% License:     GPL, v3, as defined and found on www.gnu.org,
42
%%              http://www.gnu.org/licenses/gpl.html
43
%%
44
%%
45
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
46 139 dgisselq
%
47
%
48
%
49
% From TI about DSPs vs FPGAs:
50
%       www.ti.com/general/docs/video/foldersGallery.tsp?bkg=gray
51
%       &gpn=35145&familyid=1622&keyMatch=DSP Breaktime Episode Three
52
%       &tisearch=Search-EN-Everything&DCMP=leadership
53
%       &HQS=ep-pro-dsp-leadership-problog-150518-v-en
54
%
55
%       FPGA's are annoyingly faster, cheaper, and not quite as power hungry
56
%       as they used to be.
57
%
58
%       Why would you choose DSPs over FPGAs?  If you care about size,
59
%       if you care about power, or happen to have a complicated algorithm
60
%       that just isn't simply doing the same thing over and over
61
%
62
%       For complex algorithms that change over time.  Each have their strengths
63
%       sometimes you can use both.
64
%
65
%       "No assembly required" -- TI tools all C programming, very GUI based
66
%       environment, very little optimization by hand ...
67
%
68
%
69
% The FPGA's achilles heel: Reconfigurability.  It is very difficult, although
70
% I'm sure major vendors will tell you not impossible, to reconfigure an FPGA
71
% based upon the need to process time-sensitive data.  If you need one of two
72
% algorithms, both which will fit on the FPGA individually but not together,
73
% switching between them on the fly is next to impossible, whereas switching
74
% algorithm within a CPU is not difficult at all.  For example, imagine
75
% receiving a packet and needing to apply one of two data algorithms on the
76
% packet before sending it back out, and needing to do so fast.  If both
77
% algorithms don't fit in memory, where does the packet go when you need to
78
% swap one algorithm out for the other?  And what is the cost of that "context"
79
% swap?
80
%
81
%
82 21 dgisselq
\documentclass{gqtekspec}
83 68 dgisselq
\usepackage{import}
84 139 dgisselq
\usepackage{bytefield}  % Install via apt-get install texlive-science
85 68 dgisselq
% \graphicspath{{../gfx}}
86 21 dgisselq
\project{Zip CPU}
87
\title{Specification}
88
\author{Dan Gisselquist, Ph.D.}
89
\email{dgisselq (at) opencores.org}
90 167 dgisselq
\revision{Rev.~0.91}
91 69 dgisselq
\definecolor{webred}{rgb}{0.5,0,0}
92
\definecolor{webgreen}{rgb}{0,0.4,0}
93 167 dgisselq
\hypersetup{
94
        ps2pdf,
95
        pdfpagelabels,
96
        hypertexnames,
97 36 dgisselq
        pdfauthor={Dan Gisselquist},
98 167 dgisselq
        pdfsubject={Zip CPU},
99
        anchorcolor= black,
100 69 dgisselq
        colorlinks = true,
101
        linkcolor  = webred,
102
        citecolor  = webgreen
103
}
104 21 dgisselq
\begin{document}
105
\pagestyle{gqtekspecplain}
106
\titlepage
107
\begin{license}
108
Copyright (C) \theyear\today, Gisselquist Technology, LLC
109
 
110
This project is free software (firmware): you can redistribute it and/or
111
modify it under the terms of  the GNU General Public License as published
112
by the Free Software Foundation, either version 3 of the License, or (at
113
your option) any later version.
114
 
115
This program is distributed in the hope that it will be useful, but WITHOUT
116
ANY WARRANTY; without even the implied warranty of MERCHANTIBILITY or
117
FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
118
for more details.
119
 
120
You should have received a copy of the GNU General Public License along
121
with this program.  If not, see \hbox{<http://www.gnu.org/licenses/>} for a
122
copy.
123
\end{license}
124
\begin{revisionhistory}
125 139 dgisselq
0.9 & 4/20/2016 & Gisselquist & Modified ISA: LDIHI replaced with MPY, MPYU and MPYS replaced with MPYUHI, and MPYSHI respectively.  LOCK instruction now
126
permits an intermediate ALU operation. \\\hline
127 167 dgisselq
0.91& 7/16/2016 & Gisselquist & :escribed three more CC bits\\\hline
128 92 dgisselq
0.8 & 1/28/2016 & Gisselquist & Reduced complexity early branching \\\hline
129 69 dgisselq
0.7 & 12/22/2015 & Gisselquist & New Instruction Set Architecture \\\hline
130 68 dgisselq
0.6 & 11/17/2015 & Gisselquist & Added graphics to illustrate pipeline discussion.\\\hline
131 39 dgisselq
0.5 & 9/29/2015 & Gisselquist & Added pipelined memory access discussion.\\\hline
132 36 dgisselq
0.4 & 9/19/2015 & Gisselquist & Added DMA controller, improved stall information, and self--assessment info.\\\hline
133 33 dgisselq
0.3 & 8/22/2015 & Gisselquist & First completed draft\\\hline
134 24 dgisselq
0.2 & 8/19/2015 & Gisselquist & Still Draft, more complete \\\hline
135 21 dgisselq
0.1 & 8/17/2015 & Gisselquist & Incomplete First Draft \\\hline
136
\end{revisionhistory}
137
% Revision History
138
% Table of Contents, named Contents
139
\tableofcontents
140 24 dgisselq
\listoffigures
141 21 dgisselq
\listoftables
142
\begin{preface}
143
Many people have asked me why I am building the Zip CPU. ARM processors are
144
good and effective. Xilinx makes and markets Microblaze, Altera Nios, and both
145
have better toolsets than the Zip CPU will ever have. OpenRISC is also
146 24 dgisselq
available, RISC--V may be replacing it. Why build a new processor?
147 21 dgisselq
 
148
The easiest, most obvious answer is the simple one: Because I can.
149
 
150
There's more to it, though. There's a lot that I would like to do with a
151
processor, and I want to be able to do it in a vendor independent fashion.
152 36 dgisselq
First, I would like to be able to place this processor inside an FPGA.  Without
153
paying royalties, ARM is out of the question.  I would then like to be able to
154
generate Verilog code, both for the processor and the system it sits within,
155
that can run equivalently on both Xilinx and Altera chips, and that can be
156
easily ported from one manufacturer's chipsets to another. Even more, before
157
purchasing a chip or a board, I would like to know that my soft core works. I
158
would like to build a test bench to test components with, and Verilator is my
159
chosen test bench. This forces me to use all Verilog, and it prevents me from
160
using any proprietary cores. For this reason, Microblaze and Nios are out of
161
the question.
162 21 dgisselq
 
163
Why not OpenRISC? That's a hard question. The OpenRISC team has done some
164
wonderful work on an amazing processor, and I'll have to admit that I am
165
envious of what they've accomplished. I would like to port binutils to the
166
Zip CPU, as I would like to port GCC and GDB. They are way ahead of me. The
167
OpenRISC processor, however, is complex and hefty at about 4,500 LUTs. It has
168
a lot of features of modern CPUs within it that ... well, let's just say it's
169
not the little guy on the block. The Zip CPU is lighter weight, costing only
170 32 dgisselq
about 2,300 LUTs with no peripherals, and 3,200 LUTs with some very basic
171 21 dgisselq
peripherals.
172
 
173
My final reason is that I'm building the Zip CPU as a learning experience. The
174
Zip CPU has allowed me to learn a lot about how CPUs work on a very micro
175
level. For the first time, I am beginning to understand many of the Computer
176
Architecture lessons from years ago.
177
 
178
To summarize: Because I can, because it is open source, because it is light
179
weight, and as an exercise in learning.
180
 
181
\end{preface}
182
 
183
\chapter{Introduction}
184
\pagenumbering{arabic}
185
\setcounter{page}{1}
186
 
187
 
188 36 dgisselq
The original goal of the Zip CPU was to be a very simple CPU.   You might
189 21 dgisselq
think of it as a poor man's alternative to the OpenRISC architecture.
190
For this reason, all instructions have been designed to be as simple as
191 69 dgisselq
possible, and the base instructions are all designed to be executed in one
192
instruction cycle per instruction, barring pipeline stalls.  Indeed, even the
193
bus has been simplified to a constant 32-bit width, with no option for more
194
or less.  This has resulted in the choice to drop push and pop instructions,
195
pre-increment and post-decrement addressing modes, and more.
196 21 dgisselq
 
197
For those who like buzz words, the Zip CPU is:
198
\begin{itemize}
199
\item A 32-bit CPU: All registers are 32-bits, addresses are 32-bits,
200
                instructions are 32-bits wide, etc.
201 24 dgisselq
\item A RISC CPU.  There is no microcode for executing instructions.  All
202
        instructions are designed to be completed in one clock cycle.
203 21 dgisselq
\item A Load/Store architecture.  (Only load and store instructions
204
                can access memory.)
205
\item Wishbone compliant.  All peripherals are accessed just like
206
                memory across this bus.
207
\item A Von-Neumann architecture.  (The instructions and data share a
208
                common bus.)
209
\item A pipelined architecture, having stages for {\bf Prefetch},
210 69 dgisselq
                {\bf Decode}, {\bf Read-Operand}, a
211
                combined stage containing the {\bf ALU},
212
                {\bf Memory}, {\bf Divide}, and {\bf Floating Point}
213
                units, and then the final {\bf Write-back} stage.
214
                See Fig.~\ref{fig:cpu}
215 24 dgisselq
\begin{figure}\begin{center}
216
\includegraphics[width=3.5in]{../gfx/cpu.eps}
217
\caption{Zip CPU internal pipeline architecture}\label{fig:cpu}
218
\end{center}\end{figure}
219
                for a diagram of this structure.
220 21 dgisselq
\item Completely open source, licensed under the GPL.\footnote{Should you
221
        need a copy of the Zip CPU licensed under other terms, please
222
        contact me.}
223
\end{itemize}
224
 
225 68 dgisselq
The Zip CPU also has one very unique feature: the ability to do pipelined loads
226
and stores.  This allows the CPU to access on-chip memory at one access per
227
clock, minus a stall for the initial access.
228
 
229
\section{Characteristics of a SwiC}
230
 
231
Here, we shall define a soft core internal to an FPGA as a ``System within a
232
Chip,'' or a SwiC.  SwiCs have some very unique properties internal to them
233
that have influenced the design of the Zip CPU.  Among these are the bus,
234
memory, and available peripherals.
235
 
236
Most other approaches to soft core CPU's employ a Harvard architecture.
237
This allows these other CPU's to have two separate bus structures: one for the
238 69 dgisselq
program fetch, and the other for the memory.  The Zip CPU is fairly unique in
239 68 dgisselq
its approach because it uses a von Neumann architecture.  This was done for
240
simplicity.  By using a von Neumann architecture, only one bus needs to be
241
implemented within any FPGA.  This helps to minimize real-estate, while
242
maintaining a high clock speed.  The disadvantage is that it can severely
243
degrade the overall instructions per clock count.
244
 
245
Soft core's within an FPGA have an additional characteristic regarding
246 69 dgisselq
memory access: it is slow.  While memory on chip may be accessed at a single
247
cycle per access, small FPGA's often have only a limited amount of memory on
248
chip.  Going off chip, however, is expensive.  Two examples will prove this
249
point.  On
250 68 dgisselq
the XuLA2 board, Flash can be accessed at 128~cycles per 32--bit word,
251
or 64~cycles per subsequent word in a pipelined architecture.  Likewise, the
252 69 dgisselq
SDRAM chip on the XuLA2 board allows a 6~cycle access for a write, 10~cycles
253 68 dgisselq
per read, and 2~cycles for any subsequent pipelined access read or write.
254
Either way you look at it, this memory access will be slow and this doesn't
255
account for any logic delays should the bus implementation logic get
256
complicated.
257
 
258
As may be noticed from the above discussion about memory speed, a second
259
characteristic of memory is that all memory accesses may be pipelined, and
260
that pipelined memory access is faster than non--pipelined access.  Therefore,
261
a SwiC soft core should support pipelined operations, but it should also
262
allow a higher priority subsystem to get access to the bus (no starvation).
263
 
264
As a further characteristic of SwiC memory options, on-chip cache's are
265
expensive.  If you want to have a minimum of logic, cache logic may not be
266
the highest on the priority list.
267
 
268
In sum, memory is slow.  While one processor on one FPGA may be able to fill
269
its pipeline, the same processor on another FPGA may struggle to get more than
270
one instruction at a time into the pipeline.  Any SwiC must be able to deal
271
with both cases: fast and slow memories.
272
 
273
A final characteristic of SwiC's within FPGA's is the peripherals.
274
Specifically, FPGA's are highly reconfigurable.  Soft peripherals can easily
275
be created on chip to support the SwiC if necessary.  As an example, a simple
276
30-bit peripheral could easily support reversing 30-bit numbers: a read from
277
the peripheral returns it's bit--reversed address.  This is cheap within an
278 69 dgisselq
FPGA, but expensive in instructions.  Reading from another 16--bit peripheral
279
might calculate a sine function, where the 16--bit address internal to the
280
peripheral was the angle of the sine wave.
281 68 dgisselq
 
282
Indeed, anything that must be done fast within an FPGA is likely to already
283 69 dgisselq
be done--elsewhere in the fabric.  This leaves the CPU with the simple role
284
of solely handling sequential tasks that need a lot of state.
285 68 dgisselq
 
286
This means that the SwiC needs to live within a very unique environment,
287
separate and different from the traditional SoC.  That isn't to say that a
288
SwiC cannot be turned into a SoC, just that this SwiC has not been designed
289
for that purpose.
290
 
291
\section{Lessons Learned}
292
 
293 21 dgisselq
Now, however, that I've worked on the Zip CPU for a while, it is not nearly
294
as simple as I originally hoped.  Worse, I've had to adjust to create
295
capabilities that I was never expecting to need.  These include:
296
\begin{itemize}
297 33 dgisselq
\item {\bf External Debug:} Once placed upon an FPGA, some external means is
298 21 dgisselq
        still necessary to debug this CPU.  That means that there needs to be
299
        an external register that can control the CPU: reset it, halt it, step
300 24 dgisselq
        it, and tell whether it is running or not.  My chosen interface
301
        includes a second register similar to this control register.  This
302
        second register allows the external controller or debugger to examine
303 21 dgisselq
        registers internal to the CPU.
304
 
305
\item {\bf Internal Debug:} Being able to run a debugger from within
306
        a user process requires an ability to step a user process from
307
        within a debugger.  It also requires a break instruction that can
308
        be substituted for any other instruction, and substituted back.
309
        The break is actually difficult: the break instruction cannot be
310
        allowed to execute.  That way, upon a break, the debugger should
311
        be able to jump back into the user process to step the instruction
312
        that would've been at the break point initially, and then to
313
        replace the break after passing it.
314
 
315 24 dgisselq
        Incidentally, this break messes with the prefetch cache and the
316
        pipeline: if you change an instruction partially through the pipeline,
317
        the whole pipeline needs to be cleansed.  Likewise if you change
318
        an instruction in memory, you need to make sure the cache is reloaded
319
        with the new instruction.
320
 
321 69 dgisselq
\item {\bf Prefetch Cache:} My original implementation, {\tt prefetch}, had
322
        a very simple prefetch stage.  Any time the PC changed the prefetch
323
        would go and fetch the new instruction.  While this was perhaps this
324
        simplest approach, it cost roughly five clocks for every instruction.
325
        This was deemed unacceptable, as I wanted a CPU that could execute
326
        instructions in one cycle.
327 21 dgisselq
 
328 69 dgisselq
        My second implementation, {\tt pipefetch}, attempted to make the most
329
        use of pipelined memory.  When a new CPU address was issued, it would
330
        start reading
331
        memory in a pipelined fashion, and issuing instructions as soon as they
332
        were ready.  This cache was a sliding window in memory.  This suffered
333
        from some difficult performance problems, though.  If the CPU was
334
        alternating between two diverse sections of code, both could never be
335
        in the cache at the same time--causing lots of cache misses.  Further,
336
        the extra logic to implement this window cost an extra clock cycle
337
        in the cache implementation, slowing down branches.
338 21 dgisselq
 
339 69 dgisselq
        The Zip CPU now has a third cache implementation, {\tt pfcache}.  This
340
        new implementation takes only a single cycle per access, but costs a
341
        full cache line miss on any miss.  While configurable, a full cache
342
        line miss might mean that the CPU needs to read 256~instructions from
343
        memory before it can execute the first one of them.
344
 
345 21 dgisselq
\item {\bf Operating System:} In order to support an operating system,
346
        interrupts and so forth, the CPU needs to support supervisor and
347
        user modes, as well as a means of switching between them.  For example,
348
        the user needs a means of executing a system call.  This is the
349
        purpose of the {\bf `trap'} instruction.  This instruction needs to
350
        place the CPU into supervisor mode (here equivalent to disabling
351
        interrupts), as well as handing it a parameter such as identifying
352
        which O/S function was called.
353
 
354 24 dgisselq
My initial approach to building a trap instruction was to create an external
355
peripheral which, when written to, would generate an interrupt and could
356
return the last value written to it.  In practice, this approach didn't work
357
at all: the CPU executed two instructions while waiting for the
358
trap interrupt to take place.  Since then, I've decided to keep the rest of
359
the CC register for that purpose so that a write to the CC register, with the
360
GIE bit cleared, could be used to execute a trap.  This has other problems,
361
though, primarily in the limitation of the uses of the CC register.  In
362
particular, the CC register is the best place to put CPU state information and
363
to ``announce'' special CPU features (floating point, etc).  So the trap
364
instruction still switches to interrupt mode, but the CC register is not
365
nearly as useful for telling the supervisor mode processor what trap is being
366
executed.
367 21 dgisselq
 
368
Modern timesharing systems also depend upon a {\bf Timer} interrupt
369 24 dgisselq
to handle task swapping.  For the Zip CPU, this interrupt is handled
370
external to the CPU as part of the CPU System, found in {\tt zipsystem.v}.
371
The timer module itself is found in {\tt ziptimer.v}.
372 21 dgisselq
 
373 69 dgisselq
\item {\bf Bus Errors:} My original implementation had no logic to handle
374
        what would happen if the CPU attempted to read or write a non-existent
375
        memory address.  This changed after I needed to troubleshoot a failure
376
        caused by a subroutine return to a non-existent address.
377
 
378
        My next problem bus problem was caused by a misbehaving peripheral.
379
        Whenever the CPU attempted to read from or write to this peripheral,
380
        the peripheral would take control of the wishbone bus and not return
381
        it.  For example, it might never return an {\tt ACK} to signal
382
        the end of the bus transaction.  This led to the implementation of
383
        a wishbone bus watchdog that would create a bus error if any particular
384
        bus action didn't complete in a timely fashion.
385
 
386 21 dgisselq
\item {\bf Pipeline Stalls:} My original plan was to not support pipeline
387
        stalls at all, but rather to require the compiler to properly schedule
388 24 dgisselq
        all instructions so that stalls would never be necessary.  After trying
389 21 dgisselq
        to build such an architecture, I gave up, having learned some things:
390
 
391 68 dgisselq
        First, and ideal pipeline might look something like
392
        Fig.~\ref{fig:ideal-pipeline}.
393
\begin{figure}
394
\begin{center}
395
\includegraphics[width=4in]{../gfx/fullpline.eps}
396
\caption{An Ideal Pipeline: One instruction per clock cycle}\label{fig:ideal-pipeline}
397
\end{center}\end{figure}
398
        Notice that, in this figure, all the pipeline stages are complete and
399
        full.  Every instruction takes one clock and there are no delays.
400
        However, as the discussion above pointed out, the memory associated
401
        with a SwiC may not allow single clock access.  It may be instead
402
        that you can only read every two clocks.  In that case, what shall
403
        the pipeline look like?  Should it look like
404
        Fig.~\ref{fig:waiting-pipeline},
405
\begin{figure}\begin{center}
406
\includegraphics[width=4in]{../gfx/stuttra.eps}
407
\caption{Instructions wait for each other}\label{fig:waiting-pipeline}
408
\end{center}\end{figure}
409
        where instructions are held back until the pipeline is full, or should
410
        it look like Fig.~\ref{fig:independent-pipeline},
411
\begin{figure}\begin{center}
412
\includegraphics[width=4in]{../gfx/stuttrb.eps}
413
\caption{Instructions proceed independently}\label{fig:independent-pipeline}
414
\end{center}\end{figure}
415
        where each instruction is allowed to move through the pipeline
416
        independently?  For better or worse, the Zip CPU allows instructions
417
        to move through the pipeline independently.
418 21 dgisselq
 
419 68 dgisselq
        One approach to avoiding stalls is to use a branch delay slot,
420
        such as is shown in Fig.~\ref{fig:brdelay}.
421
\begin{figure}\begin{center}
422
\includegraphics[width=4in]{../gfx/bdly.eps}
423
\caption{A typical branch delay slot approach}\label{fig:brdelay}
424
\end{center}\end{figure}
425
        In this figure, instructions
426
        {\tt BR} (a branch), {\tt BD} (a branch delay instruction),
427
        are followed by instructions after the branch: {\tt IA}, {\tt IB}, etc.
428
        Since it takes a processor a clock cycle to execute a branch, the
429
        delay slot allows the processor to do something useful in that
430
        branch.  The problem the Zip CPU has with this approach is, what
431
        happens when the pipeline looks like Fig.~\ref{fig:brbroken}?
432
\begin{figure}\begin{center}
433
\includegraphics[width=4in]{../gfx/bdbroken.eps}
434
\caption{The branch delay slot breaks with a slow memory}\label{fig:brbroken}
435
\end{center}\end{figure}
436
        In this case, the branch delay slot never gets filled in the first
437
        place, and so the pipeline squashes it before it gets executed.
438
        If not that, then what happens when handling interrupts or
439
        debug stepping: when has the CPU finished an instruction?
440
        When the {\tt BR} instruction has finished, or must {\tt BD}
441
        follow every {\tt BR}?  and, again, what if the pipeline isn't
442
        full?
443
        These thoughts killed any hopes of doing delayed branching.
444
 
445 21 dgisselq
        So I switched to a model of discrete execution: Once an instruction
446
        enters into either the ALU or memory unit, the instruction is
447
        guaranteed to complete.  If the logic recognizes a branch or a
448
        condition that would render the instruction entering into this stage
449 33 dgisselq
        possibly inappropriate (i.e. a conditional branch preceding a store
450 21 dgisselq
        instruction for example), then the pipeline stalls for one cycle
451
        until the conditional branch completes.  Then, if it generates a new
452 33 dgisselq
        PC address, the stages preceding are all wiped clean.
453 21 dgisselq
 
454 68 dgisselq
        This model, however, generated too many pipeline stalls, so the
455
        discrete execution model was modified to allow instructions to go
456
        through the ALU unit and be canceled before writeback.  This removed
457
        the stall associated with ALU instructions before untaken branches.
458
 
459
        The discrete execution model allows such things as sleeping, as
460
        outlined in Fig.~\ref{fig:sleeping}.
461
\begin{figure}\begin{center}
462
\includegraphics[width=4in]{../gfx/sleep.eps}
463
\caption{How the CPU halts when sleeping}\label{fig:sleeping}
464
\end{center}\end{figure}
465
        If the
466 24 dgisselq
        CPU is put to ``sleep,'' the ALU and memory stages stall and back up
467 21 dgisselq
        everything before them.  Likewise, anything that has entered the ALU
468
        or memory stage when the CPU is placed to sleep continues to completion.
469
        To handle this logic, each pipeline stage has three control signals:
470
        a valid signal, a stall signal, and a clock enable signal.  In
471
        general, a stage stalls if it's contents are valid and the next step
472
        is stalled.  This allows the pipeline to fill any time a later stage
473 68 dgisselq
        stalls, as illustrated in Fig.~\ref{fig:stacking}.
474
\begin{figure}\begin{center}
475
\includegraphics[width=4in]{../gfx/stacking.eps}
476
\caption{Instructions can stack up behind a stalled instruction}\label{fig:stacking}
477
\end{center}\end{figure}
478 69 dgisselq
        However, if a pipeline hazard is detected, a stage can stall in order
479
        to prevent the previous from moving forward.
480 21 dgisselq
 
481 68 dgisselq
        This approach is also different from other pipeline approaches.
482
        Instead of keeping the entire pipeline filled, each stage is treated
483 24 dgisselq
        independently.  Therefore, individual stages may move forward as long
484
        as the subsequent stage is available, regardless of whether the stage
485
        behind it is filled.
486 21 dgisselq
\end{itemize}
487
 
488
With that introduction out of the way, let's move on to the instruction
489
set.
490
 
491
\chapter{CPU Architecture}\label{chap:arch}
492
 
493 24 dgisselq
The Zip CPU supports a set of two operand instructions, where the second operand
494 21 dgisselq
(always a register) is the result.  The only exception is the store instruction,
495
where the first operand (always a register) is the source of the data to be
496
stored.
497
 
498 24 dgisselq
\section{Simplified Bus}
499
The bus architecture of the Zip CPU is that of a simplified WISHBONE bus.
500
It has been simplified in this fashion: all operations are 32--bit operations.
501 36 dgisselq
The bus is neither little endian nor big endian.  For this reason, all words
502 24 dgisselq
are 32--bits.  All instructions are also 32--bits wide.  Everything has been
503
built around the 32--bit word.
504
 
505 21 dgisselq
\section{Register Set}
506
The Zip CPU supports two sets of sixteen 32-bit registers, a supervisor
507 24 dgisselq
and a user set as shown in Fig.~\ref{fig:regset}.
508
\begin{figure}\begin{center}
509
\includegraphics[width=3.5in]{../gfx/regset.eps}
510
\caption{Zip CPU Register File}\label{fig:regset}
511
\end{center}\end{figure}
512
The supervisor set is used in interrupt mode when interrupts are disabled,
513
whereas the user set is used otherwise.  Of this register set, the Program
514
Counter (PC) is register 15, whereas the status register (SR) or condition
515
code register
516 21 dgisselq
(CC) is register 14.  By convention, the stack pointer will be register 13 and
517 24 dgisselq
noted as (SP)--although there is nothing special about this register other
518 69 dgisselq
than this convention.  Also by convention register~12 will point to a global
519
offset table, and may be abbreviated as the (GBL) register.
520 21 dgisselq
The CPU can access both register sets via move instructions from the
521
supervisor state, whereas the user state can only access the user registers.
522
 
523 36 dgisselq
The status register is special, and bears further mention.  As shown in
524
Fig.~\ref{tbl:cc-register},
525
\begin{table}\begin{center}
526
\begin{bitlist}
527 167 dgisselq
31\ldots 23 & R & Reserved for future uses\\\hline
528
22\ldots 15 & R/W & Reserved for future uses\\\hline
529
14 & W & Clear I-Cache command\\\hline
530
13 & R & VLIW instruction phase (1 for first half)\\\hline
531 69 dgisselq
12 & R & (Reserved for) Floating Point Exception\\\hline
532
11 & R & Division by Zero Exception\\\hline
533
10 & R & Bus-Error Flag\\\hline
534 167 dgisselq
9 & R & Trap Flag (or user interrupt).  Cleared on return to userspace.\\\hline
535 68 dgisselq
8 & R & Illegal Instruction Flag\\\hline
536 167 dgisselq
7 & R/W & Break--Enable (sCC), or user break (uCC)\\\hline
537 36 dgisselq
6 & R/W & Step\\\hline
538
5 & R/W & Global Interrupt Enable (GIE)\\\hline
539
4 & R/W & Sleep.  When GIE is also set, the CPU waits for an interrupt.\\\hline
540
3 & R/W & Overflow\\\hline
541
2 & R/W & Negative.  The sign bit was set as a result of the last ALU instruction.\\\hline
542
1 & R/W & Carry\\\hline
543
 
544
\end{bitlist}
545
\caption{Condition Code Register Bit Assignment}\label{tbl:cc-register}
546
\end{center}\end{table}
547 167 dgisselq
the lower 15~bits of the status register form
548 36 dgisselq
a set of CPU state and condition codes.  Writes to other bits of this register
549
are preserved.
550 21 dgisselq
 
551 33 dgisselq
Of the condition codes, the bottom four bits are the current flags:
552 21 dgisselq
                Zero (Z),
553
                Carry (C),
554
                Negative (N),
555
                and Overflow (V).
556 69 dgisselq
On those instructions that set the flags, these flags will be set based upon
557
the output of the instruction.  If the result is zero, the Z flag will be set.
558
If the high order bit is set, the N flag will be set.  If the instruction
559
caused a bit to fall off the end, the carry bit will be set.  Finally, if
560
the instruction causes a signed integer overflow, the V flag will be set
561
afterwards.
562 21 dgisselq
 
563 69 dgisselq
The next bit is a sleep bit.  Set this bit to one to disable instruction
564
        execution and place the CPU to sleep, or to zero to keep the pipeline
565
        running.  Setting this bit will cause the CPU to wait for an interrupt
566
        (if interrupts are enabled), or to completely halt (if interrupts are
567
        disabled).  In order to prevent users from halting the CPU, only the
568
        supervisor is allowed to both put the CPU to sleep and disable
569
        interrupts.  Any user attempt to do so will simply result in a switch
570
        to supervisor mode.
571 33 dgisselq
 
572 21 dgisselq
The sixth bit is a global interrupt enable bit (GIE).  When this
573 32 dgisselq
        sixth bit is a `1' interrupts will be enabled, else disabled.  When
574 21 dgisselq
        interrupts are disabled, the CPU will be in supervisor mode, otherwise
575
        it is in user mode.  Thus, to execute a context switch, one only
576
        need enable or disable interrupts.  (When an interrupt line goes
577
        high, interrupts will automatically be disabled, as the CPU goes
578 32 dgisselq
        and deals with its context switch.)  Special logic has been added to
579
        keep the user mode from setting the sleep register and clearing the
580
        GIE register at the same time, with clearing the GIE register taking
581
        precedence.
582 21 dgisselq
 
583 69 dgisselq
The seventh bit is a step bit.  This bit can be set from supervisor mode only.
584
        After setting this bit, should the supervisor mode process switch to
585
        user mode, it would then accomplish one instruction in user mode
586
        before returning to supervisor mode.  Then, upon return to supervisor
587
        mode, this bit will be automatically cleared.  This bit has no effect
588
        on the CPU while in supervisor mode.
589 21 dgisselq
 
590
        This functionality was added to enable a userspace debugger
591
        functionality on a user process, working through supervisor mode
592
        of course.
593
 
594
 
595 167 dgisselq
The eighth bit is a break enable bit.  When applied to the supervisor CC
596
register, this controls whether a break instruction in user mode will halt
597
the processor for an external debugger (break enabled), or whether the break
598
instruction will simply send send the CPU into interrupt mode.  Encountering
599
a break in supervisor mode will halt the CPU independent of the break enable
600
bit.  This bit can only be set within supervisor mode.  However, when applied
601
to the user CC register, from supervisor mode, this bit will indicate whether
602
or not the reason the CPU entered supervisor mode was from a break instruction
603
or not.  This break reason bit is automatically cleared upon any transition to
604
user mode, although it can also be cleared by the supervisor writing to the
605
user CC register.
606 21 dgisselq
 
607 32 dgisselq
% Should break enable be a supervisor mode bit, while the break enable bit
608
% in user mode is a break has taken place bit?
609
%
610
 
611 21 dgisselq
This functionality was added to enable an external debugger to
612
        set and manage breakpoints.
613
 
614 68 dgisselq
The ninth bit is an illegal instruction bit.  When the CPU
615 36 dgisselq
tries to execute either a non-existant instruction, or an instruction from
616 68 dgisselq
an address that produces a bus error, the CPU will (if implemented) switch
617 36 dgisselq
to supervisor mode while setting this bit.  The bit will automatically be
618
cleared upon any return to user mode.
619 21 dgisselq
 
620
The tenth bit is a trap bit.  It is set whenever the user requests a soft
621
interrupt, and cleared on any return to userspace command.  This allows the
622
supervisor, in supervisor mode, to determine whether it got to supervisor
623
mode from a trap or from an external interrupt or both.
624
 
625 167 dgisselq
The eleventh bit is a bus error flag.  If the user program encountered a bus
626
error, this bit will be set in the user CC register and the CPU will switch to
627
supervisor mode.  The bit may be cleared by the supervisor, otherwise it is
628
automatically cleared upon any return to user mode.  If the supervisor
629
encounters a bus error, this bit will be set in the supervisor CC register
630
and the CPU will halt.  In that case, either a CPU reset or a write to the
631
supervisor CC register will clear this register.
632
 
633
The twelth bit is a division by zero exception flag.  This operates in a fashion
634
similar to the bus error flag.  If the user attempts to use the divide
635
instruction with a zero denominator, the system will switch to supervisor mode
636
and set this bit in the user CC register.  The bit is automatically cleared
637
upon any return to user mode, although it can also be manually cleared by
638
the supervisor.  In a similar fashion, if the supervisor attempts to execute
639
a divide by zero, the CPU will halt and set the zero exception flag in the
640
supervisor's CC register.  This will automatically be cleared upon any CPU
641
reset, or it may be manually cleared by the external debugger writing to this
642
register.
643
 
644
The thirteenth bit will operate in a similar fashion to both the bus error
645
and division by zero flags, only it will be set upon a (yet to be determined)
646
floating point error.
647
 
648
Finally, the fourteenth bit references a clear cache bit.  The supervisor may
649
write a one to this bit in order to clear the CPU instruction cache.  The
650
bit always reads as a zero.
651
 
652
Some of the upper bits have been temporarily assigned to indicate CPU
653
capabilities.  This is not a permanent feature, as these upper bits officially
654
remain reserved.
655
 
656 69 dgisselq
\section{Instruction Format}
657
All Zip CPU instructions fit in one of the formats shown in
658
Fig.~\ref{fig:iset-format}.
659
\begin{figure}\begin{center}
660
\begin{bytefield}[endianness=big]{32}
661
\bitheader{0-31}\\
662
\begin{leftwordgroup}{Standard}\bitbox{1}{0}\bitbox{4}{DR}
663
                \bitbox[lrt]{5}{OpCode}
664
                \bitbox[lrt]{3}{Cnd}
665
                \bitbox{1}{0}
666
                \bitbox{18}{18-bit Signed Immediate} \\
667
\bitbox{1}{0}\bitbox{4}{DR}
668
                \bitbox[lrb]{5}{}
669
                \bitbox[lrb]{3}{}
670
                \bitbox{1}{1}
671
                \bitbox{4}{BR}
672
                \bitbox{14}{14-bit Signed Immediate}\end{leftwordgroup} \\
673
\begin{leftwordgroup}{MOV}\bitbox{1}{0}\bitbox{4}{DR}
674
                \bitbox[lrt]{5}{5'hf}
675
                \bitbox[lrt]{3}{Cnd}
676
                \bitbox{1}{A}
677
                \bitbox{4}{BR}
678
                \bitbox{1}{B}
679
                \bitbox{13}{13-bit Signed Immediate}\end{leftwordgroup} \\
680
\begin{leftwordgroup}{LDI}\bitbox{1}{0}\bitbox{4}{DR}
681
                \bitbox{4}{4'hb}
682
                \bitbox{23}{23-bit Signed Immediate}\end{leftwordgroup} \\
683
\begin{leftwordgroup}{NOOP}\bitbox{1}{0}\bitbox{3}{3'h7}
684
                \bitbox{1}{}
685
                \bitbox{2}{11}
686
                \bitbox{3}{xxx}
687
                \bitbox{22}{Ignored}
688
                \end{leftwordgroup} \\
689
\begin{leftwordgroup}{VLIW}\bitbox{1}{1}\bitbox[lrt]{4}{DR}
690
                \bitbox[lrt]{5}{OpCode}
691
                \bitbox[lrt]{3}{Cnd}
692
                \bitbox{1}{0}
693
                \bitbox{4}{Imm.}
694
                \bitbox{14}{---} \\
695
\bitbox{1}{1}\bitbox[lr]{4}{}
696
                \bitbox[lrb]{5}{}
697
                \bitbox[lr]{3}{}
698
                \bitbox{1}{1}
699
                \bitbox{4}{BR}
700
                \bitbox{14}{---}        \\
701
\bitbox{1}{1}\bitbox[lrb]{4}{}
702
                \bitbox{4}{4'hb}
703
                \bitbox{1}{}
704
                \bitbox[lrb]{3}{}
705
                \bitbox{5}{5'b Imm}
706
                \bitbox{14}{---}        \\
707
\bitbox{1}{1}\bitbox{9}{---}
708
                \bitbox[lrt]{3}{Cnd}
709
                \bitbox{5}{---}
710
                \bitbox[lrt]{4}{DR}
711
                \bitbox[lrt]{5}{OpCode}
712
                \bitbox{1}{0}
713
                \bitbox{4}{Imm}
714
                \\
715
\bitbox{1}{1}\bitbox{9}{---}
716
                \bitbox[lr]{3}{}
717
                \bitbox{5}{---}
718
                \bitbox[lr]{4}{}
719
                \bitbox[lrb]{5}{}
720
                \bitbox{1}{1}
721
                \bitbox{4}{Reg} \\
722
\bitbox{1}{1}\bitbox{9}{---}
723
                \bitbox[lrb]{3}{}
724
                \bitbox{5}{---}
725
                \bitbox[lrb]{4}{}
726
                \bitbox{4}{4'hb}
727
                \bitbox{1}{}
728
                \bitbox{5}{5'b Imm}
729
                \end{leftwordgroup} \\
730
\end{bytefield}
731
\caption{Zip Instruction Set Format}\label{fig:iset-format}
732
\end{center}\end{figure}
733
The basic format is that some operation, defined by the OpCode, is applied
734
if a condition, Cnd, is true in order to produce a result which is placed in
735 139 dgisselq
the destination register, or DR.  The load 23--bit signed immediate instruction
736
(LDI) is different in that it accepts no conditions, and uses only a 4-bit
737
opcode.
738 69 dgisselq
 
739
This is actually a second version of instruction set definition, given certain
740
lessons learned.  For example, the original instruction set had the following
741
problems:
742
\begin{enumerate}
743
\item No opcodes were available for divide or floating point extensions to be
744
        made available.  Although there was space in the instruction set to
745
        add these types of instructions, this instruction space was going to
746
        require extra logic to use.
747
\item The carveouts for instructions such as NOOP and LDIHI/LDILO required
748
        extra logic to process.
749
\item The instruction set wasn't very compact.  One bus operation was required
750
        for every instruction.
751 139 dgisselq
\item While the CPU supported multiplies, they were only 16x16 bit multiplies.
752 69 dgisselq
\end{enumerate}
753
This second version was designed with two criteria.  The first was that the
754
new instruction set needed to be compatible, at the assembly language level,
755
with the previous instruction set.  Thus, it must be able to support all of
756
the previous menumonics and more.  This was achieved with the sole exception
757
that instruction immediates are generally two bits shorter than before.
758
(One bit was lost to the VLIW bit in front, another from changing from 4--bit
759
to 5--bit opcodes.)  Second, the new instruction set needed to be a drop--in
760
replacement for the decoder, modifying nothing else.  This was almost achieved,
761
save for two issues: the ALU unit needed to be replaced since the OpCodes
762
were reordered, and some condition code logic needed to be adjusted since the
763
condition codes were renumbered as well.  In the end, maximum reuse of the
764
existing RTL (Verilog) code was achieved in this upgrade.
765
 
766
As of this second version of the Zip CPU instruction set, the Zip CPU also
767
supports a very long instruction word (VLIW) set of instructions.   These
768
instruction formats pack two instructions into a single instuction word,
769
trading immediate instruction space to do this, but in just about all other
770
respects these are identical to two standard instructions.  Other than
771
instruction format, the only basic difference is that the CPU will not switch
772
to interrupt mode in between the two instructions.  Likewise a new job given
773
to the assembler is that of automatically packing as many instructions as
774
possible into the VLIW format.  Where necessary to place both VLIW instructions
775
on the same line, they will be separated by a vertical bar.
776
 
777 139 dgisselq
One belated change to the instruction set violates some of the above
778
principles.  This latter instruction set change replaced the {\tt LDIHI}
779
instruction with a 32--bit multiply instruction {\tt MPY}, and then changed
780
the two 16--bit multiply instructions {\tt MPYU} and {\tt MPYS} for
781
{\tt MPYUHI} and {\tt MPYSHI} respectively.  This creates a 32--bit
782
multiply capability, while removing the 16--bit multiply that wasn't very
783
useful. Further, the {\tt LDIHI} instruction was being used primarily by the
784
assembler and linker to create a 32--bit load immediate pair of instructions.
785
This instruction set combination, {\tt LDIHI} followed by {\tt LDILO} was
786
replaced with an equivalent instruction set, {\tt BREV} followed by {\tt LDILO},
787
save that linking has been made more complicated in the process.
788
 
789 69 dgisselq
\section{Instruction OpCodes}
790
With a 5--bit opcode field, there are 32--possible instructions as shown in
791
Tbl.~\ref{tbl:iset-opcodes}.
792
\begin{table}\begin{center}
793
\begin{tabular}{|l|l|l|c|} \hline \rowcolor[gray]{0.85}
794
OpCode & & Instruction &Sets CC \\\hline\hline
795
5'h00 & SUB & Subtract &   \\\cline{1-3}
796
5'h01 & AND & Bitwise And &   \\\cline{1-3}
797
5'h02 & ADD & Add two numbers &   \\\cline{1-3}
798
5'h03 & OR  & Bitwise Or & Y \\\cline{1-3}
799
5'h04 & XOR & Bitwise Exclusive Or &   \\\cline{1-3}
800
5'h05 & LSR & Logical Shift Right &   \\\cline{1-3}
801
5'h06 & LSL & Logical Shift Left &   \\\cline{1-3}
802
5'h07 & ASR & Arithmetic Shift Right &   \\\hline
803 139 dgisselq
5'h08 & MPY & 32x32 bit multiply & Y \\\hline
804
5'h09 & LDILO & Load Immediate Low & N\\\hline
805
5'h0a & MPYUHI & Upper 32 of 64 bits from an unsigned 32x32 multiply &  \\\cline{1-3}
806
5'h0b & MPYSHI & Upper 32 of 64 bits from a signed 32x32 multiply & Y \\\cline{1-3}
807 69 dgisselq
5'h0c & BREV & Bit Reverse &  \\\cline{1-3}
808
5'h0d & POPC& Population Count &  \\\cline{1-3}
809
5'h0e & ROL & Rotate left &   \\\hline
810
5'h0f & MOV & Move register & N \\\hline
811
5'h10 & CMP & Compare & Y \\\cline{1-3}
812
5'h11 & TST & Test (AND w/o setting result) &   \\\hline
813
5'h12 & LOD & Load from memory & N \\\cline{1-3}
814
5'h13 & STO & Store a register into memory &  \\\hline\hline
815
5'h14 & DIVU & Divide, unsigned & Y \\\cline{1-3}
816
5'h15 & DIVS & Divide, signed &  \\\hline\hline
817
5'h16/7 & LDI & Load 23--bit signed immediate & N \\\hline\hline
818
5'h18 & FPADD & Floating point add &  \\\cline{1-3}
819
5'h19 & FPSUB & Floating point subtract &   \\\cline{1-3}
820
5'h1a & FPMPY & Floating point multiply & Y \\\cline{1-3}
821
5'h1b & FPDIV & Floating point divide &   \\\cline{1-3}
822
5'h1c & FPCVT & Convert integer to floating point &   \\\cline{1-3}
823
5'h1d & FPINT & Convert to integer &   \\\hline
824
5'h1e & & {\em Reserved for future use} &\\\hline
825
5'h1f & & {\em Reserved for future use} &\\\hline
826 139 dgisselq
5'h18 & & NOOP (A-register = PC)&\\\cline{1-3}
827
5'h19 & & BREAK (A-register = PC)& N\\\cline{1-3}
828
5'h1a & & LOCK (A-register = PC)&\\\hline
829 39 dgisselq
\end{tabular}
830 69 dgisselq
\caption{Zip CPU OpCodes}\label{tbl:iset-opcodes}
831 39 dgisselq
\end{center}\end{table}
832 69 dgisselq
%
833
Of these opcodes, the {\tt BREV} and {\tt POPC} are experimental, and may be
834
replaced later, and two floating point instruction opcodes are reserved for
835
future use.
836 39 dgisselq
 
837 21 dgisselq
\section{Conditional Instructions}
838 69 dgisselq
Most, although not quite all, instructions may be conditionally executed.
839
The 23--bit load immediate instruction, together with the {\tt NOOP},
840
{\tt BREAK}, and {\tt LOCK} instructions are the only exception to this rule.
841
 
842
From the four condition code flags, eight conditions are defined for standard
843
instructions.  These are shown in Tbl.~\ref{tbl:conditions}.
844
\begin{table}\begin{center}
845 21 dgisselq
\begin{tabular}{l|l|l}
846
Code & Mneumonic & Condition \\\hline
847
3'h0 & None & Always execute the instruction \\
848 69 dgisselq
3'h1 & {\tt .LT} & Less than ('N' set) \\
849
3'h2 & {\tt .Z} & Only execute when 'Z' is set \\
850
3'h3 & {\tt .NZ} & Only execute when 'Z' is not set \\
851 21 dgisselq
3'h4 & {\tt .GT} & Greater than ('N' not set, 'Z' not set) \\
852 69 dgisselq
3'h5 & {\tt .GE} & Greater than or equal ('N' not set, 'Z' irrelevant) \\
853 139 dgisselq
3'h6 & {\tt .C} & Carry set (Also known as less-than unsigned) \\
854 21 dgisselq
3'h7 & {\tt .V} & Overflow set\\
855
\end{tabular}
856
\caption{Conditions for conditional operand execution}\label{tbl:conditions}
857 69 dgisselq
\end{center}\end{table}
858
There is no condition code for less than or equal, not C or not V---there
859
just wasn't enough space in 3--bits.  Conditioning on a non--supported
860
condition is still possible, but it will take an extra instruction and a
861
pipeline stall.  (Ex: \hbox{\em (Stall)}; \hbox{\tt TST \$4,CC;} \hbox{\tt
862
STO.NZ R0,(R1)}) As an alternative, it is often possible to reverse the
863
condition, and thus recovering those extra two clocks.  Thus instead of
864 139 dgisselq
\hbox{\tt CMP Rx,Ry;} \hbox{\tt BNC label} you can issue a
865
\hbox{\tt CMP 1+Ry,Rx;} \hbox{\tt BC label}.
866 21 dgisselq
 
867 69 dgisselq
Conditionally executed instructions will not further adjust the
868 68 dgisselq
condition codes, with the exception of \hbox{\tt CMP} and \hbox{\tt TST}
869
instructions.   Conditional \hbox{\tt CMP} or \hbox{\tt TST} instructions
870 69 dgisselq
will adjust conditions whenever they are executed.  In this way,
871 68 dgisselq
multiple conditions may be evaluated without branches.  For example, to do
872
something if \hbox{\tt R0} is one and \hbox{\tt R1} is two, one might try
873
code such as Tbl.~\ref{tbl:dbl-condition}.
874
\begin{table}\begin{center}
875
\begin{tabular}{l}
876
        {\tt CMP 1,R0} \\
877
        {;\em Condition codes are now set based upon R0-1} \\
878
        {\tt CMP.Z 2,R1} \\
879
        {;\em If R0 $\neq$ 1, conditions are unchanged.} \\
880
        {;\em If R0 $=$ 1, conditions are set based upon R1-2.} \\
881
        {;\em Now do something based upon the conjunction of both conditions.} \\
882
        {;\em While we use the example of a STO, it could be any instruction.} \\
883
        {\tt STO.Z R0,(R2)} \\
884
\end{tabular}
885
\caption{An example of a double conditional}\label{tbl:dbl-condition}
886
\end{center}\end{table}
887 36 dgisselq
 
888 69 dgisselq
In the case of VLIW instructions, only four conditions are defined as shown
889
in Tbl.~\ref{tbl:vliw-conditions}.
890
\begin{table}\begin{center}
891
\begin{tabular}{l|l|l}
892
Code & Mneumonic & Condition \\\hline
893
2'h0 & None & Always execute the instruction \\
894
2'h1 & {\tt .LT} & Less than ('N' set) \\
895
2'h2 & {\tt .Z} & Only execute when 'Z' is set \\
896
2'h3 & {\tt .NZ} & Only execute when 'Z' is not set \\
897
\end{tabular}
898
\caption{VLIW Conditions}\label{tbl:vliw-conditions}
899
\end{center}\end{table}
900
Further, the first bit is given a special meaning.  If the first bit is set,
901
the conditions apply to the second half of the instruction, otherwise the
902
conditions will only apply to the first half of a conditional instruction.
903 139 dgisselq
Of course, the other conditions are still available by mingling the
904
non--VLIW instructions with VLIW instructions.
905 68 dgisselq
 
906 21 dgisselq
\section{Operand B}
907 69 dgisselq
Many instruction forms have a 19-bit source ``Operand B'' associated with them.
908
This ``Operand B'' is shown in Fig.~\ref{fig:iset-format} as part of the
909
standard instructions.  This Operand B is either equal to a register plus a
910
14--bit signed immediate offset, or an 18--bit signed immediate offset by
911
itself.  This value is encoded as shown in Tbl.~\ref{tbl:opb}.
912 21 dgisselq
\begin{table}\begin{center}
913 69 dgisselq
\begin{bytefield}[endianness=big]{19}
914
\bitheader{0-18}  \\
915
\bitbox{1}{0}\bitbox{18}{18-bit Signed Immediate} \\
916
\bitbox{1}{1}\bitbox{4}{Reg}\bitbox{14}{14-bit Signed Immediate}
917
\end{bytefield}
918 21 dgisselq
\caption{Bit allocation for Operand B}\label{tbl:opb}
919
\end{center}\end{table}
920 24 dgisselq
 
921 69 dgisselq
Fourteen and eighteen bit immediate values don't make sense for all
922
instructions.  For example, what is the point of an 18--bit immediate when
923
executing a 16--bit multiply?  Or a 16--bit load--immediate?  In these cases,
924
the extra bits are simply ignored.
925 24 dgisselq
 
926 69 dgisselq
VLIW instructions still use the same operand B, only there was no room for any
927
instruction plus immediate addressing.  Therefore, VLIW instructions have either
928
a register or a 4--bit signed immediate as their operand B.  The only exception
929
is the load immediate instruction, which permits a 5--bit signed operand
930
B.\footnote{Although the space exists to extend this VLIW load immediate
931
instruction to six bits, the 5--bit limit was chosen to simplify the
932
disassembler.  This may change in the future.}
933
 
934 21 dgisselq
\section{Address Modes}
935 36 dgisselq
The Zip CPU supports two addressing modes: register plus immediate, and
936 21 dgisselq
immediate address.  Addresses are therefore encoded in the same fashion as
937 69 dgisselq
Operand B's, shown above.  Practically, the VLIW instruction set only offers
938
register addressing, necessitating a non--VLIW instruction for most memory
939
operations.
940 21 dgisselq
 
941
A lot of long hard thought was put into whether to allow pre/post increment
942
and decrement addressing modes.  Finding no way to use these operators without
943 32 dgisselq
taking two or more clocks per instruction,\footnote{The two clocks figure
944
comes from the design of the register set, allowing only one write per clock.
945
That write is either from the memory unit or the ALU, but never both.} these
946
addressing modes have been
947 21 dgisselq
removed from the realm of possibilities.  This means that the Zip CPU has no
948
native way of executing push, pop, return, or jump to subroutine operations.
949 24 dgisselq
Each of these instructions can be emulated with a set of instructions from the
950
existing set.
951 21 dgisselq
 
952 139 dgisselq
\section{Modifying Conditions}
953
A quick look at the list of conditions supported by the Zip CPU and listed
954
in Tbl.~\ref{tbl:conditions} reveals that the Zip CPU does not have a full set
955
of conditions.  In particular, only one explicit unsigned condition is
956
supported.  Therefore, Tbl.~\ref{tbl:creating-conditions}
957
\begin{table}\begin{center}
958
\begin{tabular}{|l|l|l|}\hline
959
Original & Modified & Name \\\hline\hline
960
\parbox[t]{1.5in}{\tt CMP Rx,Ry\\BLE label} % If Ry <= Rx -> Ry < Rx+1
961
        & \parbox[t]{1.5in}{\tt CMP 1+Rx,Ry\\BLT label}
962
        & Less-than or equal (signed, {\tt Z} or {\tt N} set)\\[4mm]\hline
963
\parbox[t]{1.5in}{\tt CMP Rx,Ry\\BLEU label}
964
        & \parbox[t]{1.5in}{\tt CMP 1+Rx,Ry\\BC label}
965
        & Less-than or equal unsigned \\[4mm]\hline
966
\parbox[t]{1.5in}{\tt CMP Rx,Ry\\BGTU label}    % if (Ry > Rx) -> Rx < Ry
967
        & \parbox[t]{1.5in}{\tt CMP Ry,Rx\\BC label}
968
        & Greater-than unsigned \\[4mm]\hline
969
\parbox[t]{1.5in}{\tt CMP Rx,Ry\\BGEU label}    % if (Ry >= Rx) -> Rx <= Ry -> Rx < Ry+1
970
        & \parbox[t]{1.5in}{\tt CMP 1+Ry,Rx\\BC label}
971
        & Greater-than equal unsigned \\[4mm]\hline
972
\parbox[t]{1.5in}{\tt CMP A+Rx,Ry\\BGEU label} % if (Ry >= A+Rx)-> A+Rx <= Ry -> Rx < Ry+1-A
973
        & \parbox[t]{1.5in}{\tt CMP (1-A)+Ry,Rx\\BC label}
974
        & Greater-than equal unsigned (with offset)\\[4mm]\hline
975
\parbox[t]{1.5in}{\tt CMP A,Ry\\BGEU label} % if (Ry >= A+Rx)-> A+Rx <= Ry -> Rx < Ry+1-A
976
        & \parbox[t]{1.5in}{\tt LDI (A-1),Rx\\CMP Ry,Rx\\BC label}
977
        & Greater-than equal comparison with a constant\\[4mm]\hline
978
\end{tabular}
979
\caption{Modifying conditions}\label{tbl:creating-conditions}
980
\end{center}\end{table}
981
shows examples of how these unsupported conditions can be created
982
simply by adjusting the compare instruction, for no extra cost in clocks.
983
Of course, if the compare originally had an immediate within it, that immediate
984
would need to be loaded into a register in order to do some of these compares.
985
This case is shown as the last case above.
986
 
987 21 dgisselq
\section{Move Operands}
988
The previous set of operands would be perfect and complete, save only that
989 24 dgisselq
the CPU needs access to non--supervisory registers while in supervisory mode.
990
Therefore, the MOV instruction is special and offers access to these registers
991
\ldots when in supervisory mode.  To keep the compiler simple, the extra bits
992
are ignored in non-supervisory mode (as though they didn't exist), rather than
993
being mapped to new instructions or additional capabilities.  The bits
994 69 dgisselq
indicating which register set each register lies within are the A-User, marked
995
`A' in Fig.~\ref{fig:iset-format}, and B-User bits, marked as `B'.  When set
996
to a one, these refer to a user mode register.  When set to a zero, these
997
refer to a register in the current mode, whether user or supervisor.  Further,
998
because a load immediate instruction exists, there is no move capability
999
between an immediate and a register: all moves come from either a register or
1000
a register plus an offset.
1001 21 dgisselq
 
1002 69 dgisselq
This actually leads to a bit of a problem: since the {\tt MOV} instruction
1003
encodes which register set each register is coming from or moving to, how shall
1004
a compiler or assembler know how to compile a MOV instruction without knowing
1005 24 dgisselq
the mode of the CPU at the time?  For this reason, the compiler will assume
1006
all MOV registers are supervisor registers, and display them as normal.
1007 69 dgisselq
Anything with the user bit set will be treated as a user register and displayed
1008
special.  Since the CPU quietly ignores the supervisor bits while in user mode,
1009
anything marked as a user register will always be specific.
1010 21 dgisselq
 
1011
\section{Multiply Operations}
1012
 
1013 139 dgisselq
The ZipCPU originally only supported 16x16 multiply operations.  GCC, however,
1014
wanted 32x32-bit operations and building these from 16x16-bit multiplies
1015
is painful.  Therefore, the ZipCPU was modified to support 32x32-bit multiplies.
1016
 
1017
In particular, the ZipCPU supports three separate 32x32-bit multiply
1018
instructions: {\tt MPY}, {\tt MPYUHI}, and {\tt MPYSHI}.  The first of these
1019
produces the low 32-bits of a 32x32-bit multiply result.  The second two
1020
produce the upper 32-bits.  The first, {\tt MPYUHI}, produces the upper 32-bits
1021
assuming the multiply was unsigned, whereas the second assuming it was signed.
1022
Each multiply instruction is independent of each other in execution, although
1023
the compiler may use them quite dependently.
1024
 
1025
In an effort to maintain single clock pipeline timing, all three of these
1026
multiplies have been slowed down in logic.  Thus, depending upon the setting
1027
of {\tt OPT\_MULTIPLY} within {\tt cpudefs.v}, the multiply instructions
1028
will either 1)~cause an ILLEGAL instruction error, 2)~take one additional clock,
1029
or 3)~take two additional clocks.
1030
 
1031
 
1032 69 dgisselq
\section{Divide Unit}
1033
The Zip CPU also has a divide unit which can be built alongside the ALU.
1034 139 dgisselq
This divide unit provides the Zip CPU with another two instructions that
1035 69 dgisselq
cannot be executed in a single cycle: {\tt DIVS}, or signed divide, and
1036
{\tt DIVU}, the unsigned divide.  These are both 32--bit divide instructions,
1037
dividing one 32--bit number by another.  In this case, the Operand B field,
1038
whether it be register or register plus immediate, constitutes the denominator,
1039
whereas the numerator is given by the other register.
1040 21 dgisselq
 
1041 69 dgisselq
The Divide is also a multi--clock instruction.  While the divide is running,
1042 139 dgisselq
the ALU, any memory loads, and the floating point unit (if installed) will be
1043
idle.  Once the divide completes, other units may continue.
1044 21 dgisselq
 
1045 69 dgisselq
Of course, divides can have errors: division by zero.  In the case of division
1046
by zero, an exception will be caused that will send the CPU either from
1047
user mode to supervisor mode, or halt the CPU if it is already in supervisor
1048
mode.
1049 32 dgisselq
 
1050 69 dgisselq
\section{NOOP, BREAK, and Bus Lock Instruction}
1051 139 dgisselq
Three instructions within the opcode list in Tbl.~\ref{tbl:iset-opcodes}, are
1052
somewhat special.  These are the {\tt NOOP}, {\tt Break}, and bus {\tt LOCK}
1053
instructions.  These are encoded according to
1054 69 dgisselq
Fig.~\ref{fig:iset-noop}, and have the following meanings:
1055
\begin{figure}\begin{center}
1056
\begin{bytefield}[endianness=big]{32}
1057
\bitheader{0-31}\\
1058
\begin{leftwordgroup}{NOOP}
1059
\bitbox{1}{0}\bitbox{3}{3'h7}\bitbox{1}{}
1060 139 dgisselq
        \bitbox{2}{11}\bitbox{3}{000}\bitbox{22}{Ignored} \\
1061 69 dgisselq
\bitbox{1}{1}\bitbox{3}{3'h7}\bitbox{1}{}
1062 139 dgisselq
        \bitbox{2}{11}\bitbox{3}{000}\bitbox{22}{---} \\
1063 69 dgisselq
\bitbox{1}{1}\bitbox{9}{---}\bitbox{3}{---}\bitbox{5}{---}
1064
        \bitbox{3}{3'h7}\bitbox{1}{}\bitbox{2}{11}\bitbox{3}{001}
1065
        \bitbox{5}{Ignored}
1066
                \end{leftwordgroup} \\
1067
\begin{leftwordgroup}{BREAK}
1068
\bitbox{1}{0}\bitbox{3}{3'h7}
1069 139 dgisselq
                \bitbox{1}{}\bitbox{2}{11}\bitbox{3}{001}\bitbox{22}{Ignored}
1070 69 dgisselq
                \end{leftwordgroup} \\
1071
\begin{leftwordgroup}{LOCK}
1072
\bitbox{1}{0}\bitbox{3}{3'h7}
1073 139 dgisselq
                \bitbox{1}{}\bitbox{2}{11}\bitbox{3}{010}\bitbox{22}{Ignored}
1074 69 dgisselq
                \end{leftwordgroup} \\
1075
\end{bytefield}
1076
\caption{NOOP/Break/LOCK Instruction Format}\label{fig:iset-noop}
1077
\end{center}\end{figure}
1078 32 dgisselq
 
1079 69 dgisselq
The {\tt NOOP} instruction is just that: an instruction that does not perform
1080
any operation.  While many other instructions, such as a move from a register to
1081
itself, could also fit these roles, only the NOOP instruction guarantees that
1082
it will not stall waiting for a register to be available.   For this reason,
1083
it gets its own place in the instruction set.
1084 32 dgisselq
 
1085 69 dgisselq
The {\tt BREAK} instruction is useful for creating a debug instruction that
1086
will halt the CPU without executing.  If in user mode, depending upon the
1087
setting of the break enable bit, it will either switch to supervisor mode or
1088
halt the CPU--depending upon where the user wishes to do his debugging.
1089 21 dgisselq
 
1090 139 dgisselq
Finally, the {\tt LOCK} instruction was added in order to provide for
1091
atomic operations.  The {\tt LOCK} instruction only works in pipeline mode.
1092
It works by stalling the ALU pipeline stack until all prior stages are
1093
filled, and then it guarantees that once a bus cycle is started, the
1094
wishbone {\tt CYC} line will remain asserted until the LOCK is deasserted.
1095
This allows the execution of one instruction that was waiting in the load
1096
operands pipeline stage, and one instruction that was waiting in the
1097
instruction decode stage.  Further, if the instruction waiting in the decode
1098
stage was a VLIW instruction, then it may be possible to execute a third
1099
instruction.
1100 21 dgisselq
 
1101 139 dgisselq
This was originally written to implement an atomic test and set instruction,
1102
such as a {\tt LOCK} followed by {\tt LOD (Rx),Ry} and a {\tt STO Rz,(Rx)},
1103
where Rz is initially set.
1104
 
1105
Other instructions using a VLIW instruction combining a single ALU instruction
1106
with a store, such as an atomic increment, or {\tt LOCK}, {\tt LOD (Rx),Ry},
1107
{\tt ADD 1,Ry}, {\tt STO Ry,(Rx)}, should be possible as well.  Many of these
1108
combinations remain to be tested.
1109
 
1110 69 dgisselq
\section{Floating Point}
1111
Although the Zip CPU does not (yet) have a floating point unit, the current
1112
instruction set offers eight opcodes for floating point operations, and treats
1113
floating point exceptions like divide by zero errors.  Once this unit is built
1114
and integrated together with the rest of the CPU, the Zip CPU will support
1115
32--bit floating point instructions natively.  Any 64--bit floating point
1116
instructions will still need to be emulated in software.
1117
 
1118 139 dgisselq
Until that time, of even after if the floating point unit is not installed,
1119
floating point instructions will trigger an illegal instruction exception,
1120
which may be trapped and then implemented in software.
1121
 
1122 21 dgisselq
\section{Derived Instructions}
1123 36 dgisselq
The Zip CPU supports many other common instructions, but not all of them
1124 24 dgisselq
are single cycle instructions.  The derived instruction tables,
1125 36 dgisselq
Tbls.~\ref{tbl:derived-1}, \ref{tbl:derived-2}, \ref{tbl:derived-3}
1126
and~\ref{tbl:derived-4},
1127 21 dgisselq
help to capture some of how these other instructions may be implemented on
1128 36 dgisselq
the Zip CPU.  Many of these instructions will have assembly equivalents,
1129 21 dgisselq
such as the branch instructions, to facilitate working with the CPU.
1130
\begin{table}\begin{center}
1131
\begin{tabular}{p{1.4in}p{1.5in}p{3in}}\\\hline
1132
Mapped & Actual  & Notes \\\hline
1133 39 dgisselq
{\tt ABS Rx}
1134
        & \parbox[t]{1.5in}{\tt TST -1,Rx\\NEG.LT Rx}
1135 36 dgisselq
        & Absolute value, depends upon derived NEG.\\\hline
1136 39 dgisselq
\parbox[t]{1.4in}{\tt ADD Ra,Rx\\ADDC Rb,Ry}
1137
        & \parbox[t]{1.5in}{\tt Add Ra,Rx\\ADD.C \$1,Ry\\Add Rb,Ry}
1138 21 dgisselq
        & Add with carry \\\hline
1139 39 dgisselq
{\tt BRA.Cond +/-\$Addr}
1140 92 dgisselq
        & \hbox{\tt ADD.cond \$Addr+PC,PC}
1141
        & Branch or jump on condition.  Works for 18--bit
1142 24 dgisselq
                signed address offsets.\\\hline
1143 39 dgisselq
{\tt BRA.Cond +/-\$Addr}
1144
        & \parbox[t]{1.5in}{\tt LDI \$Addr,Rx \\ ADD.cond Rx,PC}
1145 73 dgisselq
        & Branch/jump on condition.  Works for 23 bit address offsets, but
1146
        costs a register and an extra instruction.  With LDIHI and LDILO
1147
        this can be made to work anywhere in the 32-bit address space, but yet
1148
        cost an additional instruction still. \\\hline
1149 39 dgisselq
{\tt BNC PC+\$Addr}
1150 92 dgisselq
        & \parbox[t]{1.5in}{\tt Test \$Carry,CC \\ ADD.Z PC+\$Addr,PC}
1151 21 dgisselq
        & Example of a branch on an unsupported
1152
                condition, in this case a branch on not carry \\\hline
1153 92 dgisselq
{\tt BUSY } & {\tt ADD \$-1,PC} & Execute an infinite loop \\\hline
1154 39 dgisselq
{\tt CLRF.NZ Rx }
1155
        & {\tt XOR.NZ Rx,Rx}
1156 21 dgisselq
        & Clear Rx, and flags, if the Z-bit is not set \\\hline
1157 39 dgisselq
{\tt CLR Rx }
1158
        & {\tt LDI \$0,Rx}
1159 21 dgisselq
        & Clears Rx, leaves flags untouched.  This instruction cannot be
1160
                conditional. \\\hline
1161 39 dgisselq
{\tt EXCH.W Rx }
1162
        & {\tt ROL \$16,Rx}
1163 21 dgisselq
        & Exchanges the top and bottom 16'bit words of Rx \\\hline
1164 39 dgisselq
{\tt HALT }
1165
        & {\tt Or \$SLEEP,CC}
1166
        & This only works when issued in interrupt/supervisor mode.  In user
1167
        mode this is simply a wait until interrupt instruction. \\\hline
1168 69 dgisselq
{\tt INT } & {\tt LDI \$0,CC} & This is also known as a trap instruction\\\hline
1169 39 dgisselq
{\tt IRET}
1170
        & {\tt OR \$GIE,CC}
1171
        & Also known as an RTU instruction (Return to Userspace) \\\hline
1172 92 dgisselq
{\tt JMP R6+\$Offset}
1173
        & {\tt MOV \$Offset(R6),PC}
1174 21 dgisselq
        & \\\hline
1175 69 dgisselq
{\tt LJMP \$Addr}
1176
        & \parbox[t]{1.5in}{\tt LOD (PC),PC \\ {\em Address }}
1177
        & Although this only works for an unconditional jump, and it only
1178
        works in a Von Neumann architecture, this instruction combination makes
1179
        for a nice combination that can be adjusted by a linker at a later
1180
        time.\\\hline
1181 92 dgisselq
{\tt JSR PC+\$Offset  }
1182
        & \parbox[t]{1.5in}{\tt MOV \$1+PC,R0 \\ ADD \$Offset,PC}
1183 69 dgisselq
        & This is similar to the jump and link instructions from other
1184
        architectures, save only that it requires a specific link
1185
        instruction, also known as the {\tt MOV} instruction on the
1186
        left.\\\hline
1187
\end{tabular}
1188
\caption{Derived Instructions}\label{tbl:derived-1}
1189
\end{center}\end{table}
1190
\begin{table}\begin{center}
1191
\begin{tabular}{p{1.4in}p{1.5in}p{3in}}\\\hline
1192
Mapped & Actual  & Notes \\\hline
1193 39 dgisselq
{\tt LDI.l \$val,Rx }
1194
        & \parbox[t]{1.8in}{\tt LDIHI (\$val$>>$16)\&0x0ffff, Rx \\
1195
                        LDILO (\$val\&0x0ffff),Rx}
1196 69 dgisselq
        & \parbox[t]{3.0in}{Sadly, there's not enough instruction
1197 21 dgisselq
                space to load a complete immediate value into any register.
1198
                Therefore, fully loading any register takes two cycles.
1199
                The LDIHI (load immediate high) and LDILO (load immediate low)
1200 69 dgisselq
                instructions have been created to facilitate this.
1201
                \\
1202
        This is also the appropriate means for setting a register value
1203
        to an arbitrary 32--bit value in a post--assembly link
1204
        operation.}\\\hline
1205 39 dgisselq
{\tt LOD.b \$addr,Rx}
1206
        & \parbox[t]{1.5in}{\tt %
1207 21 dgisselq
        LDI     \$addr,Ra \\
1208
        LDI     \$addr,Rb \\
1209
        LSR     \$2,Ra \\
1210
        AND     \$3,Rb \\
1211
        LOD     (Ra),Rx \\
1212
        LSL     \$3,Rb \\
1213
        SUB     \$32,Rb \\
1214
        ROL     Rb,Rx \\
1215
        AND \$0ffh,Rx}
1216
        & \parbox[t]{3in}{This CPU is designed for 32'bit word
1217
        length instructions.  Byte addressing is not supported by the CPU or
1218
        the bus, so it therefore takes more work to do.
1219
 
1220
        Note also that in this example, \$Addr is a byte-wise address, where
1221 24 dgisselq
        all other addresses in this document are 32-bit wordlength addresses.
1222
        For this reason,
1223 21 dgisselq
        we needed to drop the bottom two bits.  This also limits the address
1224
        space of character accesses using this method from 16 MB down to 4MB.}
1225
                \\\hline
1226 39 dgisselq
\parbox[t]{1.5in}{\tt LSL \$1,Rx\\ LSLC \$1,Ry}
1227
        & \parbox[t]{1.5in}{\tt LSL \$1,Ry \\
1228 21 dgisselq
        LSL \$1,Rx \\
1229
        OR.C \$1,Ry}
1230
        & Logical shift left with carry.  Note that the
1231
        instruction order is now backwards, to keep the conditions valid.
1232 33 dgisselq
        That is, LSL sets the carry flag, so if we did this the other way
1233 21 dgisselq
        with Rx before Ry, then the condition flag wouldn't have been right
1234
        for an OR correction at the end. \\\hline
1235 39 dgisselq
\parbox[t]{1.5in}{\tt LSR \$1,Rx \\ LSRC \$1,Ry}
1236
        & \parbox[t]{1.5in}{\tt CLR Rz \\
1237 21 dgisselq
        LSR \$1,Ry \\
1238
        LDIHI.C \$8000h,Rz \\
1239
        LSR \$1,Rx \\
1240
        OR Rz,Rx}
1241
        & Logical shift right with carry \\\hline
1242 39 dgisselq
{\tt NEG Rx} & \parbox[t]{1.5in}{\tt XOR \$-1,Rx \\ ADD \$1,Rx} & \\\hline
1243
{\tt NEG.C Rx} & \parbox[t]{1.5in}{\tt MOV.C \$-1+Rx,Rx\\XOR.C \$-1,Rx} & \\\hline
1244
{\tt NOOP} & {\tt NOOP} & While there are many
1245 21 dgisselq
        operations that do nothing, such as MOV Rx,Rx, or OR \$0,Rx, these
1246
        operations have consequences in that they might stall the bus if
1247
        Rx isn't ready yet.  For this reason, we have a dedicated NOOP
1248
        instruction. \\\hline
1249 39 dgisselq
{\tt NOT Rx } & {\tt XOR \$-1,Rx } & \\\hline
1250
{\tt POP Rx }
1251 69 dgisselq
        & \parbox[t]{1.5in}{\tt LOD \$(SP),Rx \\ ADD \$1,SP}
1252
        & \\\hline
1253 36 dgisselq
\end{tabular}
1254
\caption{Derived Instructions, continued}\label{tbl:derived-2}
1255
\end{center}\end{table}
1256
\begin{table}\begin{center}
1257
\begin{tabular}{p{1.4in}p{1.5in}p{3in}}\\\hline
1258 39 dgisselq
{\tt PUSH Rx}
1259 69 dgisselq
        & \parbox[t]{1.5in}{\hbox{\tt SUB \$1,SP}
1260
        \hbox{\tt STO Rx,\$(SP)}}
1261 39 dgisselq
        & Note that for pipelined operation, it helps to coalesce all the
1262
        {\tt SUB}'s into one command, and place the {\tt STO}'s right
1263 69 dgisselq
        after each other.  Further, to avoid a pipeline stall, the
1264
        immediate value for the store must be zero.
1265
        \\\hline
1266 39 dgisselq
{\tt PUSH Rx-Ry}
1267 69 dgisselq
        & \parbox[t]{1.5in}{\tt SUB \$$n$,SP \\
1268
        STO Rx,\$(SP)
1269 36 dgisselq
        \ldots \\
1270 69 dgisselq
        STO Ry,\$$\left(n-1\right)$(SP)}
1271 36 dgisselq
        & Multiple pushes at once only need the single subtract from the
1272
        stack pointer.  This derived instruction is analogous to a similar one
1273
        on the Motoroloa 68k architecture, although the Zip Assembler
1274 39 dgisselq
        does not support this instruction (yet).  This instruction
1275
        also supports pipelined memory access.\\\hline
1276
{\tt RESET}
1277
        & \parbox[t]{1in}{\tt STO \$1,\$watchdog(R12)\\NOOP\\NOOP}
1278
        & This depends upon the peripheral base address being
1279 69 dgisselq
        preloaded into R12.
1280 21 dgisselq
 
1281
        Another opportunity might be to jump to the reset address from within
1282 39 dgisselq
        supervisor mode.\\\hline
1283 69 dgisselq
{\tt RET} & {\tt MOV R0,PC}
1284
        & This depends upon the form of the {\tt JSR} given on the previous
1285
        page that stores the return address into R0.
1286 21 dgisselq
        \\\hline
1287 39 dgisselq
{\tt STEP Rr,Rt}
1288
        & \parbox[t]{1.5in}{\tt LSR \$1,Rr \\ XOR.C Rt,Rr}
1289 21 dgisselq
        & Step a Galois implementation of a Linear Feedback Shift Register, Rr,
1290
                using taps Rt \\\hline
1291 139 dgisselq
%
1292
%
1293
{\tt SEX.b Rx }
1294
        & \parbox[t]{1.5in}{\tt LSL 24,Rx \\ ASR 24,Rx}
1295
        & Signed extend a byte into a full word.\\\hline
1296
{\tt SEX.h Rx }
1297
        & \parbox[t]{1.5in}{\tt LSL 16,Rx \\ ASR 16,Rx}
1298
        & Sign extend a half word into a full word.\\\hline
1299
%
1300 39 dgisselq
{\tt STO.b Rx,\$addr}
1301
        & \parbox[t]{1.5in}{\tt %
1302 21 dgisselq
        LDI \$addr,Ra \\
1303
        LDI \$addr,Rb \\
1304
        LSR \$2,Ra \\
1305
        AND \$3,Rb \\
1306
        SUB \$32,Rb \\
1307
        LOD (Ra),Ry \\
1308
        AND \$0ffh,Rx \\
1309 39 dgisselq
        AND \~\$0ffh,Ry \\
1310 21 dgisselq
        ROL Rb,Rx \\
1311
        OR Rx,Ry \\
1312
        STO Ry,(Ra) }
1313
        & \parbox[t]{3in}{This CPU and it's bus are {\em not} optimized
1314
        for byte-wise operations.
1315
 
1316
        Note that in this example, \$addr is a
1317
        byte-wise address, whereas in all of our other examples it is a
1318
        32-bit word address. This also limits the address space
1319
        of character accesses from 16 MB down to 4MB.F
1320
        Further, this instruction implies a byte ordering,
1321
        such as big or little endian.} \\\hline
1322 39 dgisselq
{\tt SWAP Rx,Ry }
1323 69 dgisselq
        & \parbox[t]{1.5in}{\tt XOR Ry,Rx \\ XOR Rx,Ry \\ XOR Ry,Rx}
1324 21 dgisselq
        & While no extra registers are needed, this example
1325
        does take 3-clocks. \\\hline
1326 69 dgisselq
\end{tabular}
1327
\caption{Derived Instructions, continued}\label{tbl:derived-3}
1328
\end{center}\end{table}
1329
\begin{table}\begin{center}
1330
\begin{tabular}{p{1.4in}p{1.5in}p{3in}}\\\hline
1331 39 dgisselq
{\tt TRAP \#X}
1332
        & \parbox[t]{1.5in}{\tt LDI \$x,R0 \\ AND \~\$GIE,CC }
1333 36 dgisselq
        & This works because whenever a user lowers the \$GIE flag, it sets
1334
        a TRAP bit within the CC register.  Therefore, upon entering the
1335
        supervisor state, the CPU only need check this bit to know that it
1336
        got there via a TRAP.  The trap could be made conditional by making
1337
        the LDI and the AND conditional.  In that case, the assembler would
1338
        quietly turn the LDI instruction into an LDILO and LDIHI pair,
1339 37 dgisselq
        but the effect would be the same. \\\hline
1340 69 dgisselq
{\tt TS Rx,Ry,(Rz)}
1341
        & \hbox{\tt LDI 1,Rx}
1342
                \hbox{\tt LOCK}
1343
                \hbox{\tt LOD (Rz),Ry}
1344
                \hbox{\tt STO Rx,(Rz)}
1345
        & A test and set instruction.  The {\tt LOCK} instruction insures
1346
        that the next two instructions lock the bus between the instructions,
1347
        so no one else can use it.  Thus guarantees that the operation is
1348
        atomic.
1349
        \\\hline
1350 39 dgisselq
{\tt TST Rx}
1351
        & {\tt TST \$-1,Rx}
1352 21 dgisselq
        & Set the condition codes based upon Rx.  Could also do a CMP \$0,Rx,
1353
        ADD \$0,Rx, SUB \$0,Rx, etc, AND \$-1,Rx, etc.  The TST and CMP
1354
        approaches won't stall future pipeline stages looking for the value
1355 69 dgisselq
        of Rx. (Future versions of the assembler may shorten this to a
1356
        {\tt TST Rx} instruction.)\\\hline
1357 39 dgisselq
{\tt WAIT}
1358
        & {\tt Or \$GIE | \$SLEEP,CC}
1359
        & Wait until the next interrupt, then jump to supervisor/interrupt
1360
        mode.
1361 21 dgisselq
\end{tabular}
1362 36 dgisselq
\caption{Derived Instructions, continued}\label{tbl:derived-4}
1363 21 dgisselq
\end{center}\end{table}
1364 69 dgisselq
 
1365
\section{Interrupt Handling}
1366
The Zip CPU does not maintain any interrupt vector tables.  If an interrupt
1367
takes place, the CPU simply switches to interrupt mode.  The supervisor code
1368
continues in this interrupt mode from where it left off before, after
1369
executing a return to userspace {\tt RTU} instruction.
1370
 
1371
At this point, the supervisor code needs to determine first whether an
1372
interrupt has occurred, and then whether it is in interrupt mode due to
1373
an exception and handle each case appropriately.
1374
 
1375 21 dgisselq
\section{Pipeline Stages}
1376 32 dgisselq
As mentioned in the introduction, and highlighted in Fig.~\ref{fig:cpu},
1377
the Zip CPU supports a five stage pipeline.
1378 21 dgisselq
\begin{enumerate}
1379 36 dgisselq
\item {\bf Prefetch}: Reads instruction from memory and into a cache, if so
1380
        configured.  This
1381 21 dgisselq
        stage is actually pipelined itself, and so it will stall if the PC
1382
        ever changes.  Stalls are also created here if the instruction isn't
1383
        in the prefetch cache.
1384 36 dgisselq
 
1385 69 dgisselq
        The Zip CPU supports one of three prefetch methods, depending upon a
1386
        flag set at build time within the {\tt cpudefs.v} file.  The simplest
1387
        is a non--cached implementation of a prefetch.  This implementation is
1388
        fairly small, and ideal for users of the Zip CPU who need the extra
1389
        space on the FPGA fabric.  However, because this non--cached version
1390
        has no cache, the maximum number of instructions per clock is limited
1391
        to about one per five.
1392 36 dgisselq
 
1393
        The second prefetch module is a pipelined prefetch with a cache.  This
1394
        module tries to keep the instruction address within a window of valid
1395
        instruction addresses.  While effective, it is not a traditional
1396
        cache implementation.  One unique feature of this cache implementation,
1397
        however, is that it can be cleared in a single clock.  A disappointing
1398
        feature, though, was that it needs an extra internal pipeline stage
1399
        to be implemented.
1400
 
1401 69 dgisselq
        The third prefetch and cache module implements a more traditional cache.
1402
        While the resulting code tends to be twice as fast as the pipelined
1403
        cache architecture, this implementation uses a large amount of
1404
        distributed FPGA RAM to be successful.  This then inflates the Zip CPU's
1405
        FPGA usage statistics.
1406
 
1407
\item {\bf Decode}: Decodes an instruction into OpCode, register(s) to read,
1408
        and immediate offset.  This stage also determines whether the flags
1409
        will be set or whether the result will be written back.
1410
 
1411 21 dgisselq
\item {\bf Read Operands}: Read registers and apply any immediate values to
1412 24 dgisselq
        them.  There is no means of detecting or flagging arithmetic overflow
1413
        or carry when adding the immediate to the operand.  This stage will
1414
        stall if any source operand is pending.
1415 69 dgisselq
 
1416
\item Split into one of four tracks: An {\bf ALU} track which will accomplish
1417
        a simple instruction, the {\bf MemOps} stage which handles {\tt LOD}
1418
        (load) and {\tt STO} (store) instructions, the {\bf divide} unit,
1419
        and the {\bf floating point} unit.
1420 21 dgisselq
        \begin{itemize}
1421 69 dgisselq
        \item Loads will stall instructions in the decode stage until the
1422
                entire pipeline until complete, lest a register be read in
1423
                the read operands stage only to be updated unseen by the
1424
                Load.
1425
        \item Condition codes are available upon completion of the ALU,
1426
                divide, or FPU stage.
1427
        \item Issuing a non--pipelined memory instruction to the memory unit
1428
                while the memory unit is busy will stall the entire pipeline.
1429 21 dgisselq
        \end{itemize}
1430 32 dgisselq
\item {\bf Write-Back}: Conditionally write back the result to the register
1431 69 dgisselq
        set, applying the condition.  This routine is quad-entrant: either the
1432
        ALU, the memory, the divide, or the FPU may write back a register.
1433
        The only design rule is that no more than a single register may be
1434
        written back in any given clock.
1435 21 dgisselq
\end{enumerate}
1436
 
1437 24 dgisselq
The Zip CPU does not support out of order execution.  Therefore, if the memory
1438 69 dgisselq
unit stalls, every other instruction stalls.  The same is true for divide or
1439
floating point instructions--all other instructions will stall while waiting
1440
for these to complete.  Memory stores, however, can take place concurrently
1441
with non--memory operations, although memory reads (loads) cannot.
1442 24 dgisselq
 
1443 32 dgisselq
\section{Pipeline Stalls}
1444
The processing pipeline can and will stall for a variety of reasons.  Some of
1445
these are obvious, some less so.  These reasons are listed below:
1446
\begin{itemize}
1447
\item When the prefetch cache is exhausted
1448 21 dgisselq
 
1449 36 dgisselq
This reason should be obvious.  If the prefetch cache doesn't have the
1450 69 dgisselq
instruction in memory, the entire pipeline must stall until an instruction
1451
can be made ready.  In the case of the {\tt pipefetch} windowed approach
1452
to the prefetch cache, this means the pipeline will stall until enough of the
1453
prefetch cache is loaded to support the next instruction.  In the case
1454
of the more traditional {\tt pfcache} approach, the entire cache line must
1455
fill before instruction execution can continue.
1456 21 dgisselq
 
1457 32 dgisselq
\item While waiting for the pipeline to load following any taken branch, jump,
1458 69 dgisselq
        return from interrupt or switch to interrupt context (4 stall cycles)
1459 32 dgisselq
 
1460 68 dgisselq
Fig.~\ref{fig:bcstalls}
1461
\begin{figure}\begin{center}
1462
\includegraphics[width=3.5in]{../gfx/bc.eps}
1463 69 dgisselq
\caption{A conditional branch generates 4 stall cycles}\label{fig:bcstalls}
1464 68 dgisselq
\end{center}\end{figure}
1465
illustrates the situation for a conditional branch.  In this case, the branch
1466 69 dgisselq
instruction, {\tt BC}, is nominally followed by instructions {\tt I1} and so
1467 68 dgisselq
forth.  However, since the branch is taken, the next instruction must be
1468
{\tt IA}.  Therefore, the pipeline needs to be cleared and reloaded.
1469
Given that there are five stages to the pipeline, that accounts
1470 69 dgisselq
for the four stalls.  (Were the {\tt pipefetch} cache chosen, there would
1471
be another stall internal to the {\tt pipefetch} cache.)
1472 32 dgisselq
 
1473 92 dgisselq
The Zip CPU handles the {\tt ADD \$X,PC} and
1474 36 dgisselq
{\tt LDI \$X,PC} instructions specially, however.  These instructions, when
1475 69 dgisselq
not conditioned on the flags, can execute with only a single stall cycle,
1476
such as is shown in Fig.~\ref{fig:branch}.\footnote{Note that when using the
1477
{\tt pipefetch} cache, this requires an additional stall cycle due to that
1478
cache's implementation.}
1479 68 dgisselq
\begin{figure}\begin{center}
1480 69 dgisselq
\includegraphics[width=4in]{../gfx/bra.eps} %0.4in per clock
1481
\caption{An expedited branch costs a single stall cycle}\label{fig:branch}
1482 68 dgisselq
\end{center}\end{figure}
1483
In this example, {\tt BR} is a branch always taken, {\tt I1} is the instruction
1484
following the branch in memory, while {\tt IA} is the first instruction at the
1485
branch address.  ({\tt CLR} denotes a clear--pipeline operation, and does
1486
not represent any instruction.)
1487 36 dgisselq
 
1488 32 dgisselq
\item When reading from a prior register while also adding an immediate offset
1489
\begin{enumerate}
1490
\item\ {\tt OPCODE ?,RA}
1491
\item\ {\em (stall)}
1492
\item\ {\tt OPCODE I+RA,RB}
1493
\end{enumerate}
1494
 
1495
Since the addition of the immediate register within OpB decoding gets applied
1496
during the read operand stage so that it can be nicely settled before the ALU,
1497
any instruction that will write back an operand must be separated from the
1498
opcode that will read and apply an immediate offset by one instruction.  The
1499
good news is that this stall can easily be mitigated by proper scheduling.
1500 36 dgisselq
That is, any instruction that does not add an immediate to {\tt RA} may be
1501
scheduled into the stall slot.
1502 32 dgisselq
 
1503 69 dgisselq
This is also the reason why, when setting up a stack frame, the top of the
1504
stack frame is used first: it eliminates this stall cycle.  Hence, to save
1505
registers at the top of a procedure, one would write:
1506 32 dgisselq
\begin{enumerate}
1507 69 dgisselq
\item\ {\tt SUB 2,SP}
1508
\item\ {\tt STO R1,(SP)}
1509
\item\ {\tt STO R2,1(SP)}
1510 32 dgisselq
\end{enumerate}
1511 69 dgisselq
Had {\tt R1} instead been stored at {\tt 1(SP)} as the top of the stack,
1512
there would've been an extra stall in setting up the stack frame.
1513 32 dgisselq
 
1514
\item When reading from the CC register after setting the flags
1515
\begin{enumerate}
1516 69 dgisselq
\item\ {\tt ALUOP RA,RB} {\em ; Ex: a compare opcode}
1517 36 dgisselq
\item\ {\em (stall)}
1518 32 dgisselq
\item\ {\tt TST sys.ccv,CC}
1519
\item\ {\tt BZ somewhere}
1520
\end{enumerate}
1521
 
1522 68 dgisselq
The reason for this stall is simply performance: many of the flags are
1523
determined via combinatorial logic {\em during} the writeback cycle.
1524
Trying to then place these into the input for one of the operands for an
1525
ALU instruction during the same cycle
1526 32 dgisselq
created a time delay loop that would no longer execute in a single 100~MHz
1527
clock cycle.  (The time delay of the multiply within the ALU wasn't helping
1528
either \ldots).
1529
 
1530 33 dgisselq
This stall may be eliminated via proper scheduling, by placing an instruction
1531
that does not set flags in between the ALU operation and the instruction
1532
that references the CC register.  For example, {\tt MOV \$addr+PC,uPC}
1533
followed by an {\tt RTU} ({\tt OR \$GIE,CC}) instruction will not incur
1534
this stall, whereas an {\tt OR \$BREAKEN,CC} followed by an {\tt OR \$STEP,CC}
1535 68 dgisselq
will incur the stall, while a {\tt LDI \$BREAKEN|\$STEP,CC} will not since
1536 69 dgisselq
it doesn't read the condition codes before executing.
1537 33 dgisselq
 
1538 32 dgisselq
\item When waiting for a memory read operation to complete
1539
\begin{enumerate}
1540
\item\ {\tt LOD address,RA}
1541 36 dgisselq
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}
1542 32 dgisselq
\item\ {\tt OPCODE I+RA,RB}
1543
\end{enumerate}
1544
 
1545 36 dgisselq
Remember, the Zip CPU does not support out of order execution.  Therefore,
1546 32 dgisselq
anytime the memory unit becomes busy both the memory unit and the ALU must
1547 68 dgisselq
stall until the memory unit is cleared.  This is illustrated in
1548
Fig.~\ref{fig:memrd},
1549
\begin{figure}\begin{center}
1550 69 dgisselq
\includegraphics[width=5.6in]{../gfx/memrd.eps}
1551 68 dgisselq
\caption{Pipeline handling of a load instruction}\label{fig:memrd}
1552
\end{center}\end{figure}
1553
since it is especially true of a load
1554 69 dgisselq
instruction, which must still write its operand back to the register file.
1555
Further, note that on a pipelined memory operation, the instruction must
1556
stall in the decode operand stage, lest it try to read a result from the
1557
register file before the load result has been written to it.  Finally, note
1558
that there is an extra stall at the end of the memory cycle, so that
1559
the memory unit will be idle for two clocks before an instruction will be
1560
accepted into the ALU.  Store instructions are different, as shown in
1561
Fig.~\ref{fig:memwr},
1562 68 dgisselq
\begin{figure}\begin{center}
1563 69 dgisselq
\includegraphics[width=4in]{../gfx/memwr.eps}
1564 68 dgisselq
\caption{Pipeline handling of a store instruction}\label{fig:memwr}
1565
\end{center}\end{figure}
1566
since they can be busy with the bus without impacting later write back
1567
pipeline stages.  Hence, only loads stall the pipeline.
1568 32 dgisselq
 
1569 68 dgisselq
This, of course, also assumes that the memory being accessed is a single cycle
1570
memory and that there are no stalls to get to the memory.
1571 32 dgisselq
Slower memories, such as the Quad SPI flash, will take longer--perhaps even
1572 33 dgisselq
as long as forty clocks.   During this time the CPU and the external bus
1573 68 dgisselq
will be busy, and unable to do anything else.  Likewise, if it takes a couple
1574
of clock cycles for the bus to be free, as shown in both Figs.~\ref{fig:memrd}
1575
and~\ref{fig:memwr}, there will be stalls.
1576 32 dgisselq
 
1577
\item Memory operation followed by a memory operation
1578
\begin{enumerate}
1579
\item\ {\tt STO address,RA}
1580 36 dgisselq
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}
1581 32 dgisselq
\item\ {\tt LOD address,RB}
1582 36 dgisselq
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}
1583 32 dgisselq
\end{enumerate}
1584
 
1585 68 dgisselq
In this case, the LOD instruction cannot start until the STO is finished,
1586
as illustrated by Fig.~\ref{fig:mstld}.
1587
\begin{figure}\begin{center}
1588
\includegraphics[width=5.5in]{../gfx/mstld.eps}
1589
\caption{Pipeline handling of a store followed by a load instruction}\label{fig:mstld}
1590
\end{center}\end{figure}
1591 32 dgisselq
With proper scheduling, it is possible to do something in the ALU while the
1592 36 dgisselq
memory unit is busy with the STO instruction, but otherwise this pipeline will
1593 68 dgisselq
stall while waiting for it to complete before the load instruction can
1594
start.
1595 32 dgisselq
 
1596 39 dgisselq
The Zip CPU does have the capability of supporting pipelined memory access,
1597
but only under the following conditions: all accesses within the pipeline
1598
must all be reads or all be writes, all must use the same register for their
1599
address, and there can be no stalls or other instructions between pipelined
1600
memory access instructions.  Further, the offset to memory must be increasing
1601
by one address each instruction.  These conditions work well for saving or
1602 68 dgisselq
storing registers to the stack.  Indeed, if you noticed, both
1603
Fig.~\ref{fig:memrd} and Fig.~\ref{fig:memwr} illustrated pipelined memory
1604
accesses.
1605 36 dgisselq
 
1606 32 dgisselq
\end{itemize}
1607
 
1608
 
1609 21 dgisselq
\chapter{Peripherals}\label{chap:periph}
1610 24 dgisselq
 
1611
While the previous chapter describes a CPU in isolation, the Zip System
1612
includes a minimum set of peripherals as well.  These peripherals are shown
1613
in Fig.~\ref{fig:zipsystem}
1614
\begin{figure}\begin{center}
1615
\includegraphics[width=3.5in]{../gfx/system.eps}
1616
\caption{Zip System Peripherals}\label{fig:zipsystem}
1617
\end{center}\end{figure}
1618
and described here.  They are designed to make
1619
the Zip CPU more useful in an Embedded Operating System environment.
1620
 
1621 68 dgisselq
\section{Interrupt Controller}\label{sec:pic}
1622 24 dgisselq
 
1623
Perhaps the most important peripheral within the Zip System is the interrupt
1624
controller.  While the Zip CPU itself can only handle one interrupt, and has
1625
only the one interrupt state: disabled or enabled, the interrupt controller
1626
can make things more interesting.
1627
 
1628
The Zip System interrupt controller module supports up to 15 interrupts, all
1629
controlled from one register.  Bit~31 of the interrupt controller controls
1630
overall whether interrupts are enabled (1'b1) or disabled (1'b0).  Bits~16--30
1631 68 dgisselq
control whether individual interrupts are enabled (1'b1) or disabled (1'b0).
1632 24 dgisselq
Bit~15 is an indicator showing whether or not any interrupt is active, and
1633
bits~0--15 indicate whether or not an individual interrupt is active.
1634
 
1635
The interrupt controller has been designed so that bits can be controlled
1636
individually without having any knowledge of the rest of the controller
1637
setting.  To enable an interrupt, write to the register with the high order
1638
global enable bit set and the respective interrupt enable bit set.  No other
1639
bits will be affected.  To disable an interrupt, write to the register with
1640
the high order global enable bit cleared and the respective interrupt enable
1641
bit set.  To clear an interrupt, write a `1' to that interrupts status pin.
1642
Zero's written to the register have no affect, save that a zero written to the
1643
master enable will disable all interrupts.
1644
 
1645
As an example, suppose you wished to enable interrupt \#4.  You would then
1646
write to the register a {\tt 0x80100010} to enable interrupt \#4 and to clear
1647
any past active state.  When you later wish to disable this interrupt, you would
1648
write a {\tt 0x00100010} to the register.  As before, this both disables the
1649
interrupt and clears the active indicator.  This also has the side effect of
1650
disabling all interrupts, so a second write of {\tt 0x80000000} may be necessary
1651
to re-enable any other interrupts.
1652
 
1653
The Zip System currently hosts two interrupt controllers, a primary and a
1654 69 dgisselq
secondary.  The primary interrupt controller has one (or more) interrupt line(s)
1655
which may come from an external interrupt source, and one interrupt line from
1656
the secondary controller.  Other primary interrupts include the system timers,
1657
the jiffies interrupt, and the manual cache interrupt.  The secondary interrupt
1658
controller maintains an interrupt state for all of the processor accounting
1659
counters.
1660 24 dgisselq
 
1661 21 dgisselq
\section{Counter}
1662
 
1663
The Zip Counter is a very simple counter: it just counts.  It cannot be
1664
halted.  When it rolls over, it issues an interrupt.  Writing a value to the
1665
counter just sets the current value, and it starts counting again from that
1666
value.
1667
 
1668
Eight counters are implemented in the Zip System for process accounting.
1669
This may change in the future, as nothing as yet uses these counters.
1670
 
1671
\section{Timer}
1672
 
1673
The Zip Timer is also very simple: it simply counts down to zero.  When it
1674
transitions from a one to a zero it creates an interrupt.
1675
 
1676
Writing any non-zero value to the timer starts the timer.  If the high order
1677
bit is set when writing to the timer, the timer becomes an interval timer and
1678
reloads its last start time on any interrupt.  Hence, to mark seconds, one
1679
might set the timer to 100~million (the number of clocks per second), and
1680
set the high bit.  Ever after, the timer will interrupt the CPU once per
1681 24 dgisselq
second (assuming a 100~MHz clock).  This reload capability also limits the
1682 68 dgisselq
maximum timer value to $2^{31}-1$ (about 21~seconds using a 100~MHz clock),
1683
rather than $2^{32}-1$.
1684 21 dgisselq
 
1685
\section{Watchdog Timer}
1686
 
1687
The watchdog timer is no different from any of the other timers, save for one
1688
critical difference: the interrupt line from the watchdog
1689
timer is tied to the reset line of the CPU.  Hence writing a `1' to the
1690
watchdog timer will always reset the CPU.
1691 32 dgisselq
To stop the Watchdog timer, write a `0' to it.  To start it,
1692 21 dgisselq
write any other number to it---as with the other timers.
1693
 
1694
While the watchdog timer supports interval mode, it doesn't make as much sense
1695
as it did with the other timers.
1696
 
1697 68 dgisselq
\section{Bus Watchdog}
1698
There is an additional watchdog timer on the Wishbone bus.  This timer,
1699
however, is hardware configured and not software configured.  The timer is
1700
reset at the beginning of any bus transaction, and only counts clocks during
1701
such bus transactions.  If the bus transaction takes longer than the number
1702
of counts the timer allots, it will raise a bus error flag to terminate the
1703
transaction.  This is useful in the case of any peripherals that are
1704
misbehaving.  If the bus watchdog terminates a bus transaction, the CPU may
1705
then read from its port to find out which memory location created the problem.
1706
 
1707
Aside from its unusual configuration, the bus watchdog is just another
1708 69 dgisselq
implementation of the fundamental timer described above--stripped down
1709
for simplicity.
1710 68 dgisselq
 
1711 21 dgisselq
\section{Jiffies}
1712
 
1713
This peripheral is motivated by the Linux use of `jiffies' whereby a process
1714
can request to be put to sleep until a certain number of `jiffies' have
1715
elapsed.  Using this interface, the CPU can read the number of `jiffies'
1716
from the peripheral (it only has the one location in address space), add the
1717 69 dgisselq
sleep length to it, and write the result back to the peripheral.  The
1718
{\tt zipjiffies}
1719 21 dgisselq
peripheral will record the value written to it only if it is nearer the current
1720
counter value than the last current waiting interrupt time.  If no other
1721
interrupts are waiting, and this time is in the future, it will be enabled.
1722
(There is currently no way to disable a jiffie interrupt once set, other
1723 24 dgisselq
than to disable the interrupt line in the interrupt controller.)  The processor
1724 21 dgisselq
may then place this sleep request into a list among other sleep requests.
1725
Once the timer expires, it would write the next Jiffy request to the peripheral
1726
and wake up the process whose timer had expired.
1727
 
1728
Indeed, the Jiffies register is nothing more than a glorified counter with
1729
an interrupt.  Unlike the other counters, the Jiffies register cannot be set.
1730
Writes to the jiffies register create an interrupt time.  When the Jiffies
1731
register later equals the value written to it, an interrupt will be asserted
1732
and the register then continues counting as though no interrupt had taken
1733
place.
1734
 
1735
The purpose of this register is to support alarm times within a CPU.  To
1736
set an alarm for a particular process $N$ clocks in advance, read the current
1737
Jiffies value, and $N$, and write it back to the Jiffies register.  The
1738
O/S must also keep track of values written to the Jiffies register.  Thus,
1739 32 dgisselq
when an `alarm' trips, it should be removed from the list of alarms, the list
1740 69 dgisselq
should be resorted, and the next alarm in terms of Jiffies should be written
1741
to the register--possibly for a second time.
1742 21 dgisselq
 
1743 36 dgisselq
\section{Direct Memory Access Controller}
1744 24 dgisselq
 
1745 36 dgisselq
The Direct Memory Access (DMA) controller can be used to either move memory
1746
from one location to another, to read from a peripheral into memory, or to
1747
write from a peripheral into memory all without CPU intervention.  Further,
1748
since the DMA controller can issue (and does issue) pipeline wishbone accesses,
1749
any DMA memory move will by nature be faster than a corresponding program
1750
accomplishing the same move.  To put this to numbers, it may take a program
1751
18~clocks per word transferred, whereas this DMA controller can move one
1752 69 dgisselq
word in two clocks--provided it has bus access.  (The CPU gets priority over
1753
the bus.)
1754 24 dgisselq
 
1755 36 dgisselq
When copying memory from one location to another, the DMA controller will
1756
copy in units of a given transfer length--up to 1024 words at a time.  It will
1757
read that transfer length into its internal buffer, and then write to the
1758 69 dgisselq
destination address from that buffer.
1759 24 dgisselq
 
1760 36 dgisselq
When coupled with a peripheral, the DMA controller can be configured to start
1761 69 dgisselq
a memory copy when any interrupt line going high.  Further, the controller can
1762
be configured to issue reads from (or to) the same address instead of
1763
incrementing the address at each clock.  The DMA completes once the total
1764
number of items specified (not the transfer length) have been transferred.
1765 36 dgisselq
 
1766
In each case, once the transfer is complete and the DMA unit returns to
1767
idle, the DMA will issue an interrupt.
1768
 
1769
 
1770 21 dgisselq
\chapter{Operation}\label{chap:ops}
1771
 
1772 33 dgisselq
The Zip CPU, and even the Zip System, is not a System on a Chip (SoC).  It
1773
needs to be connected to its operational environment in order to be used.
1774
Specifically, some per system adjustments need to be made:
1775
\begin{enumerate}
1776
\item The Zip System depends upon an external 32-bit Wishbone bus.  This
1777
        must exist, and must be connected to the Zip CPU for it to work.
1778
\item The Zip System needs to be told of its {\tt RESET\_ADDRESS}.  This is
1779
        the program counter of the first instruction following a reset.
1780 69 dgisselq
\item To conserve logic, you'll want to set the {\tt ADDRESS\_WIDTH} parameter
1781
        to the number of address bits on your wishbone bus.
1782
\item Likewise, the {\tt LGICACHE} parameter sets the number of bits in
1783
        the instruction cache address.  This means that the instruction cache
1784
        will have $2^{\mbox{\tiny\tt LGICACHE}}$ locations within it.
1785 33 dgisselq
\item If you want the Zip System to start up on its own, you will need to
1786
        set the {\tt START\_HALTED} parameter to zero.  Otherwise, if you
1787
        wish to manually start the CPU, that is if upon reset you want the
1788
        CPU start start in its halted, reset state, then set this parameter to
1789 69 dgisselq
        one.  This latter configuration is useful for a CPU that should be
1790
        idle (i.e. halted) until given an explicit instruction from somewhere
1791
        else to start.
1792 33 dgisselq
\item The third parameter to set is the number of interrupts you will be
1793
        providing from external to the CPU.  This can be anything from one
1794 69 dgisselq
        to sixteen, but it cannot be zero.  (Set this to 1 and wire the single
1795
        interrupt line to a 1'b0 if you do not wish to support any external
1796
        interrupts.)
1797 33 dgisselq
\item Finally, you need to place into some wishbone accessible address, whether
1798
        RAM or (more likely) ROM, the initial instructions for the CPU.
1799
\end{enumerate}
1800
If you have enabled your CPU to start automatically, then upon power up the
1801 69 dgisselq
CPU will immediately start executing your instructions, starting at the given
1802
{\tt RESET\_ADDRESS}.
1803 33 dgisselq
 
1804
This is, however, not how I have used the Zip CPU.  I have instead used the
1805 36 dgisselq
Zip CPU in a more controlled environment.  For me, the CPU starts in a
1806 33 dgisselq
halted state, and waits to be told to start.  Further, the RESET address is a
1807
location in RAM.  After bringing up the board I am using, and further the
1808
bus that is on it, the RAM memory is then loaded externally with the program
1809
I wish the Zip System to run.  Once the RAM is loaded, I release the CPU.
1810 69 dgisselq
The CPU then runs until either its halt condition or an exception occurrs in
1811
supervisor mode, at which point its task is complete.
1812 33 dgisselq
 
1813
Eventually, I intend to place an operating system onto the ZipSystem, I'm
1814
just not there yet.
1815
 
1816 68 dgisselq
The rest of this chapter examines some common programming models, and how they
1817
might be applied to the Zip System, and then finish with a couple of examples.
1818 33 dgisselq
 
1819 68 dgisselq
\section{System High}
1820
The easiest and simplest way to run the Zip CPU is in the system high mode.
1821
In this mode, the CPU runs your program in supervisor mode from reboot to
1822
power down, and is never interrupted.  You will need to poll the interrupt
1823
controller to determine when any external condition has become active.  This
1824
mode is useful, and can handle many microcontroller tasks.
1825
 
1826
Even better, in system high mode, all of the user registers are available
1827
to the system high program as variables.  Accessing these registers can be
1828
done in a single clock cycle, which would move them to the active register
1829
set or move them back.  While this may seem like a load or store instruction,
1830
none of these register accesses will suffer from memory delays.
1831
 
1832
The one thing that cannot be done in supervisor mode is a wait for interrupt
1833
instruction.  This, however, is easily rectified by jumping to a user task
1834
within the supervisors memory space, such as Tbl.~\ref{tbl:shi-idle}.
1835
\begin{table}\begin{center}
1836
\begin{tabbing}
1837
{\tt supervisor\_idle:} \\
1838
\hbox to 0.25in{}\={\em ; While not strictly required, the following move helps to} \\
1839
\>      {\em ; ensure that the prefetch doesn't try to fetch an instruction} \\
1840
\>      {\em ; outside of the CPU's address space when it switches to user} \\
1841
\>      {\em ; mode.} \\
1842
\>      {\tt MOV supervisor\_idle\_continue,uPC} \\
1843
\>      {\em ; Put the processor into user mode and to sleep in the same} \\
1844
\>      {\em ; instruction. } \\
1845
\>      {\tt OR \$SLEEP|\$GIE,CC} \\
1846
{\tt supervisor\_idle\_continue:} \\
1847
\>      {\em ; Now, if we haven't done this inline, we need to return} \\
1848
\>      {\em ; to whatever function called us.} \\
1849
\>      {\tt RETN} \\
1850
\end{tabbing}
1851
\caption{Executing an idle from supervisor mode}\label{tbl:shi-idle}
1852
\end{center}\end{table}
1853
 
1854
\section{Traditional Interrupt Handling}
1855
Although the Zip CPU does not have a traditional interrupt architecture,
1856
it is possible to create the more traditional interrupt approach via software.
1857
In this mode, the programmable interrupt controller is used together with the
1858
supervisor state to create the illusion of more traditional interrupt handling.
1859
 
1860
To set this up, upon reboot the supervisor task:
1861
\begin{enumerate}
1862
\item Creates a (single) user context, a user stack, and sets the user
1863
        program counter to the entry of the user task
1864
\item Creates a task table of ISR entries
1865
\item Enables the master interrupt enable via the interrupt controller, albeit
1866
        without enabling any of the fifteen potential underlying interrupts.
1867
\item Switches to user mode, as the first part of the while loop in
1868
        Tbl.~\ref{tbl:traditional-isr}.
1869
\end{enumerate}
1870
\begin{table}\begin{center}
1871
\begin{tabbing}
1872
{\tt while(true) \{} \\
1873
\hbox to 0.25in{}\= {\tt rtu();}\\
1874
        \> {\tt if (trap) \{} {\em // Here, we allow users to install ISRs, or} \\
1875
        \>\hbox to 0.25in{}\= {\em // whatever else they may wish to do in supervisor mode.} \\
1876
        \> {\tt \} else \{} \\
1877
        \> \> {\tt volatile int *pic = PIC\_ADDRESS;} \\
1878
\\
1879
        \> \> {\em // Save the user context before running any ISRs.  This could easily be}\\
1880
        \> \> {\em // implemented as an inline assembly routine or macro}\\
1881
        \> \> {\tt SAVE\_PARTIAL\_CONTEXT; }\\
1882
        \> \> {\em // At this point, we know an interrupt has taken place:  Ask the programmable}\\
1883
        \> \> {\em // interrupt controller (PIC) which interrupts are enabled and which are active.}\\
1884
        \> \>   {\tt int        picv = *pic;}\\
1885
        \> \>   {\em // Turn off all active interrupts}\\
1886
        \> \>   {\em // Globally disable interrupt generation in the process}\\
1887
        \> \>   {\tt int        active = (picv >> 16) \& picv \& 0x07fff;}\\
1888
        \> \>   {\tt *pic = (active<<16);}\\
1889
        \> \>   {\em // We build a mask of interrupts to re-enable in picv.}\\
1890
        \> \>   {\tt picv = 0;}\\
1891
        \> \>   {\tt for(int i=0,msk=1; i<15; i++, msk<<=1) \{}\\
1892
        \> \>\hbox to 0.25in{}\={\tt if ((active \& msk)\&\&(isr\_table[i])) \{}\\
1893
        \> \>\>\hbox to 0.25in{}\= {\tt mov(isr\_table[i],uPC); }\\
1894
        \> \>\>\>       {\em // Acknowledge this particular interrupt.  While we could acknowledge all}\\
1895
        \> \>\>\>       {\em // interrupts at once, by acknowledging only those with ISR's we allow}\\
1896
        \> \>\>\>       {\em // the user process to use peripherals manually, and to manually check}\\
1897
        \> \>\>\>       {\em // whether or no those other interrupts had occurred.}\\
1898
        \> \>\>\>       {\tt *pic = msk; }\\
1899
        \> \>\>\>       {\tt rtu(); }\\
1900
        \> \>\>\>       {\em // The ISR will only exit on a trap in the Zip archtecture.  There is}\\
1901
        \> \>\>\>       {\em // no {\tt RETI} instruction.  Since the PIC holds all interrupts disabled,}\\
1902
        \> \>\>\>       {\em // there is no need to check for further interrupts.}\\
1903
        \> \>\>\>       {\em // }\\
1904
        \> \>\>\>       {\em // The tricky part is that, because of how the PIC is built, the ISR cannot}\\
1905
        \>\>\>\>        {\em // re-enable its own interrupt without re-enabling all interrupts.  Hence, we}\\
1906
        \>\>\>\>        {\em // look at R0 upon ISR completion to know if an interrupt needs to be }\\
1907
        \> \>\>\>       {\em // re-enabled. }\\
1908
        \> \>\>\>       {\tt mov(uR0,tmp); }\\
1909
        \> \>\>\>       {\tt picv |= (tmp \& 0x7fff) << 16; }\\
1910
        \> \>\>         {\tt \} }\\
1911
        \> \>   {\tt \} }\\
1912
        \> \>   {\tt RESTORE\_PARTIAL\_CONTEXT; }\\
1913
        \> \>   {\em // Re-activate all (requested) interrupts }\\
1914
        \> \>   {\tt *pic = picv | 0x80000000; }\\
1915
        \>{\tt \} }\\
1916
{\tt \}}\\
1917
\end{tabbing}
1918
\caption{Traditional Interrupt handling}\label{tbl:traditional-isr}
1919
\end{center}\end{table}
1920
 
1921
We can work through the interrupt handling process by examining
1922
Tbl.~\ref{tbl:traditional-isr}.  First, remember, the CPU is always running
1923
either the user or the supervisor context.  Once the supervisor switches to
1924
user mode, control does not return until either an interrupt or a trap
1925
has taken place.  (Okay, there's also the possibility of a bus error, or an
1926
illegal instruction such as an unimplemented floating point instruction---but
1927
for now we'll just focus on the trap instruction.)  Therefore, if the trap bit
1928
isn't set, then we know an interrupt has taken place.
1929
 
1930
To process an interrupt, we steal the user's stack: the PC and CC registers
1931
are saved on the stack, as outlined in Tbl.~\ref{tbl:save-partial}.
1932
\begin{table}\begin{center}
1933
\begin{tabbing}
1934
SAVE\_PARTIAL\_CONTEXT: \\
1935
\hbox to 0.25in{}\= {\em ; We save R0, CC, and PC only} \\
1936
\>        {\tt MOV -3(uSP),R3} \\
1937
\>        {\tt MOV uR0,R0} \\
1938
\>        {\tt MOV uCC,R1} \\
1939
\>        {\tt MOV uPC,R2} \\
1940 69 dgisselq
\>        {\tt STO R0,(R3)} {\em ; Exploit memory pipelining: }\\
1941
\>        {\tt STO R1,1(R3)} {\em ; All instructions write to stack }\\
1942
\>        {\tt STO R2,2(R3)} {\em ; All offsets increment by one }\\
1943 68 dgisselq
\>        {\tt MOV R3,uSP} {\em ; Return the updated stack pointer } \\
1944
\end{tabbing}
1945
\caption{Example Saving Minimal User Context}\label{tbl:save-partial}
1946
\end{center}\end{table}
1947
This is much cheaper than the full context swap of a preemptive multitasking
1948
kernel, but it also depends upon the ISR saving any state it uses.  Further,
1949
if multiple ISR's get called at once, this looses its optimality property
1950
very quickly.
1951
 
1952
As Sec.~\ref{sec:pic} discusses, the top of the PIC register stores which
1953
interrupts are enabled, and the bottom stores which have tripped.  (Interrupts
1954
may trip without being enabled, they just will not generate an interrupt to the
1955
CPU.)  Our first step is to query the register to find out our interrupt
1956
state, and then to disable any interrupts that have tripped.  To do
1957
that, we write a one to the enable half of the register while also clearing
1958
the top bit (master interrupt enable).  This has the consequence of disabling
1959
any and all further interrupts, not just the ones that have tripped.  Hence,
1960
upon completion, we re--enable the master interrupt bit again.   Finally,
1961
we keep track of which interrupts have tripped.
1962
 
1963
Using the bit mask of interrupts that have tripped, we walk through all fifteen
1964
possible interrupts.  If there is an ISR installed, we acknowledge and reset
1965
the interrupt within the PIC, and then call the ISR.  The ISR, however, cannot
1966
re--enable its interrupt without re-enabling the master interrupt bit.  Thus,
1967
to keep things simple, when the ISR is finished it places its interrupt
1968
mask back into R0, or clears R0.  This tells the supervisor mode process which
1969
interrupts to re--enable.  Any other registers that the ISR uses must be
1970
saved and restored.  (This is only truly optimal if only a single ISR is
1971
called.)  As a final instruction, the ISR clears the GIE bit executing a user
1972
trap.  (Remember, the Zip CPU has no {\tt RETI} instruction to restore the
1973
stack and return to userland.  It needs to go through the supervisor mode to
1974
get there.)
1975
 
1976
Then, once all interrupts are handled, the user context is restored in  a
1977
fashion similar to Tbl.~\ref{tbl:restore-partial}.
1978
\begin{table}\begin{center}
1979
\begin{tabbing}
1980
RESTORE\_PARTIAL\_CONTEXT: \\
1981
\hbox to 0.25in{}\= {\em ; We retore R0, CC, and PC only} \\
1982
\>        {\tt MOV uSP,R3} {\em ; Return the updated stack pointer } \\
1983 69 dgisselq
\>        {\tt LOD R0,(R3),R0} {\em ; Exploit memory pipelining: }\\
1984
\>        {\tt LOD R1,1(R3),R1} {\em ; All instructions write to stack }\\
1985
\>        {\tt LOD R2,2(R3),R2} {\em ; All offsets increment by one }\\
1986 68 dgisselq
\>        {\tt MOV R0,uR0} \\
1987
\>        {\tt MOV R1,uCC} \\
1988
\>        {\tt MOV R2,uPC} \\
1989
\>        {\tt MOV 3(R3),uSP} \\
1990
\end{tabbing}
1991
\caption{Example Restoring Minimal User Context}\label{tbl:restore-partial}
1992
\end{center}\end{table}
1993
Again, this is short and sweet simply because any other registers that needed
1994
saving were saved within the ISR.
1995
 
1996
There you have it: the Zip CPU, with its non-traditional interrupt architecture,
1997
can still process interrupts in a very traditional fashion.
1998
 
1999 36 dgisselq
\section{Example: Idle Task}
2000
One task every operating system needs is the idle task, the task that takes
2001
place when nothing else can run.  On the Zip CPU, this task is quite simple,
2002
and it is shown in assemble in Tbl.~\ref{tbl:idle-asm}.
2003
\begin{table}\begin{center}
2004
\begin{tabular}{ll}
2005
{\tt idle\_task:} \\
2006
&        {\em ; Wait for the next interrupt, then switch to supervisor task} \\
2007
&        {\tt WAIT} \\
2008
&        {\em ; When we come back, it's because the supervisor wishes to} \\
2009
&        {\em ; wait for an interrupt again, so go back to the top.} \\
2010
&        {\tt BRA idle\_task} \\
2011
\end{tabular}
2012
\caption{Example Idle Loop}\label{tbl:idle-asm}
2013
\end{center}\end{table}
2014
When this task runs, the CPU will fill up all of the pipeline stages up the
2015
ALU.  The {\tt WAIT} instruction, upon leaving the ALU, places the CPU into
2016
a sleep state where nothing more moves.  Sure, there may be some more settling,
2017
the pipe cache continue to read until full, other instructions may issue until
2018
the pipeline fills, but then everything will stall.  Then, once an interrupt
2019
takes place, control passes to the supervisor task to handle the interrupt.
2020
When control passes back to this task, it will be on the next instruction.
2021
Since that next instruction sends us back to the top of the task, the idle
2022
task thus does nothing but wait for an interrupt.
2023
 
2024
This should be the lowest priority task, the task that runs when nothing else
2025
can.  It will help lower the FPGA power usage overall---at least its dynamic
2026
power usage.
2027
 
2028
\section{Example: Memory Copy}
2029
One common operation is that of a memory move or copy.  Consider the C code
2030
shown in Tbl.~\ref{tbl:memcp-c}.
2031
\begin{table}\begin{center}
2032
\parbox{4in}{\begin{tabbing}
2033
{\tt void} \= {\tt memcp(void *dest, void *src, int len) \{} \\
2034
        \> {\tt for(int i=0; i<len; i++)} \\
2035
        \> \hspace{0.2in} {\tt *dest++ = *src++;} \\
2036
\}
2037
\end{tabbing}}
2038
\caption{Example Memory Copy code in C}\label{tbl:memcp-c}
2039
\end{center}\end{table}
2040
This same code can be translated in Zip Assembly as shown in
2041
Tbl.~\ref{tbl:memcp-asm}.
2042
\begin{table}\begin{center}
2043
\begin{tabular}{ll}
2044
memcp: \\
2045 69 dgisselq
&        {\em ; R0 = *dest, R1 = *src, R2 = LEN, R3 = return addr} \\
2046
&        {\em ; The following will operate in $12N+19$ clocks.} \\
2047
&        {\tt CMP 0,R2} \\ % 8 clocks per setup
2048
&        {\tt MOV.Z R3,PC} {\em ; A conditional return }\\
2049
&        {\tt SUB 1,SP} {\em ; Create a stack frame}\\
2050
&        {\tt STO R4,(SP)} {\em ; and a local variable}\\
2051
&        {\em ; (4 stalls, cannot be further scheduled away)} \\
2052
loop: \\ % 12 clocks per loop
2053
&        {\tt LOD (R1),R4} \\
2054 36 dgisselq
&        {\em ; (4 stalls, cannot be scheduled away)} \\
2055 69 dgisselq
&        {\tt STO R4,(R0)} {\em ; (4 schedulable stalls, has no impact now)} \\
2056
&        {\tt SUB 1,R2} \\
2057
&        {\tt BZ memcpend} \\
2058
&        {\tt ADD 1,R0} \\
2059 36 dgisselq
&        {\tt ADD 1,R1} \\
2060 69 dgisselq
&        {\tt BRA loop} \\
2061
&        {\em ; (1 stall on a BRA instruction)} \\
2062
memcpend: % 11 clocks
2063
&        {\tt LOD (SP),R4} \\
2064
&        {\em ; (4 stalls, cannot be further scheduled away)} \\
2065
&        {\tt ADD 1,SP} \\
2066
&        {\tt JMP R3} \\
2067
&        {\em ; (4 stalls)} \\
2068 36 dgisselq
\end{tabular}
2069
\caption{Example Memory Copy code in Zip Assembly}\label{tbl:memcp-asm}
2070
\end{center}\end{table}
2071
This example points out several things associated with the Zip CPU.  First,
2072
a straightforward implementation of a for loop is not the fastest loop
2073
structure.  For this reason, we have placed the test to continue at the
2074
end.  Second, all pointers are {\tt void} pointers to arbitrary 32--bit
2075
data types.  The Zip CPU does not have explicit support for smaller or larger
2076
data types, and so this memory copy cannot be applied at a byte level.
2077
Third, we've optimized the conditional jump to a return instruction into a
2078
conditional return instruction.
2079
 
2080 68 dgisselq
\section{Example: Context Switch}
2081 36 dgisselq
 
2082
Fundamental to any multiprocessing system is the ability to switch from one
2083
task to the next.  In the ZipSystem, this is accomplished in one of a couple
2084
ways.  The first step is that an interrupt happens.  Anytime an interrupt
2085
happens, the CPU needs to execute the following tasks in supervisor mode:
2086
\begin{enumerate}
2087 69 dgisselq
\item Check for a trap instruction, or other user exception such as a break,
2088
        bus error, division by zero error, or floating point exception.  That
2089
        is, if the user process needs attending then we may not wish to adjust
2090
        the context, check interrupts, or call the scheduler.
2091
        Tbl.~\ref{tbl:trap-check}
2092 36 dgisselq
\begin{table}\begin{center}
2093
\begin{tabular}{ll}
2094
{\tt return\_to\_user:} \\
2095
&       {\em; The instruction before the context switch processing must} \\
2096
&       {\em; be the RTU instruction that enacted user mode in the first} \\
2097
&       {\em; place.  We show it here just for reference.} \\
2098
&       {\tt RTU} \\
2099
{\tt trap\_check:} \\
2100
&       {\tt MOV uCC,R0} \\
2101 69 dgisselq
&       {\tt TST \$TRAP \textbar \$BUSERR \textbar \$DIVE \textbar \$FPE,R0} \\
2102 36 dgisselq
&       {\tt BNZ swap\_out} \\
2103
&       {; \em Do something here to execute the trap} \\
2104
&       {; \em Don't need to call the scheduler, so we can just return} \\
2105
&       {\tt BRA return\_to\_user} \\
2106
\end{tabular}
2107 69 dgisselq
\caption{Checking for whether the user task needs our attention}\label{tbl:trap-check}
2108 36 dgisselq
\end{center}\end{table}
2109
        shows the rudiments of this code, while showing nothing of how the
2110
        actual trap would be implemented.
2111
 
2112
You may also wish to note that the instruction before the first instruction
2113
in our context swap {\em must be} a return to userspace instruction.
2114
Remember, the supervisor process is re--entered where it left off.  This is
2115
different from many other processors that enter interrupt mode at some vector
2116
or other.  In this case, we always enter supervisor mode right where we last
2117
left.\footnote{The one exception to this rule is upon reset where supervisor
2118
mode is entered at a pre--programmed wishbone memory address.}
2119
 
2120
\item Capture user counters.  If the operating system is keeping track of
2121
        system usage via the accounting counters, those counters need to be
2122
        copied and accumulated into some master counter at this point.
2123
 
2124
\item Preserve the old context.  This involves pushing all the user registers
2125
        onto the user stack and then copying the resulting stack address
2126
        into the tasks task structure, as shown in Tbl.~\ref{tbl:context-out}.
2127
\begin{table}\begin{center}
2128
\begin{tabular}{ll}
2129
{\tt swap\_out:} \\
2130 39 dgisselq
&        {\tt MOV -15(uSP),R5} \\
2131
&        {\tt STO R5,stack(R12)} \\
2132
&        {\tt MOV uR0,R0} \\
2133
&        {\tt MOV uR1,R1} \\
2134
&        {\tt MOV uR2,R2} \\
2135
&        {\tt MOV uR3,R3} \\
2136
&        {\tt MOV uR4,R4} \\
2137 69 dgisselq
&        {\tt STO R0,(R5)} {\em ; Exploit memory pipelining: }\\
2138
&        {\tt STO R1,1(R5)} {\em ; All instructions write to stack }\\
2139
&        {\tt STO R2,2(R5)} {\em ; All offsets increment by one }\\
2140
&        {\tt STO R3,3(R5)} {\em ; Longest pipeline is 5 cycles.}\\
2141
&        {\tt STO R4,4(R5)} \\
2142 39 dgisselq
        & \ldots {\em ; Need to repeat for all user registers} \\
2143
\iffalse
2144
&        {\tt MOV uR5,R0} \\
2145
&        {\tt MOV uR6,R1} \\
2146
&        {\tt MOV uR7,R2} \\
2147
&        {\tt MOV uR8,R3} \\
2148
&        {\tt MOV uR9,R4} \\
2149 69 dgisselq
&        {\tt STO R0,5(R5) }\\
2150
&        {\tt STO R1,6(R5) }\\
2151
&        {\tt STO R2,7(R5) }\\
2152
&        {\tt STO R3,8(R5) }\\
2153
&        {\tt STO R4,9(R5)} \\
2154 39 dgisselq
\fi
2155
&        {\tt MOV uR10,R0} \\
2156
&        {\tt MOV uR11,R1} \\
2157
&        {\tt MOV uR12,R2} \\
2158
&        {\tt MOV uCC,R3} \\
2159
&        {\tt MOV uPC,R4} \\
2160 69 dgisselq
&        {\tt STO R0,10(R5)}\\
2161
&        {\tt STO R1,11(R5)}\\
2162
&        {\tt STO R2,12(R5)}\\
2163
&        {\tt STO R3,13(R5)}\\
2164
&        {\tt STO R4,14(R5)} \\
2165 36 dgisselq
&       {\em ; We can skip storing the stack, uSP, since it'll be stored}\\
2166
&       {\em ; elsewhere (in the task structure) }\\
2167
\end{tabular}
2168
\caption{Example Storing User Task Context}\label{tbl:context-out}
2169
\end{center}\end{table}
2170
For the sake of discussion, we assume the supervisor maintains a
2171
pointer to the current task's structure in supervisor register
2172
{\tt R12}, and that {\tt stack} is an offset to the beginning of this
2173
structure indicating where the stack pointer is to be kept within it.
2174
 
2175
        For those who are still interested, the full code for this context
2176
        save can be found as an assembler macro within the assembler
2177
        include file, {\tt sys.i}.
2178
 
2179
\item Reset the watchdog timer.  If you are using the watchdog timer, it should
2180
        be reset on a context swap, to know that things are still working.
2181
        Example code for this is shown in Tbl.~\ref{tbl:reset-watchdog}.
2182
\begin{table}\begin{center}
2183
\begin{tabular}{ll}
2184
\multicolumn{2}{l}{{\tt `define WATCHDOG\_ADDRESS 32'hc000\_0002}}\\
2185
\multicolumn{2}{l}{{\tt `define WATCHDOG\_TICKS 32'd1\_000\_000} {; \em = 10 ms}}\\
2186
&       {\tt LDI WATCHDOG\_ADDRESS,R0} \\
2187
&       {\tt LDI WATCHDOG\_TICKS,R1} \\
2188
&       {\tt STO R1,(R0)}
2189
\end{tabular}
2190
\caption{Example Watchdog Reset}\label{tbl:reset-watchdog}
2191
\end{center}\end{table}
2192
 
2193
\item Interrupt handling.  An interrupt handler within the Zip System is nothing
2194
        more than a task.  At context swap time, the supervisor needs to
2195
        disable all of the interrupts that have tripped, and then enable
2196
        all of the tasks that would deal with each of these interrupts.
2197
        These can be user tasks, run at higher priority than any other user
2198
        tasks.  Either way, they will need to re--enable their own interrupt
2199
        themselves, if the interrupt is still relevant.
2200
 
2201
        An example of this master interrut handling is shown in
2202
        Tbl.~\ref{tbl:pre-handler}.
2203
\begin{table}\begin{center}
2204
\begin{tabular}{ll}
2205
{\tt pre\_handler:} \\
2206
&       {\tt LDI PIC\_ADDRESS,R0 } \\
2207
&       {\em ; Start by grabbing the interrupt state from the interrupt}\\
2208
&       {\em ; controller.  We'll store this into the register R7 so that }\\
2209
&       {\em ; we can keep and preserve this information for the scheduler}\\
2210
&       {\em ; to use later. }\\
2211
&       {\tt LOD (R0),R1} \\
2212
&       {\tt MOV R1,R7 } \\
2213
&       {\em ; As a next step, we need to acknowledge and disable all active}\\
2214
&       {\em ; interrupts. We'll start by calculating all of our active}\\
2215
&       {\em ; interrupts.}\\
2216
&       {\tt AND 0x07fff,R1 } \\
2217
&       {\em ; Put the active interrupts into the upper half of R1} \\
2218
&       {\tt ROL 16,R1 } \\
2219
&       {\tt LDILO 0x0ffff,R1   } \\
2220
&       {\tt AND R7,R1}\\
2221
&       {\em ; Acknowledge and disable active interrupts}\\
2222
&       {\em ; This also disables all interrupts from the controller, so}\\
2223
&       {\em ; we'll need to re-enable interrupts in general shortly } \\
2224
&       {\tt STO R1,(R0) } \\
2225
&       {\em ; We leave our active interrupt mask in R7 so the scheduler can}\\
2226
&       {\em ; release any tasks that depended upon them. } \\
2227
\end{tabular}
2228
\caption{Example checking for active interrupts}\label{tbl:pre-handler}
2229
\end{center}\end{table}
2230
 
2231
\item Calling the scheduler.  This needs to be done to pick the next task
2232
        to switch to.  It may be an interrupt handler, or it may  be a normal
2233
        user task.  From a priority standpoint, it would make sense that the
2234
        interrupt handlers all have a higher priority than the user tasks,
2235
        and that once they have been called the user tasks may then be called
2236
        again.  If no task is ready to run, run the idle task to wait for an
2237
        interrupt.
2238
 
2239
        This suggests a minimum of four task priorities:
2240
        \begin{enumerate}
2241
        \item Interrupt handlers, executed with their interrupts disabled
2242
        \item Device drivers, executed with interrupts re-enabled
2243
        \item User tasks
2244
        \item The idle task, executed when nothing else is able to execute
2245
        \end{enumerate}
2246
 
2247
        For our purposes here, we'll just assume that a pointer to the current
2248
        task is maintained in {\tt R12}, that a {\tt JSR scheduler} is
2249
        called, and that the next current task is likewise placed into
2250
        {\tt R12}.
2251
 
2252
\item Restore the new tasks context.  Given that the scheduler has returned a
2253
        task that can be run at this time, the stack pointer needs to be
2254
        pulled out of the tasks task structure, placed into the user
2255
        register, and then the rest of the user registers need to be popped
2256
        back off of the stack to run this task.  An example of this is
2257
        shown in Tbl.~\ref{tbl:context-in},
2258
\begin{table}\begin{center}
2259
\begin{tabular}{ll}
2260
{\tt swap\_in:} \\
2261 39 dgisselq
&       {\tt LOD stack(R12),R5} \\
2262 36 dgisselq
&       {\tt MOV 15(R1),uSP} \\
2263 39 dgisselq
        & {\em ; Be sure to exploit the memory pipelining capability} \\
2264 69 dgisselq
&       {\tt LOD (R5),R0} \\
2265
&       {\tt LOD 1(R5),R1} \\
2266
&       {\tt LOD 2(R5),R2} \\
2267
&       {\tt LOD 3(R5),R3} \\
2268
&       {\tt LOD 4(R5),R4} \\
2269 39 dgisselq
&       {\tt MOV R0,uR0} \\
2270
&       {\tt MOV R1,uR1} \\
2271
&       {\tt MOV R2,uR2} \\
2272
&       {\tt MOV R3,uR3} \\
2273
&       {\tt MOV R4,uR4} \\
2274 36 dgisselq
        & \ldots {\em ; Need to repeat for all user registers} \\
2275 69 dgisselq
&       {\tt LOD 10(R5),R0} \\
2276
&       {\tt LOD 11(R5),R1} \\
2277
&       {\tt LOD 12(R5),R2} \\
2278
&       {\tt LOD 13(R5),R3} \\
2279
&       {\tt LOD 14(R5),R4} \\
2280 39 dgisselq
&       {\tt MOV R0,uR10} \\
2281
&       {\tt MOV R1,uR11} \\
2282
&       {\tt MOV R2,uR12} \\
2283
&       {\tt MOV R3,uCC} \\
2284
&       {\tt MOV R4,uPC} \\
2285
 
2286 36 dgisselq
&       {\tt BRA return\_to\_user} \\
2287
\end{tabular}
2288
\caption{Example Restoring User Task Context}\label{tbl:context-in}
2289
\end{center}\end{table}
2290
        assuming as before that the task
2291
        pointer is found in supervisor register {\tt R12}.
2292
        As with storing the user context, the full code associated with
2293
        restoring the user context can be found in the assembler include
2294
        file, {\tt sys.i}.
2295
 
2296
\item Clear the userspace accounting registers.  In order to keep track of
2297
        per process system usage, these registers need to be cleared before
2298
        reactivating the userspace process.  That way, upon the next
2299
        interrupt, we'll know how many clocks the userspace program has
2300
        encountered, and how many instructions it was able to issue in
2301
        those many clocks.
2302
 
2303
\item Jump back to the instruction just before saving the last tasks context,
2304
        because that location in memory contains the return from interrupt
2305
        command that we are going to need to execute, in order to guarantee
2306
        that we return back here again.
2307
\end{enumerate}
2308
 
2309 21 dgisselq
\chapter{Registers}\label{chap:regs}
2310
 
2311 24 dgisselq
The ZipSystem registers fall into two categories, ZipSystem internal registers
2312
accessed via the ZipCPU shown in Tbl.~\ref{tbl:zpregs},
2313
\begin{table}[htbp]
2314
\begin{center}\begin{reglist}
2315 32 dgisselq
PIC   & \scalebox{0.8}{\tt 0xc0000000} & 32 & R/W & Primary Interrupt Controller \\\hline
2316
WDT   & \scalebox{0.8}{\tt 0xc0000001} & 32 & R/W & Watchdog Timer \\\hline
2317 69 dgisselq
  & \scalebox{0.8}{\tt 0xc0000002} & 32 & R & Address of last bus error \\\hline
2318 32 dgisselq
CTRIC & \scalebox{0.8}{\tt 0xc0000003} & 32 & R/W & Secondary Interrupt Controller \\\hline
2319
TMRA  & \scalebox{0.8}{\tt 0xc0000004} & 32 & R/W & Timer A\\\hline
2320
TMRB  & \scalebox{0.8}{\tt 0xc0000005} & 32 & R/W & Timer B\\\hline
2321
TMRC  & \scalebox{0.8}{\tt 0xc0000006} & 32 & R/W & Timer C\\\hline
2322
JIFF  & \scalebox{0.8}{\tt 0xc0000007} & 32 & R/W & Jiffies \\\hline
2323
MTASK  & \scalebox{0.8}{\tt 0xc0000008} & 32 & R/W & Master Task Clock Counter \\\hline
2324
MMSTL  & \scalebox{0.8}{\tt 0xc0000009} & 32 & R/W & Master Stall Counter \\\hline
2325
MPSTL  & \scalebox{0.8}{\tt 0xc000000a} & 32 & R/W & Master Pre--Fetch Stall Counter \\\hline
2326
MICNT  & \scalebox{0.8}{\tt 0xc000000b} & 32 & R/W & Master Instruction Counter\\\hline
2327
UTASK  & \scalebox{0.8}{\tt 0xc000000c} & 32 & R/W & User Task Clock Counter \\\hline
2328
UMSTL  & \scalebox{0.8}{\tt 0xc000000d} & 32 & R/W & User Stall Counter \\\hline
2329
UPSTL  & \scalebox{0.8}{\tt 0xc000000e} & 32 & R/W & User Pre--Fetch Stall Counter \\\hline
2330
UICNT  & \scalebox{0.8}{\tt 0xc000000f} & 32 & R/W & User Instruction Counter\\\hline
2331 36 dgisselq
DMACTRL  & \scalebox{0.8}{\tt 0xc0000010} & 32 & R/W & DMA Control Register\\\hline
2332
DMALEN  & \scalebox{0.8}{\tt 0xc0000011} & 32 & R/W & DMA total transfer length\\\hline
2333
DMASRC  & \scalebox{0.8}{\tt 0xc0000012} & 32 & R/W & DMA source address\\\hline
2334
DMADST  & \scalebox{0.8}{\tt 0xc0000013} & 32 & R/W & DMA destination address\\\hline
2335 32 dgisselq
% Cache  & \scalebox{0.8}{\tt 0xc0100000} & & & Base address of the Cache memory\\\hline
2336 24 dgisselq
\end{reglist}
2337
\caption{Zip System Internal/Peripheral Registers}\label{tbl:zpregs}
2338
\end{center}\end{table}
2339 33 dgisselq
and the two debug registers shown in Tbl.~\ref{tbl:dbgregs}.
2340 24 dgisselq
\begin{table}[htbp]
2341
\begin{center}\begin{reglist}
2342
ZIPCTRL & 0 & 32 & R/W & Debug Control Register \\\hline
2343
ZIPDATA & 1 & 32 & R/W & Debug Data Register \\\hline
2344
\end{reglist}
2345
\caption{Zip System Debug Registers}\label{tbl:dbgregs}
2346
\end{center}\end{table}
2347
 
2348 33 dgisselq
\section{Peripheral Registers}
2349
The peripheral registers, listed in Tbl.~\ref{tbl:zpregs}, are shown in the
2350
CPU's address space.  These may be accessed by the CPU at these addresses,
2351
and when so accessed will respond as described in Chapt.~\ref{chap:periph}.
2352
These registers will be discussed briefly again here.
2353 24 dgisselq
 
2354 69 dgisselq
\subsection{Interrupt Controller(s)}
2355 33 dgisselq
The Zip CPU Interrupt controller has four different types of bits, as shown in
2356
Tbl.~\ref{tbl:picbits}.
2357
\begin{table}\begin{center}
2358
\begin{bitlist}
2359
31 & R/W & Master Interrupt Enable\\\hline
2360 69 dgisselq
30\ldots 16 & R/W & Interrupt Enables, write `1' to change\\\hline
2361 33 dgisselq
15 & R & Current Master Interrupt State\\\hline
2362 69 dgisselq
15\ldots 0 & R/W & Input Interrupt states, write `1' to clear\\\hline
2363 33 dgisselq
\end{bitlist}
2364
\caption{Interrupt Controller Register Bits}\label{tbl:picbits}
2365
\end{center}\end{table}
2366
The high order bit, or bit--31, is the master interrupt enable bit.  When this
2367
bit is set, then any time an interrupt occurs the CPU will be interrupted and
2368
will switch to supervisor mode, etc.
2369
 
2370
Bits 30~\ldots 16 are interrupt enable bits.  Should the interrupt line go
2371 69 dgisselq
hi while enabled, an interrupt will be generated.  (All interrupts are positive
2372
edge triggered.)  To set an interrupt enable bit, one needs to write the
2373
master interrupt enable while writing a `1' to this the bit.  To clear, one
2374
need only write a `0' to the master interrupt enable, while leaving this line
2375
high.
2376 33 dgisselq
 
2377
Bits 15\ldots 0 are the current state of the interrupt vector.  Interrupt lines
2378
trip when they go high, and remain tripped until they are acknowledged.  If
2379
the interrupt goes high for longer than one pulse, it may be high when a clear
2380
is requested.  If so, the interrupt will not clear.  The line must go low
2381
again before the status bit can be cleared.
2382
 
2383
As an example, consider the following scenario where the Zip CPU supports four
2384
interrupts, 3\ldots0.
2385
\begin{enumerate}
2386
\item The Supervisor will first, while in the interrupts disabled mode,
2387
        write a {\tt 32'h800f000f} to the controller.  The supervisor may then
2388
        switch to the user state with interrupts enabled.
2389
\item When an interrupt occurs, the supervisor will switch to the interrupt
2390
        state.  It will then cycle through the interrupt bits to learn which
2391
        interrupt handler to call.
2392
\item If the interrupt handler expects more interrupts, it will clear its
2393
        current interrupt when it is done handling the interrupt in question.
2394 69 dgisselq
        To do this, it will write a `1' to the low order interrupt mask,
2395
        such as writing a {\tt 32'h0000\_0001}.
2396 33 dgisselq
\item If the interrupt handler does not expect any more interrupts, it will
2397
        instead clear the interrupt from the controller by writing a
2398 69 dgisselq
        {\tt 32'h0001\_0001} to the controller.
2399 33 dgisselq
\item Once all interrupts have been handled, the supervisor will write a
2400 69 dgisselq
        {\tt 32'h8000\_0000} to the interrupt register to re-enable interrupt
2401 33 dgisselq
        generation.
2402
\item The supervisor should also check the user trap bit, and possible soft
2403
        interrupt bits here, but this action has nothing to do with the
2404
        interrupt control register.
2405
\item The supervisor will then leave interrupt mode, possibly adjusting
2406
        whichever task is running, by executing a return from interrupt
2407
        command.
2408
\end{enumerate}
2409
 
2410 69 dgisselq
\subsection{Timer Register}
2411
 
2412 33 dgisselq
Leaving the interrupt controller, we show the timer registers bit definitions
2413
in Tbl.~\ref{tbl:tmrbits}.
2414
\begin{table}\begin{center}
2415
\begin{bitlist}
2416
31 & R/W & Auto-Reload\\\hline
2417
30\ldots 0 & R/W & Current timer value\\\hline
2418
\end{bitlist}
2419
\caption{Timer Register Bits}\label{tbl:tmrbits}
2420
\end{center}\end{table}
2421
As you may recall, the timer just counts down to zero and then trips an
2422
interrupt.  Writing to the current timer value sets that value, and reading
2423
from it returns that value.  Writing to the current timer value while also
2424
setting the auto--reload bit will send the timer into an auto--reload mode.
2425
In this mode, upon setting its interrupt bit for one cycle, the timer will
2426
also reset itself back to the value of the timer that was written to it when
2427
the auto--reload option was written to it.  To clear and stop the timer,
2428
just simply write a `32'h00' to this register.
2429
 
2430 69 dgisselq
\subsection{Jiffies}
2431
 
2432 33 dgisselq
The Jiffies register is somewhat similar in that the register always changes.
2433
In this case, the register counts up, whereas the timer always counted down.
2434
Reads from this register, as shown in Tbl.~\ref{tbl:jiffybits},
2435
\begin{table}\begin{center}
2436
\begin{bitlist}
2437
31\ldots 0 & R & Current jiffy value\\\hline
2438
31\ldots 0 & W & Value/time of next interrupt\\\hline
2439
\end{bitlist}
2440
\caption{Jiffies Register Bits}\label{tbl:jiffybits}
2441
\end{center}\end{table}
2442
always return the time value contained in the register.  Writes greater than
2443
the current Jiffy value, that is where the new value minus the old value is
2444
greater than zero while ignoring truncation, will set a new Jiffy interrupt
2445
time.  At that time, the Jiffy vector will clear, and another interrupt time
2446
may either be written to it, or it will just continue counting without
2447
activating any more interrupts.
2448
 
2449 69 dgisselq
\subsection{Performance Counters}
2450
 
2451 33 dgisselq
The Zip CPU also supports several counter peripherals, mostly in the way of
2452
process accounting.  This peripherals have a single register associated with
2453
them, shown in Tbl.~\ref{tbl:ctrbits}.
2454
\begin{table}\begin{center}
2455
\begin{bitlist}
2456
31\ldots 0 & R/W & Current counter value\\\hline
2457
\end{bitlist}
2458
\caption{Counter Register Bits}\label{tbl:ctrbits}
2459
\end{center}\end{table}
2460
Writes to this register set the new counter value.  Reads read the current
2461
counter value.
2462
 
2463
The current design operation of these counters is that of performance counting.
2464
Two sets of four registers are available for keeping track of performance.
2465
The first is a task counter.  This just counts clock ticks.  The second
2466
counter is a prefetch stall counter, then an master stall counter.  These
2467
allow the CPU to be evaluated as to how efficient it is.  The fourth and
2468
final counter is an instruction counter, which counts how many instructions the
2469
CPU has issued.
2470
 
2471
It is envisioned that these counters will be used as follows: First, every time
2472
a master counter rolls over, the supervisor (Operating System) will record
2473
the fact.  Second, whenever activating a user task, the Operating System will
2474
set the four user counters to zero.  When the user task has completed, the
2475
Operating System will read the timers back off, to determine how much of the
2476 69 dgisselq
CPU the process had consumed.  To keep this accurate, the user counters will
2477
only increment when the GIE bit is set to indicate that the processor is
2478
in user mode.
2479 33 dgisselq
 
2480 69 dgisselq
\subsection{DMA Controller}
2481
 
2482 36 dgisselq
The final peripheral to discuss is the DMA controller.  This controller
2483
has four registers.  Of these four, the length, source and destination address
2484
registers should need no further explanation.  They are full 32--bit registers
2485
specifying the entire transfer length, the starting address to read from, and
2486
the starting address to write to.  The registers can be written to when the
2487
DMA is idle, and read at any time.  The control register, however, will need
2488
some more explanation.
2489
 
2490
The bit allocation of the control register is shown in Tbl.~\ref{tbl:dmacbits}.
2491
\begin{table}\begin{center}
2492
\begin{bitlist}
2493
31 & R & DMA Active\\\hline
2494 39 dgisselq
30 & R & Wishbone error, transaction aborted.  This bit is cleared the next time
2495
        this register is written to.\\\hline
2496 69 dgisselq
29 & R/W & Set to `1' to prevent the controller from incrementing the source address, `0' for normal memory copy. \\\hline
2497
28 & R/W & Set to `1' to prevent the controller from incrementing the
2498
        destination address, `0' for normal memory copy. \\\hline
2499 36 dgisselq
27 \ldots 16 & W & The DMA Key.  Write a 12'hfed to these bits to start the
2500
        activate any DMA transfer.  \\\hline
2501 69 dgisselq
27 & R & Always reads `0', to force the deliberate writing of the key. \\\hline
2502 36 dgisselq
26 \ldots 16 & R & Indicates the number of items in the transfer buffer that
2503
        have yet to be written. \\\hline
2504 69 dgisselq
15 & R/W & Set to `1' to trigger on an interrupt, or `0' to start immediately
2505 36 dgisselq
        upon receiving a valid key.\\\hline
2506
14\ldots 10 & R/W & Select among one of 32~possible interrupt lines.\\\hline
2507
9\ldots 0 & R/W & Intermediate transfer length minus one.  Thus, to transfer
2508
        one item at a time set this value to 0. To transfer 1024 at a time,
2509 167 dgisselq
        set it to 1023.\\\hline
2510 36 dgisselq
\end{bitlist}
2511
\caption{DMA Control Register Bits}\label{tbl:dmacbits}
2512
\end{center}\end{table}
2513
This control register has been designed so that the common case of memory
2514
access need only set the key and the transfer length.  Hence, writing a
2515
\hbox{32'h0fed03ff} to the control register will start any memory transfer.
2516
On the other hand, if you wished to read from a serial port (constant address)
2517
and put the result into a buffer every time a word was available, you
2518
might wish to write \hbox{32'h2fed8000}--this assumes, of course, that you
2519
have a serial port wired to the zero bit of this interrupt control.  (The
2520
DMA controller does not use the interrupt controller, and cannot clear
2521
interrupts.)  As a third example, if you wished to write to an external
2522
FIFO anytime it was less than half full (had fewer than 512 items), and
2523 167 dgisselq
interrupt line 3 indicated this condition, you might wish to issue a
2524 36 dgisselq
\hbox{32'h1fed8dff} to this port.
2525
 
2526 33 dgisselq
\section{Debug Port Registers}
2527
Accessing the Zip System via the debug port isn't as straight forward as
2528
accessing the system via the wishbone bus.  The debug port itself has been
2529
reduced to two addresses, as outlined earlier in Tbl.~\ref{tbl:dbgregs}.
2530
Access to the Zip System begins with the Debug Control register, shown in
2531
Tbl.~\ref{tbl:dbgctrl}.
2532
\begin{table}\begin{center}
2533
\begin{bitlist}
2534 69 dgisselq
31\ldots 14 & R & External interrupt state.  Bit 14 is valid for one
2535
        interrupt only, bit 15 for two, etc.\\\hline
2536 33 dgisselq
13 & R & CPU GIE setting\\\hline
2537
12 & R & CPU is sleeping\\\hline
2538
11 & W & Command clear PF cache\\\hline
2539 69 dgisselq
10 & R/W & Command HALT, Set to `1' to halt the CPU\\\hline
2540
9 & R & Stall Status, `1' if CPU is busy (i.e., not halted yet)\\\hline
2541
8 & R/W & Step Command, set to `1' to step the CPU, also sets the halt bit\\\hline
2542
7 & R & Interrupt Request Pending\\\hline
2543 33 dgisselq
6 & R/W & Command RESET \\\hline
2544
5\ldots 0 & R/W & Debug Register Address \\\hline
2545
\end{bitlist}
2546
\caption{Debug Control Register Bits}\label{tbl:dbgctrl}
2547
\end{center}\end{table}
2548
 
2549
The first step in debugging access is to determine whether or not the CPU
2550 69 dgisselq
is halted, and to halt it if not.  To do this, first write a `1' to the
2551 33 dgisselq
Command HALT bit.  This will halt the CPU and place it into debug mode.
2552
Once the CPU is halted, the stall status bit will drop to zero.  Thus,
2553
if bit 10 is high and bit 9 low, the debug port is open to examine the
2554
internal state of the CPU.
2555
 
2556
At this point, the external debugger may examine internal state information
2557
from within the CPU.  To do this, first write again to the command register
2558
a value (with command halt still high) containing the address of an internal
2559
register of interest in the bottom 6~bits.  Internal registers that may be
2560
accessed this way are listed in Tbl.~\ref{tbl:dbgaddrs}.
2561
\begin{table}\begin{center}
2562
\begin{reglist}
2563
sR0 & 0 & 32 & R/W & Supervisor Register R0 \\\hline
2564
sR1 & 0 & 32 & R/W & Supervisor Register R1 \\\hline
2565
sSP & 13 & 32 & R/W & Supervisor Stack Pointer\\\hline
2566
sCC & 14 & 32 & R/W & Supervisor Condition Code Register \\\hline
2567
sPC & 15 & 32 & R/W & Supervisor Program Counter\\\hline
2568
uR0 & 16 & 32 & R/W & User Register R0 \\\hline
2569
uR1 & 17 & 32 & R/W & User Register R1 \\\hline
2570
uSP & 29 & 32 & R/W & User Stack Pointer\\\hline
2571
uCC & 30 & 32 & R/W & User Condition Code Register \\\hline
2572
uPC & 31 & 32 & R/W & User Program Counter\\\hline
2573
PIC & 32 & 32 & R/W & Primary Interrupt Controller \\\hline
2574
WDT & 33 & 32 & R/W & Watchdog Timer\\\hline
2575 69 dgisselq
BUS & 34 & 32 & R & Last Bus Error\\\hline
2576 33 dgisselq
CTRIC & 35 & 32 & R/W & Secondary Interrupt Controller\\\hline
2577
TMRA & 36 & 32 & R/W & Timer A\\\hline
2578
TMRB & 37 & 32 & R/W & Timer B\\\hline
2579
TMRC & 38 & 32 & R/W & Timer C\\\hline
2580
JIFF & 39 & 32 & R/W & Jiffies peripheral\\\hline
2581
MTASK & 40 & 32 & R/W & Master task clock counter\\\hline
2582
MMSTL & 41 & 32 & R/W & Master memory stall counter\\\hline
2583
MPSTL & 42 & 32 & R/W & Master Pre-Fetch Stall counter\\\hline
2584
MICNT & 43 & 32 & R/W & Master instruction counter\\\hline
2585
UTASK & 44 & 32 & R/W & User task clock counter\\\hline
2586
UMSTL & 45 & 32 & R/W & User memory stall counter\\\hline
2587
UPSTL & 46 & 32 & R/W & User Pre-Fetch Stall counter\\\hline
2588
UICNT & 47 & 32 & R/W & User instruction counter\\\hline
2589 39 dgisselq
DMACMD & 48 & 32 & R/W & DMA command and status register\\\hline
2590
DMALEN & 49 & 32 & R/W & DMA transfer length\\\hline
2591
DMARD & 50 & 32 & R/W & DMA read address\\\hline
2592
DMAWR & 51 & 32 & R/W & DMA write address\\\hline
2593 33 dgisselq
\end{reglist}
2594
\caption{Debug Register Addresses}\label{tbl:dbgaddrs}
2595
\end{center}\end{table}
2596
Primarily, these ``registers'' include access to the entire CPU register
2597 36 dgisselq
set, as well as the internal peripherals.  To read one of these registers
2598 33 dgisselq
once the address is set, simply issue a read from the data port.  To write
2599
one of these registers or peripheral ports, simply write to the data port
2600
after setting the proper address.
2601
 
2602
In this manner, all of the CPU's internal state may be read and adjusted.
2603
 
2604
As an example of how to use this, consider what would happen in the case
2605
of an external break point.  If and when the CPU hits a break point that
2606
causes it to halt, the Command HALT bit will activate on its own, the CPU
2607
will then raise an external interrupt line and wait for a debugger to examine
2608
its state.  After examining the state, the debugger will need to remove
2609
the breakpoint by writing a different instruction into memory and by writing
2610
to the command register while holding the clear cache, command halt, and
2611
step CPU bits high, (32'hd00).  The debugger may then replace the breakpoint
2612
now that the CPU has gone beyond it, and clear the cache again (32'h500).
2613
 
2614
To leave this debug mode, simply write a `32'h0' value to the command register.
2615
 
2616
\chapter{Wishbone Datasheets}\label{chap:wishbone}
2617 32 dgisselq
The Zip System supports two wishbone ports, a slave debug port and a master
2618 21 dgisselq
port for the system itself.  These are shown in Tbl.~\ref{tbl:wishbone-slave}
2619
\begin{table}[htbp]
2620
\begin{center}
2621
\begin{wishboneds}
2622
Revision level of wishbone & WB B4 spec \\\hline
2623
Type of interface & Slave, Read/Write, single words only \\\hline
2624 24 dgisselq
Address Width & 1--bit \\\hline
2625 21 dgisselq
Port size & 32--bit \\\hline
2626
Port granularity & 32--bit \\\hline
2627
Maximum Operand Size & 32--bit \\\hline
2628
Data transfer ordering & (Irrelevant) \\\hline
2629 69 dgisselq
Clock constraints & Works at 100~MHz on a Basys--3 board, and 80~MHz on a
2630
                XuLA2--LX25\\\hline
2631 21 dgisselq
Signal Names & \begin{tabular}{ll}
2632
                Signal Name & Wishbone Equivalent \\\hline
2633
                {\tt i\_clk} & {\tt CLK\_I} \\
2634
                {\tt i\_dbg\_cyc} & {\tt CYC\_I} \\
2635
                {\tt i\_dbg\_stb} & {\tt STB\_I} \\
2636
                {\tt i\_dbg\_we} & {\tt WE\_I} \\
2637
                {\tt i\_dbg\_addr} & {\tt ADR\_I} \\
2638
                {\tt i\_dbg\_data} & {\tt DAT\_I} \\
2639
                {\tt o\_dbg\_ack} & {\tt ACK\_O} \\
2640
                {\tt o\_dbg\_stall} & {\tt STALL\_O} \\
2641
                {\tt o\_dbg\_data} & {\tt DAT\_O}
2642
                \end{tabular}\\\hline
2643
\end{wishboneds}
2644 22 dgisselq
\caption{Wishbone Datasheet for the Debug Interface}\label{tbl:wishbone-slave}
2645 21 dgisselq
\end{center}\end{table}
2646
and Tbl.~\ref{tbl:wishbone-master} respectively.
2647
\begin{table}[htbp]
2648
\begin{center}
2649
\begin{wishboneds}
2650
Revision level of wishbone & WB B4 spec \\\hline
2651 24 dgisselq
Type of interface & Master, Read/Write, single cycle or pipelined\\\hline
2652 69 dgisselq
Address Width & (Zip System parameter, can be up to 32--bit bits) \\\hline
2653 21 dgisselq
Port size & 32--bit \\\hline
2654
Port granularity & 32--bit \\\hline
2655
Maximum Operand Size & 32--bit \\\hline
2656
Data transfer ordering & (Irrelevant) \\\hline
2657 69 dgisselq
Clock constraints & Works at 100~MHz on a Basys--3 board, and 80~MHz on a
2658
                XuLA2--LX25\\\hline
2659 21 dgisselq
Signal Names & \begin{tabular}{ll}
2660
                Signal Name & Wishbone Equivalent \\\hline
2661
                {\tt i\_clk} & {\tt CLK\_O} \\
2662
                {\tt o\_wb\_cyc} & {\tt CYC\_O} \\
2663
                {\tt o\_wb\_stb} & {\tt STB\_O} \\
2664
                {\tt o\_wb\_we} & {\tt WE\_O} \\
2665
                {\tt o\_wb\_addr} & {\tt ADR\_O} \\
2666
                {\tt o\_wb\_data} & {\tt DAT\_O} \\
2667
                {\tt i\_wb\_ack} & {\tt ACK\_I} \\
2668
                {\tt i\_wb\_stall} & {\tt STALL\_I} \\
2669 69 dgisselq
                {\tt i\_wb\_data} & {\tt DAT\_I} \\
2670
                {\tt i\_wb\_err} & {\tt ERR\_I}
2671 21 dgisselq
                \end{tabular}\\\hline
2672
\end{wishboneds}
2673 22 dgisselq
\caption{Wishbone Datasheet for the CPU as Master}\label{tbl:wishbone-master}
2674 21 dgisselq
\end{center}\end{table}
2675
I do not recommend that you connect these together through the interconnect.
2676 24 dgisselq
Rather, the debug port of the CPU should be accessible regardless of the state
2677
of the master bus.
2678 21 dgisselq
 
2679 69 dgisselq
You may wish to notice that neither the {\tt LOCK} nor the {\tt RTY} (retry)
2680
wires have been connected to the CPU's master interface.  If necessary, a
2681
rudimentary {\tt LOCK} may be created by tying the wire to the {\tt wb\_cyc}
2682
line.  As for the {\tt RTY}, all the CPU recognizes at this point are bus
2683
errors---it cannot tell the difference between a temporary and a permanent bus
2684
error.
2685 21 dgisselq
 
2686
\chapter{Clocks}\label{chap:clocks}
2687
 
2688 32 dgisselq
This core is based upon the Basys--3 development board sold by Digilent.
2689
The Basys--3 development board contains one external 100~MHz clock, which is
2690 36 dgisselq
sufficient to run the Zip CPU core.
2691 21 dgisselq
\begin{table}[htbp]
2692
\begin{center}
2693
\begin{clocklist}
2694
i\_clk & External & 100~MHz & 100~MHz & System clock.\\\hline
2695
\end{clocklist}
2696
\caption{List of Clocks}\label{tbl:clocks}
2697
\end{center}\end{table}
2698
I hesitate to suggest that the core can run faster than 100~MHz, since I have
2699
had struggled with various timing violations to keep it at 100~MHz.  So, for
2700
now, I will only state that it can run at 100~MHz.
2701
 
2702 69 dgisselq
On a SPARTAN 6, the clock can run successfully at 80~MHz.
2703 21 dgisselq
 
2704
\chapter{I/O Ports}\label{chap:ioports}
2705 33 dgisselq
The I/O ports to the Zip CPU may be grouped into three categories.  The first
2706
is that of the master wishbone used by the CPU, then the slave wishbone used
2707
to command the CPU via a debugger, and then the rest.  The first two of these
2708
were already discussed in the wishbone chapter.  They are listed here
2709
for completeness in Tbl.~\ref{tbl:iowb-master}
2710
\begin{table}
2711
\begin{center}\begin{portlist}
2712
{\tt o\_wb\_cyc}   &  1 & Output & Indicates an active Wishbone cycle\\\hline
2713
{\tt o\_wb\_stb}   &  1 & Output & WB Strobe signal\\\hline
2714
{\tt o\_wb\_we}    &  1 & Output & Write enable\\\hline
2715
{\tt o\_wb\_addr}  & 32 & Output & Bus address \\\hline
2716
{\tt o\_wb\_data}  & 32 & Output & Data on WB write\\\hline
2717
{\tt i\_wb\_ack}   &  1 & Input  & Slave has completed a R/W cycle\\\hline
2718
{\tt i\_wb\_stall} &  1 & Input  & WB bus slave not ready\\\hline
2719
{\tt i\_wb\_data}  & 32 & Input  & Incoming bus data\\\hline
2720 69 dgisselq
{\tt i\_wb\_err}   &  1 & Input  & Bus Error indication\\\hline
2721 33 dgisselq
\end{portlist}\caption{CPU Master Wishbone I/O Ports}\label{tbl:iowb-master}\end{center}\end{table}
2722
and~\ref{tbl:iowb-slave} respectively.
2723
\begin{table}
2724
\begin{center}\begin{portlist}
2725
{\tt i\_wb\_cyc}   &  1 & Input & Indicates an active Wishbone cycle\\\hline
2726
{\tt i\_wb\_stb}   &  1 & Input & WB Strobe signal\\\hline
2727
{\tt i\_wb\_we}    &  1 & Input & Write enable\\\hline
2728
{\tt i\_wb\_addr}  &  1 & Input & Bus address, command or data port \\\hline
2729
{\tt i\_wb\_data}  & 32 & Input & Data on WB write\\\hline
2730
{\tt o\_wb\_ack}   &  1 & Output  & Slave has completed a R/W cycle\\\hline
2731
{\tt o\_wb\_stall} &  1 & Output  & WB bus slave not ready\\\hline
2732
{\tt o\_wb\_data}  & 32 & Output  & Incoming bus data\\\hline
2733
\end{portlist}\caption{CPU Debug Wishbone I/O Ports}\label{tbl:iowb-slave}\end{center}\end{table}
2734 21 dgisselq
 
2735 33 dgisselq
There are only four other lines to the CPU: the external clock, external
2736
reset, incoming external interrupt line(s), and the outgoing debug interrupt
2737
line.  These are shown in Tbl.~\ref{tbl:ioports}.
2738
\begin{table}
2739
\begin{center}\begin{portlist}
2740
{\tt i\_clk} & 1 & Input & The master CPU clock \\\hline
2741
{\tt i\_rst} & 1 & Input &  Active high reset line \\\hline
2742 69 dgisselq
{\tt i\_ext\_int} & 1\ldots 16 & Input &  Incoming external interrupts, actual
2743
                value set by implementation parameter \\\hline
2744 33 dgisselq
{\tt o\_ext\_int} & 1 & Output & CPU Halted interrupt \\\hline
2745
\end{portlist}\caption{I/O Ports}\label{tbl:ioports}\end{center}\end{table}
2746
The clock line was discussed briefly in Chapt.~\ref{chap:clocks}.  We
2747 69 dgisselq
typically run it at 100~MHz, although we've needed to slow it down to 80~MHz
2748
for some implementations.  The reset line is an active high reset.  When
2749 33 dgisselq
asserted, the CPU will start running again from its reset address in
2750 69 dgisselq
memory.  Further, depending upon how the CPU is configured and specifically
2751
based upon how the {\tt START\_HALTED} parameter is set, the CPU may or may
2752
not start running automatically following a reset.  The {\tt i\_ext\_int}
2753
line is for an external interrupt.  This line may actually be as wide as
2754
16~external interrupts, depending upon the setting of
2755
the {\tt EXTERNAL\_INTERRUPTS} parameter.  Finally, the Zip System produces one
2756
external interrupt whenever the entire CPU halts to wait for the debugger.
2757 33 dgisselq
 
2758 36 dgisselq
\chapter{Initial Assessment}\label{chap:assessment}
2759
 
2760
Having now worked with the Zip CPU for a while, it is worth offering an
2761
honest assessment of how well it works and how well it was designed. At the
2762
end of this assessment, I will propose some changes that may take place in a
2763
later version of this Zip CPU to make it better.
2764
 
2765
\section{The Good}
2766
\begin{itemize}
2767 69 dgisselq
\item The Zip CPU can be configured to be relatively light weight and fully
2768
        featured as it exists today. For anyone who wishes to build a general
2769
        purpose CPU and then to experiment with building and adding particular
2770
        features, the Zip CPU makes a good starting point--it is fairly simple.
2771
        Modifications should be simple enough.  Indeed, a non--pipelined
2772
        version of the bare ZipBones (with no peripherals) has been built that
2773
        only uses 1.1k~LUTs.  When using pipelining, the full cache, and all
2774
        of the peripherals, the ZipSystem can top 5~k LUTs.  Where it fits
2775
        in between is a function of your needs.
2776 36 dgisselq
\item The Zip CPU was designed to be an implementable soft core that could be
2777
        placed within an FPGA, controlling actions internal to the FPGA. It
2778
        fits this role rather nicely. It does not fit the role of a system on
2779
        a chip very well, but then it was never intended to be a system on a
2780
        chip but rather a system within a chip.
2781
\item The extremely simplified instruction set of the Zip CPU was a good
2782
        choice. Although it does not have many of the commonly used
2783
        instructions, PUSH, POP, JSR, and RET among them, the simplified
2784
        instruction set has demonstrated an amazing versatility. I will contend
2785
        therefore and for anyone who will listen, that this instruction set
2786
        offers a full and complete capability for whatever a user might wish
2787
        to do with two exceptions: bytewise character access and accelerated
2788
        floating-point support.
2789
\item This simplified instruction set is easy to decode.
2790
\item The simplified bus transactions (32-bit words only) were also very easy
2791
        to implement.
2792 68 dgisselq
\item The pipelined load/store approach is novel, and can be used to greatly
2793
        increase the speed of the processor.
2794 36 dgisselq
\item The novel approach of having a single interrupt vector, which just
2795
        brings the CPU back to the instruction it left off at within the last
2796
        interrupt context doesn't appear to have been that much of a problem.
2797
        If most modern systems handle interrupt vectoring in software anyway,
2798
        why maintain hardware support for it?
2799
\item My goal of a high rate of instructions per clock may not be the proper
2800
        measure. For example, if instructions are being read from a SPI flash
2801
        device, such as is common among FPGA implementations, these same
2802
        instructions may suffer stalls of between 64 and 128 cycles per
2803
        instruction just to read the instruction from the flash. Executing the
2804
        instruction in a single clock cycle is no longer the appropriate
2805
        measure. At the same time, it should be possible to use the DMA
2806
        peripheral to copy instructions from the FLASH to a temporary memory
2807
        location, after which they may be executed at a single instruction
2808
        cycle per access again.
2809
\end{itemize}
2810
 
2811
\section{The Not so Good}
2812
\begin{itemize}
2813
\item The CPU has no character support. This is both good and bad.
2814
        Realistically, the CPU works just fine without it. Characters can be
2815
        supported as subsets of 32-bit words without any problem. Practically,
2816
        though, it will make compiling non-Zip CPU code difficult--especially
2817
        anything that assumes sizeof(int)=4*sizeof(char), or that tries to
2818
        create unions with characters and integers and then attempts to
2819
        reference the address of the characters within that union.
2820
 
2821
\item The Zip CPU does not support a data cache. One can still be built
2822
        externally, but this is a limitation of the CPU proper as built.
2823
        Further, under the theory of the Zip CPU design (that of an embedded
2824
        soft-core processor within an FPGA, where any ``address'' may reference
2825
        either memory or a peripheral that may have side-effects), any data
2826
        cache would need to be based upon an initial knowledge of whether or
2827
        not it is supporting memory (cachable) or peripherals. This knowledge
2828
        must exist somewhere, and that somewhere is currently (and by design)
2829
        external to the CPU.
2830
 
2831
        This may also be written off as a ``feature'' of the Zip CPU, since
2832
        the addition of a data cache can greatly increase the LUT count of
2833
        a soft core.
2834
 
2835 68 dgisselq
        The Zip CPU compensates for this via its pipelined load and store
2836
        instructions.
2837
 
2838 36 dgisselq
\item Many other instruction sets offer three operand instructions, whereas
2839
        the Zip CPU only offers two operand instructions. This means that it
2840
        takes the Zip CPU more instructions to do many of the same operations.
2841
        The good part of this is that it gives the Zip CPU a greater amount of
2842
        flexibility in its immediate operand mode, although that increased
2843
        flexibility isn't necessarily as valuable as one might like.
2844
 
2845
\item The Zip CPU doesn't support out of order execution. I suppose it could
2846
        be modified to do so, but then it would no longer be the ``simple''
2847
        and low LUT count CPU it was designed to be. The two primary results
2848
        are that 1) loads may unnecessarily stall the CPU, even if other
2849
        things could be done while waiting for the load to complete, 2)
2850
        bus errors on stores will never be caught at the point of the error,
2851
        and 3) branch prediction becomes more difficult.
2852
 
2853
\item Although switching to an interrupt context in the Zip CPU design doesn't
2854
        require a tremendous swapping of registers, in reality it still
2855
        does--since any task swap still requires saving and restoring all
2856
        16~user registers. That's a lot of memory movement just to service
2857
        an interrupt.
2858
 
2859
\item The Zip CPU is by no means generic: it will never handle addresses
2860
        larger than 32-bits (16GB) without a complete and total redesign.
2861
        This may limit its utility as a generic CPU in the future, although
2862
        as an embedded CPU within an FPGA this isn't really much of a limit
2863
        or restriction.
2864
 
2865
\item While the Zip CPU has its own assembler, it has no linker and does not
2866
        (yet) support a compiler. The standard C library is an even longer
2867
        shot. My dream of having binutils and gcc support has not been
2868
        realized and at this rate may not be realized. (I've been intimidated
2869
        by the challenge everytime I've looked through those codes.)
2870
\end{itemize}
2871
 
2872
\section{The Next Generation}
2873 69 dgisselq
This section could also be labeled as my ``To do'' list.  Today's list is
2874
much different than it was for the last version of this document, as much of
2875
the prior to do list (such as VLIW instructions, and a more traditional
2876
instruction cache) has now been implemented.  The only things really and
2877
truly waiting on my list today are assembler support for the VLIW instruction
2878
set, linker and compiler support.
2879 36 dgisselq
 
2880 69 dgisselq
Stay tuned, these are likely to be coming next.
2881 36 dgisselq
 
2882 21 dgisselq
% Appendices
2883
% Index
2884
\end{document}
2885
 
2886 68 dgisselq
%
2887
%
2888
% Symbol table relocation types:
2889
%
2890
% Only 3-types of instructions truly need relocations: those that modify the
2891
% PC register, and those that access memory.
2892
%
2893
% -     LDI     Addr,Rx         // Load's an absolute address into Rx, 24 bits
2894
%
2895
% -     LDILO   Addr,Rx         // Load's an absolute address into Rx, 32 bits
2896
%       LDIHI   Addr,Rx         //   requires two instructions
2897
%
2898
% -     JMP     Rx              // Jump to any address in Rx
2899
%                       // Can be prefixed with two instructions to load Rx
2900
%                       // from any 32-bit immediate
2901
% -     JMP     #Addr           // Jump to any 24'bit (signed) address, 23'b uns
2902
%
2903
% -     ADD     x,PC            // Any PC relative jump (20 bits)
2904
%
2905
% -     ADD.C   x,PC            // Any PC relative conditional jump (20 bits)
2906
%
2907
% -     LDIHI   Addr,Rx         // Load from any 32-bit address, clobbers Rx,
2908
%       LOD     Addr(Rx),Rx     //    unconditional, requires second instruction
2909
%
2910
% -     LOD.C   Addr(Ry),Rx     // Any 16-bit relative address load, poss. cond
2911
%
2912
% -     STO.C   Rx,Addr(Ry)     // Any 16-bit rel addr, Rx and Ry must be valid
2913
%
2914
% -     FARJMP  #Addr:          // Arbitrary 32-bit jumps require a jump table
2915
%       BRA     +1              // memory address.  The BRA +1 can be skipped,
2916
%       .WORD   Addr            // but only if the address is placed at the end
2917
%       LOD     -2(PC),PC       // of an executable section
2918
%

powered by: WebSVN 2.1.0

© copyright 1999-2025 OpenCores.org, equivalent to Oliscience, all rights reserved. OpenCores®, registered trademark.