OpenCores
URL https://opencores.org/ocsvn/zipcpu/zipcpu/trunk

Subversion Repositories zipcpu

[/] [zipcpu/] [trunk/] [doc/] [src/] [spec.tex] - Blame information for rev 68

Go to most recent revision | Details | Compare with Previous | View Log

Line No. Rev Author Line
1 21 dgisselq
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2
%%
3
%% Filename:    spec.tex
4
%%
5
%% Project:     Zip CPU -- a small, lightweight, RISC CPU soft core
6
%%
7
%% Purpose:     This LaTeX file contains all of the documentation/description
8 33 dgisselq
%%              currently provided with this Zip CPU soft core.  It supersedes
9 21 dgisselq
%%              any information about the instruction set or CPUs found
10
%%              elsewhere.  It's not nearly as interesting, though, as the PDF
11
%%              file it creates, so I'd recommend reading that before diving
12
%%              into this file.  You should be able to find the PDF file in
13
%%              the SVN distribution together with this PDF file and a copy of
14
%%              the GPL-3.0 license this file is distributed under.  If not,
15
%%              just type 'make' in the doc directory and it (should) build
16
%%              without a problem.
17
%%
18
%%
19
%% Creator:     Dan Gisselquist
20
%%              Gisselquist Technology, LLC
21
%%
22
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
23
%%
24
%% Copyright (C) 2015, Gisselquist Technology, LLC
25
%%
26
%% This program is free software (firmware): you can redistribute it and/or
27
%% modify it under the terms of  the GNU General Public License as published
28
%% by the Free Software Foundation, either version 3 of the License, or (at
29
%% your option) any later version.
30
%%
31
%% This program is distributed in the hope that it will be useful, but WITHOUT
32
%% ANY WARRANTY; without even the implied warranty of MERCHANTIBILITY or
33
%% FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
34
%% for more details.
35
%%
36
%% You should have received a copy of the GNU General Public License along
37
%% with this program.  (It's in the $(ROOT)/doc directory, run make with no
38
%% target there if the PDF file isn't present.)  If not, see
39
%% <http://www.gnu.org/licenses/> for a copy.
40
%%
41
%% License:     GPL, v3, as defined and found on www.gnu.org,
42
%%              http://www.gnu.org/licenses/gpl.html
43
%%
44
%%
45
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
46
\documentclass{gqtekspec}
47 68 dgisselq
\usepackage{import}
48
% \graphicspath{{../gfx}}
49 21 dgisselq
\project{Zip CPU}
50
\title{Specification}
51
\author{Dan Gisselquist, Ph.D.}
52
\email{dgisselq (at) opencores.org}
53 68 dgisselq
\revision{Rev.~0.6}
54 36 dgisselq
\definecolor{webred}{rgb}{0.2,0,0}
55
\definecolor{webgreen}{rgb}{0,0.2,0}
56
\usepackage[dvips,ps2pdf,colorlinks=true,
57
        anchorcolor=black,pagecolor=webgreen,pdfpagelabels,hypertexnames,
58
        pdfauthor={Dan Gisselquist},
59
        pdfsubject={Zip CPU}]{hyperref}
60 21 dgisselq
\begin{document}
61
\pagestyle{gqtekspecplain}
62
\titlepage
63
\begin{license}
64
Copyright (C) \theyear\today, Gisselquist Technology, LLC
65
 
66
This project is free software (firmware): you can redistribute it and/or
67
modify it under the terms of  the GNU General Public License as published
68
by the Free Software Foundation, either version 3 of the License, or (at
69
your option) any later version.
70
 
71
This program is distributed in the hope that it will be useful, but WITHOUT
72
ANY WARRANTY; without even the implied warranty of MERCHANTIBILITY or
73
FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
74
for more details.
75
 
76
You should have received a copy of the GNU General Public License along
77
with this program.  If not, see \hbox{<http://www.gnu.org/licenses/>} for a
78
copy.
79
\end{license}
80
\begin{revisionhistory}
81 68 dgisselq
0.6 & 11/17/2015 & Gisselquist & Added graphics to illustrate pipeline discussion.\\\hline
82 39 dgisselq
0.5 & 9/29/2015 & Gisselquist & Added pipelined memory access discussion.\\\hline
83 36 dgisselq
0.4 & 9/19/2015 & Gisselquist & Added DMA controller, improved stall information, and self--assessment info.\\\hline
84 33 dgisselq
0.3 & 8/22/2015 & Gisselquist & First completed draft\\\hline
85 24 dgisselq
0.2 & 8/19/2015 & Gisselquist & Still Draft, more complete \\\hline
86 21 dgisselq
0.1 & 8/17/2015 & Gisselquist & Incomplete First Draft \\\hline
87
\end{revisionhistory}
88
% Revision History
89
% Table of Contents, named Contents
90
\tableofcontents
91 24 dgisselq
\listoffigures
92 21 dgisselq
\listoftables
93
\begin{preface}
94
Many people have asked me why I am building the Zip CPU. ARM processors are
95
good and effective. Xilinx makes and markets Microblaze, Altera Nios, and both
96
have better toolsets than the Zip CPU will ever have. OpenRISC is also
97 24 dgisselq
available, RISC--V may be replacing it. Why build a new processor?
98 21 dgisselq
 
99
The easiest, most obvious answer is the simple one: Because I can.
100
 
101
There's more to it, though. There's a lot that I would like to do with a
102
processor, and I want to be able to do it in a vendor independent fashion.
103 36 dgisselq
First, I would like to be able to place this processor inside an FPGA.  Without
104
paying royalties, ARM is out of the question.  I would then like to be able to
105
generate Verilog code, both for the processor and the system it sits within,
106
that can run equivalently on both Xilinx and Altera chips, and that can be
107
easily ported from one manufacturer's chipsets to another. Even more, before
108
purchasing a chip or a board, I would like to know that my soft core works. I
109
would like to build a test bench to test components with, and Verilator is my
110
chosen test bench. This forces me to use all Verilog, and it prevents me from
111
using any proprietary cores. For this reason, Microblaze and Nios are out of
112
the question.
113 21 dgisselq
 
114
Why not OpenRISC? That's a hard question. The OpenRISC team has done some
115
wonderful work on an amazing processor, and I'll have to admit that I am
116
envious of what they've accomplished. I would like to port binutils to the
117
Zip CPU, as I would like to port GCC and GDB. They are way ahead of me. The
118
OpenRISC processor, however, is complex and hefty at about 4,500 LUTs. It has
119
a lot of features of modern CPUs within it that ... well, let's just say it's
120
not the little guy on the block. The Zip CPU is lighter weight, costing only
121 32 dgisselq
about 2,300 LUTs with no peripherals, and 3,200 LUTs with some very basic
122 21 dgisselq
peripherals.
123
 
124
My final reason is that I'm building the Zip CPU as a learning experience. The
125
Zip CPU has allowed me to learn a lot about how CPUs work on a very micro
126
level. For the first time, I am beginning to understand many of the Computer
127
Architecture lessons from years ago.
128
 
129
To summarize: Because I can, because it is open source, because it is light
130
weight, and as an exercise in learning.
131
 
132
\end{preface}
133
 
134
\chapter{Introduction}
135
\pagenumbering{arabic}
136
\setcounter{page}{1}
137
 
138
 
139 36 dgisselq
The original goal of the Zip CPU was to be a very simple CPU.   You might
140 21 dgisselq
think of it as a poor man's alternative to the OpenRISC architecture.
141
For this reason, all instructions have been designed to be as simple as
142
possible, and are all designed to be executed in one instruction cycle per
143
instruction, barring pipeline stalls.  Indeed, even the bus has been simplified
144
to a constant 32-bit width, with no option for more or less.  This has
145
resulted in the choice to drop push and pop instructions, pre-increment and
146
post-decrement addressing modes, and more.
147
 
148
For those who like buzz words, the Zip CPU is:
149
\begin{itemize}
150
\item A 32-bit CPU: All registers are 32-bits, addresses are 32-bits,
151
                instructions are 32-bits wide, etc.
152 24 dgisselq
\item A RISC CPU.  There is no microcode for executing instructions.  All
153
        instructions are designed to be completed in one clock cycle.
154 21 dgisselq
\item A Load/Store architecture.  (Only load and store instructions
155
                can access memory.)
156
\item Wishbone compliant.  All peripherals are accessed just like
157
                memory across this bus.
158
\item A Von-Neumann architecture.  (The instructions and data share a
159
                common bus.)
160
\item A pipelined architecture, having stages for {\bf Prefetch},
161
                {\bf Decode}, {\bf Read-Operand}, the {\bf ALU/Memory}
162 24 dgisselq
                unit, and {\bf Write-back}.  See Fig.~\ref{fig:cpu}
163
\begin{figure}\begin{center}
164
\includegraphics[width=3.5in]{../gfx/cpu.eps}
165
\caption{Zip CPU internal pipeline architecture}\label{fig:cpu}
166
\end{center}\end{figure}
167
                for a diagram of this structure.
168 21 dgisselq
\item Completely open source, licensed under the GPL.\footnote{Should you
169
        need a copy of the Zip CPU licensed under other terms, please
170
        contact me.}
171
\end{itemize}
172
 
173 68 dgisselq
The Zip CPU also has one very unique feature: the ability to do pipelined loads
174
and stores.  This allows the CPU to access on-chip memory at one access per
175
clock, minus a stall for the initial access.
176
 
177
\section{Characteristics of a SwiC}
178
 
179
Here, we shall define a soft core internal to an FPGA as a ``System within a
180
Chip,'' or a SwiC.  SwiCs have some very unique properties internal to them
181
that have influenced the design of the Zip CPU.  Among these are the bus,
182
memory, and available peripherals.
183
 
184
Most other approaches to soft core CPU's employ a Harvard architecture.
185
This allows these other CPU's to have two separate bus structures: one for the
186
program fetch, and the other for thememory.  The Zip CPU is fairly unique in
187
its approach because it uses a von Neumann architecture.  This was done for
188
simplicity.  By using a von Neumann architecture, only one bus needs to be
189
implemented within any FPGA.  This helps to minimize real-estate, while
190
maintaining a high clock speed.  The disadvantage is that it can severely
191
degrade the overall instructions per clock count.
192
 
193
Soft core's within an FPGA have an additional characteristic regarding
194
memory access: it is slow.  Memory on chip may be accessed at a single
195
cycle per access, but small FPGA's have a limited amount of memory on chip.
196
Going off chip, however, is expensive.  Two examples will prove this point.  On
197
the XuLA2 board, Flash can be accessed at 128~cycles per 32--bit word,
198
or 64~cycles per subsequent word in a pipelined architecture.  Likewise, the
199
SDRAM chip on the XuLA2 board allows 6~cycle access for a write, 10~cycles
200
per read, and 2~cycles for any subsequent pipelined access read or write.
201
Either way you look at it, this memory access will be slow and this doesn't
202
account for any logic delays should the bus implementation logic get
203
complicated.
204
 
205
As may be noticed from the above discussion about memory speed, a second
206
characteristic of memory is that all memory accesses may be pipelined, and
207
that pipelined memory access is faster than non--pipelined access.  Therefore,
208
a SwiC soft core should support pipelined operations, but it should also
209
allow a higher priority subsystem to get access to the bus (no starvation).
210
 
211
As a further characteristic of SwiC memory options, on-chip cache's are
212
expensive.  If you want to have a minimum of logic, cache logic may not be
213
the highest on the priority list.
214
 
215
In sum, memory is slow.  While one processor on one FPGA may be able to fill
216
its pipeline, the same processor on another FPGA may struggle to get more than
217
one instruction at a time into the pipeline.  Any SwiC must be able to deal
218
with both cases: fast and slow memories.
219
 
220
A final characteristic of SwiC's within FPGA's is the peripherals.
221
Specifically, FPGA's are highly reconfigurable.  Soft peripherals can easily
222
be created on chip to support the SwiC if necessary.  As an example, a simple
223
30-bit peripheral could easily support reversing 30-bit numbers: a read from
224
the peripheral returns it's bit--reversed address.  This is cheap within an
225
FPGA, but expensive in instructions.
226
 
227
Indeed, anything that must be done fast within an FPGA is likely to already
228
be done--elsewhere in the fabric.  This leaves the CPU with the role of handling
229
sequential tasks that need a lot of state.
230
 
231
This means that the SwiC needs to live within a very unique environment,
232
separate and different from the traditional SoC.  That isn't to say that a
233
SwiC cannot be turned into a SoC, just that this SwiC has not been designed
234
for that purpose.
235
 
236
\section{Lessons Learned}
237
 
238 21 dgisselq
Now, however, that I've worked on the Zip CPU for a while, it is not nearly
239
as simple as I originally hoped.  Worse, I've had to adjust to create
240
capabilities that I was never expecting to need.  These include:
241
\begin{itemize}
242 33 dgisselq
\item {\bf External Debug:} Once placed upon an FPGA, some external means is
243 21 dgisselq
        still necessary to debug this CPU.  That means that there needs to be
244
        an external register that can control the CPU: reset it, halt it, step
245 24 dgisselq
        it, and tell whether it is running or not.  My chosen interface
246
        includes a second register similar to this control register.  This
247
        second register allows the external controller or debugger to examine
248 21 dgisselq
        registers internal to the CPU.
249
 
250
\item {\bf Internal Debug:} Being able to run a debugger from within
251
        a user process requires an ability to step a user process from
252
        within a debugger.  It also requires a break instruction that can
253
        be substituted for any other instruction, and substituted back.
254
        The break is actually difficult: the break instruction cannot be
255
        allowed to execute.  That way, upon a break, the debugger should
256
        be able to jump back into the user process to step the instruction
257
        that would've been at the break point initially, and then to
258
        replace the break after passing it.
259
 
260 24 dgisselq
        Incidentally, this break messes with the prefetch cache and the
261
        pipeline: if you change an instruction partially through the pipeline,
262
        the whole pipeline needs to be cleansed.  Likewise if you change
263
        an instruction in memory, you need to make sure the cache is reloaded
264
        with the new instruction.
265
 
266 21 dgisselq
\item {\bf Prefetch Cache:} My original implementation had a very
267
        simple prefetch stage.  Any time the PC changed the prefetch would go
268
        and fetch the new instruction.  While this was perhaps this simplest
269
        approach, it cost roughly five clocks for every instruction.  This
270
        was deemed unacceptable, as I wanted a CPU that could execute
271
        instructions in one cycle.  I therefore have a prefetch cache that
272
        issues pipelined wishbone accesses to memory and then pushes
273
        instructions at the CPU.  Sadly, this accounts for about 20\% of the
274
        logic in the entire CPU, or 15\% of the logic in the entire system.
275
 
276
 
277
\item {\bf Operating System:} In order to support an operating system,
278
        interrupts and so forth, the CPU needs to support supervisor and
279
        user modes, as well as a means of switching between them.  For example,
280
        the user needs a means of executing a system call.  This is the
281
        purpose of the {\bf `trap'} instruction.  This instruction needs to
282
        place the CPU into supervisor mode (here equivalent to disabling
283
        interrupts), as well as handing it a parameter such as identifying
284
        which O/S function was called.
285
 
286 24 dgisselq
My initial approach to building a trap instruction was to create an external
287
peripheral which, when written to, would generate an interrupt and could
288
return the last value written to it.  In practice, this approach didn't work
289
at all: the CPU executed two instructions while waiting for the
290
trap interrupt to take place.  Since then, I've decided to keep the rest of
291
the CC register for that purpose so that a write to the CC register, with the
292
GIE bit cleared, could be used to execute a trap.  This has other problems,
293
though, primarily in the limitation of the uses of the CC register.  In
294
particular, the CC register is the best place to put CPU state information and
295
to ``announce'' special CPU features (floating point, etc).  So the trap
296
instruction still switches to interrupt mode, but the CC register is not
297
nearly as useful for telling the supervisor mode processor what trap is being
298
executed.
299 21 dgisselq
 
300
Modern timesharing systems also depend upon a {\bf Timer} interrupt
301 24 dgisselq
to handle task swapping.  For the Zip CPU, this interrupt is handled
302
external to the CPU as part of the CPU System, found in {\tt zipsystem.v}.
303
The timer module itself is found in {\tt ziptimer.v}.
304 21 dgisselq
 
305
\item {\bf Pipeline Stalls:} My original plan was to not support pipeline
306
        stalls at all, but rather to require the compiler to properly schedule
307 24 dgisselq
        all instructions so that stalls would never be necessary.  After trying
308 21 dgisselq
        to build such an architecture, I gave up, having learned some things:
309
 
310 68 dgisselq
        First, and ideal pipeline might look something like
311
        Fig.~\ref{fig:ideal-pipeline}.
312
\begin{figure}
313
\begin{center}
314
\includegraphics[width=4in]{../gfx/fullpline.eps}
315
\caption{An Ideal Pipeline: One instruction per clock cycle}\label{fig:ideal-pipeline}
316
\end{center}\end{figure}
317
        Notice that, in this figure, all the pipeline stages are complete and
318
        full.  Every instruction takes one clock and there are no delays.
319
        However, as the discussion above pointed out, the memory associated
320
        with a SwiC may not allow single clock access.  It may be instead
321
        that you can only read every two clocks.  In that case, what shall
322
        the pipeline look like?  Should it look like
323
        Fig.~\ref{fig:waiting-pipeline},
324
\begin{figure}\begin{center}
325
\includegraphics[width=4in]{../gfx/stuttra.eps}
326
\caption{Instructions wait for each other}\label{fig:waiting-pipeline}
327
\end{center}\end{figure}
328
        where instructions are held back until the pipeline is full, or should
329
        it look like Fig.~\ref{fig:independent-pipeline},
330
\begin{figure}\begin{center}
331
\includegraphics[width=4in]{../gfx/stuttrb.eps}
332
\caption{Instructions proceed independently}\label{fig:independent-pipeline}
333
\end{center}\end{figure}
334
        where each instruction is allowed to move through the pipeline
335
        independently?  For better or worse, the Zip CPU allows instructions
336
        to move through the pipeline independently.
337 21 dgisselq
 
338 68 dgisselq
        One approach to avoiding stalls is to use a branch delay slot,
339
        such as is shown in Fig.~\ref{fig:brdelay}.
340
\begin{figure}\begin{center}
341
\includegraphics[width=4in]{../gfx/bdly.eps}
342
\caption{A typical branch delay slot approach}\label{fig:brdelay}
343
\end{center}\end{figure}
344
        In this figure, instructions
345
        {\tt BR} (a branch), {\tt BD} (a branch delay instruction),
346
        are followed by instructions after the branch: {\tt IA}, {\tt IB}, etc.
347
        Since it takes a processor a clock cycle to execute a branch, the
348
        delay slot allows the processor to do something useful in that
349
        branch.  The problem the Zip CPU has with this approach is, what
350
        happens when the pipeline looks like Fig.~\ref{fig:brbroken}?
351
\begin{figure}\begin{center}
352
\includegraphics[width=4in]{../gfx/bdbroken.eps}
353
\caption{The branch delay slot breaks with a slow memory}\label{fig:brbroken}
354
\end{center}\end{figure}
355
        In this case, the branch delay slot never gets filled in the first
356
        place, and so the pipeline squashes it before it gets executed.
357
        If not that, then what happens when handling interrupts or
358
        debug stepping: when has the CPU finished an instruction?
359
        When the {\tt BR} instruction has finished, or must {\tt BD}
360
        follow every {\tt BR}?  and, again, what if the pipeline isn't
361
        full?
362
        These thoughts killed any hopes of doing delayed branching.
363
 
364 21 dgisselq
        So I switched to a model of discrete execution: Once an instruction
365
        enters into either the ALU or memory unit, the instruction is
366
        guaranteed to complete.  If the logic recognizes a branch or a
367
        condition that would render the instruction entering into this stage
368 33 dgisselq
        possibly inappropriate (i.e. a conditional branch preceding a store
369 21 dgisselq
        instruction for example), then the pipeline stalls for one cycle
370
        until the conditional branch completes.  Then, if it generates a new
371 33 dgisselq
        PC address, the stages preceding are all wiped clean.
372 21 dgisselq
 
373 68 dgisselq
        This model, however, generated too many pipeline stalls, so the
374
        discrete execution model was modified to allow instructions to go
375
        through the ALU unit and be canceled before writeback.  This removed
376
        the stall associated with ALU instructions before untaken branches.
377
 
378
        The discrete execution model allows such things as sleeping, as
379
        outlined in Fig.~\ref{fig:sleeping}.
380
\begin{figure}\begin{center}
381
\includegraphics[width=4in]{../gfx/sleep.eps}
382
\caption{How the CPU halts when sleeping}\label{fig:sleeping}
383
\end{center}\end{figure}
384
        If the
385 24 dgisselq
        CPU is put to ``sleep,'' the ALU and memory stages stall and back up
386 21 dgisselq
        everything before them.  Likewise, anything that has entered the ALU
387
        or memory stage when the CPU is placed to sleep continues to completion.
388
        To handle this logic, each pipeline stage has three control signals:
389
        a valid signal, a stall signal, and a clock enable signal.  In
390
        general, a stage stalls if it's contents are valid and the next step
391
        is stalled.  This allows the pipeline to fill any time a later stage
392 68 dgisselq
        stalls, as illustrated in Fig.~\ref{fig:stacking}.
393
\begin{figure}\begin{center}
394
\includegraphics[width=4in]{../gfx/stacking.eps}
395
\caption{Instructions can stack up behind a stalled instruction}\label{fig:stacking}
396
\end{center}\end{figure}
397 21 dgisselq
 
398 68 dgisselq
        This approach is also different from other pipeline approaches.
399
        Instead of keeping the entire pipeline filled, each stage is treated
400 24 dgisselq
        independently.  Therefore, individual stages may move forward as long
401
        as the subsequent stage is available, regardless of whether the stage
402
        behind it is filled.
403
 
404 21 dgisselq
\item {\bf Verilog Modules:} When examining how other processors worked
405
        here on open cores, many of them had one separate module per pipeline
406
        stage.  While this appeared to me to be a fascinating and commendable
407
        idea, my own implementation didn't work out quite so nicely.
408
 
409
        As an example, the decode module produces a {\em lot} of
410
        control wires and registers.  Creating a module out of this, with
411
        only the simplest of logic within it, seemed to be more a lesson
412
        in passing wires around, rather than encapsulating logic.
413
 
414
        Another example was the register writeback section.  I would love
415
        this section to be a module in its own right, and many have made them
416
        such.  However, other modules depend upon writeback results other
417
        than just what's placed in the register (i.e., the control wires).
418
        For these reasons, I didn't manage to fit this section into it's
419
        own module.
420
 
421
        The result is that the majority of the CPU code can be found in
422
        the {\tt zipcpu.v} file.
423
\end{itemize}
424
 
425
With that introduction out of the way, let's move on to the instruction
426
set.
427
 
428
\chapter{CPU Architecture}\label{chap:arch}
429
 
430 24 dgisselq
The Zip CPU supports a set of two operand instructions, where the second operand
431 21 dgisselq
(always a register) is the result.  The only exception is the store instruction,
432
where the first operand (always a register) is the source of the data to be
433
stored.
434
 
435 24 dgisselq
\section{Simplified Bus}
436
The bus architecture of the Zip CPU is that of a simplified WISHBONE bus.
437
It has been simplified in this fashion: all operations are 32--bit operations.
438 36 dgisselq
The bus is neither little endian nor big endian.  For this reason, all words
439 24 dgisselq
are 32--bits.  All instructions are also 32--bits wide.  Everything has been
440
built around the 32--bit word.
441
 
442 21 dgisselq
\section{Register Set}
443
The Zip CPU supports two sets of sixteen 32-bit registers, a supervisor
444 24 dgisselq
and a user set as shown in Fig.~\ref{fig:regset}.
445
\begin{figure}\begin{center}
446
\includegraphics[width=3.5in]{../gfx/regset.eps}
447
\caption{Zip CPU Register File}\label{fig:regset}
448
\end{center}\end{figure}
449
The supervisor set is used in interrupt mode when interrupts are disabled,
450
whereas the user set is used otherwise.  Of this register set, the Program
451
Counter (PC) is register 15, whereas the status register (SR) or condition
452
code register
453 21 dgisselq
(CC) is register 14.  By convention, the stack pointer will be register 13 and
454 24 dgisselq
noted as (SP)--although there is nothing special about this register other
455
than this convention.
456 21 dgisselq
The CPU can access both register sets via move instructions from the
457
supervisor state, whereas the user state can only access the user registers.
458
 
459 36 dgisselq
The status register is special, and bears further mention.  As shown in
460
Fig.~\ref{tbl:cc-register},
461
\begin{table}\begin{center}
462
\begin{bitlist}
463
31\ldots 11 & R/W & Reserved for future uses\\\hline
464
10 & R & (Reserved for) Bus-Error Flag\\\hline
465
9 & R & Trap, or user interrupt, Flag.  Cleared on return to userspace.\\\hline
466 68 dgisselq
8 & R & Illegal Instruction Flag\\\hline
467 36 dgisselq
7 & R/W & Break--Enable\\\hline
468
6 & R/W & Step\\\hline
469
5 & R/W & Global Interrupt Enable (GIE)\\\hline
470
4 & R/W & Sleep.  When GIE is also set, the CPU waits for an interrupt.\\\hline
471
3 & R/W & Overflow\\\hline
472
2 & R/W & Negative.  The sign bit was set as a result of the last ALU instruction.\\\hline
473
1 & R/W & Carry\\\hline
474
 
475
\end{bitlist}
476
\caption{Condition Code Register Bit Assignment}\label{tbl:cc-register}
477
\end{center}\end{table}
478
the lower 11~bits of the status register form
479
a set of CPU state and condition codes.  Writes to other bits of this register
480
are preserved.
481 21 dgisselq
 
482 33 dgisselq
Of the condition codes, the bottom four bits are the current flags:
483 21 dgisselq
                Zero (Z),
484
                Carry (C),
485
                Negative (N),
486
                and Overflow (V).
487
 
488
The next bit is a clock enable (0 to enable) or sleep bit (1 to put
489
        the CPU to sleep).  Setting this bit will cause the CPU to
490
        wait for an interrupt (if interrupts are enabled), or to
491
        completely halt (if interrupts are disabled).
492 33 dgisselq
 
493 21 dgisselq
The sixth bit is a global interrupt enable bit (GIE).  When this
494 32 dgisselq
        sixth bit is a `1' interrupts will be enabled, else disabled.  When
495 21 dgisselq
        interrupts are disabled, the CPU will be in supervisor mode, otherwise
496
        it is in user mode.  Thus, to execute a context switch, one only
497
        need enable or disable interrupts.  (When an interrupt line goes
498
        high, interrupts will automatically be disabled, as the CPU goes
499 32 dgisselq
        and deals with its context switch.)  Special logic has been added to
500
        keep the user mode from setting the sleep register and clearing the
501
        GIE register at the same time, with clearing the GIE register taking
502
        precedence.
503 21 dgisselq
 
504
The seventh bit is a step bit.  This bit can be
505
        set from supervisor mode only.  After setting this bit, should
506
        the supervisor mode process switch to user mode, it would then
507
        accomplish one instruction in user mode before returning to supervisor
508
        mode.  Then, upon return to supervisor mode, this bit will
509
        be automatically cleared.  This bit has no effect on the CPU while in
510
        supervisor mode.
511
 
512
        This functionality was added to enable a userspace debugger
513
        functionality on a user process, working through supervisor mode
514
        of course.
515
 
516
 
517 24 dgisselq
The eighth bit is a break enable bit.  This controls whether a break
518
instruction in user mode will halt the processor for an external debugger
519
(break enabled), or whether the break instruction will simply send send the
520
CPU into interrupt mode.  Encountering a break in supervisor mode will
521
halt the CPU independent of the break enable bit.  This bit can only be set
522
within supervisor mode.
523 21 dgisselq
 
524 32 dgisselq
% Should break enable be a supervisor mode bit, while the break enable bit
525
% in user mode is a break has taken place bit?
526
%
527
 
528 21 dgisselq
This functionality was added to enable an external debugger to
529
        set and manage breakpoints.
530
 
531 68 dgisselq
The ninth bit is an illegal instruction bit.  When the CPU
532 36 dgisselq
tries to execute either a non-existant instruction, or an instruction from
533 68 dgisselq
an address that produces a bus error, the CPU will (if implemented) switch
534 36 dgisselq
to supervisor mode while setting this bit.  The bit will automatically be
535
cleared upon any return to user mode.
536 21 dgisselq
 
537
The tenth bit is a trap bit.  It is set whenever the user requests a soft
538
interrupt, and cleared on any return to userspace command.  This allows the
539
supervisor, in supervisor mode, to determine whether it got to supervisor
540
mode from a trap or from an external interrupt or both.
541
 
542 39 dgisselq
These status register bits are summarized in Tbl.~\ref{tbl:ccbits}.
543
\begin{table}
544
\begin{center}
545
\begin{tabular}{l|l}
546
Bit & Meaning \\\hline
547
9 & Soft trap, set on a trap from user mode, cleared when returning to user mode\\\hline
548 68 dgisselq
8 & Illegal instruction error flag \\\hline
549 39 dgisselq
7 & Halt on break, to support an external debugger \\\hline
550
6 & Step, single step the CPU in user mode\\\hline
551
5 & GIE, or Global Interrupt Enable \\\hline
552
4 & Sleep \\\hline
553
3 & V, or overflow bit.\\\hline
554
2 & N, or negative bit.\\\hline
555
1 & C, or carry bit.\\\hline
556
 
557
\end{tabular}
558
\caption{Condition Code / Status Register Bits}\label{tbl:ccbits}
559
\end{center}\end{table}
560
 
561 21 dgisselq
\section{Conditional Instructions}
562 36 dgisselq
Most, although not quite all, instructions may be conditionally executed.  From
563 21 dgisselq
the four condition code flags, eight conditions are defined.  These are shown
564
in Tbl.~\ref{tbl:conditions}.
565
\begin{table}
566
\begin{center}
567
\begin{tabular}{l|l|l}
568
Code & Mneumonic & Condition \\\hline
569
3'h0 & None & Always execute the instruction \\
570
3'h1 & {\tt .Z} & Only execute when 'Z' is set \\
571
3'h2 & {\tt .NE} & Only execute when 'Z' is not set \\
572
3'h3 & {\tt .GE} & Greater than or equal ('N' not set, 'Z' irrelevant) \\
573
3'h4 & {\tt .GT} & Greater than ('N' not set, 'Z' not set) \\
574 24 dgisselq
3'h5 & {\tt .LT} & Less than ('N' set) \\
575 21 dgisselq
3'h6 & {\tt .C} & Carry set\\
576
3'h7 & {\tt .V} & Overflow set\\
577
\end{tabular}
578
\caption{Conditions for conditional operand execution}\label{tbl:conditions}
579
\end{center}
580
\end{table}
581 24 dgisselq
There is no condition code for less than or equal, not C or not V.  Sorry,
582 36 dgisselq
I ran out of space in 3--bits.  Conditioning on a non--supported condition
583
is still possible, but it will take an extra instruction and a pipeline stall.  (Ex: \hbox{\em (Stall)}; \hbox{\tt TST \$4,CC;} \hbox{\tt STO.NZ R0,(R1)})
584 68 dgisselq
As an alternative, the condition may often be reversed, recovering those
585
extra two clocks.  Thus instead of \hbox{\tt CMP Rx,Ry;}
586
\hbox{\tt BNV label} you can issue a \hbox{\tt CMP Ry,Rx;} \hbox{\tt BV label}.
587 21 dgisselq
 
588 36 dgisselq
Conditionally executed ALU instructions will not further adjust the
589 68 dgisselq
condition codes, with the exception of \hbox{\tt CMP} and \hbox{\tt TST}
590
instructions.   Conditional \hbox{\tt CMP} or \hbox{\tt TST} instructions
591
will adjust conditions whenever their conditionals are true.  In this way,
592
multiple conditions may be evaluated without branches.  For example, to do
593
something if \hbox{\tt R0} is one and \hbox{\tt R1} is two, one might try
594
code such as Tbl.~\ref{tbl:dbl-condition}.
595
\begin{table}\begin{center}
596
\begin{tabular}{l}
597
        {\tt CMP 1,R0} \\
598
        {;\em Condition codes are now set based upon R0-1} \\
599
        {\tt CMP.Z 2,R1} \\
600
        {;\em If R0 $\neq$ 1, conditions are unchanged.} \\
601
        {;\em If R0 $=$ 1, conditions are set based upon R1-2.} \\
602
        {;\em Now do something based upon the conjunction of both conditions.} \\
603
        {;\em While we use the example of a STO, it could be any instruction.} \\
604
        {\tt STO.Z R0,(R2)} \\
605
\end{tabular}
606
\caption{An example of a double conditional}\label{tbl:dbl-condition}
607
\end{center}\end{table}
608 36 dgisselq
 
609 68 dgisselq
\section{Traditional Interrupt Handling}
610
 
611 21 dgisselq
\section{Operand B}
612 24 dgisselq
Many instruction forms have a 21-bit source ``Operand B'' associated with them.
613 21 dgisselq
This Operand B is either equal to a register plus a signed immediate offset,
614
or an immediate offset by itself.  This value is encoded as shown in
615
Tbl.~\ref{tbl:opb}.
616
\begin{table}\begin{center}
617
\begin{tabular}{|l|l|l|}\hline
618
Bit 20 & 19 \ldots 16 & 15 \ldots 0 \\\hline
619 24 dgisselq
1'b0 & \multicolumn{2}{l|}{20--bit Signed Immediate value} \\\hline
620
1'b1 & 4-bit Register & 16--bit Signed immediate offset \\\hline
621 21 dgisselq
\end{tabular}
622
\caption{Bit allocation for Operand B}\label{tbl:opb}
623
\end{center}\end{table}
624 24 dgisselq
 
625 33 dgisselq
Sixteen and twenty bit immediate values don't make sense for all instructions.
626
For example, what is the point of a 20--bit immediate when executing a 16--bit
627 24 dgisselq
multiply?  Likewise, why have a 16--bit immediate when adding to a logical
628
or arithmetic shift?  In these cases, the extra bits are reserved for future
629
instruction possibilities.
630
 
631 21 dgisselq
\section{Address Modes}
632 36 dgisselq
The Zip CPU supports two addressing modes: register plus immediate, and
633 21 dgisselq
immediate address.  Addresses are therefore encoded in the same fashion as
634
Operand B's, shown above.
635
 
636
A lot of long hard thought was put into whether to allow pre/post increment
637
and decrement addressing modes.  Finding no way to use these operators without
638 32 dgisselq
taking two or more clocks per instruction,\footnote{The two clocks figure
639
comes from the design of the register set, allowing only one write per clock.
640
That write is either from the memory unit or the ALU, but never both.} these
641
addressing modes have been
642 21 dgisselq
removed from the realm of possibilities.  This means that the Zip CPU has no
643
native way of executing push, pop, return, or jump to subroutine operations.
644 24 dgisselq
Each of these instructions can be emulated with a set of instructions from the
645
existing set.
646 21 dgisselq
 
647
\section{Move Operands}
648
The previous set of operands would be perfect and complete, save only that
649 24 dgisselq
the CPU needs access to non--supervisory registers while in supervisory mode.
650
Therefore, the MOV instruction is special and offers access to these registers
651
\ldots when in supervisory mode.  To keep the compiler simple, the extra bits
652
are ignored in non-supervisory mode (as though they didn't exist), rather than
653
being mapped to new instructions or additional capabilities.  The bits
654
indicating which register set each register lies within are the A-Usr and
655
B-Usr bits.  When set to a one, these refer to a user mode register.  When set
656
to a zero, these refer to a register in the current mode, whether user or
657
supervisor.  Further, because a load immediate instruction exists, there is no
658
move capability between an immediate and a register: all moves come from either
659
a register or a register plus an offset.
660 21 dgisselq
 
661 24 dgisselq
This actually leads to a bit of a problem: since the MOV instruction encodes
662
which register set each register is coming from or moving to, how shall a
663
compiler or assembler know how to compile a MOV instruction without knowing
664
the mode of the CPU at the time?  For this reason, the compiler will assume
665
all MOV registers are supervisor registers, and display them as normal.
666
Anything with the user bit set will be treated as a user register.  The CPU
667
will quietly ignore the supervisor bits while in user mode, and anything
668 36 dgisselq
marked as a user register will always be valid.
669 21 dgisselq
 
670
\section{Multiply Operations}
671 36 dgisselq
The Zip CPU supports two Multiply operations, a 16x16 bit signed multiply
672
({\tt MPYS}) and a 16x16 bit unsigned multiply ({\tt MPYU}).  In both
673 21 dgisselq
cases, the operand is a register plus a 16-bit immediate, subject to the
674
rule that the register cannot be the PC or CC registers.  The PC register
675
field has been stolen to create a multiply by immediate instruction.  The
676
CC register field is reserved.
677
 
678
\section{Floating Point}
679 36 dgisselq
The Zip CPU does not (yet) support floating point operations.  However, the
680 32 dgisselq
instruction set reserves two possibilities for future floating point
681
operations.
682 21 dgisselq
 
683 32 dgisselq
The first floating point operation hole in the instruction set involves
684 36 dgisselq
setting a proposed (but non-existent) floating point bit in the CC register.
685
The next instruction
686
would then simply interpret its operands as floating point instructions.
687 32 dgisselq
Not all instructions, however, have floating point equivalents.  Further, the
688
immediate fields do not apply in floating point mode, and must be set to
689
zero.  Not all instructions make sense as floating point operations.
690
Therefore, only the CMP, SUB, ADD, and MPY instructions may be issued as
691
floating point instructions.  Other instructions allow the examining of the
692
floating point bit in the CC register.  In all cases, the floating point bit
693
is cleared one instruction after it is set.
694 21 dgisselq
 
695 32 dgisselq
The other possibility for floating point operations involves exploiting the
696
hole in the instruction set that the NOOP and BREAK instructions reside within.
697 36 dgisselq
These two instructions use 24--bits of address space, when only a single bit
698
is necessary.  A simple adjustment to this space could create instructions
699
with 4--bit register addresses for each register, a 3--bit field for
700
conditional execution, and a 2--bit field for which operation.
701
In this fashion, such a floating point capability would only fill 13--bits of
702
the 24--bit field, still leaving lots of room for expansion.
703 32 dgisselq
 
704
In both cases, the Zip CPU would support 32--bit single precision floats
705 36 dgisselq
only, since other choices would complicate the pipeline.
706 32 dgisselq
 
707
The current architecture does not support a floating point not-implemented
708
interrupt.  Any soft floating point emulation must be done deliberately.
709
 
710 21 dgisselq
\section{Native Instructions}
711
The instruction set for the Zip CPU is summarized in
712
Tbl.~\ref{tbl:zip-instructions}.
713
\begin{table}\begin{center}
714
\begin{tabular}{|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|c|}\hline
715 36 dgisselq
\rowcolor[gray]{0.85}
716 21 dgisselq
Op Code & \multicolumn{8}{c|}{31\ldots24} & \multicolumn{8}{c|}{23\ldots 16}
717
        & \multicolumn{8}{c|}{15\ldots 8} & \multicolumn{8}{c|}{7\ldots 0}
718 36 dgisselq
        & Sets CC? \\\hline\hline
719 39 dgisselq
{\tt CMP(Sub)} & \multicolumn{4}{l|}{4'h0}
720 21 dgisselq
                & \multicolumn{4}{l|}{D. Reg}
721
                & \multicolumn{3}{l|}{Cond.}
722
                & \multicolumn{21}{l|}{Operand B}
723
                & Yes \\\hline
724 39 dgisselq
{\tt TST(And)} & \multicolumn{4}{l|}{4'h1}
725 21 dgisselq
                & \multicolumn{4}{l|}{D. Reg}
726
                & \multicolumn{3}{l|}{Cond.}
727
                & \multicolumn{21}{l|}{Operand B}
728
        & Yes \\\hline
729 39 dgisselq
{\tt MOV} & \multicolumn{4}{l|}{4'h2}
730 21 dgisselq
                & \multicolumn{4}{l|}{D. Reg}
731
                & \multicolumn{3}{l|}{Cond.}
732
                & A-Usr
733
                & \multicolumn{4}{l|}{B-Reg}
734
                & B-Usr
735
                & \multicolumn{15}{l|}{15'bit signed offset}
736
                & \\\hline
737 39 dgisselq
{\tt LODI} & \multicolumn{4}{l|}{4'h3}
738 21 dgisselq
                & \multicolumn{4}{l|}{R. Reg}
739
                & \multicolumn{24}{l|}{24'bit Signed Immediate}
740
                & \\\hline
741 39 dgisselq
{\tt NOOP} & \multicolumn{4}{l|}{4'h4}
742 21 dgisselq
                & \multicolumn{4}{l|}{4'he}
743
                & \multicolumn{24}{l|}{24'h00}
744
                & \\\hline
745 39 dgisselq
{\tt BREAK} & \multicolumn{4}{l|}{4'h4}
746 21 dgisselq
                & \multicolumn{4}{l|}{4'he}
747
                & \multicolumn{24}{l|}{24'h01}
748
                & \\\hline
749 36 dgisselq
{\em Reserved} & \multicolumn{4}{l|}{4'h4}
750 21 dgisselq
                & \multicolumn{4}{l|}{4'he}
751
                & \multicolumn{24}{l|}{24'bits, but not 0 or 1.}
752
                & \\\hline
753 39 dgisselq
{\tt LODIHI }& \multicolumn{4}{l|}{4'h4}
754 21 dgisselq
                & \multicolumn{4}{l|}{4'hf}
755
                & \multicolumn{3}{l|}{Cond.}
756
                & 1'b1
757
                & \multicolumn{4}{l|}{R. Reg}
758
                & \multicolumn{16}{l|}{16-bit Immediate}
759
                & \\\hline
760 39 dgisselq
{\tt LODILO} & \multicolumn{4}{l|}{4'h4}
761 21 dgisselq
                & \multicolumn{4}{l|}{4'hf}
762
                & \multicolumn{3}{l|}{Cond.}
763
                & 1'b0
764
                & \multicolumn{4}{l|}{R. Reg}
765
                & \multicolumn{16}{l|}{16-bit Immediate}
766
                & \\\hline
767 39 dgisselq
16-b {\tt MPYU} & \multicolumn{4}{l|}{4'h4}
768 21 dgisselq
                & \multicolumn{4}{l|}{R. Reg}
769
                & \multicolumn{3}{l|}{Cond.}
770
                & 1'b0 & \multicolumn{4}{l|}{Reg}
771
                & \multicolumn{16}{l|}{16-bit Offset}
772
                & Yes \\\hline
773 39 dgisselq
16-b {\tt MPYU}(I) & \multicolumn{4}{l|}{4'h4}
774 21 dgisselq
                & \multicolumn{4}{l|}{R. Reg}
775
                & \multicolumn{3}{l|}{Cond.}
776
                & 1'b0 & \multicolumn{4}{l|}{4'hf}
777
                & \multicolumn{16}{l|}{16-bit Offset}
778
                & Yes \\\hline
779 39 dgisselq
16-b {\tt MPYS} & \multicolumn{4}{l|}{4'h4}
780 21 dgisselq
                & \multicolumn{4}{l|}{R. Reg}
781
                & \multicolumn{3}{l|}{Cond.}
782
                & 1'b1 & \multicolumn{4}{l|}{Reg}
783
                & \multicolumn{16}{l|}{16-bit Offset}
784
                & Yes \\\hline
785 39 dgisselq
16-b {\tt MPYS}(I) & \multicolumn{4}{l|}{4'h4}
786 21 dgisselq
                & \multicolumn{4}{l|}{R. Reg}
787
                & \multicolumn{3}{l|}{Cond.}
788
                & 1'b1 & \multicolumn{4}{l|}{4'hf}
789
                & \multicolumn{16}{l|}{16-bit Offset}
790
                & Yes \\\hline
791 39 dgisselq
{\tt ROL} & \multicolumn{4}{l|}{4'h5}
792 21 dgisselq
                & \multicolumn{4}{l|}{R. Reg}
793
                & \multicolumn{3}{l|}{Cond.}
794
                & \multicolumn{21}{l|}{Operand B, truncated to low order 5 bits}
795
                & \\\hline
796 39 dgisselq
{\tt LOD} & \multicolumn{4}{l|}{4'h6}
797 21 dgisselq
                & \multicolumn{4}{l|}{R. Reg}
798
                & \multicolumn{3}{l|}{Cond.}
799
                & \multicolumn{21}{l|}{Operand B address}
800
                & \\\hline
801 39 dgisselq
{\tt STO} & \multicolumn{4}{l|}{4'h7}
802 21 dgisselq
                & \multicolumn{4}{l|}{D. Reg}
803
                & \multicolumn{3}{l|}{Cond.}
804
                & \multicolumn{21}{l|}{Operand B address}
805
                & \\\hline
806 39 dgisselq
{\tt SUB} & \multicolumn{4}{l|}{4'h8}
807 21 dgisselq
        &       \multicolumn{4}{l|}{R. Reg}
808
        &       \multicolumn{3}{l|}{Cond.}
809 32 dgisselq
        &       \multicolumn{21}{l|}{Operand B}
810 21 dgisselq
        & Yes \\\hline
811 39 dgisselq
{\tt AND} & \multicolumn{4}{l|}{4'h9}
812 21 dgisselq
        &       \multicolumn{4}{l|}{R. Reg}
813
        &       \multicolumn{3}{l|}{Cond.}
814
        &       \multicolumn{21}{l|}{Operand B}
815
        & Yes \\\hline
816 39 dgisselq
{\tt ADD} & \multicolumn{4}{l|}{4'ha}
817 21 dgisselq
        &       \multicolumn{4}{l|}{R. Reg}
818
        &       \multicolumn{3}{l|}{Cond.}
819
        &       \multicolumn{21}{l|}{Operand B}
820
        & Yes \\\hline
821 39 dgisselq
{\tt OR} & \multicolumn{4}{l|}{4'hb}
822 21 dgisselq
        &       \multicolumn{4}{l|}{R. Reg}
823
        &       \multicolumn{3}{l|}{Cond.}
824
        &       \multicolumn{21}{l|}{Operand B}
825
        & Yes \\\hline
826 39 dgisselq
{\tt XOR} & \multicolumn{4}{l|}{4'hc}
827 21 dgisselq
        &       \multicolumn{4}{l|}{R. Reg}
828
        &       \multicolumn{3}{l|}{Cond.}
829
        &       \multicolumn{21}{l|}{Operand B}
830
        & Yes \\\hline
831 39 dgisselq
{\tt LSL/ASL} & \multicolumn{4}{l|}{4'hd}
832 21 dgisselq
        &       \multicolumn{4}{l|}{R. Reg}
833
        &       \multicolumn{3}{l|}{Cond.}
834 33 dgisselq
        &       \multicolumn{21}{l|}{Operand B, imm. truncated to 6 bits}
835 21 dgisselq
        & Yes \\\hline
836 39 dgisselq
{\tt ASR} & \multicolumn{4}{l|}{4'he}
837 21 dgisselq
        &       \multicolumn{4}{l|}{R. Reg}
838
        &       \multicolumn{3}{l|}{Cond.}
839 33 dgisselq
        &       \multicolumn{21}{l|}{Operand B, imm. truncated to 6 bits}
840 21 dgisselq
        & Yes \\\hline
841 39 dgisselq
{\tt LSR} & \multicolumn{4}{l|}{4'hf}
842 21 dgisselq
        &       \multicolumn{4}{l|}{R. Reg}
843
        &       \multicolumn{3}{l|}{Cond.}
844 33 dgisselq
        &       \multicolumn{21}{l|}{Operand B, imm. truncated to 6 bits}
845 21 dgisselq
        & Yes \\\hline
846
\end{tabular}
847
\caption{Zip CPU Instruction Set}\label{tbl:zip-instructions}
848
\end{center}\end{table}
849
 
850
As you can see, there's lots of room for instruction set expansion.  The
851 24 dgisselq
NOOP and BREAK instructions are the only instructions within one particular
852 36 dgisselq
24--bit hole.  The rest of this space is reserved for future enhancements.
853 21 dgisselq
 
854
\section{Derived Instructions}
855 36 dgisselq
The Zip CPU supports many other common instructions, but not all of them
856 24 dgisselq
are single cycle instructions.  The derived instruction tables,
857 36 dgisselq
Tbls.~\ref{tbl:derived-1}, \ref{tbl:derived-2}, \ref{tbl:derived-3}
858
and~\ref{tbl:derived-4},
859 21 dgisselq
help to capture some of how these other instructions may be implemented on
860 36 dgisselq
the Zip CPU.  Many of these instructions will have assembly equivalents,
861 21 dgisselq
such as the branch instructions, to facilitate working with the CPU.
862
\begin{table}\begin{center}
863
\begin{tabular}{p{1.4in}p{1.5in}p{3in}}\\\hline
864
Mapped & Actual  & Notes \\\hline
865 39 dgisselq
{\tt ABS Rx}
866
        & \parbox[t]{1.5in}{\tt TST -1,Rx\\NEG.LT Rx}
867 36 dgisselq
        & Absolute value, depends upon derived NEG.\\\hline
868 39 dgisselq
\parbox[t]{1.4in}{\tt ADD Ra,Rx\\ADDC Rb,Ry}
869
        & \parbox[t]{1.5in}{\tt Add Ra,Rx\\ADD.C \$1,Ry\\Add Rb,Ry}
870 21 dgisselq
        & Add with carry \\\hline
871 39 dgisselq
{\tt BRA.Cond +/-\$Addr}
872
        & \hbox{\tt MOV.cond \$Addr+PC,PC}
873 24 dgisselq
        & Branch or jump on condition.  Works for 15--bit
874
                signed address offsets.\\\hline
875 39 dgisselq
{\tt BRA.Cond +/-\$Addr}
876
        & \parbox[t]{1.5in}{\tt LDI \$Addr,Rx \\ ADD.cond Rx,PC}
877 21 dgisselq
        & Branch/jump on condition.  Works for
878
        23 bit address offsets, but costs a register, an extra instruction,
879 33 dgisselq
        and sets the flags. \\\hline
880 39 dgisselq
{\tt BNC PC+\$Addr}
881
        & \parbox[t]{1.5in}{\tt Test \$Carry,CC \\ MOV.Z PC+\$Addr,PC}
882 21 dgisselq
        & Example of a branch on an unsupported
883
                condition, in this case a branch on not carry \\\hline
884 39 dgisselq
{\tt BUSY } & {\tt MOV \$-1(PC),PC} & Execute an infinite loop \\\hline
885
{\tt CLRF.NZ Rx }
886
        & {\tt XOR.NZ Rx,Rx}
887 21 dgisselq
        & Clear Rx, and flags, if the Z-bit is not set \\\hline
888 39 dgisselq
{\tt CLR Rx }
889
        & {\tt LDI \$0,Rx}
890 21 dgisselq
        & Clears Rx, leaves flags untouched.  This instruction cannot be
891
                conditional. \\\hline
892 39 dgisselq
{\tt EXCH.W Rx }
893
        & {\tt ROL \$16,Rx}
894 21 dgisselq
        & Exchanges the top and bottom 16'bit words of Rx \\\hline
895 39 dgisselq
{\tt HALT }
896
        & {\tt Or \$SLEEP,CC}
897
        & This only works when issued in interrupt/supervisor mode.  In user
898
        mode this is simply a wait until interrupt instruction. \\\hline
899
{\tt INT } & {\tt LDI \$0,CC} &  \\\hline
900
{\tt IRET}
901
        & {\tt OR \$GIE,CC}
902
        & Also known as an RTU instruction (Return to Userspace) \\\hline
903
{\tt JMP R6+\$Addr}
904
        & {\tt MOV \$Addr(R6),PC}
905 21 dgisselq
        & \\\hline
906 39 dgisselq
{\tt JSR PC+\$Addr}
907
        & \parbox[t]{1.5in}{\tt SUB \$1,SP \\\
908 21 dgisselq
        MOV \$3+PC,R0 \\
909
        STO R0,1(SP) \\
910
        MOV \$Addr+PC,PC \\
911
        ADD \$1,SP}
912 24 dgisselq
        & Jump to Subroutine. Note the required cleanup instruction after
913 36 dgisselq
        returning.  This could easily be turned into a three instruction
914
        operand, removing the preliminary stack instruction before and
915
        the cleanup after, by adjusting how any stack frame was built for
916
        this routine to include space at the top of the stack for the PC.
917 39 dgisselq
        Note also that jumping to a subroutine costs a copy register, {\tt R0}
918
        in this case.
919 36 dgisselq
        \\\hline
920 39 dgisselq
{\tt JSR PC+\$Addr  }
921
        & \parbox[t]{1.5in}{\tt MOV \$3+PC,R12 \\ MOV \$addr+PC,PC}
922 21 dgisselq
        &This is the high speed
923
        version of a subroutine call, necessitating a register to hold the
924
        last PC address.  In its favor, this method doesn't suffer the
925
        mandatory memory access of the other approach. \\\hline
926 39 dgisselq
{\tt LDI.l \$val,Rx }
927
        & \parbox[t]{1.8in}{\tt LDIHI (\$val$>>$16)\&0x0ffff, Rx \\
928
                        LDILO (\$val\&0x0ffff),Rx}
929 21 dgisselq
        & Sadly, there's not enough instruction
930
                space to load a complete immediate value into any register.
931
                Therefore, fully loading any register takes two cycles.
932
                The LDIHI (load immediate high) and LDILO (load immediate low)
933
                instructions have been created to facilitate this. \\\hline
934
\end{tabular}
935
\caption{Derived Instructions}\label{tbl:derived-1}
936
\end{center}\end{table}
937
\begin{table}\begin{center}
938
\begin{tabular}{p{1.4in}p{1.5in}p{3in}}\\\hline
939
Mapped & Actual  & Notes \\\hline
940 39 dgisselq
{\tt LOD.b \$addr,Rx}
941
        & \parbox[t]{1.5in}{\tt %
942 21 dgisselq
        LDI     \$addr,Ra \\
943
        LDI     \$addr,Rb \\
944
        LSR     \$2,Ra \\
945
        AND     \$3,Rb \\
946
        LOD     (Ra),Rx \\
947
        LSL     \$3,Rb \\
948
        SUB     \$32,Rb \\
949
        ROL     Rb,Rx \\
950
        AND \$0ffh,Rx}
951
        & \parbox[t]{3in}{This CPU is designed for 32'bit word
952
        length instructions.  Byte addressing is not supported by the CPU or
953
        the bus, so it therefore takes more work to do.
954
 
955
        Note also that in this example, \$Addr is a byte-wise address, where
956 24 dgisselq
        all other addresses in this document are 32-bit wordlength addresses.
957
        For this reason,
958 21 dgisselq
        we needed to drop the bottom two bits.  This also limits the address
959
        space of character accesses using this method from 16 MB down to 4MB.}
960
                \\\hline
961 39 dgisselq
\parbox[t]{1.5in}{\tt LSL \$1,Rx\\ LSLC \$1,Ry}
962
        & \parbox[t]{1.5in}{\tt LSL \$1,Ry \\
963 21 dgisselq
        LSL \$1,Rx \\
964
        OR.C \$1,Ry}
965
        & Logical shift left with carry.  Note that the
966
        instruction order is now backwards, to keep the conditions valid.
967 33 dgisselq
        That is, LSL sets the carry flag, so if we did this the other way
968 21 dgisselq
        with Rx before Ry, then the condition flag wouldn't have been right
969
        for an OR correction at the end. \\\hline
970 39 dgisselq
\parbox[t]{1.5in}{\tt LSR \$1,Rx \\ LSRC \$1,Ry}
971
        & \parbox[t]{1.5in}{\tt CLR Rz \\
972 21 dgisselq
        LSR \$1,Ry \\
973
        LDIHI.C \$8000h,Rz \\
974
        LSR \$1,Rx \\
975
        OR Rz,Rx}
976
        & Logical shift right with carry \\\hline
977 39 dgisselq
{\tt NEG Rx} & \parbox[t]{1.5in}{\tt XOR \$-1,Rx \\ ADD \$1,Rx} & \\\hline
978
{\tt NEG.C Rx} & \parbox[t]{1.5in}{\tt MOV.C \$-1+Rx,Rx\\XOR.C \$-1,Rx} & \\\hline
979
{\tt NOOP} & {\tt NOOP} & While there are many
980 21 dgisselq
        operations that do nothing, such as MOV Rx,Rx, or OR \$0,Rx, these
981
        operations have consequences in that they might stall the bus if
982
        Rx isn't ready yet.  For this reason, we have a dedicated NOOP
983
        instruction. \\\hline
984 39 dgisselq
{\tt NOT Rx } & {\tt XOR \$-1,Rx } & \\\hline
985
{\tt POP Rx }
986
        & \parbox[t]{1.5in}{\tt LOD \$1(SP),Rx \\ ADD \$1,SP}
987 21 dgisselq
        & Note
988
        that for interrupt purposes, one can never depend upon the value at
989
        (SP).  Hence you read from it, then increment it, lest having
990 33 dgisselq
        incremented it first something then comes along and writes to that
991 21 dgisselq
        value before you can read the result. \\\hline
992 36 dgisselq
\end{tabular}
993
\caption{Derived Instructions, continued}\label{tbl:derived-2}
994
\end{center}\end{table}
995
\begin{table}\begin{center}
996
\begin{tabular}{p{1.4in}p{1.5in}p{3in}}\\\hline
997 39 dgisselq
{\tt PUSH Rx}
998 33 dgisselq
        & \parbox[t]{1.5in}{SUB \$1,SP \\
999 21 dgisselq
        STO Rx,\$1(SP)}
1000 39 dgisselq
        & Note that for pipelined operation, it helps to coalesce all the
1001
        {\tt SUB}'s into one command, and place the {\tt STO}'s right
1002
        after each other.\\\hline
1003
{\tt PUSH Rx-Ry}
1004
        & \parbox[t]{1.5in}{\tt SUB \$n,SP \\
1005 36 dgisselq
        STO Rx,\$n(SP)
1006
        \ldots \\
1007
        STO Ry,\$1(SP)}
1008
        & Multiple pushes at once only need the single subtract from the
1009
        stack pointer.  This derived instruction is analogous to a similar one
1010
        on the Motoroloa 68k architecture, although the Zip Assembler
1011 39 dgisselq
        does not support this instruction (yet).  This instruction
1012
        also supports pipelined memory access.\\\hline
1013
{\tt RESET}
1014
        & \parbox[t]{1in}{\tt STO \$1,\$watchdog(R12)\\NOOP\\NOOP}
1015
        & This depends upon the peripheral base address being
1016 21 dgisselq
        in R12.
1017
 
1018
        Another opportunity might be to jump to the reset address from within
1019 39 dgisselq
        supervisor mode.\\\hline
1020
{\tt RET} & \parbox[t]{1.5in}{\tt LOD \$1(SP),PC}
1021 24 dgisselq
        & Note that this depends upon the calling context to clean up the
1022
        stack, as outlined for the JSR instruction.  \\\hline
1023 39 dgisselq
{\tt RET} & {\tt MOV R12,PC}
1024 21 dgisselq
        & This is the high(er) speed version, that doesn't touch the stack.
1025
        As such, it doesn't suffer a stall on memory read/write to the stack.
1026
        \\\hline
1027 39 dgisselq
{\tt STEP Rr,Rt}
1028
        & \parbox[t]{1.5in}{\tt LSR \$1,Rr \\ XOR.C Rt,Rr}
1029 21 dgisselq
        & Step a Galois implementation of a Linear Feedback Shift Register, Rr,
1030
                using taps Rt \\\hline
1031 39 dgisselq
{\tt STO.b Rx,\$addr}
1032
        & \parbox[t]{1.5in}{\tt %
1033 21 dgisselq
        LDI \$addr,Ra \\
1034
        LDI \$addr,Rb \\
1035
        LSR \$2,Ra \\
1036
        AND \$3,Rb \\
1037
        SUB \$32,Rb \\
1038
        LOD (Ra),Ry \\
1039
        AND \$0ffh,Rx \\
1040 39 dgisselq
        AND \~\$0ffh,Ry \\
1041 21 dgisselq
        ROL Rb,Rx \\
1042
        OR Rx,Ry \\
1043
        STO Ry,(Ra) }
1044
        & \parbox[t]{3in}{This CPU and it's bus are {\em not} optimized
1045
        for byte-wise operations.
1046
 
1047
        Note that in this example, \$addr is a
1048
        byte-wise address, whereas in all of our other examples it is a
1049
        32-bit word address. This also limits the address space
1050
        of character accesses from 16 MB down to 4MB.F
1051
        Further, this instruction implies a byte ordering,
1052
        such as big or little endian.} \\\hline
1053 39 dgisselq
{\tt SWAP Rx,Ry }
1054
        & \parbox[t]{1.5in}{\tt
1055 21 dgisselq
        XOR Ry,Rx \\
1056
        XOR Rx,Ry \\
1057
        XOR Ry,Rx}
1058
        & While no extra registers are needed, this example
1059
        does take 3-clocks. \\\hline
1060 39 dgisselq
{\tt TRAP \#X}
1061
        & \parbox[t]{1.5in}{\tt LDI \$x,R0 \\ AND \~\$GIE,CC }
1062 36 dgisselq
        & This works because whenever a user lowers the \$GIE flag, it sets
1063
        a TRAP bit within the CC register.  Therefore, upon entering the
1064
        supervisor state, the CPU only need check this bit to know that it
1065
        got there via a TRAP.  The trap could be made conditional by making
1066
        the LDI and the AND conditional.  In that case, the assembler would
1067
        quietly turn the LDI instruction into an LDILO and LDIHI pair,
1068 37 dgisselq
        but the effect would be the same. \\\hline
1069 36 dgisselq
\end{tabular}
1070
\caption{Derived Instructions, continued}\label{tbl:derived-3}
1071
\end{center}\end{table}
1072
\begin{table}\begin{center}
1073
\begin{tabular}{p{1.4in}p{1.5in}p{3in}}\\\hline
1074 39 dgisselq
{\tt TST Rx}
1075
        & {\tt TST \$-1,Rx}
1076 21 dgisselq
        & Set the condition codes based upon Rx.  Could also do a CMP \$0,Rx,
1077
        ADD \$0,Rx, SUB \$0,Rx, etc, AND \$-1,Rx, etc.  The TST and CMP
1078
        approaches won't stall future pipeline stages looking for the value
1079
        of Rx. \\\hline
1080 39 dgisselq
{\tt WAIT}
1081
        & {\tt Or \$GIE | \$SLEEP,CC}
1082
        & Wait until the next interrupt, then jump to supervisor/interrupt
1083
        mode.
1084 21 dgisselq
\end{tabular}
1085 36 dgisselq
\caption{Derived Instructions, continued}\label{tbl:derived-4}
1086 21 dgisselq
\end{center}\end{table}
1087
\section{Pipeline Stages}
1088 32 dgisselq
As mentioned in the introduction, and highlighted in Fig.~\ref{fig:cpu},
1089
the Zip CPU supports a five stage pipeline.
1090 21 dgisselq
\begin{enumerate}
1091 36 dgisselq
\item {\bf Prefetch}: Reads instruction from memory and into a cache, if so
1092
        configured.  This
1093 21 dgisselq
        stage is actually pipelined itself, and so it will stall if the PC
1094
        ever changes.  Stalls are also created here if the instruction isn't
1095
        in the prefetch cache.
1096 36 dgisselq
 
1097
        The Zip CPU supports one of two prefetch methods, depending upon a flag
1098
        set at build time within the {\tt zipcpu.v} file.  The simplest is a
1099
        non--cached implementation of a prefetch.  This implementation is
1100
        fairly small, and ideal for
1101
        users of the Zip CPU who need the extra space on the FPGA fabric.
1102
        However, because this non--cached version has no cache, the maximum
1103
        number of instructions per clock is limited to about one per five.
1104
 
1105
        The second prefetch module is a pipelined prefetch with a cache.  This
1106
        module tries to keep the instruction address within a window of valid
1107
        instruction addresses.  While effective, it is not a traditional
1108
        cache implementation.  One unique feature of this cache implementation,
1109
        however, is that it can be cleared in a single clock.  A disappointing
1110
        feature, though, was that it needs an extra internal pipeline stage
1111
        to be implemented.
1112
 
1113
\item {\bf Decode}: Decodes an instruction into op code, register(s) to read,
1114
        and immediate offset.  This stage also determines whether the flags will
1115 32 dgisselq
        be set or whether the result will be written back.
1116 21 dgisselq
\item {\bf Read Operands}: Read registers and apply any immediate values to
1117 24 dgisselq
        them.  There is no means of detecting or flagging arithmetic overflow
1118
        or carry when adding the immediate to the operand.  This stage will
1119
        stall if any source operand is pending.
1120 21 dgisselq
\item Split into two tracks: An {\bf ALU} which will accomplish a simple
1121 36 dgisselq
        instruction, and the {\bf MemOps} stage which handles {\tt LOD} (load)
1122
        and {\tt STO} (store) instructions.
1123 21 dgisselq
        \begin{itemize}
1124 36 dgisselq
        \item Loads will stall the entire pipeline until complete.
1125
        \item Condition codes are available upon completion of the ALU stage
1126
        \item Issuing an instruction to the memory unit while the memory unit
1127
                is busy will stall the entire pipeline.  If the bus deadlocks,
1128
                only a reset will release the CPU.  (Watchdog timer, anyone?)
1129 24 dgisselq
        \item The Zip CPU currently has no means of reading and acting on any
1130
        error conditions on the bus.
1131 21 dgisselq
        \end{itemize}
1132 32 dgisselq
\item {\bf Write-Back}: Conditionally write back the result to the register
1133 36 dgisselq
        set, applying the condition.  This routine is bi-entrant: either the
1134 21 dgisselq
        memory or the simple instruction may request a register write.
1135
\end{enumerate}
1136
 
1137 24 dgisselq
The Zip CPU does not support out of order execution.  Therefore, if the memory
1138
unit stalls, every other instruction stalls.  Memory stores, however, can take
1139 36 dgisselq
place concurrently with ALU operations, although memory reads (loads) cannot.
1140 24 dgisselq
 
1141 32 dgisselq
\section{Pipeline Stalls}
1142
The processing pipeline can and will stall for a variety of reasons.  Some of
1143
these are obvious, some less so.  These reasons are listed below:
1144
\begin{itemize}
1145
\item When the prefetch cache is exhausted
1146 21 dgisselq
 
1147 36 dgisselq
This reason should be obvious.  If the prefetch cache doesn't have the
1148
instruction in memory, the entire pipeline must stall until enough of the
1149
prefetch cache is loaded to support the next instruction.
1150 21 dgisselq
 
1151 32 dgisselq
\item While waiting for the pipeline to load following any taken branch, jump,
1152 36 dgisselq
        return from interrupt or switch to interrupt context (5 stall cycles)
1153 32 dgisselq
 
1154 68 dgisselq
Fig.~\ref{fig:bcstalls}
1155
\begin{figure}\begin{center}
1156
\includegraphics[width=3.5in]{../gfx/bc.eps}
1157
\caption{A conditional branch generates 5 stall cycles}\label{fig:bcstalls}
1158
\end{center}\end{figure}
1159
illustrates the situation for a conditional branch.  In this case, the branch
1160
instruction, {\tt BC}, is nominally followed by instructions {\tt I0} and so
1161
forth.  However, since the branch is taken, the next instruction must be
1162
{\tt IA}.  Therefore, the pipeline needs to be cleared and reloaded.
1163
Given that there are five stages to the pipeline, that accounts
1164
for four of the five stalls.  The last stall cycle is lost in the pipelined
1165
prefetch stage which needs at least one clock with a valid PC before it can
1166
produce a new output.  {\Large\bf Note: When I did this myself, I counted
1167
six stall cycles, for a total of seven cycles for this instruction.  Is five
1168
really the right answer?}
1169 32 dgisselq
 
1170 36 dgisselq
The Zip CPU handles {\tt MOV \$X(PC),PC}, {\tt ADD \$X,PC}, and
1171
{\tt LDI \$X,PC} instructions specially, however.  These instructions, when
1172 68 dgisselq
not conditioned on the flags, can execute with only 2~stall cycles, such as
1173
is shown in Fig.~\ref{fig:branch}.\footnote{Note that this behavior is
1174
slated to be improved upon in subsequent releases.  With a better prefetch,
1175
it should be possible to drop this down to one or zero stall cycles.}
1176
\begin{figure}\begin{center}
1177
\includegraphics[width=4in]{../gfx/bra.eps}
1178
\caption{An expedited delay costs only 2~stall cycles}\label{fig:branch}
1179
\end{center}\end{figure}
1180
In this example, {\tt BR} is a branch always taken, {\tt I1} is the instruction
1181
following the branch in memory, while {\tt IA} is the first instruction at the
1182
branch address.  ({\tt CLR} denotes a clear--pipeline operation, and does
1183
not represent any instruction.)
1184 36 dgisselq
 
1185 32 dgisselq
\item When reading from a prior register while also adding an immediate offset
1186
\begin{enumerate}
1187
\item\ {\tt OPCODE ?,RA}
1188
\item\ {\em (stall)}
1189
\item\ {\tt OPCODE I+RA,RB}
1190
\end{enumerate}
1191
 
1192
Since the addition of the immediate register within OpB decoding gets applied
1193
during the read operand stage so that it can be nicely settled before the ALU,
1194
any instruction that will write back an operand must be separated from the
1195
opcode that will read and apply an immediate offset by one instruction.  The
1196
good news is that this stall can easily be mitigated by proper scheduling.
1197 36 dgisselq
That is, any instruction that does not add an immediate to {\tt RA} may be
1198
scheduled into the stall slot.
1199 32 dgisselq
 
1200 68 dgisselq
\item When any (conditional) write to either the CC or PC Register is followed
1201
        by a memory operation
1202 32 dgisselq
\begin{enumerate}
1203
\item\ {\tt OPCODE RA,PC} {\em Ex: a branch opcode}
1204
\item\ {\em (stall, even if jump not taken)}
1205 36 dgisselq
\item\ {\tt LOD \$X(RA),RB}
1206 32 dgisselq
\end{enumerate}
1207 68 dgisselq
A timing diagram of this pipeline situation is shown in Fig.~\ref{fig:bcmem},
1208
\begin{figure}\begin{center}
1209
\includegraphics[width=2in]{../gfx/bcmem.eps}
1210
\caption{A (not taken) conditional branch followed by a memory operation}\label{fig:bcmem}
1211
\end{center}\end{figure}
1212
for a conditional branch, {\tt BC}, a memory operation, {\tt Mem} (which
1213
must be a load here), and ALU instructions {\tt I1} and so forth.
1214 32 dgisselq
Since branches take place in the writeback stage, the Zip CPU will stall the
1215 68 dgisselq
pipeline for one clock anytime there may be a possible jump--forcing the
1216
memory operation to stay in the operand decode stage.  This prevents
1217 32 dgisselq
an instruction from executing a memory access after the jump but before the
1218
jump is recognized.
1219
 
1220 36 dgisselq
This stall may be mitigated by shuffling the operations immediately following
1221
a potential branch so that an ALU operation follows the branch instead of a
1222
memory operation.
1223 33 dgisselq
 
1224 32 dgisselq
\item When reading from the CC register after setting the flags
1225
\begin{enumerate}
1226 36 dgisselq
\item\ {\tt ALUOP RA,RB} {\em Ex: a compare opcode}
1227
\item\ {\em (stall)}
1228 32 dgisselq
\item\ {\tt TST sys.ccv,CC}
1229
\item\ {\tt BZ somewhere}
1230
\end{enumerate}
1231
 
1232 68 dgisselq
The reason for this stall is simply performance: many of the flags are
1233
determined via combinatorial logic {\em during} the writeback cycle.
1234
Trying to then place these into the input for one of the operands for an
1235
ALU instruction during the same cycle
1236 32 dgisselq
created a time delay loop that would no longer execute in a single 100~MHz
1237
clock cycle.  (The time delay of the multiply within the ALU wasn't helping
1238
either \ldots).
1239
 
1240 33 dgisselq
This stall may be eliminated via proper scheduling, by placing an instruction
1241
that does not set flags in between the ALU operation and the instruction
1242
that references the CC register.  For example, {\tt MOV \$addr+PC,uPC}
1243
followed by an {\tt RTU} ({\tt OR \$GIE,CC}) instruction will not incur
1244
this stall, whereas an {\tt OR \$BREAKEN,CC} followed by an {\tt OR \$STEP,CC}
1245 68 dgisselq
will incur the stall, while a {\tt LDI \$BREAKEN|\$STEP,CC} will not since
1246
it doesn't read the condition codes.
1247 33 dgisselq
 
1248 32 dgisselq
\item When waiting for a memory read operation to complete
1249
\begin{enumerate}
1250
\item\ {\tt LOD address,RA}
1251 36 dgisselq
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}
1252 32 dgisselq
\item\ {\tt OPCODE I+RA,RB}
1253
\end{enumerate}
1254
 
1255 36 dgisselq
Remember, the Zip CPU does not support out of order execution.  Therefore,
1256 32 dgisselq
anytime the memory unit becomes busy both the memory unit and the ALU must
1257 68 dgisselq
stall until the memory unit is cleared.  This is illustrated in
1258
Fig.~\ref{fig:memrd},
1259
\begin{figure}\begin{center}
1260
\includegraphics[width=5in]{../gfx/memrd.eps}
1261
\caption{Pipeline handling of a load instruction}\label{fig:memrd}
1262
\end{center}\end{figure}
1263
since it is especially true of a load
1264 33 dgisselq
instruction, which must still write its operand back to the register file.
1265 68 dgisselq
Note that there is an extra stall at the end of the memory cycle, so that
1266
the memory unit will be idle for one clock before an instruction will be
1267
accepted into the ALU.
1268
Store instructions are different, as shown in Fig.~\ref{fig:memwr},
1269
\begin{figure}\begin{center}
1270
\includegraphics[width=5in]{../gfx/memwr.eps}
1271
\caption{Pipeline handling of a store instruction}\label{fig:memwr}
1272
\end{center}\end{figure}
1273
since they can be busy with the bus without impacting later write back
1274
pipeline stages.  Hence, only loads stall the pipeline.
1275 32 dgisselq
 
1276 68 dgisselq
This, of course, also assumes that the memory being accessed is a single cycle
1277
memory and that there are no stalls to get to the memory.
1278 32 dgisselq
Slower memories, such as the Quad SPI flash, will take longer--perhaps even
1279 33 dgisselq
as long as forty clocks.   During this time the CPU and the external bus
1280 68 dgisselq
will be busy, and unable to do anything else.  Likewise, if it takes a couple
1281
of clock cycles for the bus to be free, as shown in both Figs.~\ref{fig:memrd}
1282
and~\ref{fig:memwr}, there will be stalls.
1283 32 dgisselq
 
1284
\item Memory operation followed by a memory operation
1285
\begin{enumerate}
1286
\item\ {\tt STO address,RA}
1287 36 dgisselq
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}
1288 32 dgisselq
\item\ {\tt LOD address,RB}
1289 36 dgisselq
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}
1290 32 dgisselq
\end{enumerate}
1291
 
1292 68 dgisselq
In this case, the LOD instruction cannot start until the STO is finished,
1293
as illustrated by Fig.~\ref{fig:mstld}.
1294
\begin{figure}\begin{center}
1295
\includegraphics[width=5.5in]{../gfx/mstld.eps}
1296
\caption{Pipeline handling of a store followed by a load instruction}\label{fig:mstld}
1297
\end{center}\end{figure}
1298 32 dgisselq
With proper scheduling, it is possible to do something in the ALU while the
1299 36 dgisselq
memory unit is busy with the STO instruction, but otherwise this pipeline will
1300 68 dgisselq
stall while waiting for it to complete before the load instruction can
1301
start.
1302 32 dgisselq
 
1303 39 dgisselq
The Zip CPU does have the capability of supporting pipelined memory access,
1304
but only under the following conditions: all accesses within the pipeline
1305
must all be reads or all be writes, all must use the same register for their
1306
address, and there can be no stalls or other instructions between pipelined
1307
memory access instructions.  Further, the offset to memory must be increasing
1308
by one address each instruction.  These conditions work well for saving or
1309 68 dgisselq
storing registers to the stack.  Indeed, if you noticed, both
1310
Fig.~\ref{fig:memrd} and Fig.~\ref{fig:memwr} illustrated pipelined memory
1311
accesses.
1312 36 dgisselq
 
1313
\item When waiting for a conditional memory read operation to complete
1314
\begin{enumerate}
1315
\item\ {\tt LOD.Z address,RA}
1316
\item\ {\em (multiple stalls, bus dependent, 7 clocks best)}
1317
\item\ {\tt OPCODE I+RA,RB}
1318
\end{enumerate}
1319
 
1320
In this case, the Zip CPU doesn't warn the prefetch cache to get off the bus
1321
two cycles before using the bus, so there's a potential for an extra three
1322
cycle cost due to bus contention between the prefetch and the CPU.
1323
 
1324
This is true for both the LOD and the STO instructions, with the exception that
1325
the STO instruction will continue in parallel with any ALU instructions that
1326
follow it.
1327
 
1328 32 dgisselq
\end{itemize}
1329
 
1330
 
1331 21 dgisselq
\chapter{Peripherals}\label{chap:periph}
1332 24 dgisselq
 
1333
While the previous chapter describes a CPU in isolation, the Zip System
1334
includes a minimum set of peripherals as well.  These peripherals are shown
1335
in Fig.~\ref{fig:zipsystem}
1336
\begin{figure}\begin{center}
1337
\includegraphics[width=3.5in]{../gfx/system.eps}
1338
\caption{Zip System Peripherals}\label{fig:zipsystem}
1339
\end{center}\end{figure}
1340
and described here.  They are designed to make
1341
the Zip CPU more useful in an Embedded Operating System environment.
1342
 
1343 68 dgisselq
\section{Interrupt Controller}\label{sec:pic}
1344 24 dgisselq
 
1345
Perhaps the most important peripheral within the Zip System is the interrupt
1346
controller.  While the Zip CPU itself can only handle one interrupt, and has
1347
only the one interrupt state: disabled or enabled, the interrupt controller
1348
can make things more interesting.
1349
 
1350
The Zip System interrupt controller module supports up to 15 interrupts, all
1351
controlled from one register.  Bit~31 of the interrupt controller controls
1352
overall whether interrupts are enabled (1'b1) or disabled (1'b0).  Bits~16--30
1353 68 dgisselq
control whether individual interrupts are enabled (1'b1) or disabled (1'b0).
1354 24 dgisselq
Bit~15 is an indicator showing whether or not any interrupt is active, and
1355
bits~0--15 indicate whether or not an individual interrupt is active.
1356
 
1357
The interrupt controller has been designed so that bits can be controlled
1358
individually without having any knowledge of the rest of the controller
1359
setting.  To enable an interrupt, write to the register with the high order
1360
global enable bit set and the respective interrupt enable bit set.  No other
1361
bits will be affected.  To disable an interrupt, write to the register with
1362
the high order global enable bit cleared and the respective interrupt enable
1363
bit set.  To clear an interrupt, write a `1' to that interrupts status pin.
1364
Zero's written to the register have no affect, save that a zero written to the
1365
master enable will disable all interrupts.
1366
 
1367
As an example, suppose you wished to enable interrupt \#4.  You would then
1368
write to the register a {\tt 0x80100010} to enable interrupt \#4 and to clear
1369
any past active state.  When you later wish to disable this interrupt, you would
1370
write a {\tt 0x00100010} to the register.  As before, this both disables the
1371
interrupt and clears the active indicator.  This also has the side effect of
1372
disabling all interrupts, so a second write of {\tt 0x80000000} may be necessary
1373
to re-enable any other interrupts.
1374
 
1375
The Zip System currently hosts two interrupt controllers, a primary and a
1376 68 dgisselq
secondary.  The primary interrupt controller has one interrupt line (perhaps
1377
more if you configure it for more) which may come from an external interrupt
1378
controller, and one interrupt line from the secondary controller.  Other
1379
primary interrupts include the system timers, the jiffies interrupt, and the
1380
manual cache interrupt.  The secondary interrupt controller maintains an
1381
interrupt state for all of the processor accounting counters.
1382 24 dgisselq
 
1383 21 dgisselq
\section{Counter}
1384
 
1385
The Zip Counter is a very simple counter: it just counts.  It cannot be
1386
halted.  When it rolls over, it issues an interrupt.  Writing a value to the
1387
counter just sets the current value, and it starts counting again from that
1388
value.
1389
 
1390
Eight counters are implemented in the Zip System for process accounting.
1391
This may change in the future, as nothing as yet uses these counters.
1392
 
1393
\section{Timer}
1394
 
1395
The Zip Timer is also very simple: it simply counts down to zero.  When it
1396
transitions from a one to a zero it creates an interrupt.
1397
 
1398
Writing any non-zero value to the timer starts the timer.  If the high order
1399
bit is set when writing to the timer, the timer becomes an interval timer and
1400
reloads its last start time on any interrupt.  Hence, to mark seconds, one
1401
might set the timer to 100~million (the number of clocks per second), and
1402
set the high bit.  Ever after, the timer will interrupt the CPU once per
1403 24 dgisselq
second (assuming a 100~MHz clock).  This reload capability also limits the
1404 68 dgisselq
maximum timer value to $2^{31}-1$ (about 21~seconds using a 100~MHz clock),
1405
rather than $2^{32}-1$.
1406 21 dgisselq
 
1407
\section{Watchdog Timer}
1408
 
1409
The watchdog timer is no different from any of the other timers, save for one
1410
critical difference: the interrupt line from the watchdog
1411
timer is tied to the reset line of the CPU.  Hence writing a `1' to the
1412
watchdog timer will always reset the CPU.
1413 32 dgisselq
To stop the Watchdog timer, write a `0' to it.  To start it,
1414 21 dgisselq
write any other number to it---as with the other timers.
1415
 
1416
While the watchdog timer supports interval mode, it doesn't make as much sense
1417
as it did with the other timers.
1418
 
1419 68 dgisselq
\section{Bus Watchdog}
1420
There is an additional watchdog timer on the Wishbone bus.  This timer,
1421
however, is hardware configured and not software configured.  The timer is
1422
reset at the beginning of any bus transaction, and only counts clocks during
1423
such bus transactions.  If the bus transaction takes longer than the number
1424
of counts the timer allots, it will raise a bus error flag to terminate the
1425
transaction.  This is useful in the case of any peripherals that are
1426
misbehaving.  If the bus watchdog terminates a bus transaction, the CPU may
1427
then read from its port to find out which memory location created the problem.
1428
 
1429
Aside from its unusual configuration, the bus watchdog is just another
1430
implementation of the fundamental timer described above.
1431
 
1432 21 dgisselq
\section{Jiffies}
1433
 
1434
This peripheral is motivated by the Linux use of `jiffies' whereby a process
1435
can request to be put to sleep until a certain number of `jiffies' have
1436
elapsed.  Using this interface, the CPU can read the number of `jiffies'
1437
from the peripheral (it only has the one location in address space), add the
1438 24 dgisselq
sleep length to it, and write the result back to the peripheral.  The zipjiffies
1439 21 dgisselq
peripheral will record the value written to it only if it is nearer the current
1440
counter value than the last current waiting interrupt time.  If no other
1441
interrupts are waiting, and this time is in the future, it will be enabled.
1442
(There is currently no way to disable a jiffie interrupt once set, other
1443 24 dgisselq
than to disable the interrupt line in the interrupt controller.)  The processor
1444 21 dgisselq
may then place this sleep request into a list among other sleep requests.
1445
Once the timer expires, it would write the next Jiffy request to the peripheral
1446
and wake up the process whose timer had expired.
1447
 
1448
Indeed, the Jiffies register is nothing more than a glorified counter with
1449
an interrupt.  Unlike the other counters, the Jiffies register cannot be set.
1450
Writes to the jiffies register create an interrupt time.  When the Jiffies
1451
register later equals the value written to it, an interrupt will be asserted
1452
and the register then continues counting as though no interrupt had taken
1453
place.
1454
 
1455
The purpose of this register is to support alarm times within a CPU.  To
1456
set an alarm for a particular process $N$ clocks in advance, read the current
1457
Jiffies value, and $N$, and write it back to the Jiffies register.  The
1458
O/S must also keep track of values written to the Jiffies register.  Thus,
1459 32 dgisselq
when an `alarm' trips, it should be removed from the list of alarms, the list
1460 21 dgisselq
should be sorted, and the next alarm in terms of Jiffies should be written
1461
to the register.
1462
 
1463 36 dgisselq
\section{Direct Memory Access Controller}
1464 24 dgisselq
 
1465 36 dgisselq
The Direct Memory Access (DMA) controller can be used to either move memory
1466
from one location to another, to read from a peripheral into memory, or to
1467
write from a peripheral into memory all without CPU intervention.  Further,
1468
since the DMA controller can issue (and does issue) pipeline wishbone accesses,
1469
any DMA memory move will by nature be faster than a corresponding program
1470
accomplishing the same move.  To put this to numbers, it may take a program
1471
18~clocks per word transferred, whereas this DMA controller can move one
1472
word in two clocks--provided it has bus access.  (The CPU gets priority over the
1473
bus.)
1474 24 dgisselq
 
1475 36 dgisselq
When copying memory from one location to another, the DMA controller will
1476
copy in units of a given transfer length--up to 1024 words at a time.  It will
1477
read that transfer length into its internal buffer, and then write to the
1478
destination address from that buffer.  If the CPU interrupts a DMA transfer,
1479
it will release the bus, let the CPU complete whatever it needs to do, and then
1480
restart its transfer by writing the contents of its internal buffer and then
1481
re-entering its read cycle again.
1482 24 dgisselq
 
1483 36 dgisselq
When coupled with a peripheral, the DMA controller can be configured to start
1484
a memory copy on an interrupt line going high.  Further, the controller can be
1485 39 dgisselq
configured to issue reads from (or to) the same address instead of incrementing
1486 36 dgisselq
the address at each clock.  The DMA completes once the total number of items
1487
specified (not the transfer length) have been transferred.
1488
 
1489
In each case, once the transfer is complete and the DMA unit returns to
1490
idle, the DMA will issue an interrupt.
1491
 
1492
 
1493 21 dgisselq
\chapter{Operation}\label{chap:ops}
1494
 
1495 33 dgisselq
The Zip CPU, and even the Zip System, is not a System on a Chip (SoC).  It
1496
needs to be connected to its operational environment in order to be used.
1497
Specifically, some per system adjustments need to be made:
1498
\begin{enumerate}
1499
\item The Zip System depends upon an external 32-bit Wishbone bus.  This
1500
        must exist, and must be connected to the Zip CPU for it to work.
1501
\item The Zip System needs to be told of its {\tt RESET\_ADDRESS}.  This is
1502
        the program counter of the first instruction following a reset.
1503
\item If you want the Zip System to start up on its own, you will need to
1504
        set the {\tt START\_HALTED} parameter to zero.  Otherwise, if you
1505
        wish to manually start the CPU, that is if upon reset you want the
1506
        CPU start start in its halted, reset state, then set this parameter to
1507
        one.
1508
\item The third parameter to set is the number of interrupts you will be
1509
        providing from external to the CPU.  This can be anything from one
1510
        to nine, but it cannot be zero.  (Wire this line to a 1'b0 if you
1511
        do not wish to support any external interrupts.)
1512
\item Finally, you need to place into some wishbone accessible address, whether
1513
        RAM or (more likely) ROM, the initial instructions for the CPU.
1514
\end{enumerate}
1515
If you have enabled your CPU to start automatically, then upon power up the
1516
CPU will immediately start executing your instructions.
1517
 
1518
This is, however, not how I have used the Zip CPU.  I have instead used the
1519 36 dgisselq
Zip CPU in a more controlled environment.  For me, the CPU starts in a
1520 33 dgisselq
halted state, and waits to be told to start.  Further, the RESET address is a
1521
location in RAM.  After bringing up the board I am using, and further the
1522
bus that is on it, the RAM memory is then loaded externally with the program
1523
I wish the Zip System to run.  Once the RAM is loaded, I release the CPU.
1524
The CPU then runs until its halt condition, at which point its task is
1525
complete.
1526
 
1527
Eventually, I intend to place an operating system onto the ZipSystem, I'm
1528
just not there yet.
1529
 
1530 68 dgisselq
The rest of this chapter examines some common programming models, and how they
1531
might be applied to the Zip System, and then finish with a couple of examples.
1532 33 dgisselq
 
1533 68 dgisselq
\section{System High}
1534
The easiest and simplest way to run the Zip CPU is in the system high mode.
1535
In this mode, the CPU runs your program in supervisor mode from reboot to
1536
power down, and is never interrupted.  You will need to poll the interrupt
1537
controller to determine when any external condition has become active.  This
1538
mode is useful, and can handle many microcontroller tasks.
1539
 
1540
Even better, in system high mode, all of the user registers are available
1541
to the system high program as variables.  Accessing these registers can be
1542
done in a single clock cycle, which would move them to the active register
1543
set or move them back.  While this may seem like a load or store instruction,
1544
none of these register accesses will suffer from memory delays.
1545
 
1546
The one thing that cannot be done in supervisor mode is a wait for interrupt
1547
instruction.  This, however, is easily rectified by jumping to a user task
1548
within the supervisors memory space, such as Tbl.~\ref{tbl:shi-idle}.
1549
\begin{table}\begin{center}
1550
\begin{tabbing}
1551
{\tt supervisor\_idle:} \\
1552
\hbox to 0.25in{}\={\em ; While not strictly required, the following move helps to} \\
1553
\>      {\em ; ensure that the prefetch doesn't try to fetch an instruction} \\
1554
\>      {\em ; outside of the CPU's address space when it switches to user} \\
1555
\>      {\em ; mode.} \\
1556
\>      {\tt MOV supervisor\_idle\_continue,uPC} \\
1557
\>      {\em ; Put the processor into user mode and to sleep in the same} \\
1558
\>      {\em ; instruction. } \\
1559
\>      {\tt OR \$SLEEP|\$GIE,CC} \\
1560
{\tt supervisor\_idle\_continue:} \\
1561
\>      {\em ; Now, if we haven't done this inline, we need to return} \\
1562
\>      {\em ; to whatever function called us.} \\
1563
\>      {\tt RETN} \\
1564
\end{tabbing}
1565
\caption{Executing an idle from supervisor mode}\label{tbl:shi-idle}
1566
\end{center}\end{table}
1567
 
1568
\section{Traditional Interrupt Handling}
1569
Although the Zip CPU does not have a traditional interrupt architecture,
1570
it is possible to create the more traditional interrupt approach via software.
1571
In this mode, the programmable interrupt controller is used together with the
1572
supervisor state to create the illusion of more traditional interrupt handling.
1573
 
1574
To set this up, upon reboot the supervisor task:
1575
\begin{enumerate}
1576
\item Creates a (single) user context, a user stack, and sets the user
1577
        program counter to the entry of the user task
1578
\item Creates a task table of ISR entries
1579
\item Enables the master interrupt enable via the interrupt controller, albeit
1580
        without enabling any of the fifteen potential underlying interrupts.
1581
\item Switches to user mode, as the first part of the while loop in
1582
        Tbl.~\ref{tbl:traditional-isr}.
1583
\end{enumerate}
1584
\begin{table}\begin{center}
1585
\begin{tabbing}
1586
{\tt while(true) \{} \\
1587
\hbox to 0.25in{}\= {\tt rtu();}\\
1588
        \> {\tt if (trap) \{} {\em // Here, we allow users to install ISRs, or} \\
1589
        \>\hbox to 0.25in{}\= {\em // whatever else they may wish to do in supervisor mode.} \\
1590
        \> {\tt \} else \{} \\
1591
        \> \> {\tt volatile int *pic = PIC\_ADDRESS;} \\
1592
\\
1593
        \> \> {\em // Save the user context before running any ISRs.  This could easily be}\\
1594
        \> \> {\em // implemented as an inline assembly routine or macro}\\
1595
        \> \> {\tt SAVE\_PARTIAL\_CONTEXT; }\\
1596
        \> \> {\em // At this point, we know an interrupt has taken place:  Ask the programmable}\\
1597
        \> \> {\em // interrupt controller (PIC) which interrupts are enabled and which are active.}\\
1598
        \> \>   {\tt int        picv = *pic;}\\
1599
        \> \>   {\em // Turn off all active interrupts}\\
1600
        \> \>   {\em // Globally disable interrupt generation in the process}\\
1601
        \> \>   {\tt int        active = (picv >> 16) \& picv \& 0x07fff;}\\
1602
        \> \>   {\tt *pic = (active<<16);}\\
1603
        \> \>   {\em // We build a mask of interrupts to re-enable in picv.}\\
1604
        \> \>   {\tt picv = 0;}\\
1605
        \> \>   {\tt for(int i=0,msk=1; i<15; i++, msk<<=1) \{}\\
1606
        \> \>\hbox to 0.25in{}\={\tt if ((active \& msk)\&\&(isr\_table[i])) \{}\\
1607
        \> \>\>\hbox to 0.25in{}\= {\tt mov(isr\_table[i],uPC); }\\
1608
        \> \>\>\>       {\em // Acknowledge this particular interrupt.  While we could acknowledge all}\\
1609
        \> \>\>\>       {\em // interrupts at once, by acknowledging only those with ISR's we allow}\\
1610
        \> \>\>\>       {\em // the user process to use peripherals manually, and to manually check}\\
1611
        \> \>\>\>       {\em // whether or no those other interrupts had occurred.}\\
1612
        \> \>\>\>       {\tt *pic = msk; }\\
1613
        \> \>\>\>       {\tt rtu(); }\\
1614
        \> \>\>\>       {\em // The ISR will only exit on a trap in the Zip archtecture.  There is}\\
1615
        \> \>\>\>       {\em // no {\tt RETI} instruction.  Since the PIC holds all interrupts disabled,}\\
1616
        \> \>\>\>       {\em // there is no need to check for further interrupts.}\\
1617
        \> \>\>\>       {\em // }\\
1618
        \> \>\>\>       {\em // The tricky part is that, because of how the PIC is built, the ISR cannot}\\
1619
        \>\>\>\>        {\em // re-enable its own interrupt without re-enabling all interrupts.  Hence, we}\\
1620
        \>\>\>\>        {\em // look at R0 upon ISR completion to know if an interrupt needs to be }\\
1621
        \> \>\>\>       {\em // re-enabled. }\\
1622
        \> \>\>\>       {\tt mov(uR0,tmp); }\\
1623
        \> \>\>\>       {\tt picv |= (tmp \& 0x7fff) << 16; }\\
1624
        \> \>\>         {\tt \} }\\
1625
        \> \>   {\tt \} }\\
1626
        \> \>   {\tt RESTORE\_PARTIAL\_CONTEXT; }\\
1627
        \> \>   {\em // Re-activate all (requested) interrupts }\\
1628
        \> \>   {\tt *pic = picv | 0x80000000; }\\
1629
        \>{\tt \} }\\
1630
{\tt \}}\\
1631
\end{tabbing}
1632
\caption{Traditional Interrupt handling}\label{tbl:traditional-isr}
1633
\end{center}\end{table}
1634
 
1635
We can work through the interrupt handling process by examining
1636
Tbl.~\ref{tbl:traditional-isr}.  First, remember, the CPU is always running
1637
either the user or the supervisor context.  Once the supervisor switches to
1638
user mode, control does not return until either an interrupt or a trap
1639
has taken place.  (Okay, there's also the possibility of a bus error, or an
1640
illegal instruction such as an unimplemented floating point instruction---but
1641
for now we'll just focus on the trap instruction.)  Therefore, if the trap bit
1642
isn't set, then we know an interrupt has taken place.
1643
 
1644
To process an interrupt, we steal the user's stack: the PC and CC registers
1645
are saved on the stack, as outlined in Tbl.~\ref{tbl:save-partial}.
1646
\begin{table}\begin{center}
1647
\begin{tabbing}
1648
SAVE\_PARTIAL\_CONTEXT: \\
1649
\hbox to 0.25in{}\= {\em ; We save R0, CC, and PC only} \\
1650
\>        {\tt MOV -3(uSP),R3} \\
1651
\>        {\tt MOV uR0,R0} \\
1652
\>        {\tt MOV uCC,R1} \\
1653
\>        {\tt MOV uPC,R2} \\
1654
\>        {\tt STO R0,1(R3)} {\em ; Exploit memory pipelining: }\\
1655
\>        {\tt STO R1,2(R3)} {\em ; All instructions write to stack }\\
1656
\>        {\tt STO R2,3(R3)} {\em ; All offsets increment by one }\\
1657
\>        {\tt MOV R3,uSP} {\em ; Return the updated stack pointer } \\
1658
\end{tabbing}
1659
\caption{Example Saving Minimal User Context}\label{tbl:save-partial}
1660
\end{center}\end{table}
1661
This is much cheaper than the full context swap of a preemptive multitasking
1662
kernel, but it also depends upon the ISR saving any state it uses.  Further,
1663
if multiple ISR's get called at once, this looses its optimality property
1664
very quickly.
1665
 
1666
As Sec.~\ref{sec:pic} discusses, the top of the PIC register stores which
1667
interrupts are enabled, and the bottom stores which have tripped.  (Interrupts
1668
may trip without being enabled, they just will not generate an interrupt to the
1669
CPU.)  Our first step is to query the register to find out our interrupt
1670
state, and then to disable any interrupts that have tripped.  To do
1671
that, we write a one to the enable half of the register while also clearing
1672
the top bit (master interrupt enable).  This has the consequence of disabling
1673
any and all further interrupts, not just the ones that have tripped.  Hence,
1674
upon completion, we re--enable the master interrupt bit again.   Finally,
1675
we keep track of which interrupts have tripped.
1676
 
1677
Using the bit mask of interrupts that have tripped, we walk through all fifteen
1678
possible interrupts.  If there is an ISR installed, we acknowledge and reset
1679
the interrupt within the PIC, and then call the ISR.  The ISR, however, cannot
1680
re--enable its interrupt without re-enabling the master interrupt bit.  Thus,
1681
to keep things simple, when the ISR is finished it places its interrupt
1682
mask back into R0, or clears R0.  This tells the supervisor mode process which
1683
interrupts to re--enable.  Any other registers that the ISR uses must be
1684
saved and restored.  (This is only truly optimal if only a single ISR is
1685
called.)  As a final instruction, the ISR clears the GIE bit executing a user
1686
trap.  (Remember, the Zip CPU has no {\tt RETI} instruction to restore the
1687
stack and return to userland.  It needs to go through the supervisor mode to
1688
get there.)
1689
 
1690
Then, once all interrupts are handled, the user context is restored in  a
1691
fashion similar to Tbl.~\ref{tbl:restore-partial}.
1692
\begin{table}\begin{center}
1693
\begin{tabbing}
1694
RESTORE\_PARTIAL\_CONTEXT: \\
1695
\hbox to 0.25in{}\= {\em ; We retore R0, CC, and PC only} \\
1696
\>        {\tt MOV uSP,R3} {\em ; Return the updated stack pointer } \\
1697
\>        {\tt LOD R0,1(R3),R0} {\em ; Exploit memory pipelining: }\\
1698
\>        {\tt LOD R1,2(R3),R1} {\em ; All instructions write to stack }\\
1699
\>        {\tt LOD R2,3(R3),R2} {\em ; All offsets increment by one }\\
1700
\>        {\tt MOV R0,uR0} \\
1701
\>        {\tt MOV R1,uCC} \\
1702
\>        {\tt MOV R2,uPC} \\
1703
\>        {\tt MOV 3(R3),uSP} \\
1704
\end{tabbing}
1705
\caption{Example Restoring Minimal User Context}\label{tbl:restore-partial}
1706
\end{center}\end{table}
1707
Again, this is short and sweet simply because any other registers that needed
1708
saving were saved within the ISR.
1709
 
1710
There you have it: the Zip CPU, with its non-traditional interrupt architecture,
1711
can still process interrupts in a very traditional fashion.
1712
 
1713 36 dgisselq
\section{Example: Idle Task}
1714
One task every operating system needs is the idle task, the task that takes
1715
place when nothing else can run.  On the Zip CPU, this task is quite simple,
1716
and it is shown in assemble in Tbl.~\ref{tbl:idle-asm}.
1717
\begin{table}\begin{center}
1718
\begin{tabular}{ll}
1719
{\tt idle\_task:} \\
1720
&        {\em ; Wait for the next interrupt, then switch to supervisor task} \\
1721
&        {\tt WAIT} \\
1722
&        {\em ; When we come back, it's because the supervisor wishes to} \\
1723
&        {\em ; wait for an interrupt again, so go back to the top.} \\
1724
&        {\tt BRA idle\_task} \\
1725
\end{tabular}
1726
\caption{Example Idle Loop}\label{tbl:idle-asm}
1727
\end{center}\end{table}
1728
When this task runs, the CPU will fill up all of the pipeline stages up the
1729
ALU.  The {\tt WAIT} instruction, upon leaving the ALU, places the CPU into
1730
a sleep state where nothing more moves.  Sure, there may be some more settling,
1731
the pipe cache continue to read until full, other instructions may issue until
1732
the pipeline fills, but then everything will stall.  Then, once an interrupt
1733
takes place, control passes to the supervisor task to handle the interrupt.
1734
When control passes back to this task, it will be on the next instruction.
1735
Since that next instruction sends us back to the top of the task, the idle
1736
task thus does nothing but wait for an interrupt.
1737
 
1738
This should be the lowest priority task, the task that runs when nothing else
1739
can.  It will help lower the FPGA power usage overall---at least its dynamic
1740
power usage.
1741
 
1742
\section{Example: Memory Copy}
1743
One common operation is that of a memory move or copy.  Consider the C code
1744
shown in Tbl.~\ref{tbl:memcp-c}.
1745
\begin{table}\begin{center}
1746
\parbox{4in}{\begin{tabbing}
1747
{\tt void} \= {\tt memcp(void *dest, void *src, int len) \{} \\
1748
        \> {\tt for(int i=0; i<len; i++)} \\
1749
        \> \hspace{0.2in} {\tt *dest++ = *src++;} \\
1750
\}
1751
\end{tabbing}}
1752
\caption{Example Memory Copy code in C}\label{tbl:memcp-c}
1753
\end{center}\end{table}
1754
This same code can be translated in Zip Assembly as shown in
1755
Tbl.~\ref{tbl:memcp-asm}.
1756
\begin{table}\begin{center}
1757
\begin{tabular}{ll}
1758
memcp: \\
1759
&        {\em ; R0 = *dest, R1 = *src, R2 = LEN} \\
1760
&        {\em ; The following will operate in 17 clocks per word minus one clock} \\
1761
&        {\tt CMP 0,R2} \\
1762
&        {\tt LOD.Z -1(SP),PC} {\em ; A conditional return }\\
1763
&        {\em ; (One stall on potentially writing to PC)} \\
1764
&        {\tt LOD (R1),R3} \\
1765
&        {\em ; (4 stalls, cannot be scheduled away)} \\
1766
&        {\tt STO R3,(R2)} {\em ; (4 schedulable stalls, has no impact now)} \\
1767
&        {\tt ADD 1,R1} \\
1768
&        {\tt SUB 1,R2} \\
1769
&        {\tt BNZ loop} \\
1770
&        {\em ; (5 stalls, if branch taken, to clear and refill the pipeline)} \\
1771
&        {\tt RET} \\
1772
\end{tabular}
1773
\caption{Example Memory Copy code in Zip Assembly}\label{tbl:memcp-asm}
1774
\end{center}\end{table}
1775
This example points out several things associated with the Zip CPU.  First,
1776
a straightforward implementation of a for loop is not the fastest loop
1777
structure.  For this reason, we have placed the test to continue at the
1778
end.  Second, all pointers are {\tt void} pointers to arbitrary 32--bit
1779
data types.  The Zip CPU does not have explicit support for smaller or larger
1780
data types, and so this memory copy cannot be applied at a byte level.
1781
Third, we've optimized the conditional jump to a return instruction into a
1782
conditional return instruction.
1783
 
1784 68 dgisselq
\section{Example: Context Switch}
1785 36 dgisselq
 
1786
Fundamental to any multiprocessing system is the ability to switch from one
1787
task to the next.  In the ZipSystem, this is accomplished in one of a couple
1788
ways.  The first step is that an interrupt happens.  Anytime an interrupt
1789
happens, the CPU needs to execute the following tasks in supervisor mode:
1790
\begin{enumerate}
1791
\item Check for a trap instruction.  That  is, if the user task requested a
1792
        trap, we may not wish to adjust the context, check interrupts, or call
1793
        the scheduler.  Tbl.~\ref{tbl:trap-check}
1794
\begin{table}\begin{center}
1795
\begin{tabular}{ll}
1796
{\tt return\_to\_user:} \\
1797
&       {\em; The instruction before the context switch processing must} \\
1798
&       {\em; be the RTU instruction that enacted user mode in the first} \\
1799
&       {\em; place.  We show it here just for reference.} \\
1800
&       {\tt RTU} \\
1801
{\tt trap\_check:} \\
1802
&       {\tt MOV uCC,R0} \\
1803
&       {\tt TST \$TRAP,R0} \\
1804
&       {\tt BNZ swap\_out} \\
1805
&       {; \em Do something here to execute the trap} \\
1806
&       {; \em Don't need to call the scheduler, so we can just return} \\
1807
&       {\tt BRA return\_to\_user} \\
1808
\end{tabular}
1809
\caption{Checking for whether the user issued a TRAP instruction}\label{tbl:trap-check}
1810
\end{center}\end{table}
1811
        shows the rudiments of this code, while showing nothing of how the
1812
        actual trap would be implemented.
1813
 
1814
You may also wish to note that the instruction before the first instruction
1815
in our context swap {\em must be} a return to userspace instruction.
1816
Remember, the supervisor process is re--entered where it left off.  This is
1817
different from many other processors that enter interrupt mode at some vector
1818
or other.  In this case, we always enter supervisor mode right where we last
1819
left.\footnote{The one exception to this rule is upon reset where supervisor
1820
mode is entered at a pre--programmed wishbone memory address.}
1821
 
1822
\item Capture user counters.  If the operating system is keeping track of
1823
        system usage via the accounting counters, those counters need to be
1824
        copied and accumulated into some master counter at this point.
1825
 
1826
\item Preserve the old context.  This involves pushing all the user registers
1827
        onto the user stack and then copying the resulting stack address
1828
        into the tasks task structure, as shown in Tbl.~\ref{tbl:context-out}.
1829
\begin{table}\begin{center}
1830
\begin{tabular}{ll}
1831
{\tt swap\_out:} \\
1832 39 dgisselq
&        {\tt MOV -15(uSP),R5} \\
1833
&        {\tt STO R5,stack(R12)} \\
1834
&        {\tt MOV uR0,R0} \\
1835
&        {\tt MOV uR1,R1} \\
1836
&        {\tt MOV uR2,R2} \\
1837
&        {\tt MOV uR3,R3} \\
1838
&        {\tt MOV uR4,R4} \\
1839
&        {\tt STO R0,1(R5)} {\em ; Exploit memory pipelining: }\\
1840
&        {\tt STO R1,2(R5)} {\em ; All instructions write to stack }\\
1841
&        {\tt STO R2,3(R5)} {\em ; All offsets increment by one }\\
1842
&        {\tt STO R3,4(R5)} {\em ; Longest pipeline is 5 cycles.}\\
1843
&        {\tt STO R4,5(R5)} \\
1844
        & \ldots {\em ; Need to repeat for all user registers} \\
1845
\iffalse
1846
&        {\tt MOV uR5,R0} \\
1847
&        {\tt MOV uR6,R1} \\
1848
&        {\tt MOV uR7,R2} \\
1849
&        {\tt MOV uR8,R3} \\
1850
&        {\tt MOV uR9,R4} \\
1851
&        {\tt STO R0,6(R5) }\\
1852
&        {\tt STO R1,7(R5) }\\
1853
&        {\tt STO R2,8(R5) }\\
1854
&        {\tt STO R3,9(R5) }\\
1855
&        {\tt STO R4,10(R5)} \\
1856
\fi
1857
&        {\tt MOV uR10,R0} \\
1858
&        {\tt MOV uR11,R1} \\
1859
&        {\tt MOV uR12,R2} \\
1860
&        {\tt MOV uCC,R3} \\
1861
&        {\tt MOV uPC,R4} \\
1862
&        {\tt STO R0,11(R5)}\\
1863
&        {\tt STO R1,12(R5)}\\
1864
&        {\tt STO R2,13(R5)}\\
1865
&        {\tt STO R3,14(R5)}\\
1866
&        {\tt STO R4,15(R5)} \\
1867 36 dgisselq
&       {\em ; We can skip storing the stack, uSP, since it'll be stored}\\
1868
&       {\em ; elsewhere (in the task structure) }\\
1869
\end{tabular}
1870
\caption{Example Storing User Task Context}\label{tbl:context-out}
1871
\end{center}\end{table}
1872
For the sake of discussion, we assume the supervisor maintains a
1873
pointer to the current task's structure in supervisor register
1874
{\tt R12}, and that {\tt stack} is an offset to the beginning of this
1875
structure indicating where the stack pointer is to be kept within it.
1876
 
1877
        For those who are still interested, the full code for this context
1878
        save can be found as an assembler macro within the assembler
1879
        include file, {\tt sys.i}.
1880
 
1881
\item Reset the watchdog timer.  If you are using the watchdog timer, it should
1882
        be reset on a context swap, to know that things are still working.
1883
        Example code for this is shown in Tbl.~\ref{tbl:reset-watchdog}.
1884
\begin{table}\begin{center}
1885
\begin{tabular}{ll}
1886
\multicolumn{2}{l}{{\tt `define WATCHDOG\_ADDRESS 32'hc000\_0002}}\\
1887
\multicolumn{2}{l}{{\tt `define WATCHDOG\_TICKS 32'd1\_000\_000} {; \em = 10 ms}}\\
1888
&       {\tt LDI WATCHDOG\_ADDRESS,R0} \\
1889
&       {\tt LDI WATCHDOG\_TICKS,R1} \\
1890
&       {\tt STO R1,(R0)}
1891
\end{tabular}
1892
\caption{Example Watchdog Reset}\label{tbl:reset-watchdog}
1893
\end{center}\end{table}
1894
 
1895
\item Interrupt handling.  An interrupt handler within the Zip System is nothing
1896
        more than a task.  At context swap time, the supervisor needs to
1897
        disable all of the interrupts that have tripped, and then enable
1898
        all of the tasks that would deal with each of these interrupts.
1899
        These can be user tasks, run at higher priority than any other user
1900
        tasks.  Either way, they will need to re--enable their own interrupt
1901
        themselves, if the interrupt is still relevant.
1902
 
1903
        An example of this master interrut handling is shown in
1904
        Tbl.~\ref{tbl:pre-handler}.
1905
\begin{table}\begin{center}
1906
\begin{tabular}{ll}
1907
{\tt pre\_handler:} \\
1908
&       {\tt LDI PIC\_ADDRESS,R0 } \\
1909
&       {\em ; Start by grabbing the interrupt state from the interrupt}\\
1910
&       {\em ; controller.  We'll store this into the register R7 so that }\\
1911
&       {\em ; we can keep and preserve this information for the scheduler}\\
1912
&       {\em ; to use later. }\\
1913
&       {\tt LOD (R0),R1} \\
1914
&       {\tt MOV R1,R7 } \\
1915
&       {\em ; As a next step, we need to acknowledge and disable all active}\\
1916
&       {\em ; interrupts. We'll start by calculating all of our active}\\
1917
&       {\em ; interrupts.}\\
1918
&       {\tt AND 0x07fff,R1 } \\
1919
&       {\em ; Put the active interrupts into the upper half of R1} \\
1920
&       {\tt ROL 16,R1 } \\
1921
&       {\tt LDILO 0x0ffff,R1   } \\
1922
&       {\tt AND R7,R1}\\
1923
&       {\em ; Acknowledge and disable active interrupts}\\
1924
&       {\em ; This also disables all interrupts from the controller, so}\\
1925
&       {\em ; we'll need to re-enable interrupts in general shortly } \\
1926
&       {\tt STO R1,(R0) } \\
1927
&       {\em ; We leave our active interrupt mask in R7 so the scheduler can}\\
1928
&       {\em ; release any tasks that depended upon them. } \\
1929
\end{tabular}
1930
\caption{Example checking for active interrupts}\label{tbl:pre-handler}
1931
\end{center}\end{table}
1932
 
1933
\item Calling the scheduler.  This needs to be done to pick the next task
1934
        to switch to.  It may be an interrupt handler, or it may  be a normal
1935
        user task.  From a priority standpoint, it would make sense that the
1936
        interrupt handlers all have a higher priority than the user tasks,
1937
        and that once they have been called the user tasks may then be called
1938
        again.  If no task is ready to run, run the idle task to wait for an
1939
        interrupt.
1940
 
1941
        This suggests a minimum of four task priorities:
1942
        \begin{enumerate}
1943
        \item Interrupt handlers, executed with their interrupts disabled
1944
        \item Device drivers, executed with interrupts re-enabled
1945
        \item User tasks
1946
        \item The idle task, executed when nothing else is able to execute
1947
        \end{enumerate}
1948
 
1949
        For our purposes here, we'll just assume that a pointer to the current
1950
        task is maintained in {\tt R12}, that a {\tt JSR scheduler} is
1951
        called, and that the next current task is likewise placed into
1952
        {\tt R12}.
1953
 
1954
\item Restore the new tasks context.  Given that the scheduler has returned a
1955
        task that can be run at this time, the stack pointer needs to be
1956
        pulled out of the tasks task structure, placed into the user
1957
        register, and then the rest of the user registers need to be popped
1958
        back off of the stack to run this task.  An example of this is
1959
        shown in Tbl.~\ref{tbl:context-in},
1960
\begin{table}\begin{center}
1961
\begin{tabular}{ll}
1962
{\tt swap\_in:} \\
1963 39 dgisselq
&       {\tt LOD stack(R12),R5} \\
1964 36 dgisselq
&       {\tt MOV 15(R1),uSP} \\
1965 39 dgisselq
        & {\em ; Be sure to exploit the memory pipelining capability} \\
1966
&       {\tt LOD 1(R5),R0} \\
1967
&       {\tt LOD 2(R5),R1} \\
1968
&       {\tt LOD 3(R5),R2} \\
1969
&       {\tt LOD 4(R5),R3} \\
1970
&       {\tt LOD 5(R5),R4} \\
1971
&       {\tt MOV R0,uR0} \\
1972
&       {\tt MOV R1,uR1} \\
1973
&       {\tt MOV R2,uR2} \\
1974
&       {\tt MOV R3,uR3} \\
1975
&       {\tt MOV R4,uR4} \\
1976 36 dgisselq
        & \ldots {\em ; Need to repeat for all user registers} \\
1977 39 dgisselq
&       {\tt LOD 11(R5),R0} \\
1978
&       {\tt LOD 12(R5),R1} \\
1979
&       {\tt LOD 13(R5),R2} \\
1980
&       {\tt LOD 14(R5),R3} \\
1981
&       {\tt LOD 15(R5),R4} \\
1982
&       {\tt MOV R0,uR10} \\
1983
&       {\tt MOV R1,uR11} \\
1984
&       {\tt MOV R2,uR12} \\
1985
&       {\tt MOV R3,uCC} \\
1986
&       {\tt MOV R4,uPC} \\
1987
 
1988 36 dgisselq
&       {\tt BRA return\_to\_user} \\
1989
\end{tabular}
1990
\caption{Example Restoring User Task Context}\label{tbl:context-in}
1991
\end{center}\end{table}
1992
        assuming as before that the task
1993
        pointer is found in supervisor register {\tt R12}.
1994
        As with storing the user context, the full code associated with
1995
        restoring the user context can be found in the assembler include
1996
        file, {\tt sys.i}.
1997
 
1998
\item Clear the userspace accounting registers.  In order to keep track of
1999
        per process system usage, these registers need to be cleared before
2000
        reactivating the userspace process.  That way, upon the next
2001
        interrupt, we'll know how many clocks the userspace program has
2002
        encountered, and how many instructions it was able to issue in
2003
        those many clocks.
2004
 
2005
\item Jump back to the instruction just before saving the last tasks context,
2006
        because that location in memory contains the return from interrupt
2007
        command that we are going to need to execute, in order to guarantee
2008
        that we return back here again.
2009
\end{enumerate}
2010
 
2011 21 dgisselq
\chapter{Registers}\label{chap:regs}
2012
 
2013 24 dgisselq
The ZipSystem registers fall into two categories, ZipSystem internal registers
2014
accessed via the ZipCPU shown in Tbl.~\ref{tbl:zpregs},
2015
\begin{table}[htbp]
2016
\begin{center}\begin{reglist}
2017 32 dgisselq
PIC   & \scalebox{0.8}{\tt 0xc0000000} & 32 & R/W & Primary Interrupt Controller \\\hline
2018
WDT   & \scalebox{0.8}{\tt 0xc0000001} & 32 & R/W & Watchdog Timer \\\hline
2019 36 dgisselq
  & \scalebox{0.8}{\tt 0xc0000002} & 32 & R/W & {\em (Reserved for future use)} \\\hline
2020 32 dgisselq
CTRIC & \scalebox{0.8}{\tt 0xc0000003} & 32 & R/W & Secondary Interrupt Controller \\\hline
2021
TMRA  & \scalebox{0.8}{\tt 0xc0000004} & 32 & R/W & Timer A\\\hline
2022
TMRB  & \scalebox{0.8}{\tt 0xc0000005} & 32 & R/W & Timer B\\\hline
2023
TMRC  & \scalebox{0.8}{\tt 0xc0000006} & 32 & R/W & Timer C\\\hline
2024
JIFF  & \scalebox{0.8}{\tt 0xc0000007} & 32 & R/W & Jiffies \\\hline
2025
MTASK  & \scalebox{0.8}{\tt 0xc0000008} & 32 & R/W & Master Task Clock Counter \\\hline
2026
MMSTL  & \scalebox{0.8}{\tt 0xc0000009} & 32 & R/W & Master Stall Counter \\\hline
2027
MPSTL  & \scalebox{0.8}{\tt 0xc000000a} & 32 & R/W & Master Pre--Fetch Stall Counter \\\hline
2028
MICNT  & \scalebox{0.8}{\tt 0xc000000b} & 32 & R/W & Master Instruction Counter\\\hline
2029
UTASK  & \scalebox{0.8}{\tt 0xc000000c} & 32 & R/W & User Task Clock Counter \\\hline
2030
UMSTL  & \scalebox{0.8}{\tt 0xc000000d} & 32 & R/W & User Stall Counter \\\hline
2031
UPSTL  & \scalebox{0.8}{\tt 0xc000000e} & 32 & R/W & User Pre--Fetch Stall Counter \\\hline
2032
UICNT  & \scalebox{0.8}{\tt 0xc000000f} & 32 & R/W & User Instruction Counter\\\hline
2033 36 dgisselq
DMACTRL  & \scalebox{0.8}{\tt 0xc0000010} & 32 & R/W & DMA Control Register\\\hline
2034
DMALEN  & \scalebox{0.8}{\tt 0xc0000011} & 32 & R/W & DMA total transfer length\\\hline
2035
DMASRC  & \scalebox{0.8}{\tt 0xc0000012} & 32 & R/W & DMA source address\\\hline
2036
DMADST  & \scalebox{0.8}{\tt 0xc0000013} & 32 & R/W & DMA destination address\\\hline
2037 32 dgisselq
% Cache  & \scalebox{0.8}{\tt 0xc0100000} & & & Base address of the Cache memory\\\hline
2038 24 dgisselq
\end{reglist}
2039
\caption{Zip System Internal/Peripheral Registers}\label{tbl:zpregs}
2040
\end{center}\end{table}
2041 33 dgisselq
and the two debug registers shown in Tbl.~\ref{tbl:dbgregs}.
2042 24 dgisselq
\begin{table}[htbp]
2043
\begin{center}\begin{reglist}
2044
ZIPCTRL & 0 & 32 & R/W & Debug Control Register \\\hline
2045
ZIPDATA & 1 & 32 & R/W & Debug Data Register \\\hline
2046
\end{reglist}
2047
\caption{Zip System Debug Registers}\label{tbl:dbgregs}
2048
\end{center}\end{table}
2049
 
2050 33 dgisselq
\section{Peripheral Registers}
2051
The peripheral registers, listed in Tbl.~\ref{tbl:zpregs}, are shown in the
2052
CPU's address space.  These may be accessed by the CPU at these addresses,
2053
and when so accessed will respond as described in Chapt.~\ref{chap:periph}.
2054
These registers will be discussed briefly again here.
2055 24 dgisselq
 
2056 33 dgisselq
The Zip CPU Interrupt controller has four different types of bits, as shown in
2057
Tbl.~\ref{tbl:picbits}.
2058
\begin{table}\begin{center}
2059
\begin{bitlist}
2060
31 & R/W & Master Interrupt Enable\\\hline
2061
30\ldots 16 & R/W & Interrupt Enables, write '1' to change\\\hline
2062
15 & R & Current Master Interrupt State\\\hline
2063
15\ldots 0 & R/W & Input Interrupt states, write '1' to clear\\\hline
2064
\end{bitlist}
2065
\caption{Interrupt Controller Register Bits}\label{tbl:picbits}
2066
\end{center}\end{table}
2067
The high order bit, or bit--31, is the master interrupt enable bit.  When this
2068
bit is set, then any time an interrupt occurs the CPU will be interrupted and
2069
will switch to supervisor mode, etc.
2070
 
2071
Bits 30~\ldots 16 are interrupt enable bits.  Should the interrupt line go
2072
ghile while enabled, an interrupt will be generated.  To set an interrupt enable
2073
bit, one needs to write the master interrupt enable while writing a `1' to this
2074
the bit.  To clear, one need only write a `0' to the master interrupt enable,
2075
while leaving this line high.
2076
 
2077
Bits 15\ldots 0 are the current state of the interrupt vector.  Interrupt lines
2078
trip when they go high, and remain tripped until they are acknowledged.  If
2079
the interrupt goes high for longer than one pulse, it may be high when a clear
2080
is requested.  If so, the interrupt will not clear.  The line must go low
2081
again before the status bit can be cleared.
2082
 
2083
As an example, consider the following scenario where the Zip CPU supports four
2084
interrupts, 3\ldots0.
2085
\begin{enumerate}
2086
\item The Supervisor will first, while in the interrupts disabled mode,
2087
        write a {\tt 32'h800f000f} to the controller.  The supervisor may then
2088
        switch to the user state with interrupts enabled.
2089
\item When an interrupt occurs, the supervisor will switch to the interrupt
2090
        state.  It will then cycle through the interrupt bits to learn which
2091
        interrupt handler to call.
2092
\item If the interrupt handler expects more interrupts, it will clear its
2093
        current interrupt when it is done handling the interrupt in question.
2094
        To do this, it will write a '1' to the low order interrupt mask,
2095
        such as writing a {\tt 32'h80000001}.
2096
\item If the interrupt handler does not expect any more interrupts, it will
2097
        instead clear the interrupt from the controller by writing a
2098
        {\tt 32'h00010001} to the controller.
2099
\item Once all interrupts have been handled, the supervisor will write a
2100
        {\tt 32'h80000000} to the interrupt register to re-enable interrupt
2101
        generation.
2102
\item The supervisor should also check the user trap bit, and possible soft
2103
        interrupt bits here, but this action has nothing to do with the
2104
        interrupt control register.
2105
\item The supervisor will then leave interrupt mode, possibly adjusting
2106
        whichever task is running, by executing a return from interrupt
2107
        command.
2108
\end{enumerate}
2109
 
2110
Leaving the interrupt controller, we show the timer registers bit definitions
2111
in Tbl.~\ref{tbl:tmrbits}.
2112
\begin{table}\begin{center}
2113
\begin{bitlist}
2114
31 & R/W & Auto-Reload\\\hline
2115
30\ldots 0 & R/W & Current timer value\\\hline
2116
\end{bitlist}
2117
\caption{Timer Register Bits}\label{tbl:tmrbits}
2118
\end{center}\end{table}
2119
As you may recall, the timer just counts down to zero and then trips an
2120
interrupt.  Writing to the current timer value sets that value, and reading
2121
from it returns that value.  Writing to the current timer value while also
2122
setting the auto--reload bit will send the timer into an auto--reload mode.
2123
In this mode, upon setting its interrupt bit for one cycle, the timer will
2124
also reset itself back to the value of the timer that was written to it when
2125
the auto--reload option was written to it.  To clear and stop the timer,
2126
just simply write a `32'h00' to this register.
2127
 
2128
The Jiffies register is somewhat similar in that the register always changes.
2129
In this case, the register counts up, whereas the timer always counted down.
2130
Reads from this register, as shown in Tbl.~\ref{tbl:jiffybits},
2131
\begin{table}\begin{center}
2132
\begin{bitlist}
2133
31\ldots 0 & R & Current jiffy value\\\hline
2134
31\ldots 0 & W & Value/time of next interrupt\\\hline
2135
\end{bitlist}
2136
\caption{Jiffies Register Bits}\label{tbl:jiffybits}
2137
\end{center}\end{table}
2138
always return the time value contained in the register.  Writes greater than
2139
the current Jiffy value, that is where the new value minus the old value is
2140
greater than zero while ignoring truncation, will set a new Jiffy interrupt
2141
time.  At that time, the Jiffy vector will clear, and another interrupt time
2142
may either be written to it, or it will just continue counting without
2143
activating any more interrupts.
2144
 
2145
The Zip CPU also supports several counter peripherals, mostly in the way of
2146
process accounting.  This peripherals have a single register associated with
2147
them, shown in Tbl.~\ref{tbl:ctrbits}.
2148
\begin{table}\begin{center}
2149
\begin{bitlist}
2150
31\ldots 0 & R/W & Current counter value\\\hline
2151
\end{bitlist}
2152
\caption{Counter Register Bits}\label{tbl:ctrbits}
2153
\end{center}\end{table}
2154
Writes to this register set the new counter value.  Reads read the current
2155
counter value.
2156
 
2157
The current design operation of these counters is that of performance counting.
2158
Two sets of four registers are available for keeping track of performance.
2159
The first is a task counter.  This just counts clock ticks.  The second
2160
counter is a prefetch stall counter, then an master stall counter.  These
2161
allow the CPU to be evaluated as to how efficient it is.  The fourth and
2162
final counter is an instruction counter, which counts how many instructions the
2163
CPU has issued.
2164
 
2165
It is envisioned that these counters will be used as follows: First, every time
2166
a master counter rolls over, the supervisor (Operating System) will record
2167
the fact.  Second, whenever activating a user task, the Operating System will
2168
set the four user counters to zero.  When the user task has completed, the
2169
Operating System will read the timers back off, to determine how much of the
2170
CPU the process had consumed.
2171
 
2172 36 dgisselq
The final peripheral to discuss is the DMA controller.  This controller
2173
has four registers.  Of these four, the length, source and destination address
2174
registers should need no further explanation.  They are full 32--bit registers
2175
specifying the entire transfer length, the starting address to read from, and
2176
the starting address to write to.  The registers can be written to when the
2177
DMA is idle, and read at any time.  The control register, however, will need
2178
some more explanation.
2179
 
2180
The bit allocation of the control register is shown in Tbl.~\ref{tbl:dmacbits}.
2181
\begin{table}\begin{center}
2182
\begin{bitlist}
2183
31 & R & DMA Active\\\hline
2184 39 dgisselq
30 & R & Wishbone error, transaction aborted.  This bit is cleared the next time
2185
        this register is written to.\\\hline
2186 36 dgisselq
29 & R/W & Set to '1' to prevent the controller from incrementing the source address, '0' for normal memory copy. \\\hline
2187 39 dgisselq
28 & R/W & Set to '1' to prevent the controller from incrementing the
2188 36 dgisselq
        destination address, '0' for normal memory copy. \\\hline
2189
27 \ldots 16 & W & The DMA Key.  Write a 12'hfed to these bits to start the
2190
        activate any DMA transfer.  \\\hline
2191
27 & R & Always reads '0', to force the deliberate writing of the key. \\\hline
2192
26 \ldots 16 & R & Indicates the number of items in the transfer buffer that
2193
        have yet to be written. \\\hline
2194
15 & R/W & Set to '1' to trigger on an interrupt, or '0' to start immediately
2195
        upon receiving a valid key.\\\hline
2196
14\ldots 10 & R/W & Select among one of 32~possible interrupt lines.\\\hline
2197
9\ldots 0 & R/W & Intermediate transfer length minus one.  Thus, to transfer
2198
        one item at a time set this value to 0. To transfer 1024 at a time,
2199
        set it to 1024.\\\hline
2200
\end{bitlist}
2201
\caption{DMA Control Register Bits}\label{tbl:dmacbits}
2202
\end{center}\end{table}
2203
This control register has been designed so that the common case of memory
2204
access need only set the key and the transfer length.  Hence, writing a
2205
\hbox{32'h0fed03ff} to the control register will start any memory transfer.
2206
On the other hand, if you wished to read from a serial port (constant address)
2207
and put the result into a buffer every time a word was available, you
2208
might wish to write \hbox{32'h2fed8000}--this assumes, of course, that you
2209
have a serial port wired to the zero bit of this interrupt control.  (The
2210
DMA controller does not use the interrupt controller, and cannot clear
2211
interrupts.)  As a third example, if you wished to write to an external
2212
FIFO anytime it was less than half full (had fewer than 512 items), and
2213
interrupt line 2 indicated this condition, you might wish to issue a
2214
\hbox{32'h1fed8dff} to this port.
2215
 
2216 33 dgisselq
\section{Debug Port Registers}
2217
Accessing the Zip System via the debug port isn't as straight forward as
2218
accessing the system via the wishbone bus.  The debug port itself has been
2219
reduced to two addresses, as outlined earlier in Tbl.~\ref{tbl:dbgregs}.
2220
Access to the Zip System begins with the Debug Control register, shown in
2221
Tbl.~\ref{tbl:dbgctrl}.
2222
\begin{table}\begin{center}
2223
\begin{bitlist}
2224
31\ldots 14 & R & Reserved\\\hline
2225
13 & R & CPU GIE setting\\\hline
2226
12 & R & CPU is sleeping\\\hline
2227
11 & W & Command clear PF cache\\\hline
2228
10 & R/W & Command HALT, Set to '1' to halt the CPU\\\hline
2229
9 & R & Stall Status, '1' if CPU is busy\\\hline
2230 36 dgisselq
8 & R/W & Step Command, set to '1' to step the CPU, also sets the halt bit\\\hline
2231 33 dgisselq
7 & R & Interrupt Request \\\hline
2232
6 & R/W & Command RESET \\\hline
2233
5\ldots 0 & R/W & Debug Register Address \\\hline
2234
\end{bitlist}
2235
\caption{Debug Control Register Bits}\label{tbl:dbgctrl}
2236
\end{center}\end{table}
2237
 
2238
The first step in debugging access is to determine whether or not the CPU
2239
is halted, and to halt it if not.  To do this, first write a '1' to the
2240
Command HALT bit.  This will halt the CPU and place it into debug mode.
2241
Once the CPU is halted, the stall status bit will drop to zero.  Thus,
2242
if bit 10 is high and bit 9 low, the debug port is open to examine the
2243
internal state of the CPU.
2244
 
2245
At this point, the external debugger may examine internal state information
2246
from within the CPU.  To do this, first write again to the command register
2247
a value (with command halt still high) containing the address of an internal
2248
register of interest in the bottom 6~bits.  Internal registers that may be
2249
accessed this way are listed in Tbl.~\ref{tbl:dbgaddrs}.
2250
\begin{table}\begin{center}
2251
\begin{reglist}
2252
sR0 & 0 & 32 & R/W & Supervisor Register R0 \\\hline
2253
sR1 & 0 & 32 & R/W & Supervisor Register R1 \\\hline
2254
sSP & 13 & 32 & R/W & Supervisor Stack Pointer\\\hline
2255
sCC & 14 & 32 & R/W & Supervisor Condition Code Register \\\hline
2256
sPC & 15 & 32 & R/W & Supervisor Program Counter\\\hline
2257
uR0 & 16 & 32 & R/W & User Register R0 \\\hline
2258
uR1 & 17 & 32 & R/W & User Register R1 \\\hline
2259
uSP & 29 & 32 & R/W & User Stack Pointer\\\hline
2260
uCC & 30 & 32 & R/W & User Condition Code Register \\\hline
2261
uPC & 31 & 32 & R/W & User Program Counter\\\hline
2262
PIC & 32 & 32 & R/W & Primary Interrupt Controller \\\hline
2263
WDT & 33 & 32 & R/W & Watchdog Timer\\\hline
2264
CTRIC & 35 & 32 & R/W & Secondary Interrupt Controller\\\hline
2265
TMRA & 36 & 32 & R/W & Timer A\\\hline
2266
TMRB & 37 & 32 & R/W & Timer B\\\hline
2267
TMRC & 38 & 32 & R/W & Timer C\\\hline
2268
JIFF & 39 & 32 & R/W & Jiffies peripheral\\\hline
2269
MTASK & 40 & 32 & R/W & Master task clock counter\\\hline
2270
MMSTL & 41 & 32 & R/W & Master memory stall counter\\\hline
2271
MPSTL & 42 & 32 & R/W & Master Pre-Fetch Stall counter\\\hline
2272
MICNT & 43 & 32 & R/W & Master instruction counter\\\hline
2273
UTASK & 44 & 32 & R/W & User task clock counter\\\hline
2274
UMSTL & 45 & 32 & R/W & User memory stall counter\\\hline
2275
UPSTL & 46 & 32 & R/W & User Pre-Fetch Stall counter\\\hline
2276
UICNT & 47 & 32 & R/W & User instruction counter\\\hline
2277 39 dgisselq
DMACMD & 48 & 32 & R/W & DMA command and status register\\\hline
2278
DMALEN & 49 & 32 & R/W & DMA transfer length\\\hline
2279
DMARD & 50 & 32 & R/W & DMA read address\\\hline
2280
DMAWR & 51 & 32 & R/W & DMA write address\\\hline
2281 33 dgisselq
\end{reglist}
2282
\caption{Debug Register Addresses}\label{tbl:dbgaddrs}
2283
\end{center}\end{table}
2284
Primarily, these ``registers'' include access to the entire CPU register
2285 36 dgisselq
set, as well as the internal peripherals.  To read one of these registers
2286 33 dgisselq
once the address is set, simply issue a read from the data port.  To write
2287
one of these registers or peripheral ports, simply write to the data port
2288
after setting the proper address.
2289
 
2290
In this manner, all of the CPU's internal state may be read and adjusted.
2291
 
2292
As an example of how to use this, consider what would happen in the case
2293
of an external break point.  If and when the CPU hits a break point that
2294
causes it to halt, the Command HALT bit will activate on its own, the CPU
2295
will then raise an external interrupt line and wait for a debugger to examine
2296
its state.  After examining the state, the debugger will need to remove
2297
the breakpoint by writing a different instruction into memory and by writing
2298
to the command register while holding the clear cache, command halt, and
2299
step CPU bits high, (32'hd00).  The debugger may then replace the breakpoint
2300
now that the CPU has gone beyond it, and clear the cache again (32'h500).
2301
 
2302
To leave this debug mode, simply write a `32'h0' value to the command register.
2303
 
2304
\chapter{Wishbone Datasheets}\label{chap:wishbone}
2305 32 dgisselq
The Zip System supports two wishbone ports, a slave debug port and a master
2306 21 dgisselq
port for the system itself.  These are shown in Tbl.~\ref{tbl:wishbone-slave}
2307
\begin{table}[htbp]
2308
\begin{center}
2309
\begin{wishboneds}
2310
Revision level of wishbone & WB B4 spec \\\hline
2311
Type of interface & Slave, Read/Write, single words only \\\hline
2312 24 dgisselq
Address Width & 1--bit \\\hline
2313 21 dgisselq
Port size & 32--bit \\\hline
2314
Port granularity & 32--bit \\\hline
2315
Maximum Operand Size & 32--bit \\\hline
2316
Data transfer ordering & (Irrelevant) \\\hline
2317
Clock constraints & Works at 100~MHz on a Basys--3 board\\\hline
2318
Signal Names & \begin{tabular}{ll}
2319
                Signal Name & Wishbone Equivalent \\\hline
2320
                {\tt i\_clk} & {\tt CLK\_I} \\
2321
                {\tt i\_dbg\_cyc} & {\tt CYC\_I} \\
2322
                {\tt i\_dbg\_stb} & {\tt STB\_I} \\
2323
                {\tt i\_dbg\_we} & {\tt WE\_I} \\
2324
                {\tt i\_dbg\_addr} & {\tt ADR\_I} \\
2325
                {\tt i\_dbg\_data} & {\tt DAT\_I} \\
2326
                {\tt o\_dbg\_ack} & {\tt ACK\_O} \\
2327
                {\tt o\_dbg\_stall} & {\tt STALL\_O} \\
2328
                {\tt o\_dbg\_data} & {\tt DAT\_O}
2329
                \end{tabular}\\\hline
2330
\end{wishboneds}
2331 22 dgisselq
\caption{Wishbone Datasheet for the Debug Interface}\label{tbl:wishbone-slave}
2332 21 dgisselq
\end{center}\end{table}
2333
and Tbl.~\ref{tbl:wishbone-master} respectively.
2334
\begin{table}[htbp]
2335
\begin{center}
2336
\begin{wishboneds}
2337
Revision level of wishbone & WB B4 spec \\\hline
2338 24 dgisselq
Type of interface & Master, Read/Write, single cycle or pipelined\\\hline
2339
Address Width & 32--bit bits \\\hline
2340 21 dgisselq
Port size & 32--bit \\\hline
2341
Port granularity & 32--bit \\\hline
2342
Maximum Operand Size & 32--bit \\\hline
2343
Data transfer ordering & (Irrelevant) \\\hline
2344
Clock constraints & Works at 100~MHz on a Basys--3 board\\\hline
2345
Signal Names & \begin{tabular}{ll}
2346
                Signal Name & Wishbone Equivalent \\\hline
2347
                {\tt i\_clk} & {\tt CLK\_O} \\
2348
                {\tt o\_wb\_cyc} & {\tt CYC\_O} \\
2349
                {\tt o\_wb\_stb} & {\tt STB\_O} \\
2350
                {\tt o\_wb\_we} & {\tt WE\_O} \\
2351
                {\tt o\_wb\_addr} & {\tt ADR\_O} \\
2352
                {\tt o\_wb\_data} & {\tt DAT\_O} \\
2353
                {\tt i\_wb\_ack} & {\tt ACK\_I} \\
2354
                {\tt i\_wb\_stall} & {\tt STALL\_I} \\
2355
                {\tt i\_wb\_data} & {\tt DAT\_I}
2356
                \end{tabular}\\\hline
2357
\end{wishboneds}
2358 22 dgisselq
\caption{Wishbone Datasheet for the CPU as Master}\label{tbl:wishbone-master}
2359 21 dgisselq
\end{center}\end{table}
2360
I do not recommend that you connect these together through the interconnect.
2361 24 dgisselq
Rather, the debug port of the CPU should be accessible regardless of the state
2362
of the master bus.
2363 21 dgisselq
 
2364 24 dgisselq
You may wish to notice that neither the {\tt ERR} nor the {\tt RETRY} wires
2365
have been implemented.  What this means is that the CPU is currently unable
2366
to detect a bus error condition, and so may stall indefinitely (hang) should
2367
it choose to access a value not on the bus, or a peripheral that is not
2368
yet properly configured.
2369 21 dgisselq
 
2370
\chapter{Clocks}\label{chap:clocks}
2371
 
2372 32 dgisselq
This core is based upon the Basys--3 development board sold by Digilent.
2373
The Basys--3 development board contains one external 100~MHz clock, which is
2374 36 dgisselq
sufficient to run the Zip CPU core.
2375 21 dgisselq
\begin{table}[htbp]
2376
\begin{center}
2377
\begin{clocklist}
2378
i\_clk & External & 100~MHz & 100~MHz & System clock.\\\hline
2379
\end{clocklist}
2380
\caption{List of Clocks}\label{tbl:clocks}
2381
\end{center}\end{table}
2382
I hesitate to suggest that the core can run faster than 100~MHz, since I have
2383
had struggled with various timing violations to keep it at 100~MHz.  So, for
2384
now, I will only state that it can run at 100~MHz.
2385
 
2386
 
2387
\chapter{I/O Ports}\label{chap:ioports}
2388 33 dgisselq
The I/O ports to the Zip CPU may be grouped into three categories.  The first
2389
is that of the master wishbone used by the CPU, then the slave wishbone used
2390
to command the CPU via a debugger, and then the rest.  The first two of these
2391
were already discussed in the wishbone chapter.  They are listed here
2392
for completeness in Tbl.~\ref{tbl:iowb-master}
2393
\begin{table}
2394
\begin{center}\begin{portlist}
2395
{\tt o\_wb\_cyc}   &  1 & Output & Indicates an active Wishbone cycle\\\hline
2396
{\tt o\_wb\_stb}   &  1 & Output & WB Strobe signal\\\hline
2397
{\tt o\_wb\_we}    &  1 & Output & Write enable\\\hline
2398
{\tt o\_wb\_addr}  & 32 & Output & Bus address \\\hline
2399
{\tt o\_wb\_data}  & 32 & Output & Data on WB write\\\hline
2400
{\tt i\_wb\_ack}   &  1 & Input  & Slave has completed a R/W cycle\\\hline
2401
{\tt i\_wb\_stall} &  1 & Input  & WB bus slave not ready\\\hline
2402
{\tt i\_wb\_data}  & 32 & Input  & Incoming bus data\\\hline
2403
\end{portlist}\caption{CPU Master Wishbone I/O Ports}\label{tbl:iowb-master}\end{center}\end{table}
2404
and~\ref{tbl:iowb-slave} respectively.
2405
\begin{table}
2406
\begin{center}\begin{portlist}
2407
{\tt i\_wb\_cyc}   &  1 & Input & Indicates an active Wishbone cycle\\\hline
2408
{\tt i\_wb\_stb}   &  1 & Input & WB Strobe signal\\\hline
2409
{\tt i\_wb\_we}    &  1 & Input & Write enable\\\hline
2410
{\tt i\_wb\_addr}  &  1 & Input & Bus address, command or data port \\\hline
2411
{\tt i\_wb\_data}  & 32 & Input & Data on WB write\\\hline
2412
{\tt o\_wb\_ack}   &  1 & Output  & Slave has completed a R/W cycle\\\hline
2413
{\tt o\_wb\_stall} &  1 & Output  & WB bus slave not ready\\\hline
2414
{\tt o\_wb\_data}  & 32 & Output  & Incoming bus data\\\hline
2415
\end{portlist}\caption{CPU Debug Wishbone I/O Ports}\label{tbl:iowb-slave}\end{center}\end{table}
2416 21 dgisselq
 
2417 33 dgisselq
There are only four other lines to the CPU: the external clock, external
2418
reset, incoming external interrupt line(s), and the outgoing debug interrupt
2419
line.  These are shown in Tbl.~\ref{tbl:ioports}.
2420
\begin{table}
2421
\begin{center}\begin{portlist}
2422
{\tt i\_clk} & 1 & Input & The master CPU clock \\\hline
2423
{\tt i\_rst} & 1 & Input &  Active high reset line \\\hline
2424
{\tt i\_ext\_int} & 1\ldots 6 & Input &  Incoming external interrupts \\\hline
2425
{\tt o\_ext\_int} & 1 & Output & CPU Halted interrupt \\\hline
2426
\end{portlist}\caption{I/O Ports}\label{tbl:ioports}\end{center}\end{table}
2427
The clock line was discussed briefly in Chapt.~\ref{chap:clocks}.  We
2428
typically run it at 100~MHz.  The reset line is an active high reset.  When
2429
asserted, the CPU will start running again from its reset address in
2430
memory.  Further, depending upon how the CPU is configured and specifically on
2431
the {\tt START\_HALTED} parameter, it may or may not start running
2432
automatically.  The {\tt i\_ext\_int} line is for an external interrupt.  This
2433
line may be as wide as 6~external interrupts, depending upon the setting of
2434
the {\tt EXTERNAL\_INTERRUPTS} line.  As currently configured, the ZipSystem
2435
only supports one such interrupt line by default.  For us, this line is the
2436
output of another interrupt controller, but that's a board specific setup
2437
detail.  Finally, the Zip System produces one external interrupt whenever
2438
the CPU halts to wait for the debugger.
2439
 
2440 36 dgisselq
\chapter{Initial Assessment}\label{chap:assessment}
2441
 
2442
Having now worked with the Zip CPU for a while, it is worth offering an
2443
honest assessment of how well it works and how well it was designed. At the
2444
end of this assessment, I will propose some changes that may take place in a
2445
later version of this Zip CPU to make it better.
2446
 
2447
\section{The Good}
2448
\begin{itemize}
2449
\item The Zip CPU is light weight and fully featured as it exists today. For
2450
        anyone who wishes to build a general purpose CPU and then to
2451
        experiment with building and adding particular features, the Zip CPU
2452
        makes a good starting point--it is fairly simple. Modifications should
2453
        be simple enough.
2454
\item The Zip CPU was designed to be an implementable soft core that could be
2455
        placed within an FPGA, controlling actions internal to the FPGA. It
2456
        fits this role rather nicely. It does not fit the role of a system on
2457
        a chip very well, but then it was never intended to be a system on a
2458
        chip but rather a system within a chip.
2459
\item The extremely simplified instruction set of the Zip CPU was a good
2460
        choice. Although it does not have many of the commonly used
2461
        instructions, PUSH, POP, JSR, and RET among them, the simplified
2462
        instruction set has demonstrated an amazing versatility. I will contend
2463
        therefore and for anyone who will listen, that this instruction set
2464
        offers a full and complete capability for whatever a user might wish
2465
        to do with two exceptions: bytewise character access and accelerated
2466
        floating-point support.
2467
\item This simplified instruction set is easy to decode.
2468
\item The simplified bus transactions (32-bit words only) were also very easy
2469
        to implement.
2470 68 dgisselq
\item The pipelined load/store approach is novel, and can be used to greatly
2471
        increase the speed of the processor.
2472 36 dgisselq
\item The novel approach of having a single interrupt vector, which just
2473
        brings the CPU back to the instruction it left off at within the last
2474
        interrupt context doesn't appear to have been that much of a problem.
2475
        If most modern systems handle interrupt vectoring in software anyway,
2476
        why maintain hardware support for it?
2477
\item My goal of a high rate of instructions per clock may not be the proper
2478
        measure. For example, if instructions are being read from a SPI flash
2479
        device, such as is common among FPGA implementations, these same
2480
        instructions may suffer stalls of between 64 and 128 cycles per
2481
        instruction just to read the instruction from the flash. Executing the
2482
        instruction in a single clock cycle is no longer the appropriate
2483
        measure. At the same time, it should be possible to use the DMA
2484
        peripheral to copy instructions from the FLASH to a temporary memory
2485
        location, after which they may be executed at a single instruction
2486
        cycle per access again.
2487
\end{itemize}
2488
 
2489
\section{The Not so Good}
2490
\begin{itemize}
2491
\item While one of the stated goals was to use a small amount of logic,
2492
        3k~LUTs isn't that impressively small. Indeed, it's really much
2493
        too expensive when compared against other 8 and 16-bit CPUs that have
2494
        less than 1k LUTs.
2495
 
2496
        Still, \ldots it's not bad, it's just not astonishingly good.
2497
 
2498
\item The fact that the instruction width equals the bus width means that the
2499
        instruction fetch cycle will always be interfering with any load or
2500
        store memory operation, with the only exception being if the
2501 68 dgisselq
        instruction is already in the cache.
2502 36 dgisselq
 
2503
        This could be fixed in one of three ways: the instruction set
2504
        architecture could be modified to handle Very Long Instruction Words
2505
        (VLIW) so that each 32--bit word would encode two or more instructions,
2506
        the instruction fetch bus width could be increased from 32--bits to
2507
        64--bits or more, or the instruction bus could be separated from the
2508
        data bus.  Any and all of these approaches would increase the overall
2509
        LUT count.
2510
 
2511
\item The (non-existant) floating point unit was an after-thought, isn't even
2512
        built as a potential option, and most likely won't support the full
2513
        IEEE standard set of FPU instructions--even for single point precision.
2514
        This (non-existant) capability would benefit the most from an
2515
        out-of-order execution capability, which the Zip CPU does not have.
2516
 
2517
        Still, sharing FPU registers with the main register set was a good
2518
        idea and worth preserving, as it simplifies context swapping.
2519
 
2520
        Perhaps this really isn't a problem, but rather a feature.  By not
2521
        implementing FPU instructions, the Zip CPU maintains a lower LUT count
2522
        than it would have if it did implement these instructions.
2523
 
2524
\item The CPU has no character support. This is both good and bad.
2525
        Realistically, the CPU works just fine without it. Characters can be
2526
        supported as subsets of 32-bit words without any problem. Practically,
2527
        though, it will make compiling non-Zip CPU code difficult--especially
2528
        anything that assumes sizeof(int)=4*sizeof(char), or that tries to
2529
        create unions with characters and integers and then attempts to
2530
        reference the address of the characters within that union.
2531
 
2532
\item The Zip CPU does not support a data cache. One can still be built
2533
        externally, but this is a limitation of the CPU proper as built.
2534
        Further, under the theory of the Zip CPU design (that of an embedded
2535
        soft-core processor within an FPGA, where any ``address'' may reference
2536
        either memory or a peripheral that may have side-effects), any data
2537
        cache would need to be based upon an initial knowledge of whether or
2538
        not it is supporting memory (cachable) or peripherals. This knowledge
2539
        must exist somewhere, and that somewhere is currently (and by design)
2540
        external to the CPU.
2541
 
2542
        This may also be written off as a ``feature'' of the Zip CPU, since
2543
        the addition of a data cache can greatly increase the LUT count of
2544
        a soft core.
2545
 
2546 68 dgisselq
        The Zip CPU compensates for this via its pipelined load and store
2547
        instructions.
2548
 
2549 36 dgisselq
\item Many other instruction sets offer three operand instructions, whereas
2550
        the Zip CPU only offers two operand instructions. This means that it
2551
        takes the Zip CPU more instructions to do many of the same operations.
2552
        The good part of this is that it gives the Zip CPU a greater amount of
2553
        flexibility in its immediate operand mode, although that increased
2554
        flexibility isn't necessarily as valuable as one might like.
2555
 
2556
\item The Zip CPU does not currently detect and trap on either illegal
2557
        instructions or bus errors.  Attempts to access non--existent
2558
        memory quietly return erroneous results, rather than halting the
2559
        process (user mode) or halting or resetting the CPU (supervisor mode).
2560
 
2561
\item The Zip CPU doesn't support out of order execution. I suppose it could
2562
        be modified to do so, but then it would no longer be the ``simple''
2563
        and low LUT count CPU it was designed to be. The two primary results
2564
        are that 1) loads may unnecessarily stall the CPU, even if other
2565
        things could be done while waiting for the load to complete, 2)
2566
        bus errors on stores will never be caught at the point of the error,
2567
        and 3) branch prediction becomes more difficult.
2568
 
2569
\item Although switching to an interrupt context in the Zip CPU design doesn't
2570
        require a tremendous swapping of registers, in reality it still
2571
        does--since any task swap still requires saving and restoring all
2572
        16~user registers. That's a lot of memory movement just to service
2573
        an interrupt.
2574
 
2575
\item The Zip CPU is by no means generic: it will never handle addresses
2576
        larger than 32-bits (16GB) without a complete and total redesign.
2577
        This may limit its utility as a generic CPU in the future, although
2578
        as an embedded CPU within an FPGA this isn't really much of a limit
2579
        or restriction.
2580
 
2581
\item While the Zip CPU has its own assembler, it has no linker and does not
2582
        (yet) support a compiler. The standard C library is an even longer
2583
        shot. My dream of having binutils and gcc support has not been
2584
        realized and at this rate may not be realized. (I've been intimidated
2585
        by the challenge everytime I've looked through those codes.)
2586
 
2587 39 dgisselq
\iffalse
2588 36 dgisselq
\item While the Wishbone Bus (B4) supports a pipelined mode with single cycle
2589
        execution, the Zip CPU is unable to exploit this parallelism. Instead,
2590
        apart from the DMA and the pipelined prefetch, all loads and stores
2591
        are single wishbone bus operations requiring a minimum of 3 clocks.
2592
        (In practice, this has turned into 7-clocks.)
2593 39 dgisselq
        % Addressed, 20150929
2594 36 dgisselq
 
2595
\item There is no control over whether or not an instruction sets the
2596
        condition codes--certain instructions always set the condition codes,
2597
        other instructions never set them. This effectively limits conditional
2598
        instructions to a single instruction only (with two or more
2599
        instructions as an exception), as the first instruction that sets
2600
        condition codes will break the condition code chain.
2601
 
2602
        {\em (A proposed change below address this.)}
2603
 
2604
\item Using the CC register as a trap address was a bad idea--it limits the CC
2605
        registers ability to be used in future expansion, such as by adding
2606
        exception indication flags: bus error, floating point exception, etc.
2607
 
2608
        {\em (This can be changed by a different O/S implementation of the trap
2609
        instruction.)}
2610
\item The current implementation suffers from too many stalls on any
2611
        branch--even if the branch is known early on.
2612
 
2613
        {\em (This is addressed in proposals below.)}
2614
        % Addressed, 20150918
2615
 
2616
\item In a similar fashion, a switch to interrupt context forces the pipeline
2617
        to be cleared, whereas it might make more sense to just continue
2618
        executing the instructions already in the pipeline while the prefetch
2619
        stage is working on switching to the interrupt context.
2620
 
2621
        {\em (Also addressed in proposals below.)}
2622
        % This should happen so rarely that it is not really a problem
2623
\fi
2624
 
2625
\end{itemize}
2626
 
2627
\section{The Next Generation}
2628
This section could also be labeled as my ``To do'' list.
2629
 
2630
Given the feedback listed above, perhaps its time to consider what changes could be made to improve the Zip CPU in the future. I offer the following as proposals:
2631
 
2632
\begin{itemize}
2633
\item {\bf Remove the low LUT goal.} It wasn't really achieved, and the
2634
        proposals below will only increase the amount of logic the Zip CPU
2635
        requires.  While I expect that the Zip CPU will always be somewhat
2636
        of a light weight, it will never be the smallest kid on the block.
2637
 
2638
        I'm actually struggling with this idea.  The whole goal of the Zip
2639
        CPU was to be light weight.  Wouldn't it make more sense to create and
2640
        maintain options whereby it would remain lightweight?  For example, if
2641
        the process accounting registers are anything but light weight, why
2642
        keep them?  Why not instead make some compile flags that just turn them
2643
        off, keeping the CPU lightweight?  The same holds for the prefetch
2644
        cache.
2645
 
2646 39 dgisselq
\item The `{\tt .V}' condition was never used in any code other than my test
2647
        code.  Suggest changing it to a `{\tt .LE}' condition, which seems
2648
        to be more useful.
2649
 
2650
\item {\bf Consider a more traditional Instruction Cache.}  The current
2651
        pipelined instruction cache just reads a window of memory into
2652
        its cache.  If the CPU leaves that window, the entire cache is
2653
        invalidated.  A more traditional cache, however, might allow
2654
        common subroutines to stay within the cache without invalidating the
2655
        entire cache structure.
2656
 
2657 36 dgisselq
\iffalse
2658
\item {\bf Adjust the Zip CPU so that conditional instructions do not set
2659
        flags}, although they may explicitly set condition codes if writing
2660
        to the CC register.
2661
 
2662
        This is a simple change to the core, and may show up in new releases.
2663
        % Fixed, 20150918
2664
 
2665
\item Add in an {\bf unpredictable branch delay slot}, so that on any branch
2666
        the delay slot may or may not be executed before the branch.
2667
        Instructions that do not depend upon the branch, and that should be
2668
        executed were the branch not taken, could be placed into the delay
2669
        slot. Thus, if the branch isn't taken, we wouldn't suffer the stall,
2670
        whereas it wouldn't affect the timing of the branch if taken. It would
2671
        just do something irrelevant.
2672
 
2673
        % Changes made, 20150918, make this option no longer relevant
2674
 
2675
\item {\bf Re-engineer Branch Processing.}  There's no reason why a {\tt BRA}
2676
        instruction should create five stall cycles.  The decode stage, plus
2677
        the prefetch engine, should be able to drop this number of stalls via
2678
        better branch handling.
2679
 
2680
        Indeed, this could turn into a simple means of branch prediction:
2681
        if {\tt BRA} suffered a single stall only, whereas {\tt BRA.C}
2682
        suffered five stalls, then {\tt BRA.!C} followed by {\tt BRA} would
2683
        be faster than a {\tt BRA.C} instruction.  This would then allow a
2684
        compiler to do explicit branch optimizations.
2685
 
2686
        Of course, longer branches using {\tt ADD X,PC} would still not be
2687
        optimized.
2688
 
2689
        % DONE: 20150918 -- to include the ADD X,PC instructions
2690
 
2691
\item {\bf Request bus access for Load/Store two cycles earlier.}  The problem
2692
        here is the contention for the bus between the memory unit and the
2693
        prefetch unit.  Currently, the memory unit must ask the prefetch
2694
        unit to release the bus if it is in the middle of a bus cycle.  At this
2695
        point, the prefetch drops the {\tt STB} line on the next clock and must
2696
        then wait for the last {\tt ACK} before releasing the bus.  If the
2697
        request takes one clock, dropping the strobe line the next, waiting
2698
        for an acknowledgement takes another, and then the bus must be idle
2699
        for one cycle before starting again, these extra cycles add up.
2700
        It should be possible to tell the prefetch stage to give up the bus
2701
        as soon as the decoder knows the instruction will need the bus.
2702
        Indeed, if done in the decode stage, this might drop the seven cycle
2703
        access down by two cycles.
2704
        % FIXED: 20150918
2705
 
2706
\item {\bf Very Long Instruction Word (VLIW).}  Now, to speed up operation, I
2707
        propose that the Zip CPU instruction set be modified towards a Very
2708
        Long Instruction Word (VLIW) implementation. In this implementation,
2709
        an instruction word may contain either one or two separate
2710
        instructions. The first instruction would take up the high order bits,
2711
        the second optional instruction the lower 16-bits. Further, I propose
2712
        that any of the ALU instructions (SUB through LSR) automatically have
2713
        a second instruction whenever their operand `B' is a register, and use
2714
        the full 20-bit immediate if not. This will effectively eliminate the
2715
        register plus immediate mode for all of these instructions.
2716
 
2717
        This is the minimal required change to increase the number of
2718
        instructions per clock cycle. Other changes would need to take place
2719
        as well to support this. These include:
2720
        \begin{itemize}
2721
        \item Instruction words containing two instructions would take two
2722
                clocks to complete, while requiring only a single cycle
2723
                instruction fetch.
2724
        \item Instructions preceded by a label in the asseembler must always
2725
                start in the high order word.
2726
        \item VLIW's, once started, must always execute to completion. The
2727
                upper word may set the PC, the lower word may not. Regardless
2728
                of whether the upper word sets the PC, the lower word must
2729
                still be guaranteed to complete before the PC changes. On any
2730
                switch to (or from) interrupt context, both instructions must
2731
                complete or none of the instructions in the word shall
2732
                complete prior to the switch.
2733
        \item STEP commands and BREAK instructions will only take place after
2734
                the entire word is executed.
2735
        \end{itemize}
2736
 
2737
        If done well, the assembler should be able to handle these changes
2738
        with the biggest impacts to the user being increased performance and
2739
        a loss of the register plus immediate ALU modes. (These weren't really
2740
        relevant for the XOR, OR, AND, etc. operations anyway.) Machine code
2741
        compatibility will not be maintained.
2742
 
2743
        A proposed secondary instruction set might consist of: a four bit
2744
        operand (any of the prior instructions would be supported, with some
2745
        exceptions such as moves to and from user registers while in
2746
        supervisor mode not being supported), a 4-bit register result (PC not
2747
        allowed), a 3-bit conditional (identical to the conditional for the
2748
        upper word), a single bit for whether or not an immediate is present
2749
        or not, followed by either a 4-bit register or a 4-bit signed
2750
        immediate. The multiply instruction would steal the immediate flag to
2751
        be used as a sign indication, forcing both operands to be registers
2752
        without any immediate offsets.
2753
 
2754
        {\em Initial conversion of several library functions to this secondary
2755
        instruction set has demonstrated little to no gain.   The problem was
2756
        that the new instruction set was made by joining a rarely used
2757
        instruction (ALU with register and not immediate) with a more common
2758
        instruction.  The utility was then limited by the utility of the rare
2759
        instrction, which limited the impact of the entire approach.  }
2760
\else
2761
\item {\bf Very Long Instruction Word (VLIW).}  The goal here would be to
2762
        create a new instruction set whereby two instructions would be encoded
2763
        in each 32--bit word.  While this may speed up
2764
        CPU operation, it would necessitate an instruction redesign.
2765
\fi
2766
 
2767
\end{itemize}
2768
 
2769
 
2770 21 dgisselq
% Appendices
2771
% Index
2772
\end{document}
2773
 
2774 68 dgisselq
%
2775
%
2776
% Symbol table relocation types:
2777
%
2778
% Only 3-types of instructions truly need relocations: those that modify the
2779
% PC register, and those that access memory.
2780
%
2781
% -     LDI     Addr,Rx         // Load's an absolute address into Rx, 24 bits
2782
%
2783
% -     LDILO   Addr,Rx         // Load's an absolute address into Rx, 32 bits
2784
%       LDIHI   Addr,Rx         //   requires two instructions
2785
%
2786
% -     JMP     Rx              // Jump to any address in Rx
2787
%                       // Can be prefixed with two instructions to load Rx
2788
%                       // from any 32-bit immediate
2789
% -     JMP     #Addr           // Jump to any 24'bit (signed) address, 23'b uns
2790
%
2791
% -     ADD     x,PC            // Any PC relative jump (20 bits)
2792
%
2793
% -     ADD.C   x,PC            // Any PC relative conditional jump (20 bits)
2794
%
2795
% -     LDIHI   Addr,Rx         // Load from any 32-bit address, clobbers Rx,
2796
%       LOD     Addr(Rx),Rx     //    unconditional, requires second instruction
2797
%
2798
% -     LOD.C   Addr(Ry),Rx     // Any 16-bit relative address load, poss. cond
2799
%
2800
% -     STO.C   Rx,Addr(Ry)     // Any 16-bit rel addr, Rx and Ry must be valid
2801
%
2802
% -     FARJMP  #Addr:          // Arbitrary 32-bit jumps require a jump table
2803
%       BRA     +1              // memory address.  The BRA +1 can be skipped,
2804
%       .WORD   Addr            // but only if the address is placed at the end
2805
%       LOD     -2(PC),PC       // of an executable section
2806
%

powered by: WebSVN 2.1.0

© copyright 1999-2024 OpenCores.org, equivalent to Oliscience, all rights reserved. OpenCores®, registered trademark.