OpenCores
URL https://opencores.org/ocsvn/zipcpu/zipcpu/trunk

Subversion Repositories zipcpu

[/] [zipcpu/] [trunk/] [doc/] [src/] [spec.tex] - Blame information for rev 36

Go to most recent revision | Details | Compare with Previous | View Log

Line No. Rev Author Line
1 21 dgisselq
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2
%%
3
%% Filename:    spec.tex
4
%%
5
%% Project:     Zip CPU -- a small, lightweight, RISC CPU soft core
6
%%
7
%% Purpose:     This LaTeX file contains all of the documentation/description
8 33 dgisselq
%%              currently provided with this Zip CPU soft core.  It supersedes
9 21 dgisselq
%%              any information about the instruction set or CPUs found
10
%%              elsewhere.  It's not nearly as interesting, though, as the PDF
11
%%              file it creates, so I'd recommend reading that before diving
12
%%              into this file.  You should be able to find the PDF file in
13
%%              the SVN distribution together with this PDF file and a copy of
14
%%              the GPL-3.0 license this file is distributed under.  If not,
15
%%              just type 'make' in the doc directory and it (should) build
16
%%              without a problem.
17
%%
18
%%
19
%% Creator:     Dan Gisselquist
20
%%              Gisselquist Technology, LLC
21
%%
22
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
23
%%
24
%% Copyright (C) 2015, Gisselquist Technology, LLC
25
%%
26
%% This program is free software (firmware): you can redistribute it and/or
27
%% modify it under the terms of  the GNU General Public License as published
28
%% by the Free Software Foundation, either version 3 of the License, or (at
29
%% your option) any later version.
30
%%
31
%% This program is distributed in the hope that it will be useful, but WITHOUT
32
%% ANY WARRANTY; without even the implied warranty of MERCHANTIBILITY or
33
%% FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
34
%% for more details.
35
%%
36
%% You should have received a copy of the GNU General Public License along
37
%% with this program.  (It's in the $(ROOT)/doc directory, run make with no
38
%% target there if the PDF file isn't present.)  If not, see
39
%% <http://www.gnu.org/licenses/> for a copy.
40
%%
41
%% License:     GPL, v3, as defined and found on www.gnu.org,
42
%%              http://www.gnu.org/licenses/gpl.html
43
%%
44
%%
45
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
46
\documentclass{gqtekspec}
47
\project{Zip CPU}
48
\title{Specification}
49
\author{Dan Gisselquist, Ph.D.}
50
\email{dgisselq (at) opencores.org}
51 36 dgisselq
\revision{Rev.~0.4}
52
\definecolor{webred}{rgb}{0.2,0,0}
53
\definecolor{webgreen}{rgb}{0,0.2,0}
54
\usepackage[dvips,ps2pdf,colorlinks=true,
55
        anchorcolor=black,pagecolor=webgreen,pdfpagelabels,hypertexnames,
56
        pdfauthor={Dan Gisselquist},
57
        pdfsubject={Zip CPU}]{hyperref}
58 21 dgisselq
\begin{document}
59
\pagestyle{gqtekspecplain}
60
\titlepage
61
\begin{license}
62
Copyright (C) \theyear\today, Gisselquist Technology, LLC
63
 
64
This project is free software (firmware): you can redistribute it and/or
65
modify it under the terms of  the GNU General Public License as published
66
by the Free Software Foundation, either version 3 of the License, or (at
67
your option) any later version.
68
 
69
This program is distributed in the hope that it will be useful, but WITHOUT
70
ANY WARRANTY; without even the implied warranty of MERCHANTIBILITY or
71
FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
72
for more details.
73
 
74
You should have received a copy of the GNU General Public License along
75
with this program.  If not, see \hbox{<http://www.gnu.org/licenses/>} for a
76
copy.
77
\end{license}
78
\begin{revisionhistory}
79 36 dgisselq
0.4 & 9/19/2015 & Gisselquist & Added DMA controller, improved stall information, and self--assessment info.\\\hline
80 33 dgisselq
0.3 & 8/22/2015 & Gisselquist & First completed draft\\\hline
81 24 dgisselq
0.2 & 8/19/2015 & Gisselquist & Still Draft, more complete \\\hline
82 21 dgisselq
0.1 & 8/17/2015 & Gisselquist & Incomplete First Draft \\\hline
83
\end{revisionhistory}
84
% Revision History
85
% Table of Contents, named Contents
86
\tableofcontents
87 24 dgisselq
\listoffigures
88 21 dgisselq
\listoftables
89
\begin{preface}
90
Many people have asked me why I am building the Zip CPU. ARM processors are
91
good and effective. Xilinx makes and markets Microblaze, Altera Nios, and both
92
have better toolsets than the Zip CPU will ever have. OpenRISC is also
93 24 dgisselq
available, RISC--V may be replacing it. Why build a new processor?
94 21 dgisselq
 
95
The easiest, most obvious answer is the simple one: Because I can.
96
 
97
There's more to it, though. There's a lot that I would like to do with a
98
processor, and I want to be able to do it in a vendor independent fashion.
99 36 dgisselq
First, I would like to be able to place this processor inside an FPGA.  Without
100
paying royalties, ARM is out of the question.  I would then like to be able to
101
generate Verilog code, both for the processor and the system it sits within,
102
that can run equivalently on both Xilinx and Altera chips, and that can be
103
easily ported from one manufacturer's chipsets to another. Even more, before
104
purchasing a chip or a board, I would like to know that my soft core works. I
105
would like to build a test bench to test components with, and Verilator is my
106
chosen test bench. This forces me to use all Verilog, and it prevents me from
107
using any proprietary cores. For this reason, Microblaze and Nios are out of
108
the question.
109 21 dgisselq
 
110
Why not OpenRISC? That's a hard question. The OpenRISC team has done some
111
wonderful work on an amazing processor, and I'll have to admit that I am
112
envious of what they've accomplished. I would like to port binutils to the
113
Zip CPU, as I would like to port GCC and GDB. They are way ahead of me. The
114
OpenRISC processor, however, is complex and hefty at about 4,500 LUTs. It has
115
a lot of features of modern CPUs within it that ... well, let's just say it's
116
not the little guy on the block. The Zip CPU is lighter weight, costing only
117 32 dgisselq
about 2,300 LUTs with no peripherals, and 3,200 LUTs with some very basic
118 21 dgisselq
peripherals.
119
 
120
My final reason is that I'm building the Zip CPU as a learning experience. The
121
Zip CPU has allowed me to learn a lot about how CPUs work on a very micro
122
level. For the first time, I am beginning to understand many of the Computer
123
Architecture lessons from years ago.
124
 
125
To summarize: Because I can, because it is open source, because it is light
126
weight, and as an exercise in learning.
127
 
128
\end{preface}
129
 
130
\chapter{Introduction}
131
\pagenumbering{arabic}
132
\setcounter{page}{1}
133
 
134
 
135 36 dgisselq
The original goal of the Zip CPU was to be a very simple CPU.   You might
136 21 dgisselq
think of it as a poor man's alternative to the OpenRISC architecture.
137
For this reason, all instructions have been designed to be as simple as
138
possible, and are all designed to be executed in one instruction cycle per
139
instruction, barring pipeline stalls.  Indeed, even the bus has been simplified
140
to a constant 32-bit width, with no option for more or less.  This has
141
resulted in the choice to drop push and pop instructions, pre-increment and
142
post-decrement addressing modes, and more.
143
 
144
For those who like buzz words, the Zip CPU is:
145
\begin{itemize}
146
\item A 32-bit CPU: All registers are 32-bits, addresses are 32-bits,
147
                instructions are 32-bits wide, etc.
148 24 dgisselq
\item A RISC CPU.  There is no microcode for executing instructions.  All
149
        instructions are designed to be completed in one clock cycle.
150 21 dgisselq
\item A Load/Store architecture.  (Only load and store instructions
151
                can access memory.)
152
\item Wishbone compliant.  All peripherals are accessed just like
153
                memory across this bus.
154
\item A Von-Neumann architecture.  (The instructions and data share a
155
                common bus.)
156
\item A pipelined architecture, having stages for {\bf Prefetch},
157
                {\bf Decode}, {\bf Read-Operand}, the {\bf ALU/Memory}
158 24 dgisselq
                unit, and {\bf Write-back}.  See Fig.~\ref{fig:cpu}
159
\begin{figure}\begin{center}
160
\includegraphics[width=3.5in]{../gfx/cpu.eps}
161
\caption{Zip CPU internal pipeline architecture}\label{fig:cpu}
162
\end{center}\end{figure}
163
                for a diagram of this structure.
164 21 dgisselq
\item Completely open source, licensed under the GPL.\footnote{Should you
165
        need a copy of the Zip CPU licensed under other terms, please
166
        contact me.}
167
\end{itemize}
168
 
169
Now, however, that I've worked on the Zip CPU for a while, it is not nearly
170
as simple as I originally hoped.  Worse, I've had to adjust to create
171
capabilities that I was never expecting to need.  These include:
172
\begin{itemize}
173 33 dgisselq
\item {\bf External Debug:} Once placed upon an FPGA, some external means is
174 21 dgisselq
        still necessary to debug this CPU.  That means that there needs to be
175
        an external register that can control the CPU: reset it, halt it, step
176 24 dgisselq
        it, and tell whether it is running or not.  My chosen interface
177
        includes a second register similar to this control register.  This
178
        second register allows the external controller or debugger to examine
179 21 dgisselq
        registers internal to the CPU.
180
 
181
\item {\bf Internal Debug:} Being able to run a debugger from within
182
        a user process requires an ability to step a user process from
183
        within a debugger.  It also requires a break instruction that can
184
        be substituted for any other instruction, and substituted back.
185
        The break is actually difficult: the break instruction cannot be
186
        allowed to execute.  That way, upon a break, the debugger should
187
        be able to jump back into the user process to step the instruction
188
        that would've been at the break point initially, and then to
189
        replace the break after passing it.
190
 
191 24 dgisselq
        Incidentally, this break messes with the prefetch cache and the
192
        pipeline: if you change an instruction partially through the pipeline,
193
        the whole pipeline needs to be cleansed.  Likewise if you change
194
        an instruction in memory, you need to make sure the cache is reloaded
195
        with the new instruction.
196
 
197 21 dgisselq
\item {\bf Prefetch Cache:} My original implementation had a very
198
        simple prefetch stage.  Any time the PC changed the prefetch would go
199
        and fetch the new instruction.  While this was perhaps this simplest
200
        approach, it cost roughly five clocks for every instruction.  This
201
        was deemed unacceptable, as I wanted a CPU that could execute
202
        instructions in one cycle.  I therefore have a prefetch cache that
203
        issues pipelined wishbone accesses to memory and then pushes
204
        instructions at the CPU.  Sadly, this accounts for about 20\% of the
205
        logic in the entire CPU, or 15\% of the logic in the entire system.
206
 
207
 
208
\item {\bf Operating System:} In order to support an operating system,
209
        interrupts and so forth, the CPU needs to support supervisor and
210
        user modes, as well as a means of switching between them.  For example,
211
        the user needs a means of executing a system call.  This is the
212
        purpose of the {\bf `trap'} instruction.  This instruction needs to
213
        place the CPU into supervisor mode (here equivalent to disabling
214
        interrupts), as well as handing it a parameter such as identifying
215
        which O/S function was called.
216
 
217 24 dgisselq
My initial approach to building a trap instruction was to create an external
218
peripheral which, when written to, would generate an interrupt and could
219
return the last value written to it.  In practice, this approach didn't work
220
at all: the CPU executed two instructions while waiting for the
221
trap interrupt to take place.  Since then, I've decided to keep the rest of
222
the CC register for that purpose so that a write to the CC register, with the
223
GIE bit cleared, could be used to execute a trap.  This has other problems,
224
though, primarily in the limitation of the uses of the CC register.  In
225
particular, the CC register is the best place to put CPU state information and
226
to ``announce'' special CPU features (floating point, etc).  So the trap
227
instruction still switches to interrupt mode, but the CC register is not
228
nearly as useful for telling the supervisor mode processor what trap is being
229
executed.
230 21 dgisselq
 
231
Modern timesharing systems also depend upon a {\bf Timer} interrupt
232 24 dgisselq
to handle task swapping.  For the Zip CPU, this interrupt is handled
233
external to the CPU as part of the CPU System, found in {\tt zipsystem.v}.
234
The timer module itself is found in {\tt ziptimer.v}.
235 21 dgisselq
 
236
\item {\bf Pipeline Stalls:} My original plan was to not support pipeline
237
        stalls at all, but rather to require the compiler to properly schedule
238 24 dgisselq
        all instructions so that stalls would never be necessary.  After trying
239 21 dgisselq
        to build such an architecture, I gave up, having learned some things:
240
 
241
        For example, in  order to facilitate interrupt handling and debug
242
        stepping, the CPU needs to know what instructions have finished, and
243
        which have not.  In other words, it needs to know where it can restart
244
        the pipeline from.  Once restarted, it must act as though it had
245 24 dgisselq
        never stopped.  This killed my idea of delayed branching, since what
246
        would be the appropriate program counter to restart at?  The one the
247
        CPU was going to branch to, or the ones in the delay slots?  This
248
        also makes the idea of compressed instruction codes difficult, since,
249
        again, where do you restart on interrupt?
250 21 dgisselq
 
251
        So I switched to a model of discrete execution: Once an instruction
252
        enters into either the ALU or memory unit, the instruction is
253
        guaranteed to complete.  If the logic recognizes a branch or a
254
        condition that would render the instruction entering into this stage
255 33 dgisselq
        possibly inappropriate (i.e. a conditional branch preceding a store
256 21 dgisselq
        instruction for example), then the pipeline stalls for one cycle
257
        until the conditional branch completes.  Then, if it generates a new
258 33 dgisselq
        PC address, the stages preceding are all wiped clean.
259 21 dgisselq
 
260
        The discrete execution model allows such things as sleeping: if the
261 24 dgisselq
        CPU is put to ``sleep,'' the ALU and memory stages stall and back up
262 21 dgisselq
        everything before them.  Likewise, anything that has entered the ALU
263
        or memory stage when the CPU is placed to sleep continues to completion.
264
        To handle this logic, each pipeline stage has three control signals:
265
        a valid signal, a stall signal, and a clock enable signal.  In
266
        general, a stage stalls if it's contents are valid and the next step
267
        is stalled.  This allows the pipeline to fill any time a later stage
268
        stalls.
269
 
270 24 dgisselq
        This approach is also different from other pipeline approaches.  Instead
271
        of keeping the entire pipeline filled, each stage is treated
272
        independently.  Therefore, individual stages may move forward as long
273
        as the subsequent stage is available, regardless of whether the stage
274
        behind it is filled.
275
 
276 21 dgisselq
\item {\bf Verilog Modules:} When examining how other processors worked
277
        here on open cores, many of them had one separate module per pipeline
278
        stage.  While this appeared to me to be a fascinating and commendable
279
        idea, my own implementation didn't work out quite so nicely.
280
 
281
        As an example, the decode module produces a {\em lot} of
282
        control wires and registers.  Creating a module out of this, with
283
        only the simplest of logic within it, seemed to be more a lesson
284
        in passing wires around, rather than encapsulating logic.
285
 
286
        Another example was the register writeback section.  I would love
287
        this section to be a module in its own right, and many have made them
288
        such.  However, other modules depend upon writeback results other
289
        than just what's placed in the register (i.e., the control wires).
290
        For these reasons, I didn't manage to fit this section into it's
291
        own module.
292
 
293
        The result is that the majority of the CPU code can be found in
294
        the {\tt zipcpu.v} file.
295
\end{itemize}
296
 
297
With that introduction out of the way, let's move on to the instruction
298
set.
299
 
300
\chapter{CPU Architecture}\label{chap:arch}
301
 
302 24 dgisselq
The Zip CPU supports a set of two operand instructions, where the second operand
303 21 dgisselq
(always a register) is the result.  The only exception is the store instruction,
304
where the first operand (always a register) is the source of the data to be
305
stored.
306
 
307 24 dgisselq
\section{Simplified Bus}
308
The bus architecture of the Zip CPU is that of a simplified WISHBONE bus.
309
It has been simplified in this fashion: all operations are 32--bit operations.
310 36 dgisselq
The bus is neither little endian nor big endian.  For this reason, all words
311 24 dgisselq
are 32--bits.  All instructions are also 32--bits wide.  Everything has been
312
built around the 32--bit word.
313
 
314 21 dgisselq
\section{Register Set}
315
The Zip CPU supports two sets of sixteen 32-bit registers, a supervisor
316 24 dgisselq
and a user set as shown in Fig.~\ref{fig:regset}.
317
\begin{figure}\begin{center}
318
\includegraphics[width=3.5in]{../gfx/regset.eps}
319
\caption{Zip CPU Register File}\label{fig:regset}
320
\end{center}\end{figure}
321
The supervisor set is used in interrupt mode when interrupts are disabled,
322
whereas the user set is used otherwise.  Of this register set, the Program
323
Counter (PC) is register 15, whereas the status register (SR) or condition
324
code register
325 21 dgisselq
(CC) is register 14.  By convention, the stack pointer will be register 13 and
326 24 dgisselq
noted as (SP)--although there is nothing special about this register other
327
than this convention.
328 21 dgisselq
The CPU can access both register sets via move instructions from the
329
supervisor state, whereas the user state can only access the user registers.
330
 
331 36 dgisselq
The status register is special, and bears further mention.  As shown in
332
Fig.~\ref{tbl:cc-register},
333
\begin{table}\begin{center}
334
\begin{bitlist}
335
31\ldots 11 & R/W & Reserved for future uses\\\hline
336
10 & R & (Reserved for) Bus-Error Flag\\\hline
337
9 & R & Trap, or user interrupt, Flag.  Cleared on return to userspace.\\\hline
338
8 & R & (Reserved for) Illegal Instruction Flag\\\hline
339
7 & R/W & Break--Enable\\\hline
340
6 & R/W & Step\\\hline
341
5 & R/W & Global Interrupt Enable (GIE)\\\hline
342
4 & R/W & Sleep.  When GIE is also set, the CPU waits for an interrupt.\\\hline
343
3 & R/W & Overflow\\\hline
344
2 & R/W & Negative.  The sign bit was set as a result of the last ALU instruction.\\\hline
345
1 & R/W & Carry\\\hline
346
 
347
\end{bitlist}
348
\caption{Condition Code Register Bit Assignment}\label{tbl:cc-register}
349
\end{center}\end{table}
350
the lower 11~bits of the status register form
351
a set of CPU state and condition codes.  Writes to other bits of this register
352
are preserved.
353 21 dgisselq
 
354 33 dgisselq
Of the condition codes, the bottom four bits are the current flags:
355 21 dgisselq
                Zero (Z),
356
                Carry (C),
357
                Negative (N),
358
                and Overflow (V).
359
 
360
The next bit is a clock enable (0 to enable) or sleep bit (1 to put
361
        the CPU to sleep).  Setting this bit will cause the CPU to
362
        wait for an interrupt (if interrupts are enabled), or to
363
        completely halt (if interrupts are disabled).
364 33 dgisselq
 
365 21 dgisselq
The sixth bit is a global interrupt enable bit (GIE).  When this
366 32 dgisselq
        sixth bit is a `1' interrupts will be enabled, else disabled.  When
367 21 dgisselq
        interrupts are disabled, the CPU will be in supervisor mode, otherwise
368
        it is in user mode.  Thus, to execute a context switch, one only
369
        need enable or disable interrupts.  (When an interrupt line goes
370
        high, interrupts will automatically be disabled, as the CPU goes
371 32 dgisselq
        and deals with its context switch.)  Special logic has been added to
372
        keep the user mode from setting the sleep register and clearing the
373
        GIE register at the same time, with clearing the GIE register taking
374
        precedence.
375 21 dgisselq
 
376
The seventh bit is a step bit.  This bit can be
377
        set from supervisor mode only.  After setting this bit, should
378
        the supervisor mode process switch to user mode, it would then
379
        accomplish one instruction in user mode before returning to supervisor
380
        mode.  Then, upon return to supervisor mode, this bit will
381
        be automatically cleared.  This bit has no effect on the CPU while in
382
        supervisor mode.
383
 
384
        This functionality was added to enable a userspace debugger
385
        functionality on a user process, working through supervisor mode
386
        of course.
387
 
388
 
389 24 dgisselq
The eighth bit is a break enable bit.  This controls whether a break
390
instruction in user mode will halt the processor for an external debugger
391
(break enabled), or whether the break instruction will simply send send the
392
CPU into interrupt mode.  Encountering a break in supervisor mode will
393
halt the CPU independent of the break enable bit.  This bit can only be set
394
within supervisor mode.
395 21 dgisselq
 
396 32 dgisselq
% Should break enable be a supervisor mode bit, while the break enable bit
397
% in user mode is a break has taken place bit?
398
%
399
 
400 21 dgisselq
This functionality was added to enable an external debugger to
401
        set and manage breakpoints.
402
 
403 36 dgisselq
The ninth bit is reserved for an illegal instruction bit.  When the CPU
404
tries to execute either a non-existant instruction, or an instruction from
405
an address that produces a bus error, the CPU will (once implemented) switch
406
to supervisor mode while setting this bit.  The bit will automatically be
407
cleared upon any return to user mode.
408 21 dgisselq
 
409
The tenth bit is a trap bit.  It is set whenever the user requests a soft
410
interrupt, and cleared on any return to userspace command.  This allows the
411
supervisor, in supervisor mode, to determine whether it got to supervisor
412
mode from a trap or from an external interrupt or both.
413
 
414 24 dgisselq
These status register bits are summarized in Tbl.~\ref{tbl:ccbits}.
415 21 dgisselq
\begin{table}
416
\begin{center}
417
\begin{tabular}{l|l}
418
Bit & Meaning \\\hline
419 33 dgisselq
9 & Soft trap, set on a trap from user mode, cleared when returning to user mode\\\hline
420 21 dgisselq
8 & (Reserved for) Floating point enable \\\hline
421
7 & Halt on break, to support an external debugger \\\hline
422
6 & Step, single step the CPU in user mode\\\hline
423
5 & GIE, or Global Interrupt Enable \\\hline
424
4 & Sleep \\\hline
425
3 & V, or overflow bit.\\\hline
426
2 & N, or negative bit.\\\hline
427
1 & C, or carry bit.\\\hline
428
 
429
\end{tabular}
430 24 dgisselq
\caption{Condition Code / Status Register Bits}\label{tbl:ccbits}
431
\end{center}\end{table}
432
 
433 21 dgisselq
\section{Conditional Instructions}
434 36 dgisselq
Most, although not quite all, instructions may be conditionally executed.  From
435 21 dgisselq
the four condition code flags, eight conditions are defined.  These are shown
436
in Tbl.~\ref{tbl:conditions}.
437
\begin{table}
438
\begin{center}
439
\begin{tabular}{l|l|l}
440
Code & Mneumonic & Condition \\\hline
441
3'h0 & None & Always execute the instruction \\
442
3'h1 & {\tt .Z} & Only execute when 'Z' is set \\
443
3'h2 & {\tt .NE} & Only execute when 'Z' is not set \\
444
3'h3 & {\tt .GE} & Greater than or equal ('N' not set, 'Z' irrelevant) \\
445
3'h4 & {\tt .GT} & Greater than ('N' not set, 'Z' not set) \\
446 24 dgisselq
3'h5 & {\tt .LT} & Less than ('N' set) \\
447 21 dgisselq
3'h6 & {\tt .C} & Carry set\\
448
3'h7 & {\tt .V} & Overflow set\\
449
\end{tabular}
450
\caption{Conditions for conditional operand execution}\label{tbl:conditions}
451
\end{center}
452
\end{table}
453 24 dgisselq
There is no condition code for less than or equal, not C or not V.  Sorry,
454 36 dgisselq
I ran out of space in 3--bits.  Conditioning on a non--supported condition
455
is still possible, but it will take an extra instruction and a pipeline stall.  (Ex: \hbox{\em (Stall)}; \hbox{\tt TST \$4,CC;} \hbox{\tt STO.NZ R0,(R1)})
456 21 dgisselq
 
457 36 dgisselq
Conditionally executed ALU instructions will not further adjust the
458
condition codes.
459
 
460 21 dgisselq
\section{Operand B}
461 24 dgisselq
Many instruction forms have a 21-bit source ``Operand B'' associated with them.
462 21 dgisselq
This Operand B is either equal to a register plus a signed immediate offset,
463
or an immediate offset by itself.  This value is encoded as shown in
464
Tbl.~\ref{tbl:opb}.
465
\begin{table}\begin{center}
466
\begin{tabular}{|l|l|l|}\hline
467
Bit 20 & 19 \ldots 16 & 15 \ldots 0 \\\hline
468 24 dgisselq
1'b0 & \multicolumn{2}{l|}{20--bit Signed Immediate value} \\\hline
469
1'b1 & 4-bit Register & 16--bit Signed immediate offset \\\hline
470 21 dgisselq
\end{tabular}
471
\caption{Bit allocation for Operand B}\label{tbl:opb}
472
\end{center}\end{table}
473 24 dgisselq
 
474 33 dgisselq
Sixteen and twenty bit immediate values don't make sense for all instructions.
475
For example, what is the point of a 20--bit immediate when executing a 16--bit
476 24 dgisselq
multiply?  Likewise, why have a 16--bit immediate when adding to a logical
477
or arithmetic shift?  In these cases, the extra bits are reserved for future
478
instruction possibilities.
479
 
480 21 dgisselq
\section{Address Modes}
481 36 dgisselq
The Zip CPU supports two addressing modes: register plus immediate, and
482 21 dgisselq
immediate address.  Addresses are therefore encoded in the same fashion as
483
Operand B's, shown above.
484
 
485
A lot of long hard thought was put into whether to allow pre/post increment
486
and decrement addressing modes.  Finding no way to use these operators without
487 32 dgisselq
taking two or more clocks per instruction,\footnote{The two clocks figure
488
comes from the design of the register set, allowing only one write per clock.
489
That write is either from the memory unit or the ALU, but never both.} these
490
addressing modes have been
491 21 dgisselq
removed from the realm of possibilities.  This means that the Zip CPU has no
492
native way of executing push, pop, return, or jump to subroutine operations.
493 24 dgisselq
Each of these instructions can be emulated with a set of instructions from the
494
existing set.
495 21 dgisselq
 
496
\section{Move Operands}
497
The previous set of operands would be perfect and complete, save only that
498 24 dgisselq
the CPU needs access to non--supervisory registers while in supervisory mode.
499
Therefore, the MOV instruction is special and offers access to these registers
500
\ldots when in supervisory mode.  To keep the compiler simple, the extra bits
501
are ignored in non-supervisory mode (as though they didn't exist), rather than
502
being mapped to new instructions or additional capabilities.  The bits
503
indicating which register set each register lies within are the A-Usr and
504
B-Usr bits.  When set to a one, these refer to a user mode register.  When set
505
to a zero, these refer to a register in the current mode, whether user or
506
supervisor.  Further, because a load immediate instruction exists, there is no
507
move capability between an immediate and a register: all moves come from either
508
a register or a register plus an offset.
509 21 dgisselq
 
510 24 dgisselq
This actually leads to a bit of a problem: since the MOV instruction encodes
511
which register set each register is coming from or moving to, how shall a
512
compiler or assembler know how to compile a MOV instruction without knowing
513
the mode of the CPU at the time?  For this reason, the compiler will assume
514
all MOV registers are supervisor registers, and display them as normal.
515
Anything with the user bit set will be treated as a user register.  The CPU
516
will quietly ignore the supervisor bits while in user mode, and anything
517 36 dgisselq
marked as a user register will always be valid.
518 21 dgisselq
 
519
\section{Multiply Operations}
520 36 dgisselq
The Zip CPU supports two Multiply operations, a 16x16 bit signed multiply
521
({\tt MPYS}) and a 16x16 bit unsigned multiply ({\tt MPYU}).  In both
522 21 dgisselq
cases, the operand is a register plus a 16-bit immediate, subject to the
523
rule that the register cannot be the PC or CC registers.  The PC register
524
field has been stolen to create a multiply by immediate instruction.  The
525
CC register field is reserved.
526
 
527
\section{Floating Point}
528 36 dgisselq
The Zip CPU does not (yet) support floating point operations.  However, the
529 32 dgisselq
instruction set reserves two possibilities for future floating point
530
operations.
531 21 dgisselq
 
532 32 dgisselq
The first floating point operation hole in the instruction set involves
533 36 dgisselq
setting a proposed (but non-existent) floating point bit in the CC register.
534
The next instruction
535
would then simply interpret its operands as floating point instructions.
536 32 dgisselq
Not all instructions, however, have floating point equivalents.  Further, the
537
immediate fields do not apply in floating point mode, and must be set to
538
zero.  Not all instructions make sense as floating point operations.
539
Therefore, only the CMP, SUB, ADD, and MPY instructions may be issued as
540
floating point instructions.  Other instructions allow the examining of the
541
floating point bit in the CC register.  In all cases, the floating point bit
542
is cleared one instruction after it is set.
543 21 dgisselq
 
544 32 dgisselq
The other possibility for floating point operations involves exploiting the
545
hole in the instruction set that the NOOP and BREAK instructions reside within.
546 36 dgisselq
These two instructions use 24--bits of address space, when only a single bit
547
is necessary.  A simple adjustment to this space could create instructions
548
with 4--bit register addresses for each register, a 3--bit field for
549
conditional execution, and a 2--bit field for which operation.
550
In this fashion, such a floating point capability would only fill 13--bits of
551
the 24--bit field, still leaving lots of room for expansion.
552 32 dgisselq
 
553
In both cases, the Zip CPU would support 32--bit single precision floats
554 36 dgisselq
only, since other choices would complicate the pipeline.
555 32 dgisselq
 
556
The current architecture does not support a floating point not-implemented
557
interrupt.  Any soft floating point emulation must be done deliberately.
558
 
559 21 dgisselq
\section{Native Instructions}
560
The instruction set for the Zip CPU is summarized in
561
Tbl.~\ref{tbl:zip-instructions}.
562
\begin{table}\begin{center}
563
\begin{tabular}{|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|c|}\hline
564 36 dgisselq
\rowcolor[gray]{0.85}
565 21 dgisselq
Op Code & \multicolumn{8}{c|}{31\ldots24} & \multicolumn{8}{c|}{23\ldots 16}
566
        & \multicolumn{8}{c|}{15\ldots 8} & \multicolumn{8}{c|}{7\ldots 0}
567 36 dgisselq
        & Sets CC? \\\hline\hline
568 21 dgisselq
CMP(Sub) & \multicolumn{4}{l|}{4'h0}
569
                & \multicolumn{4}{l|}{D. Reg}
570
                & \multicolumn{3}{l|}{Cond.}
571
                & \multicolumn{21}{l|}{Operand B}
572
                & Yes \\\hline
573 24 dgisselq
TST(And) & \multicolumn{4}{l|}{4'h1}
574 21 dgisselq
                & \multicolumn{4}{l|}{D. Reg}
575
                & \multicolumn{3}{l|}{Cond.}
576
                & \multicolumn{21}{l|}{Operand B}
577
        & Yes \\\hline
578
MOV & \multicolumn{4}{l|}{4'h2}
579
                & \multicolumn{4}{l|}{D. Reg}
580
                & \multicolumn{3}{l|}{Cond.}
581
                & A-Usr
582
                & \multicolumn{4}{l|}{B-Reg}
583
                & B-Usr
584
                & \multicolumn{15}{l|}{15'bit signed offset}
585
                & \\\hline
586
LODI & \multicolumn{4}{l|}{4'h3}
587
                & \multicolumn{4}{l|}{R. Reg}
588
                & \multicolumn{24}{l|}{24'bit Signed Immediate}
589
                & \\\hline
590
NOOP & \multicolumn{4}{l|}{4'h4}
591
                & \multicolumn{4}{l|}{4'he}
592
                & \multicolumn{24}{l|}{24'h00}
593
                & \\\hline
594
BREAK & \multicolumn{4}{l|}{4'h4}
595
                & \multicolumn{4}{l|}{4'he}
596
                & \multicolumn{24}{l|}{24'h01}
597
                & \\\hline
598 36 dgisselq
{\em Reserved} & \multicolumn{4}{l|}{4'h4}
599 21 dgisselq
                & \multicolumn{4}{l|}{4'he}
600
                & \multicolumn{24}{l|}{24'bits, but not 0 or 1.}
601
                & \\\hline
602
LODIHI & \multicolumn{4}{l|}{4'h4}
603
                & \multicolumn{4}{l|}{4'hf}
604
                & \multicolumn{3}{l|}{Cond.}
605
                & 1'b1
606
                & \multicolumn{4}{l|}{R. Reg}
607
                & \multicolumn{16}{l|}{16-bit Immediate}
608
                & \\\hline
609
LODILO & \multicolumn{4}{l|}{4'h4}
610
                & \multicolumn{4}{l|}{4'hf}
611
                & \multicolumn{3}{l|}{Cond.}
612
                & 1'b0
613
                & \multicolumn{4}{l|}{R. Reg}
614
                & \multicolumn{16}{l|}{16-bit Immediate}
615
                & \\\hline
616
16-b MPYU & \multicolumn{4}{l|}{4'h4}
617
                & \multicolumn{4}{l|}{R. Reg}
618
                & \multicolumn{3}{l|}{Cond.}
619
                & 1'b0 & \multicolumn{4}{l|}{Reg}
620
                & \multicolumn{16}{l|}{16-bit Offset}
621
                & Yes \\\hline
622
16-b MPYU(I) & \multicolumn{4}{l|}{4'h4}
623
                & \multicolumn{4}{l|}{R. Reg}
624
                & \multicolumn{3}{l|}{Cond.}
625
                & 1'b0 & \multicolumn{4}{l|}{4'hf}
626
                & \multicolumn{16}{l|}{16-bit Offset}
627
                & Yes \\\hline
628
16-b MPYS & \multicolumn{4}{l|}{4'h4}
629
                & \multicolumn{4}{l|}{R. Reg}
630
                & \multicolumn{3}{l|}{Cond.}
631
                & 1'b1 & \multicolumn{4}{l|}{Reg}
632
                & \multicolumn{16}{l|}{16-bit Offset}
633
                & Yes \\\hline
634
16-b MPYS(I) & \multicolumn{4}{l|}{4'h4}
635
                & \multicolumn{4}{l|}{R. Reg}
636
                & \multicolumn{3}{l|}{Cond.}
637
                & 1'b1 & \multicolumn{4}{l|}{4'hf}
638
                & \multicolumn{16}{l|}{16-bit Offset}
639
                & Yes \\\hline
640
ROL & \multicolumn{4}{l|}{4'h5}
641
                & \multicolumn{4}{l|}{R. Reg}
642
                & \multicolumn{3}{l|}{Cond.}
643
                & \multicolumn{21}{l|}{Operand B, truncated to low order 5 bits}
644
                & \\\hline
645
LOD & \multicolumn{4}{l|}{4'h6}
646
                & \multicolumn{4}{l|}{R. Reg}
647
                & \multicolumn{3}{l|}{Cond.}
648
                & \multicolumn{21}{l|}{Operand B address}
649
                & \\\hline
650
STO & \multicolumn{4}{l|}{4'h7}
651
                & \multicolumn{4}{l|}{D. Reg}
652
                & \multicolumn{3}{l|}{Cond.}
653
                & \multicolumn{21}{l|}{Operand B address}
654
                & \\\hline
655
SUB & \multicolumn{4}{l|}{4'h8}
656
        &       \multicolumn{4}{l|}{R. Reg}
657
        &       \multicolumn{3}{l|}{Cond.}
658 32 dgisselq
        &       \multicolumn{21}{l|}{Operand B}
659 21 dgisselq
        & Yes \\\hline
660
AND & \multicolumn{4}{l|}{4'h9}
661
        &       \multicolumn{4}{l|}{R. Reg}
662
        &       \multicolumn{3}{l|}{Cond.}
663
        &       \multicolumn{21}{l|}{Operand B}
664
        & Yes \\\hline
665
ADD & \multicolumn{4}{l|}{4'ha}
666
        &       \multicolumn{4}{l|}{R. Reg}
667
        &       \multicolumn{3}{l|}{Cond.}
668
        &       \multicolumn{21}{l|}{Operand B}
669
        & Yes \\\hline
670
OR & \multicolumn{4}{l|}{4'hb}
671
        &       \multicolumn{4}{l|}{R. Reg}
672
        &       \multicolumn{3}{l|}{Cond.}
673
        &       \multicolumn{21}{l|}{Operand B}
674
        & Yes \\\hline
675
XOR & \multicolumn{4}{l|}{4'hc}
676
        &       \multicolumn{4}{l|}{R. Reg}
677
        &       \multicolumn{3}{l|}{Cond.}
678
        &       \multicolumn{21}{l|}{Operand B}
679
        & Yes \\\hline
680
LSL/ASL & \multicolumn{4}{l|}{4'hd}
681
        &       \multicolumn{4}{l|}{R. Reg}
682
        &       \multicolumn{3}{l|}{Cond.}
683 33 dgisselq
        &       \multicolumn{21}{l|}{Operand B, imm. truncated to 6 bits}
684 21 dgisselq
        & Yes \\\hline
685
ASR & \multicolumn{4}{l|}{4'he}
686
        &       \multicolumn{4}{l|}{R. Reg}
687
        &       \multicolumn{3}{l|}{Cond.}
688 33 dgisselq
        &       \multicolumn{21}{l|}{Operand B, imm. truncated to 6 bits}
689 21 dgisselq
        & Yes \\\hline
690
LSR & \multicolumn{4}{l|}{4'hf}
691
        &       \multicolumn{4}{l|}{R. Reg}
692
        &       \multicolumn{3}{l|}{Cond.}
693 33 dgisselq
        &       \multicolumn{21}{l|}{Operand B, imm. truncated to 6 bits}
694 21 dgisselq
        & Yes \\\hline
695
\end{tabular}
696
\caption{Zip CPU Instruction Set}\label{tbl:zip-instructions}
697
\end{center}\end{table}
698
 
699
As you can see, there's lots of room for instruction set expansion.  The
700 24 dgisselq
NOOP and BREAK instructions are the only instructions within one particular
701 36 dgisselq
24--bit hole.  The rest of this space is reserved for future enhancements.
702 21 dgisselq
 
703
\section{Derived Instructions}
704 36 dgisselq
The Zip CPU supports many other common instructions, but not all of them
705 24 dgisselq
are single cycle instructions.  The derived instruction tables,
706 36 dgisselq
Tbls.~\ref{tbl:derived-1}, \ref{tbl:derived-2}, \ref{tbl:derived-3}
707
and~\ref{tbl:derived-4},
708 21 dgisselq
help to capture some of how these other instructions may be implemented on
709 36 dgisselq
the Zip CPU.  Many of these instructions will have assembly equivalents,
710 21 dgisselq
such as the branch instructions, to facilitate working with the CPU.
711
\begin{table}\begin{center}
712
\begin{tabular}{p{1.4in}p{1.5in}p{3in}}\\\hline
713
Mapped & Actual  & Notes \\\hline
714 36 dgisselq
ABS Rx
715
        & \parbox[t]{1.5in}{TST -1,Rx\\NEG.LT Rx}
716
        & Absolute value, depends upon derived NEG.\\\hline
717 21 dgisselq
\parbox[t]{1.4in}{ADD Ra,Rx\\ADDC Rb,Ry}
718
        & \parbox[t]{1.5in}{Add Ra,Rx\\ADD.C \$1,Ry\\Add Rb,Ry}
719
        & Add with carry \\\hline
720
BRA.Cond +/-\$Addr
721 33 dgisselq
        & \hbox{MOV.cond \$Addr+PC,PC}
722 24 dgisselq
        & Branch or jump on condition.  Works for 15--bit
723
                signed address offsets.\\\hline
724 21 dgisselq
BRA.Cond +/-\$Addr
725
        & \parbox[t]{1.5in}{LDI \$Addr,Rx \\ ADD.cond Rx,PC}
726
        & Branch/jump on condition.  Works for
727
        23 bit address offsets, but costs a register, an extra instruction,
728 33 dgisselq
        and sets the flags. \\\hline
729 21 dgisselq
BNC PC+\$Addr
730
        & \parbox[t]{1.5in}{Test \$Carry,CC \\ MOV.Z PC+\$Addr,PC}
731
        & Example of a branch on an unsupported
732
                condition, in this case a branch on not carry \\\hline
733
BUSY & MOV \$-1(PC),PC & Execute an infinite loop \\\hline
734
CLRF.NZ Rx
735
        & XOR.NZ Rx,Rx
736
        & Clear Rx, and flags, if the Z-bit is not set \\\hline
737
CLR Rx
738
        & LDI \$0,Rx
739
        & Clears Rx, leaves flags untouched.  This instruction cannot be
740
                conditional. \\\hline
741
EXCH.W Rx
742
        & ROL \$16,Rx
743
        & Exchanges the top and bottom 16'bit words of Rx \\\hline
744
HALT
745
        & Or \$SLEEP,CC
746
        & Executed while in interrupt mode.  In user mode this is simply a
747 33 dgisselq
        wait until interrupt instruction. \\\hline
748 21 dgisselq
INT & LDI \$0,CC
749
        & Since we're using the CC register as a trap vector as well, this
750
        executes TRAP \#0. \\\hline
751
IRET
752
        & OR \$GIE,CC
753
        & Also an RTU instruction (Return to Userspace) \\\hline
754
JMP R6+\$Addr
755
        & MOV \$Addr(R6),PC
756
        & \\\hline
757
JSR PC+\$Addr
758
        & \parbox[t]{1.5in}{SUB \$1,SP \\\
759
        MOV \$3+PC,R0 \\
760
        STO R0,1(SP) \\
761
        MOV \$Addr+PC,PC \\
762
        ADD \$1,SP}
763 24 dgisselq
        & Jump to Subroutine. Note the required cleanup instruction after
764 36 dgisselq
        returning.  This could easily be turned into a three instruction
765
        operand, removing the preliminary stack instruction before and
766
        the cleanup after, by adjusting how any stack frame was built for
767
        this routine to include space at the top of the stack for the PC.
768
        \\\hline
769 21 dgisselq
JSR PC+\$Addr
770
        & \parbox[t]{1.5in}{MOV \$3+PC,R12 \\ MOV \$addr+PC,PC}
771
        &This is the high speed
772
        version of a subroutine call, necessitating a register to hold the
773
        last PC address.  In its favor, this method doesn't suffer the
774
        mandatory memory access of the other approach. \\\hline
775
LDI.l \$val,Rx
776
        & \parbox[t]{1.5in}{LDIHI (\$val$>>$16)\&0x0ffff, Rx \\
777
                        LDILO (\$val \& 0x0ffff)}
778
        & Sadly, there's not enough instruction
779
                space to load a complete immediate value into any register.
780
                Therefore, fully loading any register takes two cycles.
781
                The LDIHI (load immediate high) and LDILO (load immediate low)
782
                instructions have been created to facilitate this. \\\hline
783
\end{tabular}
784
\caption{Derived Instructions}\label{tbl:derived-1}
785
\end{center}\end{table}
786
\begin{table}\begin{center}
787
\begin{tabular}{p{1.4in}p{1.5in}p{3in}}\\\hline
788
Mapped & Actual  & Notes \\\hline
789
LOD.b \$addr,Rx
790
        & \parbox[t]{1.5in}{%
791
        LDI     \$addr,Ra \\
792
        LDI     \$addr,Rb \\
793
        LSR     \$2,Ra \\
794
        AND     \$3,Rb \\
795
        LOD     (Ra),Rx \\
796
        LSL     \$3,Rb \\
797
        SUB     \$32,Rb \\
798
        ROL     Rb,Rx \\
799
        AND \$0ffh,Rx}
800
        & \parbox[t]{3in}{This CPU is designed for 32'bit word
801
        length instructions.  Byte addressing is not supported by the CPU or
802
        the bus, so it therefore takes more work to do.
803
 
804
        Note also that in this example, \$Addr is a byte-wise address, where
805 24 dgisselq
        all other addresses in this document are 32-bit wordlength addresses.
806
        For this reason,
807 21 dgisselq
        we needed to drop the bottom two bits.  This also limits the address
808
        space of character accesses using this method from 16 MB down to 4MB.}
809
                \\\hline
810
\parbox[t]{1.5in}{LSL \$1,Rx\\ LSLC \$1,Ry}
811
        & \parbox[t]{1.5in}{LSL \$1,Ry \\
812
        LSL \$1,Rx \\
813
        OR.C \$1,Ry}
814
        & Logical shift left with carry.  Note that the
815
        instruction order is now backwards, to keep the conditions valid.
816 33 dgisselq
        That is, LSL sets the carry flag, so if we did this the other way
817 21 dgisselq
        with Rx before Ry, then the condition flag wouldn't have been right
818
        for an OR correction at the end. \\\hline
819
\parbox[t]{1.5in}{LSR \$1,Rx \\ LSRC \$1,Ry}
820
        & \parbox[t]{1.5in}{CLR Rz \\
821
        LSR \$1,Ry \\
822
        LDIHI.C \$8000h,Rz \\
823
        LSR \$1,Rx \\
824
        OR Rz,Rx}
825
        & Logical shift right with carry \\\hline
826
NEG Rx & \parbox[t]{1.5in}{XOR \$-1,Rx \\ ADD \$1,Rx} & \\\hline
827 36 dgisselq
NEG.C Rx & \parbox[t]{1.5in}{MOV.C \$-1+Rx,Rx\\XOR.C \$-1,Rx} & \\\hline
828 21 dgisselq
NOOP & NOOP & While there are many
829
        operations that do nothing, such as MOV Rx,Rx, or OR \$0,Rx, these
830
        operations have consequences in that they might stall the bus if
831
        Rx isn't ready yet.  For this reason, we have a dedicated NOOP
832
        instruction. \\\hline
833
NOT Rx & XOR \$-1,Rx & \\\hline
834
POP Rx
835
        & \parbox[t]{1.5in}{LOD \$-1(SP),Rx \\ ADD \$1,SP}
836
        & Note
837
        that for interrupt purposes, one can never depend upon the value at
838
        (SP).  Hence you read from it, then increment it, lest having
839 33 dgisselq
        incremented it first something then comes along and writes to that
840 21 dgisselq
        value before you can read the result. \\\hline
841 36 dgisselq
\end{tabular}
842
\caption{Derived Instructions, continued}\label{tbl:derived-2}
843
\end{center}\end{table}
844
\begin{table}\begin{center}
845
\begin{tabular}{p{1.4in}p{1.5in}p{3in}}\\\hline
846 21 dgisselq
PUSH Rx
847 33 dgisselq
        & \parbox[t]{1.5in}{SUB \$1,SP \\
848 21 dgisselq
        STO Rx,\$1(SP)}
849
        & \\\hline
850 36 dgisselq
PUSH Rx-Ry
851
        & \parbox[t]{1.5in}{SUB \$n,SP \\
852
        STO Rx,\$n(SP)
853
        \ldots \\
854
        STO Ry,\$1(SP)}
855
        & Multiple pushes at once only need the single subtract from the
856
        stack pointer.  This derived instruction is analogous to a similar one
857
        on the Motoroloa 68k architecture, although the Zip Assembler
858
        does not support this instruction (yet).\\\hline
859 21 dgisselq
RESET
860
        & \parbox[t]{1in}{STO \$1,\$watchdog(R12)\\NOOP\\NOOP}
861
        & \parbox[t]{3in}{This depends upon the peripheral base address being
862
        in R12.
863
 
864
        Another opportunity might be to jump to the reset address from within
865
        supervisor mode.}\\\hline
866 36 dgisselq
RET & \parbox[t]{1.5in}{LOD \$1(SP),PC}
867 24 dgisselq
        & Note that this depends upon the calling context to clean up the
868
        stack, as outlined for the JSR instruction.  \\\hline
869 21 dgisselq
RET & MOV R12,PC
870
        & This is the high(er) speed version, that doesn't touch the stack.
871
        As such, it doesn't suffer a stall on memory read/write to the stack.
872
        \\\hline
873
STEP Rr,Rt
874
        & \parbox[t]{1.5in}{LSR \$1,Rr \\ XOR.C Rt,Rr}
875
        & Step a Galois implementation of a Linear Feedback Shift Register, Rr,
876
                using taps Rt \\\hline
877
STO.b Rx,\$addr
878
        & \parbox[t]{1.5in}{%
879
        LDI \$addr,Ra \\
880
        LDI \$addr,Rb \\
881
        LSR \$2,Ra \\
882
        AND \$3,Rb \\
883
        SUB \$32,Rb \\
884
        LOD (Ra),Ry \\
885
        AND \$0ffh,Rx \\
886
        AND \$-0ffh,Ry \\
887
        ROL Rb,Rx \\
888
        OR Rx,Ry \\
889
        STO Ry,(Ra) }
890
        & \parbox[t]{3in}{This CPU and it's bus are {\em not} optimized
891
        for byte-wise operations.
892
 
893
        Note that in this example, \$addr is a
894
        byte-wise address, whereas in all of our other examples it is a
895
        32-bit word address. This also limits the address space
896
        of character accesses from 16 MB down to 4MB.F
897
        Further, this instruction implies a byte ordering,
898
        such as big or little endian.} \\\hline
899
SWAP Rx,Ry
900
        & \parbox[t]{1.5in}{
901
        XOR Ry,Rx \\
902
        XOR Rx,Ry \\
903
        XOR Ry,Rx}
904
        & While no extra registers are needed, this example
905
        does take 3-clocks. \\\hline
906
TRAP \#X
907 36 dgisselq
        & \parbox[t]{1.5in}{LDI \$x,R0 \\ AND ~\$GIE,CC }
908
        & This works because whenever a user lowers the \$GIE flag, it sets
909
        a TRAP bit within the CC register.  Therefore, upon entering the
910
        supervisor state, the CPU only need check this bit to know that it
911
        got there via a TRAP.  The trap could be made conditional by making
912
        the LDI and the AND conditional.  In that case, the assembler would
913
        quietly turn the LDI instruction into an LDILO and LDIHI pair,
914
        but the effectt would be the same. \\\hline
915
\end{tabular}
916
\caption{Derived Instructions, continued}\label{tbl:derived-3}
917
\end{center}\end{table}
918
\begin{table}\begin{center}
919
\begin{tabular}{p{1.4in}p{1.5in}p{3in}}\\\hline
920 21 dgisselq
TST Rx
921
        & TST \$-1,Rx
922
        & Set the condition codes based upon Rx.  Could also do a CMP \$0,Rx,
923
        ADD \$0,Rx, SUB \$0,Rx, etc, AND \$-1,Rx, etc.  The TST and CMP
924
        approaches won't stall future pipeline stages looking for the value
925
        of Rx. \\\hline
926
WAIT
927
        & Or \$SLEEP,CC
928
        & Wait 'til interrupt.  In an interrupts disabled context, this
929
        becomes a HALT instruction.
930
\end{tabular}
931 36 dgisselq
\caption{Derived Instructions, continued}\label{tbl:derived-4}
932 21 dgisselq
\end{center}\end{table}
933
\section{Pipeline Stages}
934 32 dgisselq
As mentioned in the introduction, and highlighted in Fig.~\ref{fig:cpu},
935
the Zip CPU supports a five stage pipeline.
936 21 dgisselq
\begin{enumerate}
937 36 dgisselq
\item {\bf Prefetch}: Reads instruction from memory and into a cache, if so
938
        configured.  This
939 21 dgisselq
        stage is actually pipelined itself, and so it will stall if the PC
940
        ever changes.  Stalls are also created here if the instruction isn't
941
        in the prefetch cache.
942 36 dgisselq
 
943
        The Zip CPU supports one of two prefetch methods, depending upon a flag
944
        set at build time within the {\tt zipcpu.v} file.  The simplest is a
945
        non--cached implementation of a prefetch.  This implementation is
946
        fairly small, and ideal for
947
        users of the Zip CPU who need the extra space on the FPGA fabric.
948
        However, because this non--cached version has no cache, the maximum
949
        number of instructions per clock is limited to about one per five.
950
 
951
        The second prefetch module is a pipelined prefetch with a cache.  This
952
        module tries to keep the instruction address within a window of valid
953
        instruction addresses.  While effective, it is not a traditional
954
        cache implementation.  One unique feature of this cache implementation,
955
        however, is that it can be cleared in a single clock.  A disappointing
956
        feature, though, was that it needs an extra internal pipeline stage
957
        to be implemented.
958
 
959
\item {\bf Decode}: Decodes an instruction into op code, register(s) to read,
960
        and immediate offset.  This stage also determines whether the flags will
961 32 dgisselq
        be set or whether the result will be written back.
962 21 dgisselq
\item {\bf Read Operands}: Read registers and apply any immediate values to
963 24 dgisselq
        them.  There is no means of detecting or flagging arithmetic overflow
964
        or carry when adding the immediate to the operand.  This stage will
965
        stall if any source operand is pending.
966 21 dgisselq
\item Split into two tracks: An {\bf ALU} which will accomplish a simple
967 36 dgisselq
        instruction, and the {\bf MemOps} stage which handles {\tt LOD} (load)
968
        and {\tt STO} (store) instructions.
969 21 dgisselq
        \begin{itemize}
970 36 dgisselq
        \item Loads will stall the entire pipeline until complete.
971
        \item Condition codes are available upon completion of the ALU stage
972
        \item Issuing an instruction to the memory unit while the memory unit
973
                is busy will stall the entire pipeline.  If the bus deadlocks,
974
                only a reset will release the CPU.  (Watchdog timer, anyone?)
975 24 dgisselq
        \item The Zip CPU currently has no means of reading and acting on any
976
        error conditions on the bus.
977 21 dgisselq
        \end{itemize}
978 32 dgisselq
\item {\bf Write-Back}: Conditionally write back the result to the register
979 36 dgisselq
        set, applying the condition.  This routine is bi-entrant: either the
980 21 dgisselq
        memory or the simple instruction may request a register write.
981
\end{enumerate}
982
 
983 24 dgisselq
The Zip CPU does not support out of order execution.  Therefore, if the memory
984
unit stalls, every other instruction stalls.  Memory stores, however, can take
985 36 dgisselq
place concurrently with ALU operations, although memory reads (loads) cannot.
986 24 dgisselq
 
987 32 dgisselq
\section{Pipeline Stalls}
988
The processing pipeline can and will stall for a variety of reasons.  Some of
989
these are obvious, some less so.  These reasons are listed below:
990
\begin{itemize}
991
\item When the prefetch cache is exhausted
992 21 dgisselq
 
993 36 dgisselq
This reason should be obvious.  If the prefetch cache doesn't have the
994
instruction in memory, the entire pipeline must stall until enough of the
995
prefetch cache is loaded to support the next instruction.
996 21 dgisselq
 
997 32 dgisselq
\item While waiting for the pipeline to load following any taken branch, jump,
998 36 dgisselq
        return from interrupt or switch to interrupt context (5 stall cycles)
999 32 dgisselq
 
1000
If the PC suddenly changes, the pipeline is subsequently cleared and needs to
1001
be reloaded.  Given that there are five stages to the pipeline, that accounts
1002 36 dgisselq
for four of the five stalls.  The stall cycle is lost in the pipelined prefetch
1003 32 dgisselq
stage which needs at least one clock with a valid PC before it can produce
1004 36 dgisselq
a new output.
1005 32 dgisselq
 
1006 36 dgisselq
The Zip CPU handles {\tt MOV \$X(PC),PC}, {\tt ADD \$X,PC}, and
1007
{\tt LDI \$X,PC} instructions specially, however.  These instructions, when
1008
not conditioned on the flags, can execute with only 3~stall cycles.
1009
 
1010 32 dgisselq
\item When reading from a prior register while also adding an immediate offset
1011
\begin{enumerate}
1012
\item\ {\tt OPCODE ?,RA}
1013
\item\ {\em (stall)}
1014
\item\ {\tt OPCODE I+RA,RB}
1015
\end{enumerate}
1016
 
1017
Since the addition of the immediate register within OpB decoding gets applied
1018
during the read operand stage so that it can be nicely settled before the ALU,
1019
any instruction that will write back an operand must be separated from the
1020
opcode that will read and apply an immediate offset by one instruction.  The
1021
good news is that this stall can easily be mitigated by proper scheduling.
1022 36 dgisselq
That is, any instruction that does not add an immediate to {\tt RA} may be
1023
scheduled into the stall slot.
1024 32 dgisselq
 
1025 36 dgisselq
\item When any write to either the CC or PC Register is followed by a memory
1026
        operation
1027 32 dgisselq
\begin{enumerate}
1028
\item\ {\tt OPCODE RA,PC} {\em Ex: a branch opcode}
1029
\item\ {\em (stall, even if jump not taken)}
1030 36 dgisselq
\item\ {\tt LOD \$X(RA),RB}
1031 32 dgisselq
\end{enumerate}
1032
Since branches take place in the writeback stage, the Zip CPU will stall the
1033
pipeline for one clock anytime there may be a possible jump.  This prevents
1034
an instruction from executing a memory access after the jump but before the
1035
jump is recognized.
1036
 
1037 36 dgisselq
This stall may be mitigated by shuffling the operations immediately following
1038
a potential branch so that an ALU operation follows the branch instead of a
1039
memory operation.
1040 33 dgisselq
 
1041 32 dgisselq
\item When reading from the CC register after setting the flags
1042
\begin{enumerate}
1043 36 dgisselq
\item\ {\tt ALUOP RA,RB} {\em Ex: a compare opcode}
1044
\item\ {\em (stall)}
1045 32 dgisselq
\item\ {\tt TST sys.ccv,CC}
1046
\item\ {\tt BZ somewhere}
1047
\end{enumerate}
1048
 
1049
The reason for this stall is simply performance.  Many of the flags are
1050
determined via combinatorial logic after the writeback instruction is
1051
determined.  Trying to then place these into the input for one of the operands
1052
created a time delay loop that would no longer execute in a single 100~MHz
1053
clock cycle.  (The time delay of the multiply within the ALU wasn't helping
1054
either \ldots).
1055
 
1056 33 dgisselq
This stall may be eliminated via proper scheduling, by placing an instruction
1057
that does not set flags in between the ALU operation and the instruction
1058
that references the CC register.  For example, {\tt MOV \$addr+PC,uPC}
1059
followed by an {\tt RTU} ({\tt OR \$GIE,CC}) instruction will not incur
1060
this stall, whereas an {\tt OR \$BREAKEN,CC} followed by an {\tt OR \$STEP,CC}
1061 36 dgisselq
will incur the stall, while a {\tt LDI \$BREAKEN|\$STEP,CC} will not.
1062 33 dgisselq
 
1063 32 dgisselq
\item When waiting for a memory read operation to complete
1064
\begin{enumerate}
1065
\item\ {\tt LOD address,RA}
1066 36 dgisselq
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}
1067 32 dgisselq
\item\ {\tt OPCODE I+RA,RB}
1068
\end{enumerate}
1069
 
1070 36 dgisselq
Remember, the Zip CPU does not support out of order execution.  Therefore,
1071 32 dgisselq
anytime the memory unit becomes busy both the memory unit and the ALU must
1072
stall until the memory unit is cleared.  This is especially true of a load
1073 33 dgisselq
instruction, which must still write its operand back to the register file.
1074
Store instructions are different, since they can be busy with no impact on
1075
later ALU write back operations.  Hence, only loads stall the pipeline.
1076 32 dgisselq
 
1077
This also assumes that the memory being accessed is a single cycle memory.
1078
Slower memories, such as the Quad SPI flash, will take longer--perhaps even
1079 33 dgisselq
as long as forty clocks.   During this time the CPU and the external bus
1080 32 dgisselq
will be busy, and unable to do anything else.
1081
 
1082
\item Memory operation followed by a memory operation
1083
\begin{enumerate}
1084
\item\ {\tt STO address,RA}
1085 36 dgisselq
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}
1086 32 dgisselq
\item\ {\tt LOD address,RB}
1087 36 dgisselq
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}
1088 32 dgisselq
\end{enumerate}
1089
 
1090 36 dgisselq
In this case, the LOD instruction cannot start until the STO is finished.
1091 32 dgisselq
With proper scheduling, it is possible to do something in the ALU while the
1092 36 dgisselq
memory unit is busy with the STO instruction, but otherwise this pipeline will
1093
stall waiting for it to complete.
1094 32 dgisselq
 
1095
Note that even though the Wishbone bus can support pipelined accesses at
1096
one access per clock, only the prefetch stage can take advantage of this.
1097
Load and Store instructions are stuck at one wishbone cycle per instruction.
1098 36 dgisselq
 
1099
\item When waiting for a conditional memory read operation to complete
1100
\begin{enumerate}
1101
\item\ {\tt LOD.Z address,RA}
1102
\item\ {\em (multiple stalls, bus dependent, 7 clocks best)}
1103
\item\ {\tt OPCODE I+RA,RB}
1104
\end{enumerate}
1105
 
1106
In this case, the Zip CPU doesn't warn the prefetch cache to get off the bus
1107
two cycles before using the bus, so there's a potential for an extra three
1108
cycle cost due to bus contention between the prefetch and the CPU.
1109
 
1110
This is true for both the LOD and the STO instructions, with the exception that
1111
the STO instruction will continue in parallel with any ALU instructions that
1112
follow it.
1113
 
1114 32 dgisselq
\end{itemize}
1115
 
1116
 
1117 21 dgisselq
\chapter{Peripherals}\label{chap:periph}
1118 24 dgisselq
 
1119
While the previous chapter describes a CPU in isolation, the Zip System
1120
includes a minimum set of peripherals as well.  These peripherals are shown
1121
in Fig.~\ref{fig:zipsystem}
1122
\begin{figure}\begin{center}
1123
\includegraphics[width=3.5in]{../gfx/system.eps}
1124
\caption{Zip System Peripherals}\label{fig:zipsystem}
1125
\end{center}\end{figure}
1126
and described here.  They are designed to make
1127
the Zip CPU more useful in an Embedded Operating System environment.
1128
 
1129 21 dgisselq
\section{Interrupt Controller}
1130 24 dgisselq
 
1131
Perhaps the most important peripheral within the Zip System is the interrupt
1132
controller.  While the Zip CPU itself can only handle one interrupt, and has
1133
only the one interrupt state: disabled or enabled, the interrupt controller
1134
can make things more interesting.
1135
 
1136
The Zip System interrupt controller module supports up to 15 interrupts, all
1137
controlled from one register.  Bit~31 of the interrupt controller controls
1138
overall whether interrupts are enabled (1'b1) or disabled (1'b0).  Bits~16--30
1139
control whether individual interrupts are enabled (1'b0) or disabled (1'b0).
1140
Bit~15 is an indicator showing whether or not any interrupt is active, and
1141
bits~0--15 indicate whether or not an individual interrupt is active.
1142
 
1143
The interrupt controller has been designed so that bits can be controlled
1144
individually without having any knowledge of the rest of the controller
1145
setting.  To enable an interrupt, write to the register with the high order
1146
global enable bit set and the respective interrupt enable bit set.  No other
1147
bits will be affected.  To disable an interrupt, write to the register with
1148
the high order global enable bit cleared and the respective interrupt enable
1149
bit set.  To clear an interrupt, write a `1' to that interrupts status pin.
1150
Zero's written to the register have no affect, save that a zero written to the
1151
master enable will disable all interrupts.
1152
 
1153
As an example, suppose you wished to enable interrupt \#4.  You would then
1154
write to the register a {\tt 0x80100010} to enable interrupt \#4 and to clear
1155
any past active state.  When you later wish to disable this interrupt, you would
1156
write a {\tt 0x00100010} to the register.  As before, this both disables the
1157
interrupt and clears the active indicator.  This also has the side effect of
1158
disabling all interrupts, so a second write of {\tt 0x80000000} may be necessary
1159
to re-enable any other interrupts.
1160
 
1161
The Zip System currently hosts two interrupt controllers, a primary and a
1162
secondary.  The primary interrupt controller has one interrupt line which may
1163
come from an external interrupt controller, and one interrupt line from the
1164
secondary controller.  Other primary interrupts include the system timers,
1165
the jiffies interrupt, and the manual cache interrupt.  The secondary interrupt
1166
controller maintains an interrupt state for all of the processor accounting
1167
counters.
1168
 
1169 21 dgisselq
\section{Counter}
1170
 
1171
The Zip Counter is a very simple counter: it just counts.  It cannot be
1172
halted.  When it rolls over, it issues an interrupt.  Writing a value to the
1173
counter just sets the current value, and it starts counting again from that
1174
value.
1175
 
1176
Eight counters are implemented in the Zip System for process accounting.
1177
This may change in the future, as nothing as yet uses these counters.
1178
 
1179
\section{Timer}
1180
 
1181
The Zip Timer is also very simple: it simply counts down to zero.  When it
1182
transitions from a one to a zero it creates an interrupt.
1183
 
1184
Writing any non-zero value to the timer starts the timer.  If the high order
1185
bit is set when writing to the timer, the timer becomes an interval timer and
1186
reloads its last start time on any interrupt.  Hence, to mark seconds, one
1187
might set the timer to 100~million (the number of clocks per second), and
1188
set the high bit.  Ever after, the timer will interrupt the CPU once per
1189 24 dgisselq
second (assuming a 100~MHz clock).  This reload capability also limits the
1190
maximum timer value to $2^{31}-1$, rather than $2^{32}-1$.
1191 21 dgisselq
 
1192
\section{Watchdog Timer}
1193
 
1194
The watchdog timer is no different from any of the other timers, save for one
1195
critical difference: the interrupt line from the watchdog
1196
timer is tied to the reset line of the CPU.  Hence writing a `1' to the
1197
watchdog timer will always reset the CPU.
1198 32 dgisselq
To stop the Watchdog timer, write a `0' to it.  To start it,
1199 21 dgisselq
write any other number to it---as with the other timers.
1200
 
1201
While the watchdog timer supports interval mode, it doesn't make as much sense
1202
as it did with the other timers.
1203
 
1204
\section{Jiffies}
1205
 
1206
This peripheral is motivated by the Linux use of `jiffies' whereby a process
1207
can request to be put to sleep until a certain number of `jiffies' have
1208
elapsed.  Using this interface, the CPU can read the number of `jiffies'
1209
from the peripheral (it only has the one location in address space), add the
1210 24 dgisselq
sleep length to it, and write the result back to the peripheral.  The zipjiffies
1211 21 dgisselq
peripheral will record the value written to it only if it is nearer the current
1212
counter value than the last current waiting interrupt time.  If no other
1213
interrupts are waiting, and this time is in the future, it will be enabled.
1214
(There is currently no way to disable a jiffie interrupt once set, other
1215 24 dgisselq
than to disable the interrupt line in the interrupt controller.)  The processor
1216 21 dgisselq
may then place this sleep request into a list among other sleep requests.
1217
Once the timer expires, it would write the next Jiffy request to the peripheral
1218
and wake up the process whose timer had expired.
1219
 
1220
Indeed, the Jiffies register is nothing more than a glorified counter with
1221
an interrupt.  Unlike the other counters, the Jiffies register cannot be set.
1222
Writes to the jiffies register create an interrupt time.  When the Jiffies
1223
register later equals the value written to it, an interrupt will be asserted
1224
and the register then continues counting as though no interrupt had taken
1225
place.
1226
 
1227
The purpose of this register is to support alarm times within a CPU.  To
1228
set an alarm for a particular process $N$ clocks in advance, read the current
1229
Jiffies value, and $N$, and write it back to the Jiffies register.  The
1230
O/S must also keep track of values written to the Jiffies register.  Thus,
1231 32 dgisselq
when an `alarm' trips, it should be removed from the list of alarms, the list
1232 21 dgisselq
should be sorted, and the next alarm in terms of Jiffies should be written
1233
to the register.
1234
 
1235 36 dgisselq
\section{Direct Memory Access Controller}
1236 24 dgisselq
 
1237 36 dgisselq
The Direct Memory Access (DMA) controller can be used to either move memory
1238
from one location to another, to read from a peripheral into memory, or to
1239
write from a peripheral into memory all without CPU intervention.  Further,
1240
since the DMA controller can issue (and does issue) pipeline wishbone accesses,
1241
any DMA memory move will by nature be faster than a corresponding program
1242
accomplishing the same move.  To put this to numbers, it may take a program
1243
18~clocks per word transferred, whereas this DMA controller can move one
1244
word in two clocks--provided it has bus access.  (The CPU gets priority over the
1245
bus.)
1246 24 dgisselq
 
1247 36 dgisselq
When copying memory from one location to another, the DMA controller will
1248
copy in units of a given transfer length--up to 1024 words at a time.  It will
1249
read that transfer length into its internal buffer, and then write to the
1250
destination address from that buffer.  If the CPU interrupts a DMA transfer,
1251
it will release the bus, let the CPU complete whatever it needs to do, and then
1252
restart its transfer by writing the contents of its internal buffer and then
1253
re-entering its read cycle again.
1254 24 dgisselq
 
1255 36 dgisselq
When coupled with a peripheral, the DMA controller can be configured to start
1256
a memory copy on an interrupt line going high.  Further, the controller can be
1257
configured to issue reads from (or two) the same address instead of incrementing
1258
the address at each clock.  The DMA completes once the total number of items
1259
specified (not the transfer length) have been transferred.
1260
 
1261
In each case, once the transfer is complete and the DMA unit returns to
1262
idle, the DMA will issue an interrupt.
1263
 
1264
 
1265 21 dgisselq
\chapter{Operation}\label{chap:ops}
1266
 
1267 33 dgisselq
The Zip CPU, and even the Zip System, is not a System on a Chip (SoC).  It
1268
needs to be connected to its operational environment in order to be used.
1269
Specifically, some per system adjustments need to be made:
1270
\begin{enumerate}
1271
\item The Zip System depends upon an external 32-bit Wishbone bus.  This
1272
        must exist, and must be connected to the Zip CPU for it to work.
1273
\item The Zip System needs to be told of its {\tt RESET\_ADDRESS}.  This is
1274
        the program counter of the first instruction following a reset.
1275
\item If you want the Zip System to start up on its own, you will need to
1276
        set the {\tt START\_HALTED} parameter to zero.  Otherwise, if you
1277
        wish to manually start the CPU, that is if upon reset you want the
1278
        CPU start start in its halted, reset state, then set this parameter to
1279
        one.
1280
\item The third parameter to set is the number of interrupts you will be
1281
        providing from external to the CPU.  This can be anything from one
1282
        to nine, but it cannot be zero.  (Wire this line to a 1'b0 if you
1283
        do not wish to support any external interrupts.)
1284
\item Finally, you need to place into some wishbone accessible address, whether
1285
        RAM or (more likely) ROM, the initial instructions for the CPU.
1286
\end{enumerate}
1287
If you have enabled your CPU to start automatically, then upon power up the
1288
CPU will immediately start executing your instructions.
1289
 
1290
This is, however, not how I have used the Zip CPU.  I have instead used the
1291 36 dgisselq
Zip CPU in a more controlled environment.  For me, the CPU starts in a
1292 33 dgisselq
halted state, and waits to be told to start.  Further, the RESET address is a
1293
location in RAM.  After bringing up the board I am using, and further the
1294
bus that is on it, the RAM memory is then loaded externally with the program
1295
I wish the Zip System to run.  Once the RAM is loaded, I release the CPU.
1296
The CPU then runs until its halt condition, at which point its task is
1297
complete.
1298
 
1299
Eventually, I intend to place an operating system onto the ZipSystem, I'm
1300
just not there yet.
1301
 
1302 36 dgisselq
The rest of this chapter examines some common programming constructs, and
1303
how they might be applied to the Zip System.
1304 33 dgisselq
 
1305 36 dgisselq
\section{Example: Idle Task}
1306
One task every operating system needs is the idle task, the task that takes
1307
place when nothing else can run.  On the Zip CPU, this task is quite simple,
1308
and it is shown in assemble in Tbl.~\ref{tbl:idle-asm}.
1309
\begin{table}\begin{center}
1310
\begin{tabular}{ll}
1311
{\tt idle\_task:} \\
1312
&        {\em ; Wait for the next interrupt, then switch to supervisor task} \\
1313
&        {\tt WAIT} \\
1314
&        {\em ; When we come back, it's because the supervisor wishes to} \\
1315
&        {\em ; wait for an interrupt again, so go back to the top.} \\
1316
&        {\tt BRA idle\_task} \\
1317
\end{tabular}
1318
\caption{Example Idle Loop}\label{tbl:idle-asm}
1319
\end{center}\end{table}
1320
When this task runs, the CPU will fill up all of the pipeline stages up the
1321
ALU.  The {\tt WAIT} instruction, upon leaving the ALU, places the CPU into
1322
a sleep state where nothing more moves.  Sure, there may be some more settling,
1323
the pipe cache continue to read until full, other instructions may issue until
1324
the pipeline fills, but then everything will stall.  Then, once an interrupt
1325
takes place, control passes to the supervisor task to handle the interrupt.
1326
When control passes back to this task, it will be on the next instruction.
1327
Since that next instruction sends us back to the top of the task, the idle
1328
task thus does nothing but wait for an interrupt.
1329
 
1330
This should be the lowest priority task, the task that runs when nothing else
1331
can.  It will help lower the FPGA power usage overall---at least its dynamic
1332
power usage.
1333
 
1334
\section{Example: Memory Copy}
1335
One common operation is that of a memory move or copy.  Consider the C code
1336
shown in Tbl.~\ref{tbl:memcp-c}.
1337
\begin{table}\begin{center}
1338
\parbox{4in}{\begin{tabbing}
1339
{\tt void} \= {\tt memcp(void *dest, void *src, int len) \{} \\
1340
        \> {\tt for(int i=0; i<len; i++)} \\
1341
        \> \hspace{0.2in} {\tt *dest++ = *src++;} \\
1342
\}
1343
\end{tabbing}}
1344
\caption{Example Memory Copy code in C}\label{tbl:memcp-c}
1345
\end{center}\end{table}
1346
This same code can be translated in Zip Assembly as shown in
1347
Tbl.~\ref{tbl:memcp-asm}.
1348
\begin{table}\begin{center}
1349
\begin{tabular}{ll}
1350
memcp: \\
1351
&        {\em ; R0 = *dest, R1 = *src, R2 = LEN} \\
1352
&        {\em ; The following will operate in 17 clocks per word minus one clock} \\
1353
&        {\tt CMP 0,R2} \\
1354
&        {\tt LOD.Z -1(SP),PC} {\em ; A conditional return }\\
1355
&        {\em ; (One stall on potentially writing to PC)} \\
1356
&        {\tt LOD (R1),R3} \\
1357
&        {\em ; (4 stalls, cannot be scheduled away)} \\
1358
&        {\tt STO R3,(R2)} {\em ; (4 schedulable stalls, has no impact now)} \\
1359
&        {\tt ADD 1,R1} \\
1360
&        {\tt SUB 1,R2} \\
1361
&        {\tt BNZ loop} \\
1362
&        {\em ; (5 stalls, if branch taken, to clear and refill the pipeline)} \\
1363
&        {\tt RET} \\
1364
\end{tabular}
1365
\caption{Example Memory Copy code in Zip Assembly}\label{tbl:memcp-asm}
1366
\end{center}\end{table}
1367
This example points out several things associated with the Zip CPU.  First,
1368
a straightforward implementation of a for loop is not the fastest loop
1369
structure.  For this reason, we have placed the test to continue at the
1370
end.  Second, all pointers are {\tt void} pointers to arbitrary 32--bit
1371
data types.  The Zip CPU does not have explicit support for smaller or larger
1372
data types, and so this memory copy cannot be applied at a byte level.
1373
Third, we've optimized the conditional jump to a return instruction into a
1374
conditional return instruction.
1375
 
1376
\section{Context Switch}
1377
 
1378
Fundamental to any multiprocessing system is the ability to switch from one
1379
task to the next.  In the ZipSystem, this is accomplished in one of a couple
1380
ways.  The first step is that an interrupt happens.  Anytime an interrupt
1381
happens, the CPU needs to execute the following tasks in supervisor mode:
1382
\begin{enumerate}
1383
\item Check for a trap instruction.  That  is, if the user task requested a
1384
        trap, we may not wish to adjust the context, check interrupts, or call
1385
        the scheduler.  Tbl.~\ref{tbl:trap-check}
1386
\begin{table}\begin{center}
1387
\begin{tabular}{ll}
1388
{\tt return\_to\_user:} \\
1389
&       {\em; The instruction before the context switch processing must} \\
1390
&       {\em; be the RTU instruction that enacted user mode in the first} \\
1391
&       {\em; place.  We show it here just for reference.} \\
1392
&       {\tt RTU} \\
1393
{\tt trap\_check:} \\
1394
&       {\tt MOV uCC,R0} \\
1395
&       {\tt TST \$TRAP,R0} \\
1396
&       {\tt BNZ swap\_out} \\
1397
&       {; \em Do something here to execute the trap} \\
1398
&       {; \em Don't need to call the scheduler, so we can just return} \\
1399
&       {\tt BRA return\_to\_user} \\
1400
\end{tabular}
1401
\caption{Checking for whether the user issued a TRAP instruction}\label{tbl:trap-check}
1402
\end{center}\end{table}
1403
        shows the rudiments of this code, while showing nothing of how the
1404
        actual trap would be implemented.
1405
 
1406
You may also wish to note that the instruction before the first instruction
1407
in our context swap {\em must be} a return to userspace instruction.
1408
Remember, the supervisor process is re--entered where it left off.  This is
1409
different from many other processors that enter interrupt mode at some vector
1410
or other.  In this case, we always enter supervisor mode right where we last
1411
left.\footnote{The one exception to this rule is upon reset where supervisor
1412
mode is entered at a pre--programmed wishbone memory address.}
1413
 
1414
\item Capture user counters.  If the operating system is keeping track of
1415
        system usage via the accounting counters, those counters need to be
1416
        copied and accumulated into some master counter at this point.
1417
 
1418
\item Preserve the old context.  This involves pushing all the user registers
1419
        onto the user stack and then copying the resulting stack address
1420
        into the tasks task structure, as shown in Tbl.~\ref{tbl:context-out}.
1421
\begin{table}\begin{center}
1422
\begin{tabular}{ll}
1423
{\tt swap\_out:} \\
1424
&        {\tt MOV -15(uSP),R1} \\
1425
&        {\tt STO R1,stack(R12)} \\
1426
&        {\tt MOV uPC,R0} \\
1427
&        {\tt STO R0,15(R1)} \\
1428
&        {\tt MOV uCC,R0} \\
1429
&        {\tt STO R0,14(R1)} \\
1430
&       {\em ; We can skip storing the stack, uSP, since it'll be stored}\\
1431
&       {\em ; elsewhere (in the task structure) }\\
1432
&        {\tt MOV uR13,R0} \\
1433
&        {\tt STO R0,13(R1)} \\
1434
        & \ldots {\em ; Need to repeat for all user registers} \\
1435
&        {\tt MOV uR0,R0} \\
1436
&        {\tt STO R0,1(R1)} \\
1437
\end{tabular}
1438
\caption{Example Storing User Task Context}\label{tbl:context-out}
1439
\end{center}\end{table}
1440
For the sake of discussion, we assume the supervisor maintains a
1441
pointer to the current task's structure in supervisor register
1442
{\tt R12}, and that {\tt stack} is an offset to the beginning of this
1443
structure indicating where the stack pointer is to be kept within it.
1444
 
1445
        For those who are still interested, the full code for this context
1446
        save can be found as an assembler macro within the assembler
1447
        include file, {\tt sys.i}.
1448
 
1449
\item Reset the watchdog timer.  If you are using the watchdog timer, it should
1450
        be reset on a context swap, to know that things are still working.
1451
        Example code for this is shown in Tbl.~\ref{tbl:reset-watchdog}.
1452
\begin{table}\begin{center}
1453
\begin{tabular}{ll}
1454
\multicolumn{2}{l}{{\tt `define WATCHDOG\_ADDRESS 32'hc000\_0002}}\\
1455
\multicolumn{2}{l}{{\tt `define WATCHDOG\_TICKS 32'd1\_000\_000} {; \em = 10 ms}}\\
1456
&       {\tt LDI WATCHDOG\_ADDRESS,R0} \\
1457
&       {\tt LDI WATCHDOG\_TICKS,R1} \\
1458
&       {\tt STO R1,(R0)}
1459
\end{tabular}
1460
\caption{Example Watchdog Reset}\label{tbl:reset-watchdog}
1461
\end{center}\end{table}
1462
 
1463
\item Interrupt handling.  An interrupt handler within the Zip System is nothing
1464
        more than a task.  At context swap time, the supervisor needs to
1465
        disable all of the interrupts that have tripped, and then enable
1466
        all of the tasks that would deal with each of these interrupts.
1467
        These can be user tasks, run at higher priority than any other user
1468
        tasks.  Either way, they will need to re--enable their own interrupt
1469
        themselves, if the interrupt is still relevant.
1470
 
1471
        An example of this master interrut handling is shown in
1472
        Tbl.~\ref{tbl:pre-handler}.
1473
\begin{table}\begin{center}
1474
\begin{tabular}{ll}
1475
{\tt pre\_handler:} \\
1476
&       {\tt LDI PIC\_ADDRESS,R0 } \\
1477
&       {\em ; Start by grabbing the interrupt state from the interrupt}\\
1478
&       {\em ; controller.  We'll store this into the register R7 so that }\\
1479
&       {\em ; we can keep and preserve this information for the scheduler}\\
1480
&       {\em ; to use later. }\\
1481
&       {\tt LOD (R0),R1} \\
1482
&       {\tt MOV R1,R7 } \\
1483
&       {\em ; As a next step, we need to acknowledge and disable all active}\\
1484
&       {\em ; interrupts. We'll start by calculating all of our active}\\
1485
&       {\em ; interrupts.}\\
1486
&       {\tt AND 0x07fff,R1 } \\
1487
&       {\em ; Put the active interrupts into the upper half of R1} \\
1488
&       {\tt ROL 16,R1 } \\
1489
&       {\tt LDILO 0x0ffff,R1   } \\
1490
&       {\tt AND R7,R1}\\
1491
&       {\em ; Acknowledge and disable active interrupts}\\
1492
&       {\em ; This also disables all interrupts from the controller, so}\\
1493
&       {\em ; we'll need to re-enable interrupts in general shortly } \\
1494
&       {\tt STO R1,(R0) } \\
1495
&       {\em ; We leave our active interrupt mask in R7 so the scheduler can}\\
1496
&       {\em ; release any tasks that depended upon them. } \\
1497
\end{tabular}
1498
\caption{Example checking for active interrupts}\label{tbl:pre-handler}
1499
\end{center}\end{table}
1500
 
1501
\item Calling the scheduler.  This needs to be done to pick the next task
1502
        to switch to.  It may be an interrupt handler, or it may  be a normal
1503
        user task.  From a priority standpoint, it would make sense that the
1504
        interrupt handlers all have a higher priority than the user tasks,
1505
        and that once they have been called the user tasks may then be called
1506
        again.  If no task is ready to run, run the idle task to wait for an
1507
        interrupt.
1508
 
1509
        This suggests a minimum of four task priorities:
1510
        \begin{enumerate}
1511
        \item Interrupt handlers, executed with their interrupts disabled
1512
        \item Device drivers, executed with interrupts re-enabled
1513
        \item User tasks
1514
        \item The idle task, executed when nothing else is able to execute
1515
        \end{enumerate}
1516
 
1517
        For our purposes here, we'll just assume that a pointer to the current
1518
        task is maintained in {\tt R12}, that a {\tt JSR scheduler} is
1519
        called, and that the next current task is likewise placed into
1520
        {\tt R12}.
1521
 
1522
\item Restore the new tasks context.  Given that the scheduler has returned a
1523
        task that can be run at this time, the stack pointer needs to be
1524
        pulled out of the tasks task structure, placed into the user
1525
        register, and then the rest of the user registers need to be popped
1526
        back off of the stack to run this task.  An example of this is
1527
        shown in Tbl.~\ref{tbl:context-in},
1528
\begin{table}\begin{center}
1529
\begin{tabular}{ll}
1530
{\tt swap\_in:} \\
1531
&       {\tt LOD stack(R12),R1} \\
1532
&       {\tt MOV 15(R1),uSP} \\
1533
&       {\tt LOD 15(R1),R0} \\
1534
&       {\tt MOV R0,uPC} \\
1535
&       {\tt LOD 14(R1),R0} \\
1536
&       {\tt MOV R0,uCC} \\
1537
&       {\tt LOD 13(R1),R0} \\
1538
&       {\tt MOV R0,uR12} \\
1539
        & \ldots {\em ; Need to repeat for all user registers} \\
1540
&       {\tt LOD 1(R1),R0} \\
1541
&       {\tt MOV R0,uR0} \\
1542
&       {\tt BRA return\_to\_user} \\
1543
\end{tabular}
1544
\caption{Example Restoring User Task Context}\label{tbl:context-in}
1545
\end{center}\end{table}
1546
        assuming as before that the task
1547
        pointer is found in supervisor register {\tt R12}.
1548
        As with storing the user context, the full code associated with
1549
        restoring the user context can be found in the assembler include
1550
        file, {\tt sys.i}.
1551
 
1552
\item Clear the userspace accounting registers.  In order to keep track of
1553
        per process system usage, these registers need to be cleared before
1554
        reactivating the userspace process.  That way, upon the next
1555
        interrupt, we'll know how many clocks the userspace program has
1556
        encountered, and how many instructions it was able to issue in
1557
        those many clocks.
1558
 
1559
\item Jump back to the instruction just before saving the last tasks context,
1560
        because that location in memory contains the return from interrupt
1561
        command that we are going to need to execute, in order to guarantee
1562
        that we return back here again.
1563
\end{enumerate}
1564
 
1565 21 dgisselq
\chapter{Registers}\label{chap:regs}
1566
 
1567 24 dgisselq
The ZipSystem registers fall into two categories, ZipSystem internal registers
1568
accessed via the ZipCPU shown in Tbl.~\ref{tbl:zpregs},
1569
\begin{table}[htbp]
1570
\begin{center}\begin{reglist}
1571 32 dgisselq
PIC   & \scalebox{0.8}{\tt 0xc0000000} & 32 & R/W & Primary Interrupt Controller \\\hline
1572
WDT   & \scalebox{0.8}{\tt 0xc0000001} & 32 & R/W & Watchdog Timer \\\hline
1573 36 dgisselq
  & \scalebox{0.8}{\tt 0xc0000002} & 32 & R/W & {\em (Reserved for future use)} \\\hline
1574 32 dgisselq
CTRIC & \scalebox{0.8}{\tt 0xc0000003} & 32 & R/W & Secondary Interrupt Controller \\\hline
1575
TMRA  & \scalebox{0.8}{\tt 0xc0000004} & 32 & R/W & Timer A\\\hline
1576
TMRB  & \scalebox{0.8}{\tt 0xc0000005} & 32 & R/W & Timer B\\\hline
1577
TMRC  & \scalebox{0.8}{\tt 0xc0000006} & 32 & R/W & Timer C\\\hline
1578
JIFF  & \scalebox{0.8}{\tt 0xc0000007} & 32 & R/W & Jiffies \\\hline
1579
MTASK  & \scalebox{0.8}{\tt 0xc0000008} & 32 & R/W & Master Task Clock Counter \\\hline
1580
MMSTL  & \scalebox{0.8}{\tt 0xc0000009} & 32 & R/W & Master Stall Counter \\\hline
1581
MPSTL  & \scalebox{0.8}{\tt 0xc000000a} & 32 & R/W & Master Pre--Fetch Stall Counter \\\hline
1582
MICNT  & \scalebox{0.8}{\tt 0xc000000b} & 32 & R/W & Master Instruction Counter\\\hline
1583
UTASK  & \scalebox{0.8}{\tt 0xc000000c} & 32 & R/W & User Task Clock Counter \\\hline
1584
UMSTL  & \scalebox{0.8}{\tt 0xc000000d} & 32 & R/W & User Stall Counter \\\hline
1585
UPSTL  & \scalebox{0.8}{\tt 0xc000000e} & 32 & R/W & User Pre--Fetch Stall Counter \\\hline
1586
UICNT  & \scalebox{0.8}{\tt 0xc000000f} & 32 & R/W & User Instruction Counter\\\hline
1587 36 dgisselq
DMACTRL  & \scalebox{0.8}{\tt 0xc0000010} & 32 & R/W & DMA Control Register\\\hline
1588
DMALEN  & \scalebox{0.8}{\tt 0xc0000011} & 32 & R/W & DMA total transfer length\\\hline
1589
DMASRC  & \scalebox{0.8}{\tt 0xc0000012} & 32 & R/W & DMA source address\\\hline
1590
DMADST  & \scalebox{0.8}{\tt 0xc0000013} & 32 & R/W & DMA destination address\\\hline
1591 32 dgisselq
% Cache  & \scalebox{0.8}{\tt 0xc0100000} & & & Base address of the Cache memory\\\hline
1592 24 dgisselq
\end{reglist}
1593
\caption{Zip System Internal/Peripheral Registers}\label{tbl:zpregs}
1594
\end{center}\end{table}
1595 33 dgisselq
and the two debug registers shown in Tbl.~\ref{tbl:dbgregs}.
1596 24 dgisselq
\begin{table}[htbp]
1597
\begin{center}\begin{reglist}
1598
ZIPCTRL & 0 & 32 & R/W & Debug Control Register \\\hline
1599
ZIPDATA & 1 & 32 & R/W & Debug Data Register \\\hline
1600
\end{reglist}
1601
\caption{Zip System Debug Registers}\label{tbl:dbgregs}
1602
\end{center}\end{table}
1603
 
1604 33 dgisselq
\section{Peripheral Registers}
1605
The peripheral registers, listed in Tbl.~\ref{tbl:zpregs}, are shown in the
1606
CPU's address space.  These may be accessed by the CPU at these addresses,
1607
and when so accessed will respond as described in Chapt.~\ref{chap:periph}.
1608
These registers will be discussed briefly again here.
1609 24 dgisselq
 
1610 33 dgisselq
The Zip CPU Interrupt controller has four different types of bits, as shown in
1611
Tbl.~\ref{tbl:picbits}.
1612
\begin{table}\begin{center}
1613
\begin{bitlist}
1614
31 & R/W & Master Interrupt Enable\\\hline
1615
30\ldots 16 & R/W & Interrupt Enables, write '1' to change\\\hline
1616
15 & R & Current Master Interrupt State\\\hline
1617
15\ldots 0 & R/W & Input Interrupt states, write '1' to clear\\\hline
1618
\end{bitlist}
1619
\caption{Interrupt Controller Register Bits}\label{tbl:picbits}
1620
\end{center}\end{table}
1621
The high order bit, or bit--31, is the master interrupt enable bit.  When this
1622
bit is set, then any time an interrupt occurs the CPU will be interrupted and
1623
will switch to supervisor mode, etc.
1624
 
1625
Bits 30~\ldots 16 are interrupt enable bits.  Should the interrupt line go
1626
ghile while enabled, an interrupt will be generated.  To set an interrupt enable
1627
bit, one needs to write the master interrupt enable while writing a `1' to this
1628
the bit.  To clear, one need only write a `0' to the master interrupt enable,
1629
while leaving this line high.
1630
 
1631
Bits 15\ldots 0 are the current state of the interrupt vector.  Interrupt lines
1632
trip when they go high, and remain tripped until they are acknowledged.  If
1633
the interrupt goes high for longer than one pulse, it may be high when a clear
1634
is requested.  If so, the interrupt will not clear.  The line must go low
1635
again before the status bit can be cleared.
1636
 
1637
As an example, consider the following scenario where the Zip CPU supports four
1638
interrupts, 3\ldots0.
1639
\begin{enumerate}
1640
\item The Supervisor will first, while in the interrupts disabled mode,
1641
        write a {\tt 32'h800f000f} to the controller.  The supervisor may then
1642
        switch to the user state with interrupts enabled.
1643
\item When an interrupt occurs, the supervisor will switch to the interrupt
1644
        state.  It will then cycle through the interrupt bits to learn which
1645
        interrupt handler to call.
1646
\item If the interrupt handler expects more interrupts, it will clear its
1647
        current interrupt when it is done handling the interrupt in question.
1648
        To do this, it will write a '1' to the low order interrupt mask,
1649
        such as writing a {\tt 32'h80000001}.
1650
\item If the interrupt handler does not expect any more interrupts, it will
1651
        instead clear the interrupt from the controller by writing a
1652
        {\tt 32'h00010001} to the controller.
1653
\item Once all interrupts have been handled, the supervisor will write a
1654
        {\tt 32'h80000000} to the interrupt register to re-enable interrupt
1655
        generation.
1656
\item The supervisor should also check the user trap bit, and possible soft
1657
        interrupt bits here, but this action has nothing to do with the
1658
        interrupt control register.
1659
\item The supervisor will then leave interrupt mode, possibly adjusting
1660
        whichever task is running, by executing a return from interrupt
1661
        command.
1662
\end{enumerate}
1663
 
1664
Leaving the interrupt controller, we show the timer registers bit definitions
1665
in Tbl.~\ref{tbl:tmrbits}.
1666
\begin{table}\begin{center}
1667
\begin{bitlist}
1668
31 & R/W & Auto-Reload\\\hline
1669
30\ldots 0 & R/W & Current timer value\\\hline
1670
\end{bitlist}
1671
\caption{Timer Register Bits}\label{tbl:tmrbits}
1672
\end{center}\end{table}
1673
As you may recall, the timer just counts down to zero and then trips an
1674
interrupt.  Writing to the current timer value sets that value, and reading
1675
from it returns that value.  Writing to the current timer value while also
1676
setting the auto--reload bit will send the timer into an auto--reload mode.
1677
In this mode, upon setting its interrupt bit for one cycle, the timer will
1678
also reset itself back to the value of the timer that was written to it when
1679
the auto--reload option was written to it.  To clear and stop the timer,
1680
just simply write a `32'h00' to this register.
1681
 
1682
The Jiffies register is somewhat similar in that the register always changes.
1683
In this case, the register counts up, whereas the timer always counted down.
1684
Reads from this register, as shown in Tbl.~\ref{tbl:jiffybits},
1685
\begin{table}\begin{center}
1686
\begin{bitlist}
1687
31\ldots 0 & R & Current jiffy value\\\hline
1688
31\ldots 0 & W & Value/time of next interrupt\\\hline
1689
\end{bitlist}
1690
\caption{Jiffies Register Bits}\label{tbl:jiffybits}
1691
\end{center}\end{table}
1692
always return the time value contained in the register.  Writes greater than
1693
the current Jiffy value, that is where the new value minus the old value is
1694
greater than zero while ignoring truncation, will set a new Jiffy interrupt
1695
time.  At that time, the Jiffy vector will clear, and another interrupt time
1696
may either be written to it, or it will just continue counting without
1697
activating any more interrupts.
1698
 
1699
The Zip CPU also supports several counter peripherals, mostly in the way of
1700
process accounting.  This peripherals have a single register associated with
1701
them, shown in Tbl.~\ref{tbl:ctrbits}.
1702
\begin{table}\begin{center}
1703
\begin{bitlist}
1704
31\ldots 0 & R/W & Current counter value\\\hline
1705
\end{bitlist}
1706
\caption{Counter Register Bits}\label{tbl:ctrbits}
1707
\end{center}\end{table}
1708
Writes to this register set the new counter value.  Reads read the current
1709
counter value.
1710
 
1711
The current design operation of these counters is that of performance counting.
1712
Two sets of four registers are available for keeping track of performance.
1713
The first is a task counter.  This just counts clock ticks.  The second
1714
counter is a prefetch stall counter, then an master stall counter.  These
1715
allow the CPU to be evaluated as to how efficient it is.  The fourth and
1716
final counter is an instruction counter, which counts how many instructions the
1717
CPU has issued.
1718
 
1719
It is envisioned that these counters will be used as follows: First, every time
1720
a master counter rolls over, the supervisor (Operating System) will record
1721
the fact.  Second, whenever activating a user task, the Operating System will
1722
set the four user counters to zero.  When the user task has completed, the
1723
Operating System will read the timers back off, to determine how much of the
1724
CPU the process had consumed.
1725
 
1726 36 dgisselq
The final peripheral to discuss is the DMA controller.  This controller
1727
has four registers.  Of these four, the length, source and destination address
1728
registers should need no further explanation.  They are full 32--bit registers
1729
specifying the entire transfer length, the starting address to read from, and
1730
the starting address to write to.  The registers can be written to when the
1731
DMA is idle, and read at any time.  The control register, however, will need
1732
some more explanation.
1733
 
1734
The bit allocation of the control register is shown in Tbl.~\ref{tbl:dmacbits}.
1735
\begin{table}\begin{center}
1736
\begin{bitlist}
1737
31 & R & DMA Active\\\hline
1738
30 & R & Wishbone error, transaction aborted (cleared on any write)\\\hline
1739
29 & R/W & Set to '1' to prevent the controller from incrementing the source address, '0' for normal memory copy. \\\hline
1740
28 & R/W & Set to '0' to prevent the controller from incrementing the
1741
        destination address, '0' for normal memory copy. \\\hline
1742
27 \ldots 16 & W & The DMA Key.  Write a 12'hfed to these bits to start the
1743
        activate any DMA transfer.  \\\hline
1744
27 & R & Always reads '0', to force the deliberate writing of the key. \\\hline
1745
26 \ldots 16 & R & Indicates the number of items in the transfer buffer that
1746
        have yet to be written. \\\hline
1747
15 & R/W & Set to '1' to trigger on an interrupt, or '0' to start immediately
1748
        upon receiving a valid key.\\\hline
1749
14\ldots 10 & R/W & Select among one of 32~possible interrupt lines.\\\hline
1750
9\ldots 0 & R/W & Intermediate transfer length minus one.  Thus, to transfer
1751
        one item at a time set this value to 0. To transfer 1024 at a time,
1752
        set it to 1024.\\\hline
1753
\end{bitlist}
1754
\caption{DMA Control Register Bits}\label{tbl:dmacbits}
1755
\end{center}\end{table}
1756
This control register has been designed so that the common case of memory
1757
access need only set the key and the transfer length.  Hence, writing a
1758
\hbox{32'h0fed03ff} to the control register will start any memory transfer.
1759
On the other hand, if you wished to read from a serial port (constant address)
1760
and put the result into a buffer every time a word was available, you
1761
might wish to write \hbox{32'h2fed8000}--this assumes, of course, that you
1762
have a serial port wired to the zero bit of this interrupt control.  (The
1763
DMA controller does not use the interrupt controller, and cannot clear
1764
interrupts.)  As a third example, if you wished to write to an external
1765
FIFO anytime it was less than half full (had fewer than 512 items), and
1766
interrupt line 2 indicated this condition, you might wish to issue a
1767
\hbox{32'h1fed8dff} to this port.
1768
 
1769 33 dgisselq
\section{Debug Port Registers}
1770
Accessing the Zip System via the debug port isn't as straight forward as
1771
accessing the system via the wishbone bus.  The debug port itself has been
1772
reduced to two addresses, as outlined earlier in Tbl.~\ref{tbl:dbgregs}.
1773
Access to the Zip System begins with the Debug Control register, shown in
1774
Tbl.~\ref{tbl:dbgctrl}.
1775
\begin{table}\begin{center}
1776
\begin{bitlist}
1777
31\ldots 14 & R & Reserved\\\hline
1778
13 & R & CPU GIE setting\\\hline
1779
12 & R & CPU is sleeping\\\hline
1780
11 & W & Command clear PF cache\\\hline
1781
10 & R/W & Command HALT, Set to '1' to halt the CPU\\\hline
1782
9 & R & Stall Status, '1' if CPU is busy\\\hline
1783 36 dgisselq
8 & R/W & Step Command, set to '1' to step the CPU, also sets the halt bit\\\hline
1784 33 dgisselq
7 & R & Interrupt Request \\\hline
1785
6 & R/W & Command RESET \\\hline
1786
5\ldots 0 & R/W & Debug Register Address \\\hline
1787
\end{bitlist}
1788
\caption{Debug Control Register Bits}\label{tbl:dbgctrl}
1789
\end{center}\end{table}
1790
 
1791
The first step in debugging access is to determine whether or not the CPU
1792
is halted, and to halt it if not.  To do this, first write a '1' to the
1793
Command HALT bit.  This will halt the CPU and place it into debug mode.
1794
Once the CPU is halted, the stall status bit will drop to zero.  Thus,
1795
if bit 10 is high and bit 9 low, the debug port is open to examine the
1796
internal state of the CPU.
1797
 
1798
At this point, the external debugger may examine internal state information
1799
from within the CPU.  To do this, first write again to the command register
1800
a value (with command halt still high) containing the address of an internal
1801
register of interest in the bottom 6~bits.  Internal registers that may be
1802
accessed this way are listed in Tbl.~\ref{tbl:dbgaddrs}.
1803
\begin{table}\begin{center}
1804
\begin{reglist}
1805
sR0 & 0 & 32 & R/W & Supervisor Register R0 \\\hline
1806
sR1 & 0 & 32 & R/W & Supervisor Register R1 \\\hline
1807
sSP & 13 & 32 & R/W & Supervisor Stack Pointer\\\hline
1808
sCC & 14 & 32 & R/W & Supervisor Condition Code Register \\\hline
1809
sPC & 15 & 32 & R/W & Supervisor Program Counter\\\hline
1810
uR0 & 16 & 32 & R/W & User Register R0 \\\hline
1811
uR1 & 17 & 32 & R/W & User Register R1 \\\hline
1812
uSP & 29 & 32 & R/W & User Stack Pointer\\\hline
1813
uCC & 30 & 32 & R/W & User Condition Code Register \\\hline
1814
uPC & 31 & 32 & R/W & User Program Counter\\\hline
1815
PIC & 32 & 32 & R/W & Primary Interrupt Controller \\\hline
1816
WDT & 33 & 32 & R/W & Watchdog Timer\\\hline
1817
CCHE & 34 & 32 & R/W & Manual Cache Controller\\\hline
1818
CTRIC & 35 & 32 & R/W & Secondary Interrupt Controller\\\hline
1819
TMRA & 36 & 32 & R/W & Timer A\\\hline
1820
TMRB & 37 & 32 & R/W & Timer B\\\hline
1821
TMRC & 38 & 32 & R/W & Timer C\\\hline
1822
JIFF & 39 & 32 & R/W & Jiffies peripheral\\\hline
1823
MTASK & 40 & 32 & R/W & Master task clock counter\\\hline
1824
MMSTL & 41 & 32 & R/W & Master memory stall counter\\\hline
1825
MPSTL & 42 & 32 & R/W & Master Pre-Fetch Stall counter\\\hline
1826
MICNT & 43 & 32 & R/W & Master instruction counter\\\hline
1827
UTASK & 44 & 32 & R/W & User task clock counter\\\hline
1828
UMSTL & 45 & 32 & R/W & User memory stall counter\\\hline
1829
UPSTL & 46 & 32 & R/W & User Pre-Fetch Stall counter\\\hline
1830
UICNT & 47 & 32 & R/W & User instruction counter\\\hline
1831
\end{reglist}
1832
\caption{Debug Register Addresses}\label{tbl:dbgaddrs}
1833
\end{center}\end{table}
1834
Primarily, these ``registers'' include access to the entire CPU register
1835 36 dgisselq
set, as well as the internal peripherals.  To read one of these registers
1836 33 dgisselq
once the address is set, simply issue a read from the data port.  To write
1837
one of these registers or peripheral ports, simply write to the data port
1838
after setting the proper address.
1839
 
1840
In this manner, all of the CPU's internal state may be read and adjusted.
1841
 
1842
As an example of how to use this, consider what would happen in the case
1843
of an external break point.  If and when the CPU hits a break point that
1844
causes it to halt, the Command HALT bit will activate on its own, the CPU
1845
will then raise an external interrupt line and wait for a debugger to examine
1846
its state.  After examining the state, the debugger will need to remove
1847
the breakpoint by writing a different instruction into memory and by writing
1848
to the command register while holding the clear cache, command halt, and
1849
step CPU bits high, (32'hd00).  The debugger may then replace the breakpoint
1850
now that the CPU has gone beyond it, and clear the cache again (32'h500).
1851
 
1852
To leave this debug mode, simply write a `32'h0' value to the command register.
1853
 
1854
\chapter{Wishbone Datasheets}\label{chap:wishbone}
1855 32 dgisselq
The Zip System supports two wishbone ports, a slave debug port and a master
1856 21 dgisselq
port for the system itself.  These are shown in Tbl.~\ref{tbl:wishbone-slave}
1857
\begin{table}[htbp]
1858
\begin{center}
1859
\begin{wishboneds}
1860
Revision level of wishbone & WB B4 spec \\\hline
1861
Type of interface & Slave, Read/Write, single words only \\\hline
1862 24 dgisselq
Address Width & 1--bit \\\hline
1863 21 dgisselq
Port size & 32--bit \\\hline
1864
Port granularity & 32--bit \\\hline
1865
Maximum Operand Size & 32--bit \\\hline
1866
Data transfer ordering & (Irrelevant) \\\hline
1867
Clock constraints & Works at 100~MHz on a Basys--3 board\\\hline
1868
Signal Names & \begin{tabular}{ll}
1869
                Signal Name & Wishbone Equivalent \\\hline
1870
                {\tt i\_clk} & {\tt CLK\_I} \\
1871
                {\tt i\_dbg\_cyc} & {\tt CYC\_I} \\
1872
                {\tt i\_dbg\_stb} & {\tt STB\_I} \\
1873
                {\tt i\_dbg\_we} & {\tt WE\_I} \\
1874
                {\tt i\_dbg\_addr} & {\tt ADR\_I} \\
1875
                {\tt i\_dbg\_data} & {\tt DAT\_I} \\
1876
                {\tt o\_dbg\_ack} & {\tt ACK\_O} \\
1877
                {\tt o\_dbg\_stall} & {\tt STALL\_O} \\
1878
                {\tt o\_dbg\_data} & {\tt DAT\_O}
1879
                \end{tabular}\\\hline
1880
\end{wishboneds}
1881 22 dgisselq
\caption{Wishbone Datasheet for the Debug Interface}\label{tbl:wishbone-slave}
1882 21 dgisselq
\end{center}\end{table}
1883
and Tbl.~\ref{tbl:wishbone-master} respectively.
1884
\begin{table}[htbp]
1885
\begin{center}
1886
\begin{wishboneds}
1887
Revision level of wishbone & WB B4 spec \\\hline
1888 24 dgisselq
Type of interface & Master, Read/Write, single cycle or pipelined\\\hline
1889
Address Width & 32--bit bits \\\hline
1890 21 dgisselq
Port size & 32--bit \\\hline
1891
Port granularity & 32--bit \\\hline
1892
Maximum Operand Size & 32--bit \\\hline
1893
Data transfer ordering & (Irrelevant) \\\hline
1894
Clock constraints & Works at 100~MHz on a Basys--3 board\\\hline
1895
Signal Names & \begin{tabular}{ll}
1896
                Signal Name & Wishbone Equivalent \\\hline
1897
                {\tt i\_clk} & {\tt CLK\_O} \\
1898
                {\tt o\_wb\_cyc} & {\tt CYC\_O} \\
1899
                {\tt o\_wb\_stb} & {\tt STB\_O} \\
1900
                {\tt o\_wb\_we} & {\tt WE\_O} \\
1901
                {\tt o\_wb\_addr} & {\tt ADR\_O} \\
1902
                {\tt o\_wb\_data} & {\tt DAT\_O} \\
1903
                {\tt i\_wb\_ack} & {\tt ACK\_I} \\
1904
                {\tt i\_wb\_stall} & {\tt STALL\_I} \\
1905
                {\tt i\_wb\_data} & {\tt DAT\_I}
1906
                \end{tabular}\\\hline
1907
\end{wishboneds}
1908 22 dgisselq
\caption{Wishbone Datasheet for the CPU as Master}\label{tbl:wishbone-master}
1909 21 dgisselq
\end{center}\end{table}
1910
I do not recommend that you connect these together through the interconnect.
1911 24 dgisselq
Rather, the debug port of the CPU should be accessible regardless of the state
1912
of the master bus.
1913 21 dgisselq
 
1914 24 dgisselq
You may wish to notice that neither the {\tt ERR} nor the {\tt RETRY} wires
1915
have been implemented.  What this means is that the CPU is currently unable
1916
to detect a bus error condition, and so may stall indefinitely (hang) should
1917
it choose to access a value not on the bus, or a peripheral that is not
1918
yet properly configured.
1919 21 dgisselq
 
1920
\chapter{Clocks}\label{chap:clocks}
1921
 
1922 32 dgisselq
This core is based upon the Basys--3 development board sold by Digilent.
1923
The Basys--3 development board contains one external 100~MHz clock, which is
1924 36 dgisselq
sufficient to run the Zip CPU core.
1925 21 dgisselq
\begin{table}[htbp]
1926
\begin{center}
1927
\begin{clocklist}
1928
i\_clk & External & 100~MHz & 100~MHz & System clock.\\\hline
1929
\end{clocklist}
1930
\caption{List of Clocks}\label{tbl:clocks}
1931
\end{center}\end{table}
1932
I hesitate to suggest that the core can run faster than 100~MHz, since I have
1933
had struggled with various timing violations to keep it at 100~MHz.  So, for
1934
now, I will only state that it can run at 100~MHz.
1935
 
1936
 
1937
\chapter{I/O Ports}\label{chap:ioports}
1938 33 dgisselq
The I/O ports to the Zip CPU may be grouped into three categories.  The first
1939
is that of the master wishbone used by the CPU, then the slave wishbone used
1940
to command the CPU via a debugger, and then the rest.  The first two of these
1941
were already discussed in the wishbone chapter.  They are listed here
1942
for completeness in Tbl.~\ref{tbl:iowb-master}
1943
\begin{table}
1944
\begin{center}\begin{portlist}
1945
{\tt o\_wb\_cyc}   &  1 & Output & Indicates an active Wishbone cycle\\\hline
1946
{\tt o\_wb\_stb}   &  1 & Output & WB Strobe signal\\\hline
1947
{\tt o\_wb\_we}    &  1 & Output & Write enable\\\hline
1948
{\tt o\_wb\_addr}  & 32 & Output & Bus address \\\hline
1949
{\tt o\_wb\_data}  & 32 & Output & Data on WB write\\\hline
1950
{\tt i\_wb\_ack}   &  1 & Input  & Slave has completed a R/W cycle\\\hline
1951
{\tt i\_wb\_stall} &  1 & Input  & WB bus slave not ready\\\hline
1952
{\tt i\_wb\_data}  & 32 & Input  & Incoming bus data\\\hline
1953
\end{portlist}\caption{CPU Master Wishbone I/O Ports}\label{tbl:iowb-master}\end{center}\end{table}
1954
and~\ref{tbl:iowb-slave} respectively.
1955
\begin{table}
1956
\begin{center}\begin{portlist}
1957
{\tt i\_wb\_cyc}   &  1 & Input & Indicates an active Wishbone cycle\\\hline
1958
{\tt i\_wb\_stb}   &  1 & Input & WB Strobe signal\\\hline
1959
{\tt i\_wb\_we}    &  1 & Input & Write enable\\\hline
1960
{\tt i\_wb\_addr}  &  1 & Input & Bus address, command or data port \\\hline
1961
{\tt i\_wb\_data}  & 32 & Input & Data on WB write\\\hline
1962
{\tt o\_wb\_ack}   &  1 & Output  & Slave has completed a R/W cycle\\\hline
1963
{\tt o\_wb\_stall} &  1 & Output  & WB bus slave not ready\\\hline
1964
{\tt o\_wb\_data}  & 32 & Output  & Incoming bus data\\\hline
1965
\end{portlist}\caption{CPU Debug Wishbone I/O Ports}\label{tbl:iowb-slave}\end{center}\end{table}
1966 21 dgisselq
 
1967 33 dgisselq
There are only four other lines to the CPU: the external clock, external
1968
reset, incoming external interrupt line(s), and the outgoing debug interrupt
1969
line.  These are shown in Tbl.~\ref{tbl:ioports}.
1970
\begin{table}
1971
\begin{center}\begin{portlist}
1972
{\tt i\_clk} & 1 & Input & The master CPU clock \\\hline
1973
{\tt i\_rst} & 1 & Input &  Active high reset line \\\hline
1974
{\tt i\_ext\_int} & 1\ldots 6 & Input &  Incoming external interrupts \\\hline
1975
{\tt o\_ext\_int} & 1 & Output & CPU Halted interrupt \\\hline
1976
\end{portlist}\caption{I/O Ports}\label{tbl:ioports}\end{center}\end{table}
1977
The clock line was discussed briefly in Chapt.~\ref{chap:clocks}.  We
1978
typically run it at 100~MHz.  The reset line is an active high reset.  When
1979
asserted, the CPU will start running again from its reset address in
1980
memory.  Further, depending upon how the CPU is configured and specifically on
1981
the {\tt START\_HALTED} parameter, it may or may not start running
1982
automatically.  The {\tt i\_ext\_int} line is for an external interrupt.  This
1983
line may be as wide as 6~external interrupts, depending upon the setting of
1984
the {\tt EXTERNAL\_INTERRUPTS} line.  As currently configured, the ZipSystem
1985
only supports one such interrupt line by default.  For us, this line is the
1986
output of another interrupt controller, but that's a board specific setup
1987
detail.  Finally, the Zip System produces one external interrupt whenever
1988
the CPU halts to wait for the debugger.
1989
 
1990 36 dgisselq
\chapter{Initial Assessment}\label{chap:assessment}
1991
 
1992
Having now worked with the Zip CPU for a while, it is worth offering an
1993
honest assessment of how well it works and how well it was designed. At the
1994
end of this assessment, I will propose some changes that may take place in a
1995
later version of this Zip CPU to make it better.
1996
 
1997
\section{The Good}
1998
\begin{itemize}
1999
\item The Zip CPU is light weight and fully featured as it exists today. For
2000
        anyone who wishes to build a general purpose CPU and then to
2001
        experiment with building and adding particular features, the Zip CPU
2002
        makes a good starting point--it is fairly simple. Modifications should
2003
        be simple enough.
2004
\item As an estimate of the ``weight'' of this implementation, the CPU has
2005
        cost me less than 150 hours to implement from its inception.
2006
\item The Zip CPU was designed to be an implementable soft core that could be
2007
        placed within an FPGA, controlling actions internal to the FPGA. It
2008
        fits this role rather nicely. It does not fit the role of a system on
2009
        a chip very well, but then it was never intended to be a system on a
2010
        chip but rather a system within a chip.
2011
\item The extremely simplified instruction set of the Zip CPU was a good
2012
        choice. Although it does not have many of the commonly used
2013
        instructions, PUSH, POP, JSR, and RET among them, the simplified
2014
        instruction set has demonstrated an amazing versatility. I will contend
2015
        therefore and for anyone who will listen, that this instruction set
2016
        offers a full and complete capability for whatever a user might wish
2017
        to do with two exceptions: bytewise character access and accelerated
2018
        floating-point support.
2019
\item This simplified instruction set is easy to decode.
2020
\item The simplified bus transactions (32-bit words only) were also very easy
2021
        to implement.
2022
\item The novel approach of having a single interrupt vector, which just
2023
        brings the CPU back to the instruction it left off at within the last
2024
        interrupt context doesn't appear to have been that much of a problem.
2025
        If most modern systems handle interrupt vectoring in software anyway,
2026
        why maintain hardware support for it?
2027
\item My goal of a high rate of instructions per clock may not be the proper
2028
        measure. For example, if instructions are being read from a SPI flash
2029
        device, such as is common among FPGA implementations, these same
2030
        instructions may suffer stalls of between 64 and 128 cycles per
2031
        instruction just to read the instruction from the flash. Executing the
2032
        instruction in a single clock cycle is no longer the appropriate
2033
        measure. At the same time, it should be possible to use the DMA
2034
        peripheral to copy instructions from the FLASH to a temporary memory
2035
        location, after which they may be executed at a single instruction
2036
        cycle per access again.
2037
\end{itemize}
2038
 
2039
\section{The Not so Good}
2040
\begin{itemize}
2041
\item While one of the stated goals was to use a small amount of logic,
2042
        3k~LUTs isn't that impressively small. Indeed, it's really much
2043
        too expensive when compared against other 8 and 16-bit CPUs that have
2044
        less than 1k LUTs.
2045
 
2046
        Still, \ldots it's not bad, it's just not astonishingly good.
2047
 
2048
\item The fact that the instruction width equals the bus width means that the
2049
        instruction fetch cycle will always be interfering with any load or
2050
        store memory operation, with the only exception being if the
2051
        instruction is already in the cache.  {\em This has become the
2052
        fundamental limit on the speed and performance of the CPU!}
2053
        Those familiar with the Von--Neumann approach of sharing a bus
2054
        between data and instructions will not be surprised by this assessment.
2055
 
2056
        This could be fixed in one of three ways: the instruction set
2057
        architecture could be modified to handle Very Long Instruction Words
2058
        (VLIW) so that each 32--bit word would encode two or more instructions,
2059
        the instruction fetch bus width could be increased from 32--bits to
2060
        64--bits or more, or the instruction bus could be separated from the
2061
        data bus.  Any and all of these approaches would increase the overall
2062
        LUT count.
2063
 
2064
\item The (non-existant) floating point unit was an after-thought, isn't even
2065
        built as a potential option, and most likely won't support the full
2066
        IEEE standard set of FPU instructions--even for single point precision.
2067
        This (non-existant) capability would benefit the most from an
2068
        out-of-order execution capability, which the Zip CPU does not have.
2069
 
2070
        Still, sharing FPU registers with the main register set was a good
2071
        idea and worth preserving, as it simplifies context swapping.
2072
 
2073
        Perhaps this really isn't a problem, but rather a feature.  By not
2074
        implementing FPU instructions, the Zip CPU maintains a lower LUT count
2075
        than it would have if it did implement these instructions.
2076
 
2077
\item The CPU has no character support. This is both good and bad.
2078
        Realistically, the CPU works just fine without it. Characters can be
2079
        supported as subsets of 32-bit words without any problem. Practically,
2080
        though, it will make compiling non-Zip CPU code difficult--especially
2081
        anything that assumes sizeof(int)=4*sizeof(char), or that tries to
2082
        create unions with characters and integers and then attempts to
2083
        reference the address of the characters within that union.
2084
 
2085
\item The Zip CPU does not support a data cache. One can still be built
2086
        externally, but this is a limitation of the CPU proper as built.
2087
        Further, under the theory of the Zip CPU design (that of an embedded
2088
        soft-core processor within an FPGA, where any ``address'' may reference
2089
        either memory or a peripheral that may have side-effects), any data
2090
        cache would need to be based upon an initial knowledge of whether or
2091
        not it is supporting memory (cachable) or peripherals. This knowledge
2092
        must exist somewhere, and that somewhere is currently (and by design)
2093
        external to the CPU.
2094
 
2095
        This may also be written off as a ``feature'' of the Zip CPU, since
2096
        the addition of a data cache can greatly increase the LUT count of
2097
        a soft core.
2098
 
2099
\item Many other instruction sets offer three operand instructions, whereas
2100
        the Zip CPU only offers two operand instructions. This means that it
2101
        takes the Zip CPU more instructions to do many of the same operations.
2102
        The good part of this is that it gives the Zip CPU a greater amount of
2103
        flexibility in its immediate operand mode, although that increased
2104
        flexibility isn't necessarily as valuable as one might like.
2105
 
2106
\item The Zip CPU does not currently detect and trap on either illegal
2107
        instructions or bus errors.  Attempts to access non--existent
2108
        memory quietly return erroneous results, rather than halting the
2109
        process (user mode) or halting or resetting the CPU (supervisor mode).
2110
 
2111
\item The Zip CPU doesn't support out of order execution. I suppose it could
2112
        be modified to do so, but then it would no longer be the ``simple''
2113
        and low LUT count CPU it was designed to be. The two primary results
2114
        are that 1) loads may unnecessarily stall the CPU, even if other
2115
        things could be done while waiting for the load to complete, 2)
2116
        bus errors on stores will never be caught at the point of the error,
2117
        and 3) branch prediction becomes more difficult.
2118
 
2119
\item Although switching to an interrupt context in the Zip CPU design doesn't
2120
        require a tremendous swapping of registers, in reality it still
2121
        does--since any task swap still requires saving and restoring all
2122
        16~user registers. That's a lot of memory movement just to service
2123
        an interrupt.
2124
 
2125
\item The Zip CPU is by no means generic: it will never handle addresses
2126
        larger than 32-bits (16GB) without a complete and total redesign.
2127
        This may limit its utility as a generic CPU in the future, although
2128
        as an embedded CPU within an FPGA this isn't really much of a limit
2129
        or restriction.
2130
 
2131
\item While the Zip CPU has its own assembler, it has no linker and does not
2132
        (yet) support a compiler. The standard C library is an even longer
2133
        shot. My dream of having binutils and gcc support has not been
2134
        realized and at this rate may not be realized. (I've been intimidated
2135
        by the challenge everytime I've looked through those codes.)
2136
 
2137
\item While the Wishbone Bus (B4) supports a pipelined mode with single cycle
2138
        execution, the Zip CPU is unable to exploit this parallelism. Instead,
2139
        apart from the DMA and the pipelined prefetch, all loads and stores
2140
        are single wishbone bus operations requiring a minimum of 3 clocks.
2141
        (In practice, this has turned into 7-clocks.)
2142
 
2143
\iffalse
2144
\item There is no control over whether or not an instruction sets the
2145
        condition codes--certain instructions always set the condition codes,
2146
        other instructions never set them. This effectively limits conditional
2147
        instructions to a single instruction only (with two or more
2148
        instructions as an exception), as the first instruction that sets
2149
        condition codes will break the condition code chain.
2150
 
2151
        {\em (A proposed change below address this.)}
2152
 
2153
\item Using the CC register as a trap address was a bad idea--it limits the CC
2154
        registers ability to be used in future expansion, such as by adding
2155
        exception indication flags: bus error, floating point exception, etc.
2156
 
2157
        {\em (This can be changed by a different O/S implementation of the trap
2158
        instruction.)}
2159
\item The current implementation suffers from too many stalls on any
2160
        branch--even if the branch is known early on.
2161
 
2162
        {\em (This is addressed in proposals below.)}
2163
        % Addressed, 20150918
2164
 
2165
\item In a similar fashion, a switch to interrupt context forces the pipeline
2166
        to be cleared, whereas it might make more sense to just continue
2167
        executing the instructions already in the pipeline while the prefetch
2168
        stage is working on switching to the interrupt context.
2169
 
2170
        {\em (Also addressed in proposals below.)}
2171
        % This should happen so rarely that it is not really a problem
2172
\fi
2173
 
2174
\end{itemize}
2175
 
2176
\section{The Next Generation}
2177
This section could also be labeled as my ``To do'' list.
2178
 
2179
Given the feedback listed above, perhaps its time to consider what changes could be made to improve the Zip CPU in the future. I offer the following as proposals:
2180
 
2181
\begin{itemize}
2182
\item {\bf Remove the low LUT goal.} It wasn't really achieved, and the
2183
        proposals below will only increase the amount of logic the Zip CPU
2184
        requires.  While I expect that the Zip CPU will always be somewhat
2185
        of a light weight, it will never be the smallest kid on the block.
2186
 
2187
        I'm actually struggling with this idea.  The whole goal of the Zip
2188
        CPU was to be light weight.  Wouldn't it make more sense to create and
2189
        maintain options whereby it would remain lightweight?  For example, if
2190
        the process accounting registers are anything but light weight, why
2191
        keep them?  Why not instead make some compile flags that just turn them
2192
        off, keeping the CPU lightweight?  The same holds for the prefetch
2193
        cache.
2194
 
2195
\iffalse
2196
\item {\bf Adjust the Zip CPU so that conditional instructions do not set
2197
        flags}, although they may explicitly set condition codes if writing
2198
        to the CC register.
2199
 
2200
        This is a simple change to the core, and may show up in new releases.
2201
        % Fixed, 20150918
2202
\fi
2203
 
2204
\item The `{\tt .V}' condition was never used in any code other than my test
2205
        code.  Suggest changing it to a `{\tt .LE}' condition, which seems
2206
        to be more useful.
2207
 
2208
\iffalse
2209
\item Add in an {\bf unpredictable branch delay slot}, so that on any branch
2210
        the delay slot may or may not be executed before the branch.
2211
        Instructions that do not depend upon the branch, and that should be
2212
        executed were the branch not taken, could be placed into the delay
2213
        slot. Thus, if the branch isn't taken, we wouldn't suffer the stall,
2214
        whereas it wouldn't affect the timing of the branch if taken. It would
2215
        just do something irrelevant.
2216
 
2217
        % Changes made, 20150918, make this option no longer relevant
2218
 
2219
\item {\bf Re-engineer Branch Processing.}  There's no reason why a {\tt BRA}
2220
        instruction should create five stall cycles.  The decode stage, plus
2221
        the prefetch engine, should be able to drop this number of stalls via
2222
        better branch handling.
2223
 
2224
        Indeed, this could turn into a simple means of branch prediction:
2225
        if {\tt BRA} suffered a single stall only, whereas {\tt BRA.C}
2226
        suffered five stalls, then {\tt BRA.!C} followed by {\tt BRA} would
2227
        be faster than a {\tt BRA.C} instruction.  This would then allow a
2228
        compiler to do explicit branch optimizations.
2229
 
2230
        Of course, longer branches using {\tt ADD X,PC} would still not be
2231
        optimized.
2232
 
2233
        % DONE: 20150918 -- to include the ADD X,PC instructions
2234
 
2235
\item {\bf Request bus access for Load/Store two cycles earlier.}  The problem
2236
        here is the contention for the bus between the memory unit and the
2237
        prefetch unit.  Currently, the memory unit must ask the prefetch
2238
        unit to release the bus if it is in the middle of a bus cycle.  At this
2239
        point, the prefetch drops the {\tt STB} line on the next clock and must
2240
        then wait for the last {\tt ACK} before releasing the bus.  If the
2241
        request takes one clock, dropping the strobe line the next, waiting
2242
        for an acknowledgement takes another, and then the bus must be idle
2243
        for one cycle before starting again, these extra cycles add up.
2244
        It should be possible to tell the prefetch stage to give up the bus
2245
        as soon as the decoder knows the instruction will need the bus.
2246
        Indeed, if done in the decode stage, this might drop the seven cycle
2247
        access down by two cycles.
2248
 
2249
        % FIXED: 20150918
2250
\fi
2251
 
2252
\item {\bf Consider a more traditional Instruction Cache.}  The current
2253
        pipelined instruction cache just reads a window of memory into
2254
        its cache.  If the CPU leaves that window, the entire cache is
2255
        invalidated.  A more traditional cache, however, might allow
2256
        common subroutines to stay within the cache without invalidating the
2257
        entire cache structure.
2258
 
2259
\iffalse
2260
\item {\bf Very Long Instruction Word (VLIW).}  Now, to speed up operation, I
2261
        propose that the Zip CPU instruction set be modified towards a Very
2262
        Long Instruction Word (VLIW) implementation. In this implementation,
2263
        an instruction word may contain either one or two separate
2264
        instructions. The first instruction would take up the high order bits,
2265
        the second optional instruction the lower 16-bits. Further, I propose
2266
        that any of the ALU instructions (SUB through LSR) automatically have
2267
        a second instruction whenever their operand `B' is a register, and use
2268
        the full 20-bit immediate if not. This will effectively eliminate the
2269
        register plus immediate mode for all of these instructions.
2270
 
2271
        This is the minimal required change to increase the number of
2272
        instructions per clock cycle. Other changes would need to take place
2273
        as well to support this. These include:
2274
        \begin{itemize}
2275
        \item Instruction words containing two instructions would take two
2276
                clocks to complete, while requiring only a single cycle
2277
                instruction fetch.
2278
        \item Instructions preceded by a label in the asseembler must always
2279
                start in the high order word.
2280
        \item VLIW's, once started, must always execute to completion. The
2281
                upper word may set the PC, the lower word may not. Regardless
2282
                of whether the upper word sets the PC, the lower word must
2283
                still be guaranteed to complete before the PC changes. On any
2284
                switch to (or from) interrupt context, both instructions must
2285
                complete or none of the instructions in the word shall
2286
                complete prior to the switch.
2287
        \item STEP commands and BREAK instructions will only take place after
2288
                the entire word is executed.
2289
        \end{itemize}
2290
 
2291
        If done well, the assembler should be able to handle these changes
2292
        with the biggest impacts to the user being increased performance and
2293
        a loss of the register plus immediate ALU modes. (These weren't really
2294
        relevant for the XOR, OR, AND, etc. operations anyway.) Machine code
2295
        compatibility will not be maintained.
2296
 
2297
        A proposed secondary instruction set might consist of: a four bit
2298
        operand (any of the prior instructions would be supported, with some
2299
        exceptions such as moves to and from user registers while in
2300
        supervisor mode not being supported), a 4-bit register result (PC not
2301
        allowed), a 3-bit conditional (identical to the conditional for the
2302
        upper word), a single bit for whether or not an immediate is present
2303
        or not, followed by either a 4-bit register or a 4-bit signed
2304
        immediate. The multiply instruction would steal the immediate flag to
2305
        be used as a sign indication, forcing both operands to be registers
2306
        without any immediate offsets.
2307
 
2308
        {\em Initial conversion of several library functions to this secondary
2309
        instruction set has demonstrated little to no gain.   The problem was
2310
        that the new instruction set was made by joining a rarely used
2311
        instruction (ALU with register and not immediate) with a more common
2312
        instruction.  The utility was then limited by the utility of the rare
2313
        instrction, which limited the impact of the entire approach.  }
2314
\else
2315
\item {\bf Very Long Instruction Word (VLIW).}  The goal here would be to
2316
        create a new instruction set whereby two instructions would be encoded
2317
        in each 32--bit word.  While this may speed up
2318
        CPU operation, it would necessitate an instruction redesign.
2319
\fi
2320
 
2321
\end{itemize}
2322
 
2323
 
2324 21 dgisselq
% Appendices
2325
% Index
2326
\end{document}
2327
 
2328
 

powered by: WebSVN 2.1.0

© copyright 1999-2024 OpenCores.org, equivalent to Oliscience, all rights reserved. OpenCores®, registered trademark.