URL https://opencores.org/ocsvn/zipcpu/zipcpu/trunk

# Subversion Repositorieszipcpu

## [/] [zipcpu/] [trunk/] [doc/] [src/] [spec.tex] - Blame information for rev 139

Line No. Rev Author Line
1 21 dgisselq
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2
%%
3
%% Filename:    spec.tex
4
%%
5
%% Project:     Zip CPU -- a small, lightweight, RISC CPU soft core
6
%%
7
%% Purpose:     This LaTeX file contains all of the documentation/description
8 33 dgisselq
%%              currently provided with this Zip CPU soft core.  It supersedes
9 21 dgisselq
%%              any information about the instruction set or CPUs found
10
%%              elsewhere.  It's not nearly as interesting, though, as the PDF
11
%%              file it creates, so I'd recommend reading that before diving
12
%%              into this file.  You should be able to find the PDF file in
13
%%              the SVN distribution together with this PDF file and a copy of
14
15
%%              just type 'make' in the doc directory and it (should) build
16
%%              without a problem.
17
%%
18
%%
19
%% Creator:     Dan Gisselquist
20
%%              Gisselquist Technology, LLC
21
%%
22
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
23
%%
24
%% Copyright (C) 2015, Gisselquist Technology, LLC
25
%%
26
%% This program is free software (firmware): you can redistribute it and/or
27
%% modify it under the terms of  the GNU General Public License as published
28
%% by the Free Software Foundation, either version 3 of the License, or (at
29
%% your option) any later version.
30
%%
31
%% This program is distributed in the hope that it will be useful, but WITHOUT
32
%% ANY WARRANTY; without even the implied warranty of MERCHANTIBILITY or
33
%% FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
34
%% for more details.
35
%%
36
%% You should have received a copy of the GNU General Public License along
37
%% with this program.  (It's in the $(ROOT)/doc directory, run make with no 38 %% target there if the PDF file isn't present.) If not, see 39 %% <http://www.gnu.org/licenses/> for a copy. 40 %% 41 %% License: GPL, v3, as defined and found on www.gnu.org, 42 %% http://www.gnu.org/licenses/gpl.html 43 %% 44 %% 45 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 46 139 dgisselq % 47 % 48 % 49 % From TI about DSPs vs FPGAs: 50 % www.ti.com/general/docs/video/foldersGallery.tsp?bkg=gray 51 % &gpn=35145&familyid=1622&keyMatch=DSP Breaktime Episode Three 52 % &tisearch=Search-EN-Everything&DCMP=leadership 53 % &HQS=ep-pro-dsp-leadership-problog-150518-v-en 54 % 55 % FPGA's are annoyingly faster, cheaper, and not quite as power hungry 56 % as they used to be. 57 % 58 % Why would you choose DSPs over FPGAs? If you care about size, 59 % if you care about power, or happen to have a complicated algorithm 60 % that just isn't simply doing the same thing over and over 61 % 62 % For complex algorithms that change over time. Each have their strengths 63 % sometimes you can use both. 64 % 65 % "No assembly required" -- TI tools all C programming, very GUI based 66 % environment, very little optimization by hand ... 67 % 68 % 69 % The FPGA's achilles heel: Reconfigurability. It is very difficult, although 70 % I'm sure major vendors will tell you not impossible, to reconfigure an FPGA 71 % based upon the need to process time-sensitive data. If you need one of two 72 % algorithms, both which will fit on the FPGA individually but not together, 73 % switching between them on the fly is next to impossible, whereas switching 74 % algorithm within a CPU is not difficult at all. For example, imagine 75 % receiving a packet and needing to apply one of two data algorithms on the 76 % packet before sending it back out, and needing to do so fast. If both 77 % algorithms don't fit in memory, where does the packet go when you need to 78 % swap one algorithm out for the other? And what is the cost of that "context" 79 % swap? 80 % 81 % 82 21 dgisselq \documentclass{gqtekspec} 83 68 dgisselq \usepackage{import} 84 139 dgisselq \usepackage{bytefield} % Install via apt-get install texlive-science 85 68 dgisselq % \graphicspath{{../gfx}} 86 21 dgisselq \project{Zip CPU} 87 \title{Specification} 88 \author{Dan Gisselquist, Ph.D.} 89 \email{dgisselq (at) opencores.org} 90 139 dgisselq \revision{Rev.~0.9} 91 69 dgisselq \definecolor{webred}{rgb}{0.5,0,0} 92 \definecolor{webgreen}{rgb}{0,0.4,0} 93 36 dgisselq \usepackage[dvips,ps2pdf,colorlinks=true, 94 69 dgisselq anchorcolor=black,pdfpagelabels,hypertexnames, 95 36 dgisselq pdfauthor={Dan Gisselquist}, 96 pdfsubject={Zip CPU}]{hyperref} 97 69 dgisselq \hypersetup{ 98 colorlinks = true, 99 linkcolor = webred, 100 citecolor = webgreen 101 } 102 21 dgisselq \begin{document} 103 \pagestyle{gqtekspecplain} 104 \titlepage 105 \begin{license} 106 Copyright (C) \theyear\today, Gisselquist Technology, LLC 107 108 This project is free software (firmware): you can redistribute it and/or 109 modify it under the terms of the GNU General Public License as published 110 by the Free Software Foundation, either version 3 of the License, or (at 111 your option) any later version. 112 113 This program is distributed in the hope that it will be useful, but WITHOUT 114 ANY WARRANTY; without even the implied warranty of MERCHANTIBILITY or 115 FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License 116 for more details. 117 118 You should have received a copy of the GNU General Public License along 119 with this program. If not, see \hbox{<http://www.gnu.org/licenses/>} for a 120 copy. 121 \end{license} 122 \begin{revisionhistory} 123 139 dgisselq 0.9 & 4/20/2016 & Gisselquist & Modified ISA: LDIHI replaced with MPY, MPYU and MPYS replaced with MPYUHI, and MPYSHI respectively. LOCK instruction now 124 permits an intermediate ALU operation. \\\hline 125 92 dgisselq 0.8 & 1/28/2016 & Gisselquist & Reduced complexity early branching \\\hline 126 69 dgisselq 0.7 & 12/22/2015 & Gisselquist & New Instruction Set Architecture \\\hline 127 68 dgisselq 0.6 & 11/17/2015 & Gisselquist & Added graphics to illustrate pipeline discussion.\\\hline 128 39 dgisselq 0.5 & 9/29/2015 & Gisselquist & Added pipelined memory access discussion.\\\hline 129 36 dgisselq 0.4 & 9/19/2015 & Gisselquist & Added DMA controller, improved stall information, and self--assessment info.\\\hline 130 33 dgisselq 0.3 & 8/22/2015 & Gisselquist & First completed draft\\\hline 131 24 dgisselq 0.2 & 8/19/2015 & Gisselquist & Still Draft, more complete \\\hline 132 21 dgisselq 0.1 & 8/17/2015 & Gisselquist & Incomplete First Draft \\\hline 133 \end{revisionhistory} 134 % Revision History 135 % Table of Contents, named Contents 136 \tableofcontents 137 24 dgisselq \listoffigures 138 21 dgisselq \listoftables 139 \begin{preface} 140 Many people have asked me why I am building the Zip CPU. ARM processors are 141 good and effective. Xilinx makes and markets Microblaze, Altera Nios, and both 142 have better toolsets than the Zip CPU will ever have. OpenRISC is also 143 24 dgisselq available, RISC--V may be replacing it. Why build a new processor? 144 21 dgisselq 145 The easiest, most obvious answer is the simple one: Because I can. 146 147 There's more to it, though. There's a lot that I would like to do with a 148 processor, and I want to be able to do it in a vendor independent fashion. 149 36 dgisselq First, I would like to be able to place this processor inside an FPGA. Without 150 paying royalties, ARM is out of the question. I would then like to be able to 151 generate Verilog code, both for the processor and the system it sits within, 152 that can run equivalently on both Xilinx and Altera chips, and that can be 153 easily ported from one manufacturer's chipsets to another. Even more, before 154 purchasing a chip or a board, I would like to know that my soft core works. I 155 would like to build a test bench to test components with, and Verilator is my 156 chosen test bench. This forces me to use all Verilog, and it prevents me from 157 using any proprietary cores. For this reason, Microblaze and Nios are out of 158 the question. 159 21 dgisselq 160 Why not OpenRISC? That's a hard question. The OpenRISC team has done some 161 wonderful work on an amazing processor, and I'll have to admit that I am 162 envious of what they've accomplished. I would like to port binutils to the 163 Zip CPU, as I would like to port GCC and GDB. They are way ahead of me. The 164 OpenRISC processor, however, is complex and hefty at about 4,500 LUTs. It has 165 a lot of features of modern CPUs within it that ... well, let's just say it's 166 not the little guy on the block. The Zip CPU is lighter weight, costing only 167 32 dgisselq about 2,300 LUTs with no peripherals, and 3,200 LUTs with some very basic 168 21 dgisselq peripherals. 169 170 My final reason is that I'm building the Zip CPU as a learning experience. The 171 Zip CPU has allowed me to learn a lot about how CPUs work on a very micro 172 level. For the first time, I am beginning to understand many of the Computer 173 Architecture lessons from years ago. 174 175 To summarize: Because I can, because it is open source, because it is light 176 weight, and as an exercise in learning. 177 178 \end{preface} 179 180 \chapter{Introduction} 181 \pagenumbering{arabic} 182 \setcounter{page}{1} 183 184 185 36 dgisselq The original goal of the Zip CPU was to be a very simple CPU. You might 186 21 dgisselq think of it as a poor man's alternative to the OpenRISC architecture. 187 For this reason, all instructions have been designed to be as simple as 188 69 dgisselq possible, and the base instructions are all designed to be executed in one 189 instruction cycle per instruction, barring pipeline stalls. Indeed, even the 190 bus has been simplified to a constant 32-bit width, with no option for more 191 or less. This has resulted in the choice to drop push and pop instructions, 192 pre-increment and post-decrement addressing modes, and more. 193 21 dgisselq 194 For those who like buzz words, the Zip CPU is: 195 \begin{itemize} 196 \item A 32-bit CPU: All registers are 32-bits, addresses are 32-bits, 197 instructions are 32-bits wide, etc. 198 24 dgisselq \item A RISC CPU. There is no microcode for executing instructions. All 199 instructions are designed to be completed in one clock cycle. 200 21 dgisselq \item A Load/Store architecture. (Only load and store instructions 201 can access memory.) 202 \item Wishbone compliant. All peripherals are accessed just like 203 memory across this bus. 204 \item A Von-Neumann architecture. (The instructions and data share a 205 common bus.) 206 \item A pipelined architecture, having stages for {\bf Prefetch}, 207 69 dgisselq {\bf Decode}, {\bf Read-Operand}, a 208 combined stage containing the {\bf ALU}, 209 {\bf Memory}, {\bf Divide}, and {\bf Floating Point} 210 units, and then the final {\bf Write-back} stage. 211 See Fig.~\ref{fig:cpu} 212 24 dgisselq \begin{figure}\begin{center} 213 \includegraphics[width=3.5in]{../gfx/cpu.eps} 214 \caption{Zip CPU internal pipeline architecture}\label{fig:cpu} 215 \end{center}\end{figure} 216 for a diagram of this structure. 217 21 dgisselq \item Completely open source, licensed under the GPL.\footnote{Should you 218 need a copy of the Zip CPU licensed under other terms, please 219 contact me.} 220 \end{itemize} 221 222 68 dgisselq The Zip CPU also has one very unique feature: the ability to do pipelined loads 223 and stores. This allows the CPU to access on-chip memory at one access per 224 clock, minus a stall for the initial access. 225 226 \section{Characteristics of a SwiC} 227 228 Here, we shall define a soft core internal to an FPGA as a System within a 229 Chip,'' or a SwiC. SwiCs have some very unique properties internal to them 230 that have influenced the design of the Zip CPU. Among these are the bus, 231 memory, and available peripherals. 232 233 Most other approaches to soft core CPU's employ a Harvard architecture. 234 This allows these other CPU's to have two separate bus structures: one for the 235 69 dgisselq program fetch, and the other for the memory. The Zip CPU is fairly unique in 236 68 dgisselq its approach because it uses a von Neumann architecture. This was done for 237 simplicity. By using a von Neumann architecture, only one bus needs to be 238 implemented within any FPGA. This helps to minimize real-estate, while 239 maintaining a high clock speed. The disadvantage is that it can severely 240 degrade the overall instructions per clock count. 241 242 Soft core's within an FPGA have an additional characteristic regarding 243 69 dgisselq memory access: it is slow. While memory on chip may be accessed at a single 244 cycle per access, small FPGA's often have only a limited amount of memory on 245 chip. Going off chip, however, is expensive. Two examples will prove this 246 point. On 247 68 dgisselq the XuLA2 board, Flash can be accessed at 128~cycles per 32--bit word, 248 or 64~cycles per subsequent word in a pipelined architecture. Likewise, the 249 69 dgisselq SDRAM chip on the XuLA2 board allows a 6~cycle access for a write, 10~cycles 250 68 dgisselq per read, and 2~cycles for any subsequent pipelined access read or write. 251 Either way you look at it, this memory access will be slow and this doesn't 252 account for any logic delays should the bus implementation logic get 253 complicated. 254 255 As may be noticed from the above discussion about memory speed, a second 256 characteristic of memory is that all memory accesses may be pipelined, and 257 that pipelined memory access is faster than non--pipelined access. Therefore, 258 a SwiC soft core should support pipelined operations, but it should also 259 allow a higher priority subsystem to get access to the bus (no starvation). 260 261 As a further characteristic of SwiC memory options, on-chip cache's are 262 expensive. If you want to have a minimum of logic, cache logic may not be 263 the highest on the priority list. 264 265 In sum, memory is slow. While one processor on one FPGA may be able to fill 266 its pipeline, the same processor on another FPGA may struggle to get more than 267 one instruction at a time into the pipeline. Any SwiC must be able to deal 268 with both cases: fast and slow memories. 269 270 A final characteristic of SwiC's within FPGA's is the peripherals. 271 Specifically, FPGA's are highly reconfigurable. Soft peripherals can easily 272 be created on chip to support the SwiC if necessary. As an example, a simple 273 30-bit peripheral could easily support reversing 30-bit numbers: a read from 274 the peripheral returns it's bit--reversed address. This is cheap within an 275 69 dgisselq FPGA, but expensive in instructions. Reading from another 16--bit peripheral 276 might calculate a sine function, where the 16--bit address internal to the 277 peripheral was the angle of the sine wave. 278 68 dgisselq 279 Indeed, anything that must be done fast within an FPGA is likely to already 280 69 dgisselq be done--elsewhere in the fabric. This leaves the CPU with the simple role 281 of solely handling sequential tasks that need a lot of state. 282 68 dgisselq 283 This means that the SwiC needs to live within a very unique environment, 284 separate and different from the traditional SoC. That isn't to say that a 285 SwiC cannot be turned into a SoC, just that this SwiC has not been designed 286 for that purpose. 287 288 \section{Lessons Learned} 289 290 21 dgisselq Now, however, that I've worked on the Zip CPU for a while, it is not nearly 291 as simple as I originally hoped. Worse, I've had to adjust to create 292 capabilities that I was never expecting to need. These include: 293 \begin{itemize} 294 33 dgisselq \item {\bf External Debug:} Once placed upon an FPGA, some external means is 295 21 dgisselq still necessary to debug this CPU. That means that there needs to be 296 an external register that can control the CPU: reset it, halt it, step 297 24 dgisselq it, and tell whether it is running or not. My chosen interface 298 includes a second register similar to this control register. This 299 second register allows the external controller or debugger to examine 300 21 dgisselq registers internal to the CPU. 301 302 \item {\bf Internal Debug:} Being able to run a debugger from within 303 a user process requires an ability to step a user process from 304 within a debugger. It also requires a break instruction that can 305 be substituted for any other instruction, and substituted back. 306 The break is actually difficult: the break instruction cannot be 307 allowed to execute. That way, upon a break, the debugger should 308 be able to jump back into the user process to step the instruction 309 that would've been at the break point initially, and then to 310 replace the break after passing it. 311 312 24 dgisselq Incidentally, this break messes with the prefetch cache and the 313 pipeline: if you change an instruction partially through the pipeline, 314 the whole pipeline needs to be cleansed. Likewise if you change 315 an instruction in memory, you need to make sure the cache is reloaded 316 with the new instruction. 317 318 69 dgisselq \item {\bf Prefetch Cache:} My original implementation, {\tt prefetch}, had 319 a very simple prefetch stage. Any time the PC changed the prefetch 320 would go and fetch the new instruction. While this was perhaps this 321 simplest approach, it cost roughly five clocks for every instruction. 322 This was deemed unacceptable, as I wanted a CPU that could execute 323 instructions in one cycle. 324 21 dgisselq 325 69 dgisselq My second implementation, {\tt pipefetch}, attempted to make the most 326 use of pipelined memory. When a new CPU address was issued, it would 327 start reading 328 memory in a pipelined fashion, and issuing instructions as soon as they 329 were ready. This cache was a sliding window in memory. This suffered 330 from some difficult performance problems, though. If the CPU was 331 alternating between two diverse sections of code, both could never be 332 in the cache at the same time--causing lots of cache misses. Further, 333 the extra logic to implement this window cost an extra clock cycle 334 in the cache implementation, slowing down branches. 335 21 dgisselq 336 69 dgisselq The Zip CPU now has a third cache implementation, {\tt pfcache}. This 337 new implementation takes only a single cycle per access, but costs a 338 full cache line miss on any miss. While configurable, a full cache 339 line miss might mean that the CPU needs to read 256~instructions from 340 memory before it can execute the first one of them. 341 342 21 dgisselq \item {\bf Operating System:} In order to support an operating system, 343 interrupts and so forth, the CPU needs to support supervisor and 344 user modes, as well as a means of switching between them. For example, 345 the user needs a means of executing a system call. This is the 346 purpose of the {\bf trap'} instruction. This instruction needs to 347 place the CPU into supervisor mode (here equivalent to disabling 348 interrupts), as well as handing it a parameter such as identifying 349 which O/S function was called. 350 351 24 dgisselq My initial approach to building a trap instruction was to create an external 352 peripheral which, when written to, would generate an interrupt and could 353 return the last value written to it. In practice, this approach didn't work 354 at all: the CPU executed two instructions while waiting for the 355 trap interrupt to take place. Since then, I've decided to keep the rest of 356 the CC register for that purpose so that a write to the CC register, with the 357 GIE bit cleared, could be used to execute a trap. This has other problems, 358 though, primarily in the limitation of the uses of the CC register. In 359 particular, the CC register is the best place to put CPU state information and 360 to announce'' special CPU features (floating point, etc). So the trap 361 instruction still switches to interrupt mode, but the CC register is not 362 nearly as useful for telling the supervisor mode processor what trap is being 363 executed. 364 21 dgisselq 365 Modern timesharing systems also depend upon a {\bf Timer} interrupt 366 24 dgisselq to handle task swapping. For the Zip CPU, this interrupt is handled 367 external to the CPU as part of the CPU System, found in {\tt zipsystem.v}. 368 The timer module itself is found in {\tt ziptimer.v}. 369 21 dgisselq 370 69 dgisselq \item {\bf Bus Errors:} My original implementation had no logic to handle 371 what would happen if the CPU attempted to read or write a non-existent 372 memory address. This changed after I needed to troubleshoot a failure 373 caused by a subroutine return to a non-existent address. 374 375 My next problem bus problem was caused by a misbehaving peripheral. 376 Whenever the CPU attempted to read from or write to this peripheral, 377 the peripheral would take control of the wishbone bus and not return 378 it. For example, it might never return an {\tt ACK} to signal 379 the end of the bus transaction. This led to the implementation of 380 a wishbone bus watchdog that would create a bus error if any particular 381 bus action didn't complete in a timely fashion. 382 383 21 dgisselq \item {\bf Pipeline Stalls:} My original plan was to not support pipeline 384 stalls at all, but rather to require the compiler to properly schedule 385 24 dgisselq all instructions so that stalls would never be necessary. After trying 386 21 dgisselq to build such an architecture, I gave up, having learned some things: 387 388 68 dgisselq First, and ideal pipeline might look something like 389 Fig.~\ref{fig:ideal-pipeline}. 390 \begin{figure} 391 \begin{center} 392 \includegraphics[width=4in]{../gfx/fullpline.eps} 393 \caption{An Ideal Pipeline: One instruction per clock cycle}\label{fig:ideal-pipeline} 394 \end{center}\end{figure} 395 Notice that, in this figure, all the pipeline stages are complete and 396 full. Every instruction takes one clock and there are no delays. 397 However, as the discussion above pointed out, the memory associated 398 with a SwiC may not allow single clock access. It may be instead 399 that you can only read every two clocks. In that case, what shall 400 the pipeline look like? Should it look like 401 Fig.~\ref{fig:waiting-pipeline}, 402 \begin{figure}\begin{center} 403 \includegraphics[width=4in]{../gfx/stuttra.eps} 404 \caption{Instructions wait for each other}\label{fig:waiting-pipeline} 405 \end{center}\end{figure} 406 where instructions are held back until the pipeline is full, or should 407 it look like Fig.~\ref{fig:independent-pipeline}, 408 \begin{figure}\begin{center} 409 \includegraphics[width=4in]{../gfx/stuttrb.eps} 410 \caption{Instructions proceed independently}\label{fig:independent-pipeline} 411 \end{center}\end{figure} 412 where each instruction is allowed to move through the pipeline 413 independently? For better or worse, the Zip CPU allows instructions 414 to move through the pipeline independently. 415 21 dgisselq 416 68 dgisselq One approach to avoiding stalls is to use a branch delay slot, 417 such as is shown in Fig.~\ref{fig:brdelay}. 418 \begin{figure}\begin{center} 419 \includegraphics[width=4in]{../gfx/bdly.eps} 420 \caption{A typical branch delay slot approach}\label{fig:brdelay} 421 \end{center}\end{figure} 422 In this figure, instructions 423 {\tt BR} (a branch), {\tt BD} (a branch delay instruction), 424 are followed by instructions after the branch: {\tt IA}, {\tt IB}, etc. 425 Since it takes a processor a clock cycle to execute a branch, the 426 delay slot allows the processor to do something useful in that 427 branch. The problem the Zip CPU has with this approach is, what 428 happens when the pipeline looks like Fig.~\ref{fig:brbroken}? 429 \begin{figure}\begin{center} 430 \includegraphics[width=4in]{../gfx/bdbroken.eps} 431 \caption{The branch delay slot breaks with a slow memory}\label{fig:brbroken} 432 \end{center}\end{figure} 433 In this case, the branch delay slot never gets filled in the first 434 place, and so the pipeline squashes it before it gets executed. 435 If not that, then what happens when handling interrupts or 436 debug stepping: when has the CPU finished an instruction? 437 When the {\tt BR} instruction has finished, or must {\tt BD} 438 follow every {\tt BR}? and, again, what if the pipeline isn't 439 full? 440 These thoughts killed any hopes of doing delayed branching. 441 442 21 dgisselq So I switched to a model of discrete execution: Once an instruction 443 enters into either the ALU or memory unit, the instruction is 444 guaranteed to complete. If the logic recognizes a branch or a 445 condition that would render the instruction entering into this stage 446 33 dgisselq possibly inappropriate (i.e. a conditional branch preceding a store 447 21 dgisselq instruction for example), then the pipeline stalls for one cycle 448 until the conditional branch completes. Then, if it generates a new 449 33 dgisselq PC address, the stages preceding are all wiped clean. 450 21 dgisselq 451 68 dgisselq This model, however, generated too many pipeline stalls, so the 452 discrete execution model was modified to allow instructions to go 453 through the ALU unit and be canceled before writeback. This removed 454 the stall associated with ALU instructions before untaken branches. 455 456 The discrete execution model allows such things as sleeping, as 457 outlined in Fig.~\ref{fig:sleeping}. 458 \begin{figure}\begin{center} 459 \includegraphics[width=4in]{../gfx/sleep.eps} 460 \caption{How the CPU halts when sleeping}\label{fig:sleeping} 461 \end{center}\end{figure} 462 If the 463 24 dgisselq CPU is put to sleep,'' the ALU and memory stages stall and back up 464 21 dgisselq everything before them. Likewise, anything that has entered the ALU 465 or memory stage when the CPU is placed to sleep continues to completion. 466 To handle this logic, each pipeline stage has three control signals: 467 a valid signal, a stall signal, and a clock enable signal. In 468 general, a stage stalls if it's contents are valid and the next step 469 is stalled. This allows the pipeline to fill any time a later stage 470 68 dgisselq stalls, as illustrated in Fig.~\ref{fig:stacking}. 471 \begin{figure}\begin{center} 472 \includegraphics[width=4in]{../gfx/stacking.eps} 473 \caption{Instructions can stack up behind a stalled instruction}\label{fig:stacking} 474 \end{center}\end{figure} 475 69 dgisselq However, if a pipeline hazard is detected, a stage can stall in order 476 to prevent the previous from moving forward. 477 21 dgisselq 478 68 dgisselq This approach is also different from other pipeline approaches. 479 Instead of keeping the entire pipeline filled, each stage is treated 480 24 dgisselq independently. Therefore, individual stages may move forward as long 481 as the subsequent stage is available, regardless of whether the stage 482 behind it is filled. 483 21 dgisselq \end{itemize} 484 485 With that introduction out of the way, let's move on to the instruction 486 set. 487 488 \chapter{CPU Architecture}\label{chap:arch} 489 490 24 dgisselq The Zip CPU supports a set of two operand instructions, where the second operand 491 21 dgisselq (always a register) is the result. The only exception is the store instruction, 492 where the first operand (always a register) is the source of the data to be 493 stored. 494 495 24 dgisselq \section{Simplified Bus} 496 The bus architecture of the Zip CPU is that of a simplified WISHBONE bus. 497 It has been simplified in this fashion: all operations are 32--bit operations. 498 36 dgisselq The bus is neither little endian nor big endian. For this reason, all words 499 24 dgisselq are 32--bits. All instructions are also 32--bits wide. Everything has been 500 built around the 32--bit word. 501 502 21 dgisselq \section{Register Set} 503 The Zip CPU supports two sets of sixteen 32-bit registers, a supervisor 504 24 dgisselq and a user set as shown in Fig.~\ref{fig:regset}. 505 \begin{figure}\begin{center} 506 \includegraphics[width=3.5in]{../gfx/regset.eps} 507 \caption{Zip CPU Register File}\label{fig:regset} 508 \end{center}\end{figure} 509 The supervisor set is used in interrupt mode when interrupts are disabled, 510 whereas the user set is used otherwise. Of this register set, the Program 511 Counter (PC) is register 15, whereas the status register (SR) or condition 512 code register 513 21 dgisselq (CC) is register 14. By convention, the stack pointer will be register 13 and 514 24 dgisselq noted as (SP)--although there is nothing special about this register other 515 69 dgisselq than this convention. Also by convention register~12 will point to a global 516 offset table, and may be abbreviated as the (GBL) register. 517 21 dgisselq The CPU can access both register sets via move instructions from the 518 supervisor state, whereas the user state can only access the user registers. 519 520 36 dgisselq The status register is special, and bears further mention. As shown in 521 Fig.~\ref{tbl:cc-register}, 522 \begin{table}\begin{center} 523 \begin{bitlist} 524 69 dgisselq 31\ldots 13 & R/W & Reserved for future uses\\\hline 525 12 & R & (Reserved for) Floating Point Exception\\\hline 526 11 & R & Division by Zero Exception\\\hline 527 10 & R & Bus-Error Flag\\\hline 528 36 dgisselq 9 & R & Trap, or user interrupt, Flag. Cleared on return to userspace.\\\hline 529 68 dgisselq 8 & R & Illegal Instruction Flag\\\hline 530 36 dgisselq 7 & R/W & Break--Enable\\\hline 531 6 & R/W & Step\\\hline 532 5 & R/W & Global Interrupt Enable (GIE)\\\hline 533 4 & R/W & Sleep. When GIE is also set, the CPU waits for an interrupt.\\\hline 534 3 & R/W & Overflow\\\hline 535 2 & R/W & Negative. The sign bit was set as a result of the last ALU instruction.\\\hline 536 1 & R/W & Carry\\\hline 537 538 \end{bitlist} 539 \caption{Condition Code Register Bit Assignment}\label{tbl:cc-register} 540 \end{center}\end{table} 541 the lower 11~bits of the status register form 542 a set of CPU state and condition codes. Writes to other bits of this register 543 are preserved. 544 21 dgisselq 545 33 dgisselq Of the condition codes, the bottom four bits are the current flags: 546 21 dgisselq Zero (Z), 547 Carry (C), 548 Negative (N), 549 and Overflow (V). 550 69 dgisselq On those instructions that set the flags, these flags will be set based upon 551 the output of the instruction. If the result is zero, the Z flag will be set. 552 If the high order bit is set, the N flag will be set. If the instruction 553 caused a bit to fall off the end, the carry bit will be set. Finally, if 554 the instruction causes a signed integer overflow, the V flag will be set 555 afterwards. 556 21 dgisselq 557 69 dgisselq The next bit is a sleep bit. Set this bit to one to disable instruction 558 execution and place the CPU to sleep, or to zero to keep the pipeline 559 running. Setting this bit will cause the CPU to wait for an interrupt 560 (if interrupts are enabled), or to completely halt (if interrupts are 561 disabled). In order to prevent users from halting the CPU, only the 562 supervisor is allowed to both put the CPU to sleep and disable 563 interrupts. Any user attempt to do so will simply result in a switch 564 to supervisor mode. 565 33 dgisselq 566 21 dgisselq The sixth bit is a global interrupt enable bit (GIE). When this 567 32 dgisselq sixth bit is a 1' interrupts will be enabled, else disabled. When 568 21 dgisselq interrupts are disabled, the CPU will be in supervisor mode, otherwise 569 it is in user mode. Thus, to execute a context switch, one only 570 need enable or disable interrupts. (When an interrupt line goes 571 high, interrupts will automatically be disabled, as the CPU goes 572 32 dgisselq and deals with its context switch.) Special logic has been added to 573 keep the user mode from setting the sleep register and clearing the 574 GIE register at the same time, with clearing the GIE register taking 575 precedence. 576 21 dgisselq 577 69 dgisselq The seventh bit is a step bit. This bit can be set from supervisor mode only. 578 After setting this bit, should the supervisor mode process switch to 579 user mode, it would then accomplish one instruction in user mode 580 before returning to supervisor mode. Then, upon return to supervisor 581 mode, this bit will be automatically cleared. This bit has no effect 582 on the CPU while in supervisor mode. 583 21 dgisselq 584 This functionality was added to enable a userspace debugger 585 functionality on a user process, working through supervisor mode 586 of course. 587 588 589 24 dgisselq The eighth bit is a break enable bit. This controls whether a break 590 instruction in user mode will halt the processor for an external debugger 591 (break enabled), or whether the break instruction will simply send send the 592 CPU into interrupt mode. Encountering a break in supervisor mode will 593 halt the CPU independent of the break enable bit. This bit can only be set 594 within supervisor mode. 595 21 dgisselq 596 32 dgisselq % Should break enable be a supervisor mode bit, while the break enable bit 597 % in user mode is a break has taken place bit? 598 % 599 600 21 dgisselq This functionality was added to enable an external debugger to 601 set and manage breakpoints. 602 603 68 dgisselq The ninth bit is an illegal instruction bit. When the CPU 604 36 dgisselq tries to execute either a non-existant instruction, or an instruction from 605 68 dgisselq an address that produces a bus error, the CPU will (if implemented) switch 606 36 dgisselq to supervisor mode while setting this bit. The bit will automatically be 607 cleared upon any return to user mode. 608 21 dgisselq 609 The tenth bit is a trap bit. It is set whenever the user requests a soft 610 interrupt, and cleared on any return to userspace command. This allows the 611 supervisor, in supervisor mode, to determine whether it got to supervisor 612 mode from a trap or from an external interrupt or both. 613 614 69 dgisselq \section{Instruction Format} 615 All Zip CPU instructions fit in one of the formats shown in 616 Fig.~\ref{fig:iset-format}. 617 \begin{figure}\begin{center} 618 \begin{bytefield}[endianness=big]{32} 619 \bitheader{0-31}\\ 620 \begin{leftwordgroup}{Standard}\bitbox{1}{0}\bitbox{4}{DR} 621 \bitbox[lrt]{5}{OpCode} 622 \bitbox[lrt]{3}{Cnd} 623 \bitbox{1}{0} 624 \bitbox{18}{18-bit Signed Immediate} \\ 625 \bitbox{1}{0}\bitbox{4}{DR} 626 \bitbox[lrb]{5}{} 627 \bitbox[lrb]{3}{} 628 \bitbox{1}{1} 629 \bitbox{4}{BR} 630 \bitbox{14}{14-bit Signed Immediate}\end{leftwordgroup} \\ 631 \begin{leftwordgroup}{MOV}\bitbox{1}{0}\bitbox{4}{DR} 632 \bitbox[lrt]{5}{5'hf} 633 \bitbox[lrt]{3}{Cnd} 634 \bitbox{1}{A} 635 \bitbox{4}{BR} 636 \bitbox{1}{B} 637 \bitbox{13}{13-bit Signed Immediate}\end{leftwordgroup} \\ 638 \begin{leftwordgroup}{LDI}\bitbox{1}{0}\bitbox{4}{DR} 639 \bitbox{4}{4'hb} 640 \bitbox{23}{23-bit Signed Immediate}\end{leftwordgroup} \\ 641 \begin{leftwordgroup}{NOOP}\bitbox{1}{0}\bitbox{3}{3'h7} 642 \bitbox{1}{} 643 \bitbox{2}{11} 644 \bitbox{3}{xxx} 645 \bitbox{22}{Ignored} 646 \end{leftwordgroup} \\ 647 \begin{leftwordgroup}{VLIW}\bitbox{1}{1}\bitbox[lrt]{4}{DR} 648 \bitbox[lrt]{5}{OpCode} 649 \bitbox[lrt]{3}{Cnd} 650 \bitbox{1}{0} 651 \bitbox{4}{Imm.} 652 \bitbox{14}{---} \\ 653 \bitbox{1}{1}\bitbox[lr]{4}{} 654 \bitbox[lrb]{5}{} 655 \bitbox[lr]{3}{} 656 \bitbox{1}{1} 657 \bitbox{4}{BR} 658 \bitbox{14}{---} \\ 659 \bitbox{1}{1}\bitbox[lrb]{4}{} 660 \bitbox{4}{4'hb} 661 \bitbox{1}{} 662 \bitbox[lrb]{3}{} 663 \bitbox{5}{5'b Imm} 664 \bitbox{14}{---} \\ 665 \bitbox{1}{1}\bitbox{9}{---} 666 \bitbox[lrt]{3}{Cnd} 667 \bitbox{5}{---} 668 \bitbox[lrt]{4}{DR} 669 \bitbox[lrt]{5}{OpCode} 670 \bitbox{1}{0} 671 \bitbox{4}{Imm} 672 \\ 673 \bitbox{1}{1}\bitbox{9}{---} 674 \bitbox[lr]{3}{} 675 \bitbox{5}{---} 676 \bitbox[lr]{4}{} 677 \bitbox[lrb]{5}{} 678 \bitbox{1}{1} 679 \bitbox{4}{Reg} \\ 680 \bitbox{1}{1}\bitbox{9}{---} 681 \bitbox[lrb]{3}{} 682 \bitbox{5}{---} 683 \bitbox[lrb]{4}{} 684 \bitbox{4}{4'hb} 685 \bitbox{1}{} 686 \bitbox{5}{5'b Imm} 687 \end{leftwordgroup} \\ 688 \end{bytefield} 689 \caption{Zip Instruction Set Format}\label{fig:iset-format} 690 \end{center}\end{figure} 691 The basic format is that some operation, defined by the OpCode, is applied 692 if a condition, Cnd, is true in order to produce a result which is placed in 693 139 dgisselq the destination register, or DR. The load 23--bit signed immediate instruction 694 (LDI) is different in that it accepts no conditions, and uses only a 4-bit 695 opcode. 696 69 dgisselq 697 This is actually a second version of instruction set definition, given certain 698 lessons learned. For example, the original instruction set had the following 699 problems: 700 \begin{enumerate} 701 \item No opcodes were available for divide or floating point extensions to be 702 made available. Although there was space in the instruction set to 703 add these types of instructions, this instruction space was going to 704 require extra logic to use. 705 \item The carveouts for instructions such as NOOP and LDIHI/LDILO required 706 extra logic to process. 707 \item The instruction set wasn't very compact. One bus operation was required 708 for every instruction. 709 139 dgisselq \item While the CPU supported multiplies, they were only 16x16 bit multiplies. 710 69 dgisselq \end{enumerate} 711 This second version was designed with two criteria. The first was that the 712 new instruction set needed to be compatible, at the assembly language level, 713 with the previous instruction set. Thus, it must be able to support all of 714 the previous menumonics and more. This was achieved with the sole exception 715 that instruction immediates are generally two bits shorter than before. 716 (One bit was lost to the VLIW bit in front, another from changing from 4--bit 717 to 5--bit opcodes.) Second, the new instruction set needed to be a drop--in 718 replacement for the decoder, modifying nothing else. This was almost achieved, 719 save for two issues: the ALU unit needed to be replaced since the OpCodes 720 were reordered, and some condition code logic needed to be adjusted since the 721 condition codes were renumbered as well. In the end, maximum reuse of the 722 existing RTL (Verilog) code was achieved in this upgrade. 723 724 As of this second version of the Zip CPU instruction set, the Zip CPU also 725 supports a very long instruction word (VLIW) set of instructions. These 726 instruction formats pack two instructions into a single instuction word, 727 trading immediate instruction space to do this, but in just about all other 728 respects these are identical to two standard instructions. Other than 729 instruction format, the only basic difference is that the CPU will not switch 730 to interrupt mode in between the two instructions. Likewise a new job given 731 to the assembler is that of automatically packing as many instructions as 732 possible into the VLIW format. Where necessary to place both VLIW instructions 733 on the same line, they will be separated by a vertical bar. 734 735 139 dgisselq One belated change to the instruction set violates some of the above 736 principles. This latter instruction set change replaced the {\tt LDIHI} 737 instruction with a 32--bit multiply instruction {\tt MPY}, and then changed 738 the two 16--bit multiply instructions {\tt MPYU} and {\tt MPYS} for 739 {\tt MPYUHI} and {\tt MPYSHI} respectively. This creates a 32--bit 740 multiply capability, while removing the 16--bit multiply that wasn't very 741 useful. Further, the {\tt LDIHI} instruction was being used primarily by the 742 assembler and linker to create a 32--bit load immediate pair of instructions. 743 This instruction set combination, {\tt LDIHI} followed by {\tt LDILO} was 744 replaced with an equivalent instruction set, {\tt BREV} followed by {\tt LDILO}, 745 save that linking has been made more complicated in the process. 746 747 69 dgisselq \section{Instruction OpCodes} 748 With a 5--bit opcode field, there are 32--possible instructions as shown in 749 Tbl.~\ref{tbl:iset-opcodes}. 750 \begin{table}\begin{center} 751 \begin{tabular}{|l|l|l|c|} \hline \rowcolor[gray]{0.85} 752 OpCode & & Instruction &Sets CC \\\hline\hline 753 5'h00 & SUB & Subtract & \\\cline{1-3} 754 5'h01 & AND & Bitwise And & \\\cline{1-3} 755 5'h02 & ADD & Add two numbers & \\\cline{1-3} 756 5'h03 & OR & Bitwise Or & Y \\\cline{1-3} 757 5'h04 & XOR & Bitwise Exclusive Or & \\\cline{1-3} 758 5'h05 & LSR & Logical Shift Right & \\\cline{1-3} 759 5'h06 & LSL & Logical Shift Left & \\\cline{1-3} 760 5'h07 & ASR & Arithmetic Shift Right & \\\hline 761 139 dgisselq 5'h08 & MPY & 32x32 bit multiply & Y \\\hline 762 5'h09 & LDILO & Load Immediate Low & N\\\hline 763 5'h0a & MPYUHI & Upper 32 of 64 bits from an unsigned 32x32 multiply & \\\cline{1-3} 764 5'h0b & MPYSHI & Upper 32 of 64 bits from a signed 32x32 multiply & Y \\\cline{1-3} 765 69 dgisselq 5'h0c & BREV & Bit Reverse & \\\cline{1-3} 766 5'h0d & POPC& Population Count & \\\cline{1-3} 767 5'h0e & ROL & Rotate left & \\\hline 768 5'h0f & MOV & Move register & N \\\hline 769 5'h10 & CMP & Compare & Y \\\cline{1-3} 770 5'h11 & TST & Test (AND w/o setting result) & \\\hline 771 5'h12 & LOD & Load from memory & N \\\cline{1-3} 772 5'h13 & STO & Store a register into memory & \\\hline\hline 773 5'h14 & DIVU & Divide, unsigned & Y \\\cline{1-3} 774 5'h15 & DIVS & Divide, signed & \\\hline\hline 775 5'h16/7 & LDI & Load 23--bit signed immediate & N \\\hline\hline 776 5'h18 & FPADD & Floating point add & \\\cline{1-3} 777 5'h19 & FPSUB & Floating point subtract & \\\cline{1-3} 778 5'h1a & FPMPY & Floating point multiply & Y \\\cline{1-3} 779 5'h1b & FPDIV & Floating point divide & \\\cline{1-3} 780 5'h1c & FPCVT & Convert integer to floating point & \\\cline{1-3} 781 5'h1d & FPINT & Convert to integer & \\\hline 782 5'h1e & & {\em Reserved for future use} &\\\hline 783 5'h1f & & {\em Reserved for future use} &\\\hline 784 139 dgisselq 5'h18 & & NOOP (A-register = PC)&\\\cline{1-3} 785 5'h19 & & BREAK (A-register = PC)& N\\\cline{1-3} 786 5'h1a & & LOCK (A-register = PC)&\\\hline 787 39 dgisselq \end{tabular} 788 69 dgisselq \caption{Zip CPU OpCodes}\label{tbl:iset-opcodes} 789 39 dgisselq \end{center}\end{table} 790 69 dgisselq % 791 Of these opcodes, the {\tt BREV} and {\tt POPC} are experimental, and may be 792 replaced later, and two floating point instruction opcodes are reserved for 793 future use. 794 39 dgisselq 795 21 dgisselq \section{Conditional Instructions} 796 69 dgisselq Most, although not quite all, instructions may be conditionally executed. 797 The 23--bit load immediate instruction, together with the {\tt NOOP}, 798 {\tt BREAK}, and {\tt LOCK} instructions are the only exception to this rule. 799 800 From the four condition code flags, eight conditions are defined for standard 801 instructions. These are shown in Tbl.~\ref{tbl:conditions}. 802 \begin{table}\begin{center} 803 21 dgisselq \begin{tabular}{l|l|l} 804 Code & Mneumonic & Condition \\\hline 805 3'h0 & None & Always execute the instruction \\ 806 69 dgisselq 3'h1 & {\tt .LT} & Less than ('N' set) \\ 807 3'h2 & {\tt .Z} & Only execute when 'Z' is set \\ 808 3'h3 & {\tt .NZ} & Only execute when 'Z' is not set \\ 809 21 dgisselq 3'h4 & {\tt .GT} & Greater than ('N' not set, 'Z' not set) \\ 810 69 dgisselq 3'h5 & {\tt .GE} & Greater than or equal ('N' not set, 'Z' irrelevant) \\ 811 139 dgisselq 3'h6 & {\tt .C} & Carry set (Also known as less-than unsigned) \\ 812 21 dgisselq 3'h7 & {\tt .V} & Overflow set\\ 813 \end{tabular} 814 \caption{Conditions for conditional operand execution}\label{tbl:conditions} 815 69 dgisselq \end{center}\end{table} 816 There is no condition code for less than or equal, not C or not V---there 817 just wasn't enough space in 3--bits. Conditioning on a non--supported 818 condition is still possible, but it will take an extra instruction and a 819 pipeline stall. (Ex: \hbox{\em (Stall)}; \hbox{\tt TST \$4,CC;} \hbox{\tt
820
STO.NZ R0,(R1)}) As an alternative, it is often possible to reverse the
821
condition, and thus recovering those extra two clocks.  Thus instead of
822 139 dgisselq
\hbox{\tt CMP Rx,Ry;} \hbox{\tt BNC label} you can issue a
823
\hbox{\tt CMP 1+Ry,Rx;} \hbox{\tt BC label}.
824 21 dgisselq

825 69 dgisselq
Conditionally executed instructions will not further adjust the
826 68 dgisselq
condition codes, with the exception of \hbox{\tt CMP} and \hbox{\tt TST}
827
instructions.   Conditional \hbox{\tt CMP} or \hbox{\tt TST} instructions
828 69 dgisselq
will adjust conditions whenever they are executed.  In this way,
829 68 dgisselq
multiple conditions may be evaluated without branches.  For example, to do
830
something if \hbox{\tt R0} is one and \hbox{\tt R1} is two, one might try
831
code such as Tbl.~\ref{tbl:dbl-condition}.
832
\begin{table}\begin{center}
833
\begin{tabular}{l}
834
{\tt CMP 1,R0} \\
835
{;\em Condition codes are now set based upon R0-1} \\
836
{\tt CMP.Z 2,R1} \\
837
{;\em If R0 $\neq$ 1, conditions are unchanged.} \\
838
{;\em If R0 $=$ 1, conditions are set based upon R1-2.} \\
839
{;\em Now do something based upon the conjunction of both conditions.} \\
840
{;\em While we use the example of a STO, it could be any instruction.} \\
841
{\tt STO.Z R0,(R2)} \\
842
\end{tabular}
843
\caption{An example of a double conditional}\label{tbl:dbl-condition}
844
\end{center}\end{table}
845 36 dgisselq

846 69 dgisselq
In the case of VLIW instructions, only four conditions are defined as shown
847
in Tbl.~\ref{tbl:vliw-conditions}.
848
\begin{table}\begin{center}
849
\begin{tabular}{l|l|l}
850
Code & Mneumonic & Condition \\\hline
851
2'h0 & None & Always execute the instruction \\
852
2'h1 & {\tt .LT} & Less than ('N' set) \\
853
2'h2 & {\tt .Z} & Only execute when 'Z' is set \\
854
2'h3 & {\tt .NZ} & Only execute when 'Z' is not set \\
855
\end{tabular}
856
\caption{VLIW Conditions}\label{tbl:vliw-conditions}
857
\end{center}\end{table}
858
Further, the first bit is given a special meaning.  If the first bit is set,
859
the conditions apply to the second half of the instruction, otherwise the
860
conditions will only apply to the first half of a conditional instruction.
861 139 dgisselq
Of course, the other conditions are still available by mingling the
862
non--VLIW instructions with VLIW instructions.
863 68 dgisselq

864 21 dgisselq
\section{Operand B}
865 69 dgisselq
Many instruction forms have a 19-bit source Operand B'' associated with them.
866
This Operand B'' is shown in Fig.~\ref{fig:iset-format} as part of the
867
standard instructions.  This Operand B is either equal to a register plus a
868
14--bit signed immediate offset, or an 18--bit signed immediate offset by
869
itself.  This value is encoded as shown in Tbl.~\ref{tbl:opb}.
870 21 dgisselq
\begin{table}\begin{center}
871 69 dgisselq
\begin{bytefield}[endianness=big]{19}
872
873
\bitbox{1}{0}\bitbox{18}{18-bit Signed Immediate} \\
874
\bitbox{1}{1}\bitbox{4}{Reg}\bitbox{14}{14-bit Signed Immediate}
875
\end{bytefield}
876 21 dgisselq
\caption{Bit allocation for Operand B}\label{tbl:opb}
877
\end{center}\end{table}
878 24 dgisselq

879 69 dgisselq
Fourteen and eighteen bit immediate values don't make sense for all
880
instructions.  For example, what is the point of an 18--bit immediate when
881
executing a 16--bit multiply?  Or a 16--bit load--immediate?  In these cases,
882
the extra bits are simply ignored.
883 24 dgisselq

884 69 dgisselq
VLIW instructions still use the same operand B, only there was no room for any
885
instruction plus immediate addressing.  Therefore, VLIW instructions have either
886
a register or a 4--bit signed immediate as their operand B.  The only exception
887
is the load immediate instruction, which permits a 5--bit signed operand
888
B.\footnote{Although the space exists to extend this VLIW load immediate
889
instruction to six bits, the 5--bit limit was chosen to simplify the
890
disassembler.  This may change in the future.}
891

892 21 dgisselq
893 36 dgisselq
The Zip CPU supports two addressing modes: register plus immediate, and
894 21 dgisselq
895 69 dgisselq
Operand B's, shown above.  Practically, the VLIW instruction set only offers
896
register addressing, necessitating a non--VLIW instruction for most memory
897
operations.
898 21 dgisselq

899
A lot of long hard thought was put into whether to allow pre/post increment
900
and decrement addressing modes.  Finding no way to use these operators without
901 32 dgisselq
taking two or more clocks per instruction,\footnote{The two clocks figure
902
comes from the design of the register set, allowing only one write per clock.
903
That write is either from the memory unit or the ALU, but never both.} these
904
905 21 dgisselq
removed from the realm of possibilities.  This means that the Zip CPU has no
906
native way of executing push, pop, return, or jump to subroutine operations.
907 24 dgisselq
Each of these instructions can be emulated with a set of instructions from the
908
existing set.
909 21 dgisselq

910 139 dgisselq
\section{Modifying Conditions}
911
A quick look at the list of conditions supported by the Zip CPU and listed
912
in Tbl.~\ref{tbl:conditions} reveals that the Zip CPU does not have a full set
913
of conditions.  In particular, only one explicit unsigned condition is
914
supported.  Therefore, Tbl.~\ref{tbl:creating-conditions}
915
\begin{table}\begin{center}
916
\begin{tabular}{|l|l|l|}\hline
917
Original & Modified & Name \\\hline\hline
918
\parbox[t]{1.5in}{\tt CMP Rx,Ry\\BLE label} % If Ry <= Rx -> Ry < Rx+1
919
& \parbox[t]{1.5in}{\tt CMP 1+Rx,Ry\\BLT label}
920
& Less-than or equal (signed, {\tt Z} or {\tt N} set)\\[4mm]\hline
921
\parbox[t]{1.5in}{\tt CMP Rx,Ry\\BLEU label}
922
& \parbox[t]{1.5in}{\tt CMP 1+Rx,Ry\\BC label}
923
& Less-than or equal unsigned \\[4mm]\hline
924
\parbox[t]{1.5in}{\tt CMP Rx,Ry\\BGTU label}    % if (Ry > Rx) -> Rx < Ry
925
& \parbox[t]{1.5in}{\tt CMP Ry,Rx\\BC label}
926
& Greater-than unsigned \\[4mm]\hline
927
\parbox[t]{1.5in}{\tt CMP Rx,Ry\\BGEU label}    % if (Ry >= Rx) -> Rx <= Ry -> Rx < Ry+1
928
& \parbox[t]{1.5in}{\tt CMP 1+Ry,Rx\\BC label}
929
& Greater-than equal unsigned \\[4mm]\hline
930
\parbox[t]{1.5in}{\tt CMP A+Rx,Ry\\BGEU label} % if (Ry >= A+Rx)-> A+Rx <= Ry -> Rx < Ry+1-A
931
& \parbox[t]{1.5in}{\tt CMP (1-A)+Ry,Rx\\BC label}
932
& Greater-than equal unsigned (with offset)\\[4mm]\hline
933
\parbox[t]{1.5in}{\tt CMP A,Ry\\BGEU label} % if (Ry >= A+Rx)-> A+Rx <= Ry -> Rx < Ry+1-A
934
& \parbox[t]{1.5in}{\tt LDI (A-1),Rx\\CMP Ry,Rx\\BC label}
935
& Greater-than equal comparison with a constant\\[4mm]\hline
936
\end{tabular}
937
\caption{Modifying conditions}\label{tbl:creating-conditions}
938
\end{center}\end{table}
939
shows examples of how these unsupported conditions can be created
940
simply by adjusting the compare instruction, for no extra cost in clocks.
941
Of course, if the compare originally had an immediate within it, that immediate
942
would need to be loaded into a register in order to do some of these compares.
943
This case is shown as the last case above.
944

945 21 dgisselq
\section{Move Operands}
946
The previous set of operands would be perfect and complete, save only that
947 24 dgisselq
948
Therefore, the MOV instruction is special and offers access to these registers
949
\ldots when in supervisory mode.  To keep the compiler simple, the extra bits
950
are ignored in non-supervisory mode (as though they didn't exist), rather than
951
being mapped to new instructions or additional capabilities.  The bits
952 69 dgisselq
indicating which register set each register lies within are the A-User, marked
953
A' in Fig.~\ref{fig:iset-format}, and B-User bits, marked as B'.  When set
954
to a one, these refer to a user mode register.  When set to a zero, these
955
refer to a register in the current mode, whether user or supervisor.  Further,
956
because a load immediate instruction exists, there is no move capability
957
between an immediate and a register: all moves come from either a register or
958
a register plus an offset.
959 21 dgisselq

960 69 dgisselq
This actually leads to a bit of a problem: since the {\tt MOV} instruction
961
encodes which register set each register is coming from or moving to, how shall
962
a compiler or assembler know how to compile a MOV instruction without knowing
963 24 dgisselq
the mode of the CPU at the time?  For this reason, the compiler will assume
964
all MOV registers are supervisor registers, and display them as normal.
965 69 dgisselq
Anything with the user bit set will be treated as a user register and displayed
966
special.  Since the CPU quietly ignores the supervisor bits while in user mode,
967
anything marked as a user register will always be specific.
968 21 dgisselq

969
\section{Multiply Operations}
970

971 139 dgisselq
The ZipCPU originally only supported 16x16 multiply operations.  GCC, however,
972
wanted 32x32-bit operations and building these from 16x16-bit multiplies
973
is painful.  Therefore, the ZipCPU was modified to support 32x32-bit multiplies.
974

975
In particular, the ZipCPU supports three separate 32x32-bit multiply
976
instructions: {\tt MPY}, {\tt MPYUHI}, and {\tt MPYSHI}.  The first of these
977
produces the low 32-bits of a 32x32-bit multiply result.  The second two
978
produce the upper 32-bits.  The first, {\tt MPYUHI}, produces the upper 32-bits
979
assuming the multiply was unsigned, whereas the second assuming it was signed.
980
Each multiply instruction is independent of each other in execution, although
981
the compiler may use them quite dependently.
982

983
In an effort to maintain single clock pipeline timing, all three of these
984
multiplies have been slowed down in logic.  Thus, depending upon the setting
985
of {\tt OPT\_MULTIPLY} within {\tt cpudefs.v}, the multiply instructions
986
will either 1)~cause an ILLEGAL instruction error, 2)~take one additional clock,
987
988

989

990 69 dgisselq
\section{Divide Unit}
991
The Zip CPU also has a divide unit which can be built alongside the ALU.
992 139 dgisselq
This divide unit provides the Zip CPU with another two instructions that
993 69 dgisselq
cannot be executed in a single cycle: {\tt DIVS}, or signed divide, and
994
{\tt DIVU}, the unsigned divide.  These are both 32--bit divide instructions,
995
dividing one 32--bit number by another.  In this case, the Operand B field,
996
whether it be register or register plus immediate, constitutes the denominator,
997
whereas the numerator is given by the other register.
998 21 dgisselq

999 69 dgisselq
The Divide is also a multi--clock instruction.  While the divide is running,
1000 139 dgisselq
the ALU, any memory loads, and the floating point unit (if installed) will be
1001
idle.  Once the divide completes, other units may continue.
1002 21 dgisselq

1003 69 dgisselq
Of course, divides can have errors: division by zero.  In the case of division
1004
by zero, an exception will be caused that will send the CPU either from
1005
user mode to supervisor mode, or halt the CPU if it is already in supervisor
1006
mode.
1007 32 dgisselq

1008 69 dgisselq
\section{NOOP, BREAK, and Bus Lock Instruction}
1009 139 dgisselq
Three instructions within the opcode list in Tbl.~\ref{tbl:iset-opcodes}, are
1010
somewhat special.  These are the {\tt NOOP}, {\tt Break}, and bus {\tt LOCK}
1011
instructions.  These are encoded according to
1012 69 dgisselq
Fig.~\ref{fig:iset-noop}, and have the following meanings:
1013
\begin{figure}\begin{center}
1014
\begin{bytefield}[endianness=big]{32}
1015
1016
\begin{leftwordgroup}{NOOP}
1017
\bitbox{1}{0}\bitbox{3}{3'h7}\bitbox{1}{}
1018 139 dgisselq
\bitbox{2}{11}\bitbox{3}{000}\bitbox{22}{Ignored} \\
1019 69 dgisselq
\bitbox{1}{1}\bitbox{3}{3'h7}\bitbox{1}{}
1020 139 dgisselq
\bitbox{2}{11}\bitbox{3}{000}\bitbox{22}{---} \\
1021 69 dgisselq
\bitbox{1}{1}\bitbox{9}{---}\bitbox{3}{---}\bitbox{5}{---}
1022
\bitbox{3}{3'h7}\bitbox{1}{}\bitbox{2}{11}\bitbox{3}{001}
1023
\bitbox{5}{Ignored}
1024
\end{leftwordgroup} \\
1025
\begin{leftwordgroup}{BREAK}
1026
\bitbox{1}{0}\bitbox{3}{3'h7}
1027 139 dgisselq
\bitbox{1}{}\bitbox{2}{11}\bitbox{3}{001}\bitbox{22}{Ignored}
1028 69 dgisselq
\end{leftwordgroup} \\
1029
\begin{leftwordgroup}{LOCK}
1030
\bitbox{1}{0}\bitbox{3}{3'h7}
1031 139 dgisselq
\bitbox{1}{}\bitbox{2}{11}\bitbox{3}{010}\bitbox{22}{Ignored}
1032 69 dgisselq
\end{leftwordgroup} \\
1033
\end{bytefield}
1034
\caption{NOOP/Break/LOCK Instruction Format}\label{fig:iset-noop}
1035
\end{center}\end{figure}
1036 32 dgisselq

1037 69 dgisselq
The {\tt NOOP} instruction is just that: an instruction that does not perform
1038
any operation.  While many other instructions, such as a move from a register to
1039
itself, could also fit these roles, only the NOOP instruction guarantees that
1040
it will not stall waiting for a register to be available.   For this reason,
1041
it gets its own place in the instruction set.
1042 32 dgisselq

1043 69 dgisselq
The {\tt BREAK} instruction is useful for creating a debug instruction that
1044
will halt the CPU without executing.  If in user mode, depending upon the
1045
setting of the break enable bit, it will either switch to supervisor mode or
1046
halt the CPU--depending upon where the user wishes to do his debugging.
1047 21 dgisselq

1048 139 dgisselq
Finally, the {\tt LOCK} instruction was added in order to provide for
1049
atomic operations.  The {\tt LOCK} instruction only works in pipeline mode.
1050
It works by stalling the ALU pipeline stack until all prior stages are
1051
filled, and then it guarantees that once a bus cycle is started, the
1052
wishbone {\tt CYC} line will remain asserted until the LOCK is deasserted.
1053
This allows the execution of one instruction that was waiting in the load
1054
operands pipeline stage, and one instruction that was waiting in the
1055
instruction decode stage.  Further, if the instruction waiting in the decode
1056
stage was a VLIW instruction, then it may be possible to execute a third
1057
instruction.
1058 21 dgisselq

1059 139 dgisselq
This was originally written to implement an atomic test and set instruction,
1060
such as a {\tt LOCK} followed by {\tt LOD (Rx),Ry} and a {\tt STO Rz,(Rx)},
1061
where Rz is initially set.
1062

1063
Other instructions using a VLIW instruction combining a single ALU instruction
1064
with a store, such as an atomic increment, or {\tt LOCK}, {\tt LOD (Rx),Ry},
1065
{\tt ADD 1,Ry}, {\tt STO Ry,(Rx)}, should be possible as well.  Many of these
1066
combinations remain to be tested.
1067

1068 69 dgisselq
\section{Floating Point}
1069
Although the Zip CPU does not (yet) have a floating point unit, the current
1070
instruction set offers eight opcodes for floating point operations, and treats
1071
floating point exceptions like divide by zero errors.  Once this unit is built
1072
and integrated together with the rest of the CPU, the Zip CPU will support
1073
32--bit floating point instructions natively.  Any 64--bit floating point
1074
instructions will still need to be emulated in software.
1075

1076 139 dgisselq
Until that time, of even after if the floating point unit is not installed,
1077
floating point instructions will trigger an illegal instruction exception,
1078
which may be trapped and then implemented in software.
1079

1080 21 dgisselq
\section{Derived Instructions}
1081 36 dgisselq
The Zip CPU supports many other common instructions, but not all of them
1082 24 dgisselq
are single cycle instructions.  The derived instruction tables,
1083 36 dgisselq
Tbls.~\ref{tbl:derived-1}, \ref{tbl:derived-2}, \ref{tbl:derived-3}
1084
and~\ref{tbl:derived-4},
1085 21 dgisselq
help to capture some of how these other instructions may be implemented on
1086 36 dgisselq
the Zip CPU.  Many of these instructions will have assembly equivalents,
1087 21 dgisselq
such as the branch instructions, to facilitate working with the CPU.
1088
\begin{table}\begin{center}
1089
\begin{tabular}{p{1.4in}p{1.5in}p{3in}}\\\hline
1090
Mapped & Actual  & Notes \\\hline
1091 39 dgisselq
{\tt ABS Rx}
1092
& \parbox[t]{1.5in}{\tt TST -1,Rx\\NEG.LT Rx}
1093 36 dgisselq
& Absolute value, depends upon derived NEG.\\\hline
1094 39 dgisselq
1095
& \parbox[t]{1.5in}{\tt Add Ra,Rx\\ADD.C \$1,Ry\\Add Rb,Ry} 1096 21 dgisselq & Add with carry \\\hline 1097 39 dgisselq {\tt BRA.Cond +/-\$Addr}
1098 92 dgisselq
& \hbox{\tt ADD.cond \$Addr+PC,PC} 1099 & Branch or jump on condition. Works for 18--bit 1100 24 dgisselq signed address offsets.\\\hline 1101 39 dgisselq {\tt BRA.Cond +/-\$Addr}
1102
& \parbox[t]{1.5in}{\tt LDI \$Addr,Rx \\ ADD.cond Rx,PC} 1103 73 dgisselq & Branch/jump on condition. Works for 23 bit address offsets, but 1104 costs a register and an extra instruction. With LDIHI and LDILO 1105 this can be made to work anywhere in the 32-bit address space, but yet 1106 cost an additional instruction still. \\\hline 1107 39 dgisselq {\tt BNC PC+\$Addr}
1108 92 dgisselq
& \parbox[t]{1.5in}{\tt Test \$Carry,CC \\ ADD.Z PC+\$Addr,PC}
1109 21 dgisselq
& Example of a branch on an unsupported
1110
condition, in this case a branch on not carry \\\hline
1111 92 dgisselq
{\tt BUSY } & {\tt ADD \$-1,PC} & Execute an infinite loop \\\hline 1112 39 dgisselq {\tt CLRF.NZ Rx } 1113 & {\tt XOR.NZ Rx,Rx} 1114 21 dgisselq & Clear Rx, and flags, if the Z-bit is not set \\\hline 1115 39 dgisselq {\tt CLR Rx } 1116 & {\tt LDI \$0,Rx}
1117 21 dgisselq
& Clears Rx, leaves flags untouched.  This instruction cannot be
1118
conditional. \\\hline
1119 39 dgisselq
{\tt EXCH.W Rx }
1120
& {\tt ROL \$16,Rx} 1121 21 dgisselq & Exchanges the top and bottom 16'bit words of Rx \\\hline 1122 39 dgisselq {\tt HALT } 1123 & {\tt Or \$SLEEP,CC}
1124
& This only works when issued in interrupt/supervisor mode.  In user
1125
mode this is simply a wait until interrupt instruction. \\\hline
1126 69 dgisselq
{\tt INT } & {\tt LDI \$0,CC} & This is also known as a trap instruction\\\hline 1127 39 dgisselq {\tt IRET} 1128 & {\tt OR \$GIE,CC}
1129
1130 92 dgisselq
{\tt JMP R6+\$Offset} 1131 & {\tt MOV \$Offset(R6),PC}
1132 21 dgisselq
& \\\hline
1133 69 dgisselq
{\tt LJMP \$Addr} 1134 & \parbox[t]{1.5in}{\tt LOD (PC),PC \\ {\em Address }} 1135 & Although this only works for an unconditional jump, and it only 1136 works in a Von Neumann architecture, this instruction combination makes 1137 for a nice combination that can be adjusted by a linker at a later 1138 time.\\\hline 1139 92 dgisselq {\tt JSR PC+\$Offset  }
1140
& \parbox[t]{1.5in}{\tt MOV \$1+PC,R0 \\ ADD \$Offset,PC}
1141 69 dgisselq
& This is similar to the jump and link instructions from other
1142
architectures, save only that it requires a specific link
1143
instruction, also known as the {\tt MOV} instruction on the
1144
left.\\\hline
1145
\end{tabular}
1146
\caption{Derived Instructions}\label{tbl:derived-1}
1147
\end{center}\end{table}
1148
\begin{table}\begin{center}
1149
\begin{tabular}{p{1.4in}p{1.5in}p{3in}}\\\hline
1150
Mapped & Actual  & Notes \\\hline
1151 39 dgisselq
{\tt LDI.l \$val,Rx } 1152 & \parbox[t]{1.8in}{\tt LDIHI (\$val$>>$16)\&0x0ffff, Rx \\
1153
LDILO (\$val\&0x0ffff),Rx} 1154 69 dgisselq & \parbox[t]{3.0in}{Sadly, there's not enough instruction 1155 21 dgisselq space to load a complete immediate value into any register. 1156 Therefore, fully loading any register takes two cycles. 1157 The LDIHI (load immediate high) and LDILO (load immediate low) 1158 69 dgisselq instructions have been created to facilitate this. 1159 \\ 1160 This is also the appropriate means for setting a register value 1161 to an arbitrary 32--bit value in a post--assembly link 1162 operation.}\\\hline 1163 39 dgisselq {\tt LOD.b \$addr,Rx}
1164
& \parbox[t]{1.5in}{\tt %
1165 21 dgisselq
LDI     \$addr,Ra \\ 1166 LDI \$addr,Rb \\
1167
LSR     \$2,Ra \\ 1168 AND \$3,Rb \\
1169
LOD     (Ra),Rx \\
1170
LSL     \$3,Rb \\ 1171 SUB \$32,Rb \\
1172
ROL     Rb,Rx \\
1173
AND \$0ffh,Rx} 1174 & \parbox[t]{3in}{This CPU is designed for 32'bit word 1175 length instructions. Byte addressing is not supported by the CPU or 1176 the bus, so it therefore takes more work to do. 1177 1178 Note also that in this example, \$Addr is a byte-wise address, where
1179 24 dgisselq
1180
For this reason,
1181 21 dgisselq
we needed to drop the bottom two bits.  This also limits the address
1182
space of character accesses using this method from 16 MB down to 4MB.}
1183
\\\hline
1184 39 dgisselq
\parbox[t]{1.5in}{\tt LSL \$1,Rx\\ LSLC \$1,Ry}
1185
& \parbox[t]{1.5in}{\tt LSL \$1,Ry \\ 1186 21 dgisselq LSL \$1,Rx \\
1187
OR.C \$1,Ry} 1188 & Logical shift left with carry. Note that the 1189 instruction order is now backwards, to keep the conditions valid. 1190 33 dgisselq That is, LSL sets the carry flag, so if we did this the other way 1191 21 dgisselq with Rx before Ry, then the condition flag wouldn't have been right 1192 for an OR correction at the end. \\\hline 1193 39 dgisselq \parbox[t]{1.5in}{\tt LSR \$1,Rx \\ LSRC \$1,Ry} 1194 & \parbox[t]{1.5in}{\tt CLR Rz \\ 1195 21 dgisselq LSR \$1,Ry \\
1196
LDIHI.C \$8000h,Rz \\ 1197 LSR \$1,Rx \\
1198
OR Rz,Rx}
1199
& Logical shift right with carry \\\hline
1200 39 dgisselq
{\tt NEG Rx} & \parbox[t]{1.5in}{\tt XOR \$-1,Rx \\ ADD \$1,Rx} & \\\hline
1201
{\tt NEG.C Rx} & \parbox[t]{1.5in}{\tt MOV.C \$-1+Rx,Rx\\XOR.C \$-1,Rx} & \\\hline
1202
{\tt NOOP} & {\tt NOOP} & While there are many
1203 21 dgisselq
operations that do nothing, such as MOV Rx,Rx, or OR \$0,Rx, these 1204 operations have consequences in that they might stall the bus if 1205 Rx isn't ready yet. For this reason, we have a dedicated NOOP 1206 instruction. \\\hline 1207 39 dgisselq {\tt NOT Rx } & {\tt XOR \$-1,Rx } & \\\hline
1208
{\tt POP Rx }
1209 69 dgisselq
& \parbox[t]{1.5in}{\tt LOD \$(SP),Rx \\ ADD \$1,SP}
1210
& \\\hline
1211 36 dgisselq
\end{tabular}
1212
\caption{Derived Instructions, continued}\label{tbl:derived-2}
1213
\end{center}\end{table}
1214
\begin{table}\begin{center}
1215
\begin{tabular}{p{1.4in}p{1.5in}p{3in}}\\\hline
1216 39 dgisselq
{\tt PUSH Rx}
1217 69 dgisselq
& \parbox[t]{1.5in}{\hbox{\tt SUB \$1,SP} 1218 \hbox{\tt STO Rx,\$(SP)}}
1219 39 dgisselq
& Note that for pipelined operation, it helps to coalesce all the
1220
{\tt SUB}'s into one command, and place the {\tt STO}'s right
1221 69 dgisselq
after each other.  Further, to avoid a pipeline stall, the
1222
immediate value for the store must be zero.
1223
\\\hline
1224 39 dgisselq
{\tt PUSH Rx-Ry}
1225 69 dgisselq
& \parbox[t]{1.5in}{\tt SUB \$$n,SP \\ 1226 STO Rx,\(SP) 1227 36 dgisselq \ldots \\ 1228 69 dgisselq STO Ry,\$$\left(n-1\right)$(SP)} 1229 36 dgisselq & Multiple pushes at once only need the single subtract from the 1230 stack pointer. This derived instruction is analogous to a similar one 1231 on the Motoroloa 68k architecture, although the Zip Assembler 1232 39 dgisselq does not support this instruction (yet). This instruction 1233 also supports pipelined memory access.\\\hline 1234 {\tt RESET} 1235 & \parbox[t]{1in}{\tt STO \$1,\$watchdog(R12)\\NOOP\\NOOP} 1236 & This depends upon the peripheral base address being 1237 69 dgisselq preloaded into R12. 1238 21 dgisselq 1239 Another opportunity might be to jump to the reset address from within 1240 39 dgisselq supervisor mode.\\\hline 1241 69 dgisselq {\tt RET} & {\tt MOV R0,PC} 1242 & This depends upon the form of the {\tt JSR} given on the previous 1243 page that stores the return address into R0. 1244 21 dgisselq \\\hline 1245 39 dgisselq {\tt STEP Rr,Rt} 1246 & \parbox[t]{1.5in}{\tt LSR \$1,Rr \\ XOR.C Rt,Rr}
1247 21 dgisselq
& Step a Galois implementation of a Linear Feedback Shift Register, Rr,
1248
using taps Rt \\\hline
1249 139 dgisselq
%
1250
%
1251
{\tt SEX.b Rx }
1252
& \parbox[t]{1.5in}{\tt LSL 24,Rx \\ ASR 24,Rx}
1253
& Signed extend a byte into a full word.\\\hline
1254
{\tt SEX.h Rx }
1255
& \parbox[t]{1.5in}{\tt LSL 16,Rx \\ ASR 16,Rx}
1256
& Sign extend a half word into a full word.\\\hline
1257
%
1258 39 dgisselq
{\tt STO.b Rx,\$addr} 1259 & \parbox[t]{1.5in}{\tt % 1260 21 dgisselq LDI \$addr,Ra \\
1261
LDI \$addr,Rb \\ 1262 LSR \$2,Ra \\
1263
AND \$3,Rb \\ 1264 SUB \$32,Rb \\
1265
LOD (Ra),Ry \\
1266
AND \$0ffh,Rx \\ 1267 39 dgisselq AND \~\$0ffh,Ry \\
1268 21 dgisselq
ROL Rb,Rx \\
1269
OR Rx,Ry \\
1270
STO Ry,(Ra) }
1271
& \parbox[t]{3in}{This CPU and it's bus are {\em not} optimized
1272
for byte-wise operations.
1273

1274
Note that in this example, \$addr is a 1275 byte-wise address, whereas in all of our other examples it is a 1276 32-bit word address. This also limits the address space 1277 of character accesses from 16 MB down to 4MB.F 1278 Further, this instruction implies a byte ordering, 1279 such as big or little endian.} \\\hline 1280 39 dgisselq {\tt SWAP Rx,Ry } 1281 69 dgisselq & \parbox[t]{1.5in}{\tt XOR Ry,Rx \\ XOR Rx,Ry \\ XOR Ry,Rx} 1282 21 dgisselq & While no extra registers are needed, this example 1283 does take 3-clocks. \\\hline 1284 69 dgisselq \end{tabular} 1285 \caption{Derived Instructions, continued}\label{tbl:derived-3} 1286 \end{center}\end{table} 1287 \begin{table}\begin{center} 1288 \begin{tabular}{p{1.4in}p{1.5in}p{3in}}\\\hline 1289 39 dgisselq {\tt TRAP \#X} 1290 & \parbox[t]{1.5in}{\tt LDI \$x,R0 \\ AND \~\$GIE,CC } 1291 36 dgisselq & This works because whenever a user lowers the \$GIE flag, it sets
1292
a TRAP bit within the CC register.  Therefore, upon entering the
1293
supervisor state, the CPU only need check this bit to know that it
1294
got there via a TRAP.  The trap could be made conditional by making
1295
the LDI and the AND conditional.  In that case, the assembler would
1296
quietly turn the LDI instruction into an LDILO and LDIHI pair,
1297 37 dgisselq
but the effect would be the same. \\\hline
1298 69 dgisselq
{\tt TS Rx,Ry,(Rz)}
1299
& \hbox{\tt LDI 1,Rx}
1300
\hbox{\tt LOCK}
1301
\hbox{\tt LOD (Rz),Ry}
1302
\hbox{\tt STO Rx,(Rz)}
1303
& A test and set instruction.  The {\tt LOCK} instruction insures
1304
that the next two instructions lock the bus between the instructions,
1305
so no one else can use it.  Thus guarantees that the operation is
1306
atomic.
1307
\\\hline
1308 39 dgisselq
{\tt TST Rx}
1309
& {\tt TST \$-1,Rx} 1310 21 dgisselq & Set the condition codes based upon Rx. Could also do a CMP \$0,Rx,
1311
ADD \$0,Rx, SUB \$0,Rx, etc, AND \$-1,Rx, etc. The TST and CMP 1312 approaches won't stall future pipeline stages looking for the value 1313 69 dgisselq of Rx. (Future versions of the assembler may shorten this to a 1314 {\tt TST Rx} instruction.)\\\hline 1315 39 dgisselq {\tt WAIT} 1316 & {\tt Or \$GIE | \$SLEEP,CC} 1317 & Wait until the next interrupt, then jump to supervisor/interrupt 1318 mode. 1319 21 dgisselq \end{tabular} 1320 36 dgisselq \caption{Derived Instructions, continued}\label{tbl:derived-4} 1321 21 dgisselq \end{center}\end{table} 1322 69 dgisselq 1323 \section{Interrupt Handling} 1324 The Zip CPU does not maintain any interrupt vector tables. If an interrupt 1325 takes place, the CPU simply switches to interrupt mode. The supervisor code 1326 continues in this interrupt mode from where it left off before, after 1327 executing a return to userspace {\tt RTU} instruction. 1328 1329 At this point, the supervisor code needs to determine first whether an 1330 interrupt has occurred, and then whether it is in interrupt mode due to 1331 an exception and handle each case appropriately. 1332 1333 21 dgisselq \section{Pipeline Stages} 1334 32 dgisselq As mentioned in the introduction, and highlighted in Fig.~\ref{fig:cpu}, 1335 the Zip CPU supports a five stage pipeline. 1336 21 dgisselq \begin{enumerate} 1337 36 dgisselq \item {\bf Prefetch}: Reads instruction from memory and into a cache, if so 1338 configured. This 1339 21 dgisselq stage is actually pipelined itself, and so it will stall if the PC 1340 ever changes. Stalls are also created here if the instruction isn't 1341 in the prefetch cache. 1342 36 dgisselq 1343 69 dgisselq The Zip CPU supports one of three prefetch methods, depending upon a 1344 flag set at build time within the {\tt cpudefs.v} file. The simplest 1345 is a non--cached implementation of a prefetch. This implementation is 1346 fairly small, and ideal for users of the Zip CPU who need the extra 1347 space on the FPGA fabric. However, because this non--cached version 1348 has no cache, the maximum number of instructions per clock is limited 1349 to about one per five. 1350 36 dgisselq 1351 The second prefetch module is a pipelined prefetch with a cache. This 1352 module tries to keep the instruction address within a window of valid 1353 instruction addresses. While effective, it is not a traditional 1354 cache implementation. One unique feature of this cache implementation, 1355 however, is that it can be cleared in a single clock. A disappointing 1356 feature, though, was that it needs an extra internal pipeline stage 1357 to be implemented. 1358 1359 69 dgisselq The third prefetch and cache module implements a more traditional cache. 1360 While the resulting code tends to be twice as fast as the pipelined 1361 cache architecture, this implementation uses a large amount of 1362 distributed FPGA RAM to be successful. This then inflates the Zip CPU's 1363 FPGA usage statistics. 1364 1365 \item {\bf Decode}: Decodes an instruction into OpCode, register(s) to read, 1366 and immediate offset. This stage also determines whether the flags 1367 will be set or whether the result will be written back. 1368 1369 21 dgisselq \item {\bf Read Operands}: Read registers and apply any immediate values to 1370 24 dgisselq them. There is no means of detecting or flagging arithmetic overflow 1371 or carry when adding the immediate to the operand. This stage will 1372 stall if any source operand is pending. 1373 69 dgisselq 1374 \item Split into one of four tracks: An {\bf ALU} track which will accomplish 1375 a simple instruction, the {\bf MemOps} stage which handles {\tt LOD} 1376 (load) and {\tt STO} (store) instructions, the {\bf divide} unit, 1377 and the {\bf floating point} unit. 1378 21 dgisselq \begin{itemize} 1379 69 dgisselq \item Loads will stall instructions in the decode stage until the 1380 entire pipeline until complete, lest a register be read in 1381 the read operands stage only to be updated unseen by the 1382 Load. 1383 \item Condition codes are available upon completion of the ALU, 1384 divide, or FPU stage. 1385 \item Issuing a non--pipelined memory instruction to the memory unit 1386 while the memory unit is busy will stall the entire pipeline. 1387 21 dgisselq \end{itemize} 1388 32 dgisselq \item {\bf Write-Back}: Conditionally write back the result to the register 1389 69 dgisselq set, applying the condition. This routine is quad-entrant: either the 1390 ALU, the memory, the divide, or the FPU may write back a register. 1391 The only design rule is that no more than a single register may be 1392 written back in any given clock. 1393 21 dgisselq \end{enumerate} 1394 1395 24 dgisselq The Zip CPU does not support out of order execution. Therefore, if the memory 1396 69 dgisselq unit stalls, every other instruction stalls. The same is true for divide or 1397 floating point instructions--all other instructions will stall while waiting 1398 for these to complete. Memory stores, however, can take place concurrently 1399 with non--memory operations, although memory reads (loads) cannot. 1400 24 dgisselq 1401 32 dgisselq \section{Pipeline Stalls} 1402 The processing pipeline can and will stall for a variety of reasons. Some of 1403 these are obvious, some less so. These reasons are listed below: 1404 \begin{itemize} 1405 \item When the prefetch cache is exhausted 1406 21 dgisselq 1407 36 dgisselq This reason should be obvious. If the prefetch cache doesn't have the 1408 69 dgisselq instruction in memory, the entire pipeline must stall until an instruction 1409 can be made ready. In the case of the {\tt pipefetch} windowed approach 1410 to the prefetch cache, this means the pipeline will stall until enough of the 1411 prefetch cache is loaded to support the next instruction. In the case 1412 of the more traditional {\tt pfcache} approach, the entire cache line must 1413 fill before instruction execution can continue. 1414 21 dgisselq 1415 32 dgisselq \item While waiting for the pipeline to load following any taken branch, jump, 1416 69 dgisselq return from interrupt or switch to interrupt context (4 stall cycles) 1417 32 dgisselq 1418 68 dgisselq Fig.~\ref{fig:bcstalls} 1419 \begin{figure}\begin{center} 1420 \includegraphics[width=3.5in]{../gfx/bc.eps} 1421 69 dgisselq \caption{A conditional branch generates 4 stall cycles}\label{fig:bcstalls} 1422 68 dgisselq \end{center}\end{figure} 1423 illustrates the situation for a conditional branch. In this case, the branch 1424 69 dgisselq instruction, {\tt BC}, is nominally followed by instructions {\tt I1} and so 1425 68 dgisselq forth. However, since the branch is taken, the next instruction must be 1426 {\tt IA}. Therefore, the pipeline needs to be cleared and reloaded. 1427 Given that there are five stages to the pipeline, that accounts 1428 69 dgisselq for the four stalls. (Were the {\tt pipefetch} cache chosen, there would 1429 be another stall internal to the {\tt pipefetch} cache.) 1430 32 dgisselq 1431 92 dgisselq The Zip CPU handles the {\tt ADD \$X,PC} and
1432 36 dgisselq
{\tt LDI \$X,PC} instructions specially, however. These instructions, when 1433 69 dgisselq not conditioned on the flags, can execute with only a single stall cycle, 1434 such as is shown in Fig.~\ref{fig:branch}.\footnote{Note that when using the 1435 {\tt pipefetch} cache, this requires an additional stall cycle due to that 1436 cache's implementation.} 1437 68 dgisselq \begin{figure}\begin{center} 1438 69 dgisselq \includegraphics[width=4in]{../gfx/bra.eps} %0.4in per clock 1439 \caption{An expedited branch costs a single stall cycle}\label{fig:branch} 1440 68 dgisselq \end{center}\end{figure} 1441 In this example, {\tt BR} is a branch always taken, {\tt I1} is the instruction 1442 following the branch in memory, while {\tt IA} is the first instruction at the 1443 branch address. ({\tt CLR} denotes a clear--pipeline operation, and does 1444 not represent any instruction.) 1445 36 dgisselq 1446 32 dgisselq \item When reading from a prior register while also adding an immediate offset 1447 \begin{enumerate} 1448 \item\ {\tt OPCODE ?,RA} 1449 \item\ {\em (stall)} 1450 \item\ {\tt OPCODE I+RA,RB} 1451 \end{enumerate} 1452 1453 Since the addition of the immediate register within OpB decoding gets applied 1454 during the read operand stage so that it can be nicely settled before the ALU, 1455 any instruction that will write back an operand must be separated from the 1456 opcode that will read and apply an immediate offset by one instruction. The 1457 good news is that this stall can easily be mitigated by proper scheduling. 1458 36 dgisselq That is, any instruction that does not add an immediate to {\tt RA} may be 1459 scheduled into the stall slot. 1460 32 dgisselq 1461 69 dgisselq This is also the reason why, when setting up a stack frame, the top of the 1462 stack frame is used first: it eliminates this stall cycle. Hence, to save 1463 registers at the top of a procedure, one would write: 1464 32 dgisselq \begin{enumerate} 1465 69 dgisselq \item\ {\tt SUB 2,SP} 1466 \item\ {\tt STO R1,(SP)} 1467 \item\ {\tt STO R2,1(SP)} 1468 32 dgisselq \end{enumerate} 1469 69 dgisselq Had {\tt R1} instead been stored at {\tt 1(SP)} as the top of the stack, 1470 there would've been an extra stall in setting up the stack frame. 1471 32 dgisselq 1472 \item When reading from the CC register after setting the flags 1473 \begin{enumerate} 1474 69 dgisselq \item\ {\tt ALUOP RA,RB} {\em ; Ex: a compare opcode} 1475 36 dgisselq \item\ {\em (stall)} 1476 32 dgisselq \item\ {\tt TST sys.ccv,CC} 1477 \item\ {\tt BZ somewhere} 1478 \end{enumerate} 1479 1480 68 dgisselq The reason for this stall is simply performance: many of the flags are 1481 determined via combinatorial logic {\em during} the writeback cycle. 1482 Trying to then place these into the input for one of the operands for an 1483 ALU instruction during the same cycle 1484 32 dgisselq created a time delay loop that would no longer execute in a single 100~MHz 1485 clock cycle. (The time delay of the multiply within the ALU wasn't helping 1486 either \ldots). 1487 1488 33 dgisselq This stall may be eliminated via proper scheduling, by placing an instruction 1489 that does not set flags in between the ALU operation and the instruction 1490 that references the CC register. For example, {\tt MOV \$addr+PC,uPC}
1491
followed by an {\tt RTU} ({\tt OR \$GIE,CC}) instruction will not incur 1492 this stall, whereas an {\tt OR \$BREAKEN,CC} followed by an {\tt OR \$STEP,CC} 1493 68 dgisselq will incur the stall, while a {\tt LDI \$BREAKEN|\$STEP,CC} will not since 1494 69 dgisselq it doesn't read the condition codes before executing. 1495 33 dgisselq 1496 32 dgisselq \item When waiting for a memory read operation to complete 1497 \begin{enumerate} 1498 \item\ {\tt LOD address,RA} 1499 36 dgisselq \item\ {\em (multiple stalls, bus dependent, 4 clocks best)} 1500 32 dgisselq \item\ {\tt OPCODE I+RA,RB} 1501 \end{enumerate} 1502 1503 36 dgisselq Remember, the Zip CPU does not support out of order execution. Therefore, 1504 32 dgisselq anytime the memory unit becomes busy both the memory unit and the ALU must 1505 68 dgisselq stall until the memory unit is cleared. This is illustrated in 1506 Fig.~\ref{fig:memrd}, 1507 \begin{figure}\begin{center} 1508 69 dgisselq \includegraphics[width=5.6in]{../gfx/memrd.eps} 1509 68 dgisselq \caption{Pipeline handling of a load instruction}\label{fig:memrd} 1510 \end{center}\end{figure} 1511 since it is especially true of a load 1512 69 dgisselq instruction, which must still write its operand back to the register file. 1513 Further, note that on a pipelined memory operation, the instruction must 1514 stall in the decode operand stage, lest it try to read a result from the 1515 register file before the load result has been written to it. Finally, note 1516 that there is an extra stall at the end of the memory cycle, so that 1517 the memory unit will be idle for two clocks before an instruction will be 1518 accepted into the ALU. Store instructions are different, as shown in 1519 Fig.~\ref{fig:memwr}, 1520 68 dgisselq \begin{figure}\begin{center} 1521 69 dgisselq \includegraphics[width=4in]{../gfx/memwr.eps} 1522 68 dgisselq \caption{Pipeline handling of a store instruction}\label{fig:memwr} 1523 \end{center}\end{figure} 1524 since they can be busy with the bus without impacting later write back 1525 pipeline stages. Hence, only loads stall the pipeline. 1526 32 dgisselq 1527 68 dgisselq This, of course, also assumes that the memory being accessed is a single cycle 1528 memory and that there are no stalls to get to the memory. 1529 32 dgisselq Slower memories, such as the Quad SPI flash, will take longer--perhaps even 1530 33 dgisselq as long as forty clocks. During this time the CPU and the external bus 1531 68 dgisselq will be busy, and unable to do anything else. Likewise, if it takes a couple 1532 of clock cycles for the bus to be free, as shown in both Figs.~\ref{fig:memrd} 1533 and~\ref{fig:memwr}, there will be stalls. 1534 32 dgisselq 1535 \item Memory operation followed by a memory operation 1536 \begin{enumerate} 1537 \item\ {\tt STO address,RA} 1538 36 dgisselq \item\ {\em (multiple stalls, bus dependent, 4 clocks best)} 1539 32 dgisselq \item\ {\tt LOD address,RB} 1540 36 dgisselq \item\ {\em (multiple stalls, bus dependent, 4 clocks best)} 1541 32 dgisselq \end{enumerate} 1542 1543 68 dgisselq In this case, the LOD instruction cannot start until the STO is finished, 1544 as illustrated by Fig.~\ref{fig:mstld}. 1545 \begin{figure}\begin{center} 1546 \includegraphics[width=5.5in]{../gfx/mstld.eps} 1547 \caption{Pipeline handling of a store followed by a load instruction}\label{fig:mstld} 1548 \end{center}\end{figure} 1549 32 dgisselq With proper scheduling, it is possible to do something in the ALU while the 1550 36 dgisselq memory unit is busy with the STO instruction, but otherwise this pipeline will 1551 68 dgisselq stall while waiting for it to complete before the load instruction can 1552 start. 1553 32 dgisselq 1554 39 dgisselq The Zip CPU does have the capability of supporting pipelined memory access, 1555 but only under the following conditions: all accesses within the pipeline 1556 must all be reads or all be writes, all must use the same register for their 1557 address, and there can be no stalls or other instructions between pipelined 1558 memory access instructions. Further, the offset to memory must be increasing 1559 by one address each instruction. These conditions work well for saving or 1560 68 dgisselq storing registers to the stack. Indeed, if you noticed, both 1561 Fig.~\ref{fig:memrd} and Fig.~\ref{fig:memwr} illustrated pipelined memory 1562 accesses. 1563 36 dgisselq 1564 32 dgisselq \end{itemize} 1565 1566 1567 21 dgisselq \chapter{Peripherals}\label{chap:periph} 1568 24 dgisselq 1569 While the previous chapter describes a CPU in isolation, the Zip System 1570 includes a minimum set of peripherals as well. These peripherals are shown 1571 in Fig.~\ref{fig:zipsystem} 1572 \begin{figure}\begin{center} 1573 \includegraphics[width=3.5in]{../gfx/system.eps} 1574 \caption{Zip System Peripherals}\label{fig:zipsystem} 1575 \end{center}\end{figure} 1576 and described here. They are designed to make 1577 the Zip CPU more useful in an Embedded Operating System environment. 1578 1579 68 dgisselq \section{Interrupt Controller}\label{sec:pic} 1580 24 dgisselq 1581 Perhaps the most important peripheral within the Zip System is the interrupt 1582 controller. While the Zip CPU itself can only handle one interrupt, and has 1583 only the one interrupt state: disabled or enabled, the interrupt controller 1584 can make things more interesting. 1585 1586 The Zip System interrupt controller module supports up to 15 interrupts, all 1587 controlled from one register. Bit~31 of the interrupt controller controls 1588 overall whether interrupts are enabled (1'b1) or disabled (1'b0). Bits~16--30 1589 68 dgisselq control whether individual interrupts are enabled (1'b1) or disabled (1'b0). 1590 24 dgisselq Bit~15 is an indicator showing whether or not any interrupt is active, and 1591 bits~0--15 indicate whether or not an individual interrupt is active. 1592 1593 The interrupt controller has been designed so that bits can be controlled 1594 individually without having any knowledge of the rest of the controller 1595 setting. To enable an interrupt, write to the register with the high order 1596 global enable bit set and the respective interrupt enable bit set. No other 1597 bits will be affected. To disable an interrupt, write to the register with 1598 the high order global enable bit cleared and the respective interrupt enable 1599 bit set. To clear an interrupt, write a 1' to that interrupts status pin. 1600 Zero's written to the register have no affect, save that a zero written to the 1601 master enable will disable all interrupts. 1602 1603 As an example, suppose you wished to enable interrupt \#4. You would then 1604 write to the register a {\tt 0x80100010} to enable interrupt \#4 and to clear 1605 any past active state. When you later wish to disable this interrupt, you would 1606 write a {\tt 0x00100010} to the register. As before, this both disables the 1607 interrupt and clears the active indicator. This also has the side effect of 1608 disabling all interrupts, so a second write of {\tt 0x80000000} may be necessary 1609 to re-enable any other interrupts. 1610 1611 The Zip System currently hosts two interrupt controllers, a primary and a 1612 69 dgisselq secondary. The primary interrupt controller has one (or more) interrupt line(s) 1613 which may come from an external interrupt source, and one interrupt line from 1614 the secondary controller. Other primary interrupts include the system timers, 1615 the jiffies interrupt, and the manual cache interrupt. The secondary interrupt 1616 controller maintains an interrupt state for all of the processor accounting 1617 counters. 1618 24 dgisselq 1619 21 dgisselq \section{Counter} 1620 1621 The Zip Counter is a very simple counter: it just counts. It cannot be 1622 halted. When it rolls over, it issues an interrupt. Writing a value to the 1623 counter just sets the current value, and it starts counting again from that 1624 value. 1625 1626 Eight counters are implemented in the Zip System for process accounting. 1627 This may change in the future, as nothing as yet uses these counters. 1628 1629 \section{Timer} 1630 1631 The Zip Timer is also very simple: it simply counts down to zero. When it 1632 transitions from a one to a zero it creates an interrupt. 1633 1634 Writing any non-zero value to the timer starts the timer. If the high order 1635 bit is set when writing to the timer, the timer becomes an interval timer and 1636 reloads its last start time on any interrupt. Hence, to mark seconds, one 1637 might set the timer to 100~million (the number of clocks per second), and 1638 set the high bit. Ever after, the timer will interrupt the CPU once per 1639 24 dgisselq second (assuming a 100~MHz clock). This reload capability also limits the 1640 68 dgisselq maximum timer value to$2^{31}-1$(about 21~seconds using a 100~MHz clock), 1641 rather than$2^{32}-1$. 1642 21 dgisselq 1643 \section{Watchdog Timer} 1644 1645 The watchdog timer is no different from any of the other timers, save for one 1646 critical difference: the interrupt line from the watchdog 1647 timer is tied to the reset line of the CPU. Hence writing a 1' to the 1648 watchdog timer will always reset the CPU. 1649 32 dgisselq To stop the Watchdog timer, write a 0' to it. To start it, 1650 21 dgisselq write any other number to it---as with the other timers. 1651 1652 While the watchdog timer supports interval mode, it doesn't make as much sense 1653 as it did with the other timers. 1654 1655 68 dgisselq \section{Bus Watchdog} 1656 There is an additional watchdog timer on the Wishbone bus. This timer, 1657 however, is hardware configured and not software configured. The timer is 1658 reset at the beginning of any bus transaction, and only counts clocks during 1659 such bus transactions. If the bus transaction takes longer than the number 1660 of counts the timer allots, it will raise a bus error flag to terminate the 1661 transaction. This is useful in the case of any peripherals that are 1662 misbehaving. If the bus watchdog terminates a bus transaction, the CPU may 1663 then read from its port to find out which memory location created the problem. 1664 1665 Aside from its unusual configuration, the bus watchdog is just another 1666 69 dgisselq implementation of the fundamental timer described above--stripped down 1667 for simplicity. 1668 68 dgisselq 1669 21 dgisselq \section{Jiffies} 1670 1671 This peripheral is motivated by the Linux use of jiffies' whereby a process 1672 can request to be put to sleep until a certain number of jiffies' have 1673 elapsed. Using this interface, the CPU can read the number of jiffies' 1674 from the peripheral (it only has the one location in address space), add the 1675 69 dgisselq sleep length to it, and write the result back to the peripheral. The 1676 {\tt zipjiffies} 1677 21 dgisselq peripheral will record the value written to it only if it is nearer the current 1678 counter value than the last current waiting interrupt time. If no other 1679 interrupts are waiting, and this time is in the future, it will be enabled. 1680 (There is currently no way to disable a jiffie interrupt once set, other 1681 24 dgisselq than to disable the interrupt line in the interrupt controller.) The processor 1682 21 dgisselq may then place this sleep request into a list among other sleep requests. 1683 Once the timer expires, it would write the next Jiffy request to the peripheral 1684 and wake up the process whose timer had expired. 1685 1686 Indeed, the Jiffies register is nothing more than a glorified counter with 1687 an interrupt. Unlike the other counters, the Jiffies register cannot be set. 1688 Writes to the jiffies register create an interrupt time. When the Jiffies 1689 register later equals the value written to it, an interrupt will be asserted 1690 and the register then continues counting as though no interrupt had taken 1691 place. 1692 1693 The purpose of this register is to support alarm times within a CPU. To 1694 set an alarm for a particular process$N$clocks in advance, read the current 1695 Jiffies value, and$N$, and write it back to the Jiffies register. The 1696 O/S must also keep track of values written to the Jiffies register. Thus, 1697 32 dgisselq when an alarm' trips, it should be removed from the list of alarms, the list 1698 69 dgisselq should be resorted, and the next alarm in terms of Jiffies should be written 1699 to the register--possibly for a second time. 1700 21 dgisselq 1701 36 dgisselq \section{Direct Memory Access Controller} 1702 24 dgisselq 1703 36 dgisselq The Direct Memory Access (DMA) controller can be used to either move memory 1704 from one location to another, to read from a peripheral into memory, or to 1705 write from a peripheral into memory all without CPU intervention. Further, 1706 since the DMA controller can issue (and does issue) pipeline wishbone accesses, 1707 any DMA memory move will by nature be faster than a corresponding program 1708 accomplishing the same move. To put this to numbers, it may take a program 1709 18~clocks per word transferred, whereas this DMA controller can move one 1710 69 dgisselq word in two clocks--provided it has bus access. (The CPU gets priority over 1711 the bus.) 1712 24 dgisselq 1713 36 dgisselq When copying memory from one location to another, the DMA controller will 1714 copy in units of a given transfer length--up to 1024 words at a time. It will 1715 read that transfer length into its internal buffer, and then write to the 1716 69 dgisselq destination address from that buffer. 1717 24 dgisselq 1718 36 dgisselq When coupled with a peripheral, the DMA controller can be configured to start 1719 69 dgisselq a memory copy when any interrupt line going high. Further, the controller can 1720 be configured to issue reads from (or to) the same address instead of 1721 incrementing the address at each clock. The DMA completes once the total 1722 number of items specified (not the transfer length) have been transferred. 1723 36 dgisselq 1724 In each case, once the transfer is complete and the DMA unit returns to 1725 idle, the DMA will issue an interrupt. 1726 1727 1728 21 dgisselq \chapter{Operation}\label{chap:ops} 1729 1730 33 dgisselq The Zip CPU, and even the Zip System, is not a System on a Chip (SoC). It 1731 needs to be connected to its operational environment in order to be used. 1732 Specifically, some per system adjustments need to be made: 1733 \begin{enumerate} 1734 \item The Zip System depends upon an external 32-bit Wishbone bus. This 1735 must exist, and must be connected to the Zip CPU for it to work. 1736 \item The Zip System needs to be told of its {\tt RESET\_ADDRESS}. This is 1737 the program counter of the first instruction following a reset. 1738 69 dgisselq \item To conserve logic, you'll want to set the {\tt ADDRESS\_WIDTH} parameter 1739 to the number of address bits on your wishbone bus. 1740 \item Likewise, the {\tt LGICACHE} parameter sets the number of bits in 1741 the instruction cache address. This means that the instruction cache 1742 will have$2^{\mbox{\tiny\tt LGICACHE}}$locations within it. 1743 33 dgisselq \item If you want the Zip System to start up on its own, you will need to 1744 set the {\tt START\_HALTED} parameter to zero. Otherwise, if you 1745 wish to manually start the CPU, that is if upon reset you want the 1746 CPU start start in its halted, reset state, then set this parameter to 1747 69 dgisselq one. This latter configuration is useful for a CPU that should be 1748 idle (i.e. halted) until given an explicit instruction from somewhere 1749 else to start. 1750 33 dgisselq \item The third parameter to set is the number of interrupts you will be 1751 providing from external to the CPU. This can be anything from one 1752 69 dgisselq to sixteen, but it cannot be zero. (Set this to 1 and wire the single 1753 interrupt line to a 1'b0 if you do not wish to support any external 1754 interrupts.) 1755 33 dgisselq \item Finally, you need to place into some wishbone accessible address, whether 1756 RAM or (more likely) ROM, the initial instructions for the CPU. 1757 \end{enumerate} 1758 If you have enabled your CPU to start automatically, then upon power up the 1759 69 dgisselq CPU will immediately start executing your instructions, starting at the given 1760 {\tt RESET\_ADDRESS}. 1761 33 dgisselq 1762 This is, however, not how I have used the Zip CPU. I have instead used the 1763 36 dgisselq Zip CPU in a more controlled environment. For me, the CPU starts in a 1764 33 dgisselq halted state, and waits to be told to start. Further, the RESET address is a 1765 location in RAM. After bringing up the board I am using, and further the 1766 bus that is on it, the RAM memory is then loaded externally with the program 1767 I wish the Zip System to run. Once the RAM is loaded, I release the CPU. 1768 69 dgisselq The CPU then runs until either its halt condition or an exception occurrs in 1769 supervisor mode, at which point its task is complete. 1770 33 dgisselq 1771 Eventually, I intend to place an operating system onto the ZipSystem, I'm 1772 just not there yet. 1773 1774 68 dgisselq The rest of this chapter examines some common programming models, and how they 1775 might be applied to the Zip System, and then finish with a couple of examples. 1776 33 dgisselq 1777 68 dgisselq \section{System High} 1778 The easiest and simplest way to run the Zip CPU is in the system high mode. 1779 In this mode, the CPU runs your program in supervisor mode from reboot to 1780 power down, and is never interrupted. You will need to poll the interrupt 1781 controller to determine when any external condition has become active. This 1782 mode is useful, and can handle many microcontroller tasks. 1783 1784 Even better, in system high mode, all of the user registers are available 1785 to the system high program as variables. Accessing these registers can be 1786 done in a single clock cycle, which would move them to the active register 1787 set or move them back. While this may seem like a load or store instruction, 1788 none of these register accesses will suffer from memory delays. 1789 1790 The one thing that cannot be done in supervisor mode is a wait for interrupt 1791 instruction. This, however, is easily rectified by jumping to a user task 1792 within the supervisors memory space, such as Tbl.~\ref{tbl:shi-idle}. 1793 \begin{table}\begin{center} 1794 \begin{tabbing} 1795 {\tt supervisor\_idle:} \\ 1796 \hbox to 0.25in{}\={\em ; While not strictly required, the following move helps to} \\ 1797 \> {\em ; ensure that the prefetch doesn't try to fetch an instruction} \\ 1798 \> {\em ; outside of the CPU's address space when it switches to user} \\ 1799 \> {\em ; mode.} \\ 1800 \> {\tt MOV supervisor\_idle\_continue,uPC} \\ 1801 \> {\em ; Put the processor into user mode and to sleep in the same} \\ 1802 \> {\em ; instruction. } \\ 1803 \> {\tt OR \$SLEEP|\$GIE,CC} \\ 1804 {\tt supervisor\_idle\_continue:} \\ 1805 \> {\em ; Now, if we haven't done this inline, we need to return} \\ 1806 \> {\em ; to whatever function called us.} \\ 1807 \> {\tt RETN} \\ 1808 \end{tabbing} 1809 \caption{Executing an idle from supervisor mode}\label{tbl:shi-idle} 1810 \end{center}\end{table} 1811 1812 \section{Traditional Interrupt Handling} 1813 Although the Zip CPU does not have a traditional interrupt architecture, 1814 it is possible to create the more traditional interrupt approach via software. 1815 In this mode, the programmable interrupt controller is used together with the 1816 supervisor state to create the illusion of more traditional interrupt handling. 1817 1818 To set this up, upon reboot the supervisor task: 1819 \begin{enumerate} 1820 \item Creates a (single) user context, a user stack, and sets the user 1821 program counter to the entry of the user task 1822 \item Creates a task table of ISR entries 1823 \item Enables the master interrupt enable via the interrupt controller, albeit 1824 without enabling any of the fifteen potential underlying interrupts. 1825 \item Switches to user mode, as the first part of the while loop in 1826 Tbl.~\ref{tbl:traditional-isr}. 1827 \end{enumerate} 1828 \begin{table}\begin{center} 1829 \begin{tabbing} 1830 {\tt while(true) \{} \\ 1831 \hbox to 0.25in{}\= {\tt rtu();}\\ 1832 \> {\tt if (trap) \{} {\em // Here, we allow users to install ISRs, or} \\ 1833 \>\hbox to 0.25in{}\= {\em // whatever else they may wish to do in supervisor mode.} \\ 1834 \> {\tt \} else \{} \\ 1835 \> \> {\tt volatile int *pic = PIC\_ADDRESS;} \\ 1836 \\ 1837 \> \> {\em // Save the user context before running any ISRs. This could easily be}\\ 1838 \> \> {\em // implemented as an inline assembly routine or macro}\\ 1839 \> \> {\tt SAVE\_PARTIAL\_CONTEXT; }\\ 1840 \> \> {\em // At this point, we know an interrupt has taken place: Ask the programmable}\\ 1841 \> \> {\em // interrupt controller (PIC) which interrupts are enabled and which are active.}\\ 1842 \> \> {\tt int picv = *pic;}\\ 1843 \> \> {\em // Turn off all active interrupts}\\ 1844 \> \> {\em // Globally disable interrupt generation in the process}\\ 1845 \> \> {\tt int active = (picv >> 16) \& picv \& 0x07fff;}\\ 1846 \> \> {\tt *pic = (active<<16);}\\ 1847 \> \> {\em // We build a mask of interrupts to re-enable in picv.}\\ 1848 \> \> {\tt picv = 0;}\\ 1849 \> \> {\tt for(int i=0,msk=1; i<15; i++, msk<<=1) \{}\\ 1850 \> \>\hbox to 0.25in{}\={\tt if ((active \& msk)\&\&(isr\_table[i])) \{}\\ 1851 \> \>\>\hbox to 0.25in{}\= {\tt mov(isr\_table[i],uPC); }\\ 1852 \> \>\>\> {\em // Acknowledge this particular interrupt. While we could acknowledge all}\\ 1853 \> \>\>\> {\em // interrupts at once, by acknowledging only those with ISR's we allow}\\ 1854 \> \>\>\> {\em // the user process to use peripherals manually, and to manually check}\\ 1855 \> \>\>\> {\em // whether or no those other interrupts had occurred.}\\ 1856 \> \>\>\> {\tt *pic = msk; }\\ 1857 \> \>\>\> {\tt rtu(); }\\ 1858 \> \>\>\> {\em // The ISR will only exit on a trap in the Zip archtecture. There is}\\ 1859 \> \>\>\> {\em // no {\tt RETI} instruction. Since the PIC holds all interrupts disabled,}\\ 1860 \> \>\>\> {\em // there is no need to check for further interrupts.}\\ 1861 \> \>\>\> {\em // }\\ 1862 \> \>\>\> {\em // The tricky part is that, because of how the PIC is built, the ISR cannot}\\ 1863 \>\>\>\> {\em // re-enable its own interrupt without re-enabling all interrupts. Hence, we}\\ 1864 \>\>\>\> {\em // look at R0 upon ISR completion to know if an interrupt needs to be }\\ 1865 \> \>\>\> {\em // re-enabled. }\\ 1866 \> \>\>\> {\tt mov(uR0,tmp); }\\ 1867 \> \>\>\> {\tt picv |= (tmp \& 0x7fff) << 16; }\\ 1868 \> \>\> {\tt \} }\\ 1869 \> \> {\tt \} }\\ 1870 \> \> {\tt RESTORE\_PARTIAL\_CONTEXT; }\\ 1871 \> \> {\em // Re-activate all (requested) interrupts }\\ 1872 \> \> {\tt *pic = picv | 0x80000000; }\\ 1873 \>{\tt \} }\\ 1874 {\tt \}}\\ 1875 \end{tabbing} 1876 \caption{Traditional Interrupt handling}\label{tbl:traditional-isr} 1877 \end{center}\end{table} 1878 1879 We can work through the interrupt handling process by examining 1880 Tbl.~\ref{tbl:traditional-isr}. First, remember, the CPU is always running 1881 either the user or the supervisor context. Once the supervisor switches to 1882 user mode, control does not return until either an interrupt or a trap 1883 has taken place. (Okay, there's also the possibility of a bus error, or an 1884 illegal instruction such as an unimplemented floating point instruction---but 1885 for now we'll just focus on the trap instruction.) Therefore, if the trap bit 1886 isn't set, then we know an interrupt has taken place. 1887 1888 To process an interrupt, we steal the user's stack: the PC and CC registers 1889 are saved on the stack, as outlined in Tbl.~\ref{tbl:save-partial}. 1890 \begin{table}\begin{center} 1891 \begin{tabbing} 1892 SAVE\_PARTIAL\_CONTEXT: \\ 1893 \hbox to 0.25in{}\= {\em ; We save R0, CC, and PC only} \\ 1894 \> {\tt MOV -3(uSP),R3} \\ 1895 \> {\tt MOV uR0,R0} \\ 1896 \> {\tt MOV uCC,R1} \\ 1897 \> {\tt MOV uPC,R2} \\ 1898 69 dgisselq \> {\tt STO R0,(R3)} {\em ; Exploit memory pipelining: }\\ 1899 \> {\tt STO R1,1(R3)} {\em ; All instructions write to stack }\\ 1900 \> {\tt STO R2,2(R3)} {\em ; All offsets increment by one }\\ 1901 68 dgisselq \> {\tt MOV R3,uSP} {\em ; Return the updated stack pointer } \\ 1902 \end{tabbing} 1903 \caption{Example Saving Minimal User Context}\label{tbl:save-partial} 1904 \end{center}\end{table} 1905 This is much cheaper than the full context swap of a preemptive multitasking 1906 kernel, but it also depends upon the ISR saving any state it uses. Further, 1907 if multiple ISR's get called at once, this looses its optimality property 1908 very quickly. 1909 1910 As Sec.~\ref{sec:pic} discusses, the top of the PIC register stores which 1911 interrupts are enabled, and the bottom stores which have tripped. (Interrupts 1912 may trip without being enabled, they just will not generate an interrupt to the 1913 CPU.) Our first step is to query the register to find out our interrupt 1914 state, and then to disable any interrupts that have tripped. To do 1915 that, we write a one to the enable half of the register while also clearing 1916 the top bit (master interrupt enable). This has the consequence of disabling 1917 any and all further interrupts, not just the ones that have tripped. Hence, 1918 upon completion, we re--enable the master interrupt bit again. Finally, 1919 we keep track of which interrupts have tripped. 1920 1921 Using the bit mask of interrupts that have tripped, we walk through all fifteen 1922 possible interrupts. If there is an ISR installed, we acknowledge and reset 1923 the interrupt within the PIC, and then call the ISR. The ISR, however, cannot 1924 re--enable its interrupt without re-enabling the master interrupt bit. Thus, 1925 to keep things simple, when the ISR is finished it places its interrupt 1926 mask back into R0, or clears R0. This tells the supervisor mode process which 1927 interrupts to re--enable. Any other registers that the ISR uses must be 1928 saved and restored. (This is only truly optimal if only a single ISR is 1929 called.) As a final instruction, the ISR clears the GIE bit executing a user 1930 trap. (Remember, the Zip CPU has no {\tt RETI} instruction to restore the 1931 stack and return to userland. It needs to go through the supervisor mode to 1932 get there.) 1933 1934 Then, once all interrupts are handled, the user context is restored in a 1935 fashion similar to Tbl.~\ref{tbl:restore-partial}. 1936 \begin{table}\begin{center} 1937 \begin{tabbing} 1938 RESTORE\_PARTIAL\_CONTEXT: \\ 1939 \hbox to 0.25in{}\= {\em ; We retore R0, CC, and PC only} \\ 1940 \> {\tt MOV uSP,R3} {\em ; Return the updated stack pointer } \\ 1941 69 dgisselq \> {\tt LOD R0,(R3),R0} {\em ; Exploit memory pipelining: }\\ 1942 \> {\tt LOD R1,1(R3),R1} {\em ; All instructions write to stack }\\ 1943 \> {\tt LOD R2,2(R3),R2} {\em ; All offsets increment by one }\\ 1944 68 dgisselq \> {\tt MOV R0,uR0} \\ 1945 \> {\tt MOV R1,uCC} \\ 1946 \> {\tt MOV R2,uPC} \\ 1947 \> {\tt MOV 3(R3),uSP} \\ 1948 \end{tabbing} 1949 \caption{Example Restoring Minimal User Context}\label{tbl:restore-partial} 1950 \end{center}\end{table} 1951 Again, this is short and sweet simply because any other registers that needed 1952 saving were saved within the ISR. 1953 1954 There you have it: the Zip CPU, with its non-traditional interrupt architecture, 1955 can still process interrupts in a very traditional fashion. 1956 1957 36 dgisselq \section{Example: Idle Task} 1958 One task every operating system needs is the idle task, the task that takes 1959 place when nothing else can run. On the Zip CPU, this task is quite simple, 1960 and it is shown in assemble in Tbl.~\ref{tbl:idle-asm}. 1961 \begin{table}\begin{center} 1962 \begin{tabular}{ll} 1963 {\tt idle\_task:} \\ 1964 & {\em ; Wait for the next interrupt, then switch to supervisor task} \\ 1965 & {\tt WAIT} \\ 1966 & {\em ; When we come back, it's because the supervisor wishes to} \\ 1967 & {\em ; wait for an interrupt again, so go back to the top.} \\ 1968 & {\tt BRA idle\_task} \\ 1969 \end{tabular} 1970 \caption{Example Idle Loop}\label{tbl:idle-asm} 1971 \end{center}\end{table} 1972 When this task runs, the CPU will fill up all of the pipeline stages up the 1973 ALU. The {\tt WAIT} instruction, upon leaving the ALU, places the CPU into 1974 a sleep state where nothing more moves. Sure, there may be some more settling, 1975 the pipe cache continue to read until full, other instructions may issue until 1976 the pipeline fills, but then everything will stall. Then, once an interrupt 1977 takes place, control passes to the supervisor task to handle the interrupt. 1978 When control passes back to this task, it will be on the next instruction. 1979 Since that next instruction sends us back to the top of the task, the idle 1980 task thus does nothing but wait for an interrupt. 1981 1982 This should be the lowest priority task, the task that runs when nothing else 1983 can. It will help lower the FPGA power usage overall---at least its dynamic 1984 power usage. 1985 1986 \section{Example: Memory Copy} 1987 One common operation is that of a memory move or copy. Consider the C code 1988 shown in Tbl.~\ref{tbl:memcp-c}. 1989 \begin{table}\begin{center} 1990 \parbox{4in}{\begin{tabbing} 1991 {\tt void} \= {\tt memcp(void *dest, void *src, int len) \{} \\ 1992 \> {\tt for(int i=0; i<len; i++)} \\ 1993 \> \hspace{0.2in} {\tt *dest++ = *src++;} \\ 1994 \} 1995 \end{tabbing}} 1996 \caption{Example Memory Copy code in C}\label{tbl:memcp-c} 1997 \end{center}\end{table} 1998 This same code can be translated in Zip Assembly as shown in 1999 Tbl.~\ref{tbl:memcp-asm}. 2000 \begin{table}\begin{center} 2001 \begin{tabular}{ll} 2002 memcp: \\ 2003 69 dgisselq & {\em ; R0 = *dest, R1 = *src, R2 = LEN, R3 = return addr} \\ 2004 & {\em ; The following will operate in$12N+19$clocks.} \\ 2005 & {\tt CMP 0,R2} \\ % 8 clocks per setup 2006 & {\tt MOV.Z R3,PC} {\em ; A conditional return }\\ 2007 & {\tt SUB 1,SP} {\em ; Create a stack frame}\\ 2008 & {\tt STO R4,(SP)} {\em ; and a local variable}\\ 2009 & {\em ; (4 stalls, cannot be further scheduled away)} \\ 2010 loop: \\ % 12 clocks per loop 2011 & {\tt LOD (R1),R4} \\ 2012 36 dgisselq & {\em ; (4 stalls, cannot be scheduled away)} \\ 2013 69 dgisselq & {\tt STO R4,(R0)} {\em ; (4 schedulable stalls, has no impact now)} \\ 2014 & {\tt SUB 1,R2} \\ 2015 & {\tt BZ memcpend} \\ 2016 & {\tt ADD 1,R0} \\ 2017 36 dgisselq & {\tt ADD 1,R1} \\ 2018 69 dgisselq & {\tt BRA loop} \\ 2019 & {\em ; (1 stall on a BRA instruction)} \\ 2020 memcpend: % 11 clocks 2021 & {\tt LOD (SP),R4} \\ 2022 & {\em ; (4 stalls, cannot be further scheduled away)} \\ 2023 & {\tt ADD 1,SP} \\ 2024 & {\tt JMP R3} \\ 2025 & {\em ; (4 stalls)} \\ 2026 36 dgisselq \end{tabular} 2027 \caption{Example Memory Copy code in Zip Assembly}\label{tbl:memcp-asm} 2028 \end{center}\end{table} 2029 This example points out several things associated with the Zip CPU. First, 2030 a straightforward implementation of a for loop is not the fastest loop 2031 structure. For this reason, we have placed the test to continue at the 2032 end. Second, all pointers are {\tt void} pointers to arbitrary 32--bit 2033 data types. The Zip CPU does not have explicit support for smaller or larger 2034 data types, and so this memory copy cannot be applied at a byte level. 2035 Third, we've optimized the conditional jump to a return instruction into a 2036 conditional return instruction. 2037 2038 68 dgisselq \section{Example: Context Switch} 2039 36 dgisselq 2040 Fundamental to any multiprocessing system is the ability to switch from one 2041 task to the next. In the ZipSystem, this is accomplished in one of a couple 2042 ways. The first step is that an interrupt happens. Anytime an interrupt 2043 happens, the CPU needs to execute the following tasks in supervisor mode: 2044 \begin{enumerate} 2045 69 dgisselq \item Check for a trap instruction, or other user exception such as a break, 2046 bus error, division by zero error, or floating point exception. That 2047 is, if the user process needs attending then we may not wish to adjust 2048 the context, check interrupts, or call the scheduler. 2049 Tbl.~\ref{tbl:trap-check} 2050 36 dgisselq \begin{table}\begin{center} 2051 \begin{tabular}{ll} 2052 {\tt return\_to\_user:} \\ 2053 & {\em; The instruction before the context switch processing must} \\ 2054 & {\em; be the RTU instruction that enacted user mode in the first} \\ 2055 & {\em; place. We show it here just for reference.} \\ 2056 & {\tt RTU} \\ 2057 {\tt trap\_check:} \\ 2058 & {\tt MOV uCC,R0} \\ 2059 69 dgisselq & {\tt TST \$TRAP \textbar \$BUSERR \textbar \$DIVE \textbar \\$FPE,R0} \\
2060 36 dgisselq
&       {\tt BNZ swap\_out} \\
2061
&       {; \em Do something here to execute the trap} \\
2062
&       {; \em Don't need to call the scheduler, so we can just return} \\
2063
&       {\tt BRA return\_to\_user} \\
2064
\end{tabular}
2065 69 dgisselq
\caption{Checking for whether the user task needs our attention}\label{tbl:trap-check}
2066 36 dgisselq
\end{center}\end{table}
2067
shows the rudiments of this code, while showing nothing of how the
2068
actual trap would be implemented.
2069

2070
You may also wish to note that the instruction before the first instruction
2071
in our context swap {\em must be} a return to userspace instruction.
2072
Remember, the supervisor process is re--entered where it left off.  This is
2073
different from many other processors that enter interrupt mode at some vector
2074
or other.  In this case, we always enter supervisor mode right where we last
2075
left.\footnote{The one exception to this rule is upon reset where supervisor
2076
mode is entered at a pre--programmed wishbone memory address.}
2077

2078
\item Capture user counters.  If the operating system is keeping track of
2079
system usage via the accounting counters, those counters need to be
2080
copied and accumulated into some master counter at this point.
2081

2082
\item Preserve the old context.  This involves pushing all the user registers
2083
onto the user stack and then copying the resulting stack address
2084
2085
\begin{table}\begin{center}
2086
\begin{tabular}{ll}
2087
{\tt swap\_out:} \\
2088 39 dgisselq
&        {\tt MOV -15(uSP),R5} \\
2089
&        {\tt STO R5,stack(R12)} \\
2090
&        {\tt MOV uR0,R0} \\
2091
&        {\tt MOV uR1,R1} \\
2092
&        {\tt MOV uR2,R2} \\
2093
&        {\tt MOV uR3,R3} \\
2094
&        {\tt MOV uR4,R4} \\
2095 69 dgisselq
&        {\tt STO R0,(R5)} {\em ; Exploit memory pipelining: }\\
2096
&        {\tt STO R1,1(R5)} {\em ; All instructions write to stack }\\
2097
&        {\tt STO R2,2(R5)} {\em ; All offsets increment by one }\\
2098
&        {\tt STO R3,3(R5)} {\em ; Longest pipeline is 5 cycles.}\\
2099
&        {\tt STO R4,4(R5)} \\
2100 39 dgisselq
& \ldots {\em ; Need to repeat for all user registers} \\
2101
\iffalse
2102
&        {\tt MOV uR5,R0} \\
2103
&        {\tt MOV uR6,R1} \\
2104
&        {\tt MOV uR7,R2} \\
2105
&        {\tt MOV uR8,R3} \\
2106
&        {\tt MOV uR9,R4} \\
2107 69 dgisselq
&        {\tt STO R0,5(R5) }\\
2108
&        {\tt STO R1,6(R5) }\\
2109
&        {\tt STO R2,7(R5) }\\
2110
&        {\tt STO R3,8(R5) }\\
2111
&        {\tt STO R4,9(R5)} \\
2112 39 dgisselq
\fi
2113
&        {\tt MOV uR10,R0} \\
2114
&        {\tt MOV uR11,R1} \\
2115
&        {\tt MOV uR12,R2} \\
2116
&        {\tt MOV uCC,R3} \\
2117
&        {\tt MOV uPC,R4} \\
2118 69 dgisselq
&        {\tt STO R0,10(R5)}\\
2119
&        {\tt STO R1,11(R5)}\\
2120
&        {\tt STO R2,12(R5)}\\
2121
&        {\tt STO R3,13(R5)}\\
2122
&        {\tt STO R4,14(R5)} \\
2123 36 dgisselq
&       {\em ; We can skip storing the stack, uSP, since it'll be stored}\\
2124
&       {\em ; elsewhere (in the task structure) }\\
2125
\end{tabular}
2126
2127
\end{center}\end{table}
2128
For the sake of discussion, we assume the supervisor maintains a
2129
pointer to the current task's structure in supervisor register
2130
{\tt R12}, and that {\tt stack} is an offset to the beginning of this
2131
structure indicating where the stack pointer is to be kept within it.
2132

2133
For those who are still interested, the full code for this context
2134
save can be found as an assembler macro within the assembler
2135
include file, {\tt sys.i}.
2136

2137
\item Reset the watchdog timer.  If you are using the watchdog timer, it should
2138
be reset on a context swap, to know that things are still working.
2139
Example code for this is shown in Tbl.~\ref{tbl:reset-watchdog}.
2140
\begin{table}\begin{center}
2141
\begin{tabular}{ll}
2142
2143
\multicolumn{2}{l}{{\tt define WATCHDOG\_TICKS 32'd1\_000\_000} {; \em = 10 ms}}\\
2144
2145
&       {\tt LDI WATCHDOG\_TICKS,R1} \\
2146
&       {\tt STO R1,(R0)}
2147
\end{tabular}
2148
\caption{Example Watchdog Reset}\label{tbl:reset-watchdog}
2149
\end{center}\end{table}
2150

2151
\item Interrupt handling.  An interrupt handler within the Zip System is nothing
2152
more than a task.  At context swap time, the supervisor needs to
2153
disable all of the interrupts that have tripped, and then enable
2154
all of the tasks that would deal with each of these interrupts.
2155
These can be user tasks, run at higher priority than any other user
2156
tasks.  Either way, they will need to re--enable their own interrupt
2157
themselves, if the interrupt is still relevant.
2158

2159
An example of this master interrut handling is shown in
2160
Tbl.~\ref{tbl:pre-handler}.
2161
\begin{table}\begin{center}
2162
\begin{tabular}{ll}
2163
{\tt pre\_handler:} \\
2164
&       {\tt LDI PIC\_ADDRESS,R0 } \\
2165
&       {\em ; Start by grabbing the interrupt state from the interrupt}\\
2166
&       {\em ; controller.  We'll store this into the register R7 so that }\\
2167
&       {\em ; we can keep and preserve this information for the scheduler}\\
2168
&       {\em ; to use later. }\\
2169
&       {\tt LOD (R0),R1} \\
2170
&       {\tt MOV R1,R7 } \\
2171
&       {\em ; As a next step, we need to acknowledge and disable all active}\\
2172
&       {\em ; interrupts. We'll start by calculating all of our active}\\
2173
&       {\em ; interrupts.}\\
2174
&       {\tt AND 0x07fff,R1 } \\
2175
&       {\em ; Put the active interrupts into the upper half of R1} \\
2176
&       {\tt ROL 16,R1 } \\
2177
&       {\tt LDILO 0x0ffff,R1   } \\
2178
&       {\tt AND R7,R1}\\
2179
&       {\em ; Acknowledge and disable active interrupts}\\
2180
&       {\em ; This also disables all interrupts from the controller, so}\\
2181
&       {\em ; we'll need to re-enable interrupts in general shortly } \\
2182
&       {\tt STO R1,(R0) } \\
2183
&       {\em ; We leave our active interrupt mask in R7 so the scheduler can}\\
2184
&       {\em ; release any tasks that depended upon them. } \\
2185
\end{tabular}
2186
\caption{Example checking for active interrupts}\label{tbl:pre-handler}
2187
\end{center}\end{table}
2188

2189
\item Calling the scheduler.  This needs to be done to pick the next task
2190
to switch to.  It may be an interrupt handler, or it may  be a normal
2191
user task.  From a priority standpoint, it would make sense that the
2192
interrupt handlers all have a higher priority than the user tasks,
2193
and that once they have been called the user tasks may then be called
2194
2195
interrupt.
2196

2197
This suggests a minimum of four task priorities:
2198
\begin{enumerate}
2199
\item Interrupt handlers, executed with their interrupts disabled
2200
\item Device drivers, executed with interrupts re-enabled
2201
2202
\item The idle task, executed when nothing else is able to execute
2203
\end{enumerate}
2204

2205
For our purposes here, we'll just assume that a pointer to the current
2206
task is maintained in {\tt R12}, that a {\tt JSR scheduler} is
2207
called, and that the next current task is likewise placed into
2208
{\tt R12}.
2209

2210
\item Restore the new tasks context.  Given that the scheduler has returned a
2211
task that can be run at this time, the stack pointer needs to be
2212
2213
register, and then the rest of the user registers need to be popped
2214
back off of the stack to run this task.  An example of this is
2215
shown in Tbl.~\ref{tbl:context-in},
2216
\begin{table}\begin{center}
2217
\begin{tabular}{ll}
2218
{\tt swap\_in:} \\
2219 39 dgisselq
&       {\tt LOD stack(R12),R5} \\
2220 36 dgisselq
&       {\tt MOV 15(R1),uSP} \\
2221 39 dgisselq
& {\em ; Be sure to exploit the memory pipelining capability} \\
2222 69 dgisselq
&       {\tt LOD (R5),R0} \\
2223
&       {\tt LOD 1(R5),R1} \\
2224
&       {\tt LOD 2(R5),R2} \\
2225
&       {\tt LOD 3(R5),R3} \\
2226
&       {\tt LOD 4(R5),R4} \\
2227 39 dgisselq
&       {\tt MOV R0,uR0} \\
2228
&       {\tt MOV R1,uR1} \\
2229
&       {\tt MOV R2,uR2} \\
2230
&       {\tt MOV R3,uR3} \\
2231
&       {\tt MOV R4,uR4} \\
2232 36 dgisselq
& \ldots {\em ; Need to repeat for all user registers} \\
2233 69 dgisselq
&       {\tt LOD 10(R5),R0} \\
2234
&       {\tt LOD 11(R5),R1} \\
2235
&       {\tt LOD 12(R5),R2} \\
2236
&       {\tt LOD 13(R5),R3} \\
2237
&       {\tt LOD 14(R5),R4} \\
2238 39 dgisselq
&       {\tt MOV R0,uR10} \\
2239
&       {\tt MOV R1,uR11} \\
2240
&       {\tt MOV R2,uR12} \\
2241
&       {\tt MOV R3,uCC} \\
2242
&       {\tt MOV R4,uPC} \\
2243

2244 36 dgisselq
&       {\tt BRA return\_to\_user} \\
2245
\end{tabular}
2246
2247
\end{center}\end{table}
2248
assuming as before that the task
2249
pointer is found in supervisor register {\tt R12}.
2250
As with storing the user context, the full code associated with
2251
restoring the user context can be found in the assembler include
2252
file, {\tt sys.i}.
2253

2254
\item Clear the userspace accounting registers.  In order to keep track of
2255
per process system usage, these registers need to be cleared before
2256
reactivating the userspace process.  That way, upon the next
2257
interrupt, we'll know how many clocks the userspace program has
2258
encountered, and how many instructions it was able to issue in
2259
those many clocks.
2260

2261
\item Jump back to the instruction just before saving the last tasks context,
2262
because that location in memory contains the return from interrupt
2263
command that we are going to need to execute, in order to guarantee
2264
that we return back here again.
2265
\end{enumerate}
2266

2267 21 dgisselq
\chapter{Registers}\label{chap:regs}
2268

2269 24 dgisselq
The ZipSystem registers fall into two categories, ZipSystem internal registers
2270
accessed via the ZipCPU shown in Tbl.~\ref{tbl:zpregs},
2271
\begin{table}[htbp]
2272
\begin{center}\begin{reglist}
2273 32 dgisselq
PIC   & \scalebox{0.8}{\tt 0xc0000000} & 32 & R/W & Primary Interrupt Controller \\\hline
2274
WDT   & \scalebox{0.8}{\tt 0xc0000001} & 32 & R/W & Watchdog Timer \\\hline
2275 69 dgisselq
& \scalebox{0.8}{\tt 0xc0000002} & 32 & R & Address of last bus error \\\hline
2276 32 dgisselq
CTRIC & \scalebox{0.8}{\tt 0xc0000003} & 32 & R/W & Secondary Interrupt Controller \\\hline
2277
TMRA  & \scalebox{0.8}{\tt 0xc0000004} & 32 & R/W & Timer A\\\hline
2278
TMRB  & \scalebox{0.8}{\tt 0xc0000005} & 32 & R/W & Timer B\\\hline
2279
TMRC  & \scalebox{0.8}{\tt 0xc0000006} & 32 & R/W & Timer C\\\hline
2280
JIFF  & \scalebox{0.8}{\tt 0xc0000007} & 32 & R/W & Jiffies \\\hline
2281
MTASK  & \scalebox{0.8}{\tt 0xc0000008} & 32 & R/W & Master Task Clock Counter \\\hline
2282
MMSTL  & \scalebox{0.8}{\tt 0xc0000009} & 32 & R/W & Master Stall Counter \\\hline
2283
MPSTL  & \scalebox{0.8}{\tt 0xc000000a} & 32 & R/W & Master Pre--Fetch Stall Counter \\\hline
2284
MICNT  & \scalebox{0.8}{\tt 0xc000000b} & 32 & R/W & Master Instruction Counter\\\hline
2285
UTASK  & \scalebox{0.8}{\tt 0xc000000c} & 32 & R/W & User Task Clock Counter \\\hline
2286
UMSTL  & \scalebox{0.8}{\tt 0xc000000d} & 32 & R/W & User Stall Counter \\\hline
2287
UPSTL  & \scalebox{0.8}{\tt 0xc000000e} & 32 & R/W & User Pre--Fetch Stall Counter \\\hline
2288
UICNT  & \scalebox{0.8}{\tt 0xc000000f} & 32 & R/W & User Instruction Counter\\\hline
2289 36 dgisselq
DMACTRL  & \scalebox{0.8}{\tt 0xc0000010} & 32 & R/W & DMA Control Register\\\hline
2290
DMALEN  & \scalebox{0.8}{\tt 0xc0000011} & 32 & R/W & DMA total transfer length\\\hline
2291
DMASRC  & \scalebox{0.8}{\tt 0xc0000012} & 32 & R/W & DMA source address\\\hline
2292
DMADST  & \scalebox{0.8}{\tt 0xc0000013} & 32 & R/W & DMA destination address\\\hline
2293 32 dgisselq
% Cache  & \scalebox{0.8}{\tt 0xc0100000} & & & Base address of the Cache memory\\\hline
2294 24 dgisselq
\end{reglist}
2295
\caption{Zip System Internal/Peripheral Registers}\label{tbl:zpregs}
2296
\end{center}\end{table}
2297 33 dgisselq
and the two debug registers shown in Tbl.~\ref{tbl:dbgregs}.
2298 24 dgisselq
\begin{table}[htbp]
2299
\begin{center}\begin{reglist}
2300
ZIPCTRL & 0 & 32 & R/W & Debug Control Register \\\hline
2301
ZIPDATA & 1 & 32 & R/W & Debug Data Register \\\hline
2302
\end{reglist}
2303
\caption{Zip System Debug Registers}\label{tbl:dbgregs}
2304
\end{center}\end{table}
2305

2306 33 dgisselq
\section{Peripheral Registers}
2307
The peripheral registers, listed in Tbl.~\ref{tbl:zpregs}, are shown in the
2308
CPU's address space.  These may be accessed by the CPU at these addresses,
2309
and when so accessed will respond as described in Chapt.~\ref{chap:periph}.
2310
These registers will be discussed briefly again here.
2311 24 dgisselq

2312 69 dgisselq
\subsection{Interrupt Controller(s)}
2313 33 dgisselq
The Zip CPU Interrupt controller has four different types of bits, as shown in
2314
Tbl.~\ref{tbl:picbits}.
2315
\begin{table}\begin{center}
2316
\begin{bitlist}
2317
31 & R/W & Master Interrupt Enable\\\hline
2318 69 dgisselq
30\ldots 16 & R/W & Interrupt Enables, write 1' to change\\\hline
2319 33 dgisselq
15 & R & Current Master Interrupt State\\\hline
2320 69 dgisselq
15\ldots 0 & R/W & Input Interrupt states, write 1' to clear\\\hline
2321 33 dgisselq
\end{bitlist}
2322
\caption{Interrupt Controller Register Bits}\label{tbl:picbits}
2323
\end{center}\end{table}
2324
The high order bit, or bit--31, is the master interrupt enable bit.  When this
2325
bit is set, then any time an interrupt occurs the CPU will be interrupted and
2326
will switch to supervisor mode, etc.
2327

2328
Bits 30~\ldots 16 are interrupt enable bits.  Should the interrupt line go
2329 69 dgisselq
hi while enabled, an interrupt will be generated.  (All interrupts are positive
2330
edge triggered.)  To set an interrupt enable bit, one needs to write the
2331
master interrupt enable while writing a 1' to this the bit.  To clear, one
2332
need only write a 0' to the master interrupt enable, while leaving this line
2333
high.
2334 33 dgisselq

2335
Bits 15\ldots 0 are the current state of the interrupt vector.  Interrupt lines
2336
trip when they go high, and remain tripped until they are acknowledged.  If
2337
the interrupt goes high for longer than one pulse, it may be high when a clear
2338
is requested.  If so, the interrupt will not clear.  The line must go low
2339
again before the status bit can be cleared.
2340

2341
As an example, consider the following scenario where the Zip CPU supports four
2342
interrupts, 3\ldots0.
2343
\begin{enumerate}
2344
\item The Supervisor will first, while in the interrupts disabled mode,
2345
write a {\tt 32'h800f000f} to the controller.  The supervisor may then
2346
switch to the user state with interrupts enabled.
2347
\item When an interrupt occurs, the supervisor will switch to the interrupt
2348
state.  It will then cycle through the interrupt bits to learn which
2349
interrupt handler to call.
2350
\item If the interrupt handler expects more interrupts, it will clear its
2351
current interrupt when it is done handling the interrupt in question.
2352 69 dgisselq
To do this, it will write a 1' to the low order interrupt mask,
2353
such as writing a {\tt 32'h0000\_0001}.
2354 33 dgisselq
\item If the interrupt handler does not expect any more interrupts, it will
2355
instead clear the interrupt from the controller by writing a
2356 69 dgisselq
{\tt 32'h0001\_0001} to the controller.
2357 33 dgisselq
\item Once all interrupts have been handled, the supervisor will write a
2358 69 dgisselq
{\tt 32'h8000\_0000} to the interrupt register to re-enable interrupt
2359 33 dgisselq
generation.
2360
\item The supervisor should also check the user trap bit, and possible soft
2361
interrupt bits here, but this action has nothing to do with the
2362
interrupt control register.
2363
\item The supervisor will then leave interrupt mode, possibly adjusting
2364
whichever task is running, by executing a return from interrupt
2365
command.
2366
\end{enumerate}
2367

2368 69 dgisselq
\subsection{Timer Register}
2369

2370 33 dgisselq
Leaving the interrupt controller, we show the timer registers bit definitions
2371
in Tbl.~\ref{tbl:tmrbits}.
2372
\begin{table}\begin{center}
2373
\begin{bitlist}
2374
2375
30\ldots 0 & R/W & Current timer value\\\hline
2376
\end{bitlist}
2377
\caption{Timer Register Bits}\label{tbl:tmrbits}
2378
\end{center}\end{table}
2379
As you may recall, the timer just counts down to zero and then trips an
2380
interrupt.  Writing to the current timer value sets that value, and reading
2381
from it returns that value.  Writing to the current timer value while also
2382
setting the auto--reload bit will send the timer into an auto--reload mode.
2383
In this mode, upon setting its interrupt bit for one cycle, the timer will
2384
also reset itself back to the value of the timer that was written to it when
2385
the auto--reload option was written to it.  To clear and stop the timer,
2386
just simply write a 32'h00' to this register.
2387

2388 69 dgisselq
\subsection{Jiffies}
2389

2390 33 dgisselq
The Jiffies register is somewhat similar in that the register always changes.
2391
In this case, the register counts up, whereas the timer always counted down.
2392
Reads from this register, as shown in Tbl.~\ref{tbl:jiffybits},
2393
\begin{table}\begin{center}
2394
\begin{bitlist}
2395
31\ldots 0 & R & Current jiffy value\\\hline
2396
31\ldots 0 & W & Value/time of next interrupt\\\hline
2397
\end{bitlist}
2398
\caption{Jiffies Register Bits}\label{tbl:jiffybits}
2399
\end{center}\end{table}
2400
always return the time value contained in the register.  Writes greater than
2401
the current Jiffy value, that is where the new value minus the old value is
2402
greater than zero while ignoring truncation, will set a new Jiffy interrupt
2403
time.  At that time, the Jiffy vector will clear, and another interrupt time
2404
may either be written to it, or it will just continue counting without
2405
activating any more interrupts.
2406

2407 69 dgisselq
\subsection{Performance Counters}
2408

2409 33 dgisselq
The Zip CPU also supports several counter peripherals, mostly in the way of
2410
process accounting.  This peripherals have a single register associated with
2411
them, shown in Tbl.~\ref{tbl:ctrbits}.
2412
\begin{table}\begin{center}
2413
\begin{bitlist}
2414
31\ldots 0 & R/W & Current counter value\\\hline
2415
\end{bitlist}
2416
\caption{Counter Register Bits}\label{tbl:ctrbits}
2417
\end{center}\end{table}
2418
Writes to this register set the new counter value.  Reads read the current
2419
counter value.
2420

2421
The current design operation of these counters is that of performance counting.
2422
Two sets of four registers are available for keeping track of performance.
2423
The first is a task counter.  This just counts clock ticks.  The second
2424
counter is a prefetch stall counter, then an master stall counter.  These
2425
allow the CPU to be evaluated as to how efficient it is.  The fourth and
2426
final counter is an instruction counter, which counts how many instructions the
2427
CPU has issued.
2428

2429
It is envisioned that these counters will be used as follows: First, every time
2430
a master counter rolls over, the supervisor (Operating System) will record
2431
the fact.  Second, whenever activating a user task, the Operating System will
2432
set the four user counters to zero.  When the user task has completed, the
2433
Operating System will read the timers back off, to determine how much of the
2434 69 dgisselq
CPU the process had consumed.  To keep this accurate, the user counters will
2435
only increment when the GIE bit is set to indicate that the processor is
2436
in user mode.
2437 33 dgisselq

2438 69 dgisselq
\subsection{DMA Controller}
2439

2440 36 dgisselq
The final peripheral to discuss is the DMA controller.  This controller
2441
has four registers.  Of these four, the length, source and destination address
2442
registers should need no further explanation.  They are full 32--bit registers
2443
specifying the entire transfer length, the starting address to read from, and
2444
the starting address to write to.  The registers can be written to when the
2445
DMA is idle, and read at any time.  The control register, however, will need
2446
some more explanation.
2447

2448
The bit allocation of the control register is shown in Tbl.~\ref{tbl:dmacbits}.
2449
\begin{table}\begin{center}
2450
\begin{bitlist}
2451
31 & R & DMA Active\\\hline
2452 39 dgisselq
30 & R & Wishbone error, transaction aborted.  This bit is cleared the next time
2453
this register is written to.\\\hline
2454 69 dgisselq
29 & R/W & Set to 1' to prevent the controller from incrementing the source address, 0' for normal memory copy. \\\hline
2455
28 & R/W & Set to 1' to prevent the controller from incrementing the
2456
destination address, 0' for normal memory copy. \\\hline
2457 36 dgisselq
27 \ldots 16 & W & The DMA Key.  Write a 12'hfed to these bits to start the
2458
activate any DMA transfer.  \\\hline
2459 69 dgisselq
27 & R & Always reads 0', to force the deliberate writing of the key. \\\hline
2460 36 dgisselq
26 \ldots 16 & R & Indicates the number of items in the transfer buffer that
2461
have yet to be written. \\\hline
2462 69 dgisselq
15 & R/W & Set to 1' to trigger on an interrupt, or 0' to start immediately
2463 36 dgisselq
upon receiving a valid key.\\\hline
2464
14\ldots 10 & R/W & Select among one of 32~possible interrupt lines.\\\hline
2465
9\ldots 0 & R/W & Intermediate transfer length minus one.  Thus, to transfer
2466
one item at a time set this value to 0. To transfer 1024 at a time,
2467
set it to 1024.\\\hline
2468
\end{bitlist}
2469
\caption{DMA Control Register Bits}\label{tbl:dmacbits}
2470
\end{center}\end{table}
2471
This control register has been designed so that the common case of memory
2472
access need only set the key and the transfer length.  Hence, writing a
2473
\hbox{32'h0fed03ff} to the control register will start any memory transfer.
2474
On the other hand, if you wished to read from a serial port (constant address)
2475
and put the result into a buffer every time a word was available, you
2476
might wish to write \hbox{32'h2fed8000}--this assumes, of course, that you
2477
have a serial port wired to the zero bit of this interrupt control.  (The
2478
DMA controller does not use the interrupt controller, and cannot clear
2479
interrupts.)  As a third example, if you wished to write to an external
2480
FIFO anytime it was less than half full (had fewer than 512 items), and
2481
interrupt line 2 indicated this condition, you might wish to issue a
2482
\hbox{32'h1fed8dff} to this port.
2483

2484 33 dgisselq
\section{Debug Port Registers}
2485
Accessing the Zip System via the debug port isn't as straight forward as
2486
accessing the system via the wishbone bus.  The debug port itself has been
2487
reduced to two addresses, as outlined earlier in Tbl.~\ref{tbl:dbgregs}.
2488
Access to the Zip System begins with the Debug Control register, shown in
2489
Tbl.~\ref{tbl:dbgctrl}.
2490
\begin{table}\begin{center}
2491
\begin{bitlist}
2492 69 dgisselq
31\ldots 14 & R & External interrupt state.  Bit 14 is valid for one
2493
interrupt only, bit 15 for two, etc.\\\hline
2494 33 dgisselq
13 & R & CPU GIE setting\\\hline
2495
12 & R & CPU is sleeping\\\hline
2496
11 & W & Command clear PF cache\\\hline
2497 69 dgisselq
10 & R/W & Command HALT, Set to 1' to halt the CPU\\\hline
2498
9 & R & Stall Status, 1' if CPU is busy (i.e., not halted yet)\\\hline
2499
8 & R/W & Step Command, set to 1' to step the CPU, also sets the halt bit\\\hline
2500
7 & R & Interrupt Request Pending\\\hline
2501 33 dgisselq
6 & R/W & Command RESET \\\hline
2502
5\ldots 0 & R/W & Debug Register Address \\\hline
2503
\end{bitlist}
2504
\caption{Debug Control Register Bits}\label{tbl:dbgctrl}
2505
\end{center}\end{table}
2506

2507
The first step in debugging access is to determine whether or not the CPU
2508 69 dgisselq
is halted, and to halt it if not.  To do this, first write a 1' to the
2509 33 dgisselq
Command HALT bit.  This will halt the CPU and place it into debug mode.
2510
Once the CPU is halted, the stall status bit will drop to zero.  Thus,
2511
if bit 10 is high and bit 9 low, the debug port is open to examine the
2512
internal state of the CPU.
2513

2514
At this point, the external debugger may examine internal state information
2515
from within the CPU.  To do this, first write again to the command register
2516
a value (with command halt still high) containing the address of an internal
2517
register of interest in the bottom 6~bits.  Internal registers that may be
2518
accessed this way are listed in Tbl.~\ref{tbl:dbgaddrs}.
2519
\begin{table}\begin{center}
2520
\begin{reglist}
2521
sR0 & 0 & 32 & R/W & Supervisor Register R0 \\\hline
2522
sR1 & 0 & 32 & R/W & Supervisor Register R1 \\\hline
2523
sSP & 13 & 32 & R/W & Supervisor Stack Pointer\\\hline
2524
sCC & 14 & 32 & R/W & Supervisor Condition Code Register \\\hline
2525
sPC & 15 & 32 & R/W & Supervisor Program Counter\\\hline
2526
uR0 & 16 & 32 & R/W & User Register R0 \\\hline
2527
uR1 & 17 & 32 & R/W & User Register R1 \\\hline
2528
uSP & 29 & 32 & R/W & User Stack Pointer\\\hline
2529
uCC & 30 & 32 & R/W & User Condition Code Register \\\hline
2530
uPC & 31 & 32 & R/W & User Program Counter\\\hline
2531
PIC & 32 & 32 & R/W & Primary Interrupt Controller \\\hline
2532
WDT & 33 & 32 & R/W & Watchdog Timer\\\hline
2533 69 dgisselq
BUS & 34 & 32 & R & Last Bus Error\\\hline
2534 33 dgisselq
CTRIC & 35 & 32 & R/W & Secondary Interrupt Controller\\\hline
2535
TMRA & 36 & 32 & R/W & Timer A\\\hline
2536
TMRB & 37 & 32 & R/W & Timer B\\\hline
2537
TMRC & 38 & 32 & R/W & Timer C\\\hline
2538
JIFF & 39 & 32 & R/W & Jiffies peripheral\\\hline
2539
MTASK & 40 & 32 & R/W & Master task clock counter\\\hline
2540
MMSTL & 41 & 32 & R/W & Master memory stall counter\\\hline
2541
MPSTL & 42 & 32 & R/W & Master Pre-Fetch Stall counter\\\hline
2542
MICNT & 43 & 32 & R/W & Master instruction counter\\\hline
2543
UTASK & 44 & 32 & R/W & User task clock counter\\\hline
2544
UMSTL & 45 & 32 & R/W & User memory stall counter\\\hline
2545
UPSTL & 46 & 32 & R/W & User Pre-Fetch Stall counter\\\hline
2546
UICNT & 47 & 32 & R/W & User instruction counter\\\hline
2547 39 dgisselq
DMACMD & 48 & 32 & R/W & DMA command and status register\\\hline
2548
DMALEN & 49 & 32 & R/W & DMA transfer length\\\hline
2549
2550
DMAWR & 51 & 32 & R/W & DMA write address\\\hline
2551 33 dgisselq
\end{reglist}
2552
2553
\end{center}\end{table}
2554
Primarily, these registers'' include access to the entire CPU register
2555 36 dgisselq
set, as well as the internal peripherals.  To read one of these registers
2556 33 dgisselq
once the address is set, simply issue a read from the data port.  To write
2557
one of these registers or peripheral ports, simply write to the data port
2558
2559

2560
In this manner, all of the CPU's internal state may be read and adjusted.
2561

2562
As an example of how to use this, consider what would happen in the case
2563
of an external break point.  If and when the CPU hits a break point that
2564
causes it to halt, the Command HALT bit will activate on its own, the CPU
2565
will then raise an external interrupt line and wait for a debugger to examine
2566
its state.  After examining the state, the debugger will need to remove
2567
the breakpoint by writing a different instruction into memory and by writing
2568
to the command register while holding the clear cache, command halt, and
2569
step CPU bits high, (32'hd00).  The debugger may then replace the breakpoint
2570
now that the CPU has gone beyond it, and clear the cache again (32'h500).
2571

2572
To leave this debug mode, simply write a 32'h0' value to the command register.
2573

2574
\chapter{Wishbone Datasheets}\label{chap:wishbone}
2575 32 dgisselq
The Zip System supports two wishbone ports, a slave debug port and a master
2576 21 dgisselq
port for the system itself.  These are shown in Tbl.~\ref{tbl:wishbone-slave}
2577
\begin{table}[htbp]
2578
\begin{center}
2579
\begin{wishboneds}
2580
Revision level of wishbone & WB B4 spec \\\hline
2581
Type of interface & Slave, Read/Write, single words only \\\hline
2582 24 dgisselq
2583 21 dgisselq
Port size & 32--bit \\\hline
2584
Port granularity & 32--bit \\\hline
2585
Maximum Operand Size & 32--bit \\\hline
2586
Data transfer ordering & (Irrelevant) \\\hline
2587 69 dgisselq
Clock constraints & Works at 100~MHz on a Basys--3 board, and 80~MHz on a
2588
XuLA2--LX25\\\hline
2589 21 dgisselq
Signal Names & \begin{tabular}{ll}
2590
Signal Name & Wishbone Equivalent \\\hline
2591
{\tt i\_clk} & {\tt CLK\_I} \\
2592
{\tt i\_dbg\_cyc} & {\tt CYC\_I} \\
2593
{\tt i\_dbg\_stb} & {\tt STB\_I} \\
2594
{\tt i\_dbg\_we} & {\tt WE\_I} \\
2595
2596
{\tt i\_dbg\_data} & {\tt DAT\_I} \\
2597
{\tt o\_dbg\_ack} & {\tt ACK\_O} \\
2598
{\tt o\_dbg\_stall} & {\tt STALL\_O} \\
2599
{\tt o\_dbg\_data} & {\tt DAT\_O}
2600
\end{tabular}\\\hline
2601
\end{wishboneds}
2602 22 dgisselq
\caption{Wishbone Datasheet for the Debug Interface}\label{tbl:wishbone-slave}
2603 21 dgisselq
\end{center}\end{table}
2604
and Tbl.~\ref{tbl:wishbone-master} respectively.
2605
\begin{table}[htbp]
2606
\begin{center}
2607
\begin{wishboneds}
2608
Revision level of wishbone & WB B4 spec \\\hline
2609 24 dgisselq
Type of interface & Master, Read/Write, single cycle or pipelined\\\hline
2610 69 dgisselq
Address Width & (Zip System parameter, can be up to 32--bit bits) \\\hline
2611 21 dgisselq
Port size & 32--bit \\\hline
2612
Port granularity & 32--bit \\\hline
2613
Maximum Operand Size & 32--bit \\\hline
2614
Data transfer ordering & (Irrelevant) \\\hline
2615 69 dgisselq
Clock constraints & Works at 100~MHz on a Basys--3 board, and 80~MHz on a
2616
XuLA2--LX25\\\hline
2617 21 dgisselq
Signal Names & \begin{tabular}{ll}
2618
Signal Name & Wishbone Equivalent \\\hline
2619
{\tt i\_clk} & {\tt CLK\_O} \\
2620
{\tt o\_wb\_cyc} & {\tt CYC\_O} \\
2621
{\tt o\_wb\_stb} & {\tt STB\_O} \\
2622
{\tt o\_wb\_we} & {\tt WE\_O} \\
2623
2624
{\tt o\_wb\_data} & {\tt DAT\_O} \\
2625
{\tt i\_wb\_ack} & {\tt ACK\_I} \\
2626
{\tt i\_wb\_stall} & {\tt STALL\_I} \\
2627 69 dgisselq
{\tt i\_wb\_data} & {\tt DAT\_I} \\
2628
{\tt i\_wb\_err} & {\tt ERR\_I}
2629 21 dgisselq
\end{tabular}\\\hline
2630
\end{wishboneds}
2631 22 dgisselq
\caption{Wishbone Datasheet for the CPU as Master}\label{tbl:wishbone-master}
2632 21 dgisselq
\end{center}\end{table}
2633
I do not recommend that you connect these together through the interconnect.
2634 24 dgisselq
Rather, the debug port of the CPU should be accessible regardless of the state
2635
of the master bus.
2636 21 dgisselq

2637 69 dgisselq
You may wish to notice that neither the {\tt LOCK} nor the {\tt RTY} (retry)
2638
wires have been connected to the CPU's master interface.  If necessary, a
2639
rudimentary {\tt LOCK} may be created by tying the wire to the {\tt wb\_cyc}
2640
line.  As for the {\tt RTY}, all the CPU recognizes at this point are bus
2641
errors---it cannot tell the difference between a temporary and a permanent bus
2642
error.
2643 21 dgisselq

2644
\chapter{Clocks}\label{chap:clocks}
2645

2646 32 dgisselq
This core is based upon the Basys--3 development board sold by Digilent.
2647
The Basys--3 development board contains one external 100~MHz clock, which is
2648 36 dgisselq
sufficient to run the Zip CPU core.
2649 21 dgisselq
\begin{table}[htbp]
2650
\begin{center}
2651
\begin{clocklist}
2652
i\_clk & External & 100~MHz & 100~MHz & System clock.\\\hline
2653
\end{clocklist}
2654
\caption{List of Clocks}\label{tbl:clocks}
2655
\end{center}\end{table}
2656
I hesitate to suggest that the core can run faster than 100~MHz, since I have
2657
had struggled with various timing violations to keep it at 100~MHz.  So, for
2658
now, I will only state that it can run at 100~MHz.
2659

2660 69 dgisselq
On a SPARTAN 6, the clock can run successfully at 80~MHz.
2661 21 dgisselq

2662
\chapter{I/O Ports}\label{chap:ioports}
2663 33 dgisselq
The I/O ports to the Zip CPU may be grouped into three categories.  The first
2664
is that of the master wishbone used by the CPU, then the slave wishbone used
2665
to command the CPU via a debugger, and then the rest.  The first two of these
2666
were already discussed in the wishbone chapter.  They are listed here
2667
for completeness in Tbl.~\ref{tbl:iowb-master}
2668
\begin{table}
2669
\begin{center}\begin{portlist}
2670
{\tt o\_wb\_cyc}   &  1 & Output & Indicates an active Wishbone cycle\\\hline
2671
{\tt o\_wb\_stb}   &  1 & Output & WB Strobe signal\\\hline
2672
{\tt o\_wb\_we}    &  1 & Output & Write enable\\\hline
2673
2674
{\tt o\_wb\_data}  & 32 & Output & Data on WB write\\\hline
2675
{\tt i\_wb\_ack}   &  1 & Input  & Slave has completed a R/W cycle\\\hline
2676
{\tt i\_wb\_stall} &  1 & Input  & WB bus slave not ready\\\hline
2677
{\tt i\_wb\_data}  & 32 & Input  & Incoming bus data\\\hline
2678 69 dgisselq
{\tt i\_wb\_err}   &  1 & Input  & Bus Error indication\\\hline
2679 33 dgisselq
\end{portlist}\caption{CPU Master Wishbone I/O Ports}\label{tbl:iowb-master}\end{center}\end{table}
2680
and~\ref{tbl:iowb-slave} respectively.
2681
\begin{table}
2682
\begin{center}\begin{portlist}
2683
{\tt i\_wb\_cyc}   &  1 & Input & Indicates an active Wishbone cycle\\\hline
2684
{\tt i\_wb\_stb}   &  1 & Input & WB Strobe signal\\\hline
2685
{\tt i\_wb\_we}    &  1 & Input & Write enable\\\hline
2686
{\tt i\_wb\_addr}  &  1 & Input & Bus address, command or data port \\\hline
2687
{\tt i\_wb\_data}  & 32 & Input & Data on WB write\\\hline
2688
{\tt o\_wb\_ack}   &  1 & Output  & Slave has completed a R/W cycle\\\hline
2689
{\tt o\_wb\_stall} &  1 & Output  & WB bus slave not ready\\\hline
2690
{\tt o\_wb\_data}  & 32 & Output  & Incoming bus data\\\hline
2691
\end{portlist}\caption{CPU Debug Wishbone I/O Ports}\label{tbl:iowb-slave}\end{center}\end{table}
2692 21 dgisselq

2693 33 dgisselq
There are only four other lines to the CPU: the external clock, external
2694
reset, incoming external interrupt line(s), and the outgoing debug interrupt
2695
line.  These are shown in Tbl.~\ref{tbl:ioports}.
2696
\begin{table}
2697
\begin{center}\begin{portlist}
2698
{\tt i\_clk} & 1 & Input & The master CPU clock \\\hline
2699
{\tt i\_rst} & 1 & Input &  Active high reset line \\\hline
2700 69 dgisselq
{\tt i\_ext\_int} & 1\ldots 16 & Input &  Incoming external interrupts, actual
2701
value set by implementation parameter \\\hline
2702 33 dgisselq
{\tt o\_ext\_int} & 1 & Output & CPU Halted interrupt \\\hline
2703
\end{portlist}\caption{I/O Ports}\label{tbl:ioports}\end{center}\end{table}
2704
The clock line was discussed briefly in Chapt.~\ref{chap:clocks}.  We
2705 69 dgisselq
typically run it at 100~MHz, although we've needed to slow it down to 80~MHz
2706
for some implementations.  The reset line is an active high reset.  When
2707 33 dgisselq
asserted, the CPU will start running again from its reset address in
2708 69 dgisselq
memory.  Further, depending upon how the CPU is configured and specifically
2709
based upon how the {\tt START\_HALTED} parameter is set, the CPU may or may
2710
not start running automatically following a reset.  The {\tt i\_ext\_int}
2711
line is for an external interrupt.  This line may actually be as wide as
2712
16~external interrupts, depending upon the setting of
2713
the {\tt EXTERNAL\_INTERRUPTS} parameter.  Finally, the Zip System produces one
2714
external interrupt whenever the entire CPU halts to wait for the debugger.
2715 33 dgisselq

2716 36 dgisselq
\chapter{Initial Assessment}\label{chap:assessment}
2717

2718
Having now worked with the Zip CPU for a while, it is worth offering an
2719
honest assessment of how well it works and how well it was designed. At the
2720
end of this assessment, I will propose some changes that may take place in a
2721
later version of this Zip CPU to make it better.
2722

2723
\section{The Good}
2724
\begin{itemize}
2725 69 dgisselq
\item The Zip CPU can be configured to be relatively light weight and fully
2726
featured as it exists today. For anyone who wishes to build a general
2727
purpose CPU and then to experiment with building and adding particular
2728
features, the Zip CPU makes a good starting point--it is fairly simple.
2729
Modifications should be simple enough.  Indeed, a non--pipelined
2730
version of the bare ZipBones (with no peripherals) has been built that
2731
only uses 1.1k~LUTs.  When using pipelining, the full cache, and all
2732
of the peripherals, the ZipSystem can top 5~k LUTs.  Where it fits
2733
in between is a function of your needs.
2734 36 dgisselq
\item The Zip CPU was designed to be an implementable soft core that could be
2735
placed within an FPGA, controlling actions internal to the FPGA. It
2736
fits this role rather nicely. It does not fit the role of a system on
2737
a chip very well, but then it was never intended to be a system on a
2738
chip but rather a system within a chip.
2739
\item The extremely simplified instruction set of the Zip CPU was a good
2740
choice. Although it does not have many of the commonly used
2741
instructions, PUSH, POP, JSR, and RET among them, the simplified
2742
instruction set has demonstrated an amazing versatility. I will contend
2743
therefore and for anyone who will listen, that this instruction set
2744
offers a full and complete capability for whatever a user might wish
2745
to do with two exceptions: bytewise character access and accelerated
2746
floating-point support.
2747
\item This simplified instruction set is easy to decode.
2748
\item The simplified bus transactions (32-bit words only) were also very easy
2749
to implement.
2750 68 dgisselq
\item The pipelined load/store approach is novel, and can be used to greatly
2751
increase the speed of the processor.
2752 36 dgisselq
\item The novel approach of having a single interrupt vector, which just
2753
brings the CPU back to the instruction it left off at within the last
2754
interrupt context doesn't appear to have been that much of a problem.
2755
If most modern systems handle interrupt vectoring in software anyway,
2756
why maintain hardware support for it?
2757
\item My goal of a high rate of instructions per clock may not be the proper
2758
measure. For example, if instructions are being read from a SPI flash
2759
device, such as is common among FPGA implementations, these same
2760
instructions may suffer stalls of between 64 and 128 cycles per
2761
instruction just to read the instruction from the flash. Executing the
2762
instruction in a single clock cycle is no longer the appropriate
2763
measure. At the same time, it should be possible to use the DMA
2764
peripheral to copy instructions from the FLASH to a temporary memory
2765
location, after which they may be executed at a single instruction
2766
cycle per access again.
2767
\end{itemize}
2768

2769
\section{The Not so Good}
2770
\begin{itemize}
2771
\item The CPU has no character support. This is both good and bad.
2772
Realistically, the CPU works just fine without it. Characters can be
2773
supported as subsets of 32-bit words without any problem. Practically,
2774
though, it will make compiling non-Zip CPU code difficult--especially
2775
anything that assumes sizeof(int)=4*sizeof(char), or that tries to
2776
create unions with characters and integers and then attempts to
2777
reference the address of the characters within that union.
2778

2779
\item The Zip CPU does not support a data cache. One can still be built
2780
externally, but this is a limitation of the CPU proper as built.
2781
Further, under the theory of the Zip CPU design (that of an embedded
2782
soft-core processor within an FPGA, where any address'' may reference
2783
either memory or a peripheral that may have side-effects), any data
2784
cache would need to be based upon an initial knowledge of whether or
2785
not it is supporting memory (cachable) or peripherals. This knowledge
2786
must exist somewhere, and that somewhere is currently (and by design)
2787
external to the CPU.
2788

2789
This may also be written off as a feature'' of the Zip CPU, since
2790
the addition of a data cache can greatly increase the LUT count of
2791
a soft core.
2792

2793 68 dgisselq
The Zip CPU compensates for this via its pipelined load and store
2794
instructions.
2795

2796 36 dgisselq
\item Many other instruction sets offer three operand instructions, whereas
2797
the Zip CPU only offers two operand instructions. This means that it
2798
takes the Zip CPU more instructions to do many of the same operations.
2799
The good part of this is that it gives the Zip CPU a greater amount of
2800
flexibility in its immediate operand mode, although that increased
2801
flexibility isn't necessarily as valuable as one might like.
2802

2803
\item The Zip CPU doesn't support out of order execution. I suppose it could
2804
be modified to do so, but then it would no longer be the simple''
2805
and low LUT count CPU it was designed to be. The two primary results
2806
are that 1) loads may unnecessarily stall the CPU, even if other
2807
things could be done while waiting for the load to complete, 2)
2808
bus errors on stores will never be caught at the point of the error,
2809
and 3) branch prediction becomes more difficult.
2810

2811
\item Although switching to an interrupt context in the Zip CPU design doesn't
2812
require a tremendous swapping of registers, in reality it still
2813
does--since any task swap still requires saving and restoring all
2814
16~user registers. That's a lot of memory movement just to service
2815
an interrupt.
2816

2817
\item The Zip CPU is by no means generic: it will never handle addresses
2818
larger than 32-bits (16GB) without a complete and total redesign.
2819
This may limit its utility as a generic CPU in the future, although
2820
as an embedded CPU within an FPGA this isn't really much of a limit
2821
or restriction.
2822

2823
\item While the Zip CPU has its own assembler, it has no linker and does not
2824
(yet) support a compiler. The standard C library is an even longer
2825
shot. My dream of having binutils and gcc support has not been
2826
realized and at this rate may not be realized. (I've been intimidated
2827
by the challenge everytime I've looked through those codes.)
2828
\end{itemize}
2829

2830
\section{The Next Generation}
2831 69 dgisselq
This section could also be labeled as my `To do'' list.  Today's list is
2832
much different than it was for the last version of this document, as much of
2833
the prior to do list (such as VLIW instructions, and a more traditional
2834
instruction cache) has now been implemented.  The only things really and
2835
truly waiting on my list today are assembler support for the VLIW instruction
2836
2837 36 dgisselq

2838 69 dgisselq
Stay tuned, these are likely to be coming next.
2839 36 dgisselq

2840 21 dgisselq
% Appendices
2841
% Index
2842
\end{document}
2843

2844 68 dgisselq
%
2845
%
2846
% Symbol table relocation types:
2847
%
2848
% Only 3-types of instructions truly need relocations: those that modify the
2849
% PC register, and those that access memory.
2850
%
2851
2852
%
2853
2854
%       LDIHI   Addr,Rx         //   requires two instructions
2855
%
2856
2857
%                       // Can be prefixed with two instructions to load Rx
2858
%                       // from any 32-bit immediate
2859
2860
%
2861
% -     ADD     x,PC            // Any PC relative jump (20 bits)
2862
%
2863
% -     ADD.C   x,PC            // Any PC relative conditional jump (20 bits)
2864
%
2865
2866
%       LOD     Addr(Rx),Rx     //    unconditional, requires second instruction
2867
%
2868
2869
%
2870
% -     STO.C   Rx,Addr(Ry)     // Any 16-bit rel addr, Rx and Ry must be valid
2871
%
2872
% -     FARJMP  #Addr:          // Arbitrary 32-bit jumps require a jump table
2873
%       BRA     +1              // memory address.  The BRA +1 can be skipped,
2874
%       .WORD   Addr            // but only if the address is placed at the end
2875
%       LOD     -2(PC),PC       // of an executable section
2876
%