Line 46... |
Line 46... |
\documentclass{gqtekspec}
|
\documentclass{gqtekspec}
|
\project{Zip CPU}
|
\project{Zip CPU}
|
\title{Specification}
|
\title{Specification}
|
\author{Dan Gisselquist, Ph.D.}
|
\author{Dan Gisselquist, Ph.D.}
|
\email{dgisselq (at) opencores.org}
|
\email{dgisselq (at) opencores.org}
|
\revision{Rev.~0.3}
|
\revision{Rev.~0.4}
|
|
\definecolor{webred}{rgb}{0.2,0,0}
|
|
\definecolor{webgreen}{rgb}{0,0.2,0}
|
|
\usepackage[dvips,ps2pdf,colorlinks=true,
|
|
anchorcolor=black,pagecolor=webgreen,pdfpagelabels,hypertexnames,
|
|
pdfauthor={Dan Gisselquist},
|
|
pdfsubject={Zip CPU}]{hyperref}
|
\begin{document}
|
\begin{document}
|
\pagestyle{gqtekspecplain}
|
\pagestyle{gqtekspecplain}
|
\titlepage
|
\titlepage
|
\begin{license}
|
\begin{license}
|
Copyright (C) \theyear\today, Gisselquist Technology, LLC
|
Copyright (C) \theyear\today, Gisselquist Technology, LLC
|
Line 68... |
Line 74... |
You should have received a copy of the GNU General Public License along
|
You should have received a copy of the GNU General Public License along
|
with this program. If not, see \hbox{<http://www.gnu.org/licenses/>} for a
|
with this program. If not, see \hbox{<http://www.gnu.org/licenses/>} for a
|
copy.
|
copy.
|
\end{license}
|
\end{license}
|
\begin{revisionhistory}
|
\begin{revisionhistory}
|
|
0.4 & 9/19/2015 & Gisselquist & Added DMA controller, improved stall information, and self--assessment info.\\\hline
|
0.3 & 8/22/2015 & Gisselquist & First completed draft\\\hline
|
0.3 & 8/22/2015 & Gisselquist & First completed draft\\\hline
|
0.2 & 8/19/2015 & Gisselquist & Still Draft, more complete \\\hline
|
0.2 & 8/19/2015 & Gisselquist & Still Draft, more complete \\\hline
|
0.1 & 8/17/2015 & Gisselquist & Incomplete First Draft \\\hline
|
0.1 & 8/17/2015 & Gisselquist & Incomplete First Draft \\\hline
|
\end{revisionhistory}
|
\end{revisionhistory}
|
% Revision History
|
% Revision History
|
Line 87... |
Line 94... |
|
|
The easiest, most obvious answer is the simple one: Because I can.
|
The easiest, most obvious answer is the simple one: Because I can.
|
|
|
There's more to it, though. There's a lot that I would like to do with a
|
There's more to it, though. There's a lot that I would like to do with a
|
processor, and I want to be able to do it in a vendor independent fashion.
|
processor, and I want to be able to do it in a vendor independent fashion.
|
I would like to be able to generate Verilog code that can run equivalently
|
First, I would like to be able to place this processor inside an FPGA. Without
|
on both Xilinx and Altera chips, and that can be easily ported from one
|
paying royalties, ARM is out of the question. I would then like to be able to
|
manufacturer's chipsets to another. Even more, before purchasing a chip or a
|
generate Verilog code, both for the processor and the system it sits within,
|
board, I would like to know that my soft core works. I would like to build a test
|
that can run equivalently on both Xilinx and Altera chips, and that can be
|
bench to test components with, and Verilator is my chosen test bench. This
|
easily ported from one manufacturer's chipsets to another. Even more, before
|
forces me to use all Verilog, and it prevents me from using any proprietary
|
purchasing a chip or a board, I would like to know that my soft core works. I
|
cores. For this reason, Microblaze and Nios are out of the question.
|
would like to build a test bench to test components with, and Verilator is my
|
|
chosen test bench. This forces me to use all Verilog, and it prevents me from
|
|
using any proprietary cores. For this reason, Microblaze and Nios are out of
|
|
the question.
|
|
|
Why not OpenRISC? That's a hard question. The OpenRISC team has done some
|
Why not OpenRISC? That's a hard question. The OpenRISC team has done some
|
wonderful work on an amazing processor, and I'll have to admit that I am
|
wonderful work on an amazing processor, and I'll have to admit that I am
|
envious of what they've accomplished. I would like to port binutils to the
|
envious of what they've accomplished. I would like to port binutils to the
|
Zip CPU, as I would like to port GCC and GDB. They are way ahead of me. The
|
Zip CPU, as I would like to port GCC and GDB. They are way ahead of me. The
|
Line 120... |
Line 130... |
\chapter{Introduction}
|
\chapter{Introduction}
|
\pagenumbering{arabic}
|
\pagenumbering{arabic}
|
\setcounter{page}{1}
|
\setcounter{page}{1}
|
|
|
|
|
The original goal of the ZIP CPU was to be a very simple CPU. You might
|
The original goal of the Zip CPU was to be a very simple CPU. You might
|
think of it as a poor man's alternative to the OpenRISC architecture.
|
think of it as a poor man's alternative to the OpenRISC architecture.
|
For this reason, all instructions have been designed to be as simple as
|
For this reason, all instructions have been designed to be as simple as
|
possible, and are all designed to be executed in one instruction cycle per
|
possible, and are all designed to be executed in one instruction cycle per
|
instruction, barring pipeline stalls. Indeed, even the bus has been simplified
|
instruction, barring pipeline stalls. Indeed, even the bus has been simplified
|
to a constant 32-bit width, with no option for more or less. This has
|
to a constant 32-bit width, with no option for more or less. This has
|
Line 295... |
Line 305... |
stored.
|
stored.
|
|
|
\section{Simplified Bus}
|
\section{Simplified Bus}
|
The bus architecture of the Zip CPU is that of a simplified WISHBONE bus.
|
The bus architecture of the Zip CPU is that of a simplified WISHBONE bus.
|
It has been simplified in this fashion: all operations are 32--bit operations.
|
It has been simplified in this fashion: all operations are 32--bit operations.
|
The bus is neither little endian nor bit endian. For this reason, all words
|
The bus is neither little endian nor big endian. For this reason, all words
|
are 32--bits. All instructions are also 32--bits wide. Everything has been
|
are 32--bits. All instructions are also 32--bits wide. Everything has been
|
built around the 32--bit word.
|
built around the 32--bit word.
|
|
|
\section{Register Set}
|
\section{Register Set}
|
The Zip CPU supports two sets of sixteen 32-bit registers, a supervisor
|
The Zip CPU supports two sets of sixteen 32-bit registers, a supervisor
|
Line 316... |
Line 326... |
noted as (SP)--although there is nothing special about this register other
|
noted as (SP)--although there is nothing special about this register other
|
than this convention.
|
than this convention.
|
The CPU can access both register sets via move instructions from the
|
The CPU can access both register sets via move instructions from the
|
supervisor state, whereas the user state can only access the user registers.
|
supervisor state, whereas the user state can only access the user registers.
|
|
|
The status register is special, and bears further mention. The lower
|
The status register is special, and bears further mention. As shown in
|
10 bits of the status register form a set of CPU state and condition codes.
|
Fig.~\ref{tbl:cc-register},
|
Writes to other bits of this register are preserved.
|
\begin{table}\begin{center}
|
|
\begin{bitlist}
|
|
31\ldots 11 & R/W & Reserved for future uses\\\hline
|
|
10 & R & (Reserved for) Bus-Error Flag\\\hline
|
|
9 & R & Trap, or user interrupt, Flag. Cleared on return to userspace.\\\hline
|
|
8 & R & (Reserved for) Illegal Instruction Flag\\\hline
|
|
7 & R/W & Break--Enable\\\hline
|
|
6 & R/W & Step\\\hline
|
|
5 & R/W & Global Interrupt Enable (GIE)\\\hline
|
|
4 & R/W & Sleep. When GIE is also set, the CPU waits for an interrupt.\\\hline
|
|
3 & R/W & Overflow\\\hline
|
|
2 & R/W & Negative. The sign bit was set as a result of the last ALU instruction.\\\hline
|
|
1 & R/W & Carry\\\hline
|
|
0 & R/W & Zero. The last ALU operation produced a zero.\\\hline
|
|
\end{bitlist}
|
|
\caption{Condition Code Register Bit Assignment}\label{tbl:cc-register}
|
|
\end{center}\end{table}
|
|
the lower 11~bits of the status register form
|
|
a set of CPU state and condition codes. Writes to other bits of this register
|
|
are preserved.
|
|
|
Of the condition codes, the bottom four bits are the current flags:
|
Of the condition codes, the bottom four bits are the current flags:
|
Zero (Z),
|
Zero (Z),
|
Carry (C),
|
Carry (C),
|
Negative (N),
|
Negative (N),
|
Line 369... |
Line 398... |
%
|
%
|
|
|
This functionality was added to enable an external debugger to
|
This functionality was added to enable an external debugger to
|
set and manage breakpoints.
|
set and manage breakpoints.
|
|
|
The ninth bit is reserved for a floating point enable bit. When set, the
|
The ninth bit is reserved for an illegal instruction bit. When the CPU
|
arithmetic for the next instruction will be sent to a floating point unit.
|
tries to execute either a non-existant instruction, or an instruction from
|
Such a unit may later be added as an extension to the Zip CPU. If the
|
an address that produces a bus error, the CPU will (once implemented) switch
|
CPU does not support floating point instructions, this bit will never be set.
|
to supervisor mode while setting this bit. The bit will automatically be
|
The instruction set could also be simply extended to allow other data types
|
cleared upon any return to user mode.
|
in this fashion, such as two by 16--bit vector operations or four by 8--bit
|
|
vector operations.
|
|
|
|
The tenth bit is a trap bit. It is set whenever the user requests a soft
|
The tenth bit is a trap bit. It is set whenever the user requests a soft
|
interrupt, and cleared on any return to userspace command. This allows the
|
interrupt, and cleared on any return to userspace command. This allows the
|
supervisor, in supervisor mode, to determine whether it got to supervisor
|
supervisor, in supervisor mode, to determine whether it got to supervisor
|
mode from a trap or from an external interrupt or both.
|
mode from a trap or from an external interrupt or both.
|
Line 402... |
Line 429... |
\end{tabular}
|
\end{tabular}
|
\caption{Condition Code / Status Register Bits}\label{tbl:ccbits}
|
\caption{Condition Code / Status Register Bits}\label{tbl:ccbits}
|
\end{center}\end{table}
|
\end{center}\end{table}
|
|
|
\section{Conditional Instructions}
|
\section{Conditional Instructions}
|
Most, although not quite all, instructions are conditionally executed. From
|
Most, although not quite all, instructions may be conditionally executed. From
|
the four condition code flags, eight conditions are defined. These are shown
|
the four condition code flags, eight conditions are defined. These are shown
|
in Tbl.~\ref{tbl:conditions}.
|
in Tbl.~\ref{tbl:conditions}.
|
\begin{table}
|
\begin{table}
|
\begin{center}
|
\begin{center}
|
\begin{tabular}{l|l|l}
|
\begin{tabular}{l|l|l}
|
Line 422... |
Line 449... |
\end{tabular}
|
\end{tabular}
|
\caption{Conditions for conditional operand execution}\label{tbl:conditions}
|
\caption{Conditions for conditional operand execution}\label{tbl:conditions}
|
\end{center}
|
\end{center}
|
\end{table}
|
\end{table}
|
There is no condition code for less than or equal, not C or not V. Sorry,
|
There is no condition code for less than or equal, not C or not V. Sorry,
|
I ran out of space in 3--bits. Using these conditions will take an extra
|
I ran out of space in 3--bits. Conditioning on a non--supported condition
|
instruction and a pipeline stall. (Ex: \hbox{\em (Stall)}; \hbox{\tt TST \$4,CC;} \hbox{\tt STO.NZ R0,(R1)})
|
is still possible, but it will take an extra instruction and a pipeline stall. (Ex: \hbox{\em (Stall)}; \hbox{\tt TST \$4,CC;} \hbox{\tt STO.NZ R0,(R1)})
|
|
|
|
Conditionally executed ALU instructions will not further adjust the
|
|
condition codes.
|
|
|
\section{Operand B}
|
\section{Operand B}
|
Many instruction forms have a 21-bit source ``Operand B'' associated with them.
|
Many instruction forms have a 21-bit source ``Operand B'' associated with them.
|
This Operand B is either equal to a register plus a signed immediate offset,
|
This Operand B is either equal to a register plus a signed immediate offset,
|
or an immediate offset by itself. This value is encoded as shown in
|
or an immediate offset by itself. This value is encoded as shown in
|
Line 446... |
Line 476... |
multiply? Likewise, why have a 16--bit immediate when adding to a logical
|
multiply? Likewise, why have a 16--bit immediate when adding to a logical
|
or arithmetic shift? In these cases, the extra bits are reserved for future
|
or arithmetic shift? In these cases, the extra bits are reserved for future
|
instruction possibilities.
|
instruction possibilities.
|
|
|
\section{Address Modes}
|
\section{Address Modes}
|
The ZIP CPU supports two addressing modes: register plus immediate, and
|
The Zip CPU supports two addressing modes: register plus immediate, and
|
immediate address. Addresses are therefore encoded in the same fashion as
|
immediate address. Addresses are therefore encoded in the same fashion as
|
Operand B's, shown above.
|
Operand B's, shown above.
|
|
|
A lot of long hard thought was put into whether to allow pre/post increment
|
A lot of long hard thought was put into whether to allow pre/post increment
|
and decrement addressing modes. Finding no way to use these operators without
|
and decrement addressing modes. Finding no way to use these operators without
|
Line 482... |
Line 512... |
compiler or assembler know how to compile a MOV instruction without knowing
|
compiler or assembler know how to compile a MOV instruction without knowing
|
the mode of the CPU at the time? For this reason, the compiler will assume
|
the mode of the CPU at the time? For this reason, the compiler will assume
|
all MOV registers are supervisor registers, and display them as normal.
|
all MOV registers are supervisor registers, and display them as normal.
|
Anything with the user bit set will be treated as a user register. The CPU
|
Anything with the user bit set will be treated as a user register. The CPU
|
will quietly ignore the supervisor bits while in user mode, and anything
|
will quietly ignore the supervisor bits while in user mode, and anything
|
marked as a user register will always be valid. (Did I just say that in the
|
marked as a user register will always be valid.
|
last paragraph?)
|
|
|
|
\section{Multiply Operations}
|
\section{Multiply Operations}
|
The Zip CPU supports two Multiply operations, a
|
The Zip CPU supports two Multiply operations, a 16x16 bit signed multiply
|
16x16 bit signed multiply (MPYS) and the same but unsigned (MPYU). In both
|
({\tt MPYS}) and a 16x16 bit unsigned multiply ({\tt MPYU}). In both
|
cases, the operand is a register plus a 16-bit immediate, subject to the
|
cases, the operand is a register plus a 16-bit immediate, subject to the
|
rule that the register cannot be the PC or CC registers. The PC register
|
rule that the register cannot be the PC or CC registers. The PC register
|
field has been stolen to create a multiply by immediate instruction. The
|
field has been stolen to create a multiply by immediate instruction. The
|
CC register field is reserved.
|
CC register field is reserved.
|
|
|
\section{Floating Point}
|
\section{Floating Point}
|
The ZIP CPU does not support floating point operations. However, the
|
The Zip CPU does not (yet) support floating point operations. However, the
|
instruction set reserves two possibilities for future floating point
|
instruction set reserves two possibilities for future floating point
|
operations.
|
operations.
|
|
|
The first floating point operation hole in the instruction set involves
|
The first floating point operation hole in the instruction set involves
|
setting the floating point bit in the CC register. The next instruction
|
setting a proposed (but non-existent) floating point bit in the CC register.
|
will simply interpret its operands as floating point instructions.
|
The next instruction
|
|
would then simply interpret its operands as floating point instructions.
|
Not all instructions, however, have floating point equivalents. Further, the
|
Not all instructions, however, have floating point equivalents. Further, the
|
immediate fields do not apply in floating point mode, and must be set to
|
immediate fields do not apply in floating point mode, and must be set to
|
zero. Not all instructions make sense as floating point operations.
|
zero. Not all instructions make sense as floating point operations.
|
Therefore, only the CMP, SUB, ADD, and MPY instructions may be issued as
|
Therefore, only the CMP, SUB, ADD, and MPY instructions may be issued as
|
floating point instructions. Other instructions allow the examining of the
|
floating point instructions. Other instructions allow the examining of the
|
floating point bit in the CC register. In all cases, the floating point bit
|
floating point bit in the CC register. In all cases, the floating point bit
|
is cleared one instruction after it is set.
|
is cleared one instruction after it is set.
|
|
|
The other possibility for floating point operations involves exploiting the
|
The other possibility for floating point operations involves exploiting the
|
hole in the instruction set that the NOOP and BREAK instructions reside within.
|
hole in the instruction set that the NOOP and BREAK instructions reside within.
|
These two instructions use 24--bits of address space. A simple adjustment
|
These two instructions use 24--bits of address space, when only a single bit
|
to this space could create instructions with 4--bit register addresses for
|
is necessary. A simple adjustment to this space could create instructions
|
each register, a 3--bit field for conditional execution, and a 2--bit field
|
with 4--bit register addresses for each register, a 3--bit field for
|
for which operation. In this fashion, such a floating point capability would
|
conditional execution, and a 2--bit field for which operation.
|
only fill 13--bits of the 24--bit field, still leaving lots of room for
|
In this fashion, such a floating point capability would only fill 13--bits of
|
expansion.
|
the 24--bit field, still leaving lots of room for expansion.
|
|
|
In both cases, the Zip CPU would support 32--bit single precision floats
|
In both cases, the Zip CPU would support 32--bit single precision floats
|
only.
|
only, since other choices would complicate the pipeline.
|
|
|
The current architecture does not support a floating point not-implemented
|
The current architecture does not support a floating point not-implemented
|
interrupt. Any soft floating point emulation must be done deliberately.
|
interrupt. Any soft floating point emulation must be done deliberately.
|
|
|
\section{Native Instructions}
|
\section{Native Instructions}
|
The instruction set for the Zip CPU is summarized in
|
The instruction set for the Zip CPU is summarized in
|
Tbl.~\ref{tbl:zip-instructions}.
|
Tbl.~\ref{tbl:zip-instructions}.
|
\begin{table}\begin{center}
|
\begin{table}\begin{center}
|
\begin{tabular}{|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|c|}\hline
|
\begin{tabular}{|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|c|}\hline
|
|
\rowcolor[gray]{0.85}
|
Op Code & \multicolumn{8}{c|}{31\ldots24} & \multicolumn{8}{c|}{23\ldots 16}
|
Op Code & \multicolumn{8}{c|}{31\ldots24} & \multicolumn{8}{c|}{23\ldots 16}
|
& \multicolumn{8}{c|}{15\ldots 8} & \multicolumn{8}{c|}{7\ldots 0}
|
& \multicolumn{8}{c|}{15\ldots 8} & \multicolumn{8}{c|}{7\ldots 0}
|
& Sets CC? \\\hline
|
& Sets CC? \\\hline\hline
|
CMP(Sub) & \multicolumn{4}{l|}{4'h0}
|
CMP(Sub) & \multicolumn{4}{l|}{4'h0}
|
& \multicolumn{4}{l|}{D. Reg}
|
& \multicolumn{4}{l|}{D. Reg}
|
& \multicolumn{3}{l|}{Cond.}
|
& \multicolumn{3}{l|}{Cond.}
|
& \multicolumn{21}{l|}{Operand B}
|
& \multicolumn{21}{l|}{Operand B}
|
& Yes \\\hline
|
& Yes \\\hline
|
Line 562... |
Line 593... |
& \\\hline
|
& \\\hline
|
BREAK & \multicolumn{4}{l|}{4'h4}
|
BREAK & \multicolumn{4}{l|}{4'h4}
|
& \multicolumn{4}{l|}{4'he}
|
& \multicolumn{4}{l|}{4'he}
|
& \multicolumn{24}{l|}{24'h01}
|
& \multicolumn{24}{l|}{24'h01}
|
& \\\hline
|
& \\\hline
|
{\em Rsrd} & \multicolumn{4}{l|}{4'h4}
|
{\em Reserved} & \multicolumn{4}{l|}{4'h4}
|
& \multicolumn{4}{l|}{4'he}
|
& \multicolumn{4}{l|}{4'he}
|
& \multicolumn{24}{l|}{24'bits, but not 0 or 1.}
|
& \multicolumn{24}{l|}{24'bits, but not 0 or 1.}
|
& \\\hline
|
& \\\hline
|
LODIHI & \multicolumn{4}{l|}{4'h4}
|
LODIHI & \multicolumn{4}{l|}{4'h4}
|
& \multicolumn{4}{l|}{4'hf}
|
& \multicolumn{4}{l|}{4'hf}
|
Line 665... |
Line 696... |
\caption{Zip CPU Instruction Set}\label{tbl:zip-instructions}
|
\caption{Zip CPU Instruction Set}\label{tbl:zip-instructions}
|
\end{center}\end{table}
|
\end{center}\end{table}
|
|
|
As you can see, there's lots of room for instruction set expansion. The
|
As you can see, there's lots of room for instruction set expansion. The
|
NOOP and BREAK instructions are the only instructions within one particular
|
NOOP and BREAK instructions are the only instructions within one particular
|
24--bit hole. This spaces are reserved for future enhancements. For example,
|
24--bit hole. The rest of this space is reserved for future enhancements.
|
floating point operations, consisting of a 3-bit floating point operation,
|
|
two 4-bit registers, no immediate offset, and a 3-bit condition would fit
|
|
nicely into 14--bits of this address space--making it so that the floating
|
|
point bit in the CC register need not be used.
|
|
|
|
\section{Derived Instructions}
|
\section{Derived Instructions}
|
The ZIP CPU supports many other common instructions, but not all of them
|
The Zip CPU supports many other common instructions, but not all of them
|
are single cycle instructions. The derived instruction tables,
|
are single cycle instructions. The derived instruction tables,
|
Tbls.~\ref{tbl:derived-1}, \ref{tbl:derived-2}, and~\ref{tbl:derived-3},
|
Tbls.~\ref{tbl:derived-1}, \ref{tbl:derived-2}, \ref{tbl:derived-3}
|
|
and~\ref{tbl:derived-4},
|
help to capture some of how these other instructions may be implemented on
|
help to capture some of how these other instructions may be implemented on
|
the ZIP CPU. Many of these instructions will have assembly equivalents,
|
the Zip CPU. Many of these instructions will have assembly equivalents,
|
such as the branch instructions, to facilitate working with the CPU.
|
such as the branch instructions, to facilitate working with the CPU.
|
\begin{table}\begin{center}
|
\begin{table}\begin{center}
|
\begin{tabular}{p{1.4in}p{1.5in}p{3in}}\\\hline
|
\begin{tabular}{p{1.4in}p{1.5in}p{3in}}\\\hline
|
Mapped & Actual & Notes \\\hline
|
Mapped & Actual & Notes \\\hline
|
|
ABS Rx
|
|
& \parbox[t]{1.5in}{TST -1,Rx\\NEG.LT Rx}
|
|
& Absolute value, depends upon derived NEG.\\\hline
|
\parbox[t]{1.4in}{ADD Ra,Rx\\ADDC Rb,Ry}
|
\parbox[t]{1.4in}{ADD Ra,Rx\\ADDC Rb,Ry}
|
& \parbox[t]{1.5in}{Add Ra,Rx\\ADD.C \$1,Ry\\Add Rb,Ry}
|
& \parbox[t]{1.5in}{Add Ra,Rx\\ADD.C \$1,Ry\\Add Rb,Ry}
|
& Add with carry \\\hline
|
& Add with carry \\\hline
|
BRA.Cond +/-\$Addr
|
BRA.Cond +/-\$Addr
|
& \hbox{MOV.cond \$Addr+PC,PC}
|
& \hbox{MOV.cond \$Addr+PC,PC}
|
Line 728... |
Line 759... |
MOV \$3+PC,R0 \\
|
MOV \$3+PC,R0 \\
|
STO R0,1(SP) \\
|
STO R0,1(SP) \\
|
MOV \$Addr+PC,PC \\
|
MOV \$Addr+PC,PC \\
|
ADD \$1,SP}
|
ADD \$1,SP}
|
& Jump to Subroutine. Note the required cleanup instruction after
|
& Jump to Subroutine. Note the required cleanup instruction after
|
returning. \\\hline
|
returning. This could easily be turned into a three instruction
|
|
operand, removing the preliminary stack instruction before and
|
|
the cleanup after, by adjusting how any stack frame was built for
|
|
this routine to include space at the top of the stack for the PC.
|
|
\\\hline
|
JSR PC+\$Addr
|
JSR PC+\$Addr
|
& \parbox[t]{1.5in}{MOV \$3+PC,R12 \\ MOV \$addr+PC,PC}
|
& \parbox[t]{1.5in}{MOV \$3+PC,R12 \\ MOV \$addr+PC,PC}
|
&This is the high speed
|
&This is the high speed
|
version of a subroutine call, necessitating a register to hold the
|
version of a subroutine call, necessitating a register to hold the
|
last PC address. In its favor, this method doesn't suffer the
|
last PC address. In its favor, this method doesn't suffer the
|
Line 787... |
Line 822... |
LDIHI.C \$8000h,Rz \\
|
LDIHI.C \$8000h,Rz \\
|
LSR \$1,Rx \\
|
LSR \$1,Rx \\
|
OR Rz,Rx}
|
OR Rz,Rx}
|
& Logical shift right with carry \\\hline
|
& Logical shift right with carry \\\hline
|
NEG Rx & \parbox[t]{1.5in}{XOR \$-1,Rx \\ ADD \$1,Rx} & \\\hline
|
NEG Rx & \parbox[t]{1.5in}{XOR \$-1,Rx \\ ADD \$1,Rx} & \\\hline
|
|
NEG.C Rx & \parbox[t]{1.5in}{MOV.C \$-1+Rx,Rx\\XOR.C \$-1,Rx} & \\\hline
|
NOOP & NOOP & While there are many
|
NOOP & NOOP & While there are many
|
operations that do nothing, such as MOV Rx,Rx, or OR \$0,Rx, these
|
operations that do nothing, such as MOV Rx,Rx, or OR \$0,Rx, these
|
operations have consequences in that they might stall the bus if
|
operations have consequences in that they might stall the bus if
|
Rx isn't ready yet. For this reason, we have a dedicated NOOP
|
Rx isn't ready yet. For this reason, we have a dedicated NOOP
|
instruction. \\\hline
|
instruction. \\\hline
|
Line 800... |
Line 836... |
& Note
|
& Note
|
that for interrupt purposes, one can never depend upon the value at
|
that for interrupt purposes, one can never depend upon the value at
|
(SP). Hence you read from it, then increment it, lest having
|
(SP). Hence you read from it, then increment it, lest having
|
incremented it first something then comes along and writes to that
|
incremented it first something then comes along and writes to that
|
value before you can read the result. \\\hline
|
value before you can read the result. \\\hline
|
|
\end{tabular}
|
|
\caption{Derived Instructions, continued}\label{tbl:derived-2}
|
|
\end{center}\end{table}
|
|
\begin{table}\begin{center}
|
|
\begin{tabular}{p{1.4in}p{1.5in}p{3in}}\\\hline
|
PUSH Rx
|
PUSH Rx
|
& \parbox[t]{1.5in}{SUB \$1,SP \\
|
& \parbox[t]{1.5in}{SUB \$1,SP \\
|
STO Rx,\$1(SP)}
|
STO Rx,\$1(SP)}
|
& \\\hline
|
& \\\hline
|
|
PUSH Rx-Ry
|
|
& \parbox[t]{1.5in}{SUB \$n,SP \\
|
|
STO Rx,\$n(SP)
|
|
\ldots \\
|
|
STO Ry,\$1(SP)}
|
|
& Multiple pushes at once only need the single subtract from the
|
|
stack pointer. This derived instruction is analogous to a similar one
|
|
on the Motoroloa 68k architecture, although the Zip Assembler
|
|
does not support this instruction (yet).\\\hline
|
RESET
|
RESET
|
& \parbox[t]{1in}{STO \$1,\$watchdog(R12)\\NOOP\\NOOP}
|
& \parbox[t]{1in}{STO \$1,\$watchdog(R12)\\NOOP\\NOOP}
|
& \parbox[t]{3in}{This depends upon the peripheral base address being
|
& \parbox[t]{3in}{This depends upon the peripheral base address being
|
in R12.
|
in R12.
|
|
|
Another opportunity might be to jump to the reset address from within
|
Another opportunity might be to jump to the reset address from within
|
supervisor mode.}\\\hline
|
supervisor mode.}\\\hline
|
RET & \parbox[t]{1.5in}{LOD \$-1(SP),PC}
|
RET & \parbox[t]{1.5in}{LOD \$1(SP),PC}
|
& Note that this depends upon the calling context to clean up the
|
& Note that this depends upon the calling context to clean up the
|
stack, as outlined for the JSR instruction. \\\hline
|
stack, as outlined for the JSR instruction. \\\hline
|
\end{tabular}
|
|
\caption{Derived Instructions, continued}\label{tbl:derived-2}
|
|
\end{center}\end{table}
|
|
\begin{table}\begin{center}
|
|
\begin{tabular}{p{1.4in}p{1.5in}p{3in}}\\\hline
|
|
RET & MOV R12,PC
|
RET & MOV R12,PC
|
& This is the high(er) speed version, that doesn't touch the stack.
|
& This is the high(er) speed version, that doesn't touch the stack.
|
As such, it doesn't suffer a stall on memory read/write to the stack.
|
As such, it doesn't suffer a stall on memory read/write to the stack.
|
\\\hline
|
\\\hline
|
STEP Rr,Rt
|
STEP Rr,Rt
|
Line 857... |
Line 902... |
XOR Rx,Ry \\
|
XOR Rx,Ry \\
|
XOR Ry,Rx}
|
XOR Ry,Rx}
|
& While no extra registers are needed, this example
|
& While no extra registers are needed, this example
|
does take 3-clocks. \\\hline
|
does take 3-clocks. \\\hline
|
TRAP \#X
|
TRAP \#X
|
& LDILO \$x,CC
|
& \parbox[t]{1.5in}{LDI \$x,R0 \\ AND ~\$GIE,CC }
|
& This approach uses the unused bits of the CC register as a TRAP
|
& This works because whenever a user lowers the \$GIE flag, it sets
|
address. The user will need to make certain
|
a TRAP bit within the CC register. Therefore, upon entering the
|
that the SLEEP and GIE bits are not set in \$x. LDI would also work,
|
supervisor state, the CPU only need check this bit to know that it
|
however using LDILO permits the use of conditional traps. (i.e.,
|
got there via a TRAP. The trap could be made conditional by making
|
trap if the zero flag is set.) Should you wish to trap off of a
|
the LDI and the AND conditional. In that case, the assembler would
|
register value, you could equivalently load \$x into the register and
|
quietly turn the LDI instruction into an LDILO and LDIHI pair,
|
then MOV it into the CC register. \\\hline
|
but the effectt would be the same. \\\hline
|
|
\end{tabular}
|
|
\caption{Derived Instructions, continued}\label{tbl:derived-3}
|
|
\end{center}\end{table}
|
|
\begin{table}\begin{center}
|
|
\begin{tabular}{p{1.4in}p{1.5in}p{3in}}\\\hline
|
TST Rx
|
TST Rx
|
& TST \$-1,Rx
|
& TST \$-1,Rx
|
& Set the condition codes based upon Rx. Could also do a CMP \$0,Rx,
|
& Set the condition codes based upon Rx. Could also do a CMP \$0,Rx,
|
ADD \$0,Rx, SUB \$0,Rx, etc, AND \$-1,Rx, etc. The TST and CMP
|
ADD \$0,Rx, SUB \$0,Rx, etc, AND \$-1,Rx, etc. The TST and CMP
|
approaches won't stall future pipeline stages looking for the value
|
approaches won't stall future pipeline stages looking for the value
|
Line 876... |
Line 926... |
WAIT
|
WAIT
|
& Or \$SLEEP,CC
|
& Or \$SLEEP,CC
|
& Wait 'til interrupt. In an interrupts disabled context, this
|
& Wait 'til interrupt. In an interrupts disabled context, this
|
becomes a HALT instruction.
|
becomes a HALT instruction.
|
\end{tabular}
|
\end{tabular}
|
\caption{Derived Instructions, continued}\label{tbl:derived-3}
|
\caption{Derived Instructions, continued}\label{tbl:derived-4}
|
\end{center}\end{table}
|
\end{center}\end{table}
|
\iffalse
|
|
\fi
|
|
\section{Pipeline Stages}
|
\section{Pipeline Stages}
|
As mentioned in the introduction, and highlighted in Fig.~\ref{fig:cpu},
|
As mentioned in the introduction, and highlighted in Fig.~\ref{fig:cpu},
|
the Zip CPU supports a five stage pipeline.
|
the Zip CPU supports a five stage pipeline.
|
\begin{enumerate}
|
\begin{enumerate}
|
\item {\bf Prefetch}: Read instruction from memory (cache if possible). This
|
\item {\bf Prefetch}: Reads instruction from memory and into a cache, if so
|
|
configured. This
|
stage is actually pipelined itself, and so it will stall if the PC
|
stage is actually pipelined itself, and so it will stall if the PC
|
ever changes. Stalls are also created here if the instruction isn't
|
ever changes. Stalls are also created here if the instruction isn't
|
in the prefetch cache.
|
in the prefetch cache.
|
\item {\bf Decode}: Decode instruction into op code, register(s) to read, and
|
|
immediate offset. This stage also determines whether the flags will
|
The Zip CPU supports one of two prefetch methods, depending upon a flag
|
|
set at build time within the {\tt zipcpu.v} file. The simplest is a
|
|
non--cached implementation of a prefetch. This implementation is
|
|
fairly small, and ideal for
|
|
users of the Zip CPU who need the extra space on the FPGA fabric.
|
|
However, because this non--cached version has no cache, the maximum
|
|
number of instructions per clock is limited to about one per five.
|
|
|
|
The second prefetch module is a pipelined prefetch with a cache. This
|
|
module tries to keep the instruction address within a window of valid
|
|
instruction addresses. While effective, it is not a traditional
|
|
cache implementation. One unique feature of this cache implementation,
|
|
however, is that it can be cleared in a single clock. A disappointing
|
|
feature, though, was that it needs an extra internal pipeline stage
|
|
to be implemented.
|
|
|
|
\item {\bf Decode}: Decodes an instruction into op code, register(s) to read,
|
|
and immediate offset. This stage also determines whether the flags will
|
be set or whether the result will be written back.
|
be set or whether the result will be written back.
|
\item {\bf Read Operands}: Read registers and apply any immediate values to
|
\item {\bf Read Operands}: Read registers and apply any immediate values to
|
them. There is no means of detecting or flagging arithmetic overflow
|
them. There is no means of detecting or flagging arithmetic overflow
|
or carry when adding the immediate to the operand. This stage will
|
or carry when adding the immediate to the operand. This stage will
|
stall if any source operand is pending.
|
stall if any source operand is pending.
|
\item Split into two tracks: An {\bf ALU} which will accomplish a simple
|
\item Split into two tracks: An {\bf ALU} which will accomplish a simple
|
instruction, and the {\bf MemOps} stage which accomplishes memory
|
instruction, and the {\bf MemOps} stage which handles {\tt LOD} (load)
|
read/write.
|
and {\tt STO} (store) instructions.
|
\begin{itemize}
|
\begin{itemize}
|
\item Loads stall instructions that access the register until it is
|
\item Loads will stall the entire pipeline until complete.
|
written to the register set.
|
\item Condition codes are available upon completion of the ALU stage
|
\item Condition codes are available upon completion
|
\item Issuing an instruction to the memory unit while the memory unit
|
\item Issuing an instruction to the memory while the memory is busy will
|
is busy will stall the entire pipeline. If the bus deadlocks,
|
stall the entire pipeline. If the bus deadlocks, only a reset
|
only a reset will release the CPU. (Watchdog timer, anyone?)
|
will release the CPU. (Watchdog timer, anyone?)
|
|
\item The Zip CPU currently has no means of reading and acting on any
|
\item The Zip CPU currently has no means of reading and acting on any
|
error conditions on the bus.
|
error conditions on the bus.
|
\end{itemize}
|
\end{itemize}
|
\item {\bf Write-Back}: Conditionally write back the result to the register
|
\item {\bf Write-Back}: Conditionally write back the result to the register
|
set, applying the condition. This routine is bi-re-entrant: either the
|
set, applying the condition. This routine is bi-entrant: either the
|
memory or the simple instruction may request a register write.
|
memory or the simple instruction may request a register write.
|
\end{enumerate}
|
\end{enumerate}
|
|
|
The Zip CPU does not support out of order execution. Therefore, if the memory
|
The Zip CPU does not support out of order execution. Therefore, if the memory
|
unit stalls, every other instruction stalls. Memory stores, however, can take
|
unit stalls, every other instruction stalls. Memory stores, however, can take
|
place concurrently with ALU operations, although memory reads cannot.
|
place concurrently with ALU operations, although memory reads (loads) cannot.
|
|
|
\iffalse
|
|
|
|
\section{Pipeline Logic}
|
|
How the CPU handles some instruction combinations can be telling when
|
|
determining what happens in the pipeline. The following lists some examples:
|
|
\begin{itemize}
|
|
\item {\bf Delayed Branching}
|
|
|
|
I had originally hoped to implement delayed branching. My goal
|
|
was that the compiler would handle any pipeline stall conditions so
|
|
that the pipeline logic could be simpler within the CPU. I ran into
|
|
two problems with this.
|
|
|
|
The first problem has to deal with debug mode. When the debugger
|
|
single steps an instruction, that instruction goes to completion.
|
|
This means that if the instruction moves a value to the PC register,
|
|
the PC register would now contain that value, indicating that the
|
|
next instruction would be on the other side of the branch. There's
|
|
just no easy way around this: the entire CPU state must be captured
|
|
by the registers, to include the program counter. What value should
|
|
the program counter be equal to? The branch? Fine. The address
|
|
you are branching to? Fine. The address of the delay slot? Problem.
|
|
|
|
The second problem with delayed branching is the idea of suspending
|
|
processing for an interrupt. Which address should the CPU return
|
|
to upon completing the interrupt processing? The branch? Good. The
|
|
address after the branch? Also good. The address of the delay slot?
|
|
Not so good.
|
|
|
|
If you then add into this mess the idea that, if the CPU is running
|
|
from a really slow memory such as the flash, the delay slot may never
|
|
be filled before the branch is determined, then this makes even less
|
|
sense.
|
|
|
|
For all of these reasons, this CPU does not support delayed branching.
|
|
|
|
\item {\bf Register Result:} {\tt MOV R0,R1; MOV R1,R2 }
|
|
|
|
What value does R2 get, the value of R1 before the first move or the
|
|
value of R0? The Zip CPU has been optimized so that neither of these
|
|
instructions require a pipeline stall--unless an immediate were to
|
|
be added to R1 in the second instruction.
|
|
|
|
The ZIP CPU architecture requires that R2 must equal R0 at the end of
|
|
this operation. Even better, such combinations do not (normally)
|
|
stall the pipeline.
|
|
|
|
\item {\bf Condition Codes Result:} {\tt CMP R0,R1;} {\tt MOV.EQ \$x,PC}
|
|
|
|
At issue is the same item as above, save that the CMP instruction
|
|
updates the flags that the MOV instruction depends upon.
|
|
|
|
The Zip CPU architecture requires that condition codes must be updated
|
|
and available immediately for the next instruction without stalling the
|
|
pipeline.
|
|
|
|
\item {\bf Condition Codes Register Result:} {\tt CMP R0,R1; MOV CC,R2}
|
|
|
|
At issue is the
|
|
fact that the logic supporting the CC register is more complicated than
|
|
the logic supporting any other register.
|
|
|
|
The ZIP CPU will stall for a cycle cycle on this instruction.
|
|
\item {\bf Condition Codes Register Operand:} {\tt MOV R0,R1; MOV CC,R2}
|
|
|
|
Unlike the previous case, this move prior to reading the {\tt CC}
|
|
register does not impact the {\tt CC} register. Therefore, this
|
|
does not stall the bus, whereas the previous one would.
|
|
\end{itemize}
|
|
|
|
As I've studied this, I find several approaches to handling pipeline
|
|
issues. These approaches (and their consequences) are listed below.
|
|
|
|
\begin{itemize}
|
|
\item {\bf All issued instructions complete, stages stall individually}
|
|
|
|
What about a slow pre-fetch?
|
|
|
|
Nominally, this works well: any issued instruction
|
|
just runs to completion. If there are four issued instructions in the
|
|
pipeline, with the writeback instruction being a write-to-PC
|
|
instruction, the other three instructions naturally finish.
|
|
|
|
This approach fails when reading instructions from the flash,
|
|
since such reads require N clocks to clocks to complete. Thus
|
|
there may be only one instruction in the pipeline if reading from flash,
|
|
or a full pipeline if reading from cache. Each of these approaches
|
|
would produce a different response.
|
|
|
|
For this reason, the Zip CPU works off of a different basis: All
|
|
instructions that enter either the ALU or the memory unit will
|
|
complete. Stages still stall individually.
|
|
|
|
\item {\bf Issued instructions may be canceled}
|
|
|
|
The problem here is that
|
|
memory operations cannot be canceled: even reads may have side effects
|
|
on peripherals that cannot be canceled later. Further, in the case of
|
|
an interrupt, it's difficult to know what to cancel. What happens in
|
|
a \hbox{\tt MOV.C \$x,PC} followed by a \hbox{\tt MOV \$y,PC}
|
|
instruction? Which get canceled?
|
|
|
|
Because it isn't clear what would need to be canceled, the Zip CPU
|
|
will not permit this combination. A MOV to the PC register will be
|
|
followed by a stall, and possibly many stalls, so that the second
|
|
move to PC will never be executed.
|
|
|
|
\item {\bf All issued instructions complete.}
|
|
|
|
In this example, we try all issued instructions complete, but the
|
|
entire pipeline stalls if one stage is not filled. In this approach,
|
|
though, we again struggle with the problems associated with
|
|
delayed branching. Upon attempting to restart the processor, where
|
|
do you restart it from?
|
|
|
|
\item {\bf Memory instructions must complete}
|
|
|
|
All instructions that enter into the memory module {\em must}
|
|
complete. Issued instructions from the prefetch, decode, or operand
|
|
read stages may or may not complete. Jumps into code must be valid,
|
|
so that interrupt returns may be valid. All instructions entering the
|
|
ALU complete.
|
|
|
|
This looks to be the simplest approach.
|
|
While the logic may be difficult, this appears to be the only
|
|
re-entrant approach.
|
|
|
|
A {\tt new\_pc} flag will be high anytime the PC changes in an
|
|
unpredictable way (i.e., it doesn't increment). This includes jumps
|
|
as well as interrupts and interrupt returns. Whenever this flag may
|
|
go high, memory operations and ALU operations will stall until the
|
|
result is known. When the flag does go high, anything in the prefetch,
|
|
decode, and read-op stage will be invalidated.
|
|
|
|
\end{itemize}
|
|
\fi
|
|
|
|
\section{Pipeline Stalls}
|
\section{Pipeline Stalls}
|
The processing pipeline can and will stall for a variety of reasons. Some of
|
The processing pipeline can and will stall for a variety of reasons. Some of
|
these are obvious, some less so. These reasons are listed below:
|
these are obvious, some less so. These reasons are listed below:
|
\begin{itemize}
|
\begin{itemize}
|
\item When the prefetch cache is exhausted
|
\item When the prefetch cache is exhausted
|
|
|
This should be obvious. If the prefetch cache doesn't have the instruction
|
This reason should be obvious. If the prefetch cache doesn't have the
|
in memory, the entire pipeline must stall until enough of the prefetch cache
|
instruction in memory, the entire pipeline must stall until enough of the
|
is loaded to support the next instruction.
|
prefetch cache is loaded to support the next instruction.
|
|
|
\item While waiting for the pipeline to load following any taken branch, jump,
|
\item While waiting for the pipeline to load following any taken branch, jump,
|
return from interrupt or switch to interrupt context (6 clocks)
|
return from interrupt or switch to interrupt context (5 stall cycles)
|
|
|
If the PC suddenly changes, the pipeline is subsequently cleared and needs to
|
If the PC suddenly changes, the pipeline is subsequently cleared and needs to
|
be reloaded. Given that there are five stages to the pipeline, that accounts
|
be reloaded. Given that there are five stages to the pipeline, that accounts
|
for five of the six delay clocks. The last clock is lost in the prefetch
|
for four of the five stalls. The stall cycle is lost in the pipelined prefetch
|
stage which needs at least one clock with a valid PC before it can produce
|
stage which needs at least one clock with a valid PC before it can produce
|
a new output. Hence, six clocks will always be lost anytime the pipeline needs
|
a new output.
|
to be cleared.
|
|
|
The Zip CPU handles {\tt MOV \$X(PC),PC}, {\tt ADD \$X,PC}, and
|
|
{\tt LDI \$X,PC} instructions specially, however. These instructions, when
|
|
not conditioned on the flags, can execute with only 3~stall cycles.
|
|
|
\item When reading from a prior register while also adding an immediate offset
|
\item When reading from a prior register while also adding an immediate offset
|
\begin{enumerate}
|
\begin{enumerate}
|
\item\ {\tt OPCODE ?,RA}
|
\item\ {\tt OPCODE ?,RA}
|
\item\ {\em (stall)}
|
\item\ {\em (stall)}
|
Line 1086... |
Line 1017... |
Since the addition of the immediate register within OpB decoding gets applied
|
Since the addition of the immediate register within OpB decoding gets applied
|
during the read operand stage so that it can be nicely settled before the ALU,
|
during the read operand stage so that it can be nicely settled before the ALU,
|
any instruction that will write back an operand must be separated from the
|
any instruction that will write back an operand must be separated from the
|
opcode that will read and apply an immediate offset by one instruction. The
|
opcode that will read and apply an immediate offset by one instruction. The
|
good news is that this stall can easily be mitigated by proper scheduling.
|
good news is that this stall can easily be mitigated by proper scheduling.
|
|
That is, any instruction that does not add an immediate to {\tt RA} may be
|
|
scheduled into the stall slot.
|
|
|
\item When writing to the CC or PC Register
|
\item When any write to either the CC or PC Register is followed by a memory
|
|
operation
|
\begin{enumerate}
|
\begin{enumerate}
|
\item\ {\tt OPCODE RA,PC} {\em Ex: a branch opcode}
|
\item\ {\tt OPCODE RA,PC} {\em Ex: a branch opcode}
|
\item\ {\em (stall, even if jump not taken)}
|
\item\ {\em (stall, even if jump not taken)}
|
\item\ {\tt OPCODE RA,RB}
|
\item\ {\tt LOD \$X(RA),RB}
|
\end{enumerate}
|
\end{enumerate}
|
Since branches take place in the writeback stage, the Zip CPU will stall the
|
Since branches take place in the writeback stage, the Zip CPU will stall the
|
pipeline for one clock anytime there may be a possible jump. This prevents
|
pipeline for one clock anytime there may be a possible jump. This prevents
|
an instruction from executing a memory access after the jump but before the
|
an instruction from executing a memory access after the jump but before the
|
jump is recognized.
|
jump is recognized.
|
|
|
This stall cannot be mitigated through scheduling.
|
This stall may be mitigated by shuffling the operations immediately following
|
|
a potential branch so that an ALU operation follows the branch instead of a
|
|
memory operation.
|
|
|
\item When reading from the CC register after setting the flags
|
\item When reading from the CC register after setting the flags
|
\begin{enumerate}
|
\begin{enumerate}
|
\item\ {\tt ALUOP RA,RB}
|
\item\ {\tt ALUOP RA,RB} {\em Ex: a compare opcode}
|
\item\ {\em (stall}
|
\item\ {\em (stall)}
|
\item\ {\tt TST sys.ccv,CC}
|
\item\ {\tt TST sys.ccv,CC}
|
\item\ {\tt BZ somewhere}
|
\item\ {\tt BZ somewhere}
|
\end{enumerate}
|
\end{enumerate}
|
|
|
The reason for this stall is simply performance. Many of the flags are
|
The reason for this stall is simply performance. Many of the flags are
|
Line 1120... |
Line 1056... |
This stall may be eliminated via proper scheduling, by placing an instruction
|
This stall may be eliminated via proper scheduling, by placing an instruction
|
that does not set flags in between the ALU operation and the instruction
|
that does not set flags in between the ALU operation and the instruction
|
that references the CC register. For example, {\tt MOV \$addr+PC,uPC}
|
that references the CC register. For example, {\tt MOV \$addr+PC,uPC}
|
followed by an {\tt RTU} ({\tt OR \$GIE,CC}) instruction will not incur
|
followed by an {\tt RTU} ({\tt OR \$GIE,CC}) instruction will not incur
|
this stall, whereas an {\tt OR \$BREAKEN,CC} followed by an {\tt OR \$STEP,CC}
|
this stall, whereas an {\tt OR \$BREAKEN,CC} followed by an {\tt OR \$STEP,CC}
|
will incur the stall.
|
will incur the stall, while a {\tt LDI \$BREAKEN|\$STEP,CC} will not.
|
|
|
\item When waiting for a memory read operation to complete
|
\item When waiting for a memory read operation to complete
|
\begin{enumerate}
|
\begin{enumerate}
|
\item\ {\tt LOD address,RA}
|
\item\ {\tt LOD address,RA}
|
\item\ {\em (multiple stalls, bus dependent, 7 clocks best)}
|
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}
|
\item\ {\tt OPCODE I+RA,RB}
|
\item\ {\tt OPCODE I+RA,RB}
|
\end{enumerate}
|
\end{enumerate}
|
|
|
Remember, the ZIP CPU does not support out of order execution. Therefore,
|
Remember, the Zip CPU does not support out of order execution. Therefore,
|
anytime the memory unit becomes busy both the memory unit and the ALU must
|
anytime the memory unit becomes busy both the memory unit and the ALU must
|
stall until the memory unit is cleared. This is especially true of a load
|
stall until the memory unit is cleared. This is especially true of a load
|
instruction, which must still write its operand back to the register file.
|
instruction, which must still write its operand back to the register file.
|
Store instructions are different, since they can be busy with no impact on
|
Store instructions are different, since they can be busy with no impact on
|
later ALU write back operations. Hence, only loads stall the pipeline.
|
later ALU write back operations. Hence, only loads stall the pipeline.
|
Line 1144... |
Line 1080... |
will be busy, and unable to do anything else.
|
will be busy, and unable to do anything else.
|
|
|
\item Memory operation followed by a memory operation
|
\item Memory operation followed by a memory operation
|
\begin{enumerate}
|
\begin{enumerate}
|
\item\ {\tt STO address,RA}
|
\item\ {\tt STO address,RA}
|
\item\ {\em (multiple stalls, bus dependent, 7 clocks best)}
|
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}
|
\item\ {\tt LOD address,RB}
|
\item\ {\tt LOD address,RB}
|
\item\ {\em (multiple stalls, bus dependent, 7 clocks best)}
|
\item\ {\em (multiple stalls, bus dependent, 4 clocks best)}
|
\end{enumerate}
|
\end{enumerate}
|
|
|
In this case, the LOD instruction cannot start until the STALL is finished.
|
In this case, the LOD instruction cannot start until the STO is finished.
|
With proper scheduling, it is possible to do something in the ALU while the
|
With proper scheduling, it is possible to do something in the ALU while the
|
STO is busy, but otherwise this pipeline will stall waiting for it to complete.
|
memory unit is busy with the STO instruction, but otherwise this pipeline will
|
|
stall waiting for it to complete.
|
|
|
Note that even though the Wishbone bus can support pipelined accesses at
|
Note that even though the Wishbone bus can support pipelined accesses at
|
one access per clock, only the prefetch stage can take advantage of this.
|
one access per clock, only the prefetch stage can take advantage of this.
|
Load and Store instructions are stuck at one wishbone cycle per instruction.
|
Load and Store instructions are stuck at one wishbone cycle per instruction.
|
|
|
|
\item When waiting for a conditional memory read operation to complete
|
|
\begin{enumerate}
|
|
\item\ {\tt LOD.Z address,RA}
|
|
\item\ {\em (multiple stalls, bus dependent, 7 clocks best)}
|
|
\item\ {\tt OPCODE I+RA,RB}
|
|
\end{enumerate}
|
|
|
|
In this case, the Zip CPU doesn't warn the prefetch cache to get off the bus
|
|
two cycles before using the bus, so there's a potential for an extra three
|
|
cycle cost due to bus contention between the prefetch and the CPU.
|
|
|
|
This is true for both the LOD and the STO instructions, with the exception that
|
|
the STO instruction will continue in parallel with any ALU instructions that
|
|
follow it.
|
|
|
\end{itemize}
|
\end{itemize}
|
|
|
|
|
\chapter{Peripherals}\label{chap:periph}
|
\chapter{Peripherals}\label{chap:periph}
|
|
|
Line 1277... |
Line 1230... |
O/S must also keep track of values written to the Jiffies register. Thus,
|
O/S must also keep track of values written to the Jiffies register. Thus,
|
when an `alarm' trips, it should be removed from the list of alarms, the list
|
when an `alarm' trips, it should be removed from the list of alarms, the list
|
should be sorted, and the next alarm in terms of Jiffies should be written
|
should be sorted, and the next alarm in terms of Jiffies should be written
|
to the register.
|
to the register.
|
|
|
\section{Manual Cache}
|
\section{Direct Memory Access Controller}
|
|
|
|
The Direct Memory Access (DMA) controller can be used to either move memory
|
|
from one location to another, to read from a peripheral into memory, or to
|
|
write from a peripheral into memory all without CPU intervention. Further,
|
|
since the DMA controller can issue (and does issue) pipeline wishbone accesses,
|
|
any DMA memory move will by nature be faster than a corresponding program
|
|
accomplishing the same move. To put this to numbers, it may take a program
|
|
18~clocks per word transferred, whereas this DMA controller can move one
|
|
word in two clocks--provided it has bus access. (The CPU gets priority over the
|
|
bus.)
|
|
|
|
When copying memory from one location to another, the DMA controller will
|
|
copy in units of a given transfer length--up to 1024 words at a time. It will
|
|
read that transfer length into its internal buffer, and then write to the
|
|
destination address from that buffer. If the CPU interrupts a DMA transfer,
|
|
it will release the bus, let the CPU complete whatever it needs to do, and then
|
|
restart its transfer by writing the contents of its internal buffer and then
|
|
re-entering its read cycle again.
|
|
|
|
When coupled with a peripheral, the DMA controller can be configured to start
|
|
a memory copy on an interrupt line going high. Further, the controller can be
|
|
configured to issue reads from (or two) the same address instead of incrementing
|
|
the address at each clock. The DMA completes once the total number of items
|
|
specified (not the transfer length) have been transferred.
|
|
|
|
In each case, once the transfer is complete and the DMA unit returns to
|
|
idle, the DMA will issue an interrupt.
|
|
|
The manual cache is an experimental setting that may not remain with the Zip
|
|
CPU for very long. It is designed to facilitate running from FLASH or ROM
|
|
memory, although the pipeline prefetch cache really makes this need obsolete.
|
|
The manual
|
|
cache works by copying data from a wishbone address (range) into the cache
|
|
register, and then by making that memory available as memory to the Zip System.
|
|
It is a {\em manual cache} because the processor must first specify what
|
|
memory to copy, and then once copied the processor can only access the cache
|
|
memory by the cache memory location. There is no transparency. It is perhaps
|
|
best described as a combination DMA controller and local memory.
|
|
|
|
Worse, this cache is likely going to be removed from the ZipSystem. Having used
|
|
the ZipSystem now for some time, I have yet to find a need or use for the manual
|
|
cache. I will likely replace this peripheral with a proper DMA controller.
|
|
|
|
\chapter{Operation}\label{chap:ops}
|
\chapter{Operation}\label{chap:ops}
|
|
|
The Zip CPU, and even the Zip System, is not a System on a Chip (SoC). It
|
The Zip CPU, and even the Zip System, is not a System on a Chip (SoC). It
|
needs to be connected to its operational environment in order to be used.
|
needs to be connected to its operational environment in order to be used.
|
Line 1320... |
Line 1286... |
\end{enumerate}
|
\end{enumerate}
|
If you have enabled your CPU to start automatically, then upon power up the
|
If you have enabled your CPU to start automatically, then upon power up the
|
CPU will immediately start executing your instructions.
|
CPU will immediately start executing your instructions.
|
|
|
This is, however, not how I have used the Zip CPU. I have instead used the
|
This is, however, not how I have used the Zip CPU. I have instead used the
|
ZIP CPU in a more controlled environment. For me, the CPU starts in a
|
Zip CPU in a more controlled environment. For me, the CPU starts in a
|
halted state, and waits to be told to start. Further, the RESET address is a
|
halted state, and waits to be told to start. Further, the RESET address is a
|
location in RAM. After bringing up the board I am using, and further the
|
location in RAM. After bringing up the board I am using, and further the
|
bus that is on it, the RAM memory is then loaded externally with the program
|
bus that is on it, the RAM memory is then loaded externally with the program
|
I wish the Zip System to run. Once the RAM is loaded, I release the CPU.
|
I wish the Zip System to run. Once the RAM is loaded, I release the CPU.
|
The CPU then runs until its halt condition, at which point its task is
|
The CPU then runs until its halt condition, at which point its task is
|
complete.
|
complete.
|
|
|
Eventually, I intend to place an operating system onto the ZipSystem, I'm
|
Eventually, I intend to place an operating system onto the ZipSystem, I'm
|
just not there yet.
|
just not there yet.
|
|
|
|
The rest of this chapter examines some common programming constructs, and
|
|
how they might be applied to the Zip System.
|
|
|
|
\section{Example: Idle Task}
|
|
One task every operating system needs is the idle task, the task that takes
|
|
place when nothing else can run. On the Zip CPU, this task is quite simple,
|
|
and it is shown in assemble in Tbl.~\ref{tbl:idle-asm}.
|
|
\begin{table}\begin{center}
|
|
\begin{tabular}{ll}
|
|
{\tt idle\_task:} \\
|
|
& {\em ; Wait for the next interrupt, then switch to supervisor task} \\
|
|
& {\tt WAIT} \\
|
|
& {\em ; When we come back, it's because the supervisor wishes to} \\
|
|
& {\em ; wait for an interrupt again, so go back to the top.} \\
|
|
& {\tt BRA idle\_task} \\
|
|
\end{tabular}
|
|
\caption{Example Idle Loop}\label{tbl:idle-asm}
|
|
\end{center}\end{table}
|
|
When this task runs, the CPU will fill up all of the pipeline stages up the
|
|
ALU. The {\tt WAIT} instruction, upon leaving the ALU, places the CPU into
|
|
a sleep state where nothing more moves. Sure, there may be some more settling,
|
|
the pipe cache continue to read until full, other instructions may issue until
|
|
the pipeline fills, but then everything will stall. Then, once an interrupt
|
|
takes place, control passes to the supervisor task to handle the interrupt.
|
|
When control passes back to this task, it will be on the next instruction.
|
|
Since that next instruction sends us back to the top of the task, the idle
|
|
task thus does nothing but wait for an interrupt.
|
|
|
|
This should be the lowest priority task, the task that runs when nothing else
|
|
can. It will help lower the FPGA power usage overall---at least its dynamic
|
|
power usage.
|
|
|
|
\section{Example: Memory Copy}
|
|
One common operation is that of a memory move or copy. Consider the C code
|
|
shown in Tbl.~\ref{tbl:memcp-c}.
|
|
\begin{table}\begin{center}
|
|
\parbox{4in}{\begin{tabbing}
|
|
{\tt void} \= {\tt memcp(void *dest, void *src, int len) \{} \\
|
|
\> {\tt for(int i=0; i<len; i++)} \\
|
|
\> \hspace{0.2in} {\tt *dest++ = *src++;} \\
|
|
\}
|
|
\end{tabbing}}
|
|
\caption{Example Memory Copy code in C}\label{tbl:memcp-c}
|
|
\end{center}\end{table}
|
|
This same code can be translated in Zip Assembly as shown in
|
|
Tbl.~\ref{tbl:memcp-asm}.
|
|
\begin{table}\begin{center}
|
|
\begin{tabular}{ll}
|
|
memcp: \\
|
|
& {\em ; R0 = *dest, R1 = *src, R2 = LEN} \\
|
|
& {\em ; The following will operate in 17 clocks per word minus one clock} \\
|
|
& {\tt CMP 0,R2} \\
|
|
& {\tt LOD.Z -1(SP),PC} {\em ; A conditional return }\\
|
|
& {\em ; (One stall on potentially writing to PC)} \\
|
|
& {\tt LOD (R1),R3} \\
|
|
& {\em ; (4 stalls, cannot be scheduled away)} \\
|
|
& {\tt STO R3,(R2)} {\em ; (4 schedulable stalls, has no impact now)} \\
|
|
& {\tt ADD 1,R1} \\
|
|
& {\tt SUB 1,R2} \\
|
|
& {\tt BNZ loop} \\
|
|
& {\em ; (5 stalls, if branch taken, to clear and refill the pipeline)} \\
|
|
& {\tt RET} \\
|
|
\end{tabular}
|
|
\caption{Example Memory Copy code in Zip Assembly}\label{tbl:memcp-asm}
|
|
\end{center}\end{table}
|
|
This example points out several things associated with the Zip CPU. First,
|
|
a straightforward implementation of a for loop is not the fastest loop
|
|
structure. For this reason, we have placed the test to continue at the
|
|
end. Second, all pointers are {\tt void} pointers to arbitrary 32--bit
|
|
data types. The Zip CPU does not have explicit support for smaller or larger
|
|
data types, and so this memory copy cannot be applied at a byte level.
|
|
Third, we've optimized the conditional jump to a return instruction into a
|
|
conditional return instruction.
|
|
|
|
\section{Context Switch}
|
|
|
|
Fundamental to any multiprocessing system is the ability to switch from one
|
|
task to the next. In the ZipSystem, this is accomplished in one of a couple
|
|
ways. The first step is that an interrupt happens. Anytime an interrupt
|
|
happens, the CPU needs to execute the following tasks in supervisor mode:
|
|
\begin{enumerate}
|
|
\item Check for a trap instruction. That is, if the user task requested a
|
|
trap, we may not wish to adjust the context, check interrupts, or call
|
|
the scheduler. Tbl.~\ref{tbl:trap-check}
|
|
\begin{table}\begin{center}
|
|
\begin{tabular}{ll}
|
|
{\tt return\_to\_user:} \\
|
|
& {\em; The instruction before the context switch processing must} \\
|
|
& {\em; be the RTU instruction that enacted user mode in the first} \\
|
|
& {\em; place. We show it here just for reference.} \\
|
|
& {\tt RTU} \\
|
|
{\tt trap\_check:} \\
|
|
& {\tt MOV uCC,R0} \\
|
|
& {\tt TST \$TRAP,R0} \\
|
|
& {\tt BNZ swap\_out} \\
|
|
& {; \em Do something here to execute the trap} \\
|
|
& {; \em Don't need to call the scheduler, so we can just return} \\
|
|
& {\tt BRA return\_to\_user} \\
|
|
\end{tabular}
|
|
\caption{Checking for whether the user issued a TRAP instruction}\label{tbl:trap-check}
|
|
\end{center}\end{table}
|
|
shows the rudiments of this code, while showing nothing of how the
|
|
actual trap would be implemented.
|
|
|
|
You may also wish to note that the instruction before the first instruction
|
|
in our context swap {\em must be} a return to userspace instruction.
|
|
Remember, the supervisor process is re--entered where it left off. This is
|
|
different from many other processors that enter interrupt mode at some vector
|
|
or other. In this case, we always enter supervisor mode right where we last
|
|
left.\footnote{The one exception to this rule is upon reset where supervisor
|
|
mode is entered at a pre--programmed wishbone memory address.}
|
|
|
|
\item Capture user counters. If the operating system is keeping track of
|
|
system usage via the accounting counters, those counters need to be
|
|
copied and accumulated into some master counter at this point.
|
|
|
|
\item Preserve the old context. This involves pushing all the user registers
|
|
onto the user stack and then copying the resulting stack address
|
|
into the tasks task structure, as shown in Tbl.~\ref{tbl:context-out}.
|
|
\begin{table}\begin{center}
|
|
\begin{tabular}{ll}
|
|
{\tt swap\_out:} \\
|
|
& {\tt MOV -15(uSP),R1} \\
|
|
& {\tt STO R1,stack(R12)} \\
|
|
& {\tt MOV uPC,R0} \\
|
|
& {\tt STO R0,15(R1)} \\
|
|
& {\tt MOV uCC,R0} \\
|
|
& {\tt STO R0,14(R1)} \\
|
|
& {\em ; We can skip storing the stack, uSP, since it'll be stored}\\
|
|
& {\em ; elsewhere (in the task structure) }\\
|
|
& {\tt MOV uR13,R0} \\
|
|
& {\tt STO R0,13(R1)} \\
|
|
& \ldots {\em ; Need to repeat for all user registers} \\
|
|
& {\tt MOV uR0,R0} \\
|
|
& {\tt STO R0,1(R1)} \\
|
|
\end{tabular}
|
|
\caption{Example Storing User Task Context}\label{tbl:context-out}
|
|
\end{center}\end{table}
|
|
For the sake of discussion, we assume the supervisor maintains a
|
|
pointer to the current task's structure in supervisor register
|
|
{\tt R12}, and that {\tt stack} is an offset to the beginning of this
|
|
structure indicating where the stack pointer is to be kept within it.
|
|
|
|
For those who are still interested, the full code for this context
|
|
save can be found as an assembler macro within the assembler
|
|
include file, {\tt sys.i}.
|
|
|
|
\item Reset the watchdog timer. If you are using the watchdog timer, it should
|
|
be reset on a context swap, to know that things are still working.
|
|
Example code for this is shown in Tbl.~\ref{tbl:reset-watchdog}.
|
|
\begin{table}\begin{center}
|
|
\begin{tabular}{ll}
|
|
\multicolumn{2}{l}{{\tt `define WATCHDOG\_ADDRESS 32'hc000\_0002}}\\
|
|
\multicolumn{2}{l}{{\tt `define WATCHDOG\_TICKS 32'd1\_000\_000} {; \em = 10 ms}}\\
|
|
& {\tt LDI WATCHDOG\_ADDRESS,R0} \\
|
|
& {\tt LDI WATCHDOG\_TICKS,R1} \\
|
|
& {\tt STO R1,(R0)}
|
|
\end{tabular}
|
|
\caption{Example Watchdog Reset}\label{tbl:reset-watchdog}
|
|
\end{center}\end{table}
|
|
|
|
\item Interrupt handling. An interrupt handler within the Zip System is nothing
|
|
more than a task. At context swap time, the supervisor needs to
|
|
disable all of the interrupts that have tripped, and then enable
|
|
all of the tasks that would deal with each of these interrupts.
|
|
These can be user tasks, run at higher priority than any other user
|
|
tasks. Either way, they will need to re--enable their own interrupt
|
|
themselves, if the interrupt is still relevant.
|
|
|
|
An example of this master interrut handling is shown in
|
|
Tbl.~\ref{tbl:pre-handler}.
|
|
\begin{table}\begin{center}
|
|
\begin{tabular}{ll}
|
|
{\tt pre\_handler:} \\
|
|
& {\tt LDI PIC\_ADDRESS,R0 } \\
|
|
& {\em ; Start by grabbing the interrupt state from the interrupt}\\
|
|
& {\em ; controller. We'll store this into the register R7 so that }\\
|
|
& {\em ; we can keep and preserve this information for the scheduler}\\
|
|
& {\em ; to use later. }\\
|
|
& {\tt LOD (R0),R1} \\
|
|
& {\tt MOV R1,R7 } \\
|
|
& {\em ; As a next step, we need to acknowledge and disable all active}\\
|
|
& {\em ; interrupts. We'll start by calculating all of our active}\\
|
|
& {\em ; interrupts.}\\
|
|
& {\tt AND 0x07fff,R1 } \\
|
|
& {\em ; Put the active interrupts into the upper half of R1} \\
|
|
& {\tt ROL 16,R1 } \\
|
|
& {\tt LDILO 0x0ffff,R1 } \\
|
|
& {\tt AND R7,R1}\\
|
|
& {\em ; Acknowledge and disable active interrupts}\\
|
|
& {\em ; This also disables all interrupts from the controller, so}\\
|
|
& {\em ; we'll need to re-enable interrupts in general shortly } \\
|
|
& {\tt STO R1,(R0) } \\
|
|
& {\em ; We leave our active interrupt mask in R7 so the scheduler can}\\
|
|
& {\em ; release any tasks that depended upon them. } \\
|
|
\end{tabular}
|
|
\caption{Example checking for active interrupts}\label{tbl:pre-handler}
|
|
\end{center}\end{table}
|
|
|
|
\item Calling the scheduler. This needs to be done to pick the next task
|
|
to switch to. It may be an interrupt handler, or it may be a normal
|
|
user task. From a priority standpoint, it would make sense that the
|
|
interrupt handlers all have a higher priority than the user tasks,
|
|
and that once they have been called the user tasks may then be called
|
|
again. If no task is ready to run, run the idle task to wait for an
|
|
interrupt.
|
|
|
|
This suggests a minimum of four task priorities:
|
|
\begin{enumerate}
|
|
\item Interrupt handlers, executed with their interrupts disabled
|
|
\item Device drivers, executed with interrupts re-enabled
|
|
\item User tasks
|
|
\item The idle task, executed when nothing else is able to execute
|
|
\end{enumerate}
|
|
|
|
For our purposes here, we'll just assume that a pointer to the current
|
|
task is maintained in {\tt R12}, that a {\tt JSR scheduler} is
|
|
called, and that the next current task is likewise placed into
|
|
{\tt R12}.
|
|
|
|
\item Restore the new tasks context. Given that the scheduler has returned a
|
|
task that can be run at this time, the stack pointer needs to be
|
|
pulled out of the tasks task structure, placed into the user
|
|
register, and then the rest of the user registers need to be popped
|
|
back off of the stack to run this task. An example of this is
|
|
shown in Tbl.~\ref{tbl:context-in},
|
|
\begin{table}\begin{center}
|
|
\begin{tabular}{ll}
|
|
{\tt swap\_in:} \\
|
|
& {\tt LOD stack(R12),R1} \\
|
|
& {\tt MOV 15(R1),uSP} \\
|
|
& {\tt LOD 15(R1),R0} \\
|
|
& {\tt MOV R0,uPC} \\
|
|
& {\tt LOD 14(R1),R0} \\
|
|
& {\tt MOV R0,uCC} \\
|
|
& {\tt LOD 13(R1),R0} \\
|
|
& {\tt MOV R0,uR12} \\
|
|
& \ldots {\em ; Need to repeat for all user registers} \\
|
|
& {\tt LOD 1(R1),R0} \\
|
|
& {\tt MOV R0,uR0} \\
|
|
& {\tt BRA return\_to\_user} \\
|
|
\end{tabular}
|
|
\caption{Example Restoring User Task Context}\label{tbl:context-in}
|
|
\end{center}\end{table}
|
|
assuming as before that the task
|
|
pointer is found in supervisor register {\tt R12}.
|
|
As with storing the user context, the full code associated with
|
|
restoring the user context can be found in the assembler include
|
|
file, {\tt sys.i}.
|
|
|
|
\item Clear the userspace accounting registers. In order to keep track of
|
|
per process system usage, these registers need to be cleared before
|
|
reactivating the userspace process. That way, upon the next
|
|
interrupt, we'll know how many clocks the userspace program has
|
|
encountered, and how many instructions it was able to issue in
|
|
those many clocks.
|
|
|
|
\item Jump back to the instruction just before saving the last tasks context,
|
|
because that location in memory contains the return from interrupt
|
|
command that we are going to need to execute, in order to guarantee
|
|
that we return back here again.
|
|
\end{enumerate}
|
|
|
\chapter{Registers}\label{chap:regs}
|
\chapter{Registers}\label{chap:regs}
|
|
|
The ZipSystem registers fall into two categories, ZipSystem internal registers
|
The ZipSystem registers fall into two categories, ZipSystem internal registers
|
accessed via the ZipCPU shown in Tbl.~\ref{tbl:zpregs},
|
accessed via the ZipCPU shown in Tbl.~\ref{tbl:zpregs},
|
\begin{table}[htbp]
|
\begin{table}[htbp]
|
\begin{center}\begin{reglist}
|
\begin{center}\begin{reglist}
|
PIC & \scalebox{0.8}{\tt 0xc0000000} & 32 & R/W & Primary Interrupt Controller \\\hline
|
PIC & \scalebox{0.8}{\tt 0xc0000000} & 32 & R/W & Primary Interrupt Controller \\\hline
|
WDT & \scalebox{0.8}{\tt 0xc0000001} & 32 & R/W & Watchdog Timer \\\hline
|
WDT & \scalebox{0.8}{\tt 0xc0000001} & 32 & R/W & Watchdog Timer \\\hline
|
CCHE & \scalebox{0.8}{\tt 0xc0000002} & 32 & R/W & Manual Cache Controller \\\hline
|
& \scalebox{0.8}{\tt 0xc0000002} & 32 & R/W & {\em (Reserved for future use)} \\\hline
|
CTRIC & \scalebox{0.8}{\tt 0xc0000003} & 32 & R/W & Secondary Interrupt Controller \\\hline
|
CTRIC & \scalebox{0.8}{\tt 0xc0000003} & 32 & R/W & Secondary Interrupt Controller \\\hline
|
TMRA & \scalebox{0.8}{\tt 0xc0000004} & 32 & R/W & Timer A\\\hline
|
TMRA & \scalebox{0.8}{\tt 0xc0000004} & 32 & R/W & Timer A\\\hline
|
TMRB & \scalebox{0.8}{\tt 0xc0000005} & 32 & R/W & Timer B\\\hline
|
TMRB & \scalebox{0.8}{\tt 0xc0000005} & 32 & R/W & Timer B\\\hline
|
TMRC & \scalebox{0.8}{\tt 0xc0000006} & 32 & R/W & Timer C\\\hline
|
TMRC & \scalebox{0.8}{\tt 0xc0000006} & 32 & R/W & Timer C\\\hline
|
JIFF & \scalebox{0.8}{\tt 0xc0000007} & 32 & R/W & Jiffies \\\hline
|
JIFF & \scalebox{0.8}{\tt 0xc0000007} & 32 & R/W & Jiffies \\\hline
|
Line 1354... |
Line 1582... |
MICNT & \scalebox{0.8}{\tt 0xc000000b} & 32 & R/W & Master Instruction Counter\\\hline
|
MICNT & \scalebox{0.8}{\tt 0xc000000b} & 32 & R/W & Master Instruction Counter\\\hline
|
UTASK & \scalebox{0.8}{\tt 0xc000000c} & 32 & R/W & User Task Clock Counter \\\hline
|
UTASK & \scalebox{0.8}{\tt 0xc000000c} & 32 & R/W & User Task Clock Counter \\\hline
|
UMSTL & \scalebox{0.8}{\tt 0xc000000d} & 32 & R/W & User Stall Counter \\\hline
|
UMSTL & \scalebox{0.8}{\tt 0xc000000d} & 32 & R/W & User Stall Counter \\\hline
|
UPSTL & \scalebox{0.8}{\tt 0xc000000e} & 32 & R/W & User Pre--Fetch Stall Counter \\\hline
|
UPSTL & \scalebox{0.8}{\tt 0xc000000e} & 32 & R/W & User Pre--Fetch Stall Counter \\\hline
|
UICNT & \scalebox{0.8}{\tt 0xc000000f} & 32 & R/W & User Instruction Counter\\\hline
|
UICNT & \scalebox{0.8}{\tt 0xc000000f} & 32 & R/W & User Instruction Counter\\\hline
|
|
DMACTRL & \scalebox{0.8}{\tt 0xc0000010} & 32 & R/W & DMA Control Register\\\hline
|
|
DMALEN & \scalebox{0.8}{\tt 0xc0000011} & 32 & R/W & DMA total transfer length\\\hline
|
|
DMASRC & \scalebox{0.8}{\tt 0xc0000012} & 32 & R/W & DMA source address\\\hline
|
|
DMADST & \scalebox{0.8}{\tt 0xc0000013} & 32 & R/W & DMA destination address\\\hline
|
% Cache & \scalebox{0.8}{\tt 0xc0100000} & & & Base address of the Cache memory\\\hline
|
% Cache & \scalebox{0.8}{\tt 0xc0100000} & & & Base address of the Cache memory\\\hline
|
\end{reglist}
|
\end{reglist}
|
\caption{Zip System Internal/Peripheral Registers}\label{tbl:zpregs}
|
\caption{Zip System Internal/Peripheral Registers}\label{tbl:zpregs}
|
\end{center}\end{table}
|
\end{center}\end{table}
|
and the two debug registers shown in Tbl.~\ref{tbl:dbgregs}.
|
and the two debug registers shown in Tbl.~\ref{tbl:dbgregs}.
|
Line 1489... |
Line 1721... |
the fact. Second, whenever activating a user task, the Operating System will
|
the fact. Second, whenever activating a user task, the Operating System will
|
set the four user counters to zero. When the user task has completed, the
|
set the four user counters to zero. When the user task has completed, the
|
Operating System will read the timers back off, to determine how much of the
|
Operating System will read the timers back off, to determine how much of the
|
CPU the process had consumed.
|
CPU the process had consumed.
|
|
|
|
The final peripheral to discuss is the DMA controller. This controller
|
|
has four registers. Of these four, the length, source and destination address
|
|
registers should need no further explanation. They are full 32--bit registers
|
|
specifying the entire transfer length, the starting address to read from, and
|
|
the starting address to write to. The registers can be written to when the
|
|
DMA is idle, and read at any time. The control register, however, will need
|
|
some more explanation.
|
|
|
|
The bit allocation of the control register is shown in Tbl.~\ref{tbl:dmacbits}.
|
|
\begin{table}\begin{center}
|
|
\begin{bitlist}
|
|
31 & R & DMA Active\\\hline
|
|
30 & R & Wishbone error, transaction aborted (cleared on any write)\\\hline
|
|
29 & R/W & Set to '1' to prevent the controller from incrementing the source address, '0' for normal memory copy. \\\hline
|
|
28 & R/W & Set to '0' to prevent the controller from incrementing the
|
|
destination address, '0' for normal memory copy. \\\hline
|
|
27 \ldots 16 & W & The DMA Key. Write a 12'hfed to these bits to start the
|
|
activate any DMA transfer. \\\hline
|
|
27 & R & Always reads '0', to force the deliberate writing of the key. \\\hline
|
|
26 \ldots 16 & R & Indicates the number of items in the transfer buffer that
|
|
have yet to be written. \\\hline
|
|
15 & R/W & Set to '1' to trigger on an interrupt, or '0' to start immediately
|
|
upon receiving a valid key.\\\hline
|
|
14\ldots 10 & R/W & Select among one of 32~possible interrupt lines.\\\hline
|
|
9\ldots 0 & R/W & Intermediate transfer length minus one. Thus, to transfer
|
|
one item at a time set this value to 0. To transfer 1024 at a time,
|
|
set it to 1024.\\\hline
|
|
\end{bitlist}
|
|
\caption{DMA Control Register Bits}\label{tbl:dmacbits}
|
|
\end{center}\end{table}
|
|
This control register has been designed so that the common case of memory
|
|
access need only set the key and the transfer length. Hence, writing a
|
|
\hbox{32'h0fed03ff} to the control register will start any memory transfer.
|
|
On the other hand, if you wished to read from a serial port (constant address)
|
|
and put the result into a buffer every time a word was available, you
|
|
might wish to write \hbox{32'h2fed8000}--this assumes, of course, that you
|
|
have a serial port wired to the zero bit of this interrupt control. (The
|
|
DMA controller does not use the interrupt controller, and cannot clear
|
|
interrupts.) As a third example, if you wished to write to an external
|
|
FIFO anytime it was less than half full (had fewer than 512 items), and
|
|
interrupt line 2 indicated this condition, you might wish to issue a
|
|
\hbox{32'h1fed8dff} to this port.
|
|
|
\section{Debug Port Registers}
|
\section{Debug Port Registers}
|
Accessing the Zip System via the debug port isn't as straight forward as
|
Accessing the Zip System via the debug port isn't as straight forward as
|
accessing the system via the wishbone bus. The debug port itself has been
|
accessing the system via the wishbone bus. The debug port itself has been
|
reduced to two addresses, as outlined earlier in Tbl.~\ref{tbl:dbgregs}.
|
reduced to two addresses, as outlined earlier in Tbl.~\ref{tbl:dbgregs}.
|
Access to the Zip System begins with the Debug Control register, shown in
|
Access to the Zip System begins with the Debug Control register, shown in
|
Line 1503... |
Line 1778... |
13 & R & CPU GIE setting\\\hline
|
13 & R & CPU GIE setting\\\hline
|
12 & R & CPU is sleeping\\\hline
|
12 & R & CPU is sleeping\\\hline
|
11 & W & Command clear PF cache\\\hline
|
11 & W & Command clear PF cache\\\hline
|
10 & R/W & Command HALT, Set to '1' to halt the CPU\\\hline
|
10 & R/W & Command HALT, Set to '1' to halt the CPU\\\hline
|
9 & R & Stall Status, '1' if CPU is busy\\\hline
|
9 & R & Stall Status, '1' if CPU is busy\\\hline
|
8 & R/W & Step Command, set to '1' to step the CPU\\\hline
|
8 & R/W & Step Command, set to '1' to step the CPU, also sets the halt bit\\\hline
|
7 & R & Interrupt Request \\\hline
|
7 & R & Interrupt Request \\\hline
|
6 & R/W & Command RESET \\\hline
|
6 & R/W & Command RESET \\\hline
|
5\ldots 0 & R/W & Debug Register Address \\\hline
|
5\ldots 0 & R/W & Debug Register Address \\\hline
|
\end{bitlist}
|
\end{bitlist}
|
\caption{Debug Control Register Bits}\label{tbl:dbgctrl}
|
\caption{Debug Control Register Bits}\label{tbl:dbgctrl}
|
Line 1555... |
Line 1830... |
UICNT & 47 & 32 & R/W & User instruction counter\\\hline
|
UICNT & 47 & 32 & R/W & User instruction counter\\\hline
|
\end{reglist}
|
\end{reglist}
|
\caption{Debug Register Addresses}\label{tbl:dbgaddrs}
|
\caption{Debug Register Addresses}\label{tbl:dbgaddrs}
|
\end{center}\end{table}
|
\end{center}\end{table}
|
Primarily, these ``registers'' include access to the entire CPU register
|
Primarily, these ``registers'' include access to the entire CPU register
|
set, as well as the 16~internal peripherals. To read one of these registers
|
set, as well as the internal peripherals. To read one of these registers
|
once the address is set, simply issue a read from the data port. To write
|
once the address is set, simply issue a read from the data port. To write
|
one of these registers or peripheral ports, simply write to the data port
|
one of these registers or peripheral ports, simply write to the data port
|
after setting the proper address.
|
after setting the proper address.
|
|
|
In this manner, all of the CPU's internal state may be read and adjusted.
|
In this manner, all of the CPU's internal state may be read and adjusted.
|
Line 1644... |
Line 1919... |
|
|
\chapter{Clocks}\label{chap:clocks}
|
\chapter{Clocks}\label{chap:clocks}
|
|
|
This core is based upon the Basys--3 development board sold by Digilent.
|
This core is based upon the Basys--3 development board sold by Digilent.
|
The Basys--3 development board contains one external 100~MHz clock, which is
|
The Basys--3 development board contains one external 100~MHz clock, which is
|
sufficient to run the ZIP CPU core.
|
sufficient to run the Zip CPU core.
|
\begin{table}[htbp]
|
\begin{table}[htbp]
|
\begin{center}
|
\begin{center}
|
\begin{clocklist}
|
\begin{clocklist}
|
i\_clk & External & 100~MHz & 100~MHz & System clock.\\\hline
|
i\_clk & External & 100~MHz & 100~MHz & System clock.\\\hline
|
\end{clocklist}
|
\end{clocklist}
|
Line 1710... |
Line 1985... |
only supports one such interrupt line by default. For us, this line is the
|
only supports one such interrupt line by default. For us, this line is the
|
output of another interrupt controller, but that's a board specific setup
|
output of another interrupt controller, but that's a board specific setup
|
detail. Finally, the Zip System produces one external interrupt whenever
|
detail. Finally, the Zip System produces one external interrupt whenever
|
the CPU halts to wait for the debugger.
|
the CPU halts to wait for the debugger.
|
|
|
|
\chapter{Initial Assessment}\label{chap:assessment}
|
|
|
|
Having now worked with the Zip CPU for a while, it is worth offering an
|
|
honest assessment of how well it works and how well it was designed. At the
|
|
end of this assessment, I will propose some changes that may take place in a
|
|
later version of this Zip CPU to make it better.
|
|
|
|
\section{The Good}
|
|
\begin{itemize}
|
|
\item The Zip CPU is light weight and fully featured as it exists today. For
|
|
anyone who wishes to build a general purpose CPU and then to
|
|
experiment with building and adding particular features, the Zip CPU
|
|
makes a good starting point--it is fairly simple. Modifications should
|
|
be simple enough.
|
|
\item As an estimate of the ``weight'' of this implementation, the CPU has
|
|
cost me less than 150 hours to implement from its inception.
|
|
\item The Zip CPU was designed to be an implementable soft core that could be
|
|
placed within an FPGA, controlling actions internal to the FPGA. It
|
|
fits this role rather nicely. It does not fit the role of a system on
|
|
a chip very well, but then it was never intended to be a system on a
|
|
chip but rather a system within a chip.
|
|
\item The extremely simplified instruction set of the Zip CPU was a good
|
|
choice. Although it does not have many of the commonly used
|
|
instructions, PUSH, POP, JSR, and RET among them, the simplified
|
|
instruction set has demonstrated an amazing versatility. I will contend
|
|
therefore and for anyone who will listen, that this instruction set
|
|
offers a full and complete capability for whatever a user might wish
|
|
to do with two exceptions: bytewise character access and accelerated
|
|
floating-point support.
|
|
\item This simplified instruction set is easy to decode.
|
|
\item The simplified bus transactions (32-bit words only) were also very easy
|
|
to implement.
|
|
\item The novel approach of having a single interrupt vector, which just
|
|
brings the CPU back to the instruction it left off at within the last
|
|
interrupt context doesn't appear to have been that much of a problem.
|
|
If most modern systems handle interrupt vectoring in software anyway,
|
|
why maintain hardware support for it?
|
|
\item My goal of a high rate of instructions per clock may not be the proper
|
|
measure. For example, if instructions are being read from a SPI flash
|
|
device, such as is common among FPGA implementations, these same
|
|
instructions may suffer stalls of between 64 and 128 cycles per
|
|
instruction just to read the instruction from the flash. Executing the
|
|
instruction in a single clock cycle is no longer the appropriate
|
|
measure. At the same time, it should be possible to use the DMA
|
|
peripheral to copy instructions from the FLASH to a temporary memory
|
|
location, after which they may be executed at a single instruction
|
|
cycle per access again.
|
|
\end{itemize}
|
|
|
|
\section{The Not so Good}
|
|
\begin{itemize}
|
|
\item While one of the stated goals was to use a small amount of logic,
|
|
3k~LUTs isn't that impressively small. Indeed, it's really much
|
|
too expensive when compared against other 8 and 16-bit CPUs that have
|
|
less than 1k LUTs.
|
|
|
|
Still, \ldots it's not bad, it's just not astonishingly good.
|
|
|
|
\item The fact that the instruction width equals the bus width means that the
|
|
instruction fetch cycle will always be interfering with any load or
|
|
store memory operation, with the only exception being if the
|
|
instruction is already in the cache. {\em This has become the
|
|
fundamental limit on the speed and performance of the CPU!}
|
|
Those familiar with the Von--Neumann approach of sharing a bus
|
|
between data and instructions will not be surprised by this assessment.
|
|
|
|
This could be fixed in one of three ways: the instruction set
|
|
architecture could be modified to handle Very Long Instruction Words
|
|
(VLIW) so that each 32--bit word would encode two or more instructions,
|
|
the instruction fetch bus width could be increased from 32--bits to
|
|
64--bits or more, or the instruction bus could be separated from the
|
|
data bus. Any and all of these approaches would increase the overall
|
|
LUT count.
|
|
|
|
\item The (non-existant) floating point unit was an after-thought, isn't even
|
|
built as a potential option, and most likely won't support the full
|
|
IEEE standard set of FPU instructions--even for single point precision.
|
|
This (non-existant) capability would benefit the most from an
|
|
out-of-order execution capability, which the Zip CPU does not have.
|
|
|
|
Still, sharing FPU registers with the main register set was a good
|
|
idea and worth preserving, as it simplifies context swapping.
|
|
|
|
Perhaps this really isn't a problem, but rather a feature. By not
|
|
implementing FPU instructions, the Zip CPU maintains a lower LUT count
|
|
than it would have if it did implement these instructions.
|
|
|
|
\item The CPU has no character support. This is both good and bad.
|
|
Realistically, the CPU works just fine without it. Characters can be
|
|
supported as subsets of 32-bit words without any problem. Practically,
|
|
though, it will make compiling non-Zip CPU code difficult--especially
|
|
anything that assumes sizeof(int)=4*sizeof(char), or that tries to
|
|
create unions with characters and integers and then attempts to
|
|
reference the address of the characters within that union.
|
|
|
|
\item The Zip CPU does not support a data cache. One can still be built
|
|
externally, but this is a limitation of the CPU proper as built.
|
|
Further, under the theory of the Zip CPU design (that of an embedded
|
|
soft-core processor within an FPGA, where any ``address'' may reference
|
|
either memory or a peripheral that may have side-effects), any data
|
|
cache would need to be based upon an initial knowledge of whether or
|
|
not it is supporting memory (cachable) or peripherals. This knowledge
|
|
must exist somewhere, and that somewhere is currently (and by design)
|
|
external to the CPU.
|
|
|
|
This may also be written off as a ``feature'' of the Zip CPU, since
|
|
the addition of a data cache can greatly increase the LUT count of
|
|
a soft core.
|
|
|
|
\item Many other instruction sets offer three operand instructions, whereas
|
|
the Zip CPU only offers two operand instructions. This means that it
|
|
takes the Zip CPU more instructions to do many of the same operations.
|
|
The good part of this is that it gives the Zip CPU a greater amount of
|
|
flexibility in its immediate operand mode, although that increased
|
|
flexibility isn't necessarily as valuable as one might like.
|
|
|
|
\item The Zip CPU does not currently detect and trap on either illegal
|
|
instructions or bus errors. Attempts to access non--existent
|
|
memory quietly return erroneous results, rather than halting the
|
|
process (user mode) or halting or resetting the CPU (supervisor mode).
|
|
|
|
\item The Zip CPU doesn't support out of order execution. I suppose it could
|
|
be modified to do so, but then it would no longer be the ``simple''
|
|
and low LUT count CPU it was designed to be. The two primary results
|
|
are that 1) loads may unnecessarily stall the CPU, even if other
|
|
things could be done while waiting for the load to complete, 2)
|
|
bus errors on stores will never be caught at the point of the error,
|
|
and 3) branch prediction becomes more difficult.
|
|
|
|
\item Although switching to an interrupt context in the Zip CPU design doesn't
|
|
require a tremendous swapping of registers, in reality it still
|
|
does--since any task swap still requires saving and restoring all
|
|
16~user registers. That's a lot of memory movement just to service
|
|
an interrupt.
|
|
|
|
\item The Zip CPU is by no means generic: it will never handle addresses
|
|
larger than 32-bits (16GB) without a complete and total redesign.
|
|
This may limit its utility as a generic CPU in the future, although
|
|
as an embedded CPU within an FPGA this isn't really much of a limit
|
|
or restriction.
|
|
|
|
\item While the Zip CPU has its own assembler, it has no linker and does not
|
|
(yet) support a compiler. The standard C library is an even longer
|
|
shot. My dream of having binutils and gcc support has not been
|
|
realized and at this rate may not be realized. (I've been intimidated
|
|
by the challenge everytime I've looked through those codes.)
|
|
|
|
\item While the Wishbone Bus (B4) supports a pipelined mode with single cycle
|
|
execution, the Zip CPU is unable to exploit this parallelism. Instead,
|
|
apart from the DMA and the pipelined prefetch, all loads and stores
|
|
are single wishbone bus operations requiring a minimum of 3 clocks.
|
|
(In practice, this has turned into 7-clocks.)
|
|
|
|
\iffalse
|
|
\item There is no control over whether or not an instruction sets the
|
|
condition codes--certain instructions always set the condition codes,
|
|
other instructions never set them. This effectively limits conditional
|
|
instructions to a single instruction only (with two or more
|
|
instructions as an exception), as the first instruction that sets
|
|
condition codes will break the condition code chain.
|
|
|
|
{\em (A proposed change below address this.)}
|
|
|
|
\item Using the CC register as a trap address was a bad idea--it limits the CC
|
|
registers ability to be used in future expansion, such as by adding
|
|
exception indication flags: bus error, floating point exception, etc.
|
|
|
|
{\em (This can be changed by a different O/S implementation of the trap
|
|
instruction.)}
|
|
\item The current implementation suffers from too many stalls on any
|
|
branch--even if the branch is known early on.
|
|
|
|
{\em (This is addressed in proposals below.)}
|
|
% Addressed, 20150918
|
|
|
|
\item In a similar fashion, a switch to interrupt context forces the pipeline
|
|
to be cleared, whereas it might make more sense to just continue
|
|
executing the instructions already in the pipeline while the prefetch
|
|
stage is working on switching to the interrupt context.
|
|
|
|
{\em (Also addressed in proposals below.)}
|
|
% This should happen so rarely that it is not really a problem
|
|
\fi
|
|
|
|
\end{itemize}
|
|
|
|
\section{The Next Generation}
|
|
This section could also be labeled as my ``To do'' list.
|
|
|
|
Given the feedback listed above, perhaps its time to consider what changes could be made to improve the Zip CPU in the future. I offer the following as proposals:
|
|
|
|
\begin{itemize}
|
|
\item {\bf Remove the low LUT goal.} It wasn't really achieved, and the
|
|
proposals below will only increase the amount of logic the Zip CPU
|
|
requires. While I expect that the Zip CPU will always be somewhat
|
|
of a light weight, it will never be the smallest kid on the block.
|
|
|
|
I'm actually struggling with this idea. The whole goal of the Zip
|
|
CPU was to be light weight. Wouldn't it make more sense to create and
|
|
maintain options whereby it would remain lightweight? For example, if
|
|
the process accounting registers are anything but light weight, why
|
|
keep them? Why not instead make some compile flags that just turn them
|
|
off, keeping the CPU lightweight? The same holds for the prefetch
|
|
cache.
|
|
|
|
\iffalse
|
|
\item {\bf Adjust the Zip CPU so that conditional instructions do not set
|
|
flags}, although they may explicitly set condition codes if writing
|
|
to the CC register.
|
|
|
|
This is a simple change to the core, and may show up in new releases.
|
|
% Fixed, 20150918
|
|
\fi
|
|
|
|
\item The `{\tt .V}' condition was never used in any code other than my test
|
|
code. Suggest changing it to a `{\tt .LE}' condition, which seems
|
|
to be more useful.
|
|
|
|
\iffalse
|
|
\item Add in an {\bf unpredictable branch delay slot}, so that on any branch
|
|
the delay slot may or may not be executed before the branch.
|
|
Instructions that do not depend upon the branch, and that should be
|
|
executed were the branch not taken, could be placed into the delay
|
|
slot. Thus, if the branch isn't taken, we wouldn't suffer the stall,
|
|
whereas it wouldn't affect the timing of the branch if taken. It would
|
|
just do something irrelevant.
|
|
|
|
% Changes made, 20150918, make this option no longer relevant
|
|
|
|
\item {\bf Re-engineer Branch Processing.} There's no reason why a {\tt BRA}
|
|
instruction should create five stall cycles. The decode stage, plus
|
|
the prefetch engine, should be able to drop this number of stalls via
|
|
better branch handling.
|
|
|
|
Indeed, this could turn into a simple means of branch prediction:
|
|
if {\tt BRA} suffered a single stall only, whereas {\tt BRA.C}
|
|
suffered five stalls, then {\tt BRA.!C} followed by {\tt BRA} would
|
|
be faster than a {\tt BRA.C} instruction. This would then allow a
|
|
compiler to do explicit branch optimizations.
|
|
|
|
Of course, longer branches using {\tt ADD X,PC} would still not be
|
|
optimized.
|
|
|
|
% DONE: 20150918 -- to include the ADD X,PC instructions
|
|
|
|
\item {\bf Request bus access for Load/Store two cycles earlier.} The problem
|
|
here is the contention for the bus between the memory unit and the
|
|
prefetch unit. Currently, the memory unit must ask the prefetch
|
|
unit to release the bus if it is in the middle of a bus cycle. At this
|
|
point, the prefetch drops the {\tt STB} line on the next clock and must
|
|
then wait for the last {\tt ACK} before releasing the bus. If the
|
|
request takes one clock, dropping the strobe line the next, waiting
|
|
for an acknowledgement takes another, and then the bus must be idle
|
|
for one cycle before starting again, these extra cycles add up.
|
|
It should be possible to tell the prefetch stage to give up the bus
|
|
as soon as the decoder knows the instruction will need the bus.
|
|
Indeed, if done in the decode stage, this might drop the seven cycle
|
|
access down by two cycles.
|
|
|
|
% FIXED: 20150918
|
|
\fi
|
|
|
|
\item {\bf Consider a more traditional Instruction Cache.} The current
|
|
pipelined instruction cache just reads a window of memory into
|
|
its cache. If the CPU leaves that window, the entire cache is
|
|
invalidated. A more traditional cache, however, might allow
|
|
common subroutines to stay within the cache without invalidating the
|
|
entire cache structure.
|
|
|
|
\iffalse
|
|
\item {\bf Very Long Instruction Word (VLIW).} Now, to speed up operation, I
|
|
propose that the Zip CPU instruction set be modified towards a Very
|
|
Long Instruction Word (VLIW) implementation. In this implementation,
|
|
an instruction word may contain either one or two separate
|
|
instructions. The first instruction would take up the high order bits,
|
|
the second optional instruction the lower 16-bits. Further, I propose
|
|
that any of the ALU instructions (SUB through LSR) automatically have
|
|
a second instruction whenever their operand `B' is a register, and use
|
|
the full 20-bit immediate if not. This will effectively eliminate the
|
|
register plus immediate mode for all of these instructions.
|
|
|
|
This is the minimal required change to increase the number of
|
|
instructions per clock cycle. Other changes would need to take place
|
|
as well to support this. These include:
|
|
\begin{itemize}
|
|
\item Instruction words containing two instructions would take two
|
|
clocks to complete, while requiring only a single cycle
|
|
instruction fetch.
|
|
\item Instructions preceded by a label in the asseembler must always
|
|
start in the high order word.
|
|
\item VLIW's, once started, must always execute to completion. The
|
|
upper word may set the PC, the lower word may not. Regardless
|
|
of whether the upper word sets the PC, the lower word must
|
|
still be guaranteed to complete before the PC changes. On any
|
|
switch to (or from) interrupt context, both instructions must
|
|
complete or none of the instructions in the word shall
|
|
complete prior to the switch.
|
|
\item STEP commands and BREAK instructions will only take place after
|
|
the entire word is executed.
|
|
\end{itemize}
|
|
|
|
If done well, the assembler should be able to handle these changes
|
|
with the biggest impacts to the user being increased performance and
|
|
a loss of the register plus immediate ALU modes. (These weren't really
|
|
relevant for the XOR, OR, AND, etc. operations anyway.) Machine code
|
|
compatibility will not be maintained.
|
|
|
|
A proposed secondary instruction set might consist of: a four bit
|
|
operand (any of the prior instructions would be supported, with some
|
|
exceptions such as moves to and from user registers while in
|
|
supervisor mode not being supported), a 4-bit register result (PC not
|
|
allowed), a 3-bit conditional (identical to the conditional for the
|
|
upper word), a single bit for whether or not an immediate is present
|
|
or not, followed by either a 4-bit register or a 4-bit signed
|
|
immediate. The multiply instruction would steal the immediate flag to
|
|
be used as a sign indication, forcing both operands to be registers
|
|
without any immediate offsets.
|
|
|
|
{\em Initial conversion of several library functions to this secondary
|
|
instruction set has demonstrated little to no gain. The problem was
|
|
that the new instruction set was made by joining a rarely used
|
|
instruction (ALU with register and not immediate) with a more common
|
|
instruction. The utility was then limited by the utility of the rare
|
|
instrction, which limited the impact of the entire approach. }
|
|
\else
|
|
\item {\bf Very Long Instruction Word (VLIW).} The goal here would be to
|
|
create a new instruction set whereby two instructions would be encoded
|
|
in each 32--bit word. While this may speed up
|
|
CPU operation, it would necessitate an instruction redesign.
|
|
\fi
|
|
|
|
\end{itemize}
|
|
|
|
|
% Appendices
|
% Appendices
|
% Index
|
% Index
|
\end{document}
|
\end{document}
|
|
|
|
|