OpenCores
URL https://opencores.org/ocsvn/zipcpu/zipcpu/trunk

Subversion Repositories zipcpu

Compare Revisions

  • This comparison shows the changes necessary to convert path
    /zipcpu/trunk/doc/gfx
    from Rev 6 to Rev 5
    Reverse comparison

Rev 6 → Rev 5

/iset.html
0,0 → 1,689
<HTML><HEAD><TITLE>Zip CPU Instruction Set</TITLE></HEAD><BODY>
<H1 align=center>Zip CPU Goals</H1>
<P>The original goal of the ZIP CPU was a simple CPU. For this reason,
all instructions have been designed to be as simple as possible, and
are all designed to be executed in one instruction cycle per
instruction, barring pipeline stalls. This has resulted in the choice
to drop push and pop instructions, pre-increment and post-decrement
addressing modes, and more.
<P>For those who like buzz words, the Zip CPU is:
<UL>
<LI>A 32-bit CPU: All registers are 32-bits, addresses are 32-bits,
instructions are 32-bits wide, etc.
<LI>A RISC CPU. There is no microcode for executing instructions.
<LI>A Load/Store architecture. (Only load and store instructions
can access memory.)
<LI>Wishbone compliant. All peripherals are accessed just like
memory across this bus.
<LI>A Von-Neumann architecture. (The instructions and data share a
common bus.)
<LI>A pipelined architecture, having stages for <B>Prefetch</B>,
<B>Decode</B>, <B>Read-Operand</B>, the <B>ALU/Memory</B>
unit, and <B>Write-back</B>
</UL>
 
<P>Now, however, that I've worked on the Zip CPU for a while, it is not nearly
as simple as I originally hoped. Worse, I've had to adjust to create
capabilities that I was never expecting to need. These include:
<UL>
<LI><B>Extenal Debug:</B> Once placed upon an FPGA, I'm going to need
a means of debugging this CPU. That means that there needs to be an
external register that can control the CPU: reset it, halt it, step
it, and tell
whether it is running or not. Another register is placed similar to
this register, to allow the external controller to examine registers
internal to the CPU.</P>
<LI><B>Internal Debug:</B> Being able to run a debugger from within
a user process requires an ability to step a user process from
within a debugger. It also requires a break instruction that can
be substituted for any other instruction, and substituted back.
The break is actually difficult: the break instruction cannot be
allowed to execute. That way, upon a break, the debugger should
be able to jump back into the user process to step the instruction
that would've been at the break point initially, and then to
replace the break after passing it.</P>
 
<LI><B>Prefetch Cache</B>My original implementation had a very
simple prefetch stage. Any time the PC changed the prefetch would go
and fetch the new instruction. While this was perhaps this simplest
approach, it cost roughly five clocks for every instruction. This
was deemed unacceptable, as I wanted a CPU that could execute
instructions in one cycle. I therefore have a prefetch cache that
issues pipelined wishbone accesses to memory and then pushes
instructions at the CPU. Sadly, this accounts for about 20% of the
logic in the entire CPU, or 15% of the logic in the entire system.
</P>
 
<LI><B>Operating System:</B>In order to support an operating system,
interrupts and so forth, the CPU needs to support supervisor and
user modes, as well as a means of switching between them. For example,
the user needs a means of executing a system call. This is the
purpose of the <B>'trap'</B> instruction. This instruction needs to
place the CPU into supervisor mode (here equivalent to disabling
interrupts), as well as handing it a parameter such as identifying
which O/S function was called.
 
<P>My initial approach to building a trap instruction was to create
an external peripheral which, when written to, would generate an
interrupt and could return the last value written to it. This failed
timing requirements, however: the CPU executed two instructions while
waiting for the trap interrupt to take place. Since then, I've
decided to keep the rest of the CC register for that purpose so that a
write to the CC register, with the GIE bit cleared, could be used to
execute a trap.
 
<P>Modern timesharing systems also depend upon a <B>Timer</B> interrupt
to handle task swapping. For the Zip CPU, this interrupt is handled
external to the CPU as part of the CPU System, found in
<tt>zipsystem.v</tt>. The timer module itself is found in
<tt>ziptimer.v</tt>.
 
<LI><B>Pipeline Stalls:</B> My original plan was to not support pipeline
stalls at all, but rather to require the compiler to properly schedule
instructions so that stalls would never be necessary. After trying
to build such an architecture, I gave up, having learned some things:
 
<P>For example, in order to facilitate interrupt handling and debug
stepping, the CPU needs to know what instructions have finished, and
which have not. In other words, it needs to know where it can restart
the pipeline from. Once restarted, it must act as though it had
never stopped. This killed my idea of delayed branching, since
what would be the appropriate program counter to restart at?
The one the CPU was going to branch to, or the ones in the
delay slots?
 
<P>So I switched to a model of discrete execution: Once an instruction
enters into either the ALU or memory unit, the instruction is
guaranteed to complete. If the logic recognizes a branch or a
condition that would render the instruction entering into this stage
possibly inappropriate (i.e. a conditional branch preceeding a store
instruction for example), then the pipeline stalls for one cycle
until the conditional branch completes. Then, if it generates a new
PC address, the stages preceeding are all wiped clean.
 
<P>The discrete execution model allows such things as sleeping: if the
CPU is put to "sleep", the ALU and memory stages stall and back up
everything before them. Likewise, anything that has entered the ALU
or memory stage when the CPU is placed to sleep continues to completion.
<P>To handle this logic, each pipeline stage has three control signals:
a valid signal, a stall signal, and a clock enable signal. In
general, a stage stalls if it's contents are valid and the next step
is stalled. This allows the pipeline to fill any time a later stage
stalls.
 
<LI><B>Verilog Modules:</B> When examining how other processors worked
here on open cores, many of them had one separate module per pipeline
stage. While this appeared to me to be a fascinating and commendable
idea, my own implementation didn't work out quite so nicely.
 
<P>As an example, the decode module produces a <em>lot</em> of
control wires and registers. Creating a module out of this, with
only the simplest of logic within it, seemed to be more a lesson
in passing wires around, rather than encapsulating logic.
 
<P>Another example was the register writeback section. I would love
this section to be a module in its own right, and many have made them
such. However, other modules depend upon writeback results other
than just what's placed in the register (i.e., the control wires).
For these reasons, I didn't manage to fit this section into it's
own module.
 
<P>The result is that the majority of the CPU code can be found in
the <tt>zipcpu.v</tt> file.
</UL>
 
With that introduction out of the way, let's move on to the instruction
set.
 
<H1 align=center>Zip CPU Instruction Set</H1>
The Zip CPU supports a set of two operand instructions, where the first operand
(always a register) is the result. The only exception is the store instruction,
where the first operand (always a register) is the source of the data to be
stored.
<H2>Register Set</H2>
The Zip CPU supports two sets of sixteen 32-bit registers, a supervisor
and a user set. The supervisor set is used in interrupt mode, whereas
the user set is used otherwise. Of this register set, the Program Counter (PC)
is register 15, whereas the status register (SR) or condition code register
(CC) is register 14. By convention, the stack pointer will be register 13 and
noted as (SP)--although the instruction set allows it to be anything.
The CPU can access both register sets via move instructions from the
supervisor state, whereas the user state can only access the user registers.
 
<P>The status register is special, and bears further mention. The lower
8 bits of the status register form a set of condition codes. Writes to other
bits are preserved, and can be used as part of the trap architecture--examined
by the O/S upon any interrupt, cleared before returning.
<P>Of the eight condition codes, the bottom four are the current flags:
Zero (Z),
Carry (C),
Negative (N),
and Overflow (V).
 
<P>The next bit is a clock enable (0 to enable) or sleep bit (1 to put
the CPU to sleep). Setting this bit will cause the CPU to
wait for an interrupt (if interrupts are enabled), or to
completely halt (if interrupts are disabled).
<P>The sixth bit is a global interrupt enable bit (GIE). When this
sixth bit is a '1' interrupts will be enabled, else disabled. When
interrupts are disabled, the CPU will be in supervisor mode, otherwise
it is in user mode. Thus, to execute a context switch, one only
need enable or disable interrupts. (When an interrupt line goes
high, interrupts will automatically be disabled, as the CPU goes
and deals with its context switch.)
<P><em>Experimental</em>: The seventh bit is a step bit. This bit can be
set from supervisor mode only. After setting this bit, should
the supervisor mode process switch to user mode, it would then
accomplish one instruction in user mode before returning to supervisor
mode. Then, upon return to supervisor mode, this bit will
be automatically cleared. This bit has no effect on the CPU while in
supervisor mode.
<P>This functionality was added to enable a userspace debugger
functionality on a user process, working through supervisor mode
of course.
<P><em>Experimental</em>: The eighth bit is a break enable bit. This
controls whether a break instruction will halt the processor for an
external debuggerr (break enabled), or whether the break instruction
will simply set the STEP bit and send the CPU into interrupt mode.
This bit can only be set within supervisor mode.
<P>This functionality was added to enable an external debugger to
set and manage breakpoints.
<P>The status register bits are shown below:
<TABLE border>
<TR><TH>7</TH><TH>6</TH><TH>5</TH><TH>4</TH> <TH>3</TH> <TH>2</TH> <TH>1</TH> <TH>0</TH></TR>
<TR><TD>BREAKEN</TD><TD>STEP</TD><TD>GIE</TD><TD>SLEEP</TD><TD>V</TD><TD>N</TD><TD>C</TD> <TD>Z</TD></TR>
</TABLE>
<H2>Conditions</H2>
Most, although not quite all, instructions are conditional. From the four
condition code flags, eight conditions are defined. These are:
<TABLE>
<TR><TH>Code</TH><TH>Mneumonic</TH><TH>Condition</TH></TR>
<TR><TD>3'h0</TD><TD>(None)</TD><TD>Always</TD></TR>
<TR><TD>3'h1</TD><TD>.Z</TD><TD>Equal (Zero set)</TD></TR>
<TR><TD>3'h2</TD><TD>.NE</TD><TD>Not equal to (!Z)</TD></TR>
<TR><TD>3'h3</TD><TD>.GE</TD><TD>Greater than or equal (N not set, Z irrelevant)</TD></TR>
<TR><TD>3'h4</TD><TD>.GT</TD><TD>Greater than (N not set, Z not set)</TD></TR>
<TR><TD>3'h5</TD><TD>.LT</TD><TD>Less than (N set)</TD></TR>
<TR><TD>3'h6</TD><TD>.C</TD><TD>Carry set</TD></TR>
<TR><TD>3'h7</TD><TD>.V</TD><TD>Overflow set</TD></TR>
</TABLE>
There is no condition code for less than or equal, not C or not V. Using
these conditions will take an extra instruction.
(Ex: TST $4,CC; STO.NZ R0,(R1))
<H2>Operand B</H2>
Many instruction forms have a 21-bit source "Operand B" associated with them.
This Operand B is either equal to a register plus a signed immediate offset,
or an immediate offset by itself. This value is encoded as,
<TABLE border>
<TR><TH>20</TH><TH>19</TH><TH>18</TH><TH>17</TH><TH>16</TH>
<TH>15</TH><TH>14</TH><TH>13</TH><TH>12</TH>
<TH>11</TH><TH>10</TH><TH>9</TH><TH>8</TH>
<TH>7</TH><TH>6</TH><TH>5</TH><TH>4</TH>
<TH>3</TH><TH>2</TH><TH>1</TH><TH>0</TH></TR>
<TR><TD>1'b0</TD><TD colspan=20>Signed Immediate Value</TD></TR>
<TR><TD>1'b1</TD><TD colspan=4>Register</TD><TD colspan=16>Signed immediate offset</TR>
</TABLE>
<H2>Address Mode(s)</H2>
The ZIP CPU supports two addressing modes: register plus immediate, and
immediate address. Addresses are therefore encoded in the same fashion as
Operand B's, shown above.
 
<P>A lot of long hard thought was put into whether to allow pre/post increment
and decrement addressing modes. Finding no way to use these operators without
taking two or more clocks per instruction, these addressing modes have been
removed from the realm of possibilities. This means that the Zip CPU has no
native way of executing push, pop, return, or jump to subroutine operations.
 
<H2>Move Operands</H2>
<P>The previous set of operands would be perfect and complete, save only that
the CPU needs access to non--supervisory registers while in supervisory
mode. Therefore, the MOV instruction is special and offers access
to these registers ... when in supervisory mode. To keep the compiler
simple, the extra bits are ignored in non-supervisory mode (as though
they didn't exist), rather than being mapped to new instructions or
additional capabilities. The bits indicating which register set each
register lies within are the A-map and B-map bits. Further, because
a load immediate instruction exists, there is no move capability between
an immediate and a register: all moves come from either a register or
a register plus an offset.
<P>This actually leads to a bit of a problem: since the MOV instruction
encodes which register set each register is coming from or moving to,
how shall a compiler or assembler know how to compile a MOV instruction
without knowing the mode of the CPU at the time? For this reason,
the compiler will assume all MOV registers are supervisor registers,
and display them as normal. Anything with the user bit set will
be treated as a user register. The CPU will quietly ignore the
supervisor bits while in user mode, and anything marked as a user
register will always be valid.
<H2>Native Instructions</H2>
<TABLE border>
<TR><TH rowspan=2>Op Code</TH><TH colspan=8>31..24</TH>
<TH colspan=8>23..16</TH>
<TH colspan=8>15..8</TH>
<TH colspan=8>7..0</TH>
<TH rowspan=2>Sets CC? (Y/N)</TH></TR>
<TR><!-- Opcode -->
<TD>&nbsp;</TD><TD>&nbsp;</TD><TD>&nbsp;</TD><TD>&nbsp;</TD>
<TD>&nbsp;</TD><TD>&nbsp;</TD><TD>&nbsp;</TD><TD>&nbsp;</TD>
<!-- -->
<TD>&nbsp;</TD><TD>&nbsp;</TD><TD>&nbsp;</TD><TD>&nbsp;</TD>
<TD>&nbsp;</TD><TD>&nbsp;</TD><TD>&nbsp;</TD><TD>&nbsp;</TD>
<!-- -->
<TD>&nbsp;</TD><TD>&nbsp;</TD><TD>&nbsp;</TD><TD>&nbsp;</TD>
<TD>&nbsp;</TD><TD>&nbsp;</TD><TD>&nbsp;</TD><TD>&nbsp;</TD>
<!-- -->
<TD>&nbsp;</TD><TD>&nbsp;</TD><TD>&nbsp;</TD><TD>&nbsp;</TD>
<TD>&nbsp;</TD><TD>&nbsp;</TD><TD>&nbsp;</TD><TD>&nbsp;</TD>
<!-- Sets CC -->
</TR>
<TR><TH>CMP(Sub)</TH><TD colspan=4>4'h0</TD>
<TD colspan=4>Data Reg</TD><TD colspan=3>Conditions</TD>
<TD colspan=21>Operand B</TD><TD align=center rowspan=2>Y</TD></TR>
<TR><TH>BTST(And)</TH><TD colspan=4>4'h1</TD>
<TD colspan=4>Data Reg</TD><TD colspan=3>Conditions</TD>
<TD colspan=21>Operand B</TD></TR>
<TR><TH>MOV</TH><TD colspan=4>4'h2</TD>
<TD colspan=4>Data Reg</TD><TD colspan=3>Conditions</TD>
<TD colspan=1>A-Map</TD>
<TD colspan=4>B-Reg</TD>
<TD colspan=1>B-Map</TD>
<TD colspan=15>+Immediate</TD><TD align=center rowspan=2>N</TD></TR>
<TR><TH>LODI</TH><TD colspan=4>4'h3</TD><TD colspan=4>Result Reg</TD>
<TD colspan=24>24'bit Signed Immediate</TD></TR>
<TR><TH>NOOP</TH><TD colspan=4>4'h4</TD><TD colspan=4>4'he</TD>
<TD colspan=24>24'h00</TD><TD align=center>N</TD></TR>
<TR><TH>BREAK</TH><TD colspan=4>4'h4</TD><TD colspan=4>4'he</TD>
<TD colspan=24>24'h01</TD><TD align=center>N</TD></TR>
<TR><TH>LODIHI</TH><TD colspan=4>4'h4</TD><TD colspan=4>4'hf</TD>
<TD colspan=3>Conditions</TD>
<TD>1'b1</TD><TD colspan=4>Result Reg</TD>
<TD colspan=16>16-bit Immediate</TD><TD align=center>N</TD></TR>
<TR><TH>LODILO</TH><TD colspan=4>4'h4</TD><TD colspan=4>4'hf</TD>
<TD colspan=3>Conditions</TD>
<TD>1'b0</TD><TD colspan=4>Result Reg</TD>
<TD colspan=16>16-bit Immediate</TD><TD align=center>N</TD></TR>
<!--
<TR><TH>LODIB</TH><TD colspan=4>4'h4</TD><TD colspan=4>4'hf</TD>
<TD colspan=3>Conditions</TD>
<TD colspan=3>3'h7</TD>
<TD colspan=2>Posn</TD>
<TD colspan=8>&nbsp;</TD>
<TD colspan=8>Immediate</TD><TD align=center>N</TD></TR>
-->
<TR><TH><FONT color=#444444><em>16-b MPY</em></FONT></TH><TD colspan=4>4'h4</TD>
<TD colspan=4>Result Reg</TD><TD colspan=3>Conditions</TD>
<TD colspan=21><font color=#444444>Operand
B (<em>Reserved for</em>)</font></TD>
<TD align=center>N</TD></TR>
<TR><TH rowspan=2>ROL</TH><TD colspan=4 rowspan=2>4'h5</TD>
<TD colspan=4 rowspan=2>Result Reg</TD>
<TD colspan=3 rowspan=2>Conditions</TD>
<TD colspan=2>2'b11</TD>
<TD colspan=4>Operand Reg</TD>
<TD colspan=6><em>6'h00, Unused/Reserved</em></TD>
<TD>1'b0</TD>
<TD colspan=8>Immediate</TD><TD align=center>N</TD></TR>
<TR><!-- -->
<TD colspan=1>1'b0</TD>
<TD colspan=5>Rotate amount</TD>
<TD colspan=6><em>6'h00, Unused/Reserved</em></TD>
<TD>1'b0</TD>
<TD colspan=8>Immediate</TD><TD align=center>N</TD></TR>
<TR><TH>LOD</TH><TD colspan=4>4'h6</TD>
<TD colspan=4>Resulting Reg</TD><TD colspan=3>Conditions</TD>
<TD colspan=21>Address: Register+Immediate, or Immediate</TD><TD align=center rowspan=2>N</TD></TR>
<TR><TH>STO</TH><TD colspan=4>4'h7</TD>
<TD colspan=4>Data Reg</TD><TD colspan=3>Conditions</TD>
<TD colspan=21>Address: Register+Immediate, or Immediate</TD></TR>
<TR><TH>SUB</TH><TD colspan=4>4'h8</TD>
<TD colspan=4>Result Reg</TD><TD colspan=3>Conditions</TD>
<TD colspan=21>Operand B</TD><TD rowspan=8 align=center>Y</TD></TR>
<TR><TH>AND</TH><TD colspan=4>4'h9</TD>
<TD colspan=4>Result Reg</TD><TD colspan=3>Conditions</TD>
<TD colspan=21>Operand B</TD></TR>
<TR><TH>ADD</TH><TD colspan=4>4'ha</TD>
<TD colspan=4>Result Reg</TD><TD colspan=3>Conditions</TD>
<TD colspan=21>Operand B</TD></TR>
<TR><TH>OR</TH><TD colspan=4>4'hb</TD>
<TD colspan=4>Result Reg</TD><TD colspan=3>Conditions</TD>
<TD colspan=21>Operand B</TD></TR>
<TR><TH>XOR</TH><TD colspan=4>4'hc</TD>
<TD colspan=4>Result Reg</TD><TD colspan=3>Conditions</TD>
<TD colspan=21>Operand B</TD></TR>
<TR><TH>LSL/ASL</TH><TD colspan=4>4'hd</TD>
<TD colspan=4>Result Reg</TD><TD colspan=3>Conditions</TD>
<TD colspan=21>Operand B</TD></TR>
<TR><TH>ASR</TH><TD colspan=4>4'he</TD>
<TD colspan=4>Result Reg</TD><TD colspan=3>Conditions</TD>
<TD colspan=21>Operand B</TD></TR>
<TR><TH>LSR</TH><TD colspan=4>4'hf</TD>
<TD colspan=4>Result Reg</TD><TD colspan=3>Conditions</TD>
<TD colspan=21>Operand B</TD></TR>
</TABLE>
<H2>Derived Instructions</H2>
<TABLE border>
<TR><TH>Mapped</TH><TH>Actual</TH><TH WIDTH=65%>Notes</TH></TR>
<TR><TD valign=top>ADD Ra,Rx<BR>
ADDC Rb,Ry</TD><TD valign=top>ADD Ra,Rx<BR>ADD.C $1,Ry<BR>ADD Rb,Ry</TD><TD valign=top>Add with carry</TD></TR>
<TR><TD valign=top rowspan=2>BRA.cond +/-$Addr</TD><TD>
MOV.cond $Addr+PC,PC</TD><TD>Branch/jump on condition. Works for 14 bit address offsets.</TD></TR>
<TR><TD>LDI $Addr,Rx<BR>
ADD.cond Rx,PC</TD><TD>Branch/jump on condition. Works for
23 bit address offsets, but costs a register, an extra instruction,
and setsthe flags.</TD></TR>
<TR><TD valign=top>BNC PC+$Addr</TD><TD>
TEST $Carry,CC<BR>
MOV.Z PC+$addr,PC</TD>
<TD>Example of a branch on an unsupported
condition, in this case a branch on not carry</TD></TR>
<TR><TD valign=top>CLRF.NZ Rx</TD><TD>XOR.NZ Rx,Rx</TD><TD>Clear Rx, and flags, if the Z-bit is not set</TD></TR>
<TR><TD valign=top>CLR Rx</TD><TD>LDI $0,Rx</TD><TD>Clears Rx, leaves flags untouched. This instruction cannot be conditional.</TD></TR>
<TR><TD valign=top>EXCH.W Rx</TD><TD>ROL $16,Rx</TD><TD valign=top>Exchanges the top and bottom 16'bit words of Rx</TD></TR>
<TR><TD valign=top>HALT</TD><TD>Or $SLEEP,CC</TD><TD valign=top>Executed while in interrupt mode. In user mode this is simply a wait until interrupt instructioon.</TD></TR>
<TR><TD valign=top>INT</TD><TD>AND $!GIE,CC</TD><TD>Without setting an
interrupt flag or trap vector, the O/S might not know what to do with
this instruction. Therefore the trap version is recommended<TD></TR>
<TR><TD valign=top>IRET</TD><TD>OR $GIE,CC</TD><TD></TD></TR>&nbsp;</TD></TR>
<TR><TD valign=top>JMP R6+$Addr</TD><TD>MOV $Addr(R6),PC</TD><TD>&nbsp;</TD></TR>
<TR><TD valign=top rowspan=2>JSR PC+$Addr</TD><TD>
SUB $1,SP<BR>
MOV $3+PC,R0<BR>
STO R0,1(SP)<BR>
MOV $Addr+PC,PC<br>
ADD $1,SP</TD><TD>Jump to Subroutine.</TD></TR>
<TR><TD>MOV $3+PC,R12<BR>MOV $addr+PC,PC</TD><TD>This is the high speed
version of the call, necessitating a register to hold the last
PC address. In its favor, this method doesn't suffer the mandatory
memory access of the other approach.</TD></TR>
<TR><TD valign=top>JTU</TD><TD>OR $GIE,CC</TD><TD>Also known as a JUMP-To-USER
space command, also known as IRET.</TD></TR>
<TR><TD valign=top>LDI.l $val,Rx</TD><TD>
LDIHI HIBITS($val),Rx<BR>
LDILO LOBITS($val),Rx</TD><TD>Sadly, there's not enough instruction
space to load a complete immediate value into any register.
Therefore, fully loading any register takes two cycles.
The LDIHI (load immediate high) and LDILO (load immediate low)
instructions have been created to facilitate this.</TD></TR>
<TR><TD valign=top>LOD.b $addr,Rx</TD><TD>
LDI $addr,Ra<BR>
LDI $addr,Rb<BR>
LSR $2,Ra<BR>
AND $3,Rb<BR>
LOD (Ra),Rx<BR>
LSL $3,Rb<BR>
SUB $32,Rb<BR>
ROL Rb,Rx<BR>
AND $0ffh,Rx</TD><TD>This CPU is designed for 32'bit word
length instructions. Byte addressing is not supported by the CPU or
the bus, so it therefore takes more work to do.<P>Note that in
this example, $Addr is a byte-wise address, where all other addresses
are 32-bit wordlength addresses. For this reason, we needed to
drop the bottom two bits.</TD></TR>
<TR><TD valign=top>LSL $1,Rx<BR>LSLC $1,Ry</TD>
<TD>LSL $1,Ry<BR>
LSL $1,Rx<BR>
OR.C $1,Ry</TD><TD>Logical shift left with carry. Note that the
instruction order is now backwards, to keep the conditions valid.
That is, LSL sets the carry flag, so if we did this the othe way
with Rx before Ry, then the condition flag wouldn't have been right
for an OR correction at the end.</TD></TR>
<TR><TD valign=top>LSR $1,Rx<BR>LSRC $1,Ry</TD><TD>
CLR Rz<BR>
LSR $1,Ry<BR>
LDIHI.C $8000h,Rz<BR>
LSR $1,Rx<BR>
OR Rz,Rx</TD><TD>Logical shift right with carry</TD></TR>
<TR><TD valign=top>NEG Rx</TD><TD>XOR $-1,Rx<BR>ADD $1,Rx</TD><TD>&nbsp;</TD></TR>
<TR><TD valign=top>NOOP</TD><TD>NOOP</TD><TD>While there are many
operations that do nothing, such as MOV Rx,Rx, or OR $0,Rx, these
operations have consequences in that they might stall the bus if
Rx isn't ready yet. For this reason, we have a dedicated NOOP
instruction.</TD></TR>
<TR><TD valign=top>NOT Rx</TD><TD>XOR $-1,Rx</TD><TD>&nbsp;</TD></TR>
<TR><TD valign=top>POP Rx</TD><TD>LOD $-1(SP),Rx<BR>ADD $1,SP</TD><TD>Note
that for interrupt purposes, one can never depend upon the value at
(SP). Hence you read from it, then increment it, lest having
incremented it firost something then comes along and writes to that
value before you can read the result.</TD></TR>
<TR><TD valign=top>PUSH Rx</TD><TD>
SUB $1,SP<BR>
STO Rx,$1(SP)</TD><TD>&nbsp;</TD></TR>
<TR><TD valign=top>RESET</TD><TD>STO $1,$watchdog(R12)<BR>NOOP<BR>NOOP</TD><TD>
This depends upon the peripheral base address being in R12.
<P>Another opportunity might be to jump to the reset address from within
supervisor mode.
</TD></TR>
<TR><TD valign=top rowspan=2>RET</TD><TD>LOD $-1(SP),R0<BR>
MOV $-1+SP,SP<BR>
MOV R0,PC</TD><TD>An alternative might be to LOD $-1(SP),PC, followed
by depending upon the calling program to ADD $1,SP.</TD></TR>
<TR><TD>MOV R12,PC</TD><TD>This is the high(er) speed version, that doesn't
touch the stack. As such, it doesn't suffer a stall on memory
read/write to the stack.</TD></TR>
<TR><TD valign=top>STEP Rr,Rt</TD><TD>LSR $1,Rr<BR>XOR.C Rt,Rr</TD><TD>Step a
Galois implementation of a Linear Feedback Shift Register, Rr, using
taps Rt</TD></TR>
<TR><TD valign=top>STO.b Rx,$addr</TD><TD>
LDI $addr,Ra<BR>
LDI $addr,Rb<BR>
LSR $2,Ra<BR>
AND $3,Rb<BR>
SUB $32,Rb<BR>
LOD (Ra),Ry<BR>
AND $0ffh,Rx<BR>
AND $-0ffh,Ry<BR>
ROL Rb,Rx<BR>
OR Rx,Ry<BR>
STO Ry,(Ra)</TD><TD>This CPU and it's bus are <em>not</em> optimized
for byte-wise operations.<P>Note that in this example, $addr is a
byte-wise address, whereas in all of our other examples it is a
32-bit word address. Further, this instruction implies a byte ordering,
such as big or little endian.</TD></TR>
<TR><TD valign=top>SWAP Rx,Ry</TD><TD>
XOR Ry,Rx<BR>
XOR Rx,Ry<BR>
XOR Ry,Rx</TD><TD>While no extra registers are needed, this example
does take 3-clocks.</TD></TR>
<TR><TD valign=top>TRAP #X</TD><TD>LDILO $x,CC</TD><TD>
This approach uses the unused bits of the CC register as a TRAP
address. If these bits are zero, no trap has occurred. Unlike my
previous approach, which was to use a trap peripheral, this approach
has no delay associated with it. To work, the supervisor will need
to clear this register following any trap, and the user will need to
be careful to only set this register prior to a trap condition.
Likewise, when setting this value, the user will need to make certain
that the SLEEP and GIE bits are not set in $x. LDI would also work,
however using LDILO permits the use of conditional traps. (i.e.,
trap if the zero flag is set.) Should you wish to trap off of a
register value, you could equivalently load $x into the register and
then MOV it into the CC register.
</TD></TR>
<TR><TD valign=top>TST Rx</TD><TD>TST $-1,Rx</TD><TD valign=top>Set the
condition codes based upon Rx. Could also do a CMP $0,Rx,
ADD $0,Rx, SUB $0,Rx, etc, AND $-1,Rx, etc. The TST and CMP
approaches won't stall future pipeline stages looking for the value
of Rx.</TD></TR>
<TR><TD valign=top>WAIT</TD><TD>Or $SLEEP,CC</TD><TD valign=top>Wait
'til interrupt. In an interrupts disabled context, this becomes a
HALT instruction.</TD></TR>
</TABLE>
<H2>Pipeline Stages</H2>
<OL>
<LI><B>PREFETCH</B>: Read instruction from memory (cache if possible)
<UL>
<LI>A lack of an instruction, or a waiting memory operation, stalls the
pipeline.
</UL>
<LI><B>DECODE</B>: Decode instruction into op code, register(s) to read, and
immediate offset.
<UL><LI>INPUT: Instruction
<LI>OUTPUT: 5-bit register address of result,
5-bit register address of an input and usage flag (this
register is used), 5-bit register address of second input and
usage flag, 32-bit immediate offset.
<P><TT>decode(i_clk, (i_ce)&amp;(~stall), i_instr, i_gie, i_pc,
o_opcode, o_ccode,
o_wr_back, o_wr_reg, o_ra_read, o_ra_reg,
o_rb_read, o_rb_reg,
o_immediate,
o_memop, o_wr, o_iodec);</TT>
<LI>Move instruction gets one decoder, produces two registers addresses,
use flag set to one on register B, address A is unused.
<LI>Load/Store instructions produce two registers, an immediate, and
two flags
<LI>Operand B type instructions produce two registers, an immediate,
and a use flag
<LI>LDI produces one register, a (longer) immediate, and sets the use
flag to zero (second register isn't used)
<LI>This section never stalls. On an external stall it simply doesn't update
it's outputs. Outputs are available one clock after
the instruction is valid.
</UL>
<LI><b>READ OPERANDS</B>: Read registers and apply any immediate values to them.
<UL>
<LI>This should stall if a source operand is pending.
</UL>
<LI>Split into two tracks: A) <B>ALU</B> accomplish simple instruction, B) <B>MEMOPS</B> memory read/write.
<UL>
<LI>Loads stall instructions that access the register until it is
written to the register set.
<LI>Condition codes are available upon completion
<LI>Issuing an instruction to the memory while the memory is busy will
stall the bus. If the bus deadlocks, only a reset will
release the CPU. (Watchdog timer, anyone?)
</UL>
<LI><B>WRITE-BACK</B>: Conditionally write back the result to register set, applying the
condition. This routine is bi-re-entrant. Either the memory or the
simple instruction may request a register write. Memory writes take
priority, stalling the other track.
<UL>
<LI>This stage will stall the pipeline if both memory and op
try to write to the registers at the same time.
<LI>
</OL>
<H2 align=center>Pipeline Logic</H2>
How the CPU handles some instruction combinations can be telling when
determining what happens in the pipeline. For example:
<TABLE>
<TR><TH>Instruction(s)</TH><TH>Issue</TH><TH>Choice</TH></TR>
<TR><TD valign=top>Delayed Brnaching</TD><TD>What happens in debug mode?
That is, what happens when a debugger tries to single step an
instruction? While I can easily single step the computer in either
user or supervisor mode from externally, this processor does not appear
able to step the CPU in user mode from within user mode--gosh, not even
from within supervisor mode--such as if a process had a debugger
attached. As the processor exists, I would have one result stepping
the CPU from a debugger, and another stepping it externally.
<P>This is unacceptable.
</TD></TR>
<TR><TD valign=top>MOV R0,R1<BR>MOV R1,R2</TD><TD valign=top>What value does
R2 get, the value of R1 before the first move or the value of R0?
Placing the value of R0 into R1 requires a pipeline stall, and possibly
two, as I have the pipeline designed.</TD><TD valign=top>R2 must
equal R0 at the end of this operation. This may stall the pipeline
1-2 cycles.</TR>
<TR><TD valign=top>CMP R0,R1<BR>MOV.EQ $x,PC</TD><TD valign=top>At issue is
the same item as above, save that the CMP instruction updates the
flags that the MOV instruction depends
upon.</TD><TD valign=top>Condition codes must be updated and available
immediately for the next instruction without stalling the
pipeline.</TD></TR>
<TR><TD valign=top>CMP R0,R1<BR>MOV CC,R2</TD><TD valign=top>At issue is the
fact that the logic supporting the CC register is more complicated than
the logic supporting any other register.</TD><TD valign=top>This will
create a stall, of 1-2 clock cycles</TD></TR>
<TR><TD valign=top>ADD $5,R0<BR>BTST $8,CC</TD><TD valign=top>Test for
overflow (or not). At issue is the load of the condition codes for
the BTST instruction, which takes place two clocks before the prior
instruction writes it back.</TD><TD valign=top><em>Negotiable for
simplified logic.</em> Let's stall here, 1-2 clocks.</TD></TR>
<TR><TD valign=top>ADD $x,PC<BR>MOV R0,R1</TD><TD valign=top>Will the
instruction following the jump take place before the jump? In
other words, is the MOV to the PC register handled differently from
an ADD to the PC register?</TD><TD valign=top>
MOV'es and ADD's use the same logic (simplifies the logic).</TD></TR>
<TR><TD valign=top>MOV $x,PC<BR>MOV R0,R1</TD><TD valign=top>Will the
instruction following the jump take place before the jump? Or must the
pipeline "turn off" any outputs associated with a jump once it
recognizes that the jump has taken place? Alternatively, the pipeline
could stall until the result of the MOV was
available.</TD><TD valign=top>
<em>Negotiable for simplified logic</em></TD></TR>
<TR><TD valign=top>MOV $x,PC<BR>MOV R0,R1</TD><TD valign=top>Will the
instruction following the jump take place before the jump sometimes
but not all times? Might the pipeline not be full when the jump
takes place, and thus the MOV instruction never gets loaded due to a
stalled pre-fetch? Or is the MOV a dependable instruction, guaranteed
to be executed (or not) no matter what the JMP does? Perhaps the
compiler is required to insert 2-3 NOOP's following a jump, just
to keep the pipeline doing something reliably?
</TD><TD valign=top>
<em>Negotiable (at present). Highly desired that the behavior
is the same regardless of the prefetch speed.</em></TD></TR>
<TR><TD valign=top>MOV.EQ $x,PC<BR>MOV $y,PC</TD><TD valign=top>Where will
instructions take place next? On a delayed jump instruction, this
means that instruction $x will be executed followed by $y, and
execution will not continue at $x as
desired</TD><TD><em>Negotiable.</em></TD</TR>
</TABLE>
<P>As I've studied this, I find several approaches to handling pipeline
issues. These approaches (and their consequences) are listed below.
<TABLE>
<TR><TH>Condition/Case</TH><TH>Discussion</TH></TR>
<TR><TD valign=top>All issued instructions complete
<BR>Stages stall individually</TD><TD valign=top>What about a
slow pre-fetch? <P>Nominally, this works well: any issued instruction
just runs to completion. If there are four issued instructions in the
pipeline, with the writeback instruction being a write-to-PC
instruction, the other three instructions naturally finish.
<P>This approach fails when reading instructions from the flash,
since such reads require N clocks to clocks to complete. Thus
there may be only one instruction in the pipeline if reading from flash,
or a full pipeline if reading from cache. Each of these approaches
would produce a different response.
<P>This is unacceptable.</TD></TR>
<TR><TD valign=top>Issued instructions may be canceled
<BR>Stages stall individually</TD><TD valign=top>First problem:
Memory operations cannot be canceled, even reads may have side effects
on peripherals that cannot be canceled later. Further, in the case of
an interrupt, it's difficult to know what to cancel. What happens in
a MOV.C $x,PC followed by a MOV $y,PC instruction? Which get
canceled?<P>Because it isn't clear what would need to be canceled,
this is not doable.</TD></TR>
<TR><TD valign=top>All issued instructions complete.
<BR>All stages are filled, or the entire pipeline
stalls.</TD><TD valign=top>What about debug control? What about
register writes taking an extra clock stage? MOV R0,R1; MOV R1,R2
should place the value of R0 into R2. How do you restart the pipeline
after an interrupt? What address do you use? The last issued
instruction? But the branch delay slots may make that invalid!
<P>Reading from the CPU debug port in this case yields inconsistent
results: the CPU will halt or step with instructions stuck in the
pipeline. Reading registers will give no indication of what is going
on in the pipeline, just the results of completed operations, not of
operations that have been started and not yet completed.
Perhaps we should just report the state of the CPU based upon what
instructions (PC values) have successfully completed? Thus the
debug instruction is the one that will write registers on the next
clock.
<UL>Suggestion: Suppose we load extra information in the two
CC register(s) for debugging intermediate pipeline stages?</UL>
<P>The next problem, though, is how to deal with the read operand
pipeline stage needing the result from the register pipeline.</TD></TR>
<TR><TD valign=top>All instructions that enter into the memory module *must*
complete. Issued instructions from the prefetch, decode, or operand
read stages may or may not complete. Jumps into code must be valid,
so that interrupt returns may be valid. All instructions entering the
ALU complete.</TD><TD valign=top>This looks to be the simplest approach.
While the logic may be difficult, this appears to be the only
re-entrant approach.
 
<P>A <tt>new_pc</tt> flag will be high anytime the PC changes in an
unpredictable way (i.e., it doesn't increment). This includes jumps
as well as interrupts and interrupt returns. Whenever this flag may
go high, memory operations and ALU operations will stall until the
result is known. When the flag does go high, anything in the prefetch,
decode, and read-op stage will be invalidated.
</TD></TR>
</TABLE>
</BODY></HTML>

powered by: WebSVN 2.1.0

© copyright 1999-2024 OpenCores.org, equivalent to Oliscience, all rights reserved. OpenCores®, registered trademark.