URL
https://opencores.org/ocsvn/zipcpu/zipcpu/trunk
Subversion Repositories zipcpu
Compare Revisions
- This comparison shows the changes necessary to convert path
/zipcpu/trunk/doc
- from Rev 6 to Rev 7
- ↔ Reverse comparison
Rev 6 → Rev 7
/iset.html
0,0 → 1,689
<HTML><HEAD><TITLE>Zip CPU Instruction Set</TITLE></HEAD><BODY> |
<H1 align=center>Zip CPU Goals</H1> |
<P>The original goal of the ZIP CPU was a simple CPU. For this reason, |
all instructions have been designed to be as simple as possible, and |
are all designed to be executed in one instruction cycle per |
instruction, barring pipeline stalls. This has resulted in the choice |
to drop push and pop instructions, pre-increment and post-decrement |
addressing modes, and more. |
<P>For those who like buzz words, the Zip CPU is: |
<UL> |
<LI>A 32-bit CPU: All registers are 32-bits, addresses are 32-bits, |
instructions are 32-bits wide, etc. |
<LI>A RISC CPU. There is no microcode for executing instructions. |
<LI>A Load/Store architecture. (Only load and store instructions |
can access memory.) |
<LI>Wishbone compliant. All peripherals are accessed just like |
memory across this bus. |
<LI>A Von-Neumann architecture. (The instructions and data share a |
common bus.) |
<LI>A pipelined architecture, having stages for <B>Prefetch</B>, |
<B>Decode</B>, <B>Read-Operand</B>, the <B>ALU/Memory</B> |
unit, and <B>Write-back</B> |
</UL> |
|
<P>Now, however, that I've worked on the Zip CPU for a while, it is not nearly |
as simple as I originally hoped. Worse, I've had to adjust to create |
capabilities that I was never expecting to need. These include: |
<UL> |
<LI><B>Extenal Debug:</B> Once placed upon an FPGA, I'm going to need |
a means of debugging this CPU. That means that there needs to be an |
external register that can control the CPU: reset it, halt it, step |
it, and tell |
whether it is running or not. Another register is placed similar to |
this register, to allow the external controller to examine registers |
internal to the CPU.</P> |
<LI><B>Internal Debug:</B> Being able to run a debugger from within |
a user process requires an ability to step a user process from |
within a debugger. It also requires a break instruction that can |
be substituted for any other instruction, and substituted back. |
The break is actually difficult: the break instruction cannot be |
allowed to execute. That way, upon a break, the debugger should |
be able to jump back into the user process to step the instruction |
that would've been at the break point initially, and then to |
replace the break after passing it.</P> |
|
<LI><B>Prefetch Cache</B>My original implementation had a very |
simple prefetch stage. Any time the PC changed the prefetch would go |
and fetch the new instruction. While this was perhaps this simplest |
approach, it cost roughly five clocks for every instruction. This |
was deemed unacceptable, as I wanted a CPU that could execute |
instructions in one cycle. I therefore have a prefetch cache that |
issues pipelined wishbone accesses to memory and then pushes |
instructions at the CPU. Sadly, this accounts for about 20% of the |
logic in the entire CPU, or 15% of the logic in the entire system. |
</P> |
|
<LI><B>Operating System:</B>In order to support an operating system, |
interrupts and so forth, the CPU needs to support supervisor and |
user modes, as well as a means of switching between them. For example, |
the user needs a means of executing a system call. This is the |
purpose of the <B>'trap'</B> instruction. This instruction needs to |
place the CPU into supervisor mode (here equivalent to disabling |
interrupts), as well as handing it a parameter such as identifying |
which O/S function was called. |
|
<P>My initial approach to building a trap instruction was to create |
an external peripheral which, when written to, would generate an |
interrupt and could return the last value written to it. This failed |
timing requirements, however: the CPU executed two instructions while |
waiting for the trap interrupt to take place. Since then, I've |
decided to keep the rest of the CC register for that purpose so that a |
write to the CC register, with the GIE bit cleared, could be used to |
execute a trap. |
|
<P>Modern timesharing systems also depend upon a <B>Timer</B> interrupt |
to handle task swapping. For the Zip CPU, this interrupt is handled |
external to the CPU as part of the CPU System, found in |
<tt>zipsystem.v</tt>. The timer module itself is found in |
<tt>ziptimer.v</tt>. |
|
<LI><B>Pipeline Stalls:</B> My original plan was to not support pipeline |
stalls at all, but rather to require the compiler to properly schedule |
instructions so that stalls would never be necessary. After trying |
to build such an architecture, I gave up, having learned some things: |
|
<P>For example, in order to facilitate interrupt handling and debug |
stepping, the CPU needs to know what instructions have finished, and |
which have not. In other words, it needs to know where it can restart |
the pipeline from. Once restarted, it must act as though it had |
never stopped. This killed my idea of delayed branching, since |
what would be the appropriate program counter to restart at? |
The one the CPU was going to branch to, or the ones in the |
delay slots? |
|
<P>So I switched to a model of discrete execution: Once an instruction |
enters into either the ALU or memory unit, the instruction is |
guaranteed to complete. If the logic recognizes a branch or a |
condition that would render the instruction entering into this stage |
possibly inappropriate (i.e. a conditional branch preceeding a store |
instruction for example), then the pipeline stalls for one cycle |
until the conditional branch completes. Then, if it generates a new |
PC address, the stages preceeding are all wiped clean. |
|
<P>The discrete execution model allows such things as sleeping: if the |
CPU is put to "sleep", the ALU and memory stages stall and back up |
everything before them. Likewise, anything that has entered the ALU |
or memory stage when the CPU is placed to sleep continues to completion. |
<P>To handle this logic, each pipeline stage has three control signals: |
a valid signal, a stall signal, and a clock enable signal. In |
general, a stage stalls if it's contents are valid and the next step |
is stalled. This allows the pipeline to fill any time a later stage |
stalls. |
|
<LI><B>Verilog Modules:</B> When examining how other processors worked |
here on open cores, many of them had one separate module per pipeline |
stage. While this appeared to me to be a fascinating and commendable |
idea, my own implementation didn't work out quite so nicely. |
|
<P>As an example, the decode module produces a <em>lot</em> of |
control wires and registers. Creating a module out of this, with |
only the simplest of logic within it, seemed to be more a lesson |
in passing wires around, rather than encapsulating logic. |
|
<P>Another example was the register writeback section. I would love |
this section to be a module in its own right, and many have made them |
such. However, other modules depend upon writeback results other |
than just what's placed in the register (i.e., the control wires). |
For these reasons, I didn't manage to fit this section into it's |
own module. |
|
<P>The result is that the majority of the CPU code can be found in |
the <tt>zipcpu.v</tt> file. |
</UL> |
|
With that introduction out of the way, let's move on to the instruction |
set. |
|
<H1 align=center>Zip CPU Instruction Set</H1> |
The Zip CPU supports a set of two operand instructions, where the first operand |
(always a register) is the result. The only exception is the store instruction, |
where the first operand (always a register) is the source of the data to be |
stored. |
<H2>Register Set</H2> |
The Zip CPU supports two sets of sixteen 32-bit registers, a supervisor |
and a user set. The supervisor set is used in interrupt mode, whereas |
the user set is used otherwise. Of this register set, the Program Counter (PC) |
is register 15, whereas the status register (SR) or condition code register |
(CC) is register 14. By convention, the stack pointer will be register 13 and |
noted as (SP)--although the instruction set allows it to be anything. |
The CPU can access both register sets via move instructions from the |
supervisor state, whereas the user state can only access the user registers. |
|
<P>The status register is special, and bears further mention. The lower |
8 bits of the status register form a set of condition codes. Writes to other |
bits are preserved, and can be used as part of the trap architecture--examined |
by the O/S upon any interrupt, cleared before returning. |
<P>Of the eight condition codes, the bottom four are the current flags: |
Zero (Z), |
Carry (C), |
Negative (N), |
and Overflow (V). |
|
<P>The next bit is a clock enable (0 to enable) or sleep bit (1 to put |
the CPU to sleep). Setting this bit will cause the CPU to |
wait for an interrupt (if interrupts are enabled), or to |
completely halt (if interrupts are disabled). |
<P>The sixth bit is a global interrupt enable bit (GIE). When this |
sixth bit is a '1' interrupts will be enabled, else disabled. When |
interrupts are disabled, the CPU will be in supervisor mode, otherwise |
it is in user mode. Thus, to execute a context switch, one only |
need enable or disable interrupts. (When an interrupt line goes |
high, interrupts will automatically be disabled, as the CPU goes |
and deals with its context switch.) |
<P><em>Experimental</em>: The seventh bit is a step bit. This bit can be |
set from supervisor mode only. After setting this bit, should |
the supervisor mode process switch to user mode, it would then |
accomplish one instruction in user mode before returning to supervisor |
mode. Then, upon return to supervisor mode, this bit will |
be automatically cleared. This bit has no effect on the CPU while in |
supervisor mode. |
<P>This functionality was added to enable a userspace debugger |
functionality on a user process, working through supervisor mode |
of course. |
<P><em>Experimental</em>: The eighth bit is a break enable bit. This |
controls whether a break instruction will halt the processor for an |
external debuggerr (break enabled), or whether the break instruction |
will simply set the STEP bit and send the CPU into interrupt mode. |
This bit can only be set within supervisor mode. |
<P>This functionality was added to enable an external debugger to |
set and manage breakpoints. |
<P>The status register bits are shown below: |
<TABLE border> |
<TR><TH>7</TH><TH>6</TH><TH>5</TH><TH>4</TH> <TH>3</TH> <TH>2</TH> <TH>1</TH> <TH>0</TH></TR> |
<TR><TD>BREAKEN</TD><TD>STEP</TD><TD>GIE</TD><TD>SLEEP</TD><TD>V</TD><TD>N</TD><TD>C</TD> <TD>Z</TD></TR> |
</TABLE> |
<H2>Conditions</H2> |
Most, although not quite all, instructions are conditional. From the four |
condition code flags, eight conditions are defined. These are: |
<TABLE> |
<TR><TH>Code</TH><TH>Mneumonic</TH><TH>Condition</TH></TR> |
<TR><TD>3'h0</TD><TD>(None)</TD><TD>Always</TD></TR> |
<TR><TD>3'h1</TD><TD>.Z</TD><TD>Equal (Zero set)</TD></TR> |
<TR><TD>3'h2</TD><TD>.NE</TD><TD>Not equal to (!Z)</TD></TR> |
<TR><TD>3'h3</TD><TD>.GE</TD><TD>Greater than or equal (N not set, Z irrelevant)</TD></TR> |
<TR><TD>3'h4</TD><TD>.GT</TD><TD>Greater than (N not set, Z not set)</TD></TR> |
<TR><TD>3'h5</TD><TD>.LT</TD><TD>Less than (N set)</TD></TR> |
<TR><TD>3'h6</TD><TD>.C</TD><TD>Carry set</TD></TR> |
<TR><TD>3'h7</TD><TD>.V</TD><TD>Overflow set</TD></TR> |
</TABLE> |
There is no condition code for less than or equal, not C or not V. Using |
these conditions will take an extra instruction. |
(Ex: TST $4,CC; STO.NZ R0,(R1)) |
<H2>Operand B</H2> |
Many instruction forms have a 21-bit source "Operand B" associated with them. |
This Operand B is either equal to a register plus a signed immediate offset, |
or an immediate offset by itself. This value is encoded as, |
<TABLE border> |
<TR><TH>20</TH><TH>19</TH><TH>18</TH><TH>17</TH><TH>16</TH> |
<TH>15</TH><TH>14</TH><TH>13</TH><TH>12</TH> |
<TH>11</TH><TH>10</TH><TH>9</TH><TH>8</TH> |
<TH>7</TH><TH>6</TH><TH>5</TH><TH>4</TH> |
<TH>3</TH><TH>2</TH><TH>1</TH><TH>0</TH></TR> |
<TR><TD>1'b0</TD><TD colspan=20>Signed Immediate Value</TD></TR> |
<TR><TD>1'b1</TD><TD colspan=4>Register</TD><TD colspan=16>Signed immediate offset</TR> |
</TABLE> |
<H2>Address Mode(s)</H2> |
The ZIP CPU supports two addressing modes: register plus immediate, and |
immediate address. Addresses are therefore encoded in the same fashion as |
Operand B's, shown above. |
|
<P>A lot of long hard thought was put into whether to allow pre/post increment |
and decrement addressing modes. Finding no way to use these operators without |
taking two or more clocks per instruction, these addressing modes have been |
removed from the realm of possibilities. This means that the Zip CPU has no |
native way of executing push, pop, return, or jump to subroutine operations. |
|
<H2>Move Operands</H2> |
<P>The previous set of operands would be perfect and complete, save only that |
the CPU needs access to non--supervisory registers while in supervisory |
mode. Therefore, the MOV instruction is special and offers access |
to these registers ... when in supervisory mode. To keep the compiler |
simple, the extra bits are ignored in non-supervisory mode (as though |
they didn't exist), rather than being mapped to new instructions or |
additional capabilities. The bits indicating which register set each |
register lies within are the A-map and B-map bits. Further, because |
a load immediate instruction exists, there is no move capability between |
an immediate and a register: all moves come from either a register or |
a register plus an offset. |
<P>This actually leads to a bit of a problem: since the MOV instruction |
encodes which register set each register is coming from or moving to, |
how shall a compiler or assembler know how to compile a MOV instruction |
without knowing the mode of the CPU at the time? For this reason, |
the compiler will assume all MOV registers are supervisor registers, |
and display them as normal. Anything with the user bit set will |
be treated as a user register. The CPU will quietly ignore the |
supervisor bits while in user mode, and anything marked as a user |
register will always be valid. |
<H2>Native Instructions</H2> |
<TABLE border> |
<TR><TH rowspan=2>Op Code</TH><TH colspan=8>31..24</TH> |
<TH colspan=8>23..16</TH> |
<TH colspan=8>15..8</TH> |
<TH colspan=8>7..0</TH> |
<TH rowspan=2>Sets CC? (Y/N)</TH></TR> |
<TR><!-- Opcode --> |
<TD> </TD><TD> </TD><TD> </TD><TD> </TD> |
<TD> </TD><TD> </TD><TD> </TD><TD> </TD> |
<!-- --> |
<TD> </TD><TD> </TD><TD> </TD><TD> </TD> |
<TD> </TD><TD> </TD><TD> </TD><TD> </TD> |
<!-- --> |
<TD> </TD><TD> </TD><TD> </TD><TD> </TD> |
<TD> </TD><TD> </TD><TD> </TD><TD> </TD> |
<!-- --> |
<TD> </TD><TD> </TD><TD> </TD><TD> </TD> |
<TD> </TD><TD> </TD><TD> </TD><TD> </TD> |
<!-- Sets CC --> |
</TR> |
<TR><TH>CMP(Sub)</TH><TD colspan=4>4'h0</TD> |
<TD colspan=4>Data Reg</TD><TD colspan=3>Conditions</TD> |
<TD colspan=21>Operand B</TD><TD align=center rowspan=2>Y</TD></TR> |
<TR><TH>BTST(And)</TH><TD colspan=4>4'h1</TD> |
<TD colspan=4>Data Reg</TD><TD colspan=3>Conditions</TD> |
<TD colspan=21>Operand B</TD></TR> |
<TR><TH>MOV</TH><TD colspan=4>4'h2</TD> |
<TD colspan=4>Data Reg</TD><TD colspan=3>Conditions</TD> |
<TD colspan=1>A-Map</TD> |
<TD colspan=4>B-Reg</TD> |
<TD colspan=1>B-Map</TD> |
<TD colspan=15>+Immediate</TD><TD align=center rowspan=2>N</TD></TR> |
<TR><TH>LODI</TH><TD colspan=4>4'h3</TD><TD colspan=4>Result Reg</TD> |
<TD colspan=24>24'bit Signed Immediate</TD></TR> |
<TR><TH>NOOP</TH><TD colspan=4>4'h4</TD><TD colspan=4>4'he</TD> |
<TD colspan=24>24'h00</TD><TD align=center>N</TD></TR> |
<TR><TH>BREAK</TH><TD colspan=4>4'h4</TD><TD colspan=4>4'he</TD> |
<TD colspan=24>24'h01</TD><TD align=center>N</TD></TR> |
<TR><TH>LODIHI</TH><TD colspan=4>4'h4</TD><TD colspan=4>4'hf</TD> |
<TD colspan=3>Conditions</TD> |
<TD>1'b1</TD><TD colspan=4>Result Reg</TD> |
<TD colspan=16>16-bit Immediate</TD><TD align=center>N</TD></TR> |
<TR><TH>LODILO</TH><TD colspan=4>4'h4</TD><TD colspan=4>4'hf</TD> |
<TD colspan=3>Conditions</TD> |
<TD>1'b0</TD><TD colspan=4>Result Reg</TD> |
<TD colspan=16>16-bit Immediate</TD><TD align=center>N</TD></TR> |
<!-- |
<TR><TH>LODIB</TH><TD colspan=4>4'h4</TD><TD colspan=4>4'hf</TD> |
<TD colspan=3>Conditions</TD> |
<TD colspan=3>3'h7</TD> |
<TD colspan=2>Posn</TD> |
<TD colspan=8> </TD> |
<TD colspan=8>Immediate</TD><TD align=center>N</TD></TR> |
--> |
<TR><TH><FONT color=#444444><em>16-b MPY</em></FONT></TH><TD colspan=4>4'h4</TD> |
<TD colspan=4>Result Reg</TD><TD colspan=3>Conditions</TD> |
<TD colspan=21><font color=#444444>Operand |
B (<em>Reserved for</em>)</font></TD> |
<TD align=center>N</TD></TR> |
<TR><TH rowspan=2>ROL</TH><TD colspan=4 rowspan=2>4'h5</TD> |
<TD colspan=4 rowspan=2>Result Reg</TD> |
<TD colspan=3 rowspan=2>Conditions</TD> |
<TD colspan=2>2'b11</TD> |
<TD colspan=4>Operand Reg</TD> |
<TD colspan=6><em>6'h00, Unused/Reserved</em></TD> |
<TD>1'b0</TD> |
<TD colspan=8>Immediate</TD><TD align=center>N</TD></TR> |
<TR><!-- --> |
<TD colspan=1>1'b0</TD> |
<TD colspan=5>Rotate amount</TD> |
<TD colspan=6><em>6'h00, Unused/Reserved</em></TD> |
<TD>1'b0</TD> |
<TD colspan=8>Immediate</TD><TD align=center>N</TD></TR> |
<TR><TH>LOD</TH><TD colspan=4>4'h6</TD> |
<TD colspan=4>Resulting Reg</TD><TD colspan=3>Conditions</TD> |
<TD colspan=21>Address: Register+Immediate, or Immediate</TD><TD align=center rowspan=2>N</TD></TR> |
<TR><TH>STO</TH><TD colspan=4>4'h7</TD> |
<TD colspan=4>Data Reg</TD><TD colspan=3>Conditions</TD> |
<TD colspan=21>Address: Register+Immediate, or Immediate</TD></TR> |
<TR><TH>SUB</TH><TD colspan=4>4'h8</TD> |
<TD colspan=4>Result Reg</TD><TD colspan=3>Conditions</TD> |
<TD colspan=21>Operand B</TD><TD rowspan=8 align=center>Y</TD></TR> |
<TR><TH>AND</TH><TD colspan=4>4'h9</TD> |
<TD colspan=4>Result Reg</TD><TD colspan=3>Conditions</TD> |
<TD colspan=21>Operand B</TD></TR> |
<TR><TH>ADD</TH><TD colspan=4>4'ha</TD> |
<TD colspan=4>Result Reg</TD><TD colspan=3>Conditions</TD> |
<TD colspan=21>Operand B</TD></TR> |
<TR><TH>OR</TH><TD colspan=4>4'hb</TD> |
<TD colspan=4>Result Reg</TD><TD colspan=3>Conditions</TD> |
<TD colspan=21>Operand B</TD></TR> |
<TR><TH>XOR</TH><TD colspan=4>4'hc</TD> |
<TD colspan=4>Result Reg</TD><TD colspan=3>Conditions</TD> |
<TD colspan=21>Operand B</TD></TR> |
<TR><TH>LSL/ASL</TH><TD colspan=4>4'hd</TD> |
<TD colspan=4>Result Reg</TD><TD colspan=3>Conditions</TD> |
<TD colspan=21>Operand B</TD></TR> |
<TR><TH>ASR</TH><TD colspan=4>4'he</TD> |
<TD colspan=4>Result Reg</TD><TD colspan=3>Conditions</TD> |
<TD colspan=21>Operand B</TD></TR> |
<TR><TH>LSR</TH><TD colspan=4>4'hf</TD> |
<TD colspan=4>Result Reg</TD><TD colspan=3>Conditions</TD> |
<TD colspan=21>Operand B</TD></TR> |
</TABLE> |
<H2>Derived Instructions</H2> |
<TABLE border> |
<TR><TH>Mapped</TH><TH>Actual</TH><TH WIDTH=65%>Notes</TH></TR> |
<TR><TD valign=top>ADD Ra,Rx<BR> |
ADDC Rb,Ry</TD><TD valign=top>ADD Ra,Rx<BR>ADD.C $1,Ry<BR>ADD Rb,Ry</TD><TD valign=top>Add with carry</TD></TR> |
<TR><TD valign=top rowspan=2>BRA.cond +/-$Addr</TD><TD> |
MOV.cond $Addr+PC,PC</TD><TD>Branch/jump on condition. Works for 14 bit address offsets.</TD></TR> |
<TR><TD>LDI $Addr,Rx<BR> |
ADD.cond Rx,PC</TD><TD>Branch/jump on condition. Works for |
23 bit address offsets, but costs a register, an extra instruction, |
and setsthe flags.</TD></TR> |
<TR><TD valign=top>BNC PC+$Addr</TD><TD> |
TEST $Carry,CC<BR> |
MOV.Z PC+$addr,PC</TD> |
<TD>Example of a branch on an unsupported |
condition, in this case a branch on not carry</TD></TR> |
<TR><TD valign=top>CLRF.NZ Rx</TD><TD>XOR.NZ Rx,Rx</TD><TD>Clear Rx, and flags, if the Z-bit is not set</TD></TR> |
<TR><TD valign=top>CLR Rx</TD><TD>LDI $0,Rx</TD><TD>Clears Rx, leaves flags untouched. This instruction cannot be conditional.</TD></TR> |
<TR><TD valign=top>EXCH.W Rx</TD><TD>ROL $16,Rx</TD><TD valign=top>Exchanges the top and bottom 16'bit words of Rx</TD></TR> |
<TR><TD valign=top>HALT</TD><TD>Or $SLEEP,CC</TD><TD valign=top>Executed while in interrupt mode. In user mode this is simply a wait until interrupt instructioon.</TD></TR> |
<TR><TD valign=top>INT</TD><TD>AND $!GIE,CC</TD><TD>Without setting an |
interrupt flag or trap vector, the O/S might not know what to do with |
this instruction. Therefore the trap version is recommended<TD></TR> |
<TR><TD valign=top>IRET</TD><TD>OR $GIE,CC</TD><TD></TD></TR> </TD></TR> |
<TR><TD valign=top>JMP R6+$Addr</TD><TD>MOV $Addr(R6),PC</TD><TD> </TD></TR> |
<TR><TD valign=top rowspan=2>JSR PC+$Addr</TD><TD> |
SUB $1,SP<BR> |
MOV $3+PC,R0<BR> |
STO R0,1(SP)<BR> |
MOV $Addr+PC,PC<br> |
ADD $1,SP</TD><TD>Jump to Subroutine.</TD></TR> |
<TR><TD>MOV $3+PC,R12<BR>MOV $addr+PC,PC</TD><TD>This is the high speed |
version of the call, necessitating a register to hold the last |
PC address. In its favor, this method doesn't suffer the mandatory |
memory access of the other approach.</TD></TR> |
<TR><TD valign=top>JTU</TD><TD>OR $GIE,CC</TD><TD>Also known as a JUMP-To-USER |
space command, also known as IRET.</TD></TR> |
<TR><TD valign=top>LDI.l $val,Rx</TD><TD> |
LDIHI HIBITS($val),Rx<BR> |
LDILO LOBITS($val),Rx</TD><TD>Sadly, there's not enough instruction |
space to load a complete immediate value into any register. |
Therefore, fully loading any register takes two cycles. |
The LDIHI (load immediate high) and LDILO (load immediate low) |
instructions have been created to facilitate this.</TD></TR> |
<TR><TD valign=top>LOD.b $addr,Rx</TD><TD> |
LDI $addr,Ra<BR> |
LDI $addr,Rb<BR> |
LSR $2,Ra<BR> |
AND $3,Rb<BR> |
LOD (Ra),Rx<BR> |
LSL $3,Rb<BR> |
SUB $32,Rb<BR> |
ROL Rb,Rx<BR> |
AND $0ffh,Rx</TD><TD>This CPU is designed for 32'bit word |
length instructions. Byte addressing is not supported by the CPU or |
the bus, so it therefore takes more work to do.<P>Note that in |
this example, $Addr is a byte-wise address, where all other addresses |
are 32-bit wordlength addresses. For this reason, we needed to |
drop the bottom two bits.</TD></TR> |
<TR><TD valign=top>LSL $1,Rx<BR>LSLC $1,Ry</TD> |
<TD>LSL $1,Ry<BR> |
LSL $1,Rx<BR> |
OR.C $1,Ry</TD><TD>Logical shift left with carry. Note that the |
instruction order is now backwards, to keep the conditions valid. |
That is, LSL sets the carry flag, so if we did this the othe way |
with Rx before Ry, then the condition flag wouldn't have been right |
for an OR correction at the end.</TD></TR> |
<TR><TD valign=top>LSR $1,Rx<BR>LSRC $1,Ry</TD><TD> |
CLR Rz<BR> |
LSR $1,Ry<BR> |
LDIHI.C $8000h,Rz<BR> |
LSR $1,Rx<BR> |
OR Rz,Rx</TD><TD>Logical shift right with carry</TD></TR> |
<TR><TD valign=top>NEG Rx</TD><TD>XOR $-1,Rx<BR>ADD $1,Rx</TD><TD> </TD></TR> |
<TR><TD valign=top>NOOP</TD><TD>NOOP</TD><TD>While there are many |
operations that do nothing, such as MOV Rx,Rx, or OR $0,Rx, these |
operations have consequences in that they might stall the bus if |
Rx isn't ready yet. For this reason, we have a dedicated NOOP |
instruction.</TD></TR> |
<TR><TD valign=top>NOT Rx</TD><TD>XOR $-1,Rx</TD><TD> </TD></TR> |
<TR><TD valign=top>POP Rx</TD><TD>LOD $-1(SP),Rx<BR>ADD $1,SP</TD><TD>Note |
that for interrupt purposes, one can never depend upon the value at |
(SP). Hence you read from it, then increment it, lest having |
incremented it firost something then comes along and writes to that |
value before you can read the result.</TD></TR> |
<TR><TD valign=top>PUSH Rx</TD><TD> |
SUB $1,SP<BR> |
STO Rx,$1(SP)</TD><TD> </TD></TR> |
<TR><TD valign=top>RESET</TD><TD>STO $1,$watchdog(R12)<BR>NOOP<BR>NOOP</TD><TD> |
This depends upon the peripheral base address being in R12. |
<P>Another opportunity might be to jump to the reset address from within |
supervisor mode. |
</TD></TR> |
<TR><TD valign=top rowspan=2>RET</TD><TD>LOD $-1(SP),R0<BR> |
MOV $-1+SP,SP<BR> |
MOV R0,PC</TD><TD>An alternative might be to LOD $-1(SP),PC, followed |
by depending upon the calling program to ADD $1,SP.</TD></TR> |
<TR><TD>MOV R12,PC</TD><TD>This is the high(er) speed version, that doesn't |
touch the stack. As such, it doesn't suffer a stall on memory |
read/write to the stack.</TD></TR> |
<TR><TD valign=top>STEP Rr,Rt</TD><TD>LSR $1,Rr<BR>XOR.C Rt,Rr</TD><TD>Step a |
Galois implementation of a Linear Feedback Shift Register, Rr, using |
taps Rt</TD></TR> |
<TR><TD valign=top>STO.b Rx,$addr</TD><TD> |
LDI $addr,Ra<BR> |
LDI $addr,Rb<BR> |
LSR $2,Ra<BR> |
AND $3,Rb<BR> |
SUB $32,Rb<BR> |
LOD (Ra),Ry<BR> |
AND $0ffh,Rx<BR> |
AND $-0ffh,Ry<BR> |
ROL Rb,Rx<BR> |
OR Rx,Ry<BR> |
STO Ry,(Ra)</TD><TD>This CPU and it's bus are <em>not</em> optimized |
for byte-wise operations.<P>Note that in this example, $addr is a |
byte-wise address, whereas in all of our other examples it is a |
32-bit word address. Further, this instruction implies a byte ordering, |
such as big or little endian.</TD></TR> |
<TR><TD valign=top>SWAP Rx,Ry</TD><TD> |
XOR Ry,Rx<BR> |
XOR Rx,Ry<BR> |
XOR Ry,Rx</TD><TD>While no extra registers are needed, this example |
does take 3-clocks.</TD></TR> |
<TR><TD valign=top>TRAP #X</TD><TD>LDILO $x,CC</TD><TD> |
This approach uses the unused bits of the CC register as a TRAP |
address. If these bits are zero, no trap has occurred. Unlike my |
previous approach, which was to use a trap peripheral, this approach |
has no delay associated with it. To work, the supervisor will need |
to clear this register following any trap, and the user will need to |
be careful to only set this register prior to a trap condition. |
Likewise, when setting this value, the user will need to make certain |
that the SLEEP and GIE bits are not set in $x. LDI would also work, |
however using LDILO permits the use of conditional traps. (i.e., |
trap if the zero flag is set.) Should you wish to trap off of a |
register value, you could equivalently load $x into the register and |
then MOV it into the CC register. |
</TD></TR> |
<TR><TD valign=top>TST Rx</TD><TD>TST $-1,Rx</TD><TD valign=top>Set the |
condition codes based upon Rx. Could also do a CMP $0,Rx, |
ADD $0,Rx, SUB $0,Rx, etc, AND $-1,Rx, etc. The TST and CMP |
approaches won't stall future pipeline stages looking for the value |
of Rx.</TD></TR> |
<TR><TD valign=top>WAIT</TD><TD>Or $SLEEP,CC</TD><TD valign=top>Wait |
'til interrupt. In an interrupts disabled context, this becomes a |
HALT instruction.</TD></TR> |
</TABLE> |
<H2>Pipeline Stages</H2> |
<OL> |
<LI><B>PREFETCH</B>: Read instruction from memory (cache if possible) |
<UL> |
<LI>A lack of an instruction, or a waiting memory operation, stalls the |
pipeline. |
</UL> |
<LI><B>DECODE</B>: Decode instruction into op code, register(s) to read, and |
immediate offset. |
<UL><LI>INPUT: Instruction |
<LI>OUTPUT: 5-bit register address of result, |
5-bit register address of an input and usage flag (this |
register is used), 5-bit register address of second input and |
usage flag, 32-bit immediate offset. |
<P><TT>decode(i_clk, (i_ce)&(~stall), i_instr, i_gie, i_pc, |
o_opcode, o_ccode, |
o_wr_back, o_wr_reg, o_ra_read, o_ra_reg, |
o_rb_read, o_rb_reg, |
o_immediate, |
o_memop, o_wr, o_iodec);</TT> |
<LI>Move instruction gets one decoder, produces two registers addresses, |
use flag set to one on register B, address A is unused. |
<LI>Load/Store instructions produce two registers, an immediate, and |
two flags |
<LI>Operand B type instructions produce two registers, an immediate, |
and a use flag |
<LI>LDI produces one register, a (longer) immediate, and sets the use |
flag to zero (second register isn't used) |
<LI>This section never stalls. On an external stall it simply doesn't update |
it's outputs. Outputs are available one clock after |
the instruction is valid. |
</UL> |
<LI><b>READ OPERANDS</B>: Read registers and apply any immediate values to them. |
<UL> |
<LI>This should stall if a source operand is pending. |
</UL> |
<LI>Split into two tracks: A) <B>ALU</B> accomplish simple instruction, B) <B>MEMOPS</B> memory read/write. |
<UL> |
<LI>Loads stall instructions that access the register until it is |
written to the register set. |
<LI>Condition codes are available upon completion |
<LI>Issuing an instruction to the memory while the memory is busy will |
stall the bus. If the bus deadlocks, only a reset will |
release the CPU. (Watchdog timer, anyone?) |
</UL> |
<LI><B>WRITE-BACK</B>: Conditionally write back the result to register set, applying the |
condition. This routine is bi-re-entrant. Either the memory or the |
simple instruction may request a register write. Memory writes take |
priority, stalling the other track. |
<UL> |
<LI>This stage will stall the pipeline if both memory and op |
try to write to the registers at the same time. |
<LI> |
</OL> |
<H2 align=center>Pipeline Logic</H2> |
How the CPU handles some instruction combinations can be telling when |
determining what happens in the pipeline. For example: |
<TABLE> |
<TR><TH>Instruction(s)</TH><TH>Issue</TH><TH>Choice</TH></TR> |
<TR><TD valign=top>Delayed Brnaching</TD><TD>What happens in debug mode? |
That is, what happens when a debugger tries to single step an |
instruction? While I can easily single step the computer in either |
user or supervisor mode from externally, this processor does not appear |
able to step the CPU in user mode from within user mode--gosh, not even |
from within supervisor mode--such as if a process had a debugger |
attached. As the processor exists, I would have one result stepping |
the CPU from a debugger, and another stepping it externally. |
<P>This is unacceptable. |
</TD></TR> |
<TR><TD valign=top>MOV R0,R1<BR>MOV R1,R2</TD><TD valign=top>What value does |
R2 get, the value of R1 before the first move or the value of R0? |
Placing the value of R0 into R1 requires a pipeline stall, and possibly |
two, as I have the pipeline designed.</TD><TD valign=top>R2 must |
equal R0 at the end of this operation. This may stall the pipeline |
1-2 cycles.</TR> |
<TR><TD valign=top>CMP R0,R1<BR>MOV.EQ $x,PC</TD><TD valign=top>At issue is |
the same item as above, save that the CMP instruction updates the |
flags that the MOV instruction depends |
upon.</TD><TD valign=top>Condition codes must be updated and available |
immediately for the next instruction without stalling the |
pipeline.</TD></TR> |
<TR><TD valign=top>CMP R0,R1<BR>MOV CC,R2</TD><TD valign=top>At issue is the |
fact that the logic supporting the CC register is more complicated than |
the logic supporting any other register.</TD><TD valign=top>This will |
create a stall, of 1-2 clock cycles</TD></TR> |
<TR><TD valign=top>ADD $5,R0<BR>BTST $8,CC</TD><TD valign=top>Test for |
overflow (or not). At issue is the load of the condition codes for |
the BTST instruction, which takes place two clocks before the prior |
instruction writes it back.</TD><TD valign=top><em>Negotiable for |
simplified logic.</em> Let's stall here, 1-2 clocks.</TD></TR> |
<TR><TD valign=top>ADD $x,PC<BR>MOV R0,R1</TD><TD valign=top>Will the |
instruction following the jump take place before the jump? In |
other words, is the MOV to the PC register handled differently from |
an ADD to the PC register?</TD><TD valign=top> |
MOV'es and ADD's use the same logic (simplifies the logic).</TD></TR> |
<TR><TD valign=top>MOV $x,PC<BR>MOV R0,R1</TD><TD valign=top>Will the |
instruction following the jump take place before the jump? Or must the |
pipeline "turn off" any outputs associated with a jump once it |
recognizes that the jump has taken place? Alternatively, the pipeline |
could stall until the result of the MOV was |
available.</TD><TD valign=top> |
<em>Negotiable for simplified logic</em></TD></TR> |
<TR><TD valign=top>MOV $x,PC<BR>MOV R0,R1</TD><TD valign=top>Will the |
instruction following the jump take place before the jump sometimes |
but not all times? Might the pipeline not be full when the jump |
takes place, and thus the MOV instruction never gets loaded due to a |
stalled pre-fetch? Or is the MOV a dependable instruction, guaranteed |
to be executed (or not) no matter what the JMP does? Perhaps the |
compiler is required to insert 2-3 NOOP's following a jump, just |
to keep the pipeline doing something reliably? |
</TD><TD valign=top> |
<em>Negotiable (at present). Highly desired that the behavior |
is the same regardless of the prefetch speed.</em></TD></TR> |
<TR><TD valign=top>MOV.EQ $x,PC<BR>MOV $y,PC</TD><TD valign=top>Where will |
instructions take place next? On a delayed jump instruction, this |
means that instruction $x will be executed followed by $y, and |
execution will not continue at $x as |
desired</TD><TD><em>Negotiable.</em></TD</TR> |
</TABLE> |
<P>As I've studied this, I find several approaches to handling pipeline |
issues. These approaches (and their consequences) are listed below. |
<TABLE> |
<TR><TH>Condition/Case</TH><TH>Discussion</TH></TR> |
<TR><TD valign=top>All issued instructions complete |
<BR>Stages stall individually</TD><TD valign=top>What about a |
slow pre-fetch? <P>Nominally, this works well: any issued instruction |
just runs to completion. If there are four issued instructions in the |
pipeline, with the writeback instruction being a write-to-PC |
instruction, the other three instructions naturally finish. |
<P>This approach fails when reading instructions from the flash, |
since such reads require N clocks to clocks to complete. Thus |
there may be only one instruction in the pipeline if reading from flash, |
or a full pipeline if reading from cache. Each of these approaches |
would produce a different response. |
<P>This is unacceptable.</TD></TR> |
<TR><TD valign=top>Issued instructions may be canceled |
<BR>Stages stall individually</TD><TD valign=top>First problem: |
Memory operations cannot be canceled, even reads may have side effects |
on peripherals that cannot be canceled later. Further, in the case of |
an interrupt, it's difficult to know what to cancel. What happens in |
a MOV.C $x,PC followed by a MOV $y,PC instruction? Which get |
canceled?<P>Because it isn't clear what would need to be canceled, |
this is not doable.</TD></TR> |
<TR><TD valign=top>All issued instructions complete. |
<BR>All stages are filled, or the entire pipeline |
stalls.</TD><TD valign=top>What about debug control? What about |
register writes taking an extra clock stage? MOV R0,R1; MOV R1,R2 |
should place the value of R0 into R2. How do you restart the pipeline |
after an interrupt? What address do you use? The last issued |
instruction? But the branch delay slots may make that invalid! |
<P>Reading from the CPU debug port in this case yields inconsistent |
results: the CPU will halt or step with instructions stuck in the |
pipeline. Reading registers will give no indication of what is going |
on in the pipeline, just the results of completed operations, not of |
operations that have been started and not yet completed. |
Perhaps we should just report the state of the CPU based upon what |
instructions (PC values) have successfully completed? Thus the |
debug instruction is the one that will write registers on the next |
clock. |
<UL>Suggestion: Suppose we load extra information in the two |
CC register(s) for debugging intermediate pipeline stages?</UL> |
<P>The next problem, though, is how to deal with the read operand |
pipeline stage needing the result from the register pipeline.</TD></TR> |
<TR><TD valign=top>All instructions that enter into the memory module *must* |
complete. Issued instructions from the prefetch, decode, or operand |
read stages may or may not complete. Jumps into code must be valid, |
so that interrupt returns may be valid. All instructions entering the |
ALU complete.</TD><TD valign=top>This looks to be the simplest approach. |
While the logic may be difficult, this appears to be the only |
re-entrant approach. |
|
<P>A <tt>new_pc</tt> flag will be high anytime the PC changes in an |
unpredictable way (i.e., it doesn't increment). This includes jumps |
as well as interrupts and interrupt returns. Whenever this flag may |
go high, memory operations and ALU operations will stall until the |
result is known. When the flag does go high, anything in the prefetch, |
decode, and read-op stage will be invalidated. |
</TD></TR> |
</TABLE> |
</BODY></HTML> |