URL https://opencores.org/ocsvn/forwardcom/forwardcom/trunk

# Subversion Repositoriesforwardcom

## [/] [forwardcom/] [manual/] [fwc_microarchitecture_and_pipeline.tex] - Rev 166

% chapter included in forwardcom.tex
\documentclass[forwardcom.tex]{subfiles}
\begin{document}
\RaggedRight

\chapter{Microarchitecture and pipeline design}
The ForwardCom instruction set is intended to facilitate a consistent and efficient design of the pipeline of a superscalar microprocessor. Normal instructions can have no more than one destination operand, up to three source operands, a mask register, and a fallback register. A source operand can be a register, a memory operand, or an immediate constant. An instruction cannot have more than one memory operand. The total number of input registers to an instruction, including source operands, mask, fallback, memory base pointer, index, and vector length specifier cannot exceed five.
\vv

Some instruction formats have multiple immediate operand fields. Any extra immediate operand field can be used for option bits, memory offset, or memory limit.
\vv

A high performance pipeline may be designed as superscalar with the following stages. Simple designs may have fewer stages. The number of pipeline stages should be as few as possible in order to reduce the latency of non-predicted jumps.

\begin{itemize}
\item  Fetch. Fetching blocks of code from the instruction cache, one cache line at a time, or as determined by the branch prediction machinery.

\item  Instruction length decode. Determine the length of each instruction. Distribute the first P instructions into each their pipeline, where P is the number of parallel pipelines implemented. Excess instructions may be queued for the next clock cycle. The length of an instruction is determined by two bits of the first code word in order to simplify this process.

\item  Instruction decode. Identify and classify all operands, opcode, and option bits. Determine input and output dependencies. A consistent template system simplifies this step.

\item  Calculate address and length of memory operand. Check access rights.

\item  Instruction queue.

\item  Put instructions into reservation station.

\item  Read memory operand. Schedule for execution units. Do calculations that depend on immediate operand only.

\item  Execution units.

\item  Retire or branch.
\end{itemize}

A disadvantage of register renaming and out-of-order execution is that it makes the pipeline long. This increases the branch misprediction delay. A simpler design may have one or more in-order pipelines for integer instructions in general purpose registers and one pipeline for vector registers. Such a design will have fewer pipeline stages, for example:
\vv

\begin{itemize}
\item  Fetch. Fetching blocks of code from the instruction cache.

\item  Instruction decode. Identify and classify all operands, opcode and option bits. Determine input and output dependencies.

\item  Register read. Read any registers needed for address calculation. Other register operands are read as well if the values are available at this stage.

\item  Calculate address and length of memory operand. Check access rights.

\item  Read memory operand. Do calculations that depend on immediate operand only.

\item  Execution units.

\end{itemize}
\vv

It is not necessary to split read-operate instructions into micro-operations if the reading of memory operands is done in a separate pipeline stage and instructions are allowed to wait until the memory operand has been read.
\vv

Each stage in the pipeline should ideally require only one clock cycle unless they are waiting for an operand.  Most instructions will use only one clock cycle in the execution unit. Multiplication and floating point addition need a pipelined execution unit with several stages. Division and square root may use a separate state machine.
\vv

Jump, branch, call, and return instructions also fit into this pipeline design.
\vv

A reservation station has to consider all the input and output dependencies of each instruction. Each instruction can have up to four or five input dependencies and one output dependency.
\vv

There may be multiple execution units so that it is possible to run multiple instructions in the same clock cycle if their operands are independent.
\vv

An efficient out-of-order processing requires renaming of the general purpose registers and vector registers, but not necessarily the special registers.
\vv

Some current CPUs have a stack engine'' in order to predict the value of the stack pointer for a push, pop, or call instruction when preceding stack operations are delayed due to operands that are not available yet. Such a system is not needed if we have a separate call stack (see page \pageref{dualStack}). The depth of function calls in most programs is so small that even a moderately small on-chip call stack would rarely need to spill to main memory.
\vv

Branch prediction is important for the performance of a CPUs with long pipelines. We may implement four different branch prediction algorithms: one for ordinary branches, one for loops, one for indirect jumps, and one for function returns. The long form of branch instructions have an option bit for indicating loop behavior. The short form of branch instructions does not have space for such a bit. The initial guess may be to assume loop behavior if the branch goes backwards and ordinary branch behavior if the branch goes forwards. This assumption may be corrected later, if necessary, by the branch prediction machinery. The code following a branch may be executed speculatively until it is determined whether the prediction was right. We may implement features for running both sides of a branch speculatively at the same time. Other ways of reducing branch misprediction delays are discussed on page \pageref{branchProposal} below.
\vv

A full out-of-order design with register renaming requires a lot of chip space for a reservation station typically holding hundreds or pending instructions and hundreds or rename-able temporary registers, including vector registers. A simpler design may prioritize the chip space differently. Instead of a large reservation station, we may have more execution pipelines. Each pipeline is processing its instructions in order, but instructions may execute out of order as long as there are vacant pipelines.
\vv

\section{Vector design}\label{vectorDesign}
The chip layout of a vector processor is typically divided into data lanes''.
An efficient design may divide the vector processing into multiple lanes of 64 (or 128) bits each. Each lane would have its own register file containing a 64-bit slice of each vector register. There would be very little necessary communication between the lanes as long as the output data stay in the same lane as the input data.
\vv

The ForwardCom design allows large vector processors with very long vector registers. The priority of a particular design may be either to have one vector pipeline with very large vectors, or multiple pipelines with a smaller maximum vector lenght.
\vv

A consequence of this lane layout is that transfer of data between the lanes is likely to be slower. We may have one or more data buses connecting the lanes, but long distance connections is often a limited resource.
This means that instructions that transfer data horizontally across a vector, such as permute instructions, may have longer latencies than other vector instructions.
\vv

The scheduler may need to know the instruction latency, and this can be a problem if the latency depends on the distance of data transfer on very long vectors. This problem is addressed by indicating the vector length or the distance of data transfer for such instructions in a separate operand, which always uses the RT register field. This information may be redundant because the vector length is stored in the vector register operands, but the scheduler may need this information as early as possible. The other register operands are typically not ready until the clock cycle where they go to the execution unit, while the vector length is typically known earlier. The microprocessor can read the RT register at the address calculation stage in the pipeline, where it also reads any pointer, index register and vector length for memory operands. This allows the scheduler to predict the latency a few clock cycles earlier. The instruction set provides the extra information about vector length or data transfer length in the RT register for instructions that involve horizontal data transfer, including memory broadcast, permute, insert, extract, and shift instructions, but not broadcasting of immediate constants.
\vv

Each vector lane might have its own data cache. This may be easier to implement than one big data cache with a data bus as wide as the maximum vector length. This solution will be efficient as long as large data arrays are aligned to addresses divisible by the maximum vector length, but it will be less efficient if data arrays are misaligned.
The data traffic to cache and memory is often the limiting bottleneck in modern computer systems. A system with a separate data cache for each vector lane will make it feasible to have a larger total cache size. This may be a useful alternative to having multiple CPU cores. Data transfer between the lanes will be slower than if we have one cache servicing all lanes, but faster than data transfer between different CPU cores. Whether this is an efficient solution depends on the amount of data permutation needed in a specific application.
\vv

\section{Complex instructions}\label{ComplexInstructions}
Some current systems are using a microcode ROM to implement very complex instructions, such as mathematical functions. This has turned out to be inefficient. The use of microcode should be avoided in ForwardCom processors.
\vv

A limited amount of complexity can still be implemented with the flexible design of format templates in ForwardCom. Instructions that do multiple things can be useful because they allow the code to do more work per instruction. Complex instructions should only be implemented if they fit into the streamlined template system, pipeline, and timing constraints.
\vv

A number of complex instructions have been implemented because they offer a definite advantage at relatively low hardware costs. The most complex instructions may be implemented with state machines rather than microcode.
The following instructions have complex functionality:

\begin{itemize}
\item Instructions with memory operand can calculate a memory address, read the value of the memory operand, and do some calculation on this operand. Multiformat instructions can have a memory operand with a choice of different addressing modes. This feature has significant hardware costs because it requires a few extra pipeline stages, but it has a high advantage because it allows a single instruction to do what traditionally requires several instructions in a pure RISC design.

\item Combined arithmetic and branch instructions. All branches depend on the result of some arithmetic or logic operation. Rather than saving the result of the arithmetic instruction in a flag and then make a branch depending on the flag, it is more efficient to combine these two operations into one instruction. The CPU needs the circuitry for both arithmetic and branching anyway, and there are no big costs to combining these two circuits.

\item Loop instructions. It is possible to implement a loop with a single instruction. This instruction can increment a loop counter, compare it with a limit, and branch back if the limit has not been reached. This fits into the general framework of combined arithmetic and branch instructions. Timing problems can be overcome by doing most of the compare operation in parallel with the increment. Another efficient way of making a loop is to decrement a counter down to zero. The special vector loop (see page \pageref{vectorLoops}) is also implemented with a single instruction.

\item Switch-case statements can be coded efficiently with the jump\_relative instruction. This instruction can read a relative pointer from a table of jump addresses, convert it to an absolute address, and jump to this address. Function dispatching can be coded in the same way with the call\_relative instruction. These instructions are straightforward to implement in hardware because they fit into the general pipeline design.

\item Compare instructions with extra boolean operands. Boolean variables are generated by compare instructions and bit test instructions. Boolean variables are often used as input to boolean operations such as AND, OR, XOR. The compare and bit test instructions can use mask registers and fallback registers as extra boolean operands to be combined with the result of the compare. This makes it possible to combine the results of multiple compare operations without the extra instructions or branches for AND, OR, etc. This feature is quite cheap to implement in hardware.

\item mul\_add and add\_add. These instructions have significant hardware costs for floating point operands. The integer versions are less costly.

\item Division and square root. These have high costs in the hardware budget.

\item Push and pop instructions. These instructions are complicated to implement in hardware because they require extra functionality in the decoder to generate multiple micro-operations. They do not require extra functionality in the execution unit. These instructions are so useful that it is recommended to implement them despite the extra complexity.

\item System calls, traps, and interrupts. These are complicated to implement, but necessary in many cases. They may not be needed in small embedded systems.

\item cmp\_swap and other instructions for atomic memory operations. This is complicated to implement, but necessary for efficient mutexes in multithreaded systems.

\end{itemize}

\section{Proposals for reducing branch misprediction delay}\label{branchProposal}
Modern superscalar processors often have quite long pipelines. This gives a long branch misprediction delay. The branch misprediction delay is normally equal to the number of pipeline stages from a branch instruction is fetched until it is executed.
\vv

It may be possible to reduce this delay by executing branch instructions as early as possible in the pipeline. Any preceding instructions that a branch instruction depends on may also be executed early. It is proposed to execute all control transfer instructions (branch, jump, call, and return) in the front end of the pipeline, before any register renaming and scheduler. Any preceding instruction that a branch depends on, could likewise be executed in the front end. This idea is somewhat similar to the principle described in the article:

Sheikh, Rami, James Tuck, and Eric Rotenberg. “Control-Flow Decoupling.” In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, 329–340. IEEE Computer Society, 2012.
\vv

The vision is this: There are two sets of ALU's, one in the front end used for control flow and simple integer instructions, and one in the back end used for all other calculations. The front end does not use register renaming, but relies on the permanent register file.
\vv

In most cases, the control flow depends only on simple integer operations such as incrementing a loop counter. The front end ALU's should be able to handle simple integer operations on general purpose registers. The front end may handle all simple integer instructions on general purpose registers if the operands are available. If the capacity of the front end ALU's is insufficient then it may prioritize instructions on those registers that have recently been used as loop counters or branch conditions, but it may be cheaper to implement a few extra ALU's to increase the front end capacity than to implement such a mechanism for prioritization.
\vv

Instructions that are executed in the front end use the permanent register file for both input and output.
The remaining instructions are sent to the out-of-order back end that may use register renaming.
All general purpose register operands are replaced by their value if the value is available before the instruction is sent to the back end.
Instructions that execute in the front end may write their result not only to the permanent register file but also to any pending instruction that needs it.
\vv

The result of this mechanism is that a loop that is controlled only by simple instructions on general purpose registers
will be unrolled in the front end. The instructions in the loop body may be passed on to the back end with the loop counter replaced by its value. The back end may have register renaming so that it can execute the body of the second iteration before it has finished the body of the first iteration, etc.
\vv

All other control flow, such as branches, jumps, calls, returns, etc., will be unrolled in the front end as well so that control flow is effectively resolved before execution as far as it does not depend on delayed data.
\vv

The front end may support a minimum of out-of-order execution in the sense that an instruction that is waiting for an operand should not delay any independent subsequent instructions. But the front end does not need a complicated structure with a long queue, register renaming, and scheduler.
\vv

The front end should have access to the permanent register file, while the back end may use temporary registers with register renaming. The renamed registers will be written to the permanent register file when they retire. We have to keep track of whether the permanent registers are up to date. When the decoder sees an instruction that writes to a register, it will mark this register as invalid in the permanent register file and add a tag to tell which instruction it belongs to. When this instruction is executed (whether in the front end or the back end), the register entry is marked as valid if the tag is matching the instruction that delivered the result.
\vv

The front end may fetch and decode speculatively, but not execute speculatively. It may even fetch both sides of a branch that it expects to have poor prediction. Fetching both sides of a branch becomes cheaper here than with a traditional superscalar design because it only needs to duplicate two pipeline stages (fetch and decode) in order to minimize branch misprediction delays.
\vv

The fetch and decode stages should have higher throughput than the rest of the pipeline so that it can catch up and fill the pipeline after a branch delay. The fetch stage should be able to fetch a large block of code in one clock cycle. The decoder should be able to decode the first one or two instructions in a fetched code block in the first clock cycle, and determine the starting points of the remaining instructions. This will allow it to decode maybe four or more instructions in the second clock cycle after fetching a block of code. The aim of this is to keep the branch misprediction delay as low as possible.
\vv

Multi-way branches with a jump table are particularly critical here because they have a memory operand that may be delayed due to a cache miss. Implementing memory access in the front end may be too complicated. Therefore, such instructions probably have to go to the back end. The out-of-order mechanism may give high priority to these instructions and execute them as soon as the register operands are available. Jump tables are normally placed in the read-only memory block (see page \pageref{memoryModel}). The speed of access to jump tables may be increased by having a separate cache for read-only memory.
\vv

An extra advantage of this design is that the registers used in memory operands for pointer, index, and vector length are likely to be available early so that memory operands can be fetched before they are needed.
\vv

The demand for advanced branch prediction is somewhat relaxed if we can drastically reduce the branch misprediction delay with the measures described here. The branch prediction machinery in modern processors is very complicated and requires a lot of chip space for branch target buffers and pattern/history tables. We can cut down on this and use a simpler branch prediction scheme if branches are resolved in the front end. A few extra ALU's in the front end will use much less chip space than an advanced branch prediction mechanism.
\vv

The early processing of jumps and branches is also possible with a simpler design without register renaming and out-of-order processing. We may have a fast-track in-order pipeline for branches and simple integer instructions without memory operands. Branches often depend only on loop counters and other simple calculations in general purpose registers. These instructions can be executed in the fast-track pipeline in most cases, while the more complicated loop body instructions go to a longer full-featured pipeline. Direct call and return instructions can be executed very early in the pipeline if we have an on-chip call stack because the call stack pointer is independent of the other instructions and registers.
\vv

\end{document}