URL
https://opencores.org/ocsvn/forwardcom/forwardcom/trunk
Subversion Repositories forwardcom
[/] [forwardcom/] [manual/] [fwc_basic_architecture.tex] - Rev 142
Go to most recent revision | Compare with Previous | Blame | View Log
% chapter included in forwardcom.tex \documentclass[forwardcom.tex]{subfiles} \begin{document} \RaggedRight \chapter{Basic architecture} This chapter gives an overview of the most important features of the ForwardCom instruction set architecture. Details are given in the subsequent chapters. \section{A fully orthogonal instruction set} The ForwardCom instruction set is fully orthogonal in all respects. Where other instruction sets have a large number of different instructions for different register types, operand types, operand sizes, addressing modes, etc., ForwardCom has fewer instructions, but many variants of each instruction. This modular design makes the hardware implementation much simpler. The same instruction can use integer operands of all sizes and floating point operands of all precisions. It can use register operands, memory operands or immediate operands. It can use many different addressing modes. Instructions can be coded in short forms with two operands where the same register is used for destination and source operand, or longer forms with three operands. It can work with scalars or vectors of any size. It can have predication or masks for conditional execution, and it can have optional flag inputs for determining rounding mode, exception control and other details, where appropriate. Data constants of all types can be included in the instructions and compressed in various ways to reduce the instruction size. \subsubsection{Rationale} The orthogonality is implemented by a standardized modular design that makes the hardware implementation simpler. It also makes compilation simpler and more flexible and makes it easier for the compiler to convert linear code to vector code. \vv The support for immediate constants of all types is an improvement over current systems. Most current systems store floating point constants in a data segment and access them through a 32-bit address in the instruction code. This is a waste of data cache space and causes many cache misses because the data are scattered around in different sections. Replacing a 32-bit address with a 32-bit immediate constant makes the code more efficient without increasing the code size. Extensions to allow 64-bit immediate constants are possible at the cost of having instructions with triple size. \section{Instruction size} The ForwardCom instruction set uses a 32-bit word size for code. An instruction can consist of one, two, or three 32-bit words. It is possible to add future extensions with instruction sizes of four or more words, but there is currently no need for this. \subsubsection{Rationale} A CISC architecture with many different instruction sizes is inefficient in superscalar processors where we want to execute several instructions per clock cycle. The decoding front end is often a bottleneck, especially in the x86 architecture. The decoder has to determine the length of the first instruction before it knows where the next instruction begins. The ``instruction length decoding'' is a fundamentally serial process which makes it difficult to decode multiple instructions per clock cycle. Modern x86 microprocessors have an extra ``micro-operations cache'' after the decoder in order to circumvent this bottleneck. \vv Here, it is desired to have as few different instruction sizes as possible and to make it easy to determine the length of each instruction. We want a small instruction size for the most common simple instructions, but we also need a larger instruction size in order to accommodate things like a larger register set, instructions with multiple operands, vector operations with advanced features, 32-bit address offsets, and large immediate constants. This proposal is a compromise between code compactness, easy decoding, and space for advanced features. The instruction size is indicated by only two bits. A decoder can find the instruction boundaries in n words by means of a simple Boolean function of 2n inputs. \section{Register set} There are 32 general purpose registers (r0--r31) of 64 bits each, and 32 vector registers (v0--v31) of variable length. The maximum vector length is different for different hardware implementations. The general purpose registers can be used for integers of up to 64 bits as well as for pointers. The vector registers can be used for scalars and vectors of integers and floating point numbers. \vv The following special registers are defined and visible at the application program level. All have 64 bits: \begin{itemize} \item Instruction pointer (IP) \item Data section pointer (DATAP) \item Thread data pointer (THREADP) \item Stack pointer (SP) \item Numeric control register (NUMCONTR) \end{itemize} The stack pointer is identical to r31. The other special registers cannot be accessed as ordinary registers. \vv There is no dedicated flags register. Registers r0--r6 and v0--v6 can be used for masks, predicates and floating point option flags to control attributes such as rounding mode and exception control. \vv The unused part of a register is always set to zero. This means that integer operations with an operand size smaller than 64 bits and vector operations with a vector length smaller than the maximum will always set the unused bits of the destination register to zero. \subsubsection{Rationale} The number of registers is a compromise between code density and flexibility. The cost of spilling registers to memory is usually important only in the critical innermost loop, which is unlikely to need more than 32 registers. \vv We can avoid false dependencies on the previous value of a register by setting all unused register bits to zero rather than leaving them unchanged. The hardware can save power by disabling the unused parts of execution units and data buses. \vv A dedicated flags or status register is unfeasible for vector processing, parallel processing, out-of-order processing, and instruction scheduling. \vv The reason for handling floating point scalars in the vector registers rather than in separate registers is to make it easy for a compiler to convert scalar code including function calls to vector code. Floating point code often contains calls to mathematical library functions. A library function with variable-length vectors as input and output can be used for both scalars and vectors, and the compiler can easily vectorize code that contains such library function calls. \section{Vector support} A vector register can contain signed or unsigned integers of 8, 16, 32, 64, and optionally 128 bits, or floating point numbers of single and double precision. There is limited support for floating point numbers in half precision and optional support for quadruple precision. All elements of a vector must have the same type. The elements of a vector are processed in parallel. For example, a vector addition will produce the sum of two vectors in a single operation. \vv The vector registers have variable length. Each vector register has extra bits for storing the length of the vector. The maximum vector length depends on the hardware. For example, if the hardware supports a maximum vector length of 64 bytes and a particular application needs only 16 bytes, then the vector length is set to 16. \vv Some instructions need to specify the length of a vector explicitly, for example when reading a vector from memory. These instructions use a general purpose register for specifying the vector length. The length is usually indicated as the number of bytes, not the number of vector elements. \vv The maximum length supported by the processor must be a power of 2. The actual length specified does not need to be a power of 2. If the specified length is longer than the maximum length, then the maximum length is used. \vv The contents of a vector register can arbitrarily be interpreted as any of the types and element sizes supported. For example, the hardware does not prevent the application of integer instructions on a vector that contains floating point data. It is the responsibility of the programmer that the code makes sense. \section{Vector loops} \label{vectorLoops} A special addressing mode is provided to make vector loops more compact and efficient. It uses a pointer P to the end of an array, and a negative index J, and calculates the address of a memory operand as P-J, where P and J are general purpose registers. This makes it possible to make a loop through an array as illustrated by the following pseudocode: \vv \begin{lstlisting}[frame=single] P = address of array J = size of array (in bytes) L = maximum vector length (depends on processor) X = a vector register P += J; // point to end of array while (J > 0) { X = whatever_operation(X, [P-J], vector_length = J) J -= L; } \end{lstlisting} \vv This loop works in the following way: P points to the end of the array. J is the remaining number of array bytes; counting down until the loop is finished. The loop reads one vector at a time from the array at the address (P-J). J is larger than the maximum vector length L in all but the last iteration of the loop. This makes the processor use the maximum vector length. If the array size is not divisible by the maximum vector length then the last iteration of the loop will use a smaller vector length that fits the remaining number of elements. Obviously, the loop can contain any number of vector read, vector write, and vector arithmetic instructions, using the same principle. \vv This loop will work on different processors with different maximum vector lengths \textit{without knowing the maximum vector length at compile time}. Thus, the same piece of software will work on different microprocessors with different vector lengths without the need to compile separately for each microprocessor. \vv A further advantage is that no extra code is needed after the loop to handle remaining elements in the case that the array size is not divisible by the vector length. The loop overhead can be reduced to a single instruction (sub\_maxlen/jump\_pos) which subtracts L from J and jumps back if the result is positive. \subsubsection{Rationale} Most current systems have fixed vector lengths. If different processors have different vector lengths then you have to compile the code separately for each vector length. Every time a new processor with longer vectors comes on the market, you have to compile a new version of the code for the new vector length, using newly defined extensions to the instruction set. It usually takes several years for the new software to be developed and to penetrate the mainstream market. It is so costly for software producers to develop, test, and maintain different versions of their code for each vector length that this is rarely done. \vv A further problem with current systems is that it is impossible to save a vector register in a way that is guaranteed to be compatible with future processors with longer vectors. This is no problem with the ForwardCom design because the vector length is stored in the vector register itself. Instructions are provided for saving and restoring vectors of variable length and for storing only the part of a vector register that is actually used. \vv The ForwardCom design makes it possible to take advantage of a new processor with longer vector registers immediately without recompiling the code. The loop method described above makes this easy and very efficient. You do not need different versions of the code for different processors. \vv It is possible to obtain the same effect without the special negative addressing mode by inverting the sign of J and allowing a negative value in the register that specifies the vector length while using the absolute value for the actual vector length. This solution is less elegant and more confusing, but it may possibly be included in other instruction sets by allowing negative values when specifying a vector length. \vv Loop unrolling is generally not necessary. The loop overhead is already reduced to a single instruction and a superscalar processor will execute multiple iterations in parallel if dependency chains are not too long. Loop unrolling with multiple accumulators may be useful for hiding a loop-carried dependency. In this case, you will either insert a loop control instruction after each section in the unrolled code or calculate the loop iteration count before the loop. \vv The ForwardCom design has no practical limit to the vector length that a microprocessor can support. A large microprocessor with very long vectors can be useful for calculations with a high amount of data parallelism. Other solutions to obtain high performance on parallel data processing have been discussed, such as rolling register stacks and software pipelining, but it was concluded that long vectors is the method that can be implemented most efficiently in the microprocessor as well as in the compiler. \section{Maximum vector length} The maximum length of vector registers will be different for different processors. The maximum length must be a power of 2. It can be as large as desired and should be at least 16 bytes. Each instruction can use a smaller length, which does not need to be a power of 2. \vv The maximum length may be different for different element sizes. For example, the maximum length for 32-bit integers can be 32 bytes to contain eight integers, while the maximum length for 8-bit integers could be 16 bytes to contain 16 smaller numbers. However, the maximum length must be the same for different types with the same element size. For example, the maximum length for double precision floating point numbers must be the same as for 64-bit integers because loops are likely to contain both types when integer vectors are used as masks for floating point vectors. The maximum length for a 32-bit element size cannot be less than for any other element size or operand type. This rule guarantees that it is possible to save a complete vector using a 32-bit operand type. \vv The maximum vector length should generally be the same for all instructions for the same data type, but there may be exceptions for instructions that are particularly expensive to implement. \vv It is possible for an application program or the operating system to reduce the maximum vector length. This can be useful if a smaller vector length is more appropriate for a particular purpose. \vv It is also possible to increase the apparent maximum vector length for purposes of emulation. Virtual vector registers that are bigger than what the hardware supports may be emulated through traps (synchronous interrupts) in order to verify the functionality of a program on processors with a longer maximum vector length than is currently available. \vv When an instruction specifies a longer vector than the maximum, then the maximum length is used (unless the emulation of larger vectors is activated). This is necessary for the efficient implementation of vector loops as described above on page \pageref{vectorLoops}. If the specified vector length is zero or negative then the result will be a vector of zero length. \section{Instruction masks} Most instructions can have a mask register which can be used for conditional execution and for specifying various options. Instructions with general purpose registers use one of the registers r0--r6 as a mask register or predicate. Bit 0 of the mask register indicates whether the operation is executed or not. \vv The instruction will produce the normal result when bit 0 of the mask is one, and a fallback value when this bit is zero. The fallback value can be the value of the first source operand, a separate register, or zero. \vv This mechanism can be vectorized. Instructions with vector registers use one of the vector registers v0--v6 as mask register. The calculation of each vector element is conditional on the corresponding element in the mask register. \vv An arithmetic operation with a mask of zero can never generate an error condition. A memory read or write with an illegal address and a mask of zero may or may not generate an error trap. \vv Additional bits in the mask register are used for various options, overriding the values in the numeric control register. See page \pageref{table:maskBits} for details. \section{Addressing modes} All memory addressing is relative to a base pointer. Memory operands are addressed in this general form: \begin{lstlisting} Address = Base pointer + Index * Scale + Offset \end{lstlisting} Where Base pointer is a 64-bit base pointer, Index is a 64-bit index register, Scale is a scale factor, and Offset is a constant. A base pointer is always present; the other elements are optional. \vv The base pointer can be a general purpose register or it can be the instruction pointer (IP), data section pointer (DATAP), thread data pointer (THREADP), or stack pointer (SP). \vv The index register can be one of the registers r0--r30. A value of 31 in the index register field means no index register. \vv A limit can be applied to the index register in the form of an integer constant. A trap is generated if the index register is bigger than the limit in an unsigned comparison. \vv The scale factor is equal to the operand size (in bytes) for scalar operands and broadcasts. The scale factor is 1 for vector operands. A special addressing mode with Scale = -1 is also available, as explained on page \pageref{vectorLoops}. \vv The offset is a sign-extended integer of 8, 16, or 32 bits. 8-bit offsets are multiplied by the operand size. Offsets of 16 and 32 bits have no multiplier. \vv Memory operands in vector instructions can load a vector of a specified length, a scalar, or a broadcast scalar. The length of the loaded or broadcast vector is specified by a general purpose register. The specified length is the number of bytes. The number of vector elements is the number of bytes divided by the operand size. Register r31, which is the stack pointer, cannot be used for specifying vector length. Instead, a value of 31 in the length register field will give a scalar. \vv Jumps and calls specify a target address relative to the instruction pointer. The relative address is specified with a signed offset of 8, 16, 24, or 32 bits, multiplied by the code word size which is 4. This will cover an address range of $\pm$ 8 gigabytes with the 32-bit offset. \vv \subsubsection{Rationale} A 64-bit flat address space is used. Relative addressing is used in order to avoid 64-bit addresses in the instruction code. In the rare case that a 64-bit absolute address is needed, it must be loaded into a register which is then used as a pointer. \vv Addressing with an index scaled by the operand size is useful for arrays. A limit can be applied to the index so that array bounds can be checked without any extra instructions. \vv Addressing with a negative index is useful for the efficient implementation of vector loops described on page \pageref{vectorLoops}. \vv The addressing modes specified here will cover all common applications, including arrays, vectors, structures, classes, and stack frames. \vv Support for addressing modes with both base pointer, index, and direct offset is optional because this requires two adders in the address-calculation stage in the pipeline which might limit the maximum clock frequency. \end{document}
Go to most recent revision | Compare with Previous | Blame | View Log