URL https://opencores.org/ocsvn/forwardcom/forwardcom/trunk

# Subversion Repositoriesforwardcom

## [/] [forwardcom/] [manual/] [fwc_description_of_instructions.tex] - Rev 166

% chapter included in forwardcom.tex
\documentclass[forwardcom.tex]{subfiles}
\begin{document}
\RaggedRight

\chapter{Description of instructions}
\label{chap:DescriptionOfInstructions}
\vv

\subsection{Data move and conversion instructions}
\vv

\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.2 A &  6 & vector and g.p. register \\ \hline
1.3 B & 18 & g.p. register, and 8-bit signed constant \\ \hline
2.6   &  6 & g.p. register, and 32-bit signed or float constant \\ \hline
3.1   & 33 & g.p. register, and 64-bit signed or double constant \\ \hline
\end{tabular}
\vv

\vv

Broadcast a constant or the first element of a source vector into all
elements of the destination vector with the length in bytes indicated by a general purpose register.
\vv

This instruction can have a mask but not a fallback register. The fallback value is zero.\\
(This instruction is not called broadcast because that is a reserved keyword).

\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.3 B & 19 & vector and 8-bit signed constant \\ \hline
\end{tabular}
\vv

\vv

Broadcast a small constant to all elements of a vector with maximum length.
\vv

\subsubsection{compress}
\label{table:compressInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.3 B & 6 & vectors \\ \hline
\end{tabular}
\vv

double v0 = compress(v1, 0)
\vv

All the elements of a vector are converted to half the element size. The length of the output vector will be half the length of the input vector. The OT field specifies the operand type of the input vector. Double precision floating point numbers are converted to single precision. Integer elements are converted to half the size. Support for the following conversions are optional: single precision float to half precision, quadruple precision to double precision, 8-bit integer to 4-bit.
\vv

Overflow options and rounding mode are specified in IM1 as follows:

\label{table:compressOptions}
\begin{tabular}{|p{16mm}|p{130mm}|}
\hline
\bfseries IM1 bits & \bfseries meaning \\ \hline
bit 0-2 & Floating point exception control: \newline
000 = exceptions are controlled by NUMCONTR. See page \pageref{table:FPExceptionResults} \newline
001 = overflow generates NAN code \newline
010 = underflow generates NAN code \newline
011 = overflow and underflow generate NAN code \newline
100 = underflow and inexact generate NAN code \newline
101 = overflow, underflow, and inexact generate NAN code \newline
111 = no conditions generate NAN code
\\ \hline
bit 0-2 & Integer overflow control: \newline
000 = integer overflow wraps around \newline
100 = signed integer overflow gives zero \newline
101 = signed integer overflow gives signed saturation \newline
110 = unsigned integer overflow gives zero \newline
111 = unsigned integer overflow gives unsigned saturation
\\ \hline
bit 3-5 & Floating point rounding mode: \newline
000 = rounding mode determined by NUMCONTR \newline
001 = odd if not exact \newline
100 = nearest or even \newline
101 = down \newline
110 = up \newline
111 = towards zero
\\ \hline
\end{tabular}
\vv

The rounding mode "odd if not exact" works in the following way:
Truncate the superfluous mantissa bits. If the result is not exact then set the least significant bit to 1.
This rounding mode is needed to avoid double rounding errors when rounding in multiple steps. Use odd rounding mode except in the last step.
For example, to convert from double precision to half precision, use the odd rounding mode in the first step from double to single precision, then use "nearest or even" in the last step from single to half precision.
\vv

Overflow in integer conversion can be detected by doing the conversion twice, using an "overflow gives zero" option and the corresponding saturation option. Overflow has occurred if the two results are different.
\vv

NANs are converted by preserving the least significant bits of the payload and the quiet bit. This differs from most other microprocessors, which preserve the most significant bits of binary floating point NAN payloads.
\vv

\subsubsection{compress\_sparse}
\label{table:compressSparseInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.2 A & 8 & vectors. Optional \\ \hline
\end{tabular}
\vv

int32 v0 = compress\_sparse(v1), mask = v2
\vv

Compress sparse vector elements indicated by mask bits into contiguous vector.
\vv

The algorithm of this instruction is:
For each element in the mask vector that is true, take an element from the corresponding position in the source vector and append it to the destination vector.
The length of the destination vector will be the number of true mask elements
times the element size.
\vv

This instruction cannot have a fallback register.
\vv

\subsubsection{concatenate}
\label{table:concatenateInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
2.2.6 & 0.1 & vectors \\ \hline
\end{tabular}
\vv

float v0 = concatenate(v1, v2, r3)
\vv

A vector v1 of length r3 bytes and a vector v2 of
length r3 bytes are concatenated into a result vector
of length 2$\cdot$r3, with v2 in the high end.
\vv

This instruction cannot have a mask.
\vv

\subsubsection{expand}
\label{table:expandInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.3 B & 7 & vectors \\ \hline
\end{tabular}
\vv

float  v0 = expand(v1, 0)
\vv

This is the opposite of compress. The length of the output vector is double the length of the input vector if the maximum vector length is not exceeded.
\vv

The OT field specifies the operand type of the output vector. Single precision floating point numbers are converted to double precision. Integers are converted to the double size by sign-extension or zero-extension. Support for the following conversions are optional: half precision float to single precision, double precision to quadruple precision, 4-bit integer to 8-bit.
\vv

Options are specified in IM1:
\vv

\label{table:expandOptions}
\begin{tabular}{|p{20mm}|p{120mm}|}
\hline
\bfseries IM1 bits & \bfseries meaning \\ \hline
bit 0-1 & integer options: \newline
00 = sign extension \newline
10 = zero extension
\\ \hline
\end{tabular}
\vv

\subsubsection{expand\_sparse}
\label{table:expandSparseInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.2 A & 9 & vectors. Optional \\ \hline
\end{tabular}
\vv

int32 v0 = expand\_sparse(v1, r2), mask = v3
\vv

This is the opposite of compress\_sparse.

Expand a contiguous vector into a sparse vector with positions indicated by mask bits.

The second operand is a general purpose register indicating the length in bytes of the output vector.
\vv

The algorithm of this instruction is:\\
Set an index i1 to position zero in the source vector.\\
Let another index i2 loop through the elements of the mask vector. For each i2 do:\\
\hspace{8mm}   destination[i2] = source[i1]; increment i1\\
\hspace{4mm} else\\
\hspace{8mm}   destination[i2] = 0\\
end for\\

\vv
The length of the destination vector will be the number of true mask elements
times the element size. This instruction cannot have a fallback register.
\vv

\subsubsection{extract}
\label{table:extractInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.2 A &  5 & vectors  \\ \hline
1.3 B &  5 & vectors  \\ \hline
\end{tabular}
\vv

float v0 = extract(v1, r2)\\
float v0 = extract(v1, 5)
\vv

Extract one element from the source vector at the given position and
broadcast it into all elements of vector register RD with same length and operand size as the source vector.
The index can be a constant or a general purpose register.
This index indicates which vector element to extract.
The size of the vector elements must match the operand type.
\vv

An index out of range will produce zero. An operand size of 128 bits can be used, even if this size is not otherwise supported.
This instruction cannot have a mask.
\vv

\subsubsection{float2int}
\label{table:extractInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.3 B & 12 & vectors  \\ \hline
\end{tabular}
\vv

int32 v0 = float2int(v1, 0)
\vv

Conversion of floating point values to integers with the same operand size.\\
float16 is converted to int16. float32 is converted to int32. float64 is converted to int64.
\vv

The bits in IM1 specify rounding mode and error control, according to the following table:
\vv

\label{table:float2intOptions}
\begin{tabular}{|p{16mm}|p{120mm}|}
\hline
\bfseries IM1 bit & \bfseries Meaning \\ \hline
0-2 & overflow control: \newline
000 = integer overflow wraps around \newline
100 = signed integer overflow gives zero \newline
101 = signed integer overflow gives signed saturation \newline
110 = unsigned integer overflow gives zero \newline
111 = unsigned integer overflow gives unsigned saturation \\
\hline
3-4 & rounding mode: \newline
00 = nearest or even\newline
01 = down\newline
10 = up\newline
11 = truncate towards zero \\
\hline
5 & 0: NAN gives 0. 1: NAN gives MIN\_INT \\
\hline
\end{tabular}
\vv

To check for overflow: Compare the results for overflow gives zero and overflow gives saturation.

To check if the result is exact: Compare the results for round down and round up.

\subsubsection{get\_len}
\label{table:getLenInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.2 A & 0 & vectors  \\ \hline
\end{tabular}
\vv

Get length in bytes of vector register RT into general purpose register RD.
\vv

This instruction cannot have a mask.

\subsubsection{get\_num}
\label{table:getNumInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.2 A & 1 & vectors  \\ \hline
\end{tabular}
\vv

Get the number of elements in vector register RT into general purpose register RD. This is equal to the length divided by the operand size. The result is a 64-bit integer.
\vv

This instruction cannot have a mask.

\subsubsection{gp2vec}
\label{table:gp2vecInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.3 B & 0 & g.p register in, vector register out \\ \hline
\end{tabular}
\vv

int64 v0 = gp2vec(r1)
\vv

Move integer value of general purpose register RS to
scalar in vector register RD.
\vv

\subsubsection{insert}
\label{table:insertInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.2 A & 4 & vectors \\
1.3 B & 4 & vectors \\ \hline
\end{tabular}
\vv

float v0 = insert(v0, v1, r2) \\
float v0 = insert(v0, v1, 5)
\vv

Replace one element in the first vector with the first element of the second vector.
The index to the position of replacement can be a constant or a general purpose register. This index indicates which vector element to replace.
The size of the vector elements must match the operand type.
The destination register must be the same as the first source operand.
\vv

An index out of range will leave the vector unchanged. An operand size of 128 bits can be used, even if this size is not otherwise supported.
\vv

This instruction cannot have a mask.
\vv

\subsubsection{insert\_hi}
\label{table:insertHiInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
2.9 & 1 & general purpose register, 32-bit immediate constant \\ \hline
2.6 & 1 & vector register, 32-bit immediate constant \\ \hline
\end{tabular}
\vv

int64 r0 = insert\_hi(r1, 2) \\
float v0 = insert\_hi(v1, 2.1)
\vv

Insert 32-bit constant into the high part of a
general purpose register, leaving the low part
unchanged. \\
dest = (src1 \& 0xFFFFFFFF) $|$ (IM2 $<<$ 32).
\vv

Make a vector of two elements. A constant is inserted into the second element, leaving the first element unchanged.\\
dest[0] = src1[0], dest[1] = IM2.
\vv

\subsubsection{int2float}
\label{table:int2floatInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.3 B & 13 & vectors \\ \hline
\end{tabular}
\vv

int64 v0 = int2float(v1, 0)
\vv

Conversion of signed or unsigned integers to floating point numbers with same operand size.\\
int16 is converted to float16. int32 is converted to float32. int64 is converted to float64.
\vv

Options are coded in IM1:

\label{table:int2floatOptions}
\begin{tabular}{|p{20mm}|p{120mm}|}
\hline
\bfseries IM1\newline bit number & \bfseries Meaning \\ \hline
0 & The integer is unsigned \\
2 & Inexact result gives NAN. See page \pageref{table:FPExceptionResults}.
\\ \hline
\end{tabular}
\vv

\subsubsection{interleave}
\label{table:interleaveInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
2.2.6 & 2.1 & vectors. Optional \\ \hline
\end{tabular}
\vv

float v0 = interleave(v1, v2, r3)
\vv

Interleave the inputs from two vectors, v1 and v2, so that the even-numbered elements come from v1 and the odd-numbered elements come from v2. The length in bytes of the destination vector is indicated by a general purpose register, r3. The length of each input vector is half the indicated value.
\vv

This instruction can have a mask but not a fallback register. The fallback value is zero.
\vv

\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
2.5 & 0 & vector. 32 bit immediate constant \\ \hline
\end{tabular}
\vv

\vv

Make vector of two elements. dest[0] = 0, dest[1] = IM2.
\vv

\subsubsection{move}
\label{table:moveInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi &  2 & all types \\ \hline
1.1 C &  0 & 32-bit register = 16-bit sign-extended constant \\ \hline
1.1 C &  1 & 64-bit register = 16-bit sign-extended constant \\ \hline
1.1 C &  3 & 64-bit register = 16-bit zero-extended constant \\ \hline
1.1 C &  4 & 32-bit register = 8-bit sign-extended constant with left shift \\ \hline
1.1 C &  5 & 64-bit register = 8-bit sign-extended constant with left shift \\ \hline
1.4 C &  0 & vector register 16-bit scalar = 16-bit constant. Optional  \\ \hline
1.4 C &  8 & vector register 32-bit scalar = 8-bit sign extended constant with left shift. Optional \\ \hline
1.4 C &  9 & vector register 64-bit scalar = 8-bit sign extended constant with left shift. Optional \\ \hline
1.4 C & 32 & vector register single precision scalar = half precision immediate constant. Optional \\ \hline
1.4 C & 33 & vector register double precision scalar = half precision immediate constant. Optional \\ \hline
\end{tabular}
\vv

Copy A value from a register, memory operand or immediate constant to a register. If the destination is a vector register and the source is an immediate constant then the result will be a scalar. The value will not be broadcast because there is no other input operand that specifies the vector length. If a vector is desired then use the broadcast instruction instead.
\vv

The move instruction with an immediate operand is the preferred method for setting a register to zero.
\vv

\subsubsection{permute}
\label{table:permuteInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
2.2.6 & 1.1 & vectors \\ \hline
2.6   & 8   & vectors and 32 bit immediate constant \\ \hline
\end{tabular}
\vv

float v0 = permute(v1, v2, r3) \\
float v0 = permute(v1, r3, 5) \\
\vv

This instruction permutes the elements of a vector v1. The vector is divided into blocks of size r3 bytes each. The block size must be a power of 2 and a multiple of the operand size. Elements can be moved arbitrarily between positions within each block, but not between blocks. Each element of the output vector is a copy of an element in the input vector, selected by the corresponding index in an index vector v2 or a constant. The indexes are relative to the start of the block they belong to, so that an index of zero will select the first element in the block of the input vector and insert it in the corresponding position of the output vector. The same element in the input vector can be copied to multiple elements in the output vector. An index out of range will produce a zero. The indexes are interpreted as integers regardless of the operand type.
\vv

The permute instruction has two versions. The first version specifies the indexes in a vector with the same length and element size as the input vector.
\vv

The second version specifies the indexes as a 32-bit immediate constant with 4 bits per element. This constant is split into a maximum of 8 elements with 4 bits in each, where the least significant four bits is index for the first element in the block.
If the blocks have more than 8 elements each then the sequence of 8 elements is repeated to fill a block. The same pattern of indexes will be applied to all blocks in the second version of the permute instruction.
\vv

The maximum block size for the permute instruction is implementation-dependent and given by a special register. The reason for this limitation of block size is that the complexity of the hardware grows quadratically with the block size. A full permutation is possible if the vector length does not exceed the maximum block size. A trap is generated if r3 is bigger than the maximum block size.
\vv

The outputs of multiple permute instructions can be combined by using indexes out of range to produce zeroes for unused outputs and then combine the outputs of multiple permutes by bitwise OR.
The fallback value is zero if a mask is used.
\vv

Permute instructions are essential for a vector processor because it is often necessary to rearrange data to facilitate the vector processing. These instructions are useful for reordering data, for transposing a matrix, etc.
\vv

Permute instructions can also be used for parallel table lookup when the block size is big enough to contain the entire table.
\vv

Finally, permute instructions can be used for gathering and scattering data within an area not bigger than the vector length or the block size.
\vv

\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
2.5 A & 32 & vectors. Optional \\ \hline
\end{tabular}
\vv

int32 v0 = read\_insert(v0, r1, [r2+0x8, scalar])
\vv

Replace one element in vector RD, starting
at offset RT$\cdot$OS, with scalar memory operand
[RS+IM2].

(OS = operand size).

\subsubsection{repeat\_block}
\label{table:repeatBlockInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
2.2.7 & 8.1 & vectors. Optional \\ \hline
\end{tabular}
\vv

float v0 = repeat\_block(v1, r2, 8)
\vv

Repeat a block of data to make a longer vector. This is the same as broadcast, but with a larger block of data. v1 is an input vector containing a data block to repeat. A constant (IM2) is the length in bytes of the block to repeat. This must be a multiple of 4. r2 is the length in bytes of the result vector. This instruction is useful for matrix multiplication.
\vv

This instruction cannot have a mask.
\vv

\subsubsection{repeat\_within\_blocks}
\label{table:repeatWithinBlockInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
2.2.7 & 9.1 & vectors. Optional \\ \hline
\end{tabular}
\vv

float v0 = repeat\_within\_blocks(v1, r2, 8)
\vv

This divides a vector into blocks and broadcasts the first element of each block to the rest of the block. The block size is given by a constant (IM2). This must be a multiple of the operand size, and at least 4 bytes. There may be a maximum limit to the block size. r2 is the length in bytes of the resulst vector. This instruction is useful for matrix multiplication.
\vv

For example, if the input vector contains (0,1,2,3,4,5,6,7,8) and the block size is 3 times the operand size, then the result will be (0,0,0,3,3,3,6,6,6).
\vv

This instruction cannot have a mask.
\vv

\subsubsection{replace}
\label{table:replaceInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
2.6 & 3 & vectors and 32-bit immediate constant \\ \hline
3.1 & 32 & vectors and 64-bit immediate constant. Optional \\ \hline
\end{tabular}
\vv

int32 v0 = replace(v1, 1), mask=v2, fallback=v3\\
double v0 = replace(v1, 2.3)
\vv

All elements of src1 are replaced by the integer or floating point constant src2.
\vv

When used without a mask, the constant is simply broadcast to make a vector of the same length as src1. This is useful for broadcasting a constant to all elements of a vector. Only the length of src1 (in bytes) is used, not its contents, when this instruction is used without a mask.
\vv

When used with a mask, the elements of src1 are selectively replaced. Elements that are not selected by the mask will be taken from a fallback register.

\subsubsection{replace\_even}
\label{table:replaceEvenInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
2.6 & 4 & vectors and 32-bit immediate constant \\ \hline
\end{tabular}
\vv

Same as replace. Only even-numbered vector elements are replaced.

\subsubsection{replace\_odd}
\label{table:replaceOddInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
2.6 & 5 & vectors and 32-bit immediate constant \\ \hline
\end{tabular}
\vv

Same as replace. Only odd-numbered vector elements are replaced.

\subsubsection{set\_len}
\label{table:setLenInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.2 & 2 & vectors \\ \hline
\end{tabular}
\vv

v1 = set\_len(v2, r3)
\vv

Sets the length of a vector register to the number of bytes specified by a general purpose register. If the specified length is more than the maximum length for the specified operand type then the maximum length will be used.
\vv

If the output vector is longer than the input vector then the extra elements will be zero. If the output vector is shorter than the input vector then the extra elements will be discarded.
\vv

This instruction cannot have a mask.
\vv

\subsubsection{set\_num}
\label{table:setNumInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.2 & 3 & vectors \\ \hline
\end{tabular}
\vv

v1 = set\_num(v2, r3)
\vv

The length of a vector register is changed to the value of general purpose register. The length is indicated as number of elements. If the length is increased then the extra elements will be zero. If the length is decreased then the superfluous elements are lost.

\vv
This instruction differs from set\_len by multiplying the length by the operand size.
This instruction cannot have a mask.

\subsubsection{shift\_down}
\label{table:shiftDownInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.2 & 19 & vectors \\ \hline
\end{tabular}
\vv

int32 v0 = shift\_down(v1, r2)
\vv

Shift elements of a vector down by the number of elements (n) indicated by general purpose register.
The upper n elements of the result will be zero, the lower n elements are lost. The length of the vector is not changed.
\vv

This instruction differs from shift\_reduce by indicating the shift count as a number of elements rather than a number of bytes, and by not changing the length of the vector.
\vv

This instruction cannot have a mask.
\vv

\subsubsection{shift\_expand}
\label{table:shiftExpandInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.2 & 16 & vectors \\ \hline
\end{tabular}
\vv

int32 v0 = shift\_expand(v1, r2)
\vv

The length of a vector is expanded by the specified number of bytes by adding zero-bytes at the low end and shifting all bytes up. If the resulting length is more than the maximum vector length for the specified operand type then the upper bytes are lost.
\vv

This instruction cannot have a mask.
\vv

\subsubsection{shift\_reduce}
\label{table:shiftReduceInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.2 & 17 & vectors \\ \hline
\end{tabular}
\vv

int32 v0 = shift\_reduce(v1, r2)
\vv

The length of a vector is reduced by the specified number of bytes by removing bytes at the low end and shifting all bytes down. If the resulting length is less than zero then the result will be a zero-length vector. The specified operand type is ignored.
\vv

This instruction cannot have a mask.
\vv

\subsubsection{shift\_up}
\label{table:shiftUpInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.2 & 18 & vectors \\ \hline
\end{tabular}
\vv

int32 v0 = shift\_up(v1, r2)
\vv

Shift elements of a vector up by the number of elements (n) indicated by general purpose register.
The lower n elements of RD will be zero, the upper n elements are lost. The length of the vector is not changed.
\vv

This instruction differs from shift\_expand by indicating the shift count as a number of elements rather than a number of bytes, and by not changing the length of the vector.
\vv

This instruction cannot have a mask.
\vv

\subsubsection{sign\_extend}
\label{table:signExtendInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi & 4 & general purpose and integer scalar \\ \hline
\end{tabular}
\vv

int8 r0 = sign\_extend(r1)  // result is 64 bits\\
int8 v0 = sign\_extend(v1)  // lower 8 bits of each 64-bit element is extended to 64 bits\\
int8 v0 = sign\_extend([r1, scalar]) // memory operand is 8 bits, result is 64 bits scalar
\vv

Sign-extend smaller integer to 64 bits.

\vv
The input can be an 8-bit, 16-bit or 32-bit integer. This integer is sign-extended to produce a 64-bit output in a general purpose register or a scalar in a vector register. If the input is a vector then only the first element in each 64-bit block of the input vector is used. Floating point types cannot be used.

\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi & 5 & general purpose registers \\ \hline
\end{tabular}
\vv

int8 r0 = sign\_extend\_add(r1, r2) \\
int32 r0 = sign\_extend\_add(r1, [r2]), options = 2
\vv

src2 is an integer of the specified size, often a memory operand.
This integer is sign-extended to produce a 64-bit integer.
The sign-extended value is optionally shifted left by a value of 1 .. 3, specified in the options.
The result is added to the 64-bit integer in src1 and the result is stored in the 64-bit destination register.
\vv

This instruction is useful for converting relative pointers to absolute pointers, where the reference point is in src1. The relative pointer may be scaled by a factor of 1, 2, 4, or 8, corresponding to a shift count or 0, 1, 2, or 3, respectively. Support for larger scale factors is optional.
\vv

This instruction does not sign-extend when the operand size is 64 bits, but it can still add and shift 64-bit integers.
\vv

This instruction will not generate traps in case of signed or unsigned overflow.
\vv

\subsubsection{vec2gp}
\label{table:vec2gpInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.3 B & 1 & vector register in, g.p. register out \\ \hline
\end{tabular}
\vv

int64 r0 = vec2gp(v1)
\vv

Copy value of first element of vector register RS to general purpose register RD. Integers are sign-extended. Single precision floating point values are zero-extended.
\vv

\vv

\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
2.9 A & 32 & general purpose register \\ \hline
\end{tabular}
\vv

\vv

Gives the address of a data object in static memory.
\vv

The value must be shifted two places to the right if used as the target for a jump or call instruction, because code addresses are based on 32-bit words rather than bytes.
\vv

\subsubsection{clear}
\label{table:clearInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.3 B & 58 & vector. Optional \\ \hline
\end{tabular}
\vv

clear(v5)      // clear one vector register \\
clear(v5, 8)   // clear vector registers v5 - v8
\vv

Clear one or more vector registers by setting the length to zero. A cleared register is regarded as unused.
\vv

It may be advantageous to clear vector registers after use. This will mean that there is less data to save during a task switch.
\vv

\subsubsection{extract\_store}
\label{table:extractStoreInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
2.5 A & 40 & vector. Optional \\ \hline
\end{tabular}
\vv

int32 [r3+8, scalar] = extract\_store(v1, r2)
\vv

Extract one element from vector RD, starting at offset RT$\cdot$OS, with size OS into memory operand [RS+IM2].

(OS = operand size).

\subsubsection{fence}
\label{table:fenceInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
2.5 B & 16 & memory operand and immediate. Optional \\ \hline
\end{tabular}
\vv

int32   fence([r1], 2)
\vv

\vv

Options indicated by IM1:

\begin{longtable}{|p{20mm}|p{50mm}|}
\hline
\bfseries IM1 value & \bfseries meaning \\ \hline
1 & read fence \\ \hline
2 & write fence \\ \hline
3 & read and write fence \\ \hline
\end{longtable}
\vv

\subsubsection{move}
The move instruction, described at page \pageref{table:moveInstruction}
can read a register from a memory operand.

\subsubsection{pop}
\label{table:popInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.8 B & 57 & general purpose registers. Optional \\
1.3 B & 57 & vector registers. Optional \\ \hline
\end{tabular}
\vv

\begin{lstlisting}[frame=none]
pop(r5)         // pop 64-bit register r5 off the stack
pop(r1, r2, 6)  // pop registers r2-r6 from stack pointed to by r1
pop(v5)         // pop vector register v5 off the stack
pop(v5, 9)      // pop vector registers v5-v9 off the stack
\end{lstlisting}
\vv

The pop instruction can pop one or more registers from a stack. The registers are popped in reverse order.
\vv

An optional first register (RD) indicates a stack pointer. The default stack pointer (SP) is used if not specified. An optional last operand is an index of the last register to pop. The syntax for the POP instruction has no equal sign. The operand size is 64 bits by default. A different operand type is allowed only for general purpose registers.
\vv

The stack is growing backwards by default. The last register is read from the address pointed to by the stack pointer. Then the stack pointer is incremented by the amount that was occupied by the register. This is 8 bytes by default for a general purpose register or a variable amount for a vector register. This process is repeated if multiple registers are popped. Registers are pushed in forward order and popped in reverse order.
\vv

It is possible to make a forward-growing stack for general purpose registers by adding 0x80 to the last operand. A stack containing vector registers cannot grow forwards because the pop instruction needs to read the vector length stored at the beginning of each field before it can read the rest of the vector.
\vv

See the push instruction on page \pageref{table:pushInstruction} for more details.
\vv

\subsubsection{prefetch}
\label{table:prefetchInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi & 3 & memory operand. Optional \\ \hline
\end{tabular}
\vv

Prefetch memory operand into cache for later read or write.
Different variants (not yet defined) can be specified by option bits in IM3 for formats with E template.
\vv

\subsubsection{push}
\label{table:pushInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.8 B & 56 & general purpose register. Optional \\
1.3 B & 56 & vector register. Optional \\ \hline
\end{tabular}
\vv

\begin{lstlisting}[frame=none]
push(r5)           // push 64-bit register r5 on the stack
push(r1, r2, 6)    // push registers r2-r6 on stack pointed to by r1
push(r1, r2, 0x86) // push registers r2-r6 on forward growing stack r1
push(v5, 9)        // push vector registers v5-v9 on the stack
\end{lstlisting}
\vv

The push instruction can push one or more registers on a stack.
\vv

An optional first register (RD) indicates a stack pointer. The default stack pointer (SP) is used if not specified. An optional last operand is an index of the last register to push. The syntax for the PUSH instruction has no equal sign. The operand size is 64 bits by default. A different operand type is allowed only for general purpose registers.
\vv

The stack is growing backwards by default.
The stack pointer is decremented by the amount that will be occupied by the register. The first register is then stored to the address pointed to by the stack pointer. This size is 8 bytes for a full general purpose register or a variable amount for a vector register. This process is repeated if multiple registers are pushed.
\vv

It is possible to make a forward-growing stack for general purpose registers by adding 0x80 to the last operand. This may be used as an increment-pointer-and-store instruction. A stack containing vector registers cannot grow forwards because a later pop instruction needs to read the vector length stored at the beginning of each field before it can read the rest of the vector.
\vv

Note that vector registers are stored in an implementation-dependent way by the push instruction. The microprocessor may compress the data or it may insert extra space for optimal alignment of memory access. The programmer should make no assumption about how the vector elements are stored. A pushed vector register can only be restored by a pop instruction on the same or an identical microprocessor that pushed it. If the memory image is moved before restoring, it must be moved by a multiple of the maximum vector lenth. The maximum amount of memory occupied by a pushed vector register is 8 bytes plus the maximum vector length.
\vv

\vv

\subsubsection{store}
\label{table:storeInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi &  1 & memory operand and g.p. or vector register \\ \hline
2.5 B &  8 & memory operand and 32-bit constant. Optional \\ \hline
\end{tabular}
\vv

int32 [r0+r1*4] = r1\\
float [r0, length = r1] = v2 \\
float [r0 + 0x10] = 2.5
\vv

Write the value of a register or constant to a memory operand.
\vv

The size of the memory operand is determined by the operand size OS when a scalar memory operand is specified, or by the vector length register in RS when a vector operand is specified.
\vv

An immediate constant cannot be bigger than 32 bits. A 64 bit integer constant can only be used if it fits into a 32-bit signed integer. A float64 constant can only be used if it can be represented as single precision without loss of precision.
\vv

The hardware must be able to handle memory operand sizes that are not powers of 2 without touching additional memory (read and rewrite beyond the memory operand is not allowed unless access from other threads is blocked during the operation and any access violation is suppressed).
It is allowed for the hardware to write the operand in a piecemeal fashion.
\vv

Masked operation with a mask of zero will leave the corresponding memory element untouched. An explicit fallback value cannot be specified.
\vv

\subsection{General arithmetic instructions}

\subsubsection{abs}
\label{table:absInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.8 B &  0 & g.p. registers \\ \hline
1.3 B & 16 & vector registers \\ \hline
\end{tabular}
\vv

int32 r0 = abs(r1, 1)
\vv

Absolute value of signed number.
\vv

Signed integers can overflow when the input is the minimum value.
The handling of overflow for signed integers is controlled by the constant IM1 as follows:

\begin{longtable}{|p{12mm}|p{80mm}|}
\hline
\bfseries IM1 & \bfseries result when input is INT\_MIN \\ \hline
0  & INT\_MIN (wrap around) \\ \hline
1  & INT\_MAX (saturation)  \\ \hline
2  & zero                   \\ \hline
\end{longtable}
\vv

\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi &  8 & all standard types \\ \hline
multi & 44 & float16. Optional \\ \hline
1.1 C &  6 & 32-bit register and 16-bit sign-extended constant \\ \hline
1.1 C & 10 & 32-bit register and 8-bit sign-extended constant shifted left by another constant. \\ \hline
1.1 C & 11 & 64-bit register and 8-bit sign-extended constant shifted left by another constant. \\ \hline
1.1 C & 18 & 32-bit register and 16-bit zero-extended constant shifted left by 16 \\ \hline
2.9   &  2 & g.p. register and 32-bit zero-extended constant \\ \hline
2.9   &  4 & g.p. register and 32-bit constant shifted left by 32 \\ \hline
1.4 C &  1 & vector of 16-bit integer elements and broadcast 16 bit integer constant. Optional \\ \hline
1.4 C & 10 & vector of 32-bit integer elements and broadcast 8-bit sign-extended constant shifted left by another constant. Optional \\ \hline
1.4 C & 11 & vector of 64-bit integer elements and broadcast 8-bit sign-extended constant shifted left by another constant. Optional \\ \hline
1.4 C & 34 & single precision floating point vector and broadcast half precision floating point constant. Optional \\ \hline
1.4 C & 35 & double precision floating point vector and broadcast half precision floating point constant. Optional \\ \hline
1.4 C & 40 & half precision floating point vector and broadcast half precision floating point constant. Optional \\ \hline
\end{tabular}
\vv

int32 r0 = r1 + r2 \\
int32 r0 = r1 + 2 \\
int32+ r0 += 4 \\
int32+ r0++ \\
float v0 = v1 + [r2 + 8, length = r5]
\vv

\vv

If you want to add a 64-bit constant to a general purpose register, and triple size instructions are not supported, then add the lower half first using the zero-extended version, and then add the upper half using the shifted version.

\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi & 51 & all types. Optional \\ \hline
\end{tabular}
\vspace{3mm}

This gives two additions in one instruction:
\vv

dest = $\pm$ src1 $\pm$ src2 $\pm$ src3
\vv

For optimal precision with floating point operands, the intermediate sum of the two numerically largest operands should preferably be calculated first with extended precision.
\vv

The signs of the operands can be inverted as indicated by the following option bits:

\begin{longtable} {|p{20mm}|p{75mm}|}
\hline
\bfseries Option bits & \bfseries Meaning   \\
\hline
bit 0 & change sign of src1 \\
bit 1 & change sign of src2 \\
bit 2 & change sign of src3 \\
\hline
\end{longtable}

There is no sign change if there are no option bits.
\vv

This instruction may be supported for integer operands or floating point or both.
\vv

\subsubsection{compare} \label{compare}
\label{table:compareInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi &  7 & all types \\ \hline
\end{tabular}
\vv

Examples:\\
int8 r0 = r1 $>$ r2 \\
uint8 r0 = r1 $>$ r2 \\
float v0 = v1 $<=$ 2.3 \\
int32 r0 = compare(r1, 2), mask=r3, fallback=r4, options=0b1001
\vv

The compare instruction compares two source operands and generates a boolean scalar or vector where bit 0 indicates the result. This instruction can do different compare operations depending on option bits 0-4 defined according to the following table:

\begin{longtable} {|p{14mm}|p{50mm}|p{50mm}|}
\caption{Condition codes for compare instruction}
\label{table:conditionCodesForCompareInstruction} \\
\hline
\bfseries Bit 3-2-1-0 & \bfseries Meaning for integer & \bfseries Meaning for floating point \\
\hline
\_ 0 0 0 & a $=$ b    & a $=$ b \\
\_ 0 0 1 & a $\neq$ b & a $\neq$ b \\
\_ 0 1 0 & a $<$ b    & a $<$ b \\
\_ 0 1 1 & a $\geq$ b & a $\geq$ b \\
\_ 1 0 0 & a $>$ b    & a $>$ b \\
\_ 1 0 1 & a $\leq$ b & a $\leq$ b \\
\_ 1 1 0 &            & abs(a) $<$ abs(b) \\
\_ 1 1 1 &            & abs(a) $\geq$ abs(b) \\
\hline
0 \_ \_ \_ & compare as signed & unordered gives 0 \\
1 \_ \_ \_ & compare as unsigned & unordered gives 1 \\
\hline
\end{longtable}

Option bit 3 indicates how to threat floating point NAN inputs. A compare operation is considered unordered if at least one floating point input operand is NAN. The translation of high level language operators to ordered and unordered compare operations are listed on page \pageref{table:floatCompareJumpInstructions}.
\vv

The result is indicated in bit 0 of the destination register. It is 1 for true and 0 for false. The remaining bits are copied from a mask register, or zero if there is no mask register. The number of mask bits available is implementation dependent.
\vv

The condition code is zero (indicating compare for equal) if there are no option bits.
\vv

A fallback register can be used as operand for an extra boolean operation, with or without a mask. Only bit 0 of the fallback register is used.
This option is controlled by option bits 4-5:

\begin{longtable} {|p{25mm}|p{50mm}|p{50mm}|}
\caption{Alternative use of fallback register}
\label{table:AlternativeFallbackForCompare} \\
\hline
\bfseries bit 5 bit 4 & \bfseries Output with mask & \bfseries Output without mask \\
\hline
\hspace{5mm} 0 0 & mask ? result : fallback  & result \\
\hline
\hspace{5mm} 0 1 & mask \&\& result \&\& fallback & result \&\& fallback \\
\hline
\hspace{5mm} 1 0 & mask \&\& (result $||$ fallback) & result $||$ fallback \\
\hline
\hspace{5mm} 1 1 & mask \&\& (result \^{} fallback) & result \^{} fallback \\
\hline
\end{longtable}
\vv

\subsubsection{div}
\label{table:divInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi & 14 & all types. Optional for integer vectors \\ \hline
\end{tabular}
\vv

int32 r0 = r1 / r2 \\
int32 r0 = div(r1, r2), options = 4\\
float v0 = v1 / [r2 + 8, length = r5]
\vv

Signed division.

\vv
This instruction has multiple rounding modes. The rounding mode for integer operands is controlled by option bits (IM3) as follows:

\begin{longtable} {|p{25mm}|p{80mm}|}
\caption{division instructions}
\label{table:DivInstructions} \\
\hline
\bfseries Option bits 0-3 & \bfseries Meaning   \\
\hline
0 0 0 0 & Truncate towards zero (default) \\
\hline
0 1 0 0 & Nearest or even \\
0 1 0 1 & Down \\
0 1 1 0 & Up \\
0 1 1 1 & Truncate towards zero \\
\hline
other values & Not allowed \\
\hline
\end{longtable}
Truncation is always used with integer operands when there are no option bits.

\vv
The rounding mode for floating point operands is controlled by the mask or numeric control register. Option bits must be zero for floating point operands.

\vv
Division of floating point operands by zero gives $\pm$INF (or NAN if exceptions are enabled).

Division of integer operands by zero gives INT\_MAX or INT\_MIN.

Overflow occurs by division of INT\_MIN by -1. The result will wrap around to give INT\_MIN.
\vv

\subsubsection{div\_rev}
\label{table:divRevInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi & 16 & all types. Optional for integer vectors \\ \hline
\end{tabular}
\vv

int32 r0 = 10 / r2 \\
int32 v0 = div\_rev(v1, v2), options = 4
\vv

Same as div, with the two source operands swapped.

The rounding mode is controlled in the same way as for the div instruction.
\vv

\subsubsection{div\_u}
\label{table:divUInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi & 15 & all integer types. Optional for integer vectors \\ \hline
\end{tabular}
\vv

uint32 r0 = r1 / r2 \\
uint32 v0 = div\_u(v1, v2), options=4
\vv

Unsigned integer division.

The rounding mode is controlled in the same way as for the div instruction, see page \pageref{table:DivInstructions}

\vv
Division by zero gives UINT\_MAX.
\vv

\subsubsection{div\_ex}
\label{table:divExInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.2 A & 24 & Integer vectors. Optional for more than one element \\ \hline
\end{tabular}
\vv

Divide vector of double-size signed integers RS by signed integers RT. RS has element size 2$\cdot$OS. These are divided by the even numbered
elements of RT with size OS. The truncated results are stored in the even-numbered elements of RD. The remainders are stored in the odd-numbered elements of RD.
(OS = operand size).
\vv

\subsubsection{div\_ex\_u}
\label{table:divExUInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.2 A & 25 & Integer vectors. Optional for more than one element \\ \hline
\end{tabular}
\vv

Divide vector of double-size unsigned integers RS by unsigned integers RT. RS has element size 2$\cdot$OS. These are divided by the even numbered elements of RT with size OS. The truncated results are stored in the even-numbered elements of RD. The remainders are stored in the odd-numbered elements of RD.
(OS = operand size).
\vv

\subsubsection{max}
\label{table:maxInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi & 22 & all types \\ \hline
\end{tabular}
\vv

int32 r0 = max(r1, r2) \\
float v0 = max(v1, v2)
\vv

Get the maximum of two numbers:

max(src1,src2) = src1 \textgreater{} src2 ? src1 : src2
\vv

Integer operands are treated as signed.
\vv

The handling of floating point NAN operands follows the definition of the maximum function in the 2019 revision of the IEEE floating point standard 754, which guarantees the propagation of NANs, unlike the 1985 and 2008 versions of the standard.
\vv

\subsubsection{max\_abs}
\label{table:maxAbsInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi & 23 & all floating point types \\ \hline
\end{tabular}
\vv

float v0 = max\_abs(v1, v2)
\vv

Gives the maximum of the absolute values of two floating point numbers.
\vv

max\_abs(src1, src2) = max(abs(src1), abs(src2))
\vv

NAN values are treated in the same way as for the max instruction.
\vv

\subsubsection{max\_u}
\label{table:maxUInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi & 23 & all integer types \\ \hline
\end{tabular}
\vv

uint32 r0 = max\_u(r1, r2)
\vv

Gives the maximum of two unsigned integers.
\vv

max\_u(src1,src2) = src1 \textgreater{} src2 ? src1 : src2
\vv

\subsubsection{min}
\label{table:minInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi & 20 & all types \\ \hline
\end{tabular}
\vv

int32 r0 = min(r1, r2)\\
float v0 = min(v1, v2)
\vv

Get the minimum of two numbers:

min(src1,src2) = src1 \textless{} src2 ? src1 : src2
\vv

Integer operands are treated as signed.
\vv

The handling of floating point NAN operands follows the definition of the minimum function in the 2019 revision of the IEEE floating point standard 754, which guarantees the propagation of NANs, unlike the 1985 and 2008 versions of the standard.
\vv

\subsubsection{min\_abs}
\label{table:minAbsInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi & 21 & all floating point types \\ \hline
\end{tabular}
\vv

float v0 = min\_abs(v1, v2)
\vv

Gives the minimum of the absolute values of two floating point numbers.
\vv

min\_abs(src1, src2) = min(abs(src1), abs(src2))
\vv

NAN values are treated in the same way as for the min instruction.
\vv

\subsubsection{min\_u}
\label{table:minUInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi & 21 & all integer types \\ \hline
\end{tabular}
\vv

uint32 r0 = min\_u(r1, r2)
\vv

Gives the minimum of two unsigned integers.
\vv

min\_u(src1,src2) = src1 \textless{} src2 ? src1 : src2
\vv

\subsubsection{mul}
\label{table:mulInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi & 11 & all standard types \\ \hline
multi & 46 & float16. Optional \\ \hline
1.1 C &  8 & general purpose register and 16-bit sign-extended integer constant \\ \hline
1.4 C & 36 & single precision floating point vector and broadcast half-precision floating point constant. Optional \\ \hline
1.4 C & 37 & double precision floating point vector and broadcast half-precision floating point constant. Optional \\ \hline
1.4 C & 41 & half precision floating point vector and broadcast half-precision floating point constant. Optional \\ \hline
\end{tabular}
\vv

int32 r0 = r1 * r2 \\
float v0 *= 5.0
\vv

Multiplication.
\vv

The same instruction can be used for signed and unsigned integers.
\vv

\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi & 49 & mul\_add: dest = $\pm$ src1 $\cdot$ src2 $\pm$ src3. All types. Optional \\
multi & 50 & mul\_add2: dest = $\pm$ src1 $\cdot$ src3 $\pm$ src2. All types. Optional \\
multi & 48 & mul\_add. float16. Optional \\ \hline
\hline
\end{tabular}
\vv

\vv

The fused multiply-and-add instruction can often improve the performance of floating point code significantly. The intermediate product is calculated with extended precision according to the IEEE 754-2008 standard.
\vv

The signs of the operands can be inverted as indicated by the following option bits

\begin{longtable} {|p{20mm}|p{75mm}|}
\hline
\bfseries Option bits &  \bfseries Meaning   \\
\hline
bit 0 & change sign of product in even-numbered vector elements \\
bit 1 & change sign of product in odd-numbered vector elements \\
bit 2 & change sign of addend in even-numbered vector elements \\
bit 3 & change sign of addend in odd-numbered vector elements \\
\hline
\end{longtable}

\vv
These option bits make it possible to do multiply-and-add, multiply-and-subtract, multiply-and-reverse-subtract, etc. It can also do multiply with alternating add and subtract, which is useful in calculations with complex numbers.
There is no sign change if there are no option bits.

\vv
Support for integer operands is optional. Support for floating point operands is optional but desired.
\vv

\subsubsection{mul\_ex}
\label{table:mulExInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.2 A & 26 & integer vectors \\ \hline
\end{tabular}
\vv

int32 v0 = mul\_ex(v1, v2)
\vv

Extended multiply, signed.
\vv

Multiply even-numbered signed integer vector elements to double size result. The result extends into the next odd-numbered vector element.
\vv

\subsubsection{mul\_ex\_u}
\label{table:mulExUInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.2 A & 27 & integer vectors \\ \hline
\end{tabular}
\vv

uint32 v0 = mul\_ex\_u(v1, v2)
\vv

Extended multiply, unsigned.
\vv

Multiply even-numbered unsigned integer vector elements to double size result. The result extends into the next odd-numbered vector element.
\vv

\subsubsection{mul\_hi}
\label{table:mulHiInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi & 12 & integer vectors \\ \hline
\end{tabular}
\vv

int32 r0 = mul\_hi(r1, r2) \\
int32 v0 = mul\_hi(v1, 2)
\vv

High part of signed integer product.
\vv

dest = (src1 $\cdot$ src2) $>>$ OS

(Signed, OS = operand size in bits).
\vv

\subsubsection{mul\_hi\_u}
\label{table:mulHiUInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi & 13 & integer vectors \\ \hline
\end{tabular}
\vv

uint32 r0 = mul\_hi\_u(r1, r2)
\vv

High part of unsigned integer product.
\vv

dest = (src1 $\cdot$ src2) $>>$ OS

(Unsigned, OS = operand size in bits).
\vv

\subsubsection{mul\_2pow}
\label{table:mul2PosInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi & 32 & all floating point types \\ \hline
\end{tabular}
\vv

Multiply by power of 2.

dest = src1 * $2^{src2}$

src1 and dest are floating point vectors, while src2 is interpreted as a signed integer vector with the same element size as src1 and dest.
\vv

Overflow will produce infinity. The result will be zero rather than a subnormal number in case of underflow, regardless of control bits in the mask or numeric control register.
The reason for this is that
speed has priority here. This instruction will typically take a single clock cycle, while floating point multiplication by a power of 2 takes multiple clock cycles.
This is useful for fast multiplication or division by a power of 2.
\vv

This instruction has the same op1 code as shift\_left, but applies to floating point types only.
\vv

\subsubsection{rem}
\label{table:remInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi & 18 & all types. Optional for vectors of more than one element \\ \hline
\end{tabular}
\vv

int32 r0 = r1 \% r2 \\
float v0 = rem(v1, v2)
\vv

Modulo.

\vv
Signed with integer operands or floating point operands.

\vv
A floating point number modulo zero gives NAN.
An integer modulo zero gives zero.
\vv

\subsubsection{rem\_u}
\label{table:remUInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi & 19 & integers. Optional for vectors of more than one element \\ \hline
\end{tabular}
\vv

uint32 r0 = r1 \% r2
\vv

Unsigned modulo or remainder.

\vv
An integer modulo zero gives zero.
\vv

\subsubsection{round}
\label{table:roundInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.3 B & 14 & floating point vectors \\ \hline
\end{tabular}
\vv

float v0 = round(v1, 0)
\vv

Round floating point number to integer in floating point representation.
\vv

The rounding mode is specified in bit 0-1 of IM1. See table \ref{table:maskBits} page \pageref{table:maskBits}.
\vv

\subsubsection{roundp2}
\label{table:roundP2Instruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.8 B &  3 & g.p. registers \\ \hline
\end{tabular}
\vv

int64 r0 = roundp2(r1, 1)
\vv

Round unsigned integer up or down to the nearest power of 2.
\vv

Options:

\label{table:roundp2Options}
\begin{tabular}{|p{16mm}|p{122mm}|}
\hline
\bfseries IM1 bits & \bfseries meaning \\ \hline
bit 0 & 0: Round down to power or 2:\newline
dest = 1 \textless\textless{} bitscan\_reverse(src1).\newline
1: Round up to power or 2:\newline
dest = ((src1 \& (src1-1)) == 0) ? src1 : 1 \textless\textless{}  (bitscan\_reverse(src1) + 1)
\\ \hline
bit 4 & 0: returns 0 if the input is 0.\newline
1: returns -1 if the input is 0.\\ \hline
bit 5 & 0: returns 0 if the result overflows.\newline
1: returns -1 if the result overflows.\\ \hline
\end{tabular}
\vv

\subsubsection{round2n}
\label{table:round2nInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.3 B & 15 & vector registers. Optional \\ \hline
\end{tabular}
\vv

float v0 = round2n(v1, -4)
\vv

Round to nearest multiple of $2^n$.

dest = $2^n\cdot$ round($2^{-n}\cdot$ src1)

n is a signed integer constant in IM1.
\vv

\subsubsection{sqrt}
\label{table:sqrtInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.2 A & 28 & floating point vectors. Optional \\ \hline
\end{tabular}
\vv

Square root.
\vv

\subsubsection{sub}
\label{table:subInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi &  9 & all standard types \\ \hline
multi & 45 & float16. Optional \\ \hline
2.9   &  3 & g.p. register and 32-bit zero-extended constant \\ \hline
\end{tabular}
\vv

int32 r0 = r1 - r2 \\
int32 r0 = r1 - 2 \\
int32+ r0 -= 4 \\
int32+ r0-{-} \\
float v0 = v1 - [r2 + 8, length = r5]
\vv

Subtraction.
\vv

\subsubsection{sub\_rev}
\label{table:subRevInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi & 10 & all types \\ \hline
%1.1 C &  3 & g.p. register and 16-bit sign-extended constant \\ \hline
\end{tabular}
\vv

int32 r0 = 1 - r2 \\
int32 v0 = - v2 + v1 \\
float v0 = -v1 + [r2 + 8, length = r5]
\vv

Reverse subtraction.
\vv

dest = src2 - src1.
\vv

\subsection{Arithmetic instructions with carry, overflow check, or saturation}
These instructions do not generate traps on overflow because they provide alternative ways of handling overflow.
\vv

\subsubsection{abs}
see page \pageref{table:absInstruction}.
\vv

\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.2 A & 42 & integer vectors with two elements. Optional \\ \hline
\end{tabular}
\vv

\vv

The vector has two elements. The upper element of src1 is used as carry in. The upper element of dest is used as carry out. Only the lower element of src2 is used.
\vv

Longer vectors are not supported. See page
\pageref{highPrecisionArithmetic} for an alternative for longer vectors.
\vv

\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.2 A & 38 & vector registers. Optional \\ \hline
\end{tabular}
\vv

\vv

Instructions with overflow check use the even-numbered vector elements for arithmetic instructions. Each following odd-numbered vector element is used for overflow detection.
\vv

Overflow conditions are indicated with the following bits:
\vv

bit 0. Unsigned integer overflow (carry or borrow).

bit 1. Signed integer overflow.
\vv

The values are propagated so that the overflow result of the operation is OR'ed with the corresponding values of both input operands.
\vv

\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.2 A & 32 & integer vectors. Optional \\ \hline
\end{tabular}
\vv

Overflow and underflow produces INT\_MAX and INT\_MIN.
\vv

\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.2 A & 33 & integer vectors. Optional \\ \hline
\end{tabular}
\vv

Overflow produces UINT\_MAX.
\vv

\subsubsection{compress\_ss}
\label{table:compressSsInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.2 A & 5 & integer vectors. Optional \\ \hline
\end{tabular}
\vv

Compress, signed with saturation.
\vv

Same as compress (see page \pageref{table:compressInstruction}). Integers are treated as signed and compressed with saturation. Floating point operands cannot be used.
Masks cannot be used and overflow traps cannot be enabled for this instruction.
\vv

\subsubsection{compress\_us}
\label{table:compressUsInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.2 A & 6 & integer vectors. Optional \\ \hline
\end{tabular}
\vv

Compress, unsigned with saturation.
\vv

Same as compress (see page \pageref{table:compressInstruction}). Integers are treated as unsigned and compressed with saturation. Floating point operands cannot be used.
Masks cannot be used and overflow traps cannot be enabled for this instruction.
\vv

\subsubsection{div\_oc}
\label{table:divOcInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.2 A & 41 & vector registers. Optional \\ \hline
\end{tabular}
\vv

Divide signed integers with overflow check.

\vv

\subsubsection{mul\_oc}
\label{table:mulOcInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.2 A & 40 & vector registers. Optional \\ \hline
\end{tabular}
\vv

Multiply integers with overflow check.

\vv

\subsubsection{mul\_ss}
\label{table:mulSsInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.2 A & 36 & integer vectors. Optional \\ \hline
\end{tabular}
\vv

Multiply signed integers with saturation.

Overflow and underflow produces INT\_MAX and INT\_MIN.
\vv

\subsubsection{mul\_us}
\label{table:mulUsInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.2 A & 37 & integer vectors. Optional \\ \hline
\end{tabular}
\vv

Multiply unsigned integers with saturation.

Overflow produces UINT\_MAX.
\vv

\subsubsection{sub\_b}
\label{table:subBInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.2 A & 43 & integer vectors with two elements. Optional \\ \hline
\end{tabular}
\vv

Subtraction with borrow.
\vv

The vector has two elements. The upper element of src1 is used as borrow in. The upper element of dest is used as borrow out. Only the lower element of src2 is used.
\vv

Longer vectors are not supported. See page
\pageref{highPrecisionArithmetic} for an alternative for longer vectors.
\vv

\subsubsection{sub\_oc}
\label{table:subOcInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.2 A & 39 & vector registers. Optional \\ \hline
\end{tabular}
\vv

Subtract integers with overflow check.

\vv

\subsubsection{sub\_ss}
\label{table:subSsInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.2 A & 34 & integer vectors. Optional \\ \hline
\end{tabular}
\vv

Subtract signed integers with saturation.

Overflow and underflow produces INT\_MAX and INT\_MIN.
\vv

\subsubsection{sub\_us}
\label{table:subUsInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.2 A & 35 & integer vectors. Optional \\ \hline
\end{tabular}
\vv

Subtract unsigned integers with saturation.

Overflow and underflow produces UINT\_MAX and 0.
\vv

\subsection{Logic and bit manipulation instructions}

\subsubsection{and}
\label{table:andInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi & 26 & all types \\ \hline
1.1 C & 12 & 32-bit register and 8-bit signed constant shifted left by another constant \\ \hline
1.1 C & 13 & 64-bit register and 8-bit signed constant shifted left by another constant \\ \hline
2.9   &  5 & g.p. register and 32-bit constant shifted left by 32 \\ \hline
1.4 C &  2 & vector of 16-bit integers, and broadcast 16-bit constant. Optional \\ \hline
1.4 C & 12 & vector of 32-bit integers, and broadcast sign-extended 8-bit constant shifted left by another constant. Optional \\ \hline
1.4 C & 13 & vector of 64-bit integers, and broadcast sign-extended 8-bit constant shifted left by another constant. Optional \\ \hline
\end{tabular}
\vv

int32 r0 = r1 \& r2 \\
int32 v0 = v1 \& 2
\vv

Bitwise boolean and.
\vv

Floating point operands are treated as integers.

Do not use a floating point type with a constant operand unless you want the operand to be interpreted as floating point.
\vv

\subsubsection{or}
\label{table:orInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi & 27 & all types \\ \hline
1.1 C & 14 & 32-bit register and 8-bit signed constant shifted left by another constant \\ \hline
1.1 C & 15 & 64-bit register and 8-bit signed constant shifted left by another constant \\ \hline
2.9   &  6 & g.p. register and 32-bit constant shifted left by 32 \\ \hline
1.4 C &  3 & vector of 16-bit integers, and broadcast 16-bit constant. Optional \\ \hline
1.4 C & 14 & vector of 32-bit integers, and broadcast sign-extended 8-bit constant shifted left by another constant. Optional \\ \hline
1.4 C & 15 & vector of 64-bit integers, and broadcast sign-extended 8-bit constant shifted left by another constant. Optional \\ \hline
\end{tabular}
\vv

int32 r0 = r1 $|$ r2 \\
int32 v0 = v1 $|$ 2
\vv

Bitwise boolean or.
\vv

Floating point operands are treated as integers.

Do not use a floating point type with a constant operand unless you want the operand to be interpreted as floating point.
\vv

\subsubsection{xor}
\label{table:xorInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi & 28 & all types \\ \hline
1.1 C & 16 & 32-bit register and 8-bit signed constant shifted left by another constant \\ \hline
1.1 C & 17 & 64-bit register and 8-bit signed constant shifted left by another constant \\ \hline
2.9   &  7 & g.p. register and 32-bit constant shifted left by 32 \\ \hline
1.4 C &  4 & vector of 16-bit integers, and broadcast 16-bit constant. Optional \\ \hline
1.4 C & 16 & vector of 32-bit integers, and broadcast sign-extended 8-bit constant shifted left by another constant. Optional \\ \hline
1.4 C & 17 & vector of 64-bit integers, and broadcast sign-extended 8-bit constant shifted left by another constant. Optional \\ \hline
\end{tabular}
\vv

int32 r0 = r1 \^{} r2 \\
int32 v0 = v1 \^{} 2
\vv

Bitwise boolean exclusive or.
\vv

Floating point operands are treated as integers.

Do not use a floating point type with a constant operand unless you want the operand to be interpreted as floating point.
\vv

\subsubsection{bit\_reverse byte\_reverse}
\label{table:bitReverseInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.3 B & 20 & vectors \\ \hline
\end{tabular}
\vv

int32 v0 = byte\_reverse(v1, 0)\\
int32 v0 = bit\_reverse(v1, 1)
\vv

IM1 = 0: Reverse the order of bytes within each vector element. This is useful for converting big-endian file data.\\
IM1 = 1: Reverse the order of bits in each element of a vector.
\vv

\subsubsection{bits2bool}
\label{table:bits2boolInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.2 A & 12 & integer vectors \\ \hline
\end{tabular}
\vv

int32 v0 = bits2bool(r1, v2)
\vv

Expand contiguous bits in a vector register to a boolean vector with each bit of the source going into bit 0 of each element of the destination.
The remaining bits of each element are copied from the first element of the mask or the numeric control register. The number of mask or NUMCONTR bits available is implementation dependent.
\vv

The length in bytes of the result vector is specified by a general purpose register in RS.
\vv

This instruction cannot have a fallback register.
\vv

\subsubsection{bitscan}
\label{table:bitscanInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.8 B &  2 & general purpose registers \\ \hline
1.3 B & 21 & integer vectors. Optional \\ \hline
\end{tabular}
\vv

int32 r0 = bitscan(r1, 0)\\
int64 v0 = bitscan(v1, 1)
\vv

Bit scan forward or reverse. Option bits are given in the second operand:
\vv

\label{table:bitscanOptions}
\begin{tabular}{|p{16mm}|p{122mm}|}
\hline
\bfseries IM1 bits & \bfseries meaning \\ \hline
bit 0 & 0: forward scan. Find index to the lowest set bit.\newline
1: reverse scan. Find index to the highest set bit.\\
\hline
bit 4 & 0: returns  0 if the input is 0.\newline
1: returns -1 if the input is 0.\\ \hline
\end{tabular}
\vv

\subsubsection{bool\_reduce}
\label{table:boolReduceInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.3 B & 26 & integer vectors \\ \hline
\end{tabular}
\vv

int32 v0 = bool\_reduce(v1)
\vv

A boolean vector is reduced by combining bit 0 of all elements.

The output is a scalar integer where bit 0 is the AND combination of all the bits, and bit 1 is the OR combination of all the bits. The remaining bits are reserved for future use.
\vv

This instruction cannot have a mask.
\vv

\subsubsection{bool2bits}
\label{table:bool2bitsInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.3 B & 25 & integer vectors \\ \hline
\end{tabular}
\vv

int64 v0 = bool2bits(v1)
\vv

A boolean vector with n elements is packed into the lower n bits of RD, taking bit 0 of each element.
The length of RD will be at least sufficient to contain n bits.
\vv

This instruction cannot have a mask.
\vv

\subsubsection{category\_reduce}
\label{table:categoryReduceInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.3 B & 26 & floating point vectors \\ \hline
\end{tabular}
\vv

float v0 = category\_reduce(v1)
\vv

A floating point vector is analyzed and each element is classified as belonging to one of the eight categories listed below. Each bit in the output indicates that at least one element in RT belongs to the corresponding category.
\vv

\begin{tabular}{|p{24mm}|p{115mm}|}
\hline
\bfseries Bit number & \bfseries Category \\ \hline
0 & at least one element is NAN \\
1 & at least one element is zero \\
2 & at least one element is negative subnormal \\
3 & at least one element is positive subnormal \\
4 & at least one element is negative normal \\
5 & at least one element is positive normal \\
6 & at least one element is negative infinity \\
7 & at least one element is positive infinity \\
\hline
\end{tabular}
\vv

This instruction cannot have a mask.
\vv

\subsubsection{clear\_bit}
\label{table:clearBitInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi & 36 & all types \\ \hline
\end{tabular}
\vv

Clear bit number src2 in src1.
\vv

dest = src1 \& \~{}(1 $<<$ src2).

\vv
Floating point operands are treated as integers.
\vv

\subsubsection{set\_bit}
\label{table:setBitInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi & 37 & all integer types \\ \hline
\end{tabular}
\vv

Set bit number src2 in src1 to one.
\vv

dest = src1 $|$ (1 $<<$ src2)
\vv

\subsubsection{toggle\_bit}
\label{table:toggleBitInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi & 38 & all types \\ \hline
\end{tabular}
\vv

Change the value of bit number src2 in src1 to its opposite.
\vv

dest = src1 \^{} (1 $<<$ src2)
\vv

\subsubsection{compare}
See page \pageref{table:compareInstruction}
\vv

\subsubsection{fp\_category}
\label{table:fpCategoryInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.3 B & 17 & floating point vectors \\ \hline
\end{tabular}
\vv

float v0 = fp\_category(v1, 1)
\vv

The input is a floating point vector. The output is a boolean vector where bit 0 of each element indicates if the input RS belongs to any of the categories indicated by the bits in the immediate operand IM1. The remaining bits of the output are taken from the numeric control register. The number of NUMCONTR bits available is implementation dependent.
Any floating point value will belong to one, and only one, of these categories.

\begin{longtable} {|p{20mm}|p{90mm}|}
\caption{Meaning of bits in fp\_category}
\label{table:fpCategoryInstructionBits} \\
\hline
\bfseries Bit number & \bfseries Meaning  \\
\hline
0 & $\pm$ NAN \\
1 & $\pm$ Zero \\
2 & $-$ Subnormal \\
3 & $+$ Subnormal \\
4 & $-$ Normal \\
5 & $+$ Normal \\
6 & $-$ Infinite  \\
7 & $+$ Infinite  \\
\hline
\end{longtable}
\vv

\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
2.6 &  2 & integer vectors \\ \hline
\end{tabular}
\vv

\vv

Make a mask from the bits of the 32-bit integer constant src2. Each bit of the constant goes into bit 0 of one element of the output. The remaining bits of each element are taken from a mask register, or from NUMCONTR if there is no mask. The number of mask or NUMCONTR bits available is implementation dependent.
The length of the output is the same as the length of src1. If there are more than 32 elements in the vector then the bit pattern of src2 is repeated.
\vv

\subsubsection{make\_sequence}
\label{table:makeSequenceInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.3 B &  4 & all vectors \\ \hline
\end{tabular}
\vv

int32 v0 = make\_sequence(r1, 2)
\vv

Makes a vector of sequential numbers. The number of elements is indicated by a general purpose register.
The first element is equal to the immediate operand IM1, the next element is IM1+1, etc. IM1 must be an integer in the range -128 \rightarrow 127.
\vv

\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
2.2.7 & 1.1 & integer vectors \\ \hline
\end{tabular}
\vv

int64 v0 = mask\_length(v1, r2, 0), options=2
\vv

Make a boolean vector to mask the first n bytes of a vector, where n is the value of a general purposer register r2. \\
The result vector will have the same length as the input vector v1. r2 indicates the length of the part that is enabled by the mask.
\vv

The following option bits can be specified: \\
bit 0 = 0: bit 0 will be 1 in the first n bytes in the output and 0 in the rest. \\
bit 0 = 1: bit 0 will be 0 in the first n bytes in the output and 1 in the rest. \\
bit 1 = 1: copy remaining bits from input vector v1 into each vector element. \\
bit 2 = 1: copy remaining bits from the numeric control register. \\
bit 4 = 1: broadcast remaining bits from a constant (IM2) into all 32-bit words of the result. \\
\hspace{17mm} Bit 1-7 of IM2 go to bit 1-7 of the result. \\
\hspace{17mm} Bit 8-11 of IM2 go to bit 20-23 of the result. \\
\hspace{17mm} Bit 12-15 of IM2 go to bit 26-29 of the result. \\
Output bits that are not set by any of these options will be zero.
If multiple options are specified, the results will be OR'ed.

\vv
This instruction can have a mask but not a fallback register. The fallback value is zero.
\vv

\subsubsection{move\_bits}
\label{table:moveBitsInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
2.0.7 & 0.1 & general purpose registers. Optional \\ \hline
2.2.7 & 0.1 & integer vectors. Optional \\ \hline
\end{tabular}
\vv

int16 r0 = move\_bits(r1, r2, 3, 4, 5) \\
int32 v0 = move\_bits(v1, v2, 3, 4, 5) \\
\vv

Extract, insert, or move bit fields.
\vv

Takes one or more contiguous bits from position src4 in the second source operand (src2) and insert them into position src3 in the first source operand (src1). The remaining bits of src1 are unchanged. \\
The third source operand (src3) is the bit position in src2 to take bits from. \\
The fourth source operand (src4) is the bit position to insert the bits in. \\
The fifth source operand (src5) is the number of bits to move. \\
The first two source operands must be registers, the remaining operands must be constants.
\vv

Definition:\\
m = (1 $<<$ src5) - 1 \\
b = src2 $>>$ src3 \\
dest = (src1 \& \~{}(m$<<$src4)) $|$ (b \& m) $<<$ src4
\vv

Examples:\\
int16 r1 = 0x1234\\
int16 r2 = 0xABCD\\
// extract 4 bits from r2, starting from position 8, and insert into position 0 of r1:\\
int16 r0 = move\_bits(r1, r2, 8, 0, 4) // = 0x123B \\
// insert 8 bits from position 0 of r2 into position 4 of r1:\\
int16 r0 = move\_bits(r1, r2, 0, 4, 8) // = 0x1CD4 \\
// move 4 bits from position 8 in r2 into the same position of r1:\\
int16 r0 = move\_bits(r1, r2, 8, 8, 4) // = 0x1B34 \\
\vv

\subsubsection{popcount}
\label{table:popcountInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.8 B &  4 & general purpose registers. Optional \\ \hline
1.3 B & 22 & integer vectors. Optional \\ \hline
\end{tabular}
\vv

int32 r0 = popcount(r1) \\
int32 v0 = popcount(v1)
\vv

The popcount instruction counts the number of 1-bits in an integer. It can also be used for parity generation.
\vv

\subsubsection{rotate}
\label{table:rotateInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi & 33 & all integer types \\ \hline
\end{tabular}
\vv

dest = rotate(src1, src2)
\vv

Rotate the bits of src1 left if src2 is positive, or right if src2 is negative.
\vv

\subsubsection{shift\_left}
\label{table:shiftLeftInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi & 32 & all integer types \\ \hline
\end{tabular}
\vv

Shift integer left.

dest = src1 $<<$ src2
\vv

The result is zero if src2 is outside the range 0 $\leq$ src2 $<$ number\_of\_bits.
\vv

This instruction has the same op1 code as mul\_2pow, but applies to integer operand types only.
\vv

\subsubsection{shift\_right\_s}
\label{table:shiftRightSInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi & 34 & all integer types \\ \hline
\end{tabular}
\vv

Shift integer right with sign extension (arithmetic shift).
\vv

int32 dest = src1 $>>$ src2
\vv

The result is 0 or -1 if src2 is outside the range 0 $\leq$ src2 $<$ number\_of\_bits.

\vv

\subsubsection{shift\_right\_u}
\label{table:shiftRightUInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi & 35 & all integer types \\ \hline
\end{tabular}
\vv

Shift integer right with zero extension (logical shift).
\vv

uint32 dest = src1 $>>$ src2
\vv

The result is zero if src2 is outside the range 0 $\leq$ src2 $<$ number\_of\_bits.
\vv

\subsubsection{funnel\_shift}
\label{table:funnelShiftInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi & 53 & all integer types \\ \hline
\end{tabular}
\vv

int64 r1 = funnel\_shift(r2, r3, r4) \\
int64 v1 = funnel\_shift(v2, v3, r4) \\
\vv

This instruction concatenates two bit fields and shifts this to the right. This is useful for dealing with unaligned bit fields or unaligned vectors.
\vv

dest = src1 $>>$ src3 | src2 $<<$ (operand\_size - src3)
\vv

For general purpose registers: Operand 1 (low) and operand 2 (high), with n bits each, are concatenated into a bit field with 2n bits. This bit field is shifted right by the number of bits indicated by the third operand. The lower n bits of the result are returned. The result is zero if src3 is outside the range 0 $\leq$ src3 $<$ n.
\vv

For vector registers: This instruction is shifting whole vectors rather than vector fields when the operands are vector registers. The shift count is counting vector elements rather than bits.
Vector operand 1 (low) with n elements and vector operand 2 (high), with n elements or less, are concatenated into a larger vector with at most 2n elements. This concatenated vector is shifted down by the number of elements indicated by the third operand. The lower n elements of the result are returned. The result is zero if src3 is outside the range 0 $\leq$ src3 $<$ n.
\vv

Some implementations may work slowly for high shift counts.
\vv

This instruction will rotate a vector if both input vectors are the same.
\vv

A funnel shift in the opposite direction can be made by swapping the first two operands and subtracting the shift count from the operand size.
\vv

\subsubsection{select\_bits}
\label{table:selectBitsInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi & 52 & all integer types \\ \hline
\end{tabular}
\vv

int32 r0 = select\_bits(r1, r2, r3)
\vv

dest = src1 \& src3 \textbar{} src2 \& \~{}src3
\vv

This instruction combines bits from the first two source operands, using the third source operand as selector.
\vv

\subsubsection{test\_bit}
\label{table:testBitInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi & 39 & all integer types \\ \hline
\end{tabular}
\vv

Test the value of bit number src2 in src1, and make it the least significant bit of the output, to use as a boolean. The result is zero if src2 is out of range.
\vv

result = (src1 $>>$ src2) \& 1.
\vv

The result is indicated in bit 0 of the destination register.
The remaining bits of the output may be taken from a mask register or numeric control register. The number of mask or NUMCONTR bits available is implementation dependent.
\vv

A fallback register can be used as an operand for an extra boolean operation, with or without a mask. Only bit 0 of the fallback register is used.
The boolean operation is controlled by option bits 0-1.
Option bit 2 inverts the result, bit 3 inverts the fallback, and bit 4 inverts the mask. These options are summarized in the following table, giving the value of bit 0 of the destination register.

\begin{longtable} {|p{10mm}|p{10mm}|p{10mm}|p{10mm}|p{10mm}|p{60mm}|}
\caption{Alternative use of mask and fallback register controlled by option bits}
\hline
\bfseries bit 4 & \bfseries bit 3 & \bfseries bit 2 & \bfseries bit 1 & \bfseries bit 0 & \bfseries Output \\
\hline
0 & 0 & 0 & 0 & 0 & mask ? result : fallback \\
0 & 0 & 1 & 0 & 0 & mask ? !result : fallback \\
0 & 1 & 0 & 0 & 0 & mask ? result : !fallback \\
0 & 1 & 1 & 0 & 0 & mask ? !result : !fallback \\
1 & 0 & 0 & 0 & 0 & !mask ? result : fallback \\
1 & 0 & 1 & 0 & 0 & !mask ? !result : fallback \\
1 & 1 & 0 & 0 & 0 & !mask ? result : !fallback \\
1 & 1 & 1 & 0 & 0 & !mask ? !result : !fallback \\
\hline
0 & 0 & 0 & 0 & 1 & mask \& result \& fallback \\
0 & 0 & 1 & 0 & 1 & mask \& !result \& fallback \\
0 & 1 & 0 & 0 & 1 & mask \& result \& !fallback \\
0 & 1 & 1 & 0 & 1 & mask \& !result \& !fallback \\
1 & 0 & 0 & 0 & 1 & !mask \& result \& fallback \\
1 & 0 & 1 & 0 & 1 & !mask \& !result \& fallback \\
1 & 1 & 0 & 0 & 1 & !mask \& result \& !fallback \\
1 & 1 & 1 & 0 & 1 & !mask \& !result \& !fallback \\
\hline
0 & 0 & 0 & 1 & 0 & mask \& (result $|$ fallback) \\
0 & 0 & 1 & 1 & 0 & mask \& (!result $|$ fallback) \\
0 & 1 & 0 & 1 & 0 & mask \& (result $|$ !fallback) \\
0 & 1 & 1 & 1 & 0 & mask \& (!result $|$ !fallback) \\
1 & 0 & 0 & 1 & 0 & !mask \& (result $|$ fallback) \\
1 & 0 & 1 & 1 & 0 & !mask \& (!result $|$ fallback) \\
1 & 1 & 0 & 1 & 0 & !mask \& (result $|$ !fallback) \\
1 & 1 & 1 & 1 & 0 & !mask \& (!result $|$ !fallback) \\
\hline
0 & 0 & 0 & 1 & 1 & mask \& (result \^{} fallback) \\
0 & 0 & 1 & 1 & 1 & mask \& (!result \^{} fallback) \\
0 & 1 & 0 & 1 & 1 & mask \& (result \^{} !fallback) \\
0 & 1 & 1 & 1 & 1 & mask \& (!result \^{} !fallback) \\
1 & 0 & 0 & 1 & 1 & !mask \& (result \^{} fallback) \\
1 & 0 & 1 & 1 & 1 & !mask \& (!result \^{} fallback) \\
1 & 1 & 0 & 1 & 1 & !mask \& (result \^{} !fallback) \\
1 & 1 & 1 & 1 & 1 & !mask \& (!result \^{} !fallback) \\
\hline
\end{longtable}
\vv

The value of mask is 1 if there is no mask register.
The remaining bits are copied from the mask register if option bit 5 is set, or from the numeric control register if there is no mask and bit 5 is set. The remaining bits are zero if option bit 5 is not set. The number of mask or NUMCONTR bits available is implementation dependent.
\vv

\subsubsection{test\_bits\_and}
\label{table:testBitsAndInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi & 40 & all integer types \\ \hline
\end{tabular}
\vv

Test if the indicated bits are all 1.

result = ((src1 \& src2) == src2)
\vv

The result is indicated in bit 0 of the destination register.
The remaining bits of the output may be taken from a mask register or numeric control register.
\vv

A fallback register can be used as an operand for an extra boolean operation, with or without a mask. Only bit 0 of the fallback register is used. These options are controlled by option bits 0-4 in the same way as for test\_bit, as indicated in table \ref{table:AlternativeMaskUseForTestBit}.
\vv

The remaining bits are copied from the mask register if option bit 5 is set, or from the numeric control register if there is no mask and bit 5 is set. The remaining bits are zero if option bit 5 is not set. The number of mask or NUMCONTR bits available is implementation dependent.
\vv

\subsubsection{test\_bits\_or}
\label{table:testBitsOrInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi & 41 & all integer types \\ \hline
\end{tabular}
\vv

Test if at least one of the indicated bits is 1.

result = ((src1 \& src2) != 0)
\vv

The result is indicated in bit 0 of the destination register.
The remaining bits of the output may be taken from a mask register or numeric control register.
\vv

A fallback register can be used as an operand for an extra boolean operation, with or without a mask. Only bit 0 of the fallback register is used. These options are controlled by option bits 0-4 in the same way as for test\_bit, as indicated in table \ref{table:AlternativeMaskUseForTestBit}.
\vv

The remaining bits are copied from the mask register if option bit 5 is set, or from the numeric control register if there is no mask and bit 5 is set. The remaining bits are zero if option bit 5 is not set. The number of mask or NUMCONTR bits available is implementation dependent.
\vv

\subsubsection{truth\_tab3}
\label{table:truthTab3Instruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
2.0.6 & 8.1 & general purpose registers. optional \\ \hline
2.2.6 & 8.1 & integer vectors. optional \\ \hline
\end{tabular}
\vv

int32 r0 = truth\_tab3(r1, r2, r3, 0xF2), options=0 \\
int32 v0 = truth\_tab3(v1, v2, v3, 0xF2), options=0
\vv

This instruction can make an arbitrary bitwise boolean function of three integer variables, expressed by an 8-bit truth table in an immediate constant. Each bit of the result is the arbitrary boolean function of the corresponding bits of the three input registers. The boolean function is calculated for each bit position separately. Three bits from the three input registers are combined into a 3-bit index, where the bit from the first input register goes into the least significant bit and the bit from the last input register goes into the most significant bit. This index is then selecting one bit from the truth table to go into the result.
\vv

For example, the boolean function F = A $\&$ $\sim$ B $|$ C has the truth table 0b11110010 or 0xF2.
\vv

This can be used as a universal instruction for bitwise logic functions of up to three inputs. Functions of two inputs can be obtained by using the same register for two of the three input registers.
\vv

This instruction can also be used for manipulating masks where only bit 0 contains the boolean result. The remaining bits are controlled by options according to the table below. This is useful when the result is used as a mask for floating point instructions:

\begin{longtable} {|p{20mm}|p{75mm}|}
\caption{Options for truth\_tab3}
\label{table:OptionsForTruthTab3} \\
\hline
\bfseries Options & \bfseries Meaning   \\
\hline
0 & all bits contain boolean results \\ \hline
1 & bit 0 contains a boolean result. The remaining bits are zero \\ \hline
2 & bit 0 contains a boolean result. The remaining bits are taken from a mask or numeric control register. The number of mask or NUMCONTR bits available is implementation dependent. \\ \hline
\end{longtable}
\vv

\subsection{Combined arithmetic/logic and branch instructions with integer operands}
\label{descriptionOfControlTransferInstructions}
These instructions are doing an arithmetic or logic operation and a conditional jump
depending on the result. Each instruction can be coded in a number of different formats
described on page \pageref{table:jumpInstructionFormats}.
\vv

The instructions are listed below in pairs, where the second instruction has the branch condition inverted.
\vv

These instructions cannot have a mask.
The destination operand, if any, should preferably be the same as the first source operand for optimal performance. The second source operand may be a register, a memory operand, or an immediate constant with no more than 32 bits.
\vv

\begin{tabular}{|p{16mm}|p{12mm}|p{60mm}|p{50mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries instruction & \bfseries operands \\ \hline
all & 16 & add/jump\_zero & integer \\ \hline
all & 17 & add/jump\_nzero & integer\\ \hline
\end{tabular}
\vv

Add two integer operands and jump if the result is zero.

\begin{tabular}{|p{16mm}|p{12mm}|p{60mm}|p{50mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries instruction & \bfseries operands \\ \hline
all & 18 & add/jump\_neg & integer \\ \hline
all & 19 & add/jump\_nneg & integer\\ \hline
\end{tabular}
\vv

Add two integer operands and jump if the signed result is negative.

The result will wrap around in the case of overflow and jump if the result has the sign bit set.

\begin{tabular}{|p{16mm}|p{12mm}|p{60mm}|p{50mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries instruction & \bfseries operands \\ \hline
all & 20 & add/jump\_pos & integer \\ \hline
all & 21 & add/jump\_npos & integer\\ \hline
\end{tabular}
\vv

Add two integer operands and jump if the signed result is positive.

The result will wrap around in the case of overflow and jump if the result is not zero and does not have the sign bit set.

\begin{tabular}{|p{16mm}|p{12mm}|p{60mm}|p{50mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries instruction & \bfseries operands \\ \hline
all & 22 & add/jump\_overflow & integer \\ \hline
all & 23 & add/jump\_noverflow & integer\\ \hline
\end{tabular}
\vv

Add two signed integer operands and jump if the result overflows.
\vv

\begin{tabular}{|p{16mm}|p{12mm}|p{60mm}|p{50mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries instruction & \bfseries operands \\ \hline
all & 24 & add/jump\_carry & integer \\ \hline
all & 25 & add/jump\_ncarry & integer\\ \hline
\end{tabular}
\vv

Add two unsigned integer operands and jump if the operation produces a carry.
\vv

\subsubsection{increment\_compare/jump\_above/below}
\begin{tabular}{|p{16mm}|p{12mm}|p{60mm}|p{50mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries instruction & \bfseries operands \\ \hline
all & 48 & increment\_compare/jump\_below & integer \\ \hline
all & 49 & increment\_compare/jump\_aboveeq & integer \\ \hline
all & 50 & increment\_compare/jump\_above & integer \\ \hline
all & 51 & increment\_compare/jump\_beloweq & integer \\ \hline
\end{tabular}
\vv

Add 1 to the first source operand and jump if the signed result is less than a certain limit. The result is saved in the destination operand. This is useful for implementing a simple for'' loop.
\vv

The result will wrap around from INT\_MAX to INT\_MIN in case of overflow.
\vv

\subsubsection{sub/jump\_zero}
\label{table:subJumpZeroInstruction}
\begin{tabular}{|p{20mm}|p{12mm}|p{56mm}|p{50mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries instruction & \bfseries operands \\ \hline
Not 1.7 &  0 & sub/jump\_zero & integer \\ \hline
Not 1.7 &  1 & sub/jump\_nzero  & integer\\ \hline
\end{tabular}
\vv

Subtract two integer operands and jump if the result is zero.
\vv

Immedate constants are not supported. The assembler will automatically convert a sub/jump\_zero instruction to an add/jump\_zero instruction with the negative constant.
\vv

\subsubsection{sub/jump\_neg}
\label{table:subJumpNegInstruction}
\begin{tabular}{|p{20mm}|p{12mm}|p{56mm}|p{50mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries instruction & \bfseries operands \\ \hline
Not 1.7 &  2 & sub/jump\_neg & integer \\ \hline
Not 1.7 &  3 & sub/jump\_nneg & integer\\ \hline
\end{tabular}
\vv

Subtract two integer operands and jump if the signed result is negative.

The result will wrap around in the case of overflow and jump if the result has the sign bit set.
\vv

Immedate constants are not supported. The assembler will automatically convert a sub/jump\_neg instruction to an add/jump\_neg instruction with the negative constant.
\vv

\subsubsection{sub/jump\_pos}
\label{table:subJumpPosInstruction}
\begin{tabular}{|p{20mm}|p{12mm}|p{56mm}|p{50mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries instruction & \bfseries operands \\ \hline
Not 1.7 &  4 & sub/jump\_pos & integer \\ \hline
Not 1.7 &  5 & sub/jump\_npos & integer\\ \hline
\end{tabular}
\vv

Subtract two integer operands and jump if the signed result is positive.

The result will wrap around in the case of overflow and jump if the result is not zero and does not have the sign bit set.
\vv

Immedate constants are not supported. The assembler will automatically convert a sub/jump\_pos instruction to an add/jump\_pos instruction with the negative constant.
\vv

\subsubsection{sub/jump\_overflow}
\label{table:subJumpOverflInstruction}
\begin{tabular}{|p{20mm}|p{12mm}|p{56mm}|p{50mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries instruction & \bfseries operands \\ \hline
Not 1.7 &  6 & sub/jump\_overflow & integer \\ \hline
Not 1.7 &  7 & sub/jump\_noverflow & integer\\ \hline
\end{tabular}
\vv

Subtract two signed integer operands and jump if the result overflows.
\vv

Immedate constants are not supported. The assembler will automatically convert a sub/jump\_overflow instruction to an add/jump\_overflow instruction with the negative constant.
\vv

\subsubsection{sub/jump\_borrow}
\label{table:subJumpBorrowInstruction}
\begin{tabular}{|p{20mm}|p{12mm}|p{56mm}|p{50mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries instruction & \bfseries operands \\ \hline
Not 1.7 &  8 & sub/jump\_borrow & integer \\ \hline
Not 1.7 &  9 & sub/jump\_nborrow & integer\\ \hline
\end{tabular}
\vv

Subtract two unsigned integer operands and jump if the operation produces a borrow.
\vv

Immedate constants are not supported. The assembler will automatically convert a sub/jump\_borrow instruction to an add/jump\_borrow instruction with the negative constant.
\vv

\subsubsection{sub\_maxlen/jump\_pos}
\label{table:subMaxlenJumpPosInstruction}
\begin{tabular}{|p{24mm}|p{12mm}|p{52mm}|p{50mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries instruction & \bfseries operands \\ \hline
1.7C, 2.5.1B, 2.5.4C & 52 & sub\_maxlen/jump\_pos & integer \\ \hline
1.7C, 2.5.1B, 2.5.4C & 53 & sub\_maxlen/jump\_npos & integer \\ \hline
\end{tabular}
\vv

Subtract the maximum vector length (in bytes) from a general purpose register and jump if the result is positive.
The immediate operand indicates the operand type for which the maximum vector length is obtained. The operand size for the source and destination register is 64 bits in C formats.
\vv

This instruction makes it easy to implement the type of vector loop described on on page \pageref{vectorLoops}.
\vv

\subsubsection{and/jump\_zero}
\label{table:andJumpZeroInstruction}
\begin{tabular}{|p{16mm}|p{12mm}|p{60mm}|p{50mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries instruction & \bfseries operands \\ \hline
Not 1.7 & 10 & and/jump\_zero & all \\ \hline
Not 1.7 & 11 & and/jump\_nzero & all \\ \hline
\end{tabular}
\vv

Bitwise and. Jump if zero.
\vv

dest = src1 \& src2

jump if dest == 0
\vv

All operands are treated as integers.
Floating point operands are treated as unsigned integer scalars in vector registers.
\vv

\subsubsection{or/jump\_zero}
\label{table:orJumpZeroInstruction}
\begin{tabular}{|p{16mm}|p{12mm}|p{60mm}|p{50mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries instruction & \bfseries operands \\ \hline
Not 1.7 & 12 & or/jump\_zero & all \\ \hline
Not 1.7 & 13 & or/jump\_nzero & all \\ \hline
\end{tabular}
\vv

Bitwise or. Jump if zero.
\vv

dest = src1 $|$ src2

jump if dest == 0
\vv

All operands are treated as integers.
Floating point operands are treated as unsigned integer scalars in vector registers.
\vv

\subsubsection{xor/jump\_zero}
\label{table:xorJumpZeroInstruction}
\begin{tabular}{|p{16mm}|p{12mm}|p{60mm}|p{50mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries instruction & \bfseries operands \\ \hline
Not 1.7 & 14 & xor/jump\_zero & all \\ \hline
Not 1.7 & 15 & xor/jump\_nzero & all \\ \hline
\end{tabular}
\vv

Bitwise exclusive or. Jump if zero.
\vv

dest = src1 \^{ } src2

jump if dest == 0
\vv

All operands are treated as integers.
Floating point operands are treated as unsigned integer scalars in vector registers.

\subsubsection{test\_bit/jump\_true}
\label{table:testBitJumpTrueInstruction}
\begin{tabular}{|p{16mm}|p{12mm}|p{60mm}|p{50mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries instruction & \bfseries operands \\ \hline
all & 26 & test\_bit/jump\_true & all \\ \hline
all & 27 & test\_bit/jump\_false & all \\ \hline
\end{tabular}
\vv

int test\_bit(r1, 3), jump\_true target \\
if (int r1 \& 8) \{jump target\}
\vv

Test a single bit in the first source operand as indicated by the an index in the second source operand and jump if the indicated bit is 1. There is no destination operand.
\vv

jump if ((src1 $>>$ src2) \& 1) == 1
\vv

All operands are treated as unsigned integers.
Floating point operands are treated as integer scalars in vector registers.
\vv

\subsubsection{test\_bits\_and/jump\_true}
\label{table:testBitsAndJumpInstruction}
\begin{tabular}{|p{16mm}|p{12mm}|p{60mm}|p{50mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries instruction & \bfseries operands \\ \hline
all & 28 & test\_bits\_and/jump\_true & all \\ \hline
all & 29 & test\_bits\_and/jump\_false & all \\ \hline
\end{tabular}
\vv

int test\_bits\_and(r1, 7), jump\_true target \\
if (int (r1 \& 7) == 7) \{jump target\}
\vv

Test the AND combination of the bits indicated by the second source operand. Jump if the indicated bits are all 1. There is no destination operand.
\vv

jump if (src1 \& src2) == src2
\vv

All operands are treated as unsigned integers.
Floating point operands are treated as integer scalars in vector registers.
\vv

\subsubsection{test\_bits\_or/jump\_true}
\label{table:testBitsOrJumpInstruction}
\begin{tabular}{|p{16mm}|p{12mm}|p{60mm}|p{50mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries instruction & \bfseries operands \\ \hline
all & 30 & test\_bits\_or/jump\_true & all \\ \hline
all & 31 & test\_bits\_or/jump\_false & all \\ \hline
\end{tabular}
\vv

int test\_bits\_or(r1, 7), jump\_true target \\
if (int r1 \& 7) \{jump target\}
\vv

Test the OR combination of the bits indicated by the second source operand. Jump if at least one of the indicated bits is 1. There is no destination operand.
\vv

jump if (src1 \& src2) != 0
\vv

All operands are treated as unsigned integers.
Floating point operands are treated as integer scalars in vector registers.
\vv

\subsubsection{integer compare and branch instructions}
int64 compare(r1, r2), jump\_equal target
\vv

Compare instructions have no destination operand.
Overflow cannot occur.
\vv

\label{table:integerCompareJumpInstructions}
\begin{tabular}{|p{12mm}|p{60mm}|p{50mm}|}
\hline
\bfseries opcode & \bfseries instruction & \bfseries jump condition \\ \hline
32 & compare/jump\_equal & r1 = r2 \\ \hline
33 & compare/jump\_nequal  & r1 $\neq$ r2 \\ \hline
34 & compare/jump\_sbelow & r1 $<$ r2, signed \\ \hline
35 & compare/jump\_saboveeq & r1 $\geq$ r2, signed \\ \hline
36 & compare/jump\_sabove & r1 $>$ r2, signed  \\ \hline
37 & compare/jump\_sbeloweq  & r1 $\leq$ r2, signed \\ \hline
38 & compare/jump\_ubelow & r1 $<$ r2, unsigned \\ \hline
39 & compare/jump\_uaboveeq  & r1 $\geq$ r2, unsigned \\ \hline
40 & compare/jump\_uabove & r1 $>$ r2, unsigned \\ \hline
41 & compare/jump\_ubeloweq  & r1 $\leq$ r2, unsigned \\ \hline
\end{tabular}
\vv

\subsection{floating point branch instructions}
The conditional jump instructions use general purpose registers for integer operands with at most 64 bits, and vector registers when a floating point type is specified. Only the first element of a floating point vector is used.
\vv

Addition and subtraction instructions with conditional branching do not support floating point operands.
\vv

\subsubsection{floating point compare and branch instructions}
double compare(v1, v2), jump\_above target
\vv

Compare instructions have no destination operand.
Overflow cannot occur. \\
0.0 and -0.0 are treated as equal.
\vv

The unordered versions of floating point compare instructions are true when any input operand is NAN. The versions without \_uo suffix are false when any operand is NAN.
The unordered versions are needed because conditions are often inversed in the compilation process. For example the inverse of compare/jump\_below is not compare/jump\_aboveeq but compare/jump\_aboveeq\_uo. This is a consequence of the rule that  all comparisons except '!=' return false when the inputs are unordered, i.e. when at least one operand is NAN, according to the IEEE-754 standard for floating point arithmetic.
\vspace{4mm}

\label{table:floatCompareJumpInstructions}
\begin{tabular}{|p{12mm}|p{60mm}|p{40mm}|p{40mm}|}
\hline
\bfseries opcode & \bfseries instruction & \bfseries jump condition & \bfseries high level language \\ \hline
32 & compare/jump\_equal & v1 = v2 & a == b \\ \hline
0 & compare/jump\_equal\_uo & v1 = v2 &  \\ \hline
33 & compare/jump\_nequal  & v1 $\neq$ v2 &  \\ \hline
1 & compare/jump\_nequal\_uo  & v1 $\neq$ v2 & a != b \\ \hline
34 & compare/jump\_below & v1 $<$ v2 & a < b  \\ \hline
2 & compare/jump\_below\_uo & v1 $<$ v2 & !(a >= b)  \\ \hline
35 & compare/jump\_aboveeq & v1 $\geq$ v2 & a >= b  \\ \hline
3 & compare/jump\_aboveeq\_uo & v1 $\geq$ v2 & !(a < b)  \\ \hline
36 & compare/jump\_above & v1 $>$ v2 & a > b  \\ \hline
4 & compare/jump\_above\_uo & v1 $>$ v2 & !(a <= b)   \\ \hline
37 & compare/jump\_beloweq  & v1 $\leq$ v2 & a <= b  \\ \hline
5 & compare/jump\_beloweq\_uo  & v1 $\leq$ v2 & !(a > b)  \\ \hline

38 & compare/jump\_abs\_below & abs(v1) $<$ abs(v2) &   \\ \hline
6 & compare/jump\_abs\_below\_uo & abs(v1) $<$ abs(v2) &   \\ \hline
39 & compare/jump\_abs\_aboveeq & abs(v1) $\geq$ abs(v2) &   \\ \hline
7 & compare/jump\_abs\_aboveeq\_uo & abs(v1) $\geq$ abs(v2) &   \\ \hline
40 & compare/jump\_abs\_above & abs(v1) $>$ abs(v2) &    \\ \hline
8 & compare/jump\_abs\_above\_uo & abs(v1) $>$ abs(v2) &    \\ \hline
41 & compare/jump\_abs\_beloweq  & abs(v1) $\leq$ abs(v2) &   \\ \hline
9 & compare/jump\_abs\_beloweq\_uo  & abs(v1) $\leq$ abs(v2) &   \\ \hline
24 & fp\_category/jump\_true & value belongs to one of the indicated categories &   \\ \hline
25 &  fp\_category/jump\_false & value does not belong to any of the indicated categories &  \\ \hline
\hline
\end{tabular}
\vv

The \_abs conditions ignore the sign bits and compare the absolute values of the two operands.
\vv

The fp\_category/jump\_true instruction tests if the value of the first operand belongs to any of the categories indicated by the second source operand, which is an integer. The categories are indicated according to table \ref{table:fpCategoryInstructionBits} on page \pageref{table:fpCategoryInstructionBits}
\vv

\subsection{Unconditional and indirect jump, call, and return instructions}
Control transfer instructions are available in a number of different formats, described on
page \pageref{table:jumpInstructionFormats}.

\subsubsection{Direct jump}
\label{table:jumpInstruction}
\begin{tabular}{|p{14mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
%1.7 C & 58 & jump with 16 bit relative address (not supported) \\ \hline
1.7 D &  0 & jump with 24 bit relative address \\ \hline
2.5.4 C & 58 & jump with 32 bit relative address \\ \hline
3.1.1 B & 58 & jump with 64 bit absolute address (optional) \\ \hline
\end{tabular}
\vv

Unconditional jump.

\subsubsection{Direct function call}
\label{table:callInstruction}
\begin{tabular}{|p{14mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
%1.7 C & 59 & call with 16 bit relative address (not supported) \\ \hline
1.7 D &  8 & call with 24 bit relative address \\ \hline
2.5.4 C & 59 & call with 32 bit relative address \\ \hline
3.1.1 B & 59 & call with 64 bit absolute address (optional) \\ \hline
\end{tabular}
\vv

Function call.
\vv

The return address is stored on the call stack. The calling conventions are described in chapter \ref{chap:functionCallingConventions}.

\subsubsection{Indirect jump}

\label{table:indirectJumpInstruction}
\begin{tabular}{|p{14mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.6 B & 58 & 64 bit absolute address in memory operand with 8 bit offset \\ \hline
1.7 C & 60 & 64 bit absolute address in register \\ \hline
1.6 A & 60 & Multi-way jump with table of relative addresses (see below) \\ \hline
2.5.2 B & 58 & Absolute address in memory operand with 32 bit offset \\ \hline
\end{tabular}
\vv

\subsubsection{Indirect call}
\label{table:IndirectCallInstruction}
\begin{tabular}{|p{14mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.6 B & 59 & 64 bit absolute address in memory operand with 8 bit offset \\ \hline
1.7 C & 61 & 64 bit absolute address in register \\ \hline
1.6 A & 61 & Multi-way call with table of relative addresses (see below) \\ \hline
2.5.2 B & 59 & Absolute address in memory operand with 32 bit offset \\ \hline
\end{tabular}
\vv

\subsubsection{Relative and multi-way jump and call}
\label{table:multiwayJumpCallInstructions}
\begin{tabular}{|p{14mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.6 A   & 60 & Jump with table of relative addresses. \linebreak Has reference point, base and scaled index  \\ \hline
2.5.2 B & 60 & Jump with relative address. \linebreak Has reference point, base and offset  \\ \hline
1.6 A   & 61 & Call with table of relative addresses. \linebreak Has reference point, base and scaled index    \\ \hline
2.5.2 B & 61 & Call with relative address. \linebreak Has reference point, base and offset \\ \hline
\end{tabular}
\vv

\label{relativeJumpInstruction}
The multi-way and relative jump and call instructions, jump\_relative and call\_relative, are using pointers stored in memory relative to an arbitrary reference point.
These instructions are intended to facilitate multi-way branches
(switch/case statements), function tables in code interpreters, virtual function tables in object oriented languages with polymorphism, and general use of relative pointers. The relative pointers stored in memory use 8, 16, or 32 bits, depending on the distance to the reference point, while absolute pointers need 64 bits. This saves memory space and cache space.
\vv

Relative pointers to jump or call addresses are stored in memory as signed offsets relative to an arbitrary reference point. The reference point may be the table address, the ip\_base, or any reference point defined by the programmer. The operand type specifies the size of the table entries.
\vv

This instruction works as follows. Calculate the address of a table entry as the base pointer plus the offset (unscaled) or the index (RT) scaled by the operand size. Read a relative pointer from this address, sign-extend to 64 bits, and scale by 4. Then add the reference point (RD). Jump or call to the calculated address. The array index (RT) is scaled by the operand size, while the table entries are scaled by the instruction word size (4). The reference point must be aligned by 4.
\vv

This instruction in format 1.6A has base pointer in RS, scaled index in RT, and reference point in RD. Format 2.5.2B has base pointer in RS, unscaled index in IM2 and reference point in RD.
\vv

A table of pointers used by the table-based jump\_relative and call\_relative instructions is preferably placed in the constant data section (CONST). This makes it possible to use the table base as reference point. This also improves security by giving read-only access to the table.
\vv

These instructions cannot have a mask and will not generate overflow traps in case of overflow in the address calculation, but you will get access violation traps when attempting to access an illegal memory address.
\vv

\subsubsection{return}
\label{table:returnInstruction}
\begin{tabular}{|p{14mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.6 B & 62 & \\ \hline
\end{tabular}
\vv

Return from function call. The return address is taken from the call stack.
\vv

Return instructions do not need a stack offset when the calling conventions specified in chapter \ref{chap:functionCallingConventions} are used.

\subsubsection{breakpoint}
\label{table:breakpointInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.7 C & 63 & \\ \hline
\end{tabular}
\vv

This instruction is used as a debug breakpoint.
\vv

It is the same as trap(1). The complete instruction code word is 0x7FE00001.
\vv

\subsubsection{filler}
\label{table:fillerInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.7 C & 63 & \\ \hline
\end{tabular}
\vv

This instruction is used for filling unused code memory. It will generate a trap (interrupt) if executed.
\vv

All fields are filled with ones. The complete instruction code word is 0x7FFFFFFF.
\vv

\subsubsection{System call, system return, and traps}
See page \pageref{table:sysCallInstruction}.
\vv

\subsection{Miscellaneous instructions}

\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
2.9 B & 32 & g.p. registers \\ \hline
\end{tabular}
\vv

\vv

Calculate an address relative to a pointer by adding a 32-bit sign-extended constant to a special pointer register. The pointer register can be THREADP (28), DATAP (29), IP (30) or SP(31).
\vv

\subsubsection{compare\_swap}
\label{table:compareSwapInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
2.5 A & 18 & g. p. registers and memory operand with 32 bit offset. Optional \\ \hline
\end{tabular}
\vv

int32 r1 = compare\_swap(r1, r2, [r3+0x100])
\vv

Atomic compare and swap instruction, used for thread synchronization and for lock-free data sharing between threads. src1 and src2 are register operands, src3 is a memory operand, which must be aligned to a natural address. All operands are treated as integers, regardless of the specified operand type. The operation is:

\begin{lstlisting}[frame=none]
temp = src3;
if (temp == src1) src3 = src2;
return temp;
\end{lstlisting}

This instruction cannot have a mask.
\vv

Further atomic instructions can be implemented if needed, preferably with the same format and consecutive values of OP1.
\vv

\subsubsection{nop}
\label{table:nopInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi &  0 & \\ \hline
3.0   &  0 & \\ \hline
\end{tabular}
\vv

No operation. Used as a filler to replace removed code or to align code entries.
\vv

Unused bits may be used for debugging information, etc.
\vv

The processor is allowed to skip NOPs as fast as it can at an early stage in the pipeline. These NOPs cannot be used as timing delays, only as fillers.
\vv

\subsubsection{undef}
\label{table:undefInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi & 63 & \\ \hline
\end{tabular}
\vv

Undefined code. Guaranteed to generate trap (interrupt) in all future implementations
\vv

\subsubsection{userdef}
\label{table:userdefInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
multi & 56-62 & any types \\ \hline
\end{tabular}
\vv

Reserved for user-defined instructions.
\vv

\subsection{System instructions}
These instructions cannot have a mask.
\vv

\subsubsection{input}

\label{table:inputInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.8 B & 62 & general purpose registers \\ \hline
1.2 A & 62 & vector registers \\ \hline
\end{tabular}
\vv

int32 r0 = input(r1, 4) \\
int64 v0 = input(r1, r2)
\vv

Read from input port into register RD. Privileged instruction.
\vv

General purpose register input with immediate port address:\\
The immediate operand contains a port address in the interval 0 - 254. Register RS is ignored.
\vv

General purpose register input with port address in register:\\
The immediate operand is 255. Register RS contains a 64 bit port address.
\vv

Vector register input with port address in register:\\
RS = port address. RT = vector length in bytes, \\
Vector input is not necessarily supported for all input ports.\\
\vv

\subsubsection{output}
\label{table:outputInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.8 B & 63 & general purpose registers \\ \hline
1.2 A & 63 & vector registers \\ \hline
\end{tabular}
\vv

int32 output(r1, r2, 4)\\
int64 output(v0, r1, r2)
\vv

Write register value RD to output port. Privileged instruction.
\vv

General purpose register output with immediate port address:\\
The immediate operand contains a port address in the interval 0 - 254. Register RS is ignored.
\vv

General purpose register output with port address in register:\\
The immediate operand is 255. Register RS contains a 64 bit port address.
\vv

Vector register output with port address in register:\\
RS = port address. RT = vector length in bytes, \\
Vector output is not necessarily supported for all output ports.\\
\vv

\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.8 B & 34 & read\_capabilities(capabilities register, constant) \\ \hline
1.8 B & 35 & write\_capabilities(g.p. register, constant) \\ \hline
\end{tabular}
\vv

Preliminary specification.
\vv

Read or write processor capabilities register. These registers are used for indicating capabilities of the processor, such as support for optional instructions and limitations to vector lengths. These registers are initialized with their default values at program start.
\vv

The immediate constant in IM1 may determine details of the operation.
\vv

\begin{longtable} {|p{20mm}|p{90mm}|}
\caption{List of capabilities registers}
\label{table:capabilitiesRegisters} \\
\hline
\bfseries Capabilities register number & \bfseries Meaning  \\
\hline
capab0 & Microprocessor model or brand ID  \\
capab1 & Microprocessor version number  \\
\hline
capab2 & Disable error traps. Bit 0: unknown instructions, bit 1: wrong instruction operands, bit 2: array overflow, bit 3: memory read violation, bit 4: memory write violation, bit 5: misaligned memory access. \\
\hline
capab4 & Code cache size, level 1  \\
capab5 & Data cache size, level 1  \\
\hline
capab8  &  Support for operand sizes in general purpose registers. Bit 0: int8, bit 1: int16, bit 2: int32, bit 3: int64 \\
capab9  &  Support for operand sizes in vector registers. \linebreak
Bit 0: int8, bit 1: int16, bit 2: int32, bit 3: int64, bit 4: int128, bit 5: float32, bit 6: float64, bit 7: float128, bit 8: float16.\\
\hline

capab12  &  Maximum vector length for general instructions. \\
capab13  &  Maximum vector length for permute instructions. \\
capab14  &  Maximum block size for permute instructions. \\
capab15  &  Maximum vector length for compress\_sparse and expand\_sparse. \\
\hline

\hline
\end{longtable}

Some capabilities registers can be modified for test purposes or to tell the software not to use a specific instruction.
\vv

Setting bits in capab2 will suppress error traps. Instead, the errors will be counted in performance counter registers described on page \pageref{table:performanceCounters}. To test if a particular instruction is supported, set bit 0 in capab2, reset the performance counter, try to execute the instruction, and read the performance counter again.

\vv
Changing the values of the maximum vector length has the following effects. If the maximum length is reduced below the physical capability then any attempt to make a longer vector will result in the reduced length. The behavior of vector registers that already had a longer length before the maximum length was reduced, is implementation dependent. If the maximum vector length is set to a higher value than the physical capability then any attempt to make a vector longer than the physical capability will cause a trap to facilitate emulation, if the platform supports emulation.
\vv

Capabilities registers 12-15 can be increased for the purpose of emulation. The value of capabilities registers 12-15 must be powers of 2.
\vv

\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.2 A & 60 & vector = read\_memory\_map(base, index) \\ \hline
1.2 A & 61 & write\_memory\_map(vector, base, index) \\ \hline
\end{tabular}
\vv

Preliminary specification.
\vv

\vv

Read memory map and save it to a vector register. Privileged instruction.\\
RD = destination vector register, RT-RS = internal address.
\vv

int64 write\_memory\_map(v1, r2, r3)
\vv

Write a vector register to memory map. RD = vector register source. RT-RS = internal address. Privileged instruction.
\vv

\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.2 A & 58 & read\_call\_stack(r1, r2) \\
\hline
1.2 A & 59 & write\_call\_stack(v1, r2, r3) \\ \hline
\end{tabular}
\vv

Preliminary specification.
\vv

\vv

Read the internal call stack into a vector register. This instruction is used for saving the internal call stack to system memory in case of overflow.
Privileged instruction.
\vv

RD = destination vector register, RT-RS = internal address.
\vv

int64 write\_call\_stack(v1, r2, r3)
\vv

Write a vector register to the internal call stack. This instruction is used for restoring the internal call stack.
Privileged instruction.
\vv

\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.8 B & 36 & performance counter register, constant \\ \hline
\end{tabular}
\vv

\vv

A number of internal registers are used for counting performance related events.
This instruction reads performance counter registers and performance related information. Some performance counters may be implementation-specific.
\vv

\begin{longtable} {|p{15mm}|p{15mm}|p{85mm}|}
\caption{List of performance counter registers}
\label{table:performanceCounters} \\
\hline
\bfseries Performance counter & \bfseries Second operand & \bfseries Meaning  \\
\hline
perf0  & -1 & Reset all performance counters \\
\hline
perf1  & 1 & CPU clock cycles \\
perf1  & 0 & Reset CPU clock cycles counter \\
\hline
perf2  & 1 & Number of instructions executed \\
perf2  & 2 & Number of double size instructions \\
perf2  & 3 & Number of triple size instructions \\
perf2  & 4 & General purpose register instructions \\
perf2  & 5 & G. p. register instructions with mask zero \\
perf2  & 0 & Reset counters \\
\hline
perf3  & 1 & Vector instructions executed \\
perf3  & 0 & Reset counter \\
\hline
perf4  & 1 & Vector registers in use. Returns one bit for each vector register \\
\hline
perf5  & 1 & Jumps, calls, and return instructions \\
perf5  & 2 & Direct, unconditional jumps, calls, and returns \\
perf5  & 3 & Indirect jumps and calls \\
perf5  & 4 & Conditional jumps \\
perf5  & 0 & Reset counters \\
\hline
perf16 & 1  & Unknown instructions attempted \\
perf16 & 2  & Wrong operands for instruction \\
perf16 & 3  & Array overflow  \\
perf16 & 4  & Memory read violation \\
perf16 & 5  & Memory write violation \\
perf16 & 6  & Memory access misaligned \\
perf16 & 62 & Code address where first error occurred  \\
perf16 & 63 & Type of first error \\
perf16 & 0  & Reset error counters \\
\hline
\end{longtable}
\vv

The perf16 register is useful for detecting errors when error traps are disabled using the capabilities registers described on page \pageref{table:capabilitiesRegisters}.
\vv

\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.8 B & 37 & performance counter register, constant \\ \hline
\end{tabular}
\vv

This is the same as the read\_perf instruction, but serializing. The pipeline is flushed before reading the counter so that no instruction can execute out of order with read\_perfs.
\vv

\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.8 B & 32 & read\_spec(special register, constant)\\ \hline
1.8 B & 33 & write\_spec(g.p. register, constant) \\ \hline
\end{tabular}
\vv

int64 r0 = read\_spec(spec1, 0) \\
\vv

Read a special system register. The following special registers are currently defined. The size is 64 bits. These registers are initialized with their default values at program start.
\vv

The immediate operand (IM1) is currently unused. This instruction cannot have a mask.
\vv

\begin{longtable} {|p{25mm}|p{15mm}|p{80mm}|}
\caption{List of special registers}
\label{table:specialRegisters} \\
\hline
\bfseries Special register name & \bfseries number & \bfseries Meaning  \\
\hline
numcontr & spec0  & Numeric control register \\
datap    & spec2  & Data section pointer \\
\hline
\end{longtable}

\vv

\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.2 A & 56 & special vector register, general purpose register \\ \hline
\end{tabular}
\vv

\vv

Read special vector register spev1 into vector register result with length r2 bytes.
\vv

The following special registers are currently defined:

\begin{longtable} {|p{15mm}|p{100mm}|}
\caption{Special registers that can be read into vectors}
\label{table:specialVectorRegisters} \\
\hline
\bfseries Special register number & \bfseries Meaning  \\
\hline
spec0 & Numeric control register (NUMCONTR). The value is broadcast into all elements of the destination register with the indicated operand size and length.  \\
\hline
spec48 & Name of processor. The output is a zero-terminated UTF-8 string containing the brandname and model name of the microprocessor. \\
\hline
\end{longtable}
\vv

\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.8 B & 38 & read\_sys(system register, constant) \\ \hline
1.8 B & 39 & write\_sys(g.p. register, constant) \\ \hline
\end{tabular}
\vv

Read or write system register. Details are not defined yet. These instructions are privileged.

\vv

\subsubsection{sys\_call}
\label{systemCallInstruction}
System calls use ID numbers rather than addresses to identify system functions.
The ID is the combination of a module ID identifying a particular system module or device driver and a function ID identifying a particular function within this module. The module ID and the function ID are both 16 or 32 bits, so that the combined system call ID is up to 64 bits.
The sys\_call instruction has the following variants:

\begin{longtable}
{|p{20mm}|p{20mm}|p{20mm}|p{30mm}|p{30mm}|}
\caption{Variants of system call instruction}
\label{table:sysCallInstruction}
\hline
Format & Operand type & Register operands & Module ID & Function ID \\
\hline
1.6 A & 32 bit & 3 & RT bit 16-31 & RT bit 0-15 \\
\hline
1.6 A & 64 bit & 3 & RT bit 32-63 & RT bit 0-31 \\
\hline
2.5.7 C & 64 bit & 0  & IM3 bit 0-31 & IM1,IM2 bit 0-15 \\
\hline
3.1.2 B & 64 bit & 2  & IM3 bit 0-31 & IM2 bit 0-31 \\
\hline
\end{longtable}

The sys\_call instruction can indicate a block of memory to be shared with the system function. The address of the memory block is pointed to by the register specified in RD and the length is in register RS. This memory block, which the caller must have access rights to, is shared with the system function. The system function will get the same access rights to this block as the calling thread has, i. e. read access and/or write access. This is useful for fast transfer of data between the caller and the system function. No other memory is accessible to both the caller and the called function. If the RD and RS fields are both r0 then no memory block is shared. If RD and RS are both SP then all the application's data memory is shared. The sys\_call instruction in format 2.5.7 has no register operands and no shared memory block. System calls cannot have a mask.
\vv

Parameters for system functions are transferred in registers, following the same calling conventions as normal functions. The registers used for function parameters are usually different from the registers in the RD, RS and RT fields. Function parameters that do not fit into registers must reside in the shared memory block.

\subsubsection{sys\_return}
\label{table:sysCallInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{110mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries operands \\ \hline
1.7 C & 62 & \\ \hline
\end{tabular}
\vv

Return from system call.

\subsubsection{trap}
\label{traps}
\label{table:trapInstruction}
\begin{tabular}{|p{12mm}|p{12mm}|p{30mm}|p{80mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries instruction & \bfseries immediate operand \\ \hline
1.7 C & 63 & trap & 0-254 \\ \hline
1.7 C & 63 & filler & 255 \\ \hline
\end{tabular}
\vv

Traps work like interrupts. The unconditional trap has an 8-bit interrupt number in IM1. This is an index into the interrupt vector table, which initially starts at absolute address zero. The unconditional trap instruction may use IM2 for additional information.
\vv

A trap instruction with all 1's in all fields (opcode 0x7FFFFFFF) can be used as filler in unused parts of code memory.

\subsubsection{conditional trap}
\label{table:conditionalTrapInstructions}
\begin{tabular}{|p{12mm}|p{12mm}|p{30mm}|p{80mm}|}
\hline
\bfseries format & \bfseries opcode & \bfseries instruction & \bfseries immediate operand \\ \hline
2.5.5C & 63 & compare, trap\_uabove & limit \\ \hline
%2.5.5 C & 63 & conditional trap & IM2 = interrupt number, IM3 = operand \\ \hline
\end{tabular}
\vv

Conditional traps are currently not supported.
\vv

The conditional trap generates a trap if the specified condition is true.\\
IM2 contains the interrupt number. \\
IM3 contains an immediate operand
%the condition code OPJ, specified in table \ref{table:controlTransferInstructions}.
\vv

Compare/trap\_uabove will generate a trap if RD $>$ IM3. This is useful for checking if an array index exceeds the upper bound. The lower bound does not have to be checked because we use unsigned compare.
\vv

\section{Common operations that have no dedicated instruction}
This section discusses some common operations that are not implemented as single instructions, and how to code these operations in software.

\subsubsection{Change sign}
For integer operands, do a reverse subtract from zero. For floating point operands, use the toggle\_bit instruction on the sign bit.

\subsubsection{Not}
To invert all bits in an integer, do an XOR with -1. To invert a Boolean, do an XOR with 1.

\subsubsection{Rotate through carry}
Rotates through carry are rarely used, and common implementations can be very inefficient. A left rotate through carry can be replaced by an add\_c with the same register in both source operands.

\vv

\section{Unused instructions} \label{unusedInstructions}
Unused instructions and opcodes can be divided into three types:

\begin{enumerate}
\item The opcode is reserved for future use. Attempts to execute it will trigger a trap (synchronous interrupt) which can be used for generating an error message or for emulating instructions that are not supported.
\item The opcode is guaranteed to generate a trap, not only in the present version, but also in all future versions. This can be used as a filler in unused parts of the memory or for indicating unrecoverable errors. It can also be used for emulating user-specific instructions.
\item The error is ignored and does not trigger a trap. It can be used for future extensions that improve performance or functionality, but which can be safely ignored when not supported.
\end{enumerate}

All three types are implemented, where type 1 is the most common.
\vv

Nop instructions with nonzero values in unused fields are type 3. These instructions are ignored.
\vv

Prefetch and fence instructions with no memory operand, with nonzero values in unused fields, or with undefined values in IM3 are type 3. These instructions are ignored.
\vv

Unused bits in masks and numeric control register are type 3. These bits are ignored.
\vv

Trap instructions and conditional trap instructions with nonzero values in unused fields or undefined values in any field are type 2. These instructions are guaranteed to generate a trap. A special version of the trap instruction is intended as filler in unused or inaccessible parts of code memory.
\vv

The undef instruction is type 2. It is guaranteed to generate a trap in all systems. It can be used for testing purposes and emulation.
\vv

The userdef\_\_ instructions are type 1. These instructions are reserved for user-defined and application-specific purposes.
\vv

Instructions with erroneous coding should preferably behave as type 1. This includes instruction codes with nonzero values in unused fields, operand types not supported, or any other bit pattern with no defined meaning in any field. Type 3 behavior may alternatively be allowed in these cases. If so, the instruction should behave as if it were coded correctly.
\vv

All other opcodes not explicitly defined are type 1. These may be used for future instructions.
\vv

Small systems with no operating system and no trap support should define alternative behavior.

\end{document}