1 |
72 |
zero_gravi |
<<<
|
2 |
|
|
:sectnums:
|
3 |
|
|
=== Custom Functions Unit (CFU)
|
4 |
|
|
|
5 |
|
|
The Custom Functions Unit is the central part of the <<_zxcfu_custom_instructions_extension_cfu>> and represents
|
6 |
|
|
the actual hardware module, which is used to implement _custom RISC-V instructions_. The concept of the NEORV32
|
7 |
|
|
CFU has been highly inspired by https://github.com/google/CFU-Playground[google's CFU-Playground].
|
8 |
|
|
|
9 |
|
|
The CFU is intended for operations that are inefficient in terms of performance, latency, energy consumption or
|
10 |
|
|
program memory requirements when implemented in pure software. Some potential application fields and exemplary
|
11 |
|
|
use-cases might include:
|
12 |
|
|
|
13 |
|
|
* **AI:** sub-word / vector / SIMD operations like adding all four bytes of a 32-bit data word
|
14 |
|
|
* **Cryptographic:** bit substitution and permutation
|
15 |
|
|
* **Communication:** conversions like binary to gray-code
|
16 |
|
|
* **Image processing:** look-up-tables for color space transformations
|
17 |
|
|
* implementing instructions from other RISC-V ISA extensions that are not yet supported by the NEORV32
|
18 |
|
|
|
19 |
|
|
[NOTE]
|
20 |
|
|
The CFU is not intended for complex and autonomous functional units that implement complete accelerators
|
21 |
|
|
like block-based AES de-/encoding). Such accelerator can be implemented within the <<_custom_functions_subsystem_cfs>>.
|
22 |
|
|
A comparison of all chip-internal hardware extension options is provided in the user guide section
|
23 |
|
|
https://stnolting.github.io/neorv32/ug/#_adding_custom_hardware_modules[Adding Custom Hardware Modules].
|
24 |
|
|
|
25 |
|
|
|
26 |
|
|
:sectnums:
|
27 |
|
|
==== Custom CFU Instructions - General
|
28 |
|
|
|
29 |
|
|
The custom instruction utilize a specific instruction space that has been explicitly reserved for user-defined
|
30 |
|
|
extensions by the RISC-V specifications ("_Guaranteed Non-Standard Encoding Space_"). The NEORV32 CFU uses the
|
31 |
|
|
_CUSTOM0_ opcode to identify custom instructions. The binary encoding of this opcode is `0001011`.
|
32 |
|
|
|
33 |
|
|
The custom instructions processed by the CFU use the 32-bit **R2-type** RISC-V instruction format, which consists
|
34 |
|
|
of six bit-fields:
|
35 |
|
|
|
36 |
|
|
* `funct7`: 7-bit immediate
|
37 |
|
|
* `rs2`: address of second source register
|
38 |
|
|
* `rs1`: address of first source register
|
39 |
|
|
* `funct3`: 3-bit immediate
|
40 |
|
|
* `rd`: address of destination register
|
41 |
|
|
* `opcode`: always `0001011` to identify custom instructions
|
42 |
|
|
|
43 |
|
|
.CFU instruction format (RISC-V R2-type)
|
44 |
|
|
image::cfu_r2type_instruction.png[align=center]
|
45 |
|
|
|
46 |
|
|
[NOTE]
|
47 |
|
|
Obviously, all bit-fields including the immediates have to be static at compile time.
|
48 |
|
|
|
49 |
|
|
.Custom Instructions - Exceptions
|
50 |
|
|
[NOTE]
|
51 |
|
|
The CPU control logic can only check the _CUSTOM0_ opcode of the custom instructions to check if the
|
52 |
|
|
instruction word is valid. It cannot check the `funct3` and `funct7` bit-fields since they are
|
53 |
|
|
implementation-defined. Hence, a custom CFU instruction can never raise an illegal instruction exception.
|
54 |
|
|
However, custom will raise an illegal instruction exception if the CFU is not enabled/implemented
|
55 |
|
|
(i.e. `Zxcfu` ISA extension is not enabled).
|
56 |
|
|
|
57 |
|
|
The CFU operates on the two source operands and return the processing result to the destination register.
|
58 |
|
|
The actual instruction to be performed can be defined by using the `funct7` and `funct3` bit fields.
|
59 |
|
|
These immediate bit-fields can also be used to pass additional data to the CFU like offsets, look-up-tables
|
60 |
|
|
addresses or shift-amounts. However, the actual functionality is completely user-defined.
|
61 |
|
|
|
62 |
|
|
|
63 |
|
|
:sectnums:
|
64 |
|
|
==== Using Custom Instructions in Software
|
65 |
|
|
|
66 |
|
|
The custom instructions provided by the CFU are included into plain C code by using **intrinsics**. Intrinsics
|
67 |
|
|
behave like "normal" functions but under the hood they are a set of macros that hide the complexity of inline assembly.
|
68 |
|
|
Using such intrinsics removes the need to modify the compiler, built-in libraries and the assembler when including custom
|
69 |
|
|
instructions.
|
70 |
|
|
|
71 |
|
|
The NEORV32 software framework provides 8 pre-defined custom instructions macros, which are defined in
|
72 |
|
|
`sw/lib/include/neorv32_cpu_cfu.h`. Each intrinsic provides an implicit definition of the instruction word's
|
73 |
|
|
`funct3` bit-field:
|
74 |
|
|
|
75 |
|
|
.CFU instruction prototypes
|
76 |
|
|
[source,c]
|
77 |
|
|
----
|
78 |
|
|
neorv32_cfu_cmd0(funct7, rs1, rs2) // funct3 = 000
|
79 |
|
|
neorv32_cfu_cmd1(funct7, rs1, rs2) // funct3 = 001
|
80 |
|
|
neorv32_cfu_cmd2(funct7, rs1, rs2) // funct3 = 010
|
81 |
|
|
neorv32_cfu_cmd3(funct7, rs1, rs2) // funct3 = 011
|
82 |
|
|
neorv32_cfu_cmd4(funct7, rs1, rs2) // funct3 = 100
|
83 |
|
|
neorv32_cfu_cmd5(funct7, rs1, rs2) // funct3 = 101
|
84 |
|
|
neorv32_cfu_cmd6(funct7, rs1, rs2) // funct3 = 110
|
85 |
|
|
neorv32_cfu_cmd7(funct7, rs1, rs2) // funct3 = 111
|
86 |
|
|
----
|
87 |
|
|
|
88 |
|
|
Each intrinsic functions always returns a 32-bit value (the processing result). Furthermore,
|
89 |
|
|
each intrinsic function requires three arguments:
|
90 |
|
|
|
91 |
|
|
* `funct7` - 7-bit immediate
|
92 |
|
|
* `rs2` - source operand 2, 32-bit
|
93 |
|
|
* `rs1` - source operand 1, 32-bit
|
94 |
|
|
|
95 |
|
|
The `funct7` bit-field is used to pass a 7-bit literal to the CFU. The `rs1` and `rs2` arguments to pass the
|
96 |
|
|
actual data to the CFU. These arguments can be populated with variables or literals. The following example
|
97 |
|
|
show how to pass arguments when executing `neorv32_cfu_cmd6`: `funct7` is set to all-zero, `rs1` is given
|
98 |
|
|
the literal _2751_ and `rs2` is given a variable that contains the return value from `some_function()`.
|
99 |
|
|
|
100 |
|
|
.CFU instruction usage example
|
101 |
|
|
[source,c]
|
102 |
|
|
----
|
103 |
|
|
uint32_t opb = some_function();
|
104 |
|
|
uint32_t res = neorv32_cfu_cmd6(0b0000000, 2751, opb);
|
105 |
|
|
----
|
106 |
|
|
|
107 |
|
|
.CFU Example Program
|
108 |
|
|
[TIP]
|
109 |
|
|
There is a simple example program for the CFU, which shows how to use the _default_ CFU hardware module.
|
110 |
|
|
The example program is located in `sw/example/demo_cfu`.
|
111 |
|
|
|
112 |
|
|
|
113 |
|
|
:sectnums:
|
114 |
|
|
==== Custom Instructions Hardware
|
115 |
|
|
|
116 |
|
|
The actual functionality of the CFU's custom instruction is defined by the logic in the CFU itself.
|
117 |
|
|
It is the responsibility of the designer to implement this logic within the CFU hardware module
|
118 |
|
|
`rtl/core/neorv32_cpu_cp_cfu.vhd`.
|
119 |
|
|
|
120 |
|
|
The CFU hardware module receives the data from instruction word's immediate bit-fields and also
|
121 |
|
|
the operation data, which is fetched from the CPU's register file.
|
122 |
|
|
|
123 |
|
|
.CFU instruction data passing example
|
124 |
|
|
[source,c]
|
125 |
|
|
----
|
126 |
|
|
uint32_t opb = 0x12345678;
|
127 |
|
|
uint32_t res = neorv32_cfu_cmd6(0b0100111, 0x00cafe00, opb);
|
128 |
|
|
----
|
129 |
|
|
|
130 |
|
|
In this example the CFU hardware module receives the two source operands as 32-bit signal
|
131 |
|
|
and the immediate values as 7-bit and 3-bit signals:
|
132 |
|
|
|
133 |
|
|
* `rs1_i` (32-bit) contains the data from the `rs1` register (here = `0x00cafe00`)
|
134 |
|
|
* `rs2_i` (32-bit) contains the data from the `rs2` register (here = 0x12345678)
|
135 |
|
|
* `control.funct3` (3-bit) contains the immediate value from the `funct3` bit-field (here = `0b110`; "cmd6")
|
136 |
|
|
* `control.funct7` (7-bit) contains the immediate value from the `funct7` bit-field (here = `0b0100111`)
|
137 |
|
|
|
138 |
|
|
The CFU executes the according instruction (for example this is selected by the `control.funct3` signal)
|
139 |
|
|
and provides the operation result in the 32-bit `control.result` signal. The processing can be entirely
|
140 |
|
|
combinatorial, so the result is available at the end of the current clock cycle. Processing can also
|
141 |
|
|
take several clock cycles and may also include internal states and memories. As soon as the CFU has
|
142 |
|
|
completed operations it sets the `control.done` signal high.
|
143 |
|
|
|
144 |
|
|
.CFU Hardware Example & More Details
|
145 |
|
|
[TIP]
|
146 |
|
|
The default CFU module already implement some exemplary instructions that are used for illustration
|
147 |
|
|
by the CFU example program. See the CFU's VHDL source file (`rtl/core/neorv32_cpu_cp_cfu.vhd`), which
|
148 |
|
|
is highly commented to explain the available signals and the handshake with the CPU pipeline.
|
149 |
|
|
|
150 |
|
|
.CFU Execution Time
|
151 |
|
|
[NOTE]
|
152 |
|
|
The CFU is not required to finish processing within a bound time.
|
153 |
|
|
However, the designer should keep in mind that the CPU is **stalled** until the CFU has finished processing.
|
154 |
|
|
This also means the CPU cannot react to pending interrupts. Nevertheless, interrupt requests will still be queued.
|