1 |
69 |
zero_gravi |
<<<
|
2 |
|
|
:sectnums:
|
3 |
|
|
== Application-Specific Processor Configuration
|
4 |
|
|
|
5 |
|
|
Due to the processor's configuration options, which are mainly defined via the top entity VHDL generics, the SoC
|
6 |
|
|
can be tailored to the application-specific requirements. Note that this chapter does not focus on optional
|
7 |
|
|
_SoC features_ like IO/peripheral modules. It rather gives ideas on how to optimize for _overall goals_
|
8 |
|
|
like performance and area.
|
9 |
|
|
|
10 |
|
|
[NOTE]
|
11 |
|
|
Please keep in mind that optimizing the design in one direction (like performance) will also effect other potential
|
12 |
|
|
optimization goals (like area and energy).
|
13 |
|
|
|
14 |
|
|
=== Optimize for Performance
|
15 |
|
|
|
16 |
|
|
The following points show some concepts to optimize the processor for performance regardless of the costs
|
17 |
|
|
(i.e. increasing area and energy requirements):
|
18 |
|
|
|
19 |
|
|
* Enable all performance-related RISC-V CPU extensions that implement dedicated hardware accelerators instead
|
20 |
|
|
of emulating operations entirely in software: `M`, `C`, `Zfinx`
|
21 |
|
|
* Enable mapping of compleX CPU operations to dedicated hardware: `FAST_MUL_EN => true` to use DSP slices for
|
22 |
|
|
multiplications, `FAST_SHIFT_EN => true` use a fast barrel shifter for shift operations.
|
23 |
|
|
* Implement the instruction cache: `ICACHE_EN => true`
|
24 |
|
|
* Use as many _internal_ memory as possible to reduce memory access latency: `MEM_INT_IMEM_EN => true` and
|
25 |
|
|
`MEM_INT_DMEM_EN => true`, maximize `MEM_INT_IMEM_SIZE` and `MEM_INT_DMEM_SIZE`
|
26 |
|
|
* Increase the CPU's instruction prefetch buffer size: `CPU_IPB_ENTRIES`
|
27 |
|
|
* _To be continued..._
|
28 |
|
|
|
29 |
|
|
|
30 |
|
|
=== Optimize for Size
|
31 |
|
|
|
32 |
|
|
The NEORV32 is a size-optimized processor system that is intended to fit into tiny niches within large SoC
|
33 |
|
|
designs or to be used a customized microcontroller in really tiny / low-power FPGAs (like Lattice iCE40).
|
34 |
|
|
Here are some ideas how to make the processor even smaller while maintaining it's _general purpose system_
|
35 |
|
|
concept and maximum RISC-V compatibility.
|
36 |
|
|
|
37 |
|
|
**SoC**
|
38 |
|
|
|
39 |
|
|
* This is obvious, but exclude all unused optional IO/peripheral modules from synthesis via the processor
|
40 |
|
|
configuration generics.
|
41 |
|
|
* If an IO module provides an option to configure the number of "channels", constrain this number to the
|
42 |
|
|
actually required value (e.g. the PWM module `IO_PWM_NUM_CH` or the external interrupt controller `XIRQ_NUM_CH`).
|
43 |
|
|
* Reduce the FIFO sizes of implemented modules (e.g. `SLINK_TX_FIFO`).
|
44 |
|
|
* Disable the instruction cache (`ICACHE_EN => false`) if the design only uses processor-internal IMEM
|
45 |
|
|
and DMEM memories.
|
46 |
|
|
* _To be continued..._
|
47 |
|
|
|
48 |
|
|
**CPU**
|
49 |
|
|
|
50 |
|
|
* Use the _embedded_ RISC-V CPU architecture extension (`CPU_EXTENSION_RISCV_E`) to reduce block RAM utilization.
|
51 |
|
|
* The compressed instructions extension (`CPU_EXTENSION_RISCV_C`) requires additional logic for the decoder but
|
52 |
|
|
also reduces program code size by approximately 30%.
|
53 |
|
|
* If not explicitly used/required, constrain the CPU's counter sizes: `CPU_CNT_WIDTH` for `[m]instret[h]`
|
54 |
|
|
(number of instruction) and `[m]cycle[h]` (number of cycles) counters. You can even remove these counters
|
55 |
|
|
by setting `CPU_CNT_WIDTH => 0` if they are not used at all (note, this is not RISC-V compliant).
|
56 |
|
|
* Reduce the CPU's prefetch buffer size (`CPU_IPB_ENTRIES`).
|
57 |
|
|
* Map CPU shift operations to a small and iterative shifter unit (`FAST_SHIFT_EN => false`).
|
58 |
|
|
* If you have unused DSP block available, you can map multiplication operations to those slices instead of
|
59 |
|
|
using LUTs to implement the multiplier (`FAST_MUL_EN => true`).
|
60 |
|
|
* If there is no need to execute division in hardware, use the `Zmmul` extension instead of the full-scale
|
61 |
|
|
`M` extension.
|
62 |
|
|
* Disable CPU extension that are not explicitly used (`A`, `U`, `Zfinx`).
|
63 |
|
|
* _To be continued..._
|
64 |
|
|
|
65 |
|
|
=== Optimize for Clock Speed
|
66 |
|
|
|
67 |
|
|
The NEORV32 Processor and CPU are designed to provide minimal logic between register stages to keep the
|
68 |
|
|
critical path as short as possible. When enabling additional extension or modules the impact on the existing
|
69 |
|
|
logic is also kept at a minimum to prevent timing degrading. If there is a major impact on existing
|
70 |
|
|
logic (example: many physical memory protection address configuration registers) the VHDL code automatically
|
71 |
|
|
adds additional register stages to maintain critical path length. Obviously, this increases operation latency.
|
72 |
|
|
|
73 |
|
|
In order to optimize for a minimal critical path (= maximum clock speed) the following points should be considered:
|
74 |
|
|
|
75 |
|
|
* Complex CPU extensions (in terms of hardware requirements) should be avoided (examples: floating-point unit, physical memory protection).
|
76 |
|
|
* Large carry chains (>32-bit) should be avoided (constrain CPU counter sizes: e.g. `CPU_CNT_WIDTH => 32` and `HPM_NUM_CNTS => 32`).
|
77 |
|
|
* If the target FPGA provides sufficient DSP resources, CPU multiplication operations can be mapped to DSP slices (`FAST_MUL_EN => true`)
|
78 |
|
|
reducing LUT usage and critical path impact while also increasing overall performance.
|
79 |
|
|
* Use the synchronous (registered) RX path configuration of the external memory interface (`MEM_EXT_ASYNC_RX => false`).
|
80 |
|
|
* _To be continued..._
|
81 |
|
|
|
82 |
|
|
[NOTE]
|
83 |
|
|
The short and fixed-length critical path allows to integrate the core into existing clock domains.
|
84 |
|
|
So no clock domain-crossing and no sub-clock generation is required. However, for very high clock
|
85 |
|
|
frequencies (this is technology / platform dependent) clock domain crossing becomes crucial for chip-internal
|
86 |
|
|
connections.
|
87 |
|
|
|
88 |
|
|
|
89 |
|
|
=== Optimize for Energy
|
90 |
|
|
|
91 |
|
|
There are no _dedicated_ configuration options to optimize the processor for energy (minimal consumption;
|
92 |
|
|
energy/instruction ratio) yet. However, a reduced processor area (<<_optimize_for_size>>) will also reduce
|
93 |
|
|
static energy consumption.
|
94 |
|
|
|
95 |
|
|
To optimize your setup for low-power applications, you can make use of the CPU sleep mode (`wfi` instruction).
|
96 |
|
|
Put the CPU to sleep mode whenever possible. Disable all processor modules that are not actually used (exclude them
|
97 |
|
|
from synthesis if the will be _never_ used; disable the module via it's control register if the module is not
|
98 |
|
|
_currently_ used). When is sleep mode, you can keep a timer module running (MTIME or the watch dog) to wake up
|
99 |
|
|
the CPU again. Since the wake up is triggered by _any_ interrupt, the external interrupt controller can also
|
100 |
|
|
be used to wake up the CPU again. By this, all timers (and all other modules) can be deactivated as well.
|
101 |
|
|
|
102 |
|
|
.Processor-internal clock generator shutdown
|
103 |
|
|
[TIP]
|
104 |
|
|
If _no_ IO/peripheral module is currently enabled, the processor's internal clock generator circuit will be
|
105 |
|
|
shut down reducing switching activity and thus, dynamic energy consumption.
|