URL https://opencores.org/ocsvn/neorv32/neorv32/trunk

Subversion Repositories neorv32

[/] [neorv32/] [trunk/] [docs/] [userguide/] [application_specific_configuration.adoc] - Blame information for rev 69

Details | Compare with Previous | View Log


<<<
:sectnums:
== Application-Specific Processor Configuration
 
Due to the processor's configuration options, which are mainly defined via the top entity VHDL generics, the SoC
can be tailored to the application-specific requirements. Note that this chapter does not focus on optional
_SoC features_ like IO/peripheral modules. It rather gives ideas on how to optimize for _overall goals_
like performance and area.
 
[NOTE]
Please keep in mind that optimizing the design in one direction (like performance) will also effect other potential
optimization goals (like area and energy).
 
=== Optimize for Performance
 
The following points show some concepts to optimize the processor for performance regardless of the costs
(i.e. increasing area and energy requirements):
 
* Enable all performance-related RISC-V CPU extensions that implement dedicated hardware accelerators instead
of emulating operations entirely in software:  `M`, `C`, `Zfinx`
* Enable mapping of compleX CPU operations to dedicated hardware: `FAST_MUL_EN => true` to use DSP slices for
multiplications, `FAST_SHIFT_EN => true` use a fast barrel shifter for shift operations.
* Implement the instruction cache: `ICACHE_EN => true`
* Use as many _internal_ memory as possible to reduce memory access latency: `MEM_INT_IMEM_EN => true` and
`MEM_INT_DMEM_EN => true`, maximize `MEM_INT_IMEM_SIZE` and `MEM_INT_DMEM_SIZE`
* Increase the CPU's instruction prefetch buffer size: `CPU_IPB_ENTRIES`
* _To be continued..._
 
 
=== Optimize for Size
 
The NEORV32 is a size-optimized processor system that is intended to fit into tiny niches within large SoC
designs or to be used a customized microcontroller in really tiny / low-power FPGAs (like Lattice iCE40).
Here are some ideas how to make the processor even smaller while maintaining it's _general purpose system_
concept and maximum RISC-V compatibility.
 
**SoC**
 
* This is obvious, but exclude all unused optional IO/peripheral modules from synthesis via the processor
configuration generics.
* If an IO module provides an option to configure the number of "channels", constrain this number to the
actually required value (e.g. the PWM module `IO_PWM_NUM_CH` or the external interrupt controller `XIRQ_NUM_CH`).
* Reduce the FIFO sizes of implemented modules (e.g. `SLINK_TX_FIFO`).
* Disable the instruction cache (`ICACHE_EN => false`) if the design only uses processor-internal IMEM
and DMEM memories.
* _To be continued..._
 
**CPU**
 
* Use the _embedded_ RISC-V CPU architecture extension (`CPU_EXTENSION_RISCV_E`) to reduce block RAM utilization.
* The compressed instructions extension (`CPU_EXTENSION_RISCV_C`) requires additional logic for the decoder but
also reduces program code size by approximately 30%.
* If not explicitly used/required, constrain the CPU's counter sizes: `CPU_CNT_WIDTH` for `[m]instret[h]`
(number of instruction) and `[m]cycle[h]` (number of cycles) counters. You can even remove these counters
by setting `CPU_CNT_WIDTH => 0` if they are not used at all (note, this is not RISC-V compliant).
* Reduce the CPU's prefetch buffer size (`CPU_IPB_ENTRIES`).
* Map CPU shift operations to a small and iterative shifter unit (`FAST_SHIFT_EN => false`).
* If you have unused DSP block available, you can map multiplication operations to those slices instead of
using LUTs to implement the multiplier (`FAST_MUL_EN => true`).
* If there is no need to execute division in hardware, use the `Zmmul` extension instead of the full-scale
`M` extension.
* Disable CPU extension that are not explicitly used (`A`, `U`, `Zfinx`).
* _To be continued..._
 
=== Optimize for Clock Speed
 
The NEORV32 Processor and CPU are designed to provide minimal logic between register stages to keep the
critical path as short as possible. When enabling additional extension or modules the impact on the existing
logic is also kept at a minimum to prevent timing degrading. If there is a major impact on existing
logic (example: many physical memory protection address configuration registers) the VHDL code automatically
adds additional register stages to maintain critical path length. Obviously, this increases operation latency.
 
In order to optimize for a minimal critical path (= maximum clock speed) the following points should be considered:
 
* Complex CPU extensions (in terms of hardware requirements) should be avoided (examples: floating-point unit, physical memory protection).
* Large carry chains (>32-bit) should be avoided (constrain CPU counter sizes: e.g. `CPU_CNT_WIDTH => 32` and `HPM_NUM_CNTS => 32`).
* If the target FPGA provides sufficient DSP resources, CPU multiplication operations can be mapped to DSP slices (`FAST_MUL_EN => true`)
reducing LUT usage and critical path impact while also increasing overall performance.
* Use the synchronous (registered) RX path configuration of the external memory interface (`MEM_EXT_ASYNC_RX => false`).
* _To be continued..._
 
[NOTE]
The short and fixed-length critical path allows to integrate the core into existing clock domains.
So no clock domain-crossing and no sub-clock generation is required. However, for very high clock
frequencies (this is technology / platform dependent) clock domain crossing becomes crucial for chip-internal
connections.
 
 
=== Optimize for Energy
 
There are no _dedicated_ configuration options to optimize the processor for energy (minimal consumption;
energy/instruction ratio) yet. However, a reduced processor area (<<_optimize_for_size>>) will also reduce
static energy consumption.
 
To optimize your setup for low-power applications, you can make use of the CPU sleep mode (`wfi` instruction).
Put the CPU to sleep mode whenever possible. Disable all processor modules that are not actually used (exclude them
from synthesis if the will be _never_ used; disable the module via it's control register if the module is not
_currently_ used). When is sleep mode, you can keep a timer module running (MTIME or the watch dog) to wake up
the CPU again. Since the wake up is triggered by _any_ interrupt, the external interrupt controller can also
be used to wake up the CPU again. By this, all timers (and all other modules) can be deactivated as well.
 
.Processor-internal clock generator shutdown
[TIP]
If _no_ IO/peripheral module is currently enabled, the processor's internal clock generator circuit will be
shut down reducing switching activity and thus, dynamic energy consumption.

Line No.	Rev	Author	Line
1	69	zero_gravi	`<<<`
2			`:sectnums:`
3			`== Application-Specific Processor Configuration`
4
5			`Due to the processor's configuration options, which are mainly defined via the top entity VHDL generics, the SoC`
6			`can be tailored to the application-specific requirements. Note that this chapter does not focus on optional`
7			`_SoC features_ like IO/peripheral modules. It rather gives ideas on how to optimize for _overall goals_`
8			`like performance and area.`
9
10			`[NOTE]`
11			`Please keep in mind that optimizing the design in one direction (like performance) will also effect other potential`
12			`optimization goals (like area and energy).`
13
14			`=== Optimize for Performance`
15
16			`The following points show some concepts to optimize the processor for performance regardless of the costs`
17			`(i.e. increasing area and energy requirements):`
18
19			`* Enable all performance-related RISC-V CPU extensions that implement dedicated hardware accelerators instead`
20			of emulating operations entirely in software: `M`, `C`, `Zfinx`
21			* Enable mapping of compleX CPU operations to dedicated hardware: `FAST_MUL_EN => true` to use DSP slices for
22			multiplications, `FAST_SHIFT_EN => true` use a fast barrel shifter for shift operations.
23			* Implement the instruction cache: `ICACHE_EN => true`
24			* Use as many _internal_ memory as possible to reduce memory access latency: `MEM_INT_IMEM_EN => true` and
25			`MEM_INT_DMEM_EN => true`, maximize `MEM_INT_IMEM_SIZE` and `MEM_INT_DMEM_SIZE`
26			* Increase the CPU's instruction prefetch buffer size: `CPU_IPB_ENTRIES`
27			`* _To be continued..._`
28
29
30			`=== Optimize for Size`
31
32			`The NEORV32 is a size-optimized processor system that is intended to fit into tiny niches within large SoC`
33			`designs or to be used a customized microcontroller in really tiny / low-power FPGAs (like Lattice iCE40).`
34			`Here are some ideas how to make the processor even smaller while maintaining it's _general purpose system_`
35			`concept and maximum RISC-V compatibility.`
36
37			`SoC`
38
39			`* This is obvious, but exclude all unused optional IO/peripheral modules from synthesis via the processor`
40			`configuration generics.`
41			`* If an IO module provides an option to configure the number of "channels", constrain this number to the`
42			actually required value (e.g. the PWM module `IO_PWM_NUM_CH` or the external interrupt controller `XIRQ_NUM_CH`).
43			* Reduce the FIFO sizes of implemented modules (e.g. `SLINK_TX_FIFO`).
44			* Disable the instruction cache (`ICACHE_EN => false`) if the design only uses processor-internal IMEM
45			`and DMEM memories.`
46			`* _To be continued..._`
47
48			`CPU`
49
50			* Use the _embedded_ RISC-V CPU architecture extension (`CPU_EXTENSION_RISCV_E`) to reduce block RAM utilization.
51			* The compressed instructions extension (`CPU_EXTENSION_RISCV_C`) requires additional logic for the decoder but
52			`also reduces program code size by approximately 30%.`
53			* If not explicitly used/required, constrain the CPU's counter sizes: `CPU_CNT_WIDTH` for `[m]instret[h]`
54			(number of instruction) and `[m]cycle[h]` (number of cycles) counters. You can even remove these counters
55			by setting `CPU_CNT_WIDTH => 0` if they are not used at all (note, this is not RISC-V compliant).
56			* Reduce the CPU's prefetch buffer size (`CPU_IPB_ENTRIES`).
57			* Map CPU shift operations to a small and iterative shifter unit (`FAST_SHIFT_EN => false`).
58			`* If you have unused DSP block available, you can map multiplication operations to those slices instead of`
59			using LUTs to implement the multiplier (`FAST_MUL_EN => true`).
60			* If there is no need to execute division in hardware, use the `Zmmul` extension instead of the full-scale
61			`M` extension.
62			* Disable CPU extension that are not explicitly used (`A`, `U`, `Zfinx`).
63			`* _To be continued..._`
64
65			`=== Optimize for Clock Speed`
66
67			`The NEORV32 Processor and CPU are designed to provide minimal logic between register stages to keep the`
68			`critical path as short as possible. When enabling additional extension or modules the impact on the existing`
69			`logic is also kept at a minimum to prevent timing degrading. If there is a major impact on existing`
70			`logic (example: many physical memory protection address configuration registers) the VHDL code automatically`
71			`adds additional register stages to maintain critical path length. Obviously, this increases operation latency.`
72
73			`In order to optimize for a minimal critical path (= maximum clock speed) the following points should be considered:`
74
75			`* Complex CPU extensions (in terms of hardware requirements) should be avoided (examples: floating-point unit, physical memory protection).`
76			* Large carry chains (>32-bit) should be avoided (constrain CPU counter sizes: e.g. `CPU_CNT_WIDTH => 32` and `HPM_NUM_CNTS => 32`).
77			* If the target FPGA provides sufficient DSP resources, CPU multiplication operations can be mapped to DSP slices (`FAST_MUL_EN => true`)
78			`reducing LUT usage and critical path impact while also increasing overall performance.`
79			* Use the synchronous (registered) RX path configuration of the external memory interface (`MEM_EXT_ASYNC_RX => false`).
80			`* _To be continued..._`
81
82			`[NOTE]`
83			`The short and fixed-length critical path allows to integrate the core into existing clock domains.`
84			`So no clock domain-crossing and no sub-clock generation is required. However, for very high clock`
85			`frequencies (this is technology / platform dependent) clock domain crossing becomes crucial for chip-internal`
86			`connections.`
87
88
89			`=== Optimize for Energy`
90
91			`There are no _dedicated_ configuration options to optimize the processor for energy (minimal consumption;`
92			`energy/instruction ratio) yet. However, a reduced processor area (<<_optimize_for_size>>) will also reduce`
93			`static energy consumption.`
94
95			To optimize your setup for low-power applications, you can make use of the CPU sleep mode (`wfi` instruction).
96			`Put the CPU to sleep mode whenever possible. Disable all processor modules that are not actually used (exclude them`
97			`from synthesis if the will be _never_ used; disable the module via it's control register if the module is not`
98			`_currently_ used). When is sleep mode, you can keep a timer module running (MTIME or the watch dog) to wake up`
99			`the CPU again. Since the wake up is triggered by _any_ interrupt, the external interrupt controller can also`
100			`be used to wake up the CPU again. By this, all timers (and all other modules) can be deactivated as well.`
101
102			`.Processor-internal clock generator shutdown`
103			`[TIP]`
104			`If _no_ IO/peripheral module is currently enabled, the processor's internal clock generator circuit will be`
105			`shut down reducing switching activity and thus, dynamic energy consumption.`

Browse

Tools

Subversion Repositories neorv32

[/] [neorv32/] [trunk/] [docs/] [userguide/] [application_specific_configuration.adoc] - Blame information for rev 69