Line 90... |
Line 90... |
|
|
|
|
### To-Do / Wish List / Help Wanted
|
### To-Do / Wish List / Help Wanted
|
|
|
* Use LaTeX for data sheet
|
* Use LaTeX for data sheet
|
* Further size and performance optimization *(work in progress)*
|
* Further size and performance optimization *[work in progress]*
|
* Add associativity configuration for instruction cache
|
* Add associativity configuration for instruction cache
|
* Add a data cache
|
* Add *data* cache
|
* Burst mode for the external memory/bus interface
|
* Burst mode for the external memory/bus interface
|
* RISC-V `B` extension ([bitmanipulation](https://github.com/riscv/riscv-bitmanip)) *(shelved)*
|
* RISC-V `F` (using `Zfinx`?) CPU extension (single-precision floating point) *[planning]*
|
|
* RISC-V `B` CPU extension ([bitmanipulation](https://github.com/riscv/riscv-bitmanip)) *[shelved]*
|
* Synthesis results (+ wrappers?) for more/specific platforms
|
* Synthesis results (+ wrappers?) for more/specific platforms
|
* More support for FreeRTOS
|
* More support for FreeRTOS (like *all* traps)
|
* Port additional RTOSs (like [Zephyr](https://github.com/zephyrproject-rtos/zephyr) or [RIOT](https://www.riot-os.org))
|
* Port additional RTOSs (like [Zephyr](https://github.com/zephyrproject-rtos/zephyr) or [RIOT](https://www.riot-os.org))
|
* Single-precision floating point unit (`F`) *(planned)*
|
|
* Implement further RISC-V (or custom?) CPU extensions
|
* Implement further RISC-V (or custom?) CPU extensions
|
* Add debugger ([RISC-V debug spec](https://github.com/riscv/riscv-debug-spec))
|
* Add debugger ([RISC-V debug spec](https://github.com/riscv/riscv-debug-spec))
|
* Add memory-mapped trigger to testbench to quit simulation (using VHDL2008's `use std.env.finish;`) - but how? :thinking:
|
* Add memory-mapped trigger to testbench to quit simulation (maybe using VHDL2008's `use std.env.finish`?)
|
* ...
|
* ...
|
* [Ideas?](#ContributeFeedbackQuestions)
|
* [Ideas?](#ContributeFeedbackQuestions)
|
|
|
|
|
|
|
Line 187... |
Line 187... |
**Privileged architecture / CSR access** (`Zicsr` extension):
|
**Privileged architecture / CSR access** (`Zicsr` extension):
|
* Privilege levels: `M-mode` (Machine mode)
|
* Privilege levels: `M-mode` (Machine mode)
|
* CSR access instructions: `CSRRW` `CSRRS` `CSRRC` `CSRRWI` `CSRRSI` `CSRRCI`
|
* CSR access instructions: `CSRRW` `CSRRS` `CSRRC` `CSRRWI` `CSRRSI` `CSRRCI`
|
* System instructions: `MRET` `WFI`
|
* System instructions: `MRET` `WFI`
|
* Pseudo-instructions are not listed
|
* Pseudo-instructions are not listed
|
* Counter CSRs: `cycle` `cycleh` `instret` `instreth` `time` `timeh` `mcycle` `mcycleh` `minstret` `minstreth` `mcounteren` `mcountinhibit`
|
* Counter CSRs: `[m]cycle[h]` `[m]instret[m]` `time[h]` `[m]hpmcounter*[h]`(3..31, configurable) `mcounteren` `mcountinhibit` `mhpmevent*`(3..31, configurable)
|
* Machine CSRs: `mstatus` `mstatush` `misa`(read-only!) `mie` `mtvec` `mscratch` `mepc` `mcause` `mtval` `mip` `mvendorid` [`marchid`](https://github.com/riscv/riscv-isa-manual/blob/master/marchid.md) `mimpid` `mhartid` `mzext`(custom)
|
* Machine CSRs: `mstatus[h]` `misa`(read-only!) `mie` `mtvec` `mscratch` `mepc` `mcause` `mtval` `mip` `mvendorid` [`marchid`](https://github.com/riscv/riscv-isa-manual/blob/master/marchid.md) `mimpid` `mhartid` `mzext`(custom)
|
* Supported exceptions and interrupts:
|
* Supported exceptions and interrupts:
|
* Misaligned instruction address
|
* Misaligned instruction address
|
* Instruction access fault (via unacknowledged bus access after timeout)
|
* Instruction access fault (via unacknowledged bus access after timeout)
|
* Illegal instruction
|
* Illegal instruction
|
* Breakpoint (via `ebreak` instruction)
|
* Breakpoint (via `ebreak` instruction)
|
Line 212... |
Line 212... |
|
|
**Privileged architecture / instruction stream synchronization** (`Zifencei` extension):
|
**Privileged architecture / instruction stream synchronization** (`Zifencei` extension):
|
* System instructions: `FENCE.I` (among others, used to clear and reload instruction cache)
|
* System instructions: `FENCE.I` (among others, used to clear and reload instruction cache)
|
|
|
**Privileged architecture / Physical memory protection** (`PMP`, requires `Zicsr` extension):
|
**Privileged architecture / Physical memory protection** (`PMP`, requires `Zicsr` extension):
|
* Additional machine CSRs: `pmpcfg0` `pmpcfg1` `pmpaddr0` `pmpaddr1` `pmpaddr2` `pmpaddr3` `pmpaddr4` `pmpaddr5` `pmpaddr6` `pmpaddr7`
|
* Configurable number of regions
|
|
* Additional machine CSRs: `pmpcfg*`(0..15) `pmpaddr*`(0..63)
|
|
|
|
|
### Non-RISC-V-Compliant Issues
|
### Non-RISC-V-Compliant Issues
|
|
|
* CPU and Processor are BIG-ENDIAN, but this should be no problem as the external memory bus interface provides big- and little-endian configurations
|
* CPU and Processor are BIG-ENDIAN, but this should be no problem as the external memory bus interface provides big- and little-endian configurations
|
* `misa` CSR is read-only - no dynamic enabling/disabling of synthesized CPU extensions during runtime; for compatibility: write accesses (in m-mode) are ignored and do not cause an exception
|
* `misa` CSR is read-only - no dynamic enabling/disabling of synthesized CPU extensions during runtime; for compatibility: write accesses (in m-mode) are ignored and do not cause an exception
|
* The physical memory protection (**PMP**) only supports `NAPOT` mode, a minimal granularity of 8 bytes and only up to 8 regions
|
* The physical memory protection (**PMP**) only supports `NAPOT` mode yet and a minimal granularity of 8 bytes
|
* The `A` extension only implements `lr.w` and `sc.w` instructions yet. However, these instructions are sufficient to emulate all further AMO operations
|
* The `A` extension only implements `lr.w` and `sc.w` instructions yet. However, these instructions are sufficient to emulate all further AMO operations
|
* The `mcause` trap code `0x80000000` (originally reserved in the RISC-V specs) is used to indicate a hardware reset (non-maskable reset).
|
* The `mcause` trap code `0x80000000` (originally reserved in the RISC-V specs) is used to indicate a hardware reset (as "non-maskable interrupt").
|
|
|
|
|
### NEORV32-Specific CPU Extensions
|
### NEORV32-Specific CPU Extensions
|
|
|
The NEORV32-specific extensions are always enabled and are indicated via the `X` bit in the `misa` CSR.
|
The NEORV32-specific extensions are always enabled and are indicated via the `X` bit in the `misa` CSR.
|
Line 241... |
Line 242... |
### NEORV32 CPU
|
### NEORV32 CPU
|
|
|
This chapter shows exemplary implementation results of the NEORV32 CPU for an **Intel Cyclone IV EP4CE22F17C6N FPGA** on
|
This chapter shows exemplary implementation results of the NEORV32 CPU for an **Intel Cyclone IV EP4CE22F17C6N FPGA** on
|
a DE0-nano board. The design was synthesized using **Intel Quartus Prime Lite 20.1** ("balanced implementation"). The timing
|
a DE0-nano board. The design was synthesized using **Intel Quartus Prime Lite 20.1** ("balanced implementation"). The timing
|
information is derived from the Timing Analyzer / Slow 1200mV 0C Model. If not otherwise specified, the default configuration
|
information is derived from the Timing Analyzer / Slow 1200mV 0C Model. If not otherwise specified, the default configuration
|
of the CPU's generics is assumed (for example no PMP). No constraints were used at all. The `u` and `Zifencei` extensions have
|
of the CPU's generics is assumed (e.g. no physical memory protection, no hardware performance monitors).
|
a negligible impact on the hardware requirements.
|
No constraints were used at all. The `u` and `Zifencei` extensions have a negligible impact on the hardware requirements.
|
|
|
Results generated for hardware version [`1.4.9.2`](https://github.com/stnolting/neorv32/blob/master/CHANGELOG.md).
|
Results generated for hardware version [`1.4.9.2`](https://github.com/stnolting/neorv32/blob/master/CHANGELOG.md).
|
|
|
| CPU Configuration | LEs | FFs | Memory bits | DSPs | f_max |
|
| CPU Configuration | LEs | FFs | Memory bits | DSPs | f_max |
|
|:----------------------------------------|:----------:|:--------:|:-----------:|:----:|:-------:|
|
|:----------------------------------------|:----------:|:--------:|:-----------:|:----:|:-------:|
|
Line 304... |
Line 305... |
The FPGA-specific memory components can be found in [`rtl/fpga_specific`](https://github.com/stnolting/neorv32/blob/master/rtl/fpga_specific/lattice_ice40up).
|
The FPGA-specific memory components can be found in [`rtl/fpga_specific`](https://github.com/stnolting/neorv32/blob/master/rtl/fpga_specific/lattice_ice40up).
|
* The clock frequencies marked with a "c" are constrained clocks. The remaining ones are _f_max_ results from the place and route timing reports.
|
* The clock frequencies marked with a "c" are constrained clocks. The remaining ones are _f_max_ results from the place and route timing reports.
|
* The Upduino and the Arty board have on-board SPI flash memories for storing the FPGA configuration. These device can also be used by the default NEORV32
|
* The Upduino and the Arty board have on-board SPI flash memories for storing the FPGA configuration. These device can also be used by the default NEORV32
|
bootloader to store and automatically boot an application program after reset (both tested successfully).
|
bootloader to store and automatically boot an application program after reset (both tested successfully).
|
* The setups with `PMP` implement 2 regions with a minimal granularity of 64kB.
|
* The setups with `PMP` implement 2 regions with a minimal granularity of 64kB.
|
|
* No HPM counters are implemented.
|
|
|
|
|
|
|
## Performance
|
## Performance
|
|
|
Line 315... |
Line 317... |
|
|
The [CoreMark CPU benchmark](https://www.eembc.org/coremark) was executed on the NEORV32 and is available in the
|
The [CoreMark CPU benchmark](https://www.eembc.org/coremark) was executed on the NEORV32 and is available in the
|
[sw/example/coremark](https://github.com/stnolting/neorv32/blob/master/sw/example/coremark) project folder. This benchmark
|
[sw/example/coremark](https://github.com/stnolting/neorv32/blob/master/sw/example/coremark) project folder. This benchmark
|
tests the capabilities of a CPU itself rather than the functions provided by the whole system / SoC.
|
tests the capabilities of a CPU itself rather than the functions provided by the whole system / SoC.
|
|
|
Results generated for hardware version [`1.4.7.0`](https://github.com/stnolting/neorv32/blob/master/CHANGELOG.md).
|
|
|
|
~~~
|
~~~
|
**Configuration**
|
**Configuration**
|
Hardware: 32kB IMEM, 16kB DMEM, no caches, 100MHz clock
|
Hardware: 32kB IMEM, 16kB DMEM, no caches(!), 100MHz clock
|
CoreMark: 2000 iterations, MEM_METHOD is MEM_STACK
|
CoreMark: 2000 iterations, MEM_METHOD is MEM_STACK
|
Compiler: RISCV32-GCC 10.1.0 (rv32i toolchain)
|
Compiler: RISCV32-GCC 10.1.0 (rv32i toolchain)
|
Compiler flags: default, see makefile
|
Compiler flags: default, see makefile
|
Peripherals: UART for printing the results
|
Peripherals: UART for printing the results
|
~~~
|
~~~
|
|
|
| CPU | Executable Size | Optimization | CoreMark Score | CoreMarks/MHz |
|
Results generated for hardware version [`1.4.9.8`](https://github.com/stnolting/neorv32/blob/master/CHANGELOG.md).
|
|
|
|
| CPU (including `Zicsr`) | Executable Size | Optimization | CoreMark Score | CoreMarks/MHz |
|
|:--------------------------------------------|:---------------:|:------------:|:--------------:|:-------------:|
|
|:--------------------------------------------|:---------------:|:------------:|:--------------:|:-------------:|
|
| `rv32i` | 27 424 bytes | `-O3` | 35.71 | **0.3571** |
|
| `rv32i` | 28 756 bytes | `-O3` | 36.36 | **0.3636** |
|
| `rv32im` | 26 232 bytes | `-O3` | 66.66 | **0.6666** |
|
| `rv32im` | 27 516 bytes | `-O3` | 68.97 | **0.6897** |
|
| `rv32imc` | 20 876 bytes | `-O3` | 66.66 | **0.6666** |
|
| `rv32imc` | 22 008 bytes | `-O3` | 68.97 | **0.6897** |
|
| `rv32imc` + `FAST_MUL_EN` | 20 876 bytes | `-O3` | 83.33 | **0.8333** |
|
| `rv32imc` + `FAST_MUL_EN` | 22 008 bytes | `-O3` | 86.96 | **0.8696** |
|
| `rv32imc` + `FAST_MUL_EN` + `FAST_SHIFT_EN` | 20 876 bytes | `-O3` | 86.96 | **0.8696** |
|
| `rv32imc` + `FAST_MUL_EN` + `FAST_SHIFT_EN` | 22 008 bytes | `-O3` | 90.91 | **0.9091** |
|
|
|
The `FAST_MUL_EN` configuration uses DSPs for the multiplier of the `M` extension (enabled via the `FAST_MUL_EN` generic). The `FAST_SHIFT_EN` configuration
|
The `FAST_MUL_EN` configuration uses DSPs for the multiplier of the `M` extension (enabled via the `FAST_MUL_EN` generic). The `FAST_SHIFT_EN` configuration
|
uses a barrel shifter for CPU shift operations (enabled via the `FAST_SHIFT_EN` generic).
|
uses a barrel shifter for CPU shift operations (enabled via the `FAST_SHIFT_EN` generic).
|
|
|
When the `C` extension is enabled, branches to an unaligned uncompressed instruction require additional instruction fetch cycles.
|
When the `C` extension is enabled, branches to an unaligned uncompressed instruction require additional instruction fetch cycles.
|
Line 345... |
Line 347... |
### Instruction Cycles
|
### Instruction Cycles
|
|
|
The NEORV32 CPU is based on a two-stages pipelined architecutre. Each stage uses a multi-cycle processing scheme. Hence,
|
The NEORV32 CPU is based on a two-stages pipelined architecutre. Each stage uses a multi-cycle processing scheme. Hence,
|
each instruction requires several clock cycles to execute (2 cycles for ALU operations, ..., 40 cycles for divisions).
|
each instruction requires several clock cycles to execute (2 cycles for ALU operations, ..., 40 cycles for divisions).
|
The average CPI (cycles per instruction) depends on the instruction mix of a specific applications and also on the available
|
The average CPI (cycles per instruction) depends on the instruction mix of a specific applications and also on the available
|
CPU extensions.
|
CPU extensions. *By default* the CPU-internal shifter (e.g. for the `SLL` instruction) as well as the multiplier and divider of the
|
|
|
Please note that by default the CPU-internal shifter (e.g. for the `SLL` instruction) as well as the multiplier and divider of the
|
|
`M` extension use a bit-serial approach and require several cycles for completion.
|
`M` extension use a bit-serial approach and require several cycles for completion.
|
|
|
The following table shows the performance results for successfully running 2000 CoreMark
|
The following table shows the performance results for successfully running 2000 CoreMark
|
iterations, which reflects a pretty good "real-life" work load. The average CPI is computed by
|
iterations, which reflects a pretty good "real-life" work load. The average CPI is computed by
|
dividing the total number of required clock cycles (only the timed core to avoid distortion due to IO wait cycles; sampled via the `cycle[h]` CSRs)
|
dividing the total number of required clock cycles (only the timed core to avoid distortion due to IO wait cycles; sampled via the `cycle[h]` CSRs)
|
by the number of executed instructions (`instret[h]` CSRs). The executables were generated using optimization `-O3`.
|
by the number of executed instructions (`instret[h]` CSRs). The executables were generated using optimization `-O3`.
|
|
|
Results generated for hardware version [`1.4.7.0`](https://github.com/stnolting/neorv32/blob/master/CHANGELOG.md).
|
Results generated for hardware version [`1.4.9.8`](https://github.com/stnolting/neorv32/blob/master/CHANGELOG.md).
|
|
|
| CPU | Required Clock Cycles | Executed Instructions | Average CPI |
|
| CPU (including `Zicsr`) | Required Clock Cycles | Executed Instructions | Average CPI |
|
|:--------------------------------------------|----------------------:|----------------------:|:-----------:|
|
|:--------------------------------------------|----------------------:|----------------------:|:-----------:|
|
| `rv32i` | 5 648 997 774 | 1 469 233 238 | **3.84** |
|
| `rv32i` | 5 595 750 503 | 1 466 028 607 | **3.82** |
|
| `rv32im` | 3 036 749 774 | 601 871 338 | **5.05** |
|
| `rv32im` | 2 966 086 503 | 598 651 143 | **4.95** |
|
| `rv32imc` | 3 036 959 882 | 615 034 616 | **4.94** |
|
| `rv32imc` | 2 981 786 734 | 611 814 918 | **4.87** |
|
| `rv32imc` + `FAST_MUL_EN` | 2 454 407 882 | 615 034 588 | **3.99** |
|
| `rv32imc` + `FAST_MUL_EN` | 2 399 234 734 | 611 814 918 | **3.92** |
|
| `rv32imc` + `FAST_MUL_EN` + `FAST_SHIFT_EN` | 2 320 308 322 | 615 034 676 | **3.77** |
|
| `rv32imc` + `FAST_MUL_EN` + `FAST_SHIFT_EN` | 2 265 135 174 | 611 814 948 | **3.70** |
|
|
|
|
|
The `FAST_MUL_EN` configuration uses DSPs for the multiplier of the `M` extension (enabled via the `FAST_MUL_EN` generic). The `FAST_SHIFT_EN` configuration
|
The `FAST_MUL_EN` configuration uses DSPs for the multiplier of the `M` extension (enabled via the `FAST_MUL_EN` generic). The `FAST_SHIFT_EN` configuration
|
uses a barrel shifter for CPU shift operations (enabled via the `FAST_SHIFT_EN` generic).
|
uses a barrel shifter for CPU shift operations (enabled via the `FAST_SHIFT_EN` generic).
|
|
|
When the `C` extension is enabled branches to an unaligned uncompressed instruction require additional instruction fetch cycles.
|
When the `C` extension is enabled branches to an unaligned uncompressed instruction require additional instruction fetch cycles.
|
Line 591... |
Line 590... |
|
|
> S. Nolting, "The NEORV32 Processor", github.com/stnolting/neorv32
|
> S. Nolting, "The NEORV32 Processor", github.com/stnolting/neorv32
|
|
|
#### BSD 3-Clause License
|
#### BSD 3-Clause License
|
|
|
Copyright (c) 2020, Stephan Nolting. All rights reserved.
|
Copyright (c) 2021, Stephan Nolting. All rights reserved.
|
|
|
Redistribution and use in source and binary forms, with or without modification, are
|
Redistribution and use in source and binary forms, with or without modification, are
|
permitted provided that the following conditions are met:
|
permitted provided that the following conditions are met:
|
|
|
1. Redistributions of source code must retain the above copyright notice, this list of
|
1. Redistributions of source code must retain the above copyright notice, this list of
|