OpenCores
URL https://opencores.org/ocsvn/neorv32/neorv32/trunk

Subversion Repositories neorv32

[/] [neorv32/] [trunk/] [README.md] - Diff between revs 41 and 42

Go to most recent revision | Show entire file | Details | Blame | View Log

Rev 41 Rev 42
Line 90... Line 90...
 
 
 
 
### To-Do / Wish List / Help Wanted
### To-Do / Wish List / Help Wanted
 
 
* Use LaTeX for data sheet
* Use LaTeX for data sheet
* Further size and performance optimization *(work in progress)*
* Further size and performance optimization *[work in progress]*
* Add associativity configuration for instruction cache
* Add associativity configuration for instruction cache
* Add a data cache
* Add *data* cache
* Burst mode for the external memory/bus interface
* Burst mode for the external memory/bus interface
* RISC-V `B` extension ([bitmanipulation](https://github.com/riscv/riscv-bitmanip)) *(shelved)*
* RISC-V `F` (using `Zfinx`?) CPU extension (single-precision floating point) *[planning]*
 
* RISC-V `B` CPU extension ([bitmanipulation](https://github.com/riscv/riscv-bitmanip)) *[shelved]*
* Synthesis results (+ wrappers?) for more/specific platforms
* Synthesis results (+ wrappers?) for more/specific platforms
* More support for FreeRTOS
* More support for FreeRTOS (like *all* traps)
* Port additional RTOSs (like [Zephyr](https://github.com/zephyrproject-rtos/zephyr) or [RIOT](https://www.riot-os.org))
* Port additional RTOSs (like [Zephyr](https://github.com/zephyrproject-rtos/zephyr) or [RIOT](https://www.riot-os.org))
* Single-precision floating point unit (`F`) *(planned)*
 
* Implement further RISC-V (or custom?) CPU extensions
* Implement further RISC-V (or custom?) CPU extensions
* Add debugger ([RISC-V debug spec](https://github.com/riscv/riscv-debug-spec))
* Add debugger ([RISC-V debug spec](https://github.com/riscv/riscv-debug-spec))
* Add memory-mapped trigger to testbench to quit simulation (using VHDL2008's `use std.env.finish;`) - but how? :thinking:
* Add memory-mapped trigger to testbench to quit simulation (maybe using VHDL2008's `use std.env.finish`?)
* ...
* ...
* [Ideas?](#ContributeFeedbackQuestions)
* [Ideas?](#ContributeFeedbackQuestions)
 
 
 
 
 
 
Line 187... Line 187...
**Privileged architecture / CSR access** (`Zicsr` extension):
**Privileged architecture / CSR access** (`Zicsr` extension):
  * Privilege levels: `M-mode` (Machine mode)
  * Privilege levels: `M-mode` (Machine mode)
  * CSR access instructions: `CSRRW` `CSRRS` `CSRRC` `CSRRWI` `CSRRSI` `CSRRCI`
  * CSR access instructions: `CSRRW` `CSRRS` `CSRRC` `CSRRWI` `CSRRSI` `CSRRCI`
  * System instructions: `MRET` `WFI`
  * System instructions: `MRET` `WFI`
  * Pseudo-instructions are not listed
  * Pseudo-instructions are not listed
  * Counter CSRs: `cycle` `cycleh` `instret` `instreth` `time` `timeh` `mcycle` `mcycleh` `minstret` `minstreth` `mcounteren` `mcountinhibit`
  * Counter CSRs: `[m]cycle[h]` `[m]instret[m]` `time[h]` `[m]hpmcounter*[h]`(3..31, configurable) `mcounteren` `mcountinhibit` `mhpmevent*`(3..31, configurable)
  * Machine CSRs: `mstatus` `mstatush` `misa`(read-only!) `mie` `mtvec` `mscratch` `mepc` `mcause` `mtval` `mip` `mvendorid` [`marchid`](https://github.com/riscv/riscv-isa-manual/blob/master/marchid.md) `mimpid` `mhartid` `mzext`(custom)
  * Machine CSRs: `mstatus[h]` `misa`(read-only!) `mie` `mtvec` `mscratch` `mepc` `mcause` `mtval` `mip` `mvendorid` [`marchid`](https://github.com/riscv/riscv-isa-manual/blob/master/marchid.md) `mimpid` `mhartid` `mzext`(custom)
  * Supported exceptions and interrupts:
  * Supported exceptions and interrupts:
    * Misaligned instruction address
    * Misaligned instruction address
    * Instruction access fault (via unacknowledged bus access after timeout)
    * Instruction access fault (via unacknowledged bus access after timeout)
    * Illegal instruction
    * Illegal instruction
    * Breakpoint (via `ebreak` instruction)
    * Breakpoint (via `ebreak` instruction)
Line 212... Line 212...
 
 
**Privileged architecture / instruction stream synchronization** (`Zifencei` extension):
**Privileged architecture / instruction stream synchronization** (`Zifencei` extension):
  * System instructions: `FENCE.I` (among others, used to clear and reload instruction cache)
  * System instructions: `FENCE.I` (among others, used to clear and reload instruction cache)
 
 
**Privileged architecture / Physical memory protection** (`PMP`, requires `Zicsr` extension):
**Privileged architecture / Physical memory protection** (`PMP`, requires `Zicsr` extension):
  * Additional machine CSRs: `pmpcfg0` `pmpcfg1` `pmpaddr0` `pmpaddr1` `pmpaddr2` `pmpaddr3` `pmpaddr4` `pmpaddr5` `pmpaddr6` `pmpaddr7`
  * Configurable number of regions
 
  * Additional machine CSRs: `pmpcfg*`(0..15) `pmpaddr*`(0..63)
 
 
 
 
### Non-RISC-V-Compliant Issues
### Non-RISC-V-Compliant Issues
 
 
* CPU and Processor are BIG-ENDIAN, but this should be no problem as the external memory bus interface provides big- and little-endian configurations
* CPU and Processor are BIG-ENDIAN, but this should be no problem as the external memory bus interface provides big- and little-endian configurations
* `misa` CSR is read-only - no dynamic enabling/disabling of synthesized CPU extensions during runtime; for compatibility: write accesses (in m-mode) are ignored and do not cause an exception
* `misa` CSR is read-only - no dynamic enabling/disabling of synthesized CPU extensions during runtime; for compatibility: write accesses (in m-mode) are ignored and do not cause an exception
* The physical memory protection (**PMP**) only supports `NAPOT` mode, a minimal granularity of 8 bytes and only up to 8 regions
* The physical memory protection (**PMP**) only supports `NAPOT` mode yet and a minimal granularity of 8 bytes
* The `A` extension only implements `lr.w` and `sc.w` instructions yet. However, these instructions are sufficient to emulate all further AMO operations
* The `A` extension only implements `lr.w` and `sc.w` instructions yet. However, these instructions are sufficient to emulate all further AMO operations
* The `mcause` trap code `0x80000000` (originally reserved in the RISC-V specs) is used to indicate a hardware reset (non-maskable reset).
* The `mcause` trap code `0x80000000` (originally reserved in the RISC-V specs) is used to indicate a hardware reset (as "non-maskable interrupt").
 
 
 
 
### NEORV32-Specific CPU Extensions
### NEORV32-Specific CPU Extensions
 
 
The NEORV32-specific extensions are always enabled and are indicated via the `X` bit in the `misa` CSR.
The NEORV32-specific extensions are always enabled and are indicated via the `X` bit in the `misa` CSR.
Line 241... Line 242...
### NEORV32 CPU
### NEORV32 CPU
 
 
This chapter shows exemplary implementation results of the NEORV32 CPU for an **Intel Cyclone IV EP4CE22F17C6N FPGA** on
This chapter shows exemplary implementation results of the NEORV32 CPU for an **Intel Cyclone IV EP4CE22F17C6N FPGA** on
a DE0-nano board. The design was synthesized using **Intel Quartus Prime Lite 20.1** ("balanced implementation"). The timing
a DE0-nano board. The design was synthesized using **Intel Quartus Prime Lite 20.1** ("balanced implementation"). The timing
information is derived from the Timing Analyzer / Slow 1200mV 0C Model. If not otherwise specified, the default configuration
information is derived from the Timing Analyzer / Slow 1200mV 0C Model. If not otherwise specified, the default configuration
of the CPU's generics is assumed (for example no PMP). No constraints were used at all. The `u` and `Zifencei` extensions have
of the CPU's generics is assumed (e.g. no physical memory protection, no hardware performance monitors).
a negligible impact on the hardware requirements.
No constraints were used at all. The `u` and `Zifencei` extensions have a negligible impact on the hardware requirements.
 
 
Results generated for hardware version [`1.4.9.2`](https://github.com/stnolting/neorv32/blob/master/CHANGELOG.md).
Results generated for hardware version [`1.4.9.2`](https://github.com/stnolting/neorv32/blob/master/CHANGELOG.md).
 
 
| CPU Configuration                       | LEs        | FFs      | Memory bits | DSPs | f_max   |
| CPU Configuration                       | LEs        | FFs      | Memory bits | DSPs | f_max   |
|:----------------------------------------|:----------:|:--------:|:-----------:|:----:|:-------:|
|:----------------------------------------|:----------:|:--------:|:-----------:|:----:|:-------:|
Line 304... Line 305...
The FPGA-specific memory components can be found in [`rtl/fpga_specific`](https://github.com/stnolting/neorv32/blob/master/rtl/fpga_specific/lattice_ice40up).
The FPGA-specific memory components can be found in [`rtl/fpga_specific`](https://github.com/stnolting/neorv32/blob/master/rtl/fpga_specific/lattice_ice40up).
* The clock frequencies marked with a "c" are constrained clocks. The remaining ones are _f_max_ results from the place and route timing reports.
* The clock frequencies marked with a "c" are constrained clocks. The remaining ones are _f_max_ results from the place and route timing reports.
* The Upduino and the Arty board have on-board SPI flash memories for storing the FPGA configuration. These device can also be used by the default NEORV32
* The Upduino and the Arty board have on-board SPI flash memories for storing the FPGA configuration. These device can also be used by the default NEORV32
bootloader to store and automatically boot an application program after reset (both tested successfully).
bootloader to store and automatically boot an application program after reset (both tested successfully).
* The setups with `PMP` implement 2 regions with a minimal granularity of 64kB.
* The setups with `PMP` implement 2 regions with a minimal granularity of 64kB.
 
* No HPM counters are implemented.
 
 
 
 
 
 
## Performance
## Performance
 
 
Line 315... Line 317...
 
 
The [CoreMark CPU benchmark](https://www.eembc.org/coremark) was executed on the NEORV32 and is available in the
The [CoreMark CPU benchmark](https://www.eembc.org/coremark) was executed on the NEORV32 and is available in the
[sw/example/coremark](https://github.com/stnolting/neorv32/blob/master/sw/example/coremark) project folder. This benchmark
[sw/example/coremark](https://github.com/stnolting/neorv32/blob/master/sw/example/coremark) project folder. This benchmark
tests the capabilities of a CPU itself rather than the functions provided by the whole system / SoC.
tests the capabilities of a CPU itself rather than the functions provided by the whole system / SoC.
 
 
Results generated for hardware version [`1.4.7.0`](https://github.com/stnolting/neorv32/blob/master/CHANGELOG.md).
 
 
 
~~~
~~~
**Configuration**
**Configuration**
Hardware:       32kB IMEM, 16kB DMEM, no caches, 100MHz clock
Hardware:       32kB IMEM, 16kB DMEM, no caches(!), 100MHz clock
CoreMark:       2000 iterations, MEM_METHOD is MEM_STACK
CoreMark:       2000 iterations, MEM_METHOD is MEM_STACK
Compiler:       RISCV32-GCC 10.1.0 (rv32i toolchain)
Compiler:       RISCV32-GCC 10.1.0 (rv32i toolchain)
Compiler flags: default, see makefile
Compiler flags: default, see makefile
Peripherals:    UART for printing the results
Peripherals:    UART for printing the results
~~~
~~~
 
 
| CPU                                         | Executable Size | Optimization | CoreMark Score | CoreMarks/MHz |
Results generated for hardware version [`1.4.9.8`](https://github.com/stnolting/neorv32/blob/master/CHANGELOG.md).
 
 
 
| CPU (including `Zicsr`)                     | Executable Size | Optimization | CoreMark Score | CoreMarks/MHz |
|:--------------------------------------------|:---------------:|:------------:|:--------------:|:-------------:|
|:--------------------------------------------|:---------------:|:------------:|:--------------:|:-------------:|
| `rv32i`                                     |    27 424 bytes |        `-O3` |          35.71 |    **0.3571** |
| `rv32i`                                     |    28 756 bytes |        `-O3` |          36.36 |    **0.3636** |
| `rv32im`                                    |    26 232 bytes |        `-O3` |          66.66 |    **0.6666** |
| `rv32im`                                    |    27 516 bytes |        `-O3` |          68.97 |    **0.6897** |
| `rv32imc`                                   |    20 876 bytes |        `-O3` |          66.66 |    **0.6666** |
| `rv32imc`                                   |    22 008 bytes |        `-O3` |          68.97 |    **0.6897** |
| `rv32imc` + `FAST_MUL_EN`                   |    20 876 bytes |        `-O3` |          83.33 |    **0.8333** |
| `rv32imc` + `FAST_MUL_EN`                   |    22 008 bytes |        `-O3` |          86.96 |    **0.8696** |
| `rv32imc` + `FAST_MUL_EN` + `FAST_SHIFT_EN` |    20 876 bytes |        `-O3` |          86.96 |    **0.8696** |
| `rv32imc` + `FAST_MUL_EN` + `FAST_SHIFT_EN` |    22 008 bytes |        `-O3` |          90.91 |    **0.9091** |
 
 
The `FAST_MUL_EN` configuration uses DSPs for the multiplier of the `M` extension (enabled via the `FAST_MUL_EN` generic). The `FAST_SHIFT_EN` configuration
The `FAST_MUL_EN` configuration uses DSPs for the multiplier of the `M` extension (enabled via the `FAST_MUL_EN` generic). The `FAST_SHIFT_EN` configuration
uses a barrel shifter for CPU shift operations (enabled via the `FAST_SHIFT_EN` generic).
uses a barrel shifter for CPU shift operations (enabled via the `FAST_SHIFT_EN` generic).
 
 
When the `C` extension is enabled, branches to an unaligned uncompressed instruction require additional instruction fetch cycles.
When the `C` extension is enabled, branches to an unaligned uncompressed instruction require additional instruction fetch cycles.
Line 345... Line 347...
### Instruction Cycles
### Instruction Cycles
 
 
The NEORV32 CPU is based on a two-stages pipelined architecutre. Each stage uses a multi-cycle processing scheme. Hence,
The NEORV32 CPU is based on a two-stages pipelined architecutre. Each stage uses a multi-cycle processing scheme. Hence,
each instruction requires several clock cycles to execute (2 cycles for ALU operations, ..., 40 cycles for divisions).
each instruction requires several clock cycles to execute (2 cycles for ALU operations, ..., 40 cycles for divisions).
The average CPI (cycles per instruction) depends on the instruction mix of a specific applications and also on the available
The average CPI (cycles per instruction) depends on the instruction mix of a specific applications and also on the available
CPU extensions.
CPU extensions. *By default* the CPU-internal shifter (e.g. for the `SLL` instruction) as well as the multiplier and divider of the
 
 
Please note that by default the CPU-internal shifter (e.g. for the `SLL` instruction) as well as the multiplier and divider of the
 
`M` extension use a bit-serial approach and require several cycles for completion.
`M` extension use a bit-serial approach and require several cycles for completion.
 
 
The following table shows the performance results for successfully running 2000 CoreMark
The following table shows the performance results for successfully running 2000 CoreMark
iterations, which reflects a pretty good "real-life" work load. The average CPI is computed by
iterations, which reflects a pretty good "real-life" work load. The average CPI is computed by
dividing the total number of required clock cycles (only the timed core to avoid distortion due to IO wait cycles; sampled via the `cycle[h]` CSRs)
dividing the total number of required clock cycles (only the timed core to avoid distortion due to IO wait cycles; sampled via the `cycle[h]` CSRs)
by the number of executed instructions (`instret[h]` CSRs). The executables were generated using optimization `-O3`.
by the number of executed instructions (`instret[h]` CSRs). The executables were generated using optimization `-O3`.
 
 
Results generated for hardware version [`1.4.7.0`](https://github.com/stnolting/neorv32/blob/master/CHANGELOG.md).
Results generated for hardware version [`1.4.9.8`](https://github.com/stnolting/neorv32/blob/master/CHANGELOG.md).
 
 
| CPU                                         | Required Clock Cycles | Executed Instructions | Average CPI |
| CPU  (including `Zicsr`)                    | Required Clock Cycles | Executed Instructions | Average CPI |
|:--------------------------------------------|----------------------:|----------------------:|:-----------:|
|:--------------------------------------------|----------------------:|----------------------:|:-----------:|
| `rv32i`                                     |         5 648 997 774 |         1 469 233 238 |    **3.84** |
| `rv32i`                                     |         5 595 750 503 |         1 466 028 607 |    **3.82** |
| `rv32im`                                    |         3 036 749 774 |           601 871 338 |    **5.05** |
| `rv32im`                                    |         2 966 086 503 |           598 651 143 |    **4.95** |
| `rv32imc`                                   |         3 036 959 882 |           615 034 616 |    **4.94** |
| `rv32imc`                                   |         2 981 786 734 |           611 814 918 |    **4.87** |
| `rv32imc` + `FAST_MUL_EN`                   |         2 454 407 882 |           615 034 588 |    **3.99** |
| `rv32imc` + `FAST_MUL_EN`                   |         2 399 234 734 |           611 814 918 |    **3.92** |
| `rv32imc` + `FAST_MUL_EN` + `FAST_SHIFT_EN` |         2 320 308 322 |           615 034 676 |    **3.77** |
| `rv32imc` + `FAST_MUL_EN` + `FAST_SHIFT_EN` |         2 265 135 174 |           611 814 948 |    **3.70** |
 
 
 
 
The `FAST_MUL_EN` configuration uses DSPs for the multiplier of the `M` extension (enabled via the `FAST_MUL_EN` generic). The `FAST_SHIFT_EN` configuration
The `FAST_MUL_EN` configuration uses DSPs for the multiplier of the `M` extension (enabled via the `FAST_MUL_EN` generic). The `FAST_SHIFT_EN` configuration
uses a barrel shifter for CPU shift operations (enabled via the `FAST_SHIFT_EN` generic).
uses a barrel shifter for CPU shift operations (enabled via the `FAST_SHIFT_EN` generic).
 
 
When the `C` extension is enabled branches to an unaligned uncompressed instruction require additional instruction fetch cycles.
When the `C` extension is enabled branches to an unaligned uncompressed instruction require additional instruction fetch cycles.
Line 591... Line 590...
 
 
> S. Nolting, "The NEORV32 Processor", github.com/stnolting/neorv32
> S. Nolting, "The NEORV32 Processor", github.com/stnolting/neorv32
 
 
#### BSD 3-Clause License
#### BSD 3-Clause License
 
 
Copyright (c) 2020, Stephan Nolting. All rights reserved.
Copyright (c) 2021, Stephan Nolting. All rights reserved.
 
 
Redistribution and use in source and binary forms, with or without modification, are
Redistribution and use in source and binary forms, with or without modification, are
permitted provided that the following conditions are met:
permitted provided that the following conditions are met:
 
 
1. Redistributions of source code must retain the above copyright notice, this list of
1. Redistributions of source code must retain the above copyright notice, this list of

powered by: WebSVN 2.1.0

© copyright 1999-2024 OpenCores.org, equivalent to Oliscience, all rights reserved. OpenCores®, registered trademark.