OpenCores
URL https://opencores.org/ocsvn/darkriscv/darkriscv/trunk

Subversion Repositories darkriscv

[/] [darkriscv/] [trunk/] [README.md] - Blame information for rev 2

Go to most recent revision | Details | Compare with Previous | View Log

Line No. Rev Author Line
1 2 marcelos
# DarkRISCV
2
Opensource RISC-V implemented from scratch in one night!
3
 
4
## Table of Contents
5
 
6
- [Introduction](#introduction)
7
- [Project Background](#project-background)
8
- [Directory Description](#directory-description)
9
- ["src" Directory](#src-directory)
10
- ["sim" Directory](#sim-directory)
11
- ["rtl" Directory](#rtl-directory)
12
- ["board" Directory](#board-directory)
13
- [Implementation Notes](#implementation-notes)
14
- [Development Tools](#development-tools)
15
- [Development Boards](#development-boards)
16
- [Creating a RISCV from scratch](#creating-a-riscv-from-scratch)
17
- [Acknowledgments](#acknowledgments)
18
- [References](#references)
19
 
20
## Introduction
21
 
22
Developed in a magic night of 19 Aug, 2018 between 2am and 8am, the
23
*DarkRISCV* softcore started as an proof of concept for the opensource
24
RISC-V instruction set.
25
 
26
The initial concept was based in my other early 16-bit RISC processors and
27
composed by a simplified two stage pipeline, where a instruction is fetch
28
from a instruction memory in the first clock and then the instruction is
29
decoded/executed in the second clock.  The pipeline is overlapped without
30
interlocks, in a way that the *DarkRISCV* can reach the performance of one
31
clock per instruction most of time, except by a taken branch, where one
32
clock is lost in the pipeline flush.  Of course, in order to perform read
33
operations in blockrams in a single clock, a single-phase clock with
34
combinational memory OR a two-phase clock with blockram memory is required,
35
in a way that no wait states are required in thatcases.
36
 
37
As result, the code is very compact, with around three hundred lines of
38
obfuscated but beautiful Verilog code.  After lots of exciting sleepless
39
nights of work and the help of lots of colleagues, the *DarkRISCV* reached a
40
very good quality result, in a way that the code compiled by the standard
41
GCC for RV32I worked fine.
42
 
43
Nowadays, after two years of development, a three stage pipeline working
44
with a single clock phase is also available, resulting in a better
45
distribution between the decode and execute stages.  In this case the
46
instruction is fetch in the first clock from a blockram, decoded in the
47
second clock and executed in the third clock.
48
 
49
As long the load instruction cannot load the data from a blockram in a
50
single clock, the external logic inserts one extra clock in IO operations.
51
Also, there are two extra clocks in order to flush the pipeline in the case
52
of taken branches.  The impact of the pipeline flush depends of the compiler
53
optimizations, but according to the lastest measurements, the 3-stage
54
pipeline version can reach a instruction per clock (IPC) of 0.7, smaller
55
than the measured IPC of 0.85 in the case of the 2-stage pipeline version.
56
 
57
Anyway, with the 3-stage pipeline and some other expensive optimizations,
58
the *DarkRISCV* can reach 100MHz in a low-cost Spartan-6, which results in
59
more performance when compared with the 2-stage pipeline version (typically
60
50MHz).
61
 
62
Although the code is small and crude when compared with other RISC-V
63
implementations, the *DarkRISCV* has lots of impressive features:
64
 
65
- implements most of the RISC-V RV32I instruction set (missing csr*, e* and fence*)
66
- works up to 100MHz (spartan-6) and sustain 1 clock per instruction most of time
67
- flexible harvard architecture (easy to integrate a cache controller)
68
- works fine in a real xilinx and lattice FPGAs
69
- works fine with gcc 9.0.0 for RISC-V (no patches required!)
70
- uses between 1000-1500LUTs, depending of enabled features (Xilinx LUT6)
71
- optional RV32E support (works better with LUT4 FPGAs)
72
- optional 16x16-bit MAC instruction (for signal processing)
73
- optional coarse-grained multi-threading (MT)
74
- no interlock between pipeline stages
75
- BSD license: can be used anywhere with no restrictions!
76
 
77
Some extra features are planned for the furure or under development:
78
 
79
- interrupt controller (under tests)
80
- cache controller (under tests)
81
- gpio and timer (under tests)
82
- sdram controller w/ data scrambler
83
- branch predictor (under tests)
84
- ethernet controller (GbE)
85
- multi-processing (SMP)
86
- network on chip (NoC)
87
- rv64i support (not so easy as appears...)
88
- dynamic bus size and big-endian support
89
- user/supervisor modes
90
- debug support
91
 
92
And much other features!
93
 
94
Feel free to make suggestions and good hacking! o/
95
 
96
## Project Background
97
 
98
The main motivation for the *DarkRISCV* was create a migration path for some
99
projects around the 680x0/Coldfire family.
100
 
101
Although there are lots of 680x0 cores available, they are designed around
102
different concepts and requirements, in a way that I found no much options
103
regarding my requirements (more than 50MIPS with around 1000LUTs).  The best
104
option at this moment, the TG68, requires at least 2400LUTs (by removing the
105
MUL/DIV instructions), and works up to 40MHz in a Spartan-6.  As addition,
106
the TG68 core requires at least 2 clock per instruction, which means a peak
107
performance of 20MIPS.  As long the 680x0 instruction is too complex, this
108
result is really not bad at all and, at this moment, probably the best
109
opensource option to replace the 68000.
110
 
111
Anyway, it does not match with the my requirements regarding space and
112
performance.  As part of the investigation, I tested other cores, but I
113
found no much options as good as the TG68 and I even started design a
114
risclized-68000 core, in order to try find a solution.
115
 
116
Unfortunately, due to compiler requirements (standard GCC), I found no much
117
ways to reduce the space and increase the performance, in a way that I
118
started investigate about non-680x0 cores.
119
 
120
After lots of tests with different cores, I found the *picorv32* core and
121
the all the ecosystem around the RISC-V.  The *picorv32* is a very nice
122
project and can peak up to 150MHz in a low-cost Spartan-6.  Although most
123
instructions requires 3 or 4 clocks per instruction, the *picorv32*
124
resembles the 68020 in some ways, but running at 150MHz and providing a peak
125
performance of 50MIPS, which is very impressive.
126
 
127
Although the *picorv32* is a very good option to directly replace the 680x0
128
family, it is not powerful enough to replace some Coldfire processors (more
129
than 75MIPS).
130
 
131
As long I had some good experience with experimental 16-bit RISC cores for
132
DSP-like applications, I started code the *DarkRISCV* only to check the
133
level of complexity and compare with my risclized-68000.  For my surprise,
134
in the first night I mapped almost all instructions of the rv32i
135
specification and the *DarkRISCV* started to execute the first instructions
136
correctly at 75MHz and with one clock per instruction, which not only
137
resembles a fast and nice 68040, but also can beat some Coldfires!  wow!  :)
138
 
139
After the success of the first nigth of work, I started to work in order to
140
fix small details in the hardware and software implementation.
141
 
142
## Directory Description
143
 
144
Although the *DarkRISCV* is only a small processor core, a small eco-system
145
is required in order to test the core, including RISCV compatible software,
146
support for simulations and support for peripherals, in a way that the
147
processor core produces observable results. Each element is stored with
148
similar elements in directories, in a way that the top level has the
149
following organization:
150
 
151
- [README.md](README.md): the top level README file (points to this document)
152
- [LICENSE](LICENSE): unlimited freedom! o/
153
- [Makefile](Makefile): the show start here!
154
- [src](src): the source code for the test firmware (boot.c, main.c etc in C language)
155
- [rtl](rtl): the source code for the *DarkRISCV* core and the support logic (Verilog)
156
- [sim](sim): the source code for the simulation to test the rtl files (currently via icarus)
157
- [board](board): support and examples for different boards (currently via Xilinx ISE)
158
- [tmp](tmp): empty, but the ISE will create lots of files here)
159
 
160
 
161
Setup Instructions:
162
 
163
Step 1: Clone the DarkRISC repo to your local using below code.
164
git clone https://github.com/darklife/darkriscv.git
165
 
166
Pre Setup Guide for MacOS:
167
 
168
The document encompasses all the dependencies and steps to install those dependencies to successfully utilize the Darriscv ecosystem on MacOS.
169
 
170
Essentially, the ecosystem cannot be utilized in MacOS because of on of the dependencies Xilinx ISE 14.7 Design suit, which currently do not support MacOS.
171
 
172
In order to overcome this issue, we need to install Linux/Windows on MacOS by using below two methods:
173
 
174
a) WineSkin, which is a kind of Windows emulator that runs the Windows application natively but intercepts and emulate the Windows calls to map directly in the macOS.
175
b) VirtualBox (or VMware, Parallels, etc) in order to run a complete Windows OS or Linux, which appears to be far better than the WineSkin option.
176
 
177
I used the second method and installed VMware Fusion to install Linux Mint. Please find below the links I used to obtain download files.
178
 
179
Dependencies:
180
 
181
1.  Icarus Verilog
182
a.  Bison
183
b.  GNU
184
c.  G++
185
d.  FLEX
186
 
187
2.  Xilinx 14.7 ISE
188
 
189
 
190
Icarus Verilog Setup:
191
 
192
The steps have been condensed for linux operating system. Complete steps for all other OS platforms are available on https://iverilog.fandom.com/wiki/Installation_Guide.
193
 
194
Step 1: Download Verilog download tar file from ftp://ftp.icarus.com/pub/eda/verilog/ . Always install the latest version. Verilog-10.3 is the latest version as of now.
195
 
196
Step 2: Extract the tar file using ‘% tar -zxvf verilog-version.tar.gz’.
197
 
198
Step 3: Go to the Verilog folder using ‘cd Verilog-version’. Here it is cd Verilog-10.3.
199
 
200
Step 4: Check if you have the following libraries installed: Flex, Bison, g++ and gcc. If not use ‘sudo apt-get install flex bison g++ gcc’ in terminal to install. Restart the system once for effects to change place.
201
 
202
Step 5: Run the below commands in directory Verilog-10.3
203
1.  ./configure
204
2.  Make
205
3.  Sudo make install
206
 
207
Step 6: Use ‘sudo apt-get install verilog’ to install Verilog.
208
 
209
Optional Step: sudo apt-get install gtkwave
210
 
211
Xilinx Setup:
212
 
213
Follow the below video on youtube for complete installation.
214
 
215
https://www.youtube.com/watch?v=meO-b6Ib17Y
216
 
217
Note: Make sure you have libncurses libraries installed in linux.
218
 
219
If not use the below codes:
220
 
221
1.  For 64 bit architechure
222
a.  Sudo apt-get install libncurses5 libncursesw-dev
223
2.  For 32 bit architecture
224
a.  Sudo apt-get install libncurses5:i386
225
 
226
Once all pre-requisites are installed, go to root directory and run the below code:
227
 
228
cd darkrisc
229
make (use sudo if required)
230
 
231
 
232
The top level *Makefile* is responsible to build everything, but it must
233
be edited first, in a way that the user at least must select the compiler
234
path and the target board.
235
 
236
By default, the top level *Makefile* uses:
237
 
238
        CROSS = riscv32-embedded-elf
239
        CCPATH = /usr/local/share/gcc-$(CROSS)/bin/
240
        ICARUS = /usr/local/bin/iverilog
241
        BOARD  = avnet_microboard_lx9
242
 
243
Just update the configuration according to your system configuration,
244
type *make* and hope everything is in the correct location! You probably will
245
need fix some paths and set some others in the PATH environment variable, but
246
it will eventually work.
247
 
248
And, when everything is correctly configured, the result will be something like this:
249
 
250
```$
251
# make
252
make -C src all             CROSS=riscv32-embedded-elf CCPATH=/usr/local/share/gcc-riscv32-embedded-elf/bin/ ARCH=rv32e HARVARD=1
253
make[1]: Entering directory `/home/marcelo/Documents/Verilog/darkriscv/v38/src'
254
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-gcc -Wall -I./include -Os -march=rv32e -mabi=ilp32e -D__RISCV__ -DBUILD="\"Sat, 30 May 2020 00:55:20 -0300\"" -DARCH="\"rv32e\"" -S boot.c -o boot.s
255
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-as -march=rv32e -c boot.s -o boot.o
256
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-gcc -Wall -I./include -Os -march=rv32e -mabi=ilp32e -D__RISCV__ -DBUILD="\"Sat, 30 May 2020 00:55:20 -0300\"" -DARCH="\"rv32e\"" -S stdio.c -o stdio.s
257
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-as -march=rv32e -c stdio.s -o stdio.o
258
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-gcc -Wall -I./include -Os -march=rv32e -mabi=ilp32e -D__RISCV__ -DBUILD="\"Sat, 30 May 2020 00:55:21 -0300\"" -DARCH="\"rv32e\"" -S main.c -o main.s
259
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-as -march=rv32e -c main.s -o main.o
260
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-gcc -Wall -I./include -Os -march=rv32e -mabi=ilp32e -D__RISCV__ -DBUILD="\"Sat, 30 May 2020 00:55:21 -0300\"" -DARCH="\"rv32e\"" -S io.c -o io.s
261
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-as -march=rv32e -c io.s -o io.o
262
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-gcc -Wall -I./include -Os -march=rv32e -mabi=ilp32e -D__RISCV__ -DBUILD="\"Sat, 30 May 2020 00:55:21 -0300\"" -DARCH="\"rv32e\"" -S banner.c -o banner.s
263
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-as -march=rv32e -c banner.s -o banner.o
264
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-cpp -P  -DHARVARD=1 darksocv.ld.src darksocv.ld
265
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-ld -Tdarksocv.ld -Map=darksocv.map -m elf32lriscv  boot.o stdio.o main.o io.o banner.o -o darksocv.o
266
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-ld: warning: section `.data' type changed to PROGBITS
267
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-objdump -d darksocv.o > darksocv.lst
268
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-objcopy -O binary  darksocv.o darksocv.text --only-section .text*
269
hexdump -ve '1/4 "%08x\n"' darksocv.text > darksocv.rom.mem
270
#xxd -p -c 4 -g 4 darksocv.o > darksocv.rom.mem
271
rm darksocv.text
272
wc -l darksocv.rom.mem
273
1016 darksocv.rom.mem
274
echo rom ok.
275
rom ok.
276
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-objcopy -O binary  darksocv.o darksocv.data --only-section .*data*
277
hexdump -ve '1/4 "%08x\n"' darksocv.data > darksocv.ram.mem
278
#xxd -p -c 4 -g 4 darksocv.o > darksocv.ram.mem
279
rm darksocv.data
280
wc -l darksocv.ram.mem
281
317 darksocv.ram.mem
282
echo ram ok.
283
ram ok.
284
echo sources ok.
285
sources ok.
286
make[1]: Leaving directory `/home/marcelo/Documents/Verilog/darkriscv/v38/src'
287
make -C sim all             ICARUS=/usr/local/bin/iverilog HARVARD=1
288
make[1]: Entering directory `/home/marcelo/Documents/Verilog/darkriscv/v38/sim'
289
/usr/local/bin/iverilog -I ../rtl -o darksocv darksimv.v ../rtl/darksocv.v ../rtl/darkuart.v ../rtl/darkriscv.v
290
./darksocv
291
WARNING: ../rtl/darksocv.v:280: $readmemh(../src/darksocv.rom.mem): Not enough words in the file for the requested range [0:1023].
292
WARNING: ../rtl/darksocv.v:281: $readmemh(../src/darksocv.ram.mem): Not enough words in the file for the requested range [0:1023].
293
VCD info: dumpfile darksocv.vcd opened for output.
294
reset (startup)
295
 
296
              vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
297
                  vvvvvvvvvvvvvvvvvvvvvvvvvvvv
298
rrrrrrrrrrrrr       vvvvvvvvvvvvvvvvvvvvvvvvvv
299
rrrrrrrrrrrrrrrr      vvvvvvvvvvvvvvvvvvvvvvvv
300
rrrrrrrrrrrrrrrrrr    vvvvvvvvvvvvvvvvvvvvvvvv
301
rrrrrrrrrrrrrrrrrr    vvvvvvvvvvvvvvvvvvvvvvvv
302
rrrrrrrrrrrrrrrrrr    vvvvvvvvvvvvvvvvvvvvvvvv
303
rrrrrrrrrrrrrrrr      vvvvvvvvvvvvvvvvvvvvvv
304
rrrrrrrrrrrrr       vvvvvvvvvvvvvvvvvvvvvv
305
rr                vvvvvvvvvvvvvvvvvvvvvv
306
rr            vvvvvvvvvvvvvvvvvvvvvvvv      rr
307
rrrr      vvvvvvvvvvvvvvvvvvvvvvvvvv      rrrr
308
rrrrrr      vvvvvvvvvvvvvvvvvvvvvv      rrrrrr
309
rrrrrrrr      vvvvvvvvvvvvvvvvvv      rrrrrrrr
310
rrrrrrrrrr      vvvvvvvvvvvvvv      rrrrrrrrrr
311
rrrrrrrrrrrr      vvvvvvvvvv      rrrrrrrrrrrr
312
rrrrrrrrrrrrrr      vvvvvv      rrrrrrrrrrrrrr
313
rrrrrrrrrrrrrrrr      vv      rrrrrrrrrrrrrrrr
314
rrrrrrrrrrrrrrrrrr          rrrrrrrrrrrrrrrrrr
315
rrrrrrrrrrrrrrrrrrrr      rrrrrrrrrrrrrrrrrrrr
316
rrrrrrrrrrrrrrrrrrrrrr  rrrrrrrrrrrrrrrrrrrrrr
317
 
318
       INSTRUCTION SETS WANT TO BE FREE
319
 
320
boot0: text@0 data@4096 stack@8192
321
board: simulation only (id=0)
322
build: darkriscv fw build Sat, 30 May 2020 00:55:21 -0300
323
core0: darkriscv@100.0MHz with rv32e+MT+MAC
324
uart0: 115200 bps (div=868)
325
timr0: periodic timer=1000000Hz (io.timer=99)
326
 
327
Welcome to DarkRISCV!
328
> no UART input, finishing simulation...
329
echo simulation ok.
330
simulation ok.
331
make[1]: Leaving directory `/home/marcelo/Documents/Verilog/darkriscv/v38/sim'
332
make -C boards all          BOARD=piswords_rs485_lx9 HARVARD=1
333
make[1]: Entering directory `/home/marcelo/Documents/Verilog/darkriscv/v38/boards'
334
cd ../tmp && xst -intstyle ise -ifn ../boards/piswords_rs485_lx9/darksocv.xst -ofn ../tmp/darksocv.syr
335
Reading design: ../boards/piswords_rs485_lx9/darksocv.prj
336
 
337
*** lots of weird FPGA related messages here ***
338
 
339
cd ../tmp && bitgen -intstyle ise -f ../boards/avnet_microboard_lx9/darksocv.ut ../tmp/darksocv.ncd
340
echo done.
341
done.
342
```
343
 
344
Which means that the software compiled and liked correctly, the simulation
345
worked correctly and the FPGA build produced a image that can be loaded in
346
your FPGA board with a *make install* (case you has a FPGA board and, of
347
course, you have a JTAG support script in the board directory).
348
 
349
Case the FPGA is correctly programmed and the UART is attached to a terminal
350
emulator, the FPGA will be configured with the DarkRISCV, which will run the
351
test software and produce the following result:
352
 
353
```
354
              vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
355
                  vvvvvvvvvvvvvvvvvvvvvvvvvvvv
356
rrrrrrrrrrrrr       vvvvvvvvvvvvvvvvvvvvvvvvvv
357
rrrrrrrrrrrrrrrr      vvvvvvvvvvvvvvvvvvvvvvvv
358
rrrrrrrrrrrrrrrrrr    vvvvvvvvvvvvvvvvvvvvvvvv
359
rrrrrrrrrrrrrrrrrr    vvvvvvvvvvvvvvvvvvvvvvvv
360
rrrrrrrrrrrrrrrrrr    vvvvvvvvvvvvvvvvvvvvvvvv
361
rrrrrrrrrrrrrrrr      vvvvvvvvvvvvvvvvvvvvvv
362
rrrrrrrrrrrrr       vvvvvvvvvvvvvvvvvvvvvv
363
rr                vvvvvvvvvvvvvvvvvvvvvv
364
rr            vvvvvvvvvvvvvvvvvvvvvvvv      rr
365
rrrr      vvvvvvvvvvvvvvvvvvvvvvvvvv      rrrr
366
rrrrrr      vvvvvvvvvvvvvvvvvvvvvv      rrrrrr
367
rrrrrrrr      vvvvvvvvvvvvvvvvvv      rrrrrrrr
368
rrrrrrrrrr      vvvvvvvvvvvvvv      rrrrrrrrrr
369
rrrrrrrrrrrr      vvvvvvvvvv      rrrrrrrrrrrr
370
rrrrrrrrrrrrrr      vvvvvv      rrrrrrrrrrrrrr
371
rrrrrrrrrrrrrrrr      vv      rrrrrrrrrrrrrrrr
372
rrrrrrrrrrrrrrrrrr          rrrrrrrrrrrrrrrrrr
373
rrrrrrrrrrrrrrrrrrrr      rrrrrrrrrrrrrrrrrrrr
374
rrrrrrrrrrrrrrrrrrrrrr  rrrrrrrrrrrrrrrrrrrrrr
375
 
376
       INSTRUCTION SETS WANT TO BE FREE
377
 
378
boot0: text@0 data@4096 stack@8192
379
board: piswords rs485 lx9 (id=6)
380
build: darkriscv fw build Fri, 29 May 2020 23:56:39 -0300
381
core0: darkriscv@100.0MHz with rv32e+MT+MAC
382
uart0: 115200 bps (div=868)
383
timr0: periodic timer=1000000Hz (io.timer=99)
384
 
385
Welcome to DarkRISCV!
386
>
387
```
388
 
389
The beautiful ASCII RISCV logo was produced by Andrew Waterman! [6]
390
 
391
As long as the build works, it is possible start make changes, but my
392
recommendation when working with soft processors is *not work* in the
393
hardware and software *at the same time*!  This means that is better freeze
394
the hardware and work only with the software *or* freeze the software and
395
work only with the hardware.  It is perfectly possible make your research in
396
both, but not at the same time, otherwise you find the *DarkRISCV* in a
397
non-working state after software and hardware changes and you will not be
398
sure where the problem is.
399
 
400
### "src" Directory
401
 
402
The *src* directory contains the source code for the test firmware, which
403
includes the boot code, the main process and auxiliary libraries. The code is
404
compiled via *gcc* in a way that some auxiliary files are produced,
405
for example:
406
 
407
- boot.c: the original C code for the boot process
408
- boot.s: the assembler version of the C code, generated automatically by the gcc
409
- boot.o: the compiled version of the C code, generated automatically by the gcc
410
 
411
When all .o files are produced, the result is linked in a *darksocv.o* ELF
412
file, which is used to produce the *darksocv.bin* file, which is converted to
413
hexadecimal and separated in ROM and RAM files (which are loaded by the Verilog
414
code in the blockRAMs). The linker also produces a *darksocv.lst* with a
415
complete list of the code generated and the *darsocv.map*, which shows the
416
map of all functions and variables in the produced code.
417
 
418
The firmware concept is very simple:
419
 
420
- boot.c contains the boot code
421
- main.c contains the main application code (shell)
422
- banner.c contains the riscv banner
423
- stdio.c contains a small version of stdio
424
- io.c contains the IO interfaces
425
 
426
Extra code can be easily added in the compilation by editing the *src/Makefile*.
427
 
428
For example, in order to add a lempel-ziv code *lz.c*, it is necessary make the
429
Makefile knows that we need the *lz.s* and *lz.o*:
430
 
431
        OBJS = boot.o stdio.o main.o io.o banner.o lz.o
432
        ASMS = boot.s stdio.s main.s io.s banner.s lz.s
433
        SRCS = boot.c stdio.c main.c io.c banner.c lz.c
434
 
435
And add a "lz" command in the *main.c*, in a way that is possible call
436
the function via the prompt. Alternatively, it is possible entirely replace
437
the provided firmware and use your own firmware.
438
 
439
### "sim" Directory
440
 
441
The simulation, in the other hand will show some waveforms and is possible
442
check the *DarkRISCV* operation when running the example code.
443
 
444
The main simulation tool for *DarkRISCV* is the iSIM from Xilinx ISE 14.7,
445
but the Icarus simulator is also supported via the Makefile in the *sim*
446
directory (the changes regarding Icarus are active when the symbol
447
__ICARUS__ is detected). I also included a workaround for ModelSim, as
448
pointed by our friend HYF (the changes regarding ModelSim are active when the
449
symbol MODEL_TECH is detected).
450
 
451
The simulation runs the same firmware as in the real FPGA, but in order to
452
improve the simulation performance, the UART code is not simulated, since
453
the 115200 bps requires lots dead simulation time.
454
 
455
### "rtl" Directory
456
 
457
The RTL directory contains the *DarkRISCV* core and some auxiliary files,
458
such as the DarkSoCV (a small system-on-chip with ROM, RAM and IO),
459
the DarkUART (a small UART for debug) and the configuration file, where is
460
possible enable and disable some features that are described in the
461
Implementation Notes section.
462
 
463
### "board" Directory
464
 
465
The current supported boards are:
466
 
467
- avnet_microboard_lx9
468
- lattice_brevia2_xp2
469
- piswords_rs485_lx9
470
- qmtech_sdram_lx16
471
- qmtech_spartan7_s15
472
- xilinx_ac701_a200
473
 
474
The organization is self-explained, w/ the vender, board and FPGA model
475
in the name of the directory. Each  *board* directory contains the project
476
files to be open in the Xilinx ISE 14.x, as well Makefiles to build the
477
FPGA image regarding that board model. Although a *ucf* file is provided in
478
order to generate a complete build with a UART and some LEDs, the FPGA is
479
NOT fully wired in any particular configuration and you must add the
480
pins that you will use in your FPGA board.
481
 
482
Anyway, although not wired, the build always gives you a good estimation
483
about the FPGA utilization and about the timing (because the UART output
484
ensures that the complete processor must be synthesized).
485
 
486
As long there are much supported boards, there is no way to test all boards
487
everytime, which means that sometimes the changes regarding one board may
488
affect other board in a wrong way.
489
 
490
## Implementation Notes*
491
 
492
[*This section is kept for reference, but the description may not match
493
exactly with the current code]
494
 
495
Since my target is the ultra-low-cost Xilinx Spartan-6 family of FPGAs, the
496
project is currently based in the Xilinx ISE 14.7 for Linux, which is the
497
latest available ISE version.  However, there is no explicit reference for
498
Xilinx elements and all logic is inferred directly from Verilog, which means
499
that the project is easily portable to other FPGA families and easily
500
portable to other environments, as can be observed in the case of Lattice
501
XP2 support.  Anyway, keep in mind that certain Verilog structures may not
502
work well in some FPGAs.
503
 
504
In the last update I included a way to test the firmware in the x86 host,
505
which helps as lot, since is possible interact with the firmware and fix
506
quickly some obvious bugs. Of course, the x86 code does not run the boot.c
507
code, since makes no sense (?) run the RISCV boot code in the x86.
508
 
509
Anyway, as main recomendation when working with softcores try never work in
510
the hardware and in the software at the same time!  Start with the minimum
511
software configuration possible and freeze the software.  When implementing
512
new software updates, use the minium hardware configuration possible and
513
freeze the hardware.
514
 
515
The RV32I specification itself is really impressive and easy to implement
516
(see [1], page 16).  Of course, there are some drawbacks, such as the funny
517
little-endian bus (opposed to the network oriented big-endian bus found in
518
the 680x0 family), but after some empirical tests it is easy to make work.
519
 
520
The funny information here is that, after lots of research regarding add
521
support for big-endian in the *DarkRISCV*, I found no way to make the GCC
522
generate the code and data correctly.
523
 
524
Another drawback in the specification is the lacking of delayed branches.
525
Although i understand that they are bad from the conceptual point of view,
526
they are good trick in order to extract more performance. As reference, the
527
lack of delayed branches or branch predictor in the *DarkRISCV* may reduce
528
between 20 and 30% the performance, in a way that the real measured
529
performance may be between 1.25 and 1.66 clocks per instruction.
530
 
531
Although the branch prediction is not complex to implement, I found the
532
experimental multi-threading support far more interesting, as long enable
533
use the idle time in the branches to swap the processor thread.  Anyway, I
534
will try debug the branch prediction code in order to improve the
535
single-thread performance.
536
 
537
The core supports 2 or 3-state pipelines and, although the main logic is
538
almost the same, there are huge difference in how they works. Just for
539
reference, the following section reflects the historic evolution of the
540
core and may not reflect the current core code.
541
 
542
The original 2-stage pipeline design has a small problem concerning
543
the ROM and RAM timing, in a way that, in order to pre-fetch and execute the
544
instruction in two clocks and keep the pre-fetch continously working at the
545
rate of 1 instruction per clock (and the same in the execution), the ROM and
546
RAM must respond before the next clock.  This means that the memories must
547
be combinational or, at least, use a 2-phase clock.
548
 
549
The first solution for the 2-stage pipeline version with a 2-phase clock is
550
the default solution and makes the *DarkRISCV* work as a pseudo 4-stage
551
pipeline:
552
 
553
- 1/2 stage for instruction pre-fetch (rom)
554
- 1/2 stage for static instruction decode (core)
555
- 1/2 stage for address generation, register read and data read/write (ram)
556
- 1/2 stage for data write (register write)
557
 
558
From the processor point of view, there are only 2 stages and from the
559
memory point of view, there are also 2 stages. But they are in different
560
clock phases. In normal conditions, this is not recommended because decreases the
561
performance by a 2x factor, but in the case of *DarkRISCV* the performance
562
is always limited by the combinational logic regarding the instruction
563
execution.
564
 
565
The second solution with a 2-stage pipeline is use combinational logic in
566
order to provide the needed results before the next clock edge, in a way
567
that is possible use a single phase clock.  This solution is composed by a
568
instruction and data caches, in a way that when the operand is stored in a
569
small LUT-based combinational cache, the processor can perform the memory
570
operation with no extra wait states.  However, when the operand is not
571
stored in the cache, extra wait-states are inserted in order to fetch the
572
operand from a blockram or extenal memory.  According to some preliminary
573
tests, the instruction cache w/ 64 direct mapped instructions can reach a
574
hit ratio of 91%.  The data cache performance, although is not so good (with
575
a hit ratio of only 68%), will be a requirement in order to access external
576
memory and reduce the impact of slow SDRAMs and FLASHes.
577
 
578
Unfortunately, the instruction and data caches are not working anymore for
579
the 2-stage pipeline version and only the instruction cache is working in
580
the 3-stage pipeline.  The problem is probably regarding the HLT signal
581
and/or a problem regarding the write byte enable in the cache memory.
582
 
583
Both the use of the cache and a 2-phase clock does not perform well.  By
584
this way, a 3-stage pipeline version is provided, in order to use a single
585
clock phase with blockrams.
586
 
587
The concept in this case is separate the pre-fetch and decode, in a way that
588
the pre-fetch can be done entirely in the blockram side for the instruction
589
bus. The decode, in a different stage, provides extra performance and the
590
execute stage works with one clock almost all the time, except when the load
591
instruction is executed. In this case, the external memory logic inserts one
592
wait-state. The write operation, however, is executed in a single clock.
593
 
594
The solution with wait-states can be used in the 2-stage pipeline version,
595
but decreases the performance too much. Case is possible run all versions
596
with the same, clock, the theorical performance in clocks per instruction
597
CPI), number of clocks to flush the pipeline in the taken branch (FLUSH) and
598
memory wait-states (WSMEM) will be:
599
 
600
- 2-stage pipe w/ 2-phase clock: CPI=1, FLUSH=1, WSMEM=0: real CPI=~1.25
601
- 3-stage pipe w/ 1-phase clock: CPI=1, FLUSH=2, WSMEM=1: real CPI=˜1.66
602
- 2-stage pipe w/ 1-phase clock: CPI=2, FLUSH=1, WSMEM=1, real CPI=~2.00
603
 
604
Empiracally, the impact of the FLUSH in the 2-stage pipeline is around 20%
605
and in the 3-stage pipeline is 30%. The real impact depends of the code
606
itself, of course... In the case of the impact of the wait-states in the
607
memory access regarding the load instruction, the impact ranges between 5
608
and 10%, again, depending of the code.
609
 
610
However, the clock in the case of the 3-stage pipeline is far better than the
611
2-stage pipeline, in special because the better distribuition of the logic
612
between the decode and execute stages.
613
 
614
Currently, the most expensive path in the Spartan-6 is the address bus
615
for the data side of the core (connected to RAM and peripherals). The
616
problem regards to the fact that the following actions must be done in a
617
single clock:
618
 
619
- generate the DADDR[31:0] = REG[SPTR][31:0]+EXTSIG(IMM[11:0])
620
- generate the BE[3:0] according to the operand size and DADDR[1:0]
621
 
622
In the case of read operation, the DATAI path includes also a small mux
623
in order to separate RAM and peripheral buses, as well separate the
624
diferent peripherals, which means that the path increases as long the
625
number of peripherals and the complexity increases.
626
 
627
Of course, the best performance setup uses a 3-state pipeline and a
628
single-clock phase (posedge) in the entire logic, in a way that the 2-stage
629
pipeline and dual-clock phase will be kept only for reference.
630
 
631
The only disadvantage of the 3-state pipeline is one extra wait-state in the
632
load operation and the longer pipeline flush of two clocks in the taken
633
branches.
634
 
635
Just for reference, I registered some details regarding the performance
636
measurements:
637
 
638
The current firmware example runs in the 3-stage pipeline version clocked at
639
100MHz runs at a verified performance of 62 MIPS.  The theorical 100MIPS
640
performance is not reached 5% due to the extra wait-state in the load
641
instruction and 32% due to pipeline flushes after taken branches.  The
642
2-stage pipeline version, in the other side, runs at a verified performance
643
of 79MIPS with the same clock.  The only loss regards to 20% due to pipeline
644
flushes after a taken branch.
645
 
646
Of course, the impact of the pipeline flush depends also from the software
647
and, as long the software is currently optimized for size. When compiled
648
with the -O2 instead of -Os, the performance increase to 68MIPS in the
649
3-state pipeline and the loss changed to 6% for load and 25% for the
650
pipeline flush. The -O3 option resulted in 67MIPS and the best result was
651
the -O1 option, which produced 70MIPS in the 3-stage version and 85MIPS in
652
the 2-stage version.
653
 
654
By this way, case the performance is a requirement, the src/Makefile must be
655
changed in order to use the -O1 optimization instead of the -Os default.
656
 
657
And although the 2-stage version is 15% faster than the 3-stage version, the
658
3-stage version can reach better clocks and, by this way, will provide
659
better performance.
660
 
661
Regarding the pipeline flush, it is required after a taken branch, as long
662
the RISCV does not supports delayed branches.  The solution for this problem
663
is implement a branch cache (branch predictor), in a way that the core
664
populates a cache with the last branches and can predict the future
665
branches.  In some inicial tests, the branch prediction with a 4 elements
666
entry appers to reach a hit ratio of 60%.
667
 
668
Another possibility is use the flush time to other tasks, for example handle
669
interrupts.  As long the interrupt handling and, in a general way, threading
670
requires flush the current pipelines in order to change context, by this
671
way, match the interrupt/threading with the pipeline flush makes some sense!
672
 
673
With the option __THREADING__ is possible test this feature.
674
 
675
The implementation is in very early stages of development and does not
676
handle correctly the initial SP and PC.  Anyway, it works and enables the
677
main() code stop in a gets() while the interrupt handling changes the GPIO
678
at a rate of more than 1 million interrupts per second without affecting the
679
execution and with little impact in the performance!  :)
680
 
681
The interrupt support can be expanded to a more complete threading support,
682
but requires some tricks in the hardware and in the software, in order to
683
populate the different threads with the correct SP and PC.
684
 
685
The interrupt handling use a concept around threading and, with some extra
686
effort, it is probably possible support 4, 8 or event 16 threads.  The
687
drawback in this case is that the register bank increses in size, which
688
explain why the rv32e is an interesting option for threading: with half the
689
number of registers is possible store two more threads in the core.
690
 
691
Currently, the time to switch the context in the *darkricv* is two clocks in
692
the 3-stage pipeline, which match with the pipeline flush itself. At 100MHz,
693
the maximum empirical number of context switches per second is around 2.94
694
million.
695
 
696
NOTE: the interrupt controller is currently working only with the -Os
697
flag in the gcc!
698
 
699
About the new MAC instruction, it is implemented in a very preliminary way
700
with the OPCDE 7'b1111111 (this works, but it is a very bad decision!).  I
701
am checking about the possibility to use the p.mac instruction, but at this
702
time the instruction is hand encoded in the mac() function available in the
703
stdio.c (i.e.  the darkriscv libc).  The details about include new
704
instructions and make it work with GCC can be found in the reference [5].
705
 
706
The preliminary tests pointed, as expected, that the performance decreases
707
to 90MHz and although it was possible run at 100MHz with a non-zero timing
708
score and reach a peak performance of 100MMAC/s, the small 32-bit
709
accumulator saturates too fast and requries extra tricks in order to avoid
710
overflows.
711
 
712
The mul operation uses two 16-bit integers and the result is added with a
713
separate 32-bit register, which works as accumulator.  As long the operation
714
is always signed and the signal always use the MSB bit, this means that the
715
15x15 mul produces a 30 bit result which is added to a 31-bit value, which
716
means that the overflow is reached after only two MAC operations.
717
 
718
In order to avoid overflows, it is possible shift the input operands.  For
719
example, in the case of G711 w/ u-law encoding, the effective resolution is
720
14 bits (13 bits for integer and 1 bit for signal), which means that a 13x13
721
bit mul will be used and a 26-bit result produced to be added in a 31-bit
722
integer, enough to run 32xMAC operations before overflow (in this case, when
723
the ACC reach a negative value):
724
 
725
    # awk 'BEGIN { ACC=2**31-1; A=2**13-1; B=-A; for(i=0;ACC>=0;i++) print i,A,B,A*B,ACC+=A*B }'
726
 
727
    1 8191 -8191 -67092481 2013298685
728
    2 8191 -8191 -67092481 1946206204
729
    ...
730
    30 8191 -8191 -67092481 67616736
731
    31 8191 -8191 -67092481 524255
732
    32 8191 -8191 -67092481 -66568226
733
 
734
Is this theory correct? I am not sure, but looks good! :)
735
 
736
As complement, I included in the stdio.c the support for the GCC functions
737
regarding the native *, / and % (mul, div and mod) operations with 32-bit
738
signed and unsigned integers, which means true 32x32 bit operations
739
producing 32-bit results.  The code was derived from an old 68000-related
740
project (as most of code in the stdio.c) and, although is not so faster, I
741
guess it is working. As long the MAC instruction is better defined in the
742
syntax and features, I think is possible optimize the mul/div/mod in order
743
to try use it and increase the performance.
744
 
745
Here some additional performance results (synthesis only, 3-stage
746
version) for other Xilinx devices available in the ISE for speed grade 2:
747
 
748
- Spartan-6:    100MHz (measured 70MIPS w/ gcc -O1)
749
- Artix-7:      178MHz
750
- Kintex-7:     225MHz
751
 
752
For speed grade 3:
753
 
754
- Spartan-6:    117MHz
755
- Artix-7:      202MHz
756
- Kintex-7:     266MHz
757
 
758
The Kintex-7 can reach, theorically 186MIPS w/ gcc -O1.
759
 
760
This performance is reached w/o the MAC and THREADING activated.  Thanks to
761
the RV32E option, the synthesis for the Spartan-3E is now possible with
762
resulting in 95% of LUT occupation in the case of the low-cost 100E model
763
and 70MHz clock (synthesis only and speed grade 5):
764
 
765
- Spartan-3E:   70MHz
766
 
767
For the 2-stage version and speed grade 2, we have less impact from the
768
pipeline flush (20%), no impact in the load and some impact in the clock due
769
to the use of a 2-phase clock:
770
 
771
- Spartan-6:    56MHz (measured 47MIPS w/ -O1)
772
 
773
About the compiler performance, from boot until the prompt, tested w/ the
774
3-stage pipeline core at 100MHz and no interrupts, rom and ram measured in
775
32-bit words:
776
 
777
- gcc w/ -O3: t=289us rom=876 ram=211
778
- gcc w/ -O2: t=291us rom=799 ram=211
779
- gcc w/ -O1: t=324us rom=660 ram=211
780
- gcc w/ -O0: t=569us rom=886 ram=211
781
- gcc w/ -Os: t=398us rom=555 ram=211
782
 
783
Due to reduced ROM space in the FPGA, the -Os is the default option.
784
 
785
In another hand, regarding the support for Vivado, it is possible convert
786
the Artix-7 (Xilinx AC701 available in the ise/boards directory) project to
787
Vivado and make some interesting tests.  The only problem in the conversion
788
is that the UCF file is not converted, which means that a new XDC file with
789
the pin description must be created.
790
 
791
The Vivado is very slow compared to ISE and needs *lots of time* to
792
synthesise and inform a minimal feedback about the performance...  but after
793
some weeks waiting, and lots of empirical calculations, I get some numbers
794
for speed grade 2 devices:
795
 
796
- Artix7:       147MHz
797
- Spartan-7:    146MHz
798
 
799
And one number for speed grade 3 devices:
800
 
801
- Kintex-7:     221MHz
802
 
803
Although Vivado is far slow and shows pessimistic numbers for the same FPGAs when
804
compared with ISE, I guess Vivado is more realistic and, at least, it supports the
805
new Spartan-7, which shows very good numbers (almost the same as the Artix-7!).
806
 
807
That values are only for reference.  The real values depends of some options
808
in the core, such as the number of pipeline stages, who the memories are
809
connected, etc.  Basically, the best clock is reached by the 3-stage
810
pipeline version (up to 100MHz in a Spartan-6), but it requires at lease 1
811
wait state in the load instruction and 2 extra clocks in the taken branches
812
in order to flush the pipeline.  The 2-state pipeline requires no extra wait
813
states and only 1 extra clock in the taken branches, but runs with less
814
performance (56MHz).
815
 
816
Well, my conclusion after some years of research is that the branch
817
prediction solve lots of problems regarding the performance.
818
 
819
## Development Tools
820
 
821
About the gcc compiler, I am working with the experimental gcc 9.0.0 for
822
RISC-V.  No patches or updates are required for the *DarkRISCV* other than
823
the -march=rv32i.  Although the fence*, e* and crg* instructions are not
824
implemented, the gcc appears to not use of that instructions and they are
825
not available in the core.
826
 
827
Although is possible use the compiler set available in the oficial RISC-V
828
site, our colleagues from *lowRISC* project pointed a more clever way to
829
build the toolchain:
830
 
831
https://www.lowrisc.org/blog/2017/09/building-upstream-risc-v-gccbinutilsnewlib-the-quick-and-dirty-way/
832
 
833
Basically:
834
 
835
        git clone --depth=1 git://gcc.gnu.org/git/gcc.git gcc
836
        git clone --depth=1 git://sourceware.org/git/binutils-gdb.git
837
        git clone --depth=1 git://sourceware.org/git/newlib-cygwin.git
838
        mkdir combined
839
        cd combined
840
        ln -s ../newlib-cygwin/* .
841
        ln -sf ../binutils-gdb/* .
842
        ln -sf ../gcc/* .
843
        mkdir build
844
        cd build
845
        ../configure --target=riscv32-unknown-elf --enable-languages=c --disable-shared --disable-threads --disable-multilib --disable-gdb --disable-libssp --with-newlib --with-arch=rv32ima --with-abi=ilp32 --prefix=/usr/local/share/gcc-riscv32-unknown-elf
846
        make -j4
847
        make
848
        make install
849
        export PATH=$PATH:/usr/local/share/gcc-riscv32-unknown-elf/bin/
850
        riscv32-unknown-elf-gcc -v
851
 
852
and everything will magically work! (:
853
 
854
Case you have no succcess to build the compiler, have no interest to change
855
the firmware or is just curious about the darkriscv running in a FPGA, the
856
project includes the compiled ROM and RAM, in a way that is possible examine
857
all derived objects, sources and correlated files generated by the compiler
858
without need compile anything.
859
 
860
Finally, as long the *DarkRISCV* is not yet fully tested, sometimes is a
861
very good idea compare the code execution with another stable reference!
862
 
863
In this case, I am working with the project *picorv32*:
864
 
865
https://github.com/cliffordwolf/picorv32
866
 
867
When I have some time, I will try create a more well organized support in
868
order to easily test both the *DarkRISCV* and *picorv32* in the same cache,
869
memory and IO sub-systems, in order to make possible select the core
870
according to the desired features, for example, use the *DarkRISCV* for more
871
performance or *picorv32* for more features.
872
 
873
About the software, the most complex issue is make the memory design match
874
with the linker layout.  Of course, it is a gcc issue and it is not even a
875
problem, in fact, is the way that the software guys works when linking the
876
code and data!
877
 
878
In the most simplified version, directly connected to blockRAMs, the
879
*DarkRISCV* is a pure harvard architecture processor and will requires the
880
separation between the instruction and data blocks!
881
 
882
When the cache controller is activated, the cache controller provides
883
separate memories for instruction and data, but provides a interface for a
884
more conventional von neumann memory architecture.
885
 
886
In both cases, a proper designed linker script (darksocv.ld) probably solves
887
the problem!
888
 
889
The current memory map in the linker script is the follow:
890
 
891
- 0x00000000: 4KB ROM
892
- 0x00001000: 4KB RAM
893
 
894
Also, the linker maps the IO in the following positions:
895
 
896
- 0x80000000: UART status
897
- 0x80000004: UART xmit/recv buffer
898
- 0x80000008: LED buffer
899
 
900
The RAM memory contains the .data area, the .bss area (after the .data
901
and initialized with zero), the .rodada and the stack area at the end of RAM.
902
 
903
Although the RISCV is defined as little-endian, appears to be easy change
904
the configuration in the GCC.  In this case, it is supposed that the all
905
variables are stored in the big-endian format.  Of course, the change
906
requires a similar change in the core itself, which is not so complex, as
907
long it affects only the load and store instructions.  In the future, I will
908
try test a big-endian version of GCC and darkriscv, in order to evaluate
909
possible performance enhancements in the case of network oriented
910
applications! :)
911
 
912
Finally, the last update regarding the software included  new option to
913
build a x86 version in order to help the development by testing exactly the
914
same firmware in the x86.
915
 
916
In a preliminary way, it is possible build the gcc for RV32E with the folllowing configuration:
917
 
918
    git clone --depth=1 git://gcc.gnu.org/git/gcc.git gcc
919
    git clone --depth=1 git://sourceware.org/git/binutils-gdb.git
920
    git clone --depth=1 git://sourceware.org/git/newlib-cygwin.git
921
    mkdir combined
922
    cd combined
923
    ln -s ../newlib-cygwin/* .
924
    ln -sf ../binutils-gdb/* .
925
    ln -sf ../gcc/* .
926
    mkdir build
927
    cd build
928
    ../configure --target=riscv32-embedded-elf --enable-languages=c --disable-shared --disable-threads --disable-multilib --disable-gdb --disable-libssp --with-newlib  --with-arch-rv32e --with-abi=ilp32e --prefix=/usr/local/share/gcc-riscv32-embedded-elf
929
    make -j4
930
    make
931
    make install
932
    export PATH=$PATH:/usr/local/share/gcc-riscv32-embedded-elf/bin/
933
    riscv32-embedded-elf-gcc -v
934
 
935
Currently, I found no easy way to make the GCC build big-endian code for
936
RISCV. Instead, the easy way is make the endian switch directly in the IO
937
device or in the memory region.
938
 
939
As long is not so easy build the GCC in some machines, I left in a public
940
share the source and the pre-compiled binary set of GCC tools for RV32E:
941
 
942
https://drive.google.com/drive/folders/1GYkqDg5JBVeocUIG2ljguNUNX0TZ-ic6?usp=sharing
943
 
944
As far as i remember it was compiled in a Slackware Linux or something like,
945
anyway, it worked fine in the Windows 10 w/ WSL and in other linux-like
946
environments.
947
 
948
## Development Boards
949
 
950
Currently, the following boards are supported:
951
 
952
- Avnet Microboard LX9: equipped with a Xilinx Spartan-6 LX9 running at 100MHz
953
- XilinX AC701 A200: equipped with a Xilinx Artix-7 A200 running at 90MHz
954
- QMTech SDRAM LX16: equipped with a Xilinx Spartan-6 LX16 running at 100MHz
955
- QMTech NORAM S15: equipped with a Xilinx Spartan-7 S15 running at 100MHz
956
- Lattice Brevia2 XP2: equipped with a Lattice XP2-6 running at 50MHz
957
- Piswords RS485 LX9: equipped with a Xilinx Spartan-6 LX9 running at 100MHz
958
- Digilent S3 Starter Board: equipped with a Xilinx Spartan-3 S200 running at 50MHz
959
 
960
The speeds are related to available clocks in the boards and different
961
clocks may be generated by programming a clock generator. The Spartan-6 is
962
found in most boards and the core runs fine at ~100MHz, regardless the
963
frequency of the main oscillator (typically 50MHz).
964
 
965
All Xilinx based boards typically supports a 115200 bps UART for console,
966
some LEDs for debug and on-chip 4KB ROM and 4KB RAM (as well the RESET
967
button to restart the core and the DEBUG signals for an oscilloscope).
968
 
969
In the case of QMTECH boards, that does not include the JTAG neither the
970
UART/USB port, and external USB/UART converter and a low-cost JTAG adapter
971
can solve the problem easily!
972
 
973
The Lattice Brevia is clocked by the on-board 50MHz oscillator, with the
974
UART operating at 115200bps and the LED and DEBUG ports wired to the on-
975
board LEDs.
976
 
977
Although the Digilent Spartan-3 Starter Board, this is a very useful board
978
to work as reference for LUT4 technology, in a way that is possible improve
979
the support in the future for alternative low-cost LUT4 FPGAs.
980
 
981
In the software side, a small shell is available with some basic commands:
982
 
983
- clear: clear display
984
- dump : dumps an area of the RAM
985
- led : change the LED register (which turns on/off the LEDs)
986
- timer : change the timer prescaler, which affects the interrupt rate
987
- gpio : change the GPIO register (which changes the DEBUG lines)
988
 
989
The proposal of the shell is provide some basic test features which can
990
provide a go/non-go status about the current hardware status.
991
 
992
Useful memory areas:
993
 
994
- 4096: the start of RAM (data)
995
- 4608: the start of RAM (data)
996
- 5120: empty area
997
- 5632: empty area
998
- 6144: empty area
999
- 6656: empty area
1000
- 7168: empty area
1001
- 7680: the end of RAM (stack)
1002
 
1003
As long the *DarkRISCV* uses separate instruction and data buses, it is not
1004
possible dump the ROM area.  However, this limitation is not present when
1005
the option __HARVARD__ is activated, as long the core is constructed in a
1006
way that the ROM bus is conected to one bus from a dual-ported memory and
1007
the RAM bus is connected to a different bus from the same dual-ported
1008
memory. From the *DarkRISCV* point of view, they are fully separated and
1009
independent buses, but in reality they area in the same memory area, which
1010
makes possible the data bus change the area where the code is stored. With
1011
this feature, it will be possible in the future create loadable codes from
1012
the FLASH memory! :)
1013
 
1014
## Creating a RISCV from scratch
1015
 
1016
I found that some people are very reticent about the possibility of
1017
designing a RISC-V processor in one night. Of course, it is not so easy
1018
as it appears and, in fact, it require a lot of experience, planning and
1019
luck. Also, the fact that the processor correctly run some few instructions
1020
and put some garbage in the serial port does not really means that the
1021
design is perfect, instead you will need lots and lots of debug time
1022
in order to fix all hidden problems.
1023
 
1024
Just in case, I found a set of online videos from my friend (Lucas Teske)
1025
that shows the design of a RISC-V processor from scratch:
1026
 
1027
- https://www.twitch.tv/videos/840983740 Register bank (4h50)
1028
- https://www.twitch.tv/videos/845651672 Program counter and ALU (3h49)
1029
- https://www.twitch.tv/videos/846763347 ALU tests, CPU top level (3h47)
1030
- https://www.twitch.tv/videos/848921415 Computer problems and microcode planning (08h19)
1031
- https://www.twitch.tv/videos/850859857 instruction decode and execute - part 1/3 (08h56)
1032
- https://www.twitch.tv/videos/852082786 instruction decode and execute - part 2/3 (10h56)
1033
- https://www.twitch.tv/videos/858055433 instruction decode and execute - part 3/3 - SoC simulation (10h24)
1034
- TBD tests in the Lattice FPGA
1035
 
1036
Unfortunately the video set is currently in portuguese only and there a lot of
1037
parallel discussions about technology, including the fix of the Teske's notebook
1038
online! I hope in the future will be possible edit the video set and, maybe,
1039
create english subtitles.
1040
 
1041
About the processor itself, it is a microcode oriented concept with a classic
1042
von neumann archirecture, designed to support more easily different ISAs. It is really
1043
very different than the traditional RISC cores that we found around! Also, it includes
1044
a very good eco-system around opensource tools, such as Icarus, Yosys and gtkWave!
1045
 
1046
Although not finished yet (95% done!), I think it is very illustrative about the RISC-V design:
1047
 
1048
- rv32e instruction set: very reduced (37) and very ortogonal bit patterns (6)
1049
- rv32e register set: 16x32-bit register bank and a 32-bit program counter
1050
- rv32e ALU with basic operations for reg/imm and reg/reg instructions
1051
- rv32e instruction decode: very simple to understand, very direct to implement
1052
- rv32e software support: the GCC support provides an easy way to generate code and test it!
1053
 
1054
The Teske's proposal is not design the faster RISC-V core ever (we already have lots
1055
of faster cores with CPI ~ 1, such as the darkriscv, vexriscv, etc), but create a clean,
1056
reliable and compreensive RISC-V core.
1057
 
1058
You can check the code in the following repository:
1059
 
1060
- https://github.com/racerxdl/riskow
1061
 
1062
## Acknowledgments
1063
 
1064
Special thanks to my old colleagues from the Verilog/VHDL/IT area:
1065
 
1066
- Paulo Matias (jedi master and verilog/bluespec/riscv guru)
1067
- Paulo Bernard (co-worker and verilog guru)
1068
- Evandro Hauenstein (co-worker and git guru)
1069
- Lucas Mendes (technology guru)
1070
- Marcelo Toledo (technology guru)
1071
- Fabiano Silos (technology guru)
1072
 
1073
Also, special thanks to the "friends of darkriscv" that found the project in
1074
the internet and contributed in any way to make it better:
1075
 
1076
- Guilherme Barile (technology guru and first guy to post anything about the darkriscv! [2]).
1077
- Alasdair Allan (technology guru, posted an article about the darkriscv [3])
1078
- Gareth Halfacree (technology guru, posted an article about the DarkRISCV [4])
1079
- Ivan Vasilev (ported DarkRISCV for Lattice Brevia XP2!)
1080
- timdudu from github (fix in the LDATA and found a bug in the BCC instruction)
1081
- hyf6661669 from github (lots of contributions, including the fixes regarding the AUIPC and S{B,W,L} instructions, ModelSIM simulation, the memory byte select used by store/load instructions and much more!)
1082
- zmeiresearch from github (support for Lattice XP2 Brevia board)
1083
- All other colleagues from github that contributed with fixes, corrections and suggestions.
1084
 
1085
Finally, thanks to all people who directly and indirectly contributed to
1086
this project, including the company I work for and all colleagues that
1087
tested the *DarkRISCV*.
1088
 
1089
## References
1090
 
1091
        [1] https://www.amazon.com/RISC-V-Reader-Open-Architecture-Atlas/dp/099924910X
1092
        [2] https://news.ycombinator.com/item?id=17852876
1093
        [3] https://blog.hackster.io/the-rise-of-the-dark-risc-v-ddb49764f392
1094
        [4] https://abopen.com/news/darkriscv-an-overnight-bsd-licensed-risc-v-implementation/
1095
        [5] http://quasilyte.dev/blog/post/riscv32-custom-instruction-and-its-simulation/
1096
        [6] https://github.com/riscv/riscv-pk/blob/master/bbl/riscv_logo.txt

powered by: WebSVN 2.1.0

© copyright 1999-2024 OpenCores.org, equivalent to Oliscience, all rights reserved. OpenCores®, registered trademark.