OpenCores
URL https://opencores.org/ocsvn/darkriscv/darkriscv/trunk

Subversion Repositories darkriscv

[/] [darkriscv/] [trunk/] [README.md] - Blame information for rev 4

Details | Compare with Previous | View Log

Line No. Rev Author Line
1 2 marcelos
# DarkRISCV
2
Opensource RISC-V implemented from scratch in one night!
3
 
4
## Table of Contents
5
 
6
- [Introduction](#introduction)
7 4 marcelos
- [History](#history)
8 2 marcelos
- [Project Background](#project-background)
9
- [Directory Description](#directory-description)
10
- ["src" Directory](#src-directory)
11
- ["sim" Directory](#sim-directory)
12
- ["rtl" Directory](#rtl-directory)
13
- ["board" Directory](#board-directory)
14
- [Implementation Notes](#implementation-notes)
15
- [Development Tools](#development-tools)
16
- [Development Boards](#development-boards)
17
- [Creating a RISCV from scratch](#creating-a-riscv-from-scratch)
18
- [Acknowledgments](#acknowledgments)
19
- [References](#references)
20
 
21
## Introduction
22
 
23
Developed in a magic night of 19 Aug, 2018 between 2am and 8am, the
24
*DarkRISCV* softcore started as an proof of concept for the opensource
25
RISC-V instruction set.
26
 
27 4 marcelos
Although the code is small and crude when compared with other RISC-V
28
implementations, the *DarkRISCV* has lots of impressive features:
29
 
30
- implements most of the RISC-V RV32E instruction set
31
- implements most of the RISC-V RV32I instruction set (missing csr*, e* and fence*)
32
- works up to 220MHz in a kintex-7 and up to 100MHz in a cheap spartan-6
33
- can sustain 1 clock per instruction most of time
34
- flexible harvard architecture (easy to integrate a cache controller)
35
- works fine in a real xilinx, altera and lattice FPGAs
36
- works fine with gcc 9.0.0 for RISC-V (no patches required!)
37
- uses between 1000-1500LUTs (core only with LUT6 technology, depending of enabled features)
38
- optional RV32E support (works better with LUT4 FPGAs)
39
- optional 16x16-bit MAC instruction (for digital signal processing)
40
- optional coarse-grained multi-threading (MT)
41
- no interlock between pipeline stages!
42
- BSD license: can be used anywhere with no restrictions!
43
 
44
Some extra features are planned for the furure or under development:
45
 
46
- interrupt controller (under tests)
47
- cache controller (under tests)
48
- gpio and timer (under tests)
49
- sdram controller w/ data scrambler
50
- branch predictor (under tests)
51
- ethernet controller (GbE)
52
- multi-processing (SMP)
53
- network on chip (NoC)
54
- rv64i support (not so easy as it appears...)
55
- dynamic bus sizing and big-endian support
56
- user/supervisor modes
57
- debug support
58
- misaligned memory access
59
- bridge for 8/16/32-bit buses
60
 
61
And much other features!
62
 
63
Feel free to make suggestions and good hacking! o/
64
 
65
## History
66
 
67 2 marcelos
The initial concept was based in my other early 16-bit RISC processors and
68
composed by a simplified two stage pipeline, where a instruction is fetch
69
from a instruction memory in the first clock and then the instruction is
70
decoded/executed in the second clock.  The pipeline is overlapped without
71
interlocks, in a way that the *DarkRISCV* can reach the performance of one
72
clock per instruction most of time, except by a taken branch, where one
73
clock is lost in the pipeline flush.  Of course, in order to perform read
74
operations in blockrams in a single clock, a single-phase clock with
75
combinational memory OR a two-phase clock with blockram memory is required,
76
in a way that no wait states are required in thatcases.
77
 
78
As result, the code is very compact, with around three hundred lines of
79
obfuscated but beautiful Verilog code.  After lots of exciting sleepless
80
nights of work and the help of lots of colleagues, the *DarkRISCV* reached a
81
very good quality result, in a way that the code compiled by the standard
82
GCC for RV32I worked fine.
83
 
84 4 marcelos
After two years of development, a three stage pipeline working
85 2 marcelos
with a single clock phase is also available, resulting in a better
86
distribution between the decode and execute stages.  In this case the
87
instruction is fetch in the first clock from a blockram, decoded in the
88
second clock and executed in the third clock.
89
 
90
As long the load instruction cannot load the data from a blockram in a
91
single clock, the external logic inserts one extra clock in IO operations.
92
Also, there are two extra clocks in order to flush the pipeline in the case
93
of taken branches.  The impact of the pipeline flush depends of the compiler
94
optimizations, but according to the lastest measurements, the 3-stage
95
pipeline version can reach a instruction per clock (IPC) of 0.7, smaller
96
than the measured IPC of 0.85 in the case of the 2-stage pipeline version.
97
 
98
Anyway, with the 3-stage pipeline and some other expensive optimizations,
99 4 marcelos
the *DarkRISCV* can reach up to 100MHz in a low-cost Spartan-6, which results in
100 2 marcelos
more performance when compared with the 2-stage pipeline version (typically
101
50MHz).
102
 
103
## Project Background
104
 
105
The main motivation for the *DarkRISCV* was create a migration path for some
106
projects around the 680x0/Coldfire family.
107
 
108
Although there are lots of 680x0 cores available, they are designed around
109
different concepts and requirements, in a way that I found no much options
110
regarding my requirements (more than 50MIPS with around 1000LUTs).  The best
111
option at this moment, the TG68, requires at least 2400LUTs (by removing the
112
MUL/DIV instructions), and works up to 40MHz in a Spartan-6.  As addition,
113
the TG68 core requires at least 2 clock per instruction, which means a peak
114
performance of 20MIPS.  As long the 680x0 instruction is too complex, this
115
result is really not bad at all and, at this moment, probably the best
116
opensource option to replace the 68000.
117
 
118
Anyway, it does not match with the my requirements regarding space and
119
performance.  As part of the investigation, I tested other cores, but I
120
found no much options as good as the TG68 and I even started design a
121
risclized-68000 core, in order to try find a solution.
122
 
123
Unfortunately, due to compiler requirements (standard GCC), I found no much
124
ways to reduce the space and increase the performance, in a way that I
125
started investigate about non-680x0 cores.
126
 
127
After lots of tests with different cores, I found the *picorv32* core and
128
the all the ecosystem around the RISC-V.  The *picorv32* is a very nice
129
project and can peak up to 150MHz in a low-cost Spartan-6.  Although most
130
instructions requires 3 or 4 clocks per instruction, the *picorv32*
131
resembles the 68020 in some ways, but running at 150MHz and providing a peak
132
performance of 50MIPS, which is very impressive.
133
 
134
Although the *picorv32* is a very good option to directly replace the 680x0
135
family, it is not powerful enough to replace some Coldfire processors (more
136
than 75MIPS).
137
 
138
As long I had some good experience with experimental 16-bit RISC cores for
139
DSP-like applications, I started code the *DarkRISCV* only to check the
140
level of complexity and compare with my risclized-68000.  For my surprise,
141
in the first night I mapped almost all instructions of the rv32i
142
specification and the *DarkRISCV* started to execute the first instructions
143
correctly at 75MHz and with one clock per instruction, which not only
144
resembles a fast and nice 68040, but also can beat some Coldfires!  wow!  :)
145
 
146
After the success of the first nigth of work, I started to work in order to
147
fix small details in the hardware and software implementation.
148
 
149
## Directory Description
150
 
151
Although the *DarkRISCV* is only a small processor core, a small eco-system
152
is required in order to test the core, including RISCV compatible software,
153
support for simulations and support for peripherals, in a way that the
154
processor core produces observable results. Each element is stored with
155
similar elements in directories, in a way that the top level has the
156
following organization:
157
 
158
- [README.md](README.md): the top level README file (points to this document)
159
- [LICENSE](LICENSE): unlimited freedom! o/
160
- [Makefile](Makefile): the show start here!
161
- [src](src): the source code for the test firmware (boot.c, main.c etc in C language)
162
- [rtl](rtl): the source code for the *DarkRISCV* core and the support logic (Verilog)
163
- [sim](sim): the source code for the simulation to test the rtl files (currently via icarus)
164
- [board](board): support and examples for different boards (currently via Xilinx ISE)
165
- [tmp](tmp): empty, but the ISE will create lots of files here)
166
 
167
 
168
Setup Instructions:
169
 
170
Step 1: Clone the DarkRISC repo to your local using below code.
171
git clone https://github.com/darklife/darkriscv.git
172
 
173
Pre Setup Guide for MacOS:
174
 
175 4 marcelos
The document encompasses all the dependencies and steps to install those
176
dependencies to successfully utilize the Darriscv ecosystem on MacOS.
177 2 marcelos
 
178 4 marcelos
Essentially, the ecosystem cannot be utilized in MacOS because of on of the
179
dependencies Xilinx ISE 14.7 Design suit, which currently do not support
180
MacOS.
181 2 marcelos
 
182 4 marcelos
In order to overcome this issue, we need to install Linux/Windows on MacOS
183
by using below two methods:
184 2 marcelos
 
185 4 marcelos
a) WineSkin, which is a kind of Windows emulator that runs the Windows
186
application natively but intercepts and emulate the Windows calls to map
187
directly in the macOS.
188 2 marcelos
 
189 4 marcelos
b) VirtualBox (or VMware, Parallels, etc) in order to run a complete Windows
190
OS or Linux, which appears to be far better than the WineSkin option.
191 2 marcelos
 
192 4 marcelos
I used the second method and installed VMware Fusion to install Linux Mint.
193
Please find below the links I used to obtain download files.
194
 
195 2 marcelos
Dependencies:
196
 
197
1.  Icarus Verilog
198
a.  Bison
199
b.  GNU
200
c.  G++
201
d.  FLEX
202
 
203
2.  Xilinx 14.7 ISE
204
 
205
 
206
Icarus Verilog Setup:
207
 
208 4 marcelos
The steps have been condensed for linux operating system.  Complete steps
209
for all other OS platforms are available on
210
https://iverilog.fandom.com/wiki/Installation_Guide.
211 2 marcelos
 
212 4 marcelos
Step 1: Download Verilog download tar file from
213
ftp://ftp.icarus.com/pub/eda/verilog/ .  Always install the latest version.
214
Verilog-10.3 is the latest version as of now.
215 2 marcelos
 
216
Step 2: Extract the tar file using ‘% tar -zxvf verilog-version.tar.gz’.
217
 
218 4 marcelos
Step 3: Go to the Verilog folder using ‘cd Verilog-version’.  Here it is cd
219
Verilog-10.3.
220 2 marcelos
 
221 4 marcelos
Step 4: Check if you have the following libraries installed: Flex, Bison,
222
g++ and gcc.  If not use ‘sudo apt-get install flex bison g++ gcc’ in
223
terminal to install.  Restart the system once for effects to change place.
224 2 marcelos
 
225
Step 5: Run the below commands in directory Verilog-10.3
226
1.  ./configure
227
2.  Make
228
3.  Sudo make install
229
 
230
Step 6: Use ‘sudo apt-get install verilog’ to install Verilog.
231
 
232
Optional Step: sudo apt-get install gtkwave
233
 
234
Xilinx Setup:
235
 
236
Follow the below video on youtube for complete installation.
237
 
238
https://www.youtube.com/watch?v=meO-b6Ib17Y
239
 
240
Note: Make sure you have libncurses libraries installed in linux.
241
 
242
If not use the below codes:
243
 
244
1.  For 64 bit architechure
245
a.  Sudo apt-get install libncurses5 libncursesw-dev
246
2.  For 32 bit architecture
247
a.  Sudo apt-get install libncurses5:i386
248
 
249 4 marcelos
Once all pre-requisites are installed, go to root directory and run the
250
below code:
251 2 marcelos
 
252
cd darkrisc
253
make (use sudo if required)
254
 
255
 
256
The top level *Makefile* is responsible to build everything, but it must
257
be edited first, in a way that the user at least must select the compiler
258
path and the target board.
259
 
260
By default, the top level *Makefile* uses:
261
 
262
        CROSS = riscv32-embedded-elf
263
        CCPATH = /usr/local/share/gcc-$(CROSS)/bin/
264
        ICARUS = /usr/local/bin/iverilog
265
        BOARD  = avnet_microboard_lx9
266
 
267 4 marcelos
Just update the configuration according to your system configuration, type
268
*make* and hope everything is in the correct location!  You probably will
269
need fix some paths and set some others in the PATH environment variable,
270
but it will eventually work.
271 2 marcelos
 
272 4 marcelos
And, when everything is correctly configured, the result will be something
273
like this:
274 2 marcelos
 
275
```$
276
# make
277
make -C src all             CROSS=riscv32-embedded-elf CCPATH=/usr/local/share/gcc-riscv32-embedded-elf/bin/ ARCH=rv32e HARVARD=1
278
make[1]: Entering directory `/home/marcelo/Documents/Verilog/darkriscv/v38/src'
279
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-gcc -Wall -I./include -Os -march=rv32e -mabi=ilp32e -D__RISCV__ -DBUILD="\"Sat, 30 May 2020 00:55:20 -0300\"" -DARCH="\"rv32e\"" -S boot.c -o boot.s
280
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-as -march=rv32e -c boot.s -o boot.o
281
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-gcc -Wall -I./include -Os -march=rv32e -mabi=ilp32e -D__RISCV__ -DBUILD="\"Sat, 30 May 2020 00:55:20 -0300\"" -DARCH="\"rv32e\"" -S stdio.c -o stdio.s
282
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-as -march=rv32e -c stdio.s -o stdio.o
283
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-gcc -Wall -I./include -Os -march=rv32e -mabi=ilp32e -D__RISCV__ -DBUILD="\"Sat, 30 May 2020 00:55:21 -0300\"" -DARCH="\"rv32e\"" -S main.c -o main.s
284
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-as -march=rv32e -c main.s -o main.o
285
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-gcc -Wall -I./include -Os -march=rv32e -mabi=ilp32e -D__RISCV__ -DBUILD="\"Sat, 30 May 2020 00:55:21 -0300\"" -DARCH="\"rv32e\"" -S io.c -o io.s
286
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-as -march=rv32e -c io.s -o io.o
287
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-gcc -Wall -I./include -Os -march=rv32e -mabi=ilp32e -D__RISCV__ -DBUILD="\"Sat, 30 May 2020 00:55:21 -0300\"" -DARCH="\"rv32e\"" -S banner.c -o banner.s
288
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-as -march=rv32e -c banner.s -o banner.o
289
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-cpp -P  -DHARVARD=1 darksocv.ld.src darksocv.ld
290
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-ld -Tdarksocv.ld -Map=darksocv.map -m elf32lriscv  boot.o stdio.o main.o io.o banner.o -o darksocv.o
291
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-ld: warning: section `.data' type changed to PROGBITS
292
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-objdump -d darksocv.o > darksocv.lst
293
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-objcopy -O binary  darksocv.o darksocv.text --only-section .text*
294
hexdump -ve '1/4 "%08x\n"' darksocv.text > darksocv.rom.mem
295
#xxd -p -c 4 -g 4 darksocv.o > darksocv.rom.mem
296
rm darksocv.text
297
wc -l darksocv.rom.mem
298
1016 darksocv.rom.mem
299
echo rom ok.
300
rom ok.
301
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-objcopy -O binary  darksocv.o darksocv.data --only-section .*data*
302
hexdump -ve '1/4 "%08x\n"' darksocv.data > darksocv.ram.mem
303
#xxd -p -c 4 -g 4 darksocv.o > darksocv.ram.mem
304
rm darksocv.data
305
wc -l darksocv.ram.mem
306
317 darksocv.ram.mem
307
echo ram ok.
308
ram ok.
309
echo sources ok.
310
sources ok.
311
make[1]: Leaving directory `/home/marcelo/Documents/Verilog/darkriscv/v38/src'
312
make -C sim all             ICARUS=/usr/local/bin/iverilog HARVARD=1
313
make[1]: Entering directory `/home/marcelo/Documents/Verilog/darkriscv/v38/sim'
314
/usr/local/bin/iverilog -I ../rtl -o darksocv darksimv.v ../rtl/darksocv.v ../rtl/darkuart.v ../rtl/darkriscv.v
315
./darksocv
316
WARNING: ../rtl/darksocv.v:280: $readmemh(../src/darksocv.rom.mem): Not enough words in the file for the requested range [0:1023].
317
WARNING: ../rtl/darksocv.v:281: $readmemh(../src/darksocv.ram.mem): Not enough words in the file for the requested range [0:1023].
318
VCD info: dumpfile darksocv.vcd opened for output.
319
reset (startup)
320
 
321
              vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
322
                  vvvvvvvvvvvvvvvvvvvvvvvvvvvv
323
rrrrrrrrrrrrr       vvvvvvvvvvvvvvvvvvvvvvvvvv
324
rrrrrrrrrrrrrrrr      vvvvvvvvvvvvvvvvvvvvvvvv
325
rrrrrrrrrrrrrrrrrr    vvvvvvvvvvvvvvvvvvvvvvvv
326
rrrrrrrrrrrrrrrrrr    vvvvvvvvvvvvvvvvvvvvvvvv
327
rrrrrrrrrrrrrrrrrr    vvvvvvvvvvvvvvvvvvvvvvvv
328
rrrrrrrrrrrrrrrr      vvvvvvvvvvvvvvvvvvvvvv
329
rrrrrrrrrrrrr       vvvvvvvvvvvvvvvvvvvvvv
330
rr                vvvvvvvvvvvvvvvvvvvvvv
331
rr            vvvvvvvvvvvvvvvvvvvvvvvv      rr
332
rrrr      vvvvvvvvvvvvvvvvvvvvvvvvvv      rrrr
333
rrrrrr      vvvvvvvvvvvvvvvvvvvvvv      rrrrrr
334
rrrrrrrr      vvvvvvvvvvvvvvvvvv      rrrrrrrr
335
rrrrrrrrrr      vvvvvvvvvvvvvv      rrrrrrrrrr
336
rrrrrrrrrrrr      vvvvvvvvvv      rrrrrrrrrrrr
337
rrrrrrrrrrrrrr      vvvvvv      rrrrrrrrrrrrrr
338
rrrrrrrrrrrrrrrr      vv      rrrrrrrrrrrrrrrr
339
rrrrrrrrrrrrrrrrrr          rrrrrrrrrrrrrrrrrr
340
rrrrrrrrrrrrrrrrrrrr      rrrrrrrrrrrrrrrrrrrr
341
rrrrrrrrrrrrrrrrrrrrrr  rrrrrrrrrrrrrrrrrrrrrr
342
 
343
       INSTRUCTION SETS WANT TO BE FREE
344
 
345
boot0: text@0 data@4096 stack@8192
346
board: simulation only (id=0)
347
build: darkriscv fw build Sat, 30 May 2020 00:55:21 -0300
348
core0: darkriscv@100.0MHz with rv32e+MT+MAC
349
uart0: 115200 bps (div=868)
350
timr0: periodic timer=1000000Hz (io.timer=99)
351
 
352
Welcome to DarkRISCV!
353
> no UART input, finishing simulation...
354
echo simulation ok.
355
simulation ok.
356
make[1]: Leaving directory `/home/marcelo/Documents/Verilog/darkriscv/v38/sim'
357
make -C boards all          BOARD=piswords_rs485_lx9 HARVARD=1
358
make[1]: Entering directory `/home/marcelo/Documents/Verilog/darkriscv/v38/boards'
359
cd ../tmp && xst -intstyle ise -ifn ../boards/piswords_rs485_lx9/darksocv.xst -ofn ../tmp/darksocv.syr
360
Reading design: ../boards/piswords_rs485_lx9/darksocv.prj
361
 
362
*** lots of weird FPGA related messages here ***
363
 
364
cd ../tmp && bitgen -intstyle ise -f ../boards/avnet_microboard_lx9/darksocv.ut ../tmp/darksocv.ncd
365
echo done.
366
done.
367
```
368
 
369
Which means that the software compiled and liked correctly, the simulation
370
worked correctly and the FPGA build produced a image that can be loaded in
371
your FPGA board with a *make install* (case you has a FPGA board and, of
372
course, you have a JTAG support script in the board directory).
373
 
374
Case the FPGA is correctly programmed and the UART is attached to a terminal
375
emulator, the FPGA will be configured with the DarkRISCV, which will run the
376
test software and produce the following result:
377
 
378
```
379
              vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
380
                  vvvvvvvvvvvvvvvvvvvvvvvvvvvv
381
rrrrrrrrrrrrr       vvvvvvvvvvvvvvvvvvvvvvvvvv
382
rrrrrrrrrrrrrrrr      vvvvvvvvvvvvvvvvvvvvvvvv
383
rrrrrrrrrrrrrrrrrr    vvvvvvvvvvvvvvvvvvvvvvvv
384
rrrrrrrrrrrrrrrrrr    vvvvvvvvvvvvvvvvvvvvvvvv
385
rrrrrrrrrrrrrrrrrr    vvvvvvvvvvvvvvvvvvvvvvvv
386
rrrrrrrrrrrrrrrr      vvvvvvvvvvvvvvvvvvvvvv
387
rrrrrrrrrrrrr       vvvvvvvvvvvvvvvvvvvvvv
388
rr                vvvvvvvvvvvvvvvvvvvvvv
389
rr            vvvvvvvvvvvvvvvvvvvvvvvv      rr
390
rrrr      vvvvvvvvvvvvvvvvvvvvvvvvvv      rrrr
391
rrrrrr      vvvvvvvvvvvvvvvvvvvvvv      rrrrrr
392
rrrrrrrr      vvvvvvvvvvvvvvvvvv      rrrrrrrr
393
rrrrrrrrrr      vvvvvvvvvvvvvv      rrrrrrrrrr
394
rrrrrrrrrrrr      vvvvvvvvvv      rrrrrrrrrrrr
395
rrrrrrrrrrrrrr      vvvvvv      rrrrrrrrrrrrrr
396
rrrrrrrrrrrrrrrr      vv      rrrrrrrrrrrrrrrr
397
rrrrrrrrrrrrrrrrrr          rrrrrrrrrrrrrrrrrr
398
rrrrrrrrrrrrrrrrrrrr      rrrrrrrrrrrrrrrrrrrr
399
rrrrrrrrrrrrrrrrrrrrrr  rrrrrrrrrrrrrrrrrrrrrr
400
 
401
       INSTRUCTION SETS WANT TO BE FREE
402
 
403
boot0: text@0 data@4096 stack@8192
404
board: piswords rs485 lx9 (id=6)
405
build: darkriscv fw build Fri, 29 May 2020 23:56:39 -0300
406
core0: darkriscv@100.0MHz with rv32e+MT+MAC
407
uart0: 115200 bps (div=868)
408
timr0: periodic timer=1000000Hz (io.timer=99)
409
 
410
Welcome to DarkRISCV!
411
>
412
```
413
 
414
The beautiful ASCII RISCV logo was produced by Andrew Waterman! [6]
415
 
416
As long as the build works, it is possible start make changes, but my
417
recommendation when working with soft processors is *not work* in the
418
hardware and software *at the same time*!  This means that is better freeze
419
the hardware and work only with the software *or* freeze the software and
420
work only with the hardware.  It is perfectly possible make your research in
421
both, but not at the same time, otherwise you find the *DarkRISCV* in a
422
non-working state after software and hardware changes and you will not be
423
sure where the problem is.
424
 
425
### "src" Directory
426
 
427
The *src* directory contains the source code for the test firmware, which
428
includes the boot code, the main process and auxiliary libraries. The code is
429
compiled via *gcc* in a way that some auxiliary files are produced,
430
for example:
431
 
432
- boot.c: the original C code for the boot process
433
- boot.s: the assembler version of the C code, generated automatically by the gcc
434
- boot.o: the compiled version of the C code, generated automatically by the gcc
435
 
436
When all .o files are produced, the result is linked in a *darksocv.o* ELF
437
file, which is used to produce the *darksocv.bin* file, which is converted to
438
hexadecimal and separated in ROM and RAM files (which are loaded by the Verilog
439
code in the blockRAMs). The linker also produces a *darksocv.lst* with a
440
complete list of the code generated and the *darsocv.map*, which shows the
441
map of all functions and variables in the produced code.
442
 
443
The firmware concept is very simple:
444
 
445
- boot.c contains the boot code
446
- main.c contains the main application code (shell)
447
- banner.c contains the riscv banner
448
- stdio.c contains a small version of stdio
449
- io.c contains the IO interfaces
450
 
451
Extra code can be easily added in the compilation by editing the *src/Makefile*.
452
 
453
For example, in order to add a lempel-ziv code *lz.c*, it is necessary make the
454
Makefile knows that we need the *lz.s* and *lz.o*:
455
 
456
        OBJS = boot.o stdio.o main.o io.o banner.o lz.o
457
        ASMS = boot.s stdio.s main.s io.s banner.s lz.s
458
        SRCS = boot.c stdio.c main.c io.c banner.c lz.c
459
 
460
And add a "lz" command in the *main.c*, in a way that is possible call
461
the function via the prompt. Alternatively, it is possible entirely replace
462
the provided firmware and use your own firmware.
463
 
464
### "sim" Directory
465
 
466
The simulation, in the other hand will show some waveforms and is possible
467
check the *DarkRISCV* operation when running the example code.
468
 
469
The main simulation tool for *DarkRISCV* is the iSIM from Xilinx ISE 14.7,
470
but the Icarus simulator is also supported via the Makefile in the *sim*
471
directory (the changes regarding Icarus are active when the symbol
472
__ICARUS__ is detected). I also included a workaround for ModelSim, as
473
pointed by our friend HYF (the changes regarding ModelSim are active when the
474
symbol MODEL_TECH is detected).
475
 
476
The simulation runs the same firmware as in the real FPGA, but in order to
477
improve the simulation performance, the UART code is not simulated, since
478
the 115200 bps requires lots dead simulation time.
479
 
480
### "rtl" Directory
481
 
482
The RTL directory contains the *DarkRISCV* core and some auxiliary files,
483
such as the DarkSoCV (a small system-on-chip with ROM, RAM and IO),
484
the DarkUART (a small UART for debug) and the configuration file, where is
485
possible enable and disable some features that are described in the
486
Implementation Notes section.
487
 
488
### "board" Directory
489
 
490
The current supported boards are:
491
 
492
- avnet_microboard_lx9
493
- lattice_brevia2_xp2
494
- piswords_rs485_lx9
495
- qmtech_sdram_lx16
496
- qmtech_spartan7_s15
497
- xilinx_ac701_a200
498
 
499
The organization is self-explained, w/ the vender, board and FPGA model
500
in the name of the directory. Each  *board* directory contains the project
501
files to be open in the Xilinx ISE 14.x, as well Makefiles to build the
502
FPGA image regarding that board model. Although a *ucf* file is provided in
503
order to generate a complete build with a UART and some LEDs, the FPGA is
504
NOT fully wired in any particular configuration and you must add the
505
pins that you will use in your FPGA board.
506
 
507
Anyway, although not wired, the build always gives you a good estimation
508
about the FPGA utilization and about the timing (because the UART output
509
ensures that the complete processor must be synthesized).
510
 
511
As long there are much supported boards, there is no way to test all boards
512
everytime, which means that sometimes the changes regarding one board may
513
affect other board in a wrong way.
514
 
515
## Implementation Notes*
516
 
517
[*This section is kept for reference, but the description may not match
518
exactly with the current code]
519
 
520
Since my target is the ultra-low-cost Xilinx Spartan-6 family of FPGAs, the
521
project is currently based in the Xilinx ISE 14.7 for Linux, which is the
522
latest available ISE version.  However, there is no explicit reference for
523
Xilinx elements and all logic is inferred directly from Verilog, which means
524
that the project is easily portable to other FPGA families and easily
525
portable to other environments, as can be observed in the case of Lattice
526
XP2 support.  Anyway, keep in mind that certain Verilog structures may not
527
work well in some FPGAs.
528
 
529
In the last update I included a way to test the firmware in the x86 host,
530
which helps as lot, since is possible interact with the firmware and fix
531
quickly some obvious bugs. Of course, the x86 code does not run the boot.c
532
code, since makes no sense (?) run the RISCV boot code in the x86.
533
 
534
Anyway, as main recomendation when working with softcores try never work in
535
the hardware and in the software at the same time!  Start with the minimum
536
software configuration possible and freeze the software.  When implementing
537
new software updates, use the minium hardware configuration possible and
538
freeze the hardware.
539
 
540
The RV32I specification itself is really impressive and easy to implement
541
(see [1], page 16).  Of course, there are some drawbacks, such as the funny
542
little-endian bus (opposed to the network oriented big-endian bus found in
543
the 680x0 family), but after some empirical tests it is easy to make work.
544
 
545
The funny information here is that, after lots of research regarding add
546
support for big-endian in the *DarkRISCV*, I found no way to make the GCC
547
generate the code and data correctly.
548
 
549
Another drawback in the specification is the lacking of delayed branches.
550
Although i understand that they are bad from the conceptual point of view,
551
they are good trick in order to extract more performance. As reference, the
552
lack of delayed branches or branch predictor in the *DarkRISCV* may reduce
553
between 20 and 30% the performance, in a way that the real measured
554
performance may be between 1.25 and 1.66 clocks per instruction.
555
 
556
Although the branch prediction is not complex to implement, I found the
557
experimental multi-threading support far more interesting, as long enable
558
use the idle time in the branches to swap the processor thread.  Anyway, I
559
will try debug the branch prediction code in order to improve the
560
single-thread performance.
561
 
562
The core supports 2 or 3-state pipelines and, although the main logic is
563
almost the same, there are huge difference in how they works. Just for
564
reference, the following section reflects the historic evolution of the
565
core and may not reflect the current core code.
566
 
567
The original 2-stage pipeline design has a small problem concerning
568
the ROM and RAM timing, in a way that, in order to pre-fetch and execute the
569
instruction in two clocks and keep the pre-fetch continously working at the
570
rate of 1 instruction per clock (and the same in the execution), the ROM and
571
RAM must respond before the next clock.  This means that the memories must
572
be combinational or, at least, use a 2-phase clock.
573
 
574
The first solution for the 2-stage pipeline version with a 2-phase clock is
575
the default solution and makes the *DarkRISCV* work as a pseudo 4-stage
576
pipeline:
577
 
578
- 1/2 stage for instruction pre-fetch (rom)
579
- 1/2 stage for static instruction decode (core)
580
- 1/2 stage for address generation, register read and data read/write (ram)
581
- 1/2 stage for data write (register write)
582
 
583
From the processor point of view, there are only 2 stages and from the
584
memory point of view, there are also 2 stages. But they are in different
585
clock phases. In normal conditions, this is not recommended because decreases the
586
performance by a 2x factor, but in the case of *DarkRISCV* the performance
587
is always limited by the combinational logic regarding the instruction
588
execution.
589
 
590
The second solution with a 2-stage pipeline is use combinational logic in
591
order to provide the needed results before the next clock edge, in a way
592
that is possible use a single phase clock.  This solution is composed by a
593
instruction and data caches, in a way that when the operand is stored in a
594
small LUT-based combinational cache, the processor can perform the memory
595
operation with no extra wait states.  However, when the operand is not
596
stored in the cache, extra wait-states are inserted in order to fetch the
597
operand from a blockram or extenal memory.  According to some preliminary
598
tests, the instruction cache w/ 64 direct mapped instructions can reach a
599
hit ratio of 91%.  The data cache performance, although is not so good (with
600
a hit ratio of only 68%), will be a requirement in order to access external
601
memory and reduce the impact of slow SDRAMs and FLASHes.
602
 
603
Unfortunately, the instruction and data caches are not working anymore for
604
the 2-stage pipeline version and only the instruction cache is working in
605
the 3-stage pipeline.  The problem is probably regarding the HLT signal
606
and/or a problem regarding the write byte enable in the cache memory.
607
 
608
Both the use of the cache and a 2-phase clock does not perform well.  By
609
this way, a 3-stage pipeline version is provided, in order to use a single
610
clock phase with blockrams.
611
 
612
The concept in this case is separate the pre-fetch and decode, in a way that
613
the pre-fetch can be done entirely in the blockram side for the instruction
614
bus. The decode, in a different stage, provides extra performance and the
615
execute stage works with one clock almost all the time, except when the load
616
instruction is executed. In this case, the external memory logic inserts one
617
wait-state. The write operation, however, is executed in a single clock.
618
 
619
The solution with wait-states can be used in the 2-stage pipeline version,
620
but decreases the performance too much. Case is possible run all versions
621
with the same, clock, the theorical performance in clocks per instruction
622
CPI), number of clocks to flush the pipeline in the taken branch (FLUSH) and
623
memory wait-states (WSMEM) will be:
624
 
625
- 2-stage pipe w/ 2-phase clock: CPI=1, FLUSH=1, WSMEM=0: real CPI=~1.25
626
- 3-stage pipe w/ 1-phase clock: CPI=1, FLUSH=2, WSMEM=1: real CPI=˜1.66
627
- 2-stage pipe w/ 1-phase clock: CPI=2, FLUSH=1, WSMEM=1, real CPI=~2.00
628
 
629
Empiracally, the impact of the FLUSH in the 2-stage pipeline is around 20%
630
and in the 3-stage pipeline is 30%. The real impact depends of the code
631
itself, of course... In the case of the impact of the wait-states in the
632
memory access regarding the load instruction, the impact ranges between 5
633
and 10%, again, depending of the code.
634
 
635
However, the clock in the case of the 3-stage pipeline is far better than the
636
2-stage pipeline, in special because the better distribuition of the logic
637
between the decode and execute stages.
638
 
639
Currently, the most expensive path in the Spartan-6 is the address bus
640
for the data side of the core (connected to RAM and peripherals). The
641
problem regards to the fact that the following actions must be done in a
642
single clock:
643
 
644
- generate the DADDR[31:0] = REG[SPTR][31:0]+EXTSIG(IMM[11:0])
645
- generate the BE[3:0] according to the operand size and DADDR[1:0]
646
 
647
In the case of read operation, the DATAI path includes also a small mux
648
in order to separate RAM and peripheral buses, as well separate the
649
diferent peripherals, which means that the path increases as long the
650
number of peripherals and the complexity increases.
651
 
652
Of course, the best performance setup uses a 3-state pipeline and a
653
single-clock phase (posedge) in the entire logic, in a way that the 2-stage
654
pipeline and dual-clock phase will be kept only for reference.
655
 
656
The only disadvantage of the 3-state pipeline is one extra wait-state in the
657
load operation and the longer pipeline flush of two clocks in the taken
658
branches.
659
 
660
Just for reference, I registered some details regarding the performance
661
measurements:
662
 
663
The current firmware example runs in the 3-stage pipeline version clocked at
664
100MHz runs at a verified performance of 62 MIPS.  The theorical 100MIPS
665
performance is not reached 5% due to the extra wait-state in the load
666
instruction and 32% due to pipeline flushes after taken branches.  The
667
2-stage pipeline version, in the other side, runs at a verified performance
668
of 79MIPS with the same clock.  The only loss regards to 20% due to pipeline
669
flushes after a taken branch.
670
 
671
Of course, the impact of the pipeline flush depends also from the software
672
and, as long the software is currently optimized for size. When compiled
673
with the -O2 instead of -Os, the performance increase to 68MIPS in the
674
3-state pipeline and the loss changed to 6% for load and 25% for the
675
pipeline flush. The -O3 option resulted in 67MIPS and the best result was
676
the -O1 option, which produced 70MIPS in the 3-stage version and 85MIPS in
677
the 2-stage version.
678
 
679
By this way, case the performance is a requirement, the src/Makefile must be
680
changed in order to use the -O1 optimization instead of the -Os default.
681
 
682
And although the 2-stage version is 15% faster than the 3-stage version, the
683
3-stage version can reach better clocks and, by this way, will provide
684
better performance.
685
 
686
Regarding the pipeline flush, it is required after a taken branch, as long
687
the RISCV does not supports delayed branches.  The solution for this problem
688
is implement a branch cache (branch predictor), in a way that the core
689
populates a cache with the last branches and can predict the future
690
branches.  In some inicial tests, the branch prediction with a 4 elements
691
entry appers to reach a hit ratio of 60%.
692
 
693
Another possibility is use the flush time to other tasks, for example handle
694
interrupts.  As long the interrupt handling and, in a general way, threading
695
requires flush the current pipelines in order to change context, by this
696
way, match the interrupt/threading with the pipeline flush makes some sense!
697
 
698
With the option __THREADING__ is possible test this feature.
699
 
700
The implementation is in very early stages of development and does not
701
handle correctly the initial SP and PC.  Anyway, it works and enables the
702
main() code stop in a gets() while the interrupt handling changes the GPIO
703
at a rate of more than 1 million interrupts per second without affecting the
704
execution and with little impact in the performance!  :)
705
 
706
The interrupt support can be expanded to a more complete threading support,
707
but requires some tricks in the hardware and in the software, in order to
708
populate the different threads with the correct SP and PC.
709
 
710
The interrupt handling use a concept around threading and, with some extra
711
effort, it is probably possible support 4, 8 or event 16 threads.  The
712
drawback in this case is that the register bank increses in size, which
713
explain why the rv32e is an interesting option for threading: with half the
714
number of registers is possible store two more threads in the core.
715
 
716
Currently, the time to switch the context in the *darkricv* is two clocks in
717
the 3-stage pipeline, which match with the pipeline flush itself. At 100MHz,
718
the maximum empirical number of context switches per second is around 2.94
719
million.
720
 
721
NOTE: the interrupt controller is currently working only with the -Os
722
flag in the gcc!
723
 
724
About the new MAC instruction, it is implemented in a very preliminary way
725
with the OPCDE 7'b1111111 (this works, but it is a very bad decision!).  I
726
am checking about the possibility to use the p.mac instruction, but at this
727
time the instruction is hand encoded in the mac() function available in the
728
stdio.c (i.e.  the darkriscv libc).  The details about include new
729
instructions and make it work with GCC can be found in the reference [5].
730
 
731
The preliminary tests pointed, as expected, that the performance decreases
732
to 90MHz and although it was possible run at 100MHz with a non-zero timing
733
score and reach a peak performance of 100MMAC/s, the small 32-bit
734
accumulator saturates too fast and requries extra tricks in order to avoid
735
overflows.
736
 
737
The mul operation uses two 16-bit integers and the result is added with a
738
separate 32-bit register, which works as accumulator.  As long the operation
739
is always signed and the signal always use the MSB bit, this means that the
740
15x15 mul produces a 30 bit result which is added to a 31-bit value, which
741
means that the overflow is reached after only two MAC operations.
742
 
743
In order to avoid overflows, it is possible shift the input operands.  For
744
example, in the case of G711 w/ u-law encoding, the effective resolution is
745
14 bits (13 bits for integer and 1 bit for signal), which means that a 13x13
746
bit mul will be used and a 26-bit result produced to be added in a 31-bit
747
integer, enough to run 32xMAC operations before overflow (in this case, when
748
the ACC reach a negative value):
749
 
750
    # awk 'BEGIN { ACC=2**31-1; A=2**13-1; B=-A; for(i=0;ACC>=0;i++) print i,A,B,A*B,ACC+=A*B }'
751
 
752
    1 8191 -8191 -67092481 2013298685
753
    2 8191 -8191 -67092481 1946206204
754
    ...
755
    30 8191 -8191 -67092481 67616736
756
    31 8191 -8191 -67092481 524255
757
    32 8191 -8191 -67092481 -66568226
758
 
759
Is this theory correct? I am not sure, but looks good! :)
760
 
761
As complement, I included in the stdio.c the support for the GCC functions
762
regarding the native *, / and % (mul, div and mod) operations with 32-bit
763
signed and unsigned integers, which means true 32x32 bit operations
764
producing 32-bit results.  The code was derived from an old 68000-related
765
project (as most of code in the stdio.c) and, although is not so faster, I
766
guess it is working. As long the MAC instruction is better defined in the
767
syntax and features, I think is possible optimize the mul/div/mod in order
768
to try use it and increase the performance.
769
 
770
Here some additional performance results (synthesis only, 3-stage
771
version) for other Xilinx devices available in the ISE for speed grade 2:
772
 
773
- Spartan-6:    100MHz (measured 70MIPS w/ gcc -O1)
774
- Artix-7:      178MHz
775
- Kintex-7:     225MHz
776
 
777
For speed grade 3:
778
 
779
- Spartan-6:    117MHz
780
- Artix-7:      202MHz
781
- Kintex-7:     266MHz
782
 
783
The Kintex-7 can reach, theorically 186MIPS w/ gcc -O1.
784
 
785
This performance is reached w/o the MAC and THREADING activated.  Thanks to
786
the RV32E option, the synthesis for the Spartan-3E is now possible with
787
resulting in 95% of LUT occupation in the case of the low-cost 100E model
788
and 70MHz clock (synthesis only and speed grade 5):
789
 
790
- Spartan-3E:   70MHz
791
 
792
For the 2-stage version and speed grade 2, we have less impact from the
793
pipeline flush (20%), no impact in the load and some impact in the clock due
794
to the use of a 2-phase clock:
795
 
796
- Spartan-6:    56MHz (measured 47MIPS w/ -O1)
797
 
798
About the compiler performance, from boot until the prompt, tested w/ the
799
3-stage pipeline core at 100MHz and no interrupts, rom and ram measured in
800
32-bit words:
801
 
802
- gcc w/ -O3: t=289us rom=876 ram=211
803
- gcc w/ -O2: t=291us rom=799 ram=211
804
- gcc w/ -O1: t=324us rom=660 ram=211
805
- gcc w/ -O0: t=569us rom=886 ram=211
806
- gcc w/ -Os: t=398us rom=555 ram=211
807
 
808
Due to reduced ROM space in the FPGA, the -Os is the default option.
809
 
810
In another hand, regarding the support for Vivado, it is possible convert
811
the Artix-7 (Xilinx AC701 available in the ise/boards directory) project to
812
Vivado and make some interesting tests.  The only problem in the conversion
813
is that the UCF file is not converted, which means that a new XDC file with
814
the pin description must be created.
815
 
816
The Vivado is very slow compared to ISE and needs *lots of time* to
817
synthesise and inform a minimal feedback about the performance...  but after
818
some weeks waiting, and lots of empirical calculations, I get some numbers
819
for speed grade 2 devices:
820
 
821
- Artix7:       147MHz
822
- Spartan-7:    146MHz
823
 
824
And one number for speed grade 3 devices:
825
 
826
- Kintex-7:     221MHz
827
 
828 4 marcelos
Although Vivado is far slow and shows pessimistic numbers for the same FPGAs
829
when compared with ISE, I guess Vivado is more realistic and, at least, it
830
supports the new Spartan-7, which shows very good numbers (almost the same
831
as the Artix-7!).
832 2 marcelos
 
833
That values are only for reference.  The real values depends of some options
834
in the core, such as the number of pipeline stages, who the memories are
835
connected, etc.  Basically, the best clock is reached by the 3-stage
836
pipeline version (up to 100MHz in a Spartan-6), but it requires at lease 1
837
wait state in the load instruction and 2 extra clocks in the taken branches
838
in order to flush the pipeline.  The 2-state pipeline requires no extra wait
839
states and only 1 extra clock in the taken branches, but runs with less
840
performance (56MHz).
841
 
842
Well, my conclusion after some years of research is that the branch
843
prediction solve lots of problems regarding the performance.
844
 
845
## Development Tools
846
 
847
About the gcc compiler, I am working with the experimental gcc 9.0.0 for
848
RISC-V.  No patches or updates are required for the *DarkRISCV* other than
849
the -march=rv32i.  Although the fence*, e* and crg* instructions are not
850
implemented, the gcc appears to not use of that instructions and they are
851
not available in the core.
852
 
853
Although is possible use the compiler set available in the oficial RISC-V
854
site, our colleagues from *lowRISC* project pointed a more clever way to
855
build the toolchain:
856
 
857
https://www.lowrisc.org/blog/2017/09/building-upstream-risc-v-gccbinutilsnewlib-the-quick-and-dirty-way/
858
 
859
Basically:
860
 
861
        git clone --depth=1 git://gcc.gnu.org/git/gcc.git gcc
862
        git clone --depth=1 git://sourceware.org/git/binutils-gdb.git
863
        git clone --depth=1 git://sourceware.org/git/newlib-cygwin.git
864
        mkdir combined
865
        cd combined
866
        ln -s ../newlib-cygwin/* .
867
        ln -sf ../binutils-gdb/* .
868
        ln -sf ../gcc/* .
869
        mkdir build
870
        cd build
871
        ../configure --target=riscv32-unknown-elf --enable-languages=c --disable-shared --disable-threads --disable-multilib --disable-gdb --disable-libssp --with-newlib --with-arch=rv32ima --with-abi=ilp32 --prefix=/usr/local/share/gcc-riscv32-unknown-elf
872
        make -j4
873
        make
874
        make install
875
        export PATH=$PATH:/usr/local/share/gcc-riscv32-unknown-elf/bin/
876
        riscv32-unknown-elf-gcc -v
877
 
878
and everything will magically work! (:
879
 
880
Case you have no succcess to build the compiler, have no interest to change
881
the firmware or is just curious about the darkriscv running in a FPGA, the
882
project includes the compiled ROM and RAM, in a way that is possible examine
883
all derived objects, sources and correlated files generated by the compiler
884
without need compile anything.
885
 
886
Finally, as long the *DarkRISCV* is not yet fully tested, sometimes is a
887
very good idea compare the code execution with another stable reference!
888
 
889
In this case, I am working with the project *picorv32*:
890
 
891
https://github.com/cliffordwolf/picorv32
892
 
893
When I have some time, I will try create a more well organized support in
894
order to easily test both the *DarkRISCV* and *picorv32* in the same cache,
895
memory and IO sub-systems, in order to make possible select the core
896
according to the desired features, for example, use the *DarkRISCV* for more
897
performance or *picorv32* for more features.
898
 
899
About the software, the most complex issue is make the memory design match
900
with the linker layout.  Of course, it is a gcc issue and it is not even a
901
problem, in fact, is the way that the software guys works when linking the
902
code and data!
903
 
904
In the most simplified version, directly connected to blockRAMs, the
905
*DarkRISCV* is a pure harvard architecture processor and will requires the
906
separation between the instruction and data blocks!
907
 
908
When the cache controller is activated, the cache controller provides
909
separate memories for instruction and data, but provides a interface for a
910
more conventional von neumann memory architecture.
911
 
912
In both cases, a proper designed linker script (darksocv.ld) probably solves
913
the problem!
914
 
915
The current memory map in the linker script is the follow:
916
 
917
- 0x00000000: 4KB ROM
918
- 0x00001000: 4KB RAM
919
 
920
Also, the linker maps the IO in the following positions:
921
 
922
- 0x80000000: UART status
923
- 0x80000004: UART xmit/recv buffer
924
- 0x80000008: LED buffer
925
 
926
The RAM memory contains the .data area, the .bss area (after the .data
927
and initialized with zero), the .rodada and the stack area at the end of RAM.
928
 
929
Although the RISCV is defined as little-endian, appears to be easy change
930
the configuration in the GCC.  In this case, it is supposed that the all
931
variables are stored in the big-endian format.  Of course, the change
932
requires a similar change in the core itself, which is not so complex, as
933
long it affects only the load and store instructions.  In the future, I will
934
try test a big-endian version of GCC and darkriscv, in order to evaluate
935
possible performance enhancements in the case of network oriented
936
applications! :)
937
 
938
Finally, the last update regarding the software included  new option to
939
build a x86 version in order to help the development by testing exactly the
940
same firmware in the x86.
941
 
942
In a preliminary way, it is possible build the gcc for RV32E with the folllowing configuration:
943
 
944
    git clone --depth=1 git://gcc.gnu.org/git/gcc.git gcc
945
    git clone --depth=1 git://sourceware.org/git/binutils-gdb.git
946
    git clone --depth=1 git://sourceware.org/git/newlib-cygwin.git
947
    mkdir combined
948
    cd combined
949
    ln -s ../newlib-cygwin/* .
950
    ln -sf ../binutils-gdb/* .
951
    ln -sf ../gcc/* .
952
    mkdir build
953
    cd build
954
    ../configure --target=riscv32-embedded-elf --enable-languages=c --disable-shared --disable-threads --disable-multilib --disable-gdb --disable-libssp --with-newlib  --with-arch-rv32e --with-abi=ilp32e --prefix=/usr/local/share/gcc-riscv32-embedded-elf
955
    make -j4
956
    make
957
    make install
958
    export PATH=$PATH:/usr/local/share/gcc-riscv32-embedded-elf/bin/
959
    riscv32-embedded-elf-gcc -v
960
 
961
Currently, I found no easy way to make the GCC build big-endian code for
962
RISCV. Instead, the easy way is make the endian switch directly in the IO
963
device or in the memory region.
964
 
965
As long is not so easy build the GCC in some machines, I left in a public
966
share the source and the pre-compiled binary set of GCC tools for RV32E:
967
 
968
https://drive.google.com/drive/folders/1GYkqDg5JBVeocUIG2ljguNUNX0TZ-ic6?usp=sharing
969
 
970
As far as i remember it was compiled in a Slackware Linux or something like,
971
anyway, it worked fine in the Windows 10 w/ WSL and in other linux-like
972
environments.
973
 
974
## Development Boards
975
 
976
Currently, the following boards are supported:
977
 
978
- Avnet Microboard LX9: equipped with a Xilinx Spartan-6 LX9 running at 100MHz
979
- XilinX AC701 A200: equipped with a Xilinx Artix-7 A200 running at 90MHz
980
- QMTech SDRAM LX16: equipped with a Xilinx Spartan-6 LX16 running at 100MHz
981
- QMTech NORAM S15: equipped with a Xilinx Spartan-7 S15 running at 100MHz
982
- Lattice Brevia2 XP2: equipped with a Lattice XP2-6 running at 50MHz
983
- Piswords RS485 LX9: equipped with a Xilinx Spartan-6 LX9 running at 100MHz
984
- Digilent S3 Starter Board: equipped with a Xilinx Spartan-3 S200 running at 50MHz
985
 
986
The speeds are related to available clocks in the boards and different
987
clocks may be generated by programming a clock generator. The Spartan-6 is
988
found in most boards and the core runs fine at ~100MHz, regardless the
989
frequency of the main oscillator (typically 50MHz).
990
 
991
All Xilinx based boards typically supports a 115200 bps UART for console,
992
some LEDs for debug and on-chip 4KB ROM and 4KB RAM (as well the RESET
993
button to restart the core and the DEBUG signals for an oscilloscope).
994
 
995
In the case of QMTECH boards, that does not include the JTAG neither the
996
UART/USB port, and external USB/UART converter and a low-cost JTAG adapter
997
can solve the problem easily!
998
 
999
The Lattice Brevia is clocked by the on-board 50MHz oscillator, with the
1000
UART operating at 115200bps and the LED and DEBUG ports wired to the on-
1001
board LEDs.
1002
 
1003
Although the Digilent Spartan-3 Starter Board, this is a very useful board
1004
to work as reference for LUT4 technology, in a way that is possible improve
1005
the support in the future for alternative low-cost LUT4 FPGAs.
1006
 
1007
In the software side, a small shell is available with some basic commands:
1008
 
1009
- clear: clear display
1010
- dump : dumps an area of the RAM
1011
- led : change the LED register (which turns on/off the LEDs)
1012
- timer : change the timer prescaler, which affects the interrupt rate
1013
- gpio : change the GPIO register (which changes the DEBUG lines)
1014
 
1015
The proposal of the shell is provide some basic test features which can
1016
provide a go/non-go status about the current hardware status.
1017
 
1018
Useful memory areas:
1019
 
1020
- 4096: the start of RAM (data)
1021
- 4608: the start of RAM (data)
1022
- 5120: empty area
1023
- 5632: empty area
1024
- 6144: empty area
1025
- 6656: empty area
1026
- 7168: empty area
1027
- 7680: the end of RAM (stack)
1028
 
1029
As long the *DarkRISCV* uses separate instruction and data buses, it is not
1030
possible dump the ROM area.  However, this limitation is not present when
1031
the option __HARVARD__ is activated, as long the core is constructed in a
1032
way that the ROM bus is conected to one bus from a dual-ported memory and
1033
the RAM bus is connected to a different bus from the same dual-ported
1034
memory. From the *DarkRISCV* point of view, they are fully separated and
1035
independent buses, but in reality they area in the same memory area, which
1036
makes possible the data bus change the area where the code is stored. With
1037
this feature, it will be possible in the future create loadable codes from
1038
the FLASH memory! :)
1039
 
1040
## Creating a RISCV from scratch
1041
 
1042
I found that some people are very reticent about the possibility of
1043
designing a RISC-V processor in one night. Of course, it is not so easy
1044
as it appears and, in fact, it require a lot of experience, planning and
1045
luck. Also, the fact that the processor correctly run some few instructions
1046
and put some garbage in the serial port does not really means that the
1047
design is perfect, instead you will need lots and lots of debug time
1048
in order to fix all hidden problems.
1049
 
1050
Just in case, I found a set of online videos from my friend (Lucas Teske)
1051
that shows the design of a RISC-V processor from scratch:
1052
 
1053
- https://www.twitch.tv/videos/840983740 Register bank (4h50)
1054
- https://www.twitch.tv/videos/845651672 Program counter and ALU (3h49)
1055
- https://www.twitch.tv/videos/846763347 ALU tests, CPU top level (3h47)
1056
- https://www.twitch.tv/videos/848921415 Computer problems and microcode planning (08h19)
1057
- https://www.twitch.tv/videos/850859857 instruction decode and execute - part 1/3 (08h56)
1058
- https://www.twitch.tv/videos/852082786 instruction decode and execute - part 2/3 (10h56)
1059
- https://www.twitch.tv/videos/858055433 instruction decode and execute - part 3/3 - SoC simulation (10h24)
1060
- TBD tests in the Lattice FPGA
1061
 
1062 4 marcelos
Unfortunately the video set is currently in portuguese only and there a lot
1063
of parallel discussions about technology, including the fix of the Teske's
1064
notebook online!  I hope in the future will be possible edit the video set
1065
and, maybe, create english subtitles.
1066 2 marcelos
 
1067 4 marcelos
About the processor itself, it is a microcode oriented concept with a
1068
classic von neumann archirecture, designed to support more easily different
1069
ISAs.  It is really very different than the traditional RISC cores that we
1070
found around!  Also, it includes a very good eco-system around opensource
1071
tools, such as Icarus, Yosys and gtkWave!
1072 2 marcelos
 
1073
Although not finished yet (95% done!), I think it is very illustrative about the RISC-V design:
1074
 
1075
- rv32e instruction set: very reduced (37) and very ortogonal bit patterns (6)
1076
- rv32e register set: 16x32-bit register bank and a 32-bit program counter
1077
- rv32e ALU with basic operations for reg/imm and reg/reg instructions
1078
- rv32e instruction decode: very simple to understand, very direct to implement
1079
- rv32e software support: the GCC support provides an easy way to generate code and test it!
1080
 
1081 4 marcelos
The Teske's proposal is not design the faster RISC-V core ever (we already
1082
have lots of faster cores with CPI ~ 1, such as the darkriscv, vexriscv,
1083
etc), but create a clean, reliable and compreensive RISC-V core.
1084 2 marcelos
 
1085
You can check the code in the following repository:
1086
 
1087
- https://github.com/racerxdl/riskow
1088
 
1089
## Acknowledgments
1090
 
1091
Special thanks to my old colleagues from the Verilog/VHDL/IT area:
1092
 
1093
- Paulo Matias (jedi master and verilog/bluespec/riscv guru)
1094
- Paulo Bernard (co-worker and verilog guru)
1095
- Evandro Hauenstein (co-worker and git guru)
1096
- Lucas Mendes (technology guru)
1097
- Marcelo Toledo (technology guru)
1098
- Fabiano Silos (technology guru)
1099
 
1100
Also, special thanks to the "friends of darkriscv" that found the project in
1101 4 marcelos
the internet and contributed in any way to make it better:
1102 2 marcelos
 
1103
- Guilherme Barile (technology guru and first guy to post anything about the darkriscv! [2]).
1104
- Alasdair Allan (technology guru, posted an article about the darkriscv [3])
1105
- Gareth Halfacree (technology guru, posted an article about the DarkRISCV [4])
1106
- Ivan Vasilev (ported DarkRISCV for Lattice Brevia XP2!)
1107
- timdudu from github (fix in the LDATA and found a bug in the BCC instruction)
1108
- hyf6661669 from github (lots of contributions, including the fixes regarding the AUIPC and S{B,W,L} instructions, ModelSIM simulation, the memory byte select used by store/load instructions and much more!)
1109
- zmeiresearch from github (support for Lattice XP2 Brevia board)
1110
- All other colleagues from github that contributed with fixes, corrections and suggestions.
1111
 
1112
Finally, thanks to all people who directly and indirectly contributed to
1113
this project, including the company I work for and all colleagues that
1114
tested the *DarkRISCV*.
1115
 
1116
## References
1117
 
1118
        [1] https://www.amazon.com/RISC-V-Reader-Open-Architecture-Atlas/dp/099924910X
1119
        [2] https://news.ycombinator.com/item?id=17852876
1120
        [3] https://blog.hackster.io/the-rise-of-the-dark-risc-v-ddb49764f392
1121
        [4] https://abopen.com/news/darkriscv-an-overnight-bsd-licensed-risc-v-implementation/
1122
        [5] http://quasilyte.dev/blog/post/riscv32-custom-instruction-and-its-simulation/
1123
        [6] https://github.com/riscv/riscv-pk/blob/master/bbl/riscv_logo.txt

powered by: WebSVN 2.1.0

© copyright 1999-2024 OpenCores.org, equivalent to Oliscience, all rights reserved. OpenCores®, registered trademark.