URL https://opencores.org/ocsvn/forwardcom/forwardcom/trunk

# Subversion Repositoriesforwardcom

## [/] [forwardcom/] [trunk/] [softcore_manual/] [softcore_A.tex] - Blame information for rev 166

Line No. Rev Author Line
1 166 Agner
% Manual for ForwardCom softcore model A
2
 
3
\documentclass[11pt,a4paper,oneside,openright]{report}
4
\usepackage[bindingoffset=5mm,left=20mm,right=20mm,top=20mm,bottom=20mm,footskip=10mm]{geometry}
5
\usepackage[utf8x]{inputenc}
6
\usepackage{hyperref}
7
\usepackage[english]{babel}
8
\usepackage{listings}
9
\usepackage{subfiles}
10
\usepackage{longtable}
11
\usepackage{multirow}
12
\usepackage{ragged2e}
13
\usepackage{amsthm} % example numbering
14
\usepackage{color}
15
\usepackage[T1]{fontenc} % fix problem with underscore not searchable
16
\usepackage{fontspec}
17
\usepackage{caption}
18
\usepackage{graphicx}
19
\defaultfontfeatures{Mapping=tex-text, Ligatures=NoCommon}
20
\setmainfont{Arial}[Ligatures=NoCommon]
21
\setsansfont{Arial}
22
\renewcommand{\familydefault}{\sfdefault}
23
 
24
% modify style
25
\newtheorem{example}{Example}[chapter]  % example numbering
26
\lstset{language=C}                     % formatting for code listing
27
\lstset{basicstyle=\ttfamily,breaklines=true}
28
\definecolor{darkGreen}{rgb}{0,0.4,0}
29
\lstset{commentstyle=\color{darkGreen}}
30
\captionsetup{justification=raggedright,singlelinecheck=false}
31
 
32
\newcommand{\vv}{ \vspace{2mm} }   % vertical space
33
 
34
 
35
\begin{document}
36
\title{ForwardCom softcore model A}
37
\author{Agner Fog}
38
\date{\today}
39
\maketitle
40
\RaggedRight
41
 
42
\tableofcontents
43
\setcounter{secnumdepth}{1}
44
 
45
 
46
\chapter{Introduction}
47
ForwardCom is a new instruction set architecture with corresponding hardware and software standards. An introduction to ForwardCom can be found at the website \href{https://www.forwardcom.info}{www.forwardcom.info}.
48
A full description and definition is found in the manual
49
\href{https://github.com/ForwardCom/manual/raw/master/forwardcom.pdf}{ForwardCom: An open-standard instruction set for high-performance microprocessors}.
50
\vv
51
 
52
The present document describes the softcore model A version 1.00
53
for implementing a ForwardCom CPU in a Xilinx Artix-7 FPGA.
54
\vv
55
 
56
\subsection{Features of model A version 1.00}
57
\begin{itemize}
58
\item Runs on Nexys A7-100T FPGA board
59
\item Maximum clock frequency 50 - 70 MHz, depending on configuration
60
\item 32-bit or 64-bit registers
61
\item Can execute one instruction per clock cycle
62
\item Data memory 32 kB. Code memory 64 kB. Call stack 1023 entries.
63
\item Implements all integer instructions, except multiplication, division, push, pop
64
\item Implements all instruction formats and all addressing modes defined by the ForwardCom standard version 1.11.
65
\item No vector registers. No floating point instructions
66
\item No system calls, no memory protection. Useful for embedded designs
67
\item Memory reads and writes must be aligned
68
\item RS232 serial interface for standard input and output
69
\item On-chip loader (uses 2 kB code memory)
70
\item On-chip debug interface
71
\item On-chip event counter
72
\item Code examples and test suite provided
73
\end{itemize}
74
\vv
75
 
76
\subsection{Prerequisites}
77
\begin{itemize}
78
\item FPGA board Digilent Nexys A7 100T or Nexys 4
79
\item Xilinx Vivado tools version 2020.1
80
\item ForwardCom softcore
81
\item ForwardCom binary tools "forw" version 1.11 or later
82
\item Serial terminal program, such as RealTerm or Tera Term
83
\item Windows or Linux PC
84
\end{itemize}
85
\vv
86
 
87
 
88
\chapter{Getting started} \label{Chap:GettingStarted}
89
The softcore is available at
90
\href{https://github.com/ForwardCom/softcoreA}{github.com/ForwardCom/softcoreA}. \\
91
 
92
Device-specific files for The Nexys A7-100T FPGA board are provided in
93
\href{https://github.com/ForwardCom/softcoreA/tree/main/deviceA7100T}{the subfolder deviceA7100T}. \\
94
 
95
 
96
\begin{itemize}
97
\item Install Vivado version 2020.1. If you are using a later version, make sure that it supports Artix 7 series FPGAs
98
\item Get the softcore project zip file A1.xpr.zip and unpack it. Open the project file A1.xpr with Vivado.
99
\item In Vivado, go to the Flow Navigator window and open Hardware Manager
100
\item Connect the FPGA board to the PC with the USB cable that comes with the board. Turn on the FPGA board on the slide switch on top. You will see a moving pattern on the display. Install the FTDI driver on the PC if it does not install automatically.
101
\item In Vivado Hardware Manager, click Open target -> Auto Connect. (If connection fails, restart the computer and use a different USB port).
102
\item Click Program device, and select the file \{my directory\}/A1/A1.runs/impl\_1/top.bit \\
103
You will see numbers on the display
104
\item Now you can upload a program to the soft core CPU. Choose hello.ex as a start. Upload this file in binary mode to the virtual serial port that corresponds to the USB port. Use 57600 Baud, 8 data bits, 1 stop bit, no parity, no flow control. See below how to do this with a terminal program:
105
\item If you are using RealTerm: \\
106
Open RealTerm. On the Display tab, choose Display as All Chars, newLine mode. \\
107
On the Port tab, set Baud = 57600. In the Port field, find the port number that corresponds to the USB connection. In my case, it shows 4=VCP0, but it may have a different number on your computer. The number may change if you use a different USB port. Press the Open button twice to close the connection and open again. Look at the Status to the right and make sure it is not Closed.\\
108
Go to the Send tab. Set checkmarks in "+LF" in the first two lines. Set delays to 0. Under "Dump file to port", find the file you want to upload (hello.ex) and press Send. (If you want to upload again, press the "cpu reset" button on the FPGA board to activate the loader).\\
109
Press the Up button on the FPGA board if the program asks you to press Run. The hello.ex program writes a welcoming text on the terminal.
110
\item If you are using Tera Term: \\
111
On New Connection, click Serial. In the Port window, choose USB Serial Port. \\
112
Click on the Setup menu -> Serial Port. Set speed = 57600, and click New Setting\\
113
In the File menu -> Send file. Set Option = Binary. Open the file hello.ex \\
114
Press the Up button on the FPGA board to start the program. You should see a Hello text on the terminal.
115
\item If you are using Linux: \\
116
I have not tried this on Linux yet. You need a serial terminal program that can transfer files.
117
\end{itemize}
118
 
119
 
120
\chapter{Examples and test programs}
121
Some code examples are available as assembly code at \\
122
\href{https://github.com/ForwardCom/code-examples}{github.com/ForwardCom/code-examples}.
123
\vv
124
 
125
The same examples compiled for the softcore are available as .ex files at \\
126
\href{https://github.com/ForwardCom/softcoreA/tree/main/deviceA7100T}{github.com/ForwardCom/softcoreA/tree/main/deviceA7100T}.
127
\vv
128
 
129
The following examples work on the softcore model A:
130
\begin{itemize}
131
\item hello.ex: Write a welcoming message on the serial terminal
132
\item calculator.ex: Enter two integers a, and b. The program will calculate a+b, a-b, a*b, a/b, and a modulo b.
133
\item guess\_number.ex: A guessing game. The user has to guess a random number. The program tells if the number is bigger or smaller than your guess.
134
\end{itemize}
135
 
136
These examples are compiled from the assembly code and the libc\_light.li function library. Upload one of the .ex files as described above to run these examples on the softcore,
137
\vv
138
 
139
The same examples can run on the emulator when linked with libc.li instead of libc\_light.li.
140
 
141
\subsubsection{Test suite}
142
A test suite is provided for testing all instructions and formats. The source code for the test programs are available at:\\
143
 
144
\href{https://github.com/ForwardCom/test-suite}{github.com/ForwardCom/test-suite}
145
\vv
146
 
147
The same test programs compiled for the softcore are available as .ex files at \\
148
\href{https://github.com/ForwardCom/softcoreA/tree/main/deviceA7100T}{github.com/ForwardCom/softcoreA/tree/main/deviceA7100T}.
149
\vv
150
 
151
The following test programs work on the softcore model A:
152
\begin{itemize}
153
\item tests\_arithmetics.ex: Test all integer arithmetic instructions
154
\item tests\_bool\_bit.ex: Test all boolean and bit manipulation instructions
155
\item tests\_branch.ex: Test all jump, call, and branch instructions
156
\item test\_formats.ex: Test all instruction formats
157
\end{itemize}
158
\vv
159
 
160
The 32-bit version of the softcore obviously fails the tests on 64-bit operands.
161
\vv
162
 
163
 
164
\chapter{User interface}
165
The FPGA board is connected to a PC through a USB cable. Standard input and output are provided as an RS232 serial connection tunneled through the USB cable with an FTDI driver installed on the PC. A terminal program running on the PC may be used for user input and output. The transmission parameters are 57600 Baud, 8 data bits, 1 stop bit, no parity, no flow control,
166
\textbackslash{LF} as line feed. The Baud rate can be changed by modifying a configuration file (see page \pageref{SettingConfiguration}) and rebuilding the hardware code.
167
\vv
168
 
169
The pushbuttons on the FPGA board are used as follows:
170
\vv
171
 
172
\begin{tabular}{|l|l|}
173
\hline
174
\bfseries Operation  & \bfseries Use pushbutton  \\ \hline
175
Run  &  $\uparrow$  \\ \hline
176
Single step  &  $\rightarrow$  \\ \hline
177
Restart  &  $\downarrow$  \\ \hline
178
Load program  &  CPU reset  \\ \hline
179
\end{tabular}
180
\vv
181
 
182
The Run button will make the program run if it has stopped at a breakpoint. Test programs and examples often start with a breakpoint so that they will run only when you press the Run button.
183
\vv
184
 
185
The Single step button allows you to execute CPU instructions one by one for debugging purposes. See chapter \ref{Chap:Debugger}
186
\vv
187
 
188
The Restart button makes the program restart from the beginning. All registers are reset, but static data are not reset.
189
\vv
190
 
191
The Load program button lets you load a new program. Press the Load program button and send a new .ex file from the terminal program, as explained on page \pageref{Chap:GettingStarted}.
192
\vv
193
 
194
The pushbuttons on the FPGA board are small and impractical. You may add external pushbuttons as described on page \pageref{Chap:ExternalPushbuttons}.
195
\vv
196
 
197
Runtime errors are indicated by a flashing LED. The type of error is shown on the LED displays with the following codes:
198
\vv
199
 
200
\begin{tabular}{|l|l|l|}
201
\hline
202
\bfseries Error code  & \bfseries Error when loading & \bfseries Error when running  \\ \hline
203
E1  & Error in .ex file & Unknown instruction \\ \hline
204
E2  & Code size too big & Wrong operands for instruction \\ \hline
205
E3  & Data size too big & Array index overflow \\ \hline
206
E4  &  & Memory read violation \\ \hline
207
E5  &  & Memory write violation \\ \hline
208
E6  &  & Misaligned memory read or write \\ \hline
209
\end{tabular}
210
\vv
211
 
212
The error code is followed by the code address where the error was detected, as a hexadecimal number.
213
\vv
214
 
215
The LED display shows debugging information when there is no error (no flashing LED). See chapter \ref{Chap:Debugger}
216
\vv
217
 
218
\chapter{Serial input and output} \label{Chap:SerialIO}
219
 
220
The following port numbers are provided for serial input and output.
221
\vv
222
 
223
\begin{tabular}{|l|l|l|}
224
\hline
225
\bfseries Port number & \bfseries Function & \bfseries Bits \\ \hline
226
Input port 8 & serial input & bit 0-7: received byte \\
227
             & RS232        & bit 8: (Data valid. Unreliable) \\
228
             &              & bit 9: At least one more input byte is ready \\
229
             &              & bit 12: Input buffer overflow error \\
230
             &              & bit 13: Frame error in start bit or stop bit \\ \hline
231
Input port 9 & serial input status & bit 0-15: Number of bytes in input buffer \\
232
             &              & bit 16: Input buffer overflow error \\
233
             &              & bit 17: Frame error in start bit or stop bit\\ \hline
234
Output port 9 & serial input control & bit 0: Clear buffer. Delete all data in input buffer and clear error status \\
235
             &              & bit 1: Clear error status but keep data \\ \hline
236
Output port 10 & serial output & bit 0-7: byte to send \\
237
              & RS232         & other bits are reserved  \\ \hline
238
Input port 11 & serial output status & bit 0-15: Number of bytes in output buffer \\
239
              &              & bit 16: Output buffer overflow error \\
240
              &              & bit 18: Ready. Output buffer is not full \\ \hline
241
Output port 11 & serial output control & bit 0: Clear buffer. Delete all data in output buffer and clear error status \\
242
             &              & bit 1: Clear error status but keep data \\ \hline
243
\end{tabular}
244
\vv
245
 
246
The size of the input buffer is 1kB. The size of the output buffer is 2kB. The Baud rate and buffer sizes can be changed in a configuration file (see page \pageref{SettingConfiguration}).
247
\vv
248
 
249
\chapter{Control registers and performance counters} \label{Chap:ControlRegisters}
250
 
251
The following capabilities registers are supported:
252
\vv
253
 
254
\begin{tabular}{|c|c|l|}
255
\hline
256
\bfseries Register & \bfseries Read/Write & \bfseries Functionality \\ \hline
257
capab0 & R   & Microprocessor model = "A" \\ \hline
258
capab1 & R  & Microprocessor version = 0x10000 \\ \hline
259
capab2 & RW & Disable error traps \\ \hline
260
capab4 & R  & Code memory/cache size in bytes \\ \hline
261
capab5 & R  & Data memory/cache size in bytes \\ \hline
262
 
263
capab8 & R  & Support for operand sizes \\ \hline
264
capab9 & R  & Support for operand sizes in vectors \\ \hline
265
capab12 & R & Maximum vector length in bytes \\ \hline
266
capab13 & R & Maximum vector length for permute instructions \\ \hline
267
capab14 & R & Maximum block size for permute instructions \\ \hline
268
capab15 & R & Maximum vector length for compress\_sparse and expand\_sparse. \\ \hline
269
\end{tabular}
270
\vv
271
\vv
272
\vv
273
 
274
The following performance counters are supported:
275
\vv
276
 
277
\begin{tabular}{|c|l|}
278
\hline
279
\bfseries Register & \bfseries Functionality \\ \hline
280
perf1  & CPU clock cycles \\ \hline
281
perf2  & Number of instructions executed \\ \hline
282
perf3  & Number of vector instructions executed \\ \hline
283
perf4  & Vector registers in use \\ \hline
284
perf5  & Jumps, calls, and return instructions \\ \hline
285
perf8  & Runtime errors \\ \hline
286
\hline
287
\end{tabular}
288
\vv
289
 
290
The performance counter registers are read-only, but it is possible to reset these registers to zero.
291
\vv
292
 
293
Details of these registers are provided in the ForwardCom manual.
294
\vv
295
 
296
 
297
\chapter{Debugger} \label{Chap:Debugger}
298
The softcore has two debugging interfaces: A LED display for showing internal signals in the CPU, and an external LCD display for showing the contents of each pipeline stage as well as the result of each instruction.
299
\vv
300
 
301
Both displays are useful when single-stepping through the code. Set a breakpoint instruction in the assembly code at the point where you want to begin single-stepping. Use the single-step button ($\rightarrow$) to execute instructions one by one. Use the run button ($\uparrow$) to run to the next breakpoint or end of program. It is possible to use external buttons instead of the small pushbuttons on the FPGA board, as described in chapter \ref{Chap:AddOnBoards}.
302
\vv
303
 
304
\section{LED display}
305
The LED display on the FPGA board can show any internal signal in the CPU during single-stepping. The sixteen switches on the FPGA board are used for selecting which signal to view. All signals are shown as hexadecimal numbers on the LED display.
306
\vv
307
 
308
Switch number 7-4 are selecting which pipeline stage or unit to show, and switch 3-0 are selecting which signal from this stage or unit to show.
309
For example, if you set switch 7-0 to 0101 1000, you can see the output value from the ALU.
310
\vv
311
 
312
All the possible signals to view are defined in the file debugger.vh.
313
You may modify the file debugger.vh to show any signal that is not already defined and assigned to a switch combination.
314
\vv
315
 
316
All pipeline stages and units have registers (D-flip flops) at the output, showing the values after one clock pulse. If you want to see an internal signal that is not output through a register from the unit in question, then it is recommended to output this signal through a register before sending it to the debug display. This makes sure that signals are properly synchronized so that you do not see signals that belong to different clock periods at the same time on the display (which can be very confusing). See the output debug1\_out in the file decoder.sv for an example of how to do this.
317
\vv
318
 
319
The LED display can also show register values. Set switch 8 high and set switch 0-5 to the register number as a binary value. The LED display will show the lower 28 bits of the register as a hexadecimal value on the lower seven LED display digits. The leftmost LED display digit will show 8 if the register value is not known but waiting for a value to be calculated by the ALU. The two rightmost digits show the tag of the instruction that will provide the value in this case.
320
\vv
321
 
322
\section{External LCD display}
323
It is possible to add an external LCD display to show results and how instructions propagate through the pipeline when single-stepping. Instructions for building the external LCD display are given in chapter \ref{Chap:AddOnBoards}.
324
\vv
325
 
326
The LCD display has eight lines (four on each LCD unit). The eight lines show the contents of the pipelines as follows:
327
\vv
328
 
329
\begin{tabular}{|l|l|l|}
330
\hline
331
\bfseries Line & \bfseries Symbol & \bfseries Showing  \\ \hline
332
1  & F & Fetch stage instruction \\ \hline
333
2  & D & Decode stage instruction \\ \hline
334
3  & R & Register read stage instruction \\ \hline
335
4  & A & Address generation stage instruction \\ \hline
336
5  & d & Data read stage instruction \\ \hline
337
6  & X & Execute (ALU) stage instruction \\ \hline
338
7  & : & ALU inputs \\ \hline
339
8  & = & ALU result \\ \hline
340
\end{tabular}
341
\vv
342
 
343
Each pipeline stage is represented by a line showing the code address (lower 16 bits as hexadecimal), instruction format, operand type, and instruction name. See figure \ref{fig:addOnBoard} on page \pageref{fig:addOnBoard}.
344
\vv
345
 
346
The operand type is coded with one letter as follows: b=int8, h=int16, w=int32, d=int64, q=int128, F=float32, D=float64, Q=float128.
347
\vv
348
 
349
Line 7, indicated as ":", shows three hexadecimal numbers representing the lower 16 bits of the values of up to three input operands of the current instruction. The first field shows the first operand of 3-operand instructions, or a fallback value for instructions with less than 3 operands. The second field shows the second operand of 3-operand instructions or the first operand of 2-operand instructions. The third field shows the last operand.
350
\vv
351
 
352
Line 8, indicated as "=", shows the result of the current instruction as a hexadecimal number (32 bits). "m1" indicates a mask bit of 1. "m0" indicates a mask bit of 0, giving a fallback value. The result of conditional and indirect jump instructions are indicated as "jump" or "no jump".
353
\vv
354
 
355
Direct unconditional jump, call, and return instructions are only visible on the first line because they are handled by the fetch unit only.
356
\vv
357
 
358
Pipeline bubbles (jump delays) are indicated by "-".
359
\vv
360
 
361
Pipeline stalls are indicated by "\#". This occurs when the address generator stage or the execute stage (ALU) is waiting for a register value.
362
\vv
363
 
364
 
365
\chapter{Add-on boards} \label{Chap:AddOnBoards}
366
 
367
\subsection{Adding external pushbuttons} \label{Chap:ExternalPushbuttons}
368
It is convenient to add external pushbuttons to replace the small buttons on the FPGA board. The following components are needed.
369
\vv
370
 
371
\begin{table}[h]
372
%\centering
373
\begin{tabular}{|l|l|}
374
\hline
375
\bfseries Component & \bfseries number  \\ \hline
376
VERO board or similar & 1 \\ \hline
377
Pushbutton, for example Digitast & 4 \\ \hline
378
2x6 pin male connector & 1 \\ \hline
379
Resistor 10 k\Omega & 4 \\ \hline
380
\end{tabular}
381
\caption{Component list for external pushbuttons}
382
\label{table:ComponentListPushbuttons}
383
\end{table}
384
\vv
385
 
386
The pushbuttons can be connected to the JB socket on the FPGA board according to this diagram:
387
 
388
\begin{center}
389
\begin{figure}[ht]
390
\includegraphics{SchematicDebugButtons.png}
391
\caption{Circuit diagram for connecting external pushbuttons}
392
\label{fig:PushbuttonsDiagram}
393
\end{figure}
394
\end{center}
395
\vv
396
 
397
An add-on board with both LCD display and pushbuttons is shown in figure  \ref{fig:addOnBoard}. The buttons on the photo are: Black: single step, Green: run, Red: restart, Blue: load.
398
\vv
399
 
400
\begin{center}
401
\begin{figure}[ht]
402
\includegraphics[width=\textwidth]{LCDboard.jpg}
403
\caption{Prototype add-on board with LCD display and pushbuttons connected to FPGA board}
404
\label{fig:addOnBoard}
405
\end{figure}
406
\end{center}
407
%\vv
408
\clearpage
409
 
410
 
411
\subsection{Adding external LCD display} \label{Chap:ExternalLCDDisplay}
412
An external LCD display is useful when debugging. The display shows all stages of the pipeline as well as the result of each instruction during single-stepping. The display has eight lines divided between two four-line LCD modules (see figure \ref{fig:addOnBoard}).
413
\vv
414
 
415
The components listed in table \ref{table:ComponentListLCD} are needed.
416
\vv
417
 
418
\begin{center}
419
\begin{figure}[ht]
420
\includegraphics[width=\textwidth]{SchematicLCDdisplay.png}
421
\caption{Circuit diagram for connecting external LCD display}
422
\label{fig:LCDdisplayDiagram}
423
\end{figure}
424
\end{center}
425
 
426
 
427
\begin{table}[h]
428
\begin{tabular}{|l|l|}
429
\hline
430
\bfseries Component & \bfseries number  \\ \hline
431
VERO board or similar & 1 \\ \hline
432
LCD display, 4 lines x 20 columns & 2 \\ \hline
433
16 pin male pin row for display & 2 \\ \hline
434
16 pin female connector for display & 2 \\ \hline
435
2x6 pin male connector & 1 \\ \hline
436
Smitt trigger 74HC14 or 40106 & 1 \\ \hline
437
Diode 1N4148 & 2 \\ \hline
438
Resistor 100 k\Omega & 1 \\ \hline
439
Trimpot 10 k\Omega{} or 22 k\Omega & 2 \\ \hline
440
Capacitor 10 nF & 1 \\ \hline
441
Capacitor 10 \mu{}F & 2 \\ \hline
442
\multicolumn{2}{|l|}{Component values are not critical} \\ \hline
443
\end{tabular}
444
\caption{Component list for external LCD display board}
445
\label{table:ComponentListLCD}
446
\end{table}
447
 
448
 
449
A circuit diagram for connecting the external LCD display is provided in figure \ref{fig:LCDdisplayDiagram}. Explanation of the diagram: The external LCD display board is connected to the JA socket on the FPGA board. Pin 1-4 on JA are connected to DB4-DB7 on both displays. Pin 7 on JA is connected to RS on both displays. Pin 8 on JA is connected to Enable on the upper display. Pin 9 on JA is connected to Enable on the lower display. Pin 11 and 12 on JA provide ground and +3.3V supply.
450
\vv
451
 
452
3.3V is sufficient for the control logic of the LCD displays, but not enough for the backplane voltage. The circuit driven by a 74HC14 Smitt trigger is a charge pump generating a negative voltage for the LCD backplanes. Use the two trimming potentiometers to adjust the contrast on the two display modules.
453
\vv
454
 
455
\begin{center}
456
\begin{figure}[ht]
457
\includegraphics[width=\textwidth]{LCDboardApart.jpg}
458
\caption{Prototype add-on board with one display removed, showing the underlying components}
459
\label{fig:AddOnBoardDisassembled}
460
\end{figure}
461
\end{center}
462
\vv
463
 
464
 
465
\chapter{Test suite}
466
A suite of test programs are provided for testing the softcore as well as the emulator. The test programs are written in ForwardCom assembly language.
467
\vv
468
 
469
The following test programs are currently available:
470
\begin{itemize}
471
\item tests\_arithmetics.as: Test all arithmetic instructions with general purpose registers with operand size int8, int16, int32, and int64.
472
\item tests\_bool\_bit.as: Test boolean and bit manipulation instructions with general purpose registers with operand size int8, int16, int32, and int64.
473
\item tests\_branch.as: Test jump, call, and branch instructions with general purpose registers with operand size int8, int16, int32, and int64.
474
\item tests\_formats.as: Test all instruction formats for general purpose registers, including formats for multi-format instructions, single-format instructions, and jump instructions.
475
\end{itemize}
476
\vv
477
 
478
The results of the tests are sent via the standard serial output to a terminal.
479
\vv
480
 
481
The tests are not fully exhaustive. ForwardCom instructions have an almost infinite number of different variants and special cases. It is impossible to test all combinations of instruction variants and operand values, but the tests are intended to cover as many options and variants as possible.
482
\vv
483
 
484
The test programs must be linked with the library libc\_light.li if the softcore does not support multiplication, division, push, and pop instructions.
485
Example of building a test program:
486
 
487
\begin{lstlisting}
488
 
489
forw -ass -debug=1 tests_arithmetics.as -list=test.txt
490
forw -link -debug=1 tests_arithmetics.ex tests_arithmetics.ob libc_light.li
491
\end{lstlisting}
492
\vv
493
 
494
See the ForwardCom manual for details about how to use the binary tools.
495
\vv
496
 
497
The softcore model A 1.00 will obviously fail tests for the instructions that are not supported, i.e. instructions that involve multiplication or division.
498
\vv
499
 
500
The tests for operand size int64 will fail for the 32-bit version of the softcore. Instructions that have a fixed operand size of int64 will be shown as working in the 32-bit softcore if the lower 32 bits of the result are correct.
501
\vv
502
 
503
\chapter{Function library libc\_light.li}
504
libc\_light.li is a limited version of the standard C library intended for small ForwardCom cores that lack certain instructions. libc\_light.li avoids the use of multiplication, division, push, pop, and sys\_call instructions. Input and output functions are accessing the input and output ports directly rather than using system calls.
505
\vv
506
 
507
Programs for the softcore model A 1.00 must be linked with libc\_light.li rather than with libc.li.
508
\vv
509
 
510
The following functions are provided in libc\_light.li:
511
\vv
512
 
513
\begin{tabular}{|p{25mm}|p{100mm}|p{30mm}|}
514
\hline
515
\bfseries Function name & \bfseries Functionality & \bfseries Source file \\ \hline
516
\multicolumn{3}{|l|}{Common C functions:} \\ \hline
517
atoi   & Interpret ASCII string as integer & atoi\_light.as  \\ \hline
518
atol   & Interpret ASCII string as integer & atoi\_light.as  \\ \hline
519
getch  & Wait for input from stdin. returns one character & getch\_light.as \\ \hline
520
getche  & Same as getch, with echo & getch\_light.as \\ \hline
521
gets\_s  & Read a string from standard input until newline & gets\_light.as \\ \hline
522
\_kbhit & Return 1 if there is at least one character in the input buffer, 0 otherwise & getch\_light.as \\ \hline
523
printf  & Print formatted string to stdout & printf\_light.as \\ \hline
524
sprintf  & Print formatted string to string buffer & printf\_light.as \\ \hline
525
snprintf  & Same as sprintf, with length limit & printf\_light.as \\ \hline
526
putchar    & Print a single character to stdout & putchar\_light.as \\ \hline
527
puts   & Print a string to stdout, append newline & puts\_light.as \\ \hline
528
 
529
\multicolumn{3}{|l|}{Non-standard functions:} \\ \hline
530
 
531
clear\_input  & Clear stdin input buffer & clear\_input\_light.as \\ \hline
532
 
533
divide\_int & Divide two 32-bit signed integers. return quotient and remainder & divide\_int\_light.as \\ \hline
534
 
535
getch\_nonblocking & Same as getch. Does not wait for input. Returns -1 if the input buffer is empty & getch\_light.as \\ \hline
536
 
537
multiply\_int  & Multiplies two 32-bit signed integers. no overflow check & multiply\_int\_light.as \\ \hline
538
 
539
print\_characters  & Prints a very short string to stdout. The input is a 32-bit integer holding up to 4 characters, or a 64-bit integer holding up to 8 characters & print\_characters\_light.as \\ \hline
540
 
541
print\_hexadecimal  & Prints an integer as a hexadecimal number to stdout & print\_hexadecimal\_light.as \\ \hline
542
 
543
print\_integer & Print a 32-bit signed integer as decimal number to stdout & print\_integer\_light.as \\ \hline
544
 
545
print\_string  & Same as puts, does not append newline. Returns a pointer to the first character after the string & print\_string\_light.as \\ \hline
546
 
547
\_entry\_point & This startup code just calls main, unlike the startup code in libc.li that checks for event handlers & startup\_light.as  \\ \hline
548
 
549
\end{tabular}
550
\vv
551
 
552
The functions printf, sprintf, and snprintf are following the C standard with the following limitations: cannot print floating point numbers, cannot print octal, cannot print decimal integers with more than 32 bits. 64 bit integers can be printed with format "\%lX".
553
\vv
554
 
555
 
556
\chapter{Technical description}
557
 
558
\section{Setting the configuration} \label{SettingConfiguration}
559
Precompiled bitstream files for different configurations are provided in \\
560
 
561
\href{https://github.com/ForwardCom/softcoreA/tree/main/deviceA7100T}{github.com/ForwardCom/softcoreA/tree/main/deviceA7100T}:
562
\begin{itemize}
563
\item A1T100R32.bit: 32 bit registers. 32 kB data memory. 64 kB code memory. 68 MHz
564
\item A1T100R64.bit: 64 bit registers. 32 kB data memory. 64 kB code memory. 58 MHz
565
\end{itemize}
566
\vv
567
 
568
Other configurations can be built as follows.
569
\vv
570
 
571
The file defines.vh contains a choice of configuration files to include.
572
Use config\_r32.vh for a configuration with 32-bit registers, or config\_r64.vh for 64-bit registers. You may make your own configuration with different memory sizes or other parameters.
573
\vv
574
 
575
Make sure you set the clock frequency to the value specified in the configuration file. To change the clock frequency, open the Sources window in Vivado, expand the "top" node, double click on clock\_generator, click on the Output Clocks tab, and set the requested output frequency to the desired value.
576
\vv
577
 
578
The parameters in the configuration files are described here:
579
\vv
580
\vv
581
 
582
\textbf{32 or 64 bit registers} \\
583
The general purpose registers can be 32 or 64 bits. The line\\
584
\hspace{10mm} \textasciigrave define SUPPORT\_64BIT \\
585
enables support for 64-bit registers and calculations. Comment out this line for 32-bit registers. The 32-bit configuration is using significantly less hardware resources than the 64-bit version. This has consequences for the maximum clock frequency.
586
\vv
587
 
588
All calculations work the same with 32-bit registers as with 64-bit registers, except that only the lower 32 bits of a 64-bit result are saved.
589
\vv
590
\vv
591
 
592
\textbf{Clock frequency} \\
593
The CPU clock frequency defined in the line \\
594
\hspace{10mm} \textasciigrave define CLOCK\_FREQUENCY \\
595
must match the value set with the clocking wizard.
596
\vv
597
 
598
The maximum clock frequency depends on the configuration. Differences in register size, memory sizes, and other details influence the maximum clock frequency.
599
If you change the configuration, you may need to experiment to find a suitable clock frequency. Vivado will give a warning if the timing requirements cannot be met at the specified clock frequenty.
600
\vv
601
\vv
602
 
603
 
604
\textbf{Serial input and output} \\
605
The Baud rate for the RS232 serial input and output is defined by the line \\
606
\hspace{10mm} \textasciigrave define BAUD\_RATE \\
607
\vv
608
 
609
The serial input and output have queues with power-of-2 sizes defined by the lines \\
610
\hspace{10mm} \textasciigrave define IN\_BUFFER\_SIZE\_LOG2 \\
611
\hspace{10mm} \textasciigrave define OUT\_BUFFER\_SIZE\_LOG2 \\
612
\vv
613
\vv
614
 
615
 
616
\textbf{Memory sizes} \\
617
The code memory and data memory have power-of-2 sizes defined by the lines \\
618
\hspace{10mm} \textasciigrave define CODE\_ADDR\_WIDTH \\
619
\hspace{10mm} \textasciigrave define DATA\_ADDR\_WIDTH \\
620
\vv
621
 
622
The number of entries in the call stack is a power of 2 defined by the line \\
623
\hspace{10mm} \textasciigrave define STACK\_POINTER\_BITS \\
624
\vv
625
\vv
626
 
627
 
628
\textbf{Mask register size} \\
629
Not all of the bits of a register are needed when the register is used as a mask. In most cases, only one bit is used. We can save hardware resources by not propagating all the bits of a mask register to the execute stage. The number of mask bits to use are defined by the line: \\
630
\hspace{10mm} \textasciigrave define MASKSZ \\
631
\vv
632
 
633
This size must be sufficient to contain an instruction tag, i.e. TAG\_WIDTH. More bits may be needed if the mask register contains option bits. A full register size may be needed if the mask register is used for other purposes.
634
\vv
635
\vv
636
 
637
 
638
\textbf{Instruction tag size} \\
639
All instructions are identified by a tag when they propagate through the pipeline. The instruction tag must have enough bits to distinguish the maximum number of instructions in flight or pending. This is defined by the line \\
640
\hspace{10mm} \textasciigrave define TAG\_WIDTH \\
641
\vv
642
\vv
643
 
644
 
645
\textbf{Loader} \\
646
The loader is supplied as a file containing the compiled loader code in hex format. The loader file loader.mem is assembled from the source file loader.as and linked with the option -hex2 which makes sure that the loader file is organized with 16 hexadecimal digits (64 bits) per line.
647
\vv
648
 
649
The name and maximum size of the loader are defined by the lines \\
650
\hspace{10mm} \textasciigrave define LOADER\_FILE \\
651
\hspace{10mm} \textasciigrave define MAX\_LOADER\_SIZE \\
652
\vv
653
 
654
The loader code is explained in chapter \ref{Chap:Loader}.
655
\vv
656
\vv
657
 
658
 
659
\section{Hardware in/out connections}
660
The definition of input and output pins on the FPGA chip are covered by the constraints file Nexys-A7-100T.xdc. This file is based on a template provided by Xilinx and modified to connect the internal signals to specific pins for switches, LEDs, board headers, etc.
661
\vv
662
 
663
This constraints file is specific for the Nexys A7-100T development board. A different file and different connections are needed for different FPGA models and different development boards.
664
\vv
665
 
666
A further constraints file, bitstream\_settings\_a.xdc, contains mainly code that is autogenerated by the Vivado tool.
667
\vv
668
\vv
669
 
670
 
671
\section{Building the code}
672
The A softcore version 1.00 has been developed with Xilinx Vivado version 2020.1. Note that some versions of Vivado do not support the series 7 FPGAs.
673
\vv
674
 
675
The Vivado tool is placing source files in a deep and inflexible directory structure that is not very convenient for version control. Therefore, the Github repository includes a zipfile named A1.xpr.zip containing the entire project directory and files as well as all the individual source files. This file was generated with the menu: File $\rightarrow$ Project $\rightarrow$ Archive.
676
\vv
677
 
678
Users of the softcore will need to unpack the zipfile into an appropriate base directory to get started, and then replace individual source files with any versions that are newer than the zipfile.
679
\vv
680
 
681
The following source file directories contain files that you may modify or replace with newer versions:\\
682
\{base\}/A1/A1.srcs/sources\_1/new: verilog files, header files, memory file \\
683
\{base\}/A1/A1.srcs/constrs\_1/new: device-specific constraints files \\
684
\vv
685
 
686
All the other directories contain only files generated by the Vivado tool. They should not be modified manually, but may be deleted.
687
\vv
688
 
689
To rebuild the FPGA bitstream after any modifications, open the project file
690
A1.xpr with Vivado. Press the Generate Bitstream icon (green down arrow). This will take approximately 20 minutes. If successful, you can load the new softcore into the FPGA board as described in chapter \ref{Chap:GettingStarted} page \pageref{Chap:GettingStarted}.
691
\vv
692
 
693
Prebuilt bitstream files for different configurations are provided at Github under devices.
694
\vv
695
 
696
\subsubsection{Warnings during build}
697
Vivado produces several warnings during the build process. The following warnings can safely be ignored:
698
\vv
699
 
700
\begin{itemize}
701
\item {}[Netlist 29-345] The value of SIM\_DEVICE on instance 'clock\_generator\_inst/inst/clkout1\_buf' of type 'BUFGCE' is 'ULTRASCALE'; ... source netlist.
702
 
703
\item {}[Designutils 20-4072] Reference run did not run incremental synthesis because the design is too small; reverting to default synthesis
704
 
705
\item {}[Constraints 18-5210] No constraints selected for write...
706
 
707
\item {}[Power 33-332] Found switching activity that implies high-fanout reset nets being asserted for excessive periods of time which may result in inaccurate power analysis.
708
 
709
\item {}[Vivado 12-1017] Failed to delete one or more files in run directory .../A1/A1.runs/synth\_1
710
 
711
\item {}[Synth 8-6841] Block RAM (ram\_reg) originally specified as a Byte Wide Write Enable RAM cannot take advantage of ByteWide feature and is implemented with single write enable per RAM
712
 
713
\item {}[Synth 8-7129] Port instruction\_in[29] in module ... is either unconnected or has no load
714
 
715
\item {}[Synth 8-6841] Block RAM (ram\_reg) originally specified as a Byte Wide Write Enable RAM cannot take advantage of ByteWide feature and is implemented with single write enable per RAM due to following reason...
716
 
717
\end{itemize}
718
\vv
719
 
720
You may disable unwanted warnings because it creates a risk that other warnings are overlooked. Warnings other than these may indicate serious problems.
721
\vv
722
 
723
\subsubsection{Timing problems}
724
When building a modified softcore, it often happens that it fails to meet the timing requirements. The Vivado tool is not always able to find an optimal routing pattern. Small changes in the code may result in a different routing pattern that is either faster or slower. If only a few signal paths fail to meet the timing requirements by a small negative slack, then it is possible that the problem will go away after any small change in the code, for example in the debug signals or the number of bits in the mask register.
725
\vv
726
 
727
If the timing problem is more severe, then you need to redesign the code that is causing the problem, or lower the clock frequency.
728
\vv
729
 
730
\section{Description of the hardware code} \label{Chap:DescriptionHWCode}
731
The hardware code for the ForwardCom softcore is written in the hardware description language SystemVerilog.
732
\vv
733
 
734
The CPU is based on a Harvard architecture with separate code memory and data memory that can be accessed simultaneously. There is write access to the code memory for the sake of the on-chip loader, but it is not possible to read from the code memory by other means than fetching code.
735
\vv
736
 
737
The CPU design has six pipeline stages:
738
\vv
739
 
740
\begin{enumerate}
741
\item Fetch
742
\item Decode
743
\item Register read
744
\item Address generation
745
\item Data read
746
\item Execute
747
\end{enumerate}
748
\vv
749
 
750
Instructions are propagating one by one through the pipeline, advancing one stage per clock cycle.
751
\vv
752
 
753
The hardware design consists of modules. Each pipeline stage is a module. In addition to this, there are modules for code memory, data memory, call stack, input/output, etc. Each module has registers (D-flip flops) on its outputs, but not on its inputs. This means that it takes one clock cycle to send a signal from one module to another. The complexity within each module is limited to what can be completed in one clock period.
754
\vv
755
 
756
The following is a detailed description of each pipeline stage and each module.
757
\vv
758
 
759
\subsection{Fetch stage}
760
The fetch stage in the pipeline will fetch code from the code memory at a rate of two words (64 bits) per clock cycle from even addresses. The code words are loaded into a buffer with a size of 8 words.
761
\vv
762
 
763
While dispatching one instruction to the decode stage, the fetch unit is looking at the next instruction in the buffer in order to identify jump, call, and return instructions as early as possible. The ISA is designed so that these instructions can be identified easily. Direct, unconditional jump and call instructions with a 24-bit relative address are handled immediately by the fetch unit so that the target instruction can be loaded from the code memory with a delay of only one clock cycle. Return instructions are handled equally fast, using a prefetched return address from the call stack. This is possible only because the call stack and data stack are separate and independent.
764
\vv
765
 
766
Conditional jumps, indirect jumps and calls, and any other control transfer instructions have to go through the pipeline. In this case, the fetch unit has to wait for a target address from the ALU. This is causing a delay of 7 clock cycles for taken jumps. Not-taken conditional jumps are one clock faster because the target is already in the code buffer.
767
\vv
768
 
769
The fetch unit is sending a return address to the call stack for all kinds of call instructions.
770
\vv
771
 
772
It takes two clock cycles to fetch code from the code memory: one clock cycle for sending the address to the code memory, and one clock for getting the result back to the fetch unit. The fetch unit keeps prefetching code from the next address as long as the code buffer is not full. A shift register named next\_underway keeps track of whether the next code words are underway from the code memory. Another shift register target\_underway does the same after a jump. Jump addresses are fed directly from the ALU to the code memory in order to save one clock cycle.
773
\vv
774
 
775
A reset signal will cause a jump to the start address of the loader. A restart signal will cause a jump the the restart address which is the loader address plus 1.
776
\vv
777
 
778
\subsection{Decode stage}
779
The decoder identifies the instruction format, category, number of operands, register use, result type, etc. It also assigns a unique tag to identify each instruction in the pipeline.
780
\vv
781
 
782
The outputs from the decoder includes the variables listed below. Possible values for each variable are defined in the configuration file, described on page \pageref{SettingConfiguration}.
783
\vv
784
 
785
\begin{itemize}
786
\item vector\_out: 1 if vector instruction
787
\item category\_out: single format, multiformat, or jump instruction
788
\item format\_out. Instruction format A, B, C, E
789
\item rs\_status\_out. Use of RS register field: source register, vector, pointer, etc.
790
\item rt\_status\_out. Use of RT register field: source register, vector, index, vector length
791
\item ru\_status\_out. Use of RU register field: source register, vector
792
\item rd\_status\_out. Use of RD register field: source register, vector
793
\item mask\_status\_out. Use of mask register field: source register, vector
794
\item mask\_options\_out. Mask may contain option bits (floating point only)
795
\item mask\_alternative\_out. Mask register and fallback register used for alternative purposes
796
\item fallback\_use\_out. Which register is used for fallback
797
\item num\_operands\_out. Number of source operands. 0 - 3
798
\item result\_type\_out. Type of result: normal register, system register, memory, other
799
\item offset\_field\_out. Size of memory address field
800
\item immediate\_field\_out. Size of immediate operand field
801
\item scale\_factor\_out. Scale factor for array index
802
\item index\_limit\_out. Array index has a limit
803
\end{itemize}
804
\vv
805
 
806
 
807
\subsection{Register read stage}
808
The register file is included in the same module as this pipeline stage so that it is possible to read register values without delay. This includes the 32 general purpose registers as well as the numeric control register (NUMCONTR), data pointer (DATAP), and thread data pointer (TRHEADP), but not other system registers.
809
\vv
810
 
811
All registers needed by an instruction are read at this stage, and the values are propagated through the pipeline to the stage where they are needed, i.e. the address generation stage or the execution stage.
812
\vv
813
 
814
The instruction tag is written to the destination register in the register file. This indicates to subsequent instructions that the value of this register is not available yet. The tag identifies the instruction that is going to deliver the value. Subsequent stages in the pipeline must monitor the result buses for results marked with the tag of any missing register value. The tag is then replaced by the new value and propagated through the rest of the pipeline.
815
This mechanism may be considered a form of register renaming. Multiple values of the same logical register may be in flight in the pipeline at the same time. The tag of a missing register value is replaced by the actual value as soon as it appears on one of the result buses. All stages in the pipeline after the register read stage are monitoring the result buses for missing register values.
816
\vv
817
 
818
Each entry in the register file includes an extra bit which is 1 if the file contains a tag for a pending value, or 0 if the file contains the actual value of the register. The same applies to values propagating through the pipeline.
819
\vv
820
 
821
It is important that tags are written to the register file in the same pipeline stage as all register reads. This makes sure that we get the correct version of a changing register.
822
\vv
823
 
824
The register read stage is sending the values or tags of up to five registers, including a mask register, to the address generation stage in the pipeline.
825
\vv
826
 
827
 
828
\subsection{Address generation stage}
829
The address generator calculates the address of a memory operand.
830
All memory addresses are relative to a base pointer, which may be a general purpose register, DATAP, THREADP, or IP.
831
The address may consist of three components: A base pointer, a scaled array index, and a constant offset. A base pointer register is always present, the other components are optional.
832
\vv
833
 
834
The address generator sends the calculated address of a memory operand to the data memory for any instruction that involves memory read or write. Memory read and memory write operations are initiated in the same pipeline stage so that a memory read never has to wait for a preceding write.
835
\vv
836
 
837
The address generator needs to wait for any missing register value needed for base pointer, array index, vector length, or a value to write. This will generate stalls in the preceding pipeline stages and bubbles in the subsequent pipeline stages whenever a needed register value is not available yet.
838
\vv
839
 
840
Testing the validity of the calculated memory address should ideally be implemented in this stage, but unfortunately this leads to timing problems because the address calculation involves two additions. An extra compare operation after the two additions is difficult to fit into the time slot of a single clock period. Therefore, the detection of memory access violations is currently postponed to the next pipeline stage.
841
\vv
842
 
843
The address generation stage does some additional decoding in parallel with the address calculation. It sorts the source operands of the instruction according to the order of priority defined by the ForwardCom standard:
844
immediate operand, memory operand, RT register, RS register, RU register, RD register. Up to three source operands are taken from the fields defined by the instruction format according to this order of priority, with the last operand first. Immediate operands are identified and sign-extended or uncompressed if needed. The values of all operands are sent to the next pipeline stage as operand1\_out, operand2\_out, and operand3\_out if the values are available. The output memory\_operand\_out indicates that one source operand is a memory operand that will be available after two clock cycles.
845
\vv
846
 
847
\subsection{Data read stage}
848
The main purpose of the data read stage is to wait one clock cycle for the value of a memory operand to arrive from the data memory. Normally, the value of a memory operand will arrive just in time for the execution stage to use it. However, the data read stage can catch and store the value in case the instruction is delayed by a pipeline stall. The data read stage also checks if a memory address is valid.
849
\vv
850
 
851
The operation to do in the execution stage is identified in the data read stage and indicated by the variable opx\_out. The operation for multiformat instructions is identified by the OP1 field only so that opx\_out = OP1.
852
\vv
853
 
854
Some single-format instructions are short forms of instructions that also exist in multiformat versions. These instructions are translated to the multiformat equivalents in opx\_out. Single-format instructions that do not have a multiformat equivalent are given a unique value of opx\_out.
855
\vv
856
 
857
Branch instructions are split into the corresponding ALU operation, such as "add" or "compare" in opx\_out, and the branch condition indicated by opj\_out.
858
\vv
859
 
860
The operand type ot\_out is identified for instruction formats that do not have an operand type field.
861
\vv
862
 
863
The data read stage decides which module to use for the execution stage. This is indicated by exe\_unit\_out.
864
\vv
865
 
866
 
867
\subsection{Execution stage, ALU}
868
The ALU module implements the execution stage for most integer instructions, such as addition, bit operations, shift, branch, etc. The latency is one clock cycle for all these instructions.
869
\vv
870
 
871
This is the most critical pipeline stage in terms of timing, and the one that determines the maximum clock frequency. This includes three critical steps:
872
\begin{enumerate}
873
\item Input multiplexers. Retrieve each source operand value from either the previous pipeline stage (propagated value), from one of the result buses if the value was output in the preceding clock cycle, or directly from data memory. Data memory values with a size less than 64 bits must be selected from any position within a 64-bit memory line.
874
\item Do the actual calculation or operation
875
\item Cut off the result to the desired operand size, or replace it with a fallback value in case the mask is zero
876
\end{enumerate}
877
\vv
878
 
879
The result is output to result bus 1 as result\_out together with register\_a\_out that identifies the destination register, and tag\_val\_out that indicates the instruction tag.
880
\vv
881
 
882
Jump and branch instructions have the additional outputs jump\_out and nojump\_out, indicating whether a branch is taken or not, and jump\_pointer\_out that gives the address of the jump target in case the jump is taken.
883
\vv
884
 
885
The ALU module contains several general units that can service more than one instruction each. This includes a compare unit that is used for signed and unsigned compare instructions, max, min, and compare/branch instructions. A bit manipulation unit for bit operations such as clear, set, toggle, test, etc. A barrel shifter for all shift, rotate, and move\_bits instructions. And a bit scan module for bitscan and roundp2 instructions. There are also several single-purpose units.
886
\vv
887
 
888
The bitscan instruction appears to be particularly critical in terms of timing. Future implementations may move the bitscan instruction to a unit with a 2-clock latency.
889
\vv
890
 
891
 
892
\subsection{Execution stage, in\_out\_ports}
893
The in\_out\_ports unit implements the execution stage for interfaces to external communication as well as several capabilities registers, performance counters, etc.
894
\vv
895
 
896
The RS232 serial interface consists of a UART receiver, a UART transmitter, and FIFO buffers on both input and output.
897
\vv
898
 
899
System registers that can be renamed, i.e. NUMCONTR, DATAP, and THREADP, are accessed in the register read stage, while all other system registers are implemented or accessed in the in\_out\_ports module. Some of these registers are read-only, while others can also be written or reset. It is necessary that read and write of the same register happens in the same pipeline stage.
900
\vv
901
 
902
Register results are output to result bus 1 in the same was as for the ALU unit. The instruction latency is one clock cycle.
903
\vv
904
 
905
\subsection{Execution stage, mul\_div}
906
The mul\_div unit implements the execution stage for instructions that have a latency higher than one, such as multiplication and division. The result is output to result bus 2.
907
\vv
908
 
909
The multiplication and division units are not implemented yet.
910
\vv
911
 
912
 
913
\subsection{Pipeline synchronization signals}
914
There are no instruction queues between the pipeline stages. This means that if one pipeline stage is waiting for an operand that is not available yet, then all preceding pipeline stages must stand still and keep their data until the next pipeline stages are ready to receive a new instruction. This is called a pipeline stall. While one pipeline stage is waiting for an operand, all subsequent pipeline stages will be idle. This is called a pipeline bubble.
915
\vv
916
 
917
There are three units that can cause pipeline stalls and bubbles in the present design:
918
\vv
919
 
920
\begin{enumerate}
921
\item The fetch unit is waiting for an instruction to be fetched from the code memory after a jump or after the code buffer has been exhausted by multiple triple-size instructions. This is causing a bubble to propagate through the entire pipeline
922
\item The address generation stage is waiting for a register value needed for the address calculation. This is causing a stall in the preceding pipeline stages and a bubble in the subsequent stages
923
\item The execution stage is waiting for the value of a register operand. This is causing a stall in the entire pipeline
924
\end{enumerate}
925
\vv
926
 
927
The handling of pipeline stalls and bubbles is quite tricky and delicate because stalls must be predicted one clock cycle in advance, and we must make sure that every instruction is dispatched once, and only once, from each pipeline stage to the next. The dataread stage is sending out predict\_tag signals giving the instruction tag of the result that is predicted to be sent to result bus 1 in the next clock cycle. The predict\_tag signal is used by the register read stage to predict whether the necessary register values will be available for the address calculation stage in the next clock cycle. In the same way, the data read stage is using the predict\_tag signal to predict whether all operands will be ready for the execution stage in the next clock cycle.
928
\vv
929
 
930
The register read stage and the data read stage will send out a stall\_predict signal if they predict a stall in the next clock cycle. These signals are fed to stall\_in inputs of all preceding pipeline stages, telling them to halt and keep their data until the pipeline is ready to proceed.
931
\vv
932
 
933
A stalled pipeline stage must keep all its data unchanged, with one exception. It must watch the result buses for any results that match a missing register value for the contained instruction.
934
\vv
935
 
936
Pipeline bubbles are controlled by valid\_in and valid\_out signals. The valid\_out signal of each pipeline stage is connected to the valid\_in signal of he next stage. This signal is 1 if a valid instruction is being propagated, or 0 if there is a pipeline bubble. All other inputs must be ignored when the valid\_in signal is zero.
937
\vv
938
 
939
Future designs may need handshaking signals for detecting whether a result bus is vacant, but this is not relevant yet.
940
\vv
941
 
942
 
943
\subsection{Data memory}
944
The module named data\_memory contains a RAM memory for data only. The size is a power of 2 determined by the variable DATA\_ADDR\_WIDTH defined in a configuration file.
945
\vv
946
 
947
The data RAM has one read port and one write port, each 64 bits wide. Reads and writes of less than 64 bits are allowed, but all reads and writes must have natural alignment.
948
\vv
949
 
950
The input read\_data\_size indicates data sizes of 8, 16, 32, or 64 bits. Reads of less than 64 bits are left-justified in the output read\_data\_out.
951
\vv
952
 
953
The 64-bit input write\_data\_in is divided into eight bytes enabled separately. The 8-bit input write\_enable enables each byte separately. Memory writes of less than 64 bits are controlled by the address generator by placing the data in the right position and enabling only the bytes that contain actual data.
954
\vv
955
 
956
 
957
\subsection{Code memory}
958
The module named code\_memory contains a memory for code only. The size is a power of 2 determined by the variable CODE\_ADDR\_WIDTH defined in a configuration file.
959
\vv
960
 
961
The code memory has one read port that is 64 bits wide. Unaligned reads are not possible. A read of a 32-bit word from an odd address must be implemented as a 64-bit read from the preceding even address.
962
\vv
963
 
964
The code memory has one write port that can accept 64-bit writes to an even address or 32-bit writes to an even or odd address. The write port is intended for the loader only.
965
\vv
966
 
967
A loader can be placed in the top of the code memory. A hex file with a name defined by LOADER\_FILE is placed at an address calculated as the end address of the code memory minus MAX\_LOADER\_SIZE. This file must be organized as 64-bit lines.
968
\vv
969
 
970
 
971
\subsection{Call stack}
972
The call stack is used for function return addresses only. The size is a power of 2 determined by the variable CALL\_STACK\_POINTER\_BITS defined in a configuration file.
973
\vv
974
 
975
The call stack is accessed by the instruction fetch unit only. A return address is pushed on the call stack when a call instruction is detected, and popped from the call stack when a return instruction is detected. The call stack cannot be accessed by data push and pop instructions.
976
\vv
977
 
978
Return addresses are communicated via the push\_data input and the pop\_data output. The value of pop\_data is prefetched in advance so that it can be accessed immediately by the fetch unit. The value of pop\_data is valid one clock cycle after a push or pop operation. Therefore, there must be a one clock cycle delay between a call or return and the next return. This is not a problem in the current design because the fetch unit has a delay of at least one clock cycle after any call or return anyway.
979
\vv
980
 
981
 
982
\subsection{Debug displays}
983
There are two debug displays as explained in chapter \ref{Chap:Debugger}. A LED display on the FPGA board, and an optional LCD display on an add-on board.
984
\vv
985
 
986
The LED display is controlled by a Verilog header file, debugger.vh. This file is included in the top module so that it has direct access to all global signals. Local signals inside a module are accessed through debug outputs from the module in questions. You may modify the file debugger.vh to get access to a particular signal. debugger.vh includes an instance of a 7-segment display driver seg7.sv. The debugger writes an error code in the 7-segment LED display instead of the selected debug information in case of a runtime error.
987
\vv
988
 
989
The external LCD display is driven by the module debug\_display.
990
It includes an instance of an LCD driver named lcd.
991
The LCD driver contains a state machine that writes to the LCD displays, sending 4 bits at a time.
992
\vv
993
 
994
The debug\_display module writes information about the contents of each pipeline stage to the LCD displays. The debug\_display module has only access to the outputs of each module. Source operand values in the execute stage are reconstructed inside the debug\_display module rather than transferred from the execute modules. The translation of operation codes to opx, that occurs in the data read stage, is mirrored inside the debug\_display module for the display of pipeline stages prior to the data read stage. This allows the debug display to write the names of instructions in all pipeline stages.
995
\vv
996
 
997
The debug\_display module has internal registers on most signals to avoid timing problems and to diminish restrictions on signal routing. It is no problem that the data are delayed a few clock cycles because the LCD display is only useful during single-stepping.
998
\vv
999
 
1000
 
1001
\subsection{Top module}
1002
The top module is instantiating the pipeline stages and other modules and making the connections between them.
1003
\vv
1004
 
1005
Input and output connections to external devices are defined by the top module.
1006
\vv
1007
 
1008
 
1009
\subsection{Byte based and word based addresses}
1010
Read and write address signals for the data memory are absolute addresses with byte granularity. Address signals for the code memory have word granularity (4 bytes) and are relative to the start of the code memory. The same applies to the instruction pointer.
1011
\vv
1012
 
1013
Translation from word addresses to byte addresses occurs in the address generator when calculating an address relative to IP. Translation from byte addresses to word addresses occurs in the ALU when executing an indirect jump, and in the code memory module when writing to the code memory (only the loader should write to code memory).
1014
\vv
1015
 
1016
 
1017
\section{Loader} \label{Chap:Loader}
1018
The loader is a piece of software written in ForwardCom assembly language. The loader can load an executable program (.ex file) into the program memory and data memory. The source code for the loader is in the file loader.as.
1019
\vv
1020
 
1021
The loader is placed at the end of code memory. The start address of the loader is calculated as the size of the code memory minus MAX\_LOADER\_SIZE. This address is hard-coded into the softcore. The CPU will jump to this address at start and when the load button is pressed.
1022
\vv
1023
 
1024
Debugging the loader is easier when  MAX\_LOADER\_SIZE is a multiple of 0x100 so that the last two digits of addresses on the debug display match addresses in the loader assembly listing.
1025
\vv
1026
 
1027
A loaded program can be re-started by jumping to the RESTART code, which is placed at the loader start address plus 1.
1028
\vv
1029
 
1030
The loader works as follows. The user uploads the desired program as an executable file (.ex file). The loader starts reading the .ex file through the RS232 serial interface. The first part of the .ex file contains a file header and program headers, according to the official ForwardCom definition in the file elf\_forwardcom.h. These headers are read into the beginning of data RAM and analyzed to find code sections, const data sections, and data sections. The raw data of these sections are then read and placed in code memory and data memory. The program memory starts at an address immediately following the end of data memory. Program code sections are placed in code memory.
1031
Constant (read-only) IP-based data sections are placed at the end of data RAM so that they can be addressed relative to the instruction pointer, IP.
1032
\vv
1033
 
1034
Data sections with read/write access are read into the first vacant data RAM address and later moved down to address zero when the loader has finished using the program headers. The rest of data RAM is then cleared to zero.
1035
\vv
1036
 
1037
The data stack pointer (r31) is set to the top of vacant data RAM, which is the beginning of const data. The program start address, DATAP, and THREADP are set to the addresses defined in the file header. These addresses are not stored in data RAM, but inserted in the RESTART code so that all data RAM is available to the running program. When the loader has finished loading all sections, it jumps to the RESTART code. The RESTART code sets the stack pointer, DATAP, and THREADP, clears all other registers, and jumps to the program start address.
1038
\vv
1039
 
1040
The loader works with both the 32-bit version and the 64-bit version of the softcore.
1041
\vv
1042
 
1043
The loader assumes that the .ex file is fully position-independent. It is not able to relocate code containing absolute addresses.
1044
\vv
1045
 
1046
 
1047
\section{Resource use}
1048
This section discusses how different aspects of the softcore design affects the use of hardware resources.
1049
\vv
1050
 
1051
The use of LUTs (lookup tables) and registers (flip flops) in the 32-bit and 64-bit configurations are listed in table \ref{table:ResourceUse}.
1052
 
1053
\begin{table}[h]
1054
\begin{tabular}{|l|l|l|l|l|}
1055
\hline
1056
\bfseries Module & \multicolumn{2}{|c|}{\bfseries Slice registers} & \multicolumn{2}{c|}{\bfseries Slice LUTs}  \\
1057
\bfseries  & \bfseries 32-bit version & \bfseries 64-bit version & \bfseries 32-bit version & \bfseries 64-bit version  \\ \hline
1058
Fetch  &  351 &  350  & 1309 & 1286 \\ \hline
1059
Decode  & 175  &  188 & 400 & 465 \\ \hline
1060
Register read & 1635  &  3010 & 3296 & 4711 \\ \hline
1061
Address generator  & 558 & 785 & 697 & 1550 \\ \hline
1062
Data read  & 356  & 561 & 2994 & 5267  \\ \hline
1063
Execute ALU  & 80  & 112 & 680 & 2437 \\ \hline
1064
Execute in/out, uart  & 703 & 1213 & 501 & 361 \\ \hline
1065
Execute mul/div  &   &  &  &   \\ \hline
1066
Code memory  & 15  & 15  & 103 & 115 \\ \hline
1067
Data memory  & 3  &  3  & 99 & 98 \\ \hline
1068
Call stack  & 29  &  29  & 442 & 445 \\ \hline
1069
Debug display  & 485  & 485  & 747 & 752 \\ \hline
1070
Total  & 4639  & 7032  & 12026 & 17873 \\ \hline
1071
\end{tabular}
1072
\caption{Resource use of softcore}
1073
\label{table:ResourceUse}
1074
\end{table}
1075
\vv
1076
 
1077
The data memory is using a disproportionate amount of FPGA slices because it is implemented as distributed RAM. I have problems making it use block RAM instead.
1078
\vv
1079
 
1080
The propagation of source operand values through the pipeline is using 32 or 64 bits per operand in each pipeline stage. If a pipeline stage is stalled, then it needs to keep the old instruction on its output until the next stage is ready to receive it. At the same time, it needs to catch operand values from the result buses for the pending next instruction. This doubles the required number of registers. If we allow three source operands and a mask of 64 bits each, then we need 4*64*2 = 512 flip flops just for catching and propagating data in each pipeline stage. This number of flip flops is not a problem for an FPGA with 100,000 cells or more, but the associated logic and routing is actually consuming a significant amount of resources. We may reduce the resource use for smaller applications by reducing the register size to 32 bits. The mask register may even be reduced to 16 bits.
1081
\vv
1082
 
1083
The result buses are also consuming a significant amount of resources because each pipeline stage must monitor all result busses and compare the result tags with all four source operands (including mask register). This is particularly critical in the execution stage where timing is critical.
1084
Therefore, we may let multiple execution units share the same result bus and use handshaking logic to check when a result bus is vacant.
1085
\vv
1086
 
1087
The timing problems that we encounter are often dominated by routing delays rather than logic delays. It is necessary to take this into account and limit the number of cross connections between modules.
1088
\vv
1089
 
1090
An important feature that distinguishes ForwardCom from other ISAs is that each instruction can have many different variants. This feature is actually fairly cheap in terms of hardware resources. There are many different instruction formats, and it takes many lines of verilog code to decode these formats, but the resulting hardware is actually a fairly trivial combinational logic with a limited number of inputs and outputs. It requires only a few bits to tell where an operand is or what a register is used for - in contrast to the many bits used for handling and propagating up to three or four operands of 64 bits each. The timing of the decoding process is not critical at all because the logic to handle all aspects of the decoding process can be divided over several pipeline stages. Likewise, we have several pipeline stages to handle the decoding and unpacking of immediate constants and option bits, and it is not important which pipeline stage we place this code in.
1091
\vv
1092
 
1093
The fact that ForwardCom has instructions that combine memory read with arithmetic or logic instructions comes at a significant price, though, because it makes the pipeline longer.
1094
\vv
1095
 
1096
 
1097
\section{Future development}
1098
This section discusses some prospective future extensions to the ForwardCom softcore.
1099
\vv
1100
 
1101
Multiplication and division instructions need to have a latency of more than one clock cycle. The mul\_div module is prepared for this purpose but not implemented yet. These instructions will use result bus 2.
1102
\vv
1103
 
1104
Push and pop instructions will be split into multiple micro-operations in the decode stage. It will be possible to push or pop multiple registers by a single instruction.
1105
\vv
1106
 
1107
Vector instructions will be implemented with a separate pipeline for each 64-bit vector lane. Each lane will have its own register file holding a 64-bit slice of each vector register. Designs with small vectors may have direct access to a common data RAM. Designs with long vectors may have one data cache or RAM for each vector lane. Memory access will be fast when aligned by the maximum vector length but slower when not aligned because it requires a very big barrel shifter to exchange data between lanes. The vector pipeline probably needs an extra stage for data alignment.
1108
\vv
1109
 
1110
A superscalar softcore will be able to fetch and decode at least two instructions per clock cycle. There may be two pipelines for non-vector instructions: One for instructions with memory operands, and a shorter pipeline for instructions without memory operands. There will probably be only one pipeline for vector instructions. It will be possible to execute two instructions per clock cycle if they can go to two different pipelines. The short pipeline for instructions without memory operands will make jump delays shorter. The instructions will be executed in order within each pipeline.
1111
\vv
1112
 
1113
 
1114
\section{Performance metrics}
1115
The maximum clock frequency varies from 50 to 70 MHz depending on register size, memory sizes, and other configuration details. The current settings are 67.5 MHz for the 32-bit register version and 58 MHz for the 64-bit register version of the softcore. The maximum clock frequency can be increased by a few MHz when the memory sizes are reduced or some instructions removed from the ALU. The bitscan instruction, in particular, is critical.
1116
\vv
1117
 
1118
The maximum throughput is one instruction per clock cycle. The latency is one clock cycle for most instructions.
1119
\vv
1120
 
1121
Unconditional direct jumps, calls, and returns have a latency of 2 clock cycles.
1122
\vv
1123
 
1124
Conditional jumps have a latency of 7 clocks when taken and 6 clocks when not taken. Indirect and multiway jumps and calls have a latency of 7 clocks.
1125
\vv
1126
 
1127
Memory reads have a delay of 2 clocks after the modification of a pointer or index register that is needed in the address calculation. There is no delay for memory reads if address registers are not modified in the preceding two instructions.
1128
\vv
1129
 
1130
Memory writes have a delay similar to memory reads if address registers are modified within the preceding two instructions. The same delay applies if the register holding the value to write is modified within the preceding two instructions.
1131
\vv
1132
 
1133
 
1134
\section{Portability}
1135
The hardware code written in SystemVerilog avoids device-specific code. It should be possible to port this to other FPGA models, other FPGA brands, or even to an ASIC.
1136
\vv
1137
 
1138
The clock generator is auto-generated by the Vivado clock vizard. It needs to be adjusted or redesigned for other FPGAs.
1139
\vv
1140
 
1141
The 7-segment display on the Nexys A7 FPGA board is used for debugging and error messaging. Other FPGA boards may have other kinds of display.
1142
\vv
1143
 
1144
The constraints files contain device-specific information about pin connections etc. The connections to external devices need to be redesigned for other FPGAs.
1145
\vv
1146
 
1147
 
1148
\chapter{Revision history}
1149
 
1150
\subsubsection{Version 1.00, 2021-08-07}
1151
 
1152
\begin{itemize}
1153
\item First published ForwardCom softcore. Integer only. No vectors.
1154
\end{itemize}
1155
\vv
1156
 
1157
 
1158
\chapter{License}
1159
The hardware description code is published under the \href{https://cern-ohl.web.cern.ch/}{CERN Open Hardware License weakly-reciprocal}, version 2 or later. This is a license for hardware similar to the Gnu General Public License for software.
1160
\vv
1161
 
1162
All the code is written from scratch without importing any code from elsewhere. This makes sure that all the code has the same license.
1163
\vv
1164
 
1165
This manual is copyrighted 2021 by Agner Fog with a
1166
\href{https://creativecommons.org/licenses/by/4.0/legalcode}{Creative Commons license CC-BY}.
1167
 
1168
 
1169
\end{document}