OpenCores
URL https://opencores.org/ocsvn/forwardcom/forwardcom/trunk

Subversion Repositories forwardcom

[/] [forwardcom/] [manual/] [fwc_programming_manual.tex] - Blame information for rev 154

Details | Compare with Previous | View Log

Line No. Rev Author Line
1 154 Agner
% chapter included in forwardcom.tex
2
\documentclass[forwardcom.tex]{subfiles}
3
\begin{document}
4
\RaggedRight
5
%\newtheorem{example}{Example}[chapter]  % example numbering
6
\lstset{language=C}            % formatting for code listing
7
\lstset{basicstyle=\ttfamily,breaklines=true}
8
 
9
\chapter{Programming manual}\label{chap:programmingManual}
10
This manual is based primarily on assembly language.
11
Instructions for other programming languages should be described in the manuals for the respective compilers.
12
\vv
13
 
14
 
15
\section{Assembly language syntax} \label{AssemblyLanguageSyntax}
16
 
17
\subsection{Introduction}
18
 
19
The ForwardCom assembly language is standardized to avoid the confusion
20
that we have seen with other instruction sets.
21
The basic syntax of an instruction looks like a function call:
22
\vv
23
 
24
\begin{lstlisting}
25
    datatype destination_operand = instruction(source_operands)
26
\end{lstlisting}
27
\vv
28
 
29
This syntax leaves no doubt about which operands are source and destination.
30
You can also use common operators, such as + - * / etc. instead of instructions that correspond
31
to these operators.
32
\vv
33
 
34
Branches and loops can use conditional jump instructions or the high-level language keywords:
35
if, else, for, do, while, break, continue.
36
\vv
37
 
38
Before defining the syntax details we will look at a few examples.
39
\vv
40
 
41
The following example shows a function that calculates the factorial of an integer:
42
 
43
 
44
\begin{example}
45
\label{exampleFactorial}
46
\end{example} % frame disappears if I put this after end lstlisting
47
\begin{lstlisting}[frame=single]
48
code section execute        // define executable code section
49
 
50
// factorial function calculates n!
51
// input: r0, output: r0
52
_factorial function public
53
if (uint64 r0 <= 20) {      // check for overflow, 64 bit unsigned
54
   uint64 r1 = 1            // start with 1
55
   while (uint64 r0 > 1) {  // loop through r0 values
56
      uint64 r1 *= r0       // multiply all values
57
      uint64 r0--           // count down to 1
58
   }
59
   uint64 r0 = r1           // put result in r0
60
   return                   // normal return from function
61
}
62
int64 r0 = -1               // overflow. return max unsigned value
63
return                      // error return
64
_factorial end              // end of function
65
 
66
code end                    // end of code section
67
\end{lstlisting}
68
\vspace{4mm}
69
 
70
The next example illustrates the use of the efficient vector loop described on page \pageref{vectorLoops}.
71
It calculates the polynomial $y = 0.5 x^2 - 4 x + 1$ for all elements of an array $x$
72
and stores the results in an array $y$.
73
 
74
\begin{example}
75
\label{examplePolyn}
76
\end{example} % frame disappears if I put this after end lstlisting
77
\begin{lstlisting}[frame=single]
78
data section read write datap       // define data section
79
% arraysize = 100                   // define constant
80
float x[arraysize], y[arraysize]    // define arrays
81
data end                            // end of data section
82
 
83
code section execute                // define code section
84
 
85
// This function calculates a polynomial on all elements of an
86
// array x and stores the results in an array y
87
_polyn function public
88
int64 r1 = address([x+arraysize*4]) // end of array x
89
int64 r2 = address([y+arraysize*4]) // end of array y
90
int64 r0 = arraysize*4              // array size in bytes = 400
91
 
92
for (float v0 in [r1-r0]) {         // vector loop
93
   float v0 = [r1-r0, length=r0]    // read x vector
94
   float v1 = v0 * 0.5              // 0.5 * x
95
   float v1 -= 4                    // 0.5 * x - 4
96
   float v0 = v0 * v1 + 1           // (0.5 * x - 4) * x + 1
97
   float [r2-r0, length=r0] = v0    // save y vector
98
}
99
return                              // return from function
100
_polyn end                          // end of function
101
 
102
code end                            // end of code section
103
\end{lstlisting}
104
\vv
105
 
106
While this looks very much like high-level language code, you have to explicitly specify which register to use for each variable, and you cannot put more code on one line than fits into a single instruction.
107
In example \ref{examplePolyn}, the line {\ttfamily float v0 = v0 * v1 + 1 } is allowed because it fits the {\ttfamily mul\_add2} instruction, but we cannot write {\ttfamily float v1 = v0 * 0.5 - 4 } because this instruction cannot have two immediate constants.
108
The line {\ttfamily int64 r1 = address([x+arraysize*4])} also fits a single instruction because
109
{\ttfamily arraysize*4} can be calculated at assembly time (it involves only constants) and the result can be added to the relative address of {\ttfamily x} by the linker. The only thing the {\ttfamily address} instruction has to do at runtime is to add a constant to the {\ttfamily datap} pointer.
110
\vv
111
 
112
More code examples are given in chapter \ref{chap:codeExamples}
113
\vv
114
 
115
 
116
\subsection{File format} \label{assemblyFileFormat}
117
An assembly file can be in ASCII or UTF-8 format. An UTF-8 byte order mark is allowed at the beginning of the file, but not required.
118
\vv
119
 
120
Whitespace can be spaces or tabs. The use of tabs is discouraged because different editors may have different tabstops.
121
\vv
122
 
123
Linefeeds can be UNIX style (\textbackslash n), Mac style (\textbackslash r), or Windows style (\textbackslash r\textbackslash n). There is no limit to the line length.
124
\vv
125
 
126
Comments are C style:  /* */ or // \\
127
\vv
128
 
129
Nested comments are allowed. This makes it possible to comment out a piece of code that already contains comments.
130
\vv
131
 
132
 
133
\subsection{Sections} \label{assemblySections}
134
A section containing code or data is defined as follows:
135
\begin{lstlisting}[frame=single]
136
name section options
137
...
138
name end
139
\end{lstlisting}
140
\vv
141
 
142
The following options can be defined for a section. Multiple options are separated by space or comma.
143
 
144
\begin{tabular}{|p{28mm}p{130mm}|}
145
\hline
146
read & Readable data section. \\
147
write & Writeable data section. \\
148
execute & Executable code section. This does not imply read access. Write access is usually not allowed for executable sections. \\
149
ip & Addressed relative to the instruction pointer. This is the default for executable and read-only sections. \\
150
datap & Addressed relative to the data pointer. This is the default for writeable data sections. \\
151
threadp & Addressed relative to the thread data pointer. The section will have one instance for each thread. \\
152
communal \label{communal} & Communal section. Allows identical sections in different modules, where only one of the
153
communal sections with the same name is included by the linker.
154
Unreferenced communal sections may be removed.
155
Public symbols in communal sections must be weak. \\
156
uninitialized & Data section containing only zeroes. The data of this section does not take up space in object files and executable files.\\
157
exception\_hand & Exception handler and stack unroll information.\\
158
event\_hand & Event handler information, including static constructors and destructors.\\
159
debug\_info & Debug information.\\
160
comment\_info & Comments, including copyright and required library names.\\
161
align=n & Align the beginning of the section to an address divisible by n, which must be a power of 2.
162
The default alignment for executable sections is 4. The default alignment for a data section is the size of the biggest data type in the section. A higher alignment can be specified with the align directive.\\
163
\hline
164
\end{tabular}
165
%\end{table}
166
\vv
167
 
168
Sections with the same name are placed consecutively in the executable file. They must have the same attributes, except for alignment.
169
\vv
170
 
171
Sections with different names but the same attributes (except for alignment) are placed in alphabetical order in the executable file.
172
\vv
173
 
174
All code must be placed in executable sections.
175
Data can be placed in read-only sections or writeable sections. Read-only sections are used for constants and tables.
176
Function pointers and jump tables are preferably placed in read-only sections for security reasons.
177
 
178
Variables can be placed in writeable data sections, on the stack, or in registers.
179
\vv
180
 
181
 
182
 
183
\subsection{Registers} \label{AsmRegisters}
184
There are 32 general purpose registers r0 - r31, and 32 vector registers of variable length v0 - v31.
185
\vv
186
 
187
r31 is the stack pointer, also called sp. r0 - r27 can be used as general pointers. r28 - r30 can only be used as pointers if there is no offset or a scaled offset that fits into 8 bits. r0 - r30 can be used as array indexes.
188
\vv
189
 
190
r0 - r6 and v0 - v6 can be used as masks. r0 - r30 and v0 - v30 can be used as fallback values.
191
 
192
\vv
193
The use of registers in function calls must obey the function calling conventions described in chapter \ref{chap:functionCallingConventions} and the regster usage conventions in chapter
194
\ref{chap:registerUsageConvention}.
195
\vv
196
 
197
 
198
\subsection{Names of symbols} \label{assemblerNames}
199
The names of data symbols, code labels, functions, etc. are case sensitive.
200
A name can consist of letters a-z, A-Z, numbers 0-9, the special characters \_ \$ @,
201
and unicode letters. A name cannot begin with a number. There is no limit to the length of a name.
202
\vv
203
 
204
All names are prefixed with an underscore or mangled in some other way when compiling high-level language code. This is to prevent high-level language names from clashing with assembly keywords, register names, etc.
205
\vv
206
 
207
The names of keywords and instructions are not case sensitive.
208
The following are reserved keywords:
209
\begin{lstlisting}[frame=single]
210
align
211
break broadcast
212
capab0-capab31 case comment_info communal constant continue
213
datap debug_info do double
214
else end event_hand exception_hand execute extern
215
fallback false float float16 float32 float64 float128 for function
216
if in int int8 int16 int32 int64 int128 ip
217
length limit
218
mask
219
option options
220
perf0-perf31 pop public push
221
r0-r31 read
222
scalar section sp spec0-spec31 string switch sys0-sys31
223
threadp true
224
uint8 uint16 uint32 uint64 uint128 uninitialized
225
v0-v31
226
weak while write
227
\end{lstlisting}
228
\vv
229
 
230
 
231
\subsection{Constant expressions} \label{assemblyConstantExpressions}
232
Integer constants can be expressed in the following ways:
233
 
234
\begin{tabular}{|p{47mm}p{115mm}|}
235
\hline
236
decimal numbers & contains only digits 0-9 \newline Numbers beginning with 0 are interpreted as decimal, not octal. \\
237
binary numbers & 0b followed by digits 0-1\\
238
octal numbers & 0O followed by digits 0-7\\
239
hexadecimal numbers & 0x followed by digits 0-9, a-f\\
240
character constants & 1-8 ASCII characters enclosed in single quotes '{ }'. The first character will be contained in the lowest byte of a 64 bit integer. For example 'AB' = 0x4241 \\
241
imported constant & extern name: constant \\
242
difference between addresses & symbol1 - symbol2. The two symbols must have the same base pointer, either ip, datap, threadp, or none. \\
243
\hline
244
\end{tabular}
245
%\end{table}
246
\vv
247
 
248
Integer constants can be combined with all common operators: \\
249
+ \hspace{2mm} - \hspace{2mm} * \hspace{2mm} / \hspace{2mm} \% \hspace{2mm} \& \hspace{2mm} $\vert$ \hspace{2mm} \^{} \hspace{2mm} \textasciitilde \hspace{2mm}
250
 \&\& \hspace{2mm} $\vert\vert$ \hspace{2mm} ! \hspace{2mm}
251
$<<$ \hspace{2mm} $>>$ \hspace{2mm}
252
\textless \hspace{2mm} \textless= \hspace{2mm} \textgreater \hspace{2mm} \textgreater= \hspace{2mm}
253
 == \hspace{2mm} != \hspace{2mm} ?:
254
\vv
255
 
256
The operators have the same order of precedence as in C.
257
\vv
258
 
259
Additional operators not found in C are: \\
260
$>>>$ \hspace{2mm} shift right unsigned, \\
261
\^{}\^{} \hspace{7.5mm} logical exclusive or.
262
\vv
263
 
264
 
265
The results are calculated as signed 64-bit integers, except for $>>>$ \hspace{0.5mm} which is unsigned.
266
\vv
267
 
268
Imported constants and differences between addresses cannot be allowed in general calculations.
269
The only operations allowed for these are addition of a local constant, and division by a power of 2.
270
These calculations are done by the linker except for a difference between two local symbols in the same section.
271
\vv
272
 
273
Floating point constants must contain a dot or an E and at least one digit, for example 1.23E-4
274
\vv
275
 
276
Floating point expressions are calculated with double precision. The following operators can be used: \\
277
+ \hspace{2mm} - \hspace{2mm} * \hspace{2mm} / \hspace{2mm}
278
\textless \hspace{2mm} \textless= \hspace{2mm} \textgreater \hspace{2mm} \textgreater= \hspace{2mm}
279
 == \hspace{2mm} != \hspace{2mm} ?:
280
\vv
281
 
282
String constants are sequences of ASCII or UTF-8 characters enclosed in " ".\\
283
The following escape sequences are recognized: \textbackslash\textbackslash \hspace{1mm}
284
\textbackslash n \textbackslash r  \textbackslash t  \textbackslash " \\
285
String constants can be concatenated with the + operator like in Java.
286
\vv
287
 
288
Numeric constants can be included in the instruction codes. This will reduce the pressure on the data cache.
289
\vv
290
 
291
 
292
 
293
\subsection{Data types} \label{assemblyDataTypes}
294
The following data types are used in data definitions and instructions:
295
 
296
\begin{tabular}{|p{15mm}p{125mm}|}
297
\hline
298
int8    & 8 bit signed integer\\
299
uint8   & 8 bit unsigned integer\\
300
int16   & 16 bit signed integer\\
301
uint16  & 16 bit unsigned integer\\
302
int     & 32 bit signed integer\\
303
int32   & 32 bit signed integer\\
304
uint32  & 32 bit unsigned integer\\
305
int64   & 64 bit signed integer\\
306
uint64  & 64 bit unsigned integer\\
307
int128  & 128 bit signed integer (optional)\\
308
uint128 & 128 bit unsigned integer (optional)\\
309
float16 & half precision floating point\\
310
float   & single precision floating point\\
311
float32 & single precision floating point\\
312
double  & double precision floating point\\
313
float64 & double precision floating point\\
314
float128 & quadruple precision floating point (optional)\\
315
\hline
316
\end{tabular}
317
\vv
318
 
319
A '+' after an integer type indicates that the assembler can use a bigger type if this makes
320
the instruction smaller or more efficient, for example int16+.
321
\vv
322
 
323
 
324
\subsection{Data definitions} \label{assemblyDataDefinitions}
325
Static data can be defined inside a data section. Several different forms are allowed:
326
\vv
327
 
328
\begin{tabular}{|p{140mm}|}
329
\hline
330
Assembly style data definition:\\
331
\hspace{4mm} label : datatype value, value, ...\\
332
C style data definition:\\
333
\hspace{4mm} datatype name = value, name = value, ...\\
334
C style array definition, uninitialized:\\
335
\hspace{4mm} datatype name[number]\\
336
C style array definition, initialized:\\
337
\hspace{4mm} datatype name[number] = \{value, value, ...\}\\
338
Assembly style string definition:\\
339
\hspace{4mm} label : int8 "string", "string", ...\\
340
C style string definition:\\
341
\hspace{4mm} int8 name = "string" \\
342
\hline
343
\end{tabular}
344
\vv
345
 
346
A terminating zero after a string must be added explicitly if needed.
347
\vv
348
 
349
Examples:
350
\vspace{1mm}
351
 
352
\begin{lstlisting}[frame=single, showstringspaces=false]
353
mydata section read write datap
354
alpha: float 0.1, 0.2, 0.3
355
int16 beta = 4, gamma = 5
356
int8 delta[25]
357
align (8)
358
int8 epsilon[] = {6, 7, 8, 9}
359
zeta: int8 "Dear reader", 0
360
int8 eta = "Nice to meet you\0"
361
mydata end
362
\end{lstlisting}
363
\vv
364
 
365
 
366
\subsection{Function definitions and labels} \label{assemblyFunctionDef}
367
A function can be defined inside an executable section. It is defined like this:
368
\vspace{1mm}
369
 
370
\begin{tabular}{|p{140mm}|}
371
\hline
372
\hspace{4mm} name function attributes\\
373
\hspace{4mm} ...\\
374
\hspace{4mm} name end\\
375
\hline
376
\end{tabular}
377
\vspace{4mm}
378
 
379
Attributes can be 'public', 'weak', and 'reguse=value,value'.
380
The function will be local if no attributes are specified and
381
the name does not appear in a public declaration.
382
\vv
383
 
384
Example:
385
\vspace{1mm}
386
 
387
\begin{lstlisting}[frame=single]
388
mycode section execute
389
// this function calculates the square of a double
390
_square function public, reguse = 0, 1
391
double v0 *= v0
392
return
393
_square end
394
mycode end
395
\end{lstlisting}
396
\vv
397
 
398
The calling conventions for functions are defined in chapter \ref{chap:functionCallingConventions}.
399
\vv
400
 
401
 
402
\subsection{Instructions} \label{assemblyInstructions}
403
It is convenient to have only one instruction per line. Multiple instructions on the same line must be separated by semicolon.
404
\vv
405
 
406
Instructions can be defined only inside an executable section. The general form looks like this:
407
 
408
\begin{tabular}{|p{130mm}|}
409
\hline
410
\hspace{4mm} label : datatype destination = instruction(source\_operands), options \\
411
\hline
412
\end{tabular}
413
%\end{table}
414
\vspace{4mm}
415
 
416
For example:
417
\begin{lstlisting}[frame=single]
418
A: int32 r1 = add(r2, 18), mask = r3, fallback = r4
419
\end{lstlisting}
420
\vv
421
 
422
The destination is a register in most cases. A few instructions allow a memory operand as destination.
423
\vv
424
 
425
The source operands can be registers, memory operands, or immediate constants.
426
No instruction can have more than one memory operand.
427
\vv
428
 
429
The source operands should be written in the following order: registers, memory operand, immediate operand. The assembler will reorder the operands automatically in certain cases. For example, int r2 = 1 + r1 will be coded as int r2 = r1 + 1. Reordering of operands is not always possible. For example, there is no way to code r2 = 8 $>>$ r1 as a single instruction. Vector operands are not reordered automatically because this will give an invalid result in case the vectors have different lengths.
430
\vspace{4mm}
431
 
432
The following options are possible:\\
433
\begin{tabular}{|p{140mm}|}
434
\hline
435
options = integer constant\\
436
mask = register\\
437
fallback = register\\
438
fallback = 0\\
439
\hline
440
\end{tabular}
441
\vv
442
 
443
Options specify various option bits for specific instructions.
444
Only certain instructions can have options.
445
\vv
446
 
447
Instructions can be written in a simpler form with an operator instead of the instruction name if the instruction does the same as the operator. The general form is:
448
\vv
449
 
450
\begin{tabular}{|p{140mm}|}
451
\hline
452
\hspace{4mm} label : datatype destination = operand1 operator operand2 \\
453
\hline
454
\end{tabular}
455
%\end{table}
456
\vspace{4mm}
457
 
458
For example:
459
\begin{lstlisting}[frame=single]
460
int r1 = r2 + 18
461
\end{lstlisting}
462
\vv
463
 
464
 
465
Masks are used for conditional execution. The mask registers may also contain bits specifying various numerical options. See page \pageref{MaskAndFallback} for details.
466
Almost all multi-format instructions can have a mask. Single-format instructions with A or E templates can also have a mask. Jump instructions cannot have a mask.
467
\vv
468
 
469
The mask register can be r0-r6 or v0-v6. The fallback register can be r0-r30 or v0-v30.
470
The mask and fallback registers must be the same type of register as the destination (general purpose or vector register). The fallback specifies the result when bit 0 of the mask is zero.
471
\vv
472
 
473
The default fallback register is the same as the first source register. A different fallback register is  possible only if there is a vacant register field in the code template.
474
The fallback register cannot be different from the first source register in the following cases:
475
\begin{itemize}
476
\item the instruction has three source operands
477
\item the instruction needs 64 bits for an immediate constant
478
\item the instruction has a memory operand with index
479
\item the instruction has vector registers and a memory operand
480
\end{itemize}
481
\vv
482
 
483
 
484
The mask and fallback can be indicated as mask = register, and fallback = register,
485
or in a simple way with the ?: operator:
486
\vv
487
 
488
\begin{tabular}{|p{140mm}|}
489
\hline
490
\hspace{4mm} datatype destination = mask ? operand1 operator operand2 : fallback\\
491
\hline
492
\end{tabular}
493
\vspace{4mm}
494
 
495
For example:
496
\begin{lstlisting}[frame=single]
497
int r1 = r3 ? r2 + 18 : r4
498
\end{lstlisting}
499
\vv
500
 
501
A memory operand must be enclosed in square brackets. Memory references are always relative to a base pointer.
502
A memory operand can contain:
503
\vv
504
 
505
\begin{description}
506
 
507
\item[A base pointer]
508
This is a general purpose register or a special pointer (ip, datap, threadp, sp). The base pointer contains a memory address.
509
The base pointer is implicit for data labels in an ip, datap, or threadp section.
510
 
511
\item[A scaled index]
512
This is a general purpose register multiplied by a scale factor.
513
The scaled index is added to the base pointer. This is useful for accessing an array element where the base pointer contains the address of the beginning of the array.
514
The scale factor is the same as the operand size in most cases. Instructions with a general purpose register destination can also have a scale factor of 1. Instructions with a vector register destination can also have a scale factor of -1.
515
 
516
\item[An offset]
517
This is an integer constant that will be added to the address.
518
 
519
\item[limit=constant]
520
This specifies a maximum value for the index. An error trap will be generated if the unsigned index exceeds this value.
521
 
522
\end{description}
523
 
524
Memory operands in vector instructions must have one of the following options:
525
 
526
\begin{description}
527
 
528
\item[scalar]
529
Only one element is read.
530
 
531
\item[length=register]
532
The length of the vector, in bytes, is specified by a general purpose register.
533
The number of vector elements is equal to the value of this register divided by the number of bytes used by each element.
534
 
535
\item[broadcast=register]
536
Only one element is read. This element is broadcast to a vector with a total length specified by a general purpose register. The number of identical elements is equal to the value of this register divided by the number of bytes used by one element.
537
 
538
\end{description}
539
\vspace{4mm}
540
 
541
 
542
Examples:
543
\vv
544
 
545
\begin{lstlisting}[frame=single]
546
int32 r1 = r2 + [alpha] // the base pointer is implicit
547
int32 r1 = r2 + [r3 + 0x10]
548
int32 r1 = r2 + [r3 + 4*r4, limit = 100]
549
int32 v1 = v2 + [r3 - r4, length = r4]
550
float v1 = v2 + [beta, scalar]
551
float v1 = v2 + [beta, broadcast=r3]
552
\end{lstlisting}
553
\vv
554
 
555
Remember that the base pointer, index, and offset of a memory operand must all be inside the square bracket. Otherwise they will be interpreted as separate operands.
556
\vv
557
 
558
The details for each instruction are described in chapter \ref{chap:DescriptionOfInstructions}.
559
\vv
560
 
561
 
562
\subsection{Unconditional jumps, calls, and returns} \label{assemblyJumps}
563
Direct unconditional jumps, calls, and  returns are coded as follows:
564
\vv
565
 
566
\begin{tabular}{|p{140mm}|}
567
\hline
568
\hspace{4mm} jump target\\
569
\hspace{4mm} call target\\
570
\hspace{4mm} return\\
571
\hline
572
\end{tabular}
573
\vv
574
 
575
Example:
576
\begin{lstlisting}[frame=single]
577
my_func function public
578
    nop
579
    return
580
my_func end
581
 
582
...
583
 
584
call my_func
585
\end{lstlisting}
586
\vspace{4mm}
587
 
588
 
589
\subsection{Indirect jumps and calls}
590
Indirect jumps and calls can use an absolute 64-bit address in
591
a register or memory operand:
592
 
593
\begin{tabular}{|p{140mm}|}
594
\hline
595
\hspace{4mm} jump register\\
596
\hspace{4mm} call register\\
597
\hspace{4mm} jump ([base\_register+offset])\\
598
\hspace{4mm} call ([base\_register+offset])\\
599
\hline
600
\end{tabular}
601
\vv
602
 
603
Example:
604
\begin{lstlisting}[frame=single]
605
my_func function public
606
    nop
607
    return
608
my_func end
609
 
610
...
611
 
612
int64 r1 = address([my_func])
613
call r1
614
\end{lstlisting}
615
\vv
616
 
617
The absolute address must be divisible by 4 because code words are 32 bits (4 bytes).
618
\vspace{4mm}
619
 
620
It is often more efficient to use relative addresses rather than absolute addresses for indirect jumps and calls with memory operands. A relative address contains an offset relative to an arbitrary reference point. The relative address needs fewer bits because it contains the difference between the target address and the reference point. All code addresses are divisible by 4 so we can save two more bits by dividing the relative address by 4.
621
\vv
622
 
623
Indirect jumps and calls using a relative pointer can have the following forms:
624
 
625
\begin{tabular}{|p{150mm}|}
626
\hline
627
\hspace{4mm} datatype jump\_relative (ref\_register, [base\_register+offset])\\
628
\hspace{4mm} datatype call\_relative (ref\_register, [base\_register+offset])\\
629
\hspace{4mm} datatype jump\_relative (ref\_register, [base\_register+index\_register*scale])\\
630
\hspace{4mm} datatype call\_relative (ref\_register, [base\_register+index\_register*scale])\\
631
\hline
632
\end{tabular}
633
\vspace{4mm}
634
 
635
The datatype specifies the size of the relative pointer stored in memory. This must be a signed integer type. The reference point must be contained in a 64-bit register.
636
\vv
637
 
638
Example:
639
\vv
640
 
641
\begin{lstlisting}[frame=single]
642
extern function1: function, function2: function
643
 
644
rodata section read ip  // read-only data section
645
// relative address of function1, scaled by 4
646
funcpoint1: int16 (function1-reference_point) / 4
647
// relative address of function2, scaled by 4
648
funcpoint2: int16 (function2-reference_point) / 4
649
rodata end
650
 
651
code section execute ip
652
reference_point:
653
  // load address of reference_point:
654
  int64 r1 = address([reference_point])
655
  int16 call_relative (r1, [funcpoint1])  // call function1
656
  int16 call_relative (r1, [funcpoint2])  // call function2
657
code end
658
 
659
\end{lstlisting}
660
 
661
See page \pageref{relativeDataPointer} for further explanation of relative pointers.
662
\vspace{4mm}
663
 
664
 
665
\subsection{Conditional jumps and loops} \label{assemblyConditionalJumps}
666
 
667
Conditional jumps always involve an arithmetic or logic operation and a jump conditional upon the result:
668
 
669
\begin{tabular}{|p{150mm}|}
670
\hline
671
\hspace{4mm} datatype destination = instruction(source\_operands), jump\_condition target \\
672
\hline
673
\end{tabular}
674
\vspace{4mm}
675
 
676
Examples:
677
\vv
678
 
679
\begin{lstlisting}[frame=single]
680
  int32+ r1 = add(r1, r2), jump_pos L1
681
  uint64      compare(r2, 5), jump_above L1
682
  float       test_bit(v1, 31), jump_nzero L1
683
  L1:
684
\end{lstlisting}
685
\vv
686
 
687
Vector registers can be used only when the data type does not fit into a general purpose register. Floating point addition and subtraction cannot be used with conditional jumps.
688
\vv
689
 
690
The most common conditional jumps can be coded more conveniently using the high-level language keywords: if, else, for, do, while, break, continue.
691
\vv
692
 
693
An 'if' statement can have the form:
694
\vv
695
 
696
\begin{tabular}{|p{150mm}|}
697
\hline
698
\hspace{4mm} if (datatype register operator operand) \{...\} else \{...\}\\
699
\hline
700
\end{tabular}
701
\vv
702
 
703
Line breaks are allowed before and after '\{' and '\}'. The curly brackets cannot be omitted.
704
 
705
Comparing a register value with another register or a constant is done with one of the operators:\\
706
\textless \hspace{2mm} \textless= \hspace{2mm} \textgreater \hspace{2mm} \textgreater= \hspace{2mm}
707
 == \hspace{2mm} != \hspace{2mm}
708
\vv
709
 
710
Unsigned tests can be indicated with an unsigned integer type, for example uint32.
711
Bit tests can be coded with the \& operator as described on page \pageref{table:testBitJumpTrueInstruction}.
712
\vv
713
 
714
Examples:
715
\vv
716
 
717
\begin{lstlisting}[frame=single]
718
  if (int r1 >= r2) {
719
    nop     // r1 >= r2
720
  } else {  // r1 < r2
721
    if (int !(r3 & (1 << 20))) {
722
      nop   // bit 20 in r3 is not set
723
    }
724
  }
725
\end{lstlisting}
726
\vv
727
 
728
The conditions cannot be combined with \&\& $\vert$ $\vert$ etc. because the statement must fit into a single instruction.
729
\vv
730
 
731
The 'while', 'do-while', and 'for' loops are written in the same way as if statements:
732
 
733
\begin{tabular}{|p{150mm}|}
734
\hline
735
\hspace{4mm} while (datatype condition) \{...\}\\
736
\hspace{4mm} do \{...\} while (datatype condition) \\
737
\hspace{4mm} for (datatype initialization; condition; increment) \{...\}\\
738
\hline
739
\end{tabular}
740
\vv
741
 
742
Examples:
743
\vv
744
 
745
\begin{lstlisting}[frame=single]
746
  for (int r1 = 1; r1 <= 100; r1++) {
747
    int64 r2 = 4  // executed 100 times
748
    while (uint64 r2 > 0) {
749
      int64 r2--  // executed 400 times
750
    }
751
  }
752
  do {
753
    int32 r2 = r1 - 1     // executed once
754
  } while (int32 r2 > r1) // never true
755
\end{lstlisting}
756
\vv
757
 
758
The initialization and increment instructions in the 'for' loop can be anything that fits into a single instruction with the specified data type.
759
\vspace{4mm}
760
 
761
ForwardCom has a very efficient way of doing vector loops as explained on page \pageref{vectorLoops}. Vector loops are written in the following way:
762
 
763
\begin{tabular}{|p{150mm}|}
764
\hline
765
\hspace{4mm} for (datatype vector\_register in [end\_pointer - index\_register]) \{...\}\\
766
\hline
767
\end{tabular}
768
\vspace{4mm}
769
 
770
Example:
771
 
772
\begin{lstlisting}[frame=single]
773
  int64 r1 = address([my_array])      // address of my_array
774
  int64 r2 = my_arraysize * 4         // size of my_array in bytes
775
  int64 r1 += r2                      // point to end of my_array
776
  for (float v1 in [r1 - r2]) {       // loop through my_array
777
    float v1 = [r1 - r2, length = r2] // read vector
778
    float v1 *= 2                     // multiply values by 2
779
    float [r1 - r2, length = r2] = v1 // store results in my_array
780
  }
781
\end{lstlisting}
782
\vv
783
 
784
This loop will subtract the maximum vector length from r2 for each iteration and continue as long as r2 (interpreted as a signed 64 bit integer) is positive. It will use the maximum vector length as long as r2 is bigger than the maximum vector length.
785
The maximum vector length may depend on the data type. The loop will use the maximum vector length for the data type specified in the for statement (float in this case). It is possible to use more vector registers and more end-of-array pointers than the ones specified in the 'for' statement.
786
\vv
787
 
788
It is recommended to use optimization level 1 or higher when assembling branches and loops.
789
\vv
790
 
791
 
792
\subsection{Boolean operations} \label{BooleanOperations}
793
Boolean variables are using only bit 0 for indicating false or true, according to the standard defined on page \pageref{booleanRepresentation}. The remaining bits may be used for other purposes.
794
\vv
795
 
796
Boolean variables can be generated with the compare instruction or the bit test instructions. For example:
797
 
798
\begin{lstlisting}[frame=single]
799
// C code:
800
// if (r2 > r3) r4 += 5
801
// Assembly code:
802
  int r1 = r2 > r3              // true if r2 > r3
803
  int r4 = r1 ? r4 + 5 : r4     // conditional add
804
\end{lstlisting}
805
\vv
806
 
807
The boolean variable r1 is 1 if the condition is true, and 0 if false.
808
\vv
809
 
810
Boolean variables may be combined with the operators
811
\hspace{2mm} \& \hspace{2mm} $\vert$ \hspace{2mm} \^{}. Boolean variables are negated by XOR'ing with 1:
812
 
813
\begin{lstlisting}[frame=single]
814
  int r1 = r2 > r3              // true if r2 > r3
815
  int r4 = r5 != 0              // true if r5 != 0
816
  int r6 = r1 & r4              // r1 && r4
817
  int r7 = r6 ^ 1               // r7 = !r6
818
\end{lstlisting}
819
\vv
820
 
821
The compare and bit test instructions can have extra boolean operands.
822
An extra boolean operand to a compare instruction can be specified conveniently with the logical operators
823
\hspace{2mm} \&\& \hspace{2mm} $\vert\vert$ \hspace{2mm} \^{}\^{} \hspace{2mm}  \\
824
This makes it possible to improve the above example:
825
 
826
\begin{lstlisting}[frame=single]
827
  int r1 = r2 > r3              // true if r2 > r3
828
  int r6 = r5 != 0 && r1        // true if r5 != 0 && r2 > r3
829
  int r7 = r6 ^ 1               // r7 = !r6
830
\end{lstlisting}
831
\vv
832
 
833
It is even possible to add yet another boolean operand to a compare or bit test instruction by using a mask register. This requires manual coding of the option bits:
834
 
835
\begin{lstlisting}[frame=single]
836
  int r1 = r2 > r3              // true if r2 > r3
837
  int r4 = r5 < 0               // true if r5 < 0
838
  int r6 = compare(r7, r8), mask=r1, fallback=r4, options=0b010001
839
  // r6 is true if r2 > r3 && r5 < 0 && r7 != r8
840
 
841
  int r9 = test_bits_or(r1,5), mask=r2, fallback=r3, options=0b10010
842
  // r9 is true if (((r1 & 5) != 0) || r3) && !r2
843
\end{lstlisting}
844
\vv
845
 
846
See page \pageref{table:compareInstruction} for details of the compare instruction, and page \pageref{table:testBitInstruction} for the test\_bit instructions.
847
\vv
848
 
849
Boolean variables may be used for conditional jump instructions by testing bit 0 of the register:
850
\begin{lstlisting}[frame=single]
851
  if (int r1 & 1) {
852
    // do this if boolean r1 is true
853
  }
854
\end{lstlisting}
855
\vv
856
 
857
It is possible to combine two boolean variables in a conditional jump instruction with \hspace{2mm} \& \hspace{2mm} $\vert$ \hspace{2mm} \^{} \hspace{2mm} operations. Note that this will test all the bits of the result, not just bit 0:
858
 
859
\begin{lstlisting}[frame=single]
860
  int r1 = r1 | r2, jump_nzero Label
861
\end{lstlisting}
862
\vv
863
 
864
The condition is defined only by bit zero when a boolean variable is used as a mask. The remaining bits of the mask may be used for specifying various options, such as floating point exception handling. These mask bits are defined on page \pageref{table:maskBits}. It may be necessary to insert these option bits before using a boolean variable as mask for floating point operations:
865
 
866
\begin{lstlisting}[frame=single]
867
  float v1 = 1.2
868
  float v2 = 3.4
869
  int32 v3 = v1 < v2           // boolean variable
870
  int32 v4 = make_mask(v3, 0)  // copy option bits from NUMCONTR
871
  int32 v3 |= v4               // combine boolean and option bits
872
  float v5 = v3 ? v1 * v2 : v1 // conditional multiplication
873
\end{lstlisting}
874
\vv
875
 
876
 
877
\subsection{Absolute and relative pointers} \label{AbsoluteAndRelativePointers}
878
 
879
Absolute addresses of code and data will usually need 64 bits because the program may be loaded at any memory address. 32 bits will suffice if the program is never run on a machine with more than 2 GB of address space.
880
\vv
881
 
882
Relative addresses can be coded with fewer bits because they specify an address relative to some reference point within the code or data sections of the running program. The necessary number of bits can be further reduced by dividing the relative address by a power of 2. If the address of the target as well as the reference point are know to be divisible by, for example, 4 then we can save two bits by dividing the relative address by 4. In this case, we only need 16 bits for the relative pointer if the distance between target and reference point is less than 128 kB and the relative address is divided by 4.
883
\vv
884
 
885
A further advantage of relative addresses is that they are always position-independent, while absolute addresses must be calculated after the program has been loaded at an arbitrary address in RAM.
886
\vv
887
 
888
Absolute addresses are calculated with the address instruction. For example:
889
 
890
\begin{example}
891
\label{addressInstruction2Pointer}
892
\end{example} % frame disappears if I put this after end lstlisting
893
\begin{lstlisting}[frame=single]
894
data section datap                  // define data section
895
float alpha                         // define variable
896
data end
897
 
898
code section execute                // define code section
899
int64 r1 = address([alpha])         // absolute address of alpha
900
float v1 = [r1, scalar]             // load alpha through pointer
901
code end
902
\end{lstlisting}
903
\vv
904
 
905
A variable can be initialized with an absolute address only if the loader supports relocation. The following example shows how to use an absolute address, though this is not recommended:
906
 
907
\begin{example}
908
\label{absolutePointer}
909
\end{example} % frame disappears if I put this after end lstlisting
910
\begin{lstlisting}[frame=single]
911
data section datap                  // define data section
912
float alpha                         // define variable
913
int64 pointer_to_alpha = alpha      // contains address of alpha
914
data end
915
 
916
code section execute                // define code section
917
int64 r1 = [pointer_to_alpha]       // pointer to alpha
918
float v1 = [r1, scalar]             // load alpha through pointer
919
code end
920
\end{lstlisting}
921
\vv
922
 
923
Here, the address of alpha is inserted into pointer\_to\_alpha. Note that this makes the code position-dependent. The address of alpha must be calculated by the loader rather than by the linker. Not all platforms support position-dependent code. Therefore, it is preferred to use relative pointers.
924
\vspace{4mm}
925
 
926
A relative pointer contains an address relative to an arbitrary reference point. The reference point must be placed in a section with the same base pointer (IP, DATAP, or THREADP) as the target of the pointer. A relative address can be converted to an absolute address by the instruction sign\_extend\_add. Example:
927
 
928
\begin{example}
929
\label{relativeDataPointer}
930
\end{example}
931
\begin{lstlisting}[frame=single]
932
data section datap                  // define data section
933
float alpha                         // define variable
934
int16 relative_pointer = (alpha - reference_point) // relative address
935
float reference_point               // arbitrary reference point
936
data end
937
 
938
code section execute                // define code section
939
int64 r1 = address ([reference_point]) // address of reference point
940
// convert relative address to absolute address:
941
int16 r2 = sign_extend_add(r1, [relative_pointer])
942
float v1 = [r2, scalar]             // load alpha through pointer
943
code end
944
\end{lstlisting}
945
\vv
946
 
947
In example \ref{relativeDataPointer}, the sign\_extend\_add instruction will sign-extend the relative pointer to 64 bits and add the address of the reference point to get the address of the variable alpha. Note that the type int16 on the sign\_extend\_add instruction indicates the size of relative\_pointer, while the size of the reference address r1 and the result r2 are both 64 bits.
948
\vv
949
 
950
The limit of the addresses that can be covered by a 16 bits relative address is $\pm 2^{15}$ or $\pm$ 32 kB. This range can be increased by scaling the relative address. If the target and the reference point are both aligned to addresses divisible by 4, then we can divide the relative address by 4 without losing information. The scale factor must be a power of 2. The same code with a relative address scaled by 4 looks like this:
951
 
952
\begin{example}
953
\label{relativeDataPointerScaled}
954
\end{example}
955
\begin{lstlisting}[frame=single]
956
data section datap                  // define data section
957
float alpha                         // define variable
958
// relative address divided by 4:
959
int16 relative_pointer = (alpha - reference_point) / 4
960
float reference_point               // arbitrary reference point
961
data end
962
 
963
code section execute                // define code section
964
int64 r1 = address ([reference_point]) // address of reference point
965
// Convert the relative address to absolute address.
966
// options = 2 is a shift count that shifts the relative address
967
// two bits to the left, so that is is multiplied by 4:
968
int16 r2 = sign_extend_add(r1, [relative_pointer]), options = 2
969
float v1 = [r2, scalar]             // load alpha through pointer
970
code end
971
\end{lstlisting}
972
\vv
973
 
974
This works in the following way. The linker has support for calculating relative addresses and for scaling addresses by a power of 2. The linker calculates
975
$(\mathrm{alpha} - \mathrm{reference\_point}) / 4$ and inserts the value at
976
relative\_pointer.
977
The sign\_extend\_add instruction loads the relative pointer, sign-extends to 64 bits, shifts it left by 2, which corresponds to multiplying by $2^2 = 4$, and adds the absolute address in r1. The maximum shift count supported by this instruction is usually 3, corresponding to a scale factor of 8. Support for higher shift counts is optional.
978
\vv
979
 
980
Variables in static memory are always aligned to a natural address, i.e. an address divisible by the size of a data element of the specified type. Thus, int8 is not aligned, int16 is aligned by 2, int32 is aligned by 4, int64 is aligned by 8, float is aligned by 4, and double is aligned by 8. The reference point must have at least the same alignment when relative pointers are scaled. The predefined symbols at page \pageref{SpecialAddressSymbols} may be used as reference points.
981
\vspace{4mm}
982
 
983
 
984
Relative pointers are also used for function pointers and jump tables. Relative code pointers are always scaled by 4. The function pointer can be calculated with the sign\_extend\_add instruction as in example \ref{relativeDataPointerScaled}, but it is easier to use the relative jump or call instruction:
985
 
986
\begin{example}
987
\label{exampleRelFuncPointer}
988
\end{example}
989
\begin{lstlisting}[frame=single]
990
const section read ip               // define constant data section
991
// relative function pointer divided by 4:
992
int32 function1_pointer = (function1 - reference_point) / 4
993
const end
994
 
995
code section execute                // define code section
996
reference_point:
997
int64 r1 = address ([reference_point]) // address of reference point
998
int32 call_relative (r1, [function1_pointer])
999
...
1000
function1 function public
1001
    nop
1002
    return
1003
function1 end
1004
 
1005
code end
1006
\end{lstlisting}
1007
\vv
1008
 
1009
The jump\_relative and call\_relative instructions work by sign extending the relative address, multiplying by 4, adding the reference point (first parameter), and then jumping or calling to the calculated address. Remember that the operand type on the call\_relative instruction must match the size of the relative function pointer, which is int32 in example \ref{exampleRelFuncPointer}.
1010
\vv
1011
 
1012
A switch-case multiway branch can be implemented in a similar same way, using a table of relative addresses. This table may be placed in read-only memory for security reasons.
1013
\vv
1014
 
1015
\begin{example}
1016
\label{exampleSwitchCase}
1017
\end{example}
1018
\begin{lstlisting}[frame=single]
1019
/* C code:
1020
int j, x;
1021
switch (j) {
1022
case 1:
1023
   x = 10;  break;
1024
case 2:
1025
   x = 12;  break;
1026
case 5:
1027
   x = 20;  break;
1028
default:
1029
   x = 99;  break;
1030
}
1031
*/
1032
// ForwardCom code:
1033
 
1034
rodata section read ip
1035
// table of relative addresses, using DEFAULT as reference point
1036
align (4)
1037
jumptable: int16 0            // case 0
1038
int16 (CASE1 - DEFAULT) / 4   // case 1
1039
int16 (CASE2 - DEFAULT) / 4   // case 2
1040
int16 0                       // case 3
1041
int16 0                       // case 4
1042
int16 (CASE5 - DEFAULT) / 4   // case 5
1043
rodata end
1044
 
1045
code section execute ip
1046
// r0 = j
1047
// r1 = x
1048
// Test if i is outside of the range,
1049
// use unsigned test to avoid testing for r0 < 0
1050
if (uint32 r0 > 5) {
1051
   jump DEFAULT
1052
}
1053
int64 r2 = address([jumptable])
1054
int64 r3 = address([DEFAULT])
1055
// relative jump with r3 = DEFAULT as reference point,
1056
// r2 as table base and r0 as index
1057
int16 jump_relative (r3, [r2 + r0 * 2])
1058
 
1059
CASE1:
1060
   int32 r1 = 10
1061
   jump FINISH
1062
CASE2:
1063
   int32 r1 = 12
1064
   jump FINISH
1065
CASE5:
1066
   int32 r1 = 20
1067
   jump FINISH
1068
DEFAULT:
1069
   int32 r1 = 99
1070
FINISH:
1071
 
1072
code end
1073
\end{lstlisting}
1074
\vv
1075
 
1076
The operand type for the multiway jump\_relative instruction must match the size of the entries in the table of relative jump addresses, which is int16 in example \ref{exampleSwitchCase}. The scale factor must also match this size, which is 2 in this case. The size of the table entries must be big enough to contain the distance between the jump target and the reference point, divided by 4, as a signed integer.
1077
The address of jumptable is loaded into a register because the jump\_relative instruction does not have space for both the address of the jump table and the index.
1078
\vspace{4mm}
1079
 
1080
A multiway call\_relative instruction is coded in the same way, using a table of relative function pointers. This can be useful for virtual functions in an object-oriented programming language with polymorphism. An example is provided on page \pageref{exampleVirtualFunctions}
1081
\vspace{4mm}
1082
 
1083
 
1084
\subsection{Imports and exports} \label{assemblyImportExport}
1085
A function, data symbol, or constant defined in one assembly module can be accessed from another module if it is exported from the first module and imported to the second module. The necessary cross references are calculated and inserted by the linker.
1086
\vv
1087
 
1088
Symbols are imported as follows:
1089
\vv
1090
 
1091
\begin{tabular}{|p{150mm}|}
1092
\hline
1093
\hspace{4mm} extern symbolname : attributes, symbolname : attributes, ...\\
1094
\hline
1095
\end{tabular}
1096
\vv
1097
 
1098
The following attributes can be specified:
1099
\vv
1100
 
1101
\begin{tabular}{|p{20mm}p{130mm}|}
1102
\hline
1103
function & the symbol is an executable function\\
1104
ip & the symbol is a data object addressed relative to the ip pointer\\
1105
datap & the symbol is a data object addressed relative to the datap pointer\\
1106
threadp & the symbol is a data object addressed relative to the threadp pointer\\
1107
constant & the symbol is a constant with no address\\
1108
read & the symbol is in a readable data section\\
1109
write & the symbol is in a writeable data section\\
1110
execute & the symbol is executable code\\
1111
data type & the data type for the data symbol\\
1112
weak & weak linking: the symbol will be resolved only if it exists\\
1113
reguse & register use, indicating which registers are modified by a function\\
1114
\hline
1115
\end{tabular}
1116
\vv
1117
 
1118
An external symbol must have one, and only one, of the following attributes: function, ip, datap, threadp, constant.
1119
The other attributes are optional. Multiple attributes are separated by space or comma.
1120
\vv
1121
 
1122
The reguse option indicates which registers are modified by a function.
1123
It it followed by two numbers indicating the use of general purpose registers and vector registers, respectively. Bit number n indicates that register number n is used. For example: reguse=0x1F,1 indicates that general purpose registers r0-r4 and vector register v0 are modified by the function.
1124
\vv
1125
 
1126
A weak external symbol will only be linked if it exists. The linker will not search function libraries to find the weak symbol, but the symbol will be resolved if it exists in a module that is linked in for other reasons.
1127
A weak external constant or data symbol that is not resolved will be zero. A weak function that has not  been resolved will return zero.
1128
\vv
1129
 
1130
Symbols are exported as follows:
1131
\vv
1132
 
1133
\begin{tabular}{|p{150mm}|}
1134
\hline
1135
\hspace{4mm} public symbolname : attributes, symbolname : attributes, ...\\
1136
\hline
1137
\end{tabular}
1138
\vv
1139
 
1140
The possible attributes are the same as for external symbols. It may not be necessary to specify all the attributes because the attributes of locally defined symbols are already known.
1141
\vv
1142
 
1143
It is convenient to place extern and public declarations in the beginning of the assembly file. Forward references are allowed.
1144
\vv
1145
 
1146
Weak public symbols work as follows. If more than one module contains a weak public symbol with the same name then the linker will not issue an error message but link the first symbol.
1147
It there is one occurrence of a non-weak symbol with this name then the non-weak symbol will be chosen. If there are more than one non-weak public symbol with the same name then the linker will issue an error message.
1148
\vv
1149
 
1150
\subsection{Special address symbols} \label{SpecialAddressSymbols}
1151
The following symbol names are defined by the linker:
1152
 
1153
\begin{longtable} {|p{35mm}|p{16mm}|p{85mm}|}
1154
\caption{SpecialSymbols}
1155
\label{table:specialSymbols}\\
1156
\endfirsthead
1157
\endhead
1158
\hline
1159
\bfseries Name & \bfseries Base & \bfseries Meaning  \\
1160
\hline
1161
\_\_ip\_base & ip & The linker will place this symbol where read-only data ends and code begins.
1162
It is possible to explicitly place this symbol elsewhere in an ip-addressable section.\\
1163
\hline
1164
\_\_datap\_base & datap & The linker will place this symbol where static initialized writeable data ends and uninitialized data begins.
1165
It is possible to explicitly place this symbol elsewhere in a datap-addressable section.\\
1166
\hline
1167
\_\_threadp\_base & threadp & The linker will place this symbol at the beginning of thread-local data.
1168
It is possible to explicitly place this symbol elsewhere in a threadp-addressable section.\\
1169
\hline
1170
\_\_program\_entry & ip & This symbol must be specified. It marks the first instructions to execute.
1171
This must be a startup code that makes any necessary initialization and then calls \_main.
1172
The library forwc.li includes a suitable startup code that will be included automatically
1173
if \_\_program\_entry is not defined elsewhere.\\
1174
\hline
1175
\_\_event\_table & ip & Table of event handler records.\\
1176
\hline
1177
\_\_event\_table\_num & constant & Number of event handler records.\\
1178
\hline
1179
\end{longtable}
1180
 
1181
\vv
1182
 
1183
 
1184
\subsection{Other directives} \label{assemblyOtherDirectives}
1185
 
1186
\subsubsection{Align}
1187
The 'align' directive can align data or code:
1188
\vv
1189
 
1190
\begin{tabular}{|p{150mm}|}
1191
\hline
1192
\hspace{4mm} align (number)\\
1193
\hline
1194
\end{tabular}
1195
\vv
1196
 
1197
The number must be a power of 2. The next data or code will be aligned to an address divisible by the specified number. The beginning of the section will automatically be aligned to at least the same size.
1198
\vv
1199
 
1200
 
1201
\subsubsection{Options} \label{optionsDirective}
1202
The 'options' directive can change certain parameters with effect from the place of the directive. The syntax is:
1203
\vv
1204
 
1205
\begin{tabular}{|p{150mm}|}
1206
\hline
1207
\hspace{4mm} options name = value, name = value, ...\\
1208
\hline
1209
\end{tabular}
1210
\vv
1211
 
1212
The value must be an integer constant or an expression that can be evaluated immediately to an integer constant.
1213
\vv
1214
 
1215
\begin{tabular}{|p{150mm}|}
1216
\hline
1217
\hspace{4mm} options codesize = 0x100000 \\
1218
\hspace{4mm} options datasize = 0x4000 \\
1219
\hline
1220
\end{tabular}
1221
\vv
1222
 
1223
These options change the codesize or datasize so that subsequent instructions
1224
will use address sizes that fit the specified codesize or datasize as explained on page \pageref{SpecifyDataSize}. Setting codesize = 0 or datasize = 0 will reset these parameters to the value specified on the assembler command line, or a default value if no value is specified on the command line.
1225
\vv
1226
 
1227
 
1228
\subsection{Combining vectors of different lengths} \label{vectorsDifferentLengths}
1229
 
1230
The length of the destination register is the same as the length of the first vector register source operand when vectors of different lengths are combined. For example:
1231
\vv
1232
 
1233
\begin{lstlisting}[frame=single]
1234
float v1 = v2 - v3             // v1 has length of v2
1235
float v1 = -v3 + v2            // Same, but v1 has length of v3
1236
float v1 = v2 - [r3,length=r4] // Has length of v2, even if memory
1237
                               // operand has a different length
1238
\end{lstlisting}
1239
\vspace{4mm}
1240
 
1241
 
1242
\subsection{Event handlers} \label{EventHandlers}
1243
You can specify event handlers for all the events defined in the file elf\_forwardcom.h.
1244
An event handler consists of a function to be called when the event occurs, and a record
1245
defining the event. The event record must contain the following fields:
1246
 
1247
\begin{longtable} {|p{20mm}|p{15mm}|p{100mm}|}
1248
\caption{Event record structure}
1249
\label{table:EventRecordStructure}\\
1250
\endfirsthead
1251
\endhead
1252
\hline
1253
\bfseries Name & \bfseries Size & \bfseries Meaning  \\
1254
\hline
1255
functionPtr & 32 bit & Pointer to the event function relative to \_\_ip\_base, scaled by 4.\\
1256
\hline
1257
priority    & 32 bit & If there are more than one handler for the same event then the ones with
1258
the highest value of priority will be called first.
1259
Normal priority = 0x1000.\\
1260
\hline
1261
key         & 32 bit & Subdivision of event. This can define a hotkey, menu item, or icon id for user command events.\\
1262
\hline
1263
event       & 32 bit & Event ID. Possible values are defined in the file \newline elf\_forwardcom.h.\\
1264
\hline
1265
\end{longtable}
1266
\vv
1267
 
1268
The linker will make a table of all the event handler records, sorted by event, key, and priority.
1269
\vv
1270
 
1271
The event function uses register r0 as status. The function should
1272
return a value of 1 in r0 in the normal case. The function can return 0 in r0 to bypass further
1273
event handlers with lower priority.
1274
\vv
1275
 
1276
r1 can be used for further parameters. r2 can point to a zero-terminated string.
1277
\vv
1278
 
1279
A common use of event handlers it to call initialization functions that must be called before \_main (constructors) and cleanup functions that must be called after \_main (destructors).
1280
The following example shows an event handler for the event EVT\_CONSTRUCT (= 1), initializing some data.
1281
 
1282
\begin{example}
1283
\label{exampleEventHandler}
1284
\end{example} % frame disappears if I put this after end lstlisting
1285
\begin{lstlisting}[frame=single]
1286
 
1287
// section for event handler records only
1288
events section read ip event_hand
1289
int32 (init_func - __ip_base) / 4, 0x1000, 0, 1
1290
events end
1291
 
1292
// data section
1293
data section read write datap
1294
somedata: int64 0
1295
data end
1296
 
1297
// code section
1298
code section execute ip
1299
init_func function
1300
int64 r1 = 5
1301
int64 [somedata] = r1 // initialize somedata to 5
1302
int64 r0 = 1          // return status = 1
1303
return
1304
init_func end
1305
 
1306
\end{lstlisting}
1307
\vspace{4mm}
1308
 
1309
You can activate an event by calling the function \_call\_event in the library libc.li.
1310
This function will search the table of event records for an event with the specified
1311
event and key number and call all corresponding event handler functions.
1312
\vv
1313
 
1314
The table of events cannot be changed while the application is running.
1315
A module or library that is added by dynamic linking while the application is running
1316
cannot add new event handler records.
1317
\vv
1318
 
1319
 
1320
 
1321
\section{Metaprogramming} \label{assemblyMetaprogramming}
1322
Metaprogramming means variables and instructions that do not form part of the final executable code but
1323
are useful for controlling the assembly process.
1324
\vv
1325
 
1326
Traditional assemblers use macros for this purpose. The syntax for this is often confusing, especially when macros are nested.
1327
\vv
1328
 
1329
Instead, the ForwardCom assembler will implement metaprogramming involving the following features:
1330
 
1331
\begin{itemize}
1332
\item variables that are used only during the assembly process
1333
\item include files
1334
\item assembly-time branches and loops
1335
\item assembly-time functions
1336
\item generate a text string and emit this string as assembly code
1337
\end{itemize}
1338
 
1339
Metaprogramming commands are indicated by a percent sign at the beginning of the line.
1340
At present, only variables are implemented. The other metaprogramming features will be added later.
1341
\vv
1342
 
1343
 
1344
\subsection{Metaprogramming variables} \label{MetaprogrammingVariables}
1345
 
1346
Metaprogramming variables are defined and assigned on a separate line in the following way:
1347
\vv
1348
 
1349
\begin{tabular}{|p{154mm}|}
1350
\hline
1351
\hspace{4mm} \% name = value\\
1352
\hline
1353
\end{tabular}
1354
\vspace{4mm}
1355
 
1356
Meta-variables are weakly typed. They can have one of the following types:
1357
 
1358
\begin{tabular}{|p{30mm}p{120mm}|}
1359
\hline
1360
integer & evaluated as signed 64-bit integer \\
1361
floating point & evaluated as double precision\\
1362
string & ASCII or UTF-8 text string\\
1363
register & the variable can be an alias for any register\\
1364
memory operand & the variable can be an alias for any memory operand\\
1365
type name & the variable can be an alias for a type\\
1366
\hline
1367
\end{tabular}
1368
\vv
1369
 
1370
It makes no difference whether this meta-code is inside a section or not.
1371
\vv
1372
 
1373
Meta-variables can be redefined or reassigned at any time. Meta-code is sequential so that the same variable can have different values at different places in the code.
1374
\vv
1375
 
1376
An integer meta-variable can be exported as a constant with a public declaration if it has only one value. This is the only case where a forward reference to a meta-symbol is allowed.
1377
\vv
1378
 
1379
While general assembly code allows forward references to data and code labels, meta-code cannot in general have forward references.
1380
\vv
1381
 
1382
Example:
1383
\vv
1384
 
1385
\begin{lstlisting}[frame=single]
1386
  % A = 5         // meta-variable integer A = 5
1387
  % R = r1        // alias for register r1
1388
  int32 R = A     // r1 = 5
1389
  % A++           // change value of A to 6
1390
  int32 R = A     // r1 = 6
1391
\end{lstlisting}
1392
\vv
1393
 
1394
 
1395
 
1396
\section{Code examples}\label{chap:codeExamples}
1397
 
1398
This section contains examples of assembly code to illustrate the features of the ForwardCom instruction set.
1399
The syntax for assembly language is described in chapter \ref{chap:assembler}.
1400
The function calling conventions are described in chapter \ref{chap:functionCallingConventions}.
1401
\vv
1402
 
1403
 
1404
\subsection{Horizontal vector add} \label{horizontalVectorAdd}
1405
ForwardCom has no instruction for adding all elements of a vector because this would be a complex instruction with variable latency depending on the vector length.
1406
\vv
1407
 
1408
The sum of all elements in a vector can be calculated by repeatedly adding the lower half and the upper half of the vector. This method is illustrated by the following example, finding the horizontal sum of a vector of single precision floats.
1409
 
1410
\begin{example}
1411
\label{exampleHorizontalAdd}
1412
\end{example} % frame disappears if I put this after end lstlisting
1413
\begin{lstlisting}[frame=single]
1414
v0 = my_vector              // we want the horizontal sum of this
1415
int64 r0 = get_len(v0)      // length of vector in bytes
1416
int64 r0 = roundp2(r0, 1)   // round up to nearest power of 2
1417
float v0 = set_len(v0, r0)  // adjust vector length
1418
while (uint64 r0 > 4) {     // loop to calculate horizontal sum
1419
   uint64 r0 >>= 1          // the vector length is halved
1420
   float v1 = shift_reduce(v0,r0) // get upper half of vector
1421
   // the result vector has the length of the first operand:
1422
   float v0 = v1 + v0       // Add upper half and lower half
1423
}
1424
// The sum is now a scalar in v0
1425
\end{lstlisting}
1426
\vspace{4mm}
1427
 
1428
 
1429
\subsection{Horizontal vector minimum} \label{horizontalVectorMin}
1430
The same method can be used for other horizontal operations. It may cause problems that the set\_len instruction inserts elements of zero if the vector length is not a power of 2. Special care is needed if the operation does not allow extra elements of zero, for example if the operation involves multiplication or finding the minimum element. A possible solution is to mask off the unused elements in the first iteration. The following example finds the smallest element in a vector of floating point numbers:
1431
 
1432
\begin{example}
1433
\label{exampleHorizontalMin}
1434
\end{example}
1435
\begin{lstlisting}[frame=single]
1436
v0 = my_vector             // find the smallest element in this
1437
r0 = get_len(v0)           // length of vector in bytes
1438
int64 r1 = roundp2(r0, 1)  // round up to nearest power of 2
1439
uint64 r1 >>= 1            // half length
1440
v1 = shift_reduce(v0, r1)  // upper part of vector
1441
int64 r2 = r0 - r1         // length of v1
1442
float v0 = set_len(v0, r1) // reduce length of v0
1443
// make mask for length of v1 because the two operands may
1444
// have different length
1445
int64 v2 = mask_length(v0, r2, 0), options=4
1446
// Get minimum. Elements of v0 fall through where v1 is empty
1447
float v0 = min(v0, v1, mask=v2, fallback=v0) // minimum
1448
// loop to calculate the rest. vector length is now a power of 2
1449
while (uint64 r1 > 4) {
1450
   // Half vector length
1451
   uint64 r1 >>= 1
1452
   // Get upper half of vector
1453
   float v1 = shift_reduce(v0, r1)
1454
   // Get minimum of upper half and lower half
1455
   float v0 = min(v1, v0)  // has the length of the first operand
1456
}
1457
// The minimum is now a scalar in v0
1458
\end{lstlisting}
1459
\vspace{4mm}
1460
 
1461
 
1462
 
1463
 
1464
\subsection{Boolean operations} \label{BooleanOperations}
1465
Boolean combinations of conditions can be implemented with branches as shown in this example.
1466
\vv
1467
 
1468
\begin{example}
1469
\label{exampleBooleanOperations1}
1470
\end{example}
1471
\begin{lstlisting}[frame=single]
1472
/* C code:
1473
float condfunc (float a, float b) {
1474
   if (a >= 0 && (a < 20 || a == b)) {
1475
      a = sqrt(a);
1476
   }
1477
   return a;
1478
}
1479
*/
1480
 
1481
// ForwardCom code:
1482
 
1483
code section execute ip
1484
extern _sqrtf : function
1485
 
1486
// v0 = a, v1 = b
1487
_condfunc1 function public
1488
if (float v0 >= 0) {
1489
   if (float v0 < 20) {jump L1}
1490
   if (float v0 == v1) {
1491
      L1:
1492
      call _sqrtf
1493
   }
1494
}
1495
return  // return value is in v0
1496
_condfunc1 end
1497
 
1498
code end
1499
\end{lstlisting}
1500
\vspace{4mm}
1501
 
1502
Branches can be quite slow, especially if they are poorly predictable.
1503
It is often faster to generate boolean variables for each condition and use bit operations to combine them.
1504
This corresponds to replacing \&\& with \& and $\vert\vert$ with $\vert$.
1505
The code below shows the same example where three conditional jumps are reduced to one conditional jump and two bit operations.
1506
Note that this transformation is not valid if the evaluation of unused conditions has side effects.
1507
 
1508
\begin{example}
1509
\label{exampleBooleanOperations2}
1510
\end{example}
1511
\begin{lstlisting}[frame=single]
1512
_condfunc2 function public
1513
float v2 = v0 >= 0     // boolean variable for a >= 0
1514
float v3 = v0 < 20     // boolean variable for a < 20
1515
float v4 = v0 == v1    // boolean variable for a == b
1516
int32+ v3 |= v4        // a < 20 || a == b
1517
int32+ v2 &= v3        // a >= 0 && (a < 20 || a == b)
1518
if (float v2 & 1) {    // test bit 0 of boolean v2
1519
   call _sqrtf
1520
}
1521
return
1522
_condfunc2 end
1523
\end{lstlisting}
1524
\vspace{4mm}
1525
 
1526
We can reduce the number of instructions and make the code still faster by using a special feature of the compare instruction that uses the mask and fallback registers as extra boolean operands:
1527
 
1528
\begin{example}
1529
\label{exampleBooleanOperations3}
1530
\end{example}
1531
\begin{lstlisting}[frame=single]
1532
_condfunc3 function public
1533
float v2 = v0 >= 0     // boolean variable for a >= 0
1534
float v3 = v0 == v1    // boolean variable for a == b
1535
// use mask and fallback register as extra boolean operands
1536
float v4 = compare(v0, 20), options=0x22, mask=v2, fallback=v3
1537
if (float v4 & 1) {    // test bit 0 of boolean v4
1538
   call _sqrtf
1539
}
1540
return
1541
_condfunc3 end
1542
\end{lstlisting}
1543
\vspace{4mm}
1544
 
1545
The high level operators \&\& $||$ \^{}\^{} allow a more intuitive coding:
1546
 
1547
\begin{example}
1548
\label{exampleBooleanOperations4}
1549
\end{example}
1550
\begin{lstlisting}[frame=single]
1551
_condfunc4 function public
1552
float v3 = v0 < 20          // boolean variable for a < 20
1553
float v3 = (v0 == v1) || v3 //  a == b || a < 20
1554
float v3 = v0 >= 0 && v3    // a >= 0 && (a == b || a < 20)
1555
if (float v3 & 1) {         // test bit 0 of boolean v3
1556
   call _sqrtf
1557
}
1558
return
1559
_condfunc4 end
1560
\end{lstlisting}
1561
\vspace{4mm}
1562
 
1563
 
1564
\subsection{Virtual functions} \label{virtualFunctions}
1565
Virtual functions are used in C++ for polymorphous classes.
1566
This example shows how to implement a virtual class in ForwardCom.
1567
We can save space by using 32-bit relative pointers rather than 64-bit absolute pointers as other systems do.
1568
\vv
1569
 
1570
\begin{example}
1571
\label{exampleVirtualFunctions}
1572
\end{example}
1573
\begin{lstlisting}[frame=single]
1574
/* C++ code:
1575
 
1576
class VirtClass {
1577
public:
1578
   // constructor:
1579
   VirtClass() {x = 0;}
1580
   // virtual functions:
1581
   virtual void func1() {x++;}
1582
   virtual int  func2() {return x;}
1583
protected:
1584
   int x;
1585
};
1586
 
1587
int test() {
1588
   VirtClass obj;      // create object
1589
   obj.func1();        // call virtual function 1
1590
   return obj.func2(); // call virtual function 2
1591
}
1592
*/
1593
 
1594
// ForwardCom code:
1595
rodata section read ip align = 4
1596
// table of virtual function pointers for VirtClass
1597
// with REFPOINT as reference point
1598
VirtClasstable:
1599
int32 (VirtClass_func1 - REFPOINT) / 4
1600
int32 (VirtClass_func2 - REFPOINT) / 4
1601
rodata end
1602
 
1603
code section execute ip
1604
// choose any reference point for the relative pointers,
1605
// for example the beginning of the code section:
1606
REFPOINT: nop
1607
 
1608
VirtClass_constructor function public
1609
// The pointer 'this' is in r0
1610
// At [r0] is a relative pointer to VirtClasstable,
1611
// next the class data members, in this case: x
1612
int32 r1 = VirtClasstable - REFPOINT
1613
int32 [r0] = r1
1614
int32 [r0+4] = 0
1615
return
1616
VirtClass_constructor end
1617
 
1618
VirtClass_func1 function
1619
// The pointer 'this' is in r0
1620
// x is in [r0+4]
1621
int32 r1 = [r0+4]
1622
int32 r1++
1623
int32 [r0+4] = r1
1624
return
1625
VirtClass_func1 end
1626
 
1627
VirtClass_func2 function
1628
int32 r0 = [r0+4]
1629
return
1630
VirtClass_func2 end
1631
 
1632
_test function public
1633
push (r16)       // save register
1634
// get the address of the reference point
1635
int64 r16 = address([REFPOINT])
1636
// the object 'obj' uses 8 bytes, allocate space on the
1637
// stack for it
1638
int64 sp -= 8
1639
// call the constructor. The 'this' pointer must be in r0
1640
int64 r0 = sp
1641
call VirtClass_constructor
1642
// r0 still points to the object because a constructor
1643
// always returns a reference to the object.
1644
// Get the address of the virtual table
1645
int32 r1 = sign_extend_add(r16, [r0])
1646
// call VirtClass_func1 as the first table entry
1647
int32 call_relative (r16, [r1])
1648
// get a pointer to the object again
1649
int64 r0 = sp
1650
// Get the address of the virtual table
1651
int32 r1 = sign_extend_add(r16, [r0])
1652
// call VirtClass_func2 as the second table entry
1653
int32 call_relative (r16, [r1+4])
1654
// release space allocated for 'obj'
1655
int64 sp += 8
1656
pop (r16)        // restore register
1657
// the return value from callVirtClass_func2 is in r0
1658
return
1659
_test end
1660
 
1661
code end
1662
\end{lstlisting}
1663
\vspace{4mm}
1664
 
1665
 
1666
% \subsection{Memory copying} \label{memcpy}
1667
% Needs revision to obey alignment
1668
 
1669
 
1670
\subsection{High precision arithmetic} \label{highPrecisionArithmetic}
1671
Function libraries for high precision arithmetic typically use a long sequence of add-with-carry instructions for adding integers with a very large number of bits. A more efficient method for big number calculation is to use vector addition and a carry-look-ahead method. The following algorithm calculates A + B, where A and B are big integers represented as two vectors of n$\cdot$64 bits each, where n \textless{} 64.
1672
\vv
1673
 
1674
\begin{example}
1675
\label{exampleHighPrecisionArithmetic}
1676
\end{example}
1677
\begin{lstlisting}[frame=single]
1678
uint64 v0 = A                // first vector,  n*64 bits
1679
uint64 v1 = B                // second vector, n*64 bits
1680
uint64 v2 = carry_in         // single bit in vector register
1681
uint64 v0 += v1              // sum without intermediate carries
1682
uint64 v3 = v0 < v1          // carry generate = (SUM < B)
1683
uint64 v4 = v0 == -1         // carry propagate = (SUM == -1)
1684
uint64 r0 = get_len(v0)      // length of vector in bytes
1685
uint64 v3 = bool2bits(v3)    // compressed to bitfield
1686
uint64 v4 = bool2bits(v4)    // compressed to bitfield
1687
// calculate propagated additional carry:
1688
// CA = CP ^ (CP + (CG<<1) + CIN)
1689
uint64 v3 <<= 1              // shift left carry generate
1690
uint64 v2 = v2 + v3 + v4
1691
uint64 v2 ^= v4
1692
uint64 v1 = bits2bool(r0,v2) // expand additional carry to vector
1693
uint64 v0 += v1              // add correction to sum
1694
uint64 r0 >>= 3              // n = number of elements in vectors
1695
uint64 v3 = gp2vec(r0)       // copy to vector register
1696
uint64 v2 >>= v3             // carry out
1697
// v0 = sum, v2 = carry out
1698
\end{lstlisting}
1699
\vv
1700
 
1701
If the numbers A and B are longer than the maximum vector length then the algorithm is repeated. If the vector length is more than 64 * 8 bytes then the calculation of the additional carry involves more than 64 bits, which again requires a big number algorithm.
1702
 
1703
\subsection{Matrix multiplication} \label{matrixMultiplication}
1704
Matrix operations can be difficult because they involve a lot of permutations. The following example shows the multiplication of two $4\times4$ matrixes of floating point numbers, assuming that the vector registers are long enough to contain an entire matrix.
1705
\vv
1706
 
1707
\begin{example}
1708
\label{exampleMatrixMultiplication}
1709
\end{example}
1710
\begin{lstlisting}[frame=single]
1711
float v1 = first_matrix          // first matrix,  4x4 floats
1712
float v2 = second_matrix         // second matrix, 4x4 floats
1713
int64 r1 = 64                    // size of entire matrix in bytes
1714
int64 r2 = 1                     // shift count, elements
1715
int64 r3 = 4                     // shift count, elements
1716
float v0 = replace(v1,0)         // make a matrix of zeroes
1717
for (int64 r0 = 0; r0 < 4; r0++) { // repeat 4 times
1718
   float v3 = repeat_within_blocks(v1,r1,16) // repeat column
1719
   float v4 = repeat_block(v2,r1,16)         // repeat row
1720
   float v0 = v0 + v3 * v4       // multiply rows and columns
1721
   float v1 = shift_down(v1, r2) // next column
1722
   float v2 = shift_down(v2, r3) // next row
1723
}
1724
// Result is in v0.
1725
\end{lstlisting}
1726
\vv
1727
You may roll out the loop and calculate partial sums separately to
1728
reduce the loop-carried dependency chain of v0
1729
\vv
1730
 
1731
 
1732
\section{Detecting support for particular instructions}
1733
\label{DetectingSupportForInstructions}
1734
The capabilities registers can give information about the capabilities of the processor the code is running on, including support for certain instructions and features, and maximum vector length. See page \pageref{table:capabilitiesRegisters} for details.
1735
\vv
1736
 
1737
While ForwardCom is in a phase of experimental development, there may not be specific bits in the capabilities registers for every instruction that may be supported. An alternative way of testing whether a particular instruction is supported is to disable error trapping and try to execute the instruction.
1738
This can be done without system access. For example:
1739
 
1740
\begin{example}
1741
\label{exampleDetectInstructionSupport}
1742
\end{example}
1743
\begin{lstlisting}[frame=single]
1744
// Disable error traps for unknown instructions and wrong operands
1745
int r0 = 3
1746
int64 capab2 = write_capabilities(r0, 0)
1747
// Reset counter registers
1748
int r0 = read_perfs(perf16, 0)
1749
// Try to execute instruction
1750
int r0 = userdef56(r0, r0)
1751
// Read counters
1752
int r1 = read_perfs(perf16, 1)   // counter for unknown instruction
1753
int r2 = read_perfs(perf16, 2)   // counter for wrong operands
1754
// Enable error traps again
1755
int r0 = 0;
1756
int64 capab2 = write_capabilities(r0, 0)
1757
// Test if any of the two counters is nonzero, and
1758
// jump to some code if the instruction is not supported
1759
int r1 = r1 | r2, jump_nzero INSTRUCTION_NOT_SUPPORTED
1760
\end{lstlisting}
1761
\vv
1762
 
1763
 
1764
 
1765
\section{Optimization of code}
1766
\label{OptimizationOfCode}
1767
 
1768
The ForwardCom system is designed with high performance as a top priority. The following guidelines may help programmers and compiler makers obtain optimal performance.
1769
\vv
1770
 
1771
\subsubsection{Use general purpose registers for control code}
1772
Any variables that control program execution, such as branch conditions and loop counters, should preferably be stored in general purpose registers rather than memory or vector registers. This may enable the microprocessor to resolve branches early and prefetch the code after the branch earlier.
1773
\vv
1774
 
1775
\subsubsection{Efficient loops}
1776
The overhead of a loop can be reduced to a single instruction in most cases.
1777
A loop that counts up is most efficient when the loop counter is a 32-bit signed integer incremented by 1 and the loop condition is expressed as counter < limit or counter <= limit. A loop that counts down is most efficient when the condition is expressed as counter > 0 or counter >= 0.
1778
Examples:
1779
 
1780
\begin{lstlisting}[frame=single]
1781
for (int32 r1 = 0; r1 < r2; r1++) {  }
1782
for (int32 r1 = 1; r1 <= 100; r1++) {  }
1783
for (int32 r1 = 100; r1 > 0; r1--) {  }
1784
\end{lstlisting}
1785
 
1786
The assembler will not insert an initial check before the loop if the start and end values are both constants.
1787
\vv
1788
 
1789
Array loops are particularly efficient if the vector loop feature is used. See the example page \pageref{examplePolyn}.
1790
Loops containing function calls can be vectorized if the functions allow vector parameters.
1791
\vv
1792
 
1793
\subsubsection{Avoid long dependency chains}
1794
ForwardCom may be implemented on a superscalar processor that can execute multiple instructions simultaneously. This works most efficiently if the code does not contain long dependency chains.
1795
\vv
1796
 
1797
\subsubsection{Minimize instruction size}
1798
The assembler will automatically pick the smallest possible version of an instruction.
1799
Instructions can have different versions with 8-bit, 16-bit, 32-bit, and 64-bit integer constants.
1800
A large integer constant with few significant bits can be represented as a smaller constant with a left shift. For example, the constant 0x5000000000 can be represented as 5 $<<$ 36. The assembler will do this automatically when possible.
1801
\vv
1802
 
1803
Small floating point constants with no decimals can be represented as 8-bit signed integers. Simple rational numbers where the denominator is a power of 2 can be represented as half precision without loss of precision. The assembler does this automatically, too. For example, the constant 2.25 can be coded in half precision without loss of precision, while the constant 2.24 can not.
1804
\vv
1805
 
1806
Memory addresses can use an offset of 8, 16, or 32 bits relative to a base pointer.
1807
A 32-bit offset is needed for data in static memory when the data size is large or not specified (see page \pageref{SpecifyDataSize}). A smaller offset is possible when data are addressed relative to a stack pointer, structure pointer, class pointer, or a pointer to a strategically placed reference point. An smaller offset will often allow the assembler to use a smaller instruction format.
1808
\vv
1809
 
1810
It is recommended to check the output listing from the assembler to see
1811
how much space each instruction takes. Sometimes, you can reduce the instruction size by simple changes in the code. Many instructions will be smaller if the first source register is the same as the destination register.
1812
\vv
1813
 
1814
 
1815
\subsubsection{Optimize cache use}
1816
Memory and cache throughput is often a bottleneck. You can improve caching
1817
in several ways:
1818
 
1819
\begin{itemize}
1820
\item Optimize code caching by minimizing instruction sizes and inlining functions.
1821
\item Optimize data caching by embedding immediate constants in instructions.
1822
\item Use register variables instead of saving variables in memory.
1823
\item Avoid spilling registers to memory by using the information about register use in object files, as described on page \pageref{chap:registerUsageConvention}.
1824
\item Use relative pointers rather than absolute pointers for pointer tables and function pointers in static memory (see page \pageref{relativeDataPointer}).
1825
\end{itemize}
1826
\vv
1827
 
1828
\subsubsection{Calculate pointers early}
1829
Registers used for pointers, array index, or for specifying the vector length of a memory operand are used at an early stage in the pipeline. The CPU may have to wait for these registers if they are not available at the time they are needed. Therefore, it is recommended to calculate pointer, array index, and vector length before the other instruction operands.
1830
\vv
1831
 
1832
\subsubsection{Optimize jumps}
1833
The assembler will merge a jump with a preceding arithmetic instruction if possible, unless optimization is turned off. For example, an integer addition followed by a conditional jump that compares the result with zero may be merged into a single instruction. This is only possible when a number of conditions are fulfilled:
1834
\begin{itemize}
1835
\item the two instructions have the same data type
1836
\item the destination of the arithmetic instruction is the same register as the source of the branch instruction
1837
\item the arithmetic instruction uses the same register for source and destination
1838
\item there are no other instructions between the two, and no jump label at the branch instruction.
1839
\end{itemize}
1840
 
1841
The output listing will show if the two instructions have been merged.
1842
\vv
1843
 
1844
Optimization of chained jumps, etc., is generally the responsibility of the compiler. The assembler will do only a few simple jump optimizations.
1845
\vv
1846
 
1847
\subsubsection{Avoid conditional jumps}
1848
Conditional and indirect jumps are quite slow.
1849
Conditional jumps can sometimes be replaced by conditional execution of instructions with the use of mask registers. This can be advantageous even if it requires a few more instructions. It is good to calculate the mask register before the operands are calculated because some ForwardCom implementations will not wait for delayed operands if the mask is already known to be zero so that the operands are not needed.
1850
\vv
1851
 
1852
\subsubsection{Specify data size and code size} \label{SpecifyDataSize}
1853
It is recommended to specify a maximum size for code and static data on the assembler command line (see page \pageref{assemblerCommandLine})
1854
or in a directive (see page \pageref{optionsDirective}).
1855
This allows the assembler to optimize relative addresses for both code and data to the minimum number of bits necessary.
1856
You may use a link map to see how much memory is needed for code and static data, and add some extra for future additions to the code.
1857
A link map can be generated during the link process with the link option
1858
 {\ttfamily -map=filename}, or after linking with the command {\ttfamily forw -dump-L filename.ex}
1859
\vv
1860
 
1861
Direct jump and call instructions use 24 or 32 bits for relative addresses. Conditional jumps use 8, 16, 24, or 32 bits. The table below shows the largest distance you can jump with these numbers of bits, using signed relative offsets scaled by 4:
1862
\vv
1863
 
1864
\begin{tabular}{|p{20mm}|p{140mm}|}
1865
\hline
1866
\bfseries Bits & \bfseries Jump distance \\
1867
\hline
1868
8 & 508 bytes \\
1869
16 & 128 kbytes \\
1870
24 & 32 Mbytes \\
1871
32 & 8 Gbytes \\
1872
\hline
1873
\end{tabular}
1874
\vv
1875
 
1876
For example, if you specify codesize=100000, then you will be able to make conditional jumps to external labels using 16 bits for relative addresses.
1877
The distance to local labels within the same assembly file and the same section are calculated by the assembler and optimized to the appropriate number of bits regardless of the specified codesize.
1878
\vv
1879
 
1880
Instructions with a memory operand in static memory are using an offset relative to a base pointer, typically DATAP. The offset can be 8, 16, or 32 bits. 8-bit offsets are scaled by the operand size. 16-bit and 32-bit offsets are not scaled. The maximum distance from the base pointer you can address with different sizes of offset are listed in the following table:
1881
\vv
1882
 
1883
\begin{tabular}{|p{20mm}|p{20mm}|p{116mm}|}
1884
\hline
1885
\bfseries Bits & \bfseries Data type & \bfseries Addressing range \\
1886
\hline
1887
8 & int8 & 127 bytes \\
1888
8 & int16 & 254 bytes \\
1889
8 & int32 & 508 bytes \\
1890
8 & int64 & 1 kbyte \\
1891
16 & any & 32 kbytes \\
1892
32 & any & 2 Gbytes \\
1893
\hline
1894
\end{tabular}
1895
\vv
1896
 
1897
Specifying a maximum datasize on the assembler command line or in a directive will help the assembler select the smallest possible instruction for addressing static data in a writeable data section.
1898
\vv
1899
 
1900
Note that data in a read-only section are usually addressed with IP as base pointer. The number of bits needed for addressing IP-based read-only data is determined by codesize, not datasize.
1901
\vv
1902
 
1903
Conditional jumps can use small single-word instructions if there is no constant immediate operand bigger than 8 bytes and if the distance to the destination is no more than 127 code words (= 508 bytes). This will often suffice for conditional jumps within the same function or module.
1904
See page \pageref{table:jumpInstructionFormats} for a list of jump instruction formats with different number of bits for the offset.
1905
\vv
1906
 
1907
Instructions with a memory operand in static data are two or three code words long. The three-word version is needed if the datasize is bigger than 32 kbytes and the instruction has an array index or option bits or three input operands. Example:
1908
 
1909
\begin{example}
1910
\label{instructionSizeOptimization}
1911
\end{example} % frame disappears if I put this after end lstlisting
1912
 
1913
\begin{lstlisting}[frame=single]
1914
data section read write datap
1915
alpha: int32 123
1916
data end
1917
 
1918
code section execute
1919
int32 r0 = r1 < [alpha]
1920
code end
1921
 
1922
\end{lstlisting}
1923
 
1924
The compare instruction in this example will need three words if datasize is bigger than 32 kbytes. A smaller 2-word version of this instruction can be used if datasize is specified to a value less than 32 kbytes.
1925
\vv
1926
 
1927
The assembler will automatically choose the smallest version of an instruction that fits the specified datasize and codesize. The linker will give an
1928
``Address overflow'' error message if a relative address does not fit the available number of bits.
1929
\vv
1930
 
1931
It is not possible to have memory operands with an 8-bit offset for data with IP, DATAP, or THREADP as implicit base pointer, but it is possible to make the instruction smaller if a general purpose register is used as base pointer. Instructions with a memory operand can use a single-word version of the instruction if the following conditions are satisfied:
1932
 
1933
\begin{itemize}
1934
\item the base pointer is a general purpose register or stack pointer
1935
\item the scaled offset fits into an 8-bit signed integer
1936
\item the instruction has no more than two source operands
1937
\item the first source operand is the same register as the destination
1938
\item there is no mask
1939
\item there are no option bits
1940
\item if vector registers are used: must be scalar
1941
\end{itemize}
1942
\vv
1943
 
1944
We can make the code more compact by placing a reference point near the data we want to access, and load the address of this reference point into a register. This example shows how:
1945
 
1946
\begin{example}
1947
\label{ExampleLocalPointer}
1948
\end{example} % frame disappears if I put this after end lstlisting
1949
 
1950
\begin{lstlisting}[frame=single]
1951
data section read write datap
1952
int32 A, B, C
1953
align 8
1954
refPoint: // reference point, aligned by 8
1955
int32 D
1956
double E
1957
data end
1958
 
1959
code section execute
1960
// load address of refPoint:
1961
int64 r1 = address ([refPoint])
1962
// address of A relative to refPoint:
1963
int32 r2 = r2 + [r1 + A - refPoint]
1964
// address of E relative to refPoint:
1965
double v3 = v3 * [r1 + E - refPoint], scalar
1966
 
1967
code end
1968
 
1969
\end{lstlisting}
1970
\vv
1971
 
1972
The offset relative to the reference point is scaled by 4 and 8 for A and E, respectively, in this example. The scaled offset is calculated automatically by the assembler. The reference point must be aligned by 8 in this example in order to make the offset to E divisible by 8.
1973
\vv
1974
 
1975
This method will make the code more compact if the base pointer is used in more than two instructions. Data on the stack can be accessed with single-word instructions if they are near the address that the stack pointer points to. Data in a structure or class can be accessed in the same way relative to a structure pointer or 'this' pointer.
1976
\vv
1977
 
1978
You can use an assembler listing to check the size and format of each instruction. A table of instruction formats is given on page \pageref{table:instructionFormats}.
1979
\vv
1980
 
1981
 
1982
\end{document}

powered by: WebSVN 2.1.0

© copyright 1999-2024 OpenCores.org, equivalent to Oliscience, all rights reserved. OpenCores®, registered trademark.