OpenCores
URL https://opencores.org/ocsvn/neorv32/neorv32/trunk

Subversion Repositories neorv32

[/] [neorv32/] [trunk/] [sw/] [example/] [coremark/] [README.md] - Blame information for rev 22

Go to most recent revision | Details | Compare with Previous | View Log

Line No. Rev Author Line
1 2 zero_gravi
 
2
# Introduction
3
 
4
CoreMark's primary goals are simplicity and providing a method for testing only a processor's core features. For more information about EEMBC's comprehensive embedded benchmark suites, please see www.eembc.org.
5
 
6
For a more compute-intensive version of CoreMark that uses larger datasets and execution loops taken from common applications, please check out EEMBC's [CoreMark-PRO](https://www.github.com/eembc/coremark-pro) benchmark, also on GitHub.
7
 
8
# Building and Running
9
 
10
To build and run the benchmark, type
11
 
12
`> make`
13
 
14
Full results are available in the files `run1.log` and `run2.log`. CoreMark result can be found in `run1.log`.
15
 
16
## Cross Compiling
17
 
18
For cross compile platforms please adjust `core_portme.mak`, `core_portme.h` (and possibly `core_portme.c`) according to the specific platform used. When porting to a new platform, it is recommended to copy one of the default port folders  (e.g. `mkdir  && cp linux/* `), adjust the porting files, and run:
19
~~~
20
% make PORT_DIR=
21
~~~
22
 
23
## Make Targets
24
`run` - Default target, creates `run1.log` and `run2.log`.
25
`run1.log` - Run the benchmark with performance parameters, and output to `run1.log`
26
`run2.log` - Run the benchmark with validation parameters, and output to `run2.log`
27
`run3.log` - Run the benchmark with profile generation parameters, and output to `run3.log`
28
`compile` - compile the benchmark executable
29
`link` - link the benchmark executable
30
`check` - test MD5 of sources that may not be modified
31
`clean` - clean temporary files
32
 
33
### Make flag: `ITERATIONS`
34
By default, the benchmark will run between 10-100 seconds. To override, use `ITERATIONS=N`
35
~~~
36
% make ITERATIONS=10
37
~~~
38
Will run the benchmark for 10 iterations. It is recommended to set a specific number of iterations in certain situations e.g.:
39
 
40
* Running with a simulator
41
* Measuring power/energy
42
* Timing cannot be restarted
43
 
44
Minimum required run time: **Results are only valid for reporting if the benchmark ran for at least 10 secs!**
45
 
46
### Make flag: `XCFLAGS`
47
To add compiler flags from the command line, use `XCFLAGS` e.g.:
48
 
49
~~~
50
% make XCFLAGS="-g -DMULTITHREAD=4 -DUSE_FORK=1"
51
~~~
52
 
53
### Make flag: `CORE_DEBUG`
54
 
55
Define to compile for a debug run if you get incorrect CRC.
56
 
57
~~~
58
% make XCFLAGS="-DCORE_DEBUG=1"
59
~~~
60
 
61
### Make flag: `REBUILD`
62
 
63
Force a rebuild of the executable.
64
 
65
## Systems Without `make`
66
The following files need to be compiled:
67
* `core_list_join.c`
68
* `core_main.c`
69
* `core_matrix.c`
70
* `core_state.c`
71
* `core_util.c`
72
* `PORT_DIR/core_portme.c`
73
 
74
For example:
75
~~~
76
% gcc -O2 -o coremark.exe core_list_join.c core_main.c core_matrix.c core_state.c core_util.c simple/core_portme.c -DPERFORMANCE_RUN=1 -DITERATIONS=1000
77
% ./coremark.exe > run1.log
78
~~~
79
The above will compile the benchmark for a performance run and 1000 iterations. Output is redirected to `run1.log`.
80
 
81
# Parallel Execution
82
Use `XCFLAGS=-DMULTITHREAD=N` where N is number of threads to run in parallel. Several implementations are available to execute in multiple contexts, or you can implement your own in `core_portme.c`.
83
 
84
~~~
85
% make XCFLAGS="-DMULTITHREAD=4 -DUSE_PTHREAD"
86
~~~
87
 
88
Above will compile the benchmark for execution on 4 cores, using POSIX Threads API.
89
 
90
# Run Parameters for the Benchmark Executable
91
CoreMark's executable takes several parameters as follows (but only if `main()` accepts arguments):
92
1st - A seed value used for initialization of data.
93
2nd - A seed value used for initialization of data.
94
3rd - A seed value used for initialization of data.
95
4th - Number of iterations (0 for auto : default value)
96
5th - Reserved for internal use.
97
6th - Reserved for internal use.
98
7th - For malloc users only, ovreride the size of the input data buffer.
99
 
100
The run target from make will run coremark with 2 different data initialization seeds.
101
 
102
## Alternative parameters:
103
If not using `malloc` or command line arguments are not supported, the buffer size
104
for the algorithms must be defined via the compiler define `TOTAL_DATA_SIZE`.
105
`TOTAL_DATA_SIZE` must be set to 2000 bytes (default) for standard runs.
106
The default for such a target when testing different configurations could be:
107
 
108
~~~
109
% make XCFLAGS="-DTOTAL_DATA_SIZE=6000 -DMAIN_HAS_NOARGC=1"
110
~~~
111
 
112
# Submitting Results
113
 
114
CoreMark results can be submitted on the web. Open a web browser and go to the [submission page](https://www.eembc.org/coremark/submit.php). After registering an account you may enter a score.
115
 
116
# Run Rules
117
What is and is not allowed.
118
 
119
## Required
120
1. The benchmark needs to run for at least 10 seconds.
121
2. All validation must succeed for seeds `0,0,0x66` and `0x3415,0x3415,0x66`, buffer size of 2000 bytes total.
122
    * If not using command line arguments to main:
123
~~~
124
        % make XCFLAGS="-DPERFORMANCE_RUN=1" REBUILD=1 run1.log
125
        % make XCFLAGS="-DVALIDATION_RUN=1" REBUILD=1 run2.log
126
~~~
127
3. If using profile guided optimization, profile must be generated using seeds of `8,8,8`, and buffer size of 1200 bytes total.
128
~~~
129
    % make XCFLAGS="-DTOTAL_DATA_SIZE=1200 -DPROFILE_RUN=1" REBUILD=1 run3.log
130
~~~
131
4. All source files must be compiled with the same flags.
132
5. All data type sizes must match size in bits such that:
133
        * `ee_u8` is an unsigned 8-bit datatype.
134
        * `ee_s16` is a signed 16-bit datatype.
135
        * `ee_u16` is an unsigned 16-bit datatype.
136
        * `ee_s32` is a signed 32-bit datatype.
137
        * `ee_u32` is an unsigned 32-bit datatype.
138
 
139
## Allowed
140
 
141
1. Changing number of iterations
142
2. Changing toolchain and build/load/run options
143
3. Changing method of acquiring a data memory block
144
5. Changing the method of acquiring seed values
145
6. Changing implementation `in core_portme.c`
146
7. Changing configuration values in `core_portme.h`
147
8. Changing `core_portme.mak`
148
 
149
## NOT ALLOWED
150
1. Changing of source file other then `core_portme*` (use `make check` to validate)
151
 
152
# Reporting rules
153
Use the following syntax to report results on a data sheet:
154
 
155
CoreMark 1.0 : N / C [/ P] [/ M]
156
 
157
N - Number of iterations per second with seeds 0,0,0x66,size=2000)
158
 
159
C - Compiler version and flags
160
 
161
P - Parameters such as data and code allocation specifics
162
 
163
* This parameter *may* be omitted if all data was allocated on the heap in RAM.
164
* This parameter *may not* be omitted when reporting CoreMark/MHz
165
 
166
M - Type of parallel execution (if used) and number of contexts
167
* This parameter may be omitted if parallel execution was not used.
168
 
169
e.g.:
170
 
171
~~~
172
CoreMark 1.0 : 128 / GCC 4.1.2 -O2 -fprofile-use / Heap in TCRAM / FORK:2
173
~~~
174
or
175
~~~
176
CoreMark 1.0 : 1400 / GCC 3.4 -O4
177
~~~
178
 
179
If reporting scaling results, the results must be reported as follows:
180
 
181
CoreMark/MHz 1.0 : N / C / P [/ M]
182
 
183
P - When reporting scaling results, memory parameter must also indicate memory frequency:core frequency ratio.
184
1. If the core has cache and cache frequency to core frequency ratio is configurable, that must also be included.
185
 
186
e.g.:
187
 
188
~~~
189
CoreMark/MHz 1.0 : 1.47 / GCC 4.1.2 -O2 / DDR3(Heap) 30:1 Memory 1:1 Cache
190
~~~
191
 
192
# Log File Format
193
The log files have the following format
194
 
195
~~~
196
2K performance run parameters for coremark.     (Run type)
197
CoreMark Size           : 666                                   (Buffer size)
198
Total ticks                     : 25875                                 (platform dependent value)
199
Total time (secs)       : 25.875000                             (actual time in seconds)
200
Iterations/Sec          : 3864.734300                   (Performance value to report)
201
Iterations                      : 100000                                (number of iterations used)
202
Compiler version        : GCC3.4.4                              (Compiler and version)
203
Compiler flags          : -O2                                   (Compiler and linker flags)
204
Memory location         : Code in flash, data in on chip RAM
205
seedcrc                         : 0xe9f5                                (identifier for the input seeds)
206
[0]crclist                      : 0xe714                                (validation for list part)
207
[0]crcmatrix            : 0x1fd7                                (validation for matrix part)
208
[0]crcstate                     : 0x8e3a                                (validation for state part)
209
[0]crcfinal                     : 0x33ff                                (iteration dependent output)
210
Correct operation validated. See README.md for run and reporting rules.  (*Only when run is successful*)
211
CoreMark 1.0 : 3864.734300 / GCC3.4.4 -O2 / Heap                                                  (*Only on a successful performance run*)
212
~~~
213
 
214
# Theory of Operation
215
 
216
This section describes the initial goals of CoreMark and their implementation.
217
 
218
## Small and easy to understand
219
 
220
* X number of source code lines for timed portion of the benchmark.
221
* Meaningful names for variables and functions.
222
* Comments for each block of code more than 10 lines long.
223
 
224
## Portability
225
 
226
A thin abstraction layer will be provided for I/O and timing in a separate file. All I/O and timing of the benchmark will be done through this layer.
227
 
228
### Code / data size
229
 
230
* Compile with gcc on x86 and make sure all sizes are according to requirements.
231
* If dynamic memory allocation is used, take total memory allocated into account as well.
232
* Avoid recursive functions and keep track of stack usage.
233
* Use the same memory block as data site for all algorithms, and initialize the data before each algorithm – while this means that initialization with data happens during the timed portion, it will only happen once during the timed portion and so have negligible effect on the results.
234
 
235
## Controlled output
236
 
237
This may be the most difficult goal. Compilers are constantly improving and getting better at analyzing code. To create work that cannot be computed at compile time and must be computed at run time, we will rely on two assumptions:
238
 
239
* Some system functions (e.g. time, scanf) and parameters cannot be computed at compile time.  In most cases, marking a variable volatile means the compiler is force to read this variable every time it is read. This will be used to introduce a factor into the input that cannot be precomputed at compile time. Since the results are input dependent, that will make sure that computation has to happen at run time.
240
 
241
* Either a system function or I/O (e.g. scanf) or command line parameters or volatile variables will be used before the timed portion to generate data which is not available at compile time. Specific method used is not relevant as long as it can be controlled, and that it cannot be computed or eliminated by the compiler at compile time. E.g. if the clock() functions is a compiler stub, it may not be used. The derived values will be reported on the output so that verification can be done on a different machine.
242
 
243
* We cannot rely on command line parameters since some embedded systems do not have the capability to provide command line parameters. All 3 methods above will be implemented (time based, scanf and command line parameters) and all 3 are valid if the compiler cannot determine the value at compile time.
244
 
245
* It is important to note that The actual values that are to be supplied at run time will be standardized. The methodology is not intended to provide random data, but simply to provide controlled data that cannot be precomputed at compile time.
246
 
247
* Printed results must be valid at run time. This will be used to make sure the computation has been executed.
248
 
249
* Some embedded systems do not provide “printf” or other I/O functionality. All I/O will be done through a thin abstraction interface to allow execution on such systems (e.g. allow output via JTAG).
250
 
251
## Key Algorithms
252
 
253
### Linked List
254
 
255
The following linked list structure will be used:
256
 
257
~~~
258
typedef struct list_data_s {
259
        ee_s16 data16;
260
        ee_s16 idx;
261
} list_data;
262
 
263
typedef struct list_head_s {
264
        struct list_head_s *next;
265
        struct list_data_s *info;
266
} list_head;
267
~~~
268
 
269
While adding a level of indirection accessing the data, this structure is realistic and used in many embedded applications for small to medium lists.
270
 
271
The list itself will be initialized on a block of memory that will be passed in to the initialization function. While in general linked lists use malloc for new nodes, embedded applications sometime control the memory for small data structures such as arrays and lists directly to avoid the overhead of system calls, so this approach is realistic.
272
 
273
The linked list will be initialized such that 1/4 of the list pointers point to sequential areas in memory, and 3/4 of the list pointers are distributed in a non sequential manner. This is done to emulate a linked list that had add/remove happen for a while disrupting the neat order, and then a series of adds that are likely to come from sequential memory locations.
274
 
275
For the benchmark itself:
276
- Multiple find operations are going to be performed. These find operations may result in the whole list being traversed. The result of each find will become part of the output chain.
277
- The list will be sorted using merge sort based on the data16 value, and then derive CRC of the data16 item in order for part of the list. The CRC will become part of the output chain.
278
- The list will be sorted again using merge sort based on the idx value. This sort will guarantee that the list is returned to the primary state before leaving the function, so that multiple iterations of the function will have the same result. CRC of the data16 for part of the list will again be calculated and become part of the output chain.
279
 
280
The actual `data16` in each cell will be pseudo random based on a single 16b input that cannot be determined at compile time. In addition, the part of the list which is used for CRC will also be passed to the function, and determined based on an input that cannot be determined at run time.
281
 
282
### Matrix Multiply
283
 
284
This very simple algorithm forms the basis of many more complex algorithms. The tight inner loop is the focus of many optimizations (compiler as well as hardware based) and is thus relevant for embedded processing.
285
 
286
The total available data space will be divided to 3 parts:
287
1. NxN matrix A.
288
2. NxN matrix B.
289
3. NxN matrix C.
290
 
291
E.g. for 2K we will have 3 12x12 matrices (assuming data type of 32b 12(len)*12(wid)*4(size)*3(num) =1728 bytes).
292
 
293
Matrix A will be initialized with small values (upper 3/4 of the bits all zero).
294
Matrix B will be initialized with medium values (upper half of the bits all zero).
295
Matrix C will be used for the result.
296
 
297
For the benchmark itself:
298
- Multiple A by a constant into C, add the upper bits of each of the values in the result matrix. The result will become part of the output chain.
299
- Multiple A by column X of B into C, add the upper bits of each of the values in the result matrix. The result will become part of the output chain.
300
- Multiple A by B into C, add the upper bits of each of the values in the result matrix. The result will become part of the output chain.
301
 
302
The actual values for A and B must be derived based on input that is not available at compile time.
303
 
304
### State Machine
305
 
306
This part of the code needs to exercise switch and if statements. As such, we will use a small Moore state machine. In particular, this will be a state machine that identifies string input as numbers and divides them according to format.
307
 
308
The state machine will parse the input string until either a “,” separator or end of input is encountered. An invalid number will cause the state machine to return invalid state and a valid number will cause the state machine to return with type of number format (int/float/scientific).
309
 
310
This code will perform a realistic task, be small enough to easily understand, and exercise the required functionality. The other option used in embedded systems is a mealy based state machine, which is driven by a table. The table then determines the number of states and complexity of transitions. This approach, however, tests mainly the load/store and function call mechanisms and less the handling of branches. If analysis of the final results shows that the load/store functionality of the processor is not exercised thoroughly, it may be a good addition to the benchmark (codesize allowing).
311
 
312
For input, the memory block will be initialized with comma separated values of mixed formats, as well as invalid inputs.
313
 
314
For the benchmark itself:
315
- Invoke the state machine on all of the input and count final states and state transitions. CRC of all final states and transitions will become part of the output chain.
316
- Modify the input at intervals (inject errors) and repeat the state machine operation.
317
- Modify the input back to original form.
318
 
319
The actual input must be initialized based on data that cannot be determined at compile time. In addition the intervals for modification of the input and the actual modification must be based on input that cannot be determined at compile time.
320
 
321
# Validation
322
 
323
This release was tested on the following platforms:
324
* x86 cygwin and gcc 3.4 (Quad, dual and single core systems)
325
* x86 linux (Ubuntu/Fedora) and gcc (4.2/4.1) (Quad and single core systems)
326
* MIPS64 BE linux and gcc 3.4 16 cores system
327
* MIPS32 BE linux with CodeSourcery compiler 4.2-177 on Malta/Linux with a 1004K 3-core system
328
* PPC simulator with gcc 4.2.2 (No OS)
329
* PPC 64b BE linux (yellowdog) with gcc 3.4 and 4.1 (Dual core system)
330
* BF533 with VDSP50
331
* Renesas R8C/H8 MCU with HEW 4.05
332
* NXP LPC1700 armcc v4.0.0.524
333
* NEC 78K with IAR v4.61
334
* ARM simulator with armcc v4
335
 
336
# Memory Analysis
337
 
338
Valgrind 3.4.0 used and no errors reported.
339
 
340
# Balance Analysis
341
 
342
Number of instructions executed for each function tested with cachegrind and found balanced with gcc and -O0.
343
 
344
# Statistics
345
 
346
Lines:
347
~~~
348
Lines  Blank  Cmnts  Source     AESL
349
=====  =====  =====  =====  ==========  =======================================
350
  469     66    170    251       627.5  core_list_join.c  (C)
351
  330     18     54    268       670.0  core_main.c  (C)
352
  256     32     80    146       365.0  core_matrix.c  (C)
353
  240     16     51    186       465.0  core_state.c  (C)
354
  165     11     20    134       335.0  core_util.c  (C)
355
  150     23     36     98       245.0  coremark.h  (C)
356
 1610    166    411   1083      2707.5  ----- Benchmark -----  (6 files)
357
  293     15     74    212       530.0  linux/core_portme.c  (C)
358
  235     30    104    104       260.0  linux/core_portme.h  (C)
359
  528     45    178    316       790.0  ----- Porting -----  (2 files)
360
 
361
* For comparison, here are the stats for Dhrystone
362
Lines  Blank  Cmnts  Source     AESL
363
=====  =====  =====  =====  ==========  =======================================
364
  311     15    242     54       135.0  dhry.h  (C)
365
  789    132    119    553      1382.5  dhry_1.c  (C)
366
  186     26     68    107       267.5  dhry_2.c  (C)
367
 1286    173    429    714      1785.0  ----- C -----  (3 files)
368
~~~
369
 
370
# Credits
371
Many thanks to all of the individuals who helped with the development or testing of CoreMark including (Sorted by company name; note that company names may no longer be accurate as this was written in 2009).
372
* Alan Anderson, ADI
373
* Adhikary Rajiv, ADI
374
* Elena Stohr, ARM
375
* Ian Rickards, ARM
376
* Andrew Pickard, ARM
377
* Trent Parker, CAVIUM
378
* Shay Gal-On, EEMBC
379
* Markus Levy, EEMBC
380
* Peter Torelli, EEMBC
381
* Ron Olson, IBM
382
* Eyal Barzilay, MIPS
383
* Jens Eltze, NEC
384
* Hirohiko Ono, NEC
385
* Ulrich Drees, NEC
386
* Frank Roscheda, NEC
387
* Rob Cosaro, NXP
388
* Shumpei Kawasaki, RENESAS
389
 
390
# Legal
391
Please refer to LICENSE.md in this reposity for a description of your rights to use this code.
392
 
393
# Copyright
394
Copyright © 2009 EEMBC All rights reserved.
395
CoreMark is a trademark of EEMBC and EEMBC is a registered trademark of the Embedded Microprocessor Benchmark Consortium.
396
 

powered by: WebSVN 2.1.0

© copyright 1999-2024 OpenCores.org, equivalent to Oliscience, all rights reserved. OpenCores®, registered trademark.