1 |
2 |
wzab |
The pipelined architecture is often used in high speed FPGA cores.
|
2 |
|
|
In complex designs data processing is often splitted into multiple
|
3 |
|
|
paths performing some (maybe different) operations in parallel.
|
4 |
|
|
In such a case it may be difficult to keep the same latency
|
5 |
|
|
(measured in clock periods) in all paths.
|
6 |
|
|
|
7 |
|
|
GUI based tools (e.g. Xilinx System Generator or Altera DSP Builder)
|
8 |
|
|
take care to equalize (balance) latencies (delays)
|
9 |
|
|
of different paths.
|
10 |
|
|
However it seems, that up to now there is no good solution for designers
|
11 |
|
|
implementing their designs in HDL.
|
12 |
|
|
This project offers a methodology, which allows to automatically balance
|
13 |
|
|
latenices in different paths of pipelined core, so that data arriving
|
14 |
|
|
to certain processing block, or appearing on the output are properly
|
15 |
|
|
aligned in time.
|
16 |
|
|
As estimating delay based on analysis of the source code may be difficult
|
17 |
|
|
and error prone (if the author uses non-standard solutions), the method
|
18 |
|
|
is based on simulation.
|
19 |
|
|
|
20 |
|
|
Data going through the IP core are labelled (in simulation only) with
|
21 |
|
|
additional "time marker", which the user has to generate on the input.
|
22 |
|
|
The directives "-- pragma translate_on" and "--pragma translate off" are
|
23 |
|
|
used to limit generation and processing of those labels only to simulation.
|
24 |
|
|
|
25 |
|
|
Wherever the user wants to equalize latency in certain data paths, he/she
|
26 |
|
|
places a special block (latency equalizer - "lateq").
|
27 |
|
|
The equalized data paths are routed through the shift registers with length
|
28 |
|
|
calculated from results of previous simulation (the initial length is equal to 0).
|
29 |
|
|
The block may work in two simulation modes and one synthesis mode.
|
30 |
|
|
|
31 |
|
|
1. In the standard simulation mode it reports the time markers
|
32 |
|
|
of the data in different paths. This "delay report" is written to the file
|
33 |
|
|
(the next version may use a C++ written function called via VHPI to analyze
|
34 |
|
|
those reports without writing them to the file, or a dedicated program
|
35 |
|
|
connected via named socket). The delay report file is then analyzed by
|
36 |
|
|
another program "latreadgen.py", which generates a dedicated function returning the
|
37 |
|
|
appropriate delay for each path in each equalizer block.
|
38 |
|
|
|
39 |
|
|
2. If the delays are already correctly selected, the user may set the
|
40 |
|
|
parameter switching on the "final verification". In this mode the delay
|
41 |
|
|
report is not generated, so the simulation is faster and no disk
|
42 |
|
|
space is used for the report. In this mode any inconsistency of time
|
43 |
|
|
markers on the output of the "lateq" block cause simulation error.
|
44 |
|
|
|
45 |
|
|
3. In the synthesis mode, all instructions related to time markers
|
46 |
|
|
and their processing are switched off using the "-- pragma translate_off" and
|
47 |
|
|
"-- pragma translate_on" directives. Therefore the system does not affect
|
48 |
|
|
performance of the IP core.
|
49 |
|
|
|
50 |
|
|
The proposed system is offered in two versions.
|
51 |
|
|
1. The first version, located in the "single_type" subdirectory assumes, that
|
52 |
|
|
all time aligned data in all data paths are of the same type.
|
53 |
|
|
It allows to use the versatile latency checking and equalizing block implemented
|
54 |
|
|
in a pure VHDL.
|
55 |
|
|
|
56 |
|
|
2. The second version, located in the "various_types" subdirectory assumes,
|
57 |
|
|
that the datapath may use various number of data od different types.
|
58 |
|
|
Unfortunately it is not possible (yet?) to implement so flexible latency checking
|
59 |
|
|
and equalizing block in pure VHDL.
|
60 |
|
|
The generic types are introduced only in VHDL-2008, and they are still
|
61 |
|
|
not fully supported by most synthesis tools. But to implement the needed block
|
62 |
|
|
we would need to have records of different types as ports, and then iterate through the
|
63 |
|
|
fields in this records.
|
64 |
|
|
To solve that problem, the project contains the tool "lateqgen.py", which may generate
|
65 |
|
|
such "latency checker and equalizer" (LECQ) for particular number of paths of user
|
66 |
|
|
provided types. The calling syntax is:
|
67 |
|
|
|
68 |
|
|
lateqgen.py entity_name output_file type_for_path0 type_for_path1 ...
|
69 |
|
|
|
70 |
|
|
Number of paths in the LECQ block is defined by the number of types provided
|
71 |
|
|
at the end of the command line. Of course you can use the same type in two paths.
|
72 |
|
|
In this case you simply use the same type name more than once.
|
73 |
|
|
|
74 |
|
|
To allow passing of data together with the time markers through the design, the
|
75 |
|
|
data must be encapsulated in record types with optionally (only for simulation)
|
76 |
|
|
added time marker field (lateq_mrk).
|
77 |
|
|
Due to the way how the VHDL sources are generated, it is required that name of
|
78 |
|
|
each user type passing through the LECQ starts with "T_"
|
79 |
|
|
(e.g. "T_MY_DATA"). Additionally user must define the constant, which will
|
80 |
|
|
be used to initialize all registers. The name of this constant is derived from
|
81 |
|
|
the type name by replacing initial "T_" with "C_" and by adding "_INIT" at the end
|
82 |
|
|
(so in our example it will be "C_MY_DATA_INIT").
|
83 |
|
|
|
84 |
|
|
Yet one thing must be explained. It is necessary, that each LECQ block
|
85 |
|
|
must be uniqely identified. In theory we could use the VHDL INSTANCE_NAME
|
86 |
|
|
atrribute for it.
|
87 |
|
|
Unfortunately it appears, that generated instance names are different in different
|
88 |
|
|
simulators. Most likely they will be also different in the synthesis tools,
|
89 |
|
|
so it will be not possible to pass delays found in the simulation
|
90 |
|
|
to the synthesis (maybe for a single set of tools, like Vivado and its simulator
|
91 |
|
|
it would be possible to adapt tools to convert those instance names
|
92 |
|
|
apropriately).
|
93 |
|
|
To avoid described problem the user should pass the unique LEQ_ID generic
|
94 |
|
|
to each instance of the delay equalizer.
|
95 |
|
|
What if this block is located in another one, used multiple times?
|
96 |
|
|
In this case the user should implement the LEQ_ID generic also in this
|
97 |
|
|
container block, and pass the unique value to it. Than the LEQ_ID generic
|
98 |
|
|
value from the instance of the container block should be concatenated with
|
99 |
|
|
":" string and the unique value used for particular instance of the LECQ block.
|
100 |
|
|
If the blocks are instantiated in the for generate loop, the user should concatenate
|
101 |
|
|
also the loop variable value converted to the string (with integer'image function)
|
102 |
|
|
and again separated with ":".
|
103 |
|
|
|
104 |
|
|
To allow you to check the described technology, the project contains relatively
|
105 |
|
|
simple system with 64 inputs from ADC converters, measuring the signals from certain
|
106 |
|
|
particle detector.
|
107 |
|
|
The signal generated by the particle passing through the detector is distributed
|
108 |
|
|
between neighbouring channels (strips). To find the position and anergy of
|
109 |
|
|
the particle, the system first finds the strip with maximal level of signal.
|
110 |
|
|
Then it selects predefined number of channels surrounding that "central" strip.
|
111 |
|
|
For selected channels it calculates the sum of the signal
|
112 |
|
|
(the energy of the particle) and sum of each signal multiplied by the deviation
|
113 |
|
|
from the central channel (so the center of gravity of the registered signal may be
|
114 |
|
|
calculated to improve resolution of the detector).
|
115 |
|
|
The final calculation of hit position is done in the testbench, as I didn't want to
|
116 |
|
|
increase project's complexity by implementation of divider block.
|
117 |
|
|
|
118 |
|
|
Finding of the maximum and calculation of the sums is performed in the hierarchical
|
119 |
|
|
tree-based comparators and adders. The user may define number of inputs handled
|
120 |
|
|
in each node of the tree (parameters "EX1_NOF_INS_IN_CMP" and "EX1_NOF_INS_IN_CMP"
|
121 |
|
|
in the ex1_trees_pkg.vhd file). You can play with these values, changing the number
|
122 |
|
|
of levels, and hence delay in different paths.
|
123 |
|
|
|
124 |
|
|
Two versions of the demo use also different implementations of time markers.
|
125 |
|
|
The first one simply uses integers starting from -1 and increased every clock pulse.
|
126 |
|
|
So the implementation will fail after 2^31 clock cycles.
|
127 |
|
|
The second version uses time markers defined as integers starting form minus one, and then
|
128 |
|
|
increasing every clock pulse until certain predefined C_LATEQ_MRK_MAX value is reached.
|
129 |
|
|
In the next pulse the time marker returns to 0. such implementation allows to run much
|
130 |
|
|
longer simulations and works correctly until the highest latency difference is below
|
131 |
|
|
C_LATEQ_MRK_MAX/2-1 (of course the latency difference must be calculated in a special way).
|
132 |
|
|
|
133 |
|
|
HOW TO USE PROVIDED DEMOS
|
134 |
|
|
To use demos you need the following packages:
|
135 |
|
|
Python3
|
136 |
|
|
GHDL
|
137 |
|
|
gtkwave
|
138 |
|
|
|
139 |
|
|
To start with the fresh configuration, with all additional delays set to 0, you should type:
|
140 |
|
|
make initial
|
141 |
|
|
|
142 |
|
|
Then you can test the design with:
|
143 |
|
|
make final
|
144 |
|
|
|
145 |
|
|
This performs the simulation in the "final mode". As the design is not synchronized properly,
|
146 |
|
|
you should see an error message like this:
|
147 |
|
|
src/ex1_eq_mf.vhd:116:16:@80ns:(report failure): EQ1 inequal latencies: out0=0, out1=-1
|
148 |
|
|
To synchronize the design corectly, you should type:
|
149 |
|
|
make synchro
|
150 |
|
|
|
151 |
|
|
This runs simulation in the analyzis mode, then creates the correct function for configuration
|
152 |
|
|
of LCEQ blocks, and finally runs the simulation once again.
|
153 |
|
|
You should see report about properly detected two pulses:
|
154 |
|
|
src/ex1_proc_tb.vhd:122:11:@300ns:(report note): Hit with charge: 2.5e2 at 1.475999999999999e1
|
155 |
|
|
src/ex1_proc_tb.vhd:122:11:@380ns:(report note): Hit with charge: 2.649999999999999e2 at 2.549056603773585e1
|
156 |
|
|
|
157 |
|
|
You can run "make reader" to analyse signals inside of design with gtkwave viewer.
|
158 |
|
|
You can play with number of inputs in single adder and comparator, by changing the constants
|
159 |
|
|
EX1_NOF_INS_IN_CMP and EX1_NOF_INS_IN_ADD in file ex1_trees_pkg.vhd.
|
160 |
|
|
|
161 |
|
|
You will see how the delay in different paths changes, and how the system adapts to this changes.
|
162 |
|
|
|
163 |
|
|
|
164 |
|
|
|