URL https://opencores.org/ocsvn/lateq/lateq/trunk
Subversion Repositories lateq

[/] [lateq/] [trunk/] [descr.txt] - Rev 2

Compare with Previous | Blame | View Log
The pipelined architecture is often used in high speed FPGA cores. 
In complex designs data processing is often splitted into multiple
paths performing some (maybe different) operations in parallel.
In such a case it may be difficult to keep the same latency
(measured in clock periods) in all paths.

GUI based tools (e.g. Xilinx System Generator or Altera DSP Builder)
take care to equalize (balance) latencies (delays)
of different paths.
However it seems, that up to now there is no good solution for designers
implementing their designs in HDL.
This project offers a methodology, which allows to automatically balance
latenices in different paths of pipelined core, so that data arriving 
to certain processing block, or appearing on the output are properly
aligned in time.
As estimating delay based on analysis of the source code may be difficult
and error prone (if the author uses non-standard solutions), the method
is based on simulation.

Data going through the IP core are labelled (in simulation only) with
additional "time marker", which the user has to generate on the input.
The directives "-- pragma translate_on" and "--pragma translate off" are
used to limit generation and processing of those labels only to simulation.

Wherever the user wants to equalize latency in certain data paths, he/she
places a special block (latency equalizer - "lateq").
The equalized data paths are routed through the shift registers with length
calculated from results of previous simulation (the initial length is equal to 0).
The block may work in two simulation modes and one synthesis mode.

1. In the standard simulation mode it reports the time markers 
of the data in different paths. This "delay report" is written to the file
(the next version may use a C++ written function called via VHPI to analyze
those reports without writing them to the file, or a dedicated program 
connected via named socket). The delay report file is then analyzed by 
another program "latreadgen.py", which generates a dedicated function returning the
appropriate delay for each path in each equalizer block.

2. If the delays are already correctly selected, the user may set the 
parameter switching on the "final verification". In this mode the delay
report is not generated, so the simulation is faster and no disk
space is used for the report. In this mode any inconsistency of time
markers on the output of the "lateq" block cause simulation error.

3. In the synthesis mode, all instructions related to time markers
and their processing are switched off using the "-- pragma translate_off" and
"-- pragma translate_on" directives. Therefore the system does not affect 
performance of the IP core.

The proposed system is offered in two versions.
1. The first version, located in the "single_type" subdirectory assumes, that
all time aligned data in all data paths are of the same type.
It allows to use the versatile latency checking and equalizing block implemented
in a pure VHDL.

2. The second version, located in the "various_types" subdirectory assumes,
that the datapath may use various number of data od different types.
Unfortunately it is not possible (yet?) to implement so flexible latency checking
and equalizing block in pure VHDL.
The generic types are introduced only in VHDL-2008, and they are still
not fully supported by most synthesis tools. But to implement the needed block
we would need to have records of different types as ports, and then iterate through the
fields in this records. 
To solve that problem, the project contains the tool "lateqgen.py", which may generate
such "latency checker and equalizer" (LECQ) for particular number of paths of user
provided types. The calling syntax is:

lateqgen.py entity_name output_file type_for_path0 type_for_path1 ...

Number of paths in the LECQ block is defined by the number of types provided
at the end of the command line. Of course you can use the same type in two paths.
In this case you simply use the same type name more than once.

To allow passing of data together with the time markers through the design, the 
data must be encapsulated in record types with optionally (only for simulation)
added time marker field (lateq_mrk).
Due to the way how the VHDL sources are generated, it is required that name of
each user type passing through the LECQ starts with "T_"
(e.g. "T_MY_DATA"). Additionally user must define the constant, which will
be used to initialize all registers. The name of this constant is derived from 
the type name by replacing initial "T_" with "C_" and by adding "_INIT" at the end
(so in our example it will be "C_MY_DATA_INIT").

Yet one thing must be explained. It is necessary, that each LECQ block
must be uniqely identified. In theory we could use the VHDL INSTANCE_NAME
atrribute for it.
Unfortunately it appears, that generated instance names are different in different
simulators. Most likely they will be also different in the synthesis tools,
so it will be not possible to pass delays found in the simulation
to the synthesis (maybe for a single set of tools, like Vivado and its simulator
it would be possible to adapt tools to convert those instance names
apropriately). 
To avoid described problem the user should pass the unique LEQ_ID generic 
to each instance of the delay equalizer.
What if this block is located in another one, used multiple times?
In this case the user should implement the LEQ_ID generic also in this 
container block, and pass the unique value to it. Than the LEQ_ID generic
value from the instance of the container block should be concatenated with
":" string and the unique value used for particular instance of the LECQ block.
 If the blocks are instantiated in the for generate loop, the user should concatenate
also the loop variable value converted to the string (with integer'image function)
and again separated with ":".

To allow you to check the described technology, the project contains relatively
simple system with 64 inputs from ADC converters, measuring the signals from certain
particle detector.
The signal generated by the particle passing through the detector is distributed
between neighbouring channels (strips). To find the position and anergy of
the particle, the system first finds the strip with maximal level of signal.
Then it selects predefined number of channels surrounding that "central" strip.
For selected channels it calculates the sum of the signal
(the energy of the particle) and sum of each signal multiplied by the deviation
from the central channel (so the center of gravity of the registered signal may be
calculated to improve resolution of the detector).
The final calculation of hit position is done in the testbench, as I didn't want to
increase project's complexity by implementation of divider block.

Finding of the maximum and calculation of the sums is performed in the hierarchical
tree-based comparators and adders. The user may define number of inputs handled 
in each node of the tree (parameters "EX1_NOF_INS_IN_CMP" and "EX1_NOF_INS_IN_CMP"
in the ex1_trees_pkg.vhd file). You can play with these values, changing the number
of levels, and hence delay in different paths. 

Two versions of the demo use also different implementations of time markers.
The first one simply uses integers starting from -1 and increased every clock pulse.
So the implementation will fail after 2^31 clock cycles.
The second version uses time markers defined as integers starting form minus one, and then
increasing every clock pulse until certain predefined C_LATEQ_MRK_MAX value is reached.
In the next pulse the time marker returns to 0. such implementation allows to run much
longer simulations and works correctly until the highest latency difference is below
C_LATEQ_MRK_MAX/2-1 (of course the latency difference must be calculated in a special way).

HOW TO USE PROVIDED DEMOS
To use demos you need the following packages:
Python3
GHDL
gtkwave

To start with the fresh configuration, with all additional delays set to 0, you should type:
make initial

Then you can test the design with:
make final

This performs the simulation in the "final mode". As the design is not synchronized properly,
you should see an error message like this:
src/ex1_eq_mf.vhd:116:16:@80ns:(report failure): EQ1 inequal latencies: out0=0, out1=-1
To synchronize the design corectly, you should type:
make synchro

This runs simulation in the analyzis mode, then creates the correct function for configuration
of LCEQ blocks, and finally runs the simulation once again.
You should see report about properly detected two pulses:
src/ex1_proc_tb.vhd:122:11:@300ns:(report note): Hit with charge: 2.5e2 at 1.475999999999999e1
src/ex1_proc_tb.vhd:122:11:@380ns:(report note): Hit with charge: 2.649999999999999e2 at 2.549056603773585e1

You can run "make reader" to analyse signals inside of design with gtkwave viewer.
You can play with number of inputs in single adder and comparator, by changing the constants
EX1_NOF_INS_IN_CMP and EX1_NOF_INS_IN_ADD in file ex1_trees_pkg.vhd.

You will see how the delay in different paths changes, and how the system adapts to this changes.
Compare with Previous | Blame | View Log
Browse

Tools

Subversion Repositories lateq

[/] [lateq/] [trunk/] [descr.txt] - Rev 2