The pipelined architecture is often used in high speed FPGA cores. In complex designs data processing is often splitted into multiple
paths performing some (maybe different) operations in parallel. In such a case it may be difficult to keep the same latency
(measured in clock periods) in all paths.
GUI based tools (e.g. Xilinx System Generator or Altera DSP Builder) take care to equalize (balance) latencies (delays) of different paths.
However it seems, that up to now there is no good solution for designers implementing their designs in HDL.
This project offers a methodology, which allows to automatically balance latencies in different paths of pipelined core, so that data arriving
to certain processing block, or appearing on the output are properly aligned in time.
As estimation of delay based on analysis of the source code may be difficult and error prone (if the author uses non-standard solutions), the method
is based on simulation (the more detailed description of the method is available in two papers: Automatic latency equalization in VHDL-implemented complex pipelined systems and Automatic latency balancing in VHDL-implemented complex pipelined systems).
The similar method was used in the old "Xilinx Sync Block" available e.g. in the Xilinx System Generator for Simulink
(see https://safe.nrao.edu/wiki/pub/CICADA/WebHome/xilinx_ref_guide.pdf , page 47). That block however
analyzed only the moment when first valid data appeared on its inputs, while the method proposed in this project
analyses the latency differences during the whole simulation period.
Data going through the IP core are labelled (in simulation only) with additional "time marker", which the user has to generate on the input.
The directives "-- pragma translate_on" and "--pragma translate off" are used to limit generation and processing of those labels only to simulation.
Wherever the user wants to equalize latency in certain data paths, he/she places a special block (latency checker and equalizer - LCEQ or "lateq").
The equalized data paths are routed through the shift registers with length calculated from results of previous simulation (the initial length is equal to 0).
The block may work in two simulation modes and one synthesis mode.
- In the standard simulation mode it reports the time markers of the data in different paths. This "delay report" is written to the file (the next version may use a C++ written function called via VHPI to analyze those reports without writing them to the file, or a dedicated program
connected via named socket). The delay report file is then analyzed by another program "latreadgen.py", which generates a dedicated function returning the appropriate delay for each path in each equalizer block.
- If the delays are already correctly selected, the user may set the parameter switching on the "final verification". In this mode the delay
report is not generated, so the simulation is faster and no disk space is used for the report. In this mode any inconsistency of time markers on the output of the "lateq" block cause simulation error.
- In the synthesis mode, all instructions related to time markers and their processing are switched off using the "-- pragma translate_off" and
"-- pragma translate_on" directives. Therefore the system does not affect performance of the IP core.
The proposed system is offered in two versions.
- The first version, located in the "single_type" subdirectory assumes, that all time aligned data in all data paths are of the same type.
It allows to use the versatile latency checking and equalizing block implemented in a pure VHDL.
- The second version, located in the "various_types" subdirectory assumes, that the datapath may use various number of data od different types.
Unfortunately it is not possible (yet?) to implement so flexible latency checking and equalizing block in pure VHDL.
The generic types are introduced only in VHDL-2008, and they are still not fully supported by most synthesis tools. But to implement the needed block
we would need to have records of different types as ports, and then iterate through the fields in this records.
To solve that problem, the project contains the tool "lateqgen.py", which may generate such "latency checker and equalizer" (LECQ) for particular number of paths of user provided types. The calling syntax is:
lateqgen.py entity_name output_file type_for_path0 type_for_path1 ...
Number of paths in the LECQ block is defined by the number of types provided at the end of the command line. Of course you can use the same type in two paths. In this case you simply use the same type name more than once.
To allow passing of data together with the time markers through the design, the data must be encapsulated in record types with optionally (only for simulation) added time marker field (lateq_mrk).
Due to the way how the VHDL sources are generated, it is required that name of each user type passing through the LECQ starts with "T_"
(e.g. "T_MY_DATA"). Additionally user must define the constant, which will be used to initialize all registers. The name of this constant is derived from the type name by replacing initial "T_" with "C_" and by adding "_INIT" at the end (so in our example it will be "C_MY_DATA_INIT").
Yet one thing must be explained. It is necessary, that each LCEQ block must be uniqely identified. In theory we could use the VHDL INSTANCE_NAME atrribute for it. Unfortunately it appears, that generated instance names are different in different simulators. Most likely they will be also different in the synthesis tools, so it will be not possible to pass delays found in the simulation to the synthesis (maybe for a single set of tools, like Vivado and its simulator it would be possible to adapt tools to convert those instance names apropriately).
To avoid described problem the user should pass the unique LEQ_ID generic to each instance of the delay equalizer.
What if this block is located in another one, used multiple times?
In this case the user should implement the LEQ_ID generic also in this container block, and pass the unique value to it. Than the LEQ_ID generic
value from the instance of the container block should be concatenated with ":" string and the unique value used for particular instance of the LECQ block.
If the blocks are instantiated in the for generate loop, the user should concatenate also the loop variable value converted to the string (with integer'image function) and again separated with ":".
To allow you to check the described technology, the project contains relatively simple system with 64 inputs from ADC converters, measuring the signals from certain particle detector.
The signal generated by the particle passing through the detector is distributed between neighbouring channels (strips). To find the position and anergy of the particle, the system first finds the strip with maximal level of signal. Then it selects predefined number of channels surrounding that "central" strip. For selected channels it calculates the sum of the signal (the energy of the particle) and sum of each signal multiplied by the distance from the central channel (so the center of gravity of the registered signal may be calculated to improve resolution of the detector).
The final calculation of hit position is done in the testbench, as I didn't want to increase project's complexity by implementation of divider block.
Finding of the maximum and calculation of the sums is performed in the hierarchical tree-based comparators and adders. The user may define number of inputs handled in each node of the tree (parameters "EX1_NOF_INS_IN_CMP" and "EX1_NOF_INS_IN_ADD" in the ex1_trees_pkg.vhd file). You can play with these values, changing the number of levels, and hence delay in different paths.
Two versions of the demo use also different implementations of time markers. The first one simply uses integers starting from -1 and increased every clock pulse. So the implementation will fail after 2^31 clock cycles.
The second version uses time markers defined as integers starting form minus one, and then increasing every clock pulse until certain predefined C_LATEQ_MRK_MAX value is reached. In the next pulse the time marker returns to 0. such implementation allows to run much longer simulations and works correctly until the highest latency difference is below C_LATEQ_MRK_MAX/2-1 (of course the latency difference must be calculated in a special way).
HOW TO USE PROVIDED DEMOS
To use demos you need the following packages:
To start with the fresh configuration, with all additional delays set to 0, you should type:
Then you can test the design with:
This performs the simulation in the "final mode". As the design is not synchronized properly,
you should see an error message like this:
src/ex1_eq_mf.vhd:116:16:@80ns:(report failure): LCEQ1 inequal latencies: out0=0, out1=-1
To synchronize the design corectly, you should type:
This runs simulation in the analyzis mode, then creates the correct function for configuration
of LCEQ blocks, and finally runs the simulation once again.
You should see report about properly detected two pulses:
src/ex1_proc_tb.vhd:122:11:@300ns:(report note): Hit with charge: 2.5e2 at 1.475999999999999e1
src/ex1_proc_tb.vhd:122:11:@380ns:(report note): Hit with charge: 2.649999999999999e2 at 2.549056603773585e1
You can run "make reader" to analyse signals inside of design with gtkwave viewer.
You can play with number of inputs in single adder and comparator, by changing the constants
EX1_NOF_INS_IN_CMP and EX1_NOF_INS_IN_ADD in file ex1_trees_pkg.vhd.
You will see how the delay in different paths changes, and how the system adapts to this changes.