1 |
7 |
tomtor |
# HDL-deflate
|
2 |
|
|
FPGA implementation of deflate (de)compress RFC 1950/1951
|
3 |
|
|
|
4 |
|
|
This design is implemented in MyHDL (www.myhdl.org) and can be translated to Verilog or VHDL.
|
5 |
|
|
|
6 |
|
|
It has been verified in Icarus, Xilinx Vivado and on a physical Xilinx device (Digilent Arty).
|
7 |
|
|
|
8 |
|
|
Usage should be clear from the test bench in `test_deflate.py`.
|
9 |
|
|
|
10 |
|
|
# Tunable parameters
|
11 |
|
|
|
12 |
|
|
OBSIZE = 8192 # Size of output buffer (BRAM)
|
13 |
|
|
# You need 32768 to compress ALL valid deflate streams!
|
14 |
|
|
|
15 |
|
|
IBSIZE = 2048 # Size of input buffer (LUT-RAM)
|
16 |
|
|
|
17 |
|
|
CWINDOW = 32 # Search window for compression
|
18 |
|
|
|
19 |
|
|
## Compression efficiency
|
20 |
|
|
|
21 |
|
|
One can use a sliding window to reduce the size of the input buffer and the LUT-usage.
|
22 |
|
|
|
23 |
|
|
The minimal value is 2 * CWINDOW (64 bytes), the UnitTest in `test_deflate.py`
|
24 |
|
|
uses this strategy.
|
25 |
|
|
|
26 |
|
|
By default the compressor will reduce repeated 3/4/5 byte sequences in the search window to 15 bit.
|
27 |
|
|
This will result in a decent compression ratio for many real life input data patterns.
|
28 |
|
|
|
29 |
|
|
At the expense of additional LUTs one can improve this by enlarging the `CWINDOW` or expanding
|
30 |
|
|
the matching code to include 6/7/8/9/10 byte matches. Set `MATCH10` to `True` in the top of `deflate.py`
|
31 |
|
|
to activate this option.
|
32 |
|
|
|
33 |
|
|
Another strategy for data sets with just a small set of used byte values would be
|
34 |
|
|
to use a dedicated pre-computed Huffman tree. I could add this if there is interest, but it is probably
|
35 |
|
|
better to use a more dense coding in your FPGA application data in the first place.
|
36 |
|
|
|
37 |
|
|
## Compression speed
|
38 |
|
|
|
39 |
|
|
To reduce LUT usage the original implementation matched each slot in the search window in a dedicated clock cycle.
|
40 |
|
|
By setting `FAST` to `True` it will generate the logic to match the whole window in a single cycle.
|
41 |
|
|
The effective speed will be around 1 input byte every two cycles.
|
42 |
|
|
|
43 |
|
|
## Disabling functionality to save LUTs
|
44 |
|
|
|
45 |
|
|
The compress mode can be disabled by setting `COMPRESS` to `False`.
|
46 |
|
|
|
47 |
|
|
The decompress mode can be disabled by setting `DECOMPRESS` to `False`.
|
48 |
|
|
|
49 |
|
|
As an option you can disable dynamic tree decompression by setting `DYNAMIC` to `False`.
|
50 |
|
|
This will save a lot of LUT-ram and HDL-Deflate compressed output is always using a static tree,
|
51 |
|
|
but zlib will normally generate dynamic trees. Set zlib option `Z_FIXED` to generate streams with
|
52 |
|
|
a static tree.
|
53 |
|
|
|
54 |
|
|
In general the size of `leaves` and `d_leaves` can be reduced a lot when the maximal length of the input stream
|
55 |
|
|
is less than 32768. One can replace `test_data()` in `test_deflate.py` with a specific version which generates
|
56 |
|
|
typical test data for the intended FPGA application, and repeatedly halve the sizes of the `leaves` arrays
|
57 |
|
|
until the test fails.
|
58 |
|
|
|
59 |
|
|
FAST MATCH10 compress only has quite good resource usage.
|
60 |
|
|
|
61 |
|
|
## Practical considerations
|
62 |
|
|
|
63 |
|
|
In general HDL-Deflate is interesting when speed is important. When speed is not a real issue using a (soft)
|
64 |
|
|
CPU with zlib is probably the better approach. Especially decompression is also quite fast with a CPU and HDL-Deflate
|
65 |
|
|
needs a lot of LUTs when configured to decompress ANY deflate input stream. Compression is another story because it
|
66 |
|
|
is a LOT faster in hardware with the `FAST` option and uses a reasonable amount of LUTs.
|
67 |
|
|
|
68 |
|
|
# FPGA validation
|
69 |
|
|
|
70 |
|
|
## Minimal with smaller leaves arrays
|
71 |
|
|
|
72 |
|
|
Resource|Estimation
|
73 |
|
|
--------|----------
|
74 |
|
|
LUT |7116
|
75 |
|
|
LUTRAM |800
|
76 |
|
|
FF |2265
|
77 |
|
|
BRAM |4
|
78 |
|
|
|
79 |
|
|
## Compress False and smaller leaves arrays
|
80 |
|
|
|
81 |
|
|
Resource|Estimation
|
82 |
|
|
--------|----------
|
83 |
|
|
LUT |5769
|
84 |
|
|
LUTRAM |512
|
85 |
|
|
FF |2169
|
86 |
|
|
BRAM |4
|
87 |
|
|
|
88 |
|
|
## Decompress False and FAST and MATCH10
|
89 |
|
|
|
90 |
|
|
Resource|Estimation
|
91 |
|
|
--------|----------
|
92 |
|
|
LUT |5118
|
93 |
|
|
LUTRAM |84
|
94 |
|
|
FF |1577
|
95 |
|
|
BRAM |2
|
96 |
|
|
|
97 |
|
|
## FAST
|
98 |
|
|
|
99 |
|
|
Resource|Estimation
|
100 |
|
|
--------|----------
|
101 |
|
|
LUT |8246
|
102 |
|
|
LUTRAM |704
|
103 |
|
|
FF |2520
|
104 |
|
|
BRAM |4
|
105 |
|
|
|
106 |
|
|
## FAST and MATCH10
|
107 |
|
|
|
108 |
|
|
Resource|Estimation
|
109 |
|
|
--------|----------
|
110 |
|
|
LUT |11888
|
111 |
|
|
LUTRAM |536
|
112 |
|
|
FF |3308
|
113 |
|
|
BRAM |4
|
114 |
|
|
|
115 |
|
|
## Speed
|
116 |
|
|
|
117 |
|
|
The Vivado timing report fails at 100Mhz, but the test bench runs fine on my Arty at 100Mhz.
|
118 |
|
|
|
119 |
|
|
# Future Improvements (when there is interest)
|
120 |
|
|
|
121 |
|
|
* ~~Reduce LUT usage~~
|
122 |
|
|
* ~~Improve speed from current 80Mhz to 100Mhz~~
|
123 |
|
|
* ~~Improve compression performance~~
|