GENERAL INFORMATION

DCT soft core is the unit to perform the Discrete Cosine Transform (DCT). It performs two-dimensional 8 by 8 point DCT for the period of 64 clock cycles in pipelined mode.

FEATURES

Key features

- more than 300 MHz sampling frequency, 64-cycle calculation period,
- approximately 330 CLBs and 4 DSP48E in Virtex-5 device,
- 2 DSP48E when the scaled output data mode is used,
- 8-bit input data,
- 11-bit coefficients,
- 12-bit results,
- pipelined mode,
- latent delay from input to output is 132 clock cycles,
- structure optimized for Xilinx Virtex™, Spartan™ FPGA devices.

Design features

- DCT core intended for transforming data into a format that can be easily compressed.

The characteristics of the DCT make it extremely suited for image compression algorithms. DCT removes spatial data redundancies in two-dimensional data. Therefore, DCT is used in JPEG and MPEG standards of image compression procedures.

In general, the two-dimensional DCT transforms an nxn data array into an nxn result array. Firstly the n-point DCT transforms the columns, then it transforms the rows.

The 8-point DCT is calculated due to the equations

\[ Y(0) = \frac{1}{\sqrt{8}} \sum_{m=0}^{7} X(m), \quad \text{and} \quad Y(k) = \frac{1}{2} \sum_{m=0}^{7} X(m) \cos \left(\frac{(2m+1)k\pi}{16}\right), \quad k=1,2,\ldots,7, \]

where \( Y(k) \) is \( k \) – th DCT coefficient.
DCT IP core

The input 8x8 image data block consists of integers in the range from 0 to 255. Before the DCT calculation the mean value 128 is subtracted from input data to minimize the redundancy in the input data block. The core can compute input data in range –128 to 127 as well. Then the mean value 128 is not subtracted.

After the DCT is calculated, the data can be reduced to concentrate the important information into a few DCT results, leaving the remaining coefficients equal to zero. That means that the image energy is concentrated in a few DCT coefficients.

Comparing to the Discrete Fourier Transform, the DCT has the following advantages:

- High image energy compaction
- Lower blocking artifacts
- Real data, coefficients and arithmetic only.
- Effective DCT algorithm

To minimize the amount of calculations, the Arai, Agui, and Nakajama 8-point DCT algorithm is used. This algorithm consists in the following calculations

\[
\begin{align*}
b_0 &= a(0) + a(7); \\
b_1 &= a(1) + a(6); \\
b_2 &= a(3) - a(4); \\
b_3 &= a(1) - a(6); \\
b_4 &= a(2) + a(5); \\
b_5 &= a(3) + a(4); \\
b_6 &= a(2) - a(5); \\
b_7 &= a(0) - a(7); \\
c_0 &= b_0 + b_5; \\
c_1 &= b_1 - b_4; \\
c_2 &= b_2 + b_6; \\
c_3 &= b_1 + b_4; \\
c_4 &= b_0 - b_5; \\
c_5 &= b_3 + b_7; \\
c_6 &= b_3 + b_6; \\
c_7 &= b_7; \\
d_0 &= c_0 + c_3; \\
d_1 &= c_0 - c_3; \\
d_2 &= c_2; \\
d_3 &= c_1 + c_4; \\
d_4 &= c_2 - c_5; \\
d_5 &= c_4; \\
d_6 &= c_5; \\
d_7 &= c_6; \\
d_8 &= c_7; \\
e_0 &= d_0; \\
e_1 &= d_1; \\
e_2 &= m_3 * d_2; \\
e_3 &= m_1 * d_7; \\
e_4 &= m_4 * d_6; \\
e_5 &= d_5; \\
e_6 &= m_1 * d_3; \\
e_7 &= m_2 * d_4; \\
e_8 &= d_8; \\
f_0 &= e_0; \\
f_1 &= e_1; \\
f_2 &= e_5 + e_6; \\
f_3 &= e_5 - e_6; \\
f_4 &= e_3 + e_8; \\
f_5 &= e_8 - e_3; \\
f_6 &= e_2 + e_7; \\
f_7 &= e_4 + e_7; \\
sa(0) &= f_0; \\
sa(1) &= f_4 + f_7; \\
sa(2) &= f_2; \\
sa(3) &= f_5 - f_6; \\
sa(4) &= f_1; \\
sa(5) &= f_5 + f_6; \\
sa(6) &= f_3; \\
sa(7) &= f_4 - f_7;
\end{align*}
\]

for \(i\) in 0 to 7 loop
\[
y(i) := \text{sa}(i) * s(i) * 4.0;
\]
end loop;

where
\(a(i), y(i)\) are input and output data;
\(m_1 = \cos(\pi/4.0);\)
\(m_2 = \cos(\pi/3);\)
\(m_3 = \cos(\pi/8) - \cos(\pi/2);\)
\(m_4 = \cos(\pi/8) + \cos(\pi/3);\)
\(s(0) = 0.5/\sqrt{2};\)
\(s(i) = 0.25/\cos(\pi*i/16), i=1,2,...,7.\)

This algorithm has only 13 multiplies. Its 8 rest multiplies can be discarded when the outputs are considered to be scaled ones. In this situation the proper scaling is usually performed in the stage of DCT coefficient dividing to the scale factors before the Huffman coding.
DCT IP core

• Highly pipelined calculations

The input data arrays 8x8 are loaded in data by data mode with the period which is equal to the clock period. The output data arrays are outputted in the similar manner. Time thresholds between data arrays can be absent, and high maximum clock frequency of the core is achieved due to the pipelined calculations.

The period of calculations of 64 word arrays is equal to 64 clock cycles.

The input and output data have the normal order.

• Low hardware cost

Due to the widely used resource sharing the hardware volume is rather small. The multiplier number is equal to 4. When the output results are scaled then the multiplier number is decreased to 2.

The intermediate result arrays are stored and transposed using the FIFO buffers based on the SRL16 elements. Due to that the Block RAMs are not used. Besides, the latent delay between first data input and the first data output is minimized to 132 clock cycles.

• High precision computations

The quality of decoded image depends on the computation precision, for example, when photographic picture compression. This core has 12-bit output data width, the maximum error is not succeeded 3 least significant bits (for 12-bit width), which means the maximum MSE 0.25%.

The coefficients s(i) and m(i) have 11-bit precision (10 significant bits plus sign).

INTERFACE

Symbol

Figure 1 illustrates DCT_AAN core symbol.

![Figure 1. DCT_AAN core symbol.](image)
Signal description

The descriptions of the core signals are represented in the table 1.

<table>
<thead>
<tr>
<th>SIGNAL</th>
<th>TYPE</th>
<th>DESCRIPTION</th>
</tr>
</thead>
<tbody>
<tr>
<td>CLK</td>
<td>input</td>
<td>Global clock</td>
</tr>
<tr>
<td>RST</td>
<td>input</td>
<td>Global reset</td>
</tr>
<tr>
<td>START</td>
<td>input</td>
<td>DCT start</td>
</tr>
<tr>
<td>EN</td>
<td>input</td>
<td>Clock enable</td>
</tr>
<tr>
<td>DATAIN[7:0]</td>
<td>input</td>
<td>Input data</td>
</tr>
<tr>
<td>DATA_OUT[15:0]</td>
<td>output</td>
<td>Output data</td>
</tr>
<tr>
<td>READY</td>
<td>output</td>
<td>Result ready strobe</td>
</tr>
</tbody>
</table>

Table 1. DCT_AAN core signal description.

Data Format

Input data are 8 bit wide two’s complement integers. Depending on the generic constant d_signed they can be considered as signed or unsigned ones. When unsigned, from all data the mean value 128 is subtracted.

Output data are 12 bit wide two’s complement integers. When the input signal is changed in range –128 to 127 then the output signal is changed in range –1024 to 1016. When the output data are scaled, then they are changed in range –2048 to 2032.

Typical Core Interconnection

Typical core interconnection is shown on figure 2.

Components:

- **DCT_AAN** – DCT core
- **INPUT_FRAME_BUFFER** is RAM buffer which loads 8x8 blocks into DCT core in row by row mode
- **OUTPUT_FRAME_BUFFER** is RAM buffer which stores 8x8 blocks of DCT coefficients in row by row mode
IMPLEMENTATION DATA

Performance

Tables 2 and 3 illustrate the DCT_AAN core performance in Xilinx devices for normal and scaled output data respectively. The core was synthesized using Xilinx ISE-9.2.

<table>
<thead>
<tr>
<th>Target device</th>
<th>Spartan3DSP-4</th>
<th>Virtex2P</th>
<th>Virtex4-12</th>
<th>Virtex5-3</th>
</tr>
</thead>
<tbody>
<tr>
<td>Area, CLB, CLB slices</td>
<td>629</td>
<td>742</td>
<td>685</td>
<td>297</td>
</tr>
<tr>
<td>System clock, Fmax</td>
<td>155 MHz</td>
<td>210 MHz</td>
<td>310 MHz</td>
<td>370 MHz</td>
</tr>
</tbody>
</table>

Table 2. Implementation data for normal output data and with 4 DSP48 units

<table>
<thead>
<tr>
<th>Target device</th>
<th>Spartan3DSP-4</th>
<th>Virtex2P</th>
<th>Virtex4-12</th>
<th>Virtex5-3</th>
</tr>
</thead>
<tbody>
<tr>
<td>Area, CLB, CLB slices</td>
<td>630</td>
<td>731</td>
<td>719</td>
<td>311</td>
</tr>
<tr>
<td>System clock, Fmax</td>
<td>150 MHz</td>
<td>185 MHz</td>
<td>300 MHz</td>
<td>380 MHz</td>
</tr>
</tbody>
</table>

Table 3. Implementation data for scaled output data and with 2 DSP48 units

Testbench

The Figure 3 illustrates the testbench for the core

Component BMP_GENERATOR generates the dataflow of testing arrays. In one mode it generates predefined arrays, in another mode does randomized ones.
The tested instance DCT_AAN and the reference instance DCT_BEH are switched on in parallel to the source BMP_GENERATOR. DCT_BEH is the behavioral model of the DCT processor. It computes 2D DCT using the floating point calculations. Its results are rounded to 12 bits. Therefore, it serves as the standard model.

The process ERROR_CALC calculates differences in the output data of DCT_AAN and DCT_BEH. The resulting signals are ERROR which is result of signal subtraction, SERROR as the sum of squared errors for a single data arrays, and QUADMEAN which is the resultic mean square error value for the current array.