VLSI Implementation of Fast Convolution Based 2-D Discrete Wavelet Transform for High Speed, Area Efficient Image Computing

Chandny Ramachandran¹, T.R. Dinesh Kumar²
¹Department of ECE, Vel Tech Engineering College, Affiliated to Anna University, Chennai, India
²Department of ECE, Vel Tech Engineering College, Affiliated to Anna University, Chennai, India
𝑐.𝑟.𝑚𝑒𝑛𝑜𝑛06@gmail.com, 𝑑𝑖𝑛𝑒𝑠ℎ84@gmail.com

Abstract
A VLSI design approach of a high speed and real-time 2-D Discrete Wavelet Transform computing is being presented in the paper. The proposed architecture, based on new and fast convolution approach, reduces the hardware complexity in addition to reduce the critical path to the multiplier delay. Furthermore, an advanced two dimensional (2-D) discrete wavelet transform (DWT) implementation, with an efficient memory area, is designed to produce one output in every clock cycle. As a result, a very high speed is attained. The system is verified, using JPEG2000 coefficients filters, on Xilinx Virtex-II Field Programmable Gate Array (FPGA) device without accessing any external memory. The resulting computing rate is up to 270 M samples/s and the (9,7) 2-D wavelet filter uses only 18 kb of memory (16 kb of first-in-first-out memory) with 256×256 image size. In this way, the developed design requests reduced memory and provide very high-speed processing as well as high PSNR quality.

Keywords: Digital Image Processing, Discrete Wavelet Transform (DWT), FPGA, VLSI.

1. INTRODUCTION
The Discrete Wavelet Transform (DWT) has become one of the most used techniques for image compression and is applied in a large category of applications [1]. The application of DWT comes with digital photography, medical imaging, internet, satellite imaging, FBI fingerprint image compression. DWT can provide significant compression ratios without great loss of visual quality than the previous techniques such as the Discrete Cosine Transform (DCT) and the Discrete Fourier Transform (DFT). The DWT present the main part of the JPEG2000 standard, which permits both lossy and lossless compression of digital images. It allows an encoded image to be reconstructed progressively.

The compression phase is mainly divided into three sequential steps: (1) Discrete Wavelet Transform, (2) Quantization, and (3) Entropy Encoding. After preprocessing, each component is independently analyzed by an appropriate discrete wavelet transform.

Conventionally, programmable DSP chips are used to implement such algorithms for low-rate applications and the VLSI application specific integrated circuits (ASICs) for higher rates [2]. The FPGAs are programmable logic devices that provide sufficient quantities of logic resources that can be adapted to support a large parallel distributed architecture.

Lifting and convolution present the two computing approaches to achieve the discrete wavelet transform [3]. While conventional lifting-based architectures require fewer arithmetic operations compared to the convolution-based approach for DWT, they sometimes have long critical paths. If Tm and Tp are the delays of the adder and multiplier, respectively, then the critical path of the lifting-based architecture for the (9,7) filter is (4×Tm + 8×Tp), while that of the convolution implementation is (Tm + 2×Tp) [4]. In addition to this and for the reason to preserve proper precision, intermediate variables widths are larger in lifting based computing. As a result, the lifting multiplier and adder delays are longer than the convolution ones.

Many VLSI lifting-based DWT architectures have been developed and implemented to reduce the memory requirements and the critical path [5]-[14]. A JPEG2000 single-chip encoder with 81 M samples/s is presented in [8]. A high-speed and area-efficient system is presented in [11]. 7.5 Kb is used for the (5,3) wavelet filter and 30 Kb for the (9,7) one, with 256×256 image size. Recently, the new lifting-based architectures represent the faster option, and the main critical path is about Tm which represents the lifting multiplier delay. The present work introduces new reflection of VLSI design for very high-speed image computing using new convolution based DWT.

Firstly, new and fast convolution-based architecture is developed to reduce the hardware requirement complexity in addition to reduce the critical path to the multiplier delay (Tm). Thus, we reach a very high-speed computing compared to the lifting option.

Secondly, a novel and adapted two-dimensional (2-D) discrete wavelet transform system, with an efficient memory area (e.g. 16 Kb for the (9,7) wavelet filter with 256×256 image size), is designed to produce one output in every clock cycle.

2. DISCRETE WAVELET TRANSFORM
A. One-Dimensional Discrete Wavelet Transform
The basic DWT can be realized by convolution-based implementation using the FIR-filters to do the transform. The input discrete signal X(n) is filtered by a low-pass filter (h) and a high-pass filter (g) at each
transform level. The two output streams are then sub-sampled by simply dropping the alternate output samples in each stream to produce the low-pass sub band \( Y_L \) and high-pass sub band \( Y_H \). The equation is given as (1).

\[
y_L(n) = \sum_{i=0}^{N-1} h(2n-i) \cdot x(i) \cdot y_M(n) = \sum_{i=0}^{N-1} g(2n-i) \cdot x(i)
\]

(Figure 1 shows the signal analysis and reconstruction in one dimensional (1-D) Discrete Wavelet Transform.

**B. Two-Dimensional Discrete Wavelet Transform**

For two-dimensional (Image) analysis and reconstruction the multi-resolution approach for Discrete Wavelet decomposition of signals using a pyramidal filter structure proposed by Mallat can be adopted. Fig. 2 shows the two level multi-resolution wavelet decomposition of signals using pyramidal filter structure. DWT is being computed using a near to perfection reconstruction (PR) filter bank.

\[
y_L(n) = h_0(x_{2n} + 0) + h_1(x_{2n+1} + x_{2n}) + h_2(x_{2n+2} + x_{2n+1}) + h_3(x_{2n+3} + x_{2n+2}) + h_4(x_{2n+4} + x_{2n+3}) + h_5(x_{2n+5} + x_{2n+4}) + h_6(x_{2n+6} + x_{2n+5}) + h_7(x_{2n+7} + x_{2n+6})
\]

Similarly, the high-pass filter coefficients present symmetry as follows:

\[
g_{-2} = g_3, g_{-1} = g_2, g_0 = g_1, g_1 = g_0
\]

As a result (4) can be given.

\[
y_{H0} = g_1(0 + x_0) + g_0(0 + x_1) + g_2(0 + x_2) + g_3(0 + x_3) + g_4(0 + x_4) + g_5(0 + x_5) + g_6(0 + x_6) + g_7(0 + x_7)
\]

The architecture to compute the \( Y_L \) and \( Y_H \) is shown in Fig. 3. Additional registers are added, between multipliers and adders, to speed up the computing. The critical path is reduced to the multiplier delay (Tcm). Furthermore, the outputs \( Y_{L0} \) and \( Y_{H0} \) are obtained alternately at the trailing edges of the even and odd clock cycles. (e.g., \( Y_{L0}, Y_{L1}, Y_{L2}, \ldots \) are obtained at clock cycles 0, 1, 2, … and \( Y_{H0}, Y_{H1}, Y_{H2}, \ldots \) are obtained at clock cycles 1, 2, 3, … respectively).

The preliminary gate count estimates and number of components used in the proposed fast convolution-based architecture are given in Table I.

<table>
<thead>
<tr>
<th>Component</th>
<th>Gate count</th>
<th>Number of units</th>
</tr>
</thead>
<tbody>
<tr>
<td>Adder (3b+3b)</td>
<td>200</td>
<td>4</td>
</tr>
<tr>
<td>Adder (16b+16b)</td>
<td>400</td>
<td>4</td>
</tr>
<tr>
<td>Register (8 bit)</td>
<td>50</td>
<td>8</td>
</tr>
<tr>
<td>Register (9 bit)</td>
<td>56</td>
<td>5</td>
</tr>
<tr>
<td>Register (16 bit)</td>
<td>100</td>
<td>10</td>
</tr>
<tr>
<td>Multiplier (8b x 9b)</td>
<td>810</td>
<td>5</td>
</tr>
</tbody>
</table>
The total number of the used gate and the estimated work frequency for the (9,7) lifting architecture and the proposed one, using the field programmable gate array Xilinx Virtex-II, is provided in Table II.

<table>
<thead>
<tr>
<th>Architecture</th>
<th>The total gate</th>
<th>Work frequency</th>
</tr>
</thead>
<tbody>
<tr>
<td>The lifting architecture [6]</td>
<td>13400</td>
<td>210 MHz</td>
</tr>
<tr>
<td>The proposed architecture</td>
<td>8130</td>
<td>270 MHz</td>
</tr>
</tbody>
</table>

*With 7 bits coding coefficients

4. THE PROPOSED FAST CONVOLUTION-BASED 2-D DWT ARCHITECTURE

In this approach to avoid the need of a transpose circuit between the two levels, the system starts column processing as soon as sufficient numbers of rows have been filtered. Fig. 4 presents the main 2-D DWT fast convolution based diagram.

The proposed FPGA implementation of the 2-D Discrete Wavelet Transform is designed with two fast convolution based blocks. The first one, which is similar to the 1-D block, realizes the row Discrete Wavelet Transform and uses D Latch devices for the X (n) storage. The second achieves the Column Discrete Wavelet Transform using FPGA block RAM storage of the computed rows.
Figure 5 The Select RAM based fast convolution computing

$Y_L$ and $Y_H$ present the first level outputs. The $Y_{L1}$, $Y_{LH}$, $Y_{HL}$ and $Y_{HH}$ present the second level and the 2-D DWT outputs. $Y_{L0}$, $Y_{L1}$, $Y_{L2}$... are obtained at clock cycles 9, 11, 13 ... and $Y_{H0}$, $Y_{H1}$, $Y_{H2}$ ... are obtained at clock cycles 8, 10, 12.... Respectively.

The first row of $Y_{LH}$ and $Y_{HH}$ can be obtained after the beginning of the third row storage of the first level outputs. After the beginning of the fifth storage of the first level outputs, we can obtain the second row of $Y_{LH}$ and $Y_{HH}$ and the first row of $Y_{LL}$ and $Y_{HL}$ (Fig. 5).

Nine FPGA block RAM in Dual-Port Mode are required to accomplish the second level of the Parallel Distributed 2-D DWT Architecture with the (9,7) wavelet filters. The Select RAM 1-D DWT fast convolution-based Modules is composed by blocks RAM, interface circuit and a computing circuit (Fig. 6). The computing circuit presents the main architecture of the D-Latch 1-D DWT fast convolution-based.

Figure 6 The Select RAM 1-D DWT fast convolution-based block diagram

At each step of computing, only one block RAM is selected in write mode:

1st step: Block RAM 1 and block RAM 2: read mode
Block RAM 3: write mode

2nd step: Block RAM 1 to block RAM 4: read mode
Block RAM 5: write mode

Advanced step: Block RAM 1 to block RAM 8: read mode
Block RAM 9: write mode

We have implemented the 2-D DWT Parallel Distributed Architecture described in the previous section using one of the XC2V1000 Xilinx Virtex-II FPGA devices. This device contains 1M system gates (5,120 slices), 40 Multiplier Blocks and can operate at a maximum clock speed of 450 MHz [19]. Each Xilinx Virtex-II FPGA block Select RAM cell is an 18 Kbit fully synchronous memory. The two ports have independent inputs and outputs and are independently clocked. The data widths of the two ports can be configured independently, providing built-in bus-width conversion. As a result, one output is produced in every clock cycle and this configuration is able to accomplish 2048×2048 block computing.

We conclude that the fast convolution-based scheme and the appropriate 2-D DWT architecture reduce the memory area and the critical path to the convolution multiplier delay, which is the smallest one.

The PSNR, for the selected images, after forward and inverse Discrete Wavelet Transform with 7 bits coding coefficients, are given in Table III.
Table 3
PSNR values after forward and inverse discrete wavelet transform

<table>
<thead>
<tr>
<th></th>
<th>Gold Hill</th>
<th>Lena</th>
<th>Barbara</th>
<th>Baboon</th>
<th>Peppers</th>
</tr>
</thead>
<tbody>
<tr>
<td>PSNR</td>
<td>91.89</td>
<td>91.36</td>
<td>91.03</td>
<td>90.93</td>
<td>89.80</td>
</tr>
</tbody>
</table>

5. SIMULATION RESULTS

The MATLAB simulation results of 2-D discrete wavelet transform on a 256*256 sized gray scale image of “Lena” is being illustrated in the figures below. Fig 8 shows the average and detail parts of the “Lena” image accurately after one dimensional filtering. Fig 9 shows the approximation image (average), the horizontal, vertical and diagonal details. Progressive transmission of image is one of the main advantages of discrete wavelet transform.

![Figure 7. Original Lena image](image1)

![Figure 8. 1-D DWT coefficients](image2)

![Figure 9. 2-D DWT coefficients](image3)

6. CONCLUSION

In this paper, we have proposed a parallel architecture for very high-speed computing Discrete Wavelet Transform using SRAM. To produce one output in every clock cycle in addition to reduce the critical path as well as an efficient memory area, new fast convolution-based architecture approach is performed. In this approach, the system starts the column processing as soon as sufficient numbers of rows have been filtered. Two fast convolution based blocks, for the two-dimensional (2-D) discrete wavelet transform (DWT), are used to accelerate the computing. The (3,5) 2-D wavelet filter uses only 8 kb of memory with 256x256 image size and the (9,7) one uses only 16 kb.

REFERENCES

