# Enhanced Memory Reliability Against Multiple Cell Upsets Using Decimal Matrix Code

Jing Guo, Liyi Xiao, Member, IEEE, Zhigang Mao, Member, IEEE, and Qiang Zhao

Abstract-Transient multiple cell upsets (MCUs) are becoming major issues in the reliability of memories exposed to radiation environment. To prevent MCUs from causing data corruption, more complex error correction codes (ECCs) are widely used to protect memory, but the main problem is that they would require higher delay overhead. Recently, matrix codes (MCs) based on Hamming codes have been proposed for memory protection. The main issue is that they are double error correction codes and the error correction capabilities are not improved in all cases. In this paper, novel decimal matrix code (DMC) based on divide-symbol is proposed to enhance memory reliability with lower delay overhead. The proposed DMC utilizes decimal algorithm to obtain the maximum error detection capability. Moreover, the encoder-reuse technique (ERT) is proposed to minimize the area overhead of extra circuits without disturbing the whole encoding and decoding processes. ERT uses DMC encoder itself to be part of the decoder. The proposed DMC is compared to well-known codes such as the existing Hamming, MCs, and punctured difference set (PDS) codes. The obtained results show that the mean time to failure (MTTF) of the proposed scheme is 452.9%, 154.6%, and 122.6% of Hamming, MC, and PDS, respectively. At the same time, the delay overhead of the proposed scheme is 73.1%, 69.0%, and 26.2% of Hamming, MC, and PDS, respectively. The only drawback to the proposed scheme is that it requires more redundant bits for memory protection.

*Index Terms*—Decimal algorithm, error correction codes (ECCs), mean time to failure (MTTF), memory, multiple cells upsets (MCUs).

## I. INTRODUCTION

A S CMOS technology scales down to nanoscale and memories are combined with an increasing number of electronic systems, the soft error rate in memory cells is rapidly increasing, especially when memories operate in space environments due to ionizing effects of atmospheric neutron, alpha-particle, and cosmic rays [1].

Although single bit upset is a major concern about memory reliability, multiple cell upsets (MCUs) have become a serious reliability concern in some memory applications [2]. In order to make memory cells as fault-tolerant as possible, some error correction codes (ECCs) have been widely used to protect memories against soft errors for years [3]–[6]. For example, the Bose–Chaudhuri–Hocquenghem codes [7], Reed–Solomon

Manuscript received July 12, 2012; revised November 4, 2012; accepted December 26, 2012. Date of publication March 26, 2013; date of current version December 20, 2013. This work was supported in part by the Harbin Science and Innovation Research Special Fund under Grant 2012RFXXG042.

The authors are with the Microelectronics Center, Harbin Institute of Technology, Harbin 150001, China (e-mail: guojing19861229@163.com; xiaoly@hit.edu.cn; mao@hit.edu.cn; zq496547199@126.com).

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TVLSI.2013.2238565

codes [8], and punctured difference set (PDS) codes [9] have been used to deal with MCUs in memories. But these codes require more area, power, and delay overheads since the encoding and decoding circuits are more complex in these complicated codes.

Interleaving technique has been used to restrain MCUs [10], which rearrange cells in the physical arrangement to separate the bits in the same logical word into different physical words. However, interleaving technique may not be practically used in content-addressable memory (CAM), because of the tight coupling of hardware structures from both cells and comparison circuit structures [11], [12].

Built-in current sensors (BICS) are proposed to assist with single-error correction and double-error detection codes to provide protection against MCUs [13], [14]. However, this technique can only correct two errors in a word.

More recently, in [15], 2-D matrix codes (MCs) are proposed to efficiently correct MCUs per word with a low decoding delay, in which one word is divided into multiple rows and multiple columns in logical. The bits per row are protected by Hamming code, while parity code is added in each column. For the MC [15] based on Hamming, when two errors are detected by Hamming, the vertical syndrome bits are activated so that these two errors can be corrected. As a result, MC is capable of correcting only two errors in all cases. In [16], an approach that combines decimal algorithm with Hamming code has been conceived to be applied at software level. It uses addition of integer values to detect and correct soft errors. The results obtained have shown that this approach have a lower delay overhead over other codes.

In this paper, novel decimal matrix code (DMC) based on divide-symbol is proposed to provide enhanced memory reliability. The proposed DMC utilizes decimal algorithm (decimal integer addition and decimal integer subtraction) to detect errors. The advantage of using decimal algorithm is that the error detection capability is maximized so that the reliability of memory is enhanced. Besides, the encoder-reuse technique (ERT) is proposed to minimize the area overhead of extra circuits (encoder and decoder) without disturbing the whole encoding and decoding processes, because ERT uses DMC encoder itself to be part of the decoder.

This paper is divided into the following sections. The proposed DMC is introduced and its encoder and decoder circuits are present in Section II. This section also illustrates the limits of simple binary error detection and the advantage of decimal error detection with some examples. The reliability and overheads analysis of the proposed code are analyzed in Section III. In Section IV, the implementation of decimal error



Fig. 1. Proposed schematic of fault-tolerant memory protected with DMC.

detection together with BICS for error correction in CAM is provided. Finally, some conclusions of this paper are discussed and shared in Section V.

# II. PROPOSED DMC

In this section, DMC is proposed to assure reliability in the presence of MCUs with reduced performance overheads, and a 32-bit word is encoded and decoded as an example based on the proposed techniques.

#### A. Proposed Schematic of Fault-Tolerant Memory

The proposed schematic of fault-tolerant memory is depicted in Fig. 1. First, during the encoding (write) process, information bits D are fed to the DMC encoder, and then the horizontal redundant bits H and vertical redundant bits V are obtained from the DMC encoder. When the encoding process is completed, the obtained DMC codeword is stored in the memory. If MCUs occur in the memory, these errors can be corrected in the decoding (read) process. Due to the advantage of decimal algorithm, the proposed DMC has higher fault-tolerant capability with lower performance overheads. In the fault-tolerant memory, the ERT technique is proposed to reduce the area overhead of extra circuits and will be introduced in the following sections.

# B. Proposed DMC Encoder

In the proposed DMC, first, the divide-symbol and arrange-matrix ideas are performed, i.e., the *N*-bit word is divided into *k* symbols of *m* bits ( $N = k \times m$ ), and these symbols are arranged in a  $k_1 \times k_2$  2-D matrix ( $k = k_1 \times k_2$ , where the values of  $k_1$  and  $k_2$  represent the numbers of rows and columns in the logical matrix respectively). Second, the horizontal redundant bits *H* are produced by performing decimal integer addition of selected symbols per row. Here, each symbol is regarded as a decimal integer. Third, the vertical redundant bits *V* are obtained by binary operation among the bits per column. It should be noted that both divide-symbol and arrange-matrix are implemented in logical instead of in physical. Therefore, the proposed DMC does not require changing the physical structure of the memory.

To explain the proposed DMC scheme, we take a 32-bit word as an example, as shown in Fig. 2. The cells from  $D_0$  to  $D_{31}$  are information bits. This 32-bit word has been divided into eight symbols of 4-bit.  $k_1 = 2$  and  $k_2 = 4$  have been chosen simultaneously.  $H_0-H_{19}$  are horizontal check bits;

 $V_0$  through  $V_{15}$  are vertical check bits. However, it should be mentioned that the maximum correction capability (i.e., the maximum size of MCUs can be corrected) and the number of redundant bits are different when the different values for k and m are chosen. Therefore, k and m should be carefully adjusted to maximize the correction capability and minimize the number of redundant bits. For example, in this case, when  $k = 2 \times 2$  and m = 8, only 1-bit error can be corrected and the number of redundant bits is 40. When  $k = 4 \times 4$  and m = 2, 3-bit errors can be corrected and the number of redundant bits is reduced to 32. However, when  $k = 2 \times 4$  and m = 4, the maximum correction capability is up to 5 bits and the number of redundant bits is 36. In this paper, in order to enhance the reliability of memory, the error correction capability is first considered, so  $k = 2 \times 4$  and m = 4 are utilized to construct DMC.

The horizontal redundant bits H can be obtained by decimal integer addition as follows:

$$H_4 H_3 H_2 H_1 H_0 = D_3 D_2 D_1 D_0 + D_{11} D_{10} D_9 D_8 \tag{1}$$

$$H_9H_8H_7H_6H_5 = D_7D_6D_5D_4 + D_{15}D_{14}D_{13}D_{12}$$
(2)

and similarly for the horizontal redundant bits  $H_{14}H_{13}H_{12}H_{11}H_{10}$  and  $H_{19}H_{18}H_{17}H_{16}H_{15}$ , where "+" represents decimal integer addition.

For the vertical redundant bits V, we have

$$V_0 = D_0 \oplus D_{16} \tag{3}$$

$$V_1 = D_1 \oplus D_{17} \tag{4}$$

and similarly for the rest vertical redundant bits.

The encoding can be performed by decimal and binary addition operations from (1) to (4). The encoder that computes the redundant bits using multibit adders and XOR gates is shown in Fig. 3. In this figure,  $H_{19} - H_0$  are horizontal redundant bits,  $V_{15} - V_0$  are vertical redundant bits, and the remaining bits  $U_{31} - U_0$  are the information bits which are directly copied from  $D_{31}$  to  $D_0$ . The enable signal En will be explained in the next section.

## C. Proposed DMC Decoder

To obtain a word being corrected, the decoding process is required. For example, first, the received redundant bits  $H_4H_3H_2H_1H'_0$  and  $V'_0 - V'_3$  are generated by the received information bits D'. Second, the horizontal syndrome bits  $\Delta H_4H_3H_2H_1H_0$  and the vertical syndrome bits  $S_3 - S_0$  can be calculated as follows:

$$\Delta H_4 H_3 H_2 H_1 H_0 = H_4 H_3 H_2 H_1 H_0' - H_4 H_3 H_2 H_1 H_0 \quad (5)$$
  
$$S_0 = V_0' \oplus V_0 \quad (6)$$

and similarly for the rest vertical syndrome bits, where "-" represents decimal integer subtraction.

When  $\Delta H_4 H_3 H_2 H_1 H_0$  and  $S_3 - S_0$  are equal to zero, the stored codeword has original information bits in symbol 0 where no errors occur. When  $\Delta H_4 H_3 H_2 H_1 H_0$  and  $S_3 - S_0$  are nonzero, the induced errors (the number of errors is 4 in

| Symbol 7 |                 |                 |                 |                 | Symb            | ol 2            |                 | ;               | Symb            | ool 5           |                 | ;               | Symb            | ol 0            |                 |                 |                 |          |                 |                 |                 |             |             |                |                 |                |
|----------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|----------|-----------------|-----------------|-----------------|-------------|-------------|----------------|-----------------|----------------|
|          | D <sub>15</sub> | D <sub>14</sub> | D <sub>13</sub> | D <sub>12</sub> | $D_{11}$        | $D_{10}$        | D9              | $D_8$           | •D7             | $D_6$           | D <sub>5</sub>  | D4 6            | $D_3$           | $D_2$           | $D_1$           | $D_0$           | ۶H9             | $H_8$    | H <sub>7</sub>  | ${\rm H}_6$     | ${\rm H}_5$     | ${\rm H}_4$ | ${\rm H}_3$ | H <sub>2</sub> | $H_1$           | H <sub>0</sub> |
| ł        | D <sub>31</sub> | D <sub>30</sub> | D <sub>29</sub> | D <sub>28</sub> | D <sub>27</sub> | D <sub>26</sub> | D <sub>25</sub> | D <sub>24</sub> | D <sub>23</sub> | D <sub>22</sub> | D <sub>21</sub> | D <sub>20</sub> | D <sub>19</sub> | D <sub>18</sub> | D <sub>17</sub> | D <sub>16</sub> | H <sub>19</sub> | $H_{18}$ | H <sub>17</sub> | H <sub>16</sub> | H <sub>15</sub> | $H_{14}$    | $H_{13}$    | $H_{12}$       | H <sub>11</sub> | $H_{10}$       |
|          | V <sub>15</sub> | $V_{14}$        | V <sub>13</sub> | V <sub>12</sub> | V <sub>11</sub> | V <sub>10</sub> | V9              | $V_8$           | V <sub>7</sub>  | $V_6$           | $V_5$           | $V_4$           | $V_3$           | $V_2$           | $V_1$           | $V_0$           |                 |          |                 |                 |                 |             |             |                |                 |                |

Fig. 2. 32-bits DMC logical organization ( $k = 2 \times 4$  and m = 4). Here, each symbol is regarded as a decimal integer.



Fig. 3. 32-bit DMC encoder structure using multibit adders and XOR gates.



| Extra airauit | En s        | ignal        | Function              |  |  |  |
|---------------|-------------|--------------|-----------------------|--|--|--|
| Extra circuit | Read signal | Write signal | Function              |  |  |  |
| Encodor       | 0           | 1            | Encoding              |  |  |  |
| Elicodei      | 1           | 0            | Compute syndrome bits |  |  |  |

Fig. 4. 32-bit DMC decoder structure using ERT.

this case) are detected and located in symbol 0, and then these errors can be corrected by

$$D_{0_{\text{correct}}} = D_0 \oplus S_0. \tag{7}$$

The proposed DMC decoder is depicted in Fig. 4, which is made up of the following submodules, and each executes a specific task in the decoding process: syndrome calculator, error locator, and error corrector. It can be observed from this figure that the redundant bits must be recomputed from the received information bits D' and compared to the original



Fig. 5. Limits of binary error detection in simple binary operations.

set of redundant bits in order to obtain the syndrome bits  $\Delta H$  and S. Then error locator uses  $\Delta H$  and S to detect and locate which bits some errors occur in. Finally, in the error corrector, these errors can be corrected by inverting the values of error bits.

In the proposed scheme, the circuit area of DMC is minimized by reusing its encoder. This is called the ERT. The ERT can reduce the area overhead of DMC without disturbing the whole encoding and decoding processes. From Fig. 4, it can be observed that the DMC encoder is also reused for obtaining the syndrome bits in DMC decoder. Therefore, the whole circuit area of DMC can be minimized as a result of using the existent circuits of encoder. Besides, this figure also shows the proposed decoder with an enable signal En for deciding whether the encoder needs to be a part of the decoder. In other words, the *En* signal is used for distinguishing the encoder from the decoder, and it is under the control of the write and read signals in memory. Therefore, in the encoding (write) process, the DMC encoder is only an encoder to execute the encoding operations. However, in the decoding (read) process, this encoder is employed for computing the syndrome bits in the decoder. These clearly show how the area overhead of extra circuits can be substantially reduced.

## D. Limits of Simple Binary Error Detection

For the proposed binary error detection technique in [13], although it requires low redundant bits, its error detection capability is limited. The main reason for this is that its error detection mechanism is based on binary.

We illustrate the limits of this simple binary error detection [13] using a simple example. Let us suppose that the bits  $B_3$ ,  $B_2$ ,  $B_1$ , and  $B_0$  are original information bits and the bits  $C_0$  and  $C_1$  are redundant bits shown in Fig. 5. The bits  $C_0$  and  $C_1$  are obtained using the binary algorithm



Fig. 6. Advantage of decimal error detection. Using decimal algorithm,  $H_4H_3H_2H_1H_0$  will not be "0" (decimal). This represents that MCUs can be detected and corrected so that the decoding error can be avoided.



Fig. 7. Types of MCUs can be corrected by our proposed DMC. Type 1 is a single error, type 2 is an inconsecutive error in two consecutive symbols, type 3 is a consecutive error in two consecutive symbols, and type 5 is a consecutive error in four consecutive symbols.

(XOR)

$$C_0 = B_0 \oplus B_2 = 1 \oplus 0 = 1 \tag{8}$$

$$C_1 = B_1 \oplus B_3 = 0 \oplus 1 = 1.$$
 (9)

Then assume now that MCUs occur in bits  $B_3$ ,  $B_2$ , and  $B_0$ (i.e.,  $B'_3 = 0$ ,  $B'_2 = 1$ , and  $B'_0 = 0$ ). The received redundant bits  $C'_0$  and  $C'_1$  are computed

$$C_0' = B_0' \oplus B_2' = 0 \oplus 1 = 1 \tag{10}$$

$$C_{1}^{'} = B_{1}^{'} \oplus B_{3}^{'} = 0 \oplus 0 = 0.$$
 (11)

In order to detect these errors, the syndrome bits  $S_0$  and  $S_1$  are obtained

$$S_0 = C_0 \oplus C_0 = 1 \oplus 1 = 0 \tag{12}$$

$$S_1 = C_1^{'} \oplus C_1 = 0 \oplus 1 = 1.$$
 (13)

These results mean that error bits  $B_2$  and  $B_0$  are wrongly regarded as the original bits so that these two error bits are not corrected. This example illustrates that for this simple binary operation [13], the number of even bit errors cannot be detected.

## E. Advantage of Decimal Error Detection

In the previous discussion, it has been shown that error detection [13] based on binary algorithm can only detect a finite number of errors. However, when the decimal algorithm

is used to detect errors, these errors can be detected so that the decoding error can be avoided. The reason is that the operation mechanism of decimal algorithm is different from that of binary. The detection procedure of decimal error detection using the proposed structure shown in Fig. 2 is fully described in Fig. 6. First of all, the horizontal redundant bits  $H_4H_3H_2H_1H_0$  are obtained from the original information bits in symbols 0 and 2 according to (1)

$$H_4H_3H_2H_1H_0 = D_3D_2D_1D_0 + D_{11}D_{10}D_9D_8$$
  
= 1100 + 0110  
= 10010. (14)

When MCUs occur in symbol 0 and symbol 2, i.e., the bits in symbol 0 are upset to "1111" from "1100"  $(D_3D_2D_1D_0' = 1111)$  and the bits in symbol 2 are upset to "0111" from "0110"  $(D_{11}D_{10}D_9D_8' = 0111)$ . During the decoding process, the received horizontal redundant bits  $H_4H_3H_2H_1H_0'$  are first computed, as follows:

$$H_{4}H_{3}H_{2}H_{1}H_{0}^{'} = D_{11}D_{10}D_{9}D_{8}^{'} + D_{3}D_{2}D_{1}D_{0}^{'}$$
  
= 0111 + 1111  
= 10110. (15)

Then, the horizontal syndrome bits  $\Delta H_4 H_3 H_2 H_1 H_0$  can be obtained using decimal integer subtraction

$$\Delta H_4 H_3 H_2 H_1 H_0 = H_4 H_3 H_2 H_1 H_0 - H_4 H_3 H_2 H_1 H_0$$
  
= 10110 - 10010  
= 00100. (16)

The decimal value of  $\Delta H_4 H_3 H_2 H_1 H_0$  is not "0," which represents that errors are detected and located in symbol 0 or symbol 2. Subsequently, the precise location of the bits that were flipped can be located by using the vertical syndrome bits  $S_3 - S_0$  and  $S_{11} - S_8$ . Finally, all these errors can be corrected by (7). Therefore, based on decimal algorithm, the proposed technique has higher tolerance capability for protecting memory against MCUs.

As a result, it is possible that all single and double errors and any types of multiple errors per row can be corrected by the proposed technique no matter whether these errors are consecutive or inconsecutive in Fig. 7. The proposed DMC



Fig. 8. Error type cannot be corrected by our proposed DMC. The main reason is that  $H_4H_3H_2H_1H_0$  will be "0" (decimal). Note that even though 7-bit errors occur in symbols 0 and 2 simultaneously, the decoding error can be refused.

can easily correct upsets of type 1, 2, and 3, because these are the essential property of DMC: any types of single-error and multiple-error corrections in two consecutive symbols. Upsets of types 4 and 5 introduced in Fig. 7 are also corrected because the multiple errors per row can be detected by the horizontal syndrome bits (see Fig. 6). These show that the proposed technique is an attractive option to protect memories from large MCUs. However, for the upsets of type 4 and 5, it is important to recognize that it can result in decoding error when the following prerequisite factors are achieved simultaneously (this error is typical of its kind).

- 1) The decimal integer sum of information bits in symbols 0 and 2 is equal to  $2^m 1$ .
- 2) All the bits in symbols 0 and 2 are upset.

The more detailed explanation is shown in Fig. 8. Assuming that these two factors have been achieved, according to the encoding and decoding processes of DMC,  $H_4H_3H_2H_1H_0$ , and  $H_4H_3H_2H_1H'_0$  are computed, as follows:

$$H_{4}H_{3}H_{2}H_{1}H_{0} = D_{3}D_{2}D_{1}D_{0} + D_{11}D_{10}D_{9}D_{8}$$
  
= 0110 + 1001  
= 01111 (17)  
$$H_{4}H_{3}H_{2}H_{1}H_{0}^{'} = D_{3}D_{2}D_{1}D_{0}^{'} + D_{11}D_{10}D_{9}D_{8}^{'}$$
  
= 1001 + 0110  
= 01111. (18)

Then the horizontal syndrome bits  $\Delta H_4 H_3 H_2 H_1 H_0$  can be obtained

$$\Delta H_4 H_3 H_2 H_1 H_0 = H_4 H_3 H_2 H_1 H_0 - H_4 H_3 H_2 H_1 H_0$$
  
= 01111 - 01111  
= 00000. (19)

This result means that no errors occur in symbols 0 and 2 and memory will suffer a failure. However, this case is rare. For example, when m = 4, the probability of decoding errors is

$$P_{\Delta H=0} = 4 \times \left(\frac{1}{2^4}\right)^2 \times P_{\text{MCU8}} \approx 0.001.$$
 (20)

If m = 8

$$P_{\Delta H=0} = 4 \times \left(\frac{1}{2^8}\right)^2 \times P_{\text{MCU16}} \approx 0.0000011.$$
 (21)



Fig. 9. J(S)s versus time of different protection codes (M = 32).

 $P_{MCU8}$  represents the probability of eight upsets in a given word, and similarly for  $P_{MCU16}$ . Moreover, according to the radiation experiments in [1], [2], [17], and [18], it can be obtained that the word in a memory usually has a limited number of consecutive errors and the interval of these errors is not more than three bits. Therefore, this should not be an issue.

## III. RELIABILITY AND OVERHEADS ANALYSIS

In this section, the proposed DMC has been implemented in HDL, simulated with ModelSim and tested for functionality by given various inputs. The encoder and decoder have been synthesized by the Synopsys Design Compiler in the SMIC 0.18  $\mu$ m technology. The area, power, and critical path delay of extra circuits have been obtained. For fair comparisions, Hamming, PDS [9], and MC [15] are used for references. Here, the usage of (64, 45) PDS is a triple-error correction code [9] and its information bits is shorted to 32 bits from 45 bits.

## A. Fault Injection

The correction coverage of PDS [9], MC [15], Hamming, and the proposed DMC codes are obtained from one million experiments. The results of coverage are shown in Table I. It can be seen that our proposed DMC have superior protection

| TABLE I                          |
|----------------------------------|
| CORRECTION FOR COVERAGE (32-bit) |

| ECC Codes   | The Number of Errors in a Word |     |      |      |      |      |      |      |      |      |      |      |      |      |      |      |
|-------------|--------------------------------|-----|------|------|------|------|------|------|------|------|------|------|------|------|------|------|
| ECC Codes   | 1                              | 2   | 3    | 4    | 5    | 6    | 7    | 8    | 9    | 10   | 11   | 12   | 13   | 14   | 15   | 16   |
| DMC (%)     | 100                            | 100 | 100  | 100  | 100  | 92.6 | 84.7 | 76.0 | 66.7 | 60.9 | 54.5 | 47.7 | 40.0 | 31.6 | 22.3 | 11.8 |
| PDS [9] (%) | 100                            | 100 | 100  | 0.8  | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    |
| MC [15] (%) | 100                            | 100 | 76.4 | 54.3 | 35.1 | 14.2 | 6.7  | 0.6  | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    |
| Hamming (%) | 100                            | 0   | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    |

level compared with other codes. These results show how our proposed technique provides single- and double-error correction, but can also provide effective tolerance capabilities against large MCUs that exceed the performance of other codes.

## B. Reliability Estimation

The reliability of our proposed code can be analyzed in terms of the mean time to failure (MTTF). It is assumed that MCUs arrive at memories following a Poisson distribution [19]. For one word, the correctable probability R(S) after S radiation events can be given by [14], [15], [20]

$$R(S) = \sum_{i+j+\dots+z \le T} P_i^1 P_j^2 \cdots P_z^S$$
(22)

where T is the maximum number of errors  $\operatorname{and} P_z^S$  is the correctable probability upon the reception of radiation event S which causes z errors.

For a memory with M words, the correctable probability J(S) after S radiation events can be given by

$$J(S) = \sum_{a+b+\dots+e=S} \frac{C_M^x}{M^x} R_a^1 R_b^2 \cdots R_e^x$$
(23)

where  $x \ (x \le S)$  is the number of words affected by radiation events,  $C_M^x$  is the selection of x from M words in memory, and  $R_e^x$  represents the correctable probability when e radiation events affect x words.

If we assume that the word number M is 32 and the correctable probability  $P_z^S$  can be obtained from Table I, the correctable probabilities J(S) s of different protection codes have been shown in Fig. 9. It can be seen that the correctable probability J(S) of the proposed scheme is larger than other codes.

Then the MTTF can be given by (24), which is the integral of function (23)

$$MTTF = \int_0^\infty J(t)dt.$$
 (24)

Table II shows MTTFs of different codes for different event arrival rate  $\lambda$ . In this table, we can see that the proposed scheme has higher MTTF by more than 122.6%, 154.6%, and 452.9% compared to PDS [9], MC [15], and Hamming, respectively.

In general cases, for the proposed technique, it can be inferred that the larger the word widths, the higher the tolerance capabilities and the better the reliabilities. For example, for a 64-bit word, when  $k = 2 \times 4$  and m = 8 the correction capability of the proposed technique is up to 9 bits.

TABLE II MTTF (M = 32)

| $\lambda$ (Upsets/bit per Day) | DMC     | PDS [9] | MC [15] | Hamming |
|--------------------------------|---------|---------|---------|---------|
| $10^{-4}$                      | 1121.9  | 915.0   | 725.6   | 247.7   |
| $10^{-5}$                      | 11218.8 | 9150.3  | 7256.5  | 2477.4  |

TABLE III

AREA, POWER, AND DELAY ANALYSIS OF ENCODER AND DECODER

| FCC Codes | Are       | ea     | Ро    | wer    | Delay |       |  |  |
|-----------|-----------|--------|-------|--------|-------|-------|--|--|
| Lee codes | $\mu m^2$ | %      | mw    | %      | ns    | %     |  |  |
| DMC       | 41572.6   | 100    | 10.8  | 100    | 4.9   | 100   |  |  |
| PDS* [9]  | 486778.1  | 1170.9 | 221.1 | 2047.2 | 18.7  | 381.6 |  |  |
| MC [15]   | 77933.7   | 187.5  | 24.7  | 228.7  | 7.1   | 144.9 |  |  |
| Hamming   | 58409.4   | 140.5  | 20.5  | 189.8  | 6.7   | 136.7 |  |  |

\*Using parallel decoder instead of serial decoder for fair comparisons

For a 128-bit word, when  $k = 2 \times 4$  and m = 16 the correction capability of the proposed technique is up to 17 bits. However, the correction capabilities of PDS, MC, and Hamming are smaller than DMCs under the same word widths.

## C. Overheads Analysis

For each protection code, area, power, and delay overheads of encoder and decoder have been shown in Table III. From Table III, we can observe that the proposed MC has a significant reduction compared with other codes. The area and power overheads of PDS are 1170.9% and 2047.2% of the proposed scheme, respectively. The delay overhead of DMC is 26.2%, 69.0%, and 73.1% of PDS [9], MC [15], and Hamming, respectively. This indicates that the memory with the proposed scheme performs faster than other codes. Different decoding algorithms could result in different overheads. The decoding algorithm of PDS [9] is more complex than that of other codes; thus, it has maximum area, power, and delay overheads. However, for the proposed DMC, its decoding algorithm is quite simple so that the overheads are minimal.

The issue is that the proposed technique requires more redundant bits compared with other codes. The redundant bits of these protection codes are shown in Table IV, where a coding efficiency  $\beta$  is used to evaluate the area overhead of memory cell [20]

$$\beta = \frac{\text{Redundant bits}}{\text{Redundant bits} + \text{Information bits}}.$$
 (25)

If the value of  $\beta$  is small, the code needs lower memory cell overheads. From this table, we can see that Hamming

TABLE IV REDUNDANT BITS (32-bit)

| ECC     | Information<br>Bits | Redundant<br>Bits | β     | Note                       |
|---------|---------------------|-------------------|-------|----------------------------|
| DMC     | 32                  | 36                | 52.9% | $k = 2 \times 4, m = 4$    |
| DMC     | 32                  | 32                | 50.0% | $k = 4 \times 4, m = 2$    |
| PDS [9] | 32                  | 19                | 37.3% | Shorting and puncturing    |
| MC [15] | 32                  | 28                | 46.7% | Correction capability is 2 |
| Hamming | 32                  | 7                 | 17.9% | Correction capability is 1 |



Fig. 10. CCCs of different protection codes.

code has the least  $\beta$  value but its correction capability is a constant (1). For the MC [15], its correction capability is also a constant (2) due to the limits of error correction capability of Hamming code. PDS [9] has a lower  $\beta$  value compared with the proposed scheme, but it requires higher delay overhead which would severely affect the access time of memory. The scaling down of CMOS technology has resulted in a dramatic increase in the number of MCUs. In 90-nm technology, more than three errors have been observed in radiation test [1], [2], [17], [18], so Hamming, MC, and PDS are not good choices for protecting memory. The proposed scheme needs higher  $\beta$  value than other codes but it has higher correction capability. Therefore, designers should choose the optimal combination of k and m based on the radiation test to provide a good tradeoff between reliability and redundant bits.

We have also used a metric to assess the efficiency of our proposed scheme compared to PDS, MC, and Hamming, which is called correction coverage per cost (CCC) and can be calculated as follow [15]:

$$CCC = \frac{Correction Coverage}{Cost}$$
(26)

where Cost is obtained by

$$Cost = Area \cdot Power \cdot Delay \cdot Redundant bits.$$
 (27)

Fig. 10 shows the values of this metric for different protection codes. From this figure, we can see that the proposed scheme has a higher CCC value than other codes except that only one MCU occurs. Therefore, based on the results in Table III and Fig. 10, the proposed scheme is quite suitable for high-speed memory applications. It should be mentioned that when the number of errors is more than two



Fig. 11. (a) Proposed fault-tolerant CAM using decimal error detection technique together with BICS. Note that when errors do not exist in CAM, the stored codeword is directly output without though error detection and correction circuits. (b) 32-bit word organization in CAM ( $k = 1 \times 4$  and m = 8).

per word, Hamming and MC [15] codes cannot correct any errors.

#### IV. DECIMAL ERROR DETECTION IN CAM

ECC code is a very powerful technique to correct MCUs in memory, as mentioned before. However, ECC implementation in CAM is significantly different from its implementation in SRAM due to simultaneous access to all the words in CAM, so that ECC code is not suitable to directly protect CAM [12]. In [14], BICS together with Hamming code is used to protect SRAM. Because BICS has zero fault-detection latency for multiple error detection, it is suitable for detection errors in CAM as well.

For the decimal error detection, this ability to detect any type of error can be useful in CAM. Let us consider that the decimal error detection technique combines with BICS to protect CAM. The fault-tolerant CAM structure is depicted in Fig. 11(a) and a 32-bit word organization in CAM is proposed in Fig. 11(b). For each column of CAM, BICS is added to detect the error (the basic principle and circuit of BICS are shown in [13] and [14]). When MCUs occur in a word of CAM, for each error column, a momentary current pulse is generated between power supply and ground. BICS can detect this current pulse and generate an error signal Es, i.e., this Es signal detects and locates columns which the errors occur in. At the same time, the syndrome calculation is active to detect the error row, i.e., (5) is performed row and row. Then in the error corrector these errors can be corrected. Finally, the correctable word is written back in CAM. Because the proposed decimal error detection technique can detect any number of errors in a word, the reliability of CAM has an adequate level of immunity to MCUs in a word. For example, when 32-bit errors occur in a word of CAM, the syndrome bits  $\Delta H$  can detect these errors ( $\Delta H$  can detect errors but cannot locate the precise upset locations; this is enough) and activate the syndrome calculation so that all errors can be corrected at the expense of least time consumption.

# V. CONCLUSION

In this paper, novel per-word DMC was proposed to assure the reliability of memory. The proposed protection code utilized decimal algorithm to detect errors, so that more errors were detected and corrected. The obtained results showed that the proposed scheme has a superior protection level against large MCUs in memory. Besides, the proposed decimal error detection technique is an attractive opinion to detect MCUs in CAM because it can be combined with BICS to provide an adequate level of immunity.

The only drawback of the proposed DMC is that more redundant bits are required to maintain higher reliability of memory, so that a reasonable combination of k and m should be chosen to maximize memory reliability and minimize the number of redundant bits based on radiation experiments in actual implementation. Therefore, future work will be conducted for the reduction of the redundant bits and the maintenance of the reliability of the proposed technique.

## References

- D. Radaelli, H. Puchner, S. Wong, and S. Daniel, "Investigation of multi-bit upsets in a 150 nm technology SRAM device," *IEEE Trans. Nucl. Sci.*, vol. 52, no. 6, pp. 2433–2437, Dec. 2005.
- [2] E. Ibe, H. Taniguchi, Y. Yahagi, K. Shimbo, and T. Toba, "Impact of scaling on neutron induced soft error in SRAMs from an 250 nm to a 22 nm design rule," *IEEE Trans. Electron Devices*, vol. 57, no. 7, pp. 1527–1538, Jul. 2010.
- [3] C. Argyrides and D. K. Pradhan, "Improved decoding algorithm for high reliable reed muller coding," in *Proc. IEEE Int. Syst. On Chip Conf.*, Sep. 2007, pp. 95–98.
- [4] A. Sanchez-Macian, P. Reviriego, and J. A. Maestro, "Hamming SEC-DAED and extended hamming SEC-DED-TAED codes through selective shortening and bit placement," *IEEE Trans. Device Mater. Rel.*, to be published.
- [5] S. Liu, P. Reviriego, and J. A. Maestro, "Efficient majority logic fault detection with difference-set codes for memory applications," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 20, no. 1, pp. 148–156, Jan. 2012.
- [6] M. Zhu, L. Y. Xiao, L. L. Song, Y. J. Zhang, and H. W. Luo, "New mix codes for multiple bit upsets mitigation in fault-secure memories," *Microelectron. J.*, vol. 42, no. 3, pp. 553–561, Mar. 2011.
- [7] R. Naseer and J. Draper, "Parallel double error correcting code design to mitigate multi-bit upsets in SRAMs," in *Proc. 34th Eur. Solid-State Circuits*, Sep. 2008, pp. 222–225.
- [8] G. Neuberger, D. L. Kastensmidt, and R. Reis, "An automatic technique for optimizing Reed-Solomon codes to improve fault tolerance in memories," *IEEE Design Test Comput.*, vol. 22, no. 1, pp. 50–58, Jan.–Feb. 2005.

- [9] P. Reviriego, M. Flanagan, and J. A. Maestro, "A (64,45) triple error correction code for memory applications," *IEEE Trans. Device Mater. Rel.*, vol. 12, no. 1, pp. 101–106, Mar. 2012.
- [10] S. Baeg, S. Wen, and R. Wong, "Interleaving distance selection with a soft error failure model," *IEEE Trans. Nucl. Sci.*, vol. 56, no. 4, pp. 2111–2118, Aug. 2009.
- [11] K. Pagiamtzis and A. Sheikholeslami, "Content addressable memory (CAM) circuits and architectures: A tutorial and survey," *IEEE J. Solid-State Circuits*, vol. 41, no. 3, pp. 712–727, Mar. 2003.
- [12] S. Baeg, S. Wen, and R. Wong, "Minimizing soft errors in TCAM devices: A probabilistic approach to determining scrubbing intervals," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 57, no. 4, pp. 814–822, Apr. 2010.
- [13] P. Reviriego and J. A. Maestro, "Efficient error detection codes for multiple-bit upset correction in SRAMs with BICS," ACM Trans. Design Autom. Electron. Syst., vol. 14, no. 1, pp. 18:1–18:10, Jan. 2009.
- [14] C. Argyrides, R. Chipana, F. Vargas, and D. K. Pradhan, "Reliability analysis of H-tree random access memories implemented with built in current sensors and parity codes for multiple bit upset correction," *IEEE Trans. Rel.*, vol. 60, no. 3, pp. 528–537, Sep. 2011.
- [15] C. Argyrides, D. K. Pradhan, and T. Kocak, "Matrix codes for reliable and cost efficient memory chips," *IEEE Trans. Very Large Scale Integr.* (VLSI) Syst., vol. 19, no. 3, pp. 420–428, Mar. 2011.
- [16] C. A. Argyrides, C. A. Lisboa, D. K. Pradhan, and L. Carro, "Single element correction in sorting algorithms with minimum delay overhead," in *Proc. IEEE Latin Amer. Test Workshop*, Mar. 2009, pp. 652–657.
- [17] Y. Yahagi, H. Yamaguchi, E. Ibe, H. Kameyama, M. Sato, T. Akioka, and S. Yamamoto, "A novel feature of neutron-induced multi-cell upsets in 130 and 180 nm SRAMs," *IEEE Trans. Nucl. Sci.*, vol. 54, no. 4, pp. 1030–1036, Aug. 2007.
- [18] N. N. Mahatme, B. L. Bhuva, Y. P. Fang, and A. S. Oates, "Impact of strained-Si PMOS transistors on SRAM soft error rates," *IEEE Trans. Nucl. Sci.*, vol. 59, no. 4, pp. 845–850, Aug. 2012.
- [19] C. A. Argyrides, P. Reviriego, D. K. Pradhan, and J. A. Maestro, "Matrix-based codes for adjacent error correction," *IEEE Trans. Nucl. Sci.*, vol. 57, no. 4, pp. 2106–2111, Aug. 2010.
- [20] F. Alzahrani, and T. Chen, "On-chip TEC-QED ECC for ultra-large, single-chip memory systems," in *Proc. IEEE Int. Conf. Comput. Design Design, Very-Large-Scale Integr. (VLSI) Syst. Comput. Process.*, Oct. 1994, pp. 132–137.



Jing Guo received the B.S. and M.S. degrees in electronic science and technology from Hei Longjiang University, Harbin, China, in 2008 and 2011, respectively. He is currently pursuing the Ph.D. degree in microelectronics and solid-state electronics with the Harbin Institute of Technology, Harbin.

His current research interests include fault tolerance in VLSI designs and reliability in memories.



Liyi Xiao (M'07) received the B.S., M.S., and Ph.D. degrees in microelectronics and solid-state electronics from the Harbin Institute of Technology, Harbin, China, in 1984, 1989 and 2001, respectively.

She has been a Professor with the Department of Electronic Science and Technology, Harbin Institute of Technology, Harbin, since 2004. Her current research interests include reliability in semiconductor devices and extra-low power design in VLSI, ASIC/SoC.



**Zhigang Mao** (M'07) received the Ph.D. degree from Rennes University, Rennes, France, in 1992. He is currently a Professor with the Microelectronics Center, Harbin Institute of Technology, Harbin, China.

His current research interests include VLSI design methodology, high-speed digital circuit design technology, signal processor architecture, and hardware security technology and reliability in semiconductor devices.



**Qiang Zhao** received the B.S. degree in electronic science and technology from the Harbin Institute of Technology, Harbin, China, in 2012, where he is currently pursuing the M.S. degree in microelectronics and solid-state electronics.

His current research interests include high-speed and low-power VLSI, ASIC/SoC design and reliability in semiconductor devices.