Efficient systolic multiplications in composite fields for cryptographic systems

Multiplications in finite fields are playing a key role in areas of cryptography and mathematic. We present approaches to exploit systolic architecture for multiplications in composite fields, which are expected to reduce the time-area product substantially. We design a pipelined architecture for multiplications in composite fields \begin{document} $GF({({2^n})^2})$ \end{document} , where \begin{document} $n$ \end{document} is a positive integer. Besides, we design systolic architectures for multiplications and additions in finite fields \begin{document} $GF(2^n)$ \end{document} . By integrating main improvements and other minor optimizations for multiplications in \begin{document} $GF({({2^n})^2})$ \end{document} , the non-pipelined versions of our design takes \begin{document} $8n+4$ \end{document} AND gates and \begin{document} $8n$ \end{document} XOR gates to compute multiplications with the executing time of \begin{document} $nT_{AND}+4nT_{XOR}$ \end{document} , where \begin{document} $T_{AND}$ \end{document} and \begin{document} ${T_{XOR}}$ \end{document} are delays of AND and XOR gates respectively; with the aid of pipelining, the pipelined version of our design has a throughput rate of one result per \begin{document} $2nT_{XOR}$ \end{document} . Other words, the time complexity and area complexity of our design are \begin{document} $O(n)$ \end{document} . Thus, the complexity of time-area product of our design is \begin{document} $O(n^2)$ \end{document} . Experimental results and comparisons show that our design provides significant reductions in executing time and area of multiplications.

Among operations in finite fields, multiplications are crucial to many cryptographic systems, e.g. Multivariate Public Key Cryptography (MPKC) [23], AES [5] and CLEFIA [27]. AES and CLEFIA use a Substitution-Box (S-Box) [5], which is generated by using multiplications and inversions in a finite field; MPKC uses a great many multiplications in a finite field during encrypting, decrypting, signature generation and verification. Besides, multiplications are widely used in solving systems of linear equations [7]. Thus, it is desirable to improve multiplications in finite fields due to the fact that they are playing an importance role in the implementations of many cryptographic systems and other engineering systems.
Among finite fields, composite fields are popular choices for implementations of cryptographic systems since they allow efficient hardware implementation in terms of the silicon area as well as the execution time. We present approaches to exploit systolic architecture for multiplications in composite fields, which are expected to reduce the time-area product substantially in finite fields.
Main improvements of this paper with known results are presented as follows. First, we design a systolic architecture for multiplications in finite fields GF (2 n ), where n is a positive integer. Second, we design a systolic architecture for additions in GF (2 n ). Third, we design a pipelined architecture for multiplications in composite fields GF ((2 n ) 2 ).
By integrating above improvements and other minor optimizations, non-pipelined versions and pipelined versions of multiplications in GF ((2 n ) 2 ) are designed. The non-pipelined versions of our design have the executing time of nT AN D + 4nT XOR , where T AN D and T XOR are delays of AND and XOR gates respectively; the pipelined version of our design has a throughput rate of one result per 2nT XOR . Besides, it takes 8n + 4 AND gates and 8n XOR gates to compute a multiplication. Other words, the time complexity and area complexity of our design are O(n). Thus, the complexity of time-area product of our design is O(n 2 ).
Our design is well suited for Application Specific Integrated Circuit (ASIC), Altera and Xilinx Field Programmable Logic Arrays (FPGAs). We back up the claims with implementations of our design on TSMC-0.18µm standard cell CMOS ASIC and Altera, Xilinx FPGAs respectively. Experimental results and comparisons with other multiplications in [9,20,22,34] show that our design provides significant reductions in executing time and area.
The rest of this paper is organized as follows: in Section 2, we introduce the background information; in Section 3, we propose systolic multiplications in composite fields; in Section 4, we present timing and area analysis of our design; in Section 5, we present implementations of our design; in Section 6, we compare our implementations with related methods; in Section 7, we present conclusions of this paper.
2. Preliminaries. In mathematics, a finite field is a field that contains a finite number of elements. As with any field, it is a set on which the basic operations of addition, multiplication and inversion have been defined.
Common, the prime field GF (p) of order and characteristic p is constructed as the integers modulo p, where p is a prime number. Thus, the elements are represented by integers in the range 0, . . . , p−1. Given a prime power q = 2 n with n > 1, the field GF (q) can be explicitly constructed. One chooses first an irreducible polynomial f in GF (2)[X] of degree n. Then the quotient ring GF (q) = GF (2)[X]/f of the polynomial ring GF (2)[X] by the ideal generated by f is a field of order q.
Composite field is a special case of finite field. The elements of composite fields GF ((2 n ) m ) can be represented in the standard base as polynomials with a maximum degree of m − 1 in GF (2 n ). The two pairs {GF (2 n ), p(x)} and {GF ((2 n ) m ), q(y)} constitute a composite field if GF (2 n ) is constructed from GF (2) by p(x) and GF ((2 n ) m ) is constructed from GF (2 n ) by q(y), where p(x) and q(y) are field polynomials of degree n and m respectively. Composite fields GF ((2 n ) m ) are isomorphic to fields GF (2 l ) if l = n × m. GF ((2 n ) 2 ) is a special case of composite fields, where m = 2. All finite fields GF (2 l ) can be expressed as the forms GF ((2 n ) 2 ) if l is even. 3. Efficient systolic multiplications in composite fields.

3.1.
Pipelined architecture for multiplications in composite fields. We propose a pipelined architecture for multiplications in composite fields, which is depicted in Fig. 1. It computes multiplications in GF ((2 n ) 2 ), which is illustrated as follows.
(1) field elements a h , a l , b h , b l and e in GF (2 n ) are the inputs of the architecture; (4) field elements c h and c l in GF (2 n ) are the outputs of the architecture; (5) c(x) = c h x + c l is the expecting multiplication results of a(x) and b(x) in GF ((2 n ) 2 ); (6) the architecture includes three stages, i.e. Stage0, Stage1 and Stage2; where M U L and ADD are multiplication and addition components in GF (2 n ); (10) components in the architectures are designed with AND gates and XOR gates; (11) elements are sent to the architecture and the multiplications are computed with the aid of pipelining. Based on the architecture, the multiplications are computed via pipelining as follows.
(1) period 0: a h , a l , b h , b l are sent to Stage0; (2) period 1: a h , a l , b h , b l are sent to Stage0, e is sent to Stage1; (3) period 2: a h , a l , b h , b l are sent to Stage0, e is sent to Stage1, c h , c l are generated via Stage2; It can be observed that using the architecture with the pipelining, the multiplications are computed within a period, e.g.
3.2. Systolic Component M U L: Multiplications in GF (2 n ). As described in Fig. 1, the pipelined architecture for multiplications in GF ((2 n ) 2 ) includes M U L components, which is use to compute multiplications in GF (2 n ). We design M U L in Fig. 2, which is illustrated as follows.
(1) it uses three different kinds of cells, i.e. A, B and C; (5) f (x) and g(x) are elements in GF (2 n ) and the expecting multiplication result of f (x) and g(x) is h(x), which is an element in GF (2 n );   It can be observed from Fig. 2, cell A has three ports, i.e. a 0 , a 1 and a 2 , where a 0 is input and a 1 , a 2 are outputs. In cell A i , the computation is illustrated as follows.
( (1) p(x) = x n + p n−1 x n−1 + p n−2 x n−2 + · · · + p 1 x + 1 is the irreducible polynomial in GF (2 n ), where p n−1 , p n−2 , . . . , p 1 are elements in GF (2), i.e. 0 or 1; (2) for i = 0, 1, . . . , 2(n − 1), where mod is a modular operation; (3) it uses an accumulator and its initial value is k = 0, if a new value is received via the input port, k = k + 1; (4) when a new b i is received, for t = 0, 1, . . . , n − 1, if v (k+i)t = 1, b i is sent to d k and d k is sent to cell C k . It can be observed from Fig. 2, cell C has a port, i.e. c, where c is input. In cell C i , the computation is illustrated as follows.
(1) when a new c is received, h i = h i + c is computed, where + is an addition in GF (2) via using a XOR gate. Based on our design, we depict the systolic multiplication in GF (2 n ) in Fig. 3. Cell A uses AND gates to compute multiplication in GF (2), cell B is a selector and cell C use XOR gates to compute addition in GF (2). Thus, the architecture uses n AND gates and n XOR gates, and it takes 2n clock cycles to perform a multiplication.
3.3. Systolic components ADD: Additions in GF (2 n ). As described in Fig. 1, the pipelined architecture for multiplications in GF ((2 n ) 2 ) includes ADD components, which is use to compute additions in GF (2 n ). We design ADD in Fig. 4, which is illustrated as follows. Cell A uses a XOR gate to compute additions in GF (2), e.g.
Thus, the architecture uses a XOR gate, and it takes n clock cycles to perform an addition.

4.
Timing and area analysis. According to our design, the architecture for multiplications can be designed with AND and XOR gates. Thus, in the following, we analyze the timing and area of our design in terms of AND and XOR gates.
Based on the illustration of our architecture, we analyze and summarize the executing time and area for a multiplication in GF ((2 n ) 2 ) in Table 1, which shows that it takes 8n + 4 AND gates and 8n XOR gates to compute a multiplication with the executing time of nT AN D + 4nT XOR . Thus, the executing time and area of non-pipelined version of our design is logarithmic in the field size. Other words, the time complexity and area complexity of non-pipelined multiplications are O(n).
In addition, we can use pipelining in our design to accelerate multiplications in composite fields. Table 2 shows that multiplications in GF ((2 n ) 2 ) are computed with a throughput rate of one result per 2nT XOR by using pipelining. Thus, the executing time of multiplications is reduced by more than 50% by using pipelining.

5.
Implementation. According to the analysis in Section 4, our design takes 4 cells, including 8n+4 AND gates and 8n XOR gates, and 5n clock cycles to compute a multiplication with the executing time of nT AN D + 4nT XOR in GF ((2 n ) 2 ). Besides, it computes multiplications with a throughput rate of one result per 2nT XOR nT AN D + 4nT XOR 8n + 4 AND gates, 8n XOR gates by using pipelining. We evaluate and summary the performance of our design in Table 3.
In order to prove that our architectures have high throughput of multiplications and low area on different devices, Hardware Description Language (Verilog HDL) code for modeling the design has been implemented on ASICs, Altera FPGAs and Xilinx FPGAs respectively. Since pipelining is used to gain a high throughput in our implementations, they consist of non-pipelined and pipelined versions.

Implementations on ASICs.
We implement the non-pipelined and pipelined versions of our design in GF ((2 n ) 2 ) on TSMC-0.18µm standard cell CMOS ASICs respectively. We use Synopsys Design Vision, which is a GUI for Synopsys Design Compiler tools. The map effort is set to medium. We report time (ns), throughput (ns) and area (µm 2 ) for implementations in composite fields.
We summary ASIC implementations of our design for different composite fields in Table 4, which clearly indicates that they achieve high throughput and low area of multiplications in GF ((2 n ) 2 ).

5.2.
Implementations on Altera FPGAs. In order to prove that our design is applicable to Altera FPGA devices, we implement the non-pipelined and pipelined versions in GF ((2 n ) 2 ) on Altera FPGA (Stratix II EP2S180F1508C3) respectively. Synthesis, Fitting and Place & Route have been carried out by using Quartus II 64-bit version 8.0, which is a GUI for Altera synthesis software. ModelSim PE has been used to perform the circuit simulations. We report time (ns), throughput (ns), area (combinational ALUTs) and the utilization rate of combinational ALUTs for implementations in composite fields.
We summary Altera FPGA implementations of our design for different composite fields in Table 4, which clearly indicates that they achieves high throughput and low area of multiplications in GF ((2 n ) 2 ).

5.3.
Implementations on Xilinx FPGAs. In order to prove that our design is applicable to Xilinx FPGA devices, we implement the non-pipelined and pipelined versions in GF ((2 n ) 2 ) on Xilinx FPGA (Virtex 5 XC5VLX110T) respectively. Synthesis, Fitting and Place & Route have been carried out by using ISE Design Suite version 14.4, which is a GUI for Xilinx synthesis software. ModelSim PE has been used to perform the circuit simulations. We report the time (ns), throughput (ns), area (Slice LUTs) and the utilization rate of slice LUTs for implementations in composite fields.
We summary Xilinx FPGA implementations of our design for different composite fields in Table 4, which clearly indicates that they achieves high throughput and low area of multiplications in GF ((2 n ) 2 ).   Table 5. Comparison of Our Design with Other multiplications in GF ((2 n ) 2 ) Pan et al. [22] Xie et al. [34] Namin et al. [20] Hariri et al. [9] Ours O(Time) n 2 n nlog 2 2n log 2 2n n O(Area) n √ 2n n 2 n n 2 n O(Time*Area) n 3 √ 2n n 3 n 2 log 2 2n n 2 log 2 2n n 2 6. Comparison. Our implementations are compared with related methods for multiplications in finite fields. To be fair, we use the non-pipelined versions in comparisons due to the fact that other multiplications are non-pipelined designs. Table 5 lists the comparison of our design with the recent proposals of multiplications in [9,20,22,34], which clearly demonstrates that our design is more efficient than other multiplications, e.g. the time-area product is reduced by 76% in GF ((2 61 ) 2 ) and the time-area product is reduced by 87% in GF ((2 127 ) 2 ). Thus, our design reduce the time-area product of multiplications in GF ((2 n ) 2 ) significantly. 7. Conclusion. Composite fields are popular choices for implementations of cryptographic systems since they allow efficient hardware implementation in terms of the silicon area as well as the execution time. We present approaches to exploit systolic architecture for multiplications in composite fields.
Main improvements of this paper with known results are presented as follows. First, we design a systolic architecture for multiplications in GF (2 n ). Second, we design a systolic architecture for additions in GF (2 n ). Third, we design a pipelined architecture for multiplications in GF ((2 n ) 2 ). By integrating above improvements and other minor optimizations, non-pipelined versions and pipelined versions of multiplications in GF ((2 n ) 2 ) are designed. The non-pipelined versions of our design have the executing time of nT AN D + 4nT XOR ; the pipelined version of our design has a throughput rate of one result per 2nT XOR . Besides, it takes 8n+4 AND gates and 8n XOR gates to compute a multiplication. Other words, the time complexity and area complexity of our design are O(n). Thus, the complexity of time-area product of our design is O(n 2 ).
Our design is well suited for ASIC, Altera and Xilinx FPGAs. We back up the claims with implementations of our design on TSMC-0.18µm standard cell CMOS ASIC and Altera (Stratix II EP2S180F1508C3), Xilinx (Virtex 5 XC5VLX110T) FPGAs respectively. Experimental results and comparisons with other multiplications show that our design provides significant reductions in executing time and area.