A PRACTICABLE TIMING ATTACK AGAINST HQC AND ITS COUNTERMEASURE

. In this paper, we present a practicable chosen ciphertext timing attack retrieving the secret key of HQC. The attack exploits a correlation between the weight of the error to be decoded and the running time of the decoding algorithm of BCH codes. For the 128-bit security parameters of HQC, the attack runs in less than a minute on a desktop computer using roughly 6000 decoding requests and has a success probability of approximately 93 percent. To prevent this attack, we provide an implementation of a constant time algorithm for the decoding of BCH codes. Our implementation of the countermeasure achieves a constant time execution of the decoding process without a signiﬁcant performance penalty.


Introduction
HQC [1,3] is a code-based IND-CCA2-secure public key encryption scheme, whose security is based on the hardness of the quasi-cyclic syndrome decoding problem. It is one of the candidate algorithms that has advanced to the round 2 of the NIST post-quantum standardization project. In particular, HQC relies on tensor product codes (BCH codes tensored with repetition codes) in its decryption algorithm. BCH codes are algebraic codes introduced in two independent works by Bose, Chaudhuri [9] and Hocquenghem [13]. Algorithms to decode BCH codes use Galois field arithmetic operations and basically consists in three steps: syndromes computation; error-locator polynomial computation and roots computation.
So far, BCH codes have been used to mitigate the decryption failure in various public key encryption schemes based on hard problems of either coding theory [1,3] or lattices [17]. However, due to side-channel timing leakage, a straightforward use of BCH codes would introduce a security weakness in the underlying cryptographic schemes when implemented in software. In fact, D'Anvers et al. [10] showed that the security of LAC, a lattice-based cryptosystem [17], could be significantly reduced if there is a side channel leakage during the error correction of BCH codes. Furthermore, HQC shares the same framework as the RQC [2,3] cryptosystem. It has been shown in [4] that this framework is vulnerable to a timing attack in the rank metric setting if the decoding of the underlying Gabidulin codes [11] is implemented in a non constant time fashion.
Note that making the BCH decoder running efficiently while being constanttime is challenging. In a recent work, Walters and Sinha Roy [18] proposed such a constant time BCH decoding implementation. However, the algorithms used for syndromes computation and roots computation are not the most efficient known in the litterature. In 2013, Bernstein et al. [6] presented efficient constant-time algorithms for syndrome computation, root finding and error-locator polynomial finding. These algorithms are based on a bitsliced implementation of an additive Fast Fourier Transform [12] (FFT), transposed additive FFT and sorting network. In a follow-up work, Chou [8] proposed a more efficient implementation and described how to implement the Berlekamp-Massey algorithm used to compute the error-locator polynomial in a constant-time way. This implementation follows the inversion-free version of the algorithm proposed by Xu in [19]. Recently, Bernstein and Yang [7] developed a fast constant-time Euclidean algorithm that can be used in the decoding of BCH codes.
In this paper, we describe a timing attack against HQC that recover the secret key. The decryption step in HQC takes as input a couple of vectors (u, v) and proceeds in three steps. First, the secret key y is used to compute a = v − u · y, then, a is decoded using a repetition code and finally the obtained codeword is decoded using a BCH code. The idea of the attack is based on the fact that, the secret key y is a sparse vector, hence decoding y using the repetition code results in a full-zero vector. Notice that, one can choose u and v accordingly to make a equal to y. However, when adding 1's in a specific block in v establishes a majority of 1's and the decoded word is of weight 1. Therefore, having access to an oracle that measure the running time of the BCH decoding for words of weight 0 and 1 can be used to recover the positions of all 1's in the secret key y.
Contributions. In this paper, we present a practicable timing attack against HQC that completes under the minute. As a minor contribution, we show how to make the subroutines of the BCH decoder runs in constant time by following the guidelines of recent results in this domain.
Paper organisation. In section 2, we give some preliminaries on code-based cryptography, decoding BCH codes as well as the HQC cryptosystem. Next, in section 3, we present a correlation between the weight of the error to be decoded and the decoding time of BCH codes. This observation is the cornerstone of the timing attack detailed in section 4. In section 5, we show how we implemented a constant time decoding to avoid this attack as well as some experimental results. Finally, we conclude this work in section 6.

Preliminaries
In this section, we give some preliminaries regarding the Hamming metric, errorcorrecting codes and the HQC cryptosystem.
2.1. Coding theory. Let F 2 be the binary finite field and F n 2 the vector space of dimension n over F 2 for some positive integer n. Elements of F n 2 are considered as vectors or polynomials in F 2 [X]/(X n − 1).
Definition 2.2 (Hamming weight). Let x ∈ F n 2 . The Hamming weight of x, denoted by w(x), is the cardinal of its support, i.e. the number of its non-zero coordinates. Definition 2.3 (Hamming distance). Let x, y ∈ F n 2 . The Hamming distance from x to y, denoted by d(x, y), is defined as w(x − y), i.e. the number of coordinates x and y differ on.
Definition 2.4 (Linear code). A linear [n, k]-code C of length n and dimension k is a linear subspace of F n 2 of dimension k.
The code C is called a [n, k, d]-code.
Theorem 2.9 (Generator polynomial [16]). Let C be a cyclic code over F 2 . There exists a unique polynomial g(x) in C of minimal positive degree. Moreover, a polynomial c(x) is a codeword of C if and only if g(x) divides c(x). The polynomial g(x) is called the generator polynomial of the cyclic code C.
HQC uses a tensor product code obtained as the combination of a BCH code with a repetition code. Definition 2.10 (Tensor product code [1]). Let C 1 (resp. C 2 ) be a [n 1 , k 1 ] (resp. [n 2 , k 2 ]) linear code over F 2 . The tensor product code of C 1 and C 2 denoted C 1 ⊗ C 2 is defined as the set of all n 2 × n 1 matrices whose rows are codewords of C 1 and whose columns are codewords of C 2 . More formally, if C 1 (resp. C 2 ) is generated by G 1 (resp. G 2 ), then Theorem 2.11 (BCH code [16]). For any positive integers m ≥ 3 and t < 2 m−1 , there exists a binary cyclic BCH [n, k, d]-code with the following properties: n = 2 m − 1; n − k ≤ mt; d ≥ 2t + 1. We call this code a t-error-correcting BCH code. Let α be a primitive element in F 2 m , and let φ i (x) be the minimal polynomial of α i for 1 ≤ i ≤ 2t. The generator polynomial g(x) of the BCH [n, k, d]-code is the least common multiple of φ 1 (x), φ 2 (x), . . . , φ 2t (x), that is, BCH codes encoding. Given the generator polynomial g(x) and a message u(x) = u 0 + u 1 x + . . . + u k−1 x k−1 , the encoding of BCH codes consists of three steps: BCH codes decoding. We now describe BCH decoding, which consists of three steps, following [15]: 1. Compute the 2t syndromes from the received polynomial r(x). Let c(x) denote the sent codeword and e(x) the error word, one has: For 1 ≤ i ≤ 2t, the syndromes S i are defined as: Let v be the number of errors and let j 1 , j 2 , . . . , j v be the error positions. Then: Introducing the error locators β s = α js , with s = 1, 2, . . . , v, one can write the syndromes more explicitely: These are known as power sum symmetric functions. They lead to the definition of the error locator polynomial: σ r x r (σ i ) 1≤i≤v and (S i ) 1≤i≤2t are then related by Newton's identities: (1) 3. Compute the roots of the error locator polynomial σ(x). These roots β −1 1 , β −1 2 , . . . , β −1 v are the inverses of the error locators. Once found, one can retrieve error positions j 1 , j 2 , . . . , j v and correct r(x). Definition 2.12 (Repetition code). The binary repetition code 1 n of length n is the set of two codewords 1 n (the all ones) and 0 n (the all zeros). It has dimension 1 and correction capacity n−1 2 . The 1 n code is an error-correcting code where encoding is done by repeating the message bit n times. Decoding is done by majority decision; it outputs 1 if there is a majority of 1 and 0 otherwise.

2.2.
The HQC public key encryption scheme. Hamming Quasi-Cyclic [1,3] is a code-based IND-CCA2 secure encryption scheme whose security relies on the syndrome decoding problem. It is obtained by applying the HHK transformation [14] on the IND-CPA construction denoted HQC.PKE (depicted in Figure 1). HQC uses two types of codes: a tensor code C of generator matrix G and a random double-circulant [2n, n]-code with a parity check matrix (1, h).
The correctness of HQC relies on the decoding capability of the code C. Indeed, Decrypt(sk, Encrypt(pk, m)) = m when C.Decode correctly decodes v − u · y, namely whenever w (x · r 2 − r 1 · y + e) ≤ δ.
The tensor product code C is defined by C = B ⊗ R, where B is a [n 1 , k, d] BCH code and R is the [n 2 , 1, n 2 ] repetition code 1 n2 . Encoding a given message m ∈ F k1 2 is done in two steps. Firstly, it is encoded into b ∈ F n1 2 using the aforementioned BCH code B. Secondly, each coordinate b i of b is re-encoded into c i ∈ F n2 2 , for 0 ≤ i ≤ n 1 − 1, with the repetition code R = 1 n2 . This yields the codeword (c 0 c 1 . . . c n1−1 ). Similarly, decoding a = (a 0 a 1 . . . a n1−1 ) with a i ∈ F n2 2 for 0 ≤ i ≤ n 1 − 1 is also done in two steps. Firstly, the repetition code R decodes each a i into a bit b i . Secondly the BCH code B decodes the word b = (b i ) 0≤i≤n1−1 into the message.

Correlation between decoding time and error weight
In this section, we show that there exists a correlation between the weight of the error to be decoded and the running time of the BCH codes decoding algorithm, assuming Berlekamp's simplified algorithm [16] (see appendix A) is used for the second step of decoding. We next describe an oracle distinguishing BCH codewords without errors from those with one error exactly using the running time of the HQC.Decrypt algorithm (see Figure 1).
Berlekamp's simplified algorithm (see appendix A) is an iterative algorithm solving the set of equations (1). It completes in t iterations. It starts with σ(x) = 1. At iteration µ, it computes a quantity d µ , called discrepancy, whose value is 0 if the µth equation from system (1) holds. If not, it corrects σ(x) such that equation µ holds. The loop invariant is that after µ iterations, the first µ equations of system (1) are verified. Looking at the pseudocode from appendix A, one can see that: • For a codeword without error, all discrepancies are zero and the algorithm completes without corrections. • For a codeword with one error, the first syndrome is α j1 where j 1 is the error position and one correction is needed. takes as input an HQC public key pk (which implicitely defines a BCH code B) and a ciphertext c = (u, v).
The oracle features an initialization step Init (see Algorithm 1) and an evaluation step Eval (see Algorithm 2). The Init step computes the expected running times T 0 and T 1 when the BCH code corrects 0 and 1 error respectively. To obtain these times T 0 and T 1 , the proper requests have to be submitted to O HQC Time . In order to construct them, one has to account for the additionnal layers of multiplication and R decodings on top of BCH decoding. The repetition code layer sees its input a, of length n = n 1 n 2 , as n 1 blocks of n 2 bits: Each block a i gives a bit b i of the output vector b (fed to the BCH decoder) where b i = 1 if the block contains a majority of 1 and b i = 0 otherwise. To compute T 0 and T 1 we simply query the timing oracle O HQC Time and measure its response time with u = 0 n and v = 0 n to get an estimation of T 0 and u = 0 n and v = (1 n2 0 n−n2 ) to get an estimation of T 1 as b = (1 0 n1−1 ).
As described in Algorithm 1, for T 1 we make a sample of p requests and retain their mean as the estimate. The complexity of this initialization step is that of 1 + p decodings which will be negligible with respect to the rest of the attack. Further details about the choice of p is discussed in section 4.5.
The Eval step takes a word c as input and guesses whether or not the BCH code corrects an error during the HQC decryption of c. To this end, it calls O HQC Time (pk, c), yielding the running time τ , and outputs the error weight i such that | τ − T i | is minimal.
The Input: A public key pk A precision parameter p Output: A couple (T 0 , T 1 ) of expected running times Input: A public key pk and a ciphertext c Expected running times T 0 and T 1 Output: The error weight 0 or 1 that the BCH code B corrected during HQC.Decrypt(sk, c)

Practicable timing attack against HQC
In this section, we present a side-channel chosen ciphertext attack against HQC. This attack is a real threat as it has a polynomial complexity and requires a reasonable amount of requests. It proceeds by iterations until the key y is recovered. We first give a brief overview of the attack in section 4.1. We follow by describing its first two iterations in sections 4.2 and 4.3. Finally, we estimate its success probability in section 4.4 and discuss the attack complexity and bandwidth cost in section 4.5.

Attack overview.
The key y has a Hamming weight of ω, meaning it contains ω bits 1 and n − ω bits 0. The objective of the attack is to recover the support of y, i.e. (the positions of) all 1's. Consider secret key y as n 1 blocks of n 2 bits. After initializing the oracle O HQC 01 , the attack proceeds by iterations. At iteration i, the attack searches block by block, finding out all 1's from each block containing exactly i. This is done by querying the oracle with appropriate requests. For all requests, the vector u is chosen as such that u · y = y and a = v ⊕ y. The input a isn't fed directly to the BCH code decoder but needs to go through the repetition code decoder first. So one wants to pick v such that v ⊕ y establishes a majority of 1's in the block that v alone wouldn't have. This naturally leads us to consider vectors v having a 1 in n2 2 positions of a block v i . Doing so, • either block y i has a 1 in one of the remaining positions which leads (v ⊕ y) i to have a majority of 1's, and the oracle returns 1; • or block y i has no 1's in the remaining positions, (v ⊕ y) i has no majority of 1's, and the oracle returns 0. Either way the oracle response leaks information on block y i 's content. Nevertheless, this strategy does not always work as y can have multiple 1's per block. When it does, these 1's could cancel those we set in v and break our majority, preventing us to gain information. This complexifies our task and is the reason why we split the attack in different iterations, each designed to search within y's blocks for a certain number of 1's. For the sake of clarity and simplicity, we only describe the first two iterations.

First iteration.
During the first iteration, we aim to recover all 1's of y alone in their block. Let's consider the (i+1)-th block y i of y (0 ≤ i ≤ n 1 − 1) and v i the corresponding block of v. In order to determine the position of an eventual lone 1 in y i , we start querying the oracle with (ū, v) such that: • If the oracle response is 1, it means B corrected an error, thus y i has a 1 in one of its last n2 2 positions. Proceeding by dichotomy, we can then submit to the oracle the query (ū, v) with: • If we get a response 0 to our first request, the same amount of requests is enough to either find the position of the lone 1 or know there aren't any. However, since there are many more blocks without 1 than blocks with any, one can reduce the number of requests. Instead the second request is (ū, v) with: This way, if the oracle returns 0, one can immediately dismiss the block with this second request as it does not have exactly one 1. This implies to perform an extra request if it turns out there's a 1 to find but saves us log 2 n 2 − 1 requests most of the time. Since there are a total of n 1 blocks, and that y has at most ω blocks containing a single 1, the first iteration requires at most 2(n 1 − ω) + ω( log 2 n 2 + 1) requests. Let's examine the complexity of this iteration. A request amounts to: • the computation of v − u · y. The product complexity is 2ωn + (ω − 1)n (rotating ω arrays of size n and summing the resulting vectors). With the final addition, this iteration's complexity is 3ωn. • n 1 R-decodings of complexity n 1 ((n 2 − 1) + 1) = n (for each of the n 1 blocks, its n 2 bits are summed and a comparison is done). We recall that in HQC the parameter n is chosen to be the the first primitive prime greater than n 1 · n 2 . Therefore, under the assumption ω = O( √ n), we get a request complexity of O(n √ n) and an overall complexity in O(n 5 2 ) for the first iteration.
The probability that the attack is successful after this first iteration is low enough (see section 4.4) that it calls for a second iteration. 4.3. Second iteration. The first iteration of the attack identified all 1's alone in their blocks. We now look for blocks of y containing exactly two 1's. In order to do so, we need to analyze what happens when one encounters such a block during the first iteration. There are two kinds of situations: • case a: both 1's are in the same half of the block (including the middle position if n 2 is odd). If they're in the upper half, our first request gets a response 1 and we end up identifying the position of the 1 closer to the middle of the block. If they're in the lower half, our first request gets a response 0 but our second request gets a response 1 and we again end up identifying the position of the 1 closer to the middle of the block. • case b: both halves have a 1 (note that the case where n 2 is odd and there is a 1 in the middle would have been detected already). In that case the first two requests return 0 and the block is discarded.
The second iteration will be divided in two phases treating blocks falling in each case. One can remark that there should be roughly the same amount of blocks falling in each case, simply because if one fixes a position in a block and randomly picks another position of the block, there's almost as many positions left in the same half as in the other half.

Phase 1.
Here the search is focused on blocks in which a 1 has already been identified. Clearly this situation is very similar to the first iteration. We can just ignore the 1 we know of, consider the block is of length n 2 − 1 and assume we need one less to achieve majority. This can be done using dichotomy as in the first iteration except each time we pick n2 2 − 1 positions out of these n 2 − 1. This phase can be performed efficiently as at most ω 2 blocks have to be looked into. This makes a maximum of ω 2 ( log 2 n 2 + 2) requests. Under the hypothesis ω = O( √ n), this phase complexity is: Phase 2. Now we turn to the remaining blocks. We want to catch those containing precisely two 1's. Let's recall that in the event of such a block, it has a 1 in each block half (and none in the middle if n 2 is odd). We can generalize the same strategy applied in the first iteration; we can distinguish if the block contains or not a pair of 1's in four requests (ū, v) with v j = 0 if j = i and: Since one knows the 1's are in different halves of the block, there are only four different pairs of quarters they can be in. Each of the aforementioned requests tests one such pair. Therefore, if the oracle returns 0 to these four requests, the block contains either no 1's or more than two. If the oracle answers 1 to one of these requests, one retrieves two ranges of indices, both containing a 1. Then proceeding by dichotomy for each range, one can narrow it down to a singleton in log 2 n 2 2 +1 2 + 1 requests. In the worst case scenario, we have ω 2 blocks containing two 1's, none of which have been detected yet. This takes ω 2 4 + 2 log 2 n2 2 + 1 2 + 2 + 4 n 1 − ω 2 requests to find them all, from which we derive the second iteration complexity of: 4.4. Success probability estimation. Let's calculate the probabilities that y has been retrieved after each iteration. Let the following events be for 0 ≤ i ≤ n1 2 : A i : "y has exactly i blocks with two 1's and no block with more." A : "y has at most two 1's per block." The event A 0 can also be described as the attack being successful after the first iteration. This means y has ω blocks containing a single 1 for which we have n 2 positions to choose from and n 1 − ω blocks containing none. Therefore: With HQC-128-1 (see Table 1) parameters n 1 = 796, n 2 = 31 and ω = 67, one has P(A 0 ) 0.0625. One recover 6.25 percent of potential keys y after the first iteration.
Let's now compute the probability P(A) that the attack is successful after at most two iterations. A is the disjoint union of the A i : With n 1 = 796, n 2 = 31 and ω = 67, one finds P(A) 0.9344. 93 percent of potential keys have been retrieved after the second iteration. One could show that the attack success probability after three iterations is above 99 percent. Table 3 presents the attack complexity and the number of required requests with respect to HQC parameters. The presented numerical values are obtained by using the complexity formulas in sections 4.2 and 4.3. In particular the first iteration requires at most 2(n 1 − ω) + ω( log 2 n 2 + 1) requests, while the second iteration complexities for the first and the second phase are respectively 2(n 1 − ω) + ω( log 2 n 2 + 1) and ω 2 4 + 2 log 2 n2 2 + 1 2 + 2 + 4 n 1 − ω 2 .

Attack complexity and bandwidth cost.
Since the multiplication takes most of the decryption workload, we took twice its complexity (i.e. 6ωn) as an upper bound of a request complexity. We implemented the attack locally for HQC-128-1. Table 3 assumes each oracle request is done once. However, in a real life scenario, different runs of the same request usually yield slightly different execution times. This derails the attack if the real execution time is closer to T i than T 1−i but the measured execution time is closer to T 1−i than T i for i = 0, 1. To mitigate this effect, we take the standard approach of repeating each request several times, each time measuring the execution time, and taking the median of the batch as execution time estimate. The precision parameter p (see Algorithm 1) is another variable to take into account as it plays a determinant role in defining the quality of T 0 and T 1 . We carried out simulations to fine tune p by observing the successful and failed attacks. The Table 2, illustrates the results obtained by running the attack 1000 times for p = 100 and varying the number of repeated requests to the oracle O HQC Time . We say that the attack is successful only if the entire support of y is retrieved. Nevertheless, when only portions of the support are found we consider that the attack failed. One can see that repeating request nine times, allows to reach the theoretical failure percentage of 7% because the key y has a block with at least three 1's.
The tests were performed on a machine with 16GB of memory, equipped with an Intel core i7-7820X CPU@3.60GHz and with Hyper-Threading, Turbo Boost and SpeedStep features disabled. On this machine the attack against HQC-128-1 takes less than a minute to complete. Percentage of failed attacks 100% 99, 8% 49, 7% 8, 3% 7, 6%  Table 3. Attack complexity and bandwidth cost against HQC

Constant time decoding of BCH codes
A constant time BCH code decoding algorithm naturally thwarts the attack. In this section, we detail how we implemented a constant-time decoding algorithm following the standard techniques used in this domain and using results from previous related works. We start by precising the constant time model we are considering and discuss how one can tranform a non constant time algorithm into a constant time one (section 5.1). We then apply these techniques to finite field arithmetic (section 5.2), syndromes and roots computation (section 5.3) and ELP computation (section 5.4). This allows us to provide two variants of a constant time algorithm to BCH code decoding. To finish, we provide the results of our tests and discuss which variant should be considered depending on the chosen BCH code and the targeted material (section 5.5). Since an attacker can force the BCH code decoder to use the secret y as its input (with ciphertext (0 n , 1 n2 0 n−n2 ) for example), we hereafter consider the full constant time model.
There are three kinds of obstacles to constant time implementation: loops whose bound is input-dependant, branches whose condition is input-dependant and inputdependant memory accesses. Natural fixes for each of these obstacles would respectively be [18]: • To patch loops whose condition depends upon inputs by supplying a constant bound (the maximum number of iterations) and performing dummy operations once the original bound has been reached. • To patch branches whose condition depends upon inputs by executing both branches and using a flag to control which branch is effectively executed. • To patch array accesses whose index depends upon inputs either by eliminating them or by ensuring the corresponding address is already cached.
Dealing with leaking array accesses can be done in several ways. Walters and Sinha Roy [18] suggest patching each such access by scanning the whole array to load it into the cache. For nested array accesses, this operation may induce a huge performance penalty. One may scan the array less often, but it requires being careful about addresses not being evicted from the L1 cache. One also has to be wary of the compiler with this approach, as compilers tend to identify these kinds of "do nothing" loops and optimize them out. We will denote the approach of scanning an array having potential leaking accesses once (and only once) as a cache-dependant patch as it works only if the cache is big enough or if code parameters are small enough. Note that even if the access doesn't leak anymore, it still, stricly speaking, depends on the inputs. The second approach is a cache-independant patch, which consists of removing the array access entirely. The idea is to first determine the range of indices that can potentially be accessed, then loop on all these indices, each time performing either a dummy operation or the real one as needed. Now recall from section 2.1 that BCH code decoding has three steps: syndromes computation, ELP computation and roots computation. To provide a constant time implementation of BCH code decoding, we need to achieve constant time for Galois field arithmetic as well as for each of these three steps. We propose two variants: one with some cache-dependant array accesses and one without any cache-dependant array access.

5.2.
Constant time field arithmetic. All three steps of decoding make abundant use of field operations (mostly additions and multiplications) that need be constant time.
Addition. For addition we use coefficient-wise xor.
Multiplication. We propose two implementations for multiplication: • lookup tables. Given log and antilog tables (relative to a primitive element α ∈ F 2 m ), multiplying two elements of F 2 m is done by taking their logarithms, adding them modulo 2 m − 1, and taking the antilog. • the CLMUL instruction set. This is an extension to the x86 instruction set for microprocessors from Intel and AMD. The pclmulqdq instruction computes the 128-bit carry-less product of two 64-bit values. We then reduce modulo the primitive polynomial using bitwise operations. Implementation 2 is constant time but requires support for the CLMUL instruction set. Note that if one knows of a more efficient multiplication implementation or if the CLMUL instruction set it not available, one can use any other multiplication implementation as long as it is constant time. Implementation 1 is faster but not constant time by itself because it uses three input-dependant array accesses. However, using the aforementioned cache-dependant patch, that is scanning both log and antilog tables at the beginning of decoding, we may have implementation 1 run in constant time, depending on cache size and code parameters. These two underlying implementations for field multiplication distinguish our two constant time implementation variants.
Squaring. For squaring we use bitwise operations with constant shift amounts.
Inversion. For inversion we use fast exponentiation.

Constant time syndromes computation and roots computation.
We start with steps 1 and 3 of BCH decoding, i.e. computation of syndromes and roots. For both we benefit from fast algorithms developped by Bernstein et al. [6], who built on previous work from Gao and Mateer [12]. They use an additive Fast Fourier Transform (FFT) algorithm to compute the syndromes and its transpose algorithm to compute the ELP roots. Both these algorithms are constant time. We refer the reader to the aforementioned papers for more details on the additive FFT.
We describe a small adjustment to these algorithms. Additive FFT is a recursive algorithm which calls two copies of itself. At each recursion level, some constants (called gammas and deltas) are computed using field operations. Bernstein et al. propose a bitsliced version of the algorithm. Since we use a non bitsliced version here, field operations are more costly. As a result, recomputing these constants is more expensive than accessing them from an array (even factoring some L1 cache misses). Therefore, we compute these constants only once and store them in lookup tables for our subsequent needs. Note that the array accesses to these tables are not subject to timing leaks. 5.4. Constant time error locator polynomial computation. Here we start with Berlekamp's simplified algorithm [5,15] (see appendix A). We then use the standard techniques described in section 5.1 to make it constant time, opting for the cache-independant approach when we encounter input-dependant array accesses. Because pseudocode hides implementation details by nature whereas constant time is an implementation-sensitive property, we give a constant time C implementation of Berlekamp's simplified algorithm in appendix B.

5.5.
Test results. The benchmarks are performed on a machine which has 16GB of memory and is equipped with an Intel core i7-7820X CPU @ 3.60GHz. Hyper-Threading, Turbo Boost and SpeedStep features are disabled. L1 data cache is 32 kilobytes. We pick six BCH codes of various parameters. For each chosen BCH code [n, k, t], we conduct two tests (one for each implementation of field multiplication) as follows. We generate 10 000 erroneous codewords with a distribution of error weights between 0 and 1.1t where errors positions are picked randomly. Each codeword is decoded 100 times. Out of each batch, the average execution time is taken as estimate execution time for decoding that codeword. we use all optimizations suggested by Bernstein et al. [6] regarding the additive FFT, namely picking an ideal basis to avoid twisting; dealing with 2-coefficent and 3-coefficient polynomials more efficiently and unrolling both the FFT and its transpose. Note that these codes are shortened BCH codes. Because it doesn't fundamentally impact our case, we won't discuss it here but we refer the reader to [1] for more details. An implementation will be made available at pqc-hqc.org. We give the results in the form of graphs (see Figures 2 and 3). Figure 2 features the decoding of all six codes using lookup tables for field multiplication whereas Figure 3 features these same codes using the pclmulqdq instruction for field multiplication. Each graph is vertically centered around the mean execution time t mean . Vertical axes spread from 0, 95t mean to 1, 05t mean , except for the last code [32767,16412,2631] where it stretches from 0, 85t mean to 1, 15t mean . The results for the specific case of decoding error of weight 0 or 1, are given in Tables 5 and 6.
As expected, on one hand, the second implementation of multiplication looks perfectly constant time (see figure 3). For all six codes, regardless of number of errors, the relative difference between any extremum and the mean decoding time always stays under 1%. On the other hand, the first implementation appears to be constant time only for the first three codes, that is if m ≤ 12, i.e. up to F 4096 (see the first three graphs of figure 2). Above that, the first implementation runs into cache issues. Indeed, our implementation uses uint16_t to represent field elements, which means two bytes per element. For F 8192 , log and antilog tables require 2 * 2 * 8192 = 32767 bytes, which completely fill the L1 data cache of 32 kilobytes for the considered machine. From there, any computation will lead to addresses being evicted from the cache, which in turn will cause timing leaks (see the last three graphs of Figure 2). For F 4096 , the lookup tables take only half the memory, which seems to leave enough for our decoding needs. However, for the small fields where it is constant time, the first implementation has better performance than the second (see Tables 7 and 8 Number of errors to be decoded 0 error 1 error Table 6. The mean of the decoding execution time (in CPU cycles) of various BCH codes for errors of weight 0 and 1 with field multiplication implemented via pclmulqdq instruction (variant 2) HQC, observed decoding times are 30% faster. So our recommendation would be to use the first multiplication implementation (lookup tables) for BCH codes on field F 4096 or smaller, which is the case of HQC, and to use the second multiplication implementation (via pclmulqdq) for larger fields. We integrated the constant time BCH decoding algorithm in the optimized implementation of HQC IND-CCA2 to measure the performance overhead. We restrict our measurements to the lookup tables variant of the BCH decoding. In Table 4 we report CPU cycles counts for the decapsulation step of HQC across the different security levels with either the original BCH implementation or the constant time variant. One can see that our constant time implementation only adds a little overhead between 3.21% and 11.06%.   Table 8. Decoding of some BCH codes with multiplication by pclmulqdq

Conclusion
In this work, we have highlighted a correlation between the weight of the error to be decoded and the running time of decoding BCH codes when Berlekamp's simplified algorithm is straightforwardly implemented. Next, we have devised an efficient chosen ciphertext timing attack against HQC based on that correlation. We then implemented it in software and carried it out against different security levels of HQC. The attack is very efficient as it recovers the secret key y often enough in a couple iterations and its overall complexity is O(n

Appendix B. Constant time ELP computation
The C function below computes the error locator polynomial using a constant time version of Berlekamp's simplified algorithm. It has the following features: • The constant PARAM_DELTA is the correction capacity t > 1 of the BCH code.
• Elements of F 2 m are represented by uint16_t as polynomials (m ≤ 15).
• gf_mul is the Galois field multiplication. It takes two elements and returns their product. • gf_inverse computes an element inverse. It returns 0 for input 0.
• syndromes is an array of size 2*PARAM_DELTA storing the 2t syndromes.
• Instead of maintaining a list of σ (i) (X), we update in place both σ(X) (array sigma) and the corrective term X 2(µ−ρ) σ ρ (X) (array X_sigma_p). • We don't care about σ(X) if its degree exceeds PARAM_DELTA [15]. So we don't care about X_sigma_p if its degree exceeds PARAM_DELTA either. • sigma_copy serves as a temporary save of sigma in case we need it to update X_sigma_p. We only need to save the first PARAM_DELTA−1 coefficients of sigma.