Keywords

1 Introduction

Cryptographers’ deep interests in lattices are for multiple reasons. Besides possessing highly desirable post-quantum security features, lattice-based cryptography relies on simple structures, allowing efficient asymptotic complexities, and is quite flexible in practice. In addition to encryption/signature schemes [7, 15, 18, 21, 22, 25], identity-based encryption [8], multilinear maps [10, 16], lattices are also involved in homomorphic encryption (HE). The discovery of this property by Gentry in 2009 [12], through the use of ideal rings, is a major breakthrough which has opened the door to many opportunities in terms of applications, especially when coupled with cloud computing.

HE is generally composed of a basic layer, which is a Somewhat Homomorphic Encryption scheme (SHE). Such a scheme allows us to compute a limited number of additions and multiplications on ciphertexts. This can be explained by the fact that any ciphertext contains an inherent noise which increases after each homomorphic operation. Beyond a certain limit, the noise becomes too large to allow a correct decryption. This drawback may be tackled by using bootstrap**, which however constitutes a bottleneck in terms of efficiency. Further improvements of noise management [5, 6] have been suggested so that, in practice, and given an applicative context, it may be wiser to select an efficient SHE with parameters enabling a sufficient number of operations. For instance, schemes like FV [9] and YASHE [4] have been implemented and tested for evaluating the SIMON Feistel Cipher [17]. Among the today’s more practical SHE schemes, FV is arguably one of the most competitive. This scheme is being currently considered by major stakeholders, such as the European H2020 HEAT consortium [26], as a viable candidate for practical homomorphic encryption.

Our Contribution. This work focuses on practical improvement of SHE schemes, in particular FV. Despite the fact that the security of YASHE has been called into question recently [1], this scheme can also benefit from the present work. These schemes handle elements of a polynomial ring \(\mathbb {Z}_q[X]/(X^n+1)\). The modulus q can be chosen as the product of some small moduli fitting with practical hardware requirements (machine word, etc.). This enables us to avoid the need of multi-precision arithmetic in almost the whole scheme. However, this CRT representation (a.k.a. Residue Number Systems, or RNS) is hardly compatible with a couple of core operations: coefficient-wise division and rounding, occurring in multiplication and decryption, and a noise management technique within homomorphic multiplication, relying on the access to a positional number system.

We show how to efficiently avoid any switch between RNS and the positional system for performing these operations. We present a full RNS variant of FV and analyse the new bounds on noise growth. A software implementation highlights the practical benefits of the new RNS variant.

It is important to note that this work is related to the arithmetic at the coefficient level. Thus, the security features of the original scheme are not modified.

Outline. Section2 provides some preliminaries about FV and RNS. Section 3 provides a full RNS variant of decryption. Section 4 gives a full RNS variant of homomorphic multiplication. Results of a software implementation are presented in Sect. 5. Finally, some conclusions are drawn.

2 Preliminaries

Proofs of lemmas, propositions and theorems of this article can be found in its extended version [2].

Context. High-level operations occur in a polynomial ring \(\mathcal {R}=\mathbb {Z}[X]\) \(/(X^n+1)\) with n a power of 2. \(\mathcal {R}\) is one-to-one mapped to integer polynomials of degree strictly smaller than n. Most of the time, elements of \(\mathcal {R}\) are denoted by lower-case boldface letters and identified by their coefficients. Polynomial arithmetic is done modulo \((X^n+1)\). The infinity norm of \(\varvec{a}=(a_0,\ldots ,a_{n-1})\in \mathcal {R}\) is defined by \(\Vert \varvec{a}\Vert _\infty =\max _{0\leqslant i\leqslant n-1}(|a_i|)\). Ciphertexts will be managed as polynomials (of degree 1) in \(\mathcal {R}[Y]\). For \(\texttt {ct}\in \mathcal {R}[Y]\), we note \(\Vert \texttt {ct}\Vert _\infty =\max _i\Vert \texttt {ct}[i]\Vert _\infty \), \(\texttt {ct}[i]\) being the coefficient of degree i in Y. The multiplicative law of \(\mathcal {R}[Y]\) is denoted by \(\star \).

Behind lattice-based cryptosystems in general, and FV in particular, lies the principle of noisy encryption. Additionally to the plaintext, a ciphertext contains a noise (revealed by using the secret key) which grows after each homomorphic operation. Since the homomorphic multiplication involves multiplications in \(\mathcal {R}\), it is crucial that the size of a product in \(\mathcal {R}\) does not increase too much. This increase is related to the ring constant \(\delta =\sup \{\Vert \varvec{f}\cdot \varvec{g}\Vert _\infty /\Vert \varvec{f}\Vert _\infty \cdot \Vert \varvec{g}\Vert _\infty : (\varvec{f},\varvec{g})\in (\mathcal {R}\setminus \{\varvec{0}\})^2\}\). It means that \(\Vert \varvec{f}\cdot \varvec{g}\Vert _\infty \leqslant \delta \Vert \varvec{f}\Vert _\infty \cdot \Vert \varvec{g}\Vert _\infty \). For the specific ring \(\mathcal {R}\) used here, \(\delta \) is equal to n.

For our subsequent discussions on decryption and homomorphic multiplication, we denote the “Division and Rounding” in \(\mathcal {R}[Y]\) (depending on parameters tq which are defined thereafter) by:

$$\begin{aligned} \texttt {DR}_i:\texttt {ct}=\displaystyle \sum _{j=0}^i\texttt {ct}[j]Y^j\in \mathcal {R}[Y]\mapsto \displaystyle \sum _{j=0}^i\left\lfloor \frac{t}{q}{} \texttt {ct}[j]\right\rceil Y^j\in \mathcal {R}[Y]. \end{aligned}$$
(1)

The notation \(\lfloor \tfrac{t}{q}\varvec{c}\rceil \), for any \(\varvec{c}\in \mathcal {R}\) (e.g. \(\texttt {ct}[j]\) in (1)), means a coefficient-wise division-and-rounding.

Plaintext and Ciphertext Spaces. The plaintext space is determined by an integer parameter t (\(t\geqslant 2\)). A message is an element of \(\mathcal {R}_t=\mathcal {R}/(t\mathcal {R})\), i.e. a polynomial of degree at most \(n-1\) with coefficients in \(\mathbb {Z}_t\). The notation \([\varvec{m}]_t\) (resp. \(|\varvec{m}|_t\)) means that coefficients lie in \([-t/2,t/2)\) (resp. [0, t)). Ciphertexts will lie in \(\mathcal {R}_q[Y]\) with q a parameter of the scheme. On one side, some considerations about security imply a relationship between q and n which, for a given degree n, establish an upper bound for \(\log _2(q)\) (cf. (6) in [9]). On the other side, the ratio \(\varDelta =\lfloor \tfrac{q}{t}\rfloor \) will basically determine the maximal number of homomorphic operations which can be done in a row to ensure a correct decryption.

RNS Representation. Beyond the upper bound on \(\log _2(q)\) due to security requirements, the composition of q has no restriction. So, q can be chosen as a product of small pairwise coprime moduli \(q_1\ldots q_k\). The reason for such a choice is the Chinese Remainder Theorem (CRT) which offers a ring isomorphism \(\mathbb {Z}_q\overset{\sim }{\rightarrow }\textstyle \prod _{i=1}^k\mathbb {Z}_{q_i}\). Thus, the CRT implies the existence of a non-positional number system (RNS) in which large integers (\(\bmod \ q\)) are mapped to sets of small residues. Furthermore, the arithmetic modulo q over large integers can be substituted by k independent arithmetics in the small rings \(\mathbb {Z}_{q_i}\). The isomorphism can be naturally extended to polynomials: \(\mathcal {R}_q\simeq \mathcal {R}_{q_1}\times \ldots \times \mathcal {R}_{q_k}\). It means that RNS can be used at the coefficient level to accelerate the arithmetic in \(\mathcal {R}_q\).

In the rest of the paper, the letter q may refer either to the product \(q_1\ldots q_k\) or to the “RNS base” \(\{q_1,\ldots ,q_k\}\). Symbol \(\nu \) denotes the “width” of the moduli. In other words, from now on, any modulus m (should it belong to q or to any other RNS base) is assumed to verify \(m<2^\nu \).

Asymmetric Keys. The secret key \(\varvec{s}\) is picked up in \(\mathcal {R}\) according to a discrete distribution \(\chi _{key}\) on \(\mathcal {R}\) (in practice, bounded by \(B_{key}=1\), i.e. \(\Vert \varvec{s}\Vert _\infty \leqslant 1\)).

For creating the public key, an “error” distribution \(\chi _{err}\) over \(\mathcal {R}\) is used. In practice, this is a discrete distribution statistically close to a Gaussian (with mean 0 and standard deviation \(\sigma _{err}\)) truncated at \(B_{err}\) (e.g. \(B_{err}=6\sigma _{err}\)). \(\chi _{err}\) is related to the hardness of the underlying (search version of) RLWE problem (for which the purpose is, given samples \(\left( [-(\varvec{a}_i\varvec{s}+\varvec{e}_i)]_q,\varvec{a}_i\right) \) with \(\varvec{e}_i\leftarrow \chi _{err}\) and \(\varvec{a}\leftarrow \mathcal {U}(\mathcal {R}_q)\), to find \(\varvec{s}\); \(\mathcal {U}(\mathcal {R}_q)\) is the uniform distribution on \(\mathcal {R}_q\)). The public key pk is created as follows: sample \(\varvec{a}\leftarrow \mathcal {U}(\mathcal {R}_q)\) and \(\varvec{e}\leftarrow \chi _{key}\), then output \(\texttt {pk}=(\varvec{p}_0,\varvec{p}_1)=([-(\varvec{as}+\varvec{e})]_q,\varvec{a})\).

Encryption, Addition, Inherent Noise of a Ciphertext. Encryption and homomorphic addition are already fully compliant with RNS arithmetic. They are recalled hereafter:

  • \(\texttt {Enc}_\texttt {FV}([\varvec{m}]_t)\): from samples \(\varvec{e}_1,\varvec{e}_2\leftarrow \chi _{err}\), \(\varvec{u}\leftarrow \chi _{key}\), output

    \(\texttt {ct}=(\texttt {ct}[0],\texttt {ct}[1])=\left( [\varDelta [\varvec{m}]_t+\varvec{p}_0\varvec{u} +\varvec{e}_1]_q,[\varvec{p}_1\varvec{u}+\varvec{e}_2]_q\right) \).

  • \(\texttt {Add}_\texttt {FV}(\texttt {ct}_1,\texttt {ct}_2)\): output \(\left( [\texttt {ct}_1[0]+\texttt {ct}_2[0]]_q,[\texttt {ct}_1[1]+\texttt {ct}_2[1]]_q\right) \).

By definition, the inherent noise of \(\texttt {ct}\) (encrypting \([\varvec{m}]_t\)) is the polynomial \(\varvec{v}\) such that \([\texttt {ct}(\varvec{s})]_q=[\texttt {ct}[0]+\texttt {ct}[1]\varvec{s}]_q=[\varDelta [\varvec{m}]_t+\varvec{v}]_q\). Thus, it is revealed by evaluating \(\texttt {ct}\in \mathcal {R}_q[Y]\) on the secret key \(\varvec{s}\).

Elementary Operations. A basic word will fit in \(\nu \) bits. In RNS, an “inner modular multiplication” (IMM) in a small ring like \(\mathbb {Z}_m\) is a core operation. If EM stands for an elementary multiplication of two words, in practice an IMM is costlier than an EM. But it can be well controlled. For instance, the moduli provided in NFLlib library [19] (cf. Sect. 5) enable a modular reduction which reduces to one \(\texttt {EM}\) followed by a multiplication modulo \(2^\nu \). Furthermore, the cost of an inner reduction can be limited by using lazy reduction, e.g. during RNS base conversions used throughout this paper. NTT and invNTT denote the Number Theoretic Transform and its inverse in a ring \(\mathcal {R}_{m}\) for a modulus m. They enable an efficient polynomial multiplication (\(\texttt {NTT},\texttt {invNTT}\in \mathcal {O}(n\log _2(n))\)).

3 Towards a Full RNS Decryption

This section deals with the creation of a variant of the original decryption function \(\texttt {Dec}_\texttt {FV}\), which will only involve RNS representation. The definition of \(\texttt {Dec}_\texttt {FV}\) is recalled hereafter.

  • \(\texttt {Dec}_\texttt {FV}(\texttt {ct}=(\varvec{c}_0,\varvec{c}_1)\in \mathcal {R}_q[Y])\): compute \([\texttt {DR}_0([\texttt {ct}(\varvec{s})]_q)]_t=\left[ \left\lfloor \frac{t}{q}[\varvec{c}_0+\varvec{c}_1\varvec{s}]_q\right\rceil \right] _t\).

The idea is that computing \([\varvec{c}_0+\varvec{c}_1\varvec{s}]_q=[\varDelta [\varvec{m}]_t+\varvec{v}]_q\) reveals the noise. If this noise is small enough, and given that \([\varvec{m}]_t\) has been scaled by \(\varDelta \), the function \(\texttt {DR}_{0}\) allows to cancel the noise while scaling down \(\varDelta [\varvec{m}]_t\) to recover \([\varvec{m}]_t\). Concretely, decryption is correct as long as \(\Vert \varvec{v}\Vert _\infty < (\varDelta -|q|_t)/2\), i.e. the size of the noise should not go further this bound after homomorphic operations.

The division-and-rounding operation makes \(\texttt {Dec}_\texttt {FV}\) hardly compatible with RNS at a first sight. Because RNS is of non positional nature, only exact integer division can be naturally performed (by multiplying by a modular inverse). But it is not the case here. And the rounding operation involves comparisons which require to switch from RNS to another positional system anyway, should it be a classical binary system or a mixed-radix one [11]. To get an efficient RNS variant of \(\texttt {Dec}_\texttt {FV}\), we use an idea of [3]. To this end, we introduce relevant RNS tools.

3.1 Fast RNS Base Conversion

At some point, the decryption requires, among others, a polynomial to be converted from \(\mathcal {R}_q\) to \(\mathcal {R}_t\). To achieve such kind of operations as efficiently as possible, we suggest to use a “fast base conversion”. In order to convert residues of \(x\in [0,q)\) from base q to a base \(\mathcal {B}\) (e.g. \(\{t\}\)) coprime to q, we compute:

$$\begin{aligned} \texttt {FastBconv}(x,q,\mathcal {B})=\left( \sum _{i=1}^k \left| x_i\frac{q_i}{q}\right| _{q_i}\times \frac{q}{q_i}\bmod m\right) _{m\in \mathcal {B}}. \end{aligned}$$
(2)

This conversion is relatively fast. This is because the sum should ideally be reduced modulo q in order to provide the exact value x; instead, (2) provides \(x+\alpha _x q\) for some integer \(\alpha _x\in [0,k-1]\). Computing \(\alpha _x\) requires costly operations in RNS. So this step is by-passed, at the cost of an approximate result.

\(\texttt {FastBconv}\) naturally extends to polynomials of \(\mathcal {R}\) by applying it coefficient-wise.

3.2 Approximate RNS Rounding

The above mentioned fast conversion allows us to efficiently compute an approximation of \(\lfloor \tfrac{t}{q}[\varvec{c}_0+\varvec{c}_1\varvec{s}]_q\rceil \) modulo t. The next step consists in correcting this approximation.

First, we remark that \(|\texttt {ct}(\varvec{s})|_q\) can be used instead of \([\texttt {ct}(\varvec{s})]_q\). Indeed, the difference between these two polynomials is a multiple of q. So, the division-and-rounding turns it into a polynomial multiple of t, which is cancelled by the last reduction modulo t. Second, a rounding would involve, at some point, a comparison. This is hardly compatible with RNS, so it is avoided. Therefore, we propose to simplify the computation, albeit at the price of possible errors, by replacing rounding by flooring. To this end, we use the following formula:

$$\left\lfloor \frac{t}{q}|\texttt {ct}(\varvec{s})|_q\right\rfloor =\frac{t |\texttt {ct}(\varvec{s})|_q-|t\cdot \texttt {ct}(\varvec{s})|_q}{q}.$$

The division is now exact, so it can be done in RNS. Since this computation has to be done modulo t, the term \(t|\texttt {ct}(\varvec{s})|_q\) cancels. Furthermore, the term \((|t\cdot \texttt {ct}(\varvec{s})|_q\bmod t)\) is obtained through a fast conversion.

Lemma 1 sums up the strategy by replacing \(|\texttt {ct}(\varvec{s})|_q\) by \(\gamma |\texttt {ct}(\varvec{s})|_q\), where \(\gamma \) is an integer which will help in correcting the approximation error.

Lemma 1

Let \(\texttt {ct}\) be such that \([\texttt {ct}(\varvec{s})]_q=\varDelta [\varvec{m}]_t+\varvec{v}+q\varvec{r}\), and let \(\varvec{v_c}:=t\varvec{v}-[\varvec{m}]_t|q|_t\). Let \(\gamma \) be an integer coprime to q. Then, for \(m\in \{t,\gamma \}\), the following equalities hold modulo m:

$$\begin{aligned} \texttt {FastBconv}(|\gamma t \cdot \texttt {ct}(\varvec{s})|_q,q,\{t,\gamma \})\times |-q^{-1}|_m&= \left\lfloor \gamma \frac{t}{q}[\texttt {ct}(\varvec{s})]_q\right\rceil -\varvec{e}\nonumber \\&=\gamma \left( [\varvec{m}]_t+t\varvec{r}\right) +\left\lfloor \gamma \frac{\varvec{v_c}}{q}\right\rceil -\varvec{e} \end{aligned}$$
(3)

where each integer coefficient of the error polynomial \(\varvec{e}\in \mathcal {R}\) lies in [0, k].

The error \(\varvec{e}\) is due to the fast conversion and the replacement of rounding by flooring. It is the same error for residues modulo t and \(\gamma \). The residues modulo \(\gamma \) will enable a fast correction of it and of the term \(\lfloor \gamma \tfrac{\varvec{v_c}}{q}\rceil \) at a same time. Also, note that \(\varvec{r}\) vanishes since it is multiplied by both t and \(\gamma \).

3.3 Correcting the Approximate RNS Rounding

The next step is to show how \(\gamma \) in (3) can be used to correct the term \((\lfloor \gamma \tfrac{\varvec{v_c}}{q}\rceil -\varvec{e})\). It can be done efficiently when the polynomial \(\varvec{v_c}\) is such that \(\Vert \varvec{v_c}\Vert _\infty \leqslant q(\tfrac{1}{2}-\varepsilon )\), for some real number \(\varepsilon \in (0,1/2]\).

Lemma 2

Let \(\Vert \varvec{v_c}\Vert _\infty \leqslant q(\tfrac{1}{2}-\varepsilon )\), \(\varvec{e}\in \mathcal {R}\) with coefficients in [0, k], and \(\gamma \) an integer. Then,

$$\begin{aligned} \gamma \varepsilon \geqslant k\Rightarrow \left[ \left\lfloor \gamma \frac{\varvec{v_c}}{q}\right\rceil -\varvec{e}\right] _\gamma =\left\lfloor \gamma \frac{\varvec{v_c}}{q}\right\rceil -\varvec{e}. \end{aligned}$$
(4)

Lemma 2 enables an efficient and correct RNS rounding as long as \(k(\tfrac{1}{2}-\tfrac{\Vert \varvec{v_c}\Vert _\infty }{q})^{-1}\sim \gamma \) has the size of a modulus. Concretely, one computes (3) and uses the centered remainder modulo \(\gamma \) to obtain \(\gamma \left( [\varvec{m}]_t+t\varvec{r}\right) \) modulo t, which reduces to \(\gamma [\varvec{m}]_t\bmod t\). And it remains to multiply by \(|\gamma ^{-1}|_t\) to recover \([\varvec{m}]_t\).

3.4 A Full RNS Variant of \(\texttt {Dec}_\texttt {FV}\)

The new variant of the decryption is detailed in Algorithm 1. The main modification for the proposed RNS decryption is due to Lemma 2. As stated by Theorem 1, given a \(\gamma \), the correctness of rounding requires a new bound on the noise to make the \(\gamma \)-correction technique successful.

Theorem 1

Let \(\texttt {ct}(\varvec{s})=\varDelta [\varvec{m}]_t+\varvec{v}\ (\bmod \ q)\). Let \(\gamma \) be a positive integer coprime to t and q such that \(\gamma >2k/(1-\frac{t|q|_t}{q})\). For Algorithm 1 returning \([\varvec{m}]_t\), it suffices that \(\varvec{v}\) satisfies the following bound:

$$\begin{aligned} \Vert \varvec{v}\Vert _\infty \leqslant \frac{q}{t}\left( \frac{1}{2}-\frac{k}{\gamma }\right) -\frac{|q|_t}{2}. \end{aligned}$$
(5)

There is a trade-off between the size of \(\gamma \) and the bound in (5). Ideally, \(\gamma \sim 2k\) at the price of a (a priori) quite small bound on the noise. But by choosing \(\gamma \sim 2^{p+1} k\) for \(p<\nu -1-\lceil \log _2(k)\rceil \) (i.e. \(\gamma <2^\nu \) is a standard modulus), the bound \((\varDelta (1-2^{-p})-|q|_t)/2\) for a correct decryption should be close to the original bound \((\varDelta -|q|_t)/2\) for practical values of \(\nu \). A concrete estimation of \(\gamma \) in Sect. 5.1 will show that \(\gamma \) can be chosen very close to 2k in practice, and thus fitting on a basic word by far.

figure a

3.5 Staying in RNS is Asymptotically Better

In any decryption technique, \((\texttt {ct}(\varvec{s})\bmod q)\) has to be computed first. To optimize this polynomial product, one basically performs \(k\texttt {NTT}\rightarrow kn\texttt {IMM}\rightarrow k\texttt {invNTT}\). For next steps, a simple strategy is to compute \((\lfloor \tfrac{t}{q}[\texttt {ct}(\varvec{s})]_q\rceil \bmod t)\) by doing an RNS-to-binary conversion in order to perform the division and rounding. By denoting \(\varvec{x}_i=|\texttt {ct}(\varvec{s})\tfrac{q_i}{q}|_{q_i}\), one computes \(\textstyle \sum _{i=1}^k \varvec{x}_i\frac{q}{q_i}\bmod q\), compares it to q/2 so as to center the result, and performs division and rounding. Hence, the division-and-rounding requires \(\mathcal {O}(k^2 n)\texttt {EM}\). In practice, security analysis (cf. e.g. [4, 9, 17]) requires that \(k\nu =\lceil \log _2(q)\rceil \in \mathcal {O}(n)\). So, the cost of leaving RNS to access a positional system is dominant in the asymptotic computational complexity.

Staying in RNS enables to get a better asymptotic complexity. Indeed, it is easy to see that Algorithm 1 requires \(\mathcal {O}(kn)\) operations (excluding the polynomial product). Thus, the cost of NTT is dominant in this case. By considering \(k\in \mathcal {O}(n)\), we deduce \(\mathcal {C}(\texttt {Dec}_\texttt {FV})\in \mathcal {O}(n^3)\), while \(\mathcal {C}(\texttt {Dec}_\texttt {RNS})\in \mathcal {O}(n^2\log _2(n))\). But the hidden constant in “\(k\in \mathcal {O}(n)\)” is small, and the NTT, common to both variants, should avoid any noticeable divergence (cf. Sect. 5.4) for practical ranges for parameters.

In order to provide optimized RNS variants of decryption, we make two remarks. First, the reduction modulo q is unnecessary. Indeed, any extra multiple of q in the sum \(\textstyle \sum _{i=1}^k \varvec{x}_i\frac{q}{q_i}\) is multiplied by \(\tfrac{t}{q}\), making the resulting term a multiple of t, which is not affected by the rounding and is finally cancelled modulo t. Second, it is possible to precompute \(\tfrac{t}{q}\) as a multiprecision floating point number in order to avoid a costly integer division. But given the first remark, it suffices to precompute the floating point numbers \(\mathcal {Q}_i\sim \tfrac{t}{q_i}\) with \(2\nu +\log _2(k)-\log _2(t)\) bits (\({\sim }2\) words) of precision. In this case, using standard double or quadruple (depending on \(\nu \)) precision is sufficient. Finally, it is sufficient to compute \(\lfloor \textstyle \sum _{i=1}^k\varvec{x}_i\mathcal {Q}_i\rceil \bmod t\). This represents about \(2kn\texttt {EM}\). Reducing modulo t is nearly free of cost when t is a power of 2.

A second optimized RNS variant, with only integer arithmetic, is based on Algorithm 1, in which \(\gamma \) is assumed to be coprime to t. It is possible to be slightly more efficient by noticing that the coprimality assumption can be avoided. This is because the division by \(\gamma \) is exact. To do it, the for loop can be done modulo \(\gamma \times t\). For instance, even if t is a power of 2, one can choose \(\gamma \) as being a power of 2 too, and use the following lemma to finish the decryption efficiently.

Lemma 3

Let \(\gamma \) be a power of 2. Let \(\varvec{z}:=|\gamma [\varvec{m}]_t+\lfloor \gamma \tfrac{\varvec{v_c}}{q}\rceil -\varvec{e}|_{\gamma t}\) coming from (3) when computed modulo \(\gamma t\). If \(\gamma \) satisfies (4), then (\(\gg \) denotes the right bit-shifting, and \(\varvec{v}_1\) is the polynomial with all its coefficients being equal to 1)

$$\begin{aligned} \left[ (\varvec{z}+\tfrac{\gamma }{2}\varvec{v}_1)\gg \log _2(\gamma )\right] _t=[\varvec{m}]_t. \end{aligned}$$
(6)

Lemma 3 can be adapted to other values for \(\gamma \). The important remark is that \([\varvec{m}]_t\) is contained in the \(\lfloor \log _2(t)\rfloor +1\) most significant bits of \((\varvec{z}+\tfrac{\gamma }{2}\varvec{v}_1)\bmod \gamma t\). So, by choosing \(\gamma \) as a power of 2, a simple bit shifting enables to recover these bits. Finally, as soon as \(\gamma t\) fits in 1 word, the cost of such variant (besides the polynomial product) reduces to \(kn\texttt {IMM}\), or simply to \(kn\texttt {EM}\) modulo \(2^{\log _2(\gamma t)}\) whenever t is a power of 2.

Remark 1

In previous discussion, the product \(\gamma t\) is assumed fitting in one machine word to simplify complexity analysis. However, for some applications, the plaintext modulus t can be bigger than a machine word (e.g. homomorphic neural networks [13], where \(t>2^{80}\)). In such cases, either the plaintexts directly lie in \(\mathcal {R}_t\), or t can be decomposed in a product of smaller moduli \(t_1,\ldots ,t_\ell \), enabling the use of RNS for encoding plaintexts (and then allowing better homomorphic multiplicative depth for a given dimension n). In the first case, the optimized RNS decryption (given by Lemma 3) remains available, but the residues modulo t should be handled with several words. In the second case, a plaintext is recovered by decrypting its residue modulo each of the \(t_i\). These \(\ell \) decryptions can be done as in Lemma 3, by using \(\gamma \) as a power of 2 (whatever the \(t_i\)’s are). Finally, the plaintext is reconstructed from residues modulo the \(t_i\)’s by using a classical RNS to binary conversion. However, this conversion is only related to the way the plaintexts are encoded. This is not handled by RNS decryption described in this paper, which only deals with representation of ciphertexts (i.e. modulo q).

4 Towards a Full RNS Homomorphic Multiplication

4.1 Preliminaries About \(\texttt {Mult}_\texttt {FV}\)

Below, we recall the main mechanisms of the homomorphic multiplication \(\texttt {Mult}_\texttt {FV}\) from [9]. More precisely, we focus on the variant with version 1 for relinearisation step. First, two functions, of which the purpose is to limit a too rapid noise growth during a multiplication, are recalled (these functions will be denoted as in [4]). They are applicable to any \(\varvec{a}\in \mathcal {R}\), for any radix \(\omega \), and with the subsequent parameter \(\ell _{\omega ,q}=\lfloor \log _{\omega }(q)\rfloor +1\). \(\mathcal {D}_{\omega ,q}\) is a decomposition in radix base \(\omega \), while \(\mathcal {P}_{\omega ,q}\) gets back powers of \(\omega \) which are lost within the decomposition process.

$$\begin{aligned} \forall \varvec{a}\in \mathcal {R},\left\{ \begin{array}{rcl} \mathcal {D}_{\omega ,q}(\varvec{a})&{}=&{}([\varvec{a}]_\omega ,[\lfloor \varvec{a}\omega ^{-1}\rfloor ]_\omega ,\ldots ,[\lfloor \varvec{a}\omega ^{-(\ell _{\omega ,q}-1)}\rfloor ]_\omega )\in \mathcal {R}_\omega ^{\ell _{\omega ,q}} \\ \mathcal {P}_{\omega ,q}(\varvec{a})&{}=&{}([\varvec{a}]_q,[\varvec{a}\omega ]_q,\ldots ,[\varvec{a}\omega ^{\ell _{\omega ,q}-1}]_q)\in \mathcal {R}_q^{\ell _{\omega ,q}} \end{array}\right. . \end{aligned}$$
(7)

In particular, for any \( (\varvec{a},\varvec{b})\in \mathcal {R}^2\), \(\langle \mathcal {D}_{\omega ,q}(\varvec{a}),\mathcal {P}_{\omega ,q}(\varvec{b})\rangle \equiv \varvec{ab}\bmod q\).

Next, \(\texttt {Mult}_\texttt {FV}\) is built as follows (\(\texttt {rlk}_\texttt {FV}\) is a public relinearisation key):

  • \(\texttt {rlk}_\texttt {FV}=\left( [\mathcal {P}_{\omega ,q}(\varvec{s}^2)-(\overrightarrow{\varvec{e}}+\varvec{s}\overrightarrow{\varvec{a}})]_q,\overrightarrow{\varvec{a}}\right) \) where \(\overrightarrow{\varvec{e}}\leftarrow \chi _{err}^{\ell _{\omega ,q}}\), \(\overrightarrow{\varvec{a}}\leftarrow \mathcal {U}(\mathcal {R}_q)^{\ell _{\omega ,q}}\),

  • \(\texttt {Relin}_\texttt {FV}(\varvec{c}_0,\varvec{c}_1,\varvec{c}_2,\texttt {rlk}_\texttt {FV})\):

    compute \(\left( [\varvec{c}_0+\langle \mathcal {D}_{\omega ,q}(\varvec{c}_2),\texttt {rlk}_\texttt {FV}[0]\rangle ]_q,[\varvec{c}_1+\langle \mathcal {D}_{\omega ,q}(\varvec{c}_2),\texttt {rlk}_\texttt {FV}[1]\rangle ]_q\right) \),

  • \(\texttt {Mult}_\texttt {FV}(\texttt {ct}_1,\texttt {ct}_2)\): denote \(\texttt {ct}_\star =\texttt {ct}_1\star \texttt {ct}_2\) (degree-2 element of \(\mathcal {R}[Y]\)),

    • Step 1: \(\widetilde{\texttt {ct}}_{mult}=[\texttt {DR}_2(\texttt {ct}_{\star })]_q=([\texttt {DR}_0(\texttt {ct}_{\star }[i])]_q)_{i\in \{0,1,2\}}\),

    • Step 2: \(\texttt {ct}_{mult}= \texttt {Relin}_\texttt {FV}(\widetilde{\texttt {ct}}_{mult})\).

There are two main obstacles to a full RNS variant. First, the three calls to \(\texttt {DR}_{0}\) in Step 1, for which the context is different than for the decryption. While in the decryption we are working with a noise whose size can be controlled, and while we are reducing a value from q to \(\{t\}\), here the polynomial coefficients of the product \(\texttt {ct}_1\star \texttt {ct}_2\) have kind of random size modulo q (for each integer coefficient) and have to be reduced towards q. Second, the function \(\mathcal {D}_{\omega ,q}\) (in \(\texttt {Relin}_\texttt {FV}\)) requires, by definition, an access to a positional system (in radix base \(\omega \)), which is hardly compatible with RNS.

4.2 Auxiliary RNS Bases

Step 1 requires to use enough moduli to contain any product, in \(\mathcal {R}[Y]\) (i.e. on \(\mathbb {Z}\)), of degree-1 elements from \(\mathcal {R}_q[Y]\). So, we need an auxiliary base \(\mathcal {B}\), additionally to the base q. We assume that \(\mathcal {B}\) contains \(\ell \) moduli (while q owns k elements). A sufficient size for \(\ell \) will be given later. An extra modulus \(m_{\texttt {sk}}\) is added to \(\mathcal {B}\) to create an extended base \(\mathcal {B}_{\texttt {sk}}\). It will be used for a transition between the new steps 1 and 2. Computing the residues of ciphertexts in \(\mathcal {B}_{\texttt {sk}}\) is done through a fast conversion from q. In order to reduce the extra multiples of q (called “q-overflows” in further discussions) this conversion can produce, a single-modulus base \(\tilde{m}\) is introduced. All these bases are assumed to be pairwise coprime.

Reducing (mod q) a Ciphertext in \(\varvec{\mathcal {B}_{\texttt {sk}}.}\) A FastBconv from q can create q-overflows (i.e. unnecessary multiples of q) in the output. To limit the impact on noise growth (because of division by q in step 1), we give an efficient way to reduce a polynomial \(\varvec{c}+q\varvec{u}\) in \(\mathcal {B}_{\texttt {sk}}\). It should be done prior to each multiplication. For that purpose, we use the residues modulo \(\tilde{m}\) as it is described in Algorithm 2.

figure b

Lemma 4

On input \(\varvec{c}_m''=|[\tilde{m}\varvec{c}]_q+q\varvec{u}|_m\) for all \(m\in \mathcal {B}_{\texttt {sk}}\cup \{\tilde{m}\}\), with \(\Vert \varvec{u}\Vert _\infty \leqslant \tau \), and given a parameter \(\rho >0\), then Algorithm 2 returns \(\varvec{c}'\) in \(\mathcal {B}_{\texttt {sk}}\) with \(\varvec{c}'\equiv \varvec{c}\bmod q\) and \(\Vert \varvec{c}'\Vert _\infty \leqslant \tfrac{q}{2}(1+\rho )\) if \(\tilde{m}\) satisfies:

$$\begin{aligned} \tilde{m}\rho \geqslant 2\tau +1. \end{aligned}$$
(8)

To use this fast reduction, the ciphertexts have to be handled in base q through the Montgomery [20] representation with respect to \(\tilde{m}\) (i.e. \(|\tilde{m}\varvec{c}|_q\) instead of \(|\varvec{c}|_q\)). This can be done for free of cost during the base conversions (in (2), multiply residues of \(\varvec{c}\) by precomputed \(|\tfrac{\tilde{m}q_i}{q}|_{q_i}\) instead of \(|\tfrac{q_i}{q}|_{q_i}\)). Since \(\{\tilde{m}\}\) is a single-modulus base, conversion of \(\varvec{r}_{\tilde{m}}\) from \(\{\tilde{m}\}\) to \(\mathcal {B}_{\texttt {sk}}\) (l. 3 of Algorithm 2) is a simple copy-paste when \(\tilde{m}<m_i\). Finally, if \(\texttt {SmMRq}_{\tilde{m}}\) is performed right after a \(\texttt {FastBconv}\) from q (for converting \(|\tilde{m}\varvec{c}|_q\)), \(\tau \) is nothing but k.

4.3 Adapting the First Step

We recall that originally this step is the computation of \([\texttt {DR}_2(\texttt {ct}_\star )]_q\). Unlike the decryption, a \(\gamma \)-correction technique does not guarantee an exact rounding. Indeed, for the decryption we wanted to get \(\texttt {DR}_0([\texttt {ct}(\varvec{s})]_q)\), and through \(\varvec{s}\) we had access to the noise of \(\texttt {ct}\), on which we have some control. In the present context, we cannot ensure a condition like \(\Vert [t\cdot \texttt {ct}_\star ]_{q}\Vert _\infty \leqslant q(\tfrac{1}{2}-\varepsilon )\), for some \(\varepsilon ^{-1}\sim 2^\nu \), which would enable the use of an efficient \(\gamma \)-correction. Thus, we suggest to perform a simple uncorrected RNS flooring. For that purpose, we define:

$$\begin{aligned} \forall \varvec{a}\in \mathcal {R}, \texttt {fastRNSFloor}_q(\varvec{a},m):=(\varvec{a}-\texttt {FastBconv}(|\varvec{a}|_q, q, m))\times |q^{-1}|_m\bmod m. \end{aligned}$$

Algorithm 2 should be executed first. Consequently, by Lemma 4, if \(\tilde{m}\) satisfies the bound in (8) for a given parameter \(\rho >0\), we assume having, in \(\mathcal {B}_{\texttt {sk}}\), the residues of \(\texttt {ct}_i'\) (\({\equiv }{} \texttt {ct}_i\bmod q\)) such that:

$$\begin{aligned} \Vert \texttt {ct}_\star ':=\texttt {ct}_1'\star \texttt {ct}_2'\Vert _\infty \leqslant \delta \tfrac{q^2}{2}(1+\rho )^2. \end{aligned}$$
(9)

The parameter \(\rho \) is determined in practice. Notice that, in base q, \(\texttt {ct}_i'\) and \(\texttt {ct}_i\) are equal.

Lemma 5

Let the residues of \(\texttt {ct}_i'\equiv \texttt {ct}_i\bmod q\) be given in base \(q\cup \mathcal {B}_{\texttt {sk}}\), with \(\Vert \texttt {ct}_i'\Vert _\infty \leqslant \tfrac{q}{2}(1+\rho )\) for \(i\in \{1,2\}\). Let \(\texttt {ct}_\star '=\texttt {ct}_1'\star \texttt {ct}_2'\). Then, for \(j\in \{0,1,2\}\),

$$\begin{aligned} \texttt {fastRNSFloor}_q(t\cdot \texttt {ct}_\star '[j],\mathcal {B}_{\texttt {sk}}) = \left\lfloor \frac{t}{q}{} \texttt {ct}_\star '[j]\right\rceil +\varvec{b}_j\text { in }\mathcal {B}_{\texttt {sk}},\text { with }\Vert \varvec{b}_{j}\Vert _\infty \leqslant k. \end{aligned}$$
(10)

A first part of the noise growth is detailed in the following proposition.

Proposition 1

Let \(\widetilde{\texttt {ct}}_{mult}=\texttt {DR}_2(\texttt {ct}_{\star }')\) with (9) satisfied, and \(r_\infty :=\tfrac{1+\rho }{2}(1+\delta B_{key})+1\). Let \(\varvec{v}_i\) be the inherent noise of \(\texttt {ct}_i'\). Then \(\widetilde{\texttt {ct}}_{mult}(\varvec{s})=\varDelta \left[ \varvec{m}_1\varvec{m}_2\right] _t+\varvec{\tilde{v}}_{mult}(\bmod \ q)\) with:

$$\begin{aligned} \Vert \varvec{\tilde{v}}_{mult}\Vert _\infty&<\delta t (r_\infty +\tfrac{1}{2})(\Vert \varvec{v}_1\Vert _\infty +\Vert \varvec{v}_2\Vert _\infty )+\tfrac{\delta }{2}\min \Vert \varvec{v}_i\Vert _\infty +\delta t |q|_t(r_\infty +1)\nonumber \\&+\tfrac{1}{2}(3+|q|_t+\delta B_{key}(1+\delta B_{key})). \end{aligned}$$
(11)

4.4 Transitional Step

Lemma 5 states that we have got back \(\texttt {DR}_2(\texttt {ct}_{\star }')+\texttt {b}\) in \(\mathcal {B}_{\texttt {sk}}\) so far, where we have denoted \((\varvec{b}_0,\varvec{b}_1,\varvec{b}_2)\) by \(\texttt {b}\). To perform the second step of multiplication, we need to convert it in base q. However, the conversion has to be exact because extra multiples of \(M=m_1\ldots m_\ell \) cannot be tolerated. \(m_{\texttt {sk}}\) allows us to perform a complete Shenoy and Kumaresan like conversion [24]. The next lemma describes such kind of conversion for a more general context where the input can be either positive or negative, and can be larger, in absolute value, than M.

Lemma 6

Let \(\mathcal {B}\) be an RNS base and \(m_{\texttt {sk}}\) be a modulus coprime to \(M=\textstyle \prod _{m\in \mathcal {B}}m\). Let x be an integer such that \(|x|<\lambda M\) (for some real number \(\lambda \geqslant 1\)) and whose residues are given in \(\mathcal {B}_\texttt {sk}\). Let’s assume that \(m_{\texttt {sk}}\) satisfies \(m_{\texttt {sk}}\geqslant 2(|\mathcal {B}|+\lceil \lambda \rceil )\). Let \(\alpha _{\texttt {sk},x}\) be the following integer:

$$\begin{aligned} \alpha _{\texttt {sk},x}:=\left[ (\texttt {FastBconv}(x,\mathcal {B},\{m_\texttt {sk}\})-x_{\texttt {sk}})M^{-1}\right] _{m_{\texttt {sk}}}. \end{aligned}$$
(12)

Then, for x being either positive or negative, the following equality holds:

$$\begin{aligned} \texttt {FastBconvSK}(x,\mathcal {B}_{\texttt {sk}}, q):=\left( \texttt {FastBconv}(x,\mathcal {B},q)-\alpha _{\texttt {sk},x}M\right) \bmod q=x\bmod q. \end{aligned}$$
(13)

Consequently, since \(\Vert \texttt {DR}_2(\texttt {ct}_{\star }')+\texttt {b}\Vert _\infty \leqslant \delta t \tfrac{q}{2} (1+\rho )^2+\tfrac{1}{2}+k\), we can establish the following proposition.

Proposition 2

Given a positive real number \(\lambda \), let \(m_{\texttt {sk}}\) and \(\mathcal {B}\) be such that:

$$\begin{aligned} \lambda M>\delta t \tfrac{q}{2} (1+\rho )^2+\tfrac{1}{2}+k,\ m_{\texttt {sk}}\geqslant 2(|\mathcal {B}|+\lceil \lambda \rceil ). \end{aligned}$$
(14)

Let’s assume that \(\texttt {DR}_2(\texttt {ct}_{\star }') + \texttt {b}\) is given in \(\mathcal {B}_{\texttt {sk}}\), with \(\Vert \texttt {b}\Vert _\infty \leqslant k\). Then,

$$\begin{aligned} \texttt {FastBconvSK}(\texttt {DR}_2(\texttt {ct}_{\star }') + \texttt {b},\mathcal {B}_{\texttt {sk}}, q)=\left( \texttt {DR}_2(\texttt {ct}_{\star }') + \texttt {b}\right) \bmod q. \end{aligned}$$

4.5 Adapting the Second Step

At this point, \(\widetilde{\texttt {ct}}_{mult}+\texttt {b}=(\overline{\varvec{c}}_0,\overline{\varvec{c}}_1,\overline{\varvec{c}}_2)\) is known in base q (\(\widetilde{\texttt {ct}}_{mult}:=\texttt {DR}_2(\texttt {ct}_{\star }')\)). We recall that the original second step of homomorphic multiplication would be done as follows:

$$\begin{aligned} \texttt {ct}_{mult}=\left( [\overline{\varvec{c}}_0+\langle \mathcal {D}_{\omega ,q}(\overline{\varvec{c}}_2),\mathcal {P}_{\omega ,q}(\varvec{s}^2)-(\overrightarrow{\varvec{e}}+\varvec{s}\overrightarrow{\varvec{a}})\rangle ]_q,[\overline{\varvec{c}}_1+\langle \mathcal {D}_{\omega ,q}(\overline{\varvec{c}}_2),\overrightarrow{\varvec{a}}\rangle ]_q\right) \end{aligned}$$
(15)

where \(\overrightarrow{\varvec{e}}\leftarrow \chi _{err}^{\ell _{\omega ,q}}\), \(\overrightarrow{\varvec{a}}\leftarrow \mathcal {U}(\mathcal {R}_q)^{\ell _{\omega ,q}}\). The decomposition of \(\overline{\varvec{c}}_2\) in radix \(\omega \) enables a crucial reduction of the noise growth due to the multiplications by the terms \(\varvec{e}_i+\varvec{sa}_i\). It cannot be done directly in RNS as is. Indeed, it would require a costly switch between RNS and positional representation in radix \(\omega \). However, we can do something very similar. We recall that we can write \(\overline{\varvec{c}}_2=\textstyle \sum _{i=1}^{k}|\overline{\varvec{c}}_2\tfrac{q_i}{q}|_{q_i}\times \tfrac{q}{q_i}(\bmod \ q).\) If \(\omega \) has the same order of magnitude than \(2^\nu \) (size of moduli in q), we obtain a similar limitation of the noise growth by using the vectors \(\xi _q(\overline{\varvec{c}}_2)=(|\overline{\varvec{c}}_2\tfrac{q_1}{q}|_{q_1},\ldots ,|\overline{\varvec{c}}_2\tfrac{q_k}{q}|_{q_k})\) and \(\mathcal {P}_{\texttt {RNS},q}(\varvec{s}^2) = (|\varvec{s}^2\tfrac{q}{q_1}|_{q},\ldots ,|\varvec{s}^2\tfrac{q}{q_k}|_{q})\), both in \(\mathcal {R}^k\). This is justified by the following lemma.

Lemma 7

For any \(\varvec{c}\in \mathcal {R}\), \(\langle \xi _q(\varvec{c}),\mathcal {P}_{\texttt {RNS},q}(\varvec{s}^2)\rangle \equiv \varvec{cs}^2\bmod q.\)

Thus, we replace the public relinearisation key \(\texttt {rlk}_\texttt {FV}\) by the following one: \(\texttt {rlk}_\texttt {RNS}=\left( [\mathcal {P}_{\texttt {RNS},q}(\varvec{s}^2)-(\overrightarrow{\varvec{e}}+\varvec{s}\overrightarrow{\varvec{a}})]_q,\overrightarrow{\varvec{a}}\right) \). The next lemma helps for providing a bound on the extra noise introduced by this step.

Lemma 8

Let \(\overrightarrow{\varvec{e}}\leftarrow \chi _{err}^k\), \(\overrightarrow{\varvec{a}}\leftarrow \mathcal {U}(\mathcal {R}_q)^k\), and \(\varvec{c}\in R\). Then,

$$\begin{aligned} \Vert \left( \langle \xi _q(\varvec{c}),-(\overrightarrow{\varvec{e}}+\overrightarrow{\varvec{a}}\varvec{s})\rangle +\langle \xi _q(\varvec{c}),\overrightarrow{\varvec{a}}\rangle \varvec{s}\right) \bmod q\Vert _\infty < \delta B_{err}k2^\nu . \end{aligned}$$
(16)

Remark 2

It is still possible to add a second level of decomposition (like in original approach, but applied on the residues) to limit a bit more the noise growth. Furthermore, Sect. 4.6 details how the size of \(\texttt {rlk}_\texttt {RNS}\) can be reduced in a similar way that \(\texttt {rlk}_\texttt {FV}\) could be through the method described in ([4], Sect. 5.4).

Finally, the output of the new variant of multiplication is the following one:

$$\begin{aligned} \texttt {ct}_{mult}=\left( \left[ \overline{\varvec{c}}_0+\langle \xi _q(\overline{\varvec{c}}_2),\mathcal {P}_{\texttt {RNS},q}(\varvec{s}^2)-(\overrightarrow{\varvec{e}}+\overrightarrow{\varvec{a}}\varvec{s})\rangle \right] _q, \left[ \overline{\varvec{c}}_1+\langle \xi _q(\overline{\varvec{c}}_2),\overrightarrow{\varvec{a}}\rangle \right] _q\right) . \end{aligned}$$
(17)

Proposition 3

Let \(\texttt {ct}_{mult}\) be as in (17), and \(\varvec{v}_{mult}\) (resp. \(\widetilde{\varvec{v}}_{mult}\)) the inherent noise of \(\texttt {ct}_{mult}\) (resp. \(\widetilde{\texttt {ct}}_{mult}\)). Then \(\texttt {ct}_{mult}(\varvec{s})=\varDelta \left[ \varvec{m}_1\varvec{m}_2\right] _t+\varvec{v}_{mult}(\bmod \ q)\) with:

$$\begin{aligned} \Vert \varvec{v}_{mult}\Vert _\infty <\Vert \widetilde{\varvec{v}}_{mult}\Vert _\infty +k(1+\delta B_{key}(1+\delta B_{key}))+ \delta k B_{err}2^{\nu +1}. \end{aligned}$$
(18)

Algorithm 3 depicts the scheme of the new full RNS variant \(\texttt {Mult}_{\texttt {RNS}}\).

figure c

4.6 Reducing the Size of the Relinearization Key \(\texttt {rlk}_\texttt {RNS}\)

In [4], Sect. 5.4, a method to significantly reduce the size of the public evaluation key evk is described (by truncating the ciphertext) and it is applicable to the original FV scheme. We provide an efficient adaptation of such kind of optimization to the RNS variant of the relinearisation step.

We recall that the relinearisation is applied to a degree-2 ciphertext denoted here by \((\varvec{c}_0,\varvec{c}_1,\varvec{c}_2)\). The initial suggestion was to set to zero, say, the i lowest significant components of the vector \(\mathcal {D}_{\omega ,q}(\varvec{c}_2)\). Doing so is equivalent to replacing \(\varvec{c}_2\) by \(\varvec{c}_2'=\omega ^i\lfloor \varvec{c}_2\omega ^{-i}\rfloor =\varvec{c}_2-|\varvec{c}_2|_{\omega ^i}\). Thus, only the \(\ell _{\omega ,q}-i\) most significant components of \(\texttt {rlk}_\texttt {FV}[0]\) (and then of \(\texttt {rlk}_{\texttt {FV}}[1]\)) are required (in other words, when \(\texttt {rlk}_{\texttt {FV}}[0]\) is viewed as an \((\ell _{q,\omega },k)\) RNS matrix by decomposing each component in base q, ik entries are set to zero like this). This optimization causes a greater noise than the one in Lemma 4 of [4]. Given \((\varvec{c}_0,\varvec{c}_1,\varvec{c}_2)\) decryptable under \(\varvec{s}\), the relinearisation step provides the following ciphertext:

$$(\varvec{\tilde{c}}_0, \varvec{\tilde{c}}_1):=(\varvec{c}_0+\langle \mathcal {D}_{\omega ,q}(\varvec{c}_2'), \mathcal {P}_{\omega ,q}(\varvec{s}^2)-(\overrightarrow{\varvec{e}}+\overrightarrow{\varvec{a}}\varvec{s})\rangle ,\varvec{c}_1+\langle \mathcal {D}_{\omega ,q}(\varvec{c}_2'), \overrightarrow{\varvec{a}}\rangle ).$$

Thus, \((\varvec{\tilde{c}}_0, \varvec{\tilde{c}}_1)(\varvec{s})=\varvec{c}_0+\varvec{c}_1\varvec{s}+\varvec{c}_2'\varvec{s}^2-\langle \mathcal {D}_{\omega ,q}(\varvec{c}_2'), \overrightarrow{\varvec{e}}\rangle \bmod q\). Consequently, the extra noise comes from the following term:

$$\begin{aligned} \begin{array}{rcl} \Vert -|\varvec{c}_2|_{\omega ^i}\varvec{s}^2-\langle \mathcal {D}_{\omega ,q}(\varvec{c}_2'), \overrightarrow{\varvec{e}}\rangle \Vert _\infty &{}=&{} \Vert -|\varvec{c}_2|_{\omega ^i}\varvec{s}^2-\textstyle \sum _{j=i}^{\ell _{\omega ,q}-1}\mathcal {D}_{\omega ,q}(\varvec{c}_2)_j \varvec{e}_j\Vert _\infty \\ &{}<&{} \delta ^2 \omega ^i B_{key}^2+(\ell _{\omega ,q}-i)\delta \omega B_{err}. \end{array} \end{aligned}$$
(19)

In the present RNS variant, the computation of \(\lfloor \varvec{c}_2\omega ^{-i}\rfloor \) is not straightforward. This could be replaced by \(\lfloor \varvec{c}_2(q_1\ldots q_i)^{-1}\rfloor \) through a Newton like interpolation (also known as mixed-radix conversion [11]). Though the result would be quite similar to the original optimization in terms of noise growth, its efficiency is not satisfying. Indeed, despite ik entries of the RNS matrix \(\texttt {rlk}_\texttt {RNS}[0]\) can be set to zero like this, such a Newton interpolation is intrinsically sequential, while the division by \(\omega ^{i}\) followed by a flooring is simply achieved by an immediate zeroing of the lowest significant coefficients in radix \(\omega \) representation.

For our approach, we rely on the fact that \(\texttt {rlk}_\texttt {RNS}\) contains the RLWE-encryptions of the polynomials \(|\varvec{s}^2\tfrac{q}{q_j}|_q\). Then, we notice that only the \(j^{\text {th}}\)-residue of \(|\varvec{s}^2\tfrac{q}{q_j}|_q\) can be non zero. So, let’s assume that we want to cancel ik entries in \(\texttt {rlk}_\texttt {RNS}[0]\) (as it has been done in \(\texttt {rlk}_{\texttt {FV}}\) with the previous optimization). Then we choose, for each index j, a subset of index-numbers \(\mathcal {I}_j\subseteq [ 1,k]\setminus \{j\}\) with cardinality i (i.e. at line j of \(\texttt {rlk}_\texttt {RNS}\), choose i columns, except the diagonal one; these terms will be set to zero). Next, for each j, we introduce an RLWE-encryption of \(|\varvec{s}^2\tfrac{q}{q_jq_{\mathcal {I}_j}}|_q\), where \(q_{\mathcal {I}_j}=\prod _{s\in \mathcal {I}_j}q_s\), which is \((|\varvec{s}^2\tfrac{q}{q_jq_{\mathcal {I}_j}}-(\varvec{e}_j+\varvec{sa}_j)|_q,\varvec{a}_j)\). So far, the underlying security features are still relevant. Now, it remains to multiply this encryption by \(q_{\mathcal {I}_j}\), which gives in particular \(|\varvec{s}^2\tfrac{q}{q_j}-q_{\mathcal {I}_j}(\varvec{e}_j+\varvec{sa}_j)|_q\). This is the \(j^{\text {th}}\)-line of the new matrix \(\texttt {rlk}_\texttt {RNS}'[0]\). It is clear that this line contains zeros at columns index-numbered by \(\mathcal {I}_j\). \(\texttt {rlk}_\texttt {RNS}[1]=(\varvec{a}_1,\ldots ,\varvec{a}_k)\) is replaced by \(\texttt {rlk}_\texttt {RNS}'[1]=(|q_{\mathcal {I}_1}\varvec{a}_1|_q,\ldots ,|q_{\mathcal {I}_k}\varvec{a}_k|_q)\).

Let’s analyse the new noise growth. By evaluating in \(\varvec{s}\) the output of relinearisation with this new \(\texttt {rlk}_\texttt {RNS}'\), we obtain:

$$\begin{aligned}&\varvec{c}_0+\langle \xi _q(\varvec{c}_2),\texttt {rlk}_\texttt {RNS}'[0]\rangle + \left( \varvec{c}_1+\langle \xi _q(\varvec{c}_2),\texttt {rlk}_\texttt {RNS}'[1]\rangle \right) \varvec{s} \\= & {} \varvec{c}_0+\textstyle \sum _{j=1}^k |\varvec{c}_2\tfrac{q_j}{q}|_{q_j}\left( \varvec{s}^2\tfrac{q}{q_j}-q_{\mathcal {I}_j}(\varvec{e}_j+\varvec{sa}_j)\right) +\left( \varvec{c}_1+\textstyle \sum _{j=1}^k |\varvec{c}_2\tfrac{q_j}{q}|_{q_j}q_{\mathcal {I}_j}\varvec{a}_j\right) \varvec{s}\\= & {} \varvec{c}_0+\varvec{c}_1\varvec{s}+\varvec{c}_2\varvec{s}^2-\textstyle \sum _{j=1}^k |\varvec{c}_2\tfrac{q_j}{q}|_{q_j}q_{\mathcal {I}_j}\varvec{e}_j. \end{aligned}$$

Consequently, the cancellation of ik terms in the public matrix \(\texttt {rlk}_\texttt {RNS}[0]\) by using this method causes an extra noise growth bounded by (this can be fairly compared to (19) in the case where \(\omega =2^\nu \), i.e. \(k=\ell _{\omega ,q}\)):

$$\Vert \textstyle \sum _{j=1}^k |\varvec{c}_2\tfrac{q_j}{q}|_{q_j}q_{\mathcal {I}_j}\varvec{e}_j\Vert _\infty<\textstyle \sum _{j=1}^k \delta q_jq_{\mathcal {I}_j}B_{err}<\delta k2^{\nu (i+1)}B_{err}.$$

Therefore, the truncation of ciphertexts can be efficiently adapted to RNS representation without causing more significant noise growth.

4.7 About Computational Complexity

In a classical multi-precision (MP) variant, for the purpose of efficiency the multiplication can involve NTT-based polynomial multiplication to implement the ciphertext product (e.g. [23] for such kind of implementation on FPGA). This approach requires the use of a base \(\mathcal {B}'\) (besides q) with \(|\mathcal {B}'|=k+1\) for storing the product. Notice that, in RNS variant, we also have \(|\mathcal {B}_{sk}|=k+1\) (the same amount of information is kept, but used differently). Thus, it is easy to show that RNS and MP variants involve the same number of NTT and invNTT operations in the case \(\ell _{\omega ,q}=k\) (e.g. the keys \(\texttt {rlk}_\texttt {FV}\) and \(\texttt {rlk}_\texttt {RNS}\) have the same size in this situation). In other words, the same number of polynomial products is performed in both cases when \(\omega =2^\nu \).

Above all, the RNS variant enables us to decrease the cost of other parts of the computation. Despite the fact that the asymptotic computational complexity of these parts remains identical for both variants, i.e. \(\mathcal {O}(k^2n)\) elementary multiplications, the RNS variant only involves single-precision integer arithmetic.

To sum up, because of a complexity of \(\mathcal {O}(k^2n\log _2(n))\) due to the NTT’s, we keep the same asymptotic computational complexity \(\mathcal {C}(\texttt {Mult}_\texttt {FV})\underset{n\rightarrow +\infty }{\sim }\mathcal {C}(\texttt {Mult}_\texttt {RNS})\). However, the most important fact is that multi-precision multiplications within MP variant are replaced in RNS by fast base conversions, which are simple matrix-vector products. Thus, \(\texttt {Mult}_\texttt {RNS}\) retains all the benefits of RNS properties and is highly parallelizable.

5 Software Implementation

The C++ NFLlib library [19] was used for efficiently implementing the arithmetic in \(\mathcal {R}_q\). It provides an efficient NTT-based product in \(\mathcal {R}_{q}\) for q a product of 30 or 62-bit prime integers, and with degree n a power of 2, up to \(2^{15}\).

5.1 Concrete Examples of Parameter Settings

In this part, we analyse which depth can be reached in a multiplicative tree, and for which parameters.

The initial noise is at most \(V= B_{err}(1+2\delta B_{key})\) [17]. The output of a tree of depth L has a noise bounded by \(C_{\texttt {RNS},1}^LV+LC_{\texttt {RNS},1}^{L-1}C_{\texttt {RNS},2}\) (cf. [4], Lemma 9) with, for the present RNS variant:

$$\begin{aligned} \left\{ \begin{array}{rcl} C_{\texttt {RNS},1}&{}=&{}2\delta ^2 t \tfrac{(1+\rho )}{2}B_{key}+\delta t(4+\rho )+\tfrac{\delta }{2},\\ C_{\texttt {RNS},2}&{}=&{}(1+\delta B_{key})(\delta t|q|_t\tfrac{1+\rho }{2}+\delta B_{key}(k+\tfrac{1}{2}))+2\delta t |q|_t+ k(\delta B_{err}2^{\nu +1}+1)\\ &{} &{}+\tfrac{1}{2}(3+|q|_t). \end{array}\right. \end{aligned}$$
(20)

We denote by \(L_{\texttt {RNS}}=\max \{L\in \mathbb {N}\mid C_{\texttt {RNS},1}^LV+LC_{\texttt {RNS},1}^{L-1}C_{\texttt {RNS},2}\leqslant \tfrac{q}{t}(\tfrac{1}{2}-\tfrac{k}{\gamma })-\tfrac{|q|_t}{2}\}\) the depth allowed by \(\texttt {Mult}_\texttt {RNS}\), when \(\texttt {Dec}_\texttt {RNS}\) is used for decryption.

Table 1. Examples of parameter settings, by using 30-bit moduli of NFLlib.

For an 80-bit security level and parameters \(B_{key}=1\), \(\sigma _{err}=8\), \(B_{err}=6\sigma _{err}\), we consider the security analysis in [17], which provides ranges for \((\log _2(q),n)\) (cf. [17], Table 2). We analyse our parameters by using the moduli given in NFLlib, because those were used for concrete testing. For a 32-bit (resp. 64) implementation, a set of 291 30-bit (resp. 1000 62-bit) moduli is available. These moduli are chosen to enable an efficient modular reduction (cf. [19], Algorithm 2).

Table 1 lists parameters when q and \(\mathcal {B}\) are built with the 30-bit moduli of NFLlib. These parameters were determined by choosing the largest \(\rho \) (up to \(2k-1\)) allowing to reach the depth \(L_{\texttt {RNS}}\). \(L_{\texttt {std}}\) corresponds to the bounds given in [17] for a classical approach. Sufficient sizes for \(\gamma \) and \(m_{\texttt {sk}}\) (allowing, for \(m_{\texttt {sk}}\), to have \(|\mathcal {B}|=k\) through (14) after having chosen for q the k greatest available moduli) are provided. For these specific parameters, the new bounds on the noise for RNS variant cause a smaller depth in only one case.

5.2 Influence of \(\tilde{m}\) Over Noise Growth

After a fast conversion from q, ciphertexts in \(\mathcal {B}_{\texttt {sk}}\) can contain q-overflows and verify \(\Vert \texttt {ct}_i'\Vert _\infty < \tfrac{q}{2}(1+\tau )\). In a multiplicative tree without any addition, \(\tau \leqslant 2k-1\). By applying Algorithm 2, this bound decreases to \(\tfrac{q}{2}(1+\rho )\), for some \(0<\rho \leqslant 2k-1\). Having \(\rho =2k-1\) in Table 1 means it is unnecessary to use \(\texttt {SmMRq}_{\tilde{m}}\) to reach the best depth. It happens only three times. Most of the time, doing such reduction is necessary before a multiplication so as to reach the highest depth. Moreover, choosing a lower \(\rho \) (i.e. higher \(\tilde{m}\)) than necessary can decrease the size of \(\gamma \) and \(m_{\texttt {sk}}\) (as shown in Table 1, one set of parameters leads to \(\lceil \log _2(m_{\texttt {sk}})\rceil =31\), avoiding the use of a 30-bit modulus; this can be solved by taking a larger \(\tilde{m}\)).

Fig. 1.
figure 1

Example of noise growth; \(n=2^{13}\), \(\log _2(q)=390\) (\(\nu =30,k=13\)), \(t=2\), \(\sigma _{err}=8\), \(B_{key}=1\) (dashed line: bound using (20); plain line: measurements).

To illustrate the impact of \(\texttt {SmMRq}_{\tilde{m}}\), Fig. 1 depicts the noise growth for \(\tilde{m}\in \{0,2^8,2^{16}\}\). Given Table 1, \(\tilde{m}=2^8\) is sufficient in such scenario to reach \(L_{\texttt {RNS}}=13\). Against a computation with no reduction at all (\(\tilde{m}=0\), implying \(L_{\texttt {RNS}}=11\) in this case), choosing \(\tilde{m}=2^8\) implies an average reduction of \(25\%\). By using \(\tilde{m}=2^{16}\), we gain around \(32\%\). Therefore, \(\texttt {SmMRq}_{\tilde{m}}\) has been systematically integrated within the implementation of \(\texttt {Mult}_\texttt {RNS}\) for timing measurements in next part.

5.3 Some Remarks

Convenient \(\tilde{m}\) and \(\gamma \). Given values of \(\rho \) in Table 1, \(\tilde{m}=2^{8}\) (resp. \(\tilde{m}=2^{16}\)) satisfies, by far, any set of analysed parameters. This enables an efficient and straightforward modular arithmetic through standard types like uint8_t (resp. uint16_t) and a type conversion towards the signed int8_t (resp. int16_t) immediately gives the centered remainder. Parameter analysis with such \(\tilde{m}\) shows that \(\gamma =2^8\) is sufficient to ensure a correct decryption for all configurations. A reduction modulo \(\gamma \) can then be achieved by a simple type conversion to uint8_t.

Tested Algorithms. The codeFootnote 1 we compared with was implemented in the context of HEAT [26]. It is based on NFLlib too. Multi-precision arithmetic is handled with GMP 6.1.0 [14], and multiplications by \(\tfrac{t}{q}\) are performed with integer divisions. \(\texttt {Mult}_\texttt {MP}\) and \(\texttt {Dec}_\texttt {MP}\) denote functions from this code.

\(\texttt {Mult}_\texttt {RNS}\) was implemented as described by Algorithm 3. Could the use of \(\texttt {SmMRq}_{\tilde{m}}\) be avoided to reach the maximal theoretical depth, it was however systematically used. Its cost is negligible and it enables a noticeable decrease of noise growth.

Two variants of \(\texttt {Dec}_\texttt {RNS}\) (cf. Sect. 3.5) have been implemented. Depending on \(\nu \), the one with floating point arithmetic (named \(\texttt {Dec}_\texttt {RNS-flp}\) thereafter) uses double (resp. long double) for double (resp. quadruple) precision, and then does not rely on any other external library at all.

5.4 Results

The tests have been run on a laptop, running under Fedora 22 with Intel Core i7-4810MQ CPU @ 2.80 GHz and using GNU compiler g++ version 5.3.1. Hyper-Threading and Turbo Boost were deactivated. The timings have been measured on \(2^{12}\) decryptions/multiplications for each configuration.

Fig. 2.
figure 2

Decryption time (\(t=2^{10}\)), with \(\nu =30\) (plain lines) and \(\nu =62\) (dashed lines).

Fig. 3.
figure 3

Multiplication time (\(t=2^{10}\)), with \(\nu =30\) (plain lines) and \(\nu =62\) (dashed lines).

Figure 2 presents timings for \(\texttt {Dec}_\texttt {MP}\), \(\texttt {Dec}_\texttt {RNS}\) and \(\texttt {Dec}_\texttt {RNS-flp}\), and Fig. 3 depicts timings for \(\texttt {Mult}_\texttt {MP}\) and \(\texttt {Mult}_\texttt {RNS}\). Both figures gather data for two modulus sizes: \(\nu =30\) and \(\nu =62\). Step 2 of \(\texttt {Mult}_\texttt {MP}\) uses a decomposition in radix-base \(\omega =2^{32}\) when \(\nu =30\), and \(\omega =2^{62}\) when \(\nu =62\). The auxiliary bases \(\mathcal {B}_{sk}\) and \(\mathcal {B}'\) involved in \(\texttt {Mult}_\texttt {RNS}\) and \(\texttt {Mult}_\texttt {MP}\) contain \(k+1\) moduli each. Table 2 shows which values of k have been tested (depending on n). Multiplication timing for \((n,\nu ,k)=(2^{11},62,1)\) is not given since \(L=1\) already causes decryption failures.

Table 2. Parameter k used in the tests (i.e. \(\lceil \log _2(q)\rceil =k\nu \)).

It has to be noticed that the performance are mainly due to NFLlib. The contributions appear in the speed-ups between RNS and MP variants.

In Fig. 3, the convergence of complexities of \(\texttt {Mult}_\texttt {RNS}\) and \(\texttt {Mult}_\texttt {MP}\) (as explained in Sect. 4.7) is noticeable. The new algorithm presented in this paper allows speed-ups from 4.3 to 1.7 for degree n from \(2^{11}\) to \(2^{15}\) when \(\nu =30\), and from 3.6 to 1.9 for n from \(2^{12}\) to \(2^{15}\) when \(\nu =62\).

In Fig. 2, the two variants of decryption described in Sect. 3.5 are almost equally fast. Indeed, they perform the same number of elementary (floating point or integer) operations. Between degree \(2^{11}\) and \(2^{15}\), the RNS variants allow speed-ups varying from 6.1 to 4.4 when \(\nu =30\), and from 20.4 to 5.6 when \(\nu =62\). All the implemented decryption functions take as input a ciphertext in NTT representation. Thus, only one invNTT is performed (after the product of residues) within each decryption. As explained (cf. Sect. 3.5), despite a better asymptotic computational complexity for RNS decryption, the efficiency remains in practice highly related to this invNTT procedure, even maybe justifying the slight convergence between MP and RNS decryption times observed in Fig. 2.

6 Conclusion

In this paper, the somewhat homomorphic encryption scheme FV has been fully adapted to Residue Number Systems. Prior to this work, RNS was used to accelerate polynomial additions and multiplications. However, the decryption and the homomorphic multiplication involve operations at the coefficient level which are hardly compatible with RNS, such as division and rounding.

Our proposed solutions overcome these incompatibilities, without modifying the security features of the original scheme. As a consequence, we have provided a SHE scheme which only involves RNS arithmetic. It means that only single-precision integer arithmetic is required, and the new variant fully benefits from the properties of RNS, such as parallelization.

The proposed scheme has been implemented in software using C++. Because arithmetic on polynomials (in particular polynomial product) is not concerned by the new optimizations provided here, the implementation has been based on the NFLlib library, which embeds a very efficient NTT-based polynomial product. Our implementation has been compared to a classical version of FV (based on NFLlib, and GMP). For degrees from \(2^{11}\) to \(2^{15}\), the new decryption (resp. homomorphic multiplication) offers speed-ups from 20 to 5 (resp. 4 to 2) folds for cryptographic parameters.

Further work should demonstrate the high potential of the new variant by exploiting all the concurrency properties of RNS, in particular through dedicated hardware implementations.