Keywords

1 Introduction

The computing power of general-purpose processors is enhanced by different parallelism designs. Firstly, single-instruction-multiple-data (SIMD) enables the elements of a vector to be processed in parallel. General-purpose CPUs are usually equipped with vector instruction extensions, such as Intel MMX/SSE/AVX, ARM NEON and AMD 3DNow. Graphics processing units (GPUs) follow a different parallelism structure, single-instruction-multiple-thread (SIMT), where thousands of independent threads execute the same instructions concurrently. Finally, simultaneous-multithreading (SMT) is adopted by both CPUs and GPUs, to enable instructions from multiple threads (in a GPU thread block or a CPU core) to be executed in any given pipeline stage at a time.

The GPUs’ potential on public-key cryptographic computing has been investigated for several years. Thread-level parallelism and thousands of scalar stream processors in GPUs, produce very high throughput on a great number of simultaneous tasks, but greater latency than the scalar-instruction cryptographic implementations in CPUs [22]. Note that the frequency of GPUs is much lower than that of general-purpose CPUs, for example, Intel Core i7 CPU reaches up to 3.5 GHz while NVIDIA Tesla K20 is only 706 MHz [30]. The deficiency on latency limits the applications of GPUs as public-key cryptographic engines in many scenarios.

In 2012 November, Intel announced the first product family of Many Integrated Core (MIC) architectures, named Intel Xeon Phi. Xeon Phi provides an opportunity to implement public-key algorithms in a high-throughput and low-latency way. For example, Xeon Phi 7120P consists of 61 cores, and each core is shipped with (a) 512-bit SIMD unit, 16-way 32-bit vector instructions, and (b) 4-way SMT unit, 4 hyperthreads on one core for instruction pipelining. Intel Xeon Phi, with the computing power in tera floating-point operations per second (FLOPS), has been applied in the fields of supercomputing, such as molecular dynamics in [25], sparse matrix multiplication in [27] and large integer arithmetic in [8, 16]. In fact, similar 512-bit SIMD units are supported in Intel Xeon Skylake and Skylake-E CPUs and will be in Intel Cannonlake CPUs.

This paper presents the first implementation of public-key cryptographic algorithms with 512-bit SIMD instructions on Xeon Phi, called PhiRSA. In particular, we evaluate 1024-bit, 2048-bit and 4096-bit RSA on vector instructions. PhiRSA fully exploits the computing power of Xeon Phi 7120P with the following designs. Firstly, to perform 512-bit Montgomery multiplication (see Algorithm 1 for details), the most expensive step of RSA, the intermediate products are organized into four 512-bit vectors; then, these vectors are added using the vector-add-with-carry instruction vpadcd in each round of the Montgomery multiplication’s main loop. After n rounds, the corresponding 512-bit vector in each round composes a vector carry propagation chain (VCPC). This design exploits the vector mask registers and does not need to handle the carry bits after each addition in a round. Secondly, we exploit vector instructions to compute q (see Algorithm 3 for details), without breaking the flow of VCPCs. When a vector is used to compute q, the carry bit takes effect as the write-mask which is read-only in the operation; therefore, the correct q is obtained in the each round of VCPCs but does not break the chains.

The features of SIMD are fully exploited in PhiRSA, as our design magnifies the advantages of vector instruction extensions of Xeon Phi. Our method outperforms greatly the commonly-used redundant representation method in [3, 5, 10,11,12, 21]. To avoid handling the carry bits after large-integer addition during Montgomery multiplication, redundant representation stores only 29-bit operands in each 64-bit element of vectors; then, every product of two elements multiplication is 58-bit and the additional 6 bits are used to hold addition carries without overflow. So, it requires extra instructions and vectors to finish the computations.

We implement 1024/2048-bit Montgomery multiplication (and then 2048/4096-bit RSA) based on 512-bit vectors. Two (or four) 512-bit vectors compose a 1024-bit (or 2048-bit) large integer, and the specific vector instruction valignd is used to right shift multiple 512-bit vectors of the large integer during the main loop of Montgomery multiplication. The operations of right shift and assignment are performed in only one vector instruction, for each 512-bit vector.

Meanwhile, the benefit of SMT is also kept in PhiRSA. The execution order of vector instructions is manually optimized to fully activate the pipeline of vector processing units (VPUs). When 4 threads are launched to perform RSA computations, the VPU utilization exceeds 90%, that is, almost one instruction is executed in each cycle.

Our contributions are as follows. Firstly, the vector-oriented designs are proposed to fully exploit the computing power of vector instructions for RSA. Secondly, we implement these designs on Intel Xeon Phi 7120P efficiently. To the best of our knowledge, this is the first implementation of public-key cryptography on Intel Xeon Phi. The experimental results exhibit both high throughput and low latency: for 1024-bit, 2048-bit and 4096-bit RSA, PhiRSA achieves the throughput of 258370, 41803 and 5358 decryptions per second with 244 parallel tasks, and the latency of 0.94 ms, 5.84 ms and 45.54 ms, respectively. This throughput is about 40 times of OpenSSL [23] on a single core of Intel Haswell i7-4770R, and the latency is about 5 times. Our throughput is higher than the best implementation [32] on GPUs [32], and the latency is reduced to about 25% only.

The rest of the paper is organized as follows. Section 2 is the related work. The preliminaries about Intel Xeon Phi and Montgomery multiplication are presented in Sect. 3. Section 4 describes the design of our Montgomery multiplication. In Sect. 5, we show how to implement Montgomery multiplication and RSA on Intel Xeon Phi. In Sect. 6, performance results of our Montgomery multiplication and RSA implementations are given and compared with other works. We conclude in Sect. 7.

2 Related Work

There have been amount of studies using vector instructions to implement large integer multiplication, Montgomery multiplication and public-key cryptography. These works can be classified into three groups. The first group and also the main choice is storing the large integers in vectors horizontally for fine-grained parallel. Intel SSE2 instruction set has been exploited for large integer multiplication in [21] and cryptographic pairing computation in [11]. Redundant representation method proposed in [21] is widely used in vector implementations to help delay the carry propagation. Intel AVX2 instruction set is also applied to modular exponentiation in [12] and Curve25519 implementation in [10]. ARM NEON instruction set is explored to implement Montgomery multiplication in [28], Curve41417 in [3] and RSA in [29]. On Cell platform, an approach to implement Montgomery multiplication is described in [5]. The second group is splitting the Montgomery multiplication into two parts to compute in parallel. This approach is studied in [6] for 2-way vector instruction sets like Intel SSE2 and ARM NEON. The third group is using the vector instructions to carry out multiple tasks in parallel. Computing multiple Montgomery multiplications simultaneously in vector elements is investigated on Intel SSE2 instruction set in [24] and the Cell processor in [4].

Many previous studies have proved that GPUs are suitable for asymmetric cryptography. Most of them are based on the integer computing power of GPU, such as [1, 31]. The floating-pointing power is also explored in [2, 32]. For 2048-bit RSA GPU implementation, the highest throughput is reported in [32] and the lowest latency is obtained by [31].

Intel Xeon Phi is launched as a brand-new high performance computing platform, which performance has been evaluated in [

$$ vpadcd~~~~~(zmm2/memory),~~k2,~~zmm1\{k1\} $$

This instruction performs an element-by-element three-input addition between int32 vector zmm1, a int32 vector in memory or int32 vector zmm2, and the carry bits in k2. The result is written into zmm1, and the carry bits produced by the addition are written into k2. The instruction performing is controlled by the write-mask k1. Some other vector instructions are used in this paper. The instruction vpmulhud and vpmulld perform element-by-element multiplications between int32 vectors and store the high 32-bit result or the low 32-bit result respectively. The instruction vpermd performs an element permutation by using int32 vector elements as source indices. The instruction valignd concatenates and shifts right several 32-bit elements from two vectors.

3.2 Montgomery Multiplication

The major computations of RSA are modular multiplication. The modular reduction would be very costly if performing division operations. Montgomery multiplication [20] is proposed to replace division operations by cheaper multiplication and shifting operations. Let M be an odd modulus, \(R=2^n\) and \(M<R\), Montgomery multiplication is defined as \(MontMul(A,~B)= A\cdot B\cdot R^{-1}\pmod M\). The process of calculating \(A\cdot B \pmod M\) based on Montgomery multiplication can be computed as follows: \(\widetilde{A}=MontMul(A,~R^2)\), \(\widetilde{B}=MontMul(B,~R^2)\), then \(\widetilde{C}=MontMul(\widetilde{A},~\widetilde{B})\), finally \(C=MontMul(\widetilde{C},~1)\), C is the result. If executing a sequence of modular multiplications, such as the modular exponentiation, one modular multiplication only needs to perform one Montgomery multiplication.

figure a

Koḑ et al. proposed an interleaved Montgomery multiplication, named Coarsely Integrated Operand Scanning (CIOS) method [19] described in Algorithm 1. This method interleaves multiplication and Montgomery reduction, which is suitable to be implemented by vector instructions.