Optimized threshold implementations: securing cryptographic accelerators for low-energy and low-latency applications

Božilov, Dušan; Knežević, Miroslav; Nikov, Ventzislav

doi:10.1007/s13389-021-00276-5

Optimized threshold implementations: securing cryptographic accelerators for low-energy and low-latency applications

Regular Paper
Published: 25 November 2021

Volume 12, pages 15–51, (2022)
Cite this article

Journal of Cryptographic Engineering Aims and scope Submit manuscript

Dušan Božilov ORCID: orcid.org/0000-0001-8629-4115^1,2,
Miroslav Knežević¹ &
Ventzislav Nikov¹

289 Accesses
3 Citations
Explore all metrics

Abstract

Threshold implementations have emerged as one of the most popular masking countermeasures for hardware implementations of cryptographic primitives. In this work, we provide three TI optimization techniques: First, a generic construction for $d+1$ TI sharing achieves the minimal number of output shares for any n-input Boolean function of degree $t=n-1$ and for any d. Next, we present a methodology for finding minimal number of output shares in $d+1$ TI when $t<n-1$. Third, a heuristic for minimizing the number of output shares for higher-order $td + 1$ TI for any n, any t and $d \le 2$ is proposed. In addition, we describe an optimization for the secure AES schedule which achieves maximum throughput for a serial implementation. Then, we demonstrate the applicability of our results on $d+1$ and $td+1$ TI versions, for first- and second-order secure, low-latency and low-energy implementations of the PRINCE block cipher. We show the fastest and the most energy efficient known TI-protected implementations of PRINCE.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Canada)

Instant access to the full article PDF.

Institutional subscriptions

Optimized Threshold Implementations: Minimizing the Latency of Secure Cryptographic Accelerators

Monomial evaluation of polynomial functions protected by threshold implementations—with an illustration on AES—

Article 15 July 2021

Energy Analysis of Lightweight AEAD Circuits

References

Arribas, V., Bilgin, B., Petrides, G., Nikova, S., Rijmen, V.: Rhythmic Keccak: SCA security and low latency in HW. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2018(1), 269–290 (2018)
Article Google Scholar
Banik, S., Bogdanov, A., Isobe, T., Shibutani, K., Hiwatari, H., Akishita, T., Regazzoni, F.: Midori: a block cipher for low energy. In: ASIACRYPT 2015, pp. 411–436. Springer, New York (2015)
Borghoff, J., Canteaut, A., Güneysu, T., Kavun, E.B., Knezevic, M., Knudsen, L.R., Leander, G., Nikov, V., Paar, C., Rechberger, C., Rombouts, P.: PRINCE: a low-latency block cipher for pervasive computing applications. In: ASIACRYPT 2012, LNCS, pp. 208–225. Springer, Berlin (2012)
Bertoni, G., Daemen, J., Peeters, M., Van Assche, G.: The Keccak reference. http://keccak.noekeon.org/ (2011)
Balasch, J., Gierlichs, B., Grosso, V., Reparaz, O., Standaert, F.X.: On the cost of lazy engineering for masked software implementations. In: Joye, M., Moradi, A. (eds) 13th International Conference on Smart Card Research and Advanced Applications, CARDIS 2014, Paris, France, November 5–7, 2014. Revised Selected Papers, Volume 8968 of Lecture Notes in Computer Science, pp. 64–81. Springer (2014)
Bilgin, B., Gierlichs, B., Nikova, S., Nikov, V., Rijmen, V.: Higher-order threshold implementations. In: ASIACRYPT 2014, LNCS. pp. 326–343. Springer (2014)
Bilgin, B., Gierlichs, B., Nikova, S., Nikov, V., Rijmen, V.: A more efficient AES threshold implementation. In: Pointcheval, D., Vergnaud, D. (eds) Progress in Cryptology–AFRICACRYPT 2014, pp. 267–284. Springer, Cham (2014)
Chapter Google Scholar
Bilgin, B.: Threshold implementations: as countermeasure against higher-order differential power analysis. PhD thesis, University of Twente, Enschede, Netherlands (2015)
Brusco, M.J., Jacobs, L.W., Thompson, G.M.: A morphing procedure to supplement a simulated annealing heuristic for cost-andcoverage-correlated set-covering problems. Ann. Oper. Res. 86, 611–627 (1999)
Article MathSciNet Google Scholar
Bogdanov, A., Knudsen, L.R., Leander, G., Paar, C., Poschmann, A., Robshaw, M.J., Seurin, Y., Vikkelsoe, C.: PRESENT: An ultra-lightweight block cipher. In: Paillier, P., Verbauwhede, I. (eds.) Cryptographic Hardware and Embedded Systems–CHES 2007, pp. 450–466. Springer, Berlin (2007)
Chapter Google Scholar
Bilgin, B., Nikova, S., Nikov, V., Rijmen, V., Stütz, G.: Threshold implementations of all 3 $\times $3 and 4 $\times $4 s-boxes. In: CHES 2012, LNCS, pp. 76–91. Springer (2012)
Božilov, D.: PRINCE s-boxes verilog implementation (2021). https://github.com/dusanbozilov/PRINCETI
Cooper, J., DeMulder, E., Goodwill, G., Jaffe, J., Kenworthy, G., Rohatgi, P.: Test vector leakage assessment (TVLA) methodology in practice. In: International Cryptographic Module Conference (2013)
Cassiers, G., Grégoire, B., Levi, I., Standaert, F.-X.: Hardware private circuits: from trivial composition to full verification. Cryptology ePrint Archive, Report 2020/185. https://eprint.iacr.org/2020/185 (2020)
De Cnudde, T., Reparaz, O., Bilgin, B., Nikova, S., Nikov, V., Rijmen, V.: Masking AES with d+1 shares in hardware. In: Cryptographic Hardware and Embedded Systems—CHES 2016, pp. 194–212 (2016)
Chu, G., Stuckey, P.J.: Chuffed solver description. https://github.com/chuffed/chuffed (2014)
Daemen, J.: Changing of the guards: a simple and efficient method for achieving uniformity in threshold sharing. In: Fischer, W., Homma, N. (eds) Proceedings of 19th International Conference on Cryptographic Hardware and Embedded Systems—CHES 2017, Taipei, Taiwan, September 25–28, 2017, Volume 10529 of Lecture Notes in Computer Science, pp. 137–153. Springer (2017)
Dantzig, G.: Linear Programming and Extensions. Rand Corporation Research Study. Princeton University Press, Princeton (1963)
Book Google Scholar
De Meyer, L., Bilgin, B., Reparaz, O.: Consolidating security notions in hardware masking. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2019(3), 119–147 (2019)
Article Google Scholar
Daemen, J., Dobraunig, C.E., Eichlseder, M., Gross, H., Mendel, F., Primas, R.: Protecting against statistical ineffective fault attacks. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2020(3), 508–543 (2020)
Article Google Scholar
De Meyer, L., Arribas Abril, V., Nikova, S., Nikov, V., Rijmen, V.: M&M: masks and macs against physical attacks. IACR Trans. Cryptogr. Hardwa. Embed. Syst. 2019, 25–50 (2018)
Article Google Scholar
Gross, H., Iusupov, R., Bloem, R.: Generic low-latency masking in hardware. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2018(2), 1–21 (2018)
Google Scholar
Gross, H., Mangard, S.: Reconciling $d+1$ masking in hardware and software. In: Cryptographic Hardware and Embedded Systems—CHES, Springer (2017)
Gross, H., Mangard, S., Korak, T.: Domain-oriented masking: compact masked hardware implementations with arbitrary protection order. In: Proceedings of the ACM Workshop on Theory of Implementation Security, TIS@CCS 2016 Vienna, Austria, p. 3 (2016)
Gross, H., Mangard, S., Korak, T.: An efficient side-channel protected AES implementation with arbitrary protection order. In: Handschuh, H. (ed.) Topics in Cryptology—CT-RSA, vol. 2017, pp. 95–112 (2017)
LLC Gurobi Optimization. Gurobi optimizer reference manual. http://www.gurobi.com (2020)
Ishai, Y., Sahai, A., Wagner, D.: Private circuits: securing hardware against probing attacks. In: CRYPTO 2003, pp. 463–481. Springer, Berlin (2003)
Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by simulated annealing. Science 220(4598), 671–680 (1983)
Article MathSciNet Google Scholar
Knezević, M., Nikov, V., Rombouts, P.: Low-latency encryption—Is “Lightweight = Light + Wait”? In: CHES 2012, LNCS, pp. 426–446. Springer (2012)
Minotra, D.: A study of heuristic-algorithms for set-covering problems (2008)
Moos, T., Moradi, A., Schneider, T., Standaert, F.X.: Glitch-resistant masking revisited- or why proofs in the robust probing model are needed. Cryptogr. Hardw. Embed. Syst. TCHES 2: 256–292 (2019)
Google Scholar
Moradi, A., Poschmann, A., Ling, S., Paar, C., Wang, H.: Pushing the limits: a very compact and a threshold implementation of AES. In: Paterson, K.G. (ed.) Advances in Cryptology—EUROCRYPT 2011, pp. 69–88 (2011)
Moradi, A., Schneider, T.: Side-channel analysis protection and low-latency in action—case study of PRINCE and Midori. In: ASIACRYPT 2016, LNCS. Springer (2016)
Nikova, S.: TI tools for the 3 x 3 and 4 x 4 S-boxes. http://homes.esat.kuleuven.be/~snikova/ti_tools.html (2012)
Nikova, S., Nikov, V., Rijmen, V.: Decomposition of permutations in a finite field. Cryptogr. Commun. 11, 379–384 (2019)
Article MathSciNet Google Scholar
Nikova, S., Rechberger, C., Rijmen, V.: Threshold implementations against side-channel attacks and glitches. In: ICICS 2006, LNCS, pp. 529–545. Springer (2006)
Nethercote, N., Stuckey, P.J., Becket, R., Brand, S., Duck, G.J., Tack, G.: MiniZinc: towards a standard CP modelling language. In: Christian Bessière, (ed.) Principles and Practice of Constraint Programming–CP 2007, pp. 529–543. Springer, Berlin (2007)
Chapter Google Scholar
Papapagiannopoulos, K.: High throughput in slices: the case of PRESENT, PRINCE and KATAN64 ciphers. In: RFIDSec 2014, LNCS, pp. 137–155. Springer (2014)
Perron, L., Furnon, V.: OR-Tools. https://developers.google.com/optimization/ (2020)
Poschmann, A., Moradi, A., Khoo, K., Lim, C.W., Wang, H., Ling, S.: Side-channel resistant crypto for less than 2,300 GE. J. Cryptol. 24(2), 322–345 (2011)
Article MathSciNet Google Scholar
Reparaz, O., Bilgin, B., Nikova, S., Gierlichs, B., Verbauwhede, I.: Consolidating masking schemes. In: CRYPTO 2015, LNCS, pp. 764–783. Springer (2015)
Rossi, F., Van Beek, P., Walsh, T.: Handbook of Constraint Programming (Foundations of Artificial Intelligence). Elsevier, Amsterdam (2006)
MATH Google Scholar
Reparaz, O., Gierlichs, B., Verbauwhede, I.: Fast leakage assessment. In: Cryptographic Hardware and Embedded Systems—CHES, vol. 2017, pp. 387–399 (2017)
Sasdrich, P., Bilgin, B., Hutter, M., Marson, M.E.: Low-latency hardware masking with application to AES. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2020(2), 300–326 (2020)
Article Google Scholar
Schrijver, A.: Theory of Linear and Integer Programming. Wiley, Hoboken (1986)
MATH Google Scholar
Ueno, R., Homma, N., Aoki, T.: A systematic design of tamper-resistant galois-field arithmetic circuits based on threshold implementation with (d+1) input shares. In: IEEE 47th International Symposium on Multiple-Valued Logic (ISMVL), pp. 136–141 (2017)
Ueno, R., Homma, N., Aoki, T.: Toward more efficient DPA-resistant AES hardware architecture based on threshold implementation. In: Constructive Side-Channel Analysis and Secure Design—COSADE, vol. 2017, pp. 50–64 (2017)
Wegener, F., De Meyer, L., Moradi, A.: Spin me right round rotational symmetry for FPGA-specific AES: extended version. J. Cryptol. 33:1114 (2020)

Download references

Acknowledgements

We would like to thank Amir Moradi for providing us with HDL code of PRINCE TI presented in [33].

Author information

Authors and Affiliations

NXP Semiconductors, Leuven, Belgium
Dušan Božilov, Miroslav Knežević & Ventzislav Nikov
imec-COSIC, KU Leuven, Leuven, Belgium
Dušan Božilov

Authors

Dušan Božilov
View author publications
You can also search for this author in PubMed Google Scholar
Miroslav Knežević
View author publications
You can also search for this author in PubMed Google Scholar
Ventzislav Nikov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dušan Božilov.

Appendices

Appendix A

1.1 A.1. First-order secure $td+1$ TI of $Q_{294}$

We use first-order $td + 1$ direct TI sharing [11] with three shares. Here, we recall that $d=1$ and $t=2$. The actual sharing is given in Eq. (21).

$$\begin{aligned} x_1&= a_1 \qquad \qquad&z_1&= a_1b_1 \oplus a_1b_2 \oplus a_2b_1 \oplus c_1\nonumber \\ x_2&= a_2 \qquad \qquad&z_2&= a_2b_2 \oplus a_2b_3 \oplus a_3b_2 \oplus c_2\nonumber \\ x_3&= a_3 \qquad \qquad&z_3&= a_3b_3 \oplus a_3b_1 \oplus a_1b_3 \oplus c_3\nonumber \\ y_1&= b_1 \qquad \qquad&w_1&= a_1c_1 \oplus a_1c_2 \oplus a_2c_1 \oplus d_1\nonumber \\ y_2&= b_2 \qquad \qquad&w_2&= a_2c_2 \oplus a_2c_3 \oplus a_3c_2 \oplus d_2\nonumber \\ y_3&= b_3 \qquad \qquad&w_3&= a_3c_3 \oplus a_3c_1 \oplus a_1c_3 \oplus d_3. \nonumber \\ \end{aligned}$$

(21)

Figure 13 depicts the hardware implementation of the $td + 1$ version of $Q_{294}$.

1.2 A.2. Second-order secure $td+1$ TI of $Q_{294}$

We use the second-order $td + 1$ TI sharing of $Q_{294}$ with five input shares and ten output shares as shown in Eq. (22). In this case, we have $d=2$ and $t=2$. The shares are first processed and thus expanded, then refreshed and stored into a register. Next, they are compressed into five shares using the method explained in [6]. Values in Eq. (22) denoted with the overline represent the output after the compression step.

$$\begin{aligned} x_1&= a_1&&&y_1&= b_1\nonumber \\ x_2&= a_2&&&y_2&= b_2\nonumber \\ x_3&= a_3&&&y_3&= b_3\nonumber \\ x_4&= a_4&&&y_4&= b_4\nonumber \\ x_5&= a_5&&&y_5&= b_5\nonumber \\ z_1&= a_1b_3 \oplus a_3b_1 \qquad \qquad&z_6&= a_1b_1 \oplus a_1b_2 \oplus a_2b_1 \oplus c_1 \qquad \qquad&\bar{z}_1&= z_1 \oplus z_6\nonumber \\ z_2&= a_2b_4 \oplus a_4b_2 \qquad \qquad&z_7&= a_2b_2 \oplus a_2b_3 \oplus a_3b_2 \oplus c_2 \qquad \qquad&\bar{z}_2&= z_2 \oplus z_7\nonumber \\ z_3&= a_3b_5 \oplus a_5b_3 \qquad \qquad&z_8&= a_3b_3 \oplus a_3b_4 \oplus a_4b_3 \oplus c_3 \qquad \qquad&\bar{z}_3&= z_3 \oplus z_8\nonumber \\ z_4&= a_4b_1 \oplus a_1b_4 \qquad \qquad&z_9&= a_4b_4 \oplus a_4b_5 \oplus a_5b_4 \oplus c_4 \qquad \qquad&\bar{z}_4&= z_4 \oplus z_9\nonumber \\ z_5&= a_5b_2 \oplus a_2b_5 \qquad \qquad&z_{10}&= a_5b_5 \oplus a_5b_1 \oplus a_1b_5 \oplus c_5 \qquad \qquad&\bar{z}_5&= z_5 \oplus z_{10}\nonumber \\ w_1&= a_1c_3 \oplus a_3c_1 \qquad \qquad&w_6&= a_1c_1 \oplus a_1c_2 \oplus a_2c_1 \oplus d_1 \qquad \qquad&\bar{w}_1&= w_1 \oplus w_6\nonumber \\ w_2&= a_2c_4 \oplus a_4c_2 \qquad \qquad&w_7&= a_2c_2 \oplus a_2c_3 \oplus a_3c_2 \oplus d_2 \qquad \qquad&\bar{w}_2&= w_2 \oplus w_7\nonumber \\ w_3&= a_3c_5 \oplus a_5c_3 \qquad \qquad&w_8&= a_3c_3 \oplus a_3c_4 \oplus a_4c_3 \oplus d_3 \qquad \qquad&\bar{w}_3&= w_3 \oplus w_8\nonumber \\ w_4&= a_4c_1 \oplus a_1c_4 \qquad \qquad&w_9&= a_4c_4 \oplus a_4c_5 \oplus a_5c_4 \oplus d_4 \qquad \qquad&\bar{w}_4&= w_4 \oplus w_9\nonumber \\ w_5&= a_5c_2 \oplus a_2c_5 \qquad \qquad&w_{10}&= a_5c_5 \oplus a_5c_1 \oplus a_1c_5 \oplus d_5 \qquad \qquad&\bar{w}_5&= w_5 \oplus w_{10}. \end{aligned}$$

(22)

Please note that in order to avoid multivariate attacks, where the attacker probes values from different time samples, only nonlinear parts need to be refreshed, namely $z_1, \ldots , z_{5}$ and $w_1, \ldots , w_5$. Therefore, we need ten random bits for each shared $Q_{294}$ function.

The sub-circuit used to generate two output bits of a partial evaluation of shared nonlinear function $xy + z$ is shown in Fig. 14. Figure 15 showcases the hardware implementation of the $td+1$ Version of $Q_{294}$.

1.3 A.3. First-order secure $d+1$ TI of $Q_{294}$

We use the first-order sharing given in [41] and shown in Eq. (23). In this case, it holds $d=1$. Unlike $td+1$ TI, the first-order secure sharing here has four output shares for the nonlinear component functions. For the linear parts, however, we need only two shares instead of three. Compression and mask refreshing are needed to reduce the number of output shares and make the output uniform, respectively.

$$\begin{aligned} x_1&= a_1&y_1&= b_1\nonumber \\ x_2&= a_2&y_2&= b_2\nonumber \\ z_1&= a_1b_1 \oplus c_1 \qquad \qquad&w_1&= a_1c_1 \oplus d_1\nonumber \\ z_2&= a_1b_2&w_2&= a_1c_2 \qquad \nonumber \\ z_3&= a_2b_2 \oplus c_2&w_3&= a_2c_2 \oplus d_2\nonumber \\ z_4&= a_2b_1&w_4&= a_2c_1\nonumber \\ \bar{z}_1&= z_1 \oplus z_2&\bar{w}_1&= w_1 \oplus w_2\nonumber \\ \bar{z}_2&= z_3 \oplus z_4&\bar{w}_2&= w_3 \oplus w_4. \end{aligned}$$

(23)

Shares that contain quadratic terms are refreshed as given in Eq. (2) before storing into a register. We have two shared output component functions with four shares, for which we need six random bits. As in the second-order secure $td+1$ version we set appropriate register bits to 0 during initial loading to ensure correctness of the execution. A detailed hardware implementation of the $d+1$ TI sharing of $Q_{294}$ is depicted in Fig. 16.

1.4 A.4. Second-order secure $d+1$ TI of $Q_{294}$

Next, we create a second-order secure masking of $Q_{294}$ following the work of [41]. In this case, $d=2$. Three input shares are needed for all the operations. However, sharing a nonlinear operation $xy + z$ produces nine output shares that need to be first refreshed, then stored into a register and finally compressed. We give the formula for $d+1$ second-order secure sharing in Eq. (24).

$$\begin{aligned} x_1&= a_1&y_1&= b_1\nonumber \\ x_2&= a_2&y_2&= b_2\nonumber \\ x_3&= a_3&y_3&= b_3\nonumber \\ z_1&= a_1b_1 \oplus c_1 \qquad \qquad&w_1&= a_1c_1 \oplus d_1\nonumber \\ z_2&= a_1b_2 \qquad&w_2&= a_1c_2\nonumber \\ z_3&= a_1b_3 \qquad&w_3&= a_1b_3\nonumber \\ z_4&= a_2b_1&w_4&= a_2c_1\nonumber \\ z_5&= a_2b_2 \oplus c_2&w_5&= a_2c_2 \oplus d_2 \nonumber \\ z_6&= a_2b_3&w_6&= a_2c_3 \nonumber \\ z_7&= a_3b_1&w_7&= a_3c_1 \nonumber \\ z_8&= a_3b_2&w_8&= a_3c_2 \nonumber \\ z_9&= a_3b_3 \oplus c_3&w_9&= a_3c_3 \oplus d_3 \nonumber \\ \bar{z}_1&= z_1\oplus z_2 \oplus z_3 \qquad&\bar{w}_1&= w_1\oplus w_2 \oplus w_3\nonumber \\ \bar{w}_2&= w_4\oplus w_5 \oplus w_6 \qquad&\bar{w}_2&= w_4\oplus w_5 \oplus w_6\nonumber \\ \bar{w}_3&= w_7\oplus w_8 \oplus w_9 \qquad&\bar{w}_3&= w_7\oplus w_8 \oplus w_9. \end{aligned}$$

(24)

A hardware diagram of this sharing is depicted in Fig. 17.

Appendix B

Scheduling for the AES control for single S-box implementation where S-box latency is 6, 7, 8, 10 or 11 cycles is given with Figs. 18, 19, 20, 21, 22. For 11-cycle S-box latency schedule, MixColumn input of the last byte is obtained directly from the S-box output and is not being written being read from the state, unlike in other cases presented here.

Appendix C

Here, we give a quick reference for the found sharings for the cases examined in Sect. 3.3. Again, we use the succinct notation, where we only given chosen shares in their lexicographical order.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Božilov, D., Knežević, M. & Nikov, V. Optimized threshold implementations: securing cryptographic accelerators for low-energy and low-latency applications. J Cryptogr Eng 12, 15–51 (2022). https://doi.org/10.1007/s13389-021-00276-5

Download citation

Received: 02 July 2020
Accepted: 03 October 2021
Published: 25 November 2021
Issue Date: April 2022
DOI: https://doi.org/10.1007/s13389-021-00276-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Canada)

Instant access to the full article PDF.

Institutional subscriptions

Optimized threshold implementations: securing cryptographic accelerators for low-energy and low-latency applications

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Optimized Threshold Implementations: Minimizing the Latency of Secure Cryptographic Accelerators

Monomial evaluation of polynomial functions protected by threshold implementations—with an illustration on AES—

Energy Analysis of Lightweight AEAD Circuits

References

Acknowledgements