Log in

Optimized threshold implementations: securing cryptographic accelerators for low-energy and low-latency applications

  • Regular Paper
  • Published:
Journal of Cryptographic Engineering Aims and scope Submit manuscript

Abstract

Threshold implementations have emerged as one of the most popular masking countermeasures for hardware implementations of cryptographic primitives. In this work, we provide three TI optimization techniques: First, a generic construction for \(d+1\) TI sharing achieves the minimal number of output shares for any n-input Boolean function of degree \(t=n-1\) and for any d. Next, we present a methodology for finding minimal number of output shares in \(d+1\) TI when \(t<n-1\). Third, a heuristic for minimizing the number of output shares for higher-order \(td + 1\) TI for any n, any t and \(d \le 2\) is proposed. In addition, we describe an optimization for the secure AES schedule which achieves maximum throughput for a serial implementation. Then, we demonstrate the applicability of our results on \(d+1\) and \(td+1\) TI versions, for first- and second-order secure, low-latency and low-energy implementations of the PRINCE block cipher. We show the fastest and the most energy efficient known TI-protected implementations of PRINCE.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Canada)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Arribas, V., Bilgin, B., Petrides, G., Nikova, S., Rijmen, V.: Rhythmic Keccak: SCA security and low latency in HW. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2018(1), 269–290 (2018)

    Article  Google Scholar 

  2. Banik, S., Bogdanov, A., Isobe, T., Shibutani, K., Hiwatari, H., Akishita, T., Regazzoni, F.: Midori: a block cipher for low energy. In: ASIACRYPT 2015, pp. 411–436. Springer, New York (2015)

  3. Borghoff, J., Canteaut, A., Güneysu, T., Kavun, E.B., Knezevic, M., Knudsen, L.R., Leander, G., Nikov, V., Paar, C., Rechberger, C., Rombouts, P.: PRINCE: a low-latency block cipher for pervasive computing applications. In: ASIACRYPT 2012, LNCS, pp. 208–225. Springer, Berlin (2012)

  4. Bertoni, G., Daemen, J., Peeters, M., Van Assche, G.: The Keccak reference. http://keccak.noekeon.org/ (2011)

  5. Balasch, J., Gierlichs, B., Grosso, V., Reparaz, O., Standaert, F.X.: On the cost of lazy engineering for masked software implementations. In: Joye, M., Moradi, A. (eds) 13th International Conference on Smart Card Research and Advanced Applications, CARDIS 2014, Paris, France, November 5–7, 2014. Revised Selected Papers, Volume 8968 of Lecture Notes in Computer Science, pp. 64–81. Springer (2014)

  6. Bilgin, B., Gierlichs, B., Nikova, S., Nikov, V., Rijmen, V.: Higher-order threshold implementations. In: ASIACRYPT 2014, LNCS. pp. 326–343. Springer (2014)

  7. Bilgin, B., Gierlichs, B., Nikova, S., Nikov, V., Rijmen, V.: A more efficient AES threshold implementation. In: Pointcheval, D., Vergnaud, D. (eds) Progress in Cryptology–AFRICACRYPT 2014, pp. 267–284. Springer, Cham (2014)

    Chapter  Google Scholar 

  8. Bilgin, B.: Threshold implementations: as countermeasure against higher-order differential power analysis. PhD thesis, University of Twente, Enschede, Netherlands (2015)

  9. Brusco, M.J., Jacobs, L.W., Thompson, G.M.: A morphing procedure to supplement a simulated annealing heuristic for cost-andcoverage-correlated set-covering problems. Ann. Oper. Res. 86, 611–627 (1999)

    Article  MathSciNet  Google Scholar 

  10. Bogdanov, A., Knudsen, L.R., Leander, G., Paar, C., Poschmann, A., Robshaw, M.J., Seurin, Y., Vikkelsoe, C.: PRESENT: An ultra-lightweight block cipher. In: Paillier, P., Verbauwhede, I. (eds.) Cryptographic Hardware and Embedded Systems–CHES 2007, pp. 450–466. Springer, Berlin (2007)

    Chapter  Google Scholar 

  11. Bilgin, B., Nikova, S., Nikov, V., Rijmen, V., Stütz, G.: Threshold implementations of all 3 \(\times \)3 and 4 \(\times \)4 s-boxes. In: CHES 2012, LNCS, pp. 76–91. Springer (2012)

  12. Božilov, D.: PRINCE s-boxes verilog implementation (2021). https://github.com/dusanbozilov/PRINCETI

  13. Cooper, J., DeMulder, E., Goodwill, G., Jaffe, J., Kenworthy, G., Rohatgi, P.: Test vector leakage assessment (TVLA) methodology in practice. In: International Cryptographic Module Conference (2013)

  14. Cassiers, G., Grégoire, B., Levi, I., Standaert, F.-X.: Hardware private circuits: from trivial composition to full verification. Cryptology ePrint Archive, Report 2020/185. https://eprint.iacr.org/2020/185 (2020)

  15. De Cnudde, T., Reparaz, O., Bilgin, B., Nikova, S., Nikov, V., Rijmen, V.: Masking AES with d+1 shares in hardware. In: Cryptographic Hardware and Embedded Systems—CHES 2016, pp. 194–212 (2016)

  16. Chu, G., Stuckey, P.J.: Chuffed solver description. https://github.com/chuffed/chuffed (2014)

  17. Daemen, J.: Changing of the guards: a simple and efficient method for achieving uniformity in threshold sharing. In: Fischer, W., Homma, N. (eds) Proceedings of 19th International Conference on Cryptographic Hardware and Embedded Systems—CHES 2017, Taipei, Taiwan, September 25–28, 2017, Volume 10529 of Lecture Notes in Computer Science, pp. 137–153. Springer (2017)

  18. Dantzig, G.: Linear Programming and Extensions. Rand Corporation Research Study. Princeton University Press, Princeton (1963)

    Book  Google Scholar 

  19. De Meyer, L., Bilgin, B., Reparaz, O.: Consolidating security notions in hardware masking. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2019(3), 119–147 (2019)

    Article  Google Scholar 

  20. Daemen, J., Dobraunig, C.E., Eichlseder, M., Gross, H., Mendel, F., Primas, R.: Protecting against statistical ineffective fault attacks. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2020(3), 508–543 (2020)

    Article  Google Scholar 

  21. De Meyer, L., Arribas Abril, V., Nikova, S., Nikov, V., Rijmen, V.: M&M: masks and macs against physical attacks. IACR Trans. Cryptogr. Hardwa. Embed. Syst. 2019, 25–50 (2018)

    Article  Google Scholar 

  22. Gross, H., Iusupov, R., Bloem, R.: Generic low-latency masking in hardware. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2018(2), 1–21 (2018)

    Google Scholar 

  23. Gross, H., Mangard, S.: Reconciling \(d+1\) masking in hardware and software. In: Cryptographic Hardware and Embedded Systems—CHES, Springer (2017)

  24. Gross, H., Mangard, S., Korak, T.: Domain-oriented masking: compact masked hardware implementations with arbitrary protection order. In: Proceedings of the ACM Workshop on Theory of Implementation Security, TIS@CCS 2016 Vienna, Austria, p. 3 (2016)

  25. Gross, H., Mangard, S., Korak, T.: An efficient side-channel protected AES implementation with arbitrary protection order. In: Handschuh, H. (ed.) Topics in Cryptology—CT-RSA, vol. 2017, pp. 95–112 (2017)

  26. LLC Gurobi Optimization. Gurobi optimizer reference manual. http://www.gurobi.com (2020)

  27. Ishai, Y., Sahai, A., Wagner, D.: Private circuits: securing hardware against probing attacks. In: CRYPTO 2003, pp. 463–481. Springer, Berlin (2003)

  28. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by simulated annealing. Science 220(4598), 671–680 (1983)

    Article  MathSciNet  Google Scholar 

  29. Knezević, M., Nikov, V., Rombouts, P.: Low-latency encryption—Is “Lightweight = Light + Wait”? In: CHES 2012, LNCS, pp. 426–446. Springer (2012)

  30. Minotra, D.: A study of heuristic-algorithms for set-covering problems (2008)

  31. Moos, T., Moradi, A., Schneider, T., Standaert, F.X.: Glitch-resistant masking revisited- or why proofs in the robust probing model are needed. Cryptogr. Hardw. Embed. Syst. TCHES 2: 256–292 (2019)

    Google Scholar 

  32. Moradi, A., Poschmann, A., Ling, S., Paar, C., Wang, H.: Pushing the limits: a very compact and a threshold implementation of AES. In: Paterson, K.G. (ed.) Advances in Cryptology—EUROCRYPT 2011, pp. 69–88 (2011)

  33. Moradi, A., Schneider, T.: Side-channel analysis protection and low-latency in action—case study of PRINCE and Midori. In: ASIACRYPT 2016, LNCS. Springer (2016)

  34. Nikova, S.: TI tools for the 3 x 3 and 4 x 4 S-boxes. http://homes.esat.kuleuven.be/~snikova/ti_tools.html (2012)

  35. Nikova, S., Nikov, V., Rijmen, V.: Decomposition of permutations in a finite field. Cryptogr. Commun. 11, 379–384 (2019)

    Article  MathSciNet  Google Scholar 

  36. Nikova, S., Rechberger, C., Rijmen, V.: Threshold implementations against side-channel attacks and glitches. In: ICICS 2006, LNCS, pp. 529–545. Springer (2006)

  37. Nethercote, N., Stuckey, P.J., Becket, R., Brand, S., Duck, G.J., Tack, G.: MiniZinc: towards a standard CP modelling language. In: Christian Bessière, (ed.) Principles and Practice of Constraint Programming–CP 2007, pp. 529–543. Springer, Berlin (2007)

    Chapter  Google Scholar 

  38. Papapagiannopoulos, K.: High throughput in slices: the case of PRESENT, PRINCE and KATAN64 ciphers. In: RFIDSec 2014, LNCS, pp. 137–155. Springer (2014)

  39. Perron, L., Furnon, V.: OR-Tools. https://developers.google.com/optimization/ (2020)

  40. Poschmann, A., Moradi, A., Khoo, K., Lim, C.W., Wang, H., Ling, S.: Side-channel resistant crypto for less than 2,300 GE. J. Cryptol. 24(2), 322–345 (2011)

    Article  MathSciNet  Google Scholar 

  41. Reparaz, O., Bilgin, B., Nikova, S., Gierlichs, B., Verbauwhede, I.: Consolidating masking schemes. In: CRYPTO 2015, LNCS, pp. 764–783. Springer (2015)

  42. Rossi, F., Van Beek, P., Walsh, T.: Handbook of Constraint Programming (Foundations of Artificial Intelligence). Elsevier, Amsterdam (2006)

    MATH  Google Scholar 

  43. Reparaz, O., Gierlichs, B., Verbauwhede, I.: Fast leakage assessment. In: Cryptographic Hardware and Embedded Systems—CHES, vol. 2017, pp. 387–399 (2017)

  44. Sasdrich, P., Bilgin, B., Hutter, M., Marson, M.E.: Low-latency hardware masking with application to AES. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2020(2), 300–326 (2020)

    Article  Google Scholar 

  45. Schrijver, A.: Theory of Linear and Integer Programming. Wiley, Hoboken (1986)

    MATH  Google Scholar 

  46. Ueno, R., Homma, N., Aoki, T.: A systematic design of tamper-resistant galois-field arithmetic circuits based on threshold implementation with (d+1) input shares. In: IEEE 47th International Symposium on Multiple-Valued Logic (ISMVL), pp. 136–141 (2017)

  47. Ueno, R., Homma, N., Aoki, T.: Toward more efficient DPA-resistant AES hardware architecture based on threshold implementation. In: Constructive Side-Channel Analysis and Secure Design—COSADE, vol. 2017, pp. 50–64 (2017)

  48. Wegener, F., De Meyer, L., Moradi, A.: Spin me right round rotational symmetry for FPGA-specific AES: extended version. J. Cryptol. 33:1114 (2020)

Download references

Acknowledgements

We would like to thank Amir Moradi for providing us with HDL code of PRINCE TI presented in  [33].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dušan Božilov.

Appendices

Appendix A

1.1 A.1. First-order secure \(td+1\) TI of \(Q_{294}\)

We use first-order \(td + 1\) direct TI sharing [11] with three shares. Here, we recall that \(d=1\) and \(t=2\). The actual sharing is given in Eq. (21).

$$\begin{aligned} x_1&= a_1 \qquad \qquad&z_1&= a_1b_1 \oplus a_1b_2 \oplus a_2b_1 \oplus c_1\nonumber \\ x_2&= a_2 \qquad \qquad&z_2&= a_2b_2 \oplus a_2b_3 \oplus a_3b_2 \oplus c_2\nonumber \\ x_3&= a_3 \qquad \qquad&z_3&= a_3b_3 \oplus a_3b_1 \oplus a_1b_3 \oplus c_3\nonumber \\ y_1&= b_1 \qquad \qquad&w_1&= a_1c_1 \oplus a_1c_2 \oplus a_2c_1 \oplus d_1\nonumber \\ y_2&= b_2 \qquad \qquad&w_2&= a_2c_2 \oplus a_2c_3 \oplus a_3c_2 \oplus d_2\nonumber \\ y_3&= b_3 \qquad \qquad&w_3&= a_3c_3 \oplus a_3c_1 \oplus a_1c_3 \oplus d_3. \nonumber \\ \end{aligned}$$
(21)
Fig. 13
figure 13

First-order secure sharing of \(Q_{294}\) with \(td+1\) TI

Figure 13 depicts the hardware implementation of the \(td + 1\) version of \(Q_{294}\).

1.2 A.2. Second-order secure \(td+1\) TI of \(Q_{294}\)

We use the second-order \(td + 1\) TI sharing of \(Q_{294}\) with five input shares and ten output shares as shown in Eq. (22). In this case, we have \(d=2\) and \(t=2\). The shares are first processed and thus expanded, then refreshed and stored into a register. Next, they are compressed into five shares using the method explained in [6]. Values in Eq. (22) denoted with the overline represent the output after the compression step.

$$\begin{aligned} x_1&= a_1&&&y_1&= b_1\nonumber \\ x_2&= a_2&&&y_2&= b_2\nonumber \\ x_3&= a_3&&&y_3&= b_3\nonumber \\ x_4&= a_4&&&y_4&= b_4\nonumber \\ x_5&= a_5&&&y_5&= b_5\nonumber \\ z_1&= a_1b_3 \oplus a_3b_1 \qquad \qquad&z_6&= a_1b_1 \oplus a_1b_2 \oplus a_2b_1 \oplus c_1 \qquad \qquad&\bar{z}_1&= z_1 \oplus z_6\nonumber \\ z_2&= a_2b_4 \oplus a_4b_2 \qquad \qquad&z_7&= a_2b_2 \oplus a_2b_3 \oplus a_3b_2 \oplus c_2 \qquad \qquad&\bar{z}_2&= z_2 \oplus z_7\nonumber \\ z_3&= a_3b_5 \oplus a_5b_3 \qquad \qquad&z_8&= a_3b_3 \oplus a_3b_4 \oplus a_4b_3 \oplus c_3 \qquad \qquad&\bar{z}_3&= z_3 \oplus z_8\nonumber \\ z_4&= a_4b_1 \oplus a_1b_4 \qquad \qquad&z_9&= a_4b_4 \oplus a_4b_5 \oplus a_5b_4 \oplus c_4 \qquad \qquad&\bar{z}_4&= z_4 \oplus z_9\nonumber \\ z_5&= a_5b_2 \oplus a_2b_5 \qquad \qquad&z_{10}&= a_5b_5 \oplus a_5b_1 \oplus a_1b_5 \oplus c_5 \qquad \qquad&\bar{z}_5&= z_5 \oplus z_{10}\nonumber \\ w_1&= a_1c_3 \oplus a_3c_1 \qquad \qquad&w_6&= a_1c_1 \oplus a_1c_2 \oplus a_2c_1 \oplus d_1 \qquad \qquad&\bar{w}_1&= w_1 \oplus w_6\nonumber \\ w_2&= a_2c_4 \oplus a_4c_2 \qquad \qquad&w_7&= a_2c_2 \oplus a_2c_3 \oplus a_3c_2 \oplus d_2 \qquad \qquad&\bar{w}_2&= w_2 \oplus w_7\nonumber \\ w_3&= a_3c_5 \oplus a_5c_3 \qquad \qquad&w_8&= a_3c_3 \oplus a_3c_4 \oplus a_4c_3 \oplus d_3 \qquad \qquad&\bar{w}_3&= w_3 \oplus w_8\nonumber \\ w_4&= a_4c_1 \oplus a_1c_4 \qquad \qquad&w_9&= a_4c_4 \oplus a_4c_5 \oplus a_5c_4 \oplus d_4 \qquad \qquad&\bar{w}_4&= w_4 \oplus w_9\nonumber \\ w_5&= a_5c_2 \oplus a_2c_5 \qquad \qquad&w_{10}&= a_5c_5 \oplus a_5c_1 \oplus a_1c_5 \oplus d_5 \qquad \qquad&\bar{w}_5&= w_5 \oplus w_{10}. \end{aligned}$$
(22)

Please note that in order to avoid multivariate attacks, where the attacker probes values from different time samples, only nonlinear parts need to be refreshed, namely \(z_1, \ldots , z_{5}\) and \(w_1, \ldots , w_5\). Therefore, we need ten random bits for each shared \(Q_{294}\) function.

The sub-circuit used to generate two output bits of a partial evaluation of shared nonlinear function \(xy + z\) is shown in Fig. 14. Figure 15 showcases the hardware implementation of the \(td+1\) Version of \(Q_{294}\).

Fig. 14
figure 14

Generating two outputs bits for partial evaluation of \(xy + z\)

Fig. 15
figure 15

Second-order secure sharing of \(Q_{294}\) with \(td+1\) TI

1.3 A.3. First-order secure \(d+1\) TI of \(Q_{294}\)

We use the first-order sharing given in [41] and shown in Eq. (23). In this case, it holds \(d=1\). Unlike \(td+1\) TI, the first-order secure sharing here has four output shares for the nonlinear component functions. For the linear parts, however, we need only two shares instead of three. Compression and mask refreshing are needed to reduce the number of output shares and make the output uniform, respectively.

$$\begin{aligned} x_1&= a_1&y_1&= b_1\nonumber \\ x_2&= a_2&y_2&= b_2\nonumber \\ z_1&= a_1b_1 \oplus c_1 \qquad \qquad&w_1&= a_1c_1 \oplus d_1\nonumber \\ z_2&= a_1b_2&w_2&= a_1c_2 \qquad \nonumber \\ z_3&= a_2b_2 \oplus c_2&w_3&= a_2c_2 \oplus d_2\nonumber \\ z_4&= a_2b_1&w_4&= a_2c_1\nonumber \\ \bar{z}_1&= z_1 \oplus z_2&\bar{w}_1&= w_1 \oplus w_2\nonumber \\ \bar{z}_2&= z_3 \oplus z_4&\bar{w}_2&= w_3 \oplus w_4. \end{aligned}$$
(23)
Fig. 16
figure 16

First-order secure sharing of \(Q_{294}\) with \(d+1\) TI

Shares that contain quadratic terms are refreshed as given in Eq. (2) before storing into a register. We have two shared output component functions with four shares, for which we need six random bits. As in the second-order secure \(td+1\) version we set appropriate register bits to 0 during initial loading to ensure correctness of the execution. A detailed hardware implementation of the \(d+1\) TI sharing of \(Q_{294}\) is depicted in Fig. 16.

1.4 A.4. Second-order secure \(d+1\) TI of \(Q_{294}\)

Next, we create a second-order secure masking of \(Q_{294}\) following the work of [41]. In this case, \(d=2\). Three input shares are needed for all the operations. However, sharing a nonlinear operation \(xy + z\) produces nine output shares that need to be first refreshed, then stored into a register and finally compressed. We give the formula for \(d+1\) second-order secure sharing in Eq. (24).

$$\begin{aligned} x_1&= a_1&y_1&= b_1\nonumber \\ x_2&= a_2&y_2&= b_2\nonumber \\ x_3&= a_3&y_3&= b_3\nonumber \\ z_1&= a_1b_1 \oplus c_1 \qquad \qquad&w_1&= a_1c_1 \oplus d_1\nonumber \\ z_2&= a_1b_2 \qquad&w_2&= a_1c_2\nonumber \\ z_3&= a_1b_3 \qquad&w_3&= a_1b_3\nonumber \\ z_4&= a_2b_1&w_4&= a_2c_1\nonumber \\ z_5&= a_2b_2 \oplus c_2&w_5&= a_2c_2 \oplus d_2 \nonumber \\ z_6&= a_2b_3&w_6&= a_2c_3 \nonumber \\ z_7&= a_3b_1&w_7&= a_3c_1 \nonumber \\ z_8&= a_3b_2&w_8&= a_3c_2 \nonumber \\ z_9&= a_3b_3 \oplus c_3&w_9&= a_3c_3 \oplus d_3 \nonumber \\ \bar{z}_1&= z_1\oplus z_2 \oplus z_3 \qquad&\bar{w}_1&= w_1\oplus w_2 \oplus w_3\nonumber \\ \bar{w}_2&= w_4\oplus w_5 \oplus w_6 \qquad&\bar{w}_2&= w_4\oplus w_5 \oplus w_6\nonumber \\ \bar{w}_3&= w_7\oplus w_8 \oplus w_9 \qquad&\bar{w}_3&= w_7\oplus w_8 \oplus w_9. \end{aligned}$$
(24)

A hardware diagram of this sharing is depicted in Fig. 17.

Fig. 17
figure 17

Second-order secure sharing of \(Q_{294}\) with \(d+1\) TI

Fig. 18
figure 18

S-box pipeline schedule with six-cycle latency

Appendix B

Scheduling for the AES control for single S-box implementation where S-box latency is 6, 7, 8, 10 or 11 cycles is given with Figs. 18, 19, 20, 21, 22. For 11-cycle S-box latency schedule, MixColumn input of the last byte is obtained directly from the S-box output and is not being written being read from the state, unlike in other cases presented here.

Fig. 19
figure 19

S-box pipeline schedule with seven-cycle latency

Fig. 20
figure 20

S-box pipeline schedule with eight-cycle latency

Fig. 21
figure 21

S-box pipeline schedule with ten-cycle latency

Fig. 22
figure 22

S-box pipeline schedule with 11-cycle latency

Appendix C

Here, we give a quick reference for the found sharings for the cases examined in Sect. 3.3. Again, we use the succinct notation, where we only given chosen shares in their lexicographical order.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Božilov, D., Knežević, M. & Nikov, V. Optimized threshold implementations: securing cryptographic accelerators for low-energy and low-latency applications. J Cryptogr Eng 12, 15–51 (2022). https://doi.org/10.1007/s13389-021-00276-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13389-021-00276-5

Keywords

Navigation