Byzantine-resilient decentralized network learning

Yang, Yaohong; Wang, Lei

doi:10.1007/s42952-023-00249-w

Byzantine-resilient decentralized network learning

Research Article
Published: 10 January 2024

Volume 53, pages 349–380, (2024)
Cite this article

Journal of the Korean Statistical Society Aims and scope Submit manuscript

144 Accesses
Explore all metrics

Abstract

Decentralized federated learning based on fully normal nodes has drawn attention in modern statistical learning. However, due to data corruption, device malfunctioning, malicious attacks and some other unexpected behaviors, not all nodes can obey the estimation process and the existing decentralized federated learning methods may fail. An unknown number of abnormal nodes, called Byzantine nodes, arbitrarily deviate from their intended behaviors, send wrong messages to their neighbors and affect all honest nodes across the entire network through passing polluted messages. In this paper, we focus on decentralized federated learning in the presence of Byzantine attacks and then propose a unified Byzantine-resilient framework based on the network gradient descent and several robust aggregation rules. Theoretically, the convergence of the proposed algorithm is guaranteed under some weakly balanced conditions of network structure. The finite-sample performance is studied through simulations under different network topologies and various Byzantine attacks. An application to Communities and Crime Data is also presented.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey on federated learning: challenges and applications

Article 11 November 2022

Online social networks security and privacy: comprehensive review and analysis

Article Open access 01 June 2021

Emerging trends in federated learning: from model fusion to federated X learning

Article Open access 02 April 2024

Data availability

The real dataset used in this study is publicly available and can be accessed from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets/communities+and+crime+unnormalized), which is shown in the “Real data” section.

References

Bellet, A., Guerraoui, R., Taziki, M., & Tommasi, M. (2018). Personalized and private peer-to-peer machine learning. In International conference on artificial intelligence and statistics (pp. 473–481).
Blanchard, P., El Mhamdi, E.M., Guerraoui, R., & Stainer, J. (2017). Machine learning with adversaries: Byzantine tolerant gradient descent. Advances in Neural Information Processing Systems (pp. 119–129).
Che, C., Li, X., Chen, C., He, X., & Zheng, Z. (2022). A decentralized federated learning framework via committee mechanism with convergence guarantee. IEEE Transactions on Parallel and Distributed Systems, 33(12), 4783–4800.
Article Google Scholar
Cheu, A., Smith, A., Ullman, J., Zeber, D., & Zhilyaev, M. (2019). Distributed differential privacy via shuffling. In Annual international conference on the theory and applications of cryptographic techniques (pp. 375–403).
Colin, I., Bellet, A., Salmon, J., & Clémençon, S. (2016). Gossip dual averaging for decentralized optimization of pairwise functions. In International conference on machine learning (pp. 1388–1396).
El Mhamdi, E.M., Guerraoui, R., & Rouault, S. L.A. (2021). Distributed momentum for byzantine-resilient stochastic gradient descent. In 9th International conference on learning representations.
Fang, C., Yang, Z., & Bajwa, W. U. (2022). Bridge: Byzantine-resilient decentralized gradient descent. IEEE Transactions on Signal and Information Processing over Networks, 8, 610–626.
Article MathSciNet Google Scholar
Fang, M., Cao, X., Jia, J., & Gong, N.Z. (2020). Local model poisoning attacks to byzantine-robust federated learning. In Proceedings of the 29th USENIX conference on security symposium (pp. 1623–1640).
He, L., Karimireddy, S. P., & Jaggi, M. (2022). Byzantine-robust decentralized learning via self-centered clip**. ar**v preprint ar**v:2202.01545.
Hou, J., Wang, F., Wei, C., Huang, H., Hu, Y., & Gui, N. (2022). Credibility assessment based byzantine-resilient decentralized learning. In IEEE transactions on dependable and secure computing (pp. 1–12).
Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., et al. (2021). Advances and open problems in federated learning. Foundations and Trends in Machine Learning, 14(1–2), 1–210.
Article Google Scholar
Karimireddy, S. P., He, L., & Jaggi, M. (2020). Byzantine-robust learning on heterogeneous datasets via bucketing. ar**v preprint ar**v:2006.09365.
Karimireddy, S. P., He, L., & Jaggi, M. (2021). Learning from history for byzantine robust optimization. In International conference on machine learning (pp. 5311–5319).
Konečnỳ, J., McMahan, H. B., Yu, F. X., Richtárik, P., Suresh, A. T., & Bacon, D. (2016). Federated learning: Strategies for improving communication efficiency. ar**v preprint ar**v:1610.05492.
Lamport, L., Shostak, R., & Pease, M. (1982). The byzantine generals problem. ACM Transactions on Programming Languages and Systems, 4(3), 382–401.
Article Google Scholar
Li, T., Sahu, A. K., Zaheer, M., Sanjabi, M., Talwalkar, A., & Smith, V. (2020). Federated optimization in heterogeneous networks. Proceedings of Machine Learning and Systems, 2, 429–450.
Google Scholar
Lian, X., Zhang, C., Zhang, H., Hsieh, C.-J., Zhang, W., & Liu, J. (2017). Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. Advances in Neural Information Processing Systems (pp. 5330–5340).
Liu, W., Mao, X., & Zhang, X. (2022). Fast and robust sparsity learning over networks: A decentralized surrogate median regression approach. IEEE Transactions on Signal Processing, 70, 797–809.
Article MathSciNet Google Scholar
McMahan, B., Moore, E., Ramage, D., Hampson, S., & y Arcas, B.A. (2017). Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics (pp. 1273–1282).
Pillutla, K., Kakade, S. M., & Harchaoui, Z. (2022). Robust aggregation for federated learning. IEEE Transactions on Signal Processing, 70, 1142–1154.
Article MathSciNet Google Scholar
Richards, D., & Rebeschini, P. (2019). Optimal statistical rates for decentralised non-parametric regression with linear speed-up. Advances in Neural Information Processing Systems (pp. 1216–1227).
Richards, D., Rebeschini, P., & Rosasco, L. (2020). Decentralised learning with random features and distributed gradient descent. In International conference on machine learning (pp. 8105–8115).
Smith, V., Chiang, C.-K., Sanjabi, M., & Talwalkar, A. S. (2017). Federated multi-task learning. Advances in Neural Information Processing Systems, 30.
Wei, K., Li, J., Ding, M., Ma, C., Yang, H. H., Farokhi, F., **, S., Quek, T. Q., & Poor, H. V. (2020). Federated learning with differential privacy: Algorithms and performance analysis. IEEE Transactions on Information Forensics and Security, 15, 3454–3469.
Article Google Scholar
Wu, S., Huang, D., & Wang, H. (2023a). Network gradient descent algorithm for decentralized federated learning. Journal of Business & Economic Statistics, 41(3), 806–818.
Article MathSciNet Google Scholar
Wu, Z., Chen, T., & Ling, Q. (2023b). Byzantine-resilient decentralized stochastic optimization with robust aggregation rules. In IEEE transactions on signal processing (pp. 3179–3195).
Yang, X., Yan, X., & Huang, J. (2019a). High-dimensional integrative analysis with homogeneity and sparsity recovery. Journal of Multivariate Analysis, 174, 104529.
Article MathSciNet Google Scholar
Yang, Z., Gang, A., & Bajwa, W. U. (2019b). Adversary-resilient inference and machine learning: From distributed to decentralized. Statistics, 1050, 23.
Google Scholar
Yuan, K., Ling, Q., & Yin, W. (2016). On the convergence of decentralized gradient descent. SIAM Journal on Optimization, 26(3), 1835–1854.
Article MathSciNet Google Scholar

Download references

Acknowledgements

We are grateful to the Editor, an Associate Editor and two anonymous referees for their insightful comments and suggestions on this article, which have led to significant improvements. Lei Wang’s research was supported by the Fundamental Research Funds for the Central Universities and the National Natural Science Foundation of China (12271272, 12001295).

Author information

Authors and Affiliations

School of Statistics and Data Science, KLMDASR, LEBPS and LPMC, Nankai University, Tian**, 300071, People’s Republic of China
Yaohong Yang & Lei Wang

Authors

Yaohong Yang
View author publications
You can also search for this author in PubMed Google Scholar
Lei Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lei Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Lemma 1

Assume that $\{\mathcal {A}_n\}_{n\in \mathcal {N }}$ in Algorithm 1 satisfy (1). Under A1–A3 and a constant step size $\alpha \le {1/(6\,L)}$, it holds that

$$\begin{aligned} \max _{n \in \mathcal {N}}\big \Vert \widehat{\theta }_n^{t+\frac{1}{2}}-\overline{\theta }^{t+\frac{1}{2}}\big \Vert ^2 \le 2\max _{n \in \mathcal {N}}\big \Vert \widehat{\theta }_n^t-\overline{\theta }^t\big \Vert ^2+9 \alpha ^2\delta ^2, \end{aligned}$$

(2)

$$\begin{aligned} N^{-1} \sum _{n \in \mathcal {N}} \big \Vert \widehat{\theta }_n^{t+\frac{1}{2}}-\overline{\theta }^{t+\frac{1}{2}}\big \Vert ^2 \le \frac{2}{N} \sum _{n \in \mathcal {N}}\big \Vert \widehat{\theta }_n^t-\overline{\theta }^t\big \Vert ^2+9 \alpha ^2\delta ^2. \end{aligned}$$

(3)

Lemma 1 characterizes the connection between $\big \Vert \widehat{\theta }_n^{t+\frac{1}{2}}-\overline{\theta }^{t+\frac{1}{2}}\big \Vert ^2$ and $\big \Vert \widehat{\theta }_n^{t}-\overline{\theta }^{t}\big \Vert ^2$ which is pivotal to show the relation between $\widehat{\theta }_n^{t+1}$ and $\widehat{\theta }_n^{t}$ such that we can establish the convergence rate of $\big \Vert \nabla f\big (\overline{\theta }^t\big )\big \Vert$.

Proof of Lemma 1:

Observe that

$$\begin{aligned} \begin{aligned} \max _{n \in \mathcal {N}}\big \Vert \widehat{\theta }_n^{t+\frac{1}{2}}-\overline{\theta }^{t+\frac{1}{2}}\big \Vert ^2 \le&\frac{3}{2} \max _{n \in \mathcal {N}}\left\| \overline{\theta }_n^t-N^{-1}\sum _{m \in \mathcal {N}}\overline{\theta }^t_m\right\| ^2\\&+ 3\alpha ^2 \max _{n \in \mathcal {N}} \left\| \nabla f_n\big (\overline{\theta }_n^t \big )-N^{-1} \sum _{m \in \mathcal {N}} \nabla f_m\big (\overline{\theta }_m^t \big )\right\| ^2. \end{aligned} \end{aligned}$$

(4)

With Assumptions 1 and 3, the second term at the RHS of (4) can be bounded by

$$\begin{aligned} \begin{aligned}&\max _{n \in \mathcal {N}}\left\| \nabla f_n\big (\overline{\theta }_n^t\big )-N^{-1} \sum _{m \in \mathcal {N}} \nabla f_m\big (\overline{\theta }_m^t\big )\right\| ^2 \\&\le 3 \max _{n \in \mathcal {N}}\left\| \nabla f_n\big (\overline{\theta }_n^t\big )-\nabla f_n\big (\overline{\theta }^t\big )\right\| ^2 \\&\quad + 3 \max _{n \in \mathcal {N}}\left\| \nabla f_n\big (\overline{\theta }^t\big )-N^{-1} \sum _{m \in \mathcal {N}} \nabla f_m\big (\overline{\theta }^t\big )\right\| ^2 \\&\quad +3 \max _{n \in \mathcal {N}}\left\| N^{-1} \sum _{m \in \mathcal {N}} \nabla f_m\big (\overline{\theta }^t\big )-N^{-1} \sum _{m \in \mathcal {N}} \nabla f_m\big (\overline{\theta }_m^t\big )\right\| ^2\\&\le 6L^2 \max _{n \in \mathcal {N}}\big \Vert \overline{\theta }_n^t-\overline{\theta }^t\big \Vert ^2+3 \delta ^2 \le 6L^2 \max _{n \in \mathcal {N}}\big \Vert {\widehat{\theta }}_n^t-\overline{\theta }^t\big \Vert ^2+3 \delta ^2. \end{aligned} \end{aligned}$$

(5)

Substituting (5) and $\alpha \le \frac{1}{6L}$ into (4) yields

$$\begin{aligned} \begin{aligned} \max _{n \in \mathcal {N}}\big \Vert \widehat{\theta }_n^{t+\frac{1}{2}}-\overline{\theta }^{t+\frac{1}{2}}\big \Vert ^2&\le \frac{3}{2} \max _{n \in \mathcal {N}}\left\| \overline{\theta }_n^t-N^{-1}\sum _{m \in \mathcal {N}}\overline{\theta }^t_m\right\| ^2+ 18\alpha ^2L^2 \max _{n \in \mathcal {N}}\big \Vert {\widehat{\theta }}_n^t-\overline{\theta }^t\big \Vert ^2+9 \alpha ^2\delta ^2 \\&\le \frac{3}{2} \max _{n \in \mathcal {N}}\big \Vert {\widehat{\theta }}_n^t-\overline{\theta }^t\big \Vert ^2+ 18\alpha ^2L^2 \max _{n \in \mathcal {N}}\big \Vert {\widehat{\theta }}_n^t-\overline{\theta }^t\big \Vert ^2+9 \alpha ^2\delta ^2 \\&\le 2\max _{n \in \mathcal {N}}\big \Vert \widehat{\theta }_n^t-\overline{\theta }^t\big \Vert ^2+9 \alpha ^2\delta ^2. \end{aligned} \end{aligned}$$

(6)

The deduction from (4) to (6) still holds if we replace $\max _{n \in \mathcal {N}}$ with $N^{-1}\sum _{n\in \mathcal { N }}$, and we can get (3) immediately. $\square$

Define average and maximal consensus errors as

$$\begin{aligned} \begin{aligned} H_\lambda ^t =\Vert (I-N^{-1} \textbf{1 1}{ }^{T}) \widehat{\Theta }^t\Vert ^2, \; H_\beta ^t =\Vert (I-N^{-1} \textbf{1 1}^{T}) \widehat{\Theta }^t\Vert _{2, \infty }^2, \end{aligned} \end{aligned}$$

respectively. To simultaneously analyze the errors, we consider

$$\begin{aligned} H^t=\frac{1}{2} (H_\lambda ^t+H_\beta ^t ). \end{aligned}$$

(7)

Proof of Theorem 1:

By the L-smoothness in Assumption 1, we have

$$\begin{aligned} \begin{aligned} f\big (\overline{\theta }^{t+1}\big ) \le&f\big (\overline{\theta }^t\big )+\big \langle \nabla f\big (\overline{\theta }^t\big ), \overline{\theta }^{t+1}-\overline{\theta }^t\big \rangle +\frac{L}{2} \big \Vert \overline{\theta }^{t+1}-\overline{\theta }^t\big \Vert ^2. \end{aligned} \end{aligned}$$

(8)

With the equality $\langle \theta _1, \theta _2\rangle =\frac{1}{2}\Vert \theta _1+\theta _2\Vert ^2-\frac{1}{2}\Vert \theta _1\Vert ^2-\frac{1}{2}\Vert \theta _2\Vert ^2$, we can rewrite the second term at the RHS of (8) as

$$\begin{aligned} \begin{aligned} \big \langle \nabla f\big (\overline{\theta }^t\big ), \overline{\theta }^{t+1}-\overline{\theta }^t\big \rangle =&\alpha \big \langle \nabla f\big (\overline{\theta }^t\big ),\alpha ^{-1} \big (\overline{\theta }^{t+1}-\overline{\theta }^t\big )\big \rangle \\ =&\frac{\alpha }{2} \big \Vert \nabla f\big (\overline{\theta }^t\big )+\alpha ^{-1} \big (\overline{\theta }^{t+1}-\overline{\theta }^t\big )\big \Vert ^2\\&-\frac{\alpha }{2}\big \Vert \nabla f\big (\overline{\theta }^t\big )\big \Vert ^2 -\frac{\alpha }{2} \big \Vert \alpha ^{-1} \big (\overline{\theta }^{t+1}-\overline{\theta }^t\big )\big \Vert ^2. \end{aligned} \end{aligned}$$

(9)

Because $\alpha \le \frac{1}{6L}$, substituting (9) into (8) yields

$$\begin{aligned} \begin{aligned} f\big (\overline{\theta }^{t+1}\big ) \le&f\big (\overline{\theta }^t\big )+\frac{\alpha }{2} \big \Vert \nabla f\big (\overline{\theta }^t \big )+\alpha ^{-1} \big (\overline{\theta }^{t+1}-\overline{\theta }^t\big )\big \Vert ^2\\&-\frac{1}{2}(\alpha ^{-1} -L) \big \Vert \overline{\theta }^{t+1}-\overline{\theta }^t\big \Vert ^2 -\frac{\alpha }{2}\big \Vert \nabla f\big (\overline{\theta }^t\big )\big \Vert ^2\\ \le&f\big (\overline{\theta }^t\big )+\frac{\alpha }{2} \big \Vert \nabla f\big (\overline{\theta }^t \big )+\alpha ^{-1} \big (\overline{\theta }^{t+1}-\overline{\theta }^t\big )\big \Vert ^2 -\frac{\alpha }{2}\big \Vert \nabla f\big (\overline{\theta }^t\big )\big \Vert ^2. \end{aligned} \end{aligned}$$

(10)

According to the update rule, we expand the second term at the RHS of (10) as

$$\begin{aligned} \begin{aligned}\nabla f\big (\overline{\theta }^t \big )+\alpha ^{-1} \big (\overline{\theta }^{t+1}-\overline{\theta }^t\big ) &= \nabla f\big (\overline{\theta }^t\big )+(\alpha N)^{-1} \sum _{n \in \mathcal {N}}\left( \mathcal {A}_n\left( \widehat{\theta }_n^{t+\frac{1}{2}},\left\{ \tilde{\theta }_{m, n}^{t+\frac{1}{2}}\right\} _{m \in \mathcal {N}_n \cup \mathcal {B}_n}\right) -\overline{\theta }^t\right) \\ &=\nabla f\big (\overline{\theta }^t \big )-N^{-1} \sum _{n \in \mathcal {N}} \nabla f_n\big (\overline{\theta }_n^t \big ) + (\alpha N)^{-1} \sum _{n \in \mathcal {N}}\left( \sum _{m \in \mathcal {N}_n} w_{nm} \widehat{\theta }_m^{t+\frac{1}{2}}-\overline{\theta }^t+\alpha N^{-1} \sum _{n \in \mathcal {N}} \nabla f_n\big (\overline{\theta }_n^t\big )\right) \\&\quad+(\alpha N)^{-1} \sum _{n \in \mathcal {N}}\left( \mathcal {A}_n\left( \widehat{\theta }_n^{t+\frac{1}{2}},\left\{ \tilde{\theta }_{m, n}^{t+\frac{1}{2}}\right\} _{m \in \mathcal {N}_n \cup \mathcal {B}_n}\right) -\sum _{m \in \mathcal {N}_n} w_{nm} \widehat{\theta }_m^{t+\frac{1}{2}}\right) . \end{aligned} \end{aligned}$$

(11)

Denote the $\ell _2$ norms of the three terms at the RHS of (11) as $T_1, T_2$ and $T_3$. We establish their upper bounds as follows.

Upper bound of $T_1$. For $T_1$, it holds that

$$\begin{aligned} \begin{aligned} T_1&=\left\| \nabla f\big (\overline{\theta }^t \big )-N^{-1} \sum _{n \in \mathcal {N}} \nabla f_n\big (\overline{\theta }^t_n\big )\right\| ^2 =\left\| N^{-1} \sum _{n \in \mathcal {N}}\big (\nabla f_n\big (\overline{\theta }^t \big )-\nabla f_n\big (\overline{\theta }^t_n \big )\big )\right\| ^2 \\&\le N^{-1} \sum _{n \in \mathcal {N}} \left\| \nabla f_n\big (\overline{\theta }^t\big )-\nabla f_n\big (\overline{\theta }^t_n \big )\right\| ^2. \end{aligned} \end{aligned}$$

Further, according to Assumption 1, we have

$$\begin{aligned} \big \Vert \nabla f_n\big (\overline{\theta }^t \big )-\nabla f_n\big (\overline{\theta }^t_n\big )\big \Vert ^2 \le L^2\big \Vert \overline{\theta }^t-\overline{\theta }^t_n\big \Vert ^2. \end{aligned}$$

With this inequality, $T_1$ can be bounded by

$$\begin{aligned} \begin{aligned} T_1 \le \frac{L^2}{N} \sum _{n \in \mathcal {N}}\big \Vert \overline{\theta }^t_n-\overline{\theta }^t\big \Vert ^2 \le L^2 \max _{n\in \mathcal { N }}\big \Vert \widehat{\theta }^t_n-\overline{\theta }^t\big \Vert ^2. \end{aligned} \end{aligned}$$

(12)

Upper bound of $T_2$. By $\overline{\theta }^{t+\frac{1}{2}}=N^{-1}\sum _{n\in \mathcal { N }}\overline{\theta }^t_n-\alpha N^{-1} \sum _{n \in \mathcal {N}} \nabla f_n\big (\overline{\theta }^t_n\big )$, we can rewrite the second term at the RHS of (11) as

$$\begin{aligned} \begin{aligned}&(\alpha N)^{-1} \sum _{n \in \mathcal {N}}\left( \sum _{m \in \mathcal {N}_n} w_{nm} \widehat{\theta }_m^{t+\frac{1}{2}}-\overline{\theta }^t+\alpha N^{-1} \sum _{n \in \mathcal {N}} \nabla f_n\big (\overline{\theta }_n^t \big )\right) \\&=(\alpha N)^{-1} \sum _{n \in \mathcal {N}}\left( \sum _{m \in \mathcal {N}_n} w_{nm} \widehat{\theta }_m^{t+\frac{1}{2}}+N^{-1}\sum _{n\in \mathcal { N }}\overline{\theta }^t_n-\overline{\theta }^t-N^{-1}\sum _{n\in \mathcal { N }}\overline{\theta }^t_n+\alpha N^{-1} \sum _{n \in \mathcal {N}} \nabla f_n\big (\overline{\theta }_n^t \big )\right) \\&= (\alpha N)^{-1} \sum _{n \in \mathcal {N}}\left( \sum _{m \in \mathcal {N}_n} w_{m n} \widehat{\theta }_m^{t+\frac{1}{2}}+N^{-1}\sum _{n\in \mathcal { N }}\overline{\theta }^t_n-\overline{\theta }^{t}-\overline{\theta }^{t+\frac{1}{2}}\right) . \end{aligned} \end{aligned}$$

Stacking all local models in $\widehat{\Theta }$ and applying Cauchy–Schwarz inequality, we have

$$\begin{aligned} \begin{aligned} T_2&=\left\| (\alpha N)^{-1} \sum _{n \in \mathcal {N}}\left( \sum _{m \in \mathcal {N}_n} w_{m n} \widehat{\theta }_m^{t+\frac{1}{2}}+N^{-1}\sum _{n\in \mathcal { N }}\overline{\theta }^t_n-\overline{\theta }^{t}-\overline{\theta }^{t+\frac{1}{2}}\right) \right\| ^2\\&= \left\| (\alpha N)^{-1} \sum _{n \in \mathcal {N}}\left( \sum _{m \in \mathcal {N}_n} w_{m n} \widehat{\theta }_m^{t+\frac{1}{2}}-\overline{\theta }^{t+\frac{1}{2}}\right) +(\alpha N)^{-1} \sum _{n\in \mathcal { N }}\left( N^{-1}\sum _{n\in \mathcal {N }}\overline{\theta }^t_n -\overline{\theta }^{t}\right) \right\| ^2 \\&=\left\| (\alpha N)^{-1} \big (\textbf{1}^{T}W \widehat{\Theta }^{t+\frac{1}{2}}-N^{-1}\textbf{1}^{T} \textbf{1} \textbf{1}^{T} \widehat{\Theta }^{t+\frac{1}{2}}\big ) \right. \\& \quad \left. +(\alpha N)^{-1} \left( \sum _{n \in \mathcal {N}}N^{-1}\textbf{1}^{T}W \widehat{\Theta }^{t}-N^{-1} \textbf{1}^{T}\textbf{1} \textbf{1}^{T}\widehat{\Theta }^{t}\right) \right\| ^2\\&=\frac{1}{\alpha ^2 N^2}\big \Vert \big (\textbf{1}^{T} W-\textbf{1}^{T}\big )\big (\widehat{\Theta }^{t+\frac{1}{2}}-N^{-1} \textbf{1} \textbf{1}^{T} \widehat{\Theta }^{t+\frac{1}{2}}\big )+\big (\textbf{1}^{T} W-\textbf{1}^{T}\big )\big (\widehat{\Theta }^{t}-N^{-1} \textbf{1} \textbf{1}^{T} \widehat{\Theta }^{t}\big )\big \Vert ^2 \\&\le \frac{2}{\alpha ^2 N^2}\big \Vert W^{T} \textbf{1}-\textbf{1}\big \Vert ^2\big (\big \Vert \widehat{\Theta }^{t+\frac{1}{2}}-N^{-1} \textbf{1} \textbf{1}^{T} \widehat{\Theta }^{t+\frac{1}{2}}\big \Vert _F^2+\big \Vert \widehat{\Theta }^{t}-N^{-1} \textbf{1} \textbf{1}^{T} \widehat{\Theta }^{t}\big \Vert _F^2\big ) \\&=\frac{2SE^2(W)}{\alpha ^2 N} \sum _{n \in \mathcal {N}}\big (\big \Vert \widehat{\theta }_n^{t+\frac{1}{2}}-\overline{\theta }^{t+\frac{1}{2}}\big \Vert ^2 +\big \Vert \widehat{\theta }_n^{t}-\overline{\theta }^{t}\big \Vert ^2\big ). \end{aligned} \end{aligned}$$

(13)

Upper bound of $T_3$. From the contractive property of robust aggregation rules $\big \{\mathcal {A}_n\big \}_{n \in \mathcal {N}}$, $T_3$ can be bounded by

$$\begin{aligned} \begin{aligned} T_3&=\left\| (\alpha N)^{-1} \sum _{n \in \mathcal {N}}\left( \mathcal {A}_n\left( \widehat{\theta }_n^{t+\frac{1}{2}},\left\{ \tilde{\theta }_{m, n}^{t+\frac{1}{2}}\right\} _{m \in \mathcal {N}_n \cup \mathcal {B}_n}\right) -\overline{\theta }_n^{t+\frac{1}{2}}\right) \right\| ^2\\&\le \frac{1}{\alpha ^2 N} \sum _{n \in \mathcal {N}}\left\| \mathcal {A}_n\left( \widehat{\theta }_n^{t+\frac{1}{2}},\left\{ \tilde{\theta }_{m, n}^{t+\frac{1}{2}}\right\} _{m \in \mathcal {N}_n \cup \mathcal {B}_n}\right) -\overline{\theta }_n^{t+\frac{1}{2}}\right\| ^2 \\&\le \frac{1}{\alpha ^2 N} \sum _{n \in \mathcal {N}} \rho ^2 \max _{m \in \mathcal {N}_n}\left\| \widehat{\theta }_m^{t+\frac{1}{2}}-\overline{\theta }_n^{t+\frac{1}{2}}\right\| ^2. \end{aligned} \end{aligned}$$

With inequality

$$\begin{aligned} \begin{aligned} \big \Vert \widehat{\theta }_m^{t+\frac{1}{2}}-\overline{\theta }_n^{t+\frac{1}{2}}\big \Vert ^2 \le&2\big \Vert \widehat{\theta }_m^{t+\frac{1}{2}}-\overline{\theta }^{t+\frac{1}{2}}\big \Vert ^2 +2\big \Vert \overline{\theta }^{t+\frac{1}{2}}-\overline{\theta }_n^{t+\frac{1}{2}}\big \Vert ^2\\ \le&2\big \Vert \widehat{\theta }_m^{t+\frac{1}{2}}-\overline{\theta }^{t+\frac{1}{2}}\big \Vert ^2+2 \max _{n \in \mathcal {N}}\big \Vert \widehat{\theta }_n^{t+\frac{1}{2}}-\overline{\theta }^{t+\frac{1}{2}}\big \Vert ^2, \end{aligned} \end{aligned}$$

we have

$$\begin{aligned} T_3 \le \frac{4 \rho ^2}{\alpha ^2} \max _{n \in \mathcal {N}}\big \Vert \widehat{\theta }_n^{t+\frac{1}{2}}-\overline{\theta }^{t+\frac{1}{2}}\big \Vert ^2. \end{aligned}$$

(14)

According to the derived upper bounds (12), (13) and (14), from (11) we have

$$\begin{aligned} \begin{aligned}&\big \Vert \nabla f\big (\overline{\theta }^t \big )+\alpha ^{-1} \big (\overline{\theta }^{t+1}-\overline{\theta }^t\big )\big \Vert ^2 \le 3 T_1+3 T_2+3 T_3 \\&\le 3L^2 \max _{n\in \mathcal {N }}\big \Vert \widehat{\theta }^t_n-\overline{\theta }^t\big \Vert ^2+\frac{6 SE^2(W)}{\alpha ^2 N} \sum _{n \in \mathcal {N}}\big (\big \Vert \widehat{\theta }_n^{t+\frac{1}{2}}-\overline{\theta }^{t+\frac{1}{2}}\big \Vert ^2 +\big \Vert \widehat{\theta }_n^{t}-\overline{\theta }^{t}\big \Vert ^2 \big )\\&\quad +\frac{12 \rho ^2}{\alpha ^2} \max _{n \in \mathcal {N}}\big \Vert \widehat{\theta }_n^{t+\frac{1}{2}} -\overline{\theta }^{t+\frac{1}{2}}\big \Vert ^2. \end{aligned} \end{aligned}$$

(15)

Plugging (2) and (3) in Lemma 1 into (15) yields

$$\begin{aligned} \begin{aligned}&\big \Vert \nabla f\big (\overline{\theta }^t\big )+\alpha ^{-1} \big (\overline{\theta }^{t+1}-\overline{\theta }^t\big )\big \Vert ^2 \le \left( \frac{24 \rho ^2}{\alpha ^2}+3L^2\right) \max _{n \in \mathcal {N}}\big \Vert \widehat{\theta }_n^t-\overline{\theta }^t\big \Vert ^2\\& \quad +\frac{18 SE^2(W)}{\alpha ^2 N} \sum _{n \in \mathcal {N}}\big \Vert \widehat{\theta }_n^t-\overline{\theta }^t\big \Vert ^2 \\&\quad +27\big (4 \rho ^2+SE^2(W)\big )\delta ^2 \\&\le \left( \frac{24 \rho ^2}{\alpha ^2}+3L^2\right) H_\beta ^t+\frac{18SE^2(W)}{\alpha ^2 } H_{\lambda }^t+27\big (4 \rho ^2+SE^2(W)\big )\delta ^2. \end{aligned} \end{aligned}$$

(16)

Reorganizing the terms in (10) and then substituting (16), we have

$$\begin{aligned} \begin{aligned} \big \Vert \nabla f\big (\overline{\theta }^t\big )\big \Vert ^2 \le&\frac{2 \big (f\big (\overline{\theta }^t\big )-f\big (\overline{\theta }^{t+1}\big )\big )}{\alpha }+\big \Vert \nabla f\big (\overline{\theta }^t\big )+\alpha ^{-1} \big (\overline{\theta }^{t+1}-\overline{\theta }^t\big )\big \Vert ^2 \\ \le&\frac{2 \big (f\big (\overline{\theta }^t\big )-f\big (\overline{\theta }^{t+1}\big )\big )}{\alpha }+27\big (4 \rho ^2+SE^2(W)^2\big )\delta ^2 \\&+\left( \frac{24 \rho ^2}{\alpha ^2}+3L^2\right) H_\beta ^t+\frac{18 SE^2(W)}{\alpha ^2 } H_{\lambda }^t. \end{aligned} \end{aligned}$$

(17)

Averaging (17) over $t=1, \ldots , T$ gives

$$\begin{aligned} \begin{aligned} \frac{1}{T} \sum _{t=1}^T \big \Vert \nabla f\big (\overline{\theta }^t\big )\big \Vert ^2 \le&\frac{2 \big (f\big (\overline{\theta }^0\big )-f\big (\overline{\theta }^{t+1}\big )\big )}{\alpha T}+27\big (4 \rho ^2+SE^2(W)\big )\delta ^2 \\&+\frac{1}{\alpha ^2 T} \sum _{t=1}^T\big ((24 \rho ^2+3 \alpha ^2 L^2) H_\beta ^t+18 SE^2(W) H_\lambda ^t\big ) \\ \le&\frac{2\big (f\big (\overline{\theta }^0\big )-f^*\big )}{\alpha T}+27\big (4 \rho ^2+SE^2(W)\big )\delta ^2 \\&+\frac{1}{\alpha ^2 T} \sum _{t=1}^T\big ((24 \rho ^2+3 \alpha ^2 L^2) H_\beta ^t+18 SE^2(W) H_\lambda ^t\big ). \end{aligned} \end{aligned}$$

Since we define $H^t=\frac{1}{2} (H_\beta ^t+ H_\lambda ^t),$ we then further show the convergence of $H^t$. Writing the maximal consensus error in a matrix form, we know that for any $u \in (0,1)$, it holds

$$\begin{aligned} \begin{aligned} \max _{n \in \mathcal {N}}\big \Vert \widehat{\theta }_n^{t+1}-\overline{\theta }^{t+1}\big \Vert ^2&=\big \Vert \big (I-N^{-1} \textbf{1} \textbf{1}^{T}\big ) \widehat{\Theta }^{t+1}\big \Vert _{2, \infty }^2 \le \frac{1}{1-u}\big \Vert \big (I-N^{-1} \textbf{1 1}^{T}\big ) W\widehat{\Theta }^{t+\frac{1}{2}}\big \Vert _{2, \infty }^2\\&\quad+\frac{2}{u}\big \Vert \widehat{\Theta }^{t+1}-W \widehat{\Theta }^{t+\frac{1}{2}}\big \Vert _{2, \infty }^2 +\frac{2}{u} \big \Vert N^{-1} \textbf{1} \textbf{1}^{T} \widehat{\Theta }^{t+1}-N^{-1} \textbf{1 1}\big \Vert _{2,\infty }^2, \end{aligned} \end{aligned}$$

(18)

where the inequality comes from $\Vert \theta _1+\theta _2+\theta _3\Vert ^2 \le \frac{1}{1-u}\Vert \theta _1\Vert ^2+$ $\frac{2}{u}\Vert \theta _2\Vert ^2+\frac{2}{u}\Vert \theta _3\Vert ^2$. Since W is a row stochastic matrix, it holds that $W \textbf{1}=\textbf{1}$, with which the first term at the RHS of (18) can be bounded by

$$\begin{aligned} \begin{aligned} \big \Vert \big (I-N^{-1} \textbf{1} \textbf{1}^{T}\big ) W \widehat{\Theta }^{t+\frac{1}{2}}\big \Vert _{2, \infty }^2 =&\big \Vert \big (I-N^{-1} \textbf{1} \textbf{1}^{T}\big ) W\big (I-N^{-1} \textbf{1} \textbf{1}^{T}\big ) \widehat{\Theta }^{t+\frac{1}{2}}\big \Vert _{2, \infty }^2 \\ \le&\big \Vert \big (I-N^{-1} \textbf{1} \textbf{1}^{T}\big ) W\big \Vert _{2, \infty }^2\big \Vert \big (I-N^{-1} \textbf{1} \textbf{1}^{T}\big ) \widehat{\Theta }^{t+\frac{1}{2}}\big \Vert ^2 \\ =&(1-\beta )\big \Vert \big (I-N^{-1} \textbf{1} \textbf{1}^{T}\big )\widehat{\Theta }^{t+\frac{1}{2}}\big \Vert ^2. \end{aligned} \end{aligned}$$

(19)

With the contractive property of robust aggregation rules $\big \{\mathcal {A}_n\big \}_{n \in \mathcal {N}}$ in (1), we can bound the second term at the RHS of (18) by

$$\begin{aligned} \begin{aligned} \big \Vert \widehat{\Theta }^{t+1}-W \widehat{\Theta }^{t+\frac{1}{2}}\big \Vert _{2, \infty }^2 =&\max _{n \in \mathcal {N}}\big \Vert \mathcal {A}_n\big (\widehat{\theta }_n^{t+\frac{1}{2}},\big \{\tilde{\theta }_{m, n}^{t+\frac{1}{2}}\big \}_{m \in \mathcal {N}_n \cup \mathcal {B}_n}\big )-\overline{\theta }_n^{t+\frac{1}{2}}\big \Vert ^2 \\ \le&\rho ^2 \max _{n \in \mathcal {N}} \max _{m \in \mathcal {N}_n}\big \Vert \widehat{\theta }_m^{t+\frac{1}{2}}-\overline{\theta }_n^{t+\frac{1}{2}}\big \Vert ^2\\ \le&4 \rho ^2 \max _{n \in \mathcal {N}}\big \Vert \widehat{\theta }_n^{t+\frac{1}{2}}-\overline{\theta }^{t+\frac{1}{2}}\big \Vert ^2. \end{aligned} \end{aligned}$$

(20)

A similar technique can be applied to the third term at the RHS of (18) to get

$$\begin{aligned} \begin{aligned} \big \Vert N^{-1} \textbf{1} \textbf{1}^{T} \widehat{\Theta }^{t+1}-N^{-1} \textbf{1} \textbf{1}^{T} W\widehat{\Theta }^{t+\frac{1}{2}}\big \Vert _{2, \infty }^2 =&\big \Vert N^{-1} \textbf{1}^{T} \widehat{\Theta }^{t+1}-N^{-1} \textbf{1}^{T} W \widehat{\Theta }^{t+\frac{1}{2}}\big \Vert ^2 \\ \le&N^{-1}\big \Vert \widehat{\Theta }^{t+1}-W \widehat{\Theta }^{t+\frac{1}{2}}\big \Vert _F^2 \\ =&N^{-1} \sum _{n \in \mathcal {N}}\big \Vert \mathcal {A}_n\big (\widehat{\theta }_n^{t+\frac{1}{2}},\big \{\tilde{\theta }_{m, n}^{t+\frac{1}{2}}\big \}_{m \in \mathcal {N}_n \cup \mathcal {B}_n}\big )-\overline{\theta }_n^{t+\frac{1}{2}}\big \Vert ^2 \\ \le&4 \rho ^2 \max _{n \in \mathcal {N}}\big \Vert \widehat{\theta }_n^{t+\frac{1}{2}}-\overline{\theta }^{t+\frac{1}{2}}\big \Vert ^2, \end{aligned} \end{aligned}$$

(21)

where the first equality holds because all rows of the matrix $N^{-1} \textbf{1 1}{ }^{T}\widehat{\Theta }^{t+1}-N^{-1} \textbf{ 1} \textbf{ 1}^{T} W \widehat{\Theta }^{t+\frac{1}{2}}$ are identical. Substituting (19)–(21) back into (18), we have

$$\begin{aligned} \begin{aligned} \big \Vert \big (I-N^{-1} \textbf{1 1}^{T}\big ) \widehat{\Theta }^{t+1}\big \Vert _{2, \infty }^2 \le&\frac{1-\beta }{1-u}\big \Vert \big (I-N^{-1} \textbf{1 1}^{T}\big )\widehat{\Theta }^{t+\frac{1}{2}}\big \Vert ^2 \\&+\frac{16 \rho ^2}{u}\big \Vert \big (I-N^{-1} \textbf{ 1} \textbf{1}^{T}\big ) \widehat{\Theta }^{t+\frac{1}{2}}\big \Vert _{2, \infty }^2.&\end{aligned} \end{aligned}$$

Applying Lemma 1, we have

$$\begin{aligned} \begin{aligned} \big \Vert \big (I-N^{-1} \textbf{1 1}^{T}\big ) \widehat{\Theta }^{t+1}\big \Vert _{2, \infty }^2 \le&2\left( \frac{1-\beta }{1-u}\right) \big \Vert \big (I-N^{-1} \textbf{1 1}^{T}\big ) \widehat{\Theta }^t\big \Vert ^2 \\&+\frac{32 \rho ^2}{u}\big \Vert \big (I-N^{-1} \textbf{1 1}{ }^{T}\big ) \widehat{\Theta }^t\big \Vert _{2, \infty }^2 +\left( \frac{1-\beta }{1-u}+\frac{16 \rho ^2}{u}\right) 9 \alpha ^2\delta ^2 \\ =&2\left( \frac{1-\beta }{1-u}\right) H_{\lambda }^t +\frac{32 \rho ^2}{u}H_{\beta }^t +\left( \frac{1-\beta }{1-u}+\frac{16 \rho ^2}{u}\right) 9 \alpha ^2\delta ^2. \end{aligned} \end{aligned}$$

With similar techniques, we can also prove that

$$\begin{aligned} \begin{aligned} \big \Vert \big (I-N^{-1} \textbf{1 1}^{T}\big ) \widehat{\Theta }^{t+1}\big \Vert ^2 \le 2\left( \frac{1-\lambda }{1-u}\right) H_{\lambda }^t +\frac{32 \rho ^2}{u}H_{\beta }^t +\left( \frac{1-\lambda }{1-u}+\frac{16 \rho ^2}{u}\right) 9 \alpha ^2\delta ^2. \end{aligned} \end{aligned}$$

Then we conclude that

$$\begin{aligned} \begin{aligned} H^{t+1} \le \frac{32\rho ^2}{u} H_\beta ^t +\frac{2-\beta -\lambda }{1-u} H_\lambda ^t +\frac{1}{2}\left( \frac{2-\lambda -\beta }{1-u}+\frac{32\rho ^2}{u}\right) 9 \alpha ^2\delta ^2. \end{aligned} \end{aligned}$$

(22)

Now we are going to designate some constants such that the RHS of (22) contains only $H^t$ and the term of $O\big (\delta ^2\big )$. Choose proper u such that

$$\begin{aligned} \frac{32\rho ^2}{u}=\frac{2-\lambda -\beta }{1-u}, \end{aligned}$$

which means

$$\begin{aligned} u=\frac{32\rho ^2}{32\rho ^2+2-\beta -\lambda }<1. \end{aligned}$$

(23)

With (23), (22) becomes

$$\begin{aligned} \begin{aligned} H^{t+1}&\le (32\rho ^2+2-\beta -\lambda ) H^t + 9 (32\rho ^2+2-\beta -\lambda )\alpha ^2\delta ^2\\&=\gamma _1H^t+9\gamma _1\alpha ^2\delta ^2. \end{aligned} \end{aligned}$$

(24)

Since we have $\rho < \sqrt{\frac{\beta +\lambda -1}{32}}$, then $0<\gamma _1<1$. Using telescopic cancellation on (24) from 0 to k, we deduce that

$$\begin{aligned} H^{t+1} \le \gamma _1^{t+1} H^0+\frac{\gamma _1-\gamma _1^{k+2}}{1-\gamma _1}9\alpha ^2\delta ^2, \end{aligned}$$

which completes the proof. $\square$

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yang, Y., Wang, L. Byzantine-resilient decentralized network learning. J. Korean Stat. Soc. 53, 349–380 (2024). https://doi.org/10.1007/s42952-023-00249-w

Download citation

Received: 03 June 2023
Accepted: 06 December 2023
Published: 10 January 2024
Issue Date: June 2024
DOI: https://doi.org/10.1007/s42952-023-00249-w

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Byzantine-resilient decentralized network learning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A survey on federated learning: challenges and applications

Online social networks security and privacy: comprehensive review and analysis

Emerging trends in federated learning: from model fusion to federated X learning

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Lemma 1

Proof of Lemma 1:

Proof of Theorem 1:

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Byzantine-resilient decentralized network learning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A survey on federated learning: challenges and applications

Online social networks security and privacy: comprehensive review and analysis

Emerging trends in federated learning: from model fusion to federated X learning

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

Lemma 1

Proof of Lemma 1:

Proof of Theorem 1:

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation