Abstract
Decentralized federated learning based on fully normal nodes has drawn attention in modern statistical learning. However, due to data corruption, device malfunctioning, malicious attacks and some other unexpected behaviors, not all nodes can obey the estimation process and the existing decentralized federated learning methods may fail. An unknown number of abnormal nodes, called Byzantine nodes, arbitrarily deviate from their intended behaviors, send wrong messages to their neighbors and affect all honest nodes across the entire network through passing polluted messages. In this paper, we focus on decentralized federated learning in the presence of Byzantine attacks and then propose a unified Byzantine-resilient framework based on the network gradient descent and several robust aggregation rules. Theoretically, the convergence of the proposed algorithm is guaranteed under some weakly balanced conditions of network structure. The finite-sample performance is studied through simulations under different network topologies and various Byzantine attacks. An application to Communities and Crime Data is also presented.
Similar content being viewed by others
Data availability
The real dataset used in this study is publicly available and can be accessed from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets/communities+and+crime+unnormalized), which is shown in the “Real data” section.
References
Bellet, A., Guerraoui, R., Taziki, M., & Tommasi, M. (2018). Personalized and private peer-to-peer machine learning. In International conference on artificial intelligence and statistics (pp. 473–481).
Blanchard, P., El Mhamdi, E.M., Guerraoui, R., & Stainer, J. (2017). Machine learning with adversaries: Byzantine tolerant gradient descent. Advances in Neural Information Processing Systems (pp. 119–129).
Che, C., Li, X., Chen, C., He, X., & Zheng, Z. (2022). A decentralized federated learning framework via committee mechanism with convergence guarantee. IEEE Transactions on Parallel and Distributed Systems, 33(12), 4783–4800.
Cheu, A., Smith, A., Ullman, J., Zeber, D., & Zhilyaev, M. (2019). Distributed differential privacy via shuffling. In Annual international conference on the theory and applications of cryptographic techniques (pp. 375–403).
Colin, I., Bellet, A., Salmon, J., & Clémençon, S. (2016). Gossip dual averaging for decentralized optimization of pairwise functions. In International conference on machine learning (pp. 1388–1396).
El Mhamdi, E.M., Guerraoui, R., & Rouault, S. L.A. (2021). Distributed momentum for byzantine-resilient stochastic gradient descent. In 9th International conference on learning representations.
Fang, C., Yang, Z., & Bajwa, W. U. (2022). Bridge: Byzantine-resilient decentralized gradient descent. IEEE Transactions on Signal and Information Processing over Networks, 8, 610–626.
Fang, M., Cao, X., Jia, J., & Gong, N.Z. (2020). Local model poisoning attacks to byzantine-robust federated learning. In Proceedings of the 29th USENIX conference on security symposium (pp. 1623–1640).
He, L., Karimireddy, S. P., & Jaggi, M. (2022). Byzantine-robust decentralized learning via self-centered clip**. ar**v preprint ar**v:2202.01545.
Hou, J., Wang, F., Wei, C., Huang, H., Hu, Y., & Gui, N. (2022). Credibility assessment based byzantine-resilient decentralized learning. In IEEE transactions on dependable and secure computing (pp. 1–12).
Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., et al. (2021). Advances and open problems in federated learning. Foundations and Trends in Machine Learning, 14(1–2), 1–210.
Karimireddy, S. P., He, L., & Jaggi, M. (2020). Byzantine-robust learning on heterogeneous datasets via bucketing. ar**v preprint ar**v:2006.09365.
Karimireddy, S. P., He, L., & Jaggi, M. (2021). Learning from history for byzantine robust optimization. In International conference on machine learning (pp. 5311–5319).
Konečnỳ, J., McMahan, H. B., Yu, F. X., Richtárik, P., Suresh, A. T., & Bacon, D. (2016). Federated learning: Strategies for improving communication efficiency. ar**v preprint ar**v:1610.05492.
Lamport, L., Shostak, R., & Pease, M. (1982). The byzantine generals problem. ACM Transactions on Programming Languages and Systems, 4(3), 382–401.
Li, T., Sahu, A. K., Zaheer, M., Sanjabi, M., Talwalkar, A., & Smith, V. (2020). Federated optimization in heterogeneous networks. Proceedings of Machine Learning and Systems, 2, 429–450.
Lian, X., Zhang, C., Zhang, H., Hsieh, C.-J., Zhang, W., & Liu, J. (2017). Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. Advances in Neural Information Processing Systems (pp. 5330–5340).
Liu, W., Mao, X., & Zhang, X. (2022). Fast and robust sparsity learning over networks: A decentralized surrogate median regression approach. IEEE Transactions on Signal Processing, 70, 797–809.
McMahan, B., Moore, E., Ramage, D., Hampson, S., & y Arcas, B.A. (2017). Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics (pp. 1273–1282).
Pillutla, K., Kakade, S. M., & Harchaoui, Z. (2022). Robust aggregation for federated learning. IEEE Transactions on Signal Processing, 70, 1142–1154.
Richards, D., & Rebeschini, P. (2019). Optimal statistical rates for decentralised non-parametric regression with linear speed-up. Advances in Neural Information Processing Systems (pp. 1216–1227).
Richards, D., Rebeschini, P., & Rosasco, L. (2020). Decentralised learning with random features and distributed gradient descent. In International conference on machine learning (pp. 8105–8115).
Smith, V., Chiang, C.-K., Sanjabi, M., & Talwalkar, A. S. (2017). Federated multi-task learning. Advances in Neural Information Processing Systems, 30.
Wei, K., Li, J., Ding, M., Ma, C., Yang, H. H., Farokhi, F., **, S., Quek, T. Q., & Poor, H. V. (2020). Federated learning with differential privacy: Algorithms and performance analysis. IEEE Transactions on Information Forensics and Security, 15, 3454–3469.
Wu, S., Huang, D., & Wang, H. (2023a). Network gradient descent algorithm for decentralized federated learning. Journal of Business & Economic Statistics, 41(3), 806–818.
Wu, Z., Chen, T., & Ling, Q. (2023b). Byzantine-resilient decentralized stochastic optimization with robust aggregation rules. In IEEE transactions on signal processing (pp. 3179–3195).
Yang, X., Yan, X., & Huang, J. (2019a). High-dimensional integrative analysis with homogeneity and sparsity recovery. Journal of Multivariate Analysis, 174, 104529.
Yang, Z., Gang, A., & Bajwa, W. U. (2019b). Adversary-resilient inference and machine learning: From distributed to decentralized. Statistics, 1050, 23.
Yuan, K., Ling, Q., & Yin, W. (2016). On the convergence of decentralized gradient descent. SIAM Journal on Optimization, 26(3), 1835–1854.
Acknowledgements
We are grateful to the Editor, an Associate Editor and two anonymous referees for their insightful comments and suggestions on this article, which have led to significant improvements. Lei Wang’s research was supported by the Fundamental Research Funds for the Central Universities and the National Natural Science Foundation of China (12271272, 12001295).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Lemma 1
Assume that \(\{\mathcal {A}_n\}_{n\in \mathcal {N }}\) in Algorithm 1 satisfy (1). Under A1–A3 and a constant step size \(\alpha \le {1/(6\,L)}\), it holds that
Lemma 1 characterizes the connection between \(\big \Vert \widehat{\theta }_n^{t+\frac{1}{2}}-\overline{\theta }^{t+\frac{1}{2}}\big \Vert ^2\) and \(\big \Vert \widehat{\theta }_n^{t}-\overline{\theta }^{t}\big \Vert ^2\) which is pivotal to show the relation between \(\widehat{\theta }_n^{t+1}\) and \(\widehat{\theta }_n^{t}\) such that we can establish the convergence rate of \(\big \Vert \nabla f\big (\overline{\theta }^t\big )\big \Vert\).
Proof of Lemma 1:
Observe that
With Assumptions 1 and 3, the second term at the RHS of (4) can be bounded by
Substituting (5) and \(\alpha \le \frac{1}{6L}\) into (4) yields
The deduction from (4) to (6) still holds if we replace \(\max _{n \in \mathcal {N}}\) with \(N^{-1}\sum _{n\in \mathcal { N }}\), and we can get (3) immediately. \(\square\)
Define average and maximal consensus errors as
respectively. To simultaneously analyze the errors, we consider
Proof of Theorem 1:
By the L-smoothness in Assumption 1, we have
With the equality \(\langle \theta _1, \theta _2\rangle =\frac{1}{2}\Vert \theta _1+\theta _2\Vert ^2-\frac{1}{2}\Vert \theta _1\Vert ^2-\frac{1}{2}\Vert \theta _2\Vert ^2\), we can rewrite the second term at the RHS of (8) as
Because \(\alpha \le \frac{1}{6L}\), substituting (9) into (8) yields
According to the update rule, we expand the second term at the RHS of (10) as
Denote the \(\ell _2\) norms of the three terms at the RHS of (11) as \(T_1, T_2\) and \(T_3\). We establish their upper bounds as follows.
Upper bound of \(T_1\). For \(T_1\), it holds that
Further, according to Assumption 1, we have
With this inequality, \(T_1\) can be bounded by
Upper bound of \(T_2\). By \(\overline{\theta }^{t+\frac{1}{2}}=N^{-1}\sum _{n\in \mathcal { N }}\overline{\theta }^t_n-\alpha N^{-1} \sum _{n \in \mathcal {N}} \nabla f_n\big (\overline{\theta }^t_n\big )\), we can rewrite the second term at the RHS of (11) as
Stacking all local models in \(\widehat{\Theta }\) and applying Cauchy–Schwarz inequality, we have
Upper bound of \(T_3\). From the contractive property of robust aggregation rules \(\big \{\mathcal {A}_n\big \}_{n \in \mathcal {N}}\), \(T_3\) can be bounded by
With inequality
we have
According to the derived upper bounds (12), (13) and (14), from (11) we have
Plugging (2) and (3) in Lemma 1 into (15) yields
Reorganizing the terms in (10) and then substituting (16), we have
Averaging (17) over \(t=1, \ldots , T\) gives
Since we define \(H^t=\frac{1}{2} (H_\beta ^t+ H_\lambda ^t),\) we then further show the convergence of \(H^t\). Writing the maximal consensus error in a matrix form, we know that for any \(u \in (0,1)\), it holds
where the inequality comes from \(\Vert \theta _1+\theta _2+\theta _3\Vert ^2 \le \frac{1}{1-u}\Vert \theta _1\Vert ^2+\) \(\frac{2}{u}\Vert \theta _2\Vert ^2+\frac{2}{u}\Vert \theta _3\Vert ^2\). Since W is a row stochastic matrix, it holds that \(W \textbf{1}=\textbf{1}\), with which the first term at the RHS of (18) can be bounded by
With the contractive property of robust aggregation rules \(\big \{\mathcal {A}_n\big \}_{n \in \mathcal {N}}\) in (1), we can bound the second term at the RHS of (18) by
A similar technique can be applied to the third term at the RHS of (18) to get
where the first equality holds because all rows of the matrix \(N^{-1} \textbf{1 1}{ }^{T}\widehat{\Theta }^{t+1}-N^{-1} \textbf{ 1} \textbf{ 1}^{T} W \widehat{\Theta }^{t+\frac{1}{2}}\) are identical. Substituting (19)–(21) back into (18), we have
Applying Lemma 1, we have
With similar techniques, we can also prove that
Then we conclude that
Now we are going to designate some constants such that the RHS of (22) contains only \(H^t\) and the term of \(O\big (\delta ^2\big )\). Choose proper u such that
which means
Since we have \(\rho < \sqrt{\frac{\beta +\lambda -1}{32}}\), then \(0<\gamma _1<1\). Using telescopic cancellation on (24) from 0 to k, we deduce that
which completes the proof. \(\square\)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yang, Y., Wang, L. Byzantine-resilient decentralized network learning. J. Korean Stat. Soc. 53, 349–380 (2024). https://doi.org/10.1007/s42952-023-00249-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42952-023-00249-w