A Multi-source Domain Adaption Approach to Minority Disk Failure Prediction

  • Conference paper
  • First Online:
Algorithms and Architectures for Parallel Processing (ICA3PP 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14488))

  • 238 Accesses

Abstract

Frequent happening of disk failures affects the reliability of the storage system, which can cause jittering of performance or even data loss of services and thus seriously threaten the quality of service. Although a host of machine (deep) learning-based disk failure prediction approaches have been proposed to prevent system breakdown due to unexpected disk failure, they are able to achieve high performance based on the assumption that the disk model has plenty of samples (especially failure samples). However, new disk models continuously appear in data centers with the evolution of disk manufacturing technology and the expansion of storage system capacity. Limited by the deploying time, these disk models have few failure samples and are called minority disks. The minority disks are widespread in large-scale data centers and contain amounts of disks while existing approaches cannot reach satisfying performance on such disks due to the lack of failure samples. What’s worse, failure prediction models trained on other disk models cannot be directly applied to these minority disks either due to the commonly existing distribution shift among disk models. In this work, we propose DiskDA, a novel multi-source domain adaption-based solution that can fully utilize knowledge from other disk models to predict failures for minority disks having no failure samples. Our experimental results on real-world datasets show the superiority of DiskDA against previous approaches on minority disks with a few failure samples. What’s more, DiskDA also shows its good adaptivity on minority disks having no failure samples, whereas previous works are unusable.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
EUR 29.95
Price includes VAT (Germany)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR 60.98
Price includes VAT (Germany)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
EUR 79.17
Price includes VAT (Germany)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://www.backblaze.com/b2/hard-drive-test-data.html.

References

  1. Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: International Conference on Machine Learning, pp. 214–223. PMLR (2017)

    Google Scholar 

  2. Botezatu, M.M., Giurgiu, I., Bogojeska, J., Wiesmann, D.: Predicting disk replacement towards reliable data centers. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 39–48 (2016)

    Google Scholar 

  3. Chakraborttii, C., Litz, H.: Improving the accuracy, adaptability, and interpretability of SSD failure prediction models. In: Proceedings of the 11th ACM Symposium on Cloud Computing, pp. 120–133 (2020)

    Google Scholar 

  4. Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), vol. 1, pp. 539–546. IEEE (2005)

    Google Scholar 

  5. Ghemawat, S., Gobioff, H., Leung, S.T.: The google file system. In: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, pp. 29–43 (2003)

    Google Scholar 

  6. Jiang, T., Zeng, J., Zhou, K., Huang, P., Yang, T.: Lifelong disk failure prediction via gan-based anomaly detection. In: 2019 IEEE 37th International Conference on Computer Design (ICCD), pp. 199–207. IEEE (2019)

    Google Scholar 

  7. Jiang, W., Hu, C., Zhou, Y., Kanevsky, A.: Are disks the dominant contributor for storage failures? A comprehensive study of storage subsystem failure characteristics. ACM Trans. Storage (TOS) 4(3), 1–25 (2008)

    Article  Google Scholar 

  8. Johnson, R., Zhang, T.: Learning nonlinear functions using regularized greedy forest. IEEE Trans. Pattern Anal. Mach. Intell. 36(5), 942–954 (2013)

    Article  Google Scholar 

  9. Lan, X., et al.: Adversarial domain adaptation with correlation-based association networks for longitudinal disk fault prediction. In: 2021 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2021)

    Google Scholar 

  10. Li, J., et al.: Hard drive failure prediction using classification and regression trees. In: 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 383–394. IEEE (2014)

    Google Scholar 

  11. Lu, S., Luo, B., Patel, T., Yao, Y., Tiwari, D., Shi, W.: Making disk failure predictions smarter! In: FAST, pp. 151–167 (2020)

    Google Scholar 

  12. Mikolov, T., Kombrink, S., Burget, L., Černockỳ, J., Khudanpur, S.: Extensions of recurrent neural network language model. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5528–5531. IEEE (2011)

    Google Scholar 

  13. Schroeder, B., Gibson, G.A.: Understanding disk failure rates: what does an MTTF of 1,000,000 hours mean to you? ACM Trans. Storage (TOS) 3(3), 8-es (2007)

    Google Scholar 

  14. Shen, J., Qu, Y., Zhang, W., Yu, Y.: Wasserstein distance guided representation learning for domain adaptation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

    Google Scholar 

  15. Sun, X., et al.: System-level hardware failure prediction using deep learning. In: Proceedings of the 56th Annual Design Automation Conference 2019, pp. 1–6 (2019)

    Google Scholar 

  16. Vishwanath, K.V., Nagappan, N.: Characterizing cloud computing hardware reliability. In: Proceedings of the 1st ACM Symposium on Cloud Computing, pp. 193–204 (2010)

    Google Scholar 

  17. Wang, Y., Miao, Q., Ma, E.W., Tsui, K.L., Pecht, M.G.: Online anomaly detection for hard disk drives based on mahalanobis distance. IEEE Trans. Reliab. 62(1), 136–145 (2013)

    Article  Google Scholar 

  18. Wilson, G., Cook, D.J.: A survey of unsupervised deep domain adaptation. ACM Trans. Intell. Syst. Technol. 11(5), 1–46 (2020)

    Article  Google Scholar 

  19. **e, Y., Feng, D., Wang, F., Zhang, X., Han, J., Tang, X.: OME: an optimized modeling engine for disk failure prediction in heterogeneous datacenter. In: 2018 IEEE 36th International Conference on Computer Design (ICCD), pp. 561–564. IEEE (2018)

    Google Scholar 

  20. Xu, C., Wang, G., Liu, X., Guo, D., Liu, T.Y.: Health status assessment and failure prediction for hard drives with recurrent neural networks. IEEE Trans. Comput. 65(11), 3502–3508 (2016)

    Article  MathSciNet  Google Scholar 

  21. Xu, F., Han, S., Lee, P.P., Liu, Y., He, C., Liu, J.: General feature selection for failure prediction in large-scale SSD deployment. In: 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 263–270. IEEE (2021)

    Google Scholar 

  22. Yang, W., Hu, D., Liu, Y., Wang, S., Jiang, T.: Hard drive failure prediction using big data. In: 2015 IEEE 34th Symposium on Reliable Distributed Systems Workshop (SRDSW), pp. 13–18. IEEE (2015)

    Google Scholar 

  23. Zhang, J., Huang, P., Zhou, K., **e, M., Schelter, S.: HDDSE: enabling high-dimensional disk state embedding for generic failure detection system of heterogeneous disks in large data centers. In: Proceedings of the 2020 USENIX Conference on Usenix Annual Technical Conference, pp. 111–126 (2020)

    Google Scholar 

  24. Zhang, J., et al.: Minority disk failure prediction based on transfer learning in large data centers of heterogeneous disk systems. IEEE Trans. Parallel Distrib. Syst. 31(9), 2155–2169 (2020)

    Google Scholar 

  25. Zhou, H., et al.: A proactive failure tolerant mechanism for SSDS storage systems based on unsupervised learning. In: 2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS), pp. 1–10. IEEE (2021)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xuehai Tang .

Editor information

Editors and Affiliations

7 Appendix

7 Appendix

Proof. The discrepancy between the source and target domain is measured using the Wasserstein distance in DiskDA. Specifically, the \(\mathop {p}\)-th Wasserstein distance between two Borel probability measures \(\mathbb {P}\) and \(\mathbb {Q}\) is defined as:

$$\begin{aligned} W_{p}(\mathbb {P},\mathbb {Q}) = (\mathop {inf}\limits _{\mu \in \Gamma (\mathbb {P},\mathbb {Q})}\int \rho (x,y)^{p}d\mu (x,y))^{1/p} \end{aligned}$$
(9)

where the \(\Gamma (\mathbb {P,Q})\) is the set of all joint distributions \(\mu (x,y)\) whose marginal distribution are \(\mathbb {P}\) and \(\mathbb {Q}\). The \(\mu (x,y)\) can be viewed as a policy for transporting a unit quantity of material from x to y and the \(\rho (x,y)\) is the corresponding cost. And the Wasserstein distance between \(\mathbb {P}\) and \(\mathbb {Q}\) represents the minimum expected transport cost. As Wasserstein distance satisfies the triangle inequality, the following equation holds

$$\begin{aligned} W_p(\mathbb {P}_{s},\mathbb {P}_{t})\le W_{p}(\mathbb {P}_{s},\mathbb {P}_{t_{H}})+W_{p}(\mathbb {P}_{t_{H}},\mathbb {P}_{t}) \end{aligned}$$
(10)

Shen et al. [14] prove the generalization error bound of a classification function h in the target domain for unsupervised domain adaption based on Wasserstein distance as

$$\begin{aligned} \epsilon _{t}(h) \le \epsilon _{s}(h) + 2KW_{1}(\mathbb {P}_s,\mathbb {P}_t) + \lambda \end{aligned}$$
(11)

where the K means that all hypotheses h are K-Lipschitz continous, \(\lambda \) is the combined error of the optimal hypothesis \(h*\) which minimizes the combined error \(\epsilon _{s}(h)+\epsilon _{t}(h)\), \(\mathbb {P}_s\) and \(\mathbb {P}_t\) are distributions of source and target domain, respectively. Let C denote \(2KW_{1}(\mathbb {P}_{t_{H}},\mathbb {P}_{t})\). By substituting inequality (11) for (10), Theorem 3.1 is derived.

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, W. et al. (2024). A Multi-source Domain Adaption Approach to Minority Disk Failure Prediction. In: Tari, Z., Li, K., Wu, H. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2023. Lecture Notes in Computer Science, vol 14488. Springer, Singapore. https://doi.org/10.1007/978-981-97-0801-7_4

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-0801-7_4

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-0800-0

  • Online ISBN: 978-981-97-0801-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation