Abstract
Hadoop is a state-of-the-art industry’s de facto tool for the computation of Big Data. Native fault tolerance procedure in Hadoop is dilatory and leads us towards performance degradation. Moreover, it is failed to completely consider the computational overhead and storage cost. On the other hand, the dynamic nature of MapReduce and complexity are also important parameters that affect the response time of the job. To achieve all this, it is essential to have a foolproof failure handling technique. In this paper, we have performed an analysis of notable fault tolerance techniques to see the impact of using different performance metrics under variable dataset with variable fault injections. The critical result shows that response timewise, the byzantine technique has a performance priority over the retrying and checkpointing technique in regards to killing one node failure. In addition, throughput wise, task-level byzantine fault tolerance technique once again had high priority as compared to checkpointing and retrying in terms of network disconnect failure. All in all, this comparative study highlights the strengths and weaknesses of different fault-tolerant techniques and is essential in determining the best technique in a given environment.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-020-03491-9/MediaObjects/11227_2020_3491_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-020-03491-9/MediaObjects/11227_2020_3491_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-020-03491-9/MediaObjects/11227_2020_3491_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-020-03491-9/MediaObjects/11227_2020_3491_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-020-03491-9/MediaObjects/11227_2020_3491_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-020-03491-9/MediaObjects/11227_2020_3491_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-020-03491-9/MediaObjects/11227_2020_3491_Fig7_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-020-03491-9/MediaObjects/11227_2020_3491_Fig8_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-020-03491-9/MediaObjects/11227_2020_3491_Fig9_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-020-03491-9/MediaObjects/11227_2020_3491_Fig10_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-020-03491-9/MediaObjects/11227_2020_3491_Fig11_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-020-03491-9/MediaObjects/11227_2020_3491_Fig12_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-020-03491-9/MediaObjects/11227_2020_3491_Fig13_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-020-03491-9/MediaObjects/11227_2020_3491_Fig14_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-020-03491-9/MediaObjects/11227_2020_3491_Fig15_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-020-03491-9/MediaObjects/11227_2020_3491_Fig16_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-020-03491-9/MediaObjects/11227_2020_3491_Fig17_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-020-03491-9/MediaObjects/11227_2020_3491_Fig18_HTML.png)
Similar content being viewed by others
Change history
11 February 2021
A Correction to this paper has been published: https://doi.org/10.1007/s11227-021-03651-5
References
** H, Ibrahim S, Qi L, Cao H, Wu S, Shi X (2011) The mapreduce programming model and implementations. In: Cloud computing: principles and paradigms, pp 373–390
Borthakur D et al (2008) Hdfs architecture guide. Hadoop Apache Project 53
Madani SA, Hayat K, Li H, Khan SU, Ranjan R, Khan IA, Kolodziej J, Nazir B, Chen D, Irfan R, Wang L, Bickler G (2013) Survey on social networking services. IET Netw 2(4):224–234
Cowsalya T, Mugunthan S (2015) Hadoop architecture and fault tolerance based Hadoop clusters in geographically distributed data centre. ARPN J Eng Appl Sci 10(7):2818–2821
Khan FG, Qureshi K, Nazir B (2010) Performance evaluation of fault tolerance techniques in grid computing system. Comput Electr Eng 36(6):1110–1122
Dinu F, Ng T (2012) Understanding the effects and implications of compute node related failures in hadoop. In: Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing. ACM, pp 187–198
Schroeder B, Gibson GA (2007) Understanding failures in petascale computers. In: Journal of Physics: Conference Series, vol 78, no 1. IOP Publishing, , p 012022
Dean J (2004) Simplified data processing on large clusters. In: Proceedings of the 6th Symposium on Operating System Design and Implementation (San Francisco, CA, Dec. 6.8). Usenix Association, 2004
Subramanian S, Zhang Y, Vaidyanathan R, Gunawi HS, Arpaci-Dusseau AC, Arpaci-Dusseau RH, Naughton JF (2010) Impact of disk corruption on open-source DBMS. In: 2010 IEEE 26th International Conference on Data Engineering (ICDE). IEEE, pp 509–520
Yang C, Yen C, Tan C, Madden SR (2010) Osprey: implementing MapReduce-style fault tolerance in a shared-nothing distributed database. In: 2010 IEEE 26th International Conference on Data Engineering (ICDE). IEEE, pp 657–668
Faghri F, Bazarbayev S, Overholt M, Farivar R, Campbell RH, Sanders WH (2012) Failure scenario as a service (fsaas) for Hadoop clusters. In: Proceedings of the Workshop on Secure and Dependable Middleware for Cloud Monitoring and Management. ACM, p 5
Sangroya A, Serrano D, Bouchenak S (2012) MRBS: towards dependability benchmarking for Hadoop mapreduce. In: European Conference on Parallel Processing. Springer, Berlin, Heidelberg, pp 3–12
Malik S, Nazir B, Qureshi K, Khan IA (2013) A reliable checkpoint storage strategy for grid. Computing 95(7):611–632
Quiane-Ruiz JA, Pinkel C, Schad J, Dittrich J (2011) RAFTing MapReduce: fast recovery on the RAFT. In: 2011 IEEE 27th International Conference on Data Engineering (ICDE). IEEE, pp 589–600
Hu P, Dai W (2014) Enhancing fault tolerance based on Hadoop cluster. Int J Database Theory Appl 7(1):37–48
Yildiz O, Ibrahim S, Antoniu G (2017) Enabling fast failure recovery in shared Hadoop clusters: towards failure-aware scheduling. Future Gener Comput Syst 74:208–219
Soualhia M, Khomh F, Tahar S (2015) Atlas: an adaptive failure-aware scheduler for Hadoop. In: 2015 IEEE 34th International Performance Computing and Communications Conference (IPCCC). IEEE, pp 1–8
Costa P, Pasin M, Bessani AN, Correia M (2011) Byzantine fault-tolerant mapreduce: faults are not just crashes. In: 2011 IEEE Third International Conference on Cloud Computing Technology and Science (CloudCom). IEEE, pp 32–39
Liu Y, Wei W (2015) A replication-based mechanism for fault tolerance in mapreduce framework. In: Mathematical problems in engineering 2015
Mustafa S, Nazir B, Hayat A, Khan AR, Madani SA (2015) Resource management in cloud computing: taxonomy, prospects, and challenges. Comput Electr Eng 47:186–203
Kuromatsu N, Okita M, Hagihara K (2013) Evolving fault tolerance in Hadoop with robust auto-recovering JobTracker. Bull Netw Comput Syst Softw 2(1):4
Varghese LA, Sreejith V, Bose S (2014) Enhancing NameNode fault tolerance in Hadoop over cloud environment. In: 2014 6th International Conference on Advanced Computing (ICoAC). IEEE, pp 82–85
Song L, Wu S, Wang H, Yang Q (2014) Distributed mapreduce engine with fault tolerance. In: 2014 IEEE International Conference on Communications (ICC). IEEE, pp 3626–3630
Costa PA, Bai X, Ramos FM, Correia M (2016) Medusa: an efficient cloud fault-tolerant mapreduce. In: 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid). IEEE, pp 443–452
Bala A, Chana I (2012) Fault tolerance-challenges, techniques and implementation in cloud computing. IJCSI Int J Comput Sci Issues 9(1):1694–1814
Nazir B, Qureshi K, Manuel P (2009) Adaptive checkpointing strategy to tolerate faults in economy based grid. J Supercomput 50(1):1–18
Vernica R, Balmin A, Beyer KS, Ercegovac V (2012) Adaptive mapreduce using situation-aware mappers. In: Proceedings of the 15th International Conference on Extending Database Technology. ACM, pp 420–431
Zhao D (2017) Performance comparison between Hadoop and HAMR under laboratory environment. Procedia Comput Sci 111:223–229
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Chen Y, Alspaugh S, Katz R (2012) Interactive analytical processing in big data systems: a cross-industry study of MapReduce workloads. Proc VLDB Endow 5(12):1802–1813
david78k, Jul 2013. david78k/anarchyape. https://github.com/david78k/anarchyape. Accessed 16 Jan 2017
Bouchenak S, Sangroya A (2016) MRBS—Hadoop MapReduce dependability and performance benchmarking. Mrbs.gforge.liris.cnrs.fr. https://mrbs.gforge.liris.cnrs.fr/um_configuring.php. Accessed 12 Nov 2017
Noll MG (2011) Michael G. Noll. Benchmarking and Stress Testing an Hadoop Cluster with TeraSort, TestDFSIO & Co.—Michael G. Noll. www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/. Accessed 16 Jan 2017
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Asghar, H., Nazir, B. Analysis and implementation of reactive fault tolerance techniques in Hadoop: a comparative study. J Supercomput 77, 7184–7210 (2021). https://doi.org/10.1007/s11227-020-03491-9
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-020-03491-9