Abstract
Reliability has become a first-class consideration issue for architects along with performance and energy-efficiency. The increasing scaling technology and subsequent supply voltage reductions, together with temperature fluctuations, augment the susceptibility of architectures to errors. Previous approaches have tried to provide fault tolerance. However, they usually present critical drawbacks concerning either hardware duplication or performance degradation, which for the majority of common users results unacceptable.
RMT (Redundant Multi-Threading) is a family of techniques based on SMT processors in which two independent threads (master and slave), fed with the same inputs, redundantly execute the same instructions, in order to detect faults by checking their outputs. In this paper, we study the under-explored architectural support of RMT techniques to reliably execute shared-memory applications. We show how atomic operations induce to serialization points between master and slave threads. This bottleneck has an impact of 34% in execution time for several parallel scientific benchmarks. To address this issue, we present REPAS (Reliable execution of Parallel ApplicationS in tiled-CMPs), a novel RMT mechanism to provide reliable execution in shared-memory applications.
While previous proposals achieve the same goal by using a big amount of hardware - usually, twice the number of cores in the system - REPAS architecture only needs a few extra hardware, since the redundant execution is made within 2-way SMT cores in which the majority of hardware is shared. Our evaluation shows that REPAS is able to provide full coverage against soft-errors with a lower performance slowdown in comparison to a non-redundant system than previous proposals at the same time it uses less hardware resources.
This work has been jointly supported by the Fundación Séneca (Agencia Regional de Ciencia y Tecnología, Región de Murcia) under grant 05831/PI/07, also by the Spanish MEC and European Commission FEDER funds under grants “Consolider Ingenio-2010 CSD2006-00046” and “TIN2006-15516-C04-03”.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Bartlett, J., Gray, J., et al.: Fault tolerance in tandem computer systems. In: The Evolution of Fault-Tolerant Systems (1987)
Gomaa, M., Scarbrough, C., et al.: Transient-fault recovery for chip multiprocessors. In: Proc. of the 30th annual Int’ Symp. on Computer architecture (ISCA 2003), San Diego, California, USA (2003)
González, A., Mahlke, S., et al.: Reliability: Fallacy or reality? IEEE Micro. 27(6) (2007)
Kumar, S., Aggarwal, A.: Speculative instruction validation for performance-reliability trade-off. In: Proc. of the 2008 IEEE 14th Int’ Symp. on High Performance Computer Architecture (HPCA 2008), Salt Lake City, USA (2008)
LaFrieda, C., Ipek, E.: et al. Utilizing dynamically coupled cores to form a resilient chip multiprocessor. In: Proc. of the 37th Annual IEEE/IFIP Int’ Conf. on Dependable Systems and Networks (DSN 2007), Edinburgh, UK (2007)
Magnusson, P., Christensson, M., et al.: Simics: A full system simulation platform. Computer 35(2) (2002)
Martin, M.K., Sorin, D.J., et al.: Multifacet’s general execution-driven multiprocessor simulator (gems) toolset. SIGARCH Comput. Archit. News 33(4) (2005)
Mukherjee, S.: Architecture design for soft errors. Morgan Kaufmann, San Francisco (2008)
Mukherjee, S., Kontz, M., et al.: Detailed design and evaluation of redundant multithreading alternatives. In: Proc. of the 29th annual Int’ Symp. on Computer architecture (ISCA 2002), Anchorage, AK, USA (2002)
Pizza, M., Strigini, L., et al.: Optimal discrimination between transient and permanent faults. In: Third IEEE International High-Assurance Systems Engineering Symposium, pp. 214–223 (1998)
Rashid, M., Huang, M.: Supporting highly-decoupled thread-level redundancy for parallel programs. In: Proc. of the 14th Int’ Symp. on High Performance Computer Architecture (HPCA 2008), Salt Lake City, USA (2008)
Reinhardt, S.K., Mukherjee, S.: Transient fault detection via simultaneous multithreading. In: Proc. of the 27th annual Int’ Symp. on Computer architecture (ISCA 2000), Vancouver, BC, Canada (2000)
Rotenberg, E.: Ar-smt: A microarchitectural approach to fault tolerance in microprocessors. In: Proc. of the 29th Annual Int’ Symp. on Fault-Tolerant Computing (FTCS 1999), Madison, WI, USA (1999)
Shivakumar, P., Kistler, M., et al.: Modeling the effect of technology trends on soft error rate of combinational logic. In: Proc. of the Int’ Conf. on Dependable Systems and Networks (DSN 2002), Bethesda, MD, USA (2002)
Smolens, J.C., Gold, B.T., et al.: Fingerprinting: Bounding soft-error-detection latency and bandwidth. IEEE Micro. 24(6) (2004)
Smolens, J.C., Gold, B.T., et al.: Reunion: Complexity-effective multicore redundancy. In: Proc. of the 39th Annual IEEE/ACM Int’ Symp. on Microarchitecture (MICRO 39), Orlando, FL, USA (2006)
Sánchez, D., Aragón, J.L., et al.: Evaluating dynamic core coupling in a scalable tiled-cmp architecture. In: Proc. of the 7th Int. Workshop on Duplicating, Deconstructing, and Debunking (WDDD 2008). In conjunction with ISCA (2008)
Taylor, M.B., Kim, J., et al.: The raw microprocessor: A computational fabric for software circuits and general-purpose programs. IEEE Micro. 22(2), 25–35 (2002)
Vijaykumar, T., Pomeranz, I., et al.: Transient fault recovery using simultaneous multithreading. In: Proc. of the 29th Annual Int’ Symp. on Computer Architecture (ISCA 2002), Anchorage, AK (2002)
Wells, P.M., Chakraborty, K., et al.: Adapting to intermittent faults in multicore systems. In: Proc. of the 13th Int’ Conf. on Architectural support for programming languages and operating systems (ASPLOS 2008), Seattle, WA, USA (2008)
Zielger, J.F., Puchner, H.: SER-History, Trends and Challenges. Cypress Semiconductor Corporation (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sánchez, D., Aragón, J.L., García, J.M. (2009). REPAS: Reliable Execution for Parallel ApplicationS in Tiled-CMPs. In: Sips, H., Epema, D., Lin, HX. (eds) Euro-Par 2009 Parallel Processing. Euro-Par 2009. Lecture Notes in Computer Science, vol 5704. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03869-3_32
Download citation
DOI: https://doi.org/10.1007/978-3-642-03869-3_32
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03868-6
Online ISBN: 978-3-642-03869-3
eBook Packages: Computer ScienceComputer Science (R0)