Why This Matters

Computing clusters in scientific computing environments suffer from both transient and persistent failures that impact application performance and result in significant economic losses. The innovation of this work is applying model-based design to cluster reliability, enabling specification of mitigation policies that can be verified for correctness. This systematic approach provides a bridge between autonomous fault management concepts and practical implementation on distributed systems.

What We Did

This paper presents a model-based autonomic reliability framework for computing clusters that enables periodic monitoring of vital health parameters and provides automated fault mitigation. The work develops techniques for fault prediction by analyzing correlations between system parameters like CPU utilization and temperature, enabling proactive mitigation before failures occur. The framework includes both discrete event-based scheduling and feedback control mechanisms for managing sensor execution across cluster nodes.

Key Results

The framework demonstrates model-based reliability engineering for LQCD clusters with 127-600 computing nodes across multiple systems. Monitoring and mitigation components show the ability to detect and respond to various failure modes including power outages, hardware failures, and non-responsive jobs. Experimental data illustrates the effectiveness of the approach in maintaining cluster productivity despite hardware and software faults.

Full Abstract

Cite This Paper

@inproceedings{Dubey2008,
  author = {Dubey, Abhishek and Neema, Sandeep and Kowalkowski, Jim and Singh, Amitoj},
  booktitle = {Fourth International Conference on e-Science, e-Science 2008, 7-12 December 2008, Indianapolis, IN, {USA},
  title = {Scientific Computing Autonomic Reliability Framework},
  year = {2008},
  pages = {352--353},
  abstract = {Large scientific computing clusters require a distributed dependability subsystem that can provide fault isolation and recovery and is capable of learning and predicting failures, to improve the reliability of scientific workflows. In this paper, we outline the key ideas in the design of a Scientific Computing Autonomic Reliability Framework (SCARF) for large computing clusters used in the Lattice Quantum Chromo Dynamics project at Fermi Lab.},
  bibsource = {dblp computer science bibliography, https://dblp.org},
  biburl = {https://dblp.org/rec/bib/conf/eScience/DubeyNKS08},
  category = {poster},
  contribution = {lead},
  doi = {10.1109/eScience.2008.113},
  file = {:Dubey2008-Scientific_Computing_Autonomic_Reliability_Framework.pdf:PDF},
  keywords = {cluster reliability, autonomous systems, fault tolerance, model-based design, monitoring, mitigation strategies, distributed computing},
  project = {cps-middleware,cps-reliability},
  timestamp = {Wed, 16 Oct 2019 14:14:49 +0200},
  url = {https://doi.org/10.1109/eScience.2008.113}
}
Quick Info
Year 2008
Keywords
cluster reliability autonomous systems fault tolerance model-based design monitoring mitigation strategies distributed computing
Research Areas
CPS scalable AI emergency
Search Tags

Scientific, Computing, Autonomic, Reliability, Framework, cluster reliability, autonomous systems, fault tolerance, model-based design, monitoring, mitigation strategies, distributed computing, CPS, scalable AI, emergency, 2008, Dubey, Neema, Kowalkowski, Singh