Scientific Computing Autonomic Reliability Framework

Abhishek Dubey, Sandeep Neema, Jim Kowalkowski, Amitoj Singh

Fourth International Conference on e-Science, e-Science 2008, 7-12 December 2008, Indianapolis, IN, USA 2008

Why This Matters

Computing clusters in scientific computing environments suffer from both transient and persistent failures that impact application performance and result in significant economic losses. The innovation of this work is applying model-based design to cluster reliability, enabling specification of mitigation policies that can be verified for correctness. This systematic approach provides a bridge between autonomous fault management concepts and practical implementation on distributed systems.

What We Did

This paper presents a model-based autonomic reliability framework for computing clusters that enables periodic monitoring of vital health parameters and provides automated fault mitigation. The work develops techniques for fault prediction by analyzing correlations between system parameters like CPU utilization and temperature, enabling proactive mitigation before failures occur. The framework includes both discrete event-based scheduling and feedback control mechanisms for managing sensor execution across cluster nodes.

Key Results

The framework demonstrates model-based reliability engineering for LQCD clusters with 127-600 computing nodes across multiple systems. Monitoring and mitigation components show the ability to detect and respond to various failure modes including power outages, hardware failures, and non-responsive jobs. Experimental data illustrates the effectiveness of the approach in maintaining cluster productivity despite hardware and software faults.

@inproceedings{Dubey2008, author = {Dubey, Abhishek and Neema, Sandeep and Kowalkowski, Jim and Singh, Amitoj}, booktitle = {Fourth International Conference on e-Science, e-Science 2008, 7-12 December 2008, Indianapolis, IN, {USA}, title = {Scientific Computing Autonomic Reliability Framework}, year = {2008}, pages = {352--353}, abstract = {Large scientific computing clusters require a distributed dependability subsystem that can provide fault isolation and recovery and is capable of learning and predicting failures, to improve the reliability of scientific workflows. In this paper, we outline the key ideas in the design of a Scientific Computing Autonomic Reliability Framework (SCARF) for large computing clusters used in the Lattice Quantum Chromo Dynamics project at Fermi Lab.}, bibsource = {dblp computer science bibliography, https://dblp.org}, biburl = {https://dblp.org/rec/bib/conf/eScience/DubeyNKS08}, category = {poster}, contribution = {lead}, doi = {10.1109/eScience.2008.113}, file = {:Dubey2008-Scientific_Computing_Autonomic_Reliability_Framework.pdf:PDF}, keywords = {cluster reliability, autonomous systems, fault tolerance, model-based design, monitoring, mitigation strategies, distributed computing}, project = {cps-middleware,cps-reliability}, timestamp = {Wed, 16 Oct 2019 14:14:49 +0200}, url = {https://doi.org/10.1109/eScience.2008.113} }

Scientific Computing Autonomic Reliability Framework

Why This Matters

What We Did

Key Results

Full Abstract

Cite This Paper

Quick Info

Keywords

Research Areas

Search Tags