Why This Matters

Large scientific computing clusters require fault tolerance mechanisms that maintain productivity despite frequent component failures, as downtime directly impacts expensive research campaigns. The innovation lies in applying hierarchical reflex and healing architectures specifically to scientific computing environments, providing model-based fault mitigation that enables both fault prediction and autonomous recovery. This addresses a critical gap between theoretical reliability frameworks and practical deployment needs in high-performance computing.

What We Did

This paper introduces the Scientific Computing Autonomic Reliability Framework (SCARF) designed for large computing clusters used in scientific applications like the Large Hadron Collider. The framework provides hierarchical fault mitigation using reflex engines and includes components for monitoring system health parameters, diagnosing faults through distributed reasoning, and allocating resources optimally during failures. SCARF enables coordinated fault management across cluster nodes with mechanisms for workflow reallocation and predictive failure detection.

Key Results

SCARF successfully implements distributed monitoring on LQCD computing clusters at Fermi National Accelerator Laboratory, enabling detection of multiple fault classes including communication errors, storage failures, and CPU issues. The framework provides automated mitigation strategies that can reallocate jobs and resources, with experimental data showing improvements in cluster reliability and job completion rates. The hierarchical organization allows scalability from individual nodes to multiple regional managers.

Full Abstract

Cite This Paper

@inproceedings{Dubey2008a,
  author = {Dubey, Abhishek and {Nordstrom}, S. and {Keskinpala}, T. and {Neema}, S. and {Bapty}, T. and {Karsai}, G.},
  booktitle = {Fifth IEEE Workshop on Engineering of Autonomic and Autonomous Systems (ease 2008)},
  title = {Towards A Model-Based Autonomic Reliability Framework for Computing Clusters},
  year = {2008},
  month = {mar},
  pages = {75-85},
  abstract = {One of the primary problems with computing clusters is to ensure that they maintain a reliable working state most of the time to justify economics of operation. In this paper, we introduce a model-based hierarchical reliability framework that enables periodic monitoring of vital health parameters across the cluster and provides for autonomic fault mitigation. We also discuss some of the challenges faced by autonomic reliability frameworks in cluster environments such as non-determinism in task scheduling in standard operating systems such as Linux and need for synchronized execution of monitoring sensors across the cluster. Additionally, we present a solution to these problems in the context of our framework, which utilizes a feedback controller based approach to compensate for the scheduling jitter in non real-time operating systems. Finally, we present experimental data that illustrates the effectiveness of our approach.},
  category = {conference},
  contribution = {lead},
  doi = {10.1109/EASe.2008.15},
  file = {:Dubey2008a-Towards_a_model-based_autonomic_reliability_framework_for_computing_clusters.pdf:PDF},
  issn = {2168-1872},
  keywords = {scientific computing, cluster reliability, fault mitigation, distributed monitoring, hierarchical management, reflex engines, job reallocation},
  month_numeric = {3}
}
Quick Info
Year 2008
Keywords
scientific computing cluster reliability fault mitigation distributed monitoring hierarchical management reflex engines job reallocation
Research Areas
scalable AI CPS emergency
Search Tags

Towards, Model, Autonomic, Reliability, Framework, Computing, Clusters, scientific computing, cluster reliability, fault mitigation, distributed monitoring, hierarchical management, reflex engines, job reallocation, scalable AI, CPS, emergency, 2008, Dubey, Nordstrom, Keskinpala, Neema, Bapty, Karsai