LQCD workflow execution framework: Models, provenance and fault-tolerance

Luciano Piccoli, Abhishek Dubey, James N Simone, James B Kowalkowlski

Journal of Physics: Conference Series 2010

Why This Matters

Large computing clusters used for scientific processing suffer from intermittent faults when operated over long periods, and workflow execution can fail due to single job failures, communication delays, or synchronization issues. This work innovates by integrating workflow specification with formal reliability verification methods to provide explicit fault isolation and mitigation capabilities. The model-based approach enables recovery from failures without manual intervention while maintaining data provenance necessary for scientific reproducibility.

What We Did

This paper introduces a model-based, hierarchical, reliable execution framework for scientific workflows that integrates workflow and reliability subsystems to enable data provenance tracking, execution monitoring, and online fault tolerance. The work proposes parametrized abstract workflow templates instantiated with specific input parameters to define concrete workflows executed in a distributed environment. The framework supports both configuration generation and analysis campaign workflows with execution tracking and monitoring of vital health parameters allocated on compute nodes.

Key Results

The framework successfully executes LQCD workflows on computing clusters with reliable recovery from job failures and performance tracking across multiple participants. Results demonstrate effective fault isolation and mitigation using reflex engine architecture with pre-specified mitigating actions. The system maintains data provenance enabling workflow result verification and enables recovery from failure points using stored intermediate results.

@article{Piccoli2010, author = {Piccoli, Luciano and Dubey, Abhishek and Simone, James N and Kowalkowlski, James B}, journal = {Journal of Physics: Conference Series}, title = {LQCD} workflow execution framework: Models, provenance and fault-tolerance}, year = {2010}, month = {apr}, number = {7}, pages = {072047}, volume = {219}, abstract = {Large computing clusters used for scientific processing suffer from systemic failures when operated over long continuous periods for executing workflows. Diagnosing job problems and faults leading to eventual failures in this complex environment is difficult, specifically when the success of an entire workflow might be affected by a single job failure. In this paper, we introduce a model-based, hierarchical, reliable execution framework that encompass workflow specification, data provenance, execution tracking and online monitoring of each workflow task, also referred to as participants. The sequence of participants is described in an abstract parameterized view, which is translated into a concrete data dependency based sequence of participants with defined arguments. As participants belonging to a workflow are mapped onto machines and executed, periodic and on-demand monitoring of vital health parameters on allocated nodes is enabled according to pre-specified rules. These rules specify conditions that must be true pre-execution, during execution and post-execution. Monitoring information for each participant is propagated upwards through the reflex and healing architecture, which consists of a hierarchical network of decentralized fault management entities, called reflex engines. They are instantiated as state machines or timed automatons that change state and initiate reflexive mitigation action(s) upon occurrence of certain faults. We describe how this cluster reliability framework is combined with the workflow execution framework using formal rules and actions specified within a structure of first order predicate logic that enables a dynamic management design that reduces manual administrative workload, and increases cluster-productivity.}, contribution = {colab}, doi = {10.1088/1742-6596/219/7/072047}, file = {:Piccoli2010-LQCD_workflow_execution_framework.pdf:PDF}, keywords = {scientific workflows, fault tolerance, reliability, data provenance, workflow execution, distributed computing}, publisher = {IOP} Publishing}, month_numeric = {4} }

LQCD workflow execution framework: Models, provenance and fault-tolerance

Why This Matters

What We Did

Key Results

Full Abstract

Cite This Paper

Quick Info

Keywords

Research Areas

Search Tags