Why This Matters

Scientific workflows executing over long periods on unreliable computing infrastructure need both fault tolerance and recovery capabilities with minimal manual intervention. This work innovates by integrating workflow management with distributed monitoring infrastructure using publish-subscribe middleware to decouple workflow components. The hierarchical management structure enables both local rapid response to failures and global coordination of recovery actions across the distributed system.

What We Did

This paper extends previous workflow management frameworks with dynamic workflow management and monitoring using Data Distribution Services middleware. The work presents a hierarchical framework for managing scientific workflows through workflow managers coordinating job execution and participant managers tracking individual task execution. It integrates monitoring of infrastructure resources and workflow status to enable both fault detection and workflow recovery through stopping and restarting failed workflow portions.

Key Results

The framework successfully manages scientific workflows with distributed monitoring that tracks both infrastructure status and workflow execution. Results demonstrate effective fault detection and workflow recovery through stopping failed portions and restarting from known checkpoints. The integration with Data Distribution Services enables scalable monitoring across large computing clusters without centralized bottlenecks.

Full Abstract

Cite This Paper

@inproceedings{Pan2010,
  author = {Pan}, P. and Dubey, Abhishek and {Piccoli}, L.},
  booktitle = {2010 Seventh IEEE International Conference and Workshops on Engineering of Autonomic and Autonomous Systems},
  title = {Dynamic Workflow Management and Monitoring Using DDS},
  year = {2010},
  month = {mar},
  pages = {20-29},
  abstract = {Large scientific computing data-centers require a distributed dependability subsystem that can provide fault isolation and recovery and is capable of learning and predicting failures to improve the reliability of scientific workflows. This paper extends our previous work on the autonomic scientific workflow management systems by presenting a hierarchical dynamic workflow management system that tracks the state of job execution using timed state machines. Workflow monitoring is achieved using a reliable distributed monitoring framework, which employs publish-subscribe middleware built upon OMG Data Distribution Service standard. Failure recovery is achieved by stopping and restarting the failed portions of workflow directed acyclic graph.},
  category = {conference},
  contribution = {lead},
  doi = {10.1109/EASe.2010.12},
  file = {:Pan2010-Dynamic_workflow_management_and_monitoring_using_dds.pdf:PDF},
  issn = {2168-1872},
  keywords = {workflow management, monitoring, fault tolerance, distributed systems, publish-subscribe middleware, scientific computing},
  month_numeric = {3}
}
Quick Info
Year 2010
Keywords
workflow management monitoring fault tolerance distributed systems publish-subscribe middleware scientific computing
Research Areas
middleware scalable AI
Search Tags

Dynamic, Workflow, Management, Monitoring, workflow management, monitoring, fault tolerance, distributed systems, publish-subscribe middleware, scientific computing, middleware, scalable AI, 2010, Pan, Dubey, Piccoli