Why This Matters

Autonomous computing infrastructure imposes stringent requirements for consistency, synchronization, and security across multiple nodes, and manual monitoring of large-scale systems is neither scalable nor reliable. RFDMon innovates by leveraging Data Distribution Services middleware to decouple publishers and subscribers, enabling flexible and extensible monitoring system architecture. The hierarchical organization with regional coordination provides both scalability for large systems and resilience through distributed fault management without central bottlenecks.

What We Did

This paper introduces RFDMon, a distributed event-based monitoring framework for autonomous systems that achieves real-time fault diagnosis and recovery through scalable hierarchical monitoring architecture. The system monitors infrastructure resources including CPU utilization, memory usage, network bandwidth, and hardware health across 100 to 800 computing nodes. It organizes monitoring into regions with local managers and global membership managers to enable efficient information dissemination while maintaining low communication overhead and minimal latency.

Key Results

The framework successfully monitors large heterogeneous computing clusters with minimal resource consumption, automatically detecting infrastructure faults with low latency and minimal delay. Experimental deployments in production environments demonstrate effective scalability to hundreds of nodes with automatic self-reconfiguration when failures occur. The system enables diagnostic monitoring of infrastructure and can identify specific failed components while automatically adapting to infrastructure changes.

Full Abstract

Cite This Paper

@inbook{Abdelwahed2011,
  author = {Abdelwahed, Sherif and Dubey, Abhishek and Karsai, Gabor and Mahadevan, Nagabhushan},
  chapter = {Chapter 9},
  pages = {285},
  publisher = {CRC Press},
  title = {Model-based Tools and Techniques for Real-Time System and Software Health Management},
  year = {2011},
  abstract = {The ultimate challenge in system health management is the theory for and application of the technology to systems, for instance to an entire vehicle. The main problem the designer faces is complexity; simply the sheer size of the system, the number of data points, anomalies, and failure modes can be overwhelming. Furthermore, systems are heterogeneous and one has to have a systems engineer{\textquoteright}s view to understand interactions among systems. Yet, system-level health management is crucial, as faults increasingly arise from system-level effects and interactions. While individual subsystems tend to have built-in redundancy or local anomaly detection, fault management, and prognostics features, the system integrators are 287required to provide the same capabilities for the entire vehicle, across different engineering subsystems and areas.},
  booktitle = {Machine Learning and Knowledge Discovery for Engineering Systems Health Management},
  contribution = {colab},
  doi = {10.1201/b11580-15},
  keywords = {distributed monitoring, autonomous systems, fault detection, hierarchical architecture, quality of service, middleware},
  organization = {CRC Press},
  tag = {platform},
  url = {https://doi.org/10.1201/b11580}
}
Quick Info
Year 2011
Keywords
distributed monitoring autonomous systems fault detection hierarchical architecture quality of service middleware
Research Areas
middleware scalable AI CPS
Search Tags

Model, Tools, Techniques, Real, Time, System, Software, Health, Management, distributed monitoring, autonomous systems, fault detection, hierarchical architecture, quality of service, middleware, scalable AI, CPS, 2011, Abdelwahed, Dubey, Karsai, Mahadevan