RFDMon: A Real-time and Fault-tolerant Distributed System Monitoring Approach

Rajat Mehrotra, Abhishek Dubey, Sherif Abdelwahed, Krisa W. Rowland

The 8th International Conference on Autonomic and Autonomous Systems ICAS 2012 2012

Why This Matters

As distributed systems grow in complexity and scale, monitoring becomes critical for identifying failures and performance bottlenecks before they impact users. RFDMon innovates by combining the flexibility of Data Distribution Services with fault-tolerant hierarchical management to achieve minimal latency monitoring without overwhelming system resources. The work addresses the gap between traditional client-server monitoring models and the needs of modern large-scale computing infrastructure by supporting dynamic node additions and automatic fault diagnosis.

What We Did

This work presents RFDMon, a real-time and fault-tolerant distributed system monitoring framework built on Data Distribution Services middleware. The framework measures system variables like CPU utilization, memory usage, network bandwidth, and application performance metrics across heterogeneous computing nodes. It organizes monitoring sensors into regions with local managers, regional leaders, and a global membership manager to enable hierarchical and scalable monitoring. The approach uses spatial and temporal partitioning to isolate monitoring data collection and ensure periodic updates.

Key Results

The framework successfully monitors large clusters of 100 to 800 computing nodes with minimal overhead on computational resources. Experimental results demonstrate that the system identifies infrastructure faults in real-time with minimal delay and can reconfigure itself automatically to resume monitoring. The hierarchical architecture scales effectively while maintaining low communication overhead, and the framework integrates cleanly with fault diagnosis modules to support comprehensive system health management.

@inproceedings{Mehrotra2012a, author = {Mehrotra, Rajat and Dubey, Abhishek and Abdelwahed, Sherif and Rowland, Krisa W.}, booktitle = {The 8th International Conference on Autonomic and Autonomous Systems {ICAS} 2012}, title = {RFDMon: A Real-time and Fault-tolerant Distributed System Monitoring Approach}, year = {2012}, abstract = {One of the main requirements for building an autonomic system is to have a robust monitoring framework. In this paper, a systematic distributed event based (DEB) system monitoring framework {\textquotedblleft}RFDMon{\textquotedblright} is presented for measuring system variables (CPU utilization, memory utilization, disk utilization, network utilization, etc.), system health (temperature and voltage of Motherboard and CPU) application performance variables (application response time, queue size, and throughput), and scientific application data structures (PBS information and MPI variables) accurately with minimum latency at a specified rate and with controllable resource utilization. This framework is designed to be tolerant to faults in monitoring framework, self-configuring (can start and stop monitoring the nodes and configure monitors for threshold values/changes for publishing the measurements), aware of execution of the framework on multiple nodes through HEARTBEAT messages, extensive (monitors multiple parameters through periodic and aperiodic sensors), resource constrainable (computational resources can be limited for monitors), and expandable for adding extra monitors on the fly. Since RFDMon uses a Data Distribution Services (DDS) middleware, it can be used for deploying in systems with heterogeneous nodes. Additionally, it provides a functionality to limit the maximum cap on resources consumed by monitoring processes such that it reduces the effect on the availability of resources for the applications.}, category = {selectiveconference}, contribution = {lead}, acceptance = {23}, file = {:Mehrotra2012a-RFDMon_A_real-time_and_fault-tolerant_distributed_system_monitoring_approach.pdf:PDF}, keywords = {distributed monitoring, fault tolerance, quality of service, middleware, hierarchical management, ARINC-653, real-time systems}, tag = {platform} }

RFDMon: A Real-time and Fault-tolerant Distributed System Monitoring Approach

Why This Matters

What We Did

Key Results

Full Abstract

Cite This Paper

Quick Info

Keywords

Research Areas

Search Tags