Why This Matters

Cloud system reliability critically depends on predicting hard disk failures before they cause service outages, but disk failure prediction is challenging due to highly imbalanced data, variable failure modes, and lack of clear failure indicators. This work is innovative because it combines offline pre-training with online transfer learning to continuously adapt failure prediction models without future knowledge contamination, enabling practical deployment in cloud environments.

What We Did

This paper proposes a two-layered architecture for predicting remaining useful life of hard disk drives in cloud systems using Deep LSTM networks. The system combines data-driven anomaly detection for early failure identification with online prediction mechanisms using transfer learning. The approach handles the challenge of predicting disk failures without overlap between training and test data by using pre-trained models that are continuously updated with new failure patterns.

Key Results

The two-layered architecture achieved 84.35% accuracy in predicting remaining useful life with RUL near critical failure zones, enabling proactive disk replacement decisions. The system successfully identified devices approaching failure within ten days with high precision, allowing cloud operators to migrate workloads before failures occur. The transfer learning approach enabled the system to adapt to new disk models and failure patterns through incremental online updates.

Full Abstract

Cite This Paper

@misc{Basak2018,
  author = {Basak, Sanchita and Sengupta, Saptarshi and Dubey, Abhishek},
  title = {A Data-driven Prognostic Architecture for Online Monitoring of Hard Disks Using Deep {LSTM} Networks},
  year = {2018},
  abstract = {With the advent of pervasive cloud computing technologies, service reliability and availability are becoming major concerns,especially as we start to integrate cyber-physical systems with the cloud networks. A number of smart and connected community systems such as emergency response systems utilize cloud networks to analyze real-time data streams and provide context-sensitive decision support.Improving overall system reliability requires us to study all the aspects of the end-to-end of this distributed system,including the backend data servers. In this paper, we describe a bi-layered prognostic architecture for predicting the Remaining Useful Life (RUL) of components of backend servers,especially those that are subjected to degradation. We show that our architecture is especially good at predicting the remaining useful life of hard disks. A Deep LSTM Network is used as the backbone of this fast, data-driven decision framework and dynamically captures the pattern of the incoming data. In the article, we discuss the architecture of the neural network and describe the mechanisms to choose the various hyper-parameters. We describe the challenges faced in extracting effective training sets from highly unorganized and class-imbalanced big data and establish methods for online predictions with extensive data pre-processing, feature extraction and validation through test sets with unknown remaining useful lives of the hard disks. Our algorithm performs especially well in predicting RUL near the critical zone of a device approaching failure.The proposed architecture is able to predict whether a disk is going to fail in next ten days with an average precision of 0.8435.In future, we will extend this architecture to learn and predict the RUL of the edge devices in the end-to-end distributed systems of smart communities, taking into consideration context-sensitive external features such as weather.},
  archiveprefix = {arXiv},
  bibsource = {dblp computer science bibliography, https://dblp.org},
  biburl = {https://dblp.org/rec/bib/journals/corr/abs-1810-08985},
  contribution = {lead},
  eprint = {1810.08985},
  file = {:Basak2018-A_Data-driven_Prognostic_Architecture_for_Online_Monitoring_of_Hard_Disks_Using_Deep_LSTM_Networks.pdf:PDF},
  journal = {CoRR},
  tag = {ai4cps},
  timestamp = {Wed, 31 Oct 2018 00:00:00 +0100},
  url = {http://arxiv.org/abs/1810.08985},
  volume = {abs/1810.08985},
  keywords = {remaining useful life prediction, deep learning, LSTM networks, hard disk failures, cloud systems}
}
Quick Info
Year 2018
Keywords
remaining useful life prediction deep learning LSTM networks hard disk failures cloud systems
Research Areas
ML for CPS Explainable AI
Search Tags

Data, driven, Prognostic, Architecture, Online, Monitoring, Hard, Disks, Deep, LSTM, Networks, remaining useful life prediction, deep learning, LSTM networks, hard disk failures, cloud systems, ML for CPS, Explainable AI, 2018, Basak, Sengupta, Dubey