Component Models for Design of Complex Cyber-Physical Systems

Context: Cyber-Physical Systems encompass all modern engineered systems, including smart transit, smart emergency response, smart grid. The big issue in these systems is the construction and operation of the system in a safe and efficient manner. There have been many different approaches taken by the research community over the years. The focus of my lab has been to focus on component-based software engineering (CBSE) efforts for these systems. The guiding principles of CBSE are interfaces with well defined execution models, compositional semantics and analysis. However, there are a number of challenges that have to be resolved (a) performance management, (b) modularization and adaptation of the system as the requirements and environment changes and (c) safe and secure design of the system itself and ensuring that new design and component additions can be compositionally analyzed and operated during the life cycle of the system, (d) fault diagnostics and failure isolation to detect and triage problems rapidly and (e) reconfiguration and recovery to dynamically adapt to failures and environmental changes to ensure the safe completion of mission tasks.

Background: Our work in the area of system integration and middleware for cyber physical systems has spanned over a decade. Together with Prof. Gabor Karsai at the Institute for Software Integrated Systems and Prof. Aniruddha Gokhale and Prof. Doug Schmidt at the Distributed Object Computing Group at Vanderbilt University, we have been working on CORBA, DDS, and system performance modeling. One of the key contributions we have made is the work on ARINC-653 Component Model (ACM), which combines the principle of spatial and temporal partitioning with the interaction patterns derived from the CORBA Component Model (CCM). The main extension over the the CCM are as follows: (a) The synchronous (call-return) and asynchronous (publish-subscribe) interfaces can be equipped with monitors that validate pre- and post-conditions over data that is passed on the respective interface, (b) The relevant portions of the state of the component can also be observed via a dedicated state interface, enabling the monitoring of invariants, (c) The resource usage of the component can be monitored via a resource interface that component uses for allocating and releasing resources and (d) The timing of component execution can be observed via control interface such that instance execution time violations can be detected. Given these extensions, component-level monitoring can be accomplished that evaluates pre- and post-conditions on method invocations, verifies the state invariants, tracks the resource usage, and monitors the timing behavior of the component.

This work was eventually extended and incorporated into DREMS (Distributed Real-Time Embedded Managed Systems) component model for networked CPS. It prescribed a single threaded execution model for components, which helped avoid synchronization primitives that often lead to non-analyzable code and can cause run-time deadlocks and race conditions. One of the key innovations in DREMS was development of fine-grained privileges for controlling access to system services. As part of this effort we developed a novel Multi-Level Security (MLS) information sharing policy across distributed architectures. Recently this model has been extended for a decentralized architecture for smart grid within the framework called Resilient Information Architecture Platform for Smart Grid.

Innovations: Building on the foundation of the component-based middleware enables us to also explore the problem of component placement both in response to performance concerns as well as failures. For this purpose, we have developed methods to create models that assist in performance prediction and capacity planning for these components. Further, we have developed mechanisms to perform online anomaly detection and fault source isolation for the systems.

Our approach to solving this challenge is to use a mix of data-driven and model-based techniques. First we use statistical neighborhood measures to identify the areas affected by the discrepancy and develop a hypothesis if the failure is indeed a physical anomaly. Then we use a discrete event model that captures the causal and temporal relationships between failure modes (causes) and discrepancies (effects) in a system, thereby modeling the failure cascades while taking into account propagation constraints imposed by operating modes, protection elements, and timing delays. This formalism is called Temporal Causal Diagram (TCD) and can model the effects of faults and protection mechanisms as well as incorporate fine-grain, physics-based diagnostics into an integrated, system-level diagnostics scheme. The uniqueness of the approach is that it does not involve complex real-time computations involving high-fidelity models, but performs reasoning using efficient graph algorithms based on the observation of various anomalies in the system. TCD is based on prior work on Timed Failure Propagation Graphs (TFPG). When fine-grain results are needed and computing resources and time are available, the diagnostic hypotheses can be refined with the help of the physics-based diagnostics. Finally, we use both data-driven approaches like LSTM and graphical neural networks and the TCD models to prognosticate the effect of failures.

One of the key benefits of our approach of formalized component based construction is that we can generate a Timed Failure Propagation Graph (TFPG) from software assemblies and then use it in runtime to isolate faulty components. This is possible because the data and behavioral dependencies (and hence the fault propagation) across the assembly of software components can be deduced from the well-defined and restricted set of interaction patterns supported by the framework. We have also shown that fault containment techniques could be used to provide the primary protection from propagating failures into the high-criticality components and overall protect the system health management framework as well.

Beyond fault containment, we have also research mechanisms to recover from component failures by either reinstalling the components automatically or recovering the system functionality with alternative compositions in case of device and hardware failures. The key idea is to encode and use the design space of the cyber-physical system. This design space presents the state of an entire platform. It includes information about different resources available, well known faults, system goals, objectives and corresponding functionalities that help achieve different system goals, components that provide aforementioned functionalities, and possible different ways in which these components can be deployed and configured (this is captured using a domain specific language).

The design space can expand or shrink depending on addition or removal of related entities. A configuration point represents a valid configuration which includes information about a specific deployment scenario given a set of component instances and physical nodes on which these component instances can be deployed. A change in the state of a platform is represented by transition from one configuration point to another in the same design space. An initial configuration point represents the initial state, whereas the current configuration point represents the current state of a platform. Configuration points and their transition are critical for the self-reconfiguration mechanism that I have developed. The key idea is to reconfigure by migrating/transitioning from a faulty configuration point to a new configuration point by solving the problem using efficient SMT solvers. Additionally, if we have past information about component failures, we can reconfigure components to maximize the likelihood that the mission will succeed.