The TMS is considered failed

2020-07-30

The TMS is considered failed if either the monitoring application provides to the operator wrong information that do not reflect the real railways status (functional failure), or if the service completely stops being provided (non-functional failure). The architecture currently adopted by ASTS, therefore, aims at achieve a high grade of availability through redundancy. However, the only usage of redundant HW subsystem units is not satisfactory for the SIL2 certification, especially if the system brings into it COTS components. Unlike the IXL, the TMS provides non-vital functions. A failure of the TMS could lead, as an example, to the planning of colliding routes. In such a case the underlying SIL4 fail-safe IXL would prevent loss of life avoiding the actual impact, e.g. by stopping the trains. However, the shutdown of trains would result in degradation of QoS and reduction of service availability, which turn in loss of money and reputation for the train company. For this reason, the TMS is classified as a Safety-Related System. The TMS is based on a client-server architecture. It is a collection of commercial redundant equipment (dual application servers, dual data Folinic acid servers and at least dual workstations) connected to a high speed LAN. Core of the TMS subsystem is the Application Server which runs most important SW modules. The TMS Application Server – for simplicity, from now on only TMS – is composed by two server machines clustered in a Active/Standby configuration to provide uninterrupted service. To further enhance reliability, redundancy is not only applied in terms of replicated machines but also in terms of subsystem units (e.g. RAM, Disks, Fans, Power Supply) in order to eliminate Single Points of Failure (SPF). Units replicas are equal, that is, ASTS does not enforce a diversity fault-tolerant mechanism which involves the usage of subsystems of different technologies. Each server node, connected to its motherboard, has: 2 hot-swap power supplies, 2 hot-swap fans, 2 RAMs, 1 CPU, 2 HDDs in RAID1 configuration, 1 Network Interface Card (NIC). On top of the hardware, ASTS employs Linux SUSE with High-Availability (HA) Extension: a COTS OS provided with two clustering software (Pacemaker Cluster Resource Manager (CRM) and Corosync) responsible for resources orchestration, failure diagnosis, nodes coordination, and fail-over management. The Pacemaker CRM allows for monitoring the health and status of node resources, managing dependencies, and automatically stopping and starting services. It relies on Corosync messaging layer, in charge of reliable messaging communications, membership and quorum information needed by the cluster for node orchestration. For simplicity, in the rest of this paper, Pacemaker and Corosync are indicated as a single entity under the CRM notation. The topmost layer of the system is the ASTS proprietary TMS application. This consists of several SW modules that communicate with the IXL through a CORBA message broker. Since the ASTS proprietary application software is beyond the scope of our work, the reliability of the broker itself is not addressed in our study (except for aspects related to its interaction with other software modules). The decision of using a CORBA broker was taken by the ASTS proprietary application software development team. In this study, we assume that the ASTS proprietary application software (which includes the broker), is the result of a rigorous development process, and of a thorough reliability testing activity. The ASTS proprietary application software uses FT-CORBA, which embeds fault tolerance mechanisms that help to ensure a resilient and highly-available message broker service. FT-CORBA was used, e.g., for an Air Traffic Control System, which is indeed a safety-critical system [12].