INSPIRATIONAL ARTICLES: CRITICAL SYSTEMS

INTRODUCTION Software failures are relatively common. In most cases, these failures cause inconvenience but no serious, long-term damage. However, in some systems failure can result in significant economic losses. Physical damage or threats to human life. These systems are called critical systems. Critical Systems are technical or socio-technical systems that people or businesses depend on. If these systems fail to deliver their services as expected then serious problems and significant losses may result. There are three main types of critical systems : Safety critical systems. A system whose failure may result in injury, loss of life or serious environmental damage. An example of a safety – critical system is a control system for a chemical manufacturing plant. Mission critical Systems. A system whose failure may result in the failure of some goal-directed activity. An example of a mission-critical system is a navigational system for a space craft. Business critical systems. A system whose failure may result in very high costs for the business using that system. An example of a business – critical system is the customer accounting system in a bank. The most important emergent property of a critical system is its dependability. The term dependability was proposed by Laprie (Laprie 1995) to cover the related systems attributes of availability, reliability, safety the security. There are several reasons why dependability is the most important emergent property for critical systems : Systems that are unreliable, unsafe or insecure are often rejected by their users, If users don’t trust a system, they will refuse to use it. Furthermore, they may also refuse to buy or use products from the same company as the untrustworthy system, believing that these products perhaps cannot be trusted. System failure costs may be enormous. For some applications, such as a reactor control system or an aircraft navigation system, the cost of system failure is orders of magnitude greater than the cost of the control system. Untrustworthy systems may cause information loss. Data is very expensive to collect the maintain; it may sometimes b worth more than the computer system on which it is processed. A great deal of effort and money may have to be spent duplicating valuable data to guard against data corruption. The high cost of critical systems failure means that trusted methods and techniques must be used for development. Consequently, critical systems are usually developed using well-tried techniques rather than newer techniques that have not been subject to extensive practical experience. Rather than embrace new techniques and methods, critical systems developers are naturally conservative. They prefer to use older techniques whose strengths and weaknesses are understood rather than new techniques which may appear to be better but whose long-term problems are unknown. There are three system components where critical systems failures may occur : System hardware may fail because of mistakes in its design, because components fail as a result of manufacturing errors, or because the components have reached the end of their natural life. System software may fail because of mistakes in its specification, design or implementation. Human operators of the system may fail to operate the system correctly. As hardware and software have become more reliable, failures in operation are now probably the largest single cause of system failures. There failures can be interrelated. A failed hardware component may mean system operators have to cope with an unexpected situation and additional workload. This puts them under stress and people under stress often make mistakes. This can cause the software to fail, which means more work for the operators, even more stress, and so on. A Simple Safety - Critical System There are many types of critical computer-based systems. Ranging from control systems for devices and machinery to information and e-commerce systems. Diabetes is a relatively common condition where the human pancreas is unable to reduce sufficient quantities of a hormone called insulin. Insulin metabolizes glucose in the blood. The conventional treatment of diabetes involves regular injections of genetically engineered insulin. Diabetics measure their blood sugar levels using an external meter and then calculate the dose of insulin that they should inject. The problem with this treatment is that the level of insulin in the blood does not just depend on the blood glucose level but is a function of the time when the insulin injection was taken. This can lead to very low level of blood glucose (if there is too much insulin) or very high levels of blood sugar (if there is too little insulin). Low blood sugar functioning and, ultimately, unconsciousness and death. In the long term, continual high levels of blood sugar can lead to eye damage, kidney damage, and heart problems. Current advances in developing miniaturised sensors have meant that it is now possible to develop automated insulin delivery systems. These systems monitor blood sugar levels and deliver an appropriate dose of insulin when required. Insulin delivery systems like this already exist for the treatment of hospital patients. In the future, it may be possible for many diabetics to have such systems permanently attached to their bodies. A software controlled insulin delivery system might work by using a micro sensor embedded in the patient to measure some blood parameter that is proportional to the sugar level. This is then sent to the pump controller. This controller computes the sugar level and the amount of insulin that is needed. It then sends signals to a miniaturised pump to deliver the insulin via a permanently attached needle. The components and organisation of the insulin pump is a data flow model that illustrates how an input blood sugar level is tranformed to a sequence of pump control commands. There are two high level dependability requirements for this insulin pump system : The system shall be available to deliver insulin when required. The system shall perform reliably and deliver the correct amount of insulin to counteract the current level of blood sugar. Failure of the system could, in principle, cause excessive doses of insulin to be delivered and this could threaten the life of the user. It is particularly important that overdoses of insulin should not occur. The dependability of a computer system is a property of the system that equates to its trustworthiness. Trustworthiness essentially means the degree of user confidence that the system will operate as they expect and that the system will not fail in normal use. There are four principal dimensions to dependability, Availability Informally, the availability of a system is the probability that it will be up and running and able to deliver useful services at any given time. Reliability Informally, the reliability of a system is the probability, over a given period of time, that the system will correctly deliver services as expected by the user. Safety Informally, the safety of a system is a judgement of how likely it is that the system will cause damage to people or its environment. Security Informally, the security of a system is a judgement of how likely it is that the system can resist accidental or deliberate intrusions. As well as these four main dimensions, other system properties can also be considered under the heading of dependability. Repair ability System failures are inevitable, but the disruption caused by failure can be minimised if the system can be repaired quickly. In order for that to happen, it must be possible to diagnose the problem, access the component that has failed and make changes to fix that component. Maintainability. As systems are used, new rewuirements emerge. It is important to maintain the usefulness of a system by changing it to accommodate these new requirements. Maintainable software is software that can be adapted economically to cope with new requirements and where there is a low probability that making changes will introduce new errors into the system. Survivability. A very important attribute for Internet – based systems is survivability, which is closely related to security and availability, Survivability is the ability of a system to continue to deliver service whilst it is under attack and, potentially, while part of the system is disabled. Work on survivability focuses on identifying key system components and ensuring that they can deliver a minimal service. Three strategies are used to enhance survivability – namely, resistance to attack, attack recognition and recovery from the damage caused by an attack. Error Tolerance. This property can be considered as part of usability and reflects the extent to which the system has been designed so that user input error are avoided and tolerated. When user errors occur, the system should, as far as possible, detect these errors and either fix them automatically or request the user to re-input their data. Because of additional design, implementation and validation costs, increasing the dependability of a system can significantly increase development costs. In particular, validation costs are high for critical systems. The relationship between costs and incremental improvements in dependability. The higher the dependability that you need, the more that you have to spend on testing to check that you have reached that level. Critical Systems Specification The need for dependability in critical systems generates both functional and non functional system requirements : System functional requirements may be generated to define error checking and recovery facilities and features that provide protection against system failures. Non-functional requirements may be generated to define the required reliability and availability of the system. In addition to these requirements, safety and security considerations can generate a further type of requirement that is difficult to classify as a functional or a non functional requirement. They are high – level requirements that are perhaps best described as ‘shall not’ requirements. By contrast with normal functional requirements that define what the system shall do, ‘shall not’ requirements define system behaviour that is unacceptable. Example of ‘shall not’ requirements are : The system shall not allow users to modify access permissions on any files that they have not created. (security) The system shall not allow reverse trust mode to be selected when the air craft is in flight. (safety) The system shall not allow the simultaneous activation of more than three alarm signals. (safety) Risk Driven Specification Critical systems specification supplements the normal requirements specification process by focusing on the dependability of the system. Its objective is to understand the risks faced by the system and generate dependability requirements to cope with them. Risk – driven specification has been widely used by safety and security critical systems developers. In safety – critical systems, the risks are hazards that can results in accidents; in security critical systems, the risks are vulnerabilities that can lead to a successful attack on a system. The risk-driven specification process involves understanding the risks faced by the system, discovering their root causes and generating requirements to manage these risks. The iterative process of risk analysis : Risk Identification. Potential risks that might arise are identified. These are dependent on the environment in which the system is to be used. Risk analysis and classification. The risks are considered separately. Those that are potentially serious and not implausible are selected for further analysis. At this stage, some risks may be eliminated simply because they are very unlikely ever to arise. Risk Decomposition. Each risk is analysed individually to discover potential root causes of that risk. Techniques such as fault – tree analysis may be used. Risk reduction assessment. Proposals for ways in which the identified risks may be reduced or eliminated are made. These then generate system dependability requirements that define the defenses against the risk and how the risk will be managed if it arises. Risk Identification and Classification The risk analysis and classification process is primarily concerned with understanding the likelihood that a risk will arise and the potential consequences if an accident or incident associated with that risk should occur. We need to make this analysis to understand whether a risk is a serious threat to the system or environment and to provide a basis for deciding the resources that should be used classification process is a statement of acceptability. Risks can be categorised in three ways : Intolerable. The system must be designed in such a way so that either the risk cannot arise or, if it does arise, it will not result in an accident. Intolerable risks would, typically, be those that threaten human life or the financial stability of a business and which have a significant probability of occurrence. An example of an intolerable risk for an e-commerce system in an Internet book store, say, would be a risk of the system going down for more than a day. As low as reasonably practical (ALARP). The system must be designed so that the probability of an accident arising because of the hazard is minimised, subject to other considerations such as cost and delivery. ALARP risks are those which have less serious consequences or which have a low probability of occurrence. And ALARP risk for an e-commerce system might be corruption of the web page images that presented the brand of the company. This is commercially undesirable but is unlikely to have serious short-term consequences. Acceptable. While the system designers should take all possible steps to reduce the probability of an ‘acceptable’ hazard arising, these should not increase costs, delivery time or other non-functional system attributes. An example of an acceptable risk for an e-commerce system is the risk that people using beta-release web browsers could not successfully complete orders. Risk Decomposition Risk decomposition is the process of discovering the root causes of risk in a particular system. Techniques for risk decomposition have been primarily derived from safety-critical systems development where hazard analysis is a central part of the safety process. Risk analysis can be either deductive or inductive. Deductive, top down techniques, which tend to be easier to use, start with the risk and work from that to the possible system failure; inductive, bottom – up techniques start with a proposed system failure and identify which hazards might arise that could lead to that failure. Various techniques have been proposed as possible approaches to risk decomposition. These include reviews and checklists, as well as more formal techniques such as Petri net analysis (Peterson, 1981), formal logic (Jahanian and Mok, 1986) and fault-tree analysis. Fault-tree analysis technique was developed for safety – critical systems and is relatively easy to understand without specialist domain knowledge. Fault-tree analysis involves identifying the undesired event and working backwards from that event to discover the possible causes of the hazard. You put the hazard at the root of the tree and identify the states that can lead to that hazard. For each of these states, you then identify the states that can lead to that and continue this decomposition until you identify the root causes of the risk. States can be linked with ‘and’ and ‘or’ symbols. Risks that require a combination or root causes are usually less probable than risks that can result from a single root cause. The fault tree for the software related hazards in the insulin delivery system. Insulin under dose and insulin overdose really represent a single hazard, namely, ‘incorrect insulin dose administered’, and a single fault tree can be drawn. Fault trees are also used to identify potential hardware problems. A fault tree may provide insights into requirements for software to detect and, perhaps, correct these problems. Hardware errors such as sensor, pump or timer errors can be discovered and warnings issued before they have a serious effect on the patient. Risk Reduction Assessment : Once potential risks and their root causes have been identified, you should then derive system dependability requirements that manage the risks and ensure that incidents or accidents do not occur. There are three possible strategies that you can use : Risk avoidance. The system is designed so that the risk or hazard cannot arise. Risk detection and removal. The system is designed so that risks are detected and neutralised before they result in an accident. Damage limitation. The system is designed so that the consequences of an accident are minimised. Normally, designers of critical systems use a combination of these approaches. In a safety-critical system, intolerable hazards may be handled by minimising their probability and adding a protection system that provides a safety backup. For example, in a chemical plant control system, the system will attempt to detect and avoid excess pressure in the reactor. However, there should also be an independent protection system that monitors the pressure and opens a relief valve if high pressure is detected. In the insulin delivery system, a ‘safe state’ is a shutdown state where no insulin is injected. Over a short period this will not pose a threat to the diabetic’s health. If the potential software problems identified are considered, the following ‘solutions’ might be developed. Arithmetic error. This arises when some arithmetic computation causes a representation failure. The specification must identify all possible arithmetic errors that may occur. These depend on the algorithm used. The specification might state that an exception handler must be included for each identified arithmetic error. The specification should set out the action to be taken for each of these errors if they arise. A safe action is to shut down the delivery system and activate a warning alarm. SR1 :: The system shall not deliver a signal dose of insulin that is greater than a specified maximum dose for a system user. SR2 :: The system shall not deliver a daily cumulative dose of insulin that is greater than a specified maximum for a system user. SR3 :: The system shall include a hardware diagnostic facility that shall be executed at least four times per hour. SR4 :: The system shall include an exception handler for all of the exceptions that are identified in table. SR5 :: The audible alarm shall be sounded when any hardware or software anomaly is discovered and a diagnostic message as defined in table should be displayed. SR6 :: In the event of an alarm in the system, insulin delivery shall be suspended until the user has reset the system and cleared the alarm. Algorithmic error. This is a more difficult situation as no definite anomalous situation can be detected. It might be detected by comparing the required insulin dose computed with the previously delivered dose. If it is much higher, this may mean that the amount has been computed incorrectly. The system may also keep track of the dose sequence. After a number of above – average doses have been delivered, a warning may be issued and further dosage limited. Safety Specification The process of safety specification and assurance is part of an overall safety life cycle that is defined in an international standard for safety management IEC 61508 (IEC, 1998). This standard was developed specifically for protection systems such as a system that stops a train if it passes a red signal. Although it can be used for more general safety-critical systems, such as control systems. Figure illustrates the system model that is assumed by the IEC 61508 standard. In this model, the control system controls some equipment that has associated high level safety requirements. These high-level requirements generate two types of more detailed safety requirements that apply to the protection system for the equipment : Functional Safety requirements : that define the safety functions of the system. Safety integrity requirements : that define the reliability and availability of the protection system. These are based on the expected usage of the protection system and are intended to ensure that it will work when it is needed. Systems are classified using a safety integrity level (SIL) from 1 to 4 Each SIL level represents a higher level of reliability ; the more critical the system, the higher the SIL required. The first stages of the IEC 61508 safety life cycle define the scope of the system, assess the potential system hazards and estimate the risks they pose. This is followed by safety requirements specification and the allocation of these safety requirements to different sub-systems. The development activity involves planning and implementation. The safety-critical system itself is designed and implemented, as are related external systems that may provide additional protection. In parallel with this, the safety validation, installation, and operation and maintenance of the system are planned. Safety management does not stop on delivery of the system. After delivery, the system must be installed as planned so that the hazard analysis remains valid. Safety validation is then carried out before the system is put into use. Safety must also be managed during the operation and (particularly) the maintenance of the system. Many safety-related systems problems arise because of a poor maintenance process, so it is particularly important that the system is designed for maintainability. Finally, safety considerations that may apply during decommissioning (e.g., disposal of hazardous material in circuit boards) should also be taken into account. Security Specification The specification of security requirements for systems has something in common with safety requirements. It is impractical to specify requirements. It is impractical to specify them quantitatively, and security requirements are often ‘shall not’ requirements that define unacceptable system behaviour rather than required system functionality. However, there are important differences between these types of requirements : The notion of a safety life cycle that covers all aspects of safety management is well developed. The area of security specification and management is still immature and there is no accepted equivalent of a security life cycle. Although some security threats are system specific, many are common to all types of system. All systems must protect themselves against intrusion, denial of service, and so on. By contract, hazards in safety-critical systems are domain – specific. Security techniques and technologies such as encryption and authentication devices are fairly mature. However, using this technology effectively often requires a high level of technical sophistication. It can be difficult to install, configure and stay up to date. Consequently, system mangers make mistake leaving vulnerabilities in the system. The dominance of one software supplier in world markets means that a huge number of systems may be affected if security in their programs is breached. There is insufficient diversity in the computing infrastructure and consequently it is more vulnerable to external threats. Safety-critical systems are usually specialised, custom system so this situation does not arise. The conventional (non-computerised) approach to security analysis is based around the assets to be protected and their value to an organisation. The stages in this process are : Asset identification and evaluation : The assets (data and programs) and their required degree of protection are identified. The required protection depends on the asset value so that a password file is normally more valuable than a set of public web pages as a successful attack on the password file has serious system – wide consequences. Threat analysis and risk assessment : Possible security threats are identified and the risks associated with each of these threats are estimated. Threat assignment : Identified threats are related to the assets so that, for each identified asset, there is a list of associated threats. Technology analysis : Available security technologies and their applicability against the identified threats are assessed. Security requirements specification : The security requirements are specified. Where appropriate, they explicitly identify the security technologies that may be used to protect against threats to the system. Security specification and security management are essential for all critical systems. If a system is insecure, then it is subject to infection with viruses and worms, corruption and unauthorised modification of data, and denial of service attacks. All of this means that we cannot be confident that the efforts made to ensure safety and reliability will be effective. Different types of security requirements address the different threats faced by a system. Firesmith (Firesmith, 2003) identifies 10 types of security requirements that may be included in a system : Identification requirements, specify whether a system should identify its users before interacting with the. Authentication requirements, specify how users are identified. Authorisation requirements, specify the privileges and access permissions of identified users. Immunity requirements, specify how a system should protect itself against viruses, worms, and similar threats. Integrity requirements, specify how data corruption can be avoided. Intrusion detection requirements, specify what mechanisms should be used to detect attacks on the system. Non-repudiation requirements, specify that a party in a transaction cannot deny its involvement in that transaction. Privacy requirements, specify how data privacy is to be maintained. Security auditing requirements, specify how system use can be audited and checked. System maintenance security requirements, specify how an application can prevent authorised changes from accidentally defeating its security mechanisms. Software Reliability Specification Reliability is a complex concept that should always be considered at the system rather than the individual component level. Because the components in a system are interdependent, a failure in one component can be propagated through the system and affect the operation of other components. In a computer – based system, you have to consider three dimensions when specifying the overall system reliability : Hardware reliability. What is the probability of a hardware component failing and how long would it take to repair that component ? Software reliability. How likely is it that a software component will produce an incorrect output ? Software failures are different from hardware failures in that software dose not wear out : It can continue operating correctly after producing an incorrect result. Operator reliability. How likely is it that the operator of a system will make an error ? All of these are closely linked. Hardware failure can cause spurious signals to be generated that are outside the range of inputs expected by software. The software can then behave unpredictably. Unexpected system behaviour may confuse the operator and result in operator stress. The operator may then act incorrectly and choose inputs that are inappropriate for the current failure situation. These inputs further confuse the system and more errors are generated. A single sub-system failure that is recoverable can thus rapidly develop into a serious problem requiring a complete system shutdown. Systems reliability should be specified as a non – functional requirement that, ideally, is expressed quantitatively using one of the metrics discussed in the next section. To meet the non – functional reliability requirements, it may be necessary to specify additional functional and design requirements on the system that specify how failures may be avoided or tolerated. Reliability metrics Reliability metrics were first devised for hardware components. Hardware component failure is inevitable due to physical factors such as mechanical abrasion and electrical heating. Components have limited life spans, which is reflected in the most matric for S/w reliability and availability. Metric : POFOD Probability of failure on demand ROCOF rate of failure occurrence MTTF Mean time to failure AVAIL Availability Explanation : The likelihood that the system will fail when a service request is made. A PODFOD of 0.001 means that one out of a thousand service requests may result in failure. The frequency of occurrence with which unexpected behaviour is likely to occur. A ROCOF of 2/100 means that two failures are likely to occur in each 100 operational time units. This metric is sometimes called the failure intensity. The average time between observed system failures. An MTTF of 500 means that one failure can be expected every 500 time units. The probability that the system is available for use at a given time. Availability of 0.998 means that the system is likely to be available for 998 of every 1,000 time units. Widely used hardware reliability metric, mean time to failure (MTTF). The MTTF is the mean time for which a component is expected to be operational. Hardware component failure is usually permanent, so the mean time to repair (MTTR), which reflects the time needed to repair or replace the component, is also significant. However, these hardware metrics are not directly applicable to software reliability specification because software component failures are often transient rather then permanent. They show up only with some inputs. If the data is undamaged, the system can often continue in operation after a failure has occurred. Metrics that have been used for specifying software reliability and availability are shown. The choice of which metric should be used depends on the type of system to which it applies and the requirements of the application domain. Some examples of the types of system where these different metrics may be used are : Probability of failure on demand. This metric is most appropriate for systems where services are demanded at unpredictable or at relatively long time intervals and where there are serious consequences if the service is not delivered. It might be used to specify protection systems such as the reliability of a pressure relief system in a chemical plant or an emergency shutdown system in a power plant. Rate of occurrence of failures. This metric should be used where regular demands are made on system services and where it is important that these services are correctly delivered. It might be used in the specification of a bank teller system that processes customer transactions or in a hotel reservation system. Mean time to failure. This metric should be used in system where there are long transactions; that is, where people use the system for a long time. The MTTF should be longer than the average length of each transaction. Examples of systems where this metric may be used are word processor systems and CAD systems. Availability. This metric should be used in non-stop systems where users expect the system to deliver a continuous service. Examples of such systems are telephone switching systems and railway signalling systems. There are three kinds of measurements that can be made when assessing the reliability of a system : The number of system failures given a number of requests for system services. This is used to measure the POFOD. The time (or number of transactions) between system failures. This is used to measure ROCOF and MTTF. The elapsed repair or restart time when a system failure occurs. Given that the system must be continuously available, this is used to measure AVAIL. Non Functional Reliability Requirements. The types of failure that can occur are system specific, and the consequences of a system failure depend on the nature of that failure. When writing a reliability specification, you should identify different types of failure and think about whether these should be treated differently in the specification. Examples of different types of failure are shown. Obviously combinations of these, such as a failure that is transient, recoverable and corrupting, can occur. Most large systems are composed of several sub-systems with different reliability software is expensive, you should assess the reliability requirements of each sub – system separately rather than impose the same reliability requirement on all sub systems. The steps involved in establishing a reliability specification are : For each sub – system, identify the types of system failure that may occur and analyse the consequences of these failures. From the system failure analysis, partition failures into appropriate classes. A reasonable starting point is to use the failure types. For each failure class identified, define the reliability requirement using an appropriate reliability metric. It is not necessary to use the same metric for different classes of failure. If a failure requires some intervention to recover from it, the probability of that failure occurring on demand might be the most appropriate metric. When automatic recovery is possible and the effect of the failure is user inconvenience, ROCOF might be more appropriate. Where appropriate, identify functional reliability requirements that define system functionality to reduce the probability of critical failures.

INSPIRATIONAL ARTICLES

Monday, April 4, 2011

CRITICAL SYSTEMS

0 comments:

Post a Comment

SRI SIDDHI VINAYAK

Blog Archive

About Me