«Software Fault Reporting Processes in Business-Critical Systems Jon Arvid Børretzen Doctoral Thesis Submitted for the partial fulfilment of the ...»
Safety-critical: A safety-critical system could be a computer, electronic or electromechanical system where a hazardous event may cause injury or even death to human beings, or physical harm to other objects that interact with the system. Examples are aircraft control systems and nuclear power-station control systems, where an accident in most cases will lead to economic losses as well as injury and other physical damage. Common tools to design safety-critical systems are redundancy and formal methods, and a spectrum of specialized technologies exist for safety-critical systems (Hazop, Fault-tree analysis etc). The IEC 61508 standard is intended to be a basic functional safety standard applicable to all kinds of industry, and is also used to define the safety standards of some safety-critical systems [IEC 61508].
Mission-critical: The term mission-critical system reflects military usage and is used to describe activities, processing etc., that are deemed vital to the organization's business success and, possibly, its very existence. Some major software systems are described as mission-critical if such a system, product or service experiences a failure or is otherwise unavailable to the organization, it will have a significant negative impact upon the organization. Such systems typically include support for accounts/billing, customer balances, computer-controlled machinery and production lines, just-in-time ordering, and delivery scheduling. Examples of related technologies are Enterprise Resource Planning tools, such as SAP [SAP].
Performance-critical: The SEI defines performance-criticality as the ability of software-intensive systems to perform successfully under adverse circumstances, e.g., under heavy or unexpected load or in the presence of subsystem failures. One trivial example of this is the performance of the SMS telecom services during New Years Eve.
Some services like this can have critical functions, and yet, the behaviour of systems under such circumstances is often less than acceptable [SEI].
Business-critical: The difference between a business-critical and a regular commercial software system is really defined by the business. There is no established general definition telling us which software applications are critical to an operation. In a retail business, a Customer Relationship Management (CRM) system may be the most important. On the other hand, it may be the manufacturing or supplier management software that is the most important. We need to consider the impact of relevant services from software on the business operations, and determine how much value each brings to the business and the impacts of such software parts being unavailable. The impact can be lost revenue, corrupted data or lost user time, as well as indirect and more elusive losses in customer reputation, goodwill, slipped deadlines, and increased levels of stress among employees and customers.
Non-critical: Although important enough, some types of software will simply not be classified as critical. Word processors, spreadsheets and graphical design software are examples of such software. Of course it is expected that such tools are reasonably faultfree and stable, but should they fail, the damage will usually be limited, typically a person-day of effort in the worst case scenario.
Figure 2-5 shows the relationship between business-criticality and the other types of criticality defined here. As we see, safety-, performance-, and mission-critical systems can also be business-critical, but a business-critical system need not be one of the others. Table 2-1 illustrate the overlap between the different categories.
Figure 2-5 Relationship of business-critical and other types of criticality
Table 2-1 Examples of different systems’ criticality Criticality Example category Safety-critical Nuclear reactor control system.
Performance-critical Electronic toll collection in traffic, must process and transfer information quickly enough to keep up with traffic.
Mission-critical Software handling financial transactions between banks.
Functional and non-functional aspects of such applications are considered.
Business-critical Software handling financial transactions between banks. As mission-critical, but wider consequences are also considered.
Non-critical Computer games, word processor application.
2.7 Techniques and methods used to develop safety-criticalsystems
There are a number of methods and techniques that are commonly employed when making safety-critical systems. Some of them will be presented here and related to business-critical computing. According to [Leveson95] and [Rausand91], the most
common ones are the following:
o PHA (Preliminary Hazard analysis): Preliminary Hazard Analysis (PHA) is used in the early project life cycle stages to identify critical system functions and broad system hazards, so as to enable hazard elimination, reduction or control further on in the project. The identified hazards are assessed and prioritized, and safety design criteria and requirements are identified. A PHA is started early in the concept exploration phase so that safety considerations are included in tradeoff studies and design alternatives. This process is iterative, with the PHA being updated as more information about the design is obtained and as changes are being made. The results serve as a baseline for later analysis and are used in developing system safety requirements and in the preparation of performance and design specifications. Since PHA starts at the concept formation stage of a project, little detail is available, and the assessments of hazard and risk levels are therefore qualitative. A PHA should be performed by a small group with good knowledge about the system specifications.
o HAZOP (Hazard and Operability Analysis): This is a method to identify possible safety-related or operational problems that can occur during the use and maintenance of a system. Both Preliminary Hazard Analysis and Hazard and Operability Analysis (HAZOP) are performed to identify hazards and potential problems that the stakeholders see at the conceptual stage, and that could be created by system usage. A HAZOP study is a systematic analysis of how deviations from the intended design specifications in a system can arise, and whether these deviations can result in hazards. Both analysis methods build on information that is available at an early stage of the project. This information can be used to reduce the severity or build safeguards against the effects of the identified hazards. HAZOP is a creative team method, using a set of guidewords to trigger creative thinking among the stakeholders and the cross-functional team in RUP. The guidewords are applied to all parts and aspects of the system concept plan and early design documents, to find and eliminate possible deviations from design intentions. An example of a guideword is MORE. This will mean an increase of some quantity in the system. For example, by using the “MORE” guideword on “a customer client application”, you would have “MORE customer client applications”, which could spark ideas like “How will the system react if the servers get swamped with customer client requests?” and “How will we deal with many different client application versions making requests to the servers?” A HAZOP study is conducted by a team consisting of four to eight persons with a detailed knowledge of the system to be analysed. The main difference between HazOp and PHA is that PHA is a lighter method that needs less effort and available information than the HAZOP method. Since HAZOP is a more thorough and systematic analysis method, the results will be more specific. If there is enough information available for a HAZOP study, and the development team can spare the effort, a HAZOP study will most likely produce more precise and suitable results for a safety requirement specification.
o FMEA (Failure Modes and Effects Analysis): The method of Failure Modes and Effects Analysis, alternatively the variant Failure Modes, Effects and Criticality Analysis (FMECA), is used to study the potential effects of fault occurrences in a system. Failure Modes and Effects Analysis is a method for analyzing potential reliability problems early in the development cycle. Here, it is easier to overcome such issues, thereby enhancing the reliability through design. FMEA is used to identify potential failure modes, determine their effect on the operation of the system, and identify actions to mitigate such failures. A crucial step is anticipating what might go wrong with a product. While anticipating every failure mode is not possible, the development team should formulate a extensive list of potential failure modes. Early and consistent use of FMEAs in the design process can help the engineer to design out failures and produce more reliable and safe products. FMEAs can also be used to capture historical information for use in future product improvement.
o FTA (Fault Tree Analysis): A Fault Tree Analysis diagram is a logical diagram which illustrates the connection between an unwanted event and the causes of this event. The causes can include environment factors, human error, strange combinations of “innocent” events, normal events and outright component failures.
The two main results are: 1) The fault tree diagram which shows the logical structure of failure effects. 2) The cut-sets, which show the sets of events which can cause the top event – system failure. If we can assign probability values or failure rates to each basic event, we can also get quantitative predictions for Mean Time To Failure (MTTF) and failure rate for the system.
o ETA (Event-tree analysis): An event-tree is a graphical representation of a sequence of related events. Each branching point in the tree is a point in time where we can get one of two or more possible consequences. The event-tree can be described with or without branching probabilities. In economical analyses it is customary to assign a benefit or cost to each possible alternative – or branch. An event tree can help our understanding and documentation of one or more sequences of events in a system or part of a system. Areas where we can use event-trees are: 1) Study of error propagation through a complete system – people, operational
procedures, hardware, and software. 2) Build usage scenarios to enhance HazOp:
“what could happen if…?” o CCA (Cause-Consequence Analysis): Cause-consequence analysis (CCA) is a two-part system safety analytical technique that combines Fault Tree Analysis and Event Tree Analysis. Fault Tree Analysis considers the “causes” and Event Tree Analysis considers the “consequences”, and hence both deductive and inductive analysis is used. The purpose of CCA is to identify chains of events that can result in unwanted consequences. With the probabilities of the various events in a CCA diagram, the probabilities of the various consequences can be calculated, thus establishing the risk level of the system. A CCA starts with a critical event and determines the causes of the event (using top-down or backward search) and the consequences it might create (using forward search). The cause-consequence diagram can show both temporal dependencies and causal relationships among events. The notation builds on the FTA and ETA notations, and extends these with timing, condition and decision alternatives. The result is a diagram (along with elaborated documentation), showing both a logical structure of the cause of a critical event and a graphical representation of the effect the critical event can have on the system. CCA enables probability assessments of success/failure outcomes at staged increments of system examination. Also, the CCA method helps in creating a link between the FTA and ETA methods. CCA shows the sequence of events explicitly, which makes CCA diagrams especially useful in studying start-up, shutdown and other sequential control issues. Other advantages are that multiple outcomes are analyzed from each critical event, and different levels of success/failure are distinguishable, as CCA may be used for quantitative assessment.
In addition to these techniques, we included the Safety Case method tool for use alongside the other safety criticality analysis methods. The purpose is to keep track of the requirements and information acquired when using safety criticality analysis methods. Usage of the Safety Case method is also presented in paper P1.
o Safety Case: The Safety Case method seeks to minimise safety risks and commercial risks by constructing a demonstrable safety case. Bishop and Bloomfield [Adelard98, Bishop98] define a safety case as: “A documented body of evidence that provides a convincing and valid argument that a system is adequately safe for a given application in a given environment”. The safety case method is a vehicle for managing safety claims, containing a reasoned argument that a system is or will be safe. It is manifested as a collection of data, metadata and logical arguments. The Safety Case documents will answer questions like “How will we argue that this system can be trusted/ is safe?” The Safety Case shows how safety requirements are decomposed and addressed, and will provide an appropriate answer to the above questions. The layered structure of the Safety Case allows lifetime evolution and helps to establish the safety requirements at different detail levels.
Table 2-2 shows a comparison of the safety criticality analysis methods we have considered. The properties shown are relevant when choosing between such analysis techniques. The costs involved are described for each method by the properties “Formalization” and “Effort needed”. Other properties are the requirements for available system information, which can range from a sketchy system description to a full system description including all technical documentation and code. The process stage is also important, as it tell us where in the development cycle the technique is best suited.
2.8 Empirical Software Engineering Empirical Software engineering is not software development per se, but a branch of software engineering research and practice which emphasizes empirical studies to investigate processes, methods, techniques and technology.