«Software Fault Reporting Processes in Business-Critical Systems Jon Arvid Børretzen Doctoral Thesis Submitted for the partial fulfilment of the ...»
The effort in software engineering technology was stepped up due to the “software crisis” (a term coined in 1968), which identified many problems of software development [Glass94]. Many software projects ran over budget and schedule. Some projects caused property damage, and a few projects actually caused loss of life. The software crisis was originally defined in terms of productivity, but evolved to emphasize quality. The most common result of failed software development projects are projects that overrun their schedule and budget, but more serious consequences may also be the result of poorly executed software projects.1 Cost and Budget Overruns: A survey conducted at the Simula Research Laboratory in 2003 showed that 37% of the investigated projects used more effort than estimated. The average effort overrun was 41%, with 67% in projects with a public client, and 21% for projects with a private client [Moløkken04].
Property Damage: Software defects can cause property damage. Poor software security allows hackers to steal identities, and defective control systems can damage the physical systems the software is controlling. The result is lost time, money, and damaged reputation. The expensive European Ariane 5 rocket exploded on its virgin voyage in 1996, because its software operated under different flight conditions than the software was designed for [Kropp98].
Life and Death: Defects in software can be lethal. Some software systems used in radiotherapy machines failed so gravely that they administered lethal doses of radiation to patients [Leveson95].
The use of the term “software crisis” has been slowly fading out; perhaps because the software engineering community have come to the understanding that it is unrealistic and unproductive to remain in crisis mode for this many years. Software engineers are accepting that the problems of software engineering are truly difficult and only hard work over a long period of time can solve them. Processes and methods have become major parts of software engineering, e.g. object-orientation (OO) and the Rational Unified Process (RUP). Studies have however shown, that many practitioners resist formalized processes, which often treats them impersonally like machines, rather than creative people [Thomas96]. The profession of software engineering is important, and has made big improvements since 1968, even though it is not perfect. Software engineering is a relatively young field, and practitioners and researchers continually work to improve the technologies and practices, in order to improve the final products and to better comply with the needs of the users and customers.
Peter G. Neumann has done much work on this subject, and edits a contemporary list of software problems and disasters on his website http://catless.ncl.ac.uk/Risks/ [Neumann07].
2.3 Software Quality The word quality can have several meanings and definitions, even though most of these definitions try to communicate practically the same idea. Often, the context in which the quality is to be judged, decides which definition that will be used. The context could be user-orientated, product-oriented, production-oriented or even emotionally oriented.
ISO defines quality as “The totality of features and characteristics of a product/service that bears upon its ability to satisfy stated or implied needs” [ISO 8402]. Another ISO definition is “Quality: ability of a set of inherent characteristics of a product, system or process to fulfil requirements of customers and other interested parties” [ISO 9001].
Aune presents the following simplified definitions from the ISO 8402 standard [Aune00]:
1. Quality: Conformity with requirements (or needs, expectations, specifications)
2. Quality: The satisfaction of the customer
3. Quality: Suitability for use (at a given time) Software quality in terms of reliability is often related to faults and failures, e.g. in number of faults found, or failure rate over a period of time during use. Added to this, as the before mentioned definitions imply, there are other quality factors that are important, e.g. the software’s ability to be used in an effective way (i.e. its usability).
There is a multitude of concepts that together can be used to define quality, where the importance of a given factor or characteristic depends on the software context.
Reliability, Usability, Safety, Security, Availability and Performance are common examples. The glossary in Appendix A describes some of the relevant quality attributes.
2.3.1 Software Quality practices
Quality Assurance (QA) QA is the planned and systematic efforts needed to gain sufficient confidence in that a product or a service will satisfy stated requirements to quality (e.g. degree of safety/reliability). Alternatively, QA is control of product and process throughout software development, so that we increase the probability that we manage to fulfil the requirements specifications. Software QA involves the entire software development process, monitoring and improving the process, making sure that any agreed-upon standards and procedures are followed, and ensuring that problems are found and dealt with. QA work is oriented towards problem “prevention”. Solving problems is a highvisibility process; preventing problems is low-visibility.
Among the duties of a QA team are certification and standardization work, as well as internal inspections and reviews. Other relevant QA tasks are inspections, testing, verification and validation, some of which are presented further in section 2.5.4.
Software Process Improvement (SPI) Software Process Improvement is basically systematic improvement of the work processes used in a software-producing organization, based on organizational goals and backed by empirical studies and results. Capability Maturity Model Integration (CMMI) and ISO 9000 are examples of ways to assess and certify software processes. Statistical Process Control (SPC) and the Goals/Question/Metric (GQM) paradigm are examples of methods used to implement Software Process Improvement [Dybå00], but these require a certain level of stability in an organization to be applicable.
To be able to measure improvement, we have to introduce measurement into software development processes. SPI initiatives are generally based on measurement of processes, followed by results and information feedback into the process under study.
The work in this thesis is directed towards measurement of faults in software, and how this information may be used to improve the software process and product.
2.4 Anomalies: Faults, errors, failures and hazards
Improving software quality is a goal of most software development organizations. This is not a trivial task, and different stakeholders will have different views on what software quality is. In addition, the character of the actual software will influence what is considered the most important quality attributes of that software. For many organizations, analyzing routinely collected data could be used to improve their process and product quality. Fault reports is one possible source of such data, and research shows that fault analysis can be a viable approach to certain parts of software process improvement [Grady92]. One important issue in developing business-critical software is to remove possible causes for failure, which may lead to wrong operations of the system. In our studies we will investigate fault reports from business-critical industrial software projects.
Software quality encompasses a great number of properties or attributes. The ISO 9126 standard defines many of these attributes as sub-attributes of the term “quality of use” [ISO91]. When speaking about business-critical systems, the critical quality attribute is often experienced as the dependability of the system. In [Laprie95], Laprie states that “a computer system’s dependability is the quality of the delivered service such that reliance can justifiably be placed on this service.” According to [Avizienis04] and [Littlewood00], dependability is a software quality attribute that encompasses several other attributes, especially reliability, availability, safety, integrity and maintainability2.
The term dependability can also be regarded subjectively as the “amount of trust one has in the system”. Quality-of-Service (QoS) is the dependability plus performance, usability and certain provision aspects [Emstad03].
Much effort has been put into reducing the probability of software failures, but this has not removed the need for post-release fault-fixing. Faults in the software are detrimental to the software’s quality, to a greater or lesser extent dependent on the nature and severity of the fault. Therefore, one way to improve the quality of developed software is to reduce the number of faults introduced into the system during initial development.
In Laprie’s initial dependability definition, the attribute security was present, while the attributes integrity and maintainability were not [Laprie95].
Faults are potential flaws (i.e. incorrect versus explicitly stated requirements) in a software system, that later may be activated to produce an error (as incorrect internal dynamic state). An error is the execution of a "passive fault", and my lead to a failure (for incorrect external dynamic state). This relationship is illustrated in Figure 2-1. A failure results in observable and incorrect external behaviour and system state. The remedies for errors and failures are to limit the consequences of an active error or failure, in order to resume service. This may be in the form of duplication, repair, containment etc. These kinds of remedies do work, but studies have shown that this kind of downstream (late) protection is more expensive than preventing the faults from being introduced into the code [Leveson95].
Figure 2-1 Relationship between faults, errors, failures and reliability Faults that unintentionally have been introduced into the system during some lifecycle phase can be discovered either by formal proof or manual inspections before the system is run, by testing during development or when the application is run on site. The discovered faults are then reported in some fault reporting system, to be candidates for later correction. Software may very well have faults that do not lead to failure, since they may never be executed, given the actual context and usage profile. Many such faults will remain in the system unknown to the developers and users. That is, a system with few discovered faults is not necessarily the same as a system with few faults.
Indeed, many reported faults may be deemed too “exotic” or irrelevant to correct.
Inversely, a system with many reported faults may be a very reliable system, since most relevant faults can have been eliminated. Faults are also commonly known as defects or bugs, while a more extensive concept is anomaly, used in the IEEE 1044 standard [IEEE 1044].
The relationship between faults, errors and failures concerns the reliability dimension. If we look at the safety dimension, we have a relationship between hazards and accidents.
A hazard is a state or set of conditions of a system or an object that, together with other conditions in the environment of the system or object, may lead to an accident (safety dimension) [Leveson95]. Leveson defines an accident as “an undesired and unplanned (but not necessatily unexpected) event that results in at least a specified level of loss.” The connection between hazards and safety is defined through Leveson’s definition of safety: “Safety is freedom form accidents or losses”. Figure 2-2 illustrates this relationship.
Figure 2-2 Relationship between hazards, accidents and safety To reduce the chances of critical faults existing in a software system, the latter should be analyzed in the context of its environment and operation to identify possible hazardous events [Leveson95]. Hazard analysis techniques like Failure and Effect Analysis (FMEA) and Hazard and Operability Study (Hazop) can help us to reduce the product risk stemming from such accident. Hazards encompass a greater scope than faults, because a system can be prone to many hazards even if it has no faults. Hazards are related to the system’s environment, not just to the software itself. Therefore they may be present even though the system fulfils the requirements specifications completely, i.e. has no faults.
The full lines in Figure 2-3 show the common view of how faults are related to reliability and hazards are related to safety. In parts of the thesis we also suggest that faults may influence safety and hazards may influence reliability, as shown by the dotted lines. Literature searches shows little work that have been done in this specific area, but the fact that faults and hazards do share some characteristics make plausible connections between faults and safety and hazards and reliability also, at least from a pragmatic viewpoint.
Avizienis et al. emphasize that fault prevention and fault tolerance aim to provide the ability to deliver a service that can be trusted, while fault removal and fault forecasting aim to reach confidence in that ability, by justifying that the functional, dependability and security specifications are adequate, and that the system is likely to meet them [Avizienis04]. Hence, by working towards techniques that can prevent faults and reduce the number and severity of faults in a system, the quality of the system can be improved in the area of reliability (and thus dependability).
A usual course of events leading to a fault report is that someone reports a failure through testing or operation, whereupon a report is logged. This report could initially be classed as a failure report, as it describes what happened when the system failed. As developers examine the report, they will eliminate reported “problems” that were not real failures (often caused by wrong user commands) or duplicates of previously reported ones. Primarily, they work to establish what caused the failure, i.e. the originall fault. When they identify the fault, they can choose to repair the fault and report what the fault was and how it was repaired. The failure report has thus become a fault report.
When looking at a large collection of fault/failure reports in a system in testing or operation, some faults have been repaired, while others have not (and may never be).