«Software Fault Reporting Processes in Business-Critical Systems Jon Arvid Børretzen Doctoral Thesis Submitted for the partial fulfilment of the ...»
Abstract. Improving software processes relies on the ability to analyze previous projects and derive which parts of the process that should be focused on for improvement. All software projects encounter software faults during development and have to put much effort into locating and fixing these. A lot of information is produced when handling faults, through fault reports. This paper reports a study of fault reports from industrial projects, where we seek a better understanding of faults that have been reported during development and how this may affect the quality of the system. We investigated the fault profiles of five business-critical industrial projects by data mining to explore if there were significant trends in the way faults appear in these systems. We wanted to see if any types of faults dominate, and whether some types of faults were reported as being more severe than others.
Our findings show that one specific fault type is generally dominant across reports from all projects, and that some fault types are rated as more severe than others. From this we could propose that the organization studied should increase effort in the design phase in order to improve software quality.
Improving software quality is a goal most software development organizations aim for.
This is not a trivial task, and different stakeholders will have different views on what software quality is. In addition, the character of the actual software will influence what is considered the most important quality attributes of that software. For many organizations, analyzing routinely collected data could be used to improve their process and product quality. Fault report data is one possible source of such data, and research shows that fault analysis can be a good approach to software process improvement .
The Business-Critical Software (BUCS) project  is seeking to develop a set of techniques to improve support for analysis, development, operation, and maintenance of business-critical systems. Aside from safety-critical systems, like air-traffic control and health care systems, there are other systems that we also expect will run correctly because of the possibly severe effects of failure, even if the consequences are mainly of an economic nature. This is what we call business-critical systems and software. In these systems, software quality is highly important, and the main target for developers will be to make systems that operate correctly . One important issue in developing these kinds of systems is to remove any possible causes for failure, which may lead to wrong operation of the system. In a previous study , we investigated fault reports from four business-critical industrial software projects. Building on the results of that study, we look at fault reports from five further projects. The study presented here investigated fault reports from five industrial software projects. It investigates the fault profiles in two main dimensions; Fault type and fault severity.
The rest of this paper is organized as follows. Section 2 gives our motivation and related work. Section 3 describes the research design and research questions. Section 4 presents the results found, and Section 5 presents analysis and discussion of the results. The conclusion and further work is presented in Section 6.
2. Motivation and Related Work
The motivation for the work described in this paper is to further the knowledge gained from a previous study on fault reports from industrial projects. We also wanted to present empirical data on the results of fault classification and analysis, and show how this can be of use in a software process improvement setting.
When considering quality improvement in terms of fault analysis, there are several related topics to consider. Several issues about fault reporting are discussed in  by Mohagheghi et al. General terminology in fault reporting is one problem mentioned, validity of use of fault reports as a means for evaluating software quality is another. One of its conclusions is that “There should be a trade-off between the cost of repairing a fault and its presumed customer value. The number of faults and their severity for users may also be used as a quality indicator for purchased or reused software.” Software quality is a notion that encompasses a great number of attributes. The ISO 9126 standard defines many of these attributes as sub-attributes of the term “quality of use” . When speaking about business-critical systems, the critical quality attribute is often experienced as the dependability of the system. In , Laprie states that “a computer system’s dependability is the quality of the delivered service such that reliance can justifiably be placed on this service.” According to Littlewood and Strigini , dependability is a software quality attribute that encompasses several other attributes, the most important are reliability, availability, safety and security. The term dependability can also be regarded subjectively as the “amount of trust one has in the system”.
Much effort is being put into reducing the probability of software failures, but this has not removed the need for post-release fault-fixing. Faults in the software are detrimental to the software’s quality, to a greater or lesser extent dependent on the nature and severity of the fault. Therefore, one way to improve the quality of developed software is to reduce the number of faults introduced into the system during development. Faults are potential flaws in a software system, that later may be activated to produce an error.
An error is the execution of a "passive fault", leading to a failure. A failure results in observable and erroneous external behaviour, system state or data state. The remedies known for errors and failures are to limit the consequences of an active error or failure, in order to resume service. This may be in the form of duplication, repair, containment etc. These kinds of remedies do work, but as Leveson states in , studies have shown that this kind of downstream (late) protection is more expensive than preventing the faults from being introduced into the code.
Faults that have been introduced into the system during implementation can be discovered either by inspection before the system is run, by testing during development or when the application is run on site. The discovered faults are then reported in a fault reporting system, to be fixed later. Faults are also commonly known as defects or bugs, while another, similar but more extensive concept is anomalies, which is used in the IEEE 1044 standard .
Orthogonal Defect Classification – ODC – is one way of studying defects in software systems, and is mainly suited to design and coding defects. [10, 11, 12, 13, 14] are some papers on ODC and using ODC in empirical studies. ODC is a scheme to capture the semantics of each software fault quickly.
It has been discussed in several papers if faults can be tied to the reliability in a more or less cause-effect relationship. Some papers like [12, 14, 15] indicate that this kind of connection is valid, while others like  are more critical to this approach.
Even if many of the studies point towards a connection being present between faults and reliability, they also emphasize that it is not easy to tie faults to reliability directly.
Thus, it is not given that a system with a low number of faults necessarily has a higher reliability than a system with a high number of faults. Still, reducing the number of faults in a system will make the system less prone to failure, so if you can remove the faults you find without adding new ones, there is a good case for the reliability of the system being increased. This is called “reliability-growth models”, and is discussed by Hamlet in  and by Paul et al. in .
Avizienis et al. state  that the fault prevention and fault tolerance aim to provide the ability to deliver a service that can be trusted, while fault removal and fault forecasting aim to reach confidence in that ability by justifying that the functional and the dependability and security specifications are adequate and that the system is likely to meet them. Hence, by working towards techniques that can prevent faults and reduce the number and severity of faults in a system, the quality of the system can be improved in the area of dependability.
An example of results in a related study is the work done in Vinter and Lauesen .
This paper used a different fault taxonomy as proposed by Bezier , and reports that in their studied project close to a quarter of the faults found were of the type “Requirements and Features”.
3. Research design This paper builds on a previous study  where we investigated the fault profiles of industrial projects, and this paper expands on those findings, using a similar research design. We want to explore the fault profiles of the studied projects with respect to fault types and fault severity. In order to study the faults, we categorized them into fault types as described in Section 3.2.
3.1 Research questions Initially we want to find which types of faults which are most frequent, and also the
distribution of faults into different fault types:
RQ1: Which types of faults are most common for the studied projects?
When we know which types of faults dominate and where these faults appear in the systems, we can choose to concentrate on the most serious ones in order to identify the most important issues to target in improvement work (note that the severity of the faults
are judged by the developers who report the faults):
RQ2: Which fault types are rated as the most severe faults?
We also want to compare the results from this study with the results we found in the
previous study on this topic :
RQ3: How do the results of this study compare with our previous fault report study?
3.2 Fault categorization
There are several taxonomies for fault types, two examples are the ones used in the IEEE 1044 standard  and in a variant of the Orthogonal Defect Classification (ODC) scheme by El Emam and Wieczorek . The fault reports we received were already categorized in some manner by the developers and testers, but using a very broad categorization scheme, which mainly placed the fault into categories of “fault caused by others”, “change request”, “test environment fault”, “analysis/design fault”, “test fault” and “coding fault”. The fault types used in this study is shown in Table 1. This is very similar to the ODC scheme used in , but with the addition of a GUI fault type. The reason this classification scheme was used, is that it is quite simple to use but still discerns the fault types well. Further descriptions of the fault types used can be found in Chillarege et al. .
Table 1. Fault types used in this study
The categorization of faults in this investigation has been performed by the authors of this paper, based on the fault reports’ textual description and partial categorization.
In addition, grading the faults’ consequences upon the system and system environment enables fault severities to be defined. All severity grading was done by the developers and testers performing the fault reporting in the projects. In the projects under study, the faults have been graded on a severity scale from 1 to 5, where 1 is “critical” and 5 is “change request”. The different severity classifications are shown in Table 2.
4.1 RQ1 – Which types of faults are most frequent?
To answer RQ1, we look at the distribution of the fault type categories for the different projects. Table 4 shows the distribution of faults types across all projects studied, Table 5 shows distribution of faults for each project. A plot of Table 5 is shown in Figure 1.
We see that “function” and “GUI” faults are the most common fault types, with Assignment also being quite frequent. Some faults like “documentation”, “relationship”, “timing/serialization” and “interface” faults are not frequent.
If we focus only on the faults that are rated with “critical” severity (7.6% of all faults), the distribution is as shown in Figure 2. “Function” faults do not just dominate the total distribution, but also the distribution of “critical” faults. A very similar distribution is also the case for “can not be circumvented” severity rated faults.
When looking at the distribution of faults, especially for the high severity faults, we see that “function” faults dominate the picture, We also see that for all faults, “GUI” faults
4.2 RQ2 – What types of faults are rated as most severe?
As for the severity of fault types, Figure 3 illustrates how the distribution of severities was for each fault type. The “relationship” fault type has the highest share of “critical” faults, and also the highest share when looking at both “critical” and “can not be circumvented” severity faults. The most numerous fault type “function”, does not stand out as a particularly severe fault type compared with the others. The fault types that show themselves to be rated as least severe, are “GUI” and “data” faults.
4.3 RQ3 – How do the results compare with the previous study?
Previously, we conducted a similar study of fault reports from industrial projects, which is described in . In the previous study, “function” faults were the dominant fault type, making out 33.3% to 61.3% of the reported faults in the four investigated projects. The percentage of “function” faults is lower for the five projects studied for this paper, but is still the dominant fault type making out 24.0% to 53.7% of the reported faults in P1 to P5 as shown in Table 5.
When looking at the highest severity rated faults reported, this study also shows that “function” faults are the most numerous of the “critical” severity rated faults as shown in Figure 2 with 35.8%. This is in line with the previous study where “function” faults were also dominant among the most severe faults reported, with 45.3%.
5. Analysis and discussion