«Software Fault Reporting Processes in Business-Critical Systems Jon Arvid Børretzen Doctoral Thesis Submitted for the partial fulfilment of the ...»
− The urgency of correction from the maintenance engineer’s view is called Priority in , Urgency in  or Severity in IEEE Std. 1044-1993. It should be set during resolution.
Some problem reporting systems include the one or the other, or even do not distinguish between these. Thus, the severity field may be set by the reporter and later changed by the maintenance engineer. Here are some examples on how these
fields are used:
− For reports in S1 and S4 there was only one field (S1 used “Consequence”, while S4 used “Priority”), and we do not know if the value has been changed from the first report until the fault has been fixed.
− S2 used the terms “Severity” and “Priority” in the reports.
− S3 used two terms: “Importance” and “Importance Customer”, but these were mostly judged to be the same.
In , it is recommended to use four fields for reporter and maintenance criticality, and reporter and maintenance priority. We have not seen examples of such detailed classification. In addition to the problem of ambiguity in definitions of severity or
priority, there are other concerns:
− Ostrand et al. reported that severity ratings were highly subjective and also sometimes inaccurate because of political considerations not related to the importance of the change to be made. It might be downplayed so that friends or colleagues in the development organization “looked better”, provided they agreed to fix it with the speed and effort normally reserved for highest severity faults .
− Severity of defects may be downplayed to allow launching a release.
− Probably, most defects are set to medium severity which reduces the value of such classification. E.g., 90% of problem reports in S1, 57% in S2, 72% in S3, 57% in S4, and 57% in release 2 of S5 (containing 1953 problem reports) were set to medium severity.
2. A second problem is related to release-based development. While most systems are developed incrementally or release-based, problem reporting systems and procedures may not be adapted to differ between releases of a product. As an example, in S6 problem reports did not include release number, only date of reporting. The study assumed that problems are related to the latest release. In S5, we experienced that the size of software components (used to measure defect density) was not collected systematically on the date of a release. Problem report fields had also changed between releases, making data inconsistent.
3. The third problem is related to the granularity of data. Location of a problem used to measure defect density or counting defects may be given for large components or subsystems (as in S6) or fine ones (software modules or functional modules as in S4) or both (as in S5). Too coarse data gives little information while collecting fine data needs more effort.
4. Finally, data is recorded in different formats and problem reporting tools. The commercial problem reporting tools used in industry in our case studies often did not help data collection and analysis. In S1, data were given to researchers as hardcopies of the problem reports, which were scanned and converted to digital form. In S2, the output of the problem reporting system was a HTML document. In S3 and S4, data were given to researchers in Microsoft Excel spreadsheet, which provides some facilities for analysis but not for advanced analysis. In S5, problem reports were stored in text files and were transferred to a SQL database by researchers. In S6, data were transferred to Microsoft Excel spreadsheets for further analysis. Thus, researchers had to transform data in most cases. In a large-scale empirical study to identify reuse success factors, data from 25 NASA software projects were inserted by researchers in a relational database for analysis . One plausible conclusion is that the collected data were rarely analyzed by organizations themselves, beyond collecting simple statistics.
The main purpose for industry should always be to collect business-specific data and avoid "information graveyards". Unused data are costly, lead to poor data quality (low internal validity) and even animosity among the developers. Improving tools and routines allows getting sense of collected data and giving feedback.
4.3 Validity Threats
The data in Table 3 shows large variation is different studies, but the problem is significant in some cases. Missing data is often related to the problem reporting procedure that allows reporting a problem or closing it without filling all the fields.
We wonder whether problem reporting tools may be improved to force developers entering sufficient information. In the meantime, researchers have to discuss the introduced bias and how missing data is handled, for example by mean substitution or verifying random missing. One observation is that most cases discussed in this paper collected data at least on product, location of a fault or defect, severity (reporter or developer or mixed) and type of problem. These data may therefore base a minimum for comparing systems and release, but with sufficient care.
3. Conclusion validity: Most studies referred to in this paper have applied statistical tests such as t-test, Mann-Whitney test or ANOVA. In most cases, there is no experimental design and neither is random allocation of subjects to treatments.
Often all available data is analyzed and not samples of it. Preconditions of tests such as the assumption of normality or equal variances should be discussed as well.
Studies often chose a fixed significance level and did not discuss the effect size or power of the tests (See ). The conclusions should therefore be evaluated with care.
4. External validity or generalization: There are arguments for generalization on the background of cases, e.g., to products in the same company if the case is a probable one. But “formal” generalization even to future releases of the same system needs careful discussion . Another type of generalization is to theories or models  which is seldom done. Results of a study may be considered as relevant, which is different from generalizable.
4.4 Publishing the Results
If a study manages to overcome the above barriers in metrics definition, data collection and analysis, there is still the barrier of publishing the results in major conferences or
journals. We have faced the following:
1. The referees will justifiably ask for a discussion of the terminology and the relation between terms used in the study and standards or other studies. We believe that this is not an easy task to do, and hope that this paper can help clarifying the issue.
2. Collecting evidence in the field needs comparing the results across studies, domains and development technologies. We tried to collect such evidence for studies on software reuse and immediately faced the challenge of inconsistent terminology and ambiguous definitions. More effort should be put in meta-analysis or review type studies to collect evidence and integrate the results of different studies.
3. Companies may resist publishing results or making data available to other researches.
5. DISCUSSION AND CONCLUSIONWe here described our experience with using problem reports for quality assessment in various industrial studies. While industrial case studies assure a higher degree of relevance, there is little control of collected data. In most cases, researchers have to mine industrial data, transform or recode it, and cope with missing or inconsistent data.
Relevant experiments can give more rigor (such as in ), but the scale is small. We
summarize the contributions of this paper in answering the following questions:
1. What is the meaning of a defect versus other terms such as error, fault or failure?
We identified three questions to answer in Section 2: what- whether the term applies to manifestation of a problem or its cause, where- whether problems are related to software or the environment supporting it as well, and whether the problems are related to executable software or all types of artifacts, and when- whether the problem reporting system records problems detected in all or some life cycle phases.
We gave examples on how standards and schemes use different terms and are intended for different quality views (Q1 to Q5).
2. How data from problem reports may be used to evaluate quality from different views? We used the model described in  and extended in Section 3. Measures from problem or defect data is one the few measures used in all quality views.
3. How data from problem reports should be collected and analyzed? What is the validity concerns using such reports for evaluating quality? We discussed these questions with examples in Section 4. The examples show challenges that researchers face in different phases of research.
One possible remedy to ensure consistent and uniform problem reporting is to use a common tool for this - cf. the OSS tools Bugzilla or Trac (which stores data in SQL databases with search facilities). However, companies will need local relevance (tailoring) of the collected data and will require that such a tool can interplay with existing processes and tools, either for development or project management - i.e., interoperability. Another problem is related to stability and logistics. Products, processes and companies are volatile entities, so that longitudinal studies may be very difficult to perform. And given the popularity of sub-contacting/outsourcing, it is difficult to impose a standard measurement regime (or in general to reuse common artifacts) across subcontractors possibly in different countries. Nevertheless, we evaluate adapting an OSS tool and defining a common defect classification scheme for our research purposes and collecting the results of several studies.
6. REFERENCES  Basili, V.R., Caldiera, G. and Rombach, H.D. Goal Question Metrics Paradigm. In Encyclopedia of Software Engineering, Wiley, I (1994), 469-476.
 Basili, V.R., Briand, L.C. and Melo, W.L. How software reuse influences productivity in object-oriented systems. Communications of the ACM, 39, 10 (Oct. 1996), 104-116.
 The Bugzilla project: http://www.bugzilla.org/  Børretzen, J.A. and Conradi, R. Results and experiences from an empirical study of fault reports in industrial projects. Accepted for publication in Proceedings of the 7th International Conference on Product Focused Software Process Improvement (PROFES'2006), 12-14 June, 2006, Amsterdam, Netherlands, 6 p.
 Chillarege, R. and Prasad, K.R. Test and development process retrospective- a case study using ODC triggers. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’02), 2002, 669- 678.
 Dybå, T., Kampenes, V. and Sjøberg, D.I.K. A systematic review of statistical power in software engineering experiments. Accepted for publication in Journal of Information and Software Technology.
 Fenton, N.E. and Pfleeger, S.L. Software Metrics. A Rogorous & Practical Approach.
International Thomson Computer Press, 1996.
 Florac, W. Software quality measurement: a framework for counting problems and defects.
Software Engineering Institute, Technical Report CMU/SEI-92-TR-22, 1992.
 Freimut, B. Developing and using defect classification schemes. IESE- Report No.
072.01/E, Version 1.0, Fraunhofer IESE, Sept. 2001.
 Glass, R.L. Predicting future maintenance cost, and how we’re doing it wrong. IEEE Software, 19, 6 (Nov. 2002), 112, 111.
 Graves, T.L., Karr, A.F., Marron, J.S. and Harvey, S. Predicting fault incidence using software change history. IEEE Trans. Software Eng., 26, 7 (July 2000), 653-661.
 Haug, M.T. and Steen, T.C. An empirical study of software quality and evolution in the context of software reuse. Student project report, Department of Computer and Information Science, NTNU, 2005.
 IEEE standards on http://standards.ieee.org  Kajko-Mattsson, M. Common concept apparatus within corrective software maintenance.
In Proceedings of 15th IEEE International Conference on Software Maintenance (ICSM'99), IEEE Press, 1999, 287-296.
 Kitchenham, B. and Pfleeger, S.L. Software quality: the elusive target. IEEE Software, 13, 10 (Jan. 1996), 12-21.
 Lee, A.S. and Baskerville, R.L. Generalizing generalizability in information systems research. Information Systems Research, 14, 3 (2003), 221-243.
 Mendonça, M.G. and Basili, V.R. Validation of an approach for improving existing measurement frameworks. IEEE Trans. Software Eng., 26, 6 (June 2000), 484-499.
 Mohagheghi, P., Conradi, R., Killi, O.M. and Schwarz, H. An empirical study of software reuse vs. defect-density and stability. In Proceedings of the 26th International Conference on Software Engineering (ICSE’04), IEEE Press, 2004, 282-292.
 Mohagheghi, P. and Conradi, R. Exploring industrial data repositories: where software development approaches meet. In Proceedings of the 8th ECOOP Workshop on Quantitative Approaches in Object-Oriented Software Engineering (QAOOSE’04), 2004, 61-77.
 Ostrand, T.J., Weyuker, E.J. and Bell, R.M. Where the bugs are. In Proceedings of the International Symposium on Software Testing and Analysis (ISSTA’04), ACM SIGSOFT Software Engineering Notes, 29, 4 (2004), 86–96.
 Schneidewind, N.F. Methodology for validating software metrics. IEEE Trans. Software Eng., 18, 5 (May 1992), 410-422.
 Selby, W. Enabling reuse-based software development of large-scale systems. IEEE Trans.
SE, 31, 6 (June 2005), 495-510.
 The Trac project: http://projects.edgewall.com/trac/ UKSMA- United Kingdom Software Metrics Association: http://www.uksma.co.uk/