«Software Fault Reporting Processes in Business-Critical Systems Jon Arvid Børretzen Doctoral Thesis Submitted for the partial fulfilment of the ...»
Despite many positive responses initially, we ended up only being able to use fault report data from four of them. There were two serious barriers for setting up cooperation with commercial organizations. Firstly, we experienced unwillingness by such organizations to disclose information about faults and failures in their systems, despite promises of anonymization, Secondly, many of the organizations decided that they were not able to spare the effort to facilitate our data collection, due to their own deadlines. In addition to this, a few organizations chose to end their cooperation with us before the data had been analyzed, because of resource issues. Finally, there was the issue of lack of communication, in one instance we were ready to collect data for analysis when it appeared that all but one fault report had been deleted from their fault management system.
When performing the second fault report study, we were in contact with an organization that was already involved in the EVISOFT project as a participating partner, which made establishment of contact and research agreement much simpler.
However, a common issue through all our industrial cooperation was that since we were external researchers who were just collecting and analyzing existing data, we were not part of a planned sequence of events for the organization, and therefore were not prioritized when times were busy.
6 Conclusions and future work
This thesis presents the results from several empirical studies investigating management of fault reports within a business-critical software perspective. This is augmented by work concerning business-critical software in general. We have combined literature studies, quantitative studies of historical data sources, qualitative studies through interviews of industry representatives, and a case study using both qualitative and quantitative methods. By combining different empirical strategies in a mixed-method research design, we could combine results and answer questions that had not been answered previously.
This work analyzed historical fault data that the source organizations had not analyzed in such a manner and to this extent. The results were backed up with interviews and feedback from the involved organizations to improve the validity of the results.
6.1 Conclusions 6.1.1 Fault reporting as a tool for process improvement Our findings show that there is much to gain by using fault report data to support process improvement through reduction of faults. Our analyses showed that a large number of faults had their origin in early development phases, something some of the organizations’ studies had suspected but had not been able (or willing) to quantify.
We also uncovered a lack of consistency in fault reporting. Fault reports in an organization did often not follow a strict standard, which could make it difficult for the data to be used in an analytic fashion. Another finding is that many software organizations are in possession of data resources concerning their own products and processes that they do not exploit fully. Through better recording of available information and simple analysis, many organizations could be able to focus process improvement initiatives better.
Added to this, our work has also included literature studies of fault categorization schemes. We have described how fault categorization and subsequent fault report analysis could identify improvement areas of the development process.
6.1.2 Empirical findings During our fault report studies of several industrial projects, we have presented results on fault type frequency and severity that for larger business-critical applications seem to be valid and general. Some fault types have been shown to be considerably more frequent than others, and we have identified fault types that are likely to be more severe than others.
Drawing on experience from others, we have concluded that many of the occurences of the most frequent fault types that are reported have their origins in early phases like system specification and design.
6.1.3 Software safety and reliability from a fault perspective This thesis’ overall contribution is showing how a focus on fault management and reporting in the software development process may pinpoint areas of improvement in terms of software safety and reliability. We have also proposed how to utilize techniques taken from safety analysis in software development to elicit and record possible faults in the software. Our conclusion is that such techniques should be used early in the development phases, both because suitable techniques like PHA works well in early process phases, and also because identifying and correcting faults early is more efficient than correcting them in later phases.
6.2 Future Work
This work has covered several aspects of fault management and the use of hazard analysis techniques to improve the process of developing business-critical software.
Still, we see the need for more work in these areas, and the following sections propose possible directions for future work.
6.2.1 Following fault reporting throughout the development process
The software projects under study during this thesis have all been more or less completed development projects. Thus, we have not been able to get reports from all phases of the development projects. The faults found and fixed in design phases and in many cases also unit testing during implementation, have not been studied. By including this information in fault studies, we could learn even more about the potential for fault report analysis as a process improvement tool.
6.2.3 Further studies of Hazard Analysis results and fault reports
Combining hazard analysis and fault report analysis showed that hazard identification could be helpful in eliciting possible hazardous events caused by faults possibly existing in the system. Unfortunately the system we studied had a very different fault type profile (mostly coding faults) than the other systems we had studied. This may have been a contributing reason for the lack of actual faults being identified by hazard analysis, although the number of potential faults found was high.
By performing a similar study on a system where the fault profile is more skewed towards faults introduced in early development phases, we may have a larger portion of faults found by the PHA technique and similar. This would be useful to validate this as a useful technique for reducing faults.
Glossary Term definitions To address the relevant issues, we need reasonably precise definitions of the terms used.
The following contain a table of short definitions of some terms. Where relevant, they are re-iterated and elaborated in the thesis. These terms are mostly taken from [Conradi07].
HazOp Hazard and Operability analysis is a systematic method for examining complex facilities or processes to find actual or potentially hazardous procedures and operations so that they may be eliminated or mitigated.
Performance The speed or volume offered by a service, e.g. delay/transmission time for data communication, storage capacity in a database, image resolution on a screen, or sound quality over a telephone line.
Comment: That is, the behavioral properties of a service must be acceptable (of high enough quality) for the user, which can be another system, an end-user, or a social organization. Such properties encompass technical aspects like dependability (i.e. trustworthiness), security, and timely performance (transfer rate, delay, jitter, and loss), as well as human-social aspects (from perceived multimedia reception to sales, billing, and service handling). NB: not defined in IEEE 610.12.
See popular paper on QoS [Emstad03] where the more subjective term QoE (Quality of Experience) is introduced, and also [Zekro99].
Robustness The ability to limit the consequences of an active error or failure, in order to resume (partial) service. Ways to improve this attribute are duplication, repair, containment etc.
RUP The Rational Unified Process [Kruchten00] [Kroll03], an incremental development process based around UML [Fowler04].
Security Protection against unauthorized access (e.g. read / write / search) of data / information. Remedy: Encryption and strict access control e.g. by passwords and physical hinders.
Software Computer programs, procedures and possibly associated documentation and data pertaining to the operation of a computer system [IEEE 610.12].
Software Features and procedures which ensure that a product performs Safety predictably under normal and abnormal conditions, thereby minimizing the likelihood of an unplanned event occurring, controlling and containing its consequences, and preventing accidental injury, death, destruction of property and/or damage to the environment, whether intentional or unintentional [Herrmann99].
Survivability The degree to which essential services continue to be provided in spite of either accidental or malicious harm [Firesmith03]
References [Aune00] Aune, A.: Kvalitetsdrevet ledelse, kvalitetsstyrte bedrifter. Gyldendal Norsk Forlag, Oslo, 2000.
[Avison99] Avison, D., Lau, F., Myers, M.D., Nielson, P.A.: Action Research.
Communications of the ACM, (42)1, pp. 94-97, January 1999.
[Avizienis04] Avizienis, A., Laprie, J.-C., Randell, B., Landwehr, C.: Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing, 1(1), pp.11-33, Jan.-March 2004.
[Bachmann00] Bachmann, F., Bass, L., Buhman, C., Comella-Dorda, S., Long, F., Robert, J., Seacord, R., and Wallnau, K.: Volume II: Technical Concepts of Component-Based Software Engineering. SEI Technical Report number CMU/SEITR-008, 2000, available at: http://www.sei.cmu.edu/ [Basili94] Basili, V.R., Calidiera, G., Rombach, H.D.: Goal Question Metric Paradigm.
In: Marciniak, J.J. (ed.): Encyclopaedia of Software Engineering, pp. 528-532, Wiley, New York, 1994.
[Basili00] Basili, V., Green, S., Laitenberger, O., Shull, F., Sorumgaard, S., and Zelkowitz, M.: The Empirical Investigation of Perspective-Based Reading. Empirical Software Engineering: An International Journal, 1(2), pp. 133-164, October 1996.
[Beck99] Kent Beck, Extreme programming explained. Embrace change, ISBN:
0201616416, Addison-Wesley Professional, 1999.
[Boehm88] Boehm, B.W.: A Spiral Model of Software Development and Enhancement.
IEEE Computer, (21)5, pp. 61-72, May 1988.
[Boehm91] Boehm B.W.: Software Risk Management: Principles and Practices. IEEE Software, (17)1, pp. 32-41, January 1991.
[Boehm03] Boehm B.: Value-Based Software Engineering. ACM Software Engineering Notes, (28)2, pp.1-12, March 2003.
[Beck99] Beck, K.: Extreme Programming Explained: Embrace Change. AddisonWesley, Boston, 1999.
[Bishop98] Bishop, P.G., Bloomfield, R.E.: A Methodology for Safety Case Development. Proceedings of the Safety-critical Systems Symposium, Birmingham, UK, Feb 1998.
[Cekro99] Cekro, Z.: Quality of Service – Overview of Concepts and Standards. Report for COST 256, Free University of Brussels, April 1999, available from http://www.iihe.ac.be/internal-report/1999/COSTqos.doc.
[Charette05] Charette, R.N.: Why Software Fails. IEEE Spectrum, September 2005.
[Chillarege92] Chillarege, R., Bhandari, I.S., Chaar. J.K., Halliday, M.J., Moebus, D.S., Ray, B.K., Wong, M.-Y.: Orthogonal defect classification - a concept for in-process measurements. IEEE Transactions on Software Engineering, 18(11), pp. 943 – 956, Nov. 1992.
[Chillarege02] Chillarege, R., Prasad, K.R.: Test and development process retrospective- a case study using ODC triggers. Proceedings of the International Conference on Dependable Systems and Networks (DSN’02), pp. 669- 678, Bethesda, USA, 2002.
[Conradi03] Reidar Conradi (Ed.): Software engineering mini glossary. IDI, NTNU, available from http://www.idi.ntnu.no/grupper/su/se-defs.html, August 2003.
[Conradi07] Reidar Conradi (Ed.): Mini-glossary of software quality terms, with emphasis on safety. IDI, NTNU, available from http://www.idi.ntnu.no/grupper/su/publ/ese/se-qual-glossary-v3_0-rc-4jun07.doc, June 2007.
[Crnkovic02] Crnkovic, I., Larsson M.: Building reliable component-based software systems. Artech House, Boston, 2002.
[Dawkins97] Dawkins, S., Kelly, T.: Supporting the use of COTS in safety critical applications. IEE Colloquium on COTS and Safety Critical Systems (Digest No.
1997/013), pp. 8/1 -8/4, 28 Jan. 1997.
[Dybå00] Dybå, T., Wedde, K.J., Stålhane, T., Moe, N.B., Conradi, R., Dingsøyr, T., Sjøberg, D.I.K., Jørgensen, M.: SPIQ Metodehåndbok. Department of Informatics, University of Oslo, Research Report(282), 2000.
[Eldh07] Eldh, S., Punnekkat, S., Hansson, H., Jönsson, P.: Component Testing Is Not Enough - A Study of Software Faults in Telecom Middleware. Proceedings of the 19th IFIP International Conference on Testing of Communicating Systems TESTCOM/FATES 2007, pp. 74-89, Tallinn, Estonia, June 2007.
[El Emam98] El Emam, K., Wieczorek, I.: The repeatability of code defect classifications. Proceedings of The Ninth International Symposium on Software Reliability Engineering, pp. 322-333, Paderborn, Germany, 4-7 Nov. 1998 [Emstad03] Emstad, P.J., Helvik, B.E., Knapskog, S.J., Kure, Ø., Perkis, A., Swensson, P.: A Brief Introduction to Quantitative QoS. In Annual Report for 2003 from Q2S Centre of Excellence, NTNU, pp. 18-29, 2003.
[Fairley85] Fairley, R.: Software Engineering Concepts. McGraw-Hill, 1985.
[Fenton97] Fenton, N., Pfleeger, S.L.: Software metrics (2nd ed.): a rigorous and practical approach. PWS Publishing Co., Boston, 1997.
[Firesmith03] Firesmith, D.G.: Common Concepts Underlying Safety, Security, and Survivability Engineering. Technical Note CMU/SEI-2003-TN-033, Software Engineering Institute, Pittsburgh, Pennsylvania, December 2003.
[Fowler04] Fowler, M.: UML Distilled. Third Edition, Addison-Wesley, 2004.
[Freimut01] Freimut, B: Developing and using defect classification schemes. IESEReport No. 072.01/E, Version 1.0, Fraunhofer IESE, Sept. 2001.