«Text Mining in Program Code Alexander Dreweke 1 Computer Science Department 2 Friedrich-Alexander University Erlangen-Nuremberg Martensstr. 3 91058 ...»
Text Mining in Program Code
Alexander Dreweke 1
Computer Science Department 2
Friedrich-Alexander University Erlangen-Nuremberg
voice: +49 9131 852 7923
fax: +49 9131 852 8809
Nycomed Chair for Applied Computer Science
University of Konstanz
voice: +49 7531 88 5016
fax: +49 7531 88 5132
Computer Science Department 2 Friedrich-Alexander University Erlangen-Nuremberg Martensstr. 3 91058 Erlangen Germany voice: +49 9131 852 8865 fax: +49 9131 852 8809 email: firstname.lastname@example.org Marc Wörlein Computer Science Department 2 Friedrich-Alexander University Erlangen-Nuremberg Martensstr. 3 91058 Erlangen Germany voice: +49 9131 852 7623 fax: +49 9131 852 8809 email: email@example.com 1 Authors are in strict alphabetical order.
Text Mining in Program Code
INTRODUCTIONComputer programs are a special form of text. Words of a programming languages are combined to form correct sentences in this programming language.
There exists a wide variety of programming languages, ranging from high-level object-oriented languages like Java or C++ to machine code, the language a processor can actually "understand". Programming languages are usually translated with the help of compilers from high- to low-level. To produce this kind of “text” - the computer programs - is the daily work of many programmers; billions of lines of code have been written. Mostly, this code is not well documented and not really understood by anybody after the original programmer stopped working. Typically, many programmers are working on one project and often old code from former versions or other projects is used.
A special problem in big amounts of program code are duplicated code fragments. These duplicated fragments can occur because of excessive use of "copy & paste", because something was simply re-programmed or also because of the compiler.
When translating from the high-level to intermediate or low-level languages, new duplicates can be introduced, e.g. by using code templates for instructions and instruction sequences.
Finding these duplicates has been in the focus of interest for many years. Code duplicates are called clones and clone detection has produced many different algorithms. If program code is simply viewed as text, clone detection is nothing else than mining in this text with the goal of finding the duplicate or similar code. Merging application areas and algorithms from the data mining community on the one hand and clone detection leads to fruitful new insights and results.
Finding duplicated code in programs can have different goals. First these duplicates can be visualized as a hint for programmers that something has to be done about this specific piece of code. Second, the redundant code can be replaced automatically by subroutine calls, in-lined procedure calls, macros etc. that produce the same result. This leads to smaller code that is easier to understand or to maintain.
Third, methods to detect and replace duplicated code can be integrated into compilers.
Finally, finding duplicated code can lead to special hardware for the duplicates in the area of embedded systems.
In the case of program code, duplicates are not always "totally equivalent". It is not only the one-to-one duplicate from a piece of code that is interesting. Also near duplicates or even pieces of code, that are syntactically different, but semantically equivalent must be found. E.g. in two fragments only two independent pieces of code having no side effect onto each other can be exchanged. Variable names can be different or registers in machine code can vary.
The application of clone detection ranges from high-level languages to machine code for embedded systems. The latter is the main topic in this chapter. The clone detection algorithms especially for embedded systems are described in detail.
Clone Detection and Embedded Systems
In recent years, many small programmable devices appeared on the market.
PDAs, mobile phones, and navigation systems are examples for these systems.
Typical software on these gadgets deals with speech or image processing, encryption, calendars or address books. Therefore, the devices contain small processors, handling the different tasks.
But also in larger technical devices, such as cars, the number of small processors exploded in the last years. They offer e.g. efficient engine control, drivers assistance functions such as the “Electronic Stability Program” or “Line Departure Warning”, that warns the driver when he/she is leaving the right lane or more comfort functions such storing the different seat features for different persons. In modern highend cars a network of up to sixty communicating processors are used.
These processors, also called embedded systems, are much more specialized in nature than general computing systems (e.g. desktop computers or laptops). Typically, they are produced at low cost, have low energy consumption, are small, and meet rigid time constraints for their tasks. Normally, they are fabricated in high quantities.
These requirements place constraints on the underlying architecture. An embedded system consists of a processor core, a Read Only Memory (ROM), Random Access Memory (RAM), and an Application Specific Integrated Circuit (ASIC). The cost of developing an integrated circuit is linked to the size of the system. The largest portion of the integrated circuit is often devoted to the RAM for storing the application.
Nevertheless, the ROM is much smaller than storage offered on desktop computers.
Developing techniques reducing code size are extremely important in terms of reducing the cost of producing such systems. Programmers writing for embedded systems must be aware of the memory restrictions and write small programs in the sense of code size. Less lines of code can make a huge difference on embedded systems. As the complexity of embedded systems programs grows, programming machine code directly gets more complicated. Today it is common to use high-level languages as C or C++ instead. However programming in these languages results in larger machine code compared to machine code written directly by the programmers.
One reason is that compiler optimization classically focuses on execution speed and not so much on code size. Introducing code size optimizations on embedded systems often leads to speed trade-offs, but when execution speed is not critical, minimizing the code size is usually profitable.
Different automated techniques have been proposed over the years to achieve this goal of small code (Beszdes, Ferenc, Gyimthy, Dolenc, Karsisto, 2003; Debray, Evans, Muth & Sutter, 2000). These techniques can be separated into two groups. In code compression the code is somehow compressed and must be decompressed again at runtime. This technique results in very small code, but an additional step is necessary to decompress necessary parts. Decompression itself does not take too much time (depending on the compression algorithm) but there must be enough memory to store the decompressed code parts again. Another method is code compaction: the code is shrunk, but remains executable.
It is also typical for embedded systems, that they are inflexible. Once the mobile phone is fabricated, the functionality can not be changed. If a new communication standard is proposed, the old mobile phone is thrown away and a new phone is build. For that reason, more flexible embedded systems are researched. The more applications they can handle, the more they will be used. The problem is, that the flexibility of these general purpose processors is paid with performance trade-offs.
To handle these performance trade-offs, sets of instructions that are frequently executed are implemented as fixed logic blocks. Such blocks perform better than the softer, reconfigurable parts of the device. If they are well chosen, the performance of the system is enhanced. The goal is to have the frequent parts in hard logic implementation, while simultaneously affording flexibility to other components.
Recapitulating, three problems in programming embedded systems are based on code clones: code compression, code compactification, and finding candidates for hard logic implementations. A possible example of code clones in machine code is given in the next section.
An Example Figure 1 Example instruction sequence and corresponding minimum code sequence Figure 1 on the left hand side shows a sequence of 18 ARM assembler instructions. The ARM architecture is a popular embedded architecture with a common 32-bit RISC instruction set. Three fragments of the code are highlighted.
These three parts contain the same operations but in different orders. It is easy to see that most of these operations are independent of each other, so reordering the operations does not change the semantics of the code piece. E.g. the result of ADD r1, r1, #1 is not of interest for any other operation of the gray shaded area as register r1 containing the result of this operation is not used anywhere else. The instruction sequence on the left of Figure 1 can be maximal compacted to 12 instructions (see Figure 1 right hand side) by reordering the instructions according to their data dependencies. After doing this one can remove the instructions by a call (operation BL) to a single code instance labeled "extracted". At the end of this new code the program counter (pc) must be restored to the link register (lr) – containing the return address of a function - in order for the program to execute the next instruction after the call to the extracted code. We use this example to compare the various text mining approaches to explains their advantages and limitations.
In the remainder of this paper an overview on mining in program code for embedded systems is given. As a basis, we take several research areas from embedded system design and compiler construction all having the same goal to find repeated code fragments. In the next section, the different application areas, in which shrinking program code is of interest, are described. In the third section the main focus is set on the different methods to find frequent fragments in program code.
BACKGROUND Procedural abstraction and cross jumping In the seventies and eighties of the last century code size moved in the focus of embedded system research. Until then only speed was of interest, the size of the ROM changed this view. Procedural abstraction and cross jumping are two techniques, that can be applied at different stages of compilation ranging from source code at the beginning over the intermediate code of the compiler to machine code at the end. Both methods start with finding repeated code fragments. To find these fragments, several possibilities exist which are explained in the next section “Main Focus” of this chapter. The repeated code fragments are candidates to be extracted into a separate procedure as shown in Figure 1. This is the basic idea of procedural abstraction (Fraser, Myers, & Wendt, 1984; Dreweke, Wörlein, Fischer, Schell, Meinl & Philippsen, 2007). But before extraction the fragments found must be examined whether they are really suitable. The main problem is that frequent fragments might overlap in the code. In this case only one fragment can be extracted while the second fragment is ignored. It must be calculated which groups of frequent fragments has the most savings at the end. Procedural abstraction then turns repeated code fragments into separate procedures. The fragments are deleted in their original positions and procedure calls are inserted instead. Our running example in Figure 1 shows procedural abstraction after the compilation applied to machine code.
Cross jumping is also known as tail merging. It is applicable if the identified fragments have a branch instruction to the same destination in common on the last instruction. One fragment is left as it is and the other occurrences are replaced by branch instructions to the start of this fragment. This branch instruction at the end of the frequent fragment ensures that program execution continuous at the correct position even after the extraction of the frequent fragment.
Both strategies have their benefits and costs in code space and execution time.
While the size of the original program shrinks, execution might take longer for procedural abstraction as procedure calls take their time. Therefore, it is not advisable to apply procedural abstraction to code that is often used during runtime (so called hot code). Procedural abstraction and cross jumping require no hardware support at all.
There are several approaches and enhancements that do not check for completely equivalent fragments. E.g. fragments for machine code can differ in the registers used. It might be possible to extract two code sequences that only differ in their registers into one procedure and map the registers onto each other. If the register mapping does not introduce more code than the extraction into a procedure saves, this approach is worth it (Cooper & McIntosh,1999).
Procedural abtraction is applied on different stages of compilation. The problem of having different registers in fragments can be avoided when procedural abstraction is applied on the intermediate representation of the compiler before the register mapping takes place. There are also several approaches where procedural abstraction is applied before or post link-time. If applied after library programs etc.