FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:   || 2 | 3 | 4 |

«Text Mining in Program Code Alexander Dreweke 1 Computer Science Department 2 Friedrich-Alexander University Erlangen-Nuremberg Martensstr. 3 91058 ...»

-- [ Page 1 ] --

Text Mining in Program Code

Alexander Dreweke 1

Computer Science Department 2

Friedrich-Alexander University Erlangen-Nuremberg

Martensstr. 3

91058 Erlangen


voice: +49 9131 852 7923

fax: +49 9131 852 8809

email: dreweke@cs.fau.de

Ingrid Fischer*

Nycomed Chair for Applied Computer Science

University of Konstanz

BOX M712

78457 Konstanz


voice: +49 7531 88 5016

fax: +49 7531 88 5132

email: Ingrid.Fischer@inf.uni-konstanz.de

Tobias Werth

Computer Science Department 2 Friedrich-Alexander University Erlangen-Nuremberg Martensstr. 3 91058 Erlangen Germany voice: +49 9131 852 8865 fax: +49 9131 852 8809 email: werth@cs.fau.de Marc Wörlein Computer Science Department 2 Friedrich-Alexander University Erlangen-Nuremberg Martensstr. 3 91058 Erlangen Germany voice: +49 9131 852 7623 fax: +49 9131 852 8809 email: woerlein@cs.fau.de 1 Authors are in strict alphabetical order.

Text Mining in Program Code


Searching for frequent pieces in a database with some sort of text is a well- known problem. A special sort of text is program code as e.g. C++ or machine code for embedded systems. Filtering out duplicates in large software projects leads to more understandable programs and helps avoiding mistakes when reengineering the program. On embedded systems the size of the machine code is an important issue. To ensure small programs, duplicates must be avoided. Fast program execution can be ensured, when frequently used duplicates are encoded in hardware. The most successful approaches for finding code duplicates are based on graphs representing the data and control flow of the program and graph mining algorithms. Compared to applications of suffix tries on the code or fingerprinting, where some kind of special form of program parts is calculated, more duplicates are found.


Computer programs are a special form of text. Words of a programming languages are combined to form correct sentences in this programming language.

There exists a wide variety of programming languages, ranging from high-level object-oriented languages like Java or C++ to machine code, the language a processor can actually "understand". Programming languages are usually translated with the help of compilers from high- to low-level. To produce this kind of “text” - the computer programs - is the daily work of many programmers; billions of lines of code have been written. Mostly, this code is not well documented and not really understood by anybody after the original programmer stopped working. Typically, many programmers are working on one project and often old code from former versions or other projects is used.

A special problem in big amounts of program code are duplicated code fragments. These duplicated fragments can occur because of excessive use of "copy & paste", because something was simply re-programmed or also because of the compiler.

When translating from the high-level to intermediate or low-level languages, new duplicates can be introduced, e.g. by using code templates for instructions and instruction sequences.

Finding these duplicates has been in the focus of interest for many years. Code duplicates are called clones and clone detection has produced many different algorithms. If program code is simply viewed as text, clone detection is nothing else than mining in this text with the goal of finding the duplicate or similar code. Merging application areas and algorithms from the data mining community on the one hand and clone detection leads to fruitful new insights and results.

Finding duplicated code in programs can have different goals. First these duplicates can be visualized as a hint for programmers that something has to be done about this specific piece of code. Second, the redundant code can be replaced automatically by subroutine calls, in-lined procedure calls, macros etc. that produce the same result. This leads to smaller code that is easier to understand or to maintain.

Third, methods to detect and replace duplicated code can be integrated into compilers.

Finally, finding duplicated code can lead to special hardware for the duplicates in the area of embedded systems.

In the case of program code, duplicates are not always "totally equivalent". It is not only the one-to-one duplicate from a piece of code that is interesting. Also near duplicates or even pieces of code, that are syntactically different, but semantically equivalent must be found. E.g. in two fragments only two independent pieces of code having no side effect onto each other can be exchanged. Variable names can be different or registers in machine code can vary.

The application of clone detection ranges from high-level languages to machine code for embedded systems. The latter is the main topic in this chapter. The clone detection algorithms especially for embedded systems are described in detail.

Clone Detection and Embedded Systems

In recent years, many small programmable devices appeared on the market.

PDAs, mobile phones, and navigation systems are examples for these systems.

Typical software on these gadgets deals with speech or image processing, encryption, calendars or address books. Therefore, the devices contain small processors, handling the different tasks.

But also in larger technical devices, such as cars, the number of small processors exploded in the last years. They offer e.g. efficient engine control, drivers assistance functions such as the “Electronic Stability Program” or “Line Departure Warning”, that warns the driver when he/she is leaving the right lane or more comfort functions such storing the different seat features for different persons. In modern highend cars a network of up to sixty communicating processors are used.

These processors, also called embedded systems, are much more specialized in nature than general computing systems (e.g. desktop computers or laptops). Typically, they are produced at low cost, have low energy consumption, are small, and meet rigid time constraints for their tasks. Normally, they are fabricated in high quantities.

These requirements place constraints on the underlying architecture. An embedded system consists of a processor core, a Read Only Memory (ROM), Random Access Memory (RAM), and an Application Specific Integrated Circuit (ASIC). The cost of developing an integrated circuit is linked to the size of the system. The largest portion of the integrated circuit is often devoted to the RAM for storing the application.

Nevertheless, the ROM is much smaller than storage offered on desktop computers.

Developing techniques reducing code size are extremely important in terms of reducing the cost of producing such systems. Programmers writing for embedded systems must be aware of the memory restrictions and write small programs in the sense of code size. Less lines of code can make a huge difference on embedded systems. As the complexity of embedded systems programs grows, programming machine code directly gets more complicated. Today it is common to use high-level languages as C or C++ instead. However programming in these languages results in larger machine code compared to machine code written directly by the programmers.

One reason is that compiler optimization classically focuses on execution speed and not so much on code size. Introducing code size optimizations on embedded systems often leads to speed trade-offs, but when execution speed is not critical, minimizing the code size is usually profitable.

Different automated techniques have been proposed over the years to achieve this goal of small code (Beszdes, Ferenc, Gyimthy, Dolenc, Karsisto, 2003; Debray, Evans, Muth & Sutter, 2000). These techniques can be separated into two groups. In code compression the code is somehow compressed and must be decompressed again at runtime. This technique results in very small code, but an additional step is necessary to decompress necessary parts. Decompression itself does not take too much time (depending on the compression algorithm) but there must be enough memory to store the decompressed code parts again. Another method is code compaction: the code is shrunk, but remains executable.

It is also typical for embedded systems, that they are inflexible. Once the mobile phone is fabricated, the functionality can not be changed. If a new communication standard is proposed, the old mobile phone is thrown away and a new phone is build. For that reason, more flexible embedded systems are researched. The more applications they can handle, the more they will be used. The problem is, that the flexibility of these general purpose processors is paid with performance trade-offs.

To handle these performance trade-offs, sets of instructions that are frequently executed are implemented as fixed logic blocks. Such blocks perform better than the softer, reconfigurable parts of the device. If they are well chosen, the performance of the system is enhanced. The goal is to have the frequent parts in hard logic implementation, while simultaneously affording flexibility to other components.

Recapitulating, three problems in programming embedded systems are based on code clones: code compression, code compactification, and finding candidates for hard logic implementations. A possible example of code clones in machine code is given in the next section.

An Example Figure 1 Example instruction sequence and corresponding minimum code sequence Figure 1 on the left hand side shows a sequence of 18 ARM assembler instructions. The ARM architecture is a popular embedded architecture with a common 32-bit RISC instruction set. Three fragments of the code are highlighted.

These three parts contain the same operations but in different orders. It is easy to see that most of these operations are independent of each other, so reordering the operations does not change the semantics of the code piece. E.g. the result of ADD r1, r1, #1 is not of interest for any other operation of the gray shaded area as register r1 containing the result of this operation is not used anywhere else. The instruction sequence on the left of Figure 1 can be maximal compacted to 12 instructions (see Figure 1 right hand side) by reordering the instructions according to their data dependencies. After doing this one can remove the instructions by a call (operation BL) to a single code instance labeled "extracted". At the end of this new code the program counter (pc) must be restored to the link register (lr) – containing the return address of a function - in order for the program to execute the next instruction after the call to the extracted code. We use this example to compare the various text mining approaches to explains their advantages and limitations.

In the remainder of this paper an overview on mining in program code for embedded systems is given. As a basis, we take several research areas from embedded system design and compiler construction all having the same goal to find repeated code fragments. In the next section, the different application areas, in which shrinking program code is of interest, are described. In the third section the main focus is set on the different methods to find frequent fragments in program code.

BACKGROUND Procedural abstraction and cross jumping In the seventies and eighties of the last century code size moved in the focus of embedded system research. Until then only speed was of interest, the size of the ROM changed this view. Procedural abstraction and cross jumping are two techniques, that can be applied at different stages of compilation ranging from source code at the beginning over the intermediate code of the compiler to machine code at the end. Both methods start with finding repeated code fragments. To find these fragments, several possibilities exist which are explained in the next section “Main Focus” of this chapter. The repeated code fragments are candidates to be extracted into a separate procedure as shown in Figure 1. This is the basic idea of procedural abstraction (Fraser, Myers, & Wendt, 1984; Dreweke, Wörlein, Fischer, Schell, Meinl & Philippsen, 2007). But before extraction the fragments found must be examined whether they are really suitable. The main problem is that frequent fragments might overlap in the code. In this case only one fragment can be extracted while the second fragment is ignored. It must be calculated which groups of frequent fragments has the most savings at the end. Procedural abstraction then turns repeated code fragments into separate procedures. The fragments are deleted in their original positions and procedure calls are inserted instead. Our running example in Figure 1 shows procedural abstraction after the compilation applied to machine code.

Cross jumping is also known as tail merging. It is applicable if the identified fragments have a branch instruction to the same destination in common on the last instruction. One fragment is left as it is and the other occurrences are replaced by branch instructions to the start of this fragment. This branch instruction at the end of the frequent fragment ensures that program execution continuous at the correct position even after the extraction of the frequent fragment.

Both strategies have their benefits and costs in code space and execution time.

While the size of the original program shrinks, execution might take longer for procedural abstraction as procedure calls take their time. Therefore, it is not advisable to apply procedural abstraction to code that is often used during runtime (so called hot code). Procedural abstraction and cross jumping require no hardware support at all.

There are several approaches and enhancements that do not check for completely equivalent fragments. E.g. fragments for machine code can differ in the registers used. It might be possible to extract two code sequences that only differ in their registers into one procedure and map the registers onto each other. If the register mapping does not introduce more code than the extraction into a procedure saves, this approach is worth it (Cooper & McIntosh,1999).

Procedural abtraction is applied on different stages of compilation. The problem of having different registers in fragments can be avoided when procedural abstraction is applied on the intermediate representation of the compiler before the register mapping takes place. There are also several approaches where procedural abstraction is applied before or post link-time. If applied after library programs etc.

Pages:   || 2 | 3 | 4 |

Similar works:

«Le jazz, l’Afrique et la contramétricité1 Marc Chemillier, Jean Pouchelon, Julien André, Jérôme Nika, version 5 mars 2013 Le jazz et la musique africaine partagent de nombreux traits communs : présence d’une pulsation2 régulière avec une tendance de la musique à contrecarrer cette pulsation, lien du rythme avec le mouvement corporel qui s’exprime notamment dans la danse. La notion de pulsation renvoie à un phénomène facilement observable, qui est la capacité du corps humain...»

«                             ! ∀  # ∀ ∃ % &!∋ (  ) ) )   )  ) )   )  ∗   +   (  ) , −.  / ∃  &∋  !0123 ∗∀∀% !4 1!40  5 .3!4 !40! !630 7 ! 8 9 :  ,  /    ;  ;8 8 9 :  ∗   +   (  ) , −.  / 0 ∀ ); !∃   ;       )3!4 !40! !630       ;  1  . ...»

«•Bl~JI~·.~I~~~~~(t1~~·.-'. ~·.-'.'.·~c: ::· ·':'i.::·. ;-:: ~.. :.: (. _: =-~ :. : i...-._,·,· -:;-· HISPANIC, 11 1 S PA N I C S 0 C I ET Y A~1ERICAN SERIES ···A ~\\ERICA OF •,t HISPANIC ~10NOGRAPHS NOTES & E S SAYS, STUD I E S, AND BR I E F BIOGRAPHIES ISSUED BY THE HISPANIC SOCIETY OF AMERICA VII.'f! Digitized by the Internet Archive in 2008 with funding from Microsoft Corporation http://www.archive.org/details/uruguayansoftodaOOparkuoft Entrance to the Museum of...»

«Jungfrau (Winner) Mary Watson IT WAS THE VIRGIN JESSICA who taught me about wickedness. I once asked her why she was called the Virgin Jessica. She looked at me with strange eyes and said that it was because she was a special person, like the Blessed Mary. “A virgin is someone who can do God’s work. And if you’re very, very clean and pure you can be one of the one hundred and forty-four virgins who will be carried in God’s bosom at the end of the world. And if you’re not –” She...»

«Finding the Freedoms of Contemporary Free Verse Allison Porzio Most days Johnny holds the anger inside. He looks like any other teen with baggy jeans and a beat-up white shirt, but he is bigger than most, as though the anger puffs up his stomach with hot air, and waits to escape. I met Johnny one day while working in the local YMCA teen center. He played alone, leaning over the pool table, hitting balls into the pockets at random, not even using the cue ball. “Where do you go to school?” I...»

«DOCUMENT RESUME CG 030 821 ED 451 450 AUTHOR Mirny, Anna I. TITLE Meaning of the Group: Diverging Perspectives of the Early Adolescent Boys and Girls on Their Peer Groups. 2001-04-00 PUB DATE 54p.; Paper presented at the Annual Meeting of the American NOTE Educational Research Association (Seattle, WA, April 10-14, 2001). Speeches/Meeting Papers (150) Research (143) Reports PUB TYPE MF01/PC03 Plus Postage. EDRS PRICE *Adolescents; *Grade 8; *Interpersonal Competence; DESCRIPTORS Interpersonal...»

«Table of Contents Next Page Diagnostic Imaging Prep Guidelines Previous Page Table of Contents Next Page IV Contrast Requirements All patients requiring IV contrast must be screened prior to receiving contrast! For patients that are diabetic, or have kidney disease, or have IV CONTRAST REQUIREMENTS lupus and/or other collagen vascular diseases, a BUN and Creatininte must be available, performed within the 30 dyas prior to the exam date. These results are needed before the patient is injected...»

«FFP 2012 “Go for the Gold”! Diagnostic Protocol Child/Teen Booklets Group & Individual Treatment Ideas Handouts: Past & Present Thanks to the many speech language pathologists who have given time and talents to the development of the Fluency Friday program! 1 Fluency Friday Plus! Fluency Friday Plus (FFP) is an intensive treatment program for children/teens who stutter. This program serves students from Kindergarten through High School and beyond. This project is a collaborative effort...»

«Journal of the Linguistic Association of Nigeria Volume 14 Number 2 2011 (pp. 319-327) A Re-Appraisal of Nasal Vowels in Esan E.O. Osiruemu Department of Linguistics and African Languages, University of Benin, P.O. Box 4563, Benin City E-mail: ofuije@yahoo.com This paper discusses nasal vowels in Esan. It re-evaluates prevailing assumptions about their status and proffers phonetic as well as phonological basis to justify the possibility that these so-called ‘nasal vowels’ may be a...»

«GENERATION OF TEXTURES AND GEOMETRIC PSEUDO-URBAN MODELS WITH THE AID OF IFS Xavier Marsault UMR CNRS MAP, Modèles et simulations pour l'Architecture, l'urbanisme et le Paysage Laboratoire ARIA, Ecole d’Architecture de Lyon 3, rue Maurice Audin, 69512 Vaulx en Velin Xavier.Marsault@lyon.archi.fr http://www.aria.archi.fr ABSTRACT Geometric and functional modelling of cities has become a growing field of interest, raised by the development and democratisation of computers being able to support...»

«reviewed paper Building Smart Applications for Smart Cities – IGIS-based Architectural Framework Alexander Vodyaho, Nataly Zhukova (Dr. Alexander Vodyaho, LETI, 5, Popova str., St. Petersburg, 197376, Russia, aivodyaho@mail.ru) (Dr. Natalia Zhukova, SPIIRAS, 39, 14th Line, V.O., St. Petersburg, 199178, Russia, gna@oogis.ru) 1 ABSTRACT To solve different kinds of complicated problems which arise in context of intensive development of modern cities a great number of various applications are...»

«2013 ANNUAL REPORT Every kiss begins with Kay.® Every kiss begins with Kay.® Signet Jewelers is the largest specialty jewelry retailer in the US and UK. Signet’s US division operates over 1,400 stores in all 50 states primarily under the name brands of Kay Jewelers and Jared The Galleria Of Jewelry. Signet’s UK division operates approximately 500 stores primarily under the name brands of H.Samuel and Ernest Jones. Signet’s goal is to create and offer jewelry and watch products that...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.