WWW.DISSERTATION.XLIBX.INFO
FREE ELECTRONIC LIBRARY - Dissertations, online materials
 
<< HOME
CONTACTS



Pages:   || 2 | 3 | 4 | 5 |   ...   | 13 |

«DATABASE SUPPORT FOR TOP-DOWN PROTEOMICS BY YONG-BIN KIM DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor ...»

-- [ Page 1 ] --

DATABASE SUPPORT FOR TOP-DOWN PROTEOMICS

BY

YONG-BIN KIM

DISSERTATION

Submitted in partial fulfillment of the requirements

for the degree of Doctor of Philosophy in Computer Science

in the Graduate College of the

University of Illinois at Urbana-Champaign, 2010

Urbana, Illinois

Doctoral Committee:

Professor Geneva Belford, Chair

Professor Neil L Kelleher Professor Jiawei Han Professor Chengxiang Zhai i Abstract Top-down proteomics is a revolutionary application for the identification and characterization of protein, known to be one of the most complicated and challenging issues in biology. In top-down proteomics, the quality and speed of the data warehouse is very important, as high accuracy results are returned by a database search. ProSight Warehouse fills the critical role as the data warehouse for ProSight PTM, the first publicly available top-down proteomics software suite. MySQL, a free relational database, was the base of this warehouse. Many annotated and predicted protein forms have been successfully incorporated into the organismspecific database and in the integrated database for human strains. To achieve high quality and efficiency, a database schema (Absolute Mass Search), data annotation methods (Shotgun and Extended Shotgun Annotation), data population strategies (on-the-fly population, bulk-loading method), and a database integration methodology for human protein were developed. With the successful implementation of ProSight Warehouse, ProSight PTM achieved its aspiration, highly accurate protein identification and characterization.

ii Acknowledgements It is with a deep feeling of gratitude and appreciation that I write this to acknowledge the many people who have helped me reach my goal. I am very excited to be able to write this page after a long journey towards my Ph.D. degree. The research for this dissertation could not be completed without direct help from Prof. Neil Kelleher, the members of the ProSight Development team, and the Kelleher group. Prof. Kelleher supported me academically and with a research assistantship, and I cannot thank him enough for that. I especially want to thank Dr.

Richard LeDuc, Gregory Taylor, Ryan Fellers, Leonid Zamdborg, Dr. James Pesavento, Dr.

Andrew Forbes, Dr. Michael Roth, and Craig Wenger of the Kelleher group for their contribution to my research. Prof. Jiawei Han and Prof. Chengxiang Zhai patiently waited for me to finish my work and without their advice, I could not be here. It is my honor to have them on my Ph.D. committee. Most of all, I would like to express my greatest appreciation to my advisorProf. Geneva Belford. She is my mentor, academic advisor and my inspiration. She has never stopped encouraging me to finish my degree. As I look back on my years of research here, it was not only her tremendous contribution, but also her kindness and heart which will stand out most in my mind. I am so privileged to have her as my advisor.

I am so lucky to have friends who have faith in me and have always stood by me. Dr. Soon Pack Park originally inspired me to earn a Ph.D., and he has been unwavering in his support.

Mary Beth Kelley, Holy Bagwell, Kathy Runck, and Angie Bingaman in the CS academic office are great friends and I really appreciate all their help. I would especially like to thank all of the members of the online forum DVD Prime (http://www.dvdprime.com) who gave me so much uplifting support. They really helped me when I was down and searching for my motivations..

–  –  –

Life Church for their prayers. I want to send special gratitude to my closest friend, Matt Moran, who proofread my thesis from beginning to end right when I needed it.

Lastly, I give all my love to my family for their sacrifice. Continuing supports from my parents and parents-in-law in Korea were the compelling force to pursue my goal. I really thank my kids, Michael and Michelle who never allowed me to give up and surprised me, cheered me up, and made me smile through this. They are my reason to live. And I cannot possible express enough appreciation for my wife Soojeong, who is my heart and soul. Having her as my wife is the luckiest thing that ever happened to me. I dedicate my dissertation to her.

–  –  –

List of Figures 

List of Tables 

Chapter 1. Introduction  

Chapter 2. Background  

2.1. Genomics and Proteomics  

2.2. Mass Spectrometry (MS)  

2.3. Top‐down Proteomics Software  

2.3.1. ProSight PTM 

2.3.2. ProSight Retriever 

2.3.3. ProSight PC 

2.3.4. ProSight Warehouse (PTM Warehouse)  

Chapter 3. Top‐down Proteomics Database  

3.1. Database Design  

3.1.1. Absolute Mass Search Schema 

.

3.1.2. Automatic Protein Characterization  

.

3.2. Data Annotation  

3.2.1. Shotgun Annotation  

3.2.2. Extended Shotgun Annotation  

3.3. Data Generation  

3.3.1. Predicting Protein Forms  

3.3.2. PTM Trimming  

3.4. Data Population  





.

3.4.1. On‐the‐fly Population  

3.4.2. Bulk Loading  

3.5. Data Model  

3.5.1. 2‐Tier Approach  

3.5.2. 3‐Tier Approach  

v 3.5.3. 4‐Tier Approach  

3.6. Conclusion  

Chapter 4. Database Integration  

.

4.1. Database Format  

4.1.1. FASTA Format 

4.1.2. SWISS‐PROT Format 

4.1.3. XML Format 

4.1.4. Another Format: dbSNP Database  

4.2. Integrated Database  

.

4.2.1. Biological Database Integration  

4.2.2. Unified Data Format  

4.2.3. Merging SWISS‐PROT Format Databases  

4.2.4. dbSNP Integration  

4.2.5. HPRD Integration  

4.3. Conclusion  

Chapter 5. Implementation  

.

5.1. ProSight Warehouse  

5.1.1. DB_index (Database of databases)  

5.1.1.1. DB_info Table  

.

5.1.1.2. PTM_info Table  

5.1.1.3. PTM_type Table  

5.1.1.4. DB_by_PTM Table  

.

5.1.2. Organism‐specific Databases  

5.1.2.1. Gene Table  

5.1.2.2. Protein_Form Table  

5.2. Database Loader  

5.2.1. Swissknife and BioPerl Libraries  

5.2.2. dbSNP‐SwissProt Converter  

5.2.3. HPRD‐SwissProt Converter  

5.2.4. Merging Databases  

.

5.2.5. Basic Forms Generation  

vi 5.2.6. Protein Forms Generation  

5.2.7. N‐Terminal Modifications  

5.2.8. Database Population  

5.2.9. Database Loader Configuration 

5.2.10. Handling Exceptions  

5.2.11. Database Indexing  

5.2.12. Other Features  

5.3. Conclusion  

Chapter 6. Performance and Scalability 

6.1. Histone Experiment  

6.2. Performance  

6.3. Scalability  

6.4. Conclusion  

Chapter 7. Conclusion  

7.1. The First Top‐Down Proteomics Software Suite  

7.2. The Data Warehouse  

.

7.3. Final Thoughts  

References 

viiList of Figures 

Figure 1: ProSight PTM 

Figure 2: Screenshot of ProSight PTM 

Figure 3: Search space of many closely‐related protein forms 

Figure 4: Absolute Mass Search schema 

Figure 5: Automatic Protein Characterization schema (Figure courtesy of the Kelleher          group) 

Figure 6: Shotgun annotation 

Figure 7: Shotgun annotation and extended shotgun annotation 

Figure 8: PTMs of human in UniProt 

.

Figure 9: PTM Trimming 

.

Figure 10: The 2‐tier approach 

Figure 11: The 3‐tier approach 

Figure 12: The 4‐tier approach 

Figure 13: FASTA format 

Figure 14: SWISS‐PROT format file 

Figure 15: XML format file of HPRD (Adapted form the HPRD data) 

Figure 16: Integration of SWISS‐PROT format databases 

Figure 17: dbSNP ER diagram (From ftp://ftp.ncbi.nih.gov/snp/database/b124/mssql/   schema/erd_dbSNP.pdf) 

Figure 18: Integration of UniProt, HPI and dbSNP 

Figure 19: Entire process of integrating UniProt, HPI, dbSNP, and HPRD 

Figure 20: Database information on ProSight PTM 

Figure 21: Database Loader flow chart 

Figure 22: Feature keys of UniProt entry 

Figure 23: Configuration file of Database Loader 

Figure 24: Histone H4 sequence 

Figure 25: Retrieval time with the size of the database (No Indexing) 

viii Figure 26: Retrieval time with the size of the database (Indexed) 

.

Figure 27: ProSight PTM 2.0

Figure 28: Fragmentation details of ProSight PTM 2.0 

Figure 29: Simple databases on ProSight PTM 

Figure 30: Highly annotated databases on ProSight PTM 

ixList of Tables 

Table 1: Histon H4 database 

Table 2: Search time results 

–  –  –

Introduction Proteins play a vital role in living organisms. Research on the structure and function of proteins expands the scope of biology and chemistry. Due to its complicated structure and biological functions, protein research needs a significant amount of collaboration from researchers in other areas, such as computer science, which increasingly interacts with this domain due to a vastly enlarged production of data on complex biological systems. Now that the human genome project, which identified genes and determined the sequence of nucleotides in human DNA, has ended, people are paying more attention to proteins formed from genes. The nucleotide sequences which become amino acids are known; however, the kind of modifications which happen during protein formation cannot yet be predicted, especially in human cells. Such modifications drastically increase the types of proteins available and the diversity of living organisms. Identifying and characterizing proteins have been one of the most challenging tasks in biology, and many strategies have been devised to accomplish this.

The development of high resolution analytical tools, like mass spectrometry (MS), boosts the protein identification and characterization process. By generating high accuracy data, scientists can measure protein mass and analyze the primary structure of proteins. However, researchers still need in-depth analysis of the data from the mass spectrometer, since there are unknown numbers of modifications, such as “post-translational modifications (PTM),” which regulate the biological functions of proteins and change their mass. Hence there have been many analytical methods to identify and characterize proteins harboring modifications. One of the most well-known and proven techniques to analyze a protein is the so-called “bottom-up” approach.

In the bottom-up approach, proteins are digested with enzymes, called proteases, and are cleaved into peptide fragments between 5 and 20 amino acids long. Then these digested peptide samples are fed into the MS for mass measurement. These results are compared to proteins in a sequence database. This approach is sometimes called “peptide mass fingerprinting,” since unique peptide fragments serve as fingerprints to identify where the protein came from. The bottom-up approach allows high throughput protein identification and is now used extensively.

However, since this approach does not see the entire protein, it does not cover the whole sequence and this leads to incomplete and insufficient characterization of an intact protein.

Although existing software like SEQUEST [Tabb et al. 2000], Mascot [Perkins et al. 1999], and ProFound [Zhang et al. 2000] have been used to more accurately identify proteins, the bottom-up approach cannot reach 100% sequence coverage. Thus, the characterization process suffers from this fractional data limitation.

In contrast, the recently introduced top-down approach achieves 100% sequence coverage by analyzing the entire protein without prior chemical or enzymatic proteolysis. Based on tandem MS (MS/MS), the intact protein mass and fragment ion masses are obtained with the help of MS peak picking and analysis tools. This data is then compared to a sequence database that houses known and predicted sequences with possible modifications fully annotated. This top-down strategy reduces identification and characterization to a single step process. Once the protein is identified, not only is there 100% sequence coverage information, but it is also known where the modifications occur on the sequence. In this regard, a database that returns the correct protein forms with modifications from the MS derived mass is crucial to the top-down strategy.

In this dissertation, I will describe how I have built a data warehouse for known and predicted protein forms using a strategy I have termed “Shotgun Annotation.” [Pesavento et al.

2004]. This strategy significantly accelerates protein identification by allowing automated characterization of multiply-modified proteins by top-down mass spectrometry. And I will also discuss the ways to regulate the size of the database that can grow exponentially with the number of PTMs it supports. Also the methodology of integrating biological databases for the top-down proteomics data warehouse will be shown. Database integration is known to be one of the most challenging problems in database research, due to obstacles like data models and data transformations, semantic schema and semantic data matching, and schema integration [Davidson et al. 1995]. In here, I will explore the methods that I have taken to integrate biological databases. The challenges that I faced were that some of the biological sources do not provide their data in a downloadable format and it was not a trivial job to build a link between biological databases.



Pages:   || 2 | 3 | 4 | 5 |   ...   | 13 |


Similar works:

«SAR HIGH SCHOOL Family Handbook 2015-2016 5776 th 503 West 259 Street Riverdale, NY 10471 718-548-2727 ● sarhighschool.org ● Fax 718-548-4400 Dedicated to the Memory of JJ Greenberg z”l ******************************************************************** This Family Handbook is intended for use by SAR families only. The information in this handbook is not to be used for commercial purposes or solicitations of any kind. We appreciate your cooperation in using this handbook in the spirit in...»

«Laser Cooling and Trapping of Neutral Calcium Atoms Ian Norris A thesis presented in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Physics University of Strathclyde August 2009 This thesis is the result of the author’s original research. It has been composed by the author and has not been previously submitted for examination which has lead to the award of a degree. The copyright of this thesis belongs to the author under the terms of the United...»

«THREE-DIMENSIONAL ANALYSIS OF TUNNELLING EFFECTS ON STRUCTURES TO DEVELOP DESIGN METHODS by Alan Graham Bloodworth Brasenose College Michaelmas Term 2002 A thesis submitted for the degree of Doctor of Philosophy at the University of Oxford THREE-DIMENSIONAL ANALYSIS OF TUNNELLING EFFECTS ON STRUCTURES TO DEVELOP DESIGN METHODS by Alan Graham Bloodworth Brasenose College Michaelmas Term 2002 A thesis submitted for the degree of Doctor of Philosophy at the University of Oxford ABSTRACT The...»

«REPORT ON THE CONSERVATION AND RECONSTRUCTION OF AN 18TH CENTURY MANTUA from the collections of Lincolnshire County Council Library and Heritage Service By Sheila Landi of the Landi Company Ltd. Contents Introduction General Fashion Notes The Philosophy of Costume Reconstruction The Lincolnshire Mantua The Shape of the Gown The Petticoat The Development of the Dummy and Undergarments The dummy The corset The pannier Final touches for display Conservation Methods Further Comment on Construction,...»

«PHARMACOKINETICS AND PHARMACODYNAMICS OF GLYCOPYRROLATE IN THE HORSE By MARC J. RUMPLER A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2012 1 © 2012 Marc J. Rumpler 2 To Tyler and Evelyn: Whatever it is you wish to accomplish in life, may you pursue it with passion, attack it with perseverance, triumph with success and reminisce with pride. 3 ACKNOWLEDGMENTS I...»

«STUDY OF COOLING PRODUCTION WITH A COMBINED POWER AND COOLING THERMODYNAMIC CYCLE By CHRISTOPHER MARTIN A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2004 Copyright 2004 by Christopher Martin ACKNOWLEDGMENTS I would like to express my appreciation to those people who supported this work and provided me with the encouragement to pursue it. First I would like to...»

«BIOPHYSICAL MODELING AND OPTICAL IMAGING TOOLS FOR STUDIES OF CEREBELLAR MOTOR LEARNING A DISSERTATION SUBMITTED TO THE DEPARTMENT OF PHYSICS AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Eran Abraham Mukamel December 2008 © Copyright by Eran A. Mukamel 2009 All Rights Reserved ii I certify that I have read this dissertation and that in my opinion it is fully adequate, in scope and quality, as...»

«REGULATION OF SINK STRENGTH IN DEVELOPING MAIZE FLORETS: IMPLICATIONS FOR SEED SET AND GRAIN YIELD By ANDREA LEE EVELAND A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2008 1 © 2008 Andrea L. Eveland 2 To Joshua Shome, in loving memory. 3 ACKNOWLEDGMENTS Many thanks go to the members of my committee, Donald McCarty, John Davis, Robert Ferl, and Edward Braun, for...»

«Law and War in Late Medieval Italy: the Jus Commune on War and its Application in Florence, c. 1150-1450 by Ryan Martin Greenwood A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Centre for Medieval Studies University of Toronto © Copyright by Ryan Martin Greenwood (2011) Law and War in Late Medieval Italy: the Jus Commune on War and its Application in Florence, c. 1150-1450 Ryan Martin Greenwood Doctor of Philosophy Centre for Medieval Studies...»

«The Possibility of Patina in Contemporary Art or, does the ‘New Art’ Have a Right to Get Old? Hilkka Hiiop Seeing things age is a form of beauty. – Ed Ruscha (Bartley 1998: 10.) The following article discusses contradictions arising in the conservation of contemporary art. As theoretical, philosophical and material value judgements in conservation are referred to traditional art, we are facing a basic dilemma: how far can we still apply these criteria to the conservation of contemporary...»

«CHIRAL BISAMIDINE CATALYSIS: ENANTIOSELECTIVE ALKYLATIONS AND HALOLACTONIZATIONS WITH APPLICATIONS TO SMALL MOLECULE THERAPEUTICS By Mark Christopher Dobish Dissertation Submitted to the Faculty of the Graduate School of Vanderbilt University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY in CHEMISTRY May 2013 Nashville, Tennessee Approved: Professor Jeffrey N. Johnston (Chair) Professor Timothy P. Hanusa Professor Ned A. Porter Professor Michael R. Waterman...»

«Barriers to Implication Greg Restall Gillian Russell Philosophy Department Philosophy Department The University of Melbourne Washington University in St Louis restall@unimelb.edu.au grussell@princeton.edu 1 The Issue Implication barrier theses deny that one can derive sentences of one type from sentences of another. Hume’s Law is an implication barrier thesis; it denies that one can derive an ‘ought’ (a normative sentence) from an ‘is’ (a descriptive sentence). Though Hume’s Law is...»





 
<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.