«DATABASE SUPPORT FOR TOP-DOWN PROTEOMICS BY YONG-BIN KIM DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor ...»
DATABASE SUPPORT FOR TOP-DOWN PROTEOMICS
Submitted in partial fulfillment of the requirements
for the degree of Doctor of Philosophy in Computer Science
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2010
Professor Geneva Belford, Chair
Professor Neil L Kelleher Professor Jiawei Han Professor Chengxiang Zhai i Abstract Top-down proteomics is a revolutionary application for the identification and characterization of protein, known to be one of the most complicated and challenging issues in biology. In top-down proteomics, the quality and speed of the data warehouse is very important, as high accuracy results are returned by a database search. ProSight Warehouse fills the critical role as the data warehouse for ProSight PTM, the first publicly available top-down proteomics software suite. MySQL, a free relational database, was the base of this warehouse. Many annotated and predicted protein forms have been successfully incorporated into the organismspecific database and in the integrated database for human strains. To achieve high quality and efficiency, a database schema (Absolute Mass Search), data annotation methods (Shotgun and Extended Shotgun Annotation), data population strategies (on-the-fly population, bulk-loading method), and a database integration methodology for human protein were developed. With the successful implementation of ProSight Warehouse, ProSight PTM achieved its aspiration, highly accurate protein identification and characterization.
ii Acknowledgements It is with a deep feeling of gratitude and appreciation that I write this to acknowledge the many people who have helped me reach my goal. I am very excited to be able to write this page after a long journey towards my Ph.D. degree. The research for this dissertation could not be completed without direct help from Prof. Neil Kelleher, the members of the ProSight Development team, and the Kelleher group. Prof. Kelleher supported me academically and with a research assistantship, and I cannot thank him enough for that. I especially want to thank Dr.
Richard LeDuc, Gregory Taylor, Ryan Fellers, Leonid Zamdborg, Dr. James Pesavento, Dr.
Andrew Forbes, Dr. Michael Roth, and Craig Wenger of the Kelleher group for their contribution to my research. Prof. Jiawei Han and Prof. Chengxiang Zhai patiently waited for me to finish my work and without their advice, I could not be here. It is my honor to have them on my Ph.D. committee. Most of all, I would like to express my greatest appreciation to my advisorProf. Geneva Belford. She is my mentor, academic advisor and my inspiration. She has never stopped encouraging me to finish my degree. As I look back on my years of research here, it was not only her tremendous contribution, but also her kindness and heart which will stand out most in my mind. I am so privileged to have her as my advisor.
I am so lucky to have friends who have faith in me and have always stood by me. Dr. Soon Pack Park originally inspired me to earn a Ph.D., and he has been unwavering in his support.
Mary Beth Kelley, Holy Bagwell, Kathy Runck, and Angie Bingaman in the CS academic office are great friends and I really appreciate all their help. I would especially like to thank all of the members of the online forum DVD Prime (http://www.dvdprime.com) who gave me so much uplifting support. They really helped me when I was down and searching for my motivations..
Life Church for their prayers. I want to send special gratitude to my closest friend, Matt Moran, who proofread my thesis from beginning to end right when I needed it.
Lastly, I give all my love to my family for their sacrifice. Continuing supports from my parents and parents-in-law in Korea were the compelling force to pursue my goal. I really thank my kids, Michael and Michelle who never allowed me to give up and surprised me, cheered me up, and made me smile through this. They are my reason to live. And I cannot possible express enough appreciation for my wife Soojeong, who is my heart and soul. Having her as my wife is the luckiest thing that ever happened to me. I dedicate my dissertation to her.
List of Figures
List of Tables
Chapter 1. Introduction
Chapter 2. Background
2.1. Genomics and Proteomics
2.2. Mass Spectrometry (MS)
2.3. Top‐down Proteomics Software
2.3.1. ProSight PTM
2.3.2. ProSight Retriever
2.3.3. ProSight PC
2.3.4. ProSight Warehouse (PTM Warehouse)
Chapter 3. Top‐down Proteomics Database
3.1. Database Design
3.1.1. Absolute Mass Search Schema
3.1.2. Automatic Protein Characterization
3.2. Data Annotation
3.2.1. Shotgun Annotation
3.2.2. Extended Shotgun Annotation
3.3. Data Generation
3.3.1. Predicting Protein Forms
3.3.2. PTM Trimming
3.4. Data Population
3.4.1. On‐the‐fly Population
3.4.2. Bulk Loading
3.5. Data Model
3.5.1. 2‐Tier Approach
3.5.2. 3‐Tier Approach
v 3.5.3. 4‐Tier Approach
Chapter 4. Database Integration
4.1. Database Format
4.1.1. FASTA Format
4.1.2. SWISS‐PROT Format
4.1.3. XML Format
4.1.4. Another Format: dbSNP Database
4.2. Integrated Database
4.2.1. Biological Database Integration
4.2.2. Unified Data Format
4.2.3. Merging SWISS‐PROT Format Databases
4.2.4. dbSNP Integration
4.2.5. HPRD Integration
Chapter 5. Implementation
5.1. ProSight Warehouse
5.1.1. DB_index (Database of databases)
188.8.131.52. DB_info Table
184.108.40.206. PTM_info Table
220.127.116.11. PTM_type Table
18.104.22.168. DB_by_PTM Table
5.1.2. Organism‐specific Databases
22.214.171.124. Gene Table
126.96.36.199. Protein_Form Table
5.2. Database Loader
5.2.1. Swissknife and BioPerl Libraries
5.2.2. dbSNP‐SwissProt Converter
5.2.3. HPRD‐SwissProt Converter
5.2.4. Merging Databases
5.2.5. Basic Forms Generation
vi 5.2.6. Protein Forms Generation
5.2.7. N‐Terminal Modifications
5.2.8. Database Population
5.2.9. Database Loader Configuration
5.2.10. Handling Exceptions
5.2.11. Database Indexing
5.2.12. Other Features
Chapter 6. Performance and Scalability
6.1. Histone Experiment
Chapter 7. Conclusion
7.1. The First Top‐Down Proteomics Software Suite
7.2. The Data Warehouse
7.3. Final Thoughts
viiList of Figures
Figure 1: ProSight PTM
Figure 2: Screenshot of ProSight PTM
Figure 3: Search space of many closely‐related protein forms
Figure 4: Absolute Mass Search schema
Figure 5: Automatic Protein Characterization schema (Figure courtesy of the Kelleher group)
Figure 6: Shotgun annotation
Figure 7: Shotgun annotation and extended shotgun annotation
Figure 8: PTMs of human in UniProt
Figure 9: PTM Trimming
Figure 10: The 2‐tier approach
Figure 11: The 3‐tier approach
Figure 12: The 4‐tier approach
Figure 13: FASTA format
Figure 14: SWISS‐PROT format file
Figure 15: XML format file of HPRD (Adapted form the HPRD data)
Figure 16: Integration of SWISS‐PROT format databases
Figure 17: dbSNP ER diagram (From ftp://ftp.ncbi.nih.gov/snp/database/b124/mssql/ schema/erd_dbSNP.pdf)
Figure 18: Integration of UniProt, HPI and dbSNP
Figure 19: Entire process of integrating UniProt, HPI, dbSNP, and HPRD
Figure 20: Database information on ProSight PTM
Figure 21: Database Loader flow chart
Figure 22: Feature keys of UniProt entry
Figure 23: Configuration file of Database Loader
Figure 24: Histone H4 sequence
Figure 25: Retrieval time with the size of the database (No Indexing)
viii Figure 26: Retrieval time with the size of the database (Indexed)
Figure 27: ProSight PTM 2.0
Figure 28: Fragmentation details of ProSight PTM 2.0
Figure 29: Simple databases on ProSight PTM
Figure 30: Highly annotated databases on ProSight PTM
ixList of Tables
Table 1: Histon H4 database
Table 2: Search time results
Introduction Proteins play a vital role in living organisms. Research on the structure and function of proteins expands the scope of biology and chemistry. Due to its complicated structure and biological functions, protein research needs a significant amount of collaboration from researchers in other areas, such as computer science, which increasingly interacts with this domain due to a vastly enlarged production of data on complex biological systems. Now that the human genome project, which identified genes and determined the sequence of nucleotides in human DNA, has ended, people are paying more attention to proteins formed from genes. The nucleotide sequences which become amino acids are known; however, the kind of modifications which happen during protein formation cannot yet be predicted, especially in human cells. Such modifications drastically increase the types of proteins available and the diversity of living organisms. Identifying and characterizing proteins have been one of the most challenging tasks in biology, and many strategies have been devised to accomplish this.
The development of high resolution analytical tools, like mass spectrometry (MS), boosts the protein identification and characterization process. By generating high accuracy data, scientists can measure protein mass and analyze the primary structure of proteins. However, researchers still need in-depth analysis of the data from the mass spectrometer, since there are unknown numbers of modifications, such as “post-translational modifications (PTM),” which regulate the biological functions of proteins and change their mass. Hence there have been many analytical methods to identify and characterize proteins harboring modifications. One of the most well-known and proven techniques to analyze a protein is the so-called “bottom-up” approach.
In the bottom-up approach, proteins are digested with enzymes, called proteases, and are cleaved into peptide fragments between 5 and 20 amino acids long. Then these digested peptide samples are fed into the MS for mass measurement. These results are compared to proteins in a sequence database. This approach is sometimes called “peptide mass fingerprinting,” since unique peptide fragments serve as fingerprints to identify where the protein came from. The bottom-up approach allows high throughput protein identification and is now used extensively.
However, since this approach does not see the entire protein, it does not cover the whole sequence and this leads to incomplete and insufficient characterization of an intact protein.
Although existing software like SEQUEST [Tabb et al. 2000], Mascot [Perkins et al. 1999], and ProFound [Zhang et al. 2000] have been used to more accurately identify proteins, the bottom-up approach cannot reach 100% sequence coverage. Thus, the characterization process suffers from this fractional data limitation.
In contrast, the recently introduced top-down approach achieves 100% sequence coverage by analyzing the entire protein without prior chemical or enzymatic proteolysis. Based on tandem MS (MS/MS), the intact protein mass and fragment ion masses are obtained with the help of MS peak picking and analysis tools. This data is then compared to a sequence database that houses known and predicted sequences with possible modifications fully annotated. This top-down strategy reduces identification and characterization to a single step process. Once the protein is identified, not only is there 100% sequence coverage information, but it is also known where the modifications occur on the sequence. In this regard, a database that returns the correct protein forms with modifications from the MS derived mass is crucial to the top-down strategy.
In this dissertation, I will describe how I have built a data warehouse for known and predicted protein forms using a strategy I have termed “Shotgun Annotation.” [Pesavento et al.
2004]. This strategy significantly accelerates protein identification by allowing automated characterization of multiply-modified proteins by top-down mass spectrometry. And I will also discuss the ways to regulate the size of the database that can grow exponentially with the number of PTMs it supports. Also the methodology of integrating biological databases for the top-down proteomics data warehouse will be shown. Database integration is known to be one of the most challenging problems in database research, due to obstacles like data models and data transformations, semantic schema and semantic data matching, and schema integration [Davidson et al. 1995]. In here, I will explore the methods that I have taken to integrate biological databases. The challenges that I faced were that some of the biological sources do not provide their data in a downloadable format and it was not a trivial job to build a link between biological databases.