«Entwicklung computergest¨tzter Methoden u zur Vorhersage von Proteinstrukturen, Proteinbindung und Mutationseﬀekten mithilfe von Freie Energie ...»
Development of computational methods
for the prediction of
protein structure, protein binding, and mutational eﬀects
using free energy calculations.
Entwicklung computergest¨tzter Methoden
zur Vorhersage von
Proteinstrukturen, Proteinbindung und Mutationseﬀekten
mithilfe von Freie Energie Berechnungen.
Der Naturwissenschaftlichen Fakult¨t
der Friedrich-Alexander-Universit¨t Erlangen-N¨rnberg
Erlangung des Doktorgrades Dr. rer. nat.
CAROLINE BECKERaus Saarbr¨cken u Als Dissertation genehmigt von der Naturwissenschaftlichen Fakult¨t a der Friedrich-Alexander-Universit¨t Erlangen-N¨rnberg a u Tag der m¨ndlichen Pr¨fung: 10.02.2014 u u Vorsitzender des Promotionsorgans: Prof. Dr. Johannes Barth Gutachter: Prof. Dr. Rainer A. B¨ckmann o Prof. Dr. Heinrich Sticht Prof. Dr. Martin Klingler Prof. Dr. Yves Muller ’Ihr seid mir in gewisser Hinsicht ein Lehrer geworden, und sogar der Maulwurf wurde mir fast lieb.
Trotzdem trete ich beiseite (...).’ – Franz Kafka, Der Riesenmaulwurf – Denn die bunten und lustigen M¨glichkeiten des Lebens o beginnen so recht erst jenseits jener gr¨ndlich aufr¨umenden Katastrophe, u a die man treﬀend als den b¨rgerlichen Tod bezeichnet u und eine der hoﬀnungsreichsten Lebenslagen ist die, wenn es uns so schlecht geht, dass es uns nicht mehr schlechter gehen kann.
(...) Haben wir ihn nur erst im Freien, so wird die Flut ihn schon tragen und ihn, wie ich zuversichtlich hoﬀe, zu sch¨nen K¨sten leiten.
o u – Thomas Mann, Bekenntnisse des Hochstaplers Felix Krull – Abstract A molecular understanding of protein-protein or protein-ligand binding is of crucial importance for the design of proteins or ligands with deﬁned binding characteris- tics. The comprehensive analysis of biomolecular binding and the coupled rational in silico design of protein-ligand interfaces requires both, accurate and computation- ally fast methods for the prediction of free energies. Accurate free energy methods usually involve atomistic molecular dynamics simulations that are computationally prohibitive for large mutational screens. In turn, fast prediction methods are frequently based on empirically derived scoring functions that do not take protein ﬂexibility into account and largely depend on the availability of experimental training data. Here, a novel fast and accurate structure-based method (CC/PBSA) for the prediction of the eﬀect of mutations on the binding aﬃnity was developed, taking both protein and ligand ﬂexibility into account. CC/PBSA is based on a physical eﬀective energy function, combining molecular mechanics force ﬁeld contributions and continuum solvent energies with a fast sampling of the conﬁgurational space.
In the latter step, alternative protein and ligand conformations are sampled based on geometric criteria, starting from random coordinates. This inclusion of full ﬂexibility is shown to dramatically improve the prediction for mutational eﬀects on protein-protein binding aﬃnities. The excellent scalability of the method together with its high accuracy enables full mutant scannings of protein-protein interfaces.
Additionally, CC/PBSA was extended for the prediction of absolute aﬃnities in protein-protein binding and successfully tested on a number of experimentally known complexes. The prediction of mutational changes in binding aﬃnities and of absolute aﬃnities was complemented by a structure prediction, in a ﬁrst step coined on MHC:peptide complexes. The method is easily extendable to diﬀerent proteinpeptide systems and may serve as a valuable tool in the search for high aﬃnity binders. It makes use of loosened geometric constraints for non-speciﬁc peptide positions in the generation of structural ensembles. Subsequently the structures are weighted based on a scoring function composed of diﬀerent energetic contributions.
Understanding the molecular processes governing life is a major driving force in research. As organic processes take place at time and length scales not accessible to our normal cognition, instruments are required to bridge these gaps in time and size.
The ﬁrst instruments of this kind were developed already centuries ago, doubtlessly the invention of microscopes in the end of the 16th century started a new era in this ﬁeld: The Dutch spectacle-maker Hans Janssen is said to be the inventor of the optical microscope and thereby discoverer of new worlds in magnitude. The development of instruments, helping to see particles and processes not observable with the eye, has been widely extended since and is still far from being at its end. A vast number of experimental techniques, like X-Ray crystallography , NMR spectroscopy , and recently the stimulated emission depletion microscopy (STED)  helped to reveal the atomic constructs, molecular compositions, and many interactions occurring in cells.
In this computational work, the riddle will be approached from a diﬀerent perspective. While experimental methods are starting from our (human) perception, trying to expand and stretch it as far as possible, computational methods aim at rebuilding molecular systems from their basic elements, composing them to larger systems. This way of observation opened up new lanes of research, however, it also provides new challenges and diﬃculties. The beneﬁts are obvious: In silico methods may describe biological processes on the atomistic level, being thus able to address processes in molecular detail, translating them to our perception.
Biological systems are nowadays partially rebuilt in the computer and simulated as lifelike as possible, i.e. the motions of the constituents are followed as a function of time. They are composed of as many particles and interactions as current computer systems can cope with. Computational resources are limited though. The number of particles that can be treated at once and the quality level of their interaction with the system are limited by the memory size and the processor speed. Quantum physics provides the general rules underlying the motion and interactions of atoms and molecules. However, calculations of quantum-mechanical forces between all atoms are limited to very small atom numbers and short time scales. Simulations of larger systems and longer time spans require a signiﬁcant reduction of the system complexity. Therefore, a vast number of approximations and simpliﬁcations to the stringent theory of quantum mechanics exists, leading to a gain in simulation length and system size, but bought by a loss in accuracy.
Finding possible approximations for the task at hand, without loosing too much of accuracy, is the challenge of most computational methods in biomolecular life sciences.
The main concept when building computational systems – using the necessary approximations – is to prove the correctness of the model or system. If this prove can be given, the method can be applied for research. As the size and time spans diﬀer, computational methods are not competing with experimental methods, but rather complementing them, by helping to explain observed mechanisms or suggesting new experiments.
In this work, the ﬁeld of computational methods is extended by a method to describe and analyze the interaction patterns of proteins. Proteins are regulating 2 most processes in the cell. Furthermore, they are involved in a multitude of diseases.
Therefore, they are in the focus of the pharmaceutical research and contribute to the majority of therapeutic schemes. An introduction to the composition and structures and the experimental analysis of proteins will be given in Chapter 1.
Modeling the behavior and the interactions of proteins requires a detailed understanding of the underlying mechanisms and energetics. Rational drug design is not possible without knowledge about the binding properties of the protein at hand. Here, a computational method is established to calculate and characterize the fundamental energetic contributions dictating the protein binding mechanisms.
The underlying physical principles and basic computational tools for this work are described in Chapter 2.
Currently, for the in silico analysis of proteins and their interactions in solution, classical molecular mechanics studies are widely used. All-atom molecular dynamics (MD) simulations [4, 5] nowadays reveal the dynamics of systems of up to several million of atoms  or time scales of up to 100 µs from a single trajectory  or 30 ms aggregating a multitude of trajectories by the Folding@home computing network . Using approximations, e.g. applying a coarse-grained atom description , simulations are 100 times faster with respect to the classical all-atom simulations, enabling simulations on the millisecond time scale .
Unraveling the free energy landscape of proteins is the key to understand the folding or binding mechanisms of the proteins in solution. Accessing the complete free energy landscape is in general not possible though, therefore, free energy computations are restricted to the analysis of discrete states of the system.
Several free energy prediction methods have been developed for the energetic characterization of protein binding in the past: The most accurate prediction methods are based on MD simulations, i.e. the free energy perturbation (FEP) method  and the thermodynamic integration (TI) scheme . Both methods simulate the transition of a biomolecular system from its initial state to the ﬁnal state, adding up the energy diﬀerences along a (frequently unphysical) path. Due to the underlying all-atom MD simulations, the huge computational cost of FEP 3 and TI limits these methods to the study of up to 100 mutants [13, 14, 15, 16] in calculations of protein stabilities or protein binding aﬃnities.
A gain in eﬃciency was obtained by the introduction of the simpliﬁed MM/PBSA or MM/GBSA methods. In these so-called end state methods, only the conformational space of the initial and ﬁnal states is sampled by MD simulations. The free energy of the two states includes enthalpic and entropic contributions, combining molecular mechanics contributions for the protein with an implicit continuum solvent contribution. A similar accuracy of the MM/PBSA approach as compared to the rigorous TI has recently been stated . The decreased computation time of the MM/PBSA method enabled studies of datasets of around 20  to 50  protein-ligand complexes. Mutation studies using MM/PBSA [15, 20, 21, 22, 17] cover in general up to 30 mutants. However, virtual screening approaches , generally applied in pharmaceutical industry e.g. for lead discovery, require the evaluation of larger datasets. Protein-protein interfaces range from a few nm2 to tens of nm2 . Most interfaces, however, have a (one-sided) interface size of only 600-800 ˚  (thus 1200-1600 ˚ buried area), comprising around 30 interface A A residues. The calculation of alanine scans of the interface – i.e. the sequential mutation of all interface residues to alanine for the analysis of their inﬂuence on the binding aﬃnity – is thus within the range of MM/PBSA. However, the computation of complete scans – i.e. the mutation of all interfaces residues to all other amino acids – and multiple mutation analysis – e.g. for the analysis of interacting residues and cooperative clusters in the interface – both require the calculation of several hundreds or even thousands of mutations and are therefore out of the range of the MM/PBSA method.
For the analysis of protein interfaces and protein design, such comprehensive mutational scans are mandatory, thus faster methods are required.
The computationally cheapest structure-based methods are based on statistical potentials, e.g. the recently developed BeAtMuSiC method . The potential is described based on relative atom positions of the structure, parametrized over a training set. Such empirical methods usually suﬀer from insuﬃcient or biased training sets . BeAtMuSiC method achieved a correlation of only 0.55 to experiment over the full dataset . Other fast methods, e.g. FoldX , show better results and are thus well established for the fast prediction of single alanine 4 mutations, but with decreased accuracy for multiple and non-alanine mutations . Methods based on force ﬁelds based on physical chemistry, e.g. EGAD , use a physical free energy function for the prediction of mutational folding and binding free energies. Flexibility is included by the usage of a rotamer library for side chain orientations. However, the full ﬂexibility of proteins, including backbone ﬂexibility, is relevant for computational protein design methods . The structure description of EGAD is restricted by a ﬁxed backbone description, thus precluding the reliable prediction of mutations, with a probable inﬂuence on the backbone conformation, e.g. mutations to cysteine, glycine and proline .
Here, a novel method is presented, including full protein ﬂexibility required for reliable predictions – including side chain and backbone ﬂexibility – but without applying time-consuming MD simulations. Instead, alternative protein conformations are built by iteratively correcting the coordinates of a starting structure with random coordinates, until all geometric restraints are fulﬁlled (CONCOORD, ). The generated structures eﬃciently sample the accessible conformational space. The free energy change between diﬀerent states is then calculated based on a physical eﬀective energy function – comparable to the energy function used in MM/PBSA – averaged over the generated ensemble of structures.
The comprehensive task of protein binding prediction and analysis can be separated into the prediction of mutation-induced changes in binding aﬃnity and in the prediction of absolute protein-protein binding aﬃnities.
In the former task, the diﬀerent energetic contributions and the composition of the binding interfaces of protein-protein complexes can be analyzed by mutational analysis: Single (or multiple) amino acids are mutated to conclude on their contribution to the binding. Predictions on the eﬀect of mutations are e.g. crucial for drug design. Chapter 3 introduces this so-called CC/PBSA prediction method for the eﬀect of mutations on protein-protein binding.