This thesis is organized as follows: Chapter 2 introduces the application domain of protein structureprediction. We focus our introduction on the information sources that are employed to solve the structureprediction problem. Chapter 3 reviews the related work. Again, we focus on algorithms that leverage information for protein structureprediction. Chapter 4 introduces the technical background in network analysis, machine learning, and cross- linking/mass spectrometry. We recommend this chapter if the reader is unfamiliar with these topics. In chapter 5, we introduce our approach to contact prediction from protein decoys with physicochemical information (EPC-map). Chapter 6 presents the veriﬁcation of this approach in the CASP11 experiment. In chapter 7, we introduce a novel protein structure determination method that combines high-density cross-linking/mass spectrometry data with conformational space search. With this method, we reconstruct the domain structures of human serum albumin from protein samples in complex biological matrices. Chapter 8 describes our approach of reﬁning structural constraints with corroborating evidence. Chapter 9 concludes this thesis and outlines future work. Appendix A summarizes evaluation criteria used in contact and structureprediction. Appendix B lists the features used in EPC-map. Appendix C lists the training set and test set of EPC-map that were constructed for this thesis.
Residue-residue contact prediction uses features calculated on multiple sequence alignments to predict with residue positions of a protein family are interacting. While there has been agreement for over 30 years  that contact predictions could be of use for protein structureprediction, the contact prediction methods developed until the late 2000s were not accurate enough for a blind 3D structureprediction as too many sources of noise caused many false-positive predictions. It took until 2008 for the major noise sources of entropic and phylogenetic effects  and until 2009 for indirect couplings  to be eliminated. In 2011, it was finally shown that these improvements could be combined with a NMR- based structureprediction program to fold de-novo protein structures using only sequence data , causing a spike in interest in further development of even more accurate contact prediction methods.
Comprehensive application studies of bioinformatics approaches were performed, which primarily targeted autoinflammatory and neurodegenerative diseases. A variety of computational tools was used to analyze medically relevant proteins and to evaluate experimental data. Many bioinformatics methods were applied to predict the molecular structure and function of proteins. The results provided a rationale for the design, prioritization, and interpretation of experiments performed by cooperation partners. Some of the generated biological hypotheses were tested and confirmed by experiments. In addition, the application studies revealed limitations of current bioinformatics techniques, which led to suggestions for novel approaches. Three new computational methods were developed to support the prediction of the secondary and tertiary structure of proteins and the investigation of their interaction networks. First, consensus formation between three different methods for secondary structureprediction was shown to considerably improve the prediction quality and reliability. Second, in order to utilize experimental measurements in tertiary structureprediction, scoring functions were implemented that incorporate distance constraints into the alignment evaluation, thus increasing the fold recognition rate. Third, an automatic procedure for decomposing protein networks into interacting domains was designed to obtain a more detailed molecular view of protein-protein interactions, facilitating further functional and structural analyses.
All tools described in this work use some piece wise geometric description of the backbone. NASTI, CVRRY and ALFONS additionally incorporate information about base pairs, and the treatment of helix and non-helix parts of the RNA backbone was a relevant issue in all chapters of this thesis. Another common principle used by all approaches is the extraction of knowledge from the PDB. Experimentally determined structure files were used to parameterise the NASTI energy terms, to search for an optimal structure alphabet representation and to find parameters for CVRRY and ALFONS alignments. In FREEDOLIN and ALFONS, substring matches were weighted by the distribution of structure alphabet letters found in PDB structures. The choice of the dataset used for the parameterisation, however, is not a trivial task. Approaches to solve different problems can face similar problems here. In comparison to protein structures the number of available folds for RNAs seems rather limited. There is also the problem of redundancy within the set of known structures. The large number of tRNA structures or ribosomes do not provide much independent information. Another problem is the uneven distribution of chain lengths. On the one side, there is a large number of small ( < 200 nucleotides) structures. On the other side, there is a large number of ribosomal structures ( > 10 3 nucleotides). Data between these extremes is rather sparse. These problems were handled in multiple ways in this work. The NASTI parameters were obtained only from ribosomal subunits relying on the belief that their substructures are representative for the whole RNA fold space. A different approach was used for the optimization of structural alphabets and the search of alignment parameters. Here, ribosomal chains were cut to smaller pieces in order to enable effective optimization procedures. In the future, more sophisticated solutions could help to improve both comparison and prediction of RNA structures. Finally one could state that the choices on how to pick, design and combine the molecule representation, the algorithms and the parameters are in all cases subject to computational speed and accuracy. In most cases, there is a compromise to make which one would like to make as smart as possible.
also called reference sequences, see Section 3.2.3 for details. It is unknown how many of the cDNA/EST/protein sequences will match the gDNA sequence in some location. The alignment problem thus can be divided into two subproblems: first, identification of the cDNA/EST/pro- tein sequences and corresponding gDNA locations that may constitute high-quality matching pairs, and second, derivation of the optimal alignment (delineating the exons and introns in the gDNA). In GenomeThreader, the first task is solved by fast string matching algorithms based on enhanced suffix arrays [AKO04], with a subsequent chaining phase combining several consis- tent matches. The second task involves application of classical dynamic programming [Bel57]. The idea is to take an expressed gene product (a cDNA/EST or a protein) and perform a “back- ward calculation” of the biological process shown in Figure 2.2. In the so-called splicing (see Section 2.6.1), the introns are cut out and only the enclosing exons remain (which comprise the cDNA/EST/protein sequences). The goal is to reveal the (previously unknown) gene structure from which the (known) product was derived. That is, one aligns the product against the gDNA allowing for introns. Therefore, this kind of alignment is called spliced alignment. See Section 2.9 for a general introduction into genes.
Most frequently observed secondary structures (i.e. helices and beta sheets) in pro- tein structures are actually enforced or consolidated by physical hindrance caused by the steric properties of a protein backbone. The physical size of atoms or the groups of atoms in a protein backbone allow formation of a limited number of shapes without any clashes. In this regard, influence of weaker non-covalent interactions, called hy- drogen bonds, is quite significant in stabilizing these shapes (i.e. secondary structures, helices and β sheets) and holding the entire structure together. Strength and effect of hydrogen bonds depend upon the environment. Backbone geometries of helices and β sheets facilitate in the establishment of systematic and extensible intramolecular hydro- gen bonding. If these intramolecular hydrogen bonding patterns are not formed, the folding equilibrium would lead to unfolding by developing intermolecular hydrogen bonds with the surrounding water (Baldwin and Rose 1999, Petsko and Ringe 2004). Hydrogen bonds involve electrostatic attractions either between actual charges (Glu- Lys) or between peptide dipoles (N-H and C=O) to share a proton. Helices involve a repeated pattern of local hydrogen bonds between i and i + 3 (in 3 10 helix) or i + 4 (in α
58 Dihydropyridines (DHPs), such as amlodipine, nifedipine and nimodipine, are the most widely used drugs. They bind to the inactivated state and to an allosteric pocket, without occluding the ion pore. The recent CavAb crystal structures in complex with amlodipine (5kmd) and nimodipine (5kmf) show that the allosteric pocket is hydrophobic and placed on the lipid-facing surface of the pore (Tang et al., 2016). Upon binding of a single am- lodipine molecule, the ion channel undergoes conformational changes and the symmetry of the selectivity filter is broken. Consequently, the ion pore cannot coordinate the cal- cium ion. However, site-directed mutagenesis assays of Cav1.2 channel identified dis- tinct residues at the interface of repeats III and IV and the III S6 helix with respect to the CavAb structure. Therefore, further structural analyses are still needed to fully un- derstand the inhibition of human calcium channels.
With the structure 2 family having been observed experimen- tally at high pressure, our computational results for Dalcetrapib show no indication for having missed the thermodynamically stable form at ambient conditions. Within each family, cross- nucleation and low-energy barriers for solid–solid phase transi- tions make it unlikely to encounter a metastable form that does not readily convert to the most stable form in the family; therefore, none of the structures in the two observed families presents a danger regardless of potential computational errors. The structure 4 family and structure 5 exhibit the same hydrogen bonding and a similar molecular conformation as the experi- mental structures. Hence, there is in principle no apparent reason why their nucleation or growth should be hindered compared with the observed forms. If any of them were the truly stable form, they should have been observed in the experimental screening experiments, whereas only forms A and B, which are thermodynamically the two most stable forms, were repeatedly isolated. Interestingly, this argument also applies to form C; however, this low-energy form could be isolated experimentally by changing another thermodynamic variable, namely pressure, in the crystallization experiments, and effectively moving to and probing another portion of the compound’s phase diagram. Once more experience with interpreting crystal energy landscapes has been built up, it may well turn out that form C should never have been considered a candidate for a missing thermodynamically stable form at ambient conditions. For an unobserved low-energy structure to be ﬂagged as a threat after thorough experimental screening, it should probably also present a structural feature suggesting a high-nucleation barrier, such as an unlikely molecular conformation according to the statistics of the Cambridge Structural Database. Some of the unique structures feature other molecular conformations; however, they are all predicted to be less stable than structure 1 by more than
of a chemical reaction and is often used to facilitate the search for a first-order saddle point which is the point of highest energy on a MEP. In some theories, like small curvature tunneling the MEP is also used to calculate semi-classical reac- tion rates that also incorporate quantum effects like tunneling to a certain extent. Theoretical chemists that are, e.g., concerned with reaction kinetics spend a lot of time on finding the said geometries. To find a converged geometry can often be very tedious and might need a lot of human input. Therefore, developing faster and especially more stable algorithms for geometry optimizations can massively improve many future works in computational chemistry. The currently available algorithms are not necessarily stable in every case or require much computational power. Fore example, some algorithms require the calculation of Hessians of the PES. For many electronic structure calculations this is not feasible. Ideally, a geometry optimizer works as a black box-like system that only needs very little human input and still converges in a stable way without requiring information about the Hessian. It also should require as few energy and gradient evaluations of the PES as possible. In this thesis new approaches to attack this problem are presented that are based on the concepts of machine learning.
If sequences of high similarity (and known structure) can be found, the method of choice is homol- ogy modelling where the known structures are used to create a template on which the structure to be predicted is modelled. In biology, the term “homology” is used to state common ancestry of proteins. In the context used here, however, it is not necessary to establish true homology in the above sense for the selection of sequences. Instead, sequences of known structure are selected based on their similarity to the query sequence which makes homology between the two very likely. Still, “homology modelling” is the standard term used in this context although “comparative modelling” may be used equivalently and is the more accurate term [8, 66, 67]. Protein structure is more strongly conserved in evolution than sequence  so even a moderate level of similarity over the entire sequence suffices to be confident of high structural similarity – although there are a few notable examples where this does not hold, see e.g. Ref.  where two proteins are engineered at 88% sequence identity but with completely different (α as opposed to α/β) folds or Ref.  for two naturally occurring proteins of 40% sequence identity and different folds. The general rule for natural proteins though is that sequence similarity means high structural similarity and one of the main challenges is to properly incorporate information from remote homologues .
amount of storage tremendously compared to database approaches. Necessary for this purpose is a similarity measure between trajectories, e.g., the QRLCS mentioned above. Each cluster is represented by a trajectory prototype, which is often assumed to be the mean value of the trajectories associated to the corresponding cluster. Prediction basically consists of finding the cluster which best fits to a partially observed trajectory. Usually the cluster prototype is used as the predicted trajectory. The algorithms differ mainly in the way they represent a trajectory, the cluster algorithm that is used and the appropriate distance metric between trajectories. It is distinguished between two main clustering categorizations in this thesis: centroid based clustering and pairwise clustering. In the first kind of algorithms, the clusters are built by optimizing a global criteria. Further, they have the advantage that the trajectory prototypes are a direct result of the clustering process. However, they possess the two disadvantages that the number of clusters must be known a-priori and the number of elements of each trajectory has to be the same. The first problem often is addressed by assuming the number of clusters is known, or alternatively iterative greedy search techniques are applied. To solve the second problem, [HXTM04] suggested to resample trajectories in order to normalize their length. As an alternative, [BBT02] proposed to normalize the length by padding the end of each trajectory with its last value up to the length of the longest trajectory. Pairwise clustering, however, is based on a dissimilarity measure (e.g., Manhattan or Euclidian distance, but also QRLCS to be independent of the number of elements). In each iteration, two trajectories are compared by the measure to decide if they belong to the same cluster. This process groups the data but does not create any prototypes. Hierarchical clustering is one of its famous representatives. Because of the
On the other extreme, Eq. 9 is also tested to see if it can be applied to predict sediment discharge with very fine bed material that is generally regarded as wash load (sediment size is finer than 0.07mm, Partheniades, 1977). Fig. 5 shows the predictions for laboratory data with d 50 = 0.011 mm from Kalinske and Hsia (1945) and field data with d 50 = 0.02 to 0.07 mm from Indian Canals by Chitale (1966). As a comparison, other equations are also included in Fig. 5 and it can be seen that Eq. 9 provides a reasonable prediction. Hence, it can be concluded that wash-load could also be predicted since the motions for coarse and fine sediment are governed by identical physical laws, Partheniades (1977).
of the prison guard when he enters the cell in the morning for control. If they seem wet or the guard even leaves a wet trail, this can be a clear indication of a rainy day outside. Of course it is possible that he just crossed a recently cleaned floor, although this seems unlikely. But on the other hand, if the boots are dry, one can not that easily conclude that there must be a sunny day outside. It is also possible that the officer arrived by car, or is already on duty for hours and the boots are dry again. All in all, the boot-procedure does not seem to give a very accurate pre- diction about the unknown status of the weather outside. But if we could combine many such perhaps weak prediction techniques, the performance could rise. Let us imagine the prisoner meets 100 other prisoners for lunch. By sharing the informa- tion about their guard’s boots, they could develop a much more accurate and better performing prediction.
concrete floor in the building-like situation. This shows that the SEA model is able to predict the measured vibrational response of the concrete floor within the acceptable error limits for all ramp durations with only a few exceptions (e.g. 125ms ramp at 50Hz) where the error is up to 1.4dB higher than the acceptable error limits. In general, there is no significant offset, although for the 125ms ramp the prediction appears to slightly overestimate between 50Hz and 100Hz and for the 1s, 2s and 5s ramps the prediction appears to slightly overestimate between 630Hz and 3.15kHz.
about the mismatched content itself. Prediction errors are thus as structured and nuanced in their implications as the model-based predic- tions relative to which they are computed. This means that, in a very real sense, the prediction error signal is not a mere proxy for incoming sensory information – it is sensory information. Thus, suppose you and I play a game in which I (the “higher, predicting level”) try to describe to you (the “lower level”) the scene in front of your eyes. I can’t see the scene directly, but you can. I do, however, believe that you are in some specific room (the living room in my house, say) that I have seen in the past. Recalling that room as best I can, I say to you “there’s a vase of yellow flowers on a table in front of you”. The game then continues like this. If you are silent, I take that as your agreeing to my description. But if I get anything that matters wrong, you must tell me what I got wrong. You might say “the flowers are yellow”. You thus provide an er- ror signal that invites me to try again in a rather specific fashion—that is, to try again with respect to the colour of the flowers in the vase. The next most probable colour, I conjec- ture, is red. I now describe the scene in the same way but with red flowers. Silence. We have settled into a mutually agreeable descrip- tion. 3
mal number of selected variables. In other words, we investigate theoretical and numerical properties of the ` 0 -norm constrained maximum score prediction rules. 1
To the best of our knowledge, Greenshtein (2006) and Jiang and Tanner (2010) are the only existing papers in the literature that explicitly considered the same prediction problem as ours. Greenshtein (2006) considered a general loss func- tion that includes maximum score prediction as a special case in the i.i.d. setup. Greenshtein (2006) focused on a high dimensional case and established conditions under which the excess risk converges to zero as n → ∞. Jiang and Tanner (2010) focused on the prediction of time series data and obtained an upper bound for the excess risk. Neither Greenshtein (2006) nor Jiang and Tanner (2010) provided any numerical results for the best subset maximum score prediction rule. In contrast, we focus on cross-sectional applications and emphasize computational aspects.
For consistency, all grids used for AePW calculations should conform to the following set of gridding guidelines listed in this section. These gridding guidelines for the Aeroelastic Prediction Workshop are adopted from the guidelines developed for the Drag Prediction Workshop and the High Lift Prediction Workshop; see Appendix I for corresponding internet addresses. These guidelines have remained relatively unchanged over the course of these previous workshops and codify much of the collective experience of the applied CFD community in aerodynamic grid generation practices. Grid-related issues’ effects on drag prediction error gleaned from the experiences of DPW are summarized in reference 28. For the current workshop, a sequence of coarse, medium and fine grids are required for each configuration and the guidelines can be summarized as follows: