Computational Models PDF
Document Details
Uploaded by DazzlingFreedom
University of the Philippines Manila
Billones
Tags
Summary
This document is a set of lecture notes on computational models, specifically focused on quantitative structure-activity relationships (QSAR). It discusses various topics including historical overview, deriving QSAR equations using linear regression, cross-validation techniques, and other analysis methods. The notes are from the University of the Philippines Manila.
Full Transcript
4 Computational Models Historical Overview Deriving a QSAR Equation Designing a QSAR Experiment Partial Least Squares Principal Components Regression Molecular Field Analysis Billones Lecture Notes 4.1 Historical Overview Molecular discoveries today are the result of an iterative cycle o...
4 Computational Models Historical Overview Deriving a QSAR Equation Designing a QSAR Experiment Partial Least Squares Principal Components Regression Molecular Field Analysis Billones Lecture Notes 4.1 Historical Overview Molecular discoveries today are the result of an iterative cycle of design, synthesis, test, analyze. The analysis stage is the construction of a model which enables the observed activity or properties to be related to the molecular structure. Billones Lecture Notes Hansch and Fujita (1964) • first to use QSARs to explain the biological activity of series of structurally related molecules. • pioneered the use of descriptors related to a molecule’s electronic characteristics and to its hydrophobicity C = concentration of compound required to produce a standard response in a given time log P = logarithm of the molecule’s partition coefficient between 1octanol and water σ = appropriate Hammett substitution parameter Billones Lecture Notes Hansch proposed that activity is parabolically dependent on log P: e.g. Narcotic effect of barbiturates on mice Hammett parameter, σ, quantifies the electronic characteristics of a molecule that can be related with activity. e.g. Ki of benzoic acid esters could be quantified using the Hammett parameter, σ. k = rate constant for a particular substituent K = equilibrium constant for a particular substituent k0 = rate constant for the reference (substituent = H) K0 = equilibrium constant for the reference (substituent = H) • The substituent parameter σ is determined by the nature of the substituent and position whether it is meta or para to the –COOR • The reaction constant ρ is fixed for a particular process with the standard reaction being the dissociation of benzoic acids (ρ = 1) 4.2 Deriving a QSAR Equation Linear Regression - the most widely used technique to derive QSAR models Multiple Linear Regression – if there are more than one independent variable The simplest type of LR equation: y - the dependent variable x - the independent variable • in QSAR or QSPR y is the property being modeled (such as the biological activity) and x is a molecular descriptor such as log P. • variable and descriptor are used interchangeably to refer to the independent variable. Billones Lecture Notes The aim of linear regression is to find values for the coefficient m and the constant c that minimize the sum of the differences between the values predicted by the equation and the actual observations. The differences between the actual observations and those predicted from the regression equation are represented by the vertical lines from each point to the best-fit line. Billones Lecture Notes The values of coefficient m and constant c are given by the following equations (in which N is the number of data values): The line described by the regression equation passes through the point ( x , y ) where x and y are the means of x and y, respectively: Billones Lecture Notes 4.2.1 The Squared Correlation Coefficient, R2 Squared correlation coefficient (R2 or r2) • assesses the “quality” of a linear regression • has a value between 0 and 1 • indicates the proportion of the variation in the dependent variable that is explained by the regression equation Suppose ycalc,i = calculated y values, and yi = experimental y values Billones Lecture Notes Thus, R2 is given by the following relationships: • R2 = 0 - none of the variation in the observations is explained by the variation in the independent variables • R2 = 1 - corresponds to a perfect explanation • R2 is very useful but taken in isolation it can be misleading Billones Lecture Notes Consider five data sets for which the value of R2 is the same (0.7). ✔ okay ✘ with outlier ✘ nonlinear trend ✘ nonlinear trend ✘ with extreme value Billones Lecture Notes 4.2.2 Cross Validation Cross-validation • overcomes the problems inherent in the use of the R2 value alone. • involves o removal of some of the values from the data set o derivation of a QSAR model using the remaining data (LOO: n - 1 , or LGO: n – c) o application of the new model to predict the values of the data that have been removed Leave-one-out approach (LOO) • just a single data value is removed • repeating this process for every value in the data set leads to a crossvalidated R2 (more commonly written Q2 or q2) Billones Lecture Notes • Q2 < R2 , but is a true guide to the predictive ability of the equation. • R2 measures goodness-of-fit whereas Q2 measures goodness-of-prediction. • The simple LOO approach o is considered to be inadequate o and has been superseded by LGO Leave Group Out (LGO) • the data set is divided into four or five groups • each of which is left out in turn to generate the Q2 value • by repeating this procedure a large number of times (100), selecting different groupings at random each time, a mean Q2 can be derived Billones Lecture Notes Predictive Residual Sum of Squares (PRESS) • another measure of predictive ability • analogous to the RSS but rather than using values ycalc,i calculated from the model, PRESS uses predicted values ypred,i (values for data not used to derive the model) Q2 is related to PRESS as follows: • ⟨y⟩ should strictly be calculated as the mean of the values for the appropriate cross-validation group rather than the mean for the entire data set, though the latter value is often used for convenience. Billones Lecture Notes 4.2.3 Other Measures of a Regression Equation Standard error of prediction (s) • indicates how well the regression function predicts the observed data and is given by: p = number of independent variables in the equation F statistic • equals the ESS divided by the Residual Mean Square: Billones Lecture Notes • Fcalculated is compared with Ftable o If Fcalculated > Ftable ; the equation is said to be significant • Higher Fcalculated - higher significance levels (i.e. greater confidence) • the threshold value of F falls as the number of independent variables decreases and/or the number of data points increases o consistent with the desire to describe as large a number of data points with as few independent variables as possible Billones Lecture Notes t statistic • indicates the significance of the individual terms in the linear regression equation If ki is the coefficient in the regression equation associated with a particular variable xi then the t statistic is given by: s(ki) - standard error of the coefficient • If tcalculated > ttable, the coefficient is considered significant Billones Lecture Notes 4.3 Designing a QSAR Experiment The goal is generally to derive a parsimonious model. Parsimonious - uses the smallest number of independent variables to explain as much data as possible. Data set • a general rule-of-thumb: 5 compounds : 1 descriptor • simple verification checks should be performed on any descriptors for inclusion in the analysis o the values vary and have a good ‘spread’ o The values are ‘scaled’ if appropriate Billones Lecture Notes Correlations between the descriptors should also be checked o remove the descriptors that have the highest correlations with other descriptors. o use a data reduction technique such as PCA to derive a new set of variables that are by definition uncorrelated. Billones Lecture Notes 4.3.1 Selecting the Descriptors to Include Forward-stepping regression • initially generates an equation containing just one variable (one that contributes the most to the model (assessed using the t statistic) • Second, third and subsequent terms are added using the same criterion Backward-stepping regression • starts with an equation involving all of the descriptors which are then removed in turn (e.g. starting with the variable with the smallest t statistic) • Both procedures aim to identify the equation that best fits the data (as assessed using the F statistic or with cross-validation) Billones Lecture Notes 4.3.2 Experimental Design Experimental design techniques can help to extract the maximum information from the smallest number of molecules. Factorial design Suppose there are two variables that might affect the outcome • the outcome (i.e. response) may be % yield, and the variables (or factors) might be time, t, and temperature, T • if each of these factors is restricted to two values, then there are four possible experiments (i.e. T1t1, T1t2, T2t1, T2t2) • In general, if there are n variables, each restricted to two values, then a full factorial design involves 2n experiments. • A full factorial design can in principle provide information about all possible interactions between the factors; require a large number of experiments. Billones Lecture Notes 4.3.3 Indicator Variables Indicator variables • used to indicate the presence or absence of particular chemical features • usually take one of two values (0 or 1) e.g. Equation for the binding of sulphonamides of the type X–C6H4–SO2NH2 to human carbonic anhydrase I1 = 1 for meta substituents (0 for others) I2 = 1 for ortho substituents (0 for others) Billones Lecture Notes 4.3.4 Free-Wilson Analysis Starting from a series of substituted compounds with their associated activity, the aim is to generate an equation of the following form: • x1, x2, … are the various substituents at the different positions in the series of structures. • these xi variables are essentially equivalent to indicator variables and they take a value of 0 or 1 to indicate the presence or absence of a particular substituent at the relevant position in the compound - x has constant contribution to activity contribution is additive no interactions between substituents • standard multiple linear regression methods are used to derive the coefficients ai, which indicate the contribution of the corresponding substituent/position combination to the activity. Billones Lecture Notes 4.3.5 Non-Linear Terms in QSAR Equations Bilinear model [Kubinyi 1976] • the ascending and descending parts of the function have different slopes (unlike the Hansch parabolic model, which is symmetrical) • may more closely mimic the observed data • uses an equation of the following form: In general, the Hansch parabolic model is more appropriate to model complex assays where the drug must cross several barriers whereas the bilinear model may be more suited to simpler assay systems. Billones Lecture Notes 4.3.6 Interpretation and Application of a QSAR Equation Interpolative prediction - within the range of properties of the set of molecules used to derive the QSAR; generally more reliable Extrapolative prediction – one beyond the range of properties; e.g. predictions that improve potency or property e.g. Inhibition of alcohol dehydrogenase (ADH) by 4-substituted pyrazoles σmeta - Hammett constant for meta substitutents Ki - enzyme inhibition constant • electron-releasing substituents X will increase the electron density on the nitrogen and increase binding to catalytic Zn. • larger σmeta value (more e-releasing), larger Ki (i.e. stronger binding) Billones Lecture Notes 4.4 Principal Components Regression Principal components regression (PCR) • the PCs are themselves used as variables in a multiple linear regression • as most data sets provide fewer “significant” PCs than variables this may often lead to a concise QSAR equation of the form: • the most important PCs (those with the largest eigenvalues) are not necessarily the most important to use in the PCR equation. e.g. if forward-stepping variable selection is used then the PCs will not necessarily be chosen in the order one, two, three, etc. • nevertheless, frequently at least the first two PCs will give the best correlation with the dependent variable. Billones Lecture Notes 4.5 Partial Least Squares Partial Least Squares [Wold 1982] • is similar to PCR, the difference being that the quantities calculated are chosen to explain the variation in the independent (x) variables and dependent (y) variables • expresses the dependent variable in terms of quantities called latent variables, comparable to the PCs in PCR Thus: The latent variables ti are themselves linear combinations of xi: Billones Lecture Notes • as with PCA, the latent variables are orthogonal to each other • PLS takes into account not only the variance in the x variables but also how this corresponds to the values of the dependent variable • the first latent variable t1 is a linear combination of the x values that gives a good explanation of the variance in the x-space, similar to PCA. • t1 is also defined so that when multiplied by its corresponding coefficient a1 it provides a good approximation to the variation in the y values • a graph of the observed values y against a1t1 should give a reasonably straight line. • a graph of y versus a1t1 + a2t2 will show an even better correlation than was the case for just the first latent variable. Billones Lecture Notes Residuals • the differences between these observed and predicted values • these represent the variation not explained by the model Studentized residual based on Leave-Group-Out method. Billones LT, Billones JB. Phil Sci Lett. 2013, 6(2): 231 – 240. Billones Lecture Notes Consider the amino acid data set: Billones Lecture Notes • PLS analysis on the 19 data points suggests that just one component is significant (R2 = 0.43, Q2 = 0.31) • although the R2 can be improved to 0.52 by the addition of a second component the Q2 value falls to 0.24. • by comparison, the equivalent multiple linear regression model with all six variables has an R2 of 0.70 but the Q2 is low (0.142). • The quality of the onecomponent PLS model can be assessed graphically by plotting the predicted free energies against the measured values. • This initial PLS is poor, but removal of aromatic amino acids gives improvement. Billones Lecture Notes PLS is able to cope with data sets with more than one dependent variable (i.e. multivariate problems). • The first component of a multivariate PLS model consists of two lines, one in the xspace (t1) and one in the y-space (u1). • Each line gives a good approximation to their respective point distributions but they also achieve maximum intercorrelation. • The scores obtained by projecting each point onto these lines are plotted (bottom figure) • if there is a perfect match between the x and y data all the points would lie on the diagonal of slope one. Billones Lecture Notes Number of latent variables to include in a PLS model • The fit of the model (i.e. R2) can always be enhanced by the addition of more latent variables. • but, the predictive ability (i.e. Q2) eventually either passes through a maximum or else it reaches a plateau. • this maximum in Q2 corresponds to the most appropriate number of latent variables to include. Some practitioners use the sPRESS parameter, the standard deviation of the error of the predictions: • use the smallest number of latent variables that gives a reasonably high Q2 c - number of PLS components in the model N - number of compounds • each of these latent variables should produce a fall in the value of sPRESS of at least 5% [Wold1993] Billones Lecture Notes 4.6 Molecular Field Analysis Comparative Molecular Field Analysis (CoMFA) [Cramer 1988] • derives a correlation between the activity of a series of molecules and their 3D shape, electrostatic, and hydrogen-bonding characteristics. • the data structure is derived from a series of superimposed conformations, one for each molecule in the data set. • these conformations are presumed to be the active structures, overlaid in their common binding mode. • each conformation is taken in turn, and the molecular fields around it are calculated. o this is achieved by placing the structure within a regular lattice and calculating the interaction energy between the molecule and a series of probe groups placed at each lattice point. Billones Lecture Notes • The standard fields used in CoMFA are electrostatic and steric. o These are calculated using a probe comprising an sp3-hybridized carbon atom with a charge of +1. o The electrostatic potential is calculated using Coulomb’s law and the steric field using the Lennard–Jones potential. ,, COULOMB POTENTIAL LEONARD-JONES POTENTIAL • The results of these calculations are placed into a data matrix, each row corresponds to one molecule and each column corresponds to the interaction energy of a particular probe at a specific lattice point. Billones Lecture Notes • The general form of the equation that results can be written: cij corresponds to placing probe j at lattice point i Sij = energy value ; C = constant In CoMFA each molecule is placed within a regular grid and interaction energies are computed with probe groups to give a rectangular data structure. Billones Lecture Notes • by connecting points with the same coefficients a 3D contour plot can be produced that identifies regions of particular interest. o allows identification of regions (e.g. favorable to place positively charged or bulky groups that would increase the activity. e.g. • N+ indicates regions where it is favorable to place a negative charge • N− region where is unfavorable to place a negative charge • S+/S− indicate regions where it would be favorable/unfavorable to place steric bulk Contour representation of the key features from a CoMFA analysis of a series of coumarin substrates and inhibitors of cytochrome P4502A5 Billones Lecture Notes CoMFA green: steric favored yellow: steric unfavored red: (-) charge favored Conformational Search and Molecular Alignment blue: (+) charge favored 3D Grid Probing Creation of Contour Plot with Calculated Steric and Electrostatic Fields Partial Least Squares Billones Lecture Notes CoMFA (Comparative Molecular Field Analysis) • very small changes in position can give rise to significant variation in the energy • the Lennard–Jones potential used to calculate the steric field becomes very steep close to the vdW surface (also true to Coulomb potential used to calculate the electrostatic field) • remedied by applying arbitrary cut-offs; the overall effect is a loss of information; contour plots become fragmented and hard to interpret. CoMSIA (Comparative Molecular Similarity Indices Analysis) • the fields are replaced by similarity values at the various grid positions. • the similarities are calculated using much smoother potentials that are not so steep and have a finite value even at the atomic positions. • has superior contour maps that are easier to interpret Billones Lecture Notes Assignment Consider this data set. Billones Lecture Notes 1. Generate a Multiple Linear Regression (MLR) model involving all six independent variables. 2. Generate a 3-variable MLR model involving the descriptors that are included by forward-stepping method. 3. Calculate the R2 squared in each case. (Hint: Use the models (equations) to predict the FE) 4. a) Perform Leave-One-out (LOO) validation and calculate the Q2 in each case. (Hint: Use the models (equations) based on n - 1 observation to predict the FE of the removed residue). Plot (i.e. scatter plot with trendline) the calculated free energy, FEcalc versus FEobs. b) Perform Leave-Group-Out (LGO) validation, taking out 5 rows (AAs) at a time to generate models based on n-5 dataset. Do it repeatedly until you obtain 5 predicted values for each amino acid. Plot (i.e. scatter plot with trendline) the calculated free energy, FEcalc (mean of 5 values) versus FEobs. What is the Q2? 5. Plot the residuals in each case. (Hint: Add a column for FEcalc and another for the difference, FEcalc – FEobs. Then plot the residuals, y (i.e. FEcal – FEobs) versus x (FEobs). Add a trend line in the plot. Billones Lecture Notes