3) Molecular Descriptors.pptx
Document Details
Uploaded by DazzlingFreedom
Full Transcript
3.1 Introductions Molecular descriptors • allows manipulation and analysis of chemical structural information • numerical values that characterize properties of molecules • may represent the physicochemical properties of a molecule or they may be values that are derived by applying algorithmic tec...
3.1 Introductions Molecular descriptors • allows manipulation and analysis of chemical structural information • numerical values that characterize properties of molecules • may represent the physicochemical properties of a molecule or they may be values that are derived by applying algorithmic techniques to the molecular structures. • in general, the computational requirements increase with the level of discrimination that is achieved. e.g. MW does not convey much about a properties but fast to compute. • some descriptors have an experimental counterpart (e.g. the octanol–water partition coefficient), some are purely algorithmic constructs (e.g. 2D fingerprints). Billones Lecture Notes Descriptors Calculated from the 2D Structure 3.2 3.2.1 Simple Counts • simplest descriptors; based on simple counts of features o o o o o # hydrogen bond donors (HBD), # hydrogen bond acceptors (HBA) # ring systems (Nring) # rotatable bonds (Nrot) molecular weight (MW) • substructures or molecular fragments calculated from a 2D connection table • low level of discrimination; often used in combination with other descriptors Billones Lecture Notes 3.2.2 Physicochemical Properties • Hydrophobicity is an important property in determining the activity and transport of drugs a molecule’s hydrophobicity can affect how tightly it binds to a protein and its ability to pass through a cell membrane. it is most commonly modelled using the logarithm of the partition coefficient between n-octanol and water (log P) • log P was based on an additive scheme whereby the log P for a compound with a substituent X is equal to the log P for the parent compound plus the appropriate substituent constant πX [Fujita 1964]. Billones Lecture Notes • Another method for estimating log P is based on breaking the molecule into fragments. • The partition coefficient for the molecule then equals the sum of fragment values plus a series of “correction factors” to account for interactions between the fragments such as intramolecular hydrogen bonding [Rekker 1977, 1992] where there are ai fragments of type i with fi being the corresponding contribution and bj occurrences of correction factor j with Fj being the correction factor. • The most widely used program of this type is the ClogP program, developed by Leo and Hansch [1993]. Billones Lecture Notes • advantage of fragment-based approach is that electronic interactions can be taken into account • potential disadvantage is that it fails on molecules containing fragments for which values have not been provided Billones Lecture Notes 3.2.3 Molar Refractivity The molar refractivity is given by: n = refractive index, d = density, MW = molecular weight • used as a measure of the steric bulk of a molecule The refractive index term accounts for polarizability of the molecule and does not vary much from one molecule to another. • calculated using atomic contributions Billones Lecture Notes 3.2.4 Topological Indices Topological indices • single-valued descriptors that can be calculated from the 2D graph representation of molecules • they characterize structures according to their size, degree of branching and overall shape e.g. Wiener Index involves counting the number of bonds between each pair of atoms and summing the distances, Dij, between all such pairs: Billones Lecture Notes MOLECULAR CONNECTIVITY INDICES • Introduced by Randic ́ [1975] and developed by Kier and Hall [1986]. Branching index • calculated from the hydrogen-suppressed graph representation of a molecule • based on the degree δi of each atom i. the degree equals the number of adjacent non-hydrogen atoms a bond connectivity value is calculated for each bond as the reciprocal of the square root of the product of the degree of the two atoms in the bond • equals the sum of the bond connectivities over all of the bonds in the molecule: Billones Lecture Notes Chi molecular connectivity indices [Kier and Hall] • δi value is redefined in terms of the number of sigma electrons and the number of hydrogen atoms associated with an atom. • valence δiv values are also introduced; these encode atomic and valence state electronic information through counts of sigma, pi and lone pair electrons Simple Delta Value, δi is given by: σi = number of sigma electrons for atom i hi = number of hydrogen atoms bonded to atom i Billones Lecture Notes Valence delta value for atom i, δiv is defined as: Ziv = number of valence electrons (sigma, pi and lone pair electrons) for atom i For elements beyond fluorine in the periodic table the valence delta expression is modified as follows: Zi = atomic number Billones Lecture Notes Values of δi and δiv for several common types of atom • the simple delta value differentiates –CH3 from –CH2 • while –CH3 has the same simple delta value as –NH2 it has a different valence delta value and so the two atoms can be differentiated using δiv Billones Lecture Notes The chi molecular connectivity indices are sequential indices that sum the atomic delta values over bond paths of different lengths. Zeroth order chi index (0χ) • summation over all atoms in a molecule (i.e. paths of length zero): First-order chi index (1χ) • summation over bonds • the same as Randic ́’s branching index when simple delta values are used Billones Lecture Notes Higher-order chi indices - summations over sequences of two, three, etc. bonds. Chi indices for the various isomers of hexane. + + Billones Lecture Notes 3.2.5 Kappa Shape Indices Kappa shape indices [Hall and Kier 1991] • designed to characterize aspects of molecular shape by comparing a molecule with the “extreme shapes” that are possible for that number of atoms. • there are shape indices of various order (first, second, etc.) - The first-order shape index involves a count over single bond fragments. The first-order kappa index is defined as: Pmax - number of edges (or paths of length one) in the completely connected graph 1 Pmin - number of bonds in the linear molecule 1 P - number of bonds in the molecule for which the shape index is being calculated 1 Billones Lecture Notes The two extreme shapes are the linear molecule and the completely connected graph where every atom is connected to every other atom. Extreme shapes used in the first- and second-order kappa indices for graphs containing four, five and six atoms. • • The linear molecule corresponds to the minimum (middle column). The maximum for the first-order index corresponds to the completely connected graph (lefthand column) and for the second-order index to the star shape (right-hand column). Billones Lecture Notes For a molecule containing A atoms, Pmin = (A−1) 1 and Pmax = A(A−1)/2 1 Thus 1κ becomes: The second-order kappa index is determined by the count of two-bond paths, 2 P. Pmin = A − 2 2 and 2 Pmax = ( A − 1)( A − 2)/2 Thus, Billones Lecture Notes • The kappa indices themselves do not include any information about the identity of the atoms. Kappa–alpha indices • include atom identity • the alpha value for an atom i is a measure of its size relative to some standard (sp3-hybridized carbon): The kappa–alpha indices: where α is the sum of the αi ’s for all atoms in the molecule Billones Lecture Notes 3 3 = (A (A = + - 1 )( A + + ( 3 - 3 )( A + ( 3 P + ) - 3) ) P + - 2 2 ) 2 if A is o d d 2 if A is e v e n a p e in d ic e s a re d e s c rib e d o n p a g e 2 5 0 o f th e H a n d b o o k o f M o le c u la r D e s n d C o n so n n i 2 0 0 0 ). The kappa flexibility index (phia) is given by x ib ility in d e x (p h ia ) is g iv e n b y 1 p h ia = 2 A x ib ility in d e x is d e s c rib e d o n p a g e 1 7 8 o f th e H a n d b o o k o f M o le c u la r D e Billones Lecture Notes n d C o n so n n i 2 0 0 0 ). 3.2.6 Electrotopological State Indices Electrotopological state indices [Hall 1991] • determined for each atom (including hydrogen atoms, if so desired) rather than for whole molecules. • depend on the intrinsic state of an atom, Ii , which for an atom i in the first row of the periodic table is given by: • The intrinsic state encodes electronic and topological characteristics of atoms. • The effects of interactions with the other atoms are incorporated by determining the number of bonds between the atom i and each of the other Billones Lecture Notes atoms, j. For path length rij, the perturbation ΔIij is defined as: Electrotopological state (E-state) for an atom is given by the sum of ΔIij and Ii. Atomic E-states can be combined into a whole-molecule descriptor by calculating the mean-square value over all atoms. Molconn-Z program provides access to several hundred different E-state descriptors. Billones Lecture Notes 3.2.7 2D Fingerprints • In dictionary-based fingerprints each bit position often corresponds to a specific substructural fragment. • Fragments that occur infrequently may be more likely to be useful than fragments which occur very frequently. Billones Lecture Notes • Unfortunately, the optimum set of fragments is often data set dependent. • The hashed fingerprints are not dependent on a predefined dictionary so any fragment that is present in a molecule will be encoded. Billones Lecture Notes • It is not possible to map from a bit position back to a unique substructural fragment and so the fingerprints are not directly interpretable. • The fact that 2D Fingerprints “work” as descriptors is probably due to the fact that a molecule’s properties and biological activity often depends on features such as those encoded by 2D Fingerprints. Billones Lecture Notes 3.3 Descriptors Based on 3D Representations 3.3.1 3D Fragment Screens 3D screens • originally designed for use in 3D substructure searching • encode spatial relationships (e.g. distances and angles) between the different features of a molecule such as atoms, ring centroids and planes. • Distance ranges for each pair of features are divided into a series of bins by specifying a bin width. • Valence angle descriptors consist of three atoms, ABC • Torsion angle descriptors consist of four atoms, ABCD • The different types of screens can be combined into a bitstring of length equal to the total number of bins over all feature types. Billones Lecture Notes 3.3.2 Pharmacophore Keys Pharmacophore keys • based on pharmacophoric features, that is, atoms or substructures that are thought to have relevance for receptor binding. • pharmacophoric features typically include hydrogen bond donors, hydrogen bond acceptors, charged centers, aromatic ring centers The generation of 3-point pharmacophore keys, illustrated using benperidol. Two different conformations are shown, together with two different combinations of three pharmacophore points. Billones Lecture Notes 3.3.3 Other 3D Descriptors 3D topographical indices • can be calculated from the distance matrix of a molecule • analogous to the topological indices which are generated from a 2D connection table. Geometric atom pairs • extension of atom pair descriptors, which encode all pairs of atoms in a molecule together with the length of the shortest bond-by-bond path between them Others • HOMO and LUMO energies, • Molecular electrostatic potentials • dipole moments • Etc. Billones Lecture Notes 3.4 Data Verification and Manipulation • examine their characteristics of the descriptors prior to using them in an analysis. • evaluate the distribution of values for a given descriptor • check for correlations between different descriptors which could lead to over-representation of certain information. Manipulation of the data may be required. • could involve a simple technique such as scaling to ensure that each descriptor contributes equally to the analysis • may involve a more complex technique such as Principal Components Analysis (PCA) that results in a new set of descriptors with more desirable characteristics. Billones Lecture Notes 3.4.1 Data Spread and Distribution • examine the spread of values for the data set. • if the values show no variation then there is nothing to be gained from inclusion of the descriptor. • the values of a descriptor should (in some cases) follow a particular distribution, often the normal distribution. Coefficient of variation can be used to assess the spread of a descriptor. • equal to the standard deviation (σ ) divided by the mean (⟨x⟩): N = data points; xi = value of descriptor x for data point i. • The larger the coefficient of variation the better the spread of values. Billones Lecture Notes 3.4.2 Scaling • If descriptors have different numerical ranges, they are scaled so that each one has an equal chance of contributing to the overall analysis. • scaling is also often referred to as standardization. Ways to scale data: Unit variance scaling (also known as auto-scaling) • each descriptor value is divided by the standard deviation for that descriptor across all observations (molecules). • each scaled descriptor then has variance of one. Billones Lecture Notes Unit variance scaling is usually combined with mean centering in which the average value of a descriptor is subtracted from each individual value. In this way all descriptors are centered on zero and have a standard deviation of one: xi′ is the new, transformed value of the original xi. Range scaling • uses a related expression in which the denominator equals the difference between the maximum and minimum values. Billones Lecture Notes 3.4.3 Correlations • Correlations between the descriptors should also be checked as a matter of routine to avoid over-representation. • Many correlations can be identified from simple scatterplots of pairs of descriptor values. • Ideally, the points will be distributed with no discernible pattern. A pair of descriptors with no correlation will have points in all four quadrants of the scatter plot and with no obvious pattern or correlation. Billones Lecture Notes When many descriptors need to be considered then it is more convenient to compute a pairwise correlation matrix. This quantifies the degree of correlation between all pairs of descriptors. The correlation coefficient, r, is given by: Each entry (i, j) in the correlation matrix is the correlation coefficient between the descriptors xi and xj. Billones Lecture Notes The values of the correlation coefficient range from −1.0 to +1.0. • r = +1.0 means perfect positive correlation; a line with a positive slope • r = −1.0 means perfect negative correlation, negative slope • r=0 means no relationship between the variables e.g. Correlation matrix for amino acid data. • there is a high degree of positive correlation between the two lipophilicity constants (LCE and LCF, and between volume (VOL) and the solvent-accessible surface area ASA). • there is a strong negative correlation between the FET and LCE and LCF parameters. Billones Lecture Notes 3.4.4 Reducing the Dimensionality of a Data Set: PCA The dimensionality of a data set is the number of variables that are used to describe each object. Principal Components Analysis (PCA) • commonly used method for reducing the dimensionality of a data set when there are significant correlations between some or all of the descriptors. • provides a new set of variables that have some special properties • it is often found that much of the variation in the data set can be explained by a small number of principal components. • the principal components are also much more convenient for graphical data display and analysis. Billones Lecture Notes • there is a high correlation between the x1 and the x2 values. • most of this variation can be explained by introducing a single variable that is a linear combination of these (i.e. z = x1 − x2). • The new variable (z) is referred to as a principal In the general, the principal components is acomponent. linear combination of the original variables or descriptors: Billones Lecture Notes Loadings plot indicates the coefficients for each of the descriptors in the various principal components. Scores plot shows how the various amino acids relate to each other in the space of the first two principal components.