Analysis of High-Throughput Screening Data PDF
Document Details
Uploaded by DazzlingFreedom
University of the Philippines Manila
Junie B. Billones, PhD
Tags
Summary
These lecture notes from the University of the Philippines Manila cover the analysis of high-throughput screening data. Topics include data visualization, cluster analysis, and different methods for selecting compounds.
Full Transcript
6 Analysis of High-Throughput Screening Data Data Visualization Data Mining Methods Billones Lecture Notes 6.1 Data Visualization Several packages are now available for the graphical display of large data sets; able to draw various kinds of graphs, to color according to selected properties a...
6 Analysis of High-Throughput Screening Data Data Visualization Data Mining Methods Billones Lecture Notes 6.1 Data Visualization Several packages are now available for the graphical display of large data sets; able to draw various kinds of graphs, to color according to selected properties and to calculate simple statistics. Graphical representation of the property distributions of active (large spheres), moderately active (medium spheres) and inactive (small spheres) compounds from the NCI AIDS data set. Billones Lecture Notes 6.1.1 Selecting Diverse Sets of Compounds HTS data sets are often so large that it may help to divide the molecules into subsets in order to help navigation through the data. Cluster analysis • aims to divide a group of objects into clusters so that the objects within a cluster are similar but objects taken from different clusters are dissimilar • Clustering algorithms attempt to group together similar objects. • A representative object might then be chosen from each cluster (open circles). Billones Lecture Notes Key steps involved in cluster-based compound selection: 1. Generate descriptors for each compound in the data set. 2. Calculate the similarity or distance between all compounds in the data set. 3. Use a clustering algorithm to group the compounds within the data set. 4. Select a representative subset by selecting one (or more) compounds from each cluster. Hierarchical Clustering Methods • organize compounds into clusters of increasing size, with small clusters of related compounds being grouped together into larger clusters • at one extreme each compound is in a separate cluster, at the other extreme all the compounds are in one single cluster [Murtaugh 1983] • The relationships between the clusters can be visualised using a dendogram Billones Lecture Notes A dendrogram representing a hierarchical clustering of seven compounds. Sequential Agglomerative Hierarchical Non-overlapping (SAHN) methods • differ in the way in which the similarity between two clusters is measured Billones Lecture Notes Single linkage or nearest neighbor method • the distance between a pair of clusters is equal to the minimum distance between any two compounds, one from each cluster. Complete linkage or furthest neighbor method • intercluster distance is measured by the distance between the furthest pair of compounds in a pair of clusters. Group average method • measures intercluster distance as the average of the distances between all pairs of compounds in the two clusters. Schematic illustration of the methods used by (from left) the single linkage, complete linkage and group average approaches to calculate intercluster distances. Billones Lecture Notes Selecting the Appropriate Number of Clusters • choose a level from the hierarchy in order to define the number of clusters. • draw an imaginary line across the dendrogram • the number of vertical lines that it intersects equals the number of clusters Choosing the level from the hierarchy defines the number of clusters present (in this case, four clusters). Billones Lecture Notes Compare the different cluster groupings obtained using the Jaccard statistic: • C1 and C2 are two different clusterings • a is the number of compounds that are clustered together in both clusterings • b is the number of compounds that cluster together in the first clustering but not the second • c is the number of compounds that cluster together in the second clustering but not the first • Note that the Jaccard statistic is identical to the Tanimoto coefficient Billones Lecture Notes Cell-Based or Partitioning methods • operate within a predefined low-dimensional chemistry space [Mason 1994]. • each property is plotted along a separate orthogonal axis and is divided into a series of value ranges to give a series of “bins” • the combinatorial product of bins for all the properties defines a set of cells that covers the space If there are N axes, or properties, and each is divided into bi bins then the number of cells in the multidimensional space is: A representative subset can be selected by choosing one or more compounds from each cell. The construction of a 2D chemistry space. In this case the log P bins are <0, 0–3, 3–7 and >7 and the MW bins are 0–250, 250–500, 500–750 and >750. Billones Lecture Notes 6.1.2 Nonlinear Mapping • HTS data sets are thus usually multidimensional in nature • to visualize a multidimensional data set, map it to a lower 2D or 3D space • the objective is to reproduce the distances in the higher-dimensional space in the low-dimensional one • it is often particularly desirable that objects close together in the highdimensional space are close together in the low-dimensional space • with 2D or 3D data one may then use graphics or other visual cues to identify structure–activity relationships, for example by coloring the data points according to the biological activity. Billones Lecture Notes Multidimensional Scaling [Kruskal and Cox 1994] • initial set of coordinates is generated in low-D space using e.g. PCA • coordinates are modified using a mathematical optimization procedure that improves the correspondence between the distances in the low-D space and the original multi-D space Kruskal’s Stress dij = distance between the two objects i and j in the low-dimensional space Dij = corresponding distance in the original multidimensional space. Billones Lecture Notes The optimization continues until the value of the stress function falls below a threshold value. Sammon mapping [Sammon 1969] • places more emphasis on the smaller distances due to the presence of the normalization term • • Difference between Kruskal and Sammon mapping A rectangular box with sides of length 1, 1 and 10 (left) was mapped into 2D using both the Kruskal (top graph) and Sammon (bottom graph) stress functions. Sammon function tends to preserve the shorter distances better than the Kruskal function. Billones Lecture Notes 6.2 Data Mining Methods • widely used to identify relationships in large, multidimensional data sets • key objective is the construction of models that enable relationships to be identified between the chemical structure and the observed activity • traditional QSAR methods such as MLR are not generally applicable • HTS data sets classify the molecules as “active” or “inactive” or into a small number of activity classes (e.g. “high”, “medium”, “low”) rather than using the numerical activity • The aim is to derive a computational model that enables the activity class of new structures to be predicted e.g. The model could be used to 1) select additional compounds for testing 2) design combinatorial libraries 3) elect compounds for acquisition from external vendors Billones Lecture Notes 6.2.1 Substructural Analysis Substructural analysis (SSA) [Cramer 1974] • related to the Free–Wilson approach • each substructural fragment makes a constant contribution to the activity, independent of the other fragments in the molecule • the aim is to derive a weight for each substructural fragment that reflects its tendency to be in an active or an inactive molecule • the sum of the weights for all of the fragments contained within a molecule gives the score for the molecule • enables a new set of structures to be ranked in decreasing probability of activity. Simple function that defines the weight of a fragment i: acti = number of active molecules that contain the ith fragment inacti = number of inactive molecules that contain the ith fragment. Billones Lecture Notes 6.2.2 Discriminant Analysis Discriminant Analysis • separates the molecules into their constituent classes • the simplest type is linear discriminant analysis [McFarland and Gains 1990] • • • A linear discriminant analysis aims to find a line that best separates the active and the inactive molecules. The dotted line in the figure is the discriminant function; the solid line is the corresponding discriminant surface. Note that in this case it is not possible to find a line that completely separates the two types of data points. Billones Lecture Notes A linear discriminant analysis is characterized by a discriminant function which is a linear combination of the independent variables (x1,x2,...,xn): • feeding the appropriate descriptor values into this equation enables the value of the discriminant function to be computed for a new molecule • values above a threshold correspond to one activity class and lie to one side of the discriminant surface and values below the threshold correspond to the other activity class. Billones et al. Phil J Health Res Dev. 2019, 23(4): 11-16. Billones Lecture Notes 6.2.3 Neural Networks Feed-Forward Neural Network • consists of layers of nodes with connections between all pairs of nodes in adjacent layers • is a supervised learning method; uses the values of the dependent variable A feed-forward neural network with seven input nodes, three hidden nodes and one output node. Billones Lecture Notes • each node exists in a state between 0 and 1 • the state of each node depends on the states of the nodes to which it is connected in the previous layer and the strengths/weights of these connections • the NN must first be “trained”; achieved by repeatedly providing it with a set of inputs and the corresponding outputs from a training set • each node in the input layer may correspond to one of the descriptors that are used to characterize each molecule • the weights and other parameters in the NN are initially set to random values and so the initial output from the NN may be different to the desired outputs. • Once the NN has been trained it can then be used to predict the values for new, unseen molecules. Billones Lecture Notes Kohonen Network or Self-Organising Map (SOM) • an unsupervised learning method • consists of a rectangular array of nodes • each node has an associated vector of values • during training the size of the neighborhood around each winning node is gradually reduced • to eliminate edge effects a Kohonen network may be constructed so that opposite sides of the rectangle are joined together, as indicated by the arrows. Kohonen network showing neighborhood behaviour. • thus the immediate neighbors of the bottom right node would be those shaded in grey. Billones Lecture Notes • Each node has an associated vector that corresponds to the input data (i.e. the molecular descriptors) • each of these vectors initially consists of small random values • the data is presented to the network one molecule at a time and the distance between the molecule vector and each of the node vectors is determined using the following distance metric: where vi is the value of the ith component of the vector v for the node in question and xi is the corresponding value for the input vector. • The vector of the winning node (closest to the input vector) is identified and then updated. Billones Lecture Notes By modifying the vectors of not only the winning node but also its neighbors the Kohonen network creates regions containing similar nodes. Classification of drug and non-drug molecules using a Kohonen network. • the distribution of the two types of molecules is indicated by grey shading, with dark regions being dominated by drugs and light regions by non-drugs. Billones Lecture Notes 6.2.4 Decision Trees • Feed-forward NN does not provide explanation of the result for a given input. • This is due to the complex nature of the interconnections between the nodes. Decision trees • are very interpretable • consist of a set of “rules” that provide the means to associate specific molecular features and/or descriptor values with the activity or property of interest. • a decision tree is commonly depicted as a tree-like structure, with each node corresponding to a specific rule. • each rule correspond to the presence or absence of a particular feature or to the value of a descriptor. Billones Lecture Notes A decision tree for a set of 50 active and 12 inactive sumazole and isomazole analogues • In order to classify an unknown molecule a path is followed through the tree according to the values of the relevant properties, until a terminal node is reached. • The values indicate the number of active and inactive molecules at each node. Billones Lecture Notes • Ensemble approaches involve the construction of collections of trees, each of which is generated by training on a subset of the data set. o the data subsets may be generated using a bootstrap method in which the data set is sampled with replacement (i.e. there may be duplicates in the subset and some molecules may not appear). Bagging • trees are repeatedly generated on bootstrap samples and new molecules are classified using a majority voting mechanism Random Forest [Breiman 2001] • an extension of bagging, in which a small subset of the descriptors is randomly selected at each node rather than using the full set. Boosting [Quinlan 1996] • each tree is designed to improve the performance for data points misclassified by its predecessor by giving more weight to such points. • a voting scheme gives a higher weight to the predictions for these more “expert” trees Billones Lecture Notes 6.2.5 Support Vector Machines and Kernel Methods Support Vector Machine (SVM) • a popular classification technique • attempts to find a boundary or hyperplane that separates two classes of compounds • the hyperplane is positioned using examples in the training set which are known as the support vectors • the use of a subset of the training data prevents overtraining • when the data cannot be separated linearly, kernel functions can be used to transform it to higher dimensions • molecules in the test set are mapped to the same feature space and their activity is predicted according to which side of the hyperplane they fall. • the greater the distance to the boundary the higher is the confidence in the prediction. Disadvantage: It is a black box method, making it more difficult to interpret the results Billones Lecture Notes