Lecture 6 On Machine Learning - PDF

LECTURE 6 OCTOBER 1, 2024 OUTLINE Machine learning Intro Feature Selection Decision Tree Random Forest KNN Classification WHAT IS THE DIFFERENCE? Computer Science Biological Sciences Mathematical Sciences Statistics WHAT IS THE DIFFERENCE? Computer Science Biological Science Mathematical Science Statistics Computational Science Data Science MACHINE LEARNING Biology Mathematics/ Computer Statistics Science 5 DEFINING THE PROBLEM Classification vs Prediction? More accurate or more sensitive or more specific? High throughput or limited use? What features are necessary? What samples to use? How many samples to use? CLASSIFICATION VS PREDICTION Classification Prediction Categorize data into groups based on shared To determine the outcome of a future event characteristics Will a site be glycosylated? Is this a GPCR? Will the stock market go up or down? Is this a hybrid duck? FEATURE SELECTION How to decide what features are important? Are things meaningful? Correlations? FEATURE SELECTION When to select features? Filter (Univariate & Multivariate) – Test for correlations Wrapper –Selects and tests groups of features Embedded – Part of the algorithm DECISION TREE Can work with continuous, discrete, and categorical data Height General Algorithm (>200cm) Determine the best split Weight Weight Move the samples along the tree (50kg) Repeat the prior two steps Male Female Female Male RANDOM FOREST General Algorithm Subset the various features with replacement Construct decision trees Ensemble method to perform final prediciton RANDOM FOREST VARIABLE IMPORTANCE Random Forest with ISOGlyP Values Top 20 netsurf_asa netsurf_rsa IsoGlyP.T3 IsoGlyP.T5 IsoGlyP.T13 IsoGlyP.T1 IsoGlyP.T12 IsoGlyP.Max IsoGlyP.T2 IsoGlyP.T14 IsoGlyP.T16 IsoGlyP.T11 IsoGlyP.T10 scratch_acc20 scratch_ss8 pseb_s_2 scratch_ss espritz_X_1 espritz_D_1 pseb_s_17 0 5 10 15 20 12 MeanDecreaseGini RANDOM FOREST VARIABLE IMPORTANCE MeanDecreaseGini Across All Features 30 25 20 MeanDecreaseGini 15 10 5 0 13 0 100 200 300 400 Index of features RANDOM FOREST VARIABLE IMPORTANCE RandomForest Cross-Validation for Number of Features to be Used 0.38 0.36 0.34 error.cv 0.32 0.30 0.28 0.26 1 2 5 10 20 50 100 200 500 14 n.var K NEAREST NEIGHBORS Helps groups a sample with the closest K matches Supervised Classification Uses K nearest neighbors, those that are the closest to the unknown, to determine the class of the unknown Data must be binary or continuous The data can be transformed, i.e. the use of principal component analysis Uses distance matrices Euclidean – The cartesian distance between two point. Manhattan Distance – The absolute difference between the coordinates of the points in multiple dimensions Jaccard similarity coefficient – Presence/absence between two sets K - MEANS CLUSTERING Helps group samples into K clusters Unsupervised Classification Splits all the samples into K clusters, such that each sample is closest to an assigned centroid Initially, centroids are determined by randomly selecting points in the data K-MEANS CLUSTERING https://aws.amazon.com/blogs/machine-learning/k-means-clustering-with-amazon-sagemaker/

Lecture 6 On Machine Learning - PDF

Document Details

Tags

Related

Summary

Full Transcript