Summary

This presentation introduces machine learning basics, focusing on applications in bioinformatics. It covers various types of machine learning, loss functions, gradient descent, and dimensionality reduction methods.

Full Transcript

Machine Learning Basics 의료 인공지능 So Yeon Kim Outline Introduction to Bioinformatics Machine learning basics Objective function Dimensionality reduction Variable selection 2 Introduction to Bioinformatics...

Machine Learning Basics 의료 인공지능 So Yeon Kim Outline Introduction to Bioinformatics Machine learning basics Objective function Dimensionality reduction Variable selection 2 Introduction to Bioinformatics 3 What is Bioinformatics? “Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex.” It combines biology, chemistry, physics, computer science, information engineering, mathematics and statistics to analyze and interpret the biological data 4 What is Bioinformatics? 5 Biological data Clinical data Medical imaging Genetic data Medical signal 6 When to use? Which patients are high risk for developing cancer? What are early biomarkers of cancer/disease? Which patients are likely to be short/long term cancer survivors? What chemotherapeutic might a cancer patient benefit from? … and many complex problems 7 What can we do? Precision medicine https://www.efpia.eu/about-medicines/development-of-medicines/precision-medicine/ 8 What can we do? Survival analysis / prediction Cancer subtype clustering 9 Tools and Languages 10 Machine learning basics 11 What is Machine Learning? Traditional Programming Data Computer Output Program Machine Learning Data Computer Program Output 12 Machine learning 13 Machine learning 14 Types of Machine learning 15 Machine learning analysis pipeline Problem definition Data collection Data preprocessing / cleaning Exploratory Data Analysis Modeling Communicating and interpreting results 16 Classifying Tumors with Array Data Task: Classify Acute Lymphoblastic Leukemia (ALL) vs. Acute Myeloid Leukemia (AML) Data: 6817 genes x 38 samples (leukemia patients at the time of diagnosis) Select informative 50 genes (variable selection) Train classifier (classification) Evaluation (cross-validation/independent set validation) Golub, Todd R. et al. "Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring." Science 286, (1999) 17 Types of Machine learning Supervised learning Training data includes desired outputs Unsupervised learning Training data does not include desired outputs Semi-supervised learning Training data includes a few desired outputs 18 Supervised learning (Classification) Support Vector Machine (SVM) K-Nearest Neighbor Support Vector Machine (SVM) Decision trees Random forest Naïve Bayes K-Nearest Neighbor Random Forest Logistic regression Neural networks Bayesian networks 19 Supervised learning (Regression) Simple Linear Regression Multivariate Linear Regression 20 Unsupervised learning (clustering) 21 Clustering Cancer subtype clustering Clustering on single-cell transcriptomic data 22 Objective function 23 Objective functions Optimization problem Pick 𝜃 that minimize loss min ℒ(𝑦, 𝑓𝜃 (𝑥)) 𝜃 Loss/cost/error function ℒ 24 Loss function Classification: labels are discrete values 0-1 loss Cross entropy (CE) loss Loss =σ𝑁 𝑖=1 𝐶𝐸 𝑦 ෝ𝑖 , 𝑦𝑖 25 Loss function Regression: labels are continuous values Mean Absolute Error (MAE, 𝐿1 norm) Mean Squared Error (MSE, 𝐿2 norm) Loss =σ𝑁 𝑖=1 𝑀𝑆𝐸 𝑦 ෝ𝑖 , 𝑦𝑖 26 Gradient descent How to optimize objective function = minimize loss function? Learn model parameters 𝜃 Iterate until convergence 𝜕ℒ 1. For all 𝑖, compute the derivative (decent) 𝜕𝜃𝑖 2. For all 𝑖, make a step (learning rate: 𝜂) towards 𝜕ℒ the direction of derivative 𝜃𝑖 ← 𝜃𝑖 − 𝜂 𝜕𝜃 𝑖 Image from https://seamless.tistory.com/38 27 Stochastic Gradient Descent (SGD) Unbiased estimator of full gradient Other optimizer improves over SGD However, Adam, adagrad, adadelta, etc. No guarantee on the rate of convergence Often requires tuning of learning rate 28 Image from https://seamless.tistory.com/38 Dimensionality reduction 29 Dimensionality reduction High dimensional data = Lots of features Social network, document, surveys, gene network, brain imaging, etc. Suffer from curse of dimensionality Redundant features make things worse Hard to interpret and visualize Computationally/statistically challenging 30 Dimensionality reduction Feature selection Select features relevant to the learning task What if we have gene expression data with 1000 genes (variables) while we have only 50 samples? Latent features Some linear/nonlinear combination of features provides a more efficient representation than observed features 𝑦 = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 + … + 𝑏𝑛 𝑥𝑛 Which variables to choose? 31 Dimensionality reduction Linear Principal Component Analysis (PCA) Factor Analysis Independent Component Analysis (ICA) Nonlinear Laplacian Eigenmaps ISOMAP Local Linear Embedding (LLE) t-SNE UMAP AutoEncoder 32 Variable selection 34 Variable selection Select the variables which are highly associated with the target variable In multivariate linear regression, 35 Variable selection Exhaustive Search Consider all the possibilities If we have p variables, we compare the model performances over 2p possibilities Forward Selection Start with no variables For each iteration, add the most significant variable (e.g. lowest p-value) Keep adding until reaching the stopping rule 36 Variable selection Backward Elimination Start with all variables For each iteration, remove the lease significant variable (e.g. largest p-value) Keep removing until reaching the stopping rule 37 Thank You! Q&A 38

Use Quizgecko on...
Browser
Browser