Hang Li - Machine Learning Methods (2023, Springer) - libgen.li.pdf

Hang Li Machine Learning Methods Translated by Lu Lin and Huanqiang Zeng Machine Learning Methods Hang Li Machine Learning Methods Hang Li Bytedance Technology Beijing, China Translated by Lu Lin Huanqiang Zeng Huaqiao University Huaqiao University Quanzhou, Fujian, China Quanzhou, Fujian, China ISBN 978-981-99-3916-9 ISBN 978-981-99-3917-6 (eBook) https://doi.org/10.1007/978-981-99-3917-6 Jointly published with Tsinghua University Press The print edition is not for sale in China (Mainland). Customers from China (Mainland) please order the print book from: Tsinghua University Press. Translation from the Chinese Simplified language edition: “统计学习方法 (第2版)” by Hang Li, © Tsinghua University Press 2019. Published by Tsinghua University Press. All Rights Reserved. © Tsinghua University Press 2024 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publishers, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publishers nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publishers remain neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore Preface This book is a popular machine learning textbook and reference book in China. The first edition was published in March 2012 under the title Statistical Learning Methods focusing on supervised learning, including perceptron, k-nearest neighbor method, naive Bayes method, decision tree, logistic regression, maximum entropy model, support vector machine, boosting method, EM algorithm, hidden Markov model, and conditional random field. The second edition was published in May 2019, with additional content on unsupervised learning, including the clustering method, singular value decomposition, principal component analysis, latent semantic analysis, probabilistic latent semantic analysis, Markov chain Monte Carlo method, latent Dirichlet assignment, and PageRank Algorithm. By the end of 2021, the two editions had been printed more than 30 times and sold more than 350,000 copies. Many thanks go to Assistant Professor Lin Lu and Prof. Zeng Huanqiang of Huaqiao University for translating this book into English. It is renamed Machine Learning Methods and published by Springer Press. We would be honored if overseas readers could also benefit from this book. This book can be used as a reference for teaching statistical machine learning and related courses. Its target readers also include undergraduates and postgradu- ates in artificial intelligence, machine learning, natural language processing, and other related majors. In this book, we are dedicated to introducing machine learning methods systematically and comprehensively. Regarding content selection, it focuses on the most important and commonly used methods. In the narrative style, each chapter describes a single method and is relatively independent and complete. Following a unified framework, the discussion of all methods remains systemic. The reader can read it from beginning to end or focus on some individual chap- ters. Complicated methods are described in simple terms with necessary proofs and examples, enabling beginners to grasp the basic content and essence of the methods and use them accurately. Relevant theories are briefly described as well. At the end of each chapter, exercises with references and relevant research trends are provided to meet readers’ needs for further learning. v vi Preface Currently, we are also adding content on deep learning and reinforcement learning to the existing version. The reprinted version is hoped to be translated into English in the future. Finally, I gratefully acknowledge Dr. Chang Lanlan, the editor of Springer Publishing, Assistant Professor Lin Lu, and Prof. Zeng Huanqiang of Huaqiao University for their efforts and support. Beijing, China Hang Li Contents 1 Introduction to Machine Learning and Supervised Learning...... 1 1.1 Machine Learning........................................ 1 1.1.1 Characteristics of Machine Learning................. 1 1.1.2 The Object of Machine Learning.................... 2 1.1.3 The Purpose of Machine Learning................... 2 1.1.4 Methods of Machine Learning...................... 3 1.1.5 The Study of Machine Learning..................... 3 1.1.6 The Importance of Machine Learning................ 4 1.2 Classification of Machine Learning......................... 4 1.2.1 The Basic Classification........................... 4 1.2.2 Classification by Model Types...................... 11 1.2.3 Classification by Algorithm........................ 12 1.2.4 Classification by Technique........................ 13 1.3 Three Elements of Machine Learning Methods............... 15 1.3.1 Model........................................... 16 1.3.2 Strategy......................................... 17 1.3.3 Algorithm....................................... 20 1.4 Model Evaluation and Model Selection...................... 20 1.4.1 Training Error and Test Error....................... 20 1.4.2 Over-Fitting and Model Selection................... 21 1.5 Regularization and Cross-Validation........................ 24 1.5.1 Regularization.................................... 24 1.5.2 Cross-Validation.................................. 25 1.6 Generalization Ability.................................... 26 1.6.1 Generalization Error.............................. 26 1.6.2 Generalization Error Bound........................ 27 1.7 Generative Approach and Discriminative Model.............. 29 1.8 Supervised Learning Application........................... 31 1.8.1 Classification.................................... 31 1.8.2 Tagging......................................... 33 vii viii Contents 1.8.3 Regression....................................... 34 References.................................................... 37 2 Perceptron................................................... 39 2.1 The Perceptron Model.................................... 39 2.2 Perceptron Learning Strategy.............................. 41 2.2.1 Linear Separability of the Dataset................... 41 2.2.2 Perceptron Learning Strategy....................... 41 2.3 Perceptron Learning Algorithm............................. 43 2.3.1 The Primal Form of the Perceptron Learning Algorithm....................................... 43 2.3.2 Convergence of the Algorithm...................... 46 2.3.3 The Dual Form of the Perceptron Learning Algorithm....................................... 49 References.................................................... 52 3 K-Nearest Neighbor........................................... 55 3.1 The K-Nearest Neighbor Algorithm......................... 55 3.2 The K-Nearest Neighbor Model............................ 56 3.2.1 Model........................................... 56 3.2.2 Distance Metrics.................................. 57 3.2.3 The Selection of k Value........................... 58 3.2.4 Classification Decision Rule........................ 59 3.3 Implementation of K-Nearest Neighbor: The kd-Tree.......... 60 3.3.1 Constructing the kd-Tree........................... 60 3.3.2 Searching for kd-Tree............................. 62 References.................................................... 65 4 The Naïve Bayes Method....................................... 67 4.1 The Learning and Classification of Naïve Bayes.............. 67 4.1.1 Basic Methods................................... 67 4.1.2 Implications of Posterior Probability Maximization.................................... 69 4.2 Parameter Estimation of the Naïve Bayes Method............. 70 4.2.1 Maximum Likelihood Estimation................... 70 4.2.2 Learning and Classification Algorithms.............. 70 4.2.3 Bayesian Estimation.............................. 72 References.................................................... 75 5 Decision Tree................................................. 77 5.1 Decision Tree Model and Learning......................... 77 5.1.1 Decision Tree Model.............................. 77 5.1.2 Decision Tree and If-Then Rules.................... 78 5.1.3 Decision Tree and Conditional Probability Distributions..................................... 79 5.1.4 Decision Tree Learning............................ 79 5.2 Feature Selection......................................... 81 Contents ix 5.2.1 The Feature Selection Problem..................... 81 5.2.2 Information Gain................................. 83 5.2.3 Information Gain Ratio............................ 87 5.3 Generation of Decision Tree............................... 87 5.3.1 ID3 Algorithm................................... 87 5.3.2 C4.5 Generation Algorithm........................ 89 5.4 Pruning of Decision Tree.................................. 90 5.5 CART Algorithm......................................... 92 5.5.1 CART Generation................................ 93 5.5.2 CART Pruning................................... 98 References.................................................... 102 6 Logistic Regression and Maximum Entropy Model............... 103 6.1 Logistic Regression Model................................ 103 6.1.1 Logistic Distribution.............................. 103 6.1.2 Binomial Logistic Regression Model................ 104 6.1.3 Model Parameter Estimation....................... 106 6.1.4 Multi-nomial Logistic Regression................... 107 6.2 Maximum Entropy Model................................. 107 6.2.1 Maximum Entropy Principle....................... 107 6.2.2 Definition of Maximum Entropy Model.............. 110 6.2.3 Learning of the Maximum Entropy Model............ 111 6.2.4 Maximum Likelihood Estimation................... 116 6.3 Optimization Algorithm of Model Learning.................. 117 6.3.1 Improved Iterative Scaling......................... 118 6.3.2 Quasi-Newton Method............................ 122 References.................................................... 125 7 Support Vector Machine....................................... 127 7.1 Linear Support Vector Machine in the Linearly Separable Case and Hard Margin Maximization........................ 128 7.1.1 Linear Support Vector Machine in the Linearly Separable Case................................... 128 7.1.2 Function Margin and Geometric Margin............. 129 7.1.3 Maximum Margin................................ 132 7.1.4 Dual Algorithm of Learning........................ 137 7.2 Linear Support Vector Machine and Soft Margin Maximization............................................ 144 7.2.1 Linear Support Vector Machine..................... 144 7.2.2 Dual Learning Algorithm.......................... 145 7.2.3 Support Vector................................... 149 7.2.4 Hinge Loss Function.............................. 150 7.3 Non-Linear Support Vector Machine and Kernel Functions..... 153 7.3.1 Kernel Trick..................................... 153 7.3.2 Positive Definite Kernel............................ 156 7.3.3 Commonly Used Kernel Functions.................. 162 x Contents 7.3.4 Nonlinear Support Vector Classifier................. 164 7.4 Sequential Minimal Optimization Algorithm................. 165 7.4.1 The Method of Solving Two-Variable Quadratic Programming.................................... 166 7.4.2 Selection Methods of Variables..................... 170 7.4.3 SMO Algorithm.................................. 172 References.................................................... 177 8 Boosting...................................................... 179 8.1 AdaBoost Algorithm...................................... 179 8.1.1 The Basic Idea of Boosting......................... 179 8.1.2 AdaBoost Algorithm.............................. 180 8.1.3 AdaBoost Example............................... 183 8.2 Training Error Analysis of AdaBoost Algorithm.............. 185 8.3 Explanation of AdaBoost Algorithm........................ 187 8.3.1 Forward Stepwise Algorithm....................... 187 8.3.2 Forward Stepwise Algorithm and AdaBoost.......... 189 8.4 Boosting Tree............................................ 191 8.4.1 Boosting Tree Model.............................. 191 8.4.2 Boosting Tree Algorithm.......................... 191 8.4.3 Gradient Boosting................................ 196 References.................................................... 199 9 EM Algorithm and Its Extensions............................... 201 9.1 Introduction of the EM Algorithm.......................... 201 9.1.1 EM Algorithm................................... 201 9.1.2 Derivation of the EM Algorithm.................... 205 9.1.3 Application of the EM Algorithm in Unsupervised Learning........................................ 208 9.2 The Convergence of the EM Algorithm...................... 208 9.3 Application of the EM Algorithm in the Learning of the Gaussian Mixture Model............................. 210 9.3.1 Gaussian Mixture Model........................... 211 9.3.2 The EM Algorithm for Parameter Estimation of the Gaussian Mixture Model..................... 211 9.4 Extensions of the EM Algorithm........................... 214 9.4.1 The Maximization-Maximization Algorithm of F-Function.................................... 214 9.4.2 GEM Algorithm.................................. 217 9.5 Summary............................................... 218 9.6 Further Reading.......................................... 219 9.7 Exercises................................................ 220 References.................................................... 220 Contents xi 10 Hidden Markov Model........................................ 221 10.1 The Basic Concept of Hidden Markov Model................. 221 10.1.1 Definition of Hidden Markov Model................. 221 10.1.2 The Generation Process of the Observation Sequence........................................ 225 10.1.3 Three Basic Problems of the Hidden Markov Model........................................... 225 10.2 Probability Calculation Algorithms......................... 226 10.2.1 Direct Calculation Method......................... 226 10.2.2 Forward Algorithm............................... 227 10.2.3 Backward Algorithm.............................. 230 10.2.4 Calculation of Some Probabilities and Expected Values........................................... 231 10.3 Learning Algorithms...................................... 233 10.3.1 Supervised Learning Methods...................... 233 10.3.2 Baum-Welch Algorithm........................... 234 10.3.3 Baum-Welch Model Parameter Estimation Formula......................................... 237 10.4 Prediction Algorithm..................................... 238 10.4.1 Approximation Algorithm......................... 238 10.4.2 Viterbi Algorithm................................. 239 References.................................................... 245 11 Conditional Random Field..................................... 247 11.1 Probabilistic Undirected Graphical Model................... 247 11.1.1 Model Definition................................. 247 11.1.2 Factorization of Probabilistic Undirected Graphical Model.................................. 250 11.2 The Definition and Forms of Conditional Random Field....... 251 11.2.1 The Definition of Conditional Random Field.......... 251 11.2.2 The Parameterized Form of the Conditional Random Field.................................... 253 11.2.3 The Simplified Form of Conditional Random Field.... 254 11.2.4 The Matrix Form of the Conditional Random Field.... 256 11.3 The Probability Computation Problem of Conditional Random Field........................................... 258 11.3.1 Forward–Backward Algorithm...................... 258 11.3.2 Probability Computation........................... 259 11.3.3 The Computation of Expected Value................. 259 11.4 Learning Algorithms of Conditional Random Field............ 260 11.4.1 Improved Iterative Scaling......................... 261 11.4.2 Quasi-Newton Method............................ 264 11.5 The Prediction Algorithm of Conditional Random Field....... 266 References.................................................... 271 xii Contents 12 Summary of Supervised Learning Methods...................... 273 12.1 Application.............................................. 273 12.2 Models................................................. 277 12.3 Learning Strategies....................................... 278 12.4 Learning Algorithms...................................... 280 13 Introduction to Unsupervised Learning......................... 281 13.1 The Fundamentals of Unsupervised Learning................. 281 13.2 Basic Issues............................................. 282 13.2.1 Clustering....................................... 282 13.2.2 Dimensionality Reduction......................... 283 13.2.3 Probability Model Estimation....................... 284 13.3 Three Elements of Machine Learning....................... 285 13.4 Unsupervised Learning Methods........................... 286 13.4.1 Clustering....................................... 286 13.4.2 Dimensionality Reduction......................... 287 13.4.3 Topic Modeling.................................. 288 13.4.4 Graph Analytics.................................. 289 References.................................................... 292 14 Clustering.................................................... 293 14.1 Basic Concepts of Clustering.............................. 294 14.1.1 Similarity or Distance............................. 294 14.1.2 Class or Cluster.................................. 297 14.1.3 Distance Between Classes.......................... 299 14.2 Hierarchical Clustering.................................... 300 14.3 k-means Clustering....................................... 302 14.3.1 Model........................................... 302 14.3.2 Strategy......................................... 303 14.3.3 Algorithm....................................... 304 14.3.4 Algorithm Characteristics.......................... 306 References.................................................... 309 15 Singular Value Decomposition.................................. 311 15.1 Introduction............................................. 311 15.2 Definition and Properties of Singular Value Decomposition..... 311 15.2.1 Definition and Theorem........................... 311 15.2.2 Compact Singular Value Decomposition and Truncated Singular Value Decomposition......... 317 15.2.3 Geometry Interpretation........................... 319 15.2.4 Main Properties.................................. 321 15.3 Computation of Singular Value Decomposition............... 323 15.4 Singular Value Decomposition and Matrix Approximation..... 327 15.4.1 Frobenius Norm.................................. 327 15.4.2 Optimal Approximation of the Matrix............... 328 Contents xiii 15.4.3 The Outer Product Expansion of Matrix.............. 331 References.................................................... 336 16 Principal Component Analysis................................. 337 16.1 Overall Principal Component Analysis...................... 337 16.1.1 Basic Ideas...................................... 337 16.1.2 Definition and Derivation.......................... 340 16.1.3 Main Properties.................................. 342 16.1.4 The Number of Principal Components............... 347 16.1.5 The Overall Principal Components of Normalized Variables........................................ 351 16.2 Sample Principal Component Analysis...................... 352 16.2.1 The Definition and Properties of the Sample Principal Components............................. 352 16.2.2 Eigenvalue Decomposition Algorithm of Aorrelation Matrix.............................. 355 16.2.3 Singular Value Decomposition Algorithm for Data Matrix.......................................... 358 References.................................................... 364 17 Latent Semantic Analysis...................................... 365 17.1 Word Vector Space and Topic Vector Space.................. 366 17.1.1 Word Vector Space................................ 366 17.1.2 Topic Vector Space................................ 368 17.2 Latent Semantic Analysis Algorithm........................ 372 17.2.1 Matrix Singular Value Decomposition Algorithm...... 372 17.2.2 Examples........................................ 374 17.3 Non-negative Matrix Factorization Algorithm................ 377 17.3.1 Non-negative Matrix Factorization.................. 378 17.3.2 Latent Semantic Analysis Model.................... 379 17.3.3 Formalization of Non-negative Matrix Factorization..................................... 379 17.3.4 Algorithm....................................... 380 References.................................................... 385 18 Probabilistic Latent Semantic Analysis.......................... 387 18.1 Probabilistic Latent Semantic Analysis Model................ 387 18.1.1 Basic Ideas...................................... 387 18.1.2 Generative Model................................. 388 18.1.3 Co-occurrence Model............................. 390 18.1.4 Model Properties................................. 391 18.2 Algorithms for Probabilistic Latent Semantic Analysis......... 394 References.................................................... 398 xiv Contents 19 Markov Chain Monte Carlo Method............................ 401 19.1 Monte Carlo Method...................................... 401 19.1.1 Random Sampling................................ 402 19.1.2 Mathematical Expectation Estimate................. 403 19.1.3 Integral Computation.............................. 404 19.2 Markov Chain........................................... 406 19.2.1 Basic Definition.................................. 406 19.2.2 Discrete-Time Markov Chain....................... 407 19.2.3 Continuous-Time Markov Chain.................... 413 19.2.4 Properties of Markov Chain........................ 414 19.3 Markov Chain Monte Carlo Method........................ 419 19.3.1 Basic Ideas...................................... 419 19.3.2 Basic Steps...................................... 421 19.3.3 Markov Chain Monte Carlo Method and Machine Learning........................................ 421 19.4 Metropolis–Hasting Algorithm............................. 422 19.4.1 Fundamental Concepts............................ 423 19.4.2 Metropolis–Hastings Algorithm..................... 426 19.4.3 The Single-Component Metropolis–Hastings Algorithm....................................... 427 19.5 Gibbs Sampling.......................................... 428 19.5.1 Basic Principles.................................. 429 19.5.2 Gibbs Sampling Algorithm......................... 430 19.5.3 Sampling Computation............................ 432 References.................................................... 437 20 Latent Dirichlet Allocation..................................... 439 20.1 Dirichlet Distribution..................................... 440 20.1.1 Definition of Distribution.......................... 440 20.1.2 Conjugate Prior................................... 443 20.2 Latent Dirichlet Allocation Model.......................... 445 20.2.1 Basic Ideas...................................... 445 20.2.2 Model Definition................................. 446 20.2.3 Probability Graphical Model....................... 448 20.2.4 The Changeability of Random Variable Sequences..... 449 20.2.5 Probability Formula............................... 450 20.3 Gibbs Sampling Algorithm for LDA........................ 451 20.3.1 Basic Ideas...................................... 452 20.3.2 Major Parts of Algorithm.......................... 453 20.3.3 Algorithm Post-processing......................... 455 20.3.4 Algorithm....................................... 456 20.4 Variational EM Algorithm for LDA......................... 458 20.4.1 Variational Reasoning............................. 458 20.4.2 Variational EM Algorithm......................... 460 20.4.3 Algorithm Derivation.............................. 461 Contents xv 20.4.4 Algorithm Summary.............................. 468 References.................................................... 471 21 The PageRank Algorithm...................................... 473 21.1 The Definition of PageRank............................... 473 21.1.1 Basic Ideas...................................... 473 21.1.2 The Directed Graph and Random Walk Model........ 475 21.1.3 The Basic Definition of PageRank................... 477 21.1.4 General Definition of PageRank.................... 480 21.2 Computation of PageRank................................. 482 21.2.1 Iterative Algorithm................................ 482 21.2.2 Power Method.................................... 484 21.2.3 Algebraic Algorithms............................. 489 References.................................................... 491 22 A Summary of Unsupervised Learning Methods................. 493 22.1 The Relationships and Characteristics of Unsupervised Learning Methods........................................ 493 22.1.1 The Relationships Between Various Methods......... 493 22.1.2 Unsupervised Learning Methods.................... 494 22.1.3 Basic Machine Learning Methods................... 495 22.2 The Relationships and Characteristics of Topic Models........ 496 References.................................................... 497 Appendix A: Gradient Descent..................................... 499 Appendix B: Newton Method and Quasi-Newton Method............. 501 Appendix C: Language Duality..................................... 509 Appendix D: Basic Subspaces of Matrix............................. 513 Appendix E: The Definition of KL Divergence and the Properties of Dirichlet Distribution............................... 517 Color Diagrams................................................... 521 Index............................................................. 525 Chapter 1 Introduction to Machine Learning and Supervised Learning This chapter gives an overview of supervised learning methods. Supervised learning is the machine learning task of inferring a model from labeled training data. It is an important area in machine learning or machine learning. In this chapter, some basic concepts of machine learning and supervised learning are introduced, which gives readers general knowledge of machine learning and supervised learning. Section 1.1 of this chapter describes the definition, research objects and machine learning methods. Section 1.2 covers the classification of machine learning. In general, machine learning can be classified into supervised learning, unsupervised learning and reinforcement learning. Section 1.3 introduces the three elements of machine learning methods: models, strategies and algorithms; Sects. 1.4–1.7 intro- duce several essential concepts of supervised learning successively, including model evaluation and model selection, regularization and cross-validation, generalization ability of learning, generative model and discriminative model; Finally, Sect. 1.8 discusses the application of supervised learning, including classification, tagging and regression. 1.1 Machine Learning 1.1.1 Characteristics of Machine Learning Machine learning is a discipline in which computers build data-based probabilistic statistical models and use them to predict and analyze data. Machine learning is also known as statistical machine learning. Herbert A. Simon once defined “learning” as “a system’s ability to improve its performance by executing a process”. According to this view, statistical machine learning is machine learning in which computer systems improve system performance by utilizing data and statistical methods. © Tsinghua University Press 2024 1 H. Li, Machine Learning Methods, https://doi.org/10.1007/978-981-99-3917-6_1 2 1 Introduction to Machine Learning and Supervised Learning The main characteristics of machine learning are: (1) Machine learning is based on computer and network. (2) Machine learning is a data-driven discipline that uses data as its object of study. (3) The purpose of machine learning is to predict and analyze data. (4) Machine learning is method-centered, with machine learning methods building and applying models to prediction and analysis. (5) Machine learning is an interdisciplinary subject of probability, statistics, information, computation, opti- mization, computer science, etc. It has gradually formed its theoretical system and methodology along with its development. 1.1.2 The Object of Machine Learning The object of machine learning research is data. The study starts from data, covers feature extraction, abstraction of data models and knowledge discovery of data, and returns to data analysis and prediction. As the object of machine learning, the data involved are diverse, including various digital, textual, image, video, and audio data, as well as their combinations on computers and the Internet. The basic assumption of machine learning about data is that similar data have certain statistical regularity, which is the premise of machine learning. Here, similar data refers to the data with some common properties, such as English articles, Internet web pages, data in databases, etc. Because of their statistical regularity, they can be handled with probability and statistics. For example, random variables can describe the data feature, while probability distribution can depict the statistical regularity. In machine learning, data, which variables or groups of variables represent, can be further categorized according to the variable types (continuous and discrete variables) that denote them. This book focuses on the discussion of discrete variables and concerns data analysis and prediction with models constructed based on data. Issues like data observation and collection are not covered in this book. 1.1.3 The Purpose of Machine Learning Machine learning is for predicting and analyzing data, especially the prediction and analysis of unknown new data. The prediction of data can make the computer more intelligent or boost the computer’s performance; the analysis of data enables people to acquire new knowledge and make discoveries. The data prediction and analysis are achieved by constructing probability and statistical models. The general goal of machine learning is to analyze the types of models to be learned and ways to learn so that the model can accurately predict and analyze the data. It is also important to consider the possibility of maximizing learning efficiency as much as possible. 1.1 Machine Learning 3 1.1.4 Methods of Machine Learning Machine learning methods construct probability statistic models based upon data to predict and analyze the data. Machine learning consists of supervised learning, unsupervised learning, reinforcement learning, etc. In this book, Chaps. 1 and 2 deal with supervised and unsupervised learning, the two primary machine learning approaches, respectively. Machine learning methods can be summarized as follows: it starts from a given set of limited training data for learning and assumes that the data are independently and identically distributed; moreover, it is assumed that the model to be learned belongs to a set of functions, which is called hypothesis space; an optimal model is selected from a hypothesis space based on a specific evaluation criterion to make an optimal prediction of available training data and unknown test data; the selection of the optimal model is implemented by the algorithm. In this way, the machine learning method includes the hypothesis space of the model, the criteria of model selection and the algorithm of model learning, or model, strategy and algorithm for short. The factors mentioned above are addressed as the three elements of machine learning. The steps to implement the machine learning method are as follows: (1) Acquire a limited set of training data; (2) Determine the hypothesis space that contains all possible models, that is, the set of learning models; (3) Determine the criteria for model selection, that is, the learning strategy; (4) Implement the algorithm for solving the optimal model, that is, the algorithm for learning; (5) Choose the optimal model by learning methods; (6) Use the learned optimal model to predict or analyze new data. This chapter introduces supervised learning methods, mainly including classifi- cation, tagging and regression methods. These methods are being broadly applied in some domains, such as natural language processing, information retrieval, text data mining, etc. 1.1.5 The Study of Machine Learning Machine learning research generally falls into three aspects: machine learning methods, machine learning theory and machine learning application, which have different focuses. The research of machine learning methods aims at developing new learning methods. The study of machine learning theory explores the effectiveness and efficiency of machine learning methods and the fundamental theoretical issues of machine learning, yet the research of machine learning applications is concerned with applying machine learning methods to practical problems for problem-solving. 4 1 Introduction to Machine Learning and Supervised Learning 1.1.6 The Importance of Machine Learning The last two decades have witnessed a remarkable development of machine learning in theory and application. Many significant breakthroughs have been marked by the successful application of machine learning in artificial intelligence, pattern recog- nition, data mining, natural language processing, speech processing, computational vision, information retrieval, biological information, and many other computer appli- cation fields. It has become the core technology in these areas. Furthermore, it is convinced that machine learning will play an increasingly crucial role in future scientific development and technology applications. The importance of machine learning in science and technology is mainly reflected in the following aspects: (1) Machine learning is an effective way of massive data processing. We live in an era of information explosion, where the processing and utilization of massive data is an inevitable demand. Data, in reality, is not only large in scale but also often uncertain. Machine learning tends to be the most powerful tool for processing this type of data. (2) Machine learning is an effective means of computer intelligentization. Intel- ligentization is the inevitable trend of computer development as well as the primary goal of computer technology research and development. In recent decades, research in artificial intelligence and other fields has proved that the application of machine learning in imitating human intelligence is the most effective means despite certain limitations. (3) Machine learning is a crucial component of the development of computer science. It can be considered that computer science is composed of three dimen- sions: system, computation, and information. Machine learning mainly belongs to the information dimension and plays a central role. 1.2 Classification of Machine Learning Machine learning is a wide-ranging, diverse, and widely-used field, and there is no (at least not yet) single unified body of theory covering everything. The following describes the classification of machine learning methods from different perspectives. 1.2.1 The Basic Classification Machine learning is generally categorized into three major groups: supervised learning, unsupervised learning and reinforcement learning. Sometimes it also includes semi-supervised learning and active learning. (1) Supervised learning 1.2 Classification of Machine Learning 5 Supervised learning is a machine learning method in which the prediction models are acquired from labeled data. The labeled data represent the relationship between input and output, and the prediction model generates the corresponding output for a given input. The essence of supervised learning is to learn the statistical law of input–output mapping. (A) Input space, feature space and output space In supervised learning, the set of all possible input and output values is called input space and output space, respectively. The input and output space can be a set of finite elements or the entire Euclidean space. Input and output spaces can be the same or different, but the output space is usually much smaller than the input space. Each specific input is an instance, which is usually represented by a feature vector. At this time, the space where all feature vectors exist is referred to as feature space. Each dimension of the feature space corresponds to a feature. Sometimes the input space and the feature space are assumed to be the same space and are not distin- guished; sometimes, the input space and the feature space are assumed to be different spaces, and the instance is mapped from the input space to the feature space. The model is defined in the feature space. In supervised learning, input and output are the values of random variables defined in input (feature) space and output space. Input and output variables are represented in capital letters. Conventionally, the input variable is written as X , and the output variable is written as Y. The values of the input and output variables are expressed in lowercase letters; that is to say, the value of the input variable is written as x, and the value of the output variable is written as y. Variables can be scalar or vector, both of which are represented by the same type of letters. Unless otherwise stated, the vectors in this book are all column vectors. The feature vector of the input instance x is denoted as: x = (x (1) , x (2) ,... , x (i) ,... , x (n) )T x (i) represents the i-th feature of x. Note that x (i ) is different from xi. In this book, xi is generally used to represent the i-th variable among multiple input variables, namely xi = (xi(1) , xi(2) ,..., xi(n) )T Supervised learning is about learning a model from a set of training data and making predictions on test data. The training data is composed of input (or feature vector) and output pairs. The training set is usually expressed as: T = {(x1 , y1 ), (x2 , y2 ),..., (x N , y N )} The test data also consists of input and output pairs. The input and output pairs are also called samples or sample points. 6 1 Introduction to Machine Learning and Supervised Learning The input variable X and the output variable Y have different types, which can be continuous or discrete. People provide different names for prediction tasks according to the different types of input and output variables. The prediction problem where the input and output are both continuous variables is called a regression problem. The prediction problem where the output variables are a finite number of discrete variables is called a classification problem. The prediction problem where both the input variable and the output variable are a sequence of variables is named the tagging problem. (B) Joint probability distribution Supervised learning assumes that the input and output random variables X and Y follow the joint probability distribution P(X, Y ), which represents the distribution function or distribution density function. Note that in the learning process, it is assumed that this joint probability distribution does exist. However, the specific definition of the joint probability distribution is unknown to the learning system. The training and test data are regarded as independently and identically distributed according to the joint probability distribution P(X, Y ). Machine learning assumes that the data has specific statistical laws. The joint probability distribution of X and Y is the basic assumption of supervised learning about the data. (C) Hypothetical space The purpose of supervised learning is to learn a mapping from input to output, which the model represents. In other words, the learning aims to find the best model. The model belongs to the set of mappings from the input space to the output space, and this set is the hypothesis space. The determination of the hypothesis space implies the determination of the scope of learning. The supervised learning model can be a probabilistic model or a non-probabilistic model, represented by the conditional probability distribution P(Y |X ) or decision function Y = f (X ), depending on the specific learning method. When making corresponding output predictions for specific inputs, it’s written as P(y|x ) or y = f (x). (D) Formalization of the problem Supervised learning utilizes the training dataset to learn a model and then applies the model to make predictions on the test set. Since the labeled training dataset is needed in this process, and the labeled training dataset is often given manually, it is called supervised learning. There are two processes in supervised learning, learning and prediction, completed by the learning and prediction systems. These processes are described in Fig. 1.1. First, a training dataset is given T = {(x1 , y1 ), (x2 , y2 ),..., (x N , y N )} 1.2 Classification of Machine Learning 7 Fig. 1.1 Supervised learning In this set, the T, (xi , yi ), i = 1, 2,... , N , is denoted as sample or sample point. xi ∈ χ ⊆ R n is the input observation value, also called input or instance, and yi ∈ y is the output observation value, also known as output. Supervised learning is divided into two processes: learning and prediction, completed by the learning system and the prediction system. In the learning process, the learning system uses a given training dataset to obtain a model through learning ∧ (or training), which is expressed as a conditional probability distribution P (Y |X ) or ∧ a decision function Y = f (X ). They both describe the mapping relationship between input and output random variables. In the prediction process, the prediction system ∧ gives the corresponding output y N +1 from the model y N +1 = arg max P (y|x N +1 ) or y ∧ y N +1 = f (x N +1 ) for the input x N +1 in the given test set. In supervised learning, it is hypothesized that training data and test data are inde- pendently and identically distributed according to the joint probability distribution P(X, Y ). The learning system (namely the learning algorithm) tries to learn the model through the information brought by the samples (xi , yi ) in the training dataset. Specif- ically, for input xi , a specific model y = f (x) can generate an output f (xi ), and the corresponding output in the training dataset is yi. The difference between the training sample output yi and the model output f (xi ) should be small enough to guarantee the excellent prediction ability of the model. The learning system selects the best model through continuous attempts to make a good enough prediction on the training dataset, along with the best spreading of the prediction of the unknown test dataset. (2) Unsupervised learning Unsupervised learning refers to the machine learning problem of learning predictive models from unlabeled data. Unlabeled data is naturally obtained data, and the predic- tive model represents the data’s category, conversion, or probability. The essence of unsupervised learning is to learn statistical laws or potential structures in the data. 8 1 Introduction to Machine Learning and Supervised Learning The set that all possible values of the input and the output of the model are respectively called input space and output space. The input and output space can be a collection of finite elements or Euclidean space. Each input is an instance represented by a feature vector. Each output is the result of the analysis of the input, represented by categories, conversion or probability of the input. The model can realize clustering, dimensionality reduction or probability estimation of data. Suppose χ is the input space and Z is the implicit structure space. The model to be learned can be expressed as a function z = g(x), a conditional probability distribution P(z|x), or a conditional probability distribution P(x|z), where x ∈ χ is the input and z ∈ Z is the output. The set containing all possible models is called the hypothesis space. Unsupervised learning aims to select the optimal model under the given evaluation standard from the hypothesis space. Unsupervised learning usually learns or trains the model using a large amount of unlabeled data, and each sample is an instance. The training data is expressed as U = {x1 , x2 ,... x N }, where xi , i = 1, 2,... , N is the sample. Unsupervised learning can be used to analyze existing data and predict future data. The analysis uses the learned model, i.e., function z = ĝ(x), conditional probability distribution P̂(z|x), or conditional probability distribution P̂(x|z). While predicting, the process is similar to supervised learning. It is completed by the learning system and the prediction system, as shown in Fig. 1.2. In the learning process, the learning system learns from the training dataset and obtains an optimal model, expressed as the function z = ĝ(x), conditional probability distribution P̂(z|x) or conditional probability distribution P̂(x|z). In the prediction process, for the given input x N +1 , the prediction system generates the corresponding output z N +1 by the model z N +1 = ĝ(x N +1 ) or z N +1 = arg max P̂(z|x N +1 ) to perform clustering or dimensionality reduction, and it’s also possible to get the input probability P̂(x N +1 |z N +1 ) from the model P̂(x|z), then perform probability estimation. (3) Reinforcement learning Fig. 1.2 Unsupervised learning 1.2 Classification of Machine Learning 9 Fig. 1.3 Interaction between intelligent system and environment Reinforcement learning is a machine learning problem in which the intelligent system learns the optimal behavior strategy through continuous interaction with the environ- ment. Assuming that the interaction between the intelligent system and environment is based on the Markov decision process, the intelligent system can observe the data sequence obtained from the interaction with the environment. The essence of reinforcement learning is to learn the optimal sequential decision. The interaction between the intelligent system and the environment is shown in Fig. 1.3. At each step t, the intelligent system observes a state st and a reward rt from the environment and takes an action at. According to the action selected by the intelligent system, the environment decides the state st+1 and reward rt+1 in the next step t + 1. The strategy to be learned is expressed as an action taken in a given state. The goal of an intelligent system is not to maximize short-term but long-term cumulative rewards. In the process of reinforcement learning, the system constantly makes trial and error to achieve the purpose of learning the optimal strategy. The Markov decision process of reinforcement learning is a random process on the state, reward, and action sequence. It consists of the five-tuple S, A, P, r, γ. S is a set of finite state A is a set of finite action P is the transition probability function: P(s |s, a) = P(st+1 = s |st = s, at = a) r is the reward function: r (s, a) = E(rt+1 |st = s, at = a) γ is the discount factor: γ ∈ [0, 1] The Markov decision process has Markov property, and the next state only depends on the previous state and action, which is represented by the state transition proba- bility function P(s |s, a). The next reward depends on the previous state and action, represented by the reward function r (s, a). 10 1 Introduction to Machine Learning and Supervised Learning Strategy π is defined as a function of action in a given state a = f (s) or a condi- tional probability distribution P(a|s). Given a strategy π , the behavior of the intelli- gent system interacting with the environment is determined (either deterministically or stochastically). The value function or state value function is defined as the mathematical expec- tation of the long-term cumulative reward of the strategy π starting from a certain state s: vπ (s) = E π [rt+1 + γ rt+2 + γ 2 rt+3 + · · · |st = s] (1.1) The action value function is defined as the mathematical expectation of the long- term cumulative reward of strategy π starting from a certain state s and an action a: qπ (s, a) = E π [rt+1 + γ rt+2 + γ 2 rt+3 + · · · |st = s, at = a] (1.2) The goal of reinforcement learning is to select the strategy π ∗ with the largest value function among all possible strategies. Actual learning often starts from specific strategies and continuously optimizes existing strategies. Here, γ indicates that future rewards will decay. Reinforcement learning methods include policy-based and value-based methods, both of which belong to model-free methods. Besides model-free methods, there are model-based methods. Model-based methods try to directly learn the model of the Markov decision process, including the transition probability function P(s |s, a) and the reward func- tion r (s, a). In this case, the feedback of the environment can be predicted through the model, and the strategy π ∗ with the largest value function can be obtained. Model-free and strategy-based methods do not directly learn the model but try to learn the optimal strategy π ∗ , expressed as a function a = f ∗ (s) or a conditional probability distribution P ∗ (a|s), which can also achieve optimal environmental deci- sions. The learning usually starts with a specific strategy and proceeds by searching for better strategies. Model-free and value-based methods do not directly learn the model as well. Instead, they attempt to solve the optimal value function, especially the optimal action value function q ∗ (s, a). In this case, the optimal strategy can be learned indirectly, and the corresponding action in a given state according to the strategy can be made. Learning usually starts with a specific value function and proceeds by searching for a better value function. (4) Semi-supervised learning and active learning Semi-supervised learning refers to a machine learning problem using tagged and untagged data to learn a predictive model. There is a small amount of tagged data and a large amount of untagged data in general because the construction of tagged data tends to require labor with high cost, while the collection of untagged data doesn’t need much cost. Semi-supervised learning aims to use the information in 1.2 Classification of Machine Learning 11 untagged data to assist in tagging data and perform supervised learning, achieving better learning results at a lower cost. Active learning is a machine learning problem in which the machine gives instances to teachers actively and constantly for tagging, and then uses the tagged data to learn the predictive model. General supervised learning uses the given tagged data, which tends to be obtained randomly, so that it can be regarded as “passive learning”. The goal of active learning is to find the most helpful instances of learning to teachers for tagging, achieving better learning results at a small tagging cost. Semi-supervised learning and active learning are more similar to supervised learning. 1.2.2 Classification by Model Types Machine learning or machine learning methods can be classified according to the model type. (1) The probabilistic model and the non-probabilistic model Machine learning models can be divided into probabilistic and non-probabilistic models (or deterministic models). In supervised learning, the probabilistic model takes the form of the conditional probability distribution P(y|x ), and the non- probabilistic model takes the functional form y = f (x), where x is the input and y is the output. In unsupervised learning, the probabilistic model takes the form of conditional probability distribution P(z|x ) or P(x|z ), and the non-probabilistic model takes functional form z = g(x), where x is the input and z is the output. The Decision Tree, Naive Bayes, Hidden Markov Model (HMM), Conditional Random Field (CRF), Probabilistic Latent Semantic Analysis (PLSA), Latent Dirichlet Assignment (LDA), and Gaussian Mixture Model (GMM) introduced in this book are all probabilistic models. Perceptron, Support Vector Machine (SVM), k-Nearest Neighbors (KNN), AdaBoost, k-means, Latent Semantic Analysis (LSA), and Neural Networks are non-probabilistic models. Logistic regression can be regarded as both probabilistic model and non-probabilistic model. The conditional probability distribution P(y|x ) and the function y = f (x) can be converted into each other (the conditional probability distribution P(z|x ) and the function z = g(x) can be converted, too). Specifically, the function is obtained after the conditional probability distribution is maximized, and the conditional probability distribution is obtained after the function is normalized in return. Therefore, the differ- ence between the probabilistic and non-probabilistic models is not in the mapping relationship between input and output, but in the internal structure of the model. The probabilistic model can usually be expressed as a joint probability distribution, where the variables represent input, output, hidden variables and even parameters. Non-probabilistic models may not have such a joint probability distribution. The representative of the probabilistic model is the probabilistic graphical model. It is a probabilistic model in which a directed or undirected graph indicates the joint 12 1 Introduction to Machine Learning and Supervised Learning Fig. 1.4 Basic probability formula probability distribution. The joint probability distribution can be decomposed into the form of factor products according to the graph’s structure. Bayesian networks, Markov Random Fields, and Conditional Random Fields are all probabilistic graph- ical models. Regardless of the complexity of the model, probabilistic reasoning can be carried out using the most basic addition and multiplication rules (see Fig. 1.4). (2) The linear model and the non-linear model Machine learning models, especially non-probabilistic models, can be divided into linear and nonlinear models. If the function y = f (x) or z = g(x) is a linear function, then the model is called a linear model; otherwise, it’s called a non-linear model. The Perceptron, linear SVM, KNN, k-mean, and LSA introduced in this book are linear models. The Kernel Function SVM, AdaBoost, and Neural Networks are non-linear models. Deep learning is the learning of complex Neural Networks, i.e., the learning of complex non-linear models. (3) The parametric model and the non-parametric model Machine learning models can be divided into parametric models and non-parametric models. Parametric models assume that the dimension of model parameters is fixed, and finite-dimensional parameters can completely describe the model. Non- parametric models assume that the dimension of model parameters is not fixed or infinite, increasing as training data increases. The Perceptron, Naive Bayes, logistic regression, k-means, GMM, LSA, PLSA, and LDA introduced in this book are all parametric models. The Decision Tree, SVM, AdaBoost, and KNN are non-parametric models. Parametric models are suitable for simple problems. However, non-parametric models can be more effective as real problems tend to be more complicated. 1.2.3 Classification by Algorithm According to the algorithm, machine learning can be divided into online learning and batch learning. Online learning refers to the machine learning that accepts one sample at a time, makes predictions, then learns the model and repeats the operation continuously. Correspondingly, batch learning accepts all data at once, learns the 1.2 Classification of Machine Learning 13 Fig. 1.5 Online learning model, and then makes predictions. Some practical application scenarios require online learning. Such scenarios include where the data sequentially reach a point that they cannot be stored, and the system needs to make timely processing; or where the scale of data is too large to be processed all at once. Another case is where the mode of the data changes dynamically over time, requiring the algorithm to adapt to the new mode rapidly (not satisfying the independent identical distribution assumption). Online learning can be either supervised or unsupervised, and reinforcement learning has the characteristics of online learning. Note that only online supervised learning is considered below. Learning and prediction exist in a system that takes one input xt at a time, generates the prediction fˆ(xt ) by using the existing model, and then obtains the corresponding feedback, i.e., the output yt corresponding to the input. The system uses the loss function to compute the difference between them, updates the model, and repeats the above operations continuously (see Fig. 1.5). The perceptron learning algorithm using stochastic gradient descent is an online learning algorithm. Online learning is usually more complicated than batch learning, and it’s hard for online learning to learn a model with higher prediction accuracy because the available data is limited in each model update. 1.2.4 Classification by Technique Machine learning methods can be classified according to the techniques used. (1) Bayesian learning Bayesian learning, also known as Bayesian inference, is a crucial method in statistics and machine learning. The main idea is to use Bayes theorem in the learning and inference of probabilistic models to compute the conditional probability of the model under given data conditions, i.e., the posterior probability, and to apply this principle to the model estimation and the data prediction. The representation of the model, unobserved elements and their parameters as variables, and the use of the prior distribution of the model are characteristics of Bayesian learning. Basic probability formulas are also used in Bayesian learning (see Fig. 1.4). The naive Bayes and potential Dirichlet distributions discussed in this book belong to Bayesian learning. 14 1 Introduction to Machine Learning and Supervised Learning Assume that the random variable D represents the data, and the random vari- able θ represents the model parameters. According to Bayes theorem, the posterior probability P(θ |D ) can be computed with the following formula: P(θ )P(D|θ ) P(θ |D ) = (1.3) P(D) where P(θ ) is the prior probability and P(D|θ ) is the likelihood function. When the model is estimated, the entire posterior probability distribution P(θ |D ) is estimated. The model with the maximum posterior probability is usually taken if a model is required. When forecasting, compute the expected value of the data to the posterior probability distribution: P(x|D ) = P(x|θ , D)P(θ |D )dθ (1.4) where x is the new sample. Bayesian Estimation is quite different from Maximum Likelihood Estimation (MLE) in thought, representing the different understanding of Bayesian school and frequency school in statistics. In fact, the two can be simply connected. Assuming that the prior distribution is uniform, take the largest posterior probability, then the MLE can be obtained from the Bayesian Estimation. Figure 1.6 shows the comparison between Bayesian Estimation and MLE. Fig. 1.6 Bayesian estimation and maximum likelihood estimation 1.3 Three Elements of Machine Learning Methods 15 Fig. 1.7 Mapping from the input space to the feature space (2) The Kernel method The kernel method is a machine learning method that uses kernel functions to represent and learn nonlinear models and can be used for supervised and unsu- pervised learning. There are some learning methods of linear model based on simi- larity computation, more specifically, vector inner product computation. The Kernel method can extend them to learning nonlinear models, expanding their application scope. The Kernel Function SVM introduced in this book, the kernel PCA and the kernel k-mean belong to the kernel method. A straightforward approach to extending a linear model to a non-linear model is explicitly defining a mapping from the input space (low-dimensional space) to the feature space (high-dimensional space), where the inner product is computed. For example, the SVM transforms the linear inseparable problem in the input space into the linear separable problem in the feature space, as shown in Fig. 1.7. The technique of the kernel method is not to define the mapping explicitly, but to directly define the kernel function, that is, the inner product in the feature space after the mapping. This technique can simplify the computation and achieve the same effect. Assuming that x1 and x2 are two instances (vectors) of the input space, the inner product is x1 , x2. Suppose that the mapping from the input space to the feature space is ϕ, so the mapping of x1 and x2 in the feature space is ϕ(x1 ) and ϕ(x2 ), the inner product is ϕ(x1 ), ϕ(x2 ). The kernel method directly defines the kernel function K (x1 , x2 ) in the input space to satisfy K (x1 , x2 ) = ϕ(x1 ), ϕ(x2 ). It shows the necessary and sufficient condition for the kernel technique given by the theorem. 1.3 Three Elements of Machine Learning Methods Machine learning methods are all composed of models, strategies and algorithms; that is, machine learning methods are composed of three elements, which can be simply expressed as: Method = Model + Strategy + Algorithm 16 1 Introduction to Machine Learning and Supervised Learning The three elements of machine learning in supervised learning are discussed below. Unsupervised learning and reinforcement learning also have these three elements. It can be said that constructing a machine learning method determines the three elements of specific machine learning. 1.3.1 Model The first consideration in machine learning is deciding the kind of model to learn. In the supervised learning process, the model is the conditional probability distri- bution or decision function to be learn. The hypothesis space of the model contains all possible conditional probability distributions or decision functions. For example, if the decision function is a linear function of the input variables, then the hypoth- esis space of the model is the set of functions formed by all these linear functions. Generally, there are infinite models in the hypothesis space. The hypothesis space is denoted by F and can be defined as a set of decision functions: F = { f |Y = f (X )} (1.5) Among them, X and Y are variables defined in the input space χ and the output space Y. At this time, F is usually a family of functions determined by a parameter vector: F = f |Y = f θ (X ), θ ∈ Rn (1.6) The parameter vector θ is taken in the n-dimensional Euclidean space Rn , which is called the parameter space. The hypothesis space can also be defined as a set of conditional probabilities: F = {P|P(Y |X )} (1.7) Among them, X and Y are random variables defined in the input space χ and the output space. In this case, F is usually a conditional probability distribution family determined by a parameter vector: F = P|Pθ (Y |X ), θ ∈ Rn (1.8) The parameter vector θ takes the value in the n-dimensional Euclidean space Rn , also called the parameter space. In this book, models represented by the decision function are referred to as non- probabilistic models, and models depicted by conditional probability are called prob- abilistic models. For the sake of simplicity, when discussing models, sometimes only one of those models is used. 1.3 Three Elements of Machine Learning Methods 17 1.3.2 Strategy Once the hypothesis space of the model is available, machine learning then needs to consider the criteria to follow for learning or selecting the optimal model. And machine learning aims to select the optimal model from the hypothesis space. Firstly, the concepts of the loss function and risk function are introduced. The loss function measures how well the model predicts once, and the risk function measures the quality of the average prediction. (1) The loss function and the risk function The supervised learning problem is to select model f as the decision function in the hypothesis space F. For a given input X , f (X ) gives the corresponding output Y , The output predicted value f (X ) may be consistent or inconsistent with the true value Y. Then a loss function or cost function is used to measure the degree of prediction error. The loss function is a non-negative real-valued function of f (X ) and Y , denoted as L(Y, f (X )). Loss functions commonly used in machine learning include: (A) 0–1 loss function 1, Y = f (X ) L(Y, f (X )) = (1.9) 0, Y = f (X ) (B) Quadratic loss function L(Y, f (X )) = (Y − f (X ))2 (1.10) (C) Absolute loss function L(Y, f (X )) = |Y − f (X )| (1.11) (D) Logarithmic loss function or log-likelihood loss function L(Y, P( Y |X )) = − log P( Y |X ) (1.12) The smaller the loss function value, the better the model. Since the input and output (X, Y ) of the model are random variables and follow the joint distribution P(X, Y ), the expected loss function is: Rexp ( f ) = E P [L(Y, f (X ))] = L(y, f (x))P(x, y)dxdy (1.13) x×y 18 1 Introduction to Machine Learning and Supervised Learning This loss is the theoretical average loss of model f (X ) concerning the joint distribution P(X, Y ), called the risk function or expected loss. The goal of learning is to choose the model with the least expected risk. Since the joint distribution P(X, Y ) is unknown, Rexp ( f ) cannot be computed directly. In fact, if you know the joint distribution P(X, Y ), you can directly find the conditional probability distribution P(Y |X ) from the joint distribution, and you don’t need to learn. It is just because the joint probability distribution is unknown that learning is needed. In this way, on the one hand, the joint distribution is used according to the expected risk minimum learning model. On the other hand, the joint distribution is unknown, so supervised learning becomes an ill-formed problem. Given a training dataset T = {(x1 , y1 ), (x2 , y2 ),..., (x N , y N )} The average loss of model f (X ) with respect to the training dataset is called empirical risk or empirical loss, denoted as Remp ( f ): N 1 Remp ( f )= L(yi , f (xi )) (1.14) N i=1 The expected risk Rexp ( f ) is the expected loss of the model concerning the joint distribution, and the empirical risk Remp ( f ) is the average loss of the model with respect to the training sample set. According to the law of large numbers, the empirical risk Remp ( f ) gets closer to the expected risk Rexp ( f ) when the sample size N verges on infinity. So, the natural idea is to use empirical risk to estimate expected risk. However, due to the limited or even small number of training samples in reality, using empirical risk to estimate expected risk is often not ideal, and correcting certain empirical risks is necessary. This relates to the two basic strategies of super- vised learning: empirical risk minimization (ERM) and structural risk minimization (SRM). (2) Empirical risk minimization and structural risk minimization When the hypothesis space, loss function, and training dataset are determined, the empirical risk function (1.14) can be determined. The empirical risk minimization (ERM) strategy believes that the model with the least empirical risk is the optimal model. According to this strategy, seeking the optimal model according to ERM is to solve the optimization problem: N 1 min = L(yi , f (xi )) (1.15) f∈ N i=1 Among them, F is the hypothesis space. When the sample size is large enough, the ERM strategy ensures a good learning effect and is widely used in reality. For example, Maximum Likelihood Estimation 1.3 Three Elements of Machine Learning Methods 19 is an example of ERM. When the model is a conditional probability distribution, and the loss function is a log loss function, the empirical risk minimization is equivalent to the Maximum Likelihood Estimation. However, when the sample size is tiny, the effect of the ERM learning may not be excellent, and an “over-fitting” phenomenon will occur. Structural risk minimization (SRM) is a strategy proposed to prevent over-fitting. Structural risk minimization is equivalent to regularization. Structural risk adds a regularizer or penalty term representing the model’s complexity to empirical risk. Under the circumstance where hypothesis space, the loss function and the training dataset are determined, the definition of structural risk is: N 1 Rsrm ( f )= L(yi , f (xi )) + λJ ( f ) (1.16) N i=1 where J ( f ) is the complexity of the model, which is a functional defined on the hypothesis space F. The more complex the model f , the greater the complexity J ( f ); conversely, the simpler the model f, the smaller the complexity J ( f ). In other words, complexity represents the penalty for complex models. λ ≥ 0 is a coefficient used to weigh empirical risk and model complexity. Small structural risk requires both experience risk and model complexity to be small. Models with small structural risks tend to have better predictions on training data and unknown test data. The maximum posterior probability estimation (MAP) in Bayesian Estimation is an example of structural risk minimization. When the model is a conditional probability distribution, the loss function is a log loss function, and the model’s prior probability represents the model’s complexity, the structural risk minimization is equivalent to the maximum posterior probability estimation. The strategy of structural risk minimization considers that the model with the least structural risk is the optimal model. Finding the optimal model is consequently to solve the optimization problem: N 1 min L(yi , f (xi )) + λJ ( f ) (1.17) f ∈F N i=1 In this way, the supervised learning problem becomes the optimization problem (1.15) and (1.17) of the empirical or structural risk function. At this time, the empirical or structural risk function is the optimal objective function. 20 1 Introduction to Machine Learning and Supervised Learning 1.3.3 Algorithm Algorithm refers to the specific computation method of the learning model. Based on the training dataset, machine learning selects the optimal model from the hypoth- esis space according to the learning strategy. Finally, it considers what computation method to use to solve the optimal model. At this time, the machine learning problem is boiled down to the optimization problem, and the machine learning algorithm becomes the algorithm for solving the optimization problem. If the optimization problem has an explicit analytical solution, the optimization problem is relatively simple. But usually, the analytical solution does not exist, which requires a numerical computation method. Ensuring the optimal global solution is found, and the solution process is very efficient has become an important issue. Machine learning can use existing optimization algorithms, while sometimes, it is necessary to develop unique optimization algorithms. The differences between machine learning methods mainly come from the differ- ences in their models, strategies, and algorithms. Once the model, strategy, and algorithm are determined, the machine learning method is also determined. This is why they are called the three elements of machine learning methods. The following introduces several important concepts of supervised learning. 1.4 Model Evaluation and Model Selection 1.4.1 Training Error and Test Error Machine learning aims to make the learned model have good predictive ability for known and unknown data. Different learning methods will give different models. When the loss function is fixed, the training error and the test error of the model based on the loss function will naturally become the evaluation criteria of the learning method. Note that the specific loss function used by the machine learning method may not be the loss function used in the evaluation. It is, of course, desirable that the two be aligned. Suppose the learned model is Y = fˆ(X ), and the training error is the average loss of the model Y = fˆ(X ) for the training dataset: N 1 Remp ( fˆ) = L(yi , fˆ(xi )) (1.18) N i=1 where N is the capacity of test samples. The test error is the average loss of the model Y = fˆ(X ) with respect to the test dataset: 1.4 Model Evaluation and Model Selection 21 N 1 etest = L(yi , fˆ(xi )) (1.19) N i=1 where N is the capacity of test samples. For example, when the loss function is 0–1 loss, the test error becomes the error rate on the common test dataset: N 1 etest = I (yi = fˆ(xi )) (1.20) N i=1 Here I is the indicator function. When yi = fˆ(xi ), it is 1, otherwise it is 0. Correspondingly, the accuracy of the common test dataset is N 1 rtest = I (yi = fˆ(xi )) (1.21) N i=1 Obviously, rtest + etest = 1 The size of the training error is relevant in judging whether a given problem is an easy-to-learn problem, but it is not crucial. The test error reflects the predictive ability of the learning method on the unknown test dataset and is an essential concept in learning. Given two learning methods, the one with small test error has the better predictive ability and is more effective. The ability of learning methods to predict unknown data is usually called generalization ability. This problem will be discussed in Sect. 1.6. 1.4.2 Over-Fitting and Model Selection When the hypothesis space contains models of different complexity (e.g., different numbers of parameters), the problem of model selection arises. We hope to choose or learn a suitable model. If a True Model exists in the hypothesis space, then the selected model should approximate the True Model. Specifically, the chosen model must have the same number of parameters as the True Model, and the parameter vector of the selected model should be similar to the parameter vector of the True Model. If you blindly seek to improve the predictive ability of training data, the complexity of the selected model will tend to be higher than that of the True Model. This phenomenon is known as over-fitting. Overfitting refers to the model chosen during learning containing too many parameters, so the model predicts the known data well 22 1 Introduction to Machine Learning and Supervised Learning Fig. 1.8 An example of a polynomial function fitting problem of degree M but the unknown data poorly. It can be argued that the model selection aims to avoid overfitting and improve the model’s predictive ability. The problem of polynomial function fitting is taken as an example here to illustrate overfitting and model selection. It is a regression problem. Example 1.1 assuming a training dataset1 : T = {(x1 , y1 ), (x2 , y2 ),... , (x N , y N )} Among them, xi ∈ R is the observation value of input x, yi ∈ R is the corre- sponding observation value of output y, i = 1, 2,... , N. The task of polynomial function fitting is to assume that the polynomial function of degree M generates the given data and to select the polynomial function of degree M that is most likely to generate those data. In other words, it selects a function with a good predictive ability for known and unknown data among the polynomial functions of degree M. Assume that 10 data points, as shown in Fig. 1.8 are given, and the data are fitted with polynomial functions of degrees 0–9. The figure shows the data that needs to be fitted with a polynomial function curve. Let the polynomial of degree M be 1 This example is taken from. 1.4 Model Evaluation and Model Selection 23 M f M (x, ω) = ω0 + ω1 x + ω2 x 2 + · · · + ω M x M = ωjx j (1.22) j=0 where x is a single variable input, and ω0 , ω1 ,... , ω M are M + 1 parameters. The approach to solving this problem can be as follows. First, the complexity of the model is determined, i.e., the degree of the polynomial is determined; then, under the given complexity of the model, the parameters, that is, the coefficients of the polynomial, is solved according to the strategy of empirical risk minimization. To be specific, the following empirical risk minimization is sought: N 1 L(ω) = ( f (xi , ω) − yi )2 (1.23) 2 i=1 In this case, the loss function is square loss, and the coefficient 1/2 is for the convenience of computation. It is then a simple optimization problem. Substitute the model and training data into Eq. (1.23) and obtain N M 1 j L(ω) = ( ω j xi − yi )2 2 i=1 j=0 The least square method can be used in this problem to find the unique solution to the fitted polynomial coefficients, denoted as ω1∗ , ω2∗ ,... , ω∗M. The solution process is not described here, and interested readers can refer to relevant materials. Figure 1.8 shows the fit of the polynomial function when M = 0, M = 1, M = 3 and M = 9. If M = 0, the polynomial curve is a constant, and the data fitting effect is poor. If M = 1, the polynomial curve is a straight line, and the data fitting effect is also poor. In contrast, if M = 9, the polynomial curve passes through each data point, and the training error is 0, which is the best result in terms of fitting the given training data. However, because of the inherent noise in the training data, such fitted curves are often not the best predictors of unknown data and are not desirable for practical learning purposes. This is where overfitting occurs. It means that model selection should consider the predictive power not only for known data but also for unknown data. When M = 3, the polynomial curve fits the training data well enough, and the model is relatively simple, making it a better choice. As can be seen in polynomial function fitting, the training error decreases as the number of polynomials (model complexity) increases until it approaches 0. But the test error is not the same, as it will first decrease and then increase with the increase of polynomial times (model complexity). The ultimate goal is to minimize the test error. In this way, the appropriate polynomial degree should be chosen to achieve this goal in the polynomial function fitting. This conclusion also holds for general model selection. 24 1 Introduction to Machine Learning and Supervised Learning Fig. 1.9 Relationship between training and test errors and the model complexity Figure 1.9 depicts the relationship between training and test errors and the model complexity. When the model complexity increases, the training error decreases grad- ually and approaches 0, while the test error decreases, reaches a minimum, and then increases again. Overfitting occurs when the complexity of the chosen model is too high. In this way, it is necessary to prevent overfitting and select the optimal model in learning, i.e., to choose the one with appropriate complexity to minimize the test error. Here are two commonly used model selection methods: regularization and cross-validation. 1.5 Regularization and Cross-Validation 1.5.1 Regularization The typical method of model selection is regularization. Regularization is the real- ization of a structural risk minimization strategy, which is to add a regularizer or penalty term to empirical risk. The regularization term is generally a monotoni- cally increasing function of model complexity, and the more complex the model, the greater the regularization value. For example, the regularization term can be the norm of the model parameter vector. Regularization is generally of the form: N 1 min L(yi , f (xi )) + λJ ( f ) (1.24) f ∈F N i=1 1.5 Regularization and Cross-Validation 25 Among them, item 1 is the empirical risk, item 2 is the regularizer, and λ ≥ 0 is the coefficient to adjust the relationship between the two items. The regularized item (regularizer) can take different forms. For example, in the regression problem, the loss function is the square loss, and the regularized item can be the L 2 norm of the parameter vector: N 1 λ L(w) = ( f (xi ; w) − yi )2 + w 2 (1.25) N i=1 2 Here, w represents the L 2 norm of the parameter vector w. The regularized item can also be the L 1 norm of the parameter vector: N 1 L(w) = ( f (xi ; w) − yi )2 + λ w 1 (1.26) N i=1 Here, w 1 represents the L 1 norm of the parameter vector w. The model with less empirical risk in item 1 may be more complex (with multiple non-zero parameters), in which case the model complexity in item 2 will be greater. The function of regularization is to choose a model with a more negligible empirical risk and model complexity. Regularization conforms to the principle of Occam’s razor. When Occam’s razor principle is applied to model selection, it becomes the following idea: Of all the possible models to be selected, the one that explains the known data well and is very simple is the best model that should be chosen. From the perspective of Bayesian Estimation, the regularized item corresponds to the prior probability of the model. It can be assumed that a complex model has a smaller prior probability, and a simple model has a larger prior probability. 1.5.2 Cross-Validation Another commonly used method of model selection is cross-validation. If the given sample data is sufficient, a simple method for model selection is to randomly divide the dataset into three parts: the training set, the validation set, and the test set. The training set is used to train the model, the validation set is for model selection, and the test set is for the final evaluation of the learning method. Among the learned models with different complexity, th

Hang Li - Machine Learning Methods (2023, Springer) - libgen.li.pdf

Document Details

Related

Full Transcript

Upgrade to continue