Machine Learning Overview PDF
Document Details
Uploaded by SoulfulDialect
Jim Liang
Tags
Summary
This document provides an overview of machine learning, discussing its fundamental concepts, algorithms, and applications. The presentation covers supervised learning, unsupervised learning, and related topics such as data cleansing, feature selection, and model evaluation.
Full Transcript
Machine Learning Overview Objectives ⚫ Upon completion of this course, you will understand: ◼ Learning algorithm definitions and machine learning process ◼ Related concepts such as hyperparameters, gradient descent, and cross-validation ◼ Common machi...
Machine Learning Overview Objectives ⚫ Upon completion of this course, you will understand: ◼ Learning algorithm definitions and machine learning process ◼ Related concepts such as hyperparameters, gradient descent, and cross-validation ◼ Common machine learning algorithms 2 Huawei Confidential Contents 1. Machine Learning Algorithms 2. Types of Machine Learning 3. Machine Learning Process 4. Important Machine Learning Concepts 5. Common Machine Learning Algorithms 3 Huawei Confidential Machine Learning Algorithms (1) ⚫ Machine learning is often combined with deep learning methods to study and observe AI algorithms. A computer program is said to learn from experience 𝐸 with respect to some class of tasks 𝑇 and performance measure 𝑃, if its performance at tasks in 𝑇, as measured by 𝑃, improves with experience 𝐸. Understanding Data Learning algorithm (Performance (Experience 𝐸) (Task 𝑇) measure 𝑃) 4 Huawei Confidential Machine Learning Algorithms (2) Historical Experience data Summarize Train Input Predict Input Predict New New Future Rules Future Model problem data attributes 6 Huawei Confidential Created by: Jim Liang Differences Between Machine Learning Algorithms and Traditional Rule- based Methods Rule-based method Machine learning Training data Machine learning New Model Prediction data Models are trained on samples. Explicit programming is used to solve Decision-making rules are complex or problems. difficult to describe. Rules can be manually determined. Machines automatically learn rules. 7 Huawei Confidential When to Use Machine Learning (1) ⚫ Machine learning provides solutions to complex problems, or those involving a large amount of data whose distribution function cannot be determined. ⚫ Consider the following scenarios: Task rules change over time, for example, Data distribution changes over time and Rules are complex or difficult to describe, part-of-speech tagging, in which new words programs need to adapt to new data for example, speech recognition. or word meanings can be generated at any constantly, for example, sales trend time. forecast. 8 Huawei Confidential When to Use Machine Learning (2) High Manual Machine learning rules algorithms Complexity of rules Simple Rule-based Low questions algorithms Small Large Scale of the problem 9 Huawei Confidential Rationale of Machine Learning Algorithms Target equation 𝑓: 𝑋 → 𝑌 Ideal Actual Training data Learning Hypothesis function 𝐷: {(𝑥1 , 𝑦1 ) ⋯ , (𝑥𝑛 , 𝑦𝑛 )} algorithm 𝑔≈𝑓 ⚫ The objective function 𝑓 is unknown, and the learning algorithm cannot obtain a perfect function 𝑓. ⚫ Hypothesis function 𝑔 approximates function 𝑓, but may be different from function 𝑓. 10 Huawei Confidential Main Problems Solved by Machine Learning ⚫ Machine learning can solve many types of tasks. Three most common types are: ◼ Classification: To specify a specific one of the k categories for the input, the learning algorithm usually outputs a function 𝑓: 𝑅 𝑛 → (1,2, … , 𝑘). For example, image classification algorithms in computer vision solve classification tasks. ◼ Regression: The program predicts the output for a given input. The learning algorithms usually output a function 𝑓: 𝑅 𝑛 → 𝑅. Such tasks include predicting the claim amount of a policy holder to set an insurance premium or predicting the security price. ◼ Clustering: Based on internal similarities, the program groups a large amount of unlabeled data into multiple classes. Same- class data is more similar than data across classes. Clustering tasks include search by image and user profiling. ⚫ Classification and regression are two major types of prediction tasks. The output of classification is discrete class values, and the output of regression is continuous values. 11 Huawei Confidential Contents 1. Machine Learning Algorithms 2. Types of Machine Learning 3. Machine Learning Process 4. Important Machine Learning Concepts 5. Common Machine Learning Algorithms 12 Huawei Confidential Types of Machine Learning ⚫ Supervised learning: The program takes a known set of samples and trains an optimal model to generate predictions. Then, the trained model maps all inputs to outputs and performs simple judgment on the outputs. In this way, unknown data is classified. ⚫ Unsupervised learning: The program builds a model based on unlabeled input data. For example, a clustering model groups objects based on similarities. Unsupervised learning algorithms model the highly similar samples, calculate the similarity between new and existing samples, and classify new samples by similarity. ⚫ Semi-supervised learning: The program trains a model through a combination of a small amount of labeled data and a large amount of unlabeled data. ⚫ Reinforcement learning: The learning systems learn behavior from the environment to maximize the value of reward (reinforcement) signal function. Reinforcement learning differs from supervised learning of connectionism in that, instead of telling the system the correct action, the environment provides scalar reinforcement signals to evaluate its actions. ⚫ Machine learning evolution is producing new machine learning types, for example, self-supervised learning, contrastive learning, generative learning. 13 Huawei Confidential Supervised Learning Data features Labels Feature 1 ······ Feature n Target Supervised learning Feature 1 ······ Feature n Target algorithm Feature 1 ······ Feature n Target Suitable for Weather Temperature Wind Speed Exercise Sunny High High Yes Rainy Low Medium No Sunny Low Low Yes 15 Huawei Confidential Supervised Learning - Regression ⚫ Regression reflects the features of sample attributes in a dataset. A function is used to express the sample mapping relationship and further discover the dependency between attributes. Examples include: ◼ How much money can I make from stocks next week? ◼ What will the temperature be on Tuesday? Monday Tuesday 38° ? 16 Huawei Confidential Supervised Learning - Classification ⚫ Classification uses a classification model to map samples in a dataset to a given category. ◼ What category of garbage does the plastic bottle belong to? ◼ Is the email a spam? 17 Huawei Confidential Unsupervised Learning Data features Feature 1 ······ Feature n Unsupervised learning Intra-cluster Feature 1 ······ Feature n similarity algorithm Feature 1 ······ Feature n Monthly Sales Product Sale Duration Category Volume Cluster 1 1000-2000 Badminton racket 6:00-12:00 Cluster 2 500-1000 Basketball 18:00-24:00 1000-2000 Game console 00:00-6:00 18 Huawei Confidential Unsupervised Learning - Clustering ⚫ Clustering uses a clustering model to classify samples in a dataset into several categories based on similarity. ◼ Defining fish of the same species. ◼ Recommending movies for users. 19 Huawei Confidential Semi-supervised Learning Data features Labels Feature 1 ······ Feature n Target Semi-supervised learning Feature 1 ······ Feature n Unknown algorithm Feature 1 ······ Feature n Unknown Weather Temperature Wind Speed Suitable for Sunny High High Exercise Rainy Low Medium Yes Sunny Low Low / / 20 Huawei Confidential Reinforcement Learning ⚫ A reinforcement learning model learns from the environment, takes actions, and adjusts the actions based on a system of rewards. Model Status 𝑠𝑡 Reward 𝑟𝑡 Action 𝑎𝑡 𝑟𝑡+1 𝑠𝑡+1 Environment 21 Huawei Confidential Reinforcement Learning - Best Action ⚫ Reinforcement learning always tries to find the best action. ◼ Autonomous vehicles: The traffic lights are flashing yellow. Should the vehicle brake or accelerate? ◼ Robot vacuum: The battery level is 10%, and a small area is not cleaned. Should the robot continue cleaning or recharge? 22 Huawei Confidential Contents 1. Machine Learning Algorithms 2. Types of Machine Learning 3. Machine Learning Process 4. Important Machine Learning Concepts 5. Common Machine Learning Algorithms 23 Huawei Confidential Machine Learning Process Feature Model Data Data Model Model extraction and deployment and preparation cleansing training evaluation selection integration Feedback and iteration 24 Huawei Confidential Machine Learning Basic Concept - Dataset ⚫ Dataset: collection of data used in machine learning tasks, where each piece of data is called a sample. Items or attributes that reflect the presentation or nature of a sample in a particular aspect are called features. Training set: dataset used in the training process, where each sample is called a training sample. Learning (or training) is the process of building a model from data. Test set: dataset used in the testing process, where each sample is called a test sample. Testing refers to the process, during which the learned model is used for prediction. 25 Huawei Confidential Data Overview ⚫ Typical dataset composition Feature 1 Feature 2 Feature 3 Label No. Area Location Orientation House Price 1 100 8 South 1000 2 120 9 Southwest 1300 Training set 3 60 6 North 700 4 80 9 Southeast 1100 Test set 5 95 3 South 850 26 Huawei Confidential Importance of Data Processing ⚫ Data is crucial to models and determines the scope of model capabilities. All good models require good data. Data cleansing Data Data standardization preprocessing Fill in missing values, Standardize data to and detect and reduce noise and eliminate noise and improve model other abnormal points accuracy Data dimension reduction Simplify data attributes to avoid the curse of dimensionality 27 Huawei Confidential Data Cleansing ⚫ Most machine learning models process features, which are usually numeric representations of input variables that can be used in the model. ⚫ In most cases, only preprocessed data can be used by algorithms. Data preprocessing involves the following operations: ◼ Data filtering ◼ Data loss handling ◼ Handling of possible error or abnormal values ◼ Merging of data from multiple sources ◼ Data consolidation 29 Huawei Confidential Dirty Data ⚫ Raw data usually contains data quality problems: ◼ Incompleteness: Incomplete data or lack of relevant attributes or values. ◼ Noise: Data contains incorrect records or abnormal points. ◼ Inconsistency: Data contains conflicting records. Missing value Invalid value Misfielded value Invalid duplicate Incorrect items format Dependent attributes Misspelling 30 Huawei Confidential Data Conversion ⚫ Preprocessed data needs to be converted into a representation suitable for machine learning models. The following are typically used to convert data: ◼ Encoding categorical data into numerals for classification ◼ Converting numeric data into categorical data to reduce the values of variables (for example, segmenting age data) ◼ Other data: ◼ Embedding words into text to convert them into word vectors (Typically, models such as word2vec and BERT are used.) ◼ Image data processing, such as color space conversion, grayscale image conversion, geometric conversion, Haar-like features, and image enhancement ◼ Feature engineering: ◼ Normalizing and standardizing features to ensure that different input variables of a model fall into the same value range ◼ Feature augmentation: combining or converting the existing variables to generate new features, such as averages. 31 Huawei Confidential Necessity of Feature Selection ⚫ Generally, a dataset has many features, some of which may be unnecessary or irrelevant to the values to be predicted. ⚫ Feature selection is necessary in the following aspects: Simplifies models for Shortens easy training time interpretation Improves Avoids the model curse of generalization dimensionality and avoids overfitting 32 Huawei Confidential Feature Selection Methods - Filter ⚫ Filter methods are independent of models during feature selection. By evaluating the correlation between each feature and target attribute, a filter method scores each feature using a statistics measurement and then sorts the features by score. This can preserve or eliminate specific features. Common methods: Pearson correlation coefficient Selecting the best Traversing all feature subset Learning Model Chi-square coefficient features algorithm evaluation Mutual information Limitations of filter methods: Filter method process Filter methods tend to select redundant variables because they do not consider the relationships between features. 33 Huawei Confidential Feature Selection Methods - Wrapper ⚫ Wrapper methods use a prediction model to score a feature subset. Wrapper methods treat feature selection as a search issue and evaluate and compare different combinations. Wrapper methods use a predictive model to evaluate the different feature combinations, and score the feature subsets by model accuracy. Selecting the best feature subset Common method: Recursive feature elimination Traversing all Generating a Learning Model Limitations of wrapper methods: features feature subset algorithm evaluation Wrapper methods train a new model for each feature subset, which can be computationally Wrapper method process intensive. Wrapper methods usually provide high- performance feature sets for a specific type of model. 34 Huawei Confidential Feature Selection Methods - Embedded ⚫ Embedded methods treat feature selection as a part of the modeling process. Regularization is the most common type of embedded methods. Regularization methods, also called penalization Selecting the most appropriate feature subset methods, introduce additional constraints into the optimization of a predictive algorithm to bias the model toward lower complexity and reduce the Traversing all Generating a Learning algorithm + number of features. features feature subset model evaluation Common method: Embedded method process LASSO regression 35 Huawei Confidential Supervised Learning Example - Learning Phase ⚫ Use a classification model to determine whether a person is a basketball player based on specific features. Features (attributes) Target (label) Service Name City Age Label data Mike Miami 42 yes Training set Data used by the model to Jerry New York 32 no determine the relationships (Cleansed features and labels) Split between features and Bryan Orlando 18 no targets. Task: Use a classification model to Patricia Miami 45 yes determine whether a person is a basketball player using specific features Elodie Phoenix 35 no Test set Remy Chicago 72 yes New data for evaluating model effectiveness. John New York 48 yes Train the model Each feature or set of features provides a judgment basis for the model. 37 Huawei Confidential Supervised Learning Example - Prediction Phase Name City Age Label Marine Miami 45 ? Unknown data Julien Miami 52 ? Recent data cannot New determine whether they are data Fred Orlando 20 ? a basketball player. Michelle Boston 34 ? Nicolas Phoenix 90 ? IF city = Miami → Probability = +0.7 IF city= Orlando → Probability = +0.2 Apply the IF age > 42 → Probability = +0.05*age + 0.06 model IF age