Summary of Machine Learning Concepts PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document provides a summary of machine learning concepts, specifically focusing on supervised learning, linear regression, and classification. It covers foundational topics in linear models, discussing concepts like distance measures, loss functions, and risk minimization to ensure accurate predictions.
Full Transcript
Lecture_02 - Linear regression and classification.pdf Introduction to Supervised Learning The lecture covers the topics of linear regression and classification in the context of machine learning for smart industry, building on the introduction to AI, ML, and...
Lecture_02 - Linear regression and classification.pdf Introduction to Supervised Learning The lecture covers the topics of linear regression and classification in the context of machine learning for smart industry, building on the introduction to AI, ML, and supervised learning from the previous lecture. Supervised learning involves a dataset with inputs and corresponding outputs or labels, where the goal is to find a mapping between the inputs and outputs using a function, often denoted as the oracle function. The oracle function is assumed to have some noise or randomness, which can be represented by adding a random noise term to the output, and the goal is to obtain a "good" approximation of this oracle function from the dataset. The approximation is chosen from a collection of functions, known as the hypothesis space, and the closeness of the approximation to the oracle function is defined using a suitable loss function. The loss function is combined for all data in the dataset to form the risk, which is then minimized to find the best predictive model. However, the risk minimization problem is often not feasible in practice, and instead, empirical risk minimization is used, which evaluates the risk from the samples in the dataset. The challenge of generalization arises when the model needs to perform well on new, unseen data, and this is a key question to be addressed in supervised learning. The main questions to be addressed in supervised learning are: how to define the hypothesis space, how to perform the optimization, and how well the model generalizes to unseen data. Classification Example A classification example is provided, where a model is learned to map images to labels, and the hypothesis space is the set of labels. The optimization or training process is used to obtain the model, and the goal is to find a good approximation of the oracle function that maps inputs to outputs. Linear Models and Related Concepts The lecture also touches on the topics of linear models, distance and norm, linear regression, basis functions, matrix solution, and residual, but these are not fully explored in this section. The authors mentioned in the text are Ronald Aarts, Bojana Rosić, and Qianxiao Li. Supervised learning involves classification and regression, where the goal is to make predictions based on labeled data, and generalization is key to achieving accurate results. Linear models are well-suited to illustrate machine learning concepts and serve as a baseline for more complex problems, with the goal of minimizing the loss function or risk. Distance and Norms Measuring closeness or distance is crucial, and norms such as Euclidean (2-norm), Manhattan (1-norm), and maximum (∞-norm) are used to quantify the difference between predictions and actual values. Linear Regression Linear regression involves finding the best-fitting linear function to the data, using the least squares fit (LSQ) method, which minimizes the sum of the squared errors between predictions and actual values. The hypothesis space for linear regression is the space of linear functions, and the Euclidean norm is used for the loss function, with an explicit solution for the parameters. Basis functions can be used to extend linear regression to more complex models, where the parameters appear in a linear relation, and the goal is to minimize the Euclidean norm of the difference between predictions and actual values. The solution for the parameter estimate involves taking the derivative of the loss function and setting it to zero, resulting in a linear problem that can be solved using the Moore-Penrose pseudoinverse. Examples of linear regression include fitting a linear function to data, which may result in underfitting, and fitting a high-order polynomial, which may result in overfitting. Residual plots can be used to evaluate the fit result, with random and small residuals indicating a good fit, and structured residuals indicating the need for a more complex hypothesis space or the removal of outliers. Linearity in Parameters and 2-Norm Linear regression and classification are discussed in the context of linearity in parameters and 2-norm, which allows for straightforward mathematical analysis using linear algebra and results in a single analytical solution that is also the global minimum. If the relationship is not linear in the parameters or a different norm is used, there is typically no explicit analytical solution, and multiple solutions are likely to exist. In such cases, an initial guess is made, and an iterative approach, such as gradient descent, is used to find a better solution with a smaller norm. This topic is further discussed in the work of Qianxiao Li (2020), specifically on pages 16-17, and will be explored in more detail later in the course. Lecture_03 - Nonlinear regression using kernels (1).pdf Linear Regression with Basis Functions and Regularization Linear regression is defined as a hypothesis space ℋ where the parameters 𝑤 appear in a linear relation, with basis functions or feature maps 𝜑𝑇 = 𝜑(𝑤 )𝑇. The goal is to minimize the Euclidean or 2-norm: 𝑅emp 𝑤 = (𝑤 − Φ𝑤)²/2, where Φ is a matrix of basis functions. The solution for the parameter estimate is given by ෝ 𝑤 = Φ†𝑤 , where Φ† is the Moore-Penrose pseudoinverse of Φ. A more general form of linear regression is introduced, where the input 𝑤 𝑖 ∈ ℝ𝑑 can be a vector of dimension 𝑑, and the basis functions 𝜑𝑖 map ℝ𝑑 to ℝ. Regularization is introduced to address the issue of a possibly infinite number of solutions, by adding a regularization term to the cost function: min 𝑤∈ℝ𝑀 𝑅emp 𝑤 = min 𝑤∈ℝ𝑀 (1/𝑀) ||𝑤 − Φ𝑤||² + 𝜆𝐶(𝑤). The regularization function 𝐶: ℝ𝑀 → ℝ+ and the parameter 𝜆 > 0 control the strength of the regularization. A specific type of regularization, 𝑘² regularization or ridge regression, is introduced, where 𝐶(𝑤) = ||𝑤||². The solution for the parameter estimate with 𝑘² regularization is given by ෝ 𝑤 = (Φ𝑇Φ + 𝜆𝐼𝑀)⁻¹Φ𝑇𝑤. Nonlinear Regression and Optimization Nonlinear regression is introduced, where the matrix formalism does not apply, and there is no explicit solution, requiring iterative solutions such as gradient descent. Gradient descent updates the parameters using the local gradient of the cost function 𝑅emp 𝑤, but may end in a local minimum instead of the global minimum. Nonlinear optimization is a common technique in machine learning, and it can be implemented using TensorFlow, which offers a stochastic gradient descent method called the Adam optimizer. The Adam optimizer is computationally efficient, has little memory requirement, and is well-suited for large problems, as stated by Kingma et al. in their 2014 paper. Applying Machine Learning with TensorFlow To apply machine learning, one needs to select a hypothesis space, choose an optimization procedure, and check the generalization of the model by splitting the data into training, validation, and test sets. TensorFlow provides a framework for implementing machine learning models, and some tips for using it include importing and preprocessing data, checking the data format, splitting the dataset, scaling numerical values, defining the model, compiling the model, training the model, evaluating the model, and computing predictions. Classification Classification is a type of machine learning problem where the outputs are discrete labels, and it can be solved using a hypothesis space with an activation function that produces binary output labels. For binary classification, the activation function can be a "hard" switch or a "smooth" transition, such as the tanh function, and ignoring the activation function results in a linear model, while including it results in a nonlinear model. For multi-class classification, the outputs are multiple discrete labels, and the labels can be represented using one-hot embedding, where each label is a one-hot vector. The hypothesis space for multi-class classification is also multi-dimensional, and the oracle function maps the input vector to a vertex of a hypercube. The hypothesis space for multi-class classification can be defined using a weight vector and an activation function that produces an output comparable to a one-hot vector, and the activation function can be a theoretical function that selects the maximum value or a smooth approximation like the softmax function. Kernel Ridge Regression (Part 1: Derivation) The Least Squares with k^2 regularization or ridge regression problem is defined as minimizing the empirical risk R_emp(w) = (1/2M) || Φw - y ||^2 + (λ/2) || w ||^2, where Φ is the design matrix, w is the weight vector, y is the target vector, λ is the regularization parameter, and M is the number of samples. The solution to this problem is given by ŵ = (Φ^T Φ + λI_M)^-1 Φ^T y, where I_M is the M x M identity matrix. An alternative solution is given by ŵ = Φ^T (Φ Φ^T + λI_M)^-1 y, which can be used to make predictions for new samples. The prediction for a new sample x can be expressed as f̂(x) = φ(x)^T ŵ, where φ(x) is the feature map of x. By defining α = (Φ Φ^T + λI_M)^-1 y, the prediction can be rewritten as f̂(x) = ∑_{i=1}^M α_i k(x, x_i), where k(x, x') is the kernel function defined as k(x, x') = φ(x)^T φ(x'). The kernel function can be defined without an explicit feature map, and the solution depends only on the kernel function. The kernel function k: ℝ^d × ℝ^d → ℝ can be defined as k(x, x') = φ(x)^T φ(x'), and examples of symmetric positive definite (SPD) kernel functions include linear, polynomial, and Gaussian/RBF kernels. The Gram matrix G can be defined as G_{ij} = k(x_i, x_j), and the prediction can be rewritten as f̂(x) = ∑_{i=1}^M α_i k(x, x_i) = (G + λI_M)^-1 y k(x, x_i). The relationship between feature maps and kernel functions is demonstrated through examples, showing how a kernel function can be derived from a feature map and vice versa. Examples of kernel functions and their corresponding feature maps are provided, including a simple linear regression example and a polynomial kernel example. Kernel Ridge Regression (Part 2: Gaussian Kernel and Implementation) The Gaussian/RBF kernel function is defined as k(x, x') = exp(-||x-x'||^2 / (2σ^2)), where σ is a hyperparameter. Kernel Ridge Regression is a method that uses a hypothesis space for SPD (Symmetric Positive Definite) kernels, defined as ℋ = {𝑓: 𝑓(𝑤) = ∑∞ 𝑖=0 𝑤𝑖𝜑𝑖(𝑤)}, where 𝑤𝑖 ∈ ℝ and 𝑘(𝑤, 𝑤′) = ∑∞ 𝑖=0 𝜑𝑖(𝑤)𝜑𝑖(𝑤′). The solution to regularized empirical risk minimization is given by መ𝑓(𝑤) = ∑𝑀 𝑖=1 (𝐺 + 𝜆𝐼𝑀)−1𝑤𝑖𝑘(𝑤, 𝑤𝑖), where 𝐺 is the Gram matrix with elements 𝐺𝑖𝑖′ = 𝑘(𝑤𝑖, 𝑤𝑖′), and this solution does not require explicit knowledge of the feature maps 𝜑𝑖. To implement Kernel Ridge Regression in TensorFlow, one needs to define the kernel function, such as the RBF (Radial Basis Function) or polynomial kernel, which depends on the hyperparameter 𝜎 (also called gamma), and another hyperparameter is the number of basis functions to be used. The implementation also involves splitting the dataset into training, validation, and test data, scaling the numerical values, transforming the data according to the kernel function, and other steps. Choosing hyperparameters, such as the width 𝜎 of the RBF kernel, is an important step in Kernel Ridge Regression, and this can be done using techniques such as cross- validation. Introduction to Support Vector Machines The next topic to be covered is Support Vector Machines (SVM), which is a different type of machine learning algorithm. Lecture_04 - Support vector machines (1).pdf Introduction to Support Vector Machines The lecture covers Support Vector Machines (SVMs), a powerful machine learning algorithm used for linear and nonlinear classification, regression, and outlier detection tasks. SVMs have various applications, including text classification, image classification, spam detection, face detection, and anomaly detection. The primary objective of the SVM algorithm is to identify the optimal hyperplane in an N-dimensional space that can effectively separate data points into different classes in the feature space. The algorithm ensures that the margin between the closest points of different classes, known as support vectors, is maximized. A hyperplane is a boundary that classifies the data set, and the best hyperplane is one that maximizes the separation margin between the two classes. The maximal-margin hyperplane is selected based on maximizing the distance between the hyperplane and the nearest data point on each side, known as the hard margin. The equation of the linear hyperplane is given by 𝒘𝒘 𝒙𝒙 + 𝑏𝑏 = 0, where 𝒘𝒘 is the normal vector to the hyperplane, and 𝑏𝑏 determines the offset of the hyperplane from the origin along 𝒘𝒘. The support vectors are the datapoints closest to the hyperplane, and the hyperplane is defined as a flat affine subspace of dimension N-1 in N-dimensional space. The lecture will cover SVM classification, including linear and hard margin hyperplane, soft margin, and using kernels, as well as SVM regression. Hyperplanes and Margin Maximization The set 𝐻𝐻 = {𝒙𝒙 ∈ ℝ𝑑𝑑 ∶ 𝒘𝒘 𝒙𝒙 + 𝑏𝑏 = 0} is a hyperplane, where 𝒘𝒘 is a normal and 𝑙𝑙 = 𝑏𝑏/|𝒘𝒘| is the distance to the origin. In the context of Support Vector Machines (SVM), the goal is to find the hyperplane that maximizes the margin between two classes, which can be achieved by minimizing the norm of the weight vector 𝒘𝒘. The equation for the hyperplane is 𝒘𝒘 𝒙𝒙 + 𝑏𝑏 = 0, and the margins are defined by the hyperplanes 𝒘𝒘 𝒙𝒙 + 𝑏𝑏 = 1 and 𝒘𝒘 𝒙𝒙 + 𝑏𝑏 = −1. The distance between the two hyperplanes is 2/|𝒘𝒘|, and the goal is to maximize this distance. The SVM formulation can be generalized to 𝑦𝑦𝑖𝑖 𝒘𝒘 𝒙𝒙𝑖𝑖 + 𝑏𝑏 − 1 ≥ 0, where 𝑦𝑦𝑖𝑖 is the label of the data point 𝒙𝒙𝑖𝑖. The optimization problem is to minimize 1/2 |𝒘𝒘|² subject to the constraint 𝑦𝑦𝑖𝑖 𝒘𝒘 𝒙𝒙𝑖𝑖 + 𝑏𝑏 − 1 ≥ 0. Soft Margins and Penalty Terms However, not all data is linearly separable, and in such cases, soft margins are used, which introduce a penalty term to the minimization equation. The penalty term is typically denoted by λ ∑ penalty, and one common penalty function is the hinge loss. Slack variables (ξ) are introduced to allow for some data points to violate the margin or the hyperplane, and the cost of misclassification is denoted by the variable C. The cost of misclassification is greater than or equal to the summation of all the epsilons (ξ) of each data point. The C variable in the optimization condition of a soft margin Support Vector Machine (SVM) plays a crucial role in determining the trade-off between maximizing the margin and minimizing the classification error. When C is large, the slack variable can be large, resulting in greater misclassification or violation of the margin, whereas when C is small, the slack variable can be small, resulting in fewer misclassifications. A small C ignores outliers and focuses on the first term, resulting in a large margin, while a large C gives importance to outliers and focuses on the second term, resulting in a small margin. As C approaches infinity, it enforces all outliers and results in a hard margin, minimizing only the second term. The goal of the optimization condition is to maximize the margin and minimize the classification error. Hinge Loss Function The optimization condition for the soft margin case can be rewritten using the hinge loss function, which is defined as ξi = max(0, 1 - yi f(xi)), where f(xi) = w^T x_i + b. The hinge loss function is used to penalize misclassifications, and its value depends on the correct classification of the data point and the margin. For correct classification and f(xi) ≥ 1, the loss function is always zero, while for correct classification and f(xi) < 1, the loss function is 1 - yi f(xi). For incorrect classification, the loss function is always positive and increases linearly with f(xi). Introduction to Support Vector Machines (SVMs) using Kernels Introduction to Support Vector Machines (SVMs) using Kernels: Kernels are nonlinear functions that transform data into a higher-dimensional space, making it possible to handle non-linearly separable data. The kernel trick allows for the computation of similarities between data points in the higher-dimensional space without explicitly computing the coordinates. Need for Kernels Need for Kernels: Handle non-linearly separable data by transforming the feature space. Provide flexible options for choosing diverse kernels depending on the data. Implicitly take care of feature extraction by projecting data into a space where it becomes linearly separable. Types of Kernels Types of Kernels: Linear Kernel: No transformation necessary, suitable for linearly separable data, but not suitable for complex, nonlinear data. Polynomial Kernel: Allows for more complex decision boundaries, can capture interaction between features, suitable for nonlinear data, but computationally more expensive and risks overfitting. Radial Basis Function (RBF) Kernel: Highly flexible, can handle very complex nonlinear relations, but computationally expensive with large datasets. Choosing the Right Kernel: Consider the complexity of the data. Consider the computational resources available. Evaluate model performance. Feature Scaling Feature scaling is crucial when distances between two observations differ for non- scaled and scaled cases. Feature Scaling: Normalization: Maps values into [0,1]. Standardization: Shifts feature values to have a mean of 0 and a standard deviation of 1, centers the data, and is more flexible to new values. SVM Classification and Regression SVM Classification and Regression: SVM Classification: Predicts the class of a new point, can be binary or multiclass (one against all or one against one). SVM Regression: Predicts a function that describes the data set with the maximum absolute deviation. SVM Soft Classification and Regression: SVM Soft Classification: Allows for misclassified points by introducing slack variables. SVM Soft Regression: Allows for outliers by introducing slack variables. Conclusion SVMs using kernels provide a powerful tool for handling non-linearly separable data. Choosing the right kernel and feature scaling are crucial for optimal performance. Lecture_05 - Decision trees.pdf Decision Tree Concepts A decision tree is a non-parametric supervised learning approach used for classification and regression applications, utilizing a flowchart-like tree structure to show predictions resulting from a series of feature-based splits. Key terminologies in a decision tree include the root node, decision node (also called internal node), leaf node, branch/sub-tree, and pruning. The decision tree algorithm starts at the root, asks the best questions, branches out, and repeats until the final leaf nodes are reached, resulting in a predicted outcome or classification. Decision trees have several advantages, including simplicity and interpretability, versatility for both classification and regression, no need for feature scaling, and the ability to handle nonlinear relationships between features and target variables. However, decision trees also have disadvantages, such as overfitting, instability, and bias towards features with more labels. Decision trees assume binary splits, recursive partitioning, feature independence, homogeneity, no missing values or outliers, and a top-down approach (greedy). Entropy and Information Gain Entropy measures the uncertainty in a dataset, with lower entropy indicating higher purity, and is used to determine the best feature to split a node in a decision tree. Information gain measures the reduction of uncertainty given some feature and is used to decide which attribute should be selected as a decision node. The hiking example illustrates how to calculate entropy and information gain for different features, with the feature having the highest information gain being selected as the next parent node. Decision Tree Tuning and Gini Index Decision trees can be tuned by adjusting hyperparameters such as maximum depth, minimum samples for each split, minimum samples in a leaf node, and maximum allowable features. The Gini index is a measure of the inequality or impurity of a distribution, used in decision trees to evaluate the quality of a split by measuring the difference between the impurity of the parent node and the weighted impurity of the child nodes. The Gini index is faster to compute and more sensitive to changes in class probabilities compared to other impurity measures like entropy. Entropy and Decision Trees (Details) Entropy and Decision Trees: The section begins with a table showing entropy values, but it is not explicitly explained in the provided text. However, it seems to be related to decision trees, which are a type of machine learning model. Pruning Decision Trees Pruning Decision Trees: Pruning is a technique used to avoid overfitting in decision trees by removing nodes or sub-nodes that are not important. There are two types of pruning: pre-pruning, which prunes the tree while it is growing, and post-pruning, which prunes the tree once it is built to its depth. Decision Tree Algorithms (CART) Decision Tree Algorithms: The CART (Classification And Regressor Tree) algorithm is a type of decision tree algorithm that splits the training set into two subsets using a single feature and a threshold. It searches for the pair that produces the purest subsets, minimizing the cost function. CART is a greedy algorithm that does not guarantee an optimal solution. Ensemble Learning Ensemble Learning: Ensemble learning is a technique that combines the predictions of multiple models to produce better predictions than any individual model. It aims to mitigate errors or biases in individual models by leveraging collective intelligence. Types of Ensemble Learning Types of Ensemble Learning: There are several types of ensemble learning, including: Max Voting: Multiple models make predictions for each data point, and the majority prediction is used as the final prediction. Averaging: The average of predictions from all models is used to make the final prediction. Weighted Averaging: An extension of averaging, where each model is assigned a weight defining its importance. Bagging, Pasting, and Random Forest Bagging and Pasting: Create multiple subsets of the training set, define a model for each subset, and make a combined prediction. Bagging samples with replacement, while pasting samples without replacement. Random Forest: An ensemble of decision trees that introduces extra randomness when growing trees. It searches for the best feature among a random subset of features at each node. Random Forest Algorithm: The Random Forest algorithm creates random subsets from the original dataset, fits a decision tree model on each subset, and calculates the final prediction by averaging the predictions from all decision trees. Random Forest Considerations Things to Take Care of in Random Forest: Overfitting, optimizing computational resources, imbalanced data, complex models, and noisy data. Boosting Boosting: A technique that trains predictors sequentially, with each model trying to correct the errors made by the predecessor. Each model is trained on a modified version of the dataset, with instances that were misclassified by previous models given more weight. The final prediction in ensemble learning is made by weighted voting, and there are several approaches to achieve this, including AdaBoost and Gradient Boost. AdaBoost AdaBoost is an ensemble learning method that works by training a base classifier, making predictions on the training set, increasing the relative weight of misclassified instances, and repeating the process with updated weights. The decision boundaries of consecutive predictors in AdaBoost are adjusted to improve the accuracy of the model, with each predictor doing a better job on instances that were misclassified by the previous predictor. However, AdaBoost is not a very scalable method because it cannot be parallelized due to its sequential nature. Gradient Boost Gradient Boost is another ensemble learning method that works by sequentially adding predictors to an ensemble, each one correcting its predecessor by fitting the new predictor to the residual errors made by the previous predictor. Decision Trees and Ensemble Methods Overview Decision trees are flowchart-like structures that model decisions and their possible consequences, and they can be used for both classification and regression tasks. Ensemble methods can be divided into two main types: sequential ensemble methods, such as AdaBoost, and parallel ensemble methods, such as Random Forest. The basic motivation of sequential methods is to exploit the dependence between the base learners, while parallel methods aim to exploit independence between the base learners to achieve error reduction. Random forests are a classic example of an ensemble method, combining multiple decision trees to solve complex problems. Lecture_06 - Artificial Neural Networks.pdf This lecture will cover the comparison between biological neurons and artificial neurons, the perceptron, multilayer perceptron (MLP), backpropagation, MLP for regression and classification, learning rate, and optimizers. A biological neuron consists of a cell body with a nucleus and several extensions called dendrites, a long extension called the axon, and tiny structures called synapses at the tip of the axon's branches that are connected to other neurons. Biological neurons produce short electrical signals that travel along the axon, causing the synapses to release chemical signals called neurotransmitters, and are often organized in consecutive layers. An artificial neuron, first proposed by Warren McCulloch and Walter Pitts, has one or more binary inputs and one binary output, activating its output once a certain number of inputs are active. McCulloch and Pitts demonstrated that a simple artificial neuron model can be used to build complex models, with various examples existing to illustrate this concept. Introduction to Perceptrons Introduction to Perceptrons: The perceptron was invented by Frank Rosenblatt in 1957 and is the simplest artificial neural network (ANN) architecture, consisting of a threshold logic unit (TLU) that computes a weighted sum of its inputs and applies a step function to produce an output. Key Components of a Perceptron: Each input is associated with a weight, and the TLU computes a weighted sum of its inputs using the formula z = ∑ x_i*w_i. The perceptron then applies a step function to z, resulting in an output h_w(x) = step(z). Common Step Functions: Common step functions used in perceptrons include the Heaviside step function and the sign function. Perceptron Architecture: A perceptron is composed of a single layer of TLUs, where each TLU is connected to all inputs. If all neurons in a layer are connected to every input neuron, it is called a fully connected or dense layer. Limitations of Perceptrons: Perceptrons are limited to linearly separable data, cannot learn complex patterns, and have no probabilistic output. They are also sensitive to initial weights. Multilayer Perceptrons (MLPs) Multilayer Perceptrons (MLPs): MLPs are composed of one input layer, one or more hidden layers, and one output layer. Every layer is fully connected to the next one, and every layer includes a bias neuron. Backpropagation Algorithm: The backpropagation algorithm is a gradient descent algorithm that computes the gradient of the network's error with regard to every single model parameter in two passes (one forward and one backward). Activation Functions: The backpropagation algorithm uses activation functions such as sigmoid, hyperbolic tangent, and rectified linear unit (ReLU) to introduce non- linearity into the model. MLPs for Regression: MLPs can be used for predicting a single value or for cases of multivariate regression. Typical loss functions used for regression include mean squared error and mean absolute error. MLPs for Classification: MLPs can be used for binary, multilabel binary, and multiclass classification. Typical architectures for classification include using sigmoid activation for binary classification and softmax activation for multiclass classification. Learning Rate and Optimizers Cross Entropy: Cross entropy is a loss function used for multiclass classification that penalizes the model when it estimates a low probability of the target class. Learning Rate: The learning rate controls the step size for a model to reach the minimum loss function. A higher learning rate can result in faster convergence but may miss the minimum loss function. Optimizers: Optimizers such as gradient descent, stochastic gradient descent, and Adam are used to minimize the model's error or loss function. Adam Optimizer: The Adam optimizer is a popular optimizer that combines the best of stochastic gradient descent with momentum and RMSProp. It is straightforward to implement, has a faster running time, low memory requirements, and requires less tuning than other optimization algorithms. Lecture_07 - Feature selection and PCA.pdf Introduction and Background The current lecture focuses on feature selection and principal component analysis (PCA), emphasizing dimension reduction and linear algebra concepts. Dimension reduction is discussed in the context of both supervised learning (linked to regression) and unsupervised learning (detecting common features in data), with PCA being a key technique. Mathematical Background for PCA The mathematical background for PCA involves eigenvalues and eigenvectors, which are fundamental concepts in linear algebra. Linear algebra concepts are detailed, including the properties of real square matrices, eigenvectors, eigenvalues, and conditions for matrix diagonalization. Symmetric matrices, orthogonal matrices, and their properties are explained, including conditions for positive semi-definite and positive definite matrices. PCA Introduction and Process PCA is introduced as a method for transforming a dataset to a reduced dimension using a linear transformation, with an example of reducing 2D data to 1D. The PCA process involves maximizing variance, assuming data has zero mean, and constructing a covariance matrix. A linear transformation is applied to project data onto a new axis, with the goal of maximizing the variance of the projected data. The process of maximizing variance involves solving an optimization problem using a Lagrange multiplier, leading to the identification of principal component axes. The first principal component corresponds to the eigenvector associated with the largest eigenvalue, and subsequent components are determined similarly. Principal Component Analysis (PCA) Explained Principal Component Analysis (PCA) is a method used to reduce the dimensionality of data while minimizing error. The goal of PCA is to obtain a lower dimension with the smallest error by finding an orthogonal basis for the data. The data is represented as a linear combination of basis vectors, and the first basis vectors are chosen to capture the most variance in the data. To reduce the dimensionality of the data, the first m basis vectors are selected, and the data is projected onto these vectors. The error is minimized by finding the optimal basis vectors that capture the most variance in the data. The first basis vectors are the first m eigenvectors of the covariance matrix of the data. Applications of PCA PCA can be used for both regression and classification tasks. In regression, PCA can be used to find the best fit line or hyperplane that minimizes the error. In classification, PCA can be used to reduce the dimensionality of the data and improve the performance of the classifier. Dimensionality Selection and Feature Space PCA The choice of dimension m is important and can be determined by plotting the eigenvalues of the covariance matrix and selecting the point where the eigenvalues start to decrease rapidly. PCA can also be used in feature space by applying a feature map to the data and then performing PCA on the transformed data. The feature map can be used to transform the data into a higher-dimensional space where the data is more separable. PCA and Related Concepts PCA can be related to Singular Value Decomposition (SVD), which is a factorization of a matrix into the product of three matrices. SVD can be used to find the eigenvectors and eigenvalues of a matrix, which are used in PCA. The Moore-Penrose pseudoinverse can be used to invert a matrix, which is useful in PCA and other machine learning algorithms. Lecture_08 - Unsupervised learning.pdf Introduction The previous lecture covered topics including dimensionality reduction, principal component analysis (PCA), PCA in regression and classification, PCA in feature space, and Single Value Decomposition. The current lecture, 'Lecture 08: Unsupervised Learning', will cover the differences between supervised and unsupervised learning, unsupervised learning itself, and various techniques such as clustering, K-means clustering, DBSCAN, Gaussian Mixture Model, and associative rule. The broader course, 'Machine Learning for Smart Industry' by Ameya Rege, covers a range of topics including neural nets and deep learning (MLP, CNN, RNN, GAN), supervised learning (regression and classification), reinforcement learning, and unsupervised learning (clustering and pattern search). Unsupervised learning is a type of machine learning where algorithms are used to analyze and cluster unlabeled data sets, discovering hidden patterns or data groupings without human intervention. Unlike supervised learning, unsupervised learning does not provide output labels, and the goal is to identify patterns and relationships in the data without prior knowledge of its meaning. Unsupervised learning can be compared to supervised learning and reinforcement learning, with unsupervised learning being the foundation, supervised learning being the refinement, and reinforcement learning being the optimization. Common unsupervised learning algorithms include clustering, association rule, and dimensionality reduction. Clustering Clustering involves grouping unlabeled data into clusters based on similarities, with common algorithms including exclusive clustering, density-based clustering, hierarchical clustering, and probabilistic clustering. Exclusive clustering, also known as hard clustering, assigns each datapoint to only one cluster, with the k-means clustering algorithm being a common example. K-means clustering is a centroid-based clustering approach that partitions data points into predetermined clusters, recalculating centroids for each cluster, but requires selecting the number of clusters (k) and is sensitive to initial centroid positions. The elbow method is a common technique for selecting k in k-means clustering, based on the inertia metric, which measures the mean squared distance between each datapoint and its closest centroid. K-means clustering performs well with well-separated data but is not suitable for overlapping datapoints, non-spherical clusters, or providing clear information about cluster quality. Density-based clustering, such as DBSCAN, finds groups based on datapoint density, automatically determining the number of clusters and being less sensitive to initial positions, making it suitable for non-separable data, irregular shapes, or overlapping clusters. DBSCAN separates dense regions from lower-density regions, identifying core points, border points, and noise, with parameters including radius and minimum number of points. Gaussian Mixture Model (GMM) is a probabilistic model that assumes instances were generated from a mixture of Gaussian distributions, estimating weights and distribution parameters to identify clusters. GMM assumes that each instance is generated from a randomly chosen cluster, with the probability of choosing a cluster defined by its weight, and the location of the instance sampled from a Gaussian distribution with mean and covariance matrix. The goal of GMM is to estimate the weights and distribution parameters given the dataset, but the process is not explicitly described in the provided text. The Gaussian Mixture Model (GMM) is a generalization of the k-means approach, which not only finds cluster centers but also determines the size, shape, and orientation of the clusters, as well as their relative weights. The GMM algorithm consists of two steps: the expectation step, where the algorithm estimates the probability that each instance belongs to each cluster, and the maximization step, where each cluster is updated using all instances in the dataset, weighted by the estimated probability of belonging to that cluster. This approach is also known as soft clustering, as each instance can belong to multiple clusters with different probabilities. In contrast to k-means, GMM can handle clusters with varying densities and shapes. Association Rules Association rules are used to discover patterns in binary data, where each instance represents a transaction with a subset of items. An association rule is defined as X ⇒ Y, where X and Y are subsets of items, and the rule means that whenever X occurs in a dataset, Y is also likely to occur. Two constraints are used to evaluate association rules: support and confidence. Support measures how frequently an itemset appears in the dataset, while confidence measures the percentage of transactions satisfying X that also satisfy Y. A minimum threshold can be set for both support and confidence to filter out weak rules. Alternatively, the product of support and confidence can be used to evaluate the strength of an association rule.