AML-1413-topic-6_Lecture-9-10-11 (Exam Preparation).pptx

AML – 1413 Artificial Intelligence [Artificial Intelligence] ‘Artificial Intelligence is the study of how to make computers do things at which, at the moment, people are better.’ (Elaine Rich) Topic 6 The learning methods and their domains Major Topics: ◇ Machine learning ◇ Supervised learning ◇ Unsupervised learning ◇ Classification ◇ Regression and its types ◇ Clustering ◇ Dimensionality reduction ◇ Deep learning Machine learning Machine learning “Machine Learning is the study of computer algorithms that improve automatically through experience.” (Tom Mitchell) A branch of Artificial Intelligence (AI) - Can act as the brain of AI ML learns and works by - Finding patterns and relationships in data - Applying these patterns to make useful predictions Most data science projects at present are more aligned with Machine Learning (ML) modeling approach Machine learning Finding Patterns in Data: Learning Supervised learning Supervised Learning: Generates a function based upon assigned labels that maps input to desired outputs. - Prediction (e.g., linear regression) - Classification (e.g., decision trees, k‐ nearest neighbors) - Time‐series forecasting (e.g., Learning Unsupervised Learning: Unsupervised Learning: Looks for patterns native to a dataset and models it like clustering. - Data mining and Knowledge discovery. - Association rules - Cluster analysis Learning Reinforcement Learning: In reinforcement learning the agent learns from a series of reinforcements—rewards or punishments. Example, the lack of a tip at the end of the journey gives the taxi agent an indication that it did something wrong. Other usage might include resource management in computer clusters, robotics etc. Introduction to Classification Classification It is a process of grouping data based on characteristics or features. In machine learning, it is part of supervised learning approaches. - First, the model is trained with training set having features and labels of different classes - Then it tries to classify the test set based on the learning. Regression is for continues variables, but classification used for discrete variables. Bias vs Variance Bias vs Variance Bias is the difference between the average prediction of model and the correct value which the model is trying to predict. Model with high bias pays very little attention to the training set & oversimplifies the model. Variance is the variability of the predictions for a given data point or a value which tells the spread of data. Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before. Bias vs Variance Overfitting vs Underfitting In supervised learning, Underfitting happens when a model unable to capture the underlying pattern of the data. These models usually have high bias and low variance. In supervised learning, Overfitting happens when our model captures the noise along with the underlying pattern in data. These models have low bias and high variance. Bias vs Variance Bias Variance Trade-off Trade off : a good balance without overfitting / underfitting the data. Ensemble Methods can be used to: Decrease Variance (Bagging Algorithms) Decrease Bias (Boosting Algorithms) Improve Predictions (Stacking) Classification approaches Classification algorithms Example of ML classification algorithms Linear Classifiers: Logistic Regression, Naive Bayes Classifier K Nearest Neighbor (KNN) Decision Trees Boosting Algorithms : Gradient Boost, Ada Boost Classifiers. Bagging Algorithms : Random Forest, Extra Trees Classifiers. Support Vector Machines (SVM). Stacking and Voting Classifier. Classification approaches Logistic Regression algorithm Modelling the probability of a discrete outcome given an input variable. The most common logistic regression models a binary outcome; something that can take two values such as true/false, yes/no, and so on. logistic regression uses a loss function referred to as “Maximum Likelihood estimation (MLE)” which is a conditional probability. Maximum likelihood estimation is a method that determines values for the parameters of a model. If the probability is greater than 0.5, the predictions will be classified as class 1. Classification approaches Logistic Regression algorithm Classification approaches Logistic Regression and Log odds Odds are often stated as wins to losses (wins : losses), e.g. a one to ten chance or ratio of winning is stated as 1 : 10. Given the probability of success (p) predicted by the logistic regression model, we can convert it to odds of success as the probability of success divided by the probability of not success: Odds of Success = p / (1 – p) Classification approaches Why Maximum Likelihood? Our goal in logistic regression is to draw the best fitting S- curve for given data points. And in logistic regression, we transform the y-axis from the probabilities to log(odds). Log (Odds of Success) = log (p / (1 – p)) we can’t use ordinary least-squares to find the best fitting line as the residuals are also equal to positive and negative infinity. Classification approaches Naive Bayes Algorithm Relying on Bayes’ theorem! In Bayesian classification, The model generates the probability of a label given some observed features, i.e., P (L | features). The posterior probabilities is a product of prior probability and likelyhood. Compute the ratio of the posterior probabilities for each label: Classification approaches Naive Bayes Algorithm The algorithm operates under a couple of key assumptions, earning the title of “naïve”. Predictors in are conditionally independent. All features contribute equally to the outcome. The Naïve assumption, still makes the algorithm ideal for text classifications Classification approaches Types of Naive Bayes Algorithm Gaussian Naïve Bayes (GaussianNB): This is a variant, which is used with Gaussian distributions and continuous variables. This model is fitted by finding the mean and standard deviation of each class. Multinomial Naïve Bayes (MultinomialNB): This classifier assumes that the features are from multinomial distributions. This variant is useful when using discrete data, and it is typically applied within natural language processing use cases. Bernoulli Naïve Bayes (BernoulliNB): This is another variant of the Naïve Bayes classifier, which is used with Boolean variables. Classification approaches KNN Algorithm During the training phase, the KNN (K-Nearest Neighbor) algorithm stores the entire training dataset as a reference. When making predictions, it calculates the distance between the input data point and all the training examples, using a distance metric such as (Euclidean distance, Manhattan distance, Minkowski distance) and identifies the K nearest neighbors to the input data point based on their distances. In the case of classification, the algorithm assigns the most common class label among the K neighbors as the predicted label for the input data point. For regression, it calculates the average or weighted average of the target values of the K neighbors to predict the value for the input Classification approaches KNN Algorithm Minkowski Distance – It is a metric intended for real-valued vector spaces. Minkowski distance is calculated in a normed vector space, which means the distances can be represented as a vector that has a length and the lengths cannot be negative. Classification approaches KNN Algorithm Manhattan Distance – Manhatten distance between two vectors or points is the L1 norm of two vector i.e. The distance between two points is the sum of the absolute differences of their Cartesian coordinates. Classification approaches KNN Algorithm Euclidean Distance – This distance is the most widely used one as it is the default metric that Sklearn library of Python uses for K-Nearest Neighbors. It is a measure of the true straight-line distance between two points in Euclidean space. Classification approaches KNN Algorithm Cosine Distance - This distance metric is used mainly to calculate similarity between two vectors. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in the same direction. Hamming Distance - Hamming distance is a metric used for comparing two binary data strings. Hamming distance is the number of bit positions in which the two bits are different. The Hamming distance method looks at the whole data and finds when data points are similar and dissimilar one to one. Jaccard Distance - The Jaccard approach compares two data sets, to find out - how many 1 to 1 matches occur in comparison to the total Classification approaches Decision Tree Algorithm Supervised Machine Learning Algorithm that uses a set of rules to make decisions, similarly to how humans make decisions. Decision trees can perform both classification and regression tasks, The intuition is to use the dataset features to create yes/no questions and continually split the dataset until all data points are utilized, belonging to each class. Decision Trees can handle the missing values in the dataset, by putting them in a different node. Classification approaches Decision Tree Algorithm Criterion Gini impurity is a measure of how often a randomly chosen element (as a root node) from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. Gini Impurity is a measurement used to build Decision Trees to determine how the features of a dataset should split nodes to form the tree. Classification approaches Decision Tree Algorithm Criterion Entropy is a measure of chaos within the node. And chaos, in the context of decision trees, is having a node where all classes are equally present in the data. A skewed distribution has a low entropy, whereas a distribution where events have equal probability has a larger entropy. Classification approaches Decision Tree Algorithm Information Gain calculates the reduction in entropy or surprise from transforming a dataset in some way. It is commonly used in the construction of decision trees from a training dataset, by selecting the variable that maximizes the information gain, which in turn minimizes the entropy and best splits the dataset into groups for effective classification. Information gain can also be used for feature selection, by evaluating the gain of each variable in the context of the target variable. Classification approaches Information Gain Formula E(S) is the total entropy for the set. v stands for the categories in the node variable (A). Sv is the number of total observation with the category v of the node variable (A). S is the total number of observations. E(Sv) is the entropy of the system having observations with the category v of the node variable. Classification approaches Random Forest Algorithm Random forest, consists of many individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction. Bagging, also known as Bootstrap Aggregation, chooses a random sample/random subset from the entire data set. Here are the Steps Involved in Random Forest Algorithm A subset of data points and a subset of features is selected for constructing each decision tree. Individual decision trees are constructed for each sample, which will generate output. Majority Voting for Classification or Averaging for regression, to evaluate the results Classification approaches Gradient Boosting Algorithm The idea behind this algorithm is to build models sequentially and these subsequent models try to reduce the errors of the previous model (unlike random forest). The objective here is to minimize the loss function by adding weak learners. e.g., Mean squared error (MSE), log-likelihood. Classification approaches ADA Boost Algorithm (Adaptive Boosting) The most common estimator used with AdaBoost is decision tree with one level which means Decision trees with only 1 split. These trees are also called Decision Stumps. Decision Stumps are like trees in a Random Forest, but not "fully grown.“ Here are the steps: A Decision Stump is made on top of the training data based on the weighted samples. Initially, we give all the samples equal weights. We create a decision stump for each variable and see how well each stump classifies samples to their target classes. More weight is assigned to the incorrectly classified samples so that they're classified correctly in the next decision stump. Reiterate from Step 2 until all the data points have been correctly classified, or Classification approaches Extra Trees Classifier Extra Trees is an ensemble ML approach that trains numerous decision trees and aggregates the results from the group of decision trees to output a prediction. However, there are few differences between Extra Trees and Random Forest. In terms of computational cost, Extra Trees is much faster than Random Forest. This is because Extra Trees randomly selects the value at which to split features, instead of the optimum split used in Random Forest. Random Forest uses bagging to select different variations of the training data to ensure decision trees are sufficiently different. Classification approaches Support Vector Machine (SVM) “Support Vector Machine” (SVM) is a supervised learning machine learning algorithm that can be used for both classification and regression challenges. Here are the steps followed in SVM Plot each data item as a point in n-dimensional space (where n is the number of features you have), with the value of each feature being the value of a particular coordinate. Perform classification by finding the optimal hyper-plane that differentiates the classes. Support Vectors are simply the coordinates of individual observation, and a hyper-plane is a form of SVM visualization. The SVM classifier is a frontier that best segregates the two classes (hyper-plane/line). Classification approaches Support Vector Machine (SVM) Identify the right hyper-plane (Scenario-1): Here, we have three hyper- planes (A, B, and C). Now, identify the right hyper-plane to classify stars and circles. Classification approaches Support Vector Machine (SVM) You need to remember a thumb rule to identify the right hyper-plane: “Select the hyper-plane which segregates the two classes better.” Identify the right hyper-plane (Scenario-2): Here, we have three hyper- planes (A, B, and C), and all segregate the classes well. Now, how can we identify the right hyper-plane? Classification approaches Support Vector Machine (SVM) The margin for hyper-plane C is high as compared to both A and B. Hence, we name the right hyper-plane as C. Another reason for selecting the hyper-plane with a higher margin is robustness. Identify the right hyper-plane (Scenario-3): Hint: Use the rules as discussed in the previous section to identify the right hyper-plane. SVM selects the hyper-plane which classifies the classes accurately prior to maximizing the margin. Hyper-plane B has a classification error, and A Classification approaches Kernel – Radial Basis Function (RBF) The kernel function is just a mathematical function that converts a low- dimensional input space into a higher-dimensional space. They allow us to do linear discriminants on nonlinear manifolds, which can lead to higher accuracies and robustness than traditional linear models alone. RBF or Radial Basis Function Kernel can combine multiple polynomial kernels - multiple times of different degrees to project the non-linearly separable data into higher dimensional space so that it can be separable using a hyperplane. the RBF kernel uses the so-called radial basis function which can be written as: Here ||X1 - X2||^2 is known as the Squared Euclidean Distance and σ is a parameter that can be used to tune the equation. Classification approaches Voting and Stacking Classifier Voting Classifier trains as an ensemble of numerous models and predicts an output (class) based on their highest probability of chosen class as the output. It simply aggregates the findings of each classifier passed into Voting Classifier and predicts the output class based on majority of voting. Hard Voting use predictions from different classifiers, while soft voting, uses predicted probabilities on different classifiers. Stacking is an ensemble learning technique to combine multiple classification models via a meta-classifier. The individual classification models are trained based on the complete training set; then, the meta- classifier is fitted based on the outputs of the individual classification models in the ensemble. The meta-classifier can either be trained on the predicted class labels or probabilities from the ensemble. Regression and Prediction Regression is a set of statistical processes for estimating the relationships between a dependent and one or more independent variables. It usually refers to linear regression. Image credit: Wikipedia Regression and Prediction Regression Goal is to fit in a line through the data points. Regression models are typically fit by the method of least squares; goal is to have minimum error. Regression and Prediction Regression and predictions Regression used for explanation, prediction and forecasting. Fitting a straight line through data points to indicate relationship allows it to have the slope of the line and y- intercept. For any value of X, the model provides the predicted Y value. Example use cases: - Anomaly detection; points far away from the regression line - Housing price estimate Regression and Prediction Regression algorithms Example of ML regression algorithms Linear Regression Linear Regression with Polynomial Features Lasso Regression Ridge Regression Elastic Net Regression Regression and Prediction Linear Regression Linear regression fits a straight line or surface that minimizes the discrepancies between predicted and actual output values. There are simple linear regression calculators that use a “least squares” method to discover the best-fit line for a set of paired data. The most common method for fitting a regression line is the method of least-squares. This method calculates the best-fitting line for the observed data by minimizing the sum of the squares of the vertical deviations from each data point to the line (if a point lies on the fitted line exactly, then its vertical deviation is 0). Regression and Prediction The Theory of Heteroscedasticity Once a regression model has been fit to a group of data, examination of the residuals (the deviations from the fitted line to the observed values) allows the modeler to investigate the validity of assumption that a linear relationship exists. Plotting the residuals on the y-axis against the explanatory variable on the x-axis reveals any possible non-linear relationship among the variables or might alert the modeler to investigate lurking variables. Regression and Prediction The Theory of Heteroscedasticity Heteroscedasticity is a systematic change in the spread of the residuals over the range of measured values. Heteroscedasticity is a problem because ordinary least squares (OLS) regression assumes that all residuals are drawn from a population that has a constant variance. Heteroscedasticity, also spelled heteroskedasticity, occurs more often in datasets that have a large range between the largest and smallest observed values. Regression and Prediction Polynomial Regression Polynomial regression is a form of Linear regression But due to the Non-linear relationship between dependent and independent variables, we add some polynomial terms to linear regression y = a0 + a1x1 + a2x12 + … + anx1n The degree of order which to use is a Hyper parameter, and we need to Is the model overfitting??? choose it wisely. Regression and Prediction How to overcome overfitting with Regressions : Reduce the model complexity To reduce the complexity of the model we can use stepwise Regression (forward or backward) selection for this, but that way we would not be able to tell anything about the removed variables’ effect on the response. Stepwise regression is the step-by-step iterative construction of a regression model. It involves adding or removing potential explanatory variables in succession and testing for statistical significance after each iteration. Forward selection begins with no variables in the model, tests each variable as it is added to the model, then keeps those that are deemed most statistically significant. Backward elimination starts with a set of independent variables, deleting one at a time, then testing to see if the removed variable is statistically significant. Bidirectional elimination is a combination of the first two methods that test which variables should be included or excluded. Regression and Prediction How to overcome overfitting with Regressions : Regularization Regularization is a regression technique, which limits, regulates or shrinks the estimated coefficient towards zero. In other words, this technique does not encourage learning of more complex or flexible models, so as to avoid the risk of overfitting. Regularization is one of the ways to improve our model to work on unseen data by ignoring the less important features. Regularization minimizes the validation loss and tries to improve the accuracy of the model. It avoids overfitting by adding a penalty to the model with high variance, thereby shrinking the beta coefficients to zero. You can apply Lasso Regression, Ridge Regression and Elastic Net Regression. Regression and Prediction How to overcome overfitting with Regressions : Regularization What is Lasso Regularization (L1)? It stands for Least Absolute Shrinkage and Selection Operator It adds L1 the penalty. L1 is the sum of the absolute value of the beta coefficients. Regression and Prediction How to overcome overfitting with Regressions : Regularization What is Ridge Regularization (L2)? What is Ridge Regularization (L2) It adds L2 as the penalty. L2 is the sum of the square of the magnitude of beta coefficients. Regression and Prediction How to overcome overfitting with Regressions : Regularization What is Elastic Net Regression (L1 and L2)? Elastic net linear regression uses the penalties from both the lasso and ridge techniques to regularize regression models. The elastic net method improves on lasso’s limitations, i.e., where lasso takes a few samples for high dimensional data, the elastic net procedure provides the inclusion of “n” number of variables until saturation. In a case where the variables are highly correlated groups, lasso tends to choose one variable from such groups and ignore the rest entirely. Clustering Clustering An unsupervised learning method that learns patterns in the unlabeled input. Finding structures is the main point of clustering. It is a task of dividing data points into groups based on similarities. In a cluster, the distance of neighboring points is typically smaller than the distance between points of different clusters Clustering Clustering Kmeans Clustering / Kmeans ++ Clustering (Centroid Based Models) Agglomerative Clustering (Connectivity Based Models) DB Scan Clustering (Density Based Model) Gaussian Mixture Models (Distribution Based Model) Clustering Clustering : Hierarchical Clustering Hierarchical clustering, also known as connectivity-based clustering, is based on the principle that every object is connected to its neighbors depending on their proximity distance (degree of relationship). The clusters are represented in extensive hierarchical structures separated by a maximum distance required to connect the cluster parts. The clusters are represented as Dendrograms The similar data objects have minimal distance falling in the same cluster, and the Clustering Distance Metrics The decision to merge two clusters is taken on the basis of the closeness of these clusters. There are multiple metrics for deciding the closeness of two clusters Euclidean distance : Usual square distance between the two vectors (2 norm). Minkowski distance : The p norm, the pth root of the sum of the pth powers of the differences of the components. Manhattan distance : Absolute distance between the two vectors (1 norm). Clustering The Linkage Function : the function tells you to measure the distance between clusters. Single linkage clusters looks at all the pairwise distances between the items in the two clusters and takes the distance between the clusters as the minimum distance. Complete linkage, which is more popular, takes the maximum distance. Average linkage takes the average, which as it turns out is fairly similar to complete linkage. Centroid linkage sounds the same as average linkage but instead of using the average distance, it creates a new item which is the average of all the individual items and then uses the distance between averages. Clustering K-means algorithm K-means is a widely utilized clustering technique that partitions data into k clusters, with k pre-defined by the user. It iteratively assigns data points to the nearest centroid and recalculates the centroids until convergence. K- means is efficient and effective for data with numerical attributes. Specify the desired number of clusters K: Randomly assign each data point to a cluster Compute cluster centroids Re-assign each point to the closest cluster centroid Re-compute cluster centroids: Now, re-computing the centroids for both clusters. Clustering K-means algorithm The way to calculate the optimal value of K Elbow Method : It works by finding the within-cluster sum of square (WCSS), i.e. the sum of the square distance between points in a cluster and the cluster centroid. The elbow graph shows WCSS values on the y-axis corresponding to the different values of K on the x-axis. Silhouette Score : it assesses the appropriateness of clustering results by providing a quantitative measure of how well-defined and distinct are the clusters. The Silhouette Score quantifies how well a data point fits into its assigned cluster and how distinct it is from other clusters. Dimensionality Reduction What are Dimensions? In the context of data analysis and machine learning, dimensions refer to the features / columns or attributes of data. For instance, if we consider a dataset of houses, the dimensions could include the house's price, size, number of bedrooms, location, and so on. What is Curse of Dimensionality? As we add more dimensions to our dataset, the volume of the space increases exponentially. This means that the data becomes sparse. Think of it this way: if you have a line (1D), it's easy to fill it with a few points. If you have a square (2D), you need more points to cover the area. Now, imagine a cube (3D) - you'd need even more points to fill the space. This concept extends to higher dimensions, making the data extremely sparse. Dimensionality Reduction Principal Component Analysis (PCA): Principal component analysis, or PCA, is a dimensionality reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set. Reducing the number of variables of a data set naturally comes at the expense of accuracy, but the trick in dimensionality reduction is to trade a little accuracy for simplicity. Because smaller data sets are easier to explore and visualize and make analyzing data points much easier and faster for machine learning algorithms without extraneous variables to process. Dimensionality Reduction Principal Component Analysis (PCA): STEP 1: STANDARDIZATION The aim of this step is to standardize the range of the continuous initial variables so that each one of them contributes equally to the analysis. STEP 2: COVARIANCE MATRIX COMPUTATION The aim of this step is to understand how the variables of the input data set are varying from the mean with respect to each other, or in other words, to see if there is any relationship between them. STEP 3: COMPUTE THE EIGENVECTORS AND EIGENVALUES OF THE COVARIANCE MATRIX TO IDENTIFY THE PRINCIPAL COMPONENTS Eigenvectors and eigenvalues are the linear algebra concepts that we need to compute from the covariance matrix in order to determine the principal components of the data. Dimensionality Reduction TSNE (T – Stochastic Neighbor Embedding) t-SNE (t-distributed Stochastic Neighbor Embedding) is an unsupervised non-linear dimensionality reduction technique for data exploration and visualizing high- dimensional data. Non-linear dimensionality reduction means that the algorithm allows us to separate data that cannot be separated by a straight line. PCA vs TSNE PCA (Principal Component Analysis) is a linear technique that works best with data that has a linear structure. It seeks to identify the underlying principal components in the data by projecting onto lower dimensions, minimizing variance, and preserving large pairwise distances But, t-SNE is a nonlinear technique that focuses on preserving the pairwise similarities between data points in a lower-dimensional space. t-SNE is concerned with preserving small pairwise distances whereas, PCA focuses on maintaining large pairwise Any questions so far? Any comments? Supporting materials References http://www.alanturing.net/turing_archive/pages/Reference%20Ar ticles/TheTuringTest.html https://www.businessinsider.com/artificial-intelligence

AML-1413-topic-6_Lecture-9-10-11 (Exam Preparation).pptx

Document Details

Tags

Related

Full Transcript