Study Guide Lessons 5-8 PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document is a study guide on several machine learning concepts including model selection, classification algorithms such as decision trees and support vector machines (SVMs), and clustering techniques including k-means and hierarchical clustering.
Full Transcript
STUDY GUIDE (lesson 5) Model selection - is a key ingredient in data analysis for reliable and reproducible statistical inference or prediction. Variable selection - is the process of selecting the best subset of predictors for a given problem and predi...
STUDY GUIDE (lesson 5) Model selection - is a key ingredient in data analysis for reliable and reproducible statistical inference or prediction. Variable selection - is the process of selecting the best subset of predictors for a given problem and predictive model. Model selection techniques can be widely classified as probabilistic measures and resampling methods. Probabilistic measures involve statistically scoring candidate models using performance on training dataset. Resampling methods estimate the performance of a model using hold out or test dataset. Random train/test split: This is a resampling method. - In this method the model is evaluated on the skill of generalization and predictive efficiency in an unseen set of data. Cross validation: it is a very popular resampling method for model selection. Bootstrap: This is also a resampling method, and can be performed like random train/test split or cross validation. AIC (Akaike Information Criterion): It is a probabilistic measure to estimate model performance on unseen data. (lesson 6) Classification is a task in data mining that involves assigning a class label to each instance in a dataset based on its features. Binary classification involves classifying instances into two classes, such as “spam” or “not spam” multi-class classification involves classifying instances into more than two classes Feature selection involves the most relevant attributes in the dataset for classification. Correlation analysis involves identifying the correlation between the features in the dataset. Information gain is a measure of the amount of information that a feature provides for classification. Principal Component Analysis (PCA) is a technique used to reduce the dimensionality of the dataset. Model selection involves selecting the appropriate classification algorithm for the problem at hand. Decision trees are a simple yet powerful classification algorithm. Support Vector Machines (SVMs) are a popular classification algorithm used for both linear and nonlinear classification problems. SVMs are based on the concept of maximum margin, which involves finding the hyperplane that maximizes the distance between the two classes. Neural Networks are a powerful classification algorithm that can learn complex patterns in the data. Model training involves using the selected classification algorithm to learn the patterns in the data. Model evaluation involves assessing the performance of the trained model on a test set. This is done to ensure that the model generalizes well. Binary: Possesses only two values i.e. True or False Nominal: When more than two outcomes are possible. Ordinal: Values that must have some meaningful order. Continuous: May have an infinite number of values, it is in float type Discrete: Finite number of values. Mathematical Notation: Classification is based on building a function taking input feature vector “X” and predicting its outcome “Y” Classifier (or model) is used which is a Supervised function, can be designed manually based on the expert’s knowledge. Discriminative: It is a very basic classifier and determines just one class for each row of data. Generative: It models the distribution of individual classes and tries to learn the model that generates the data behind the scenes training set is given to a learning algorithm, which derives a classifier. classifier is tested with the test set, where all class values are hidden. (lesson 7) Clustering or cluster analysis is a machine learning technique, which groups the unlabelled dataset. centroid-based method It is a type of clustering that divides the data into non-hierarchical groups. Hierarchical clustering can be used as an alternative for the partitioned clustering as there is no requirement of pre-specifying the number of clusters to be created. Dendrogram the dataset is divided into clusters to create a tree-like structure Fuzzy clustering is a type of soft method in which a data object may belong to more than one group or cluster. k-means algorithm is one of the most popular clustering algorithms. Mean-shift algorithm tries to find the dense areas in the smooth density of data points. Expectation-Maximization Clustering using GMM: This algorithm can be used as an alternative for the k-means algorithm or for those cases where K-means can be failed. Agglomerative hierarchical algorithm performs the bottom-up hierarchical clustering. Affinity Propagation It is different from other clustering algorithms as it does not require to specify the number of clusters (lesson 8) decision tree is a non-parametric supervised learning algorithm, which is utilized for both classification and regression tasks. ID3: Ross Quinlan is credited within the development of ID3, which is shorthand for “Iterative Dichotomiser 3.” C4.5: This algorithm is considered a later iteration of ID3, which was also developed by Quinlan. CART: The term, CART, is an abbreviation for “classification and regression trees” and was introduced by Leo Breiman.