Study Guide Lessons 5-8 PDF

STUDY GUIDE (lesson 5)  Model selection - is a key ingredient in data analysis for reliable and reproducible statistical inference or prediction.  Variable selection - is the process of selecting the best subset of predictors for a given problem and predictive model.  Model selection techniques can be widely classified as probabilistic measures and resampling methods.  Probabilistic measures involve statistically scoring candidate models using performance on training dataset.  Resampling methods estimate the performance of a model using hold out or test dataset.  Random train/test split: This is a resampling method. - In this method the model is evaluated on the skill of generalization and predictive efficiency in an unseen set of data.  Cross validation: it is a very popular resampling method for model selection.  Bootstrap: This is also a resampling method, and can be performed like random train/test split or cross validation.  AIC (Akaike Information Criterion): It is a probabilistic measure to estimate model performance on unseen data. (lesson 6)  Classification is a task in data mining that involves assigning a class label to each instance in a dataset based on its features.  Binary classification involves classifying instances into two classes, such as “spam” or “not spam”  multi-class classification involves classifying instances into more than two classes  Feature selection involves the most relevant attributes in the dataset for classification.  Correlation analysis involves identifying the correlation between the features in the dataset.  Information gain is a measure of the amount of information that a feature provides for classification.  Principal Component Analysis (PCA) is a technique used to reduce the dimensionality of the dataset.  Model selection involves selecting the appropriate classification algorithm for the problem at hand.  Decision trees are a simple yet powerful classification algorithm.  Support Vector Machines (SVMs) are a popular classification algorithm used for both linear and nonlinear classification problems.  SVMs are based on the concept of maximum margin, which involves finding the hyperplane that maximizes the distance between the two classes.  Neural Networks are a powerful classification algorithm that can learn complex patterns in the data.  Model training involves using the selected classification algorithm to learn the patterns in the data.  Model evaluation involves assessing the performance of the trained model on a test set. This is done to ensure that the model generalizes well.  Binary: Possesses only two values i.e. True or False  Nominal: When more than two outcomes are possible.  Ordinal: Values that must have some meaningful order.  Continuous: May have an infinite number of values, it is in float type  Discrete: Finite number of values.  Mathematical Notation: Classification is based on building a function taking input feature vector “X” and predicting its outcome “Y”  Classifier (or model) is used which is a Supervised function, can be designed manually based on the expert’s knowledge.  Discriminative: It is a very basic classifier and determines just one class for each row of data.  Generative: It models the distribution of individual classes and tries to learn the model that generates the data behind the scenes  training set is given to a learning algorithm, which derives a classifier.  classifier is tested with the test set, where all class values are hidden. (lesson 7)  Clustering or cluster analysis is a machine learning technique, which groups the unlabelled dataset.  centroid-based method It is a type of clustering that divides the data into non-hierarchical groups.  Hierarchical clustering can be used as an alternative for the partitioned clustering as there is no requirement of pre-specifying the number of clusters to be created.  Dendrogram the dataset is divided into clusters to create a tree-like structure  Fuzzy clustering is a type of soft method in which a data object may belong to more than one group or cluster.  k-means algorithm is one of the most popular clustering algorithms.  Mean-shift algorithm tries to find the dense areas in the smooth density of data points.  Expectation-Maximization Clustering using GMM: This algorithm can be used as an alternative for the k-means algorithm or for those cases where K-means can be failed.  Agglomerative hierarchical algorithm performs the bottom-up hierarchical clustering.  Affinity Propagation It is different from other clustering algorithms as it does not require to specify the number of clusters (lesson 8)  decision tree is a non-parametric supervised learning algorithm, which is utilized for both classification and regression tasks.  ID3: Ross Quinlan is credited within the development of ID3, which is shorthand for “Iterative Dichotomiser 3.”  C4.5: This algorithm is considered a later iteration of ID3, which was also developed by Quinlan.  CART: The term, CART, is an abbreviation for “classification and regression trees” and was introduced by Leo Breiman.

Study Guide Lessons 5-8 PDF

Document Details

Tags

Related

Summary

Full Transcript