Podcast
Questions and Answers
What are explanatory models built for?
What are explanatory models built for?
test causal hypotheses
What type of models are based on underlying causal relationships between theoretical constructs?
What type of models are based on underlying causal relationships between theoretical constructs?
Supervised learning allows us to make predictions about unseen data.
Supervised learning allows us to make predictions about unseen data.
True
Feature Scaling is a method used to transform the range of independent variables or features of data. It leads to quicker convergence of optimization algorithms such as __________ descent.
Feature Scaling is a method used to transform the range of independent variables or features of data. It leads to quicker convergence of optimization algorithms such as __________ descent.
Signup and view all the answers
Match the libraries with their descriptions:
Match the libraries with their descriptions:
Signup and view all the answers
What does Dimensionality Reduction refer to?
What does Dimensionality Reduction refer to?
Signup and view all the answers
Which method of Dimensionality Reduction involves imposing penalties on parameter values for feature selection?
Which method of Dimensionality Reduction involves imposing penalties on parameter values for feature selection?
Signup and view all the answers
Principal Component Analysis (PCA) finds uncorrelated features that explain most of the variance in high-dimensional data.
Principal Component Analysis (PCA) finds uncorrelated features that explain most of the variance in high-dimensional data.
Signup and view all the answers
________ is an extension of Principal Component Analysis (PCA) that allows for nonlinear dimensionality reduction.
________ is an extension of Principal Component Analysis (PCA) that allows for nonlinear dimensionality reduction.
Signup and view all the answers
Match the Dimensionality Reduction method with its description:
Match the Dimensionality Reduction method with its description:
Signup and view all the answers
What is information gain in decision trees?
What is information gain in decision trees?
Signup and view all the answers
What are the measures of impurity used in decision trees?
What are the measures of impurity used in decision trees?
Signup and view all the answers
Tree pruning is a technique used to ___ the complexity of the final model and help prevent overfitting.
Tree pruning is a technique used to ___ the complexity of the final model and help prevent overfitting.
Signup and view all the answers
Random Forests are ensembles of decision trees.
Random Forests are ensembles of decision trees.
Signup and view all the answers
Match the following hyperparameters with their descriptions:
Match the following hyperparameters with their descriptions:
Signup and view all the answers
What does KNN stand for?
What does KNN stand for?
Signup and view all the answers
What is the key difference between L1 and L2 regularizations?
What is the key difference between L1 and L2 regularizations?
Signup and view all the answers
What is the purpose of an n-gram model?
What is the purpose of an n-gram model?
Signup and view all the answers
What does a 2-gram model represent?
What does a 2-gram model represent?
Signup and view all the answers
In binary classification, what is required for a decision to be made by majority voting?
In binary classification, what is required for a decision to be made by majority voting?
Signup and view all the answers
What are regular expressions used for?
What are regular expressions used for?
Signup and view all the answers
Match the topic modeling term with its description:
Match the topic modeling term with its description:
Signup and view all the answers
What is the main difference between Hard Voting and Soft Voting in ensemble learning?
What is the main difference between Hard Voting and Soft Voting in ensemble learning?
Signup and view all the answers
What does AdaBoost stand for?
What does AdaBoost stand for?
Signup and view all the answers
DBSCAN is a clustering method based on the density of data points.
DBSCAN is a clustering method based on the density of data points.
Signup and view all the answers
_______ is a preliminary step before applying more formal statistical techniques/analytics and can be crucial for understanding data sets.
_______ is a preliminary step before applying more formal statistical techniques/analytics and can be crucial for understanding data sets.
Signup and view all the answers
Which type of regression analysis uses several independent variables?
Which type of regression analysis uses several independent variables?
Signup and view all the answers
Mean Squared Error (MSE) is heavily influenced by outliers.
Mean Squared Error (MSE) is heavily influenced by outliers.
Signup and view all the answers
What does R-squared (R2) measure in regression analysis?
What does R-squared (R2) measure in regression analysis?
Signup and view all the answers
An observation or data point that falls within the expected range of a dataset is called an ________.
An observation or data point that falls within the expected range of a dataset is called an ________.
Signup and view all the answers
Match the regression regularization method with its description:
Match the regression regularization method with its description:
Signup and view all the answers
Which ensemble learning algorithm uses multiple decision trees to make predictions?
Which ensemble learning algorithm uses multiple decision trees to make predictions?
Signup and view all the answers
Decision tree regression models require feature scaling.
Decision tree regression models require feature scaling.
Signup and view all the answers
Study Notes
Introduction to Analytics
- Explanatory models: built to test causal hypotheses, based on underlying causal relationships between theoretical constructs
- Predictive models: generate accurate predictions of new observations, integrate knowledge from existing theoretical models in a less formal way
Supervised vs Unsupervised Learning
- Supervised Learning:
- Build a model from labeled training data to make predictions about unseen or future data
- Examples: classification, regression
- Unsupervised Learning:
- Discover structure in unlabeled dataset
- Examples: clustering, topic modeling
Classification Algorithms 1
- Types of Classification:
- Binary Classification: predict categorical class labels with two classes (e.g., true/false)
- Multi-class Classification: predict categorical class labels with more than two classes (e.g., buy/sell/hold)
- Perceptron Learning Algorithm:
- Single-layer linear classifier
- Operates as a single-layer neural network
- Weights are updated using the errors computed based on the output from the linear activation function and true class labels
- Hyperparameters:
- Set by the analyst, not optimized from the data (e.g., learning rate, number of epochs)
- Feature Scaling:
- Method used to transform the range of independent variables or features of data
- Examples: standardization
- Python Libraries:
- Pandas: data manipulation and analysis
- NumPy: working with arrays
- Scikit-learn: machine learning library with various algorithms and utilities
Classification Algorithms 2
- Overfitting and Underfitting:
- Overfitting: model captures "patterns" in the training data that do not repeat in new data
- Underfitting: model cannot capture the underlying trend of the data
- Bias-Variance Tradeoff:
- Balancing model complexity and accuracy
- Regularization: technique to prevent overfitting by adding a penalty on the larger magnitudes of model parameters
- Examples: L1 and L2 regularization
- Cost Function:
- Formed by taking the negative/inverse of the log likelihood
- Used to optimize model parameters
Classification Algorithms 3
- Logistic Regression:
- Classification algorithm for binary classification tasks
- Models the probability that a given input belongs to a particular class
- Maximum Likelihood Estimation:
- Estimating parameters of a probability distribution by maximizing a likelihood function
- Used to optimize model parameters
- Support Vector Machine (SVM):
- Aims to maximize the margin between the decision boundary and the closest data points from each class
- Less sensitive to outliers than other classification algorithms
Classification Algorithms 4
- Decision Tree Learning:
- Supervised learning algorithm that models decisions and their possible consequences as a tree-like structure
- Decision trees can be prone to overfitting
- Maximising Information Gain:
- Measure of the difference between the impurity of the parent node and the sum of the child node impurities
- Used to evaluate the quality of a split in the decision tree
- Measures of Impurity:
- Entropy: quantifies the amount of uncertainty or disorder in a system
- Classification error: measures the proportion of misclassified instances in a dataset
- Gini impurity: calculates the probability of incorrect classification by randomly assigning a label to a randomly chosen sample
Random Forest
- Ensemble method that combines multiple decision tree models
- Uses bootstrap sampling and random feature selection to reduce correlation between trees
- Output prediction is the class selected by most trees
- Hyperparameters: number of trees, maximum depth, bootstrap, criterion
Data Preprocessing
- Dealing with Missing Data:
- Deleting rows with missing values
- Imputing missing values
- Categorical Data:
- Nominal features: no ordering possible (e.g., color)
- Ordinal features: categorical values that can be ordered or sorted (e.g., shirt size)
- Encoding Categorical Variables:
- Nominal categorical variables: one-hot encoding
- Ordinal categorical variables: mapping into integers
- Feature Scaling:
- Normalization: scales features to a specified range (usually [0, 1])
- Standardization: scales features to have a mean of 0 and a standard deviation of 1### Transformation Formula
- The transformation formula is 𝑋−µ / σ, where µ is the mean and σ is the standard deviation of the feature.
Feature Selection
- L1 regularisation:
- Produces sparse models (only a subset of coefficients are non-zero)
- Robust to outliers (penalises absolute value of coefficients)
- Can lead to multiple solutions
- L2 regularisation:
- Does not produce sparse models (all coefficients are shrunk towards zero)
- Not robust to outliers
- Leads to one solution
- Key differences between L1 and L2 regularisation:
- Sparsity
- Number of solutions
- Robustness to outliers
- Computational difficulty
Sequential Feature Selection
- Greedy algorithms:
- Make locally optimal choices at each step
- Do not guarantee global optimum
- Sequential Backward Selection (SBS):
- Start with the full feature set
- Evaluate the performance of the model
- Remove a feature
- Determine the feature to remove
- Permanently remove the feature
- Repeat steps 2-5 until the desired number of features is reached or the performance of the model does not improve
Feature Importance with Random Forests
- Feature importance can be measured as the averaged impurity decrease from all decision trees in the forest
Dimensionality Reduction
- High dimensionality:
- Requires large amounts of data
- Computationally expensive
- Methods of dimensionality reduction:
- Regularisation (L1 and L2 penalties)
- Sequential feature selection
- Feature extraction (PCA, LDA, etc.)
- Principal Component Analysis (PCA):
- Finds uncorrelated features that explain most of the variance in high-dimensional data
- Used for dimensionality reduction
- Can be used for linearly separable data
- Linear Discriminant Analysis (LDA):
- Supervised dimensionality reduction technique
- Finds features that optimize class separability
- Kernel Principal Component Analysis (KPCA):
- Extension of PCA for nonlinear dimensionality reduction
- Uses kernel methods to transform data onto a lower-dimensional subspace
Model Evaluation and Hyperparameter Tuning
- Pipelines:
- Combine multiple processing steps into a single estimator
- Advantages: simplicity, reproducibility, code maintenance, and parameter tuning
- Hyperparameter tuning:
- Grid search:
- Define a grid of hyperparameter values
- Set up a model and grid search tool
- Fit the grid search to the data
- Evaluate results
- Holdout cross-validation:
- Divide dataset into training, validation, and test sets
- Use validation set to tune hyperparameters
- Evaluate final model on test set
- Grid search:
- K-fold cross-validation:
- Divide dataset into k folds
- Use k-1 folds for training and 1 fold for validation
- Repeat for each fold
- Average performance metrics
Model Evaluation Metrics
- Learning curves:
- Plot performance metrics against sample size
- Identify overfitting or underfitting
- Validation curves:
- Plot performance metrics against a hyperparameter value
- Identify optimal hyperparameter value
- Confusion matrix:
- Evaluate classification model performance
- Calculate precision, recall, F1 score, etc.
Ensemble Methods
- Ensemble methods:
- Combine multiple models to improve performance
- Reduce overfitting and bias
- Techniques:
- Bagging (bootstrap aggregating)
- Boosting (AdaBoost)
- Voting (hard and soft)
- Bagging:
- Reduce overfitting
- Create bootstrap samples
- Combine multiple models
- Adaptive Boosting (AdaBoost):
- Reduce bias and variance
- Focus on hard-to-classify instances
- Update instance weights and learner weights### Clustering Techniques
- K-Means: partitions n items into k clusters, each item belongs to the cluster with the nearest mean
- K-Means++: an algorithm for choosing the initial values for the K-Means clustering algorithm
- Hierarchical Trees: a method of cluster analysis that seeks to build a hierarchy of clusters
- Agglomerative (Bottom-Up) Method: starts by assuming each example is a single cluster, merges closest pairs of clusters iteratively until only one cluster remains
- Divisive (Top-Down) Method: starts with one cluster, splits the cluster into smaller clusters iteratively until each cluster contains only one example
Measuring Distance
- Single Linkage Approach: computes distances between the most similar members for each pair of clusters and merges the two clusters for which the distance between the most similar members is the smallest
- Complete Linkage Approach: computes the distance between the most dissimilar members for each pair of clusters and merges the two clusters for which the distance between the most dissimilar members is the smallest
DBSCAN
- identifies clusters in datasets based on the density of data points
- classifies clusters based on the idea that a cluster in a dataset is a high-density area surrounded by a low-density area
- key concepts:
- Core Points: a point is considered a core point if it has a minimum number of points (MinPts) within a given radius (ϵ)
- Border Points: a point that is not a core point but falls within the radius of a core point
- Noise Points: any point that is not a core point or a border point is considered noise or an outlier, not belonging to any cluster
Regression Analysis
- Simple Regression: models a linear relationship between a target/dependent and one independent variable/feature
- Multiple Regression: models a linear relationship between a target/dependent and more than one independent variable/feature
- Ordinary Least Squares (OLS): a method for estimating the parameters of the linear regression line that minimises the sum of squared vertical distances from the estimated line to the training examples
Evaluating Regression Performance
- Mean Squared Error (MSE): the average of the squares of the errors—that is, the average squared difference between the predicted values and the actual value
- R-Squared (R²): a statistical measure of how close the data are to the fitted regression line
- Residual Plot: a graph that shows the residuals on the vertical axis and the predictive values, or an independent variable on the horizontal axis
Regularisation Methods
- L2 Regularisation - Ridge Regression: adds a penalty equal to the square of the magnitude of coefficients to the loss function, reducing the size of coefficients but keeping all variables in the model
- L1 Regularisation - Lasso Regression: introduces a penalty that is the absolute value of the magnitude of coefficients, which can shrink some coefficients to zero, effectively performing feature selection
- L1 + L2 Regularisation - Elastic Net: combines penalties from both Ridge and Lasso, integrating the benefits of both
Non-Linear Regression Models
- Polynomial Regression: models complex relationships that do not follow simple linear patterns
- Decision Tree Regression: employed to predict a continuous outcome by learning decision rules inferred from the data features
- Random Forest Regression: an ensemble learning algorithm that utilises multiple decision trees to make predictions by averaging the outputs of the individual trees
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.