Podcast
Questions and Answers
What type of decision trees does CART stand for?
What type of decision trees does CART stand for?
What does the directional derivative of f at θ, denoted Du f (θ), primarily measure?
What does the directional derivative of f at θ, denoted Du f (θ), primarily measure?
CART trees can only be binary trees and cannot be converted from m-trees.
CART trees can only be binary trees and cannot be converted from m-trees.
False
The partial derivative of f with respect to θk represents the directional derivative in the direction of the unit vector ek.
The partial derivative of f with respect to θk represents the directional derivative in the direction of the unit vector ek.
Signup and view all the answers
What is the primary output variable y in the context of predicting the creditworthiness of individuals?
What is the primary output variable y in the context of predicting the creditworthiness of individuals?
Signup and view all the answers
In a decision tree, internal nodes correspond to tests of the form IF (xj < _____), where tr is the threshold value.
In a decision tree, internal nodes correspond to tests of the form IF (xj < _____), where tr is the threshold value.
Signup and view all the answers
What does the notation ∇f (θ) represent?
What does the notation ∇f (θ) represent?
Signup and view all the answers
The example partial derivative of f with respect to θ1 is ___?
The example partial derivative of f with respect to θ1 is ___?
Signup and view all the answers
Which algorithm is used by CART to determine the decision variables and threshold values?
Which algorithm is used by CART to determine the decision variables and threshold values?
Signup and view all the answers
Which is NOT a characteristic of the directional derivative?
Which is NOT a characteristic of the directional derivative?
Signup and view all the answers
The process of classification using a CART tree begins at the leaves of the tree.
The process of classification using a CART tree begins at the leaves of the tree.
Signup and view all the answers
Match the following components of a decision tree with their functions:
Match the following components of a decision tree with their functions:
Signup and view all the answers
Higher-order derivatives are beneficial in solving minimization problems in machine learning.
Higher-order derivatives are beneficial in solving minimization problems in machine learning.
Signup and view all the answers
What does the algorithm do to minimize the cost function G in CART?
What does the algorithm do to minimize the cost function G in CART?
Signup and view all the answers
Calculate the components of the gradient for the function f (θ1, θ2) = θ1^2 - 5θ1 + θ2^3 + 10.
Calculate the components of the gradient for the function f (θ1, θ2) = θ1^2 - 5θ1 + θ2^3 + 10.
Signup and view all the answers
Match the following terms with their definitions:
Match the following terms with their definitions:
Signup and view all the answers
What approach is considered when dealing with categorical features with more than two values?
What approach is considered when dealing with categorical features with more than two values?
Signup and view all the answers
Decision trees are inherently stable and do not show high variance.
Decision trees are inherently stable and do not show high variance.
Signup and view all the answers
What process is employed to reduce the error of a classifier after a decision tree has been created?
What process is employed to reduce the error of a classifier after a decision tree has been created?
Signup and view all the answers
Examples of algorithms that are considered stable include __________ and __________.
Examples of algorithms that are considered stable include __________ and __________.
Signup and view all the answers
Match the following terms with their meanings:
Match the following terms with their meanings:
Signup and view all the answers
What happens when a decision tree is allowed to grow until only a few examples are in the same leaf?
What happens when a decision tree is allowed to grow until only a few examples are in the same leaf?
Signup and view all the answers
Generalized Optimal Sparse Decision Trees (GOSDT) aim to build less interpretable models compared to traditional decision trees.
Generalized Optimal Sparse Decision Trees (GOSDT) aim to build less interpretable models compared to traditional decision trees.
Signup and view all the answers
What is the significance of using threshold values in continuous variables?
What is the significance of using threshold values in continuous variables?
Signup and view all the answers
What is the primary risk of choosing too large a number of weak models M in AdaBoost?
What is the primary risk of choosing too large a number of weak models M in AdaBoost?
Signup and view all the answers
AdaBoost is inherently stochastic, similar to Random Forests.
AdaBoost is inherently stochastic, similar to Random Forests.
Signup and view all the answers
What is the ideal range of leaves (J) for small decision trees to perform well in AdaBoost?
What is the ideal range of leaves (J) for small decision trees to perform well in AdaBoost?
Signup and view all the answers
The loss function used in AdaBoost is known as the __________ loss.
The loss function used in AdaBoost is known as the __________ loss.
Signup and view all the answers
Match the following characteristics with their descriptions:
Match the following characteristics with their descriptions:
Signup and view all the answers
What weight method is used in AdaBoost for weak classifiers that are CART decision trees?
What weight method is used in AdaBoost for weak classifiers that are CART decision trees?
Signup and view all the answers
In the iterative process of AdaBoost, each model is dependent on the previous model, denoted as g_m = g_{m-1}(x) + β_m b(x; γ_m). This dependency portrays __________ modeling.
In the iterative process of AdaBoost, each model is dependent on the previous model, denoted as g_m = g_{m-1}(x) + β_m b(x; γ_m). This dependency portrays __________ modeling.
Signup and view all the answers
Predictions made by AdaBoost can be easily parallelized.
Predictions made by AdaBoost can be easily parallelized.
Signup and view all the answers
What is an eigenvector of matrix C associated with?
What is an eigenvector of matrix C associated with?
Signup and view all the answers
The largest eigenvalue should be chosen to maximize variance.
The largest eigenvalue should be chosen to maximize variance.
Signup and view all the answers
What happens to the coordinates of points after projecting into the space spanned by the largest eigenvectors?
What happens to the coordinates of points after projecting into the space spanned by the largest eigenvectors?
Signup and view all the answers
To maximize variance for the dimension of the subspace q, we project X into the space spanned by the ____ eigenvectors.
To maximize variance for the dimension of the subspace q, we project X into the space spanned by the ____ eigenvectors.
Signup and view all the answers
When performing noise filtering and lossy compression, which part of the eigenvectors are omitted?
When performing noise filtering and lossy compression, which part of the eigenvectors are omitted?
Signup and view all the answers
To select a suitable value for q, one should consider the ratio of eigenvalues.
To select a suitable value for q, one should consider the ratio of eigenvalues.
Signup and view all the answers
What visual indicator might suggest an optimal choice for q in eigenvalue selection?
What visual indicator might suggest an optimal choice for q in eigenvalue selection?
Signup and view all the answers
What does W(C) measure in clustering?
What does W(C) measure in clustering?
Signup and view all the answers
Maximizing B(C) is equivalent to minimizing W(C).
Maximizing B(C) is equivalent to minimizing W(C).
Signup and view all the answers
What is the primary distance metric used in the k-means algorithm?
What is the primary distance metric used in the k-means algorithm?
Signup and view all the answers
The number of points in group k is denoted as _____ .
The number of points in group k is denoted as _____ .
Signup and view all the answers
Which step is NOT part of the K-means algorithm?
Which step is NOT part of the K-means algorithm?
Signup and view all the answers
In K-means, a local minimum of W(C) is reached through iterative _____ .
In K-means, a local minimum of W(C) is reached through iterative _____ .
Signup and view all the answers
There are approximately K^n different partitions in the clustering problem.
There are approximately K^n different partitions in the clustering problem.
Signup and view all the answers
Study Notes
Linear Algebra Review
- A p-vector is a vector with p elements (x₁, x₂, ..., xₚ). Column vectors (p x 1 matrices) are assumed unless otherwise specified.
- eₖ is a p-vector with all elements zero except the k-th element, which is one.
- 1 is a p-vector where all elements are one.
- The inner product (dot product) of p-vectors a and b is a ⋅ b = a₁b₁ + a₂b₂ + ... + aₚbₚ = Σⱼaⱼbⱼ.
- The notation aᵀ indicates that vector a has been transposed, which results in a row vector.
- The norm of vector x is a measure of its length. The 2-norm (Euclidean norm) is ||x||₂ = √(x₁² + x₂² + ... + xₚ²) = √Σⱼxⱼ².
- The 1-norm is ||x||₁ = |x₁| + |x₂| + ... + |xₚ| = Σⱼ|xⱼ|.
- Vectors a and b are perpendicular if a ⋅ b = 0.
- The projection of vector a onto vector b is a ⋅ b / ||b||².
- Vectors x₁, ..., xₖ are linearly independent if a₁x₁ + ... + aₖxₖ = 0 if and only if a₁ = ... = aₖ = 0.
- A matrix is a table of numbers with n rows and m columns.
- Aij represents the element in row i, column j of a matrix A.
- The transpose of matrix A (denoted as Aᵀ) is the matrix where rows and columns are swapped.
- A matrix A is symmetric if A = Aᵀ.
- I is the identity matrix, with 1s on the diagonal and 0s elsewhere.
- The rank of a matrix is the number of linearly independent columns in the matrix.
Supervised Learning
- In supervised learning, we have a dataset X = {(x⁽¹⁾, y⁽¹⁾), ..., (x⁽ⁿ⁾, y⁽ⁿ⁾)}. x⁽ⁱ⁾ are input variables (features) and y⁽ⁱ⁾ are output variables.
- Input variables are often vectors (x⁽ⁱ⁾ ∈ Rᵖ) and output variables are often scalars (y⁽ⁱ⁾ ∈ R).
- Example: Predicting apartment prices in Reykjavík—x is the size in m² and y is the price in million ISK.
- Example: Predicting heart attacks—x is patient data (age, maximum pulse, blood pressure) and y is whether patient had a heart attack (0 or 1).
- Example: Image classification—x is a bitmap image and y is a class from a set of possible categories (dog, cat, car, etc.).
Linear Regression
- If there are two input variables, x = (x₁, x₂), and a linear relationship is assumed between input and output variables, then the function f is of the form f(x) = θ₀ + θ₁x₁ + θ₂x₂.
- When the number of input variables is p, the model is f(x) = θ₀x₀ + θ₁x₁ + ... + θₚxₚ = Σⱼθⱼxⱼ, where x₀ = 1 for simplicity.
- The cost function J(θ) = (½)Σᵢ(f(x⁽ⁱ⁾) - y⁽ⁱ⁾)² is minimized to find suitable parameters θ.
Function Minimization
- Learning algorithms often find parameters θ that maximize or minimize an objective function.
- The objective function measures how well the model fits the data.
Functions of a Single Variable
- The derivative of a function f(θ) at θ indicates how quickly f changes near θ.
- A necessary condition for a local minimum is that the derivative is zero.
Functions of Multiple Variables
- A directional derivative of a function f(θ₁, θ₂,... θₚ) at point θ indicates how f changes when moving along a line from θ.
- A partial derivative of f at θ = (θ₁, θ₂... θₚ) with respect to θₖ is the directional derivative of f in the direction of the unit vector eₖ (in the k-th coordinate axis.)
Nearest Neighbor Classifiers
- A nearest neighbor classifier assigns a point x to the class that is most frequent among its k nearest neighbors in the dataset.
- Euclidean distance is often used to calculate the distance between points.
- This method is simple but can be computationally expensive for large datasets.
Logistic Regression Classifier
- The output value is in the interval [0, 1] using a sigmoid function.
- Optimization is used to derive suitable parameters θ.
- Error rate in logistic regression are different than in linear regression.
Evaluating Model Quality and Choosing a Model
- Mean Squared Error (MSE): Measures average squared error between predicted values f(x⁽ⁱ⁾) and actual values y⁽ⁱ⁾. A low MSE on the training data does not guarantee a good model.
- Independent dataset error estimation: Split the data into training and testing sets to evaluate a model's performance on unseen data.
Error Estimation from Independent Dataset
- Randomly split the dataset X into training (Xtrain) and testing (Xtest) sets.
- Compute the MSEtest on the test data.
Model Selection
- Evaluate the model's error by iterating over different hyperparameter values and choosing the one that yields the lowest average error on a test dataset.
Performance Metrics
- Root Mean Squared Error (RMSE) is the square root of MSE. It has the same units as the output variable (y).
Classification
- Accuracy in classification: Percentage of correctly classified examples.
- True Positive (TP), True Negative (TN), False Positive (FP), False Negative (FN) are categories for assessing classification errors.
- Other metrics might be required if types of errors have different costs or class distributions are uneven.
Class Imbalance
- Class imbalance occurs when the number of examples differs significantly between categories.
- Modifications of the classifier or the objective function might be necessary to address this.
Clustering
- The task in clustering is to partition the examples into groups in X.
- The distance function d between examples is important.
- K-means is a common clustering algorithm.
Principal Component Analysis (PCA)
- PCA finds a subspace where the data lies approximately.
- The variances are maximized by choosing appropriate directions.
Non-negative Matrix Factorization (NMF)
- Approximates a given (non-negative) data matrix with the product of two non-negative matrices (W, H).
- Can be useful for analyzing structure present in the data.
Recommender Systems
- Recommending items users might be interested in using user reviews.
- Use of a matrix or user/item interaction pattern.
Reinforcement Learning
- The agent interacts with an environment (e.g. game, trading algorithms, robotics).
- The goal is to maximize the total reward obtained over a period of time, e.g from playing many rounds of a game.
- A (Q-) function quantifies how good it would be to perform a given action in a state.
Neural Networks
- Multilayer perceptrons (MLP)
- Neural networks with multiple layers
- Output layer with specific activation functions (logistic regression)
Gradient Descent, Variants, and Regularization
- Batch gradient descent
- Stochastic gradient descent
- Mini-batch gradient descent
- Momentum: Technique to accelerate convergence
- Gradient clipping: Technique to limit gradient values to prevent exploding gradients.
- Regularization functions: To avoid overfitting (e.g., L₁, L₂).
Convolutional Neural Networks
- CNNs are feedforward neural networks specific to image data.
- Use convolution, pooling (subsampling) and fully connected layers.
- Weight sharing is one main advantage of this type of neural network.
- They are useful for tasks that involve image classification and other pattern recognition tasks in which spatial properties are important.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
Test your knowledge on CART decision trees and the concepts of directional derivatives. This quiz covers key terms, functions, and algorithms essential for understanding decision tree classification and gradient measures. Perfect for students in data science or machine learning courses.