CART Decision Trees and Derivatives Quiz
46 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What type of decision trees does CART stand for?

  • Classification And Regression Trees (correct)
  • Cumulative And Regression Trees
  • Classification Algorithm for Regression Trees
  • Categorical And Regression Trees

What does the directional derivative of f at θ, denoted Du f (θ), primarily measure?

  • The change in f when moving from θ to θ + hu. (correct)
  • The curvature of f at θ.
  • The maximum value of f at θ.
  • The overall function growth rate of f.

CART trees can only be binary trees and cannot be converted from m-trees.

False (B)

The partial derivative of f with respect to θk represents the directional derivative in the direction of the unit vector ek.

<p>True (A)</p> Signup and view all the answers

What is the primary output variable y in the context of predicting the creditworthiness of individuals?

<p>0 or 1</p> Signup and view all the answers

In a decision tree, internal nodes correspond to tests of the form IF (xj < _____), where tr is the threshold value.

<p>tr</p> Signup and view all the answers

What does the notation ∇f (θ) represent?

<p>The gradient of the function f at point θ.</p> Signup and view all the answers

The example partial derivative of f with respect to θ1 is ___?

<p>2θ1 + 5θ2</p> Signup and view all the answers

Which algorithm is used by CART to determine the decision variables and threshold values?

<p>Greedy Algorithm (C)</p> Signup and view all the answers

Which is NOT a characteristic of the directional derivative?

<p>It can be defined for points that are minimum points. (A)</p> Signup and view all the answers

The process of classification using a CART tree begins at the leaves of the tree.

<p>False (B)</p> Signup and view all the answers

Match the following components of a decision tree with their functions:

<p>Internal Nodes = Correspond to tests of attributes Leaves = Indicate predictions for each region Root Node = First decision point in the tree Branches = Show the outcome of tests</p> Signup and view all the answers

Higher-order derivatives are beneficial in solving minimization problems in machine learning.

<p>False (B)</p> Signup and view all the answers

What does the algorithm do to minimize the cost function G in CART?

<p>Partitions the dataset into subsets</p> Signup and view all the answers

Calculate the components of the gradient for the function f (θ1, θ2) = θ1^2 - 5θ1 + θ2^3 + 10.

<p>∇f = [2θ1 - 5, 3θ2^2]</p> Signup and view all the answers

Match the following terms with their definitions:

<p>Directional Derivative = Change of f when moving from θ to θ + hu Partial Derivative = Derivative with respect to one variable, others constant Gradient = Vector of all partial derivatives of f Minimum Point = A point where f does not decrease in any direction</p> Signup and view all the answers

What approach is considered when dealing with categorical features with more than two values?

<p>Sending all examples with one value down the left branch (A)</p> Signup and view all the answers

Decision trees are inherently stable and do not show high variance.

<p>False (B)</p> Signup and view all the answers

What process is employed to reduce the error of a classifier after a decision tree has been created?

<p>Pruning</p> Signup and view all the answers

Examples of algorithms that are considered stable include __________ and __________.

<p>Logistic regression, k-NN</p> Signup and view all the answers

Match the following terms with their meanings:

<p>Pruning = Reducing a decision tree by merging nodes One-hot encoding = Converting categorical variables into binary values High variance = Sensitivity to changes in input Dynamic programming = A method to optimize decision tree construction</p> Signup and view all the answers

What happens when a decision tree is allowed to grow until only a few examples are in the same leaf?

<p>It leads to overfitting. (C)</p> Signup and view all the answers

Generalized Optimal Sparse Decision Trees (GOSDT) aim to build less interpretable models compared to traditional decision trees.

<p>False (B)</p> Signup and view all the answers

What is the significance of using threshold values in continuous variables?

<p>To calculate impurity at split points</p> Signup and view all the answers

What is the primary risk of choosing too large a number of weak models M in AdaBoost?

<p>Overfitting the data (A)</p> Signup and view all the answers

AdaBoost is inherently stochastic, similar to Random Forests.

<p>False (B)</p> Signup and view all the answers

What is the ideal range of leaves (J) for small decision trees to perform well in AdaBoost?

<p>4 to 10</p> Signup and view all the answers

The loss function used in AdaBoost is known as the __________ loss.

<p>exponential</p> Signup and view all the answers

Match the following characteristics with their descriptions:

<p>Deterministic method = Does not randomize during training Overfitting = Model performs poorly on unseen data Base function = A component like a tree used in learning LogitBoost = Uses logistic regression loss instead of exponential loss</p> Signup and view all the answers

What weight method is used in AdaBoost for weak classifiers that are CART decision trees?

<p>Scaling calculations of Gini impurity (B)</p> Signup and view all the answers

In the iterative process of AdaBoost, each model is dependent on the previous model, denoted as g_m = g_{m-1}(x) + β_m b(x; γ_m). This dependency portrays __________ modeling.

<p>sequential</p> Signup and view all the answers

Predictions made by AdaBoost can be easily parallelized.

<p>True (A)</p> Signup and view all the answers

What is an eigenvector of matrix C associated with?

<p>A corresponding eigenvalue (D)</p> Signup and view all the answers

The largest eigenvalue should be chosen to maximize variance.

<p>True (A)</p> Signup and view all the answers

What happens to the coordinates of points after projecting into the space spanned by the largest eigenvectors?

<p>The coordinates become a combination of the projections onto the selected eigenvectors.</p> Signup and view all the answers

To maximize variance for the dimension of the subspace q, we project X into the space spanned by the ____ eigenvectors.

<p>largest</p> Signup and view all the answers

When performing noise filtering and lossy compression, which part of the eigenvectors are omitted?

<p>The contribution from eigenvectors q + 1,..., p (C)</p> Signup and view all the answers

To select a suitable value for q, one should consider the ratio of eigenvalues.

<p>True (A)</p> Signup and view all the answers

What visual indicator might suggest an optimal choice for q in eigenvalue selection?

<p>A 'kink' or change in slope in the graph of eigenvalues.</p> Signup and view all the answers

What does W(C) measure in clustering?

<p>Dispersion of points within clusters (C)</p> Signup and view all the answers

Maximizing B(C) is equivalent to minimizing W(C).

<p>True (A)</p> Signup and view all the answers

What is the primary distance metric used in the k-means algorithm?

<p>Euclidean distance</p> Signup and view all the answers

The number of points in group k is denoted as _____ .

<p>Nk</p> Signup and view all the answers

Which step is NOT part of the K-means algorithm?

<p>Calculate the total distance from all points to centroids (A)</p> Signup and view all the answers

In K-means, a local minimum of W(C) is reached through iterative _____ .

<p>assignment</p> Signup and view all the answers

There are approximately K^n different partitions in the clustering problem.

<p>False (B)</p> Signup and view all the answers

Flashcards

Directional Derivative

Rate of change of a function in a specific direction.

Du f(θ)

Directional derivative of function f at point θ in direction u.

Partial Derivative

Directional derivative in direction of a coordinate axis.

∂f(θ)/∂θk

Partial derivative of f at θ with respect to θk.

Signup and view all the flashcards

Gradient of f

Vector containing all partial derivatives of the function, specifying the direction of greatest increase.

Signup and view all the flashcards

∇f(θ)

Gradient of function f at point θ

Signup and view all the flashcards

Higher-order derivatives

Derivatives of a function calculated multiple times or taken multiple derivatives.

Signup and view all the flashcards

Minimization problems in machine learning

Finds the input that produces the lowest possible value for a function.

Signup and view all the flashcards

CART decision trees

A type of decision tree used for classification and regression, where binary trees split the input space into regions to make predictions.

Signup and view all the flashcards

Internal nodes

Nodes in a decision tree that involve a test (e.g., if 'x' is less than 'someValue').

Signup and view all the flashcards

Threshold value (tr)

The value used in a decision tree test to create a split point.

Signup and view all the flashcards

Input space regions

The areas defined by the splits in a decision tree that contain similar data points.

Signup and view all the flashcards

Leaf nodes

The final nodes in a decision tree that provide a prediction for a data point that reached it

Signup and view all the flashcards

Greedy algorithms

Algorithms that make the best local decision at each step without considering the overall effect, like the CART algorithm used for building decision trees.

Signup and view all the flashcards

Decision tree for classification

A decision tree used to predict a category or class label.

Signup and view all the flashcards

Cost function (G) (impurity)

A function used in the CART algorithm to evaluate how well a split separates the data based on a specific metric.

Signup and view all the flashcards

Categorical Feature Splitting

For categorical features with more than two values, splits are made by assigning examples with the same value to be sent down the same branch.

Signup and view all the flashcards

Continuous Feature Splitting

For continuous features, splits are determined by threshold values calculated from the midpoint between adjacent sorted values.

Signup and view all the flashcards

Decision Tree Overfitting

If a decision tree is allowed to grow unbounded, it can fit only the training data too closely.

Signup and view all the flashcards

Decision Tree Pruning

Technique to reduce the complexity of a decision tree by merging nodes to improve generalization ability and reduce error.

Signup and view all the flashcards

Decision Tree Instability

Small changes in input values can significantly impact the structure of the resulting tree.

Signup and view all the flashcards

Greedy Decision Tree Algorithms

Algorithms that make locally optimal decisions at each step while constructing the tree to improve performance.

Signup and view all the flashcards

Dimensionality Reduction

Technique for reducing the number of features in a dataset to improve the speed and effectiveness of the decision tree model

Signup and view all the flashcards

Generalized Optimal Sparse Decision Trees (GOSDT)

A class of decision tree construction methods designed for seeking optimal or near-optimal trees utilizing techniques such as dynamic programming for speed improvements.

Signup and view all the flashcards

AdaBoost parameter M

A parameter in AdaBoost that determines the number of weak learners in the ensemble. It needs to be chosen through cross-validation.

Signup and view all the flashcards

AdaBoost overfitting

If the parameter M is too large, AdaBoost can overfit the data, unlike Random Forests and Extra Trees.

Signup and view all the flashcards

AdaBoost sequential training

AdaBoost's weak learners depend on each other in training, which makes parallel training difficult compared to methods like Bagging.

Signup and view all the flashcards

AdaBoost parallelizable predictions

AdaBoost predictions can be easily parallelized, despite sequential training.

Signup and view all the flashcards

AdaBoost and tree classifiers

AdaBoost frequently uses tree-based weak learners. Small trees with 4-10 leaves are particularly effective.

Signup and view all the flashcards

AdaBoost's determinism

AdaBoost is deterministic, unlike some methods like Random Forests and Extra Trees.

Signup and view all the flashcards

AdaBoost weights and Gini impurity

If weak learners are CART decision trees, AdaBoost incorporates weights by modifying Gini impurity calculations.

Signup and view all the flashcards

AdaBoost loss function

AdaBoost uses an exponential loss function, which can be sensitive to outliers and mislabeled data points.

Signup and view all the flashcards

Within-Cluster Scatter

A measure of how dispersed data points are within their respective clusters.

Signup and view all the flashcards

Between-Cluster Scatter

A measure of how dispersed data points are between different clusters.

Signup and view all the flashcards

Total Scatter

The overall dispersion of data points, encompassing both within-cluster and between-cluster scatter.

Signup and view all the flashcards

K-Means Clustering

An algorithm that aims to partition data points into K distinct clusters by minimizing the within-cluster scatter.

Signup and view all the flashcards

Centroid

The average location of all data points within a cluster.

Signup and view all the flashcards

Euclidean Distance

A measure of the straight-line distance between two points in space.

Signup and view all the flashcards

Local Minimum

A point in the within-cluster scatter function where the value is lower than its immediate neighbors, but not necessarily the lowest possible value.

Signup and view all the flashcards

K-Means Algorithm Steps

  1. Initialize random centroids for each of the K clusters. 2. Iteratively assign data points to the nearest centroid. 3. Recalculate centroids based on assigned data points. Repeat steps 2-3 until convergence.
Signup and view all the flashcards

Eigenvector

A vector that, when multiplied by a matrix, results in a scaled version of itself. The scaling factor is the eigenvalue.

Signup and view all the flashcards

Eigenvalue

The scaling factor that results from multiplying an eigenvector by a matrix.

Signup and view all the flashcards

What is variance in PCA?

In Principal Component Analysis (PCA), variance is a measure of how much the data spreads out along a particular direction. It's calculated as the dot product of a data point with itself after being transformed by a matrix.

Signup and view all the flashcards

Why choose largest eigenvalue?

In PCA, we choose the eigenvector corresponding to the largest eigenvalue because it maximizes variance, capturing the most significant data variation.

Signup and view all the flashcards

What is the purpose of projecting data in PCA?

Projecting data onto a lower-dimensional subspace spanned by the top eigenvectors reduces dimensionality while preserving the most important variations in the data.

Signup and view all the flashcards

How does noise filtering work in PCA?

Noise filtering in PCA is achieved by omitting the contributions from eigenvectors corresponding to smaller eigenvalues. These eigenvectors capture less significant variations, often representing noise.

Signup and view all the flashcards

What is the purpose of the 'kink' in an eigenvalue graph?

The 'kink' in an eigenvalue graph indicates a point where further dimensionality reduction might not significantly improve the representation of the data. It suggests a suitable value for q, the number of principal components to keep.

Signup and view all the flashcards

What is the ratio Pq/Pp used for?

The ratio Pq/Pp is a measure of how much variance is captured by the top q eigenvectors compared to the total variance captured by all p eigenvectors. It provides a way to quantify the effectiveness of dimensionality reduction.

Signup and view all the flashcards

Study Notes

Linear Algebra Review

  • A p-vector is a vector with p elements (x₁, x₂, ..., xₚ). Column vectors (p x 1 matrices) are assumed unless otherwise specified.
  • eₖ is a p-vector with all elements zero except the k-th element, which is one.
  • 1 is a p-vector where all elements are one.
  • The inner product (dot product) of p-vectors a and b is a ⋅ b = a₁b₁ + a₂b₂ + ... + aₚbₚ = Σⱼaⱼbⱼ.
  • The notation aᵀ indicates that vector a has been transposed, which results in a row vector.
  • The norm of vector x is a measure of its length. The 2-norm (Euclidean norm) is ||x||₂ = √(x₁² + x₂² + ... + xₚ²) = √Σⱼxⱼ².
  • The 1-norm is ||x||₁ = |x₁| + |x₂| + ... + |xₚ| = Σⱼ|xⱼ|.
  • Vectors a and b are perpendicular if a ⋅ b = 0.
  • The projection of vector a onto vector b is a ⋅ b / ||b||².
  • Vectors x₁, ..., xₖ are linearly independent if a₁x₁ + ... + aₖxₖ = 0 if and only if a₁ = ... = aₖ = 0.
  • A matrix is a table of numbers with n rows and m columns.
  • Aij represents the element in row i, column j of a matrix A.
  • The transpose of matrix A (denoted as Aᵀ) is the matrix where rows and columns are swapped.
  • A matrix A is symmetric if A = Aᵀ.
  • I is the identity matrix, with 1s on the diagonal and 0s elsewhere.
  • The rank of a matrix is the number of linearly independent columns in the matrix.

Supervised Learning

  • In supervised learning, we have a dataset X = {(x⁽¹⁾, y⁽¹⁾), ..., (x⁽ⁿ⁾, y⁽ⁿ⁾)}. x⁽ⁱ⁾ are input variables (features) and y⁽ⁱ⁾ are output variables.
  • Input variables are often vectors (x⁽ⁱ⁾ ∈ Rᵖ) and output variables are often scalars (y⁽ⁱ⁾ ∈ R).
  • Example: Predicting apartment prices in Reykjavík—x is the size in m² and y is the price in million ISK.
  • Example: Predicting heart attacks—x is patient data (age, maximum pulse, blood pressure) and y is whether patient had a heart attack (0 or 1).
  • Example: Image classification—x is a bitmap image and y is a class from a set of possible categories (dog, cat, car, etc.).

Linear Regression

  • If there are two input variables, x = (x₁, x₂), and a linear relationship is assumed between input and output variables, then the function f is of the form f(x) = θ₀ + θ₁x₁ + θ₂x₂.
  • When the number of input variables is p, the model is f(x) = θ₀x₀ + θ₁x₁ + ... + θₚxₚ = Σⱼθⱼxⱼ, where x₀ = 1 for simplicity.
  • The cost function J(θ) = (½)Σᵢ(f(x⁽ⁱ⁾) - y⁽ⁱ⁾)² is minimized to find suitable parameters θ.

Function Minimization

  • Learning algorithms often find parameters θ that maximize or minimize an objective function.
  • The objective function measures how well the model fits the data.

Functions of a Single Variable

  • The derivative of a function f(θ) at θ indicates how quickly f changes near θ.
  • A necessary condition for a local minimum is that the derivative is zero.

Functions of Multiple Variables

  • A directional derivative of a function f(θ₁, θ₂,... θₚ) at point θ indicates how f changes when moving along a line from θ.
  • A partial derivative of f at θ = (θ₁, θ₂... θₚ) with respect to θₖ is the directional derivative of f in the direction of the unit vector eₖ (in the k-th coordinate axis.)

Nearest Neighbor Classifiers

  • A nearest neighbor classifier assigns a point x to the class that is most frequent among its k nearest neighbors in the dataset.
  • Euclidean distance is often used to calculate the distance between points.
  • This method is simple but can be computationally expensive for large datasets.

Logistic Regression Classifier

  • The output value is in the interval [0, 1] using a sigmoid function.
  • Optimization is used to derive suitable parameters θ.
  • Error rate in logistic regression are different than in linear regression.

Evaluating Model Quality and Choosing a Model

  • Mean Squared Error (MSE): Measures average squared error between predicted values f(x⁽ⁱ⁾) and actual values y⁽ⁱ⁾. A low MSE on the training data does not guarantee a good model.
  • Independent dataset error estimation: Split the data into training and testing sets to evaluate a model's performance on unseen data.

Error Estimation from Independent Dataset

  • Randomly split the dataset X into training (Xtrain) and testing (Xtest) sets.
  • Compute the MSEtest on the test data.

Model Selection

  • Evaluate the model's error by iterating over different hyperparameter values and choosing the one that yields the lowest average error on a test dataset.

Performance Metrics

  • Root Mean Squared Error (RMSE) is the square root of MSE. It has the same units as the output variable (y).

Classification

  • Accuracy in classification: Percentage of correctly classified examples.
  • True Positive (TP), True Negative (TN), False Positive (FP), False Negative (FN) are categories for assessing classification errors.
  • Other metrics might be required if types of errors have different costs or class distributions are uneven.

Class Imbalance

  • Class imbalance occurs when the number of examples differs significantly between categories.
  • Modifications of the classifier or the objective function might be necessary to address this.

Clustering

  • The task in clustering is to partition the examples into groups in X.
  • The distance function d between examples is important.
  • K-means is a common clustering algorithm.

Principal Component Analysis (PCA)

  • PCA finds a subspace where the data lies approximately.
  • The variances are maximized by choosing appropriate directions.

Non-negative Matrix Factorization (NMF)

  • Approximates a given (non-negative) data matrix with the product of two non-negative matrices (W, H).
  • Can be useful for analyzing structure present in the data.

Recommender Systems

  • Recommending items users might be interested in using user reviews.
  • Use of a matrix or user/item interaction pattern.

Reinforcement Learning

  • The agent interacts with an environment (e.g. game, trading algorithms, robotics).
  • The goal is to maximize the total reward obtained over a period of time, e.g from playing many rounds of a game.
  • A (Q-) function quantifies how good it would be to perform a given action in a state.

Neural Networks

  • Multilayer perceptrons (MLP)
  • Neural networks with multiple layers
  • Output layer with specific activation functions (logistic regression)

Gradient Descent, Variants, and Regularization

  • Batch gradient descent
  • Stochastic gradient descent
  • Mini-batch gradient descent
  • Momentum: Technique to accelerate convergence
  • Gradient clipping: Technique to limit gradient values to prevent exploding gradients.
  • Regularization functions: To avoid overfitting (e.g., L₁, L₂).

Convolutional Neural Networks

  • CNNs are feedforward neural networks specific to image data.
  • Use convolution, pooling (subsampling) and fully connected layers.
  • Weight sharing is one main advantage of this type of neural network.
  • They are useful for tasks that involve image classification and other pattern recognition tasks in which spatial properties are important.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

Machine Learning Notes PDF

Description

Test your knowledge on CART decision trees and the concepts of directional derivatives. This quiz covers key terms, functions, and algorithms essential for understanding decision tree classification and gradient measures. Perfect for students in data science or machine learning courses.

More Like This

Decision Tree Algorithms
134 questions

Decision Tree Algorithms

WellEstablishedWisdom avatar
WellEstablishedWisdom
OLD CART Pain Assessment Quiz
10 questions
Decision Tree Classification Algorithm
5 questions
Use Quizgecko on...
Browser
Browser