CART Decision Trees and Derivatives Quiz
46 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What type of decision trees does CART stand for?

  • Classification And Regression Trees (correct)
  • Cumulative And Regression Trees
  • Classification Algorithm for Regression Trees
  • Categorical And Regression Trees
  • What does the directional derivative of f at θ, denoted Du f (θ), primarily measure?

  • The change in f when moving from θ to θ + hu. (correct)
  • The curvature of f at θ.
  • The maximum value of f at θ.
  • The overall function growth rate of f.
  • CART trees can only be binary trees and cannot be converted from m-trees.

    False

    The partial derivative of f with respect to θk represents the directional derivative in the direction of the unit vector ek.

    <p>True</p> Signup and view all the answers

    What is the primary output variable y in the context of predicting the creditworthiness of individuals?

    <p>0 or 1</p> Signup and view all the answers

    In a decision tree, internal nodes correspond to tests of the form IF (xj < _____), where tr is the threshold value.

    <p>tr</p> Signup and view all the answers

    What does the notation ∇f (θ) represent?

    <p>The gradient of the function f at point θ.</p> Signup and view all the answers

    The example partial derivative of f with respect to θ1 is ___?

    <p>2θ1 + 5θ2</p> Signup and view all the answers

    Which algorithm is used by CART to determine the decision variables and threshold values?

    <p>Greedy Algorithm</p> Signup and view all the answers

    Which is NOT a characteristic of the directional derivative?

    <p>It can be defined for points that are minimum points.</p> Signup and view all the answers

    The process of classification using a CART tree begins at the leaves of the tree.

    <p>False</p> Signup and view all the answers

    Match the following components of a decision tree with their functions:

    <p>Internal Nodes = Correspond to tests of attributes Leaves = Indicate predictions for each region Root Node = First decision point in the tree Branches = Show the outcome of tests</p> Signup and view all the answers

    Higher-order derivatives are beneficial in solving minimization problems in machine learning.

    <p>False</p> Signup and view all the answers

    What does the algorithm do to minimize the cost function G in CART?

    <p>Partitions the dataset into subsets</p> Signup and view all the answers

    Calculate the components of the gradient for the function f (θ1, θ2) = θ1^2 - 5θ1 + θ2^3 + 10.

    <p>∇f = [2θ1 - 5, 3θ2^2]</p> Signup and view all the answers

    Match the following terms with their definitions:

    <p>Directional Derivative = Change of f when moving from θ to θ + hu Partial Derivative = Derivative with respect to one variable, others constant Gradient = Vector of all partial derivatives of f Minimum Point = A point where f does not decrease in any direction</p> Signup and view all the answers

    What approach is considered when dealing with categorical features with more than two values?

    <p>Sending all examples with one value down the left branch</p> Signup and view all the answers

    Decision trees are inherently stable and do not show high variance.

    <p>False</p> Signup and view all the answers

    What process is employed to reduce the error of a classifier after a decision tree has been created?

    <p>Pruning</p> Signup and view all the answers

    Examples of algorithms that are considered stable include __________ and __________.

    <p>Logistic regression, k-NN</p> Signup and view all the answers

    Match the following terms with their meanings:

    <p>Pruning = Reducing a decision tree by merging nodes One-hot encoding = Converting categorical variables into binary values High variance = Sensitivity to changes in input Dynamic programming = A method to optimize decision tree construction</p> Signup and view all the answers

    What happens when a decision tree is allowed to grow until only a few examples are in the same leaf?

    <p>It leads to overfitting.</p> Signup and view all the answers

    Generalized Optimal Sparse Decision Trees (GOSDT) aim to build less interpretable models compared to traditional decision trees.

    <p>False</p> Signup and view all the answers

    What is the significance of using threshold values in continuous variables?

    <p>To calculate impurity at split points</p> Signup and view all the answers

    What is the primary risk of choosing too large a number of weak models M in AdaBoost?

    <p>Overfitting the data</p> Signup and view all the answers

    AdaBoost is inherently stochastic, similar to Random Forests.

    <p>False</p> Signup and view all the answers

    What is the ideal range of leaves (J) for small decision trees to perform well in AdaBoost?

    <p>4 to 10</p> Signup and view all the answers

    The loss function used in AdaBoost is known as the __________ loss.

    <p>exponential</p> Signup and view all the answers

    Match the following characteristics with their descriptions:

    <p>Deterministic method = Does not randomize during training Overfitting = Model performs poorly on unseen data Base function = A component like a tree used in learning LogitBoost = Uses logistic regression loss instead of exponential loss</p> Signup and view all the answers

    What weight method is used in AdaBoost for weak classifiers that are CART decision trees?

    <p>Scaling calculations of Gini impurity</p> Signup and view all the answers

    In the iterative process of AdaBoost, each model is dependent on the previous model, denoted as g_m = g_{m-1}(x) + β_m b(x; γ_m). This dependency portrays __________ modeling.

    <p>sequential</p> Signup and view all the answers

    Predictions made by AdaBoost can be easily parallelized.

    <p>True</p> Signup and view all the answers

    What is an eigenvector of matrix C associated with?

    <p>A corresponding eigenvalue</p> Signup and view all the answers

    The largest eigenvalue should be chosen to maximize variance.

    <p>True</p> Signup and view all the answers

    What happens to the coordinates of points after projecting into the space spanned by the largest eigenvectors?

    <p>The coordinates become a combination of the projections onto the selected eigenvectors.</p> Signup and view all the answers

    To maximize variance for the dimension of the subspace q, we project X into the space spanned by the ____ eigenvectors.

    <p>largest</p> Signup and view all the answers

    When performing noise filtering and lossy compression, which part of the eigenvectors are omitted?

    <p>The contribution from eigenvectors q + 1,..., p</p> Signup and view all the answers

    To select a suitable value for q, one should consider the ratio of eigenvalues.

    <p>True</p> Signup and view all the answers

    What visual indicator might suggest an optimal choice for q in eigenvalue selection?

    <p>A 'kink' or change in slope in the graph of eigenvalues.</p> Signup and view all the answers

    What does W(C) measure in clustering?

    <p>Dispersion of points within clusters</p> Signup and view all the answers

    Maximizing B(C) is equivalent to minimizing W(C).

    <p>True</p> Signup and view all the answers

    What is the primary distance metric used in the k-means algorithm?

    <p>Euclidean distance</p> Signup and view all the answers

    The number of points in group k is denoted as _____ .

    <p>Nk</p> Signup and view all the answers

    Which step is NOT part of the K-means algorithm?

    <p>Calculate the total distance from all points to centroids</p> Signup and view all the answers

    In K-means, a local minimum of W(C) is reached through iterative _____ .

    <p>assignment</p> Signup and view all the answers

    There are approximately K^n different partitions in the clustering problem.

    <p>False</p> Signup and view all the answers

    Study Notes

    Linear Algebra Review

    • A p-vector is a vector with p elements (x₁, x₂, ..., xₚ). Column vectors (p x 1 matrices) are assumed unless otherwise specified.
    • eₖ is a p-vector with all elements zero except the k-th element, which is one.
    • 1 is a p-vector where all elements are one.
    • The inner product (dot product) of p-vectors a and b is a ⋅ b = a₁b₁ + a₂b₂ + ... + aₚbₚ = Σⱼaⱼbⱼ.
    • The notation aᵀ indicates that vector a has been transposed, which results in a row vector.
    • The norm of vector x is a measure of its length. The 2-norm (Euclidean norm) is ||x||₂ = √(x₁² + x₂² + ... + xₚ²) = √Σⱼxⱼ².
    • The 1-norm is ||x||₁ = |x₁| + |x₂| + ... + |xₚ| = Σⱼ|xⱼ|.
    • Vectors a and b are perpendicular if a ⋅ b = 0.
    • The projection of vector a onto vector b is a ⋅ b / ||b||².
    • Vectors x₁, ..., xₖ are linearly independent if a₁x₁ + ... + aₖxₖ = 0 if and only if a₁ = ... = aₖ = 0.
    • A matrix is a table of numbers with n rows and m columns.
    • Aij represents the element in row i, column j of a matrix A.
    • The transpose of matrix A (denoted as Aᵀ) is the matrix where rows and columns are swapped.
    • A matrix A is symmetric if A = Aᵀ.
    • I is the identity matrix, with 1s on the diagonal and 0s elsewhere.
    • The rank of a matrix is the number of linearly independent columns in the matrix.

    Supervised Learning

    • In supervised learning, we have a dataset X = {(x⁽¹⁾, y⁽¹⁾), ..., (x⁽ⁿ⁾, y⁽ⁿ⁾)}. x⁽ⁱ⁾ are input variables (features) and y⁽ⁱ⁾ are output variables.
    • Input variables are often vectors (x⁽ⁱ⁾ ∈ Rᵖ) and output variables are often scalars (y⁽ⁱ⁾ ∈ R).
    • Example: Predicting apartment prices in Reykjavík—x is the size in m² and y is the price in million ISK.
    • Example: Predicting heart attacks—x is patient data (age, maximum pulse, blood pressure) and y is whether patient had a heart attack (0 or 1).
    • Example: Image classification—x is a bitmap image and y is a class from a set of possible categories (dog, cat, car, etc.).

    Linear Regression

    • If there are two input variables, x = (x₁, x₂), and a linear relationship is assumed between input and output variables, then the function f is of the form f(x) = θ₀ + θ₁x₁ + θ₂x₂.
    • When the number of input variables is p, the model is f(x) = θ₀x₀ + θ₁x₁ + ... + θₚxₚ = Σⱼθⱼxⱼ, where x₀ = 1 for simplicity.
    • The cost function J(θ) = (½)Σᵢ(f(x⁽ⁱ⁾) - y⁽ⁱ⁾)² is minimized to find suitable parameters θ.

    Function Minimization

    • Learning algorithms often find parameters θ that maximize or minimize an objective function.
    • The objective function measures how well the model fits the data.

    Functions of a Single Variable

    • The derivative of a function f(θ) at θ indicates how quickly f changes near θ.
    • A necessary condition for a local minimum is that the derivative is zero.

    Functions of Multiple Variables

    • A directional derivative of a function f(θ₁, θ₂,... θₚ) at point θ indicates how f changes when moving along a line from θ.
    • A partial derivative of f at θ = (θ₁, θ₂... θₚ) with respect to θₖ is the directional derivative of f in the direction of the unit vector eₖ (in the k-th coordinate axis.)

    Nearest Neighbor Classifiers

    • A nearest neighbor classifier assigns a point x to the class that is most frequent among its k nearest neighbors in the dataset.
    • Euclidean distance is often used to calculate the distance between points.
    • This method is simple but can be computationally expensive for large datasets.

    Logistic Regression Classifier

    • The output value is in the interval [0, 1] using a sigmoid function.
    • Optimization is used to derive suitable parameters θ.
    • Error rate in logistic regression are different than in linear regression.

    Evaluating Model Quality and Choosing a Model

    • Mean Squared Error (MSE): Measures average squared error between predicted values f(x⁽ⁱ⁾) and actual values y⁽ⁱ⁾. A low MSE on the training data does not guarantee a good model.
    • Independent dataset error estimation: Split the data into training and testing sets to evaluate a model's performance on unseen data.

    Error Estimation from Independent Dataset

    • Randomly split the dataset X into training (Xtrain) and testing (Xtest) sets.
    • Compute the MSEtest on the test data.

    Model Selection

    • Evaluate the model's error by iterating over different hyperparameter values and choosing the one that yields the lowest average error on a test dataset.

    Performance Metrics

    • Root Mean Squared Error (RMSE) is the square root of MSE. It has the same units as the output variable (y).

    Classification

    • Accuracy in classification: Percentage of correctly classified examples.
    • True Positive (TP), True Negative (TN), False Positive (FP), False Negative (FN) are categories for assessing classification errors.
    • Other metrics might be required if types of errors have different costs or class distributions are uneven.

    Class Imbalance

    • Class imbalance occurs when the number of examples differs significantly between categories.
    • Modifications of the classifier or the objective function might be necessary to address this.

    Clustering

    • The task in clustering is to partition the examples into groups in X.
    • The distance function d between examples is important.
    • K-means is a common clustering algorithm.

    Principal Component Analysis (PCA)

    • PCA finds a subspace where the data lies approximately.
    • The variances are maximized by choosing appropriate directions.

    Non-negative Matrix Factorization (NMF)

    • Approximates a given (non-negative) data matrix with the product of two non-negative matrices (W, H).
    • Can be useful for analyzing structure present in the data.

    Recommender Systems

    • Recommending items users might be interested in using user reviews.
    • Use of a matrix or user/item interaction pattern.

    Reinforcement Learning

    • The agent interacts with an environment (e.g. game, trading algorithms, robotics).
    • The goal is to maximize the total reward obtained over a period of time, e.g from playing many rounds of a game.
    • A (Q-) function quantifies how good it would be to perform a given action in a state.

    Neural Networks

    • Multilayer perceptrons (MLP)
    • Neural networks with multiple layers
    • Output layer with specific activation functions (logistic regression)

    Gradient Descent, Variants, and Regularization

    • Batch gradient descent
    • Stochastic gradient descent
    • Mini-batch gradient descent
    • Momentum: Technique to accelerate convergence
    • Gradient clipping: Technique to limit gradient values to prevent exploding gradients.
    • Regularization functions: To avoid overfitting (e.g., L₁, L₂).

    Convolutional Neural Networks

    • CNNs are feedforward neural networks specific to image data.
    • Use convolution, pooling (subsampling) and fully connected layers.
    • Weight sharing is one main advantage of this type of neural network.
    • They are useful for tasks that involve image classification and other pattern recognition tasks in which spatial properties are important.

    Studying That Suits You

    Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

    Quiz Team

    Related Documents

    Machine Learning Notes PDF

    Description

    Test your knowledge on CART decision trees and the concepts of directional derivatives. This quiz covers key terms, functions, and algorithms essential for understanding decision tree classification and gradient measures. Perfect for students in data science or machine learning courses.

    More Like This

    OLD CART Pain Assessment Quiz
    10 questions
    Decision Tree Algorithms Overview
    13 questions
    Decision Tree Classification Algorithm
    5 questions
    Use Quizgecko on...
    Browser
    Browser