Machine Learning CS7641
42 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What are the key elements required in breaking down a classification problem?

Instances, concept, target concept, hypotheses, input samples, candidate, testing set

What is the main purpose of decision trees in classification learning?

  • To provide a linear classification approach
  • To randomly assign outcomes
  • To map various choices to diverging paths (correct)
  • To create a decision for every possible outcome
  • According to the ID3 algorithm, the 'information gain' from a particular attribute A can be calculated as the difference between the overall entropy and the weighted sum of entropies of each subset. The formula is: ______(S, A) = Entropy(S) - Σ (|Sv| / |S|) * Entropy(Sv).

    Gain

    Decision trees for regression problems can rely on information gain to sort attributes.

    <p>False</p> Signup and view all the answers

    What is the general approach to ensemble learning algorithms?

    <p>Learn rules over smaller subsets of the training data, then combine all of the rules into a collective, smarter decision-maker.</p> Signup and view all the answers

    What is the reason behind taking the average of weak learners trained on subsets of the data to outperform a single learner trained on the entire dataset?

    <p>Avoiding overfitting</p> Signup and view all the answers

    Ensemble learning algorithms aim to prevent overfitting by learning rules over subsets of data.

    <p>True</p> Signup and view all the answers

    In boosting, what is the process of finding the weak learner that achieves the lowest error in each boosting round?

    <p>find a weak learner that achieves the lowest error</p> Signup and view all the answers

    What is supervised learning?

    <p>Supervised learning relies on human input to train a model and involves labeled data.</p> Signup and view all the answers

    What is the main difference between classification and regression in supervised learning?

    <p>Classification involves mapping complex inputs to labels, while regression maps inputs to numerical values.</p> Signup and view all the answers

    What is one of the risks in machine learning related to errors in data?

    <p>One risk is errors that can come from sources such as hardware, human mistakes, malicious intent, or unmodeled influences.</p> Signup and view all the answers

    Cross-validation is a method used to reduce the risk of overfitting in machine learning.

    <p>True</p> Signup and view all the answers

    What is the 'Goldilocks zone' in training models?

    <p>It is the ideal balance between underfitting and overfitting, where the error in training and cross-validation data is similar.</p> Signup and view all the answers

    What is the outcome when the weak learner classified incorrectly?

    <p>f(ε) = exp(−αt yi Ht(xi)) = (1−ε)/ε</p> Signup and view all the answers

    What is the outcome when the weak learner classified correctly?

    <p>f(ε) = exp(−αt yi Ht(xi)) = ε/(1−ε)</p> Signup and view all the answers

    What is the purpose of boosting in machine learning?

    <p>All of the above</p> Signup and view all the answers

    Boosting is a method that tends to overfit.

    <p>False</p> Signup and view all the answers

    What does a support vector machine try to find in the data?

    <p>The boundary that will maximize the margin from the nearest data points.</p> Signup and view all the answers

    What is the purpose of the classification function f(x) in Support Vector Machines?

    <p>To sign the dot product between support vectors and the new point x for classification purposes.</p> Signup and view all the answers

    What is the significance of the Kernel Trick in SVMs?

    <p>It allows for linear separation of non-linearly separable data.</p> Signup and view all the answers

    The kernel trick involves finding the explicit mapping function Φ in order to apply different kernels in SVMs.

    <p>False</p> Signup and view all the answers

    In the example provided, the kernel function K(xi, xj) = (1 + xTi xj)2 is a _________ product in a higher-dimensional space.

    <p>dot</p> Signup and view all the answers

    What does the choice of K(·) represent in SVMs?

    <p>Domain knowledge about the data that can help classify it better.</p> Signup and view all the answers

    What is the formula for finding the least squares approximation of the solution in linear regression?

    <p>(A^T A)^{-1} A^T y</p> Signup and view all the answers

    What is the basic model of an artificial neuron powered by?

    <p>Inputs and weights</p> Signup and view all the answers

    The perceptron is the foundational building block of neural networks.

    <p>True</p> Signup and view all the answers

    In the binary AND operation, if x1 = 1 and x2 = 1 with w1 = 1 and w2 = -2, the output is __.

    <p>0</p> Signup and view all the answers

    Match the following terms with their definitions:

    <p>Perceptron = The simplest activation function that produces a binary output based on weighted inputs. Bias = An extra input value with a fixed weight that influences computations. Learning Rate = Controls the size of weight adjustments during training. Gradient Descent = An approach that operates on unthresholded summations for adjusting weights.</p> Signup and view all the answers

    What is the error metric defined based on the difference between the expected output (y) and the actual output (a)?

    <p>E(w) = (y - a)^2</p> Signup and view all the answers

    What is the derivative of the error (E) with respect to a weight (wi)?

    <p>-(y - a)xi</p> Signup and view all the answers

    What function is introduced as the activation function that allows using gradient descent?

    <p>Sigmoid function</p> Signup and view all the answers

    Neural networks prefer simpler explanations over complex ones.

    <p>True</p> Signup and view all the answers

    Which optimization method allows gradient descent to 'gain speed' when descending down steep areas?

    <p>Momentum</p> Signup and view all the answers

    What is the main downside of kNN according to the text?

    <p>kNN gets slow and unwieldy very quickly due to using the training data in the querying process.</p> Signup and view all the answers

    What is the contrasting characteristic of linear regression compared to kNN?

    <p>Linear regression calculates a model upfront and makes querying cheap (constant time), while kNN is considered a lazy learner.</p> Signup and view all the answers

    What are the two important biases discussed regarding kNN representation of data?

    <p>Preference bias and Restriction bias</p> Signup and view all the answers

    Which of the following factors affect inductions in a learning problem? (Select all that apply)

    <p>Complexity of the hypothesis class</p> Signup and view all the answers

    A good question in a binary yes-or-no scenario ideally reduces the number of possibilities by half.

    <p>True</p> Signup and view all the answers

    What is the hypothesis space denoted by 'H' in machine learning?

    <p>The hypothesis space 'H' is where the candidate hypothesis 'h' is explored and considered by the learner.</p> Signup and view all the answers

    What is the version space in machine learning?

    <p>The set of all possible hypotheses consistent with the data</p> Signup and view all the answers

    A consistent learner is one that produces the correct result for all of the training samples: _______.

    <p>c(x) = h(x)</p> Signup and view all the answers

    Training error is the fraction of training examples correctly classified by the candidate hypothesis 'h'.

    <p>False</p> Signup and view all the answers

    Study Notes

    Supervised Learning

    • Supervised learning relies on human input (or "supervision") to train a model
    • Examples of supervised learning include label-based learning, and it occurs more often than unsupervised learning
    • Supervised learning can be reduced down to function approximation
    • An elementary example of supervised learning is a model that "learns" a dataset represents a function, such as x²

    Techniques

    • Supervised learning is broken up into two main schools of algorithms: classification and regression
    • Classification involves mapping between complex inputs and labels (discrete values)
    • Regression involves mapping complex inputs to an arbitrary, often-continuous, often-numeric value
    • Data is everything in machine learning, but it isn't perfect, and errors can come from hardware, human element, and malicious intent

    Classification

    • Classification problems require instances, concept, target concept, hypotheses, input samples, and a testing set
    • Decision trees are a form of classification learning, which maps various choices to diverging paths that end with a decision
    • To create a decision tree, we need to identify pertinent features that describe the concept well
    • The "Goldilocks zone" of training is between underfitting and overfitting, where the error across both training data and cross-validation data are relatively similar

    Decision Trees

    • Decision trees are a representation of our features, and we need to be careful about how accurately we "fit" the training data
    • The order in which we apply each feature to our decision tree should be correlated with its ability to reduce our space
    • A "best" question is one that divides our data roughly in half
    • The ID3 algorithm is a top-down approach to creating a decision tree, which greedily chooses the attribute with the most information gain

    Asking Questions: The ID3 Algorithm

    • The ID3 algorithm uses information gain to qualify attributes, and the "best attribute" is the one that gives us the maximum information gain
    • The algorithm repeats until the labels are correctly classified, and prefers correct decision trees to incorrect ones
    • Attributes that give a lot of information are more valuable, and should thus be higher on the decision tree

    Considerations

    • The ID3 algorithm has a preference bias towards trees with good splits at the top, and prefers shallower or "shorter" trees
    • Asking continuous questions requires binning or discretization to make them Boolean questions### Decision Trees
    • In decision trees, repeating attributes can be acceptable, depending on the attribute
    • It makes sense to ask about the same attribute twice down a branch, especially with bucketed continuous values like cost, age, etc.
    • Refining buckets as we go further down a branch can be beneficial

    Stopping Point

    • The ID3 algorithm stops creating the decision tree when all training examples are classified correctly
    • This approach may lead to overfitting the training set and creating an infinite loop
    • Adopting a termination approach that is a little more general and robust can help avoid overfitting
    • Pruning branches that do not incur a large penalty for incorrect classification can be an effective approach

    Ensemble Learning

    • Ensemble learning combines multiple learners to create a more accurate and robust predictor
    • It is powerful when features are weakly indicative of a result on their own but strongly indicative in combination
    • The general approach is to learn rules over smaller subsets of the training data and combine them into a collective decision-maker

    Bagging

    • Bagging (Bootstrap Aggregation) involves creating subsets of the training data by uniformly randomly selecting examples with replacement
    • Combining the results with an average also works well
    • Bagging helps to avoid overfitting by smoothing out the specifics of each individual learner

    Boosting

    • Boosting involves creating a sequence of weak learners that adapt to the errors of the previous learners
    • Each weak learner is trained on a weighted version of the training data, with more weight given to examples that were incorrectly classified by the previous learner
    • The final classifier is a weighted average of the individual weak learners
    • The weights are chosen to minimize the total error of the final classifier
    • Boosting can be broken down into a simple loop: construct a distribution, find a weak classifier that minimizes the error, and combine the weak classifiers into a stronger one

    AdaBoost

    • AdaBoost is a specific boosting algorithm that follows a human approach to learning: focus on individual mistakes and adjust the weights accordingly
    • The algorithm starts with a uniform distribution and adjusts the weights based on the correctness of the weak learners
    • The final classifier is a weighted average of the individual weak learners, with more weight given to learners that did well
    • AdaBoost is a robust method that tries hard to avoid overfitting and achieve high confidence in its predictions### Boosting
    • Boosting can overfit when the underlying weak learner uses a complex artificial neural network.
    • Boosting can't prevent overfitting if all underlying learners overfit and can't stop overfitting (like complex neural nets).
    • Boosting also suffers under pink noise (uniform noise) and tends to overfit.

    Support Vector Machines

    • SVMs operate on the notion of finding the boundary that maximizes the margin from the nearest data points.
    • SVMs focus on examples near the boundaries rather than the entire data set to reduce computational complexity.
    • The optimal margin lines will always have some special points that intersect the dashed lines, called support vectors.
    • SVMs try to maximize the margin while also classifying all data points correctly.
    • The classification function depends only on the dot product between a new point x and the support vectors xi.

    Extending SVMs: The Kernel Trick

    • SVMs can be extended to find separation boundaries between data points in higher dimensions than the features provide.
    • The kernel trick is a method of finding a mapping function Φ that can represent the kernel function K as a dot product in a higher-dimensional space.
    • The kernel trick allows us to apply almost any kernel function and it will still find a boundary that is linear in a higher-dimensional space.

    Kernel Functions

    • A kernel function is a function that represents the dot product in a higher-dimensional space.
    • The choice of kernel function encodes domain knowledge about the data and can help classify it better.
    • Common kernel functions include polynomial kernels and radial basis kernels.

    Regression

    • Regression is a machine learning technique that can approximate real-valued and continuous functions.
    • Linear regression is the process of finding the "line of best fit" that minimizes the sum of least squared-error between the points and the chosen line.
    • The line of best fit can be rigorously defined by solving a linear system.
    • The lack of an exact solution to the system means that the vector of y-values isn't in the column space of A, and we need to find the projection of y into the column space plane.
    • The projection is the closest possible vector in the column space to y, which is exactly the distance we were trying to minimize.

    Studying That Suits You

    Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

    Quiz Team

    Related Documents

    Description

    An unofficial guide to Georgia Institute of Technology's CS7641 Machine Learning course, covering various concepts and topics in machine learning.

    Use Quizgecko on...
    Browser
    Browser