Logistic Regression and Classifiers

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

In the context of classifying images, what does a generative classifier aim to do?

  • Build separate models for each class (e.g., cats and dogs) and assess how well a new image fits each model. (correct)
  • Focus solely on the differences between the images, ignoring any shared characteristics.
  • Distinguish between images based on specific features like collars on dogs.
  • Create a single model that identifies common features across all image types.

What is the key difference between a generative and a discriminative classifier?

  • Generative classifiers model the joint probability distribution $P(d, c)$, while discriminative classifiers model the conditional probability $P(c|d)$ directly. (correct)
  • Generative classifiers are used for regression tasks, while discriminative classifiers are used for classification tasks.
  • Generative classifiers require more computational resources than discriminative classifiers.
  • Generative classifiers directly estimate the posterior probability $P(c|d)$, while discriminative classifiers model the likelihood $P(d|c)$ and prior $P(c)$.

What is the purpose of including a feature representation in a probabilistic machine learning classifier?

  • To directly compute the estimated class without needing a classification function.
  • To reduce the dimensionality of the input data.
  • To estimate the complexity of the model.
  • To convert raw input data into a format that the model can process, typically a vector of numerical features. (correct)

What role do the weights w and bias b play in logistic regression?

<p>The weights determine the importance of each feature, and the bias provides an offset for the decision boundary. (D)</p> Signup and view all the answers

What is the significance of the range of values produced by the sigmoid function in logistic regression?

<p>It maps any real-valued number to a probability between 0 and 1, allowing the output to be interpreted as a probability. (B)</p> Signup and view all the answers

In logistic regression, what is the purpose of the decision boundary?

<p>To separate the data into distinct classes based on the predicted probabilities. (A)</p> Signup and view all the answers

How does logistic regression utilize the sigmoid function in classification problems?

<p>To map the linear combination of input features and weights to a probability between 0 and 1. (B)</p> Signup and view all the answers

Given a sentiment classification task, if a logistic regression model assigns a probability $P(y=1|x) = 0.7$ to a document, how is this value typically used to classify the document?

<p>The document is classified as positive since the probability is greater than 0.5. (B)</p> Signup and view all the answers

What is the main purpose of the cross-entropy loss function in logistic regression?

<p>To quantify the difference between the predicted probabilities and the actual labels. (A)</p> Signup and view all the answers

How does stochastic gradient descent (SGD) contribute to training a logistic regression model?

<p>It iteratively updates the model's parameters (weights and bias) to minimize the loss function. (B)</p> Signup and view all the answers

What is the intuition behind using the negative log likelihood as a loss function in logistic regression?

<p>It penalizes the model more strongly when the predicted probability is far from the true label. (A)</p> Signup and view all the answers

Given that $\hat{y}$ is the predicted output from logistic regression and $y$ is the true label, which formula correctly represents the cross-entropy loss $L_{CE}(\hat{y}, y)$ for a single observation?

<p>$L_{CE}(\hat{y}, y) = -[y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})]$ (A)</p> Signup and view all the answers

In the context of gradient descent, what does the 'learning rate' ( \eta ) signify, and how does it affect the training process?

<p>It controls the magnitude of the update to the weights in each iteration; a higher rate can lead to faster but potentially unstable convergence. (B)</p> Signup and view all the answers

In gradient descent, what is the significance of the gradient vector?

<p>It represents the direction in which the loss function increases most rapidly. (A)</p> Signup and view all the answers

What is a hyperparameter in the context of machine learning, and how does it differ from a regular parameter?

<p>A hyperparameter is a special kind of parameter that is not directly learned from the data but is set prior to training to control aspects of the learning process. (A)</p> Signup and view all the answers

In stochastic gradient descent (SGD), what is the primary reason for using mini-batch training instead of processing one example at a time?

<p>To reduce the noise in the gradient estimate, leading to more stable convergence. (B)</p> Signup and view all the answers

What is the purpose of regularization in logistic regression?

<p>To prevent overfitting by penalizing large weights. (B)</p> Signup and view all the answers

Why might a model that achieves 100% accuracy on a training set still be considered problematic?

<p>It suggests that the model has overfit the training data and may not generalize well to new, unseen data. (D)</p> Signup and view all the answers

What is the key difference between L1 and L2 regularization in logistic regression?

<p>L1 regularization adds the sum of the absolute values of the weights to the loss function, while L2 adds the sum of the squares of the weights. (D)</p> Signup and view all the answers

Under what circumstances would multinomial logistic regression be more appropriate than binary logistic regression?

<p>When there are more than two mutually exclusive classes to predict. (D)</p> Signup and view all the answers

Why is the softmax function used in multinomial logistic regression?

<p>To normalize the output into a probability distribution across multiple classes that sums to 1. (B)</p> Signup and view all the answers

What is the key difference in how features are handled in binary versus multinomial logistic regression?

<p>In binary logistic regression each feature has a single weight, while in multinomial logistic regression, each feature has separate weights for each class. (B)</p> Signup and view all the answers

Which of the following statements best describes the role of logistic regression in the broader field of machine learning?

<p>It serves as a foundational classification algorithm and is also connected to the architecture of neural networks. (A)</p> Signup and view all the answers

If a review contains the word 'awesome', and the feature x_i represents whether the review contains 'awesome' with a corresponding weight w_i = +10, what does this imply for sentiment analysis using logistic regression?

<p>The review is likely to be positive because the weight is positive, indicating 'awesome' contributes positively to sentiment. (A)</p> Signup and view all the answers

Given a logistic regression model, how is the decision to classify an input x as either class 0 or class 1 made based on the computed value of w•x+b?

<p>If w•x+b &gt; 0, classify as class 1; otherwise, classify as class 0. (C)</p> Signup and view all the answers

In logistic regression, if a feature _x_k is 'review contains mediocre', and it has a weight _w_k = -2, how does it affect the log-odds of the review being positive?

<p>It slightly decreases the log-odds, making a positive classification less likely because the weight is negative. (A)</p> Signup and view all the answers

Suppose you're building a logistic regression model for authorship attribution to determine if Alexander Hamilton or James Madison wrote a document. What constitutes the 'input observation' x in this scenario?

<p>A vector of features derived from the document, such as word frequencies or stylistic elements. (D)</p> Signup and view all the answers

Why is it essential to have a 'principled classifier' that provides probabilities, rather than simply a 'sum is high' approach in logistic regression?

<p>Because probabilities provide a measure of confidence, which is crucial for decision-making and risk assessment. (C)</p> Signup and view all the answers

What distinguishes a convex loss function from a non-convex one in the context of logistic regression?

<p>A convex loss function has only one global minimum, ensuring that gradient descent will converge to the optimal solution. (A)</p> Signup and view all the answers

When visualizing gradient descent for a single scalar weight w, what does moving w in the 'reverse direction from the slope of the function' achieve?

<p>It moves the weight towards a region where the loss function is minimized. (D)</p> Signup and view all the answers

What does it mean for a loss function to be 'parameterized by weights (\Theta = (w, b))'?

<p>The value of the loss function depends on the specific values of the weights and bias, which are adjusted during training. (B)</p> Signup and view all the answers

In the context of L1 regularization, what key benefit does it offer in logistic regression, particularly for high-dimensional data?

<p>It prevents overfitting by reducing model complexity by driving some feature weights to exactly zero, effectively performing feature selection. (D)</p> Signup and view all the answers

How is the derivative of the loss function, with respect to a specific weight, used in updating that weight during stochastic gradient descent?

<p>It is multiplied by the learning rate and subtracted from the weight, moving the weight in the direction that reduces the loss. (D)</p> Signup and view all the answers

In multinomial logistic regression, if $p(y = c \vert x)$ represents the probability of an input $x$ belonging to class $c$, what must be true of the sum of these probabilities across all possible classes?

<p>$\sum_c p(y = c \vert x) = 1$ (B)</p> Signup and view all the answers

If a new document containing exclamation points is fed into two sentiment models, one created with binary logistic regression where $w_5 = 3.0$ for exclamation points: $x_5 = 1 \text{ if '!' } \in \text{ doc}$, and the other with multinomial logistic regression such that $w_{5,+} = 3.5$, $w_{5,-} = 3.1$, $w_{5,0} = -5.3$, which model can discern the effect of the exclamation point on different sentiments (postive, negative, neutral)?

<p>The multinomial logistic regression can discern the difference effects since it has separate weights for each class. (A)</p> Signup and view all the answers

Flashcards

Logistic Regression

An important analytic tool used in natural and social sciences.

Logistic Regression

A baseline supervised machine learning tool used for classification tasks.

Generative Classifier

A classifier that builds a model of each class to assign probabilities.

Discriminative Classifier

A classifier that directly distinguishes between classes without modeling each one.

Signup and view all the flashcards

Text Classification Definition

The input is a document (x) and a fixed set of classes (C). The output is a predicted class (ŷ) that belongs to C.

Signup and view all the flashcards

Feature Vector

Representing an observation x(i) using a set of features [x1, x2, ..., xn].

Signup and view all the flashcards

Feature importance

The weight wᵢ tells how important feature xᵢ is for determining the class.

Signup and view all the flashcards

Sigmoid Function Output

A range from 0 to 1. Represents the probability an input belongs to a specific class.

Signup and view all the flashcards

Core logistic regression idea

Computes w•x + b, passes it through a sigmoid function, then treats the result as a probability.

Signup and view all the flashcards

Decision Boundary

The threshold (typically 0.5) used to classify an instance based on its predicted probability.

Signup and view all the flashcards

Logistic Regression Training

Training phase involves learning weights and biases using stochastic gradient descent and cross-entropy loss.

Signup and view all the flashcards

Training a Model

Involves minimizing the distance between the model's estimate and the true value by adjusting weights and biases.

Signup and view all the flashcards

Supervised classification

The correct label y is known for each input x.

Signup and view all the flashcards

Loss Function

A function quantifies how much the predicted output ŷ differs from the true output y.

Signup and view all the flashcards

Negative log likelihood loss

Refers to when choose the parameters w,b that maximize the log probability.

Signup and view all the flashcards

Goal of Loss Function

Goal is to maximize the probability of the correct label p(y|x) for a single observation x.

Signup and view all the flashcards

Optimization Algorithm

The algorithm updates the parameters w and b to minimize the loss function.

Signup and view all the flashcards

Stochastic Gradient Descent

An algorithm to minimize the cross-entropy loss. It finds the direction of the steepest decrease in the loss function and moves in the opposite direction.

Signup and view all the flashcards

Learning Rate

A hyperparameter to scale value of the gradient. Higher learning rates lead to faster, but potentially unstable, learning.

Signup and view all the flashcards

Convex

A loss function has at least one minimum. Gradient descent is guaranteed to find the minimum.

Signup and view all the flashcards

Gradient

The vector pointing direction of greatest increase.

Signup and view all the flashcards

Hyperparameter Definition

A special kind of parameter for an ML model: Instead of being learned by algorithm from supervision (like regular parameters), is are chosen by algorithm designer.

Signup and view all the flashcards

Hyperparameter Tuning

Tuning is done by split dataset and tune hyperparameters on devset to optimize performance

Signup and view all the flashcards

Overfitting

Model models noise and perfectly match the training data. Fails in test time. Use REGULARIZATION.

Signup and view all the flashcards

Regularization Goal

Add a regularization term R(θ) to the loss function to avoid Overfitting.

Signup and view all the flashcards

L2 Regularization

A type of regularization that sums the squares of the weights

Signup and view all the flashcards

L1 Regularization

A type of regularization that sums of absolute values of the weights

Signup and view all the flashcards

2+Class examples

Positive, negative, neutral. Parts of speech (noun, verb, adjective, adverb, preposition, etc).

Signup and view all the flashcards

2+Class use case

Use multinomial logistic regression: Softmax regression, Multinomial logit.

Signup and view all the flashcards

Softmax Definition

Takes a vector z = [z1, z2, ..., zk] of k arbitrary values, outputs probability distribution, Each value in the range [0,1] and All the values summing to 1

Signup and view all the flashcards

Study Notes

Logistic Regression Overview

  • An important analytic tool in natural and social sciences.
  • It is a baseline supervised machine learning tool for classification.
  • Logistic Regression is also the foundation of neural networks.

Generative vs Discriminative Classifiers

  • Naive Bayes as a generative classifier contrasts with Logistic Regression, which is a discriminative classifier.
  • Generative classifiers build a model of what's in an image of something like a cat, knowing about whiskers, ears, and eyes.
  • Generative classifiers assign a probability to any image on how cat-like it is.
  • Generative classifiers also build different models for different things of what they are trying to classify.
  • To run a generative classifier, both models are run and it will determine which one fits better.
  • Discriminative Classifiers just try to distinguish dogs from cats, or in the example, dogs have collars, ignoring everything else.
  • Naive Bayes finds the correct class c from a document d by using argmax P(d|c)P(c) for c in C with the likelihood prior.
  • Logistic Regression finds the correct class c from a document d by using argmax P(c|d) for c in C with the posterior.

Components of a Probabilistic Machine Learning Classifier

  • Given m input/output pairs (x(i), y(i)), a classifier involves:
    • Feature Representation: Each input observation x(i) has a vector of features [x1, x2, ..., xn] and feature j for input x(i) is xj or x(i)j, or fj(x).
    • Classification Function: Computes ŷ, the estimated class, via p(y|x), like the sigmoid or softmax functions.
    • Objective Function: Needed for learning, such as cross-entropy loss.
    • Optimization Algorithm: Needed for optimizing the objective function, such as stochastic gradient descent.

Phases of Logistic Regression

  • Training entails learning weights w and b using stochastic gradient descent and cross-entropy loss.
  • Testing entails computing p(y|x) for a test example x using learned weights w and b, and returning whichever label (y = 1 or y = 0) has a higher probability.

Classification Reminder

  • Examples of classifications are positive/negative sentiment classification, spam/not spam, and authorship attribution such as determining if a document was written by Hamilton or Madison.

Text Classification Definition

  • The input is a document x and a fixed set of classes C = {c1, c2, ..., cj}.
  • The output is a predicted class ŷ ∈ C.

Binary Classification in Logistic Regression

  • A series of input/output pairs (x(i), y(i)), which represent x(i) by a feature vector [x1, x2, ..., xn] and compute an output: a predicted class ŷ(i) ∈ {0,1}.

Features in Logistic Regression

  • Weight w for feature x indicates how important x is; for positive sentiment:
    • x = "review contains 'awesome'": w = +10
    • x = "review contains 'abysmal'": w = -10
    • x = "review contains 'mediocre'": w = -2

Logistic Regression for One Observation x

  • The input observation is a vector x = [x1, x2, ..., xn].
  • The weights are one per feature: W = [w1, w2, ..., wn], and are sometimes called the weights θ = [θ1, θ2, ..., θn].
  • The output is a predicted class ŷ ∈ {0,1}.
  • Multinomial logistic regression entails ŷ ∈ {0, 1, 2, 3, 4}.

How to do Classification

  • For each feature x, weight w tells the importance of x, including a bias b.
  • Sum all the weighted features and the bias: z = ∑(wixi) + b, can be simplified to z = w•x+b.
  • If the sum is high, it can be said that y equals 1; if low, then y equals 0.
  • Wanting a probabilistic classifier requires formalizing what "sum is high" means.

The Sigmoid or Logistic Function

  • A principled classifier would give a probability, just like Naive Bayes did.
  • The problem is that 'z' isn't a probability, so a function that goes from 0 to 1 is needed.
  • The solution is then y = σ(z) = 1 / (1 + e-z) = 1 / (1 + exp(-z)).
  • To compute w•x+b then it is passed through the sigmoid function to be treated as a probability. P(y=1) = σ(w•x+b) = 1 / (1 + exp(-(w•x+b))) P(y=0) = 1 - σ(w•x+b) = 1 / (1 + exp(-(w•x+b))) = exp(-(w•x+b)) / (1 + exp(-(w•x+b)))
  • P(y=0) = 1- σ(w•x+b) equals σ(-(w•x+b))

Turning a Probability into a Classifier

  • y = 1 if P(y = 1|x) > 0.5 otherwise it equals 0
  • 0.5 is called the decision boundary.
  • If P(y = 1|x) > 0.5 it is equivalent in saying if w•x+b > 0, otherwise if w•x+b < 0

Logistic Regression: Sentiment Example on Sentiment Classification

  • Given a document, classifying sentiment is determining if y=1 (positive) or y=0 (negative).
  • A sentiment example is, "It's hokey. There are virtually no surprises, and the writing is second-rate. So why was it so enjoyable ? For one thing, the cast is great. Another nice touch is the music. I was overcome with the urge to get off the couch and start dancing. It sucked me in, and it'll do the same to you."
    • x1 = count(positive lexicon ∈ doc) equaling 3.
    • x2 = count(negative lexicon ∈ doc) equaling 2.
    • x3 equals 1 if "no" is in doc, or 0 otherwise which is 1.
    • x4 is count(1st and 2nd pronouns ∈ doc) equaling 3.
    • x5 equals 1 if "!" ∈ doc, or 0 otherwise which is 0.
    • x6 is log(word count of doc) equaling In(66) = 4.19.
  • Suppose w = [2.5, -5.0, -1.2, 0.5,2.0,0.7] and b = 0.1.
  • p(+|x) = P(Y = 1|x) = σ(w•x+b) is equivalent to σ([2.5, -5.0, -1.2, 0.5,2.0,0.7] • [3,2,1,3,0,4.19] + 0.1) which would equal σ(.833), meaning it has value of 0.70.
  • p(-|x) = P(Y = 0|x) = 1 - σ(w•x+b), which is equal to 0.30.

Features for Logistic Regression

  • Features can be made for any classification task like period disambiguation.
  • For if a period is the end of a scentence, or is not the end.
    • x1 equals 1 if "Case(wi) = Lower"
    • x2 equals 1 if "wi ∈ AcronymDict"
    • x3 equals 1 if "wi = St. & Case(wi-1) = Cap"

Classification in (Binary) Logistic Regression: Summary

  • Given a set of classes with a + sentiment or - sentiment.
  • Given a vector x of features [x1, x2, ..., xn], where x1= count("awesome") and x2 = log(number of words in review).
  • Given a vector w of weights [w1, w2, ..., wn] with w for each feature f, P(y = 1) = σ(w•x+b) can be expressed as 1 / (1 + e-(w•x+b)).

Learning: Cross-Entropy Loss

  • Supervised classification means knowing the correct label y (either 0 or 1) for each x, but the system produces an estimate, ŷ.
  • The goal is to set w and b to minimize the distance between the estimate ŷ(i) and the true y(i).
  • A distance estimator, or a loss/cost function, and an optimization algorithm to update w and b to minimize the loss is needed.

Learning Components

  • Cross-entropy loss as the loss function.
  • Stochastic gradient descent as the optimization algorithm.

Distance Between ŷ and y

  • Knowing how far the classifier output is can be represented with ŷ = σ(w•x+b).
  • This is measured from the true output y which equals either 0 or 1.
  • Measuring the difference is represented in L(ŷ,y), which is how much ŷ differs from the true y.

Intuition of Negative Log Likelihood Loss

  • Negative log likelihood loss equals cross-entropy loss.
  • It is a case of conditional maximum likelihood estimation.
  • The parameters w,b that maximize the log probability, of the true y labels in the training data, give the observations x.

Deriving Cross-Entropy Loss for a Single Observation x

  • The Goal: maximize probability of the correct label p(y|x).
  • Since there are only 2 discrete outcomes (0 or 1) the probability p(y|x) from the classifier can be expressed as ŷ(y) (1- ŷ) (1-y),
  • In the scenario that y=1 this simplifies to ŷ, and if y=0 this simplifies to 1- ŷ.
  • Then from there, maximize log p(y|x) where it log equals [ŷ^(y) (1- ŷ)^^(1-y)] and it is equal to ylogŷ+(1-y) log(1 – ŷ).
  • Values that maximize log p(y|x) will also maximize p(y|x).
  • Find a loss to minimize, flip the script with Cross-entropy loss formula and plug in ŷ:
    • LCE (ŷ,y) = – [ylog σ(w•x+b)+(1-y) log (1 – σ(w•x+b))]

How Cross Entropy Works

  • The loss is smaller if the model estimate is close to correct such as the true label of y=1 and is positive.
  • Loss is bigger if the model is confused.
  • In the sentiment example, it is hokey. There are virtually no surprises, and the writing is second-rate . So why was it so enjoyable ? For one thing, the cast is great. Another nice touch is the music . I was overcome with the urge to get off the couch and start dancing. It sucked me in, and it'll do the same to you .
    • p(+|x) = P(Y = 1|x) = σ(w•x+b) equals σ([2.5, -5.0, -1.2, 0.5,2.0,0.7] • [3,2,1,3,0,4.19] + 0.1) which is equivalent σ(.833) and has a 0.70 value.
    • LCE (ŷ, y) equals – [ylog σ(w•x+b)+(1-y) log (1 – σ(w•x+b))], which equals– [log σ(w•x+b)] and has a final value of to .36
    • if the value was instead y=0 = 1-σ(w•x+b), it would equal 0.30 so then, LCE (ŷ, y) equals– [ylog σ(w•x+b)+(1-y) log (1 – σ(w•x+b))], and loss is 1.2
  • It is lower than the loss when the model was wrong which confirms a higher loss.

Stitchastic Gradient decent

  • Goal: minimize loss 1 / m ∑LCE (f(x(i)),y(i))

  • For logistic regression, the loss function is convex.

  • A convex function has just one minimum

  • Gradient descent starting from any point is guaranteed to find the minimum.

  • (Loss for neural networks is non-convex)

  • Consider a single scalar 'w': the question is should it be made bigger or smaller, and it can be resolved by looking at the slope of where it is sitting.

  • The gradient of a function of many variables is a vector pointing in the direction of the greatest increase in a function.

  • Gradient Descent finds the gradient of the loss function at the current point and move in the opposite direction.

  • The value of the gradient (slope) is weighted by a learning rate η, higher learning rate means move faster. wt+1 = wt - η * slope of the loss function

  • The gradient is represented as a vector to express the directional components of the sharpest slope along the dimensions.

Real gradients have lots and lots of weights.

  • For each dimension w; the gradient component i tells the slope with respect to that variable such as how much a small change in w, influence the total loss function L
  • Express the slope as a partial derivative with respect to wi
  • The gradient is then defined as a vector of the partials and it can be represented in function format.

Stochastic Gradient Descent Details

  • Compute ŷ(i) = f(x(i); θ) = What is the estimated output ŷ?
  • Compute the loss L(ŷ(i), y(i)) = How far off is ŷ(i)) from the true output y(i)?
  • g←∇θL(f(x(i); θ),y(i)) = How should θ move to maximize loss?
  • θ← θ - η g = Move θ the other way instead.
  • The learning rate n is a hyperparameter which is a special kind of parameter for an ML model.
    • being too high, the learner will take big steps and overshoot
    • being too low, the learner will take too long
  • Instead of being learned by an algorithm, they are chosen by the algorithm designer.
  • There is also Batch training where it trains the entire dataset and mini-batch training where it trains by m examples.

Gradients Example

  • Goal to have w1 = w2 = b = 0, given a mini-sentiment example where x1 is 3 and x2 is 2, also when with with true y=1 and (n = 0.1 as learning rate)
  • Update step to new step 0t+1 = θ-η∇L(f(x; 0), y) [Wip formula]
  • ∇0 L = [-0.4,-0.2, .6] [Plug into function and derivative with sigmoid]

Regularization

  • A model that perfectly matches the training data has a problem, called overfitting.
    • This overfitting entails modeling noise, where a random word that perfectly predicts y, it will get a very high weight.
    • Failing to generalize to a test set without that word.
  • Solve by making a model that generalizes, adding a regularization term R(0) to the loss function to penalize those large weights.

L2 Regularization

  • Also known as ridge regression, L2 Regularization is found as the sum of the squares of the weights.
  • The name comes from equation using L2 norm ||0||2.

L1 Regularization

  • Also known as lasso regression, L1 Regularization is found as the sum of the absolute value of the weights.
  • The name comes from equation using L1 norm ||W||1.

Multinomial Logistic Regression

  • Using more than 2 classes may use multinomial logistic regression.
    • Softmax regression is a method, similar to multiclass Naive Bayes algorithm.
    • Examples are Positive/negative/neutral, parts of speech (noun, verb, adjective, adverb, preposition, etc.), and classifying emergency SMSs into different actionable classes
  • Binary logistic regression uses only 2 output classes.

Softmax in Multinomial Logistic Regression

  • A generalisation of the sigmoid is called the softmax. Takes a vector z = values Output distribution values which add up one
  • The probability of everything must still sum to 1 P(positive|doc) + P(negative|doc) + P(neutral|doc) = 1
  • It is still the dot product between weight vector w and input vector xBut separates weight vectors for each of the K classes.
  • In binary the weight had direction, for example positive weights are in direction to y=1, multinominal has its weights for its class.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Use Quizgecko on...
Browser
Browser