Naive Bayes Classification Algorithm

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

What is the purpose of using Bayes' rule in generative algorithms like Naive Bayes?

  • To directly model the decision boundary between classes.
  • To find the test data `x` that is least likely for a label `y`.
  • To model $P(y|x)$ based on $P(x|y)$ and $P(y)$. (correct)
  • To discriminate between positive and negative samples in the training set.

What does it mean for the Naive Bayes algorithm to be considered 'naive'?

  • It is only suitable for simple datasets with few features.
  • It models $P(y|x)$ directly without any assumptions.
  • It assumes that all features are independent of each other given the class. (correct)
  • It assumes that features are not really correct.

In the Naive Bayes classifier, how is the likelihood for a certain evidence X determined?

  • By using Bayes rule $P(Y/X) = \frac{P(X/Y)P(Y)}{P(X)}$. (correct)
  • By calculating the prior probability $P(Y)$.
  • By directly modeling $P(y|x)$ using logistic regression.
  • By ignoring the evidence and focusing on the posterior probability.

What is the primary challenge in evaluating the term $P(a_1, a_2, ..., a_d | y)$ in the Naive Bayes algorithm, and how is it typically addressed?

<p>It is dificult to idenitfy and extract from a training set. The solution is to make an assumption that each feature is independent. (A)</p> Signup and view all the answers

What is the purpose of Laplace smoothing in Naive Bayes and how is it implemented?

<p>To avoid zero probability terms by adding '1' to the numerator and the count of training data to the denominator when calculating conditional probabilities. (D)</p> Signup and view all the answers

What is the first step toward implementing the naive bayes classifier?

<p>Model $P(x/y)$ and $P(y)$ (C)</p> Signup and view all the answers

How does a flow chart differ from a decision tree in the context of classification tasks?

<p>A flow chart shows the tasks in a process and many decisions are taken along the path, whereas a decision tree is expected to make a single classification decision. (D)</p> Signup and view all the answers

What does node purity refer to in the context of decision trees?

<p>The degree to which a node contains samples of only one class. (D)</p> Signup and view all the answers

In decision tree, what is information gain used for?

<p>To determine the order of the features used in the nodes of a decision tree. (B)</p> Signup and view all the answers

What happens to the entropy of a node if all the data belongs to one class?

<p>It becomes minimum (-(1log21 = 0)) (D)</p> Signup and view all the answers

How is the Gini index calculated and how is it used in the context of decision trees?

<p>Calculated by summing the squares of the probabilities of each class and subtracting this sum from 1; used to decide data splitting at the nodes of a decision tree. (A)</p> Signup and view all the answers

Why the Gini Index can be preferrable than Entropy in decision tree models?

<p>Has less computation. (C)</p> Signup and view all the answers

What is the role of the Adaboost algorithm in ensemble methods?

<p>To strengthen weak learners by weighting training data based on the misclassified examples of previous iterations . (B)</p> Signup and view all the answers

What happens in the AdaBoost algorithm in the next step if a classifier contains 0 errors?

<p>Am is large and positive. (C)</p> Signup and view all the answers

What is the core idea behind ensemble methods?

<p>Decisions made collectively by a large group of people are better than that of an individual. (A)</p> Signup and view all the answers

What is the purpose of 'weighting' data in the AdaBoost algorithm?

<p>To get Modified' data. (B)</p> Signup and view all the answers

In ensemble methods, what is 'majority voting' and how is it applied?

<p>The prediction by each model is considered as a 'vote' and prediction we get from the majority of the models is used as the final prediction. (D)</p> Signup and view all the answers

What is the name for a decision tree with just one node and two leaves used in AdaBoost?

<p>Decision stump (A)</p> Signup and view all the answers

What is 'bootstrapping' in the context of bagging?

<p>A re-sampling method. The original training data set is distorted by re-sampling and converted to multiple data sets. (B)</p> Signup and view all the answers

What is the primary difference between 'Bagging trees' and 'Random Forest'?

<p>In Bagging trees',all features of the data are considered for making the different trees. In the latter, only a subset of the features are considered for each tree. (B)</p> Signup and view all the answers

Which of the following is a characteristic of sequential ensemble methods?

<p>Models are created sequentially, with later models correcting errors of earlier ones. (B)</p> Signup and view all the answers

What is the purpose of the 'averaging' method in ensemble learning?

<p>Then, the average rating could be taken .As an example,the average of 5 ratings is:. (C)</p> Signup and view all the answers

What characteristics define a 'weak learner' in the context of ensemble methods like AdaBoost?

<p>A classifier which performs better than random guessing but is still poor in action as a classifier. (D)</p> Signup and view all the answers

How does the re-sampling process used in bootstrapping work?

<p>The sampling is performed with replacement .This means that the same data is allowed to occur more than once in the target data set. (D)</p> Signup and view all the answers

What is the motivation behind using ensemble methods in machine learning?

<p>To avoid the risk of having to rely on the result of a poorly performing ML system. (D)</p> Signup and view all the answers

What is the 'weighted averaging' method in ensemble learning, and when is it most applicable?

<p>In this method different weights are assigned to each classifier. The prediction of those classifiers with higher weights will have a higher impact on the final result. (B)</p> Signup and view all the answers

In a Random Forest, how is the subset of features chosen at each node split?

<p>(\sqrt{d}) is a suggested count for the number of features of each tree in the algorithm. (A)</p> Signup and view all the answers

What is the main difference between parallel and sequential ensemble methods?

<p>In parallel methods the data is manipulated in such a way as to give the illusion of independence between models, while in sequential methods,In this, different models are created sequentially . (B)</p> Signup and view all the answers

Flashcards

Naïve Bayes Algorithm

A classification algorithm known for simplicity and speed, widely used in Natural Language Processing, despite assumptions that are ‘not really correct'.

Discriminative Algorithms

Algorithms that model P(y/x) directly to create a ‘decision boundary' to separate classes.

Generative Algorithms

Algorithms that model P(y) and P(x/y), using Bayes rule to estimate P(y/x).

Evidence (in Bayes Rule)

The probability of observing the evidence (input data).

Signup and view all the flashcards

Prior Probability

The probability of the label (y) before observing the evidence (x).

Signup and view all the flashcards

Likelihood

The probability of observing the evidence (x) given the label (y).

Signup and view all the flashcards

Posterior Probability

The probability of the label (y) given the evidence (x), calculated using Bayes' Rule.

Signup and view all the flashcards

Laplace Smoothing

A technique used to prevent zero probabilities in Naive Bayes, by adding a small number (usually 1) to the numerator of probability estimates.

Signup and view all the flashcards

Decision Tree

A tree-like structure used for decision making, where each node represents a feature and each branch represents a decision rule.

Signup and view all the flashcards

Information Gain

A measure used to determine the best splitting point in a decision tree, based on the reduction in entropy.

Signup and view all the flashcards

Entropy (in Decision Trees)

A measure of the impurity or randomness in a set of data.

Signup and view all the flashcards

Gini Index

A measure used in decision trees to determine the inequality among classes. Lower values represent higher purity.

Signup and view all the flashcards

Node Purity

The percentage of data belonging to a single class in a node of a decision tree.

Signup and view all the flashcards

Ensemble Method

A technique that combines multiple models to improve overall performance.

Signup and view all the flashcards

Majority Voting

An ensemble method where multiple models make predictions, and the final prediction is determined by the majority vote.

Signup and view all the flashcards

Averaging (Ensemble)

An ensemble method where the average of multiple model predictions is taken as the final prediction.

Signup and view all the flashcards

Weighted Averaging

An ensemble method that gives different weights to each classifier's prediction based on their importance.

Signup and view all the flashcards

Sequential Ensemble Method

An ensemble method where models are built sequentially, with each model focusing on the mistakes of the previous ones.

Signup and view all the flashcards

Parallel Ensemble Method

An ensemble method where models are built independently and simultaneously.

Signup and view all the flashcards

AdaBoost

A sequential ensemble method that combines weak learners into a strong learner by weighting data points based on past classification performance.

Signup and view all the flashcards

Decision Stump

A weak learner that is a decision tree with only one node and two leaves.

Signup and view all the flashcards

Random Forest

A parallel ensemble method that creates multiple decision trees on bootstrapped subsets of the data and random subsets of features.

Signup and view all the flashcards

Bootstrapping

creating random subsets from the original dataset.

Signup and view all the flashcards

Bootstrapping in Bagging.

A re-sampling method where the original training data set is distorted by re-sampling and converted to multiple data sets.

Signup and view all the flashcards

Study Notes

  • A popular classification algorithm that is widely used in many tasks, particularly in Natural Language Processing, is the Naive Bayes Classification Algorithm
  • The name stems from the 'naïve' assumption that algorithm assumptions are not really correct; the algorithm still performs well
  • This technique belongs to the ‘generative' class of algorithms, contrasting with the ‘discriminative' class

Discriminative Algorithms

  • Logistic regression classification models center around creating a ‘decision boundary' which models P(y/x) directly
  • Discriminates between positive and negative samples
  • Looks at the test data x and finds a label y which is most likely for it; that is, the test sample finds itself on one side of the decision boundary
  • P(y/x) is modeled directly

Generative Algorithms

  • These algorithms work with the knowledge of negative and positive samples in the training set
  • This means P(y) is known
  • P(x/y) is also known, indicating the values of x for which y is positive or negative
  • Models P(x/y) and P(y) are created, and Bayes' rule is used to get P(y/x)
  • The algorithm builds two models based on y = Positive and y = Negative, and when a test sample comes, matches it with one of these models to see which comes out best
  • The probabilities used are P(x/y) and P(y), where the first is a conditional probability and the second is a 'prior'; these are used to get the class for a test sample
  • The Naive Bayes algorithm is a generative algorithm

Naive Bayes Formulation

  • Bayes rule of conditional probability for the variables X and Y can be written as follows:

  • P(XY) = P(X/Y)P(Y)

  • P(XY) = P(Y/X) P(X)

  • P(Y/X) = P(X/Y)P(Y) / P(X)

  • P(Y) is the prior, P(X) is the evidence, P(X/Y) is the posterior, and P(Y/X) is the likelihood

  • The aim is to get the likelihood for certain evidence X

  • Steps for solving classification problems are:

  • Model P(x/y) and P(y)

  • Use Bayes rules to get P(y/x)

  • Find argmaxyP(y/x) = argmaxy P(x/y)P(y) / P(x) ≈ argmaxyP(x/y)P(y)

  • Given training data (x₁, y₁), ... (xn, yn) where xi ∈ Rd and yi ∈ Y, it is possible to learn a classification function f: Rd → Y, where Y = {+1, −1}

  • With training data (xi, yi), where xį is a feature vector and yį is a discrete label, xį has d features, and there are n number of samples, then a new sample xnew comes in with features (a₁, a₂, ..., ad)

  • The label ynew of this new sample is found using the formula: ynew = argmaxy∈Y P(a₁, a₂, ..., ad|y) * P(y) / P(a₁, a₂, ..., ad)

  • The label (Positive or Negative) is the one with the max value for P(y|a₁, a₂, ..., ad); the denominator corresponds to a sample xnew with feature values (a₁, a₂, ..., ad) and can be ignored to get ynew = argmaxy∈Y P(a₁, a₂, ..., ad|y) * P(y)

  • To evaluate these terms, P(y) is obtained easily from the training data

  • The first term denotes the probability of getting a particular label y for a specific combination of features (a₁, a₂, ..., ad), and to identify this, a simplifying assumption can be made that each feature is independent

  • The probability of 'individual features' gives a specific label, so we find P(a₁/y), P(a₂/y), ..., P(ad/y)

  • This untrue assumption of independence means the method is referred to as ‘naïve.'

  • Based on the assumption of independence of the features, Naive Bayes classifier formulation is: ynew = argmaxy∈Y P(y) ∏P(aj|y)

Steps of the method

  • Estimate P(y) for each y by counting the number of occurrences of the label
  • Estimate P(aj|y) for all j = 1 to d
  • Use the prior equation to get ynew

Laplace Smoothing

  • One problem in the method is that any probability term could be zero, so the whole product becomes zero
  • To avoid this, smoothing is used, such as 'Add one Laplace Smoothing,' where all conditional probabilities are added to the numerator and count the training data in the denominator

Decision Trees

  • Decision trees are important algorithms widely used in ML and statistics with names such as ID3, C4.5, CART, and Regression trees, working on the concept of information gain
  • Classification and regression are possible
  • It is easy to visualize because it is like a flow chart
  • Many decisions are taken along the path, but a decision tree is expected to make a single classification decision.
  • ML has criteria used to reach the decision-making process
  • ML problems usually have many features; the correct ones to use for each node of the tree are evaluated using the concept of information gain/Gini index, and decision trees are used in a greedy, top-down manner

Node Purity

  • The main point in decision trees is to choose the feature at each node
  • The root node split is based on the feature of hardware vs software, giving a split of 20:40 for the data
  • This node is not 'pure' but has a purity of 66.67% b/c that % of data is on one side with the remaining on the other side
  • In the example with 60 students, if they all chose a software job, the node would be 100% pure because the split caused all data to belong to one class only
  • If all 100 students of ECE wanted jobs as opposed to wanting to go for higher studies, the split caused all data to be on one side of the node, so it is said to be 100% pure

Information Gain at each Node

  • The splitting criteria for each node in the ID3 and C 4.5 decision tree methods use information gain
  • Info gain is based on entropy, which is well known in communications
  • For a binary class with P+ and P_ as the probabilities, the definition of entropy for a set of samples, S, is Entropy(S) = −P+log₂P+ - P_log₂P_
  • For C classes, Entropy = Σ -Pilog₂Pi
  • Max entropy occurs when P+ = P_; this happens when there is 100% impurity for the node
  • The probability of all data belonging to one class = 1; then entropy = −(1log₂1 = 0)
  • A measure of how much information is provided by a feature about a class; the order of features used in the nodes of a decision tree are determined using the measurement
  • The root node is named the parent node, and the subnodes are child nodes
  • Information gain = Eparent - Echildren
  • The information gain is the difference between parent and children entropies; there may be many children, so Echildren is the average entropy of the children
  • Information Gain measures how a given feature will reduce entropy when splitting examples
  • Gain(S, A) = Entropy(S) – Σv∈values(A) |Sv|/|S| Entropy(Sv) where S = samples in the set and Gain(S,A) = Information gain of set S for attribute (feature) A
  • Also, Entropy(S)= entropy of parent node and |Sv|/|S| = the proportion of samples in the specific child node

Gini Index

  • This index decides the splitting of data at the nodes of a decision tree
  • It does not involve a 'logarithm' term and there is less computation, so it is preferred
  • Used in the CART (Classification and Regression Tree) and in many Python tool kits (Sci-kit, for example)
  • The Gini index, is calculated by summing the squares of the probabilities of each class and subtracting this sum from 1
  • When node impurity is maximum with P+=P- (each equal to 0.5), Entropy =1,whereas Gini =0.5

Ensemble Methods

  • This means “a group of items viewed as a whole rather than individually'
  • The intuition is that decisions made collectively by a large group are better than those made by an individual, even if considered an expert
  • Ensemble methods are also used in machine learning to avoid the risk of relying on the result of a poorly performing ML system
  • Combine the results of many ML systems to get a final prediction that is likely to be better

How Results are Combined

  • With classifiers, the solution is to take a majority vote in which prediction is considered a 'vote' and the prediction from the majority of the models is the final prediction; in some methods, each classifier is assigned different weights so those with higher weights have a higher impact
  • Calculate the average by asking people to rate something on a scale

Types of Ensemble Methods

  • Many versions of the same classifier are modeled for the same dataset, which is manipulated in such a way that it seems different datasets are available to each classifier; the most popular type is the decision tree
  • Each tree is named "weak learner/base learner"
  • The ensemble algorithm strengthens the weak learners in a way that better results are obtained

Sequential Methods

  • In the first iteration, a number of weak learners are used
  • In the next iteration, these learners are strengthened and the training data is selectively “weighted” based on the misclassified examples of the previous iteration
  • Adaboost is a method that produces spectacular results
  • Gradient Boost and XG Boost are other sequential ensemble methods

Parallel Methods

  • Weak models are generated in parallel
  • The data is manipulated to give the illusion of independence between models, and ‘Random Forest' method is used

Adaboost

  • Boosting is applied to data and weak learners after every iteration in this method
  • Weak learner performs better than random guessing but is poor in action
  • A decision stump is used as a base (weak) learner which means that only binary classification can be done; the algorithm uses a large set of stumps called a 'forest of decision stumps'
  • Actions performed:
    • Weighting of data samples
    • Weighting of the classifiers

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Use Quizgecko on...
Browser
Browser