Podcast
Questions and Answers
What is the purpose of using Bayes' rule in generative algorithms like Naive Bayes?
What is the purpose of using Bayes' rule in generative algorithms like Naive Bayes?
- To directly model the decision boundary between classes.
- To find the test data `x` that is least likely for a label `y`.
- To model $P(y|x)$ based on $P(x|y)$ and $P(y)$. (correct)
- To discriminate between positive and negative samples in the training set.
What does it mean for the Naive Bayes algorithm to be considered 'naive'?
What does it mean for the Naive Bayes algorithm to be considered 'naive'?
- It is only suitable for simple datasets with few features.
- It models $P(y|x)$ directly without any assumptions.
- It assumes that all features are independent of each other given the class. (correct)
- It assumes that features are not really correct.
In the Naive Bayes classifier, how is the likelihood for a certain evidence X
determined?
In the Naive Bayes classifier, how is the likelihood for a certain evidence X
determined?
- By using Bayes rule $P(Y/X) = \frac{P(X/Y)P(Y)}{P(X)}$. (correct)
- By calculating the prior probability $P(Y)$.
- By directly modeling $P(y|x)$ using logistic regression.
- By ignoring the evidence and focusing on the posterior probability.
What is the primary challenge in evaluating the term $P(a_1, a_2, ..., a_d | y)$ in the Naive Bayes algorithm, and how is it typically addressed?
What is the primary challenge in evaluating the term $P(a_1, a_2, ..., a_d | y)$ in the Naive Bayes algorithm, and how is it typically addressed?
What is the purpose of Laplace smoothing in Naive Bayes and how is it implemented?
What is the purpose of Laplace smoothing in Naive Bayes and how is it implemented?
What is the first step toward implementing the naive bayes classifier?
What is the first step toward implementing the naive bayes classifier?
How does a flow chart differ from a decision tree in the context of classification tasks?
How does a flow chart differ from a decision tree in the context of classification tasks?
What does node purity refer to in the context of decision trees?
What does node purity refer to in the context of decision trees?
In decision tree, what is information gain used for?
In decision tree, what is information gain used for?
What happens to the entropy of a node if all the data belongs to one class?
What happens to the entropy of a node if all the data belongs to one class?
How is the Gini index calculated and how is it used in the context of decision trees?
How is the Gini index calculated and how is it used in the context of decision trees?
Why the Gini Index can be preferrable than Entropy in decision tree models?
Why the Gini Index can be preferrable than Entropy in decision tree models?
What is the role of the Adaboost algorithm in ensemble methods?
What is the role of the Adaboost algorithm in ensemble methods?
What happens in the AdaBoost algorithm in the next step if a classifier contains 0 errors?
What happens in the AdaBoost algorithm in the next step if a classifier contains 0 errors?
What is the core idea behind ensemble methods?
What is the core idea behind ensemble methods?
What is the purpose of 'weighting' data in the AdaBoost algorithm?
What is the purpose of 'weighting' data in the AdaBoost algorithm?
In ensemble methods, what is 'majority voting' and how is it applied?
In ensemble methods, what is 'majority voting' and how is it applied?
What is the name for a decision tree with just one node and two leaves used in AdaBoost?
What is the name for a decision tree with just one node and two leaves used in AdaBoost?
What is 'bootstrapping' in the context of bagging?
What is 'bootstrapping' in the context of bagging?
What is the primary difference between 'Bagging trees' and 'Random Forest'?
What is the primary difference between 'Bagging trees' and 'Random Forest'?
Which of the following is a characteristic of sequential ensemble methods?
Which of the following is a characteristic of sequential ensemble methods?
What is the purpose of the 'averaging' method in ensemble learning?
What is the purpose of the 'averaging' method in ensemble learning?
What characteristics define a 'weak learner' in the context of ensemble methods like AdaBoost?
What characteristics define a 'weak learner' in the context of ensemble methods like AdaBoost?
How does the re-sampling process used in bootstrapping work?
How does the re-sampling process used in bootstrapping work?
What is the motivation behind using ensemble methods in machine learning?
What is the motivation behind using ensemble methods in machine learning?
What is the 'weighted averaging' method in ensemble learning, and when is it most applicable?
What is the 'weighted averaging' method in ensemble learning, and when is it most applicable?
In a Random Forest, how is the subset of features chosen at each node split?
In a Random Forest, how is the subset of features chosen at each node split?
What is the main difference between parallel and sequential ensemble methods?
What is the main difference between parallel and sequential ensemble methods?
Flashcards
Naïve Bayes Algorithm
Naïve Bayes Algorithm
A classification algorithm known for simplicity and speed, widely used in Natural Language Processing, despite assumptions that are ‘not really correct'.
Discriminative Algorithms
Discriminative Algorithms
Algorithms that model P(y/x) directly to create a ‘decision boundary' to separate classes.
Generative Algorithms
Generative Algorithms
Algorithms that model P(y) and P(x/y), using Bayes rule to estimate P(y/x).
Evidence (in Bayes Rule)
Evidence (in Bayes Rule)
Signup and view all the flashcards
Prior Probability
Prior Probability
Signup and view all the flashcards
Likelihood
Likelihood
Signup and view all the flashcards
Posterior Probability
Posterior Probability
Signup and view all the flashcards
Laplace Smoothing
Laplace Smoothing
Signup and view all the flashcards
Decision Tree
Decision Tree
Signup and view all the flashcards
Information Gain
Information Gain
Signup and view all the flashcards
Entropy (in Decision Trees)
Entropy (in Decision Trees)
Signup and view all the flashcards
Gini Index
Gini Index
Signup and view all the flashcards
Node Purity
Node Purity
Signup and view all the flashcards
Ensemble Method
Ensemble Method
Signup and view all the flashcards
Majority Voting
Majority Voting
Signup and view all the flashcards
Averaging (Ensemble)
Averaging (Ensemble)
Signup and view all the flashcards
Weighted Averaging
Weighted Averaging
Signup and view all the flashcards
Sequential Ensemble Method
Sequential Ensemble Method
Signup and view all the flashcards
Parallel Ensemble Method
Parallel Ensemble Method
Signup and view all the flashcards
AdaBoost
AdaBoost
Signup and view all the flashcards
Decision Stump
Decision Stump
Signup and view all the flashcards
Random Forest
Random Forest
Signup and view all the flashcards
Bootstrapping
Bootstrapping
Signup and view all the flashcards
Bootstrapping in Bagging.
Bootstrapping in Bagging.
Signup and view all the flashcards
Study Notes
- A popular classification algorithm that is widely used in many tasks, particularly in Natural Language Processing, is the Naive Bayes Classification Algorithm
- The name stems from the 'naïve' assumption that algorithm assumptions are not really correct; the algorithm still performs well
- This technique belongs to the ‘generative' class of algorithms, contrasting with the ‘discriminative' class
Discriminative Algorithms
- Logistic regression classification models center around creating a ‘decision boundary' which models P(y/x) directly
- Discriminates between positive and negative samples
- Looks at the test data x and finds a label y which is most likely for it; that is, the test sample finds itself on one side of the decision boundary
- P(y/x) is modeled directly
Generative Algorithms
- These algorithms work with the knowledge of negative and positive samples in the training set
- This means P(y) is known
- P(x/y) is also known, indicating the values of x for which y is positive or negative
- Models P(x/y) and P(y) are created, and Bayes' rule is used to get P(y/x)
- The algorithm builds two models based on y = Positive and y = Negative, and when a test sample comes, matches it with one of these models to see which comes out best
- The probabilities used are P(x/y) and P(y), where the first is a conditional probability and the second is a 'prior'; these are used to get the class for a test sample
- The Naive Bayes algorithm is a generative algorithm
Naive Bayes Formulation
-
Bayes rule of conditional probability for the variables X and Y can be written as follows:
-
P(XY) = P(X/Y)P(Y)
-
P(XY) = P(Y/X) P(X)
-
P(Y/X) = P(X/Y)P(Y) / P(X)
-
P(Y) is the prior, P(X) is the evidence, P(X/Y) is the posterior, and P(Y/X) is the likelihood
-
The aim is to get the likelihood for certain evidence X
-
Steps for solving classification problems are:
-
Model P(x/y) and P(y)
-
Use Bayes rules to get P(y/x)
-
Find argmaxyP(y/x) = argmaxy P(x/y)P(y) / P(x) ≈ argmaxyP(x/y)P(y)
-
Given training data (x₁, y₁), ... (xn, yn) where xi ∈ Rd and yi ∈ Y, it is possible to learn a classification function f: Rd → Y, where Y = {+1, −1}
-
With training data (xi, yi), where xį is a feature vector and yį is a discrete label, xį has d features, and there are n number of samples, then a new sample xnew comes in with features (a₁, a₂, ..., ad)
-
The label ynew of this new sample is found using the formula: ynew = argmaxy∈Y P(a₁, a₂, ..., ad|y) * P(y) / P(a₁, a₂, ..., ad)
-
The label (Positive or Negative) is the one with the max value for P(y|a₁, a₂, ..., ad); the denominator corresponds to a sample xnew with feature values (a₁, a₂, ..., ad) and can be ignored to get ynew = argmaxy∈Y P(a₁, a₂, ..., ad|y) * P(y)
-
To evaluate these terms, P(y) is obtained easily from the training data
-
The first term denotes the probability of getting a particular label y for a specific combination of features (a₁, a₂, ..., ad), and to identify this, a simplifying assumption can be made that each feature is independent
-
The probability of 'individual features' gives a specific label, so we find P(a₁/y), P(a₂/y), ..., P(ad/y)
-
This untrue assumption of independence means the method is referred to as ‘naïve.'
-
Based on the assumption of independence of the features, Naive Bayes classifier formulation is: ynew = argmaxy∈Y P(y) ∏P(aj|y)
Steps of the method
- Estimate P(y) for each y by counting the number of occurrences of the label
- Estimate P(aj|y) for all j = 1 to d
- Use the prior equation to get ynew
Laplace Smoothing
- One problem in the method is that any probability term could be zero, so the whole product becomes zero
- To avoid this, smoothing is used, such as 'Add one Laplace Smoothing,' where all conditional probabilities are added to the numerator and count the training data in the denominator
Decision Trees
- Decision trees are important algorithms widely used in ML and statistics with names such as ID3, C4.5, CART, and Regression trees, working on the concept of information gain
- Classification and regression are possible
- It is easy to visualize because it is like a flow chart
- Many decisions are taken along the path, but a decision tree is expected to make a single classification decision.
- ML has criteria used to reach the decision-making process
- ML problems usually have many features; the correct ones to use for each node of the tree are evaluated using the concept of information gain/Gini index, and decision trees are used in a greedy, top-down manner
Node Purity
- The main point in decision trees is to choose the feature at each node
- The root node split is based on the feature of hardware vs software, giving a split of 20:40 for the data
- This node is not 'pure' but has a purity of 66.67% b/c that % of data is on one side with the remaining on the other side
- In the example with 60 students, if they all chose a software job, the node would be 100% pure because the split caused all data to belong to one class only
- If all 100 students of ECE wanted jobs as opposed to wanting to go for higher studies, the split caused all data to be on one side of the node, so it is said to be 100% pure
Information Gain at each Node
- The splitting criteria for each node in the ID3 and C 4.5 decision tree methods use information gain
- Info gain is based on entropy, which is well known in communications
- For a binary class with P+ and P_ as the probabilities, the definition of entropy for a set of samples, S, is Entropy(S) = −P+log₂P+ - P_log₂P_
- For C classes, Entropy = Σ -Pilog₂Pi
- Max entropy occurs when P+ = P_; this happens when there is 100% impurity for the node
- The probability of all data belonging to one class = 1; then entropy = −(1log₂1 = 0)
- A measure of how much information is provided by a feature about a class; the order of features used in the nodes of a decision tree are determined using the measurement
- The root node is named the parent node, and the subnodes are child nodes
- Information gain = Eparent - Echildren
- The information gain is the difference between parent and children entropies; there may be many children, so Echildren is the average entropy of the children
- Information Gain measures how a given feature will reduce entropy when splitting examples
- Gain(S, A) = Entropy(S) – Σv∈values(A) |Sv|/|S| Entropy(Sv) where S = samples in the set and Gain(S,A) = Information gain of set S for attribute (feature) A
- Also, Entropy(S)= entropy of parent node and |Sv|/|S| = the proportion of samples in the specific child node
Gini Index
- This index decides the splitting of data at the nodes of a decision tree
- It does not involve a 'logarithm' term and there is less computation, so it is preferred
- Used in the CART (Classification and Regression Tree) and in many Python tool kits (Sci-kit, for example)
- The Gini index, is calculated by summing the squares of the probabilities of each class and subtracting this sum from 1
- When node impurity is maximum with P+=P- (each equal to 0.5), Entropy =1,whereas Gini =0.5
Ensemble Methods
- This means “a group of items viewed as a whole rather than individually'
- The intuition is that decisions made collectively by a large group are better than those made by an individual, even if considered an expert
- Ensemble methods are also used in machine learning to avoid the risk of relying on the result of a poorly performing ML system
- Combine the results of many ML systems to get a final prediction that is likely to be better
How Results are Combined
- With classifiers, the solution is to take a majority vote in which prediction is considered a 'vote' and the prediction from the majority of the models is the final prediction; in some methods, each classifier is assigned different weights so those with higher weights have a higher impact
- Calculate the average by asking people to rate something on a scale
Types of Ensemble Methods
- Many versions of the same classifier are modeled for the same dataset, which is manipulated in such a way that it seems different datasets are available to each classifier; the most popular type is the decision tree
- Each tree is named "weak learner/base learner"
- The ensemble algorithm strengthens the weak learners in a way that better results are obtained
Sequential Methods
- In the first iteration, a number of weak learners are used
- In the next iteration, these learners are strengthened and the training data is selectively “weighted” based on the misclassified examples of the previous iteration
- Adaboost is a method that produces spectacular results
- Gradient Boost and XG Boost are other sequential ensemble methods
Parallel Methods
- Weak models are generated in parallel
- The data is manipulated to give the illusion of independence between models, and ‘Random Forest' method is used
Adaboost
- Boosting is applied to data and weak learners after every iteration in this method
- Weak learner performs better than random guessing but is poor in action
- A decision stump is used as a base (weak) learner which means that only binary classification can be done; the algorithm uses a large set of stumps called a 'forest of decision stumps'
- Actions performed:
- Weighting of data samples
- Weighting of the classifiers
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.