BA222A1E-39D6-43E8-84DB-AC8A89C2D281.jpeg
Document Details

Uploaded by SupremeBananaTree6353
Full Transcript
# Bernoulli Naive Bayes The Bernoulli Naive Bayes classifier is suitable for discrete data with binary features. The difference from multinomial naive Bayes is that while multinomial naive Bayes works with occurrence counts, Bernoulli naive Bayes is designed for binary/boolean features. ## Example...
# Bernoulli Naive Bayes The Bernoulli Naive Bayes classifier is suitable for discrete data with binary features. The difference from multinomial naive Bayes is that while multinomial naive Bayes works with occurrence counts, Bernoulli naive Bayes is designed for binary/boolean features. ## Example Let's imagine that we have the following dataset: | | Word 1 | Word 2 | Word 3 | Class | | --- | --- | --- | --- | --- | | Document 1 | 1 | 0 | 1 | Spam | | Document 2 | 0 | 1 | 0 | Not Spam | | Document 3 | 1 | 0 | 0 | Spam | | Document 4 | 0 | 1 | 0 | Not Spam | Here, 1 means that the word appears in the document and 0 means that it doesn't. ## Math Bernoulli Naive Bayes applies Bayes’ Theorem: $P(c \vert d) = \frac{P(d \vert c)P(c)}{P(d)}$ Where: * c is the class (Spam or Not Spam in our example). * d is the document (the features). ### Assumptions Naive Bayes makes the assumption that the features are conditionally independent given the class. In other words, the presence or absence of one word does not affect the presence or absence of another word, given the class. ### Likelihood The likelihood $P(d \vert c)$ is calculated as: $P(d \vert c) = \prod_{i=1}^{n} P(w_i \vert c)^{d_i} (1 - P(w_i \vert c))^{(1-d_i)}$ Here: * $w_i$ is the i-th word. * $d_i$ is 1 if the word appears in the document, and 0 if it doesn't. * $P(w_i \vert c)$ is the probability of word i appearing in a document of class c. ### Calculating $P(w_i \vert c)$ $P(w_i \vert c) = \frac{N_{ci} + \alpha}{N_c + \alpha n}$ Where: * $N_{ci}$ is the number of times word i appears in documents of class c. * $N_c$ is the total number of words in documents of class c. * $\alpha$ is a smoothing parameter (Laplace smoothing). * $n$ is the total number of features (words in the vocabulary). ## Pros and Cons ### Pros * Simple and easy to implement. * Can work well with binary data. ### Cons * The assumption of feature independence may not hold in real-world scenarios. * If a feature value does not occur in the training data for a particular class, the conditional probability will be zero, which can cause issues. Smoothing techniques like Laplace smoothing can help with this.