Full Transcript

# Lecture 4: Bayesian Classification ## Bayes Decision Theory ### Minimizing the misclassification rate - Let $\omega_i$ represent the $i$th state of nature (or class). - Let $P(\omega_i)$ be the a priori probability that nature is in state $\omega_i$. - $P(\omega_i)$ reflects any prior knowled...

# Lecture 4: Bayesian Classification ## Bayes Decision Theory ### Minimizing the misclassification rate - Let $\omega_i$ represent the $i$th state of nature (or class). - Let $P(\omega_i)$ be the a priori probability that nature is in state $\omega_i$. - $P(\omega_i)$ reflects any prior knowledge of how likely you are to encounter an object from class $\omega_i$. - E.g. $P(\text{apple}) = 0.7, P(\text{orange}) = 0.3$. - Let $p(x|\omega_i)$ be the class-conditional probability density function. - Describes the probability of observing the feature value $x$ given that the object is from class $\omega_i$. - If $\omega_i$ is "apple", $x$ might be "red". - "How red are apples in general?" ### Bayes Formula $$ P(\omega_i | x) = \frac{p(x|\omega_i) P(\omega_i)}{p(x)} $$ - $P(\omega_i | x)$ is the a posteriori probability. - Probability of the state of nature being $\omega_i$ given the feature value $x$. - "Probability that the object is an apple given that it is red." $$ p(x) = \sum_i p(x|\omega_i) P(\omega_i) $$ ### Bayes Decision Rule - Choose $\omega_i$ if $P(\omega_i | x) > P(\omega_j | x)$ for all $j \neq i$. - Equivalent to choosing $\omega_i$ if $p(x|\omega_i) P(\omega_i) > p(x|\omega_j) P(\omega_j)$ for all $j \neq i$. - This decision rule minimizes the probability of misclassification. ### Example - Medical diagnosis: - $\omega_1$: patient has cancer - $\omega_2$: patient does not have cancer - $x$: test result (e.g., positive or negative) - $P(\omega_1) = 0.01$ (prior probability of having cancer) - $P(\omega_2) = 0.99$ (prior probability of not having cancer) - $p(x = \text{positive} | \omega_1) = 0.8$ (sensitivity of the test) - $p(x = \text{positive} | \omega_2) = 0.1$ (false positive rate of the test) $$ P(\omega_1 | x = \text{positive}) = \frac{0.8 \times 0.01}{0.8 \times 0.01 + 0.1 \times 0.99} \approx 0.075 $$ $$ P(\omega_2 | x = \text{positive}) = \frac{0.1 \times 0.99}{0.8 \times 0.01 + 0.1 \times 0.99} \approx 0.925 $$ - Even with a positive test result, the probability of not having cancer is much higher due to the low prior probability of having cancer. ### Minimizing the expected loss - Let $\lambda_{ij}$ be the loss incurred for choosing action $\alpha_i$ when the state of nature is $\omega_j$. - Expected loss for action $\alpha_i$ given $x$: $$ R(\alpha_i | x) = \sum_j \lambda_{ij} P(\omega_j | x) $$ - Bayes decision rule: Choose the action $\alpha_i$ that minimizes $R(\alpha_i | x)$. ### Special case: minimizing the error rate - $\lambda_{ij} = 0$ if $i = j$ (correct decision) - $\lambda_{ij} = 1$ if $i \neq j$ (incorrect decision) - Risk becomes: $$ R(\alpha_i | x) = \sum_j \lambda_{ij} P(\omega_j | x) = \sum_{j \neq i} P(\omega_j | x) = 1 - P(\omega_i | x) $$ - Minimizing $R(\alpha_i | x)$ is equivalent to maximizing $P(\omega_i | x)$, which is the same as the Bayes decision rule for minimizing the misclassification rate. ## Discriminant Functions - A discriminant function $g_i(x)$ is a function that assigns a score to each class $\omega_i$ based on the feature value $x$. - The decision rule is to choose the class $\omega_i$ with the highest discriminant function value. - Examples of discriminant functions: - $g_i(x) = P(\omega_i | x)$ - $g_i(x) = p(x | \omega_i) P(\omega_i)$ - $g_i(x) = \ln p(x | \omega_i) + \ln P(\omega_i)$ ## Gaussian Classifier ### Univariate case - Assume $p(x | \omega_i) \sim N(\mu_i, \sigma_i^2)$, where $\mu_i$ is the mean and $\sigma_i^2$ is the variance of the Gaussian distribution for class $\omega_i$. $$ p(x | \omega_i) = \frac{1}{\sqrt{2 \pi \sigma_i^2}} \exp \left( -\frac{(x - \mu_i)^2}{2 \sigma_i^2} \right) $$ - Discriminant function: $$ g_i(x) = \ln p(x | \omega_i) + \ln P(\omega_i) = -\frac{1}{2} \ln (2 \pi \sigma_i^2) - \frac{(x - \mu_i)^2}{2 \sigma_i^2} + \ln P(\omega_i) $$ ### Multivariate case - Assume $p(x | \omega_i) \sim N(\mu_i, \Sigma_i)$, where $\mu_i$ is the mean vector and $\Sigma_i$ is the covariance matrix of the Gaussian distribution for class $\omega_i$. $$ p(x | \omega_i) = \frac{1}{(2 \pi)^{d/2} |\Sigma_i|^{1/2}} \exp \left( -\frac{1}{2} (x - \mu_i)^T \Sigma_i^{-1} (x - \mu_i) \right) $$ - Discriminant function: $$ g_i(x) = \ln p(x | \omega_i) + \ln P(\omega_i) = -\frac{d}{2} \ln (2 \pi) - \frac{1}{2} \ln |\Sigma_i| - \frac{1}{2} (x - \mu_i)^T \Sigma_i^{-1} (x - \mu_i) + \ln P(\omega_i) $$ ### Special Cases 1. $\Sigma_i = \sigma^2 I$ (covariance matrices are equal and diagonal) - $g_i(x) = -\frac{||x - \mu_i||^2}{2 \sigma^2} + \ln P(\omega_i)$ - Equivalent to choosing the class with the closest mean to $x$ (minimum Euclidean distance classifier). 2. $\Sigma_i = \Sigma$ (covariance matrices are equal but not diagonal) - $g_i(x) = -\frac{1}{2} (x - \mu_i)^T \Sigma^{-1} (x - \mu_i) + \ln P(\omega_i)$ - Equivalent to choosing the class with the closest mean to $x$ in terms of the Mahalanobis distance. 3. $\Sigma_i$ is arbitrary (covariance matrices are different) - Use the general discriminant function. - More complex decision boundary. ## Naive Bayes Classifier - Assumes that the features are conditionally independent given the class label. $$ p(x | \omega_i) = \prod_{j=1}^d p(x_j | \omega_i) $$ - Discriminant function: $$ g_i(x) = \sum_{j=1}^d \ln p(x_j | \omega_i) + \ln P(\omega_i) $$ - Simplifies the estimation of the class-conditional probability densities. - Often used with discrete features (e.g., bag-of-words in text classification). - Can perform surprisingly well even when the independence assumption is violated.