Full Transcript

10/29/24, 10:11 AM Activity_7 (1).ipynb - Colab Course: CSCE 5215 Machine Learning Professor: Zeenat Tariq Activity 7 keyboard_arrow_down What is Naive Bayes Classifier ?? Naive Bayes classifiers are a collection of cl...

10/29/24, 10:11 AM Activity_7 (1).ipynb - Colab Course: CSCE 5215 Machine Learning Professor: Zeenat Tariq Activity 7 keyboard_arrow_down What is Naive Bayes Classifier ?? Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other. Use of Naive Bayes Naïve Bayes algorithm is used for classification problems. It is highly used in text classification. In text classification tasks, data contains high dimension (as each word represent one feature in the data). It is used in spam filtering, sentiment detection, rating classification etc. The advantage of using naïve Bayes is its speed. It is fast and making prediction is easy with high dimension of data. This model predicts the probability of an instance belongs to a class with a given set of feature value. It is a probabilistic classifier. It is because it assumes that one feature in the model is independent of existence of another feature. In other words, each feature contributes to the predictions with no relation between each other. In real world, this condition satisfies rarely. It uses Bayes theorem in the algorithm for training and prediction path = 'https://images.prismic.io/turing/65a5400a7a5e8b1120d58951_real_world_applications_th from IPython.display import Image Image(url=path) https://colab.research.google.com/drive/1X3YEpGX_SsjwY2zpTKaWZntDUlzMfDFV#printMode=true 1/19 10/29/24, 10:11 AM Activity_7 (1).ipynb - Colab Types of Naive Bayes Three types of naive bayes are as follows : Gaussian Naive Bayes https://colab.research.google.com/drive/1X3YEpGX_SsjwY2zpTKaWZntDUlzMfDFV#printMode=true 2/19 10/29/24, 10:11 AM Activity_7 (1).ipynb - Colab Used for classifying continuous data, which has a normal (Gaussian) distribution. It's fitted by finding the mean and standard deviation of each class. Multinomial Naive Bayes A specialized version of Naive Bayes that's designed for text documents. It models word counts and adjusts calculations to deal with them. Bernoulli Naive Bayes Based on the Bernoulli Distribution, it accepts only binary values, i.e., 0 or 1. keyboard_arrow_down Bayes Theorem Bayes' Theorem is a mathematical formula used to calculate the probability of an event based on prior knowledge of conditions related to that event. The general formula for Bayes' Theorem is: path = 'https://media.geeksforgeeks.org/wp-content/uploads/20240517210301/probablity-(1).png from IPython.display import Image Image(url=path) https://colab.research.google.com/drive/1X3YEpGX_SsjwY2zpTKaWZntDUlzMfDFV#printMode=true 3/19 10/29/24, 10:11 AM Activity_7 (1).ipynb - Colab keyboard_arrow_down Implementation of Bayes Theorem The code is as follows: # Function to calculate Bayes' Theorem def bayes_theorem(prior_A, likelihood_B_given_A, likelihood_B): """ Calculate the posterior probability using Bayes' Theorem. Parameters: prior_A (float): The prior probability P(A) of event A. likelihood_B_given_A (float): The likelihood P(B|A), the probability of event B given ev likelihood_B (float): The total probability of event B, also called the evidence P(B). Returns: float: The posterior probability P(A|B), the probability of event A given event B. """ # Bayes' Theorem formula posterior_A_given_B = (likelihood_B_given_A * prior_A) / likelihood_B return posterior_A_given_B # Example values prior_A = 0.01 # P(A): Prior probability, for example, the probability of havi likelihood_B_given_A = 0.9 # P(B|A): Likelihood, the probability of a positive test resul likelihood_B = 0.1 # P(B): Total probability of a positive test result (including # Calculate the posterior probability posterior_A_given_B = bayes_theorem(prior_A, likelihood_B_given_A, likelihood_B) # Print the result print(f"Posterior Probability P(A|B): {posterior_A_given_B:.5f}") Posterior Probability P(A|B): 0.09000 keyboard_arrow_down Implementation of Gaussian Naive Bayes Classifier What is gaussian naive bayes ? Gaussian Naive Bayes (GNB) is a variant of the Naive Bayes classifier, which is particularly suited for datasets where the features are continuous and are assumed to follow a Gaussian (normal) distribution. Unlike Multinomial Naive Bayes, which is used for discrete data (like word counts in text classification), Gaussian Naive Bayes is ideal for real-valued data https://colab.research.google.com/drive/1X3YEpGX_SsjwY2zpTKaWZntDUlzMfDFV#printMode=true 4/19 10/29/24, 10:11 AM Activity_7 (1).ipynb - Colab Continuos data Continuous data refers to a type of quantitative data that can take any value within a given range. This means that the data can be measured on a continuum or scale and can have an infinite number of possible values. Continuous data is typically associated with measurements and can include fractions and decimals. keyboard_arrow_down Data Preprocessing Basic preprocessing steps: Import libraries Import the dataset Split the data into train and test Let us implement code on iris dataset: import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.naive_bayes import GaussianNB from sklearn.metrics import accuracy_score, classification_report iris = sns.load_dataset('iris') print(iris.head()) X = iris.drop(columns='species') # Features: SepalLength, SepalWidth, PetalLength, PetalWid y = iris['species'] # Target: Species (setosa, versicolor, virginica) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 setosa 1 4.9 3.0 1.4 0.2 setosa 2 4.7 3.2 1.3 0.2 setosa 3 4.6 3.1 1.5 0.2 setosa 4 5.0 3.6 1.4 0.2 setosa keyboard_arrow_down Train and predict the model Let us train the model on gaussian classifier. The model can be trained as follows: https://colab.research.google.com/drive/1X3YEpGX_SsjwY2zpTKaWZntDUlzMfDFV#printMode=true 5/19 10/29/24, 10:11 AM Activity_7 (1).ipynb - Colab gnb = GaussianNB() var_smoothing : It is a smoothing parameter that accounts for numerical stability by adding a small value to the variance. This avoids division by zero in the likelihood calculation for Gaussian Naive Bayes. The model can be analyzed as follows: Accuracy Precision F-1 score Recall gnb = GaussianNB(var_smoothing=1e-8) gnb.fit(X_train, y_train) # Make predictions on the test data y_pred = gnb.predict(X_test) # Calculate accuracy accuracy = accuracy_score(y_test, y_pred) print(f"Accuracy: {accuracy:.4f}") # Print a detailed classification report print(classification_report(y_test, y_pred, target_names=iris['species'].unique())) Accuracy: 0.9778 precision recall f1-score support setosa 1.00 1.00 1.00 19 versicolor 1.00 0.92 0.96 13 virginica 0.93 1.00 0.96 13 accuracy 0.98 45 macro avg 0.98 0.97 0.97 45 weighted avg 0.98 0.98 0.98 45 Task 1 Implement gaussian naive bayes on wine dataset to analyze the quality https://colab.research.google.com/drive/1X3YEpGX_SsjwY2zpTKaWZntDUlzMfDFV#printMode=true 6/19 10/29/24, 10:11 AM Activity_7 (1).ipynb - Colab keyboard_arrow_down Implementation of multinomial theorem What is multinomial naive bayes ? Multinomial Naive Bayes is a variant of the Naive Bayes classification algorithm specifically designed for classification tasks involving discrete features, particularly suited for text classification and natural language processing tasks. This model is based on the application of Bayes' theorem with the multinomial distribution, which is used to model counts or frequency data. Discreet Data Discrete data refers to data that can take on only specific, separate values and cannot be divided into smaller increments. This type of data is often countable and typically involves integers. keyboard_arrow_down Data Preprocessing Basic preprocessing steps: Import libraries Import the dataset Perform vectorization Split the data into train and test Let us implement code on text dataset: import pandas as pd from sklearn.feature_extraction.text import CountVectorizer from sklearn.model_selection import train_test_split from sklearn.naive_bayes import MultinomialNB from sklearn.metrics import accuracy_score, classification_report data = { 'text': [ 'I love programming in Python', 'Python is great for data science', 'I enjoy learning new programming languages', 'The weather is nice today', 'Today is a beautiful day', 'I hate being stuck in traffic', 'Traffic jams are frustrating', 'I love sunny days', https://colab.research.google.com/drive/1X3YEpGX_SsjwY2zpTKaWZntDUlzMfDFV#printMode=true 7/19 10/29/24, 10:11 AM Activity_7 (1).ipynb - Colab 'The movie was boring', 'I dislike horror movies' ], 'label': ['positive', 'positive', 'positive', 'neutral', 'neutral', 'negative', 'negative', 'positive', 'negative', 'negative'] } df = pd.DataFrame(data) X = df['text'] # Features (text data) y = df['label'] # Target variable (labels) print(df) text label 0 I love programming in Python positive 1 Python is great for data science positive 2 I enjoy learning new programming languages positive 3 The weather is nice today neutral 4 Today is a beautiful day neutral 5 I hate being stuck in traffic negative 6 Traffic jams are frustrating negative 7 I love sunny days positive 8 The movie was boring negative 9 I dislike horror movies negative keyboard_arrow_down Count vectorizer Count Vectorizer is a feature extraction method used in natural language processing (NLP) to convert a collection of text documents into a matrix of token counts. It is a fundamental step in preparing text data for machine learning algorithms, particularly in text classification tasks. vectorizer = CountVectorizer() X_vectorized = vectorizer.fit_transform(X) X_train, X_test, y_train, y_test = train_test_split(X_vectorized, y, test_size=0.3, random_s keyboard_arrow_down Task 2 Apply count vectorization on the below data: documents = [ "The cat sat on the chair", "The dog chased the car", https://colab.research.google.com/drive/1X3YEpGX_SsjwY2zpTKaWZntDUlzMfDFV#printMode=true 8/19 10/29/24, 10:11 AM Activity_7 (1).ipynb - Colab "The cat chased the mouse" ] keyboard_arrow_down Visualize text data Word Cloud This visualization represents the frequency of words in a corpus. Common words appear larger and bolder from wordcloud import WordCloud import matplotlib.pyplot as plt all_text = ' '.join(df['text']) wordcloud = WordCloud(width=800, height=400, background_color='white').generate(all_text) plt.figure(figsize=(10, 5)) plt.imshow(wordcloud, interpolation='bilinear') plt.axis('off') # Turn off the axis plt.title('Word Cloud of Text Data', fontsize=20) plt.show() https://colab.research.google.com/drive/1X3YEpGX_SsjwY2zpTKaWZntDUlzMfDFV#printMode=true 9/19 10/29/24, 10:11 AM Activity_7 (1).ipynb - Colab keyboard_arrow_down Task 3 Visualize the text data from newsgroup dataset in seaborn from sklearn.datasets import fetch_20newsgroups newsgroups = fetch_20newsgroups(subset='train', categories=['comp.graphics', 'comp.os.ms-windows.misc keyboard_arrow_down Train and predict the model Let us train the model on gaussian classifier. The model can be trained as follows: mnb = MultinomialNB() alpha : This is the additive smoothing parameter used to prevent zero probabilities in the likelihood estimation for features that are not present in the training data. fit_prior : Determines whether to learn class priors from the training data or to use uniform class priors. https://colab.research.google.com/drive/1X3YEpGX_SsjwY2zpTKaWZntDUlzMfDFV#printMode=true 10/19 10/29/24, 10:11 AM Activity_7 (1).ipynb - Colab The model can be analyzed as follows: Accuracy Precision F-1 score Recall mnb = MultinomialNB(alpha=0.5, fit_prior=True) mnb.fit(X_train, y_train) y_pred = mnb.predict(X_test) accuracy = accuracy_score(y_test, y_pred) print(f"Accuracy: {accuracy:.4f}") print(classification_report(y_test, y_pred)) Accuracy: 0.8889 precision recall f1-score support 0 0.90 1.00 0.95 19 1 0.95 0.86 0.90 21 2 0.79 0.79 0.79 14 accuracy 0.89 54 macro avg 0.88 0.88 0.88 54 weighted avg 0.89 0.89 0.89 54 Task 4 On the same dataset implement multinomial naive bayes classifier keyboard_arrow_down Implementation of bernoulli naive bayes What is bernoulli naive bayes ? Bernoulli Naive Bayes is a variation of the Naive Bayes classification algorithm, specifically designed for binary and discrete data. It is particularly effective for text classification tasks where https://colab.research.google.com/drive/1X3YEpGX_SsjwY2zpTKaWZntDUlzMfDFV#printMode=true 11/19 10/29/24, 10:11 AM Activity_7 (1).ipynb - Colab the features (typically words) are treated as binary values, indicating whether a word is present or absent in a document. Let us implement bernoulli naive bayes on binary data keyboard_arrow_down Data preprocessing Here we are creating a synthetic binary dataset. We are following the below steps: Binarization: Since Bernoulli Naive Bayes works with binary data, the continuous features are binarized using the Binarizer. We use the median of each feature as a threshold, meaning values greater than the median will be converted to 1, and values less than or equal to the median will be converted to 0. Basic preprocessing steps: Import libraries Import the dataset Split the data into train and test https://colab.research.google.com/drive/1X3YEpGX_SsjwY2zpTKaWZntDUlzMfDFV#printMode=true 12/19 10/29/24, 10:11 AM Activity_7 (1).ipynb - Colab import numpy as np import pandas as pd from sklearn.datasets import make_classification from sklearn.naive_bayes import BernoulliNB from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, classification_report # Create a synthetic binary dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=0, n_clusters_per_class=1, random_state=42, n_classes=2) # Convert the dataset to binary (0s and 1s) X_binary = (X > 0).astype(int) # Convert features to binary df = pd.DataFrame(X_binary, columns=[f'feature_{i+1}' for i in range(X_binary.shape)]) df['target'] = y print(df.head()) # Split the data into training and testing sets (70% training, 30% testing) X_train, X_test, y_train, y_test = train_test_split(X_binary, y, test_size=0.3, random_state # Check the size of the splits print(f'Training samples: {X_train.shape}, Testing samples: {X_test.shape}') feature_1 feature_2 feature_3 feature_4 feature_5 feature_6 \ 0 0 1 0 1 1 1 1 0 0 1 1 0 1 2 0 0 0 0 1 0 3 0 0 1 0 1 1 4 0 0 1 0 1 1 feature_7 feature_8 feature_9 feature_10 target 0 0 1 1 1 0 1 1 1 1 1 0 2 0 0 0 1 1 3 0 1 0 0 1 4 0 1 0 1 1 Training samples: 700, Testing samples: 300 keyboard_arrow_down Task 5 Apply preprocessing on breast cancer dataset and binarize the data as well. keyboard_arrow_down Train and predict the model https://colab.research.google.com/drive/1X3YEpGX_SsjwY2zpTKaWZntDUlzMfDFV#printMode=true 13/19 10/29/24, 10:11 AM Activity_7 (1).ipynb - Colab The model can be trained as follows: model = BernoulliNB() alpha : Similar to the alpha in Multinomial Naive Bayes, it controls the smoothing of probabilities to handle missing features. binarize : Threshold for converting features into binary (0 or 1). If the feature values are above this threshold, they are considered as 1 (present), otherwise as 0 (absent). fit_prior : Determines whether to learn class priors from the training data or assume uniform class priors. The model can be analyzed as follows: Accuracy Precision F-1 score Recall model = BernoulliNB(alpha=0.5, binarize=0.1) # Train the model model.fit(X_train, y_train) y_pred = model.predict(X_test) accuracy = accuracy_score(y_test, y_pred) print(f'Accuracy: {accuracy:.4f}') print(classification_report(y_test, y_pred)) Accuracy: 0.9100 precision recall f1-score support 0 0.90 0.92 0.91 148 1 0.92 0.90 0.91 152 accuracy 0.91 300 macro avg 0.91 0.91 0.91 300 weighted avg 0.91 0.91 0.91 300 https://colab.research.google.com/drive/1X3YEpGX_SsjwY2zpTKaWZntDUlzMfDFV#printMode=true 14/19 10/29/24, 10:11 AM Activity_7 (1).ipynb - Colab Task 6 Implement bernoulli naive bayes on breast cancer dataset keyboard_arrow_down Zero probability Zero probability in Naive Bayes refers to the problem that occurs when a feature in the test data (or any input data) has not been observed in the training data for a particular class. This leads to a zero probability in the likelihood computation, which can cause the entire probability for that class to become zero. How can we solve this ? Zero probability can be solved using laplace smoothing Laplace smoothing, also known as additive smoothing, is a technique used to handle the zero probability problem in probabilistic models like Naive Bayes. In such models, if a word or feature doesn't appear in the training data for a specific class, its probability becomes zero for that class. This can be problematic when making predictions. Let us observe this on the newsgroup datase on the two groups scispace and comp graphics keyboard_arrow_down Data Preprocessing Let us follow basic preprocessing steps: Import libraries Import datasets Apply count vectorizer Split data into train and test from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score import numpy as np https://colab.research.google.com/drive/1X3YEpGX_SsjwY2zpTKaWZntDUlzMfDFV#printMode=true 15/19 10/29/24, 10:11 AM Activity_7 (1).ipynb - Colab categories = ['comp.graphics', 'sci.space'] newsgroups = fetch_20newsgroups(subset='train', categories=categories, remove=('headers', 'f X_data = newsgroups.data y_labels = newsgroups.target print("Target names:", newsgroups.target_names) vectorizer = CountVectorizer(stop_words='english', binary=False) X_features = vectorizer.fit_transform(X_data) X_train, X_test, y_train, y_test = train_test_split(X_features, y_labels, test_size=0.3, ran print(f"Training set has {X_train.shape} samples and {X_train.shape} features.") Target names: ['comp.graphics', 'sci.space'] Training set has 823 samples and 19198 features. keyboard_arrow_down Train data on no smoothing To train multinomial naive bayes with no laplace smoothing, set the alpha = 0 nb_no_smoothing = MultinomialNB(alpha=0.0) nb_no_smoothing.fit(X_train, y_train) y_pred_no_smoothing = nb_no_smoothing.predict(X_test) accuracy_no_smoothing = accuracy_score(y_test, y_pred_no_smoothing) print(f"Accuracy without Laplace Smoothing: {accuracy_no_smoothing:.4f}") Accuracy without Laplace Smoothing: 0.5226 /usr/local/lib/python3.10/dist-packages/sklearn/naive_bayes.py:890: RuntimeWarning: divi self.feature_log_prob_ = np.log(smoothed_fc) - np.log( If any words in the test sentence were unseen during training for a particular class, the model may output a zero probability for that class, potentially leading to bias in favor of the class with more familiar words. This illustrates the zero probability issue that can occur in Naive Bayes models when smoothing techniques are not applied. Let us test a sample data: https://colab.research.google.com/drive/1X3YEpGX_SsjwY2zpTKaWZntDUlzMfDFV#printMode=true 16/19 10/29/24, 10:11 AM Activity_7 (1).ipynb - Colab test_sentence = ["space exploration in graphics and 3D modeling"] X_test_example = vectorizer.transform(test_sentence) pred_no_smoothing = nb_no_smoothing.predict(X_test_example) probs_no_smoothing = nb_no_smoothing.predict_proba(X_test_example) print(f"Predicted Class (No Smoothing): {newsgroups.target_names[pred_no_smoothing]}") print(f"Class Probabilities (No Smoothing): {probs_no_smoothing}") Predicted Class (No Smoothing): sci.space Class Probabilities (No Smoothing): [[0. 1.]] keyboard_arrow_down Task 7 Implement the no smoothing model on the data below. Record your observations test_sentence = ["The latest advancements in autonomous vehicles and space exploration"] keyboard_arrow_down Train data with smoothing To train multinomial naive bayes with laplace smoothing, set the alpha = 1 nb_with_smoothing = MultinomialNB(alpha=1.0) nb_with_smoothing.fit(X_train, y_train) y_pred_smoothing = nb_with_smoothing.predict(X_test) accuracy_smoothing = accuracy_score(y_test, y_pred_no_smoothing) print(f"Accuracy with Laplace Smoothing: {accuracy_no_smoothing:.4f}") Accuracy with Laplace Smoothing: 0.5226 With smoothing applied, the model mitigates the zero probability issue, ensuring that even if certain words from the test sentence were not present in the training data for a particular class, they are still assigned a small non-zero probability, allowing for a more robust prediction. This approach enhances the model's ability to generalize to unseen data by reducing bias toward known words. Let us test on sample data: test_sentence = ["space exploration in graphics and 3D modeling"] X_test_example = vectorizer.transform(test_sentence) pred_with_smoothing = nb_with_smoothing.predict(X_test_example) probs_with_smoothing = nb_with_smoothing.predict_proba(X_test_example) print(f"Predicted Class (With Smoothing): {newsgroups.target_names[pred_with_smoothing]}" print(f"Class Probabilities (With Smoothing): {probs_with_smoothing}") https://colab.research.google.com/drive/1X3YEpGX_SsjwY2zpTKaWZntDUlzMfDFV#printMode=true 17/19 10/29/24, 10:11 AM Activity_7 (1).ipynb - Colab Predicted Class (With Smoothing): comp.graphics Class Probabilities (With Smoothing): [[0.99474133 0.00525867]] Task 8 Implement the smoothing model on the data below. Record your observations test_sentence = ["The latest advancements in autonomous vehicles and space exploration"] Task 9 Implement multinomial naive bayes with smoothing and no smoothing on the data below: data = { 'text': [ "Congratulations! You've won a free ticket to Bahamas!", "Important information regarding your account", "Free entry in 2 a weekly competition", "Click this link to claim your prize", "Your appointment is scheduled for tomorrow", "You have received a warning from the bank", "Get paid to work from home", "Meeting at 10 AM tomorrow", "Exclusive offer just for you", "Your loan application has been approved" ], 'label': [ "spam", "ham", "spam", "spam", "ham", "ham", "spam", "ham", "spam", "ham" ] } keyboard_arrow_down Practice On iris dataset follow the below steps: Import the libraries Preprocess the data https://colab.research.google.com/drive/1X3YEpGX_SsjwY2zpTKaWZntDUlzMfDFV#printMode=true 18/19 10/29/24, 10:11 AM Activity_7 (1).ipynb - Colab Train the data on gaussian nb Train the model on multinomial nb Train the model on bernoulli nb Evaluate the models #1 import ?? as sns import pandas as ?? from sklearn.model_selection import ?? from ??.naive_bayes import GaussianNB, MultinomialNB, ?? from sklearn.preprocessing import ?? #2 iris = sns.??('??') print(iris.??()) X = ??.drop(columns='??') ?? = iris['??'] X_train, ??, y_train, y_test = ??(X, y, test_size=0.3, ??=42) https://colab.research.google.com/drive/1X3YEpGX_SsjwY2zpTKaWZntDUlzMfDFV#printMode=true 19/19

Use Quizgecko on...
Browser
Browser