Machine Learning for Business Applications Lecture PDF
Document Details
Uploaded by SuaveVerse
TUM
2024
Maximilian Schiffer
Tags
Summary
This document is a lecture on machine learning for business applications, focusing on Naïve Bayes and Bayesian networks. The lecture is part of the Winter Semester 2024/25 at the TUM School of Management.
Full Transcript
Machine Learning for Business Applications Naïve Bayes & Bayesian Networks – Lecture A.1 Prof. Dr. Maximilian Schiffer Professorship of Business Analytics & Intelligent Systems TUM School of Management Munich Data Science Institute Winter Semester 2024/25 Professorship of Business Analytics & Int...
Machine Learning for Business Applications Naïve Bayes & Bayesian Networks – Lecture A.1 Prof. Dr. Maximilian Schiffer Professorship of Business Analytics & Intelligent Systems TUM School of Management Munich Data Science Institute Winter Semester 2024/25 Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Agenda Recap Introduction to ML for BA Introduction to Probabilistic Models Bayes Theorem Naïve Bayes Classifier Bayesian (Belief) Networks Summary Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Agenda Recap Introduction to ML for BA Introduction to Probabilistic Models Bayes Theorem Naïve Bayes Classifier Bayesian (Belief) Networks Summary Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Recap Introduction Recap: Intro ML for BA Key takeaways We discussed Machine Learning (ML) and its historical developments We defined features, target variables, and different feature types We formally defined the three basic ML problems: classification, clustering, and regression, additionally differentiating between supervised and unsupervised learning In today‘s lecture, we Finally, we discussed essentials and training strategies in the realm of ML will recap basic probabilities, learn about conditional probabilities and use these to classify new observations Winter Semester 2024/25 | Machine Learning for Business Applications | Prof. Dr. Maximilian Schiffer 4 Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Agenda Recap Introduction to ML for BA Introduction to Probabilistic Models Bayes Theorem Naïve Bayes Classifier Bayesian (Belief) Networks Summary Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Probabilistic Models Introduction to Probabilistic Models Definition: Models that use probability distributions to predict outcomes and quantify uncertainty by leveraging the principles of probability theory. allow for decision making even when we do Characteristics of probabilistic models: not have complete information Quantifying uncertainty in prediction or estimation Incorporating prior knowledge Handling complex relationships and multiple sources of uncertainty Applications of probabilistic models for classification tasks: Spam detection: using word frequency and presence to calculate the probability that an email is spam given the words it contains Sentiment analysis: analyzing text features to assigns probabilities to sentiments based on training data Credit risk assessment: using applicant information, economic factors, and loan characteristics, to calculate the probability of an applicant’s default risk Winter Semester 2024/25 | Machine Learning for Business Applications | Prof. Dr. Maximilian Schiffer 6 Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Recap: General Set Notation Introduction to Probabilistic Models Set Notation Meaning Venn Diagram Everything in both sets Union What is common in both sets Intersection Everything not in that set Complement Winter Semester 2024/25 | Machine Learning for Business Applications | Prof. Dr. Maximilian Schiffer 7 Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Recap: Probability Notation Introduction to Probabilistic Models Probability is the likelihood of an event occurring; Probabilities range from We denote the probability of an event as Event A The probability of event A happening Complement The probability of event A not happening Union The probability of event A or event B happening Intersection The probability of event A and event B happening Conditional Probability Probability of event A happening given that event B has occurred: Winter Semester 2024/25 | Machine Learning for Business Applications | Prof. Dr. Maximilian Schiffer 8 Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Conditional Probability Introduction to Probabilistic Models Independent Events If two events do not influence each other, the probability that occurs stays the same, independent of Probability of A given B Probability of A and B Dependent Events Product rule for conditional probability, to express dependent events, i.e., events influence each other Probability of A given B Probability of A and B determined by probability of A given B times probability of B Winter Semester 2024/25 | Machine Learning for Business Applications | Prof. Dr. Maximilian Schiffer 9 Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Agenda Recap Introduction to ML for BA Introduction to Probabilistic Models Bayes Theorem Naïve Bayes Classifier Bayesian (Belief) Networks Summary Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Bayes Theorem - Motivation Bayes Theorem OLD LEVEL OF NEW LEVEL OF Strength of new BELIEF BELIEF evidence (Prior odds) (Posterior odds) The motivation behind Bayes' theorem is to update or revise probabilities based on new evidence or observations. It provides a way to calculate the probability of a hypothesis or event, given the probability of the evidence, by incorporating prior knowledge or beliefs about the hypothesis and updating them with new data, leading to more accurate and informed predictions. Winter Semester 2024/25 | Machine Learning for Business Applications | Prof. Dr. Maximilian Schiffer 11 Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Bayes Theorem Bayes Theorem Posterior Probability, conditional Conditional probability of given Probability of probability of given Probability of Bayes Theorem provides a way to update the probability of a hypothesis based on new evidence. The theorem considers both the initial probability of the hypothesis and how likely the new evidence is. Winter Semester 2024/25 | Machine Learning for Business Applications | Prof. Dr. Maximilian Schiffer 12 Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Bayes Theorem – Visual Example Bayes Theorem Your friend Alex is very tidy and loves organizing → what is the probability that Alex’ job is farmer vs the probability that Alex works as a librarian? Absolute Relative distribution distribution 20% 70% 80% 63 2 30% 27 8 Librarians Farmer Observing the jobs of 100 randomly selected people: Tidy Messy Winter Semester 2024/25 | Machine Learning for Business Applications | Prof. Dr. Maximilian Schiffer 13 Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Example: Bayes Theorem with known Probabilities Bayes Theorem Probabilities of and are known – we look at an example of spectators at a soccer match, where some are fans of the home team and some support the away team; some of the spectators are children Number of spectators / Child No child (Adult) sum Under full information… occurences Supports the away team 2 3 5 … if information on is unknown… Does not support the 6 9 away team (home team) 15 we utilize Bayes Theorem sum 8 12 20 … if additionally the probability of is unkown We can utilize Bayes Theorem to calculate conditional probabilities under varying levels of information we additionally utilize the law of total probability Winter Semester 2024/25 | Machine Learning for Business Applications | Prof. Dr. Maximilian Schiffer 14 Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Agenda Recap Introduction to ML for BA Introduction to Probabilistic Models Bayes Theorem Naïve Bayes Classifier Bayesian (Belief) Networks Summary Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Naïve Bayes Naïve Bayes Classifier Idea We want to classify vectors of discrete-valued features where K is the number of values for each feature and D is the number of features To this end, we use Bayes' theorem to calculate the probability of a particular class or category given a set of features, i.e., to describe Formally, we determine a complete probability distribution for each class, which describes the likelihood for any point to be in the respective class Naïve Assumptions: All features are equally important All features are independent, i.e., knowing the value of a particular feature does not imply knowledge about the value of another feature Winter Semester 2024/25 | Machine Learning for Business Applications | Prof. Dr. Maximilian Schiffer 16 Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Naïve Bayes – Independence Assumption Naïve Bayes Classifier Assume that features are conditionally independent given the class label Example: Rolling two dice, one may assume that the two dice behave independently of each other. Looking at the results of one dice will not tell you about the result of the second dice Why do we worry about this? If we want to compute the conditional probability for every observation , we may not be able to observe every combination of and. Winter Semester 2024/25 | Machine Learning for Business Applications | Prof. Dr. Maximilian Schiffer 17 Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Naïve Bayes – Conditional Independence Naïve Bayes Classifier Probabilities of some events are not independent, e.g., the probability of a delayed train (D) and arriving late at work (A) are not independent (but positively correlated) Probability that both 𝐷 and 𝐴 happen However, these events may be conditionally independent if we have additional knowledge about the situation, e.g., that there is rainy weather In this setting, the rainy weather explains the dependence between delayed train and arriving late at work Arriving late at work Rainy weather Delayed train Streets are wet Winter Semester 2024/25 | Machine Learning for Business Applications | Prof. Dr. Maximilian Schiffer 18 Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Naïve Bayes – Continuous Example Naïve Bayes Classifier Task: Distinguish cats from dogs based on their size ▪ Classes: ▪ Features: height [cm], weight [kg] ▪ Training examples: with 4 dogs and 12 cats Given probabilities: , Model for dogs: mean ▪ Height ~ Gaussian with mean variance ▪ Weight ~ Gaussian ▪ Assume that height and weight are independent sample variance Model for cats: ▪ Same, using and Winter Semester 2024/25 | Machine Learning for Business Applications | Prof. Dr. Maximilian Schiffer 19 Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Naïve Bayes – Continuous Example cont‘d Naïve Bayes Classifier Recap: Gaussian Distribution Probability density function: distribution mean unknown animal standard deviation height (cm) Winter Semester 2024/25 | Machine Learning for Business Applications | Prof. Dr. Maximilian Schiffer 20 Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Naïve Bayes - Discrete Example, spam Naïve Bayes Classifier Task: Separate spam from valid E-Mails Probabilities per class ▪ Classes: ▪ Features: words Conditional probabilities P(spam) per = 4/6;class P(valid) = 2/6 These are the already received E-Mails: spam valid 01 “send us your password” spam 2 1 password 4 2 02 “review your password” valid 1 2 review 03 “send us your account” spam 4 2 3 1 send 04 “send us your review” valid 4 2 05 “review us” spam 3 1 us 06 “send your password” spam 4 2 3 1 your 4 2 We receive a new E-Mail; how do we classify it? 1 0 account “review us now” 4 2 Winter Semester 2024/25 | Machine Learning for Business Applications | Prof. Dr. Maximilian Schiffer 21 Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Naïve Bayes - Discrete Example cont‘d Naïve Bayes Classifier We want to distinguish spam from valid mails; considered features are the expressions used in the mail. Can we apply the classical formulation of the Naïve Bayes Operator? How would that look like? Do we have all the information? “review us now” Replace with known information Winter Semester 2024/25 | Machine Learning for Business Applications | Prof. Dr. Maximilian Schiffer 22 Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Zero-frequency Problem Naïve Bayes Classifier Getting back to our example for spam detection, we defined any mail containing the word “account“ as spam we have 0 observations where “account“ was classified as valid This problem occurs very frequently* Solution approach: Never allow zero probabilities by applying Laplace smoothing ▪ Add a small positive number to all counts ▪ We add 2 to the denominator because any mail can take on either of two values *Zipf‘s Law: in any given text, 50% of all words occur only once Winter Semester 2024/25 | Machine Learning for Business Applications | Prof. Dr. Maximilian Schiffer 23 Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Excursion: Laplace Smoothing Naïve Bayes Classifier Laplace Smoothing or Additive Smoothing ensures that no event has an estimated zero probability by assigning a small non-zero probability to unseen events (assuming categorical data). sunrise problem* Laplace’s estimate: pretend that you saw every outcome once more than you actually did amount of possible values of x number of observations Extended Laplace’s estimate: pretend that you saw every outcome more times than you actually did Laplace Smoothing for Conditionals: * Estimating the likelihood that the sun will rise tomorrow. Even given a large sample of days where the sun rises, we can still not be completely sure that the sun will rise tomorrow. Winter Semester 2024/25 | Machine Learning for Business Applications | Prof. Dr. Maximilian Schiffer 24 Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Assumed Independence Naïve Bayes Classifier The assumption holds that every word contributes independently to the likelihood that an E-Mail is classified as spam. Therefore: the Naïve Bayes Classifier can be fooled by adding lots of valid words into a spam E-Mail, thus decreasing the likelihood that it is flagged as spam. Solution approach: Feature Modeling can help to mitigate the problem. ▪ Using a feature vector, we can filter out words and ignore them for the classification. ▪ Disregarding common popular words like „the“, that are equally likely in spam and valid E-Mails is a good start. Winter Semester 2024/25 | Machine Learning for Business Applications | Prof. Dr. Maximilian Schiffer 25 Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Missing Values Naïve Bayes Classifier Assume that for some feature we do not have a value, e.g., some test not performed on a patient, some weather data not available for a certain geolocation In many books on statistical models (e.g., ) the joint distribution is denoted with commas: Solution approach: We utilize the conditional independence assumption that allows us to ignore the missing feature and instead compute the likelihood for the observation based on known features. The theoretical justification for this approach is as follows: because Winter Semester 2024/25 | Machine Learning for Business Applications | Prof. Dr. Maximilian Schiffer 26 Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Missing Values Example Naïve Bayes Classifier 3 coin tosses Observed: Problem: The result of the second coin toss has not properly been logged during the data selection Head Tail Question: What is the probability for the observed Head Tail Head Tail event? Solution: We know all possible outcomes of the second coin toss. Thus, Head Tail Head Tail Head Tail Head Tail Instead of ignoring the missing data, we incorporate all known information which enables us to analyze the overall probability even with incomplete information. Winter Semester 2024/25 | Machine Learning for Business Applications | Prof. Dr. Maximilian Schiffer 27 Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Benefits and Drawbacks of Naïve Bayes Naïve Bayes Classifier Naïve Bayes assumes conditional This means the relationship between all input independence where Bayes theorem does not. features in Naïve Bayes are independent. Benefits of Naïve Bayes ▪ Robust to isolated noise points ▪ Handle missing values by ignoring the instance during probability estimate calculations ▪ Robust to irrelevant features Drawbacks of Naïve Bayes ▪ Adding too many redundant (e.g., identical) features can cause problems ▪ Time complexity of calculating both conditional probabilities and the class with and where is the number of instances, the number of classes, and the number of features However: independence assumptions may not hold for some features → Bayesian Belief Networks Winter Semester 2024/25 | Machine Learning for Business Applications | Prof. Dr. Maximilian Schiffer 28 Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Agenda Recap Introduction to ML for BA Introduction to Probabilistic Models Bayes Theorem Naïve Bayes Classifier Bayesian (Belief) Networks Summary Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Recap: Graph Terminology (cf. , p. 309) Bayesian Belief Networks Examples: 1 2 3 4 5 1 2 3 4 5 Winter Semester 2024/25 | Machine Learning for Business Applications | Prof. Dr. Maximilian Schiffer 30 Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Bayesian (Belief) Networks Bayesian Belief Networks Bayesian Belief Networks (BNNs, also called Bayesian Nets) model the conditional dependency between some features. Thus, BNNs are less constraining and more powerful than Naïve Bayes models that always assume feature independence. BBNs allow us to combine prior knowledge about variable dependencies with patterns learnt from observed training data. connections of nodes no directed cycles 1 with directions 2 We represent BBNs as directed acyclic graphs (DAG) 3 in which every node represents a feature Edges between nodes (i.e., features) represent the local conditional dependencies between the features 5 4 Winter Semester 2024/25 | Machine Learning for Business Applications | Prof. Dr. Maximilian Schiffer 31 Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Definition of Bayesian Networks Bayesian Belief Networks Bayesian Network Let be random features. A Bayesian Network is a directed acyclic graph (DAG) that specifies a joint distribution over as a product of local conditional distributions. For each node, a local joint conditional distribution exists. Given we know the set of parents for each feature , the resulting joint probability distribution can be written as: Recall: you will also encounter the joint distribution denoted by the following: Winter Semester 2024/25 | Machine Learning for Business Applications | Prof. Dr. Maximilian Schiffer 32 Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Recap: Chain Rule Bayesian Belief Networks The chain rule of probability allows us to represent a joint distribution as follows Where is the number of variables and denotes the set. The order of the variables can be interchanged in any way. Consider a simple DAG, where the following parent-child relationship is present: A Using the chain rule, we determine the joint probability distribution as: B This decomposition aligns with the structure of the Bayesian Network and allows us to compute the joint probability efficiently. C Winter Semester 2024/25 | Machine Learning for Business Applications | Prof. Dr. Maximilian Schiffer 33 Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Recap: Conditional Probability Tables Bayesian Belief Networks Conditional probability tables (CPTs) yield the probabilities for an event (dependent variable) for every possible realization of its determining variables. For example, we display the probability that the gras in our garden is wet (depending variable) depending on the weather during the previous night and if the sprinkler system was activated in the morning. P(Wet Gras | Sprinkler ∩ Weather) Sprinkler Weather Wet Dry Sum On Sunny 0.95 0.05 1 On Rainy 0.91 0.09 1 Off Sunny 0.29 0.71 1 Off Rainy 0.52 0.48 1 Columns do not necessarily sum up to 1 Rows must sum up to 1 Winter Semester 2024/25 | Machine Learning for Business Applications | Prof. Dr. Maximilian Schiffer 34 Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Example for Bayesian Network: Alarm Bayesian Belief Networks Burglary (B) Earthquake(E) Question: What is the probability that someone breaks into your house and neither Intuition: John nor Mary call you? Alarm (A) A (partly) determines M Approach John Calls (J) Mary Calls (M) 1. Find the joint probability distribution by factorization (i.e., chain rule): Setting Your house has an alarm system against burglary Sometimes earthquakes trigger the alarm 2. Determine the relevant probability: Your neighbors Mary & John do not know each other They call you if they hear the alarm (no guarantee) Earthquake and Burglary are in- dependent and can both happen Winter Semester 2024/25 | Machine Learning for Business Applications | Prof. Dr. Maximilian Schiffer 35 Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Example for Bayesian Network: Alarm Bayesian Belief Networks 1. Find the joint probability distribution by factorization (i.e., chain rule) P(B) P(E) True False True False 0.001 0.999 0.002 0.998 Burglary (B) Earthquake(E) Alarm (A) P(A|B ∩ E) John Calls (J) B E True False Mary Calls (M) P(J|A) T T 0.95 0.05 P(M|A) A True False T F 0.94 0.06 A True False T 0.90 0.10 F T 0.29 0.71 T 0.70 0.30 F 0.05 0.95 F F 0.001 0.999 F 0.01 0.99 Winter Semester 2024/25 | Machine Learning for Business Applications | Prof. Dr. Maximilian Schiffer 36 Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Example for Bayesian Network: Alarm Bayesian Belief Networks 2. Determine the relevant probability P(B) P(E) T F T F 0.001 0.999 0.002 0.998 Burglary (B) Earthquake(E) P(A|B ∩ E) B E T F T T 0.95 0.05 Alarm (A) T F 0.94 0.06 F T 0.29 0.71 John Calls (J) Mary Calls (M) F F 0.001 0.999 P(J|A) P(M|A) By expressing the joint probability A T F A T F distribution as a product of local conditional T 0.90 0.10 T 0.70 0.30 distributions, we are capable to model F 0.05 0.95 F 0.01 0.99 dependencies between variables. Winter Semester 2024/25 | Machine Learning for Business Applications | Prof. Dr. Maximilian Schiffer 37 Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Bayesian Belief Networks Summary Bayesian Belief Networks Definition: BBNs are graphical models that represent probabilistic relationships among variables using a directed acyclic graph (DAG) Key components: Nodes: represent random variables or features Edges: represent conditional dependencies between variables Conditional probability tables: quantify the relationships between connected nodes Advantages: Ability to handle dependencies among features Less constraining than the assumption on conditional independence in Naïve Bayes BBNs combine prior knowledge about variable dependencies with observed training data Disadvantages / Challenges: Learning Bayes Belief Networks is computationally complex → Subject of current research Winter Semester 2024/25 | Machine Learning for Business Applications | Prof. Dr. Maximilian Schiffer 38 Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Agenda Recap Introduction to ML for BA Introduction to Probabilistic Models Bayes Theorem Naïve Bayes Classifier Bayesian (Belief) Networks Summary Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Naïve Bayes vs Bayesian Belief Networks Summary Naïve Bayes Bayesian Belief Networks Independence yes, assumes feature independence given no, models conditional dependencies Assumption the class explicitly Computational low moderate to high Complexity Structure no explicit structure needed, straightforward yes, uses directed acyclic graphs application of Bayes’ Theorem (DAGs) to represent relationships Suitable for high-dimensional, simple problems complex relationships, interdependence Use Cases spam filtering, sentiment analysis, simple medical diagnosis, risk assessment, recommendation systems complex decision-making systems Winter Semester 2024/25 | Machine Learning for Business Applications | Prof. Dr. Maximilian Schiffer 40 Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München Summary Key Takeaways We studied Bayes Theorem describing how to update a belief given some evidence We defined how to use the Naïve Bayes Classifier, e.g., for Spam detection We discussed the potential pitfalls of Naïve Bayes Finally, we studied Bayesian (Belief) Networks, represented on directed acyclic graphs In the next lecture, we will learn about an additional classification method based on decision trees Winter Semester 2024/25 | Machine Learning for Business Applications | Prof. Dr. Maximilian Schiffer 41 Professorship of Business Analytics & Intelligent Systems TUM School of Management Technische Universität München References : https://dogsbestlife.com/dog-fun/dogs-or-cats/ : https://scipython.com/blog/visualizing-the-bivariate-gaussian-distribution/ : Example adapted from https://media.ed.ac.uk/media/Naïve+Bayes+for+Spam+Detection/1_xyjwz2z1 : https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4176592/ : Laplace, Pierre-Simon (1814). A Philosophical Essay on Probabilities. Translated by Truscott, F. W.; Emory, F. L. John Wiley & Son and Chapman & Hall. : Example adapted from https://people.cs.pitt.edu/~milos/courses/cs2740/Lectures/class19.pdf : Murphy, K.P., Machine Learning: A Probabilistic Perspective, MIT Press, 2012. Winter Semester 2024/25 | Machine Learning for Business Applications | Prof. Dr. Maximilian Schiffer 42