Machine Learning for Business Analytics 2024 PDF
Document Details
Uploaded by Deleted User
2024
Tags
Related
Summary
This document is a presentation about machine learning for business analytics. It covers topics such as classification, clustering, and different machine learning problems, along with mathematical concepts like random variables, probability, and expected values.
Full Transcript
Machine Learning for Business Analytics 2024 Docenten: Dr. Marc Hilbert, Dr. Andrii Kleshchonok, Voertaal: Engels 20.10.24 Business Intelligence - Introduction 1 Part 1: Introductio...
Machine Learning for Business Analytics 2024 Docenten: Dr. Marc Hilbert, Dr. Andrii Kleshchonok, Voertaal: Engels 20.10.24 Business Intelligence - Introduction 1 Part 1: Introduction to Business Analytics Define your task: Prediction, Clustering, Classication, Anomaly Detection? Dene objectives, error metrics, performance standards Collect Data: Set up data stream (storage, input ow, parallelization, Hadoop) Preprocessing: Noise/Outlier Filtering Completing missing data (histograms, interpolation) Normalization (scaling data) Dimensionality Reduction / Feature Selection: Choose features to use/extract from data PCA/LDA/LLE/GDA Choose Algorithm: Consider goals, questions Tractability Experimental Design: train/validate/test data sets cross-validation Run it! : Deployment Classification vs clustering Classication Clustering Supervised Unsupervised Uses labelled data Uses unlabelled data Requires training phase Organize patterns w.r.t. an Domain sensitive optimization criteria Easy to evaluate Requires a definition of similarity (you know the correct answer) Hard to evaluate Examples: Naive Bayes, KNN, Examples: K-means, Fuzzy SVM, Decision Trees, Random C-means, Hierarchical Forests Clustering, DBScan Mark Examples of ML problems Predict how much customer would spend in online retail Examples of ML problems Explore which kind of customers there are in online retail Examples of ML problems Find a category where this particular item in online store belongs to. Examples of ML problems Suggest user next item, he/she might want to buy online? Part 2: Elements of statistics Random variable ! ! -3 -2 -1 0 1 2 3 Description of random variables A random variable x takes on a defined set of values with different probabilities. Roughly, probability is how frequently we expect different outcomes to occur if we repeat the experiment over and over (“frequentist” view) Discrete random variables have a countable number of outcomes – Examples: Dead/alive, treatment/placebo, dice, counts, etc. Continuous random variables have an infinite continuum of possible values. – Examples: blood pressure, weight, the speed of a car, the real numbers from 1 to 6. A probability function maps the possible values of x against their respective probabilities of occurrence, p(x) p(x) is a number from 0 to 1.0. The area under a probability function is always 1. Continuous case § The probability function that accompanies a continuous random variable is a continuous mathematical function that integrates to 1. § For example, recall the negative exponential function (in probability, this is called an “exponential distribution”): # " !! = " ! ! § This function integrates to 1: +" +" #" !! !! = !" = " +! =! " " Continuous case: “probability density function” (pdf) +" +" #" !! !! = !" = " +! =! " " p(x)=e-x 1 x The probability that x is any exact particular value (such as 2 is 0; we can only assign probabilities to possible ranges of x. All probability distributions are characterized by an expected value (mean) and a variance (standard deviation squared). Mean or expectation value Discrete case: ! !" ! $ && % % = ! $' "#$' ! " !#$$!!" #= $ =" ! = ! $ =" "$ # ! ! Continuous case: '& & % = ! #$$!!" "E $%"E #!" Variance: s2=Var(x) =E(x-µ)2 “The expected (or average) squared distance (or deviation) from the mean” Discrete case: &'( ' % & = ! '$ !#$$!!" ) " µ & "#$) ! % # " # "$ ! " ! $ # " $ =" ! !" = " $ =" $ # "$ ! " ! # ! !" ! Continuous case: ! ' "* " µ & $%"* #!" % '() ' & & = #$$!!" Normal distribution The Normal Distribution f(X) Changing μ shifts the distribution left or right. Changing σ increases or decreases the spread. σ µ X The Normal Distribution: as mathematical function (pdf) $ !#µ ! $ # # " # # !" = $" ! ! ! !" This is a bell shaped curve with different centers and spreads depending on µ and s The Normal PDF It’s a probability function, so no matter what the values of µ and s, must integrate to 1! +# ! "$µ " ! $ $ # "% $# "& ! # " % !" =! Normal distribution is defined by its mean and standard dev. +" $ "#µ ! E(X)=µ = !"% $ $# # # ! % " !" #" !& +& $ "#µ ! $ # # " Var(X)=s2 = # % #& "! " !! $# ! " !"" # µ ! Standard Deviation(X)=s CENTRAL LIMIT THEOREM : Within a population we collect random samples of size n. The mean of the sample 𝑥̅ varies around the mean of the population μ with a standard deviation equal to σ/√n, where σ is the standard deviation of the population. As n increases, the sampling distribution of 𝑥̅ is increasingly concentrated around μ and becomes closer and closer to a Gaussian distribution. 68-95-99.7 Rule 68% of the data 95% of the data 99.7% of the data Confidence interval t-Students value, depends on sample size and confidence Testing of hypotheses Testing of hypotheses p-value: probability of an observed result arising by chance Informal interpretation based on a significance levelof 10% may be – 𝑝 ≤ 0.01: very strong presumption against null hypothesis – 0.01 < 𝑝 ≤ 0.05: strong presumption against null hypothesis – 0.05 < 𝑝 ≤ 0.1: low presumption against null hypothesis – 𝑝 > 0.1: no presumption against the null hypothesis Anomaly detection Examples of ML problems Find out potential scams in online retail shop A/B testing Test decision for Fisher’s exact test of the null hypothesis that there are no non random associations between the two categorical variables in x, against the alternative that there is a non random association. Underlining Links Does underlining increase or decrease clickthrough-rate? 37 Correlation Correlation cov(X,Y) > 0 X and Y are positively correlated cov(X,Y) < 0 X and Y are inversely correlated cov(X,Y) = 0 X and Y are independent 1 0.67 0.01 -0.67 -1 0.88 0.06 0.99 0.98 0.92 -0.989 -0.981 -0.96 -0.95 -0.94 Linear Correlation Linear relationships Curvilinear relationships Y Y X X Y Y X X Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall n Linear Correlation Strong relationships Weak relationships Y Y X X Y Y X X Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall n Linear Correlation No relationship Y X Y X Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall n Linear Regression Model 1. Relationship Between Variables Is a Linear Function Random Y-Intercept Slope Error 𝑌) = 𝛽* + 𝛽+ 𝑋) + 𝜖) Dependent Independent (Explanatory) (Response) Variable Linear Regression Model 𝑌) = 𝛽* + 𝛽+ 𝑋) + 𝜖) 𝑌 𝜖! 𝛽! 𝑋 Estimating Parameters: Least Squares Method Least Squares 1. ‘Best Fit’ Means Difference Between Actual Y Values & Predicted Y Values Are a Minimum. Differences could be both positive and negative. So square errors! - - '(𝑌) − 𝑌*) ). = ' 𝜖)̃. ),+ ),+ 2. LS Minimizes the Sum of the Squared Differences (errors) (SSE) 47 Least Squares Graphically % LS Minimizes $ 𝜖"̃ & "#$ 𝑌 𝜖! 𝜖! 𝜖! 𝛽! 𝑋 Residual Analysis for Linearity Y Y x x residuals residuals x x Not Linear ü Linear Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall n Residual Analysis for Homoscedasticity Y Y x x residuals residuals x x Non-constant variance ü Constant variance Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall n Residual Analysis for Independence Not Independent ü Independent residuals X residuals X residuals X Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall n Estimating Parameters: Classification Confusion matrix/crosstabs Calculate four quantities: True Positives (TP): answer = YES, model said YES True Negatives (TN): answer = NO, model said NO False Positives (FP): answer = NO, model said YES False Negatives (FN): answer = YES, model said NO Confusion matrix ! "#A%&! "#A%&! 'EF! *+! ,-./M&!'EF! NO! P*! ,-./M&!*+! PO! N*! ! ! Underfitting and overfitting underfitting good fit overfitting OVERFITTING Modeling techniques tend to overfit the data. Multiple regression: ü Every time you add a variable to the regression, the model’s R2 goes up. ü Naïve interpretation: every additional predictive variable helps to explain yet more of the target’s variance. But that can’t be true! ü Left to its own devices, Multiple Regression will fit too many patterns. ü A reason why modeling requires subject-matter expertise. OVERFITTING Error on the dataset used to fit the model can be misleading › Doesn’t predict future performance. Too much complexity can diminish model’s accuracy on future data. › Sometimes called the Bias- Variance Tradeoff. How to check if a model fit is good? How to check if a model fit is good? OVERFITTING What are the consequences of overfitting? ›“Overfitted models will have high R2 values, but will perform poorly in predicting out-of-sample cases” CROSS-VALIDATION In cross-validation the original sample is split into two parts. One part is called the training (or derivation) sample, and the other part is called the validation (or validation + testing) sample. 1) What portion of the sample should be in each part? If sample size is very large, it is often best to split the sample in half. For smaller samples, it is more conventional to split the sample such that 2/3 of the observations are in the derivation sample and 1/3 are in the validation sample. CROSS-VALIDATION 2) How should the sample be split? The most common approach is to divide the sample randomly, thus theoretically eliminating any systematic differences. One alternative is to define matched pairs of subjects in the original sample and to assign one member of each pair to the derivation sample and the other to the validation sample. Modeling of the data uses one part only. The model selected for this part is then used to predict the values in the other part of the data. A valid model should show good predictive accuracy. One thing that R-squared offers no protection against is overfitting. On the other hand, cross validation, by allowing us to have cases in our testing set that are different from the cases in our training set, inherently offers protection against overfitting. CROSS VALIDATION – THE IDEAL PROCEDURE 1.Divide data into three sets, training, validation and test sets 2.Find the optimal model on the training set, and use the test set to check its predictive capability 3.See how well the model can predict the test set 4.The validation error gives an unbiased estimate of the predictive power of a model Train-test split Train, Test, Cross Validation split: K-Fold Cross Validation split: Part 4: Introduction to artificial neural nets ANN: goal and design ANN is specified by: – Neuron model: the information processing unit of the NN. – Architecture: a set of neurons and links connecting neurons. Each link features a weight. – Learning algorithm: used for training the NN by modifying the weights in order to solve the particular learning task correctly on the training examples. The goal is to obtain an ANN that generalizes well, i.e., that behaves correctly on new examples. Neuron model Neuron model [McCulloch & Pitts] 70 Some neural activation functions 𝑓 𝑥 =𝑥 1 𝑓 𝑥 = 1 + 𝑒 !" "!$ ² 𝑒 #" − 1 ! 𝑓 𝑥 = #" 𝑓 𝑥 = 𝑎𝑒 #&² 𝑒 +1 Single-layer perceptron !" & !" $" &!#$" !# ! $" " ! '$ & !% " !-.$ &, ,!-. ()*+*" The perceptron The aim of the perceptron is to classify inputs: 𝑥1, … , 𝑥𝑛 into one of two classes, say A1 and A2 In case of an elementary perceptron, the n- dimensional space is divided by a hyperplane into two decision regions The hyperplane is defined by the linearly separable function $ $ 𝑥! 𝑤! − 𝜃 = 0 !"# Two-dimensional plots of basic logical operations A B A OR B x1 x2 A AND B A B A XOR B 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 1 1 1 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 0 OR operator AND operator XOR operator A single-layer perceptron is capable of learning linearly separable problems, but not more. Multi-layer ANNs and backpropagation Multi-layer ANNs A multi-layer ANN is an ANN with one or more hidden layers The ANN consists of an input layer of source neurons, at least one middle or hidden layer of computational neurons, and an output layer of computational neurons The input signals are propagated in forward direction on layer-by-layer basis Multi-layer feed forward ANNs [I] $" " & & !"$ $" %" !" &!" $# &$ "% & !"& !"$( ( $' &$#%" $# %" &$ $" !# #% & !#$# & ( !# &! #$' %" & $' & !# $( $' &$'%( %( & !($" $# & !( &!($' !( %" &! $) %( & ($ ( & $) $( ,. 𝑥' = ℎ#(𝑤( + - 𝑤) ℎ+(𝑤() + - 𝑤-) 𝑥'!- )) + 𝜖' )*+ -*+ Multi-layer feed forward ANNs [II] ANNs with only two hidden layers are capable of representing an arbitrary decision boundary to arbitrary accuracy with rational activation functions and can approximate any smooth mapping to any accuracy. The optimal size of the hidden layer(s) is not known in advance and is determined heuristically. Hidden layer „hides“ its desired output The output is determined by the layer itself. Why more layers? Complexity within data Memory Numerous predictors Training performance Learning in a multi-layer ANN Learning in a multilayer network proceeds basically the same way as for a perceptron. The training set of input patterns is presented to the ANN. The ANN computes its output pattern, and if there is an error, i.e., a difference between actual and target output patterns, the weights are adjusted to reduce this error. ANN mean squared error The ANN error’s discrepancy between the desired and the actual output measured by the sum of squared errors over all the instances (training examples) 2 1 𝑥/01 = - 𝑥-# 𝑛 -*+ 𝑥+# + 𝑥## + ⋯ + 𝑥2# 𝑥/01 = 𝑛 2 1 𝑥/01 = -(𝑎𝑐𝑡𝑢𝑎𝑙- − 𝑑𝑒𝑠𝑖𝑟𝑒𝑑- )² 𝑛 -*+ 1 𝑥/01 = - | 𝑦 𝑥 − 𝑎3 𝑥 |# 𝑛 " ANN root mean squared error ( ! 1 𝑥!"#$ = % 𝑥%) 𝑛 %&' ! 𝑥') + 𝑥)) + ⋯ + 𝑥() 𝑥!"#$ = 𝑛 ( ! 1 𝑥!"#$ = %(𝑎𝑐𝑡𝑢𝑎𝑙% − 𝑑𝑒𝑠𝑖𝑟𝑒𝑑% )² 𝑛 %&' ! 1 𝑥!"#$ = % | 𝑦 𝑥 − 𝑎+ 𝑥 |) 𝑛 * The aim of a learning procedure is to minimize the error. Back propagation algorithm The back propagation algorithm (as any other training algorithm) searches for weight values that minimize error over the set of training examples Back propagation consists of the repeated application of the following two passes: 1. Forward pass: first, a training input pattern is presented to network input layer. The ANN propagates the input pattern forward from layer to layer until output pattern is generated by output layer 2. Backward pass: If this pattern is different from desired output, an error is calculated and then propagated backwards through network from output layer to input layer. The weights are modified as the error is propagated. This is done by computing the local gradient of each neuron Gradient descent [I] The back propagation algorithm seeks to reduce the ANN’s total error by computing the gradient of the error surface at its current point, and adjusting the weights in the ANN to descent on the error surface. Gradient descent [II] ANN stuck in local maximum ANN converges to global maximum with momentum Overfitting of ANNs Parameters: – # hidden neurons – Initial weights – Activation function (e.g., sigmoid) – Learning rate – Momentum Real problem: the increase in flexibility due to an increasing number of neurons may result in overfitting Training and test data set Training data (in-sample error) Test data (out-of sample error) Goodness-of-fit of ANN Similar to measures for linear regression models Measure of the quality of the fit of the ANN to the data set X, is 𝑅! , defined by 1 % ∑"#$(𝑎" − 𝑑" )² ! 𝑅 𝑋 =1− 𝑛 𝜎! where % ! 1 𝜎 = 0(𝑎" − 𝑎)² 1 𝑛 "#$ and % 1 𝑎1 = 0(𝑎" ) 𝑛 "#$ The closer 𝑅!is to 1, the better the ANN fits the data Distinguish between in-sample and out-of-sample 𝑅! MSE related with training over time Advantages/ disadvantages of ANNs Efficient Difficult to design Inherent massively parallel No clear design rules for Robust arbitrary applications Can deal with incomplete/ Difficult to assess internal noisy data operation Fault-tolerant What tasks are performed by different parts of the ANN Still works when part of the ANN fails Black-box nature User-friendly Hard to interpret the final model as well as the relations Learning instead of between input and output programming Part 5: Python implementation Components of dense ANN training in Keras 1. Data – Housing data set 2. Problem statement – regression or classification 3. Preprocessing and scaling – standard scaling, one-hot encoding of categorical feature 4. Architecture: Number of layers, neurons (nodes) Activations – ReLu, Softmax Dropout layers Loss function – RMSE, MSE, cross-entropy 5. Training parameters Optimizer - ADAM Batch size Number of epochs 6. Evaluation metrics, learning curves 7. Analysis of errors and residuals Dropout Layers Randomly in forward path remove nodes, such that on backward path the weights are not updated Dropout trains an ensemble of all subnetworks Dropout is similar to bagging, but share same parameters and fraction of sub-networks are trained In bagging all models are Fully connected NN with indented Leads to uncertainty estimation NN dropout of predictions (Srivastava et al., 2014) Components of dense ANN training in Keras 1. Data – Housing data set 2. Problem statement – regression or classification 3. Preprocessing and scaling – standard scaling, one-hot encoding of categorical feature 4. Architecture: Number of layers, neurons (nodes) Activations – ReLu, Softmax Dropout layers Loss function – RMSE, MSE, cross-entropy 5. Training parameters Optimizer - ADAM Batch size Number of epochs 6. Evaluation metrics, learning curves 7. Analysis of errors and residuals Classification implementing in Keras 500 500 softmax y0 28x28 y1 … y9 Fully connected NN model = sequential() # layers are sequentially added model.add( Dense(input_dim=28*28, output_dim=500)) model.add(Activation(‘relu’)) #: softplus, softsign,relu,tanh, hard_sigmoid model.add(Dropout(0.2)) model.add(Dense( output_dim = 500)) model.add (Activation(‘sigmoid’)) Model.add(Dense(output_dim=10)) Model.add(Activation(‘softmax’)) model.compile(loss=‘categorical_crossentropy’, optimizer=‘adam’, metrics=[‘accuracy’]) model.fit(x_train, y_train, batch_size=100, nb_epoch=20) High-level Language Model Overview Hello world! Bonjour le monde! https://huggingface.co/blog/large-language-models https://observablehq.com/@sorami/sizes-of-large-language-models Intuition behind LLM trainings mushrooms 0.1 pepperoni 0.1 I eat pasta with cheese and ____ anchovies 0.01 European union includes ____ …. I saw a ____ fried rice 0.0001 …. and 1e-100 Autoregressive Models Autoencoding Models Predict future token: Predict token based on past and future context: Pasta was hot. He couldn’t ___. He could not finish the entire ___ of pasta. LLM capabilities 1.Classify text (e.g., sentiment, specific topic categories) 2.Recognize named entities (e.g., people, locations, dates) 3.Tag parts of speech (e.g., noun, verb, adjective) 4.Question-answer (e.g., find answer within provided context) 5.Summarize (short summary that preserves key concepts) 6.Paraphrase (rewrite in different way while retaining meaning) 7.Complete (predict likely next words) 8.Translate (one language to another; human or code, if in training data) 9.Generate (again, can be code if in training data) 10.Chat (engage in extended conversation) 11.Generate speech, music, image, video (multimodal) 12.Creative writing (e.g., poetry, prose) … GPT Abbreviation: Generative (autoregressive) Pre-trained (zero-/one-/few-shot learning on many tasks) Transformer + Chat Human feedback loop Part 6: exam information