Machine Learning Cribsheet (COMS30035) Weeks 1-5 PDF

Cribsheet for Machine Learning (COMS30035) Weeks 1–5 James Cussens October 11, 2024 1 Machine Learning Principles You need to know what the following terms mean: Unsupervised learning...

Cribsheet for Machine Learning (COMS30035) Weeks 1–5 James Cussens October 11, 2024 1 Machine Learning Principles You need to know what the following terms mean: Unsupervised learning - A type of machine learning where the model is trained on unlabelled data. - Goal is to discover hidden patterns or structures in the data. - E.g., Clutering algorithms (K-means) Supervised learning - A type of machine learning where the model is trained on labelled data. - Model learns to map inputs to outputs basedon the provided examples. - E.g., Linear regression , Logistic regression, Support Vector Machines Regression - A predictive modeling task where the goal is to predict a continuous outcome based on input variables. Classification - A task where the model assigns input data into discrete categories or classes. Underfitting - When a model is too simple to capture the underlying pattern in the data, resulting in poor performance on both training and test data. - Solution: increase model complexity or add more features. Overfitting - When a model performs well on the training data but poorly on the test data, because it has learned noise and details specific to the training data. - Solution: Gather more data, Apply cross-validation Model selection - Process of selecting the most appropriate machine learning model for a given task. - Involves comparing models, evaluating performance, and choosing the one that generalises best to unseen data. - E.g., Cross-validation Training dataset - Portion of data used to train the model - The model learns the relationships between inputs and outputs from this dataset. Validation dataset - A separate set of data used to tune model parameters (e.g., hyperparameters). - The model does not train on this data, but it helps prevent overfitting by assessing how well the model generalises. Test dataset - A dataset used to evaluate the performance of the trained model. - Provides an unbiased assessment of how well the model performs on unseen data. Cross-validation - A technique for evaluating models by splitting the data into multiple folds (S). - Trained on some folds (S-1) and validated on others (1). - Helps ensure the model generalises well to new data. - E,g. K-fold Cross-validation No free lunch theorem - No one machine learning algorithm works best for all problems. - The performance on algorithm depends on the specific task and dataset. Model parameters - Internal variables of the model that are learned from the training data (e.g., weights) - Defines the model’s behaviour and how it makes predictions. Parametric model - A model that assumes a specific form for the function mapping inputs to outputs and has a fixed number of parameters, regardless of the size of the dataset. - E.g., Linear regression, Logistic regression Nonparametric model - A model that does not assume a fixed form for the function and can grow in complexity as more data is provided. - Have an increasing number of parameters as the data grows. Likelihood function - A function that measures how likely it is that a given set of parameters explains the observed data. Maximum likelihood estimation (MLE) - A method for estimating the parameters of a model by maximising the likelihood function. constant value that shifts the entire bias term , we add regression To include the ensures that the model variable 'I to the weight dummy : a can fit clatasets that don't + When + e form y ↑ and = No · 1 + Wix 1 +.. origin he the pass gh 2 Linear Regression You need to know that the linear regression model is: p(y|x, w) = w> x + ✏ (1) 2 where ✏ has a Gaussian distribution with mean 0: ✏ ⇠ N (0, ). You need to know what a bias parameter is and how in (1) it was included in the parameter In linear it's vector w by the addition of a ‘dummy’ variable which always has the value 1. regression , You need to understand that we can apply linear regression to a feature hard to capture nonlinear vector (x) rather than the original data x: between variables , relationship p(y| (x), w) = w> (x) + ✏ (2) so we use feature-mapped You need to know: finding the best fit line that minimises & (x) to better capture the sum of squared differencee. 1. what the least-squares problem for linear regression is the relationship. 2. that the solution to this problem has a closed-form (but you don’t need = (xTX) XTy + to memorise this closed-form) 3. and that the least-squares solution is also the maximum likelihood solution (you do not need to be able to prove this). When the error we assume that t is Normally distributed to , solving LSE is equivalent MLE. - 3 Linear Discriminant likelihoods (minimising the error = maximising the You need to know that when there are two classes Linear Discriminant computes y = w> x for input x and assigns x to class C1 if y 0 and class C2 otherwise. Parameters are ‘learnt’ by assuming that: (1) data for each class have a Gaussian Classification distribution, (2) these 2 Gaussian distributions have the same covariance matrix. Parameters can then be found by applying MLE. ↳ used to find the decision boundary 4 Logistic Regression > used to obtain the probability - of class CK You need to know that the logistic sigmoid function (sometimes called just the logistic function) is: 1 (a) = 1 + exp( a) You need to know that the logistic regression model for two classes is: p(C1 |x) = (w> x) p(C2 |x) = 1 p(C1 |x) (3) You need to know that the MLE parameters for logistic regression can be found by gradient descent. & Define likelihood descent * Updatea using gradient PCtIX w , = yntyn) where yu = P(C , (xn) 2 : more downwards (increase or decrease small amount w by likelihood. 3 Take negative logarithm of - InPCtIX w) , = - Etenyn + (1 - En) encryn)] ③ Calculate derivative cort.. w GlupstIX n) = (yn-tn) , X Jw 5 Neural networks You need to know what the following terms mean: Weights - Parameters that define the strength of connections between neurons. Activation function - A mathematical function applied to the output of each neuron to introduce non- linearity into the network. - E.g., ReLU, sigmoid Input layer - The first layer of the neural network that receivese the input data. - Each neuron in this layer represents a feature of the input. Hidden layer - Layers of neurons between the input and output layers. - Performs computations on the input data to learn complex patterns. Output layer - Final layer of a neural network that produces the predicted output. - E.g., Class labels – classification, Continuous value – regression. Cost function / Loss function - A function that measures how far the network’s predictions are from the actual target values. - Guides the larning process by providing feedback to update weights (Backpropagation) - E.g., Mean Squared Error – regression, Cross-entropy - classification Forward pass - The process of parsing inptu data through the network, layer by layer, to compute the predicted output. - The way model makes predictions. Backward pass - An algorithm for efficiently computing the gradients of the loss function with respect to the weights in a neural network. - Enables the backward pass by propagating the error from the output layer back through the network.why Backpropagation - An algorithm for efficiently computing the gradients of the loss function with respect to the weights in a neural network. - Enables the backward pass by propagating the error from the output layer back through the network. Vanishing gradient problem - A problem that occurs during backpropagation when the gradients become extremely small, making it difficul to update the weights in the earlier layers of the network. - Common in activation functions like sigmoid. Exploding gradient problem - Occurs when gradients become excessively large during backpropagation, causing the weights to update too much, leading to instability. - Gradient clipping - A technique used to prevent exploding gradients by capping the gradients during backpropagation, ensuring they stay within a reasonable range. constant (7) 9 - g' = min Avoiding Exploding/Vanishing Gradient Problem Non-saturating activation functions - Activation functions that avoid the vanishing gradient problem by not saturating. - E.g., ReLU => f(x) max (0 x) dead ReLU / traditionaa layer problem = ,... Residual layer / network M - A type of architecture that includes skip connections. Fi(x) F (x) + x = - Allows the gradient to flow more easily through deep networks. , - Helps mitigate the vanishing gradient problem and makes training deep networks - in put easier. Parameter initialisation (avoid being trapped in local minimal - Process of setting initial values of the network’s weights before training begins. Early stopping - Regularisation technique that stops training once the model’s performance on a validation set starts to degrade, preventing overfitting. Avoiding Weight decay T Penalty overfitting - Regularisation technique where a penalty term is added to the cost function to Loss L(data) X wiw discourgae large weight values, helping to prevent overfititng. = + originall I loss of data Dropout - Regularisation technique that randomly drops out a fraction of the neurons during training, forcing the network to learn more robust and disributed representations. turned off. e. g p., = 0 3. > - 30 % chance being 3 # Attention mechanisms input based importance : gives different parts of the different weights on their model to focus specific at each step of output generation allowing , a on of the. input parts preactivation and; value of layer ; X -weight between neuron : 1 output of neuron i output of 7 neuronj -activation function 5 Neural networks You need to know what the following terms mean: Weights Terim , T -error rate from next neuron k Activation function Input layer ↓ L derivative of weight between j and K. Hidden layer activation function of neuron j Output layer Cost function / Loss function Wir Will Forward Pass >zj zk > Forward pass - : Zi ↳ 3 Backward pass Backward Pass SK propagation propagation : > Sj > Si Backpropagation ijk Vanishing gradient problem Exploding gradient problem Gradient clipping Non-saturating activation functions Residual layer / network Parameter initialisation Early stopping Weight decay Dropout You need to know that a unit j in a neural network (but not in the input layer) computes a value zj by first computing aj , a weighted sum of its inputs (from the previous layer), and then sending aj to some nonlinear activation function h: X aj = wji zi (4) i zj = h(aj ) (5) (6) You need to know the backpropagation formula: X 0 j = h (aj ) wkj k (7) k and be able to explain what each of the symbols in this formula represents. 3 6 Trees You need to know... what a classification tree is - A type of decision tree used for categorical outcomes. (Classification) - Goal is to predict the class of an input based on feature values. - Partitions based on the data based on feature values. - Each internal node represents a test on a feature. - Each leaf node reperesents a class label. what a regression tree is - Used for predicting continuous outcomes. (Regression) - Has a real number output. - Structure of the tree is similar to that of a classification tree, but the predictions at the leaf nodes are continuous values. how trees partition the input space - Decision trees partition the input space by applying recursive binary splits on the features. - Process continues down the tree, with each split further refining the partition of the input space until the data points at the leaves are either homogeneous (classification) or have minimal variance (regression). that trees are a nonparametric method - Tree can grow in complexity depending on the data. - Can adapt to complex patterns in the data, but are prone to overfitting if not properly controlled. how the standard CART algorithm for learning trees works, including the final pruning stage -Algorithm: (1) Start from root node. (entire dataset) (2) Run exhaustive search over possible variable and threshold for new node. o For each variable and threshold: § Compute average of target variable for each leaf of proposed node. (Regression) OITT Pl-P) § Compute the error if we stop adding nodes Gini impurity (classification) - here. Mean squared error (regression) - > (3) Choose the variable and threshold that best minimises the error. Quit = -Prepar =o Choose which of D input variables to split, along with the er(T) = z(tn - yn) threshold θ. XnERY (4) Add a new node for the chosen variable and threshold. minimum number of samples in node, > g e.., 4 (5) Repeat (2) recursively until a stopping condition is met. reaching maximum depth a (6) Prune back the tree to remove branches that do not reduce error by more than a small tolerance value. (t) optimal prediction - Pruning ye = : involves cutting back a decision tree to prevent overfitting. : o Reduces the size of the tree by removing nodes that don’t significantly reduce an error. o Stopping based on residual error cannot capture the case of when tree is grown continuously reducing the error. o Therefore, we grow a large tree based on different criterion and prune it back. - Steps: (1) Start with tree T` (fully grown tree). ⑧ (2) Consider pruning each node in T`by combining branches to obtain T 8 (simpler tree). (3) Compute a criterion. [Goal: minimise criteron] CIT) Zen IT = + XIT) § Criterion C(T) evaluates the performance of pruned tree. ↳ penalises the tree (4) If C(T) - simpler and efficient > - often used when accuracy is primary goal mixed data terms of class measure of how the is in -Cross-entropy : distribution within a node. needed Used > - when more sensitive measure of node impurity is > - is differentiable. 7 Kernels and SVMs You need to know what the following terms mean Kernel function 17(X X') , = $(XP(x) > - symmetric represents , similarity between points - Used in SVMs to enable the algorithm to operate in a higher-dimensional space without explicitly calculating the coordinates in that space. - Computes the inner product of two data points in the feature space. - Allows the SCM to handle nonlinear relationships. - E.g., Linear kernel, Radial Basis Function kernel, Stationary kernel, Homogeneous kernel Kernel trick ( for non-linear classification) kX x) k(X x) , > q(x) 1, at # $(x) = = k(X2 X) , : k(Xn x) , k(Xn x) , - Allows SVMs to efficiently perform classification in higher-dimensional spaces by using kernel function without explicitly mapping the data to the space. - Transforms the input data into higher-dimensional space, where a linear decision boundary can be found even if the data is not linearly separable in the original space. Dual parameter (an) vector target value a = (k + XIN) - E- I I Grammatrix NXN Identity matrix - Control how much influence each training example has on the final solution. - Indicate whether a particular data point is a support vector or not. - Involves solving the optimisation problem in terms of dual variables (Lagrange multiplers), rather than directly solving for the weights of the classifier. - Computationally efficient. semidefinite (vTKV 0 symmetric positive = Gram matrix , , - entry point K = knm = ↑(xn)$(xm) = k(Xn Xm) , - Matrix of kernel evaluations between all pairs of data points. - For a dataset with n points, Gram matrix is an n x n matrix where each entry G(i, j) represents the kernel function applied to datapoints x_i and x_j. Margin - Distance between the decision boundary and the closest data points from each class. - SVMs aim to maximise the margin, which results in a more robust classifier that generalises well to unseen data. Support Vectors - Data points that lie closest to the decision boundary. - Critical as they determine the position of the decision boundary. - Is the only data points that influence the final model. using a is qual representation Dual representation e wf(x) aT(x) = y(x) = = w a > - Converting the primal optimisation problem into its dual form using Lagrange multipliers. - Handles constraints more efficiently, and easier to solve in high dimensional spaces. Design matrix T = [P(x1)P(x2)b(x3)... ((xn)] - A matrix containing the feature vectors if training points. - Simplifies the representation of input data. Soft margin In (Wid(Xn) + b) 1- En, CEn - correctly OXEn11 < correctly classified , lies within boundary En > 1 > misclassified. - Used with SVMs with a soft margin to allow for some data points to be misclassified or to lie within the margin. - Measure the degree of misclassification or margin violation, and are penalised in optimisation process to prevent too many violations. Role of the regularisation parameter in soft margins - Controls the trade-off between maximising the margin and minimising classification errors. - Smaller C results in wider margin but aloows more misclassifications. - Larger C emphasises classifying all training examples correctly, but can lead to overfitting. Multiclass classification o One-versus-one : train all possible pairs of classes [ k(k-1)/2 cases] e.g., SVC, NuSVC (non-parametric o One-versus-the-rest e RBF Kernel k(x x) exp( -11X-X112) : , = - : distinguish between each class and rest of the classes [k cases] e.g., LinearSVC Maximising Margin Problem - Goal is to find a hyperplane (decision boundary) that separates different classes in a dataset, and maximise the margin. - A quadratic programming problem. argmin Elwil- > (in dual representation [Ca) = an-anamEntmkX - kernel substitution where ans o , n = 1 ,... N , anth=o Test after training (SVM) - After getting optimal w and b, we test the model by making a predictino of a new data point. [scikit-version] y(x) = antnk(x x b x + Yak(xi , , role in - an = 0 > - does not appear in the sum , no For all An , for data. making prediction new 1 rectors try(xn) support = > - well-classified (an > 0 ensures the datapoint is I ↓ margin on the ② lies tells how much a particular training point matters in SVM's decision boundary determining. 8 Probabilistic Graphical Models You need to know what the following terms mean Directed acyclic graph - A graph where all edges are directed and has no cycles. - Represents the conditional dependencies between random variables. Conditional independence - When two random variables are independent of each other given a third variable. - DAG enodes these independence relations in Bayesian Networks. Prior / Posterior Distribution - Prior distribution [P(θ)] : Belief about a parameter before observing any data, expresses uncertainty about a parameter before you see any evidence (data). - Posterior distribution [P(θ |data)] : updates the prior belief after observing the data. o Combines the prior distribution and likelihood, represented by Bayes Theorem. E ↑ PCHIE) = Bayesian network - A type of PGM that represents the joint probability distribution of a set of random variables using a DAG. - Each node represents a random variable, and directed edges represent conditioanl dependencies. input - hyperparameter N Bayesian PoynomialWIX Regression P(t X 0 " , ,. = P(mid) I pltnIwiXn10 ↓ -n = 1 noise variance prior likelihood - Approach to regression where the goal is to find posterior contribution over polynomial coefficients (weights) instead of point estimates. - Adds uncertainty into the polynomial coefficients as random variables with prior distribution. > - Parameters are treated as random variables with prior distributions. the structure of a Bayesian network - DAG structure. the parameters of a Bayesian network - Conditional probability distributions. child (in a Bayesian network) - Node that has an incoming directed edge from another node. - A à B : B is a child of A. parent (in a Bayesian network) - A node that has an outgoing directed edge to another node. - A à B : A is a parent of B. descendant (in a Bayesian network) - Any node that can be reached by following a directed path starting from the given node. - A à B à C : C is a descendant of A. path (in a Bayesian network) - A sequence of directed edges connecting two nodes in Bayesian Network. - Paths can go in any direction (not just along the edges’ direction). collider (in a Bayesian network) - A node where two directed edges point into it. Blocked path - A path is blocked if there is a node alongg the path that d-seoarates the two variables. - A path can be blocked when given S: o There exists a collider in the path that is not in S, or o There exists a non-collider in the path from S factorisation of a joint probability distribution defined by the structure of a given Bayesian network - d P(X. X2 ,.... Xn) =evability when given its parent. how to use plate notation to compactly represent a Bayesian network - * plate : represents a set of nodes th..... In all of which have was their single parent. Naïve Bayes - Observed variable are assumed independent conditional on class variable z. prior - Specific application of Bayes for classification. class label data point - dth feature in Nth - parameter all variable 5 independent from each other. Latent Variable Model hidden variables - Latent variable: hidden, unobserved variable sthat influence the observed data but are not directly measurable. - Goal is to explain observed variables in terms of latent variables. - Used when we suspect that underlying factors influence the observed data but aren’t directly accessible. how to translate a machine learning model (described in words) to a Bayesian network - Involves representing the variables and their dependencies using nodes and edges. - The structure of the model including the assumptions about conditional dependencies, is encoded as a DAG. - BN representation of ML models: (1) Differentially private Bayesian Linear Regression that the output of doesn't > - ensures an algorithm reveal too much information about individual (Sufficiency ( any data point in the dataset. high-dimensional representations (2) Causal Representation Learning - of casual variables cenvironmental & external conditions) dynamic process over help determine (binary of the behaviour time affecting casual system X casual variables T how to use d-separation to check for conditional independence relations in a Bayesian network. - D-separation is a criterion for determining whether two nodes in a BN network are conditionally independent given a set of other nodes. - Two nodes are d-separated by a set of nodes S if all paths between them are blocked by S. - If d-separated, two nodes are conditionally independent given S. Prior distribution and o over M likelihood of g: ↑ Hierarchial Linear Regression ↑ PoP(y - P(0 y ,) : 10) PoM0 distribution of Oi , = ~ ↳ ↳given u and o hyperparameters - Each individual has its own parameter, but is not entirely independent where parameter is drawn from a common distribution. - Data collected from a particular location influence posterior distribution for other locations but only to a degree. - Got balance between location-specific learning and using data for all locations. Standard Regression o prior on parameter -k P(0 y) , = P(O) TT PlyiIO) i= 1 likelihood - Single #oarameter theta assumed to be applied to entire dataset. - Assunes same relationship holds across the entire dataset. Separate Regression likelihood of eachy : dependent - on Oi Plo , y) = T Poil - prior for each - Each datapoint has its own set of parameters. datapoint - y_i modelled independently with its own parameters. 10 Sampling and MCMC Univariate sampling - Sampling from a.distribution with only one variable. - E.g., Gaussian (Normal), Bionomial Ancestral sampling Sampling - Involves sampling from a joint distribution in a way that respects the conditional methods dependencies of a probabilistic model. - E.g., Bayesian Rejection sampling - A method to sample from complex distributions by generating samples from a simpler distribution and rejecting those that don’t meet certain condition. - Often inefficient. Marginal distribution PIBIE) = [PLAIBiCiDiE) AlC , D - Probabilities of a subset variables by summing other variables in joint distribution. Conditional distribution P(B , DIE = 1) : @ Calculate PlB , D, E) @ Reject where EF1 - Probability of a variable given that other variable takes certain value. - (1) Sample marginal distribution (2) Reject any non-condition matching samples Markov chain - A sequence of random variables where the probability of each variable depends only on the previous one. Markov - Sampling from Markov chain is a special case of ancestral sampling. homogeneous Markov chain - If state transition probabilities are constant over time for all nodes, the chain is homogeneous. - State transition probabilities: represents the probability of transitioning from current state z_m to z_m+1. - This probability only relies on the current state. initial distribution (in a Markov chain) - We have initial distribution p(z_1). - A probability distribution over the possible states at the very start of the chain. distributions - Defines a starting point of the Markov processes by assigning probabilities to each in MC possible state that the system can begin in. transition distribution (in a Markov chain) - Probabilities of moving from one state to another in a single time step. - Defines how the system transitions between different states over time. Markov chain Monte Carlo - Used to approximate distributions by constructing a Markov chain. - The idea is to sample from this chain and use samples to estimate expected values of interest. - Used in a complex distribution when direct sampling is challenging. - Samples are not independently drawn, as it relies on Markov chain where new sample is generated based on the previous sample. - Metropolis-Hastings algorithm - Widely used algorithm in MCMC for generating a sequence of samples from a target probability distribution. - Define a single transition probability distribution for homogeneous Markov Chain. - A way we construct Markov chain, enables to generate a sequence of samples from that distribution is difficult. - Algorithm: (1) Start with an initial state. (2) At each step, a new potential sample is proposed based on the current state. § This comes from proposal distribution q(x*|x) – curr : x, next : x* (3) The new proposed sample is either accepted or rejected based on an acceptance probability. p( *)q(x)x ) * § A(x*, x) = min (1, * x) ) P(x)g(x o § p(x) = target distribution we want to sample from. § q(x*|x) = proposal distribution (suggests the next candiate state) § If A(x*, x) >= 1 : x* accepted. § Otherwise, rejected and stays in x. Metropolis algorithm - Simplified MH algorithm, where proposal distribution is symmetric. - q(x*|x) = q(x|x*) target probability distribution - The distribution we are trying to approximate or sample from. - Typically a complex posterior distribution that arises in Bayesian inference. proposal distribution - Dstribution used to propose new states in the Markov Chain. - Choice of proposal distribution can significantly affect the efficiency of the algorithm and how quickly it converges to the target distribution. acceptance probability A = min(1 , PRAY - Probability used to accept or reject a proposed state in the MH algorithm. - Normalisation 4(z) = the - Symmetric proposal distribution f(z * /zY) = g(z/z * )- Alz* zi) = min(1)) - p(x*) >= p(x) e allows explorintlity regions. > - proposed state is always accepted. > - If * p( ) < P(X) , then * is accepted with a ratio of burn-in - First / early set of samples are often discarded because the Markoc chain takes time to reach the target distribution. - This period is far and do not yet reflect the target distribution. convergence (in context of MCMC) - A point where samples generated by Markov Chain approximate the target distribution. > use R value to cheek. convergence - how a sample from a distribution can be used to approximate an expected value defined by that distribution - Once generated samples from a distribution (after MCMC converged), we can use those samples to estimate properties of the target distribution. - function > evaluated = SfzPE) m - - target distribution - Expected balue of f(X) with respect to the distribution p(x). - Computing integral is ineficient, so we approximate it using the samples generated by MCMC. o Dependent between samples. eg f(z) : = zen 10 , 1) o After Markov Chain has converged, the distribution of the samples will approximate p(x). & draw samples > - independent samples z = 0. 3z = 0. 5 (Monte Carlo Estimation Fiz = T drawn from Plz) & cate fez number of X z" = 0. 09 zz = 0 25 independent samples. and calculate average that in MCMC we sample from a sequence of distributions and that the samples are not independent - Samples are drawn sequentially. - Each sample depends on the previous one due to the Markov property. - Samples are not independent, unlike traditional random sampling. - Next sample is determined by transitioning from the current sample according to the Markov chain’s rules. - Result: samples tend to be correlated (autocorrelated), which reduces the number of effectively independent samples. 4 = why we throw away samples in the burn-in - Early period of the MCMC chain when the chain is still far from the target distribution. - During this phase, the samples may not represent the target distribution well, as the chain is still moving from its initial state toward convergence. why we typically run several chains when doing MCMC - Running multiple independent chain can be useful for diagnosing convergence in MCMC. - Since each chain starts from a different initial state, it gives you a better idea of Markov chain has fully explored the state space and converted to target. - If all converges to same distribution, chain ahs mixed well and reached the target distribution. - If different results, chain may not have converged yet. that R is a value computed from an MCMC run used to check for convergence; if the run has been successful (i.e. there’s been convergence) it will be close to 1. - R compares the variance between multiple chains to the variance within each chain. - If R == 1, chains have likely converged to same distribution, can trust samples. - If R > 1, chains have not yet converged, more iterations may be needed. - Helps ensure that the samples you’re using for estimation truly reflect the target distribution. 11 k-means and Gaussian mixtures You need to know what the following terms mean Clustering - Refers to the process of dividing data points into groups such that points within a cluster are more similar to each other than to points in other clusters. - k-means clustering: common method of hard clustering, where each point belongs to exactly one cluster. Soft clustering - Data points are not assigned to a single cluster, but instead have a probability of belonging to each cluster. - E.g., Gaussian Mixture Model (allows data points to belong to multiple clusters) likelihood Gaussian mixture model - (up(X(π m ]) ENIXIMkIn = P(E) p(x1z) = , , P(x) 14 1 = NM - mixing coefficient GMM is a probabilistic model that ssumes that all data points are generated from a mixture of several Gaussian distributions. - Each Gaussian represents a different cluster. > Problems - Performs soft clustering by assigning a probability for each data point belonging to - 1) possible singularities each cluster. 1) Th plzk = Mixing coeﬃcient (π) - = K 2) symmetry/nonidentifiability : Probability that a data point belongs to cluster - Represents the weight or proportion of the k-th Gaussian component in the mixture 3) no closed form for MLE model. - These coefficients sum to 1. [2 + = 7) Latent variable (z) a - Variable that is not difectly observed, but rather inferred from other observable data. - Represents the hidden clas or cluster assignment of each data point. - For each data point, the latent variable z_i indicates which Gaussian component the point likely belongs to, although this is probabilistic in GMMs. ussianprobabilities Responsibility (in context of a mixture model) TUNAMI - ((zk) PlEK 1(x) = ZiN(Xn(Mi = Zj) = , - - sum of likelihood - A term that represents how much each Gaussian component in the mixture model is responsible for generating a given data point. - For a data point x_i, γ(z_i. k) is the probability that the k-th Gaussian component generated this data point x_i. You need to know... how the k-means algorithm works - Iteratively assigns data points to clusters by: (1) minimising the within-cluster variance (2) updating the centroids (3) reassigning points based on the new centroids. that one can do soft clustering by applying MLE to a Gaussian mixture model - EM algorithm o Alternates between estimating responsibilities (E-step) and maximising the likelihood by updating parameters (M-step) until convergence. o Results in soft clustering, where each point is probabilistically assigned to clusters. optimisation #K-means (data pointIn assigned to clusterk). - 1 if k = angmin, 11xn-Mill ruk = ↳Uk = In Unkn O otherwise 12 The EM algorithm You need to know that θ the EM algorithm is an iterative algorithm that attempts to find a value of θ that maximises the log-likelihood : ln p(X| θ), where X is observed data. - θ: o represents the parameters of the model that you are trying to estimate. o Typically includes the means, covariances, and mixing coefficients in GMM for each Gaussian component k. - Log-Likelihood Maximisation: o EM is an iterative process that maximises the log-likelihood of the observed data X, given the model parameters θ. o Goal is to find θ that maximises ln p(X | θ). o Algorithm adjusts θ so that the model best fits the observed data. - EM Algorithm: ln p(X| θ’) = L(q, θ’) + KL (q || p) o E-step § Calculate the expected value of the log-likelihood function, given the current estimate of the parameters θ. § Our goal is p(z) = p(z|X, θ’), so maximising L(q, θ’). We set KL (q||p) = 0. Because KL (q || p) = 0 when p = q! § This increases L(q, θ’) but does not increase ln p(X | θ’). We haven’t changed the parameter yet. o M-step § Update parameters θ’ to maximise L (q, θ). This increase ln p(X | θ) as KL(q||p) >= 0. Changing p [ p(Z | X, θ’) à p (Z | X, θ’’)] will lead KL (q||p) to increase from 0 to some positive value. § We keep the q. there is no guarantee that the EM algorithm will succeed in maximising the log-likelihood. It may converge to a local maximum which is not a global maximum of the log-likelihood function.’ - EM is not guaranteed to find the global maximum of the log-likelihood function. - Instead, the algorithm may converge to a local maximum, but not the global maxima. - Since EM uses an iterative process, it depends heavily on the initial values on the parameters. - If starting values are closer to a local maxima, the algorith can be trapped there. - Therefore, we run EM algorithm multiple times with different initialisations to increase the chances of finding the global maximum.

Machine Learning Cribsheet (COMS30035) Weeks 1-5 PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue