Stats AI II Final Review PDF
Document Details

Uploaded by WellMadeAntigorite5033
Tags
Related
- Machine Learning 1_ classification methods - lectures-1.pdf
- Lecture 6: Machine Learning for Remote Sensing Image Processing - Part I PDF
- Machine learning.pdf
- Machine Learning Overview PDF
- Machine Learning System Design Performance Measurement Lecture
- Lecture12_Machine learning system design.pptx.pdf
Summary
This document contains lecture notes for Stats AI II, covering topics such as survival analysis, regression models with survival response, hazard functions, Cox's proportional hazards model, and various machine learning algorithms including PCA, gradient descent, and neural networks. The notes also discuss clustering methods, recurrent neural networks (RNNs), and transformers.
Full Transcript
Lecture 3-4 - Survival Analysis: Survival analysis concerns a special kind of outcome variable: ○ The time until an event occurs. ○ Important complication Some of the patients have survived until the end of the study. Such a patient’s su...
Lecture 3-4 - Survival Analysis: Survival analysis concerns a special kind of outcome variable: ○ The time until an event occurs. ○ Important complication Some of the patients have survived until the end of the study. Such a patient’s survival time is said to be censored. Survival and Censoring Times ○ For each individual, we suppose that there is a true failure or event time T, as well as a true censoring time C ○ The survival time represent the time at which the event of interest occurs ○ Censoring is the time at which censoring occurs. ○ If the event occurs before censoring, then we observe the true survival time T; If censoring occurs before the event, then we observe the censoring time. (Check for equations in this page: Slide 4) The Kaplan-Meier Estimate ○ Named after Edward Kaplan and Paul Meier. ○ Each point in the solid curve shows the estimated probability of surviving past the time indicated on the horizontal axis. The Log-Rank Test ○ Dk are the unique death times among the non-censored, rk is the number of patients at risk at time dk, and toqk is the number of patients who died at time dk. ○ We further define r1k and r2k to be the number of patients in Groups 1 and 2, respectively, who died at time dk. Note that r1k + r2k = rk and q1k + q2k = qk. Details of the Test Statistic ○ At each time dk, we construct a 2x2 table of counts of the form shown above. ○ Note that if the death times are unique (i.e. no two individuals die at the same time), then one q1k and q2k equals one, and the other equals 0. ○ Check table slide 20 The Long-Rank Test: Main Idea ○ Definitely write this down on the flashcard. The Final Result ○ Check out final formula (slide 22) ○ When the sample size is large, the log-rank test statistic W has approximately a standard normal distribution. Can be used to compute a p-value for the null hypothesis that there is no difference between the survival curves in the two groups. The log-rank test is closely related to Cox’s proportional hazards model. Regression Models with a Survival Response ○ Consider the task of fitting a regression model to survival data We wish to predict the true survival time T. Since the observed quantity Y = min(T,C) is positive and may have a long right tail, we might be tempted to fit a linear regression of log(Y) on X. Censoring again creates a problem though. To overcome this: Make use of sequential construction, similar to idea for the Kaplan-Meier survival curve. The Hazard Function ○ Hazard rate as well. Also known as force of mortality. Defined as Write down this formula! It is the death rate in the instant after time t, given survival up to that time. The hazard function is the basis for the Proportional Hazards Model Cox’s Proportional Hazards Model. ○ Assumes that Write down formula ○ Baseline hazard The hazard function for an individual with features xi1 = … = xip = 0. ○ Name arises from the fact that hazard function for an individual with feature vector xi is some unknown function h0(t) times the factor (write the rest of this down (Slide 27) ○ What does it mean that the baseline hazard function h0(t) is unspecified? We make no assumptions about its functional form. We allow the instantaneous probability of death at time t, given that one has survived at least until time t, to take any form. This mean that the hazard function is very flexible and can model a wide range of relationships between the covariates and survival time. Our only assumption is that a one-unit increase in xij corresponds to an increase in h(t|xi) by a factor of exp(Betaj) Partial Likelihood ○ The magic of Cox’s proportional hazards model lies in the fact that it is in fact possible to estimate without having to specify the form of h0(t). ○ To accomplish this: Make use of the same “sequential in time” logic that we use to derive the Kaplan-Meier survival curve and the log-rank test. Then the total hazard at failure time yi for the at-risk observation is Write down!!! Write down in flashcards all of the equations you feel will be on it. Lecture 5: PCA Unsupervised Learning ○ Machine Learning with unlabeled data ○ Goal Discover interesting things about the measurements Is there an informative way to visualize the data? Can we discover subgroups among the variables or among the observations? ○ Two methods: Principal components analysis and its variants Clustering: a broad class of methods for discovering unknown subgroups in data The challenge of unsupervised learning ○ More subjective than supervised learning There is no goal for the analysis, such as a prediction of a response ○ Techniques for unsupervised are of growing importance in a number of fields Subgroups of breast cancer patients grouped by their gene expression measurements Groups of shoppers characterized by their browsing and purchase histories Movies grouped by the rating assigned by movie viewers ○ Advantage Often easier to obtain unlabeled data - from a lab instrument or a computer. Low-dimensional space in high-dimensional dataset ○ A key task in big data analytics is pattern recognition Till the end, we assume there exists some patterns in a dataset ○ If a large dataset can be described by a few patterns, then the data points are concentrated around some lower-dimensional space Lasso regression that leads us to a sparse solution is one way to find a low-dimensional space, which consists of the non-zero features ○ The process of finding low-dimensional space in high-dimensional dataset is called dimensionality reduction Principal Component Analysis ○ Finds the most significant component (weighed direction) that describes the dataset. Then, find the 2nd most significant component that describes the rest of the dataset To find the rest of the dataset, the 2nd component is set perpendicular to the 1st component ○ By repeating the steps, we find n significant components for an n-feature dataset. Essentially finding a new coordinate system that better represents the dataset. ○ If we do a “sparse coding” that takes only the first few significant components, only the least important components are lost That is, we have the dataset embedded in the low-dim subspace that is formed by the first few significant components. Therefore, dimensionality is reduced How to find the significant components ○ Several ways Least square is probably the least efficient yet easy-to-understand method Use it to find the most significant direction. Take that direction off from the dataset, and use least square to find 2nd direction. Put the dataset in matrix form. We can find the PCA via the Eigenvalue Decomposition of A’s covariance matrix, AAT SVD To reduce dimensionality, we keep only first a few singular values and set the rest to 0. Eliminate the columns corresponding to 0 singular values to save space If keeps only the first k singular values, we call rank-k truncated SVD ○ This is finding the first k significate components of A. Lecture 6 - RPCA Low-Rank Approximation ○ Math terminologies: Rank: The number of non-zero singular values of a matrix Low-rank matrix: A matrix that has only a few non-zero singular values ○ Truncated SVD A low-rank matrix approximation Finding the closest low-rank matrix to the original matrix Eqn write down Discuss Frobenius norm on board, many equivalent definitions. ○ PCA ~ Low-rank matrix approximation PCA ○ Tolerates small noise very well, but is over-sensitive to outliers. True for many other methods. A common issue for l2 based loss function. ○ One single extreme outlier can completely change the PCA result. Robust PCA ○ Designed to overcome this outlier issue. ○ Observed a corrupted data matrix Y = X + S X is a low-rank matrix S is a sparse matrix ○ Tolerates outliers by puting them into the sparse matrix Regularization for Low Rank ○ The original RPCA problem is hard to be solved directly—non-convex ○ In Lasso, we use the l1 norm to regularize sparse solution ○ For low-rank matrix, the regularization term is nuclear norm Sum of singular values ○ The formula becomes a much easier convex problem Write down formula ○ There are also many non-convex algorithms for RPCA Many applications ○ Video background subtraction ○ Face modeling ○ Image alignment ○ Feature identification ○ Community detection Lecture 7: Matrix Completion Matrix completion ○ Task: Try to recover a matrix with missing entries ○ Different applications: Netflix problem Image Inpainting Positioning from distance Link prediction Structure-from-motion Multitask learning The formula ○ Write down formula Factorization Based Formula ○ Also write this down ○ Advantages: Additional regularization on can be imposed Number of unknowns is reduced ○ Disadvantages: Non-convex optimization Have to know the exact rank Random Sampling ○ The “too sparse” problem is handled by incoherence assumption RPCA requires this assumption too ○ For example 2, we require the observations to be randomly sampled Uniform with replacement Uniform without replacement Bernoulli ○ Mathematically speaking, it requires at least r( 2n - r ) samples to recover a n x n rank - r matrix ○ State-of-the-arts take nr log2n random samples CUR Sampling ○ For an n x n rank - r matrix X: C: column submatrix of X R: row submatrix of X U: the intersection of C and R ○ Theorem: If rank(U) = rank (X), then X = CUTR UT is the pseudo-inverse of U ○ Basically, this is a matrix completion with column and row sampling Lecture 8-9 - Gradient Descent ○ It is good for finding global minima/maxima if the function is convex ○ It is also good for finding local minima/maxima if the function is non-convex ○ It is used for optimizing many models in Machine Learning: Neural Networks Linear Regression Logistic Regression Support Vector Machines Convexity ○ Make sure sure to know this, ask classmates ○ All (mathematically well-defined) norms are convex! Gradient Descent ○ Gradient is the direction where the function value will increase/decrease most steeply at the current point ○ Idea: Start somewhere Take steps along the gradient vector of the current position till convergence ○ Convergence When change between two steps < epsilon When the gradient is near 0 Iterative Algorithm ○ The update: Write this down, same old update from last semester Where the alpha is called stepsize Potential Issues ○ Convexity ○ Stepsize GD requires one parameter: stepsize Too small = slow convergence Too big = divergence Theoretically speaking, alpha = 2/L where L is the so-called Lipschitz constant that describes how smooth a function is Second order method ○ Refers to those use information of the second derivative Newton’s method BFGS ○ GD is a first order method ○ Second order method can use second derivative to pick “optimal step size”, and escape from a saddle point. They typically converge faster ○ However, there is a trade-off: The computational cost First derivative of f: Rn -> R is a n-D vector Second derivative of f: Rn -> R is an n x n matrix, called Hessian Stochastic Gradient Descent ○ Most loss functions in machine learning problems are separable: Write down ○ For example: Least square: Logistic Cross-entropy Write down all of these as well ○ Stochastic gradient Make sure to write this down, will need it ○ Algorithm: Given the training set {(Xi, yi) | n = 1, … , N} Initialize theta For k-th iteration Draw a random subset B Update using formula ○ Write down!!! ○ If |B| = 1, the use only one sample at a time – common in network training Learning rate ○ Stepsize alpha(k): SGD with constant stepsize does not converge ○ Make sure to check this out, might not need it but just in case ○ Typical strategy: Start with large stepsize and gradually decrease: alpha(k) -> 0 Perspectives of SGD ○ Classical optimization literature have the following observations Compared to GD in convex problems SGD offers a trade-off between accuracy and efficiency More iterations Less gradient evaluation per iteration Noise is a by-product ○ Recent studies of SGD for non-convex problems found that SGD for training deep neural networks works SGD finds a solution faster SGD finds a better local minimum Noise matters and helps Constrained minimization problem ○ For unconstrained minimization problem Any x in Rn can be a solution ○ For constrained minimization problem The solution has to be inside of a given set Q ⊆ Rn Projected Gradient Descent ○ We can use GD for a constrained problem but we need one additional step Projection Consider a constraining set Q, starting from initial point x0 ∈ Q, PGD iterates the following equation until a stopping condition is met x(k+1) = PQ ( x(k) - ɑ ∇f(x(k))) PQ(.) is the projection operator: finding the point y ∈ Q that is closest to the input point. Thus the projection operator can be viewed as an optimization problem: Write down Convex set ○ If Q is a convex set, the projection has a unique solution ○ If Q is nonconvex, the solution to projection may not be unique: it gives more than one solution ○ A set is convex if, given any two points in the subset, the set contains the whole line segment that joins them. Understanding the geometry of projection ○ Consider a convex set Q and a point xo ∈ Q. As xo ∈ Q, the closest point to x0 in Q will be x0 itself The distance between a point to itself is zero ○ Now consider a point x0 ∉ Q The circles are l2 norm ball centered at x0 with different radius Points on these circles are equidistant to x0 (with different l2 distance on different circles) Note that some points on the blue circle are inside Q The point inside Q which is closest to x0 is the point where the l2 norm ball “touches” Q In the example, the blue point y is the solution to PQ (x0) = argminy ∈ Q ½ ||x0 - y|| 2 Note that the projection is orthogonal: the blue point y is always on a straight line that is tangent to the norm ball and Q PGD ○ An “economic” algorithm if the problem is easy to solve. This is not true for general Q and there are lots of constraint sets that are very difficult to project onto ○ Examples || X || 0. Equivalently, M ∈ Rn x n is SPD if all its eigenvalues are positive. A strict convex function has SPD Hessian at all points More Definite about Matrix ○ M ∈ Rn x n is Symmetric Positive Semi-Definite if xTMx >= 0 for all x ∈ Rn, denoted ∇2f(x*) >= 0. All eigenvalues are non-negative ∇2f(x*) >= 0 and ∇f(x*) = 0 indicates x* is a minimizer, but maybe not strict A convex, but not necessary strict convex, function has Symmetric Positive Semi-Definite Hessian at all points. M ∈ Rn x n is Symmetric Positive Semi-Definite if xTMx >= 0 for all x ∈ Rn, denoted ∇2f(x*) 0; 0 otherwise Neural Net Training ○ Goal: Determine how to change weights to get correct output Large change in weight to produce large change in error ○ Approach: Compute actual output: y Compare to desiderd output: y* Determine effect of each weight w on error = y* - y Adjust weights Cost (Loss, Error) Function ○ Backpropagation ○ Weights are parameters to change ○ Backpropagation: Computes current output, works backward to correct the error ○ If smooth function, use Gradient Descent ○ Linear functions (including identity function) are not useful, as the combination of linear functions is still linear. Lecture 16 - CNN Fully Connected Model ○ We learned fully connected model is good to learn a small model Consider Learning An Image: ○ Some patterns are much smaller than the whole image Can represent a small region with fewer parameters ○ Same patterns appears in different places: They can be compressed What about training a lot of such “small” detectors and each detector must “move around” They can be compressed to the same parameters. Convolutional neural network (CNN) ○ A CNN is a neural network with some convolutional layers (and some other layers) ○ A convolutional layer has a number of filters that does convolutional operation. Convolutional Operator: f * g ○ Express each function in terms of a dummy variable tau ○ Reflect one of the functions: g(tau) -> g(-tau) ○ Add a time-offset t, which allows g(-tau) to slide along the tau-axis. As t increases, g(t - tau) is equal to g(tau) that slides along the tau-axis toward +infinity by the amount of t ○ Start t at -infinity and slide it all the way to +infinity. Wherever the two functions intersect, find the integral of their product. In other words, at time t, compute the area under the function f(tau) weighted by the weighting function g(t - tau) Convolutional Operator ○ Formal definitions: Continuous: Discrete: Convolution Kernels ○ Take f as our input signal, g is often termed kernel, mask, or filter. ○ Different kernels can extract different information from the input: blurring, sharpening, embossing, edge detection. A Conventional Application: Haar Wavelet Transform ○ The image can be sparsified after applying wavelet transform Image denoise Image compression ○ Load the blurred image first then fill in the multi-level texture details Back to NN: Pooling Layer ○ Browning the multi-scale idea Pooling layer downsamples the volume spatially, independently in each depth slice of the input volume. Two Popular Pooling Methods in Deep Learning ○ Max pooling & Average pooling LeNet-5 for MNIST ○ Instead of designing, we learn the filters (and weights) in CNN through training ○ Note that the illustration misses activation function U-Net for Image Segmentation ○ This is designed by the same group of people who studied Haar Wavelet Transform Some other well-known CNNs ○ VGG-16 ○ GoogleNet ○ ResNet Lecture 17 - RNN Sequential Data ○ Sometimes the sequence of data matters. Text generation Stock price prediction ○ Simple solution: Feedforward Neural Networks? Fixed input/ output size Fixed number of steps Recurrent Neural Networks ○ RNNs are networks with loops, allowing information to persist ○ Have memory that keeps track of information observed so far ○ Maps from the entire history of previous inputs to each output ○ Handle sequential data ○ Recurrent Networks Offer a Lot of Flexibility ○ ○ Xt is the input at time t ○ Ht is the hidden state (memory) at time t. (sometimes we use s for hidden states) ○ Yt is the output at time t ○ Theta, thetax, theta are distinct weights Weights are the same at all time steps Sometimes we use W for weights too ○ RNNs can be thought of as multiple copies of the same network, each passing a message to a successor. ○ The same function and the same set of parameters are used at every time step Are called recurrent because they perform the same task for each input Back-Propagation Through Time (BPTT) ○ Using the generalized back-propagation algorithm one can obtain the so-called Back-Propagation Through Time algorithm ○ The recurrent model is represented as a multi-layer one (with an unbounded number of layers) and backpropagation is applied on the unrolled model. Vanishing/Exploding Gradients ○ In RNNs, during the gradient back propagation phase, the gradient signal can end up being multiplied many times. ○ If the gradients are large Exploding gradients, learning diverges ○ If the gradients are small Vanishing gradients, learning very slow or stops ○ Many solutions have been proposed: Truncated Backpropagation Through Time (for both) Clip the gradients to a certain max value (for exploding gradients) Long-Short Term Memory (LSTM) architecture (for vanishing gradients) Truncated Backpropagation Through Time ○ Run forward and backward through chunks of the sequence instead of whole sequence ○ Carry hidden states forward in time forever, but only backpropagate for some smaller number of steps Gradient Clipping ○ The Problem of Long-Term Dependencies ○ RNNs connect previous information to present task: May be enough for predicting the next word for “the clouds are in the sky” May not be enough when more context is needed: “I grew up in France… I speak fluent French” Long Short-Term Memory Networks ○ LSTM networks are RNNs capable of learning long-term dependencies ○ A memory cell using logistic and linear units with multiplicative interactions: Information gets into the cell whenever its input gate is on Information is thrown away from the cell whenever its forget gate is off Information can be read from the cell by turning on its output gate ○ We define the LSTM unit at each step t to be a collection of vectors in Rd: The Core Idea: Cell State (MemoryCell) ○ Information can flow along the memory cell unchanged ○ Information can be removed or written to the memory cell, regulated by gates Gates ○ A way to optionally let information through A sigmoid layer outputs number between 0 and 1, deciding how much of each component should be let through A pointwise multiplication operation applies the decision Forget Gate ○ A sigmoid layer, forget gate, decides which values of the memory cell to reset Input Gate ○ A sigmoid layer, input gate, decides which value of the memory cell to write to. Vector of New Candidate Values ○ A tanh layer creates a vector of new candidate values ĉt to write to the memory cell. Memory Cell Update ○ The previous steps decided which values of the memory cell to reset and overwrite ○ Now the LSTM applies the decision to the memory cell Output Gate ○ A sigmoid layer, output gate, decides which values of the memory to output Output Update ○ The memory cell goes through Tanh and is multiplied by the output gate Variants on LSTM ○ Gate layers look at the memory cell ○ Use coupled forget and input gates. Instead of separately deciding what to forget and what to add, make those decisions together. ○ Gated Recurrent Unit (GRU) Combine the forget and input gates into a single update gate Merge the memory cell and the hidden state. Lecture 18 - Transformer Encoder Decoder Attention Transformers Encoder (E) - Decoder (D) ○ RNN: Input sequence is transformed into output sequence in a one-to-one fashion Handles one thing at once The input & output don’t necessary have same length ○ Goal of E-D: Develop an architecture capable of generating contextually appropriate, arbitrary length, output sequences ○ Applications: Machine translation Summarization Question answering Dialogue modeling RNN for Translation ○ Training data are parallel text e.g., English/French ○ Build an RNN language model on the concatenation of source and target ○ This technique is called autoregressive generation since the word generated at the Generation each time step is conditioned on the word generated by the network at the previous step. (Simple) Encoder-Decoder Networks ○ Limiting design choices E and D assumed to have the same internal structure (here RNNs) Final state of the E is the only context available to D This context is only available to D as its initial hidden state ○ Encoder generates a contextualized representation of the input (last state) ○ Decoder takes that state and autoregressively generates a sequence of outputs General Encoder Decoder Networks ○ Abstracting away form these choices Encoder: accepts an input sequence, x1:n and generates a corrersponding sequence contextualized representations, hE1:n Context vector c: function of hE1:n and conveys the essence of the input to the decoder Decoder: accepts c as input and generates an arbitrary length sequence of hidden states hD1:n from which a corresponding sequence of outputs states y1:m can be obtained Popular Architectural Choices for Encoder ○ Widely used encoder design: stacked Bi-LSTMs Contextualized representations for each step: hidden states from top layers from the forward and backward passes Decoder Basic Design ○ Produce an output sequence, an element at a time, until an end-of-sequence marker is generated ○ This incremental process is guided by the context provided by the encoder as well as any items generated for earlier states by the decoder. Decoder: How output y is chosen ○ Sample soft-max: distribution (OK for generating novel output, not OK e.g. translation or summary) ○ Most likely output (doesn’t guarantee individual choices being made make sense together) 1. 4 most likely “words” decoded from initial state 2. Feed each of those in decoder and keep most likely 4 sequences of two words 3. Feed most recent word in decoder and keep most likely 4 sequences of three words When EOS is generated. Stop sequence and reduce Beam by 1 Attention Self-Attention in NLP ○ Self-attention learnt “it” could refer to different entities in the different contexts Self-Attention ○ While processing each word it allows to look at other positions in the input sequence for clues to build a better encoding for this word Step 1: create three vectors from each of the encoder’s input vectors: Query, a Key, Value (typically smaller dimension). By multiplying the embedding by three matrices that we trained during the training process Step 2: calculate a score (like we have seen for regular attention) How much focus to place on other parts of the input sequence as we encode a word at a certain position Take dot product of the query vector with the key vector of the respective word we’re scoring Processing the self–attention for word “Thinking” in position #1, the first score would be the dot product of q1 and k1. The second score would be the dot product of q1 and k2. Step 3: divide score by the square root of the dimension of the key vectors (more stable gradients) Step 4: pass result through a softmax operation. (all positive and add up to 1) Softmax score determines how much each sword will be expressed at this position Step 5: sum up the weighted value vectors. This produces the output of the self-attention layer at this position More details: What we have seen for a word is done for all words (using matrices) Need to encode position of words And improved using a mechanism called “multi-headed” attention (kind of like multiple filters for CNN) Transformers High-level architecture ○ The encoders are all identical in structure (yet they do not share weights.) Each one is broken down into two sub-layers Outputs of the self-attention are fed to a feed-forward neural network. The exact same one is independently applied to each position Helps the encoder look at other words in the input sequence as it encodes a specific word. Key property of Transformer ○ Word in each position flows through its own path in the encoder There are dependencies between these paths in the self-attention layer Feed-forward layer does NOT have those dependencies => various paths can be executed in parallel. The Decoder Side ○ Relies on most of the concepts on the encoder side.