Datascience 144 A2 PDF
Document Details
Uploaded by FriendlyMimosa6117
Stellenbosch University
Tags
Summary
This document provides an introduction to data science concepts, including data types, visualization, and modelling techniques. It focuses on creating and interpreting data.
Full Transcript
Datascience 144 A2 R: Open-source Wide availability of libraries - fast to incorporate new concepts Release names linked to Peanuts cartoon RStudio: command-line interface Integrated Development Environment (IDE) for R Notebooks: Notebooks = documents that contain a mix...
Datascience 144 A2 R: Open-source Wide availability of libraries - fast to incorporate new concepts Release names linked to Peanuts cartoon RStudio: command-line interface Integrated Development Environment (IDE) for R Notebooks: Notebooks = documents that contain a mix of code (and its output) and formattable text Important tool for reproducible data analysis R (coding language) Inspired by S (also a programming language) -> R the modern version of S - CRISP-DM: Cross-Industry Standard Process for Data Mining In DS141, we will frame what we do in the context of CRISP-DM - Wickham & Grolemund: Model of the tools needed in a typical data science project New Section 1 Page 1 - PPDAC: Problem, Plan, Data, Analysis, Conclusions - O’Neil & Schutt : Lecture 4: Dataset organized in a data matrix: Row = observation Col = variable Observation: unit in which we measure data -> animals, plants, people Variables: particular features of the observations New Section 1 Page 2 -> animals, plants, people Variables: particular features of the observations -> attributes, features Dataset = set of n observations. Each observation = p variables Eg. Structured vs unstructured data: Structured data -> data that can be stored in a table, with every observation having the same structure Easy to store, organise, search, order, merge Into structured data: Merging data from different sources Cleaning data Deciding which attributes should be included Unstructured data -> observations not all the same structure Eg. Webpages, emails, texts, images Structured data can be extracted from unstructured data. Data types: -> type of the variable - Determines analysis methods that can be used Eg. (in R) logical numeric integer character complex factor Summary of dataset: New Section 1 Page 3 Summary of dataset: Range, median, mode Range: -> diff between max and min Klein na groot Median: -> "middle" value Arrange asc: klein na groot Odd: middle value Even: avg between two middle values Mean: -> a.k.a. average Mode: -> most frequently recurring value (can be no mode or multiple) Mean, mode and median are all measures of central tendency -> what “typical” value is likely to be for a particular variable Dispersion: -> how much the values are scattered/ spread out Range Interquartile range Variance Standard deviation Quartiles: -> divide set into 4 equal parts New Section 1 Page 4 IQR: Q3-Q1 Variance: -> Mean of squared deviations = variance -> need to summarise the deviations for a set of observations Can't take + and - deviations (will cancel out) -> square the deviations Standard deviation: -> "avg distance" of observations from mean -> square root of variance (makes interpretation easier - original unit of measure) Outliers: -> extreme cases = significantly different from other observations -> values more than 3 standard deviations from the mean Mean = heavily influenced by outliers -> rather use median then Relationship between variables: Correlation: -> strength of the (linear) relationship between values of 2 variables Between -1 and 1: 1=perfect positive, 0=no, -1=perfect negative Lecture 5: Data visualisation: -> plots, charts, graphs -> give good overview, point out errors, help communicate data Shape of data: Distribution of data = described by arithmetic mean and standard deviation Shape of distribution: New Section 1 Page 5 Histogram: Choosing number of bins b is important Lecture 6: Importance of visualization? 1. Exploratory data analysis -> get to know the data 2. Error detection -> outliers, cleaning issues, erroneous assumptions 3. Communication -> present findings in a meaningful (pretty) way What can make a graph misleading? Leaving gaps / changing scale (especially on vertical axis) Emphasising some sections unfairly Distorting of areas Use of 3-d charts Pictograms… Unjust extrapolation Lecture 7: Models: -> representation of some aspect of reality -> based on some underlying assumptions (consider/test the validity of the assumptions) Patterns, variation and covariation in data: Patterns = clues about relationship between variables New Section 1 Page 6 Patterns = clues about relationship between variables Variation = creates uncertainty Covariation = reduce uncertainty If 2 variables vary together = use value of 1 to make better prediction about value of 2 Signal and noise: What we can see/measure = 2 parts: 1. Predictable mathematical form 2. Random contribution that can't be explained -> signal and noise Find pattern = attempt to find signal = learning from the data. Thus statistical learning = set of tools for understanding data Supervised vs unsupervised learning: 1. Supervised = set of data labelled with the correct answers to learn from -> target variable 2. Unsupervised: = no labels to learn from -> identify natural groups based on similarities Basic types of models: 1. Classification (supervised) 2. Regression (supervised) 3. Clustering (unsupervised) 1. Classification -> predict which set of classes the observation belongs to 2. Regression -> predict numerical value of some target variable for that observation 3. Clustering -> group observations together by similarity Statistical learning: New Section 1 Page 7 Estimating f: Use values of X to predict values of Y: Aim is to estimate unknown function f. Why? Prediction Interference Parametric methods: - Make assumption about form of f - Assume Y = a + bX - Bad: estimated parameters won't match true form of f exactly - Good: Simplicity and interpretability Non-parametric methods: - No explicit assumption about form of f - Want to find f as close to data points as possible - Bad: requires more data for accuracy - Good: can fit wider possible range of shapes Lecture 8 Supervised segmentation: - Segment data into subgroups (distinct target variables) - Observations within subgroups should have similar values for target variable We don't know the value of the target variable when model is implemented: New Section 1 Page 8 We don't know the value of the target variable when model is implemented: -> use other attributes for segmentation -> attribute should be informative (reduce any uncertainty) Does attribute contain important info about target variable? -> choose attributes mathematically -> rank variables based on how good they are at predicting the value of the target variable If the variable doesn't directly correlate to the outcome = not an informative attribute Splitting rules: -> require a mathematical formula measuring how well each attribute splits a set of observations into groups, with respect to a chosen target variable Purity measures: 1. Entropy -> measure how impure a set is with respect to a property of interest (measures disorder) -> measured in bits Calculating entropy: Information gain: -> measures how much an attribute decreases entropy (decreases impurity) over the segmentation it creates -> change in entropy due to any amount of new information being added -> higher information gain = more informative Calculation: Original set = parent set Attribute on which we split = k values New Section 1 Page 9 Attribute on which we split = k values Creates k children sets (c=children set) Know the value of the attribute, how much would it increase our knowledge of the value of the target variable? Lecture 9: Decision tree methodology: 1. Start at top (root node) 2. Decide which attribute to split on (to maximise information gain) = interior node -> Child nodes of new split should vary less than observations in root node 3. Repeat process -> after every split, child nodes show less variation (w.r.t target variable) Continue till all child nodes are pure (in own group / no real variation anymore) = leaf node Advantages of decision trees: - Easy to understand - Induction procedures = elegant - Easy to describe rules - Relatively efficient - Robust to many common data problems Recursive partitioning: Choose attributes to split on by testing all, selecting whichever yields PUREST subgroups Stop? - When all nodes are pure - Run out of attributes to split on - When it gets too complex = "prune" it back -> specify minimum number of observations that must be present in a leaf node Lecture 10: New Section 1 Page 10 Lecture 10: Bayes' Rule: Example: Spam: - Contains certain words - Length of subject line - A lot of exclamation marks/ punctuation - Many capital letters **can't have too many rules for spam, otherwise normal emails will incorrectly be classified as spam Bayes' rule for spam filters -> obtaining revised estimate of probability of one event in the light of evidence provided by another Example of how bayes' helps: New Section 1 Page 11 Without bayes' = initially 20% prob of spam Now with Bayes' prob of spam given the word "loans" = 80% Thus it helps narrow down spam a lot Problem: Only takes one word into account Several words/ other features = more accurate Likelihood based on simultaneous occurrence When doing calculation with multiple words at once = need lots of memory to store probabilities for all possible intersecting events Thus need a simplifying assumption: -> Naïve Bayes' theorem -> Assumption of conditional independence New Section 1 Page 12 Some practical problems (and some proposed solutions): 1. Many of the probabilities P(Xi | S) and P(Xi | SC) will be extremely small -> multiplying lots of numbers too close to 0 = problem -> called arithmetic underflow -> Solution: taking logs 2. Problem comes in when the training data doesn't contain all of the possible options. -> This causes a 0 probability which overrides other probabilities. -> Solution: Laplace estimator Laplace estimator: -> adds small number to each of the counts in the frequency table (normally +1 to every probability and to total) = 0/20 becomes 1/24 if 4 words are used for the test = no more zero probabilities Frequency tables: - To learn from the data in Naïve Bayes - Categorical - If numerical = discretisation -> putting numbers into bins (like in histograms) - Visualising data = help to see natural bins - Quantiles can also be used - Too few bins = can obscure important trends - Too many bins = noise/ unimportant info Advantages of Naïve Bayes: - Efficient in terms of memory space and computation time - Performs well despite strict independence assumption (not that big of an impact) - Works on small and large datasets - Incremental learner - Easy to use estimated probability for prediction Disadvantages: - Relies on assumption that is not often true in practice - Not ideal for datasets with many numerical features Estimated probabilities less reliable than predicted classes New Section 1 Page 13 - Estimated probabilities less reliable than predicted classes (better for ranking, where specific value not relevant = only class needed) Naive Bayes is the basis of many personalized spam detection systems, such as the one in Mozilla’s Thunderbird. Lecture 11: (other classifiers) Linear classifiers ○ Decision tree ○ Split dataset using straight line ( y = mx + c ) classify client according to which side of line ○ Called linear because the decision boundary is a linear combination (weighted sum) of the attributes Logistic regression: -> always give output between 0 and 1 -> X = balance, p(X) = called predict -> will always produce S-Shaped curve -> if you want to get the odds of something: Logistic regression = estimates of parameters using maximum likelihood (not in this module) Default or not? Decide which predicted probability classifies as default and convert it to a table. (eg. >=0,5 as default) Linear discriminant analysis (LDA): -> maximizing separability between categories -> creates new axis: 1. maximize distance between means 2. minimize the scatter (variation) -> under certain conditions - LDA estimates = more stable than logistic regression -> best when target variable has more than 2 classes -> fitting procedure to obtain parameter = difference between LDA and logistic regression New Section 1 Page 14 -> best when target variable has more than 2 classes -> fitting procedure to obtain parameter = difference between LDA and logistic regression Support Vector Machine: -> popular and versatile -> try to fit "fattest" bar possible between classes -> linear decision boundary = line through centre of bar -> margin = width of "fat" bar = want to maximise margin -> penalty points if data is not linearly separated perfectly by bar -> penalty = proportional to distance from boundary. Error function = called hinge loss -> best fit = balance between fat margin and low total error penalty -> nonlinear SVM = uses different features (functions of original feature), so that the linear discriminant with the new features is a nonlinear discriminant with the original features k-Nearest Neighbours (kNN): -> measuring distance between observations -> make prediction based on the classes of those neighbours -> choice of k (number of neighbours) has influence on decision boundary Conclusion: Different methods produce different boundaries = optimise different functions No one method dominates: - Logistic regression/LDA when decision boundaries are linear - kNN when decision boundaries are more complicated From textbook: Parameterized model: -> goal of the data mining = tune the parameters so that the model fits the data as well as possible -> parameter learning or parametric modelling Optimizing an Objective function: -> define an objective function that represents our goal -> can be calculated for a particular set of weights and data. -> choose objective function based on faith and experience = difficult to match true goal -> find optimal value of weights by maximizing or minimizing the objective function -> weights can only be interpreted as importance indicators if the attribute values have been normalized, so that they're all in the same range. Choice of objective function is one of the most important and fundamental ideas in data science. Linear discriminant functions for scoring : New Section 1 Page 15 Linear discriminant functions for scoring : -> if we do not need precise probability estimate = can use a score that will rank cases by likelihood of belonging to one class or the other. -> linear discriminant functions can give us such a ranking for free -> f(x) itself—the output of the linear discriminant function—gives an intuitively satisfying ranking of the instances by their (estimated) likelihood of belonging to the class of interest. -> f(x) = small when x is near boundary and large when x is far from boundary and 0 on boundary Lecture 12: Revisiting spam filters: Text is difficult to deal with: - Unstructured data - "dirty" data (spelling mistakes, grammar mistakes, random punctuation, etc) - Context matters - Different languages Text mining: -> process of deriving information (typically numerical) from unstructured text data = by considering attributes like: patterns, topics, keywords -> goal: get text data into format that can be used as input for data mining algorithms Terminology: Document: (observations) - One piece of test (eg. Email) Terms/ tokens: (attributes) - Components of a document (typically words) Corpus: (dataset) - Collection of documents Document-Term-Matrix (DTM): - Data representation -> rows=documents, cols=terms Term-Document-Matrix (TDM): - Transpose of DTM = rows=terms, cols=documents Bag of words: - Transform unstructured into structured = sequence of words from free-form into feature-vector form - Treat each document as a collection of individual words - Ignore grammar, word order, sentence structure, punctuation - Each word = potentially important (words don't have a value) - Advantage: straightforward, inexpensive and works well in practice Tokenisation: New Section 1 Page 16 Tokenisation: -> give words value 1. Binary term frequency 2. Term frequency (TF) 3. Inverse document frequency (IDF) 4. Term frequency inverse document frequency (TFIDF) 1. Binary term frequency: -> counts presence or absence of term in document 1 : indicates term is present in document 0 : not present Bad: doesn't take importance (frequency) into account 2. Term frequency: -> measures prevalence of word in particular document Number indicates occurrences of term in document (can be normalized w.r.t document length) 3. Inverse document frequency (IDF): -> weight terms based on how common it is in the entire corpus: -> terms shouldn't be too rare/ too common = impose upper/lower limits on term frequency -> better: consider distribution of terms over entire corpus Eg: 4. Term frequency inverse document frequency (TFIDF): -> combo of TF (Term Frequency) and IDF (Inverse Document Frequency) -> calculated at document level (not corpus level like IDF) New Section 1 Page 17 Eg: Thus: TFIDF = very common value representation for terms, but not necessarily optimal Apply Normalization? - Takes doc length into account - Use relative counts instead of absolute term frequency counts (term frequency % total nr of terms in doc) Experiment with different representations to see which produces best results Bag of (individual) words: Advantages: ○ Simple, no linguistic analysis required, performs well Disadvantages: ○ Word order not considered n-gram Sequences: -> word order can convey information (thus important) -> make sequences of adjacent words terms -> often used in combination Eg. "bag of n-grams up to three" - Individual words = features - Adjacent word pairs - Adjacent word triples -> Advantages: - Easy to generate - No linguistic knowledge required -> Disadvantages: - Greatly increases size of feature set - Many n-grams will be extremely rare Named entity extraction: New Section 1 Page 18 Named entity extraction: -> common named entities can be important -> word on their own = useless, but with another word = unique and interesting Eg. game of thrones vs Game of Thrones -> bag of words/ n-grams may not capture these -> knowledge intensive: needs lots of data/ manually coded (based on domain knowledge) -> can also get entity extractors for specific contexts Text representation in R: Lecture 13: Visualizing text data: 1. Case normalisation -> typically case doesn't matter -> normalise case by converting everything to lower case New Section 1 Page 19 -> normalise case by converting everything to lower case 2. Stemming -> need to treat related terms as single concept (tenses, singular/plural etc. = same) -> stemming = strip words to root form (eg. By removing suffixes) 3. Stopwords -> occur frequently -> language specific -> do not provide useful information -> removed during pre-processing of text data 4. Other pre-processing in R: Remove punctuation, numbers, unnecessary whitespace Caution: Context is important!! - Special stopwords can be imported (eg. Book titles) - Numbers can convey meaning (eg. Web 2.0, 4GB) - Abbreviations (eg. IR) Lecture 14: Correlation: - Correlation coefficient = measuring the direction and strength of a linear relationship between 2 variables - Value between -1 and 1 Correlation does not imply causation (one doesn't cause the other) Get different types of causation: - Reverse causation (windmills and wind) - Lurking variables (unseen factor eg. ice cream sales and drowning = swimming is unseen factor) - Spurious correlation (appear to be casual but is not eg. Eating bananas and getting a divorce) Regression analysis: - Assume target variable is influenced by other attributes - Use info from attributes to predict/ describe changes in target variable - Deterministic relationship: value of target variable is uniquely determined by values of attributes Eg. P = mv (physics) - Relationship often stochastic: ○ Some relevant attributes omitted ○ Real life randomness (noise) New Section 1 Page 20 Simple linear regression: 1 target variable ; 1 attribute Assuming that f(x) is linear: (now just need to calculate B's = called parameters of the model) Don't know what true values of , so use data to produce estimates = Difference between predicted value and actual value = residual: Attempt to find a line which is as close as possible to data points Common approach to find closest line is ordinary least squares -> minimizes the sum of the squared differences between the observed and predicted values (SSE) means 37 - 5(3)(2) / 5-1 55 - 5(3)^2 / 5-1 1,75 / 2,5 2 - (0,7)(3) New Section 1 Page 21 Model assumptions: Random component and its relation to the errors of estimating Four basic assumptions about the general form of the probability distribution of : Mean of probability distribution of is 0 Its variance is constant for all settings of x Probability distribution of is normal Errors associated with any 2 different observations are independent = error with one Y value = no effect on error of other Y values Extensions to multiple linear regression: Has p amount of attributes, so takes longer to find estimates (have to solve p + 1 simultaneous linear equations) Same assumption about error term In R = easy: Example question: Example question: New Section 1 Page 22 Lecture 15: Similarity: -> if 2 entities are similar they often share characteristics Find new customers that are similar to current best customers Classify entities based on similar entities Group similar entities together into clusters Recommend similar products or services To discuss similarity between entities = need a data representation Use Euclidean distance to compare similarity of one pair of instances to that of another pair Unitless = no meaningful interpretation Different types of distances: - Euclidean = as crow flies - Manhattan = taxicab distance (city block distance) - Minkowski = generalized form of Manhattan and Euclidean distance - Canberra = weighted version of Manhattan - Maximum distance = max distance between two components of two entities - Jaccard distance = complement of proportion of characteristics shared by two entities Text classification: Cosine distance New Section 1 Page 23 - Cosine distance - Edit distance 1. Manhattan distance = sum of pairwise distances 2. Jaccard distance ○ Treats objects as sets of characteristics ○ Considers union and intersection of sets Jaccard index: ○ Proportion of all characteristics that are shared between 2 objects Jaccard distance: ○ 1 - Jaccard index Nearest Neighbours: -> shortest distance between entities -> From A to B = use B-A everywhere k-Nearest Neighbours: -> number of nearest neighbours considered -> add a new observation = find nearest neighbours in dataset -> use target values of nearest neighbours to help make a prediction for the new observation Choice of k: -> any value of k from 1 to n is technically possible -> 2 class problem = use odd numbers to prevent ties (in case of majority voting scheme) -> larger values of k imply estimates are more smoothed out among neighbours -> mostly trial and error approach k = n: -> entire dataset -> predicts majority class in case of classification k = 1: -> has danger of modelling noise New Section 1 Page 24 Combining votes: -> most obvious = majority voting -> need to determine the weight and contribution of the nearest neighbours Thus choice of k is not so important here - Categorical variables should be encoded numerically (eg. 1 = Owner, 2 = Renter, 3 = Other) - Numerical values should be rescaled, so that eg 10 difference in age is same as 10 difference in income -> use for example: Z-score standardization / min-max normalisation Important not just to classify a new example, but also to estimate a probability too. (use kNN) kNN for classification = majority vote of target kNN for regression = average or median of target -> eg. to predict income, using 3NN: = predict David’s income as average of income of Rachael, John and Norah kNN = considered lazy learning algorithm (not really learning from data, just storing training data) - No abstracted model (means predictions can be slow) - a.k.a. instance-based learning - Non-parametric method: ○ Limits ability to understand how classifier uses data ○ Natural patterns are found, instead of trying to fit data to preconceived functional form Strengths and weaknesses of kNN: Strengths: - Simple - Effective - No assumption about underlying data necessary Training is fast New Section 1 Page 25 - Training is fast Weaknesses: - No interpretable model - Selection of appropriate k can be difficult - Classification of new observations can be slow - Categorical attributes and missing data need pre-processing Curse of dimensionality: -> if working with many dimensions = even nearest neighbour can be far away -> dimensionality reduction = good idea when using kNN -> feature selection = one way of doing this - Filters - Wrappers - Embedded method Lecture 16: Recap: - Clustering = a form of unsupervised learning - Aim = find groups of entities where entities within groups are similar, but entities in different groups = not so similar Hierarchical clustering: Dendogram: - Shows hierarchy of clusters - As y increases, clusters are merged, until one single cluster at top - Bottom-up / agglomerative clustering(each entity starts as own cluster) - "clip" the dendogram with a horizontal line to get clustering with different number of clusters - Easy to see groupings before deciding clusters Hierarchical clustering: - Clusters are merged based on chosen similarity/ distance function Merge at bottom = very similar New Section 1 Page 26 - Merge at bottom = very similar - Merge at top = quite different - Can't draw conclusion about similarity using proximity along x axis - Draw conclusion based on location on y axis Hierarchical clustering algorithm: 1. Decide on similarity measure to be used (often euclidean distance) 2. n individual clusters 3. 2 entities (clusters) most similar = merge (n-1 clusters) 4. Next 2 clusters most similar = merge (n-2 clusters) 5. Repeat until all observations in one single cluster (n-? clusters -> 1 cluster) Linkage functions: Defines dissimilarity between 2 groups of observations Types: ○ Single (smallest pairwise distance) - not preferred (extended, trailing clusters) ▪ Nearest neighbour ▪ Distance between any two clusters is the shortest distance from any point in one cluster to any point in the other ○ Complete (largest pairwise distances) ▪ Furthest neighbour ▪ Distance between any two clusters is the maximum distance from any point in one cluster to any point in the other ○ Average (average pairwise distance) ▪ Distance between any two clusters is the average distance from all individuals in one cluster to all individuals in another Complete/average = preferred -> more balanced ○ Other types: ward, centroid Lecture 17: k-means clustering: Focus on clusters themselves New Section 1 Page 27 Lecture 17: k-means clustering: Focus on clusters themselves Each cluster represented by its centroid k-means = most popular centroid-based clustering algorithm Algorithm: 1. Specify value of k (nr of clusters required) 2. Randomly assign a value from 1 to k to each of the observations (initial cluster assignment) a. Compute cluster centroid for each of the k clusters b. Assign each observation to the cluster whose centroid is closest = get distance from observation to all k cluster centroids c. Repeat until cluster assignments stop changing (fully sorted) Clustering highly depends on starting configuration (initial centroid locations) Best to run k-means multiple times, with different initial assignments Choice of k important Experiment with different values of k Compare results from different clusterings by calculating clusters' distortion Sum of squared differences between each data point and its corresponding centroid Lowest distortion = best No single right answer Clustering solutions often useful to highlight interesting aspects of the data Supervised learning is used for generating cluster descriptions by combining unsupervised and supervised learning. Cluster assignments are used as labels for examples, and a supervised learning algorithm generates a classifier for each cluster. This classifier provides a differential description of what distinguishes each cluster from the others. Clusters are treated as classes, and a decision tree is used to describe the differences between clusters. The result is a concise, intelligible description that helps to differentiate clusters effectively. Lecture 18: Evaluating classification models: Measures we will consider: - Accuracy - Precision - Recall - F-measure - Specificity - Also: Misclassification error rate Positive predictive value Negative predictive value Sensitivity New Section 1 Page 28 Sensitivity True positive rate True negative rate False positive rate 1. Accuracy: -> proportion of correct decisions Can also be calculated as 1 - misclassification error rate Potential problems with relying on accuracy: ○ Can be accurate for specific dataset, but isn't a useful model eg. 95% accuracy to say a person is healthy ○ The class distribution of the target variable is therefore unbalanced, or skewed. Such an imbalance can have an impact when we attempt to maximise the accuracy of a classifier ○ Instead of accuracy, other measures should be considered, especially when the cost of different errors (false positives and false negatives) varies significantly. Confusion matrices: ○ Consider a binary classification (+ and -) ○ Should not only focus on correct decisions Distinguish between correctly identified positives and negatives Also distinguish between 2 possible types of incorrect decisions □ i.e. positive as negative and vice versa ○ Different types of correct and incorrect decisions can be summarized in confusion matrix Compares actual labels to predicted labels New Section 1 Page 29 2. Precision -> how often classifier is correct when it predicts a positive outcome ○ Measures accuracy over cases predicted to be positive ○ Probability that classifier is correct when predicting positive outcome Positive predictive value: Negative predictive value: 3. Recall -> how often classifier is correct on all positive instances ○ Probability that positive case will be correctly classified as such True positive rate / sensitivity: All in one example: 4. F-measure -> precision vs recall ○ Sometimes one is more important than the other eg. Medical diagnosis = recall, recommender systems = precision ○ Inherent trade-off between recall and precision Harmonic mean of precision and recall: F-measure = 0.3333 5. Specificity New Section 1 Page 30 5. Specificity True negative rate 1 - False positive rate Issues to keep in mind: - Baseline performance - Unbalanced classes ○ Artificially rebalancing ○ Unequal costs Textbook: In discussing classifiers: - "positive" examples refer to bad outcomes worthy of attention or alarm - "negative" examples refer to normal or good outcomes that are uninteresting or benign. - This terminology helps in maintaining a general orientation across different domains. Eg. a medical test detecting a disease is a positive result, while a healthy result is negative. Fraud detection works similarly, where fraud cases are positives. - The positive class is often rare compared to the negative class, leading to more false positive errors, but the cost of false negative errors is higher. Lecture 19: Confusion matrix will not work if we want to predict a numerical outcome. Simple linear regression: Residual standard error: -> estimate of standard deviation of error term -> average amount that response will deviate from true regression line = way of measuring lack of fit of model to data Small RSE: ○ Predictions close to true values ○ Model fits data well Large RSE: ○ Model doesn't fit data well (small vs large must be judged in context of data) Coefficient of determination: Always between 0 and 1 (independent of scale of Y) New Section 1 Page 31 - Always between 0 and 1 (independent of scale of Y) - Measure of proportion of variability in Y that can be explained using X - Close to 1 = large proportion of variability in Y has been explained by the regression model (good fit) - Close to 0 = regression model didn’t explain much of variability in response - In simple linear regression, is square of correlation between X and Y Include variable with 2nd highest correlation too: Increased from 0,7262 to 0,7596 - Will always increase if another variable is added to the model Adjusted R^2: - Adjusts for number of variables in model - Only increase if added variable improves model fit more than would have been expected by chance - Use when comparing regression models of different sizes Cluster models: - Supervised learning: check model fit by comparing predicted vs actual values - Unsupervised = not possible - Should try different ones and choose one with most useful/interpretable solution - Also look for patterns which consistently emerge - When thinking of a suitable value for k: ○ Try different values of k ○ Plot within-cluster sum of squares (measure of variability of observations within each cluster) ○ Plot between-cluster sum of squares (measures variation between all clusters) ○ Sum -> small values = compact, large values = spread out ○ Look for "elbow point" ○ Use prior knowledge of context of problem to help select a value for k (look at cluster sizes & means) "Elbow point" example: New Section 1 Page 32 Lecture 20: Signal and noise: - When fitting a model, want to find signal in the data - Use past data to produce a model - Use predictions for new, unseen cases - Making model more complex can increase accuracy, but will not necessarily generalise well to unseen data - Overfitting: Allowing model to follow training data too closely will lead to decline in predictive ability ○ Modelling noise rather than signal Overfitting: - Need to manage trade-off between model complexity and possibility of overfitting Goldilocks zone: - Want to include enough attributes, but not too many - Building a model that is too complex for the data can also cause overfitting Holdout data: - Hold a portion of data separate ○ Use model to predict (known) target value for holdout set ○ Can see how model performs on data not seen before (part that was held out) - Portion used to build model = training data - Portion kept aside = test data - Model overfit to training data should perform poorly on test data - Thus good performance on test data = confidence that model not overfit - Normally 80/20 split Fitting graph/curve: - Visualise difference between accuracy (or other measure) on training data vs test data - Chance of overfitting increases as more flexibility / complexity allowed - Complexity on x axis and evaluation on y axis - Each point on curve = specific combo of complexity and evaluation measure New Section 1 Page 33 Overfitting in trees: - Can keep segmenting data until all nodes are pure (each leaf only single observation) - Gives perfect classification on training data - Complex tree = many nodes - Better to artificially limit size of tree / can prune tree back Validation data: - Test data should only be used once ○ Only to assess performance of final model ○ Shouldn’t use it to choose among models or "tune" a model - Validation data used in such cases ○ Split data into 3 sections: training and 2 holdout (possible split 80/10/10) Validation data to choose model and find tuning values Test set used at end to assess model Cross-validation: - Sometimes data is limited = use cross validation instead - Also gives info about how performance varies across datasets - Split training dataset into several subsets of same size (each subset called a fold) Eg. 10 fold cross validation: ○ Train 10 models, each of 10 subsets/folds acting as validation data ○ Compute evaluation measures (eg accuracy) for each validation set ○ Average measure over 10 validation sets New Section 1 Page 34 Learning curves: - Visualises generalisation performance against amount of training data - Nr of training cases on x-axis, Evaluation on y-axis - Typically steep initially -> model attempts to find most apparent regularities - Accuracy increases as data increases - Marginal advance decreases as data increases (less steep curve) For this data: - Logistic regression = better generalisation accuracy for smaller training set sizes - Learning curve for logistic regression levels off faster Fitting graph vs learning curve: - Learning curve: ○ Generalisation performance vs amount of training data - Fitting graph: ○ Generalisation performance vs model complexity ○ Also performance on training for fixed amount of training data Regularization: - Allow complex models - Add penalty term to objective function - Penalty term = higher value when model more complex - Simple models thus = advantage (penalized less for complexity) Bias- variance trade-off: - Captures tension between model complexity and predictive accuracy - Bias leads to underfit models - Models performing poorly on training and test data could by underfit Mistaking noise for signal produces overfit models (with high variance) New Section 1 Page 35 - Mistaking noise for signal produces overfit models (with high variance) - Models with high variance are sensitive to small fluctuations in training data ○ Using different sample of training data will result in significantly different model ○ Such models typically do much better on training data than on test data - Bias-variance trade-off: ○ Using eg. regularisation to build a less complex model can significantly reduce variance, but can lead to higher bias ○ Thus trade-off should be managed! Textbook: Note that as model complexity increases, training error keeps coming down (make sure you understand why!) Why: In simpler terms, imagine trying to draw a line through a scatterplot of points. A basic straight line (simple model) might not fit all points perfectly, resulting in higher training error. A more complex curve (complex model) can fit all points much better, reducing training error, but it might end up being too wavy and specific to the training points, losing its ability to generalize. Base rate classifier: in a regression context, the base rate model would be one which simply predicts the mean of the target variable for every observation. Cross-validation also enables us to obtain statistics (such as the mean and variance) on the estimated performance, so that we can understand how the performance is expected to vary across datasets. This is critical for assessing confidence in the performance estimate. To avoid overfitting in decision trees, we control their complexity using two main strategies: - Stopping Early: Halting tree growth before it becomes too complex by setting a minimum number of instances required at each leaf, ensuring that the model doesn't overfit to small subsets of the data. - Pruning: Allowing the tree to grow fully and then cutting back branches and leaves that don't contribute to improved accuracy. This is done iteratively until further pruning would reduce accuracy. Both methods help to prevent large, overly complex trees that fit the training data too closely, ensuring better generalization to new data. Hypothesis tests can also be used to determine if splits in the tree are statistically significant, further aiding in controlling tree complexity. Regularization helps to avoid overfitting by balancing model complexity and fit to the data. In methods like logistic regression, regularization introduces a penalty for complexity in the optimization process. Two common types of regularization are L2 (ridge regression), which penalizes large weights, and L1 (lasso), which can perform feature selection by zeroing out coefficients. Regularization aims to optimize both fit and simplicity, and the balance between the two is controlled by a parameter This parameter can be chosen using techniques like cross-validation, ensuring a good balance between data fit and model complexity, preventing overfitting. New Section 1 Page 36 New Section 1 Page 37