DSCI 100 Revision Guides PDF
Document Details
Uploaded by CheeryLiberty
Tags
Summary
This document covers revision guides for DSCI 100, focusing on data science and R programming. It details various data analysis techniques including loading, cleaning, visualizing, and statistical inference using R. It includes numerous examples related to data analysis concepts.
Full Transcript
SCI 100 Revision Guides D creates a reference that enables eff...
SCI 100 Revision Guides D creates a reference that enables efficient data manipulation using SQL queries in the ⇒ Address overplotting (overlapping data points) to prevent obscuring data patterns. Chapter 1 background. ⇒ Adjust the plot area to fit the data appropriately.\ This chapter introduces data science and the R programming language, focusing on ⇒ collect :This function retrieves the data from a database reference into an R data ⇒ Avoid manipulating axes to exaggerate small differences. an example data analysis to demonstrate the basics of loading, cleaning, and frame.Use caution with collect ,as retrieving very large datasets can be Creating Visualizations Using ggplot2in R: visualizing data using R. time-consuming or even crash R. It's often better to filter and select data within the - ggplot2Basics: -Learning Objectives: database before using collectto reduce the size of the retrieved data. ⇒Start by specifying the data frame to visualize ( ggplot(data_frame, →Identify the different types of data analysis questions and categorize them -PostgreSQL (Postgres)is a powerful, open-source relational database system aes(...)) ). accordingly. commonly used for networks. Connecting to a Postgres database using dbConnect ⇒Aesthetic mapping ( aes )defines how columns in the data frame map to visual → Load the tidyverse package into R. → Read tabular data with read_csv . requires additional information, such as the database name ( dbname ), host location properties (e.g., x, y, color shape , ). → Create new variables and objects in R using the assignment symbol ( Often used with group_byto calculate summaries for different groups within a whether a tumor is benign or malignant. ⇒Using relative paths is generally recommendedbecause they make your code d ata frame. - Each observation represents a tumor image with features like radius, texture, more portable and less prone to errors when shared or used on different computers. - across :Used with summarizeor mutateto apply a function to multiple columns perimeter, area, smoothness, compactness, concavity, symmetry, and fractal -URLs (Uniform Resource Locators)are used to locate resources on the internet. s imultaneously. d imension. Reading Tabular Data from Plain Text Files: - mapFamily:Applies a function to elements of a vector, list, or data frame, returning - The data set contains 357 benign (63%) and 212 malignant (37%) tumor -Tabular datais structured in a rectangular format like a spreadsheet, andplain text results in various formats (list, vector, data frame). observations. filesstore data in a simple text format. - rowwise :Applies functions across columns within each row, often used with Distance Between Points - read_csv :This function from the tidyversepackage is used to read - Euclidean distanceis used to determine the "nearest" neighbors. mutateto create new variables based on multiple columns within the same row. - For two observationsaandbwith two predictor variablesxandy: comma-separated value (.csv) files, where commas separate the columns. Chapter 4 - Distance = √((ax-bx)2+ (ay-by)2) → It's important to note that read_csvhas default expectations, such as the This chapter discusses the principles and techniques of effective data visualization, - The formula extends to multiple predictor variables by summing the squared presence of column names and commas as delimiters. If these defaults are not met, emphasizinghow to choose, refine, explain, and save visualizationsto effectively differences for each variable. the data may not be read correctly. communicate insights from data. It primarily focuses on using the ggplot2package K- Nearest Neighbors with tidymodels -Skipping rows:The skipargumentin read_csvis used to skip a specified in Rfor creating various types of visualizations. - The tidymodelspackage simplifies the implementation of theK- nearest neighbors number of rows at the beginning of a file, often containing metadata or information not Key Concepts in Data Visualization: algorithm in R. intended for analysis. -Purpose-Driven Visualization:The primary goal of visualization is toanswer a - Key steps: - read_tsv :This function readstab-separated value (.tsv) files, using tabs as specific questionabout a dataset. A well-designed visualization should clearly convey the answer to the question without unnecessary distractions. 1.Specify the model:Use nearest_neighborto define theK- nearest neighbors delimiters instead of commas. - Choosing the Right Visualization Type:T he choice of visualization depends on the model, specifying the number of neighbors ( neighbors ) and the distance weighting - read_delim :This function is a moreflexible option, allowing you to specify the delimiter used in the file. It can be used for bothcomma- and tab-separated filesand data type and the question being asked. The sources introduce four common types of function ( weight_func ). other formats by specifying the delimiter using the delimargument. v isualizations: 2 . S et the engine: S pecify the package for training the model using set_engine ⇒Scatter plots:Visualize therelationship between two quantitative variables. (e.g., "kknn" for the kknnpackage). → When the data lacks column names, read_delimassigns generic names likeX1, ⇒Line plots:Visualizetrends over timeor anyordered quantity. 3 . S et the mode: Indicate that it's a classification problem using X2, etc.You can rename these columns using the renamefunctionfrom the dplyr ⇒Bar plots:Visualizecomparisons of amountsacross different categories. package for better clarity and organization. set_mode("classification") . ⇒Histograms:Visualize thedistribution of a single quantitative variable, showing Reading Tabular Data from Microsoft Excel Files: 4.Fit the model:Use fitto train the model on the data, specifying the target the frequency of values within specific ranges (bins).\ - Excel files (.xlsx)store data differently than plain text files, including additional variable and predictors. -Visualizations to Avoid:Certain types of visualizations are generally considered features like fonts, formatting, and multiple sheets. ineffective or easily replaced with better alternatives: 5.Predict on new data:Use predictto classify new observations based on the - read_excel :This function from the readxlpackageis used to read data from ⇒Pie charts:Difficult to accurately compare pie slice sizes. Bar plots are usually a trained model. Excel spreadsheets. better choice. Data Preprocessing with tidymodels ⇒ If an Excel file has multiple sheets, you can specify the sheet to read using the ⇒3-D visualizations:Can be misleading and hard to interpret in a 2-D format. - Centering and scaling: sheetargument. You can also specify particular cell ranges using the range ⇒Tables for numerical comparisons:Humans are better at processing visual ⇒Standarizing data (mean of 0 and standard deviation of 1) ensures that variables argument. information than text and numbers. Bar plots are typically more effective. with larger scales don't disproportionately influence the distance calculations. Reading Data from Databases: Refining Visualizations to Enhance Clarity: ⇒ Use step_scaleand step_centerin a recipe to standardize predictor variables. -Relational databasesare advantageous for storing large datasets, enabling multiple -Convey the Message: -Balancing: users to work on a project, and ensuring data integrity. ⇒ The visualization should directly answer the question being asked. ⇒ Class imbalanceoccurs when one class is much more common than another, -SQLiteis a simple, self-contained database system often used for local storage and ⇒ Use clear legends and labels to make the visualization understandable without potentially biasing the classifier. access. relying on external text. ⇒Oversamplingthe rare class by replicating observations can help address this → dbConnect :This function from the DBI(Database Interface) package ⇒ Ensure text, symbols, and lines are large enough for easy readability. issue. establishes a connection to a database, enabling R to send commands to it. ⇒ Data points should be clearly visible and not obscured by other elements. ⇒ Use step_upsamplein a recipe to oversample the minority class. → dbListTables :Lists all the tables present in a connected database. ⇒ Use color schemes that are accessible to people with color blindness (consider Using Workflows in tidymodels → tbl:From the dbplyrpackage, this function creates a reference to a database using tools like ColorBrewer). - Aworkflowchains together multiple data analysis steps, including data -Minimize Noise: preprocessing and model fitting. table, allowing you to interact with the data as if it were a data frame in R. ⇒Use colors sparingly to avoid distractions and false patterns. - Key steps: ⇒Importantly, tbldoes not immediately retrieve the entire data into R, but . Create a workflow object using 1 workflow() . - An initial scatter plot visualizes the relationship between house size (predictor implementation in R, advantages and disadvantages compared to KNN regression, 2. Add the preprocessing recipe using add_recipe . variable) and sale price (response variable), revealing a positive correlation. and potential issues like multicollinearity and outliers. It also touches upon the broader 3. Add the model specification using add_model . K NN Regression Example applications of regression beyond prediction. - A small sample of 30 data points illustrates how KNN regression works. Chapter 9 4. Fit the workflow to the data using fit. - The scenario involves predicting the sale price of a 2,000 square-foot house. This chapter introducesclusteringas anunsuperviseddata analysis technique used This chapter introduces the fundamental concepts of classification, theK- nearest - With no exact match for that size in the sample, the 5 nearest neighbors (based on to separate a dataset into subgroups (clusters) of related data. Unlike supervised neighbors algorithm, and data preprocessing techniques for building effective square footage) are identified. techniques like classification and regression, clustering doesn't rely on a response classifiers using the tidymodelsframework in R. - The average sale price of these 5 neighbors is used as the predicted price for the variable with known labels or values. The chapter focuses on theK-means clustering Chapter 6 2,000 square-foot house. algorithmand explores techniques for choosing the optimal number of clusters, This chapter focuses on evaluating and improving the accuracy of classifiers, building Training, Evaluation, and Tuning including theelbow method. upon the concepts introduced in Chapter 5, specifically addressing theevaluation of - The chapter emphasizes the importance of splitting the data into training and test What is Clustering? classifier accuracy, techniques for enhancing performance, and the concept of sets for evaluating model performance on unseen data. Clusteringaims to group data points based on their similarity, measured using tuning parameters. -Root Mean Square Prediction Error (RMSPE)is used to assess prediction distance metrics like Euclidean distance. Evaluating Accuracy accuracy in regression. Applications: -Importance of New Data:A classifier's effectiveness lies in its ability to make -Cross-validationhelps select the optimal number of neighbors (K) by minimizing ○Grouping documents by topic. accurate predictions on data it hasn't encountered during training. RMSPE. ○Separating human genetic information into ancestral sub-populations. -Train/Test Split:To evaluate accuracy, the data is split into atraining set(used for -Standardization(scaling and centering) of predictor variables is recommended even ○Segmenting online customers based on purchasing behaviors. model building) and atest set(used for assessing performance). with a single predictor. Clustering vs. Classification/Regression -Prediction Accuracy:This metric measures the proportion of correct predictions - The tidymodelsworkflow simplifies model training, tuning, and evaluation in R. Clustering is unsupervised:No response variable labels or values are used to made by the classifier. Underfitting and Overfitting guide the grouping process. ⇒ Accuracy = (Number of Correct Predictions) / (Total Number of Predictions) - Similar to classification, KNN regression is susceptible to underfitting and overfitting Classification/Regression are supervised:They leverage labeled data to predict -Confusion Matrix:This table provides a detailed breakdown of correct and incorrect depending on the value ofK. categorical labels or numerical values. predictions, showing the counts for each combination of predicted and true labels. -Underfittingoccurs with a largeK, resulting in a model that's too simple and doesn't Advantages of Clustering: Randomness and Seeds capture the data's trend. ○No need for data annotation. -Reproducibility:Using randomness in analyses can compromise reproducibility. -Overfittingoccurs with a smallK, leading to a model that's too complex and ○Can uncover patterns in data without pre-existing labels. -Random Seed:R's random number generator relies on aseed valueto produce sensitive to noise in the training data. Disadvantages of Clustering: sequences of numbers. Setting the seed using set.seed()ensures consistent - Cross-validation helps find the optimalKthat balances underfitting and overfitting, ○Evaluating cluster quality can be challenging. results across multiple runs of the analysis. achieving good generalization to new data. ○No single "best" evaluation method exists. -Best Practice:Set the seedonceat the beginning of the analysis to maintain Evaluating on the Test Set K- Means Clustering reproducibility. - The final model, trained on the entire training set with the optimalKfound through K- meansis a widely used algorithm that iteratively groups data intoKclusters. Evaluating Accuracy with tidymodels cross-validation, is evaluated on the test set using RMSPE. Steps: -Data Splitting:Use initial_split()to split the data into training and test sets, - The test RMSPE provides an estimate of the model's prediction error on unseen ○Initialization:Randomly assign data points toKclusters. ensuring stratification (maintaining class proportions) by setting the strataargument data. ○Center Update:Calculate the cluster centers (means of each variable for data to the class label variable. Multivariable KNN Regression points within the cluster). -Preprocessing:Standardize predictor variables using step_scale()and - KNN regression can incorporate multiple predictor variables. ○Label Update:Reassign each data point to the cluster with the nearest center - S caling and centering predictors becomes essential in this case to prevent (using Euclidean distance). step_center()in a recipe. variables with larger scales from dominating distance calculations. ○Iteration:Repeat steps 2 and 3 until cluster assignments stabilize. -Training:Use workflow() add_recipe() , ,add_model() , andfit()to train -Predictor variable selectionis crucial, as irrelevant variables can negatively impact Cluster Quality:Measured bywithin-cluster sum-of-squared-distances (WSSD), the classifier on the training data. model performance. representing the spread of data points around their cluster center. -Prediction:Use predict()and bind_cols()to predict labels for the test set and - The example demonstrates using house size and the number of bedrooms to predict ○Lower total WSSD indicates better clustering. combine them with the original test data. sale price. Choosing the Number of Clusters (K) -Accuracy Calculation:Use metrics()to calculate the accuracy of the classifier - Visualizing predictions with multiple predictors requires a 3D surface instead of a 2D Elbow Method:Plot the total WSSD against different values ofK. on the test data, filtering for the accuracymetric. line. ○The "elbow" point on the plot, where the decrease in total WSSD levels off, -Confusion Matrix:Use conf_mat()to generate the confusion matrix for the Strengths and Weaknesses of KNN Regression suggests an appropriateKvalue. classifier. → Strengths: Challenges: Critical Analysis of Performance 1 . Simple and intuitive algorithm. ○ K- means can get "stuck" in suboptimal solutions due to random initialization. -Contextual Accuracy:The acceptable accuracy level depends on the specific 2. Requires few assumptions about the data distribution. ○Random Restarts:RunK- means multiple times with different random application and the consequences of misclassifications.\ 3 . Works well with non-linear relationships. initializations and choose the clustering with the lowest total WSSD. - Majority Classifier Baseline:Themajority classifieralways predicts the most → Weaknesses: ○Use the nstartargument in the kmeansfunction in R to specify the number of frequent class from the training data. Comparing the classifier's accuracy to this 1. Computationally expensive for large datasets. restarts. baseline provides a basic performance benchmark. 2. Performance may degrade with many predictors. Data Pre-processing forK- means -Importance of Confusion Matrix:In addition to accuracy, the confusion matrix 3. Limited extrapolation ability beyond the range of training data. Standardization:Scaling and centering variables is crucial to prevent variables with helps analyze the types of errors made by the classifier. Chapter 8 larger scales from dominating distance calculations. Tuning the Classifier This chapter introduceslinear regressionas an alternative to KNN regression for ○Use the scalefunction in R to standardize data. -Parameters:Most predictive models have parameters that influence their behavior. p redicting numerical variables. Linear regression offers several advantages over KNN, K- means in R -Tuning:Selecting optimal parameter values to maximize the classifier's performance including better performance with large datasets, faster computation, and Use the kmeansfunction with arguments: on unseen data. interpretability. ○ data : Data frame containing the data to cluster. -Validation Set:A subset of the training data used to evaluate the classifier's Simple Linear Regression ○ centers : The number of clusters (K). performance during tuning. Simple linear regression involves predicting a numerical response variable using a s ingle predictor variable. ○ nstart : Number of random restarts. -Cross-Validation:A technique involving multiple train/validation splits to obtain more robust performance estimates. It fits astraight linethrough the training data to minimize the average squared Use the broompackage functions: ⇒K-Fold Cross-Validation:Divides the training data intoKfolds, using each fold as vertical distance between the line and the data points. This line is called theline of ○ augment : Adds cluster assignments to the data frame. the validation set once. best fit. ○ glance : Extracts clustering statistics, including total WSSD. -Parameter Selection:Tune the classifier by trying different parameter values and T he equation for the line of best fit is: h ouse sale price = 𝛽0 + 𝛽1 ⋅ (house Visualize clusters using colored scatter plots (e.g., with ggplot2 ). selecting the one that yields the best cross-validation accuracy.\ size) , where: Illustrative Example Tuning in tidymodels ○ 𝛽0is the vertical intercept (the price when the house size is 0). The chapter utilizes the penguin_datadataset from the palmerpenguinsR -Cross-Validation Splitting:Use vfold_cv()to create cross-validation folds. ○ 𝛽1is the slope (how quickly the price increases as the house size increases). package, focusing on penguin bill and flipper lengths (standardized) to identify - tune()Placeholder:Specify parameters to be tuned using tune()in the model The coefficients 𝛽0and 𝛽1are determined by minimizing the average squared potential penguin subtypes. Figures and explanations throughout the chapter illustrate specification. vertical distances, a method known asleast squares. theK- means algorithm, the impact of differentKvalues, and the elbow method for -Grid Search:Create a data frame of parameter values to try ( grid ) and use Linear regression can make predictions for any value of the predictor, but it's choosingK. tune_grid()to evaluate performance for each combination. g enerally advisable to a void extrapolating b eyond the range of the observed data, Advantages, Limitations, and Assumptions ofK- Means as the relationship might not hold true outside that range. The chapter mentions advantages and limitations ofK- means but doesn't explicitly -Accuracy Visualization:Plot accuracy against parameter values to identify optimal R MSPE (Root Mean Square Prediction Error) is used to evaluate the predictive detail them within the provided excerpt. It does note that the basicK-means algorithm settings. accuracy of the model. assumes quantitative data and uses Euclidean distance for similarity calculations. For Underfitting and Overfitting Linear Regression in R non-quantitative data, other clustering algorithms or distance metrics are needed. -Underfitting:Occurs when the model is too simple and doesn't capture the The tidymodelsframework in R provides tools for performing linear regression. Chapter 10 underlying patterns in the data. This chapter introduces the concept ofstatistical inference, which involves drawing -Overfitting:Occurs when the model is too complex and learns the training data too Key steps include: conclusions about an unknown population based on a sample of data from that well, including noise, leading to poor generalization to new data. ○Creating a linear_reg()model specification with the lmengine. population. It explains why sampling is necessary and discusses two key techniques: -Balance:Finding the right balance between underfitting and overfitting is crucial for ○Defining a recipe using recipe() , specifying the relationship between the point estimation and interval estimation. achieving optimal model performance. response and predictor variables. Why Sampling? K- Nearest Neighbors (K-NN) - Strengths and Weaknesses ○Building a workflow using workflow() , adding the recipe and model, and fitting It is oftenimpractical, costly, or impossibleto collect data on an entire population. - Strengths: the model to the training data using fit() . Instead, we can collect data on a smallersample(subset) of the population and use 1. Simplicity and intuitiveness. 2. Minimal assumptions about data distribution. ○Predicting on the test data using predict() . it toestimatecharacteristics of the whole population. ○ E valuating performance usingm etrics(). This process of using a sample to make conclusions about the population is called 3. Works for both binary and multi-class classification problems. The geom_smooth()function can be used to visualize the line of best fit on a statistical inference. -Weaknesses: