DSCI 100 Revision Guides PDF

Summary

This document covers revision guides for DSCI 100, focusing on data science and R programming. It details various data analysis techniques including loading, cleaning, visualizing, and statistical inference using R. It includes numerous examples related to data analysis concepts.

Full Transcript

‭ SCI 100 Revision Guides‬ D ‭creates a reference that enables eff...

‭ SCI 100 Revision Guides‬ D ‭creates a reference that enables efficient data manipulation using SQL queries in the‬ ‭⇒ Address overplotting (overlapping data points) to prevent obscuring data patterns.‬ ‭Chapter 1‬ ‭background.‬ ‭⇒ Adjust the plot area to fit the data appropriately.\‬ ‭This chapter introduces data science and the R programming language, focusing on‬ ‭⇒‬‭ collect‬ ‭:‬‭This function retrieves the data from a database reference into an R data‬ ‭⇒ Avoid manipulating axes to exaggerate small differences.‬ ‭an example data analysis to demonstrate the basics of loading, cleaning, and‬ ‭frame.‬‭Use caution with‬‭ collect‬ ‭,‬‭as retrieving very large datasets can be‬ ‭Creating Visualizations Using‬‭ ggplot2‬‭in R:‬ ‭visualizing data using R.‬ ‭time-consuming or even crash R. It's often better to filter and select data within the‬ ‭-‬‭ ggplot2‬‭Basics:‬ ‭-‬‭Learning Objectives:‬ ‭database before using‬‭ collect‬‭to reduce the size of the retrieved data.‬ ‭⇒‬‭Start by specifying the data frame to visualize (‬‭ ggplot(data_frame,‬ ‭→‬‭Identify the different types of data analysis questions and categorize them‬ ‭-‬‭PostgreSQL (Postgres)‬‭is a powerful, open-source relational database system‬ aes(...))‬ ‭ ‭).‬ ‭accordingly.‬ ‭commonly used for networks. Connecting to a Postgres database using‬‭ dbConnect‬ ‭⇒‬‭Aesthetic mapping (‬‭ aes‬ ‭)‬‭defines how columns in the data frame map to visual‬ ‭→ Load the tidyverse package into R.‬ ‭→ Read tabular data with‬‭ read_csv‬ ‭.‬ ‭requires additional information, such as the database name (‬‭ dbname‬ ‭), host location‬ ‭properties (e.g.,‬‭ x‭,‬‬‭ y‭,‬‬‭ color‬ shape‬ ‭,‬‭ ‭).‬ ‭→ Create new variables and objects in R using the assignment symbol (‬‭ Often used with‬‭ group_by‬‭to calculate summaries for different groups within a‬ ‭whether a tumor is benign or malignant.‬ ‭⇒‬‭Using relative paths is generally recommended‬‭because they make your code‬ d ‭ ata frame.‬ - ‭ Each observation represents a tumor image with features like radius, texture,‬ ‭more portable and less prone to errors when shared or used on different computers.‬ ‭-‬‭ across‬ ‭:‬‭Used with‬‭ summarize‬‭or‬‭ mutate‬‭to apply a function to multiple columns‬ ‭perimeter, area, smoothness, compactness, concavity, symmetry, and fractal‬ ‭-‬‭URLs (Uniform Resource Locators)‬‭are used to locate resources on the internet.‬ s ‭ imultaneously.‬ d ‭ imension.‬ ‭Reading Tabular Data from Plain Text Files:‬ ‭-‬‭ map‬‭Family:‬‭Applies a function to elements of a vector, list, or data frame, returning‬ ‭- The data set contains 357 benign (63%) and 212 malignant (37%) tumor‬ ‭-‬‭Tabular data‬‭is structured in a rectangular format like a spreadsheet, and‬‭plain text‬ ‭results in various formats (list, vector, data frame).‬ ‭observations.‬ ‭files‬‭store data in a simple text format.‬ ‭-‬‭ rowwise‬ ‭:‬‭Applies functions across columns within each row, often used with‬ ‭Distance Between Points‬ ‭-‬‭ read_csv‬ ‭:‬‭This function from the‬‭ tidyverse‬‭package is used to read‬ ‭- Euclidean distance‬‭is used to determine the "nearest" neighbors.‬ mutate‬‭to create new variables based on multiple columns within the same row.‬ ‭ ‭- For two observations‬‭a‬‭and‬‭b‬‭with two predictor variables‬‭x‬‭and‬‭y‭:‬‬ ‭comma-separated value (.csv) files‬‭, where commas separate the columns.‬ ‭Chapter 4‬ ‭- Distance = √((‬‭ax‬‭-‬‭bx‬‭)‬‭2‬‭+ (‬‭ay‬‭-‬‭by‬‭)‬‭2‬‭)‬ ‭→ It's important to note that‬‭ read_csv‬‭has default expectations‬‭, such as the‬ ‭This chapter discusses the principles and techniques of effective data visualization,‬ ‭- The formula extends to multiple predictor variables by summing the squared‬ ‭presence of column names and commas as delimiters. If these defaults are not met,‬ ‭emphasizing‬‭how to choose, refine, explain, and save visualizations‬‭to effectively‬ ‭differences for each variable.‬ ‭the data may not be read correctly.‬ ‭communicate insights from data. It primarily focuses on using the‬‭ ggplot2‬‭package‬ ‭K‭-‬ Nearest Neighbors with‬‭ tidymodels‬ ‭-‬‭Skipping rows:‬‭The‬‭ skip‬‭argument‬‭in‬‭ read_csv‬‭is used to skip a specified‬ ‭in R‬‭for creating various types of visualizations.‬ ‭- The‬‭ tidymodels‬‭package simplifies the implementation of the‬‭K‭-‬ nearest neighbors‬ ‭number of rows at the beginning of a file, often containing metadata or information not‬ ‭Key Concepts in Data Visualization:‬ ‭algorithm in R.‬ ‭intended for analysis.‬ ‭-‬‭Purpose-Driven Visualization:‬‭The primary goal of visualization is to‬‭answer a‬ ‭- Key steps:‬ ‭-‬‭ read_tsv‬ ‭:‬‭This function reads‬‭tab-separated value (.tsv) files‬‭, using tabs as‬ ‭specific question‬‭about a dataset. A well-designed visualization should clearly‬ ‭convey the answer to the question without unnecessary distractions.‬ ‭1.‬‭Specify the model:‬‭Use‬‭ nearest_neighbor‬‭to define the‬‭K‭-‬ nearest neighbors‬ ‭delimiters instead of commas.‬ - ‭ ‬‭Choosing the Right Visualization Type:‬‭T he choice of visualization depends on the‬ ‭model, specifying the number of neighbors (‬‭ neighbors‬ ‭) and the distance weighting‬ ‭-‬‭ read_delim‬ ‭:‬‭This function is a more‬‭flexible option‬‭, allowing you to specify the‬ ‭delimiter used in the file. It can be used for both‬‭comma- and tab-separated files‬‭and‬ ‭data type and the question being asked. The sources introduce four common types of‬ ‭function (‬‭ weight_func‬ ‭).‬ ‭other formats by specifying the delimiter using the‬‭ delim‬‭argument‬‭.‬ v ‭ isualizations:‬ 2 ‭.‬‭ S et the engine:‬‭ S pecify the package for training the model using‬‭ set_engine‬ ‭⇒‬‭Scatter plots:‬‭Visualize the‬‭relationship between two quantitative variables‬‭.‬ ‭(e.g., "kknn" for the‬‭ kknn‬‭package).‬ ‭→ When the data lacks column names,‬‭ read_delim‬‭assigns generic names like‬‭X1,‬ ‭⇒‬‭Line plots:‬‭Visualize‬‭trends over time‬‭or any‬‭ordered quantity‬‭.‬ 3 ‭.‬‭ S et the mode:‬‭ Indicate that it's a classification problem using‬ ‭X2, etc.‬‭You can rename these columns using the‬‭ rename‬‭function‬‭from the‬‭ dplyr‬ ‭⇒‬‭Bar plots:‬‭Visualize‬‭comparisons of amounts‬‭across different categories.‬ ‭package for better clarity and organization.‬ set_mode("classification")‬ ‭ ‭.‬ ‭⇒‬‭Histograms:‬‭Visualize the‬‭distribution of a single quantitative variable‬‭, showing‬ ‭Reading Tabular Data from Microsoft Excel Files:‬ ‭4.‬‭Fit the model:‬‭Use‬‭ fit‬‭to train the model on the data, specifying the target‬ ‭the frequency of values within specific ranges (bins).\‬ ‭- Excel files (.xlsx)‬‭store data differently than plain text files, including additional‬ ‭variable and predictors.‬ ‭-‬‭Visualizations to Avoid:‬‭Certain types of visualizations are generally considered‬ ‭features like fonts, formatting, and multiple sheets.‬ ‭ineffective or easily replaced with better alternatives:‬ ‭5.‬‭Predict on new data:‬‭Use‬‭ predict‬‭to classify new observations based on the‬ ‭-‬‭ read_excel‬ ‭:‬‭This function from the‬‭ readxl‬‭package‬‭is used to read data from‬ ‭⇒‬‭Pie charts:‬‭Difficult to accurately compare pie slice sizes. Bar plots are usually a‬ ‭trained model.‬ ‭Excel spreadsheets.‬ ‭better choice.‬ ‭Data Preprocessing with‬‭ tidymodels‬ ‭⇒ If an Excel file has multiple sheets, you can specify the sheet to read using the‬ ‭⇒‬‭3-D visualizations:‬‭Can be misleading and hard to interpret in a 2-D format.‬ ‭- Centering and scaling:‬ sheet‬‭argument‬‭. You can also specify particular cell ranges using the‬‭ ‭ range‬ ‭⇒‬‭Tables for numerical comparisons:‬‭Humans are better at processing visual‬ ‭⇒‬‭Standarizing data (mean of 0 and standard deviation of 1) ensures that variables‬ ‭argument‬‭.‬ ‭information than text and numbers. Bar plots are typically more effective.‬ ‭with larger scales don't disproportionately influence the distance calculations.‬ ‭Reading Data from Databases:‬ ‭Refining Visualizations to Enhance Clarity:‬ ‭⇒ Use‬‭ step_scale‬‭and‬‭ step_center‬‭in a recipe to standardize predictor variables.‬ ‭-‬‭Relational databases‬‭are advantageous for storing large datasets, enabling multiple‬ ‭-‬‭Convey the Message:‬ ‭-‬‭Balancing:‬ ‭users to work on a project, and ensuring data integrity.‬ ‭⇒ The visualization should directly answer the question being asked.‬ ‭⇒ Class imbalance‬‭occurs when one class is much more common than another,‬ ‭-‬‭SQLite‬‭is a simple, self-contained database system often used for local storage and‬ ‭⇒ Use clear legends and labels to make the visualization understandable without‬ ‭potentially biasing the classifier.‬ ‭access.‬ ‭relying on external text.‬ ‭⇒‬‭Oversampling‬‭the rare class by replicating observations can help address this‬ ‭→‬‭ dbConnect‬ ‭:‬‭This function from the‬‭ DBI‬‭(Database Interface) package‬ ‭⇒ Ensure text, symbols, and lines are large enough for easy readability.‬ ‭issue.‬ ‭establishes a connection to a database, enabling R to send commands to it.‬ ‭⇒ Data points should be clearly visible and not obscured by other elements.‬ ‭⇒ Use‬‭ step_upsample‬‭in a recipe to oversample the minority class.‬ ‭→‬‭ dbListTables‬ ‭:‬‭Lists all the tables present in a connected database.‬ ‭⇒ Use color schemes that are accessible to people with color blindness (consider‬ ‭Using Workflows in‬‭ tidymodels‬ ‭→‬‭ tbl‬‭:‬‭From the‬‭ dbplyr‬‭package‬‭, this function creates a reference to a database‬ ‭using tools like ColorBrewer).‬ ‭- A‬‭workflow‬‭chains together multiple data analysis steps, including data‬ ‭-‬‭Minimize Noise:‬ ‭preprocessing and model fitting.‬ ‭table, allowing you to interact with the data as if it were a data frame in R.‬ ‭⇒‬‭Use colors sparingly to avoid distractions and false patterns.‬ ‭- Key steps:‬ ‭⇒‬‭Importantly,‬‭ tbl‬‭does not immediately retrieve the entire data into R‬‭, but‬ ‭. Create a workflow object using‬‭ 1 workflow()‬ ‭.‬ ‭- An initial scatter plot visualizes the relationship between house size (predictor‬ ‭implementation in R, advantages and disadvantages compared to KNN regression,‬ ‭2. Add the preprocessing recipe using‬‭ add_recipe‬ ‭.‬ ‭variable) and sale price (response variable), revealing a positive correlation.‬ ‭and potential issues like multicollinearity and outliers. It also touches upon the broader‬ ‭3. Add the model specification using‬‭ add_model‬ ‭.‬ K ‭ NN Regression Example‬ ‭applications of regression beyond prediction.‬ ‭- A small sample of 30 data points illustrates how KNN regression works.‬ ‭Chapter 9‬ ‭4. Fit the workflow to the data using‬‭ fit‬‭.‬ ‭- The scenario involves predicting the sale price of a 2,000 square-foot house.‬ ‭This chapter introduces‬‭clustering‬‭as an‬‭unsupervised‬‭data analysis technique used‬ ‭This chapter introduces the fundamental concepts of classification, the‬‭K‭-‬ nearest‬ ‭- With no exact match for that size in the sample, the 5 nearest neighbors (based on‬ ‭to separate a dataset into subgroups (clusters) of related data. Unlike supervised‬ ‭neighbors algorithm, and data preprocessing techniques for building effective‬ ‭square footage) are identified.‬ ‭techniques like classification and regression, clustering doesn't rely on a response‬ ‭classifiers using the‬‭ tidymodels‬‭framework in R.‬ ‭- The average sale price of these 5 neighbors is used as the predicted price for the‬ ‭variable with known labels or values. The chapter focuses on the‬‭K‬‭-means clustering‬ ‭Chapter 6‬ ‭2,000 square-foot house.‬ ‭algorithm‬‭and explores techniques for choosing the optimal number of clusters,‬ ‭This chapter focuses on evaluating and improving the accuracy of classifiers, building‬ ‭Training, Evaluation, and Tuning‬ ‭including the‬‭elbow method‬‭.‬ ‭upon the concepts introduced in Chapter 5, specifically addressing the‬‭evaluation of‬ ‭- The chapter emphasizes the importance of splitting the data into training and test‬ ‭What is Clustering?‬ ‭classifier accuracy, techniques for enhancing performance, and the concept of‬ ‭sets for evaluating model performance on unseen data.‬ ‭‬‭Clustering‬‭aims to group data points based on their similarity, measured using‬ ‭tuning parameters‬‭.‬ ‭-‬‭Root Mean Square Prediction Error (RMSPE)‬‭is used to assess prediction‬ ‭distance metrics like Euclidean distance.‬ ‭Evaluating Accuracy‬ ‭accuracy in regression.‬ ‭‬‭Applications:‬ ‭-‬‭Importance of New Data:‬‭A classifier's effectiveness lies in its ability to make‬ ‭-‬‭Cross-validation‬‭helps select the optimal number of neighbors (‬‭K‬‭) by minimizing‬ ‭○‬‭Grouping documents by topic.‬ ‭accurate predictions on data it hasn't encountered during training.‬ ‭RMSPE.‬ ‭○‬‭Separating human genetic information into ancestral sub-populations.‬ ‭-‬‭Train/Test Split:‬‭To evaluate accuracy, the data is split into a‬‭training set‬‭(used for‬ ‭-‬‭Standardization‬‭(scaling and centering) of predictor variables is recommended even‬ ‭○‬‭Segmenting online customers based on purchasing behaviors.‬ ‭model building) and a‬‭test set‬‭(used for assessing performance).‬ ‭with a single predictor.‬ ‭Clustering vs. Classification/Regression‬ ‭-‬‭Prediction Accuracy:‬‭This metric measures the proportion of correct predictions‬ ‭- The‬‭ tidymodels‬‭workflow simplifies model training, tuning, and evaluation in R.‬ ‭‬‭Clustering is unsupervised:‬‭No response variable labels or values are used to‬ ‭made by the classifier.‬ ‭Underfitting and Overfitting‬ ‭guide the grouping process.‬ ‭⇒ Accuracy = (Number of Correct Predictions) / (Total Number of Predictions)‬ ‭- Similar to classification, KNN regression is susceptible to underfitting and overfitting‬ ‭‬‭Classification/Regression are supervised:‬‭They leverage labeled data to predict‬ ‭-‬‭Confusion Matrix:‬‭This table provides a detailed breakdown of correct and incorrect‬ ‭depending on the value of‬‭K‭.‬‬ ‭categorical labels or numerical values.‬ ‭predictions, showing the counts for each combination of predicted and true labels.‬ ‭-‬‭Underfitting‬‭occurs with a large‬‭K‭,‬ resulting in a model that's too simple and doesn't‬ ‭‬‭Advantages of Clustering:‬ ‭Randomness and Seeds‬ ‭capture the data's trend.‬ ‭○‬‭No need for data annotation.‬ ‭-‬‭Reproducibility:‬‭Using randomness in analyses can compromise reproducibility.‬ ‭-‬‭Overfitting‬‭occurs with a small‬‭K‭,‬ leading to a model that's too complex and‬ ‭○‬‭Can uncover patterns in data without pre-existing labels.‬ ‭-‬‭Random Seed:‬‭R's random number generator relies on a‬‭seed value‬‭to produce‬ ‭sensitive to noise in the training data.‬ ‭‬‭Disadvantages of Clustering:‬ ‭sequences of numbers. Setting the seed using‬‭ set.seed()‬‭ensures consistent‬ ‭- Cross-validation helps find the optimal‬‭K‬‭that balances underfitting and overfitting,‬ ‭○‬‭Evaluating cluster quality can be challenging.‬ ‭results across multiple runs of the analysis.‬ ‭achieving good generalization to new data.‬ ‭○‬‭No single "best" evaluation method exists.‬ ‭-‬‭Best Practice:‬‭Set the seed‬‭once‬‭at the beginning of the analysis to maintain‬ ‭Evaluating on the Test Set‬ ‭K‭-‬ Means Clustering‬ ‭reproducibility.‬ ‭- The final model, trained on the entire training set with the optimal‬‭K‬‭found through‬ ‭‬‭K‭-‬ means‬‭is a widely used algorithm that iteratively groups data into‬‭K‬‭clusters.‬ ‭Evaluating Accuracy with‬‭ tidymodels‬ ‭cross-validation, is evaluated on the test set using RMSPE.‬ ‭‬‭Steps:‬ ‭-‬‭Data Splitting:‬‭Use‬‭ initial_split()‬‭to split the data into training and test sets,‬ ‭- The test RMSPE provides an estimate of the model's prediction error on unseen‬ ‭○‬‭Initialization:‬‭Randomly assign data points to‬‭K‬‭clusters.‬ ‭ensuring stratification (maintaining class proportions) by setting the‬‭ strata‬‭argument‬ ‭data.‬ ‭○‬‭Center Update:‬‭Calculate the cluster centers (means of each variable for data‬ ‭to the class label variable.‬ ‭Multivariable KNN Regression‬ ‭points within the cluster).‬ ‭-‬‭Preprocessing:‬‭Standardize predictor variables using‬‭ step_scale()‬‭and‬ ‭- KNN regression can incorporate multiple predictor variables.‬ ‭○‬‭Label Update:‬‭Reassign each data point to the cluster with the nearest center‬ - ‭ ‬‭ S caling and centering‬‭ predictors becomes essential in this case to prevent‬ ‭(using Euclidean distance).‬ step_center()‬‭in a recipe.‬ ‭ ‭variables with larger scales from dominating distance calculations.‬ ‭○‬‭Iteration:‬‭Repeat steps 2 and 3 until cluster assignments stabilize.‬ ‭-‬‭Training:‬‭Use‬‭ workflow()‬ add_recipe()‬ ‭,‬‭ ‭,‬‭add_model()‬ ‭, and‬‭fit()‬‭to train‬ ‭-‬‭Predictor variable selection‬‭is crucial, as irrelevant variables can negatively impact‬ ‭‬‭Cluster Quality:‬‭Measured by‬‭within-cluster sum-of-squared-distances (WSSD)‬‭,‬ ‭the classifier on the training data.‬ ‭model performance.‬ ‭representing the spread of data points around their cluster center.‬ ‭-‬‭Prediction:‬‭Use‬‭ predict()‬‭and‬‭ bind_cols()‬‭to predict labels for the test set and‬ ‭- The example demonstrates using house size and the number of bedrooms to predict‬ ‭○‬‭Lower total WSSD indicates better clustering.‬ ‭combine them with the original test data.‬ ‭sale price.‬ ‭Choosing the Number of Clusters (‬‭K‭)‬ ‬ ‭-‬‭Accuracy Calculation:‬‭Use‬‭ metrics()‬‭to calculate the accuracy of the classifier‬ ‭- Visualizing predictions with multiple predictors requires a 3D surface instead of a 2D‬ ‭‬‭Elbow Method:‬‭Plot the total WSSD against different values of‬‭K‭.‬‬ ‭on the test data, filtering for the‬‭ accuracy‬‭metric.‬ ‭line.‬ ‭○‬‭The "elbow" point on the plot, where the decrease in total WSSD levels off,‬ ‭-‬‭Confusion Matrix:‬‭Use‬‭ conf_mat()‬‭to generate the confusion matrix for the‬ ‭Strengths and Weaknesses of KNN Regression‬ ‭suggests an appropriate‬‭K‬‭value.‬ ‭classifier.‬ ‭→ Strengths:‬ ‭‬‭Challenges:‬ ‭Critical Analysis of Performance‬ 1 ‭. Simple and intuitive algorithm.‬ ○ ‭ ‬‭ K‭-‬ means can get "stuck" in suboptimal solutions due to random initialization.‬ ‭-‬‭Contextual Accuracy:‬‭The acceptable accuracy level depends on the specific‬ ‭2. Requires few assumptions about the data distribution.‬ ‭○‬‭Random Restarts:‬‭Run‬‭K‭-‬ means multiple times with different random‬ ‭application and the consequences of misclassifications.\‬ 3 ‭. Works well with non-linear relationships.‬ ‭initializations and choose the clustering with the lowest total WSSD.‬ ‭- Majority Classifier Baseline:‬‭The‬‭majority classifier‬‭always predicts the most‬ ‭→ Weaknesses:‬ ‭○‬‭Use the‬‭ nstart‬‭argument in the‬‭ kmeans‬‭function in R to specify the number of‬ ‭frequent class from the training data. Comparing the classifier's accuracy to this‬ ‭1. Computationally expensive for large datasets.‬ ‭restarts.‬ ‭baseline provides a basic performance benchmark.‬ ‭2. Performance may degrade with many predictors.‬ ‭Data Pre-processing for‬‭K‭-‬ means‬ ‭-‬‭Importance of Confusion Matrix:‬‭In addition to accuracy, the confusion matrix‬ ‭3. Limited extrapolation ability beyond the range of training data.‬ ‭‬‭Standardization:‬‭Scaling and centering variables is crucial to prevent variables with‬ ‭helps analyze the types of errors made by the classifier.‬ ‭Chapter 8‬ ‭larger scales from dominating distance calculations.‬ ‭Tuning the Classifier‬ ‭This chapter introduces‬‭linear regression‬‭as an alternative to KNN regression for‬ ‭○‬‭Use the‬‭ scale‬‭function in R to standardize data.‬ ‭-‬‭Parameters:‬‭Most predictive models have parameters that influence their behavior.‬ p ‭ redicting numerical variables. Linear regression offers several advantages over KNN,‬ ‭K‭-‬ means in R‬ ‭-‬‭Tuning:‬‭Selecting optimal parameter values to maximize the classifier's performance‬ ‭including better performance with large datasets, faster computation, and‬ ‭‬‭Use the‬‭ kmeans‬‭function with arguments:‬ ‭on unseen data.‬ ‭interpretability.‬ ‭○‬‭ data‬ ‭: Data frame containing the data to cluster.‬ ‭-‬‭Validation Set:‬‭A subset of the training data used to evaluate the classifier's‬ ‭Simple Linear Regression‬ ‭○‬‭ centers‬ ‭: The number of clusters (‬‭K‭)‬.‬ ‭performance during tuning.‬ ‭‬‭Simple linear regression involves predicting a numerical response variable using a‬ s ‭ ingle predictor variable.‬ ‭○‬‭ nstart‬ ‭: Number of random restarts.‬ ‭-‬‭Cross-Validation:‬‭A technique involving multiple train/validation splits to obtain more‬ ‭robust performance estimates.‬ ‭‬‭It fits a‬‭straight line‬‭through the training data to minimize the average squared‬ ‭‬‭Use the‬‭ broom‬‭package functions:‬ ‭⇒‬‭K‬‭-Fold Cross-Validation:‬‭Divides the training data into‬‭K‬‭folds, using each fold as‬ ‭vertical distance between the line and the data points. This line is called the‬‭line of‬ ‭○‬‭ augment‬ ‭: Adds cluster assignments to the data frame.‬ ‭the validation set once.‬ ‭best fit‬‭.‬ ‭○‬‭ glance‬ ‭: Extracts clustering statistics, including total WSSD.‬ ‭-‬‭Parameter Selection:‬‭Tune the classifier by trying different parameter values and‬ ‭ ‬‭ T he equation for the line of best fit is:‬‭ h ouse sale price = 𝛽0 + 𝛽1 ⋅ (house‬ ‭‬‭Visualize clusters using colored scatter plots (e.g., with‬‭ ggplot2‬ ‭).‬ ‭selecting the one that yields the best cross-validation accuracy.\‬ size)‬ ‭ ‭, where:‬ ‭Illustrative Example‬ ‭Tuning in‬‭ tidymodels‬ ‭○‬ ‭ 𝛽0‬‭is the vertical intercept (the price when the house size is 0).‬ ‭The chapter utilizes the‬‭ penguin_data‬‭dataset from the‬‭ palmerpenguins‬‭R‬ ‭-‬‭Cross-Validation Splitting:‬‭Use‬‭ vfold_cv()‬‭to create cross-validation folds.‬ ‭○‬ ‭ 𝛽1‬‭is the slope (how quickly the price increases as the house size increases).‬ ‭package, focusing on penguin bill and flipper lengths (standardized) to identify‬ ‭-‬‭ tune()‬‭Placeholder:‬‭Specify parameters to be tuned using‬‭ tune()‬‭in the model‬ ‭‬‭The coefficients‬‭ 𝛽0‬‭and‬‭ 𝛽1‬‭are determined by minimizing the average squared‬ ‭potential penguin subtypes. Figures and explanations throughout the chapter illustrate‬ ‭specification.‬ ‭vertical distances, a method known as‬‭least squares‬‭.‬ ‭the‬‭K‭-‬ means algorithm, the impact of different‬‭K‬‭values, and the elbow method for‬ ‭-‬‭Grid Search:‬‭Create a data frame of parameter values to try (‬‭ grid‬ ‭) and use‬ ‭‬‭Linear regression can make predictions for any value of the predictor, but it's‬ ‭choosing‬‭K‭.‬‬ tune_grid()‬‭to evaluate performance for each combination.‬ ‭ g ‭ enerally advisable to‬‭ a void extrapolating‬‭ b eyond the range of the observed data,‬ ‭Advantages, Limitations, and Assumptions of‬‭K‭-‬ Means‬ ‭as the relationship might not hold true outside that range.‬ ‭The chapter mentions advantages and limitations of‬‭K‭-‬ means but doesn't explicitly‬ ‭-‬‭Accuracy Visualization:‬‭Plot accuracy against parameter values to identify optimal‬ ‭ ‬‭ R MSPE (Root Mean Square Prediction Error)‬‭ is used to evaluate the predictive‬ ‭detail them within the provided excerpt. It does note that the basic‬‭K‬‭-means algorithm‬ ‭settings.‬ ‭accuracy of the model.‬ ‭assumes quantitative data and uses Euclidean distance for similarity calculations. For‬ ‭Underfitting and Overfitting‬ ‭Linear Regression in R‬ ‭non-quantitative data, other clustering algorithms or distance metrics are needed.‬ ‭-‬‭Underfitting:‬‭Occurs when the model is too simple and doesn't capture the‬ ‭‬‭The‬‭ tidymodels‬‭framework in R provides tools for performing linear regression.‬ ‭Chapter 10‬ ‭underlying patterns in the data.‬ ‭This chapter introduces the concept of‬‭statistical inference‬‭, which involves drawing‬ ‭-‬‭Overfitting:‬‭Occurs when the model is too complex and learns the training data too‬ ‭‬‭Key steps include:‬ ‭conclusions about an unknown population based on a sample of data from that‬ ‭well, including noise, leading to poor generalization to new data.‬ ‭○‬‭Creating a‬‭ linear_reg()‬‭model specification with the‬‭ lm‬‭engine.‬ ‭population. It explains why sampling is necessary and discusses two key techniques:‬ ‭-‬‭Balance:‬‭Finding the right balance between underfitting and overfitting is crucial for‬ ‭○‬‭Defining a recipe using‬‭ recipe()‬ ‭, specifying the relationship between the‬ ‭point estimation and interval estimation.‬ ‭achieving optimal model performance.‬ ‭response and predictor variables.‬ ‭Why Sampling?‬ ‭K‭-‬ Nearest Neighbors (K-NN) - Strengths and Weaknesses‬ ‭○‬‭Building a workflow using‬‭ workflow()‬ ‭, adding the recipe and model, and fitting‬ ‭‬‭It is often‬‭impractical, costly, or impossible‬‭to collect data on an entire population.‬ ‭- Strengths:‬ ‭the model to the training data using‬‭ fit()‬ ‭.‬ ‭‬‭Instead, we can collect data on a smaller‬‭sample‬‭(subset) of the population and use‬ ‭1. Simplicity and intuitiveness.‬ ‭2. Minimal assumptions about data distribution.‬ ‭○‬‭Predicting on the test data using‬‭ predict()‬ ‭.‬ ‭it to‬‭estimate‬‭characteristics of the whole population.‬ ○ ‭ ‬‭ E valuating performance using‬‭m etrics()‬. ‭‬ ‭‬‭This process of using a sample to make conclusions about the population is called‬ ‭3. Works for both binary and multi-class classification problems.‬ ‭‬‭The‬‭ geom_smooth()‬‭function can be used to visualize the line of best fit on a‬ ‭statistical inference‬‭.‬ ‭-‬‭Weaknesses:‬

Use Quizgecko on...
Browser
Browser