Data Science Key Concepts

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

Which of the following is NOT a typical characteristic of a good Key Performance Indicator (KPI)?

  • It is aligned with the objective or goal of the business problem.
  • It is measurable given the available data and assumptions.
  • It is complex and difficult to understand to ensure rigor. (correct)
  • It directly measures progress toward achieving a business goal.

Stratification ensures a successful model if the average of the training set is significantly different from the average of the validation set.

False (B)

What type of data analysis involves finding patterns in data without a pre-defined target variable?

unsupervised techniques

A model with the lowest __________ strikes a balance between bias and variance.

<p>RMSE</p>
Signup and view all the answers

What is the primary purpose of using stepwise selection in the context of overfitting?

<p>To identify and drop weak predictors, thereby reducing model flexibility. (B)</p>
Signup and view all the answers

A p-value greater than 0.05 for a coefficient estimate indicates that the coefficient is statistically significant.

<p>False (B)</p>
Signup and view all the answers

In the context of variables, what is another term for a categorical variable?

<p>factor variable</p>
Signup and view all the answers

In the context of data analysis, __________ is used for right skewed distribution.

<p>log transformation</p>
Signup and view all the answers

Match the following terms with their definitions:

<p>Sensitivity = True positives / Real positives Specificity = True negatives / Real negatives Precision = True positives / Predicted positives Accuracy = (True positives + True negatives) / All</p>
Signup and view all the answers

Which of the following is a disadvantage of using a scatterplot for bivariate analysis when dealing with a large number of data points?

<p>It is difficult to see how many points are in a certain range and identify the overall trend. (A)</p>
Signup and view all the answers

In a multiple linear regression, a residual vs. predicted value plot is used to determine homoscedasticity, where the variance of errors is non-constant.

<p>False (B)</p>
Signup and view all the answers

In the context of MLR assumptions, what does a qq plot help to check?

<p>normality of residuals</p>
Signup and view all the answers

If predictor variables are highly correlated to each other, they exhibit ________ which is a violation and cause problems if used together in a model

<p>collinearity</p>
Signup and view all the answers

What should you do if a model performs excellently on the training set but significantly worse on the test set?

<p>Conclude that the model is likely overfitted. (A)</p>
Signup and view all the answers

In boosted trees, increasing the learning rate always decreases the test RMSE, leading to better model performance.

<p>False (B)</p>
Signup and view all the answers

Fill in the blank: In the model coefficients table, what kind of link is used given that it is Gamma with log link.

<p>log</p>
Signup and view all the answers

In stepwise selection, __________ fits all predictors simultaneously to optimize a loss function.

<p>shrinkage</p>
Signup and view all the answers

Which statement is true regarding stepwise selection? (Choose the best answer)

<p>Backward with AIC produces more predictors. (D)</p>
Signup and view all the answers

As Lambda increases, variance increases and flexibility reduces.

<p>False (B)</p>
Signup and view all the answers

What kind of model is used if the alpha is between 0 to 1?

<p>Elastic Net Regression</p>
Signup and view all the answers

The __________ determines how the mean (expected value of the target) changes in response to changes in predictor.

<p>link function</p>
Signup and view all the answers

Match the reasons with the distributions:

<p>Normal = Allows both positive and negative values Poisson = Count variables; integer values Gamma = When target variable are positive values Binomial = When response variable is binary</p>
Signup and view all the answers

What kind of model is used if there is a linear relationship between predictor and target variable? (Choose the best answer)

<p>GLM (D)</p>
Signup and view all the answers

Weights assign a constant value to each observation in the model

<p>False (B)</p>
Signup and view all the answers

What variable minimizes total SSE (impurity) in regression trees.

<p>the split</p>
Signup and view all the answers

Flashcards

Key Performance Indicator (KPI)

Variable used to measure the success of a business goal or objective.

Stratification

Dividing a dataset into smaller groups based on predictor variables, maintaining original proportions in each group.

Target Leakage

When predictor variables inadvertently contain information about the target variable used in training.

Oversampling

Adjusting a dataset to have an equal representation of minority and majority classes by duplicating minority observations.

Signup and view all the flashcards

Unstructured Data

Data that lacks a predefined format, offering qualitative insights but requiring more processing.

Signup and view all the flashcards

Unsupervised Techniques

Techniques that identify data patterns without a predefined target variable.

Signup and view all the flashcards

Variance

Measures how much a model's shape changes with new training data.

Signup and view all the flashcards

Bias

Measures how close the modeled distribution is to the actual distribution.

Signup and view all the flashcards

Overfitting

A model that performs well on training data but poorly on unseen data.

Signup and view all the flashcards

Underfitting

A model that is too simple to capture the underlying patterns in the data.

Signup and view all the flashcards

Categorical Variable

A variable whose number inputs don't have a meaningful order or scale.

Signup and view all the flashcards

Log Transformation

Compresses larger, and expands smaller values to show a symmetric distribution; reduces outlier impact.

Signup and view all the flashcards

Binarization

Converting a variable into binary form for clustering or stepwise selection.

Signup and view all the flashcards

Sensitivity

True positives / real positives (TP / (TP + FN)). Measures a model's ability to detect positive values correctly.

Signup and view all the flashcards

Specificity

True negatives / real negatives (TN / (TN + FP)). Measures a model's ability to detect negative values correctly.

Signup and view all the flashcards

Accuracy

% of correct predictions = (TN+TP) / all

Signup and view all the flashcards

Precision

True positives / predicted positives (TP / (TP + FP)). Measures the accuracy of positive predictions.

Signup and view all the flashcards

Residual Plot

Used to determine if a model suffers from bias or heteroscedasticity.

Signup and view all the flashcards

QQ Plot

A plot to check if data is normally distributed.

Signup and view all the flashcards

Bivariate analysis

Examines variable relationships.

Signup and view all the flashcards

Coefficient Estimates

OLS seeks this

Signup and view all the flashcards

Interaction Term

An additional impact on top of the original impact.

Signup and view all the flashcards

Forward Selection

From no predictors, then add based on performance metric.

Signup and view all the flashcards

Backward Selection

From all predictors, then remove based on performance metric.

Signup and view all the flashcards

Regularization

A technique to address overfitting.

Signup and view all the flashcards

Study Notes

  • KPI stands for Key Performance Indicator

Qualities of a useful KPI

  • Measurable with available data
  • Aligned with the business objective
  • Directly measures goal achievement such as revenue as a measurement to define business profitability

Stratified Sampling

  • Stratified samples can be created by determining the strata number, which combines predictor variables, then sample randomly
  • Proportional sample sizes that reflect original dataset population proportions should be used for each stratum
  • Stratification is successful when the average of training and validation sets are similar

Target Leakage

  • Occurs when predictor variables include information directly related to the target variable
  • E.g. using target variable as input during training

Balanced Dataset

  • Achieved through oversampling a minority class until it has equal representation to the majority class
  • Oversampling duplicates minority class observations

Unstructured Data

  • Provides qualitative information but demands complex processing and more resources

Unsupervised Techniques

  • Analyse data without a defined target variable
  • PCA and clustering are useful for feature creation
  • Aids in model selection and improves prediction accuracy by identifying patterns between predictors

Developing Analysis to Estimate a Target Variable Includes

  • Filtering irrelevant information to focus on flights over 300 miles when analysing beverages
  • Identifying key variables such as what to predict as the target variable
  • Determining which predictors influence target variables
  • Measuring average daily passengers on flights into and out of Washington state to gauge market size

Missing Data

  • Missing data can be handled by: -Replacing with the mean -Removing rows or columns with disproportionate amounts of missing data

Bias-Variance Tradeoff

  • Considers that as one decreases, the other increases
  • Variance reflects model sensitivity to changes in training data
  • Bias reflects how close modeled and actual distributions are
  • Higher flexibility coincides with higher variance and lower bias
  • Variance may indicate that the model is too flexible
  • Bias may indicate that the model is insufficiently flexible
  • A model with the lowest Root Mean Squared Error strikes a balance

Overfitting and Underfitting

Overfitting

  • Can be addressed by removing weak predictors through stepwise selection
  • Can be addressed by shrinking coefficients through regularization with lasso or elastic net

Underfitting

  • Can be addressed by adding interaction terms to describe patterns between two or more predictors

P-Value

  • If a coefficient estimate has a p-value less than 0.05, its statistically significant

Numeric vs Categorical Variables

  • Factor variables are preferred when inputs lack meaningful order or scale
  • Factor variables can be preferred due to arbitrary ordering of clusters
  • Factors can force the model to ignore the number values and treat items as different groups
  • When there is not a monotonic relationship between the predictor and target variable, a factor variable can be used
  • Factor variables may introduce too many levels, complicating the model
  • Bivariate analysis helps determine if a predictor variable should be numeric or categorical

Factor Variables

  • If a level lacks enough observations, low representation might create unreliable statistics
  • combining a level with too few observations with another can obscure the true effect on the target variable

Log Transformation

  • Compresses larger values and expands smaller values to achieve more symmetric distribution
  • Used for dealing with right-skewed data
  • Cannot log-transform non-positive values
  • Reduces impacts from outliers

Disadvantages of Log Transforms

  • Makes interpreting model coefficients more challenging
  • May not necessarily improve model performance
  • Can exhibit a spike at 0

Binarization

Clustering Context

  • Achieved for 4 clusters by choosing a base level and create dummy variables for other clusters, setting cluster value to 1 if present and 0 otherwise

Stepwise Selection Context

  • Each factor variable level is treated separately which may mean only some levels appear in the chosen model
  • When dummy variables are removed, levels merge, with the original base level merging into the levels that no longer have a dummy variable

Accuracy Metrics - Confusion Matrix

  • Used for classification problems, not regression

Sensitivity

  • Represents true positives divided by real positives
  • TP / (TP + FN)
  • High sensitivity means the model is effective at identifying actual positives rather than misclassifying them

Specificity

  • Represents true negatives divided by real negatives
  • TN / (TN + FP)
  • Accuracy determined as the percentage of correct predictions, or true positives plus true negatives divided by all outcomes
  • Classification error rate determined by percentage of wrong predictions, 1 minus accuracy
  • False Positive Rate: False positives divided by false positives plus true negatives, also equal to 1 minus specificity

Precision

  • Percentage of positive predictions that are true positives
  • TP / (TP + FP)
  • High precision denotes the model effectively classifies positive outcomes when they are positive

Statistical Goals

  • Focus on sensitivity to accurately predict positives
  • Focus on specificity to accurately predict negatives

Lowering Positive Response Cutoff

  • Allows more positive predictions and lowers negative predictions, increasing sensitivity but decreasing specificity

Exploratory Data Analysis: Univariate & Bivariate Techniques

Univariate Numerical Analysis

  • Includes mean, variance, quantiles, and frequency

Univariate Graphical Analysis

  • Includes histograms, bar charts, and box plots

Bivariate Numerical Analysis

  • Includes correlation and statistics by level, frequency

Bivariate Graphical Analysis

  • Includes scatter plots and side-by-side plots which enable histograms, bar charts and box plots

Bivariate Analysis

Scatterplots

  • Show the full range of observations
  • Can be hard to interpret when too many data points are present

Side-by-side box plots

  • Show the distribution of each x range and easily show outliers
  • Conceal observation quantities in each range of x

Highly Correlated Variables

  • Highly correlated predictors are very predictive
  • Correlated predictors exhibit collinearity which can be an issue if they are used together in a model

MLR Assumptions

  • Ordinary Least Squares seeks estimates that minimize the Sum of Squared Errors

Residual vs Predicted Value Plot

  • Detects bias and heteroscedasticity
Ideal Look
  • Mean of errors is zero
  • Variance of errors is constant indicating homoscedasticity
  • Errors are independent
Residuals Plot
  • Balanced around 0 indicates no bias
  • Constant spread of residuals
  • Points that appear random suggests no obvious trend
If The Residuals Spread
  • May need normal distribution and identity link are not appropriate

More MLR Assumptions

  • Errors normally distributed, which also means predictors cannot be perfectly correlated with others
More MLR Violations
  • A non-zero average of residuals can occur
  • Heteroscedasticity
  • Dependent non-normal errors
  • Outliers
  • Collinearity
  • Too many predictors, indicating high dimensionality

QQ Plot

  • Checks for deviations from superimposed lines, where deviations indicate outliers and data that is not normally distributed

MLR Shortcoming

  • Allows for negative predictions, which is not always appropriate

Predicted vs. Actual Plot Interpretation

  • Points on the red line indicate the predicted values accurately match the data
  • Points above the red line suggest the model underestimates
  • Points below the red line suggest the model overestimates

Model Choice: Weak Model Indicators

  • Predictor variables are highly correlated
  • There is a non-linear relationship with the target variable
  • Missing data for a good amount of observations

Data With Issues

  • Replacing coefficients with constant value of the mean might lead to perfect collinearity
  • Bx1x2 is made perfectly linear at Bx1c by replacing x2 with the constant mean
Scatterplots
  • Can be used to see if there is a linear monotonic relationship between the two variables
  • Otherwise, a quadratic relationship exists

Linear Monotonic Relationship

  • Desirable for GLM, or else the model will only fit the values that are close to zero

Model Improvements

  • When the histogram is right skewed, log-transforming makes for a better fit
  • Numeric can be changed to Factor for a better fit that tolerates non-linear or monotonic relationship
  • Factor variables can increase complexity
  • Adding an interaction term improves the model
  • Adding variable also improve the model to explain more data patterns
  • Removing levels or noise factors can improve the model
  • Removing data for 2020 impacted by COVID-19 could be a step toward model improvement

Untransformed Model

  • May lack a linear monotonic relationship
  • Can result in a fitted line with a slight uptrend

Transformed Model

  • Monotonic tendency on either side
  • Enable better fitting data
  • Results in accurate predictions

ROC and AUC Metrics

  • They both evaluate the classification problem

ROC

  • ROC plots all possible combinations of TPR which is sensitivity, and FPR which is one minus sensitivity for different cutoff values

AUC

  • AUC measures the accuracy and performance of the model and is measured by the area under the curve
Similar AUC scores between different models
  • Demonstrate the models distinguish between different conditions
ROC Metric
  • Helpful when comparing the weighed versus unweighted metrics

Training vs. validation data (test data)

  • Training data is what is used to fit the model
  • Validation data is used to calculate metrics of the model
  • Comparing against validation data is better to find errors in training data
  • If splits are not available, conduct cross validation instead

Flexibility Parameter vs. RMSE graph

  • training error should be less than testing error by parameter level
  • if training results are better, it is an indication of high levels of overfitting
Booted Tree
  • subsequent trees will improve errors from prior trees
  • rate increase over flexible models
  • test measures do not have an exact measure
  • do not use test set to tune hyper parameter, as a result of the final assessment

Using K-fold Cross Validation

  • Helps tune hyper parameters and keep test assessments separate

Interaction Variables

  • scatterplots indicates correlation in variables, such as control versus cost to attend
  • with interaction in slope for two levels

Interactions

  • indicate that levels are dependent
  • if no interaction levels, then the slope would be the same
  • the model would be similar and impact both slope coefficient
  • allow to have ones own slope

Interpretation of Interaction and variable impacts

  • discounts have positive effects and passengers have the most impacts
  • interactions should have isolate positives, and then identify negatives with the passengers

Model Coefficient Table

Linear Model

  • if school is public, then loan has 3.4% versus profit versus others
  • public institutions may receive grant funding

GLM and Gamma Log

  • it can be determined since family has gamma qualities like to that in function

Model with Private for Profit Institution

  • Has 50% of undergrads and Pell Grants, and uses an Exp of .006
  • P value may not always be significant, if larger than .05
  • estimates are needed, can change as numeric

Statistical Errors

  • errors can be the result of perfect correlation, may have collinearity
  • with to low levels, then it must be equal to 1 or less
  • the mean can be used to fix missing data at a perfect level
  • be careful of using data if it is close, stepwise to ensure it is in the proper area

stepwise selection

  • shrinkage selection to reduce complexities as a method to fix data
  • to minimize overfitting, should be compared versus the amount of other predictors

different areas

  • forward action; takes iterative
  • no predictors and action taken for measurements
  • backward selections, removal via measurement
  • stepwise sections require all predictors to optimise values

- backward models may miss

AIC Metrics

  • represents a goodness of the fit, and reflects how well the data is applied

lower values

  • provides a balance for metrics, such as error terms decreasing, where the second increases

For Backward and Stepwise Selection,

  • From the removed performance , and interaction, if others are considered
  • Maximize the find, complement, and back actions with AIC

variables should show as

  • more adaptive, precise, and lower bars in graph

Technique to see what is best for the model

  • single area selection
  • to reduce for values that have data
  • penalize coefficients and remove variables
  • reduce impact and add Lambda

regularization

  • address in the model and penalize
  • helps with high estimates
  • can cause under fitting if data and coefficient
  • also known as near zero metrics

ridge and lasso regression's

  • can find other equations, such averages, and weighted

  • can add variables that are already created

  • estimates are equal with data selection, must use lasso to fix

  • near is a lack of data in certain variables

  • estimates allow data to have similar structure and properties

  • lasso allows data to remove areas or select data

GLM

  • key assumption of data
  • target is with in family
  • model that adjusts variables

data selection

  • if it requires a certain variables, there could be a bias between the levels
  • to prevent data to the target
  • distribution may not always be normal

distribution type selection

  • poisson is good for a count
  • gamma is better for continuous

function type selection

  • link is great for all data types related to the target
  • does not account for sign

type use

  • real , allow data for units

  • log, positive numbers only to the correct type

  • model of chart

  • for probabilities, for what is correct, between levels

  • linear is more constant and easy to model, has errors

  • log has more to see relationships across variables for both the type and the model

To decide from trees to model

  • select with no levels of trend, can perform better with types
  • if continuous, use GLM

Weights impact

  • weigh the level of importance
  • if similar, consider what is weighted

Decision Tree notes

  • Interpretation of data, that node indicates the amount and relationship of data from the set

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Data Science Essentials Quiz
5 questions

Data Science Essentials Quiz

ConscientiousCoralReef avatar
ConscientiousCoralReef
Data Science Chapter 2
10 questions

Data Science Chapter 2

PeaceableSalamander avatar
PeaceableSalamander
Introduction to Data Science
5 questions

Introduction to Data Science

InspiringPhotorealism avatar
InspiringPhotorealism
Data Science and AI
10 questions

Data Science and AI

UndisputableWichita4143 avatar
UndisputableWichita4143
Use Quizgecko on...
Browser
Browser