Data Science Key Concepts

Podcast

Play an AI-generated podcast conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

Which of the following is NOT a typical characteristic of a good Key Performance Indicator (KPI)?

It is aligned with the objective or goal of the business problem.
It is measurable given the available data and assumptions.
It is complex and difficult to understand to ensure rigor. (correct)
It directly measures progress toward achieving a business goal.

Stratification ensures a successful model if the average of the training set is significantly different from the average of the validation set.

False (B)

What type of data analysis involves finding patterns in data without a pre-defined target variable?

unsupervised techniques

A model with the lowest __________ strikes a balance between bias and variance.

RMSE

Signup and view all the answers

What is the primary purpose of using stepwise selection in the context of overfitting?

To identify and drop weak predictors, thereby reducing model flexibility. (B)

Signup and view all the answers

A p-value greater than 0.05 for a coefficient estimate indicates that the coefficient is statistically significant.

False (B)

Signup and view all the answers

In the context of variables, what is another term for a categorical variable?

factor variable

Signup and view all the answers

In the context of data analysis, __________ is used for right skewed distribution.

log transformation

Signup and view all the answers

Match the following terms with their definitions:

Sensitivity = True positives / Real positives Specificity = True negatives / Real negatives Precision = True positives / Predicted positives Accuracy = (True positives + True negatives) / All

Signup and view all the answers

Which of the following is a disadvantage of using a scatterplot for bivariate analysis when dealing with a large number of data points?

It is difficult to see how many points are in a certain range and identify the overall trend. (A)

Signup and view all the answers

In a multiple linear regression, a residual vs. predicted value plot is used to determine homoscedasticity, where the variance of errors is non-constant.

False (B)

Signup and view all the answers

In the context of MLR assumptions, what does a qq plot help to check?

normality of residuals

Signup and view all the answers

If predictor variables are highly correlated to each other, they exhibit ________ which is a violation and cause problems if used together in a model

collinearity

Signup and view all the answers

What should you do if a model performs excellently on the training set but significantly worse on the test set?

Conclude that the model is likely overfitted. (A)

Signup and view all the answers

In boosted trees, increasing the learning rate always decreases the test RMSE, leading to better model performance.

False (B)

Signup and view all the answers

Fill in the blank: In the model coefficients table, what kind of link is used given that it is Gamma with log link.

log

Signup and view all the answers

In stepwise selection, __________ fits all predictors simultaneously to optimize a loss function.

shrinkage

Signup and view all the answers

Which statement is true regarding stepwise selection? (Choose the best answer)

Backward with AIC produces more predictors. (D)

Signup and view all the answers

As Lambda increases, variance increases and flexibility reduces.

False (B)

Signup and view all the answers

What kind of model is used if the alpha is between 0 to 1?

Elastic Net Regression

Signup and view all the answers

The __________ determines how the mean (expected value of the target) changes in response to changes in predictor.

link function

Signup and view all the answers

Match the reasons with the distributions:

Normal = Allows both positive and negative values Poisson = Count variables; integer values Gamma = When target variable are positive values Binomial = When response variable is binary

Signup and view all the answers

What kind of model is used if there is a linear relationship between predictor and target variable? (Choose the best answer)

GLM (D)

Signup and view all the answers

Weights assign a constant value to each observation in the model

False (B)

Signup and view all the answers

What variable minimizes total SSE (impurity) in regression trees.

the split

Signup and view all the answers

Flashcards

Key Performance Indicator (KPI)

Variable used to measure the success of a business goal or objective.

Stratification

Dividing a dataset into smaller groups based on predictor variables, maintaining original proportions in each group.

Target Leakage

When predictor variables inadvertently contain information about the target variable used in training.

Oversampling

Adjusting a dataset to have an equal representation of minority and majority classes by duplicating minority observations.

Signup and view all the flashcards

Unstructured Data

Data that lacks a predefined format, offering qualitative insights but requiring more processing.

Signup and view all the flashcards

Unsupervised Techniques

Techniques that identify data patterns without a predefined target variable.

Signup and view all the flashcards

Variance

Measures how much a model's shape changes with new training data.

Signup and view all the flashcards

Bias

Measures how close the modeled distribution is to the actual distribution.

Signup and view all the flashcards

Overfitting

A model that performs well on training data but poorly on unseen data.

Signup and view all the flashcards

Underfitting

A model that is too simple to capture the underlying patterns in the data.

Signup and view all the flashcards

Categorical Variable

A variable whose number inputs don't have a meaningful order or scale.

Signup and view all the flashcards

Log Transformation

Compresses larger, and expands smaller values to show a symmetric distribution; reduces outlier impact.

Signup and view all the flashcards

Binarization

Converting a variable into binary form for clustering or stepwise selection.

Signup and view all the flashcards

Sensitivity

True positives / real positives (TP / (TP + FN)). Measures a model's ability to detect positive values correctly.

Signup and view all the flashcards

Specificity

True negatives / real negatives (TN / (TN + FP)). Measures a model's ability to detect negative values correctly.

Signup and view all the flashcards

Accuracy

% of correct predictions = (TN+TP) / all

Signup and view all the flashcards

Precision

True positives / predicted positives (TP / (TP + FP)). Measures the accuracy of positive predictions.

Signup and view all the flashcards

Residual Plot

Used to determine if a model suffers from bias or heteroscedasticity.

Signup and view all the flashcards

QQ Plot

A plot to check if data is normally distributed.

Signup and view all the flashcards

Bivariate analysis

Examines variable relationships.

Signup and view all the flashcards

Coefficient Estimates

OLS seeks this

Signup and view all the flashcards

Interaction Term

An additional impact on top of the original impact.

Signup and view all the flashcards

Forward Selection

From no predictors, then add based on performance metric.

Signup and view all the flashcards

Backward Selection

From all predictors, then remove based on performance metric.

Signup and view all the flashcards

Regularization

A technique to address overfitting.

Signup and view all the flashcards

Study Notes

KPI stands for Key Performance Indicator

Qualities of a useful KPI

Measurable with available data
Aligned with the business objective
Directly measures goal achievement such as revenue as a measurement to define business profitability

Stratified Sampling

Stratified samples can be created by determining the strata number, which combines predictor variables, then sample randomly
Proportional sample sizes that reflect original dataset population proportions should be used for each stratum
Stratification is successful when the average of training and validation sets are similar

Target Leakage

Occurs when predictor variables include information directly related to the target variable
E.g. using target variable as input during training

Balanced Dataset

Achieved through oversampling a minority class until it has equal representation to the majority class
Oversampling duplicates minority class observations

Unstructured Data

Provides qualitative information but demands complex processing and more resources

Unsupervised Techniques

Analyse data without a defined target variable
PCA and clustering are useful for feature creation
Aids in model selection and improves prediction accuracy by identifying patterns between predictors

Developing Analysis to Estimate a Target Variable Includes

Filtering irrelevant information to focus on flights over 300 miles when analysing beverages
Identifying key variables such as what to predict as the target variable
Determining which predictors influence target variables
Measuring average daily passengers on flights into and out of Washington state to gauge market size

Missing Data

Missing data can be handled by: -Replacing with the mean -Removing rows or columns with disproportionate amounts of missing data

Bias-Variance Tradeoff

Considers that as one decreases, the other increases
Variance reflects model sensitivity to changes in training data
Bias reflects how close modeled and actual distributions are
Higher flexibility coincides with higher variance and lower bias
Variance may indicate that the model is too flexible
Bias may indicate that the model is insufficiently flexible
A model with the lowest Root Mean Squared Error strikes a balance

Overfitting and Underfitting

Overfitting

Can be addressed by removing weak predictors through stepwise selection
Can be addressed by shrinking coefficients through regularization with lasso or elastic net

Underfitting

Can be addressed by adding interaction terms to describe patterns between two or more predictors

P-Value

If a coefficient estimate has a p-value less than 0.05, its statistically significant

Numeric vs Categorical Variables

Factor variables are preferred when inputs lack meaningful order or scale
Factor variables can be preferred due to arbitrary ordering of clusters
Factors can force the model to ignore the number values and treat items as different groups
When there is not a monotonic relationship between the predictor and target variable, a factor variable can be used
Factor variables may introduce too many levels, complicating the model
Bivariate analysis helps determine if a predictor variable should be numeric or categorical

Factor Variables

If a level lacks enough observations, low representation might create unreliable statistics
combining a level with too few observations with another can obscure the true effect on the target variable

Log Transformation

Compresses larger values and expands smaller values to achieve more symmetric distribution
Used for dealing with right-skewed data
Cannot log-transform non-positive values
Reduces impacts from outliers

Disadvantages of Log Transforms

Makes interpreting model coefficients more challenging
May not necessarily improve model performance
Can exhibit a spike at 0

Binarization

Clustering Context

Achieved for 4 clusters by choosing a base level and create dummy variables for other clusters, setting cluster value to 1 if present and 0 otherwise

Stepwise Selection Context

Each factor variable level is treated separately which may mean only some levels appear in the chosen model
When dummy variables are removed, levels merge, with the original base level merging into the levels that no longer have a dummy variable

Accuracy Metrics - Confusion Matrix

Used for classification problems, not regression

Sensitivity

Represents true positives divided by real positives
TP / (TP + FN)
High sensitivity means the model is effective at identifying actual positives rather than misclassifying them

Specificity

Represents true negatives divided by real negatives
TN / (TN + FP)
Accuracy determined as the percentage of correct predictions, or true positives plus true negatives divided by all outcomes
Classification error rate determined by percentage of wrong predictions, 1 minus accuracy
False Positive Rate: False positives divided by false positives plus true negatives, also equal to 1 minus specificity

Precision

Percentage of positive predictions that are true positives
TP / (TP + FP)
High precision denotes the model effectively classifies positive outcomes when they are positive

Statistical Goals

Focus on sensitivity to accurately predict positives
Focus on specificity to accurately predict negatives

Lowering Positive Response Cutoff

Allows more positive predictions and lowers negative predictions, increasing sensitivity but decreasing specificity

Exploratory Data Analysis: Univariate & Bivariate Techniques

Univariate Numerical Analysis

Includes mean, variance, quantiles, and frequency

Univariate Graphical Analysis

Includes histograms, bar charts, and box plots

Bivariate Numerical Analysis

Includes correlation and statistics by level, frequency

Bivariate Graphical Analysis

Includes scatter plots and side-by-side plots which enable histograms, bar charts and box plots

Bivariate Analysis

Scatterplots

Show the full range of observations
Can be hard to interpret when too many data points are present

Side-by-side box plots

Show the distribution of each x range and easily show outliers
Conceal observation quantities in each range of x

Highly Correlated Variables

Highly correlated predictors are very predictive
Correlated predictors exhibit collinearity which can be an issue if they are used together in a model

MLR Assumptions

Ordinary Least Squares seeks estimates that minimize the Sum of Squared Errors

Residual vs Predicted Value Plot

Detects bias and heteroscedasticity

Ideal Look

Mean of errors is zero
Variance of errors is constant indicating homoscedasticity
Errors are independent

Residuals Plot

Balanced around 0 indicates no bias
Constant spread of residuals
Points that appear random suggests no obvious trend

If The Residuals Spread

May need normal distribution and identity link are not appropriate

More MLR Assumptions

Errors normally distributed, which also means predictors cannot be perfectly correlated with others

More MLR Violations

A non-zero average of residuals can occur
Heteroscedasticity
Dependent non-normal errors
Outliers
Collinearity
Too many predictors, indicating high dimensionality

QQ Plot

Checks for deviations from superimposed lines, where deviations indicate outliers and data that is not normally distributed

MLR Shortcoming

Allows for negative predictions, which is not always appropriate

Predicted vs. Actual Plot Interpretation

Points on the red line indicate the predicted values accurately match the data
Points above the red line suggest the model underestimates
Points below the red line suggest the model overestimates

Model Choice: Weak Model Indicators

Predictor variables are highly correlated
There is a non-linear relationship with the target variable
Missing data for a good amount of observations

Data With Issues

Replacing coefficients with constant value of the mean might lead to perfect collinearity
Bx1x2 is made perfectly linear at Bx1c by replacing x2 with the constant mean

Scatterplots

Can be used to see if there is a linear monotonic relationship between the two variables
Otherwise, a quadratic relationship exists

Linear Monotonic Relationship

Desirable for GLM, or else the model will only fit the values that are close to zero

Model Improvements

When the histogram is right skewed, log-transforming makes for a better fit
Numeric can be changed to Factor for a better fit that tolerates non-linear or monotonic relationship
Factor variables can increase complexity
Adding an interaction term improves the model
Adding variable also improve the model to explain more data patterns
Removing levels or noise factors can improve the model
Removing data for 2020 impacted by COVID-19 could be a step toward model improvement

Untransformed Model

May lack a linear monotonic relationship
Can result in a fitted line with a slight uptrend

Transformed Model

Monotonic tendency on either side
Enable better fitting data
Results in accurate predictions

ROC and AUC Metrics

They both evaluate the classification problem

ROC

ROC plots all possible combinations of TPR which is sensitivity, and FPR which is one minus sensitivity for different cutoff values

AUC

AUC measures the accuracy and performance of the model and is measured by the area under the curve

Similar AUC scores between different models

Demonstrate the models distinguish between different conditions

ROC Metric

Helpful when comparing the weighed versus unweighted metrics

Training vs. validation data (test data)

Training data is what is used to fit the model
Validation data is used to calculate metrics of the model
Comparing against validation data is better to find errors in training data
If splits are not available, conduct cross validation instead

Flexibility Parameter vs. RMSE graph

training error should be less than testing error by parameter level
if training results are better, it is an indication of high levels of overfitting

Booted Tree

subsequent trees will improve errors from prior trees
rate increase over flexible models
test measures do not have an exact measure
do not use test set to tune hyper parameter, as a result of the final assessment

Using K-fold Cross Validation

Helps tune hyper parameters and keep test assessments separate

Interaction Variables

scatterplots indicates correlation in variables, such as control versus cost to attend
with interaction in slope for two levels

Interactions

indicate that levels are dependent
if no interaction levels, then the slope would be the same
the model would be similar and impact both slope coefficient
allow to have ones own slope

Interpretation of Interaction and variable impacts

discounts have positive effects and passengers have the most impacts
interactions should have isolate positives, and then identify negatives with the passengers

Model Coefficient Table

Linear Model

if school is public, then loan has 3.4% versus profit versus others
public institutions may receive grant funding

GLM and Gamma Log

it can be determined since family has gamma qualities like to that in function

Model with Private for Profit Institution

Has 50% of undergrads and Pell Grants, and uses an Exp of .006
P value may not always be significant, if larger than .05
estimates are needed, can change as numeric

Statistical Errors

errors can be the result of perfect correlation, may have collinearity
with to low levels, then it must be equal to 1 or less
the mean can be used to fix missing data at a perfect level
be careful of using data if it is close, stepwise to ensure it is in the proper area

stepwise selection

shrinkage selection to reduce complexities as a method to fix data
to minimize overfitting, should be compared versus the amount of other predictors

different areas

forward action; takes iterative
no predictors and action taken for measurements
backward selections, removal via measurement
stepwise sections require all predictors to optimise values

- backward models may miss

AIC Metrics

represents a goodness of the fit, and reflects how well the data is applied

lower values

provides a balance for metrics, such as error terms decreasing, where the second increases

For Backward and Stepwise Selection,

From the removed performance , and interaction, if others are considered
Maximize the find, complement, and back actions with AIC

variables should show as

more adaptive, precise, and lower bars in graph

Technique to see what is best for the model

single area selection
to reduce for values that have data
penalize coefficients and remove variables
reduce impact and add Lambda

regularization

address in the model and penalize
helps with high estimates
can cause under fitting if data and coefficient
also known as near zero metrics

ridge and lasso regression's

can find other equations, such averages, and weighted
can add variables that are already created
estimates are equal with data selection, must use lasso to fix
near is a lack of data in certain variables
estimates allow data to have similar structure and properties
lasso allows data to remove areas or select data

GLM

key assumption of data
target is with in family
model that adjusts variables

data selection

if it requires a certain variables, there could be a bias between the levels
to prevent data to the target
distribution may not always be normal

distribution type selection

poisson is good for a count
gamma is better for continuous

function type selection

link is great for all data types related to the target
does not account for sign

type use

real , allow data for units
log, positive numbers only to the correct type
model of chart
for probabilities, for what is correct, between levels
linear is more constant and easy to model, has errors
log has more to see relationships across variables for both the type and the model

To decide from trees to model

select with no levels of trend, can perform better with types
if continuous, use GLM

Weights impact

weigh the level of importance
if similar, consider what is weighted

Decision Tree notes

Interpretation of data, that node indicates the amount and relationship of data from the set

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Data Science Key Concepts

Choose a study mode

Podcast

Questions and Answers

Which of the following is NOT a typical characteristic of a good Key Performance Indicator (KPI)?

Stratification ensures a successful model if the average of the training set is significantly different from the average of the validation set.

What type of data analysis involves finding patterns in data without a pre-defined target variable?

A model with the lowest __________ strikes a balance between bias and variance.

What is the primary purpose of using stepwise selection in the context of overfitting?

A p-value greater than 0.05 for a coefficient estimate indicates that the coefficient is statistically significant.

In the context of variables, what is another term for a categorical variable?

In the context of data analysis, __________ is used for right skewed distribution.

Match the following terms with their definitions:

Which of the following is a disadvantage of using a scatterplot for bivariate analysis when dealing with a large number of data points?

In a multiple linear regression, a residual vs. predicted value plot is used to determine homoscedasticity, where the variance of errors is non-constant.

In the context of MLR assumptions, what does a qq plot help to check?

If predictor variables are highly correlated to each other, they exhibit ________ which is a violation and cause problems if used together in a model

What should you do if a model performs excellently on the training set but significantly worse on the test set?

In boosted trees, increasing the learning rate always decreases the test RMSE, leading to better model performance.

Fill in the blank: In the model coefficients table, what kind of link is used given that it is Gamma with log link.

In stepwise selection, __________ fits all predictors simultaneously to optimize a loss function.

Which statement is true regarding stepwise selection? (Choose the best answer)

As Lambda increases, variance increases and flexibility reduces.

What kind of model is used if the alpha is between 0 to 1?

The __________ determines how the mean (expected value of the target) changes in response to changes in predictor.

Match the reasons with the distributions:

What kind of model is used if there is a linear relationship between predictor and target variable? (Choose the best answer)

Weights assign a constant value to each observation in the model

What variable minimizes total SSE (impurity) in regression trees.

Flashcards

Key Performance Indicator (KPI)

Stratification

Target Leakage

Oversampling

Unstructured Data

Unsupervised Techniques

Variance

Bias

Overfitting

Underfitting

Categorical Variable

Log Transformation

Binarization

Sensitivity

Specificity

Accuracy

Precision

Residual Plot

QQ Plot

Bivariate analysis

Coefficient Estimates

Interaction Term

Forward Selection

Backward Selection

Regularization

Study Notes

Qualities of a useful KPI

Stratified Sampling

Target Leakage

Balanced Dataset

Unstructured Data

Unsupervised Techniques

Developing Analysis to Estimate a Target Variable Includes

Missing Data

Bias-Variance Tradeoff

Overfitting and Underfitting

Overfitting

Underfitting

P-Value

Numeric vs Categorical Variables

Factor Variables

Log Transformation

Disadvantages of Log Transforms

Binarization

Clustering Context

Stepwise Selection Context

Accuracy Metrics - Confusion Matrix

Sensitivity

Specificity

Precision