Podcast
Questions and Answers
Which of the following is NOT a typical characteristic of a good Key Performance Indicator (KPI)?
Which of the following is NOT a typical characteristic of a good Key Performance Indicator (KPI)?
- It is aligned with the objective or goal of the business problem.
- It is measurable given the available data and assumptions.
- It is complex and difficult to understand to ensure rigor. (correct)
- It directly measures progress toward achieving a business goal.
Stratification ensures a successful model if the average of the training set is significantly different from the average of the validation set.
Stratification ensures a successful model if the average of the training set is significantly different from the average of the validation set.
False (B)
What type of data analysis involves finding patterns in data without a pre-defined target variable?
What type of data analysis involves finding patterns in data without a pre-defined target variable?
unsupervised techniques
A model with the lowest __________ strikes a balance between bias and variance.
A model with the lowest __________ strikes a balance between bias and variance.
What is the primary purpose of using stepwise selection in the context of overfitting?
What is the primary purpose of using stepwise selection in the context of overfitting?
A p-value greater than 0.05 for a coefficient estimate indicates that the coefficient is statistically significant.
A p-value greater than 0.05 for a coefficient estimate indicates that the coefficient is statistically significant.
In the context of variables, what is another term for a categorical variable?
In the context of variables, what is another term for a categorical variable?
In the context of data analysis, __________ is used for right skewed distribution.
In the context of data analysis, __________ is used for right skewed distribution.
Match the following terms with their definitions:
Match the following terms with their definitions:
Which of the following is a disadvantage of using a scatterplot for bivariate analysis when dealing with a large number of data points?
Which of the following is a disadvantage of using a scatterplot for bivariate analysis when dealing with a large number of data points?
In a multiple linear regression, a residual vs. predicted value plot is used to determine homoscedasticity, where the variance of errors is non-constant.
In a multiple linear regression, a residual vs. predicted value plot is used to determine homoscedasticity, where the variance of errors is non-constant.
In the context of MLR assumptions, what does a qq plot help to check?
In the context of MLR assumptions, what does a qq plot help to check?
If predictor variables are highly correlated to each other, they exhibit ________ which is a violation and cause problems if used together in a model
If predictor variables are highly correlated to each other, they exhibit ________ which is a violation and cause problems if used together in a model
What should you do if a model performs excellently on the training set but significantly worse on the test set?
What should you do if a model performs excellently on the training set but significantly worse on the test set?
In boosted trees, increasing the learning rate always decreases the test RMSE, leading to better model performance.
In boosted trees, increasing the learning rate always decreases the test RMSE, leading to better model performance.
Fill in the blank: In the model coefficients table, what kind of link is used given that it is Gamma with log link.
Fill in the blank: In the model coefficients table, what kind of link is used given that it is Gamma with log link.
In stepwise selection, __________ fits all predictors simultaneously to optimize a loss function.
In stepwise selection, __________ fits all predictors simultaneously to optimize a loss function.
Which statement is true regarding stepwise selection? (Choose the best answer)
Which statement is true regarding stepwise selection? (Choose the best answer)
As Lambda increases, variance increases and flexibility reduces.
As Lambda increases, variance increases and flexibility reduces.
What kind of model is used if the alpha is between 0 to 1?
What kind of model is used if the alpha is between 0 to 1?
The __________ determines how the mean (expected value of the target) changes in response to changes in predictor.
The __________ determines how the mean (expected value of the target) changes in response to changes in predictor.
Match the reasons with the distributions:
Match the reasons with the distributions:
What kind of model is used if there is a linear relationship between predictor and target variable? (Choose the best answer)
What kind of model is used if there is a linear relationship between predictor and target variable? (Choose the best answer)
Weights assign a constant value to each observation in the model
Weights assign a constant value to each observation in the model
What variable minimizes total SSE (impurity) in regression trees.
What variable minimizes total SSE (impurity) in regression trees.
Flashcards
Key Performance Indicator (KPI)
Key Performance Indicator (KPI)
Variable used to measure the success of a business goal or objective.
Stratification
Stratification
Dividing a dataset into smaller groups based on predictor variables, maintaining original proportions in each group.
Target Leakage
Target Leakage
When predictor variables inadvertently contain information about the target variable used in training.
Oversampling
Oversampling
Signup and view all the flashcards
Unstructured Data
Unstructured Data
Signup and view all the flashcards
Unsupervised Techniques
Unsupervised Techniques
Signup and view all the flashcards
Variance
Variance
Signup and view all the flashcards
Bias
Bias
Signup and view all the flashcards
Overfitting
Overfitting
Signup and view all the flashcards
Underfitting
Underfitting
Signup and view all the flashcards
Categorical Variable
Categorical Variable
Signup and view all the flashcards
Log Transformation
Log Transformation
Signup and view all the flashcards
Binarization
Binarization
Signup and view all the flashcards
Sensitivity
Sensitivity
Signup and view all the flashcards
Specificity
Specificity
Signup and view all the flashcards
Accuracy
Accuracy
Signup and view all the flashcards
Precision
Precision
Signup and view all the flashcards
Residual Plot
Residual Plot
Signup and view all the flashcards
QQ Plot
QQ Plot
Signup and view all the flashcards
Bivariate analysis
Bivariate analysis
Signup and view all the flashcards
Coefficient Estimates
Coefficient Estimates
Signup and view all the flashcards
Interaction Term
Interaction Term
Signup and view all the flashcards
Forward Selection
Forward Selection
Signup and view all the flashcards
Backward Selection
Backward Selection
Signup and view all the flashcards
Regularization
Regularization
Signup and view all the flashcards
Study Notes
- KPI stands for Key Performance Indicator
Qualities of a useful KPI
- Measurable with available data
- Aligned with the business objective
- Directly measures goal achievement such as revenue as a measurement to define business profitability
Stratified Sampling
- Stratified samples can be created by determining the strata number, which combines predictor variables, then sample randomly
- Proportional sample sizes that reflect original dataset population proportions should be used for each stratum
- Stratification is successful when the average of training and validation sets are similar
Target Leakage
- Occurs when predictor variables include information directly related to the target variable
- E.g. using target variable as input during training
Balanced Dataset
- Achieved through oversampling a minority class until it has equal representation to the majority class
- Oversampling duplicates minority class observations
Unstructured Data
- Provides qualitative information but demands complex processing and more resources
Unsupervised Techniques
- Analyse data without a defined target variable
- PCA and clustering are useful for feature creation
- Aids in model selection and improves prediction accuracy by identifying patterns between predictors
Developing Analysis to Estimate a Target Variable Includes
- Filtering irrelevant information to focus on flights over 300 miles when analysing beverages
- Identifying key variables such as what to predict as the target variable
- Determining which predictors influence target variables
- Measuring average daily passengers on flights into and out of Washington state to gauge market size
Missing Data
- Missing data can be handled by: -Replacing with the mean -Removing rows or columns with disproportionate amounts of missing data
Bias-Variance Tradeoff
- Considers that as one decreases, the other increases
- Variance reflects model sensitivity to changes in training data
- Bias reflects how close modeled and actual distributions are
- Higher flexibility coincides with higher variance and lower bias
- Variance may indicate that the model is too flexible
- Bias may indicate that the model is insufficiently flexible
- A model with the lowest Root Mean Squared Error strikes a balance
Overfitting and Underfitting
Overfitting
- Can be addressed by removing weak predictors through stepwise selection
- Can be addressed by shrinking coefficients through regularization with lasso or elastic net
Underfitting
- Can be addressed by adding interaction terms to describe patterns between two or more predictors
P-Value
- If a coefficient estimate has a p-value less than 0.05, its statistically significant
Numeric vs Categorical Variables
- Factor variables are preferred when inputs lack meaningful order or scale
- Factor variables can be preferred due to arbitrary ordering of clusters
- Factors can force the model to ignore the number values and treat items as different groups
- When there is not a monotonic relationship between the predictor and target variable, a factor variable can be used
- Factor variables may introduce too many levels, complicating the model
- Bivariate analysis helps determine if a predictor variable should be numeric or categorical
Factor Variables
- If a level lacks enough observations, low representation might create unreliable statistics
- combining a level with too few observations with another can obscure the true effect on the target variable
Log Transformation
- Compresses larger values and expands smaller values to achieve more symmetric distribution
- Used for dealing with right-skewed data
- Cannot log-transform non-positive values
- Reduces impacts from outliers
Disadvantages of Log Transforms
- Makes interpreting model coefficients more challenging
- May not necessarily improve model performance
- Can exhibit a spike at 0
Binarization
Clustering Context
- Achieved for 4 clusters by choosing a base level and create dummy variables for other clusters, setting cluster value to 1 if present and 0 otherwise
Stepwise Selection Context
- Each factor variable level is treated separately which may mean only some levels appear in the chosen model
- When dummy variables are removed, levels merge, with the original base level merging into the levels that no longer have a dummy variable
Accuracy Metrics - Confusion Matrix
- Used for classification problems, not regression
Sensitivity
- Represents true positives divided by real positives
- TP / (TP + FN)
- High sensitivity means the model is effective at identifying actual positives rather than misclassifying them
Specificity
- Represents true negatives divided by real negatives
- TN / (TN + FP)
- Accuracy determined as the percentage of correct predictions, or true positives plus true negatives divided by all outcomes
- Classification error rate determined by percentage of wrong predictions, 1 minus accuracy
- False Positive Rate: False positives divided by false positives plus true negatives, also equal to 1 minus specificity
Precision
- Percentage of positive predictions that are true positives
- TP / (TP + FP)
- High precision denotes the model effectively classifies positive outcomes when they are positive
Statistical Goals
- Focus on sensitivity to accurately predict positives
- Focus on specificity to accurately predict negatives
Lowering Positive Response Cutoff
- Allows more positive predictions and lowers negative predictions, increasing sensitivity but decreasing specificity
Exploratory Data Analysis: Univariate & Bivariate Techniques
Univariate Numerical Analysis
- Includes mean, variance, quantiles, and frequency
Univariate Graphical Analysis
- Includes histograms, bar charts, and box plots
Bivariate Numerical Analysis
- Includes correlation and statistics by level, frequency
Bivariate Graphical Analysis
- Includes scatter plots and side-by-side plots which enable histograms, bar charts and box plots
Bivariate Analysis
Scatterplots
- Show the full range of observations
- Can be hard to interpret when too many data points are present
Side-by-side box plots
- Show the distribution of each x range and easily show outliers
- Conceal observation quantities in each range of x
Highly Correlated Variables
- Highly correlated predictors are very predictive
- Correlated predictors exhibit collinearity which can be an issue if they are used together in a model
MLR Assumptions
- Ordinary Least Squares seeks estimates that minimize the Sum of Squared Errors
Residual vs Predicted Value Plot
- Detects bias and heteroscedasticity
Ideal Look
- Mean of errors is zero
- Variance of errors is constant indicating homoscedasticity
- Errors are independent
Residuals Plot
- Balanced around 0 indicates no bias
- Constant spread of residuals
- Points that appear random suggests no obvious trend
If The Residuals Spread
- May need normal distribution and identity link are not appropriate
More MLR Assumptions
- Errors normally distributed, which also means predictors cannot be perfectly correlated with others
More MLR Violations
- A non-zero average of residuals can occur
- Heteroscedasticity
- Dependent non-normal errors
- Outliers
- Collinearity
- Too many predictors, indicating high dimensionality
QQ Plot
- Checks for deviations from superimposed lines, where deviations indicate outliers and data that is not normally distributed
MLR Shortcoming
- Allows for negative predictions, which is not always appropriate
Predicted vs. Actual Plot Interpretation
- Points on the red line indicate the predicted values accurately match the data
- Points above the red line suggest the model underestimates
- Points below the red line suggest the model overestimates
Model Choice: Weak Model Indicators
- Predictor variables are highly correlated
- There is a non-linear relationship with the target variable
- Missing data for a good amount of observations
Data With Issues
- Replacing coefficients with constant value of the mean might lead to perfect collinearity
- Bx1x2 is made perfectly linear at Bx1c by replacing x2 with the constant mean
Scatterplots
- Can be used to see if there is a linear monotonic relationship between the two variables
- Otherwise, a quadratic relationship exists
Linear Monotonic Relationship
- Desirable for GLM, or else the model will only fit the values that are close to zero
Model Improvements
- When the histogram is right skewed, log-transforming makes for a better fit
- Numeric can be changed to Factor for a better fit that tolerates non-linear or monotonic relationship
- Factor variables can increase complexity
- Adding an interaction term improves the model
- Adding variable also improve the model to explain more data patterns
- Removing levels or noise factors can improve the model
- Removing data for 2020 impacted by COVID-19 could be a step toward model improvement
Untransformed Model
- May lack a linear monotonic relationship
- Can result in a fitted line with a slight uptrend
Transformed Model
- Monotonic tendency on either side
- Enable better fitting data
- Results in accurate predictions
ROC and AUC Metrics
- They both evaluate the classification problem
ROC
- ROC plots all possible combinations of TPR which is sensitivity, and FPR which is one minus sensitivity for different cutoff values
AUC
- AUC measures the accuracy and performance of the model and is measured by the area under the curve
Similar AUC scores between different models
- Demonstrate the models distinguish between different conditions
ROC Metric
- Helpful when comparing the weighed versus unweighted metrics
Training vs. validation data (test data)
- Training data is what is used to fit the model
- Validation data is used to calculate metrics of the model
- Comparing against validation data is better to find errors in training data
- If splits are not available, conduct cross validation instead
Flexibility Parameter vs. RMSE graph
- training error should be less than testing error by parameter level
- if training results are better, it is an indication of high levels of overfitting
Booted Tree
- subsequent trees will improve errors from prior trees
- rate increase over flexible models
- test measures do not have an exact measure
- do not use test set to tune hyper parameter, as a result of the final assessment
Using K-fold Cross Validation
- Helps tune hyper parameters and keep test assessments separate
Interaction Variables
- scatterplots indicates correlation in variables, such as control versus cost to attend
- with interaction in slope for two levels
Interactions
- indicate that levels are dependent
- if no interaction levels, then the slope would be the same
- the model would be similar and impact both slope coefficient
- allow to have ones own slope
Interpretation of Interaction and variable impacts
- discounts have positive effects and passengers have the most impacts
- interactions should have isolate positives, and then identify negatives with the passengers
Model Coefficient Table
Linear Model
- if school is public, then loan has 3.4% versus profit versus others
- public institutions may receive grant funding
GLM and Gamma Log
- it can be determined since family has gamma qualities like to that in function
Model with Private for Profit Institution
- Has 50% of undergrads and Pell Grants, and uses an Exp of .006
- P value may not always be significant, if larger than .05
- estimates are needed, can change as numeric
Statistical Errors
- errors can be the result of perfect correlation, may have collinearity
- with to low levels, then it must be equal to 1 or less
- the mean can be used to fix missing data at a perfect level
- be careful of using data if it is close, stepwise to ensure it is in the proper area
stepwise selection
- shrinkage selection to reduce complexities as a method to fix data
- to minimize overfitting, should be compared versus the amount of other predictors
different areas
- forward action; takes iterative
- no predictors and action taken for measurements
- backward selections, removal via measurement
- stepwise sections require all predictors to optimise values
- backward models may miss
AIC Metrics
- represents a goodness of the fit, and reflects how well the data is applied
lower values
- provides a balance for metrics, such as error terms decreasing, where the second increases
For Backward and Stepwise Selection,
- From the removed performance , and interaction, if others are considered
- Maximize the find, complement, and back actions with AIC
variables should show as
- more adaptive, precise, and lower bars in graph
Technique to see what is best for the model
- single area selection
- to reduce for values that have data
- penalize coefficients and remove variables
- reduce impact and add Lambda
regularization
- address in the model and penalize
- helps with high estimates
- can cause under fitting if data and coefficient
- also known as near zero metrics
ridge and lasso regression's
-
can find other equations, such averages, and weighted
-
can add variables that are already created
-
estimates are equal with data selection, must use lasso to fix
-
near is a lack of data in certain variables
-
estimates allow data to have similar structure and properties
-
lasso allows data to remove areas or select data
GLM
- key assumption of data
- target is with in family
- model that adjusts variables
data selection
- if it requires a certain variables, there could be a bias between the levels
- to prevent data to the target
- distribution may not always be normal
distribution type selection
- poisson is good for a count
- gamma is better for continuous
function type selection
- link is great for all data types related to the target
- does not account for sign
type use
-
real , allow data for units
-
log, positive numbers only to the correct type
-
model of chart
-
for probabilities, for what is correct, between levels
-
linear is more constant and easy to model, has errors
-
log has more to see relationships across variables for both the type and the model
To decide from trees to model
- select with no levels of trend, can perform better with types
- if continuous, use GLM
Weights impact
- weigh the level of importance
- if similar, consider what is weighted
Decision Tree notes
- Interpretation of data, that node indicates the amount and relationship of data from the set
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.