Podcast
Questions and Answers
Which measure of central tendency is the midpoint of a data set when arranged in ascending order?
Which measure of central tendency is the midpoint of a data set when arranged in ascending order?
What is the main purpose of a box plot in data visualization?
What is the main purpose of a box plot in data visualization?
Which of the following distributions is best suited for modeling the number of events in a fixed interval of time or space?
Which of the following distributions is best suited for modeling the number of events in a fixed interval of time or space?
What is the purpose of the Central Limit Theorem in statistics?
What is the purpose of the Central Limit Theorem in statistics?
Signup and view all the answers
In hypothesis testing, what does a type II error refer to?
In hypothesis testing, what does a type II error refer to?
Signup and view all the answers
Which test is typically used to evaluate whether there is a significant difference between the means of three or more groups?
Which test is typically used to evaluate whether there is a significant difference between the means of three or more groups?
Signup and view all the answers
What is the primary purpose of confidence intervals in statistics?
What is the primary purpose of confidence intervals in statistics?
Signup and view all the answers
In correlation analysis, what does a Pearson correlation coefficient of -1 indicate?
In correlation analysis, what does a Pearson correlation coefficient of -1 indicate?
Signup and view all the answers
Which method is commonly used to detect multicollinearity in regression analysis?
Which method is commonly used to detect multicollinearity in regression analysis?
Signup and view all the answers
What is the primary objective of bootstrapping in statistics?
What is the primary objective of bootstrapping in statistics?
Signup and view all the answers
What does the standard deviation measure in a data set?
What does the standard deviation measure in a data set?
Signup and view all the answers
Which statistical test is appropriate for comparing the means of two independent groups?
Which statistical test is appropriate for comparing the means of two independent groups?
Signup and view all the answers
Which component is NOT typically included in a logistic regression model?
Which component is NOT typically included in a logistic regression model?
Signup and view all the answers
Which sampling technique involves dividing the population into subgroups and then taking a sample from each subgroup?
Which sampling technique involves dividing the population into subgroups and then taking a sample from each subgroup?
Signup and view all the answers
What is the primary goal of conducting a power analysis in statistics?
What is the primary goal of conducting a power analysis in statistics?
Signup and view all the answers
What can be inferred from a p-value of 0.07 in hypothesis testing?
What can be inferred from a p-value of 0.07 in hypothesis testing?
Signup and view all the answers
Which of the following is a key characteristic of a normal distribution?
Which of the following is a key characteristic of a normal distribution?
Signup and view all the answers
What technique is used to handle missing data by estimating what the missing value could be based on available data?
What technique is used to handle missing data by estimating what the missing value could be based on available data?
Signup and view all the answers
In time series analysis, what does stationarity refer to?
In time series analysis, what does stationarity refer to?
Signup and view all the answers
Which of the following techniques is used to identify and manage outliers in a dataset?
Which of the following techniques is used to identify and manage outliers in a dataset?
Signup and view all the answers
What does a higher variance in a dataset indicate?
What does a higher variance in a dataset indicate?
Signup and view all the answers
Which of the following is a key principle of the Central Limit Theorem?
Which of the following is a key principle of the Central Limit Theorem?
Signup and view all the answers
In a two-way ANOVA, which of the following does NOT represent a factor?
In a two-way ANOVA, which of the following does NOT represent a factor?
Signup and view all the answers
Which test is primarily used to determine the association between two categorical variables?
Which test is primarily used to determine the association between two categorical variables?
Signup and view all the answers
What is a common use of logistic regression?
What is a common use of logistic regression?
Signup and view all the answers
What is the main purpose of using the bootstrap resampling technique?
What is the main purpose of using the bootstrap resampling technique?
Signup and view all the answers
In time series analysis, what does 'differencing' aim to achieve?
In time series analysis, what does 'differencing' aim to achieve?
Signup and view all the answers
What is a key advantage of using Principal Component Analysis (PCA)?
What is a key advantage of using Principal Component Analysis (PCA)?
Signup and view all the answers
Which method is typically applied to test for independence in categorical data?
Which method is typically applied to test for independence in categorical data?
Signup and view all the answers
What is the role of the ACF plot in time series analysis?
What is the role of the ACF plot in time series analysis?
Signup and view all the answers
What is the purpose of logistic regression?
What is the purpose of logistic regression?
Signup and view all the answers
Which of the following sampling techniques involves previously defining subgroups within the population?
Which of the following sampling techniques involves previously defining subgroups within the population?
Signup and view all the answers
What does the term 'heteroscedasticity' refer to in regression analysis?
What does the term 'heteroscedasticity' refer to in regression analysis?
Signup and view all the answers
In the context of time series analysis, what does the term 'autocorrelation' describe?
In the context of time series analysis, what does the term 'autocorrelation' describe?
Signup and view all the answers
Which of the following best describes the Central Limit Theorem?
Which of the following best describes the Central Limit Theorem?
Signup and view all the answers
What is the purpose of the Chi-Square Test for independence?
What is the purpose of the Chi-Square Test for independence?
Signup and view all the answers
Which analysis technique is primarily used for dimensionality reduction?
Which analysis technique is primarily used for dimensionality reduction?
Signup and view all the answers
In hypothesis testing, what does the p-value represent?
In hypothesis testing, what does the p-value represent?
Signup and view all the answers
What is the main purpose of conducting a power analysis?
What is the main purpose of conducting a power analysis?
Signup and view all the answers
What does 'bootstrap resampling' help to achieve in statistics?
What does 'bootstrap resampling' help to achieve in statistics?
Signup and view all the answers
Which technique effectively assesses the strength and direction of a linear relationship between two continuous variables?
Which technique effectively assesses the strength and direction of a linear relationship between two continuous variables?
Signup and view all the answers
In which scenario would a Poisson distribution be most appropriately applied?
In which scenario would a Poisson distribution be most appropriately applied?
Signup and view all the answers
What is a primary limitation of using a t-test for comparing means?
What is a primary limitation of using a t-test for comparing means?
Signup and view all the answers
Which model allows for the analysis of relationships between multiple independent variables and a categorical outcome?
Which model allows for the analysis of relationships between multiple independent variables and a categorical outcome?
Signup and view all the answers
What is a fundamental requirement for conducting a valid ANOVA analysis?
What is a fundamental requirement for conducting a valid ANOVA analysis?
Signup and view all the answers
What is the main purpose of conducting residual analysis in regression?
What is the main purpose of conducting residual analysis in regression?
Signup and view all the answers
In Bayesian statistics, what does Bayes' theorem primarily allow for?
In Bayesian statistics, what does Bayes' theorem primarily allow for?
Signup and view all the answers
Which technique is specifically used for dimensionality reduction in high-dimensional datasets?
Which technique is specifically used for dimensionality reduction in high-dimensional datasets?
Signup and view all the answers
What best describes the purpose of a Monte Carlo simulation in statistics?
What best describes the purpose of a Monte Carlo simulation in statistics?
Signup and view all the answers
Which of the following correctly describes the role of the chi-square test in statistics?
Which of the following correctly describes the role of the chi-square test in statistics?
Signup and view all the answers
What does the maximum likelihood estimation (MLE) method primarily focus on?
What does the maximum likelihood estimation (MLE) method primarily focus on?
Signup and view all the answers
In regression analysis, what does multicollinearity specifically refer to?
In regression analysis, what does multicollinearity specifically refer to?
Signup and view all the answers
What is the primary focus of conducting a time series analysis on a dataset?
What is the primary focus of conducting a time series analysis on a dataset?
Signup and view all the answers
What is the main goal of using a logistic regression model?
What is the main goal of using a logistic regression model?
Signup and view all the answers
Which of the following best defines the term 'heteroscedasticity' in regression context?
Which of the following best defines the term 'heteroscedasticity' in regression context?
Signup and view all the answers
In Bayesian statistics, what interpretation does the prior distribution have?
In Bayesian statistics, what interpretation does the prior distribution have?
Signup and view all the answers
What characterizes the use of the Mann-Whitney U test?
What characterizes the use of the Mann-Whitney U test?
Signup and view all the answers
What is the primary objective of factor analysis?
What is the primary objective of factor analysis?
Signup and view all the answers
Which of the following statements best describes the role of the Autocorrelation Function (ACF) in time series analysis?
Which of the following statements best describes the role of the Autocorrelation Function (ACF) in time series analysis?
Signup and view all the answers
What is the primary limitation of the Chi-Square Test for independence?
What is the primary limitation of the Chi-Square Test for independence?
Signup and view all the answers
In a logistic regression model, the coefficients represent changes in which of the following?
In a logistic regression model, the coefficients represent changes in which of the following?
Signup and view all the answers
Which of the following scenarios is best suited for applying a Poisson regression model?
Which of the following scenarios is best suited for applying a Poisson regression model?
Signup and view all the answers
What does a p-value represent in the context of a hypothesis test?
What does a p-value represent in the context of a hypothesis test?
Signup and view all the answers
In time series analysis, what is the main purpose of the Autocorrelation Function (ACF)?
In time series analysis, what is the main purpose of the Autocorrelation Function (ACF)?
Signup and view all the answers
Which condition is necessary for the Central Limit Theorem to hold true?
Which condition is necessary for the Central Limit Theorem to hold true?
Signup and view all the answers
What does the term 'multicollinearity' indicate in multiple regression analysis?
What does the term 'multicollinearity' indicate in multiple regression analysis?
Signup and view all the answers
What is a key characteristic of a Bayesian network?
What is a key characteristic of a Bayesian network?
Signup and view all the answers
What does the term 'heteroscedasticity' signify in regression analysis?
What does the term 'heteroscedasticity' signify in regression analysis?
Signup and view all the answers
Which method is considered non-parametric and is often used for comparing two related samples?
Which method is considered non-parametric and is often used for comparing two related samples?
Signup and view all the answers
Study Notes
Descriptive Statistics
- Mean: Average of a dataset, calculated by summing all values and dividing by the total number of values.
- Median: Middle value when a dataset is ordered from least to greatest.
- Mode: Most frequent value in a dataset.
- Range: Difference between the highest and lowest values in a dataset.
- Variance: Measures the spread of data points around the mean.
- Standard Deviation: Square root of the variance, indicating the average distance of data points from the mean.
Data Visualization
- Histograms: Graphical representation of the distribution of numerical data, using bars to represent the frequency of data within specific intervals.
- Box Plots: Visual representation of the distribution of data, showing quartiles, median, and outliers.
- Bar Charts: Used to compare categorical data, with bars representing the frequency or magnitude of each category.
- Scatter Plots: Visual representation of the relationship between two variables, showing the data points as dots on a two-dimensional plot.
Probability Basics
- Definition: Probability is the likelihood of an event occurring.
-
Rules of Probability:
- Probability of an event is between 0 and 1.
- Sum of probabilities of all possible outcomes is 1.
- Probability Distributions: Function that describes the probabilities of all possible outcomes of a random variable.
Probability Distributions
- Normal Distribution: Bell-shaped distribution, characterized by mean and standard deviation.
- Binomial Distribution: Discrete distribution for the number of successes in a fixed number of independent trials.
- Poisson Distribution: Discrete distribution for the number of events occurring in a fixed interval of time or space.
- Exponential Distribution: Continuous distribution describing the time between events in a Poisson process.
Sampling Techniques
- Simple Random Sampling: Every member of the population has an equal chance of being selected.
- Stratified Sampling: Population is divided into subgroups (strata), and a random sample is taken from each stratum.
- Cluster Sampling: Population is divided into clusters, and a random sample of clusters is selected.
Central Limit Theorem
- States that the distribution of sample means approaches a normal distribution as the sample size increases, regardless of the population distribution.
Hypothesis Testing
- Null Hypothesis: A statement about the population that we are trying to disprove.
- Alternative Hypothesis: A statement that we are trying to support.
- Type I Error: Rejecting the null hypothesis when it is true.
- Type II Error: Failing to reject the null hypothesis when it is false.
P-Value and Significance Levels
- P-Value: Probability of obtaining the observed results if the null hypothesis is true.
- Significance Level: Threshold for rejecting the null hypothesis, typically set at 0.05.
Z-Test and T-Test
- Z-Test: Used to test hypotheses about population means when the population standard deviation is known.
- T-Test: Used to test hypotheses about population means when the population standard deviation is unknown.
ANOVA (Analysis of Variance)
- One-Way ANOVA: Used to compare means of two or more groups on a single factor.
- Two-Way ANOVA: Used to compare means of two or more groups on two or more factors.
Chi-Square Test
- Test for Independence: Used to determine if there is a relationship between two categorical variables.
- Goodness of Fit: Used to test if observed frequencies match expected frequencies.
Confidence Intervals
- Constructing: Calculated using sample statistics and a confidence level.
- Interpreting: Represents a range of values within which the population parameter is likely to lie.
Correlation
- Pearson Correlation Coefficient: Measures the linear association between two continuous variables.
- Spearman Correlation Coefficient: Measures the monotonic association between two variables, regardless of linearity.
Simple Linear Regression
- Model Fitting: Uses a straight line to model the relationship between a dependent variable and an independent variable.
- Interpretation: The slope represents the change in the dependent variable for every unit change in the independent variable.
Multiple Linear Regression
- Model Fitting: Uses a linear equation to model the relationship between a dependent variable and multiple independent variables.
- Assumptions: Linearity, normality, homoscedasticity.
- Interpretation: Regression coefficients represent the change in the dependent variable for every unit change in the corresponding independent variable, holding other variables constant.
Logistic Regression
- Binary Outcome Modeling: Used to model the probability of a binary outcome (e.g., success or failure) based on one or more predictor variables.
- Interpretation: Odds ratios represent the change in the odds of the outcome for every unit change in the corresponding predictor variable.
Residual Analysis
- Checking Model Assumptions: Examining the residuals (differences between observed values and predicted values) to assess whether the model assumptions hold.
Multicollinearity
- Detection: Using Variance Inflation Factor (VIF) to identify high correlation between independent variables.
- Remedies: Removing one of the highly correlated variables, using principal component analysis, or ridge regression.
Heteroscedasticity
- Understanding: Unequal variance of residuals across different values of the independent variable.
- Detecting: Visual inspection of residual plots, using Breusch-Pagan test, or White test.
Time Series Analysis
- Decomposition: Separating a time series into trend, seasonal, and random components.
- Trend: Long-term pattern in the data.
- Seasonality: Periodic fluctuations in the data.
Stationarity in Time Series
- Testing: Using statistical tests (e.g., Augmented Dickey-Fuller test) to determine if the time series is stationary.
- Transforming Data: Using differencing or other transformations to make a non-stationary time series stationary.
Autocorrelation and Partial Autocorrelation
- ACF Plots: Show the correlation between a time series and its lagged values.
- PACF Plots: Show the correlation between a time series and its lagged values, controlling for the effects of intervening lags.
ARIMA Models
- Autoregressive Integrated Moving Average models: Time series forecasting models that use past values of the time series to predict future values.
Forecasting Techniques
- Exponential Smoothing: Forecasting technique that assigns weights to past observations, giving more weight to recent observations.
- ARIMA Forecasting: Forecasting technique based on ARIMA models.
Non-Parametric Tests
- Mann-Whitney U Test: Used to compare two independent groups on a ranked variable.
- Wilcoxon Signed-Rank Test: Used to compare two related groups on a ranked variable.
- Kruskal-Wallis Test: Used to compare two or more independent groups on a ranked variable.
Principal Component Analysis (PCA)
- Dimensionality Reduction: Technique used to reduce the number of variables in a dataset while preserving as much of the variance as possible.
Factor Analysis
- Identifying Latent Variables: Uncovering underlying factors that explain the correlations among observed variables.
Cluster Analysis
- K-Means Clustering: Dividing data points into clusters based on their distance from cluster centroids.
- Hierarchical Clustering: Building a hierarchy of clusters by iteratively merging or splitting clusters.
Discriminant Analysis
- Linear and Quadratic Discriminant Analysis: Used to classify observations into two or more groups based on predictor variables.
Bayesian Statistics
- Bayes’ Theorem: A mathematical formula that updates prior beliefs based on new evidence.
- Bayesian Inference: Using Bayes’ theorem to draw inferences about parameters or hypotheses.
Markov Chains
- Transition Matrices: Represent the probabilities of transitioning between different states in a system.
- Steady-State Probabilities: Probabilities of being in each state after a long time.
Maximum Likelihood Estimation (MLE)
- Estimating Parameters: Finding the parameter values that maximize the likelihood of observing the data.
Monte Carlo Simulations
- Applications in Probability and Statistics: Using random numbers to simulate events and estimate probabilities or statistical properties.
Bootstrap Resampling
- Understanding and Applying: Resampling with replacement from the original sample to estimate the distribution of a statistic.
Survival Analysis
- Kaplan-Meier Curves: Used to estimate the survival probability over time.
- Cox Proportional Hazards Model: Used to model the relationship between covariates and the hazard rate.
Log-Linear Models
- Analysis of Categorical Data and Contingency Tables: Used to model the relationships between categorical variables.
Poisson Regression
- Modeling Count Data: Used to model the expected count of events based on predictor variables.
Generalized Linear Models (GLM)
- Overview and Applications: Framework for modeling various types of data (e.g., count, binary, continuous) using linear models.
Random Effects Models
- Mixed Models and Repeated Measures Analysis: Used to analyze data with both fixed and random effects.
Multivariate Analysis
- MANOVA (Multivariate Analysis of Variance: Extension of ANOVA for multiple dependent variables.
- Multivariate Regression: Regression models with multiple dependent variables.
Experimental Design
- Completely Randomized Designs: Assigning treatments to experimental units randomly.
- Block Designs: Grouping experimental units into blocks based on a blocking factor.
- Factorial Designs: Investigating the effects of multiple factors and their interactions.
Power Analysis
- Calculating and Interpreting Statistical Power: The probability of detecting a true effect.
Meta-Analysis
- Combining Results from Multiple Studies: Used to synthesize evidence from multiple studies on the same topic.
Causal Inference
- Concepts of Causality: Establishing a causal relationship between variables.
- Confounding: A variable that influences both the exposure and the outcome.
- Randomized Controlled Trials (RCTs: Experimental designs that aim to minimize confounding.
Bayesian Networks
- Probabilistic Graphical Models: Used to represent probabilistic relationships between variables.
Propensity Score Matching
- Techniques for Causal Inference in Observational Studies: Using propensity scores to create comparable groups for causal analysis.
Machine Learning Basics
- Overview of Classification, Regression, and Clustering Techniques: Algorithms used for predicting outcomes, identifying relationships, and grouping data.
Outlier Detection
- Identifying and Managing Outliers: Unusual data points that may distort statistical analyses.
Model Validation Techniques
- Cross-validation: Splitting the data into training and validation sets to assess model performance.
- Train-test Split: Similar to cross-validation, but with only one train-test split.
- ROC Curves: Visual representation of a model's ability to discriminate between classes.
Data Imputation Methods
-
Techniques for Handling Missing Data: Replacing missing values with estimated values.
- Mean/Mode Imputation: Replacing missing values with the mean or mode of the variable.
- KNN Imputation: Replacing missing values with the values of nearest neighbors.
Descriptive Statistics
- Mean: The average of a dataset, calculated by summing all values and dividing by the number of values.
- Median: The middle value in a sorted dataset. If the dataset has an even number of values, the median is the average of the two middle values.
- Mode: The value that appears most frequently in a dataset. There can be multiple modes if multiple values appear with the same highest frequency.
- Range: The difference between the highest and lowest values in a dataset.
- Variance: A measure of how spread out the data is around the mean. It's calculated as the average of the squared differences between each value and the mean.
- Standard Deviation: The square root of the variance. It provides a measure of how much the data points deviate from the mean, in the same units as the original data.
Data Visualization
- Histograms: Graphical representation of the distribution of a single continuous variable, using bars to show the frequency or relative frequency of values within each interval.
- Box Plots: Visual representation of the distribution of a single continuous variable, showing the median, quartiles, and potential outliers.
- Bar Charts: Graphical representation of the frequency or relative frequency of categorical data, using bars for each category.
- Scatter Plots: Graphical representation of the relationship between two continuous variables, using dots to represent each data point.
Probability Basics
- Probability: The likelihood or chance that an event will occur.
-
Probability Rules:
- Probability of an event: Ranges between 0 and 1.
- Sum of probabilities: The sum of probabilities of all possible outcomes of an event equals 1.
- Conditional probability: Probability of an event occurring, given that another event has already occurred.
- Probability Distributions: Mathematical functions describing the probability of different outcomes for a random variable.
Probability Distributions
- Normal Distribution: A bell-shaped, symmetrical distribution with mean and standard deviation as parameters, widely used in statistics.
- Binomial Distribution: Describes the probability of a specific number of successes in a fixed number of independent trials, with each trial having only two possible outcomes (success or failure).
- Poisson Distribution: Describes the probability of a specific number of events occurring in a fixed interval of time or space, assuming events occur independently and at a constant rate.
- Exponential Distribution: Describes the probability of the time until an event occurs, assuming events occur independently and at a constant rate.
Sampling Techniques
- Simple Random Sampling: Each member of the population has an equal chance of being selected.
- Stratified Sampling: The population is divided into subgroups (strata) and a random sample is taken from each stratum.
- Cluster Sampling: The population is divided into clusters, and a random sample of clusters is selected. All members of the selected clusters are included in the sample.
Central Limit Theorem
- States that for a large enough sample size, the distribution of sample means will be approximately normal, regardless of the original population distribution.
- This is essential for statistical inference, allowing us to use normal distributions to infer about population parameters based on sample data.
Hypothesis Testing
- Null Hypothesis (H0): A statement about the population parameter that we want to test.
- Alternative Hypothesis (Ha): A statement that contradicts the null hypothesis.
- Type I Error: Rejecting the null hypothesis when it is actually true.
- Type II Error: Failing to reject the null hypothesis when it is false.
P-Value and Significance Levels
- P-value: The probability of observing the data or more extreme data, if the null hypothesis is true.
- Significance Level (α): A predetermined threshold typically set to 0.05, representing the maximum acceptable probability of making a Type I error.
- If the p-value is less than the significance level, the null hypothesis is rejected.
Z-Test and T-Test
- Z-Test: Used to test hypotheses about population means when the population standard deviation is known.
- T-Test: Used to test hypotheses about population means when the population standard deviation is unknown.
ANOVA
- Analysis of Variance (ANOVA): A statistical method used to compare the means of two or more groups.
- One-way ANOVA: Compares means of groups based on a single factor.
- Two-way ANOVA: Compares means of groups based on two or more factors.
Chi-Square Test
- Chi-Square Test: Used to examine relationships between categorical variables, assessing whether observed frequencies differ significantly from expected frequencies.
- Test for Independence: Determines whether there is a relationship between two categorical variables.
- Goodness of Fit: Determines whether a sample distribution fits a hypothesized theoretical distribution.
Confidence Intervals
- Confidence Interval: A range of values that is likely to contain the true population parameter.
- Confidence Level: Represents the probability that the confidence interval will contain the true population parameter.
Correlation
- Correlation: Measures the strength and direction of the linear relationship between two variables.
- Pearson Correlation Coefficient: Measures the linear correlation between two continuous variables.
- Spearman Correlation Coefficient: Measures the monotonic relationship between two variables, regardless of whether it's linear or non-linear.
Simple Linear Regression
- Simple Linear Regression: A statistical method used to model the linear relationship between two variables, with one variable being the predictor and the other being the response.
- Model Fitting: Finding the best-fitting line based on the data points.
- Interpretation: Understanding the relationship between the two variables, including the slope (change in response per unit change in predictor) and the intercept (predicted response when the predictor is zero).
Multiple Linear Regression
- Multiple Linear Regression: Models the linear relationship between a response variable and two or more predictor variables.
- Model Fitting: Finding the best-fitting plane (or hyperplane in higher dimensions) based on the data points.
- Assumptions: Assumptions include linearity, normality, and homoscedasticity.
Logistic Regression
- Logistic Regression: Used to model the probability of a binary outcome (success or failure) based on predictor variables.
- Interpretation: Understanding the relationship between predictor variables and the probability of the outcome.
Residual Analysis
- Residuals: Differences between the observed values and the predicted values from a regression model.
- Checking Model Assumptions: Analyzing residuals to ensure that the assumptions of the model are met (e.g., linearity, normality, homoscedasticity).
- Diagnostics: Using residual plots to identify patterns or deviations from the model assumptions.
Multicollinearity
- Multicollinearity: The presence of high correlations between predictor variables in a multiple regression model.
- Detection: Using variance inflation factors (VIFs) to identify multicollinearity.
- Remedies: Removing correlated variables, combining them, or using techniques like ridge regression or lasso regression.
Heteroscedasticity
- Heteroscedasticity: The variability of the residuals is not equal across the range of predictor variables.
- Detection: Using residual plots to identify non-constant variances.
- Remedies: Transforming the data or using robust standard errors.
Time Series Analysis
- Time Series Data: Data collected over time, often ordered sequentially.
- Decomposition: Breaking down a time series into trend, seasonality, and random components.
- Trend: The general long-term movement of the data.
- Seasonality: Regular patterns in the data that occur at specific times of the year.
Stationarity in Time Series
- Stationary Time Series: A time series with constant mean, variance, and autocorrelation over time.
- Testing: Using statistical tests like the Augmented Dickey-Fuller (ADF) test to check for stationarity.
- Transforming Data: Applying techniques like differencing to make a non-stationary time series stationary.
Autocorrelation and Partial Autocorrelation
- Autocorrelation Function (ACF): Measures the correlation between a time series with its lagged values.
- Partial Autocorrelation Function (PACF): Measures the correlation between a time series and its lagged values, controlling for the effects of intermediate lags.
- ACF and PACF Plots: Used to identify the order of ARIMA models.
ARIMA Models
- ARIMA (Autoregressive Integrated Moving Average): A class of statistical models used to forecast time series data.
- Autoregressive (AR): Model that uses past values of the time series as predictors.
- Integrated (I): Uses differencing to make the time series stationary.
- Moving Average (MA): Uses past forecast errors as predictors.
Forecasting Techniques
- Exponential Smoothing: A statistical technique used to forecast time series data by giving more weight to recent observations.
- ARIMA Forecasting: Uses ARIMA models to generate future forecasts for time series data.
Non-Parametric Tests
- Non-Parametric Tests: Statistical tests that don't assume a specific distribution for the data.
- Mann-Whitney U Test: Used to compare two independent groups when the data is not normally distributed.
- Wilcoxon Signed-Rank Test: Used to compare two related groups when the data is not normally distributed.
- Kruskal-Wallis Test: Used to compare the medians of three or more groups when the data is not normally distributed.
Principal Component Analysis (PCA)
- PCA: A technique used to reduce the dimensionality of a dataset by identifying principal components, which are linear combinations of the original variables that capture most of the variance in the data.
Factor Analysis
- Factor Analysis: A statistical method used to identify latent variables (also called factors) that underlie the observed variables.
Cluster Analysis
- Cluster Analysis: A technique for grouping data points into clusters based on their similarities.
- K-Means Clustering: A popular algorithm for clustering data points into k clusters.
- Hierarchical Clustering: An algorithm that builds a hierarchy of clusters, starting with individual data points and merging them into larger clusters.
Discriminant Analysis
- Discriminant Analysis: Used to classify observations into predefined groups based on predictor variables.
- Linear Discriminant Analysis (LDA): Assumes that the groups have equal variances.
- Quadratic Discriminant Analysis (QDA): Allows for unequal variances between the groups.
Bayesian Statistics
- Bayes' Theorem: A mathematical formula that relates prior probabilities, likelihoods, and posterior probabilities.
- Bayesian Inference: Using Bayes' theorem to update beliefs about a hypothesis based on new data.
Markov Chains
- Markov Chain: A mathematical model describing a sequence of events where the probability of each event depends only on the previous event.
- Transition Matrices: Used to represent the probabilities of transitioning between states in a Markov chain.
- Steady-State Probabilities: Probabilities of being in each state after a long period of time.
Maximum Likelihood Estimation (MLE)
- Maximum Likelihood Estimation (MLE): A method used to estimate the parameters of a statistical model by finding the values that maximize the likelihood function.
Monte Carlo Simulations
- Monte Carlo Simulations: A technique that uses random sampling to estimate the probability distribution of an outcome.
Bootstrap Resampling
- Bootstrap Resampling: A statistical method for estimating the sampling distribution of a statistic by repeatedly resampling with replacement from the original data.
Survival Analysis
- Survival Analysis: A statistical methodology used to analyze data where the outcome of interest is time until an event occurs.
- Kaplan-Meier Curves: Used to estimate the survival function over time.
- Cox Proportional Hazards Model: Regression model used to estimate the effect of covariates on the hazard rate.
Log-Linear Models
- Log-Linear Models: Statistical models used to analyze data from contingency tables, modeling the relationships between categorical variables.
Poisson Regression
- Poisson Regression: Used to model count data, assuming a Poisson distribution for the response variable.
Generalized Linear Models (GLM)
- Generalized Linear Models (GLM): A framework for modeling a variety of statistical models, including linear regression, logistic regression, and Poisson regression.
Random Effects Models
- Random Effects Models: Statistical models where the effects of interest are random variables.
- Mixed Models: Models that include both fixed effects and random effects.
- Repeated Measures Analysis: A technique used to analyze data collected on the same individuals over time.
Multivariate Analysis
- Multivariate Analysis: Statistical methods used to analyze data with multiple dependent variables.
- MANOVA (Multivariate Analysis of Variance): Used to compare the means of two or more groups on multiple dependent variables.
- Multivariate Regression: Used to model the relationship between multiple dependent variables and multiple predictor variables.
Experimental Design
- Experimental Design: The process of planning and conducting experiments to collect data for statistical analysis.
- Completely Randomized Designs: A design where units are randomly assigned to treatment groups.
- Block Designs: A design where units are grouped into blocks, and units within each block are then randomly assigned to treatment groups.
- Factorial Designs: A design where multiple factors are simultaneously varied.
Power Analysis
- Power Analysis: A method for determining the sample size needed to detect a statistically significant effect.
- Statistical Power: The probability of correctly rejecting the null hypothesis when it is false.
Meta-Analysis
- Meta-Analysis: A statistical method used to combine results from multiple studies.
Causal Inference
- Causality: Determining the causal relationships between variables.
- Confounding: A situation where the effect of one variable on another is confounded by the influence of a third variable.
- Randomized Controlled Trials (RCTs): The gold standard for establishing causality, where participants are randomly assigned to treatment groups.
Bayesian Networks
- Bayesian Networks: Probabilistic graphical models that represent relationships between variables using directed acyclic graphs.
Propensity Score Matching
- Propensity Score Matching: A technique used to create comparable treatment and control groups in observational studies by matching individuals based on their predicted probability of receiving the treatment.
Machine Learning Basics
- Machine Learning: A field of computer science focused on developing algorithms that can learn from data.
- Classification: Supervised learning problems where the goal is to predict a categorical outcome.
- Regression: Supervised learning problems where the goal is to predict a continuous outcome.
- Clustering: Unsupervised learning problems where the goal is to group data points into clusters.
Outlier Detection
- Outliers: Data points that are significantly different from other data points.
- Identifying Outliers: Using statistical methods and visual inspection of data.
- Managing Outliers: Removing, transforming, or replacing outliers depending on the cause and context.
Model Validation Techniques
- Model Validation: Evaluating the performance of a model on unseen data.
- Cross-Validation: A technique for splitting the data into multiple folds and using each fold as a test set while training on the remaining folds.
- Train-Test Split: Dividing the data into training and testing sets.
- ROC Curves: Used to evaluate the performance of classification models by plotting the true positive rate against the false positive rate.
Data Imputation Methods
- Data Imputation: A technique for filling in missing data values.
- Mean/Mode Imputation: Replacing missing values with the mean or mode of the variable.
- KNN Imputation: Using the values of k nearest neighbors to impute missing values.
Descriptive Statistics
- Mean: The average of a dataset, calculated by summing all values and dividing by the number of values.
- Median: The middle value in a sorted dataset, dividing the dataset into two halves with equal numbers of values.
- Mode: The most frequent value in a dataset.
- Range: The difference between the highest and lowest values in a dataset.
- Variance: A measure of how spread out the data is from the mean, calculated as the average squared deviation from the mean.
- Standard Deviation: The square root of the variance, providing a more interpretable measure of data spread.
Data Visualization
- Histograms: Visualize the distribution of a single variable by showing the frequency of each value or range of values.
- Box Plots: Summarize the distribution of a single variable using quartiles, median, and outliers.
- Bar Charts: Display the frequency or value of categorical data, often grouped by categories.
- Scatter Plots: Show the relationship between two variables, plotting each data point as a dot.
Probability Basics
- Probability: The likelihood of an event occurring, expressed as a number between 0 and 1.
-
Rules of Probability:
- Addition Rule: P(A or B) = P(A) + P(B) - P(A and B)
- Multiplication Rule: P(A and B) = P(A) * P(B|A)
- Probability Distributions: Functions that describe the probability of different outcomes for a random variable.
Probability Distributions
- Normal Distribution: A symmetrical bell-shaped distribution, commonly used in statistics due to its frequent occurrence in nature and applications.
- Binomial Distribution: Describes the probability of a certain number of successes in a fixed number of independent trials, with each trial having two possible outcomes.
- Poisson Distribution: Models the probability of a certain number of events occurring in a fixed interval of time or space, assuming events occur independently and at a constant rate.
- Exponential Distribution: Represents the probability of an event occurring after a certain amount of time, with the probability decreasing exponentially over time.
Sampling Techniques
- Simple Random Sampling: Each member of the population has an equal chance of being selected.
- Stratified Sampling: The population is divided into subgroups (strata) and a random sample is taken from each stratum.
- Cluster Sampling: The population is divided into groups (clusters), and a random sample of clusters is selected.
Central Limit Theorem
- Understanding: States that the distribution of sample means will approach a normal distribution as the sample size increases, regardless of the underlying distribution of the population.
- Applications: Enables statistical inference about population parameters based on sample data, even with unknown population distributions.
Hypothesis Testing
- Null Hypothesis: A statement about the population parameter that is assumed to be true.
- Alternative Hypothesis: A statement that contradicts the null hypothesis.
- Type I Error: Rejecting the null hypothesis when it is actually true.
- Type II Error: Failing to reject the null hypothesis when it is actually false.
P-Value and Significance Levels
- P-Value: The probability of obtaining the observed results if the null hypothesis were true.
- Significance Level: A threshold value used to determine whether to reject the null hypothesis. If the p-value is less than the significance level, then the null hypothesis is rejected.
Z-Test and T-Test
- Differences: Z-tests are used when the population standard deviation is known, while t-tests are used when it is unknown.
- Applications: Both tests are used to compare population means or proportions, but Z-tests require larger sample sizes than t-tests.
ANOVA (Analysis of Variance)
- One-Way ANOVA: Tests for differences in means between multiple groups, considering only one factor.
- Two-Way ANOVA: Tests for differences in means between multiple groups, considering two or more factors.
Chi-Square Test
- Test for Independence: Determines whether there is an association between two categorical variables.
- Goodness of Fit: Tests whether observed data matches an expected distribution.
Confidence Intervals
- Constructing: Used to estimate the range of values where the true population parameter is likely to lie.
- Interpreting: A 95% confidence interval means that we are 95% confident that the true population parameter lies within the interval.
Correlation
- Pearson Correlation Coefficient: Measures the linear relationship between two continuous variables, ranging from -1 to 1.
- Spearman Correlation Coefficient: Measures the monotonic relationship between two variables, regardless of their linearity.
Simple Linear Regression
- Model Fitting: Uses a straight line to describe the relationship between a dependent variable and an independent variable.
- Interpretation: Uses the slope and intercept of the regression line to understand the relationship between the variables.
Multiple Linear Regression
- Model Fitting: Extends simple linear regression to include multiple independent variables.
- Assumptions: Assumes linearity, normality, and homoscedasticity of errors.
- Interpretation: Uses regression coefficients to understand the individual effect of each independent variable on the dependent variable.
Logistic Regression
- Binary Outcome Modelling: Predicts the probability of a binary outcome (e.g., yes/no, success/failure) based on one or more independent variables.
- Interpretation: Uses the odds ratio to understand the effect of each independent variable on the odds of the outcome occurring.
Residual Analysis
- Checking Model Assumptions: Examines the residuals from a model to assess linearity, homoscedasticity, and independence of errors.
- Diagnostics: Visual examination of residuals can reveal problems with model fit and suggest potential remedies.
Multicollinearity
- Detection: Occurs when two or more independent variables are highly correlated, making it difficult to isolate their individual effects.
- Remedies: Reducing multicollinearity can involve removing redundant variables or using techniques like principal component analysis.
Heteroscedasticity
- Understanding: Occurs when the variability of the errors is not constant across all values of the independent variable.
- Detecting: Can be identified by examining plots of residuals against predicted values.
Time Series Analysis
- Decomposition: Separates a time series into its underlying components: trend, seasonality, and noise.
- Trend: The long-term pattern in the data, often represented by a line or curve.
- Seasonality: A recurring pattern in the data that occurs at regular intervals, often related to time of year or day of the week.
Stationarity in Time Series
- Testing: Determining whether the statistical properties of the data remain constant over time.
- Transforming Data: Non-stationary data can be transformed to achieve stationarity, often by using differencing.
Autocorrelation and Partial Autocorrelation
- ACF (Autocorrelation Function) Plots: Show the correlation between a time series and its past values at different lags.
- PACF (Partial Autocorrelation Function) Plots: Show the correlation between a time series and its past values, controlling for the effect of intervening values.
ARIMA Models
- Autoregressive Integrated Moving Average (ARIMA) Models: Time series forecasting models that use past values of the time series to predict future values.
Forecasting Techniques
- Exponential Smoothing: A method that uses a weighted average of past values to predict future values, giving more weight to recent values.
- ARIMA Forecasting: Uses an ARIMA model to forecast future values based on historical data.
Non-Parametric Tests
- Mann-Whitney U Test: Compares the medians of two independent groups when the data is not normally distributed.
- Wilcoxon Signed-Rank Test: Compares the medians of two paired groups when the data is not normally distributed.
- Kruskal-Wallis Test: Extends the Mann-Whitney U test to more than two groups.
Principal Component Analysis (PCA)
- Dimensionality Reduction Technique: Transforms a set of correlated variables into a smaller set of uncorrelated variables (principal components), while preserving as much variance as possible.
Factor Analysis
- Identifying Latent Variables: A statistical method used to identify underlying factors that explain observed correlations among a set of variables.
Cluster Analysis
- K-Means Clustering: An algorithm that partitions data points into k predefined clusters, minimizing the sum of squared distances between points and their assigned cluster centers.
- Hierarchical Clustering: A method that creates a nested hierarchy of clusters, starting with individual data points and merging them into larger clusters based on similarity.
Discriminant Analysis
- Linear and Quadratic Discriminant Analysis: Statistical techniques used to classify data points into two or more groups, based on a set of predictor variables.
Bayesian Statistics
- Bayes’ Theorem: A mathematical formula that describes how to update the probability of an event based on new evidence.
- Bayesian Inference: A statistical approach that uses Bayes’ theorem to estimate the probability of unknown parameters based on observed data.
Markov Chains
- Transition Matrices: Matrices that describe the probabilities of transitions between different states in a Markov chain.
- Steady-State Probabilities: The long-term probabilities of being in each state of the Markov chain, after a sufficient number of transitions.
Maximum Likelihood Estimation (MLE)
- Estimating Parameters: A statistical method that finds the values of model parameters that are most likely to have generated the observed data.
- Interpreting Results: The estimated parameters provide information about the underlying process that generated the data.
Monte Carlo Simulations
- Applications in Probability and Statistics: Used to estimate probabilities, simulate random events, and explore the properties of statistical models.
Bootstrap Resampling
- Understanding: A technique that involves repeatedly sampling with replacement from the original dataset to create multiple resampled datasets.
- Applying Bootstrap Techniques: Used to estimate confidence intervals, standard errors, and other statistical quantities.
Survival Analysis
- Kaplan-Meier Curves: Visualize the survival probability over time for a group of individuals.
- Cox Proportional Hazards Model: A statistical model used to analyze time-to-event data, taking into account the relative hazards for different groups of individuals.
Log-Linear Models
- Analysis of Categorical Data: Used to analyze the relationship between multiple categorical variables.
- Contingency Tables: Tables that summarize observed frequencies for combinations of categorical variables.
Poisson Regression
- Modeling Count Data: A type of generalized linear model used to predict the expected count of events based on predictor variables.
Generalized Linear Models (GLM)
- Overview: A flexible class of statistical models that can be used to analyze data with different types of distributions.
- Applications: Used for modeling binary, count, and continuous data, among others.
Random Effects Models
- Mixed Models and Repeated Measures Analysis: Statistical models that account for both random and fixed effects, often used for analyzing data with multiple measurements per individual.
Multivariate Analysis
- MANOVA (Multivariate Analysis of Variance): Extends ANOVA to multiple dependent variables, testing for differences in means between groups on multiple response variables.
- Multivariate Regression: Used to predict multiple dependent variables simultaneously, based on one or more predictor variables.
Experimental Design
- Completely Randomized Designs: A basic experimental design where treatments are randomly assigned to experimental units.
- Block Designs: Designs that group experimental units into blocks based on similarity, reducing variability within blocks.
- Factorial Designs: Designs that investigate the effects of multiple factors and their interactions.
Power Analysis
- Calculating and Interpreting Statistical Power: Determining the probability of detecting a significant effect if it truly exists.
Meta-Analysis
- Combining Results from Multiple Studies: A systematic method for combining the results of multiple independent studies to estimate an overall effect size.
Causal Inference
- Concepts of Causality: Understanding the relationship between cause and effect.
- Confounding: Variables that are associated with both the exposure and the outcome, potentially obscuring the causal relationship.
- Randomized Controlled Trials: Gold-standard experimental design for establishing causality, where participants are randomly assigned to treatment groups.
Bayesian Networks
- Probabilistic Graphical Models: Representation of probabilistic relationships among variables in the form of a directed acyclic graph.
Propensity Score Matching
- Techniques for Causal Inference in Observational Studies: Matching individuals in a treatment group with individuals in a control group with similar characteristics, controlling for confounding variables.
Machine Learning Basics
- Overview of Classification, Regression, and Clustering Techniques: Machine learning algorithms used for predicting categorical, continuous, and group memberships, respectively.
Outlier Detection
- Identifying and Managing Outliers: Values in a dataset that significantly differ from other values, potentially affecting analysis results.
Model Validation Techniques
- Cross-Validation: A technique for assessing the performance of a model on unseen data, by splitting the data into training and validation sets.
- Train-Test Split: Another method of model validation, where a portion of the data is set aside for testing after the model is trained on the remaining data.
- ROC Curves: Visualize the performance of a classification model, showing the trade-off between true positive rate and false positive rate.
Data Imputation Methods
- Techniques for Handling Missing Data: Replacing missing values with plausible estimates based on the available data.
- Mean/Mode Imputation: Replacing missing values with the mean or mode of the variable.
- KNN Imputation: Replacing missing values using the values from nearest neighbors in the data.
Descriptive Statistics
- Mean: Average of a dataset, calculated by summing all values and dividing by the number of values.
- Median: Middle value in a sorted dataset.
- Mode: Most frequent value in a dataset.
- Range: Difference between the highest and lowest value in a dataset.
- Variance: Measures the spread of data points around the mean.
- Standard Deviation: Square root of the variance, representing the average distance of data points from the mean.
Data Visualization
- Histograms: Visual representation of the distribution of numerical data, using bars to show the frequency of values within intervals.
- Box Plots: Visual representation of the distribution of numerical data, showing the median, quartiles, and outliers.
- Bar Charts: Visual comparison of categorical data, using bars to represent the frequency or value of each category.
- Scatter Plots: Visual representation of the relationship between two numerical variables, showing the correlation between them.
Probability Basics
- Probability: The chance of an event occurring, measured on a scale from 0 to 1.
-
Rules of Probability:
- Addition Rule: Probability of either of two events happening is the sum of their individual probabilities minus the probability of both happening.
- Multiplication Rule: Probability of two events happening is the product of their individual probabilities (given they're independent).
- Probability Distributions: Mathematical function describing the probabilities of different outcomes for a random variable.
Probability Distributions
- Normal Distribution: bell-shaped distribution, symmetrical and characterized by mean and standard deviation.
- Binomial Distribution: Discrete distribution representing the probability of successes in a fixed number of independent trials.
- Poisson Distribution: Discrete distribution representing the probability of a specific number of events occurring in a fixed time or place.
- Exponential Distribution: Continuous distribution representing the probability of an event occurring in a specific time interval.
Sampling Techniques
- Simple Random Sampling: Every individual in the population has an equal chance of being selected.
- Stratified Sampling: Population is divided into strata based on characteristics, then samples are randomly selected from each stratum.
- Cluster Sampling: Population is divided into clusters, and a random sample of clusters is selected, with all individuals within selected clusters included in the sample.
Central Limit Theorem
- Key Idea: The distribution of sample means will approximate a normal distribution, regardless of the population distribution, as the sample size increases.
- Applications: Allows us to use normal distribution-based tests for inferences about population parameters even if the population is not normally distributed.
Hypothesis Testing
- Null Hypothesis: Statement about the population parameter that is assumed to be true.
- Alternative Hypothesis: Statement contradicting the null hypothesis, that we are trying to prove.
- Type I Error: Rejecting the null hypothesis when it is actually true.
- Type II Error: Failing to reject the null hypothesis when it is false.
- p-Value: Probability of obtaining results as extreme as observed, assuming the null hypothesis is true.
Z-Test & T-Test
- Z-Test: Used for testing hypotheses about population parameters when the population standard deviation is known.
- T-Test: Used for testing hypotheses about population parameters when the population standard deviation is unknown and estimated from the sample.
ANOVA (Analysis of Variance)
- One-Way ANOVA: Used to compare means of two or more groups when the independent variable has only one factor.
- Two-Way ANOVA: Used to compare means of two or more groups when the independent variable has two or more factors.
Chi-Square Test
- Test for Independence: Determines if there's a relationship between two categorical variables.
- Goodness of Fit Test: Checks if the observed frequencies of data match the expected frequencies.
Confidence Intervals
- Confidence Intervals: Range of values likely to contain the true population parameter with a certain level of confidence.
- Interpretation: Based on the sample data, we are confident that the true population parameter lies within the interval.
Correlation
- Pearson Correlation Coefficient: Measures the linear relationship between two continuous variables.
- Spearman Correlation Coefficient: Measures the monotonic relationship between two variables (not necessarily linear).
Simple Linear Regression
- Model Fitting: Finding the best linear relationship between an independent variable and a dependent variable.
- Interpretation: Estimating the effect of the independent variable on the dependent variable.
Multiple Linear Regression
- Model Fitting: Finding the best linear relationship between multiple independent variables and a dependent variable.
- Assumptions: Linearity, normality, homogeneity of variance, and independence of errors.
- Interpretation: Estimating the independent effects of each independent variable on the dependent variable.
Logistic Regression
- Binary Outcome Modeling: Predicting the probability of a binary (yes/no) outcome based on independent variables.
- Interpretation: Determining the relationship between independent variables and the likelihood of the binary outcome (e.g., odds ratio).
Residual Analysis
- Checking Model Assumptions: Examining the residuals (the difference between predicted and actual values) to see if the model assumptions are met.
- Diagnostics: Identifying potential issues with the model, such as non-linearity, heteroscedasticity, and outliers.
Multicollinearity
- Detection: High correlation among independent variables.
- Remedies: Removing redundant variables, combining variables, or using techniques like Principal Component Analysis.
Heteroscedasticity
- Understanding: Unequal variance of errors across the range of independent variables.
- Detection: Observing patterns in residuals, using tests like the Breusch-Pagan test.
Time Series Analysis
- Decomposition: Separating the time series into trend, seasonality, and random fluctuations.
- Trend: Long-term pattern in the data.
- Seasonality: Regular patterns in the data that repeat over time.
Stationarity in Time Series
- Testing: Checking if the mean, variance, and autocorrelation of the time series remain constant over time.
- Transforming Data: Techniques like differencing to make the time series stationary.
Autocorrelation and Partial Autocorrelation
- ACF (Autocorrelation Function): Measures the correlation between values at different lags in the time series.
- PACF (Partial Autocorrelation Function): Measures the correlation between values at different lags, controlling for the effect of intervening lags.
ARIMA Models
- Autoregressive Integrated Moving Average: Statistical model used for forecasting time series data.
- Components: Autoregressive (AR), Integrated (I), and Moving Average (MA).
Forecasting Techniques
- Exponential Smoothing: Using weighted averages of past values to forecast future values.
- ARIMA Forecasting: Forecasting using ARIMA models to capture patterns in the time series data.
Non-Parametric Tests
- Mann-Whitney U Test: Comparing the distributions of two independent groups.
- Wilcoxon Signed-Rank Test: Comparing the distributions of two related groups.
- Kruskal-Wallis Test: Comparing the distributions of three or more independent groups.
Principal Component Analysis (PCA)
- Dimensionality Reduction: Reducing the number of variables while retaining as much information as possible.
- Technique: Transforming original variables into a smaller set of uncorrelated variables (principal components).
Factor Analysis
- Identifying Latent Variables: Exploring underlying constructs (latent variables) that explain the relationships between observed variables.
Cluster Analysis
- K-Means Clustering: Partitioning data points into k clusters based on minimizing the sum of squared distances between points and their cluster centers.
- Hierarchical Clustering: Creating a hierarchy of clusters, starting with individual data points and merging them based on similarity.
Discriminant Analysis
- Linear Discriminant Analysis (LDA): Classifying data points into groups based on a linear combination of variables.
- Quadratic Discriminant Analysis (QDA): Classifying data points into groups based on a quadratic combination of variables.
Bayesian Statistics
- Bayes’ Theorem: Calculating the probability of an event based on prior knowledge and new evidence.
- Bayesian Inference: Updating beliefs about parameters based on observed data.
Markov Chains
- Transition Matrices: Represent the probabilities of moving from one state to another in a system.
- Steady-State Probabilities: Long-run probabilities of being in each state.
Maximum Likelihood Estimation (MLE)
- Estimating Parameters: Finding the values of parameters that maximize the likelihood of observing the data.
- Interpreting Results: Determining the most likely values of parameters based on the data.
Monte Carlo Simulations
- Applications: Estimating probabilities, evaluating statistical models, and solving complex problems.
- Technique: Generating random samples to approximate the distribution of a variable.
Bootstrap Resampling
- Understanding: Sampling with replacement from the original data to create multiple datasets.
- Applying Techniques: Estimating confidence intervals, testing hypotheses, and validating models.
Survival Analysis
- Kaplan-Meier Curves: Estimating the survival probability of a population over time.
- Cox Proportional Hazards Model: Modeling the hazard rate (risk of an event) as a function of covariates.
Log-Linear Models
- Analysis of Categorical Data: Modeling relationships between categorical variables.
- Contingency Tables: Summarizing frequencies of categorical variables.
Poisson Regression
- Modeling Count Data: Predicts the count of events based on independent variables.
Generalized Linear Models (GLM)
- Overview: Extending linear regression to models with non-normal responses.
- Applications: Predicting binary outcomes, count data, and other types of responses.
Random Effects Models
- Mixed Models: Models with both fixed and random effects.
- Repeated Measures Analysis: Analyzing data where the same individuals are measured multiple times.
Multivariate Analysis
- MANOVA (Multivariate Analysis of Variance): Comparing means of two or more groups on multiple dependent variables.
- Multivariate Regression: Predicting multiple dependent variables using multiple independent variables.
Experimental Design
- Completely Randomized Designs: Randomly assigning subjects to treatment groups.
- Block Designs: Creating blocks of subjects with similar characteristics before random assignment to treatment groups.
- Factorial Designs: Manipulating multiple independent variables simultaneously to study their interactions.
Power Analysis
- Calculating Statistical Power: Determining the probability of detecting a true effect.
- Interpreting Power: Ensuring that the study has enough power to detect statistically significant effects.
Meta-Analysis
- Combining Results: Combining results from multiple studies to estimate the overall effect of a treatment or intervention.
Causal Inference
- Concepts of Causality: Identifying cause-and-effect relationships.
- Confounding: Extraneous variables that can influence both the independent and dependent variables.
- Randomized Controlled Trials: Experiments that randomly assign participants to treatment and control groups to minimize confounding.
Bayesian Networks
- Probabilistic Graphical Models: Representing relationships between variables using a directed acyclic graph.
Propensity Score Matching
- Techniques for Causal Inference: Matching individuals in observational studies based on their propensity scores (the probability of being treated).
Machine Learning Basics
- Classification: Predicting categorical outcomes (e.g., spam/not spam).
- Regression: Predicting continuous outcomes (e.g., house prices).
- Clustering: Grouping data points based on similarity.
Outlier Detection
- Identifying Outliers: Data points that deviate significantly from the rest of the data.
- Managing Outliers: Removing, transforming, or adjusting outliers depending on the context.
Model Validation Techniques
- Cross-Validation: Dividing the data into multiple folds and using different folds for training and testing the model.
- Train-Test Split: Dividing the data into training and test sets to evaluate the model's performance on unseen data.
- ROC Curves: Visualizing the trade-off between true positive rate and false positive rate for different classification thresholds.
Data Imputation Methods
- Mean/Mode Imputation: Replacing missing values with the mean or mode of the variable.
- KNN Imputation: Replacing missing values using the values of k nearest neighbors.
Descriptive Statistics
- Mean: Average of a dataset calculated by summing all values and dividing by the number of values.
- Median: Middle value in a sorted dataset.
- Mode: Most frequently occurring value in a dataset.
- Range: Difference between the highest and lowest value in a dataset.
- Variance: Measures how spread out the data is from the mean.
- Standard Deviation: Square root of variance, provides a measure of the typical deviation of data from the mean.
Data Visualization
- Histograms: Represent the frequency distribution of continuous data.
- Box Plots: Display the five-number summary (minimum, first quartile, median, third quartile, maximum).
- Bar Charts: Used for visualizing categorical data, comparing the sizes of different groups.
- Scatter Plots: Show the relationship between two variables.
Probability Basics
- Probability: Measure of how likely an event is to occur.
-
Rules of Probability:
- Addition Rule: For mutually exclusive events, the probability of either event occurring is the sum of their individual probabilities.
- Multiplication Rule: For independent events, the probability of both events occurring is the product of their individual probabilities.
- Probability Distributions: Mathematical functions that describe the probability of different outcomes for a random variable.
Probability Distributions
- Normal Distribution: Bell-shaped, symmetrical distribution with mean, median, and mode all equal.
- Binomial Distribution: Represents the probability of a certain number of successes in a fixed number of trials, with each trial having two possible outcomes.
- Poisson Distribution: Used for modeling the number of events occurring in a fixed time interval or space.
- Exponential Distribution: Describes the time between events in a Poisson process.
Sampling Techniques
- Simple Random Sampling: Every member of the population has an equal chance of being selected.
- Stratified Sampling: Population is divided into subgroups (strata), and random samples are drawn from each strata.
- Cluster Sampling: Population is divided into clusters, and a random sample of clusters is selected.
Central Limit Theorem
- States that the distribution of sample means will be approximately normal, regardless of the population distribution, as long as the sample size is large enough.
- Important for making inferences about a population based on a sample.
Hypothesis Testing
- Null Hypothesis (H0): Statement about the population parameter that we are trying to disprove.
- Alternative Hypothesis (H1): Statement about the population parameter that we are trying to prove.
- Type I Error (α): Rejecting the null hypothesis when it is true.
- Type II Error (β): Failing to reject the null hypothesis when it is false.
P-Value and Significance Levels
- P-value: Probability of observing data as extreme as the observed data, assuming the null hypothesis is true.
- Significance Level (α): Threshold for rejecting the null hypothesis. Typically set to 0.05.
Z-Test and T-Test
- Z-test: Used to test hypotheses about population means when the population standard deviation is known.
- T-test: Used to test hypotheses about population means when the population standard deviation is unknown. Used when the sample size is small.
ANOVA (Analysis of Variance)
- Used to compare the means of two or more groups.
- One-way ANOVA: Used to compare the means of two or more groups on one factor variable.
- Two-way ANOVA: Used to compare the means of two or more groups on two or more factor variables.
Chi-Square Test
- Test for Independence: Used to determine if there is a relationship between two categorical variables.
- Goodness of Fit Test: Used to determine if a sample distribution matches a hypothesized distribution.
Confidence Intervals
- Constructing Confidence Intervals: Range of values that is likely to contain the true population parameter.
- Interpreting Confidence Intervals: Provides a range of plausible values for the true population parameter.
Correlation
- Pearson Correlation Coefficient: Measures the linear relationship between two continuous variables.
- Spearman Correlation Coefficient: Measures the monotonic relationship between two variables.
Simple Linear Regression
- Model Fitting: Finding the best-fitting line that describes the linear relationship between two variables.
- Interpretation: Using the model to predict the value of one variable based on the value of the other variable.
Multiple Linear Regression
- Model Fitting: Finding the best-fitting plane that describes the linear relationship between a response variable and two or more predictor variables.
-
Assumptions:
- Linearity
- Independence
- Homoscedasticity
- Normality
-
Interpretation:
- Using the model to predict the value of the response variable based on the values of the predictor variables.
- Determining which predictor variables are significant.
Logistic Regression
- Binary Outcome Modeling: Used for predicting the probability of a binary outcome (e.g., success vs. failure).
-
Interpretation:
- Estimating the odds ratio, which represents the change in odds of the outcome for a one-unit change in the predictor variable.
Residual Analysis
- Checking Model Assumptions: Examining the residuals (the difference between the observed and predicted values) to see if the model assumptions are met.
- Diagnostics: Using residual plots to identify patterns that suggest problems with the model.
Multicollinearity
- Detection: Using variance inflation factors (VIF) to determine if predictor variables are highly correlated.
- Remedies: Removing redundant variables, creating new variables (e.g., principal components), or using ridge regression.
Heteroscedasticity
- Understanding: Non-constant variance of the residuals.
- Detecting Heteroscedasticity: Examining residual plots to see if the variance of the residuals is changing across the range of predicted values.
Time Series Analysis
- Decomposition: Separating a time series into its components (trend, seasonality, and noise).
- Trend: Long-term pattern in the data.
- Seasonality: Cyclical pattern that repeats over time.
Stationarity in Time Series
- Testing: Determining if the properties of the time series are constant over time (e.g., mean, variance).
- Transforming Data: Applying transformations (e.g., differencing) to make the time series stationary if it is not.
Autocorrelation and Partial Autocorrelation
- ACF Plots: Shows the correlation between the time series and its lagged values.
- PACF Plots: Shows the partial correlation between the time series and its lagged values, controlling for the effects of intermediate lags.
ARIMA Models
- Autoregressive Integrated Moving Average Models: Time series models that use past values of the time series to predict future values.
Forecasting Techniques
- Exponential Smoothing: Using a weighted average of past values to forecast future values.
- ARIMA Forecasting: Using ARIMA models to forecast future values.
Non-Parametric Tests
- Mann-Whitney U test: Used to compare two independent groups when the data is not normally distributed.
- Wilcoxon signed-rank test: Used to compare two dependent groups when the data is not normally distributed.
- Kruskal-Wallis test: Used to compare three or more groups when the data is not normally distributed.
Principal Component Analysis (PCA)
- Dimensionality Reduction Technique: Reducing the number of variables in a dataset while retaining as much information as possible.
Factor Analysis
- Identifying Latent Variables: Identifying underlying factors that explain the relationships between observed variables.
Cluster Analysis
- K-means Clustering: Grouping data points into k clusters, with each data point belonging to the cluster with the nearest mean.
- Hierarchical Clustering: Building a hierarchy of clusters by successively merging or splitting clusters.
Discriminant Analysis
- Linear and Quadratic Discriminant Analysis: Classifying data points into groups based on a linear or quadratic function of predictor variables.
Bayesian Statistics
- Bayes’ Theorem: Updating prior beliefs about a parameter based on new data.
- Bayesian Inference: Using Bayes’ theorem to make inferences about parameters.
Markov Chains
- Transition Matrices: Matrices that represent the probabilities of transitions between states in a Markov chain.
- Steady-State Probabilities: Long-run probabilities of being in each state in a Markov chain.
Maximum Likelihood Estimation (MLE)
- Estimating Parameters: Finding the values of parameters that maximize the likelihood of observing the data.
- Interpreting Results: Estimating the values of population parameters based on the data.
Monte Carlo Simulations
- Applications in Probability and Statistics: Using computer simulations to generate random samples from a distribution and estimate probabilities or parameters.
Bootstrap Resampling
- Understanding: Creating multiple samples with replacement from the original data to estimate the sampling distribution of a statistic.
- Applying Bootstrap Techniques: Using bootstrap resampling to construct confidence intervals or perform hypothesis tests.
Survival Analysis
- Kaplan-Meier Curves: Plots that show the proportion of individuals surviving over time.
- Cox Proportional Hazards Model: Model that estimates the hazard rate (the instantaneous risk of an event) as a function of covariates.
Log-Linear Models
- Analysis of Categorical Data and Contingency Tables: Models that use logarithms of expected frequencies to analyze the relationships between categorical variables.
Poisson Regression
- Modeling Count Data: Used for predicting the count of events occurring in a fixed time interval or space.
Generalized Linear Models (GLM)
- Overview: General framework for modeling a response variable that has a distribution from the exponential family.
- Applications: Variety of applications, including Poisson regression, logistic regression, and gamma regression.
Random Effects Models
- Mixed Models and Repeated Measures Analysis: Models that include both fixed and random effects to account for variability between subjects or groups.
Multivariate Analysis
- MANOVA (Multivariate Analysis of Variance): Comparing the means of two or more groups on multiple dependent variables.
- Multivariate Regression: Predicting multiple dependent variables based on one or more predictor variables.
Experimental Design
- Completely Randomized Designs: Randomly assigning subjects to treatment groups.
- Block Designs: Grouping subjects into blocks based on a common characteristic.
- Factorial Designs: Studying the effects of two or more factors on a response variable.
Power Analysis
- Calculating and Interpreting Statistical Power: Calculating the probability of detecting a statistically significant difference when one exists.
Meta-Analysis
- Combining Results from Multiple Studies: Combining the results of multiple studies to obtain an overall estimate of an effect size.
Causal Inference
- Concepts of Causality: Understanding what it means to cause an outcome.
- Confounding: When a third variable affects both the exposure and the outcome, leading to a spurious association.
- Randomized Controlled Trials: Gold standard for determining causal relationships.
Bayesian Networks
- Probabilistic Graphical Models: Graphical representation of the probabilistic relationships between variables.
Propensity Score Matching
- Techniques for Causal Inference in Observational Studies: Matching individuals in a treatment group with individuals in a control group who have similar propensity scores, which represent the probability of receiving treatment.
Machine Learning Basics
- Overview of Classification, Regression, and Clustering Techniques: Techniques used for predicting categorical outcomes, continuous outcomes, and grouping data points.
Outlier Detection
- Identifying and Managing Outliers: Using various methods to identify and treat outliers in a dataset.
Model Validation Techniques
- Cross-Validation: Dividing the data into multiple folds and using a portion of the data to train the model and the remaining portion to evaluate its performance.
- Train-Test Split: Dividing the data into a training set and a testing set to evaluate the model’s performance.
- ROC Curves: Plots that show the tradeoff between true positive rate and false positive rate at different thresholds.
Data Imputation Methods
-
Techniques for Handling Missing Data: Filling in missing data using various methods:
- Mean/Mode Imputation: Replacing missing values with the mean or mode of the variable.
- KNN Imputation: Replacing missing values with the values from the k nearest neighbors in the dataset.
Descriptive Statistics
- Mean: The average of a dataset, calculated by summing all values and dividing by the number of values.
- Median: The middle value in a sorted dataset. If the dataset has an even number of values, the median is the average of the two middle values.
- Mode: The value that appears most frequently in a dataset. A dataset can have multiple modes or no mode.
- Range: The difference between the highest and lowest values in a dataset.
- Variance: A measure of how spread out the data is from the mean. Calculated as the average of the squared differences between each data point and the mean.
- Standard Deviation: The square root of the variance, providing a measure of data spread in the same units as the original data.
Data Visualization
- Histograms: Graphical representation of the distribution of numerical data.
- Box Plots: Provides a summary of a dataset's distribution, displaying the median, quartiles, and potential outliers.
- Bar Charts: Used to display categorical data, comparing frequencies or proportions across different categories.
- Scatter Plots: Shows the relationship between two variables, indicating potential trends or correlations.
Probability Basics
- Probability: The likelihood of an event occurring, expressed as a value between 0 (impossible) and 1 (certain).
-
Rules of Probability:
- Addition Rule: P(A or B) = P(A) + P(B) - P(A and B)
- Multiplication Rule: P(A and B) = P(A) * P(B|A) (where B|A represents the probability of B given A)
- Probability Distributions: Mathematical functions that describe the probabilities of different outcomes in an experiment.
Probability Distributions
- Normal Distribution: Bell-shaped distribution characterized by its mean and standard deviation. Widely used in statistics.
- Binomial Distribution: Describes the probability of a certain number of successes in a fixed number of independent trials.
- Poisson Distribution: Models the probability of a particular number of events occurring in a fixed interval of time or space.
- Exponential Distribution: Represents the time between events in a Poisson process.
Sampling Techniques
- Simple Random Sampling: Every member of the population has an equal chance of being selected.
- Stratified Sampling: Population is divided into subgroups based on characteristics, and samples are randomly drawn from each subgroup.
- Cluster Sampling: Similar to stratified sampling, but the subgroups are clusters of individuals. Samples are selected randomly from clusters.
Central Limit Theorem
- Central Limit Theorem: States that the distribution of sample means approaches a normal distribution as the sample size increases, regardless of the population distribution.
- Applications: Enables statistical inference and hypothesis testing, even when the population distribution is unknown.
Hypothesis Testing
- Null Hypothesis: A statement about the population parameter that is assumed to be true.
- Alternative Hypothesis: A statement that contradicts the null hypothesis.
- Type I Error: Rejecting the null hypothesis when it is true (false positive).
- Type II Error: Failing to reject the null hypothesis when it is false (false negative).
P-Value and Significance Levels
- P-Value: The probability of obtaining results as extreme as the observed results, assuming the null hypothesis is true.
- Significance Level (α): A predetermined threshold for rejecting the null hypothesis. Typically set at 0.05, meaning there is a 5% chance of rejecting the null hypothesis when it is actually true.
Z-Test and T-Test
- Z-Test: Used to compare the mean of a sample to a known population mean, assuming a known population standard deviation.
- T-Test: Used to compare the means of two samples, or to compare the mean of a sample to a known population mean, when the population standard deviation is unknown.
ANOVA (Analysis of Variance)
- One-Way ANOVA: Tests for differences in means between two or more groups with a single independent variable.
- Two-Way ANOVA: Tests for differences in means between two or more groups with two or more independent variables.
Chi-Square Test
- Chi-Square Test of Independence: Tests whether there is a relationship between two categorical variables.
- Chi-Square Test for Goodness of Fit: Tests whether observed frequencies match expected frequencies from a theoretical distribution.
Confidence Intervals
- Confidence Intervals: A range of values that is likely to contain the true population parameter.
- Interpretation: A 95% confidence interval indicates that we are 95% confident that the true population parameter lies within the specified range.
Correlation
- Pearson Correlation Coefficient: Measures the strength and direction of the linear relationship between two continuous variables.
- Spearman Correlation Coefficient: Measures the strength and direction of the monotonic relationship (non-linear) between two variables.
Simple Linear Regression
- Simple Linear Regression: A statistical technique used to model the linear relationship between a dependent variable and a single independent variable.
- Model Fitting: Finding the line of best fit that minimizes the sum of squared differences between the actual and predicted values.
- Interpretation: The slope of the regression line indicates the change in the dependent variable for a one-unit change in the independent variable.
Multiple Linear Regression
- Multiple Linear Regression: Models the relationship between a dependent variable and two or more independent variables.
- Assumptions: Linearity, independence of errors, homoscedasticity, and normality of errors.
- Interpretation: The coefficients of the independent variables indicate the change in the dependent variable for a one-unit change in each independent variable, holding all other variables constant.
Logistic Regression
- Logistic Regression: A statistical method for predicting the probability of a binary outcome.
- Interpretation: The coefficients of the independent variables indicate the change in the log-odds of the outcome for a one-unit change in each independent variable.
Residual Analysis
- Residual Analysis: Examining the residuals (the difference between the actual and predicted values) to assess model assumptions.
- Diagnostics: Identifying potential issues like non-linearity, heteroscedasticity, and outliers.
Multicollinearity
- Multicollinearity: A situation where two or more independent variables in a regression model are highly correlated.
- Detection: Using variance inflation factor (VIF) to identify multicollinearity.
- Remedies: Removing one of the correlated variables or using dimensionality reduction techniques.
Heteroscedasticity
- Heteroscedasticity: Violation of the assumption in regression models that the variance of the error term is constant across all levels of the independent variables.
- Understanding: Non-constant variance can lead to inaccurate parameter estimates and confidence intervals.
- Detection: Visual inspection of residual plots and statistical tests.
Time Series Analysis
- Decomposition: Breaking down a time series into its components: trend, seasonality, and randomness.
- Trend: The long-term pattern in the time series.
- Seasonality: Regular, recurring patterns in the time series, often related to time of year.
Stationarity in Time Series
- Stationarity: Time series is stationary if its statistical properties (mean, variance, and autocorrelation) do not change over time.
- Testing: Using statistical tests like the Augmented Dickey-Fuller (ADF) test.
- Transforming Data: Techniques like differencing can help make a non-stationary time series stationary.
Autocorrelation and Partial Autocorrelation
- Autocorrelation (ACF): Measures the correlation between values in a time series at different time lags.
- Partial Autocorrelation (PACF): Measures the correlation between values in a time series at different time lags, after removing the effects of intervening lags.
- ACF and PACF Plots: Visual representations of autocorrelation and partial autocorrelation, used to identify the order of an ARIMA model.
ARIMA Models
- ARIMA Models: Autoregressive Integrated Moving Average models used to forecast time series data.
-
Components:
- AR (Autoregressive): Uses previous values of the time series to predict future values.
- I (Integrated): Incorporates differencing to make the time series stationary.
- MA (Moving Average): Uses past forecast errors to improve future forecasts.
Forecasting Techniques
- Exponential Smoothing: A method for smoothing time series data and forecasting future values.
- ARIMA Forecasting: Uses ARIMA models to predict future values in a time series.
Non-Parametric Tests
- Mann-Whitney U Test: Compares the medians of two independent groups when the data is not normally distributed.
- Wilcoxon Signed-Rank Test: Compares the medians of two dependent groups (paired data) when the data is not normally distributed.
- Kruskal-Wallis Test: Compares the medians of three or more independent groups when the data is not normally distributed.
Principal Component Analysis (PCA)
- PCA: Dimensionality reduction technique that transforms a set of correlated variables into a smaller set of uncorrelated variables (principal components).
- Applications: Data visualization, feature extraction, and noise reduction.
Factor Analysis
- Factor Analysis: A statistical method for identifying underlying latent variables (factors) that explain the correlations between observed variables.
Cluster Analysis
- K-Means Clustering: An algorithm that partitions data points into k clusters based on their distance to cluster centroids.
- Hierarchical Clustering: Constructs a hierarchy of clusters by iteratively merging or splitting clusters based on their similarity or dissimilarity.
Discriminant Analysis
- Discriminant Analysis: A statistical method for classifying observations into predefined groups based on a set of predictor variables.
- Linear Discriminant Analysis (LDA): Assumes that the groups have equal covariance matrices.
- Quadratic Discriminant Analysis (QDA): Does not assume equal covariance matrices.
Bayesian Statistics
- Bayes' Theorem: A mathematical formula that updates the probability of an event given new evidence.
- Bayesian Inference: A method for updating beliefs about a parameter in light of new data.
Markov Chains
- Markov Chains: A stochastic process where the probability of future events depends only on the present state, not past states.
- Transition Matrices: Represent the probabilities of moving between different states in a Markov chain.
- Steady-State Probabilities: Probabilities of being in each state in the long-run, as the Markov chain evolves.
Maximum Likelihood Estimation (MLE)
- MLE: A method for estimating the parameters of a statistical model by maximizing the likelihood of the observed data.
- Applications: Estimating the parameters of probability distributions.
Monte Carlo Simulations
- Monte Carlo Simulations: Using random numbers to simulate a process or model repeatedly to estimate probabilities or other quantities of interest.
- Applications: In probability and statistics, finance, and engineering.
Bootstrap Resampling
- Bootstrap Resampling: A resampling technique that involves drawing multiple samples with replacement from the original data to estimate the sampling distribution of a statistic.
- Applications: Estimating confidence intervals, hypothesis testing, and model validation.
Survival Analysis
- Survival Analysis: A statistical method for analyzing data where the outcome of interest is the time until an event occurs.
- Kaplan-Meier Curves: Graphical representations of the survival function, showing the probability of surviving beyond a certain time point.
- Cox Proportional Hazards Model: A regression model that estimates the effect of predictor variables on the hazard rate of an event.
Log-Linear Models
- Log-Linear Models: Used for analyzing categorical data and contingency tables.
- Applications: Examining associations between categorical variables and testing for independence.
Poisson Regression
- Poisson Regression: A statistical method for modeling count data, where the dependent variable represents the number of occurrences of an event.
Generalized Linear Models (GLM)
- GLM: A generalization of linear regression that allows for different distributions for the dependent variable (e.g., Poisson, binomial).
- Applications: Modeling data that does not follow a normal distribution.
Random Effects Models
- Random Effects Models (Mixed Models): Statistical models that include both fixed effects (constant across individuals) and random effects (vary across individuals).
- Repeated Measures Analysis: Used when multiple measurements are taken from the same individuals over time.
Multivariate Analysis
- MANOVA (Multivariate Analysis of Variance): Tests for differences in means between two or more groups on multiple dependent variables.
- Multivariate Regression: Regression analysis with multiple dependent variables.
Experimental Design
- Completely Randomized Designs: Participants are randomly assigned to treatment groups.
- Block Designs: Participants are grouped into blocks based on a characteristic, and then randomly assigned to treatments within each block.
- Factorial Designs: Study the effects of multiple factors and their interactions.
Power Analysis
- Power Analysis: Determining the sample size needed to detect a statistically significant effect with a given probability.
- Statistical Power: The probability of correctly rejecting the null hypothesis when it is false.
Meta-Analysis
- Meta-Analysis: A statistical technique for combining the results of multiple studies to estimate an overall effect size.
Causal Inference
- Causality: The relationship between an event (cause) and its effect.
- Confounding: A third variable that influences both the cause and the effect, leading to a spurious association.
- Randomized Controlled Trials (RCTs): Gold standard for establishing causality, where participants are randomly assigned to treatment groups.
Bayesian Networks
- Bayesian Networks: Probabilistic graphical models that represent relationships between variables.
- Applications: Machine learning, causal inference, expert systems.
Propensity Score Matching
- Propensity Score Matching: Techniques for causal inference in observational studies, where participants are not randomly assigned to treatment groups.
- Propensity Score: The probability of receiving the treatment, estimated based on observed characteristics.
Machine Learning Basics
- Classification: Predicting the category or class to which an observation belongs.
- Regression: Predicting a continuous outcome.
- Clustering: Grouping similar data points together.
Outlier Detection
- Outliers: Data points that are significantly different from other data points in the dataset.
- Identifying Outliers: Using box plots, scatter plots, and statistical measures like the Z-score.
- Managing Outliers: Removing them, transforming them, or using robust statistical methods that are less sensitive to outliers.
Model Validation Techniques
- Cross-Validation: A technique for evaluating the performance of a machine learning model on unseen data.
- Train-Test Split: Dividing the data into training and testing sets to evaluate model performance.
- ROC Curves: Used to assess the performance of binary classification models, showing the trade-off between true positive rate and false positive rate.
Data Imputation Methods
- Data Imputation: Techniques for handling missing data by filling in the missing values with plausible estimates.
- Mean/Mode Imputation: Replacing missing values with the mean or mode of the variable.
- KNN Imputation: Using the values of nearest neighbors to impute missing values.
Descriptive Statistics
- Mean: Average of a dataset
- Median: Middle value in a sorted dataset
- Mode: Most frequent value in a dataset
- Range: Difference between the highest and lowest values
- Variance: Measures how spread out data is from the mean
- Standard Deviation: Square root of variance, providing a measure of data dispersion
Data Visualization
- Histograms: Visualize the distribution of numerical data
- Box Plots: Summarize key features of a dataset: median, quartiles, outliers
- Bar Charts: Compare categorical data using bars
- Scatter Plots: Show the relationship between two numerical variables
Probability Basics
- Probability: Measures the likelihood of an event occurring
- Rules of probability: Addition rule, multiplication rule, conditional probability, etc.
- Probability Distributions: Mathematical functions that describe the likelihood of different outcomes
Probability Distributions
- Normal Distribution: Bell-shaped distribution, commonly used in statistics
- Binomial Distribution: Describes the probability of successes in a fixed number of trials
- Poisson Distribution: Models the probability of a certain number of events occurring in a fixed interval
- Exponential Distribution: Describes the time between events in a Poisson process
Sampling Techniques
- Simple Random Sampling: Each member of the population has an equal chance of being selected
- Stratified Sampling: Population is divided into strata, and samples are drawn from each stratum
- Cluster Sampling: Population is divided into clusters, and a random sample of clusters is selected
Central Limit Theorem
- States that the distribution of sample means approaches a normal distribution as sample size increases
- Important for making inferences about populations using sample data
Hypothesis Testing
- Null Hypothesis (H0): Statement that there is no effect or difference
- Alternative Hypothesis (H1): Statement that there is an effect or difference
- Type I Error: Rejecting H0 when it is true
- Type II Error: Failing to reject H0 when it is false
P-Value and Significance Levels
- P-value: Probability of observing the data if H0 is true
- Significance Level (α): Threshold for rejecting H0
- If the P-value is less than α, then H0 is rejected
Z-Test and T-Test
- Z-Test: Used to compare means when the population standard deviation is known
- T-Test: Used to compare means when the population standard deviation is unknown
ANOVA (Analysis of Variance)
- One-Way ANOVA: Compares means of two or more groups for one independent variable
- Two-Way ANOVA: Compares means of two or more groups for two or more independent variables
Chi-Square Test
- Test for Independence: Investigates the relationship between two categorical variables
- Goodness of Fit Test: Compares observed frequencies to expected frequencies
Confidence Intervals
- Confidence Interval: Range of values that is likely to contain the true population parameter
- Constructed at a specific confidence level (e.g., 95%)
Correlation
- Pearson Correlation: Measures the strength and direction of the linear relationship between two numerical variables
- Spearman Correlation: Measures the strength and direction of the monotonic relationship between two variables
Simple Linear Regression
- Models the linear relationship between a dependent variable and an independent variable
- Fits a line to the data to minimize the sum of squared errors
Multiple Linear Regression
- Models the relationship between a dependent variable and multiple independent variables
- Assumes linearity, independence of errors, constant variance (homoscedasticity)
Logistic Regression
- Models the probability of a binary outcome (e.g., 0 or 1) based on independent variables
- Uses a sigmoid function to predict probabilities
Residual Analysis
- Examines the residuals (differences between observed and predicted values)
- Helps to assess model assumptions and identify potential problems
Multicollinearity
- Occurs when independent variables are highly correlated
- Can inflate the variance of regression coefficients and make interpretation difficult
- Can be detected using the Variance Inflation Factor (VIF)
Heteroscedasticity
- Occurs when the variance of the errors is not constant across the range of independent variables
- Can violate assumptions of linear regression and make the model unreliable
Time Series Analysis
- Analyzes data collected over time
- Decomposes time series into trend, seasonality, cycle, and random components
- Important for forecasting and understanding trends
Stationarity in Time Series
- Time series is stationary when its statistical properties (mean, variance, autocorrelation) are constant over time
- Can be tested using statistical tests (e.g., Dickey-Fuller test)
- Data can be transformed (e.g., differencing) to achieve stationarity
Autocorrelation and Partial Autocorrelation
- Autocorrelation Function (ACF): Measures the correlation between a time series and lagged versions of itself
- Partial Autocorrelation Function (PACF): Measures the correlation between a time series and lagged versions of itself, controlling for the effects of intervening lags
ARIMA Models
- Autoregressive Integrated Moving Average (ARIMA) models are widely used for time series forecasting
- Combine autoregressive (AR), integrated (I), and moving average (MA) components to capture dependencies in the data
Forecasting Techniques
- Exponential Smoothing: Uses a weighted average of past observations to forecast future values
- ARIMA models can also be used for forecasting
Non-Parametric Tests
- Do not assume a specific distribution for the data
- Mann-Whitney U test: Compares two independent groups
- Wilcoxon signed-rank test: Compares two dependent groups
- Kruskal-Wallis test: Compares more than two groups
Principal Component Analysis (PCA)
- Dimensionality reduction technique that transforms data into a smaller set of uncorrelated variables (principal components)
- Used for data visualization, feature extraction in machine learning
Factor Analysis
- Identifies underlying latent variables that explain the relationships between observed variables
- Used to simplify complex datasets and understand underlying constructs
Cluster Analysis
- Groups data points into clusters based on similarity
- K-means clustering: An iterative algorithm that partitions data into k clusters
- Hierarchical clustering: Builds a hierarchy of clusters based on distances between data points
Discriminant Analysis
- Predicts group membership based on independent variables
- Linear Discriminant Analysis (LDA): Assumes linear relationships between variables and group membership
- Quadratic Discriminant Analysis (QDA): Allows for nonlinear relationships
Bayesian Statistics
- Bayes' Theorem: Updates prior beliefs with new evidence to obtain posterior probabilities
- Allows for incorporating prior knowledge into statistical analysis
Markov Chains
- Models sequences of events where the probability of an event depends only on the previous event
- Used to analyze systems with discrete states and transitions between states
Maximum Likelihood Estimation (MLE)
- Finds the parameter values that maximize the likelihood of observing the data
- Widely used for estimating parameters in statistical models
Monte Carlo Simulations
- Randomly generates data from a specified distribution to estimate probabilities or other quantities
- Used for risk assessment, optimization, and other applications
Bootstrap Resampling
- Resamples data with replacement to estimate the distribution of a statistic
- Helpful for constructing confidence intervals and performing hypothesis tests when assumptions are not met
Survival Analysis
- Analyzes the time until an event occurs (e.g., death, failure)
- Kaplan-Meier curves: Estimate the survival probability over time
- Cox proportional hazards model: Models the hazard rate of an event as a function of independent variables
Log-Linear Models
- Analyze categorical data and contingency tables
- Use logarithms to model the relationship between variables
Poisson Regression
- Models count data (e.g., number of events) as a function of independent variables
- Assumes a Poisson distribution for the response variable
Generalized Linear Models (GLM)
- General framework that encompasses a variety of statistical models
- Allows for different response distributions and link functions
Random Effects Models
- Mixed models that include both fixed and random effects
- Used to analyze data with repeated measures or clustered effects
Multivariate Analysis
- Analyzes multiple dependent variables simultaneously
- MANOVA (Multivariate Analysis of Variance): Compares means of multiple dependent variables across groups
- Multivariate Regression: Predicts multiple dependent variables based on independent variables
Experimental Design
- Systematic methods for planning and conducting experiments
- Completely randomized designs: Random assignment of participants to treatment groups
- Block designs: Control for extraneous variables by grouping participants into blocks
- Factorial designs: Investigate the effects of multiple independent variables
Power Analysis
- Determines the minimum sample size needed to detect a statistically significant effect
- Helps to ensure that an experiment has sufficient power to answer the research question
Meta-Analysis
- Combines results from multiple studies to obtain a more precise estimate of an effect
- Can be used to synthesize research evidence and assess the overall strength of an effect
Causal Inference
- Assessing the causal relationship between variables
- Confounding variables: Factors that can influence both the cause and the effect
- Randomized controlled trials (RCTs): Gold standard for establishing causality
Bayesian Networks
- Probabilistic graphical models that represent relationships between variables using a directed acyclic graph
- Used for reasoning under uncertainty, predictive modeling, and causal inference
Propensity Score Matching
- Technique for causal inference in observational studies
- Uses propensity scores (the probability of receiving treatment) to match treated and control groups
Machine Learning Basics
- Classification: Predicting a categorical outcome (e.g., spam or not spam)
- Regression: Predicting a numerical outcome (e.g., house price)
- Clustering: Grouping data points into clusters based on similarity
Outlier Detection
- Identifying data points that are significantly different from the rest of the data
- Can affect model accuracy and bias results
Model Validation Techniques
- Cross-validation: Splits data into training and validation sets to assess model performance
- Train-test split: Similar to cross-validation but with a single split
- ROC curves: Visualize the trade-off between true positive rate and false positive rate
Data Imputation Methods
- Filling in missing data values using various methods
- Mean/mode imputation: Replaces missing values with the mean or mode
- KNN imputation: Uses the k nearest neighbors to impute missing values
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Description
Test your understanding of key concepts in descriptive statistics and data visualization. This quiz covers essential measures such as mean, median, mode, and various visual tools like histograms and bar charts. Ensure you grasp the foundational elements that are critical for data analysis.