Bivariate Associations/Analysis - Session 3 PDF
Document Details
Uploaded by Deleted User
Tags
Related
- EPMATH235 Statistics Extra Exercises Worksheet PDF
- NSE212W7 Analyzing Quantitative Data-Descriptive & Bivariate Statistics, 2024
- POL1803: Analyse des techniques quantitatives - Cours 9 - Corrélation et régression PDF
- Data Analysis Lecture Notes PDF
- Bivariate Kenngrößen PDF
- BMS 511 Biostatistics & Statistical Analysis PDF
Summary
This document is a lecture or study guide/notes on bivariate analysis, covering various statistical methods like correlation analysis, scatterplots, covariance, chi-square tests, t-tests, ANOVA and regression analysis. It provides examples and formulas to illustrate the concepts.
Full Transcript
Bivariate Associations/Analysis - Session 3 =========================================== *The bivariate analysis* aims to determine if there is a statistical link between the two variables and, if so, how strong and in which direction that link is. Here are some of the most common types of bivariat...
Bivariate Associations/Analysis - Session 3 =========================================== *The bivariate analysis* aims to determine if there is a statistical link between the two variables and, if so, how strong and in which direction that link is. Here are some of the most common types of bivariate analysis: - *Correlation Analysis:* This is one of the most widely used techniques for examining the relationship between two **continuous** variables. The Pearson correlation coefficient measures the strength and direction of a linear relationship. A value close to 1 indicates a strong positive correlation, close to -1 suggests a strong negative correlation, and close to 0 implies little to no correlation. - *Scatterplots*: Scatterplots are graphical representations of data points in a two-dimensional space. They are particularly useful for visualizing the relationship between two **continuous** variables. The pattern of points on the scatterplot can provide insights into the nature of the relationship. - *Covariance:* Covariance is a statistical measure that assesses how two **continuous** variables change together. It indicates the direction of the linear relationship (positive or negative) but doesn't provide a standardized measure like correlation coefficients. - *Chi-Square Test*: When dealing with two **categorical variables**, the chi-square test of independence is a common choice. It helps determine whether there is a significant association between the two variables. For example, it can be used to analyze whether there's a relationship between gender and voting preferences. - *T-Test:* The t-test is used when you want to compare the means of two groups for a **continuous** variable. For instance, you might use a t-test to determine if there's a significant difference in the test scores of two different teaching methods. - *Analysis of Variance (ANOVA):* ANOVA is an extension of the t-test and is used when there are more than two groups to compare. It assesses whether there are statistically significant differences among the means of three or more groups. - *Regression Analysis:* Bivariate regression analysis is used to model and predict the relationship between one dependent variable and one independent variable. For example, you might use simple linear regression to predict how changes in temperature (independent variable) affect ice cream sales (dependent variable). 1. Contingency Table -------------------- Contingency tables (cross-tables) summarize the frequencies of combinations of categorical variables. They are used for descriptive analysis of associations.\ \ Example:\ Two-dimensional table with variables X and Y: - Dependent variable (y) typically in lines; independent variable (x) typically in columns - Row sums and column sums represent marginal distributions - Formula for expected frequency (for a cell): (row total × column total) / grand total- n. *f~e,ij=~ sum f~i~ × sum f~j~ / n* Gender (X) \\ Agreement (Y) Agree Disagree Sum(Row) ----------------------------- ------- ---------- ---------- Male 40 10 50 Female 30 20 50 Sum (Column) 70 30 100 \- (Conditional) frequencies/cell frequencies (= distribution of y-values of category A or B depend on x) **Column percentages (%)** = Divide cells by column sum. → The rows of a column summed up add up to 100%. **Row percentages (%)** = Divide cells by row sum. → The columns of a row are then added up to 100%. - Indifference matrix: Comparison of observed frequencies (fb,ij) with expected frequencies (fe,ij): (fb -- fe) **Statistical independence:** \- Two variables are statistically independent, if for all cells the observed frequency (fb,ij) = expected frequency (fe,ij) \- if deviations exist statistical correlation present **2. Chi-square (χ2)** The χ2 test assesses the independence of two categorical variables:\ - Null hypothesis (H0): Variables are independent.\ - If χ2 calculated \> χ2 critical or p \< α (e.g., 0.05), H0 is rejected.\ \ Example: Observed and expected frequencies are compared using χ2 formula. Formula: Chi-square sum of deviations between empirical and expected values (differences are squared if else sum = 0) χ2=∑^n^~i=1~ ∑~j~^n^~=1~ (F~b,ij~−F~e,ij~)^2^/ F~e,ij~ fe= expected values; fb= observed values Example: (10-23.53)2/23.53 +... + (5-18.43)2/18.43 \~ 0.51 Interpretation: \- **χ2 = 0** if expected and observed frequencies are the same for all combinations; x and y are statistically independent \- **Χ2 \> 0** the stronger the relationship between the variables \- (! Χ^2^ does not only depend on the strength of relationship but also on table format and n, χ^2^ \> 0 hardly interpretable) Chi-square (χ2) ** Properties:** \+ Starting point for measures of correlation in a contingency table (see next slides) \- Often χ2 is only used as a test variable in (statistical) hypothesistesting → X^2^ -test \- Problem: χ2 depends on the number of cases (the more cases the higher χ2) → therefore does not say (anything) about the strength of the correlation 3. Measures of Correlation -------------------------- Different measures are used depending on the data\'s scale:\ **- Nominal**: Cramer\'s V, Phi coefficient, Lambda. **Nominal: Cramer's V** - Formula: - Interpretation: Cramer\'s V standardizes χ2 for nominal variables, and values range from 0 (no correlation) to 1 (perfect correlation). - Properties: + Correction of χ 2 ,takes number of cases into account → other measures of correction are phi, Pearson's C; Cramer's V is the most general one **Nominal (binary): Phi** Measure of correlation for two binary/dichotomous variables, 2x2 table Formula: Interpretation: 0 = no relationship; -1/+1 = perfect negative/positive relationship Properties: - Phi is only appropriate for binary variables - Only measures direction of relationship; does not take into account the strength of the relationship\ **Nominal: Lambda** Asymmetrical\* (\* Dependent variable has to be specified (vs. symmetrical measure where DV does not have to be defined) measure of association for nominal variables Formula: Interpretation: - Values between 0 and 1: 0 = no correlation; 1 = perfect statistical correlation Properties: - PRE measure\* \* PRE measures = compare the errors made in predicting the dependent variable while ignoring the independent variable with errors made when making predictions that use information about the independent variable) \- Only measures the strength of the relationship between two variables **- Ordinal**: **Rank correlation coefficients** Spearman\'s rho, Kendall\'s tau-a,tau-b,gamma.(for formulas you may look at slides) **Idea behind**: Ordinal and metric data can be ordered. If one examines the degree of correlation between two series of such data, one can thus order one of the data series and then check to what extent the second data series has \"also ordered\" itself. **Usage**: For variables measured on an ordinal scale, with a small number of observations, or variables without normal distribution. **Interpretation**: +1 = perfect positive monotonic correlation, higher \ higher ranks. -1 = perfect negative monotonic correlation, higher \ lower ranks. 0 = no monotonic correlation between the variables. **- Metric: Bravais-Pearson\'s correlation coefficient (r).**\ Ratio between the covariance of two variables and the product of their standard deviations Properties: + Strength and direction of the linear relationship between two variables \- If relationship between variables is not linear, correlation coefficient does not adequately represent the strength \- Only for metric/interval-level variables that follow at least a near normal distribution **Interpretation of the correlation coefficient** −1 ≤ r ≤ +1 → r = -1 perfect negative correlation → r = +1 perfect positive correlation → r = 0 two characteristics are uncorrelated Conventional categorization of \|r\| (depends on discipline!): - 0.1 \< r \< 0.5: weak linear correlation \- 0.5 ≤ r \< 0.8: moderate linear relationship - r \"detects\" only \"linear\" relationships (use → scatter plots to see other relationships) **4. Covariance & Correlation:** - **Covariance**- Direction of the relationship, No limits, Not standardized, depending on the scale of the variables - **Correlation** -Direction and strength of the relationship, Range between -1 and 1, Independent of scale of variables /standardized (see z-transformation) - Example: Education and income are positively correlated. **5. Correlation & Causality** Correlation: Statistical measure that describes the size and direction of a relationship between (two) variables. E.g., education and income; height and weight Causation: One event is the result of the occurrence of the other event → Cause and effect relationship E.g., smoking and lung cancer; regular exercise and physical health, hours of study and academic performance, higher carbon emission and higher global warming → (!) A correlation doesn't necessarily imply causation (necessary but not sufficient condition), but causation implies correlation. Example: Smoking and lung cancer have a causal relationship. Whether one can conclude from a correlation to a causal relationship in the social sciences, at least the following conditions have to be fulfilled: 1\) X and Y must be **correlated** 2\) X **precedes** Y in time (! However: anticipation possible) 3\) **Theoretical explanation** for the observed correlation, with exclusion of alternative explanations (correlation stays even when controlling for third variables). It is not possible to deduce from a correlation coefficient whether: \- x caused y (x → y) or y caused x (y → x). The correlation coefficient is **symmetrical**. \- An interaction between x and y exists, i.e., the strength of the correlation of x and y is influenced by another variable. \- Whether an (unobserved) third variable affects both variables to make them seem causally related when they are not. (Illusorycorrelation/"Scheinkorrelation") **6. Bivariate Graphics** Graphs visually represent relationships between two variables and include:\ - Scatter plots: Show patterns in metric data.\ - Bar charts: Summarize frequencies.\ - Line graphs: Illustrate trends. - Pie chart: Displays proportions or percentages of a whole, showing how a single category or set of categories contributes to the total. - Box plot: Visualizes the distribution of data through quartiles, highlighting the median, range, and potential outliers, providing a summary of the data\'s spread and central tendency. Graphs contain more information than single measures They are intuitively understandable Humans can visually recognize patterns in complex graphics\ \ Example: Scatter plot of weekly working hours vs. age reveals no strong pattern. **Tips for creating (bivariate) graphics** Present only 2 dimensions (in exceptional cases 3) **Clear** and understandable presentation - Don not overload graphic - Use colors. But do not exaggerate. - No 3D effects, shadows, etc. Choose meaningful **labels** for titles, axes, categories, etc. Specification of **data source**