Summary

This presentation covers various parametric tests, including correlation, regression, and covariance. It details the concepts, formulas, and limitations.

Full Transcript

PARAMETRIC TEST The relationship between x and y Correlation: is there a relationship between 2 variables? Regression: how well a certain independent variable predict dependent variable? CORRELATION  CAUSATION  In order to infer causality: manipulate independent variable...

PARAMETRIC TEST The relationship between x and y Correlation: is there a relationship between 2 variables? Regression: how well a certain independent variable predict dependent variable? CORRELATION  CAUSATION  In order to infer causality: manipulate independent variable and observe effect on dependent variable Scattergrams Y Y Y Y Y Y X X X Positive correlation Negative correlation No correlation Variance vs Covariance Do two variables change together? Variance: n Gives information on variability of a single variable. 2  i ( x  x ) 2 S  i 1 x Covariance: n 1 Gives information on the degree to which two variables vary together. n Note how similar the covariance is to variance: the equation simply  (x i  x)( yi  y ) multiplies x’s error scores by y’s error cov( x, y )  i 1 scores as opposed to squaring x’s error scores. n 1 Covariance n  (x i  x)( yi  y ) cov( x, y )  i 1 n 1 When X and Y : cov (x,y) = pos. When X and Y : cov (x,y) = neg. When no constant relationship: cov (x,y) = 0 Problem with Covariance: The value obtained by covariance is dependent on the size of the data’s standard deviations: if large, the value will be greater than if small… even if the relationship between x and y is exactly the same in the large versus small standard deviation datasets. Example of how covariance value relies on variance High variance data Low variance data Subject x y x error * y x y X error * y error error 1 101 100 2500 54 53 9 2 81 80 900 53 52 4 3 61 60 100 52 51 1 4 51 50 0 51 50 0 5 41 40 100 50 49 1 6 21 20 900 49 48 4 7 1 0 2500 48 47 9 Mean 51 50 51 50 Sum of x error * y error : 7000 Sum of x error * y error : 28 Covariance: 1166.67 Covariance: 4.67 Solution: Pearson’s r Covariance does not really tell us anything  Solution: standardise this measure Pearson’s R: standardises the covariance value. Divides the covariance by the multiplied standard deviations of X and Y: cov( x, y ) rxy  sx s y Pearson’s R continued n n  ( x  x)( y i i  y)  ( x  x)( y i i  y) cov( x, y )  i 1 rxy  i 1 n 1 (n  1) s x s y n Z xi * Z yi rxy  i 1 n 1 Limitations of r When r = 1 or r = -1:  We can predict y from x with certainty  all data points are on a straight line: y = ax + b r is actually r̂  r = true r of whole population r̂ = estimate of r based on data r is very sensitive to extreme values: Regression Correlation tells you if there is an association between x and y but it doesn’t describe the relationship or allow you to predict one variable from the other. To do this we need REGRESSION! Best-fit Line Aim of linear regression is to fit a straight line, ŷ = ax + b, to data that gives best prediction of y for any value of x This will be the line that ŷ = ax + b minimises distance between data and fitted line, i.e. slope intercept the residuals ε = ŷ, predicted value = y i , true value ε = residual error Least Squares Regression To find the best line we must minimise the sum of the squares of the residuals (the vertical distances from the data points to our line) Model line: ŷ = ax + b a = slope, b = intercept Residual (ε) = y - ŷ Sum of squares of residuals = Σ (y – ŷ)2 we must find values of a and b that minimise Σ (y – ŷ)2 The solution Doing this gives the following equations for a and b: r sy r = correlation coefficient of x and y a= sx sy = standard deviation of y sx = standard deviation of x From you can see that:  A low correlation coefficient gives a flatter slope (small value of a)  Large spread of y, i.e. high standard deviation, results in a steeper slope (high value of a)  Large spread of x, i.e. high standard deviation, results in a flatter slope (high value of a) The solution cont. Our model equation is ŷ = ax + b This line must pass through the mean so: y = ax + b b = y – ax We can put our equation for a into this giving: r sy r = correlation coefficient of x and y b=y- s x s = standard deviation of y y x s = standard deviation of x x Back to the model a b r sy r sy ŷ = ax + b = x+y- x sx sx a a r sy Rearranges to: ŷ= (x – x) + y sx If the correlation is zero, we will simply predict the mean of y for every value of x, and our regression line is just a flat straight line crossing the x-axis at y But this isn’t very useful. We can calculate the regression line for any data, but the important question is how well does this line fit the data, or how good is it at predicting y from x How good is our model? ∑(y – y)2 SSy Total variance of y: sy 2 = n-1 = dfy Variance of predicted y values (ŷ): ∑(ŷ – y)2 SSpred This is the variance sŷ 2 = = explained by our n-1 dfŷ regression model Error variance: This is the variance of the error between our predicted y values and ∑(y – ŷ)2 SSer the actual y values, and thus is the serror = 2 = variance in y that is NOT explained n-2 dfer by the regression model How good is our model cont. Insert r2 sy2 into sy2 = sŷ2 + ser2 and rearrange to get: ser2 = sy2 – r2sy2 = sy2 (1 – r2) From this we can see that the greater the correlation the smaller the error variance, so the better our prediction General Linear Model Linear regression is actually a form of the General Linear Model where the parameters are a, the slope of the line, and b, the intercept. y = ax + b +ε A General Linear Model is just any model that describes the data in terms of a straight line COMPARATIVE t(ea) for Two: Test between the Means of Different Groups When you want to know if there is a ‘difference’ between the two groups in the mean  Use “t-test”. Why can’t we just use the “difference” in score? Because we have to take the ‘variability’ into account. T = difference between group means sampling variability One-Sample T Test Evaluates whether the mean on a test variable is significantly different from a constant (test value). Test value typically represents a neutral point. (e.g. midpoint on the test variable, the average value of the test variable based on past research) Example of One-sample T-test Is the starting salary of company A ($17,016.09) the same as the average of the starting salary of the national average ($20,000)? Null Hypothesis: Starting salary of company A = National average Alternative Hypothesis: Starting salary of company A = National average Review: Standard deviation: Measure of dispersion or spread of scores in a distribution of scores. Standard error of the mean: Standard deviation of sampling distribution. How much the mean would be expected to vary if the differences were due only to error variance. Significance test: Statistical test to determine how likely it is that the observed characteristics of the samples have occurred by chance alone in the population from which the samples were selected. SOME TIMES USING P VALUE z and t Z score : standardized scores Z distribution : normal curve with mean value z=0 95% of the people in the given sample (or population) have z-scores between –1.96 and 1.96. T distribution is adjustment of z distribution for sample size (smaller sampling distribution has flatter shape with small samples). T = difference between group means sampling variability Confidence Interval A range of values of a sample statistic that is likely (at a given level of probability, i.e. confidence level) to contain a population parameter. The interval that will include that population parameter a certain percentage (= confidence level) of the time. Confidence Interval for difference and Hypothesis Test When the value 0 is not included in the interval, that means 0 (no difference) is not a plausible population value. It appears unlikely that the true difference between Company A’s salary average and the national salary average is 0. Therefore, Company A’s salary average is significantly different from the national salary average. Independent-Sample T test Evaluates the difference between the means of two independent groups. Also called “Between Groups T test” Ho:  =  1 2 H1: 1= 2 Paired-Sample T test Evaluates whether the mean of the difference between the paired variables is significantly different than zero. Applicable to 1) repeated measures and 2) matched subjects. Also called “Within Subject T test” “Repeated Measures T test”. Ho: d= 0 H1: d= 0 Analysis of Variance (ANOVA) An inferential statistical procedure used to test the null hypothesis that the means of two or more populations are equal to each other. The test statistic for ANOVA is the F-test (named for R. A. Fisher, the creator of the statistic). T test vs. ANOVA T-test  Compare two groups  Test the null hypothesis that two populations has the same average. ANOVA:  Compare more than two groups  Test the null hypothesis that two populations among several numbers of populations has the same average. ANOVA example Example: Curricula A, B, C. You want to know what the average score on the test of computer operations would have been  if the entire population of the 4th graders in the school system had been taught using Curriculum A;  What the population average would have been had they been taught using Curriculum B;  What the population average would have been had they been taught using Curriculum C. Null Hypothesis: The population averages would have been identical regardless of the curriculum used. Alternative Hypothesis: The population averages differ for at least one pair of the population. ANOVA: F-ratio The variation in the averages of these samples, from one sample to the next, will be compared to the variation among individual observations within each of the samples. Statistic termed an F-ratio will be computed. It will summarize the variation among sample averages, compared to the variation among individual observations within samples. This F-statistic will be compared to tabulated critical values that correspond to selected alpha levels. If the computed value of the F-statistic is larger than the critical value, the null hypothesis of equal population averages will be rejected in favor of the alternative that the population averages differ. Interpreting Significance p

Use Quizgecko on...
Browser
Browser