PSYC 3090 Exam 1 PDF - Applied Statistics, Research Questions, and Statistical Analysis

Document Details

DeservingAmericium3698

Uploaded by DeservingAmericium3698

Clemson University

Tags

statistical analysis research methods applied statistics psychology

Summary

This PDF document appears to be an exam paper for a PSYC 3090 course, likely at the undergraduate level. It covers topics in applied statistics, research methodology, and the analysis of research questions, including independent and dependent variables, and examples. The document explores concepts in statistical analysis such as descriptive and inferential stats, discussing examples and applications.

Full Transcript

Course mainly focuses on applied statistics - Stats is a tool that helps in testing research questions - Examples o Does text messaging while driving create unsafe roads? - Turn research question into a statistical question o Research question > statistical questi...

Course mainly focuses on applied statistics - Stats is a tool that helps in testing research questions - Examples o Does text messaging while driving create unsafe roads? - Turn research question into a statistical question o Research question > statistical question o Example: have 40 participants each drive in a driving simulator while text messaging and a different 40 participants each drive in the same simulator but with no cell phone o For each driver, record the number of times that the driver deviated from their lane o Compute average lane deviation for participants in; § Text messaging group § No cell phone group o Using a driving simulator lab - Independent vs. Dependent Variables § The variable manipulated by a researcher is the IV. In our example: manipulated text messaging § The outcome or the variable that is measured in the DV. In our example: measured the number of lane deviations. § IV > DV § Text messaging > lane deviations - Confounding Variables § Want to be sure that no other factors could be influencing our dependent variable (i.e., lae deviations) other than the one that we manipulated § In our example, what about the: Age of the drivers? Course that the drivers went through (winding roads vs. Straight roads)? § Ways to deal with confounding include holding variables constant, matching, and random assignment Similar age group/weather conditions – constant variables 1 25 y/o white male from a mid SES in each group – matching Could be best solution – random assignment - Analysis of Statistical Question o Was the average number of lane deviations greater for those in the Text Messaging Group than for those in the No Cell Phone Group? (Statistical Question) o We apply statistical procedures to our data. (In this case, we use a 2 sample t-test which we will learn about in this class) - Conclusion o Statistical Conclusion > Research Conclusion § Perhaps we find that the average number of lane deviations is greater for those in the Text Messaging Group than for those in the No Cell Phone Group. (Statistical Conclusion) § Text messaging while driving tends to create unsafe roads. (Research Conclusion) - Overall Process o Research Question > Statistical Question > Data Collection & Analysis > Statistical Conclusion > Research Conclusion Population vs. Sample - Population vs. Sample o Population: all of the observations about which an investigator wishes to draw conclusions § µ = mean § o2 = variance o Sample: a subset of the population § y-bar/x-bar = mean § s2 - variance - Parameter vs. Statistic o Parameter: index used to describe some characteristic of a population o Statistic: index used to describe some characteristic of a sample Statistics - Descriptive Statistics vs. Inferential Statistics o Descriptive Statistics: used to organize and summarize observations o Inferential Statistics: using statistics calculated from a sample to draw conclusions about the population - Large, random samples yield statistics that better approximate the characteristics of the population compared to other types of samples (i.e., small, nonrandom samples) Other Terminology - Variable vs. Constant - Discrete or Continuous - Qualitative or Quantitative Some Synonyms - In experiments, independent variable & factor are often used interchangeably - In nonexperiments, instead of independent variable, researchers may use the term predictor & explanatory variable interchangeably - Dependent variable = outcome, response, criterion Levels of Measurement - Measurement: assigning numbers to observations - Nominal: label used for mutually exclusive & collectively exhaustive (MECE) labels. The person, object, or event should be assigned to a unique label. o MECE: once a person is in one label, they can’t be in another. § Example – female, male: female: 1; male = 2 - Ordinal: labels still MECE, but also indicates order of magnitude (more or less of some characteristic) o Example – A, B, C, D, F, or Freshman, Sophomore, etc. or ranked 1st, 2nd, 3rd, etc. - Interval: labels still MECE. Order of magnitude, but also equal intervals. Equal differences between numbers reflect equal magnitude differences between the corresponding classes o Consider the Fahrenheit scale: Can refer to equal differences between 80F & 40F vs. 60F & 20F. But cannot unambiguously say that 80F is twice as hot as 40F. The 0 point on the Fahrenheit scale is not a true zero point - Ratio: still MECE, order of magnitude, & equal intervals. But there is also an absolute 0 point. The ratio between measurements has meaning. o Example – Kelvin scale, height, weight Review: - On a job search website, job seekers can narrow down their US-based job search. o Example: south, midwest, etc. - Region of the Country: would be nominal data Some Graphical Displays - Frequency distributions o Works for qualitative/quantitative data o Can be grouped (range of numbers ex. 20-30 w/midpoint) or ungrouped (each row is assigned one number, ex. 21,22,23, etc.) - Histogram o Shape depends on frequency table/data - Stem-and-Leaf Plot o For quantitative data. Retains the original data § Leaves are the last significant digit § Stems are the remaining digits § To correctly interpret, check the key § Ex. 6|8 means 68; 1|7 means 1.7 We don’t lose any info - Shapes of Frequency Distributions o J-shaped distribution § Majority of data piled up towards the end of the scale o Positively skewed distribution § Hump towards the left § More data at the lower end of the scale o Negatively skewed distribution § Hump towards the right § More data at the higher end of the scale o Rectangular distribution o Bimodal distribution § 2 major groups of data: one on the lower end and the other on the higher end o Bell-shaped distribution § Data piled in the middle Central Tendency Measures of Central Tendency - Measures of Central Tendency: indices which represent the center value of a set of observations o Mode o Median o Mean Mode (Mo) - Score/data that occurs most frequency - Possible to have more than one mode (i.e., bimodal, multimodal o 7,12,6,2,9,7,5,2 – bimodal, 2 & 7 - Can also be used with qualitative data (i.e., nominal) like eye color, blood type, race Median (Mdn) - Point in the (ordered) distribution of scores that divides the data into two groups having equal frequency. (50th percentile) - If n is an odd number, Mdn is the middle-ranked value. If n is an even number, Mdn is the average of the two middle-ranked values o 2,3,5,8,9,11,12 – 8 is the Mdn - Sensitive only to the number scores above & below it, not the values of the actual scores - Tends to be used to represent the center for positively or negatively skewed data Mean - Arithmetic mean; sum of the scores divided by the total number of scores o 3,6,5,9 – 5.75 - Balance point of a distribution o Imagine a seesaw & the scores of a distribution spread along the board like bricks, with one brick per score o The fulcrum is placed right in the middle of the data so that the seesaw will be in perfect balance - Has very good mathematical properties and is generally quite stable from sample to sample - Sensitive to extreme scores/outliers Score Transformations - If we add a constant number to each score in a distribution, the distribution shifts by the amount of the constant o The mean will shift by the same amount - If we multiple (or divide) each score in a distribution by a constant, the mean will be multiplied (or divided) by the same constant Variability Variability - Do the scores in a distribution cluster around a central point or do the scores spread around it? o Measures of variability § Range § Semi-interquartile range § Variance § Standard deviation Range - Difference between the highest (maximum) and lowest (minimum) score in the distribution o 8,5,5,3,13,14,18,23 – range is 20 o 48,42,37,53,57 – range is 20 - Crude measure of variability & it depends only on two scores - A single outlier substantially influences this measure of variability Semi-Interquartile Range - Half of the middle 50% of the scores - Q = (Q3-Q1)/2 - Less sensitive to extreme scores Deviation Scores - For two other measures of variability, we need deviation scores - For a given distribution of scores, calculate the mean. Then subtract the mean from each score. - Recall: Do the scores in a distribution cluster around a central point or do the scores spread around it? - Now we have scores that show how far a given score is from a central point. - Might seem reasonable to take the average of the deviation scores o Problem: what is the sum of the deviation scores? - A way to work around this..square the deviation scores Variance - Variance of population (p. 61) o Average Squared Deviation from the Mean – N = whole population " ∑ $%&'! ( § 𝜎!" = ) o Unbiased variance of a sample (p. 188) - n = sample of population ∑ (%&!)" § 𝑠!" = ,&- Standard Deviation - Standard deviation of population (p. 63) o Average Deviation from the Mean " ∑ 0%&'$ 1 § 𝜎%.√ ) o Unbiased standard deviation of a sample (p. 188) " ∑ $%&%( § 𝑠! = $ ,&- " Y Y-𝑌 &𝑌 − 𝑌( 1 -3 9 5 1 1 7 3 9 3 -1 -1 𝑌=4 " ∑ $%&%( - 𝑠%" = ,&- 23-323- - 𝑠%" = 4&- = 6.667 - 𝑠% = -𝑠%" = 2.581 o The deviation is the variation’s square root– variation is deviation is squared Score Transformations - Adding a constant number to each score in a distribution does not affect any of the measures of variability - If we multiply (or divide) each score in a distribution by a constant: o The standard deviation will change by multiplying (or dividing) by absolute value of the constant o The variance will change by multiplying (or dividing) by the squared constant § Ex. Y (Y) (4) (Y) (-4) 7 28 -28 6 24 -24 3 12 -12 8 32 -32 10 40 -40 5 20 -20 Mean: 6.5 26 -26 S: 2.429 9.716 9.716 S : 5.9 2 94.4 94.4 § S: was changed by a multiplication of 4 § S^2: was changed by 4^2 (16) then multiplied by that square Statistical Reasoning - For a distribution of scores: o Can the variance be a negative number? o Can the standard deviation be a negative number? - Think of what the distribution of scores would look like if all the scores clustered around the mean vs spread around the mean Standard Scores & Normal Curve Standard Scores (z scores) - Your raw score on a test was 346. What does this mean? - Raw scores not very informative - Need a frame of reference. Use the o Mean o Standard deviation - Now, we can state the position of a raw score relative to the mean in standard deviation units - Z score in a population – use deviation score (𝜇% ) %&'$ o 𝑧= 5! - Z score in a sample – use deviation score &𝑌( %&% o 𝑧= 6! - Example: o 𝑌 = 49.44444 o 𝑠% = 12.21793 %&% o 𝑧 = 6$ o Y = 47 47&42.44444 § 𝑧 = -"."-729 = −.200 o Y = 58 :;&42.44444 § 𝑧 = -"."-729 =. 700 - When we convert a set of raw scores to z scores: o The mean of the z scores will equal 0 o The standard deviation of the z scores will equal 1 § By necessity, what is the variance of the z scores? 1 o The shape of the new distribution of scores will not differ from the shape of the original distribution of scores - Sign of the z score tells us something useful (i.e., above (+) or below (-) the mean) - The absolute value of the z tells us the distance between the score and the mean in standard deviation units Other Kinds of Standard Scores Mean Std Dev Test 50 10 California Psych Inventory 100 16 Standard-Binet Intell Scale 100 15 Wechsler Intell Scale Standard Scores & Normal Curve - Given a z score: %&% o 𝑧= 6$ o We can convert a set of scores to have any mean or standard deviation we would like § 𝑌 = 𝑌 + 𝑧(𝑠% ) - If a distribution of scores approximately follows a normal distribution, we can use z scores to find: o The % (proportion) of individuals above or below a score o The % (proportion) of individuals between a pair of scores o A score above or below which a certain % (proportion) of the total scores fall - Normal curve refers to a family of curves o Bell-shaped o Symmetric o Unimodal o Continuous o Area under any normal curve sums to 1.0 o We will focus on the standard normal curve Example 1: - It is known that cognitive ability scores follow a normal curve with 𝜇 = 100 & 𝜎 = 15. What if we wanted to know the proportion of individuals with a cognitive ability score greater than 130? o Convert raw score, 130, into z score (z = 2) %&%(') § 𝑧= 5(6$ ) -9 130 Example 2: - On the previous scale, known that SAT scores follow a normal distribution with 𝜇 = 500 & 𝜎 = 100. We want to know above what raw score (Y) would a student need to have in order to be in the top 15% of the SAT distribution o Partition the standard normal curve such that 15% of the distribution is to the right of a particular z & 85% is to the left of the same z o In Table A, look under the column labeled “AREA BEYOND z” to get as close as possible to 15% (.15) o It is about z = 1.04 o Then, use Y = 500 + (1.04)(100) = 604 o To be in the top 15%, you will need at least a 604 Example 3 - 3000 entering freshman at a university are given an entrance exam in mathematics. Scores are normally distributed with 𝑌 = 100 and s = 20. If the university decides to place all students scoring above 120 into honors math, how many students will be placed into the class? o Change the raw score to a z score: z = (120-100)/20 = 1.0 o What proportion of scores fall above zero = 1.0? o Area under the curve is proportional to the frequency of scores. Compute: § (.1587)(3000) =476 freshman will be placed in honors math Example 4 - Related to Example 3. Students who score lower than 85 will be placed into remedial math. How many scored between 85 and 120 and will not be placed into either remedial math or honors math? o Convert both raw scores to zero scores: § Z = (120-100)/20 =1.0 § Z = (85-100)/20 =-0.75 o Determine the area: § Area between z = -.75 and the mean.2734 § Area between the mean and 1.0:.3413 § Total area:.2734+.3413=0.6147 o (.6147)(3000) =1,844 students' scores between 85 and 120 on the entrance exam Example 5 - A standardized math achievement test for 6th graders is administered in SC. A score below 40 makes a student eligible for remedial math tutoring. It is known that 𝜇 = 50 & 𝜎 = 8. Assuming that the distribution of scores in SC is approximately normally distributed, what percentage of 6th graders will be eligible for tutoring? o Convert raw score to zero score: § Z = (40-50)/8 =-1.25 o Remember that there are no negative z values in Table A o 10.56% of 6th graders will be eligible for tutoring Correlation Correlation - A measure of linear relationship between two variables - We need pairs of scores for each participant to calculate the correlation coefficient (r) - Values of r range between –1.0 and 1.0 - The correlation coefficient is attributed to Pearson - Sign of r indicates whether correlation is positive or negative o r = 0.65 o r = -0.73 - Absolute value of r indicates the degree of linear relationship o r = 0.65 and r = -0.65 have same degree of linear relationship. One is simply positive & the other in negative o Which represents a stronger degree of linear relationship? § r = 0.79 or r = -0.85 (stronger because it is closer to –1 than the other is to +1) - Cohen’s conventions (ignore the sign) o r =.1(small) o r =.3 (medium) o r =.5 (large) - Predict for each pair of variables, the direction of the relationship (positive, negative, none) o Height and shoe size - positive o Cholesterol and intelligence - none o Hours of physical activity per week and body fat % - negative o Number of fruits and vegetables eaten per day and risk of heart disease – negative - We will calculate correlation using Equation 7.3 ∑ $=&=($%&%( o 𝑟 = >??% ∙??$ § 𝑆𝑆= - sum of squares for X § 𝑆𝑆% - sum of squares for Y § SCP – sum of cross products Example: - I administer a quiz (X) ten days before Exam 1. Then, I obtain Exam 1 scores (Y). Is there a linear relationship between your quiz scores and your scores on Exam 1? " " Quiz (X) Exam 1 (Y) 𝑋−𝑋 𝑌−𝑌 &𝑋 − 𝑋( &𝑌 − 𝑌( &𝑋 − 𝑋(&𝑌 − 𝑌( 29 47 -2.429 -24.286 5.898 589.796 58.980 34 93 2.571 21.714 6.612 471.510 55.837 27 49 -4.429 -22.286 19.612 496.653 98.694 34 98 2.571 26.714 6.612 713.653 68.694 33 83 1.571 11.714 2.469 137.224 18.408 31 59 -0.429 -12.286 0.184 150.939 5.265 32 70 0.571 -1.286 0.327 1.653 -0.735 𝑋 𝑌 𝑆𝑆= = 𝑆𝑆% = 𝑆𝐶𝑃 = 205.143 = 31.428 = 71.285 41.714 2561.429 ∑ &𝑋 − 𝑋(&𝑌 − 𝑌( 𝑟= -𝑆𝑆= ∙ 𝑆𝑆% 305.143 𝑟 = -(41.714)(2561.429) 𝑟 =. 93 - strong correlation - There appears to be a strong positive correlation between quiz scores & exam scores (Pearson’s r =.93). Specifically, as quiz scores increase (decrease), exam scores tend to increase (decrease). Another Correlation Example - A mental health counselor has a sample of 11 clients and measured Self-Esteem and Negative Affect - The counselor wants to know whether there is linear relationship between Self- Esteem and Negative Affect o Mean of Self-Esteem = 4.272727 o Mean of Negative Affect = 2.545454 o SS for Self-Esteem = 28.1818 o SS for Negative Effect = 22.7273 o SCP = -15.636364 o R = -.618 (strong negative correlation, one goes up the other goes down) Spearman’s Rank-Order Correlation - It is the regular correlation (Pearson’s r), but applied to data that have been “properly” ranked - X has to be ranked. Y has to be ranked - Instead of pairs of scores, we have pairs of ranks for each participant - Applying the usual r to the newly ranked data is called Spearman’s r (𝑟6 ) - Spearman developed a specialized formula to calculate the correlation when the data were in ranks - If there are repeats in the data, added up the initial ranks given and then divide them by how many of the repeated numbers there are and give them the new rank number X 𝑅= 11 1 1 13 2 3 15 5 5 13 3 3 13 4 3 21 8 8 19 6 6.5 19 7 6.5 - 13 appears 3 times and given the ranks 2,3,4. Add the ranks up: 2+3+4=9. Divide 9 by 3 (the amount of times “13” appears) and give the new rank (3) - 19 appears twice and given the ranks 6 and 7. Add them up: 6+7=13. Divide 13 by 2 (the amount of times “19” appears) and give them the new rank (6.5) Correlation Does not Prove Causation - Ice cream sales and aggravated assaults are correlated o Seasons - Satisfaction with one’s job is correlated with job performance o Both could affect each other More on Correlation - Linear relationships only - Affected by a range of talent (also range restriction) o As range restrictions increase, this tends to lower (attenuate) correlation coefficients. They tend to weaken the actual relationship - Discontinuities in the distributions tend to result in strengthening of the actual relationship - The correlation between the same two variables will fluctuate from one sample to the next - Because correlations can be computed between pairs of variables, it is conventional to display correlations in a matrix o Correlation matrix - If there are p variables, there will be p(p-1)/2 unique correlations - Correlations are often presented in a table along with means & standard deviations Example - Among a group of mentally disabled 10-yr old children, it is found that the correlation between IQ and reading achievement is 0.25. On a school-wide basis, the correlation is 0.50. What explanation do you suggest? o Range restriction, focusing on 10 y/o children with intellectual disabilities instead of the whole school that doesn’t involve students with intellectual disabilities (Simple Linear) Regression Some Substantive Examples - Using a sample of salespeople, can we predict monthly sales (Y) based on extraversion scores (X)? - Based on a person’s mood (+/-) (X), can we predict whether a person will behave altruistically (Y)? - Does social dominance orientation (X) predict rejection of science (Y)? Prediction Using (Simple Linear) Regression - One outcome (Y) - One predictor (X) - We want to predict Y using X - We assume that Y and X are linearly related - Find “best-fitting” line – a regression line – using a regression equation - What is the “best fitting” line? Notation - X used to represent actual scores on X - Y used to represent actual scores on Y - Y’ used to represent predicted scores on Y o Also referred to as fitted values o Each observation will have a predicted score - 𝑑! = (Y – Y') is a discrepancy (or residual) o Each observation will have a discrepancy Least Squares Criterion - When we sum all the squared discrepancies (squared residuals), we prefer the smallest possible sum o Sum-of-squares error (or sum-of-squares residual) - The line that accomplishes this is the “best fitting” line - Fortunately, we do not need to iteratively try out different lines (each with a different y-intercept and slope) in an effort to find the “best fitting” line Regression Equation (raw score) - Y’ = bX + a - We say that Y is “regressed on” X 6 o 𝑏 = 𝑟 G6! H & o 𝑎 = 𝑌 − 𝑏𝑋 Example with Raw Data X Y 19 41 20 39 18 38 21 39 22 41 19 39 17 38 21 40 𝑋 = 19.625, 𝑌 = 39.375, 𝑠= = 1.685, 𝑠% = 1.188, 𝑟 =. 651 6 - 𝑏 = 𝑟 G6! H & -.-;; o =. 651 G-.B;:H o =. 459 o 𝑎 = 𝑌 − 𝑏𝑋 - 𝑎 = 𝑌 − 𝑏𝑋 o = 39.375 − (. 459)(19.625) o = 30.36 - Y’ = bX + a o Y’ =.46X + 30.36 Making Predictions - Using this equation, we can “plug-in” values for X and calculate predicted values on Y. If a person had a score on X = 18, what is his/her predicted Y? What if a person had a score on X = 22? o Y’ =.46(18) +30.36 = 38.64 o Y’ =.46(22) +30.36 = 40.48 - Simple linear regression is used in many areas of psychology - A developmental psychologist might develop a regression equation that uses scores on a screening tool (X) to predict developmental delays (Y) among children with autism spectrum disorder - A military psychologist might use spatial visualization (X) to predict task performance (Y) while operating a drone Do Not Extrapolate - When making predictions, be sure to use only values for X that are within the bounds of the original X scores - In other words, use scores on X that are: o Greater than or equal to the minimum or X and o Less than or equal to the maximum of X - Thus, if the X scores range from 3.5 to 52.1, which of the following would be extrapolating? o X = 47 o X = 1.9 Standard Error of Estimate** - How much do the actual Y values vary around the predicted Y scores (Y’)? Want an index of the amount of error in our prediction - Desirable if this index.... o Equaled 0 when there was NO ERROR in prediction o Was a positive number if there was prediction error - What if we just summed the discrepancies (residuals) & divided n... the mean of the discrepancies? o Problem: sum equals 0 - Equation: " ∑ (%&% ' )" ∑ $D! ( o 𝑠!C = $ ,&" = $ ,&" - Notice how the numerator resembles a deviation score. Instead of actual score minus the mean, it is actual score minus the predicted score o Reminder: we will divide by (n-2) - If it is precisely 0, what does this tell us about how the predicted scores on Y are related to the actual scores on Y? o There is no error in the prediction Homoscedasticity - The spread of the actual scores (Y) around the regression line (predicted scores, Y’) is about the same - Sometimes referred to as the constant variance assumption b/c the residuals should have a similar variance across the values of the predicted scores and X values Prediction Does not Prove Causation - Just as correlation does not prove causation, prediction does not prove causation o Just b/c I can predict job performance (Y) from job satisfaction (X), this does not mean that satisfaction with my job caused me to perform my job better o It is quite possible that performing my job (Y) well led to reinforcers like pay increase and flexible schedule which, in turn, caused me to be more satisfied with my job (X) Regression toward the Mean - When a variable that is extreme on its first measurement will tend to be closer to the center of the distribution when measured a second time o Health: In a screening program for hypertension, only individuals with high blood pressure are asked to return a second time. On average, the second measure will be less than the first o Genetics: If you are very tall (short), it is likely that your offspring will also be tall (short), but generally not as tall (short) as you. Proportion of Explained Variance - Coefficient of Determination: proportion of variance in Y that can be explained by (or is attributed to) differences on X ??$' o 𝑟" = ??$ " - If r =.4, then 𝑟 =.16. - The stronger correlation between Y and X, the greater will be the proportion of explained variance in Y due to X Regression Equation (standard score form) - Pairs of scores on X and Y - Convert all X scores to z scores - Convert all Y scores to z scores o What is the mean of the z scores for X? § 0 o What is the mean of the z scores for Y? § 0 - Calculate a regression equation using these z scores o 𝑧%E = 𝑟 𝑧= o The slope is the correlation between X and Y o What is the Y intercept? § For the standardized regression equation, the Y intercept is 0 Concluding Comments - Simple linear regression: predict a quantitative outcome (Y) using a quantitative predictor (X) o Does time spent in dialectal behavior therapy (X) predict emotion regulation (Y)? o We would expect a positive slope - Standard error of estimate measures how much variability is there in the residuals - Heteroscedasticity > bad. Inflates the standard error of estimate - Strength of relationship indexed by 𝑟 "

Use Quizgecko on...
Browser
Browser