Quantitative Research Methods in Political Science Lecture 3 PDF

Quantitative Research Methods in Political Science Lecture 3: Descriptive Statistics and Measures of Central Tendency and Dispersion Course Instructor: Michael E. Campbell Course Numbe...

Quantitative Research Methods in Political Science Lecture 3: Descriptive Statistics and Measures of Central Tendency and Dispersion Course Instructor: Michael E. Campbell Course Number: PSCI 2702(A) Date: 09/19/2024 Descriptive Statistics Statistics are used to summarize information about a variable or variables quickly and effectively Two types of descriptive statistics: Univariate Statistics: summarize or describe the distribution of a single variable Bivariate (or Multivariate) Statistics: summarize or describe the relationship between two or more variables Proportions and Percentages Proportions and Percentages Used to standardize raw data and compare parts of a whole Can be used to compare parts of a whole or groups of different sizes Standardization: to transform the unit of measurement so that it can be compared to other values on a common scale For example: Proportions always on a scale of 1.00 Percentages always on a scale of 100 Proportions The formula for proportions is: In this equation: f is the total number of all cases in any category n the number of cases in all categories Percentages The formula for percentages is: In this equation: f is the total number of all cases in any category n the number of cases in all categories Proportions and Percentages Example Most Popular Ways for Canadians to Celebrate St. Valentines Day Ways to Celebrate Frequency Proportion Percentage Going to a restaurant 312 0.3995 39.95 Romantic evening at 110 0.1408 14.08 home Giving a 92 0.1178 11.78 gift/card/flowers/chocol ate Going on a trip 22 0.0282 2.82 Going on out dancing 7 0.0090 0.90 Other 74 0.0947 9.47 Don’t Know 164 0.2100 21.00 Proportions and Percentages Example Cont’d Proportion of Percentage of “Romantic Evening at Home” “Romantic Evening at Home” Therefore… Therefore… “Out of 781 people surveyed, “Out of the 781 people surveyed, the approximately 14.08% of proportion who prefer celebrating St. respondents prefer a romantic Valentine’s Day with a romantic evening at home” evening at home is 0.1408” Comparing Groups of Different Sizes Example In the absence of a standardized statistic, it is very hard to compare relative group sizes because they the size of groups might be different… For instance, do more males or females pursue studies in business, Comparing Groups of Different Sizes Example Cont’d “Calculating percentages eliminates the difference in size of the two groups by standardizing both distributions on a base of 100” (Healey, Donoghue, and Prus 2023, 42). For instance, now we can see that a higher percentage of males (25.57%) study business, management, and Guidelines on Use of Proportions and Percentages 1. If you have fewer than 20 cases, report on frequencies With small number of cases, percentages can change drastically with only minor changes The higher the number of observations, the impact of each additional observation on proportions/percentages will be smaller 2. Always report the number of observations (n) along with proportions and percentages Allows the reader to judge the adequacy of sample size Helps to prevent use of misleading statistics 3. Proportions and percentages can be reported for variables at all three levels of measurement Ratios and Rates Ratios allow us to “compare the categories of a variable in terms of relative frequency” (Healey, Donoghue, Prus 2023, 45). We do not standardize when we calculate ratios (or rates) Ratios “tell us exactly how much one category outnumbers (or is outnumbered) by the other” (Healey, Ratios Donoghue, Prus 2023, 47). The formula for Ratio: Ratio = In this equation: f1 = the number of cases in the first category f2 = the number of cases in the second category Q: How would you describe your feelings towards the Sponsorship Scandal? Response Frequency Angry 3351 Not Angry 630 What is the ratio of Canadians who are angry about the Ratio Sponsorship Scandal to Canadians who are not angry? Example Ratio = Ratio = This means that “for every Canadian who is not angry about the Sponsorship Scandal, there are 5.32 Canadians who are angry about the scandal.” Ratios are often multiplied by some power of 10 (to eliminate decimal points) 5.32 x 100 = 532 Therefore, “for every 100 Canadians who are not angry Ratio about the Sponsorship Scandal, there are 532 Canadians who are angry about the scandal.” Example For greater clarity, comparison of units are usually Cont’d expressed Based on units of ones, the ratio of Canadians who think about the same as now or less would be expressed as: 5.32:1 Based on hundreds, this would be expressed as: 532:100 Rates The formula for rates is: Rate = In this equation: f actual is the number of actual occurrences of a phenomenon f possible is the number of possible occurrences of the phenomenon Are also multiplied by some power of 10 to eliminate decimal points A Country’s Death Rate is commonly used rate in research To determine death rate, divide the number of deaths (f actual) in a country by the total population (f possible) Rates In 2019, there was 284 082 deaths in Canada (f actual) Example Population of Canada was 37 590 000 (f possible) The death rate for Canada in 2019 would be: Rate = This leaves us with 0.006 (very small) Rates Example Cont’d To resolve, multiply by 1000 (common with death rates)… Rate = x 1000 Rate = 7.56 Therefore, for every 1000 Canadians, there were 7.56 deaths in 2019 Frequency Distributions Frequency distributions are different than instruments Instruments are measurement tools… Frequency distribution “is a table that summarizes the distribution of a variables values by reporting the number of cases contained in each category of the variable” (Healey, Donoghue, and Prus 2023, 49). Frequency distributions are a useful way of organizing and presenting data They are also one of the first steps in data analysis Frequency Distribution at Nominal Level Nominal-Level Variable Frequency Table (Employment Status) Employment Status Frequency Employed 36 Unemployed 14 Frequency Distribution at Nominal Level Cont’d Here is a distribution table reporting the types of electoral system in a country. Majoritarian systems are coded as 0.00 Proportional systems are coded as 1.00 Mixed systems are coded as 2.00 Other systems are coded as 3.00 See instrument on page 89- Frequency Distribution at Nominal Level Cont’d We see the same table with value labels now ( it is the exact same information as before) We see there are 40 observations (N) and 139 missing values Frequency column tells us the number of observations for each category Percent tells us percent of cases per category, but does not omit missing cases Valid percent column corrects for this Ordinal-Level Variable Frequency Frequency Table (Satisfaction Distributio with Meal) n at How Satisfied Were Frequenc Ordinal You with Your Meal? y Level Very satisfied 15 Somewhat satisfied 25 Somewhat dissatisfied 7 Very dissatisfied 3 Frequency Distribution at Ordinal Level Cont’d 179 Valid Cases (0 missing) Notice “Percent” and “Valid Percent” columns are now the same This is because there are no missing cases Public Sector Corrupt Exchanges are “Extremely Common” in 11 countries Public Sector Corrupt Exchanges are “Extremely Common” in 6.1% of countries Public Sector Corrupt Exchanges are “Extremely A lot of information presented here… 45 valid cases No repeating Frequency values, so each case equals 2.2% Distributio of valid cases n for Using cumulative percentage Interval- column, we can make statements Ratio like “In 20% of Level countries in our sample, voter turnout is 58.77% or less.” But these types of Frequency Distribution at Interval-Ratio Level Cont’d To make distribution table more manageable, researchers sometimes collapse data into groups… This is the same information as previous table…. Now, we have four categories 2.2% of countries in the sample have “Very Low” voter turnout (below 25.9%) 11.1% of countries in the sample have “Low” voter turnout (between 26 and 50.9%) Provide visual representation of data Graphs and Charts are often used by researchers to present their data in ways that are less confusing than just presenting statistics… Graphs and Charts These give us a general sense of the overall shape of the distribution These give us general sense about the way the data are dispersed Pie Chart Election Turnout 2% Very simple and 11% intuitive 36% Used when there are few categories 51% Rarely used in quantitative academic research Very Low Low High Very High Bar Chart Voter Turnout Categories of 60% variable along 51% horizonal axis 50% 40% 36% Frequencies (or 30% percentages) along vertical axis 20% 11% 10% Bar charts used 2% when there are four 0% or five categories Very Low Low High Very High Histogram Most appropriate for continuous interval-ratio data Categories of scores are contiguous (the bars touch each other) Here, rather than categories of voter turnout, we can use the original frequency distribution A line can be placed atop of to get a better sense of how the data are dispersed Measures of Central Tendency and Dispersion Measures of Central Tendency and Dispersion Measures of Central Tendency allow us to describe data so we can identify the typical or average case in Three Measures of Central Tendency: a distribution 1. Mode 2. Median They are statistics that help us 3. Mean summarize data so we can describe the most common scores, the middle case, or the average or all cases MCT statistics “reduce huge arrays combined of data to a single, easily understood number” (Healey, Donoghue, and Prus 2023, 80) Measures of Dispersion give us some idea about the level of heterogeneity (or how much variety) there is in a distribution Measures of Central Tendency and Dispersion Cont’d “For a full description of scores, measures The best Measures of Dispersion will: of central tendency must be paired with 1. use all the scores in a distribution, measures of dispersion” (Healey, meaning the statistic will be Donoghue, and Prus 2023, 80) computed using all the information that is available Measures of dispersion provide information 2. describe the average or typical about “the amount of variety, diversity, or deviation of the scores and give us an heterogeneity within a distribution of idea of how far the scores are from scores” (Healey, Donoghue, and Prus 2023, one another or from the center of the 80). distribution 3. increase in value as the distribution of scores becomes more diverse Data Dispersion Think about data dispersion in terms of homogeneity or heterogeneity of data The less spread out the data, the less dispersed it is, meaning it is more homogenous (See Essay Exam) The more spread out the data, the more dispersed it is, meaning it is heterogenous (See Multiple-Choice Exam) Please note that the average is the same in this example, despite differences in variability The Mode Is the most recurring value in a variable Example: you have a set of scores (8, 9, 9, 14, 17, 20). The mode is 9. The mode represents the variable’s largest category Is the only measure of central tendency that can be used with nominal data The mode in the above example is “Proportional” electoral systems Mode Limitations 1. Some distributions have no mode at all (no repeating values). Or some distributions have multiple modes (many repeating values). 2. With ordinal and interval-ratio data, the modal score “may not be central to the distribution as a whole” (Healey, Donoghue, and Prus 2023, 83) This suggests that the most common Although the mode is 93 in score may not be “typical” in identifying the above example, it does the center of a distribution not accurately convey the distribution, because the vast majority of students scored below that. Index of The IQV the only measure of dispersion that can be used with nominal level data Qualitative Variation The IQV is “the ratio of the amount of variation actually observed in a distribution of scores to (IQV) the maximum variation that could exist in a distribution” (Healey, Donoghue, and Prus 2023, 83). IQV ranges from 0.00 (no variation) to 1.00 (maximum variation) IQV Cont’d If everyone were indigenous (or non- indigenous) in Canada, the IQV would be 0.00 (no variation) If 50% of population were indigenous and 50% non-indigenous the IQV would be 1.00 (maximum variation) Looking at percentages in table, we see that indigenous population has been increasing over time IQV Cont’d The formula for the IQV is: In this equation: is the number of variable response categories is the sum of squared percentages of cases in the variable's response categories IQV for 1996: IQV A higher IQV means Cont’d = = 0.109 more dispersion in the data. IQV for 2006: As you see, the IQV = = 0.144 increases over time… IQV for 2016: Variation increases from 0.109 (10.9%) = = 0.185 to 0.185 (18.5%). Indigenous population is increasing. The Median The median represents the exact center score in a distribution Exactly half the cases will fall below the median, and half the cases above Example: Median household income in Ottawa is $88 000 This means that 50% of households make less than this, and 50% make more Median Cont’d Exam Score Frequency 58 2 72 3 Let’s say you have data on Exam Scores 75 1 from a class of 11 people… 79 2 To determine the median, you must order 80 1 the cases from lowest to highest (or highest 87 1 to lowest)…. 96 1 Total: 11 Median Cont’d Now that all scores are in order… Exam Score Frequency The median is exactly halfway between the 58 1 scores of the two middle cases 58 1 72 1 To identify the middle case, take the sample 72 1 size and add 1 then divide by 2 72 1 In this case it is (11+1)/2 = 6 75 1 79 1 In this example, the median is 75% (half the 79 1 cases are below it and half are above it) 80 1 87 1 But this was an odd number of cases… 96 1 What happens if you have an even number? Total: 11 Median Cont’d If you have an even number of cases, you Exam Score Frequency determine the median by adding the two 58 1 middle cases 72 1 To identify middle cases, divide N by 2 (this 72 1 gives you the first middle case) (10/2) = 5 72 1 Then, add 1 to that number (this gives you the 75 1 second middle case) (5+1) = 6 79 1 Therefore, the center scores here are 75 and 79 1 79 (there are four cases above and four cases 80 1 below them) 87 1 Add those numbers and divide by 2 96 1 Total: 10 (75+79) / 2 = 77 (the median is 77) The median can be used for interval Note: because the median requires scores to level variables, but is preferred for The range (and interquartile range) are measured of dispersion generally used for ordinal level data Range: “defined as the difference between or interval between the highest score (H) and the lowest score (L) in a distribution” (Healey, Donoghue, and Prus 2023, 88). Range The formula for Range: R=H–L -In this equation: -H is the highest score -L is the lowest score Range Example #1 Age of Students in a Class Age Frequency 18 2 R = 26 – 18 R=8 19 2 With this information we could say “The 20 1 age range of students in the course spans 21 2 8 years, with the youngest being 18 years old and the oldest being 26.” 22 1 24 1 However, the range is calculated using 26 1 only the highest and lowest scores and can be misleading if there is an outlier Range Example #2 Age of Students in a Let’s say that there was a retiree in the Classroom (with outlier) course, with an age of 65… Age Frequency R = 65 – 18 18 2 R = 47 19 2 As you can see, the range is now 20 1 misleading, suggesting the data are more 21 2 dispersed than they actually are 22 1 This is the result of the outlier (65) 33 1 Therefore, the range is limited in what it can 65 1 tell us abut the distribution of data Interquartile Range Quartiles represent the percentage of observations that fall within different segments of a dataset for the same variable The first quartile (Q1) is the first 25% of cases, the second is the second 25% of cases, and so on… There are four quartiles in total, each representing 25% of cases in a variable… Interquartile Range Cont’d The formula for the interquartile range is: In this equation: Q3 is the value of the third quartile Q1 is the value of the first quartile Interquartile Range Cont’d Q1 is the point below which 25% of the cases fall and above which 75% of the cases fall Q3 is the point below which 75% of the cases fall and above which 25% of the cases fall L represents the lowest score H represents the highest score Q is the range in the middle 50% of cases in a distribution It only uses 50% of cases, and will not be affected by outliers Interquartile Range Cont’d We know the range is 47 Age of Students in Class (with outlier) Now, you need to calculate Q Age Frequency (requiring you to identify Q1 and Q3) 18 1 18 1 To do this, find the median in the 19 1 ordered data 19 1 20 1 (n/2) + 1 – indicates cases 5th and 6th are center values 21 1 21 1 Median = (20+21)/2 = 20.5 22 1 33 1 Interquartile Range Cont’d You can now divide the data into 2 50% below 50% above sets the Median the Median Remember, 50% of cases fall above the median and 50% below (20.5) (20.5) Find the median for each set 18 21 Q1 = the median for the lowest half 18 21 of the data 19 22 Q3 = the median for the upper half of the data 19 33 Therefore… 20 65 Q1 = 19 Interquartile Range Cont’d Therefore, interquartile range is 3, as opposed to 47 Better than range, because it is not influenced by outliers Or Therefore, “the interquartile range is a more useful measure of dispersion than the range because it is not affected by extreme scores, or outliers” (Healey, Donoghue, and Prus 2023, 90) The formula for the mean is: The Mean (or In this equation: Arithmetic = the sample mean = the sum of scores Average) = the number of cases in the sample The sum of scores means that you add all of the scores in a distribution represents each individual score A Note on Notation If we were calculating for a population and not a sample, the notation would change The formula for the mean would be: In this equation: = the population mean = the sum of scores = the number of cases in the population Characteristics of the Mean 1. The first characteristic is that the sum of differences will always add up to 0 The mean is the center of the distribution Unlike median, it is the point around which all scores cancel out Symbolically, represented by: This means if we take each score from a distribution and subtract the mean from it, and all of those differences, the sum will always be 0 Characteristics of the Mean Cont’d Imagine you have 5 exam Sum of Differences scores in a sample: 65, 73, 77, 85, 90 65 65-78 = -13 73 73-78 = -5 77 77-78 = -1 85 85-78 = 7 90 90-78 = 12 = 390 As you see, the total negative differences will be exactly equal to the total positive differences Characteristics of the Mean Cont’d 2. The second characteristic is the least-squares principle Is expressed by the following statement: Suggests that the mean is the point in a distribution around which the variation of scores (as indicted by squared differences) is minimized In other words, if we square the differences between scores and add them together, the resultant sum will be less than the sum of squared differences between the scores and any other point in the distribution Least-Squares Principle Example Five exam sample scores: 65, 73, 77, 85, 90 We know the differences between these scores and the mean are -13, - 5, -1, 7, 12 If we square these differences and add them, we get a total of 388 If we use any number besides the mean, the result will always be higher Least-Squares Principle Cont’d 65 65-78 = -13 (-13)² = 169 65-77 = -12 = (-12)² = 144 73 73-78 = -5 (-5) ² = 25 73-77 = -4 = (-4)² = 16 77 77-78 = -1 (-1) ² = 1 77-77 = 0 = (0)² = 0 85 85-78 = 7 (7) ² = 49 85-77 = 64 = (64)² = 64 90 90-78 = 12 (12) ² = 144 90-77 = 169 = (169)² The least-squares principle tells us that the mean (or average) is =169 close to all of = 390the other scores than the other = measures 388 = 393 of central tendency Characteristics of the Mean Cont’d 3. Third characteristic is that every score in distribution affects the mean For example, take the same scores as previous example, but change the last score to 500… Neither the mode nor the median is affected by every score Now we have 65, 73, 77, 85, 500 The mean is calculated using all the The median is still 77 information available to us But now the mean is 400 This is an advantage as disadvantage The mean will always be pulled in the (because outliers can make the mean misleading) direction of outliers Symmetrical Distribution (Unskewed) The median and mean will only have the same value when the data are symmetrical Positive and Negative Skews When a distribution has some extremely high scores (high value outliers) the mean will always have a greater value than the median A Positive Skew When a distribution has extremely low scores (low value outliers) the mean is lower in value than the median A Negative Skew “A quick comparison of the median and the mean always tells you if a distribution is skewed and the direction of the skew” (Healey, Donoghue, and Prus 2023, 97) If you have a data that with a A Note on positive or negative skew, the median is a better measure of Mean and the central tendency to use Skew If the distribution is symmetrical, then the mean is the most appropriate measure of central tendency Computing the mean requires addition and division, therefore it should only be used with interval- ratio variables Variance and Standard Deviation Unlike range and interquartile range, variance and standard deviation use all of the scores in a distribution We must identify the distance Deviations smaller between each score in a distribution  and the mean These distances are known as deviations (and deviations increase in size as the data become more heterogenic) Deviations larger  If the scores are more homogenous, they will be clustered around the mean and the deviations will be smaller Using Deviations to Calculate Variance and Standard Deviation We can use deviations of scores to calculate useful statistics But we know the sum of deviation is always 0 So, we need to square the deviations However, the higher the number of cases, the higher the squared deviation value Scores Deviations Deviations Squared () () 65 65-78 = -13 (-13)² = 169 73 73-78 = -5 (-5) ² = 25 77 77-78 = -1 (-1) ² = 1 85 85-78 = 7 (7) ² = 49 90 90-78 = 12 (12) ² = 144 = 390 = 388 Variance Uses the squared deviations and divide by the number of cases – thereby standardizing distributions of different sizes Formula for variance: In this equation: is the score is the sample mean is the number of cases in a sample Standard Deviation To compute the standard deviation, you use the square root of the variance The formula for standard deviation is: In this equation: is the scores is the sample mean is the number of cases in a sample Standard Deviation Cont’d Our sum of squares was 388, with 5 cases in the sample. Therefore… The standard deviation is 8.81 Interpreting the Standard Deviation Standard deviation is a very important statistic (required for understanding Normal Curve) Is also an index of variability A larger standard deviation represents more dispersion, a smaller standard deviation represents less dispersion Has a low value of 0.00 (no variation in the data) Interpreting Standard Deviation Example Data represent daily temperatures for Calgary, Alberta, and Gander, Newfoundland and Labrador for the month Calgary Gander of January = -7.1 = -7.1 The Mean for each city is -7.1 Celsius = 4.5 =1.8 But standard deviation is higher in Calgary (why?) Because there is a greater level of variation in day-to-day temperature in Calgary than in Gander Calgary might have had some hotter and colder days, while Gander has Interpreting Standard Deviation Example Cont’d Think back to the exam type example… Here, the means are the same… But the standard deviation would be larger for the Multiple-Choice Exam, because the data are more dispersed (i.e., more people got scores that were higher or lower on the exam) Selecting Appropriate In general, select MCT and MD based on level of measurement Measures of But keep in mind how data are dispersed Central Tendency For example, mean and standard dev. are best for and interval data, unless outliers exist Dispersion In such cases, the median and interquartile range will give you more accurate description of data For this reason, researcher usually present more than one MCT and MD

Quantitative Research Methods in Political Science Lecture 3 PDF

Document Details

Tags

Related

Summary

Full Transcript