Introduction to Data Analysis: Research Knowledge Base PDF

Introduction to data analysis: Research Knowledge Base William Trochim Foundations of Data Analysis If there ever was something called a “research hurdle,” then congratulations on having crossed over! By the time you get to the analysis of your data, most of the really difficult work has been done. It’s much harder to come up with and define a research problem; obtain IRB approval; develop and implement a sampling plan; conceptualize, test, and operationalize your measures; develop a design; and, collect a clean and complete data set. If you have done this work well, then the analysis of the data should be relatively straightforward. The main idea to keep in mind is that the process of data analysis has important implications for a key validity type—conclusion validity. Conclusion validity refers to the extent to which conclusions or inferences regarding relationships between the major variables in your research (e.g., the treatment and the outcome) are warranted. The next thing to note is that in most social science research, data analysis involves three major steps, performed roughly in this order: Foundations of Data Analysis 1. Data preparation involves logging the data in; making a codebook; entering the data into the computer; checking the data for accuracy; transforming the data; and, developing and documenting a database that integrates all of your measures. 2. Descriptive statistics describe the basic features of the data in a study. They provide meaningful summaries about the sample so that potential patterns might emerge from the data. Together with graphical analysis, they form the basis of virtually every form of quantitative analysis. With descriptive statistics, you are simply describing what the data show. Following the discussion of conclusion validity, you will learn about the basics of descriptive analysis in the rest of the chapter. Often, the descriptive statistics produced early on are voluminous because we first need to examine each variable individually. We need to know what the distributions of numbers look like, whether we have issues to deal with like extremely large or small values (“outliers”), and whether we have problems with missing data. The first step is usually a descriptive summary of characteristics of your sample. Then you move on to describing the basic characteristics of your study variables (the measures). You carefully select and organize these statistics into summary tables and graphs that show only the most relevant or important information. This is especially critical so that you don’t “miss the forest for the trees.” If you present too much detail, the reader may not be able to follow the central line of the results. More extensive analysis details are appropriately relegated to appendices—reserving only the most critical analysis summaries for the body of the report itself. Foundations of Data Analysis 3. Inferential statistical analysis tests your specific research hypotheses. In descriptive and some relational studies, you may find that simple descriptive summaries like means, standard deviations, and correlations provide you with all of the information you need to answer your research question. In experimental and quasi-experimental designs, you will need to use more complex methods to determine whether the program or treatment has a statistically detectable effect. We save this type of statistical analysis for the next chapter—Chapter 12, “Inferential Analysis.” We’ll remind you right now that this is not a statistics text. We’ll cover lots of statistics, some elementary and some advanced, but we are not trying to teach you statistics here. Instead, we are trying to get you to think about data analysis and how it fits into the broader context of your research. Another goal is to help you achieve some measure of “statistical literacy,” so that you understand the data analyses you need to think about, to set the stage for developing your own proposal. We’ll begin this chapter by discussing conclusion validity. After all, the point of any study is to reach a valid conclusion. This will give you an understanding of some of the key principles involved in data analysis. Then we’ll cover the often-overlooked issue of data preparation. This includes all of the steps involved in cleaning and organizing the data for analysis. We then introduce the basic descriptive statistics and consider some general analysis issues. This sets the stage for consideration of the statistical analysis of the major research designs in Chapter 12. Conclusion Validity Conclusion validity is the degree to which conclusions you reach about relationships in your data are reasonable. The emphasis here is on the term relationships. The definition suggests that we are most concerned about conclusion validity when we are looking at how two or more variables are associated with each other. It should be noted that conclusion validity is also relevant in qualitative research. For example, in an observational field study of homeless adolescents, a researcher might, based on field notes, see a pattern that suggests that teenagers on the street who use drugs are more likely to be involved in more complex social networks and to interact with a more varied group of people than non drug users. The relationship in this case would be between drug use and social network complexity. Even though this conclusion or inference may be based entirely on qualitative (observational or impressionistic) data, the conclusion validity of that relationship can still be assessed—that is, whether it is reasonable to conclude that there is a relationship between the two variables. Conclusion Validity Similarly, in quantitative research, if you’re doing a study that looks at the relationship between socioeconomic status (SES) and attitudes about capital punishment, you eventually want to reach some conclusion. Based on your data, you might conclude that there is a positive relationship— that persons with higher SES tend to be more in favor of capital punishment, whereas those with lower SES tend to be more opposed. Conclusion validity in this case is the degree to which that conclusion or inference is credible or believable. Conclusion validity is relevant whenever you are looking at relationships between variables, including cause-and-effect relationships. Since causal relationships are the purview of internal validity (see Chapter 8), we have a potential confusion that we should clear up: How, then, do you distinguish conclusion validity from internal validity? Remember from the discussion of internal validity in Chapter 8 that in order to have a strong degree of internal validity in a causal study, three attributes must be present: covariation, temporal precedence of the presumed cause occurring prior to the presumed effect, and the absence of plausible alternative explanations. The first of these, covariation, is the degree to which the cause and effect are related. So when you are trying to establish internal validity, you first need to establish that the cause-effect relationship has conclusion validity. If our study is not concerned with cause-effect relationships, internal validity is irrelevant but conclusion validity still needs to be addressed. Therefore, conclusion validity is only concerned with whether or not a relationship exists; internal validity goes beyond that and assesses whether one variable in the relationship can be said to cause the other. So, in a sense, conclusion validity is needed before we can establish internal validity (but not the other way around). Conclusion Validity For instance, in a program evaluation, you might conclude that there is a positive relationship between your educational program and achievement test scores because students in the program get higher scores and students who are not in the program get lower ones. Conclusion validity in this case is concerned with how reasonable your conclusion is. However, it is possible to conclude that, while a relationship exists between the program and outcome, the program itself didn’t necessarily cause the outcome. Perhaps some other factor, and not the program, was responsible for the outcome in this study. The observed differences in the outcome could be due to the fact that the program group was smarter than the comparison group to begin with. Observed posttest differences between these groups could be due to this initial difference and not due to the result of your program. This issue—the possibility that some factor other than your program caused the outcome—is what internal validity is all about. It is possible that in a study you not only can conclude that your program and outcome are related (conclusion validity exists) but also that the outcome may have been caused by some factor other than the program (you don’t have internal validity). Conclusion Validity One underappreciated aspect of conclusion validity has to do with the context of the study. When we think about conclusion validity, we need to think about the context in which the research was carried out. Consider an example of two studies, one reporting a statistically significant and large effect and the other with a statistically significant but small effect. On the surface, the larger effect might seem to be more important. However, some relatively small effects can be very important, and some relatively big effects much less impressive in the context of real life. For example, a small effect on mortality (that is, life and death) would be more valuable than a large effect on something like a taste comparison of two kinds of frozen pizza (though we should not underestimate the importance of pizza quality, especially to its fanatics!). As you can see, conclusion validity is not purely a statistical issue; it is a matter of proper interpretation of the evidence a study produces. Data analysis is much more than just drawing conclusions based simply on probability values associated with the null hypothesis (Cumming, 2012; Kline, 2013). We are now seeing a broader and more complete view of analysis in a context that includes power, measurement precision, effect sizes, confidence intervals, replication, and practical and clinical significance—all of which should enhance conclusion validity. We will have more to say about what “significance” means in Chapter 12. Threats to Conclusion Validity Threats to conclusion validity are any factors that can lead you to reach an incorrect conclusion about a relationship in your observations. You can essentially make two kinds of errors when talking about relationships: You conclude that there is a relationship when in fact there is not. (You’re seeing things that aren’t there!) This error is known as a Type I Error. This type of error is also referred to as a “false alarm” or a “false positive.” You conclude that there is no relationship when in fact there is. (You missed the relationship or didn’t see it.) This error is known as a Type II Error. This kind of error is also thought of as a “miss” or “false negative.” Threats to Conclusion Validity We’ll classify the specific threats by the type of error with which they are associated. Type I Error: Finding a Relationship When There Is Not One (or Seeing Things That Aren’t There) Level of Statistical Significance: In statistical analysis, you attempt to determine the probability of whether your finding is either a real one or a chance event. You then compare this probability to a set criterion called the alpha level and decide whether to accept the statistical result as evidence that there is a relationship. In the social sciences, researchers conventionally use the rather arbitrary value, known as the.05 level of significance, as the criterion to decide whether their result is credible or could be considered a fluke. Essentially, the value.05 means that the result you got could be expected to occur by chance at least five times out of every 100 times you ran the statistical analysis, if the null hypothesis is in reality true. Threats to Conclusion Validity In other words, if you meet this criterion, it means that the probability that the conclusion about a significant result is false (Type I error) is at most 5 percent. As you can imagine, if you were to choose a bigger criterion as your alpha level, you would then increase your chances of committing a Type I error. For example, a.10 level of significance means that the result you got could be expected to occur by chance at least ten times out of every 100 times you ran the statistical analysis, when the null hypothesis is actually true. So, as compared to a 5 percent level of significance, a 10 percent level of significance increases the probability that the conclusion about a significant result is false. Therefore, because it is up to your discretion, it is important to choose the significance level carefully—as far as reducing the potential for making a Type I error goes, the lower the alpha level, the better it is. Threats to Conclusion Validity Fishing and Error Rate Problem: In anything but the most trivial research study, the researcher spends a considerable amount of time analyzing the data for relationships. Of course, it’s important to conduct a thorough analysis, but most people are well aware of the fact that if you torture data long enough, you can usually turn up results that support or corroborate your hypotheses. In more everyday terms, you are “fishing” for a specific result by analyzing the data repeatedly under slightly differing conditions or assumptions. Let’s think about why this is a problem. A major assumption that underlies most statistical analyses is that each analysis is independent of the other. However, that is usually not true when you conduct multiple analyses of the same data in the same study. For instance, let’s say you conduct 20 statistical tests and for each one you use the.05 level criterion for deciding whether you are observing a relationship. For each test, the odds are 5 out of 100 that you will see a relationship even if there is not one there. Odds of 5 out of 100 are equal to the fraction 5/100 which is also equal to 1 out of 20. Now, in this study, you conduct 20 separate analyses. Let’s say that you find that of the twenty results, only one is statistically significant at the.05 level. Does that mean you have found a real relationship? If you had only done the one analysis, you might conclude that you found a relationship in that result. However, if you did 20 analyses, you would expect to find one of them significant by chance alone, even if no real relationship exists in the data. This threat to conclusion validity is called the fishing and the error rate problem. Threats to Conclusion Validity The basic problem is that, like a fisherman, you repeatedly try to catch something—each analysis is like another cast into the water, and you keep going until to “catch” one. In fishing, it doesn’t matter how many times you try to catch one (depending on your patience). In statistical analysis, it does matter because if you report just the one you caught, we would not know how many other analyses were run in order to catch that one. Actually, maybe this is also true for fishing. If you go out and catch one fish, the degree to which we think the fishing is “good” depends on whether you got that fish with the first cast or had to cast all day long to get it! Instead, when you conduct multiple analyses, you should adjust the error rate (the significance level or alpha level) to reflect the number of analyses you are doing, and this should be planned from the time you develop your hypothesis. The bottom line is that you are more likely to see a relationship when there isn’t one if you keep reanalyzing your data and don’t take your fishing into account when drawing your conclusions. Threats to Conclusion Validity Type II Error: Finding No Relationship When There Is One (or, Missing the Needle in the Haystack) Small Effect Size: When you’re looking for the needle in the haystack, you essentially have two basic problems: the tiny needle and too much hay. You can think of this as a signal-to-noise ratio problem. What you are observing in research is composed of two major components: the signal, or the relationship you are trying to see; and the noise, or all of the factors that interfere with what you are looking at. This ratio of the signal to the noise (or needle to haystack) in your research is often called the effect size. Threats to Conclusion Validity Sources of Noise: There are several important sources of noise, each of which can be considered a threat to conclusion validity. One important threat is low reliability of measures (see the section “Reliability” in Chapter 5, “Introduction to Measurement”). This can be caused by many factors, including poor question wording, bad instrument design or layout, illegibility of field notes, and so on. In studies where you are evaluating a program, you can introduce noise through poor reliability of treatment implementation. If the program doesn’t follow the prescribed procedures or is inconsistently carried out, it will be harder to see relationships between the program and other factors like the outcomes. Noise caused by random irrelevancies in the setting can also obscure your ability to see a relationship. For example, in a classroom context, the traffic outside the room, disturbances in the hallway, and countless other irrelevant events can distract the researcher or the participants. The types of people you have in your study can also make it harder to see relationships. The threat here is due to the random heterogeneity of respondents. If you have a diverse group of respondents, group members are likely to vary more widely on your measures or observations. Some of their variability may be related to the phenomenon you are looking at, but at least part of it is likely to constitute individual differences that are irrelevant to the relationship you observe. All of these threats add variability into the research context and contribute to the noise relative to the signal of the relationship you are looking for. Threats to Conclusion Validity Source of Weak Signal: Noise is only one part of the problem. You also have to consider the issue of the signal—the true strength of the relationship. A low-strength intervention could be a potential cause for a weak signal. For example, suppose you want to test the effectiveness of an after-school tutoring program on student test scores. One way to attain a strong signal, and thereby a larger effect size, is to design the intervention research with the strongest possible “dose” of the treatment. Thus, in this particular case, you would want to ensure that teachers delivering the after- school program are well trained and spend a significant amount of time delivering it. The main idea here is that if the intervention is effective, then a stronger dose will make it easier for us to detect the effect. That is, if the program is effective, then a higher dose of the intervention (two hours of tutoring) will create a bigger contrast in the treatment and control group outcomes compared to a weaker dose of the program (one hour of tutoring). Problems That Can Lead to Either Conclusion Error Every analysis is based on a variety of assumptions about the nature of the data, the procedures you use to conduct the analysis, and the match between these two. If you are not sensitive to the assumptions behind your analysis, you are likely to draw incorrect conclusions about relationships. In quantitative research, this threat is referred to as violated assumptions of statistical tests. For instance, many statistical analyses are based on the assumption that the data are distributed normally—that the population from which data are drawn would be distributed according to a normal or bell- shaped curve. If that assumption is not true for your data and you use that statistical test, you are likely to get an incorrect estimate of the true relationship. It’s not always possible to predict what type of error you might make—seeing a relationship that isn’t there or missing one that is. Similarly, if your analysis assumes random selection and/or random assignment, but if that assumption is not true, then you can end up increasing your likelihood of making Type I or Type II errors Problems That Can Lead to Either Conclusion Error Similar problems can occur in qualitative research as well. There are assumptions, some of which you may not even realize, behind all qualitative methods. For instance, in interview situations you might assume that the respondents are free to say anything they wish. If that is not true—if the respondent is under covert pressure from supervisors to respond in a certain way—you may erroneously see relationships in the responses that aren’t real and/or miss ones that are. The threats discussed in this section illustrate some of the major difficulties and traps that are involved in one of the most basic areas of research—deciding whether there is a relationship in your data or observations. So, how do you attempt to deal with these threats? The following section details a number of strategies for improving conclusion validity by minimizing or eliminating these threats. Improving Conclusion Validity So let’s say you have a potential problem ensuring that you reach credible conclusions about relationships in your data. What can you do about it? In general, minimizing the various threats that increase the likelihood of making Type I and Type II errors (discussed in the previous section) can help improve the overall conclusion validity of your research. One thing that can strengthen conclusion validity relates to improving statistical power. The concept of statistical power is central to conclusion validity and is related to both Type I and Type II errors—but more directly to Type II error. Statistical power is technically defined as the probability that you will conclude there is a relationship when in fact there is one. In other words, power is the odds of correctly finding the needle in the haystack. Like any probability, statistical power can be described as a number between 0 and 1. For instance, if we have statistical power of.8, then it means that the odds are 80 out of 100 that we will detect a relationship when it is really there. Or, it’s the same as saying that the chances are 80 out of 100 that we will find the needle that’s in the haystack. We want statistical power to be as high as possible. Power will be lower in our study if there is either more noise (a bigger haystack), a weaker signal (a smaller needle) or both. So, improving statistical power in your study usually involves important trade-offs and additional costs. Improving Conclusion Validity The rule of thumb in social research is that you want statistical power to be at least.8 in value. Several factors interact to affect power. Here are some general guidelines you can follow in designing your study that will help improve statistical power and thus the conclusion validity of your study. Increase the sample size: One thing you can usually do is collect more information— use a larger sample size. Of course, you have to weigh the gain in power against the time and expense of having more participants or gathering more data. There are now dedicated power analysis programs as well as an array of web-based calculators to enable you to determine a specific sample size for your analysis. These estimators ask you to identify your design, input your alpha level and the smallest effect that would be important to be able to see (sometimes referred to as the “minimally important difference”), and they provide you with a sample size estimate for a given level of power. Improving Conclusion Validity Increase the level of significance: If you were to increase your risk of making a Type I error— increase the chance that you will find a relationship when it’s not there—you would be improving statistical power of your study. In practical terms, you can do that statistically by raising the alpha level or level of significance. For instance, instead of using a.05 significance level, you might use.10 as your cutoff point. However, as you probably realize, this represents a trade- off. Because increasing the level of significance also makes it more likely for a Type I error to occur (which negatively affects conclusion validity), we recommend that you first try other steps to improve statistical power. Improving Conclusion Validity Increase the effect size: Because the effect size is a ratio of the signal of the relationship to the noise in the context, there are two broad strategies here. To raise the signal, you can increase the salience of the relationship itself. This is especially true in experimental studies where you are looking at the effects of a program or treatment. If you increase the dosage of the program (for example, increase the hours spent in training or the number of training sessions), it should be easier to see an effect. The other option is to decrease the noise (or, put another way, increase reliability). In general, you can improve reliability by doing a better job of constructing measurement instruments, by increasing the number of questions on a scale, or by reducing situational distractions in the measurement context. When you improve reliability, you reduce noise, which increases your statistical power and improves conclusion validity. Similarly, you can also reduce noise by ensuring good implementation. You accomplish this by training program operators and standardizing the protocols for administering the program and measuring the results. Data Preparation Now that you understand the basic concept of conclusion validity, it’s time to discuss how we actually carry out data analysis. The first step in this process is data preparation. Data preparation involves acquiring or collecting the data; checking the data for accuracy; entering the data into the computer; transforming the data; and developing and documenting a database structure that integrates the various measures. Logging the Data In any research project, you might have data coming from several different sources at different times, such as: Survey returns Coded interview data Pretest or posttest data Observational data Logging the Data In all but the simplest of studies, you need to set up a procedure for logging the information and keeping track of it until you are ready to do a comprehensive data analysis. Different researchers differ in how they keep track of incoming data. In most cases, you will want to set up a database that enables you to assess, at any time, which data are already entered and which still need to be entered. You could do this with any standard computerized spreadsheet (Microsoft Excel) or database (Microsoft Access, Filemaker) program. You can also accomplish this by using standard statistical programs (for example, SPSS, SAS, Minitab, Datadesk, etc.) by running simple descriptive analyses to get reports on data status. It is also critical that the data analyst retains and archives the original data records—returned surveys, field notes, test protocols, and so on—for a reasonable period of time. Most professional researchers retain such records for at least five to seven years. For important or expensive studies, the original data might be stored in a formal data archive. The data analyst should always be able to trace a result from a data analysis back to the original forms on which the data were collected. Most IRBs now require researchers to keep information that could identify a participant separate from the data files. All data and consent forms should be kept in a secure location with password protection and encryption whenever possible. A database for logging incoming data is a critical component in good research recordkeeping. Checking the Data for Accuracy As soon as you receive the data, you should screen them for accuracy. In some circumstances, doing this right away allows you to go back to the sample to clarify any problems or errors. You should ask the following questions as part of this initial data screening: Are the responses legible/readable? Are all important questions answered? Are the responses complete? Is all relevant contextual information included (for example, date, time, place, and researcher)? In most social research, the quality of data collection is a major issue. Ensuring that the data-collection process does not contribute inaccuracies helps ensure the overall quality of subsequent analyses. Developing a Database Structure The database structure is the system you use to store the data for the study so that it can be accessed in subsequent data analyses. You might use the same structure you used for logging in the data; or in large, complex studies, you might have one structure for logging data and another for storing it. As mentioned previously, there are generally two options for storing data on a computer: database programs and statistical programs. Usually database programs are the more complex of the two to learn and operate, but generally they allow you greater flexibility in manipulating the data. In every research project, you should generate a printed codebook that describes each variable in the data and indicates where and how it can be accessed. Minimally the codebook should include the following items for each variable: Variable name Variable description Variable format (number, data, text) Instrument/method of collection Date collected Respondent or group Variable location (in database) Notes The codebook is an indispensable tool for the analysis team. Together with the database, it should provide comprehensive documentation that enables other researchers who might subsequently want to analyze the data to do so without any additional information. Entering the Data into the Computer If you administer an electronic survey then you don’t have to enter the data into a computer—the data are already in the computer! However, if you decide to use paper measures, you can enter data into a computer in a variety of ways. Probably the easiest is to just type in the data directly. You could enter the data in a word processor (like Microsoft Word) or in a spreadsheet program (like Microsoft Excel). You could also use a database or a statistical program for data entry. Note that most statistical programs allow you to import spreadsheets directly into data files for the purposes of analysis, so sometimes it makes more sense to enter the data in a spreadsheet (the “database” it will be stored in) and then import it into the statistical program. It is very important that your data file is arranged by a unique ID number for each case. Each case’s ID number on the computer should be recorded on its paper form (so that you can trace the data back to the original if you need to). When you key-in the data, each case (or each unique ID number) would typically have its own separate row. Each column usually represents a different variable. If you input your data in a word processor, you usually need to use a delimiter value to separate one variable from the next on a line. In some cases, you might use the comma, in what is known as a comma-separated file. For instance, if two people entered 1-to-5 ratings on five different items, we might enter their data into a word processing program like this: Entering the Data into the Computer Here, the first number in each row is the ID number for the respondent. The remaining numbers are their five 1-to-5 ratings. You would save this file as a text file (not in the word processor’s native file format) and then it could be imported directly into a statistics package or database. The only problem with this is that you may have times when one of your qualitative variables can include a comma within a variable. For instance, if you have a variable that includes city and state you might have a respondent who is from “New York, NY.” If you enter this as is into the word processor and save it as a comma delimited file, there will be a problem when you import it into a statistics program or spreadsheet because the comma will incorrectly split that field into two. So, in cases like this, you either need to not include comma values in the variable or you need to use a delimiter that will not be confusing. In this case, you might use a tab to delimit each variable within a line. Consider the above example with the addition of the city/state variable where we use a tab (→) to delimit the fields: Entering the Data into the Computer When you transfer your text-based input into a spreadsheet or a statistical program for analysis purposes, the delimiter will tell the program that it has just finished reading the value for one variable and that now the value for the next variable is coming up. To ensure a high level of data accuracy for quantitative data, you can use a procedure called double entry. In this procedure, you enter the data once. Then, you use a special program that allows you to enter the data a second time and then checks the second entries against the first. If there is a discrepancy, the program immediately notifies you and enables you to determine which is the correct entry. This double-entry procedure significantly reduces entry errors. However, these double-entry programs are not widely available and require some training. Entering the Data into the Computer An alternative is to enter the data once and set up a procedure for checking the data for accuracy. These procedures might include rules that limit the data that can be entered into the program, typically thought of as a “validation rule.” For example, you might set up a rule indicating that values for the variable Gender can only be 1 or 2. Then, if you accidentally try to enter a 3, the program will not accept it. Once data are entered, you might spot-check records on a random basis. In cases where you have two different data-entry workers entering the same data into computer files, you can use Microsoft Word’s “Document Compare” functionality to compare the data files. The disadvantage of this approach is that it can only compare entries when they are completed (unlike traditional double-entry verifiers that stop the data enterer immediately when a discrepancy is detected). If you do not have a program with built-in validation rules, you need to examine the data carefully so that you can check that all the data fall within acceptable limits and boundaries. For instance, simple summary reports would enable you to spot whether there are persons whose age is 601 or whether anyone entered a 7 where you expected a 1-to-5 response. Data Transformations After the data are entered, it is often necessary to transform the original data into variables that are more usable. There are a variety of transformations that you might perform. The following are some of the more common ones: Missing values: Many analysis programs automatically treat blank values as missing. In others, you need to designate specific values to represent missing values. For instance, you might use a value that could not be valid for your variable (e.g., - 99) to indicate that the item is missing. You need to check the specific analysis program you are using to determine how to handle missing values, and be sure that the program correctly identifies the missing values so they are not accidentally included in your analysis. Some measures come with scoring manuals that include procedures for prorating scales in which less than 100 percent of the data are available. In other cases, no such procedures exist and the researcher must decide how to handle missing data. Many articles and several books have been written on how to best estimate a missing value from other available data, as well as when estimation of missing values is not advisable. You may not need to delete entire participants from your study just because they have missing values, especially if values are not missing at random. Data Transformations Item reversals: On scales and surveys, the use of reversal items (see Chapter 6, “Scales, Tests and Indexes”) can help reduce the possibility of a response set. When you analyze the data, you want all scores for questions or scale items to be in the same direction, where high scores mean the same thing and low scores mean the same thing. In such cases, you may have to reverse the ratings for some of the scale items to get them in the same direction as the others. For instance, let’s say you had a five-point response scale for a self-esteem measure where 1 meant strongly disagree and 5 meant strongly agree. One item is “I generally feel good about myself.” If respondents strongly agree with this item, they will put a 5, and this value would be indicative of higher self-esteem. Alternatively, consider an item like “Sometimes I feel like I’m not worth much as a person.” Here, if a respondent strongly agrees by rating this a 5, it would indicate low self-esteem. To compare these two items, you would reverse the scores. (Probably you’d reverse the latter item so that higher values always indicate higher self-esteem.) You want a transformation where, if the original value was 1, it’s changed to 5; 2 is changed to 4; 3 remains the same; 4 is changed to 2; and 5 is changed to 1. Although you could program these changes as separate statements in most programs, it’s easier to do this with a simple formula like the following: Data Transformations In our example, the high value for the scale is 5; so to get the new (transformed) scale value, you simply subtract the original value on each reversal item from 6 (that is, 5 = 1). Scale and subscale totals: After you transform any individual scale items, you will often want to add or average across individual items to get scores for any subscales and a total score for the scale. Categories: You may want to collapse one or more variables into categories. For instance, you may want to collapse income estimates (in dollar amounts) into income ranges. Data Transformations Variable transformations: In order to meet assumptions of certain statistical methods, we often need to transform particular variables. Depending on the data you have, you might transform them by expressing them in logarithm or square-root form. For example, if data on a particular variable are skewed in the positive direction, then taking its square root can make it look closer to a normal distribution—a key assumption for many statistical analyses. You should be careful to check if your transformation produced an erroneous value— for example, in the above case, if you proposed transforming a variable with some negative values (one cannot take a square root of a negative number!). Finally, you should be careful in interpreting the results of your statistical analysis when using transformed variables—remember, with transformed variables, you are no longer analyzing the relationship between the original variable and some other variable on the same scale that you started with. Descriptive Statistics Descriptive statistics describe the basic features of the data in a study. They provide simple summaries about the sample and the measures. Together with simple graphical analysis, they form the basis of virtually every quantitative analysis of data. Descriptive statistics present quantitative descriptions in a manageable form. In a research study, you may have many measures, or you might measure a large number of people on any given measure. Descriptive statistics help you summarize large amounts of data in a sensible way. Each descriptive statistic reduces data into a simpler summary. For instance, consider a simple number used to summarize how well a batter is performing in baseball, the batting average. This single number is the number of hits divided by the number of times at bat (reported to three significant digits). A batter who is hitting.333 is getting a hit one time in every three at-bats. One batting 0.250 is hitting one time in four. The single number describes a large number of discrete events. Or, consider the scourge of many students: the grade-point average (GPA). This single number describes the general performance of a student across a potentially wide range of course experiences. Descriptive Statistics Every time you try to describe a large set of observations with a single indicator, you run the risk of distorting the original data or losing important detail (see Figure 11.1 ). The batting average doesn’t tell you whether batters hit home runs or singles. It doesn’t tell whether they’ve been in a slump or on a streak. The GPAs don’t tell you whether the students were in difficult courses or easy ones, or whether the courses were in their major field or in other disciplines. Even given these limitations, descriptive statistics provide a powerful summary that enables comparisons across people or other units. A single variable has three major characteristics that are typically described: The distribution The central tendency The dispersion In most situations, you would describe all three of these characteristics for each of the variables in your study. Descriptive statistics The Distribution The distribution is a summary of the frequency of individual values or ranges of values for a variable. The simplest distribution lists every value of a variable and the number of persons who had each value. For instance, a typical way to describe the distribution of college students is by year in college, listing the number or percent of students at each of the four years. Or, you describe gender by listing the number or percent of males and females. In these cases, the variable has few enough values that you can list each one and summarize how many sample cases had the value. But what do you do for a variable like income or GPA? These variables have a large number of possible values, with relatively few people having each one. In this case, you group the raw scores into categories according to ranges of values. For instance, you might look at GPA according to the letter-grade ranges, or you might group income into four or five ranges of income values. One of the most common ways to describe a single variable is with a frequency distribution. Depending on the particular variable, all of the data values might be represented, or you might group the values into categories first. For example, with age, price, or temperature variables, it is usually not sensible to determine the frequencies for each value. The Distribution Rather, the values are grouped into ranges and the frequencies determined. In many situations, we are able to determine that a distribution is approximately normal by examining a graph. But we can also use statistical estimates of skew (leaning toward one end or the other) and kurtosis (peaks and flatness of the distribution) to make judgments about deviations from a normal distribution. Frequency distributions can be depicted in two ways, as a table or as a graph. Figure 11.2 shows an age frequency distribution with five categories of age ranges defined. The same frequency distribution can be depicted in a graph as shown in Figure 11.3. This type of graph is often referred to as a histogram or bar chart. Distributions can also be displayed using percentages. For example, you could use percentages to describe the following: Percentage of people in different income levels Percentage of people in different age ranges Percentage of people in different ranges of standardized test scores Figure 11.2 A frequency distribution in table form. A frequency distribution bar chart Central Tendency The central tendency of a distribution is an estimate of the center of a distribution of values. There are three major types of estimates of central tendency: Mean Median Mode The mean or average is probably the most commonly used method of describing central tendency. To compute the mean, all you do is add up all the values and divide by the number of values. For example, the mean or average quiz score is determined by summing all the scores and dividing by the number of students taking the exam. Consider the test score values: 15, 20, 21, 20, 36, 15, 25, 15 The sum of these eight values is 167, so the mean is 167/8 = 20.875. Central tendency The median is the score found at the exact middle of the set of values. One way to compute the median is to list all scores in numerical order and then locate the score in the center of the sample. For example, if there are 500 scores in the list, score number 250 would be the median. If you order the eight scores shown previously, you would get 15, 15, 15, 20, 20, 21, 25, 36 There are eight scores and score number 4 and number 5 represent the halfway point. Since both of these scores are 20, the median is 20. If the two middle scores had different values, you find the value midway between them to determine the median. The mode is the most frequently occurring value in the set of scores. To determine the mode, you might again order the scores as shown previously and then count each one. The most frequently occurring value is the mode. In our example, the value 15 occurs three times and is the mode. In some distributions, there is more than one modal value. For instance, in a bimodal distribution, two values occur most frequently Notice that for the same set of eight scores, we got three different values—20.875, 20, and 15—for the mean, median, and mode, respectively. If the distribution is truly normal (bell-shaped), the mean, median, and mode are all equal to each other. Dispersion or Variability Dispersion refers to the spread of the values around the central tendency. The two common measures of dispersion are the range and the standard deviation. The range is simply the highest value minus the lowest value. In the previous example distribution, the high value is 36 and the low is 15, so the range is 36 - 15 = 21. The standard deviation is a more accurate and detailed estimate of dispersion, because an outlier can greatly exaggerate the range (as was true in this example where the single outlier value of 36 stands apart from the rest of the values). The standard deviation shows the relation the set of scores has to the mean of the variable. Again let’s take the set of scores: 15, 20, 21, 20, 36, 15, 25, 15 To compute the standard deviation, you first find the distance between each value and the mean. You know from before that the mean for the data in this example is 20.875. So, the differences from the mean are: Dispersion or Variability Notice that values that are below the mean have negative discrepancies and values above it have positive ones. Next, you square each discrepancy: Dispersion or Variability Now, you take these squares and sum them to get the Sum of Squares (SS) value. Here, the sum is 350.875. Next, you divide this sum by the number of scores minus 1. Here, the result is 350.875/7 = 50.125. This value is known as the variance. To get the standard deviation, you take the square root of the variance (remember that you squared the deviations earlier). This would be SQRT(50.125) = 7.079901129253. Although this computation may seem convoluted, it’s actually quite simple and is automatically computed in statistical programs. To see this, consider the formula for the standard deviation shown in Figure 11.4. In the top part of the ratio, the numerator, notice that each score has the mean subtracted from it, the difference is squared, and the squares are summed. In the bottom part, you take the number of scores minus 1. The ratio is the variance and the square root is the standard deviation. In English, the standard deviation is described as follows: The square root of the sum of the squared deviations from the mean divided by the number of scores minus one. Standard deviation Dispersion or variability Although you can calculate these univariate statistics by hand, it becomes quite tedious when you have more than a few values and variables. For instance, we put the eight scores into a commonly used statistics program (SPSS) and got the results shown in Table 11.1. This table confirms the calculations we did by hand previously. The standard deviation allows you to reach some conclusions about specific scores in your distribution. Assuming that the distribution of scores is normal or bell-shaped (or close to it), you can reach conclusions like the following: Approximately 68 percent of the scores in the sample fall within one standard deviation of the mean. Approximately 95 percent of the scores in the sample fall within two standard deviations of the mean. Approximately 99 percent of the scores in the sample fall within three standard deviations of the mean. For instance, since the mean in our example is 20.875 and the standard deviation is 7.0799, you can use the statement listed previously to estimate that approximately 95 percent of the scores will fall in the range of 20.875—(2 X 7.0799) to 20.875 + (2 X 7.0799) or between 6.7152 and 35.0348. This kind of information is critical in enabling you to compare the performance of individuals on one variable with their performance on another, even when the variables are measured on entirely different scales. Correlation Correlation is one of the most common and useful measures in statistics. A correlation is a single number that describes the degree of relationship between two variables. Let’s work through an example to show how this statistic is computed. Correlation example Let’s assume that you want to look at the relationship between two variables, height and self-esteem. Perhaps you have a hypothesis that how tall you are is related to your self-esteem. (Incidentally, we don’t think you have to worry about the direction of causality here; it’s not likely that self-esteem causes your height.) Let’s say you collect some information on twenty individuals—all male. (The average height differs for males and females; so, to keep this example simple, we’ll just use males.) Height is measured in inches. Self-esteem is measured based on the average of 10, 1-to-5 rating items (where higher scores mean higher self-esteem). See Table 11.2 for the data for the 20 cases. (Don’t take this too seriously; we made this data up to illustrate what correlation is.) Now, let’s take a quick look at the bar chart (histogram) for each variable (see Figure 11.5 Correlation example Correlation example Correlation example Table 11.3 shows the descriptive statistics. Finally, look at the simple bivariate (two-variable) plot (see Figure 11.7 ). You should immediately see in the bivariate plot that the relationship between the variables is a positive one, because as you move from lower to higher on one variable, the values on the other variable tend to move from lower to higher as well. If you were to fit a single straight line through the dots, it would have a positive slope or move up from left to right. (If you can’t see the positive relationship, review the section “Types of Relationships” in Chapter 1.) Since the correlation is nothing more than a quantitative estimate of the relationship, you would expect a positive correlation. What does a positive relationship mean in this context? It means that, in general, higher scores on one variable tend to be paired with higher scores on the other, and that lower scores on one variable tend to be paired with lower scores on the other. You should confirm visually that this is generally true in the plot in Figure 11.7. Descriptive statistics Figure 11.7 Bivariate plot for the example correlation calculation Calculating the Correlation Now you’re ready to compute the correlation value. The formula for the correlation is shown in Figure 11.8. The symbol r stands for the correlation. Through the magic of mathematics, it turns out that r will always be between 21.0 and 11.0. If the correlation is negative, you have a negative relationship; if it’s positive, the relationship is positive. (Pretty clever, huh?) You don’t need to know how we came up with this formula unless you want to be a statistician. But you probably will need to know how the formula relates to real data—how you can use the formula to compute the correlation. Let’s look at the data you need for the formula. Table 11.4 shows the original data with the other necessary columns. The first three columns are the same as those in Table 11.2. The next three columns are simple computations based on the height and self-esteem data in the first three columns. The bottom row consists of the sum of each column. This is all the information you need to compute the correlation. Figure 11.9 shows the values from the bottom row of the table (where N is 20 people) as they are related to the symbols in the formula: Calculating the correlation Calculating the correlation Correlation formula Now, when you plug these values into the formula in Figure 11.8, you get the following. (We show it here tediously, one step at a time in Figure 11.10.) So, the correlation for the 20 cases is.73, which is a fairly strong positive relationship. It seems like there is a relationship between height and self-esteem, at least in this made-up data! Testing the Significance of a Correlation After you’ve computed a correlation, you can determine the probability that the observed correlation occurred by chance. That is, you can conduct a significance test. Most often, you are interested in determining the probability that the correlation is a real one and not a chance occurrence. When you are interested in that, you are testing the mutually exclusive hypotheses: Testing the Significance of a Correlation In effect, you are testing whether the real correlation is zero or not. If you are doing your analysis by hand, the easiest way to test this hypothesis is to look up a table of critical values of r online or in a statistics text. As in all hypothesis testing, you need to determine first the significance level you will use for the test. Here, we’ll use the common significance level of alpha =.05. This means that we are conducting a test where the odds that the correlation occurred by chance are no more than 5 out of 100. Before we look up the critical value in a table, we also have to compute the degrees of freedom or df. The df for a correlation is simply equal to N - 2 or, in this example, is 20 - 2 = 18. Finally, we have to decide whether we are doing a one- tailed or two-tailed test (see the discussion in Chapter 1, “Foundations”). In this example, since we have no strong prior theory to suggest whether the relationship between height and self-esteem would be positive or negative, we’ll opt for the two-tailed test. With these three pieces of information—the significance level (alpha =.05), degrees of freedom (df = 18), and type of test (two-tailed)—we can now test the significance of the correlation we found. Testing the Significance of a Correlation When we look up this value in the handy little table, we find that the critical value is.4438. This means that if our correlation is greater than.4438 or less than 2.4438 (remember, this is a two-tailed test) we can conclude that the odds are less than 5 out of 100 that this is a chance occurrence. Since our correlation of.73 is actually quite a bit higher, we conclude that it is not a chance finding and that the correlation is statistically significant (given the parameters of the test) and different from no correlation (r = 0). We can reject the null hypothesis and accept the alternative—we have a statistically significant correlation. Not only would we conclude that this estimate of the correlation is statistically significant, we would also conclude that the relationship between height and self- esteem is a strong one. Conventionally, a correlation over.50 would be considered large or strong. The Correlation Matrix All we’ve shown you so far is how to compute a correlation between two variables. In most studies, you usually have more than two variables. Let’s say you have a study with ten interval-level variables and you want to estimate the relationships among all of them (between all possible pairs of variables). In this instance, you have forty-five unique correlations to estimate (more later about how we knew that). You could do the computations just completed forty-five times to obtain the correlations, or you could use just about any statistics program to automatically compute all forty-five with a simple click of the mouse. We used a simple statistics program to generate random data for ten variables with twenty cases (persons) for each variable. Then, we told the program to compute the correlations among these variables. The results are shown in Table 11.5. This type of table is called a correlation matrix. It lists the variable names (in this case, C1 through C10) down the first column and across the first row. The diagonal of a correlation matrix (the numbers that go from the upper-left corner to the lower right) always consists of ones because these are the correlations between each variable and itself (and a variable is always perfectly correlated with itself). The statistical program we used shows only the lower triangle of the correlation matrix. The Correlation Matrix In every correlation matrix, there are two triangles: the values below and to the left of the diagonal (lower triangle) and above and to the right of the diagonal (upper triangle). There is no reason to print both triangles because the two triangles of a correlation matrix are always mirror images of each other. (The correlation of variable x with variable y is always equal to the correlation of variable y with variable x.) When a matrix has this mirror-image quality above and below the diagonal, it is referred to as a symmetric matrix. A correlation matrix is always a symmetric matrix. To locate the correlation for any pair of variables, find the value in the table for the row and column intersection for those two variables. For instance, to find the correlation between variables C5 and C2, look for where row C2 and column C5 is (in this case, it’s blank because it falls in the upper triangle area) and where row C5 and column C2 is and, in the second case, the correlation is 2.166. Okay, so how did we know that there are forty-five unique correlations when there are ten variables? There’s a simple little formula that tells how many pairs (correlations) there are for any number of variables (see Figure 11.11). N is the number of variables. In the example, we had 10 variables, so we know we have (10 * 9)/2 = 90/2 = 45 pairs. Formula Other Correlations The specific type of correlation we’ve illustrated here is known as the Pearson product moment correlation, named for its inventor, Karl Pearson. It is appropriate when both variables are measured at an interval level (see the discussion of level of measurement in Chapter 5, “Introduction to Measurement”). However there are other types of correlations for other circumstances. For instance, if you have two ordinal variables, you could use the Spearman Rank Order Correlation (rho) or the Kendall Rank Order Correlation (tau). When one measure is a continuous, interval level one, and the other is dichotomous (two-category), you can use the Point-Biserial Correlation. Statistical programs will allow you to select which type of correlation you want to use. The formulas for these various correlations differ because of the type of data you’re feeding into the formulas, but the idea is the same; they estimate the relationship between two variables as a number between 21 and 11.

Introduction to Data Analysis: Research Knowledge Base PDF

Document Details

Tags

Related

Summary

Full Transcript