Summary

This chapter from a research methods textbook discusses randomized experiments, highlighting their role in establishing causality. It details controlling variables, analyzing data, and various types of designs. This section's key keywords are randomized experiments, research methods, and social sciences.

Full Transcript

Chapter 10 Randomized Experiments Controlling and Manipulating Variables 258 Random Assignment 261 Independent Variables that Vary Within and Between Participants...

Chapter 10 Randomized Experiments Controlling and Manipulating Variables 258 Random Assignment 261 Independent Variables that Vary Within and Between Participants 263 Threats to Internal Validity 264 Selection 265 Maturation 265 History 266 Instrumentation 267 Mortality 267 Selection by Maturation 268 Illustrating Threats to Internal Validity with a Research Example 269 Selection 270 Selection by Maturation 270 Maturation 271 History 271 Instrumentation 271 Mortality 272 Construct Validity of Independent Variables in a Randomized Experiment 272 Alternative Experimental Designs 274 Design 1: Randomized Two-Group Design 274 Design 2: Pretest–Posttest Two-Group Design 275 Design 3: Solomon Four-Group Design 276 Design 4: Between-Participants Factorial Design 277 Repeated Measures Designs 282 Analyzing Data from Experimental Designs 284 Research Methods in Social Relations, Eighth Edition. Geoffrey Maruyama and Carey S. Ryan. © 2014 John Wiley & Sons, Inc. Published 2014 by John Wiley & Sons, Inc. Companion Website: www.wiley.com/go/maruyama 258 Randomized Experiments Strengths and Weaknesses of Randomized Experiments 284 Experimental Artifacts 285 External Validity 285 The Problem of College Sophomores in the Laboratory 286 The Failure of Experiments to Provide Useful Descriptive Data 287 Summary 288 Research designs differ in many ways. Some are more efficient than others, some make fewer demands on the researcher or on participants, and some take more time to implement. Recall from Chapter 2 that research designs also differ in internal validity, that is, the extent to which one can draw causal conclusions about the effect of one variable on another. Indeed one of the most important distinctions among designs is how effectively they rule out threats to internal validity. In this chapter, we consider designs that are highest in internal validity: randomized experiments. Randomized experiments are highly specialized tools, and like any tool they are excellent for some jobs and poor for others. They are ideally suited for the task of causal analysis, that is, for determining whether differences in one variable cause (as opposed to simply relate to) differences in some other variable. Randomized experi- ments are used a great deal in the social sciences. Yet they also have their weaknesses. In this chapter, we describe their strengths and weaknesses and show how they differ from other research designs. The chief strength of randomized experiments, their internal validity, requires that researchers be able to decide which participants are assigned to which levels of the variable that is believed to be the causal variable. Experiments also require the researcher to minimize the effects of extraneous varia- bles that might otherwise be confounded with the causal variable of interest. Fre- quently, this control, although maximizing internal validity, compromises construct and external validity. Indeed, the sort of control that is necessary for a randomized experiment is most easily achieved in laboratory settings, where extraneous variables can be controlled so only the variable of interest changes across conditions. By controlling variables that may vary in natural settings, experiments typically differ from real-life, everyday set- tings in ways that limit our ability to generalize research conclusions to other settings. In other words, external validity can be compromised. Randomized experiments can also be conducted in real-world settings. However, real-world or field experiments, although often stronger in external validity, tend to be weaker in internal validity because it is usually more difficult to control the effects of extraneous factors. To see this more clearly, let us examine the kind of control a randomized experiment requires. Controlling and Manipulating Variables All research requires the manipulation or measurement of variables. As defined earlier, variables represent the constructs that researchers are interested in theoretically. They represent those things that researchers want to study and draw conclusions about. For instance, if we want to understand why people vote as they do, voting preference Controlling and Manipulating Variables 259 would be a variable that must be measured. Variables, as the name suggests, must vary; that is, they must have at least two values. Therefore, to understand voting preferences, we must study people whose voting preferences vary; some people prefer one candidate and other people prefer another. If everyone in the study preferred the same candidate, voting preference would no longer be a variable; it would be a constant. Most research involves at least two variables, one variable usually called the independent variable, and one the dependent variable. The independent variable is the variable we believe to have a causal influence on our outcome variables, and is the variable in experimental research that the researcher manipulates. The depend- ent variable is the outcome variable; its values depend on the independent variables. In most cases, whether something is considered an independent or dependent variable depends on the research questions being asked. For example, if we wanted to dem- onstrate that teachers who hold high expectations for their students treat those stu- dents with greater nonverbal warmth, teacher expectations would be the independent variable and teacher warmth would be the dependent variable. But if our research looked instead at the influence of teacher nonverbal warmth on students’ subsequent achievement, teacher warmth would be the independent variable and student achieve- ment would be the dependent variable. Clearly there are many variables besides teacher expectations or teacher nonverbal warmth that might also influence children’s achievement. Family members’ educa- tion, income, and expectations for children and children’s motivation, health, and attendance would also influence achievement. If we wanted to be able to predict children’s achievement, we would try to include as many of these variables as possible in our research. If, on the other hand, we wanted to understand the influence of a single variable, to see whether it has a causal effect on children’s achievement, we would try to control all the other variables. Isolating and controlling other possible independent variables is the strategy followed in experimental research. Experimenters ask questions, such as “What is the effect of others’ expectations on children’s achievement?” Notice that the question refers to a variable an experi- menter can possibly control – expectations. Experimenters study variables that either they or someone else can manipulate, such as the timing, content, or strength of others’ expectations. Thus, they can decide to vary whether others have positive or negative expectations about a target; that is, they can control which individuals have positive expectations and which have negative expectations of a particular child or children, for example, by telling some individuals that a child has high ability and telling others that the child has low ability. Variables that can be so controlled are called experimental independent variables or more simply experimental variables. Other sorts of independent variables, such as religion, income, education, gender, ethnicity, and personality traits, are all variables that people bring with them to a study and are virtually impossible to manipulate. These sorts of independent variables are called individual difference or subject variables. They are properties that people already possess. In contrast, experimental variables can be manipulated. This is a major difference between experimental and non-experimental or quasi-experimental research. Experimenters can control the variables whose effects they wish to study; they can control who is exposed and how they are exposed to those variables. 260 Randomized Experiments Why is it necessary to isolate and control extraneous variables and manipulate the independent variable to maximize internal validity? To answer this question, consider a study of the influence of television advertising on voting preferences. Assume we were not interested in the effects of education, religion, and parents’ political prefer- ences, or any other independent variables on voters’ choices. All we wanted to know was whether seeing a particular television advertisement influenced voting prefer- ences. Now suppose we did not manipulate who watched the television advertisement and who did not. Suppose that instead, we simply found some individuals who saw the advertisement and some who did not, and we asked them about their preferred candidate. If we found a difference in candidate preferences between those who saw the advertisement and those who did not, we could never be sure that this difference was due to having watched the advertisement or not. Instead, it might be due to a host of other individual difference variables on which the two groups happen to differ, such as political views or education. We might try to conduct our research with individuals who all had the same politi- cal views and education. That is, we might try to equate the two groups on these individual difference variables or control these variables by holding them constant. In the social sciences, however, controlling for other variables by holding them constant is never sufficient to ensure internal validity. Even if we controlled political views and education, there might be many other individual difference variables that differ between the two groups. And any of these might also influence candidate preference. For instance, the two groups might differ in how late they stayed up at night, in how many children they had, in their age, or in what they preferred to eat for breakfast. Any of these other variables might be responsible for the difference between the two groups in candidate preference. In other words, any preexisting difference between groups could serve as a plau- sible alternative explanation for our conclusion that exposure to advertisements affects voting behavior. Some rival explanations are more plausible than others (e.g., it would be difficult to think of a reason that breakfast preference would determine voting behavior), but any systematic difference between groups other than the experi- mental variables threatens the study’s internal validity. And, because no study meas- ures all extraneous variables, there are many possible rival explanations that cannot be dismissed, as well as alternative explanations for the results that might account for them but that are never even considered. For example, among people there is an unknown, but surely large, number of variables that might be responsible for voting choices (or almost any other dependent variable in which we might be interested). Thus, in social science research one can never maximize internal validity or reach causal conclusions about the effect of an independent variable simply by making sure that research participants do not differ on a limited number of other potentially confounding variables. There are simply too many other variables that need to be controlled. The solution to this problem is to conduct a randomized experiment in which individuals are randomly assigned to levels of the independent variable. In our study of the effects of televised advertisements on voting preferences, we would randomly determine whether each individual did or did not watch advertisements featuring the political candidates. Obviously, to conduct this sort of randomized experiment, we Random Assignment 261 must be the ones to decide who watches the advertisement and who does not. The researcher must be able to manipulate the independent variable. Random Assignment Random assignment is the only way to equate two or more groups on all possible individual difference variables at the start of the research. This step is essential for drawing causal inferences about the effects of an experimental independent variable because the experimenter must be reasonably confident that the differences that appear at the end of the experiment between two treatment groups are the result of the treatments and not of some preexisting differences between the groups. Random assignment (also called randomization) is not the same as random sam- pling, which we covered in detail in Chapter 9. Random sampling is the procedure we might use to select participants for the study. It serves not to equate two or more experimental groups but to make sure that participants are representative of a larger population. As discussed in Chapter 9, random sampling allows us to say that what we have found to be true for a particular sample is likely to be true of people in the larger population from which the sample was drawn. It maximizes the external valid- ity of research. In contrast, random assignment is a procedure we use after we have a sample of participants and before we expose them to a treatment or independent variable. It is a way of assigning participants to the levels of the independent variable so that the groups do not differ when the study begins. Random assignment ensures that all participants have an equal chance of being assigned to the various experimental condi- tions. Dividing participants on the basis of their arrival to the laboratory or alphabeti- cally would not be truly random as it does not ensure equal probabilities of assignment to condition and could introduce systematic biases in the data. For example, personal- ity differences might be associated with early versus late arrival to the laboratory, and thus we would not want our experimental groups to differ on those traits. Random assignment requires a truly random process, such as the use of a random numbers table, a computerized random number generator, or the flip of a coin in the case of an experiment that has only two conditions. (Use of random numbers tables and random number generators is discussed in Chapter 9.) Random assignment enables us to say that X caused Y with some certainty. That is, random assignment maximizes the internal validity of research. Exercise: Random assignment and random sampling Random assignment and random sampling are sometimes difficult to keep distinct, for they both are important for drawing inferences from research. Discuss them in your group, and see what you can create that helps everyone in your group to distinguish between them and keep them straight. 262 Randomized Experiments To appreciate what random assignment accomplishes and why, consider a new example – determining whether students learn more about research methods through a traditional lecture format or through an interactive web-based format. Assume that a company was able to bring together a random sample of undergraduate stu- dents from across the country to participate in the study. We would therefore not have to worry about the representativeness of the students. Our only concern would be determining whether students learn more from traditional lectures or the web-based format. The best way to design this experiment is to assign students randomly to one of two conditions: group L, which takes the traditional lecture-based course, and group W, which takes the web-based version of the course. We measure how much they learn by giving them all the same examination at the end of the semester. Assume that the web and lecture versions of the course cover exactly the same material and that the final examination is a valid measure of how much people know about research methods. If our sample were large enough or if we did the study repeatedly, we could be confident, because of random assignment, that the two groups of students – those attending lectures and those doing the web course – were equivalent in all possible ways. To appreciate this statement, suppose we randomly assigned the students by flipping a coin. About half of the students would get “heads” and about half would get “tails.” With a large enough sample, would we expect all the students with light hair to be in one group and all the students with darker hair in the other? Of course not, for every student regardless of hair color has the same likelihood of being in group L and of being in group W. Chances are that there would be a mixture of dark- and light-haired students in both groups. Would we expect all the men to get “tails” and all the women to get “heads”? Of course not. On average, the two groups would include both women and men. In fact, as a result of random assignment, on average, the two groups would be equivalent in all possible ways. We would not expect them to differ on any individual difference variables. Thus, random assignment and randomized experimental research designs control for all possible individual difference variables that could interfere with our ability to reach causal conclusions about the effect of the independent variable. Of course, if we had only two students in the study, one female and one male, the student attending the traditional lecture course would differ in gender from the student taking the web course whether we randomly assigned them to conditions or not. But if we were to do this study over and over again (randomly assigning the two students each time), across the studies, gender would not be confounded with the independent variable. Thus, random assignment works on average, given a large enough sample or given a sufficient number of times that a study is conducted. In any one study, with a limited sample size, there can be differences between the experi- mental groups simply by chance. But on average, if we did the study over and over again, all such differences would disappear. As was illustrated with a sample of two participants, random assignment works only on average. Nevertheless, it is the only procedure that can ensure the equivalence of people across experimental conditions and is therefore critical for maximizing internal validity. Because random assignment works on average, it is important for researchers to be cognizant of the possibility of failures of randomization, which are much more Independent Variables that Vary Within and Between Participants 263 likely with small samples. In other words, in any single study, it is possible that par- ticular kinds of participants will not be evenly distributed across experimental condi- tions (e.g., all the African American participants might end up in one condition, or there might be disproportionately more women in one group than another). Failure of randomization is a problem because it means that despite random assignment, the groups were not equivalent at the beginning of the study and therefore the internal validity of the study is compromised. Thus, it is often wise to measure important individual difference variables or administer pretests for important dependent meas- ures. Analyses comparing experimental groups on these individual difference and pretest variables should ideally reveal no differences between experimental conditions. If the analyses indicate a failure of randomization, the researcher is in a difficult posi- tion. It would not be acceptable to shuffle the experimental assignment of participants until the groups “looked” equivalent, as that undermines the whole point of rand- omization. Replicating the study – perhaps with a larger sample to minimize the likeli- hood of a failure of randomization – is often the only satisfactory solution. Independent Variables that Vary Within and Between Participants In the research examples presented so far, the independent variable has varied between research participants. That is, some individuals were exposed to the political advertise- ments and some were not; some students went to the lectures and some took the course on the web. There are many independent variables, however, that can be manipulated within participants, which usually results in a more efficient research design. Consider another example: We want to examine whether using the term “climate change” or the term “global warming” in an online advertisement affects people’s concern for the environment. We must decide how to manipulate the independent variable. We consider two options. We can assign some individuals to the “climate change” condition and others to the “global warming” condition. If a total of 40 individuals were available, there would be 20 participants in each of the two conditions of the study (i.e., if we were lucky or if we constrained our random assignment pro- cedure so that the two conditions would have equal numbers of participants). The second option for manipulating the independent variable involves measuring all 40 participants’ concern for the environment in both conditions, once after reading the message with “climate change” and once after reading the message with “global warming.” In this case, the independent variable varies within participants rather than between them. Rather than some participants being in one condition and some in the other, all participants are in both conditions. This is called a within-participants or repeated measures design. We introduce this distinction between independent variables that vary within participants and those that vary between participants because it helps to clarify the list of threats to internal validity that we consider in the next section of the chap­ ter. At this point, it is only necessary to understand the distinction and also what random assignment means in the two cases. When the independent variable varies 264 Randomized Experiments between participants, participants are randomly assigned to one condition or the other so that participants in the two conditions will not, on average, differ on any variables that are not part of the experimental procedure. When the independent variable varies within participants, random assignment involves randomly determining the order in which each participant is exposed to the levels of the independent variable. Suppose we did not do this. Instead, all 40 participants first rated their concern for the environ- ment after the “global warming” message and then after the “climate change” message. Participants might read the first message more carefully than the second or deduce the purpose of the study after the first message (we discuss this issue further later in this chapter), or become tired after reading the second message, or notice character- istics of the room or the experimenter in between the two ratings. All of these other things might influence participants’ ratings. If we then found a difference in environ- mental concern, we could not be sure whether it was due to our experimental manipu- lation or to any or all the other differences between the two times at which participants provided their ratings. It is simple to overcome this problem by randomly assigning half of the participants to read the “global warming” message first and the “climate change” message second and the other half to read the “climate change” message first and the “global warming” message second. In this way, differences in, for example, attention or fatigue would not account for differences in ratings between the two levels of the independent vari- able. This practice of varying the order of experimental conditions across participants in a repeated measures design is called counterbalancing. Counterbalancing is impor- tant not only because it helps to assure internal validity but also because it controls for possible contamination or carryover effects between experimental conditions, for example, the possibility that people would become more concerned after reading the second message simply as a result of greater exposure to environmental issues. To recap, when the independent variable varies between participants, we randomly assign each participant to one condition or the other. When the independent variable varies within participants, each participant is measured under each condition, and we must then randomly assign participants to experience the various conditions in different orders. Later we consider some of the advantages and disadvantages of using designs in which the independent variable varies within rather than between participants. Threats to Internal Validity Making causal inferences is what doctors do when they try to diagnose the cause of a patient’s pain or what detectives do when they attempt to identify the cause of a crime. The researcher, doctor, and detective must each rule out a list of alternative explanations to arrive at the most probable cause. The alternative explanations are threats to the internal validity of the research proposition. The strength of rand- omized experiments is that through randomization these threats are, on average, eliminated. If we use a research design other than a randomized experiment, these threats make causal inference very difficult indeed. Six such threats to internal validity are defined in this section. Other threats exist as well; fuller discussions of them can Threats to Internal Validity 265 be found in the classic primer on research design by Campbell and Stanley (1963) as well as in work by Shadish, Cook, and Campbell (2002) and Judd and Kenny (1981). Selection Selection refers to any preexisting differences between individuals in the different experimental conditions that can influence the dependent variable. As should be clear from our earlier discussion, selection is a threat to validity whenever participants are not randomly assigned to conditions. An extreme example makes it obvious why selection poses such a serious threat to internal validity: Pretend a researcher is inter- ested in testing the hypothesis that bungee jumping has mental health benefits. The researcher explains the study to a group of prospective participants, and all those who want to try bungee jumping are placed in the experimental group, whereas the par- ticipants who decide they would rather not bungee jump are placed in the control group. Of course, no experienced researcher would design a study in this manner. Clearly, the two groups differ in many important ways, including risk-taking tenden- cies and attitudes toward bungee jumping. It would be impossible to determine at the conclusion of the study whether any differences between the two groups were caused by the bungee jumping itself or these preexisting differences. The threat posed by selection is obvious in cases like these where participants self- select into experimental conditions. Less obvious, but just as much of a threat, are cases where participants do not self-select, but rather preexisting groups are used that could differ on any number of characteristics. For example, a researcher might wish to test a new method of increasing sales productivity. The method might be used in one car dealership and monthly sales compared to those of a neighboring dealership using traditional sales techniques. The inferential difficulty is that the sales personnel at the two dealerships might differ in many respects that affect their productivity above and beyond the effects due to the experimental sales techniques. The same threat to validity occurs when selecting participants for various treatment conditions from dif- ferent businesses, health facilities, or schools. In short, unless participants have been randomly assigned to an experimental condition, selection is always a threat, and in most cases causal inferences about the independent variable cannot be drawn. Maturation Maturation involves any naturally occurring process within persons that could cause a change in their behavior. Examples include fatigue, boredom, growth, or intellectual development. A rather obvious example of the threat to internal validity caused by maturation would be a study testing the effects of intensive speech therapy on 2-year-old children over a 6-month period. At the conclusion of the study, the children are pronouncing words significantly more clearly than at the beginning. Of course, children of that age are experiencing rapid improvement in their speech as a natural function of language development, and they might have improved just as much without the speech therapy. Fortunately, such biases can be easily cured by including a control group (i.e., children who do not receive intensive therapy) in the study. 266 Randomized Experiments Normal developmental changes should affect participants in both the experimental and control groups to the same degree. In addition to developmental changes that occur in individuals over extended periods of time, maturation also refers to short-term changes that can occur within an experimental session. Consider a reaction time task in which participants are instructed to attend to a complicated visual pattern and press a button whenever a certain stimulus shape appears. Participants are likely to become fatigued or bored after doing the task for 30 or 40 minutes. Reaction latencies probably will increase and more errors will be committed toward the end of the study, posing an inferential problem if all of the experimental trials are located in the beginning or end of the session. This type of maturational bias, too, has a simple cure: Present stimuli in a counterbalanced order so that all types of stimuli are evenly distributed across the experimental session. History History refers to any event that coincides with the independent variable and could affect the dependent variable. It could be a major historical event that occurs in the political, economic, or cultural lives of the people we are studying, or it could be a minor event that occurs during the course of an experimental session – such as a disruption in the procedures because of equipment failure, a fire alarm going off, a participant in a group session behaving inappropriately, or an interruption from any unwanted source. In short, history is anything that happens during the course of the study that is unrelated to the independent variable yet affects the dependent variable. Sometimes the event actually has historical significance, but the threat to validity refers to any event that provides a competing explanation for findings. Imagine that a researcher is conducting a study of attitudes toward the death penalty. A well-publicized execution that takes place during data collection could affect participants’ attitudes in ways unintended by the researcher. History threatens not only internal validity, but also external validity of this study. The threat to internal validity caused by such events can be remedied by including a control group, as par- ticipants in both conditions should be aware of and affected by the execution to the same extent, so we can infer that any additional differences between the groups were caused by the independent variable. The threat to external validity, however, is not so easily removed. There is no way of knowing whether participants would have reacted similarly to the dependent measures if they had not been exposed to the publicity surrounding the execution. With respect to unique events that occur during the course of a study, the degree of damage caused depends on how the data are gathered. If participants are run indi- vidually, such events would presumably affect only one participant, whose data could be dropped from analyses. At worst, the disruption would merely add a slight amount of noise or random error to the data. Intrasession history becomes more problematic when participants are run in group sessions, as is often done to gather data quickly and easily. Malfunctioning computers or other equipment can disrupt or ruin a sig- nificant portion of a study, and the researcher is faced with the unfortunate choice of either dropping the affected data or contending with considerable noise in the data Threats to Internal Validity 267 analyses. Moreover, if the data are gathered in such a way that all of the experimental participants are run in one group and the control participants in another, history becomes an insurmountable threat to internal validity. If any differences are obtained between conditions, we do not know whether it is because of the independent variable or because of the unique events that occurred within each group. For that reason, if participants are run in groups out of necessity or convenience, researchers should ensure that the different experimental conditions are represented within each group if at all possible. If that is not possible (e.g., if the manipulation must be administered orally), then multiple groups for each condition – as many as possible – should be run so that potential history effects can be statistically evaluated. Instrumentation Instrumentation is any change that occurs over time in measurement procedures or devices. If researchers purposefully change their measuring procedures because they have discovered a “better” way to collect data, or if observers gradually become more experienced or careless, these changes could have effects that might be confused with those of the independent variable. As was the case with maturation, this problem is a particular threat to internal validity if the various experimental conditions are run at different times. Instrumentation is a bias that should not happen. The cure is careful training and monitoring of observers or measurement procedures and ensuring that the order of experimental conditions is counterbalanced or randomized through- out the course of the study. Pilot testing also helps, for weaknesses in the measure- ment instrument often can be identified by asking participants what they inferred from each question and why they answered the way they did. Mortality Mortality refers to any attrition of participants from a study. If some participants do not return for a posttest or if participants in a control group are more difficult to recruit than participants in a treatment group, these differential recruitment and attri- tion rates could create differences that are confused with effects of the independent variable. Take, for example, a study testing an experimental drug designed to help people quit smoking. The drug reduces the desire for nicotine substantially, but it has the rather distressing side effect of causing unrelenting diarrhea. As a result, 70% of the treatment group drops out of the study. At the conclusion of the study, the remain- ing 30% in the treatment group are smoking significantly fewer cigarettes than the participants in the control group (none of whom dropped out). Can we conclude that the new drug works? No, because the 30% who stayed in the study are almost certainly no longer equivalent to the control group participants on important variables such as motivation. Presumably, the people who are truly motivated and committed to quit- ting smoking are the ones who would keep taking the drug despite the troubling side effects. The opposite case is also possible, where participants in the control condition see no benefits and withdraw from a study in proportions much greater than those in the treatment condition. 268 Randomized Experiments Mortality is particularly problematic in longitudinal research, in which data are gathered at multiple points in time. Imagine an intervention study designed to improve school success for at-risk adolescents. At the five-year follow-up, 45% of the original sample could not be located. Analyses of the data showed significant gains in achieve- ment test scores; however, the high mortality rate precludes us from drawing the causal inference that the gains were caused by the intervention. The adolescents who could not be located for the follow-up probably include those whose performance was worse; perhaps the reason they could not be located is that they dropped out of school or were even incarcerated. Mortality always presents a threat to external validity; at the end of a study, we are only able to conclude that our results are representative for the kinds of individu- als who are likely to finish the study. The greater the mortality, the less representative our final participant sample becomes. On the other hand, differential mortality, that is, mortality rates that differ across experimental groups, creates a threat to internal validity. Because experimental groups are no longer equivalent except for the inde- pendent variable, we cannot determine whether it was the independent variable that caused any group differences or the other ways in which the groups differ. Thus, it is always important to look for problems caused by mortality and differential mortal- ity. At the conclusion of data gathering, researchers can count the number of par- ticipant dropouts and determine whether the number varies systematically across conditions. Unlike some of the other threats to internal validity, there is no easy cure for mor- tality. There are steps researchers can take to reduce mortality in a longitudinal study, such as keeping in close contact with participants by sending them newsletters or birthday cards or obtaining the names and addresses of contact persons who would be expected to know how to find participants over the duration of the study. With adequate preparation and effort, attrition in even a 10-year study can be kept under 20%. Differential mortality is harder to prevent, especially if the experimental treat- ment involves aspects that make it substantially more or less desirable than the control group’s experience, or if the treatment affects mortality (which likely would be a dependent variable in such research). The antismoking drug with the unpleasant side effects would naturally result in more dropouts than the control condition, as would a study in which the experimental participants must agree to experience painful elec- tric shocks. In such cases, the problem of differential mortality can be addressed by carefully designing the experience of the control participants to be equally desirable or aversive. Making both conditions equivalently aversive might increase the total amount of participant mortality (as more people in the control group will likely drop out of the study), but external validity might be a necessary cost to pay for increasing the internal validity of the study. Selection by Maturation Selection by maturation occurs when there are differences between individuals in the treatment groups that produce changes in the groups at different rates. Differences in spontaneous changes across the different groups can be confused with effects of the treatment or the independent variable. For example, imagine we were conducting Illustrating Threats to Internal Validity with a Research Example 269 social skills training groups for preadolescents. If analyses showed significant improve- ments in girls but not in boys, we might be tempted to conclude that our treatment is effective for girls; however, an alternative explanation is that girls simply mature socially earlier than boys and that our treatment had no effect at all. Random assignment is the best way to address this threat to internal validity. When participants are randomly assigned to treatment groups, any variability in rates of maturation is spread across all groups to an equivalent degree. When random assign- ment is not feasible, one can attempt to assess the degree of change that would have been expected in the absence of the treatment. For example, one might ask compari- son groups of female and male preadolescents (i.e., individuals who seem similar to the research participants, such as female and male preadolescents in a similar school) to complete the same measures as the research participants and at the same points in time. Hopefully, you will not encounter this type of challenging situation in most studies, for the solutions are complicated. One is a cross-sequential design where the treatment is delivered at two different times, with data being collected at three time points. The first data collection would be a pretest for both treatment groups, the second would occur after one group gets the treatment but before the second group gets the treatment, and the third after both groups have had the treatment. Each group would provide both a treatment condition and a control condition, and only if the impact of the treatment were the same for the two groups could this threat be eliminated. Illustrating Threats to Internal Validity with a Research Example Some of the threats to internal validity are particularly troublesome when the inde- pendent or treatment variable varies between participants (selection and selection by maturation). Others are likely to be more problematic when there is no control group or the treatment variable varies within participants (history, maturation, instrumenta- tion, mortality). To help understand these threats and how they are in fact threats if a randomized experiment is not conducted, consider the following examples of non- experimental research designs. Imagine that the trustees of an educational foundation want to know whether receiving a liberal arts education actually makes people more liberal. The researcher decides to answer the question by comparing the people she knew in high school who went to college with those who did not. A high school reunion provided the oppor- tunity to make some observations. These observations revealed that the researcher’s high school friends who had not gone to college were more politically conservative than the people who had gone to college. The independent or treatment variable in this example is whether the participants went to college. The researcher wishes to ascertain whether this independent variable has a causal effect on political attitudes, the dependent variable. Because some of her classmates went to college and others did not, and because she was comparing the political leanings of these two groups of classmates, the independent variable varies between classmates. Obviously, classmates were not randomly assigned to the levels 270 Randomized Experiments of the independent variable: The researcher did not flip a coin for each classmate at high school graduation to determine who went off to college and who did not. The internal validity of this design is particularly threatened by selection and by selection by maturation. Selection We need to consider the possibility that the people who did not go to college and the people who did were different types of people to begin with. They might have had different political attitudes even before their educational paths diverged. Because they were not randomly assigned to the college and no-college groups but rather selected their own paths or had their paths selected for them by admissions committees, parents, school counselors, and other advisors, there is no guarantee that they were similar at the outset. Such selection effects are serious threats to the internal validity of studies in which there is not random assignment. Whenever people select their own treatments or are selected by others for treatments or end up in different treat- ment groups by some unknown process instead of by random assignment, we have no assurance that the people in different groups were equivalent before exposure to the independent variable. Chances are they were not because the very fact that they selected or were selected for different treatments indicates that they were different types of people, with different preferences, abilities, or other characteristics that made them seem more suitable for one treatment rather than another. Selection by Maturation Now suppose the researcher had thought about the selection threat and tried to elimi- nate it. Perhaps she gathered some information from the past about her classmates in an attempt to show that the political leanings of those who went to college were the same as the leanings of those who did not go to college back when both groups were finishing high school. Suppose she found some records of interviews about the then- current presidential election and became convinced that at that time the two groups of classmates had approximately the same political attitudes. The researcher then argued that selection was no longer a threat and that causal conclusions about the impact of college on political attitudes were more defensible. The problem is that the researcher also needs to consider the possibility that the two groups of classmates would have grown apart in their political leanings even if one of the groups had not gone to college. Even if the two groups were the same politically during senior year in high school, they probably still were very differ- ent groups in other ways; their political attitudes might have been changing at dif­ ferent rates and in different directions even if, for some reason, the college-bound classmates had not in fact enrolled in college. Realizing the impossibility of eliminating these two internal validity threats, the researcher decides to gather some more data, using a research design that she hopes will not be as subject to the threats of selection and selection by maturation. This time she decides to follow a group of students as they go to college for the first time. Illustrating Threats to Internal Validity with a Research Example 271 She initially measures their political leanings when they graduated from high school and then again two years later, after their sophomore year in college. Again, the researcher finds that the political attitudes after two years in college were more liberal than they were two years earlier. On the plus side, this research design has effectively eliminated the threats of selection and selection by maturation because this time the researcher is not comparing two different groups of individuals. Rather, she is looking at the political attitudes of the same students at two different times: before they went to college and after they had been there two years. Note that the research design is now a repeated measures design in which the independent variable varies within participants. That is, the political attitudes of each individual are measured twice, once before and once during college. Obviously, the researcher has not randomly assigned individuals to different orders. That is, every- one’s attitudes were measured the first time without having been to college and the second time after having been there for two years. This is an example of a study in which counterbalancing the order of experimental conditions is impossible. Thus, the design is not a randomized experimental one, and its internal validity is threatened in the following ways. Maturation We need to consider the obvious problem that individuals might simply change in their political attitudes as they mature or grow older. Accordingly, the individuals fol- lowed for two years might have developed more liberal political attitudes even if they had not gone to college during those two years, simply as a function of growing older. History We also need to consider the possibility that different sorts of historical events were taking place at the two times and everyone at the second time might have been more liberal than they were at the first, whether or not they went to college in the interim. It is well known that the American populace as a whole seems to change in its political leanings across time. The decade of the 1960s was characterized by relatively liberal sentiments throughout the country compared to the 1950s. Thus, historical events change everyone’s outlook, and such events might have affected the outlook of the students over the two-year period that they were followed. Instrumentation We need to consider whether the way in which the dependent variable was measured changed from the first time to the second. Perhaps after the first phase of data collec- tion, the researcher thought of better ways to ask the political attitude questions for the second phase of data collection. If the measurement procedures were not exactly the same at the two times, differences in the measurement instruments could be responsible for the differences in attitudes obtained. 272 Randomized Experiments Mortality In all probability, after a two-year period, the researcher was not able to gather data successfully from all of the individuals who participated in the first phase of the study. Some of them, despite the best efforts of the researcher to track them down, might have moved away, become ill, or for other reasons become unreachable. It is possible that these individuals happen to be those with the most conservative political atti- tudes. Thus, the relative liberalness of those from whom data were gathered the second time, after two years in college, might be due not to their becoming more liberal but to the fact that the most conservative individuals have not been included in the sample at the end of the study. Construct Validity of Independent Variables in a Randomized Experiment A randomized experiment requires the experimenter to be able to control or manipu- late the independent variable so that a random assignment rule can be used. To do so, a researcher must first define and create an operational definition of it. As discussed in Chapter 7, an operational definition is the procedure used to manipulate or measure the variables of the study. All research contains operational definitions of abstract concepts; they are not unique to laboratory experiments. Sometimes the operational definition of an independent variable is clear and straightforward. For example, if we are interested in the impact of jury size on verdicts, the independent variable is easy to conceptualize and manipulate: We simply compare juries composed of 6 versus 12 jurors. In many cases, however, the operational definition of a given independent variable is not so straightforward. The independent variable might be complex, requiring a complex operational definition; or it might be a general construct that can be operationally defined in any number of ways, and the researcher must choose an operational definition that is valid, practical, and convincing. Consider, for example, urban stress. Urban stress is a complex, abstract notion that could be measured or manipulated in a number of ways. We could ask people living in large cities to rate how stressed they feel on a 7-point scale, which might be adequate if we were interested in stress as a dependent variable. But we could not readily use that operational definition if we wanted to manipulate stress. Glass and Singer (1972), in their classic work on urban stress, manipulated stress in the laboratory by subjecting participants to noise that was either controllable or uncontrollable. Their reasoning was that uncontrollable noise in the laboratory is a noxious stimulus that could serve as a substitute for urban social stressors. In order for their findings to be compelling, however, they had to convince others that it was reasonable to equate uncontrollable noise with the abstract construct urban stress. In Chapter 7, we discussed the importance of construct validity, that is, making sure that our variables capture the constructs we wish to measure. The same prob- lems of assessing validity pertain to variables that are experimentally manipulated. For instance, instead of measuring anxiety, a researcher might manipulate people’s anxiety, creating high anxiety in some persons and low anxiety in others. Or rather Construct Validity of Independent Variables in a Randomized Experiment 273 than measure existing levels of motivation, an experimenter might manipulate moti- vation by giving some participants instructions that motivate them to do well on a task and giving others instructions that cause them to care little about their performance. When researchers create rather than measure levels of an independent variable, we call it a manipulated variable. When independent variables are obtained by meas- urements rather than manipulations, a researcher uses the same logic for assessing reliability and validity that is used for dependent variables. With manipulated inde- pendent variables, however, the researcher must test the manipulation by measuring its effects to determine its construct validity. For instance, a researcher might try to create high motivation by telling some participants that their performance on a task is an indicator of their intelligence and create low motivation by telling others that the task is a measure of willingness to practice dull tasks. The researcher needs to know whether these instructions really created different levels of motivation. Perhaps the manipulation resulted in no differences in motivation because participants did not believe the instructions or because participants wanted to do well regardless of the task. Perhaps the researcher created differences – but in anxiety instead of motivation. To demonstrate the validity of the manipulation, the researcher must also measure participants’ motivation after the instructions – in the same experiment, if possible, or in a separate experiment if including the additional measure in the primary experi- ment would undermine the effect of the manipulation on the primary measures of interest. If those who received the “intelligence test” instructions rate the task as more important and their desire to do well as higher than do those who get the “willingness to practice dull tasks” instructions, the researcher would have some evidence that the manipulation was successful. However, the instructions might also have created dif- ferent levels of anxiety along with levels of motivation. We would like to see evidence that only motivation, and not anxiety, was manipulated by the instructions. To do this, a researcher would have to demonstrate both the discriminant and convergent validity of the manipulation. When researchers demonstrate the validity of their manipulated variables, they generally obtain a measure of the independent variable construct after they have manipulated it. This is called a manipulation check, and it can provide evidence of the convergent validity of the manipulation. More than one variable might also be manipulated, as in a factorial design, which we discuss later in this chapter. In such designs, obtaining evidence for the construct validity of a manipulated independent variable requires that (1) the manipulated variable affect its corresponding manipula- tion check measure (e.g., that our motivation manipulation affects our measure of motivation) and (2) that the effect of that manipulated variable on the manipulation check measure does not depend on, or interact with, the other manipulated variable (e.g., that the effect of our motivation manipulation on our motivation measure is the same whether people are low or high on the other manipulated variable). Research- ers rarely take the further step of demonstrating discriminant validity by showing that their manipulation has not created different levels of some other variables, for example, that it has not changed participants’ anxiety levels. Occasionally, however, when research includes this additional step, it becomes all the more persuasive. Manipulation checks are discussed in greater detail in Chapter 5. 274 Randomized Experiments Constructing good operational definitions requires appropriate and accurate pro- cedures to measure and manipulate variables. The art of finding suitable procedures cannot be taught with a set of rules but is acquired by experience. The real test of external validity for both measured and manipulated variables, however, rests on the confirmation of the findings in other settings. In short, the best test of external validity is a replication of a study – a demonstration that the results can be repeated with different participants, procedures, experimenters, and operational definitions. Alternative Experimental Designs We already have briefly discussed two alternative designs for randomized experiments. In one, participants were randomly assigned to the levels of the independent variable and each participant was measured only once. In this case, the independent variable is said to vary between participants. In the other design, the independent variable varies within participants and each participant is measured under every level of the independent variable; the order of exposure to those levels is randomly determined. These two designs, both of which we talk about in more detail in this section, are not the only alternatives. A variety of other randomized experimental designs are also possible and useful. We use the following notation to describe different research designs: X = a treatment, an independent variable, a cause O = an observation, a dependent variable, an effect R = participants who have been randomly assigned to the treatment condition Design 1: Randomized Two-Group Design X1 O1 R X2 O2 Participants are randomly assigned to the experimental treatment group (X1) or to a control group (X2). This is the design discussed earlier in which the independent variable varies between participants and has only two levels: treatment and control. The word “treatment” is simply verbal shorthand to identify the group that experi- enced the variable of interest; it does not necessarily imply some kind of intervention designed to help people. This design contains all the bare essentials for a randomized experiment: random assignment treatment and no-treatment groups observations after the treatment We must have at least two groups to know whether the treatment had an effect, and we must randomly assign individuals to groups so that the groups will be, on average, Alternative Experimental Designs 275 equivalent before treatment. Then we can attribute any posttreatment differences to the experimental treatment. We can rule out several rival explanations or threats to internal validity by using this design. We know that any posttreatment differences are not the result of a selec- tion threat (barring any failure of randomization) because participants were randomly assigned rather than self-selected or systematically assigned to the two groups. We know also that the posttreatment differences are not a product of maturation because the two groups should have matured (e.g., aged or fatigued) at the same rate if they were tested at the same intervals after random assignment. We can rule out other alternative explanations not just by referring to random assignment but also by looking carefully at the experimental procedures to see whether it is plausible that the treatment group might have been exposed to some other events (historical events in the outside world or events within the experimental session) that the no-treatment group did not experience. If not, we can eliminate history as a rival explanation. If the two groups were tested or observed under similar circumstances, we can eliminate instrumentation differences as an explanation. Once we have elimi- nated these alternative explanations, we can feel quite confident that the experimental treatment caused any observed difference between the two groups (O1 and O2). Design 2: Pretest–Posttest Two-Group Design O1 X1 O2 R O3 X2 O4 This design has an additional set of tests or observations of the dependent variable, called pretests, before the experimental treatment. Pretests have several advantages. They provide a check on the randomization and let the experimenter see whether the groups were equivalent before the treatment. Pretests also provide a more sensitive test of the effects of the treatment by letting participants serve as their own compari- son. Instead of comparing only O2 and O4, the experimenter can compare the differ- ence between participants’ pretest and posttest scores. In other words, O2 minus O1 can be compared with O4 minus O3. Because participants’ pretest scores all differ from one another and their posttest scores reflect some of these preexisting individual dif- ferences, the experimenter gains precision by making these sorts of comparisons rather than simply comparing O2 and O4. Researchers can compare change scores or include pretest scores as covariates in the analyses. To understand the benefits of this design, suppose two people were randomly assigned to different groups in an experiment on weight loss; Person A was assigned to the no-treatment control group and Person B to the weight-loss treatment group. If Person A weighed 130 pounds on the pretest and 130 pounds on the posttest, it is clear that being in the control group did not affect Person A’s weight. If Person B weighed 160 pounds on the pretest and 150 pounds on the posttest, it is plausible that the treatment caused Person B to lose 10 pounds. However, if the experimenter did not take pretest measures and looked only at the posttest weights, Person B’s 150 pounds compared to Person A’s 130 would make the treatment look bad. Therefore, 276 Randomized Experiments having pretest information in this pretest–posttest two-group design gives the experi- menter a more precise measure of treatment effects. The pretest also has some disadvantages, however. It can sensitize participants to the purpose of the experiment and bias their posttest scores. If this occurs for the experimental and control groups alike, their posttest scores should be equally elevated or depressed; pretesting alone would then not be an alternative explanation for a dif- ference between O2 and O4. However, the pretest might affect the treatment group differently from the control group; this would appear as a difference on the posttest and would be indistinguishable from a difference produced by the treatment alone. In sum, when pretesting affects both experimental conditions equally, it is a threat to external validity; participants’ responses to the second testing are not representative of how people would respond if they had not been given a pretest. When pretesting affects the experimental groups differentially, however, it becomes a threat to internal validity. Unfortunately, this kind of differential effect of pretesting is common. Take, for example, a persuasion study trying to change people’s attitudes toward capital punish- ment. A pretest asking for participants’ attitudes about capital punishment probably will alert participants to the focus of the study and thus make them particularly sensi- tive to the persuasion manipulation. Alert participants might realize that the experi- menter is interested in capital punishment and that the persuasive message is supposed to change their attitudes, perhaps causing participants to change their responses on the posttest in an attempt to please the experimenter. Participants in the control group, however, who do not receive a persuasive message on the topic of capital punishment, do not realize that capital punishment is the focus of the experiment and hence feel no demand to change their attitudes on the posttest. Design 2 provides no solution to this problem. Experimenters must therefore decide whether this is a plausible occurrence for any particular study. If it is plausible, they should avoid this design in favor of the simpler Design 1 or opt for the more complex Design 3 described next. Design 3: Solomon Four-Group Design O1 X1 O2 (D es ign 2) O3 X2 O4 R + X1 O5 (D es ign 1) X2 O6 The third design combines Designs 1 and 2. With this design an experimenter can test decisively whether the posttest differences were caused by the treatment, the pretest, or the combination of treatment plus pretest. Design 3 is an expensive design because it requires four groups of participants to test the effects of only two levels of a treat- ment. The four groups are needed because in addition to the treatment and control groups, there are pretested and non-pretested groups. Alternative Experimental Designs 277 This design offers the separate advantages of Design 1 – no interference from pretesting effects – and Design 2 – greater precision from the pretest scores as baselines against which to measure the effects of the treatment. In addition, it enables the experimenter to see whether the combination of pretesting plus treatment produces an effect that is different from what we would expect if we simply added the separate effects of pretesting and treatment. Such combinations, if they are different from the sum of the two individual effects, are called interaction effects. They are similar to what occurs when two natural elements combine to produce a new effect, as hydrogen and oxygen together produce a new compound, water. The whole is different from or greater than the simple sum of the parts. For many problems studied by social scientists, interaction effects are important, for variables can affect other variables in complex ways. We need more than two-group designs to study these, and we need more than one independent variable because an interaction results from a combina- tion of two or more causes, or independent variables. Designs with two or more independent variables are called factorial designs. Design 4: Between-Participants Factorial Design X1 Y1 O1 X1 Y2 O2 R X2 Y1 O3 X2 Y2 O4 The X is one independent variable; the Y is another. In a factorial design, two or more independent variables are presented in combination. The entire design contains every possible combination of the independent variables (also known as factors; hence the name, factorial design). If there are more than two independent variables and if each has more than two values, the design rapidly mushrooms because each additional variable or value greatly increases the number of conditions. We illustrate this fact using tables, which are the form most commonly used to diagram factorial designs. Table 10.1 illustrates the combination of two factors, or independent variables. In the language of experimental design, we call this a 2 × 2 (where the “×” is read out loud as “by”) factorial design, which means there are two factors and each has two levels. In this particular example, the two factors are feedback (positive/negative) and confederate gender (male/female). If we added a third factor, we would double the number of conditions if the additional factor also had two levels, triple it if the new factor had three levels, and so on. For instance, if we added the relative status of the confederate as another factor to the two factors in Table 10.1 and used three status categories – lower, same, or higher than the participant – we would have a 2 × 2 × 3 (between-participants) factorial design, with 12 conditions, shown in Table 10.2. This 12-cell design is much more complex than the original 2 × 2. It is triple the size and, therefore, either requires three times as many participants or spreads the same number of participants thinner, with one-third the number in each condition. 278 Randomized Experiments Table 10.1 A 2 × 2 Factorial Design Factor Y Factor X Confederate Gender Feedback Male Female Negative Male Female Negative Negative Positive Male Female Positive Positive Table 10.2 A 2 × 2 × 3 Factorial Design Factor Y Factor Z Factor X Confederate Gender Feedback Confederate Status Male (M) Female (F) Negative (N) Lower (L) N, L, M N, L, F Same (S) N, S, M N, S, F Higher (H) N, H, M N, H, F Positive (P) Lower (L) P, L, M P, L, F Same (S) P, S, M P, S, F Higher (H) P, H, M P, H, F The advantage of a factorial design involving more than a single independent vari- able is that we can examine interaction effects involving multiple independent varia- bles in addition to the separate or main effects of those variables by themselves. Suppose we were interested in two independent variables, X and Y, and their effects on some dependent variable, O. We could design two different experiments, one in which we randomly assigned participants to levels of X and then examined effects on O, and a second in which we randomly assigned participants to levels of Y and exam- ined effects on O. Alternatively, we could create a factorial design and randomly assign participants to all of the X−Y combinations of levels: X1−Y1, X1−Y2, X2−Y1, and X2−Y2. The advantage of this factorial design over the two separate single-factor experiments is that we can ask whether the effect of one of the independent variables is qualified by the other independent variable. If it is, the two independent variables are said to “interact” in producing O. Then, we cannot simply talk about the effect of X on O because the effect of X on O depends on the level of Y. Similarly, we cannot simply talk about the effect of Y on O because that effect depends on the level of X. We must talk about their joint or interactive effects on O. To describe an interaction effect more concretely, we consider a published example of a 2 × 2 design (Sinclair & Kunda, 2000, Study 2). The study examined how partici- pants responded to receiving either positive or negative feedback from a male or female manager. Participants were told that the study was a collaborative venture on Alternative Experimental Designs 279 the part of the university with local businesses to train personnel managers. Partici- pants were asked to respond orally to an interpersonal skills test while a manager-in- training (actually a videotaped accomplice of the experimenter posing as a research participant) was allegedly listening from another room. Following the task, partici- pants were shown one of four videotapes showing the alleged manager-in-training giving an evaluation of participants’ interpersonal skills. Half of the time the person on the videotape was female and half the time the person was male. Half of the time the feedback given was positive and half the time the feedback was negative. Partici- pants were then asked to rate how skilled the manager was at evaluating them. This 2 × 2 (Manager Gender × Feedback) factorial design is depicted in Table 10.1. Notice that the gender variable in this study refers not to participants’ gender but to the people whom the participants rated. This distinction is important. Participants’ age and gender are characteristics they bring with them rather than experimental conditions to which people can be randomly assigned. The portion of a study that examines such individual difference variables is therefore not a true experiment. In contrast, the gender of an actor or stimulus person to whom participants respond is an experimental variable, because participants can be randomly assigned to interact with or observe a male or female actor. The researchers combined two independent variables – manager gender and the valence of the feedback – because they were particularly interested in the effect of the combination. The dependent variable was participants’ ratings of how skilled they thought the confederate manager-in-training was. The experimenters expected that the effects of the confederate’s feedback would depend not only on whether the feed- back was positive or negative but also on whether the feedback was delivered by a man or a woman. Table 10.3 displays the results of the experiment. Higher scores mean that partici- pants evaluated the confederate’s skill more positively. Let us review the effects that we can examine with a factorial design. All factorial designs provide information about the separate main effects of each independent variable and the interaction effects among the independent variables. The main effect shows whether one independent variable has an effect when we average across the levels of any other variable. Do not be misled by the term “main effect”; it does not mean the most important or primary Table 10.3 Ratings of Male and Female Confederates who Delivered Positive or Negative Feedback Confederate Gender Feedback Male Female Negative M = 8.0 M = 6.8 M = 7.4 (Mean for negative) Positive M = 8.8 M = 9.2 M = 9.0 (Mean for positive) M = 8.4 M = 8.0 (Mean for males) (Mean for females) 280 Randomized Experiments result but rather the effect of one independent variable averaging across the other. In Table 10.3, the main effects for each variable are shown in the margins. Looking first at the results for the feedback factor, we can see that participants who received positive feedback rated the confederate much more positively (average = 9.0) than did partici- pants who received negative feedback (average = 7.4). And, indeed, statistical analysis confirmed that this difference was statistically significant (i.e., not likely a chance occurrence). Looking at the results for the confederate gender factor, we see that male confederates are rated slightly higher (average = 8.4) than are female confederates (average = 8.0), but the difference is small and not statistically significant. The primary benefit of the factorial design is that it shows us how independent variables interact in their effects on dependent variables. In other words, we can test interaction effects, that is, whether the effect of one independent variable on the dependent variable depends on, or is moderated by, the other independent variable. An independent variable or factor that alters the effect of another independent vari- able or factor on the dependent variable is referred to as a moderator variable. Decid- ing which variable moderates the other is drawn from the theory, not just from the statistical outcomes. The values inside the four cells of Table 10.3 show the interaction effect. Interac- tions are often depicted in graphical form for ease of interpretation. Figure 10.1 dis- plays the data from Table 10.3 in the form of a figure. The non-parallel lines indicate that there is an interaction that the statistical analysis revealed to be significant. The interaction indicates that the effect of feedback was negligible when the confederate was a man but considerable when the confederate was a woman. In other words, just because the main effect of confederate gender was not significant does not mean it had no effect; instead, participants rated the male confederate as equally skilled whether he had delivered flattering or unflattering feedback, but they rated the female confederate as more skilled when she had just delivered flattering feedback than when she had delivered negative feedback. Thus, one could say that the effect of confederate 10 9 8 7 6 5 Negative 4 Positive 3 2 1 0 Male Female Figure 10.1 Effect of the Interaction of Confederate Gender and Valence of Feedback on Partici- pants’ Evaluations of the Confederate’s Skill at Providing Feedback. Alternative Experimental Designs 281 gender on participants’ ratings was moderated by (or depended on) the valence of the feedback. Notice that testing an interaction is equivalent to asking whether two differences are different from each other. In this case, for example, the interaction indicates that the positive–negative feedback difference in ratings of the male confederate is different from the positive–negative feedback difference in ratings of the female confederate. Notice also that the same interaction can be interpreted in more than one way. In this case, for example, the interaction can also be interpreted as indicating that ratings of the male confederate were higher than those of the female confederate when the feedback was negative, but somewhat lower than those of the female confederate when the feedback was positive. The effect of one factor (e.g., male vs. female con- federate) looking at only one level of another factor (e.g., positive feedback only) is known as a simple effect. The interaction thus tells us whether two simple effects differ from each other. But an interaction does not usually tell us whether either simple effect is significantly different from zero. In our example, the simple effect of feedback valence (positive vs. negative) for the female confederate must be significantly different from zero because the feedback valence difference for the male confederate is 0.8 and the interaction tells us that the two differ. However, it is not clear whether the simple effect of male versus female confederate was significant when feedback was negative. The significant interaction tells us that the male–female difference of 1.2 in the nega- tive feedback condition differs significantly from the male–female difference of −.40 in the positive feedback condition, but we cannot be sure whether 1.2 differs from 0. Researchers therefore often report the results of additional “simple effects” signifi- cance tests to aid in the interpretation of interactions. They sometimes also interpret the same interaction in more than one way as the interpretations can yield different insights about the nature of the phenomenon being investigated. Interaction effects require more complex theoretical explanations than do main effects. Researchers must sufficiently develop their theories to explain why effects of one independent variable are different at different levels of the other independent variable. This complexity, however, is also one of the major strengths of factorial designs: By including more than one independent variable, the researcher is better able to identify and understand multiple and complex causes of a dependent variable. Thus, a major reason to use factorial designs is to test for interaction effects. Another reason is to be able to generalize the effects of one variable across levels of another variable. For instance, if we wanted to study the effects of being able to control noise (variable 1) on people’s ability to solve puzzles, we might vary the type of puzzle as a second independent variable. This would enable us to demonstrate that people perform better on not just one but two (or more) types of puzzles (variable 2) when they can control the noise in their environment. We add the second variable not because we expect it to make a difference but to demonstrate that it makes no differ- ence. A third reason to include more than one independent variable in an experiment is to study the separate effects of the variables. We might design a factorial study even if we expect to find only two main effects and no interaction because we can test the two main effects more efficiently and with fewer total participants in a factorial design than we could with two separate studies. 282 Randomized Experiments Repeated Measures Designs Earlier we discussed the fact that experimental or independent variables could be manipulated within as well as between participants. Rather than assign different people to different treatments, the experimenter exposes the same persons to multiple treatments. Each participant is repeatedly treated and tested, and the variations caused by different treatments appear within the same person rather than between different groups of people. Such designs are randomized experimental designs as long as we randomly assign participants to be exposed to the various conditions in different orders. Not all independent variables can be used in repeated measures designs, just as not all variables can be manipulated experimentally. We earlier made the distinction between manipulated experimental variables and individual difference variables. Manipulated variables are designed by the experimenter and participants can be ran- domly assigned to manipulated treatments. In contrast, individual difference variables, such as age, height, personality traits, gender, race, and so on, come with participants. Individual difference variables impose restrictions on research design as well as analysis because they cannot be used as within-participants or repeated measures factors. When factors can be varied within participants, experimenters can use a design that requires fewer participants and provides more sensitive measures of the effects of a variable. For instance, if we wanted to study how quickly men and women can solve puzzles that are labeled “masculine problem” and “feminine problem,” we could use either a between-participants or a within-participants design. The participants’ gender is an individual difference variable and must be a between-participants factor. The label on the puzzle could be either a between-participants or within-participants factor. If it were between participants and we wished to have 15 observations in each condition, 60 participants would be required, as shown in Table 10.4. The 60 observa- tions would come from 60 different people. We could, however, make the gender labeling of the task a within-participants factor and have each participant solve both a “masculine” and “feminine” labeled puzzle. In this case, as shown in Table 10.5, we would need only 30 participants, 15 men and 15 women, to get the same number of observations in each condition because each person would solve two puzzles. Note that we now have one repeated measures or within-participants factor and one between-participants factor. Designs Table 10.4 Illustration of the Number of Participants Needed for a Between- Participants Design Gender Labeling of the Task Participant’s Gender Masculine Feminine Male n = 15 men n = 15 men Female n = 15 women n = 15 women Total N = 60 participants Repeated Measures Designs 283 Table 10.5 Illustration of the Number of Participants Needed for a Within-Participants Design Gender Labeling of the Task Participant’s Gender Masculine Feminine Male 15 men (15) Female 15 women (15) Total N = 30 participants that include both within-participants (e.g., masculine vs. feminine puzzle) and between-participants factors (e.g., participant gender) are known as mixed models. In this case, however, participant gender is not manipulated and thus gender cannot be said to cause differences in the dependent variable. The other efficient feature of repeated measures designs is the precision gained by using participants as their own comparisons. Like the pretest observations of the pretest–posttest two-group design, the repeated measures give us individual baselines for each participant. The 15 men who solve the “masculine” puzzle in Table 10.4 might vary widely in the time they require. One might solve the puzzle in 10 seconds and another might take 10 minutes. If each person takes one minute longer to solve the “feminine” than the “masculine” puzzle, it would not appear as a noticeable difference between the two puzzle groups if we used a between-participants design, but it might be a noticeable difference in a repeated measures design. In other words, repeated measures are more statistically powerful than are between-participants designs because analyses do not just compare differences between groups, but compare changes of each individual within groups. Recall from our discussion of statistical conclusion validity in Chapter 2 that statistical power refers to the likelihood of seeing an effect if the effect truly exists. Again, individual difference variables cannot be used with repeated measures; not even all manipulated variables are suitable as within-participants or repeated measures variables. Some manipulated variables would arouse participants’ suspicions about the purposes of the experiment. For instance, suppose we tried to use the ethnicity or gender of job applicants as a within-participants variable. If we presented prospective employers with two hypothetical job applications and résumés in which everything was identical except the ethnicity or gender of the applicant, the prospective employ- ers could see immediately that we were testing to see whether they practice race or sex discrimination in hiring. Similarly, asking the same participants to judge the same message twice, once using the term “global warming” and once using “climate change,” as in our earlier example, would make the purpose of the experiment quite obvious. Researchers therefore sometimes use “filler tasks,” that is, asking participants to complete additional tasks to disguise the true purpose of the experiment. Other variables are not suitable for repeated measures designs if they produce long-lasting effects that would carry over from one testing to the next. For instance, if we tried to compare the effects of alcohol and hallucinogenic drugs on drivers’ reaction times, we would not have them drink alcohol, give them a driver’s test, and 284 Randomized Experiments then give them hallucinogenic drugs immediately after for a second test. In addition to the obvious ethical problems of administering drugs to experimental participants, we also would encounter practical problems. If we use repeated measures designs, we must be sure the effects of the first level of a treatment are gone before we try to administer subsequent levels. For this reason, repeated measures designs are generally not appropriate for examining factors that affect learning. Consider, for example, an experiment to determine which of two teaching strategies is more effective in helping participants learn a task or acquire certain knowledge. Once participants have learned the task or acquired the knowledge, they are unlikely to unlearn it. Analyzing Data from Experimental Designs Because data from experimental designs are typically scores that reflect the effects of treatments, the appropriate approaches for analyzing data are those that compare the means of the different treatment conditions. In the simplest instances of two groups, the analysis could be a t-test. More generally, however, analyses use a form of analysis of variance (ANOVA). Variations of the approach include those where there are mul- tiple dependent variables, called multivariate analysis of variance (MANOVA), and those where there may be control variables whose effects are eliminated before looking at the mean differences, called analysis of covariance (ANCOVA). Perhaps most common are those ANCOVAs that control for pretest scores when looking at posttest scores. Sometimes researchers use a general linear model approach, which involves using regression approaches to produce the mean comparisons. The methods may look different, but are actually the same, as are t-test and ANOVA analyses of the same data. They can also easily accommodate cases in which the number of partici- pants varies across conditions (see, e.g., Judd, McClelland, & Ryan, 2009). Strengths and Weaknesses of Randomized Experiments We have emphasized the strengths of randomized experiments. By randomly assign- ing people to experimental conditions, experimenters can be confident that subse- quent differences on the dependent variable are caused, on average, by the treatments rather than preexisting differences among groups of people. Manipulated experimen- tal variables, unlike individual difference variables, enable experimenters to conclude “This caused that.” No experimenter can be 100% sure that “this” experimental treat- ment was the cause of “that” effect, as there is always the possibility of a failure of randomization or undetected artifact, but randomized experiments can rule out many alternative explanations. Yet randomized experiments are not without their weaknesses. In this section, we describe some of the major drawbacks of randomized experiments. It is important to keep in mind, however, that these drawbacks are not inevitable condemnations of experimental designs. Not all experiments have these limitations and not all non- experimental studies are without them. Strengths and Weaknesses of Randomized Experiments 285 Experimental Artifacts One set of extraneous variables that undermines the validity of research conclusions is artifacts. In research design, the word artifact refers to an unintended effect on the dependent variable that is caused by some feature of the experimental setting other than the independent variable. Even with selection, history, maturation, instru- mentation, and the other threats to internal validity taken care of, the results of research might not be true effects of the experimental treatment but instead be artifacts, or effects of some extraneous variables. For instance, experimenters can unwittingly influence their participants to behave in ways that confirm the hypoth- esis, particularly if the participants want to please the experimenter. Findings that result from such attempts are artifactual in the sense that they do not represent participants’ true responses to the independent variables of interest. Another example would be when participants respond in socially desirable ways rather than in ways that represent what they really believe and would do. A detailed discussion of such artifacts and their potential threats is given in Chapter 5, where we consider labora- tory research in detail. Artifacts can occur regardless of the research design that is used. In laboratory settings, artifacts are just as likely when non-experimental or quasi-experimental designs are used. Artifacts can also occur in the field if, for example, the independent variable is confounded or covaries with some unintended aspect of the field setting. External Validity Experimental designs and procedures maximize the internal validity of research – they enable the researcher to rule out most rival explanations or threats to internal validity. There can be a trade-off, however. Experimenters might maximize internal validity at the expense of the external validity or generalizability of the results. Because many randomized experiments are conducted in laboratory settings (although they need not be), we might ask whether the findings extend beyond the laboratory. Can the experi- menter talk about these phenomena in the world outside, or do they appear only in highly controlled and sometimes artificial conditions? A common criticism of laboratory experiments in particular is that they are poor representations of natural processes. Some laboratory experiments, like Glass and Singer’s (1972) studies of noise, use remote analogues of real-world variables, like urban stress. Although some readers criticize such analogues as being artificial, we also can argue that the artificial conditions in these experiments are more effective ways to study the problem than are some more realistic conditions. The laboratory noise and laboratory measures of physiological and cognitive effects are all substitutes for the real phenomena; they are analogues and therefore artificial. Being artificial is not necessarily a disadvantage, however. Some laboratory analogues are more effective than their realistic but mundane counterparts and therefore make the research more persuasive, an issue we discuss in Chapter 5. In the final analysis, how realistic or generalizable any treatments and effects are can be discovered only by trying to rep- licate the findings in another setting. 286 Randomized Experiments The Problem of College Sophomores in the Laboratory A third major criticism of experiments is not about the methods but about the subject populations. It questions the representativeness of typical research participants, who particularly in psychology often are college students participating in research to fulfill course requirements. Are college students representative of the larger population? It depends on how one defines the larger population as well as the particular research question. For many research purposes college sophomores are often considered to be no different from anyone else. For instance, to study a physiological variable such as the eye blink response, we might be able to assume that what is true for 18-year-old college students is also true for 6-year-old elementary school students and 40-year- old employees. And some researchers argue that the particular population being studied is largely irrelevant if the desire is to establish that a causal relationship occurs between two variables. Once the existence of the relationship has been established, then researchers can worry about how generalizable the findings are and what their implications are. However, researchers who study social processes, such as the effects of contact with ethnic outgroup members on prejudice, would certainly be wise to study more heterogeneous people – in the same or different studies. Similarly, to study the effects of an economic variable, such as tax incentives for purchasing energy effi- cient appliances, it would be wise to include people with a range of incomes and some practical experience with the issue. Sears (1986) documented social psychologists’ overreliance on college students as participants, providing a compelling critique of the practice. More recently, other researchers have questioned our tendency to overgeneralize findings from research that typically relies on research participants in Western, Educated, Industrialized, Rich, and Democratic (WEIRD) societies (Chiao & Cheon, 2010; Henrich, Heine, & Norenzayan, 2010). Many research conclusions that have been considered universal – even those concerning basic perceptual processes, such as visual illusions – appear to depend on one’s experiences. In short, the unique characteristics of our research participants can indeed affect our findings. What can be done about this problem? First, in many instances, researchers could reduce their reliance on college students and include other populations in their studies. The convenience of the college “subject pool” is hard for researchers to resist; many university-based researchers decide that they prefer to increase the volume of their research by using inexpensive and readily available college undergraduates, even at the cost of external validity. Fortunately, the reliance on college undergraduates appears to be declining, partly as a result of technology that makes diverse populations more accessible. For example, the ability to develop cross-cultural research collabora- tions has improved dramatically and some studies can be conducted su

Use Quizgecko on...
Browser
Browser