chap15.pdf
Document Details
Uploaded by VividStream
Tags
Full Transcript
Chapter 15 E VA L U AT I O N S T U D I E S : FROM CONTROLLED TO N AT U R A L S E T T I N G S 15.1 Introduction 15.2 Usability Testing 15.3 Conducting Experiments 15.4 Field Studies Objectives The main goals of the chapter are to accomplish the following: • Explain how to do usability testing. • O...
Chapter 15 E VA L U AT I O N S T U D I E S : FROM CONTROLLED TO N AT U R A L S E T T I N G S 15.1 Introduction 15.2 Usability Testing 15.3 Conducting Experiments 15.4 Field Studies Objectives The main goals of the chapter are to accomplish the following: • Explain how to do usability testing. • Outline the basics of experimental design. • Describe how to do field studies. 15.1 Introduction Imagine that you have designed a new app to allow school children ages 9 or 10 and their parents to share caring for the class hamster over the school holidays. The app will schedule which children are responsible for the hamster and when, and it will also record when it is fed. The app will also provide detailed instructions about when the hamster is scheduled to go to another family and the arrangements about when and where it will be handed over. In addition, both teachers and parents will be able to access the schedule and send and leave messages for each other. How would you find out whether the children, their teacher, and their parents can use the app effectively and whether it is satisfying to use? What evaluation methods would you employ? In this chapter, we describe evaluation studies that take place in a spectrum of settings, from controlled laboratories to natural settings. Within this range we focus on the following: • Usability testing, which takes place in usability labs and other controlled lab-like settings • Experiments, which take place in research labs • Field studies, which take place in natural settings, such as people’s homes, schools, work, and leisure environments 524 15 E V A L U AT I O N S T U D I E S : F R O M C O N T R O L L E D T O N AT U R A L S E T T I N G S Source: http://geek-and-poke.com. Licensed under CC-BY 3.0 15.2 Usability Testing The usability of products has traditionally been tested in controlled laboratory settings. This approach emphasizes how usable a product is. Initially, it was most commonly used to evaluate desktop applications, such as websites, word processors, and search tools. It is also important now, however, to test the usability of apps and other digital products. Performing usability testing in a laboratory, or in a temporarily assigned controlled environment, enables designers to control what users do and allows them to control the environmental and social influences that might impact the user’s performance. The goal is to test whether the product being developed is usable by the intended user in order to achieve the tasks for which it was designed and whether users are satisfied with their experience. For some products, such as games, designers will also want to know whether their product is enjoyable and fun to use. (Chapter 1, “What is Interaction Design,” discusses usability and user experience goals.) 15.2.1 Methods, Tasks, and Users Collecting data about users’ performance on predefined tasks is a central component of usability testing. As mentioned in Chapter 14, “Introducing Evaluation,” a combination of methods is often used to collect data. The data includes video recordings of the users, including their facial expressions, logged keystrokes, and mouse and other movements, such as swiping and dragging objects. Sometimes, participants are asked to describe what they are thinking and doing out loud (the “think aloud” technique) while carrying out tasks as a way of revealing what they are thinking and planning. In addition, a user satisfaction questionnaire is used to find out how users actually feel about using the product by asking them to rate it using a number of scales after they interact with it. Structured or semistructured 15.2 USABILITY TESTING interviews may also be conducted with users to collect additional information about what they liked and did not like about the product. Sometimes, designers also collect data about how the product is used in the field. Examples of the tasks that are typically given to users include searching for information, reading different typefaces (for example, Helvetica and Times), navigating through different menus, and uploading apps. Performance times and the number of the different types of actions carried out by users are the two main performance measures. Obtaining these two measures involves recording the time it takes typical users to complete a task, such as finding a website, and the number of errors that users make, such as selecting incorrect menu options when creating a visual display. The following quantitative performance measures, which were identified in the late 1990s, are still used as a baseline for collecting user performance data (Wixon and Wilson, 1997): • • • • • • • Number of users completing a task successfully Time to complete a task Time to complete a task after a specified time away from the product Number and type of errors per task Number of errors per unit of time Number of navigations to online help or manuals Number of users making a particular error A key concern when doing usability testing is the number of users that should be involved: early research suggests that 5 to 12 is an acceptable number (Dumas and Redish, 1999), though more is often regarded as being better because the results represent a larger and often broader selection of the target user population. However, sometimes it is reasonable to involve fewer users when there are budget and schedule constraints. For instance, quick feedback about a design idea, such as the initial placement of a logo on a website, can be obtained from only two or three users reporting on how quickly they spot the logo and whether they like its design. Sometimes, more users can be involved early on by distributing an initial questionnaire online to collect information about users’ concerns. The main concerns can then be examined in more detail in a follow-up lab-based study with a small number of typical users. This link provides a practical introduction to usability testing and describes how it relates to UX design: https://icons8.com/articles/usability-practical-definitionux-design/. 15.2.2 Labs and Equipment Many large companies, such as Microsoft, Google, and Apple, test their products in custombuilt usability labs that consist of a main testing lab with recording equipment and an observation room where the designers can watch what is going on and how the data collected is being analyzed. There may also be a reception area where users can wait, a storage area, and a viewing room for observers. These lab spaces can be arranged to mimic superficially features of the real world. For example, when testing an office product or for use in a hotel 525 526 15 E V A L U AT I O N S T U D I E S : F R O M C O N T R O L L E D T O N AT U R A L S E T T I N G S reception area, the lab can be set up to resemble those environments. Soundproofing and lack of windows, co-workers, and other workplace and social distractions are eliminated so that the users can concentrate on the tasks that have been set up for them to perform. While controlled environments like these enable researchers to capture data about users’ uninterrupted performance, the impact that real-world interruptions can have on usability is not captured. Typically, there are two to three wall-mounted video cameras that record the users’ behavior, such as hand movements, facial expressions, and general body language. Microphones are placed near where the participants will be sitting to record their comments. Video and other data is fed through to monitors in an observation room, which is usually separated from the main lab or workroom by a one-way mirror so that designers can watch what participants are doing but not be seen by the participants. The observation room can be a small auditorium with rows of seats at different levels or, more simply, a small backroom consisting of a row of chairs facing the monitors. Figure 15.1 shows a typical arrangement in which designers in an observation room are watching a usability test through a one-way mirror, as well as watching the data being recorded on a video monitor. Figure 15.1 A usability laboratory in which designers watch participants on a monitor and through a one-way mirror Source: Helen Sharp Usability labs can be expensive and labor-intensive to run and maintain. Therefore, less expensive and more versatile alternatives started to become popular in the early and mid1990s. The development of mobile and remote usability testing equipment also corresponded with the need to do more testing in small companies and in other venues. Mobile usability equipment typically includes video cameras, laptops, eye-tracking devices, and other measuring equipment that can be set up temporarily in an office or other space, converting it into a makeshift usability laboratory. An advantage of this approach is that the equipment can be taken into work settings, enabling testing to be done on-site, which makes it less artificial and more convenient for the participants. An increasing number of products are specifically designed for performing mobile evaluations. Some are referred to as lab-in-a-box or lab-in-a-suitcase because they pack away neatly into a convenient carrying case. The portable lab equipment typically consists of off-the-shelf components that plug into a laptop that can record video directly to hard disk, eye-trackers 15.2 USABILITY TESTING (some of which take the form of glasses for recording the user’s gaze, as shown in Figure 15.2), and facial recognition systems for recording changes in the user’s emotional responses. Figure 15.2 The Tobii Glasses Mobile Eye-Tracking System Source: Dalton et al. (2015), p. 3891. Reproduced with permission of ACM Publications An example of a recent study in which eye-tracking glasses were used to record the eyegaze of people in a shopping mall is reported by Nick Dalton and his colleagues (Dalton et al., 2015). The goal of this study was to find out whether shoppers pay attention to largeformat plasma screen displays when wandering around a large shopping mall in London. The displays varied in size, and some contained information about directions to different parts of the mall, while others contained advertisements. Twenty-two participants (10 males and 12 females, aged 19 to 73 years old) took part in the study in which they were asked to carry out a typical shopping task while wearing Tobii Glasses Mobile Eye Tracking glasses (see Figure 15.2). These participants were told that the researchers were investigating what people look at while shopping; no mention was made of the displays. Each participant was paid £10 to participate in the study. They were also told that there would be a prize drawing after the study and that participants who won would receive a gift of up £100 in value. Their task was to find one or more items that they would purchase if they won the prize drawing. The researchers did this so that the study was an ecologically valid in-the-wild shopping task, in which the participants focused on shopping for items that they wanted. As the participants moved around the mall, their gaze was recorded and analyzed to determine the percentage of time that they were looking at different things. This was done by using software that converted eye-gaze movements so that they could be overlaid on a video of the scene. The researchers then coded the participants’ gazes based on where they were looking (for instance, at the architecture of the mall, products, people, signage, large text, or displays). Several other quantitative and qualitative analyses were also performed. The findings from these analyses revealed that participants looked at displays, particularly large plasma screens, more than had been previously reported in earlier studies by other researchers. Another trend in usability testing is to conduct remote, unmoderated usability testing in which users perform a set of tasks with a product in their own setting, and their interactions are logged remotely (Madathil and Greenstein, 2011). An advantage of this approach is that many 527 528 15 E V A L U AT I O N S T U D I E S : F R O M C O N T R O L L E D T O N AT U R A L S E T T I N G S users can be tested at the same time in real-world settings, and the logged data can be automatically compiled for data analysis. For example, clicks can be tracked and counted per page when users search for specific information on websites. This approach is particularly popular in large companies such as Microsoft and Google and in companies specializing in user testing (for example, Userzoom.com) that test products used across the world. With remote testing, large numbers of participants can be recruited who are able participate at convenient times within their own time zones. As more and more products are designed for global markets, designers and researchers appreciate this flexibility. Remote testing also allows individuals with disabilities to be involved, as they can work from their own homes (Petrie et al., 2006). 15.2.3 Case Study: Testing the iPad Usability When Apple’s iPad first came onto the market, usability specialists Raluca Budiu and Jakob Nielsen from the Nielsen Norman Group conducted user tests to evaluate participants’ interactions with websites and apps specifically designed for the iPad (Budiu and Nielsen, 2010). This classic study is presented here because it illustrates how usability tests are carried out and the types of modifications that are made to accommodate real-world constraints, such as having a limited amount of time to evaluate the iPad as it came onto the market. Completing the study quickly was important because Raluca Budiu and Jakob Nielsen wanted to get feedback to third-party developers, who were creating apps and websites for the iPad. These developers were designing products with little or no contact with the iPad developers at Apple, who needed to keep details about the design of the iPad secret until it was launched. There was also considerable “hype” among the general public and others before the launch, so many people were eager to know if the iPad would really live up to expectations. Because of the need for a quick first study, and to make the results public around the time of the iPad launch, a second study was carried out in 2011, a year later, to examine some additional usability issues. (Reports of both studies are available on the Nielsen Norman Group website, which suggests reading the second study first. However, in this case study, the reports are discussed in chronological order: http://www.nngroup.com/reports/ipad-app-and-website-usability.) 15.2.3.1 iPad Usability: First Findings from User Testing In the first study of iPad usability, Raluca Budiu and Jakob Nielsen (Budiu and Nielsen, 2010) used two usability evaluation methods: usability testing with think-aloud in which users said what they were doing and thinking as they did it (discussed earlier in Chapter 8, “Data Gathering”) and an expert review, which will be discussed in the next chapter. A key question they asked was about whether user expectations were different for the iPad as compared to the iPhone. They focused on this issue because a previous study of the iPhone showed that people preferred using apps to browsing the web because the latter was slow and cumbersome at that time. They wondered whether this would be the same for the iPad, where the screen was larger and web pages were more similar to how they appeared on the laptops or desktop computers that most people were accustomed to using at the time. The usability testing was carried out in two cities in the United States: Fremont, California, and Chicago, Illinois. The test sessions were similar: the goal of both was to understand the typical usability issues that users encounter when using applications and accessing websites on the iPad. Seven participants were recruited. All were experienced iPhone users who had owned their phones for at least three months and who had used a variety of apps. One reason for selecting participants who used iPhones was because they would have had previous experience in using apps and the web with a similar interaction style as the iPad. 15.2 USABILITY TESTING The participants were considered to be typical users who represented the range of those who might purchase an iPad. Two participants were in their 20s, three were in their 30s, one was in their 50s, and one was in their 60s. Three were males, and four were females. Before taking part, the participants were asked to read and sign an informed consent form agreeing to the terms and conditions of the study. This form described the following: • • • • • • What the participant would be asked to do The length of time needed for the study The compensation that would be offered for participating in the study The participants’ right to withdraw from the study at any time A promise that the person’s identity would not be disclosed An agreement that the data collected from each participant would be confidential and would not be made available to marketers or anyone other than the researchers The Tests The session started with participants being invited to explore any application they found interesting on the iPad. They were asked to comment on what they were looking for or reading, what they liked and disliked about a site, and what made it easy or difficult for them to carry out a task. A moderator sat next to each participant, observed, and took notes. The sessions were video-recorded, and they lasted about 90 minutes each. Participants worked on their own. After exploring the iPad, the participants were asked by the researchers to open specific apps or websites, explore them, and then carry out one or more tasks as they would have if they were on their own. Each participant was assigned the tasks in a random order. All of the apps that were tested were designed specifically for the iPad, but for some tasks the users were asked to do the same task on a website that was not specifically designed for the iPad. For these tasks, the researchers took care to balance the presentation order so that the app would be the first presented for some participants and the website would be first presented for others. More than 60 tasks were chosen from more than 32 different sites. Examples are shown in Table 15.1. App or website Task iBook Download a free copy of Alice’s Adventures in Wonderland and read through the first few pages. Craigslist Find some free mulch for your garden. Time Magazine Browse through the magazine, and find the best pictures of the week. Epicurious You want to make an apple pie tonight. Find a recipe and see what you need to buy in order to prepare it. Kayak You are planning a trip to Death Valley in May this year. Find a hotel located in the park or close to the park. Table 15.1 Examples of some of the user tests used in the iPad evaluation (adapted from Budiu and Nielsen, 2010) Source: http://www.nngroup.com/reports/ ipad-app-and-website-usability. Used courtesy of the Nielsen Norman Group 529 530 15 E V A L U AT I O N S T U D I E S : F R O M C O N T R O L L E D T O N AT U R A L S E T T I N G S ACTIVITY 15.1 1. What was the main purpose of this study? 2. What aspects are considered to be important for good usability and user experience in this study? 3. How representative do you consider the tasks outlined in Table 15.1 to be for a typical iPad user? Comment 1. The main purpose of the study was to find out how participants interacted with the iPad by examining how they interacted with the apps and websites that they used on the iPad. The findings were intended to help designers and developers determine whether specific websites need to be developed for the iPad. 2. The definition of usability in Chapter 1 suggests that the iPad should be efficient, effective, safe, easy to learn, easy to remember, and have good utility (that is, good usability). The definition of user experience suggests that it should also support creativity and be motivating, helpful, and satisfying to use (that is, to offer a good user experience). The iPad is designed for the general public, so the range of users is broad in terms of age and experience with technology. 3. The tasks are a small sample of the total set prepared by the researchers. They cover shopping, reading, planning, and finding a recipe, which are common activities that people engage in during their everyday lives. The Equipment The testing was done using a setup (see Figure 15.3) similar to the mobile usability kit described earlier. A camera recorded the participant’s interactions and gestures when using the iPad and streamed the recording to a laptop computer. A webcam was also used to record the expressions on the participants’ faces and their think-aloud commentary. The laptop ran software called Morae, which synchronized these two data streams. Up to three observers (including the moderator sitting next to the participant) watched the video streams (rather than observing the participants directly) on their laptops situated on the table so that they did not invade the participants’ personal space. Usability Problems The main findings from the study showed that the participants were able to interact with websites on the iPad but that it was not optimal. For example, links on the pages were often too small to tap on reliably, and the fonts were sometimes difficult to read. The various usability problems identified in the study were classified according to a number of wellknown interaction design principles and concepts, including mental models, navigation, quality of images, problems of using a touchscreen with small target areas, lack of affordances, getting lost in the application, effects of changing orientations, working memory, and feedback received. Getting lost in an application is an old but important problem for designers of digital products, and some participants got lost because they tapped the iPad too much and could not find a back button and could not get back to the home page. One participant said “ . . . I like having everything there [on the home page]. That’s just how my brain works” (Budiu and 15.2 USABILITY TESTING Figure 15.3 The setup used in the Chicago usability testing sessions Source: http://www.nngroup.com/reports/ ipad-app-and-website-usability. Used courtesy of the Nielsen Norman Group Nielsen, 2010, p. 58). Other problems arose because applications appeared differently in the two views possible on the iPad: portrait and landscape. Interpreting and Presenting the Data Based on the findings of their study, Budiu and Nielsen made a number of recommendations, including supporting standard navigation. The results of the study were written up as a report that was made publicly available to app developers and the general public. It provided a summary of key findings for the general public as well as specific details of the problems the participants had with the iPad so that developers could decide whether to make specific websites and apps for the iPad. While revealing how usable websites and apps are on the iPad, this user testing did not address how the iPad would be used in people’s everyday lives. This required a field study where observations were made of how people use iPads in their own homes, at school, in the gym, and when traveling, but this did not happen because of lack of time. ACTIVITY 15.2 1. Was the selection of participants for the iPad study appropriate? Justify your comments. 2. What might have been some of the problems with asking participants to think out loud as they completed the tasks? Comments 1. The researchers tried to get a representative set of participants across an age and gender range with similar skill levels, that is, participants who had already used an iPhone. Ideally, it would have been good to have had additional participants to see whether the findings were more generalizable across the broad range of users for whom the iPad was designed. However, it was important to do the study as quickly as possible and get the results out to developers and to the general public. 531 532 15 E V A L U AT I O N S T U D I E S : F R O M C O N T R O L L E D T O N AT U R A L S E T T I N G S 2. If a person is concentrating hard on a task, it can be difficult to talk at the same time. This can be overcome by asking participants to work in pairs so that they talk to each other about the problems that they encounter. 15.2.3.2 iPad Usability: Year One Having rushed to get their first report out when the iPad first came onto the market, Raluca Budiu and Jakob Nielsen did more tests a year later in 2011. Even though many of their recommendations (for example, designing apps with back buttons, broader use of search, and direct access to news articles by touching headlines on the front page) were implemented, there were still some problems. For example, users accidentally touched something and couldn’t find their way back to their starting point. There were also magazine apps that required many steps to access a table of contents, and that led users to make mistakes when navigating through the magazine. Normally, a second usability study would not be done just a year after the first. However, the first study was done with participants who did not have direct experience with an iPad. A year later, the researchers were able to recruit participants with at least two months’ experience of using an iPad. Another reason for doing a second study so close to the first one was that many of the apps and websites with usability problems were developed without the direct involvement of the Apple iPad development team due to the need for secrecy until the iPad was officially launched onto the market. This time, user testing was done with 16 iPad users, half men and half women. Fourteen were between 25–50 years of age, and two were older than 50. The new findings included splash screens that became boring after a while for regular users, too much information on the screen, fonts that were too small, and swiping the wrong items when several options were presented on-screen. The first set of tests in 2010 illustrates how the researchers had to adapt their testing method to fit within a tight time period. Designers and researchers often have to modify how they go about user testing for a number of reasons. For example, in a study in Namibia, the researchers reported that questionnaires did not work well because the participants gave the responses that they thought the researchers wanted to hear (Paterson et al., 2011). However, “the interviews and observations, revealed that many participants were unable to solve all tasks and that many struggled . . . Without the interviews and observations many issues would not have been laid open during the usability evaluation” (Paterson et al., 2011, p. 245). This experience suggests that using multiple methods can reveal different usability problems. Even more important, it illustrates the importance of not taking for granted that a method used with one group of participants will work with another group, particularly when working with people from different cultures. For another example of usability testing, see the report entitled “Case Study: Iterative Design and Prototype Testing of the NN/g Homepage” by Kathryn Whitenton and Sarah Gibbons from the Nielsen Norman Group (August 26, 2018), which describes how user testing with prototypes is integrated into the design process. You can view this report at: https://www.nngroup.com/articles/ case-study-iterative-design-prototyping/. 15.3 CONDUCTING EXPERIMENTS A video illustrates the usability problems that a woman had when navigating a website to find the best deal for renting a car in her neighborhood. It illustrates how usability testing can be done in person by a designer sitting with a participant. The video is called Rocket Surgery Made Easy by Steve Krug, and you can view it here: https://www.youtube.com/watch?v=QckIzHC99Xc. 15.3 Conducting Experiments In research contexts, specific hypotheses are tested that make a prediction about the way users will perform with an interface. The benefits are more rigor and confidence that one interface feature is easier to understand or faster to use than another. An example of a hypothesis is that context menus (that is, menus that provide options related to the context determined by the users’ previous choices) are easier to select as compared to cascading menus. Hypotheses are often based on a theory, such as Fitts’ Law (see Chapter 16, “Evaluation: Inspections, Analytics, and Models”), or previous research findings. Specific measurements provide a way of testing the hypothesis. In the previous example, the accuracy of selecting menu options could be compared by counting the number of errors made by participants when selecting from each menu type. 15.3.1 Hypotheses Testing Typically, a hypothesis involves examining a relationship between two things, called variables. Variables can be independent or dependent. An independent variable is what the researcher manipulates (that is, selects), and in the previous example, it is the different menu types. The other variable is called the dependent variable, and in our example this is the time taken to select an option. It is a measure of user performance and, if our hypothesis is correct, will vary depending on the different types of menus. When setting up a hypothesis to test the effect of the independent variable(s) on the dependent variable, it is normal to derive a null hypothesis and an alternative one. The null hypothesis in our example would state that there is no difference in the time it takes users to find items (that is, the selection time) between context and cascading menus. The alternative hypothesis would state that there is a difference between the two regarding selection time. When a difference is specified but not what it will be, it is called a two-tailed hypothesis. This is because it can be interpreted in two ways: either it is faster to select options from the context menu or the cascading menu. Alternatively, the hypothesis can be stated in terms of one effect. This is called a one-tailed hypothesis, and it would state that “it is faster to select options from context menus,” or vice versa. A one-tailed hypothesis would be preferred if there was a strong reason to believe it to be the case. A two-tailed hypothesis would be chosen if there was no reason or theory that could be used to support the case that the predicted effect would go one way or the other. 533 534 15 E V A L U AT I O N S T U D I E S : F R O M C O N T R O L L E D T O N AT U R A L S E T T I N G S You might ask why you need a null hypothesis, since it seems to be the opposite of what the experimenter wants to find out. It is put forward so that the data can reject a statement without necessarily supporting the opposite statement. If the experimental data shows a big difference between selection times for the two menu types, then the null hypothesis that the menu type has no effect on selection time can be rejected, which is different from saying that there is an effect. Conversely, if there is no difference between the two, then the null hypothesis cannot be rejected (that is, the claim that it is faster to select options from context menus is not supported). To test a hypothesis, the researcher has to set up the conditions and find ways to keep other variables constant to prevent them from influencing the findings. This is called the experimental design. Examples of other variables that need to be kept constant for both types of menus might include size and screen resolution. For example, if the text is in 10-point font size in one condition and 14-point font size in the other, then it could be this difference that causes the effect (that is, differences in selection speed are due to font size). More than one condition can also be compared with the control, for example Condition 1 = Context menu; Condition 2 = Cascading menu; and Condition 3 = Scrolling. Sometimes, a researcher might want to investigate the relationship between two independent variables, for example, age and educational background. A hypothesis might be that young people are faster at searching the web than older people and that those with a scientific background are more effective at searching the web. An experiment would be set up to measure the time it takes to complete the task and the number of searches carried out. The analysis of the data would focus on the effects of the main variables (age and background) and also look for any interactions among them. Hypothesis testing can also be extended to include even more variables, but it makes the experimental design more complex. An example is testing the effects of age and educational background on user performance for two methods of web searching: one using a search engine and the other manually navigating through links on a website. Again, the goal is to test the effects of the main variables (age, educational background, and web searching method) and to look for any interactions among them. However, as the number of variables increases in an experimental design, it makes it more difficult to work out what is causing the results from the data. 15.3.2 Experimental Design A concern in experimental design is to determine which participants to involve for which conditions in an experiment. The experience of participating in one condition will affect the performance of those participants if asked to participate in another condition. For example, having learned about the way the heart works using multimedia, if one group of participants was exposed to the same learning material via another medium, for instance, virtual reality, and another group of participants was not, the participants who had the additional exposure to the material would have an unfair advantage. Furthermore, it would create bias if the participants in one condition within the same experiment had seen the content and the others had not. The reason for this is that those who had the additional exposure to the content would have had more time to learn about the topic, and this would increase their chances of answering more questions correctly. In some experimental designs, however, it is possible to use the same participants for all conditions without letting such training effects bias the results. The names given for the different designs are different-participant design, same-participant design, and matched-pairs design. In different-participant design, a single group of participants is allocated randomly to each of the experimental conditions so that different participants perform in different conditions. Another term used for this experimental design is between-subjects 15.3 CONDUCTING EXPERIMENTS design. An advantage is that there are no ordering or training effects caused by the influence of participants’ experience on one set of tasks to their performance on the next set, as each participant only ever performs under one condition. A disadvantage is that large numbers of participants are needed so that the effect of any individual differences among participants, such as differences in experience and expertise, is minimized. Randomly allocating the participants and pretesting to identify any participants that differ strongly from the others can help. In same-participant design (also called within-subjects design), all participants perform in all conditions so that only half the number of participants is needed; the main reason for this design is to lessen the impact of individual differences and to see how performance varies across conditions for each participant. It is important to ensure that the order in which participants perform tasks for this setup does not bias the results. For example, if there are two tasks, A and B, half the participants should do task A followed by task B, and the other half should do task B followed by task A. This is known as counterbalancing. Counterbalancing neutralizes possible unfair effects of learning from the first task, known as the order effect. In matched-participant design (also known as pair-wise design), participants are matched in pairs based on certain user characteristics such as expertise and gender. Each pair is then randomly allocated to each experimental condition. A problem with this arrangement is that other important variables that have not been considered may influence the results. For example, experience in using the web could influence the results of tests to evaluate the navigability of a website. Therefore, web expertise would be a good criterion for matching participants. The advantages and disadvantages of using different experimental designs are summarized in Table 15.2. Design Advantages Disadvantages Different participants (betweensubjects design) No order effects. Many participants are needed. Individual differences among participants are a problem, which can be offset to some extent by randomly assigning to groups. Same participants (withinsubjects design) Eliminates individual differences between experimental conditions. Need to counterbalance to avoid ordering effects. Matched participants (pair-wise design) No order effects. The effects Can never be sure that subjects are matched of individual differences across variables that might affect performance. are reduced. Table 15.2 The advantages and disadvantages of different allocations of participants to conditions The data collected to measure user performance on the tasks set in an experiment usually includes response times for subtasks, total times to complete a task, and number of errors per task. Analyzing the data involves comparing the performance data obtained across the different conditions. The response times, errors, and so on, are averaged across conditions to see whether there are any marked differences. Statistical tests are then used, such as t-tests that statistically compare the differences between the conditions, to reveal if these are significant. For example, a t-test will reveal whether it is faster to select options from context or cascading menus. 535 536 15 E V A L U AT I O N S T U D I E S : F R O M C O N T R O L L E D T O N AT U R A L S E T T I N G S 15.3.3 Statistics: t-tests There are many types of statistics that can be used to test the probability of a result occurring by chance, but t-tests are the most widely used statistical test in HCI and related fields, such as psychology. The scores, for example, time taken for each participant to select items from a menu in each condition (that is, context and cascading menus), are used to compute the means (x) and standard deviations (SDs). The standard deviation is a statistical measure of the spread or variability around the mean. The t-test uses a simple equation to test the significance of the difference between the means for the two conditions. If they are significantly different from each other, we can reject the null hypothesis and in so doing infer that the alternative hypothesis holds. A typical t-test result that compared menu selection times for two groups with 9 and 12 participants each might be as follows: t 4.53, p 0.05, df 19 The t-value of 4.53 is the score derived from applying the t-test; df stands for degrees of freedom, which represents the number of values in the conditions that are free to vary. This is a complex concept that we will not explain here other than to mention how it is derived and that it is always written as part of the result of a t-test. The df values are calculated by summing the number of participants in one condition minus 1 and the number of participants in the other condition minus 1. It is calculated as df N a 1 Nb 1 , where N a is the number of participants in one condition and Nb is the number of participants in the other condition. In our example, df 9 1 12 1 19, p is the probability that the effect found did not occur by chance. So, when p 0.05, it means that the effect found is probably not due to chance and that there is only a 5 percent possibility that it could be by chance. In other words, there most likely is a difference between the two conditions. Typically, a value of p 0.05 is considered good enough to reject the null hypothesis, although lower levels of p are more convincing, for instance, p 0.01 where the effect found is even less likely to be due to chance, there being only a 1 percent chance of that being the case. 15.4 Field Studies Increasingly, more evaluation studies are being done in natural settings with either little or no control imposed on participants’ activities. This change is largely a response to technologies being developed for use outside office settings. For example, mobile, ambient, IoT, and other technologies are now available for use in the home, outdoors, and in public places. Typically, field studies are conducted to evaluate these user experiences. As mentioned in Chapter 14, evaluations conducted in natural settings are very different from those conducted in controlled environments, where tasks are set and completed in an orderly way. In contrast, studies in natural settings tend to be messy in the sense that activities often overlap and are constantly interrupted by events that are not predicted or controlled such as phone calls, texts, rain if the study is outside, and people coming and going. This follows the way that people interact with products in their everyday messy worlds, which is generally different from how they perform on fixed tasks in a laboratory setting. Evaluating how people think about, interact with, and integrate products within the settings in which they will ultimately be used, gives a better sense of how successful the products will be in the real world. The trade-off is that it is harder to test specific hypotheses about an interface because many environmental factors that influence the interaction cannot be controlled. Therefore, 15.4 FIELD STUDIES it is not possible to account, with the same degree of certainty, for how people react to or use a product as can be done in controlled settings like laboratories. This makes it more difficult to determine what causes a particular type of behavior or what is problematic about the usability of a product. Instead, qualitative accounts and descriptions of people’s behavior and activities are obtained that reveal how they used the product and reacted to its design. Field studies can range in time from just a few minutes to a period of several months or even years. Data is collected primarily by observing and interviewing people, such as by collecting video, audio, field notes, and photos to record what occurs in the chosen setting. In addition, participants may be asked to fill out paper-based or electronic diaries, which run on smartphones, tablets, or other handheld devices, at particular points during the day. The kinds of reports that can be of interest include being interrupted during an ongoing activity or when they encounter a problem when interacting with a product or when they are in a particular location, as well as how, when, and if they return to the task that was interrupted. This technique is based on the experience sampling method (ESM), discussed in Chapter 8, which is often used in healthcare (Price et al., 2018). Data on the frequency and patterns of certain daily activities, such as the monitoring of eating and drinking habits, or social interactions like phone and face-to-face conversations, are often recorded. Software running on the smartphones triggers messages to study participants at certain intervals, requesting them to answer questions or fill out dynamic forms and checklists. These might include recording what they are doing, what they are feeling like at a particular time, where they are, or how many conversations they have had in the last hour. As in any kind of evaluation, when conducting a field study, deciding whether to tell the people being observed, or asked to record information, that they are being studied and how long the study or session will last is more difficult than in a laboratory situation. For example, when studying people’s interactions with an ambient display, or the displays in a shopping mall described earlier (Dalton et al. 2016), telling them that they are part of a study will likely change the way they behave. Similarly, if people are using an online street map while walking in a city, their interactions may take only a few seconds, so informing them that they are being studied would disrupt their behavior. It is also important to ensure the privacy of participants in field studies. For example, participants in field studies that run over a period of weeks or months should be informed about the study and asked to sign an informed consent form in the usual way, as mentioned in Chapter 14. In studies that last for a long time, such as those in people’s homes, the designers will need to work out and agree with the participants what part of the activity is to be recorded and how. For example, if the designers want to set up cameras, they need to be situated unobtrusively, and participants need to be informed in advance about where the cameras will be and when they will be recording their activities. The designers will also need to work out in advance what to do if the prototype or product breaks down. Can the participants be instructed to fix the problem themselves, or will the designers need to be called in? Security arrangements will also need to be made if expensive or precious equipment is being evaluated in a public place. Other practical issues may also need to be considered depending on the location, product being evaluated, and the participants in the study. The study in which the Ethnobot (Tallyn et al., 2018) was used to collect information about what users did and how they felt while walking around at the Royal Highland Show in Scotland (discussed in Chapter 14) was an example of a field study. A wide range of other studies have explored how new technologies have been used and adopted by people in their own cultures and settings. By adopted, we mean how the participants use, integrate, and adapt the technology to suit their needs, desires, and ways of living. The findings from studies 537 538 15 E V A L U AT I O N S T U D I E S : F R O M C O N T R O L L E D T O N AT U R A L S E T T I N G S in natural settings are typically reported in the form of vignettes, excerpts, critical incidents, patterns of behavior, and narratives to show how the products are being used, adopted, and integrated into their surroundings. 15.4.1 In-the-Wild Studies For several years now, it has become increasingly popular to conduct in-the-wild studies to determine how people use and persist in using a range of new technologies or prototypes in situ. The term in-the-wild reflects the context of the study, in which new technologies are deployed and evaluated in natural settings (Rogers, 2011). Instead of developing solutions that fit in with existing practices and settings, researchers often explore new technological possibilities that can change and even disrupt participants’ behavior. Opportunities are created, interventions are installed, and different ways of behaving are encouraged. A key concern is how people react, change, and integrate the technology into their everyday lives. The outcome of conducting in-the-wild studies for different periods and at different intervals can be revealing, demonstrating quite different results from those arising out of lab studies. Comparisons of findings from lab studies and in-the-wild studies have revealed that while many usability issues can be uncovered in a lab study, the way the technology is actually used can be difficult to discern. These aspects include how users approach the new technology, the kinds of benefits that they can derive from it, how they use it in everyday contexts, and its sustained use over time (Rogers et al, 2013; Kjeldskov and Skov, 2014; Harjuniemi and Häkkila, 2018). The next case study describes a field study in which the researchers evaluated a pain-monitoring device with patients who had just had surgery. CASE STUDY: A field study of a pain monitoring device Monitoring patients’ pain and ensuring that the amount of pain experienced by them after surgery is tolerable is an important part of helping patients to recover. However, accurate pain monitoring is a known problem among physicians, nurses, and caregivers. Collecting scheduled pain readings takes time, and it can be difficult because patients may be asleep or may not want to be bothered. Typically, pain is managed in hospitals by nurses asking patients to rate their pain on a 1–10 scale, which is then recorded by the nurse in the patients’ records. Before launching on the field study that is the focus of our case study, Blaine Price and his colleagues (Price et al., 2018) had already spent a considerable amount of time observing patients in hospitals and talking with nurses. They had also carried out usability tests to ensure that the design of Painpad, a pain-monitoring tangible device for patients to report their pain levels, was functioning properly. For example, they checked the usability of the display and appropriateness of the device covering for the hospital environment and whether the LED display was working and was readable. In other words, they ensured that they had a well-functioning prototype for the field study that they planned to carry out. The goal of the field study was to evaluate the use of Painpad by patients recovering from ambulatory surgery (total hip or knee replacement) in the natural environments of two UK hospitals. Painpad (see Figure 15.4) enables patients to monitor their own pain levels by 15.4 FIELD STUDIES pressing the keys on the pad to record their pain rating. The researchers were interested in many aspects related to how patients interacted with Painpad, particularly on how robust and easy it was to use in the hospital environments. They also wanted to see whether the patients rated their pain every two hours as they should do and how the patients’ ratings using Painpad compared with the ratings that the nurses collected. They also looked for insights about the preferences and needs of the older patients who used Painpad and for design insights around visibility, customizability, ease of operation, and the contextual factors that affected its usability in hospital environments. Figure 15.4 Painpad, a tangible device for inpatient self-logging of pain Source: Price et al. (2018). Reproduced with permission of ACM Publications Data Collection and Participants Two studies were conducted that involved 54 people (31 in one study and 23 in another). Data screening excluded participants who did not provide data using Painpad or for whom the nurses did not collect data that could be compared with the Painpad data. Because of the confidential nature of the study, ethical considerations were carefully applied to ensure that the data was stored securely and that the patients’ privacy was assured. Thirteen of the patients were male, and 41 were female. They ranged in age from 32–88, with mean and median ages of 64.6 and 64.5. The time they spent in the hospital ranged from 1–7 days, with an average stay of 2–3 days. After returning from surgery, the patients were each given a Painpad that stayed by the side of their bed. Patients were encouraged to use it at their earliest convenience. The Painpad was programmed to prompt the patients to report their pain levels every two hours. This twohour interval was based on the hospital’s desired clinical target for collecting pain data. Each time a pain rating was due, alternating red and green lights flashed on the Painpad for up to 539 540 15 E V A L U AT I O N S T U D I E S : F R O M C O N T R O L L E D T O N AT U R A L S E T T I N G S five minutes, and an audio notification of a few seconds sounded. The patients’ pain rating was automatically time-stamped by the Painpad and stored in a secure database. In addition to the pain scores collected using Painpad, the nurses also collected verbal pain scores from the patients every two hours. These scores were entered into the patients’ charts and later entered into a database by a senior staff nurse and made available to the researchers for comparison with the Painpad data. When the patients were ready to leave the second hospital mentioned, they were given a short questionnaire that asked whether Painpad was easy to use, how often they made mistakes using it, and whether they noticed the flashing light and sound notifications. They were also asked to rate how satisfied they were with Painpad on a 1–5 Likert rating scale and to make any other comments that they wanted to share about their experience in a free text field. Data Analysis and Presentation Three types of data analysis were used by the researchers. They examined how satisfied the patients were with Painpad based on the questionnaire responses, how the patients complied with the bi-hourly requests to rate their pain on Painpad, and how the data collected with Painpad compared with the data collected by the nurses. Nineteen fully completed satisfaction questionnaires were collected that indicated that Painpad was well received and easy to use (mean rating 4.63 on a scale 1–5, where 5 was the highest rating) and that it was easy to remember to use it. Sixteen of the respondents commented that they never made an error entering their pain ratings, the aesthetics of Painpad were rated as “good,” and participants were “mostly satisfied” with it. Responses to the flashing lights to draw patients’ attention to Painpad were polarized. Most patients noticed the lights most of the time, while others only noticed the lights sometimes, and three patients said they did not notice them at all. The effectiveness of the sound alert received a middle rating; some patients thought it was “too loud and annoying,” and others thought it was too soft. More nuanced reactions and ideas were collected from the free-text response box on the questionnaire. For example, one patient (P49) wrote, “I think it is useful for monitoring the pattern of pain over the day which can be changeable” Patient P52 commented, “A dayto-day chart might be helpful.” Some patients, who had limited dexterity or other challenges, reported how their ability to use Painpad was compromised because Painpad was sometimes hard to reach or to hear. After removing duplicate entries, there were 824 pain scores provided by the patients using Painpad compared with 645 scores collected by the nurses. This indicated that the patients recorded more pain scores than would typically be collected in the hospital by nurses. To examine how the patients complied with using Painpad every two hours compared with the scores collected by the nurses, the researchers had to define acceptable time ranges of compliance. For example, they accepted all of the time scores that were submitted 15 minutes before and 15 minutes after the bi-hourly time schedule for reporting time scores. This analysis showed that the Painpad scores indicated stronger compliance with the two-hour schedule than with scores collected by the nurses. 15.4 FIELD STUDIES Overall, the evaluation of Painpad indicated that it was a successful device for collecting patients’ pain scores in hospitals. Of course, there are still more questions for Blaine Price and his team to investigate. An obvious one is this: “Why did the patients give more pain scores and adhere more strongly to the scheduled pain recording times with Painpad than with the nurses?” ACTIVITY 15.3 1. Why do you think Painpad was evaluated in the field rather than in a controlled laboratory setting? 2. Two types of data were collected in the field study: pain ratings and user satisfaction questionnaires. What does each type contribute to our understanding of the design of Painpad? Comment 1. The researchers wanted to find out how Painpad would be used by patients who had just had ambulatory surgery. They wanted to know whether the patients liked using Painpad and whether they liked its design and what problems they experienced when using it over a period of several days within hospital settings. During the early development of Painpad, the researchers carried out several usability evaluations to check that it was suitable for testing in real hospital environments. It is not possible to do a similar evaluation in a laboratory because it would be difficult, if not impossible, to create realistic and often unpredictable events that happen in hospitals (for example, visitors coming into the ward, conversations with doctors and nurses, and so forth). Furthermore, the kind of pain that patients experience after surgery does not occur, nor can it be simulated, in participants in lab studies. The researchers had already evaluated Painpad’s usability, and now they wanted to see how it was used in hospitals. 2. Two kinds of data were collected. Pain data was logged on Painpad and recorded independently by the nurses every two hours. This data enabled the researchers to compare the pain data recorded using Painpad with the data collected by the nurses. A user satisfaction questionnaire was also given to some of the patients. The patients answered questions by selecting a rating from a Likert scale. The patients were also invited to give comments and suggestions in a free text box. These comments helped the researchers to get a more nuanced view of the patients’ needs, likes, and dislikes. For example, they learned that some patients were hampered from taking full advantage of Painpad because of other problems, such as poor hearing and restricted movement. 15.4.2 Other Perspectives Field studies may also be conducted where a behavior of interest to the researchers reveals itself only after using a particular type of software for a long time, such as a complex design program or data visualization tool. For example, the expected changes in user problem-solving strategies using a sophisticated visualization tool for knowledge discovery may emerge only after days or weeks of active use because it takes time for users to become familiar, 541 542 15 E V A L U AT I O N S T U D I E S : F R O M C O N T R O L L E D T O N AT U R A L S E T T I N G S confident, and competent with the tool (Shneiderman and Plaisant, 2006). To evaluate the efficacy of such tools, users are best studied in realistic settings in their own workplaces so they can deal with their own data and set their own agenda for extracting insights relevant to their professional goals. These long evaluations of how experts learn and interact with tools for complex tasks typically starts with an initial interview in which the researchers check that the participant has a problem to work on, available data, and a schedule for completion. These are fundamental attributes that have to be present for the evaluation to proceed. Then the participant will get an introductory training session with the tool, followed by 2–4 weeks of novice usage, followed by 2–4 weeks of mature usage, leading to a semistructured exit interview. Additional assistance may be provided by the researcher as needed, thereby reducing the traditional separation between researcher and participant, but this close connection enables the researcher to develop a deeper understanding of the users’ struggles and successes wit