chap14.pdf
Document Details
Uploaded by VividStream
Tags
Full Transcript
Chapter 14 I N T R O D U C I N G E VA L U AT I O N 14.1 Introduction 14.2 The Why, What, Where, and When of Evaluation 14.3 Types of Evaluation 14.4 Evaluation Case Studies 14.5 What Did We Learn from the Case Studies? 14.6 Other Issues to Consider When Doing Evaluation Objectives The main goals o...
Chapter 14 I N T R O D U C I N G E VA L U AT I O N 14.1 Introduction 14.2 The Why, What, Where, and When of Evaluation 14.3 Types of Evaluation 14.4 Evaluation Case Studies 14.5 What Did We Learn from the Case Studies? 14.6 Other Issues to Consider When Doing Evaluation Objectives The main goals of this chapter are to accomplish the following: • Explain the key concepts and terms used in evaluation. • Introduce a range of different types of evaluation methods. • Show how different evaluation methods are used for different purposes at different stages of the design process and in different contexts of use. • Show how evaluation methods are mixed and modified to meet the demands of evaluating novel systems. • Discuss some of the practical challenges of doing evaluation. • Illustrate through short case studies how methods discussed in more depth in Chapters 8, 9, and 10 are used in evaluation and describe some methods that are specific to evaluation. • Provide an overview of methods that are discussed in detail in the next two chapters. 14.1 Introduction Imagine that you designed an app for teenagers to share music, gossip, and photos. You prototyped your first design and implemented the core functionality. How would you find out whether it would appeal to them and whether they will use it? You would need to evaluate it—but how? This chapter presents an introduction to the main types of evaluation and the methods that you can use to evaluate design prototypes and design concepts. 496 14 I N T R O D U C I N G E VA L U AT I O N Evaluation is integral to the design process. It involves collecting and analyzing data about users’ or potential users’ experiences when interacting with a design artifact such as a screen sketch, prototype, app, computer system, or component of a computer system. A central goal of evaluation is to improve the artifact’s design. Evaluation focuses on both the usability of the system (that is, how easy it is to learn and to use) and on the users’ experiences when interacting with it (for example, how satisfying, enjoyable, or motivating the interaction is). Devices such as smartphones, iPads, and e-readers, together with the pervasiveness of mobile apps and the emergence of IoT devices, have heightened awareness about usability and interaction design. However, many designers still assume that if they and their colleagues can use a product and find it attractive, others will too. The problem with this assumption is that designers may then design only for themselves. Evaluation enables them to check that their design is appropriate and acceptable for the target user population. There are many different evaluation methods. Which to use depends on the goals of the evaluation. Evaluations can occur in a range of places such as in labs, people’s homes, outdoors, and work settings. Evaluations usually involve observing participants and measuring their performance during usability testing, experiments, or field studies in order to evaluate the design or design concept. There are other methods, however, that do not involve participants directly, such as modeling users’ behavior and analytics. Modeling users’ behavior provides an approximation of what users might do when interacting with an interface; these models are often done as a quick way of assessing the potential of different interface configurations. Analytics provide a way of examining the performance of an already existing product, such as a website, so that it can be improved. The level of control on what is evaluated varies; sometimes there is none, such as for studies in the wild, and in others there is considerable control over which tasks are performed and the context, such as in experiments. In this chapter, we discuss why evaluation is important, what needs to be evaluated, where evaluation should take place, and when in the product lifecycle evaluation is needed. Some examples of different types of evaluation studies are then illustrated by short case studies. 14.2 The Why, What, Where, and When of Evaluation Conducting evaluations involves understanding not only why evaluation is important but also what aspects to evaluate, where evaluation should take place, and when to evaluate. 14.2.1 Why Evaluate? User experience involves all aspects of the user’s interaction with the product. Nowadays users expect much more than just a usable system—they also look for a pleasing and engaging experience from more products. Simplicity and elegance are valued so that the product is a joy to own and use. From a business and marketing perspective, well-designed products sell. Hence, there are good reasons for companies to invest in evaluating the design of products. Designers can focus on real problems and the needs of different user groups and make informed decisions about the design, rather than on debating what each other likes or dislikes. It also enables problems to be fixed before the product goes on sale. 14.2 T H E W H Y, W H A T, W H E R E , A N D W H E N O F E V A L U A T I O N Source: © Glasbergen. Reproduced with permission of Glasbergen Cartoon Service ACTIVITY 14.1 Identify two adults and two teenagers prepared to talk with you about their Facebook usage (these may be family members or friends). Ask them questions such as these: How often do you look at Facebook each day? How many photos do you post? What kind of photos do you have in your albums? What kind of photo do you have as your profile picture? How often do you change it? How many Facebook friends do you have? What books and music do you list? Are you a member of any groups? Comment As you may know, some teenagers are leaving Facebook, while some adults, often parents and grandparents, continue to be avid users, so you may have had to approach several teenagers before finding two who would be worth talking to. Having found people who use Facebook, you probably heard about different patterns of use between the adults and the teenagers. Teenagers are more likely to upload selfies and photos of places they have just visited on sites such as Instagram or that they sent to friends on WhatsApp. Adults tend to spend time discussing family issues, the latest trends in fashion, news, and politics. They frequently post pictures and descriptions for family members and friends about where they went on vacation or of their children and grandchildren. After doing this activity, you should be aware that different kinds of users may use the same software in different ways. It is therefore important to include a range of different types of users in your evaluations. 14.2.2 What to Evaluate What to evaluate ranges from low-tech prototypes to complete systems, from a particular screen function to the whole workflow, and from aesthetic design to safety features. Developers of a new web browser may want to know whether users find items faster with their product. Developers of an ambient display may be interested in whether it changes people’s 497 498 14 I N T R O D U C I N G E VA L U AT I O N behavior. Game app developers will want to know how engaging and fun their games are compared with those of their competitors and how long users will play them. Government authorities may ask if a computerized system for controlling traffic lights results in fewer accidents or if a website complies with the standards required for users with disabilities. Makers of a toy may ask whether 6-year-olds can manipulate the controls, whether they are engaged by its furry cover, and whether the toy is safe for children. A company that develops personal, digital music players may want to know whether people from different age groups and living in different countries like the size, color, and shape of the casing. A software company may want to assess market reaction to its new home page design. A developer of smartphone apps for promoting environmental sustainability in the home may want to know if their designs are enticing and whether users continue to use their app. Different types of evaluations will be needed depending on the type of product, the prototype or design concept, and the value of the evaluation to the designers, developers, and users. In the end, the main criteria are whether the design does what the users need and want it to do; that is, will they use it? ACTIVITY 14.2 What aspects would you want to evaluate for the following systems: 1. A personal music service? 2. A website for selling clothes? Comment 1. You would need to discover how well users can select tracks from potentially thousands of tunes and whether they can easily add and store new music. 2. Navigation would be a core concern for both examples. Users of a personal music service will want to find tracks to select quickly. Users wanting to buy clothes will want to move quickly among pages displaying clothes, comparing them, and purchasing them. In addition, do the clothes look attractive enough to buy? Other core aspects include how trustworthy and how secure the procedure is for taking customer credit card details. 14.2.3 Where to Evaluate Where evaluation takes place depends on what is being evaluated. Some characteristics, such as web accessibility, are generally evaluated in a lab because it provides the control necessary to investigate systematically whether all of the requirements are met. This is also true for design choices, such as choosing the size and layout of keys for a small handheld device for playing games. User experience aspects, such as whether children enjoy playing with a new toy and for how long before they get bored, can be evaluated more effectively in natural settings, which are often referred to as in-the-wild studies. Unlike a lab study, seeing children play in a natural setting will reveal when the children get bored and stop playing with the toy. In a lab study, the children are told what to do, so the UX researchers cannot easily see how the children naturally engage with the toy and when they get 14.2 T H E W H Y, W H A T, W H E R E , A N D W H E N O F E V A L U A T I O N bored. Of course, the UX researchers can ask the children whether they like it or not, but sometimes children will not say what they really think because they are afraid of causing offense. Remote studies of online behavior, such as social networking, can be conducted to evaluate natural interactions of participants in the context of their interaction, for example, in their own homes or place of work. Living labs (see Box 14.1) have also been built that are a compromise between the artificial, controlled context of a lab and the natural, uncontrolled nature of in-the-wild studies. They provide the setting of a particular type of environment, such as the home, a workplace, or a gym, while also giving the ability to control, measure, and record activities through embedding technology in them. ACTIVITY 14.3 A company is developing a new car seat to monitor whether a person is starting to fall asleep while driving and to provide a wake-up call using olfactory and haptic feedback. Where would you evaluate it? Comment It would be initially important to conduct lab-based experiments using a car simulator to see the effectiveness of the new type of feedback—in a safe setting, of course! You would also need to find a way to try to get the participants to fall asleep at the wheel. Once established as an effective mechanism, you would then need to test it in a more natural setting, such as a race track, airfield, or safe training circuit for new drivers, which can be controlled by the experimenter using a dual-control car. 14.2.4 When to Evaluate The stage in the product lifecycle when evaluation takes place depends on the type of product and the development process being followed. For example, the product being developed could be a new concept, or it could be an upgrade to an existing product. It could also be a product in a rapidly changing market that needs to be evaluated to see how well the design meets current and predicted market needs. If the product is new, then considerable time is usually invested in market research and discovering user requirements. Once these requirements have been established, they are used to create initial sketches, a storyboard, a series of screens, or a prototype of the design ideas. These are then evaluated to see whether the designers have interpreted the users’ requirements correctly and embodied them in their designs appropriately. The designs will be modified according to the evaluation feedback and new prototypes developed and subsequently evaluated. When evaluations are conducted during design to check that a product continues to meet users’ needs, they are known as formative evaluations. Formative evaluations cover a broad range of design processes, from the development of early sketches and prototypes through to tweaking and perfecting a nearly finished design. 499 500 14 I N T R O D U C I N G E VA L U AT I O N Evaluations that are carried out to assess the success of a finished product are known as summative evaluations. If the product is being upgraded, then the evaluation may not focus on discovering new requirements but may instead evaluate the existing product to ascertain what needs improving. Features are then often added, which can result in new usability problems. At other times, attention is focused on improving specific aspects, such as enhanced navigation. As discussed in earlier chapters, rapid iterations of product development that embed evaluations into short cycles of design, build, and test (evaluate) are common. In these cases, the evaluation effort may be almost continuous across the product’s development and deployment lifetime. For example, this approach is sometimes adopted for government websites that provide information about Social Security, pensions, and citizens’ voting rights. Many agencies, such as the National Institute of Standards and Technology (NIST) in the United States, the International Standards Organization (ISO), and the British Standards Institute (BSI) set standards by which particular types of products, such as aircraft navigation systems and consumer products that have safety implications for users, have to be evaluated. The Web Content Accessibility Guidelines (WCAG) 2.1 describe how to design websites so that they are accessible. WCAG 2.1 is discussed in more detail in Box 16.2. 14.3 Types of Evaluation We classify evaluations into three broad categories, depending on the setting, user involvement, and level of control. These are as follows: • Controlled settings directly involving users (examples are usability labs and research labs): Users’ activities are controlled to test hypotheses and measure or observe certain behaviors. The main methods are usability testing and experiments. • Natural settings involving users (examples are online communities and products that are used in public places): There is little or no control of users’ activities to determine how the product would be used in the real world. The main method used is field studies (for example in-the-wild studies). • Any settings not directly involving users: Consultants and researchers critique, predict, and model aspects of the interface to identify the most obvious usability problems. The range of methods includes inspections, heuristics, walk-throughs, models, and analytics. There are pros and cons of each evaluation category. For example, lab-based studies are good at revealing usability problems, but they are poor at capturing context of use; field studies are good at demonstrating how people use technologies in their intended setting, but they are often time-consuming and more difficult to conduct (Rogers et al., 2013); and modeling and predicting approaches are relatively quick to perform, but they can miss unpredictable usability problems and subtle aspects of the user experience. Similarly, analytics are good for tracking the use of a website but are not good for finding out how users feel about a new color scheme or why they behave as they do. Deciding on which evaluation approach to use is determined by the goals of the project and on how much control is needed to find out whether an interface or device meets those goals and can be used effectively. For example, in the case of the music service mentioned earlier, this includes finding out how users use it, whether they like it, and what problems they experience with the functions. In turn, this requires determining how they carry out various tasks using 14.3 T Y P E S O F E VA L U AT I O N the interface operations. A degree of control is needed when designing the evaluation study to ensure participants try all of the tasks and operations for which the service is designed. 14.3.1 Controlled Settings Involving Users Experiments and user tests are designed to control what users do, when they do it, and for how long. They are designed to reduce outside influences and distractions that might affect the results, such as people talking in the background. The approach has been extensively and successfully used to evaluate software applications running on laptops and other devices where participants can be seated in front of them to perform a set of tasks. Usability Testing This approach to evaluating user interfaces involves collecting data using a combination of methods in a controlled setting, for example, experiments that follow basic experimental design, observation, interviews, and questionnaires. Often, usability testing is conducted in labs, although increasingly interviews and other forms of data collection are being done remotely via phone and digital communication (for instance, through Skype or Zoom) or in natural settings. The primary goal is to determine whether an interface is usable by the intended user population to carry out the tasks for which it was designed. This involves investigating how typical users perform on typical tasks. By typical, we mean the users for whom the system is designed (for example, teenagers, adults, and so on) and the activities that it is designed for them to be able to do (such as, purchasing the latest fashions). It often involves comparing the number and kinds of errors that users make between versions and recording the time that it takes them to complete the task. As users perform the tasks, they may be recorded on video. Their interactions with the software may also be recorded, usually by logging software. User satisfaction questionnaires and interviews can also be used to elicit users’ opinions about how they liked the experience of using the system. This data can be supplemented by observation at product sites to collect evidence about how the product is being used in the workplace or in other environments. Observing users’ reactions to an interactive product has helped developers reach an understanding of usability issues, which would be difficult for them to glean simply by reading reports or listening to presentations. The qualitative and quantitative data that is collected using these different techniques are used in conjunction with each other to form conclusions about how well a product meets the needs of its users. Usability testing is a fundamental, essential HCI process. For many years, usability testing has been a staple of companies, which is used in the development of standard products that go through many generations, such as word processing systems, databases, and spreadsheets (Johnson, 2014; Krug, 2014; Redish, 2012). The findings from usability testing are often summarized in a usability specification that enables developers to test future prototypes or versions of the product against it. Optimal performance levels and minimal levels of acceptance are generally specified, and current levels are noted. Changes in the design can then be implemented, such as navigation structure, use of terms, and how the system responds to users. These changes can then be tracked. While usability testing is well established in UX design, it has also started to gain more prominence in other fields such as healthcare, particularly as mobile devices take an increasingly central role (Schnall et al., 2018) in hospitals and for monitoring one’s own health (for instance, Fitbit and the Polar series, Apple Watch, and so forth). Another trend reported by Kathryn Whitenton and Sarah Gibbons (2018) from the Nielsen/Norman (NN/g) Usability 501 502 14 I N T R O D U C I N G E VA L U AT I O N Consulting Group is that while usability guidelines have tended to be stable over time, audience expectations change. For example, Whitenton and Gibbons report that since the last major redesign of the NN/g home page a few years ago, both the content and the audience’s expectations about the attractiveness of the visual design have evolved. However, they stress that even though visual design has assumed a bigger and more important role in design, it should never replace or compromise basic usability. Users still need to be able to carry out their tasks effectively and efficiently. ACTIVITY 14.4 Look at Figure 14.1, which shows two similarly priced devices for recording activity and measuring heart rate: (a) Fitbit Charge and (b) Polar A370. Assume that you are considering buying one of these devices. What usability issues would you want to know about, and what aesthetic design issues would be important to you when deciding which one to purchase? (a) (b) Figure 14.1 Devices for monitoring activity and heart rate (a) Fitbit Charge and (b) Polar A370 Source: Figure 14.1 (a) Fitbit (b) Polar Comment There are several usability issues to consider. Some that you might be particularly interested in finding out about include how comfortable the device is to wear, how clearly the information is presented, what other information is presented (for example, time), how long the battery lasts before it needs to be recharged, and so forth. Most important of all might be how accurate the device is, particularly for recording heart rate if that is a concern for you. Since these devices are worn on your wrist, they can be considered to be fashion items. Therefore, you might want to know whether they are available in different colors, whether they are bulky and likely to rub on clothes and cause damage, and whether they are discrete or clearly noticeable. 14.3 T Y P E S O F E VA L U AT I O N Experiments are typically conducted in research labs at universities or commercial labs to test such hypotheses. These are the most controlled evaluation settings, where researchers try to remove any extraneous variables that may interfere with the participant’s performance. The reason for this is so that they can reliably say that the findings arising from the experiment are due to the particular interface feature being measured. For example, in an experiment comparing which is the best way for users to enter text when using an iPad or other type of tablet interface, researchers would control all other aspects of the setting to ensure that they do not affect the user’s performance. These include providing the same instructions to all of the participants, using the same tablet interface, and asking the participants to do the same tasks. Depending on the available functionality, conditions that might be compared could be typing using a virtual keyboard, typing using a physical keyboard, and swiping using a virtual keyboard. The goal of the experiment would be to test whether one type of text input is better than the others in terms of speed of typing and number of errors. A number of participants would be brought into the lab separately to carry out the predefined set of text entry tasks, and their performance would be measured in terms of time consumed and how many errors are made, for example, selecting the wrong letter. The data collected would then be analyzed to determine whether the scores for each condition were significantly different. If the performance measures obtained for the virtual keyboard were significantly faster than those for the other two and had the least number of errors, one could say that this method of text entry is the best. Testing in a laboratory may also be done when it is too disruptive to evaluate a design in a natural setting, such as in a military conflict. BOX 14.1 Living Labs Living labs have been developed to evaluate people’s everyday lives, which would be simply too difficult to assess in usability labs, for example, to investigate people’s habits and routines over a period of several months. An early example of a living lab was the Aware Home (Abowd et al., 2000) in which the house was embedded with a complex network of sensors and audio/video recording devices that recorded the occupants’ movements throughout the house and their use of technology. This enabled their behavior to be monitored and analyzed, for example, their routines and deviations. A primary motivation was to evaluate how real families would respond and adapt to such a setup over a period of several months. However, it proved difficult to get families to agree to leave their own homes and live in a living lab home for that long. Ambient-assisted homes have also been developed where a network of sensors is embedded throughout someone’s home rather than in a special, customized building. One rationale is to enable disabled and elderly people to lead safe and independent lives by providing a nonintrusive system that can remotely monitor and provide alerts to caregivers in the event of an accident, illness, or unusual activities (see Fernández-Luque et al., 2009). The term living lab is also used to describe innovation networks in which people gather in person and virtually to explore and form commercial research and development collaborations (Ley et al., 2017). Nowadays, many living labs have become more like commercial enterprises, which offer facilities, infrastructure, and access to participating communities, bringing together users, developers, researchers, and other stakeholders. Living labs are being developed that form an 503 504 14 I N T R O D U C I N G E VA L U AT I O N integral part of a smart building that can be adapted for different conditions in order to investigate the effects of different configurations of lighting, heating, and other building features on the inhabitant’s comfort, work productivity, stress levels, and well-being. The Smart Living Lab in Switzerland, for example, is developing an urban block, including office buildings, apartments, and a school, to provide an infrastructure that offers opportunities for researchers to investigate different kinds of human experiences within built environments (Verma et al., 2017). Some of these spaces are large and may house hundreds and even thousands of people. People themselves are also being fitted with mobile and wearable devices that measure heart rate, activity levels, and so on, which can then be aggregated to assess the health of a population (for example, students at a university) over long periods of time. Hofte et al. (2009) call this a mobile living lab approach, noting how it enables more people to be studied for longer periods and at the times and locations where observation by researchers is difficult. Citizen science, in which volunteers work with scientists to collect data on a scientific research issue, such as biodiversity (for instance, iNaturalist.org), monitoring the flowering times of plants over tens or hundreds of years (Primak, 2014), or identifying galaxies online (for example, https://www.zooniverse.org/projects/zookeeper/galaxy-zoo/), can also be thought of as a type of living lab, especially when the behavior of the participants, their use of technology, and the design of that technology are also being studied. Lab in the Wild (http://www .LabintheWild.org), for example, is an online site that hosts volunteers who participate in a range of projects. Researchers analyzed more than 8,000 comments from volunteers involved in four experiments. They concluded that such online sites have potential as online research labs (Oliveira et al., 2017) that can be studied over time and hence form a type of living lab. DILEMMA Is a Living Lab Really a Lab? The concept of a living lab differs from a traditional view of a lab insofar as it is trying to be both natural and experimental and the goal is to bring the lab into the home (or other natural setting) or online. The dilemma is how artificial to make the more natural setting; where does the balance lie in setting it up to enable the right level of control to conduct research and evaluation without losing the sense of it being natural? 14.3.2 Natural Settings Involving Users The goal of field studies is to evaluate products with users in their natural settings. Field studies are used primarily to • Help identify opportunities for new technology • Establish the requirements for a new design • Facilitate the introduction of technology or inform deployment of existing technology in new contexts 14.3 T Y P E S O F E VA L U AT I O N Methods that are typically used are observation, interviews, and interaction logging (see Chapters 8 and 9). The data takes the form of events and conversations that are recorded by the researchers as notes, or through audio or video recording, or by the participants as diaries and notes. The goal is to be unobtrusive and not to affect what people do during the evaluation. However, it is inevitable that some methods will influence how people behave. For example, diary studies require people to document their activities or feelings at certain times, and this can make them reflect on and possibly change their behavior. During the last 15 years, there has been a trend toward conducting in-the-wild studies. These are essentially user studies that look at how new technologies or prototypes have been deployed and used by people in various settings, such as outdoors, in public places, and in homes. Sometimes, a prototype that is deployed is called a disruptive technology, where the aim is to determine how it displaces an existing technology or practice. In moving into the wild, researchers inevitably have to give up control of what is being evaluated in order to observe how people approach and use—or don’t use—technologies in their everyday lives. For example, a researcher might be interested in observing how a new mobile navigation device will be used in urban environments. To conduct the study, they would need to recruit people who are willing to use the device for a few weeks or months in their natural surroundings. They might then tell the participants what they can do with the device. Other than that, it is up to the participants to decide how to use it and when, as they move among work or school, home, and other places. The downside of handing over control is that it makes it difficult to anticipate what is going to happen and to be present when something interesting does happen. This is in contrast to usability testing where there is always an investigator or camera at hand to record events. Instead, the researcher has to rely on the participants recording and reflecting on how they use the product, by writing up their experiences in diaries, filling in online forms, and/or taking part in intermittent interviews. Field studies can also be virtual, where observations take place in multiuser games such as World of Warcraft, online communities, chat rooms, and so on. A goal of this kind of field study is to examine the kinds of social processes that occur in them, such as collaboration, confrontation, and cooperation. The researcher typically becomes a participant and does not control the interactions (see Chapters 8 and 9). Virtual field studies have also become popular in the geological and biological sciences because they can supplement studies in the field. Increasingly, online is partnered with a real-world experience so that researchers and students get the best of both situations (Cliffe, 2017). 14.3.3 Any Settings Not Involving Users Evaluations that take place without involving users are conducted in settings where the researcher has to imagine or model how an interface is likely to be used. Inspection methods are commonly employed to predict user behavior and to identify usability problems based on knowledge of usability, users’ behavior, the contexts in which the system will be used, and the kinds of activities that users undertake. Examples include heuristic evaluation that applies knowledge of typical users guided by rules of thumb and walkthroughs that involve stepping through a scenario or answering a set of questions for a detailed prototype. Other techniques include analytics and models. The original heuristics used in heuristic evaluation were for screen-based applications (Nielsen and Mack, 1994; Nielsen and Tahir, 2002). These have been adapted to develop new sets of heuristics for evaluating web-based products, mobile systems, collaborative 505 506 14 I N T R O D U C I N G E VA L U AT I O N technologies, computerized toys, information visualizations (Forsell and Johansson, 2010), and other new types of systems. One of the problems with using heuristics is that designers can sometimes be led astray by findings that are not as accurate as they appeared to be at first (Tomlin, 2010). This problem can arise from different sources, such as a lack of experience and the biases of UX researchers who conduct the heuristic evaluations. Cognitive walk-throughs involve simulating a user’s problem-solving process at each step in the human-computer dialogue and checking to see how users progress from step to step in these interactions (see Wharton et al., 1994 in Nielsen and Mack, 1994). During the last 15 years, cognitive walk-throughs have been used to evaluate smartphones (Jadhav et al., 2013), large displays, and other applications, such as public displays (Parker et al., 2017). A key feature of cognitive walk-throughs is that they focus on evaluating designs for ease of learning. Analytics is a technique for logging and analyzing data either at a customer’s site or remotely. Web analytics is the measurement, collection, analysis, and reporting of Internet data to understand and optimize web usage. Examples of web analytics include the number of visitors to a website home page over a particular time period, the average time users spend on the home page, which other pages they visit, or whether they leave after visiting the homepage. For example, Google provides a commonly used approach for collecting analytics data that is a particularly useful method for evaluating design features of a website (https:// marketingplatform.google.com/about/analytics/). As part of the massive open online courses (MOOCs) and open educational resources (OERs) movement, learning analytics has evolved and gained prominence for assessing the learning that takes place in these environments. The Open University in the United Kingdom, along with others, has published widely on this topic, describing how learning analytics are useful for guiding course and program design and for evaluating the impact of pedagogical decision-making (Toetenel and Bart, 2016). This web page provides information about learning analytics and learning design: https://iet.open.ac.uk/themes/learning-analytics-and-learning-design Models have been used primarily for comparing the efficacy of different interfaces for the same application, for example, the optimal arrangement and location of features. A wellknown approach uses Fitts’ law to predict the time it takes to reach a target using a pointing device (MacKenzie, 1995) or using the keys on a mobile device or game controller (Ramcharitar and Teather, 2017). 14.3.4 Selecting and Combining Methods The three broad categories identified previously provide a general framework to guide the selection of evaluation methods. Often, combinations of methods are used across the categories to obtain a richer understanding. For example, sometimes usability testing conducted in labs is combined with observations in natural settings to identify the range of usability problems and find out how users typically use a product. 14.4 E VA L U AT I O N C A S E S T U D I E S There are both pros and cons for controlled and uncontrolled settings. The benefits of controlled settings include being able to test hypotheses about specific features of the interface where the results can be generalized to the wider population. A benefit of uncontrolled settings is that unexpected data can be obtained that provides quite different insights into people’s perceptions and their experiences of using, interacting, or communicating through the new technologies in the context of their everyday and working lives. 14.3.5 Opportunistic Evaluations 14.4 Evaluation Case Studies Evaluations may be detailed, planned studies, or opportunistic explorations. The latter explorations are generally done early in the design process to provide designers with feedback quickly about a design idea. Getting this kind of feedback early in the design process is important because it confirms whether it is worth proceeding to develop an idea into a prototype. Typically, these early evaluations are informal and do not require many resources. For example, the designers may recruit a few local users and ask their opinions. Getting feedback this early in design provides feedback early on when it is easier to make changes to an evolving design. Opportunistic evaluations with users can also be conducted to hone the target audience so that subsequent evaluation studies can be more focused. Opportunistic evaluations can also be conducted in addition to more formal evaluations. Two contrasting case studies are described in this section to illustrate how evaluations can take place in different settings with different amounts of control over users’ activities. The first case study (section 14.4.1) describes a classic experiment that tested whether it was more exciting playing against a computer versus playing against a friend in a collaborative computer game (Mandryk and Inkpen, 2004). Though published more than 15 years ago, we are keeping this case study in this edition of the book because it provides a concise and clear description about a variety of measures that were used in the experiment. The second case study (section 14.4.2) describes an ethnographic field study in which a bot, known as Ethnobot, was developed to prompt participants to answer questions about their experiences while walking around a large outdoor show (Tallyn et al., 2018). 14.4.1 Case Study 1: An Experiment Investigating a Computer Game For games to be successful, they must engage and challenge users. Criteria for evaluating these aspects of the user experience are therefore needed. In this case study, physiological responses were used to evaluate users’ experiences when playing against a friend and when playing alone against the computer (Mandryk and Inkpen, 2004). Regan Mandryk and Kori Inkpen conjectured that physiological indicators could be an effective way of measuring a player’s experience. Specifically, they designed an experiment to evaluate the participants’ engagement while playing an online ice-hockey game. Ten participants, who were experienced game players, took part in the experiment. During the experiment, sensors were placed on the participants to collect physiological data. The data collected included measurements of the moisture produced by sweat glands of their hands and feet and changes in heart and breathing rates. In addition, they videoed the 507 508 14 I N T R O D U C I N G E VA L U AT I O N participants and asked them to complete user satisfaction questionnaires at the end of the experiment. To reduce the effects of learning, half of the participants played first against a friend and then against the computer, and the other half played against the computer first. Figure 14.2 shows the setup for recording data while the participants were playing the game. Figure 14.2 The display shows the physiological data (top right), two participants, and a screen of the game they played. Source: Mandryk and Inkpen (2004). Physiological Indicators for the Evaluation of Co-located Collaborative Play, CSCW’2004, pp. 102–111. Reproduced with permission of ACM Publications Results from the user satisfaction questionnaire revealed that the mean ratings on a 1–5 scale for each item indicated that playing against a friend was the favored experience (Table 14.1). Data recorded from the physiological responses was compared for the two conditions and in general revealed higher levels of excitement when participants played against a friend than when they played against the computer. The physiological recordings were also compared across participants and, in general, indicated the same trend. Figure 14.3 shows a comparison for two participants. Playing Against Computer Playing Against Friend Mean St. Dev. Mean St. Dev. Boring 2.3 0.949 1.7 0.949 Challenging 3.6 1.08 3.9 0.994 Easy 2.7 0.823 2.5 0.850 Engaging 3.8 0.422 4.3 0.675 Exciting 3.5 0.527 4.1 0.568 Frustrating 2.8 1.14 2.5 0.850 Fun 3.9 0.738 4.6 0.699 Table 14.1 Mean subjective ratings given on a user satisfaction questionnaire using a five-point scale, in which 1 is lowest and 5 is highest for the 10 players 14.4 E VA L U AT I O N C A S E S T U D I E S Fight Sequence Participant 9 When Goal Scored Participant 2 1.80 1.60 1.40 1.20 1.00 0.80 0.60 0.40 0.20 0.00 Goal 2.60 Fight begin Fight end Friend 2.50 Friend Computer 2.40 2.30 Computer 2.20 2.10 2.00 Time (seconds) (a) Time (seconds) (b) Figure 14.3 (a) A participant’s skin response when scoring a goal against a friend versus against the computer, and (b) another participant’s response when engaging in a hockey fight against a friend versus against the computer Source: Mandryk and Inkpen (2004). Physiological Indicators for the Evaluation of Co-located Collaborative Play, CSCW’2004, pp. 102–111. Reproduced with permission of ACM Publications Identifying strongly with an experience state is indicated by a higher mean. The standard deviation indicates the spread of the results around the mean. Low values indicate little variation in participants’ responses; high values indicate more variation. Because of individual differences in physiological data, it was not possible to compare directly the means of the two sets of data collected: subjective questionnaires and physiological measures. However, by normalizing the results, it was possible to correlate the results across individuals. This indicated that the physiological data gathering and analysis methods were effective for evaluating levels of challenge and engagement. Although not perfect, these two kinds of measures offer a way of going beyond traditional usability testing in an experimental setting to get a deeper understanding of user experience goals. ACTIVITY 14.5 1. What kind of setting was used in this experiment? 2. How much control did the researchers exert? 3. Which types of data were collected? Comment 1. The experiment took place in a research lab, which is a controlled setting. 2. The evaluation was strongly controlled by the evaluators. They specified which of the two gaming conditions was assigned to each participant. The participants also had sensors placed on them to collect physiological data as they played the game, for example to monitor changes in heart rate and breathing. 3. Physiological measures of the participants while playing the game were collected together with data collected afterward using a user satisfaction questionnaire that asked questions about how satisfied they were with the game and how much they enjoyed it. 509 510 14 I N T R O D U C I N G E VA L U AT I O N 14.4.2 Case Study 2: Gathering Ethnographic Data at the Royal Highland Show Field observations, including in-the-wild and ethnographic studies, provide data about how users interact with technology in their natural environments. Such studies often provide insights not available in lab settings. However, it can be difficult to collect participants’ thoughts, feelings, and opinions as they move about in their everyday lives. Usually, it involves observations and asking them to reflect after an event, for example through interviews and diaries. In this case study, a novel evaluation approach—a live chatbot—was used to address this gap by collecting data about people’s experiences, impressions, and feelings as they visited and moved around the Royal Highland Show (RHS) (Tallyn et al., 2018). The RHS is a large agricultural show that runs every June in Scotland. The chatbot, known as Ethnobot, was designed as an app that runs on a smartphone. In particular, Ethnobot was programmed to ask participants pre-established questions as they wandered around the show and to prompt them to expand on their answers and take photos. It also directed them to particular parts of the show that the researchers thought would interest the participants. This strategy also allowed the researchers to collect data from all of the participants in the same place. Interviews were also conducted by human researchers to supplement the data collected online by the Ethnobot. The overall purpose of the study was to find out about participants’ experiences of, and feelings about, the show and of using Ethnobot. The researchers also wanted to compare the data collected by the Ethnobot with the interview data collected by the human researchers. The study consisted of four data collection sessions using the Ethnobot over two days and involved 13 participants, who ranged in age and came from diverse backgrounds. One session occurred in the early afternoon and the other in the late afternoon on each day of the study. Each session lasted several hours. To participate in the study, each participant was given a smartphone and shown how to use the Ethnobot app (Figure 14.4), which they could experience on their own or in groups as they wished. Two main types of data were collected. • The participants’ online responses to a short list of pre-established questions that they answered by selecting from a list of prewritten comments (for example, “I enjoyed something” or “I learned something”) presented by the Ethnobot in the form of buttons called experience buttons, and the participants’ additional open-ended, online comments and photos that they offered in response to prompts for more information from Ethnobot. The participants could contribute this data at any time during the session. • The participants’ responses to researchers’ in-person interview questions. These questions focused on the participants’ experiences that were not recorded by the Ethnobot, and their reactions to using the Ethnobot. A lot of data was collected that had to be analyzed. The pre-established comments collected in the Ethnobot chatlogs were analyzed quantitatively by counting the responses. The in-person interviews were audio-recorded and transcribed for analysis, and that involved coding them, which was done by two researchers who cross-checked each other’s analysis for consistency. The open-ended online comments were analyzed in a similar way to the inperson interview data. 14.4 E VA L U AT I O N C A S E S T U D I E S Figure 14.4 The Ethnobot used at the Royal Highland Show in Scotland. Notice that the Ethnobot directed participant Billy to a particular place (that is, Aberdeenshire Village). Next, Ethnobot asks “. . . What’s going on?” and the screen shows five of the experience buttons from which Billy needs to select a response Source: Tallyn et al. (2018). Reproduced with permission of ACM Publications Overall, the analyses revealed that participants spent an average of 120 minutes with the Ethnobot on each session and recorded an average of 71 responses, while submitting an average of 12 photos. In general, participants responded well to prompting by the Ethnobot and were eager to add more information. For example, P9 said, “ I really enjoyed going around and taking pictures and [ to the question] ‘have you got something to add’ [said] yeah! I have, I always say ‘yes’. . . .” A total of 435 pre-established responses were collected, including 70 that were about what the participants did or experienced (see Figure 14.5). The most frequent response was “I learned something” followed by “I tried something” and “I enjoyed something.” Some participants also supplied photos to illustrate their experiences. When the researchers asked the participants about their reactions to selecting prewritten comments, eight participants remarked that they were rather restrictive and that they would like more flexibility to answer the questions. For example, P12 said, “maybe there should have been more options, in terms of your reactions to the different parts of the show.” However, in general participants enjoyed their experience of the RHS and of using Ethnobot. 511 512 14 I N T R O D U C I N G E VA L U AT I O N I tried something 17 I didn’t like something 9 I experienced something 0 I learned something 20 I enjoyed something 16 I bought something 8 0 5 10 15 20 25 Figure 14.5 The number of prewritten experience responses submitted by participants to the preestablished questions that Ethnobot asked them about their experiences Source: Tallyn et al. (2018). Reproduced with permission of ACM Publications When the researchers compared the data collected by Ethnobot with that from the interviews collected by the human researchers, they found that the participants provided more detail about their experiences and feelings in response to the in-person interview questions than to those presented by Ethnobot. Based on the findings of this study, the researchers concluded that while there are some challenges to using a bot to collect inthe-wild evaluation data, there are also advantages, particularly when researchers cannot be present or when the study involves collecting data from participants on the move or in places that are hard for researchers to access. Collecting data with a bot and supplementing it with data collected by human researchers appears to offer a good solution in circumstances such as these. ACTIVITY 14.6 1. What kind of setting was used in this evaluation? 2. How much control did the researchers exert? 3. Which types of data were collected? Comment 1. The evaluation took place in a natural outdoor setting at the RHS. 2. The researchers imposed less control on the participants than in the previous case study, but the Ethnobot was programmed to ask specific questions, and a range of responses was provided from which participants selected. The Ethnobot was also programmed to request additional information and photos. In addition, the Ethnobot was programmed to guide the participants to particular areas of the show, although some participants ignored this guidance and went where they pleased. 3. The Ethnobot collected answers to a specific set of predetermined questions (closed questions) and prompted participants for additional information and photographs. In addition, 14.4 E VA L U AT I O N C A S E S T U D I E S participants were interviewed by the researchers using semi-structured, open-ended interviews. The data collected was qualitative, but counts of the response categories produced quantitative data (see Figure 14.5). Some demographic data was also quantitative (for instance, participants’ ages, gender, and so forth), which is provided in the full paper (Tallyn et al., 2018). BOX 14.2 Crowdsourcing The Internet makes it possible to gain access to hundreds of thousands of participants who will perform tasks or provide feedback on a design or experimental task quickly and almost immediately. Mechanical Turk is a service hosted by Amazon that has thousands of people registered (known as Turkers), who have volunteered to take part by performing various activities online, known as human intelligence tasks (HITs), for a very small reward. HITs are submitted by researchers or companies that pay a few cents for simple tasks (such as tagging pictures) to a few dollars (for taking part in an experiment). Advantages of using crowdsourcing in HCI is that it is more flexible, relatively inexpensive, and often much quicker to enroll participants than with traditional lab studies. Another benefit is that many more participants can be recruited. Early in the history of online crowdsourcing, Jeff Heer and Michael Bostock (2010) used it to determine how reliable it was to ask random people over the Internet to take part in an experiment. Using Mechanical Turk, they asked the Turkers to perform a series of perception tasks using different visual display techniques. A large number agreed, enabling them to analyze their results statistically and generalize from their findings. They also developed a short set of test questions that generated 2,880 responses. They then compared their findings from using crowdsourcing with those reported in published lab-based experiments. They found that while the results from their study using Turkers showed wider variance than in the lab study, the overall results across the studies were the same. They also found that the total cost of their experiment with Turkers was one-sixth the cost of a typical lab study involving the same number of people. While these results are important, online crowdsourcing studies have also raised ethical questions about whether Turkers are being appropriately rewarded and acknowledged—an important question that continues to be discussed (see, for example, the Brookings Institute article by Vanessa Williamson, 2016). Since Jeff Heer’s and Michael Bostock’s 2010 study, crowdsourcing has become increasingly popular and has been used in a wide range of applications including collecting design ideas, such as ideation, for developing a citizen science app (Maher et al., 2014); managing volunteers for disaster relief (Ludwig et al., 2016); and delivering packages (Kim, 2015). Both the number and diversity of useful contributions and ideas generated make crowdsourcing particularly attractive for getting timely feedback from the public. For example, in a study to collect and improve the design of a street intersection, a system called CommunityCrit was used to collect opinions from members of the community and to draw on their skills and availability (Mahyar et al., 2018). Those who contributed were empowered by getting to see the planning process. 513 514 14 I N T R O D U C I N G E VA L U AT I O N Citizen science is also a form of crowdsourcing. Originally dating back to the days of Aristotle and Darwin, the data was collected by humans, who were sometimes referred to as sensors. During the last 10 years, the volume of data collected has increased substantially, leveraged by technology, particularly smartphones, and a range of other digital devices (Preece, 2017). For example, iSpotNature.org, iNaturalist.com, and eBird.com are apps that are used across the world for collecting biodiversity data and data about bird behavior. These examples illustrate how crowdsourcing can be a powerful tool for improving, enhancing, and scaling up a wide range of tasks. Crowdsourcing makes it possible to recruit participants to generate a large pool of potential ideas, collect data, and make other useful inputs that would be difficult to achieve in other ways. Several companies, including Google, Facebook, and IDEO, use crowdsourcing to try ideas and to gather evaluation feedback about designs. 14.5 What Did We Learn from the Case Studies? The case studies along with Box 14.1 and Box 14.2 provide examples of how different evaluation methods are used in different physical settings that involve users in different ways to answer various kinds of questions. They demonstrate how researchers exercise different levels of control in different settings. The case studies also show how it is necessary to be creative when working with innovative systems and when dealing with constraints created by the evaluation setting (for example, online, distributed, or outdoors where people are on the move) and the technology being evaluated. In addition, the case studies and boxes discussed illustrate how to do the following: • Observe users in the lab and in natural settings • Develop different data collection and analysis techniques to evaluate user experience goals, such as challenge and engagement and people on the move • Run experiments on the Internet using crowdsourcing, thereby reaching many more participants while being straightforward to run • Recruit a large number of participants who contribute to a wide range of projects with different goals using crowdsourcing BOX 14.3 The Language of Evaluation Sometimes terms describing evaluation are used interchangeably and have different meanings. To avoid this confusion, we define some of these terms here in alphabetical order. (You may find that other books use different terms.) Analytics Data analytics refers to examining large volumes of raw data with the purpose of drawing inferences about a situation or a design. Web analytics is commonly used to measure website traffic through analyzing users’ click data. Analytical evaluation This type of evaluation models and predicts user behavior. This term has been used to refer to heuristic evaluation, walk-throughs, modeling, and analytics. 14.5 W H AT D I D W E L E A R N F R O M T H E C A S E S T U D I E S ? Bias The results of an evaluation are distorted. This can happen for several reasons. For example, selecting a population of users who have already had experience with the new system and describing their performance as if they were new users. Controlled experiment This is a study that is conducted to test hypotheses about some aspect of an interface or other dimension. Aspects that are controlled typically include the task that participants are asked to perform, the amount of time available to complete the tasks, and the environment in which the evaluation study occurs. Crowdsourcing This can be done in person (as was typical in citizen science for decades) or online via the web and mobile apps. Crowdsourcing provides the opportunity for hundreds, thousands, or even millions of people to evaluate a product or take part in an experiment. The crowd may be asked to perform a particular evaluation task using a new product or to rate or comment on the product. Ecological validity This is a particular kind of validity that concerns how the environment in which an evaluation is conducted influences or even distorts the results. Expert review or crit This is an evaluation method in which someone (or several people) with usability expertise and knowledge of the user population reviews a product looking for potential problems. Field study This type of evaluation study is done in a natural environment such as in a person’s home or in a work or leisure place. Formative evaluation This type of evaluation is done during design to check that the product fulfills requirements and continues to meet users’ needs. Heuristic evaluation This is an evaluation method in which knowledge of typical users is applied, often guided by heuristics, to identify usability problems. Informed consent form This form describes what a participant in an evaluation study will be asked to do, what will happen to the data collected about them, and their rights while involved in the study. In-the-wild study This is a t