Performance Appraisal - Chapter 5 - Industrial/Organizational Psychology
Document Details
Uploaded by RelaxedPeninsula
2017
Paul E. Levy
Tags
Summary
This chapter from Paul E. Levy's 2017 book on industrial-organizational psychology discusses the uses, formats, and legal issues surrounding performance appraisals. It details the role of I/O psychology in managing employee performance.
Full Transcript
CHAPTER Performance Appraisal 5 CHAPTER OUTLINE Uses of Performance Appraisal The Role of I/O Psychology in Performance Appraisal Sources of Performance Ratings Rating Formats Rating Errors Rater Considerations Contemporary Performance Appraisal Research PRACTITIONER FORUM: Elaine Pulakos Legal...
CHAPTER Performance Appraisal 5 CHAPTER OUTLINE Uses of Performance Appraisal The Role of I/O Psychology in Performance Appraisal Sources of Performance Ratings Rating Formats Rating Errors Rater Considerations Contemporary Performance Appraisal Research PRACTITIONER FORUM: Elaine Pulakos Legal Issues in Performance Appraisal I/O TODAY Technology, Performance Measurement, and Privacy Summary LEARNING OBJECTIVES This chapter should help you understand: The purposes of performance appraisal The various formats used in the evaluation of performance The effect of rating errors on the appraisal process How the broader context in which performance appraisal takes place has various and diverse implications for performance appraisal and organizational functioning The role of employee development in the performance management process The important role played by legal issues in the performance appraisal process The complexities involved in giving and receiving performance feedback The process of evaluating employees’ performance—performance appraisal —is the focus of this chapter. Having our performance appraised is something we’ve all experienced at one time or another. Of course, we may not have fond memories of our fathers telling us that we didn’t weed the vegetable garden well enough or of our Psychology 100 professor telling us that our test scores so far gave us only a C in the course. The types of performance appraisal just described are examples of negative appraisals. But sometimes we did a nice job mowing the lawn or writing a term paper— instances in which we probably received positive appraisals. There are many domains in which we are held accountable for our performance, but in this chapter we will talk specifically about the role of performance appraisal in organizational life—in which it plays a very important role indeed. In this domain, performance appraisal is defined as the systematic review and evaluation of job performance, as well as the provision of performance feedback. However, as you’ll soon see, performance appraisal is a lot more interesting than that definition makes it sound. performance appraisal Systematic review and evaluation of job performance. USES OF PERFORMANCE APPRAISAL Performance appraisal is one of the most important processes conducted in organizations. It has many purposes, of which the three most significant are discussed here. First, performance appraisals are used to make important personnel decisions, such as who gets promoted, fired, demoted, or laid off; who gets a large raise, a small raise, or no raise at all; and so on. In efficient organizations, these decisions are not made haphazardly; they are made on the basis of performance appraisal data. Second, performance appraisals are used for developmental purposes. Employees are informed of their performance strengths and weaknesses so that they can be proud of what they are doing well and can focus their efforts on the areas that need some work. On the whole, organizations benefit when employees perform better, and performance appraisal data are used to help employees become better performers. In addition, organizations are interested in seeing their employees advance within the company to other important jobs—an outcome that performance appraisal can facilitate. For example, an employee may be told that she needs to improve her interpersonal skills so that she will be eligible when the next promotion becomes available. A third purpose of performance appraisal is what I’ll call documentation of organizational decisions—a purpose that has recently evolved out of personnel decisions and the growing area of personnel law. Now that companies are very aware of the possibility of being sued over personnel business decisions, managers are increasingly using performance appraisals to document employees’ performance patterns over time. In cases in which employees are fired for inadequate performance, the organization—if it has kept careful track—can point to detailed accounts of the employees’ inferior performance, making it difficult for them to claim that they were fired without just cause. (We will discuss legal issues in performance appraisal later in this chapter.) A CLOSER LOOK Performance can be measured in many different ways. How has your performance been measured or recognized in any jobs that you’ve had, and what made that process useful for you? On the other hand, performance appraisals that are not carefully developed and implemented can have negative repercussions for both the organization and its employees. For instance, a poorly conceived appraisal system could get the wrong person promoted, transferred, or fired. It could cause feelings of inequity on the part of good employees who erroneously receive smaller raises than bad employees. It could lead to lawsuits in which the company has a very weak defense for why a particular individual was not promoted. Also, it could result in disgruntled employees who decrease their effort, ignore the feedback, or look for other jobs. Even customers are poorly served when an ineffective appraisal system causes employees to operate at less than their peak level of efficiency. Indeed, an ineffective performance appraisal system has widespread implications for everyone involved with the organization, which is why performance appraisal has received so much research attention (for reviews, see Levy & Williams, 2004; Murphy & Cleveland, 1995). A good performance appraisal system is well received by ratees, is based on carefully documented behaviors, is focused on important performance criteria, is inclusive of many perspectives, and is forward looking with a focus on improvement. THE ROLE OF I/O PSYCHOLOGY IN PERFORMANCE APPRAISAL I/O psychologists play a significant role in the area of performance appraisal. They are often hired to help develop and implement performance appraisal systems. I/O psychologists have measurement expertise, as well as a background in both human resources and organizational psychology—areas of knowledge that are integral to successful performance appraisals. Many companies have I/O psychologists in their HR departments who are responsible for what is known as performance management, a motivational system of individual performance improvement (DeNisi & Pritchard, 2006). This system typically includes (1) objective goal setting, (2) continuous coaching and feedback, (3) performance appraisal, and (4) developmental planning. The key points here are twofold: These four components are linked to the company’s goals and objectives, and the system is implemented on a continuous cycle rather than just once per year. Some research has proposed that when an organization is profitable, it is typically more willing to reinvest in human resource practices like performance management, which then impacts employees, supervisors, and the organization itself (den Hartog, Boselie, & Paauwe, 2004). Remember (see Figure 3.1) that performance appraisal stems directly from the job analysis. Performance criteria are identified by the job analysis and used as the central element of the performance appraisal system. Without a careful job analysis, we would likely end up with unimportant or job-irrelevant criteria and appraising performance on the wrong criterion dimensions. performance management A system of individual performance improvement that typically includes (1) objective goal setting, (2) continuous coaching and feedback, (3) performance appraisal, and (4) development planning. coaching One-on-one collaborative relationship in which an individual provides performancerelated guidance to an employee. Researchers who specialize in performance appraisal pursue research questions like the following: (1) What is the best format or rating scale for performance appraisals? (2) To what extent do rater errors and biases affect the appraisal process? (3) How should raters be trained so that they can avoid these errors and biases? (4) What major contextual variables affect the appraisal process? (5) How important is the organizational context or culture in the appraisal process? (6) What factors affect how ratees and raters react to performance appraisal? In addition to addressing such questions, I/O psychology, as an empirically based applied discipline, attempts to use basic psychological principles to help organizations develop and implement motivating, fair, accurate, and user-friendly appraisal systems. Sources of Performance Ratings Performance feedback can be generated and delivered by various sources. Traditionally, supervisors were charged with conducting the performance appraisal and delivering the performance feedback. This top-down approach continues to be very popular, and it is quite common for an organization to include this type of appraisal as part of the performance management process. However, there are other, more contemporary approaches for the use of feedback sources, with multisource feedback being the most prevalent. Multisource Feedback This method, sometimes called 360-degree feedback (Fletcher & Baldry, 2015), involves multiple raters at various levels of the organization who evaluate and provide feedback to a target employee. As presented in Figure 5.1, these multiple sources typically include subordinates (or direct reports), peers, supervisors, customers, and clients and even self-ratings. Note the variation that exists between the different raters for each dimension. 360-degree feedback A method of performance appraisal in which multiple raters at various levels of the organization evaluate a target employee and the employee is provided with feedback from these multiple sources. FIGURE 5.1 360-Degree Development Report This sample figure from a 360-degree report shows the scores provided by four raters (and one aggregate score) across eleven “leadership” dimensions on a 5-point scale. Source: Human Resource Decisions, Inc. These systems are becoming increasingly important to the modern organization in terms of performance assessment and management (Hoffman, Lance, Bynum, & Gentry, 2010). Many companies—such as Home Depot, Procter & Gamble, General Electric, Intel, and Boeing, among others—are using them for a variety of purposes that are consistent with both greater employee expectations and the more sophisticated organizations of the 21st century. Three basic assumptions are held by advocates of 360-degree feedback systems. First, when multiple raters are used, the participants are happier because they are involved in the process—and this calls to mind the importance of participation alluded to earlier. Second, and perhaps more important, when multiple raters from different levels of the organization rate the same target employee, the idiosyncrasies (and biases) of any single rater are overcome. For instance, if my supervisor doesn’t like me and rates me severely for that reason, additional ratings from other individuals that aren’t severe should overcome my supervisor’s rating and highlight the possibility that there may be a problem with that rating. Third, multiple raters bring with them multiple perspectives on the target employee, allowing for a broader and more accurate view of performance. For instance, universities often require that students rate faculty teaching because it is believed that they have a valuable perspective to share about teaching effectiveness. These are sometimes called upward appraisal ratings because they refer to ratings provided by individuals whose status is, in an organizational-hierarchy sense, below that of the ratee (Atwater, Daldman, Atwater, & Cartier, 2000). A recent study that included traditional supervisor ratings, peer ratings, and subordinate (upward) ratings demonstrated that these perspectives differ from each other regarding managerial competencies and how they predict managerial effectiveness (Semeijn, Van der Heijden, and Van der Lee, 2014). The authors concluded that the multisource approach is beneficial in assessing both competencies and effectiveness. upward appraisal ratings Ratings provided by individuals whose status, in an organizational-hierarchy sense, is below that of the ratees. Although 360-degree feedback has been used in organizations for a few decades, research efforts to understand it have only recently begun to catch up with practitioner usage. Until recent years the majority of empirical work on 360-degree feedback has focused on measurement properties of the ratings, such as the extent of agreement among rating sources (Levy & Williams, 2004). However, we are starting to see considerably more research on 360-degree feedback on broader contextual issues. In addition, there has also been an influx of research focused on applications to the health field (doctors, nurses, medical residents) and education (teachers, principals). A recent study examined the use of 360-degree feedback with 385 surgeons (Nurudeen et al., 2015). Results included very high percentages of surgeons reporting that the feedback they received was accurate and reporting that they made changes to their practice as a result of the feedback. Similarly, a large percentage of department heads reported that they believed the feedback provided to their surgeons was accurate. Almost 75% of the participants (raters and ratees) found the process valuable, and over 80% were willing to participate in future 360-degree evaluations. There is even a new app (called Healthcare Supervision Logbook) that allows doctors who are training medical students to provide feedback to the students following a clinical session. The trainees can use the app to provide feedback about the curriculum and training to the doctor trainers (Gray, Hood, & Farrell, 2015). It can also be used to gather patient and peer feedback. In another recent investigation, this time in the education arena, principals were evaluated with a 360-degree instrument and researchers studied how they reacted to conflicting feedback among sources (Goldring, Mavrogordato, & Haynes, 2015). They found that principals experienced cognitive dissonance (that is, discomfort and tension) when the teacher ratings of their performance were lower than their own ratings of their performance. The authors assert that principals are motivated to reduce the dissonance and that they can do that by either working harder and improving how they do their jobs or by discounting the teacher ratings. The researchers suggest that principals need to be trained in how to receive, evaluate, and use the feedback they are given. One firm, Human Resource Decision, Inc., employs a 360-degree feedback system that consists of four main tools: (1) a 360-degree development questionnaire, which is administered to multiple rating sources and measures 13 skill dimensions, such as leadership and business acumen; (2) a 360-degree feedback report, which provides the results of the ratings from the various sources, as well as some summary information about ratees’ strengths and weaknesses; (3) a development workbook, which helps employees work with and understand the feedback report; and (4) a development guide, which provides suggested readings and activities to improve skills in the targeted areas. This 360-degree system has been used by many companies across a variety of industries. What does the future hold for 360-degree feedback? Consistent with this chapter’s theme of focusing on the social context of appraisal is a list of recommendations to follow for implementing 360-degree feedback: This list includes (1) being honest about how the ratings will be used, (2) helping employees interpret and deal with the ratings, and (3) avoiding the presentation of too much information (DeNisi & Kluger, 2000). The frequency with which it’s used in organizations suggests that it will continue to play an important role. Recent research suggests that the following important issues will continue to attract attention: construct validity of ratings, determinants of multisource ratings (such as rating purpose, liking, and personality), and the effect of multisource ratings on employee attitudes and development. New Challenges in Telework Another recent trend in organizations is the increased frequency of telework (see Chapter 11 for information on telework and worker well-being), employees working from home or some other remote location, a practice that is growing rapidly (Golden, BarnesFarrell, & Mascharka, 2009). This telecommuting results in altered forms of communication. Many employee–employee and employee–supervisor interactions don’t take place face-to-face but via e-mails, phone calls, faxes, text messages, conference calls, and so on. This arrangement suggests that supervisors doing performance appraisal must rely on indirect sources of performance information, like gathering information from those who work directly with the employee and reviewing written documentation of work, instead of direct interactions and on-the-job observations. Recent research has demonstrated that when provided with both direct and indirect performance information, supervisors rely more on direct performance information (Golden et al., 2009). Supervisors’ tendencies to downplay the indirect information that is frequent in telework suggest potential room for performance appraisal errors and ineffectiveness. telework Working arrangements in which employees enjoy flexibility in work hours and/or location. Of course, one important aspect of telecommuting is its relationship with performance. Although there hasn’t been a lot of empirical evidence in this area, there is some very recent research. For instance, a survey of 273 supervisor–subordinate dyads found that telecommuting did have a positive effect on performance. The strength of this effect was moderated by various factors, but it’s important to note that the effect was never negative (Golden & Gajendran, 2014). That is to say, telecommuting may have a weak or strong positive effect on performance depending on various situational characteristics, but it never has a negative effect. As more evidence of the positive effect of telecommuting on performance, Gajendran, Harrison, and Delaney-Klinger (2015) studied 323 supervisor–subordinate dyads and found that telecommuting was positively associated with both task performance and contextual performance. Further, the effect of telecommuting on performance was strengthened when there was a positive relationship between the subordinate and supervisor. When an employee who experiences a favorable subordinate–supervisor relationship is given the freedom to telecommute, that employee will assume more autonomy, which will be reflected in higher levels of task and contextual performance. For a complete and broad review of telecommuting, I encourage you to read Allen, Golden, and Shockley (2015), who discuss this arrangement from a historical perspective and in terms of a broad array of organizational outcomes. A CLOSER LOOK What factors might play a role in determining whether telecommuting has a positive or negative effect on performance? Rating Formats When it comes time for an evaluator to appraise someone’s performance, he or she typically uses some type of rating form. There are quite a few options here; in this section, we will discuss the ones most frequently employed. Graphic Rating Scales Graphic rating scales are among the oldest formats used in the evaluation of performance. These scales consist of a number of traits or behaviors (e.g., dependability), and the rater is asked to judge how much of each particular trait the ratee possesses or where on this dimension the ratee falls with respect to organizational expectations. Today, graphic rating scales usually include numerical/verbal anchors at various points along the scale, such as “1/below expectations,” “4/meets expectations,” and “7/exceeds expectations,” and the score is whatever number is circled. Graphic rating scales are commonly used in organizations due, in part, to the ease with which they can be developed and used. Figure 5.2 provides an example of a graphic rating scale—in this case, one that appraises the extent to which employees are “following procedures.” FIGURE 5.2 Graphing Rating Scale for “Following Procedures” Behaviorally Anchored Rating Scales Behaviorally anchored rating scales, or BARS (Smith & Kendall, 1963), are similar to graphic rating scales except that they provide actual behavioral descriptions as anchors along the scale. An example of a 9-point BARS for a nuclear power plant operator’s tendency to “follow procedures” is shown in Figure 5.3. BARS A performance appraisal format that uses behavioral descriptors for evaluation. BARS are perhaps best known for the painstaking process involved in their development (see Smith & Kendall, 1963). This process can be summarized as taking place in five steps (Bernardin & Beatty, 1984). First, a group of participants—such as employees, supervisors, or subject matter experts (SMEs)—identifies and carefully defines several dimensions as important for the job (“follows procedures,” “works in a timely manner,” etc.). Second, another group of participants generates a series of behavioral examples of job performance (similar to the items in Figure 5.3) for each dimension. These behavioral examples are called critical incidents. Participants are encouraged to write critical incidents at high-, medium-, and low-effectiveness levels for each dimension. In Figure 5.3, for example, “Never deviates from the procedures outlined for a particular task and takes the time to do things according to the employee manual” represents high effectiveness, whereas “Takes shortcuts around established procedures at every opportunity” represents low effectiveness. critical incidents Examples of job performance used in behaviorally anchored rating scales or job-analytic approaches. FIGURE 5.3 Behaviorally Anchored Rating Scale for “Following Procedures” Third, yet another group of participants is asked to sort these critical incidents into the appropriate dimensions. During this retranslation stage, the goal is to make sure that the examples generated for each dimension are unambiguously associated with that dimension. Usually a criterion, such as 80%, is used to weed out items that are not clearly related to a particular dimension: If fewer than 80% of the participants place the critical incident in the correct dimension, the item is dropped. A fourth group of participants then rates each remaining behavioral example on its effectiveness for the associated dimension. This rating is usually done on a 5- or 7-point scale. Any item with a large standard deviation is eliminated because a large standard deviation indicates that some respondents think the item represents effective performance, whereas others think it represents ineffective performance—obviously, a problematic situation. Fifth, items that specifically represent performance levels on each dimension are chosen from the acceptable pool of items, and BARS are developed and administered. This very deliberate and thorough development process is both the biggest strength and the biggest weakness of BARS. Usually such a detailed process involving at least four different groups of participants results in a useful and relevant scale. However, the costs in both time and money are too high for many organizations, so they often employ a simple graphic scale instead. Checklists Checklists are another popular format for performance appraisals. Here, raters are asked to read a large number of behavioral statements and to check off each behavior that the employee exhibits. One example from this category is the weighted checklist, which includes a series of items that have previously been weighted as to importance or effectiveness; specifically, some items are indicative of desirable behavior, whereas others are indicative of undesirable behavior. Figure 5.4 presents an example of a weighted checklist for the evaluation of a computer and information systems manager. In a real-life situation, the scale shown in Figure 5.4 would be modified in two ways. First, the items would be scrambled so as not to be in numerical order; second, the scale score column would not be part of the form. To administer this type of scale, raters would simply check the items that apply; in our example, the sum of items checked would be the computer and information systems manager’s performance appraisal score. Note that as more and more of the negative items are checked, the employee’s summed rating gets lower and lower. FIGURE 5.4 A Weighted Checklist Forced-choice checklists are also used by organizations, though not as frequently as weighted checklists. Here, raters are asked to choose two items from a group of four that best describe the target employee. All four appear on the surface to be favorable, but the items have been developed and validated such that only two are actually good discriminators between effective and ineffective performers. The purpose of this approach is to reduce purposeful bias or distortion on the part of raters. In other words, because all the items appear favorable and the raters don’t know which two are truly indicative of good performance, they cannot intentionally give someone high or low ratings. One drawback to this approach is that some raters don’t like it because they feel as though they’ve lost control over the rating process. How would you feel if you had to choose two statements that describe your poorly performing subordinate, but all four statements seem positive? Researchers have recently developed an appraisal format that appears to reduce the bias sometimes associated with other approaches, but without the negative reactions on the part of the raters. Borman and his colleagues developed the computerized adaptive rating scale (CARS), which, although still relatively new to the performance appraisal field, seems to be more sound from a measurement perspective than are other approaches; it also provides more discriminability—that is, the scale does a better job of differentiating between effective and ineffective performers (Borman et al., 2001; Schneider, Goff, Anderson, & Borman, 2003). CARS is an ideal point response method (Drasgow, Chernyshenko, & Stark, 2010) in which raters are given two statements about performance with one slightly above average and one slightly below average. The rater is asked to choose the one that best reflects the ratee’s performance. This is followed by two more statements— one more favorable and one less favorable than the point just chosen—and the rater chooses the best option. This continues until all the relevant items have been presented and the best performance estimate/rating has been determined (Borman, 2010). Employee Comparison Procedures The final category of rating formats, employee comparison procedures, involves evaluation of ratees with respect to how they measure up to or compare with other employees. One example of this type of format is rank-ordering, whereby several employees are ranked from best to worst. Rank-ordering can be particularly useful for making promotion decisions and discriminating the very best employee from the rest. A second example, paired comparisons, involves the comparison of each employee with every other employee. If a manager has only three employees to evaluate, this isn’t too difficult a task. (Think about doing this for each of your instructors this semester—comparing each to every one of the others.) However, as the number of ratees increases, so does the complexity of the task. Although this method is one way to arrive at a “clear winner” among employees, it obviously becomes very cumbersome as the number of employees grows. TECHNICAL TIP The formula N(N – 1)/2 can be used to calculate the total number of comparisons. For example, whereas 3 employees would involve 3 comparisons, 10 employees would involve 45! A third example is called forced distribution. Here, raters are instructed to “force” a designated proportion of ratees into each of five to seven categories. A similar procedure is grading on the normal curve, whereby teachers assign grades based on meeting the normal curve percentages (i.e., 68% of the grades assigned are Cs, 13.5% are Bs, 13.5% are Ds, 2.5% are As, and 2.5% are Fs). Sometimes organizations require supervisors to use the same sort of procedure, resulting in the categorization of one-third of the subordinates as average, one-third as below average, and one-third as above average (see the Time article “Rank and Fire” by Greenwald, 2001). This is often done because performance ratings are tied to raises and, with a limited pool of money for raises, the company wants to make sure that not too many employees are rated as eligible for a raise and to potentially remove inferior employees from the organization. To appreciate how employees might feel about this approach, think about how you would feel if your psychology instructor told you that you were to be graded on a normal curve. In other words, if you had an average of 95 in the course but 3% of your classmates had an average of 96 or better, you would not receive an A because only 3% can get As. Needless to say, forced distribution is not a popular approach among ratees, whether students or employees, but it was very popular among Fortune 500 companies in the 1990s and early- to mid-2000s. It’s estimated that 20% of these organizations employed this practice in the mid-2000s (Grote, 2005), with companies such as General Electric, 3M, Texas Instruments, Microsoft, Ford, Goodyear, and Hewlett Packard among the most ardent supporters. However, some very public lawsuits over the use of these systems have created a controversy regarding the extent to which underrepresented groups tend to be disproportionately ranked in the low category, resulting in adverse impact against these groups (Giumetti, Schroeder, & Switzer, 2015). Adverse impact is an important concept in personnel law and may indicate illegal discrimination against a particular group (see Chapter 7 for a more detailed discussion of this issue). Many thoughtful papers and analyses have pointed out other weaknesses in the forced distribution approach, which include the negative reactions of both raters and ratees, the fallacy of forcing some employees into the bottom rung regardless of their true performance, and problems with continuity, where, for example, an employee that was rated average or above average will eventually fall to the bottom rung as the work force is strengthened and perceptions of effective performance change. For these and other reasons, we have recently seen many more companies back away from this approach. You can continue to follow these issues in the popular press, such as in a 2014 piece on the Wall Street Journal website called “It’s Official: Forced Ranking is Dead” (Deloitte, 2014). Contemporary Trends in Rating Formats Although there has been less research and less written about rating formats in recent years, there are a couple of interesting contemporary trends. First, while there are many rating formats and most are quantitative in nature (i.e., performance is rated in terms of numbers), most of these formats include narrative comments (e.g., your performance on this dimension is above average, but you need to enhance your administrative skills) in some way. It is only recently that researchers have begun examining these narrative comments. Some work has found that supervisors’ and subordinates’ comments are considerably clearer than the comments from peers (Gillespie, Rose, & Robinson, 2006). A recent conceptual paper proposes a framework for the use of narrative comments in the performance appraisal process (Brutus, 2010) and identifies important characteristics, such as the specificity or breadth of the comments as well as the processes that raters and ratees employ in the use of these narrative comments. This paper provides a framework that can be used to help structure a research program focused on gaining insight into the significance of these and other important characteristics of narrative comments. Finally, additional outside-the-box thinking has resulted in the use of feedforward interviews (FFIs) for performance appraisal (Kluger & Nir, 2010). This idea is borrowed from appreciative inquiry (see Chapter 14), a part of the positive psychology movement. The FFI is proposed to replace the traditional performance appraisal interview and to facilitate positive change by focusing on employees’ strengths, rather than weaknesses, and also to enhance the relationship between the rater and ratee. While promising, there is very little empirical research on the effectiveness of the FFI. However, one new study trained customer service managers in an FFI approach that focused on employees’ attention on positive work experiences in which goals were met and success was attained. Researchers found that the performance of subordinates whose managers employed FFI improved more over a four-month period than did those whose managers used more traditional performance appraisal techniques with their employees (Budworth, Latham, & Manroop, 2015). Obviously, more research is needed, but there is some potential for organizations to consider feedforward as an alternative or in combination with feedback. An Evaluation of the Various Alternative Methods Since the 1950s, perhaps no area in performance appraisal has received more research attention than rating type or format. Even in recent years, research has looked at how personality and format may impact performance ratings (Yun, Donahue, Dudley, & McFarland, 2005) as well as how providing ratings via e-mail versus face-to-face meetings impacts the process (Kurtzberg, Naquin, & Belkin, 2005). Researchers have argued that no single format is clearly superior to the others across all evaluative dimensions (Landy & Farr, 1980; Murphy & Cleveland, 1995), though a recent paper that reviewed research on BARS makes a case for the superiority of that approach on some dimensions (Debnath, Lee, & Tandon, 2015). I will make no attempt to argue for one approach over another. Table 5.1 provides a summary of the advantages and disadvantages of the four major types discussed here. You can use this to decide for yourself which format you would use to evaluate your employees; in some cases, your choice would depend on the situation. Note, however, that companies currently often use a graphic rating scale with various behavioral anchors—basically a hybrid of a graphic rating scale and a BARS. It’s also important to note that there has been a good bit in popular press outlets like NPR and the Huffington Post about abolishing performance appraisal all together. This suggestion, however, seems to ignore the important uses of performance appraisal in personnel decisions, employee development, and legal documentation, as outlined at the outset of this chapter. Effective procedures in all three of these areas are necessary for successful organizations. Most organizations value performance appraisal as an important part of the performance management system, but I/O practitioners and researchers alike all agree that there is great room for improvement in appraisal systems and the implementation of those systems. I would conclude that this is an example of where we shouldn’t throw out the baby with the bathwater. TABLE 5.1 Summary of Appraisal Formats Advantages Disadvantages Graphic rating scales 1. Easy to develop 1. Lack of precision in dimensions 2. Easy to use 2. Lack of precision in anchors BARS 1. Precise and well-defined scales—good for coaching 1. Time and money intensive 2. Well received by raters and ratees Checklists 1. Easy to develop 2. Easy to use Employee comparison methods 1. Precise rankings are possible 2. Useful for making administrative rewards on a limited basis 2. No evidence that it is more accurate than other formats 1. Rater errors such as halo, leniency, and severity are quite frequent 1. Time intensive 2. Not well received by raters (paired comparison) or ratees (forced distribution) Rating Errors Evaluating another individual’s performance accurately and fairly is not an easy thing to do; moreover, errors often result from this process. An understanding of these errors is important to appreciating the complexities of performance appraisal. Research in cognitive psychology has shed considerable light on how the human brain processes information while making decisions. This research has provided I/O psychologists with valuable information that has been applied to the performance appraisal process. Cognitive Processes In a typical company situation, once or twice a year supervisors have to recall specific performance incidents relating to each employee, somehow integrate those performance incidents into a comprehensible whole, arrive at an overall evaluation, and, finally, endorse a number or category that represents the employee’s performance over that period. Furthermore, this has to be done for each of 6 to 12 different employees, and perhaps more! With the trend toward flatter organizations, the number of subordinates for each supervisor is increasing steadily, making the task that much more difficult. Although more complex cognitive-processing models of performance appraisal have been developed (see Hodgkinson & Healey, 2008; Landy & Farr, 1980), all such models are consistent with the scheme depicted in Figure 5.5. This figure includes two examples of potential error or bias that may come into play at each of the steps shown (see the left and right columns). The first step in this model is the observation of employees’ behaviors. In many situations, this is done well; the rater may observe a large portion of the ratee’s behavior, as when a grocery store manager observes his subordinates’ performance every day. In other situations, however, raters are unable to observe ratees’ performance directly. For instance, professors are often evaluated by their department heads, even though the department heads have little opportunity to view the professors engaging in job-related tasks like teaching. FIGURE 5.5 Cognitive-Processing Model of Performance Appraisal Second, the observed behavior must be encoded, which means that the behavior must be cognitively packaged in such a way that the rater is able to store it. If an observed behavior is encoded incorrectly (e.g., an adequate behavior is somehow encoded as inadequate), the appraisal rating will be affected at a later time. Third, after encoding, the behavior must be stored in long-term memory. Because it is unreasonable to expect that anyone could perfectly store all the relevant performance incidents, some important incidents may not get stored at all. Fourth, when the appraisal review is being conducted, the stored information must be retrieved from memory. In many situations, the rater cannot retrieve some of the important information, leading to an appraisal rating that is based on an inadequate sample of behavior. Also, since performance reviews are difficult and time-consuming, it’s not unusual for raters to retrieve irrelevant information and use it as the basis for the performance rating. Think about doing a review of your subordinate’s administrative assistant—someone with whom you have very little interaction. In doing this review, you would certainly get input from your subordinate; but later, when doing the final evaluation, you might recall a memo from your subordinate that explained an expensive mix-up in his office in terms of scheduling conflicts. You may consider this when doing the administrative assistant’s evaluation (after all, it makes sense that he would be in charge of scheduling), even though you don’t know whether this mix-up was his fault. For that matter, he may have been the one who caught the problem and saved the company money. If you stored irrelevant information, you may unwittingly use it later in making a performance judgment. Finally, the rater has to integrate all this information and come to a final rating. If the rater has done a good job of observing, encoding, storing, and retrieving relevant performance information, the integration step should be relatively easy. However, sometimes raters let attitudes and feelings cloud their judgments. If your boyfriend, girlfriend, spouse, or even a close acquaintance worked under your direct supervision, could you be objective in arriving at a performance judgment that would be used in a promotion decision? You’d like to think that you could, but many of us probably couldn’t. This is one reason that many companies frown on hiring both members of a married couple or, if they do hire them, make sure to limit their workplace interaction. In one case that I know of, the U.S. military accepted a married couple—two attorneys—into the army as captains, guaranteeing that they would be posted to the same geographical region; but the military also guaranteed that the couple would not be assigned to the same base, thereby avoiding potential problems with favoritism and bias. I/O psychologists have been developing ways to help raters avoid the cognitive errors involved in performance appraisal for the last 35 years. Let’s consider some of the most common of these errors. Halo One error that has received a great deal of attention is called halo. Halo results from either (1) a rater’s tendency to use his or her global evaluation of a ratee in making dimension-specific ratings for that ratee or (2) a rater’s unwillingness to discriminate between independent dimensions of a ratee’s performance (Saal, Downey, & Lahey, 1980). Halo effects can be positive or negative. A retail manager may evaluate a salesclerk as being very good at connecting with potential customers because she knows that the clerk is very good at keeping the shelves neat and stocked even though competence on this dimension doesn’t really indicate that he is competent in dealing with customers. The manager has generalized her evaluation of this clerk from one dimension to another. Similarly, I may assume a student is not going to do well in my class because he hasn’t done well in his physics classes. halo The rating error that results from either (1) a rater’s tendency to use his or her global evaluation of a ratee in making dimension-specific ratings for that ratee or (2) a rater’s unwillingness to discriminate between independent dimensions of a ratee’s performance. A CLOSER LOOK What is the difference between halo and true halo? The early research in this area assumed that all halo was error. However, some people are competent across all dimensions—leadership, communication skills, motivating employees, completing paperwork, and so on. Traditionally, I/O psychologists noted high correlations across ratings on these dimensions and concluded that there was a great deal of halo error. However, from a hiring standpoint (see Chapters 6 and 7), organizations tend to target those applicants they believe will be “good at everything.” And, indeed, there are employees in all organizations who do seem to be good performers on all performance dimensions. The point here is that some halo results from accurate intercorrelations among performance dimensions (see Goffin, Jelley, & Wagner, 2003; Murphy & Jako, 1989; Murphy & Reynolds, 1988)—what we call true halo. true halo Halo that results from accurate intercorrelations among performance dimensions rather than from rating error. TECHNICAL TIP We discussed the normal distribution in Chapter 2, pointing out that many qualities, such as intelligence and normal personality characteristics, are distributed in a bell-like shape, with most of the data in the middle and much less of the data in the extremes of the distribution. It is also possible that extremely low intercorrelations among performance dimensions (what some have called negative halo because performance on the dimensions seems to be unrelated) may reflect inaccuracy in ratings just as much as strong intercorrelations (positive halo). In a study that reanalyzed some existing data, a curvilinear relationship between halo and accuracy emerged such that both positive and negative halo were found to reduce accuracy (Thomas, Palmer, & Feldman, 2009). In other words, both very strong and very weak associations between performance dimensions may indicate low accuracy. So, we know that halo exists and we understand it better now than we did 30 years ago, but it is complex. Sometimes halo is reflective of true performance and sometimes either positive or negative halo may reflect inaccuracy in ratings. Rating errors such as leniency, central tendency, and severity (discussed below) are categorized as distributional errors because they result from a mismatch between actual rating distributions and expected rating distributions. In other words, the grouping of ratings is much farther toward one end of the distribution or much closer to the middle than what we assume the true distribution to be. Performance is one of those qualities that we expect to be distributed “normally.” distributional errors Rating errors, such as severity, central tendency, and leniency, that result from a mismatch between actual rating distributions and expected rating distributions. Leniency Raters commit the error of leniency when (1) the mean of their ratings across ratees is higher than the mean of all ratees across all raters or (2) the mean of their ratings is higher than the midpoint of the scale. In other words, if your boss rates her employees higher than all the other bosses rate their employees, or if she gives ratings with a mean of 4 on a 5-point scale, she would be described as a lenient rater. Raters may be lenient because they like their employees or want to be liked. They may think that giving everyone favorable ratings will keep peace in the workplace or that doing so will make them look good as supervisors who have high-performing subordinates. In fact, in a laboratory study examining leniency effects, researchers found that when raters were held accountable to their supervisors (as well as ratees), they were less lenient in their ratings than when they were held accountable only to the ratees (Curtis, Darvey, & Ravden, 2005). This surely has implications for how organizations might want to structure performance appraisal processes. leniency The rating error that results when (1) the mean of one’s ratings across ratees is higher than the mean of all ratees across all raters or (2) the mean of one’s ratings is higher than the midpoint of the scale. As with halo, we need to be careful in assuming that distributional errors are really errors. Indeed, it is possible, perhaps likely, that some supervisors really do have better employees or work groups than others, resulting in more favorable ratings that are accurate rather than lenient. For example, the airline chosen as the best in the nation with respect to service should show higher performance ratings for its service employees than other airlines. So what might appear to be the result of leniency among the evaluators of this airline would in fact be the result of accurate evaluation. Research also suggests that personality can have an impact on one’s tendency to be lenient. Specifically, individuals categorized as “agreeable” have been shown to be more lenient than those categorized as “conscientious” (Bernardin, Cooke, & Villanova, 2000; Bernardin, Tyler & Villanova, 2009). Central Tendency Raters who use only the midpoint of the scale in rating their employees commit the error of central tendency. An example is the professor who gives almost everyone a C for the course. In some cases, the raters are lazy and find it easier to give everyone an average rating than spending the extra time necessary to review employees’ performance so as to differentiate between good and poor workers. In other cases, the raters don’t know how well each of their subordinates has performed (perhaps because they are new to the work group or just don’t see their employees’ on-the-job behavior very often) and take the easy way out by opting for average ratings for everyone. Some research suggests that central tendency error is sometimes a result of the rating scale itself and that simpler semantic differential scales (e.g., ranging from “effective employee” to “ineffective employee”) result in a considerable amount of this bias (Yu, Albaum, & Swenson, 2003). central tendency The tendency to use only the midpoint of the scale in rating one’s employees. Of course, a given work group or department may be populated largely by average employees. In fact, the normal distribution suggests that most employees really are average, so it is reasonable to have a large percentage of employees rated as such. At times, then, central tendency is a rating error; at other times, though, it simply reflects the actual distribution of performance, which is largely centered around “average.” Severity Less frequent than leniency and central tendency is the rating error of severity, which is committed by raters who tend to use only the low end of the scale or to give consistently lower ratings to their employees than other raters do. Some supervisors intentionally give low ratings to employees because they believe that doing so motivates them (you will see when we get to Chapter 9, on motivation, that this strategy is not likely to work), or keeps them from getting too cocky, or provides a baseline from which new employees can improve. For some, severity represents an attempt to maintain the impression of being tough and in charge—but what tends to happen is that such raters lose, rather than gain, the respect of their subordinates. severity The tendency to use only the low end of the scale or to give consistently lower ratings to one’s employees than other raters do. Some work groups include a larger number of low performers than other work groups, so low ratings from the supervisor of such work groups may be accurate rather than “severe.” Thus, although we don’t see these terms in the literature, we could speak of true leniency, true central tendency, and true severity in much the same way as we speak of true halo. For any given situation, though, it is difficult to determine whether the ratings are affected by rating errors or are an accurate reflection of performance. The chief problem stemming from distributional errors is that the ratings do not adequately discriminate between effective and ineffective performers. In such cases, the majority of ratees are lumped together in the bottom, middle, or top of the distribution. This general problem is often referred to as range restriction because only a small part of the scale range is used in the ratings. The difficulty for the organization is that it intends to use performance rating information for personnel decisions such as promotions, raises, transfers, layoffs, and other terminations; but if all the employees are rated similarly (whether as a result of central tendency, leniency, or severity), the ratings do not help in making these personnel decisions. For instance, if everyone is rated in the middle of the scale, who gets promoted? If everyone is rated very favorably, who gets the big raise? If everyone is rated as ineffective, who gets fired? Employee morale is also affected by nondiscriminating ratings, in that employees who believe they are good employees will feel slighted because their reviews are no better than those of employees whom they view as much less effective. Think about this: Have you ever been passed over for a promotion, only to discover that the person who was promoted, though less deserving than you, was rated similarly to you? You likely experienced feelings of injustice, which may have affected not only your subsequent onthe-job performance but also your attitude. These additional implications of nondiscriminating ratings will be discussed in more detail in Chapters 10 and 11. Other Errors Many other rating errors are discussed in the literature, but I will touch on just three more. One is recency error, whereby raters heavily weight their most recent interactions with or observations of the ratee. A supervisor who, in rating his subordinate, largely ignores nine months of superior performance and bases his evaluation on only the past three months of less-than-adequate performance is making a recency error. This is similar to the somewhat misguided belief in organizations that all that matters is the question “What have you done for me lately?” Another error of note, called first impression error or primacy effect, is the opposite of recency error. Here, raters pay an inordinate amount of attention to their initial experiences with the ratee. A construction foreman may think back to that first day when the new electrician helped out on the site at a crucial time and use this as the basis for his evaluation of the electrician while largely ignoring some major performance problems over the past few months. First impressions tend to be heavily weighted in our everyday lives, as when we form friendships with people with whom we just seem to “hit it off” from the very beginning; they are also used, sometimes ineffectively, in performance appraisal. Finally, there is the similar-to-me error, which occurs when raters tend to give more favorable ratings to ratees who are very much like themselves. We know from social psychology that people tend to make friends with and like being around people who are much like themselves. An old English proverb states: “Birds of a feather flock together.” A similar effect occurs in performance appraisal situations, resulting in more favorable ratings of employees similar to the rater than of those dissimilar. Rater Considerations We’ve talked at length about characteristics of the rating—that is, formats and errors. In this section, we consider important performance appraisal elements that revolve around the rater—namely, rater training, rater goals, and rater accountability. Rater Training We have just discussed some of the common errors that raters make in evaluating the performance of others. An important question asked by I/O researchers and practitioners alike is whether rater training can reduce such errors and improve the rating process (Hauenstein, 1998; Schleicher, Day, Mayes, & Riggio, 2002). There are two main types of rater training in the performance appraisal area. One, known as Rater Error Training (RET), was originally developed to reduce the incidence of rater errors (Spool, 1978). The focus was on describing errors like halo to raters and showing them how to avoid making such errors. The assumption was that by reducing the errors, RET could increase accuracy, the degree to which performance ratings match one’s true performance level. Rater Error Training (RET) A type of training originally developed to reduce rater errors by focusing on describing errors like halo to raters and showing them how to avoid making such errors. As suggested early on in the development of this approach, however, RET can indeed reduce errors, but accuracy is not necessarily improved (Bernardin & Pence, 1980). In fact, studies have shown that accuracy sometimes decreases as a function of reducing error (e.g., Bernardin & Pence, 1980). How can this be? Well, recall our discussion of halo. When raters are instructed not to allow for halo, they are in effect being taught that there is no relationship among performance dimensions and that their ratings across dimensions should not be correlated. But, in many cases, there is true halo, and the ratings should be correlated. Thus, rater training may have reduced the correlation across dimensions that we used to assume was error, resulting in artificially uncorrelated ratings that are now inaccurate. A second type of rater training, called Frame of Reference (FOR) Training, was designed by John Bernardin and his colleagues to enhance raters’ observational and categorization skills (Bernardin & Beatty, 1984; Bernardin & Buckley, 1981). Bernardin’s belief was that to improve the accuracy of performance ratings, raters have to be provided with a frame of reference for defining performance levels that is consistent across raters and consistent with how the organization defines the various levels of performance. In other words, for a fast-food employee, on the dimension of cleanliness, all raters must know that, on a 5-point scale, a 5 would be indicated by the following behaviors: Table top is wiped down in between runs of burgers. Ketchup and mustard guns are always placed back into their cylinders and never left sitting on the table. Buns are kept wrapped in plastic in between runs of burgers. Floor around production area is swept at least once per hour and mopped once in the morning and afternoon. All items in the walk-in refrigerators are placed on the appropriate shelves and the walk-ins are organized, swept, and mopped periodically. Frame of Reference (FOR) Training A type of training designed to enhance raters’ observational and categorization skills so that all raters share a common view and understanding of performance levels to improve rater accuracy. FOR training attempts to make that description part of all raters’ performance schema for the level of “5/exceptional” performance. The hope is that by etching this performance exemplar in the raters’ minds, the training will render each rater better able to use it consistently when observing, encoding, storing, retrieving, and integrating behaviors in arriving at a final rating. The goal is to “calibrate” raters so that a score of 5 from one rater means the same as a score of 5 from any other rater. Popular procedures for FOR training have since been developed (Pulakos, 1984, 1986). In these, raters are provided with descriptions of the dimensions and rating scales while also having them read aloud by the trainer. The trainer then describes ratee behaviors that are representative of different performance levels on each scale. Raters are typically shown a series of videotaped practice vignettes in which individuals (stimulus persons, or ratees) are performing job tasks. Raters evaluate the stimulus persons on the scales; then the trainer discusses the raters’ ratings and provides feedback about what ratings should have been made for each stimulus person. A detailed discussion ensues about the reasons for the various ratings. Research has consistently shown that FOR training not only improves appraisal accuracy (Sulsky & Day, 1994) but also is generally recognized as the most effective approach for improving rater accuracy (Meriac, Gorman, & Macan, 2015). A couple of recent studies have suggested that combining FOR training with behavioral observation training (BOT), which focuses on teaching raters how to watch for certain behaviors and avoid behavioral observation errors, may improve the recognition or recall of performance behaviors (Noonan & Sulsky, 2001; Roch & O’Sullivan, 2003). One current concern, however, is that very little field data are available on the FOR technique. Most of the research showing that FOR training improves accuracy has been conducted with students in laboratory settings (see Uggerslev & Sulsky, 2008). For instance,