Evaluating Selection Techniques and Decisions PDF
Document Details
Uploaded by Deleted User
2007
OCR
Tags
Summary
This document provides a comprehensive overview of evaluating selection techniques and decisions, focusing on the characteristics of effective selection techniques, reliability (test-retest, alternate forms, internal, scorer), and alternate forms reliability. It includes examples and formulas for determining test usefulness.
Full Transcript
EVALUATING SELECTION TECHNIQUES AND DECISIONS A Comprehensive Overview CHARACTERISTICS OF EFFECTIVE SLECTION TECHNIQUES Effective selection techniques have five characteristics. They are reliable, valid, cost-efficient, fair, and legally defensible. RELIABILITY The ex...
EVALUATING SELECTION TECHNIQUES AND DECISIONS A Comprehensive Overview CHARACTERISTICS OF EFFECTIVE SLECTION TECHNIQUES Effective selection techniques have five characteristics. They are reliable, valid, cost-efficient, fair, and legally defensible. RELIABILITY The extent to which a score from a test o from an evaluation is consistent and free from error.. TEST RELIABILITY IS DETERMINED IN TEST --RETEST RELIABILITY FOUR WAYS: ALTERNATE-FORMS RELIABILITY INTERNAL RELIABILITY SCORER RELIABILITY Reliability is an essential characteristic of an effective measure TEST-RETEST RELIABILITY The extent to which repeated administration of Definition the same test will achieve similar results. The scores from the first administration of the test are correlated with scores from the second to determine whether they are similar. If they are, then the test is said to have temporal stability: The test scores are stable across time and not highly susceptible to such random daily conditions as illness, fatigue, stress, or uncomfortable testing conditions. There is no standard amount of time that should elapse between the two administrations of the test. However, the time interval should be long enough so that the specific test answers have not been memorized, but short enough so that the person has not changed significantly. Temporal stability The consistency of test scores across time. Typical time intervals between test administrations range from three days to three months. Usually, the longer the time interval, the lower the reliability coefficient. The typical test-retest reliability coefficient for tests used by organizations is.86 (Hood, 2001). TEST-RETEST RELIABILITY Test-retest reliability is not appropriate for all kinds of tests. It would not make sense to measure the test-retest reliability of a test designed to measure short-term moods or feelings. Example The State–Trait Anxiety Inventory measures two types of anxiety: the amount of anxiety that an individual normally Trait anxiety has all the time, State anxiety the amount of anxiety an individual has at any given moment. For the test to be useful, it is important for the measure of trait anxiety, but not the measure of state anxiety, to have temporal stability. ALTERNATE FORMS RELIABILITY DefinitionThe extent to which two forms of the same test are With its method, two forms of the same test are constructed. As shown in Table 6.1, a sample of 100 people are administered both forms of the test; half of the sample first receive Form A and the other half receive Form B. This counterbalancing of test-taking order is designed to eliminate any effects that taking one form of the test first may have on scores on the second form. A method of controlling for order effects by Counterbalancinggiving half of a sample Test A first, followed by Test B, and giving the other half of the sample Test B first, followed by Test A. ALTERNATE FORMS RELIABILITY The scores on the two forms are then correlated to determine whether they are similar. If they are, the test is said to have form stability. The extent to which the scores on two forms of Form stability a test are similar. If there is a high probability that people will take a test more than once, two forms of the test are needed to reduce the potential advantage to individuals who take the test a second time. ALTERNATE FORMS RELIABILITY A meta-analysis by Hausknecht, Halpert, Di Paolo, and Moriarty Gerard (2007) found that applicants retaking the same cognitive ability test will increase their scores about twice as much (d.46) as applicants taking an alternate form of the cognitive ability test (d.24). Not surprisingly, the longer the interval between the two test administrations, the lower the gain in test scores. It should be noted that the Hausknecht et al. meta--analysis was limited to cognitive ability tests. It appears that with knowledge tests, retaking the test will still increase test scores, but the increase is at the same level whether the second test is the same test or an alternate form of the same test (Raymond, Neustel, & Anderson, 2007). Test-Retest Time interval: 3 days to 3 months. Reliability: Alternate-Forms Time interval: as short as possible. Longer intervals (e.g., 3 weeks) can obscure the Reliability cause of low correlation—could indicate issues with either form stability or temporal stability. ALTERNATE FORMS RELIABILITY EQUIVALENCE OF TEST FORMS Correlation Test forms must be correlated (Clause et al., Requirement 1998). Same Mean & Standard Both forms should have identical mean and standard deviation Deviation Table 6.2 Perfect correlation between Form A and Form B. Form B's average score is 2 points higher than Form A. Conclusion Perfect correlation = parallel scores. Different means = not equivalent. Action Needed Revise test forms or use different standards for results interpretation. ALTERNATE FORMS RELIABILITY IMPACT OF TEST CHANGES Effects of Modify reliability, validity, and difficulty. Most studies show minimal impact of alternate-form Changes differences on test outcomes. Examples: item order, question examples, administration Research method, time limits. validity, and difficulty. Insights and Meta-analyses indicates that Computer PowerPoint administration yield scores similar to paper-and-pencil (Dwight & Feigelson, 2000; Kingston, 2009; Larson, 2001). Key Study Web-administered tests: lower scores, better reliability than paper-and-pencil (Ployhart et al., 2003) African Americans score higher on video-based tests Demographic than traditional formats; no similar effect for Whites Findings (Chan & Schmitt, 1997). INTERNAL RELIABILITY The extent to which similar items are answered in similar ways Definition is referred to as internal consistency and measures item stability. The extent to which responses to the same test Item stability items are consistent. DETERMINING TEST RELIABILITY: INTERNAL CONSISTENCY Internal consistency measures how consistently Definition an applicant responds to similar items (e.g., personality traits, abilities). FACTORS INFLUENCING INTERNAL CONSISTENCY Item homogeneity The extent to which test items measure the same construct. Longer tests generally yield higher internal consistency (more items = less impact from careless errors). Test Length INTERNAL RELIABILITY EXAMPLE A 3-item test on 3 chapters may yield low ON FINAL reliability due to diverse content. EXAMINATIONS Breaking down the exam into homogeneous groups increases reliability. METHODS TO ASSESS INTERNAL CONSISTENCY Split-Half Method: Splits test items into two groups, correlates scores, and adjusts using Spearman-Brown formula. Spearman-Brown Used to correct reliability coefficients resulting from the split-half method. prophecy formula: Most common; measures reliability for all Cronbach’s Coefficient item combinations. A statistic used to determine internal reliability of tests that use Alpha: interval or ratio scales. Kuder-Richardson Formula Used for dichotomous items (true-false). 20 (K-R 20): INTERNAL RELIABILITY KEY STATISTICS Median internal reliability coefficient is.81. Coefficient alpha is the most reported measure of internal reliability (Hogan et al., 2003). SCORER RELIABILITY The extent to which two people scoring a test agree on the Definition test score, or the extent to which a test is scored correctly. Important for ensuring accurate scoring. Issues arise in subjective tests (e.g., projective tests) and even objective tests. Research Findings: Allard et al. (1995): 53% of hand-scored personality tests had scoring errors; 19% affected diagnoses. Goddard et al. (2004): 12% of hand-scored interest inventories had errors; 64% could change career advice. INTERRATER RELIABILITY Consistency between different scorers (e.g., interviewers, supervisors). Example: Judges on talent shows like American Idol. SCORER RELIABILITY EVALUATING TEST RELIABILITY 1.Reliability Coefficient: Key Factors: Obtainable from data, manuals, or articles. Compare with typical coefficients for similar tests. 2.Test Population: Consider if the reliability was established with a population similar to your test-takers. Example: NEO personality scales showed lower reliability in men and students compared to women and adults (Caruso, 2003). 1.Validity is the degree to which inferences from test Validity: scores are justified by the evidence. 2. A test’s reliability does not imply validity. Instead, we think of reliability as having a necessary but not sufficient relationship with validity. SCORER RELIABILITY EVALUATING TEST RELIABILITY SCORER RELIABILITY 5 STRATEGIES TO INVESTIGATE TEST VALIDITY Content Validity: Measures how well test items sample the intended content. Example: A final exam must fairly cover material from specified chapters. Determined through job analysis to identify relevant KSAOs (Knowledge, Skills, Abilities, Other characteristics). The extent to which a test score is related to some Criterion Validity measure of job performance. Types: Criterion - A measure of job performance, such as Concurrent Validity: Test given to current attendance, productivity, or employees; scores correlated with performance. a supervisor rating. Predictive Validity: Test given to job applicants; scores compared to future performance. Ideal to use a broad range of scores for stronger validity coefficients. SCORER RELIABILITY STRATEGIES TO INVESTIGATE TEST VALIDITY Construct Validity: Evaluates whether a test measures the intended construct. Methods: Convergent Validity: Test correlates well with similar constructs. Discriminant Validity: Test shows low correlation with different constructs. Known-Group Validity: Compares scores from groups known to differ on the trait. Validity Extent to which a test valid for one job is also valid for similar jobs elsewhere. Generalization Supported by meta-analysis and job analysis. Assumes tests predicting specific job components can Synthetic Validity apply to similar jobs with shared components. SCORER RELIABILITY VALIDITY MEASUREMENT METHODS Content Validity: Assesses whether a test covers the relevant content for the job. It's often sufficient for straightforward tasks (e.g., typing tests for clerical positions). The "next-door- neighbor rule" is suggested to evaluate its sufficiency— would it stand up to scrutiny in court? Criterion Validity: Evaluates how well a test predicts job performance. It’s more critical when the link between the test and job performance isn't obvious (e.g., using a critical thinking test for police officers). However, obtaining a significant validity coefficient can be challenging and, if unsuccessful, could weaken the test’s legal defensibility. While not a formal measure of validity, face validity Face Validity: pertains to how relevant a test appears to its users. High face validity can enhance test-taker motivation and acceptance, reducing the likelihood of legal challenges. However, face validity alone doesn’t ensure accuracy. SCORER RELIABILITY Challenges and Validity is job-specific; a test may be valid for one position but not another. Small correlation coefficients (often Considerations around.20 to.30) can be statistically significant but may not seem persuasive in real-world applications. Conducting criterion validity studies can be risky if outcomes aren’t favorable. Finding Reliability Resources like the Mental Measurements Yearbook and and Validity Tests in Print provide comprehensive reliability and validity data for various tests. Information Cost Efficiency cost- effectiveness When multiple tests becomes have essential. similar For validity, example, a cheaper, quicker group-administered test may be preferable over a more expensive individual test with similar predictive power. SCORER RELIABILITY Advancements in Computer-assisted Testing: More organizations are adopting online testing for efficiency and cost savings. Testing Computer-adaptive question difficulty based testing on previous (CAT) answers, adjusts improving the testing experience. In essence, the choice of validity measurement method and the implementation of testing practices depend on the specific context and objectives of the assessment. Understanding these nuances can guide effective selection practices in organizations. ESTABLISHING THE USEFULNESS OF A SELECTION DEVICE “Even when a test is both reliable and valid, it is not necessarily useful.” FORMULAS TO DETERMINE HOW USEFUL A TEST 1.Taylor-Russell Tables designed to estimate the percentage of future employees who will be successful on the job if an organization uses a particular test. Philosophy of Taylor-Russell Tables A test will be useful to an organization if (1) the test is valid, (2) the organization can be selective in its hiring because it has more applicants than openings, (3) there are plenty of current employees who are not performing well, thus there is room for improvement. HOW TO USE TAYLOR-RUSSEL TABLES To use the Taylor-Russell tables, three pieces of information must be obtained. 1.Test’s criterion validity coefficient. To obtain this, you can use two ways: conduct a Criterion Validity study with test scores correlated with some measure of job performance. Validity Generalization, if an organization wants to know whether testing is useful before investing time and money in a criterion validity study. To estimate the validity coefficient that an organization might obtain, one of the coefficients from Table 5.2 in the previous chapter is used. The higher the validity coefficient, the greater the possibility the test will be useful. HOW TO USE TAYLOR-RUSSEL TABLES CONT.. 2. Obtained the Selection Ratio it is the percentage of people an organization must hire. The ratio is determined by the following formula: the lower the selection ratio, the greater the potential usefulness of the test. 3. Lastly, get the Base Rate of current performance percentage of employees currently on the job who are considered successful. Can usually obtained in one of two ways: 1.Employees are split into two equal groups based on their scores The base rate using his method is always.50 because one-half of the employees are considered satisfactory. This is the most simple but the least accurate. 3. Lastly, get the Base Rate of current performance cont... 2. Choose a criterion measure score above which all employees are considered successful (Refer to Table 5.2) For example, at one real estate agency, any agent who sells more than Php700,000 of properties makes a profit for the agency after training and operating expenses have been deducted. In this case, agents selling more than Php700,000 of properties would be considered successes because they made money for the company. Agents selling less than Php700,000 of properties would be considered failures because they cost the company more money than they brought in. In this example, there is a clear point at which an employee can be considered a success. Most of the time, however, there are no such clear points. In these cases, managers will subjectively choose a point on the criterion that they feel separates successful from unsuccessful employees. Note: this method is more meaningful than the first one. HOW TO USE TAYLOR-RUSSEL TABLES CONT.. After the (1)validity, (2)selection ratio, and (3) base rate figures have been obtained, the Taylor-Russell tables are consulted (Table 6.4). EXAMPLE OF TAYLOR-RUSSEL TABLE Suppose we have a test validity of.40, a selection ratio of.30, and a base rate of.50. How many future future employees are likely to be considered successful? EXAMPLE OF TAYLOR-RUSSEL TABLE If the organization uses that particular selection test, 69% of future employees are likely to be considered successful. This figure is compared with the previous base rate of.50, indicating a 38% increase in successful employees (.19/50=.38). FORMULAS TO DETERMINE HOW USEFUL A TEST CONT... 2. Proportion of Correct Decisions A utility method that compares the percentage of times a selection decision was accurate with the percentage of successful employees. is easier to do but less accurate than the Taylor-Russell tables. To determine the proportion of correct decisions For PCD you only need these two information: employee test scores the scores on the criterion. The two scores from each employee are graphed on a chart similar to that in Figure 6.1. Lines are drawn from the point on the y-axis (criterion score) that represents a successful applicant, and from the point on the x-axis that represents the lowest test score of a hired applicant. If a test is a good predictor of performance, there should be more points in quadrants II and IV because the points in the other two quadrants represent “predictive failures.” That is, in quadrants I and III no correspondence is seen between test scores and criterion scores. These lines divide the scores into four quadrants. Quadrant I represent employees who scored poorly on the test but performed well on the job. Quadrant II represent employees who scored well on the test and were successful on the job. Quadrant III represent employees who scored high on the test, yet did poorly on the job, and Quadrant IV represent employees who scored low on the test and did poorly on the job. PROPORTION OF CORRECT DECISIONS FORMULA 1.Proportion of Correct Decisions Effectiveness To estimate the test’s effectiveness, the number of points in each quadrant is totaled, and the following formula is used: Points in quadrants II and IV Total points in all quadrants (II+IV/TotalPointsofQuadrants=N) The resulting number represents the percentage of time that we expect to be accurate in making a selection decision in the future. 2. Proportion of Correct Decisions Improvement To determine whether this is an improvement, we use the following formula: Points in quadrants I and II Total points in all quadrants (I+II/TotalPointsofQuadrants=N) If the percentage from the first formula is higher than that from the second, our proposed test should increase selection accuracy. If not, it is probably better to stick with the selection method currently used. PROPORTION OF CORRECT DECISIONS FORMULA CONT... 1.Example of Proportion of Correct Decisions Effectiveness (Base on slide 20) There are 5 data points in quadrant I, 10 in quadrant II, 4 in quadrant III, and 11 in quadrant IV. The percentage of time we expect to be accurate in the future would be: 2. Example of Proportion of Correct Decisions Improvement To compare this figure with the test we were previously using to select employees, we compute the satisfactory performance baseline: PROPORTION OF CORRECT DECISIONS FORMULA CONT... 1.Example of Proportion of Correct Decisions Effectiveness There are 5 data points in quadrant I, 10 in quadrant II, 4 in quadrant III, and 11 in quadrant IV. The percentage of time we expect to be accurate in the future would be: 2. Example of Proportion of Correct Decisions Improvement To compare this figure with the test we were previously using to select employees, we compute the satisfactory performance baseline: PROPORTION OF CORRECT DECISIONS FORMULA CONT... To compare the two percentages: PCD Effectiveness:.70 PCD Improvement (Baseline):.50 (.70-.50=.20 -.50)=.40 x 100 = 40% Using the new test would result in a 40% increase in selection accuracy over the selection method previously used. FORMULAS TO DETERMINE HOW USEFUL A TEST CONT... 3. Lawshe Table Tables that use the base rate, test validity, and applicant percentile on a test to determine the probability of future success for that applicant. The Taylor-Russell tables were designed to determine the overall impact of a testing procedure. But we often need to know the probability that a particular applicant will be successful (Lawshe Table). To use Lawshe Table: To use these tables, three pieces of information are needed.: (1) validity coefficient and the (2) base rate are found in the same way as for the Taylor-Russell tables. The third piece of information needed is the (3) applicant’s test score. More specifically, did the person score in the top 20%, the next 20%, the middle 20%, the next lowest 20%, or the bottom 20%? Once we have all three pieces of information, the Lawshe tables, as shown in Table 6.5, are examined. For our example, we have a base rate of.50, a validity of.40, and an applicant who scored third highest out of 10. Question:What is the chance of an applicant being successful? PROCESS: First, we locate the table with the base rate of.50. Then the we locate appropriate category at the top of the chart. Our applicant scored third highest out of 10 applicants, so she would be in the second category, the next highest one fifth, or 20%. Using the validity of.40, we locate the intersection of the validity row and the test score column and find 59. This means that the applicant has a 59% chance of being a successful employee. BROGDEN-CRONBACH- GLESER UTILITY FORMULA Utility formula is a method of ascertaining the extent to which an organization will benefit from the use of a particular selection system. HOW TO USE UTILITY FORMULA I/O psychologists have devised a fairly simple utility formula to estimate the monetary savings to an organization. To use this formula, five items of information must be known. 1.Number of employees hired per year (n). 2.Average tenure (t). This is the average amount of time that employees in the position tend to stay with the company. 3.Test validity (r). 4.Standard deviation of performance in dollars (SDy). 5.Mean standardized predictor score of selected applicants (m). UTILITY FORMULA EXAMPLE For example, we administer a test of mental ability to a group of 100 applicants and hire the 10 with the highest scores. The average score of the 10 hired applicants was 34.6, the average test score of the other 90 applicants was 28.4, and the standard deviation of all test scores was 8.3. The desired figure would be: UTILITY FORMULA EXAMPLE CONT... The second way to find m is to compute the proportion of applicants who are hired and then use a conversion table such as that in Table 6.6 to convert the proportion into a standard score. Using the previous example, the proportion of applicants hired would be: UTILITY FORMULA EXAMPLE CONT... From Table 6.6, we see that the standard score associated with a selection ratio of.10 is 1.76. To determine the savings to the company, we use the following formula: As an example, suppose we hire 10 auditors per year, the average person in this position stays two years, the validity coefficient is.30, and the average annual salary for the position is $30,000, and we have 50 applicants for 10 openings. Thus, UTILITY FORMULA EXAMPLE CONT... Using the previous formula, we would have: This means that after accounting for the cost of testing, using this particular test instead of selecting employees by chance will save a company $100,300 over the two years that auditors typically stay with the organization. Because a company seldom selects employees by chance, the same formula should be used with the validity of the test (interview, psychological test, references, and so on) that the company currently uses. The result of this computation should then be subtracted from the first. DETERMINING THE FAIRNESS OF A TEST “After the test is determined to be reliable and valid and to have utility for an organization, the next step is to ensure that the test is fair and unbiased.” DETERMINING THE FAIRNESS OF A TEST -Although there is disagreement among I/O psychologists regarding the definition of test fairness, most professionals agree that one must consider potential race, gender, disability, and other cultural differences in both the content of the test (measurement bias) and the way in which scores from the test predict job performance (predictive bias). 1.Measurement Bias refers to technical aspects of a test. A test is considered to have measurement bias if there are group differences (e.g., sex, race, or age) in test scores that are unrelated to the construct being measured. For example, if race differences on a test of logic are due to vocabulary words found more often in the White than the African American culture, but these same words are not important to the performance of the job in question, the test might be considered to have measurement bias and thus not be fair in that particular situation. The statistical methods for determining measurement bias can be very complicated. However, from a legal perspective, if differences in test scores result in one group (e.g., men) being selected at a significantly higher rate than another (e.g.,women), adverse impact is said to have occurred and the burden is on the organization using the test to prove that the test is valid. DETERMINING THE FAIRNESS OF A TEST CONT... 2. Predictive Bias refers to situations in which the predicted level of job success falsely favors one group (e.g., men) over another (e.g., women). That is, a test would have predictive bias if men scored higher on the test than women but the job performance of women was equal to or better than that of men. 1st Form of Predictive Bias: Single-group Validity meaning that the test will significantly predict performance for one group and not others. For example, a test of reading ability might predict performance of White clerks but not of African American clerks. To test for single-group validity, separate correlations are computed between the test and the criterion for each group. If both correlations are significant, the test does not exhibit single-group validity and it passes this fairness hurdle. If, however, only one of the correlations is significant, the test is considered fair for only that one group. Single-group validity is very rare and is usually the result of small sample sizes and other methodological problems. Where it occurs, an organization has two choices. DETERMINING THE FAIRNESS OF A TEST CONT... Two choices to make when Single-group Validity occurs: (1) It can disregard single-group validity because research indicates that it probably occurred by chance. (2) it can stop using the test. Disregarding single-group validity probably is the most appropriate choice, given that most I/O psychologists believe that single-group validity occurs only by chance. As evidence of this, think of a logical reason a test would predict differently for African Americans than for Whites or differently for males than for females. That is, why would a test of intelligence predict performance for males but not for females? Or why would a personality inventory predict performance for African Americans but not for Whites? There may be many cultural reasons why two groups score differently on a test (e.g., educational opportunities, socioeconomic status), but finding a logical reason that the test would predict differently for two groups is difficult. DETERMINING THE FAIRNESS OF A TEST CONT... 2nd Form of Predictive Bias Differential Validity The characteristic of a test that significantly predicts a criterion for two groups, such as both minorities and non-minorities, but predicts significantly better for one of the two groups. With differential validity, the test is valid for both groups, but it is more valid for one than for the other. Differential validity is also rare. When it does occur, it is usually in occupations dominated by a single sex, tests are most valid for the dominant sex, and the tests overpredict minority performance. If differential-group validity occurs, the organization has two choices: The first is not to use the test. Usually, however, this is not a good option. Finding a test that is valid is difficult; throwing away a good test would be a shame. DETERMINING THE FAIRNESS OF A TEST CONT... If differential-group validity occurs, the organization has two choices: The second option is to use the test with separate regression equations for each group. Because applicants do not realize that the test is scored differently, there are not the public relations problems that occur with use of separate tests. However, the 1991 Civil Rights Act prohibits score adjustments based on race or gender. As a result, using separate equations may be statistically acceptable but would not be legally defensible. Perception of fairness important aspect of test fairness, held by the applicants taking the test. a test may not have measurement or predictive bias, but applicants might perceive the test itself or the way in which the test is administered as not being fair. Factors that might affect applicants’ perceptions of fairness include the (1) difficulty of the test, (2) the amount of time allowed to complete the test, (3) the face validity of the test items, (4) the manner in which hiring decisions are made from the test scores, (5) policies about retaking the test, and (6) the way in which requests for testing accommodations for disabilities were handled. DETERMINING THE FAIRNESS OF A TEST Once a test has been determined to be reliable and valid and to have utility for an organization, the next step is to ensure that the test is fair and unbiased. MEASUREMENT BIAS Measurement bias refers to technical aspects of a test. A test is considered to have measurement bias if there are group differences (e.g., sex, race, or age) in test scores that are unrelated to the construct being measured. The statistical methods for determining measurement bias can be very complicated and are certainly beyond the scope of this text. However, from a legal perspective, if differences in test scores result in one group (e.g., men) being selected at a significantly higher rate than another (e.g., women), adverse impact is said to have occurred and the burden is on the organization using the test to prove that the test is valid. Adverse impact is an employment practice that results in members of a protected class being negatively affected at a higher rate than members of the majority class. Adverse impact is usually determined by the four- fifths rule. PREDICTIVE BIAS Predictive bias refers to situations in which the predicted level of job success falsely favors one group (e.g., men) over another (e.g., women). That is, a test would have predictive bias if men scored higher on the test than women but the job performance of women was equal to or better than that of men. One form of predictive bias is single-group validity, meaning that the test will significantly predict performance for one group and not others. A second form of predictive bias is differential validity. With differential validity, a test is valid for two groups but more valid for one than for the other. Single-group validity and differential validity are easily confused, but there is a big difference between the two. Remember, with single-group validity, the test is valid only for one group. With differential validity, the test is valid for both groups, but it is more valid for one than for the other MAKING THE HIRING DECISION After valid and fair selection tests have been administered to a group of applicants, a final decision must be made as to which applicant or applicants to hire. UNADJUSTED TOP-DOWN SELECTION With top-down selection, applicants are rank-ordered on the basis of their test scores. Selection is then made by starting with the highest score and moving down until all openings have been filled. In a compensatory approach to top-down selection, the assumption is that if multiple test scores are used, the relationship between a low score on one test can be compensated for by a high score on another. RULE OF THREE A technique often used in the public sector is the rule of three (or rule of five), in which the names of the top three scorers are given to the person making the hiring decision (e.g., police chief, HR director). This person can then choose any of the three based on the immediate needs of the employer. This method ensures that the person hired will be well qualified but provides more choice than does top-down selection. PASSING SCORES Passing scores are a means for reducing adverse impact and increasing flexibility. With this system, an organization determines the lowest score on a test that is associated with acceptable performance on the job. Notice the distinct difference between top-down selection and passing scores. With top-down selection, the question is, “Who will perform the best in the future?” With passing scores, the question becomes, “Who will be able to perform at an acceptable level in the future?” If there is more than one test for which we have passing scores, a decision must be made regarding the use of a multiple-cutoff approach or a multiple- hurdle approach. Both approaches are used when one score can’t compensate for another or when the relationship between the selection test and performance is not linear. With a multiple-cutoff approach, the applicants would be administered all of the tests at one time. If they failed any of the tests (fell below the passing score), they would not be considered further for employment. PASSING SCORES CONT... One problem with a multiple-cutoff approach is the cost. If an applicant passes only three out of four tests, he will not be hired, but the organization has paid for the applicant to take all four tests. To reduce the costs associated with applicants failing one or more tests, multiple-hurdle approaches are often used. With a multiple- hurdle approach, the applicant is administered one test at a time, usually beginning with the least expen sive. Applicants who fail a test are eliminated from further consideration and take no more tests. Applicants who pass all of the tests are then administered the linearly related tests; the applicants with the top scores on these tests are hired. BANDING Banding takes into consideration the degree of error associated with any test score. Thus, even though one applicant might score two points higher than another, the two-point difference might be the result of chance (error) rather than actual dif ferences in ability. The question then becomes, “How many points apart do two appli cants have to be before we say their test scores are significantly different?” We can answer this question using a statistic called the standard error of measure ment (SEM). To compute this statistic, we obtain the reliability and standard deviation (SD) of a particular test either from the test catalog or we can compute it ourselves from the actual test scores. This information is then plugged into the following formula: CHAPTER SUMMARY In this chapter you learned: Test. The reliability of a test. There are three ways to measure reliability: (a) the test-retest method, which measures temporal stability; (b) the alternate-forms method, which measures forms stability; and (c) the internal consistency method (split-half, K-R 20, and coefficient alpha), which measures item homogeneity. Tests can be validated using five approaches: content, criterion, construct, known group, and face. Information about tests can be obtained from such sources as the Mental Measurements Yearbook. The utility of a test can be determined using the Taylor-Russell tables, the Lawshe tables, proportion of correct decisions, and utility formulas. The fairness of a test can be determined by testing for adverse impact, single-group validity, and differential validity. Selection decisions can be made in four ways: top-down, rule of three, top-down with banding, or passing scores. THANK YOU