Utility Analysis (Lesson 5.2) PDF

Summary

This document details utility analysis, particularly in the context of psychometrics. It breaks down factors affecting a test's usefulness, including its psychometric soundness, associated costs, and potential benefits. It also explains utility analysis as a broad family of techniques for evaluating the cost-benefit relationship of assessments.

Full Transcript

**[Utility]** In the language of psychometrics, *utility* (also referred to as *test utility*) refers to how useful a test or battery of tests are. More specifically, it refers to the practical value of using a test to aid in decision making. [Factors that affect a test's utility] 1. Psychometri...

**[Utility]** In the language of psychometrics, *utility* (also referred to as *test utility*) refers to how useful a test or battery of tests are. More specifically, it refers to the practical value of using a test to aid in decision making. [Factors that affect a test's utility] 1. Psychometric soundness A test is said to be psychometrically sound for a particular purpose if reliability and validity coefficients are acceptably high. How can an index of utility be distinguished from an index of reliability or validity? The short answer to that question is as follows: An index of reliability can tell us something about how consistently a test measures what it measures; and an index of validity can tell us something about whether a test measures what it purports to measure. But an index of utility can tell us something about the practical value of the information derived from scores on the test. Test scores are said to have utility if their use in a particular situation helps us to make better decisions---better, that is, in the sense of being more cost-effective. A reliable or valid test does not automatically mean that a test is useful. For example, a patch was developed to determine whether people were taking drugs or not. In the study, the patch and urine test had a significant direct relationship. This means that the patch can be used to determine if an individual has taken drugs. But, when checked, they cannot say that the patch was useful because it can be tampered with -- not putting it properly, putting it on other people. 2. Costs In the context of testing, cost refers to disadvantages, losses, or expenses in both economic and noneconomic terms. Economic costs can be related to expenses associated with testing. If testing is to be conducted, then it may be necessary to allocate funds to purchase (1) a particular test, (2) a supply of blank test protocols, and (3) computerized test processing, scoring, and interpretation from the test publisher or some independent service. Associated costs of testing may come in the form of (1) payment to professional personnel and staff associated with test administration, scoring, and interpretation, (2) facility rental, mortgage, and/or other charges related to the usage of the test facility, and (3) insurance, legal, accounting, licensing, and other routine costs of doing business. In some settings, such as private clinics, these costs may be offset by revenue, such as fees paid by testtakers. In other settings, such as research organizations, these costs will be paid from the test user's funds, which may in turn derive from sources such as private donations or government grants. Noneconomic costs could include loss of confidence in organizations due to not conducting proper testing and assessment. Imagine of a company did not assess their applicants, their employees would not be deemed as competent by the public. Another noneconomic cost would be the safety of the workers and public if such organizations provided services like transportation or healthcare. We cannot forget to also mention the time and effort developers, or user expend in testing and assessment. 3. Benefits For a test to be useful, the benefits must outweigh the costs. Benefits refer to profits, gains, or advantages in both economic and noneconomic terms. For example, for economic benefits, if a new personnel testing program results in the selection of employees who produce significantly more than other employees, then the program will have been responsible for greater productivity on the part of the new employees. This greater productivity may lead to greater overall company profits. If a new method of quality control in a food-processing plant results in higher quality products and less product being trashed as waste, the net result will be greater profits for the company. For noneconomic benefits, this could include increase in worker's performance, reduction of accidents, reduction of work turnover, and good work environment. [Utility Analysis] Utility analysis is a family of techniques that entail a cost-benefit analysis designed to yield information relevant to a decision about the usefulness and/or practical value tool of assessment. Note that in this definition, we used the phrase "family of techniques." This is so because a utility analysis is not one specific technique used for one specific objective. Rather, *utility analysis* is an umbrella term covering various possible methods, each requiring various kinds of data to be inputted and yielding various kinds of output. The purpose of utility analysis is to evaluate whether the benefits of a test outweigh the costs. In evaluating a test, utility analysis can help make decisions regarding the following: - one test is preferable to another test for use for a specific purpose. - one tool of assessment (such as a test) is preferable to another tool of assessment (such as behavioral observation) for a specific purpose. - the addition of one or more tests (or other tools of assessment) to one or more tests (or other tools of assessment) that are already in use is preferable for a specific purpose. - no testing or assessment is preferable to any testing or assessment. *General Approaches to Utility Analysis* 1. Expectancy data An expectancy table can provide an indication of the likelihood that a testtaker will score within some interval of scores on a criterion measure---an interval that may be categorized as "passing," "acceptable," or "failing." For example, with regard to the utility of a new and experimental personnel test in a corporate setting, an expectancy table can provide vital information to decision-makers. An expectancy table might indicate, for example, that the higher a worker's score is on this new test, the greater the probability that the worker will be judged successful. In other words, the test is working as it should and, by instituting this new test on a permanent basis, the company could reasonably expect to improve its productivity. An expectancy chart is a graphic representation of an expectancy table. Look at the examples below. A picture containing table Description automatically generated The image above is an organized scatterplot which is the first step in making your expectancy table found below. In the graph above, the number of points were counted per grid (its up to you to put gridlines according to intervals you want to use). The one in pink and parentheses are percentages. For example, in the language scores (X-axis) 30 people got a score between 0 and 20. Out of those 30, 11 got a grade between 60 and 70, 17 between 70 and 80, and 2, a grade between 80 and 90. ![A picture containing table Description automatically generated](media/image2.png) The image above is an expectancy table. The numbers are in percent. The N in the 6^th^ column is the total number of individuals per category like in our example a while ago, there are 30 people who got a test score of below 20. In this case, it is up to the personnel to determine which intervals are considered "passing", "failed" or "excellent" if they want. Chart Description automatically generated with medium confidence This table above is an example of an expectancy chart. The process of making this is like the examples above, the only difference is that a table is presented in a graph. In this example, the ratings are based on the intervals (like in the table). Ratings here are based on interviews and test scores while the production was evaluated by the supervisor upon observation. For example, out of all the people who had excellent ratings, 94% was observed to have satisfactory production while 6%, unsatisfactory. This could be a problem for the interview or test scores since how is it that they had excellent scores, but their production was still unsatisfactory? 2. Taylor-Russel Tables Tables that provide an estimate of the extent to which inclusion of a particular test in the selection system will improve selection. They provide an estimate of the percentage of employees hired by the use of a particular test who will be successful at their jobs, given different combinations of three variables: the test's validity, the selection ratio used, and the base rate. The value assigned for the test's validity is the computed validity coefficient. The *selection ratio* is a numerical value that reflects the relationship between the number of people to be hired and the number of people available to be hired. For instance, if there are 50 positions and 100 applicants, then the selection ratio is 50/100, or.50. As used here, *base rate* refers to the percentage of people hired under the existing system for a particular position. If, for example, a firm employs 25 computer programmers and 20 are considered successful, the base rate would be.80. With knowledge of the validity coefficient of a particular test along with the selection ratio, reference to the Taylor-Russell tables provides the personnel officer with an estimate of how much using the test would improve selection over existing methods. An example can be seen below. ![Table Description automatically generated](media/image4.png) [Some practical considerations] 1. Pool of job applicants Utility of a test may be overestimated because top scorers on a test may not actually accept the job. Remember that since they are top scorers, they have better skills and abilities which makes them popular in the market. 2. Cut score in use Cut scores or cutoff scores are reference points derived because of a judgement and used to divide a set of data into two or more classifications, with some action based on these classifications a. Relative cut score or norm-referenced cut score i. Reference point that is based on norm-related considerations rather than a set standard ii. Example: Instructor may say that the top 10% would receive a perfect score on the test. The determination of the top 10% would depend on or is relative to the performance of the group. b. Fixed cut score or absolute cut scores iii. Reference point that is typically set with reference to a judgement concerning a minimum level of proficiency required to be included in a classification iv. Example: you must get a grade of 75 to pass and move on -- this has nothing to do with the performance of your other classmates c. Multiple cut scores v. Use of two or more cut scores with reference to one predictor for the purpose of categorizing testtakers vi. Example: if your score reached a certain cut score, you will be classified -- the grading system of other countries -- A, B, C, D, E, and F. In employment, cut scores can be given to each task which you need to pass to be considered for the position. d. Multiple hurdles or multistage vii. A cut score is placed for each predictor viii. To make sure that testtakers have minimum level of skill ix. Example: In employment, before you can proceed to the 2^nd^ stage of application, you need to pass stage 1 first. In school, if you are applying for a scholarship, you need to pass the screening of application form first before you can continue to the processing of requirements. e. Compensatory model of selection x. High scores on one test or stage can balance out or compensate low scores on other tests or stages xi. Different weights can be given to tests or stages in application [Methods for Setting Cut Scores] 1. The Angoff Method - By William Angoff - Can be applied to personnel selection tasks as well as to questions regarding the presence or absence of a particular trait, attribute, or ability. - When used for purposes of personnel selection, experts in the area provide estimates regarding how testtakers who have at least minimal competence for the position should answer test items correctly. As applied for purposes relating to the determination of whether or not testtakers possess a particular trait, attribute, or ability, an expert panel makes judgments concerning the way a person with that trait, attribute, or ability would respond to test items. In both cases, the judgments of the experts are averaged to yield cut scores for the test. Persons who score at or above the cut score are considered high enough in the ability to be hired or to be sufficiently high in the trait, attribute, or ability of interest. - Cannot be used if judges or raters do not agree or if there is low inter-rater reliability 2. The Known Groups Method (method of contrasting groups) - Entails collection of data on the predictor of interest from groups, known to possess, and not to possess, a trait, ability, or attribute of interest. Based on an analysis of this data, a cut score is set on the test that best discriminates the two groups' test performance. - Example: ![](media/image6.png) - The illustration above is used to establish a cut score on a Math placement test. To do this, researchers gave the test to incoming freshmen and held the results until the 1^st^ semester finished. They used the grades of the students during the 1^st^ semester to categorize the students into the passing or failing. Then, they plotted the scores accordingly. - To determine the cut score, find the score at the point of least difference between the two groups. - Problem: Determining which groups to compare -- in the example can it be students with grades of 90 and above versus those with 74 and below only? In a depression scale, how depressed should a testtaker be to be considered in the groups to be compared? 3. IRT-Based Methods - Cut scores are set based on the testtaker's performance across all the items on the test; some portion of the total number of items on the test must be scored "correct" (or in a way that indicates the testtaker possesses the target trait or attribute) in order for the testtaker to "pass" the test (or be deemed to possess the targeted trait or attribute). - In the IRT framework, each item is associated with a particular level of difficulty. To "pass" the test, the testtaker must answer items that are deemed to be above some minimum level of difficulty, which is determined by experts and serves as the cut score. - Examples: - A technique that has found application in setting cut scores for licensing examinations is the [item-mapping method]**.** It entails the arrangement of items in a histogram, with each column in the histogram containing items according to their established value. Judges who have been trained regarding minimal competence required for licensure are presented with sample items from each column and are asked whether a minimally competent licensed individual would answer those items correctly about half the time. If so, that difficulty level is set as the cut score; if not, the process continues until the appropriate difficulty level has been selected. Typically, the process involves several rounds of judgments in which experts may receive feedback regarding how their ratings compare to ratings made by other experts. - An IRT-based method of setting cut scores that is more typically used in academic applications is the [bookmark method]. Use of this method begins with the training of experts regarding the minimal knowledge, skills, and/or abilities that testtakers should possess to "pass." After this training, the experts are given a book of items, with one item printed per page, such that items are arranged in an ascending order of difficulty. The expert then places a "bookmark" between the two pages (or, the two items) that are deemed to separate testtakers who have acquired the minimal knowledge, skills, and/or abilities from those who have not. The bookmark serves as the cut score. Additional rounds of bookmarking with the same or other judges may take place, as necessary. Feedback regarding placement may be provided, and discussion among experts about the bookmarkings may be allowed. In the end, the level of difficulty to use as the cut score is decided upon by the test developers.

Use Quizgecko on...
Browser
Browser