🎧 New: AI-Generated Podcasts Turn your study notes into engaging audio conversations. Learn more

Unit 2 BSP 311.pdf

Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...

Full Transcript

Unit 2 Basic psychometric concepts Test construction Item analysis Reliability : meaning and types Validity: meaning and types Norms Process of test development 1. Define the test (comprehensive literature review) 2. Scaling...

Unit 2 Basic psychometric concepts Test construction Item analysis Reliability : meaning and types Validity: meaning and types Norms Process of test development 1. Define the test (comprehensive literature review) 2. Scaling and item writing 3. Item analysis 4. Revising the test (items,format, option, cross validation, feedback from examinees) 5. Publish test (technical manual and user manual) 1. Literature Review Define the test through operational definition. Operational definition: This definition gives an obvious,precise and communicable meaning to a concept used to ensure comprehensive knowledge of the idea is measured and applied within a particular set of circumstances. Two important points are :- How to measure Application (context based application) Interview with respondents and focus groups to agreed upon definition of the construct And generate items in the preliminary stage. 2. Item writing Table of content : specification of all relevant items representing domains of a variable Homogenous Heterogenous Item difficulty, discrimination Format Selection of words, sentence its formatting Relevance, clarity, simplicity Ambiguity SCALING METHODS Test developers select a scaling method that is optimally suited to the manner in which they have conceptualized the trait(s) measured by their test. Nominal scale Interval scale Ordinal scale Ratio scale 3. Item Analysis A process to select final set of questions from large pool of items. It determines the item which should be retained, revised and those thrown out. Item - difficulty, reliability, validity and discrimination index. ITEM DIFFICULTY (Pi) Proportion of examinees in a large tryout sample who get that item correct. Varies from 0 to 1 Pi =.3 Pi =.7 Pi =.5 Choosing items, depend on the purpose of test and which type of subject (normal, talented, high intelligent group) The optimal level of item difficulty can be computed from the formula (1.0 g)/2, where g is the chance success level. Thus, for a four-option multiple- choice item, the chance success level is.25, and the optimal level of item difficulty would be (1.0.25)/2, or about.63. If a test is to be used for selection of an extreme group by means of a cutting score, it may be desir able to select items with difficulty levels outside the.3 to.7 range. ITEM RELIABILITY A test developer may desire an instrument with a high level of internal consistency in which the items are reasonably homogeneous. A simple way to determine whether an individual item “hangs together” with the remaining test items is to correlate scores on that item with scores on the total test. Point biserial correlation is used for yes or no type questions. The higher the point biserial correlation riT between an individual item and the total score, the more useful is the item from the standpoint of internal consistency. ITEM RELIABILITY INDEX For dichotomous items Product of (riT) and (Si) i.e. (riT)(Si) correlation with the total score = (riT) standard deviation = (Si) More the variation among reliability indexes the more the useful item. It means that the item is able to discriminate between examinees. By computing the item-reliability index for every item in the preliminary test, we can eliminate the “outlier” items that have the lowest value on this index. Such items would possess poor internal consistency or weak dispersion of scores and therefore not contribute to the goals of measurement. ITEM VALIDITY INDEX It is important that a test possess the highest possible concurrent or predictive validity. The higher the point-biserial correlation (riC) between scores on an individual item and the criterion score, the more useful is the item from the standpoint of predictive validity. Concurrent and predictive validity of an item. Point biserial correlation between item score and score on criterion variable = (riC) Method = (Si)(riC) Useful to identify ineffective items which are not able to predict the criterion. ITEM CHARACTERISTICS CURVE An item characteristic curve (ICC) is a graphical display of the relationship between the probability of a correct response and the examinee’s position on the underlying trait measured by the test. A good item has a positive ICC slope. If the ability to solve a particular item is normally distributed, the ICC will resemble a normal ogive (curve a in Figure 4.8). The desired shape of the ICC depends on the purpose of the test. The underlying theory of ICC is also known as item response theory and latent trait theory. ITEM DISCRIMINATION An effective test item is one that discriminates between high scorers and low scorers on the entire test. An item-discrimination index is a statistical index of how efficiently an item discriminates between persons who obtain high and low scores on the entire test. The item-discrimination index for a test item is calculated from the formula: d = (U- L)/N where U is the number of examinees in the upper range who answered the item correctly, L is the number of examinees in the lower range who answered the item correctly, and N is the total number of examinees in the upper or lower range. 4. Revising the test The purpose of item analysis, discussed previously, is to identify unproductive items in the preliminary test so that they can be revised, eliminated, or replaced. Collect new data from second tryout sample. Repeat the item analysis process anew. Cross validation - The term cross validation refers to the practice of using the original regression equation in a new sample to determine whether the test predicts the criterion as well as it did in the original sample. Items are cross validated in second tryout sample. Validity shrinkage - A common discovery in cross-validation research is that a test predicts the relevant criterion less ac curately with the new sample of examinees than with the original tryout sample. Feedback from examinees. 5. Publishing the test Production of testing material Technical manual and user’s manual Validity ➔ Validity defines the meaning of test scores. ➔ Validity refers to what a test score means ➔ Validity of a test is the extent to which it measures and what it claims to measure. ➔ New instrument must fulfill the purpose for which they are designed. ➔ A test is valid to the extent that inferences made from it are appropriate, meaningful and useful (standard for educational and psychological testing, 1999) ➔ Appropriate means correct, meaningful means it should be interpreted by using psychometric sounded manual and interpretation must be useful/significant. ➔ Traditionally, the different ways of accumulating validity evidence have been grouped into three categories: Content validity Criterion-related validity Construct validity Content Validity Content validity is determined by the degree to which the questions, tasks, or items on a test are representative of the universe of behavior the test was designed to sample. ➔ If the sample (specific items on the test) is representative of the population (all possible items), then the test possesses content validity. ➔ Are items content related to others as a whole ➔ Content validity is more difficult to assure when the test measures an ill-defined trait. ➔ content validity = D / (A + B + C + D) ➔ Face validity - A test has face validity if it looks valid to test users, examiners, and especially the examinees. Face validity is really a matter of social acceptability and not a technical form of validity in the same category as content, criterion-related, or construct validity (Nevo, 1985). Criterion related validity Criterion-related validity is demonstrated when a test is shown to be effective in estimating an examinee’s performance on some outcome measure. In this context, the variable of primary interest is the outcome measure, called a criterion. ➔ Comparing to an already related test. ➔ Concurrent validity - the criterion measures are obtained at approximately the same time as the test scores. ➔ Predictive validity - the criterion measures are obtained in the future, usually months or years after the test scores are obtained. Construct validity A construct is a theoretical, intangible quality or trait in which individuals differ. Construct validity refers to the appropriateness of these inferences about the underlying construct. Does the test relate to any underlying theory? How accurately theories have been translated to actual measure. Example - intelligence - underlying construct RELIABILITY Reliability refers to the attribute of consistency in measurement. Reliability refers to consistency of scores obtained by same individual when re-examined with test on different occasions or with different set of equivalent items, or under other variable examining conditions. (Anastasi & Urbani, 1997). Reliability refers to self-correlation of the test. Refers to consistency of scores measured from one set of measure to another. Classical test theory Also called the theory of true and error scores. Charles Spearman The basic starting point of the classical theory of measurement is the idea that test scores result from the influence of two factors: 1. Factors that contribute to consistency. These consist entirely of the stable attributes of the individual, which the examiner is trying to measure. 2. Factors that contribute to inconsistency. These include characteristics of the individual, test, or situation that have nothing to do with the attribute being measured, but that nonetheless affect test scores X = T + e. X - obtained score T - true score e - error Systematic measurement error ❖ Unsystematic measurement error - Means that their effects are unpredictable and inconsistent. ❖ Systematic measurement error - It arises when, unknown to the test developer, a test consistently measures something other than the trait for which it was intended. TEST RETEST RELIABILITY administering the identical test twice to the same group of heterogeneous and representative subjects. If the test is perfectly reliable, each person’s second score will be completely predictable from his or her first score. Two set of scores when correlated give the value of reliability coefficient, also k/a Temporal stability coefficient. Disadvantages - 1. Time consuming 2. Assumes that examinee’s physical and psychological setup remains unchanged in both testing situations. Sources of error variance - Time sampling Maturational sampling ALTERNATE FORMS RELIABILITY Test developers produce two forms of the same test. These alternate forms are independently constructed to meet the same specifications, often on an item-by-item basis. Thus, alternate forms of a test incorporate similar content and cover the same range and level of difficulty in items. Alternate forms of a test possess similar statistical and normative properties. Estimates of alternate-forms reliability are derived by administering both forms to the same group and correlating the two sets of scores. DISADVANTAGE - item sampling difference becomes a source of error variance Quite expensive INTERNAL CONSISTENCY RELIABILITY In this method, the psychometrician seeks to determine whether the test items tend to show a consistent interrelatedness. Two ways of estimating internal consistency reliability :- 1. Split half reliability 2. Coefficient alpha 3. The Kuder and Richardson estimate of reliability 4. Interscorer reliability 1. Split half reliability We obtain an estimate of split-half reliability by correlating the pairs of scores obtained from equivalent halves of a test administered only once to a representative sample of examinees. Split-half approaches often yield higher estimates of reliability as compared to test-retest approaches. The major challenge with split-half reliability is dividing the test into two nearly equivalent halves. Half test reliability is calculated in this method using the Spearman Brown formula DISADVANTAGE : it lacks precision 2. Coefficient reliability Cronbach 1951 Coefficient alpha may be thought of as the mean of all possible split-half coefficients, corrected by the Spearman-Brown formula. Formula for coefficient alpha is — 3. The Kuder-Richardson estimate of Reliability Their formula is generally referred to as Kuder Richardson formula 20 or, simply, KR-20 4. Interscorer Reliability A sample of tests is independently scored by two or more examiners and scores for pairs of examiners are then correlated. Interscorer reliability supplements other reliability estimates but does not replace them. NORMS A norm group consists of a sample of examinees who are representative of the population for whom the test is intended. The essential objective of test standardization is to determine the distribution of raw scores in the norm group so that the test developer can publish derived scores known as norms. Norms come in many varieties, for example, percentile ranks, age equivalents, grade equivalents, or standard scores etc. In general, norms indicate an examinee’s standing on the test relative to the performance of other persons of the same age, grade, sex, and so on. Raw scores Apply statistical concepts as following on the raw score :- 1. Frequency distribution 2. Measures of central tendency - mean, median & mode 3. Measure of variability - standard deviation 4. Normal distribution curve 5. Skewness Raw score transformed to :- 1. Percentile and percentile ranks - A percentile expresses the percentage of persons in the standardization sample who scored below a specific raw score. They depict an individual’s position with respect to the sample. Drawback: They distort the underlying measurement scale, especially at the extremes. 2. Standard scores -. A standard score uses the standard deviation of the total distribution of raw scores as the fundamental unit of measurement. The standard score expresses the distance from the mean in standard deviation units 3. T score and other standardized score - standardized scores are always expressed as positive whole numbers. One popular kind of standardized score is the T score, which has a mean of 50 and a standard deviation of 10. T score scales are especially common with personality tests. 4. Normalized standard score 5. Stanine scale - In a stanine scale, all raw scores are converted to a single-digit system of scores ranging from 1 to 9. The mean of stanine scores is always 5, and the standard deviation is approximately 2. The transformation from raw scores to stanines is simple: The scores are ranked from lowest to highest, and the bottom 4 percent of scores convert to a stanine of 1, the next 7 percent convert to a stanine of 2, and so on Stens - Canfield (1951) proposed the 10-unit sten scale, with 5 units above and 5 units below the mean. C Scale - Guilford and Fruchter (1978) proposed the C scale consisting of 11 units Age norms An age norm depicts the level of test performance for each separate age group in the normative sample. The purpose of age norms is to facilitate same-aged comparisons. With age norms, the performance of an examinee is interpreted in relation to standardization subjects of the same age. Grade norms A grade norm depicts the level of test performance for each separate grade in the normative sample. Grade norms are rarely used with ability tests. However, these norms are especially useful in school settings when reporting the achievement levels of schoolchildren. Local norms Local norms are derived from representative local examinees, as opposed to a national sample. Subgroup norms Subgroup norms consist of the scores obtained from an identified subgroup (African Americans, Hispanics, females), as opposed to a diversified national sample. The subgroups can be formed with respect to sex, ethnic background, geographical region, urban versus rural environment, socio economic level, and many other factors. Expectancy table One practical form that norms may take is an expectancy table. An expectancy table portrays the established relationship between test scores and expected outcome on a relevant task (Harmon, 1989). Expectancy tables are especially useful with predictor tests used to forecast well-defined criteria. For example, an expectancy table could depict the relationship between scores on a scholastic aptitude test (predictor) and subsequent college grade point average (criterion).

Use Quizgecko on...
Browser
Browser