Psychological Testing and Assessment Notes PDF

Summary

These comprehensive notes cover various aspects of psychological assessment and testing, including scales of measurement, reliability, validity, and different types of tests used in educational, clinical, counseling, and business settings. Key figures and historical developments in psychological testing are also mentioned, alongside statistical methods used in psychometrics.

Full Transcript

Psychological Assessment Psychological Testing and History Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018) Testing and Assessment Asse...

Psychological Assessment Psychological Testing and History Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018) Testing and Assessment Assessor is the key to the process of selecting o 1905: Alfred Binet published a test designed to tests and/or other tools of evaluation help place Paris school children in appropriate Requires an educated selection of tools of classes evaluation, skill in evaluation, and thoughtful o Testing – refer to everythingꟷfrom organization and integration of data administration of test to interpretation of the test Entails logical problem-solving that brings to bear scores many sources of data assigned to answer the ▪ Once used to describe the group of screening referral question individuals of thousands of military recruits o Test – measuring device or procedure o Psychological Assessment – gathering and o Psychological Test – device or procedure integration of psychology-related data for the designed to measure variables related to purpose of making psychological evaluation psychology ▪ Educational – evaluate abilities and skills ▪ Content – subject matter relevant in school context ▪ Format – form, plan, structure, ▪ Retrospective – draw conclusions about arrangement, layout psychological aspects of a person as they ▪ Item – a specific stimulus to which a person existed at some point in time prior to the responds overtly and this response is being assessment scored or evaluated ▪ Remote – subject is not in physical proximity ▪ Administration Procedures – one-to-one to the person conducting the evaluation basis or group administration ▪ Ecological Momentary – “in the moment” ▪ Score – code or summary of statement, evaluation of specific problems and related usually but not necessarily numerical in cognitive and behavioral variables at the very nature, but reflects an evaluation of time and place that they occur performance on a test ▪ Collaborative – the assessor and assesee ▪ Scoring – the process of assigning scores to may work as “partners” from initial contact performances through final feedback ▪ Cut-Score – reference point derived by ▪ Therapeutic – therapeutic self-discovery and judgement and used to divide a set of data new understanding are encouraged into two or more classification ▪ Dynamic – describe interactive approach to ▪ Psychometric Soundness – technical quality psychological assessment that usually ▪ Psychometrics – science of psychological follows the model: evaluation > intervention measurement of some sort > evaluation ▪ Psychometrist or Psychometrician – refer to o Psychological Testing – process of measuring professional who uses, analyzes, and psychology-related variables by means of interprets psychological data devices or procedures designed to obtain a o Achievement Test – measurement of the sample of behavior previous learning Testing o Aptitude – refers to the potential for learning or Usually numerical in nature acquiring a specific skill Could be individual or by group in administration o Intelligence – refers to a person’s general Test administrators can be interchangeable potential to solve problems, adapt to changing without affecting the evaluation environments, abstract thinking, and profit from Requires technician-like skills in terms of experience administration and scoring o Human Ability – considerable overlap of Yield a test score or a series of test score achievement, aptitude, and intelligence test Assessment o Structured Personality tests – provide Answers the referral question through the use of statement, usually self-report, and require the different tools of evaluation subject to choose between two or more Administered individually alternative responses Psychological Assessment Psychological Testing and History Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018) o Projective Personality Tests – unstructured, and b. extent to which they understand and agree to the stimulus or response are ambiguous the rationale for the assessment o Interview – method of gathering information c. capacity and willingness to cooperate through direct communication involving d. amount of physical or emotional distress reciprocal exchange e. amount of physical discomfort ▪ Panel Interview (Board Interview) – more f. alertness level than one interviewer participates in the g. predisposed to agree or disagree when assessment presented with stimulus statements ▪ Motivational Interview – used by counselors h. received prior coaching and clinicians to gather information about i. portraying themselves in good or bad light some problematic behavior, while j. “luckiness” or have “bad luck” on multiple- simultaneously attempting to address it choice achievement test therapeutically o Psychological Autopsy – on the basis of archival o Portfolio – samples of one’s ability and records, artifacts, and interviews previously accomplishment conducted with the deceased assess or people o Case History Data – refers to records, who knew him or her transcripts, and other accounts in written, o Other parties: organizations, companies, pictorial, or other form that preserve archival government that could sponsor the information, official and informal accounts, and development of the test other data and items relevant to an assessee What? ▪ Case study – a report or illustrative account Educational Setting concerning a person or an event that was o Achievement Test – evaluates accomplishment compiled on the basis of case history data or the degree of learning that has taken place ▪ Groupthink – result of the varied forces that o Diagnostic Test – refers to a tool of assessment drive decision-makers to reach a consensus used to help narrow down and identify areas of o Behavioral Observation – monitoring of actions deficit to be targeted for intervention of others or oneself by visual or electronic ▪ Diagnosis – description or conclusion means while recording quantitative and/or reached on the basis of evidence and opinion qualitative information regarding those actions o Informal Evaluation – nonsystematic ▪ Naturalistic Observation – observe humans assessment that leads to the formation of an in natural setting opinion or attitude o Role Play – defined as acting an improvised or Clinical Settings partially improvised part in a stimulated o Used to help screen for or diagnose behavior situation problems ▪ Role Play Test – assesses are directed to act o Tests could be intelligence tests, personality as if they are in a particular situation tests, neuropsychological tests, or other o Other tools include: computer, physiological specialized instruments devices (biofeedback devices) o Usually individualized Who, What, Why, How, and Where? Counseling Settings Who? o May occur in environments as diverse as school, o Test Developers – create tests or other methods prisons, and governmental or privately owned of assessment institutions o Test User – clinicians, counselors, o Goal: improve the client in terms of adjustment, psychologists, HR personnel, consumer productivity, or some related variables psychologists, experimental psychologists, and Geriatric Settings social psychologists o Quality of Life – variables related to perceived o Test taker – taking the test stress, loneliness, sources of satisfaction, o Test takers vary in terms of: personal values, quality of living conditions, and a. amount of test anxiety quality of friendships and other social support Psychological Assessment Psychological Testing and History Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018) o Dementia – loss of cognitive functioning that assessee or by means of alternative methods occurs as the result of damage to or loss of brain designed to measure same variables cells o Accommodation – adaptation of a test, o Pseudodementia – a severe depression that procedure, or situation of one test for another, to mimics dementia make the assessment more suitable for an Business and Military Settings assessee with an exceptional needs o A wide range of achievement, aptitude, interest, Where? motivational, and other tests may be employed in o Test Catalogues – contain only a brief the decision to hire as well as in related description of the test and seldom contain the decisions regarding promotions, transfers, job kind of detailed technical information that a satisfaction, etc. prospective user might require o Psychological Tests involves the engineering o Test Manuals – detailed information concerning and design of products and environment the development of a particular test and o Can work in marketing to help “diagnose” what technical information relating to it should be can be improved with the brand found in the test manual Governmental and Organizational Credentialing o Others: professional books, journals, online o Governmental licensing, certification, or general databases credentialing of professionals History Academic Research Settings o Testing programs was first held in China as early o Conducting any sort of research typically entails as 2200 B.C.E. for Civil Service measurement of some kind, and any o 1733: Abraham De Moivre introduced the basic academician who ever hopes to publish notion of sampling error research should ideally have a sound knowledge o 1859: Charles Darwin argued that chance of measurement principles and tools of variation in species would be selected or assessment rejected by nature according to adaptivity and Other Settings survival value o Judiciary, program evaluation ▪ Humans had descended from the Ape as a How? result of such genetic variations o Test users should only use tests that are o 1869: Francis Galton explored and quantify necessary and appropriate for the individual individual differences of people being tested ▪ Classified people according to the “natural o Test user must be prepared and suitably trained gifts” and to ascertain their “deviation from an to administer the test properly average” o Protocol – refers to the form or sheet or booklet ▪ Pioneered the use of a statistical concept on which a testtaker’s responses are entered central to psychological experimentation and o Rapport – working relationship between the testing: the coefficient of correlation examiner and the examinee o Karl Pearson developed Product-Moment o Test users who have responsibility for Correlation Technique interpreting scores or other test results have an o First experimental psychology laboratory was obligation to do so in accordance with founded by Wilhelm Wundt in Germany established procedures and ethical guidelines ▪ Wundt focused on how similar people are and o Alternate Assessment – for children who, as a viewed individual differences as frustrating result of a disability, could not otherwise source of error participate in state- and district-wide o James McKeen Cattell – coined the term Mental assessments Test ▪ An evaluative or diagnostic procedure or o Charles Spearman – originated the concept of process that varies from the usual, test reliability as well as building mathematical customary, or standardized way a framework for the statistical technique of factor measurement is derived, either by virtue of analysis some special accommodation made to the Psychological Assessment Psychological Testing and History Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018) o Victor Henri – collaborated with Alfred Binet on o Hermann Rorschach – developed Rorschach papers suggesting how mental tests could be Inkblot test used to measure higher mental processes o Henry Murray & Christiana Morgan – developed o Emil Kraepelin – early experimentation with the Thematic Apperception Test word association technique as a formal test o 1943: Minnesota Multiphasic Personality ▪ One of the founding founders of modern Inventory was published psychiatry ▪ to use empirical methods to determine the ▪ Classification and diagnosis of mental meaning of a test response disorders o Factor Analysis – method of finding the minimum ▪ Dementia Praecox number of dimensions (factors) to account for a o Lightner Witmer – “Little know founder of large number of variables Clinical Psychology” o J.R. Guilford – made the first serious attempt to ▪ Founded the first psychological clinic in US use factor analytic technique in the development o 1895: Alfred Binet and Victor Henri published of a structured personality test several articles in which they argued for the o Raymond Cattell – introduced 16PF measurement of abilities o Beginning of 1980s, several major branches of o 1905: Alfred Binet and Theodore Simon published applied psychology emerged such as the first intelligence test designed to help neuropsych, health psych, forensic psych, and identify Paris schoolchildren with ID child psych ▪ Considered standardization sample end ▪ Representative Sample – one that comprises individuals similar to those for whom the test is to be used ▪ 1908: Mental Age was determined ▪ L.M. Terman revised Binet’s test for US use o 1939: David Wechsler introduced Adult Intelligence Test ▪ Intelligence was the aggregate or the global capacity of the individual to act purposefully, to think rationally, and to deal effectively with his environment (Weschler, 1939) o Binet devised his intelligence test into group intelligence test in response to the US Military’s need for screening of recruits for WWI o Lewis M. Terman, Robert M. Yerkes, and others developed Army Tests for recruits ▪ Army Alpha – for literate folks ▪ Army Beta – for illiterate folks o Robert Woodworth was assigned the task of developing a measure of adjustment and emotional stability that could be administered quickly and efficiently to groups of recruits ▪ Disguised as Personal Data sheet o Woodworth Psychoneurotic Inventory – first self-report measure of personality to identify soldiers at risk for shell shock o Projective Test – one in which an individual is assumed to project onto some ambiguous stimulus his or her own unique feelings Psychological Assessment Statistics Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018), Gravetter & Wallnau (2013) Scales of Measurement 4. Ratio – has true zero point (if the score is zero, it o Measurement – the act of assigning numbers or means none/null) symbols to characteristics of things according to ▪ Easiest to manipulate rules Describing Data o Descriptive Statistics – methods used to provide o Distribution – defined as a set of test scores concise description of a collection of quantitative arrayed for recording or study information o Raw Scores – straightforward, unmodified o Inferential Statistics – method used to make accounting of performance that is usually inferences from observations of a small group of numerical people known as sample to a larger group of o Frequency Distribution – all scores are listed individuals known as population alongside the number of times each score o Magnitude – the property of “moreness” occurred o Equal Intervals – the difference between two o Independent Variable – being manipulated in the points at any place on the scale has the same study meaning as the difference between two other o Quasi-Independent Variable – nonmanipulated points that differ by the same number of scale variable to designate groups units ▪ Factor – for ANOVA o Absolute 0 – when nothing of the property being o Post-Hoc Tests – used in ANOVA to determine measured exists which mean differences are significantly o Scale – a set of numbers who properties model different empirical properties of the objects to which the o Tukey’s HSD test – allows the compute a single numbers are assigned value that determines the minimum difference ▪ Continuous Scale – takes on any value within between treatment means that is necessary for the range and the possible value within that significance range is infinite Measures of Central Tendency ▪ Discrete Scale – can be counted; has distinct, o Measures of Central Tendency – statistics that countable values indicates the average or midmost score o Error – refers to the collective influence of all the between the extreme scores in a distribution factors on a test score or measurement beyond ▪ Goal: Identify the most typical or those specifically measured by the test or representative of entire group measurement o Mean – the average of all the raw scores ▪ Measurement with continuous scale always ▪ Equal to the sum of the observations divided involve with error by the number of observations o Four Levels of the scales of Measurement: ▪ Interval and ratio data (when normal 1. Nominal – involve classification or distribution) categorization based on one or more ▪ Point of least squares distinguishing characteristics ▪ Balance point for the distribution ▪ Label and categorize observations but do not o Median – the middle score of the distribution make any quantitative distinctions between ▪ Ordinal, Interval, Ratio observations ▪ Useful in cases where relatively few scores ▪ mode fall at the high end of the distribution or 2. Ordinal – rank ordering on some characteristics relatively few scores fall at the low end of the is also permissible distribution ▪ median ▪ In other words, for extreme scores, use 3. Interval – contains equal intervals, has no median absolute zero point (even negative values have ▪ Identical for sample and population interpretation to it) ▪ Also used when there has an unknown or ▪ Zero value does not mean it represents none undetermined score Psychological Assessment Statistics Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018), Gravetter & Wallnau (2013) ▪ Used in “open-ended” categories (e.g., 5 or more, more than 8, at least 10) ▪ For ordinal data o Mode – most frequently occurring score in the distribution ▪ Bimodal Distribution – if there are two scores that occur with highest frequency ▪ Not commonly used ▪ Useful in analyses of qualitative or verbal nature ▪ For nominal scales, discrete variables ▪ Value of the mode gives an indication of the shape of the distribution as well as a measure of central tendency Measures of Variability o Symmetrical Distribution – right side of the o Variability – an indication how scores in a graph is mirror image of the left side distribution are scattered or dispersed ▪ Has only one mode and it is in the center of the o Measures of Variability – statistics that describe distribution the amount of variation in a distribution ▪ Mean = median = mode o Range – equal to the difference between highest o Skewness – nature and extent to which and the lowest score symmetry is absent ▪ Provides a quick but gross description of the o Positive Skewed – few scores fall the high end of spread of scores the distribution ▪ When its value is based on extreme scores of ▪ The exam is difficult the distribution, the resulting description of ▪ More items that was easier would have been variation may be understated or overstated desirable in order to better discriminate at o Quartile – dividing points between the four the lower end of the distribution of test scores quarters in the distribution ▪ Specific point ▪ Quarter – refers to an interval ▪ Interquartile Range – measure of variability equal to the difference between Q3 and Q1 ▪ Semi-interquartile Range – equal to the interquartile range divided by 2 o Standard Deviation – equal to the square root of the average squared deviations about the mean ▪ Equal to the square root of the variance ▪ Variance – equal to the arithmetic mean of the ▪ Mean > Median > Mode squares of the differences between the o Negative Skewed – when relatively few of the scores in a distribution and their mean scores fall at the low end of the distribution ▪ Distance from the mean ▪ The exam is easy Normal Curve ▪ More items of a higher level of difficulty would o Also known as Gaussian Curve make it possible to better discriminate o Bell-shaped, smooth, mathematically defined between scores at the upper end of the curve that is highest at its center distribution o Asymptotically = approaches but never touches the axis o Tail – 2 – 3 standard deviations above and below the mean Psychological Assessment Statistics Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018), Gravetter & Wallnau (2013) ▪ Raw score that fell in the mean has T of 50 ▪ Raw score 5 standard deviations about the mean would be equal to a T of 100 ▪ No negative values ▪ Used when the population or variance is unknown o Stanine – a method of scaling test scores on a nine-point standard scale with a mean of five (5) and a standard deviation of two (2) o Linear Transformation – one that retains a direct ▪ Mean < Median < Mode numerical relationship to the original raw score o Skewed is associated with abnormal, perhaps o Nonlinear Transformation – required when the because the skewed distribution deviates from data under consideration are not normally the symmetrical or so-called normal distributed distribution o Normalizing the distribution involves stretching o Kurtosis – steepness if a distribution in its center the skewed curve into the shape of a normal ▪ Platykurtic – relatively flat curve and creating a corresponding scale of ▪ Leptokurtic – relatively peaked standard scores, a scale that is technically ▪ Mesokurtic – somewhere in the middle referred to as Normalized Standard Score Scale o Generally preferrable to fine-tune the test according to difficulty or other relevant variables so that the resulting distribution will approximate the normal curve o STEN – standard to ten; divides a scale into 10 units ▪ High Kurtosis = high peak and fatter tails ▪ Lower Kurtosis = rounded peak and thinner tails Standard Scores o Standard Score – raw score that has been converted from one scale to another scale Hypothesis Testing o Z-Scores – results from the conversion of a raw o Statistical method that uses a sample data to score into a number indicating how many SD evaluate a hypothesis about a population units the raw score is below or above the mean o Alternative Hypothesis – states there is a of the distribution change, difference, or relationships ▪ Identify and describe the exact location of o Null Hypothesis – no change, no difference, or no each score in a distribution relationship ▪ Standardize an entire distribution o Alpha Level or Level of Significance – used to ▪ Zero plus or minus one scale define concept of “very unlikely” in a hypothesis ▪ Have negative values test ▪ Requires that we know the value of the o Critical Region – composed of extreme values variance to compute the standard error that are very unlikely to be obtained if the null o T-Scores – a scale with a mean set at 50 and a hypothesis is true standard deviation set at 10 o If sample data fall in the critical region, the null ▪ Fifty plus or minus 10 scale hypothesis is rejected ▪ 5 standard deviation below the mean would be equal to a t-score of 0 Psychological Assessment Statistics Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018), Gravetter & Wallnau (2013) o The alpha level for a hypothesis test is the ▪ Coefficient of Determination – an indication of probability that the test will lead to a Type I error how much variance is shared by the X- and Y- o Directional Hypothesis Test or One-Tailed Test – variables statistical hypotheses specify either an increase o Spearman Rho/Rank-Order Correlation or a decrease in the population mean Coefficient/Rank-Difference Correlation o T-Test – used to test hypotheses about an Coefficient – frequently used if the sample size is unknown population mean and variance small and when both sets of measurement are in ▪ Can be used in “before and after” type of ordinal research ▪ Developed by Charles Spearman ▪ Sample must consist of independent Pearson R observationsꟷthat is, if there is not Interval/ratio + interval/ratio consistent, predictable relationship between Two continuous variables the first observation and the second Spearman Rho ▪ The population that is sampled must be Ordinal + Ordinal normal Point-Biserial Coefficient ▪ If not normal distribution, use a large sample Dichotomous Two nominal + continuous variable (interval/ratio) Kendall’s Coefficient 3 or more rank/ordinals (ratings) Ordinal + ordinal + ordinal Correlation and Inference Phi or Fourfold Coefficient o Correlation Coefficient – number that provides Nominal + nominal us with an index of the strength of the All Dichotomous variables relationship between two things Rank Biserial Correlation o Correlation – an expression of the degree and direction of correspondence between two things Nominal + Ordinal (Rating) ▪ + & - = direction Tetrachoric Correlation R ▪ Number anywhere to -1 to 1 = magnitude Continuous + Continuous but both are measured ▪ Positive – same direction, either both going as Nominal (e.g., Passed or Not Passed, rather up or both going down than grades itself) ▪ Negative – Inverse Direction, either DV is up o Outlier – extremely atypical point located at a and IV goes down or IV goes up and DV goes relatively long distance from the rest of the down coordinate points in a scatterplot ▪ 0 = no correlation o Regression Analysis – used for prediction ▪ Predict the values of a dependent or response variable based on values of at least one independent or explanatory variable ▪ Residual – the difference between an observed value of the response variable and the value of the response variable predicted from the regression line ▪ The Principle of Least Squares ▪ Standard Error of Estimate – standard deviation of the residuals in regression o Pearson r/Pearson Correlation analysis Coefficient/Pearson Product-Moment ▪ Slope – determines how much the Y variable Coefficient of Correlation – used when two changes when X is increased by 1 point variables being correlated are continuous and o T-Test (Independent) – comparison or linear determining differences ▪ Devised by Karl Pearson Psychological Assessment Statistics Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018), Gravetter & Wallnau (2013) ▪ 2 different groups/independent samples + interval/ratio scales (continuous varriables) ▪ Equal Variance – 2 groups are equal ▪ Unequal Variance – groups are unequal o T-test (Dependent)/Paired Test – two groups nominal (either matched or repeated measures) + continuous scales o One-Way ANOVA – 3 or more IV, 1 DV comparison of differences o Two-Way ANOVA – 2 IV, 1 DV o Critical Value – reject the null and accept the alternative if [ obtained value > critical value ] o P-Value (Probability Value) – reject null and accept alternative if [ p-value < alpha level ] Norms o Norms – refer to the performances by defined groups on a particular tests o Age-Related Norms – certain tests have different normative groups for particular age groups o Tracking – tendency to stay at about the same level relative to one’s peers o Norm-Referenced Tests – compares each person with the norm o Criterion-Referenced Tests – describes specific types of skills, tasks, or knowledge that the test taker can demonstrate end Psychological Assessment Assumptions about Psychological Testing and Assessment Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018) Assumption 1: Psychological Traits and States Exist provide insight to it, to gauge the strength of that o Trait – any distinguishable, relatively enduring trait way in which one individual varies from another o Measuring traits and states means of a test ▪ Permit people predict the present from the entails developing not only appropriate tests past items but also appropriate ways to score the test ▪ Characteristic patterns of thinking, feeling, and interpret the results and behaving that generalize across similar o Cumulative Scoring – assumption that the more situations, differ systematically between the testtaker responds in a particular direction individuals, and remain rather stable across keyed by the test manual as correct or time consistent with a particular trait, the higher that ▪ Psychological Trait – intelligence, specific testtaker is presumed to be on the targeted intellectual abilities, cognitive style, ability or trait adjustment, interests, attitudes, sexual Assumption 3: Test-Related Behavior Predicts Non- orientation and preferences, Test-Related Behavior psychopathology, etc. o The tasks in some tests mimics the actual o States – distinguish one person from another behaviors that the test user is attempting to but are relatively less enduring understand ▪ Characteristic pattern of thinking, feeling, o Such tests only yield a sample of the behavior and behaving in a concrete situation at a that can be expected to be emitted under nontest specific moment in time conditions ▪ Identify those behaviors that can be Assumption 4: Test and Other Measurement controlled by manipulating the situation Techniques have strengths and weaknesses o Psychological Traits exists as construct o Competent test users understand and ▪ Construct – an informed, scientific concept appreciate the limitations of the test they use as developed or constructed to explain a well as how those limitations might be behavior, inferred from overt behavior compensated for by data from other sources ▪ Overt Behavior – an observable action or the Assumption 5: Various Sources of Error are part of product of an observable action the Assessment Process o Trait is not expected to be manifested in behavior o Error – refers to something that is more than 100% of the time expected; it is component of the measurement o Whether a trait manifests itself in observable process behavior, and to what degree it manifests, is ▪ Refers to a long-standing assumption that presumed to depend not only on the strength of factors other than what a test attempts to the trait in the individual but also on the nature of measure will influence performance on the the action (situation-dependent) test o Context within which behavior occurs also plays ▪ Error Variance – the component of a test a role in helping us select appropriate trait terms score attributable to sources other than the for observed behaviors trait or ability measured o Definition of trait and state also refer to a way in o Potential Sources of error variance: which one individual varies from another 1. Assessors o Assessors may make comparisons among 2. Measuring Instruments people who, because of their membership in 3. Random errors such as luck some group or for any number of other reasons, o Classical Test Theory – each testtaker has true are decidedly not average score on a test that would be obtained but for the Assumption 2: Psychological Traits and States can be action of measurement error Quantified and Measured Assumption 6: Testing and Assessment can be o Once the trait, state or other construct has been conducted in a Fair and Unbiased Manner defined to be measured, a test developer o Despite best efforts of many professionals, consider the types of item content that would fairness-related questions and problems do occasionally rise Psychological Assessment Assumptions about Psychological Testing and Assessment Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018) o In al questions about tests with regards to testtakers in a given period of time rather than fairness, it is important to keep in mind that tests norms obtained by formal sampling methods are tools ꟷthey can be used properly or o Standardization – the process of administering a improperly test to a representative sample of testtakers for Assumption 7: Testing and Assessment Benefit the purpose of establishing norms Society o Sample – a portion of the universe of people that o Considering the many critical decisions that are represents the whole population based on testing and assessment procedures, o Sampling – process of selecting sample we can readily appreciate the need for tests Probability Sampling What is a “Good Test”? Random sampling, randomization is used to o Includes clear instructions for administration, select samples scoring, and interpretation; offered economy in Simple Random Sampling the time and money it took to administer, score, Every element in the population has an equal and interpret; and, measures what it purports to chance of being selected as part of the sample measure Easy, cheap, remove all risk of bias o Reliability – consistency of the measuring tool Systematic Sampling ▪ The precision with which test measurers and Every nth item or person after is picked the extent to which error is present in Researcher can choose the interval at which measurements items are picked o Validity – measure what it is supposed to Stratified Sampling measure Random selection within predefined groups o A good test is one that trained examiners can More risk of bias due to stratifying administer, score, and interpret with a minimum Cluster Sampling difficulty Groups rather than individual units of the target ▪ Yields actionable results that will ultimately population are selected randomly benefit individual testtakers or society at Non-Probability Sampling large Researchers pick items or individual based on Norms their research goals or knowledge o Norm-Referenced Testing and Assessment – method of evaluation and a way of deriving Convenience Sampling meaning from test scores by evaluating an Selected based on their availability individual testtaker’s score and comparing it to Quota Sampling scores of a group testtakers Achieve a spread across the target population by ▪ Yield information on a testtaker’s standing or specifying who should be recruited for a survey ranking relative to some comparison group of according to certain groups or criteria testtakers Purposive Sampling o Norms – usual, average, normal, standard, Chosen consciously based on their knowledge expected, or typical and understanding of the research question at ▪ Test performance data of a particular group hand or their goals of testtakers that are designed for use as a Snowball or Referral Sampling reference when evaluating or interpreting People recruited to be part of a sample are asked individual test scores to invite those they know to take part, who are then o Normative Sample – group of people whose asked to invite their friends and family and so on performance on a particular test is analyzed for Helpful when the researcher doesn’t know very reference in evaluating the performance of much about the target population and has no easy individual testtakers way to contact or access them o Norming – process of deriving norms o After obtaining sample for standardization, the o User Norms or Program Norms – consists or test developer will administer the test according descriptive statistics based on a group of to the standard set of instructions that will be Psychological Assessment Assumptions about Psychological Testing and Assessment Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018) used with the test and also describe the recommended setting for giving the test o Percentile – an expression of the percentage of people whose score on a test or measure falls before a particular raw score ▪ Percentage Correct – refers to the distribution of raw scores, specifically, to the number of items that were answered correctly multiplied by 100 and divided by the total number of items o Age Norms – average performance of different samples of testtakers who were at various ages at the time the test was administered o Grade Norms – developed by administering the test to representative samples of children over a range of consecutive grade levels ▪ Developmental Norms – norms developed on the basis of any trait, ability, skill, or other characteristics that is presumed to develop, deteriorate, or otherwise affected by chronological age, school grade, or stage of life o National Norms – derived from a normative sample that was nationally representative of the population at the time norming study was conducted o Local Norms – provide normative information with respect to the local population’s performance on some test o Fixed Reference Group Scoring System – the distribution of scores obtained on the test from one group of testtakers (fixed reference group) is used as the basis for the calculation of test scores for future administrations of the test o Criterion-Referenced Tests and Assessment – method of evaluation and a way of deriving meaning from test scores by evaluating an individual’s score with reference to a set standard (criterion) o Domain- or Content-Referenced Testing and Assessment – how scores relate to a particular content area or domain end Psychological Assessment Reliability, Validity, Utility Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018), Gravetter & Wallnau (2013) Reliability test scoreꟷand thus the reliability can be o Dependability or consistency affected o Consistency of the instrument o Measurement Error – all of the factors o A test may be reliable in one context and associated with the process of measuring some unreliable in another variable, other than the variable being measured o Reliability Coefficient – index of reliability, a o Random Error – source of error in measuring a proportion that indicates the ratio between the targeted variable caused by unpredictable true score variance on a test and the total fluctuations and inconsistencies of other variance variables in the measurement process o Classical Test Theory – a score on an ability tests ▪ “Noise” is presumed to reflect not only the testtaker’s ▪ E.g., physical events that happened while true score on the ability being measured but also test is happening error o Systematic Error – source of error in a ▪ Errors of measurement are random measuring a variable that is typically constant or o Error – refers to the component of the observed proportionate to what is presumed to be the true test score that does not have to do with the value of the variable being measured testtaker’s ability o Sources of Error Variance: a. Item Sampling/Content Sampling – refer to variation among items within a test as well as to variation among items between tests ▪ The extent to which testtaker’s score is affected by the content sampled on a test and by the way the content is sampled is a source of error variance b. Test Administration ▪ Testtaker’s motivation or attention, environment, etc. ▪ Testtaker variables and Examiner-related Variables c. Test Scoring and Interpretation ▪ Type I – “false-positive”; an investigator ▪ May employ objective-type items amenable rejects a null hypothesis that is true to computer scoring of well-documented ▪ Type II – “false-negative”; fails to reject null reliability hypothesis that is false in the population ▪ If subjectivity is involved in scoring, ▪ Can reduce the likelihood of type 1 and 2 Reliability Estimates errors by increasing the sample size Test-Retest Reliability o Variance – useful in describing sources of test o Time Sampling score variability o An estimate of reliability obtained by correlating ▪ True Variance – variance from true pairs of scores from the same people on two differences different administrations of the test ▪ Error Variance – variance from irrelevant, o Appropriate when evaluating the reliability of a random sources test that purports to measure something that is o Reliability refers to the proportion of total relatively stable such as personality trait variance attributed to true variance o The longer the time that passes, the greater the o The greater the proportion of the total variance likelihood that the reliability coefficient will be attributed to true variance, the more reliable the lower test o Coefficient of Stability - when the interval o Error variance may increase or decrease a test between testing is greater than 6 months score by varying amounts, consistency of the Parallel Forms and Alternate Forms Reliability Psychological Assessment Reliability, Validity, Utility Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018), Gravetter & Wallnau (2013) o Item Sampling o Randomly assign items to one or the other half of o Coefficient of Equivalence – the degree of the test or assign odd-numbered items to one relationship between various forms of test can half and even-numbered to the other half (odd- be evaluated by means of an alternate forms or even reliability) parallel forms coefficient of reliability o Divide the test by content so that each half o Parallel Forms – each form of the test, the contains items equivalent with respect to means and the variances are equal content and difficulty ▪ Same items, different o Spearman-Brown Formula – allows a test positionings/numberings developer or user to estimate internal ▪ Parallel Forms Reliability – estimate of the consistency reliability from a correlation of two extent to which item sampling and other halves of a test errors have affect test scores on version of o Reliability of the test is affected by the length. the same test when, for each form of the test, Usually, reliability increases as length increases the means and variances of observed test o Spearman-Brown may be used to estimate the scores are equal effect of the shortening on the test’s reliability o Alternate Forms – simply different version of a o SBF also be used to determine the number of test that has been constructed so as to be items needed to attain a desired level of parallel reliability ▪ Alternate Forms Reliability – estimate of the o If the reliability of the original test is relatively extent to which these different forms of the low, then it may be impractical to increase the no. same test have been affected by sampling of items, so they should develop a suitable error, or other error alternative o Two administrations with the same group are o Or increase reliability by creating new items, required clarifying the test instructions, or simplifying the o Test scores may be affected by factors such as scoring rules motivation, fatigue, or intervening events such Inter-item Consistency as practice, learning, or therapy o Refers to the degree of correlation among all the o Some testtaker might do better on a specific items on a scale form of a test but not a function of their true o Calculated from a single administration of a ability but simply bec of the particular items that single form of a test were selected for inclusion in the test o Useful in assessing Homogeneity o Minimizes the effect of memory for the content of ▪ Homogeneity – if a test contains items that a previously administered form of the test measure a single trait (unifactorial) o Certain traits are presumed to relatively stable ▪ Heterogeneity – degree to which a test in people measures different factors (more than one o The means and the variances of the observed trait); source of error variance scores are equal for two forms o More homogenous = higher inter-item Internal Consistency consistency Split-Half Reliability o KR-20 – used for inter-item consistency of o Obtained by correlating two pairs of scores dichotomous items obtained from equivalent halves of a single test o KR-21 – if all the items have the same degree of administered once difficulty (speed tests) o Useful when it is impractical or undesirable to o Coefficient Alpha – appropriate for use on tests assess reliability with two tests or to administer containing non-dichotomous items a test twice ▪ Help answer questions about how similar o Simply diving the test in the middle is not sets of data are recommended because it is likely that this ▪ Check consistency across terms of an procedure would spuriously raise or lower the instrument with responses with varying reliability coefficient credit Psychological Assessment Reliability, Validity, Utility Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018), Gravetter & Wallnau (2013) o Average Proportional Distance – measure used ▪ As individual differences decrease, a to evaluate internal consistency of a test that traditional measure of reliability would also focuses on the degree of differences that exists decrease, regardless of the stability of between item scores individual performance ▪ Not connected to the number of items on a o Classical Test Theory – everyone has a “true measure score” on test Inter-scorer Reliability ▪ True Score – genuinely reflects an o The degree of agreement or consistency individual’s ability level as measured by a between two or more scorers with regard to a particular test particular measure o Domain Sampling Theory – estimate the extent o Used for coding nonverbal behavior to which specific sources of variation under o Coefficient of Inter-scorer Reliability defined conditions are contributing to the test o Observer Differences scores o Kappa Statistics is used ▪ Considers problem created by using a limited ▪ Fleiss Kappa – determine the level of number of items to represent a larger and agreement between two or more raters when more complicated construct the method of assessment is measured on a ▪ Test reliability is conceived of as an objective categorical scale; best way; more than 2 measure of how precisely the test score raters assesses the domain from which the test ▪ Cohen’s Kappa – each classify N items into C draws a sample mutually exclusive categories; rates the ▪ Generalizability Theory – based on the idea same thing, corrected for how often that the that a person’s test scores vary from testing raters may agree by chance; only 2 raters to testing because of the variables in the Using and Interpreting Coefficient of Reliability testing situations o Tests designed to measure one factor ▪ Universe – test situation (Homogenous) are expected to have high degree ▪ Facets – number of items in the test, amount of internal consistency and vice versa of review, and the purpose of test o Dynamic – trait, state, or ability presumed to be administration ever-changing as a function of situational and ▪ According to Generalizability Theory, given cognitive experience the exact same conditions of all the facets in o Static – barely changing or relatively the universe, the exact same test score unchanging should be obtained (Universe score) o Restriction of range or Restriction of variance – ▪ Decision Study – developers examine the if the variance of either variable in a usefulness of test scores in helping the test correlational analysis is restricted by the user make decisions sampling procedure used, then the resulting o Item Response Theory – the probability that a correlation coefficient tends to be lower person with X ability will be able to perform at a o Power Tests – when time limit is long enough to level of Y in a test allow test takers to attempt all times ▪ Latent-Trait Theory o Speed Tests – generally contains items of ▪ The computer is used to focus on the range of uniform level of difficulty with time limit item difficulty that helps assess an ▪ Reliability should be based on performance individual’s ability level from two independent testing periods using ▪ Difficulty – attribute of not being easily test-retest and alternate-forms or split- accomplished, solved, or comprehended half-reliability ▪ Discrimination – degree to which an item o Criterion-Referenced Tests – designed to differentiates among people with higher or provide an indication of where a testtaker stands lower levels of the trait, ability or etc. with respect to some variable or criterion ▪ Dichotomous – can be answered with only one of two alternative responses Psychological Assessment Reliability, Validity, Utility Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018), Gravetter & Wallnau (2013) ▪ Polytomous – 3 or more alternative o Content Validity – describes a judgement of how responses adequately a test samples behavior Reliability and Individual Scores representative of the universe of behavior that o Standard Error of Measurement – provide a the test was designed to sample measure of the precision of an observed test o When the proportion of the material covered by score the test approximates the proportion of material ▪ Standard deviation of errors as the basic covered in the course measure of error o Test Blueprint – a plan regarding the types of ▪ Provides an estimate of the amount of error information to be covered by the items, the no. of inherent in an observed score or items tapping each area of coverage, the measurement organization of the items, and so forth ▪ Higher reliability, lower SEM Criterion-Related Validity ▪ Used to estimate or infer the extent to which o Criterion-Related Validity – a judgement of how an observed score deviates from a true score adequately a test score can be used to infer an ▪ Standard Error of a Score individual’s most probable standing on some ▪ Confidence Interval – a range or band of test measure of interestꟷthe measure of interest scores that is likely to contain true scores being criterion o Standard Error of the Difference – can aid a test o Criterion – standard on which a judgement or user in determining how large a difference decision may be made should be before it is considered statistically ▪ Characteristics: relevant, valid, significant uncontaminated o Standard Error of Estimate – refers to the ▪ Criterion Contamination – occurs when the standard error of the difference between the criterion measure includes aspects of predicted and observed values performance that are not part of the job or when the measure is affected by “construct- irrelevant ” (Messick, 1989) factors that are not part of the criterion construct Concurrent Validity o If the test scores obtained at about the same time as the criterion measures are obtained o Extent to which test scores may be used to estimate an individual’s present standing on a Validity criterion o Validity – a judgment or estimate of how well a o Economically efficient test measures what it supposed to measure Predictive Validity ▪ Evidence about the appropriateness of o Measures of the relationship between test inferences drawn from test scores scores and a criterion measure obtained at a ▪ Inferences – logical result or deduction future time ▪ May diminish as the culture or times change o Researchers must take into consideration the o Validation – the process of gathering and base rate of the occurrence of the variable, both evaluating evidence about validity as that variable exists in the general population o Validation Studies – yield insights regarding a and as it exists in the sample being studies particular population of testtakers as compared o Base Rate – the extent to which a particular trait, to the norming sample described in a test behavior, characteristic, or attribute exist in the manual population o Face Validity – a test appears to measure to the o Hit Rate – defined as the proportion of people a person being tested than to what the test test accurately identifies possessing a actually measures particular trait, behavior, etc. Content Validity Psychological Assessment Reliability, Validity, Utility Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018), Gravetter & Wallnau (2013) o Miss Rate – fails to identify having that particular would have different test scores than those characteristic who really possesses that construct o False Positive – miss; the test predicted that they o Convergent Evidence – if scores on the test do possess a particular trait but actually not undergoing construct validation tend to highly o False Negative – miss; the test predicted they do correlated with another established, validated not possess a particular trait but actually do test that measures the same construct o Validity Coefficient – correlation coefficient that o Discriminant Evidence – a validity coefficient provides a measure of the relationship between showing little relationship between test scores test scores and scores on the criterion measure and/or other variables with which scores on the ▪ Usually, Pearson R is used, however other test being construct-validated should not be correlation coefficients could be used correlated depends on the type of data o Factor Analysis – designed to identify factors or ▪ Affected by restriction or inflation of range specific variables that are typically attributes, ▪ Validity Coefficient need to be large enough to characteristics, or dimensions on which people enable the test user to make accurate may differ decisions within the unique context in which a ▪ Employed as data reduction method test is being used ▪ Identify the factor or factors in common o Incremental Validity – the degree to which an between test scores on subscales within a additional predictor explains something about particular test the criterion measure that is not explained by ▪ Explanatory FA – estimating or extracting predictors already in use factors; deciding how many factors must be Construct Validity retained o Construct Validity – judgement about the ▪ Confirmatory FA – researchers test the appropriateness of inferences drawn from test degree to which a hypothetical model fits the scores regarding individual standing on variable actual data called construct ▪ Factor Loading – conveys info about the o Construct – an informed, scientific idea extent to which the factor determine the test developed or hypothesized to describe or score or scores explain behavior Validity, Bias, and Fairness ▪ Unobservable, presupposed traits that may o Bias – factor inherent in a test that invoke to describe test behavior or criterion systematically prevents accurate, impartial performance measurement o One way a test developer can improve the ▪ Prejudice, preferential treatment homogeneity of a test containing dichotomous ▪ Prevention during test dev through a items is by eliminating items that do not show procedure called Estimated True Score significant correlation coefficients with total test Transformation scores o Rating – numerical or verbal judgement that o If it is an academic test and high scorers on the places a person or an attribute along a entire test for some reason tended to get that continuum identified by a scale of numerical or particular item wrong while low scorers got it word descriptors known as Rating Scale right, then the item is obviously not a good one ▪ Rating Error – intentional or unintentional o Some constructs lend themselves more readily misuse of the scale than others to predictions of change over time ▪ Leniency Error – rater is lenient in scoring o Method of Contrasted Groups – demonstrate (Generosity Error) that scores on the test vary in a predictable way ▪ Severity Error – rater is strict in scoring as a function of membership in a group ▪ Central Tendency Error – rater’s rating would ▪ If a test is a valid measure of a particular tend to cluster in the middle of the rating construct, then the scores from the group of scale people who does not have that construct Psychological Assessment Reliability, Validity, Utility Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018), Gravetter & Wallnau (2013) ▪ One way to overcome rating errors is to use scores on a criterion measure – passing, rankings acceptable, failing ▪ Halo Effect – tendency to give high score due o Might indicate future behaviors, then if to failure to discriminate among conceptually successful, the test is working as it should distinct and potentially independent aspects o Taylor-Russel Tables – provide an estimate of of a ratee’s behavior the extent to which inclusion of a particular test o Fairness – the extent to which a test is used in an in the selection system will improve selection impartial, just, and equitable way o Selection Ratio – numerical value that reflects o Attempting to define the validity of the test will be the relationship between the number of people futile if the test is NOT reliable to be hired and the number of people available to be hired o Base Rate – percentage of people hired under the existing system for a particular position o One limitation of Taylor-Russel Tables is that the relationship between the predictor (test) and criterion must be linear o Naylor-Shine Tables – entails obtaining the difference between the means of the selected and unselected groups to derive an index of what the test is adding to already established Utility procedures o Utility – usefulness or practical value of testing Brogden-Cronbach-Gleser Formula to improve efficiency o Used to calculate the dollar amount of a utility o Can tell us something about the practical value gain resulting from the use of a particular of the information derived from scores on the selection instrument test o Utility Gain – estimate of the benefit of using a o Helps us make better decisions particular test o Higher criterion-related validity = higher utility o Productivity Gains – an estimated increase in o One of the most basic elements in utility analysis work output is financial cost of the selection device Some Practical Considerations o Cost – disadvantages, losses, or expenses both o High performing applicants may have been economic and noneconomic terms offered in other companies as well o Benefit – profits, gains or advantages o The more complex the job, the more people o The cost of test administration can be well worth differ on how well or poorly they do that job it if the results is certain noneconomic benefits o Cut Score – reference point derived as a result Utility Analysis of a judgement and used to divide a set of data o Utility Analysis – family of techniques that entail into two or more classifications a cost-benefit analysis designed to yield ▪ Relative Cut Score – reference point based information relevant to a decision about the on norm-related considerations (norm- usefulness and/or practical value of a tool of referenced); e.g, NMAT assessment ▪ Fixed Cut Scores – set with reference to a How is Utility Analysis Conducted? judgement concerning minimum level of Expectancy Data proficiency required; e.g., Board Exams o Expectancy table – provide an indication that a ▪ Multiple Cut Scores – refers to the use of testtaker will score within some interval of two or more cut scores with reference to Psychological Assessment Reliability, Validity, Utility Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018), Gravetter & Wallnau (2013) one predictor for the purpose of categorization ▪ Multiple Hurdle – multi-stage selection process, a cut score is in place for each predictor ▪ Compensatory Model of Selection – assumption that high scores on one attribute can compensate for lower scores Methods for Setting Cut Scores o Angoff Method – setting fixed cut scores ▪ low interrater reliability o Known Groups Method – collection of data on the predictor of interest from group known to possess and not possess a trait of interest ▪ The determination of where to set cutoff score is inherently affected by the composition of contrasting groups o IRT-Based Methods – cut scores are typically set based on testtaker’s performance across all the items on the test ▪ Item-Mapping Method – arrangement of items in histogram, with each column containing items with deemed to be equivalent value ▪ Bookmark Method – expert places “bookmark” between the two pages that are deemed to separate testtakers who have acquired the minimal knowledge, skills, and/or abilities from those who are not o Method of Predictive Yield – took into account the number of positions to be filled, projections regarding the likelihood of offer acceptance, and the distribution of applicant scores o Discriminant Analysis – shed light on the relationship between identified variables and two naturally occurring groups end Psychological Assessment Test Development Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018) Test Conceptualization ▪ Multidimensional – more than one o Test Development – an umbrella term for all that dimension goes into the process of creating a test ▪ Comparative and Categorical o Test Conceptualization – brain storming of ideas ▪ Rating Scale – grouping of words, about what kind of test a developer wants to statements, or symbols on which judgments publish of the strength of a particular trait are o Questions to ponder on when conceptualizing indicated by the testtaker for new tests: ▪ Summative Scale – final score is obtained by 1. What is the test designed to measure? summing the ratings across all the items 2. What is the objective? ▪ Likert Scale – scale attitudes, usually 3. Is there a need for this kind of test? reliable; 4. Who will use the test? ▪ Thurstone Scale - involves the collection of a 5. Who will take the test? variety of different statements about a 6. What content will the test cover? phenomenon which are ranked by an expert 7. How will the test be administered? panel in order to develop the questionnaire 8. What is the ideal format of the test? ▪ Method of Paired Comparisons – produces 9. Should more than one form of test be developed? ordinal data by presenting with pairs of two 10. What special training will be required of test stimuli which they are asked to compare users for administering or interpreting the test? ▪ Comparative Scaling – entails judgments of 11. What types of responses will be required of a stimulus in comparison with every other testtaker’s? stimulus on the scale 12. Who benefits from an administration of this test? ▪ Categorical Scaling – stimuli are placed into 13. Is there potential harm? one of two or more alternative categories 14. How will meaning be attributed to scores on this that differ quantitatively with respect to test? some continuum o Pilot Work/Pilot Study/Pilot Research – ▪ Guttman Scale – yields ordinal-level preliminary research surrounding the creation measures of a prototype of the test o Item Pool – reservoir or well from which the ▪ Attempts to determine how best to measure a items will or will not be drawn for the final targeted construct version of the test ▪ Entail lit reviews and experimentation, ▪ A comprehensive sampling provides a basis creation, revision, and deletion of preliminary for content validity of the final version of the items test Test Construction ▪ The test developer may write a large number o Test Construction – stage in the process that of items from personal experience or entails writing test items, revisions, formatting, academic acquaintance with the subject setting scoring rules matter or experts o Scaling – process of setting rules for assigning o Item Format – form, plan, structure, numbers in measurement arrangement, and layout of individual test items ▪ Process by which a measuring device is ▪ Selected-Response Format – require assigned and calibrated and by which testtakers to select response from a set of numbersꟷscale valuesꟷare assigned to alternative responses different amounts of the trait, attribute, or Multiple-Choice Format characteristic being measured Has three elements: stem (question), a correct ▪ Age-Based – age is of critical interest option, and several incorrect alternatives ▪ Grade-Based – grade is of critical interest (distractors or foils) ▪ Stanine – if all raw score of the test are to be Should’ve one correct answer, has grammatically transformed into scores that range from 1-9 parallel alternatives, similar length, alternatives ▪ Unidimensional – only one dimension is that fit grammatically with the stem, avoid presumed to underlie the ratings Psychological Assessment Test Development Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018) ridiculous distractors, not excessively long, “all of ▪ Floor Effects – occurs when there is some the above”, “none of the above” lower limit on a survey or questionnaire and a Probability of getting the correct answer is 25% large percentage of respondents score near Matching Item this lower limit (testtakers have low scores) Test taker is presented with two columns: ▪ Ceiling Effects – occurs when there is some Premises and Responses upper limit on a survey or questionnaire and a Should be fairly short and to the point and only one large percentage of respondents score near premise would match to one response this upper limit (testtakers have high scores) Binary Choice o Item Branching – ability of the computer to tailor True-False Item the content and order of presentation of items on Usually takes the form of a sentence that requires the basis of responses to previous items the testtaker to indicate whether the statement is o Cumulative Scoring – the higher score one or is not a fact achieved on the test, the higher the testtaker is Contains single idea and not subject to debate on the ability that the test purports to measure Probability of obtaining the correct answer is 50% o Class Scoring/Category Scoring – testtaker ▪ Constructed-Response Format – requires responses earn credit toward placement in a testtakers to supply or to create the correct particular class or category with other answer, not merely selecting it testtakers who pattern of responses is presumably similar in some way Completion Item o Ipsative Scoring – comparing a testtaker’s score Requires the examinee to provide a word or phrase on one scale within a test to another scale within that completes a sentence that same test Should be worded properly so that the correct o Semantic Differential Rating Technique - answer is specific measures an individual's unique, perceived Short-answer item meaning of an object, a word, or an individual; Should be written clearly enough that the testtaker usually essay type, open-ended format can respond succinctly, with short answer Test Tryout Essay Item o The test should be tried out on people who are Respond by writing a composition similar in critical respects to the people for Allows creative integration and expression of the whom the test was designed material o An informal rule of thumb should be no fewer Tends to focus on a more limited area than can be that 5 and preferably as many as 10 for each item covered in the same amount of time when using a (the more, the better) series of selected-response items or completion o Risk of using few subjects = phantom factors items emerge Subject to scoring and inter-scorer differences o Should be executed under conditions as o Item Banks – relatively large and easily identical as possible accessible collection of test questions o Pseudobulbar Affect – neurological disorder o Computerized Adaptive Testing – refers to an characterized by frequent involuntary outburst interactive, computer administered test-taking of laughing or crying that may or may not be process wherein items presented to the appropriate to the situation testtaker are based in part on the testtaker’s o A good test item