Full Transcript

INTRODUCTION Test – a measurement device or technique used to quantify behavior or aid in the understanding and prediction of behavior a test only measures a sample of behavior, and error is always associated with a sampling process Item – a specific stimulus to which a person respond overtly; this...

INTRODUCTION Test – a measurement device or technique used to quantify behavior or aid in the understanding and prediction of behavior a test only measures a sample of behavior, and error is always associated with a sampling process Item – a specific stimulus to which a person respond overtly; this response can be evaluated Hence, items are the questions provided in the test Psychological Test – aka Educational Test, a set of items that are designed to measure characteristics of human beings that pertain to behavior It is a standardized measure of a sample of a person’s behavior (by Anastasi) Types of behavior that a test measures Overt behaviors – an individual’s observable activity Covert behaviors – it takes place within an individual and cannot be changed (e.g. feelings) Scales – relating raw scores on test items to some defined theoretical or empirical distribution Types of Tests Based on the number of examinees Group tests – can be administered to more than one person at a time by a single examiner Individual tests – can be given to one person at a time by an instructor Based on what it measures Ability test – can be in the form of achievement, aptitude, speed and power Achievement tests – measures your previous learning Aptitude tests – measures your potential for learning, or acquiring of skills Maximum Performance Test – called as such because they test what you can achieve when you are making maximum effort Speed Test – the items in this test is homogeneous, means that it is easy. However, the time allowed in this test is so short that few are unable to finish the test. It is concerned with how many questions can you answer in a given time period Power Test – a test that will present a smaller number of items, but more complex as compared to speed tests. The highest level at which a person can succeed is of greater interest than his speed on easy tasks Personality tests – measures typical behavior-traits, temperaments, and disposition Structured Personality test – provide a self-report statements, and require the subject to choose between two or more alternatives such as “true or false” or “1-10” (Likert scale) Projective Personality test – unstructured tests. Either the stimulus (test materials) or the response, or both, are ambiguous Based on user qualifications Level A – lowest level, college students etc. Level B – Psychometricians and college graduates supervised by an expert Level C – Psychologists, Ph.D. in Psychology (Projective tests) Psychological Testing – aka Psychometrics, the systematic use of tests to quantify psychophysical behavior, abilities, and problems and to make predictions about psychological performance. The main use of these tests is to evaluate individual differences or variations among individuals (to differentiate among those taking the test) Historical of Psychological Testing Chinese already had a sophisticated civil service testing program 4000 years ago Han Dynasty – application of Test Batteries for civil law, military affairs, etc. Test Batteries or Battery of Tests – two or more tests used in conjunction Individual Differences Charles Darwin – his theory of Natural Selection provided the impetus to assess individual differences among humans using tests. He is also regarded as the Father of Comparative Psychology. Sir Francis Galton – half-cousin of Charles Darwin. The one who laid the foundation of correlation and percentile. Also, the first to establish the Anthropometric Laboratory in England, 1884. He continued the study of Darwin in individual differences. Considered to be as the Father of Mental Testing James McKeen Cattell – coined the term “mental test”. The founder of Mental Testing Movement Experimental Psychology J.E. Herbart – used the mathematical models of the minds E.H. Weber – attempted to demonstrate the existence of “psychological threshold”, the minimum stimulus necessary to activate a sensory system Weber’s Law - states that the just-noticeable difference between two stimuli is proportional to the magnitude of the stimuli (if you sense a change in weight of .5 lbs on a 5 pound dumbbell, you ought to feel the extra pound added to a ten pound dumbbell) G.T. Fechner – devised the “Fechner’s Law”, states that the strength of a sensation grows as the logarithm of the stimulus intensity. Weber and Fechner are considered as the grandfather of psychology Wilhelm Wundt – the father of modern psychology, laid the foundation of modern psychology. Edouard Seguin – developed the Seguin Form Board Test. He also opened the world’s first school for the mentally retarded. Inspiration for Maria Montessori Jean Esquirol – emphasized language as the best criterion for mental ability (Language Tests). Presented the retardation hierarchy Intelligence Standardized Achievement Tests The Simon-Binet Intelligence Scale was first published in 1905, it was designed to identify intellectually subnormal individuals Alfred Binet – father of Intelligence testing Representative Sample/Standardization Sample – a sample that comprises individuals similar to those for whom the test is to be used (hence, the population) Mental Age – introduced in the 1908 Simon-Binet scale. It is a measurement of a child’s performance on the test relative to other children of that particular age group Ceiling age – the lowest year level at which the examiner fails all the subtests of the scale Basal age – the highest year level at which the examiner pass all the subtest of the scale Mental age – computed by adding to his basal age the number of months credited received for passing each subtests to his ceiling age: MA = Basal Age + Credit (until ceiling age) Stanford-Binet Intelligence Scale – developed by Lewis Terman, who uses William Stern’s concept of Intelligence Quotient IQ = (MA/CA) x 100 William Stern – Concept of IQ World War I Robert Yerkes – president of APA during WWI headed the committee (together with Arthur Otis) to create two structured group tests of human abilities: Army Alpha and Army Beta. The Army Alpha required reading ability while the Army Beta measures the intelligence of illiterate adults Wechsler’s Contribution David Wechsler published the Wechsler-Bellevue Intelligence Scale on 1939, challenging the supremacy of Binet - Simon scale. Wechsler Scale produced several scores, like the Performance IQ. It overcome some weakness of the Binet test by adding Non-verbal Scales to his test Personality Tests Traits – relatively enduring dispositions or tendencies to feel, act, and think in certain manner at any situation that distinguish one individual from another. The Woodworth Personal Data Sheet or Woodworth’s Psychoneurotic Inventory – the first ever personality test utilized during WWI The Rorschach Inkblot Test – published by Hermann Rorschach in Switzerland in 1921. Sam Beck, David Levy, and Exner developed techniques in administering, scoring, and interpreting the Rorschach Test Thematic Apperception Test – developed by Christiana Morgan and Henry Murray in 1935. Composed of 32 cards, but 20 is the suggested number of cards to be administered, 2 day administration with 1 day interval MMPI – use empirical methods to determine the meaning of the test response Factor Analysis – a method of finding the minimum number of dimensions (characteristics, attributes) called factors, to account for a large number of variables. R.B. Cattell’s 16 Personality Factor Questionnaire – he utilized the method of factor analysis to come up with this test NORMS AND BASIC STATISTICS FOR TESTING Inferences – logical deductions about events that cannot be observed directly Exploratory Data Analysis – the detective work of gathering and displaying clues. An approach to analyzing data that sets to summarize their main characteristics, often with visual method. (termed by statistician John Tukey) Confirmatory Data Analysis – when the clues are evaluated against rigid statistical rules Descriptive statistics – methods used to provide a concise description of a collection of quantitative information Inferential statistics – methods used to make inferences from observations of a sample to the population Scales of measurement Measurement – assigning numbers to qualities (to quantify qualities) Properties of Scales Magnitude – the property of “moreness”. It allows us to say that a particular instance of the attribute represents more, less, or equal amounts of the given quantity than does another instance. Equal Intervals – the difference between two points at any place on the scale has the same meaning as the difference between two other points that differ by the same number of scale units. Example of which is the length in a ruler. IQ scores doesn’t possess this property for 45 and 50 is different from 100 and 105, even both of them seems to have an interval of 5 points. When a scale has this property, the relationship between the measured units and some outcome can be described as a linear equation in the form: Y = a + bX Absolute Zero – an absolute zero is obtained when nothing of the property being measured exists. Zero means nothing, absence of the variable being measured Types of Scales Nominal Scale – not scale at all; only purpose is to name objects. Heavily used if data is qualitative rather than quantitative. Ordinal Scale – a scale with a property of magnitude. Allows you to rank individuals but not to say anything about the meaning of the differences between the ranks. IQ is in Ordinal Scale Interval Scale – has the properties of magnitude and equal intervals. Example of which is temperature in degree Celsius and Fahrenheit Ratio Scale – possess all of the properties of scales. Temperature in degree kelvin is a good example, also the speed in mph. Correlational Statistics Pearson’s Product Moment Correlation Coefficient (Pxy) – used if two data are both interval, both ratio, or one interval and one ratio. Eta Coefficient – measure of nonlinear association. For linear relationships, eta equals the correlation coefficient (Pearson's r). For nonlinear relationships it is greater -- hence the difference between eta and r is a measure of the extent of nonlinearity of relationship. Spearman’s rho (rs or rho) – used if both data are ordinal Point biserial (rph) – used if one datum is nominal (for true dichotomy) and the other is in interval Biserial Correlation – for an artificial dichotomy (nominal) and the other is in interval level Rank-Biserial correlation – is used for dichotomous nominal data versus rankings (ordinal data) Phi Coefficient (phi) aka Mean Square Contingency – if both data are nominal in nature (for true dichotomy) Tetrachoric Correlation coefficient – is used when both variables are artificially dichotomous McNemar’s Test – statistical test used on paired nominal data. (2 matched groups/repeated groups with 2 nominal data) Cochran’s Q Test – provides a method for testing for differences between three or more matched sets of frequencies or proportions. Kendall’s Coefficient of Concordance – a non-parametrical statistics for test of agreement among sets of rankings. For 2 or more ordinal data Chi Square test – a non-parametric test for 2 or more nominal data. Used as a test of independence, but also as a goodness of fit test Chi-Square Test for Goodness of Fit – the individuals are classified into categories and we want to know what proportion of the population is in each category. Uses sample data to test hypotheses about the shape or proportion of a population distribution. The test determines how well the obtained sample proportions fit the population proportions specified by the null hypothesis Chi Square Test for Independence – uses the frequency data from a sample to evaluate the relationship between two variables in the population The Gaussian Curve or the Normal Curve (aka the Symmetrical Binomial Probability Distribution) Properties of the Normal Curve Symmetrically bell shaped distribution Mean=mode=median Kurtosis proper is equal to 3, while kurtosis excess is 0 Kurtosis – the degree of peakedness in a distribution. Mesokurtic – the peak is neither high nor low, it serves as a baseline for platykurtic and leptokurtic. (Normal distribution) Leptokurtic – kurtosis is greater than mesokurtic distributions. Peaks that are tall and very thin Platykurtic – kurtosis is lower than mesokurtic distributions. Peaks that are wide and short. Results if the extreme scores are slightly the same as the mean score Asymptotic to the x-axis, means the curve comes infinitely close to the x-axis without touching it. No value of y is zero The curve is continuous The inflection points are from ±1σ from the mean, µ Skewness = 0 Skewness is measure of symmetry, or more precisely, the lack of symmetry Positively skewed – the tail is in the positive side of the distribution, which means that there is a high frequency of low scores, but not high scores Negatively skewed – the tail is in the negative side of the distribution, which means that there is a high frequency of high scores, but not low scores Percentile and Percentile Ranks Percentile rank – answers the question “what percent of scores fall below a particular score? Percentile – the specific score or points within the distribution. It indicates the particular score below which a defined percentage of scores fall. Mean – arithmetic average score in the distribution Standard deviation – an approximation of the average deviation around the mean Variance – the average squared deviation around the mean (just square the value of standard deviation and you’ll get the variance) Transforming Raw Scores Z-Score – the difference between a score and the mean, divided by the standard deviation. It is the deviation of a score from the mean in standard deviation units. Z = (X-M)/SD It has a mean of 0 and an SD of 1 McCall’s T – have a mean of 50, and an SD of 10. To convert Raw scores into T-scores, DIQ, Stanines etc.: Convert the Raw Score into a Z-Score After conversion, use the formula X (the norm you want, ex. T-score, Dev. IQ) = m(Z) + SD Deviation IQ – mean of 100, and SD of 15 Stanine (standard nine) – mean of 5 and SD of 2. Developed in US Air force during WWII. Sten – Standard ten Quartiles and Deciles – pertains to the division of percentile scale into groups Quartiles – points that divide the frequency distribution into equal fourths. The 50th quartile is the median Interquartile range – the interval of scores bounded by the 25th percentile and 75th percentile. It is the middle 50% of the distribution Deciles – mark 10% rather than 25%. Norms – refer to the performances by defined groups on particular tests. Within group norms – allows you to compare your score to a standardized sample that took the test. Developmental Norms/Age-related norms – doesn’t allow you to compare oneself from another, instead, your score is only compared to your own score (from infancy to childhood for example) Tracking – tendency to stay at the same level to one’s peers is known as tracking. (height and weight for example) Norm-referenced tests – tests that compares each person with a norm. encourages competition among children Criterion-referenced tests – describes the specific types of skills, tasks, or knowledge that the test taker can demonstrate. Result of the test is not compared, but they would be employed to design an individualized program of instruction. Thus, this tests identify problems that can be remedied CORRELATION AND REGRESSION Bivariate distributions – two scores for each individual Scatterplot diagram – a picture of the relationship between two variables Correlation Correlation coefficient – a mathematical index that describes both the magnitude and the direction of the relationship Regression – used to make predictions about scores on one variable from knowledge of scores on another variable Regression line or the line of best fit– the best-fitting straight line through a set of points in a scatter diagram. Found using the principle of least squares, which minimizes the squared deviation around the regression line. Formula: Y’ = a + bX Where Y’ – is the predicted value of Y every time X changes b – the regression coefficient. The slope of the regression line Sum of squares – defined as the sum of the squared deviations around the mean Covariance – used to express how much two measures covary, or go together The slope describes how much change is expected in Y each time X increases by one unit a – the intercept, is the value of Y when X is 0. In other words, it is the point at which the regression line crosses the Y-axis X – any value selected along the x-axis Residual – The difference between the observed score (Y) and predicted score (Y’) Principle of least squares – the line of best fit is obtained by keeping these squared residuals as small as possible. The distance of any point in the best-fitting line to the points at the scatterplot needs to me in the minimum level possible, so that when we added all these distances, small value will be obtained Stated as: Σ = (Y-Y’)2 is at minimum Pearson Product Moment Correlation Coefficient – the ratio used to determine the degree of variation in one variable that can be estimated from knowledge about variation in the other variable. From -1.0 to +1.0 Testing the Statistical Significance of r using T-distribution (without the use of P- or Sig-value) Degrees of freedom – defined as the sample size minus two or, df=N-2 $$t = r\ \sqrt{}\frac{N - 2}{1 - r^{2}}$$ After we have obtained the tcomputed, we will locate the value of texpected in the appendix (T-distribution) based on the degrees of freedom, the significance level of 0.05, and whether it is a two- or one-tailed test Regression Plots – pictures that show the relationship between variables Normative data – data in the middle level of the distribution, the data most observed. It is the information gained from representative groups. Other correlation coefficients Spearman’ rho – finding the association between two sets of ranks Dichotomous variables – variables that have two levels (either true or false, male or female, correct or incorrect) True Dichotomous – called as such because they naturally form two categories (ex. Male or Female, Atheist or Theist etc.) False Dichotomous – they reflect an underlying continuous scale forced into a dichotomy Statistics used in Dichotomous variables Biserial correlation – expresses the relationship between a continuous variable and an artificial dichotomous variable Point biserial correlation – finding the association between a continuous variable and a true dichotomous variable Phi coefficient – used when both variables are dichotomous and at least one of them is a true dichotomous variable Tetrachoric Correlation – used if both dichotomous variables are artificial (substitute to phi coefficient; for 2 artificial dichotomous variables) Terms and Issues in the use of Correlation Residual – the difference between the predicted and observed values Symbolically, Y – Y’ Properties The sum of all residuals is always equals to 0 The squared residuals is the smallest value according to the principle of least squares Standard Error of Estimate (Syx) – the standard deviation of the residuals It is the measure of accuracy of prediction If the standard error of estimate is relatively small, prediction is most accurate. As it becomes larger, the prediction becomes less accurate Coefficient of Determination (r2) – equals to the correlation coefficient squared This value tells us the proportion of the total variation in scores on Y that we know as a function of information about X Ex. If the r of SAT and academic performance is 0.40, when we square it, we will obtain 0.16. This means that we can explain 16% of the variation in academic performance by knowing the SAT scores (16% of the variance in academic performance can be accounted for by the SAT scores) Coefficient of Alienation – the measure of non-association between two variables Calculated as Coeff. Of A = 1 − r2 For example, if the r of SAT and Academic Performance is 0.40, then, 1 − .42 = 0.92. Therefore, there is high degree of non-association between the SAT and Academic Performance Shrinkage – the amount of decrease observed when a regression equation is created for one population and then applied to another Cross Validation – use the regression equation to predict performance in a group of subjects other than the ones to which the equation is applied Correlation-causation problem Correlation doesn’t imply causation Third Variable Explanation – a third variable, an external influence other than the two variables, might cause the relationship or association. Restricted Range Problem – if the variability is restricted, then significant correlation are difficult to find Multivariate Analysis – considers the relationship among combinations of three or more variables Linear Combinations of variables Y’ = a + b1X1 + b2X2 + …… + bkXk Multiple Regression – to find the linear combination of three or more variables that provides the best prediction of a single variable (ex. Age, GPA, Rating by professor as predictors of success in law school) Standardized Regression coefficients (aka B’s or Betas) – beta weights or coefficients for the variables are called like this when the variables are expressed in Z-units Raw Regression coefficients (aka b’s) – if we did convert the raw scores into standardized scores (hence we use the variables’ original units) Multiple regression is appropriate when the criterion variable is continuous (not nominal) Discriminant Analysis – used when the task is to find the linear combination of variables that provides a maximum discrimination between categories (or nominal data) Ex. A set of measures will predict success or failure on a particular performance evaluation Multiple Discriminant Analysis – if we want to determine the categorization in more than two categories. Factor Analysis – used to study the interrelationships among set of variables without reference to a criterion. It is a data reduction technique Linear combination of variables are known as principal components or factors Factor loadings – correlation between the original items and the factors Methods of rotation – called as such because the transformational methods involve rotating the axes in the space created by the factors Transformational methods – transforming the variables in a way that pushes the factor loadings toward the high or the low extreme Variable X Variable Y Correlational Statistics Interval/Ratio Interval/Ratio Pearson’s r Ordinal Ordinal Spearman’s rho Nominal (True Dichotomy) Interval/Ratio Point Biserial Nominal (Artificial Dichotomy) Interval/Ratio Biserial Nominal Ordinal Rank Biserial Nominal (True Dichotomy) Nominal (True or Artificial Dichotomy) Phi Coefficient Nominal (Artificial Dichotomy) Nominal (Artificial Dichotomy) Tetrachoric Correlation RELIABILITY Spearman’s Early Studies Charles Spearman The advanced development of reliability assessment dates back to the early work of Charles Spearman. The Father of Classical Test Theory Other contributors in the advancement of reliability assessment Karl Pearson – proposed the product moment correlation Abraham De Moivre – introduced the basic notion of sampling error Sampling error – the error caused by observing a sample instead of the whole population. The sampling error is the difference between a sample statistic used to estimate a population parameter and the actual but unknown value of the parameter  Basics of Test Score Theory by Charles Spearman Classical Test Theory assumes that each person has a true score that would be obtained if there were no errors in measurement. The observed score has two components; a true score and an error component. Thus, error is the difference between the observed score and the true score Major assumption – errors of measurement is random It also proposed that the distribution of random errors is bell-shaped The center of the distribution represent the true score, and the dispersion around the mean displays the distribution of sampling errors Standard error of measurement σmeas it is the standard deviation of errors it tells us how much a score varies from the true score Domain Sampling Model – another central concept in classical test theory This model considers the problems created by using a limited number of items to represent a larger, more complicated construct As the sample gets larger, it represents the domain more and more accurately As a result, the greater the number of items, the higher the reliability Item Response Theory The study of test and item scores based on assumptions concerning the mathematical relationship between abilities (or other hypothesized traits) and item responses. Other names and subsets include Item Characteristic Curve Theory, Latent Trait Theory, Rasch Model, 2PL Model, 3PL model and the Birnbaum model. In the following figure, the x-axis represents student ability and the y-axis represents the probability of a correct response to one test item. The s-shaped curve, then, shows the probabilities of a correct response for students with different ability (theta) levels. One of the basic assumptions in IRT is that the latent ability of a test-taker is independent of the content of a test. The relationship between the probability of answering an item correctly and the ability of a test-taker can be modeled in different ways depending on the nature of the test (Hambleton et al., 1991). It is common to assume unidimensionality, i.e. that the items in a test measure one single latent ability. According to IRT, test-taker with high ability should have a high probability of answering an item correctly Figure: Models of Reliability Most reliability coefficients are correlations, but we can also expressed it in a mathematical equivalent ratio $$r = \ \frac{\sigma_{T}^{3}}{\sigma_{X}^{2}}$$ The ratio of the variance of the true scores on a test to the variance of the observed scores Test-retest Method – Time sampling Used to evaluate the error associated with administering a test at two different times. This type of analysis is of value only when we measure traits that do not change overtime Test that measure some constantly changing characteristic are not appropriate for test-retest evaluation. It only applies to measures of stable traits Consider carryover effect Occurs when the first testing session influences scores from the second session. Some people remembers the items Practice effect – type of carryover effect – some skills approve with practice Parallel Forms Method or Equivalent Forms – Item Sampling Compare two equivalent forms of a test that measure the same attribute. The two forms use different items, but the difficulty level is the same Few tests possess this reliability aspect Internal Consistency Split-half method A test is given and divided into halves that are scored separately 50-50 or odd-even systems To correct for half-length tests, apply the Spearman-Brown Formula This will allow you to estimate what the correlation between the two halves would have been if each half been the length of the whole $$corrected\ r = \ \frac{2r}{1 + r}$$ Using this formula, the reliability coefficient will increase Spearman-brown formula is not advisable if the two halves of the test have unequal variances. To answer this, Cronbach’s coefficient alpha can be used Kuder-Richardson 20 Formula Used for calculating the reliability of a test in which items are dichotomous, scored 0 or 1 (usually for right and wrong) Difficulty – refers to the percentage of test takers who pass the item All the items of a test must have equal difficulty, or that the average difficulty level is 50% Cronbach’s Alpha Used for calculating the reliability of a test in which items are continuous (ex. Likert scale) Coefficient alpha is the most general method of finding estimates of reliability through internal consistency All of the measures of internal consistency evaluate the extent to which the different items on a test measures the same ability or trait Interscorer/Interrater/Interobserver/Interjudge Kappa Statistic Introduced by J. Cohen (only as a measure of agreement bet. two judges that rate an object that uses nominal scale) It Assess the level of agreement among several observers Value of kappa statistic ranges from 1 (perfect agreement) to -1 (less agreement than can be expected on the basis of chance alone) We can also use Phi coefficient (mean square contingency coefficient) as an approximation of the coefficient for the agreement between two observers Sources of errors Time sampling The same test given at different points in time may produce different scores, even if given to the same test takers Assess using the test-retest method Item Sampling – assess using the alternate form method Internal consistency – refers to the intercorrelations among items within the same test. Assessed using the split-half method, KR-20, or Alpha coefficient How to compensate for low reliability coefficient Increase the no. of test items Use the Spearman-Brown Prophecy Formula is used to estimate how long a test must be to achieve a desire level of reliability Prophecy Formula $$N = \ \frac{r_{\text{desired}} - (1 - r_{\text{observed}})}{r_{\text{observed}} - (1 - r_{\text{desired}})}$$ After obtaining N, we need to multiply it to the number of items of the current test Factor and Item Analysis Perform factor analysis to check whether each items measure the same construct Test are most reliable if they are unidimensional Discriminability analysis – a form of item analysis Examining the correlation between each item and the total score of the test If the correlation between the performance on a single item and the total test score is low, the item should be excluded (either because it is too easy or difficult) Correction for Attenuation Allow us to estimate what the correlation between the two measures would have been if they had not been measured with error. VALIDITY Validity – the extent to which the test measures what it purports to measure. It is the agreement between a test score or measure and the quality it is believed to measure Aspects of Validity Face validity – it is the mere appearance that a measure has validity. It is not a validity at all because it does not offer evidence to support conclusions drawn from test scores. Face validity is important because it helps motivate the test takers to answer the questions Content validity evidence – considers the adequacy of representation of the conceptual domain the test is designed to cover. Besides face validity, it is more logical rather than statistical Table of Specification – used in achievement tests in educational setting (quizzes, exams, etc.) to check whether the test adequately sample the domain being represented. Construct underrepresentation – describes as the failure to capture important components of the construct Construct-irrelevance variance – occurs when scores are influenced by factors irrelevant to the construct (Ex. Factors such as reading comprehension, illness or test anxiety) Criterion validity evidence – it tells us how well a test corresponds with a particular criterion. It is provided by high correlations between the test and a well-defined criterion. A criterion is the standard against which a test is compared. The reason for gathering criterion-validity evidence is that the test or measure is to serve as a “stand-in” for the measure we are really interested in. Predictive Validity Evidence – the forecasting function of the test. It is composed of a “predictor variable” (the test), and the criterion (any standard) Concurrent Validity Evidence – comes from assessments of the simultaneous relationship between the test and the criterion. The measures and criterion are taken at the same time. Another use of this approach occurs when a person does not know how he or she will respond to the criterion measure Validity Coefficients – it tells the extent to which the test is valid for making statements about the criterion A validity coefficient ranging of .30 to .40 are commonly considered high Construct-Related Evidence for Validity – established through a series of activities in which a researcher simultaneously defines some construct and develops the instrumentation to measure it. This process is required when “no criterion or universe of content is accepted as entirely adequate to define the quality to be measured”. It is concerned about what a test ‘means’. D.T. Campbell and Fiske introduced the Convergent and Divergent Validity Convergent Validity – obtained when a measure correlated well with other test believed to measure the same construct. Unlike with criterion validity which has a well-defined criterion, the convergent validity has none. Because no well-defined criterion is involved, the meaning of the test comes to be defined by the variables it can be shown to be associated with Discriminant or divergent validity – a demonstration of uniqueness. To demonstrate this, a test should have low correlations with measures of unrelated constructs, or evidence for what a test does not measure All validation is one, and in a sense all is construct validation Reliability and Validity Attempting to establish the validity will be futile if the test is not reliable The maximum validity coefficient between two variables is equal to the root of the product of their reliabilities WRITING AND EVALUATING TEST Item writing Define clearly what you want to measure Generate an item pool Avoid exceptionally long items Keep the level of reading difficulty appropriate for those who will complete the scale Avoid double-barreled items that convey two or more ideas at the same time Consider mixing positively and negatively worded items Item Formats Dichotomous Format – offers two alternative for each item Simple, easy administration, and quick scoring It requires absolute judgment from the test taker Polytomous or Polychotomous Format – each item has more than two alternatives Distractors – incorrect choices in a test that use polytomous format Correction for Guessing – the assumption is that all wrong answers are guessed wrong and that all correct answers are obtained by either knowledge or guessing $$corrected\ score = R - \ \frac{W}{N - 1}$$ R – the number of correct responses W – number of wrong responses N – the number of choices for each item Likert Format – uses words like strongly disagree, disagree, agree, and strongly agree. Category Format – similar to Likert Format but uses an even greater number of choices (ex. 10-point rating system) Adjective Checklists – with this, a subject receives a long list of adjectives and indicate whether each one is a characteristic of himself Q-Sort Technique – with this, a subject is given statements and asked to sort them into nine piles. Item Analysis – a general term for a set of methods used to evaluate test items. One of the most important aspect of test construction Item Difficulty – use for achievement or ability tests. It is defined as the of people who get a particular item correct An item that is answered correctly by 100% of the examinees offers little value because it does not discriminate among individuals. To find the optimal difficulty level for items: Add 1 to chance performance (depending on the number of choices per item, if there are 4 choices per item, the chance performance is 0.25 = ¼) Divide it by two For most test, item difficulty range from 0.30 to 0.70 Item Discriminability – determines whether the people who have done well on a particular items have also done well on the whole test. Point biserial and extreme group method is used to evaluate the discriminability of test items Extreme Group Method This method compares people who have done well with those who have done poorly on a test. Discrimination index – the difference between these proportions (or the proportion for students in the bottom third to the proportion of students in the top third) Point Biserial Used for one dichotomous variable and a continuous variable Item Characteristics Curve – the total test scores is plotted in the x-axis, and the proportion of examinees who get the item correct is plotted in the y-axis Item response theory – according to these approaches, each item on a test has its own item characteristic curve that describes the probability of getting each particular item right or wrong given the ability level of each test taker Total test score = internal criteria Criterion referenced test – compares performance with some clearly defined criterion for learning “Testing” vs. “Assessment” Psychological Assessment – the gathering and integration of psychology-related data for the purpose of making a psychological evaluation that is accomplished through the use of tools such as tests, interviews, case studies, behavioral observation, and specially designed apparatuses and measurement procedures. Psychological Testing – the process of measuring psychology-related variables by means of devices of procedures designed to obtain a sample of behavior Approaches to Psychological Assessment Collaborative Psychological Assessment – the assessor and the assessee work as “partner” from initial contact through final feedback Therapeutic Psychological Assessment – a variety of collaborative psychological assessment and includes an element of therapy as part of the process Dynamic Assessment – refers to an interactive approach to psychological assessment that usually follows a model of (1) evaluation, (2) intervention, and (3) evaluation. It provides a means for evaluating how the assessee processes or benefits from some type of intervention during the course of evaluation The Tools of Psychological Assessment Test – may be simply defined as a measuring device or procedure Psychological test – refers to a device or procedure designed to measure variables related to Psychology (e.g. intelligence, personality, interests, attitudes, or values) Psychological Tests may differ with respect to a number of variables, such as content, format, administration procedures, scoring and interpretation, and technical quality Content – the subject matter of the test. For example, an Intelligence test vs. personality test OR a personality test based on psychoanalytic orientation vs. another personality test based on behavioral perspective (even the two both measure the same construct, they differ due to their respective backgrounds) Format – pertains to the form, plan, structure, arrangement, and layout of test items as well as to related considerations such as time limits. Format is also used to refer to the form in which a test is administered: computerized, paper-and-pencil, or some other form. Administration Procedures – tests can be administered on a one-to-one basis (which requires an active and knowledgeable test administrator) or by group (which may not even require the test administrator to be present while the testtakers independently complete the tasks Scoring and interpretation Score – a code or summary statement that reflects an evaluation of performance on a test Scoring – the process of assigning such evaluative codes or statements to performance on tests In the world of psychological assessment, tests yield different types of scores. Some scores are just an accumulation of responses, while some are derived from more elaborate procedures Tests differ widely in terms of guidelines for scoring and interpretation Some tests are self-scored, by computers, or by a trained examiner Most tests come with test manuals that are explicit about scoring and interpretation Other tests, like the Rorschach Inkblot Test, are sold with no manual at all (purchasers only selects and uses one of many available guides for administration, scoring, and interpretation) Technical Quality – pertains to a test’s psychometric soundness Psychometric soundness refers to how consistently and how accurately a psychological test measures what it purports to measure. In addition, a test must possess some Psychometric Utility, or the usefulness or practical value that a test has for a particular purpose Interview – a method of gathering information through direct communication involving reciprocal exchange Interviews differ with regards to many variables, such as their purpose, length, and nature. Panel interview – involves multiple interviewers. Advantage – minimizes the idiosyncratic biases of a lone interviewer Disadvantage – costly; the use of multiple interviewers may not be even justified Portfolio – a sample of one’s ability and accomplishment. These are work products (ranging from art works to blueprints of a building) that may be used as a tool of evaluation Case History Data – refers to records, transcripts, and other accounts in written, pictorial, or other form that preserve archival information and other data and items relevant to the assessee. Examples: files from schools, hospitals, employers, religious institutions, and other agencies. Also includes letters, photos and family albums, newspapers clippings, movies, audiotapes etc. Case History or Case Study – a report or illustrative account concerning a person or an event that was compiled on the basis of case history data Behavioral Observations – defined as monitoring the actions of others or oneself by visual or electronic means while recording quantitative and/or qualitative information regarding those actions Naturalistic Observation – observing behaviors in their natural setting, that is, setting in which behavior would typically be expected to occur Role-Play Tests – a tool of assessment wherein examinees are directed to act as if they were in a particular situation, and then be evaluated with regard to their expressed thoughts, behaviors, abilities etc. Who are the Parties in the assessment enterprise? The Test Developer – the ones who create, develop, publish, and distribute tests or other methods of assessment The Test User – tests are used by a wide range of professionals, including clinicians, counselors, school psychologists, human resource personnel, social psychologists etc. The Testtaker – anyone who is the subject of an assessment or an evaluation. Even a deceased individual can be considered an assessee Psychological Autopsy – a reconstruction of a deceased individual’s psychological profile on the basis of archival records, artifacts, and interviews previously conducted with the deceased assessee or people who knew him or her In what types of settings are assessments conducted? Educational settings – tests such as achievement and diagnostic tests are administered in these settings Clinical settings – include public, private, or military hospitals, inpatient and outpatient clinics, and other institutions Counseling settings Geriatric settings Business and military settings PWDs Entails accommodation or alternate assessment Accommodation – adaptation of a test, procedure, or situation, or the substitution of one test for another, to make the assessment more suitable for an assessee with exceptional needs Alternate Assessment – an evaluative procedure or process that varies from the usual or standardized way a measurement is derived, either by virtue of some accommodation made to the assessee or by means of alternative methods designed to measure the same variable/s Reference Sources – sources for authoritative information about published tests and assessment-related issues Test Catalogues – most readily accessible reference source. Usually contain only a brief description of the test and seldom a detailed technical information about the test. Its main purpose is to sell the test Test Manuals – contains the detailed information concerning the development of a particular test and technical information relating to it. Reference Volumes – provides a “one-stop shopping” for a great deal of test-related information. Mental Measurement Yearbook (MMY) – compiled by Oscar Buros in 1933. It is an authoritative compilation of test reviews. Its latest edition was published in 2010 (18th edition MMY) Journal Articles – contain reviews of the test, updated or independent studies of its psychometric soundness, or examples of how the instrument was used in either research or an applied context Online databases like PsycINFO, ClinPSYC, PsycSCAN, PsycARTICLES, HAPI and many more CHAPTER 2: Historical and Legal/Ethical Considerations Historical Perspective Early Antecedents It is believed that tests and testing programs first came into being in China as early as 2200 B.C.E. Testing was instituted as a means of selecting who, of many applicants, would obtain governmental jobs. Greco-Roman writings indicate some attempts (back then) to categorize people in terms of personality types The 19th Century Charles Darwin’s book On the Origin of Species by Means of Natural Selection was published in 1859 and spurred scientific interest in individual differences. Darwin’s writings on individual differences kindled interest in research on heredity by his half cousin, Sir Francis Galton Sir Francis Galton’s Contributions Aspired to classify people according to their “natural gifts” and to ascertain their deviation from the average Contributed to the development of contemporary tools of psychological assessment, including questionnaires, rating scales, and self-report inventories Pioneered the use of a statistical concept central to psychological experimentation and testing, the coefficient of correlation Displayed his Anthropometric Laboratory at an exhibition in London in 1884. A person could be measured in terms of height, arm span, weight, breathing capacity, color discrimination and many more. Wilhelm Wundt: The Father of Modern Psychology Established the first experimental psychology laboratory at the University of Leipzig Formulated a general description of human abilities with respect to variables such as reaction time, perception, and attention span Wundt focused on how people are similar, in contrast with Galton’s view He viewed individual differences as a frustrating source of error in experimentation Wundt’s students James McKeen Cattell Coined the term “Mental Test” in an 1890 publication Charles Spearman Founded the concept of reliability The one who builds the mathematical framework for the statistical technique of factor analysis Victor Henri Collaborated with Alfred Binet and suggested that mental tests could be used to measure higher mental processes Emil Kraepelin Early experimenter with the word association technique as a formal test Lightner Witmer Cited as the “little-known founder of clinical psychology” Founded the first psychological clinic at the University of Pennsylvania The 20th Century The measure of intelligence In 1905, Alfred Binet and collaborator Theodore Simon published a 30-item “measuring scale of intelligence” designed to help identify mentally retarded Paris schoolchildren In 1939, David Wechsler introduced a test designed to measure adult intelligence. Group Intelligence Testing Group intelligence tests came into being in the United States in response to the military’s need for an efficient method of screening the intellectual ability of WWI recruits. The Army Alpha and the Army Beta came into being The measurement of personality Woodworth’s Personal Data Sheet by Robert S. Woodworth A measure of adjustment and emotional stability that could be administered quickly and efficiently to groups of recruit Answerable by “Yes” or “No” Woodworth’s Psychoneurotic Inventory Developed by Robert Woodworth after the war for civilian use It was based on his Personal Data Sheet During his time, it was the most widely used measure of personality The Rorschach Inkblot Test A projective test developed by Hermann Rorschach Relied on inkblots as its stimuli Henry Murray and Christiana Morgan Popularized the use of pictures as projective stimuli Created the Thematic Apperception Test Respondents are typically asked to tell a story about the pictures The responses are then analyzed in terms of what needs and motivations the respondents may be projecting onto the ambiguous pictures Legal and Ethical Considerations Ethics – a body of principles of right, proper, or good conduct Code of Professional Ethics – defines the “standard of care” expected of members of that profession Standard of Care – the level at which the average, reasonable, and prudent professional would provide diagnostic or therapeutic services under the same or similar conditions Psychological assessment has been affected in numerous and important ways by the activities of the legislative, executive, and judicial branches of governments. Below are some landmark legislation and litigation Litigation – court-mediated resolution of legal matters of a civil, criminal or administrative nature. Sometimes referred as “judge-made law” because it typically comes in the form of a ruling by a court Concerns of the Profession Test-User Qualification In 1950, a report called “Ethical Standards for the Distribution of Psychological Tests and Diagnostic Aids” was published by the APA. This report defined three (3) levels of tests in terms of the degree to which the test’s use required knowledge of testing and psychology Level A: Tests or aids that can adequately be administered, scored, and interpreted with the aid of the manual and a general orientation to the kind of institution or organization in which one is working (for instance, achievement or proficiency tests) Level B: Tests or aids that require some technical knowledge of test construction and use of supporting psychological and educational fields such as statistics, individual differences, psychology of adjustment, personnel psychology, and guidance (e.g., aptitude tests, and adjustment inventories applicable to normal populations) Level C: Tests and aids that require substantial understanding of testing and supporting psychological fields together with supervised experience in the use of these devices (for instance, projective tests, individual mental tests) The Rights of Testtakers The right of Informed consent Testtakers have a right to know why they are being tested, how the data will be used, and what information will be released to whom. With full knowledge of such information, testtakers give their informed consent If testtakers is incapable of providing an informed consent to testing, such consent may be obtained from a parent or a legal representative (written, not in oral form) For the issue of deception, the general rule is: deception is allowed if the knowledge of the participant about the study’s hypothesis might irrevocably contaminate the test data. The Ethical Principles of Psychologists and Code of Conduct provides that psychologists Do not use deception unless it is absolutely necessary Do not use deception at all if it will cause participants emotional distress Fully debriefed the participants if deception occurred The right to be informed of test findings Testtakers have the right to be informed, in language they can understand, of the nature of the findings with respect to a test they have taken. They are also entitled to know what recommendations are being made as a consequence of the test data. The right to privacy and confidentiality Privacy Right – this concept recognizes the freedom of the individual to pick and choose for himself the time, circumstances, and particularly the extent to which he wishes to share or withhold from others his attitudes, beliefs, behaviors, and opinions The information withheld is termed “Privileged”; it is the information that is protected by law from disclosure in a legal proceeding Privilege is not absolute. The court can deem the disclosure of certain information necessary and can order the disclosure of that information Confidentiality It may be distinguished from privilege in that “confidentiality concerns matter of communication outside the courtroom, privilege protects clients from disclosure in judicial proceedings” In some rare cases the psychologist may be ethically compelled to disclose the information if that information will prevent harm either the client or to some third party. Here, the preservation of life would be deemed as objective; more important than the nonrevelation of privileged information (ex. The Tarasoff Case) Test users must take reasonable precautions to safeguard test controls The right to the least stigmatizing label This can be best explained by the Iverson v. Frandsen case A student was referred to as a “High Grade Moron” in a report sent to a public school by a school psychologist CHAPTER 3: A Statistics Refresher Scales of Measurement Terms in measurement Measurement – act of assigning numbers or symbols to characteristics of things according to rules Scale – a set of numbers whose properties model empirical properties of the objects to which the numbers are assigned Error – the collective influence of all of the factors on a test score or measurement beyond those specifically measured by the test or measurement Types of scale based on the type of variable being measured Discrete scales – scale used to measure discrete variables (ex. Male or female) Continuous Scales – scale used to measure continuous variables Four major types/levels of scale Nominal Scales – simplest form of measurement. These scales involve classification or categorization based on one or more distinguishing characteristics, where all things measured must be placed into mutually exclusive and exhaustive categories Ordinal Scales – it also permits classification and rank ordering on some characteristic is also permissible with this scale Alfred Binet strongly believed that the data derived from an intelligence test are ordinal in nature. He emphasized that what he tried to do with his test was not to measure people but to merely classify and rank people on the basis of their performance on the tasks Ordinal scales imply nothing about how much greater one ranking is than another because the numbers do not indicate units of measurement This scale have no absolute zero point Interval Scales Permit categorization and ranking, and contain equal intervals between numbers Contain no absolute zero point Ratio Scales In addition to all the properties of the first three scales, a ratio scale has a true zero point Measurement scales in Psychology The ordinal level of measurement is most frequently used in psychology As Kerlinger put it: “Intelligence, aptitude, and personality test scores are, basically and strictly speaking, ordinal Why would psychologists want to treat their assessment data as interval when those data would better described as ordinal? The attraction of interval measurement for users of psychological tests is the flexibility with which such data can be manipulated statistically Describing Data Frequency Distributions – all scores are listed alongside the number of times each score is occurred. Often referred to as a simple frequency distribution to indicate that individual scores have been used and the data have not been grouped Grouped Frequency Distribution – here, test scores intervals (aka class intervals) replace the actual test scores. Graphical Representations Histogram – a graph with vertical lines drawn at the true limits of each score (or class interval), forming a series of contiguous rectangles. Test scores are placed along the X-axis, while the numbers indicative of frequency to be placed along the Y-axis Bar Graph – with this, numbers indicative to of frequency appear on Y-axis, and reference to some categorization appears to X-axis Frequency Polygon – expressed by continuous line connecting the points where test scores or class intervals meet frequencies (just like in Histogram) Measures of Central Tendency – a statistic that indicates the average or midmost score between the extreme scores in a distribution The Arithmetic Mean – equal to the sum of the observations (test scores) divided by the number of observations It is typically the most appropriate measure of central tendency for interval or ratio data when the distributions are believed to be approximately normal The Median – defined as the middle score of the distribution. It is an appropriate measure of central tendency for ordinal, interval, and ratio data. It may be a particularly useful measure of central tendency in cases where relatively few scores fall at the high end of the distribution or relatively few scores fall at the low end of the distribution The Mode – the most frequently occurring score Appropriate for nominal data Useful in the analyses of a qualitative or verbal nature Measures of Variability – an indication of how scores in a distribution are scattered or dispersed Range – equal to the difference between the highest and the lowest score Because the range is based entirely on the values of the lowest and highest scores, one extreme score can radically alter the value of the range The interquartile and the semi-interquartile range Quartiles – the dividing points between the four quarters of the distribution (quartiles are the specific points whereas quarters refers to an interval) Interquartile range – a measure of variability equal to the difference between Q3 and Q1. It is an ordinal statistic Semi-interquartile range – equal to the interquartile range divided by 2 Average Deviation This is rarely used because the deletion of the algebraic signs renders it useless measure of any further operations Standard Deviation – equal to the square root of the average squared deviations around the mean. In short, it is the square root of the variance To make meaningful interpretations, the test-score distribution should be approximately normal It is a very useful measure of variation because each individual score’s distance from the mean of the distribution is factored into its computation Skewness – the nature or extent to which symmetry is absent. It is an indication of how the measurements in a distribution are distributed A distribution has a positive skew when relatively few of the scores fall at the high end of the distribution. Positively skewed examination results may indicate that the test was too difficult. More items that were easier would have been desirable in order to better discriminate at the lower end of the distribution of test scores The distance between Q3 and Q0032 are greater than the distance between Q2 and Q1 A distribution has a negative skew when relatively few of the scores fall at the low end of the distribution. It also indicate that the test was easy. In this case, more items of a higher difficulty level would make it possible to better discriminate between scores at the upper end of the distribution The distance between Q3 and Q2 will be less than the distance between Q2 and Q1 Kurtosis – refer to the steepness of a distribution in its center. Platykurtic – relatively flat distributions Leptokurtic – relatively peaked distributions Mesokurtic – somewhere in the

Use Quizgecko on...
Browser
Browser