2080 Introduction to Test and Measurement PDF

2080 Introduction to Test and Measurement Week 1: Measurement is the act of identifying properties of an object. - Advancement of psychometrics (psychological measurement) led to advances in related disciplines such as statistics and biometrics (biostatistics). - Psychometrics led the way for what we now call data science, where lots of information on millions of humans were collected and analyzed to make national policy decisions (military recruitment, assignment, etc). How do we know a test is good? Does the test measure what I think it is measuring? Does the test prioritize certain people? Does the test work well each year? Does the test, on average, provide the same amount of error? Observable Behaviours, Unobservable Attributes Seek to identify the properties of a psychological object (called a latent construct) Latent - Cannot observe it Construct – Domain of behaviours Example: Educational Achievement How should we assess Educational Achievement? The process is carried out by a measuring device called a test. Psychological and Behavioural Testing A great deal of resources have been put into the creation of tests. Many research projects will come up with new tests or a modification of a previous test. Tests are used commonly to make critical decisions Example: Entrance to Medical School: MCAT Entrance to American Colleges: SAT Types of Tests Individual tests - given to one person at a time Group tests - given to a group of people simultaneously Achievement tests - assess prior learning Aptitude tests - assess potential to learn Intelligence tests - ability to solve problems and think abstractly. Personality Tests Seek to assess overt and covert dispositions. Structured - accept or reject statements of ones’ self - usually self-report Projective - reactions to ambiguous stimuli are recorded and interpreted. - Reveal aspects of the unconscious mind - Assumes responses reflect individual characteristics Tests A test is a technique to quantify a behaviour. A test is a systematic procedure for comparing the behaviour of two people (Cronbach). Tests are composed of items. An item is a specific stimuli that produces a response. Often require multiple items in order to properly assess the construct. The science of evaluating characteristics of psychological tests is known as psychometrics Criterion vs Norm-Referenced Criterion referenced tests - decisions made in comparison to a cut-off score. Norm-referenced tests - score is compared to some reference sample. - expected score from the population. - Test taker needs to be part of the reference population! Lots of overlap between the two types of tests! A Brief History of Measurement Tests have been recorded in use from as early as 4000 years ago in China to test for selection of talents. Earlier than 500 BCE, Confucius argued people were different from each other: “their nature might be similar, but behaviours are far apart”. Mencius (372-289 BCE) believed differences among individuals were measurable. Some evidence from the Xii Dynasty (2070-1600 BCE) used competition of skills of strength to select officers but by the Zhou Dynasty (1046-256 BCE) tests expanded to include judgements of good conduct and manners. Darwinian Roots Charles Darwin in the The Origin of Species (1859) describes individual differences in animals. Sir Francis Galton (1883) suggests there are differences between individuals. Argues some traits result in some people being more fit to survive than others. Galton has a fascination in measuring obscure concepts such as the efficacy of prayer. Galton Arguably the first to have a large focus on mental differences between individuals. Believes anything is measurable. Came up with the term psychometry - later evolving into psychometrics. Credited as the father of psychometrics. Early adopter of applying statistics to psychological measurement. Most of Galton’s measures were unsuccessful! Galton’s work would be expanded on by James Cattell (1890), leading to the first modern test. German Psychophysics Determining thresholds for observing changes in physical stimuli. Examples: How much more intense does a light bulb need to be in order for the human eye to detect it? Herbert, Weber, Fechner, and Wundt are the key names of this time period. - Wundt is credited as the father of Psychology as a Science! Had a large impact on the creation of behavioural tests (such as our Working Memory example) and lead to important Psychometrician LL Thurstone. Early Intelligence and Standardized Tests 1905 - The Binet-Simon Scale - First intelligence test to create a standardized sample of 50 children. - Was called a Standard Condition. - Repeatedly revised over the years. - By 1908 the number of items had doubled and the standard sample was now 200 children! - Test compared mental age with chronological age - In 1916 another major revision (and renamed to the Stanford-Binet Intelligence scale) changed many items and increased the standard sample to 1000 individuals Personality Tests: 1920-1940s Around World War II, personality tests began to be implemented to assess stable traits. Traits are relatively enduring dispositions that distinguish one person from another. Had a great number of criticisms. After World War II saw the creation of projective tests such as the Rorschach Test. Return of the Test: 1980s In 1980s many fields of psychology began to emerge: - Neuropsychology - Health psychology - Forensic psychology - Child psychology Emergent new fields brought back excitement into creation and assessment of tests as new perspectives entered the field. Expansions in statistical sciences continued to push the field of measurement theory forward. Testing continues today to be a major part of psychology and other fields such as Organizational Sciences, Health Sciences, or Education. Most common definition is credited from Stevens (1946). Four major variable types: Ratio, Interval, Ordinal, Nominal Differences based on four principles: Identity, Order, Equal Intervals / Property of Quantity, and Absolute Zero. Ratio Data Continuous Can rank the values Zero is meaningful What does this mean? We can accurately describe differences between values. We can apply addition, subtraction, multiplication, and division on the values without losing interpretation. Example: Personal income, Age, Time Interval Data Continuous Can rank the values Zero is not meaningful What does this mean? We can apply addition and subtraction only. We cannot multiple or divide as the interpretation is meaningless Example: Fahrenheit Temperature Discrete Data Can think of as a subtype of Continuous data. Values cannot be fractions - whole numbers only. The fully continuum is not measurable and interpretation should keep this in mind Example: Number of children, Count data Ordinal Data Categorical Can rank/order the categories Zero is not meaningful What does this mean? We cannot usually accurately determine distance between categories. We should be cautious applying mathematical operations on the data Example: Income categories, Likert scales Nominal Data Categorical Can not rank/order the categories Zero is not meaningful What does this mean? We should not be applying mathematical operations on the data Example: Ethnicity, Marital Status, Province of Residence Units of Analysis Property of Quantity / Equal Intervals dictates that the unit of measure be clearly defined. This way we know 10-units of something is exactly 5-units larger than a 5-unit object (as 5+5 = 10). How do we know the precise unit of measurement for a psychological object? We say that for a psychological measure that the measurement is arbitrary in size but is linked to the construct in a non-arbitrary way with an assumed equal distance. Additivity A second critical assumption is that the unit of measure does not change in size while the units are being counted. Cannot know for constructs so we must assume it. Example: Knowledge of Psychometrics: Question: “Who is the Father of Psychometrics” Question: “When should one use an orthogonal rotation over a promax rotation when constructing a multidimensional scale?” Frequency Tables Single-variable (Univariate) tables Left side indicating the categories of the nominal or interval variable (Rows) Columns indicate various types of information: frequency (count), relative frequency (percentage), cumulative frequency (sum of frequencies of categories up to this point). Two or more variables (Crosstabulations) Both column and row represent variables. Cell will be the statistic requested or in some programs multiple statistics listed within. Relative Frequency = Number of times the value occurs/ Number of observations in the dataset Suppose we have 200 observations on x = number of prior arrests. If 70 of these x values are 1, then Relative frequency of (x=70) = 70 / 200 = 0.35 We can transform this into a percentage: 0.35*100 = 35%. Another name for relative frequency is proportion (p). Measures of Central Tendency Statistics that are used to understand the average of a population. Three most common: Mean (Arithmetic Mean), Median, Mode The mean is producing an average from all the values in the sample. The median is another type of average but instead represents the exact middle point in the sample Dispersion Variation among values Greater dispersion, greater range of scores, more possible differences could exist between scores. If dispersion is very high, central tendency measures become less meaningful to describe the distribution or data as they may not capture it precisely enough. Range Simplest measure of dispersion. Range = Maximum Value – Minimum Value Deviation A measure of distance from the mean. Deviation = Value – Mean Not particularly useful. Variance is our most commonly reported measure of dispersion. More complicated to calculate than the deviation but allows us to characterize deviations above and below the mean to fully capture how much variation exists among data. We take the square root of the variance to get a second measure, the standard deviation. As a whole, the variance (or standard deviation), along with the mean, are the two most commonly used estimates to describe measurement statistics. Distributions Distributions have a unique appearance based upon the relationships between the central tendencies and the measures of dispersion. Standard Normal Special form of the Normal Distribution where the mean is 0 and standard deviation is 1 Standard Normal Special form of the Normal Distribution where the mean is 0 and standard deviation is 1 Properties of Correlation (r) 1. The value of r does not depend of which variable we designate X or Y. 2. The value of r is independent of scale of X and Y 3. -1 time limits can impact reliability and in non-restricted surveys, too long a time to completion will reduce reliability. Rule of thumb is that you would not want a questionnaire to take longer than 6 minutes for a 20- item questionnaire. Questionnaire Length Need to pilot test the questionnaire before obtaining a final version. Expect to include at least 50% more questions in the pilot version than the final version. Thus, the pilot version will be longer. Example: “The GRIMS was intended as a short questionnaire for use with both distressed and non-distressed couples. As we hoped to achieve a final scale of around 30 items, we planned a pilot version with 100 items”. Number of Items Once you have assigned weights to the cells and decided on the total number of items you require for the pilot questionnaire, you can work out how many items to write for each cell. Multiply the percentages in a row/column by the total number of items. 3) Select Population of Interest Identify those individuals who should be targeted by the questionnaire and whom you are hoping to gather information on. May be straightforward. Sampling - the selection of elements following prescribed rules from a defined population. Population – the collection of elements sharing a defining characteristics. In testing, the elements are our test takers. Sampling Things to keep in mind: 1) Who should the sample consist of? 2) How credible is this group as being a representative of the population of interest? 3) What obstacles may we encounter when obtaining our sample? 4) How can we address or avoid the pre-mentioned obstacles? Sampling Methods Sampling can either be probabilistic or non-probabilistic Non-probabilistic sampling – individuals are selected based on some criteria – there is no defined probability of selecting a person. Example: Student volunteers agreeing to take a test. Probabilistic sampling – Each person has a nonzero chance of being selected and the selection process is random. 4) Item Construction A test is the sum of it’s parts – so good items make good tests! Many decisions along the way for item construction. Major decision is choice of item format - also called the scale of the item / item scale. Different scales have different purposes and different pros/cons. Item Formats The format is also called the scale of the item. The process of transforming and modifying the mathematical properties of an item is called Scaling. Types of formats: Alternate-Choice / Dichotomous – Only two response options: True / False - Advantages: Simplicity, Ease of Administration, Ease of Scoring - Requires absolute judgement – respondents must indicate with certainty that the item is 100% true or 100% false. No middle of the fence. Alternate-Choice / Dichotomous Alternate-Choice / Dichotomous – Only two response options: True / False. - Most commonly used in knowledge-based questions for assessment. - Can also be used in personality questionnaires such in the question: “I never keep a lucky charm”. - Advantages: Simplicity, Ease of Administration, Ease of Scoring - Requires absolute judgement – respondents must indicate with certainty that the item is 100% true or 100% false. No middle of the fence. Dichotomous items have some disadvantages: - Tend to be easier to select the correct response. - Misses nuances by requiring only absolute answers. - Increased probability of selecting the correct answer by chance. In order to increase the reliability of a test, you tend to need many more dichotomous items compared to other item formats. Multiple-Choice / Polytomous Polytomous items have more than two response options. Example: Multiple Choice Tests Consists of two parts: 1) the step - a statement or question that contains the problem; and 2) the options - a list of possible responses, of which one is correct or the best and the others are distractors. Often four or five possible responses are used to reduce the probability of guessing the answer. Good balance of ease of administration and scoring without too much risk of selection by chance / issues of reliability. Wrong answers on a polytomous item are called distractors. Choice of distractors are the largest difficulty on polytomous items. The effects of guessing are reduced by inclusion of additional distractors. The disadvantages of a multiple-choice are the time and skill requirements to write good items. A common problem is that not all response options are equally good distractors and often 1 or more distractors are so unlikely to be correct that they end up not being treated as a possible response option. Distractors Psychometric theory suggests increasing the number of distractors will increase the item reliability. This in turn increases the test reliability. This only works if perfect distractors are created – not typically the case in practice. On average, an item tends to only have 2 good distractors. Bad distractors reduce the item reliability and in turn the test reliability. Rodriguez (2005) showed in an analysis of 80 years of research that four-response options (3 distractors) is optimal for reliability in practice. Correction for Chance We can correct a score for guessing: Example: Someone gets 14 items correct on a 20-item test with four response options, the corrected score would be 14-2=12. The chance corrected score penalizes test takers by assuming some items they got correct were by chance. The assumption is that all response options have an equal chance of selection. If a test is using chance-corrected scoring, the test taker should almost never guess an answer. The formula does not include omitted answers. If instead the person above had only answered 3 questions incorrectly and had 3 questions they left blank, the chance corrected score would be: 14 - (3/3) = 14 - 1 = 13. Statistically speaking, the only time it makes sense to guess in a chance corrected test is when you have reduced the number of responses down to 2. Rating-Scale Items Similar to polytomous where we have multiple responses but in this case the responses are on an ordinal scale along a continuum. Common to use 7 or less points. Example: I am not a superstitious person: A.Strongly Disagree B.Disagree C.Agree D.Strongly Agree Likert-Scales Previous example was a special rating-scale called a Likert- scale. Measure the degree of agreement a person has with a question. Common to see Likert-like items on a similar scale such as Not at All, Not Often, Neutral, Often, Very Often. Likert Items and Correlations Likert items (and rating-scale items in general) are ordinal. As such, use of a Pearson correlation (for continuous variables) is inappropriate (will underestimate the correlation). Two options: Spearman Rank Correlation or Polychorous Correlation. Rating-Scale Items The advantageous of rating-scale items is that you can capture a wider range and more precise measurement of where a person falls on the content continuum. Disadvantageous tend to be dealing with response behaviours. Interpretational - differing ways of interpreting words such as “frequent” or “often”. Some individuals always select specific responses such as “Strongly Agree”. Some individuals always select the neutral option. Rating-Scale Items The advantageous of rating-scale items is that you can capture a wider range and more precise measurement of where a person falls on the content continuum. Disadvantageous tend to be dealing with response behaviours. Interpretational - differing ways of interpreting words such as “frequent” or “often”. Some individuals always select specific responses such as “Strongly Agree”. Some individuals always select the neutral option. A lot of effort goes into dealing with response behaviours. Choice of wording reflects domain of interest. A personality or mood questionnaire might require responses in terms of the options “not at all”, “somewhat”, and “very much”. Attitude questionnaires generally consist of statements about an attitude object filled by the options “strongly agree”, “agree”, “uncertain”, “disagree”, “strongly disagree”. For clinical-symptom questionnaires you might find that options relating to the frequency of occurrence - such as “always”, “sometimes”, “occasionally”, “hardly ever”, and “never”. Correction by Chance If a test is using chance-corrected scoring, the test taker should almost never guess an answer. The formula does not include omitted answers. If instead the person above had only answered 3 questions incorrectly and had 3 questions they left blank, the chance corrected score would be: 14 - (3/3) = 14 - 1 = 13. Statistically speaking, the only time it makes sense to guess in a chance corrected test is when you have reduced the number of responses down to 2. Rating-Scale Items GRIMS example: “Rating-scale items are the most appropriate scale of relationship state. The GRIMS items are presented as statements to which the respondents are asked to respond with “strongly agree”, “agree”, “disagree”, or “strongly disagree”. This spread of options allows strength of feeling to affect scores. The items are forced choice, i.e. there is no “don’t know” category.” Categorical Formatted Items Special type of rating-scale formats for larger continuums (typically 7-10). Often do not label all points on the continuum. “On a scale from 1-10, how hungry are you?” These items tend to be useful for discriminating more finely between individuals Research suggests that the reliability and validity of the test will increase if each value on the categorical formatted item is labeled, rather than only the end points. Very hard to do with 10+ response values. If only end points are used, they need to be well defined to help respondents select the appropriate spot on the continuum. Categorical Format (3) You want to select the number of points that will let you discriminate between individuals as much as needed. If you do not need fine discrimination, less response options are better. Psychometric theory indicates less than four response options will reduce the reliability of the item and seven response options is the point were reliability begins to diminish. Adjective Checklist An alternative to dichotomous items is the adjective checklist. Was commonly used in Personality measurement (historically). A list of adjectives are provided and the respondent is asked to circle which adjectives best describe themselves. Equivalent to asking a dichotomous question for each adjective. We take all the adjectives and see if we can find patterns in the responses: Is there a pattern that suggests a certain prototypical personality? Q-Sort Adjective checklists are not used often anymore due to simplicity and increased risk of error (what if a user by chance simply misses reading one of the adjectives?) An alternative is the Q-sort. In a Q-sort, choices are listed on cards and individuals place cards in piles. Can have anywhere from 2-10 piles (no rules). Piles are based on the degree they agree with the word on the card. Does this card reflect me and by how much? The frequency of placed cards and the types of cards then paint a picture of the person. Usually the end piles (extremes) are most interesting! All Questionnaires 1) Match your blueprint! 2) Write all items clearly and simply 3) Avoid irrelevant material and keep the options short. 4) Each item should ask only one question or make only one statement. This is avoids what is called double- barrelled items. 5) Generate a pool of items so you have multiple choices to pull from. 6) Avoid statements written in the past tense. 7) Avoid using words with absolutes, such as only, just, always, or none. 8) Where possible, avoid subjective words such as “frequently”, as these may be interpreted differently by different respondents. 9) It is important that all options function as feasible responses - I.e. that none be clearly wrong or irrelevant and therefore unlikely to be chosen. 10) Avoid statements that would be selected would be selected by everyone. 11) Keep the reading level of the test to be appropriate for individuals who will be taking the test. 12) Items should be sensitive to ethnic and cultural differences and items should be updated over time to ensure they reflect changes in the intended population. 13) Having correct spelling and grammar is essential! Person-based Questionnaires Person-based questionnaires need to take into account different responding behaviours when creating items. Acquiescence - the tendency to agree with items regardless of their content. Usually necessary to reverse some of the items. For example the item “I am satisfied with our relationship” can be reversed to “I am dissatisfied with our relationship”. Social Desirability - the the tendency to respond to an item in a socially acceptable manner indecisiveness- Indecisiveness is the tendency to use the “don’t know” or “uncertain” response option. Item Order Few common strategies. 1) For knowledge-based questionnaires, order the items in increasing difficulty. 2) Order the items based on content areas. 3) Randomly place items, making adjustments to ensure a given content area does not occur too often (subjective). Background Information Include headings and give sufficient space for the respondents to fill out their name, age, gender, or whatever other background information you require. It is strongly recommended that you include the date on which the questionnaire is completed, especially so if it is to be administered again. Instructions The instructions must be clear and unambiguous. They should tell the respondent how to choose a response and how to indicate the chosen response in the questionnaire. Information that is likely to increase compliance, such as information on confidentiality, should be stressed. Scoring To score the questionnaire, you will be needing to allocate a score to each response option. For knowledge-based questionnaires, it is common to give the correct or best option for each item a score of 1 and the distractor options a score of 0. The higher the total score, the better the performance. For person-based questionnaires, scores should be allocated to response options according to a continuous scale. Commonly, the larger the number, the more positive or frequent the response option. For example, setting always = 5, usually = 4, occasionally = 3, hardly ever = 2, and never = 1. Reverse Scaling / Coding To add items together, they must be in the same numeric direction : positive or negative. For example if one item is: “In the last week how often were you happy?” And another item is “How often do you have feelings of sadness?” A response of Not at All for one item is equivalent of Very Often for the other item. Week 3 Selection of Items Blueprint Through all of this we form a blueprint and plan exactly how the items and shape of the questionnaire will look and work. During this we choose a response format and assemble an initial item pool. Guided by considerations from the first step, researchers write or seek out items that seem psychologically relevant to the intended construct. Pilot Testing Items should be administered to respondents representing the likely target population, in a manner reflecting the likely administration context. First, it can reveal obvious problems through respondent feedback or observation. For example, respondents might require more time or they might express confusion or frustration -> revisions. Second, this step produces data for the next step of scale construction—evaluation of the item pool's psychometric properties and quality. Item Writing One important recommendation is that the items should be written with clarity. The most important such task is generating items that are easily-understood by potential respondents. Items and instructions that are relatively clear and simple are likely to be understood by respondents and to require little cognitive effort, enhancing respondents' ability and motivation to provide psychologically-meaningful responses. Item Writing Tips Avoid complex words. Avoid double negative wording. Avoid double-barrelled items Item Analysis Item analysis of the data collected in a pilot study to select the best items for the final version of your questionnaire involves an examination of the facility and the discrimination of each item. For knowledge-based multiple-choice items, it is also important to look at the distractors. Other considerations will also be discussed such as readability, dimensionality, or size of scale. Item Analysis Table The first step is to create an item-analysis table with each column (a, b, c, d, etc) representing an item and each row (1, 2, 3, 4, 5) representing a respondent. For knowledge-based items, insert a 1 in each cell for which the respondent fave the correct answer and 0 for each incorrect answer. Then you can add up the scores to give a total score for each row (aka for each respondent) and for each column (aka the total score for each item). Facility / Difficulty Most questionnaires are designed to differentiate between respondents according to whatever knowledge or characteristic is being measured. A good item, therefore, is one for which different respondents give different responses. The facility index gives an indication of the extent to which all respondents answer an item in the same way. If they do, then these items are redundant and it is important to get rid of them. Facility Index For person-based questionnaires, the facility index is a notion that there is some unknown latent threshold that a person needs to pass before a person will select a specific response. For knowledge-based questionnaires, the facility index is calculated by dividing the number of respondents who obtain the correct response for an item by the total number of respondents. For example, let’s say we have 100 people write a test and for a specific item, 70 obtain the correct response. Then the facility index would be calculated as 70/100 = 0.70. Facility Index Ideally, the facility index for each item should lie between 0.25 and 0.75, averaging 0.5 for the entire questionnaire. A facility index of less than 0.25 indicates the item is too difficult, as very few respondents obtain the correct response. A facility index of more than 0.75 shows the item is too easy, as most respondents obtain the correct response. The facility index should ideally be higher than chance. Optimal Difficulty Formula for calculating optimal difficulty: Optimal Difficulty = (1 + Chance) / Number of Options Example: True-False item (2 options, chance = 0.5): Optimal Difficulty = (1 + 0.5) /2 = 1.5/2 = 0.75 Facility Index - Person-based If it is a person-based questionnaire, then items likely have values of more than 1. For example, if the response options for each item are strongly agree, agree, disagree, and strongly disagree, then the item values may be 1, 2, 3, or 4. Insert the actual score for each item into the item- analysis table and remember to ensure you have reverse scored any opposite direction items. The facility index for a person-based item is calculated by summing the scores for the item for each respondent, then dividing this total by the number of respondents. Facility Index - person Need to always check the distribution of responses. If the facility index is 2 in a four option item due to everyone selecting option 2, then the item is not useful as everyone is selecting the same choice -> same issue as everyone selecting an extreme. Caveat - items to diagnose certain groups may be expected to score in a certain way that disagrees with the discussed guidelines Discrimination Discrimination is the ability of each item to discriminate among respondents according to whatever the questionnaire is measuring. Items should be selected if they measure the same knowledge or characteristic as the other items in the questionnaire Discrimination is measured by correlating each item with the total score from summing all the other items in the questionnaire, excluding the item in question. Some people will include the item in the total score to ease the calculation but this only works if the item makes up a very small proportion of the total score. Including the item in the correlation will cause the relationship between the item and the total score to be overinflated. Correct choice of correlation should be a biserial correlation or a point-biserial correlation as the items are either dichotomous or ordinal. Discrimination Generally, a minimum correlation of 0.20 is considered acceptable. Some argue that the value should be 0.30 or above. Items with negative or zero correlations are always excluded as they are indicating that the item either goes in the opposite direction or the item is unrelated to the other items. Item Characteristic Curves Visual approach to examining discrimination. In an item characteristic curve, break the total score into a number of bins. The number of bins should represent groups that we want to discriminate between. ICC Matrix A tool to plotting both difficulty and discrimination with many items. Can place a line at the recommended cut-offs to identify which items are deemed good vs bad. Distractors It is valuable to look at the use of distractor options by respondents who do not choose the correct or best option. This is to ensure that each distractor is endorsed by a similar proportion of respondents. Each distractor should have an equal probability of being selected. Items for which the distractor options are not roughly equal in proportion are considered to not be functioning properly and should be considered for exclusion from the final questionnaire. Modification If you identify poor functioning items you may decide you want to modify the item’s wording or response options to try and “fix” the item. The problem is you typically cannot test whether or not your adjustment has fixed the item as you would need to conduct a new pilot study. Hence, plan to use 50%+ more items than you will need for the final survey as you will throw many items out. Total Items Whenever deciding which items you will include in the final version of the questionnaire, you will have to take many factors into account and balance them against each other. In addition to facility, discrimination, distractors, and so on, you will need to consider the number of items you require for the final version, which again typically is going to be between 12-30 items for many questionnaires. Reliability and Validity: A Prime Reliability is an estimate of the accuracy of a questionnaire (most basic definition) Cronbach’s Alpha The first method is to calculate a statistic called Cronbach’s alpha. Cronbach’s alpha is a measure of the internal consistency of the questionnaire The second method is called the split-half reliability. Here, the questionnaire is divided into two halves. The typical way is to split into even and odd numbers, but any random split can be used. You then will calculate the correlation between the halves. The argument is that if both halves produce a strong positive correlation then it is evidence towards reliability. Reliability Our pilot sample is crucial in the estimation of reliability. For example, if we have a small pilot sample then there is a risk that a few individuals may respond in non-normal ways and their responding behaviour may influence the item correlations, item scores, and in turn the reliability. The rule of thumb is that the greater the number of respondents, the better the estimate of reliability will be. Recommended at least 50 respondents for a pilot study, some recommend 200+ respondents for better estimates. Reliability and Item Analysis Some argue that the sample used for item analysis should be different than the sample used for estimating reliability. Thus, since reliability is based on item correlations, the sample is inherently highly correlated, and the reliability will be over estimated. One option is to collect a larger pilot sample and split it in half (70:30, 50:50), using one split for item analysis and the remaining for reliability estimation. Generally want a minimum reliability of 0.70 for person-based and 0.80 for knowledge-based questionnaires. Validity Face Validity - Not a true type of validity. Some argue it is important to consider if you want your questionnaire to have good uptake. Face validity describes the appearance of the questionnaire to respondents. This asks the question whether or not the questionnaire looks as if it is measuring what it claims to measure Content Validity - the relationship between the content and the purpose of the questionnaire. Is there a good match between the test specification and the items? Standardization and scaling Standardization involves obtaining scores on the final version of your questionnaire from appropriate groups of respondents. These scores are called norms. Large numbers of respondents must be carefully selected for a standardization group according to clearly specified criteria in order for the norms to be meaningful. Norms can be obtained from data in the pilot study, but it is generally criticized. Representative It is important to include as many respondents as possible in the standardization group and ensure that they are truly representative. A minimum of several hundred is generally required, but again it is dependent on the population targeted Reporting Once you have obtained your norm sample you will want to provide information about the norms and sample characteristics to provide evidence of representativeness. You generally want to provide the mean test score and standard deviation, stratified by any meaningful groups, such as by gender. The mean score for the standardization group is the average of the scores for the respondents in that group. The standard deviation is a measure of the amount of variation in the standardization group. Standard Scores Once you know the mean and the standard deviation of the standardization group, you can calculate for each person how many standard deviations their score differs from the mean. This figure ranges between about -3.00 and 3.00. This score is called the standard score, or a z-score. One advantage of the z-score is that anyone who understands how they are calculated can immediately interpret someones standard score in terms of how they compare with everyone in the standardization sample. If their z score is 0, they are right at the average. If their z- score is 1.00, they are one standard deviation above the mean. If their z-score is -1.50, they are one and a half standard deviations below the mean Standardized Scores Assume individuals do not know about standard scores or what a standard deviation is. Common to transform the score to make it more interpretable. These are called standardized scores. Note: different from standard scores. T-Scores Commonly used standardized score is the T-score. Note: this is different from a t-score in statistics obtained during a t-test. To obtain a T score, you simply multiply the standard score by 10 and add 50. Then you run to the nearest whole number. This would mean that any Z-score larger than -5.0 would have a positive value. Since a z-score of + or - 3 captures around 99.9% of a sample it would be very rare to obtain someone with a z- sore lower than -5.0 or higher than 5.0. Stanine Commonly used standardized score for person-based tests is the Stanine. To obtain an stanine you multiplying the z-score by 2, then adding 5, and finally rounding to the nearest integer. This produces (usually) an integer between 1 to 9. IQ Test Standardized Scores IQ tests produce standardized scores by multiplying the z-score by 15 and adding 100. So a z-score of 2 (two standard deviations above the mean) has an IQ of 2*15 + 100 = 30+100 = 130. Scaling Production of total scores, standardized scores, etc. is scaling our responses. Scaling is the process of measuring objects in a way that maximizes precision, objectivity, and communication. A scaling model provides the operational framework for assigning numbers to objects and transforms qualitative constructs into measurable metrics. Scaling is a critical concept in psychometrics. In essence, scaling provides a way to mathematically understand a stimulus-response relationship. We apply a mathematical function that will allow us to better understand the stimulus (our test / test items), and the response (total scores for a person) and contextualize it in some way. Subject vs Response Can break down our psychological scaling further. In response-centered scaling, responses are scaled to place a subject along a psychological continuum based on the strength of the psychological trait they possess. This type of scaling is used in item response theory. Scores are inferred rather than simply calculated by summing the items. What we have mostly discussed so far is subject-centered scaling. In this case, we are summing or averaging all items. It is the simpler type of scaling and by far the most common. Dimensionality Related to validity. A scale's dimensionality is also called the factor structure. Dimensionality reflects the number and nature of variables assessed by its items. Brief scales are appealing, particularly when participants cannot be burdened with long scales. Unfortunately, brief scales have important psychometric costs —their psychometric quality might be, likely is, poor or even unknown. Traditional reliability theory suggests that reliability is relatively weak for brief scales, all else being equal. Interview as a Test An interview can be essentially a verbal test. We ask questions and gather information, codifying responses to categories and assigning numbers to summarize the individual. This means interviews can be subjected to the same concepts as paper-based or computerized tests and we can establish the reliability and validity of the interview Good Interviews 1) Social facilitation- a phenomena where the mood of an interviewee can influence the mood of the interviewer and vice versa 2) Have a good attitude when conducting an interview. Social psychology has shown that interpersonal influences (the degree to which one can influence another) is related to interperosonal attraction (the degree to which people share a feeling). As such it is important to be warm, accepting, open, and understanding when conducting an interview. 3) It is important to make the interviewee not feel judged or uncomfortable 4) Limit probing 5) Unless it is a structured interview, one should focus on open ended questions. 6) Keep responses flowing so the interview continues to move forward

2080 Introduction to Test and Measurement PDF

Document Details

Tags

Related

Summary

Full Transcript