Lecture 3 & 4 - Standardization and Scoring - MHD PDF

1 Dr. Mohammed Nadar Dr. Naser Alotaibi Dr. Mohammad Alshehab Part I THE EVALUATION PROCESS INTRODUCTION TO STANDARDIZATION & PSYCHOMETRICS 2 This presentatino Standardised Test Correlation & Variable Relationships Psychometric Analysis Reliability Validity Sensitivity vs Specificty 3 What is a Standardized Test ? A measurement tool that is published and has been designed for a specific purpose for use with a particular population Standardized tests: (what is included in a ST?) Have gone through rigorous development Clear instructions for administration and scoring Have psychometric data Implies uniformity of procedure in administering and scoring a test 4 What is a Standardized Test ? Implies uniformity of procedure in administering and scoring a test … Why is it important? o Diagnostic  Concerns with presence & severity of a problem o Descriptive  Concerns with details of current state of the problem o Comparative  Concerns with progress or status of current state of the problem o Predictive  Concerns with prognosis of condition/performance o Conclusive  Concerns with effectiveness of outcome from an intervention 5 What is a Standardized Test ? Psychometrics: The field of study concerned with the construction and validation of measurement instruments Reliability (consistency / reproducibility of its scores), Validity (True Representation of measured construct) Responsiveness (Sensitivity to detect Change) Psychometric evaluation represent Quality Goodness Both demonstrate Accuracy & Generalizability – (WHY?!) 6 Why Standardized Test ? Both demonstrate Accuracy & Generalizability – (WHY?!) They allow to provide: Objectivity: independence from opinions of examiner (bias) Quantification: numerical precision with discriminative & interpretive quantification of degree of differences in the measured domains Communication: enhanced interprofessional use/understanding Scientific generalisation: general applicability to evaluate service/intervention/outcome. 7 What is Correlation? A correlation is a relationship between two variables The data can be represented by the pairs (x, y) Where x is the independent variable (i.e. gender) (Cause) and y is the dependent variable (i.e. score) (Effect) (affected by) Examples of variables that may be correlated: Height & shoe size SAT score & GPA (grade point average) Number of cigarettes smoked & lung capacity 8 Positive Relationship Height & shoe size Years in school & salary expectations 9 Negative Relationship Classes absent & final grade Self-esteem and paranoia 10 No Relationship Pet owners & % college educators IQ and height 11 Correlation Coefficient Indicates the extent to which two variables are related. It can range from ( -1 to 0 OR 0 to +1) + correlation coefficient = positive relationship, - correlation coefficient = inverse relationship Sample correlation coefficient = r (a part ) Population correlation coefficient = ρ (generalize) 12 Correlation Coefficient Interpretation Coefficient Strength of Range Relationship 0.00 - 0.20 Very Low 0.20 - 0.40 Low 0.40 - 0.60 Moderate 0.60 - 0.80 High 0.80 - 1.00 Very High 13 Reliability Dependability (consistency or stability) of measurement 1. Will the measurement process generate consistent information Requires repetition (i.e. test-retest reliability) (no intervention) 2. The extent to which testing captures a persons “true score” Observed score = True score + Error score Observed score = Actual score on the exam True score = The individual’s accurate ability Error score = Difference between observed and true scores 14 Types of Reliability 1. Test-Retest Reliability Used to assess the consistency of a measure from one time to another. 2. Alternate-Forms Reliability Used to assess the consistency of the results of two tests constructed in the same way from the same content domain. 3. Internal Consistency Used to assess the consistency of results across items within a test 4. Rater Reliability Used to assess the degree to which different raters/observers give consistent estimates of the same phenomenon (Intra-rater VS Inter-rater reliability) 15 1. Test-Retest Reliability Administer the assessment on two occasions, separated by a time interval Determine the correlation between the first & second scores Score =.9, means test is stable across time Useful only when the trait being measured is expected to be stable over time Not appropriate to measure mood (or sadness) 16 1. Test-Retest Reliability Test - Retest reliability can be confounded by several factors Actual change in the trait being measured (between the two sessions) The examinee may remember some of the test items: Repeating the test should not be soon enough so that it can’t be remembered OR in some cases, the re-administration of the test should be administered after the existence of a distraction in between the first and second administration 17 2. Alternate-Form Reliability Also called “Parallel Forms Reliability” Procedure: 1. Equivalent forms Two distinct, but parallel, forms of the same assessment (different timing) Similar in content, difficulty level, type of assessment, etc. 2. Give the two equivalent forms of the test to the same person in close succession Correlation between scores obtained for the two forms of a test (items of the same construct) E.g., SAT, GMAT, TOEFL, 18 3. Internal Consistency Split-half Reliability: Obtaining similar scores of different items that measure the same variable Procedure: 1. Split the test in two equivalent halves, E.g., Odd items vs. even items 2. Give the test once, 3. Score two equivalent halves, 4. Correlate the scores on one half compared to the other. If patient scores the same on both halves, the test is said to have internal consistency. 19 4. Rater Consistency Produces the same results with different observers or raters Intra-rater Reliability: The same evaluator generate similar results. Inter-rater Reliability: Different evaluators generate similar results Must adhere to the standardized procedure 20 Acceptable Reliability r = 0.70 - 0.79 is below-average acceptable r = 0.80 - 0.89 is average acceptable r = 0.90 - 1.00 is above-average acceptable 21 Validity Does the instrument measures what it claims to measure? E.g.: A tape-measure is a valid measuring device for length, but is it valid for measuring volume? Validity cannot be directly measured. It must be inferred from evidence Would everyone agree about what we are measuring? Matter of degree (how valid?) 22 Validity Clarity (Comprehension): Clarity of items/understandable Meaningfulness: How it describes the trait being measured Appropriateness (Relevance): Suitable for the population to be evaluated Usefulness: Does it make sense to do a given assessment with a given person within a given context Note: Content Validity Index (of test items by experts) : Ex: Rating each item of the test from 1-4 (rating scale) 1 = Not relevant 2 = Less relevant 3 = Mostly relevant 4 = Completely relevant 23 1. Face Validity Establishes if the APPEARANCE of a tool actually seems to measure the proposed construct (Not statistical but logical measure) Weakest form of validity Not testable Logical judgment Assessment appears to be valid 24 2. Content Validity Degree to which an assessment measures what is supposed to measure E.g., 1: are we measuring pain (and not numbness, or discomfort) E.g., 2: are we measuring depression (and not anxiety, or panic) The comprehensiveness of an assessment The inclusion of items that fully represent the attribute being measured E.g.,: How closely the content of questions in the exam relates to content of the curriculum? Does your “Human Development” exam measures what you learned in the Human Development course? 25 How to establish Content Validity? Procedure: 1. Defining the concept to be measured 2. Literature search to see how this concept is represented 3. Generate items which might measure the concept 4. Panel of experts to review items Example: At the end of the lecture, the student will be able to do the following: 1. Explain what ‘stars’ are 2. Discuss the type of stars and galaxies in our universe 3. Categorize different constellations by looking at the stars 4. Differentiate between our stars and all other stars 25 26 How to measure Content Validity? There are no statistics that demonstrate content validity Logical judgment of appropriateness of test content Of assessment objectives, literature supported criteria, … E.g.: NBCOT: items on the exam reflect current practice E.g.: Fieldwork evaluation should reflect on students' knowledge of the OT skills they were meant to learn MMSE, DASH, … 26 27 3. Criterion-related Validity The degree to which content on a test (predictor) correlates with performance on another relevant measures (criterion). 1. Concurrent Validity: Scores from one test are correlated with an external criterion 2. Predictive Validity: Effectiveness of a test in predicting an individual’s performance in specific activities E.g., if you taught skills relating to ‘public speaking’ and had students do a test on it, the test can be validated by looking at how it relates to actual performance (public speaking) of students outside of the classroom 28 Two types of Criterion Validity 1. Concurrent Validity The extent to which the test results (of the instrument in question) agree with other measures that measure the same traits (criterion). Compare the instrument with the “gold standard” ‘Gold standard’ is an already validated instruments 29 Two types of Criterion Validity 2. Predictive Validity How well performance on a test predicts future performance on some valued measure (criterion)? The accuracy of an instrument to predict future event or performance of patient E.g.: Reading Readiness Test might be used to predict students’ achievement in reading E.g.: Glasgow Coma Score & brain activities patterns GPA & success? 30 Other Considerations for Validity Sensitivity Specificity What False Negative True Negative What proportion proportion of people of people who have who do NOT the disease True False have was the test Positive Positive disease was SENSITIVE the test to detect SPECIFIC to exclude 8 / (8+4) = 96% Diseased Healthy 7 / (7+2) = 77% 31 Other Considerations Ceiling effect: responses exceed capacity of measuring instrument Responses are “off the scale”: Too easy (not challenging) Floor (basement) effect: responses are below threshold of measuring instrument: Too difficult/so challenging Remedies: Use established evaluation instruments Use sensitive, precise measures 32 Factors that can lower validity Ambiguity in statements Tests that are too short e.g. instruction-related e.g. not comprehensive enough Inadequate time limits Identifiable patterns of e.g. between testing sessions answers Inappropriate level of difficulty e.g. that increase bias e.g. in testing phase Poorly constructed test items Administration and scoring e.g. not related item e.g. not clearly optimised Test items inappropriate for the Unclear directions outcomes being measured e.g. non-standardised administration e.g. not comprehensive enough to Difficult reading vocabulary & measure construct sentence structure e.g. instruction-related 33 NON-STANDARDIZED ASSESSMENTS 34 What is an un-standardized Test ? Assessments that provide the therapist with information but have no precise comparison to a norm or a criterion. No manuals or standard directions Usually descriptive Might be structured assessments that are constructed to provide guidelines for the content and process of the assessment, but their psychometric properties have not been researched and documented Examples: Informal interviewing, Many questionnaires, Observations: Greeting a person and walking with them to the clinic The discussions you might have with a patient before each session How are you? Have you gotten a chance to practice your driving since the last time we met? Etc…. 35 Why Use Non-Standardized Methods? Pragmatic reasons (time, materials, etc.) Person or issue does not fit a “standard” criteria What if patient doesn’t speak English? Use a standardized assessment but with a non-standardized approach E.g., Use an Adult assessment on a child Disability status may make it hard to do standardized testing CP is heterogenous condition >>> standardisation process is difficult Standardise using STRATA (e.g. hemiplegic CP) of the population Might be more contextual (e.g. cultural) A cooking assessment on wood instead of stove in a poverty area Using Chopsticks instead of a fork & knife 36 Limitations of Non-standard Assessment You cannot be certain of validity/reliability Not good for pre - post measurement (Responsiveness?) More difficult to communicate to others While numbers can be easy to communicate Cannot compare to a standard 37 See you in the Lab! ANY Question? 38 Validity & Reliability Lab Exercises Validity Face Validity (Choose a tool for the exercise) Reliability Measuring distances between two dots Flexible Tape Measure VS Retractable Tape Measure 39 Validity & Reliability Lab Exercises Validity Face Validity (Choose a tool for the exercise) Validity vs Reliability (Graph) Categorise using (either valid/reliable or NOT reliable/valid reliable Reliable reliability not · Reliable · bu & Valud not valid S validity 8 · Low Not valid · · ·

Lecture 3 & 4 - Standardization and Scoring - MHD PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue