Professional Education Assessment PDF
Document Details
Uploaded by Deleted User
Romblon State University
Jay-Ar G. Beloy, CSPE, LPT
Tags
Related
Summary
This document is a set of educational materials covering professional education assessment, including different types of tests and their purposes. It includes details of testing, measurement, assessment, and evaluation, focusing broadly on the theory and application of instructional methods in professional education.
Full Transcript
1 PROFESSIONAL EDUCATION ASSESSMENT JAY-AR G. BELOY, CSPE, LPT 2 3 TESTING MEASUREMENT ASSESSMENT EVALUATION 4 ✓ TEST is an instrument designed to measure any characteristic, qualit...
1 PROFESSIONAL EDUCATION ASSESSMENT JAY-AR G. BELOY, CSPE, LPT 2 3 TESTING MEASUREMENT ASSESSMENT EVALUATION 4 ✓ TEST is an instrument designed to measure any characteristic, quality, ability, knowledge, or skill. ✓ It comprises items in the area it is designed to measure. ✓ TESTING is when you employ the test (tool). TESTING ✓ A process of QUANTIFYING THE DEGREE to which someone/ something possesses a given trait i.e., quality, characteristics, or feature. ✓ ASSIGNING OF NUMBERS to a performance, product, skill, or behavior of a student, based on a pre-determined procedure or set of criteria. ✓ Assigning of numbers to the results of a test or other type of assessment ✓ Awarding points for a particular aspect of an MEASUREMENT essay or performance ✓ Assessment is the process of gathering information through tests and measurement. ✓ It provides the information that enables the evaluation to take place. ✓ Assessment as a PRODUCT (e.g. set of questions or tasks) designed to elicit a predetermined behavior, unique performance, or a product from a student. ✓ Assessment as a PROCESS refers to the collection, interpretation, and use of qualitative and quantitative information to assist teachers in their education decision- ASSESSMENT making. ✓ Assessment is a pre-requisite to evaluation. It provides the information that enables the evaluation to take place. ✓ A process of making judgement about the quality if a performance, product, skill, or behavior of a student. (PASS or FAIL?) ✓ Includes using some basis to judge worth or value ✓ It involves judgement about the desirability or changes in the students. EVALUATION TESTING MEASUREMENT ASSESSMENT EVALUATION 9 PURPOSES OF ASSESSMENT ✓ Assessment FOR Learning ensure ✓ Assessment OF Learning grading ✓ Assessment AS Learning self-assessment ASSESSMENT FOR LEARNING Includes three types of assessment done before and during instruction. This ensures that the students are learning. PLACEMENT DIAGNOSTIC FORMATIVE ✓ Done PRIOR to or BEFORE instruction ✓ Its purpose is to assess the needs of the learners to have basis in planning for a relevant instruction. ✓ The results of this assessment place students in specific learning groups to facilitate teaching and learning. PLACEMENT ✓ Done BEFORE instruction ✓ This is a form of PRE-ASSESSMENT that allows a teacher to determine individual student’s PRIOR knowledge including MISCONCEPTION before instruction. ✓ Used to diagnosed what students already know and don’t know yet in order to guide instructions. ✓ This is also used to determine students’ recurring or persistent difficulties DIAGNOSTIC ✓ It helps formulate a plan for detailed REMEDIAL INSTRUCTION. ✓ Done DURING instruction ✓ Meant to ENSURE that learning takes place ✓ Provides the teacher with information regarding how well the learning objectives of a given learning activity are being met. ✓ Teachers monitor student learning to get on- going feedback to improve their teaching and for the students to improve their learning. ✓ Helped students to identify strengths and FORMATIVE weaknesses and target areas that need work. ✓ Results of this assessment are communicated clearly and promptly to the students. ASSESSMENT OF LEARNING This is done after the instruction. Does the learning outcome set was attained after the instruction? SUMMATIVE ASSESSMENT ✓ It is used to evaluate student learning at the END of a defined instructional period. ✓ Its results reveal whether or not instructions have successfully achieved the curriculum outcomes. SA SUMMATIVE ASSESSMENT ✓ The results of the summative assessments are the BASES for GRADES and report to parents. ✓ The results of which are communicated to the students, parents, and other stakeholders for decision making. ✓ It is also a powerful factor that could pave the way for educational reforms. ASSESSMENT AS LEARNING This is done for the teachers to understand their role of assessing FOR and OF learning. SELF-ASSESSMENT ✓ This is done for teachers to understand and perform well their role of assessing FOR and OF learning. ✓ It requires teacher to undergo training on how to SA SUMMATIVE ASSESSMENT assess learning and be equipped with the following competencies needed in performing their work as assessors. ✓ The ultimate goal of Assessment AS Learning for learners is to be self-directed and independent learners. MODE OF ASSESSMENT TRADITIONAL AUTHENTIC PORTFOLIO ✓ Decontextualized Assessment is a teacher centered which focuses on declarative knowledge and/ or procedural knowledge in artificial situations detached from the real world (indirect evidences). ✓ Includes paper-and-pencil tests ✓ Paper-and-pencil tests are either SELECTED RESPONSE type (alternate responses, multiple choice, and matching type) or CONSTRUCTED TRADITIONAL RESPONSE type (short answer, essay, problem solving, completion tests) ✓ Focuses on the MASTERY OF KNOWLEDGE ✓ Contextualized/ Performance/ Non-Traditional Assessment is a student-centered which requires REAL LIFE tasks and require students to utilize higher order thinking skills. ✓ It is authentic because students' knowledge and skill are assessed in a context that approximate the real world or real life as closely as possible. ✓ It can be in the form of a PROCESS (procedure) or a PRODUCT (concrete output) focusing on AUTHENTIC the demonstration and performance of tasks in the real world. Authentic Assessment COMPLEMENTS Traditional Assessment! Learners cannot perform the real-world tasks if they have not mastered the basic knowledge and skills. ✓ From portare (carry) and foglio (sheet of paper) ✓ It is a systematic and organized COLLECTION of student’s work that demonstrate the student’s skills and accomplishments. ✓ It is a purposeful collection of work that tells the story of the student's progress and achievement in relation to a purpose. ✓ It also involves gathering of multiple indicators of student's progress. PORTFOLIO ✓ It should contain the following: (a) student participation in selecting contents, (b) criteria for selection, (c) criteria for judging merit, and (d) evidence of self-reflection. EVIDENCES OF LEARNING PLACED IN PORTFOLIO ARTIFACTS REPRODUCTIONS ATTESTATIONS PRODUCTIONS ✓ Documents or products produced as a result of ACADEMIC CLASSROOM WORK. ✓ Example: student papers and homework ARTIFACTS ✓ Documentations of a student’s work OUTSIDE the classroom ✓ Example: special projects like Capstone and a student’s description of an interview with the Chairman of the Education Committee in the Municipal Council REPRODUCTIONS ✓ The teacher or other responsible person’s documentations to attest to the student's progress. ✓ Example: teacher’s evaluative notes about student's oral defense ATTESTATIONS ✓ Documents that the student himself/ herself prepares ✓ Includes: 1. Goal Statements: What does the student want to do about his/ her portfolio? 2. Reflections: What are the student’s reflections about his/ her work? 3. Captions: These are the student’s description and explanation of each piece PRODUCTIONS of work contained in the portfolio. TYPES OF PORTFOLIO ASSESSMENT DEVELOPMENT DISPLAY ✓ Also called as PROCESS PORTFOLIO ✓ Intended to document what a student has learned based on the intended learning outcomes. ✓ It includes students’g reflections. ✓ The result of an assessment portfolio informs both the classroom teacher and student the EXTENT to which the learning outcomes have been ATTAINED. ASSESSMENT ✓ Meant to diagnose student’s learning ✓ Also called as GROWTH/ DOCUMENTATION/ WORKING PORTFOLIO ✓ This portfolio takes time to complete. ✓ Consists of students work over an extended time frame to reveal the students progress in meeting learning targets. ✓ Provide CONCRETE evidence on HOW MUCH a student has change or developed over time. ✓ Documents student’s COGNITIVE and PSYCHOMOTOR progress in learning ✓ Example: To see how much a Kindergarten pupil has DEVELOPMENT improved in his skill in writing his/ her name one needs to compare his written name from the beginning of the school year with that of the middle and end of the school year ✓ Also called as BEST WORK/ SHOWCASE PORTFOLIO ✓ Presents the MOST OUTSTANDING work ✓ Document student’s proof of BEST efforts with respect to learning outcomes ✓ It may include evidence of student activities BEYOND school (a story written at home) ✓ More selective than Growth Portfolio ✓ Very useful for parent-teacher conferences, student’s future teachers, admission in college and even in future DISPLAY job application to supplement other information. ✓ Lead students to celebrate learning because they present the BEST PRODUCT or the BEST PERFORMANCE of the student SCORING RUBRIC necessary for authentic assessment It is as scoring guide used to assess performance (process or product) against a set of criteria that includes descriptions of levels of performance quality on the criteria. Typically employed when a judgment of quality is required and may be used to evaluate a broad range subjects and activities. Scoring rubric can be useful in grading essays or in evaluating projects such as scrapbooks, portfolios, etc. PARTS OF A SCORING RUBRIC ✓ Coherent sets of criteria ✓ Descriptions of the levels of performance for these criteria Recitation Rubric (Corpuz & Cuartel, 2021) Criteria Weight 1 2 3 Number of Appropriate ×1 1−4 5−9 10 − 12 Hand Gestures No apparent Appropriate Facial Lots of inappropriate Few of inappropriate ×1 inappropriate facial Expression facial expression facial expression expression Can vary voice Can easily vary voice Voice Inflection ×2 Monotone voice used inflection with inflection difficulty Recitation fully Incorporate Proper Recitation contains Recitation has some captures ambiance Ambiance through ×3 very little feelings feelings through feelings in the Feelings in the Voice voice TYPES OF SCORING RUBRIC HOLISTIC ANALYTIC ✓ Holistic Rubric describes the overall quality, global picture, gross judgement of a performance or product. ✓ There is only one rating given to the entire work or performance. ✓ It allows fast assessment, provides one score to describe the overall performance or quality of work, can indicate the general strengths and weaknesses of the work performance. ✓ It does not clearly describe the degree of the criterion HOLISTIC satisfied or not by the performance or product and does not permit differential weighting of the qualities of a product or a performance. Recitation Rubric (Corpuz & Cuartel, 2021) Rating Description ▪ Include 10 – 12 changes in hand gestures ▪ No apparent inappropriate facial expression 𝟑 – Excellent Speaker ▪ Can easily vary voice inflection ▪ Recitation fully captures ambiance through feelings in the voice ▪ Include 5 – 9 changes in hand gestures ▪ Few inappropriate facial expression 𝟐 – Good Speaker ▪ Can vary voice inflection with difficulty ▪ Recitation has some feelings ▪ Include 1 – 4 changes in hand gestures ▪ Lots of inappropriate facial expression 𝟏 – Poor Speaker ▪ Monotone voice used ▪ Recitation contains very little feelings ✓ Analytic Rubric also known as Dimensional or Multiple Rating describes the quality of a performance or product in terms of the identified dimensions and/or criteria for which are rated independently to give a better picture of the quality of work or performance. ✓ It clearly describes the degree of the criterion satisfied or not by the performance or product, permits differential weighting of the qualities of a product or a performance, and helps pinpoint specific areas of strengths and ANALYTIC weaknesses. ✓ It is more time consuming to use and more difficult to construct. Recitation Rubric (Corpuz & Cuartel, 2021) Criteria Weight 1 2 3 Number of Appropriate ×1 1−4 5−9 10 − 12 Hand Gestures No apparent Appropriate Facial Lots of inappropriate Few of inappropriate ×1 inappropriate facial Expression facial expression facial expression expression Can vary voice Can easily vary voice Voice Inflection ×2 Monotone voice used inflection with inflection difficulty Recitation fully Incorporate Proper Recitation contains Recitation has some captures ambiance Ambiance through ×3 very little feelings feelings through feelings in the Feelings in the Voice voice PRINCIPLES OF HIGH-QUALITY CLASSROOM ASSESSMENT ✓ Principle 01: Clear and Appropriate Learning Targets ✓ Principle 02: Appropriate Methods ✓ Principle 03: Balanced ✓ Principle 04: Validity ✓ Principle 05: Reliability Minor Principles of Assessment ✓ Administrability ✓ Scoreability ✓ Interpretability ✓ Economy PRINCIPLES OF HIGH-QUALITY CLASSROOM ASSESSMENT ✓ Principle 01: Clear and Appropriate Learning Targets Learning targets should be clearly stated, specific, and centers on what is truly important. should be on point, not broad, and center on what is important Learning Targets (MC Millan, 2007; Stiggins, 2007) Knowledge content of the book Student’s mastery of substantive subject matter Reasoning critical thinking Student’s ability to use knowledge to reason and solve problems Skills demonstrate Student’s ability to demonstrate achievement-related skills Products concrete proof Student’s ability to create achievement-related products Student’s attainment of affective states such as attitudes, values, Affective/ Disposition behavior interest and self-efficacy PRINCIPLES OF HIGH-QUALITY CLASSROOM ASSESSMENT ✓ Principle 02: Appropriate Methods Objective Supply Objective Selection Essay ✓ Short Answer ✓ Multiple Choice ✓ Restricted Response ✓ Completion Test ✓ Matching limited content ✓ True/ False ✓ Extended Response No Options! There are Options! no limitation; any ideas LT: Knowledge LT: Knowledge LT: Reasoning PRINCIPLES OF HIGH-QUALITY CLASSROOM ASSESSMENT ✓ Principle 02: Appropriate Methods Performance Oral Question Observation Self-Report Based ✓ Presentations ✓ Oral Examination ✓ Informal ✓ Attitude ✓ Papers ✓ Conferences ✓ Formal ✓ Survey ✓ Projects ✓ Interviews ✓ Sociometric ✓ Athletics Obtrusive Devices ✓ Demonstrations ✓ Questionnaires ✓ Exhibitions Unobtrusive ✓ Inventories ✓ Portfolios LT: Skills, Products LT: Affective, LT: Skills LT: Affective Reasoning TYPES OF TESTS ACCORDING TO FORMAT SELECTIVE SUPPLY ESSAY ✓ SELECTIVE TYPE provides choices for the answer. o Multiple Choice: consists of a stem which describes the problem and 3 or more alternatives which give the suggested solutions. The incorrect alternatives are the distractors/ jokers or foils. o True-False or Alternative Response: consists of declarative statement that one has to mark true or false, right or wrong, correct or incorrect, yes or no, fact or opinion or the like. SELECTIVE o Matching Type: consists of two parallel columns; Column A, the column of premises from which a match is sought; Column B, the column of responses from which the selection is made. ✓ SUPPLY TEST o Short Answer: uses a direct question that can be answered by a word, phrase, a number or a symbol. o Completion Test: consists of an incomplete statement. SUPPLY ✓ ESSAY TYPE o Restricted Response: limits the content of the response by restricting the scope of the topic. o Extended Response: allows the students to select any factual information that they think is pertinent, to organize their answers in accordance with their best judgment. ESSAY GUIDELINES FOR CONSTRUCTING TEST ITEMS Use essay test when students are going to PLAN – ORGANIZE – EXPLAIN their ideas through writing. BLUFFING! Essays are appropriate when: ✓ the group to be tested is small and the test is not to be used again; ✓ you wish to encourage and reward the development of student’s skill in WRITING; spelling, vocabulary, and sentence construction ✓ you are more interested in exploring the student’s ATTITUDES than in measuring his/ her academic achievement; ✓ you are more confident of your ability as a critical and fair reader than as an imaginative writer of good objective test items. point of view, belief, outlook in life, perspective ESSAY TEST ITEMS A classroom essay test consists of a small number of questions to which the student is expected to demonstrate his/ her ability to: (1) recall factual knowledge; (2) organize this knowledge; and (3) present the knowledge in a logical, integrated answer to the question. Classifications of Essay Test ✓ Extended-response essay item uncontrolled ✓ Limited Response or Short-answer essay item controlled Example of Extended-Response Essay Items: Explain the difference between the S-R (Stimulus-Response) and the S-O-R (Stimulus-Organism-Response) theories of personality. Include in your answer the following: a. Brief description of both theories b. Supporters of both theories c. Research methods used to study each of the two theories. (20 pts) Example of Short-Answer Essay Items: Identify research methods used to study the (Stimulus-Response) and the S-O-R (Stimulus-Organism-Response) theories of personality. (10 pts) GUIDELINES FOR CONSTRUCTING TEST ITEMS When to use Objective Test Items? STANDARDIZED TEST Objective Test Items are appropriate when: created by experts for large crowd ✓ The group to be tested is large and the test may be reused; ✓ Highly reliable test scores must be obtained as efficiently as possible; ✓ Impartiality of evaluation, absolute fairness, and freedom from possible test scoring influences – fatigue, lack of anonymity are essential; ✓ You are more confident of your ability to express objective test items clearly than your ability to judge essay test answers correctly; ✓ There is more pressure for speedy reporting of scores that fro speedy test preparation. OBJECTIVE TESTS Objective tests are assessments designed to measure a student's knowledge or skills through questions with clear, unambiguous answers. These tests are objective because they require little to no subjective judgment when scoring. Each question has a specific correct answer, making grading straightforward and consistent. ✓ Multiple Choice Items ✓ True-False Choice Items ✓ Matching Test Items ✓ Completion Test Items MULTIPLE CHOICE ITEMS It consists of: ✓ STEM – which identifies the question or problem ✓ Response Alternatives or Options ✓ Correct Answer Example: Which of the following is a chemical change? (STEM) a. Evaporation of Alcohol c. Burning of oil Alternatives b. Freezing of water d. Melting of wax What is a flaw of the test item? The total enrollees of the College of Education for A.Y. 2024 – 2025 First semester is about? a. 2,000 c. 1,236 b. 10,104 d. 967 a. Options not sequenced b. Insignificant highest to lowest c. Irrelevant lowest to highest d. Limited options What is a flaw of the test item? Which element is most important? a. Carbon c. Oxygen b. Hydrogen d. Potassium a. Opinionated b. No single answer c. Irrelevant d. Limited options TRUE-FALSE CHOICE ITEMS True-false test items are typically used to measure the ability to identify whether or not the statements of facts are correct. The basic format is simply a declarative statement that the student must judge as true or false. No modification of this basic form in which the student must respond “yes” or “no”, “agree” or “disagree”. Three Forms: ✓ Simple: consists of only two choices ✓ Complex: consists of more than two choices ✓ Compound: consists of two choices plus a conditional completion response TRUE-FALSE CHOICE ITEMS Three Forms: Examples ✓ Simple: The acquisition of morality is a developmental process. True False ✓ Complex: The acquisition of morality is a developmental process. True False Opinion ✓ Compound: An acquisition of morality is a developmental process. True False If the statement is false, what makes it false? Which guidelines was not observed in the formulation of such item? Graciano Lopez Jaena was the first editor of La Solidaridad and author of El Filibusterismo. a. Express a single idea in each test item b. Base true-false items upon statements that are absolutely true without exceptions c. Avoid the use of extreme modifiers or qualifiers d. Avoid the use of unfamiliar vocabulary Which is an improved version of this True-False test item? With mandatory Kindergarten and Grades 1 to 12, the Philippines has one of the best if not the best educational system in the world. a. According to Columnist Cruz, with mandatory Kindergarten and Grades 1 to 12, the Philippines has one of the best, if not the best, educational system in the world. b. The K to 12 Program makes the Philippine educational system one of the best in the world. c. The K to 12 program makes the Philippine educational system one of the longest in the world. d. The K to 12 program makes the Philippine educational system the longest and the best in the world. MATCHING TEST ITEMS In general, matching items consists of a column of stimuli presented on the left side of the exam page and a column of responses placed on the right side of the page. Students are required to match the response associated with a given stimulus. no. of stimuli < no. of responses stimuli should be homogenous (same) – not more than 10 responses should have distractors ✓ Perfect Matching Type (one-one) ✓ Imperfect Matching Type (one-many) Which rule in test construction is violated by this test? Directions: Match the following in order to complete the sentences on the left. __ 1. Plato insisted that government was A. The Prince. __ 2. Machiavelli wrote about B. desirable and inevitable __ 3. Hobbes argued that human nature C. a science requiring experts. made absolute Monarchy D. organized along industrial lines __4. Marx was a German and economist E. Communism who founded a. Avoid grammatical or other clues to the correct response b. The premise should be in Column 2 c. The options should be more d. The options should be numbered Which is a way to improve the matching test below? COLUMN A COLUMN B __1. Measure of relationship a. Mean __2. Measure of central tendency b. standard deviation __3. Binet-Simon c. rho __4. Statistical test of mean difference d. t – ratio __5. Measure of variability e. Intelligence Testing a. Add five items in both columns b. Add one or two items in the left column c. Add one or two items in the right column d. Add ten items in both columns to make the test more comprehensive COMPLETION TEST ITEMS The completion items require the student to answer a question or to finish an incomplete statement by filling in a blank with correct word or phrase. Example: According to Frued, personality is made up of three major systems, the _____, the _____, and the _____. equal length of blanks should at the end with only one absolute answer What is wrong with this completion test item? ___________ is a part of speech that describes a verb and an adjective. a. It is extremely easy. b. It is pure recall. c. The blank is at the beginning of the sentence. d. The sentence is quite short. What is faulty with the test items? 1. The process by which plants manufacture their own food is ___________. 2. Water from the roots rise to the tip of the leaf by way of the ________. a. They are extremely difficult questions for Grade 6. b. The second question suggests the answer to the first. c. The lengths of the blank suggest the answer. d. They are recall items. PRINCIPLES OF HIGH-QUALITY CLASSROOM ASSESSMENT ✓ Principle 03: Balanced A balanced assessment sets TARGET in all sets in DOMAIN OF LEARNING (cognitive, affective, and psychomotor) or DOMAINS OF INTELLIGENCES (verbal-linguistics, logical mathematical, bodily kinesthetic, visual-spatial, musical-rhythmic, intrapersonal- social, intrapersonal-introspection, physical world-natural- existential-spiritual). A balanced assessment makes use of both TRADITIONAL and AUTHENTIC ASSESSMENT. PRINCIPLES OF HIGH-QUALITY CLASSROOM ASSESSMENT ✓ Principle 04: Validity Is a degree to which the assessment instrument measures what it intends to measure. It also refers to the usefulness of the instrument for a given purpose. It is the most important criterion of a good assessment instrument. Ways in Establishing Validity FC C FACE CONTENT CRITERION ✓ Face Validity is done by examining the physical appearance of the instrument. FACE VALIDITY ✓ Content Validity is done through a careful and critical examination of the objectives of the assessment so that it reflects the curricular objectives. ✓ Table of Specification is use to ensure the content validity of an instrument. ✓ Content Validity is based on the objectives and learning outcomes. CONTENT VALIDITY ✓ Criterion-related Validity is established statistically such that a set of scores revealed by the measure instrument is correlated with the scores obtained in another external predictor or measure. ✓ There will be two instrument one that is already valid and one that is to be correlated. ✓ It has two purpose: (1) Concurrent Validity, and (2) Predictive Validity. ✓ Concurrent Validity: describes the present status of the individual by correlating the set of scores obtained from two measures given concurrently or given at the same time/ about CRITERION VALIDITY the same time. ✓ Predictive Validity: describes the future performance of an individual be correlating the sets of score obtained from two measures given at a longer time interval. Example of Concurrent Validity: Relate the reading test result with pupil’s average grades in reading given by the teacher. Example of Predictive Validity: The entrance examination scores in a test administered to a freshman class at the beginning of the school year is correlated with the average grades at the end of the school year. FACTORS INFLUENCING THE VALIDITY OF AN ASSESSMENT ✓ Unclear Directions: Directions that do not clearly indicate to the students how to respond to the task and how to record the responses tend to reduce validity. ✓ Reading Vocabulary and Sentence Structure Too Difficult: Vocabulary and sentences structure that are too complicated for the student result in the assessment of reading comprehension thus altering the meaning of assessment result. ✓ Ambiguity: Ambiguous statements in assessments tasks contribute to misinterpretation and confusion. Ambiguity sometimes confuses the better student more than it does the poor students. FACTORS INFLUENCING THE VALIDITY OF AN ASSESSMENT ✓ Inadequate Time Limits: Time limits that do not provide students with enough time to consider the tasks and provide thoughtful responses can reduce the validity of interpretation of results. ✓ Overemphasis on easy-to-assess aspects of a domain can lead to neglect of important, but harder-to-assess areas: It is simpler to create test questions that measure factual recall than to assess deeper conceptual understanding or higher-order thinking. Therefore, it is crucial to avoid underrepresenting complex aspects of achievement. FACTORS INFLUENCING THE VALIDITY OF AN ASSESSMENT ✓ Test Items inappropriate for the Outcomes being measured: Attempting to measure understanding, thinking skills, and other complex types of achievement with test forms that are appropriate for only measuring factual knowledge will invalidate the result. ✓ Poorly Constructed Test Items: Test items that unintentionally provide clues to the answer tend to measure the students’ alertness in detecting clues rather than their mastery of skills or knowledge the test is intended to measure. ✓ Test too Short: If a test is too short to provide a representative sample of the performance, it will have reduced validity, as it may not accurately measure the intended knowledge or skills. FACTORS INFLUENCING THE VALIDITY OF AN ASSESSMENT ✓ Improper Arrangement of Items: Test items are typically arranged in order of difficulty, with the easiest items first. Placing difficult items first in the test may cause students to spend too much time on these and prevent them from reaching items they could easily answer. Improper arrangement may also influence validity by having a detrimental effect on student motivation. ✓ Identifiable Pattern of Answer: Placing correct answer in some systematic pattern enables students to guess the answers to some items more easily, and this lowers validity. PRINCIPLES OF HIGH-QUALITY CLASSROOM ASSESSMENT ✓ Principle 05: Reliability This refers to the consistency of scores obtained by the same person when retested using the same instrument/ its parallel or when compared with other students who took the same test. Ways in Establishing Reliability TETS K TEST-RETEST EQUIVALENT TEST-RETEST+ SPLIT HALF KUDER-RICHARDSON ✓ Procedure: In test-retest, ive a test (exactly the same items; can be shuffled) twice to the same group with any time interval between tests from several minutes to several years. ✓ Type of Reliability Measure: Measure of Stability ✓ Statistical Treatment: Pearson r 𝑛 σ 𝑥𝑦 − σ 𝑥 σ 𝑦 TEST-RETEST 𝑟= 𝑛 σ 𝑥2 − σ 𝑥 2 𝑛 σ 𝑦2 − σ 𝑦 2 𝑛 σ 𝑥𝑦 − σ 𝑥 σ 𝑦 𝑟= 𝑛 σ 𝑥2 − σ 𝑥 2 𝑛 σ 𝑦2 − σ 𝑦 2 ✓ 𝒓 = +𝟏. 𝟎𝟎: Perfect Positive Correlation. The test results are perfectly consistent between the two administrations, showing excellent test-retest reliability. ✓ 𝒓 = 𝟎: No Correlation. The scores from the two administrations are completely unrelated, indicating no reliability. ✓ 𝒓 = −𝟏. 𝟎𝟎: Perfect Negative Correlation. This would be TEST-RETEST unusual and indicate that higher scores in the first test are associated with lower scores in the second, suggesting poor test-retest reliability. GENERAL GUIDELINES FOR RELIABILITY INTERPRETATION 𝒓 ≥ 𝟎. 𝟖𝟎 High Reliability (test scores are very stable over time) 𝟎. 𝟔𝟎 ≤ 𝒓 < 𝟎. 𝟖𝟎 Moderate Reliability (acceptable but could be better) 𝟎. 𝟒𝟎 ≤ 𝒓 < 𝟎. 𝟔𝟎 Low Reliability (concerns about the test's stability over time) 𝒓 < 𝟎. 𝟒𝟎 Very Low Reliability (the test likely doesn’t measure consistently across time) TEST-RETEST If you calculate 𝑟 = 0.85, this means the test has high stability over time, indicating that the test results are consistent and reliable. ✓ Procedure: In equivalent forms, give a parallel forms of tests with close time interval between forms. ✓ Type of Reliability Measure: Measure of Equivalence ✓ Statistical Treatment: Pearson r 𝑛 σ 𝑥𝑦 − σ 𝑥 σ 𝑦 𝑟= 𝑛 σ 𝑥2 − σ 𝑥 2 𝑛 σ 𝑦2 − σ 𝑦 2 EQUIVALENT 𝑛 σ 𝑥𝑦 − σ 𝑥 σ 𝑦 𝑟= 𝑛 σ 𝑥2 − σ 𝑥 2 𝑛 σ 𝑦2 − σ 𝑦 2 ✓ 𝒓 = +𝟏. 𝟎𝟎: Perfect Positive Correlation. The two forms of the test are perfectly equivalent and measure the same construct very reliably. ✓ 𝒓 = 𝟎: No Correlation. The two forms are not related and likely do not measure the same construct consistently. ✓ 𝒓 = −𝟏. 𝟎𝟎: Perfect Negative Correlation. This would be unusual and indicate that as scores on one form increase, EQUIVALENT scores on the other decrease, suggesting the forms are not equivalent. GENERAL GUIDELINES FOR RELIABILITY INTERPRETATION 𝒓 ≥ 𝟎. 𝟖𝟎 High Reliability (highly equivalent and measure consistently) 𝟎. 𝟔𝟎 ≤ 𝒓 < 𝟎. 𝟖𝟎 Moderate Reliability (acceptable but some variability between forms) 𝟎. 𝟒𝟎 ≤ 𝒓 < 𝟎. 𝟔𝟎 Low Reliability (may not be measuring the same construct consistently) 𝒓 < 𝟎. 𝟒𝟎 Very Low Reliability (likely not equivalent in measuring the construct) EQUIVALENT If you calculate 𝑟 = 0.75, this suggests moderate equivalence, meaning that the two forms of the test are fairly similar but not identical in terms of their measurement accuracy. ✓ Procedure: In test-retest with equivalent forms, give parallel forms of tests with increased time interval between forms. ✓ Type of Reliability Measure: Measure of Stability and Equivalence ✓ Statistical Treatment: Pearson r 𝑛 σ 𝑥𝑦 − σ 𝑥 σ 𝑦 TEST-RETEST+ 𝑟= 𝑛 σ 𝑥2 − σ 𝑥 2 𝑛 σ 𝑦2 − σ 𝑦 2 𝑛 σ 𝑥𝑦 − σ 𝑥 σ 𝑦 𝑟= 𝑛 σ 𝑥2 − σ 𝑥 2 𝑛 σ 𝑦2 − σ 𝑦 2 ✓ 𝒓 = +𝟏. 𝟎𝟎: Perfect Positive Correlation. The test scores are completely stable over time, and the two parallel forms are fully equivalent, meaning the test measures the same construct consistently across time and forms. ✓ 𝒓 = 𝟎: No Correlation. There is no relationship between the scores from the two test administrations, indicating poor stability and/or equivalence. TEST-RETEST+ ✓ 𝒓 = −𝟏. 𝟎𝟎: Perfect Negative Correlation. This suggests that higher scores on one test form correspond to lower scores on the other, indicating poor reliability and equivalence. GENERAL GUIDELINES FOR RELIABILITY INTERPRETATION 𝒓 ≥ 𝟎. 𝟖𝟎 High Reliability (indicates both strong stability over time and strong equivalence) 𝟎. 𝟔𝟎 ≤ 𝒓 < 𝟎. 𝟖𝟎 Moderate Reliability (suggests some consistency over time and between forms, but some variability exists) 𝟎. 𝟒𝟎 ≤ 𝒓 < 𝟎. 𝟔𝟎 Low Reliability (the test forms may not be fully equivalent or the test may not be stable over time). 𝒓 < 𝟎. 𝟒𝟎 Very Low Reliability (the test is neither stable over time nor equivalent) TEST-RETEST+ If 𝑟 = 0.45, this suggests low stability and/or equivalence, indicating that the test scores fluctuate over time or the parallel forms are not fully equivalent. ✓ Procedure: In split half, give a test once. Score is equivalent halves of the rest e.g. odd-and-even- numbered items. ✓ Type of Reliability Measure: Measure of Internal Consistency ✓ Statistical Treatment: Pearson r and Spearman Brown Formula 𝑛 σ 𝑥𝑦 − σ 𝑥 σ 𝑦 𝑟= SPLIT HALF 𝑛 σ 𝑥2 − σ 𝑥 2 𝑛 σ 𝑦2 − σ 𝑦 2 2𝑟 𝑟𝑆𝐵 = 1+𝑟 𝑛 σ 𝑥𝑦 − σ 𝑥 σ 𝑦 𝑟= 𝑛 σ 𝑥2 − σ 𝑥 2 𝑛 σ 𝑦2 − σ 𝑦 2 ✓ 𝒓 = +𝟏. 𝟎𝟎: Perfect Positive Correlation. Both halves of the test are highly consistent in measuring the same construct. ✓ 𝒓 = 𝟎: No Correlation. The two halves measure unrelated constructs. ✓ 𝒓 = −𝟏. 𝟎𝟎: Perfect Negative Correlation. High scores on SPLIT HALF one half correspond to low scores on the other, indicating inconsistency. 2𝑟 𝑟𝑆𝐵 = 1+𝑟 SPEARMAN-BROWN ADJUSTED RELIABILITY 𝒓𝑺𝑩 ≥ 𝟎. 𝟖𝟎 High Reliability (the test items are internally consistent and measure the same construct well) 𝟎. 𝟔𝟎 ≤ 𝒓𝑺𝑩 < 𝟎. 𝟖𝟎 Moderate Reliability (suggests some consistency over time and between forms, but some variability exists) 𝟎. 𝟒𝟎 ≤ 𝒓𝑺𝑩 < 𝟎. 𝟔𝟎 Low Reliability (the test forms may not be fully equivalent or the test SPLIT HALF may not be stable over time) 𝒓𝑺𝑩 < 𝟎. 𝟒𝟎 Very Low Reliability (the test is neither stable over time nor equivalent) Suppose 𝑟 = 0.70. Use the Spearman-Brown formula to estimate the reliability of the entire test 2𝑟 2 0.70 1.40 𝑟𝑆𝐵 = = = ≈ 0.82 1 + 𝑟 1 + 0.70 1.70 In this example, 𝑟𝑆𝐵 = 0.82, which indicates high internal consistency, meaning the test items are highly consistent in measuring the same construct. SPLIT HALF ✓ Procedure: In Kuder-Richardson, give the test once then correlate th eproportion/ percentage of the students passing and not passing a given item. ✓ Type of Reliability Measure: Measure of Internal Consistency ✓ Statistical Treatment: Kruder-Richardson Formula 20 and 21 KUDER-RICHARDSON 𝑘 σ 𝑝𝑞 𝑘 𝑀(𝑘−𝑀) 𝐾𝑅20 = 1− 𝐾𝑅21 = 1− 𝑘−1 𝜎2 𝑘−1 𝑘 ∙ 𝜎2 𝑘 σ 𝑝𝑞 𝑘 𝑀(𝑘−𝑀) 𝐾𝑅20 = 1− 𝐾𝑅21 = 1− 𝑘−1 𝜎2 𝑘−1 𝑘 ∙ 𝜎2 ✓ The KR-20 formula is more precise as it takes into account the proportion of students passing and failing each item and the variance of the total test scores. ✓ The KR-21 formula is a simplified version of KR-20. It is used when all items are assumed to have equal difficulty KUDER-RICHARDSON (when the item variances are not available). It provides an estimate of reliability but is less accurate than KR-20 when item difficulty varies widely. INTERPRETATION OF KR-20 AND KR-21 RESULTS 𝑲𝑹 ≥ 𝟎. 𝟖𝟎 High Internal Consistency (The test items work together well to measure the same construct.) 𝟎. 𝟔𝟎 ≤ 𝑲𝑹 < 𝟎. 𝟖𝟎 Moderate Internal Consistency (The items are somewhat consistent, but some improvement could be made.) 𝟎. 𝟒𝟎 ≤ 𝑲𝑹 < 𝟎. 𝟔𝟎 Low Internal Consistency (The items may not be effectively measuring the same construct.) 𝑲𝑹 < 𝟎. 𝟒𝟎 Very Low Internal Consistency KUDER-RICHARDSON (The items likely measure different constructs, and the test may need substantial revision.) KR-20 Example A test has 20 items. For each item, calculate 𝑝 (the proportion of students who answered correctly) and 𝑞 (the proportion who answered incorrectly). Suppose the sum σ𝑝𝑞 = 0.15, and the total test score variance 𝜎 2 = 0.20. 𝑘 σ 𝑝𝑞 20 0.15 𝐾𝑅20 = 1− 2 = 1− = 0.263 𝑘−1 𝜎 20 − 1 0.20 KUDER-RICHARDSON In this case,𝐾𝑅 = 0.263, which indicates low internal consistency. KR-21 Example A test has 20 items. Suppose the sum σ𝑝𝑞 = 0.15, and the total test score variance 𝜎 2 = 5 and the mean score 𝑀 = 12. 𝑘 𝑀(𝑘 − 𝑀) 20 12(20 − 12) 𝐾𝑅21 = 1− 2 = 1− = 0.042 𝑘−1 𝑘 ∙ 𝜎 20 − 1 20 ∙ 5 This gives a very low internal consistency of 𝐾𝑅 = 0.042 suggesting the test items are not consistent. KUDER-RICHARDSON KR-20 is preferred when item difficulty varies. KR-21 is a quicker, rougher estimate when item difficulty is assumed to be consistent. IMRPOVING TEST RELIABILITY ✓ Test Length: In general, a longer test is more reliable than a shorter one because longer test sample the instructional objectives more adequately. ✓ Spread of Scores: The type of students taking the test can influence reliability. A group of students with heterogeneous ability will produce a large spread of test than a group with homogeneous ability. ✓ Item Difficulty: In general, test composed of items of moderate or average difficulty (0.30 𝑡𝑜 0.70) will have more influence on reliability than those with composed primarily of easy or very difficult items. IMRPOVING TEST RELIABILITY ✓ Item Discrimination: In general, test composed of more discriminating items will have greater reliability than those composed of less discriminating items. can differentiate high performing from low performing students ✓ Time Limits: Adding a time factor may improve reliability for lower-cognitive test items. Since all students do not function at the same pace, a time factor adds another criterion to the test that causes discrimination, thus improving reliability. Teachers should not, however, arbitrarily impose a time limit. For higher-level cognitive test items, the imposition of time may defeat the intended purpose of the items. ITEM ANALYSIS This refers to the process of examining the student’s response to each item in the test. There are two characteristics of an item: (1) desirable; and (2) undesirable characteristics. Desirable items are retained for subsequent use while undesirable items are revised or rejected. Three Criteria in Determining the Desirability of an Item ✓ Difficulty of an Item ✓ Discriminating Power of an Item ✓ Measures of Attractiveness ITEM DIFFICULTY The percent of students who answer a particular test item correctly. # 𝑜𝑓 𝑠𝑡𝑢𝑑𝑒𝑛𝑡𝑠 𝑤ℎ𝑜 𝑔𝑜𝑡 𝑡ℎ𝑒 𝑖𝑡𝑒𝑚 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑪𝑶 𝑫𝒊𝒇𝒇𝒊𝒄𝒖𝒍𝒕𝒚 𝑰𝒏𝒅𝒆𝒙 = = 𝑫𝑰 = 𝑡𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝑠𝑡𝑢𝑑𝑒𝑛𝑡𝑠 𝑻𝑶 Index Range Difficulty Level 0% − 20% 0.00 − 0.20 Very Difficult 21% − 40% 0.21 − 0.40 Difficult 41% − 60% 0.41 − 0.60 Moderately Difficult 61% − 80% 0.61 − 0.80 Easy 81% − 100% 0.81 − 1.00 Very Easy Acceptable Difficulty Index: 𝟐𝟔% − 𝟕𝟓% or 𝟎. 𝟐𝟔 − 𝟎. 𝟕𝟓 ITEM DIFFICULTY Problem 01: There are 50 students who answered item 𝑥, 30 of whom answered the item correctly. What is the difficulty index? 60% 𝑜𝑟 0.60 Moderately Difficult Problem 02: Get the difficulty index of each item. Question A B C D 1 0 3 24* 3 2 12* 13 3 2 *denotes correct answer 𝑸𝟏: 80% 𝑜𝑟 0.80 Easy; 𝑸𝟐: 40% 𝑜𝑟 0.40 Difficult DISCRIMINATION INDEX It is the difference between the proportion of high performing students who the item right and the proportion pf low performing students who got the item right. Discrimination index is the degree which the item discriminates between high performing and low performing group (upper and lower 27%). 𝑫𝒊𝒔𝒄𝒓𝒊𝒎𝒊𝒏𝒂𝒕𝒊𝒐𝒏 𝑰𝒏𝒅𝒆𝒙(𝑫) = 𝑫. 𝑰.𝑢𝑝𝑝𝑒𝑟 𝑔𝑟𝑜𝑢𝑝 − 𝑫. 𝑰.𝑙𝑜𝑤𝑒𝑟 𝑔𝑟𝑜𝑢𝑝 DISCRIMINATION INDEX 𝑫 = 𝑫. 𝑰.𝑢𝑝𝑝𝑒𝑟 𝑔𝑟𝑜𝑢𝑝 − 𝑫. 𝑰.𝑙𝑜𝑤𝑒𝑟 𝑔𝑟𝑜𝑢𝑝 Discrimination Index Range Item Evaluation 40% and above 0.40 and above Very Good Item Reasonably Good Item 30% − 39% 0.30 − 0.39 (but possibly subject to improvement) Marginal Item 20% − 29% 0.20 − 0.29 (usually needing and being subject to improvement) Poor Item 19% and below 0.19 and below (to be rejected or improved by revision) Acceptable Difficulty Index: 𝟐𝟎% and above or 𝟎. 𝟐𝟎 and above TYPES OF DISCRIMINATION ✓ Positive Discrimination: If the proportion of students who got an item right in the upper group is greater than the lower group. ✓ Negative Discrimination: If the proportion of students who got an item right in the lower group is greater than the upper group. ✓ Zero Discrimination: If the proportion of students who got the item right in the upper performing group and low performing group are equal. CASES OF ZERO DISCRIMINATION ✓ Case 01: Equal number of student from the lower and upper group got the item right. ✓ Case 02: Item are too easy; all students got the item right. ✓ Case 03: Item are too difficult; no student got the item right. DISCRIMINATION INDEX Problem 01: Thirty students are divided into two groups: 15 students in the lower group and 15 students in the upper group. In the upper group, there are 12 students who answered item 𝑥 correctly whereas in the lower group only 6 students answered item 𝑥 correctly, What is the discrimination index of item 𝑥? 𝑫 = 𝑫. 𝑰.𝑢𝑝𝑝𝑒𝑟 𝑔𝑟𝑜𝑢𝑝 − 𝑫. 𝑰.𝑙𝑜𝑤𝑒𝑟 𝑔𝑟𝑜𝑢𝑝 12 6 𝑫= − = 0.80 − 0.40 = 0.40 𝑣𝑒𝑟𝑦 𝑔𝑜𝑜𝑑 𝑖𝑡𝑒𝑚! 15 15 WHEN TO RETAIN, REVISE, OR REJECT ITEMS? DIFFICULTY INDEX DISCRIMINATION INDEX DECISION (0.26 − 0.75) (0.20 𝑎𝑛𝑑 𝑎𝑏𝑜𝑣𝑒) YES YES RETAIN YES NO REVISE NO YES REVISE NO NO REJECT MINOR PRINCIPLES OF ASSESSMENT ✓ Administrability: The test should be easy to administer such that the directions should clearly indicate how student should respond to the test or task items and how much time should student spend for each test item or for the whole test. ✓ Scoreablity: The test should be easy to score such directions for scoring are clear and points for each correct answer is specified. ✓ Interpretability: Test scores can easily be interpreted and described in terms of the specific tasks that a student can perform or their relative position in a clearly defined group. ✓ Economy: The test should begiven in a cheaper way in terms of time and efforts spent for administration of the test and answer sheets must be provided so the test can be given from time to time. MEASURES OF CENTRAL TENDENCY ✓ It is a single value that is used to identify the center of the data. ✓ Single value used to describe the perform as/of the group. ✓ It is taught as the typical vale in the set of scores. ✓ It tends to lie within the center if it is arranged from lowest to highest and vice versa. MMM MEAN MEDIAN MODE ✓ Mean is the score as a group. The most stable and reliable measures of central tendency. This refers to the arithmetic average. Mean is used when data are interval or in ratio level of measurement and the frequency distribution is regular, symmetrical or normal when there is no outliers). It is very easy to compute and easily affected by outliers (very high or very low scores). Used to compute other measures such as standard deviation, coefficient of variation, skewness and z-score. MEAN σ𝑥 𝑥= 𝑛 When to use the Mean? ✓ When it is desired to give each scire equal weight in determining the central tendency ✓ When it is desired to find the measure of central tendency which has the highest reliability ✓ When it is desired to compute the standard deviation and the coefficient of correlation latter on Problem: Scores of 15 students in Mathematics 1 quiz consists of 25 items. The highest score is 25 and the lowest MEAN score is 10. Here are the raw scores: 25, 20, 18, 18, 17, 15, 15, 14, 14, 13, 12, 12, 12, 10, 10. Find the mean of the following scores. ✓ Median refers to the centermost scores when the scores in the distribution are arranged according to magnitude. It is used when the middlemost score is desired, the data are in ordinal level of measurement, when the frequency distribution is irregular or skewed, and when there are extreme scores. Median are reliable when scores are extremely high or low. It is not affected by extreme scores because it is a positional measure. However, it may not me an actual observation in the MEDIAN data set. When to use the Median? ✓ When a quick and easily computed measure of central tendency is desired ✓ When there are extreme score such as a few very high scores or a few low scores, which could affect the mean disproportionately Median of Ungrouped Data? ✓ Arrange the scores from lowest to highest or highest to lowest MEDIAN ✓ Determine the middle score in a distribution if 𝑛 is odd number and get the average of the two middle score if 𝑛 is an even number. ✓ Mode refers to the score/s that occurs most frequently pr with greatest concentration in the score distribution. It is used when the data are in nominal level of measurement, when quick answer is needed and when the score distribution is normal. Mode can be used for qualitative, as well as quantitative data. It may not be unique and exists at time. Mode is not affected by extreme values. MODE normal/ symmetrical distribution: mean and mode irregular/ skewed distribution: median When to use the Mode? ✓ When it is desired to find the score that occurs most often ✓ When it is desired to find the measure of central tendency that has greatest concentration Problem: Find the mode and identify its classification. Section A 25, 24, 24, 20, 20, 20, 16, 12, 10, 7 MODE Section B 25, 24, 24, 20, 18, 18, 17, 10, 9, 7 MEASURES OF VARIABILITY ✓ It is a single value that is used to describe the spread of scores in a distribution, that is above or below the measures of central tendency. RQS RANGE QUARTILE STANDARD ✓ Range is the difference between the highest score and the lowest score in the data set. It is used when the distribution is normal, the data set are interval or in ratio level of measurement and when a quick answer is needed. It is a rough estimation if variation or dispersion. Only two scores are needed to compute its value and it is very easy to compute. Just like mean, range can be easily affected by extreme scores. RANGE 𝑹 = 𝐻𝑖𝑔ℎ𝑒𝑠𝑡 𝑆𝑐𝑜𝑟𝑒 − 𝐿𝑜𝑤𝑒𝑠𝑡 𝑆𝑐𝑜𝑟𝑒 ✓ Quartile Deviation refers to the average deviation of the third and the first quartile from the value of the median. It indicates the distance wee need to go above and below the median to include the middle 50% of the scores. It is based on the range of the middle 50% of the score, instead of the range of the entire set. It is used when the data are in ordinal level of measurement and the score distribution is irregular or skewed. It reduces the influence of the extreme scores since it considers only the middle 50% of the scores. It is not easy to calculate compared to QUARTILE DEVIATION the range. 𝑄3 − 𝑄1 𝑸𝑫 = 2 ✓ Standard Deviation is the most stable and important measures of variability commonly used particularly in research. It refers to the average distance that deviates from the mean value. It is used when the data are in interval or in ratio level of measurement and when the distribution is normal. σ 𝑥−𝑥 2 STANDARD DEVIATION 𝜎 = 𝑆𝐷 = 𝑛−1 MEASURES OF VARIABILITY: INTERPRETATION ✓ Small Variability: closer, clustered, homogenous, scores are less varied, scores are tightly bunched together, concentrated ✓ Large Variability: dispersed, scattered, spread apart, far form each other, heterogeneous, scores are varied MEASURES OF SKEWNESS ✓ Skewness describe the degree of departure of the scores from a symmetry. Symmetric distribution is when the frequency of high score is equal to the frequency of low scores. Skewness tells us about the performance of the students. 3 𝑚𝑒𝑎𝑛 − 𝑚𝑒𝑑𝑖𝑎𝑛 𝑆𝐾 = 𝑆𝐷 CLASSIFICATIONS OF SKEWNESS ✓ Positively Skewed 𝑆𝐾 > 0 ✓ Negatively Skewed 𝑆𝐾 < 0 ✓ Normal Distributed 𝑆𝐾 = 0 POSITIVELY SKEWED mean > median > mode ✓ Is a distribution where the thin end tail of the graph goes to the right part of the curve. This happens when the most of the scores of the students are below the mean. It tells you only on poor performance of takers but not the reason why students did poorly in the said examination. ✓ Reasons for poor performance: (1) ineffective teaching method and instruction, (2) student’s unpreparedness, (3) test items are very difficult, (4) not enough time to answer test items. POSITIVELY SKEWED PR POSITIVE RIGHT DL DIFFICULT LOW NEGATIVELY SKEWED mean < median < mode ✓ Is a distribution where the thin end tail of the graph goes to the left part of the curve. This happens when the most of the scores of the students are above the mean. It tells you only on excellent performance of takers but not the reason why students did poorly in the said examination. ✓ Reasons for high scores: (1) students are smart, (2) enough time to finish the examination, (3) test items are very easy, (4) effective instructions, (5) students have prepared for the examination. NEGATIVELY SKEWED NL NEGATIVE LEFT EH EASY HIGH NORMAL DISTRIBUTION mean = median = mode ✓ Is a special kind of symmetric distribution that can be determined using the values of the mean and standard deviation. Properties of Normal Distribution The curve has a single peak, meaning the distribution is unimodal. It is a bell-shaped curve. It is symmetrical to the mean. The end tails of the curve can be extended indefinitely in both sides. The shape of the curve will depend on the value of the mean and standard deviation. NORMAL DISTRIBUTION MEASURES OF RELATIVE POSITION ✓ It indicates where a score is in relation to all other scores in the distribution. They make it possible to compare the performance of an individual on two or more different tests. PZTS PERCENTILE Z-SCORE T-SCORE STANINE ASSESSMENT APPROACHES NORM-REFERENCED CRITERION-REFERENCED NORM-REFERENCED ✓ Norm-referenced assessments compare an individual's performance to a group, usually a national or peer-based average. These assessments rank students. ✓ There is a competition for a limited percentage of high score and some will really pass. CRITERION-REFERENCED ✓ Criterion-referenced assessments measure a student's performance against a set of specific criteria or standards, without comparing them to others. ✓ There is no competition for a limited percentage of high score where all or none may pass. ✓ Percentile Rank is the percentage of the scores in the frequency distribution which are lower. This means that the percentage of the examinees in the norm group who scored below the score of the interest (Crocker & Algina, 1986). It is used to clarify the interpretation of scores on standardized tests. Percentile Rank Descriptive Terms 96 𝑜𝑟 𝑎𝑏𝑜𝑣𝑒 Very High; Superior 86 − 95 High; Excellent 76 − 85 Above Average; Good PERCENTILE RANK 26 − 75 Average; Satisfactory or Fair 16 − 25 Below Average; Slightly Weak 6 − 15 Low; Weak 5 𝑜𝑟 𝑏𝑒𝑙𝑜𝑤 Very Low; Very Weak Maria’s raw score n English Class is 82 which is equal to 90𝑡ℎ percentile. This means that 90% of Maria’s classmates got a score lower than 82. Maria surpassed 90% of her classmates. She belongs to the upper 10% of the class. Percentile Rank Descriptive Terms 96 𝑜𝑟 𝑎𝑏𝑜𝑣𝑒 Very High; Superior 86 − 95 High; Excellent 76 − 85 Above Average; Good PERCENTILE RANK 26 − 75 Average; Satisfactory or Fair 16 − 25 Below Average; Slightly Weak 6 − 15 Low; Weak 5 𝑜𝑟 𝑏𝑒𝑙𝑜𝑤 Very Low; Very Weak ✓ Z-score is the number of standard deviation units a score is above or below the mean of a given distribution. A positive z-score measures the number of a standard deviation above the mean. While a negative z-score measures the number of a standard deviation above the mean. 𝑥−𝜇 𝑧= 𝑆𝐷 Z-SCORE Problem: Jemar had a score of 59. He is compared with other examines; the mean and standard deviation of this group are 50 and 5, respectively. What is Jemar’s z-score? Problem: Jemar had a score of 59. He is compared with other examines; the mean and standard deviation of this Z-SCORE group are 50 and 5, respectively. What is Jemar’s z-score? 𝑥 − 𝜇 59 − 50 9 𝑧= = = = 1.8 𝑆𝐷 5 5 ✓ T-score tells the location of a score in a normal distribution having a mean of 50 and a standard deviation of 10. 𝑡 = 10𝑧 + 50 Problem: Ben had a score of 49. He is compared with other examines; the mean and standard deviation of this group are 42 and 6, respectively. What is Ben’s t-score? T-SCORE 𝑡 > 50: excellent performance 𝑡 < 50: poor performance 𝑇 = 10𝑧 + 50 Problem: Ben had a score of 49. He is compared with other examines; the mean and standard deviation of this group are 42 and 6, respectively. What is Ben’s t-score? 𝑥 − 𝜇 49 − 42 7 𝑧= = = = 1.17 𝑆𝐷 6 6 T-SCORE 𝑡 = 10 1.17 + 50 = 11.7 + 50 = 61.7 61.7 > 50: excellent performance T-SCORE ✓ Stanines also known as the standard nine tells the location of a raw score in a specific segment in a normal distribution which is divided into a segment, numbered from low (1) to high (9). 𝑠 = 1.96𝑧 + 5 Problem: Bryan had a z-score of 1.9. What id Bryan’s stanine score? STANINES 𝑠 = 1.96𝑧 + 5 = 1.96 1.9 + 5 = 3.724 + 5 = 8.724 = 9 𝑠 = 1.96𝑧 + 5 = 1.96 1.9 + 5 = 3.724 + 5 = 8.724 = 9 STANINE % IN STANINES DESCRIPTION 1 4% Very Poor 2 7% Poor 3 12% Below Average 4 17% Slightly Below Average 5 20% Average 6 17% Slightly Above Average STANINES 7 12% Above Average 8 7% Superior 9 4% Very Superior MEASURES OF RELATIONSHIP ✓ This measures the degree of relationship or correlation between two variables e.g. academic achievement and motivation). The relationship between the results of two administration of test would determine the reliability of the instrument. The greater the degree of relationship, the more reliable the test.