Educational Testing, Measurement, Assessment, and Evaluation Notes PDF

**\ ** **TOPIC 1: INTRODUCTION TO EDUCATIONAL TESTING, MEASUREMENT, ASSESSMENT AND EVALUATION** **DEFINITIONS OF IMPORTANT TERMS:** **DEFINITION OF TERMS** **TESTING** - Is a systematic process of measuring achievement or attributes using a test? **TEST** - Is a type of (measuring) measurement device in which student or other individuals are exposed to a set of task to obtain a mark or score? **MEASUREMENT** - Is a systematic process of assigning numbers to a person or object (which represent a quantity of an attribute or person) according to a specific rule? - Is a process used to determine the extent to which a certain characteristic is present in a learner, it can be qualitative or quantitative? **EVALUATION** - Is a process of making judgement about the value or worth pf something e.g. quality of people's performance? - is systematic process used together to gather information to make judgement. - is a process that critically examine the problem and passing a judgment about it. - It is a process of collecting, analyzing and interpreting data and then passing judgment or make decisions about that particular data. *Is value judgement or decision based on the score or marks obtained by a person for his/her performance on a particular task.* **ASSESMENT** - A process of collecting, analysing and interpreting data for the purpose of specifying and verifying problems (about) and making decisions about learners. - Is a process of collecting data for the purpose of specifying or verifying problems and making decisions about individual students? It answers the question, how well the individual performs. - is a process that involves a variety of procedures used to gain insight about students learning. For example, observations, projects, assignments, research work e.t.c so that the teacher can formulate value judgment (value judgement=very poor, very good; excellent) concerning student learning progress. - is a process of analyzing and interpreting information about the behavior of a learner in the aspect of affective, cognitive and psychomotor in relation to established standard of achievement? A=Excellent, B=Good, C=Average, D=Below average. For example, 80% in *measurement* can be valued as Excellent in *assessment.* **Distinction among measurement, evaluation and assessment.** **Measurement** **Assessment** **Evaluation** ----------------- ---------------- ------------------------------------------------------------------------------------------------------------------------------------------------------ 20% Very Poor Recommend that the student should repeat. The teacher will conduct *re-teaching.* 50% Fair Recommend that the student must be remediated. The teacher will conduct *remediation.* 80% Very Good Recommend that the student must be given more challenging and complex work than the one he has just excelled. The teacher will conduct *enrichment.* - When a teacher gives students a test (testing), e.g. C.A, Students' scores e.g. 60% are measurement, such marks provide data information to the teacher to make judgments about the students' performance and success/ failure of the program, e.g. poor, good (evaluation). - Assessment entails the process and procedures of testing, measurement and evaluation as illustrated in the diagram below: **Measurement**: - is the process of determining a degree or extent to which a given characteristic is present in the object, person, people, system or an event (Pandaeli, 1995). - is the assigning of numbers to the results of a test or other types of assignment according to a specific role. **Educational Measurement:** Educational measurement may be defined as the procedure for assigning numbers (usually scores) to a specified attribute or characteristic of a person in such way that the numbers describe the degree to which the person possesses the attribute (Nitko, 2004). **Attributes/ Characteristics that can be measured may be** 1. concrete (physical) - as length, mass, volume, speed 2. abstract(non-physical) - intelligence, achievement or performance, personality, motivation, attitudes, aptitudes or abilities, creativity, interests, etc. **Non-physical characteristic of measurement:** Is a process of assigning a number to the individual members (object, person, people, system or an event) for purpose of indicating differences among them in the degree to which they possess the characteristics being measured. **NB:** Measurement involves quantities expressions such as allocating of numbers to the outcome of a test possessed by the learner, for example 60%; 70% 80% 90% to measure "how much?" did the student understand the topic / to measure achievement. **Assessment:** - concerns the use of a variety of procedures to obtain information about student performance that will help in the formation of value judgment concerning learning progress. - is a process that involves a variety of procedures used to gain insight about students learning. For example, observations, projects, assignments, research work e.t.c so that the teacher can formulate value judgment (value judgement=very poor, very good; excellent) concerning student learning progress. - is a process of analyzing and interpreting information about the behaviour of a learner in the aspect of affective, cognitive and psychomotor in relation to established standard of achievement. A=Excellent, B=Good, C=Average, D=Below average. For example, 80% in *measurement* can be valued as Excellent in *assessment.* - Is a process of collecting data for the purpose of specifying or verifying problems and making decisions about individual students? It answers the question, how well the individual performs. **Educational Assessment:** It is a process of obtaining information that can be used to make decisions about students, curricula, educational programs and policies (Nitko,2004). There are number of tools and techniques that may be used to obtain the required information. These include; informal observations, formal observations, pen- and -- paper tests, students' performance on given tasks, research assignments reports, oral questions, analyses of students' records. Etc. **General Purposes for Assessment** To: a. determine the appropriateness/effectiveness of teachings methods b. determine the grouping of students for more effective learning c. determine students' level of readiness for the next learning experience d. determine the extent to which learning objectives have been attained e. determine the types or nature of learning difficulties students are encountering f. provide feedback to students about their learning progress g. direct students to the key learning outcomes they are expected to master. h. grade and credential/ certificate students i. motivate students to work harder j. select students for admission into some programs (selection& placement) k. help students identify possible career paths they may pursue. (career guidance & counselling) l. expose problems/weaknesses in learning (diagnose) m. reveal differences of quality n. assist selection o. maintain standards p. test claims that people make q. test how much is known about something supposedly learnt r. monitor teaching s. motivate pupils and teachers t. measure specific abilities such as IQ, reading and writing skills etc u. discriminate between children's ability v. predict the ability of individual children for particular courses/careers w. select children for further education (Linn & Gronlund, (2002); Nitko, (2004)) **Types Assessments** +-----------------------------------+-----------------------------------+ | **Assessment Type** | **Purpose** | +===================================+===================================+ | Baseline | To determine the student | | | performance at the beginning of | | (Placement) | instruction. | +-----------------------------------+-----------------------------------+ | Formative | To monitor learning progress | | | during instruction | +-----------------------------------+-----------------------------------+ | Diagnostic | To diagnose(detect) learning | | | difficulties during instruction | +-----------------------------------+-----------------------------------+ | Continuous | To establish cumulative | | | performance records for | | | | | | certification/credentialing at | | | the end of instruction | +-----------------------------------+-----------------------------------+ | Summative | To assess achievement at the end | | | of instruction | +-----------------------------------+-----------------------------------+ **Why Do We Need Educational Assessment?** Ideally, educational assessments are more than just tests. When done well, they are powerful learning tools for students as well as evaluation tools for educators. Here are some of the benefits of a good educational assessment: - It helps educators track students' progress so they can identify anyone who is struggling and provide remediation. - It provides feedback to students about their own performance, which they can use to improve their knowledge and skills further. - It motivates students, as they know they will be evaluated at the end of each module or course. - It helps educators set learning objectives and outcomes and determine the best ways to help students reach their goals. - It can be used to improve the curriculum. - It can be used to evaluate teachers' and school systems' performance, as well as the effectiveness of different teaching practices. **What Makes an Educational Assessment "Good"?** Here are two basic principles of quality educational assessment: 1. Assessments must be based on defined objectives and outcomes: Before assessments can be used as a valid measuring tool for knowledge and skills, the desired knowledge and skills must first be clearly articulated. 2. Assessments must be valid: Validity refers to the extent to which an assessment actually measures the knowledge or skills it is supposed to. The reason there are so many types of assessments is that there is no one type that can validly measure all kinds of learning **Evaluation:** - It is a process of collecting, analyzing and interpreting data and then passing judgment or make decisions about that particular data. - It is the process of assigning of worth or value to the available information so as to enable the teacher to choose the best among available decision alternatives. - It is making judgment to determine the quality or worth of the performance. It answers the question; what is the worth e.g. when the teacher says this is a good mark she has passed judgment on the mark. Evaluation can be process of making a value judgement about the worth of a student's product or performance. Evaluation can be made using test scores, but it can also be done without using any objective information from assessments (Nitko, 2004). Evaluations are the bases for making decisions in education **General purposes of evaluation** To; a. determine the relative effectiveness of the programme in terms of students' behavioural output b. make reliable decisions about educational planning c. ascertain the worth of time, energy and resources invested in a programme d. identify students' growth or lack of growth in acquiring desirable knowledge, skills, attitudes and societal values e. help teachers determine the effectiveness of their teaching techniques and learning materials f. help motivate students to want to learn more as they discover their progress or lack of progress in given tasks g. encourage students to develop a sense of discipline and systematic study habits h. provide educational administrators with adequate information about teachers' effectiveness and school need i. acquaint parents or guardians with their children's performances j. identify problems that might hinder or prevent the achievement of set goals k. predict the general trend in the development of the teaching-learning process l. ensure an economical and efficient management of scarce resources m. provide an objective basis for determining the promotion of students from one class to another as well as the award of certificates n. provide a just basis for determining at what level of education the possessor of a certificate should enter a career **characteristics** evaluation influenced by results of measurement, e.g. student's marks. In education, it is done at two basic levels; a. Program level: evaluation is used to determine whether a programme has or has not been successfully implanted by answering question such as -- Is the programme content of desirable quality? Are intended learning outcomes achieved? Does the programme continue to be effective? b. Students level: Evaluation is to determine how well a student is performing in a programme. Through oral questions, paper -- pencils tests, manipulative skills tests, discussion, tutorials, individualized instruction, assignment, projects etc, the student is gradually guided towards a desired goal. **Types of Evaluation** There are two general types of evaluation namely: **Formative evaluation** This is making judgements about the quality or worth of instructional materials, teaching methods (instructional procedures), curricula, or educational programs at their design or development stage (Nitko, 2004). Teachers also do formative evaluation when reviewing lessons or learning materials based on information obtained from their previous use. Formative evaluation is a form of testing, measurement and evaluation done during the course of the program. For example, weekly quizzes; end of month tests; exercises, quiz, group work etc. **Purposes of formative evaluation** It; a. provides continuous feedback to both students and teachers concerning learning successes and failures b. provides reinforcement of successful learning and identifies specific errors that are in need of correction it provides a base for modifying instructions c. monitors student's progress and identify learning errors. **Summative evaluation** - This is making judgements about the quality or worth of already-completed instructional materials, instructional procedures, curricula or educational programs in the attainment of set goals and objectives. - Teachers normally participate in evaluations of syllabi and school programs. - Summative evaluation is a form of testing, measurement and evaluation done at the end of the course of instruction or program. For example, final examinations; end of term examinations etc. - Measurement of student's achievement at the end of instruction program. - Occurs at the end of the program or course and determines its overall effectives. Summative evaluation includes: achievement tests, rating scales and evaluation of a student's products. **Purpose of summative evaluation** It; a. determines the extent to which the instructional goals have been achieved b. is used to assign course grade/certification of student's achievement c. provides information for judging the appropriateness of the course objective and the effectiveness of the instruction d. screen learners e. group learners especially in a final performance f. to summarize how well a group / or a student has performed the course on a set of learning goals or objectives. g. information from summative evaluation is used by teachers to determine grades and inform students and their parents. NB: If the assignment is meant for obtaining information about students learning, for planning purposes. It is formative. If the assignment is to determine final achievement, it is summative **Distinction among measurement, evaluation and assessment.** **Measurement** **Assessment** **Evaluation** ----------------- ---------------- ------------------------------------------------------------------------------------------------------------------------------------------------------ 20% Very Poor Recommend that the student should repeat. The teacher will conduct *re-teaching* 50% Fair Recommend that the student must be remediated. The teacher will conduct *remediation* 80% Very Good Recommend that the student must be given more challenging and complex work than the one he has just excelled. The teacher will conduct *enrichment.* - When a teacher gives students a test (testing), e.g. C.A, Students' scores e.g. 60% are measurement, such marks provide data information to the teacher to make judgments about the students' performance and success/ failure of the program, e.g. poor, good (evaluation). - Assessment entails the process and procedures of testing, measurement and evaluation as illustrated in the diagram below: **TOPIC 2: EDUCATIONAL TEST** **The meaning of Educational Test** - Pandaeli (1995) defines a test as an instrument; tool or a systematic procedure for measuring how much an individual can perform tasks, or solve problems by posing a set of questions so that the teacher can then assign a score or mark. - Pandaeli (1995) defines a test as an assessment tool in which questions are asked to learners to determine their degree of learning. - A test is an instrument or systematic procedure for observing and describing one or more characteristics of a student using a numerical scale or a classification scheme (Nitko,2004). - A test is a particular type of assessment that typically consists of a set of questions administered during a fixed period of time under reasonable comparable conditions for all students. - An [educational test](https://www.proprofs.com/quiz-school/solutions/test-maker-software/) or an exam is used to examine someone's knowledge of something to determine what they know or have learned. The goal of testing is to measure the level of skill or knowledge that has been acquired **Testing** Testing is a systematic process of measuring achievement or other characteristic using a test. Testing is a systematic procedure for observing or describing traits of a learner with the help of a numerical scale. This is simply a process of administering a test. **TYPES OF TESTS** **Diagnostic test** a. used for determining the problems experienced by student in a particular a learning area. Tests results will indicate areas of difficulty for a student which require remedial teaching. b. is pre-test/test done before instruction before the program is implemented to establish students' strengths and weaknesses so that the teacher can determine the best way of helping each students according to their weakness and strengths. **Achievement test** a. Achievement test can be standardized or teacher made. b. measures what a student has learnt in a specific subject area. E.g. topic, or unit test. c. a test that measures learners' ability in a specific skill and relate the results to learners of different ages. d. used to measure academic level in a topic or course. e. is a form of test in which questions center around common errors made by learners. f. shows the strength and weaknesses of learners. g. helps to identify areas of difficulties encountered by learners and enable the teacher appropriate or corrective measure. h. used to measure skills possessed by students at a point in time. i. used to measure strengths and weakness of students when scores are used on NRT. j. when used with aptitude measures, achievement test are used to evaluate the effectiveness of programmes or course and identify students with learning abilities. k. used to ensure quality across different regions in a country with regard to critical skills taught, hence minimum academic stand subjective and for promotion and graduation. l. used to ensure uniformity in content taught. m. used to exempt students from institutions in subject they have already learnt n. can function as diagnostic tests. **Aptitude test** a. It is also known as a proficiency test which measures how well a learner can perform a given task eg interview, entry examinations. The learner's talent is assessed and predictions for capacity to cope with the course. b. measure a student's potential to perform a specific skill; can be used to predict areas of success in future. c. tests a person ability to clear a task or the potential to perform a task. d. checks an inherent ability of an individual to acquire a skill or a particular type of knowledge **Attitude/personality test** a. it identifies the dominant traits of the learner so as to classify his or her personality and provide the kind of learning best suited for her. b. maybe employed to assess learner's interests, values, beliefs and general personality traits. **Performance test** a. made to measure motor skills such as manipulation of objects e. g Practical including **HE, AGRIC, PE, D&T** etc. b. evaluates both the effectiveness of the process or procedure with the product of resulting from a task **Power test** a. consist of items ordered in levels of difficulty in order to test students' knowledge. Has generous time limits to allow most students to attempt every item **Speed test** a. aims to measure speed and accuracy students' performance within severely restricted time. b. test items are easy. **Standardized test** a. designed by test experts working with curriculum experts and teachers. b. administered and scored under standard and uniform conditions to allow fair comparisons **PURPOSE (USES) OF EDUCATIONAL TEST** **At micro level (classroom level)** To; a. determine how much one has learnt. i.e. mastery of instructional process. b. detect learning errors, difficulties, misconceptions so that remediation can be applied. c. evaluate teaching methods, teaching aids and the content taught. d. draw attention of teachers and learners to areas of content that needs revision or re-teaching. e. motivate learners to work harder. f. predict learner performance in final exams. g. influence choice of careers h. screen students for certain subjects. **At macro level (national)** To; a. select students for placement to further studies or employment. b. provide certificate to completers. c. determine the achievement of educational goals. d. provide information to parents, employees etc. **CHARACTERISTICS OF A GOOD TEST:** There are several characteristics or qualities of a good test: a. ***Validity*** A test is considered as valid when it measures what it is supposed to measure. b. ***Reliability*** A test is reliable if it measures what it purports to measure consistently. On a reliable test, you can be confident that someone will get more or less the same score on different occasions or when it is used by different people. c. **Objectivity** A test is said to be objective if it is free from personal biases in interpreting its scope(coverage) as well as in scoring the responses. the test should be such that the results scores(marks) are not influenced by the Scorers' judgment or opinion. This means that a test with objective type of items (e.g. multiple choice) better than the one with subjective type of items (essay) d. ***Discriminating Power*** Discriminating power of the test is its power to discriminate between the upper and lower groups who took the test. It should contain questions of different difficulty levels. e. ***Practicability/Usability*** Practicability of the test depends up on i\) administrative ease ii\) scoring ease iii\) interpretative ease iv\) economic affordability --cost of testing \(v) availability of equivalent or comparable forms f. Variety of Items and / Tasks g. Error Free A good test should be free from grammatical errors, typing errors, etc. h. Measurability It should measure the objective to be achieved. **TEST VALIDITY** **Types of Test Validity** 1. **Content Validity** It refers to the degree to which a test (achievement test) represents the instructional objectives. It determines the degree of overlap between what has been taught and what is tested i. e comparing the content of the test with the objective of a course. In content validity, the criterion is the judgment of the subject-matter experts. 2. **Criterion-related validity** Tests (i.e. psychological and educational tests) are validated by relating test scores to performance on some criterion / established standard. It has two components: Predictive validity and concurrent validity. a. **Predictive validity:** is a measure of the ability of a test (aptitude test) to predict future behavior. It measures the relationship between test performance and subsequent performance e.g. a learner who performs well in PSLE is likely to perform well in JC. The criterion does not come become available until time after the test has been administered b. **Concurrent validity:** is the relationship between a test (i.e. personality test) and other measures of the same behavior. The criterion is either available at the time of testing or will become available at a later time. When the criterion measure (e.g. score, rating or classification) is available at the time of testing, then the concurrent validity of the test is determined. When junior secondary schools sit for the JC Examination, schools have already been given the academic ratings / standard criterion from Botswana Examination Council such as A= 80 and above B= &0 and above C=60 and above D= 50 and above. They also have rubric usually used by BEC. So schools use this academic ratings/ standard criterion from BEC to grade the students at school level during End of month examination and End of term Examination as it is the case during terminal examinations/year three final examination. If students' scripts are marked using the same rubric from BEC and then students' marks/score are rated using same academic ratings/ standard criterion from BEC, then we can validate tests at school level and assert that they are at the standard of certain authority being BEC at this instance. If students use rubric at school level; sit for exams under conditions same to BEC; and there is high agreement between school's ratings/standard criterion and the JC Final Examination results, then we say the JC Examination has a high concurrent validity. Students are made to write mock exams set using the standard of BEC and if students pass those test, we assume that they will pass PSLE, or JC or COSC. 3. **Construct validity** Construct validity is validating /verifying if indeed a student who passed a particular test possess the construct / trait/ variable. Construct validity is how well performance in a test can measure / explain the traits possessed by a particular student. Construct validity is achievable by using marks/scores of a student from a test that has been filtered through content validity and criterion related validity to check if indeed the marks/scores scored by a student reflect the amount of construct/ trait/variable posed by the same student. After a student has obtained marks/scores from a test, we then check whether indeed the marks/scores reflect the amount of construct/ trait/variable posed by the same student. For example, if a learner scores high in an IQ test, how well/true/correct does the high score allow us to conclude that indeed the learner has a construct/ trait/variable of high intelligence in his head. The teacher may be forced to validate this by subjecting the same student to a very difficult test and if a student may be scores that test at (98% while the entire class has scored below 50%, we may make a construct that indeed the student is intelligent. For example, if a learner scores high marks in a a topic of Electronics in Physics, how well/true/correct does the high score allow us to conclude that the learner has a construct/ trait/variable of ability to repair gadgets that require electronics knowledge such as cellphones. The teacher may be forced to validate this by subjecting the student to a practical whereby cellphones are presented to students to fix. If students are able to fix the cellphone, then the test was valid but if the students are unable to fix the cellphone, then the test he passed was not valid to convince us that the student has mastered methods of repairing cellphones in electronics. For example, if a learner scores high in Guidance and Counselling, test, how well/true/correct does the high score allow us to conclude that the learner has a construct/ trait/variable of the ability to guide and counsel in real life situation. The teacher may be forced to validate this by subjecting he student to a mock practical on Guidance and Counselling as we did in Hall L16 during drama/role play presentations by various small groups. In those presentations your teacher was able to see students with ability to execute guidance and counseling skill from those with good content in their heads/marks but lacked the construct/trait/variable of good skill execution. For example, if a learner scores high in Moral Education test, how well/true/correct does the high score allow us to conclude that the learner has a construct/ trait/variable of the ability to uphold high moral standards in real life situation. May be its high time we observe the behavior of student who passed with distinction from Moral Education or Religious Education to check if such marks indeed reveal to us that such a student has the ability to exhibit or manifest the same construct/trait/variable of upholding good habits in real life situation. If students passed Testing, Measurement and Evaluation module test and yet you are worried that they will not use or apply the module when they get at various secondary schools, then you measure the construct/variable/trait of "affective domain/attitude towards the module of Testing, Measurement and Evaluation by adding this question in the test: What is your attitude towards Testing, Measurement and Evaluation module? Should secondary school teachers apply skills they learnt from Testing, Measurement and Evaluation module? The immediate preceding two questions are good to measure the attitude construct/variable/ trait in an **ATTITUDE/PERSONALITY TEST:** - It identifies the dominant traits of the learner so as to classify his or her personality and provide the kind of learning best suited for her. - An attitude test maybe employed to assess learner's interests, values, beliefs and general personality traits **FACTORS THAT CAN AFFECT THE VALIDITY OF A TEST** a. **Factors in the test itself** i. The difficulty level of the language used in writing the test items: unclear items ii. Vague test instructions iii. Having test items not linked to instructional objectives iv. Too difficult test items: grammar/typing errors v. unclear directions vi. reading vocabulary & sentence structure too difficult vii. ambiguous statements in the test viii. inadequate time limits ix. dominance of lower order thinking questions (recall, comprehension, etc) over higher order thinking ones (application, analysis, evaluation and synthesis) x. test items inappropriate for the cognitive domain they are supposed to measure- recall, comprehension, application, analysis, evaluation, synthesis xi. Poorly constructed items -- e,g ,items that may provide clues to the correct answer. xii. test too short xiii. improper arrangement of test items- i.e. Placing difficult items at the beginning of the test xiv. identifiable pattern of answers- placing correct answers in a systematic pattern, e.g. T, T. F,F, A,A, B,B, C,C, D, D.etc b. **Other factors** i. teaching-learning process- knowledge and skills students gained in class ii. administration and scoring of the test- insufficient time allowed, cheating by students, and errors in scoring have detrimental effect on validity iii. students' personal factors- test anxiety, emotional disturbances such as grief, etc iv. nature of the group of students being tested- age, gender, ability level, (Linnn & Gronlund,2000) **NB:** These factors would very much influence the performance of the learners in the test. **Ways in which validity of a test may be improved** a. match them with the instructional objectives. b. match test items with students' maturity c. ensure that the test items are matched to the objectives. It gives content validity. d. ensure that the test items are set according to the cognitive levels of the content objectives covered. e. ensure that the test covers a variety of test items. Usually **not** more than three items are set under each cognitive level. f. test items should be varied across levels of Bloom taxonomy levels g. use easy language h. the language used in the test should be to the level of the learners. i. instructions should be clearly stated. j. the difficulty indices of the test items should be at intermediate values e.g starting with easy questions and ending with difficult ones (knowledge-evaluation). k. validate the test by giving the test draft to a colleague (s) to correct or comment on the following: i. content ii. arrangement of items iii. variety of levels of items (cognitive domain levels) iv. language l. prepare the final version of the test by doing the following: i. make amendments if necessary ii. arrange test items iii. write instructions m. give clear instructions n. corrections should be made before students starts writing **B: TEST RELIABILITY** **Types of Test reliability** The following are the common: test-retest reliability, parallel (equivalent) form reliability and split-half or internal consistency reliability. a. **Test-Retest Reliability** a. **Practice Effect:** learners might have practiced the test items b. **Memory Effect:** learners might still remember how they solved the problems during the first administration of the test c. **Differential Learning:** learners might respond better to the test items because of maturation or learning other things b. **Parallel (equivalent) Forms Reliability** Two tests that are equivalent (i. e containing same kinds of items of equal difficulty from same content area covered, objectives covered, item format, time allowed) are administered on two different occasions to a group of learners, and the correlation between the two sets of scores is computed. **Limitations of Parallel Forms Reliability** a. It is not easy to ensure that the test items are really parallel **Split-Half Reliability** A test is administered once to a group of learners but when scoring, it is divided into two parts, (a) scores of odd-numbered items and scores of even numbered items, (b) using top half and bottom, (c) random selection etc. The scores of the two halves are then correlated and the reliability of the test is computed. **Limitations of Split-Half Reliability** a. It is difficult to make the two halves of the test equivalent or parallel **Random errors** could be a result of: b. Fatigue/ tiredness c. Time of the test (e. g morning or evening) d. Practice effect e. Level of motivation of the learners on the subject f. Socio-cultural variables g. Testing conditions h. Quality of marking i. e marking errors **FACTORS THAT CAN AFFECT THE RELIABILITY OF A TEST** - The length of the assessment --- a longer assessment generally produces more reliable results. - The suitability of the questions or tasks for the students being assessed. - The phrasing and terminology of the questions. - The consistency in test administration --- for example, the length of time given for the assessment, instructions given to students before the test. - The design of the marking schedule and moderation of marking procedures. - The readiness of students for the assessment --- for example, a hot afternoon or straight after physical activity might not be the best time for students to be assessed. **Improving the Reliability of a Test** a. **Increase the number of test items** A test with many items is generally more reliable than a shorter test because it provides a longer test and hence potentially more representative sample of a learner's performance. Fewer test items might result in ties among learners (i. e learners getting same scores). This would result in low standard deviation and hence low test reliability. b. **Increase the discrimination indices of the test items** Setting difficult items at intermediate levels would increase the standard deviation of the test and hence the increase in the reliability of the test. c. **Increase the homogeneity of the test** A test that measures only learners' mastery of Instructional Design would be more reliable than an equivalent test that attempts to measure learners' mastery of Teacher Education, Human Learning, Curriculum Design, Sociology of Education etc d. **Control the conditions of test administration** If the conditions vary widely from testing to testing, the results of the test will reflect these variations and hence will be less reliable than if conditions remained the same. The testing conditions should avoid creating extra and unnecessary stress for learners beyond that normally experienced in a test situation. Distractions should be kept to a minimum and the teacher's direct intervention should be kept at a minimum. Physical environment (seating, lighting, ventilation, temperature, etc) should be made as comfortable as possible. **TOPIC 3: CATEGORIES OF TESTING / STYLES OF TESTING** **NORM-REFERENCED TESTING (NRT)** Norm Referenced testing is where students compete against the marks of other students instead of competing against a set standard. **Characteristics of Norm Referenced Test** a. compares a person's score against the scores of a group of people who have already taken the same exam, called the "norming group". b. scores make a normal curve of distribution in which the mean is 50% c. it uses a broad content but the amount of content covered in a given time e. g testing 20 objectives in NRT but testing two or more objectives in CRT d. it favors teachers centered approaches e. test items are chosen to promote variance (the spread of scores) f. scores may be reported as percentile ranks, grade equivalents, stanines or normal curve equivalents g. test may be biased towards a certain ability group (average performers) to yield a normal group h. there is usual a time limit for the test **Advantages of Norm Referenced Test** a. it is useful for testing and evaluating course objectives. b. it stimulates competition amongst learners and this can result in hard work c. good to assess a wide range of abilities in a large group of learners d. used for classifying or selecting learners for rewards, further school, places because are rank ordered. e. compare students' performance f. cheap & easy to administer g. students may not be disadvantaged by poor instruction **Disadvantages of Norm Referenced Test** a. difficult to set a test in all the objectives covered in a term or year. Therefore, NRT samples the objective and leave others untested b. does not emphasis the mastery of skills and content but rather the teacher teaches to cover the syllabus c. discriminate between layers therefore the competition between low and high achievers may have far reaching results on teaching and learning process. e.g. low moral animosity between students, school drop outs d. time spent on all the topics is the same for all learners and the content is covered at the same pace. Therefore, there is no room for remediation. e. the grades do not say what the learners can do or not therefore NR can give unreliable information for career choices f. it is fair but less informative regarding which part of the curriculum was attained. g. can only measure a limited part of a subject area. h. too much focus on memorization & routine procedures. i. T023eachers may have lower expectations of lower achievers j. scores do not tell us what students learnt **Uses of Norm Referenced Test at Classroom Level (Micro Level)** a. grading b. ranking positions c. selection for rewards d. mmeasuring general ability in certain subject areas e. aassessing a range of abilities in a large group f. selecting top candidates for limited spaces or opportunities **Uses of Norm Referenced Test at Macro Level (National Level**) a. grading b. certification c. ranking of schools d. selection for employment and for further education. **CRITERION REFERENCED TESTING** It is a system of mark interpretation in which a learner's mark is interpreted by comparing it with or by referencing it to a criterion or absolute (set standard) **Characteristics of Criterion Referenced Test** a. it is designed to measure the mastery of tasks in a given domain. It is a test in which a teacher measures the learner ability to meet certain standard e. g if the set standard is 80%, the teacher will consider the work well-done if the learners score 80% or above. b. learner's performance is interpreted in relation to the set standard c. it focuses on the narrow area or domains of the syllabus enabling testing is more frequent than in NRT and this produces a reliable Continuous Assessment d. a representative sample of objective should be indicated on the Test Blue Print (Table of Specifications) because the test should be valid (cover stated objectives) and reliable (students should obtain nearly the same score on a similar test). e. it provides useful information on learner's ability in particular content areas f. focuses on instructional objectives and they are related to learner centered approach g. diagnose learning short comings. Learners are given more practice until they succeed h. it is useful in predicting the performance of learners in the final examinations i. it encourages cooperatives learning because learners are not compared against each other like in NRT and the information is useful in making decision such as: Grading such as A B C D Fail; recommendation for special need education, recommendation for guidance and counseling; recommendation for enrichment; recommendation for remediation etc j. it encourages cooperatives learning because learners are not compared against each other like in NRT and the information is useful in making decision such as: Selection of learners for promotion to upper classes/further studies etc / help select students who are ready for extra lessons or work called enrichment or helping slow learners to catch up */remediation* k. measures how well test takers have mastered a particular body of knowledge or set of skills l. test items are chosen to reflect the criterion behaviour m. scores used to sort students into categories n. test not biased towards any ability group. The focus is on the criterion behaviour. o. there may be flexibility in time given to complete the test **Advantages of Criterion Referenced Test** a. scores or results indicate mastery of subject area b. tests can be easily developed at classroom level c. students have a better understanding of how they are performing in class d. it is non- competitive **Disadvantages of Criterion Referenced Test** a. due to emphasis on the mastery of objectives, more time is spent on sharpening learners to achieve given task. This may mean that the syllabus is not covered on time. b. the teacher maybe over-burdened to help learners with different learning abilities c. paying more attention to low achievers may lead to high achievers being bored because of less work to do d. it kills a competitive classroom climate that has positive results such as hard work, recognition and best performance e. the test does not tell what the learner what the learner knows in relation to his/her peers. f. scores or results cannot be generalised beyond a certain course or programme g. setting criteria for assessment require considerable knowledge and teaching experience of the subject h. teaching may be focused only on criteria or qualification skills i. students may be penalised for poor instruction j. may not be applicable if few objectives have been covered. **Uses of Criterion Referenced Test at Classroom Level (Micro Level)** a. grading b. selection c. measuring mastery of skills d. when determining if students have pre-requisites skills to start on a new unit. e. when assessing affective and psychomotor domain objectives f. when grouping students for instruction (learning activities) **Uses of Criterion Referenced Test at Macro Level (National Level**) a. certification b. selection for further studies c. employment **Similarities Between Norm Referenced Test & Criterion Referenced Test** They both; a. require specification of the achievement domain to be measured. b. require a relevant and representative sample of test sample. c. use the same rules for item writing. d. are judged by the same quality of goodness (validity and reliability). e. are useful in educational assessment of learners **Differences between Norm Referenced Test & Criterion Referenced Test** **Norm Referenced Test** **Criterion Referenced Test** --------------------------------------------------------------------------------------------------- ------------------------------------------------------------------------------------------------------------------------------ The learner competes with the marks of other students but does not compete with a set standard. Learner's performance is interpreted in relation to the set standard or learner competes with a set standard. Typically cover large domain of learning tasks with just a few items measuring each specific task Typically focuses on a limited domain of learning tasks with a relatively large number of items measuring each specific task It emphasis discrimination amongst learners in terms of relative level of learning Description of what learning tasks a learner can or cannot perform Favours items of average difficulty and typically omits very easy or very hard items Match items difficulty to learning tasks and cannot alter item difficulty or omit easy/hard items Interpretation of test results requires a clearly defined group Interpretation requires a clearly defined and delimited achievement domain **\ TOPIC 7: TEST CONSTRUCTION AND ADMINISTRATION** **Test Construction** It is a process of setting test items from the content of a course module. Construction of test items needs a great deal of skill, time and planning. A well-constructed test should produce expected result and it is the reason that the test is seen as part of the classroom. **Test Planning** It involves the following: a. Outlining content and specific learning objectives. b. Choosing what will be covered under each combination of content and specific learning objectives. c. Assigning percentage of the total test by content area and by learning objectives and getting an estimate of the total number of items. d. Draw a Table of Specifications or **Test Blue Print**- (it is a detailed plan for the content coverage of a test e. Choosing the type of item format to be used and an estimate of the number of such items per cell of the test blue print). **Sources of test item: Test Item banks** Test item banks are a collection of test items according to types of questions, namely; a. Multiple choice type of test items b. Matching type test items c. Completion type test items d. Short answer-structured type test items e. Essay type-free response type of questions f. E.t.c Item banks makes it easy to assemble a test. **Test Assembling** Test assembling involve; a. packaging the test b. reproducing the test. **Guidelines for packaging the test:** a. group together all items of similar format b. arrange all items from easy to hard c. space items well for easy reading d. keep items & options on the same page e. place diagrams, maps, pictures or some other illustrations above the items/items to which they refer f. check that your answer keys form random patterns g. determine how you want students to present answers h. make your answer sheet or booklet should have spaces for student name/ candidate number, class/year group, etc. i. make sure directions/instructions for each item format are very clear j. proof read the test to correct grammatical and typing errors **Guidelines for reproducing (duplicating) the test include the following:** a. find out if the duplicating machine is working b. specify duplicating instructions c. collect and check that all the specified copies and the original are collected. d. seal the copies in envelopes and store in a safe place. **Test Blue Print** a. test blueprint is a table showing the number of items that will be asked under each topic of the content and the levels of learning objectives (learning outcome). b. these objectives from the Cognitive, Affective or Psychomotor domain. For most test constructed in schools, the objectives will be from the Cognitive Domain. c. it is also called A Table of Specifications because it specifies the proportions (weightage) of questions from each topic and objective level. d. it ensures that test items constructed cover the different levels of the cognitive domain. **Example of Table of Specifications (Test Blue Print)** **SUBJECT:** Human Learning **Cognitive level** **Knowledge** **Ccomprehension** **Application** **Analysis** **Synthesis** **Evaluation** **%** ------------------------------ --------------- -------------------- ----------------- -------------- --------------- ---------------- --------------- Meaning 1.1.1(2) 20% (2) Factors influencing learning 1.1.2 (2) 1.1.3 (1) 30% (3) Classical conditioning 2.2.1 (2) 2.2. (1) 30% (3) Operant conditioning 3.3.4 (1) 10% (1) Gestalts 4.3.2 (1) 10% (1) **Total** **3 (30%)** **2 (20%)** **2 (20%)** **1 (10%)** **1 (10%)** **1 (10%)** **100% (10)** **Purposes of the components of the Table of Specifications** a. **Content (topics or sub-topics)** Test items are set from the content taught, e.g. *Human Learning topic* content. It gives content validity. b. **Instructional/specific objectives** It ensures that the test items are matched to the objectives. It gives content validity. c. **Cognitive levels** It ensures that the test items are set according to the cognitive levels of the content objectives covered. It also ensures that the test covers a variety of test items. Usually **not** more than three items are set under each cognitive level. d. **Number of test items per an objective** It indicates how many test items are set from each objective and topic. e. **Number of test items per cognitive level** It indicates how many test items are under each cognitive level. f. **Total number and percentage of test items** They indicate the overall total and percentage of the test items. Example of a Table of Specifications **Writing test items** When writing test items, you should match them with the instructional objectives. Consider the following: a. The type of test (objective or subjective) that would measure the achievement of a given objective. b. The level of learners c. Level of language to use a. **Validate the test** (give the test draft to a colleague(s) to correct or comment on the following: a. Content b. Arrangement of items c. Variety of levels of items (cognitive domain levels) d. Language **Prepare the final version of the test** a. Make amendments if necessary b. Arrange test items c. Write instructions **Factors to consider before test administration (giving the test items to learners to write)** Test administration is all activities undertaking just before and during the time students will be writing the test. The following are guidelines for administering a test: a. secure a room suitable for number of students that will be writing the test-it should have enough suitable furniture, well ventilated and well lit. b. maintain a positive attitude & maximize achievement motivation among students c. make sure all students are notified of the test in advance to avoid surprises (Test Announcement) i. date ii. type of test (objective or subjective) iii. area of coverage iv. venue (free from noise, well furnished, spacious, clean, good ventilation, lights etc.) v. time and duration (morning, afternoon, evening; minutes, hours). vi. motivation to create a test feeling in learners and wish them **GOOD LUCK.** vii. materials needed and remove those on the walls that might give clues d. just before students start the test clarify rules that will apply during the writing of the test. e. during the test make sure you do the following: f. remind students to fill information in spaces provided on front page of answer sheet or booklet-names, class, etc. g. ask them to read instructions on cover/front page of question paper and check for any missing pages h. ask them to start answering i. let them know the duration of the test j. monitor students as they write k. make sure there are no distractions- noise from the surroundings or from the students themselves l. give them warning of how much time is left, especially towards the end m. once the test period is over ask students to stop and collect all answer sheets/booklets a. make corrections if any before writing begins b. make sure learners are well spaced c. teacher should take the supervisory role d. assist learners who are in need of help e. watch out for cheating/disturbances f. time should be observed strictly (announce time left at intervals) g. minimize moving up and down because it could be disturbing h. do not stand and read learners papers while working because some may have fear, doubt etc. i. all external materials that are related or not related to the test should be left outside. E.g books, cell phones, calculators etc. **Features of A Standardized Test** a. it is ***valid*** and that is to say it measures what is in intended to measure b. it is *reliable* and that is to say the test must yield same results over and over again if it is re-administered over and over gain c. scoring of a test is standardized and constant from class to class d. it is administered at the same time under the same conditions e. it measures content common to majority of students f. it is constructed by experts g. the interpretation of scores is usually compared to the national norm **\ ** **TEST ITEM** **Types of test items** There are many types of test items that are usually classified into two main categories namely: Objective test and Subjective test **Objective test items** A test where two or more scorers can easily agree on whether an answer is correct or incorrect. They are items that do not require or allow learners to say out their ideas or opinions. Equally or similar competent learners would obtain the same score. Some learners may perform well through guess work. E.g. True or False, Matching, Multiple Choice, etc. **Characteristics of an objective test** a. Each item has a pre-determined answer b. Items are highly structured c. Non-teachers and machines can score them because they are easy mark d. Learners do not write at length when answering them e. They encourage reading skills **Types of objective test items** a. **Supply type of items (fill in type)** These are suitable for measuring achievement at the lower level of the cognitive domain level. There are two types of supply items to focus on: The Short Answer type and The Completion type. In constructing the supply items consider the following: i. The required answer should be brief e.g a phrase, a word, a number etc ii. The number of blanks should be small at least one in a statement iii. The blanks should be approximately equal in length **Example:** What is the name of the instrument used for measuring air pressure? Or The instrument for measuring air pressure is called................ b. **Selection type of item** i. **Alternate response item** The items consist of a statement of fact, principle etc. The learners are required to respond with a "yes/no" "correct/incorrect" "true/false" type of answer. They are good for measuring achievement at the lower cognitive domain level. The weakness of these items is that there is 50% chance of guessing the correct answer. In constructing the alternate response items consider the following: - items must measure important learning objectives - statements must be definitely true or false - avoid double negative statements. - answers to questions should not form predictable pattern - use short statement and simple language - avoid the use of negative statements - avoid long and complex statements - the statements should be approximately the same length - there should be as many true/correct/yes statements as there are false/incorrect/no statements respectively. - each statement should include only one important idea. **Example:** Sir Seretse Khama was the first president of Botswana. True/False Or Was Sir Seretse Khama the first president of Botswana? Yes/No **Advantages of Alternate Response Item** a. easy to construct b. easy to score c. wide syllabus coverage **Disadvantages of Alternate Response Item** a. easy to answer correctly by guessing b. can assess trivial facts c. test lower order objectives d. can be ambiguous ii. **Matching type of items** In this type of the items, there are two lists: a list of **premises** and a list of **responses**. They are good for measuring the ability of learners. In constructing the items, consider the following: a. provide clear directions/ instructions e. g match the African countries with their capital cities. b. avoid long directions c. use numbers to identify the premises and letters to identify the responses d. avoid using incomplete sentences and premises e. arrange items in numerical or alphabetical order f. put the premises and responses (options) on the same page g. provide more responses than premises h. items must measure important objectives i. responses must be plausible to the premises Limit items to 5/6 per set Example: **PREMISES RESPONSES** 1. Botswana A. Lusaka 2. Ghana B. Pretoria 3. Nigeria C. Lagos 4. South Africa D. Gaborone 5. Zambia E. Accra **Advantages of Matching type of items** a. can measure a number of objectives b. easy for students to score **Advantages of Matching type of items** a. test basic skills or lower order objectives b. it is difficult to write responses that are equally plausible to all premises c. responses can provide clues iii. **Multiple choice items** A multiple choice item consists of a **STEM** and **3-5 OPTIONS**. One of these options is the correct answer while the other options are **DISTRACTERS** (incorrect). It is the most used type of items in objective tests because: a. It can be used for measuring achievement at all the cognitive domain levels (from Knowledge level to Evaluation level) b. It can sample behavior and content better than all other types of items. The stem introduces the main idea and it can be in the form of an incomplete statement/item. It should be short and clear. Options/alternatives are the responses to the item. The correct response is called the **KEY.** The misleading options are called **DISTRACTERS.** Distracters should be plausible to learners so that they hide the key and definitely incorrect. In constructing the item consider the following: a. the stem should measure an important learning outcome b. the stem of the item should contain a single clearly stated problem c. there should be only one key d. the distracters should be plausible e. avoid using absolutes-"always"," none","never", etc f. avoid window dressing " all of the above"or "None of the above" in distractors. g. negative statements should be avoided, if used, the negating word/term should be underlined, bolded or capitalized e. g Which of the following is **NOT** a noun. h. there should be grammatical consistence between the stem and all the options i. the options should be listed one below the other beneath the stem j. familiar and simple language should be used k. if options are related to time such as years, should be arranged chronologically l. keys across the items should be scattered randomly m. duplication of textbook wording should be avoided (rephrase) n. items should not require personal opinions of learner e. g Which Botswana president contributed more to the country's economy? o. options should be homogenous e. g all should be names of rivers or all should be in the singular/plural form p. repeated concepts in the options should be included in the stem e. g Which of the following tests evaluate the person's capability to the task? A. Aptitude test B. Attitude test C. Diagnostic test q. options should be independent from each other to avoid linking and cluing r. options should not be a subset of one of them e. g The number of Batswana infected with HIV was... A. Over 200 B. Over 350 C. Over 530 D. Over 600 **Advantages of Multiple Choice Items** a. take less time to answer b. can cover a wide range of objectives c. substantial amount of course material can be sampled d. effects of guessing relatively lower e. a wide range of knowledge, skills and attitudes could be tested in a relatively short time f. a large amount of information could be covered g. it is easy to administer, mark, and score. Scoring is objective and non-teaching individual and machines can score it h. guessing can be minimized by increasing the number of distracters i. it promotes the reading skill **Disadvantages of Multiple Choice Items** a. time consuming b. does not measure writing ability, creativity and organisation of ideas c. students with reading skills may not do well d. it teaches a low level of recognizing the key e. learners may rely on guessing and they can pass even without instructional knowledge f. it is difficult to construct items from some objectives g. a learner with low attainment level in English may fail because sometimes the difference between options is little h. it makes learners to believe that there is always a solution to a problem and this undermine diverse thinking i. without a proper seating arrangement learners can illegally share the information **Subjective Test Items** A subjective test is a test in which it is very difficult for two or more scorers to agree on whether the answer to test item is correct or incorrect. It is a supply item type that gives the learner greater freedom of responses than in the simple short answer and completion items. This is where the scores are influenced by the idea/opinion or judgment of the individual doing the core (marking). The examples of the items include Completion, Structured, essays, practical, aural (listening) test, etc. 1. **Constructing Completion Items** a. questions should address important course objectives b. answers to questions should be a single word or a phrase c. direct questions preferable to incomplete sentences d. omit only key words- avoid a situation where the sense of the sentence is impaired. e. blank spaces should be near the end of the sentence f. where applicable, indicate units if the required answer is numerical g. the key or missing word/phrase must be definitely correct. **Advantages of Completion Items** a. easy to construct. b. guessing is eliminated c. large amount of content covered **Disadvantages of Completion Items** a. slightly more time consuming to construct b. difficult to score c. may encourage bluffing d. measure lower order objectives 2. **Constructing Structured (Restricted Response) Items** a. questions should address important course objectives b. answers questions should be a phrase or single sentence c. questions should be clear &direct d. avoid using optional items e. indicate maximum score for each question- present a scoring rubric. **Advantages of Restricted Response Items** a. easy to construct. b. Guessing eliminated c. large amount of content covered **Disadvantages of Restricted Response Items** a. slightly more time consuming to construct b. ddifficult to score c. may encourage bluffing d. measure lower order objectives **Constructing Essay Test Items** It is where learners compose the expected response most commonly in written form because of that, the essay vary in quality and merit. **Essay Test item** There are no limits placed on learners except on length. It has low test reliability. It is the most appropriate item for measuring complex learning outcomes such as the ability to select, organize, recognize, integrate and express ideas. This is what one should observe while constructing essay test items: a. ensure that questions assess important learning outcomes (objectives) b. ensure that questions have clear directions to the required responses; ask questions in direct & explicit manner -- students should know as to whether they have answered the questions fully or not. c. questions should be complex enough to allow multiple correct answers d. questions should allow self-regulated learning e. provide information on scoring criteria (scoring rubric) **Advantages of The Essay Test Item** a. allow free response b. eliminate guessing c. good for testing small number of studentsake short time to construct d. assess divergent & critical thinking (complex cognitive skills) e. easy to design f. promotes creativity of learners g. encourages independent thinking h. promotes the writing skill i. teaches learners to organize, integrate and synthesize knowledge j. tests different levels of knowledge, skills and attitudes **Disadvantages of Essay Test Item** a. difficult to score b. time consuming for teachers and students c. too much emphasis on writing d. subject to bluffing responses e. it usually tests a small area of the content covered or to be covered. Therefore, some instructional objectives are left out f. have low reliability because marks given depend on the opinion of the marker g. writing and scoring essays demand a lot of time h. marking is also influenced by factors such as poor grammar, handwriting etc i. learners who are unable to express themselves adequately perform badly in examinations j. scoring can be unreliable, highly subjective and inconsistent, also the problem of hallo effect may be seen / tendency to rate learners on the basis of the first impression. k. markers can be influenced by other factors such as mood or their impression of the hand writing of a particular learner **Points to consider when constructing subjective test items** In constructing the subjective item, the following should be considered: a. items should be restricted to the measurement of complex learning outcomes b. the activity to be done by the learners should be made clear to reduce to a minimum the variations in the responses due to differences in interpretation of the item c. the learners should be given an idea of the number of pages (length), time to be spent on the item and the number of marks to be awarded to each section: introduction, development, conclusion, references if any etc. **Ways of Reducing Subjectivity** *(Achieving Some Level of Objectivity)* **When Marking or Scoring Essay Test Item** a. prepare a marking key-this show a structure of an essay and mark distribution per session. b. mark without knowing the owner of the paper, where possible allow learners to use admission or examination numbers not their names c. mark an item across the scripts before marking the next item to reduce The Hallow Effect d. allow colleague (s) to moderate your marking e. marks for content grammar should be set apart from those relating to organization, expression, spelling mistakes, punctuations etc f. ensure the essay items are well constructed to promote favourable marking g. construct a good marking scheme/rubric to ensure uniformity and fairness to all scripts marked h. assign points to the question various parts to maintain consistency i. compare quality of the papers marked for the same question to ensure fairness and accuracy in grading j. grade all responses to one question before moving to the next-this will prevent the quality of students' answers to one question from influencing the marker's reaction to the student's other answers k. shuffle the papers after marking one question to avoid the same paper being graded first, middle or last l. use students' ID numbers instead of their names to conceal their identity m. the marker should not allow his or her feelings/emotions to influence his/her marking n. avoid distraction when marking o. having well trained markers **Marking Methods to Reduce Subjectivity/ Types of Marking to Reduce Subjectivity** A model answer is drawn by the teacher or marker to reflect the responses expected from the learners. a. **Analytic Method/ Analytic scoring rubric** i. the teacher goes through the model answer to identify essential and relative components because there are different performances or criteria that are being rated separately to check if the student did them well or not. For example, in FOE, we look at the following performances: introduction; body of the essay; conclusion; layout; sequence. In mathematics, a mathematical task might be rated for organization, quality of ideas, and clarity of expression. For example, long mathematical problems such as: Calculating mean, median, mode, range variance and standard deviation using group data or interval data. A student may be awarded marks for finding frequency; finding cumulative frequency; finding mid-value; giving the proper equation for mean; median; mode, range; variance; standard deviation and then marks for giving correct answers for mean; median; mode, range; variance; standard deviation. ii. the components are allocated marks as per their value because analytic rubric enable a teacher to focus on one characteristic/performance of a response at a time. iii. the marks allocated to each one characteristic/performance of a response at a time are then added to provide the overall essay mark. **Advantages of The Analytic Method** a. the teacher or marker can make comments on each component and this can help learners get clearer feedback about the strengths and weaknesses of their responses b. he teacher is able to identify areas where the learners are weak **Disadvantages of The Analytic Method** The essay is judged in the form of its components/parts and not as a whole **b) Holistic Method/ Holistic Scoring Rubric** i. the teacher reads through several papers of scripts to get a feel of possible responses to the test item to enable the teacher to constructs an acceptable model answer. Holistic marking is based more on the teacher's impression on students' scripts or works. ii. the teacher reads the scripts and places them according to their quality: For example, **A, B, C** categories and as such, holistic scoring rubrics yield a single overall score taking account the entire response. It the means the work of several students can be awarded the same mark if the marker perceives that the works fall within the same quality. iii. a comparison between the scripts is made against each other and they can be awarded the same grade. This is mostly happening in subject such as Art, Project writing, Agriculture etc. **Advantages of the Holistic Method** a. by viewing/judging an essay as complete work, the teacher is able to appreciate the creativity of learners b. categorizing of essays according to their qualities can improve reliability c. teachers are able to appreciate the internal consistency in an essay **Disadvantages of the Holistic Method** a. it is time marking (consuming) when marking b. teacher's knowledge of learners is likely to influence the groups they are put in into which may not be true. **TOPIC 8: ITEM ANALYSIS** **TEST ITEM ANALYSIS (MEASURES OF ITEM DIFFICULTY AND DISCRIMINATION)** a. "Test analysis is a general term which covers a wide range of statistical and analytical techniques which may be applied for the purpose of improving tests, examinations, quizzes and other mental measurement devices." (Harper & Chauhan, 1974). b. it is the process of determining the effectiveness of a test item by analysing students' responses to it. c. the quality of a good test item is determined through the test item analysis process. It; a. determines the quality or effectiveness of test item. b. provides data for remedial teaching c. enhances teachers' in test construction d. helps in the general improvement of classroom instruction e. helps teachers identify faulty test items Test Item Analysis also helps answer the following important questions; a. Did the test items function as intended? b. were the items appropriately difficult? c. were the items free of relevant clues and other defects? d. was each distractor (for Multiple Choice Item) effective? **Item Analysis Factors or Indices** 1. **Item Difficulty Index (P-value)** Check suitability of test items; ensure validity and reliability of test items (quality assurance) 2. **Item discrimination index Point Biserial Index (D-value)** Check suitability of test items; ensure validity and reliability of test items (quality assurance) 1. **Item Difficulty Index (P-value)** The index is proportion of testies who got a test item correct to the total number of testees who took test containing the item. The item difficulty index (P-value) Formula is as follows: The index has a range of between 0.00 and 1. 00. P -- values near 0.00 indicate that the test item is very difficult while P-values near 1.00 indicate that the test item was very easy. Ideal test items (items that are of moderate difficulty) have a P - value of between 0.3 to 0.7 (especially for Norm referenced tests). **p-value** **What it means** ------------- -------------------------------------------------------------------------------------------- 0 Extremely difficult; no one got the answer correct. \< 0.3 very hard question; most students got it wrong. Consider changing or removing the question 0.3 to 0.9 Medium level (moderate) difficulty -- may be acceptable \>0.9 Very easy, almost everyone gets it correct 1.0 Extremely easy everyone gets the question correct **Common Reasons for Poor Item Difficulty (too easy or too difficult)** a. **Too Easy** i. well known content ii. Item has been exposed (leaked) & shared iii. Clue to the answer in the question paper iv. distractors are poor b. **Too Difficult** i. content not taught or not understood ii. poorly worded item iii. question scored wrongly (wrong answer key) iv. two choices that are both right v. question was answered towards the end of a timed test **2). Discrimination index (D-value) (Point Biserial Index- PBI)** i. It is an index score that indicates the ability of a test item to effectively differentiate performance of higher achievers and lower achievers. ii. also called Validity Index (VI), the Index ranges between -1.00 to +1.00 iii. an item that has the ability to discriminate between high achievers & low achievers effectively is said to have high discriminating power iv. an item with high discriminating power is more suitable for its purpose in test, i. e., the higher the D-value/PBI the better the item. v. the ideal D-value (possible acceptable range) is + 0.20 or greater (\>/= 0.20) vi. D-Value/PBI correlate with p-values- very easy or very difficult test items have poor discriminating power while moderately difficult (p- values around 0.3 - 0.7) have high discriminating power. **Procedure for Determining Discriminating Power** i. scripts for high performers ii. scripts for moderate performers iii. scripts for lower performers **Interpretation of D- value(PBI)** **Possible Reasons for Low Discrimination Power** a. item is very easy b. item is very hard (difficult) c. answer in scoring key is wrong d. question poorly written e. high performing students are overthinking the question f. question is measuring a different construct g. low sample size **Possible Reasons for High Discrimination Power** a. moderate difficult questions, questions are good and relevant to content **Eg 1** Below are the D-value analysis results of an objective item (MCI). The correct response is indicated with asterisk (\*). This question discriminated well. This is indicated by High D-value for the correct and negative D-values for distractors. **Response** -------------- -- -- -- -- Below are the D-value analysis results of an objective item (MCI). The correct response is indicated with asterisk ( \*) The item did not discriminate well. The D-value is positive but very low. Options C and D are bad distractors. The positive D-value for distractor D indicates that the 2 students who selected option D were high performing students on the test. **Eg 2** **Response** **Frequency** **%** **D-Value (PBI)** **Interpretation** -------------- --------------- ------- ------------------- --------------------------------- A\* 76 92.68 0.04 Too easy B 4 4.88 -0.08 Good distractor C 0 0 \- Bad distractor; must be changed D 2 2.44 0.04 Bad distractor; must be changed Below are the D-value analysis results of an objective item (MCI). The correct response is indicated with asterisk (\*). A discrimination index of +0.34 is acceptable for the correct option. No one chose option B, and so it must be changed. **Eg 3** Response Frequency \% D-Value (PBI) Interpretation ---------- ----------- ------ --------------- --------------------------------- A 3 3.66 -0.21 Good distractor B 0 0 \- Not acceptable; Must be changed C 2 2.44 -0.26 Good distractor D\* 77 93.9 0.34 Very easy **TOPIC 5: CONTINOUS ASSESSMENT** **CONTINUOUS ASSESSMENT (C.A)** Teachers continuously observe and test learners to obtain information about their academic progress. It is an activity used to appraise learners' performance on regular basis. It is an on-going activity to monitor learner progress in terms of specific learning outcomes. Regular assessment of learners' progress is part and parcel of classroom instruction. **Purposes of Continuous Assessment (C.A)** a. it is useful for assessing achievement; selecting students who are ready for enrichment; diagnostic or diagnosing students who must go to next level or remain in the current level or repeat the previous level b. it provides feedback about the learners' progress hence allows for future action to assist the learners c. it motivates learners in their learning d. it provides a record of progress which can assist teachers, other learners and other stakeholders e. it provides a statement of current attainment/ assess learners' current *achievement* f. it assesses learners' readiness for future learning/ *diagnostic or diagnosing* students who must go to next level or remain in the current level or repeat the previous level g. it provides evidence of teachers and school effectiveness h. helps the teacher to know the learners' strengths and weaknesses, interests, abilities etc. this is vital as they need to help/*enrichment.* i. the performance quality of learners would help the teacher to judge his teaching methods, teaching aids, scoring etc. j. it enables the teacher to communicate effectively with parents, learners and other stakeholders about the learning performance i.e. can provide a true reflection of learners' performance over a given period of time k. the information is useful in making decision such as: i. selection of learners for promotion to upper classes/further studies etc / help select students who are ready for extra lessons or work called *enrichment* ii. recommendation for special need education, guidance and counseling **Disadvantages of Continuous Assessment (C.A)** a. when used alone it may not show the actual ability of the learner b. normally it is given after covering a small portion and learners may forget there after c. it is time consuming d. cause frustration if the learners is getting low marks e. need more materials e. g files, file tags, manila papers, A4 papers, computers etc. f. need place for display if necessary **\ ** **TOPIC 10: RECORD KEEPING IN ASSESSMENT** **RECORD KEEPING IN A SCHOOL** ??????? It's a form of feedback about learner's progress and is of importance because of its contribution to motivation and further progress. It's given to parents in the form of a written reports or verbal during meetings and is an accident in permanent form. The report should have comments that are fair, valid meaningful to the reader and have a positive effect of future progress. **Purposes of keeping records in the teaching and learning process/ in secondary school** a. it's a form of feedback about learner's progress and is of importance because of its contribution to motivation and further progress. b. it's given to parents in the form of a written reports or verbal during meetings and is an accident in permanent form. The report should have comments that are fair, valid meaningful to the reader and have a positive effect of future progress. c. provide a useful basis from which report to others can be made e.g. learners, parent and other stakeholders d. highlight any coarse for concern if the learners' performance shows a clear decline compared to previous progress e. facilitate the planning of future work to each learner by building upon previous progress e.g. gives remedial work. f. useful in general decision making in relation to school as well as to the program (level of language used). g. ensure that the curriculum matches the student needs and abilities. **Purposes or uses of test item bank in a primary school:** - Questions in a test-bank can categorized according to subject; topic; question difficulty; question type; and then these questions can be kept readily accessible and organized. - Teacher can generate a random selection of questions from a test item bank and will be able to set class quiz, mock test and examinations - Designing different types of questions can be a tedious process and therefore test item bank can come to the rescue to save time and alleviate teachers stress - The teacher will improve his productivity by spending more time choosing great questions from the test item bank rather than designing new questions. - Once a teacher has put his questions into a test bank, he can use Test-Making-Software or Manual Test-Creation to easily generate different tests as she likes using the questions saved in the test item bank. **TOPIC 6: INTERPRETATION AND ANALYSIS OF RESULTS** **INTERPRETATIONS OF TEST RESULTS** **Mark interpretation** a. describing marks using statistical techniques b. deriving meaning from test marks c. using tables, graphs, average, percentage etc. to give meaning to score d. comparison of scores of an update standard performance of class group as a whole student own potential performance **Measures of Central Tendency** Measure of Central Tendency or Measure of Location describes points on a normal distribution (of scores) that represent the average or most typical values. The most common measures of central tendency is the Mean, the Median & the Mode. **Importance of mark interpretation / Uses of measures of central tendency:** - There is a portion in the scheme book called "record of work" where teachers record the work covered in their lessons every two weeks. Teachers use measures of central tendency to analyze test and based on their analysis, teachers record the following in the "record of work": a. Number of students who got *Quality Pass* (A 80% and above; B 70% and above; C 60% and above) in a test b. Number of students who got *Quantity Pass* (D 50% to 59%) in a test c. Number of students who got *Fail* (E 39% and below) in a test d. Establish students' performance rate in terms of quality pass, quantity pass, and fail e. Teachers use mean/average, median; mode, range, variance, standard deviation to compare a learner with other learners; a class with other classes; a school with other schools. **Ways of interpreting results** a. Measures of central tendency are some of the common descriptive terms used most frequently in the classrooms to compute and interpret data b. Central tendency of a distribution set of scores is a point where score seem to coverage to give a typical score/central score c. Measures of central tendency is **mean**, **median** and **mode** **Levels of measurement** 1. Nominal scale 2. Ordinal scale 3. Interval scale 4. Ratio scale **Nominal scale** a. categories should be clearly distinct. b. no unit or value should be left uncategorized. c. every unit or value should belong to a certain category. d. all units or values should be classified or categorized. For example: religion; gender, geographical background, Age. **Limitations of nominal scale** We can count or make statements. For example, religion; gender, geographical background, Age but we cannot order/rank the categories to which the unit/value of analysis belong. We can only say that there are those who are influenced by religion; gender, geographical background, Age when they say that school should be closed but we cannot order or rank the above categories. **Ordinal scale** Meets all the requirements of nominal scale but here we are able to rank the categories of the scale by either saying: this one is good; better; best. **Limitations of the ordinal scale** Even though we can rank the categories, we cannot say by how much of the differences is between the categories. **Interval scale** Interval scale takes care of the limitations of the nominal scale and the ordinal scale because it has the following: a. We can order/rank the categories to which the unit/values belong b. We can say by how much of the differences is between the categories. For example, those who are influenced by c. Interval scale tells us there should be an equal interval between the categories. All intervals should be of equal size. For example, if you study age, the correct interval would be as follows; *Wrong intervals:* 20-25 **Limitations of the interval scale** In measurement, it is standard that there is always an arbitrary zero. For example, we cannot say a person with 30 years is twice as old as a person with 15 years because there is always an arbitrary zero. **Ratio scale:** The ratio scale must meet all limitations of the nominal scale; ordinal scale; and the interval scale. Ratio scale allows us to: a. allows us to express differences between units b. allows us to say that we have a non-arbitrary zero unlike the interval scale **Mean** The mean is the sum of all the values in a set, divided by the number of values. The mean of a whole population is usually denoted by µ, while the mean of a sample is usually denoted by x. (Note that this is the arithmetic mean; there are other means, which will be discussed later). a. The strength of the mean is that in its calculations all scores are taken into account and because of that it gives a better picture of class performance b. The weakness is that extreme score turned to influence the mean to be unrealistic e.g. high score would push it high, low score would push it low c. The solution is to omit the extremely high score when calculating the mean **Calculating the Mean** ![](media/image4.jpg) **Mean Average: Benefits of mean** - Best measure for symmetrical distribution of data where there are no extremes in the dispersion of data. - Most reliable because it is influenced by all data in the study ie all students' marks. - Good for interval and ratio data. - It is only measure of central tendency where the sum of the deviation/ differences from the mean is always zero. Importance of mean and variance - How far from the mean/ average/ centre mark are the values dispersed. - If many students have their marks dispersed above the mean, then the teacher can say that more students mastered the content taught during the lesson and therefore no need for re-teaching or remedial but only need enrichment. - If more students' marks are dispersed below the mean, the teacher should know that there is need for remediation or re-teaching the whole class if it far below the mean with a large difference/ deviation. **Median/counting arithmetic average/middle score** - **Median** is the second most frequently used measure of central tendency. It strength is affected by an extreme scores (all scores are used) - Median is also referred to as the 50^th^ percentile because it represents a point that divide the score is 2 two halves - This is a measure of position rather than the magnitude e.g. 1,2,3,4,5,6,7. It refers to the score in the middle when numbers are in a sequence, - Order the score number from smallest-highest - Counting if the number of scores is odd, the median in the middle. - Calculating if the number of scores is even the two middle scores are added and divided by two to get the median ### Calculating The Median - The median is the middle value in an ordered array of numbers. - For an array with an odd number of terms, the median is the middle number. - For an array with an even number of terms, the median is the average of the two middle numbers. - The following steps are used to determine the median. - STEP 1. Arrange the observations in an ordered data array. - STEP 2. For an odd number of terms, find the middle term of the ordered array. It is the median. - STEP 3. For an even number of terms, find the average of the middle two terms. - This average is the median. - ***Another way to locate the median is by finding the term (n+1)/2 in an ordered array of numbers*** **Example 1.** **STUDENTS NAME** **MARKS** ------------------- ----------- Modise 10 John 3 Jane 2 Goodboy 5 Timothy 6 Therefore: The median (middle number) is 2, 3, ****, 6, 10 **Example 2** **STUDENTS NAME** **MARKS** ------------------- ----------- Modise 10 John 3 Jane 2 Goodboy 8 Timothy 6 Gosiame 4 Therefore: The median is 2, 3**[, 4, 6]**, 8, 10 (4 +6 = 10 /2 = ****) **Limitation of Median** - Does not account for extreme scores. - Not algebraically defined. - Not appropriate for nominal data **Mode** d. Mode-Most frequently occurring score in a given set of scope distribution e. The mode tells us that the largest number of learners made that score f. A score distribution with 1 mode is called uni-modal g. A score distribution with 2 modes called bi-modal h. A score distribution with more than two modes is called multi-modal ### Determining The Mode - The mode is the most frequently occurring value in a set of data. - A set of data with one mode is said to be monomodal - In the case of a tie for the most frequently occurring value, **two modes** are listed. Then the data are said to be **bimodal**. If a set of data is not exactly bimodal but contains two values that are more dominant than others, some researchers take the liberty of referring to the data set as bimodal even without an exact tie for the mode. - Data sets with more than two modes are referred to as multimodal. **Limitations and delimitations of mode** - It is least reliable because other scores are not involved - Its strength that it can be used to estimate a central tendency **\ MEASURE OF VARIABILITY/ SPREAD/ DISPENSION** **Variability** is the term used to describe how spread or dispersed our scores are within a distribution. - It is the estimated of variability that helps teachers to understand how scores are compressed or dispersed in a distribution or how far the scores are from the mean - Scores can be compressed/ expanded below/ above a measure of central tendency (mean) - Measures of variability include range, variance, quartile deviation and standard deviation **MEASURES OF VARIABILITY** Measures of variability describe the spread or dispersion of set of data. This include Range, Variance and standard Deviation When used together with measures of location or central tendency a complete description of a set of data can be attained. - Measures of variability a. Range The range is the diﬀerence between the largest and smallest values of a set. The range of a set is simple to calculate, but is not very useful because it depends on the extreme values, which may be distorted. b. Variance The variance is a measure of how items are dispersed about their mean. c. standard deviation The standard deviation σ (or s for a sample) is the square root of the variance. **RANGE** The range is the difference between the lowest and the highest score in the distribution. Formula= R=H.L R=70%-20% R=50% ***Calculating Range*** Range is the difference between the largest value and the smallest value. In other words, it is the value obtained by subtracting the smallest number from the largest number in an ordered array of numbers. Range= Largest (value) Number- Smallest (value) Number ***Limitations of range*** - It is unstable way of determining spread/ variability/ dispersion of score because it depends on only two scores - It is affected by extreme scores and by so doing produces misleading results. In most quartile deviation is used because it's not influenced by extreme scores though ½ the score are used in its calculation. **Variance** Variance is the average of the sum of squared deviations about the mean of a set of values (numbers). It measures the average degree to which each point differs from the mean. Formula : ACTIVITY: Find the variation for the following set of scores: 5, 9, 16, 17,18 ![](media/image6.jpg) **Example 2:** For example; 22, 33, 20, 21 **STEP 1**: Find the mean in a set. 22+33+20+21= 96 96÷4= 24 The mean is **24** in a set. **STEP 2**: Square each all values in a set and add them. 22^2^ + 33^2^ + 20^2^ + 21^2^ 484 + 1089 + 400 + 441 = 2414 **STEP 3**; Divide the value found in step 2 by the number of values in a set. 2414÷4= 603.5 **STEP 4**: Subtract the value found in step 3 by the mean square 603.5 - (24)^2^ 603.5 -- 576 =27.5 The variance is **27.5** **STANDARD DEVIATION (S.D)** - It tells how scores vary around the mean, how the score deviates from the mean. - Scores can deviate positively from the mean i.e higher than the mean/ may deviate negatively from the mean lower than the mean - S.D is more reliable than both the range and quartile deviation because it includes all the scores in a distribution in its calculation - S.D is usually reported along with the mean and is our best estimate of variability, its determined by subtracting the mean from each raw score to obtain a deviation score, (X) squaring each deviation score summing the squared deviation scores, dividing th

Educational Testing, Measurement, Assessment, and Evaluation Notes PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue