Classroom-Based Evaluation 3 PDF

# Introduction to Evaluation ## Second Language Evaluation - Involves many different kinds of decisions - Placement of students - Levels - Instructions - Textbooks/materials - Homework - Objectives/plans - Grading - Parents, other teachers, non-instructional educational professionals - Students themselves are also participants in evaluation - **Important:** Teachers' language proficiency, their professional qualifications (instructional experiences and attitudes), and availability can affect instructional methods. - Tests can be a great help in collecting information for evaluation: - Pre-tests - Observations, comments, entries in students' journals, school records, parents, and medical reports. ## Coming to Terms with Evaluation - Evaluation is primarily about decision making. - **Information:** Test results, health, content, etc. - **Interpretation:** What do the results mean? - **Decision making:** Decide what to do? ## The Context of Second Language Evaluation - **Input factors:** - Student needs and abilities - Time - Attitudes - Resources - Facilities - Support - Teacher abilities - **Instructional purpose:** - Important! - **Instructional plans:** - **Instructional practices:** - **Outcomes:** - **Instruction:** - The purpose identifies the objectives of instruction - the "why". - The plans describe the means of attaining those objectives - the "how". - Practices are what actually takes place in the classroom - the "what". - **Community values and attitudes toward learning can have significant effect.** - **Input factors:** Current theories about teaching and learning may influence the effectiveness of the instructional approach of an SL course. ## Instructional Objectives - Goals that you as a teacher aim for when teaching. - Provide criteria for assessing the outcomes of your teaching. ### Types of Instructional Objectives: 1. **Language:** Language skills that learners are expected to require. 2. **Strategic:** Strategies for communicating, learning, and critical thinking. 3. **Socioaffective:** Changes in learners' attitudes of social behaviours that result from class instructions. 4. **Philosophical:** Changes in attitudes, values, beliefs of a more general nature rather than socioaffective. 5. **Methods/Process:** Methods, processes, experiences, materials, activities, or other aspects of instruction (opportunities or experiences). ## Language Objectives - Entire course of instruction as well as with individual units and lessons. - We refer to the former as **course objectives** and the latter as **unit and lesson objectives.** ### Using Language Objectives in Evaluation - Assess the adequacy of course objectives relative to students' needs, background characteristics, and goals. - Explicit course objectives help students understand how they will be able to use the language. ## Well-Stated and Practical Course Objectives - Strive for the following five characteristics: 1. Course objectives should be general. 2. Should refer to a single domain of language performance. 3. Within a given language domain, course objectives should not overlap. They should be independent as possible. 4. Should refer to student performance, not to teacher performance. 5. Course objectives refer to products of learning and to processes associated with language performances. ## Instructional Plans - Specify what should be taught, when and how it should be taught. - Blueprints for achieving course objectives. - **Effective teaching requires detailed plans for an entire course; syllabus** ### Aspects of Instructional Plans 1. **Content (Objectives):** - Specific language content - General themes - Situations - Tasks - Communicative functions - Linguistic structures 2. **Organization:** - Materials and equipment (textbooks, audio books, etc.) - Activities and Roles: - What the students are doing - How the students are grouped - How the activities are organized in the classroom. ## Instructional Practices - Include the actual strategies, materials, activities, tasks used by teachers and students - What is planned may not always occur in the classroom. ## Input Factors - SL teaching and learning are affected by a variety of factors from outside the classroom itself. - Instructional objectives, plans, and practices should be compatible with input factors. - Input factors cannot be changed, but should be considered. ## A Framework for Evaluation - **The Context for Classroom-Based Evaluation:** - **The first step:** Identify the purpose for evaluation (decision-making process). - **The second step:** Collection of information (assessment). - **The third step:** Interpretation of the information possible (context). - **The final step:** A decision can be made about how to proceed - ACTION! - **How decisions are made in the classroom using figure 1 as a frame of reference:** - **Changing purposes:** Mismatch between students' needs and objectives. Objectives need to be redesigned to match students' needs. Objectives judged against students' needs. - **Changing plans:** Mismatch between teacher's guide and the students' textbooks. Plans are judged against objectives. - **Changing practices:** Practices judged against plans. - **Students as decision makers:** Students can be partners in planning Instruction - Student goals. ## A Strategy for Classroom-Based Evaluation - Identify potential problems and decide on actions to resolve them. ## Planning Evaluation - **Evaluation needs:** - Who will use the results of assessment, and for what purposes? - **High-stake exams:** Used to make important decisions about students. - **Low-stake exams:** Don't affect so many people. - **Mainly teachers, learners themselves (self-assessment), educational administrators, parents.** - **What will I assess?** - Mastery of academic skills - Students' understanding of content - Subject matter - Students' achievement in learning - Process or factors influencing achievement (lack of interest, motivation, attitudes, teaching styles, strategies). - Student backgrounds. - **When will I assess?** - On a continuous basis: Formative Evaluation - At the end of the instructional units: Summative Assessment - At the end of the course/year: For grading or promoting to the next level. - **5 Components of a Plan:** 1. **A list of people who need assessment information.** 2. **A description of the kinds of information they need.** 3. **A description of assessment activities.** 4. **A schedule for conducting assessment.** 5. **A description of record-keeping methods to be used.** - **Formative assessment:** The most informal style could be formatted. - **Assessment for learning:** Formative assessment - **Assessment of learning:** Summative assessment - **Assessment as learning:** Self-assessment ## C4: Collecting Information - **Assessment:** - Formative assessment - Summative assessment - **Formative assessment:** Most of our classroom assessment. Evaluating students in the process of "forming" their competencies and skills to help them to improve. - Delivery (by the teacher) - Internalization (by the student) - Informal assessment: Formative. - Primary focus: The ongoing development of the learner's language. - **Summative assessment:** Aims to measure, or summarize, occurs at the end of a course or unit of instruction. - Finals exams and general proficiency exams are examples. - **Types of Information:** - **Qualitative:** Helping students to improve their accent. - **Quantitative:** The average reading speed of a grade 10 student is 68 words per minute. ## Qualities of Information - **Practicality:** - Administrative time to collect information. - Compilation Time: Score and interpret the information. - Administrator Qualifications: Are they qualified to use this method? - Acceptability: Acceptable to students, parents, and the community. - Cost: Affordable. - **Consistency:** - Stability - Qualitative - **Validity:** The extent to which the information is relevant. - Without validity, gathering information is, at best, a waste of time. ## Reliability - **Defining reliability:** Is concerned with freedom from nonsystematic fluctuation. - The difference between the two observers is called assessor-related variability or rater reliability. - The unreliability of the person about whom information is being collected is object-related or person-related reliability. - The unreliability of the procedures used for collecting information is instrument-related variability. - **Estimating reliability:** A matter of degree. - We cannot obtain information with perfect reliability. - Usually expressed by indices ranging from .00 to 1.00. - Only can be estimated and not truly calculated. - High reliability is desirable. - There are practical ways of enhancing reliability in classroom evaluations. ## Validity - **Defining validity:** The extent to which information you collect actually reflects the characteristic or attribute you want to know about. - **What you want to know exactly:** - Perfect validity - Perfect reliability - **Validity is generally reported as an index that ranges from .00 to 1.00.** - We can only estimate. - **An assessment instrument or procedure can be only as valid as it is reliable.** - **A test can be reliable without being valid.** - **Estimating validity:** Validity cannot be assessed directly. ### Types of Validity 1. **Content Relevance:** Is assessed logically by carefully and systematically examining whether the method and content of the assessment procedure are representative of the kinds of language skills you want to assess. Content relevance can be characterized as high, moderate, or low, but it cannot be quantified. 2. **Criterion-Relatedness:** Is the extent to which information about some attribute or quality assessed by one method correlates with or is related to information about the same or a related quality assessed by a different method. - (A correlation coefficient that ranges from .00 (no correlation) to ±1.00 (perfect correlation). 3. **Construct Validity:** Is probably the most difficult to understand and the least useful for classroom-based evaluation. Construct validation is most useful when you do not know the exact content of the quality or attribute you want to assess, thereby ruling out the use of content validity. ## Methods of Information Collection: - Observation, conferences, portfolios, questionnaires and interviews, dialogue, journals, tests. ## Hughes: Validity - **Validity:** The degree to which a test accurately measures what it claims to measure. - We try to create a test whose scores maximize the contribution of the construct in question and minimize the contribution of irrelevant factors. - Anxiety, noise, physical conditions, technical issues... - **Without validity, test results are meaningless.** ### Types of Validity: 1. **Face validity:** Unlikely to be accurate. 2. **Content validity:** - Over/under representation of objectives. - A test is said to have content validity if its content constitutes a representative sample of the language skills, structures, etc., with which it is meant to be connected. A test would have content validity only if it included a proper sample of the relevant structures. - In order to judge whether or not a test has content validity, we need a specification of the skills or structures etc. that it is meant to cover. 3. **Criterion-related validity:** - **Concurrent validity:** (current validity) - **Predictive validity:** (future validity) - To see how far results on the test agree with those provided by some independent and highly dependable assessment of the candidate’s ability. - **Concurrent validity:** Is established when the test and the criterion are administered at about the same time. - ('High level of agreement', 'little agreement') - **How the level of agreement is measured:** Comparing sets of scores in this way, which is called a "validity coefficient" (a mathematical measure of similarity). - **Perfect agreement:** Validity coefficient of 1. - **Total lack of agreement:** Validity coefficient of zero. - **Predictive validity:** This concerns the degree to which a test can predict candidates' future performance. - Ex: Placement exams, proficiency tests. ## Construct Validity - Internal structure - Convergent-discrimant - Aligned with theory - A test, part of a test, or a testing technique, is said to have construct validity if it can be demonstrated that it measures just the ability which it is supposed to measure. - The word "construct" refers to any underlying ability (or trait) which is hypothesized in a theory of language ability. - Construct validation is a research activity-> theories are put to the test and are confirmed, modified, or abandoned. - Direct testing/indirect testing of skills. - **Face validity:** A test is said to have face validity if it looks, as if it measures what it is supposed to measure. - Think aloud: Real-time - Retrospection: Test-takers recall their thoughts - Post-test - **Validity in scores:** Scoring must be aligned with the specific ability. ## Reliability - **Reliability:** Refers to the consistency of test scores across different test administrations or scoring sessions. Reliability is concerned with freedom from nonsystematic fluctuation. - Reliable scores imply that results would be similar if the test were repeated under the same conditions. - **Consistency:** in testing ensures that test scores are trustworthy and not influenced by irrelevant factors. ## Three General Sources of Unreliability: 1. **Instrument-related reliability:** This source of unreliability resides in the procedures used for collecting information. 2. **Assessor-related reliability or rater reliability:** This source of unreliability has to do with instability, or nonsystematic fluctuation in the person or among the people collecting the information. 3. **Object-related or person-related reliability:** This source of unreliability concerns the person about whom information is being collected. ## Why Reliability Matters? - **Importance:** Reliability is essential to making fair and valid inferences about a test-taker's abilities. - **Example:** Inconsistent scores undermine the trustworthiness of a test; making it difficult to judge a learner's true proficiency. - **Note:** Without reliability, any conclusions drawn from a test may be inaccurate. ## Intro to Reliability Coefficient: - **A numerical value (0 to 1) that quantifies reliability; 1 indicates perfect reliability.** - High-stakes tests require higher reliability (e.g., above .90), while lower-stakes tests may accept a lower coefficient. - **Can only be estimated and not truly calculated**. - **There are practical ways of enhancing reliability in classroom evaluations.** ### Types of Reliability: 1. **Test-retest reliability:** - Measures how consistent scores are when the same test is administered on different occasions. - Scores can vary due to memory effects if too soon or changes in ability if too delayed. - Test-retest reliability is often difficult to achieve in practice. 2. **Alternate forms reliability:** - Assesses reliability by using two equivalent forms of the same test. - Alternate forms reduce memory influence but require highly similar test versions. 3. **Internal consistency reliability:** - Consistency within a single test, often measured by the split-half method. - Test items are divided into two halves, and the scores for each half are correlated. - Requires well-matched items on each half to ensure accurate internal consistency. - **Standard Error of Measurement (SEM):** - Estimates the range within which a test-taker's "true score" likely falls. - Ex: With SEM of 5; a score of 60 means the true score is likely between 55 and 65 with 68% certainty. - SEM helps identify the margin of error in test scores and guide decision-making. The calculation of the standard error of measurement is based on the reliability coefficient and a measure of the spread of all the scores on the test. (The greater the reliability coefficient, the smaller will be the standard error of measurement). ## Scorer Reliability - **Scorer reliability:** Is the consistency of scores across different scorers or across multiple scoring occasions by the same scorer. - **Types:** - **Intra-scorer reliability:** Same scorer, different times. - **Inter-scorer reliability:** Different scorers. - **Essential in subjective tests (essay).** ## How to Make Tests More Reliable? - **Reliable tests require a consistent test structure and sufficient sampling of skills.** - Ex: A longer reading comprehension test with varied questions enhanced reliability. - Longer tests with independent items generally provide more reliable scores. ## Ways to Improve Test Reliability: 1. **Sampling behaviour sufficiently:** The more items or tasks included, the more reliable the test (like taking multiple shots in archery). Reliability improves when multiple samples of a skill are included in a test. 2. **Excluding non-discriminating items:** Items that don't differentiate well between stronger and weaker students reduce reliability. Removing or revising such items improves the test's ability to differentiate between ability levels. 3. **Minimizing test-taker freedom:** Too much freedom in responses lowers reliability, as responses vary widely. Narrowing choices in questions improves score consistency across testing occasions. 4. **Writing clear and unambiguous items:** Ambiguity in test items can confuse test-takers and reduce reliability. Careful item wording prevents misunderstanding and enhances test reliability. 5. **Providing explicit instructions:** Clear instructions reduce variability in responses due to misunderstanding. Written instructions should be reviewed and clarified to avoid misinterpretation. 6. **Ensuring legibility and layout quality:** Poorly formatted tests add unnecessary difficulty and impact reliability. Well-laid-out tests ensure that students focus on content, not readability. 7. **Familiarizing students with test format:** Familiarity with test format helps students focus on content rather than navigating the structure. Familiarity with format improves student performance consistency. 8. **Maintaining consistent testing conditions:** Conditions should be uniform to avoid variations in student performance due to external factors. Consistency in administration supports accurate, reliable results. 9. **Objective scoring methods:** Objective items like multiple choice increase scoring reliability but may not suit all tests. Use objective items where feasible, but ensure they align with the construct being measured. 10. **As direct as possible:** Candidates should not be given a choice of items and that they should be limited in the way they are allowed to respond. Scoring the compositions all on one topic will be more reliable than if the candidates are allowed to choose six topics, as has been the case in some well-known tests. 11. **Providing a detailed scoring key:** A clear scoring key ensures consistent scoring, especially for subjective items. Detailed keys reduce scorer bias and enhance reliability. 12. **Training and coordinating scorers:** Proper scorer training minimizes inconsistencies in scoring subjective responses. Training and coordination are crucial, especially in high-stakes testing. 13. **Using candidate numbers for anonymity:** Anonymity reduces bias, especially when scorers know candidates personally. Assign numbers to candidates to maintain impartiality in scoring. 14. **Employing multiple, independent scorers:** Independent scoring by multiple scorers increases score reliability, especially for subjective tests. Discrepancies between scores can be resolved by a third-party scorer. ## Balancing Reliability and Validity - **Reliability is essential but should not compromise validity; the test must still measure the intended construct.** - **Achieving the right balance between reliability and validity is crucial in test design.** ## Remember: Validity & Reliability 1. **An assessment instrument or procedure can be only as valid as it is reliable.** Worded differently, inconsistency in a measurement procedure reduces validity. A "noisy" instrument reduces validity. 2. **Nonsystematic effects on the answers reduce the validity of the information.** If only 90 percent of your students answer your question the way you intended, then you can know the amount of time spent in the United States for only 90 percent of them. 3. **For example, we said earlier that attitudes toward age might also influence some students' responses to questionnaires.** If such attitudes are operating, the validity of people's answers to a question about age is reduced by whatever amount reflects the influence of attitudes toward age reporting as compared with the influence of the desire to be truthful. 4. **Finally, a test can be reliable without being valid for its intended purpose.** - **Reliable, not Valid** - **Both Reliable & Valid** - **Unreliable & Unvalid** - **Unreliable, but Valid**

Classroom-Based Evaluation 3 PDF

Document Details

Tags

Related

Summary

Full Transcript