The Promises and Pitfalls of ChatGPT in Higher Education PDF
Document Details
Uploaded by AchievableSelkie3084
Università di Bologna
Lucas Jasper Jacobsen & Kira Elena Weber
Tags
Summary
This research paper explores the promises and pitfalls of using ChatGPT as a feedback provider in higher education. It examines the quality of AI-driven feedback compared to feedback from novices and experts, and investigates the types of prompts needed to ensure high-quality AI feedback.
Full Transcript
The Promises and Pitfalls of ChatGPT as a Feedback Provider in Higher Education: An Exploratory Study of Prompt Engineering and the Quality of AI-Driven Feedback1 1 Both authors contributed equally to this work. Lucas Jasper Jacobsen2 & Kira Elena Weber3 2 Leuphana Universität Lüneburg 3...
The Promises and Pitfalls of ChatGPT as a Feedback Provider in Higher Education: An Exploratory Study of Prompt Engineering and the Quality of AI-Driven Feedback1 1 Both authors contributed equally to this work. Lucas Jasper Jacobsen2 & Kira Elena Weber3 2 Leuphana Universität Lüneburg 3 Universität Hamburg Keywords: AI; ChatGPT; Feedback; Higher Education; Learning Goals; Prompt Engineering; Prompt Manual; Teacher Education Abstract ChatGPT, with version GPT-4, is currently the most powerful generative pretrained transformer on the market. To date, however, there is a lack of empirical studies investigating the quality and use of ChatGPT in higher education. Therefore, we address the following research questions: What kind of prompt is needed to ensure high quality AI feedback in higher education? What are the differences between novice, expert, and AI feedback in terms of feedback quality and content accuracy? To test our research questions, we formulated a learning goal with three errors and developed a theory-based manual to determine prompt quality. Based on this manual, we developed three prompts of varying quality. We used these prompts to generate feedback using ChatGPT. We gave the best prompt to novices and experts to formulate feedback. Our results showed that only the prompt with the highest prompt quality generates almost consistently high-quality feedback. Second, our results revealed that both 1 expert and AI feedback show significantly higher quality than novice feedback and that AI feedback is not only less time consuming, but also of higher quality than expert feedback in the categories explanation, questions and specificity. In conclusion, feedback generated with ChatGPT can be an economical and high-quality alternative to expert feedback. However, our findings point to the relevance of a manual for generating prompts to ensure both the quality of the prompt and the quality of the output. Moreover, we need to discuss ethical and data relevant questions regarding the future use of ChatGPT in higher education. 1. Introduction Generative artificial intelligence (GenAI) is among the most potent structures in the field of machine learning (Abukmeil et al., 2021). This method enables a generation of new multimedia content (Baidoo-Anu & Ansah, 2023). One of the most influential frameworks within GenAI is that of Generative Pretrained Transformers (GPT). GPT models generate human-like text structures in a wide variety of languages on almost any topic (Bronstein et al., 2021). Probably the best known model, ChatGPT, with its current version GPT-4, is currently the most powerful GPT on the market. Since its initial release on November 30, 2022, ChatGPT has gained more than one million subscribers within a week (Baidoo-Anu & Ansah, 2023). Although ChatGPT has since been used in a wide variety of academic contexts (Stojanov, 2023), there is a lack of comprehensive empirical research examining the quality and use of these innovative systems in higher education (HE) (Crompton & Burke, 2023). A recent study at Stanford University (Demszky et al., 2023) used a sample of over 1,000 faculty members and over 12,000 students to show that faculty members who received automated formative artificial intelligence (AI) feedback were significantly more accepting of student input, resulting in an increase in student satisfaction with the course. In general, feedback is considered an integral part of educational processes in HE but as Henderson and colleagues (2019) point out “Feedback is a topic of hot 2 debate in universities. Everyone agrees that it is important. However, students report a lot of dissatisfaction: they don’t get what they want from the comments they receive on their work and they don’t find it timely. Teaching staff find it burdensome, are concerned that students do not engage with it and wonder whether the effort they put in is worthwhile” (Henderson et al., 2019, p. 3). In this regard, providing feedback represents both a potential and a challenge (Carless & Boud, 2018), as feedback can also result in negative consequences for learning processes (Kluger & DeNisi, 1996). Therefore, feedback in HE needs to ensure high quality which can be characterized by certain criteria such as concreteness, activation, and empathy (Prilop et al., 2019). Unfortunately, there is often a lack of human and financial resources to provide high-quality feedback (Demszky et al., 2023), which is why AI feedback can potentially optimize teaching and learning processes in HE institutions. Based on current developments and findings it can be inferred that Artificial Intelligence in Education (AIEd) offers the opportunity for more personalized, flexible, inclusive, and engaging learning (Luckin et al., 2016). Luckin et al. (2016) postulate the emergence of individual tutoring systems (ITS) due to AIEd, providing learning activities that are optimally matched to a student's cognitive needs and provide accurate and timely feedback without requiring the presence of a face-to- face instructor. Following up on these ideas, this study will specifically look at the potential of AI feedback by analyzing feedback from the most potent GPT on the market to date. In the present study, we address the following questions: 1. What kind of prompt is needed to ensure high quality AI feedback? 2. What are the differences between novice, expert and AI feedback with regard to the feedback quality? 3 2. Theoretical Background 2.1 AI in Higher Education Although AIEd has been around for about three decades, educators struggle with how to use AIEd for pedagogical purposes and its implications for teaching and learning in HE (Zawacki- Richter et al., 2019). The field of AIEd is growing and developing rapidly, and there is an urgent need to improve academic understanding of its potential and limitations (Crompton & Burke, 2023). One of the most recent developments in AIEd is the introduction of ChatGPT, a text-generating AI model that was made publicly available in late 2022. This tool has the potential to fulfill several use codes identified in previous studies, such as helping students write papers, helping faculty grade student work, and facilitating communication with students (Crompton & Burke, 2023). AIEd, including tools like ChatGPT, has been argued to increase student engagement, collaboration, and accessibility of education (Cotton et al., 2023). It has been used for various purposes, including assessment/evaluation, predicting, AI assistance, intelligent tutoring, and managing student learning (Crompton & Burke, 2023). The extraordinary abilities of ChatGPT to perform complex tasks within the field of education have elicited mixed feelings among educators (Baidoo-Anu & Ansah, 2023). Therefore, the question is whether ChatGPT and related GenAI are the future of teaching and learning or a threat to education (Baidoo-Anu & Ansah, 2023). However, empirical research on this new technology, particularly in the context of HE, is still in its infancy, and more research is needed. Consequently, caution is advised when using ChatGPT as a learning aid, as our understanding of its potential and constraints, as well as human interaction and perception of such technologies, is still evolving (Stojanov, 2023). In conclusion, the role of AIEd, particularly ChatGPT, in HE is a complex and multifaceted issue. The technology promises to have profound impacts on HE. These impacts bring with them both opportunities for enhancement and challenges that must be addressed. Realizing the full potential of AI in HE requires 4 continuous research, open ethical discussions, and cooperative efforts among researchers, educators, technologists, and policymakers. This paper contributes to the ongoing discourse by developing a theory driven manual for analyzing the quality of prompts, thereby ensuring a high quality output. Furthermore, we critically examine the quality of AI feedback in HE, especially in comparison to novice and expert feedback. 2.1.1 Prompt Engineering for generative AI in Higher Education For the use of AI, and ChatGPT in particular, in HE, it is crucial to address the relevance of prompt engineering. Therefore, research must answer the question of how to write prompts that yield high quality output in GenAI. In simple terms, prompt engineering is the process of designing effective questions or stimuli, known as "prompts," for AI language models. The aim is to get clear, relevant answers. It's similar to fine-tuning questions for AI to produce desired results. Although prompt engineering is a fairly new research topic, findings consistently suggest that the success of AI language models, like ChatGPT, is not merely determined by their foundational algorithms or training data. Equally crucial is the clarity and accuracy of the prompts they receive (Lo, 2023; ChatGPT & Ekin, 2023). Up until now a limited number of studies highlighted different aspects of prompt engineering (e.g. ChatGPT & Enkin 2023; Lo 2023). For example Kipp (2023) points out that four primary elements (context, question, format, and examples) should serve as modular guidelines for constructing effective prompts. Enkin (2023) proposed five influencing factors for prompt selection (user intent, model understanding, domain specificity, clarity and specificity, and constraints). In addition, there is the CLEAR framework by Lo (2023) which comprises five key elements that a prompt should embody to be effective. According to this framework, a prompt should be concise, logical, explicit, adaptive, and reflective. To sum it up, we already know a lot about good prompts. However, to our knowledge there has been no manual for analyzing the quality of a prompt and no additional examinations if these guidelines really improve the output of AI language 5 models. This study aims to deliver such a manual and investigate if there are differences regarding the output when feeding ChatGPT with different kinds of prompts. 2.2 Feedback Feedback is widely recognized as an integral component of individual and institutional learning and developmental processes (Behnke, 2016; Prilop et al., 2019) and therefore a crucial component in HE (Henderson et al., 2019). Feedback is characterized as the information offered to an individual concerning their current performance to foster improvement in future endeavors (Narciss, 2013) and individuals usually struggle to effectively reflect on, manage, and adjust their actions or tasks in the absence of appropriate feedback (Behnke, 2016). Consequently, there has been an increasing "opening" to the utilization of feedback information in recent years in teacher education and training (Funk, 2016, p. 41). Pre-service teachers receive feedback after either actual classroom practice or specific skill training, which could be from peers with a similar knowledge base (novices) or from experts holding knowledge authority (Lu, 2010). But even if the incorporation of feedback sessions in teacher education is becoming increasingly prevalent (Weber et al., 2018; Kraft et al., 2018), feedback designs are often compromised, because feedback from novices is not as high in quality as expert feedback (Prilop et al., 2019a) and educators (experts) frequently express concerns regarding the insufficiency of time for delivering high quality feedback. 2.2.1 Feedback Quality Ericsson, Krampe, and Tesch-Römer (1993) underscore that substantial enhancements in performance are achievable only through high-quality feedback. Also, Prilop et al. (2021) underline that the quality of peer feedback holds significant importance to ensure its acceptance and to facilitate the continuous development of professional competencies among teachers. With regard to the quality of feedback, Prilop et al. (2019; 2021a) elaborated criteria of 6 effective peer feedback for teachers based on various studies in other domains (e.g., Gielen & De Wever, 2015; Prins, Sluijsmans & Kirschner, 2006). Summarizing these criteria, effective feedback should consistently be specific, empathetic, and engaging (Prilop, Weber & Kleinknecht, 2020). On a cognitive level (specific and engaging), numerous studies (e.g., Strijbos et al., 2010) suggest that effective feedback should incorporate both evaluative and tutorial components. Those providing feedback should therefore assess a particular situation with a firm emphasis on content, offer and explain alternative actions, and pose engaging questions. At the affective-motivational level (empathetic), the delivery of feedback is crucial. Ultimately, according to Prins et al. (2006), effective peer feedback should be presented in the first person. This approach suggests that the feedback is a subjective viewpoint open to dialogue, rather than an indisputable fact. In our previous research (Prilop et al., 2021a), we found that critiques should always be counterbalanced by positive evaluations. Regarding these criteria of high quality feedback, few studies (Prins et al., 2006; Prilop et al., 2019a) examined the impact of expertise on feedback quality by analyzing feedback of novices in contrast to experts. 2.2.2 Novice and Expert Feedback Hattie and Timperley (2007) emphasize that feedback can be provided by different agents, such as experts or novices. The disparity in the quality of experts vs. novices feedback has been systematically examined by a limited number of studies to date. Prins, Sluijsmans, and Kirschner (2006) distinguished expert feedback from novice feedback in medical education. They found that experts utilized more criteria, provided more situation-specific comments and positive remarks, and frequently adopted a first-person perspective style. They also observed that a significant portion of novices either didn't pose any reflective questions (59%) or failed to offer alternative suggestions (44%). Similar observations were made in the domain of teacher education by Prilop et al. (2019a). They reported that expert feedback was more specific, 7 question-rich, and first-person perspective oriented than pre-service teachers' feedback at the bachelor level. Pre-service teachers seldom included specific descriptions of teaching situations in their feedback and barely utilized activating questions. To sum it up, expert feedback seems to be of higher quality than novice feedback. However, the provision of adaptive feedback is resource intensive for HE teachers if done manually for every learner’s task solution and accordingly, experts in HE often struggle to provide high quality feedback due to lacking resources (Henderson et al., 2019). Automating adaptive feedback on the learners’ task processing to make process-oriented, adaptive feedback accessible to numerous learners is a potential solution (Sailer et al., 2023), but until now, we don't know if AI feedback is qualitatively equivalent to expert feedback in HE. 2.2.3 Conducting Feedback with AI One of the significant advancements in AIEd is the use of learning analytics for automated feedback. Several HE institutions have begun applying learning analytics to evaluate crucial aspects of the learning process and pedagogical practice (Tsai et al. 2020). Recent research has explored the use of large language models (LLMs) for generating automatic adaptive feedback. For example, Zhu et al. (2020) looked into an AI-powered feedback system that includes automated scoring technologies within a high school climate activity task, finding that the feedback assisted students in refining their scientific arguments. Sailer et al. (2023) investigated the effects of automatic adaptive feedback, based on artificial neural networks, on pre-service teachers’ diagnostic reasoning. The study found that adaptive feedback facilitates pre-service teachers’ quality of justifications in written assignments, but not their diagnostic accuracy. Moreover, static feedback had detrimental effects on the learning process in dyads. Additionally, Bernius et al. (2022) applied Natural Language Processing based models to generate feedback for textual student responses in extensive courses. This approach reduced the grading effort by up to 85% and was perceived by the students as being of high precision 8 and improved quality. Accordingly, Kasneci et al. (2023) point out that LLMs can aid university and high school teachers in research and writing tasks, such as seminar works, paper writing, and providing feedback to students. This assistance can make these tasks more efficient and effective and can greatly reduce the amount of time teachers spend on tasks related to providing personalized feedback to students (Kasneci et al., 2023). While AI feedback seems promising according to Sailer et al. (2023) and Kasneci et al. (2023), until now, there are no empirical findings regarding the quality of AI feedback. This gap in the literature is what our study aims to address. By focusing on the analysis of AI feedback in contrast to expert and novice feedback, we hope to contribute to the understanding of its efficacy and potential applications in HE. 3. Aim of the Study Looking at previous studies, Zawacki-Richter et al. (2019) highlighted the lack of critical reflection on the challenges and risks of AIEd, the weak connection to theoretical pedagogical perspectives, and the need for further exploration of ethical and educational approaches in the application of AIEd in higher education. Hence, this paper seeks to enhance our understanding of these issues by addressing the following research questions: a. What kind of prompt do we need to ensure high quality AI feedback? b. Are there differences between Novice, Expert and AI feedback in terms of feedback quality? We address the above-mentioned research gaps regarding the use of AI in education by linking our research to theoretical pedagogical perspectives (formulating learning goals and giving feedback) and critically reflecting on the findings, particularly their pedagogical and ethical implications. Figure 1 shows our heuristic working model, which includes the quality of 9 prompts, the quality of feedback and potential outcomes which should be investigated in future studies. Figure 1: Heuristic Working Model adapted from Narciss (2008) and Pekrun et al. (2023) 4. Method 4.1 Development of a Theory-Driven Prompt Manual Following Wittwer et al. (2020), we first formulated a learning goal with three types of errors (Learning goal: Students will recognize a right triangle and understand the Pythagorean theorem [Type of errors: No activity verb; instructional rather than learning goal and multiple learning goals in a single statement]). Then we developed a theory-driven coding manual for analyzing prompt quality for GenAI (e.g. ChatGPT). To achieve the best results, we integrated various prompt engineering approaches. Our prompt design was influenced by Prof. Dr. Michael Kipps' 2023 lecture on prompting for AI, where he highlighted four key elements for every prompt: 1) context, 2) question, 3) format, and 4) examples. These modular guidelines in mind, we looked at the five factors influencing prompt selection, formulated by ChatGPT and Enkin (2023): 1) user intent, 2) model understanding, 3) domain specificity, 4) clarity and specificity and 5) constraints. In the last step we incorporated Leo S. Lo's (2023) CLEAR framework to shape the content within each prompt module. Lo’s framework as well consists 10 of five components. A prompt should be 1) concise, 2) logical, 3) explicit, 4) adaptive and 5) reflective. Table 1: Prompt manual to ensure the development of high quality prompts. Category Subcategory Good Code Average Code Suboptimal Code Context Role The role of 2 Only the 1 Neither the 0 ChatGPT and role of role of of the person ChatGPT ChatGPT nor asking the is the role of the question is explained person asking explained the question is explained Target There is a 2 The target 1 The target 0 audience clearly defined audience is audience is not and described roughly specified target audience described Channel The channel is 2 The 1 The channel is 0 clearly channel is not mentioned described roughly described Mission Mission/Quest Mission to the 2 Mission to 1 Mission to the 0 ion AI is clearly the AI is AI is not clear described roughly described Clarity and Format and Stylistic 2 Either 1 Neither 0 specificity constraints properties as stylistic stylistic well as length properties properties nor specifications are length are described described specifications or a length are given specificati on is given Conciseness The prompt 2 The 1 The prompt 0 contains only prompt is contains a lot information that concise of information is directly with very that is related and little irrelevant to relevant to the superfluou the output. It is s mission/questi clear and informatio on concise n Domain Technical terms 2 Technical 1 No specific 0 specificity are used terms are vocabulary correctly and used that is relevant give the LLM sporadicall to the subject the opportunity y or area of the to recourse to without 11 them in the explanatio question is answer n used Logic The prompt has 2 The 1 The prompt is 0 a very good prompt illogically reading flow, fulfills constructed internal logical only parts coherence, a of the very coherent conditions sequence of of the information and coding "2" clearly understandable connection of content and mission Subsequently, we used our prompting manual to develop three prompts of varying quality (bad, intermediate, good) and prompted ChatGPT to formulate feedback on the above mentioned learning goal. An intriguing question emerges from this endeavor: Do the prompts deemed superior according to our manual consistently manifest superior results? In order to analyze the quality of the AI feedback and to answer our first research question, we used another coding scheme which we adapted from Prilop et al. (2019), Prins et al. (2006) and Wu & Schunn (2021). 4.2 Assessment of Feedback Quality We performed a quantitative feedback analysis to gain information about the quality of the AI feedback. For this purpose we adapted a coding scheme developed by Prilop et al. (2019) and derived from the feedback quality index of Prins et al. (2006). In this approach, each feedback served as a unit of analysis, allowing for a comprehensive evaluation of its content, as suggested by Prins et al. (2006). The original coding scheme is divided into six distinct categories: assessment criteria, specificity, suggestions, questions, use of first-person perspective, and valence (positive/negative). High-quality feedback is assigned a code of '2,' average feedback receives a '1,' and feedback deemed sub-optimal is coded as '0.' A more 12 detailed explanation of the coding manual and the coding process can be found in Prilop et al. (2020). In addition to the original manual, we added three more categories: Errors, Explanations and Explanations of suggestions. Looking into the AI feedback it became evident that an Error category had to be taken into account, primarily because of ChatGPT’s tendency to hallucinate (Alkaissi & McFarlane, 2023; Ji et al., 2022). This is the only category where points are deducted. Because of the domain-specific nature of learning goals we added the category of Explanation, following the coding manual of Wu and Schunn (2021). Furthermore, we split up the category of Suggestions into two categories: Presence of Suggestion and Explanation of Suggestion. This allowed for more accurate coding of the feedback (see table 2 for the coding manual and the inter-coder reliability). Table 2: Content analysis of feedback quality: Categories, examples, and inter-coder reliability (Fleiss kappa). Category Good Cod Average Cod Sub-optimal Cod κ Good feedback feedback e feedback e feedback e example definition definition definition Assessment Aspects of a 2 Aspects of a 1 Aspects of a 0.8 “However, the criteria good learning good learning good learning 1 learning goal, goal are goal are goal are not as currently addressed addressed addressed stated, has using without room for technical technical improvement. terms/ terms/theoreti The verb theoretical cal models "recognize" is models on the lower end of Bloom's taxonomy; it's more about recall than application or analysis.” (AI feedback 3.30) 13 “Your goal Specificity All three 2 Two types of 1 One type of 0.8 contains two error types errors are error is 1 separate are named named and named and objectives: […] and explicitly explicitly explicitly explained explained explained Next, the verbs you've chosen, "recognize" and "understand," are a bit vague in the context of Bloom's taxonomy […] And how do you envision this learning goal relating back to the learner? […]”(AI feedback 3.28) Explanation A detailed 2 A brief 1 No 0.8 “According to explanation is explanation is explanation is 6 best practices, given as to given of why given as to it's beneficial to why the the aspects of why the focus on just aspects of a a good aspects of a one learning good learning learning goal good learning goal at a time. goal are are relevant goal are This makes it relevant relevant clearer for both you and the students, streamlining the assessment process.” (AI feedback 3.14) Presence of Alternatives 2 Alternatives 1 No 0.8 “A more suggestions are suggested are presented alternatives 6 targeted for in a in concrete are named learning goal improvement cognitively terms will focus on stimulating just one of way these. Which one is your priority?” (AI feedback 3.28) “This would Explanation Alternatives 2 Alternatives 1 Alternatives 0.8 align the goal of are explained are briefly are not 2 more closely suggestions in detail explained explained with achieving deeper understanding 14 and skill utilization. […]This goal is learner- centered, contains only one focus, and involves higher-level thinking skills. It also makes the intended learning outcome clear.” (AI feedback 3.30) Errors The feedback -2 The feedback -1 Feedback 0.9 includes includes one does not 0 several error include errors content errors regarding regarding regarding learning learning learning goals goals goals Questions Activating 2 Clarifying 1 No questions 0 1. “So, what question question posed 00 specific skill or posed posed understanding are you hoping your students will gain by the end of this lesson?” (AI feedback 3.28) First person Written in 2 Occasionally 1 Not written in 0.9 “I appreciate first person written in first person 0 the effort throughout first person you've put into feedback formulating this learning goal for your future teachers. […]Let me share my thoughts with you. Firstly, I noticed […]” (AI feedback 3.23) Valence Equilibrium 2 Mainly 1 Mainly 0.7 "I don't think of positive positive negative 6 this learning and negative feedback feedback goal is well feedback worded. [...]However, I like that your learning goal is formulated in a very clear and structured 15 way." (Novice feedback 13) 4.3 Coding of the Feedback Subsequently, the AI feedback (20 AI feedbacks generated with a low quality prompt; 20 AI feedbacks generated with a middle quality prompt; 20 AI feedbacks generated with a high quality prompt) were coded by three trained coders. These coders were student employees who underwent training from a member of the research team. To prepare for the task, all three of them initially coded a sample set of 20 feedback comments chosen at random. Any discrepancies between the coders were then discussed and resolved, as outlined in Zottmann et al. (2013). After this preliminary step, each coder was randomly allocated feedback comments to code. To quantify the level of agreement between the coders, we used Fleiss kappa (κ) as per the methodology of Fleiss & Cohen (1973). The coding process resulted in substantial kappa values, as detailed in Table 2, indicating that the content had been reliably coded. Based on our analysis of the different AI feedback, it became evident which prompt yielded favorable results. Thereon, we presented the learning goal to 30 pre-service teachers in their fourth semester (novices), seven teacher educators, two professors of school education, one teacher seminar instructor and one school principal (experts) and asked them to also formulate feedback based on the high quality AI prompt. This feedback was then coded by the same coders. 4.4 Method of Analyses In the first step, we used our prompt manual, to analyze the prompt quality of our three different prompts. We then analyzed differences between AI feedback (n = 30), expert (n = 11) and novices (n = 30) feedback (independent variables) concerning the different subdimensions of 16 feedback quality (dependent variables) with one-way analyses of variance, followed by Bonferroni post hoc tests. All statistical calculations were performed using SPSS 26, and we set the significance level at p <.05 for all tests. 5. Results 5.1 Differences between Prompts and their Output Regarding the first research question we have been feeding ChatGPT with different types of prompts and analyzed the outcome concerning the quality as well as the correctness of the feedback. The first prompt “Please give feedback for the following learning goal: “Students will recognize a right triangle and understand the Pythagorean theorem.” The feedback should be 150 - 250 words” reached a low quality (M = 5 points of 16 regarding the prompt manual). The second prompt entailed more details in contrast to the first prompt: “Please give feedback for the following learning goal: “Students will recognize a right triangle and understand the Pythagorean theorem.”. Use the criteria of good feedback. Your feedback should be concrete, empathic and activating. We give you some criteria for a good learning goal: A good learning goal contains an activity verb, is related to the learner, contains only one learning goal and refers to the learning outcome. Your feedback should be about 150-250 words”. Accordingly, the prompt reached a slightly better prompt quality regarding to our manual (8 points of 16). The third prompt had the highest prompt quality (15 of 16 possible points): “I want you to be a tough critic with professional feedback. I am a lecturer at an institute for educational sciences and train future teachers. I want you to give feedback on a learning goal that is used for teachers' progress plans. The feedback should meet certain criteria. The criteria are: the feedback should be concrete, empathic and activating. Ask stimulating questions. Phrase feedback in terms of first-person messages. Refer to the content of the learning goal. Explain your evaluation. I will give you some criteria for a successful learning goal. Include them in 17 your feedback. A good learning goal contains an action verb, please consider bloom's taxonomy of action verbs. A good learning goal is related to the learner, contains only one learning goal, relates to the learning outcome, is concrete, and connects content and goal. The tone of the text should sound like you are a befriended teacher. The feedback should be 150 - 250 words and written in continuous text. Ask me first what learning goal I want feedback on. When you feel that you know all the necessary contexts, think step by step how to formulate your feedback. The feedback should be exclusively about the formulated learning goal.”. We conducted 20 feedbacks for each prompt and coded them with our feedback quality manual. In order to compare the feedback, we conducted an ANOVA with Bonferroni posthoc tests. Our results showed significant differences between the prompts regarding their feedback quality for all subdimensions except valence and presence of suggestions (for more details about descriptive data see table 3). Bonferroni adjusted posthoc tests revealed, that the feedback generated with prompt 3 (most sophisticated prompt) performed significantly (p <.001) better in the subcategory assessment criteria than prompt 1 (MDiff = 1.50, 95%-CI[1.10, 1.90]) and prompt 2 (MDiff = 0.90, 95%-CI[0.50, 1.30]). For the categories explanation (prompt 1: MDiff = 0.75, 95%-CI[0.41, 1.09], p <.001; prompt 2: MDiff = 0.40, 95%-CI[0.06, 0.74], p <.05), first person (prompt 1: MDiff = 1.05, 95%-CI[0.63, 1.47], p <.001; prompt 2: MDiff = 0.95, 95%- CI[0.53, 1.37], p <.001) and questions (prompt 1: MDiff = 0.70, 95%-CI[0.28, 1.12], p <.001; prompt 2: MDiff = 1.00, 95%-CI[0.58, 1.42], p <.001) we found the same effect. Furthermore, the feedback generated with prompt 3 performed significantly (p <.001) better than prompt 1 for the categories explanation of suggestion (MDiff = 0.60, 95%-CI[0.23, 0.97]) and specificity (MDiff = 1.25, 95%-CI[0.90, 1.60]). By looking at the category error, we found, that prompt 2 generated significantly (p <.001) more errors than prompt 1 (MDiff = -0.85, 95%-CI[-1.34, - 0.36]) and prompt 2 (MDiff = -0.95, 95%-CI[-1.44, -0.46]). 18 Table 3: Quality of the feedback which was generated with the three different prompts. Concreteness Category Assessment criteria Explanation Subcategory M SD Min. Max. M SD Min. Max. Prompt 1 0.45.76 0 2 0.25.44 0 1 Prompt 2 1.05.39 0 2 0.60.50 0 1 Prompt 3 1.95.22 1 2 1.00.32 0 2 Empathy First person Valence M SD Min. Max. M SD Min. Max. Prompt 1 0.00.00 0 0 0.85.56 0 2 Prompt 2 0.10.45 0 2 1.00.00 1 1 Prompt 3 1.05.83 0 2 1.00.00 1 1 Activation Questions Presence of suggestions for Explanation of sugge improvement M SD Min. Max. M SD Min. Max. M SD Min Prompt 1 1.20.52 0 2 1.15.75 0 2 0.50.51 0 Prompt 2 0.90.72 0 2 1.15.37 1 2 1.25.55 0 Prompt 3 1.90.31 1 2 1.50.51 1 2 1.10.31 1 Correctness Specificity Errors M SD Min. Max. M SD Min. Max. Prompt 1 0.10.30 0 1 -0.40.50n -1 0 Prompt 2 1.05.39 0 2 -1.25.79 -2 0 Prompt 3 1.35.59 0 2 -0.30.57 -2 0 19 These quantitative results can be illustrated with the following examples, which show a feedback created with low prompt quality (prompt 1) and a feedback created with the most sophisticated prompt (prompt 3). The feedback created with prompt 1 received zero points from our trained coders in terms of its feedback quality. Although some errors are described very roughly and aspects of a good learning goal are somewhat implicitly outlined, it would be impossible for a novice to use the information provided in this specific feedback to improve the formulated learning goal. The feedback generated with prompt 3 was coded for quality with a score of 12 out of a possible 16. The parts of the feedback with high quality are annotated and explained in the feedback itself (comments in italics and gray). This illustration should not be regarded as an in-depth qualitative analysis, but instead illustrates how the AI feedback differs when ChatGPT is fed with prompts of varying quality. AI feedback which was generated with Prompt 1: The learning goal "Students will recognize a right triangle and understand the Pythagorean theorem" is clear but could benefit from further specificity. Recognizing right triangles and understanding the Pythagorean theorem are two distinct skills, and clarifying the level of understanding and application expected can enhance the goal. While the goal is relevant to mathematics education, it would be beneficial to outline how success will be measured. Breaking the goal into subgoals, such as identification, understanding, and application, could help in creating appropriate assessments. Finally, aligning this goal with students' grade level and setting a specific time frame might provide a more structured educational pathway. This feedback aims to enhance the clarity and measurability of the goal, ensuring it is tailored to the students' needs. [→ ChatGPT does not write in first person throughout the feedback; ChatGPT provides no alternatives, poses no questions, and recognizes only one error] 20 AI feedback which was generated with Prompt 3: I appreciate the effort you've put into formulating this learning goal. It's clear that you aim to teach a foundational concept in geometry that has a wide range of applications. Let me share my thoughts with you. [→ ChatGPT writes in first person throughout the feedback] Firstly, I noticed that the goal contains two distinct objectives: recognizing right triangles and understanding the Pythagorean theorem. [→ ChatGPT recognizes the first error: multiple learning goals in a single statement] This could lead to a scattered focus. [→ ChatGPT explains the relevance of learning goals] How might you refine it to contain just one learning goal? [→ ChatGPT suggests alternatives in a cognitively stimulating way and poses a question] In terms of action verbs, "recognize" and "understand" are quite broad and don't lend themselves to easy assessment. [→ ChatGPT recognizes the second error: No activity verb] Bloom's taxonomy might help you choose more specific verbs that clarify what you want students to do. [→ ChatGPT uses theoretical models to address good learning goals] Perhaps "identify" for recognizing right triangles and "apply" for using the Pythagorean theorem would be more measurable and actionable? [→ ChatGPT presents alternatives] The goal is aimed at the learner, which is a positive, but I wonder if it's concrete enough. How will you measure "understanding" in this context? Could you link this more closely to a specific learning outcome, like solving a problem using the Pythagorean theorem? [→ ChatGPT recognizes the third error: instructional rather than learning goal; and poses a question] Lastly, the connection between content and goal is somewhat present but could be more explicit. Could the goal focus more on what students will actually do with their knowledge about right triangles and the Pythagorean theorem in practice? [→ ChatGPT poses a thought provoking question] 21 So, what do you think about these points? How would you modify the learning goal to make it more specific, actionable, and closely related to measurable outcomes? [→ ChatGPT poses thought provoking questions] 5.2 Differences between Novice, AI and Expert Feedback The prompt with the highest quality (prompt 3) was given to pre-service teachers and experts (for information about the experts see 4.3). In order to compare the AI feedback with novice and expert feedback. We also conducted an ANOVA with Bonferroni posthoc tests. Our results showed significant differences between the groups regarding their feedback quality for all subdimensions except the category empathy with the subdimensions valence and first person (for more details about descriptive data see table 4). Bonferroni adjusted posthoc tests confirmed the results of prior studies (for example Prilop et al., 2021, Weber et al., 2019) and showed that expert feedback was more concrete, more activating and more correct, but not more empathetic than that of novices. Expert feedback showed significantly higher quality (p <.001) in the subcategories assessment criteria, explanation, questions, presence of suggestions, explanation of suggestions and specificity. The comparison between novice and AI-Feedback showed that AI-Feedback outperformed novice feedback in all the subcategories except valence and first person. Regarding the difference between AI- and Expert feedback, the Bonferroni adjusted posthoc tests revealed, that the AI-Feedback had a higher quality than expert feedback in the subcategories explanation (MDiff = 0.46, 95%-CI [0.17, 0.74], p <.001), questions (MDiff = 0.50, 95%-CI [0.07, 0.93], p <.05) specificity (MDiff = 0.96, 95%-CI [0.52, 1.41]). Table 4: Quality of the novice-, expert and AI-Feedback. Concreteness 22 Assessment criteria Explanation M SD Min. Max. M SD Min. Max. Peers 0.63.81 0 2 0.10.31 0 1 Experts 1.64.51 1 2 0.55.52 0 1 AI 1.97.18 1 2 1.00.26 0 2 Empathy First person Valence M SD Min. Max. M SD Min. Max. Peers 1.10.71 0 2 1.10.30 1 2 Experts 1.18.60 0 2 1.25.50 1 2 AI 1.10.76 0 2 1.00.39 0 2 Activation Questions Presence of suggestions for Explanation of suggestions improvement M SD Min. Max. M SD Min. Max. M SD Min. Max. Peers 0.17.38 0 1 0.87.82 0 2 0.30.54 0 2 Experts 1.36.81 0 2 1.73.47 1 2 0.82.60 0 2 AI 1.86.44 0 2 1.57.50 1 2 1.13.35 1 2 Correctness Specificity Errors M SD Min. Max. M SD Min. Max. Peers 0.17.38 0 1 -.73.87 -2 0 Experts 0.64.67 0 2 -.18.60 -2 0 AI 1.60.56 0 2 -.17.46 -2 0 6. Discussion The findings of this study offer compelling insights into the utility and effectiveness of AI- generated feedback, specifically ChatGPT, in HE. Currently, novice feedback, in the form of 23 peer feedback, is often used in HE, but it is not always conducive to learning (Kluger & DeNisi, 1996). Moreover, experts are challenged to provide high-quality feedback in HE due to a lack of human and financial resources (Demszky et al., 2023). AI feedback can provide an enriching and at the same time economical alternative here. A particularly promising result of our study is that feedback generated by ChatGPT surpassed novice feedback in quality and even rivaled that of experts. Moreover, our study underlines the importance of prompting when using ChatGPT. In our first research question, we wanted to know what kind of prompt is needed for high quality AI-Feedback. One key finding of our study was the critical role played by the quality of the prompt in determining the quality of AI-generated feedback. While AI can indeed generate high-quality feedback the output is dependent on the context, mission, specificity and clarity of the prompts provided. The study revealed that only the prompt with the highest quality could induce ChatGPT to generate consistent high-quality feedback. By looking at the category error prompt 2 was revealed to be a wolf in sheep’s clothing with good stylistic properties but significantly more errors than prompt 1 and more errors than any other prompt or feedback provider in this study. This illustrates the potential of GenAI to hallucinate (Alkaissi & McFarlane, 2023; Ji et al., 2022) and underscores the importance of careful, theory- driven design of prompts. Crafting high-quality prompts itself is a skill that educators need to master, necessitating a manual or guidelines. In our study, we designed a prompt manual which could and should be used by educators who work with ChatGPT. However, relying on a manual to create prompts may introduce another layer of complexity and therefore future studies in this area are needed. With regard to research question 2, our study supports previous findings (Prilop et al., 2021; Weber et al., 2019), that expert feedback is of higher quality than novice feedback. We found that experts outperform pre-service teachers in the categories concreteness, activation and 24 correctness but not in the category empathy. The same is valid when we compare AI and novice feedback. By comparing AI feedback with expert feedback we complement these findings and offer new insights in feedback processes in HE. Our results show that AI feedback can outperform expert feedback in the categories explanation, questions and specificity. This stands as a testament to the transformative potential of AI in educational settings, offering the promise of scalable, high-quality feedback that could revolutionize the way educators assess student work. Furthermore, the AI-generated feedback was produced in significantly less time than expert feedback (in our study ChatGPT produced an average of 49 pieces of feedback in the same amount of time that an expert produced one piece of feedback), heralding efficiency gains that could free up educators for more personalized or creative pedagogical endeavors. However, considering our proposed heuristic model, researchers should investigate in future studies how AI-driven feedback is perceived by students and whether students' learning experiences and learning gains can be enhanced by AI feedback. This is an endeavor we intend to pursue. Overall, our findings lend credence to the promise of AI-based systems like ChatGPT as a viable alternative to expert feedback in HE. However, these promises come with caveats, particularly concerning prompt quality, data ethics, and the nuanced intricacies that human experts bring to the educational table. Moreover, they are intertwined with substantial challenges and pitfalls that demand academic and ethical scrutiny. The surprising finding that ChatGPT not only generated feedback more quickly but also more accurately than human experts opens new avenues for efficiency in higher education. However, we must temper this excitement by considering the scope and limitations of AI. While AI can quickly analyze and generate feedback based on set parameters, it lacks the nuanced understanding of individual learner psychology, needs, and the socio-cultural context within which learning occurs. It is vital to recognize that expertise is not solely a function of accurate or quick feedback. Experts bring a depth of experience, professional judgment, and a personal 25 touch to their interactions with students. These qualities are currently beyond the reach of AI systems, including ChatGPT, and may prove irreplaceable in educational settings that value not just the transfer of knowledge, but also the building of relationships and character. And even if efficiency and quality are the only benchmarks, there was one outlier with multiple errors among the 20 feedbacks generated by the highest quality prompt. This leads us to the hypothesis that experts are still needed but that their tasks are shifted from providing feedback to monitor and maybe revise AI-Feedback. In future studies, it should be investigated how the quality of expert feedback can be enhanced by using ChatGPT and how this intertwined approach is perceived by students and educators in HE. Going beyond the promise of efficiency and quality, and considering Russel and Norvig’s (2010) statement, that every researcher in the field of artificial intelligence ought to be attentive to the ethical ramifications of their projects, it becomes evident that the ethical and data-related dimensions of ChatGPT cannot be ignored in HE. While the AI is not subjectively biased, the data on which it is trained on could have inherent biases. Moreover, there are potential concerns about data security, privacy, and intellectual property, particularly in a learning environment where sensitive information may be discussed. As educators and policy-makers think about implementing ChatGPT into HE, these ethical questions need careful attention and possibly regulatory oversight. To sum it up, we come to the same conclusion as Zawacki-Richter et al. (2019): “We should not strive for what is technically possible, but always ask ourselves what makes pedagogical sense” (p. 21). 6.2 Limitations and Implications The current study takes an in-depth look at the efficacy of ChatGPT, specifically the GPT-4 version, as a tool for generating feedback in HE. However, there are also some limitations. An important limitation of our study that warrants discussion is the restricted focus on a single learning goal and a limited set of errors for which feedback was generated. This narrow scope may limit the generalizability of our findings. While we found that ChatGPT outperforms both 26 novices and experts in providing high-quality feedback for the specific errors we examined, it remains an open question whether these findings would hold true across a broader range of academic subjects and tasks in HE. Educational settings are diverse, encompassing a wide array of subjects, each with their own unique types of content and forms of assessment. Therefore, it would be risky to assume that the efficacy of ChatGPT in our context would be universally applicable across all educational environments. Future research should aim to diversify the types of tasks and the corresponding feedback. This would provide a more comprehensive understanding of where AI-based feedback tools like ChatGPT can be most effectively and appropriately utilized in HE. Until such broader research is conducted, the application of our findings should be considered preliminary and best suited for contexts similar to the one we studied. Another point which should be considered as a practical implication, is that the relevance of prompt engineering may create a barrier to entry for educators less familiar with the nuances of designing effective prompts, thereby necessitating further training or guidance. 6.3 Conclusion In conclusion, ChatGPT presents a compelling case for being incorporated as a tool for feedback in higher education, with both quality and efficiency as its major selling points. However, its application is not without pitfalls. In summary, we find that ChatGPT has the potential to be a useful tool, provided that educators are skilled in prompt engineering and adept at utilizing the tool for optimal results. Or as Azaria et al. (2023) point out with the title of their article “ChatGPT is a Remarkable Tool - For Experts”. The dependence on prompt quality, ethical challenges, and the irreplaceable nuanced inputs from human experts make it a tool to be used cautiously. Future research should explore these dimensions in more detail, possibly leading to a balanced hybrid approach that combines the strengths of both AI and human expertise in educational feedback mechanisms. The endeavor to incorporate AI in higher education is not a question of replacement but of augmentation. How we navigate this balance 27 will determine the efficacy of such technological solutions in truly enriching the educational landscape. 28 List of abbreviations AI = Artificial Intelligence AIEd = Artificial Intelligence in Education GenAI = Generative Artificial Intelligence GPT = Generative Pretrained Transformer HE = Higher Education ITS = Individual Tutoring System LLM = Large Language Model Acknowledgements We would like to thank the pre-service teachers and experts in our study, as well as the coders, for their efforts. Thank you for participating in the study. Funding This study received no external funding. Availability of data and materials All data generated or analyzed during this study are either included in this published article or can be made available by the authors upon request. Consent for publication Not applicable. Author Affiliations Lucas Jasper Jacobsen: Leuphana Universität Lüneburg, Universitätsallee 1, 21335 Lüneburg, [email protected] Kira Elena Weber: Universität Hamburg, Von-Melle-Park 8, 20146 Hamburg, [email protected] 29 Authors’ contributions LJ supervised the work of student staff, particularly training coders to code with the feedback quality manual. LJ developed the theory-driven manual for assessing prompt quality, contacted experts, collected their feedback, and contributed substantially to the writing of this article. KW performed the calculations using IBM SPSS Statistics, developed the heuristic working model, contacted novices and experts, collected their feedback, and contributed substantially to the writing of this article. The authors declare that each author has made a substantial contribution to this article, has approved the submitted version of this article and has agreed to be personally accountable for the author’s own contributions. Competing interests The authors declare that they have no competing interests. Ethics approval and consent to participate In Germany, the criteria set forth by the German Research Foundation (DFG) stipulate that a study must obtain ethical clearance if it subjects participants to significant emotional or physical stress, doesn't fully disclose the study's purpose, involves patients, or includes procedures like functional magnetic resonance imaging or transcranial magnetic stimulation. Our research did not meet any of these conditions, so it was not necessary for us to seek ethical approval. The pre-service teachers as well as the experts provided the feedback voluntarily. Moreover, all participants were informed about the study’s purpose and confidentiality as well as data protection information. 30 References Abukmeil, M., Ferrari, S., Genovese, A., Piuri, V., & Scotti, F. (2021). A Survey of Unsupervised Generative Models for Exploratory Data Analysis and Representation Learning. ACM Computing Surveys, 54(5), 1–40. https://doi.org/10.1145/3450963 Alkaissi H, McFarlane SI. Artificial Hallucinations in ChatGPT: Implications in Scientific Writing. Cureus. 2023 Feb 19;15(2):e35179. doi: 10.7759/cureus.35179. Azaria, A., Azoulay, R., & Reches, S. (2023). ChatGPT is a Remarkable Tool--For Experts. https://doi.org/10.48550/arXiv.2306.03102 Baidoo-Anu, D., & Ansah, L. O. (2023). Education in the Era of Generative Artificial Intelligence (AI): Understanding the Potential Benefits of ChatGPT in Promoting Teaching and Learning. https://www.researchgate.net/publication/369385210 Behnke, K. (2016). Umgang mit Feedback im Kontext Schule. Springer Fachmedien Wiesbaden. Bernius, J. P., Krusche, S. & Bruegge, B. (2022): Machine learning based feedback on textual student answers in large courses. Computers and Education: Artificial Intelligence 3 100081. https://doi.org/10.1016/j.caeai.2022.100081 ChatGPT, & Ekin, S. (2023). Prompt Engineering for ChatGPT: A quick quide to techniques, tips and best practice. doi.org/10.36227/techrxiv.22683919 Crompton, H., & Burke, D. (2023). Artificial intelligence in higher education: the state of the field. International Journal of Educational Technology in Higher Education, 20(1). https://doi.org/10.1186/s41239-023-00392-8 Cotton, D. R. E., Cotton, P. A. & Shipway J. R. (2023). Chatting and cheating: Ensuring academic integrity in the era of ChatGPT. Innovations in Education and Teaching International. https://doi.org/10.1080/14703297.2023.2190148 Demszky, D., Liu, J., Hill, H. C., Jurafsky, D., & Piech, C. (2023). Can Automated Feedback Improve Teachers’ Uptake of Student Ideas? Evidence From a Randomized Controlled Trial in a Large-Scale Online Course. Educational Evaluation and Policy Analysis, 016237372311692. https://doi.org/10.3102/01623737231169270 Ericsson, K. A., Krampe, R. T., & Tesch-Römer, C. (1993). The role of deliberate practice in the acquisition of expert performance. Psychological Review, 100(3), 363–406. 31 Fleiss, J. L., & Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement, 33(3), 613–619. https://doi.org/10.1177/001316447303300309 Funk, C. M. (2016). Kollegiales Feedback aus der Perspektive von Lehrpersonen. Springer Fachmedien Wiesbaden. Gielen, M., & De Wever, B. (2015). Structuring peer assessment: Comparing the impact of the degree of structure on peer feedback content. Computers in Human Behavior, 52, 315–325. Hammerness, K. M., Darling-Hammond, L., Bransford, J., Berliner, D. C., Cochran-Smith, M., McDonald, M., & Zeichner, K. M. (2005). How teachers learn and develop. In Darling-Hammond, L., Bransford, J., LePage, P., Hammerness, K., and Duffy, H. (Eds.), Preparing Teachers for a Changing World: What teachers should learn and be able to do (pp. 358-389). San Francisco: Jossey-Bass. Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77(1), 81– 112. Henderson, M., Ajjawi, R., Boud, D., & Molloy, E. (Eds.). (2019). The Impact of Feedback in Higher Education: Improving Assessment Outcomes for Learners. Springer International Publishing. https://doi.org/10.1007/978-3-030-25112-3 Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, J., Dai, W., Madotto, A. & Fung, P. (2022): Survey of Hallucination in Natural Language Generation. ACM Comput. Surv. 55 (12), 1–38. https://doi.org/10.48550/arXiv.2202.03629 Kasneci, E., Sessler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., Krusche, S., Kutyniok, G., Michaeli, T., Nerdel, C., Pfeffer, J., Poquet, O., Sailer, M., Schmidt, A., Seidel, (...), Kasneci, G. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences 103, https://doi.org/10.1016/j.lindif.2023.102274 Kluger, A. N., & DeNisi, A. (1996). The effects of feedback interventions on performance: A historical review, a meta-analysis and a preliminary feedback intervention theory. Psychological Bulletin, 119(2), 254–284. Kraft, M. A., Blazar, D., & Hogan, D. (2018). The effect of teacher coaching on instruction and achievement: A meta-analysis of the causal evidence. Review of Educational Research, 88(4), 547–588. 32 Kasneci, E., Sessler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., Krusche, S., Kutyniok, G., Michaeli, T., Nerdel, C., Pfeffer, J., Poquet, O., Sailer, M., Schmidt, A., Seidel, T., (...), Kasneci, G. (2023). ChatGPT for good? On opportunities and challenges of large language models for education, Learning and Individual Differences, 103, https://doi.org/10.1016/j.lindif.2023.102274 Kipp, Michael (2023). Wie sag ich’s meiner KI? Hintergründe und Prinzipien zum #Prompting bei #ChatGPT, https://www.youtube.com/watch?v=cfl7q1llkso&t=2382s. Accessed 18 May 2023. Krause, G. (2019). Training zur Förderung von Kompetenzen für die Arbeit mit Videofeedback. In: Uhde, G. & Thies, B. (Eds). Kompetenzentwicklung im Lehramtsstudium durch professionelles Training (pp.83–108). https://doi.org/10.24355/dbbs.084-201901231126-0 Lo, L. S. (2023). The CLEAR path: A framework for enhancing information literacy through prompt engineering, The Journal of Academic Librarianship, 49(4). https://doi.org/10.1016/j.acalib.2023.102720. Lu, H.-L. (2010). Research on peer-coaching in preservice teacher education – a review of literature. Teaching and Teacher Education, 26(4), 748–753. Luckin, R., Holmes, W., Griffiths, M. & Forcier, L. B. (2016). Intelligence Unleashed. An argument for AI in Education. London: Pearson. Narciss, S. (2008). Feedback strategies for interactive learning tasks. In J. M. Spector, M. D. Merrill, J. J. G. van Merrienboer, & M. P. Driscoll (Eds.), Handbook of research on educational communications and technology (3rd ed., pp. 125e144). Mahaw, NJ: Lawrence Erlbaum Associates. Narciss, S. (2013). Designing and evaluating tutoring feedback strategies for digital learning environments on the basis of the interactive feedback model. Digital Education Review, 23. Pekrun, R., Marsh, H. W., Elliot, A. J., Stockinger, K., Perry, R. P., Vogl, E., Goetz, T., van Tilburg, W. A. P., Lüdtke, O., & Vispoel, W. P. (2023). A three-dimensional taxonomy of achievement emotions. Journal of Personality and Social Psychology, 124(1), 145–178. https://doi.org/10.1037/pspp0000448 Prilop, C. N., Weber, K., & Kleinknecht, M. (2019). Entwicklung eines video- und textbasierten Instruments zur Messung kollegialer Feedbackkompetenz von Lehrkräften [Development of a video- and text-based instrument for the assessment of teachers' peer feedback competence]. In T. Ehmke, P. Kuhl, & M. Pietsch (Eds.), Lehrer. Bildung. Gestalten: Beiträge zur empirischen Forschung in der Lehrerbildung (pp. 153-163). Weinheim Basel: Beltz Juventa Verlag. 33 Prilop, C. N., Weber, K. E., & Kleinknecht, M. (2020). Effects of digital video-based feedback environments on pre-service teachers’ feedback competence. Computers in Human Behavior, 102, 120– 131. https://doi.org/10.1016/j.chb.2019.08.011 Prins, F., Sluijsmans, D., & Kirschner, P. A. (2006). Feedback for general practitioners in training: Quality, styles and preferences. Advances in Health Sciences Education, 11, 289–303. Russel, S., & Norvig, P. (2010). Artificial intelligence - a modern approach. New Jersey: Pearson Education. Sailer, M., Bauer, E., Hofmann, R., Kiesewetter, J., Glas, Julia., Gurevych, I. & Fischer, F. (2023): Adaptive feedback from artificial neural networks facilitates pre-service teachers’ diagnostic reasoning in simulation-based learning. Learning and Instruction 83. https://doi.org/10.1016/j.learninstruc.2022.101620 Stojanov, A. (2023). Learning with ChatGPT 3.5 as a more knowledgeable other: an autoethnographic study. International Journal of Educational Technology in Higher Education, 20(1). https://doi.org/10.1186/s41239-023-00404-7 Salzmann, P. (2015). Lernen durch kollegiales Feedback: die Sicht von Lehrpersonen und Schulleitungen in der Berufsbildung. Waxmann Verlag. Strahm, P. (2008). Qualität durch systematisches Feedback. Grundlagen, Einblicke, Werkzeuge. Bern: Schulverlag. Strijbos, J.W., Narciss, S., & Dünnebier, K. (2010). Peer feedback content and sender’s competence level in academic writing revision tasks: Are they critical for feedback perceptions and efficiency? Learning and Instruction, 20(4), 291-303. Tsai, Y.-S., Rates, D., Moreno-Marcos, P. M., Muñoz-Merino, P. J., Jivet, I., Scheffel, M., … Gašević, D. (2020). Learning analytics in European higher education—Trends and barriers. Computers & Education, 155. doi:10.1016/j.compedu.2020.103933 Weber, K. E., Gold, B., Prilop, C. N. & Kleinknecht, M. (2018a). Promoting pre-service teachers’ professional vision of classroom management during practical school training: Effects of a structured online- and video-based self-reflection and feedback intervention. Teaching and Teacher Education, 76, 39-49. https://doi.org/10.1016/j.tate.2018.08.008 Weber, K. E., Prilop, C. N., Glimm, K. & Kleinknecht, M. (2018b). Video-, Text- oder Live-Coaching? Konzeption und Erprobung neuer Formate der Praktikumsbegleitung. Herausforderung 34 Lehrer_innenbildung - Zeitschrift zur Konzeption, Gestaltung und Diskussion, 1(0), 90-119. https://doi.org/10.4119/hlz-2384 Wittwer, J., Kratschmayr, L., & Voss, T. (2020). Wie gut erkennen Lehrkräfte typische Fehler in der Formulierung von Lernzielen?. Unterrichtswissenschaft, 48(1), 113-128. https://doi.org/10.1007/s42010-019-00056-5 Wu, Y., & Schunn, C. D. (2021). From plans to actions: A process model for why feedback features influence feedback implementation. Instructional Science, 49(3), 365-394. Zawacki-Richter, O., Marín, V. I., Bond, M. & Gouveneur, F. (2019): Systematic review of research on artificial intelligence applications in higher education – where are the educators? Int J Educ Technol High Educ 16 (1). https://doi.org/10.1186/s41239-019-0171-0 Zhu, M., Liu, O, L., Lee, H.-S. (2020). The effect of automated feedback on revision behavior and learning gains in formative assessment of scientific argument writing, Computers & Education, 143, https://doi.org/10.1016/j.compedu.2019.103668 Zottmann, J. M., Stegmann, K., Strijbos, J.-W., Vogel, F., Wecker, C., & Fischer, F. (2013). Computer- supported collaborative learning with digital video cases in teacher education: The impact of teaching experience on knowledge convergence. Computers in Human Behavior (5), 2100–2108. 35