Education Hazards of Generative AI PDF
Document Details
Benjamin Riley, Paul Bruno
Tags
Summary
This document provides a basic scientific overview of how large language models (LLMs) work and their practical implications for educators, aiming to highlight potential hazards in educational use. It focuses on the misunderstandings of how LLMs function and may lead to ineffective or harmful practices.
Full Transcript
EDUCATION HAZARDS of GENERATIVE AI www.CognitiveResonance.net ABOUT THIS DOCUMENT The Education Hazards of Generative AI provides a basic scientific overview of how large- language models (LLMs) work and connects this knowledge to practical implications for educators....
EDUCATION HAZARDS of GENERATIVE AI www.CognitiveResonance.net ABOUT THIS DOCUMENT The Education Hazards of Generative AI provides a basic scientific overview of how large- language models (LLMs) work and connects this knowledge to practical implications for educators. This document is intended as a resource for teachers, principals, school district administrators, parents, students, policymakers, and anyone else thinking about using generative AI for educational purposes. The widespread commercial deployment of LLMs, also referred to as chatbots in this document, has generated a tremendous amount of excitement, including in education. Already, teachers and administrators report using chatbots with increasing frequency. There is no shortage of hype about how LLMs will “revolutionize” education. But although there are promising use cases for LLMs in education, there are also potential education hazards involved with using them. Chatbots are tools and, as with any tool, the failure to understand how they work may result in using them for purposes they are not well-suited for. This document highlights areas of concern where misconceptions about how LLMs function may lead to ineffective or even harmful educational practices. The Education Hazards of Generative AI is intended as an introductory overview and is far from comprehensive. As a technology product, LLMs are continually being updated by the companies that deploy them, and our scientific understanding of how they function continues to evolve. That notwithstanding, and because educators are the professionals who bear the ultimate responsibility for instruction and student learning, we hope that this document is helpful in making decisions about whether or how to use generative AI in education today. This document was co-authored by Benjamin Riley (Cognitive Resonance) and Paul Bruno (University of Illinois Urbana-Champaign). We are grateful to Amber Willis, Blake Harvard, Dan Willingham, Dylan Kane, Efrat Furst, Geoff Vaughan, Jane Rosenzweig, Jasmine Lane, Michael Pershan, Peter Greene, Sarah Oberle, Sean Trott, and Tom Mullaney for providing feedback on pre-publication drafts, along with other anonymous reviewers. Citation: Riley, B., & Bruno, P. (2024). Education hazards of generative AI. Cognitive Resonance. https://www.cognitiveresonance.net/ Education Hazards of Generative AI What are large-language Predict models designed to do? Text LLMs are statistical models that take text as input and then generate text as output.1 They are designed to address the following scenario: “Here’s a fragment of text. Tell me how this fragment might go on. According to your model of the statistics of human language, what words are likely to come next?”2 To make their predictions, LLMs treat text as a series of tokens, which can be thought of as the LLM’s vocabulary. 3 During training, an LLM develops the capacity to predict what text should follow a given prompt based on the frequency and relationship of the tokens included in the data it’s been given. This is done through use of an artificial neural network, which can be thought of as a series of mathematical functions that adjust the statistical weights between LLMS tokens.4 LLMs store these statistical relationships but not their training data – they are not search engines.5 ARE NOT Interactions with LLMs feel conversational, and it is natural to impute human SEARCH qualities to LLMs – e.g., that an LLM “knows” or “understands” something. But this can make LLMs seem more authoritative than they really are. LLMs ENGINES do not determine what response would be best suited to your particular needs, and they do not necessarily produce responses that are true. It’s better to think of LLMs as role-playing entities that imitate intelligence.7 By design, they are intended to be helpful, and they try to do this by offering plausible responses to prompts they have been given. But their responses are often wrong. 1 Embers of autoregression: Understanding large 4 Large language models, explained with a minimum of math language models through the problem they are and jargon trained to solve 5 Training is not the same as chatting: ChatGPT and other 2 Talking about large language models LLMs don't remember everything you say 3 Tokenization in large language models, explained 6 Muddles about models 7 Role-play with large language models; Imitation intelligence Education Hazards of Generative AI EDUCATION HAZARDS POSED BY AI’S DESIGN Lesson planning Tutoring LLMs may not correctly predict what sequence LLMs may provide answers to students that are of lessons would effectively build the knowledge factually incorrect. The responses they provide of students. may also vary depending on how they have been prompted. For example, because LLMs do not perform computations like calculators do, but instead predict likely responses to prompts, Generating instructional they may generate computational errors that a materials and assessments calculator would never make. This could result LLMs may propose content that is based on in students being misled or confused by their common misconceptions about how students interactions with an LLM-based tutor. learn if those misconceptions are prevalent in the data used to train the model – for example, that students have different learning styles, or are left brained/right brained.8 For Administrators LLMs may not produce content that is aligned with administrators’ strategic objectives for their own schools and staff. For example, Grading and feedback on student work LLM-generated job descriptions or classroom observation notes may not emphasize the skills When providing feedback on essays, LLMs may or attributes that administrators want in their not focus on the aspects of student work that staff. Such LLM-generated content also may are most important from a teacher’s point of not align with applicable laws, regulations, or view. For example, LLMs may provide feedback collective bargaining agreements, which could on essays that primarily focuses on grammar or result in legal liability. overall essay structure, even if the teacher is primarily concerned with the underlying ideas and concepts expressed by students. LLMs may also add up points or calculate percentages incorrectly if given a rubric. 8 The science of learning Education Hazards of Generative AI Do large-language models No learn the way that humans do? The most powerful LLMs are trained on a huge amount of data. But LLMs are not continuously learning from their interactions with humans, and they do not distinguish useful data from misleading data during their training. To produce their output, LLMs are trained on data that has been produced by humans, mostly from digital sources found on the Internet, including Wiki- pedia, academic articles, news stories, books, and computer code.9 Although the data used to train LLMs is vast, it does not encompass knowledge that is not encoded digitally. Further, the models available in the United States LLMS ARE are primarily trained in English, and on data that academics label “WEIRD”: Western, Educated, Industrialized, Rich, and Democratic. LLMs are thus EXPOSED exposed to a biased sample of cultural practices and values.10 The commercial companies that make the largest and most well-known LLMs TO A BIASED no longer disclose the precise data that they use to train their models. There are pending lawsuits against some of these companies for violating copyright SAMPLE OF and other laws protecting intellectual property.11 CULTURAL After being trained on large data sets, LLMs are fine-tuned by the commercial companies that produce them. This process can include testing the model PRACTICES with human users and asking them to provide feedback on the model’s output, and then using this to adjust the statistical weights the model uses to make AND VALUES text predictions12 We do not know precisely how individual companies fine tune their LLMs. Numerous scholars have called attention to existing and potential biases that presently exist in LLMs that result from how they are trained.13 9 A philosophical introduction to language 11 Copyright safety for generative AI models – part I: Continuity with classic debates 12 A philosophical introduction to language models 10 GPT-4 is WEIRD – What should we do about On hate scaling laws for data-swamps; On the dangers 13 it?; Which humans? of stochastic parrots: Can language models be too big? Education Hazards of Generative AI EDUCATION HAZARDS POSED BY AI'S TRAINING Lesson planning Grading essays and providing feedback Given that many instructional materials found LLMs may not recognize student creativity if the online are low-quality, LLMs may not have been student’s work does not align to the data they trained on high-quality lesson plans or on lesson have been trained on plans aligned with specific content standards.14 Tutoring Generating instructional In general, LLMs do not learn from their materials and assessments interactions with students. The LLM’s capabilities The materials LLMs generate may not align to are almost entirely derived from its training data. the needs of culturally or linguistically diverse This means that LLMs may not adapt to the students if the LLM training data does not specific needs of the students they are tutoring. include text from these students’ communities. Data on the Internet is not representative of knowledge globally. LLM-based materials may also include inaccurate or false information that’s prevalent in the data used to train the LLM. This For administrators and policymakers could include misinformation about content Administrators should emphasize to teachers (e.g., the biological processes of evolution) or and others using LLM-created materials that misconceptions how students learn (e.g., that educators are responsible for the validity and lessons need to be differentiated for “visual usefulness of the materials they choose to learners”).15 use, including in personnel evaluation contexts. Likewise, if school administrators encourage or mandate the use of LLM-based tools by teachers and other educators, administrators should be held responsible for the validity and usefulness of those resources. 14 The supplemental curriculum bazaar: Is what's online any good? 15 Ask the cognitive scientist: Does tailoring instruction to “learning styles” help students learn? 3 Education Hazards of Generative AI Can large-language Not like models reason? humans With careful prompting, LLMs can produce text that appears to involve higher-order thinking skills, but it’s better to see LLMs as skilled at pattern matching. Be wary of any definitive statement suggesting that LLMs are currently capable of reasoning the way that humans do – the evidence does not support such strong claims right now. An important discovery with LLMs is that the quality of their output improves when they are prompted in specific ways. One method is called chain-of- thought prompting where LLMs are given an example of a correctly solved problem first and then asked to solve a similar problem.16 This is similar to LLMS using worked examples to help students understand a concept before asking them to solve a related problem.17 Another method is simply to ask LLMs to PRODUCE “think step by step,” which can also improve performance.18 Because LLMs can produce text related to complex ideas after careful THEIR OUTPUT prompting, some claim that they are already as intelligent as humans.19 But across a wide range of novel tasks that fall outside the sort of data that BY PATTERN LLMs are trained upon, LLMs perform significantly worse than humans.20 At present, the weight of the evidence suggests LLMs produce their output MATCHING, predominantly by pattern matching inputted text to data they have been trained upon, rather than reasoning in human fashion.21 NOT REASONING 16 Chain-of-Thought prompting elicits reasoning 20 Language models don't always say what they think: in large language models Unfaithful explanations in chain-of-thought prompting; Comparing humans, GPT-4, and GPT-4V on abstraction 17 How to make worked examples work and reasoning tasks; Embers of autoregression; The 18 Large language models are zero-shot reasoners reversal curse: LLMs trained on "A is B" fail to learn "B is A" 19 Why the godfather of A.I. fears what he’s built 21 Can large language models reason?; Language models don’t always say what they think Education Hazards of Generative AI 3 EDUCATION HAZARDS POSED BY AI’S LACK OF REASONING Generating lesson plans, instructional Tutoring materials, and assessments Students should understand that providing Educators should provide LLMs with examples more context on their thinking when prompting of high-quality content when prompting them an LLM may help improve the quality of its to produce new material. The more complex response. Teachers should review LLM output the topic, the more risk there is that LLMs related to complex concepts. Rather than will produce plausible but factually incorrect assuming LLMs can reason about a prompt, the materials. user should model the type of reasoning they are looking for. LLMs may produce explanations on complex ideas that sound plausible to students but are not logical or coherent. Grading essays and providing feedback Educators who choose to use LLMs to give feedback to students should provide LLMs with a range of examples of student work aligned For administrators and policymakers to scoring rubrics, as well as a range of human Professional development for educators on how feedback. Results should be carefully monitored to use LLMs should not focus exclusively or even by educators. primarily on “prompt engineering.” Instead, to the extent time and resources are dedicated to LLM usage at all, they may be better spent by first building general knowledge of how LLMs function, and then rigorously evaluating the impact they have on student learning or other important outcomes. 4 Education Hazards of Generative AI Does AI make the content No we teach in schools obsolete? A bedrock principle of cognitive science is that humans understand new ideas based on ideas we already know, i.e., knowledge that is stored in our long-term memory.22 Knowledge cannot be outsourced to AI, and students who have not built a broad base of knowledge will not be able to make best of use this new technology. Educators should continue to focus on building student knowledge across all subjects. Not long ago, it was common to hear some educators ask, “why teach it if students can Google it?” With the commercial deployment of chatbots, some educators ask, “why teach it if students can have generative AI do it?” EFFECTIVE Cognitive science provides an unequivocal answer to these questions: Stu- dents need to develop a broad base of knowledge – in their heads – to learn USE OF LLMS new ideas and navigate the world they experience.23 The fact that chatbots can generate essays, summarize ideas, and create other things is an impres- REQUIRES sive technological achievement, but it does not affect how our minds work. What’s more, effective use of LLMs requires the user to possess existing HUMAN background knowledge and expertise.24 Applying this knowledge when inter- acting with an LLM can lead to productive, co-created output, but there are KNOWLEDGE no shortcuts – if students lack this knowledge, their ability to make use of this technology will be severely limited.25 AND EXPERTISE 22 The science of learning 24 Transcript: Ezra Klein interviews Ethan Mollick 23 How knowledge helps: It speeds and strengthens 25 Learning that doesn’t stick reading comprehension, learning—and thinking Education Hazards of Generative AI 4 EDUCATION HAZARDS POSED BY AI SUPPLANTING KNOWLEDGE AND SKILLS Reading Mathematics LLMs can quickly provide summaries of complex texts, Unlike computers or calculators, LLMs generally do and both educators and students will be tempted to not solve math problems by applying formal rules of use them for this purpose. But reading complex ideas mathematics. Instead, they treat math-related prompts and thinking about their implications is vital to building as text, and then predict what text to produce as knowledge that can be applied in other contexts. output. This can lead to outputs that "sound right" but Educators should avoid using LLM-created materials sometimes include mathematical or logical errors that that create less-challenging texts for certain students students may miss. Educators will need to monitor LLM – differentiating texts this way is inequitable and may output carefully if used for math instruction. cause long-term educational harm. Computer science Writing The implications of LLMs for computer science Educators should be particularly careful regarding the education are highly uncertain. The commercial use of AI for writing tasks. In many cases, the purpose deployment of LLMs does not necessarily imply of a writing assignment is to make students think that students need to focus on coding to effectively effortfully and develop their ideas through the writing participate in society and the economy. For instance, process. When students rely on chatbots to assist their AI tools may be able to do some of the work currently writing, they can miss opportunities to learn how to performed by human software engineers, reducing think critically, to assess ideas, and to consider alternate demand for coding skills in the labor market. At the viewpoints.26 But educators should not use AI detection same time, students may benefit from learning about tools to prevent students from using chatbots to write the computer science underlying LLMs – such as the their essays – such tools are unreliable, and often are information contained in this document – so that they discriminatory against non-native English speakers.27 can use LLM-related tools more effectively. For school administrators and policymakers Science, history, and other content subjects Education leaders should avoid oversimplifications such LLMs are known to “hallucinate,” that is, they can “if AI can do this, we don’t need to teach it anymore” or generate text that sounds plausible but does not “AI will be a part of jobs of the future so we should let accord with the truth. 28 For subjects where there students use AI today.” Likewise, leaders should refrain is a great deal of factually accurate text available from funding professional development to teachers on the Internet, chatbots may produce output that that suggests AI, or any other technology, supplants we would recognize as true. But if information on the need for educators to build student knowledge a subject is sparse, or if there is misinformation on across the traditional range of subjects in school. Think the topic online, LLMs may produce output that carefully about the value of a skill before eliminating it isn’t true. Educators cannot rely on LLMs to be from standards or curriculum. factually accurate, and thus will need to fact-check any materials they create. 26 When the friction is the point 28 Hallucinations, errors, and dreams; What teachers should know about the ELIZA effect GPT detectors are biased against non-native 27 English writers 5 Education Hazards of Generative AI Will large-language models No one become smarter than humans? knows The claim that LLMs are on the verge of matching human cognitive ability in the near-term, or that they will soon be “smarter” or more effective than human educators, is not well-supported by existing scientific evidence. Educators should engage with LLMs as they currently function and be skeptical of speculative claims about the future capabilities of this technology. Some of the commercial companies that produce LLMs say they are working to create “artificial general intelligence,” AI systems that are as intelligent (or more intelligent) than humans.29 Whether this is possible is the subject of heated debate in the AI research community. On the one hand, the rate of performance improvement of LLMs might be slowing down, and most models seem to be converging on the same level of capabilities.30 There is growing evidence that large-language models are limited to producing output related to their training data and cannot reliably “generalize” to novel situations they haven’t been trained on, and some scholars suggest they never will be able to do so.31 On the other hand, some argue that adding new data and more computing power to LLMs will continue to drive exponential improvement in their capabilities. And others believe that as future generative AI models incorporate an increasingly wide range of input beyond just human language, human-like intelligence will result.32 But no one can predict what the future holds. Be skeptical of those who claim otherwise. EDUCATION HAZARDS POSED BY SPECULATION ABOUT AI'S FUTURE For Teachers and Students For school administrators and policymakers Educators and students should not assume LLMs Leaders should not invest time and resources to will continually improve – the better approach is incorporate AI in schools based on assumptions to engage with this technology as it exists now to about what the future will bring. Nor should they explore its strengths and limitations. Engaging drastically alter curricula to prepare students for an with AI does not necessarily mean integrating it “AI world.” We simply do not know what such a world into instructional practice. will look like or what it will require of future citizens. 29 Planning for AGI and beyond tasks; The generative AI paradox: ‘What it can create, it may not understand’; LLMs have failed every single 30 AI scaling myths; Two years later, deep learning is benchmark and experiment focused on generalization, still faced with the same fundamental challenges since their inception; AI and the Everything in the Whole 31 Reasoning or reciting? Exploring the capabilities and Wide World Benchmark limitations of language models through counterfactual 32 Will scaling work?