Scoping Review of Reinforcement Learning in Education PDF
Document Details
Uploaded by AmpleDwarf
Simon Fraser University
Bahar Memarian, Tenzin Doleck
Tags
Summary
This paper provides a scoping review of reinforcement learning (RL) in education. It examines the role and characteristics of RL, a sub-branch of machine learning, within an educational context. The paper explores the potential impact of RL on teaching and learning, and the associated pedagogical paradigms and biases.
Full Transcript
Computers and Education Open 6 (2024) 100175 Contents lists available at ScienceDirect Computers and Education Open journal homepage: w...
Computers and Education Open 6 (2024) 100175 Contents lists available at ScienceDirect Computers and Education Open journal homepage: www.sciencedirect.com/journal/computers-and-education-open A scoping review of reinforcement learning in education Bahar Memarian *, Tenzin Doleck Faculty of Education, Simon Fraser University, Vancouver, British Columbia, Canada A R T I C L E I N F O A B S T R A C T Keywords: The use of Artificial Intelligence (AI) and Machine Learning algorithms is surging in education. One of these Reinforcement learning methods, called Reinforcement Learning (RL) may be considered more general and less rigid by changing its Education learning through interactions with the environment and specifically the inputs received as rewards and pun Artificial intelligence ishments. Given that education has shifted towards a constructivist approach and uses technology such as al Machine learning gorithms in its making (e.g., instructional design, delivery, assessment, and feedback), we are interested in taking stock of the effect RL may play in today’s teaching and learning. We conduct a scoping review of the literature on RL in education. This work aims to open discussions on the pedagogical paradigm of RL and various types of bias introduced in teaching and learning. 1. Introduction to reinforcement learning (RL) underlying, and perhaps rudimentary, approach to learning. It is often the case that the family and society establish rules and norms for Through a scoping review and synthesis of the literature, this paper appropriate living. Therefore, as humans, we practice the notion of RL in aims to examine the role and characteristics of Reinforcement Learning, our upbringing. or RL, a sub-branch of machine learning techniques in education. The RL While the notion of RL is familiar to humans, it may not be readily method allows an agent (who could be artificially intelligent) to learn in understood in an academic context. The RL in research concerns more an interactive environment through the set of positive rewards or pun specifically a set of machine learning algorithms that come to make an ishments (negative rewards) received as feedback from the environment agent (who may be a human or artificially intelligent agent) interact and. We contend that education today is reformed to adopt a socially learn from an environment through a set of positive or negative rewards constructive paradigm, enabling learners to construct knowledge. The field of education has advanced to incorporate machine through interaction based on their prior experiences. RL, however, is learning and artificial intelligence algorithms with the hopes of inherently rooted in a behaviorist paradigm, seeking punishments and enhancing the teaching and learning experience. Some authors denote rewards to shape learning. This section provides background informa that RL can bring a significant shift in the way the students approach and tion on RL and works done to date. The next section shares our review of engage with learning [5,6]. A review of the literature by Fahad Mon the literature followed by our synthesis of findings and considerations et al. and Singla et al. , presents the progress education may make for future work. in light of RL. The reviews provide that there is a clear application of RL in education and present areas of use. Yet, the reviews also offer several 1.1. Background challenges that need to be understood. Most pertinently, there is a lack of perspectives on the educational, socioemotional, and ethical consid Reinforcement learning problems may seem second nature to erations of RL use in education. These gaps motivate our work to humans. It may be the first way we visualize learning. This could be due examine the current state and challenges of RL in education, followed by to evolutionary reasons. When infants, we get a sense of what actions our synthesis and considerations/recommendations for the future. The we may or may not perform through the environment. An infant playing contribution of this work is in examining the societal and ethical con with a pet for the first time, unaware of the nature of the pet, may poke siderations when using RL for different educational purposes. and view the pet strangely. Upon enough interactions with the pet, the infant learns what actions they may or may not perform on the pet. Trial and error learning conditioned through rewards and punishments is an * Corresponding author. E-mail address: [email protected] (B. Memarian). https://doi.org/10.1016/j.caeo.2024.100175 Received 10 November 2023; Received in revised form 11 March 2024; Accepted 27 March 2024 Available online 28 March 2024 2666-5573/© 2024 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). B. Memarian and T. Doleck Computers and Education Open 6 (2024) 100175 1.2. Overview and characteristics of RL - Action: What an agent performs on the environment based on the state observations RL may be better understood through a preview of machine learning. - Reward: The feedback the environment provides to the agent to Two key types of machine learning are supervised and unsupervised shape their actions. A reward can be either positive (rewards) or learning [7,8]. Supervised learning, as the name suggests, is directed at negative (punishment). how to learn from the environment primarily through labeled data sets. Classification and regression are two common methods used in super A basic RL model may thus use the mentioned components in the vised learning. Unsupervised learning, on the other hand, lacks in following way to solve a problem: struction and labels on data sets. As such the unsupervised learning algorithm discovers how grouping needs to be done and predicts where - Observe state, St each data set lies. Clustering is a common method used in unsupervised - Decide on an action learning. There is also another type of algorithm known as - Act At semi-supervised learning, with the most prominent example being RL. - Observe the new state, St+1 RL is a machine learning algorithm that enables an agent to learn by - Observe reward, Rt+1 using feedback from its actions and experiences in the forms of rewards - Learn from the experience. and punishment provided in an interactive environment. - Repeat RL is more general than supervised and unsupervised learning. It aims to learn through interaction with the environment and to achieve a 2. Methods goal. As such the behavior of RL is dependent on the feedback it receives from the environment and whether that feedback is a reward or This section presents our method and scoping review of the literature punishment to the algorithm [9–12]. Based on the valence of the feed surrounding RL in education. An overview of our search process can be back, the algorithm adjusts its behaviors accordingly to maximize the seen in Fig. 2. To better understand the landscape of RL in education, reward feedback from the environment and minimize the punishment specifically considering artificial intelligence, we searched the following feedback. The algorithm may be coded further to wait for a larger string in Web of Science (WoS): reward at time t + 1 rather than an instant reward at a time. This is to Reinforcement Learning in education AND (AI OR Artificial increase the chances of a higher cumulative reward throughout the Intelligence) learning period. WoS is a comprehensive database and hence was selected as our We can further view semi-supervised learning through a systems lens search source. After the screening of abstracts and titles followed by full-. In a block diagram view, we consider the systems’ inputs and text screening, a total of 15 studies were included in the review. Since outputs and aim to gain optimal outputs through the inputs provided to our focus is on reinforcement learning and AI, we included these terms the system. In supervised learning, the system is closed loop and the via the AND operator. No time framework was selected. error feedback serves as the supervision and instructional component. In Our initial inclusion criteria (see Table 1) contained review articles RL, on the other hand, the input that is provided to the system besides containing AI and reinforcement learning in their title or abstract, the inputs are signals that are often numerical and are either in the form regardless of the academic level (i.e., including both K-12 and university of reward or punishment. These signals may be noisy. Given that the settings). Our initial exclusion criteria included publications focused on environment is also constantly changing, it is upon the RL algorithm to non-higher education industries, or studies not written in English. Our appropriately revise and adjust so the outputs become more optimal or secondary screening examined the content of the manuscripts and in other words maximize rewards and minimize punishments. explored if a description surrounding the use of reinforcement learning The classification and evaluation of RL algorithms are often per with AI was provided. Studies that had a one-time and irrelevant formed at a technical level, dependent on the rigor and complexity of the mention of AI and reinforcement learning (e.g., the research focus of the RL algorithms [14–17]. In a simpler and preliminary sense, however, author) were subsequently excluded. A total of 363 articles were sought reinforcement learning comprises a set of features to solve a range of for retrieval. problems (e.g., well-understood, not well-understood, and hard-to-solve problems). 2.1. Research questions A high-level block diagram of RL is presented in Fig. 1. An agent is placed in an interactive environment. Through the agent’s actions, the We aim to examine the following research question: environment can sense the quality of the agent’s actions and provide RQ: How are the characteristics of RL noted in the reviewed studies? feedback in the form of a set of rewards that may be positive or negative. The agent can also make observations on the environment after each action. As such the key components of RL are known to be the : 2.2. Data analysis - States: The observations on the environment the agent can make For the data analysis, we carry out data extraction and open coding. upon acting in the environment First, we provide a background overview of the reviewed studies, sharing background information such as: article title academic level year and type of publication and countries of affiliation Second, we code and share the summary of studies by exploring the area, context, and type of RL employed. In the discussion, we summarize the trends emerging from the review of RL in education. We further present any reported challenges identi fied by the reviewed studies and synthesize future directions for RL in Fig. 1. High-level block diagram of RL. education based on the findings from the reviewed studies. 2 B. Memarian and T. Doleck Computers and Education Open 6 (2024) 100175 Fig. 2. PRISMA chart. USA (N = 3). Italy, Germany, Australia, Brazil, Spain, Japan, and Egypt Table 1 each had one publication (N = 1). Inclusion and exclusion criteria. Inclusion Exclusion 3. Results Initial Review articles containing AI Publications focused on non- and reinforcement learning in higher education industries, or In this section, we aim to summarize the characteristics of reviewed their title or abstract, regardless studies not written in English. of the academic level (i.e., studies and present a high-level summary of each work. A coded sum including both K-12 and mary of the studies is presented in Table 3. university settings) In terms of area, the most commonly mentioned terms across the Secondary if a description surrounding the Studies that had a one-time and reviewed studies are learning followed by assistant, games, and serious. use of reinforcement learning irrelevant mention of AI and with AI is provided. reinforcement learning (e.g., the Less frequently noted terms include real-world, entertain, AI players, research focus of the author). feedback, embodied, MOOCs, robotic, environment, and so on. In terms of context, the most frequent terms described across the reviewed studies are adaptive, followed by intelligent, tutoring, and 2.3. Summary of background demographics of the reviewed studies system. The less frequently noted terms are learning embedded, online, Petri, VARK, behavior, gamification, automatic, and so on. An overview of the background demographics of the reviewed In terms of types of RL, the most frequent terms described across the studies is presented in Table 2. As shown in Table 2, most of the reviewed studies are hierarchical and classification. The less frequently reviewed studies did not specify an academic level but rather kept it noted terms are tree, temporal, ontology base, random, dynamic, nat open-ended. The most prominent keywords among the reviewed studies ural, learning, mixed-initiative adaptive, forest, and so on. were reinforcement learning, education, game, and hierarchical, signi Yet, as can be seen in Table 3, the 15 reviewed studies each had their fying the role such terms play in current research on reinforcement particular area, context, and RL type of their own. We thus next provide learning and AI in education. The number of publications per year was a detailed and holistic analysis of each of the reviewed studies. In doing one in 2009, 2016, and 2020. In 2021 and 2021, the number of publi so we come to learn about the current topic explored and summarize the cations was two per year. In 2022, the number of publications was 3 per reported challenges and our synthesis of future considerations in the year. In 2023 the number of publications was five per year. Hence, we discussion section. see an increased yearly rate of publications from 2020. Conference pa Bellotti et al. focus on the design of serious games that support pers were more frequent (N = 8) followed by journal articles (N = 7). the attainment of necessary knowledge and skills. More specifically, the China had the highest number of publications (N = 5), followed by the authors propose a new design approach for the sandbox serious game 3 B. Memarian and T. Doleck Computers and Education Open 6 (2024) 100175 Table 2 Table 2 (continued ) Demographics of reviewed studies. Refs. Academic Article Keywords Year Type: Country Refs. Academic Article Keywords Year Type: Country level C– –Conf, level C– –Conf, J=Journal J=Journal reinforcement Focus on Genetic algorithms; 2009 J Italy learning taxonomy student, level reinforcement Focus on Dynamic learner 2023 J Egypt not disclosed learning (RL); student, level model; serious games; not disclosed Gamification; technology- Learning style; enhanced Primary school education; user mathematics; modeling Reinforcement Focus on the Dynamic Difficulty 2022 C Australia learning game player, Adjustment (DDA); Focus on Not disclosed 2018 C China the level not DDA deep mathematic disclosed reinforcement problem learning; solving, a Mathematical DDA; level not DDA in education; disclosed Gamification in Focus on the Reinforcement 2020 C Japan education game player, learning; artificial Focus on AI; Human-machine 2021 J Spain the level not intelligence; auditoriums Interaction; disclosed machine learning; with 200 Learning; game people, a level Multimedia; Virtual Focus on an Pediatrics; Robots; 2023 J USA not disclosed Assistance intelligent Natural languages; Focus on an Not disclosed 2018 C USA agent, a level Semantics; intelligent not disclosed Programming agent profession; Robot navigating a sensing systems; space, level Context modeling; not disclosed Evolutionary Focus on Not disclosed 2023 J Germany robotics; student, level explainable not disclosed artificial University Intelligent Tutoring 2022 C Brazil intelligence (AI); students System; Software learning and working on Maintenance; adaptive systems; software Reinforcement natural language development Learning; Q- programming; Learning robotic architecture; Focus on Behavior Tree; 2016 C China robotic baby; intelligent Reinforcement semantic agents, a level Learning; Game Al; representation not disclosed Agent; Raven Focus on Critical decisions; 2022 C USA student, level Reinforcement class to decouple content from the delivery approach during the game not disclosed learning; Student play. The approach entails modeling a sandbox serious game as a hier choice archy of missions/tasks that maximize learning objectives. The policy is University Online education; 2023 C China level MOOCs learned by an experience engine based on genetic computation and recommendation; reinforcement learning, along with the game engine. The authors find Meta-learning; that the approach enables pedagogical experts to insert, after the design Hierarchical of a game, quantitative specifications for effective knowledge/skill reinforcement learning; Learning acquisition. to rank Bonti et al. aim to challenge the traditional “one system fits all” Focus on the Games; Task 2023 J China view of education by implementing an Adaptive Training Framework game player, analysis; Problem- based on AI techniques through a Dynamic Difficulty Adjustment (DDA) the level not solving; Petri nets; agent. The authors uniquely apply DDA to purely educational platforms disclosed Random forests; Training; Radio intended for use in higher education. frequency; Learning Cobos-Guzman et al. propose the graphic design of a novel optimization; Petri virtual assistant for improving communication between the presenter nets (PNs); serious and the audience in presentation scenarios (e.g., auditoriums with 200 game (SG) Focus on Hierarchical 2021 J China people). The assistant has four levels of interaction and uses an AI sys decision- reinforcement tem to activate different interaction modes. The four levels of attention making, a learning; subtask considered by the virtual assistant are normal conditions, keeping silent, level not discovery; skill a low level of distraction, and a high level of distraction. When a level of disclosed discovery; attention is recognized by the AI agent, a positive or negative interaction hierarchical reinforcement of the virtual assistant is presented to increase the audience’s attention learning survey; levels. The graphic design of the virtual assistant relies on non hierarchical anthropomorphic forms with “live” characteristics (eye, mouth, and cable arm). Das et al. present a new AI task – Embodied Question Answering 4 B. Memarian and T. Doleck Computers and Education Open 6 (2024) 100175 Table 3 This gap can be addressed by intelligent tutoring systems that can Overview of studies. enable, adapt, and automate teaching-learning processes. The intelligent Area Context Type of RL Refs. tutoring system is assumed to work by characterizing three types of knowledge: the content, the student, and the teaching approaches. Their Serious games Adaptive experience Hierarchy of tasks Education Adaptive Training Dynamic Difficulty content recommendation engine uses the Q-Learning algorithm, a Framework Adjustment agent Reinforcement Learning (RL) AI-based technique. The modeling is done Virtual Assistant Improve Ontology-based knowledge through an analysis of computer science curricula in Brazil in univer communication representation sities and the Association for Computing Machinery Task Force docu Embodied QA Scene navigation Hierarchical model Real-world Improved Guided RL ment (2013). The Q-Learning algorithm refines its decision model robotics performance according to the results of previous recommendations. In their work, the Game Adaptive behavior Behavior tree and authors note that in the Q-Learning configuration, the recommended environment modeling reinforcement learning actions represent a Didactic Material (DM), and the states represent the Software Intelligent tutoring Recommendation model scores of the students. maintenance system based on Q-Learning RL algorithm, Fu et al. developed a reinforcement learning behavior tree Learning Intelligent tutoring Student-tutor mixed- framework for game AI to provide reasoning while considering learning. Assistant system initiative decision-making A behavior tree is a directed tree consisting of nodes and edges. Three MOOCs Personalized online Meta hierarchical RL node types of behavior trees are presented as shown in Table 4. In the education Serious games Learning-embedded Learning mechanisms (i.e., behavior tree, selectors will choose one option from its child nodes to attribute Petri net RL and random forest tick. The algorithm follows an order from left to right to find the first (LAPN) model classification) success of the nodes that can be performed, which inevitably gives nodes Hierarchical Challenging Classification of hierarchical on the left the higher weights. The authors suggest this may not be reinforcement decision-making reinforcement learning reasonable and instead attaching the weight to the child nodes may be a learning Learning and VARK presentation AI-based Adaptive more natural approach. Yet, the process of assigning weights can be feedback or gamification Personalized Platform for difficult as it requires the full evaluation of the system. The authors thus Effective and Advanced propose using Reinforcement Learning to optimize the weighting of the Learning (APPEAL) selectors. Results of experimentation showed that the combination of Math word Automatic solver Deep RL behavior tree and reinforcement learning may be used as the framework problems AI player to RL method Temporal difference (TD) of adaptive behavior modeling in the game environment. entertain method Ju et al. propose a student-tutor mixed-initiative (ST-MI) human players decision-making framework that better supports the learning of Robotic baby English and Chinese Natural language RL low-performing (i.e., requiring intelligent tutors to make more of the learning pedagogical decisions) and high-performing (i.e., requiring the learner to make more of the pedagogical decision) learners. The findings of the (EmbodiedQA). An agent is spawned at a location (random) in a 3D authors’ empirical study show that an ST-MI significantly improved environment and asked a question (‘What color is the car in the pre student learning gains more than an Expert-designed, tutor-driven sented scene?’). To answer questions, the agent first has to intelligently pedagogical policy on an ITS. Further, the ST-MI framework was found navigate and explore the environment, gather information through to offer low performers the same benefits as the Expert policy, while the first-person (egocentric) vision, and then answer the question (‘e.g., the benefits for high performers were significantly greater than the Expert color of the car in the presented scene is orange’). Such a task requires policy. several AI skills, such as active perception, language understanding, Li et al. explore the need for personalized online education for goal-driven navigation, commonsense reasoning, and grounding of the development of Massive Open Online Courses or MOOCs. The au language into actions, among other things. The authors develop a novel thors highlight the gap of weak “course embeddings” in existing solu neural hierarchical model that decomposes navigation into a ‘planner’ – tions that recommend MOOC courses via deep learning models. that selects actions or a direction – and a ‘controller’ – that selects a Furthermore, existing algorithms depend on the scope of each course velocity and executes the primitive actions a variable number of times – without considering the needs of individual learners. To address these before returning control to the planner. They initialize the agent via gaps, the authors propose a Meta hierarchical Reinforced Learning to imitation learning and fine-tune it using reinforcement learning for the rank approach (MRLtr) consisting of a Meta Hierarchical Reinforcement goal of answering questions. The authors develop evaluation protocols Learning pre-trained mechanism and a gradient boosting ranking for EmbodiedQA and evaluate the developed agent in the House3D method. MRLtr comprises three steps: (1) Reinforced User Profiling with virtual environment. Additionally, the authors collect human demon Item Filtering that removes the noisy courses with hierarchical rein strations by connecting workers on Amazon Mechanical Turk to this forcement learning, (2) End-to-end pre-training with meta-enhancing, environment to remotely control embodied agents. adopting a gradient-based meta-learning approach to search for a bet Esser et al. propose the concept of guided RL to enable a sys ter embedding representation, and (3) Gradient Boosting with Order tematic approach toward accelerating the training process and Promoting, promoting the course recommendation order through a improving settings of real-world robotics. The authors share a taxonomy LightGBM-based ranking regressor. Both offline and online experiments of structure-guided RL, describe available approaches, and evaluate are conducted, revealing that MRLtr can achieve superior performance. their efficacy in terms of robotic features such as efficiency and effec Liang et al. examine serious games and game states as learning tiveness. In their view, guided RL: “describes the integration of addi tional knowledge into the learning process to accelerate and improve success for real-world robotics deployment” (p. 68). Thus, additional Table 4 Behavior tree node types adapted from Fu et al.. knowledge can be integrated at various phases of the guided RL pipeline. Francisco and Silva explore the use of recommendation systems Node type Succeeds Fails Node type in software maintenance courses as an intelligent tutoring system that Selector If one child succeeds If all children fail Selector would guide students in their learning. Software maintenance plays a Sequence If all children succeed If one child fails Sequence huge part in costs and requires students’ knowledge and proficiency. Parallel If at least N children If more than M-N children Parallel succeed fail 5 B. Memarian and T. Doleck Computers and Education Open 6 (2024) 100175 processes that maximize student learning. The authors propose the use the next state s′, and receives a reward r. Then, the value V(s) of the state of a learning-embedded attribute Petri net (LAPN) model based on Petri is updated as follows: V (s) ← (1 − α)V (s) + α(r + γV (s)) where α is the Networks, Reinforcement Learning, and Random Forest Classification. learning rate parameter, and γ is the discount rate parameter. The AI Serious games are assumed to be problem-solving goals partitioned into player engages in numerous games against its opponent, learning from small tasks, each of which may contain content knowledge that is part of outcomes. The performance of the proposed AI player is assessed by the problem-solving process. Furthermore, it is assumed that each small having it compete against 121 different types of computer players task is represented as a mini-game or a quick activity. These tasks are instead of human players with limited gaming skills. then ordered in a way that the completion of each task allows access to Zhu et al. present the formal definition of a robotic baby, which more tasks and progresses towards completion of the problem-solving starts with no prior knowledge and learns interactively and incremen process. Consequently, students much have acquired the content tally through adaptation to its environment. The authors analyze the knowledge of each task to advance. In their work, the authors consider baby’s architecture from a a system engineering perspective. The au two types of stimuli: in-game stimuli such as hints/supports, and thors present “the capabilities of the robotic baby in natural language external stimuli such as physiological changes such as posture, gaze acquisition and semantic parsing in English and Chinese, as well as in movement, and speech. The two types of stimuli together can help natural language grounding, natural language reinforcement learning, determine how to guide learners (i.e., which task to tackle and in what natural language programming, and system introspection for explain order) and how to promote problem-solving (i.e., which content ability” (p. 1). Specifically, for natural language reinforcement learning, knowledge to acquire and which help session to complete). Their work the robotic baby can receive reward signals in English and Chinese, presents: adjusting its behavior accordingly. Results of testing reveal that with natural language reinforcement signals, the robotic baby can alter its - “VARK adaptive personalized content presentation on one hand or strategies for the same command and enhance its performance. The gamification on the other hand. authors illustrate the robotic baby’s education in both a distributed - Adaptive personalized content presentation employs a DQN RL AI embodiment robotic setting and a multiagent robotic setting, demon implementation. strating the transfer of knowledge between human and robotic entities - Adaptive personalized exercises difficulty scaffolding and navigation as well as among robotic babies themselves. Additionally, the authors (through skipping/ hiding less difficult exercises) and adaptive present the mechanism of direct knowledge inheritance between robotic feedback (through hints and messages), as well as reattempting ex babies and its benefits in the evolution of the robotic baby. ercises till mastery employing an online rule-based decision making based on student interactions.” (p. 25) 4. Discussion Pateria et al. present a review of Hierarchical Reinforcement This scoping review aimed to delve deeper into the application of Learning or HRL. The HRL method enables the division of challenging Reinforcement Learning or RL algorithms in improving teaching, long-horizon decision-making into simpler subtasks. The authors clas learning, and educational systems as a whole. Typically, RL algorithms sify approaches along three independent dimensions: (1) Approaches derive knowledge through the interaction between an agent and its with subtask discovery or without it. (2) Approaches for training single environment. The intelligence of the agent can be molded based on the agent or multiple agents. (3) Approaches for learning a single task or rewards and penalties it receives from the environment. multiple tasks. The authors suggest there can be eight possible divisions (octants) in this three-dimensional space, with the majority of the HRL 4.1. Summary of findings approaches falling into six of these divisions: (1) Single agent, single task, without subtask discovery, (2) Single agent, single task, with Our review of the literature on RL in education and AI identified only subtask discovery, (3) Multiple agents, single task, without subtask 15 focused studies on this topic. This highlights the necessity for further discovery, (4) Multiple agents, single task, with subtask discovery, (5) research in this area. A demographic analysis of these studies demon Single agent, multiple tasks, without subtask discovery, 6) Single agent, strated that even within the subset of 15, academic levels were often left multiple tasks, with subtask discovery. unspecified and kept open-ended. Additionally, we observed a steady Sayed et al. propose an AI-based Adaptive Personalized Plat increase in the number of publications per year on this topic, high form for an Effective and Advanced Learning (APPEAL) platform. The lighting the increased attention RL is receiving in education. However, platform incorporates a novel combination of Visual/Aural/Read, we should note that existing research may still be in its nascent stages, as Write/Kinesthetic (VARK) presentation or gamification in two modes more conference proceedings are generated as opposed to journal arti with scaffolding through skipping/hiding and reattempting. The modes cles and the research is polarized by certain locations (e.g., USA or include VARK and gamification, and facilitating adaptive exercise nav China). igation and feedback. The platform accomplishes by leveraging Deep When examining our research question: How are the characteristics Q-Network Reinforcement Learning (DQN-RL) and an online rule-based of RL noted in the reviewed studies? We find that even among the decision-making implementation. limited articles (N = 15) available, the areas, contexts, and uses of RL Wang et al. focus on designing an automatic solver for math tend to be variable and different from one another. This calls for more word problems. The authors apply deep reinforcement learning to solve discussion and characterization of RL in education in light of AI. Our arithmetic word problems. The proposed MathDQN is tailored from the review of the studies informed challenges remaining and our synthesis of general deep reinforcement learning framework. The work encompasses the literature which is described next. the design of states, actions, and reward functions, along with a feed-forward neural network as the deep Q-network. 4.2. Summary of challenges Yaguchi et al. propose a reinforcement learning method for automatically designing an AI player to entertain human players, Several challenges and needs are presented in the reviewed studies. especially those who are not proficient at playing games, in one-to-one Examples of noted challenges are described below: games. One approach to achieve this is to calibrate the AI player to be There are still gaps in user studies and mechanisms to assess how the neither too strong nor too weak through reward and opponent settings. system can improve knowledge/skill acquisition in serious games. To facilitate rapid learning for the AI player, the authors adopt the Serious games allow users to engage in activities that help them practice Temporal Difference (TD) method which uses state values. In the and acquire skills beyond leisurely purposes. In such contexts, learning TD method, the AI player at a state s takes an action, and progresses to becomes more informal and hidden, and thus interventions are needed 6 B. Memarian and T. Doleck Computers and Education Open 6 (2024) 100175 to appropriately assess user performance and learning. detrimental to their long-term progress. Consequently, students may opt There are small sample sizes in the studies and there is a need to to collaborate in groups and engage with the RL algorithm in a more collect data beyond the time and levels (e.g., individuals and cohorts) balanced and holistic manner. This idea may stem from established. For example, various factors may influence the punishments and theories such as Nash equilibrium. Students may find that their rewards of RL algorithms. Characterizing and accounting for such fac interaction with RL in relation (one may argue to a competing) group tors is thus a delicate task that needs further research. has more disadvantages than facing the RL as a group instead. With this Achieving a high level of attention in learners may be heightened for approach, however, another question may arise and that is: extroverted individuals and reduced for introverted individuals Cobos- Guzman et al.. Recognizing the internal and personal influencers - If a reward or punishment for a group of learners is deemed less of the RL functionality and logic is an important consideration. costly than each learner individually, then how well can RL meet an Using real-world experimental scenarios and data is needed [25,31]. individual’s needs in a group while assuming maximum reward for More discussion is needed to understand suitable and ethical areas in the group? which RL can be applied and studied in education with AI. A cascade of simple states may lead to an exponential explosion, so This is, perhaps, an important economic question that needs to be an emergent issue is the construction of states [26,29]. This means a lot investigated considering RL and humans in teaching and learning. Such of RL may be replaced by simplifications and aggregate summarizations a view may suggest that there may exist an economic view on the value of states. An important consideration is how to address the approxi of punishments and rewards. Students may discover that engaging with mations of simplified states or the storage and management of the RL algorithms within a group setting carries fewer drawbacks and yields exponential growth of states. greater returns compared to interacting with them individually. Different demographic groups (e.g., high versus low-performing Are economic views toward learning productive? students) may end up receiving different treatments, putting the Moreover, students may start to monetize their learning experiences, learning gains of some groups at an added advantage. An under which could subsequently influence their learning trajectory and qual standing of how to make RL integration a universal and equitable ity. For example, students may realize that an immediate reward holds experience among different user groups, and especially among under greater value than a promised but uncertain future reward. They may, as represented and well-represented groups, is needed. groups and individuals, gravitate towards behaviors that enhance their likelihood of receiving more rewards and facing fewer penalties for 4.3. Synthesis of the literature actions taken in the educational setting. Students may further begin to create economic metrics for their learning. For example, they might The review of the literature demonstrated extensive research on the perceive a form of inflation in learning rewards, where a smaller im teaching of machine learning algorithms, such as reinforcement learning mediate reward carries more weight than an inflated one in the future. or RL, but there is little work on the use of RL for improving teaching, Besides the tendency of learners wishing to strategize their learning learning, and educational systems overall. We identified 15 articles that to get the best outcomes from the RL algorithm, some roadblocks may be addressed the use of RL for educational purposes and examined their presented by the environment or the inadequately designed algorithm areas, contexts, and the use of RL algorithms. Overall, we found that itself. reinforcement learning is most frequently used in serious games, as they What if rewards and punishments are biased towards certain de are inherently depend on states and actions from and feedback to the mographic groups? player. The algorithm is prone to dispensing unequal punishments or re We find that there are challenges or questions that need to be wards to learners, particularly if it is built on or updated with biased addressed when seeking to integrate behaviorist RL models with historical data or skewed learning interactions in the classroom. Imagine constructivist teaching and learning. The pedagogical paradigm of RL is a learning environment where student teams consist of a mix of eth behaviorist and assumes it holds true for the environment and all users nicities and language levels. The most extroverted and fluent English-. This may come into conflict with the desired constructivist speaking student is likely to dominate discussions or have greater in teaching and learning paradigm. The constructivist paradigm encour fluence within the group, resulting in the concetration of group data ages interaction and knowledge-building among individuals , around that individual. Another prevalent scenario is the use of male- viewing learning as not having an inherently right or wrong approach. identifying data in healthcare studies and findings, some of which are In the behaviorist paradigm, on the other hand, it is believed that ab starkly different (e.g., Women versus Men’s health) for both genders. solute knowledge exists, and learning is seen as producing either right or What if there is a policy of deceit, illusion, or misconception in the wrong responses. In a learning environment that promotes constructivist educational setting? learning but uses RL, which inherently aims to personalize the envi Another roadblock to learning with RL algorithms is when the de ronment through behaviorist-like rewards and punishments, conflicts, velopers of the algorithm, or actors playing a role in the learning envi confusions misunderstandings, or prone likely to arise. Below, we pro ronment resort to deceiving the algorithm by manipulating or reversing vide some challenges (italicized) and a short explanation of each. the allocation of rewards and punishments. In such a scenario, the (When) to reward or penalize social learning? learner, for example, may unknowingly engage in actions with the al The teacher and students tak a constructivist approach and discus gorithm to maximize rewards, only to discover that those very in sions, which may often result in unclear points, errors, and the need for teractions are categorized as punishible according to the algorithm. clarifications. The RL algorithm, on the other hand, may attribute What if our perceptions of rewards and punishments are wrong? negative weights to questions and answers that generate more inquiries One aspect contributing to misconceptions about RL involves our and areas of confusion, while rewarding concise questions that do not potentially inaccurate perceptions regarding the definitions and roles or provide significant insight to the students. Now one may argue that RL rewards and punishments in the learning process. Learning typically could be programmed to incentivize more questions or those that occurs when errors are made, and a learner actively seeks to resolve stimulate discussion, while penalizing brief questions and responses. confusion and enhance their learning. In this view, learning happens However, this approach might lead to excessive and basic questioning when there is an error. In the world of RL, errors may be equated with practices, without necessarily resolving the specific confusions (or punishments, whereas advancements in learning could be likened to misconceptions) that students need addressed. rewards. As such learners may naively think they should strive for zero Is there an equilibrium that comes to the learner’s true advantage? punishments and maximal rewards. However, in reality, rewards are Learners may realize that personal rewards or punishments can be attained through acknowledging and rectifying mistakes, leading to 7 B. Memarian and T. Doleck Computers and Education Open 6 (2024) 100175 educational growth. As such an area of challenge could be learner’s (e.g., points awarded versus deducted) to reshape their learning prog misinterpretations of learning progress and errors caused by notions of ress. A human environment on the other hand (e.g., instructor) is ex rewards and punishments. pected to receive and review action outcomes and decide on the type and Can RL truly address bias in educational settings? degree of rewards to be provided to the agent. Note that the agent may One mitigation strategy that may be employed is to tailor rewards not only be a student but environment to be a teacher. A student, for and punishments to suit the diverse demographics of individuals or example, may be perceived as the environment when conducting self or groups. For instance, under-represented minority groups may exhibit peer assessment or taking the role of tutoring an AI agent. Moreover, it varying levels of engagement compared to well-represented groups. Yet, should be noted that negative rewards do no necessarily equate to pu such a realization could present a double-edged dilemma. On the one nitive measures such as deductions or poor grades. Rather, they repre hand, if the reinforcement algorithm becomes agnostic to the back sent a complementary facet to positive rewards, facilitating operational ground (e.g., demographics) of learners, it risks overlooking the chal perspectives such as actions and counteractions or divergent feedback lenges faced by under-represented minorities or perpetuating bias. On approaches. As such the entirety of reinforcement learning for the other hand, tailoring the algorithm’s responses based on learners’ enhancing educational purposes may be understood as mechanisms that backgrounds may be perceived as discriminatory or result in an unre break down complex and time-consuming decisions that entail types of alistic pursuit of algorithmic fairness. As such an important question that consequences into a set of states and actions. remains is whether more equity is achieved by normalizing versus realizing learners’ backgrounds. 5. Conclusion Can RL truly be tailored to its learning environment? We should further note potential disturbances caused by the learning From a pedagogical perspective, it is important that learning pro environment. For example: cesses steer clear of penalization and punitive measurs that revolve around labeling students as incorrect. A constructive feedback - Take a noisy and extremely changing learning environment. In such approach is favored where learning errors are reformulated as areas that a scenario, does RL learn to not learn from the experiences and re need improvement. While the RL algorithm may inherently operate wards? And if the environment is uncertain can RL ever be certain through a reward and punishment approach, it is important to reorient about rewards and punishments and may randomness or luck be of these rewards and punishments towards educational strategies that the better fit? algorithm can leverage or avoid to make social learning more constructivist and less behaviorist. In natural settings without RL present, human learners may engage Learners learn both individually and as part of a social group. The in learning not solely to optimize their rewards but out of a genuine demographics of the group and learners may impact the quality of desire for knowledge. The motive for the human could be to gain a better learning in both settings. As such rewards and punishments are likely to understanding of the real needs of the environment and different de differ when learners interact with an RL algorithm as individuals versus mographics. Thus, we may need to consider how to imbue RL with a groups. Established theories such as Nash equilibrium reason that human-centered perspective, so that it can better align with the values working with the algorithm as a group may come to benefit the group and needs of learners as they engage and exchange insights within this more than individual interactions. Yet, such theories assume that the framework. environment is unaware of such laws and that they are consistent under both individual and collective interactions. 4.4. Revisiting the high-level block diagram of RL One possibility is, therefore, to educate both learners and rein forcement algorithms that receiving punishments as groups versus in With the proliferation of digitalized forms of education, reenacting dividuals have different weights and learning gains, so students do not pedagogy with the use of artificial intelligence agents, besides human see this as an opportunity to hack their learning. Further, such a scenario educators and students, is becoming feasible. Fig. 3 attempts to show may help students make more careful decisions about the groups and how AI may be integrated as part of an RL system. Note that AI may be educational settings they put themselves in. Such recognition may present as part of the agent and/or environment. That is: further help students to make less economic views towards learning and want to learn rather than hack learning to get more rewards. This - There may exist an AI agent, a human agent, or both. consideration, however, is in itself a delicate task and difficult to achieve - There may exist an AI environment, a real environment (e.g., human, without stepping into the world of biases. or live creatures), or both. Undoing rewards and punishments that tend to certain demographic groups may not just be accomplished by including more demographi Together these may make 9 agent and environment scenarios cally diverse historical data. While considered a step in the right di possible. As a result, the notions and variables of RL may need to be rection, the use of corrected historical data may not be enough to redefined in light of AI and human learners. For example, a human as an address bias in educational and other social settings. This is because the agent (e.g., student) is expected to receive positive and negative rewards inclusion of demographically diverse data may only present a picture of Fig. 3. Re-envisioning reinforcement learning with AI. 8 B. Memarian and T. Doleck Computers and Education Open 6 (2024) 100175 performances and characteristics of different demographics but may not Singla, A., Rafferty, A.N., Radanovic, G., & Heffernan, N.T. (2021). Reinforcement learning for education: opportunities and challenges. ArXiv Preprint. identify what humanistic rewards and punishments were put in place, Ayodele TO. Types of machine learning algorithms. New Adv Mach Learn 2010;3: and the RL algorithm may be unaware of what needs to be undone. 19–48. Additionally, individuals and groups who serve as actors in the educa Bonaccorso G. Machine learning algorithms. Packt Publishing Ltd; 2017. tional setting may come to perceive and misinterpret constructive re Bennane A. Adaptive educational software by applying reinforcement learning. Inform Educ Int J 2013;12(1):13–27. wards and punishments as biased. Iglesias A, Martínez P, Aler R, Fernández F. Learning teaching strategies in an We thus find that bias can manifest itself in several ways when adaptive and intelligent educational system through reinforcement learning. Appl learners interact with RL in educational environments. Examples Intell 2009;31:89–106. Martínez-Tenor Á, Cruz-Martín A, Fernández-Madrigal JA. Teaching machine include, but are not limited to, the clashes of paradigms between RL and learning in robotics interactively: the case of reinforcement learning with Lego® educational settings, the role of deception and differing interests in re Mindstorms. Interact Learn Environ 2019;27(3):293–306. wards and punishments across individuals and demographic groups, and Narvekar S, Peng B, Leonetti M, Sinapov J, Taylor ME, Stone P. Curriculum learning for reinforcement learning domains: a framework and survey. J Mach the impact of changing environments or actors on learning gains and Learn Res 2020;21(1):82–7431. 73. penalties. Future work may consider potentials and challenges, as well Meyn S. Control systems and reinforcement learning. Cambridge University Press; as pedagogical considerations, of RL in education. 2022. Akanksha E, Sharma N, Gulati K. Review on reinforcement learning, research We acknowledge that our study has limitations, including the reli evolution, and scope of application. In: Proceedings of the 5th international ance on a single database and the examination of RL in educational conference on computing methodologies and communication (ICCMC); 2021. contexts with AI primarily in formal settings. This means that areas such p. 1416–23. AlMahamid F, Grolinger K. Reinforcement learning algorithms: an overview and as internships, industry experiences, and the use of RL in specialized classification. In: Proceedings of the IEEE Canadian conference on electrical and industries were not represented in our scoping review. We attempted to computer engineering (CCECE); 2021. p. 1–7. address bias in our work by charting and coding characteristics of Insa-Cabrera J, Dowe DL, Hernández-Orallo J. Evaluating a reinforcement learning reviewed studies and reflecting on the reported limitations. However, algorithm with a general intelligence test. In: Proceedings of the conference of the Spanish association for artificial intelligence; 2011. p. 1–11. we acknowledge that our synthesis may have been influenced by our Jordan S, Chandak Y, Cohen D, Zhang M, Thomas P. Evaluating the performance of research experiences at the intersection of AI’s technological and soci reinforcement learning algorithms. In: Proceedings of the international conference etal needs. Therefore, we present a review that is more focused on so on machine learning; 2020. p. 4962–73. Boulesnane A, Meshoul S. Reinforcement learning for dynamic optimization cietal and application needs rather than purely mathematical and problems. In: Proceedings of the genetic and evolutionary computation conference technological applications. companion; 2021. p. 201–2. Thrun S, Littman ML. Reinforcement learning: an introduction. AI Mag 2000;21(1). 103–103. Funding Bellotti F, Berta R, De Gloria A, Primavera L. Adaptive experience engine for serious games. IEEE Trans Comput Intell AI Games 2009;1(4):264–80. https://doi. This work was supported by the Canada Research Chair Program and org/10.1109/TCIAIG.2009.2035923. WE - Science Citation Index Expanded (SCI- EXPANDED. the Canada Foundation for Innovation. Bonti A, Palaparthi M, Jiang XM, Pham T. TuneIn: framework design and implementation for education using dynamic difficulty adjustment based on deep Disclosure statement reinforcement learning and mathematical approach. AD HOC NETWORKS AND TOOLS FOR IT, ADHOCNETS 2021 (Vol. 428, Issues 13th EAI International Conference on Ad Hoc Networks (ADHOCNETS). In: Proceedings of the 16th EAI The authors declared no potential conflicts of interest concerning the international conference on tools for design, implementation and verification of research, authorship, and/or publication of this paper. emerging information technologies (TRIDENTCOM); 2022. p. 229–41. https://doi. org/10.1007/978-3-030-98005-4_17. WE - Conference Proceedings Citation Index - Science (CPCI-S). About the authors Cobos-Guzman S, Nuere S, De Miguel L, Konig C. Design of a virtual assistant to improve interaction between the audience and the presenter. Int J Interact Multimed Artif Intell 2021;7(2):232–40. https://doi.org/10.9781/ Dr. Bahar Memarian is a postdoctoral researcher at Simon Fraser ijimai.2021.08.017. WE - Science Citation Index Expanded (SCI-EXPANDED. University. Dr. Tenzin Doleck is Assistant Professor and Canada Das A, Datta S, Gkioxari G, Lee S, Parikh D, Batra D. Embodied question answering. Research Chair at Simon Fraser University. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR); 2018. p. 1–10. https://doi.org/10.1109/CVPR.2018.00008 (Issue 31st IEEE/CVF conference on computer vision and pattern recognition CRediT authorship contribution statement (CVPR)WE - Conference Proceedings Citation Index - Science (CPCI-S). Esser J, Bach N, Jestel C, Urbann O, Kerner S. Guided reinforcement learning a review and evaluation for efficient and effective real-world robotics. IEEE Robot Bahar Memarian: Writing – review & editing, Writing – original Autom Mag 2023;30(2):67–85. https://doi.org/10.1109/MRA.2022.3207664. WE draft, Visualization, Project administration, Methodology, Investigation, - Science Citation Index Expanded (SCI-EXPANDED). Formal analysis, Data curation, Conceptualization. Tenzin Doleck: Francisco RE, Silva FD. A Recommendation module based on reinforcement Writing – review & editing, Visualization, Software, Conceptualization. learning to an intelligent tutoring system for software maintenance. M. Cukurova, N. Rummel, D. Gillet, B. McLaren, & J. Uhomoibhi (editors), Proceedings of the CSEDU. In: Proceedings of the 14th international conference on computer Declaration of competing interest supported education - VOL 1 (issue 14th international conference on computer supported education (CSEDU); 2022. pp. 322–329). 10.5220/0011083900003182 WE - Conference Proceedings Citation Index - Science (CPCI-S) WE - Conference The authors declare that they have no known competing financial Proceedings Citation Index - Social Science & Humanities (CPCI-SSH). interests or personal relationships that could have appeared to influence Fu YC, Qin L, Yin QJ. A reinforcement learning behavior tree framework for game Al. T. Hu & X. Lee (editors). In: Proceedings of the international conference on the work reported in this paper. economics, social science, arts, education and management engineering; 2016. p. 573–9 (ESSAEME) (Vol. 71, Issues 2nd International Conference on Economics, References Social Science, Arts, Education and Management Engineering (ESSAEME)WE- Conference Proceedings Citation Inde). Ju S, Yang X, Barnes T, Chi M. Student-tutor mixed-initiative decision-making Wiering MA, Van Otterlo M. Reinforcement learning: state of the art. Springer; supported by deep reinforcement learning. M. M. Rodrigo, N. Matsuda, A. I. 2012 (Vol. 12, Issue 3). Cristea, & V. Dimitrova (editors), Artificial intelligence in education, PT I (Vol. Ertmer P, Newby T. Behaviorism, cognitivism, constructivism: comparing critical 13355, Issue. In: Proceedings of the 23rd international conference on artificial features from an instructional design perspective. Perform Improv Q 2013;26(2): intelligence in education (AIED); 2022. p. 440–52. https://doi.org/10.1007/978-3- 43–71. https://doi.org/10.1111/j.1937-8327.1993.tb00605.x. 031-11644-5_36. WE - Conference Proceedings Citation Index - Science (CPCI-S) Sutton RS, Barto AG. Reinforcement learning: an introduction. Robotica 1999;17 WE - Conference Proceedings Citation Index - Social Science & Humanities (CPCI- (2):229–35. SSH). Sutton RS, Barto AG. Reinforcement learning: an introduction. MIT Press; 2018. Li YC, Xiong HY, Kong LH, Zhang R, Dou DJ, Chen GH. Meta hierarchical Fahad Mon B, Wasfi A, Hayajneh M, Slim A, Abu Ali N. Reinforcement learning in reinforced learning to rank for recommendation: a comprehensive study in education: a literature review. Informatics 2023;10(3):74. 9 B. Memarian and T. Doleck Computers and Education Open 6 (2024) 100175 MOOCs. In: Proceedings of the machine learning and knowledge discovery in Conference / 8th AAAI Symposium on Educational Advances in Artificial databases. 13718; 2023. p. 302–17. https://doi.org/10.1007/978-3-031-26422-1_ IntelligenceWE-Conference Proceedings Citation In). 19. ECML PKDD 2022, PT VI (Vol. Yaguchi T, Iima H. Design of an artificial game entertainer by reinforcement Liang J, Tang Y, Hare R, Wu B, Wang FY. A learning-embedded attributed petri net learning. In: Proceedings of the IEEE conference on games; 2020. p. 588–91 (IEEE to optimize student learning in a serious game. IEEE Trans Comput Soc Syst 2023; COG 2020) (Issue IEEE Conference on Games (IEEE CoG)WE-Conference 10(3):869–77. https://doi.org/10.1109/TCSS.2021.3132355. Proceedings Citation Inde). Pateria S, Subagdja B, Tan AH, Quek C. Hierarchical reinforcement learning: a Zhu HQ, Wilson S, Feron E. The design, education and evolution of a robotic baby comprehensive survey. ACM Comput Surv 2021;54(5). https://doi.org/10.1145/ IEEE Trans Robot 2023;39(3):2488–507. https://doi.org/10.1109/ 3453160. WE - Science Citation Index Expanded (SCI-EXPANDED). TRO.2023.3240619. Sayed WS, Noeman AM, Abdellatif A, Abdelrazek M, Badawy MG, Hamed A, El- Shawky D, Badawi A. A reinforcement learning-based adaptive learning system. In: Tantawy S. AI-based adaptive personalized content presentation and exercises The international conference on advanced machine learning technologies and navigation for an effective and engaging E-learning platform. Multimed Tools Appl applications (AMLTA2018). Springer International Publishing; 2018. p. 221–31. 2023;82(3):3303–33. https://doi.org/10.1007/s11042-022-13076-8. Silvetti M, Verguts T. Reinforcement learning, high-level cognition, and the human Wang L, Zhang DX, Gao LL, Song JK, Guo L, Shen HT. MathDQN: solving brain. Neuroimaging Cogn Clin Neurosci 2012:283–96. arithmetic word problems via deep reinforcement learning. In: Proceedings of the Rantzen AJ. Constructivism, direct realism and the nature of error. Theory Psychol 32th AAAI conference on artificial intelligence /thirtieth innovative applications of 1993;3(2):147–71. artificial intelligence conference / eighth AAAI symposium on educational Nash JF. Equilibrium points in n-person games. Natl Acad Sci 1950;36:48–9. advances in artificial intelligence; 2018. p. 5545–52 (Issue 32nd AAAI Conference Bransford JD, Brown AL, Cocking RR. How people learn: brain, mind, experience, on Artificial Intelligence / 30th Innovative Applications of Artificial Intelligence and school: expanded edition. National Academy Press; 2000. https://doi.org/ 10.17226/6160. 10