Automated Question and Answer Generation from Texts using Text-to-Text Transformers PDF
Document Details
Uploaded by Deleted User
Thapar Institute of Engineering and Technology
Rupali Goyal, Parteek Kumar, V. P. Singh
Tags
Summary
This research article presents an automated system for generating various question-answer types, including subjective, Boolean, fill-in-the-blank, and multiple-choice questions, from given texts. The T5 transformer model is used for this task. The system's performance is evaluated on benchmark datasets (SQUAD, QuAC, BoolQ) using automated metrics like BLEU, ROUGE, and METEOR. The developed system demonstrates better performance and diversity in comparison with previous models. The generated questions and answers show high levels of grammatical accuracy and contextual relevance.
Full Transcript
Arabian Journal for Science and Engineering (2024) 49:3027–3041 https://doi.org/10.1007/s13369-023-07840-7 RESEARCH ARTICLE-COMPUTER ENGINEERING AND COMPUTER SCIENCE Automated Question and Answer Generation from Texts using Text-to-Text Transformers Rupali Goyal1 · Parteek Kumar1 · V. P...
Arabian Journal for Science and Engineering (2024) 49:3027–3041 https://doi.org/10.1007/s13369-023-07840-7 RESEARCH ARTICLE-COMPUTER ENGINEERING AND COMPUTER SCIENCE Automated Question and Answer Generation from Texts using Text-to-Text Transformers Rupali Goyal1 · Parteek Kumar1 · V. P. Singh1 Received: 8 September 2022 / Accepted: 20 March 2023 / Published online: 3 May 2023 © King Fahd University of Petroleum & Minerals 2023 Abstract Automatic question generation and automatic question answering from text is a fundamental academic tool that serves a wide range of purposes, including self-study, coursework, educational assessment, and many more. Manual construction of questions is a time-consuming and complicated process that requires experience, whereas automating the process diminishes the costs of manual question creation and fulfills the need for a persistent supply of questions for the tutors and self-evaluators. This paper uses an encoder–decoder architecture-based text-to-text transfer transformer (T5) intending to generate several types of question–answer pairs over a given context, including subjective question–answers having short and long answers, fill- in-the-blanks-type question–answers, Boolean answer (yes-or-no)-type questions, and multiple-choice question-answers. The model has been evaluated on benchmark datasets—SQuAD, QuAC, and BoolQ; over automated metrics—BLEU, ROUGE, METEOR, F1, and accuracy. The model outperformed previous baseline models, with 18.87 and 25.24 scores over BLEU-4 and METEOR metrics, respectively. The paper also demonstrates that the proposed system efficiently generates question–answer pairs where the baseline approaches struggled. The evaluation analysis also shows that the generated question–answer pairs are comparable with existing systems and even better in terms of diversity. Also, the generated questions are grammatically and contextually correct, and the answer generated matches the question in the textual context. Keywords Question answering · Automatic question–answer generation · Multiple-choice question–answers · Fill-in-the- blanks · Subjective question–answer · Boolean answer–questions 1 Introduction Also, due to the unprecedented convenience of Massive Open Online Courses (MOOCs), a large number of students have Automatic question generation and question answering is shifted to this self-learning paradigm of education to learn critical and plays a key role in many application domains, new concepts or aid their classroom courses. However, most such as education [1–3], personal assistance , and health- online classrooms lack relevant and sufficient exercises to test care. Manual question generation requires a lot of effort the students due to the paucity of time on the creator’s side. and resources and is costly and time-consuming. Traditional Therefore, this becomes the need of the hour. As a result, there classrooms involve periodic tests, quizzes, and exams, along has been much interest in building an autonomous system for with the impromptu questions asked by the instructor dur- generating questions and corresponding answers during the ing or after every session. This enables the learner to gauge last decade. their understanding and the instructor to gauge the effective- Questions are an integral part of learning and a fundamen- ness of their lessons. But creation and selection of questions tal tool in education. They can be used for querying more is a time-consuming task. Creating good-quality questions information or making sure that the class is engaged. Ques- is a complex process that requires training and experience. tions provide learners with feedback about their understand- ing and misconceptions, offer the opportunity to practice retrieving information from memory, highlight the impor- B Rupali Goyal [email protected] tant learning material and help learners focus on it, motivate learners to engage in learning activities, and repetition of core 1 Computer Science and Engineering Department, Thapar concepts for reinforcement learning are few of the advan- Institute of Engineering and Technology, Patiala, Punjab, tages of asking questions. Questions are broadly divided into India 123 3028 Arabian Journal for Science and Engineering (2024) 49:3027–3041 two categories: objective questions and subjective questions On the other hand, BART and T5 transformers. The objective question asks individuals to choose the use encoder–decoder architecture. This paper adopts the correct response from two to four alternatives or to pro- conventional transformer-based sequence-to-sequence struc- vide a word/phrase to complete a sentence or to answer ture. The proposed framework is built upon a text-to-text a question. The most prevalent types of objective question transfer learning model. The training objective of this in education are multiple-choice, true–false, and fill-in-the- model is to generate an end-to-end question–answer pair gen- blank. The subjective questions, on the other hand, demand eration based on the given context. The key objective of this response in terms of explanation that enables individuals paper is the use of T5 to automatically generate sentence and to construct and create a response in their wording. Long paragraph-level question–answer pairs with diverse perspec- and short answer-type questions are the two well-known tives over a given context. Figure 1 shows the overview of examples of subjective questions. However, for an effec- the proposed question–answer pair generation system. tive assessment, both subjective and objective questions are The proposed question–answer pair generation system necessary, and the proposed system has the capability of gen- takes a context passage as input and creates a list of ques- erating these questions. tions on content knowledge with the extracted answers as The answers are extracted based on the information accu- output. The approach starts with loading textual context over mulated from the given context. The answers are categorized which question–answer pairs have to be generated. The text based on different tasks. For objective questions, the output uploaded can be a single-sentence or multi-sentence pas- is a word or phrase from context, while the multiple-choice sage. This text is then passed to the next block, where it approach requires selecting the correct answer from a list of is pre-processed and split into sentences using a sentential potential answers. In the subjective question answering tech- tokenizer. These text tokens are then fed to the next block nique, a subsequence of the provided context is extracted to obtain the position of each sentence in the context. These as a response for questions with the short answer, and for positions are attained using positional embeddings. Then, long answer-type questions, the free answering techniques the sentence which contains answers is realized and high- are used. lighted using token and task prefixes are appended to the Traditionally, rule-based approaches and statistical context with the highlighted sentence. The length padding approaches [13, 14]. The creation of rules and templates or truncation is performed before applying the fine-tuned T5 is extremely expensive, lacks diversity, and is hard to gen- model for answer extraction. After applying the model, a eralize on different domains. The recent advancements in list of answers has been obtained and joined. The embed- deep neural networks have come to be known to outper- dings for these extracted answers are obtained along with form in most areas of natural language generation in the their corresponding positional embeddings. The task pre- past, but with an overhead cost of processing and a con- fixes are appended and the length padding or truncation is siderable amount of training. However, with the decrease in performed. Then, the fine-tuned T5 model has been applied hardware cost and increased availability (backed by numer- to generate the questions that matched the text in the context. ous cloud platforms), the transition to neural techniques has Finally, a list of question–answer pairs has been obtained been quite favorable. The developments in deep neural net- that are grammatically and contextually correct. These gen- works, such as memory networks , attention mechanisms erated question–answer pairs facilitate teachers in developing , and copy mechanisms , have shown promising results instructional content for various domains. Also, automating for the question and answer generation task. However, gen- the process of tests creation helps tutors for academic pur- erating diverse question–answer pairs from the text remains poses and students with their self-evaluation. a significant challenge. This paper addresses this challenge The rest of this paper is organized as follows: Section 2 and has proposed a system to automatically generate a variety presents a brief overview of the existing approaches to ques- of question–answer pairs over a given passage. tion generation and answer extraction tasks. The architecture These approaches required sequential processing, making of the question–answer generation system and the design of training a very tedious and time-consuming task, but trans- each module is discussed in Sect. 3. The algorithms behind formers allowed parallelization of tasks by taking the the implementation of each module are also explored. The entire sequence as input instead of token by token which evaluation results obtained for the system are presented in made them popular. Transformer-based networks down- Sect. 4. Section 5 draws the conclusion along with the future stream on a specific task followed by fine-tuning over a scope of work. large corpus have become the norm for a variety of natural language tasks. Several variations of transformers have been introduced in the past years. Transformers like GPT are based on a left-to-right decoder whereas BERT is based on a deep bidirectional long short-term memory encoder. 123 Arabian Journal for Science and Engineering (2024) 49:3027–3041 3029 Fig. 1 Block diagram for automatic question–answer pair generation from text 2 Related Work other strategies like embedding the relative distance between the answer and the context words , key-phrase extractor Traditionally, a rule-based approach was suggested to con- for the key answer candidates [27–29], answer position indi- struct question templates and manually create questions from cator [8, 30], and so on are also incorporated. Du and Cardie a given text [15, 16]. The creation of rules and templates is used gated coreference knowledge for paragraph-level extremely expensive, lacks diversity, and is hard to generalize answer-aware question generation. Zhao et al. proposed on different domains. Recently, deep neural network-based a seq2seq network having a gated self-attention encoder approaches [17, 18] have been proposed that address the and a maxout pointer decoder for paragraph-level single issues of rule-based approaches to generate question-answers question generation. Nema et al. had proposed refine without handcrafting rules. These approaches include recur- networks for the question generation task. Kim et al. rent neural networks (RNNs) which encompass long have used deep neural networks to generate questions using short-term memory (LSTM) networks , gated recurrent answer-separation. Lopez et al. have used a decoder-only units (GRUs) and several variations of these. Therefore, transformer for paragraph-level question generation. Liu there has been a clear shift from rule based and statistical have used seq2seq-based algorithms with copy mechanism methods of natural language processing to deep learning and attention mechanism for question generation. There have methods and now to transformers due to the excellent results utilized different approaches for question–answer generation obtained from the latter. Neural networks have been known task but are limited in capabilities as compared to our pro- to outperform in most areas of natural language generation posed approach. in the past, but with an overhead cost of processing and a The introduction of attention gave way to transformers considerable amount of training. However, with the decrease. All previous approaches required sequential processing, in hardware cost and availability (backed by numerous cloud making training a very tedious and time-consuming task, but platforms), the transition to neural techniques has been quite transformers allowed parallelization of tasks by taking the favorable. With advancements like memory networks , entire sequence as input instead of token by token, making attention mechanisms , and copy mechanisms , deep them popular. Several variations of transformers have been neural networks have also shown promising results for the introduced in the past years. Transformers like GPT are question–answer generation task. based on a left-to-right decoder, whereas BERT is based Du et al. pioneered the automated generation of ques- on a deep bidirectional long short-term memory encoder. On tions using a deep sequence-to-sequence neural model. For the other hand, BART and T5 [14, 36] transformers answer-aware question generation, firstly, the positions of the use encoder–decoder architecture. The current state of the answer span are extracted from the input sentence, and then, art leverages transformer-based models [10, 11, 13, 37] for the answer-specific questions are generated. Most existing question–answer generation. Transfer learning models studies [19, 22] use an encoder–decoder framework with an focus on preserving knowledge obtained from training on one attention mechanism [23, 24]. However, different strategies problem and applying it to a related but different problem have been incorporated for answer information by different [11, 14]. Dehghani et al. use transformer architecture models, such as first detect the question-worthy answer and for open-domain question answering. Lopez et al. used then generating the answer-aware question. Similarly, 123 3030 Arabian Journal for Science and Engineering (2024) 49:3027–3041 GPT-2 transformer architecture to generate questions over a over BLEU-4 and METEOR metrics with a score of 18.87 given paragraph. and 25.24, respectively. Most of the existing question–answer generation mod- A comparative analysis has been performed with the exist- els had been proposed over a single-sentence context and ing models to show the effectiveness of our proposed can generate only three types of WH-question (what, who, question–answer generation system. where). These systems use handcraft rules for question–an- The questions generated by our approach are grammati- swer generation and falter on complex sentences. Also, cally and contextually correct and the answer generated the existing neural models generate one question per context matches the question in the textual context.. The previous works on question generation also use transformers , but it is either encoder or decoder based. Some research works use encoder–decoder-based approach 3 Proposed Model Architecture but lack diversity of the generated question–answer pairs. The proposed approach utilizes a stacked encoder–decoder In this paper, an encoder–decoder architecture-based with text-to-text transfer learning to generate various ques- transformer model has been used for generating ques- tion–answers. Figure 2 illustrates that the T5 model has been tion–answer pairs given context. The context can be a fine-tuned using different settings on multiple tasks, such as single-sentence or multi-sentence passage. The proposed QA, answer extraction, and QG. system has the capability to scale well over different perspec- The task has been formally described using Eq. (1). Let tives and styles of question–answers. The creation of a system t denote the text, a denote the answer which is in span of c to facilitate teachers and students in generating instructional and let q denote the question targeting the answer a. Given a content for various domains, which can serve multiple uses text t, the task is to model but also consume a lot of time when produced manually, is the primary motivation of this work. For an effective assessment, P(q, a|t) = P(a|t) P(q|a, t) (1) both subjective and objective questions are necessary, and the proposed system has the capability of generating these ques- The first part of the equation model P(a|t) as an answer tions. A significant problem faced by instructors engaged in extraction task. In this way, the question generation task has frontline instruction of classes is the lack of time for creating been carried out without explicitly providing answers and good-quality instruction content. The generation of diverse the model has been trained to search for answers in the given and grammatically correct question–answer pairs from the text. For this task, text and answer pairs have been used where text remains a significant challenge. This paper addresses text is used as an input and answers are used as targets for these challenges and has proposed a system to automatically training an answer extraction task. For the second part of the generate a variety of question–answer pairs over a given pas- equation, P(q|a, t), the answer and text pairs are utilized as sage. inputs and targeted question for given answer as the target in training. In the generation part, the answer extraction is done 2.1 Contributions of the Paper are as Follows before question generation. The framework for the automatic question generation The proposed question–answer pair generation system auto- and question answering is shown in Fig. 3. The proposed matically generates sentence and paragraph-level ques- approach utilizes transfer learning with a text-to-text trans- tion–answer pairs with diverse perspectives over a given fer transformer (T5) that can handle long-term dependencies context. The main contributions of this paper are as follows: well. This T5 model is a unified framework-based encoder–decoder model that converts every problem into a A system that acts as a one-stop destination for generating text-to-text format. It has been trained on a variety of super- subjective as well as objective-type questions has been vised and unsupervised tasks. For unsupervised tasks, it has proposed. been trained on Colossal Clean Crawled Corpus which is a The proposed system is capable to handle fill-in-the-blank, novel 750 Gigabyte-huge dataset, and for supervised tasks, multiple-choice, Boolean, and long/short answers. several well-known datasets were utilized. This work fine-tunes a text-to-text transfer learning (T5) Deep transformer-based networks downstream on a transformer to generate question–answer pairs over three specific task followed by fine-tuning over a large cor- large-scale benchmark datasets: SQuAD, QuAC, and pus have become the norm for a variety of natural language BoolQ. tasks. In this paper, the task of question–answer generation The proposed system has been evaluated over automated is addressed. Specifically, when given a context, the model metrics, such as BLEU, ROUGE, METEOR, F1 score, and is entrusted with generating appropriate questions and cor- accuracy, and has outperformed state-of-the-art baselines responding answers to the generated question. 123 Arabian Journal for Science and Engineering (2024) 49:3027–3041 3031 Fig. 2 Multi-task fine-tuning of the pre-trained T5 model: (1) QA task uses text and question pair as input and generates answer as output, (2) QG task uses answer highlighted text as input and generates question as output, and (3) answer extraction task uses sentence highlighted text as input and extract list of answers with separator as output The context input is first aligned as a specific input the fine-tuned T5 model has been applied to generate the sequence by adding a special token [CLS]. The context input questions that match the text in the context. Finally, a list of sequence can either be a single sentence or a group of sen- question–answer pairs has been obtained that are grammati- tences. A special token [SEP] is introduced between the cally and contextually correct. The architectural diagram of tokens of two consecutive sentences to separate information each encoder and decoder layer used is shown in Fig. 5. from different sentences. In addition, a learned embedding The encoder and decoder component consists of a stack of is added to every token to denote whether it belongs to six encoder layers on top of each other and a stack of the same which sentence. The sum of token embeddings and posi- number of decoder layers. Each encoder comprises two sub- tion embeddings is the input representation of a given token. layers: a self-attention layer and a feed-forward layer. The The first encoder in the encoder-stack receives these embed- input to the encoder is first fed to a self-attention layer. This dings of the input sequence. The encoder then propagates layer helps the encoder look at all the words in the input sen- and transforms the resultant to the next encoder. To get a tence as it encodes a specific word. The normalization layer better understanding of a certain word in the sequence, a self- is applied to each subcomponent layer where the activations attention layer is used. This layer allows the model to look are rescaled. A residual skip connection adds each sub-layer’s at the other words in the input sequence. All the decoders input to its output. Within the feed-forward layer, a dropout in the decoder-stack receive the output from the last encoder is applied on the attention weights, skip connection, and at in the encoder-stack. The encoder–decoder attention layer the input and output of the stack. The resultant outputs are in the decoder-stack enables the decoder in focusing on the fed to a feed-forward neural network (the second sub-layer pertinent segments of the input sequence. The keywords of the encoder). The same feed-forward network is indepen- or keyphrases are obtained as output. These keywords are dently applied to each position and repeats the dropouts and then mapped with context sentences to obtain corresponding normalization. The decoder has similar layers to the encoder questions. The proposed system finally outputs the extracted and an additional attention layer between them. This addi- answers and generated questions as question–answer pairs. tional attention layer helps the decoder focus on relevant parts Figure 4 represents the flow diagram of this approach. of the input sentence. The flow of the approach starts with loading textual con- text over which question–answer pairs have to be generated. 3.1 Implementation Details The context is then split into sentences using a sentential tok- enizer. The position of each sentence in the context has been This paper uses the pre-trained T5-small model with 60 M attained using positional embeddings. Then, the sentence parameters. The most notable feature of the used transfer which contains answers has been realized and highlighted learning model is its text-to-text nature. This text-to-text using < hl > token. After this, a task prefix (extract answer) nature of the model enables it to learn any natural language has been appended to the context with the highlighted sen- task without altering the loss functions and hyperparameters. tence. Prior to applying the fine-tuned T5 model for answer All these models are trained using the same hyperparame- extraction, length padding or truncation was done. After ters with different data preparation. All the training and data applying the model, a list of answers has been obtained, preparation has been done on Google Colab. Pytorch is used which are joined using < sep > token. These answer embed- for training and developing neural models. The grid-search dings and positional embeddings are obtained. After this, a technique had been utilized for identified hyperparameters, task prefix (generate questions) has been appended to the including learning rate, optimizer type, and the number of context and the length padding or truncation is done. Then, training epochs. And we have selected the set of parameters 123 3032 Arabian Journal for Science and Engineering (2024) 49:3027–3041 Fig. 3 Framework for automatic question and answer generation from text that had attained the overall best scores in all metrics. The 3.1.1 Subjective Question–Answer (Long and Short) batch size for training and evaluation is 32. The model is Generation trained with a learning rate of 1e−4 for 18 epochs. The beam search technique is used for sequence decoding with a beam For subjective question–answer generation, the process is of size 4. The experiments described are implemented using divided into two tasks. The first task is to seek potential Hugging Face’s Transformer library. For fine-tuning a answers from the text given. The Answer-Extractor model is special class, the trainer class is used which simplifies and used for this task, which has been trained to extract answers abstracts the complex training procedure and is optimized from a given context. Each sentence from the input text is for training transformer models. The training arguments are recognized as a separate input. These sentences are then for- decided according to the model to be trained, and a trainer matted to be preceded by the label extract answers and passed is instantiated with those arguments. All models trained for to the generate function, which feeds these formatted texts to the system use the same training script but different datasets the model and retrieves the answers. For the second task, the and different training parameters. The algorithms used in the extracted answers from the first task are mapped to the origi- prediction phase for different models in the form of pseudo- nal context sentences, and a highlight token < hl > is added at codes along with explanations are mentioned below. the start and the end answer positions. This mapped text with highlight tokens is then formatted to contain the generate question label which is then passed to the generate func- tion of the model for generating questions according to the highlighted tokens. Finally, a list of the generated question with corresponding answers is obtained. The pseudo-code for subjective question–answer generation is given below. 123 Arabian Journal for Science and Engineering (2024) 49:3027–3041 3033 Fig. 4 Flow diagram for automatic question and answer generation approach Fig. 5 Architecture of each encoder and decoder Layer 123 3034 Arabian Journal for Science and Engineering (2024) 49:3027–3041 3.1.2 Fill-in-the-Blanks-Type Question–Answer Generation answers are then mapped back to the original input context, and dashes or blanks are introduced at the positions of the For the generation of fill-in-the-blanks-type question–an- answer phrase. Finally, both the generated fill-in-the-blank swers, the first task is to recognize the potential spots for and its answer phrase are appended together as output. The introducing blanks in the sentences. This is done using the below mentioned is the pseudo-code for fill-in-the-blanks Answer-Extractor model, which has been trained to extract generation. answers from a given context. Each sentence from the input text is formatted by appending the label extract answers and then passed to the generate function, which feeds these for- matted texts to the model and retrieves the answers. The 123 Arabian Journal for Science and Engineering (2024) 49:3027–3041 3035 3.1.3 Boolean Answer–Question Generation ated along with the correct answer. For generating distractors, the Sense2Vec module is used. The extracted answers hav- For generating Boolean questions having true/false or yes/no- ing considerable equivalent distractors are selected from the type answers, firstly the input text is sentence tokenized. Each list of extracted answers and any answer which do not have sentence from the input text is recognized as a separate input. a specific number of distractors is discarded. The selected These sentences are then formatted such that each sentence is answers are mapped to the original context sentences, and a appended with the Boolean label. These labeled data are then highlight token < hl > is added at the start and the end answer passed to the generate function, which feeds these format- positions. This mapped text with highlight tokens is then for- ted texts to the model to generate Boolean-type questions. matted to contain the generate question label which is then The pseudo-code for Boolean answer–question generation is passed to the generate function of the model for generating attributed below. questions according to the highlighted tokens. 3.1.4 Multiple-Choice Question–Answer Generation The final output contains the generated question, cor- For generating multiple-choice question-answers, the first rect answer, and a list of distractors. The pseudo-code for task is to seek potential answers from the text given. Each multiple-choice question–answer generation is given below. sentence from the input text is recognized as a separate input. These sentences are then formatted to be preceded by the label extract answers and passed to the generate func- tion, which feeds these formatted texts to the model and retrieves the answers. The distractors are also to be gener- 123 3036 Arabian Journal for Science and Engineering (2024) 49:3027–3041 The system automatically generates different types generation, SQuAD and BoolQ datasets are used. And for of question-answers over a given text. Figure 6 shows subjective question–answer generation, QuAC and SQuAD the generated question–answer pairs using the above- datasets are used. mentioned algorithms (Subjective Question–Answer Gen- eration, Boolean Answer–Question Generation, Multiple- 4.2 Metrics Used Choice Question–Answer Generation, and Fill-in-the- Blanks-type Question–Answer Generation). These algo- The performance of the question–answer generation system rithms individually generate more than one relevant ques- is assessed using the metrics listed below. tion–answer pair over the same context. ROUGE-L measures recall that by how many words the predicted and reference sentences are similar using 4 Evaluation longest common subsequence-based statistics. BLEU measures precision that scores word similar- The proposed system is evaluated to analyze the quality of ity between candidate and reference sentences. BLEU-1, generated question–answers. The performance of the system BLEU-2, BLEU-3, and BLEU-4, use 1-g to 4-g, respec- is evaluated on various datasets and metrics as mentioned tively, for calculation. below. METEOR is based on the harmonic mean of recall and precision, with recall weighted higher than precision. F1 Score measures the average overlap between the 4.1 Dataset Used ground truth and the predicted answer. The proposed model has been trained on three datasets, i.e., SQuAD , QuAC , and BoolQ to generate dif- ferent types of objective and subjective question-answers. Table 1 lists the statistics for these datasets. These datasets have been pre-processing by removing the questions which do not have any answers. For the objective question–answer 123 Arabian Journal for Science and Engineering (2024) 49:3027–3041 3037 Fig. 6 Sample, diverse question–answer pairs generated by the system over the same given text: (1) questions with short answers, (2) Boolean answer–questions, (3) fill-in-the-blanks, (4) multiple-choice answer–questions, and (5) questions with multi-line answers 4.3 Evaluation Results Table 2 Model evaluation scores on different datasets Dataset Evaluation metric Evaluation score The model is evaluated on SQuAD, QuAC, and BoolQ datasets over the above-mentioned metrics. Table 2 lists the SQuAD ROUGE-L 40.64 metrics and corresponding scores on different datasets. BLEU-1 41.74 BLEU-2 30.81 BLEU-3 23.81 Table 1 Statistics of the datasets BLEU-4 18.87 SQuAD QuAC BoolQ METEOR 25.24 QuAC F1 62.71 #Training 59,819 60,674 9427 BoolQ Accuracy 64.80 #Validation 3127 5142 3270 #Test 3000 5140 3270 Average length of context (in 121 396 108 The metric scores reflect how similar the generated ques- words) tokens tions are to the ones in the training dataset and through Average length of question (in 10 7 8.9 persistent and careful observation it has been found that words) tokens almost every time the question and answers formed by the Average length of answer (in 3 14 Boolean models are grammatically sound and they successfully cap- words) ture inherent contexts in the given paragraphs. 123 3038 Arabian Journal for Science and Engineering (2024) 49:3027–3041 Table 3 Evaluation of our model with existing approaches of question–answer pairs, including subjective question- answers, Boolean (yes/no, true/false), fill-in-the-blanks, and Model Metrics multiple-choice question–answer pairs. The generated ques- BLEU-4 METEOR tion–answers are comparable to the baseline question–an- swer generation models and have performed well with the Du and Cardie 15.16 19.12 majority of question–answer domains as shown in Table 4. Lopez et al. 8.26 21.2 It has been observed that the proposed system outperforms Nema et al. 16.99 21.10 the baseline question and answer generation systems. The Zhao et al. 16.85 20.62 question and answer pairs generated by our system show Kim et al. 16.17 – diversity over same given textual passage. The generated Liu 13.86 – question–answer pairs have also been shown to a small group Proposed approach 18.87 25.24 of individuals and majority of the questions were found gram- matically sound and acceptable over the given context. 5 Conclusion and Future Scope 4.4 Evaluation Analysis with Existing Models In this paper, the use of encoder–decoder architecture-based The proposed automatically question-answers generation text-to-text transfer transformer (T5) has been introduced to system has been analyzed with the existing models. The automatically generate sentence and paragraph-level ques- evaluation results of the proposed system and other exist- tion–answer pairs with diverse perspectives over a given ing models are shown in Table 3. It has been found that context. The system outperforms the existing baseline ques- the proposed system outperforms on BLEU-4 and METEOR tion–answer generation models over BLEU-4 and METEOR metrics. evaluation metrics with a score of 18.87 and 25.24, respec- It can be seen clearly from the above table that the tively. The paper also shows that the proposed system proposed approach outperformed the existing approaches efficiently generates question–answer pairs over the con- over BLEU-4 and METEOR metrics. The results men- text where the baseline approaches failed. It is believed tioned in the above table are the results that the authors that the question–answer generation system makes the auto- report in their respective papers. These papers have uti- matic process of tests creation easier and helps students lized different approaches for question–answer generation with their self-evaluation and tutors for academic purposes. tasks but are limited in capabilities compared to our pro- Question-answers impromptu class discussions and hence posed approach. Du and Cardie used gated coreference help reinforce the concepts of their students. Also, such knowledge for paragraph-level answer-aware question gen- an application enables students to test their understanding eration. Zhao et al. proposed a seq2seq network having whenever required, helping them evaluate themselves as well a gated self-attention encoder and a maxout pointer decoder as eradicating the fear generally associated with tests and for paragraph-level single question generation but lacks quizzes. diversity. Nema et al. had proposed refine networks The system addresses the need for variations of tests as the for the question generation task and Kim et al. have system offers the users to choose a variety of question–an- used deep neural networks to generate questions using swer types, including short and long answer-type ques- answer-separation but are not capable of generating cor- tions, Boolean answer-type questions, multiple-choice ques- responding answers over the same context. Lopez et al. tion–answer generation, and fill-in-the-blanks-type ques- have used a decoder-only transformer for paragraph- tion–answer generation. The system has the capability to level question generation. Liu has used seq2seq-based generate question–answer pairs over single text and on a algorithms with copy and attention mechanisms for ques- paragraph. The application helps the instructors to save their tion generation. However, the proposed approach used in time by providing them with an easy, efficient, and reliable our paper has utilized T5 transformer architecture to gen- source to generate questions with corresponding answers. erate question–answer pairs of the selected type such that The saved time can then be used to focus on other impor- the questions and answers generated are grammatically and tant academic activities, personal care, and well-being. The contextually correct. The proposed system has the capa- practice of questions helps boost the morale of students and bility to generate multiple questions–answer pairs over the helps them gain confidence in their field of study. same sentence/paragraph-level context and the generated Further, it is likely to extend the work including more local answer matches the question in the textual context. The Indian languages like Hindi and Punjabi. The more powerful proposed approach has the capability to generate a variety language models like T5-large, T5-3B, and T5-11B would be 123 Arabian Journal for Science and Engineering (2024) 49:3027–3041 3039 Table 4 Analysis of generated question-answers by existing models and by our system with three examples Example 1 Text 1: There is, I think, humor here which does not translate well from English into sanity Question–Answer generated by Lovenia, Limanta, and Gunawan 2018 Question–Answer generated by our system Q: What does not do? Q: What does not translate well from English into sanity? A: sanity Answer: humor Q: What does not translate well into sanity? – > Humor (correct) – > Dark Humour – > Slapstick – > Goofiness Q: What does humor translate well from English into? – > Sanity (correct) – > Own Sake – > Life – > Mental Stability Q: There is, I think, _____ here which does not translate well from English into sanity Answer: humor Example 2 Text 2: Liberated by Napoleon’s army in 1806, Warsaw was made the capital of the newly created Duchy of Warsaw Question–Answer generated by Nema et al. 2019 Question–Answer generated by our system Q: Who liberated Warsaw in 1806? Q: When was Warsaw liberated? Answer: 1806 Q: Who liberated Warsaw in 1806? – > Napolean (correct) – > Genghis – > Military Genius – > Augustus Q: Is Warsaw the capital of the new Duchy ? Q: What was the capital of the newly created Duchy of Warsaw? – > Warsaw (correct) – > Berlin – > Moscow – > Vilnius Q: What was the name of the newly created city of Warsaw? – > Duchy (correct) – > Duchies – > Kingdom Title – > Vassal Example 3 Text 3: Teaching may be carried out informally, within the family, which is called homeschooling, or in the wider community. Formal teaching may be carried out by paid professionals. Such professionals enjoy a status in some societies on a par with physicians, lawyers, engineers, and accountants (Chartered or CPA) Question–Answer generated by Lopez et al. 2020 Question–Answer generated by our system Q: What is a profession of the profession of the profession of the profession of the Q: What is the name of the family that may taught informally? profession of the profession of the profession of the profession of the profession of Answer: homeschooling the profession Q: Teaching may be carried out informally, within the family, which is called ________ Answer: homeschooling Q: Who may be responsible for the formal teaching? – > Paid Professionals (correct) – > Entertainers – > Amateurs – > Decision Makers Q: Formal teaching may be carried out by ________________ Answer: paid professionals Q: What is usually carried out informally? – > Teaching (correct) – > Teach – > Learning – > College Course 123 3040 Arabian Journal for Science and Engineering (2024) 49:3027–3041 experimented near future. The automation of question–an- 10. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; swer generation algorithms would be made faster with more Gomez, A.N.; Kaiser, L.; Polosukhin, I.: Attention is all you need. In: NIPS’17: Proceedings of the 31st International Conference on depth and diversity in the future. This paper is believed to Neural Information Processing Systems. CA, USEA. arXiv:1706. benefit the users with the proposed question and answer gen- 03762v5 (2017) eration system. 11. Devlin, J.; Chang, M.; Lee, K.; Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. Funding The authors have no funding to report. arXiv:1810.04805v2 4171–4186 (2019). https://doi.org/10.18653/ v1/N19-1423 12. Radford, A.; Narasimhan, K.: Improving language understanding Data Availability The data used to support the findings of this study are by generative pre-training (2018) included within the article. 13. Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L.: BART: denoising Declarations sequence-to-sequence pre-training for natural language generation, translation, and comprehension. 7871–7880 (2020). https://doi.org/ Conflict of interest On behalf of all authors, the corresponding author 10.18653/v1/2020.acl-main.703 states that there is no conflict of interest. 14. Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J.: Exploring the limits of transfer Consent for publication This work is original and has not been pub- learning with a unified text-to-text transformer. 1–67. arXiv:1910. lished elsewhere nor is it currently under consideration for publication 10683v3 (2020) elsewhere 15. Chali, Y.; Hasan, S.A.: Towards topic-to-question generation. Com- put. Linguist. (2015). https://doi.org/10.1162/COLI Ethical Approval The author declares that this article complies with the 16. Danon, G.; Last, M.: A syntactic approach to domain-specific auto- ethical standard. matic question generation (2017) 17. Cho, K.; van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y.: Learning phrase represen- tations using RNN encoder–decoder for statistical machine trans- lation. Methods Nat. Lang. Process. (EMNLP) Assoc. Comput. Linguist. Empir. (2014). https://doi.org/10.1128/jcm.28.9.2159-. References 1990 18. Bahdanau, D.; Cho, K.; Bengio, Y.: Neural machine translation 1. Agarwal, M.; Mannem, P.: Automatic gap-fill question generation by jointly learning to align and translate. In: 3rd Int. Conf. Learn. from text Books. In: Proceedings of the Sixth Workshop on Innova- Represent. ICLR 2015 - Conf. Track Proc. pp. 1–15 (2015) tive Use of NLP for Building Educational Applications, pp. 56–64 19. Du, X.; Shao, J.; Cardie, C.: Learning to ask: neural question gen- (2011) eration for reading comprehension. arXiv:1705.00106v1 (2017) 2. Kumar, G.; Banchs, R.; D’Haro, L.F.: RevUP: automatic gap-fill 20. Upadhya, B.A.; Udupa, S.; Kamath, S.S.: Deep neural net- question generation from educational texts. Assoc. Comput. Lin- work models for question classification in community question- guist. (2015). https://doi.org/10.3115/v1/w15-0618 answering forums. In: 2019 10th International Conference on 3. Baha, T.A.I.T.; Hajji, M.E.L.; Es-Saady, Y.; Fadili, H.: Towards Computing, Communication and Networking Technologies, ICC- highly adaptive Edu-Chatbot. Procedia Comput. Sci. 198, 397–403 CNT 2019, pp. 6–11. IEEE (2019) (2021). https://doi.org/10.1016/j.procs.2021.12.260 21. Wang, R.; Panju, M.; Gohari, M.: Classification-based RNN 4. Gao, S.; Ren, Z.; Zhao, Y.; Zhao, D.; Yin, D.; Yan, R.: Product- machine translation using GRUs. 1–7 (2017) aware answer generation in E-commerce question-answering. In: 22. Serban, I.V.; García-Durán, A.; Gulcehre, C.; Ahn, S.; Chandar, Proceedings of the Twelfth ACM International Conference on Web S.; Courville, A.; Bengio, Y.: Generating factoid questions with Search and Data Mining, pp. 429–437. ACM, New York, NY recurrent neural networks: the 30m factoid question-answer cor- (2019) pus. In: Proc. 54th Annu. Meet. Assoc. Comput. Linguist., vol. 1, 5. Shen, S.; Li, Y.; Du, N.; Wu, X.; Xie, Y.; Ge, S.; Yang, T.; Wang, K.; pp. 588–598 (2016). https://doi.org/10.18653/v1/P16-1056 Liang, X.; Fan, W.: On the generation of medical question-answer 23. Du, X.; Cardie, C.: Harvesting paragraph-level question-answer pairs. Proc. AAAI Conf. Artif. Intell. 34, 8822–8829 (2020). https:// pairs from wikipedia. In: Proceedings of the 56th Annual doi.org/10.1609/aaai.v34i05.6410 Meeting of the Association for Computational Linguistics, vol. 6. Liu, S.; Zhang, X.; Zhang, S.; Wang, H.; Zhang, W.: Neural 1, pp. 1907–1917. Association for Computational Linguistics, machine reading comprehension: methods and trends. Appl. Sci. Stroudsburg, PA (2018) 9, 3698 (2019). https://doi.org/10.3390/app9183698 24. Song, L.; Wang, Z.; Hamza, W.; Zhang, Y.; Gildea, D.: Leveraging 7. Weston, J.; Chopra, S.; Bordes, A.: Memory networks. In: 3rd Int. context information for natural question generation. In: Proceed- Conf. Learn. Represent. ICLR 2015 - Conf. Track Proc. pp. 1–15 ings of the 2018 Conference of the North American Chapter of (2015) the Association for Computational Linguistics: Human Language 8. Zhou, Q.; Yang, N.; Wei, F.; Tan, C.; Bao, H.; Zhou, M.: Neu- Technologies, vol. 2, pp. 569–574. Association for Computational ral question generation from text: A preliminary study. In: Lect. Linguistics, Stroudsburg, PA (2018) Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. 25. Du, X.; Cardie, C.: Identifying where to focus in reading com- Lect. Notes Bioinformatics). 10619 LNAI, pp. 662–671 (2018). prehension for neural question generation. In: Proceedings of the https://doi.org/10.1007/978-3-319-73618-1_56 2017 conference on empirical methods in natural language process- 9. Kumar, V.; Ramakrishnan, G.; Li, Y.F.: Putting the horse before ing, pp. 2067–2073. Association for Computational Linguistics, the cart: a generator-evaluator framework for question generation Stroudsburg, PA (2017) from text. In: CoNLL 2019 - 23rd Conf. Comput. Nat. Lang. Learn. 26. Sun, X.; Liu, J.; Lyu, Y.; He, W.; Ma, Y.; Wang, S.: Answer-focused Proc. Conf. pp. 812–821 (2019). https://doi.org/10.18653/v1/k19- and position-aware neural question generation. In: Proceedings of 1076 the 2018 conference on empirical methods in natural language 123 Arabian Journal for Science and Engineering (2024) 49:3027–3041 3041 processing, pp. 3930–3939. Association for Computational Lin- 39. Dehghani, M.; Azarbonyad, H.; Kamps, J.; De Rijke, M.: Learn- guistics, Stroudsburg, PA (2018) ing to transform, combine, and reason in open domain question 27. Meng, R.; Zhao, S.; Han, S.; He, D.; Brusilovsky, P.; Chi, Y.: Deep answering. In: CEUR Workshop Proc., vol. 2491, pp. 681–689 keyphrase generation. In: Proc. 55th Annu. Meet. Assoc. Comput. (2019). https://doi.org/10.1145/3289600.3291012 Linguist., vol. 1, pp. 582–592 (2017). https://doi.org/10.18653/v1/ 40. Lovenia, H.; Limanta, F.; Gunawan, A.: Automatic question- P17-1054 answer pairs generation from text. Acad. Edu. (2018). https://doi. 28. Subramanian, S.; Wang, T.; Yuan, X.; Zhang, S.; Trischler, A.; org/10.13140/RG.2.2.33776.92162 Bengio, Y.: Neural models for key phrase extraction and question 41. Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; generation. In: Proceedings of the Workshop on Machine Reading Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; Davison, J.; Shleifer, for Question Answering,. pp. 78–88. Association for Computa- S.; von Platen, P.; Ma, C.; Jernite, Y.; Plu, J.; Xu, C.; Le Scao, T.; tional Linguistics, Stroudsburg, PA (2018) Gugger, S.; Drame, M.; Lhoest, Q.; Rush, A.: Transformers: state- 29. Willis, A.; Davis, G.; Ruan, S.; Manoharan, L.; Landay, J.; of-the-art natural language processing. In: Proceedings of the 2020 Brunskill, E.: Key phrase extraction for generating educational Conference on Empirical Methods in Natural Language Process- question-answer pairs. In: Proc. Sixth ACM Conf. Learn. @ Scale. ing: System Demonstrations, pp. 38–45 (2020). https://doi.org/10. pp. 1–10 (2019). https://doi.org/10.1145/3330430.3333636 18653/v1/2020.emnlp-demos.6 30. Liu, B.; Zhao, M.; Niu, D.; Lai, K.; He, Y.; Wei, H.; Xu, Y.: Learning 42. Rajpurkar, P.; Zhang, J.; Lopyrev, K.; Liang, P.: SQuAD: 100,000+ to generate questions by learningwhat not to generate. World Wide questions for machine comprehension of text. In: Proc. 2016 Conf. Web Conf. - WWW ’1, pp. 1106–1118 (2019). https://doi.org/10. Empir. Methods Nat. Lang. Process. pp. 2383–2392 (2016). https:// 1145/3308558.3313737 doi.org/10.18653/v1/D16-1264 31. Zhao, Y.; Ni, X.; Ding, Y.; Ke, Q.: Paragraph-level neural ques- 43. Choi, E.; He, H.; Iyyer, M.; Yatskar, M.; Yih, W.; Choi, Y.; Liang, tion generation with maxout pointer and gated self-attention P.; Zettlemoyer, L.: QuAC: question answering in context. In: Proc. networks. In: Proc. 2018 Conf. Empir. Methods Nat. Lang. Process. 2018 Conf. Empir. Methods Nat. Lang. Process., pp. 2174–2184 EMNLP 2018, pp. 3901–3910 (2018). https://doi.org/10.18653/v1/ (2018). https://doi.org/10.18653/v1/D18-1241 d18-1424 44. Clark, C.; Lee, K.; Chang, M.; Kwiatkowski, T.; Collins, M.; 32. Nema, P.; Mohankumar, A.K.; Khapra, M.M.; Srinivasan, B.V.; Toutanova, K.: BoolQ: exploring the surprising difficulty of natu- Ravindran, B.: Let’s ask again: refine network for automatic ques- ral yes/no questions. In: Proc. 2019 Conf. North., pp. 2924–2936 tion generation. In: Proc. 2019 Conf. Empir. Methods Nat. Lang. (2019). https://doi.org/10.18653/v1/N19-1300 Process. 9th Int. Jt. Conf. Nat. Lang. Process. pp. 3312–3321 45. Lin, C.-Y.: ROUGE: a package for automatic evaluation of sum- (2019). https://doi.org/10.18653/v1/D19-1326 maries. In: Text Summarization Branches Out. pp. 74–81. Associ- 33. Kim, Y.; Lee, H.; Shin, J.; Jung, K.: Improving neural question ation for Computational Linguistics, Barcelona, Spain (2004) generation using answer separation. In: Proceedings of the AAAI 46. Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.-J.: BLEU: a method Conference on Artificial Intelligence (2019) for automatic evaluation of machine translation. In: 40th Annual 34. Lopez, L.E.; Cruz, D.K.; Cruz, J.C.B.; Cheng, C.: Simplifying Meeting of the Association for Computational Linguistics (ACL), Paragraph-level Question Generation via Transformer Language pp. 311–318. ACL (2002) Models. (2020) 47. Lavie, A.; Agarwal, A.: METEOR: an automatic metric for MT 35. Liu, B.: Neural question generation based on Seq2Seq. In: Pro- evaluation with high levels of correlation with human judgments. ceedings of the 2020 5th International Conference on Mathematics In: Proceedings of the Second Workshop on Statistical Machine and Artificial Intelligence, pp. 119–123 (2020). https://doi.org/10. Translation (2005) 1145/3395260.3395275 36. Akyon, F.C.; Cavusoglu, D.; Cengiz, C.; Altinuc, S.O.; Temizel, Springer Nature or its licensor (e.g. a society or other partner) holds A.: Automated question generation and question answering from exclusive rights to this article under a publishing agreement with the Turkish texts using text-to-text transformers. 1–14 (2021). https:// author(s) or other rightsholder(s); author self-archiving of the accepted doi.org/10.3906/elk-Automated manuscript version of this article is solely governed by the terms of such 37. Nassiri, K.; Akhloufi, M.: Transformer models used for text-based publishing agreement and applicable law. question answering systems. Appl. Intell. (2022). https://doi.org/ 10.1007/s10489-022-04052-8 38. Bashath, S.; Perera, N.; Tripathi, S.; Manjang, K.; Dehmer, M.; Streib, F.E.: A data-centric review of deep transfer learning with applications to text data. Inf. Sci. (Ny) 585, 498–528 (2022). https:// doi.org/10.1016/j.ins.2021.11.061 123