Learning to Automatically Generate Fill-In-The-Blank Quizzes PDF

Learning to Automatically Generate Fill-In-The-Blank Quizzes Edison Marrese-Taylor, Ai Nakajima, Yutaka Matsuo Graduate School of Engineering The University of Tokyo emarrese,ainakajima,[email protected] Ono Yuichi Center for Education of Global Communication University of Tsukuba [email protected] Abstract questions (CQ), are commonly used for evaluat- ing proficiency of language learners, including of- In this paper we formalize the prob- ficial tests such as TOEIC and TOEFL (Sakaguchi lem automatic fill-in-the-blank question et al., 2013). They have also been used to test generation using two standard NLP ma- students knowledge of English in using the cor- chine learning schemes, proposing con- rect verbs (Sumita et al., 2005), prepositions (Lee crete deep learning models for each. We and Seneff, 2007) and adjectives (Lin et al., 2007). present an empirical study based on data Pino et al. (2008) and Smith et al. (2010) generated obtained from a language learning plat- questions to evaluate students vocabulary. form showing that both of our proposed settings offer promising results. The main problem in CQ generation is that it is generally not easy to come up with appropri- 1 Introduction ate distractors —incorrect options— without rich experience. Existing approaches are mostly based With the advent of the Web 2.0, regular users were on domain-specific templates, whose elaboration able to share, remix and distribute content very relies on experts. Lately, approaches based on easily. As a result of this process, the Web be- discriminative methods, which rely on annotated came a rich interconnected set of heterogeneous training data, have also appeared. Ultimately, data sources. Being in a standard format, it is suit- these settings prevent end-users from participating able for many tasks involving knowledge extrac- in the elaboration process, limiting the diversity tion and representation. For example, efforts have and variation of quizzes that the system may offer. been made to design games with the purpose of semi-automating a wide range of knowledge trans- In this work we formalize the problem of au- fer tasks, such as educational quizzes, by leverag- tomatic fill-in-the-blank question generation and ing on this kind of data. present an empirical study using deep learning In particular, quizzes based on multiple choice models for it in the context of language learn- questions (MCQs) have been proved efficient to ing. Our study is based on data obtained from our judge students knowledge. However, manual con- language learning platform (Nakajima and Tomi- struction of such questions often results a time- matsu, 2013; Ono and Nakajima; Ono et al., 2017) consuming and labor-intensive task. where users can create their own quizzes by utiliz- Fill-in-the-blank questions, where a sentence is ing freely available and open-licensed video con- given with one or more blanks in it, either with tent on the Web. In the platform, the automatic or without alternatives to fill in those blanks, have quiz creation currently relies on hand-crafted fea- gained research attention recently. In this kind of tures and rules, making the process difficult to question, as opposed to MCQs, there is no need adapt. Our goal is to effectively provide an adap- to generate a WH style question derived from text. tive learning experience in terms of style and diffi- This means that the target sentence could simply culty, and thus better serve users’ needs (Lin et al., be picked from a document on a corresponding 2015). In this context, we study the ability of topic of interest which results easier to automate. our proposed architectures in learning to generate Fill-in-the-blank questions in its multiple- quizzes based on data derived of the interaction of choice answer version, often referred to as cloze users with the platform. 152 Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications, pages 152–156 Melbourne, Australia, July 19, 2018. c 2018 Association for Computational Linguistics 2 Related Work 3 Proposed Approach We formalize the problem of automatic fill-on-the- The problem of fill-in-the-blank question genera- blanks quiz generation using two different per- tion has been studied in the past by several authors. spectives. These are designed to match with Perhaps the earlies approach is by Sumita et al. specific machine learning schemes that are well- (2005), who proposed a cloze question generation defined in the literature. In both cases. we con- system which focuses on distractor generation us- sider a training corpus of N pairs (Sn , Cn ), n = ing search engines to automatically measure En- 1... N where Sn = s1 ,... , sL(Sn ) is a sequence glish proficiency. In the same research line, we of L(Sn ) tokens and Cn ∈ [1, L(Sn )] is an index also find the work of Lee and Seneff (2007), Lin that indicates the position that should be blanked et al. (2007) and Pino et al. (2008). In this context, inside Sn. the work of Goto et al. (2009) probably represents This setting allows us to train from examples the first effort in applying machine learning tech- of single blank-annotated sentences. In this way, niques for multiple-choice cloze question genera- in order to obtain a sentence with several blanks, tion. The authors propose an approach that uses multiple passes over the model are required. This conditional random fields (Lafferty et al., 2001) approach works in a way analogous to humans, based on hand-crafted features such as word POS where blanks are provided one at a time. tags. More recent approaches also focus on the prob- 3.1 AQG as Sequence Labeling lem of distractor selection or generation but apply Firstly, we model the AQG as a sequence label- it to different domains. For example, Narendra and ing problem. Formally, for an embedded input Agarwal (2013), present a system which adopts sequence Sn = s1 ,... , sL(n) we build the corre- a semi-structured approach to generate CQs by sponding label sequence by simply creating a one- making use of a knowledge base extracted from a hot vector of size L(Sn ) for the given class Cn. Cricket portal. On the other hand, Lin et al. (2015) This vector can be seen as a sequence of binary present a generic semi-automatic system for quiz classes, Yn = y1 ,... , yL(n) , where only one item generation using linked data and textual descrip- (the one in position Cn ) belongs to the positive tions of RDF resources. The system seems to be class. Given this setting, the conditional proba- the first that can be controlled by difficulty level. bility of an output label is modeled as follows: Authors tested it using an on-line dataset about n wildlife provided by the BBC. Kumar et al. (2015) Y p(y | s) ∝ ŷi (1) present an approach automatic for CQs generation i=1 for student self-assessment. ŷi = H(yi−1 , yi , si ) (2) Finally, the work of Sakaguchi et al. (2013) presents a discriminative approach based on SVM Where, in our, case, function H is modeled using classifiers for distractor generation and selection a bidirectional LSTM (Hochreiter and Schmidhu- using a large-scale language learners corpus. The ber, 1997). Each predicted label distribution ŷt is SVM classifier works at the word level and takes a then calculated using the following formulas. sentence in which the target word appears, choos- ing a verb as the best distractor given the context. ~hi = LST Mf w (~hi−1 , xi ) (3) Again, the SVM is based on human-engineered h~i = LST Mbw (h~i+1 , xi ) (4) features such as n-grams, lemmas and dependency ŷi = softmax([~hi ; h~i ]) (5) tags. Compared to approaches above, our take is dif- The loss function is the average cross entropy ferent since we work on fill-in-the-blank ques- for the mini-batch. Figure 1 summarizes the pro- tion generation without multiple-choice answers. posed model. Therefore, our problem focuses on word selection —the word to blank— given a sentence, rather n than on distractor generation. To the best of our 1X L(θ) = − yi log ŷi + (1 − yi ) log(1 − ŷi ) knowledge, our system is also the first to use rep- n i=1 resentation learning for this task. (6) 153 The dog is barking model graphically. Then, for a given sentence Ck , the goal of our model is to predict the most likely position C ? ∈ [1, L(Sn )] of the next word to be h(1) h(2) h(3) h(4) blanked. The dog is barking O O O BLANK NN h(1) h(2) h(3) h(4) A(1) A(2) A(3) A(4) BLANK A O O O Figure 1: Our sequence labeling model based on Figure 2: Our sequence classification model, an LSTM for AQG. based on an LSTM for AQG. 3.2 AQG as Sequence Classification 4 Empirical Study In this case, since the output of the model is a po- sition in the input sequence Sn , the size of out- Although the hand-crafted rule-based system cur- put dictionary for Cn is variable and depends on rently used in our language learning platform of- Sn. Regular sequence classification models use a fers us good results in general, we are interested in softmax distribution over a fixed output dictionary developing a more flexible approach that is easier to compute p(Cn |Sn ) and therefore are not suit- to tailor depending on the case. In particular, in able for our case. Therefore, we propose to use an adaptive learning setting where the goal is re- an attention-based approach that allows us to have source allocation according to the unique needs of a variable size dictionary for the output softmax, each learner, rule-based methods for AQG appear in a way akin to Pointer Networks (Vinyals et al., to have insufficient flexibility and adaptability to 2015). More formally, given an embedded input accurately model the features of each learner or vector sequence Sn = s1 ,..., sL(n) , we use a bidi- teacher. rectional LSTM to first obtain a dense representa- With this point in mind, this section presents an tion of each input token. empirical study using state-of-the-art Deep Learn- ing approaches for the problem of AQG. In par- ~hi = LST Mf w (~hi−1 , xi ) (7) ticular, the objective is to test to what extent our h~i = LST ~ Mbw (h~i+1 , xi ) (8) prosed models are able to encode the behavior of hi = [~hi ; h~i ] (9) the rule-based system. Ultimately, we hope that these can be used for a smooth transition from the We later use pooling techniques including max current human-engineered feature-based system to and mean to obtain a summarized representation a fully user-experience-based regime. h̄ of the input sequence, or simply take the last In Natural Language Processing, deep models hidden state as a drop-in replacement to do so. Af- have succeeded in large part because they learn ter this, we add a global content-based attention and use their own continuous numeric representa- layer, which we use to to compare that summa- tional systems for words and sentences. In particu- rized vector to each hidden state hi. Concretely, lar, distributed representations (Hinton, 1984) ap- plied to words (Mikolov et al., 2013) have meant u = v | W [hi ; h̄] (10) a major breakthrough. All our models start with p(Cn |Pn ) = sof tmax(u) (11) random word embeddings, we leave the usage of other pre-trained vectors for future work. Where W and v are learnable parameters of the Using our platform, we extracted anonymized model, and the softmax normalizes the vector u user interaction data in the manner of real quizzes to be an output distribution over a dictionary of generated for a collection of several input video size L(Sn ). Figure 2 summarizes the proposed sources. We obtained a corpus of approximately 154 300,000 sentences, from which roughly 1.5 mil- of our parameters to a maximum norm of 5. Even lion single-quiz question training examples were with these limits, convergence is faster than in the derived. We split this dataset using the regular previous model, so we only trained the the classi- 70/10/20 partition for training, validation and test- fier for up to 5 epochs. Again we use a word em- ing. bedding and hidden state of 300, and add dropout As the system required the input sentences to be with drop probability of 0.2 before and after the tokenized and makes use of features such as word LSTM. Our results for different pooling strategies pos-tags and such, the sentences in our dataset are showed no noticeable performance difference in processed using CoreNLP (Manning et al., 2014). preliminary experiments, so we report results us- We also extract user-specific and quiz-specific in- ing the last hidden state. formation, including word-level learning records For development and evaluation we used accu- of the user, such as the number of times the learner racy over the validation and test set, respectively. made a mistake on that word, or whether the Table 2 below summarizes our obtained result, we learner looked up the word in the dictionary. In can see that model was able to obtain a maximum this study, however, we restrain our model to only accuracy of approximately 89% on the validation look at word embeddings as input. and testing sets. We use the same data pre-processing for all of our models. We build the vocabulary using the Set Loss Accuracy train partition of our dataset with a minimum fre- Valid 101.80 89.17 quency of 1. We do not keep cases and obtain an Test 102.30 89.31 unknown vocabulary of size 2,029, and a total vo- Table 2: Results of the seq. classification ap- cabulary size of 66,431 tokens. proach. 4.1 Sequence Labeling We use a 2-layer bidirectional LSTM, which we 5 Conclusions train using Adam Kingma and Ba (2014) with a learning rate of 0.001, clipping the gradient of our In this paper we have formalized the problem of parameters to a maximum norm of 5. We use automatic fill-on-the-blanks quiz generation us- a word embedding size and hidden state size of ing two well-defined learning schemes: sequence 300 and add dropout (Srivastava et al., 2014) be- classification and sequence labeling. We have also fore and after the LSTM, using a drop probability proposed concrete architectures based on LSTMs of 0.2. We train our model for up to 10 epochs. to tackle the problem in both cases. Training lasts for about 3 hours. We have presented an empirical study in which we test the proposed architectures in the context For evaluation, as accuracy would be extremely of a language learning platform. Our results show unbalanced given the nature of the blanking that both the0 proposed training schemes seem scheme —there is only one positive-class example to offer fairly good results, with an Accuracy/F1- on each sentence— we use Precision, Recall and score of nearly 90%. We think this sets a clear F1-Score over the positive class for development future research direction, showing that it is possi- and evaluation. Table 1 summarizes our obtained ble to transition from a heavily hand-crafted ap- results. proach for AQG to a learning-based approach on Set Loss Prec. Recall F1-Score the base of examples derived from the platform on Valid 0.0037 88.35 88.81 88.58 unlabeled data. This is specially important in the Test 0.0037 88.56 88.34 88.80 context of adaptive learning, where the goal is to effectively provide an tailored and flexible experi- Table 1: Results of the seq. labeling approach. ence in terms of style and difficulty For future work, we would like to use differ- ent pre-trained word embeddings as well as other 4.2 Sequence Classification features derived from the input sentence to further In this case, we again use use a 2-layer bidirec- improve our results. We would also like to test tional LSTM, which we train using Adam with a the power of the models in capturing different quiz learning rate of 0.001, also clipping the gradient styles from real questions created by professors. 155 References Annamaneni Narendra and Manish Agarwal. 2013. Automatic cloze-questions generation. In Proceed- Takuya Goto, Tomoko Kojiri, Toyohide Watanabe, To- ings of the International Conference Recent Ad- moharu Iwata, and Takeshi Yamada. 2009. An au- vances in Natural Language Processing RANLP tomatic generation of multiple-choice cloze ques- 2013, pages 511–515. tions based on statistical learning. In Proceedings of the 17th International Conference on Computers Yuichi Ono and Ai Nakajima. Automatic quiz genera- in Education, pages 415–422. Asia-Pacific Society tor and use of open educational web videos for en- for Computers in Education. glish as general academic purpose. In Proceedings of the 23rd International Conference on Computers Geoffrey E Hinton. 1984. Distributed representations. in Education, pages 559–568. Asia-Pacific Society for Computers in Education. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Yuichi Ono, Ai Nakajima, and Manabu Ishihara. 2017. 9(8):1735–1780. Motivational effects of a game-based automatic quiz generator using online educational resources for Diederik P. Kingma and Jimmy Ba. 2014. Adam: A japanese efl learners. In Society for Information method for stochastic optimization. CoRR. Technology and Teacher Education International G. Kumar, R. E. Banchs, and L. F. D’Haro. 2015. Au- Conference. tomatic fill-the-blank question generator for student Juan Pino, Michael Heilman, and Maxine Eskenazi. self-assessment. In 2015 IEEE Frontiers in Educa- 2008. A selection strategy to improve cloze ques- tion Conference (FIE), pages 1–3. tion quality. In Proceedings of the Workshop on In- John D. Lafferty, Andrew McCallum, and Fernando telligent Tutoring Systems for Ill-Defined Domains. C. N. Pereira. 2001. Conditional Random Fields: 9th International Conference on Intelligent Tutoring Probabilistic Models for Segmenting and Label- Systems, Montreal, Canada, pages 22–32. ing Sequence Data. In Proceedings of the Eigh- Keisuke Sakaguchi, Yuki Arase, and Mamoru Ko- teenth International Conference on Machine Learn- machi. 2013. Discriminative approach to fill-in-the- ing, ICML ’01, pages 282–289, San Francisco, CA, blank quiz generation for language learners. In Pro- USA. Morgan Kaufmann Publishers Inc. ceedings of the 51st Annual Meeting of the Associa- John Lee and Stephanie Seneff. 2007. Automatic gen- tion for Computational Linguistics (Volume 2: Short eration of cloze items for prepositions. In Eighth Papers), volume 2, pages 238–242. Annual Conference of the International Speech Simon Smith, P. V. S. Avinesh, and Adam Kilgar- Communication Association. riff. 2010. Gap-fill Tests for Language Learners: Chenghua Lin, Dong Liu, Wei Pang, and Zhe Wang. Corpus-Driven Item Generation. 2015. Sherlock: A Semi-automatic Framework for Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Quiz Generation Using a Hybrid Semantic Similar- Ilya Sutskever, and Ruslan Salakhutdinov. 2014. ity Measure. Cognitive Computation, 7(6):667–679. Dropout: A simple way to prevent neural networks Yi-Chien Lin, Li-Chun Sung, and Meng Chang Chen. from overfitting. The Journal of Machine Learning 2007. An automatic multiple-choice question gen- Research, 15(1):1929–1958. eration scheme for english adjective understanding. Eiichiro Sumita, Fumiaki Sugaya, and Seiichi Ya- In Workshop on Modeling, Management and Gener- mamoto. 2005. Measuring Non-native Speakers’ ation of Problems/Questions in eLearning, the 15th Proficiency of English by Using a Test with International Conference on Computers in Educa- Automatically-generated Fill-in-the-blank Ques- tion (ICCE 2007), pages 137–142. tions. In Proceedings of the Second Workshop on Building Educational Applications Using NLP, Christopher D. Manning, Mihai Surdeanu, John Bauer, EdAppsNLP 05, pages 61–68, Stroudsburg, PA, Jenny Finkel, Steven J. Bethard, and David Mc- USA. Association for Computational Linguistics. Closky. 2014. The Stanford CoreNLP natural lan- guage processing toolkit. In Association for Compu- Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. tational Linguistics (ACL) System Demonstrations. 2015. Pointer networks. In Advances in Neural In- formation Processing Systems, pages 2692–2700. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013. Distributed Representa- tions of Words and Phrases and their Composition- ality. In Advances in Neural Information Processing Systems 26. Curran Associates, Inc. Ai Nakajima and Kiyoshi Tomimatsu. 2013. New po- tential of e-learning by re-utilizing open content on- line. In International Conference on Human Inter- face and the Management of Information. 156

Learning to Automatically Generate Fill-In-The-Blank Quizzes PDF

Document Details

Tags

Related

Summary

Full Transcript