Indosum_A_New_Benchmark_Dataset_for_Indonesian_Text_Summarization.pdf

I NDO S UM: A New Benchmark Dataset for Indonesian Text Summarization Kemal Kurniawan Samuel Louvan Kata Research Team Fondazione Bruno Kessler...

I NDO S UM: A New Kemal Kurniawan Kata Research Team Kata.ai Jakarta, Indonesia [email protected] [email protected] dataset publicly available. In short, the contribution of this work is two-fold: dataset that 1) I NDO S UM, a large dataset for text summarization in Indonesian that is compiled from online news articles and publicly available. 2) Evaluation of state-of-the-art extractive summariza- tion methods on the dataset using ROUGE as the the previous standard metric for text summarization. The state-of-the-art result on the dataset, although impres- sive, is still significantly lower than the maximum possible ROUGE score. This result suggests that the dataset is suf- online under ficiently challenging to be used as evaluation benchmark for future research on Indonesian text summarization. Indonesian; II. R ELATED WORK Fachrurrozi et al. proposed some scoring methods and used them with TF-IDF to rank and summarize news articles. Another work used latent Dirichlet allocation coupled with genetic algorithm to produce summaries for online news articles. Simple methods like naive Bayes has also been used for Indonesian news summarization , although for English, naive Bayes has been used almost two decades earlier. A more recent work employed a summarization algorithm called TextTeaser with some predefined features for news articles as well. Slamet et al. used TF-IDF to convert sentences into vectors, and their similarities are then computed against another vector obtained from some keywords. They used these similarity scores to extract important sentences as the summary. Unfortunately, all these work do not seem to be evaluated using ROUGE, despite being the standard metric for text summarization research. An example of Indonesian text summarization research which used ROUGE is. They employed the best method on TAC 2011 competition for news dataset and achieved ROUGE-2 scores that are close to that of hu- mans. However, their dataset consists of only 56 articles which is very small, and the dataset is not available publicly. An attempt to make a public summarization dataset has been done in. They compiled a chat dataset along with its summary, which has both the extractive and abstractive versions. This work is a good step toward standardizing summarization research for Indonesian. However, to the best of our knowledge, for news dataset, there has not been a publicly available dataset, let alone a standard. 215 Negeri Semarang. Downloaded on June 11,2024 at 06:57:00 UTC from IEEE Xplore. Restrictions apply. Bond bocor film ini, agen rahasia 007 berhenti menjadi mata-mata Inggris demi menikah dengan ::::::::::::::::::::::::::::::::::::: jatuh cinta dan menikah dengan perempuan yang dicintai,” tutur seorang sumber yang PageSix.com. Madeleine Swann yang diperankan oleh Lea Seydoux. sekuel Spectre pada 2015 silam. yang bercerita pernikahan James Bond sejak 1969. ::::::::::::: Bond menikahi Tracy Draco yang diperankan Diana Rigg. ::::::::::::::::::: lama setelah Daniel Craig mengumumkan bakal kembali memerankan tokoh agen 007. film ini, agen rahasia 007 berhenti menjadi mata-mata Inggris demi menikah dengan yang bercerita pernikahan James Bond sejak 1969. Bond menikahi Tracy Draco. story was leaked movie production, the secret agent 007 stopped being an English spy to marry a woman whom :::::::::::::::::::::::::::::::::::::::: fell in love and married a woman that he loved,” said a source who is close to the production Madeleine Swann who is played by Lea Seydoux. the sequel Spectre in 2015. that tells about James Bond’s marriage since 1969. ::::::::::: James Bond married Tracy Draco who was played by Diana Rigg. :::::::::::::::::::::::: movie was leaked not long after Daniel Craig announced that he would play agent 007 production, the secret agent 007 stopped being an English spy to marry a woman whom that tells about James Bond’s marriage since 1969. James Bond marries Tracy Draco. and their English translations. Underlined sentences are the extractive summary obtained by concert, list of award nominations, and so on. Since such a list is never included in the summary, we truncated such articles so that the number of paragraphs are at most two standard deviations away from the mean.3 For each fold, the mean and standard deviation were estimated from the training set. We discarded articles whose summary is too long since we do not want lengthy summaries anyway. The cutoff length is defined by the upper limit of the Tukey’s boxplot, where for each fold, the quartiles were estimated from the training set. After removing such articles, we ended up with roughly 19K articles in total. The complete statistics of the corpus is shown in Table I. Since the gold summaries provided by Shortir are abstractive, we needed to label the sentences in the article for training the supervised extractive summarizers. We fol- lowed Nallapati et al. to make these labeled sentences (called oracles hereinafter) using their greedy algorithm. The idea is to maximize the ROUGE score between the labeled sentences and the abstractive gold summary. Although the provided gold summaries are abstractive, in this work we focused on extractive summarization because we think research on this area are more mature, especially for Indonesian, and thus starting with extractive summarization is a logical first step toward standardizing Indonesian text summarization research. 3 We assume the number of paragraphs exhibits a Gaussian distribution. Conference on Asian Language Processing (IALP) 216 Negeri Semarang. Downloaded on June 11,2024 at 06:57:00 UTC from IEEE Xplore. Restrictions apply. Table I C ORPUS STATISTICS Fold 2 Fold 3 Fold 4 Fold 5 test train dev test train dev test train dev test train dev test 749 3762 14290 747 3737 14272 750 3752 14266 747 3761 10.49 10.83 10.47 10.47 10.57 10.61 10.52 10.37 10.49 10.49 10.23 10.54 1.75 1.75 1.75 1.75 1.74 1.73 1.74 1.73 1.77 1.75 1.79 1.74 18.87 18.71 19.00 18.89 18.95 18.90 18.88 19.27 18.82 18.92 18.81 18.82 3.47 3.50 3.47 3.48 3.44 3.46 3.48 3.40 3.48 3.47 3.54 3.48 19.60 19.54 19.58 19.57 19.77 19.65 19.58 19.92 19.60 19.63 19.05 19.57 4) T EXT R ANK, which is very similar to L EX R ANK but computes sentence similarity based on the number of common tokens. For the non-neural supervised methods, we compared: text 1) BAYES, which represents each sentence as a feature vector and uses naive Bayes to classify them. Four features are used: whether the sentence has less than 5 words, whether the sentence contains signature words, its position in the document, and its position in the paragraph. To obtain the signature words, TF-IDF are used. The original paper com- putes TF-IDF score on multi-word tokens that are identified automatically using mutual information. We did not do this identification, so our TF-IDF computation operates on word tokens. 2) H MM, which uses hidden Markov model where states correspond to whether the sentence should be extracted. Gaussian distribution is used as the emission probability distribution, where each sentence is represented as a feature vector. Four features are used: its position in the paragraph, number of terms, sum of probability of terms in the document, and sum of probability of terms in a baseline document. We used a precomputed TF table for the last feature. The original work uses QR decomposition for sentence selection but our implementation does not. We simply ranked the sentences by their scores and picked the top 3 as the summary. 3) M AX E NT, which represents each sentence as a fea- ture vector and leverages maximum entropy model to compute the probability of a sentence should be extracted. Several features are used: word pairs, sentence length, previous sentence length, sentence position, and whether the sentence is at the start of a paragraph. The original approach puts a prior distribution over the labels but we put the prior on the weights instead. Our implementation still agrees with the original because we employed a bias feature which should be able to learn the prior label distribution. subsumption As for the neural supervised method, we evaluated the paper N EURAL S UM using the original implementation by an approx- the authors.6 We modified their implementation slightly to overlap allow for evaluating the model with ROUGE. Note that all the methods are extractive. Our implementation code for 6 https://github.com/cheng6076/NeuralSum Conference on Asian Language Processing (IALP) 217 Negeri Semarang. Downloaded on June 11,2024 at 06:57:00 UTC from IEEE Xplore. Restrictions apply. every N EURAL S UM scenario scores are still considerably lower than O RACLE, hinting that it can be improved further. Moreover, initializing with FAST T EXT pre-trained embedding slightly lowers the scores, although they are still within one standard deviation. This finding suggests that the effect of FAST T EXT pre-trained embedding is unclear for our case. B. Out-of-domain results Since Indonesian is a low-resource language, collecting in-domain dataset for any task (including summarization) can be difficult. Therefore, we experimented with out-of- domain scenario to see if N EURAL S UM can be used easily for a new use case for which the dataset is scarce or non- existent. Concretely, we trained the best N EURAL S UM (with word embedding size of 300) on articles belonging to category c1 and evaluated its performance on articles belonging to category c2 for all categories c1 and c2. As we have a total of 6 categories, we have 36 domain pairs to experiment on. To reduce computational cost, we used only the articles from the first fold and did not tune any hyperparameters. We note that this decision might undermine the generalizability of conclusions drawn from these out-of-domain experiments. Nonetheless, we feel that the results can still be a useful guidance for future work. As comparisons, we also evaluated L EAD -3, O RACLE, and the best unsupervised method, L EX R ANK. For L EX R ANK, we used the best hyperparameter that we found in the previous experiment for the first fold. We only report the ROUGE-1 scores. Table III shows the result of this experiment. We see that almost all the results outperform the L EAD -3 baseline, which means that for out-of-domain cases, N EURAL S UM can summarize not just by selecting some leading sentences from the original text. Almost all N EURAL S UM results also outperform L EX R ANK, sug- gesting that when there is no in-domain training data, training N EURAL S UM on out-of-domain data may yield better performance than using an unsupervised model like L EX R ANK. Looking at the best results, we observe that they all are the out-of-domain cases. In other words, training on out-of-domain data is surprisingly better than on in-domain data. For example, for Sport as the target domain, the best model is trained on Headline as the source domain. In fact, using Headline as the source domain yields the best result in 3 out of 6 target domains. We suspect that this phenomenon is because of the simi- larity between the corpus of the two domain. Specifically, training on Headline yields the best result most of the time because news from any domain can be headlines. Further investigation on this issue might leverage domain similarity metrics proposed in. Next, comparing the best N EURAL S UM performance on each target domain to O RACLE, we still see quite a large gap. This gap hints that N EURAL S UM can still be improved further, probably by lifting the limitations of our experiment setup (e.g., tuning the hyperparameters for each domain pair). Conference on Asian Language Processing (IALP) 218 Negeri Semarang. Downloaded on June 11,2024 at 06:57:00 UTC from IEEE Xplore. Restrictions apply. Table II ROUGE-1, ROUGE-2, AND ROUGE-L, AVERAGED OVER 5 FOLDS Method R-1 R-2 R-L 79.27 (0.25) 72.52 (0.35) 78.82 (0.28) 62.86 (0.34) 54.50 (0.41) 62.10 (0.37) 35.96 (0.18) 20.19 (0.31) 33.77 (0.18) 41.37 (0.19) 28.43 (0.25) 39.64 (0.19) 62.86 (0.35) 54.44 (0.44) 62.10 (0.37) 42.87 (0.29) 29.02 (0.35) 41.01 (0.31) 62.70 (0.39) 54.32 (0.46) 61.93 (0.41) 17.62 (0.11) 4.70 (0.11) 15.89 (0.11) 50.94 (0.42) 44.33 (0.50) 50.26 (0.44) 67.60 (1.25) 61.16 (1.53) 66.86 (1.30) N EURAL S UM 300 emb. size 67.96 (0.46) 61.65 (0.48) 67.24 (0.47) N EURAL S UM + FAST T EXT 67.78 (0.69) 61.37 (0.93) 67.05 (0.72) Table III SCORE OF ROUGE-1 FOR THE OUT- OF - DOMAIN EXPERIMENT Target dom. Entertainment Inspiration Sport Showbiz Headline Tech 75.59 81.19 77.65 78.33 80.52 80.09 51.27 52.12 67.56 65.05 65.21 50.01 51.41 50.78 67.52 65.01 65.19 50.01 52.51 53.15 72.51 67.01 67.63 51.81 52.51 52.71 72.51 67.01 68.02 51.67 52.41 53.85 72.51 66.62 68.48 50.89 53.65 49.86 72.51 67.81 70.88 51.22 52.80 55.07 72.53 67.17 71.59 50.92 50.39 47.93 62.43 56.93 63.44 48.00 approach can also be interesting directions for future avenue. Other tasks such as further investigation on the out-of-domain issue, human evaluation, or even extending the corpus to include more than one summary per article are worth exploring as well. ACKNOWLEDGMENT Our We thank anonymous reviewers for their helpful feed- back. We acknowledge the support from Shortir and Tempo. Lastly, we also thank Muhammad Pratikto and Ahmad Rizqi Meydiarso for their relentless support. R EFERENCES consists D. Das and A. F. Martins, “A survey on automatic text summarization,” Literature Survey for the Language and Statistics II course at CMU, vol. 4, pp. 192–195, 2007. C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in Text Summarization Branches out: Proceed- ings of the ACL-04 Workshop, vol. 8. Barcelona, Spain, 2004. A. Najibullah, “Indonesian Text Summarization based on Naı̈ve Bayes Method,” in Proceeding of the Interna- tional Seminar and Conference 2015: The Golden Triangle (Indonesia-India-Tiongkok) Interrelations in Religion, Sci- ence, Culture, and Economic, Semarang, Indonesia, 2015, p. 12. Conference on Asian Language Processing (IALP) 219 Negeri Semarang. Downloaded on June 11,2024 at 06:57:00 UTC from IEEE Xplore. Restrictions apply. extraction, utility-based evaluation, and user studies,” in Proceedings of the 2000 NAACL-ANLP Workshop on Au- tomatic Summarization. Association for Computational Linguistics, 2000, pp. 21–30. R. Mihalcea and P. Tarau, “Textrank: Bringing order into text,” in Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, 2004. J. Conroy and D. O’Leary, “Text summarization via hidden Markov model and pivoted QR matrix decomposition,” 2001. M. Osborne, “Using maximum entropy for sentence ex- traction,” in Proceedings of the Workshop on Automatic Summarization (Including DUC 2002). Philadelphia: Association for Computational Linguistics, Jul. 2002. F. Tala, J. Kamps, K. E. Müller, and R. de M, “The impact of stemming on information retrieval in Bahasa Indonesia,” Studia Logica - An International Journal for Symbolic Logic - SLOGICA, Jan. 2003. C.-Y. Lin and E. Hovy, “Automatic evaluation of summaries using n-gram co-occurrence statistics,” in Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1. Association for Compu- tational Linguistics, 2003, pp. 71–78. P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “En- riching word vectors with subword information,” arXiv preprint arXiv:1607.04606, 2016. S. Ruder and B. Plank, “Learning to select data for transfer learning with Bayesian Optimization,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen, Denmark: Association for Computational Linguistics, Jul. 2017, pp. 372–382. S. Narayan, N. Papasarantopoulos, M. Lapata, and S. B. Cohen, “Neural Extractive Summarization with Side Infor- mation,” CoRR, vol. abs/1704.04530, 2017. A. M. Rush, S. Chopra, and J. Weston, “A neural attention model for abstractive sentence summarization,” in Proceed- ings of the 2015 Conference on Empirical Methods in Nat- ural Language Processing. Lisbon, Portugal: Association for Computational Linguistics, 2015, pp. 379–389. R. Nallapati, B. Zhou, C. dos Santos, C. Gulcehre, and B. Xiang, “Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond,” in Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning. Berlin, Germany: SIGNLL, 2016. A. See, P. J. Liu, and C. D. Manning, “Get to the point: Summarization with pointer-generator networks,” in Pro- ceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Van- couver, Canada: Association for Computational Linguistics, 2017, pp. 1073–1083. R. Paulus, C. Xiong, and R. Socher, “A Deep Reinforced Model for Abstractive Summarization,” arXiv:1705.04304 [cs], May 2017. 6, pp. 1606–1618, using in Pro- ACM SIGIR Information semantic analysis in Proc. lexical Journal of pp. 457–479, 2004. Sentence Conference on Asian Language Processing (IALP) 220 Negeri Semarang. Downloaded on June 11,2024 at 06:57:00 UTC from IEEE Xplore. Restrictions apply.

Indosum_A_New_Benchmark_Dataset_for_Indonesian_Text_Summarization.pdf

Document Details

Related

Full Transcript

Upgrade to continue