Full Transcript

KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS VOL. 16, NO. 6, Jun. 2022 1778 Copyright ⓒ 2022 KSII An Efficient Machine Learning-based Text Summarization in the Malayalam Language Rosna P Haroon1*, Abdul Gafur M2, Bar...

KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS VOL. 16, NO. 6, Jun. 2022 1778 Copyright ⓒ 2022 KSII An Efficient Machine Learning-based Text Summarization in the Malayalam Language Rosna P Haroon1*, Abdul Gafur M2, Barakkath Nisha U3 1AssistantProfessor, Department of CSE, Ilahia College of Engineering and Technology, APJ Abdul Kalam Technological University, Kerala, India. [email protected] 2 Professor&Principal, Ilahia College of Engineering and Technology, APJ Abdul Kalam Technological University, Kerala, India. [email protected] 3Associate Professor, Department of IT, Sri Krishna College of Engineering and Technology, Tamilnadu, India. [email protected] *Corresponding author : Rosna P Haroon Received January 29, 2022; revised March 24, 2022; revised April 23, 2022; accepted May 11, 2022; published June 30, 2022 Abstract Automatic text summarization is a procedure that packs enormous content into a more limited book that incorporates significant data. Malayalam is one of the toughest languages utilized in certain areas of India, most normally in Kerala and in Lakshadweep. Natural language processing in the Malayalam language is relatively low due to the complexity of the language as well as the scarcity of available resources. In this paper, a way is proposed to deal with the text summarization process in Malayalam documents by training a model based on the Support Vector Machine classification algorithm. Different features of the text are taken into account for training the machine so that the system can output the most important data from the input text. The classifier can classify the most important, important, average, and least significant sentences into separate classes and based on this, the machine will be able to create a summary of the input document. The user can select a compression ratio so that the system will output that much fraction of the summary. The model performance is measured by using different genres of Malayalam documents as well as documents from the same domain. The model is evaluated by considering content evaluation measures precision, recall, F score, and relative utility. Obtained precision and recall value shows that the model is trustable and found to be more relevant compared to the other summarizers. Keywords: Malayalam Text Summarization, Supervised Machine Learning, SVM, Text Mining, Sentence Extraction, Summary Generation. http://doi.org/10.3837/tiis.2022.06.001 ISSN : 1976-7277 KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS VOL. 16, NO. 6, June 2022 1779 1. Introduction Summarization plays an important role in our day-to-day life. As all of us are living a very busy scheduled life, most of them wish to get the important data at their fingertips without reading a very large text. With the introduction of Artificial intelligence, the computer can automate any type of human activities and the same will reduce the risk of time management of humans. So, obviously in summarization task also computer is playing a role. As we are dealing with languages in summarization, this is coming under the subarea of AI known as Natural Language Processing. Usually, English is accepted as the universal language for communication. So the majority of the NLP works are focused on the English language. Because of that dataset and corpus availability are more in such languages. But when we consider the other languages it is not the case. Here we have implemented a learning based extractive summarizer for Malayalam language which one is providing better precision and recall rates compared to other summarizers implemented so far.Moreover a trained summarizer like this is lagging in Malayalam.Here comes the importance of our work.Eventhough the technology have advanced and everything is getting into our finger prints within seconds, still the technology is not reached to the common people in our state due to the language gap.This is one of the main motivation behind the topic. Text summarization is one of the predominant applications of natural language processing. It is nothing but here the system is finding out the gist of the text given in the document. It can be performed mainly in two ways. One is known as extractive summarization whereby we would be able to get the shortened version of the document by picking the most important sentences from the document. The other is named abstractive summarization where the sentences are regenerated with the help of paraphrasing and natural language generation techniques. The following example will illustrate the difference between extractive and abstractive summarization: Fig. 1. Extractive and Abstractive summary of a sample text. 1780 Haroon et al.: An Efficient Machine Learning-based Text Summarization in the Malayalam Language As seen in the above example, abstractive summarization produces a more abstractive summary which consists of a sentence conveying the important information in the paragraph given. But in extractive summary, it is producing a summary by extracting the important sentences from the given paragraph. The majority of research works in Malayalam are adopting extractive summarization rather than abstractive. Several approaches including statistical score, semantic graph, etc are used there to find out the most leading sentences from the document. A machine learning-based algorithm is lagging in Malayalam text summarization and we are trying to perform the same by using a supervised machine learning technique. The major contributions providied by this paper are:  An efficient extractive summarizer for Malayalam language.  Machine learning based extractive summarizer.  An extractive summarizer with better precision and recall rates. Here we are implementing Support Vector Machine(SVM)-based learning to perform the summarization process. Support Vector Machine is a supervised machine learning classifier which is proved to be an efficient one in many classification tasks.SVMs can be categorized into linear and nonlinear SVMs based on how the hyperplane segregates the data. If it is possible to separate the data by using a straight line it is known as linear type, otherwise, it is nonlinear. 2. Related Work An extensive set of literature about text summarization using machine learning approaches is available for the English language. But for a language like Malayalam, it is not possible to apply the same methods invented for other languages so far. There are lots of syntactic differences between these two languages. So, from the preprocessing phase to the final step, the complication is more in the case of languages like Malayalam. Here we are reviewing machine learning method text summarizers in languages other than Malayalam and some text summarizers available in the Malayalam language. Joel Larocca Neto et.al. in their paper “Automatic text summarization using machine learning approach” proposing an ML-based classifier for the English language by incorporating the features like mean Term Frequency-Inverse Frequency(TF-ISF), Sentence length, Sentence positionetc.They have employed two classification algorithms namely Naïve Bayes and C4.5 for the training purpose.When comparing the Naïve Bayes and C4.5, Naïve Byes produced better results in compression rates and C4.5 prediction seems to be poor. Nikitha Desai and Pranchi shah implemented a supervised machine learning model for the Hindi language whereby they have tried to analyze the summarizer system with a different experimental setup. Based on the different combinations of the feature vectors selected, accuracy was calculated and the system shows an average score of 72% in accuracy when taking more features in the feature vector. As the number of features taken into consideration for summarizing the document is increased the system accuracy is also being incremented. Chintan Shah and Anjali Jivani in “An Automatic Text Summarization on Naive Bayes Classifier Using Latent Semantic Analysis” describe a summarization based on latent semantic analysis and trained using Naïve Bayes classifier. The semantic similarity between text fragments has been measured using Latent Semantic Analysis. Here they are using statistical methods like SVD (Singular Value Decomposition) to show the relationship among KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS VOL. 16, NO. 6, June 2022 1781 words and sentences. Important concepts are being selected from the SVM model by using recursive feature elimination. Based on the order of elimination, concepts will be ranked. The model is trained using Naïve Bayes classifier. Nedunchelian Ramanujan et al. proposed a timestamp-based approach with a naïve Bayes classifier for multi-document summarization. Based on the chronological position of the sentences in the document a value is assigned to each sentence and this is taken as the timestamp. Based on the score obtained using the features selected a number of relevant sentences are selected in the summary and the same is ordered using this timestamp value. This will result in an ordered and coherent summary. They have also done a comparative based study of proposed methods by using MEAD platform including this timestamp approach. In , authors have implemented an extractive text summarizer using deep learning modified neural network classifier. Here entropy value is calculated for each relevant feature and the value is classified into two classes namely the highest entropy value and the lowest entropy value. Those sentences coming in the highest entropy class are taken in summary output. The dataset used for the performance analysis is Document Understanding Conference (DUC) Dataset and the performance is varying depending on the file size they have taken. The result shows that this method scores a higher accuracy rate compared to other Artificial Neural Network schemes. Different methods for machine learning approaches for text summarization have been discussed in. Authors have listed out different methodologies used so far in a tabular form along with the dataset used and remarks. For Malayalam documents, summarization works are very few. Implemented works have been focused on statistical scoring and graph-based approaches. Vector space model for Malayalam Summarizer proposed a statistical method for extractive summarization by prioritizing the sentences with the help of cosine similarity. The highest scored sentences will be sorted out in the summary. A graph-based method for Malayalam documents has been proposed in where the sentences are represented as nodes and vertex weight is calculated using similarity measures. Minimum spanning tree Malayalam summarizer creates a semantic graph from the input document and thereby graph reduction is performed using minimum spanning tree concept by creating repetitive subgraphs. A clustering technique using self-organizing maps are also been implemented for Malayalam summary in paper whereby an extractive summarizer has developed by scoring the sentence based on relevance analysis and context-aware measures and formed a cluster using SOM. Relevant sentences are selected from the clusters using the algorithm proposed by the researcher. Both theoretical and practical evaluations are done in this method to check the accuracy of the model. Evaluation of text summarizer is also important in determining the accuracy of the output generated. There are intrinsic and extrinsic measures for summarization. Text quality evaluation and content evaluation are coming under intrinsic and task-based evaluation schemes like question answering, information retrieval, etc. are coming under extrinsic. Quality in terms of grammar, non-redundancy, referential clarity, structure, and coherence is being considered in text quality evaluation techniques. For a summarizer system, the most important evaluation measure to be considered is its content evaluation. The measures like precision, recall, F-score, relative utility, Rouge N-gram matching, etc. are the most frequent measures taken by the researchers to evaluate their system. ROUGE (Recall Oriented Understudy of Gisting Evaluation) is an often-used evaluation strategy where consecutive tokens are considered for comparison.Overlap of N-grams in human evaluated summary and system computed summary are taken into account and computing the ROUGE score. If a high overlap is there, the score will be more. Here N may be 1, 2 or more and based on this the 1782 Haroon et al.: An Efficient Machine Learning-based Text Summarization in the Malayalam Language measure will be ROUGE-1,ROUGE-2 etc.But this is not suited well for abstractive summarizers since semantic meaning and factual accuracy are not been considering in Rouge. Other than N-grams, alternatives like the longest common subsequence can also be considered in Rouge evaluation. A multi document summarization system with statistical score features incorporated with modified page ranking algorithm is proposed in. After getting the summary of each document,it is subjected to Maximum Marginal Relevance to get the final summary.An abstractive summarizer for Malayalam with the help of attention mechanism is proposed in. Here it produces regenerated sentences in the summary,but it doesn’t support long range dependency between sentences. A robust document similarity metric is proposed in, by which they are doing the clustering of documents.For the similarity measure of documents this may contribute in summarization works also. Three way clustering scheme is used in to find out the relationship between data items and clusters.A multi view clustering technique by customizing the K-means algorithm is also suggesting in this paper. also describing a multi view data clustering scheme with the help of non negative matrix factorization and a solution is proposed from diverse views by preserving the geometrical structure of the data. From the literature works done, it is evident that no one tried a trained model for Malayalam extractive text summarization. The proposed model focuses on such a training model using an SVM classification algorithm to select the prioritized sentences for summary output. From the related work study,it is seen that support vector machine provides better performance compared to other classification algorithms.Evaluation measures using relative utility is also been considering here to determine the correct accuracy of the output, which was not been done by other Malayalam NLP researchers. Table 1. Summary of Text summarization Papers in Literature Review Methodology Proposed Datasets Used Measurement Mertic Used Machine learning based Manual(200 Precision and Recall method(Naïve Bayes,SVM) for Documents) Accuracy by counting English/Hindi language HindiNews domain correctly classified (130 articles) sentence Manual(Text corpus ROUGE 2.0 Evaluation kit from different articles,10Nos) Manual(20 Documents) Sentence Ranking Method 50 selected News ROUGE-1 and ROUGE-2 Articles in Malayalam Graph based method/Minimum Manual Precision,Recall and F- Spanning Tree score Vector Space Model Manual Precision,Recall Self organizing maps and entity Manual(Articles Sentence rank recognition from evaluation,question game Manoramaonline) evaluation, keyword association Hierarchical encoder/decoder CNN/Daily mail ROUGE-1,ROUGE-2 and architecture Data Set ROUGE-L KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS VOL. 16, NO. 6, June 2022 1783 Multi document summarization Manual ROUGE-1 and ROUGE-2 with statistical score and MMR Attention based Mechanism for Translated BBC ROUGE-1,ROUGE-2 and abstractive summary News Repository ROUGE-L 3. Proposed Methodology A machine learning-based text summarizer for the language Malayalam is proposed here. The framework we are discussing is for the single input document. As far as a machine learning model is concerned, the accuracy will depend on the quality of the training we have given to the model as well as the learning algorithm we have adopted. The system is trained with the Support Vector Machine algorithm. The following diagram shows the architecture of the machine learning-based text summarizer. The input can be given as a text document and the system outputs an extractive summary of the input concerned. The document given as input firstly undergoes a preprocessing phase which includes the process of text segmentation, tokenization, stop word removal, and stemming. The following mentioned features are extracted from the segmented units and the machine is trained with those features in order to predict the exact output summary as a human is doing. Segmented Pre-processed text/sentences text Input text Text Pre- document segmentizer processing Feature Extractor Summary Ranking Document generator model using Summary SVM l ifi Classified text Fig. 2. Architecture of the proposed model 3.1 Text Segmentizer The accepted input document is segmentized into different sentences here. The sentences can be identified from the document by giving a sentence boundary condition. 1784 Haroon et al.: An Efficient Machine Learning-based Text Summarization in the Malayalam Language Fig. 3. sample input to text segmentizer Fig. 4. output of text segmentizer for the input given in Fig 3.1.1 3.2 Pre-Processing Module Sentences extracted from the text segmentizer are subjected to preprocessing tasks. The following tasks are performed in preprocessing stage: 1. Tokenization: Tokens are nothing but it is the basic building units of the natural language. This may be words, sub-words, or characters. For example, when considering the sentence, Word-level tokens are: , , , ,. sub word token can be like: വാഹനം, ആണ് character level will be like: വാ-ഹ-ന-മാ-ണ് word-level tokenization is performed here for the proposed work. 2. Stop word Removal: There may be certain words in the document which may not provide valuable meaning to the sentence and more often come as a grammatical construct. These types of words may remove from the text since this will cause extra storage and also more processing time. The words such as ആണ്, അത്, ഇത്, അങ്ങെന etc. are removed from the text for reducing the complexity in space and time. KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS VOL. 16, NO. 6, June 2022 1785 3. Stemming: Words may come in different inflected forms. Stemming helps us to find the base form of a word without any inflections. For example, വാഹനമാണ് will result in the stem വാഹനം. In the case of the Malayalam Language, stemming is difficult since it is an agglutinative language. That means we can append more and more affixes to a particular word and by removing one affix still, it must be a syntactic word. Here we are using a separate data corpus to find out the stem of different words. 3.3 Feature Extractor After the preprocessing task has been done, the features mentioned below are extracted from the text. The following section deals with the features used to train the model. As text summarization is concerned, we can take so many features including text as well as statistical features for training the text summarizer model. Our text summarizer is trained with the following features: Step 1: Number of key phrases in the Sentence The most occurring words/phrases in the sentences are called key phrases. A ranking is given to the sentences based on the key phrases present. The ranking can be computed by taking the ratio of the number of key phrases in the sentences to the total number of words/phrases in the longest sentence occurring in the text document. Step 2: Position of the sentence in the input document The locality of the sentence within the document has a significant role in the case of the summarization process. Usually, human beings have a nature that the important concept will be organized in the initial paragraph positions and the conclusion part will be given in the last paragraph of the document. More weight will be given to those sentences which are coming under this category. Step 3: Position of the sentence in the paragraph The other feature we have taken considers the overall position of sentences within the document. But in this feature, we are considering the locality of the sentence within the paragraph itself. As we mentioned earlier, the initial sentences in the paragraph will reflect conceptually more, so the training model will give more importance to those sentences which are coming first in the paragraph. Step 4: Numerical information in the sentence When taking as index terms we are not giving good credit to numbers since they are hazy without a surrounding context. But regarding text summary, numerical data plays a major role since that may represent a date, year, or any important count. In such cases that should be included in the summary report definitely. The segmented sentences from the document which contains the numerical information are ranked as a ratio of the number of numerical data in the sentence concerned to the total number of words in the sentence concerned. Step 5: Presence of guillemets in the sentence 1786 Haroon et al.: An Efficient Machine Learning-based Text Summarization in the Malayalam Language The presence of quotation marks is also a salient feature in producing text summaries. For a language like Malayalam, the important concepts are usually been quoted and the same cannot be avoided in the output of a summary document. So, such units are also been ranked based on how many words are quoted out of total words in the sentence. Step 6: Length of the sentence A score is being added to each sentence by examining the number of words in the sentence taken and the number of words in the longest sentence of the document. Generally, the shorter sentences may not convey more information. Similarly, the sentences with more length also give a short weight since they may contain more unnecessary extravasation. 3.4 Ranking Model & Summary Generator Based on the features extracted from the previous step, the model will be trained to classify the sentences into different groups namely VVI (Very-Very Important), VI (Very Important) I (Important), and LI (Least Important). Support Vector Machine is the algorithm used here for classification. A score is being calculated based on the feature obtained from the feature extractor module. The relevance of the sentences can be found out by using this score and sentences will be clustered into four classes in view of the priority of the text segment. The summary document is produced with the most relevant sentences formed by the ranking module. Percentage of the summary to be produced can be given from the user end. Based on this threshold value that many numbers of sentences will be selected from the four classes VVI, VI, I, LI respectively ranked by the training model. The following algorithm describes the entire process in detail. Algorithm 1 Input: Text document of any genres. Method: SVM classifier-based algorithm for training the model Output: Extracted text document summary Begin Input the text document Perform preprocessing: Break the document into separate sentences For (sen=1; sen=ns in VVIP class. If ns is not reached the limit, select the sentences from the VIP class If ns is not reached the limit, select the sentences from the IP class If ns is not reached the limit, select the sentences from the LIP class End 4. Implementation of Summarizer The summarizer is implemented here by using Python programming language. An interface is also developed by using a web framework Django which enables us to summarize the document in the simplest way. Python language is one of the best options for natural language processing tasks since it contains so many NLP tools and libraries which helps the programmer to pre-process the unstructured input text in an easy way. It also gives the support to integrate with other languages and moreover the syntax of the language is so easy and the same can be easily understandable for anyone including a beginner in the programming field. Initially, we are selecting the document which is to be summarized. Consider the following document as the input document. 1788 Haroon et al.: An Efficient Machine Learning-based Text Summarization in the Malayalam Language Fig. 5. Sample Input document The given document will be subjected to text pre-processing first. Sentences were separated by a process of sentence splitter and removing the stop words like അത്, ഇത്, അവ, അവിെട etc. After the text will pass through the stemmer and now the text is in pre- processed form so that we can perform the learning operations. Feature scores were calculated next based on the feature vectors mentioned in section 3.3. The following table illustrates the feature value obtained for the above text document. Table 2. Feature values obtained for sample document Number Paragraph of Feature values of the sentences Number sentences f1-[1.0,0.6666666666666666,0.3333333333333333] f2-[0.9821428571428571,0.9642857142857143,.9464285714285714] f3-[0.0,0.0,0.0] 1 3 f4-[0.0,0.0,0.0] f5-[0.6024590163934426, 0.7581967213114754, 0.30327868852459017] f6-[0.0,0.0, 0.3333333333333333] KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS VOL. 16, NO. 6, June 2022 1789 f1-[1.0, 0.8571428571428571, 0.7142857142857143, 0.5714285714285714, 0.42857142857142855, 0.2857142857142857, 0.14285714285714285] f2-[0.9285714285714286, 0.9107142857142857, 0.8928571428571429, 0.875, 0.8571428571428571, 0.8392857142857143, 0.8214285714285714] f3-[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 2 7 f4-[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] f5-[0.36475409836065575, 0.18442622950819673, 0.36475409836065575, 0.3729508196721312, 0.30327868852459017, 0.4016393442622951, 0.4016393442622951] f6-[0.0, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.0, 0.14285714285714285, 0.0] f1-[1.0, 0.9166666666666666, 0.8333333333333334, 0.75, 0.6666666666666666, 0.5833333333333334, 0.5, 0.4166666666666667, 0.3333333333333333, 0.25, 0.16666666666666666, 0.08333333333333333] f2-[0.8035714285714286, 0.7857142857142857, 0.7678571428571429, 0.75, 0.7321428571428571, 0.7142857142857143, 0.6964285714285714, 0.6785714285714286, 0.6607142857142857, 0.6428571428571429, 0.625, 0.6071428571428571] 3 12 f3-[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.16666666666666666, 0.0, 0.0, 0.0] f4-[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] f5-[0.13114754098360656, 0.319672131147541, 0.2336065573770492, 0.36885245901639346, 0.3114754098360656, 0.24180327868852458, 0.5409836065573771, 0.4426229508196721, 0.29508196721311475, 0.3114754098360656, 0.29918032786885246, 0.22540983606557377] f6-[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.08333333333333333] f1-[1.0, 0.8888888888888888, 0.7777777777777778, 0.6666666666666666, 0.5555555555555556, 0.4444444444444444, 0.3333333333333333, 0.2222222222222222, 0.1111111111111111] f2-[0.5892857142857143, 0.5714285714285714, 0.5535714285714286, 0.5357142857142857, 0.5178571428571429, 0.5, 0.48214285714285715, 0.4642857142857143, 0.44642857142857145] f3-[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 4 8 f4-[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] f5-[0.39344262295081966, 0.3155737704918033, 0.3975409836065574, 0.3360655737704918, 0.0860655737704918, 0.5819672131147541, 0.1885245901639344, 0.16393442622950818, 0.1885245901639344] f6-[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] f1-[1.0, 0.8571428571428571, 0.7142857142857143, 0.5714285714285714, 0.42857142857142855, 0.2857142857142857, 0.14285714285714285] f2-[0.42857142857142855, 0.4107142857142857, 0.39285714285714285, 0.375, 0.35714285714285715, 0.3392857142857143, 0.32142857142857145] f3-[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 5 7 f4-[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] f5-[0.26229508196721313, 0.12704918032786885, 0.819672131147541, 0.1885245901639344, 0.4057377049180328, 0.26639344262295084, 0.040983606557377046] f6-[0.0, 0.0, 0.14285714285714285, 0.0, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285] f1-[1.0, 0.75, 0.5, 0.25] f2-[0.30357142857142855, 0.2857142857142857, 0.26785714285714285, 0.25], f3-[0.0, 0.0, 0.0, 0.0] 6 4 f4-[0.0, 0.0, 0.0, 0.0] f5-[0.6844262295081968, 0.1885245901639344, 0.7295081967213115, 0.3237704918032787] f6-[0.75, 0.25, 0.0, 0.0] f1-[1.0, 0.8571428571428571, 0.7142857142857143, 0.5714285714285714, 0.42857142857142855, 0.2857142857142857, 0.14285714285714285] f2-[0.23214285714285715, 0.21428571428571427, 0.19642857142857142, 0.17857142857142858, 0.16071428571428573, 0.14285714285714285, 0.125] 7 7 f3-[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] f4-[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] f5-[0.6844262295081968, 0.22131147540983606, 0.3360655737704918, 1.0, 0.45491803278688525, 0.5901639344262295, 0.4057377049180328] f6-[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] f1-[1.0, 0.8571428571428571, 0.7142857142857143, 0.5714285714285714, 0.42857142857142855, 0.2857142857142857, 0.14285714285714285] f2-[0.10714285714285714, 0.08928571428571429, 0.07142857142857142, 0.05357142857142857, 0.03571428571428571, 0.017857142857142856, 0.0] 8 7 f3-[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] f4-[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] f5-[0.0942622950819672, 0.29918032786885246, 0.639344262295082, 0.48360655737704916, 0.4672131147540984, 0.47950819672131145, 0.2581967213114754] f6-[0.14285714285714285, 0.0, 0.0, 0.0, 0.14285714285714285, 0.0, 0.0] The SVM classifier predicts the class of the sentences and groups the same in four different classes namely VVI (Very-Very Important), VI (Very Important), I (Important), and LI (Least Important). The user is allowed to provide a compression ratio of a particular percentage and the system produces the summary according to the compression factor given. For example, if the compression is given as 25% only 1/4th of the original document will result 1790 Haroon et al.: An Efficient Machine Learning-based Text Summarization in the Malayalam Language in the summary module. The highest preferences are given to those sentences in the SVM class named VVI. The next preference is in the order VI, I, and LI. To train the model, thousands of documents from different genres are selected. We have created our own dataset to train the model since there are no such training datasets in Malayalam. Data are collected from different news portals, travel vlogs, historical vlogs, etc. Followed by a compression ratio of 30%, the above sample document brings out the text summary as shown below: Fig. 6. Output document corresponds to Fig. 5 5. Experimental Classification Results and Analysis As mentioned in section 4, Python language is used to do the implementation side. For the training and testing purpose, we have created our own dataset for the Malayalam language which includes documents from different genres like news articles, travel vlogs, historical, geographical documents, etc. For evaluation purposes also, Malayalam documents from different genres are collected. Summarizer is evaluated using the correlation measures like Precision, Recall, and F1-Score. The statistics about the inclusion of ideal sentences in the generated summary can be found out by using these evaluation measures. For this purpose, we have taken the summary generated by our machine learning-based summarizer and also the summary generated by a human being. Consider, N (System Summary) = Number of sentences occurring in the final summary generated by the system. N (Manual Summary) = Number of sentences in the summary generated by a human. KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS VOL. 16, NO. 6, June 2022 1791 N (System Summary ∩ Manual Summary) =Number of sentences which are common in system generated summary and human generated summary. Precision can be computed as, N(System Summary ∩ Manual Summary) Precision (P) = (1) N(System Summary) Recall can be computed as, N(System Summary ∩ Manual Summary) Recall(R) = (2) N(Manual Summary) F1- score is another evaluation figure we are using to predict the accuracy of our system. Here we are measuring the harmonic mean of precision and recall of our model. The model is considered to be so perfect if we are getting an F1-score value of 1. This can be computed by using the following formula: 2. P. R F1 − Score = (3) P+R It is possible to adjust the F-score by giving more weightage to precision or recall based on our model. This is called by the name Fβ measure and can be computed from the following formula: 1 + β2 (Precision ∗ Recall) Fβ = (4) (β2. Precision) + Recall Here β is the weighting factor which is giving high weightage to precision when β>1 and favours recall when β

Use Quizgecko on...
Browser
Browser