Learning to Classify Text PDF

Introduction to Natural Language Processing Learning To Classify Text Assist. Prof. Dr. Dželila MEHANOVIĆ Learning To Classify Text The goal of this chapter is to answer the following questions: How can we identify particular features of language data that are important for classifying it? How can we construct models of language that can be used to perform language processing tasks automatically? Supervised Classiﬁcation Classification is the task of choosing the correct class label for a given input Some examples of classification tasks are: Deciding whether an email is spam or not. Deciding what the topic of a news article is, from a fixed list of topic areas such as “sports,ˮ “technology,ˮ and “politics.ˮ Deciding whether a given occurrence of the word bank is used to refer to a river bank, a financial institution, the act of tilting to the side, or the act of depositing something in a financial institution. Supervised Classiﬁcation A classifier is called supervised if it is built based on training corpora containing the correct label for each input. The framework used by supervised classification is shown below Gender Identiﬁcation Names ending in a, e, and i are likely to be female, while names ending in k, o, r, s, and t are likely to be male. The first step in creating a classifier is deciding what features of the input are relevant, and how to encode those features. For this example, weʼll start by just looking at the final letter of a given name. The following feature extractor function builds a dictionary containing relevant information about a given name: Gender Identiﬁcation The dictionary that is returned by this function is called a feature set and maps from featuresʼ names to their values. Feature names are case-sensitive strings that typically provide a short human-readable description of the feature. Feature values are values with simple types, such as Booleans, numbers, and strings. Now that weʼve defined a feature extractor, we need to prepare a list of examples and corresponding class labels: Gender Identiﬁcation Next, we use the feature extractor to process the names data, and divide the resulting list of feature sets into a training set and a test set. The training set is used to train a new “naive Bayesˮ classifier. Letʼs just test it out on some names that did not appear in its training data: Gender Identiﬁcation Observe that these character names from The Matrix are correctly classified. We can systematically evaluate the classifier on a much larger quantity of unseen data: Finally, we can examine the classifier to determine which features is found most effective for distinguishing the namesʼ genders: Gender Identiﬁcation This listing shows that the names in the training set that end in a are female 35 times more often than they are male, but names that end in k are male 31 times more often than they are female. These ratios are known as likelihood ratios, and can be useful for comparing different feature-outcome relationships. When working with large corpora, constructing a single list that contains the features of every instance can use up a large amount of memory. In these cases, use the function nltk.classify.apply_features, which returns an object that acts like a list but does not store all the feature sets in memory. Choosing The Right Features Selecting relevant features and deciding how to encode them for a learning method can have an enormous impact on the learning methodʼs ability to extract a good model. Itʼs common to start with a “kitchen sinkˮ approach, including all the features that you can think of, and then checking to see which features actually are helpful. Choosing The Right Features Choosing The Right Features However, there are usually limits to the number of features that you should use with a given learning algorithm - if you provide too many features, then the algorithm will have a higher chance of relying on training data that donʼt generalize well to new examples. This problem is known as overfitting, and can be especially problematic when working with small training sets. Once an initial set of features has been chosen, a very productive method for refining the feature set is error analysis. First, we select a development set, containing the corpus data for creating the model. This development set is then subdivided into the training set and the dev-test set. Choosing The Right Features The training set is used to train the model, and the dev-test set is used to perform error analysis. The test set serves in our final evaluation of the system. The division of the corpus data into different subsets is shown below Choosing The Right Features Having divided the corpus into appropriate datasets, we train a model using the training set, and then run it on the dev-test set. Using the dev-test set, we can generate a list of the errors that the classifier makes when predicting name genders. Choosing The Right Features We can then examine individual error cases where the model predicted the wrong label, and try to determine what additional pieces of information would allow it to make the right decision (or which existing pieces of information are tricking it into making the wrong decision). The feature set can then be adjusted accordingly. The names classifier that we have built generates about 100 errors on the dev-test corpus: Choosing The Right Features Choosing The Right Features Looking through this list of errors makes it clear that some suffixes that are more than one letter can be indicative of name genders. For example, names ending in yn appear to be predominantly female, despite the fact that names ending in n tend to be male; and names ending in ch are usually male, even though names that end in h tend to be female. We therefore adjust our feature extractor to include features for two-letter suffixes: Choosing The Right Features Rebuilding the classifier with the new feature extractor, we see that the performance on the dev-test dataset improves by almost three percentage points (from 76.6% to 78.1% This error analysis procedure can then be repeated, checking for patterns in the errors that are made by the newly improved classifier. Each time the error analysis procedure is repeated, we should select a different dev-test/training split. Document Classiﬁcation Using these corpora, we can build classifiers that will automatically tag new documents with appropriate category labels. First, we construct a list of documents, labeled with the appropriate categories. For this example, weʼve chosen the Movie Reviews Corpus, which categorizes each review as positive or negative. Document Classiﬁcation Next, we define a feature extractor for documents, so the classifier will know which aspects of the data it should pay attention to. For document topic identification, we can define a feature for each word, indicating whether the document contains that word. To limit the number of features that the classifier needs to process, we begin by constructing a list of the 2,000 most frequent words in the overall corpus. We can then define a feature extractor that simply checks whether each of these words is present in a given document. Document Classiﬁcation Document Classiﬁcation Now that weʼve defined our feature extractor, we can use it to train a classifier to label new movie reviews. To check how reliable the resulting classifier is, we compute its accuracy on the test set. And once again, we can use show_most_infor mative_features() to find out which features the classifier found to be most informative. Part-Of-Speech Tagging Previously, we built a regular expression tagger that chooses a part-of-speech tag for a word by looking at the internal makeup of the word. However, this regular expression tagger had to be handcrafted. Instead, we can train a classifier to work out which suffixes are most informative. Letʼs begin by finding the most common suffixes: Part-Of-Speech Tagging Next, weʼll define a feature extractor function that checks a given word for these suffixes: Feature extraction functions highlights some of the properties (colors) in our data and making it impossible to see other properties. The classifier will rely exclusively on these highlighted properties when determining how to label inputs. In this case, the classifier will make its decisions based only on information about which of the common suffixes (if any) a given word has. Part-Of-Speech Tagging Now that weʼve defined our feature extractor, we can use it to train a new “decision treeˮ classifier Part-Of-Speech Tagging One nice feature of decision tree models is that they are often fairly easy to interpret. We can even instruct NLTK to print them out as pseudocode: Part-Of-Speech Tagging Here, we can see that the classifier begins by checking whether a word ends with a comma - if so, then it will receive the special tag ",". Next, the classifier checks whether the word ends in "the", in which case itʼs almost certainly a determiner. This “suffixˮ gets used early by the decision tree because the word the is so common. Part-Of-Speech Tagging Continuing on, the classifier checks if the word ends in s. If so, then itʼs most likely to receive the verb tag VBZ (unless itʼs the word is, which has the special tag BEZ, and if not, then itʼs most likely a noun (unless itʼs the punctuation mark “.ˮ). The actual classifier contains further nested if-then statements below the ones shown here, but the depth=4 argument just displays the top portion of the decision tree. Exploiting Context By augmenting the feature extraction function, we could modify this part-of-speech tagger to leverage a variety of other word-internal features, such as the length of the word, the number of syllables it contains, or its prefix. But contextual features often provide powerful clues about the correct tag - for example, when tagging the word fly, knowing that the previous word is a will allow us to determine that it is functioning as a noun, not a verb. Exploiting Context In order to accommodate features that depend on a wordʼs context, we must revise the pattern that we used to define our feature extractor. Instead of just passing in the word to be tagged, we will pass in a complete (untagged) sentence, along with the index of the target word. Evaluation In order to decide whether a classification model is accurately capturing a pattern, we must evaluate that model. The result of this evaluation is important for deciding how trustworthy the model is, and for what purposes we can use it. Evaluation can also be an effective tool for guiding us in making future improvements to the model. The Test Set Most evaluation techniques calculate a score for a model by comparing the labels that it generates for the inputs in a test set (or evaluation set) with the correct labels for those inputs. This test set typically has the same format as the training set. However, it is very important that the test set be distinct from the training corpus: if we simply reused the training set as the test set, then a model that simply memorized its input, without learning how to generalize to new examples, would receive misleadingly high scores. The Test Set When building the test set, there is often a trade-off between the amount of data available for testing and the amount available for training. For classification tasks that have a small number of well-balanced labels and a diverse test set, a meaningful evaluation can be performed with as few as 100 evaluation instances. But if a classification task has a large number of labels or includes very infrequent labels, then the size of the test set should be chosen to ensure that the least frequent label occurs at least 50 times. The Test Set Another consideration when choosing the test set is the degree of similarity between instances in the test set and those in the development set. Consider the part-of-speech tagging task. At one extreme, we could create the training set and test set by randomly assigning sentences from a data source that reflects a single genre, such as news: The Test Set In this case, our test set will be very similar to our training set. The training set and test set are taken from the same genre, and so we cannot be confident that evaluation results would generalize to other genres. Whatʼs worse, because of the call to random.shuffle(), the test set contains sentences that are taken from the same documents that were used for training. The Test Set If there is any consistent pattern within a document (say, if a given word appears with a particular part-of-speech tag especially frequently), then that difference will be reflected in both the development set and the test set. A somewhat better approach is to ensure that the training set and test set are taken from different documents: The Test Set If we want to perform a more stringent evaluation, we can draw the test set from documents that are less closely related to those in the training set: If we build a classifier that performs well on this test set, then we can be confident that it has the power to generalize well beyond the data on which it was trained. Accuracy The simplest metric that can be used to evaluate a classifier, accuracy, measures the percentage of inputs in the test set that the classifier correctly labeled. For example, a name gender classifier that predicts the correct name 60 times in a test set containing 80 names would have an accuracy of 60/80  75%. The function nltk.classify.accuracy() will calculate the accuracy of a classifier model on a given test set: Accuracy When interpreting the accuracy score of a classifier, it is important to consider the frequencies of the individual class labels in the test set. For example, consider a classifier that determines the correct word sense for each occurrence of the word bank. If we evaluate this classifier on financial newswire text, then we may find that the financial-institution sense appears 19 times out of 20. In that case, an accuracy of 95% would hardly be impressive, since we could achieve that accuracy with a model that always returns the financial-institution sense. However, if we instead evaluate the classifier on a more balanced corpus, where the most frequent word sense has a frequency of 40%, then a 95% accuracy score would be a much more positive result. Precision And Recall Another instance where accuracy scores can be misleading is in “searchˮ tasks, such as information retrieval, where we are attempting to find documents that are relevant to a particular task. Since the number of irrelevant documents far outweighs the number of relevant documents, the accuracy score for a model that labels every document as irrelevant would be very close to 100%. Precision And Recall It is therefore conventional to employ a different set of measures for search tasks, based on the number of items in each of the four categories: True positives are relevant items that we correctly identified as relevant. True negatives are irrelevant items that we correctly identified as irrelevant. False positives (or Type I errors) are irrelevant items that we incorrectly identified as relevant. False negatives (or Type II errors) are relevant items that we incorrectly identified as irrelevant. Precision And Recall Given these four numbers, we can define the following metrics: Precision: which indicates how many of the items that we identified were relevant, is TP/TPFP. Recall: which indicates how many of the relevant items that we identified, is TP/TPFN. The FMeasure (or FScore), which combines the precision and recall to give a single score, is defined to be the harmonic mean of the precision and recall 2  Precision × Recall)/Precision+Recall) Confusion Matrices When performing classification tasks with three or more labels, it can be informative to subdivide the errors made by the model based on which types of mistake it made. A confusion matrix is a table where each cell [i,j] indicates how often label j was predicted when the correct label was i. Thus, the diagonal entries (i.e., cells [i,j]) indicate labels that were correctly predicted, and the off-diagonal entries indicate errors. Cross-Validation In order to evaluate our models, we must reserve a portion of the annotated data for the test set. As we already mentioned, if the test set is too small, our evaluation may not be accurate. However, making the test set larger usually means making the training set smaller, which can have a significant impact on performance if a limited amount of annotated data is available. One solution to this problem is to perform multiple evaluations on different test sets, then to combine the scores from those evaluations, a technique known as cross validation. Cross-Validation In particular, we subdivide the original corpus into N subsets called folds. For each of these folds, we train a model using all of the data except the data in that fold, and then test that model on the fold. Even though the individual folds might be too small to give accurate evaluation scores on their own, the combined evaluation score is based on a large amount of data and is therefore quite reliable Thank you

Learning to Classify Text PDF

Document Details

Tags

Related

Summary

Full Transcript