Richard O. Duda, Peter E. Hart, David G. Stork - Pattern classification (2001, Wiley).pdf

Contents 1 Introduction 3 1.1 Machine Perception............................ 3 1.2 An Example................................. 3 1.2.1 Related ﬁelds.....................

Contents 1 Introduction 3 1.1 Machine Perception............................ 3 1.2 An Example................................. 3 1.2.1 Related ﬁelds............................ 11 1.3 The Sub-problems of Pattern Classiﬁcation............... 11 1.3.1 Feature Extraction......................... 11 1.3.2 Noise................................ 12 1.3.3 Overﬁtting............................. 12 1.3.4 Model Selection........................... 12 1.3.5 Prior Knowledge.......................... 12 1.3.6 Missing Features.......................... 13 1.3.7 Mereology.............................. 13 1.3.8 Segmentation............................ 13 1.3.9 Context............................... 14 1.3.10 Invariances............................. 14 1.3.11 Evidence Pooling.......................... 15 1.3.12 Costs and Risks.......................... 15 1.3.13 Computational Complexity.................... 16 1.4 Learning and Adaptation......................... 16 1.4.1 Supervised Learning........................ 16 1.4.2 Unsupervised Learning....................... 17 1.4.3 Reinforcement Learning...................... 17 1.5 Conclusion................................. 17 Summary by Chapters.............................. 17 Bibliographical and Historical Remarks.................... 19 Bibliography................................... 19 Index....................................... 22 1 2 CONTENTS Chapter 1 Introduction T hetenease with which we recognize a face, understand spoken words, read handwrit- characters, identify our car keys in our pocket by feel, and decide whether an apple is ripe by its smell belies the astoundingly complex processes that underlie these acts of pattern recognition. Pattern recognition — the act of taking in raw data and taking an action based on the “category” of the pattern — has been crucial for our survival, and over the past tens of millions of years we have evolved highly sophisticated neural and cognitive systems for such tasks. 1.1 Machine Perception It is natural that we should seek to design and build machines that can recognize patterns. From automated speech recognition, ﬁngerprint identiﬁcation, optical char- acter recognition, DNA sequence identiﬁcation and much more, it is clear that reli- able, accurate pattern recognition by machine would be immensely useful. Moreover, in solving the myriad problems required to build such systems, we gain deeper un- derstanding and appreciation for pattern recognition systems in the natural world — most particularly in humans. For some applications, such as speech and visual recog- nition, our design eﬀorts may in fact be inﬂuenced by knowledge of how these are solved in nature, both in the algorithms we employ and the design of special purpose hardware. 1.2 An Example To illustrate the complexity of some of the types of problems involved, let us consider the following imaginary and somewhat fanciful example. Suppose that a ﬁsh packing plant wants to automate the process of sorting incoming ﬁsh on a conveyor belt according to species. As a pilot project it is decided to try to separate sea bass from salmon using optical sensing. We set up a camera, take some sample images and begin to note some physical diﬀerences between the two types of ﬁsh — length, lightness, width, number and shape of ﬁns, position of the mouth, and so on — and these suggest features to explore for use in our classiﬁer. We also notice noise or variations in the 3 4 CHAPTER 1. INTRODUCTION images — variations in lighting, position of the ﬁsh on the conveyor, even “static” due to the electronics of the camera itself. Given that there truly are diﬀerences between the population of sea bass and that model of salmon, we view them as having diﬀerent models — diﬀerent descriptions, which are typically mathematical in form. The overarching goal and approach in pattern classiﬁcation is to hypothesize the class of these models, process the sensed data to eliminate noise (not due to the models), and for any sensed pattern choose the model that corresponds best. Any techniques that further this aim should be in the conceptual toolbox of the designer of pattern recognition systems. Our prototype system to perform this very speciﬁc task might well have the form shown in Fig. 1.1. First the camera captures an image of the ﬁsh. Next, the camera’s pre- signals are preprocessed to simplify subsequent operations without loosing relevant processing information. In particular, we might use a segmentation operation in which the images of diﬀerent ﬁsh are somehow isolated from one another and from the background. The segmentation information from a single ﬁsh is then sent to a feature extractor, whose purpose is to feature reduce the data by measuring certain “features” or “properties.” These features extraction (or, more precisely, the values of these features) are then passed to a classiﬁer that evaluates the evidence presented and makes a ﬁnal decision as to the species. The preprocessor might automatically adjust for average light level, or threshold the image to remove the background of the conveyor belt, and so forth. For the moment let us pass over how the images of the ﬁsh might be segmented and consider how the feature extractor and classiﬁer might be designed. Suppose somebody at the ﬁsh plant tells us that a sea bass is generally longer than a salmon. These, then, give us our tentative models for the ﬁsh: sea bass have some typical length, and this is greater than that for salmon. Then length becomes an obvious feature, and we might attempt to classify the ﬁsh merely by seeing whether or not the length l of a ﬁsh exceeds some critical value l∗. To choose l∗ we could obtain some design or training training samples of the diﬀerent types of ﬁsh, (somehow) make length measurements, samples and inspect the results. Suppose that we do this, and obtain the histograms shown in Fig. 1.2. These disappointing histograms bear out the statement that sea bass are somewhat longer than salmon, on average, but it is clear that this single criterion is quite poor; no matter how we choose l∗ , we cannot reliably separate sea bass from salmon by length alone. Discouraged, but undeterred by these unpromising results, we try another feature — the average lightness of the ﬁsh scales. Now we are very careful to eliminate variations in illumination, since they can only obscure the models and corrupt our new classiﬁer. The resulting histograms, shown in Fig. 1.3, are much more satisfactory — the classes are much better separated. So far we have tacitly assumed that the consequences of our actions are equally costly: deciding the ﬁsh was a sea bass when in fact it was a salmon was just as cost undesirable as the converse. Such a symmetry in the cost is often, but not invariably the case. For instance, as a ﬁsh packing company we may know that our customers easily accept occasional pieces of tasty salmon in their cans labeled “sea bass,” but they object vigorously if a piece of sea bass appears in their cans labeled “salmon.” If we want to stay in business, we should adjust our decision boundary to avoid antagonizing our customers, even if it means that more salmon makes its way into the cans of sea bass. In this case, then, we should move our decision boundary x∗ to smaller values of lightness, thereby reducing the number of sea bass that are classiﬁed as salmon (Fig. 1.3). The more our customers object to getting sea bass with their 1.2. AN EXAMPLE 5 Figure 1.1: The objects to be classiﬁed are ﬁrst sensed by a transducer (camera), whose signals are preprocessed, then the features extracted and ﬁnally the classiﬁ- cation emitted (here either “salmon” or “sea bass”). Although the information ﬂow is often chosen to be from the source to the classiﬁer (“bottom-up”), some systems employ “top-down” ﬂow as well, in which earlier levels of processing can be altered based on the tentative or preliminary response in later levels (gray arrows). Yet others combine two or more stages into a uniﬁed step, such as simultaneous segmentation and feature extraction. salmon — i.e., the more costly this type of error — the lower we should set the decision threshold x∗ in Fig. 1.3. Such considerations suggest that there is an overall single cost associated with our decision, and our true task is to make a decision rule (i.e., set a decision boundary) so as to minimize such a cost. This is the central task of decision theory of which decision pattern classiﬁcation is perhaps the most important subﬁeld. theory Even if we know the costs associated with our decisions and choose the optimal decision boundary x∗ , we may be dissatisﬁed with the resulting performance. Our ﬁrst impulse might be to seek yet a diﬀerent feature on which to separate the ﬁsh. Let us assume, though, that no other single visual feature yields better performance than that based on lightness. To improve recognition, then, we must resort to the use 6 CHAPTER 1. INTRODUCTION salmon sea bass Count 22 20 18 16 12 10 8 6 4 2 0 Length 5 10 15 20 25 l* Figure 1.2: Histograms for the length feature for the two categories. No single thresh- old value l∗ (decision boundary) will serve to unambiguously discriminate between the two categories; using length alone, we will have some errors. The value l∗ marked will lead to the smallest number of errors, on average. Count 14 salmon sea bass 12 10 8 6 4 2 0 Lightness 2 4 x* 6 8 10 Figure 1.3: Histograms for the lightness feature for the two categories. No single threshold value x∗ (decision boundary) will serve to unambiguously discriminate be- tween the two categories; using lightness alone, we will have some errors. The value x∗ marked will lead to the smallest number of errors, on average. 1.2. AN EXAMPLE 7 Width 22 salmon sea bass 21 20 19 18 17 16 15 14 Lightness 2 4 6 8 10 Figure 1.4: The two features of lightness and width for sea bass and salmon. The dark line might serve as a decision boundary of our classiﬁer. Overall classiﬁcation error on the data shown is lower than if we use only one feature as in Fig. 1.3, but there will still be some errors. of more than one feature at a time. In our search for other features, we might try to capitalize on the observation that sea bass are typically wider than salmon. Now we have two features for classifying ﬁsh — the lightness x1 and the width x2. If we ignore how these features might be measured in practice, we realize that the feature extractor has thus reduced the image of each ﬁsh to a point or feature vector x in a two-dimensional feature space, where x1 x=. x2 Our problem now is to partition the feature space into two regions, where for all patterns in one region we will call the ﬁsh a sea bass, and all points in the other we call it a salmon. Suppose that we measure the feature vectors for our samples and obtain the scattering of points shown in Fig. 1.4. This plot suggests the following rule for separating the ﬁsh: Classify the ﬁsh as sea bass if its feature vector falls above the decision boundary shown, and as salmon otherwise. decision This rule appears to do a good job of separating our samples and suggests that boundary perhaps incorporating yet more features would be desirable. Besides the lightness and width of the ﬁsh, we might include some shape parameter, such as the vertex angle of the dorsal ﬁn, or the placement of the eyes (as expressed as a proportion of the mouth-to-tail distance), and so on. How do we know beforehand which of these features will work best? Some features might be redundant: for instance if the eye color of all ﬁsh correlated perfectly with width, then classiﬁcation performance need not be improved if we also include eye color as a feature. Even if the diﬃculty or computational cost in attaining more features is of no concern, might we ever have too many features? Suppose that other features are too expensive or expensive to measure, or provide little improvement (or possibly even degrade the performance) in the approach de- scribed above, and that we are forced to make our decision based on the two features in Fig. 1.4. If our models were extremely complicated, our classiﬁer would have a decision boundary more complex than the simple straight line. In that case all the 8 CHAPTER 1. INTRODUCTION Width 22 salmon sea bass 21 20 19 18 ? 17 16 15 14 Lightness 2 4 6 8 10 Figure 1.5: Overly complex models for the ﬁsh will lead to decision boundaries that are complicated. While such a decision may lead to perfect classiﬁcation of our training samples, it would lead to poor performance on future patterns. The novel test point marked ? is evidently most likely a salmon, whereas the complex decision boundary shown leads it to be misclassiﬁed as a sea bass. training patterns would be separated perfectly, as shown in Fig. 1.5. With such a “solution,” though, our satisfaction would be premature because the central aim of designing a classiﬁer is to suggest actions when presented with novel patterns, i.e., general- ﬁsh not yet seen. This is the issue of generalization. It is unlikely that the complex ization decision boundary in Fig. 1.5 would provide good generalization, since it seems to be “tuned” to the particular training samples, rather than some underlying characteris- tics or true model of all the sea bass and salmon that will have to be separated. Naturally, one approach would be to get more training samples for obtaining a better estimate of the true underlying characteristics, for instance the probability distributions of the categories. In most pattern recognition problems, however, the amount of such data we can obtain easily is often quite limited. Even with a vast amount of training data in a continuous feature space though, if we followed the approach in Fig. 1.5 our classiﬁer would give a horrendously complicated decision boundary — one that would be unlikely to do well on novel patterns. Rather, then, we might seek to “simplify” the recognizer, motivated by a belief that the underlying models will not require a decision boundary that is as complex as that in Fig. 1.5. Indeed, we might be satisﬁed with the slightly poorer performance on the training samples if it means that our classiﬁer will have better performance on novel patterns.∗ But if designing a very complex recognizer is unlikely to give good generalization, precisely how should we quantify and favor simpler classiﬁers? How would our system automatically determine that the simple curve in Fig. 1.6 is preferable to the manifestly simpler straight line in Fig. 1.4 or the complicated boundary in Fig. 1.5? Assuming that we somehow manage to optimize this tradeoﬀ, can we then predict how well our system will generalize to new patterns? These are some of the central problems in statistical pattern recognition. For the same incoming patterns, we might need to use a drastically diﬀerent cost ∗ The philosophical underpinnings of this approach derive from William of Occam (1284-1347?), who advocated favoring simpler explanations over those that are needlessly complicated — Entia non sunt multiplicanda praeter necessitatem (“Entities are not to be multiplied without necessity”). Decisions based on overly complex models often lead to lower accuracy of the classiﬁer. 1.2. AN EXAMPLE 9 Width 22 salmon sea bass 21 20 19 18 17 16 15 14 Lightness 2 4 6 8 10 Figure 1.6: The decision boundary shown might represent the optimal tradeoﬀ be- tween performance on the training set and simplicity of classiﬁer. function, and this will lead to diﬀerent actions altogether. We might, for instance, wish instead to separate the ﬁsh based on their sex — all females (of either species) from all males if we wish to sell roe. Alternatively, we might wish to cull the damaged ﬁsh (to prepare separately for cat food), and so on. Diﬀerent decision tasks may require features and yield boundaries quite diﬀerent from those useful for our original categorization problem. This makes it quite clear that our decisions are fundamentally task or cost speciﬁc, and that creating a single general purpose artiﬁcial pattern recognition device — i.e., one capable of acting accurately based on a wide variety of tasks — is a profoundly diﬃcult challenge. This, too, should give us added appreciation of the ability of humans to switch rapidly and ﬂuidly between pattern recognition tasks. Since classiﬁcation is, at base, the task of recovering the model that generated the patterns, diﬀerent classiﬁcation techniques are useful depending on the type of candi- date models themselves. In statistical pattern recognition we focus on the statistical properties of the patterns (generally expressed in probability densities), and this will command most of our attention in this book. Here the model for a pattern may be a single speciﬁc set of features, though the actual pattern sensed has been corrupted by some form of random noise. Occasionally it is claimed that neural pattern recognition (or neural network pattern classiﬁcation) should be considered its own discipline, but despite its somewhat diﬀerent intellectual pedigree, we will consider it a close descen- dant of statistical pattern recognition, for reasons that will become clear. If instead the model consists of some set of crisp logical rules, then we employ the methods of syntactic pattern recognition, where rules or grammars describe our decision. For ex- ample we might wish to classify an English sentence as grammatical or not, and here statistical descriptions (word frequencies, word correlations, etc.) are inapapropriate. It was necessary in our ﬁsh example to choose our features carefully, and hence achieve a representation (as in Fig. 1.6) that enabled reasonably successful pattern classiﬁcation. A central aspect in virtually every pattern recognition problem is that of achieving such a “good” representation, one in which the structural relationships among the components is simply and naturally revealed, and one in which the true (unknown) model of the patterns can be expressed. In some cases patterns should be represented as vectors of real-valued numbers, in others ordered lists of attributes, in yet others descriptions of parts and their relations, and so forth. We seek a represen- 10 CHAPTER 1. INTRODUCTION tation in which the patterns that lead to the same action are somehow “close” to one another, yet “far” from those that demand a diﬀerent action. The extent to which we create or learn a proper representation and how we quantify near and far apart will determine the success of our pattern classiﬁer. A number of additional characteris- tics are desirable for the representation. We might wish to favor a small number of features, which might lead to simpler decision regions, and a classiﬁer easier to train. We might also wish to have features that are robust, i.e., relatively insensitive to noise or other errors. In practical applications we may need the classiﬁer to act quickly, or use few electronic components, memory or processing steps. A central technique, when we have insuﬃcient training data, is to incorporate knowledge of the problem domain. Indeed the less the training data the more impor- tant is such knowledge, for instance how the patterns themselves were produced. One analysis method that takes this notion to its logical extreme is that of analysis by synthesis, by where in the ideal case one has a model of how each pattern is generated. Con- synthesis sider speech recognition. Amidst the manifest acoustic variability among the possible “dee”s that might be uttered by diﬀerent people, one thing they have in common is that they were all produced by lowering the jaw slightly, opening the mouth, placing the tongue tip against the roof of the mouth after a certain delay, and so on. We might assume that “all” the acoustic variation is due to the happenstance of whether the talker is male or female, old or young, with diﬀerent overall pitches, and so forth. At some deep level, such a “physiological” model (or so-called “motor” model) for production of the utterances is appropriate, and diﬀerent (say) from that for “doo” and indeed all other utterances. If this underlying model of production can be deter- mined from the sound (and that is a very big if ), then we can classify the utterance by how it was produced. That is to say, the production representation may be the “best” representation for classiﬁcation. Our pattern recognition systems should then analyze (and hence classify) the input pattern based on how one would have to synthesize that pattern. The trick is, of course, to recover the generating parameters from the sensed pattern. Consider the diﬃculty in making a recognizer of all types of chairs — standard oﬃce chair, contemporary living room chair, beanbag chair, and so forth — based on an image. Given the astounding variety in the number of legs, material, shape, and so on, we might despair of ever ﬁnding a representation that reveals the unity within the class of chair. Perhaps the only such unifying aspect of chairs is functional: a chair is a stable artifact that supports a human sitter, including back support. Thus we might try to deduce such functional properties from the image, and the property “can support a human sitter” is very indirectly related to the orientation of the larger surfaces, and would need to be answered in the aﬃrmative even for a beanbag chair. Of course, this requires some reasoning about the properties and naturally touches upon computer vision rather than pattern recognition proper. Without going to such extremes, many real world pattern recognition systems seek to incorporate at least some knowledge about the method of production of the pat- terns or their functional use in order to insure a good representation, though of course the goal of the representation is classiﬁcation, not reproduction. For instance, in op- tical character recognition (OCR) one might conﬁdently assume that handwritten characters are written as a sequence of strokes, and ﬁrst try to recover a stroke rep- resentation from the sensed image, and then deduce the character from the identiﬁed strokes. 1.3. THE SUB-PROBLEMS OF PATTERN CLASSIFICATION 11 1.2.1 Related ﬁelds Pattern classiﬁcation diﬀers from classical statistical hypothesis testing, wherein the sensed data are used to decide whether or not to reject a null hypothesis in favor of some alternative hypothesis. Roughly speaking, if the probability of obtaining the data given some null hypothesis falls below a “signiﬁcance” threshold, we reject the null hypothesis in favor of the alternative. For typical values of this criterion, there is a strong bias or predilection in favor of the null hypothesis; even though the alternate hypothesis may be more probable, we might not be able to reject the null hypothesis. Hypothesis testing is often used to determine whether a drug is eﬀective, where the null hypothesis is that it has no eﬀect. Hypothesis testing might be used to determine whether the ﬁsh on the conveyor belt belong to a single class (the null hypothesis) or from two classes (the alternative). In contrast, given some data, pattern classiﬁcation seeks to ﬁnd the most probable hypothesis from a set of hypotheses — “this ﬁsh is probably a salmon.” Pattern classiﬁcation diﬀers, too, from image processing. In image processing, the image input is an image and the output is an image. Image processing steps often include processing rotation, contrast enhancement, and other transformations which preserve all the original information. Feature extraction, such as ﬁnding the peaks and valleys of the intensity, lose information (but hopefully preserve everything relevant to the task at hand.) As just described, feature extraction takes in a pattern and produces feature values. The number of features is virtually always chosen to be fewer than the total necessary to describe the complete target of interest, and this leads to a loss in information. In acts of associative memory, the system takes in a pattern and emits another pattern associative which is representative of a general group of patterns. It thus reduces the information memory somewhat, but rarely to the extent that pattern classiﬁcation does. In short, because of the crucial role of a decision in pattern recognition information, it is fundamentally an information reduction process. The classiﬁcation step represents an even more radical loss of information, reducing the original several thousand bits representing all the color of each of several thousand pixels down to just a few bits representing the chosen category (a single bit in our ﬁsh example.) 1.3 The Sub-problems of Pattern Classiﬁcation We have alluded to some of the issues in pattern classiﬁcation and we now turn to a more explicit list of them. In practice, these typically require the bulk of the research and development eﬀort. Many are domain or problem speciﬁc, and their solution will depend upon the knowledge and insights of the designer. Nevertheless, a few are of suﬃcient generality, diﬃculty, and interest that they warrant explicit consideration. 1.3.1 Feature Extraction The conceptual boundary between feature extraction and classiﬁcation proper is some- what arbitrary: an ideal feature extractor would yield a representation that makes the job of the classiﬁer trivial; conversely, an omnipotent classiﬁer would not need the help of a sophisticated feature extractor. The distinction is forced upon us for practi- cal, rather than theoretical reasons. Generally speaking, the task of feature extraction is much more problem and domain dependent than is classiﬁcation proper, and thus requires knowledge of the domain. A good feature extractor for sorting ﬁsh would 12 CHAPTER 1. INTRODUCTION surely be of little use for identifying ﬁngerprints, or classifying photomicrographs of blood cells. How do we know which features are most promising? Are there ways to automatically learn which features are best for the classiﬁer? How many shall we use? 1.3.2 Noise The lighting of the ﬁsh may vary, there could be shadows cast by neighboring equip- ment, the conveyor belt might shake — all reducing the reliability of the feature values actually measured. We deﬁne noise very general terms: any property of the sensed pattern due not to the true underlying model but instead to randomness in the world or the sensors. All non-trivial decision and pattern recognition problems involve noise in some form. In some cases it is due to the transduction in the signal and we may consign to our preprocessor the role of cleaning up the signal, as for instance visual noise in our video camera viewing the ﬁsh. An important problem is knowing some- how whether the variation in some signal is noise or instead to complex underlying models of the ﬁsh. How then can we use this information to improve our classiﬁer? 1.3.3 Overﬁtting In going from Fig 1.4 to Fig. 1.5 in our ﬁsh classiﬁcation problem, we were, implicitly, using a more complex model of sea bass and of salmon. That is, we were adjusting the complexity of our classiﬁer. While an overly complex model may allow perfect classiﬁcation of the training samples, it is unlikely to give good classiﬁcation of novel patterns — a situation known as overﬁtting. One of the most important areas of re- search in statistical pattern classiﬁcation is determining how to adjust the complexity of the model — not so simple that it cannot explain the diﬀerences between the cat- egories, yet not so complex as to give poor classiﬁcation on novel patterns. Are there principled methods for ﬁnding the best (intermediate) complexity for a classiﬁer? 1.3.4 Model Selection We might have been unsatisﬁed with the performance of our ﬁsh classiﬁer in Figs. 1.4 & 1.5, and thus jumped to an entirely diﬀerent class of model, for instance one based on some function of the number and position of the ﬁns, the color of the eyes, the weight, shape of the mouth, and so on. How do we know when a hypothesized model diﬀers signiﬁcantly from the true model underlying our patterns, and thus a new model is needed? In short, how are we to know to reject a class of models and try another one? Are we as designers reduced to random and tedious trial and error in model selection, never really knowing whether we can expect improved performance? Or might there be principled methods for knowing when to jettison one class of models and invoke another? Can we automate the process? 1.3.5 Prior Knowledge In one limited sense, we have already seen how prior knowledge — about the lightness of the diﬀerent ﬁsh categories helped in the design of a classiﬁer by suggesting a promising feature. Incorporating prior knowledge can be far more subtle and diﬃcult. In some applications the knowledge ultimately derives from information about the production of the patterns, as we saw in analysis-by-synthesis. In others the knowledge may be about the form of the underlying categories, or speciﬁc attributes of the patterns, such as the fact that a face has two eyes, one nose, and so on. 1.3. THE SUB-PROBLEMS OF PATTERN CLASSIFICATION 13 1.3.6 Missing Features Suppose that during classiﬁcation, the value of one of the features cannot be deter- mined, for example the width of the ﬁsh because of occlusion by another ﬁsh (i.e., occlusion the other ﬁsh is in the way). How should the categorizer compensate? Since our two-feature recognizer never had a single-variable threshold value x∗ determined in anticipation of the possible absence of a feature (cf., Fig. 1.3), how shall it make the best decision using only the feature present? The naive method, of merely assuming that the value of the missing feature is zero or the average of the values for the train- ing patterns, is provably non-optimal. Likewise we occasionally have missing features during the creation or learning in our recognizer. How should we train a classiﬁer or use one when some features are missing? 1.3.7 Mereology We eﬀortlessly read a simple word such as BEATS. But consider this: Why didn’t we read instead other words that are perfectly good subsets of the full pattern, such as BE, BEAT, EAT, AT, and EATS? Why don’t they enter our minds, unless explicitly brought to our attention? Or when we saw the B why didn’t we read a P or an I, which are “there” within the B? Conversely, how is it that we can read the two unsegmented words in POLOPONY — without placing the entire input into a single word category? This is the problem of subsets and supersets — formally part of mereology, the study of part/whole relationships. It is closely related to that of prior knowledge and segmentation. In short, how do we recognize or group together the “proper” number of elements — neither too few nor too many? It appears as though the best classiﬁers try to incorporate as much of the input into the categorization as “makes sense,” but not too much. How can this be done? 1.3.8 Segmentation In our ﬁsh example, we have tacitly assumed that the ﬁsh were isolated, separate on the conveyor belt. In practice, they would often be abutting or overlapping, and our system would have to determine where one ﬁsh ends and the next begins — the individual patterns have to be segmented. If we have already recognized the ﬁsh then it would be easier to segment them. But how can we segment the images before they have been categorized or categorize them before they have been segmented? It seems we need a way to know when we have switched from one model to another, or to know when we just have background or “no category.” How can this be done? Segmentation is one of the deepest problems in automated speech recognition. We might seek to recognize the individual sounds (e.g., phonemes, such as “ss,” “k,”...), and then put them together to determine the word. But consider two nonsense words, “sklee” and “skloo.” Speak them aloud and notice that for “skloo” you push your lips forward (so-called “rounding” in anticipation of the upcoming “oo”) before you utter the “ss.” Such rounding inﬂuences the sound of the “ss,” lowering the frequency spectrum compared to the “ss” sound in “sklee” — a phenomenon known as anticipatory coarticulation. Thus, the “oo” phoneme reveals its presence in the “ss” earlier than the “k” and “l” which nominally occur before the “oo” itself! How do we segment the “oo” phoneme from the others when they are so manifestly intermingled? Or should we even try? Perhaps we are focusing on groupings of the wrong size, and that the most useful unit for recognition is somewhat larger, as we saw in subsets and 14 CHAPTER 1. INTRODUCTION supersets (Sect. 1.3.7). A related problem occurs in connected cursive handwritten character recognition: How do we know where one character “ends” and the next one “begins”? 1.3.9 Context We might be able to use context — input-dependent information other than from the target pattern itself — to improve our recognizer. For instance, it might be known for our ﬁsh packing plant that if we are getting a sequence of salmon, that it is highly likely that the next ﬁsh will be a salmon (since it probably comes from a boat that just returned from a ﬁshing area rich in salmon). Thus, if after a long series of salmon our recognizer detects an ambiguous pattern (i.e., one very close to the nominal decision boundary), it may nevertheless be best to categorize it too as a salmon. We shall see how such a simple correlation among patterns — the most elementary form of context — might be used to improve recognition. But how, precisely, should we incorporate such information? Context can be highly complex and abstract. The utterance “jeetyet?” may seem nonsensical, unless you hear it spoken by a friend in the context of the cafeteria at lunchtime — “did you eat yet?” How can such a visual and temporal context inﬂuence your speech recognition? 1.3.10 Invariances In seeking to achieve an optimal representation for a particular pattern classiﬁcation task, we confront the problem of invariances. In our ﬁsh example, the absolute position on the conveyor belt is irrelevant to the category and thus our representation should also be insensitive to absolute position of the ﬁsh. Here we seek a representation that is invariant to the transformation of translation (in either horizontal or vertical directions). Likewise, in a speech recognition problem, it might be required only that we be able to distinguish between utterances regardless of the particular moment they were uttered; here the “translation” invariance we must ensure is in time. The “model parameters” describing the orientation of our ﬁsh on the conveyor belt are horrendously complicated — due as they are to the sloshing of water, the bumping of neighboring ﬁsh, the shape of the ﬁsh net, etc. — and thus we give up hope of ever trying to use them. These parameters are irrelevant to the model parameters that interest us anyway, i.e., the ones associated with the diﬀerences between the ﬁsh categories. Thus here we try to build a classiﬁer that is invariant to transformations such as rotation. orientation The orientation of the ﬁsh on the conveyor belt is irrelevant to its category. Here the transformation of concern is a two-dimensional rotation about the camera’s line of sight. A more general invariance would be for rotations about an arbitrary line in three dimensions. The image of even such a “simple” object as a coﬀee cup undergoes radical variation as the cup is rotated to an arbitrary angle — the handle may become hidden, the bottom of the inside volume come into view, the circular lip appear oval or a straight line or even obscured, and so forth. How might we insure that our pattern recognizer is invariant to such complex changes? size The overall size of an image may be irrelevant for categorization. Such diﬀerences might be due to variation in the range to the object; alternatively we may be genuinely unconcerned with diﬀerences between sizes — a young, small salmon is still a salmon. 1.3. THE SUB-PROBLEMS OF PATTERN CLASSIFICATION 15 For patterns that have inherent temporal variation, we may want our recognizer to be insensitive to the rate at which the pattern evolves. Thus a slow hand wave and rate a fast hand wave may be considered as equivalent. Rate variation is a deep problem in speech recognition, of course; not only do diﬀerent individuals talk at diﬀerent rates, but even a single talker may vary in rate, causing the speech signal to change in complex ways. Likewise, cursive handwriting varies in complex ways as the writer speeds up — the placement of dots on the i’s, and cross bars on the t’s and f’s, are the ﬁrst casualties of rate increase, while the appearance of l’s and e’s are relatively inviolate. How can we make a recognizer that changes its representations for some categories diﬀerently from that for others under such rate variation? A large number of highly complex transformations arise in pattern recognition, and many are domain speciﬁc. We might wish to make our handwritten optical character recognizer insensitive to the overall thickness of the pen line, for instance. Far more severe are transformations such as non-rigid deformations that arise in three- deformation dimensional object recognition, such as the radical variation in the image of your hand as you grasp an object or snap your ﬁngers. Similarly, variations in illumination or the complex eﬀects of cast shadows may need to be taken into account. The symmetries just described are continuous — the pattern can be translated, rotated, sped up, or deformed by an arbitrary amount. In some pattern recognition applications other — discrete — symmetries are relevant, such as ﬂips left-to-right, discrete or top-to-bottom. symmetry In all of these invariances the problem arises: How do we determine whether an invariance is present? How do we eﬃciently incorporate such knowledge into our recognizer? 1.3.11 Evidence Pooling In our ﬁsh example we saw how using multiple features could lead to improved recog- nition. We might imagine that we could do better if we had several component classiﬁers. If these categorizers agree on a particular pattern, there is no diﬃculty. But suppose they disagree. How should a “super” classiﬁer pool the evidence from the component recognizers to achieve the best decision? Imagine calling in ten experts for determining if a particular ﬁsh is diseased or not. While nine agree that the ﬁsh is healthy, one expert does not. Who is right? It may be that the lone dissenter is the only one familiar with the particular very rare symptoms in the ﬁsh, and is in fact correct. How would the “super” categorizer know when to base a decision on a minority opinion, even from an expert in one small domain who is not well qualiﬁed to judge throughout a broad range of problems? 1.3.12 Costs and Risks We should realize that a classiﬁer rarely exists in a vacuum. Instead, it is generally to be used to recommend actions (put this ﬁsh in this bucket, put that ﬁsh in that bucket), each action having an associated cost or risk. Conceptually, the simplest such risk is the classiﬁcation error: what percentage of new patterns are called the wrong category. However the notion of risk is far more general, as we shall see. We often design our classiﬁer to recommend actions that minimize some total expected cost or risk. Thus, in some sense, the notion of category itself derives from the cost or task. How do we incorporate knowledge about such risks and how will they aﬀect our classiﬁcation decision? 16 CHAPTER 1. INTRODUCTION Finally, can we estimate the total risk and thus tell whether our classiﬁer is ac- ceptable even before we ﬁeld it? Can we estimate the lowest possible risk of any classiﬁer, to see how close ours meets this ideal, or whether the problem is simply too hard overall? 1.3.13 Computational Complexity Some pattern recognition problems can be solved using algorithms that are highly impractical. For instance, we might try to hand label all possible 20 × 20 binary pixel images with a category label for optical character recognition, and use table lookup to classify incoming patterns. Although we might achieve error-free recognition, the labeling time and storage requirements would be quite prohibitive since it would require a labeling each of 220×20 ≈ 10120 patterns. Thus the computational complexity of diﬀerent algorithms is of importance, especially for practical applications. In more general terms, we may ask how an algorithm scales as a function of the number of feature dimensions, or the number of patterns or the number of categories. What is the tradeoﬀ between computational ease and performance? In some prob- lems we know we can design an excellent recognizer, but not within the engineering constraints. How can we optimize within such constraints? We are typically less concerned with the complexity of learning, which is done in the laboratory, than the complexity of making a decision, which is done with the ﬁelded application. While computational complexity generally correlates with the complexity of the hypothe- sized model of the patterns, these two notions are conceptually diﬀerent. This section has catalogued some of the central problems in classiﬁcation. It has been found that the most eﬀective methods for developing classiﬁers involve learning from examples, i.e., from a set of patterns whose category is known. Throughout this book, we shall see again and again how methods of learning relate to these central problems, and are essential in the building of classiﬁers. 1.4 Learning and Adaptation In the broadest sense, any method that incorporates information from training sam- ples in the design of a classiﬁer employs learning. Because nearly all practical or interesting pattern recognition problems are so hard that we cannot guess classiﬁ- cation decision ahead of time, we shall spend the great majority of our time here considering learning. Creating classiﬁers then involves posit some general form of model, or form of the classiﬁer, and using training patterns to learn or estimate the unknown parameters of the model. Learning refers to some form of algorithm for reducing the error on a set of training data. A range of gradient descent algorithms that alter a classiﬁer’s parameters in order to reduce an error measure now permeate the ﬁeld of statistical pattern recognition, and these will demand a great deal of our attention. Learning comes in several general forms. 1.4.1 Supervised Learning In supervised learning, a teacher provides a category label or cost for each pattern in a training set, and we seek to reduce the sum of the costs for these patterns. How can we be sure that a particular learning algorithm is powerful enough to learn the solution to a given problem and that it will be stable to parameter variations? 1.5. CONCLUSION 17 How can we determine if it will converge in ﬁnite time, or scale reasonably with the number of training patterns, the number of input features or with the perplexity of the problem? How can we insure that the learning algorithm appropriately favors “simple” solutions (as in Fig. 1.6) rather than complicated ones (as in Fig. 1.5)? 1.4.2 Unsupervised Learning In unsupervised learning or clustering there is no explicit teacher, and the system forms clusters or “natural groupings” of the input patterns. “Natural” is always deﬁned explicitly or implicitly in the clustering system itself, and given a particular set of patterns or cost function, diﬀerent clustering algorithms lead to diﬀerent clusters. Often the user will set the hypothesized number of diﬀerent clusters ahead of time, but how should this be done? How do we avoid inappropriate representations? 1.4.3 Reinforcement Learning The most typical way to train a classiﬁer is to present an input, compute its tentative category label, and use the known target category label to improve the classiﬁer. For instance, in optical character recognition, the input might be an image of a character, the actual output of the classiﬁer the category label “R,” and the desired output a “B.” In reinforcement learning or learning with a critic, no desired category signal is given; critic instead, the only teaching feedback is that the tentative category is right or wrong. This is analogous to a critic who merely states that something is right or wrong, but does not say speciﬁcally how it is wrong. (Thus only binary feedback is given to the classiﬁer; reinforcement learning also describes the case where a single scalar signal, say some number between 0 and 1, is given by the teacher.) In pattern classiﬁcation, it is most common that such reinforcement is binary — either the tentative decision is correct or it is not. (Of course, if our problem involves just two categories and equal costs for errors, then learning with a critic is equivalent to standard supervised learning.) How can the system learn which are important from such non-speciﬁc feedback? 1.5 Conclusion At this point the reader may be overwhelmed by the number, complexity and mag- nitude of these sub-problems. Further, these sub-problems are rarely addressed in isolation and they are invariably interrelated. Thus for instance in seeking to reduce the complexity of our classiﬁer, we might aﬀect its ability to deal with invariance. We point out, though, that the good news is at least three-fold: 1) there is an “existence proof” that many of these problems can indeed be solved — as demonstrated by hu- mans and other biological systems, 2) mathematical theories solving some of these problems have in fact been discovered, and ﬁnally 3) there remain many fascinating unsolved problems providing opportunities for progress. Summary by Chapters The overall organization of this book is to address ﬁrst those cases where a great deal of information about the models is known (such as the probability densities, category labels,...) and to move, chapter by chapter, toward problems where the form of the 18 CHAPTER 1. INTRODUCTION distributions are unknown and even the category membership of training patterns is unknown. We begin in Chap. ?? (Bayes decision theory) by considering the ideal case in which the probability structure underlying the categories is known perfectly. While this sort of situation rarely occurs in practice, it permits us to determine the optimal (Bayes) classiﬁer against which we can compare all other methods. Moreover in some problems it enables us to predict the error we will get when we generalize to novel patterns. In Chap. ?? (Maximum Likelihood and Bayesian Parameter Estimation) we address the case when the full probability structure underlying the categories is not known, but the general forms of their distributions are — i.e., the models. Thus the uncertainty about a probability distribution is represented by the values of some unknown parameters, and we seek to determine these parameters to attain the best categorization. In Chap. ?? (Nonparametric techniques) we move yet further from the Bayesian ideal, and assume that we have no prior parameterized knowledge about the underlying probability structure; in essence our classiﬁcation will be based on information provided by training samples alone. Classic techniques such as the nearest-neighbor algorithm and potential functions play an important role here. We then in Chap. ?? (Linear Discriminant Functions) return somewhat toward the general approach of parameter estimation. We shall assume that the so-called “discriminant functions” are of a very particular form — viz., linear — in order to de- rive a class of incremental training rules. Next, in Chap. ?? (Nonlinear Discriminants and Neural Networks) we see how some of the ideas from such linear discriminants can be extended to a class of very powerful algorithms such as backpropagation and others for multilayer neural networks; these neural techniques have a range of use- ful properties that have made them a mainstay in contemporary pattern recognition research. In Chap. ?? (Stochastic Methods) we discuss simulated annealing by the Boltzmann learning algorithm and other stochastic methods. We explore the behavior of such algorithms with regard to the matter of local minima that can plague other neural methods. Chapter ?? (Non-metric Methods) moves beyond models that are statistical in nature to ones that can be best described by (logical) rules. Here we discuss tree-based algorithms such as CART (which can also be applied to statistical data) and syntactic based methods, such as grammar based, which are based on crisp rules. Chapter ?? (Theory of Learning) is both the most important chapter and the most diﬃcult one in the book. Some of the results described there, such as the notion of capacity, degrees of freedom, the relationship between expected error and training set size, and computational complexity are subtle but nevertheless crucial both theoretically and practically. In some sense, the other chapters can only be fully understood (or used) in light of the results presented here; you cannot expect to solve important pattern classiﬁcation problems without using the material from this chapter. We conclude in Chap. ?? (Unsupervised Learning and Clustering), by addressing the case when input training patterns are not labeled, and that our recognizer must determine the cluster structure. We also treat a related problem, that of learning with a critic, in which the teacher provides only a single bit of information during the presentation of a training pattern — “yes,” that the classiﬁcation provided by the recognizer is correct, or “no,” it isn’t. Here algorithms for reinforcement learning will be presented. 1.5. BIBLIOGRAPHICAL AND HISTORICAL REMARKS 19 Bibliographical and Historical Remarks Classiﬁcation is among the ﬁrst crucial steps in making sense of the blooming buzzing confusion of sensory data that intelligent systems confront. In the western world, the foundations of pattern recognition can be traced to Plato , later extended by Aristotle , who distinguished between an “essential property” (which would be shared by all members in a class or “natural kind” as he put it) from an “accidental property” (which could diﬀer among members in the class). Pattern recognition can be cast as the problem of ﬁnding such essential properties of a category. It has been a central theme in the discipline of philosophical epistemology, the study of the nature of knowledge. A more modern treatment of some philosophical problems of pattern recognition, relating to the technical matter in the current book can be found in [22, 4, 18]. In the eastern world, the ﬁrst Zen patriarch, Bodhidharma, would point at things and demand students to answer “What is that?” as a way of confronting the deepest issues in mind, the identity of objects, and the nature of classiﬁcation and decision. A delightful and particularly insightful book on the foundations of artiﬁcial intelligence, including pattern recognition, is. Early technical treatments by Minsky and Rosenfeld are still valuable, as are a number of overviews and reference books. The modern literature on decision theory and pattern recognition is now overwhelming, and comprises dozens of journals, thousands of books and conference proceedings and innumerable articles; it continues to grow rapidly. While some disciplines such as statistics , machine learning and neural networks , expand the foundations of pattern recognition, others, such as computer vision [6, 19] and speech recognition rely on it heavily. Perceptual Psychology, Cognitive Science , Psychobiology and Neuroscience analyze how pattern recognition is achieved in humans and other animals. The extreme view that everything in human cognition — including rule-following and logic — can be reduced to pattern recognition is presented in. Pattern recognition techniques have been applied in virtually every scientiﬁc and technical discipline. 20 CHAPTER 1. INTRODUCTION Bibliography Aristotle, Robin Waterﬁeld, and David Bostock. Physics. Oxford University Press, Oxford, UK, 1996. Allan Bloom. The Republic of Plato. Basic Books, New York, NY, 2nd edition, 1991. Bodhidharma. The Zen Teachings of Bodhidharma. North Point Press, San Francisco, CA, 1989. Mikhail M. Bongard. Pattern Recognition. Spartan Books, Washington, D.C., 1970. Chi-hau Chen, Louis François Pau, and Patrick S. P. Wang, editors. Handbook of Pattern Recognition & Computer Vision. World Scientiﬁc, Singapore, 2nd edition, 1993. Marty Fischler and Oscar Firschein. Readings in Computer Vision: Issues, Prob- lems, Principles and Paradigms. Morgan Kaufmann, San Mateo, CA, 1987. Keinosuke Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, New York, NY, 2nd edition, 1990. John Hertz, Anders Krogh, and Richard G. Palmer. Introduction to the Theory of Neural Computation. Addison-Wesley Publishing Company, Redwood City, CA, 1991. Douglas Hofstadter. Gödel, Escher, Bach: an Eternal Golden Braid. Basic Books, Inc., New York, NY, 1979. Eric R. Kandel and James H. Schwartz. Principles of Neural Science. Elsevier, New York, NY, 2nd edition, 1985. Immanuel Kant. Critique of Pure Reason. Prometheus Books, New York, NY, 1990. George F. Luger. Cognitive Science: The Science of Intelligent Systems. Aca- demic Press, New York, NY, 1994. Howard Margolis. Patterns, Thinking, and Cognition: A Theory of Judgement. University of Chicago Press, Chicago, IL, 1987. Marvin Minsky. Steps toward artiﬁcial intelligence. Proceedings of the IEEE, 49:8–30, 1961. 21 22 BIBLIOGRAPHY Lawrence Rabiner and Biing-Hwang Juang. Fundamentals of Speech Recognition. Prentice Hall, Englewood Cliﬀs, NJ, 1993. Azriel Rosenfeld. Picture Processing by Computer. Academic Press, New York, 1969. Jude W. Shavlik and Thomas G. Dietterich, editors. Readings in Machine Learn- ing. Morgan Kaufmann, San Mateo, CA, 1990. Brian Cantwell Smith. On the Origin of Objects. MIT Press, Cambridge, MA, 1996. Louise Stark and Kevin Bower. Generic Object Recognition using Form & Func- tion. World Scientiﬁc, River Edge, NJ, 1996. Donald R. Tveter. The Pattern Recognition basis of Artiﬁcial Intelligence. IEEE Press, New York, NY, 1998. William R. Uttal. The psychobiology of sensory coding. HarperCollins, New York, NY, 1973. Satoshi Watanabe. Knowing and Guessing: A quantitative study of inference and information. John Wiley, New York, NY, 1969. Index analysis by synthesis, 10 missing, 13 anticipatory coarticulation, see coartic- robust, 10 ulation, anticipatory space, 7 continuous, 8 beanbag chair vector, 7 example, 10 ﬁngerprint identiﬁcation, 3 BEATS example, see subset/superset ﬁsh categorization example, 3–9 camera for pattern recognition, 4 generalization, 8 classiﬁcation, see pattern recognition grammar, 9 cost, 4, 15 model, 4 hardware, 3 risk, see classiﬁcation, cost hypothesis clustering, see learning, unsupervised, null, see null hypothesis 17 hypothesis testing, 11 coarticulation anticipatory, 13 image complexity processing, 11 computational, see computational threshold, 4 complexity information computational complexity, 16 loss, 11 and feature dimensions, 16 invariance, 14–15 and number of categories, 16 illumination, 15 and number of patterns, 16 line thickness, 15 context, 14 jeetyet example, 14 decision boundary, 7, 8 knowledge complex, 8 incorporating, 10 simple, 10 prior, 12 decision theory, 5 deformations learning non-rigid, 15 and adaptation, 16 distribution, see probability, distribu- reinforcement, 17 tion supervised, 16 DNA sequence identiﬁcation, 3 unsupervised, 17 evidence pooling, 15 machine perception, see perception, ma- chine feature memory extraction, 4, 11 associative, 11 23 24 INDEX mereology, 13 coarticulation in, 13 missing feature, see feature, missing skloo model, 4 coarticulation in, 13 selection, 12 speech recognition rate variation, 15 noise, 12 rounding, 13 null hypothesis, 11 subset/superset, 13 supervised learning, see learning, su- Occam pervised William of, 8 symmetry OCR, see optical character recognition discrete, 15 optical character recognition, 3 exhaustive training, 16 unsupervised learning, see learning, un- handwritten, 10 supervised rate variation, 15 segmentation, 14 William of Occam, see Occam, William orientation, 14 of overﬁtting, 8, 12 pattern classiﬁcation, see pattern recog- nition pattern recognition, 3 general purpose, 9 information reduction, 11 neural, 9 statistical, 8 syntactic, 9 perception machine, 3 phoneme, 13 POLOPONY example, see subset/superset preprocessing, 4 prior knowledge, see knowledge, prior probability density, 9 distribution, 8 rate variation, 15 recognition chair example, 10 reinforcement learning, see learning, re- inforcement representation, 9 scatter plot, 7 segmentation, 4, 13 speech, 13 shadow, 15 signiﬁcance threshold, 11 size, 14 sklee Contents 2 Bayesian decision theory 3 2.1 Introduction................................. 3 2.2 Bayesian Decision Theory – Continuous Features............ 7 2.2.1 Two-Category Classiﬁcation.................... 8 2.3 Minimum-Error-Rate Classiﬁcation.................... 9 2.3.1 *Minimax Criterion........................ 10 2.3.2 *Neyman-Pearson Criterion.................... 12 2.4 Classiﬁers, Discriminants and Decision Surfaces............. 13 2.4.1 The Multi-Category Case..................... 13 2.4.2 The Two-Category Case...................... 14 2.5 The Normal Density............................ 15 2.5.1 Univariate Density......................... 16 2.5.2 Multivariate Density........................ 17 2.6 Discriminant Functions for the Normal Density............. 19 2.6.1 Case 1: Σi = σ 2 I.......................... 19 2.6.2 Case 2: Σi = Σ........................... 23 2.6.3 Case 3: Σi = arbitrary....................... 25 Example 1: Decisions for Gaussian data................. 29 2.7 *Error Probabilities and Integrals..................... 30 2.8 *Error Bounds for Normal Densities................... 31 2.8.1 Chernoﬀ Bound........................... 31 2.8.2 Bhattacharyya Bound....................... 32 Example 2: Error bounds......................... 33 2.8.3 Signal Detection Theory and Operating Characteristics.... 33 2.9 Bayes Decision Theory — Discrete Features............... 36 2.9.1 Independent Binary Features................... 36 Example 3: Bayes decisions for binary data............... 38 2.10 *Missing and Noisy Features....................... 39 2.10.1 Missing Features.......................... 39 2.10.2 Noisy Features........................... 40 2.11 *Compound Bayes Decision Theory and Context............ 41 Summary..................................... 42 Bibliographical and Historical Remarks.................... 43 Problems..................................... 44 Computer exercises................................ 59 Bibliography................................... 61 Index....................................... 65 1 2 CONTENTS Chapter 2 Bayesian decision theory 2.1 Introduction ayesian decision theory is a fundamental statistical approach to the problem of B pattern classiﬁcation. This approach is based on quantifying the tradeoﬀs be- tween various classiﬁcation decisions using probability and the costs that accompany such decisions. It makes the assumption that the decision problem is posed in proba- bilistic terms, and that all of the relevant probability values are known. In this chapter we develop the fundamentals of this theory, and show how it can be viewed as being simply a formalization of common-sense procedures; in subsequent chapters we will consider the problems that arise when the probabilistic structure is not completely known. While we will give a quite general, abstract development of Bayesian decision theory in Sect. ??, we begin our discussion with a speciﬁc example. Let us reconsider the hypothetical problem posed in Chap. ?? of designing a classiﬁer to separate two kinds of ﬁsh: sea bass and salmon. Suppose that an observer watching ﬁsh arrive along the conveyor belt ﬁnds it hard to predict what type will emerge next and that the sequence of types of ﬁsh appears to be random. In decision-theoretic terminology we would say that as each ﬁsh emerges nature is in one or the other of the two possible states: either the ﬁsh is a sea bass or the ﬁsh is a salmon. We let ω denote the state state of of nature, with ω = ω1 for sea bass and ω = ω2 for salmon. Because the state of nature nature is so unpredictable, we consider ω to be a variable that must be described probabilistically. If the catch produced as much sea bass as salmon, we would say that the next ﬁsh is equally likely to be sea bass or salmon. More generally, we assume that there is some a priori probability (or simply prior) P (ω1 ) that the next ﬁsh is sea bass, and prior some prior probability P (ω2 ) that it is salmon. If we assume there are no other types of ﬁsh relevant here, then P (ω1 ) and P (ω2 ) sum to one. These prior probabilities reﬂect our prior knowledge of how likely we are to get a sea bass or salmon before the ﬁsh actually appears. It might, for instance, depend upon the time of year or the choice of ﬁshing area. Suppose for a moment that we were forced to make a decision about the type of ﬁsh that will appear next without being allowed to see it. For the moment, we shall 3 4 CHAPTER 2. BAYESIAN DECISION THEORY assume that any incorrect classiﬁcation entails the same cost or consequence, and that the only information we are allowed to use is the value of the prior probabilities. If a decision must be made with so little information, it seems logical to use the following decision decision rule: Decide ω1 if P (ω1 ) > P (ω2 ); otherwise decide ω2. rule This rule makes sense if we are to judge just one ﬁsh, but if we are to judge many ﬁsh, using this rule repeatedly may seem a bit strange. After all, we would always make the same decision even though we know that both types of ﬁsh will appear. How well it works depends upon the values of the prior probabilities. If P (ω1 ) is very much greater than P (ω2 ), our decision in favor of ω1 will be right most of the time. If P (ω1 ) = P (ω2 ), we have only a ﬁfty-ﬁfty chance of being right. In general, the probability of error is the smaller of P (ω1 ) and P (ω2 ), and we shall see later that under these conditions no other decision rule can yield a larger probability of being right. In most circumstances we are not asked to make decisions with so little informa- tion. In our example, we might for instance use a lightness measurement x to improve our classiﬁer. Diﬀerent ﬁsh will yield diﬀerent lightness readings and we express this variability in probabilistic terms; we consider x to be a continuous random variable whose distribution depends on the state of nature, and is expressed as p(x|ω1 ).∗ This is the class-conditional probability density function. Strictly speaking, the probabil- ity density function p(x|ω1 ) should be written as pX (x|ω1 ) to indicate that we are speaking about a particular density function for the random variable X. This more elaborate subscripted notation makes it clear that pX (·) and pY (·) denote two diﬀer- ent functions, a fact that is obscured when writing p(x) and p(y). Since this potential confusion rarely arises in practice, we have elected to adopt the simpler notation. Readers who are unsure of our notation or who would like to review probability the- ory should see Appendix ??). This is the probability density function for x given that the state of nature is ω1. (It is also sometimes called state-conditional probability density.) Then the diﬀerence between p(x|ω1 ) and p(x|ω2 ) describes the diﬀerence in lightness between populations of sea bass and salmon (Fig. 2.1). Suppose that we know both the prior probabilities P (ωj ) and the conditional densities p(x|ωj ). Suppose further that we measure the lightness of a ﬁsh and discover that its value is x. How does this measurement inﬂuence our attitude concerning the true state of nature — that is, the category of the ﬁsh? We note ﬁrst that the (joint) probability density of ﬁnding a pattern that is in category ωj and has feature value x can be written two ways: p(ωj , x) = P (ωj |x)p(x) = p(x|ωj )P (ωj ). Rearranging these leads us to the answer to our question, which is called Bayes’ formula: p(x|ωj )P (ωj ) P (ωj |x) = , (1) p(x) where in this case of two categories 2 p(x) = p(x|ωj )P (ωj ). (2) j=1 Bayes’ formula can be expressed informally in English by saying that likelihood × prior posterior =. (3) evidence ∗ We generally use an upper-case P (·) to denote a probability mass function and a lower-case p(·) to denote a probability density function. 2.1. INTRODUCTION 5 Bayes’ formula shows that by observing the value of x we can convert the prior probability P (ωj ) to the a posteriori probability (or posterior) probability P (ωj |x) posterior — the probability of the state of nature being ωj given that feature value x has been measured. We call p(x|ωj ) the likelihood of ωj with respect to x (a term likelihood chosen to indicate that, other things being equal, the category ωj for which p(x|ωj ) is large is more “likely” to be the true category). Notice that it is the product of the likelihood and the prior probability that is most important in determining the psterior probability; the evidence factor, p(x), can be viewed as merely a scale factor that guarantees that the posterior probabilities sum to one, as all good probabilities must. The variation of P (ωj |x) with x is illustrated in Fig. 2.2 for the case P (ω1 ) = 2/3 and P (ω2 ) = 1/3. p(x|ωi) 0.4 ω2 0.3 ω1 0.2 0.1 x 9 10 11 12 13 14 15 Figure 2.1: Hypothetical class-conditional probability density functions show the probability density of measuring a particular feature value x given the pattern is in category ωi. If x represents the length of a ﬁsh, the two curves might describe the diﬀerence in length of populations of two types of ﬁsh. Density functions are normalized, and thus the area under each curve is 1.0. If we have an observation x for which P (ω1 |x) is greater than P (ω2 |x), we would naturally be inclined to decide that the true state of nature is ω1. Similarly, if P (ω2 |x) is greater than P (ω1 |x), we would be inclined to choose ω2. To justify this decision procedure, let us calculate the probability of error whenever we make a decision. Whenever we observe a particular x, P (ω1 |x) if we decide ω2 P (error|x) = (4) P (ω2 |x) if we decide ω1. Clearly, for a given x we can minimize the probability of error by deciding ω1 if P (ω1 |x) > P (ω2 |x) and ω2 otherwise. Of course, we may never observe exactly the same value of x twice. Will this rule minimize the average probability of error? Yes, because the average probability of error is given by 6 CHAPTER 2. BAYESIAN DECISION THEORY P(ωi|x) ω1 1 0.8 0.6 0.4 ω2 0.2 x 9 10 11 12 13 14 15 Figure 2.2: Posterior probabilities for the particular priors P(ω1 ) = 2/3 and P(ω2 ) = 1/3 for the class-conditional probability densities shown in Fig. 2.1. Thus in this case, given that a pattern is measured to have feature value x = 14, the probability it is in category ω2 is roughly 0.08, and that it is in ω1 is 0.92. At every x, the posteriors sum to 1.0. ∞ ∞ P (error) = P (error, x) dx = P (error|x)p(x) dx (5) −∞ −∞ and if for every x we insure that P (error|x) is as small as possible, then the integral must be as small as possible. Thus we have justiﬁed the following Bayes’ decision Bayes’ rule for minimizing the probability of error: decision rule Decide ω1 if P (ω1 |x) > P (ω2 |x); otherwise decide ω2 , (6) and under this rule Eq. 4 becomes P (error|x) = min [P (ω1 |x), P (ω2 |x)]. (7) This form of the decision rule emphasizes the role of the posterior probabilities. By using Eq. 1, we can instead express the rule in terms of the conditional and prior evidence probabilities. First note that the evidence, p(x), in Eq. 1 is unimportant as far as making a decision is concerned. It is basically just a scale factor that states how frequently we will actually measure a pattern with feature value x; its presence in Eq. 1 assures us that P (ω1 |x) + P (ω2 |x) = 1. By eliminating this scale factor, we obtain the following completely equivalent decision rule: Decide ω1 if p(x|ω1 )P (ω1 ) > p(x|ω2 )P (ω2 ); otherwise decide ω2. (8) Some additional insight can be obtained by considering a few special cases. If for some x we have p(x|ω1 ) = p(x|ω2 ), then that particular observation gives us no 2.2. BAYESIAN DECISION THEORY – CONTINUOUS FEATURES 7 information about the state of nature; in this case, the decision hinges entirely on the prior probabilities. On the other hand, if P (ω1 ) = P (ω2 ), then the states of nature are equally probable; in this case the decision is based entirely on the likelihoods p(x|ωj ). In general, both of these factors are important in making a decision, and the Bayes decision rule combines them to achieve the minimum probability of error. 2.2 Bayesian Decision Theory – Continuous Fea- tures We shall now formalize the ideas just considered, and generalize them in four ways: by allowing the use of more than one feature by allowing more than two states of nature by allowing actions other than merely deciding the state of nature by introducing a loss function more general than the probability of error. These generalizations and their attendant notational complexities should not ob- scure the central points illustrated in our simple example. Allowing the use of more than one feature merely requires replacing the scalar x by the feature vector x, where x is in a d-dimensional Euclidean space Rd , called the feature space. Allowing more feature than two states of nature provides us with a useful generalization for a small notational space expense. Allowing actions other than classiﬁcation primarily allows the possibility of rejection, i.e., of refusing to make a decision in close cases; this is a useful option if being indecisive is not too costly. Formally, the loss function states exactly how costly loss each action is, and is used to convert a probability determination into a decision. Cost function functions let us treat situations in which some kinds of classiﬁcation mistakes are more costly than others, although we often discuss the simplest case, where all errors are equally costly. With this as a preamble, let us begin the more formal treatment. Let ω1 ,..., ωc be the ﬁnite set of c states of nature (“categories”) and α1 ,..., αa be the ﬁnite set of a possible actions. The loss function λ(αi |ωj ) describes the loss incurred for taking action αi when the state of nature is ωj. Let the feature vector x be a d-component vector-valued random variable, and let p(x|ωj ) be the state- conditional probability density function for x — the probability density function for x conditioned on ωj being the true state of nature. As before, P (ωj ) describes the prior probability that nature is in state ωj. Then the posterior probability P (ωj |x) can be computed from p(x|ωj ) by Bayes’ formula: p(x|ωj )P (ωj ) P (ωj |x) = , (9) p(x) where the evidence is now c p(x) = p(x|ωj )P (ωj ). (10) j=1 Suppose that we observe a particular x and that we contemplate taking action αi. If the true state of nature is ωj , by deﬁnition we will incur the loss λ(αi |ωj ). Since P (ωj |x) is the probability that the true state of nature is ωj , the expected loss associated with taking action αi is merely 8 CHAPTER 2. BAYESIAN DECISION THEORY c R(αi |x) = λ(αi |ωj )P (ωj |x). (11) j=1 risk In decision-theoretic terminology, an expected loss is called a risk, and R(αi |x) is called the conditional risk. Whenever we encounter a particular observation x, we can minimize our expected loss by selecting the action that minimizes the conditional risk. We shall now show that this Bayes decision procedure actually provides the optimal performance on an overall risk. Stated formally, our problem is to ﬁnd a decision rule against P (ωj ) that mini- decision mizes the overall risk. A general decision rule is a function α(x) that tells us which rule action to take for every possible observation. To be more speciﬁc, for every x the decision function α(x) assumes one of the a values α1 ,..., αa. The overall risk R is the expected loss associated with a given decision rule. Since R(αi |x) is the conditional risk associated with action αi , and since the decision rule speciﬁes the action, the overall risk is given by R = R(α(x)|x)p(x) dx, (12) where dx is our notation for a d-space volume element, and where the integral extends over the entire feature space. Clearly, if α(x) is chosen so that R(αi (x)) is as small as possible for every x, then the overall risk will be minimized. This justiﬁes the following statement of the Bayes decision rule: To minimize the overall risk, compute the conditional risk c R(αi |x) = λ(αi |ωj )P (ωj |x) (13) j=1 for i = 1,...,a and select the action αi for which R(αi |x) is minimum.∗ The resulting Bayes risk minimum overall risk is called the Bayes risk, denoted R∗ , and is the best performance that can be achieved. 2.2.1 Two-Category Classiﬁcation Let us consider these results when applied to the special case of two-category classiﬁ- cation problems. Here action α1 corresponds to deciding that the true state of nature is ω1 , and action α2 corresponds to deciding that it is ω2. For notational simplicity, let λij = λ(αi |ωj ) be the loss incurred for deciding ωi when the true state of nature is ωj. If we write out the conditional risk given by Eq. 13, we obtain R(α1 |x) = λ11 P (ω1 |x) + λ12 P (ω2 |x) and R(α2 |x) = λ21 P (ω1 |x) + λ22 P (ω2 |x). (14) There are a variety of ways of expressing the minimum-risk decision rule, each having its own minor advantages. The fundamental rule is to decide ω1 if R(α1 |x) < R(α2 |x). In terms of the posterior probabilities, we decide ω1 if (λ21 − λ11 )P (ω1 |x) > (λ12 − λ22 )P (ω2 |x). (15) ∗ Note that if more than one action minimizes R(α|x), it does not matter which of these actions is taken, and any convenient tie-breaking rule can be used. 2.3. MINIMUM-ERROR-RATE CLASSIFICATION 9 Ordinarily, the loss incurred for making an error is greater than the loss incurred for being correct, and both of the factors λ21 − λ11 and λ12 − λ22 are positive. Thus in practice, our decision is generally determined by the more likely state of nature, al- though we must scale the posterior probabilities by the loss diﬀerences. By employing Bayes’ formula, we can replace the posterior probabilities by the prior probabilities and the conditional densities. This results in the equivalent rule, to decide ω1 if (λ21 − λ11 )p(x|ω1 )P (ω1 ) > (λ12 − λ22 )p(x|ω2 )P (ω2 ), (16) and ω2 otherwise. Another alternative, which follows at once under the reasonable assumption that λ21 > λ11 , is to decide ω1 if p(x|ω1 ) λ12 − λ22 P (ω2 ) >. (17) p(x|ω2 ) λ21 − λ11 P (ω1 ) This form of the decision rule focuses on the x-dependence of the probability densities. We can consider p(x|ωj ) a function of ωj (i.e., the likelihood function), and then form likelihood the likelihood ratio p(x|ω1 )/p(x|ω2 ). Thus the Bayes decision rule can be interpreted ratio as calling for deciding ω1 if the likelihood ratio exceeds a threshold value that is independent of the observation x. 2.3 Minimum-Error-Rate Classiﬁcation In classiﬁcation problems, each state of nature is usually associated with a diﬀerent one of the c classes, and the action αi is usually interpreted as the decision that the true state of nature is ωi. If action αi is taken and the true state of nature is ωj , then the decision is correct if i = j, and in error if i = j. If errors are to be avoided, it is natural to seek a decision rule that minimizes the probability of error, i.e., the error rate. The loss function of interest for this case is hence the so-called symmetrical or zero-one loss function, zero-one loss 0 i=j λ(αi |ωj ) = i, j = 1,..., c. (18) 1 i = j This loss function assigns no loss to a correct decision, and assigns a unit loss to any error; thus, all errors are equally costly.∗ The risk corresponding to this loss function is precisely the average probability of error, since the conditional risk is c R(αi |x) = λ(αi |ωj )P (ωj |x) j=1 = P (ωj |x) j=i = 1 − P (ωi |x) (19) ∗ We note that other loss functions, such as quadratic and linear-diﬀerence, ﬁnd greater use in regression tasks, where there is a natural ordering on the predictions and we can meaningfully penalize predictions that are “more wrong” than others. 10 CHAPTER 2. BAYESIAN DECISION THEORY and P (ωi |x) is the conditional probability that action αi is correct. The Bayes decision rule to minimize risk calls for selecting the action that minimizes the conditional risk. Thus, to minimize the average probability of error, we should select the i that maximizes the posterior probability P (ωi |x). In other words, for minimum error rate: Decide ωi if P (ωi |x) > P (ωj |x) for all j = i. (20) This is the same rule as in Eq. 6. We saw in Fig. 2.2 some class-conditional probability densities and the posterior probabilities; Fig. 2.3 shows the likelihood ratio p(x|ω1 )/p(x|ω2 ) for the same case. In general, this ratio can range between zero and inﬁnity. The threshold value θa marked is from the same prior probabilities but with zero-one loss function. Notice that this leads to the same decision boundaries as in Fig. 2.2, as it must. If we penalize mistakes in classifying ω1 patterns as ω2 more than the converse (i.e., λ21 > λ12 ), then Eq. 17 leads to the threshold θb marked. Note that the range of x values for which we classify a pattern as ω1 gets smaller, as it should. p(x|ω1) p(x|ω2) θb θa x R2 R1 R2 R1 Figure 2.3: The likelihood ratio p(x|ω1 )/p(x|ω2 ) for the distributions shown in Fig. 2.1. If we employ a zero-one or classiﬁcation loss, our decision boundaries are determined by the threshold θa. If our loss function penalizes miscategorizing ω2 as ω1 patterns more than the converse, (i.e., λ12 > λ21 ), we get the larger threshold θb , and hence R1 becomes smaller. 2.3.1 *Minimax Criterion Sometimes we must design our classiﬁer to perform well over a range of prior proba- bilities. For instance, in our ﬁsh categorization problem we can imagine that whereas the physical properties of lightness and width of each type of ﬁsh remain constant, the prior probabilities might vary widely and in an unpredictable way, or alternatively we want to use the classiﬁer in a diﬀerent plant where we do not know the prior probabilities. A reasonable approach is then to design our classiﬁer so that the worst overall risk for any value of the priors is as small as possible — that is, minimize the maximum possible overall risk. 2.3. MINIMUM-ERROR-RATE CLASSIFICATION 11 In order to understand this, we let R1 denote that (as yet unknown) region in feature space where the classiﬁer decides ω1 and likewise for R2 and ω2 , and then write our overall risk Eq. 12 in terms of conditional risks: R = [λ11 P (ω1 ) p(x|ω1 ) + λ12 P (ω2 ) p(x|ω2 )] dx R1 + [λ21 P (ω1 ) p(x|ω1 ) + λ22 P (ω2 ) p(x|ω2 )] dx. (21) R2 We use the fact that P (ω2 ) = 1 − P (ω1 ) and that p(x|ω1 ) dx = 1 − p(x|ω1 ) dx R1 R2 to rewrite the risk as: = Rmm , minimax risk R(P (ω1 )) = λ22 + (λ12 − λ22 ) p(x|ω2 ) dx (22) R1   + P (ω1 ) (λ11 − λ22 ) − (λ21 − λ11 ) p(x|ω1 ) dx − (λ12 − λ22 ) p(x|ω2 ) dx. R2 R1 = 0 for minimax solution This equation shows that once the decision boundary is set (i.e., R1 and R2 determined), the overall risk is linear in P (ω1 ). If we can ﬁnd a boundary such that the constant of proportionality is 0, then the risk is independent of priors. This is the minimax solution, and the minimax risk, Rmm , can be read from Eq. 22: minimax risk Rmm = λ22 + (λ12 − λ22 ) p(x|ω2 ) dx R1 = λ11 + (λ21 − λ11 ) p(x|ω1 ) dx. (23) R2 Figure 2.4 illustrates the approach. Brieﬂy stated, we search for the prior for which the Bayes risk is maximum, the corresponding decision boundary gives the minimax solution. The value of the minimax risk, Rmm , is hence equal to the worst Bayes risk. In practice, ﬁnding the decision boundary for minimax risk may be diﬃcult, partic- ularly when distributions are complicated. Nevertheless, in some cases the boundary can be determined analytically (Problem 3). The minimax criterion ﬁnds greater use in game theory then it does in traditional pattern recognition. In game theory, you have a hostile opponent who can be expected to take an action maximally detrimental to you. Thus it makes great sense for you to take an action (e.g., make a classiﬁcation) where your costs — due to your opponent’s subsequent actions — are minimized. 12 CHAPTER 2. BAYESIAN DECISION THEORY P(error).4.4.3.3.2.2.1.1 P(ω1) 0 0.2 0.4 0.6 0.8 1 Figure 2.4: The curve at the bottom shows the minimum (Bayes) error as a function of prior probability P (ω1 ) in a two-category classiﬁcation problem of ﬁxed distributions. For each value of the priors (e.g., P (ω1 ) = 0.25) there is a corresponding optimal decision boundary and associated Bayes error rate. For any (ﬁxed) such boundary, if the priors are then changed, the probability of error will change as a linear function of P (ω1 ) (shown by the dashed line). The maximum such error will occur at an extreme value of the prior, here at P (ω1 ) = 1. To minimize the maximum of such error, we should design our decision boundary for the maximum Bayes error (here P (ω1 ) = 0.6), and thus the error will not change as a function of prior, as shown by the solid red horizontal line. 2.3.2 *Neyman-Pearson Criterion In some problems, we may wish to minimize the overall risk subject to a constraint; for instance, we might wish to minimize the total risk subject to the constraint R(αi |x) dx < constant for some particular i. Such a constraint might arise when there is a ﬁxed resource that accompanies one particular action αi , or when we must not misclassify pattern from a particular state of nature ωi at more than some limited frequency. For instance, in our ﬁsh example, there might be some government regu- lation that we must not misclassify more than 1% of salmon as sea bass. We might then seek a decision that minimizes the chance of c

Richard O. Duda, Peter E. Hart, David G. Stork - Pattern classification (2001, Wiley).pdf

Document Details

Tags

Related

Full Transcript