Full Transcript

5 Fitness, Selection, and Population Management As explained in Chap. 3, there are two fundamental forces that form the basis of evolutionary systems: variation and selection. In this chapter we discuss the EA components behind the second one. Having discussed some typical population management mod...

5 Fitness, Selection, and Population Management As explained in Chap. 3, there are two fundamental forces that form the basis of evolutionary systems: variation and selection. In this chapter we discuss the EA components behind the second one. Having discussed some typical population management models, and selection operators, we then go on to explicitly look at some situations where diversity is needed, such as multimodal problems, and some approaches to population management, and altering selection, that have been proposed to increase useful diversity. 5.1 Population Management Models In the previous chapter we have focused on the way that potential solutions are represented to give a population of diverse individuals, and on the way that variation (recombination and mutation) operators work on those individuals to yield offspring. These offspring will generally inherit some of their parents’ properties but also differ slightly from them, providing new potential solutions to be evaluated. We now turn our attention to the second important element of the evolutionary process – the differential survival of individuals to compete for resources and take part in reproduction, based on their relative fitness. Two different models of population management are found in the literature: the generational model and the steady-state model. The generational model is the one used in the example in Sect. 3.3. In each generation we begin with a population of size µ, from which a mating pool of parents is selected. Every member of the pool is a copy of something in the population, but the proportions will probably differ, with (usually) more copies of the ‘better’ parents. Next, λ offspring are created from the mating pool by the application of variation operators, and evaluated. After each generation, the whole population is replaced by µ individuals selected from its offspring, which is called the next generation. In the model typically used within the Simple Genetic Algorithm, the population, mating pool and offspring are all the same size, so that each generation is replaced by all of its offspring. This restriction © Springer-Verlag Berlin Heidelberg 2015 A.E. Eiben, J.E. Smith, Introduction to Evolutionary Computing, Natural Computing Series, DOI 10.1007/978-3-662-44874-8_5 79 80 5 Fitness, Selection, and Population Management is not necessary: for example in the (µ, λ) Evolution Strategy, an excess of offspring is created (typically λ/µ is in the range 5–7) from which the next generation is selected on the basis of fitness. In the steady-state model, the entire population is not changed at once, but rather a part of it. In this case, λ (< µ) old individuals are replaced by λ new ones, the offspring. The proportion of the population that is replaced is called the generational gap, and is equal to λ/µ. Since its introduction in Whitley’s GENITOR algorithm [460], the steady-state model has been widely studied and applied [105, 354, 442], often with λ = 1. At this stage it is worth reiterating that the operators that are responsible for this competitive element of population management work on the basis of an individual’s fitness. As a direct consequence, these selection and replacement operators work independently of the problem representation chosen. As was seen in the general description of an evolutionary algorithm at the start of Chap. 3, there are two points in the evolutionary cycle at which fitness-based competition can occur: during selection to take part in mating, and during the selection of individuals to survive into the next generation. We begin by describing the most commonly used methods for parent selection, but note that many of these can also be applied during the survival selection phase. As a final preliminary, please note that we will adopt a convention that we are trying to maximise fitness, and that fitness values are not negative. Often problems are expressed in terms of an objective function to be minimised, and sometimes negative fitness values occur. However, in all cases these can be mapped into the desired form by using an appropriate transformation. 5.2 Parent Selection 5.2.1 Fitness Proportional Selection The principles of fitness proportional selection (FPS) were described in the simple example in Sect. 3.3. Recall that for each choice, the probability that an individual i is selected for mating depends on its absolute fitness value compared to the absolute fitness values of the rest of the population. Observing that the sum of the probabilities over the whole population must!equal 1 the µ selection probability of individual i using FPS is PF P S (i) = fi / j=1 fj . This selection mechanism was introduced in [220] and has been the topic of intensive study ever since, not least because it happens to be particularly amenable to theoretical analysis. However, it has been recognised that there are some problems with this selection mechanism: • Outstanding individuals take over the entire population very quickly. This tends to focus the search process, and makes it less likely that the algorithm will thoroughly search the space of possible solutions, where better 5.2 Parent Selection 81 solutions may exist. This phenomenon is often observed in early generations, when many of the randomly created individuals will have low fitness, and is known as premature convergence. • When fitness values are all very close together, there is almost no selection pressure, so selection is almost uniformly random, and having a slightly better fitness is not very ‘useful’ to an individual. Therefore, later in a run, when some convergence has taken place and the worst individuals are gone, it is typically observed that the mean population fitness only increases very slowly. • The mechanism behaves differently if the fitness function is transposed. This last point is illustrated in Table 5.1, which shows three individuals and a fitness function with f (A) = 1, f (B) = 4, and f (C) = 5. Transposing this fitness function changes the selection probabilities, while the shape of the fitness landscape, and hence the location of the optimum, remains the same. Individual Fitness Sel. prob. Fitness Sel. prob. Fitness Sel. prob. for f for f for f + 10 for f + 10 for f + 100 for f + 100 A 1 0.1 11 0.275 101 0.326 B 4 0.4 14 0.35 104 0.335 C 5 0.5 15 0.375 105 0.339 Sum 10 1.0 40 1.0 310 1.0 Table 5.1. Transposing the fitness function changes selection probabilities for fitness-proportionate selection To avoid the second two problems with FPS, a procedure known as windowing is often used. Under this scheme, fitness differentials are maintained by subtracted from the raw fitness f (x) a value β t , which depends in some way on the recent search history, and so can change over time (hence the superscript t). The simplest approach is just to subtract the value of the least-fit member of the current population P t by setting β t = miny∈P t f (y) This value may fluctuate quite rapidly, so one alternative is to use a running average over the last few generations. Another well-known approach is sigma scaling [189], which incorporates information about the mean f¯ and standard deviation σf of fitnesses in the population: f " (x) = max(f (x) − (f¯ − c · σf ), 0), where c is a constant value, usually set to 2. 5.2.2 Ranking Selection Rank-based selection is another method that was inspired by the observed drawbacks of fitness proportionate selection [32]. It preserves a constant selection pressure by sorting the population on the basis of fitness, and then 82 5 Fitness, Selection, and Population Management allocating selection probabilities to individuals according to their rank, rather than according to their actual fitness values. Let us assume that the ranks are numbered so that an individual’s rank notes how many worse solutions are in the population, so the best has rank µ-1 and the worst has rank 0. The mapping from rank number to selection probability can be done in many ways, for example, linearly or exponentially decreasing. As with FPS above, and any selection scheme, we insist that the sum over the population of the selection probabilities must be unity – that we must select one of the parents. The usual formula for calculating the selection probability for linear ranking schemes is parameterised by a value s (1 < s ≤ 2). In the case of a generational EA, where µ = λ, this can be interpreted as the expected number of offspring allotted to the fittest individual. Since this individual has rank µ − 1, and the worst has rank 0, then the selection probability for an individual of rank i is: Plin−rank (i) = (2 − s) 2i(s − 1) + . µ µ(µ − 1) Note that the first term will be constant for all individuals (it is there to ensure the probabilities add to one). Since the second term will be zero for the worst individual (with rank i = 0), it can be thought of as the ‘baseline’ probability of selecting that individual. In Table 5.2 we show an example of how the selection probabilities differ for a population of µ = 3 different individuals with fitness proportionate and rank-based selection with different values of s. Individual Fitness Rank PselF P PselLR (s = 2) PselLR (s = 1.5) A 1 0 0.1 0 0.167 B 4 1 0.4 0.33 0.33 C 5 2 0.5 0.67 0.5 Sum 10 1.0 1.0 1.0 Table 5.2. Fitness proportionate (FP) versus linear ranking (LR) selection When the mapping from rank to selection probabilities is linear, only limited selection pressure can be applied. This arises from the assumption that, on average, an individual of median fitness should have one chance to be reproduced, which in turn imposes a maximum value of s = 2. (Since the scaling is linear, letting s > 2 would require the worst to have a negative selection probability if the probabilities are to sum to unity.) If a higher selection pressure is required, i.e., more emphasis on selecting individuals of above-average fitness, an exponential ranking scheme is often used, of the form: Pexp−rank (i) = 1 − e−i . c The normalisation factor c is chosen so that the sum of the probabilities is unity, i.e., it is a function of the population size. 5.2 Parent Selection 83 5.2.3 Implementing Selection Probabilities The description above provides two alternative schemes for deciding a probability distribution that defines the likelihood of each individual in the population being selected for reproduction. In an ideal world, the mating pool of parents taking part in recombination would have exactly the same proportions as this selection probability distribution. This would mean that the number of any given individual would be given by its selection probability, multiplied by the size of the mating pool. However, in practice this is not possible because of the finite size of the population, i.e., when we do this multiplication, we find typically that some individuals have an expected number of copies which is noninteger – whereas of course in practice we need to select complete individuals. In other words, the mating pool of parents is sampled from the selection probability distribution, but will not in general accurately reflect it, as was seen in the example in Sect. 3.3. The simplest way of achieving this sampling is known as the roulette wheel algorithm. Conceptually this is the same as repeatedly spinning a one-armed roulette wheel, where the sizes of the holes reflect the selection probabilities. In general, the algorithm can be applied to select λ members from the set of µ parents into a mating pool. To illustrate the workings of this algorithm, we will assume some order over the population (ranking or random) from 1 to µ, so that we can calculate the cumulative probability dis!i tribution, which is a list of values [a1 , a2 , . . . , aµ ] such that ai = 1 Psel (i), where Psel (i) is defined by the selection distribution — fitness proportionate or ranking. Note that this implies aµ = 1. The outlines of the algorithm are given in Fig. 5.1. BEGIN /* Given the cumulative probability distribution a */ /* and assuming we wish to select λ members of the mating pool */ set current member = 1; WHILE ( current member ≤ λ ) DO Pick a random value r uniformly from [0, 1]; set i = 1; WHILE ( ai < r ) DO set i = i + 1; OD set mating pool[current member] = parents[i]; set current member = current member + 1; OD END Fig. 5.1. Pseudocode for the roulette wheel algorithm 84 5 Fitness, Selection, and Population Management Despite its inherent simplicity, it has been recognised that the roulette wheel algorithm does not in fact give a particularly good sample of the required distribution. Whenever more than one sample is to be drawn from the distribution – for instance λ – the use of the stochastic universal sampling (SUS) algorithm [32] is preferred. Conceptually, this is equivalent to making one spin of a wheel with λ equally spaced arms, rather than λ spins of a one-armed wheel. Given the same list of cumulative selection probabilities [a1 , a2 , . . . , aµ ], it selects the mating pool as described in Fig. 5.2. BEGIN /* Given the cumulative probability distribution a */ /* and assuming we wish to select λ members of the mating pool */ set current member = i = 1; Pick a random value r uniformly from [0, 1/λ]; WHILE ( current member ≤ λ ) DO WHILE ( r ≤ a[i] ) DO set mating pool[current member] = parents[i]; set r = r + 1/λ; set current member = current member + 1; OD set i = i + 1; OD END Fig. 5.2. Pseudocode for the stochastic universal sampling algorithm making λ selections Since the value of the variable r is initialised in the range [0, 1/λ] and increases by an amount 1/λ every time a selection is made, it is guaranteed that the number of copies made of each parent i is at least the integer part of λ · Psel (i) and is no more than one greater. Finally, we should note that with minor changes to the code, SUS can be used to make any number of selections from the parents, and in the case of making just one selection, it is the same as the roulette wheel. 5.2.4 Tournament Selection The previous two selection methods and the algorithms used to sample from their probability distributions relied on a knowledge of the entire population. However, in certain situations, for example, if the population size is very large, or if the population is distributed in some way (perhaps on a parallel system), obtaining this knowledge is either highly time consuming or at worst impossible. Furthermore, both methods assume that fitness is a quantifiable 5.2 Parent Selection 85 measure (based on some explicit objective function to be optimised), which may not be valid. Think, for instance, of an application evolving game playing strategies. In this case we might not be able to quantify the strength of a given individual (strategy) in isolation, but we can compare any two of them by simulating a game played by these strategies as opponents. Similar situations occur also in evolutionary design and evolutionary art applications [48, 49]. In these the user typically makes a subjective selection by comparing individuals representing designs or pieces of art, rather than using a quantitative measure to assign fitness, cf. Sect. 14.1. Tournament selection is an operator with the useful property that it does not require any global knowledge of the population, nor a quantifiable measure of quality. Instead it only relies on an ordering relation that can compare and rank any two individuals. It is therefore conceptually simple and fast to implement and apply. The application of tournament selection to select λ members of a pool of µ individuals works according to the procedure shown in Fig. 5.3. BEGIN /* Assume we wish to select λ members of a pool of µ individuals */ set current member = 1; WHILE ( current member ≤ λ ) DO Pick k individuals randomly, with or without replacement; Compare these k individuals and select the best of them; Denote this individual as i; set mating pool[current member] = i; set current member = current member + 1; OD END Fig. 5.3. Pseudocode for the tournament selection algorithm Because tournament selection looks at relative rather than absolute fitness, it has the same properties as ranking schemes in terms of invariance to translation and transposition of the fitness function. The probability that an individual will be selected as the result of a tournament depends on four factors, namely: • Its rank in the population. Effectively this is estimated without the need for sorting the whole population. • The tournament size k. The larger the tournament, the greater the chance that it will contain members of above-average fitness, and the less that it will consist entirely of low-fitness members. Thus the probability of selecting a high-fitness member increases, and that of selecting a low- 86 5 Fitness, Selection, and Population Management fitness member decreases, as k is increased. Hence we say that increasing k increases the selection pressure. • The probability p that the most fit member of the tournament is selected. Usually this is 1 (deterministic tournaments), but stochastic versions are also used with p < 1. Since this makes it more likely that a less-fit member will be selected, decreasing p will decrease the selection pressure. • Whether individuals are chosen with or without replacement. In the second case, with deterministic tournaments, the k-1 least-fit members of the population can never be selected, since the other member of the tournament will be fitter. However, if the tournament candidates are picked with replacement, it is always possible for even the least-fit member of the population to be selected, since with probability 1/µk > 0 all tournament candidates will be copies of that member. These properties of tournament selection were characterised in [20, 58], and it was shown [190] that for binary (k = 2) tournaments with parameter p the expected time for a single individual of high fitness to take over the population is the same as that for linear ranking with s = 2p. However, since λ tournaments are required to produce λ selections, it suffers from the same problems as the roulette wheel algorithm, in that the outcomes can show a high variance from the theoretical probability distribution. Despite this drawback, tournament selection is perhaps the most widely used selection operator in some EC dialects (in particular, Genetic Algorithms), due to its extreme simplicity and the fact that the selection pressure is easy to control by varying the tournament size k. 5.2.5 Uniform Parent Selection In some dialects of EC it is common to use mechanisms such that each individual has the same chance to be selected. At first sight this might appear to suggest that there is no selection pressure in the algorithm, which would indeed be true if this was not coupled with a strong fitness-based survivor selection mechanism. In Evolutionary Programming, usually there is no recombination, only mutation, and parent selection is deterministic. In particular, each parent produces exactly one child by mutation. Evolution Strategies are also usually implemented with uniform random selection of parents into the mating pool, i.e., for each 1 ≤ i ≤ µ we have Punif orm (i) = 1/µ. 5.2.6 Overselection for Large Populations In some cases it may be desirable to work with extremely large populations. Sometimes this could be for technical reasons – for example, there has been a lot of interest in implementing EAs using graphics cards (GPUs), which offer similar speed-up to clusters or supercomputers, but at much lower cost. 5.3 Survivor Selection 87 However, achieving the maximum potential speed-up typically depends on having a large population on each processing node. Regardless of the implementation details, if the potential search space is enormous it might be a good idea to use a large population to avoid ‘missing’ promising regions in the initial random generation, and thereafter to maintain the diversity needed to support exploration. For example, in Genetic Programming it is not unusual to use population sizes of several thousands: in 1994 [254] used 1000; in 1996 [7] used 128,000; and in 1999 [255] used 1,120,000 individuals. In the latter case, often a method called over-selection is used for population sizes of 1000 and above. In this method, the population is first ranked by fitness and then divided into two groups, the top x% in one and the remaining (100 − x)% in the other. When parents are selected, 80% of the selection operations choose from the first group, and the other 20% from the second. Koza [252] provides rule of thumb values for x depending on the population size as shown in Table 5.3. As can be seen, the number of individuals from which the majority of parents are chosen stays constant, i.e., the selection pressure increases dramatically for larger populations. Population size Proportion of population in fitter group (x) 1000 32% 2000 16% 4000 8% 8000 4% Table 5.3. Rule of thumb values for overselection: Proportion of ranked population in fitter subpopulation from which majority of parents are selected 5.3 Survivor Selection The survivor selection mechanism is responsible for managing the process of reducing the working memory of the EA from a set of µ parents and λ offspring to a set of µ individuals forming the next generation. In principle, any of the mechanisms introduced for parent selection could be also used for selecting survivors. However, over the history of EC a number of special survivor selection strategies have been suggested and are widely used. As explained in Sect. 3.2.6, this step in the main evolutionary cycle is also called replacement. In the present section we often use this latter term to be consistent with the literature. Replacement strategies can be categorised according to whether they discriminate on the basis of the fitness or the age of individuals. 88 5 Fitness, Selection, and Population Management 5.3.1 Age-Based Replacement The basis of these schemes is that the fitness of individuals is not taken into account during the selection of which individuals to replace in the population. Instead, they are designed so that each individual exists in the population for the same number of EA iterations. This does not preclude the possibly that copies of highly-fit individuals might persist in the population, but for this to happen they must be chosen at least once in the selection phase and then survive the recombination and mutation stages without being modified. Note that since fitness is not taken into account, the mean, and even best fitness of any given generation, may be lower than that of its predecessor. While slightly counterintuitive, this is not a problem as long as it does not happen too often, and may even be beneficial if the population is concentrated around a local optimum. A net increase in the mean fitness over time therefore relies on (i) having sufficient selection pressure when selecting parents into the mating pool, and (ii) using variation operators that are not too disruptive. Age-based replacement is the strategy used in the simple Genetic Algorithm. Since the number of offspring produced is the same as the number of parents (µ = λ), each individual exists for just one cycle, and the parents are simply discarded, to be replaced by the entire set of offspring. This is the generational model, but in fact this replacement strategy can also be implemented in a steady-state with overlapping populations (λ < µ), right to the other extreme where a single offspring is created and inserted in the population in each cycle. In this case the strategy takes the form of a first-in-first-out (FIFO) queue. An alternative method of age-based replacement for steady-state GAs is to randomly select a parent for replacement. A straightforward mathematical argument based on the population size being fixed tells us that this probabilistic strategy has the same mean effect – that is, on average individuals live for µ iterations. De Jong and Sarma [105] investigated this strategy experimentally, and found that the algorithm showed higher variance in performance than a comparable generational GA. Smith and Vavak [400] showed that this was because the random strategy is far more likely to lose the best member of the population than a delete-oldest (FIFO) strategy. For these reasons the random replacement strategy is not recommended. 5.3.2 Fitness-Based Replacement A wide number of strategies based on fitness have been proposed for choosing which µ of the µ parents + λ offspring should go forward to the next generation. Some also take age into account. Replace worst (GENITOR) In this scheme the worst λ members of the population are selected for replacement. Although this can lead to very rapid improvements in the mean population fitness, it can also lead to premature convergence as the population tends to rapidly focus on the fittest member 5.3 Survivor Selection 89 currently present. For this reason it is commonly used in conjunction with large populations and/or a “no duplicates” policy. Elitism This scheme is commonly used in conjunction with age-based and stochastic fitness-based replacement schemes, to prevent the loss of the current fittest member of the population. In essence a trace is kept of the current fittest member, and it is always kept in the population. Thus if it is chosen in the group to be replaced, and none of the offspring being inserted into the population has equal or better fitness, then it is kept and one of the offspring is discarded. Round-robin tournament This mechanism was introduced within Evolutionary Programming, where it is applied to choose µ survivors. However, in principle, it can also be used to select λ parents from a given population of µ. The method works by holding pairwise tournament competitions in round-robin format, where each individual is evaluated against q others randomly chosen from the merged parent and offspring populations. For each comparison, a “win” is assigned if the individual is better than its opponent. After finishing all tournaments, the µ individuals with the greatest number of wins are selected. Typically, q = 10 is recommended in Evolutionary Programming. It is worth noting that this stochastic variant of selection allows for less-fit solutions to be selected if they had a lucky draw of opponents. As the value of q increases this chance becomes more and unlikely, until in the limit it becomes deterministic µ + µ. (µ + λ) Selection The name and the notation of the (µ + λ) selection comes from Evolution Strategies. In general, it refers to the case where the set of offspring and parents are merged and ranked according to (estimated) fitness, then the top µ are kept to form the next generation. This strategy can be seen as a generalisation of the GENITOR method (µ > λ) and the round-robin tournament in Evolutionary Programming (µ = λ). In Evolution Strategies λ > µ with a great offspring surplus (typically λ/µ ≈ 5 − 7) that induces a large selection pressure. (µ, λ) Selection The (µ, λ) strategy used in Evolution Strategies where typically λ > µ children are created from a population of µ parents. This method works on a mixture of age and fitness. The age component means that all the parents are discarded, so no individual is kept for more than one generation (although of course copies of it might exist later). The fitness component comes from the fact that the λ offspring are ranked according to the fitness, and the best µ form the next generation. In Evolution Strategies, (µ, λ) selection, is generally preferred over (µ + λ) selection for the following reasons: 90 5 Fitness, Selection, and Population Management • The (µ, λ) discards all parents and is therefore in principle able to leave (small) local optima. This may be advantageous in a multimodal search space with many local optima. • If the fitness function is not fixed, but changes in time, the (µ+λ) selection preserves outdated solutions, so it is not able to follow the moving optimum well. • (µ + λ) selection hinders the self-adaptation mechanism used to adapt strategy parameters, cf. Sect. 6.2. 5.4 Selection Pressure Throughout this chapter we have referred rather informally to the notion of selection pressure, using an intuitive description that as selection pressure increases, so fitter solutions are more likely to survive, or be chosen as parents, and less-fit solutions are correspondingly less likely. A number of measures have been proposed for quantifying this, and studied theoretically, of which the best known is the takeover time. The takeover time τ ∗ of a given selection mechanism is defined as the number of generations it takes until the application of selection completely fills the population with copies of the best individual, given one copy initially. Goldberg and Deb [190] showed that ln λ τ∗ = . ln(λ/µ) For a typical evolution strategy with µ = 15 and λ = 100, this results in τ ∗ ≈ 2. For fitness proportional selection in a genetic algorithm it is τ ∗ = λ ln λ, resulting in τ ∗ = 460 for population size λ = 100. Other authors have extended this analysis to other strategies in generational and steady-state population models [79, 400]; Rudolph applied it to different population structures such as rings [360], and also to consider a range of other measures of selection operators’ performance, such as the ‘Diversity Indicator’. Other measures of selection pressure have been proposed, including the ‘Expected Loss of Diversity’ [310], which is the expected change in the number of diverse solutions after µ selection events; and from theoretical biology, the ‘Selection Intensity’, which is the expected relative increase in mean population fitness after applying a selection operator. While these measures can help in understanding the effect of different strategies, they can also be rather misleading since they consider selection alone, rather than in the context of variation operators providing diversity. Smith [390] derived mathematical expressions for a number of these indicators considering a wide range of replacement strategies in steady-state EAs. Experiments bore out the analytic results, and a benchmark comparison using wellknown test problems showed that both the mean and variance of the takeover 5.5 Multimodal Problems, Selection, and the Need for Diversity 91 time could correctly predict the relative ordering of the mean and variance of the time taken to first locate the global optimum. However, for many applications of EAs the most important measure is the quality of the best solution found and also possibly the diversity of good solutions discovered. Smith’s results showed that in fact none of the theoretical measures were particularly indicative of the relative performance of different algorithms in these terms. 5.5 Multimodal Problems, Selection, and the Need for Diversity 5.5.1 Multimodal Problems In Sects. 2.3.1 and 3.5 we introduced the concept of multimodal search landscapes and local optima. We discussed how effective search relies on the preservation of sufficient diversity to allow both exploitation of learned information (by investigating regions contained high fitness solutions discovered) and exploration in order to uncover new high-fitness regions. Multimodality is a typical aspect of the type of problems for which EAs are often employed, either in attempt to locate the global optimum (particularly when a local optimum has the largest basin of attraction), or to identify a number of high–fitness solutions corresponding to various local optima. The latter situation can often arise, for example, when the fitness function used by the EA does not completely specify the underlying problem. An example of this might be in the design of a new widget, where the parameters of the fitness function may change during the design process, as progressively more refined and detailed models are used as decisions such as the choice of materials, etc., are made. In this situation it is valuable to be able to examine a number of possible options, first so as to permit room for human aesthetic judgements, and second because it is probably desirable to use solutions from niches with broader peaks rather than from a sharp peak. This is because the latter may be overfitted (that is, overly specialised) to the current fitness function and may not be as good once the fitness function is refined. The population-based nature of EAs holds out much promise for identifying multiple optima, however, in practice the finite population size, when coupled with recombination between any parents (known as panmictic mixing) leads to the phenomenon known as genetic drift and eventual convergence around one optimum. The reasons for this can easily be seen: imagine that we have two equally fit niches, and a population of 100 individuals originally equally divided between them. Eventually, because of the random effects in selection, it is likely that we will obtain a parent population consisting of 49 of one sort and 51 of the other. Ignoring the effects of recombination and mutation, in the next generation the probabilities of selecting individuals from the two niches are now 0.49 and 0.51 respectively, i.e., we are increasingly likely to select 92 5 Fitness, Selection, and Population Management individuals from the second niche. This effect increases as the two subpopulations become unbalanced, until eventually we end up with only one niche represented in the population. 5.5.2 Characterising Selection and Population Management Approaches for Preserving Diversity A number of mechanisms have been proposed to aid the use of EAs on multimodal problems. These can be broadly separated into two camps: explicit approaches, in which specific changes are made to operators in order to preserve diversity, and implicit approaches, in which a framework is used that permits, but does not guarantee, the preservation of diverse solutions. Before describing these it is useful to clarify exactly what we mean by ‘diversity’ and ‘space’. Just as biological evolution takes place on a geographic surface, but can also be considered to occur on an adaptive landscape, so we can define a number of spaces within which the evolutionary algorithms operate: • Genotype Space: We may perceive the set of representable solutions as a genotype space and define some distance metrics. This can be a natural distance metrics in that space (e.g., the Manhattan distance) or based on some fundamental move operator. Typical move operators include a single bit-flip for binary spaces, a single inversion for adjacency-based permutation problems and a single swap for order-based permutations problems. • Phenotype Space: This is the end result: a search space whose structure is based on distance metrics between solutions. The neighbourhood structure in this space may bear little relationship to that in the genotype space according to the complexity of the representation–solution mapping. • Algorithmic Space: This is the equivalent of the geographical space on which life on Earth has evolved. Effectively we are considering that the working memory of the EA, that is, the population of candidate solutions, can be structured in some way. This spatial structure could be either a conceptual division, or real: for example, a population might be split over a number of processors or cores. Explicit approaches to diversity maintenance based on measures of either genotype or phenotypic space include Fitness Sharing (Sect. 5.5.3), Crowding (Sect. 5.5.4), and Speciation (Sect. 5.5.5), all of which work by affecting the probability distributions used by selection. Implicit approaches to diversity maintenance based on the concept of algorithmic space include Island Model EAs (Sect. 5.5.6) and Cellular EAs (Sect. 5.5.7). 5.5.3 Fitness Sharing This scheme is based upon the idea that the number of individuals within a given niche is controlled by sharing their fitness immediately prior to selection, in an attempt to allocate individuals to niches in proportion to the niche 5.5 Multimodal Problems, Selection, and the Need for Diversity 93 fitness [193]. In practice the scheme considers each possible pairing of individuals i and j within the population (including i with itself) and calculates a distance d(i, j) between them according to some distance metric (phenotypic is preferred if possible, else genotypic, e.g., Hamming distance for binary representations). The fitness F of each individual i is then adjusted according to the number of individuals falling within some prespecified distance σshare using a power-law distribution: F " (i) = ! F (i) , sh(d(i, j)) j where the sharing function sh(d) is a function of the distance d, given by " 1 − (d/σshare )α if d ≤ σshare , sh(d) = 0 otherwise . The constant value α determines the shape of the sharing function: for α=1 the function is linear, but for values greater than this the effect of similar individuals in reducing a solution’s fitness falls off more rapidly with distance. The value of the share radius σshare decides both how many niches can be maintained and the granularity with which different niches can be discriminated. Deb [114] gives some suggestions for how this might be set if the number of niches is known in advance, but clearly this is not always the case. In [110] he suggests that a default value in the range 5–10 should be used. We should point out that the use of fitness proportionate selection is implicit within the fitness-sharing method. In this case there exists a stable distribution of solutions amongst the niches when solutions from each peak have the same effective fitness F " . Since the niche fitness Fk" = Fk /nk , in this stable distribution each niche k contains a number of solutions nk proportional to the niche fitness Fk 1 . This point is illustrated in Fig. 5.4. Studies have indicated that the use of alternative selection methods does not lead to the formation and preservation of stable subpopulations in niches [324]. 5.5.4 Crowding The crowding algorithm was first suggested in De Jong’s thesis [102] as a way of preserving diversity by ensuring that new individuals replaced similar members of the population. The original scheme worked in a steady-state setting (the number of new individuals generated in each step was 20% of the population size). When a new offspring is inserted into the population, a number 2 of members of the parent population are chosen at random, and then the offspring replaces the most similar of those parents. A number of problems 1 2 This assumes for the sake of ease that all solutions within a given niche lie at its optimal point, at zero distance from each other. called the Crowding Factor (CF) - De Jong used CF =2 94 5 Fitness, Selection, and Population Management 6 5 xx x xx f(x) 4 xx xx 3 xxx 2 xx x 1 0 0 1 2 3 6 x 4 5 f(x) 6 xxx 4 xxx 3 xxx 2 xxx 1 0 5 xxx 0 1 2 3 x 4 5 6 Fig. 5.4. Idealised population distributions under fitness sharing (top) and crowding (bottom). There are five peaks in the landscape with fitnesses (5,4,3,2,1) and the population size is 15. Fitness sharing allocates individuals to peaks in proportion to their fitness, whereas crowding distributes the population evenly amongst the peaks were found with this approach, and Mahfoud has suggested an improvement called deterministic crowding [278]. This algorithm relies on the fact that offspring are likely to be similar to their parents as follows: 1. 2. 3. 4. 5. The parent population is randomly paired. Each pair produces two offspring via recombination. These offspring are mutated and then evaluated. The four pairwise distances between offspring and parents are calculated. Each offspring then competes for survival in a tournament with one parent, so that the intercompetition distances are minimised. In other words, denoting the parents as p, the offspring as o, and using the subscript to indicate tournament pairing, d(p1 , o1 ) + d(p2 , o2 ) < d(p1 , o2 ) + d(p2 , o1 ). The net result of all this is that offspring tend to compete for survival with the most similar parent, so subpopulations are preserved in niches but their size does not depend on fitness; rather it is equally distributed amongst the peaks available. Fig. 5.4 illustrates this point in comparison with the distribution achieved under crowding. 5.5 Multimodal Problems, Selection, and the Need for Diversity 95 5.5.5 Automatic Speciation Using Mating Restrictions The automatic speciation approach imposes mating restrictions based on some aspect of the candidate solutions (or their genotypes) defining them as belonging to different species. The population contains multiple species, and during parent selection for recombination individuals will only mate with others from the same (or similar) species. The biological analogy becomes particularly clear when we note that some authors refer to the aspect controlling reproductive opportunities as an individual’s ‘plumage’ [401]. A number of schemes have been proposed to implement speciation, which can be divided into two main approaches. In the first speciation is based on the solution (or its representation), e.g., Deb’s phenotype (genotype)-restricted mating [109, 114, 401]. The alternative approach is to add some elements such as tags to the genotype that code for the individual’s species, rather than representing part of the solution. See [62, 109, 409] for implementations, noting that many of these ideas were previously suggested by other authors. These are usually randomly initialised and subject to recombination and mutation. Common to both approaches is the idea that once an individual has been selected to be a parent, then the choice of mate involves the use of a pairwise distance metric (in phenotype or genotype space as appropriate), with potential mates being rejected beyond a certain distance. Note that in the tag scheme, there is initially no guarantee that individuals with similar tags will represent similar solutions, although after a few generations selection will usually take care of this problem. Neither is there any guarantee that different species will contain different solutions, although Spears goes some way towards rectifying this by also using the tags to perform fitness sharing [409], and even without this Deb reported improved performance compared to a standard GA [109]. Similarly, although the phenotype-based speciation scheme does not guarantee diversity maintenance, when used in conjunction with fitness sharing, it was reported to give better results than fitness sharing on its own [114]. 5.5.6 Running Multiple Populations in Tandem: Island Model EAs The idea of evolving multiple populations in tandem is also known as island model EAs, parallel EA, and, more precisely coarse-grain parallel EAs. These schemes attracted great interest in the 1980s when parallel computing became popular [87, 88, 274, 339, 356, 425] and are still applicable on MIMD systems such as computing clusters. Of course, they can equally well be implemented on a single-processor architecture, without the time speed-up. The essential idea is to run multiple populations in parallel, in some kind of communication structure. The communication structure is usually a ring or a torus, but in principle any form is possible, and sometimes this is determined by the architecture of the parallel system, e.g., a hypercube [425]. After a (usually fixed) number of generations (known as an epoch), a number of 96 5 Fitness, Selection, and Population Management individuals are selected from each population to be exchanged with others from neighbouring populations – this can be thought of as migration. In [284] this approach is discussed in the context of Eldredge and Gould’s theory of punctuated equilibria [154] and the exploration–exoloitation tradeoff. They suggest that during the epochs between communication, when each subpopulation is evolving independently of the others, exploitation occurs, so that the subpopulations each explore the search space around the fitter solutions that they contain. When communication takes place, the injection of individuals of potentially high fitness, and with (possibly) radically different genotypes, facilitates exploration, particularly as recombination happens between the two different solutions. Whilst extremely attractive in theory, it is obvious that there are no guarantees per se that the different subpopulations are actually exploring different regions of the search space. One possibility is clearly to achieve a start at this through a careful initialisation process, but even if this is used, there are a number of parameters that have been shown to affect the ability of this technique to explore different peaks and obtain good results even when only a single solution is desired as the end result. A number of detailed studies have been made of the effects of different parameters and implementations of this basic scheme (see, e.g., earlier references in this section, and [276] for a more recent treatment), but of course we must bear in mind that the results obtained may be problem dependent, and so we will restrict ourselves to commenting on a few important facets: • How often to exchange individuals? The essential problem here is that if the communication occurs too frequently, then all subpopulations will converge to the same solution. Equally if it is done too infrequently, and one or more subpopulations has converged quickly in the vicinity of a peak, then significant amounts of computational effort may be wasted. Most authors have used epoch lengths of the range 25–150 generations. An elegant alternative strategy proposed in [284] is to organise communication adaptively, that is to say, to stop the evolution in each subpopulation when no improvement has been observed for, say, 25 generations. • How many, and which individuals to exchange? Many authors have found that in order to prevent too rapid convergence to the same solution, it is better to exchange a small number of solutions between subpopulations – usually 2–5. Once the amount of communication has been decided, it is necessary to specify which individuals are selected from each population to be exchanged. Clearly this can be done either by some fitness-based selection mechanism (e.g., “copy-best” [339], “pick-from-fittest-half” [425]) or at random [87]. It must also be decided whether the individuals being exchanged are effectively moved from one population to another, thus (assuming a symmetrical communication structure) maintaining subpopulation sizes, or whether they are merely copied, in which case each subpopulation must then undergo some kind of survivor selection mechanism. 5.5 Multimodal Problems, Selection, and the Need for Diversity 97 The choices of how many and which individuals to exchange will evidently affect the tendency of the subpopulations to converge to the same solution. Random, rather than fitness-based, selection strategy is less likely to lead to takeover of one population by a new high-fitness migrant, and exchanging more solutions also leads to faster mixing and possible takeover. However, the extent to which these factors affect the behaviour is clearly tied to the epoch length, since if this is long enough to permit fitness convergence then all of the solutions contained within a given subpopulation are likely to be genotypically very similar, so the selection method used becomes less important. • How to divide the population into subpopulations? The general rule here appears to be that provided a certain (problem-dependent) minimum subpopulation size is respected, then more subpopulations usually gives better results. This clearly fits in with our understanding, since if each subpopulation is exploring a different peak (the ideal scenario), the more peaks explored, the likely it is that one of them will contain the global optimum. Finally, it is worth mentioning that it is perfectly possible to use different algorithmic parameters on different islands. Thus in the injection island models the subpopulations are arranged hierarchically with each level operating at a different granularity of representation. Equally, parameters such as the choice of recombination or mutation operator and associated parameters, or even subpopulation sizes, might be different between different subpopulations [148, 367]. 5.5.7 Spatial Distribution Within One Population: Cellular EAs In the previous section we described the implementation of a population structure in the form of a number of subpopulations with occasional communication. In this section we describe an alternative model whereby a single population is considered to be split into a larger number of smaller overlapping subpopulations (demes) by being distributed within algorithmic space. We can consider this to be equivalent to the situation whereby biological individuals are separated, only mating and competing for survival with those within a certain distance to them. To take a simple example from the days of less-rapid transport, a person might only have been able to marry and have children with someone from their own or surrounding villages. Thus should a new gene for say, telekinesis, evolve, even if it offers huge evolutionary advantage, at first it will only spread to surrounding villages. In the next generation it might spread to those surrounding them, and so on, only slowly diffusing or percolating throughout the society. This effect is implemented by considering each member of the population to exist on a different point on a grid, and only permitting recombination and selection with neighbours, hence the common names of parallel EAs [195, 311], fine-grain parallel EAs [281], diffusion model EA [451], distributed 98 5 Fitness, Selection, and Population Management EAs [225] and, more commonly nowadays cellular EAs [456, 5]. There have been a great many differing implementations of this form of EA, but we can broadly outline the algorithm as follows: 1. The current population is conceptually distributed on a (usually toroidal) grid, with one individual per node. 2. For each node we have defined a deme (neighbourhood). This is usually the same for all nodes, e.g., for a neighbourhood size of nine on a square lattice, we take the node and all of its immediate neighbours. 3. In each generation we consider each deme in turn and perform the following operations within it: • Select two solutions from the nodes in the deme that will act as parents. • Generate an offspring via recombination. • Mutate, then evaluate the offspring. • Select one solution residing on a node in the deme and replace it with the new offspring. Within this general structure there is scope for considerable differences in implementation. The ASPARAGOS algorithm [195, 311] uses a ladder topology rather than a lattice, and also performs a hill-climbing step after mutation. Several algorithms implemented on massively parallel SIMD or SPMD machines use asynchronous updates in step 3 rather than the sequential mode suggested in the third step above (a good discussion of this issue can be found in [338]). The selection of parents might be fitness-based [95] or random (or one of each [281]), and often one parent is taken to be that residing on the central node of the deme. When fitness-based selection is used it is usually a local implementation of a well-known global scheme such as fitness proportionate or tournament. De Jong and Sarma [106] analysed a number of such schemes and found that local selection techniques generally exhibited less selection pressure than their global versions. While it is common to replace the central node of the deme, again fitness-based or random selection have been used to select the individual to be replaced, or a combination such as “replace current solution if better” [195]. White and Pettey reported results suggesting that the use of fitness in the survivor selection is preferred [451]. A good recent treatment and discussion can be found in the book [5]. For exercises and recommended reading for this chapter, please visit www.evolutionarycomputation.org.