Making Complex Decisions PDF
Document Details
Uploaded by ComplimentarySphinx9714
Tags
Summary
This document provides a comprehensive exploration of methods for making complex decisions in stochastic environments, focusing on sequential decision problems and partially observable environments. It covers the concepts of game theory and multi-agent systems.
Full Transcript
17 MAKING COMPLEX DECISIONS In which we examine methods for deciding what to do today, given that we may decide again tomorrow. In this chapter, we address th...
17 MAKING COMPLEX DECISIONS In which we examine methods for deciding what to do today, given that we may decide again tomorrow. In this chapter, we address the computational issues involved in making decisions in a stochas- tic environment. Whereas Chapter 16 was concerned with one-shot or episodic decision problems, in which the utility of each action’s outcome was well known, we are concerned SEQUENTIAL DECISION PROBLEM here with sequential decision problems, in which the agent’s utility depends on a sequence of decisions. Sequential decision problems incorporate utilities, uncertainty, and sensing, and include search and planning problems as special cases. Section 17.1 explains how se- quential decision problems are defined, and Sections 17.2 and 17.3 explain how they can be solved to produce optimal behavior that balances the risks and rewards of acting in an uncertain environment. Section 17.4 extends these ideas to the case of partially observable environments, and Section 17.4.3 develops a complete design for decision-theoretic agents in partially observable environments, combining dynamic Bayesian networks from Chapter 15 with decision networks from Chapter 16. The second part of the chapter covers environments with multiple agents. In such en- vironments, the notion of optimal behavior is complicated by the interactions among the agents. Section 17.5 introduces the main ideas of game theory, including the idea that ra- tional agents might need to behave randomly. Section 17.6 looks at how multiagent systems can be designed so that multiple agents can achieve a common goal. 17.1 S EQUENTIAL D ECISION P ROBLEMS Suppose that an agent is situated in the 4 × 3 environment shown in Figure 17.1(a). Beginning in the start state, it must choose an action at each time step. The interaction with the environ- ment terminates when the agent reaches one of the goal states, marked +1 or –1. Just as for search problems, the actions available to the agent in each state are given by ACTIONS (s), sometimes abbreviated to A(s); in the 4 × 3 environment, the actions in every state are Up, Down, Left, and Right. We assume for now that the environment is fully observable, so that the agent always knows where it is. 645 646 Chapter 17. Making Complex Decisions 3 +1 0.8 0.1 0.1 2 –1 1 START 1 2 3 4 (a) (b) Figure 17.1 (a) A simple 4 × 3 environment that presents the agent with a sequential decision problem. (b) Illustration of the transition model of the environment: the “intended” outcome occurs with probability 0.8, but with probability 0.2 the agent moves at right angles to the intended direction. A collision with a wall results in no movement. The two terminal states have reward +1 and –1, respectively, and all other states have a reward of –0.04. If the environment were deterministic, a solution would be easy: [Up, Up, Right, Right, Right]. Unfortunately, the environment won’t always go along with this solution, because the actions are unreliable. The particular model of stochastic motion that we adopt is illustrated in Figure 17.1(b). Each action achieves the intended effect with probability 0.8, but the rest of the time, the action moves the agent at right angles to the intended direction. Furthermore, if the agent bumps into a wall, it stays in the same square. For example, from the start square (1,1), the action Up moves the agent to (1,2) with probability 0.8, but with probability 0.1, it moves right to (2,1), and with probability 0.1, it moves left, bumps into the wall, and stays in (1,1). In such an environment, the sequence [Up, Up, Right, Right , Right] goes up around the barrier and reaches the goal state at (4,3) with probability 0.85 = 0.32768. There is also a small chance of accidentally reaching the goal by going the other way around with probability 0.14 × 0.8, for a grand total of 0.32776. (See also Exercise 17.1.) As in Chapter 3, the transition model (or just “model,” whenever no confusion can arise) describes the outcome of each action in each state. Here, the outcome is stochastic, so we write P (s | s, a) to denote the probability of reaching state s if action a is done in state s. We will assume that transitions are Markovian in the sense of Chapter 15, that is, the probability of reaching s from s depends only on s and not on the history of earlier states. For now, you can think of P (s | s, a) as a big three-dimensional table containing probabilities. Later, in Section 17.4.3, we will see that the transition model can be represented as a dynamic Bayesian network, just as in Chapter 15. To complete the definition of the task environment, we must specify the utility function for the agent. Because the decision problem is sequential, the utility function will depend on a sequence of states—an environment history—rather than on a single state. Later in this section, we investigate how such utility functions can be specified in general; for now, REWARD we simply stipulate that in each state s, the agent receives a reward R(s), which may be positive or negative, but must be bounded. For our particular example, the reward is −0.04 in all states except the terminal states (which have rewards +1 and –1). The utility of an Section 17.1. Sequential Decision Problems 647 environment history is just (for now) the sum of the rewards received. For example, if the agent reaches the +1 state after 10 steps, its total utility will be 0.6. The negative reward of –0.04 gives the agent an incentive to reach (4,3) quickly, so our environment is a stochastic generalization of the search problems of Chapter 3. Another way of saying this is that the agent does not enjoy living in this environment and so wants to leave as soon as possible. To sum up: a sequential decision problem for a fully observable, stochastic environment MARKOV DECISION PROCESS with a Markovian transition model and additive rewards is called a Markov decision process, or MDP, and consists of a set of states (with an initial state s0 ); a set ACTIONS (s) of actions in each state; a transition model P (s | s, a); and a reward function R(s).1 The next question is, what does a solution to the problem look like? We have seen that any fixed action sequence won’t solve the problem, because the agent might end up in a state other than the goal. Therefore, a solution must specify what the agent should do for any state POLICY that the agent might reach. A solution of this kind is called a policy. It is traditional to denote a policy by π, and π(s) is the action recommended by the policy π for state s. If the agent has a complete policy, then no matter what the outcome of any action, the agent will always know what to do next. Each time a given policy is executed starting from the initial state, the stochastic nature of the environment may lead to a different environment history. The quality of a policy is therefore measured by the expected utility of the possible environment histories generated OPTIMAL POLICY by that policy. An optimal policy is a policy that yields the highest expected utility. We use π ∗ to denote an optimal policy. Given π ∗ , the agent decides what to do by consulting its current percept, which tells it the current state s, and then executing the action π ∗ (s). A policy represents the agent function explicitly and is therefore a description of a simple reflex agent, computed from the information used for a utility-based agent. An optimal policy for the world of Figure 17.1 is shown in Figure 17.2(a). Notice that, because the cost of taking a step is fairly small compared with the penalty for ending up in (4,2) by accident, the optimal policy for the state (3,1) is conservative. The policy recommends taking the long way round, rather than taking the shortcut and thereby risking entering (4,2). The balance of risk and reward changes depending on the value of R(s) for the nonter- minal states. Figure 17.2(b) shows optimal policies for four different ranges of R(s). When R(s) ≤ −1.6284, life is so painful that the agent heads straight for the nearest exit, even if the exit is worth –1. When −0.4278 ≤ R(s) ≤ −0.0850, life is quite unpleasant; the agent takes the shortest route to the +1 state and is willing to risk falling into the –1 state by acci- dent. In particular, the agent takes the shortcut from (3,1). When life is only slightly dreary (−0.0221 < R(s) < 0), the optimal policy takes no risks at all. In (4,1) and (3,2), the agent heads directly away from the –1 state so that it cannot fall in by accident, even though this means banging its head against the wall quite a few times. Finally, if R(s) > 0, then life is positively enjoyable and the agent avoids both exits. As long as the actions in (4,1), (3,2), 1 Some definitions of MDPs allow the reward to depend on the action and outcome too, so the reward function is R(s, a, s ). This simplifies the description of some environments but does not change the problem in any fundamental way, as shown in Exercise 17.4. 648 Chapter 17. Making Complex Decisions +1 +1 –1 –1 3 +1 R(s) < –1.6284 – 0.4278 < R(s) < – 0.0850 2 –1 +1 +1 1 –1 –1 1 2 3 4 – 0.0221 < R(s) < 0 R(s) > 0 (a) (b) Figure 17.2 (a) An optimal policy for the stochastic environment with R(s) = − 0.04 in the nonterminal states. (b) Optimal policies for four different ranges of R(s). and (3,3) are as shown, every policy is optimal, and the agent obtains infinite total reward be- cause it never enters a terminal state. Surprisingly, it turns out that there are six other optimal policies for various ranges of R(s); Exercise 17.5 asks you to find them. The careful balancing of risk and reward is a characteristic of MDPs that does not arise in deterministic search problems; moreover, it is a characteristic of many real-world decision problems. For this reason, MDPs have been studied in several fields, including AI, operations research, economics, and control theory. Dozens of algorithms have been proposed for calculating optimal policies. In sections 17.2 and 17.3 we describe two of the most important algorithm families. First, however, we must complete our investigation of utilities and policies for sequential decision problems. 17.1.1 Utilities over time In the MDP example in Figure 17.1, the performance of the agent was measured by a sum of rewards for the states visited. This choice of performance measure is not arbitrary, but it is not the only possibility for the utility function on environment histories, which we write as Uh ([s0 , s1 ,... , sn ]). Our analysis draws on multiattribute utility theory (Section 16.4) and is somewhat technical; the impatient reader may wish to skip to the next section. FINITE HORIZON The first question to answer is whether there is a finite horizon or an infinite horizon INFINITE HORIZON for decision making. A finite horizon means that there is a fixed time N after which nothing matters—the game is over, so to speak. Thus, Uh ([s0 , s1 ,... , sN +k ]) = Uh ([s0 , s1 ,... , sN ]) for all k > 0. For example, suppose an agent starts at (3,1) in the 4 × 3 world of Figure 17.1, and suppose that N = 3. Then, to have any chance of reaching the +1 state, the agent must head directly for it, and the optimal action is to go Up. On the other hand, if N = 100, then there is plenty of time to take the safe route by going Left. So, with a finite horizon, Section 17.1. Sequential Decision Problems 649 the optimal action in a given state could change over time. We say that the optimal policy NONSTATIONARY POLICY for a finite horizon is nonstationary. With no fixed time limit, on the other hand, there is no reason to behave differently in the same state at different times. Hence, the optimal ac- STATIONARY POLICY tion depends only on the current state, and the optimal policy is stationary. Policies for the infinite-horizon case are therefore simpler than those for the finite-horizon case, and we deal mainly with the infinite-horizon case in this chapter. (We will see later that for partially ob- servable environments, the infinite-horizon case is not so simple.) Note that “infinite horizon” does not necessarily mean that all state sequences are infinite; it just means that there is no fixed deadline. In particular, there can be finite state sequences in an infinite-horizon MDP containing a terminal state. The next question we must decide is how to calculate the utility of state sequences. In the terminology of multiattribute utility theory, each state si can be viewed as an attribute of the state sequence [s0 , s1 , s2...]. To obtain a simple expression in terms of the attributes, we will need to make some sort of preference-independence assumption. The most natural as- STATIONARY PREFERENCE sumption is that the agent’s preferences between state sequences are stationary. Stationarity for preferences means the following: if two state sequences [s0 , s1 , s2 ,...] and [s0 , s1 , s2 ,...] begin with the same state (i.e., s0 = s0 ), then the two sequences should be preference-ordered the same way as the sequences [s1 , s2 ,...] and [s1 , s2 ,...]. In English, this means that if you prefer one future to another starting tomorrow, then you should still prefer that future if it were to start today instead. Stationarity is a fairly innocuous-looking assumption with very strong consequences: it turns out that under stationarity there are just two coherent ways to assign utilities to sequences: ADDITIVE REWARD 1. Additive rewards: The utility of a state sequence is Uh ([s0 , s1 , s2 ,...]) = R(s0 ) + R(s1 ) + R(s2 ) + · · ·. The 4 × 3 world in Figure 17.1 uses additive rewards. Notice that additivity was used implicitly in our use of path cost functions in heuristic search algorithms (Chapter 3). DISCOUNTED REWARD 2. Discounted rewards: The utility of a state sequence is Uh ([s0 , s1 , s2 ,...]) = R(s0 ) + γR(s1 ) + γ 2 R(s2 ) + · · · , DISCOUNT FACTOR where the discount factor γ is a number between 0 and 1. The discount factor describes the preference of an agent for current rewards over future rewards. When γ is close to 0, rewards in the distant future are viewed as insignificant. When γ is 1, discounted rewards are exactly equivalent to additive rewards, so additive rewards are a special case of discounted rewards. Discounting appears to be a good model of both animal and human preferences over time. A discount factor of γ is equivalent to an interest rate of (1/γ) − 1. For reasons that will shortly become clear, we assume discounted rewards in the remainder of the chapter, although sometimes we allow γ = 1. Lurking beneath our choice of infinite horizons is a problem: if the environment does not contain a terminal state, or if the agent never reaches one, then all environment histories will be infinitely long, and utilities with additive, undiscounted rewards will generally be 650 Chapter 17. Making Complex Decisions infinite. While we can agree that+∞ is better than −∞, comparing two state sequences with +∞ utility is more difficult. There are three solutions, two of which we have seen already: 1. With discounted rewards, the utility of an infinite sequence is finite. In fact, if γ < 1 and rewards are bounded by ±Rmax , we have ∞ ∞ Uh ([s0 , s1 , s2 ,...]) = γ R(st ) ≤ t γ t Rmax = Rmax /(1 − γ) , (17.1) t=0 t=0 using the standard formula for the sum of an infinite geometric series. 2. If the environment contains terminal states and if the agent is guaranteed to get to one eventually, then we will never need to compare infinite sequences. A policy that is PROPER POLICY guaranteed to reach a terminal state is called a proper policy. With proper policies, we can use γ = 1 (i.e., additive rewards). The first three policies shown in Figure 17.2(b) are proper, but the fourth is improper. It gains infinite total reward by staying away from the terminal states when the reward for the nonterminal states is positive. The existence of improper policies can cause the standard algorithms for solving MDPs to fail with additive rewards, and so provides a good reason for using discounted rewards. AVERAGE REWARD 3. Infinite sequences can be compared in terms of the average reward obtained per time step. Suppose that square (1,1) in the 4 × 3 world has a reward of 0.1 while the other nonterminal states have a reward of 0.01. Then a policy that does its best to stay in (1,1) will have higher average reward than one that stays elsewhere. Average reward is a useful criterion for some problems, but the analysis of average-reward algorithms is beyond the scope of this book. In sum, discounted rewards present the fewest difficulties in evaluating state sequences. 17.1.2 Optimal policies and the utilities of states Having decided that the utility of a given state sequence is the sum of discounted rewards obtained during the sequence, we can compare policies by comparing the expected utilities obtained when executing them. We assume the agent is in some initial state s and define St (a random variable) to be the state the agent reaches at time t when executing a particular policy π. (Obviously, S0 = s, the state the agent is in now.) The probability distribution over state sequences S1 , S2 ,... , is determined by the initial state s, the policy π, and the transition model for the environment. The expected utility obtained by executing π starting in s is given by "∞ # U π (s) = E γ t R(St ) , (17.2) t=0 where the expectation is with respect to the probability distribution over state sequences de- termined by s and π. Now, out of all the policies the agent could choose to execute starting in s, one (or more) will have higher expected utilities than all the others. We’ll use πs∗ to denote one of these policies: πs∗ = argmax U π (s). (17.3) π Section 17.1. Sequential Decision Problems 651 Remember that πs∗ is a policy, so it recommends an action for every state; its connection with s in particular is that it’s an optimal policy when s is the starting state. A remarkable consequence of using discounted utilities with infinite horizons is that the optimal policy is independent of the starting state. (Of course, the action sequence won’t be independent; remember that a policy is a function specifying an action for each state.) This fact seems intuitively obvious: if policy πa∗ is optimal starting in a and policy πb∗ is optimal starting in b, then, when they reach a third state c, there’s no good reason for them to disagree with each other, or with πc∗ , about what to do next.2 So we can simply write π ∗ for an optimal policy. ∗ Given this definition, the true utility of a state is just U π (s)—that is, the expected sum of discounted rewards if the agent executes an optimal policy. We write this as U (s), matching the notation used in Chapter 16 for the utility of an outcome. Notice that U (s) and R(s) are quite different quantities; R(s) is the “short term” reward for being in s, whereas U (s) is the “long term” total reward from s onward. Figure 17.3 shows the utilities for the 4 × 3 world. Notice that the utilities are higher for states closer to the +1 exit, because fewer steps are required to reach the exit. 3 0.812 0.868 0.918 +1 2 0.762 0.660 –1 1 0.705 0.655 0.611 0.388 1 2 3 4 Figure 17.3 The utilities of the states in the 4 × 3 world, calculated with γ = 1 and R(s) = − 0.04 for nonterminal states. The utility function U (s) allows the agent to select actions by using the principle of maximum expected utility from Chapter 16—that is, choose the action that maximizes the expected utility of the subsequent state: π ∗ (s) = argmax P (s | s, a)U (s ). (17.4) a∈A(s) s The next two sections describe algorithms for finding optimal policies. 2 Although this seems obvious, it does not hold for finite-horizon policies or for other ways of combining rewards over time. The proof follows directly from the uniqueness of the utility function on states, as shown in Section 17.2. 652 Chapter 17. Making Complex Decisions 17.2 VALUE I TERATION VALUE ITERATION In this section, we present an algorithm, called value iteration, for calculating an optimal policy. The basic idea is to calculate the utility of each state and then use the state utilities to select an optimal action in each state. 17.2.1 The Bellman equation for utilities Section 17.1.2 defined the utility of being in a state as the expected sum of discounted rewards from that point onwards. From this, it follows that there is a direct relationship between the utility of a state and the utility of its neighbors: the utility of a state is the immediate reward for that state plus the expected discounted utility of the next state, assuming that the agent chooses the optimal action. That is, the utility of a state is given by U (s) = R(s) + γ max P (s | s, a)U (s ). (17.5) a∈A(s) s BELLMAN EQUATION This is called the Bellman equation, after Richard Bellman (1957). The utilities of the states—defined by Equation (17.2) as the expected utility of subsequent state sequences—are solutions of the set of Bellman equations. In fact, they are the unique solutions, as we show in Section 17.2.3. Let us look at one of the Bellman equations for the 4 × 3 world. The equation for the state (1,1) is U (1, 1) = −0.04 + γ max[ 0.8U (1, 2) + 0.1U (2, 1) + 0.1U (1, 1), (Up) 0.9U (1, 1) + 0.1U (1, 2), (Left ) 0.9U (1, 1) + 0.1U (2, 1), (Down) 0.8U (2, 1) + 0.1U (1, 2) + 0.1U (1, 1) ]. (Right ) When we plug in the numbers from Figure 17.3, we find that Up is the best action. 17.2.2 The value iteration algorithm The Bellman equation is the basis of the value iteration algorithm for solving MDPs. If there are n possible states, then there are n Bellman equations, one for each state. The n equations contain n unknowns—the utilities of the states. So we would like to solve these simultaneous equations to find the utilities. There is one problem: the equations are nonlinear, because the “max” operator is not a linear operator. Whereas systems of linear equations can be solved quickly using linear algebra techniques, systems of nonlinear equations are more problematic. One thing to try is an iterative approach. We start with arbitrary initial values for the utilities, calculate the right-hand side of the equation, and plug it into the left-hand side—thereby updating the utility of each state from the utilities of its neighbors. We repeat this until we reach an equilibrium. Let Ui (s) be the utility value for state s at the ith iteration. The iteration BELLMAN UPDATE step, called a Bellman update, looks like this: Ui+1 (s) ← R(s) + γ max P (s | s, a)Ui (s ) , (17.6) a∈A(s) s Section 17.2. Value Iteration 653 function VALUE -I TERATION(mdp, ) returns a utility function inputs: mdp, an MDP with states S , actions A(s), transition model P (s | s, a), rewards R(s), discount γ , the maximum error allowed in the utility of any state local variables: U , U , vectors of utilities for states in S , initially zero δ, the maximum change in the utility of any state in an iteration repeat U ← U ; δ ← 0 for each state s in S do U [s] ← R(s) + γ max P (s | s, a) U [s ] a ∈ A(s) s if |U [s] − U [s]| > δ then δ ← |U [s] − U [s]| until δ < (1 − γ)/γ return U Figure 17.4 The value iteration algorithm for calculating utilities of states. The termina- tion condition is from Equation (17.8). 1e+07 1 (4,3) c = 0.0001 (3,3) 1e+06 c = 0.001 0.8 c = 0.01 Iterations required (1,1) 100000 c = 0.1 Utility estimates 0.6 (3,1) 10000 0.4 (4,1) 1000 0.2 100 0 10 -0.2 1 0 5 10 15 20 25 30 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Number of iterations Discount factor γ (a) (b) Figure 17.5 (a) Graph showing the evolution of the utilities of selected states using value iteration. (b) The number of value iterations k required to guarantee an error of at most = c · Rmax , for different values of c, as a function of the discount factor γ. where the update is assumed to be applied simultaneously to all the states at each iteration. If we apply the Bellman update infinitely often, we are guaranteed to reach an equilibrium (see Section 17.2.3), in which case the final utility values must be solutions to the Bellman equations. In fact, they are also the unique solutions, and the corresponding policy (obtained using Equation (17.4)) is optimal. The algorithm, called VALUE -I TERATION , is shown in Figure 17.4. We can apply value iteration to the 4 × 3 world in Figure 17.1(a). Starting with initial values of zero, the utilities evolve as shown in Figure 17.5(a). Notice how the states at differ- 654 Chapter 17. Making Complex Decisions ent distances from (4,3) accumulate negative reward until a path is found to (4,3), whereupon the utilities start to increase. We can think of the value iteration algorithm as propagating information through the state space by means of local updates. 17.2.3 Convergence of value iteration We said that value iteration eventually converges to a unique set of solutions of the Bellman equations. In this section, we explain why this happens. We introduce some useful mathe- matical ideas along the way, and we obtain some methods for assessing the error in the utility function returned when the algorithm is terminated early; this is useful because it means that we don’t have to run forever. This section is quite technical. The basic concept used in showing that value iteration converges is the notion of a con- CONTRACTION traction. Roughly speaking, a contraction is a function of one argument that, when applied to two different inputs in turn, produces two output values that are “closer together,” by at least some constant factor, than the original inputs. For example, the function “divide by two” is a contraction, because, after we divide any two numbers by two, their difference is halved. Notice that the “divide by two” function has a fixed point, namely zero, that is unchanged by the application of the function. From this example, we can discern two important properties of contractions: A contraction has only one fixed point; if there were two fixed points they would not get closer together when the function was applied, so it would not be a contraction. When the function is applied to any argument, the value must get closer to the fixed point (because the fixed point does not move), so repeated application of a contraction always reaches the fixed point in the limit. Now, suppose we view the Bellman update (Equation (17.6)) as an operator B that is applied simultaneously to update the utility of every state. Let Ui denote the vector of utilities for all the states at the ith iteration. Then the Bellman update equation can be written as Ui+1 ← B Ui. MAX NORM Next, we need a way to measure distances between utility vectors. We will use the max norm, which measures the “length” of a vector by the absolute value of its biggest component: ||U || = max |U (s)|. s With this definition, the “distance” between two vectors, ||U − U ||, is the maximum dif- ference between any two corresponding elements. The main result of this section is the following: Let Ui and Ui be any two utility vectors. Then we have ||B Ui − B Ui || ≤ γ ||Ui − Ui ||. (17.7) That is, the Bellman update is a contraction by a factor of γ on the space of utility vectors. (Exercise 17.6 provides some guidance on proving this claim.) Hence, from the properties of contractions in general, it follows that value iteration always converges to a unique solution of the Bellman equations whenever γ < 1. Section 17.2. Value Iteration 655 We can also use the contraction property to analyze the rate of convergence to a solu- tion. In particular, we can replace Ui in Equation (17.7) with the true utilities U , for which B U = U. Then we obtain the inequality ||B Ui − U || ≤ γ ||Ui − U ||. So, if we view ||Ui − U || as the error in the estimate Ui , we see that the error is reduced by a factor of at least γ on each iteration. This means that value iteration converges exponentially fast. We can calculate the number of iterations required to reach a specified error bound as follows: First, recall from Equation (17.1) that the utilities of all states are bounded by ±Rmax /(1 − γ). This means that the maximum initial error ||U0 − U || ≤ 2Rmax /(1 − γ). Suppose we run for N iterations to reach an error of at most . Then, because the error is reduced by at least γ each time, we require γ N · 2Rmax /(1 − γ) ≤ . Taking logs, we find N = (log(2Rmax /(1 − γ))/ log(1/γ)) iterations suffice. Figure 17.5(b) shows how N varies with γ, for different values of the ratio /Rmax. The good news is that, because of the exponentially fast convergence, N does not depend much on the ratio /Rmax. The bad news is that N grows rapidly as γ becomes close to 1. We can get fast convergence if we make γ small, but this effectively gives the agent a short horizon and could miss the long-term effects of the agent’s actions. The error bound in the preceding paragraph gives some idea of the factors influencing the run time of the algorithm, but is sometimes overly conservative as a method of deciding when to stop the iteration. For the latter purpose, we can use a bound relating the error to the size of the Bellman update on any given iteration. From the contraction property (Equation (17.7)), it can be shown that if the update is small (i.e., no state’s utility changes by much), then the error, compared with the true utility function, also is small. More precisely, if ||Ui+1 − Ui || < (1 − γ)/γ then ||Ui+1 − U || < . (17.8) This is the termination condition used in the VALUE -I TERATION algorithm of Figure 17.4. So far, we have analyzed the error in the utility function returned by the value iteration algorithm. What the agent really cares about, however, is how well it will do if it makes its decisions on the basis of this utility function. Suppose that after i iterations of value iteration, the agent has an estimate Ui of the true utility U and obtains the MEU policy πi based on one-step look-ahead using Ui (as in Equation (17.4)). Will the resulting behavior be nearly as good as the optimal behavior? This is a crucial question for any real agent, and it turns out that the answer is yes. U πi (s) is the utility obtained if πi is executed starting in s, and the POLICY LOSS policy loss ||U πi − U || is the most the agent can lose by executing πi instead of the optimal policy π ∗. The policy loss of πi is connected to the error in Ui by the following inequality: if ||Ui − U || < then ||U πi − U || < 2γ/(1 − γ). (17.9) In practice, it often occurs that πi becomes optimal long before Ui has converged. Figure 17.6 shows how the maximum error in Ui and the policy loss approach zero as the value iteration process proceeds for the 4 × 3 environment with γ = 0.9. The policy πi is optimal when i = 4, even though the maximum error in Ui is still 0.46. Now we have everything we need to use value iteration in practice. We know that it converges to the correct utilities, we can bound the error in the utility estimates if we 656 Chapter 17. Making Complex Decisions stop after a finite number of iterations, and we can bound the policy loss that results from executing the corresponding MEU policy. As a final note, all of the results in this section depend on discounting with γ < 1. If γ = 1 and the environment contains terminal states, then a similar set of convergence results and error bounds can be derived whenever certain technical conditions are satisfied. 17.3 P OLICY I TERATION In the previous section, we observed that it is possible to get an optimal policy even when the utility function estimate is inaccurate. If one action is clearly better than all others, then the exact magnitude of the utilities on the states involved need not be precise. This insight POLICY ITERATION suggests an alternative way to find optimal policies. The policy iteration algorithm alternates the following two steps, beginning from some initial policy π0 : POLICY EVALUATION Policy evaluation: given a policy πi , calculate Ui = U πi , the utility of each state if πi were to be executed. POLICY IMPROVEMENT Policy improvement: Calculate a new MEU policy πi+1 , using one-step look-ahead based on Ui (as in Equation (17.4)). The algorithm terminates when the policy improvement step yields no change in the utilities. At this point, we know that the utility function Ui is a fixed point of the Bellman update, so it is a solution to the Bellman equations, and πi must be an optimal policy. Because there are only finitely many policies for a finite state space, and each iteration can be shown to yield a better policy, policy iteration must terminate. The algorithm is shown in Figure 17.7. The policy improvement step is obviously straightforward, but how do we implement the P OLICY-E VALUATION routine? It turns out that doing so is much simpler than solving the standard Bellman equations (which is what value iteration does), because the action in each state is fixed by the policy. At the ith iteration, the policy πi specifies the action πi (s) in 1 Max error 0.8 Policy loss Max error/Policy loss 0.6 0.4 0.2 0 0 2 4 6 8 10 12 14 Number of iterations Figure 17.6 The maximum error ||Ui − U || of the utility estimates and the policy loss ||U πi − U ||, as a function of the number of iterations of value iteration. Section 17.3. Policy Iteration 657 state s. This means that we have a simplified version of the Bellman equation (17.5) relating the utility of s (under πi ) to the utilities of its neighbors: Ui (s) = R(s) + γ P (s | s, πi (s))Ui (s ). (17.10) s For example, suppose πi is the policy shown in Figure 17.2(a). Then we have πi (1, 1) = Up, πi (1, 2) = Up, and so on, and the simplified Bellman equations are Ui (1, 1) = −0.04 + 0.8Ui (1, 2) + 0.1Ui (1, 1) + 0.1Ui (2, 1) , Ui (1, 2) = −0.04 + 0.8Ui (1, 3) + 0.2Ui (1, 2) ,... The important point is that these equations are linear, because the “max” operator has been removed. For n states, we have n linear equations with n unknowns, which can be solved exactly in time O(n3 ) by standard linear algebra methods. For small state spaces, policy evaluation using exact solution methods is often the most efficient approach. For large state spaces, O(n3 ) time might be prohibitive. Fortunately, it is not necessary to do exact policy evaluation. Instead, we can perform some number of simplified value iteration steps (simplified because the policy is fixed) to give a reasonably good approximation of the utilities. The simplified Bellman update for this process is Ui+1 (s) ← R(s) + γ P (s | s, πi (s))Ui (s ) , s and this is repeated k times to produce the next utility estimate. The resulting algorithm is MODIFIED POLICY ITERATION called modified policy iteration. It is often much more efficient than standard policy iteration or value iteration. function P OLICY-I TERATION(mdp) returns a policy inputs: mdp, an MDP with states S , actions A(s), transition model P (s | s, a) local variables: U , a vector of utilities for states in S , initially zero π, a policy vector indexed by state, initially random repeat U ← P OLICY-E VALUATION(π, U , mdp) unchanged ? ← true for each state s in S do if max P (s | s, a) U [s ] > P (s | s, π[s]) U [s ] then do a ∈ A(s) s s π[s] ← argmax P (s | s, a) U [s ] a ∈ A(s) s unchanged ? ← false until unchanged ? return π Figure 17.7 The policy iteration algorithm for calculating an optimal policy. 658 Chapter 17. Making Complex Decisions The algorithms we have described so far require updating the utility or policy for all states at once. It turns out that this is not strictly necessary. In fact, on each iteration, we can pick any subset of states and apply either kind of updating (policy improvement or sim- plified value iteration) to that subset. This very general algorithm is called asynchronous ASYNCHRONOUS POLICY ITERATION policy iteration. Given certain conditions on the initial policy and initial utility function, asynchronous policy iteration is guaranteed to converge to an optimal policy. The freedom to choose any states to work on means that we can design much more efficient heuristic algorithms—for example, algorithms that concentrate on updating the values of states that are likely to be reached by a good policy. This makes a lot of sense in real life: if one has no intention of throwing oneself off a cliff, one should not spend time worrying about the exact value of the resulting states. 17.4 PARTIALLY O BSERVABLE MDP S The description of Markov decision processes in Section 17.1 assumed that the environment was fully observable. With this assumption, the agent always knows which state it is in. This, combined with the Markov assumption for the transition model, means that the optimal policy depends only on the current state. When the environment is only partially observable, the situation is, one might say, much less clear. The agent does not necessarily know which state it is in, so it cannot execute the action π(s) recommended for that state. Furthermore, the utility of a state s and the optimal action in s depend not just on s, but also on how much the PARTIALLY OBSERVABLE MDP agent knows when it is in s. For these reasons, partially observable MDPs (or POMDPs— pronounced “pom-dee-pees”) are usually viewed as much more difficult than ordinary MDPs. We cannot avoid POMDPs, however, because the real world is one. 17.4.1 Definition of POMDPs To get a handle on POMDPs, we must first define them properly. A POMDP has the same elements as an MDP—the transition model P (s | s, a), actions A(s), and reward function R(s)—but, like the partially observable search problems of Section 4.4, it also has a sensor model P (e | s). Here, as in Chapter 15, the sensor model specifies the probability of perceiv- ing evidence e in state s.3 For example, we can convert the 4 × 3 world of Figure 17.1 into a POMDP by adding a noisy or partial sensor instead of assuming that the agent knows its location exactly. Such a sensor might measure the number of adjacent walls, which happens to be 2 in all the nonterminal squares except for those in the third column, where the value is 1; a noisy version might give the wrong value with probability 0.1. In Chapters 4 and 11, we studied nondeterministic and partially observable planning problems and identified the belief state—the set of actual states the agent might be in—as a key concept for describing and calculating solutions. In POMDPs, the belief state b becomes a probability distribution over all possible states, just as in Chapter 15. For example, the initial 3 As with the reward function for MDPs, the sensor model can also depend on the action and outcome state, but again this change is not fundamental. Section 17.4. Partially Observable MDPs 659 belief state for the 4 × 3 POMDP could be the uniform distribution over the nine nonterminal states, i.e., 19 , 19 , 19 , 19 , 19 , 19 , 19 , 19 , 19 , 0, 0. We write b(s) for the probability assigned to the actual state s by belief state b. The agent can calculate its current belief state as the conditional probability distribution over the actual states given the sequence of percepts and actions so far. This is essentially the filtering task described in Chapter 15. The basic recursive filtering equation (15.5 on page 572) shows how to calculate the new belief state from the previous belief state and the new evidence. For POMDPs, we also have an action to consider, but the result is essentially the same. If b(s) was the previous belief state, and the agent does action a and then perceives evidence e, then the new belief state is given by b (s ) = α P (e | s ) P (s | s, a)b(s) , s where α is a normalizing constant that makes the belief state sum to 1. By analogy with the update operator for filtering (page 572), we can write this as b = F ORWARD(b, a, e). (17.11) In the 4 × 3 POMDP, suppose the agent moves Left and its sensor reports 1 adjacent wall; then it’s quite likely (although not guaranteed, because both the motion and the sensor are noisy) that the agent is now in (3,1). Exercise 17.13 asks you to calculate the exact probability values for the new belief state. The fundamental insight required to understand POMDPs is this: the optimal action depends only on the agent’s current belief state. That is, the optimal policy can be described by a mapping π ∗ (b) from belief states to actions. It does not depend on the actual state the agent is in. This is a good thing, because the agent does not know its actual state; all it knows is the belief state. Hence, the decision cycle of a POMDP agent can be broken down into the following three steps: 1. Given the current belief state b, execute the action a = π ∗ (b). 2. Receive percept e. 3. Set the current belief state to F ORWARD(b, a, e) and repeat. Now we can think of POMDPs as requiring a search in belief-state space, just like the meth- ods for sensorless and contingency problems in Chapter 4. The main difference is that the POMDP belief-state space is continuous, because a POMDP belief state is a probability dis- tribution. For example, a belief state for the 4 × 3 world is a point in an 11-dimensional continuous space. An action changes the belief state, not just the physical state. Hence, the action is evaluated at least in part according to the information the agent acquires as a result. POMDPs therefore include the value of information (Section 16.6) as one component of the decision problem. Let’s look more carefully at the outcome of actions. In particular, let’s calculate the probability that an agent in belief state b reaches belief state b after executing action a. Now, if we knew the action and the subsequent percept, then Equation (17.11) would provide a deterministic update to the belief state: b = F ORWARD (b, a, e). Of course, the subsequent percept is not yet known, so the agent might arrive in one of several possible belief states b , depending on the percept that is received. The probability of perceiving e, given that a was 660 Chapter 17. Making Complex Decisions performed starting in belief state b, is given by summing over all the actual states s that the agent might reach: P (e|a, b) = P (e|a, s , b)P (s |a, b) s = P (e | s )P (s |a, b) s = P (e | s ) P (s | s, a)b(s). s s Let us write the probability of reaching b from b, given action a, as P (b | b, a)). Then that gives us P (b | b, a) = P (b |a, b) = P (b |e, a, b)P (e|a, b) e = P (b |e, a, b) P (e | s ) P (s | s, a)b(s) , (17.12) e s s where P (b |e, a, b) is 1 if b = F ORWARD (b, a, e) and 0 otherwise. Equation (17.12) can be viewed as defining a transition model for the belief-state space. We can also define a reward function for belief states (i.e., the expected reward for the actual states the agent might be in): ρ(b) = b(s)R(s). s Together, P (b | b, a) and ρ(b) define an observable MDP on the space of belief states. Fur- thermore, it can be shown that an optimal policy for this MDP, π ∗ (b), is also an optimal policy for the original POMDP. In other words, solving a POMDP on a physical state space can be reduced to solving an MDP on the corresponding belief-state space. This fact is perhaps less surprising if we remember that the belief state is always observable to the agent, by definition. Notice that, although we have reduced POMDPs to MDPs, the MDP we obtain has a continuous (and usually high-dimensional) state space. None of the MDP algorithms de- scribed in Sections 17.2 and 17.3 applies directly to such MDPs. The next two subsec- tions describe a value iteration algorithm designed specifically for POMDPs and an online decision-making algorithm, similar to those developed for games in Chapter 5. 17.4.2 Value iteration for POMDPs Section 17.2 described a value iteration algorithm that computed one utility value for each state. With infinitely many belief states, we need to be more creative. Consider an optimal policy π ∗ and its application in a specific belief state b: the policy generates an action, then, for each subsequent percept, the belief state is updated and a new action is generated, and so on. For this specific b, therefore, the policy is exactly equivalent to a conditional plan, as de- fined in Chapter 4 for nondeterministic and partially observable problems. Instead of thinking about policies, let us think about conditional plans and how the expected utility of executing a fixed conditional plan varies with the initial belief state. We make two observations: Section 17.4. Partially Observable MDPs 661 1. Let the utility of executing a fixed conditional plan p starting in physical state s be αp (s). Then the expected utility of executing p in belief state b is just s b(s)αp (s), or b · αp if we think of them both as vectors. Hence, the expected utility of a fixed conditional plan varies linearly with b; that is, it corresponds to a hyperplane in belief space. 2. At any given belief state b, the optimal policy will choose to execute the conditional plan with highest expected utility; and the expected utility of b under the optimal policy is just the utility of that conditional plan: ∗ U (b) = U π (b) = max b · αp. p π∗ If the optimal policy chooses to execute p starting at b, then it is reasonable to expect that it might choose to execute p in belief states that are very close to b; in fact, if we bound the depth of the conditional plans, then there are only finitely many such plans and the continuous space of belief states will generally be divided into regions, each corresponding to a particular conditional plan that is optimal in that region. From these two observations, we see that the utility function U (b) on belief states, being the maximum of a collection of hyperplanes, will be piecewise linear and convex. To illustrate this, we use a simple two-state world. The states are labeled 0 and 1, with R(0) = 0 and R(1) = 1. There are two actions: Stay stays put with probability 0.9 and Go switches to the other state with probability 0.9. For now we will assume the discount factor γ = 1. The sensor reports the correct state with probability 0.6. Obviously, the agent should Stay when it thinks it’s in state 1 and Go when it thinks it’s in state 0. The advantage of a two-state world is that the belief space can be viewed as one- dimensional, because the two probabilities must sum to 1. In Figure 17.8(a), the x-axis represents the belief state, defined by b(1), the probability of being in state 1. Now let us con- sider the one-step plans [Stay] and [Go], each of which receives the reward for the current state followed by the (discounted) reward for the state reached after the action: α[Stay] (0) = R(0) + γ(0.9R(0) + 0.1R(1)) = 0.1 α[Stay] (1) = R(1) + γ(0.9R(1) + 0.1R(0)) = 1.9 α[Go] (0) = R(0) + γ(0.9R(1) + 0.1R(0)) = 0.9 α[Go] (1) = R(1) + γ(0.9R(0) + 0.1R(1)) = 1.1 The hyperplanes (lines, in this case) for b·α[Stay] and b·α[Go] are shown in Figure 17.8(a) and their maximum is shown in bold. The bold line therefore represents the utility function for the finite-horizon problem that allows just one action, and in each “piece” of the piecewise linear utility function the optimal action is the first action of the corresponding conditional plan. In this case, the optimal one-step policy is to Stay when b(1) > 0.5 and Go otherwise. Once we have utilities αp (s) for all the conditional plans p of depth 1 in each physical state s, we can compute the utilities for conditional plans of depth 2 by considering each possible first action, each possible subsequent percept, and then each way of choosing a depth-1 plan to execute for each percept: [Stay; if Percept = 0 then Stay else Stay] [Stay; if Percept = 0 then Stay else Go]... 662 Chapter 17. Making Complex Decisions 3 3 2.5 2.5 2 2 Utility Utility 1.5 [Stay] 1.5 [Go] 1 1 0.5 0.5 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Probability of state 1 Probability of state 1 (a) (b) 3 7.5 2.5 7 2 6.5 Utility Utility 1.5 6 1 5.5 0.5 5 0 4.5 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Probability of state 1 Probability of state 1 (c) (d) Figure 17.8 (a) Utility of two one-step plans as a function of the initial belief state b(1) for the two-state world, with the corresponding utility function shown in bold. (b) Utilities for 8 distinct two-step plans. (c) Utilities for four undominated two-step plans. (d) Utility function for optimal eight-step plans. There are eight distinct depth-2 plans in all, and their utilities are shown in Figure 17.8(b). Notice that four of the plans, shown as dashed lines, are suboptimal across the entire belief DOMINATED PLAN space—we say these plans are dominated, and they need not be considered further. There are four undominated plans, each of which is optimal in a specific region, as shown in Fig- ure 17.8(c). The regions partition the belief-state space. We repeat the process for depth 3, and so on. In general, let p be a depth-d conditional plan whose initial action is a and whose depth-d − 1 subplan for percept e is p.e; then αp (s) = R(s) + γ P (s | s, a) P (e | s )αp.e (s ). (17.13) s e This recursion naturally gives us a value iteration algorithm, which is sketched in Figure 17.9. The structure of the algorithm and its error analysis are similar to those of the basic value iter- ation algorithm in Figure 17.4 on page 653; the main difference is that instead of computing one utility number for each state, POMDP-VALUE -I TERATION maintains a collection of Section 17.4. Partially Observable MDPs 663 function POMDP-VALUE -I TERATION(pomdp, ) returns a utility function inputs: pomdp, a POMDP with states S , actions A(s), transition model P (s | s, a), sensor model P (e | s), rewards R(s), discount γ , the maximum error allowed in the utility of any state local variables: U , U , sets of plans p with associated utility vectors αp U ← a set containing just the empty plan [ ], with α[ ] (s) = R(s) repeat U ←U U ← the set of all plans consisting of an action and, for each possible next percept, a plan in U with utility vectors computed according to Equation (17.13) U ← R EMOVE -D OMINATED -P LANS(U ) until M AX -D IFFERENCE(U , U ) < (1 − γ)/γ return U Figure 17.9 A high-level sketch of the value iteration algorithm for POMDPs. The R EMOVE -D OMINATED -P LANS step and M AX -D IFFERENCE test are typically implemented as linear programs. undominated plans with their utility hyperplanes. The algorithm’s complexity depends pri- marily on how many plans get generated. Given |A| actions and |E| possible observations, it d−1 is easy to show that there are |A|O(|E| ) distinct depth-d plans. Even for the lowly two-state world with d = 8, the exact number is 2255. The elimination of dominated plans is essential for reducing this doubly exponential growth: the number of undominated plans with d = 8 is just 144. The utility function for these 144 plans is shown in Figure 17.8(d). Notice that even though state 0 has lower utility than state 1, the intermediate belief states have even lower utility because the agent lacks the information needed to choose a good action. This is why information has value in the sense defined in Section 16.6 and optimal policies in POMDPs often include information-gathering actions. Given such a utility function, an executable policy can be extracted by looking at which hyperplane is optimal at any given belief state b and executing the first action of the corre- sponding plan. In Figure 17.8(d), the corresponding optimal policy is still the same as for depth-1 plans: Stay when b(1) > 0.5 and Go otherwise. In practice, the value iteration algorithm in Figure 17.9 is hopelessly inefficient for larger problems—even the 4 × 3 POMDP is too hard. The main reason is that, given n con- ditional plans at level d, the algorithm constructs |A| · n|E| conditional plans at level d + 1 before eliminating the dominated ones. Since the 1970s, when this algorithm was developed, there have been several advances including more efficient forms of value iteration and various kinds of policy iteration algorithms. Some of these are discussed in the notes at the end of the chapter. For general POMDPs, however, finding optimal policies is very difficult (PSPACE- hard, in fact—i.e., very hard indeed). Problems with a few dozen states are often infeasible. The next section describes a different, approximate method for solving POMDPs, one based on look-ahead search. 664 Chapter 17. Making Complex Decisions At–2 At–1 At At+1 At+2 Xt–1 Xt Xt+1 Xt+2 Xt+3 Ut+3 Rt–1 Rt Rt+1 Rt+2 Et–1 Et Et+1 Et+2 Et+3 Figure 17.10 The generic structure of a dynamic decision network. Variables with known values are shaded. The current time is t and the agent must decide what to do—that is, choose a value for At. The network has been unrolled into the future for three steps and represents future rewards, as well as the utility of the state at the look-ahead horizon. 17.4.3 Online agents for POMDPs In this section, we outline a simple approach to agent design for partially observable, stochas- tic environments. The basic elements of the design are already familiar: The transition and sensor models are represented by a dynamic Bayesian network (DBN), as described in Chapter 15. The dynamic Bayesian network is extended with decision and utility nodes, as used in decision networks in Chapter 16. The resulting model is called a dynamic decision DYNAMIC DECISION NETWORK network, or DDN. A filtering algorithm is used to incorporate each new percept and action and to update the belief state representation. Decisions are made by projecting forward possible action sequences and choosing the best one. DBNs are factored representations in the terminology of Chapter 2; they typically have an exponential complexity advantage over atomic representations and can model quite sub- stantial real-world problems. The agent design is therefore a practical implementation of the utility-based agent sketched in Chapter 2. In the DBN, the single state St becomes a set of state variables Xt , and there may be multiple evidence variables Et. We will use At to refer to the action at time t, so the transition model becomes P(Xt+1 |Xt , At ) and the sensor model becomes P(Et |Xt ). We will use Rt to refer to the reward received at time t and Ut to refer to the utility of the state at time t. (Both of these are random variables.) With this notation, a dynamic decision network looks like the one shown in Figure 17.10. Dynamic decision networks can be used as inputs for any POMDP algorithm, including those for value and policy iteration methods. In this section, we focus on look-ahead methods that project action sequences forward from the current belief state in much the same way as do the game-playing algorithms of Chapter 5. The network in Figure 17.10 has been projected three steps into the future; the current and future decisions A and the future observations Section 17.4. Partially Observable MDPs 665 At in P(Xt | E1:t) Et+1............... At+1 in P(Xt+1 | E1:t+1)............ Et+2............ At+2 in P(Xt+2 | E1:t+2)............ Et+3............ U(Xt+3)... 10 4 6 3 Figure 17.11 Part of the look-ahead solution of the DDN in Figure 17.10. Each decision will be taken in the belief state indicated. E and rewards R are all unknown. Notice that the network includes nodes for the rewards for Xt+1 and Xt+2 , but the utility for Xt+3. This is because the agent must maximize the (discounted) sum of all future rewards, and U (Xt+3 ) represents the reward for Xt+3 and all subsequent rewards. As in Chapter 5, we assume that U is available only in some approximate form: if exact utility values were available, look-ahead beyond depth 1 would be unnecessary. Figure 17.11 shows part of the search tree corresponding to the three-step look-ahead DDN in Figure 17.10. Each of the triangular nodes is a belief state in which the agent makes a decision At+i for i = 0, 1, 2,.... The round (chance) nodes correspond to choices by the environment, namely, what evidence Et+i arrives. Notice that there are no chance nodes corresponding to the action outcomes; this is because the belief-state update for an action is deterministic regardless of the actual outcome. The belief state at each triangular node can be computed by applying a filtering al- gorithm to the sequence of percepts and actions leading to it. In this way, the algorithm takes into account the fact that, for decision At+i , the agent will have available percepts Et+1 ,... , Et+i , even though at time t it does not know what those percepts will be. In this way, a decision-theoretic agent automatically takes into account the value of information and will execute information-gathering actions where appropriate. A decision can be extracted from the search tree by backing up the utility values from the leaves, taking an average at the chance nodes and taking the maximum at the decision nodes. This is similar to the E XPECTIMINIMAX algorithm for game trees with chance nodes, except that (1) there can also be rewards at non-leaf states and (2) the decision nodes corre- spond to belief states rather than actual states. The time complexity of an exhaustive search to depth d is O(|A|d · |E|d ), where |A| is the number of available actions and |E| is the num- ber of possible percepts. (Notice that this is far less than the number of depth-d conditional 666 Chapter 17. Making Complex Decisions plans generated by value iteration.) For problems in which the discount factor γ is not too close to 1, a shallow search is often good enough to give near-optimal decisions. It is also possible to approximate the averaging step at the chance nodes, by sampling from the set of possible percepts instead of summing over all possible percepts. There are various other ways of finding good approximate solutions quickly, but we defer them to Chapter 21. Decision-theoretic agents based on dynamic decision networks have a number of advan- tages compared with other, simpler agent designs presented in earlier chapters. In particular, they handle partially observable, uncertain environments and can easily revise their “plans” to handle unexpected evidence. With appropriate sensor models, they can handle sensor failure and can plan to gather information. They exhibit “graceful degradation” under time pressure and in complex environments, using various approximation techniques. So what is missing? One defect of our DDN-based algorithm is its reliance on forward search through state space, rather than using the hierarchical and other advanced planning techniques described in Chap- ter 11. There have been attempts to extend these techniques into the probabilistic domain, but so far they have proved to be inefficient. A second, related problem is the basically proposi- tional nature of the DDN language. We would like to be able to extend some of the ideas for first-order probabilistic languages to the problem of decision making. Current research has shown that this extension is possible and has significant benefits, as discussed in the notes at the end of the chapter. 17.5 D ECISIONS WITH M ULTIPLE AGENTS : G AME T HEORY This chapter has concentrated on making decisions in uncertain environments. But what if the uncertainty is due to other agents and the decisions they make? And what if the decisions of those agents are in turn influenced by our decisions? We addressed this question once before, when we studied games in Chapter 5. There, however, we were primarily concerned with turn-taking games in fully observable environments, for which minimax search can be GAME THEORY used to find optimal moves. In this section we study the aspects of game theory that analyze games with simultaneous moves and other sources of partial observability. (Game theorists use the terms perfect information and imperfect information rather than fully and partially observable.) Game theory can be used in at least two ways: 1. Agent design: Game theory can analyze the agent’s decisions and compute the expected utility for each decision (under the assumption that other agents are acting optimally according to game theory). For example, in the game two-finger Morra, two players, O and E, simultaneously display one or two fingers. Let the total number of fingers be f. If f is odd, O collects f dollars from E; and if f is even, E collects f dollars from O. Game theory can determine the best strategy against a rational player and the expected return for each player. 4 4 Morra is a recreational version of an inspection game. In such games, an inspector chooses a day to inspect a facility (such as a restaurant or a biological weapons plant), and the facility operator chooses a day to hide all the nasty stuff. The inspector wins if the days are different, and the facility operator wins if they are the same. Section 17.5. Decisions with Multiple Agents: Game Theory 667 2. Mechanism design: When an environment is inhabited by many agents, it might be possible to define the rules of the environment (i.e., the game that the agents must play) so that the collective good of all agents is maximized when each agent adopts the game-theoretic solution that maximizes its own utility. For example, game theory can help design the protocols for a collection of Internet traffic routers so that each router has an incentive to act in such a way that global throughput is maximized. Mechanism design can also be used to construct intelligent multiagent systems that solve complex problems in a distributed fashion. 17.5.1 Single-move games We start by considering a restricted set of games: ones where all players take action simulta- neously and the result of the game is based on this single set of actions. (Actually, it is not crucial that the actions take place at exactly the same time; what matters is that no player has knowledge of the other players’ choices.) The restriction to a single move (and the very use of the word “game”) might make this seem trivial, but in fact, game theory is serious busi- ness. It is used in decision-making situations including the auctioning of oil drilling rights and wireless frequency spectrum rights, bankruptcy proceedings, product development and pricing decisions, and national defense—situations involving billions of dollars and hundreds of thousands of lives. A single-move game is defined by three components: PLAYER Players or agents who will be making decisions. Two-player games have received the most attention, although n-player games for n > 2 are also common. We give players capitalized names, like Alice and Bob or O and E. ACTION Actions that the players can choose. We will give actions lowercase names, like one or testify. The players may or may not have the same set of actions available. PAYOFF FUNCTION A payoff function that gives the utility to each player for each combination of actions by all the players. For single-move games the payoff function can be represented by a STRATEGIC FORM matrix, a representation known as the strategic form (also called normal form). The payoff matrix for two-finger Morra is as follows: O: one O: two E: one E = +2, O = −2 E = −3, O = +3 E: two E = −3, O = +3 E = +4, O = −4 For example, the lower-right corner shows that when player O chooses action two and E also chooses two, the payoff is +4 for E and −4 for O. STRATEGY Each player in a game must adopt and then execute a strategy (which is the name used in PURE STRATEGY game theory for a policy). A pure strategy is a deterministic policy; for a single-move game, a pure strategy is just a single action. For many games an agent can do better with a mixed MIXED STRATEGY strategy, which is a randomized policy that selects actions according to a probability distri- bution. The mixed strategy that chooses action a with probability p and action b otherwise is written [p: a; (1 − p): b]. For example, a mixed strategy for two-finger Morra might be STRATEGY PROFILE [0.5: one; 0.5: two]. A strategy profile is an assignment of a strategy to each player; given OUTCOME the strategy profile, the game’s outcome is a numeric value for each player. 668 Chapter 17. Making Complex Decisions SOLUTION A solution to a game is a strategy profile in which each player adopts a rational strategy. We will see that the most important issue in game theory is to define what “rational” means when each agent chooses only part of the strategy profile that determines the outcome. It is important to realize that outcomes are actual results of playing a game, while solutions are theoretical constructs used to analyze a game. We will see that some games have a solution only in mixed strategies. But that does not mean that a player must literally be adopting a mixed strategy to be rational. Consider the following story: Two alleged burglars, Alice and Bob, are caught red- handed near the scene of a burglary and are interrogated separately. A prosecutor offers each a deal: if you testify against your partner as the leader of a burglary ring, you’ll go free for being the cooperative one, while your partner will serve 10 years in prison. However, if you both testify against each other, you’ll both get 5 years. Alice and Bob also know that if both refuse to testify they will serve only 1 year each for the lesser charge of possessing stolen PRISONER’S DILEMMA property. Now Alice and Bob face the so-called prisoner’s dilemma: should they testify or refuse? Being rational agents, Alice and Bob each want to maximize their own expected utility. Let’s assume that Alice is callously unconcerned about her partner’s fate, so her utility decreases in proportion to the number of years she will spend in prison, regardless of what happens to Bob. Bob feels exactly the same way. To help reach a rational decision, they both construct the following payoff matrix: Alice:testify Alice:refuse Bob:testify A = −5, B = −5 A = −10, B = 0 Bob:refuse A = 0, B = −10 A = −1, B = −1 Alice analyzes the payoff matrix as follows: “Suppose Bob testifies. Then I get 5 years if I testify and 10 years if I don’t, so in that case testifying is better. On the other hand, if Bob refuses, then I get 0 years if I testify and 1 year if I refuse, so in that case as well testifying is better. So in either case, it’s better for me to testify, so that’s what I must do.” DOMINANT STRATEGY Alice has discovered that testify is a dominant strategy for the game. We say that a STRONG DOMINATION strategy s for player p strongly dominates strategy s if the outcome for s is better for p than the outcome for s , for every choice of strategies by the other player(s). Strategy s weakly WEAK DOMINATION dominates s if s is better than s on at least one strategy profile and no worse on any other. A dominant strategy is a strategy that dominates all others. It is irrational to play a dominated strategy, and irrational not to play a dominant strategy if one exists. Being rational, Alice chooses the dominant strategy. We need just a bit more terminology: we say that an outcome PARETO OPTIMAL is Pareto optimal5 if there is no other outcome that all players would prefer. An outcome is PARETO DOMINATED Pareto dominated by another outcome if all players would prefer the other outcome. If Alice is clever as well as rational, she will continue to reason as follows: Bob’s dominant strategy is also to testify. Therefore, he will testify and we will both get five years. When each player has a dominant strategy, the combination of those strategies is called a DOMINANT STRATEGY dominant strategy equilibrium. In general, a strategy profile forms an equilibrium if no EQUILIBRIUM EQUILIBRIUM player can benefit by switching strategies, given that every other player sticks with the same 5 Pareto optimality is named after the economist Vilfredo Pareto (1848–1923). Section 17.5. Decisions with Multiple Agents: Game Theory 669 strategy. An equilibrium is essentially a local optimum in the space of policies; it is the top of a peak that slopes downward along every dimension, where a dimension corresponds to a player’s strategy choices. The mathematician John Nash (1928–) proved that every game has at least one equi- librium. The general concept of equilibrium is now called Nash equilibrium in his honor. NASH EQUILIBRIUM Clearly, a dominant strategy equilibrium is a Nash equilibrium (Exercise 17.16), but some games have Nash equilibria but no dominant strategies. The dilemma in the prisoner’s dilemma is that the equilibrium outcome is worse for both players than the outcome they would get if they both refused to testify. In other words, (testify, testify) is Pareto dominated by the (-1, -1) outcome of (refuse, refuse). Is there any way for Alice and Bob to arrive at the (-1, -1) outcome? It is certainly an allowable option for both of them to refuse to testify, but is is hard to see how rational agents can get there, given the definition of the game. Either player contemplating playing refuse will realize that he or she would do better by playing testify. That is the attractive power of an equilibrium point. Game theorists agree that being a Nash equilibrium is a necessary condition for being a solution—although they disagree whether it is a sufficient condition. It is easy enough to get to the (refuse, refuse) solution if we modify the game. For example, we could change to a repeated game in which the players know that they will meet again. Or the agents might have moral beliefs that encourage cooperation and fairness. That means they have a different utility function, necessitating a different payoff matrix, making it a different game. We will see later that agents with limited computational powers, rather than the ability to reason absolutely rationally, can reach non-equilibrium outcomes, as can an agent that knows that the other agent has limited rationality. In each case, we are considering a different game than the one described by the payoff matrix above. Now let’s look at a game that has no dominant strategy. Acme, a video game console manufacturer, has to decide whether its next game machine will use Blu-ray discs or DVDs. Meanwhile, the video game software producer Best needs to decide whether to produce its next game on Blu-ray or DVD. The profits for both will be positive if they agree and negative if they disagree, as shown in the following payoff matrix: Acme:bluray Acme:dvd Best:bluray A = +9, B = +9 A = −4, B = −1 Best:dvd A = −3, B = −1 A = +5, B = +5 There is no dominant strategy equilibrium for this game, but there are two Nash equilibria: (bluray, bluray) and (dvd, dvd). We know these are Nash equilibria because if either player unilaterally moves to a different strategy, that player will be worse off. Now the agents have a problem: there are multiple acceptable solutions, but if each agent aims for a different solution, then both agents will suffer. How can they agree on a solution? One answer is that both should choose the Pareto-optimal solution (bluray, bluray); that is, we can res