Dopamine Reward Prediction Error Coding PDF
Document Details
Uploaded by Deleted User
Wolfram Schultz
Tags
Summary
This paper discusses reward prediction errors, focusing on their role in learning, and motivation, using the example of a drink vending machine. It details how dopamine neurons signal prediction errors, connecting reward prediction with expected utility, and emphasizing the role in decision-making.
Full Transcript
Basic research Dopamine reward prediction error coding Wolfram Schultz, MD, FRS Introduction...
Basic research Dopamine reward prediction error coding Wolfram Schultz, MD, FRS Introduction I am standing in front of a drink-dispensing ma- chine in Japan that seems to allow me to buy six dif- ferent types of drinks, but I cannot read the words. I have a low expectation that pressing a particular button will deliver my preferred blackcurrant juice (a chance Reward prediction errors consist of the differences be- of one in six). So I just press the second button from tween received and predicted rewards. They are crucial the right, and then a blue can appears with a familiar for basic forms of learning about rewards and make logo that happens to be exactly the drink I want. That us strive for more rewards—an evolutionary beneficial is a pleasant surprise, better than expected. What would trait. Most dopamine neurons in the midbrain of hu- I do the next time I want the same blackcurrant juice mans, monkeys, and rodents signal a reward prediction from the machine? Of course, press the second button error; they are activated by more reward than predicted from the right. Thus, my surprise directs my behavior to (positive prediction error), remain at baseline activity a specific button. I have learned something, and I will for fully predicted rewards, and show depressed activity keep pressing the same button as long as the same can with less reward than predicted (negative prediction er- comes out. However, a couple of weeks later, I press ror). The dopamine signal increases nonlinearly with re- that same button again, but another, less preferred can ward value and codes formal economic utility. Drugs of appears. Unpleasant surprise, somebody must have addiction generate, hijack, and amplify the dopamine filled the dispenser differently. Where is my preferred reward signal and induce exaggerated, uncontrolled can? I press another couple of buttons until my blue can dopamine effects on neuronal plasticity. The striatum, comes out. And of course I will press that button again amygdala, and frontal cortex also show reward predic- the next time I want that blackcurrant juice, and hope- tion error coding, but only in subpopulations of neu- fully all will go well. rons. Thus, the important concept of reward prediction Author affiliations: Department of Physiology, Development and Neuro- errors is implemented in neuronal hardware. science, University of Cambridge, United Kingdom © 2016, AICH – Servier Research Group Dialogues Clin Neurosci. 2016;18:23-32. Address for correspondence: Wolfram Schultz, Department of Physiology, Development and Neuroscience, University of Cambridge, Cambridge CB2 Keywords: neuron; substantia nigra; ventral tegmental area; striatum; neuro- 3DY, United Kingdom physiology; dopamine; reward; prediction (email: [email protected]) Copyright © 2016 AICH – Servier Research Group. All rights reserved 23 www.dialogues-cns.org Basic research What happened? The first button press delivered my a prediction about a future reward, or no prediction preferred can. This pleasant surprise is what we call a (which is also a prediction, but a poorly defined one), positive reward prediction error. “Error” refers to the and the subsequent reward. Then we compare the re- difference between the can that came out and the low ward with the prediction; the reward is either better expectation of getting exactly that one, irrespective of than, equal to, or worse than than its prediction. The whether I made an error or something else went wrong. future behavior will change depending on the experi- “Reward” is any object or stimulus that I like and of enced difference between the reward and its prediction, which I want more. “Reward prediction error” then the prediction error (Figure 1). If the reward is differ- means the difference between the reward I get and the ent from its prediction, a prediction error exists, and we reward that was predicted. Numerically, the prediction should update the prediction and change our behavior error on my first press was 1 minus 1/6, the difference (red). Specifically, if the reward is better than predicted between what I got and what I reasonably expected. (positive prediction error), which is what we all want, Once I get the same can again and again for the same the prediction becomes better and we will do more of button press, I get no more surprises; there is no predic- the behavior that resulted in that reward. If the reward tion error, I don’t change my behavior, and thus I learn is worse than predicted (negative prediction error), nothing more about these buttons. But what about the which nobody wants, the prediction becomes worse and wrong can coming out 2 weeks later? I had the firm ex- we will avoid this the next time around. In both cases, pectation of my preferred blackcurrant juice but, un- our prediction and behavior changes; we are learning. pleasant surprise, the can that came out was not the one By contrast, if the reward is exactly as predicted (blue), I preferred. I experienced a negative prediction error, there is no prediction error, and we keep our prediction the difference between the nonpreferred, lower val- and behavior unchanged; we learn nothing. The intu- ued can and the expected preferred can. At the end of ition behind prediction error learning is that we often the exercise, I have learned where to get my preferred learn by making mistakes. Although mistakes are usu- blackcurrant juice, and the prediction errors helped me ally poorly regarded, they nevertheless help us to get a to learn where to find it. Even if all this sounds arcane, task right at the end and obtain a reward. If no further it is the formal description of my Japanese experience. error occurs, the behavior will not change until the next This is what this article is about: how our brains pro- error. This applies to learning for obtaining rewards as cess reward prediction errors to help us get our drinks, well as it does for learning movements. and all the other rewards and good things in life. The whole learning mechanism works because we want positive prediction errors and hate negative pre- Reward prediction errors for learning diction errors. This is apparently a mechanism built in by evolution that pushes us to always want more and Rewards produce learning. Pavlov’s dog hears a bell, sees never want less. This is what drives life and evolution, a sausage, and salivates. If done often enough, the dog and makes us buy a bigger car when our neighbors will salivate merely on hearing the bell.1 We say that the bell predicts the sausage, and that is why the dog sali- vates. This type of learning occurs automatically, without Keep the dog doing anything except being awake. Operant Use prediction Receive prediction outcome Reward = conditioning, another basic form of learning, requires prediction unchanged the animal’s participation. Thorndike’s cat runs around Reward ≠ a cage until it happens to press a latch and suddenly gets prediction out and can eat.2 The food is great, and the cat presses Update again, and again. Operant learning requires the subject’s prediction Error own action, otherwise no reward will come and no learn- ing will occur. Pavlovian and operant learning constitute Figure 1. S cheme of learning by prediction error. Red: a prediction er- the building blocks for behavioral reactions to rewards. ror exists when the reward differs from its prediction. Blue: Both learning forms involve prediction errors.3 To no error exists when the outcome matches the prediction, understand prediction errors, we distinguish between and the behavior remains unchanged. 24 Dopamine reward prediction error - Schultz Dialogues in Clinical Neuroscience - Vol 18. No. 1. 2016 muscle up on their cars (the neighbors’ average car size as a light, picture, or sound predicts a reward. Such serves as a reference that is equivalent to a prediction). reward-predicting stimuli are conditioned rewards and Even a buddhist, who hates wanting and craving for have similar effects on learning and approach behavior material goods and unattainable goals, wants more hap- to real rewards. Dopamine neurons treat reward pre- piness rather than less. Thus, the study of reward predic- dictors and real rewards in a similar way, as events that tion errors touches the fundamental conditions of life. are valuable for the individual. In addition, predictive stimuli allow animals to plan ahead and make informed Reward is in the brain decisions. Thus, in signaling both rewards and reward- predicting stimuli, dopamine neurons provide informa- The study of reward processing in the brain started tion about past and future rewards that is helpful for when Olds and Milner4 introduced electrodes into the learning and decision-making. brains of rats and subjected them to small electric cur- rents when they pressed a lever. Such currents elicit Reward prediction errors in dopamine action potentials in thousands of neurons within a mil- neurons limeter around the electrode. Olds and Milner placed electrodes in different brain regions. In some of these However, the dopamine response shows something regions they found a remarkable effect. The rats pressed else. The response to the reward itself disappears when more to get more electric shocks to their brains. The an- the reward is predicted. But if more than the predicted imals were so fascinated by the lever pressing that they reward occurs, the dopamine neurons show stronger forgot to eat and drink for a while. Not even a female responses. By contrast, their activity decreases if no, rat could distract a male. It seemed that there was noth- or less than predicted, reward occurs. The dopamine ing better than this brain stimulation. The surge of ac- response thus reflects a reward prediction error and tion potentials in neurons incited the animals again and can be described by the simple difference between ob- again to press the lever, a typical manifestation of the tained and predicted reward (Figure 2). When we look function of rewards in learning and approach behavior. at the time of the reward, more reward than predicted Olds and Milner had found a physical correlate for re- induces a positive dopamine response (excitation or ward in the brain! activation, top), as much reward as expected induces Subsequent studies showed that about half of the ef- no response (middle), and less than predicted reward fective locations for electrical self-stimulation are con- leads to a negative response (depression of activity, nected with dopamine neurons.5 Dopamine neurons are bottom).7,8 These responses exist not only in monkeys, located in the midbrain, just behind the mouth, between but are also found in dopamine neurons in humans9 the ears, about a million in humans, 200 000 in monkeys, and rodents.10,11 Thus, dopamine neurons don’t just re- and 20 000 in rats. They extend their axons several mil- spond to any old reward: they respond only to rewards limeters into the striatum, frontal cortex, amygdala, and that differ from their prediction. But that’s not all. The several other brain regions. The self-stimulation data dopamine response is transferred to the next preced- demonstrate that the action potentials of dopamine ing reward-predicting stimulus, and ultimately to the neurons induce learning and approach behavior, thus first predictive stimulus (Figure 3A).7,12 The longer the linking brain function in a causal way to behavior. time between the first stimulus and the final reward, But do dopamine neurons generate action potentials the smaller the dopamine response, as subjective re- when a reward is encountered, without being stimulat- ward value becomes lower with greater delays, a phe- ed by electric currents? The answer is yes.6 Showing a nomenon known as temporal discounting; dopamine human, a monkey, or a rat money, food, or liquid makes responses decrease in specific temporal discounting the large majority of their dopamine neurons produce tests.13 The response to the reward-predicting stimu- similar action potentials to those that occur during lus itself depends on the prediction of that stimulus, in electrical self-stimulation. The higher the reward, the the same way as the response to the reward. Thus, do- stronger the dopamine response. A closer look reveals pamine neurons respond to reward-predicting stimuli that the dopamine neurons not only respond when the in the same way as to rewards, only with slightly less animal receives a reward but also when a stimulus, such intensity, which allows them to use predictive informa- 25 Basic research tion for teaching even earlier stimuli and actions. In neutral stimuli, and seems to simply alert the neurons this way, dopamine signals can be useful for learning of possible rewards in the environment. It subsides in long chains of events. Processing of prediction errors a few tens to hundreds of milliseconds when the neu- rather than full information about an environmental rons identify the object and its reward value properly. event saves neuronal information processing14 and, in Thus, the neurons code salience only in a transitory the case of rewards, excites neurons with larger-than- manner. Then a second, selective response component predicted rewards. becomes identifiable, which reflects only the reward The dopamine reward prediction error response occurs pretty much in the way used in the Rescorla- No prediction Wagner model3 that conceptualizes reward learning Reward occurs by prediction errors. In addition, the dopamine pre- diction error signal with reward-predicting stimuli corresponds well to the teaching term of temporal difference (TD) learning, a derivative of the Rescor- + Error la-Wagner model.15 Indeed, dopamine responses in simple and complex tasks correlate very well with for- mal TD models (Figure 3B).8,16 The existence of such neuronal error signals suggests that some brain pro- cesses operate on the principle of error learning. The dopamine error signal could be a teaching signal that Reward affects neuronal plasticity in brain structures that are Reward predicted involved in reward learning, including the striatum, Reward occurs frontal cortex, and amygdala. The error signal serves also an important function in economic decisions be- cause it helps to update the value signals for the differ- ent choice options. 0 Error Multiple components in dopamine responses If we look closely, the form of the dopamine response can be quite tricky, not unlike that of other neuronal responses in the brain. With careful analysis,17 or when demanding stimuli require extended processing time,18 Predictive Reward 500 ms two response components become visible (Figure 4). stimulus An initial, unselective response component registers Reward prediced any environmental object that occurs in the environ- No reward occurs ment, including a reward. This salience response oc- curs with all kinds of stimuli, including punishers and eward prediction error responses at the time of reward Figure 2. R (right) and reward-predicting visual stimuli (left in bottom two graphs). The dopamine neuron is activated by the un- - Error predicted reward eliciting a positive reward prediction error (blue, + error, top), shows no response to the fully predicted reward eliciting no prediction error (0 error, middle), and is depressed by the omission of predicted reward eliciting a negative prediction error (- error, bottom). Reproduced from ref 8: Schultz W, Dayan P, Montague RR. A neural substrate of prediction and reward. Science 1997;275:1593-1599. Predictive (no reward) Copyright © American Association for the Advancement of Science stimulus 1997 26 Dopamine reward prediction error - Schultz Dialogues in Clinical Neuroscience - Vol 18. No. 1. 2016 information (as reward prediction error). From this Risky rewards, subjective value, point on, the dopamine neurons represent only reward and formal economic utility information. With these two components, the dopamine neurons Rewards are only sure in the laboratory. In real life, start processing the encountered stimulus or object rewards are risky. We go to the pub expecting to meet before they even know whether it is a reward, which friends and have a pint of beer. But we don’t know gives them precious time to prepare for a potential be- whether the friends will actually be there this evening, havioral reaction; the preparation can be cancelled if or whether the pub might have run out of our favor- the object turns out not to be a reward. Also, the at- ite beer. Risk is usually considered as something bad, tentional chacteristics of the initial response enhance and in the case of rewards, the risk is associated with the subsequent processing of reward information. This the possibility of not getting the best reward we expect. mechanism affords overall a gain in speed and accuracy We won’t give animals beer in the laboratory, but we without major cost. can nicely test reward risk by going to economic theory. Before this two-component response structure had The most simple and confound-free risky rewards can been identified, some interpretations considered only be tested with binary gambles in which either a large the attentional response, which we now know concerns or a small reward occurs with equal probability of only the initial response component. Without looking P=0.523: I get either the large or the small reward with at the second, reward response component, the whole equal chance, but not both. Larger risk is modeled by dopamine response appeared to be simply a salience increasing the large reward and reducing the small re- signal,19 and the function in reward prediction error ward by equal amounts, thus keeping the mean reward coding was missed. Experiments on aversive stimuli re- constant. Monkeys like fruit juices and prefer such risky ported some dopamine activations,20,21 which were later rewards over safe rewards with the same mean when found to reflect the physical rather than aversive stimu- the juice amount is low; they are risk seekers. Accord- lus components.17 Together, these considerations led to ingly, they prefer the more widely spread gamble over assumptions of primarily salience coding in dopamine the less spread one; thus they satisfy what is referred to neurons,22 which can probably now be laid to rest. as second-order stochastic dominance, a basic econom- A B 2 Model Neurons 500 ms Untrained CS Reward lmp/s 1 CS1 Reward (17) (49) (90) (97) (97) 0 N1 N2 N3 R1 R2 CS2 CS1 Reward Trial type (reward probability, %) Figure 3. (A) Stepwise transfer of dopamine response from reward to first reward-predicting stimulus. CS2: earliest reward-predicting stimulus, CS1: subsequent reward-predicting stimulus. From Schultz 12. (B) Reward prediction error responses in sequential stimulus-reward task (gray bars) closely parallel prediction errors of formal temporal difference (TD) reinforcement model (black lines). Averaged population responses of 26 dopamine neurons follow the reward probabilities at each sequence step (numbers indicate reward probabilities in %) and match the time course of TD prediction errors. Reproduced from ref 16: Enomoto K, Matsumoto N, Nakai S, et al. Dopamine neurons learn to encode the long-term value of multiple future rewards. Proc Natl Acad Sci U S A. 2011;108:15462-15467. Copyright © National Academy of Sciences 2011 27 Basic research ic construct for assessing the integration of risk into cisions.24,25 Dopamine neurons show larger responses to subjective value. However, with larger juice amounts, risky compared with safe rewards in the low range, in a monkeys prefer the safe reward over the gamble; they similar direction to the animal’s preferences; thus do- are risk avoiders, just as humans often are with larger pamine neurons follow second-order stochastic domi- rewards (Figure 5A). Thus, monkeys show meaningful nance. Formal economic utility can be inferred from choices of risky rewards. risky gambles and constitutes an internal measure of The binary reward risk is fully characterized by the reward value for an individual; it is measured in utils, statistical variance. However, as good a test as it is for rather than milliliters or pounds, euros or dollars. It can decisions under risk, it does not fully comprise everyday be measured from choices under risk,26 using the frac- risks which often include asymmetric probability distri- tile chaining procedure.27 Monkeys show nonlinear util- butions and thus skewness risk. Why do gamblers of- ity functions that are compatible with risk seeking at ten have health insurance? They prefer positive skew- small juice amounts and risk avoiding at larger amounts. ness risk, a small but possible chance of winning large Importantly, dopamine neurons show the same nonlin- amounts, while avoiding negative skewness risk, the ear response increases with unpredicted rewards and small but possible chance of a costly medical treatment. risky gambles (Figure 5B). Thus, dopamine responses Future risk tests should include skewness risk to model signal formal economic utility as the best characterized a more real-life scenario for risk. measure of reward value; the dopamine reward predic- The study of risky rewards allows us to address two tion error response is in fact a utility prediction error important questions for dopamine neurons, the incor- response. This is the first utility signal ever observed in poration of risk into subjective reward value, and the the brain, and to the economist it identifies a physical construction of a formal, mathematical economic utility implementation of the artificial construct of utility. function, which provides the most theory-constrained definition of subjective reward value for economic de- Dopamine reward prediction errors in human imaging 12 Dopamine reward prediction error signals are not lim- 10 ited to animals, and occur also in humans. Besides the mentioned electrophysiological study,9 hundreds of hu- Neuronal response 8 man neuroimaging studies demonstrate reward predic- 6 tion error signals in the main reward structures,28 in- Reward Detection cluding the ventral striatum.29,30 The signal reflects the 4 dopamine response 31and occurs in striatal and frontal dopamine terminal areas rather than in midbrain cell 2 body regions, presumably because it reflects summed 0 postsynaptic potentials. Thus, the responses in the dopamine-receiving striatum seen with human neuro- -200 0 200 400 600 800 imaging demonstrate the existence of neural reward Stimulus (ms) prediction error signals in the human brain and their convenient measurement with noninvasive procedures. Figure 4. S chematic of the two phasic dopamine response compo- nents. The initial component (blue) detects the event before having identified its value. It increases with sensory impact General consequences of dopamine (physical salience), generalization to rewarded stimuli, re- prediction error signaling ward context and novelty (novelty/surprise salience). The second component (red) codes reward value (as reward utility prediction error). The two components become more We know that dopamine stimulation generates learning distinct with more demanding stimuli. and approach behavior.4 We also know that encounter- Adapted from data graph in ref 17: Fiorillo CD, Song MR, Yun SR. ing a better-than-predicted reward stimulates dopamine Multiphasic temporal dynamics in responses of midbrain dopamine neurons to appetitive and aversive stimuli. J Neurosci. 2013;33:4710– neurons. Thus, the dopamine stimulation arising from a 4725. Copyright © Society for Neuroscience 2013 natural reward may directly induce behavioral learning 28 Dopamine reward prediction error - Schultz Dialogues in Clinical Neuroscience - Vol 18. No. 1. 2016 and actions. Every time we see a reward, the responses The dopamine prediction error response may belong of our dopamine neurons affect our behavior. They are to a mechanism that underlies our drive for always want- like “little devils” in our brain that drive us to rewards! ing more reward. This mechanism would explain why we This becomes even more troubling because of the par- need ever higher rewards and are never satisfied with ticular dopamine response characteristics, namely the what we have. We want another car, not only because the positive dopamine response (activation) to positive neighbors have one but because we have become accus- prediction errors: the dopamine activation occurs when tomed to our current one. Only a better, or at least a new, we get more reward than predicted. But any reward we car would lead to a dopamine response, and that drives us receive automatically updates the prediction, and the to buy one. I have enough of my old clothes, even if they previously larger-than-predicted reward becomes the are still in very good condition, and therefore I go shop- norm and no longer triggers a dopamine prediction er- ping. What the neighbors have, I want also, but better. The ror surge. The next same reward starts from the higher wife of 7 years is no longer good enough, so we need a new prediction and hence induces less or no prediction error one, or at least another mistress. The house needs to be response. To continue getting the same prediction error, further decorated or at least rewallpapered, or we just buy and thus the same dopamine stimulation, requires get- a bigger one. And we need a summer house. There is no ting a bigger reward every time. The little devil not only end to the ever-increasing needs of rewards. And all that drives us towards rewards, it drives us towards ever-in- because of the little dopamine neurons with their positive creasing rewards. reward prediction error responses! A Choice safe ⇔ risky B 0.5 1 P=0.5 each 50 Norm imp/s EV CE CE EV Utility 0.5 Safe choices (%) 50 0 0 0 0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.4 0.8 1.2 Safe reward (mL) Reward amount (mL) Figure 5. (A) Top: testing risky rewards: an animal chooses between a safe reward whose amount is adjusted by the experimenter (left) and a fixed, binary, equiprobable gamble (right). Height of each bar indicates juice volume; two bars indicate occurrence of each indicated amount with P=0.5 (risky reward). Bottom: Two psychophysical assessments of risk attitude in a monkey. CE indicates amount of safe reward at choice indifference against gamble (certainty equivalent). EV = expected value of gamble. A CE > EV suggests risk seeking (subjective gamble value CE exceeds objective gamble value EV, left), CE < EV indicates risk avoidance (right). (B) Positive utility predic- tion error responses to unpredicted juice rewards. Red: utility function derived from binary, equiprobable gambles. Black: correspond- ing, nonlinear increase of population response (n=14 dopamine neurons) in the same animal. A and B reproduced from ref 25: Stauffer WR, Lak A, Schultz W. Dopamine reward prediction error responses reflect marginal utility. Curr Biol. 2014;24:2491-2500. Copyright © Cell Press 2014 29 Basic research Dopamine mechanism of drug addiction bidirectional reward prediction error signals that are sign inverted to dopamine responses and may affect do- Dopamine neurons are even more devilish than ex- pamine neurons via inhibitory midbrain reticular neu- plained so far. They are at the root of addictions to rons.33 Select groups of phasically and tonically firing drugs, food, and gambling. We know, for example, that neurons in the striatum and globus pallidus code posi- dopamine mechanisms are overstimulated by cocaine, tive and negative reward prediction errors bidirection- amphetamine, methamphetamine, nicotine, and alco- ally.34-36 Some neurons in the amygdala display separate, hol. These substances seem to hijack the neuronal sys- bidirectional error coding for reward and punishment.37 tems that have evolved for processing natural rewards. In the cortex, select neurons in anterior cingulate38,39 Only this stimulation is not limited by the sensory re- and supplementary eye field40 code reward prediction ceptors that process the environmental information, errors. All of these reward prediction error signals are because the drugs act directly on the brain via blood bidirectional; they show opposite changes to positive vessels. Also, the drug effects mimic a positive dopa- versus negative prediction errors. The reward predic- mine reward prediction error, as they are not compared tion error responses in these subcortical and cortical against a prediction, and thus induce continuing strong neurons with their specific neuronal connections are dopamine stimulation on their postsynaptic receptors, unlikely to serve reinforcement processes via divergent whereas the evolving predictions would have prevented anatomical projections; rather they would affect plastic- such stimulation.32 The overstimulation resulting from ity at specifically connected postsynaptic neurons. the unfiltered impact and the continuing positive pre- diction error-like effect is difficult to handle for the neu- Conclusions rons, which are not used to it from their evolution, and some brains cannot cope with the overstimulation and The discovery that the most powerful and best char- become addicted. We have less information about the acterized reward signal in the brain reflects reward mechanisms underlying gambling and food addiction, prediction errors rather than the simple occurrence of but we know that food and gambling, with their strong rewards is very surprising, but is also indicative of the sensory stimulation and prospect of large gains, activate role rewards play in behavior. Rather than signaling dopamine-rich brain areas in humans and dopamine every reward as it appears in the environment, dopa- neurons in animals. mine responses represent the crucial term underlying basic, error-driven learning mechanisms for reward. Non-dopamine reward prediction errors The existence of the error signal validates error-driven learning rules by demonstrating their implementation The dopamine reward prediction error signal is propa- in neuronal hardware. The additional characteristic of gated along the widely divergent dopamine axons to economic utility coding conforms to the most advanced the terminal areas in the striatum, frontal cortex, and definition of subjective reward value and suggests a role amygdala, where they innervate basically all (striatum) in economic decision mechanisms. Having a neuronal or large proportions of postsynaptic neurons. Thus, the correlate for a positive reward prediction error in our rather homogeneous dopamine reward signal influ- brain may explain why we are striving for ever-greater ences heterogeneous postsynaptic neurons and thus af- rewards, a behavior that is surely helpful for surviving fects diverse postsynaptic functions. However, reward competition in evolution, but also generates frustra- prediction error signals occur also in other reward tions and inequalities that endanger individual well- structures of the brain. Lateral habenula neurons show being and the social fabric. o REFERENCES 3. Rescorla RA, Wagner AR. A theory of Pavlovian conditioning: Varia- tions in the effectiveness of reinforcement and nonreinforcement. In: Black AH, Prokasy WF, eds. Classical Conditioning II: Current Research and 1. Pavlov PI. Conditioned Reflexes. London, UK: Oxford University Press; Theory. New York, NY: Appleton Century Crofts; 1972:64-99. 1927. 4. Olds J, Milner P. Positive reinforcement produced by electrical stimu- 2. Thorndike EL. Animal Intelligence: Experimental Studies. New York, NY: lation of septal area and other regions of rat brain. J Comp Physiol Psychol. MacMillan; 1911. 1954;47:419-427. 30 Dopamine reward prediction error - Schultz Dialogues in Clinical Neuroscience - Vol 18. No. 1. 2016 Codificación del error de predicción de la Codage de l’erreur de prédiction de la recompensa dopaminérgica récompense dopaminergique Los errores en la predicción de la recompensa se deben Les erreurs de prédiction de la récompense consistent en a las diferencias entre las recompensas recibidas y predi- différences entre la récompense reçue et celle prévue. chas. Ellos son cruciales para las formas básicas de apren- Elles sont déterminantes pour les formes basiques d’ap- dizaje acerca de las recompensas y nos hacen esforzarnos prentissage concernant la récompense et nous font lutter por más recompensas, lo que constituye un rasgo evolu- pour plus de récompense, une caractéristique bénéfique cionario beneficioso. La mayoría de las neuronas dopami- de l’évolution. La plupart des neurones dopaminergiques nérgicas en el mesencéfalo de los humanos, monos y roe- du mésencéphale des humains, des singes et des rongeurs dores dan información sobre el error en la predicción de indiquent une erreur de prédiction de la récompense ; ils la recompensa; ellas son activadas por más recompensa sont activés par plus de récompense que prévu (erreur de que la predicha (error de predicción positivo); se mantie- prédiction positive), restent dans l’activité initiale pour nen en una actividad basal para el total de las recompen- une récompense complètement prévue et montrent une sas predichas, y muestran una actividad disminuida con activité diminuée en cas de moins de récompense que menos recompensa que la predicha (error de predicción prévu (erreur de prédiction négative). Le signal dopami- negativo). La señal de dopamina aumenta de forma no nergique augmente de façon non linéaire avec la récom- lineal con el valor de la recompensa y da claves sobre la pense et code l’utilité économique formelle. Les médi- utilidad económica formal. Las drogas adictivas generan, caments addictifs génèrent, détournent et amplifient le secuestran y amplifican la señal de recompensa dopami- signal de la récompense dopaminergique et induisent des nérgica e inducen efectos dopaminérgicos exagerados y effets dopaminergiques exagérés et non contrôlés sur la descontrolados en la plasticidad neuronal. El estriado, la plasticité neuronale. Le striatum, l’amygdale et le cortex amígdala y la corteza frontal también codifican errores frontal manifestent aussi le codage erroné de la prédic- en la predicción de la recompensa, pero solo en ciertas tion de la récompense, mais seulement dans des sous-po- subpoblaciones de neuronas. Por lo tanto, el concepto pulations de neurones. L’important concept d’erreurs de importante de errores en la predicción de recompensa prédiction de la récompense est donc mis en œuvre dans está implementado en el hardware neuronal. le matériel neuronal. 5. Corbett D, Wise RA. Intracranial self-stimulation in relation to the 16. Enomoto K, Matsumoto N, Nakai S, et al. Dopamine neurons learn to ascending dopaminergic systems of the midbrain: A moveable microelec- encode the long-term value of multiple future rewards. Proc Natl Acad Sci trode study. Brain Res. 1980;185:1-15. U S A. 2011;108:15462-15467. 6. Schultz W. Responses of midbrain dopamine neurons to behavioral 17. Fiorillo CD, Song MR, Yun SR. Multiphasic temporal dynamics in re- trigger stimuli in the monkey. J Neurophysiol. 1986;56:1439-1462. sponses of midbrain dopamine neurons to appetitive and aversive stimuli. 7. Schultz W, Apicella P, Ljungberg T. Responses of monkey dopamine J Neurosci. 2013;33:4710–4725. neurons to reward and conditioned stimuli during successive steps of 18. Nomoto K, Schultz W, Watanabe T, Sakagami M. Temporally extend- learning a delayed response task. J Neurosci. 1993;13:900-913. ed dopamine responses to perceptually demanding reward-predictive 8. Schultz W, Dayan P, Montague RR. A neural substrate of prediction stimuli. J Neurosci. 2010;30:10692–10702. and reward. Science. 1997;275:1593-1599. 19. Redgrave P, Prescott TJ, Gurney K. Is the short-latency dopamine re- 9. Zaghloul KA, Blanco JA, Weidemann CT, et al. Human substantia nig- sponse too short to signal reward? Trends Neurosci. 1999;22:146-151. ra neurons encode unexpected financial rewards. Science. 2009;323:1496- 20. Mirenowicz J, Schultz W. Preferential activation of midbrain do- 1499. pamine neurons by appetitive rather than aversive stimuli. Nature. 10. Pan W-X, Schmidt R, Wickens JR, Hyland BI. Dopamine cells respond 1996;379:449-451. to predicted events during classical conditioning: Evidence for eligibility 21. Matsumoto M, Hikosaka O. Two types of dopamine neuron distinctively traces in the reward-learning network. J Neurosci. 2005;25:6235-6242. convey positive and negative motivational signals. Nature. 2009;459:837-841. 11. Cohen JY, Haesler S, Vong L, Lowell BB, Uchida N. Neuron-type-spe- 22. Kapur S. Psychosis as a state of aberrant salience: a framework link- cific signals for reward and punishment in the ventral tegmental area. ing biology, phenomenology, and pharmacology in schizophrenia. Am J Nature. 2012;482:85-88. Psychiatry. 2003;160:13-23. 12. Schultz W. Predictive reward signal of dopamine neurons. J Neuro- 23. Rothschild M, Stiglitz JE. Increasing risk: I. A definition. J Econ Theory. physiol. 1998;80:1-27. 1970;2:225-243. 13. Kobayashi S, Schultz W. Influence of reward delays on responses of 24. Lak A, Stauffer WR, Schultz W. Dopamine prediction error responses dopamine neurons. J Neurosci. 2008;28:7837-7846. integrate subjective value from different reward dimensions. Proc Natl 14. Rao RPN, Ballard DH. Predictive coding in the visual cortex: a func- Acad Sci U S A. 2014;111:2343-2348. tional interpretation of some extra-classical receptive-field effects. Nat 25. Stauffer WR, Lak A, Schultz W. Dopamine reward prediction error Neurosci. 1999;2:79-87. responses reflect marginal utility. Curr Biol. 2014;24:2491-2500. 15. Sutton RS, Barto AG. Toward a modern theory of adaptive networks: 26. von Neumann J, Morgenstern O. The Theory of Games and Economic expectation and prediction. Psychol Rev. 1981;88:135-170. Behavior. Princeton, NJ: Princeton University Press; 1944. 31 Basic research 27. Machina MJ. Choice under uncertainty: problems solved and un- 34. Hong S, Hikosaka O. The globus pallidus sends reward-related signals solved. J Econ Perspect. 1987;1:121-154. to the lateral habenula. Neuron. 2008;60:720–729. 28. Thut G, Schultz W, Roelcke U, Nienhusmeier M, Maguire RP, Leen- 35. Apicella P, Deffains M, Ravel S, Legallet E. Tonically active neurons ders KL. Activation of the human brain by monetary reward. NeuroReport. in the striatum differentiate between delivery and omission of expected 1997;8:1225-1228. reward in a probabilistic task context. Eur J Neurosci. 2009;30:515-526. 29. O’Doherty J, Dayan P, Friston K, Critchley H, Dolan RJ. Temporal dif- 36. Ding L, Gold JI. Caudate encodes multiple computations for percep- ference models and reward-related learning in the human brain. Neuron. tual decisions. J Neurosci. 2010;30:15747-15759. 2003;28:329-337. 37. Belova MA, Paton JJ, Morrison SE, Salzman CD. Expectation modu- 30. McClure SM, Berns GS, Montague PR. Temporal prediction errors in a lates neural responses to pleasant and aversive stimuli in primate amyg- passive learning task activate human striatum. Neuron. 2003;38:339-346. dala. Neuron. 2007;55:970–984. 31. Pessiglione M, Seymour B, Flandin G, Dolan RJ, Frith CD. Dopamine- 38. Seo H, Lee D. Temporal filtering of reward signals in the dorsal anterior dependent prediction errors underpin reward-seeking behaviour in hu- cingulate cortex during a mixed-strategy game. J Neurosci. 2007;27:8366-8377. mans. Nature. 2006;442:1042-1045. 39. Kennerley SW, Behrens TEJ, Wallis JD. Double dissociation of value 32. Redish AD. Addiction as a computational process gone awry. Science. computations in orbitofrontal and anterior cingulate neurons. Nat Neuro- 2004;306:1944-1947. sci. 2011;14:1581-1589. 33. Matsumoto M, Hikosaka O. Lateral habenula as a source of negative 40. So N-Y, Stuphorn V. Supplementary eye field encodes reward predic- reward signals in dopamine neurons. Nature. 2007;447:1111-1115. tion error. J Neurosci. 2012;32:2950–2963. 32