The Death of a Beautiful Theory? Dopamine And Reward Prediction Error

Very early in the history of artificial intelligence research, it was apparent that cognitive agents needed to be able to maximize reward by changing their behavior. But this leads to a "credit-assignment" problem: how does the agent know which of its actions led to the reward? An early solution was to select the behavior with the maximal predicted rewards, and to later adjust the likelihood of that behavior according to whether it ultimately led to the anticipated reward. These "temporal-difference" errors in reward prediction were first implemented in a 1950's checker-playing program, before exploding in popularity some 30 years later.

This repopularization seemed to originate from a tantalizing discovery: the brain's most ancient structures were releasing dopamine in exactly the way predicted by temporal-difference learning algorithms. Specifically, dopamine release in the ventral tegmental area (VTA) decreased in response to stimuli that were repeatedly paired without a reward - as though dopamine levels "dipped" to signal the overprediction (and under-delivery) of a reward. Secondly, dopamine release abruptly spikes in response to stimuli that are suddenly paired with a reward - as though dopamine is signaling the underprediction (and over-delivery) of a reward. Finally, when a previously-rewarded stimulus is no longer rewarded, dopamine levels dip, again suggesting overprediction and underdelivery of reward.

Thus, a beautiful computational theory was garnering support from some unusually beautiful data in neuroscience. Dopamine appeared to rise for items that predicted a reward, to dropped for items that predict an absence of reward, and to show no response to neutral stimuli. But as noted by Thomas Huxley, in science "many a beautiful theory has been destroyed by an ugly fact."

These ugly facts are presented in Redgrave and Gurney's new NRN article that is circulating the field of computational neuroscience. Among the ugliest:

1) Dopamine spikes in response to novel items which have never been paired with reward, and thus have no predictive value.

2) The latency and duration of dopamine spikes is constant across species, experiments, stimulus modality and stimulus complexity. In contrast, reward prediction should take longer to establish in some situations than others - for example, reward prediction may be slower for more complex stimuli.

3) The dopamine signal actually occurs before animals have even been able to fixate on a stimulus - this questions the extent to which this signal is mechanistically capable of the "reward prediction error" function.

4) VTA dopamine neurons fire simultaneous with (and possibly even before) object recognition is completed in the infero-temporal cortex, and simultaneous with visual responses in striatum and subthalamic nucleus. It seems unlikely that VTA can perform both object recognition and reward prediction error.

5) The most likely visual signal to these VTA neurons may originate from superior colliculus, a region that is sensitive to spatial changes but not those that would be involved in object processing per se.

6) Many of the experiments showing the apparent dopaminergic-coding of reward prediction error had stimuli that differed not only in reward value but also in spatial location. Therefore, data in support of reward prediction error is confounded with hypotheses involving spatial selectivity.

Redgrave & Gurney suggest that VTA dopamine neurons fire too quickly and with too little detailed visual input to actually accomplish the calculation of errors in reward prediction. They advocate an alternative theory in which temporal prediction is still key, but instead of encoding reward prediction, dopamine neurons are actually signalling the "reinforcement of actions/movements that immediately precede a biologically salient event."

To understand this claim, consider Redgrave & Gurney's point that "most temporally unexpected transient events in nature are also spatially unpredictable." The theory is basically that a system notes its own uncertainty, via the spatial reorientation in the superior colliculus, and attempts to reduce that uncertainty by pairing a running record of previous movements with the unexpected event.

Although this alternative theory is intriguing, there is not an abundance of evidence supporting it: it seems to me more like a pastiche of fragments from the apparently broken "reward prediction error" hypothesis.

We should also be cautious in discarding any theory as powerful as the reward prediction error hypothesis on the basis of null evidence: in this case, we simply don't know how reward prediction error could be calculated so quickly. This kind of theoretical arrogance ("we don't know how it could be calculated, so it isn't calculated") is particularly dangerous in computational neuroscience - the whole point of which is to identify possible mechanisms of neural information processing.

Of course, this article may ultimately be seen as the obituary of yet another beautiful theory killed by science. What's your prediction?

Categories

More like this

Recent work has leveraged increasingly sophisticated computational models of neural processing as a way of predicting the BOLD response on a trial-by-trial basis. The core idea behind much of this work is that reinforcement learning is a good model for the way the brain learns about its…
There are 13 new articles in PLoS ONE today. As always, you should rate the articles, post notes and comments and send trackbacks when you blog about the papers. Here are my own picks for the week - you go and look for your own favourites: Geographic and Genetic Population Differentiation of the…
How does the human brain construct intelligent behavior? Computational models have proposed several mechanisms to accomplish this: the most well known is "Hebbian learning," a process mathematically similar to both principal components analysis and Bayesian statistics. But other neural learning…
Most people who have known a drug addict, or have watched Trainspotting or ER, know that one of the more insidious parts of addiction is the need for more and more drug to achieve a "high." This leads the addict into a spiral of drug-seeking behavior, and brain changes, which lead to the person…

I had got a desire to begin my organization, nevertheless I didn't have enough amount of cash to do that. Thank goodness my fellow suggested to utilize the personal loans. Thence I used the commercial loan and made real my old dream.

(1) the brain's most ancient structures release dopamine when the subject gets published in peer-reviewed journal;

(2) the brain's most ancient structures release dopamine when the subject gets published in conference proceedings, such as IJCAI and other AI venues, even if nothing journal-quality is present; this hijacks the brain to do conferences instead of peer reviewed journals;

(3) the brain's most ancient structures release dopamine when the subject gets quoted in press releases associated with raising Venture Capital for software start-ups; this hijacks the brain to forget about conferences and focus on the rise of Expert Systems instead of AI, and then on the dotcom boom, which busted, and now is rsing again as Web 2.0 or whatever;

(4) AI true believers never understood microneuroanatomy and neurochemistry anyway;

(5) John McCarthy, who coined the very term "Artificial Intelligence" has told me that it was a very misleading terms, which he now regrets;

(6) For a while Cybernetics was hot, and had neuro people working with computer people and Math people, but that Mecan Institute of Cardiology money stopped flowing;

(7) AI conferences were real fun when I went to them in the 1960s and 1970s; then Artficial Life conferences were fun in the same interdisciplinary way; now Complex Systems conferences are fun. Each has its own jargon, paradigm, predictions, tenuous linkage between brain and computer models.

Are these reported results to be considered mere "anomalies" or the crisis of a dying paradigm, in Thomas Kuhn terms?

"1) Dopamine spikes in response to novel items which have never been paired with reward, and thus have no predictive value."

This argument seems weak to me. Novel events could be interpreted as rewards, just like any other motivationally relevant event. The only difference is that a novelty reward might be more complicated to compute than, say, a food reward. Specifically, a novelty detection system would probably be based on prediction errors (e.g., the difference between the prior and posterior, in terms of Bayesian inference).

Under the dopamine prediction error theory, unexpected events would generate "dopamine spikes in response to novel items which have never been paired with reward." The predictive value of this signal lies in its ability to predict future information gain (i.e. reduction in prediction errors over time), a very valuable commodity to any intelligent agent.

Hey Tyler - I agree completely, that point is not very persuasive. I take it you're not likely to abandon the reward prediction error hypothesis so easily ? :)

I do wonder to what extent the alternative is functionally interchangeable and/or mutually exclusive with the RPE hypothesis.

By Chris Chatham (not verified) on 11 Apr 2007 #permalink

Reward Prediction Error:

I believe that it's safe to say that reward prediction error is part of a "monitoring" process that goes on during decision-making. I.e. in the Iowa Gambling Task (IGT) when a person is deciding which deck to choose from, because he is constantly monitoring (from the past trials) which deck provides him/her with the most beneficial and also disadvantageous outcomes. That individual would have to predict whether he gets a good card or not and reward prediction error is likely to occur because it's a case of chance and probability.

I did an essay on decision-making and the ventromedial prefrontal cortex (VMPFC) and came to some conclusions.

1) I believe that one cannot categorize any processes involved with the brain as "events" like a computational model where everything happens like a binary event "go or no-go" kind of thing. Instead, one would have to see it as a process that involves not only one, but several "go or no-go" events. This is to let it qualify as a process and not an event. So, ithink that point number (1) is not very applicable to the nature of dopaminergic spikes because it takes longer than just a single event to do prediction of a reward.

2) I came across this find. Qouting from my essay: "Perhaps one alternative explanation that has been sidelined is the neurotransmitter of dophamine. By running PET imaging techniques on methamphetamine and cocaine addicted patients, Volkov et al. (1993) have demonstrated that there is a low number of a certain type of dophamine receptor within the regions of the OFC and ACd. The low number of dophamine receptors within the region has been shown to be relative to the loss of control over addiction (Adinoff, 2004)." This was actually about a find by Adinoff (2004) that lower levels of D4 in the PFC area caused impairments to the decision-making abilities of drug addicts. This caused me to think that dophamine might be related to more than just stimulated by reward. Rather, dophamine could be seen as a sort of lubricant to allow "smoother activation" of the PFC for decision-making functions.

In conclusion:
1) There may be other functions that spike dophamine levels as well, which i believe could account for arguments (1), (3), (4) and (6).

2) Maybe, we're just looking at the wrong parts of the brain to determine for dophamine's role in reward prediction error?

"I take it you're not likely to abandon the reward prediction error hypothesis so easily?"

I'm mainly interested in modifying/extending parts of the theory that are broken/missing. The core component, temporal difference learning, seems like a good initial model of dopamine activity. But it doesn't account for all the data. (As discussed above, in its simplest form, TD learning doesn't explain dopamine activity during novel events, which requires a new interpretation of novel events as curiosity/novelty rewards.)

"dophamine could be seen as a sort of lubricant to allow "smoother activation" of the PFC for decision-making functions"

I agree. If we interpret dopamine activity as a general signal of unexpected, motivationally-relevant information, it fits nicely into the role of gating salient information into working memory in the PFC.

Agree on that first point; novel stimuli are intrinsically motivating so the failure of a reaction would have been more surprising.

Also, superior collicullus is too early for object detection and segmentation, but is known to drive early attention based on appearance of simple features, and again it would have been more surprising had there been no motivation effect.

Hmm.. isn't the idea of "staying alive for the next 5 minutes" a reward in itself? If I was a computer program, then I would probably be in suspended animation until the next "problem to be solved" is encountered, and no "spikes" of learning activity would be expected. But since we are all "continuous" beings, each and every moment of our lifespan is unique, i.e never been encountered before. Therefore every moment presents a unique situation for survival, and thus, an opportunity for learning, no matter how small. I don't see how seemingly non-synchronized dopamine firings debunk its credit-assigment worthiness.

It could also be that the dopamine is suggestive of novelty to help with conflict detection.

We know that prefrontal neurons in the rat and monkey respond to the selectively to general task attributes -- sort of like strategy. (They can also respond selectively to reward-cue pairings depending on the area recorded and the nature of the task.)

Perhaps the activity in the dopamine neurons doesn't signal reward, but something unexpected because unexpected things usually indicate that a modification of strategy is necessary to maximize rewards.

In this sense, dopamine is permissive of changes in activity to increase reward, not indicative of the reward itself.

It would solve the issue of timing because novelty detection doesn't require object identification nearly to the degree that reward evaluation would.

hmm..maybe we can't look at dopamine levels in general. Maybe there's a clue in the "continuous" spikes in dopamine because it's actually the different types of dopamine (i.e. D1, D2, D3, etc.) spiking at different types and making it seem as though it's dopamine in general spiking.

Maybe through determining which type of dopamine (because each type is more concentrated in a specific area of the brain), we can narrow down to which area RPE actually occurs?

There are now over 2860 PUBMED peer -reviewed papers since 1990 JAMA paper Blum et al. on the DRD2 gene alone compared to no evidence seems very weak. But as the original discoverer along with Ernest Noble of the association of the DRD2 A1 allele and severe alcoholism I am always open for new ideas even if we are proven wrong.

Kenneth Blum, PhD.

By kenneth Blum (not verified) on 16 Apr 2010 #permalink