We will prove that if p + b 6= 1, then there is a transition effect on the results of model-based agents. 438. As explained in the Methods, if each initial-state action ...
437
S1 Appendix
438
We will prove that if p + b 6= 1, then there is a transition e↵ect on the results of model-based agents.
439
As explained in the Methods, if each initial-state action transitions to a di↵erent final state with the
440
same probability, then the probability P (left|si ) of choosing left at the initial state si is given by
P (left|si ) =
441
where K
442
parameter.
1 1 + exp[ K(p
b)]
= logit
1
K(p
b),
(17)
0 is a constant that depends on the transition probabilities and the exploration-exploitation
443
According to the model-based reinforcement learning rule (Equation 14), if the agent chooses left,
444
then experiences a common transition to pink and receives 1 reward, the stay probability pstay (of
445
choosing left again in the next trial) is given by
pstay = logit
446
K[(1
1
K[p
(18)
(1
↵)b
↵];
(19)
1
K[(1
↵)p
b];
(20)
and if the agent experiences a rare transition to blue and receives 0 rewards, pstay is given by
pstay = logit
449
b];
if the agent experiences a common transition to pink and receives 0 rewards, pstay is given by
pstay = logit
448
↵)p + ↵
if instead the agent experiences a rare transition to blue and receives 1 reward, pstay is given by
pstay = logit
447
1
1
K[p
(1
↵)b].
(21)
The logistic regression model, on the other hand, determines pstay as a function xr (xr = +1 for 1
450
reward, xr =
1 for 0 rewards in the previous trial) and xt (xt = +1 for a common transition, x =
451
for a rare transition in the previous trial):
pstay = logit
1
(
0
+
r xr
18
+
t xt
+
r⇥t xr xt ).
1
(22)
Since logit
1
is a one-to-one function, this implies that
K[(1 K[p
↵)p + ↵
0,
r,
0
+
r
+
r
(1
↵)b
↵] =
0
K[(1
↵)p
b] =
0
r
↵)b] =
0
r
K[p
Solving this system for
b] =
t,
and
(1
r⇥t
0
r t
r⇥t
+
+
r⇥t ,
(23)
t
r⇥t ,
(24)
t
r⇥t ,
(25)
r⇥t .
(26)
t
t
+
+
yields
⇣ =K 1 = 0,
↵⌘ (p 2
↵ = K (1 2 ↵ =K , 2
p
b),
b) ,
(27) (28) (29) (30)
452
which implies that if ↵ > 0, K > 0 and p + b 6= 1, then
453
left, but the same can be proved if the agent chose right, as in this example “left,” “right,” “pink,”
454
and “blue” are arbitrary.
19
t
6= 0. This proof assumes that the agent chose