3  Framework

Multi-agent environment interface (MAEi)

Figure 3.1: Multi-Agent Environment Interface (MAEi)

The basis for the learning dynamics is the multi-agent environment interface (MAEi) (), which itself is based in its most basic form on the formal framework of stochastic games, also known as Markov games (), which consist of the elements N,S,A,T,R.

In an MAEi, NN agents reside in an environment of ZN states S=(S1,,SZ). In each state s, each agent i{1,,N} has a maximum of MN available actions Ai=(A1i,,AMi) to choose from. A=iAi is the joint-action set where i denotes the cartesian product over the sets indexed by i. Agents choose their actions simultaneously. A joint action is denoted by a=(a1,,aN)A. With ai=(a1,, ai1, ai+1, , aN) we denote the joint action except agent i’s, and we write the joint action in which agent i chooses ai and all other agents choose ai as aiai. We chose an equal number of actions for all states and agents out of notational convenience.

The transition function T:S×A×S[0,1] determines the probabilistic state change. T(s,a,s) is the transition probability from current state s to next state s under joint action a. Throughout this work, we restrict ourselves to ergodic environments without absorbing states.

The reward function R:S×A×SRN maps the triple of current state s, joint action a and next state s to an immediate reward scalar for each agent. Ri(s,a,s) is the reward agent i receives. Note that the reward function is often defined as depending only on the current state and joint action, Ri(s,a). Our formulation maps onto this variant by averaging out the transition probabilities towards the next state according to Ri(s,a)=sT(s,a,s)Ri(s,a,s).

In principle, agents could condition their probabilities of choosing action on the entire history of past play. However, doing so is not only cognitively demanding. It also requires that agents observe all other agents’ actions. Therefore, we focus our analysis on simple, so-called Markov strategies, with which agents choose their actions based only on the current state: Xi:Si×Ai[0,1]. Xi(s,ai) is the probability that agent i chooses action ai given the environment is in state s. We denote the joint strategy by X=X(s,a)=iXi(s,ai):S×A[0,1]N.

Ecological Tipping Environment

We illustrate an application of the multi-agent environment interface by specifying a concrete environment that allows studying the prospects of collective action under environmental tipping elements ().

Figure 3.2: Ecological Tipping Environment

It is available in the Python package via:

from pyCRLD.Environments.EcologicalPublicGood import EcologicalPublicGood as EPG
env = EPG(N=2, f=1.2, c=5, m=-5, qc=0.2, qr=0.01)

The environmental state set consists of two states, a prosperous and a degraded one, S={g,p}.

env.Sset
['g', 'p']

In each state sS, each agent i{1,,N} can choose from their action set between either cooperation or defection, Ai={c,d}.

env.Aset
[['c', 'd'], ['c', 'd']]

We denote the number of cooperating (defecting) agents by Nc (Nd=NNc).

A collapse from the prosperous state to the degraded state occurs with transition probability, T(p,a,g)=NdNqc, with qc[0,1] being the collapse leverage parameter, indicating how much impact a defecting agent exerts on the environment. Thus, the environment remains within the prosperous state with probability, T(p,a,p)=1T(p,a,g).

In the degraded state, we set the recovery to occur with probability, T(g,a,p)=qr, independent of the agents’ actions. The parameter qr sets the recovery probability, and the probability that the environment remains degraded is, thus, T(g,a,g)=1T(g,a,p).

env.T.round(4)
array([[[[0.99, 0.01],
         [0.99, 0.01]],

        [[0.99, 0.01],
         [0.99, 0.01]]],


       [[[0.  , 1.  ],
         [0.1 , 0.9 ]],

        [[0.1 , 0.9 ],
         [0.2 , 0.8 ]]]])

Rewards in the prosperous state follow the standard public good game, Ri(p,aiai,p)={fcNcNcif ai=cfcNcNif ai=d where c denotes the cost of cooperation and f, the cooperation synergy factor.

env.R[0, 1, :, :, 1]
array([[ 1., -2.],
       [ 3.,  0.]])
env.R[1, 1, :, :, 1]
array([[ 1.,  3.],
       [-2.,  0.]])

However, when a state transition involves the degraded state, g, the agents receive an environmental collapse impact, m<0, Ri(p,a,g)=Ri(g,a,p)=Ri(g,a,g)=m,for all a,i.

For illustration purposes, we set the model’s parameters as N=2,f=1.2,c=5,m=5,qc=0.2, and qr=0.01:

env = EPG(N=2, f=1.2, c=5, m=-5, qc=0.2, qr=0.01)

Reinforcement learning

Learning helps agents adjust their behavior to changes in their environment, both from other agents and external factors. This is essential when the future is unpredictable, unknown, and complex, and thus, detailed pre-planning is doomed to failure.

In particular, reinforcement learning is a trial-and-error method of mapping situations to actions to maximize a numerical reward signal (). When rewards are a delayed consequence of current actions, so-called temporal-difference or reward-prediction learning has been particularly influential (). This type of learning summarizes the difference between value estimates from past and present experiences into a reward-prediction error, which is then used to adapt the current behavior to gain more rewards over time. There also exist remarkable similarities between computational reinforcement learning and the results of neuroscientific experiments (). Dopamine conveys reward-prediction errors to brain structures where learning and decision-making occur (). This dopamine reward-prediction error signal constitutes a potential neuronal substrate for the essential economic decision quantity of utility ().

In the following, we present the essential elements of the reinforcement learning update.

Gain

We assume that at each time step t, each agent i strives to maximize its exponentially discounted sum of future rewards,

(3.1)Gti=Nik=0(γi)krt+ki,

where ri(t+k) is the reward agent i receives at time step t+k, and γi[0,1) is the discount factor of agent i. The discount factor regulates how much an agent cares for future rewards, where γi close to 1 means that it cares for the future almost as much for the present and γi close to 0 means that it cares almost only for immediate rewards. Ni denotes a normalization constant. It is either 1, or (1γi). While machine learning researchers often use Ni=1, the pre-factor Ni=(1γi) has the advantage of normalizing the gains, Gi(t), to be on the same numerical scale as the rewards.

Value functions

Given a joint strategy X, we define the state values, VXi(s), as the expected gain, Gi(t), when starting in state s and then following the joint strategy, X,

(3.2)VXi(s)=EX[Gti|st=s].

Analogously, we define the state-action values, QXi(s,a), as the expected gain, Gi(t), when starting in state s, executing action a, and then following the joint strategy, X,

(3.3)QXi(s,a)=EX[Gti|st=s,ati=a].

From and , we can obtain the famous Bellman equation as follows, denoting the next state as s,

VXi(s)=EX[Gti|st=s]=EX[Nik=0(γi)krt+ki|st=s]=EX[Nirti+Niγik=0(γi)krt+1+ki|st=s]=EX[Nirti+γiVXi(s)|st=s]=EX[NiRi(s,a,s)+γiVXi(s)|st=s].

Analogously, we can write for the state-action values, (3.4)QXi(s,a)=EX[NiRi(s,a,s)+γiVXi(s)|st=s,ati=a].

Thus, the value function can be expressed via a recursive relationship. The value of a state equals the discounted value of the next state (γiVXi(s)) plus the reward the agent receives along the way, properly normalized (NiRi(s,a,s)). This recursion will come in useful for learning (see ).

Strategy function

In general, reinforcement learning agents do not know the true state and state-action values, VXi(s), and QXi(s,a). Instead, they hold variable beliefs about the quality of each available action in each state Qti(s,a). The higher an agent believes an action brings value, the more likely it will choose it. We parameterize the agents’ behavior according to the soft-max strategy function,

(3.5)Xti(s,a)=eβiQti(s,a)beβiQti(s,b),

where the intensity-of-choice parameters, βiR+, regulate the exploration-exploitation trade-off. For high βi, agents exploit their learned knowledge about the environment, leaning toward actions with high estimated state-action values. For low βi, agents are more likely to deviate from these high-value actions to explore the environment further with the chance of finding actions that eventually lead to even higher values. This soft-max strategy function can be motivated by the maximum-entropy principle (), stating that the current strategy of an agent should follow a distribution that maximizes entropy subject to current beliefs about the qualities Qti(s,a) (; ).

Learning

Learning means updating the quality estimates, Qti(s,a), with the current reward-prediciton error, δti(s,a), after selection action at in state st according to

(3.6)Qt+1i(st,at)=Qti(st,at)+αiδti(st,at),

where αi(0,1) is the learning rate of agent i, which regulates how much new information the agent uses for the update. The reward-prediction error, δti(st,at), equals the difference of the new quality estimate, Nirti+γiQni(st+1), and the current quality estimate, Qci(st),

(3.7)δti(st,at)=Nirti+γiQni(st+1,at+1)Qci(st,at),

where the Qni represents the quality estimate of the next state and Qci represents the quality estimate of the current state. Depending on how we choose, Qni, and Qci, we recover various well-known temporal-difference reinforcement learning update schemes ().

Variants

For example, if Qni=Qci=Qti, we obtain the so called SARSA update,

δti(st,at)=Nirti+γiQti(st+1,at+1)Qti(st,at).

If Qni=maxbQti(st+1,b), and Qci=Qti, we obtain the famous Q-learning update,

δti(st,at)=Nirti+γimaxbQti(st+1,b)Qti(st,at).

And if Qni=Qci=Vti is a separate state-value estimate, we obtain an actor-critic update,

δti(st,at)=Nirti+γiVti(st+1)Vti(st).

Collective Reinforcement Learning Dynamics (CRLD)

Motivation

In , we saw how to derive temporal-difference reward-prediction reinforcement learning from first principles. Agents strive to improve their discounted sum of future rewards () while acting according to the maximum entropy principle (). However, using these standard reinforcement algorithms directly for modeling comes also with some challenges:

  • First of all, the learning is highly stochastic, since, in general, all agents strategies Xi(s,a), and the environments transition function T(s,a,s) are probability distributions.
  • This stochasticity can make it sometimes hard to explain, why a phenomenon occurred in a simulation.
  • Reinforcement learning is also very sample-inefficient, meaning it can take the agents a long time to learn something.
  • Thus, learning simulations are computationally intense, since one requires many simulations to make sense of the stochasticity, of which each takes a long time to address the sample inefficiency.

How can we address these challenges? In , we saw that we could express different reward-prediction learning variants by formulating different reward-prediction errors, δ. The essential idea of the collective reinforcement learning dynamics approach is to replace the individual sample realizations of the reward-prediction error with its strategy average plus a small error term,

δδ¯+ϵ.

Thus, collective reinforcement learning dynamics describe how agents with access to (a good approximation of) the strategy-average reward-prediction error would learn. There are at least three interpretations to motivate how the agents can obtain the strategy averages:

  • The agents are batch learners. They store experiences (state observations, rewards, actions, next state observations) inside a memory batch and replay these experiences to make the learning more stable. In the limit of an infinite memory batch, the error term vanishes, ϵ0 ().
  • The agents learn on two different time scales. On one time scale, the agents interact with the environment, collecting experiences and integrating them to improve their quality estimates while keeping their strategies fixed. On the other time scale, they use the accumulated experiences to adapt their strategy. In the limit of a complete time scale separation, having infinite experiences between two strategy updates, the error term vanishes, ϵ0 ().
  • The agents have a model of how the environment works, including how the other agents behave currently, but not how the other agents learn. This model can be used to stabilize learning. In the limit of a perfect model (and sufficient cognitive resources), the error term vanishes, ϵ0.

In the following, we focus on the idealized case of a vanishing error term, ϵ0.

Derivation

We start by combining and to obtain the joint strategy update,

(3.8)Xt+1i(s,a)=Xti(s,a)exp(αiβiδ¯i(s,a))bXti(s,b)exp(αiβiδ¯i(s,b)),

where we have also replaced the sample reward-prediction error, δti(s,a), with its strategy average, δ¯i(s,a). Thus, in the remainder, we can focus on obtaining the strategy-average reward-prediction error, δ¯i(s,a)=δXti(s,a). We equip a symbol with a straight bar on top to denote the averaging with the current joint policy Xt. From , we see that we need to construct the strategy-average reward, the strategy-average value of the next state, and the strategy-average value of the current state.

suggests summarizing the product of the learning rate αi and the intensity-of-choice βi into an effective learning rate ηi. If we restate the denominator by Z¯i(s)=bXti(s,b)exp(αiβiδ¯i(s,b)), we recover exactly the form used in the main text,

Xt+1i(s,a)=1Z¯i(s)Xti(s,a)exp(ηiδ¯i(s,a)).

Rewards

The strategy-average version of the current reward is obtained by considering each agent i taking action a in state s when all other agents j act according to their strategy Xj(s,aj), causing the environment to transition to the next state s with probability T(s,a,s), during which agent i receives reward Ri(s,a,s). Mathematically, we write,

R¯i(s,a)=ajsjiXj(s,aj)T(s,a,s)Ri(s,a,s).

Next values

The strategy average of the following state value is likewise computed by averaging over all actions of the other agents and following states.

We start with the simplest learning variant, actor-critic learning. For each agent i, state s, and action a, all other agents ji choose their action aj with probability Xj(s,aj). Consequently, the environment transitions to the next state s with probability T(s,a,s). At s, the agent estimates the quality of the next state to be of V¯i(s). Mathematically, we write,

nQ¯i(s,a)=ajsjiXj(s,aj)T(s,a,s)V¯i(s).

We obtain the strategy-average value estimate of the following state precisely as the state values of the following state, V¯i(s)=VXi(s), as defined in . We compute them by writing the Bellman equation, V¯i(s)=NiR¯i(s)+γiT¯(s,s)V¯i(s), in matrix form, V¯i=NiR¯i+γiT¯V¯i, which allows us to bring all state value variables on one site through a matrix inversion, V¯i=Ni(1ZγiT¯)1R¯i.

Here, R¯i(s) is the strategy-average reward value agent i receives in state s. They are computed by averaging over all agents’ strategies, Xj(s,aj), and the state transition T(s,a,s), R¯i(s)=ajsjXj(s,aj)T(s,a,s)Ri(s,a,s).

And T¯(s,s) are the strategy-average transition probabilities. They are computed by averaging over all agents’ strategies, Xj(s,aj), T¯(s,s)=ajjXj(s,aj)T(s,a,s).

Last, 1Z, is the Z-by-Z identity matrix.

For SARSA learning, the strategy average of the following state value reads,

nQ¯i(s,a)=ajsjiXj(s,aj)T(s,a,s)aiXi(s,ai)Q¯i(s,ai),

where we replace Qti(st+1,at+1) by the strategy-average next-state next-action value aiXi(s,ai)Q¯i(s,ai).

Here, the strategy-average state-action values, Q¯i(s,a)=QXi(s,a), are exaclty the state-action values defined in . We compute them exactly as prescribes,

Q¯i(s,a)=NiR¯i(s,a)+γisT¯i(s,a,s)V¯i(s),

where T¯i(s,a,s) is the strategy-average transition model from the perspective of agent i. It can be computed by averaging out all other agents’ strategies from the transition tensor, T¯i(s,a,s)=ajjiXj(s,aj)T(s,a,s).

However, it is easy to show that aiXi(s,ai)Q¯i(s,ai)=V¯i(s), and thus, the strategy-average next-state values of SARSA and actor-critic learning are indeed identical.

Current values

The strategy-average of the current state value in the reward-prediction error of actor-critic learning, V¯i(s), is - for each agent i and state s - a constant in actions. Thus, they do not affect the joint strategy update ().

The state-action value of the current state, Qti(st,at), in SARSA learning becomes, 1βilnXi(s,a), in the strategy-average reward-prediction error and can be seen as a regularization term. We can derive it by inverting , Qti(s,a)=1βilnXti(s,a)+1βiln(beβiQti(s,b)), and realizing that the dynamics induced by are invariant under additive transformations, which are constant in actions.

Reward-prediction error

Together, the strategy-average reward-prediction error for actor-critic learning reads, δ¯i(s,a)=NiR¯i(s,a)+γinQ¯i(s,a)=Q¯i(s,a), and the strategy-average actor-critic learning dynamics, thus, Xt+1i(s,a)=Xti(s,a)exp(αiβiQ¯i(s,a))bXti(s,b)exp(αiβiQ¯i(s,b)). With αiβiQ¯i(s,a) being the fitness of agent i’s action a in state s, these dynamics are exactly equivalent to the alternative replicator dynamics in discrete time ().

For SARSA learning, the strategy-average reward-prediction error reads, δ¯i(s,a)=NiR¯i(s,a)+γinQ¯i(s,a)1βilnXi(s,a)=Q¯i(s,a)1βilnXi(s,a), and the strategy-average SARSA learning dynamics, thus, Xt+1i(s,a)=Xti(s,a)exp(αi(βiQ¯i(s,a)lnXi(s,a)))bXti(s,b)exp(αi(βiQ¯i(s,b)lnXi(s,b))).