2  Literature background

Complex Systems Science (CSS)

Complex systems are generally out of equilibrium with many interacting components, feedback, and couplings between components and levels (e.g., Levin, 2002). Emergent, collective behavior at the macro level is surprising and hard to predict. An extensive repertoire of strategies and interaction types at the component level makes for multiple emergent patterns and functions at the macroscale–the state of the art to date largely treats each pattern independently. Consequently, there is little understanding of the overall degree of collectivity (Daniels et al., 2021). Power law and heavy-tailed distributions can lead to consequential ‘black swan’ and second and third-order effects (De Marzo et al., 2022). Some complex systems sit near a critical point at which small perturbations can cause a phase transition or reconfiguration because of long-range correlations (Mora & Bialek, 2011), including in finite, relatively small systems like many animal societies and human groups (Daniels et al., 2017).

CSS methods are diverse. Those most relevant to understanding emergence of cooperative collectives include nonlinear dynamics to study temporal oscillations and couplings in time such as how individuals synchronize their activities (Sarfati et al., 2021), stochastic differential equations to study how, for example, noise influences transitions between disordered and ordered states (Jhawar et al., 2020), approaches from statistical mechanics to study collective behavior in space, such as how swarms (Buhl et al., 2006) and flocks choose trajectories (Bialek et al., 2012), game theory (Hofbauer & Sigmund, 1998; Nowak, 2006), network theory (Newman, 2003) to quantify interaction structure and how for example individuals make decisions under the influence of others in uncertain environments (Kleshnina et al., 2023), cellular automata (Wolfram, 1994) to gain insight from toy models about the relationship between rule complexity and pattern formation, agent based modeling (Epstein & Axtell, 1996), the physics of information to identify the mechanisms supporting information processing and quantify their efficiency with the goal of understanding how energy and information processing interact to shape collective effects (Kempes et al., 2017), and information theory to identify and quantify the contribution of higher order interactions to macroscale effects (Rosas et al., 2019; Tekin et al., 2017, 2018), quantify overall degree of collectivity (Daniels et al., 2016), and to build unifying frameworks leveraging ideas from predictive coding (Darriba & Waszak, 2018; Friston, 2018; Rao & Ballard, 1999), active inference and the free-energy principle (Buckley et al., 2017), and the information theory of individuality (D. Krakauer et al., 2020) for formalizing the role of uncertainty reduction (also called surprise minimization (Heins et al., 2023)) in micro-macro relationships and entity formation and evolution.

We expect information theoretic approaches emphasizing uncertainty reduction will be particularly productive for informing the development of a ‘strategic statistical mechanics’ that brings together powerful probabilistic approaches from statistical physics and information theory for deriving micro-macro maps with logical principles from theoretical computer science and the study of inference to capture robust and optimal design of how strategies interact in social circuits to support cooperation at scale. For example, a recent paper on surprise minimization (Heins et al., 2023) makes substantial progress in this direction. The authors develop a modeling framework to capture spatial collective behavior with inference-capable agents. The agents can estimate hidden causes of their sensations and adapt their position in space to minimize surprise (prediction error). The authors then study the relationship between individual inference and the emergence of collective states like cohesion and milling. Next steps include 1) exploring the pros and cons of active inference vs. MARL for encoding cognition into agents, 2) combining this approach with inductive game theory (probabilistic strategies are empirically grounded and extracted from time series) (DeDeo et al., 2010; D. C. Krakauer et al., 2010), and 3) studying how collective states and their transitions are computed from the social circuits (Brush et al., 2018; DeDeo et al., 2010; Ramos-Fernandez et al., 2020) that form as strategies are updated and consolidate into slow variables, which reduce social uncertainty and permit accelerated rates of adaptation (Flack, 2017). With these extensions, it should be possible to begin to deduce the organizational and algorithmic principles underlying the emergence of micro-macro maps in collective information processing systems through feed-forward effects and downward causation. These principles will likely inform the conditions under which collective, cooperative intelligence emerges at scale.

It is worth noting that in the models developed in the surprise minimization, inductive game theory, and some of the collective computation work described above, individuals are tracked, making these models, in a sense, agent-based models. However, these approaches distinguish themselves by incorporating agents that model the world in a restricted but cognitively principled manner or by parameterizing the models using probabilistic strategies obtained directly from data and by leveraging the rigor of powerful probabilistic modeling frameworks in statistical physics or dynamical systems traditions. In the more conventional agent-based modeling community, there have also been attempts to develop a more rigorous axiomatic approach, e.g., based on symmetries and bifurcation theory (Franci et al., 2022; Park et al., 2021) and game theory and control (Marden & Shamma, 2018).

Work within the game theory and cultural evolution CSS sub-communities has made strides in understanding the social and cultural dynamics resulting from interacting boundedly rational agents with a finite computational budget. This work focuses on social and cultural learning mechanisms that allow agents to improve their behavior over time (Arthur, 1994, 2014; Holland & Miller, 1991). Game theoretic models in this tradition aim to explain emerging cooperation from simple yet plausible mechanisms. For example, the famous strategy tit-for-tat, which merely reciprocates what the opponent did in the previous turn, is surprisingly successful against much more complicated strategies (Axelrod & Hamilton, 1981). Its success can be attributed to its ability to control payoffs, ensuring that it receives the same score as the opponent, regardless of the complexity of the opponent’s strategy. The “zero-determinant” strategies later discovered by Press & Dyson (2012) provide a vast generalization of this phenomenon, allowing for extortion and generosity, in addition to more equitable relationships like that of tit-for-tat. These strategies have also encouraged a more geometric view of behavior in CSS (Hilbe, Chatterjee, et al., 2018), moving away from purely mechanistic descriptions.

An unsatisfactory facet of many of these game theoretic and cultural and social evolution models is that “cooperation” is based on an atomic action with the property that more cooperation translates to better social welfare. This interchangeability is likely part of the reason for the widespread focus on mechanisms for increasing the level of cooperation within a system. However, even for the most basic model of a conflict of interest, the repeated Prisoner’s Dilemma, it can be the case that high levels of “cooperation” are suboptimal for individuals and the collective, e.g., when agents are better off alternating “cooperation” and “defection” over time (relative to always cooperating) (McAvoy et al., 2022). Along these lines, nonlinearities produce counter-intuitive or hard-to-predict dynamics, meaning that it is essential to consider not only the level of cooperation but also the specific collective states or social outcomes that emerge from alternative strategic configurations and game structures. With their simplifying assumptions, these approaches are suitable for gaining insight into null expectations for baseline conditions but are more limited in utility when tackling cooperation at scale in complex environments composed of cognitively complex, error-prone agents (McNamara, 2013).

Humans routinely deviate from the behavior predicted by the economic model of Homo economicus (Camerer, 2011). Yet, they are also more sophisticated than assumed in many simple evolutionary game theory models. They are capable of foresight, have a theory of mind, make inferences about their environment, and can adapt their behavior correspondingly. For example, in the most common evolutionary game theory models, individuals from a large population are randomly matched with other population members to play a static game. Those individuals who are more successful (because they employ better strategies) are more likely to be imitated. Imitation-based models are most appropriate when interactions are symmetric in the sense that individuals coincide in their feasible actions and payoffs. However, the paradigm is more challenging to motivate among heterogeneous and diverse actors and when behaviors cannot be observed directly and must be inferred before they are imitated. Moreover, extending this paradigm to account for other forms of cognition that intelligent individuals typically employ when revising their strategies is not straightforward. Finally, the act of imitation itself may be learned and, therefore, subject to cognitive constraints and learning dynamics (Team et al., 2022).

There is a clear interest in adapting game theoretic and cultural evolution models to accommodate these nuances (Hauser et al., 2019; Hilbe, Šimsa, et al., 2018; McNamara et al., 2021; X. Wang & Fu, 2020). As with the uncertainty reduction and collective computation approaches discussed above, considering how MARL could inform such models has great potential to unleash novel ways of modeling complex systems to tackle the challenges of collective cooperation in more complex settings.

Multi-Agent Reinforcement Learning (MARL)

In a typical MARL setting, each agent observes (part of) the current state of the environment, then takes an action, after which they observe (part of) the new state of the environment and are provided with a reward indicating how desirable the previous “state-action-state” transition was. Over time, the agents update their strategies (a mapping from observation histories to probability distributions over their action space) to increase the long-term amount of reward they receive (Busoniu et al., 2008). In this work, we employ a broad definition of reinforcement learning (RL), including various individual-based update mechanisms. However, we exclude strategy update processes based on social reward comparisons, such as typical evolution models and explicit social learning. Eventually, we are interested in how processes, such as social learning, opinion formation, and collective action, can emerge from individual learning agents.

Modern MARL is inspired by work in several fields, including neuroscience, psychology, economics, and machine learning (Bush & Mosteller, 1951; Cross, 1973; Dayan & Niv, 2008; Erev & Roth, 1998; Fudenberg & Levine, 1998; Roth & Erev, 1995; Sutton & Barto, 2018). For example, the commonly used idea of temporal-difference learning is based upon reward-prediction errors, common to humans, other animals, and machines (Botvinick et al., 2020; Gunawardena, 2022; Schultz et al., 1997). In recent years, these traditional ideas have been combined with advances in machine learning – in particular, deep learning – to produce spectacular successes in various domains (Berner et al., 2019; Silver et al., 2016; Vinyals et al., 2019).

Studies of cooperation in MARL fall under the umbrella of Cooperative AI (Dafoe et al., 2021). They can be divided based on whether the underlying game is fully cooperative (i.e., where all agents share the same goal) or mixed-sum (as opposed to zero-sum, which describes fully competitive situations). MARL as a field does not have a unique goal (Shoham et al., 2007). For example, some works aim to obtain game-theoretic equilibria via MARL, while others ask which learning rules are in equilibrium with one another in a specific environment. Despite this variety, the overarching aim of Cooperative AI is to improve the cooperative capabilities of AI systems, increasing joint welfare by prescribing how agents should (learn to) act. Such learning algorithms should ideally generalize to novel situations and scale to high-dimensional environments. A vital advantage of the MARL paradigm is that it can easily accommodate heterogeneous actors. Extending machine learning interpretability techniques to MARL is an ongoing effort to advance the understanding of MARL systems (Grupen et al., 2022; Lovering et al., 2022; McGrath et al., 2022).

Methodologically, the focus often lies in designing novel algorithmic features to improve the cooperativeness of RL algorithms in large-scale environments. For example, algorithms may be equipped with abilities, such as sending each other messages (Foerster et al., 2016), making commitments (Christoffersen et al., 2022; Hughes et al., 2020), or transferring rewards to others (Lupu & Precup, 2020; W. Z. Wang et al., 2021). Algorithms are evaluated for their ability to produce agents and multi-agent systems that can generalize, i.e., perform well under conditions they never saw during training, such as situations where they must interact with unfamiliar AI social partners (Leibo et al., 2021; Stone et al., 2010) or humans (Carroll et al., 2019; (FAIR)† et al., 2022; Strouse et al., 2021). Measuring generalization to a fixed set of test scenarios allows researchers to compare the performance of MARL algorithms to one another despite incompatibilities in their training. In contrast to CSS studies, cooperation is typically not an available action to choose from. Instead, implementing a cooperative strategy must be learned from scratch (Leibo et al., 2017), and performance is measured by total social welfare.

However, MARL simulation studies on their own are challenging to use to obtain analytically reliable insights into how collective cooperation emerges from complex human and machine behavior in dynamic environments. They often require significant computational resources, while the space to explore suffers from the curse of dimensionality. Moreover, they are typically highly stochastic, and results can be difficult to interpret (Hernandez-Leal et al., 2019). We believe that a unified approach that combines approaches from CSS and MARL could fill this gap.

Examplary works on the learning dynamics of cooperation

The study of cooperation has not been at the center of Collective Reinforcement Learning Dynamics (CRLD) studies. Here we list some notable examples from mathematical biology and sociology.

  • L. Panait, K. Tuyls, S. Luke, Theoretical advantages of lenient learners: An evolutionary game theoretic perspective. J. Mach. Learn. Res. 9, 423–457 (2008).
  • S. S. Izquierdo, L. R. Izquierdo, N. M. Gotts, Reinforcement learning dynamics in social dilemmas. J. Artif. Soc. Soc. Simul. 11, 1 (2008).
  • M. Wunder, M. L. Littman, M. Babes, Classes of multiagent Q-learning dynamics with epsilon-greedy exploration in ICML (2010).
  • N. Masuda, M. Nakamura, Numerical analysis of a reinforcement learning model with the dynamic aspiration level in the iterated Prisoner’s dilemma. J. Theor. Biol. 278, 55–62 (2011).
  • S. Tanabe, N. Masuda, Evolution of cooperation facilitated by reinforcement learning with adaptive aspiration levels. J. Theor. Biol. 293, 151–160 (2012).
  • T. Ezaki, Y. Horita, M. Takezawa, N. Masuda, Reinforcement learning explains conditional cooperation and its moody cousin. PLoS Comput. Biol. 12, e1005034 (2016).
  • S. Dridi, E. Akçay, Learning to cooperate: The evolution of social rewards in repeated interactions. Am. Nat. 191, 58–73 (2018).
  • O. Leimar, J. M. McNamara, Learning leads to bounded rationality and the evolution of cognitive bias in public goods games. Sci. Rep. 9, 16319 (2019).
  • W. Barfuss, J. F. Donges, V. V. Vasconcelos, J. Kurths, S. A. Levin, Caring for the future can turn tragedy into comedy for long-term collective action under risk of collapse. Proc. Natl. Acad. Sci. U.S.A. 117, 12915– 12922 (2020).
  • W. Z. Wang et al., “Emergent prosociality in multi-agent games through gifting” in Twenty-Ninth International Joint Conference on Artificial Intelligence (2021), vol. 1, pp. 434–442.
  • L. Wang et al., Lévy noise promotes cooperation in the Prisoner’s dilemma game with reinforcement learning. Nonlinear Dyn. 108, 1837–1845 (2022).
  • W. Barfuss, J. M. Meylahn, Intrinsic fluctuations of reinforcement learning promote cooperation. Sci. Rep. 13, 1309 (2023).

On cooperation and social dilemmas

In CSS, cooperation is frequently defined mechanistically. A cooperative act might involve colluding with a co-conspirator to remain quiet under interrogation (Poundstone, 2011), paying a cost to provide a benefit to another (e.g., measured in currency, time, or reproductive success) (Sigmund, 2010), or provisioning a public good or resource (Fehr & Gächter, 2000; Ostrom et al., 1992). Game theory allows such a behavior to be modeled using abstract payoffs. Dawes (1980) summarizes a social dilemma among \(N\) agents, each with two atomic actions, \(C\) (‘cooperate’) or \(D\) (‘defect’), as follows:

  • (i) the payoff when all cooperate exceeds that when all defect and
  • (ii) regardless of the composition of the group, a cooperator can always improve their own payoff by switching to defection.

A simple example is a prisoner’s dilemma, which takes place in a collective of \(N=2\) agents. With payoffs defined by the matrix \[ \begin{array}{c|cc} \text{} & \text{C} & \text{D} \\ \hline \text{C} & R & S \\ \text{D} & T & P \\ \end{array} \tag{2.1}\] a social dilemma requires \(T>R>P>S\), which is the definition of a prisoner’s dilemma (Axelrod, 1984).

Cooperation becomes a graded quantity when a social dilemma is repeated, although it is still based on (atomic) cooperative actions in each round. As Leibo et al. (2017) note, what constitutes cooperation in spatially and/or temporally extended environments is more complicated and cannot determined using just reduction to a prisoner’s dilemma via empirical game-theoretic analysis (EGTA). EGTA is an approach to game theory that combines expert modeling with empirical data of gameplay. High-dimensional game models are reduced to so-called meta-games via a small set of heuristic strategies. The meta-game, or empirical game, is a simplified model of the high-dimensional game that is used to gain an improved qualitative understanding of the complex multi-agent interaction (Tuyls et al., 2019).

Avoiding mechanistic considerations altogether, a useful way of thinking about cooperation is in terms of how a collective can jointly achieve higher payoffs, particularly when individual agents cannot force such outcomes. Suppose that the outcome \(r^{\ast}\in\mathbb{R}^{N}\) is supported in Nash equilibrium. By the definition of a Nash equilibrium, no agent can improve its payoff through unilateral deviations in its policy. Therefore, if \(r\in\mathbb{R}^{N}\) is another outcome for which \(r_{i}\geqslant r_{i}^{\ast}\) for all \(i=1,\dots ,N\), with at least one inequality strict, then no agent that would strictly benefit when the collective moves from \(r^{\ast}\) to \(r\) can force this outcome, even though all agents would be at least as well off in \(r\) as in \(r^{\ast}\). Doing so is said to require `cooperation’ (Cohen, 1998).

Thinking of cooperation in this way hearkens back the notion of a social dilemma. If \(\left(D,D\right)\) is a Nash equilibrium, then \(P>S\). Neither \(\left(C,D\right)\) nor \(\left(D,C\right)\) can Pareto-dominate \(\left(D,D\right)\) due to this inequality, so for ‘cooperation’ to exist it must be the case that \(R>P\). One possibility for the final payoff is that \(T\leqslant R\), in which case \(\left(C,C\right)\) is also a Nash equilibrium. Such is the case in the stag hunt game (Skyrms, 2004). Although this situation describes an interaction in which social welfare can be improved via cooperation, it is not strictly a social dilemma by the definition we used above, because the incentives of the individuals and the pair are not opposing. Rather, it represents an equilibrium selection problem. If, \(T>R\) instead, then \(T>R>P>S\), the defining inequalities of a prisoner’s dilemma.

Importantly, the Pareto-dominated outcome (\(r^{\ast}\) above) need not be supported in Nash equilibrium in order to define a relevant notion of cooperation. Instead, one might impose the condition that there exists no sequence of unilateral, individually-rational deviations leading the outcome from \(r^{\ast}\) to an outcome that Pareto-dominates \(r^{\ast}\). For the game depicted in Equation 2.1, this condition allows \(S\geqslant P\) as long as \(T>R\). Such is the case in the snowdrift game (Sugden, 2004), in which two drivers are stuck on either side of a snowdrift blocking the road and must decide on who clears it. In contrast to the prisoner’s dilemma, a driver is still better off cooperating (clearing the snowdrift) even when the other driver does nothing. One would also prefer to have the co-player do all of the clearing than to collaborate. A simple example in which ‘cooperation’ does not exist–even under this relaxed definition–is the harmony game (Hauert, 2002), which satisfies \(R>T>S>P\) and possesses the property that the unique Nash equilbrium, \(\left(C,C\right)\), is also Pareto-efficient.

The prisoner’s dilemma, and more generally the definition of Dawes (1980), characterize ‘strict’ social dilemmas. There are also ‘weaker’ social dilemmas describing conflicts of interest to lesser degrees. Again using the game in Equation 2.1, Hauert et al. (2006) stipulate that a weak social dilemma should satisfy

  • (i) \(R>P\);
  • (ii) \(T>S\); and
  • (iii) \(R>S\) and \(T>P\).

The intuition behind these conditions is that (i) the payoff for mutual cooperation should exceed that of mutual defection; (ii) in mixed groups, the payoff to defectors should exceed that of cooperators; and (iii) regardless of what action a focal agent takes, they are better off when the co-player cooperates than when the co-player defects. The harmony game satisfies these inequalities, so it is considered a weak social dilemma despite the fact that it has no notion of ‘cooperation’ according to the definition of a ‘strict’ social dilemma. In addition to the prisoner’s dilemma and the harmony game, the remaining two weak social dilemmas are the snowdrift and stag hunt games. As one might expect, the behavior of a weak social dilemma in CSS depends on which of these classes of games it falls under (Hauert & Doebeli, 2004), rather than just the fact that it’s a weak social dilemma.

However, even in strict social dilemmas, we caution that the presence of alternative actions can destabilize conflicts of interest. For example, suppose that in addition to the actions \(C\) and \(D\) in a prisoner’s dilemma, each player can take action \(G\), which is interpreted as avoiding the prisoner’s dilemma and instead collecting a pot of gold (at no cost). If both players have separate pots of gold available to collect and the value of this gold exceeds all of the prisoner’s dilemma payoffs, then the unique Nash equilibrium of this augmented game is \(\left(G,G\right)\), which is also Pareto-efficient. Like in the harmony game, there is no strict notion of ‘cooperation’ in the sense of Pareto dominance. Most importantly, there is no conflict of interest and thus no strict social dilemma. It is irrelevant that there are options \(C\) and \(D\) such that \(T>R>P>S\); this ‘embedded’ game is merely a decoy. Only when the action \(G\) is unavailable or unknown would the agents view this interaction as a social dilemma. In this sense, social dilemmas need not be preserved upon inclusion into larger games. In this example, one can easily recognize the option \(G\) as trivializing the game, but in realistic applications, especially those involving EGTA, it might be entirely unclear whether there are true conflicts of interest. Intriguingly, the augmented game described above could still be considered a sequential social dilemma (Leibo et al., 2017), owing to the fact that the reference policies representing cooperation and defection can be chosen freely (and thus can represent policies in a smaller, embedded game).

Along these lines, the reduction to matrix games via EGTA could result in too much averaging with respect to social dilemmas. One might instead map a stochastic game not to a matrix game but to a down-sampled stochastic game with a smaller number of ‘salient’ states. A simple example would be when two agents interact in a grid world, with two colors distributed throughout the grid according to some distribution. The two players drift throughout the space via independent, unbiased random walks. When they appear on neighboring tiles, they play one of two matrix games, a prisoner’s dilemma or a harmony game, depending on whether the tiles have the same or different colors. There are then three relevant matrix games: the two played when on neighboring tiles and one ‘null’ game in which rewards are zero when neighbors on non-neighboring tiles. While one may view this game as having a large state space based on the agents’ positions on the grid, this scenario can also be modeled as a three-state game whose transitions are governed by a hidden Markov model (due to the structure of the grid and configuration of its colors). Nonetheless, by averaging appropriately, one might expect to obtain a useful approximation via a stochastic game with just three states. It is an open question whether further reduction to a matrix game would wash out important artifacts of this spatially-extended game.

Regarding cooperation and the goals of CSS, one (rather misleading) aspect of CSS models that could be informed better by the goals of MARL is the fact that social welfare, even in strict social dilemmas, need not be a monotonic function of the level of cooperation. The goal should not always be ‘more cooperation’ in a mechanistic sense. In the prisoner’s dilemma, many studies in CSS make the simplifying assumption that \(R>\left(S+T\right) /2\), which implies that a socially efficient outcome can be attained by full mutual cooperation. However, there are also prisoner’s dilemmas for which \(R<\left(S+T\right) /2\), in which case both agents can do better by agreeing on a strategy of alternation: one agent cooperates in even time steps only, while the other cooperates in odd time steps only. Moving from the mutually cooperative outcome of \(\left(R, R\right)\) to the Pareto-dominant outcome of \(\left(\left(S+T\right) /2, \left(S+T\right) /2\right)\) requires ‘cooperation’, despite the fact that the latter involves a lower level of the atomic action ‘cooperate’ than the former. Thus, what constitutes a cooperative strategy in a temporally extended social dilemma might be decoupled from what constitutes a cooperative action in the underlying stage game, an observation that has not fully penetrated CSS (McAvoy et al., 2022) despite being understood in MARL (Leibo et al., 2017).

In summary, what constitutes ‘cooperation’ depends on the context. In both CSS and MARL, seemingly isolated systems of agents can involve externalities that affect how an interaction is characterized/understood. If a cooperative social dilemma is actually a zero-sum game among \(N\) players and the environment, with the environment getting depleted as the social welfare increases, then the ‘goals’ in such an environment are ambiguous. Agents might also transition between such states and those involving the possibility of true surpluses. Complicating matters further, agents could transition among states involving different numbers of agents, including those with only a single agent and the environment. In turn, an agent can reasonably have many different conceptions of what ‘cooperation’ means, even on short timescales. Rolling such ephemeral interactions into `cooperative strategies’ is only more complicated.