Cooperative AI

The world may soon contain many advanced AI systems frequently interacting with humans and with each other. Can we create a solid game-theoretic foundation for reasoning about these interactions to prevent catastrophic conflict and incentivize cooperation?

Mentor

Anthony DiGiovanni
Researcher, Center for Long-term Risk

Anthony's research aims to mitigate catastrophic AI conflicts through understanding and promoting the use of safe Pareto improvements among competing AI systems, focusing on theoretical conditions that encourage cooperative solutions.

  • Anthony DiGiovanni is a researcher at the Center on Long-Term Risk, leading the Conceptual stream. His research is on modeling causes of catastrophic conflict between AGIs, and mapping the conditions for interventions to counterfactually prevent AGI conflict. Anthony focuses on deconfusing the “commitment races” problem and developing the game theory of “safe Pareto improvements.” Example public-facing work: “Responses to apparent rationalist confusions about game / decision theory,” “Safe Pareto Improvements for Expected Utility Maximizers in Program Games.”

    LinkedIn | Scholar | LessWrong

  • Background on my research agenda

    AGIs might be motivated to make catastrophic irreversible decisions, e.g. in delegating to successor agents, when they perceive themselves as under strategic pressures from competition with other AGIs. This is the commitment races problem. We’d like to understand what decision-theoretic properties and (lack of) capabilities would mitigate AGIs’ cooperation failures due to commitment races, to help prioritize interventions on how AIs are trained and used.

    One of the most important capabilities for avoiding conflict is the ability to implement safe Pareto improvements (SPIs) (Oesterheld and Conitzer, 2022). The high-level motivation for SPIs is: Even if agents who get into conflict would all rather have cooperated, they might worry (under their uncertainty before the conflict) that being more willing to cooperate makes them more exploitable. SPIs are designed to avoid this problem, between agents with high transparency — agents who use SPIs will, when they would otherwise get into conflict, instead agree to a more cooperative outcome without changing their relative bargaining power (thus avoiding exploitability).

    Research topics

    I’d be most keen on supervising projects about the following:

    • Mapping concrete conditions for agents to (want to) use SPIs: We have theoretical results on the sufficient conditions under which agents who are capable of using SPIs prefer to use them. (See Oesterheld and Conitzer (2022) and DiGiovanni and Clifton (2024).) The next steps are: 1) What are the specific ways in which these sufficient conditions (e.g., assumptions about the agents’ beliefs) plausibly might not hold? 2) What does it look like for prosaic AIs to satisfy or violate these sufficient conditions?

    • Understanding SPI implementation in early human-AI institutions: SPIs have been well-studied conceptually in the context of single agents, but it’s less clear how they would work in a system of many humans and AI assistants. This question may be high-priority to understand because a) it’s relatively time-sensitive (if AIs face strategic pressures early in takeoff), and b) compared to a coherent agent, a system of decision-makers is less likely to use SPIs by default.

    But I will also consider applications on other topics related to the agenda sketched above.

  • Qualifications

    • Necessary:

    • Ideal:

      • Some familiarity with decision theory

      • Familiarity with prosaic AI development and alignment

    • Mentorship structure

      • Weekly meetings (30 - 60 min), mostly for giving high-level direction

      • Comments on substance of documents (i.e., soundness and clarity of key arguments), mostly not on low-level points or writing style

      • I’ll be mostly hands-off otherwise