Cooperative AI

The world may soon contain many advanced AI systems frequently interacting with humans and with each other. Can we create a solid game-theoretic foundation for reasoning about these interactions to prevent catastrophic conflict and incentivize cooperation?

Mentors

Anthony DiGiovanni
Researcher, Center for Long-term Risk

Anthony's research aims to mitigate catastrophic AI conflicts through understanding and promoting the use of safe Pareto improvements among competing AI systems, focusing on theoretical conditions that encourage cooperative solutions.

  • Anthony DiGiovanni is a researcher at the Center on Long-Term Risk, leading the Conceptual stream. His research is on modeling causes of catastrophic conflict between AGIs, and mapping the conditions for interventions to counterfactually prevent AGI conflict. Anthony focuses on deconfusing the “commitment races” problem and developing the game theory of “safe Pareto improvements.” Example public-facing work: “Responses to apparent rationalist confusions about game / decision theory,” “Safe Pareto Improvements for Expected Utility Maximizers in Program Games.”

    LinkedIn | Scholar | LessWrong

  • Background on my research agenda

    AGIs might be motivated to make catastrophic irreversible decisions, e.g. in delegating to successor agents, when they perceive themselves as under strategic pressures from competition with other AGIs. This is the commitment races problem. We’d like to understand what decision-theoretic properties and (lack of) capabilities would mitigate AGIs’ cooperation failures due to commitment races, to help prioritize interventions on how AIs are trained and used.

    One of the most important capabilities for avoiding conflict is the ability to implement safe Pareto improvements (SPIs) (Oesterheld and Conitzer, 2022). The high-level motivation for SPIs is: Even if agents who get into conflict would all rather have cooperated, they might worry (under their uncertainty before the conflict) that being more willing to cooperate makes them more exploitable. SPIs are designed to avoid this problem, between agents with high transparency — agents who use SPIs will, when they would otherwise get into conflict, instead agree to a more cooperative outcome without changing their relative bargaining power (thus avoiding exploitability).

    Research topics

    I’d be most keen on supervising projects about the following:

    • Mapping concrete conditions for agents to (want to) use SPIs: We have theoretical results on the sufficient conditions under which agents who are capable of using SPIs prefer to use them. (See Oesterheld and Conitzer (2022) and DiGiovanni and Clifton (2024).) The next steps are: 1) What are the specific ways in which these sufficient conditions (e.g., assumptions about the agents’ beliefs) plausibly might not hold? 2) What does it look like for prosaic AIs to satisfy or violate these sufficient conditions?

    • Understanding SPI implementation in early human-AI institutions: SPIs have been well-studied conceptually in the context of single agents, but it’s less clear how they would work in a system of many humans and AI assistants. This question may be high-priority to understand because a) it’s relatively time-sensitive (if AIs face strategic pressures early in takeoff), and b) compared to a coherent agent, a system of decision-makers is less likely to use SPIs by default.

    But I will also consider applications on other topics related to the agenda sketched above.

  • Qualifications

    • Necessary:

    • Ideal:

      • Some familiarity with decision theory

      • Familiarity with prosaic AI development and alignment

    • Mentorship structure

      • Weekly meetings (30 - 60 min), mostly for giving high-level direction

      • Comments on substance of documents (i.e., soundness and clarity of key arguments), mostly not on low-level points or writing style

      • I’ll be mostly hands-off otherwise

Christian Schroeder de Witt
Postdoc Researcher, Oxford University

Christian specializes in AI safety and multi-agent security, with foundational contributions to secure steganography and reinforcement learning.

  • Dr. Christian Schroeder de Witt is a leading researcher in foundational AI and information security, celebrated for breakthroughs in the 25+ year old challenge of perfectly secure steganography and the development of illusory attacks on reinforcement learning agents. During his Ph.D., he helped establish the field of cooperative deep multi-agent reinforcement learning, contributing to the creation of several popular algorithms and standard benchmark environments. Currently, he is a postdoc at the University of Oxford's Torr Vision Group and a former visiting researcher with Turing Award-winner Prof. Yoshua Bengio at MILA (Quebec).

    His academic credentials are bolstered by distinguished master's degrees in Physics and in Computer Science from Oxford, where he made significant contributions to categorical quantum mechanics. His work has gained international acclaim, including recognition in Quanta Magazine, Scientific American, and Bruce Schneier’s Security Blog, and his selection as a "30 under 35 rising strategist (Europe)" by Schmidt Futures International Strategy Forum and the European Council on Foreign Relations in 2021.

  • I am a deep multi-agent learning / Cooperative AI researcher now working on AI Safety (manipulation, deception, ELK) and Multi-Agent Security (collusion, illusory attacks). My other research interests are Information Theory, Security, (Deep, Multi-Agent) Reinforcement Learning, and Agent-Based Modeling. You may view more of my past research at Google Scholar.

  • The key values I’m looking for are passionate, autonomous, and kind scholars.

    Some of my students and mentees include:

    • Linas Nasvytis (MSc Statistics student, now Research Fellow at Harvard University (Psychology and ML))

    • Yat Long Lo (MSc Computer Science student - winner of Tony Hoare Prize for best MSc Thesis in Computer Science, now Dyson Robot Learning Lab)

    • Khaulat Abdulhakeem (mentee, now MS Education Data Science at Stanford University)

    • Eshaan Agrawal (mentee and collaborator, now ORISE Fellow at the Department of Energy)

Tsvi Benson-Tilsen
Researcher, MIRI

Tsvi’s research focuses on understanding core concepts of agency, mind, and goal-pursuit to innovate in the area of AI intent alignment, incorporating philosophical rigor with practical constraints and desiderata essential for developing corrigible strong minds.

  • Tsvi Benson-Tilsen works on the foundations of rational agency, including logical uncertainty, logical counterfactuals, and reflectively stable decision making, as well as other questions of AI alignment. Before joining MIRI as a full-time researcher, he collaborated on “Logical Induction”. Tsvi holds a BSc in Mathematics with honors from the University of Chicago, and is on leave from the UC Berkeley Group in Logic and the Methodology of Science PhD program. Tsvi joined MIRI in June 2017.

  • The project is to do speculative analytic philosophy to core concepts about agency, mind, and goal-pursuit, to pave the way to address the hard problem of AGI intent alignment. We'll bring in criteria (constraints and desiderata) from the nature of agency and from the engineering goal of creating a corrigible strong mind. Example constraint: by default, if a mind is highly capable then it also quickly increases its capabilities. Example desideratum: for some mental element to determine a mind's ultimate effects, it has to be stable under pressures from reflective self-modification; so we'd like a concept of core effect-determiners that describes mental elements which can be stable. We'll look at the demands that these criteria make on our concepts, and find better concepts. See "A hermeneutic net for agency."

    • Software engineering isn't very relevant. Knowledge about machine learning isn't very relevant. A strong math background isn't directly needed but would be helpful for the thought patterns. Math content knowledge that's somewhat relevant: mainly classical agent foundations topics (logic, probability, games, decision theory, algorithmic complexity).

    • The project will be centered around doing standard analytic philosophy, but with much more impatience to get to the core of things, and with more willingness to radically deconstruct preconceived ideas to set the stage for creating righter concepts. So a prerequisite is to have already been seriously struggling with philosophical questions around mind / agency / language / ontology / value. Having struggled with a bit of Quine, Wittgenstein, Anscombe, Fodor, Deacon, Heidegger, Bergson, Lakoff, etc. is some positive indicator.

    • You should be able and happy to "buy in to" abstract arguments, while helping keep them grounded in concrete examples and desiderata. E.g. "concepts are grounded in counterfactuals and counterfactuals are grounded in one's own possible actions" should seem relevant, like the sort of thing you might have or form opinions about (after some clarification).

    • Understanding of "the MIRI view" on AGI X-risk is helpful, and practically speaking, some significant overlap with that view is probably needed to hit the ground running--to be aimed at the hard problems of intent alignment. Indeed: Applicants must be interested in the hard problem of AGI intent alignment.

    • Applicants must know and/or be willing to learn from me that unexamined deference will render them unable to make any key progress.