Developmental Interpretability

Singular learning theory (SLT) offers a principled scientific approach to detecting phase transitions during ML training. Can we develop methods to identify, understand, and ultimately prevent the formation of dangerous capabilities and harmful values?

Mentors

Jesse Hoogland

Jesse Hoogland is Co-Founder and Executive Director of Timaeus, a new AI safety organization currently working on singular learning theory and "developmental interpretability," a new research agenda that aims to detect, locate, and interpret phase transitions in neural networks. He was previously a research assistant at David Kreuger’s lab at Cambridge University and a MATS Research Scholar.

Research
Daniel Murfet

Daniel Murfet is Co-Founder and Director of Research at Timaeus and is a mathematician in the School of Mathematics and Statistics at the University of Melbourne. His research interests are in algebraic geometry, mathematical logic and deep learning. Recent papers include “Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition” and “Quantifying degeneracy in singular models via the learning coefficient.”

Research

Research projects

Timaeus is concentrated on two primary avenues of research: singular learning theory and developmental interpretability.
Singular Learning Theory (SLT) is a mathematical theory of neural networks and other hierarchical learning machines established by Sumio Watanabe. It has yielded concepts like the learning coefficient and insights into phase transitions in Bayesian learning. Further advances in singular learning theory look set to yield new tools for alignment.
Developmental Interpretability (“devinterp”) is an approach to understanding the emergence of structure in neural networks, which is informed by singular learning theory but also draws on mechanistic interpretability and ideas from statistical physics and developmental biology. The key idea is that phase transitions organize learning and that detecting, locating, and understanding these transitions could pave a road to evaluation tools that prevent the development of dangerous capabilities, values, and behaviors.
For an overview of how these fields fit together, see the closing talk from the 2023 Primer on SLT and alignment. Here, Daniel and Jesse sketch a vision for how singular learning theory and developmental interpretability could contribute to the alignment portfolio.

Personal Fit

Research
We recommend checking out some of the publications, blog posts, or lectures we’ve put out to determine if this research area is something for you.
SLT and devinterp are a natural fit for people with a background in mathematics (especially, algebraic geometry, empirical processes theory, dynamical systems, and learning theory), physics (especially statistical physics, catastrophe theory, and condensed matter), programming language theory, and developmental biology or neuroscience.
Having a background in one of these fields is, however, not a prerequisite to apply. We also don’t expect you to necessarily be familiar with SLT. If anything, we’re prioritizing applicants with more of a software development background who are familiar with PyTorch and able to get ML experiments up and running quickly on their own.
Mentorship
In terms of mentorship, this will be done remotely via the devinterp discord. You’ll meet once a week in a group setting with Daniel and more regularly (one-on-one) with Jesse and the other people that are in the project you’re working on.

Selection Questions

Similar to Neel’s selection task, the selection task involves spending 10 hours (<16 for fairness) investigating a small open problem in developmental interpretability and writing up your progress in a google doc that you share with us.
To help you get started, check out the example notebooks in the devinterp repo. For inspiration, see the list of project ideas on the devinterp site or the discussions in the devinterp discord (especially the #open-problems channel). Feel free to propose your own ideas and to copy wholesale from the examples. We don’t expect you to solve the problem, to make much progress at all, or to have any prior familiarity with devinterp.
If you’re also applying to Neel’s stream, you’re welcome to submit that application to our stream if it is relevant. Do note that while mechinterp and devinterp have a lot of overlap, they’re not the same. Not all projects in one of the categories will translate to the other.
See Neel's info doc for more on the kinds of applications/problems he's interested in.

Developmental Interpretability

Jesse Hoogland

Daniel Murfet